摘要

1. Upper, Middle and Lower Region Learning for Facial Action Unit Detection [PDF] 返回目录
Yao Xia
Abstract: Facial action units (AUs) detection is fundamental to facial expression analysis. As AU occur only in a small area of face, region based learning has been widely recognized useful for AU detection. Most region based studies focus on a small region where the AU occurs. Focusing on a specific region is helpful in eliminating the influence of identity, but to be risk for losing information. It is difficult to find balance. In this study, I propose a simple strategy. I divide the face into three large regions, upper, middle and lower region, and group AUs based on where it occurs. I propose a new end-to-end deep learning framework named three regions based attention network (TRA-Net). After extracting the global feature, TRA-Net uses a hard attention module to extract three feature maps, each of which contains only a specific region. Each region-specific feature map is fed to an independent branch. For each branch, three continuous soft attention modules are used to extract higher-level features for final AU detection. In the DISFA dataset, this model achieves the highest F1 scores for the detection of AU1, AU2 and AU4, and produces the highest accuracy in comparison with the state-of-the-art methods.
摘要：面部动作单元（AU）检测是面部表情分析的基础。由于AU只发生在脸上的小区域，基于区域的学习已得到广泛认可的AU检测有用。大多数基于区域的研究重点放在非盟发生小区域。专注于一个特定的区域是在消除身份的影响力有帮助，但对信息丢失的风险。这是很难找到平衡点。在这项研究中，我提出了一个简单的策略。我划分面为三个大区域，上部，中部和下部区域，并且组的AU基于其中它发生。我建议命名为三个区域以关注网络（TRA-网）一个新的终端到终端的深度学习的框架。提取全局特征后，TRA-Net使用硬关注模块中提取三个特征的地图，每一个都包含只针对特定区域。每个区域特异性特征地图被馈送到一个独立的分支。对于每个分支，三个连续软注意模块用于提取最终AU检测较高级别的功能。在DISFA数据集，该模型获得了最高的分数F1用于检测AU1，AU2和AU4的，并产生最高的精度在与国家的最先进的方法相比。

2. Deep Convolutional Neural Networks with Spatial Regularization, Volume and Star-shape Priori for Image Segmentation [PDF] 返回目录
Jun Liu, Xiangyue Wang, Xue-cheng Tai
Abstract: We use Deep Convolutional Neural Networks (DCNNs) for image segmentation problems. DCNNs can well extract the features from natural images. However, the classification functions in the existing network architecture of CNNs are simple and lack capabilities to handle important spatial information in a way that have been done for many well-known traditional variational models. Prior such as spatial regularity, volume prior and object shapes cannot be well handled by existing DCNNs. We propose a novel Soft Threshold Dynamics (STD) framework which can easily integrate many spatial priors of the classical variational models into the DCNNs for image segmentation. The novelty of our method is to interpret the softmax activation function as a dual variable in a variational problem, and thus many spatial priors can be imposed in the dual space. From this viewpoint, we can build a STD based framework which can enable the outputs of DCNNs to have many special priors such as spatial regularity, volume constraints and star-shape priori. The proposed method is a general mathematical framework and it can be applied to any semantic segmentation DCNNs. To show the efficiency and accuracy of our method, we applied it to the popular DeepLabV3+ image segmentation network, and the experiments results show that our method can work efficiently on data-driven image segmentation DCNNs.
摘要：我们使用深卷积神经网络（DCNNs）图像分割问题。 DCNNs能很好地提取自然图像的功能。然而，在细胞神经网络的现有网络架构的分类功能简单，缺乏能力来处理已为许多著名的传统模式变做一种方式重要的空间信息。现有如空间规律性，体积之前和对象的形状不能被很好地现有DCNNs处理。我们提出了一个新颖的软阈值的动力学（STD）的框架，可以很容易的经典车型变了许多空间先验融入DCNNs的图像分割。我们的方法的新颖性在于解释SOFTMAX激活函数如在变分问题双重可变，因此许多空间先验可以在对偶空间的罚款。从该观点出发，我们可以建立一个基于STD框架，可以使DCNNs的输出以有许多特殊的先验诸如空间规律性，体积限制和星形先验。所提出的方法是一般的数学框架，它可以被应用到任何语义分割DCNNs。为了显示我们的方法的效率和准确性，我们将其运用到流行DeepLabV3 +图像分割网络，实验结果表明，该方法可以在数据驱动的图像分割DCNNs提高工作效率。

3. Unconstrained Periocular Recognition: Using Generative Deep Learning Frameworks for Attribute Normalization [PDF] 返回目录
Luiz A. Zanlorensi, Hugo Proença, David Menotti
Abstract: Ocular biometric systems working in unconstrained environments usually face the problem of small within-class compactness caused by the multiple factors that jointly degrade the quality of the obtained data. In this work, we propose an attribute normalization strategy based on deep learning generative frameworks, that reduces the variability of the samples used in pairwise comparisons, without reducing their discriminability. The proposed method can be seen as a preprocessing step that contributes for data regularization and improves the recognition accuracy, being fully agnostic to the recognition strategy used. As proof of concept, we consider the "eyeglasses" and "gaze" factors, comparing the levels of performance of five different recognition methods with/without using the proposed normalization strategy. Also, we introduce a new dataset for unconstrained periocular recognition, composed of images acquired by mobile devices, particularly suited to perceive the impact of "wearing eyeglasses" in recognition effectiveness. Our experiments were performed in two different datasets, and support the usefulness of our attribute normalization scheme to improve the recognition performance.
摘要：眼在不受约束的环境中工作的生物识别系统通常面临所造成的多种因素共同降解所获得的数据的质量小的类内紧凑的问题。在这项工作中，我们提出了一种基于深度学习生成框架属性正常化的策略，即减少了两两比较用的样品的可变性，而不会降低他们的辨别力。所提出的方法可以被看作是一个预处理步骤，对于数据的正则化有助于，提高了识别精度，被完全不可知的使用的识别策略。作为概念验证，我们认为“眼镜”和“凝视”的因素，在不使用所提出的标准化战略比较与/五种不同的识别方法的性能水平。此外，我们介绍的无约束眼周识别一个新的数据集，由移动设备，特别适合于感知“戴眼镜”的在识别有效性的影响获取的图像所组成。我们的实验是在两个不同的数据集进行，并支持我们的属性正常化方案，以提高识别性能的实用性。

4. StickyPillars: Robust feature matching on point clouds using Graph Neural Networks [PDF] 返回目录
Martin Simon, Kai Fischer, Stefan Milz, Christian Tobias Witt, Horst-Michael Gross
Abstract: StickyPillars introduces a sparse feature matching method on point clouds. It is the first approach applying Graph Neural Networks on point clouds to stick points of interest. The feature estimation and assignment relies on the optimal transport problem, where the cost is based on the neural network itself. We utilize a Graph Neural Network for context aggregation with the aid of multihead self and cross attention. In contrast to image based feature matching methods, the architecture learns feature extraction in an end-to-end manner. Hence, the approach does not rely on handcrafted features. Our method outperforms state-of-the art matching algorithms, while providing real-time capability.
摘要：StickyPillars介绍了点云稀疏特征匹配方法。它是将点云图的神经网络坚持的兴趣点的第一种方法。该功能估计和分配依赖于最佳的交通问题，其中成本是基于神经网络本身。我们利用图的神经网络模型多头自我和交叉关注的援助范围内聚集。与基于图像特征匹配方法，该架构获悉设有在端至端的方式提取。因此，该方法不依赖于手工制作的特点。我们的方法优于国家的本领域匹配算法，同时提供实时能力。

5. Joint Encoding of Appearance and Motion Features with Self-supervision for First Person Action Recognition [PDF] 返回目录
Mirco Planamente, Andrea Bottino, Barbara Caputo
Abstract: Wearable cameras are becoming more and more popular in several applications, increasing the interest of the research community in developing approaches for recognizing actions from a first-person point of view. An open challenge is how to cope with the limited amount of motion information available about the action itself, as opposed to the more investigated third-person action recognition scenario. When focusing on manipulation tasks, videos tend to record only parts of the movement, making crucial the understanding of the objects being manipulated and of their context. Previous works addressed this issue with two-stream architectures, one dedicated to modeling the appearance of objects involved in the action, another dedicated to extracting motion features from optical flow. In this paper, we argue that features from these two information channels should be learned jointly to capture the spatio-temporal correlations between the two in a better way. To this end, we propose a single stream architecture able to do so, thanks to the addition of a self-supervised block that uses a pretext motion segmentation task to intertwine motion and appearance knowledge. Experiments on several publicly available databases show the power of our approach.
摘要：可穿戴式摄像机正变得越来越流行在几个应用程序，增加了研究界在发展从一个第一人称的角度认识行动方案的兴趣。一个开放的挑战是如何应对提供了有关行动本身数量有限的运动信息，而不是更多的研究第三人称动作识别场景。当着眼于操作任务，视频往往只记录运动的部件，使得关键的被操纵的对象的理解和他们的背景。以前的作品中解决了这个问题有两个流架构下，一个专门用于模拟参与行动对象的外观，另一个专门用于提取运动从光流的特征。在本文中，我们认为，这两个信息渠道功能应共同学会了捕捉两者之间的时空相关性以更好的方式。为此，我们提出了一个单一的数据流架构能够这样做，由于增加使用的借口运动分割任务纠结运动和外观知识自我监督的块。几个公共数据库实验证明我们的方法的力量。

6. RePose: Learning Deep Kinematic Priors for Fast Human Pose Estimation [PDF] 返回目录
Hossam Isack, Christian Haene, Cem Keskin, Sofien Bouaziz, Yuri Boykov, Shahram Izadi, Sameh Khamis
Abstract: We propose a novel efficient and lightweight model for human pose estimation from a single image. Our model is designed to achieve competitive results at a fraction of the number of parameters and computational cost of various state-of-the-art methods. To this end, we explicitly incorporate part-based structural and geometric priors in a hierarchical prediction framework. At the coarsest resolution, and in a manner similar to classical part-based approaches, we leverage the kinematic structure of the human body to propagate convolutional feature updates between the keypoints or body parts. Unlike classical approaches, we adopt end-to-end training to learn this geometric prior through feature updates from data. We then propagate the feature representation at the coarsest resolution up the hierarchy to refine the predicted pose in a coarse-to-fine fashion. The final network effectively models the geometric prior and intuition within a lightweight deep neural network, yielding state-of-the-art results for a model of this size on two standard datasets, Leeds Sports Pose and MPII Human Pose.
摘要：我们从一个单一的形象提出了人体姿势估计一种新型高效和轻质的模型。我们的模型设计在参数和各种先进设备，最先进的方法计算成本的一小部分，以实现竞争的结果。为此，我们明确地纳入一个分层的预测基于框架部分结构和几何先验。在粗糙的分辨率，并以类似经典的基于部分的方法的方式，我们利用人体的运动结构传播的关键点或身体部位之间的卷积功能更新。不同于传统的方法，我们采用终端到终端的培训，学习这种几何之前通过功能从数据更新。然后，我们传播的特征表示，在最粗分辨率高达层次细化预测姿态在粗到精的方式。最终的网络有效地模型轻质深层神经网络内的几何之前和直觉，产生国家的最先进的结果对于该尺寸的两种标准数据集的模型，利兹体育姿和MPII人体姿势。

7. 6DoF Object Pose Estimation via Differentiable Proxy Voting Loss [PDF] 返回目录
Xin Yu, Zheyu Zhuang, Piotr Koniusz, Hongdong Li
Abstract: Estimating a 6DOF object pose from a single image is very challenging due to occlusions or textureless appearances. Vector-field based keypoint voting has demonstrated its effectiveness and superiority on tackling those issues. However, direct regression of vector-fields neglects that the distances between pixels and keypoints also affect the deviations of hypotheses dramatically. In other words, small errors in direction vectors may generate severely deviated hypotheses when pixels are far away from a keypoint. In this paper, we aim to reduce such errors by incorporating the distances between pixels and keypoints into our objective. To this end, we develop a simple yet effective differentiable proxy voting loss (DPVL) which mimics the hypothesis selection in the voting procedure. By exploiting our voting loss, we are able to train our network in an end-to-end manner. Experiments on widely used datasets, i.e. LINEMOD and Occlusion LINEMOD, manifest that our DPVL improves pose estimation performance significantly and speeds up the training convergence.
摘要：从估计单个图像6自由度对象姿势非常由于遮挡或无纹理出场挑战。矢量场根据关键点投票已经证明对解决这些问题，它的有效性和优越性。然而，矢量场忽略的直接回归的像素和关键点之间的距离也影响假设的偏差显着。换言之，在方向矢量小误差可能产生严重偏离假设当像素远离关键点。在本文中，我们的目标是通过将像素和关键点之间的距离为我们的目标，以减少此类错误。为此，我们开发了一个简单而有效的微代理投票损失（DPVL），它模仿了投票过程中的假设选择。通过利用我们的投票损失，我们能够训练我们的网络中的终端到终端的方式。广泛使用的数据集的实验，即LINEMOD和闭塞LINEMOD，体现我们的DPVL显著改善姿势估计性能并加快训练收敛。

8. Hierarchical Multi-Process Fusion for Visual Place Recognition [PDF] 返回目录
Stephen Hausler, Michael Milford
Abstract: Combining multiple complementary techniques together has long been regarded as a way to improve performance. In visual localization, multi-sensor fusion, multi-process fusion of a single sensing modality, and even combinations of different localization techniques have been shown to result in improved performance. However, merely fusing together different localization techniques does not account for the varying performance characteristics of different localization techniques. In this paper we present a novel, hierarchical localization system that explicitly benefits from three varying characteristics of localization techniques: the distribution of their localization hypotheses, their appearance- and viewpoint-invariant properties, and the resulting differences in where in an environment each system works well and fails. We show how two techniques deployed hierarchically work better than in parallel fusion, how combining two different techniques works better than two levels of a single technique, even when the single technique has superior individual performance, and develop two and three-tier hierarchical structures that progressively improve localization performance. Finally, we develop a stacked hierarchical framework where localization hypotheses from techniques with complementary characteristics are concatenated at each layer, significantly improving retention of the correct hypothesis through to the final localization stage. Using two challenging datasets, we show the proposed system outperforming state-of-the-art techniques.
摘要：结合使用多种互补技术的配合一直被认为是提高性能的一种方式。在视觉定位，多传感器融合，单个感测模态的多进程融合，和不同的定位技术，即使组合已显示导致改善的性能。然而，仅仅融合在一起不同的定位技术不考虑不同的定位技术不同的性能特点。在本文中，我们提出了一种新的分层定位系统，从定位技术3个变特征明确的好处：其本地化的假说，他们appearance-和观点不变性质的分配，并在一个环境中，其中产生的差异各系统的工作原理以及与失败。我们发现分级部署两种技术如何更好地工作比并行融合，如何结合两种不同的技术更好地工作比单一技术两个层面，即使在单一技术具有优异的个人表现和发展二，三梯队层次结构是渐进提高定位性能。最后，我们开发了一个层叠的分级框架，其中从具有互补特性的技术定位的假设，在每个层级联，通过对最终定位阶段显著提高正确假设的保持。使用两个挑战数据集，我们证明了该系统超越国家的最先进的技术。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-02-11

目录

摘要