摘要

1. Embedded Systems and Computer Vision Techniques utilized in Spray Painting Robots: A Review [PDF] 返回目录
Soham Shah, Siddhi Vinayak Pandey, Archit Sorathiya, Raj Sheth, Alok Kumar Singh, Jignesh Thaker
Abstract: The advent of the era of machines has limited human interaction and this has increased their presence in the last decade. The requirement to increase the effectiveness, durability and reliability in the robots has also risen quite drastically too. Present paper covers the various embedded system and computer vision methodologies, techniques and innovations used in the field of spray painting robots. There have been many advancements in the sphere of painting robots utilized for high rise buildings, wall painting, road marking paintings, etc. Review focuses on image processing, computational and computer vision techniques that can be applied in the product to increase efficiency of the performance drastically. Image analysis, filtering, enhancement, object detection, edge detection methods, path and localization methods and fine tuning of parameters are being discussed in depth to use while developing such products. Dynamic system design is being deliberated by using which results in reduction of human interaction, environment sustainability and better quality of work in detail. Embedded systems involving the micro-controllers, processors, communicating devices, sensors and actuators, soft-ware to use them; is being explained for end-to-end development and enhancement of accuracy and precision in Spray Painting Robots.
摘要：机时代的到来限制了人际交往，这也增加了他们在过去十年中的存在。加大对机器人的有效性，耐用性和可靠性的要求也随之上涨相当显着了。本论文涵盖了喷漆机器人领域中使用的各种嵌入式系统和计算机视觉方法，技术和创新。已经有很多进步，在画的高层建筑，外墙粉刷，道路标线画等审查的重点是图像处理，可以在产品中应用以提高性能效率的计算和计算机视觉技术利用机器人领域急剧下降。图像分析，过滤，增强，对象检测，边缘检测方法，路径和定位的方法和参数的微调在深度，同时开发这样的产品正在讨论使用。动力系统的设计是由在减少人际交往，环境可持续性和更优质的详细工作的利用其结果审议。嵌入式系统涉及微控制器，处理器，通信装置，传感器和致动器，软洁具使用它们;在喷雾端至端的开发和增强准确度和精密度的喷涂机器人被解释。

2. Goal-GAN: Multimodal Trajectory Prediction Based on Goal Position Estimation [PDF] 返回目录
Patrick Dendorfer, Aljoša Ošep, Laura Leal-Taixé
Abstract: In this paper, we present Goal-GAN, an interpretable and end-to-end trainable model for human trajectory prediction. Inspired by human navigation, we model the task of trajectory prediction as an intuitive two-stage process: (i) goal estimation, which predicts the most likely target positions of the agent, followed by a (ii) routing module which estimates a set of plausible trajectories that route towards the estimated goal. We leverage information about the past trajectory and visual context of the scene to estimate a multi-modal probability distribution over the possible goal positions, which is used to sample a potential goal during the inference. The routing is governed by a recurrent neural network that reacts to physical constraints in the nearby surroundings and generates feasible paths that route towards the sampled goal. Our extensive experimental evaluation shows that our method establishes a new state-of-the-art on several benchmarks while being able to generate a realistic and diverse set of trajectories that conform to physical constraints.
摘要：在本文中，我们目前的目标-GaN，为人类的轨迹预测可解释和终端到终端的可训练的模式。人类导航的启发，我们轨迹预测的任务模型作为一种直观的两阶段的过程：（一）目标估计，这预示代理的最有可能的目标位置，然后是（II）的路由模块，其估计一组合理的轨迹向着目标估计这条路线。我们对过去的轨迹和场景的可视范围内利用信息来估计在可能的目标位置，所使用的推理过程来样一个潜在目标的多模态概率分布。路由是由回归神经网络支配能够反应在附近的周围环境的物理限制，并且产生可行路径朝向目标采样路径。我们广泛的试验评估表明，我们的方法建立一个新的国家的最先进的几个基准，同时能够产生现实而多样的符合物理约束轨迹。

3. AIM 2020 Challenge on Image Extreme Inpainting [PDF] 返回目录
Evangelos Ntavelis, Andrés Romero, Siavash Bigdeli, Radu Timofte
Abstract: This paper reviews the AIM 2020 challenge on extreme image inpainting. This report focuses on proposed solutions and results for two different tracks on extreme image inpainting: classical image inpainting and semantically guided image inpainting. The goal of track 1 is to inpaint considerably large part of the image using no supervision but the context. Similarly, the goal of track 2 is to inpaint the image by having access to the entire semantic segmentation map of the image to inpaint. The challenge had 88 and 74 participants, respectively. 11 and 6 teams competed in the final phase of the challenge, respectively. This report gauges current solutions and set a benchmark for future extreme image inpainting methods.
摘要：本文综述了极端的图像修复的AIM 2020挑战。本报告重点提出了解决方案，并为极端图像修复两种不同的轨道结果：经典的图像修复和语义引导图像修复。轨道1的目标是要补绘不使用的监督，但背景图像的相当大一部分。类似地，轨道2的目标是通过具有访问图像补绘的整个语义分割图来补绘的图像。我们面临的挑战分别有88倍74的参与者。 11个6支球队分别参加了挑战的最后阶段。这份报告计当前的解决方案，并为未来极限图像修复方法的基准。

4. Dynamic Graph: Learning Instance-aware Connectivity for Neural Networks [PDF] 返回目录
Kun Yuan, Quanquan Li, Dapeng Chen, Aojun Zhou, Junjie Yan
Abstract: One practice of employing deep neural networks is to apply the same architecture to all the input instances. However, a fixed architecture may not be representative enough for data with high diversity. To promote the model capacity, existing approaches usually employ larger convolutional kernels or deeper network structure, which may increase the computational cost. In this paper, we address this issue by raising the Dynamic Graph Network (DG-Net). The network learns the instance-aware connectivity, which creates different forward paths for different instances. Specifically, the network is initialized as a complete directed acyclic graph, where the nodes represent convolutional blocks and the edges represent the connection paths. We generate edge weights by a learnable module \textit{router} and select the edges whose weights are larger than a threshold, to adjust the connectivity of the neural network structure. Instead of using the same path of the network, DG-Net aggregates features dynamically in each node, which allows the network to have more representation ability. To facilitate the training, we represent the network connectivity of each sample in an adjacency matrix. The matrix is updated to aggregate features in the forward pass, cached in the memory, and used for gradient computing in the backward pass. We verify the effectiveness of our method with several static architectures, including MobileNetV2, ResNet, ResNeXt, and RegNet. Extensive experiments are performed on ImageNet classification and COCO object detection, which shows the effectiveness and generalization ability of our approach.
摘要：采用深层神经网络的一个做法是相同的架构适用于所有的输入情况。然而，固定的结构可能不适合以高的数据分集代表足够。为了促进模型容量，现有的方法通常采用较大的卷积内核或更深的网络结构，这可能增加了计算成本。在本文中，我们通过提高动态图形网络（DG-网）解决这一问题。网络学习实例感知连接，这将为不同的情况不同的前进路径。具体而言，网络被初始化为一个完整的有向无环图，其中节点代表卷积块和边表示连接路径。我们产生由可学习模块\ textit {路由器}边缘权重和选择其权重大于阈值，来调整神经网络结构的连通性的边缘。而不是使用网络的相同路径的，DG-Net的聚集体中的每个节点，这使得网络能有更多的表现能力动态地提供。为了方便训练，我们表示在邻接矩阵各样品的网络连通性。该矩阵被更新为在直传骨料特征，在存储器中高速缓存的，并且在复路用于梯度计算。我们与几个静态结构，包括MobileNetV2，RESNET，ResNeXt和RegNet验证了该方法的有效性。大量的实验是在ImageNet分类和COCO目标检测，这说明我们的方法的有效性和泛化能力进行。

5. Efficient Colon Cancer Grading with Graph Neural Networks [PDF] 返回目录
Franziska Lippoldt
Abstract: Dealing with the application of grading colorectal cancer images, this work proposes a 3 step pipeline for prediction of cancer levels from a histopathology image. The overall model performs better compared to other state of the art methods on the colorectal cancer grading data set and shows excellent performance for the extended colorectal cancer grading set. The performance improvements can be attributed to two main factors: The feature selection and graph augmentation method described here are spatially aware, but overall pixel position independent. Further, the graph size in terms of nodes becomes stable with respect to the model's prediction and accuracy for sufficiently large models. The graph neural network itself consists of three convolutional blocks and linear layers, which is a rather simple design compared to other networks for this application.
摘要：处理分级结直肠癌图像的应用，这项工作提出了从病理图像癌症水平预测的3步流水线。总体模型进行更好相比的结肠直肠癌分级数据集和示出了用于扩展结直肠癌组分级性能优良的技术的方法的其他状态。的性能改进可以归因于两个主要因素：特征选择和图形增强方法这里所述在空间上是知道的，但总体来说像素位置无关。此外，根据节点的图形尺寸变得稳定，对于模型预测和准确度足够大的车型。该曲线图的神经网络本身由三个卷积块和线性层，这是一个相当简单的设计相对于其他网络这种应用。

6. Pre-Training by Completing Point Clouds [PDF] 返回目录
Hanchen Wang, Qi Liu, Xiangyu Yue, Joan Lasenby, Matthew J. Kusner
Abstract: There has recently been a flurry of exciting advances in deep learning models on point clouds. However, these advances have been hampered by the difficulty of creating labelled point cloud datasets: sparse point clouds often have unclear label identities for certain points, while dense point clouds are time-consuming to annotate. Inspired by mask-based pre-training in the natural language processing community, we propose a novel pre-training mechanism for point clouds. It works by masking occluded points that result from observing the point cloud at different camera views. It then optimizes a completion model that learns how to reconstruct the occluded points, given the partial point cloud. In this way, our method learns a pre-trained representation that can identify the visual constraints inherently embedded in real-world point clouds. We call our method Occlusion Completion (OcCo). We demonstrate that OcCo learns representations that improve generalization on downstream tasks over prior pre-training methods, that transfer to different datasets, that reduce training time, and improve labelled sample efficiency. %, and (e) more effective than previous pre-training methods. Our code and dataset are available at this https URL
摘要：最近，有对点云的深度学习模式令人兴奋的进展带来的混乱。但是，这些进步受到阻碍创建标记点云数据集的困难：稀疏的点云往往有某些点不明身份的标签，而密集的点云耗时注释。在自然语言处理领域光罩式前培训的启发，我们提出了点云的新颖前培训机制。它通过屏蔽从观察在不同摄像机视图的点云导致阻塞点。然后，它可以优化完成模型学习如何重建闭塞点，给出的局部点云。这样一来，我们的方法学习预先训练的表现，它可以识别视觉限制固有嵌入真实世界的点云。我们把我们的方法闭塞完成（OCCO）。我们证明OCCO获悉表示，用于提高对现有前的训练方法，下游的任务概括，即转移到不同的数据集，减少训练时间，提高标记的样品效率。％，及（e）比以前的预训练方法更有效。我们的代码和数据集可在此HTTPS URL

7. Semi-Supervised Learning for Multi-Task Scene Understanding by Neural Graph Consensus [PDF] 返回目录
Marius Leordeanu, Mihai Pirvu, Dragos Costea, Alina Marcu, Emil Slusanschi, Rahul Sukthankar
Abstract: We address the challenging problem of semi-supervised learning in the context of multiple visual interpretations of the world by finding consensus in a graph of neural networks. Each graph node is a scene interpretation layer, while each edge is a deep net that transforms one layer at one node into another from a different node. During the supervised phase edge networks are trained independently. During the next unsupervised stage edge nets are trained on the pseudo-ground truth provided by consensus among multiple paths that reach the nets' start and end nodes. These paths act as ensemble teachers for any given edge and strong consensus is used for high-confidence supervisory signal. The unsupervised learning process is repeated over several generations, in which each edge becomes a "student" and also part of different ensemble "teachers" for training other students. By optimizing such consensus between different paths, the graph reaches consistency and robustness over multiple interpretations and generations, in the face of unknown labels. We give theoretical justifications of the proposed idea and validate it on a large dataset. We show how prediction of different representations such as depth, semantic segmentation, surface normals and pose from RGB input could be effectively learned through self-supervised consensus in our graph. We also compare to state-of-the-art methods for multi-task and semi-supervised learning and show superior performance.
摘要：通过神经网络的图中寻找共识，解决世界的多种视觉解释的背景下半监督学习的具有挑战性的问题。每个图形节点是场景解释层，而每个边缘是深网，在一个节点可将一层到另一个从一个不同的节点。在监督相位边缘网络独立地训练。在接下来的无监督阶段边缘线网由共识，即到达网的多条路径中所提供的伪地面实况训练开始和结束节点。这些路径作为合奏教师对于任何给定边和强烈的共识，用于高可信度的监控信号。在无监督学习过程重复了好几代，其中每个边缘变成了“学生”，也有不同的合奏“教师”的一部分，其他的培训生。通过优化不同路径之间这样的共识，图形达到一致性和鲁棒性在多个解释和世代，在未知标签的面。我们给予提出的想法的理论理由并验证它在大型数据集。我们将展示如何不同的表示方法，如深度，语义分割，表面法线和姿势从RGB输入的预测也可能通过我们的图形自我监督的共识得到有效的教训。我们还比较国家的最先进的方法多任务和半监督学习和表现出卓越的性能。

8. Homography Estimation with Convolutional Neural Networks Under Conditions of Variance [PDF] 返回目录
David Niblick, Avinash Kak
Abstract: Planar homography estimation is foundational to many computer vision problems, such as Simultaneous Localization and Mapping (SLAM) and Augmented Reality (AR). However, conditions of high variance confound even the state-of-the-art algorithms. In this report, we analyze the performance of two recently published methods using Convolutional Neural Networks (CNNs) that are meant to replace the more traditional feature-matching based approaches to the estimation of homography. Our evaluation of the CNN based methods focuses particularly on measuring the performance under conditions of significant noise, illumination shift, and occlusion. We also measure the benefits of training CNNs to varying degrees of noise. Additionally, we compare the effect of using color images instead of grayscale images for inputs to CNNs. Finally, we compare the results against baseline feature-matching based homography estimation methods using SIFT, SURF, and ORB. We find that CNNs can be trained to be more robust against noise, but at a small cost to accuracy in the noiseless case. Additionally, CNNs perform significantly better in conditions of extreme variance than their feature-matching based counterparts. With regard to color inputs, we conclude that with no change in the CNN architecture to take advantage of the additional information in the color planes, the difference in performance using color inputs or grayscale inputs is negligible. About the CNNs trained with noise-corrupted inputs, we show that training a CNN to a specific magnitude of noise leads to a "Goldilocks Zone" with regard to the noise levels where that CNN performs best.
摘要：平面单应估计是基础许多计算机视觉问题，如同步定位和映射（SLAM）和增强现实（AR）。然而，高方差的条件混淆甚至国家的最先进的算法。在这份报告中，我们分析了使用卷积神经网络（细胞神经网络）被用来取代更为传统的特征匹配的办法处理单应的估测模型最近公布的表现方法。我们的基于CNN方法评价尤其侧重于测量显著噪声，照明的转变，和闭塞的条件下的性能。我们还测量培养细胞神经网络不同程度的噪音带来的好处。此外，我们比较使用的彩色图像而非灰度图像的输入细胞神经网络的效果。最后，我们比较了使用SIFT，SURF，和ORB针对基于基线特征匹配单应估计方法的结果。我们发现，细胞神经网络可以训练相对于噪声更加健壮，但在小成本的精度在无噪声的情况下。此外，细胞神经网络的极端差异比他们的特征匹配基于同行的条件显著更好地履行。关于色差输入，我们的结论是，在CNN的架构没有变化采取的彩色平面中的附加信息优势，利用色彩输入或灰度的投入在性能上的差异可以忽略不计。关于细胞神经网络与噪声破坏输入来训练，我们表明，训练CNN噪声导致的特定大小的“金发区”相对于噪声水平如该CNN性能最佳。

9. Hard Negative Mixing for Contrastive Learning [PDF] 返回目录
Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinzaepfel, Diane Larlus
Abstract: Contrastive learning has become a key component of self-supervised learning approaches for computer vision. By learning to embed two augmented versions of the same image close to each other and to push the embeddings of different images apart, one can train highly transferable visual representations. As revealed by recent studies, heavy data augmentation and large sets of negatives are both crucial in learning such representations. At the same time, data mixing strategies either at the image or the feature level improve both supervised and semi-supervised learning by synthesizing novel examples, forcing networks to learn more robust features. In this paper, we argue that an important aspect of contrastive learning, i.e., the effect of hard negatives, has so far been neglected. To get more meaningful negative samples, current top contrastive self-supervised learning approaches either substantially increase the batch sizes, or keep very large memory banks; increasing the memory size, however, leads to diminishing returns in terms of performance. We therefore start by delving deeper into a top-performing framework and show evidence that harder negatives are needed to facilitate better and faster learning. Based on these observations, and motivated by the success of data mixing, we propose hard negative mixing strategies at the feature level, that can be computed on-the-fly with a minimal computational overhead. We exhaustively ablate our approach on linear classification, object detection and instance segmentation and show that employing our hard negative mixing procedure improves the quality of visual representations learned by a state-of-the-art self-supervised learning method.
摘要：对比学习已经成为自我监督学习的一个重要组成部分的计算机视觉方法。通过学习相同的图像彼此靠近的嵌入2个增强版本，并推开不同图像的嵌入物，可以培养高转让视觉表示。所揭示的最近的研究，大量的数据增长以及大型成套底片都在学习这样的表示是至关重要的。与此同时，数据在图像混合策略的任一个或所述特征电平通过合成新颖实施例中，迫使网络学习更健壮的特征同时改善监督和半监督学习。在本文中，我们认为，对比学习，即硬底片效果的一个重要方面，迄今一直被忽视。为了获得更有意义的阴性样品，目前最顶级的对比自我监督学习方法既大幅提高了生产批量，或者保持非常大的存储体;增加内存大小，然而，导致在性能方面收益递减。因此，我们通过钻研更深的顶级表现的框架，并展示证据，需要更加努力的底片，以促进更好更快的学习开始。根据这些意见，并通过数据混合的成功动机，我们提出了硬消极的特征级混合策略，可在即时计算以最小的计算开销。我们详尽地消融我们的线性分类，目标检测和实例的分割和显示，采用我们的辛勤负混合过程提高了一个国家的最先进的自我监督学习方法，学会了视觉表现的质量方针。

10. DecAug: Augmenting HOI Detection via Decomposition [PDF] 返回目录
Yichen Xie, Hao-Shu Fang, Dian Shao, Yong-Lu Li, Cewu Lu
Abstract: Human-object interaction (HOI) detection requires a large amount of annotated data. Current algorithms suffer from insufficient training samples and category imbalance within datasets. To increase data efficiency, in this paper, we propose an efficient and effective data augmentation method called DecAug for HOI detection. Based on our proposed object state similarity metric, object patterns across different HOIs are shared to augment local object appearance features without changing their state. Further, we shift spatial correlation between humans and objects to other feasible configurations with the aid of a pose-guided Gaussian Mixture Model while preserving their interactions. Experiments show that our method brings up to 3.3 mAP and 1.6 mAP improvements on V-COCO and HICODET dataset for two advanced models. Specifically, interactions with fewer samples enjoy more notable improvement. Our method can be easily integrated into various HOI detection models with negligible extra computational consumption. Our code will be made publicly available.
摘要：人类对象相互作用（HOI）检测需要大量的注释的数据的。目前的算法从训练样本不足和数据集内的类别不平衡吃亏。为了提高数据效率，在本文中，我们提出了称为DecAug用于HOI检测的高效和有效的数据增强方法。根据我们提出的目标状态相似性度量，在不同HOIS对象模式共享，以增强本地对象外观，而不改变其状态的功能。此外，我们人类和对象之间的空间相关性转移到其他可行的配置，在一个姿态引导高斯混合模型的帮助下，同时保持它们之间的相互作用。实验表明，我们的方法带来了高达3.3地图和V-COCO和HICODET数据集1.6地图改进了两个先进典型。具体来说，与较少的样本相互作用享受更多的显着改善。我们的方法可以很容易地集成到可以忽略不计的额外计算消耗的各种HOI检测模型。我们的代码将被公之于众。

11. DIRV: Dense Interaction Region Voting for End-to-End Human-Object Interaction Detection [PDF] 返回目录
Hao-Shu Fang, Yichen Xie, Dian Shao, Cewu Lu
Abstract: Recent years, human-object interaction (HOI) detection has achieved impressive advances. However, conventional two-stage methods are usually slow in inference. On the other hand, existing one-stage methods mainly focus on the union regions of interactions, which introduce unnecessary visual information as disturbances to HOI detection. To tackle the problems above, we propose a novel one-stage HOI detection approach DIRV in this paper, based on a new concept called interaction region for the HOI problem. Unlike previous methods, our approach concentrates on the densely sampled interaction regions across different scales for each human-object pair, so as to capture the subtle visual features that is most essential to the interaction. Moreover, in order to compensate for the detection flaws of a single interaction region, we introduce a novel voting strategy that makes full use of those overlapped interaction regions in place of conventional Non-Maximal Suppression (NMS). Extensive experiments on two popular benchmarks: V-COCO and HICO-DET show that our approach outperforms existing state-of-the-arts by a large margin with the highest inference speed and lightest network architecture. We achieved 56.1 mAP on V-COCO without addtional input. Our code will be made publicly available.
摘要：近年来，人类对象交互（海）检测已经取得了重大进步。然而，传统的双级方法通常是在推断缓慢。在另一方面，现有的单级方法主要集中在相互作用的结合区，其引入不必要的可视信息扰动HOI检测。为了解决上述问题，我们提出了在本文提出了一种新型的单级HOI检测方法DIRV的基础上，为HOI问题的新概念，叫做互动区。不同于以往的方法中，我们的方法的浓缩物上横跨每个人类对象对不同尺度的密集采样相互作用区域，以便捕捉到细微的视觉特征是最至关重要的相互作用。此外，为了补偿单个交互区域的检测的缺陷，我们介绍一种新颖的表决策略，使得代替常规的非最大抑制（NMS）的充分利用这些重叠相互作用区域。两个流行的基准测试大量的实验：V-COCO和HICO-DET显示，我们的方法比现有的国家的最艺术大幅度最高推理速度，重量最轻的网络架构。我们对V-COCO达到56.1地图，而addtional输入。我们的代码将被公之于众。

12. RDCNet: Instance segmentation with a minimalist recurrent residual network [PDF] 返回目录
Raphael Ortiz, Gustavo de Medeiros, Antoine H.F.M. Peters, Prisca Liberali, Markus Rempfler
Abstract: Instance segmentation is a key step for quantitative microscopy. While several machine learning based methods have been proposed for this problem, most of them rely on computationally complex models that are trained on surrogate tasks. Building on recent developments towards end-to-end trainable instance segmentation, we propose a minimalist recurrent network called recurrent dilated convolutional network (RDCNet), consisting of a shared stacked dilated convolution (sSDC) layer that iteratively refines its output and thereby generates interpretable intermediate predictions. It is light-weight and has few critical hyperparameters, which can be related to physical aspects such as object size or density.We perform a sensitivity analysis of its main parameters and we demonstrate its versatility on 3 tasks with different imaging modalities: nuclear segmentation of H&E slides, of 3D anisotropic stacks from light-sheet fluorescence microscopy and leaf segmentation of top-view images of plants. It achieves state-of-the-art on 2 of the 3 datasets.
摘要：实例分割是定量的显微镜的关键一步。虽然一些基于机器学习的方法已经提出了这个问题，他们大多依靠复杂的计算模型，是在代理任务训练。最近朝向端至端可训练实例分割的发展的基础上，我们提出了称为复发简约复发网络扩张卷积网络（RDCNet），由一个共享堆叠扩张卷积（SSDC）层的迭代地细化其输出，从而生成可解释的中间预测。它是重量轻的，并具有几个临界超参数，其可被相关的物理方面，例如物体的尺寸或density.We执行其主要参数的敏感性分析和我们证明3个任务与不同成像模态它的多功能性：核分割H＆E载玻片，3D各向异性叠层从光片荧光显微镜和顶视图植物的图像的叶分割。它实现了对3个数据集的2状态的最先进的。

13. Uncertainty driven probabilistic voxel selection for image registration [PDF] 返回目录
Boris N. Oreshkin, Tal Arbel
Abstract: This paper presents a novel probabilistic voxel selection strategy for medical image registration in time-sensitive contexts, where the goal is aggressive voxel sampling (e.g. using less than 1% of the total number) while maintaining registration accuracy and low failure rate. We develop a Bayesian framework whereby, first, a voxel sampling probability field (VSPF) is built based on the uncertainty on the transformation parameters. We then describe a practical, multi-scale registration algorithm, where, at each optimization iteration, different voxel subsets are sampled based on the VSPF. The approach maximizes accuracy without committing to a particular fixed subset of voxels. The probabilistic sampling scheme developed is shown to manage the tradeoff between the robustness of traditional random voxel selection (by permitting more exploration) and the accuracy of fixed voxel selection (by permitting a greater proportion of informative voxels).
摘要：本文提出了医学图像配准在时间敏感的上下文的新的概率体素选择策略，其中目标是侵略性的体素采样（例如，使用总数的小于1％），同时保持配准准确度和低故障率。我们开发了一个贝叶斯框架，其中，第一，体素取样概率场（VSPF）是基于变换参数的不确定性建造。然后，我们描述了一个实用的，多尺度配准算法，其中，在每一步优化迭代，不同的体素子集是基于VSPF采样。该解决方案最大限度的精度，无需提交到体素的特定固定的子集。开发了概率性采样方案示（通过允许更多的探索）和固定体素选择的准确度（通过允许信息的体素的更大比例）来管理传统随机选择的体素的鲁棒性之间的折衷。

14. Group Equivariant Stand-Alone Self-Attention For Vision [PDF] 返回目录
David W. Romero, Jean-Baptiste Cordonnier
Abstract: We provide a general self-attention formulation to impose group equivariance to arbitrary symmetry groups. This is achieved by defining positional encodings that are invariant to the action of the group considered. Since the group acts on the positional encoding directly, group equivariant self-attention networks (GSA-Nets) are steerable by nature. Our experiments on vision benchmarks demonstrate consistent improvements of GSA-Nets over non-equivariant self-attention networks.
摘要：我们提供了一个通用的自我关注配方实施组同变性任意对称群。这是通过定义是不变的所考虑的组的动作位置的编码来实现的。由于组作用于的位置直接编码，组等变自关注网络（GSA-篮网）本质上是可操纵的。我们对视力基准的实验证明了非等变自我关注网络GSA-篮网的持续改善。

15. Taking Modality-free Human Identification as Zero-shot Learning [PDF] 返回目录
Zhizhe Liu, Xingxing Zhang, Zhenfeng Zhu, Shuai Zheng, Yao Zhao, Jian Cheng
Abstract: Human identification is an important topic in event detection, person tracking, and public security. There have been numerous methods proposed for human identification, such as face identification, person re-identification, and gait identification. Typically, existing methods predominantly classify a queried image to a specific identity in an image gallery set (I2I). This is seriously limited for the scenario where only a textual description of the query or an attribute gallery set is available in a wide range of video surveillance applications (A2I or I2A). However, very few efforts have been devoted towards modality-free identification, i.e., identifying a query in a gallery set in a scalable way. In this work, we take an initial attempt, and formulate such a novel Modality-Free Human Identification (named MFHI) task as a generic zero-shot learning model in a scalable way. Meanwhile, it is capable of bridging the visual and semantic modalities by learning a discriminative prototype of each identity. In addition, the semantics-guided spatial attention is enforced on visual modality to obtain representations with both high global category-level and local attribute-level discrimination. Finally, we design and conduct an extensive group of experiments on two common challenging identification tasks, including face identification and person re-identification, demonstrating that our method outperforms a wide variety of state-of-the-art methods on modality-free human identification.
摘要：人类识别是事件检测，人跟踪，和公安的一个重要课题。已经提出了对人的识别多种方法，如面部识别，人再识别，步态识别。典型地，现有的方法主要是在一个图片库集（121）一个询问图像到特定的身份进行分类。这严重地限制了其中仅查询的文字描述或属性画廊集合在大范围的视频监控应用（A2I或I2A）可用的情况。然而，很少已经努力朝自由模态识别，即识别一个可扩展的方式在画廊设置查询。在这项工作中，我们采取了初步尝试，制定这样的新形态 - 免费个体识别（名为MFHI）的任务，因为在一个可扩展的方式通用的零射门的学习模式。同时，它能够通过学习每个身份的区别原型桥接视觉和语义模态的。此外，语义导向空间注意力强制视觉方式同时具有高全局类别层次和地方属性值辨别获得交涉。最后，我们设计并在两个共同挑战的识别任务，包括面部识别和人重新鉴定进行广泛组实验，证明我们的方法优于自由形态人类识别各种各样的国家的最先进的方法。

16. RISA-Net: Rotation-Invariant Structure-Aware Network for Fine-Grained 3D Shape Retrieval [PDF] 返回目录
Rao Fu, Jie Yang, Jiawei Sun, Fang-Lue Zhang, Yu-Kun Lai, Lin Gao
Abstract: Fine-grained 3D shape retrieval aims to retrieve 3D shapes similar to a query shape in a repository with models belonging to the same class, which requires shape descriptors to be capable of representing detailed geometric information to discriminate shapes with globally similar structures. Moreover, 3D objects can be placed with arbitrary position and orientation in real-world applications, which further requires shape descriptors to be robust to rigid transformations. The shape descriptions used in existing 3D shape retrieval systems fail to meet the above two criteria. In this paper, we introduce a novel deep architecture, RISA-Net, which learns rotation invariant 3D shape descriptors that are capable of encoding fine-grained geometric information and structural information, and thus achieve accurate results on the task of fine-grained 3D object retrieval. RISA-Net extracts a set of compact and detailed geometric features part-wisely and discriminatively estimates the contribution of each semantic part to shape representation. Furthermore, our method is able to learn the importance of geometric and structural information of all the parts when generating the final compact latent feature of a 3D shape for fine-grained retrieval. We also build and publish a new 3D shape dataset with sub-class labels for validating the performance of fine-grained 3D shape retrieval methods. Qualitative and quantitative experiments show that our RISA-Net outperforms state-of-the-art methods on the fine-grained object retrieval task, demonstrating its capability in geometric detail extraction. The code and dataset are available at: this https URL.
摘要：细粒度三维形状检索目标来获取的3D形状在与属于同一类别，这需要的形状描述符为能够代表详细几何信息来判别形状与全局相似的结构的模型的存储库相似于查询形状。此外，3D对象可以被放置在真实世界的应用任意位置和方向，这进一步要求的形状描述符是稳健的刚性变换。在现有的三维形状检索系统中使用的形状的说明不符合上述两个标准。在本文中，我们介绍一种新颖的深架构，RISA型网，其获悉旋转不变的3D形状描述符，其能够编码细粒度几何信息和结构信息的，并且因此实现细粒度3D对象的任务精确的结果恢复。 RISA-Net的提取物的一组紧凑和详细几何特征部分明智和有区别地估计每个语义部分形状表示的贡献。此外，我们的方法是能够产生三维形状细粒度检索的最终紧凑潜在功能时学习的所有部件的几何和结构信息的重要性。我们还建立并发布了新的3D形状数据集的子类标签用于验证的细粒度3D形状检索方法的性能。定性和定量的实验表明，我们的RISA-Net的性能优于国家的最先进的方法，而对细粒度对象检索任务，证明几何细节提取游刃有余。代码和数据集，请访问：此HTTPS URL。

17. DOTS: Decoupling Operation and Topology in Differentiable Architecture Search [PDF] 返回目录
Yu-Chao Gu, Yun Liu, Yi Yang, Yu-Huan Wu, Shao-Ping Lu, Ming-Ming Cheng
Abstract: Differentiable Architecture Search (DARTS) has attracted extensive attention due to its efficiency in searching for cell structures. However, DARTS mainly focuses on the operation search, leaving the cell topology implicitly depending on the searched operation weights. Hence, a problem is raised: can cell topology be well represented by the operation weights? The answer is negative because we observe that the operation weights fail to indicate the performance of cell topology. In this paper, we propose to Decouple the Operation and Topology Search (DOTS), which decouples the cell topology representation from the operation weights to make an explicit topology search. DOTS is achieved by defining an additional cell topology search space besides the original operation search space. Within the DOTS framework, we propose group annealing operation search and edge annealing topology search to bridge the optimization gap between the searched over-parameterized network and the derived child network. DOTS is efficient and only costs 0.2 and 1 GPU-day to search the state-of-the-art cell architectures on CIFAR and ImageNet, respectively. By further searching for the topology of DARTS' searched cell, we can improve DARTS' performance significantly. The code will be publicly available.
摘要：可微架构搜索（飞镖）已经引起了广泛关注，因为它在寻找细胞结构的效率。然而，飞镖主要集中在操作搜索，而使细胞拓扑隐含根据搜索到的操作权。因此，一个问题是提出：可以单元拓扑由操作权重来很好的体现？答案是否定的，因为我们观察到的操作权不表明细胞的拓扑结构的性能。在本文中，我们提出来解耦操作和拓扑搜索（DOTS），其中解耦细胞拓扑展示从操作权重做出明确拓扑搜索。 DOTS由除了原有的操作的搜索空间定义的附加小区拓扑搜索空间来实现的。内DOTS框架下，我们提出了组退火操作搜索和边缘退火拓扑搜索的搜索过参数化网络和派生子网络弥合优化鸿沟。 DOTS是高效的并且仅成本0.2和GPU 1天，以搜寻上CIFAR和ImageNet国家的最先进的小区结构中，分别。通过飞镖的拓扑结构进一步搜索显著'搜索电池，我们可以改善飞镖的性能。该代码将是公开的。

18. MGD-GAN: Text-to-Pedestrian generation through Multi-Grained Discrimination [PDF] 返回目录
Shengyu Zhang, Donghui Wang, Zhou Zhao, Siliang Tang, Di Xie, Fei Wu
Abstract: In this paper, we investigate the problem of text-to-pedestrian synthesis, which has many potential applications in art, design, and video surveillance. Existing methods for text-to-bird/flower synthesis are still far from solving this fine-grained image generation problem, due to the complex structure and heterogeneous appearance that the pedestrians naturally take on. To this end, we propose the Multi-Grained Discrimination enhanced Generative Adversarial Network, that capitalizes a human-part-based Discriminator (HPD) and a self-cross-attended (SCA) global Discriminator in order to capture the coherence of the complex body structure. A fined-grained word-level attention mechanism is employed in the HPD module to enforce diversified appearance and vivid details. In addition, two pedestrian generation metrics, named Pose Score and Pose Variance, are devised to evaluate the generation quality and diversity, respectively. We conduct extensive experiments and ablation studies on the caption-annotated pedestrian dataset, CUHK Person Description Dataset. The substantial improvement over the various metrics demonstrates the efficacy of MGD-GAN on the text-to-pedestrian synthesis scenario.
摘要：在本文中，我们研究文本到行人的合成，它在艺术，设计和视频监控许多潜在的应用问题。文本到鸟/花综合现有的方法还远远没有解决这个细粒度的图像生成问题，由于结构复杂，异构的外观，行人自然呈现。为此，我们提出了多晶歧视增强创成对抗性的网络，即大写基于人类部分鉴别（HPD）和自交出席（SCA）全球鉴别，以捕捉复杂体的连贯性结构体。采用的HPD模块中的细粒度字的高度重视机制，加强多元化的外观和生动细节。此外，两个行人一代指标，命名姿势评分和姿态差异，都分别制定评价产生质量和多样性。我们在标题标注行人数据集，中大的人描述Dataset进行了广泛的实验和切除研究。在各种指标大幅改善证MGD-GaN上的文本到行人的合成方案的疗效。

19. Image-based underwater 3D reconstruction for Cultural Heritage: from image collection to 3D. Critical steps and considerations [PDF] 返回目录
Dimitrios Skarlatos, Panagiotis Agrafiotis
Abstract: Underwater Cultural Heritage (CH) sites are widely spread; from ruins in coastlines up to shipwrecks in deep. The documentation and preservation of this heritage is an obligation of the mankind, dictated also by the international treaties like the Convention on the Protection of the Underwater Cultural Her-itage which fosters the use of "non-destructive techniques and survey meth-ods in preference over the recovery of objects". However, submerged CH lacks in protection and monitoring in regards to the land CH and nowadays recording and documenting, for digital preservation as well as dissemination through VR to wide public, is of most importance. At the same time, it is most difficult to document it, due to inherent restrictions posed by the environ-ment. In order to create high detailed textured 3D models, optical sensors and photogrammetric techniques seems to be the best solution. This chapter dis-cusses critical aspects of all phases of image based underwater 3D reconstruc-tion process, from data acquisition and data preparation using colour restora-tion and colour enhancement algorithms to Structure from Motion (SfM) and Multi-View Stereo (MVS) techniques to produce an accurate, precise and complete 3D model for a number of applications.
摘要：水下文化遗产（CH）网站被广泛传播;从废墟中海岸线长达深沉船。该遗产的文件和保存是人类的义务，也规定国际条约喜欢上了保护水下文化的Her-itage的公约，其促进使用“非破坏性的技术和勘测甲基-ODS优先在对象的恢复”。然而，淹没CH缺乏保护以及关于土地CH监测和记录今天和记录，对数字保存和传播通过VR公众的广泛，是最重要的。同时，这是最困难的记录它，由于该ENVIRON-换货造成的固有限制。为了打造高质感的详细三维模型，光学传感器和摄影技术似乎是最好的解决办法。本章DIS-cusses基于水下3D reconstruc-重刑处理图像的各个阶段的关键环节，从数据采集和使用彩色restora，重刑和彩色增强算法，从运动（SFM）和多视点立体结构数据准备（MVS）技术以产生一个准确，精确和完整的三维模型用于许多应用。

20. Multiple Infrared Small Targets Detection based on Hierarchical Maximal Entropy Random Walk [PDF] 返回目录
Chaoqun Xia, Xiaorun Li, Liaoying Zhao, Shuhan Chen
Abstract: The technique of detecting multiple dim and small targets with low signal-to-clutter ratios (SCR) is very important for infrared search and tracking systems. In this paper, we establish a detection method derived from maximal entropy random walk (MERW) to robustly detect multiple small targets. Initially, we introduce the primal MERW and analyze the feasibility of applying it to small target detection. However, the original weight matrix of the MERW is sensitive to interferences. Therefore, a specific weight matrix is designed for the MERW in principle of enhancing characteristics of small targets and suppressing strong clutters. Moreover, the primal MERW has a critical limitation of strong bias to the most salient small target. To achieve multiple small targets detection, we develop a hierarchical version of the MERW method. Based on the hierarchical MERW (HMERW), we propose a small target detection method as follows. First, filtering technique is used to smooth the infrared image. Second, an output map is obtained by importing the filtered image into the HMERW. Then, a coefficient map is constructed to fuse the stationary dirtribution map of the HMERW. Finally, an adaptive threshold is used to segment multiple small targets from the fusion map. Extensive experiments on practical data sets demonstrate that the proposed method is superior to the state-of-the-art methods in terms of target enhancement, background suppression and multiple small targets detection.
摘要：检测与低信号杂波比（SCR）多个弱小目标的技术是红外搜索和跟踪系统非常重要。在本文中，我们建立了从最大熵随机游走（MERW）衍生稳健地检测多个小目标的检测方法。首先，我们介绍了原始MERW并分析其应用到小目标探测的可行性。然而，MERW的原始权重矩阵是对干扰敏感。因此，一个特定的权重矩阵是专为增强对小目标的特性和较强的抑制杂波的原则MERW。此外，原始MERW具有强烈的偏见的最显着的小目标的关键限制。为了实现多个小目标检测，我们开发的MERW方法的分级版本。基于分级MERW（HMERW），我们提出了一个小目标检测方法如下。首先，滤波技术被用来平滑红外图像。第二，输出地图由导入滤波图像进HMERW获得。然后，系数映射被构造以熔合HMERW的静止dirtribution地图。最后，自适应阈值被从融合地图用于分割多个小目标。实际数据集广泛的实验表明，该方法优于国家的最先进的方法在目标增强，背景抑制和多个小目标检测的方面。

21. Self-Play Reinforcement Learning for Fast Image Retargeting [PDF] 返回目录
Nobukatsu Kajiura, Satoshi Kosugi, Xueting Wang, Toshihiko Yamasaki
Abstract: In this study, we address image retargeting, which is a task that adjusts input images to arbitrary sizes. In one of the best-performing methods called MULTIOP, multiple retargeting operators were combined and retargeted images at each stage were generated to find the optimal sequence of operators that minimized the distance between original and retargeted images. The limitation of this method is in its tremendous processing time, which severely prohibits its practical use. Therefore, the purpose of this study is to find the optimal combination of operators within a reasonable processing time; we propose a method of predicting the optimal operator for each step using a reinforcement learning agent. The technical contributions of this study are as follows. Firstly, we propose a reward based on self-play, which will be insensitive to the large variance in the content-dependent distance measured in MULTIOP. Secondly, we propose to dynamically change the loss weight for each action to prevent the algorithm from falling into a local optimum and from choosing only the most frequently used operator in its training. Our experiments showed that we achieved multi-operator image retargeting with less processing time by three orders of magnitude and the same quality as the original multi-operator-based method, which was the best-performing algorithm in retargeting tasks.
摘要：在这项研究中，我们在处理图像重定向，这是一个任务，调整输入图像到任意大小。在称为MULTIOP表现最佳的方法中的一个，多个重新定向运营商合并并重定向产生地发现最小化原始和重定向图像之间的距离运营商的最优序列在每个阶段的图像。这种方法的限制是它的巨大的处理时间，这严重禁止其实际使用。因此，本研究的目的是找到一个合理的处理时间内操作人员的最佳组合;我们提出预测的最优操作者用于使用强化学习代理程序的每个步骤的方法。这项研究的技术成果如下。首先，我们提出了一种基于自我发挥的奖励，这将是不敏感的MULTIOP测量的依赖于内容的距离较大的变化。其次，我们建议动态改变失重每个动作，以防止算法陷入局部最优，在其训练只选择最常用的操作。我们的实验表明，我们实现了多运营商通过图像大小和相同的质量与原始多运营商为基础的方法，这是在重定向任务的最佳执行算法的三个数量较少的处理时间重定向。

22. OpenTraj: Assessing Prediction Complexity in Human Trajectories Datasets [PDF] 返回目录
Javad Amirian, Bingqing Zhang, Francisco Valente Castro, Juan Jose Baldelomar, Jean-Bernard Hayet, Julien Pettre
Abstract: Human Trajectory Prediction (HTP) has gained much momentum in the last years and many solutions have been proposed to solve it. Proper benchmarking being a key issue for comparing methods, this paper addresses the question of evaluating how complex is a given dataset with respect to the prediction problem. For assessing a dataset complexity, we define a series of indicators around three concepts: Trajectory predictability; Trajectory regularity; Context complexity. We compare the most common datasets used in HTP in the light of these indicators and discuss what this may imply on benchmarking of HTP algorithms. Our source code is released on
摘要：人类轨迹预测（HTP）获得了巨大的势头，在过去几年中，许多解决方案已经提出来解决它。正确的基准是比较方法的一个关键问题，本文地址评估的问题有多么复杂相对于预测问题给定数据集。为了评估一个数据集的复杂性，我们定义了一系列的围绕着三个概念指标：轨迹预测;弹道规律;语境的复杂性。我们比较HTP中最常用的数据集，这些指标的光互相讨论一下这可能意味着对HTP算法的基准。我们的源代码发布

23. Remote Sensing Image Scene Classification with Self-Supervised Paradigm under Limited Labeled Samples [PDF] 返回目录
Chao Tao, Ji Qi, Weipeng Lu, Hao Wang, Haifeng Li
Abstract: With the development of deep learning, supervised learning methods perform well in remote sensing images (RSIs) scene classification. However, supervised learning requires a huge number of annotated data for training. When labeled samples are not sufficient, the most common solution is to fine-tune the pre-training models using a large natural image dataset (e.g. ImageNet). However, this learning paradigm is not a panacea, especially when the target remote sensing images (e.g. multispectral and hyperspectral data) have different imaging mechanisms from RGB natural images. To solve this problem, we introduce new self-supervised learning (SSL) mechanism to obtain the high-performance pre-training model for RSIs scene classification from large unlabeled data. Experiments on three commonly used RSIs scene classification datasets demonstrated that this new learning paradigm outperforms the traditional dominant ImageNet pre-trained model. Moreover, we analyze the impacts of several factors in SSL on RSIs scene classification tasks, including the choice of self-supervised signals, the domain difference between the source and target dataset, and the amount of pre-training data. The insights distilled from our studies can help to foster the development of SSL in the remote sensing community. Since SSL could learn from unlabeled massive RSIs which are extremely easy to obtain, it will be a potentially promising way to alleviate dependence on labeled samples and thus efficiently solve many problems, such as global mapping.
摘要：随着深度学习的发展，监督学习方法在遥感图像（RSIS）场景分类表现良好。然而，监督学习需要注释的训练数据的数量巨大。当标记的样品是不够的，最常见的解决方案是将微调用大的自然图像数据集的预训练模型（例如ImageNet）。然而，该学习模式也不是万能的，特别是当目标遥感图像（例如多光谱和高光谱数据）有从RGB自然图像不同的成像机制。为了解决这个问题，我们引入新的自我监督学习（SSL）机制来获得从大标签数据RSIS场景分类高性能前的训练模式。在常用的三种RSIS场景分类数据集的实验表明，这种新的学习模式优于传统训练前主导ImageNet模型。此外，我们分析的几个因素在SSL上RSIS场景分类任务的影响，包括选择自我监督的信号，源和目标数据集之间的域差异，训练前的数据量。从我们的研究，蒸馏除去见解，能够有助于促进SSL的发展，在遥感社区。由于SSL可以从大量的未标记其RSIS是非常容易获得学习，这将是一个潜在的承诺，以减轻标记的样品依赖性方式从而有效解决了很多问题，如全球映射。

24. Rotated Ring, Radial and Depth Wise Separable Radial Convolutions [PDF] 返回目录
Wolfgang Fuhl, Enkelejda Kasneci
Abstract: Simple image rotations significantly reduce the accuracy of deep neural networks. Moreover, training with all possible rotations increases the data set, which also increases the training duration. In this work, we address trainable rotation invariant convolutions as well as the construction of nets, since fully connected layers can only be rotation invariant with a one-dimensional input. On the one hand, we show that our approach is rotationally invariant for different models and on different public data sets. We also discuss the influence of purely rotational invariant features on accuracy. The rotationally adaptive convolution models presented in this work are more computationally intensive than normal convolution models. Therefore, we also present a depth wise separable approach with radial convolution. Link to CUDA code this https URL
摘要：简单的图像旋转显著减少深层神经网络的精度。此外，与所有可能的旋转培养增加了数据集，这也增加了训练时间。在这项工作中，我们要解决训练的旋转不变卷积以及网的建设，因为完全连接层只能是旋转不变的一维输入。在一方面，我们表明，我们的做法是针对不同车型和不同的公共数据集旋转不变。我们还讨论了对精度纯粹旋转不变的特性的影响。在这项工作中提出的旋转自适应卷积模型计算量比正常卷积模型密集。因此，我们还提出了一种深度径向卷积明智可分离的做法。链接CUDA代码这个HTTPS URL

25. Weight and Gradient Centralization in Deep Neural Networks [PDF] 返回目录
Wolfgang Fuhl, Enkelejda Kasneci
Abstract: Batch normalization is currently the most widely used variant of internal normalization for deep neural networks. Additional work has shown that the normalization of weights and additional conditioning as well as the normalization of gradients further improve the generalization. In this work, we combine several of these methods and thereby increase the generalization of the networks. The advantage of the newer methods compared to the batch normalization is not only increased generalization, but also that these methods only have to be applied during training and, therefore, do not influence the running time during use. Link to CUDA code this https URL
摘要：批标准化是目前深层神经网络内部正常化的最广泛使用的变体。额外的工作表明，重量和附加的调理的正常化以及梯度的归一化进一步提高概括。在这项工作中，我们结合一些的这些方法，从而提高网络的泛化。相比于批标准化的新的方法的优点是不仅增加了推广，而且这些方法只能在训练过程中被应用，因此，在使用过程中不影响运行时间。链接CUDA代码这个HTTPS URL

26. CAPTION: Correction by Analyses, POS-Tagging and Interpretation of Objects using only Nouns [PDF] 返回目录
Leonardo Anjoletto Ferreira, Douglas De Rizzo Meneghetti, Paulo Eduardo Santos
Abstract: Recently, Deep Learning (DL) methods have shown an excellent performance in image captioning and visual question answering. However, despite their performance, DL methods do not learn the semantics of the words that are being used to describe a scene, making it difficult to spot incorrect words used in captions or to interchange words that have similar meanings. This work proposes a combination of DL methods for object detection and natural language processing to validate image's captions. We test our method in the FOIL-COCO data set, since it provides correct and incorrect captions for various images using only objects represented in the MS-COCO image data set. Results show that our method has a good overall performance, in some cases similar to the human performance.
摘要：近日，深学习（DL）方法已经显示出图像字幕和视觉答疑出色的表现。然而，尽管他们的表现，DL方法不学习那些被用来描述一个场景的话语义，使其很难被发现在标题或具有类似含义交换字使用不正确的话。这项工作提出了目标检测和自然语言处理，以验证图像的字幕DL方法的组合。我们在箔COCO数据集测试我们的方法，因为它提供了正确和不正确的字幕仅使用对象表示在MS-COCO图像数据集各种图像。结果表明，该方法具有良好的综合性能，在某些情况下，类似于人的表现。

27. Discriminative and Generative Models for Anatomical Shape Analysison Point Clouds with Deep Neural Networks [PDF] 返回目录
Benjamin Gutierrez Becker, Ignacio Sarasua, Christian Wachinger
Abstract: We introduce deep neural networks for the analysis of anatomical shapes that learn a low-dimensional shape representation from the given task, instead of relying on hand-engineered representations. Our framework is modular and consists of several computing blocks that perform fundamental shape processing tasks. The networks operate on unordered point clouds and provide invariance to similarity transformations, avoiding the need to identify point correspondences between shapes. Based on the framework, we assemble a discriminative model for disease classification and age regression, as well as a generative model for the accruate reconstruction of shapes. In particular, we propose a conditional generative model, where the condition vector provides a mechanism to control the generative process. instance, it enables to assess shape variations specific to a particular diagnosis, when passing it as side information. Next to working on single shapes, we introduce an extension for the joint analysis of multiple anatomical structures, where the simultaneous modeling of multiple structures can lead to a more compact encoding and a better understanding of disorders. We demonstrate the advantages of our framework in comprehensive experiments on real and synthetic data. The key insights are that (i) learning a shape representation specific to the given task yields higher performance than alternative shape descriptors, (ii) multi-structure analysis is both more efficient and more accurate than single-structure analysis, and (iii) point clouds generated by our model capture morphological differences associated to Alzheimers disease, to the point that they can be used to train a discriminative model for disease classification. Our framework naturally scales to the analysis of large datasets, giving it the potential to learn characteristic variations in large populations.
摘要：介绍深层神经网络为学习从给定的任务低维形状表示解剖学形状的分析，而不是依赖于手工设计的表示。我们的架构是模块化的，由若干个计算模块执行基本形状的处理任务。网络上无序点云操作和相似变换提供不变性，避免了需要确定形状之间的对应点。框架基础上，我们对组装疾病分类与年龄回归一个判别模型，以及为形状的accruate重建生成模型。特别是，我们提出了一种有条件生成模型，其中，所述条件向量提供了一种机制来控制生成过程。例如，它能够评估形状变化的特定于特定的诊断，传递，当它作为辅助信息。接下来对单一形状的工作中，我们介绍了多个解剖结构，其中多个结构的同时造型可以导致更紧凑的编码，并更好地了解疾病的联合分析的延伸。我们证明在真实和合成数据进行全面的实验我们的框架的优点。的重要见解是：（ⅰ）学习形状表示特定于给定任务的产率性能高于替代形状描述符，（ⅱ）多结构分析是更为有效和比单结构分析更精确，和（iii）点我们对有关老年性痴呆模型捕获的形态差异，给点产生的云，它们可以被用于训练疾病分类一个判别模型。我们的框架自然地扩展到大型数据集的分析，给它来学习大量人口特征变化的可能性。

28. Deep4Air: A Novel Deep Learning Framework for Airport Airside Surveillance [PDF] 返回目录
Phat Thai, Sameer Alam, Nimrod Lilith, Phu N. Tran, Binh Nguyen Thanh
Abstract: An airport runway and taxiway (airside) area is a highly dynamic and complex environment featuring interactions between different types of vehicles (speed and dimension), under varying visibility and traffic conditions. Airport ground movements are deemed safety-critical activities, and safe-separation procedures must be maintained by Air Traffic Controllers (ATCs). Large airports with complicated runway-taxiway systems use advanced ground surveillance systems. However, these systems have inherent limitations and a lack of real-time analytics. In this paper, we propose a novel computer-vision based framework, namely "Deep4Air", which can not only augment the ground surveillance systems via the automated visual monitoring of runways and taxiways for aircraft location, but also provide real-time speed and distance analytics for aircraft on runways and taxiways. The proposed framework includes an adaptive deep neural network for efficiently detecting and tracking aircraft. The experimental results show an average precision of detection and tracking of up to 99.8% on simulated data with validations on surveillance videos from the digital tower at George Bush Intercontinental Airport. The results also demonstrate that "Deep4Air" can locate aircraft positions relative to the airport runway and taxiway infrastructure with high accuracy. Furthermore, aircraft speed and separation distance are monitored in real-time, providing enhanced safety management.
摘要：机场跑道和滑行道（禁区）地区是一个高度动态和复杂的环境特色不同类型的车辆（速度和尺寸）之间的相互作用，变化的知名度和流量条件下。机场地面活动被认为是安全关键活动，安全分离程序必须由空中交通管制（ATCS）来维持。与复杂的跑道，滑行道系统的大型机场使用先进的地面监视系统。然而，这些系统具有固有局限性，缺乏实时分析的。在本文中，我们提出了一个新的计算机视觉基础的框架，即“Deep4Air”，这不仅可以增加通过跑道和滑行道的飞机位置自动可视监控地面监视系统，而且还提供实时的速度和距离分析对跑道和滑行道的飞机。所提出的框架包括用于有效地检测和跟踪飞机的自适应深神经网络。实验结果表明，检测的平均精度，并从乔治布什洲际机场数字塔监控视频验证了跟踪到99.8％的模拟数据。研究结果还表明，“Deep4Air”可以找到相对于机场跑道和滑行道的基础设施高精度的飞机位置。此外，飞机的速度和分隔距离的实时监测，提供增强的安全管理。

29. PrognoseNet: A Generative Probabilistic Framework for Multimodal Position Prediction given Context Information [PDF] 返回目录
Thomas Kurbiel, Akash Sachdeva, Kun Zhao, Markus Buehren
Abstract: The ability to predict multiple possible future positions of the ego-vehicle given the surrounding context while also estimating their probabilities is key to safe autonomous driving. Most of the current state-of-the-art Deep Learning approaches are trained on trajectory data to achieve this task. However trajectory data captured by sensor systems is highly imbalanced, since by far most of the trajectories follow straight lines with an approximately constant velocity. This poses a huge challenge for the task of predicting future positions, which is inherently a regression problem. Current state-of-the-art approaches alleviate this problem only by major preprocessing of the training data, e.g. resampling, clustering into anchors etc. In this paper we propose an approach which reformulates the prediction problem as a classification task, allowing for powerful tools, e.g. focal loss, to combat the imbalance. To this end we design a generative probabilistic model consisting of a deep neural network with a Mixture of Gaussian head. A smart choice of the latent variable allows for the reformulation of the log-likelihood function as a combination of a classification problem and a much simplified regression problem. The output of our model is an estimate of the probability density function of future positions, hence allowing for prediction of multiple possible positions while also estimating their probabilities. The proposed approach can easily incorporate context information and does not require any preprocessing of the data.
摘要：预测给周围的情况下，同时估计其概率出自车辆的多种可能的未来位置的能力是关键，安全自动驾驶。目前大多数国家的最先进的深度学习方法都训练有素的轨迹数据来实现这一任务。通过传感器系统捕获然而轨迹数据是高度均衡，因为迄今为止大多数轨迹遵循直线以大致一定的速度。这给预测未来位置的任务，这本身就是一个回归问题的巨大挑战。当前状态的最先进的方法仅通过训练数据的主要预处理，例如缓解这一问题重采样，汇聚到锚等。在本文中，我们提出了一种方法，其重新表述的预测问题作为分类的任务，允许功能强大的工具，例如焦点损失，打击不平衡。为此，我们设计出具有高斯头的混合物组成的深层神经网络的生成概率模型。潜在变量的一个聪明的选择允许对数似然函数的再形成作为分类问题的组合，并且另一个具有更加简化的回归问题。我们的模型的输出是未来位置的概率密度函数的估计，从而允许多个可能位置中的预测，同时还估计它们的概率。所提出的方法可以很容易地将上下文信息，并且不需要任何数据预处理。

30. Block-wise Image Transformation with Secret Key for Adversarially Robust Defense [PDF] 返回目录
MaungMaung AprilPyone, Hitoshi Kiya
Abstract: In this paper, we propose a novel defensive transformation that enables us to maintain a high classification accuracy under the use of both clean images and adversarial examples for adversarially robust defense. The proposed transformation is a block-wise preprocessing technique with a secret key to input images. We developed three algorithms to realize the proposed transformation: Pixel Shuffling, Bit Flipping, and FFX Encryption. Experiments were carried out on the CIFAR-10 and ImageNet datasets by using both black-box and white-box attacks with various metrics including adaptive ones. The results show that the proposed defense achieves high accuracy close to that of using clean images even under adaptive attacks for the first time. In the best-case scenario, a model trained by using images transformed by FFX Encryption (block size of 4) yielded an accuracy of 92.30% on clean images and 91.48% under PGD attack with a noise distance of 8/255, which is close to the non-robust accuracy (95.45%) for the CIFAR-10 dataset, and it yielded an accuracy of 72.18% on clean images and 71.43% under the same attack, which is also close to the standard accuracy (73.70%) for the ImageNet dataset. Overall, all three proposed algorithms are demonstrated to outperform state-of-the-art defenses including adversarial training whether or not a model is under attack.
摘要：在本文中，我们提出了一种新的防御转变，使我们能够保持使用既干净图像和对抗性的例子一较高下的分类准确度adversarially强大的防御能力。所提出的转变是一个密钥输入图像逐块的预处理技术。我们开发了三种算法来实现所提出的转型：像素置换，位翻转，和FFX加密。实验通过使用具有各种度量，包括自适应那些既黑盒和白盒攻击者的CIFAR-10和ImageNet数据集进行的。结果表明，所提出的辩护实现高精确度接近，即使在首次适应性攻击使用干净的图像。在最好的情况下，模型通过使用由加密FFX（4块尺寸）变换的图像训练，得到的干净图像92.30％和91.48％的准确度下PGD攻击用的二百五十五分之八噪声距离，这是接近到非鲁棒精度（95.45％）为CIFAR-10的数据集，和它产生的干净图像72.18％和71.43％相同的攻击下的精度，这也是接近标准精度（73.70％）为ImageNet数据集。总体而言，这三个建议算法证明国家的最先进的跑赢大市的防御，包括对抗训练模式是否是受到了攻击。

31. Online Knowledge Distillation via Multi-branch Diversity Enhancement [PDF] 返回目录
Zheng Li, Ying Huang, Defang Chen, Tianren Luo, Ning Cai, Zhigeng Pan
Abstract: Knowledge distillation is an effective method to transfer the knowledge from the cumbersome teacher model to the lightweight student model. Online knowledge distillation uses the ensembled prediction results of multiple student models as soft targets to train each student model. However, the homogenization problem will lead to difficulty in further improving model performance. In this work, we propose a new distillation method to enhance the diversity among multiple student models. We introduce Feature Fusion Module (FFM), which improves the performance of the attention mechanism in the network by integrating rich semantic information contained in the last block of multiple student models. Furthermore, we use the Classifier Diversification(CD) loss function to strengthen the differences between the student models and deliver a better ensemble result. Extensive experiments proved that our method significantly enhances the diversity among student models and brings better distillation performance. We evaluate our method on three image classification datasets: CIFAR-10/100 and CINIC-10. The results show that our method achieves state-of-the-art performance on these datasets.
摘要：知识蒸馏是从繁琐的老师模型转换的知识传授给学生轻量化模型的有效方法。在线知识蒸馏使用多个学生机型的合奏的预测结果作为软指标，培养每个学生的模式。然而，同质化问题将导致进一步提高模型的性能困难。在这项工作中，我们提出了一种新的蒸馏方法，以提高学生的多种型号的多样性。我们引入特征融合模块（FFM），提高了网络的关注机制通过整合包含在多个学生机型的最后一块丰富的语义信息的性能。此外，我们使用分类多元化（CD）损失函数，以加强学生的模型之间的差异，并提供更好的整体效果。大量的实验证明我们的方法显著提高学生机型的多样性，并带来了更好的性能蒸馏。我们评估我们的三个图像分类数据集的方法：CIFAR-10/100和互联网络信息中心-10。结果表明，我们的方法实现对这些数据集的国家的最先进的性能。

32. A Parallel Down-Up Fusion Network for Salient Object Detection in Optical Remote Sensing Images [PDF] 返回目录
Chongyi Li, Runmin Cong, Chunle Guo, Hua Li, Chunjie Zhang, Feng Zheng, Yao Zhao
Abstract: The diverse spatial resolutions, various object types, scales and orientations, and cluttered backgrounds in optical remote sensing images (RSIs) challenge the current salient object detection (SOD) approaches. It is commonly unsatisfactory to directly employ the SOD approaches designed for nature scene images (NSIs) to RSIs. In this paper, we propose a novel Parallel Down-up Fusion network (PDF-Net) for SOD in optical RSIs, which takes full advantage of the in-path low- and high-level features and cross-path multi-resolution features to distinguish diversely scaled salient objects and suppress the cluttered backgrounds. To be specific, keeping a key observation that the salient objects still are salient no matter the resolutions of images are in mind, the PDF-Net takes successive down-sampling to form five parallel paths and perceive scaled salient objects that are commonly existed in optical RSIs. Meanwhile, we adopt the dense connections to take advantage of both low- and high-level information in the same path and build up the relations of cross paths, which explicitly yield strong feature representations. At last, we fuse the multiple-resolution features in parallel paths to combine the benefits of the features with different resolutions, i.e., the high-resolution feature consisting of complete structure and clear details while the low-resolution features highlighting the scaled salient objects. Extensive experiments on the ORSSD dataset demonstrate that the proposed network is superior to the state-of-the-art approaches both qualitatively and quantitatively.
摘要：多样的空间分辨率，各种对象类型，尺度和方向，和杂乱背景的光学遥感图像（RSIS）挑战电流显着对象检测（SOD）方法。直接采用SOD办法设计的自然场景图像（NSIS）来RSIS人们普遍不满意。在本文中，我们提出在光学RSIS，它充分利用了路径中的低和高电平的特性和交叉路径的多分辨率特征的新型并行下 - 上融合网络（PDF-净）为SOD区分不同地规模显着对象和抑制杂乱的背景。具体而言，保持了显着对象仍然是显着的，无论图像的分辨率是一记关键的观察，PDF-网作为连续降采样，形成五个平行的路径和地感知缩放了在光学普遍存在的显着对象RSIS。同时，我们采用了密集的连接，同时利用低和高层次的信息在相同的路径，并建立跨路径的关系，明确产生强大的功能表示。最后，我们融合在平行路径的多分辨率特征，而低分辨率特色突出的缩放突出对象的不同的分辨率，即高分辨率功能，包括完整的结构和清晰的细节特征的好处结合起来。在ORSSD广泛的实验数据集表明，该网络是优于状态的最先进的方法定性和定量。

33. Deep Learning for Earth Image Segmentation based on Imperfect Polyline Labels with Annotation Errors [PDF] 返回目录
Zhe Jiang, Marcus Stephen Kirby, Wenchong He, Arpan Man Sainju
Abstract: In recent years, deep learning techniques (e.g., U-Net, DeepLab) have achieved tremendous success in image segmentation. The performance of these models heavily relies on high-quality ground truth segment labels. Unfortunately, in many real-world problems, ground truth segment labels often have geometric annotation errors due to manual annotation mistakes, GPS errors, or visually interpreting background imagery at a coarse resolution. Such location errors will significantly impact the training performance of existing deep learning algorithms. Existing research on label errors either models ground truth errors in label semantics (assuming label locations to be correct) or models label location errors with simple square patch shifting. These methods cannot fully incorporate the geometric properties of label location errors. To fill the gap, this paper proposes a generic learning framework based on the EM algorithm to update deep learning model parameters and infer hidden true label locations simultaneously. Evaluations on a real-world hydrological dataset in the streamline refinement application show that the proposed framework outperforms baseline methods in classification accuracy (reducing the number of false positives by 67% and reducing the number of false negatives by 55%).
摘要：近年来，深学习技术（例如，U型网，DeepLab）都实现了图像分割了巨大的成功。这些机型的性能在很大程度上依赖于高质量的地面实况段标签。不幸的是，在许多现实世界的问题，地面实况片断标签经常有因手工标注错误几何标注错误，GPS的误差，或以粗分辨率可视化解释的背景图像。这样的定位误差会显著影响现有的深度学习算法的训练表现。现有在标签语义标签错误或者车型的实测误差研究（假设标签位置是正确的）或模型标记用简单的方形片移动的位置误差。这些方法不能完全合并标签定位误差的几何性质。填补了国内空白，本文提出了一种基于EM算法更新深度学习模型参数，同时推断出隐藏真实标签位置的通用学习框架。在流线型细化应用表明，在分类精度拟议框架性能优于基准方法（由67％减少误报的数量和55％减少假阴性的数量）现实世界的水文数据集的评估。

34. Contrastive Learning of Medical Visual Representations from Paired Images and Text [PDF] 返回目录
Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D. Manning, Curtis P. Langlotz
Abstract: Learning visual representations of medical images is core to medical image understanding but its progress has been held back by the small size of hand-labeled datasets. Existing work commonly relies on transferring weights from ImageNet pretraining, which is suboptimal due to drastically different image characteristics, or rule-based label extraction from the textual report data paired with medical images, which is inaccurate and hard to generalize. We propose an alternative unsupervised strategy to learn medical visual representations directly from the naturally occurring pairing of images and textual data. Our method of pretraining medical image encoders with the paired text data via a bidirectional contrastive objective between the two modalities is domain-agnostic, and requires no additional expert input. We test our method by transferring our pretrained weights to 4 medical image classification tasks and 2 zero-shot retrieval tasks, and show that our method leads to image representations that considerably outperform strong baselines in most settings. Notably, in all 4 classification tasks, our method requires only 10% as much labeled training data as an ImageNet initialized counterpart to achieve better or comparable performance, demonstrating superior data efficiency.
摘要：学习医学图像的视觉表现为核心，以医疗图像理解，但其进展一直由手工标记的数据集的规模小了。现有的工作通常依赖于从ImageNet训练前，这是不理想的，由于完全不同的图像特征，或者从医学图像配对的文字报告数据，这是不准确的，很难一概而论基于规则的标签提取转移的权重。我们提出替代无人监督的策略，直接从图像和文本数据的自然发生的配对学习医学可视化表示。我们通过两个模态之间的双向对比目标训练前与成对的文本数据的医用图像编码器的方法是结构域无关的，并且不需要附加的专门的输入。我们通过我们的预训练的权重转移到4个医学图像分类任务和2个零次检索任务，并显示测试我们的方法，我们的方法导致图像表示，在大多数的设置大大超越强大的基线。值得注意的是，在所有4个分类任务，我们的方法只需要10％之多标记的训练数据作为ImageNet初始化配对，以达到更好或相当的性能，展示了出色的数据效率。

35. Smart-Inspect: Micro Scale Localization and Classification of Smartphone Glass Defects for Industrial Automation [PDF] 返回目录
M Usman Maqbool Bhutta, Shoaib Aslam, Peng Yun, Jianhao Jiao, Ming Liu
Abstract: The presence of any type of defect on the glass screen of smart devices has a great impact on their quality. We present a robust semi-supervised learning framework for intelligent micro-scaled localization and classification of defects on a 16K pixel image of smartphone glass. Our model features the efficient recognition and labeling of three types of defects: scratches, light leakage due to cracks, and pits. Our method also differentiates between the defects and light reflections due to dust particles and sensor regions, which are classified as non-defect areas. We use a partially labeled dataset to achieve high robustness and excellent classification of defect and non-defect areas as compared to principal components analysis (PCA), multi-resolution and information-fusion-based algorithms. In addition, we incorporated two classifiers at different stages of our inspection framework for labeling and refining the unlabeled defects. We successfully enhanced the inspection depth-limit up to 5 microns. The experimental results show that our method outperforms manual inspection in testing the quality of glass screen samples by identifying defects on samples that have been marked as good by human inspection.
摘要：智能设备的玻璃屏幕上的任何类型的缺陷的存在对他们的质量有很大的影响。我们提出了智能玻璃的16K像素图像上智能微尺度定位和缺陷分类的健壮半监督学习框架。我们的模型具有三种类型的缺陷的有效识别和标记：划痕，漏光由于裂纹和凹坑。我们的方法还区分缺陷和光反射由于灰尘颗粒和传感器区域，其被分类为无缺陷区域之间。我们使用的部分标记的数据集相比，主成分分析（PCA），多分辨率和基于信息融合算法，以实现高耐用性和的缺陷和非缺陷区域优良分类。此外，我们在纳入我们的检查框架的不同阶段两个分类标签和细化未标记的缺陷。我们成功地提高了检查的深度极限可达5微米。实验结果表明，我们的方法优于人工检查在通过识别对已经标记为通过人工检查良好样品缺陷测试玻璃屏样品的质量。

36. Cycle-Consistent Adversarial Autoencoders for Unsupervised Text Style Transfer [PDF] 返回目录
Yufang Huang, Wentao Zhu, Deyi Xiong, Yiye Zhang, Changjian Hu, Feiyu Xu
Abstract: Unsupervised text style transfer is full of challenges due to the lack of parallel data and difficulties in content preservation. In this paper, we propose a novel neural approach to unsupervised text style transfer, which we refer to as Cycle-consistent Adversarial autoEncoders (CAE) trained from non-parallel data. CAE consists of three essential components: (1) LSTM autoencoders that encode a text in one style into its latent representation and decode an encoded representation into its original text or a transferred representation into a style-transferred text, (2) adversarial style transfer networks that use an adversarially trained generator to transform a latent representation in one style into a representation in another style, and (3) a cycle-consistent constraint that enhances the capacity of the adversarial style transfer networks in content preservation. The entire CAE with these three components can be trained end-to-end. Extensive experiments and in-depth analyses on two widely-used public datasets consistently validate the effectiveness of proposed CAE in both style transfer and content preservation against several strong baselines in terms of four automatic evaluation metrics and human evaluation.
摘要：无监督的文本样式转移是充满挑战，由于缺乏并行数据和内容保存困难。在本文中，我们提出了无监督文本样式转移，我们称之为非并行数据训练周期一致对抗性自动编码（CAE）一种新型的神经途径。 CAE由三个基本部件组成：（1）LSTM自动编码，在一个风格文本编码成它的潜表示和解码的编码表示成其原始文本或转移表示成式转印文本，（2）对抗式传送网络使用一个adversarially训练发生器在一个风格转换潜表示形式转换为表示在另一种风格，以及（3）一个周期一致的约束增强在保存内容的对抗式传输网络的容量。这三个组件的整个CAE可以训练结束到终端。两种广泛使用的公共数据集大量的实验和深入的分析一致验证提出CAE技术在四个自动评价指标和评价人既条款转让的风格和内容保存对几种强基线的有效性。

37. LiRaNet: End-to-End Trajectory Prediction using Spatio-Temporal Radar Fusion [PDF] 返回目录
Meet Shah, Zhiling Huang, Ankit Laddha, Matthew Langford, Blake Barber, Sidney Zhang, Carlos Vallespi-Gonzalez, Raquel Urtasun
Abstract: In this paper, we present LiRaNet, a novel end-to-end trajectory prediction method which utilizes radar sensor information along with widely used lidar and high definition (HD) maps. Automotive radar provides rich, complementary information, allowing for longer range vehicle detection as well as instantaneous radial velocity measurements. However, there are factors that make the fusion of lidar and radar information challenging, such as the relatively low angular resolution of radar measurements, their sparsity and the lack of exact time synchronization with lidar. To overcome these challenges, we propose an efficient spatio-temporal radar feature extraction scheme which achieves state-of-the-art performance on multiple large-scale datasets.Further, by incorporating radar information, we show a 52% reduction in prediction error for objects with high acceleration and a 16% reduction in prediction error for objects at longer range.
摘要：在本文中，我们提出LiRaNet，其利用雷达传感器信息与广泛使用的激光雷达和高清晰度（HD）沿一个新颖的端至端轨迹预测方法映射。汽车雷达提供丰富的，互补的信息，从而允许更长的范围车辆检测以及瞬时径向速度的测量。但是，也有因素，使激光雷达和雷达信息的融合挑战，诸如雷达测量的相对低的角分辨率，其稀疏性以及缺乏与激光雷达精确的时间同步。为了克服这些挑战，我们提出一种实现在多个大型datasets.Further状态的最先进的性能的有效的空间 - 时间的雷达特征提取方案，通过将雷达信息，我们表明在预测误差的减少52％为高的加速度和在较长范围中的对象的预测误差的减少16％的对象。

38. Using ROC and Unlabeled Data for Increasing Low-Shot Transfer Learning Classification Accuracy [PDF] 返回目录
Spiridon Kasapis, Geng Zhang, Jonathon Smereka, Nickolas Vlahopoulos
Abstract: One of the most important characteristics of human visual intelligence is the ability to identify unknown objects. The capability to distinguish between a substance which a human mind has no previous experience of and a familiar object, is innate to every human. In everyday life, within seconds of seeing an "unknown" object, we are able to categorize it as such without any substantial effort. Convolutional Neural Networks, regardless of how they are trained (i.e. in a conventional manner or through transfer learning) can recognize only the classes that they are trained for. When using them for classification, any candidate image will be placed in one of the available classes. We propose a low-shot classifier which can serve as the top layer to any existing CNN that the feature extractor was already trained. Using a limited amount of labeled data for the type of images which need to be specifically classified along with unlabeled data for all other images, a unique target matrix and a Receiver Operator Curve (ROC) criterion, we are able to increase identification accuracy by up to 30% for the images that do not belong to any specific classes, while retaining the ability to identify images that belong to the specific classes of interest.
摘要：一个人的视觉智力的最重要的特征是确定不明物体的能力。其中一个人的心灵有没有以往的经验和熟悉的物体上的物质来区分的能力，是与生俱来每个人。在日常生活中，看到一个“未知”对象的几秒钟内，我们可以将其归类为这种没有任何实质性的努力。卷积神经网络，而不管它们是如何训练（即以常规方式或通过转移学习）仅可以识别它们为训练类。当使用它们进行分类，任何候选图像将被放置在可用类之一。我们提出了一个低射击分类，可以作为顶层任何现有CNN的特征提取已经训练。使用标记的有限的数据量，其需要与用于所有其它图像的未标记的数据一起进行具体分类的图像的类型，一个唯一的目标矩阵和接收器操作曲线（ROC）标准，我们能够通过多达提高识别精度对于不属于任何特定类，同时保留标识属于感兴趣的特定类图像的能力图像的30％。

39. Deep Reinforcement Learning with Mixed Convolutional Network [PDF] 返回目录
Yanyu Zhang
Abstract: Recent research has shown that map raw pixels from a single front-facing camera directly to steering commands are surprisingly powerful. This paper presents a convolutional neural network (CNN) to playing the CarRacing-v0 using imitation learning in OpenAI Gym. The dataset is generated by playing the game manually in Gym and used a data augmentation method to expand the dataset to 4 times larger than before. Also, we read the true speed, four ABS sensors, steering wheel position, and gyroscope for each image and designed a mixed model by combining the sensor input and image input. After training, this model can automatically detect the boundaries of road features and drive the robot like a human. By comparing with AlexNet and VGG16 using the average reward in CarRacing-v0, our model wins the maximum overall system performance.
摘要：最近的研究已经表明，原始像素从一个单一的前置摄像头直接映射到转向命令是令人惊讶的强大。本文提出了一种卷积神经网络（CNN）打通过模仿学习中OpenAI健身房的CarRacing-V0。所述数据集由在健身房手动玩游戏生成并使用的数据增强方法对数据集扩大至4倍比以前大。此外，我们读真实速度，四个ABS传感器，方向盘位置，陀螺仪和针对每个图像以及通过组合所述传感器输入和图像输入设计的混合模型。训练结束后，这种模式可以自动检测道路功能的边界，并推动像一个人机器人。通过与AlexNet比较和使用VGG16在CarRacing-V0的平均回报，我们的模型中获得最大的整体系统性能。

40. Binary Neural Networks for Memory-Efficient and Effective Visual Place Recognition in Changing Environments [PDF] 返回目录
Bruno Ferrarini, Michael Milford, Klaus D. McDonald-Maier, Shoaib Ehsan
Abstract: Visual place recognition (VPR) is a robot's ability to determine whether a place was visited before using visual data. While conventional hand-crafted methods for VPR fail under extreme environmental appearance changes, those based on convolutional neural networks (CNNs) achieve state-of-the-art performance but result in model sizes that demand a large amount of memory. Hence, CNN-based approaches are unsuitable for memory-constrained platforms, such as small robots and drones. In this paper, we take a multi-step approach of decreasing the precision of model parameters, combining it with network depth reduction and fewer neurons in the classifier stage to propose a new class of highly compact models that drastically reduce the memory requirements while maintaining state-of-the-art VPR performance, and can be tuned to various platforms and application scenarios. To the best of our knowledge, this is the first attempt to propose binary neural networks for solving the visual place recognition problem effectively under changing conditions and with significantly reduced memory requirements. Our best-performing binary neural network with a minimum number of layers, dubbed FloppyNet, achieves comparable VPR performance when considered against its full precision and deeper counterparts while consuming 99% less memory.
摘要：视觉识别的地方（VPR）是一个机器人，以确定是否发生使用可视数据之前访问的能力。而对于常规VPR的手工制作方法在极端的环境外观变化失败，那些基于卷积神经网络（细胞神经网络）实现状态的最先进的性能，但导致在要求大量的存储器模型大小。因此，基于CNN的办法不适用于内存受限的平台，如小型机器人和无人驾驶飞机。在本文中，我们采取减少模型参数的精度的多步骤方法，它与网络的深度减少和分类阶段较少的神经细胞相结合，提出了一类新的高度紧凑的车型，大大降低了存储要求，同时保持状态-of先进VPR的性能，并且可以被调谐以各种平台和应用场景。据我们所知，这是提出二元神经网络的不断变化的条件下，用显著减少内存的需求有效地解决了视觉识别发生问题的首次尝试。与层的最小数量我们表现最好的二元神经网络，被称为FloppyNet，实现了与当针对其全精度和更深入的同行认为，同时消耗99％较少的内存相当的VPR性能。

41. Learned Dual-View Reflection Removal [PDF] 返回目录
Simon Niklaus, Xuaner Cecilia Zhang, Jonathan T. Barron, Neal Wadhwa, Rahul Garg, Feng Liu, Tianfan Xue
Abstract: Traditional reflection removal algorithms either use a single image as input, which suffers from intrinsic ambiguities, or use multiple images from a moving camera, which is inconvenient for users. We instead propose a learning-based dereflection algorithm that uses stereo images as input. This is an effective trade-off between the two extremes: the parallax between two views provides cues to remove reflections, and two views are easy to capture due to the adoption of stereo cameras in smartphones. Our model consists of a learning-based reflection-invariant flow model for dual-view registration, and a learned synthesis model for combining aligned image pairs. Because no dataset for dual-view reflection removal exists, we render a synthetic dataset of dual-views with and without reflections for use in training. Our evaluation on an additional real-world dataset of stereo pairs shows that our algorithm outperforms existing single-image and multi-image dereflection approaches.
摘要：传统反射消除算法既可以使用单个图像作为输入，从固有歧义其中患有，或从一个移动摄像机，这对于用户不方便使用多个图像。相反，我们建议使用立体图像作为输入的学习型dereflection算法。这是一种有效的折衷两个极端之间：两种观点之间的视差提供线索，除去反射，和两个观点很容易捕获由于智能手机采用的立体摄像机。我们的模型由学习型反射不变流模型对双视图登记，和用于组合对准图像对一个学习合成模型的。因为没有数据集双视角反思去除存在，我们渲染的有和无反射在训练中使用双视图合成的数据集。我们对对立体声节目的附加真实世界的数据集的评估，我们现有的单图像和多图像dereflection算法优于方法。

42. Active Learning for Bayesian 3D Hand Pose Estimation [PDF] 返回目录
Razvan Caramalau, Binod Bhattarai, Tae-Kyun Kim
Abstract: We propose a Bayesian approximation to a deep learning architecture for 3D hand pose estimation. Through this framework, we explore and analyse the two types of uncertainties that are influenced either by data or by the learning capability. Furthermore, we draw comparisons against the standard estimator over three popular benchmarks. The first contribution lies in outperforming the baseline while in the second part we address the active learning application. We also show that with a newly proposed acquisition function, our Bayesian 3D hand pose estimator obtains lowest errors with the least amount of data. The underlying code is publicly available at this https URL.
摘要：我们提出了一个贝叶斯近似的深度学习架构的3D手姿态估计。通过这个框架，我们探讨和分析了两种类型的要么通过数据或学习能力影响的不确定性。此外，我们战平了三个流行的基准测试标准估计比较。第一个贡献在于跑赢基准，而在第二部分中，我们讨论了主动学习的应用。我们还表明，与新提出的采集功能，数据量最少我们贝叶斯3D手姿势估计取得最低的错误。底层代码是公开的，在此HTTPS URL。

43. Explaining Convolutional Neural Networks through Attribution-Based Input Sampling and Block-Wise Feature Aggregation [PDF] 返回目录
Sam Sattarzadeh, Mahesh Sudhakar, Anthony Lem, Shervin Mehryar, K. N. Plataniotis, Jongseong Jang, Hyunwoo Kim, Yeonjeong Jeong, Sangmin Lee, Kyunghoon Bae
Abstract: As an emerging field in Machine Learning, Explainable AI (XAI) has been offering remarkable performance in interpreting the decisions made by Convolutional Neural Networks (CNNs). To achieve visual explanations for CNNs, methods based on class activation mapping and randomized input sampling have gained great popularity. However, the attribution methods based on these techniques provide lower resolution and blurry explanation maps that limit their explanation power. To circumvent this issue, visualization based on various layers is sought. In this work, we collect visualization maps from multiple layers of the model based on an attribution-based input sampling technique and aggregate them to reach a fine-grained and complete explanation. We also propose a layer selection strategy that applies to the whole family of CNN-based models, based on which our extraction framework is applied to visualize the last layers of each convolutional block of the model. Moreover, we perform an empirical analysis of the efficacy of derived lower-level information to enhance the represented attributions. Comprehensive experiments conducted on shallow and deep models trained on natural and industrial datasets, using both ground-truth and model-truth based evaluation metrics validate our proposed algorithm by meeting or outperforming the state-of-the-art methods in terms of explanation ability and visual quality, demonstrating that our method shows stability regardless of the size of objects or instances to be explained.
摘要：在机器学习，人工智能可解释（XAI）一个新兴的领域已经解释由卷积神经网络（细胞神经网络）作出的决定提供了卓越的性能。为了实现细胞神经网络可视化解释，基于类的激活映射和随机输入抽样方法都获得了巨大的人气。然而，基于这些技术的归属方法提供较低的分辨率和模糊的解释地图，限制他们的解释力。为了规避这个问题，基于各层可视化追捧。在这项工作中，我们收集可视化基于基于归属输入采样技术模型的多层地图，它们聚集达到细粒度和完整的解释。我们亦建议适用于全家基于CNN的模型，在此基础上我们提取框架应用于可视化模型的每个卷积块的最后一层一层选择策略。此外，我们进行的衍生下级信息来增强所表示归因功效的经验分析。浅层和训练有素的自然和工业数据集深模型进行了全面的实验，同时使用地面实况和基于模型的真相评价指标验证了我们的算法在解释能力方面均达到或优于国家的最先进的方法和视觉质量，这表明我们的方法示出了稳定性而不管对象或实例的大小来进行说明。

44. Multiscale Detection of Cancerous Tissue in High Resolution Slide Scans [PDF] 返回目录
Qingchao Zhang, Coy D. Heldermon, Corey Toler-Franklin
Abstract: We present an algorithm for multi-scale tumor (chimeric cell) detection in high resolution slide scans. The broad range of tumor sizes in our dataset pose a challenge for current Convolutional Neural Networks (CNN) which often fail when image features are very small (8 pixels). Our approach modifies the effective receptive field at different layers in a CNN so that objects with a broad range of varying scales can be detected in a single forward pass. We define rules for computing adaptive prior anchor boxes which we show are solvable under the equal proportion interval principle. Two mechanisms in our CNN architecture alleviate the effects of non-discriminative features prevalent in our data - a foveal detection algorithm that incorporates a cascade residual-inception module and a deconvolution module with additional context information. When integrated into a Single Shot MultiBox Detector (SSD), these additions permit more accurate detection of small-scale objects. The results permit efficient real-time analysis of medical images in pathology and related biomedical research fields.
摘要：我们提出了在高分辨率扫描滑动多尺度肿瘤（嵌合细胞）的检测的算法。在广泛的肿瘤大小在我们的数据的构成当前卷积神经网络（CNN）是一个挑战，当图像特征是非常小的（8个像素），这往往会失败。我们的方法修改在一个CNN不同层中的有效感受域，使得具有宽的范围内变化的尺度的对象可以在一个单一的向前传球来检测。我们定义为计算自适应前锚箱，我们显示是下等比例原则间隔可解的规则。在我们的CNN架构两种机制减轻非判别特征流行的效果在我们的数据 - 一个中心凹检测算法并入有级联残余起始模块并用额外的上下文信息的反卷积模块。当被集成到单拍MultiBox的检测器（SSD），这些添加允许小规模的对象更精确的检测。结果允许在病理医学图像及相关的生物医学研究领域的高效的实时分析。

45. StreamSoNG: A Soft Streaming Classification Approach [PDF] 返回目录
Wenlong Wu, James M. Keller, Jeffrey Dale, James C. Bezdek
Abstract: Examining most streaming clustering algorithms leads to the understanding that they are actually incremental classification models. They model existing and newly discovered structures via summary information that we call footprints. Incoming data is normally assigned crisp labels (into one of the structures) and that structure's footprints are incrementally updated. There is no reason that these assignments need to be crisp. In this paper, we propose a new streaming classification algorithm that uses Neural Gas prototypes as footprints and produces a possibilistic label vector (typicalities) for each incoming vector. These typicalities are generated by a modified possibilistic k-nearest neighbor algorithm. The approach is tested on synthetic and real image datasets with excellent results.
摘要：检查最流聚类算法导致的理解，他们实际上是增加的分类模型。他们通过汇总信息，我们称之为脚印现有的和新发现的结构模型。传入的数据通常被分配脆标签（插入结构中的一个）和该结构的脚印递增地更新。没有理由，这些任务必须是清晰的。在本文中，我们提出了采用神经燃气原型，足迹和生产为每个到来的载体能度标签矢量（typicalities）新的流分类算法。这些typicalities由改性能度k近邻算法生成。该方法是关于具有优良的结果合成的和真实的图像数据组进行测试。

46. Efficient Image Super-Resolution Using Pixel Attention [PDF] 返回目录
Hengyuan Zhao, Xiangtao Kong, Jingwen He, Yu Qiao, Chao Dong
Abstract: This work aims at designing a lightweight convolutional neural network for image super resolution (SR). With simplicity bare in mind, we construct a pretty concise and effective network with a newly proposed pixel attention scheme. Pixel attention (PA) is similar as channel attention and spatial attention in formulation. The difference is that PA produces 3D attention maps instead of a 1D attention vector or a 2D map. This attention scheme introduces fewer additional parameters but generates better SR results. On the basis of PA, we propose two building blocks for the main branch and the reconstruction branch, respectively. The first one - SC-PA block has the same structure as the Self-Calibrated convolution but with our PA layer. This block is much more efficient than conventional residual/dense blocks, for its twobranch architecture and attention scheme. While the second one - UPA block combines the nearest-neighbor upsampling, convolution and PA layers. It improves the final reconstruction quality with little parameter cost. Our final model- PAN could achieve similar performance as the lightweight networks - SRResNet and CARN, but with only 272K parameters (17.92% of SRResNet and 17.09% of CARN). The effectiveness of each proposed component is also validated by ablation study. The code is available at this https URL.
摘要：这项工作的目的是对图像超分辨率（SR）设计的轻量级卷积神经网络。与简单一点裸露，我们构建了一个相当简洁，有效的网络与新提出的像素关注方案。像素的关注（PA）是作为信道的关注和空间注意在制剂相似。不同的是，PA生产三维关注映射，而不是一维关注矢量或2D地图。这种关注方案引入较少的额外的参数，但产生更好的结果SR。对PA的基础上，我们提出了主枝和重建一个分支有两个积木，分别。第一个 - SC-PA块具有相同的结构自校准卷积但与我们的PA层。此块是比传统的残余/致密块更有效，对于其twobranch架构和注意力方案。而第二个 - UPA块结合了最近邻上采样，卷积和PA层。它提高了很少的参数成本最终的重建质量。我们最终的模型 - 潘能达到的性能类似的轻量级网络 - SRResNet和CARN，但只有272K参数（SRResNet的17.92％和CARN的17.09％）。每个提议的组分的有效性也通过烧蚀研究验证。该代码可在此HTTPS URL。

47. Attention-Based Clustering: Learning a Kernel from Context [PDF] 返回目录
Samuel Coward, Erik Visse-Martindale, Chithrupa Ramesh
Abstract: In machine learning, no data point stands alone. We believe that context is an underappreciated concept in many machine learning methods. We propose Attention-Based Clustering (ABC), a neural architecture based on the attention mechanism, which is designed to learn latent representations that adapt to context within an input set, and which is inherently agnostic to input sizes and number of clusters. By learning a similarity kernel, our method directly combines with any out-of-the-box kernel-based clustering approach. We present competitive results for clustering Omniglot characters and include analytical evidence of the effectiveness of an attention-based approach for clustering.
摘要：在机器学习中，没有数据点是独一无二的。我们认为，这种情况下是在很多机器学习方法一个被低估的概念。我们建议关注的聚类（ABC）的基础上，注意机制，其目的是得知适应上下文输入集内的潜表示神经结构，且其本身无关的输入大小和群集的数量。通过学习相似的内核，我们的方法是直接与任何外的开箱基于内核的聚类方法结合起来。我们目前的竞争结果聚类Omniglot字符，包括用于集群的关注为基础的方法的有效性分析证据。

48. Encoded Prior Sliced Wasserstein AutoEncoder for learning latent manifold representations [PDF] 返回目录
Sanjukta Krishnagopal, Jacob Bedrossian
Abstract: While variational autoencoders have been successful generative models for a variety of tasks, the use of conventional Gaussian or Gaussian mixture priors are limited in their ability to capture topological or geometric properties of data in the latent representation. In this work, we introduce an Encoded Prior Sliced Wasserstein AutoEncoder (EPSWAE) wherein an additional prior-encoder network learns an unconstrained prior to match the encoded data manifold. The autoencoder and prior-encoder networks are iteratively trained using the Sliced Wasserstein Distance (SWD), which efficiently measures the distance between two $\textit{arbitrary}$ sampleable distributions without being constrained to a specific form as in the KL divergence, and without requiring expensive adversarial training. Additionally, we enhance the conventional SWD by introducing a nonlinear shearing, i.e., averaging over random $\textit{nonlinear}$ transformations, to better capture differences between two distributions. The prior is further encouraged to encode the data manifold by use of a structural consistency term that encourages isometry between feature space and latent space. Lastly, interpolation along $\textit{geodesics}$ on the latent space representation of the data manifold generates samples that lie on the manifold and hence is advantageous compared with standard Euclidean interpolation. To this end, we introduce a graph-based algorithm for identifying network-geodesics in latent space from samples of the prior that maximize the density of samples along the path while minimizing total energy. We apply our framework to 3D-spiral, MNIST, and CelebA datasets, and show that its latent representations and interpolations are comparable to the state of the art on equivalent architectures.
摘要：虽然变自动编码已成功生成模型的各种任务，使用传统高斯或混合高斯先验在他们捕捉潜在的表示数据的拓扑和几何性质的能力有限。在这项工作中，我们介绍的编码之前刨切瓦瑟斯坦自动编码器（EPSWAE），其中一个附加的现有编码器网络学习不受约束以匹配现有的编码数据歧管。自动编码器和现有的编码器的网络所使用的切片瓦瑟斯坦距离（SWD），其有效地测量而不被限制为特定的形式，如KL散度2个$ \ textit {任意} $ sampleable分布之间的距离反复地训练，并且在不需要昂贵的对抗性训练。此外，我们通过引入非线性剪切增强常规SWD，即，平均过的随机$ \ textit {非线性} $变换，以两个分布之间的更好的捕获的差异。现有进一步鼓励通过使用该特征空间和潜在空间之间鼓励等距的结构一致性的术语来编码数据歧管。最后，内插沿$ \ textit {测地线}上的数据歧管的潜在空间表示$生成样品与标准欧几里得插值相比趴在歧管，因此是有利的。为此，我们引入用于从沿着所述路径最大化样品的密度，同时最小化总能量的现有的样品识别潜在空间网络的测地线基于图的算法。我们运用我们的框架，以3D螺旋，MNIST和CelebA数据集，并表明其潜在的陈述和插补媲美艺术等效架构的状态。

49. Understanding the Predictability of Gesture Parameters from Speech and their Perceptual Importance [PDF] 返回目录
Ylva Ferstl, Michael Neff, Rachel McDonnell
Abstract: Gesture behavior is a natural part of human conversation. Much work has focused on removing the need for tedious hand-animation to create embodied conversational agents by designing speech-driven gesture generators. However, these generators often work in a black-box manner, assuming a general relationship between input speech and output motion. As their success remains limited, we investigate in more detail how speech may relate to different aspects of gesture motion. We determine a number of parameters characterizing gesture, such as speed and gesture size, and explore their relationship to the speech signal in a two-fold manner. First, we train multiple recurrent networks to predict the gesture parameters from speech to understand how well gesture attributes can be modeled from speech alone. We find that gesture parameters can be partially predicted from speech, and some parameters, such as path length, being predicted more accurately than others, like velocity. Second, we design a perceptual study to assess the importance of each gesture parameter for producing motion that people perceive as appropriate for the speech. Results show that a degradation in any parameter was viewed negatively, but some changes, such as hand shape, are more impactful than others. A video summarization can be found at this https URL.
摘要：手势行为是人类对话的一个自然组成部分。许多工作都集中在消除了繁琐的手工动画创建通过设计语音驱动的手势发电机体现会话代理的需求。然而，这些发电机通常以黑盒的方式工作，假定输入语音和输出运动之间的一般关系。由于他们的成功仍然有限，我们更详细地调查讲话可以如何与手势运动的不同方面。我们确定的多个参数表征的手势，如速度和手势的大小，并探究双重方式其对语音信号的关系。首先，我们训练多次复发网络从语音预测姿态参数，以了解如何以及手势属性可以从语音单独建模。我们发现，手势参数可以从语音部分地预测，和一些参数，诸如路径长度，被比其他人更准确地预测，像速度。其次，我们设计了一个感性的研究，以评估用于生产运动，人们所认为的适合于语音每个手势参数的重要性。结果表明，在任何参数的劣化呈负观察，但一些变化，如手的形状，比其他人更有影响。视频摘要可以在此HTTPS URL中找到。

50. An Empirical Study of DNNs Robustification Inefficacy in Protecting Visual Recommenders [PDF] 返回目录
Vito Walter Anelli, Tommaso Di Noia, Daniele Malitesta, Felice Antonio Merra
Abstract: Visual-based recommender systems (VRSs) enhance recommendation performance by integrating users' feedback with the visual features of product images extracted from a deep neural network (DNN). Recently, human-imperceptible images perturbations, defined \textit{adversarial attacks}, have been demonstrated to alter the VRSs recommendation performance, e.g., pushing/nuking category of products. However, since adversarial training techniques have proven to successfully robustify DNNs in preserving classification accuracy, to the best of our knowledge, two important questions have not been investigated yet: 1) How well can these defensive mechanisms protect the VRSs performance? 2) What are the reasons behind ineffective/effective defenses? To answer these questions, we define a set of defense and attack settings, as well as recommender models, to empirically investigate the efficacy of defensive mechanisms. The results indicate alarming risks in protecting a VRS through the DNN robustification. Our experiments shed light on the importance of visual features in very effective attack scenarios. Given the financial impact of VRSs on many companies, we believe this work might rise the need to investigate how to successfully protect visual-based recommenders. Source code and data are available at https://anonymous.4open.science/r/868f87ca-c8a4-41ba-9af9-20c41de33029/.
摘要：通过从深层神经网络（DNN）提取产品图片的视觉特征整合用户的反馈基于视觉的推荐系统（VRSs）提高推荐性能。近来，人类难以察觉的图像的扰动，定义\ textit {对抗攻击}，已被证明以改变VRSs建议性能，例如，推/的摧毁的产品类别。然而，由于对抗训练技术在保持分类精度，在我们所知的已经被证明成功robustify DNNs，两个重要的问题尚未调查：1）如何以及可这些防御机制保护VRSs表现？ 2）有什么背后无效/有效防御的原因是什么？为了回答这些问题，我们定义了一组防御和攻击的设置，以及推荐的模型，进行实证调查的防御机制的有效性。结果表明：在通过DNN robustification保护VRS令人担忧的风险。我们的实验中是非常有效的攻击方案阐明了视觉特征的重要性。鉴于VRSs对很多公司的财务影响，我们认为这项工作可能会上升，需要研究如何成功地保护基于视觉的推荐者。源代码和数据可在https://anonymous.4open.science/r/868f87ca-c8a4-41ba-9af9-20c41de33029/。

51. A Deep-Unfolded Reference-Based RPCA Network For Video Foreground-Background Separation [PDF] 返回目录
Huynh Van Luong, Boris Joukovsky, Yonina C. Eldar, Nikos Deligiannis
Abstract: Deep unfolded neural networks are designed by unrolling the iterations of optimization algorithms. They can be shown to achieve faster convergence and higher accuracy than their optimization counterparts. This paper proposes a new deep-unfolding-based network design for the problem of Robust Principal Component Analysis (RPCA) with application to video foreground-background separation. Unlike existing designs, our approach focuses on modeling the temporal correlation between the sparse representations of consecutive video frames. To this end, we perform the unfolding of an iterative algorithm for solving reweighted $\ell_1$-$\ell_1$ minimization; this unfolding leads to a different proximal operator (a.k.a. different activation function) adaptively learned per neuron. Experimentation using the moving MNIST dataset shows that the proposed network outperforms a recently proposed state-of-the-art RPCA network in the task of video foreground-background separation.
摘要：深展开的神经网络被展开的优化算法的迭代设计。他们可以证明实现比其同行优化收敛更快和更高的精度。本文提出了一种稳健主成分分析（RPCA）的问题，应用前景视频背景分离一个新的深展开基于网络的设计。不同于现有的设计，我们的方法着重于模拟连续视频帧的稀疏表述的时间相关性。为此，我们进行迭代算法的展开为解决重加权$ \ $ ell_1 - $ \ $ ell_1最小化;此展开通向不同的邻近算（又名不同的激活功能），每个神经元的自适应教训。使用运动数据集MNIST显示实验，所提出的网络优于中的视频的前景背景分离的任务最近提出的状态的最先进的网络RPCA。

52. Tubular Shape Aware Data Generation for Semantic Segmentation in Medical Imaging [PDF] 返回目录
Ilyas Sirazitdinov, Heinrich Schulz, Axel Saalbach, Steffen Renisch, Dmitry V. Dylov
Abstract: Chest X-ray is one of the most widespread examinations of the human body. In interventional radiology, its use is frequently associated with the need to visualize various tube-like objects, such as puncture needles, guiding sheaths, wires, and catheters. Detection and precise localization of these tube-like objects in the X-ray images is, therefore, of utmost value, catalyzing the development of accurate target-specific segmentation algorithms. Similar to the other medical imaging tasks, the manual pixel-wise annotation of the tubes is a resource-consuming process. In this work, we aim to alleviate the lack of the annotated images by using artificial data. Specifically, we present an approach for synthetic data generation of the tube-shaped objects, with a generative adversarial network being regularized with a prior-shape constraint. Our method eliminates the need for paired image--mask data and requires only a weakly-labeled dataset (10--20 images) to reach the accuracy of the fully-supervised models. We report the applicability of the approach for the task of segmenting tubes and catheters in the X-ray images, whereas the results should also hold for the other imaging modalities.
摘要：胸片是人体最广泛的考试之一。在介入放射学，它的使用经常与需要可视化的各种筒状的物体，诸如穿刺针，引导护套，电线，和导管相关联。检测和精确定位这些筒状的X射线图像中的对象，因此最大值的，催化的准确靶特异性分割算法的开发。类似于其他医学成像任务，该管的手动逐像素注解是一个耗费资源的处理。在这项工作中，我们的目标是用人工的数据，以缓解缺少注释的图像。具体地，我们提出了合成数据生成所述管形物的方法，将具有一个生成对抗网络与一现有形状约束正则化。我们的方法省去了对图像 - 掩模数据，并且只需要一个弱标记的数据集（10--20图像），以达到完全监督模型的精度。我们报告在X射线图像分割管和导尿管的任务的方法的适用性，而结果也应该持有的其他成像方式。

53. Weight Encode Reconstruction Network for Computed Tomography in a Semi-Case-Wise and Learning-Based Way [PDF] 返回目录
Hujie Pan, Xuesong Li, Min Xu
Abstract: Classic algebraic reconstruction technology (ART) for computed tomography requires pre-determined weights of the voxels for projecting pixel values. However, such weight cannot be accurately obtained due to the limitation of the physical understanding and computation resources. In this study, we propose a semi-case-wise learning-based method named Weight Encode Reconstruction Network (WERNet) to tackle the issues mentioned above. The model is trained in a self-supervised manner without the label of a voxel set. It contains two branches, including the voxel weight encoder and the voxel attention part. Using gradient normalization, we are able to co-train the encoder and voxel set numerically stably. With WERNet, the reconstructed result was obtained with a cosine similarity greater than 0.999 with the ground truth. Moreover, the model shows the extraordinary capability of denoising comparing to the classic ART method. In the generalization test of the model, the encoder is transferable from a voxel set with complex structure to the unseen cases without the deduction of the accuracy.
摘要：经典代数重建技术（ART），用于计算机断层摄影需要的体素的预先确定的权重用于投影的像素值。然而，这样的重量不能被精确到物理理解和计算资源的限制而获得所致。在这项研究中，我们提出了一个名为重量编码重建网络半的情况下，明智的学习法（WERNet），以解决上述问题。该模型是在自我监督的方式没有一套素的标签训练。它包括两个分支，其中包括体素重编码器和体素关注的部分。使用梯度归一化，我们能够共同列车编码器和体素组数值稳定。与WERNet，用余弦相似度大于0.999与地面实况获得的重构结果。此外，模型显示去噪比较经典ART方法的非凡能力。在该模型中的泛化测试中，编码器是从体素组具有复杂结构，以看不见的情况下不准确性的扣除转让。

54. Morphological segmentation of hyperspectral images [PDF] 返回目录
Guillaume Noyel, Jesus Angulo, Dominique Jeulin
Abstract: The present paper develops a general methodology for the morphological segmentation of hyperspectral images, i.e., with an important number of channels. This approach, based on watershed, is composed of a spectral classification to obtain the markers and a vectorial gradient which gives the spatial information. Several alternative gradients are adapted to the different hyperspectral functions. Data reduction is performed either by Factor Analysis or by model fitting. Image segmentation is done on different spaces: factor space, parameters space, etc. On all these spaces the spatial/spectral segmentation approach is applied, leading to relevant results on the image.
摘要：本本文开发用于高光谱图像，即的形态分割的一般方法，用信道的一个重要的数字。该方法中，基于分水岭，由光谱分类以获得标记和一个矢量梯度这给空间信息的。几种可供选择的梯度适用于不同光谱的功能。数据缩减或者通过因子分析或通过模型拟合进行。图像分割在不同的空间内进行：因子空间，参数空间等在所有这些空间被施加空间/频谱分割方法，导致在图像上的相关结果。

55. Continuous close-range 3D object pose estimation [PDF] 返回目录
Bjarne Grossmann, Francesco Rovida, Volker Krueger
Abstract: In the context of future manufacturing lines, removing fixtures will be a fundamental step to increase the flexibility of autonomous systems in assembly and logistic operations. Vision-based 3D pose estimation is a necessity to accurately handle objects that might not be placed at fixed positions during the robot task execution. Industrial tasks bring multiple challenges for the robust pose estimation of objects such as difficult object properties, tight cycle times and constraints on camera views. In particular, when interacting with objects, we have to work with close-range partial views of objects that pose a new challenge for typical view-based pose estimation methods. In this paper, we present a 3D pose estimation method based on a gradient-ascend particle filter that integrates new observations on-the-fly to improve the pose estimate. Thereby, we can apply this method online during task execution to save valuable cycle time. In contrast to other view-based pose estimation methods, we model potential views in full 6- dimensional space that allows us to cope with close-range partial objects views. We demonstrate the approach on a real assembly task, in which the algorithm usually converges to the correct pose within 10-15 iterations with an average accuracy of less than 8mm.
摘要：在未来的生产线的情况下，除去固定装置将是一个基本的步骤，增加自治系统的灵活性，装配和物流业务。基于视觉的3D姿态估计是必要的，以准确地处理可能不会在固定位置的机器人任务执行过程中放置的对象。工业生产任务带来的摄像机视图的对象的稳健姿态估计多重挑战，如困难对象的属性，紧周期时间和约束。尤其是，使用对象进行交互时，我们必须与造成典型的基于视图的姿态估计方法一个新的挑战对象的近景部分景色的工作。在本文中，我们提出基于梯度-上升颗粒过滤器上的即时集成新的观察，以提高姿态估计三维姿态估计方法。因此，我们可以在任务执行期间的在线应用此方法来节省宝贵的周期时间。与其他基于视图的姿势估计方法，我们的模型在全6-维空间，使我们能够应对近距离的局部对象的意见潜力的看法。我们展示一个真正的装配任务，其中算法通常与小于8mm的平均准确收敛到正确的姿势在10-15迭代方法。

56. Goal-Auxiliary Actor-Critic for 6D Robotic Grasping with Point Clouds [PDF] 返回目录
Lirui Wang, Yu Xiang, Dieter Fox
Abstract: 6D robotic grasping beyond top-down bin-picking scenarios is a challenging task. Previous solutions based on 6D grasp synthesis with robot motion planning usually operate in an open-loop setting without considering the dynamics and contacts of objects, which makes them sensitive to grasp synthesis errors. In this work, we propose a novel method for learning closed-loop control policies for 6D robotic grasping using point clouds from an egocentric camera. We combine imitation learning and reinforcement learning in order to grasp unseen objects and handle the continuous 6D action space, where expert demonstrations are obtained from a joint motion and grasp planner. We introduce a goal-auxiliary actor-critic algorithm, which uses grasping goal prediction as an auxiliary task to facilitate policy learning. The supervision on grasping goals can be obtained from the expert planner for known objects or from hindsight goals for unknown objects. Overall, our learned closed-loop policy achieves over 90% success rates on grasping various ShapeNet objects and YCB objects in the simulation. Our video can be found at this https URL .
摘要：6D机器人抓持超越自上而下斌采摘场景是一个具有挑战性的任务。基于6D把握合成与机器人运动规划以前的解决方案通常以公开环的设置不考虑动态和物体的接触，这使得它们的敏感把握合成错误。在这项工作中，我们提出了学习的闭环控制策略，使用点云从自我中心的相机6D机器人抓持的新方法。我们结合模仿学习和强化学习，以掌握看不见的物体和处理连续6D动作空间，在那里专家示威是从关节运动和把握规划师获得。我们引入一个目标辅助演员评论家算法，它采用抓目标预测作为辅助任务，以促进政策学习。抓目标的监管可以从专家策划者已知对象或从不明物体事后目标来获得。总体而言，超过90％的成功率，我们了解到闭环政策实现抓各种ShapeNet对象和对象YCB在模拟。我们的视频可以在此HTTPS URL中找到。

57. Explainable Online Validation of Machine Learning Models for Practical Applications [PDF] 返回目录
Wolfgang Fuhl, Yao Rong, Thomas Motz, Michael Scheidt, Andreas Hartel, Andreas Koch, Enkelejda Kasneci
Abstract: We present a reformulation of the regression and classification, which aims to validate the result of a machine learning algorithm. Our reformulation simplifies the original problem and validates the result of the machine learning algorithm using the training data. Since the validation of machine learning algorithms must always be explainable, we perform our experiments with the kNN algorithm as well as with an algorithm based on conditional probabilities, which is proposed in this work. For the evaluation of our approach, three publicly available data sets were used and three classification and two regression problems were evaluated. The presented algorithm based on conditional probabilities is also online capable and requires only a fraction of memory compared to the kNN algorithm.
摘要：我们提出的回归和分类，其目的是验证一个机器学习算法的结果进行改写。我们再形成简化了原来的问题，并验证使用训练数据的机器学习算法的结果。由于机器学习算法的验证必须始终解释的，我们执行我们的实验与k近邻算法以及基于条件概率的算法，这是在这项工作中提出的。对于我们的做法的评价，使用了三个公开的数据集和三个分类和两个回归问题进行了评价。基于条件概率所提出的算法也能够在网上，只需要一个比的kNN算法的内存部分。

58. Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning [PDF] 返回目录
Weili Nie, Zhiding Yu, Lei Mao, Ankit B. Patel, Yuke Zhu, Animashree Anandkumar
Abstract: Humans have an inherent ability to learn novel concepts from only a few samples and generalize these concepts to different situations. Even though today's machine learning models excel with a plethora of training data on standard recognition tasks, a considerable gap exists between machine-level pattern recognition and human-level concept learning. To narrow this gap, the Bongard Problems (BPs) were introduced as an inspirational challenge for visual cognition in intelligent systems. Albeit new advances in representation learning and learning to learn, BPs remain a daunting challenge for modern AI. Inspired by the original one hundred BPs, we propose a new benchmark Bongard-LOGO for human-level concept learning and reasoning. We develop a program-guided generation technique to produce a large set of human-interpretable visual cognition problems in action-oriented LOGO language. Our benchmark captures three core properties of human cognition: 1) context-dependent perception, in which the same object may have disparate interpretations given different contexts; 2) analogy-making perception, in which some meaningful concepts are traded off for other meaningful concepts; and 3) perception with a few samples but infinite vocabulary. In experiments, we show that the state-of-the-art deep learning methods perform substantially worse than human subjects, implying that they fail to capture core human cognition properties. Finally, we discuss research directions towards a general architecture for visual reasoning to tackle this benchmark.
摘要：人类必须从只有几样学新概念并推广这些概念不同情况的固有能力。尽管今天的机器学习模型训练数据的标准识别任务过多出类拔萃，机器级模式识别和人类水平的概念学习之间存在相当大的差距。要缩小这个差距，邦加德问题（BPS）分别介绍了智能系统的视觉认知一个鼓舞人心的挑战。尽管在代表性学习，学会学习新的进展，保持个BP现代AI一项艰巨的挑战。经原百个BP的启发，我们提出了一个新的标杆邦加德-LOGO人类层次的概念学习和推理。我们开发了一个程序，引导新一代技术生产大集的面向行动的LOGO语言的人类可解释的视觉认知的问题。我们的基准捕获三种人类认知的核心属性：1）依赖于上下文的知觉，其中相同的对象可以具有给定的不同的上下文不同解释; 2）类似的决策感知，其中一些有意义的概念折衷其他有意义的概念; 3）认知和几件样品，但无限的词汇。在实验中，我们证明了国家的最先进的深学习方法比人类受试者进行显着恶化，这意味着他们无法捕捉的核心人类认知特性。最后，我们讨论对视觉推理来解决这个标杆的通用架构的研究方向。

59. Cell Complex Neural Networks [PDF] 返回目录
Mustafa Hajij, Kyle Istvan, Ghada Zamzami
Abstract: Cell complexes are topological spaces constructed from simple blocks called cells. They generalize graphs, simplicial complexes, and polyhedral complexes that form important domains for practical applications. We propose a general, combinatorial, and unifying construction for performing neural network-type computations on cell complexes. Furthermore, we introduce inter-cellular message passing schemes, message passing schemes on cell complexes that take the topology of the underlying space into account. In particular, our method generalizes many of the most popular types of graph neural networks.
摘要：细胞复合物是从被称为小区简单块构成拓扑空间。他们概括图表，单纯复和多面体复合物形成实际应用的重要领域。我们提出了对细胞复合进行神经网络式计算的一般，组合，并统一建设。此外，我们介绍上取下层空间的拓扑考虑细胞复合物细胞间的消息传递的方案，消息传递方案。特别是，我们的方法概括许多最流行的类型的图形神经网络的。

60. BCNN: A Binary CNN with All Matrix Ops Quantized to 1 Bit Precision [PDF] 返回目录
Arthur J. Redfern, Lijun Zhu, Molly K. Newquist
Abstract: This paper describes a CNN where all CNN style 2D convolution operations that lower to matrix matrix multiplication are fully binary. The network is derived from a common building block structure that is consistent with a constructive proof outline showing that binary neural networks are universal function approximators. High levels of accuracy on the 2012 ImageNet validation set are achieved with a 2 step training procedure and implementation strategies optimized for binary operands are provided.
摘要：本文介绍了一种CNN哪里是降低到矩阵的矩阵乘法所有CNN风格的二维卷积操作是完全二进制。所述网络是从一个共同的积木式结构，其与表示二进制神经网络是通用函数逼近一个建设性证明轮廓一致的。在2012 ImageNet验证集很高的精确度水平与2步训练过程实现，提供了二进制操作数的优化实施策略。

61. Implicit Rank-Minimizing Autoencoder [PDF] 返回目录
Li Jing, Jure Zbontar, Yann LeCun
Abstract: An important component of autoencoders is the method by which the information capacity of the latent representation is minimized or limited. In this work, the rank of the covariance matrix of the codes is implicitly minimized by relying on the fact that gradient descent learning in multi-layer linear networks leads to minimum-rank solutions. By inserting a number of extra linear layers between the encoder and the decoder, the system spontaneously learns representations with a low effective dimension. The model, dubbed Implicit Rank-Minimizing Autoencoder (IRMAE), is simple, deterministic, and learns compact latent spaces. We demonstrate the validity of the method on several image generation and representation learning tasks.
摘要：自动编码的一个重要组件是通过该潜表示的信息容量被最小化或限制该方法。在这项工作中，所述代码的协方差矩阵的秩被隐含依靠以下事实最小化，在多层线性网络导致最低秩溶液梯度下降学习，通过插入一个数编码器和解码器之间的额外线性层的，该系统自发获知具有低有效尺寸表示。这个模型，称为隐排名最小化的自动编码器（IRMAE），是简单的，确定的，紧凑的学习空间潜伏。我们证明在几个图像生成和代表性的学习任务的方法的有效性。

62. VAEBM: A Symbiosis between Variational Autoencoders and Energy-based Models [PDF] 返回目录
Zhisheng Xiao, Karsten Kreis, Jan Kautz, Arash Vahdat
Abstract: Energy-based models (EBMs) have recently been successful in representing complex distributions of small images. However, sampling from them requires expensive Markov chain Monte Carlo (MCMC) iterations that mix slowly in high dimensional pixel space. Unlike EBMs, variational autoencoders (VAEs) generate samples quickly and are equipped with a latent space that enables fast traversal of the data manifold. However, VAEs tend to assign high probability density to regions in data space outside the actual data distribution and often fail at generating sharp images. In this paper, we propose VAEBM, a symbiotic composition of a VAE and an EBM that offers the best of both worlds. VAEBM captures the overall mode structure of the data distribution using a state-of-the-art VAE and it relies on its EBM component to explicitly exclude non-data-like regions from the model and refine the image samples. Moreover, the VAE component in VAEBM allows us to speed up MCMC updates by reparameterizing them in the VAE's latent space. Our experimental results show that VAEBM outperforms state-of-the-art VAEs and EBMs in generative quality on several benchmark image datasets by a large margin. It can generate high-quality images as large as 256$\times$256 pixels with short MCMC chains. We also demonstrate that VAEBM provides complete mode coverage and performs well in out-of-distribution detection.
摘要：基于能量的模型（EBMS）最近成功地代表小图像的复杂分布。然而，从这些采样要求在高维像素空间混合慢慢昂贵马尔可夫链蒙特卡洛（MCMC）迭代。不像EBMS，变自动编码（VAES）快速生成样品，并配有能够使数据歧管的快速遍历一个潜在空间。然而，VAES倾向于分配在数据空间中的高概率密度到区域中的实际数据分布范围之外，并经常在生成清晰的图像失败。在本文中，我们提出VAEBM，一个VAE的共生组合物和EBM，它提供了两全其美。 VAEBM捕获使用国家的最先进的VAE数据分配的总体模式结构和它依赖其EBM部件上从模型明确排除非数据状区域和细化图像样本。此外，在VAEBM的VAE组件允许我们通过在VAE的潜在空间重新参数他们加快MCMC更新。我们的实验结果表明，VAEBM性能优于国家的最先进的VAES并在几个基准图像数据集生成质量EBMS大幅度。它可以生成高质量的图像一样大，256 $ \ $倍256个像素的短MCMC链。我们还表明，在VAEBM外的分布检测提供了完整的模式覆盖，表现良好。

63. Tabular GANs for uneven distribution [PDF] 返回目录
Insaf Ashrapov
Abstract: GANs are well known for success in the realistic image generation. However, they can be applied in tabular data generation as well. We will review and examine some recent papers about tabular GANs in action. We will generate data to make train distribution bring closer to the test. Then compare model performance trained on the initial train dataset, with trained on the train with GAN generated data, also we train the model by sampling train by adversarial training. We show that using GAN might be an option in case of uneven data distribution between train and test data.
摘要：甘斯是众所周知的现实图像生成成功。然而，它们可以以表格数据生成以及施加。我们将回顾和审查有关行动表格甘斯最近的一些论文。我们将生成的数据，使列车分布更加接近测试。训练有素的初始训练集然后比较模型的性能，训练有素与GAN生成的数据在火车上，我们也通过对抗训练采样火车的火车模型。我们发现，使用GAN可能是在训练和测试数据之间的不均匀数据分布的情况选择。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-10-05

目录

摘要