摘要

1. Bringing Old Photos Back to Life [PDF] 返回目录
Ziyu Wan, Bo Zhang, Dongdong Chen, Pan Zhang, Dong Chen, Jing Liao, Fang Wen
Abstract: We propose to restore old photos that suffer from severe degradation through a deep learning approach. Unlike conventional restoration tasks that can be solved through supervised learning, the degradation in real photos is complex and the domain gap between synthetic images and real old photos makes the network fail to generalize. Therefore, we propose a novel triplet domain translation network by leveraging real photos along with massive synthetic image pairs. Specifically, we train two variational autoencoders (VAEs) to respectively transform old photos and clean photos into two latent spaces. And the translation between these two latent spaces is learned with synthetic paired data. This translation generalizes well to real photos because the domain gap is closed in the compact latent space. Besides, to address multiple degradations mixed in one old photo, we design a global branch with a partial nonlocal block targeting to the structured defects, such as scratches and dust spots, and a local branch targeting to the unstructured defects, such as noises and blurriness. Two branches are fused in the latent space, leading to improved capability to restore old photos from multiple defects. The proposed method outperforms state-of-the-art methods in terms of visual quality for old photos restoration.
摘要：本文提出恢复，通过深刻的学习方式从严重退化遭受的老照片。不同于通过监督学习来解决以往恢复任务，在真实照片的降解是复杂的和合成的图像和真实的老照片之间的差距域使网络不能一概而论。因此，我们通过利用实物照片进行大规模的合成图像对一起提出了一个新颖的三重领域的翻译网络。具体来说，我们班列车2个变自动编码（VAES）老照片和干净的照片分别转换成两个潜在空间。而这两个潜在的空间之间的转换与合成配对数据获悉。这种翻译推广以及真实的照片，因为域差距在紧凑的潜在空间封闭。此外，为了解决多个劣化在一个老照片混合，我们设计了一个全局分支具有部分非本地块靶向结构缺陷，如划痕和灰尘的斑点，和一个本地分支靶向非结构化缺陷，如噪声和模糊。两个分支融合于潜在空间，从而提高能力，以从多个缺陷修复老照片。该方法优于国家的最先进的方法，在老照片恢复视觉质量方面。

2. Music Gesture for Visual Sound Separation [PDF] 返回目录
Chuang Gan, Deng Huang, Hang Zhao, Joshua B. Tenenbaum, Antonio Torralba
Abstract: Recent deep learning approaches have achieved impressive performance on visual sound separation tasks. However, these approaches are mostly built on appearance and optical flow like motion feature representations, which exhibit limited abilities to find the correlations between audio signals and visual points, especially when separating multiple instruments of the same types, such as multiple violins in a scene. To address this, we propose "Music Gesture," a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music. We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals. Experimental results on three music performance datasets show: 1) strong improvements upon benchmark metrics for hetero-musical separation tasks (i.e. different instruments); 2) new ability for effective homo-musical separation for piano, flute, and trumpet duets, which to our best knowledge has never been achieved with alternative methods. Project page: this http URL.
摘要：近期深学习方法都取得了视觉上的声音分离任务的骄人业绩。然而，这些方法大多建在外观和光流象运动特征表示，其显示出找到的音频信号和视点之间的相关性有限能力，分离相同类型的多个仪器，例如在一个场景中的多个小提琴时尤其如此。为了解决这个问题，我们提出了“音乐手势，”基于关键点结构表示，当他们演奏音乐，明确建模音乐家的身体和手指的动作。我们首先采用情景感知图形网络视觉语义语境与车身动态整合，然后运用视听融合模型，以联营体的运动与相应的音频信号。在三个音乐性能数据集实验结果表明：在基准指标为杂音乐分离任务（即不同的仪器）1）强的改进; 2）对于钢琴，长笛，小号二重奏有效同源的音乐分离，这给我们的最好的知识从未用替代方法取得了新的能力。项目页面：这个HTTP URL。

3. A Spatially Constrained Deep Convolutional Neural Network for Nerve Fiber Segmentation in Corneal Confocal Microscopic Images using Inaccurate Annotations [PDF] 返回目录
Ning Zhang, Susan Francis, Rayaz Malik, Xin Chen
Abstract: Semantic image segmentation is one of the most important tasks in medical image analysis. Most state-of-the-art deep learning methods require a large number of accurately annotated examples for model training. However, accurate annotation is difficult to obtain especially in medical applications. In this paper, we propose a spatially constrained deep convolutional neural network (DCNN) to achieve smooth and robust image segmentation using inaccurately annotated labels for training. In our proposed method, image segmentation is formulated as a graph optimization problem that is solved by a DCNN model learning process. The cost function to be optimized consists of a unary term that is calculated by cross entropy measurement and a pairwise term that is based on enforcing a local label consistency. The proposed method has been evaluated based on corneal confocal microscopic (CCM) images for nerve fiber segmentation, where accurate annotations are extremely difficult to be obtained. Based on both the quantitative result of a synthetic dataset and qualitative assessment of a real dataset, the proposed method has achieved superior performance in producing high quality segmentation results even with inaccurate labels for training.
摘要：语义图像分割是医学图像分析中最重要的任务之一。大多数国家的最先进的深度学习方法需要大量的模型训练准确注释的例子。但是，准确的注释是很难在医疗应用中获得特别。在本文中，我们提出了一个空间受限的深卷积神经网络（DCNN）实现平滑和强大的图像分割不准确的注释标签进行培训。在我们提出的方法，图像分割被配制成由一个DCNN模型学习过程中解决了一个图形优化问题。要优化的成本函数包括由交叉熵测量和基于上执行的本地标签稠度的成对项计算一元的术语。基于角膜的共焦显微镜（CCM）图像进行神经纤维分割，其中准确的注释将得到极其困难所提出的方法进行了评估。基于合成数据集和真实数据集的定性评估两者的定量结果，该方法具有生产高品质的分割结果，即使有不准确的标签培训实现卓越的性能。

4. Improving correlation method with convolutional neural networks [PDF] 返回目录
Dmitriy Goncharov, Rostislav Starikov
Abstract: We present a convolutional neural network for the classification of correlation responses obtained by correlation filters. The proposed approach can improve the accuracy of classification, as well as achieve invariance to the image classes and parameters.
摘要：本文提出了一种卷积神经网络的通过相关滤波器得到相关答复分类。该方法能提高分类的准确度，以及实现不变性的图像类和参数。

5. Characters as Graphs: Recognizing Online Handwritten Chinese Characters via Spatial Graph Convolutional Network [PDF] 返回目录
Ji Gan, Weiqiang Wang, Ke Lu
Abstract: Chinese is one of the most widely used languages in the world, yet online handwritten Chinese character recognition (OLHCCR) remains challenging. To recognize Chinese characters, one popular choice is to adopt the 2D convolutional neural network (2D-CNN) on the extracted feature images, and another one is to employ the recurrent neural network (RNN) or 1D-CNN on the time-series features. Instead of viewing characters as either static images or temporal trajectories, here we propose to represent characters as geometric graphs, retaining both spatial structures and temporal orders. Accordingly, we propose a novel spatial graph convolution network (SGCN) to effectively classify those character graphs for the first time. Specifically, our SGCN incorporates the local neighbourhood information via spatial graph convolutions and further learns the global shape properties with a hierarchical residual structure. Experiments on IAHCC-UCAS2016, ICDAR-2013, and UNIPEN datasets demonstrate that the SGCN can achieve comparable recognition performance with the state-of-the-art methods for character recognition.
摘要：中国是世界上使用最广泛的语言之一，但网上中国手写字符识别（OLHCCR）挑战仍然存在。要认识中国字，一个流行的选择是采用二维卷积神经网络（2D-CNN）对所提取的特征图像，另一个是采用递归神经网络（RNN）或1D-CNN的时间序列特征。相反，观看字符作为静态图像或时间的轨迹，这里我们建议表示字符为几何图形，既保持空间结构和时间顺序。因此，我们提出了一种新颖的空间图形卷积网络（SGCN）有效地分类那些字符图形首次。具体地，我们的SGCN结合经由空间图形卷积和进一步获悉具有分级残基结构的全局形状属性的局部邻域的信息。上IAHCC-UCAS2016，ICDAR 2013实验中，和UNIPEN数据集表明SGCN可以实现与国家的最先进的方法字符识别可比的识别性能。

6. Shape-Oriented Convolution Neural Network for Point Cloud Analysis [PDF] 返回目录
Chaoyi Zhang, Yang Song, Lina Yao, Weidong Cai
Abstract: Point cloud is a principal data structure adopted for 3D geometric information encoding. Unlike other conventional visual data, such as images and videos, these irregular points describe the complex shape features of 3D objects, which makes shape feature learning an essential component of point cloud analysis. To this end, a shape-oriented message passing scheme dubbed ShapeConv is proposed to focus on the representation learning of the underlying shape formed by each local neighboring point. Despite this intra-shape relationship learning, ShapeConv is also designed to incorporate the contextual effects from the inter-shape relationship through capturing the long-ranged dependencies between local underlying shapes. This shape-oriented operator is stacked into our hierarchical learning architecture, namely Shape-Oriented Convolutional Neural Network (SOCNN), developed for point cloud analysis. Extensive experiments have been performed to evaluate its significance in the tasks of point cloud classification and part segmentation.
摘要：点云是用于3D几何信息的编码采用了主要数据结构。不像其他的以往的可视数据，诸如图像和视频，这些不规则点描述的3D对象，这使得形状特征点学习点云分析的一个重要组成部分的复杂的形状的特点。为此，被称为ShapeConv面向形状的消息传递方案提出了焦点在由每个局部相邻点形成的底层外形的表示学习。尽管这样形帧内关系学习，ShapeConv也被设计通过捕获本地基础形状之间的长程依赖掺入从形间的关系的语境效果。这种面向形状操作堆叠到我们的分层学习结构，即形状面向卷积神经网络（SOCNN），点云分析开发。大量的实验已经完成，以评估其在点云的分类和部分分割的任务的意义。

7. The Notorious Difficulty of Comparing Human and Machine Perception [PDF] 返回目录
Christina M. Funke, Judy Borowski, Karolina Stosio, Wieland Brendel, Thomas S. A. Wallis, Matthias Bethge
Abstract: With the rise of machines to human-level performance in complex recognition tasks, a growing amount of work is directed towards comparing information processing in humans and machines. These works have the potential to deepen our understanding of the inner mechanisms of human perception and to improve machine learning. Drawing robust conclusions from comparison studies, however, turns out to be difficult. Here, we highlight common shortcomings that can easily lead to fragile conclusions. First, if a model does achieve high performance on a task similar to humans, its decision-making process is not necessarily human-like. Moreover, further analyses can reveal differences. Second, the performance of neural networks is sensitive to training procedures and architectural details. Thus, generalizing conclusions from specific architectures is difficult. Finally, when comparing humans and machines, equivalent experimental settings are crucial in order to identify innate differences. Addressing these shortcomings alters or refines the conclusions of studies. We show that, despite their ability to solve closed-contour tasks, our neural networks use different decision-making strategies than humans. We further show that there is no fundamental difference between same-different and spatial tasks for common feed-forward neural networks and finally, that neural networks do experience a "recognition gap" on minimal recognizable images. All in all, care has to be taken to not impose our human systematic bias when comparing human and machine perception.
摘要：随着机械制造业的兴起在复杂的识别任务人类水平的性能，工作的数量不断增加是针对人类和机器比较信息处理。这些作品必须加深我们的人类感知的内在机制的认识，并提高机器学习的潜力。从比较研究借鉴有力的结论，但是，原来是困难的。在这里，我们强调共同的缺点，可以很容易导致脆弱的结论。首先，如果模型上与人类相似任务实现高性能，其决策过程不一定是类人。此外，进一步的分析可以揭示差异。二，神经网络的性能对训练过程和建筑细节敏感。因此，从特定的架构归纳的结论是困难的。最后，比较人类和机器的时候，相当于实验设置，以确定先天的差异至关重要。解决这些缺点，变造或者细化的研究结论。我们证明了，尽管他们解决封闭轮廓任务的能力，我们的神经网络使用比人类不同的决策策略。进一步的研究表明有共同的前馈神经网络，最后同不同和空间任务之间没有根本的区别，即神经网络做经验上最小的可识别图像的“承认差距”。总而言之，护理必须加以比较人类和机器感知时，不征收我们人类的系统性偏差。

8. Class Distribution Alignment for Adversarial Domain Adaptation [PDF] 返回目录
Wanqi Yang, Tong Ling, Chengmei Yang, Lei Wang, Yinghuan Shi, Luping Zhou, Ming Yang
Abstract: Most existing unsupervised domain adaptation methods mainly focused on aligning the marginal distributions of samples between the source and target domains. This setting does not sufficiently consider the class distribution information between the two domains, which could adversely affect the reduction of domain gap. To address this issue, we propose a novel approach called Conditional ADversarial Image Translation (CADIT) to explicitly align the class distributions given samples between the two domains. It integrates a discriminative structure-preserving loss and a joint adversarial generation loss. The former effectively prevents undesired label-flipping during the whole process of image translation, while the latter maintains the joint distribution alignment of images and labels. Furthermore, our approach enforces the classification consistence of target domain images before and after adaptation to aid the classifier training in both domains. Extensive experiments were conducted on multiple benchmark datasets including Digits, Faces, Scenes and Office31, showing that our approach achieved superior classification in the target domain when compared to the state-of-the-art methods. Also, both qualitative and quantitative results well supported our motivation that aligning the class distributions can indeed improve domain adaptation.
摘要：大多数现有的无监督域适配方法主要集中在对准所述源和目标域之间的样品的边缘分布。此设置不充分考虑这两个领域，这可能域差距缩小产生不利影响的类别分布信息。为了解决这个问题，我们提出了一个所谓的条件对抗性图像转换（CADIT）明确给出对准两个域之间的样本的类分布新方法。它集成了一个歧视性结构保留的损失和联合对抗产生的损失。前者能有效地防止图像翻译的整个过程中不需要的标签翻转，而后者保持图像和标签的联合分布排列。此外，我们的方法强制目标域图像的分类一致性之前和适应后，以帮助这两个领域分类培训。广泛的实验上的多个基准数据集包括位数，面孔，场景和Office31进行，显示出我们的方法在目标域相比国家的最先进的方法时，实现优异的分类。此外，定性和定量的结果很好地支持我们的动力是对准类分布的确可以改善领域适应性。

9. Complex-Object Visual Inspection via Multiple Lighting Configurations [PDF] 返回目录
Maya Aghaei, Matteo Bustreo, Pietro Morerio, Nicolo Carissimi, Alessio Del Bue, Vittorio Murino
Abstract: The design of an automatic visual inspection system is usually performed in two stages. While the first stage consists in selecting the most suitable hardware setup for highlighting most effectively the defects on the surface to be inspected, the second stage concerns the development of algorithmic solutions to exploit the potentials offered by the collected data. In this paper, first, we present a novel illumination setup embedding four illumination configurations to resemble diffused, dark-field, and front lighting techniques. Second, we analyze the contributions brought by deploying the proposed setup in training phase only - mimicking the scenario in which an already developed visual inspection system cannot be modified on the customer site - and in evaluation phase. Along with an exhaustive set of experiments, in this paper, we demonstrate the suitability of the proposed setup for effective illumination of complex-objects, defined as manufactured items with variable surface characteristics that cannot be determined a priori. Moreover, we discuss the importance of multiple light configurations availability during training and their natural boosting effect which, without the need to modify the system design in evaluation phase, lead to improvements in the overall system performance.
摘要：通常在两个阶段中执行的自动视觉检查系统的设计。而第一阶段包括在选择用于突出显示最有效地在表面上的缺陷的最合适的硬件设置进行检查，在第二阶段的担忧算法解决方案的开发，以利用所收集的数据所提供的电势。在本文中，第一，提出了一种新颖的照明设置嵌入四个照明配置成类似于扩散，暗视场，和前照明技术。其次，我们分析由部署在训练阶段只提出的设置所带来的贡献 - 模仿，其中已经开发视觉检测系统无法在客户现场进行修改的情况 - 在评估阶段。随着一个详尽的组实验中，在本文中，我们证明了提出的设置可进行复杂的对象的有效照明的适合性，定义为不能先验地确定可变的表面特性制造项目。此外，我们在训练中讨论的多个光配置可用性的重要性及其自然提高其效果，而无需修改评估阶段的制度设计，导致整体系统性能的提升。

10. Deep-COVID: Predicting COVID-19 From Chest X-Ray Images Using Deep Transfer Learning [PDF] 返回目录
Shervin Minaee, Rahele Kafieh, Milan Sonka, Shakib Yazdani, Ghazaleh Jamalipour Soufi
Abstract: The COVID-19 pandemic is causing a major outbreak in more than 150 countries around the world, having a severe impact on the health and life of many people globally. One of the crucial step in fighting COVID-19 is the ability to detect the infected patients early enough, and put them under special care. Detecting this disease from radiography and radiology images is perhaps one of the fastest way to diagnose the patients. Some of the early studies showed specific abnormalities in the chest radiograms of patients infected with COVID-19. Inspired by earlier works, we study the application of deep learning models to detect COVID-19 patients from their chest radiography images. We first prepare a dataset of 5,000 Chest X-rays from the publicly available datasets. Images exhibiting COVID-19 disease presence were identified by board-certified radiologist. Transfer learning on a subset of 2,000 radiograms was used to train four popular convolutional neural networks, including ResNet18, ResNet50, SqueezeNet, and DenseNet-121, to identify COVID-19 disease in the analyzed chest X-ray images. We evaluated these models on the remaining 3,000 images, and most of these networks achieved a sensitivity rate of 97\%($\pm$ 5\%), while having a specificity rate of around 90\%. While the achieved performance is very encouraging, further analysis is required on a larger set of COVID-19 images, to have a more reliable estimation of accuracy rates. Besides sensitivity and specificity rates, we also present the receiver operating characteristic (ROC), area under the curve (AUC), and confusion matrix of each model. The dataset, model implementations (in PyTorch), and evaluations, are all made publicly available for research community, here: this https URL
摘要：COVID-19大流行在全球超过150个国家造成重大疫情，全球有很多人对健康造成严重影响和生活。一个在战斗COVID-19的关键步骤是检测感染患者早期，足量，并把它们在特殊护理的能力。从X线和放射影像检测这个病或许是诊断病人的最快方式之一。一些早期的研究表明特定异常的感染COVID-19例患者胸部X线片。由早期的作品的启发，我们研究深度学习模型的应用COVID-19检测患者从他们的胸片影像。我们首先从可公开获得的数据集准备的5000胸部X射线的数据集。参展COVID-19病存在图像由委员会认证的放射科医生鉴定。上的2000 X线片的子集传递学习用于训练四种流行卷积神经网络，包括ResNet18，ResNet50，SqueezeNet，和DenseNet-121，以识别在所分析的胸部X射线图像COVID-19的疾病。我们评估剩余的3000个图像这些模型，大部分这些网络中实现的97 \％（$ \ $下午5 \％）敏感性率，同时具有％左右，90 \特异性率。虽然得到的性能是非常令人鼓舞的，需要在更大的一组COVID-19图像的进一步分析，有准确率的更可靠的估计。除了敏感性和特异性率，我们还提出了曲线（AUC），和每个模型的混淆矩阵下的接收器操作特性（ROC），面积。该数据集模型的实现（在PyTorch），和评估，都可以公开进行研究社区，在这里：这HTTPS URL

11. Robust Partial Matching for Person Search in the Wild [PDF] 返回目录
Yingji Zhong, Xiaoyu Wang, Shiliang Zhang
Abstract: Various factors like occlusions, backgrounds, etc., would lead to misaligned detected bounding boxes , e.g., ones covering only portions of human body. This issue is common but overlooked by previous person search works. To alleviate this issue, this paper proposes an Align-to-Part Network (APNet) for person detection and re-Identification (reID). APNet refines detected bounding boxes to cover the estimated holistic body regions, from which discriminative part features can be extracted and aligned. Aligned part features naturally formulate reID as a partial feature matching procedure, where valid part features are selected for similarity computation, while part features on occluded or noisy regions are discarded. This design enhances the robustness of person search to real-world challenges with marginal computation overhead. This paper also contributes a Large-Scale dataset for Person Search in the wild (LSPS), which is by far the largest and the most challenging dataset for person search. Experiments show that APNet brings considerable performance improvement on LSPS. Meanwhile, it achieves competitive performance on existing person search benchmarks like CUHK-SYSU and PRW.
摘要：各种因素，如闭塞，背景等，将导致检测到的未对准包围盒，例如，那些仅覆盖人体的部位。这个问题是常见的，但忽视了前面的人的搜索工作。为了缓解这一问题，本文提出了一种对齐到部分网络（APNET）对人的检测和重新鉴定（里德）。 APNET提炼检测到的边界框来覆盖所估计的整体体区域，从该判别一部分的特征可被提取和对准。对准部分的特征自然制定里德作为部分特征匹配过程，其中有效部分的特征被选择用于相似度计算，而部分上闭塞或嘈杂的区域设有将被丢弃。这种设计增强人的搜索与边际计算开销现实世界的挑战的鲁棒性。本文还有助于人搜索在野外程序（LSP），这是迄今为止规模最大，人搜索最有挑战性的数据集大型数据集。实验表明，APNET上带来相当大的LSPS性能改进。同时，实现了像香港中文大学，中山大学和PRW现有人搜索基准竞争力的性能。

12. Combining multimodal information for Metal Artefact Reduction: An unsupervised deep learning framework [PDF] 返回目录
Marta B.M. Ranzini, Irme Groothuis, Kerstin Kläser, M. Jorge Cardoso, Johann Henckel, Sébastien Ourselin, Alister Hart, Marc Modat
Abstract: Metal artefact reduction (MAR) techniques aim at removing metal-induced noise from clinical images. In Computed Tomography (CT), supervised deep learning approaches have been shown effective but limited in generalisability, as they mostly rely on synthetic data. In Magnetic Resonance Imaging (MRI) instead, no method has yet been introduced to correct the susceptibility artefact, still present even in MAR-specific acquisitions. In this work, we hypothesise that a multimodal approach to MAR would improve both CT and MRI. Given their different artefact appearance, their complementary information can compensate for the corrupted signal in either modality. We thus propose an unsupervised deep learning method for multimodal MAR. We introduce the use of Locally Normalised Cross Correlation as a loss term to encourage the fusion of multimodal information. Experiments show that our approach favours a smoother correction in the CT, while promoting signal recovery in the MRI.
摘要：金属伪影减少（MAR）技术旨在去除金属诱导从临床图像中的噪声。在计算机断层摄影（CT），监督学习深的方法已被证明有效的，但在可推广的限制，因为它们大多依赖于合成的数据。磁共振成像（MRI）来代替，没有方法尚未出台纠正易感性人工制品，在具体的MAR-收购仍然存在偶数。在这项工作中，我们推测，多式联运方针MAR将同时提高CT和MRI检查。鉴于它们的不同的人工制品的外观，它们的互补信息可以补偿在任一形态的损坏的信号。因此，我们提出了多MAR一种无监督的深度学习方法。我们介绍了在本地使用标准化的互相关的损失项，鼓励的多模态信息融合。实验结果表明，我们的方法有利于在CT更平滑修正，同时促进在MRI信号恢复。

13. How Do Neural Networks Estimate Optical Flow? A Neuropsychology-Inspired Study [PDF] 返回目录
D. B. de Jong, F. Paredes-Vallés, G. C. H. E. de Croon
Abstract: End-to-end trained convolutional neural networks have led to a breakthrough in optical flow estimation. The most recent advances focus on improving the optical flow estimation by improving the architecture and setting a new benchmark on the publicly available MPI-Sintel dataset. Instead, in this article, we investigate how deep neural networks estimate optical flow. A better understanding of how these networks function is important for (i) assessing their generalization capabilities to unseen inputs, and (ii) suggesting changes to improve their performance. For our investigation, we focus on FlowNetS, as it is the prototype of an encoder-decoder neural network for optical flow estimation. Furthermore, we use a filter identification method that has played a major role in uncovering the motion filters present in animal brains in neuropsychological research. The method shows that the filters in the deepest layer of FlowNetS are sensitive to a variety of motion patterns. Not only do we find translation filters, as demonstrated in animal brains, but thanks to the easier measurements in artificial neural networks, we even unveil dilation, rotation, and occlusion filters. Furthermore, we find similarities in the refinement part of the network and the perceptual filling-in process which occurs in the mammal primary visual cortex.
摘要：结束到终端的训练的卷积神经网络已经导致了光流估计突破。最近的进步专注于通过改善建筑和公开提供的MPI-辛特尔数据集设置一个新的基准提高了光流估计。取而代之的是，在这篇文章中，我们研究网络如何深层神经估计光流。更好地了解这些网络功能是如何成为（一）评估其泛化能力看不见的投入，及（ii）建议修改，以提高它们的性能非常重要。对于我们的调查，我们专注于FlowNetS，因为它是光流估计的编码器，解码器神经网络的雏形。此外，我们使用的是在揭露存在于神经心理学研究动物的大脑运动过滤器方面发挥了重大作用的过滤器识别方法。该方法显示，在FlowNetS的最深层的过滤器是于多种运动模式敏感。我们不仅找到翻译过滤器，如动物的大脑表现出来，但由于人工神经网络更易于测量，我们甚至推出扩张，旋转和阻塞过滤器。此外，我们发现在该网络的细化部分和感性填充入过程发生在哺乳动物初级视觉皮层相似之处。

14. Joint Spatial-Temporal Optimization for Stereo 3D Object Tracking [PDF] 返回目录
Peiliang Li, Jieqi Shi, Shaojie Shen
Abstract: Directly learning multiple 3D objects motion from sequential images is difficult, while the geometric bundle adjustment lacks the ability to localize the invisible object centroid. To benefit from both the powerful object understanding skill from deep neural network meanwhile tackle precise geometry modeling for consistent trajectory estimation, we propose a joint spatial-temporal optimization-based stereo 3D object tracking method. From the network, we detect corresponding 2D bounding boxes on adjacent images and regress an initial 3D bounding box. Dense object cues (local depth and local coordinates) that associating to the object centroid are then predicted using a region-based network. Considering both the instant localization accuracy and motion consistency, our optimization models the relations between the object centroid and observed cues into a joint spatial-temporal error function. All historic cues will be summarized to contribute to the current estimation by a per-frame marginalization strategy without repeated computation. Quantitative evaluation on the KITTI tracking dataset shows our approach outperforms previous image-based 3D tracking methods by significant margins. We also report extensive results on multiple categories and larger datasets (KITTI raw and Argoverse Tracking) for future benchmarking.
摘要：直接学习多个3D从顺序图像对象的运动是困难的，而几何束调节缺乏本地化不可见对象的质心的能力。为了从深层神经网络两种强大的对象理解的技能同时解决了一致的轨迹估计精确几何建模的利益，我们提出了一个共同的时空基于优化的立体三维目标跟踪方法。从网络中，我们检测对应的2D边界上相邻的图像框和回归初始三维边界框。密对象提示（局部深度和本地坐标），该相关联的对象的质心，然后使用基于区域的网络预测。考虑到即时定位精度和运动都一致，我们的优化模型对象重心和观察到的线索之间的关系进入了联合时空误差函数。所有的历史线索进行总结通过无需反复计算的每帧边缘化策略，有助于当前估计。在KITTI跟踪数据集显示了我们的方法比以前的基于图像的3D跟踪由显著利润率方法定量评价。我们还报告多个类别和更大的数据集（KITTI生Argoverse跟踪）为未来的标杆广泛的结果。

15. A Revised Generative Evaluation of Visual Dialogue [PDF] 返回目录
Daniela Massiceti, Viveka Kulharia, Puneet K. Dokania, N. Siddharth, Philip H.S. Torr
Abstract: Evaluating Visual Dialogue, the task of answering a sequence of questions relating to a visual input, remains an open research challenge. The current evaluation scheme of the VisDial dataset computes the ranks of ground-truth answers in predefined candidate sets, which Massiceti et al. (2018) show can be susceptible to the exploitation of dataset biases. This scheme also does little to account for the different ways of expressing the same answer--an aspect of language that has been well studied in NLP. We propose a revised evaluation scheme for the VisDial dataset leveraging metrics from the NLP literature to measure consensus between answers generated by the model and a set of relevant answers. We construct these relevant answer sets using a simple and effective semi-supervised method based on correlation, which allows us to automatically extend and scale sparse relevance annotations from humans to the entire dataset. We release these sets and code for the revised evaluation scheme as DenseVisDial, and intend them to be an improvement to the dataset in the face of its existing constraints and design choices.
摘要：评估视觉对话，回答有关视觉输入问题序列的任务，仍然是一个有待研究的挑战。在VisDial数据集的当前评估方案计算地面实况答案的行列中预定义的候选集，Massiceti等。（2018）示出了可能易受数据集偏差的开采。该方案还小呢交代表达了同样的回答不同的方式 - 语言的一个方面已经在NLP得到很好的研究。我们提出的VisDial数据集利用指标从NLP文献来衡量模型和一组相关答案产生答案之间的共识修订的评估方案。我们构造使用简单而有效的半监督方法基于相关性，这使我们能够自动地从人类延长和扩展稀疏相关注释整个数据集，这些相关的结果集。我们发布这些集和代码的修改后的评估方案为DenseVisDial，并打算在他们的改进，在现有的约束条件和设计选择面对数据集。

16. Unsupervised Vehicle Counting via Multiple Camera Domain Adaptation [PDF] 返回目录
Luca Ciampi, Carlos Santiago, Joao Paulo Costeira, Claudio Gennaro, Giuseppe Amato
Abstract: Monitoring vehicle flow in cities is a crucial issue to improve the urban environment and quality of life of citizens. Images are the best sensing modality to perceive and asses the flow of vehicles in large areas. Current technologies for vehicle counting in images hinge on large quantities of annotated data, preventing their scalability to city-scale as new cameras are added to the system. This is a recurrent problem when dealing with physical systems and a key research area in Machine Learning and AI. We propose and discuss a new methodology to design image-based vehicle density estimators with few labeled data via multiple camera domain adaptations.
摘要：在城市监控车辆流量，改善城市环境和居民生活质量的一个关键问题。图片是最好的传感方式来感知和驴大面积车辆的流量。在图像车辆计数目前的技术取决于对大量注释的数据，防止其可扩展性，一个城市规模的新相机添加到系统中。这是与物理系统和机器学习和人工智能的重要研究领域打交道时，一个经常出现的问题。我们提出并讨论新的方法通过多个摄像机领域改编几个标记数据来设计基于图像的车辆密度估计。

17. Unsupervised Person Re-identification via Multi-label Classification [PDF] 返回目录
Dongkai Wang, Shiliang Zhang
Abstract: The challenge of unsupervised person re-identification (ReID) lies in learning discriminative features without true labels. This paper formulates unsupervised person ReID as a multi-label classification task to progressively seek true labels. Our method starts by assigning each person image with a single-class label, then evolves to multi-label classification by leveraging the updated ReID model for label prediction. The label prediction comprises similarity computation and cycle consistency to ensure the quality of predicted labels. To boost the ReID model training efficiency in multi-label classification, we further propose the memory-based multi-label classification loss (MMCL). MMCL works with memory-based non-parametric classifier and integrates multi-label classification and single-label classification in a unified framework. Our label prediction and MMCL work iteratively and substantially boost the ReID performance. Experiments on several large-scale person ReID datasets demonstrate the superiority of our method in unsupervised person ReID. Our method also allows to use labeled person images in other domains. Under this transfer learning setting, our method also achieves state-of-the-art performance.
摘要：在学习判别特征没有真正的标签的无监督人重新鉴定（里德）谎言的挑战。本文制定监督的人里德作为一个多标签分类的任务，逐步寻求真正的标签。我们的方法通过使用单级标签赋予每个人图像开始，然后通过利用用于标签预测更新里德模式发展到多标签分类。标签预测包括相似度计算和周期一致性，以确保预测标签的质量。为了提高在多标签分类里德模型训练效率，我们进一步提出了基于存储的多标签分类损失（MMCL）。 MMCL适用于基于内存的非参数分类器和集成的多标签分类和单标签分类在一个统一的框架。我们的标签预测和MMCL工作重复和大幅提升的性能里德。在几个大型人里德的数据集实验证明我们在无人监督的人雷德法的优越性。我们的方法还可以在其他领域使用标记的人物图像。在这种转移的学习环境，我们的方法也能达到国家的最先进的性能。

18. End-to-End Learning for Video Frame Compression with Self-Attention [PDF] 返回目录
Nannan Zou, Honglei Zhang, Francesco Cricri, Hamed R. Tavakoli, Jani Lainema, Emre Aksu, Miska Hannuksela, Esa Rahtu
Abstract: One of the core components of conventional (i.e., non-learned) video codecs consists of predicting a frame from a previously-decoded frame, by leveraging temporal correlations. In this paper, we propose an end-to-end learned system for compressing video frames. Instead of relying on pixel-space motion (as with optical flow), our system learns deep embeddings of frames and encodes their difference in latent space. At decoder-side, an attention mechanism is designed to attend to the latent space of frames to decide how different parts of the previous and current frame are combined to form the final predicted current frame. Spatially-varying channel allocation is achieved by using importance masks acting on the feature-channels. The model is trained to reduce the bitrate by minimizing a loss on importance maps and a loss on the probability output by a context model for arithmetic coding. In our experiments, we show that the proposed system achieves high compression rates and high objective visual quality as measured by MS-SSIM and PSNR. Furthermore, we provide ablation studies where we highlight the contribution of different components.
摘要：一种传统的（即非学习）视频编解码器的核心组分组成预测从先前解码的帧的帧，通过利用时间相关性的。在本文中，我们提出了一个终端到终端了解到系统用于压缩视频帧。而不是依赖于像素空间运动（与光流）的，我们的系统学习帧的嵌入深并且编码它们的潜在空间差异。在解码器侧，一个注意机构被设计出席帧的潜在空间来决定组合不同的先前和当前帧的部分是如何以形成最终的预测的当前帧。空间变化的信道分配是通过使用作用于特征的通道重要性掩模来实现。该模型被训练通过最大限度地减少对重要的地图损失和算术编码的上下文模型的概率输出损失降低比特率。在我们的实验中，我们表明，由MS-SSIM和PSNR测量所提出的系统实现高压缩率和高客观视觉质量。此外，我们提供消融研究，我们突出不同分量的影响。

19. 4D Deep Learning for Multiple Sclerosis Lesion Activity Segmentation [PDF] 返回目录
Nils Gessert, Marcel Bengs, Julia Krüger, Roland Opfer, Ann-Christin Ostwaldt, Praveena Manogaran, Sven Schippling, Alexander Schlaefer
Abstract: Multiple sclerosis lesion activity segmentation is the task of detecting new and enlarging lesions that appeared between a baseline and a follow-up brain MRI scan. While deep learning methods for single-scan lesion segmentation are common, deep learning approaches for lesion activity have only been proposed recently. Here, a two-path architecture processes two 3D MRI volumes from two time points. In this work, we investigate whether extending this problem to full 4D deep learning using a history of MRI volumes and thus an extended baseline can improve performance. For this purpose, we design a recurrent multi-encoder-decoder architecture for processing 4D data. We find that adding more temporal information is beneficial and our proposed architecture outperforms previous approaches with a lesion-wise true positive rate of 0.84 at a lesion-wise false positive rate of 0.19.
摘要：多发性硬化病变活动分割是检测新的和扩大病灶出现了基线和后续脑部核磁共振成像扫描之间的任务。虽然单次扫描病变划分深度学习方法是通用的，深度学习的病变活动方法只被最近提出。在这里，双系统架构处理来自两个时间点的两个3D MRI卷。在这项工作中，我们研究一个扩展的基线是否延长这一问题充分利用4D MRI卷的历史深度学习，从而可以提高性能。为了这个目的，我们设计用于处理数据4D复发性多编码器 - 解码器架构。我们发现，增加更多的时间信息是有益的，我们提出的架构优于同为0.84，在0.19的病变明智的假阳性率病变明智的真阳性率以前的方法。

20. CatNet: Class Incremental 3D ConvNets for Lifelong Egocentric Gesture Recognition [PDF] 返回目录
Zhengwei Wang, Qi She, Tejo Chalasani, Aljosa Smolic
Abstract: Egocentric gestures are the most natural form of communication for humans to interact with wearable devices such as VR/AR helmets and glasses. A major issue in such scenarios for real-world applications is that may easily become necessary to add new gestures to the system e.g., a proper VR system should allow users to customize gestures incrementally. Traditional deep learning methods require storing all previous class samples in the system and training the model again from scratch by incorporating previous samples and new samples, which costs humongous memory and significantly increases computation over time. In this work, we demonstrate a lifelong 3D convolutional framework - c(C)la(a)ss increment(t)al net(Net)work (CatNet), which considers temporal information in videos and enables lifelong learning for egocentric gesture video recognition by learning the feature representation of an exemplar set selected from previous class samples. Importantly, we propose a two-stream CatNet, which deploys RGB and depth modalities to train two separate networks. We evaluate CatNets on a publicly available dataset -- EgoGesture dataset, and show that CatNets can learn many classes incrementally over a long period of time. Results also demonstrate that the two-stream architecture achieves the best performance on both joint training and class incremental training compared to 3 other one-stream architectures. The codes and pre-trained models used in this work are provided at this https URL.
摘要：自我中心手势沟通人类与可穿戴式设备，如VR / AR头盔和眼镜交互最自然的形式。在这样的情况下对现实世界的应用的一个主要问题是，可能容易成为必要的新手势添加到系统例如，一个适当的VR系统应该允许用户以增量方式自定义手势。传统的深度学习方法需要存储所有以前的类样品中的系统，并把以前的样本和新样本，它的价格堆积如山的记忆随着时间的推移显著增加了计算从头训练模型。在这项工作中，我们展示了一个终身3D卷积框架 - C（C）LA（一）SS增量（t）的人网（净）工作（CatNet），它考虑的视频时间信息，使终身学习自我中心的手势视频识别通过学习先前类样本中选择的示例性集合的特征表示。重要的是，我们提出了两个流CatNet，其部署RGB和深度模式培养两个独立的网络。我们评估在公开的数据集CatNets - EgoGesture数据集，并表明CatNets可以学到许多类增量在一段较长的时间。研究结果还表明，两流体系结构实现了双方的联合训练，并与其他3个单流架构级增量训练的最佳性能。在此HTTPS URL中提供了这项工作中所使用的代码和预先训练模式。

21. Generative Feature Replay For Class-Incremental Learning [PDF] 返回目录
Xialei Liu, Chenshen Wu, Mikel Menta, Luis Herranz, Bogdan Raducanu, Andrew D. Bagdanov, Shangling Jui, Joost van de Weijer
Abstract: Humans are capable of learning new tasks without forgetting previous ones, while neural networks fail due to catastrophic forgetting between new and previously-learned tasks. We consider a class-incremental setting which means that the task-ID is unknown at inference time. The imbalance between old and new classes typically results in a bias of the network towards the newest ones. This imbalance problem can either be addressed by storing exemplars from previous tasks, or by using image replay methods. However, the latter can only be applied to toy datasets since image generation for complex datasets is a hard problem. We propose a solution to the imbalance problem based on generative feature replay which does not require any exemplars. To do this, we split the network into two parts: a feature extractor and a classifier. To prevent forgetting, we combine generative feature replay in the classifier with feature distillation in the feature extractor. Through feature generation, our method reduces the complexity of generative replay and prevents the imbalance problem. Our approach is computationally efficient and scalable to large datasets. Experiments confirm that our approach achieves state-of-the-art results on CIFAR-100 and ImageNet, while requiring only a fraction of the storage needed for exemplar-based continual learning. Code available at \url{this https URL}.
摘要：人类可以学习新的任务，而忘记以前的，而神经网络故障是由于新的和之前习得的任务之间灾难性的遗忘。我们认为一类增量的设置，这意味着任务ID是推理时间未知。新旧类之间的不平衡通常会导致对那些最新网络的偏见。这种不平衡的问题可以通过存储范例来自先前的任务，或者通过使用图像重放方法来解决。然而，后者可自图像生成仅可应用于玩具数据集为复杂的数据集是一个困难的问题。我们提出了一个解决方案，基于生成功能重播，它不需要任何典范的不平衡问题。要做到这一点，我们在网络分成两个部分：一个特征提取和分类。为了防止遗忘，我们在分类相结合生成功能重播与特征提取功能蒸馏。通过特征生成，我们的方法减少了生成重播的复杂性并防止不平衡问题。我们的方法是计算效率和可扩展到大型数据集。实验证实，我们的方法实现对CIFAR-100和ImageNet国家的先进成果，而只需要需要根据标本的不断学习存储的一小部分。代码可以在\ {URL这HTTPS URL}。

22. LSM: Learning Subspace Minimization for Low-level Vision [PDF] 返回目录
Chengzhou Tang, Lu Yuan, Ping Tan
Abstract: We study the energy minimization problem in low-level vision tasks from a novel perspective. We replace the heuristic regularization term with a learnable subspace constraint, and preserve the data term to exploit domain knowledge derived from the first principle of a task. This learning subspace minimization (LSM) framework unifies the network structures and the parameters for many low-level vision tasks, which allows us to train a single network for multiple tasks simultaneously with completely shared parameters, and even generalizes the trained network to an unseen task as long as its data term can be formulated. We demonstrate our LSM framework on four low-level tasks including interactive image segmentation, video segmentation, stereo matching, and optical flow, and validate the network on various datasets. The experiments show that the proposed LSM generates state-of-the-art results with smaller model size, faster training convergence, and real-time inference.
摘要：我们从一个新颖的角度研究了低级别的视觉任务的能量最小化问题。我们有可以学习的子空间约束更换启发式规则化项，并保存数据项利用从任务的第一原理推导领域知识。这种学习子空间最小化（LSM）框架相结合的网络结构和许多低级别的视觉任务的参数，这使我们能够同时培养了多个任务的单一网络具有完全共享的参数，甚至概括出训练有素的网络看不见的任务只要它的数据项可配制。我们证明了我们对四低级别的任务，包括交互式图像分割，视频分割，立体匹配和光流LSM框架，验证网络上的各种数据集。实验结果表明，所提出的LSM产生国家的先进成果与较小的型号大小，加快训练收敛，和实时的推理。

23. Landmark Detection and 3D Face Reconstruction for Caricature using a Nonlinear Parametric Model [PDF] 返回目录
Juyong Zhang, Hongrui Cai, Yudong Guo, Zhuang Peng
Abstract: Caricature is an artistic abstraction of the human face by distorting or exaggerating certain facial features, while still retains a likeness with the given face. Due to the large diversity of geometric and texture variations, automatic landmark detection and 3D face reconstruction for caricature is a challenging problem and has rarely been studied before. In this paper, we propose the first automatic method for this task by a novel 3D approach. To this end, we first build a dataset with various styles of 2D caricatures and their corresponding 3D shapes, and then build a parametric model on vertex based deformation space for 3D caricature face. Based on the constructed dataset and the nonlinear parametric model, we propose a neural network based method to regress the 3D face shape and orientation from the input 2D caricature image. Ablation studies and comparison with baseline methods demonstrate the effectiveness of our algorithm design, and extensive experimental results demonstrate that our method works well for various caricatures. Our constructed dataset, source code and trained model are available at this https URL.
摘要：漫画是通过扭曲或夸大某些面部特征的人脸的艺术抽象，同时仍保留与给定的脸神似。由于大的多样性几何和纹理的变化，自动标志检测和三维人脸重建为漫画是一个具有挑战性的问题，之前已经很少被研究。在本文中，我们提出了一个新颖的3D处理这个任务的第一个自动的方法。为此，我们首先建立与各种风格的2D漫画及其相应的3D形状的数据集，然后建立在对3D漫画脸基于顶点变形空间的参数模型。根据所构造的数据集和非线性参数模型，我们提出了一个基于神经网络的方法，以便从输入的2D图像漫画退步的三维人脸形状和取向。消融研究，并与基线方法的比较表明我们的算法设计的有效性，和广泛的实验结果表明，我们的方法适用于各种漫画。我们构建的数据集，源代码和训练模型可在此HTTPS URL。

24. Pose Manipulation with Identity Preservation [PDF] 返回目录
Andrei-Timotei Ardelean, Lucian Mircea Sasu
Abstract: This paper describes a new model which generates images in novel poses e.g. by altering face expression and orientation, from just a few instances of a human subject. Unlike previous approaches which require large datasets of a specific person for training, our approach may start from a scarce set of images, even from a single image. To this end, we introduce Character Adaptive Identity Normalization GAN (CainGAN) which uses spatial characteristic features extracted by an embedder and combined across source images. The identity information is propagated throughout the network by applying conditional normalization. After extensive adversarial training, CainGAN receives figures of faces from a certain individual and produces new ones while preserving the person's identity. Experimental results show that the quality of generated images scales with the size of the input set used during inference. Furthermore, quantitative measurements indicate that CainGAN performs better compared to other methods when training data is limited.
摘要：本文描述了生成新的姿势例如图像的新模型通过改变脸部表情和方向，从人类受试者的只是少数情况。与需要一个特定的人进行培训的大型数据集以前的方法，我们的方法可以从稀缺组图像的开始，甚至从一个单一的形象。为此，我们引入使用由嵌入器提取并跨越源图像组合的空间特征字符自适应身份正常化GAN（CainGAN）。身份信息由应用条件正常化传遍整个网络。经过广泛的对抗性训练，CainGAN收到来自某个人的脸的人物，并同时保持人的身份产生新的问题。实验结果表明，推理过程中使用的生成的图像的尺度与输入集的大小的质量。此外，定量的测量表明，CainGAN进行更好与其它方法相比，当训练数据有限。

25. VOC-ReID: Vehicle Re-identification based on Vehicle-Orientation-Camera [PDF] 返回目录
Xiangyu Zhu, Zhenbo Luo, Pei Fu, Xiang Ji
Abstract: Vehicle re-identification is a challenging task due to high intra-class variances and small inter-class variances. In this work, we focus on the failure cases caused by similar background and shape. They pose serve bias on similarity, making it easier to neglect fine-grained information. To reduce the bias, we propose an approach named VOC-ReID, taking the triplet vehicle-orientation-camera as a whole and reforming background/shape similarity as camera/orientation re-identification. At first, we train models for vehicle, orientation and camera re-identification respectively. Then we use orientation and camera similarity as penalty to get final similarity. Besides, we propose a high performance baseline boosted by bag of tricks and weakly supervised data augmentation. Our algorithm achieves the second place in vehicle re-identification at the NVIDIA AI City Challenge 2020.
摘要：车辆重新鉴定是一项具有挑战性的任务，由于高类内方差和小跨级的差异。在这项工作中，我们侧重于引起类似的背景和形状的失败案例。他们提出担任相似的偏见，使其更容易忽视细粒度信息。为了减少偏见，我们提出了一个名为VOC - 里德的做法，采取三重车辆方位用相机作为一个整体和改革背景/形状相似的相机/方向重新鉴定。起初，我们对训练车辆分别为，方向和相机重新识别模型。然后我们使用方向和相机相似的点球取得最终的相似性。此外，我们建议的锦囊和弱教师数据提振增强高性能基线。我们的算法实现了在车辆重新鉴定的第二位在NVIDIA AI城市挑战2020

26. Cosmetic-Aware Makeup Cleanser [PDF] 返回目录
Yi Li, Huaibo Huang, Junchi Yu, Ran He, Tieniu Tan
Abstract: Face verification aims at determining whether a pair of face images belongs to the same identity. Recent studies have revealed the negative impact of facial makeup on the verification performance. With the rapid development of deep generative models, this paper proposes a semanticaware makeup cleanser (SAMC) to remove facial makeup under different poses and expressions and achieve verification via generation. The intuition lies in the fact that makeup is a combined effect of multiple cosmetics and tailored treatments should be imposed on different cosmetic regions. To this end, we present both unsupervised and supervised semantic-aware learning strategies in SAMC. At image level, an unsupervised attention module is jointly learned with the generator to locate cosmetic regions and estimate the degree. At feature level, we resort to the effort of face parsing merely in training phase and design a localized texture loss to serve complements and pursue superior synthetic quality. The experimental results on four makeuprelated datasets verify that SAMC not only produces appealing de-makeup outputs at a resolution of 256*256, but also facilitates makeup-invariant face verification through image generation.
摘要：人脸验证的目的是确定一对的人脸图像是否属于相同的标识。最近的研究显示脸谱上的验证性能的负面影响。凭借深厚的生成模型的快速发展，本文提出了一种semanticaware卸妆液（SAMC）在不同的姿势和表情来去除面部化妆，并通过一代实现验证。直觉就在于，妆容多种化妆品和量身定制的治疗方法的综合效应应该在不同的化妆品区域的罚款。为此，我们提出了两个无监督和监督SAMC语义感知学习策略。在图像水平，无人监督的注意力模块共同与发电机定位化妆品区，估计的程度教训。在功能方面，我们采取的脸在训练阶段仅仅解析努力和设计本地化的质感损失以服务补，追求卓越的综合素质。四个makeuprelated数据集上的实验结果验证SAMC不仅产生在256 * 256的分辨率吸引人去化妆输出，还可以通过图像生成便于化妆不变的人脸验证。

27. Transformer Reasoning Network for Image-Text Matching and Retrieval [PDF] 返回目录
Nicola Messina, Fabrizio Falchi, Andrea Esuli, Giuseppe Amato
Abstract: Image-text matching is an interesting and fascinating task in modern AI research. Despite the evolution of deep-learning-based image and text processing systems, multi-modal matching remains a challenging problem. In this work, we consider the problem of accurate image-text matching for the task of multi-modal large-scale information retrieval. State-of-the-art results in image-text matching are achieved by inter-playing image and text features from the two different processing pipelines, usually using mutual attention mechanisms. However, this invalidates any chance to extract separate visual and textual features needed for later indexing steps in large-scale retrieval systems. In this regard, we introduce the Transformer Encoder Reasoning Network (TERN), an architecture built upon one of the modern relationship-aware self-attentive architectures, the Transformer Encoder (TE). This architecture is able to separately reason on the two different modalities and to enforce a final common abstract concept space by sharing the weights of the deeper transformer layers. Thanks to this design, the implemented network is able to produce compact and very rich visual and textual features available for the successive indexing step. Experiments are conducted on the MS-COCO dataset, and we evaluate the results using a discounted cumulative gain metric with relevance computed exploiting caption similarities, in order to assess possibly non-exact but relevant search results. We demonstrate that on this metric we are able to achieve state-of-the-art results in the image retrieval task. Our code is freely available at this https URL
摘要：图片文字匹配是现代人工智能研究的一个有趣和引人入胜的任务。尽管深为基础的学习图像和文字处理系统的发展，多模式匹配仍然是一个具有挑战性的问题。在这项工作中，我们考虑多模态的大规模信息检索的任务精确的图像，文本匹配的问题。国家的最先进成果在图像文本匹配由扮演间的图片和文字的特点从两个不同的处理管道来实现，通常使用相互关注的机制。然而，这种无效信号任何机会，以提取所需的大型检索系统后索引步骤分开视觉和文本的功能。在这方面，我们引进了转换器编码器推理网络（TERN），在现代的关系，自觉周到的架构的一个内置的架构，转换器编码器（TE）。这种体系结构是能够分别原因在两个不同的模式和通过共享更深变压器层的权重来执行最终的共同的抽象的概念空间。由于这种设计，实现网络能够产生可用于连续索引步骤紧凑，非常丰富的视觉和文本特征。实验是在MS-COCO数据集进行，我们评估使用贴现累计收益与相关指标的计算结果，利用标题的相似性，以评估可能的非精确的，但相关的搜索结果。我们证明了这个指标，我们能够实现国家的最先进成果在图像检索任务。我们的代码是免费提供在此HTTPS URL

28. Deep Exposure Fusion with Deghosting via Homography Estimation and Attention Learning [PDF] 返回目录
Sheng-Yeh Chen, Yung-Yu Chuang
Abstract: Modern cameras have limited dynamic ranges and often produce images with saturated or dark regions using a single exposure. Although the problem could be addressed by taking multiple images with different exposures, exposure fusion methods need to deal with ghosting artifacts and detail loss caused by camera motion or moving objects. This paper proposes a deep network for exposure fusion. For reducing the potential ghosting problem, our network only takes two images, an underexposed image and an overexposed one. Our network integrates together homography estimation for compensating camera motion, attention mechanism for correcting remaining misalignment and moving pixels, and adversarial learning for alleviating other remaining artifacts. Experiments on real-world photos taken using handheld mobile phones show that the proposed method can generate high-quality images with faithful detail and vivid color rendition in both dark and bright areas.
摘要：现代照相机具有有限的动态范围，并常常产生与使用单次曝光饱和的或暗区的图像。尽管这个问题可以通过采取多个图像具有不同曝光来解决，曝光融合方法需要处理引起的摄像机运动或移动物体幻影伪像和细节丢失。本文提出了一种融合暴露了深刻的网络。为了减少潜在的重影问题，我们的网络只需要两个图像，不足的图像和曝光过度的一个。我们的网络集成在一起的单应估计补偿相机运动，注意机制纠正剩余错位和移动的像素，以及减少其他剩余的文物对抗性学习。使用手持式手机拍摄的真实照片，实验结果表明，该方法可以产生忠实的细节和鲜艳的色彩再现在黑暗和明亮区域的高品质图像。

29. CatSIM: A Categorical Image Similarity Metric [PDF] 返回目录
Geoffrey Z. Thompson, Ranjan Maitra
Abstract: We introduce CatSIM, a new similarity metric for binary and multinary two and three-dimensional images and volumes. CatSIM uses a structural similarity image quality paradigm and is robust to small perturbations in location so that structures in similar, but not entirely overlapping, regions of two images are rated higher than using simple matching. The metric can also compare arbitrary regions inside images. CatSIM is evaluated on artificial data sets, image quality assessment surveys and two imaging applications
摘要：介绍CatSIM，二元和多元两个三维图像和体积的一种新的相似性度量。 CatSIM使用结构相似的图像质量模式和是稳健的小扰动的位置，以便在结构相似，但不完全重叠，两个图像的区域被评为比使用简单匹配更高。该指标可以还包括图像内比较任意区域。 CatSIM被人工数据集，图像质量评价调查和两个成像应用评估

30. Semantic Correspondence via 2D-3D-2D Cycle [PDF] 返回目录
Yang You, Chengkun Li, Yujing Lou, Zhoujun Cheng, Lizhuang Ma, Cewu Lu, Weiming Wang
Abstract: Visual semantic correspondence is an important topic in computer vision and could help machine understand objects in our daily life. However, most previous methods directly train on correspondences in 2D images, which is end-to-end but loses plenty of information in 3D spaces. In this paper, we propose a new method on predicting semantic correspondences by leveraging it to 3D domain and then project corresponding 3D models back to 2D domain, with their semantic labels. Our method leverages the advantages in 3D vision and can explicitly reason about objects self-occlusion and visibility. We show that our method gives comparative and even superior results on standard semantic benchmarks. We also conduct thorough and detailed experiments to analyze our network components. The code and experiments are publicly available at this https URL.
摘要：视觉语义对应是计算机视觉中的一个重要课题，并有助于了解机器在我们日常生活中的物体。然而，大多数以前的方法直接对应列车在2D图像，这是终端到终端的，但失去了大量的信息在三维空间。在本文中，我们提出由它借力3D区域预测语义对应一种新的方法，然后投影对应的3D模型回2D领域，与它们的语义标签。我们的方法利用了优势，在3D视，可以明确的原因有关对象的自遮挡和知名度。我们证明了我们的方法给出了标准的语义基准比较，甚至更好的结果。我们还开展深入细致的实验来分析我们的网络组件。代码和实验是公开的，在此HTTPS URL。

31. Airborne LiDAR Point Cloud Classification with Graph Attention Convolution Neural Network [PDF] 返回目录
Congcong Wen, Xiang Li, Xiaojing Yao, Ling Peng, Tianhe Chi
Abstract: Airborne light detection and ranging (LiDAR) plays an increasingly significant role in urban planning, topographic mapping, environmental monitoring, power line detection and other fields thanks to its capability to quickly acquire large-scale and high-precision ground information. To achieve point cloud classification, previous studies proposed point cloud deep learning models that can directly process raw point clouds based on PointNet-like architectures. And some recent works proposed graph convolution neural network based on the inherent topology of point clouds. However, the above point cloud deep learning models only pay attention to exploring local geometric structures, yet ignore global contextual relationships among all points. In this paper, we present a graph attention convolution neural network (GACNN) that can be directly applied to the classification of unstructured 3D point clouds obtained by airborne LiDAR. Specifically, we first introduce a graph attention convolution module that incorporates global contextual information and local structural features. Based on the proposed graph attention convolution module, we further design an end-to-end encoder-decoder network, named GACNN, to capture multiscale features of the point clouds and therefore enable more accurate airborne point cloud classification. Experiments on the ISPRS 3D labeling dataset show that the proposed model achieves a new state-of-the-art performance in terms of average F1 score (71.5\%) and a satisfying overall accuracy (83.2\%). Additionally, experiments further conducted on the 2019 Data Fusion Contest Dataset by comparing with other prevalent point cloud deep learning models demonstrate the favorable generalization capability of the proposed model.
摘要：机载光探测和测距（LIDAR）起在城市规划，地形图测绘，环境监测，电力线检测等由于其能力，以快速获取大范围，高精度的地面信息领域日益显著的作用。为了实现点云的分类，以往的研究提出了点云深学习模式，可以直接处理基于PointNet样架构的原始点云。而最近的一些作品基于点云固有的拓扑图形提出卷积神经网络。然而，上述点云深学习模式只注重探索当地的几何结构，但忽略所有点间的全球上下文关系。在本文中，我们提出了可直接施加到由机载激光雷达获得非结构化三维点云的分类的图注意卷积神经网络（GACNN）。具体而言，我们首先介绍的是采用了全球的背景信息和本地的结构特征的图形注意卷积模块。根据拟议图形注意卷积模块上，我们进一步设计了一个终端到终端的编码器，解码器网络，命名GACNN，捕捉点云的多尺度特性，因此能够更准确地空降点云分类。在ISPRS 3D实验标记数据集上，该模型实现了平均F1得分（71.5 \％）和满足整体精度（83.2 \％）方面有新的国家的最先进的性能。此外，实验还对2019数据融合竞赛数据集与其他主要的点云对比进行了深入学习模型验证了该模型的良好泛化能力。

32. Colonoscope tracking method based on shape estimation network [PDF] 返回目录
Masahiro Oda, Holger R. Roth, Takayuki Kitasaka, Kazuhiro Furukawa, Ryoji Miyahara, Yoshiki Hirooka, Nassir Navab, Kensaku Mori
Abstract: This paper presents a colonoscope tracking method utilizing a colon shape estimation method. CT colonography is used as a less-invasive colon diagnosis method. If colonic polyps or early-stage cancers are found, they are removed in a colonoscopic examination. In the colonoscopic examination, understanding where the colonoscope running in the colon is difficult. A colonoscope navigation system is necessary to reduce overlooking of polyps. We propose a colonoscope tracking method for navigation systems. Previous colonoscope tracking methods caused large tracking errors because they do not consider deformations of the colon during colonoscope insertions. We utilize the shape estimation network (SEN), which estimates deformed colon shape during colonoscope insertions. The SEN is a neural network containing long short-term memory (LSTM) layer. To perform colon shape estimation suitable to the real clinical situation, we trained the SEN using data obtained during colonoscope operations of physicians. The proposed tracking method performs mapping of the colonoscope tip position to a position in the colon using estimation results of the SEN. We evaluated the proposed method in a phantom study. We confirmed that tracking errors of the proposed method was enough small to perform navigation in the ascending, transverse, and descending colons.
摘要：本文介绍使用冒号形状估计方法结肠镜跟踪方法。 CT结肠用作微创结肠诊断方法。如果发现结肠息肉或早期癌症，他们在结肠镜检查被删除。在结肠镜检查，了解其中在结肠运行结肠镜是困难的。结肠镜导航系统是必要的，以减少俯瞰息肉。我们提出了导航系统结肠镜跟踪方法。上一页结肠镜跟踪方法造成的大的跟踪误差，因为他们没有在结肠镜插入考虑结肠变形。我们利用形状推定网络（SEN），其在结肠镜插入估计变形结肠形状。该SEN是含有长短期存储器（LSTM）层的神经网络。要执行适合于实际临床情况结肠形状推定，我们培养使用在医师结肠镜操作而获得的数据的SEN。所提出的跟踪方法进行结肠镜尖端位置的映射到在使用SEN的估计结果结肠的位置。我们评估的影像学所提出的方法。我们证实了该方法的是跟踪误差小到足以在升结肠，横结肠进行导航和降结肠。

33. DeepSDF x Sim(3): Extending DeepSDF for automatic 3D shape retrieval and similarity transform estimation [PDF] 返回目录
Oladapo Afolabi, Allen Yang, Shankar S. Sastry
Abstract: Recent advances in computer graphics and computer vision have allowed for the development of neural network based generative models for 3D shapes based on signed distance functions (SDFs). These models are useful for shape representation, retrieval and completion. However, this approach to shape retrieval and completion has been limited by the need to have query shapes in the same canonical scale and pose as those observed during training, restricting its effectiveness to real world scenes. In this work, we present a formulation that overcomes this issue by jointly estimating the shape and similarity transformation parameters. We conduct experiments to demonstrate the effectiveness of this formulation on synthetic and real datasets and report favorable comparisons to strong baselines. Finally, we also emphasize the viability of this approach as a form of data compression useful in augmented reality scenarios.
摘要：在计算机图形学和计算机视觉的最新进展已经允许基于神经网络的生成模型为3D的发展基础上签署的距离函数（的SDF）的形状。这些模型是形状表示，检索和完成有用的。然而，这种方法形状检索和完成已被需要有查询形状在相同的规范的规模和构成那些在训练中观察到，限制了它的有效性真实世界场景的限制。在这项工作中，我们提出通过共同估计的形状和相似变换参数克服了这个问题的制剂。我们进行实验来证明这一提法对合成和真实数据集的有效性和强大的基线报告有利的比较。最后，我们也强调这种做法作为数据的形式的可行性压缩增强现实情况下非常有用。

34. Learning What Makes a Difference from Counterfactual Examples and Gradient Supervision [PDF] 返回目录
Damien Teney, Ehsan Abbasnedjad, Anton van den Hengel
Abstract: One of the primary challenges limiting the applicability of deep learning is its susceptibility to learning spurious correlations rather than the underlying mechanisms of the task of interest. The resulting failure to generalise cannot be addressed by simply using more data from the same distribution. We propose an auxiliary training objective that improves the generalization capabilities of neural networks by leveraging an overlooked supervisory signal found in existing datasets. We use pairs of minimally-different examples with different labels, a.k.a counterfactual or contrasting examples, which provide a signal indicative of the underlying causal structure of the task. We show that such pairs can be identified in a number of existing datasets in computer vision (visual question answering, multi-label image classification) and natural language processing (sentiment analysis, natural language inference). The new training objective orients the gradient of a model's decision function with pairs of counterfactual examples. Models trained with this technique demonstrate improved performance on out-of-distribution test sets.
摘要：一个限制深度学习的适用性的主要挑战是其易患学习伪相关，而不是关心的任务的基本机制。将得到的失败一概而论不能通过简单地使用来自相同分发更多的数据加以处理。我们建议，通过利用现有的数据集，发现了一个被忽视的监控信号提高了神经网络的泛化能力的辅助训练目标。我们使用对具有不同的标签，a.k.a反或对比实施例，其提供指示任务的底层因果结构的信号差别最小的例子。我们表明，这种对可以在许多计算机视觉领域的现有数据集（视觉答疑，多标签图像分类）和自然语言处理（情感分析，自然语言推理）的标识。新的培养目标定向模型的决策函数与对反例的梯度。用这种方法训练的模型展示上外的分布测试设备改进的性能。

35. OSLNet: Deep Small-Sample Classification with an Orthogonal Softmax Layer [PDF] 返回目录
Xiaoxu Li, Dongliang Chang, Zhanyu Ma, Zheng-Hua Tan, Jing-Hao Xue, Jie Cao, Jingyi Yu, Jun Guo
Abstract: A deep neural network of multiple nonlinear layers forms a large function space, which can easily lead to overfitting when it encounters small-sample data. To mitigate overfitting in small-sample classification, learning more discriminative features from small-sample data is becoming a new trend. To this end, this paper aims to find a subspace of neural networks that can facilitate a large decision margin. Specifically, we propose the Orthogonal Softmax Layer (OSL), which makes the weight vectors in the classification layer remain orthogonal during both the training and test processes. The Rademacher complexity of a network using the OSL is only $\frac{1}{K}$, where $K$ is the number of classes, of that of a network using the fully connected classification layer, leading to a tighter generalization error bound. Experimental results demonstrate that the proposed OSL has better performance than the methods used for comparison on four small-sample benchmark datasets, as well as its applicability to large-sample datasets. Codes are available at: this https URL.
摘要：多个非线性层的深神经网络形成大功能的空间，这也容易导致过度拟合，当它遇到小的样本数据。为了减轻在小样本分类过学习，从小样本数据中学习更多的判别特征正在成为一种新的趋势。为此，本文旨在找到神经网络的子空间，可以方便大型决策边缘。具体来说，我们建议的正交使用SoftMax层（OSL），这使得在分类层的权重向量保持在训练和测试过程都正交。使用OSL网络的拉特马赫复杂只有$ \压裂{1} {K} $，其中$ķ$是类的数量，的，使用完全连接分类层网络，导致更紧的泛化误差界。实验结果表明，该OSL具有比用于比较的四个小样本基准数据集的方法更好的性能，以及它适用于大样本数据集。代码，请访问：此HTTPS URL。

36. Traffic Lane Detection using FCN [PDF] 返回目录
Shengchang Zhang, Ahmed EI Koubia, Khaled Abdul Karim Mohammed
Abstract: Automatic lane detection is a crucial technology that enables self-driving cars to properly position themselves in a multi-lane urban driving environments. However, detecting diverse road markings in various weather conditions is a challenging task for conventional image processing or computer vision techniques. In recent years, the application of Deep Learning and Neural Networks in this area has proven to be very effective. In this project, we designed an Encoder- Decoder, Fully Convolutional Network for lane detection. This model was applied to a real-world large scale dataset and achieved a level of accuracy that outperformed our baseline model.
摘要：自动车道检测是一个关键技术，使自动驾驶汽车在多车道的城市驾驶环境中正确定位自己。然而，在各种天气条件下检测多样的道路标记为传统的图像处理或计算机视觉技术具有挑战性的任务。近年来，深学和神经网络在该领域的应用已被证明是非常有效的。在这个项目中，我们设计了一个编码器 - 解码器，完全卷积网络的车道检测。该模型应用到真实世界的大规模数据集和实现的精确度水平优于我们的基础模型。

37. Machine Learning based Pallets Detection and Tracking in AGVs [PDF] 返回目录
Shengchang Zhang, Jie Xiang, Weijian Han
Abstract: The use of automated guided vehicles (AGVs) has played a pivotal role in manufacturing and distribution operations, providing reliable and efficient product handling. In this project, we constructed a deep learning-based pallets detection and tracking architecture for pallets detection and position tracking. By using data preprocessing and augmentation techniques and experiment with hyperparameter tuning, we achieved the result with 25% reduction of error rate, 28.5% reduction of false negative rate, and 20% reduction of training time.
摘要：使用自动导向车（AGV的）的起到了制造和分销业务中起关键作用，从而提供可靠和有效的产品处理。在这个项目中，我们构建托盘检测和位置跟踪的深学习型托盘检测和跟踪架构。通过使用数据预处理和增强技术和实验用超参数调谐，我们与减少25％的错误率的，假阴性率减少28.5％，和训练时间减少20％所取得的结果。

38. ResNeSt: Split-Attention Networks [PDF] 返回目录
Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Zhi Zhang, Haibin Lin, Yue Sun, Tong He, Jonas Mueller, R. Manmatha, Mu Li, Alexander Smola
Abstract: While image classification models have recently continued to advance, most downstream applications such as object detection and semantic segmentation still employ ResNet variants as the backbone network due to their simple and modular structure. We present a simple and modular Split-Attention block that enables attention across feature-map groups. By stacking these Split-Attention blocks ResNet-style, we obtain a new ResNet variant which we call ResNeSt. Our network preserves the overall ResNet structure to be used in downstream tasks straightforwardly without introducing additional computational costs. ResNeSt models outperform other networks with similar model complexities. For example, ResNeSt-50 achieves 81.13% top-1 accuracy on ImageNet using a single crop-size of 224x224, outperforming previous best ResNet variant by more than 1% accuracy. This improvement also helps downstream tasks including object detection, instance segmentation and semantic segmentation. For example, by simply replace the ResNet-50 backbone with ResNeSt-50, we improve the mAP of Faster-RCNN on MS-COCO from 39.3% to 42.3% and the mIoU for DeeplabV3 on ADE20K from 42.1% to 45.1%.
摘要：虽然图像分类模型最近不断推进，多数下游应用，如目标检测和语义分割仍然采用RESNET为骨干网变种，由于其简单的模块化结构。我们提出一个简单的和模块化的分体式注块，支持跨特征图群体的关注。通过堆叠这些拆分关注块RESNET式，我们得到了一个新的变种RESNET我们称之为ResNeSt。我们的网络蜜饯整体RESNET结构以在下游的任务可以使用直接地，而不会引入额外的计算成本。 ResNeSt模型优于其他网络，类似的模型的复杂性。例如，ResNeSt-50达到81.13％顶部1上ImageNet精度使用的224x224单一作物尺寸，超过1％的精度优于以前的最好RESNET变体。这种改善也有助于下游任务，包括目标检测，例如分割和语义分割。例如，通过简单地更换RESNET-50主链ResNeSt-50，我们提高更快-RCNN上MS-COCO的从39.3％地图以42.3％和米欧为DeeplabV3上ADE20K从42.1％至45.1％。

39. Desmoking laparoscopy surgery images using an image-to-image translation guided by an embedded dark channel [PDF] 返回目录
Sebastián Salazar-Colores, Hugo Alberto-Moreno, César Javier Ortiz-Echeverri, Gerardo Flores
Abstract: In laparoscopic surgery, the visibility in the image can be severely degraded by the smoke caused by the $CO_2$ injection, and dissection tools, thus reducing the visibility of organs and tissues. This lack of visibility increases the surgery time and even the probability of mistakes conducted by the surgeon, then producing negative consequences on the patient's health. In this paper, a novel computational approach to remove the smoke effects is introduced. The proposed method is based on an image-to-image conditional generative adversarial network in which a dark channel is used as an embedded guide mask. Obtained experimental results are evaluated and compared quantitatively with other desmoking and dehazing state-of-art methods using the metrics of the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) index. Based on these metrics, it is found that the proposed method has improved performance compared to the state-of-the-art. Moreover, the processing time required by our method is 92 frames per second, and thus, it can be applied in a real-time medical system trough an embedded device.
摘要：在腹腔镜手术，图像中的可见性可严重造成$ CO_2 $注射烟，并清扫工具降解，从而降低了器官和组织的知名度。这种缺乏可见性增加了手术时间和手术医生进行的，那么产生对患者的健康造成负面影响的错误甚至是概率。在本文中，被引入以除去烟雾效果的新的计算方法。所提出的方法所基于的图像 - 图像生成条件对抗性网络，其中暗信道被用作一个嵌入导向掩模上。获得的实验结果进行评估，并与其他desmoking和除雾使用峰值信噪比的度量（PSNR）和结构相似性（SSIM）指数国家的技术的方法定量比较。基于这些度量，发现相比于国家的最先进的所提出的方法具有改进的性能。此外，通过我们的方法所需要的处理时间是每秒92帧，因此，可以在实时的医疗系统波谷的嵌入式设备施加。

40. Exploring Racial Bias within Face Recognition via per-subject Adversarially-Enabled Data Augmentation [PDF] 返回目录
Seyma Yucer, Samet Akçay, Noura Al-Moubayed, Toby P. Breckon
Abstract: Whilst face recognition applications are becoming increasingly prevalent within our daily lives, leading approaches in the field still suffer from performance bias to the detriment of some racial profiles within society. In this study, we propose a novel adversarial derived data augmentation methodology that aims to enable dataset balance at a per-subject level via the use of image-to-image transformation for the transfer of sensitive racial characteristic facial features. Our aim is to automatically construct a synthesised dataset by transforming facial images across varying racial domains, while still preserving identity-related features, such that racially dependant features subsequently become irrelevant within the determination of subject identity. We construct our experiments on three significant face recognition variants: Softmax, CosFace and ArcFace loss over a common convolutional neural network backbone. In a side-by-side comparison, we show the positive impact our proposed technique can have on the recognition performance for (racial) minority groups within an originally imbalanced training dataset by reducing the pre-race variance in performance.
摘要：虽然面部识别应用正成为我们的日常生活中越来越普遍，该领域的主要方法从性能还是偏向遭受的社会中某些种族的配置文件的损害。在这项研究中，我们提出了一种新的对抗中导出的数据的增强方法，其目的是使在通过使用图像到图像变换为敏感种族特征的面部特征的传送每个受试者水平的数据集的平衡。我们的目标是通过将横跨不同种族域面部图像自动构造一个合成数据集，同时仍保留身份相关的功能，使得种族从属特征随后变得主体身份的确定内无关紧要的。我们构建我们的实验在三个显著人脸识别变种：SOFTMAX，CosFace和ArcFace在一个共同的卷积神经网络骨干流失。在侧方比较，我们将展示通过降低性能赛前变化的积极影响我们提出的技术可以对一个原本不平衡的训练数据集内（种族）少数群体的识别性能。

41. 3D rectification with Visual Sphere perspective: an algebraic alternative for P4P pose estimation [PDF] 返回目录
Jakub Maksymilian Fober
Abstract: Presented algorithm solves co-planar P4P problem of parallel lines viewed in perspective with algebraic equation. Introduction of visual sphere model extends this algorithm to exotic non-linear projections, where view angle can span to 180° and beyond; a hard-limit of rectilinear perspective used in planar homography. This algorithm can perform full 3D reconstruction of visible rectangle, including pose estimation, and camera orientation. It requires some camera-lens information like angle of view (for rectilinear projection) or a perspective map.
摘要：在与代数方程角度来看平行线提交算法解决共面P4P问题。视觉球形模型的介绍扩展了这种算法异国非线性突起，其中视场角可以跨越到180°和超越;在平面单应使用直线透视的硬限。该算法可以进行全三维重建可视矩形的，包括姿态估计，和摄像头的方向。它需要像的视场角的一些相机镜头信息（直线投影）或透视图。

42. MER-GCN: Micro Expression Recognition Based on Relation Modeling with Graph Convolutional Network [PDF] 返回目录
Ling Lo, Hong-Xia Xie, Hong-Han Shuai, Wen-Huang Cheng
Abstract: Micro-Expression (ME) is the spontaneous, involuntary movement of a face that can reveal the true feeling. Recently, increasing researches have paid attention to this field combing deep learning techniques. Action units (AUs) are the fundamental actions reflecting the facial muscle movements and AU detection has been adopted by many researches to classify facial expressions. However, the time-consuming annotation process makes it difficult to correlate the combinations of AUs to specific emotion classes. Inspired by the nodes relationship building Graph Convolutional Networks (GCN), we propose an end-to-end AU-oriented graph classification network, namely MER-GCN, which uses 3D ConvNets to extract AU features and applies GCN layers to discover the dependency laying between AU nodes for ME categorization. To our best knowledge, this work is the first end-to-end architecture for Micro-Expression Recognition (MER) using AUs based GCN. The experimental results show that our approach outperforms CNN-based MER networks.
摘要：微表达（我）是一个面，可以揭示真实感觉的自发的，不自主运动。近日，增加研究一直关注这个领域梳理深学习技术。动作单元（AU）是反映了面部肌肉运动的基本动作和AU检测已通过许多研究进行分类的面部表情。然而，耗时注释过程使得很难对特定情感类AU的组合相关。由节点关系，构建格拉夫卷积网络（GDN）的启发，我们提出了一种端至端AU-面向图形分类网络，即MER-GCN，它采用三维ConvNets提取AU设有和适用GCN层发现的依赖铺设AU之间，用于ME分类节点。据我们所知，这项工作的第一端至端的架构，采用基于GCN的AU微表情识别（MER）。实验结果表明，该方法比基于CNN-MER网络。

43. A Biologically Interpretable Two-stage Deep Neural Network (BIT-DNN) For Hyperspectral Imagery Classification [PDF] 返回目录
Yue Shi, Liangxiu Han, Wenjiang Huang, Sheng Chang, Yingying Dong, Darren Dancey, Lianghao Han
Abstract: Spectral-spatial based deep learning models have recently proven to be effective in hyperspectral image (HSI) classification for various earth monitoring applications such as land cover classification and agricultural monitoring. However, due to the nature of "black-box" model representation, how to explain and interpret the learning process and the model decision remains an open problem. This study proposes an interpretable deep learning model -- a biologically interpretable two-stage deep neural network (BIT-DNN), by integrating biochemical and biophysical associated information into the proposed framework, capable of achieving both high accuracy and interpretability on HSI based classification tasks. The proposed model introduces a two-stage feature learning process. In the first stage, an enhanced interpretable feature block extracts low-level spectral features associated with the biophysical and biochemical attributes of the target entities; and in the second stage, an interpretable capsule block extracts and encapsulates the high-level joint spectral-spatial features into the featured tensors representing the hierarchical structure of the biophysical and biochemical attributes of the target ground entities, which provides the model an improved performance on classification and intrinsic interpretability. We have tested and evaluated the model using two real HSI datasets for crop type recognition and crop disease recognition tasks and compared it with six state-of-the-art machine learning models. The results demonstrate that the proposed model has competitive advantages in terms of both classification accuracy and model interpretability.
摘要：谱空间基于深度学习模型最近已被证明是多种地面监控应用，如土地覆盖分类和农业监测有效光谱图像（HSI）分类。然而，由于“黑盒子”模型表示，如何解释和说明的学习过程和模型的性质决定的仍然是一个悬而未决的问题。这项研究提出了一种解释的深度学习模式 - 生物可解释两阶段深层神经网络（BIT-DNN），通过生物化学和生物物理相关的信息集成到建议的架构，能够实现基于HSI分类任务高精度和解释性的。该模型引入了两个阶段的特征的学习过程。在第一阶段中，增强的可解释的特征块提取的低级谱特征与目标实体的生物物理和生物化学属性相关联;和在第二阶段中，可解释的胶囊块提取并封装了高级别联合的谱空间特征成表示目标地面实体的生物物理和生物化学属性的层次结构，它提供了该模型上的改进的性能的功能的张量分类和内在的可解释性。我们已与国家的最先进的六机学习模型测试和评估使用作物类型识别和作物病害识别任务两个真实数据集HSI模型，并进行了比较。结果表明，该模型在这两个分类的准确性和模型解释性方面具有竞争优势。

44. Uncertainty-Aware Consistency Regularization for Cross-Domain Semantic Segmentation [PDF] 返回目录
Qianyu Zhou, Zhengyang Feng, Guangliang Cheng, Xin Tan, Jianping Shi, Lizhuang Ma
Abstract: Unsupervised domain adaptation (UDA) aims to adapt existing models of the source domain to a new target domain with only unlabeled data. The main challenge to UDA lies in how to reduce the domain gap between the source domain and the target domain. Existing approaches of cross-domain semantic segmentation usually employ a consistency regularization on the target prediction of student model and teacher model respectively under different perturbations. However, previous works do not consider the reliability of the predicted target samples, which could harm the learning process by generating unreasonable guidance for the student model. In this paper, we propose an uncertainty-aware consistency regularization method to tackle this issue for semantic segmentation. By exploiting the latent uncertainty information of the target samples, more meaningful and reliable knowledge from the teacher model would be transferred to the student model. The experimental evaluation has shown that the proposed method outperforms the state-of-the-art methods by around $3\% \sim 5\%$ improvement on two domain adaptation benchmarks, i.e. GTAV $\rightarrow $ Cityscapes and SYNTHIA $\rightarrow $ Cityscapes.
摘要：无监督领域适应性（UDA）旨在源域的现有模式适应只有未标记数据的新目标域。到UDA的主要挑战在于如何抑制源域和目标域之间的域差距。跨域语义分割的现有方法通常使用对学生模型和教师模型的对象预测的稠度正规化分别在不同的扰动。然而，以前的作品不考虑预测的目标样本，这可能是产生不合理指导学生模型损害了学习过程的可靠性。在本文中，我们提出了一个不确定性感知一致性正则化方法来解决这个问题的语义分割。通过充分利用目标样本的潜在不确定性的信息，从老师模型更加有意义和可靠的知识将被转移到学生模型。实验评估显示，该方法优于国家的最先进的方法，通过约$ 3 \％\ SIM 5 \％上的两个结构域自适应基准$改进，即GTAV $ \ RIGHTARROW $都市风景和SYNTHIA $ \ RIGHTARROW $都市风景。

45. Graph-Structured Referring Expression Reasoning in The Wild [PDF] 返回目录
Sibei Yang, Guanbin Li, Yizhou Yu
Abstract: Grounding referring expressions aims to locate in an image an object referred to by a natural language expression. The linguistic structure of a referring expression provides a layout of reasoning over the visual contents, and it is often crucial to align and jointly understand the image and the referring expression. In this paper, we propose a scene graph guided modular network (SGMN), which performs reasoning over a semantic graph and a scene graph with neural modules under the guidance of the linguistic structure of the expression. In particular, we model the image as a structured semantic graph, and parse the expression into a language scene graph. The language scene graph not only decodes the linguistic structure of the expression, but also has a consistent representation with the image semantic graph. In addition to exploring structured solutions to grounding referring expressions, we also propose Ref-Reasoning, a large-scale real-world dataset for structured referring expression reasoning. We automatically generate referring expressions over the scene graphs of images using diverse expression templates and functional programs. This dataset is equipped with real-world visual contents as well as semantically rich expressions with different reasoning layouts. Experimental results show that our SGMN not only significantly outperforms existing state-of-the-art algorithms on the new Ref-Reasoning dataset, but also surpasses state-of-the-art structured methods on commonly used benchmark datasets. It can also provide interpretable visual evidences of reasoning. Data and code are available at this https URL
摘要：接地参考表达式目标的图像的目的是通过自然语言表达中提到的定位。一个参考表达的语言结构提供推理在视觉内容的布局，并且它往往是至关重要的对准和共同理解所述图像和所述参照表达。在本文中，我们提出了引导模块化网络（SGMN）场景图，其执行在推理语义图，并与表达的语言结构的指导下神经模块场景图。特别是，我们建模图像作为结构化的语义图，并解析表达式转换为语言场景图。语言场景图不仅解码表达的语言结构，而且还具有与所述图像语义图一致的表示。除了探索结构化的解决方案，接地参考的表达，我们也提出了参考推理，大规模真实世界的数据集结构指表达推理。我们自动生成了不同的使用表达式模板和功能程序图像的场景图指称词语。此数据集配有真实世界的视觉内容用不同的推理布局以及语义丰富的表情。实验结果表明，我们的SGMN不仅显著优于现有的到手价推理的数据集的国家的最先进的算法，但也超过了国家的最先进的上常用的标准数据集构建方法。它还可以提供推理可解释的视觉证据。数据和代码都可以在此HTTPS URL

46. When Residual Learning Meets Dense Aggregation: Rethinking the Aggregation of Deep Neural Networks [PDF] 返回目录
Zhiyu Zhu, Zhen-Peng Bian, Junhui Hou, Yi Wang, Lap-Pui Chau
Abstract: Various architectures (such as GoogLeNets, ResNets, and DenseNets) have been proposed. However, the existing networks usually suffer from either redundancy of convolutional layers or insufficient utilization of parameters. To handle these challenging issues, we propose Micro-Dense Nets, a novel architecture with global residual learning and local micro-dense aggregations. Specifically, residual learning aims to efficiently retrieve features from different convolutional blocks, while the micro-dense aggregation is able to enhance each block and avoid redundancy of convolutional layers by lessening residual aggregations. Moreover, the proposed micro-dense architecture has two characteristics: pyramidal multi-level feature learning which can widen the deeper layer in a block progressively, and dimension cardinality adaptive convolution which can balance each layer using linearly increasing dimension cardinality. The experimental results over three datasets (i.e., CIFAR-10, CIFAR-100, and ImageNet-1K) demonstrate that the proposed Micro-Dense Net with only 4M parameters can achieve higher classification accuracy than state-of-the-art networks, while being 12.1$\times$ smaller depends on the number of parameters. In addition, our micro-dense block can be integrated with neural architecture search based models to boost their performance, validating the advantage of our architecture. We believe our design and findings will be beneficial to the DNN community.
摘要：各种架构（如GoogLeNets，ResNets和DenseNets）已经被提出。然而，现有的网络通常遭受卷积层或参数的利用率不足的任一冗余。为了处理这些棘手的问题，我们提出微密网，一个新的架构，全球剩余的学习和局部微密集聚集。具体地，残留的学习目标有效地检索来自不同卷积块的功能，而微密聚集能够通过减少残留的聚合，以提高卷积层的每个块和避免冗余。此外，提出的微致密结构有两个特点：锥体多级特征的学习，可以在一个块中逐渐加宽更深层，和尺寸基数自适应卷积其可以使用线性增加的尺寸基数平衡每个层。在三个数据集上的实验结果（即，CIFAR-10，CIFAR-100，和ImageNet-1K）表明，该微密网仅4M参数可以达到更高的分类精度比状态的最先进的网络，而是12.1 $ \ $次小取决于参数的数量。此外，我们的微致密块体可以与基于神经结构的搜索模型集成，以提高他们的表现，证明我们的架构的优势。我们相信，我们的设计和研究结果将是DNN的社会是有益的。

47. AD-Cluster: Augmented Discriminative Clustering for Domain Adaptive Person Re-identification [PDF] 返回目录
Yunpeng Zhai, Shijian Lu, Qixiang Ye, Xuebo Shan, Jie Chen, Rongrong Ji, Yonghong Tian
Abstract: Domain adaptive person re-identification (re-ID) is a challenging task, especially when person identities in target domains are unknown. Existing methods attempt to address this challenge by transferring image styles or aligning feature distributions across domains, whereas the rich unlabeled samples in target domains are not sufficiently exploited. This paper presents a novel augmented discriminative clustering (AD-Cluster) technique that estimates and augments person clusters in target domains and enforces the discrimination ability of re-ID models with the augmented clusters. AD-Cluster is trained by iterative density-based clustering, adaptive sample augmentation, and discriminative feature learning. It learns an image generator and a feature encoder which aim to maximize the intra-cluster diversity in the sample space and minimize the intra-cluster distance in the feature space in an adversarial min-max manner. Finally, AD-Cluster increases the diversity of sample clusters and improves the discrimination capability of re-ID models greatly. Extensive experiments over Market-1501 and DukeMTMC-reID show that AD-Cluster outperforms the state-of-the-art with large margins.
摘要：域自适应人重新鉴定（重新-ID）是一个具有挑战性的任务，尤其是当在目标域的人的身份是未知的。现有的方法试图解决由传输图像的样式或跨域心功能分布这一挑战，而在目标域丰富的未标记样本没有得到充分利用。本文提出了一种新颖的增强判别聚类（AD-Cluster）的技术，其估计和在目标域增强件人簇和强制重新ID模型与增强簇的鉴别能力。 AD-群集是通过迭代基于密度的聚类，自适应样品扩增，并辨别特征学习训练。它学习图像生成和其目的是最大限度地提高在所述样品空间中的集群内的多样性和最小化在敌对最小 - 最大的方式在特征空间中的簇内距离的特征的编码器。最后，AD-群集增加了样品的簇的多样性和极大地提高了重新ID模型的辨别能力。对市场-1501和DukeMTMC-Reid的大量实验表明，AD-集群优于国家的最先进的大利润。

48. TriGAN: Image-to-Image Translation for Multi-Source Domain Adaptation [PDF] 返回目录
Subhankar Roy, Aliaksandr Siarohin, Enver Sangineto, Nicu Sebe, Elisa Ricci
Abstract: Most domain adaptation methods consider the problem of transferring knowledge to the target domain from a single source dataset. However, in practical applications, we typically have access to multiple sources. In this paper we propose the first approach for Multi-Source Domain Adaptation (MSDA) based on Generative Adversarial Networks. Our method is inspired by the observation that the appearance of a given image depends on three factors: the domain, the style (characterized in terms of low-level features variations) and the content. For this reason we propose to project the image features onto a space where only the dependence from the content is kept, and then re-project this invariant representation onto the pixel space using the target domain and style. In this way, new labeled images can be generated which are used to train a final target classifier. We test our approach using common MSDA benchmarks, showing that it outperforms state-of-the-art methods.
摘要：大多数领域适应性方法考虑从一个源数据集知识转移到目标域的问题。然而，在实际应用中，我们通常会使用多个来源。在本文中，我们提出了基于创成对抗性网络多源领域适应性（MSDA）第一种方法。我们的方法由一给定图像的外观取决于三个因素的观察启发：域，风格（其特征在于低级别的功能方面的变化）和内容。为此，我们建议将图像投影到功能只有从内容的依赖性保持一个空间，然后再预测这不变的表示到使用目标域和风格的像素空间。以这种方式，可以生成新的图像标记被用于训练最终目标分类器。我们使用普通MSDA基准测试我们的做法，显示出它优于国家的最先进的方法。

49. Lightweight Mask R-CNN for Long-Range Wireless Power Transfer Systems [PDF] 返回目录
Hao Li, Aozhou Wu, Wen Fang, Qingqing Zhang, Mingqing Liu, Qingwen Liu, Wei Chen
Abstract: Resonant Beam Charging (RBC) is a wireless charging technology which supports multi-watt power transfer over meter-level distance. The features of safety, mobility and simultaneous charging capability enable RBC to charge multiple mobile devices safely at the same time. To detect the devices that need to be charged, a Mask R-CNN based dection model is proposed in previous work. However, considering the constraints of the RBC system, it's not easy to apply Mask R-CNN in lightweight hardware-embedded devices because of its heavy model and huge computation. Thus, we propose a machine learning detection approach which provides a lighter and faster model based on traditional Mask R-CNN. The proposed approach makes the object detection much easier to be transplanted on mobile devices and reduce the burden of hardware computation. By adjusting the structure of the backbone and the head part of Mask R-CNN, we reduce the average detection time from $1.02\mbox{s}$ per image to $0.6132\mbox{s}$, and reduce the model size from $245\mbox{MB}$ to $47.1\mbox{MB}$. The improved model is much more suitable for the application in the RBC system.
摘要：谐振梁充电（RBC）是无线充电技术，其支持在米级的距离多瓦功率传输。安全性，流动性和同时充电能力的功能使RBC到多个移动设备在同一时间安全充电。为了检测需要充电的设备，一个面具R-CNN基于dection模型是在以前的工作建议。但是，考虑到RBC系统的限制，这是不容易的，因为其沉重的模型和巨大的计算的轻质硬件嵌入式设备敷面膜R-CNN。因此，我们提出一种机器学习的检测方法，提供基于传统面膜R-CNN更轻，更快的模型。所提出的方法，使物体检测更容易在移动设备上移植和减少硬件的计算负担。通过调节主链的结构和掩码R-CNN的头部部分中，我们减少从$ 1.02 \ MBOX {S}每图像$到$ 0.6132 \ MBOX {S}的平均检测时间$，和从$ 245减少模型大小\ MBOX {MB} $至$ 47.1 \ {MBOX MB} $。改进的模型是用于在RBC系统中的应用更适合。

50. Tensor completion using enhanced multiple modes low-rank prior and total variation [PDF] 返回目录
Haijin Zeng, Xiaozhen Xie, Jifeng Ning
Abstract: In this paper, we propose a novel model to recover a low-rank tensor by simultaneously performing double nuclear norm regularized low-rank matrix factorizations to the all-mode matricizations of the underlying tensor. An block successive upper-bound minimization algorithm is applied to solve the model. Subsequence convergence of our algorithm can be established, and our algorithm converges to the coordinate-wise minimizers in some mild conditions. Several experiments on three types of public data sets show that our algorithm can recover a variety of low-rank tensors from significantly fewer samples than the other testing tensor completion methods.
摘要：在本文中，我们提出了一种新的模型由底层张量的所有模式matricizations同时执行双核标准转正低秩矩阵分解，回收低秩张量。一个块连续的上部结合的最小化算法被应用于求解。我们的算法的收敛子可以建立，我们的算法收敛于一些温和的条件下，协调明智极小。三种类型的公共数据集的一些实验表明，该算法可以比其他测试完成张样的方法减少显著恢复各种低阶张量。

51. Learning to Evaluate Perception Models Using Planner-Centric Metrics [PDF] 返回目录
Jonah Philion, Amlan Kar, Sanja Fidler
Abstract: Variants of accuracy and precision are the gold-standard by which the computer vision community measures progress of perception algorithms. One reason for the ubiquity of these metrics is that they are largely task-agnostic; we in general seek to detect zero false negatives or positives. The downside of these metrics is that, at worst, they penalize all incorrect detections equally without conditioning on the task or scene, and at best, heuristics need to be chosen to ensure that different mistakes count differently. In this paper, we propose a principled metric for 3D object detection specifically for the task of self-driving. The core idea behind our metric is to isolate the task of object detection and measure the impact the produced detections would induce on the downstream task of driving. Without hand-designing it to, we find that our metric penalizes many of the mistakes that other metrics penalize by design. In addition, our metric downweighs detections based on additional factors such as distance from a detection to the ego car and the speed of the detection in intuitive ways that other detection metrics do not. For human evaluation, we generate scenes in which standard metrics and our metric disagree and find that humans side with our metric 79% of the time. Our project page including an evaluation server can be found at this https URL.
摘要：准确度和精密度的变体的金标准由计算机视觉领域的措施的感知算法进步。这些指标的普及的一个原因是，他们在很大程度上是任务无关的;我们一般寻求检测零个假阴性或阳性。这些指标的不足之处是，在最坏的情况，他们惩罚所有不正确的检测，而不在任务或场景调理同样，在最好的，启发式需要进行选择，以保证不同的错误数不同。在本文中，我们提出了立体物检测原则性的指标专门为自驾车的任务。我们的度量背后的核心思想是将隔离物检测的任务，并测量冲击所产生的检测将诱导对驾驶的下游任务。如果没有手工设计它，我们发现我们的很多错误的度量惩罚，其他指标惩罚的设计。此外，我们的指标downweighs检测基于其他因素，如从检测到自我车距离和检测以直观方式的速度，其他检测指标没有。对于人的评价，我们生成的场景中，标准的指标和我们的指标不同意，并发现人类侧与我们的时间度量79％。我们包括评估服务器的项目页面可以在此HTTPS URL中找到。

52. Are we pretraining it right? Digging deeper into visio-linguistic pretraining [PDF] 返回目录
Amanpreet Singh, Vedanuj Goswami, Devi Parikh
Abstract: Numerous recent works have proposed pretraining generic visio-linguistic representations and then finetuning them for downstream vision and language tasks. While architecture and objective function design choices have received attention, the choice of pretraining datasets has received little attention. In this work, we question some of the default choices made in literature. For instance, we systematically study how varying similarity between the pretraining dataset domain (textual and visual) and the downstream domain affects performance. Surprisingly, we show that automatically generated data in a domain closer to the downstream task (e.g., VQA v2) is a better choice for pretraining than "natural" data but of a slightly different domain (e.g., Conceptual Captions). On the other hand, some seemingly reasonable choices of pretraining datasets were found to be entirely ineffective for some downstream tasks. This suggests that despite the numerous recent efforts, vision & language pretraining does not quite work "out of the box" yet. Overall, as a by-product of our study, we find that simple design choices in pretraining can help us achieve close to state-of-art results on downstream tasks without any architectural changes.
摘要：许多近期的作品提出了训练前的通用Visio的语言表述，然后微调他们的下游视觉和语言的任务。虽然结构和目标函数设计的选择已经受到关注，预训练数据集的选择很少受到关注。在这项工作中，我们质疑的一些文献中提出的默认选项。例如，我们系统地研究如何在训练前的数据集域（文本和视频）和下游领域之间的相似性变化对性能的影响。令人惊讶地，我们显示在域自动生成的数据更接近下游任务（例如，VQA v2）的比“天然”的数据，但稍微不同的域（例如，概念字幕）的预训练更好的选择。在另一方面，发现训练前的数据集的一些看似合理的选择是对一些下游任务完全无效的。这表明，尽管最近多次努力，视力及语言训练前完全不是那么回事“开箱即用”呢。总体来说，作为我们研究的副产品，我们发现在训练前可以帮助我们实现接近下游任务的国家的艺术效果没有任何的架构改变这么简单的设计选择。

53. An end-to-end CNN framework for polarimetric vision tasks based on polarization-parameter-constructing network [PDF] 返回目录
Yong Wang, Qi Liu, Hongyu Zu, Xiao Liu, Ruichao Xie, Feng Wang
Abstract: Pixel-wise operations between polarimetric images are important for processing polarization information. For the lack of such operations, the polarization information cannot be fully utilized in convolutional neural network(CNN). In this paper, a novel end-to-end CNN framework for polarization vision tasks is proposed, which enables the networks to take full advantage of polarimetric images. The framework consists of two sub-networks: a polarization-parameter-constructing network (PPCN) and a task network. PPCN implements pixel-wise operations between images in the CNN form with 1x1 convolution kernels. It takes raw polarimetric images as input, and outputs polarization-parametric images to task network so as to complete a vison task. By training together, the PPCN can learn to provide the most suitable polarization-parametric images for the task network and the dataset. Taking faster R-CNN as task network, the experimental results show that compared with existing methods, the proposed framework achieves much higher mean-average-precision (mAP) in object detection task
摘要：偏振图像之间的像素方面的操作进行处理偏振信息很重要。对于缺乏这样的操作，偏振信息不能完全在卷积神经网络（CNN）利用。在本文中，对于极化视觉任务的新颖的端至端CNN框架提出，这使得网络采取偏振图像的全部优点。该框架包括两个子网络：偏振参数构建网络（PPCN）和任务网络。在CNN形式使用1x1卷积核图像之间PPCN器具像素为单位的运算。它需要原始偏振图像作为输入，并输出偏振参数图像到任务网络以完成一个VISON任务。通过共同培养，PPCN可以学习的任务网络和数据集提供最合适的极化参数图片。以更快的R-CNN作为任务网络中，实验结果表明，与现有的方法相比，所提出的框架中的对象检测任务实现高得多的平均平均精度（MAP）

54. Adaptive Attention Span in Computer Vision [PDF] 返回目录
Jerrod Parker, Shakti Kumar, Joe Roussy
Abstract: Recent developments in Transformers for language modeling have opened new areas of research in computer vision. Results from late 2019 showed vast performance increases in both object detection and recognition when convolutions are replaced by local self-attention kernels. Models using local self-attention kernels were also shown to have less parameters and FLOPS compared to equivalent architectures that only use convolutions. In this work we propose a novel method for learning the local self-attention kernel size. We then compare its performance to fixed-size local attention and convolution kernels. The code for all our experiments and models is available at this https URL
摘要：变压器最近的事态发展对语言模型已经打开的计算机视觉研究的新领域。从2019后期结果表明在这两个目标检测和识别广阔的性能提升时，卷积是由当地自重视内核替换。使用本地自注意内核模型也显示出具有较少的参数和FLOPS与同等的架构，只使用卷积。在这项工作中，我们提出了学习当地自我关注的内核尺寸的新方法。然后，我们比较其性能，以固定大小的地方的关注和卷积核。我们所有的实验和模型的代码可以在此HTTPS URL

55. Attention, please: A Spatio-temporal Transformer for 3D Human Motion Prediction [PDF] 返回目录
Emre Aksan, Peng Cao, Manuel Kaufmann, Otmar Hilliges
Abstract: In this paper, we propose a novel architecture for the task of 3D human motion modelling. We argue that the problem can be interpreted as a generative modelling task: A network learns the conditional synthesis of human poses where the model is conditioned on a seed sequence. Our focus lies on the generation of plausible future developments over longer time horizons, whereas previous work considered shorter time frames of up to 1 second. To mitigate the issue of convergence to a static pose, we propose a novel architecture that leverages the recently proposed self-attention concept. The task of 3D motion prediction is inherently spatio-temporal and thus the proposed model learns high dimensional joint embeddings followed by a decoupled temporal and spatial self-attention mechanism. The two attention blocks operate in parallel to aggregate the most informative components of the sequence to update the joint representation. This allows the model to access past information directly and to capture spatio-temporal dependencies explicitly. We show empirically that this reduces error accumulation over time and allows for the generation of perceptually plausible motion sequences over long time horizons as well as accurate short-term predictions. Accompanying video available at this https URL .
摘要：在本文中，我们提出了三维人体运动建模的任务新颖的架构。我们认为，这个问题可以被解释为生成建模任务：网络学习人类的姿势在模型上种子序列条件的条件合成。我们的重点在于似是而非的未来发展在较长的时间范围产生，而考虑了更短的时间框架为1秒以前的工作。为了缓解收敛的问题是静态姿态，我们建议充分利用了最近提出的自我关注概念的新颖的架构。 3D运动预测的任务是固有时空，因此所提出的模型学习高维的嵌入关节后跟一个解耦时空自注意机制。这两个关注块并行操作以聚集序列的最翔实的组件更新联合表示。这使得模型访问过去的信息直接和捕捉时空依赖关系明确。我们发现经验，这降低了误差积累随着时间的推移，并允许在较长的时间跨度，以及准确的短期预测感知可信的运动序列的产生。伴随着可用的视频在该HTTPS URL。

56. A Large Dataset of Historical Japanese Documents with Complex Layouts [PDF] 返回目录
Zejiang Shen, Kaixuan Zhang, Melissa Dell
Abstract: Deep learning-based approaches for automatic document layout analysis and content extraction have the potential to unlock rich information trapped in historical documents on a large scale. One major hurdle is the lack of large datasets for training robust models. In particular, little training data exist for Asian languages. To this end, we present HJDataset, a Large Dataset of Historical Japanese Documents with Complex Layouts. It contains over 250,000 layout element annotations of seven types. In addition to bounding boxes and masks of the content regions, it also includes the hierarchical structures and reading orders for layout elements. The dataset is constructed using a combination of human and machine efforts. A semi-rule based method is developed to extract the layout elements, and the results are checked by human inspectors. The resulting large-scale dataset is used to provide baseline performance analyses for text region detection using state-of-the-art deep learning models. And we demonstrate the usefulness of the dataset on real-world document digitization tasks. The dataset is available at this https URL.
摘要：自动文档布局分析和内容提取深基础的学习型方法来解锁被困在历史文献进行了大规模的丰富信息的潜力。一个主要的障碍是缺乏大型数据集用于训练强大的车型。特别是，对于亚洲语言存在小的训练数据。为此，我们提出HJDataset，历史日本与复杂的布局的文档大型数据集。它包含七种类型的超过25万的布局元素的注解。除了边界内容区域的箱子和口罩，还包括布局元素的层次结构和阅读顺序。该数据集是使用人类和机器工作的组合构成。基于半规则方法进行显影，以提取布局元素，其结果是由人类检查员检查。将所得的大规模数据集被用于使用状态的最先进的深学习模型以提供用于文本区域检测基准性能分析。而我们展示了现实世界的文件数字化的任务数据集的有用性。该数据集可在此HTTPS URL。

57. Dual Embedding Expansion for Vehicle Re-identification [PDF] 返回目录
Clint Sebastian, Raffaele Imbriaco, Egor Bondarev, Peter H.N. de With
Abstract: Vehicle re-identification plays a crucial role in the management of transportation infrastructure and traffic flow. However, this is a challenging task due to the large view-point variations in appearance, environmental and instance-related factors. Modern systems deploy CNNs to produce unique representations from the images of each vehicle instance. Most work focuses on leveraging new losses and network architectures to improve the descriptiveness of these representations. In contrast, our work concentrates on re-ranking and embedding expansion techniques. We propose an efficient approach for combining the outputs of multiple models at various scales while exploiting tracklet and neighbor information, called dual embedding expansion (DEx). Additionally, a comparative study of several common image retrieval techniques is presented in the context of vehicle re-ID. Our system yields competitive performance in the 2020 NVIDIA AI City Challenge with promising results. We demonstrate that DEx when combined with other re-ranking techniques, can produce an even larger gain without any additional attribute labels or manual supervision.
摘要：车辆重新鉴定起着交通基础设施和交通流的管理至关重要的作用。然而，这是一个具有挑战性的任务，由于在外观，环境和实例相关的因素大视点的变化。现代系统部署细胞神经网络，产生从各车辆实例的图像唯一表示。大部分工作的重点是利用新的损失和网络架构，以提高这些表象的描述性。相比之下，我们的工作集中于重新排序和嵌入扩展技术。我们提出了一种高效的方式为各种规模，同时利用tracklet和邻居信息，称为双嵌入扩展（DEX）多个模型的输出组合。另外，常见的几种图像检索技术的比较研究呈现在车辆重新ID的上下文中。我们的系统产生在2020年NVIDIA AI城市挑战与希望的结果具有竞争力的表现。我们表明，DEX与其他重排序技术相结合，可以产生没有任何附加属性的标签或说明书监督更大的增益。

58. Occluded Prohibited Items Detection: An X-ray Security Inspection Benchmark and De-occlusion Attention Module [PDF] 返回目录
Yanlu Wei, Renshuai Tao, Zhangjie Wu, Yuqing Ma, Libo Zhang, Xianglong Liu
Abstract: Object detection has taken advantage of the advances in deep convolutional networks to bring about significant progress in recent years. Though promising results have been achieved in a multitude of situations, like pedestrian detection, autonomous driving, etc, the detection of prohibited items in X-ray images for security inspection has received less attention. Meanwhile, security inspection often deals with baggage or suitcase where objects are randomly stacked and heavily overlapped with each other, resulting in unsatisfactory performance in detecting prohibited items in X-ray images due to the variety in scale, viewpoint, and style in these images. In this work, first, we propose an attention mechanism named De-occlusion attention module (DOAM), to deal with the problem of detecting prohibited items with some parts occluded in X-ray images. DOAM hybrids two attention sub-modules, EAM and RAM, focusing on different information respectively. Second, we present a well-directed dataset, of which the images in the dataset are annotated manually by the professional inspectors from Beijing Capital International Airport. Our dataset, named Occluded Prohibited Items X-ray (OPIXray), which focuses on the detection of occluded prohibited items in X-ray images, consists of 8885 X-ray images of 5 categories of cutters, considering the fact that cutters are the most common prohibited items. We evaluate our method on the OPIXray dataset and compare it to several baselines, including popular methods for detection and attention mechanisms. As compared to these baselines, our method enjoys a better ability to detect the objects we desire. We also verify the ability of our method to deal with the occlusion by dividing the testing set into three subsets according to different occlusion levels and the result shows that our method achieves a better performance with a higher level of occlusion.
摘要：目的检测已利用深卷积网络的进步，近年来带来显著的进展。尽管有希望的结果在多种情况，比如行人检测，自动驾驶等，在X射线图像进行安全检查违禁物品的检测已经取得了很少受到关注。同时，安全检查常处理与行李或手提箱，其中对象是随机堆叠并彼此重重叠，从而导致不令人满意的性能检测中的X射线图像，由于各种规模，观点和样式在这些图像中的违禁物品。在这项工作中，首先，我们提出了一个名为去遮挡注意模块（DOAM）的注意机制，处理的检测与一些地方违禁物品的问题在X射线图像遮挡。 DOAM混合动力车2所关注的子模块，EAM和RAM，分别侧重于不同的信息。其次，我们提出了一个良好引导数据集，其中的数据集中的图像是由北京首都国际机场的专业检验人员手动注释。我们的数据集，命名为闭塞禁止和限制项目X射线（OPIXray），其重点检测的X射线图像遮挡违禁物品的，由5类刀具的8885 X射线图像，在考虑到刀具是最常见的违禁物品。我们评估我们在OPIXray数据集的方法并将其与几个基线，包括检测和注意力的机制流行的方法。相较于这些基线，我们的方法中享有较好的检测能力，我们渴望的对象。我们也验证了我们的方法通过根据不同闭塞水平，结果表明，该方法实现了与闭塞的更高水平有更好的表现将测试集合时，为三个子集来处理闭塞的能力。

59. A Deep Learning Approach to Object Affordance Segmentation [PDF] 返回目录
Spyridon Thermos, Petros Daras, Gerasimos Potamianos
Abstract: Learning to understand and infer object functionalities is an important step towards robust visual intelligence. Significant research efforts have recently focused on segmenting the object parts that enable specific types of human-object interaction, the so-called "object affordances". However, most works treat it as a static semantic segmentation problem, focusing solely on object appearance and relying on strong supervision and object detection. In this paper, we propose a novel approach that exploits the spatio-temporal nature of human-object interaction for affordance segmentation. In particular, we design an autoencoder that is trained using ground-truth labels of only the last frame of the sequence, and is able to infer pixel-wise affordance labels in both videos and static images. Our model surpasses the need for object labels and bounding boxes by using a soft-attention mechanism that enables the implicit localization of the interaction hotspot. For evaluation purposes, we introduce the SOR3D-AFF corpus, which consists of human-object interaction sequences and supports 9 types of affordances in terms of pixel-wise annotation, covering typical manipulations of tool-like objects. We show that our model achieves competitive results compared to strongly supervised methods on SOR3D-AFF, while being able to predict affordances for similar unseen objects in two affordance image-only datasets.
摘要：学习理解和推断物体的功能是对强大的视觉智能的重要一步。显著的研究工作最近集中在分割的对象部分，使特定类型的人类对象的互动，所谓的“对象启示”。然而，大多数的作品把它当作一个静态语义分割问题，仅仅关注对象的外观和依靠强有力的监督和对象检测。在本文中，我们提出了利用人类对象交互的启示分割的时空性质的新方法。特别是，我们设计一个只使用该序列的最后一帧的地面实况标签训练的自动编码，并能够在这两个视频和静态图像来推断像素方面的启示标签。我们的模型通过使用软注意力机制，使得互动热点的隐式定位为超越目标标签和边框的需求。用于评估目的，我们引入SOR3D-AFF语料库，它由人的对象交互序列和载体9种启示的在逐个像素的注解来说，覆盖工具状物体的典型操作。我们表明，我们的模型实现了相比于SOR3D-AFF强烈监督方法有竞争力的结果，同时能够预测类似看不见的物体启示两个启示只图像数据集。

60. Motion Segmentation using Frequency Domain Transformer Networks [PDF] 返回目录
Hafez Farazi, Sven Behnke
Abstract: Self-supervised prediction is a powerful mechanism to learn representations that capture the underlying structure of the data. Despite recent progress, the self-supervised video prediction task is still challenging. One of the critical factors that make the task hard is motion segmentation, which is segmenting individual objects and the background and estimating their motion separately. In video prediction, the shape, appearance, and transformation of each object should be understood only by predicting the next frame in pixel space. To address this task, we propose a novel end-to-end learnable architecture that predicts the next frame by modeling foreground and background separately while simultaneously estimating and predicting the foreground motion using Frequency Domain Transformer Networks. Experimental evaluations show that this yields interpretable representations and that our approach can outperform some widely used video prediction methods like Video Ladder Network and Predictive Gated Pyramids on synthetic data.
摘要：自监督预测是一个强有力的机制来学习表示，用于捕获数据的底层结构。尽管最近的进展，自我监督的视频预测任务仍然具有挑战性。之一的，使硬任务的关键因素是运动分割，这是分割的各个对象和背景，并分别估计其运动。在影像预测，形状，外观和每个对象的变换应仅通过预测在像素空间中的下一帧的理解。为了解决这个任务，我们建议，预测通过分别建模前景和背景的同时估计和预测使用频域变换器网络前景运动下一帧的新的端至端的可学习架构。试验评估表明，该产量解释的陈述与我们的方法可以超越对合成数据的一些广泛使用的视频预测方法，如视频梯形网络和预测门控金字塔。

61. Single-step Adversarial training with Dropout Scheduling [PDF] 返回目录
Vivek B.S., R. Venkatesh Babu
Abstract: Deep learning models have shown impressive performance across a spectrum of computer vision applications including medical diagnosis and autonomous driving. One of the major concerns that these models face is their susceptibility to adversarial attacks. Realizing the importance of this issue, more researchers are working towards developing robust models that are less affected by adversarial attacks. Adversarial training method shows promising results in this direction. In adversarial training regime, models are trained with mini-batches augmented with adversarial samples. Fast and simple methods (e.g., single-step gradient ascent) are used for generating adversarial samples, in order to reduce computational complexity. It is shown that models trained using single-step adversarial training method (adversarial samples are generated using non-iterative method) are pseudo robust. Further, this pseudo robustness of models is attributed to the gradient masking effect. However, existing works fail to explain when and why gradient masking effect occurs during single-step adversarial training. In this work, (i) we show that models trained using single-step adversarial training method learn to prevent the generation of single-step adversaries, and this is due to over-fitting of the model during the initial stages of training, and (ii) to mitigate this effect, we propose a single-step adversarial training method with dropout scheduling. Unlike models trained using existing single-step adversarial training methods, models trained using the proposed single-step adversarial training method are robust against both single-step and multi-step adversarial attacks, and the performance is on par with models trained using computationally expensive multi-step adversarial training methods, in white-box and black-box settings.
摘要：深学习模型显示横跨计算机视觉应用的范围包括医疗诊断和自动驾驶的骄人业绩。其中之一，这些模型所面临的主要问题是它们易受敌对攻击。认识到这一问题的重要性，越来越多的研究者正努力开发健壮的模型，较少受到敌对攻击。对抗性训练方法显示看好在这个方向的结果。在对抗训练计划，模型被训练与对抗样本扩充小型批次。快速和简单的方法（例如，单步梯度上升）被用于生成对抗性的样品，以降低计算复杂度。结果表明，使用单步骤对抗性训练方法（对抗性样品使用非迭代方法生成的）训练的模型是伪健壮。此外，模型的这个伪鲁棒性归因于梯度掩蔽效应。但是，现有的作品无法解释的时候，为什么在单步对抗训练时梯度掩蔽效应。在这项工作中，（我），我们显示了使用单步对抗训练方法学，以防止单步对手的一代机型的训练，这是由于在训练的初始阶段过度拟合的模型，和（ ⅱ）以减轻这种效果，我们提出与漏失调度的单步对抗性训练方法。不同车型使用现有的单步对抗性的训练方法的训练，模型中使用建议的单步对抗性的训练方法训练是强大的抗单步和多步敌对攻击，而且性能上比肩高配车型使用计算昂贵的多训练有素 - 工序对抗性的训练方法，在白盒和黑盒的设置。

62. Halluci-Net: Scene Completion by Exploiting Object Co-occurrence Relationships [PDF] 返回目录
Kuldeep Kulkarni, Tejas Gokhale, Rajhans Singh, Pavan Turaga, Aswin Sankaranarayanan
Abstract: We address the new problem of complex scene completion from sparse label maps. We use a two-stage deep network based method, called `Halluci-Net', that uses object co-occurrence relationships to produce a dense and complete label map. The generated dense label map is fed into a state-of-the-art image synthesis method to obtain the final image. The proposed method is evaluated on the Cityscapes dataset and it outperforms a single-stage baseline method on various performance metrics like Fréchet Inception Distance (FID), semantic segmentation accuracy, and similarity in object co-occurrences. In addition to this, we show qualitative results on a subset of ADE20K dataset containing bedroom images.
摘要：我们从稀疏标签映射解决复杂场景完成的新的问题。我们使用了深刻的基于网络的两阶段的方法，称为'Halluci-网”，使用对象的共生关系产生稠密和完整的标签映射。将所生成的密标签地图被送入一个国家的最先进的图像合成方法，以获得最终图像。所提出的方法是在都市风景数据集进行评估和它优于像Fréchet可启距离各种性能度量（FID），语义分割精度，和相似性的对象共同出现一个单级基线法。除了这个，我们证明含有卧室图像ADE20K数据集的一个子集定性结果。

63. Underwater image enhancement with Image Colorfulness Measure [PDF] 返回目录
Hui Li, Xi Yang, ZhenMing Li, TianLun Zhang
Abstract: Due to the absorption and scattering effects of the water, underwater images tend to suffer from many severe problems, such as low contrast, grayed out colors and blurring content. To improve the visual quality of underwater images, we proposed a novel enhancement model, which is a trainable end-to-end neural model. Two parts constitute the overall model. The first one is a non-parameter layer for the preliminary color correction, then the second part is consisted of parametric layers for a self-adaptive refinement, namely the channel-wise linear shift. For better details, contrast and colorfulness, this enhancement network is jointly optimized by the pixel-level and characteristiclevel training criteria. Through extensive experiments on natural underwater scenes, we show that the proposed method can get high quality enhancement results.
摘要：由于水的吸收和散射作用，水下图像往往从许多严重的问题，如低对比度，变灰的色彩和模糊的内容受到影响。为了提高水下图像的视觉质量，我们提出了一种新的增强模式，这是一个可训练结束到终端的神经网络模型。两个部分构成的整体模型。第一种是用于预备色彩校正的非参数层，那么第二部分包括用于自适应细化，即信道分段线性移位参数层。为了更好的细节，对比度和色彩度，这增强网络活动由像素级和characteristiclevel培训标准进行了优化。通过对自然的水下场景广泛的实验，我们表明，该方法可以得到高质量增强的结果。

64. Feathers dataset for Fine-Grained Visual Categorization [PDF] 返回目录
Alina Belko, Konstantin Dobratulin, Andrey Kuznetsov
Abstract: This paper introduces a novel dataset FeatherV1, containing 28,272 images of feathers categorized by 595 bird species. It was created to perform taxonomic identification of bird species by a single feather, which can be applied in amateur and professional ornithology. FeatherV1 is the first publicly available bird's plumage dataset for machine learning, and it can raise interest for a new task in fine-grained visual recognition domain. The latest version of the dataset can be downloaded at this https URL. We also present feathers classification task results. We selected several deep learning architectures (DenseNet based) for categorical crossentropy values comparison on the provided dataset.
摘要：本文介绍了一种新颖的数据集FeatherV1，含有羽毛28272个图像由595个鸟类物种分类。它的建立由一根羽毛，它可以在专业和业余鸟类被应用到执行的鸟类物种的分类鉴定。 FeatherV1是首个公开鸟的机器学习的羽毛数据集，并能提高对细粒度的视觉识别领域的新课题的兴趣。该数据集的最新版本可在此HTTPS URL下载。我们还提出羽毛分类任务结果。我们挑选了几款深度学习的架构（基于DenseNet）对所提供的数据集分类crossentropy值进行比较。

65. DAPnet: A double self-attention convolutional network for segmentation of point clouds [PDF] 返回目录
Li Chen, Zewei Xu, Yongjian Fu, Haozhe Huang, Shaowen Wang, Haifeng Li
Abstract: LiDAR point cloud has a complex structure and the 3D semantic labeling of it is a challenging task. Existing methods adopt data transformations without fully exploring contextual features, which are less efficient and accurate problem. In this study, we propose a double self-attention convolutional network, called DAPnet, by combining geometric and contextual features to generate better segmentation results. The double self-attention module including point attention module and group attention module originates from the self-attention mechanism to extract contextual features of terrestrial objects with various shapes and scales. The contextual features extracted by these modules represent the long-range dependencies between the data and are beneficial to reducing the scale diversity of point cloud objects. The point attention module selectively enhances the features by modeling the interdependencies of neighboring points. Meanwhile, the group attention module is used to emphasizes interdependent groups of points. We evaluate our method based on the ISPRS 3D Semantic Labeling Contest dataset and find that our model outperforms the benchmark by 85.2% with an overall accuracy of 90.7%. The improvements over powerline and car are 7.5% and 13%. By conducting ablation comparison, we find that the point attention module is more effective for the overall improvement of the model than the group attention module, and the incorporation of the double self-attention module has an average of 7% improvement on the pre-class accuracy of the classes. Moreover, the adoption of the double self-attention module consumes a similar training time as the one without the attention module for model convergence. The experimental result shows the effectiveness and efficiency of the DAPnet for the segmentation of LiDAR point clouds. The source codes are available at this https URL.
摘要：激光雷达点云具有复杂的结构和它的3D语义标注是一项艰巨的任务。现有的方法采用的数据转换不充分发掘上下文特征，这是不太有效和准确的问题。在这项研究中，我们提出了一个双自重视卷积网名为DAPnet，结合几何和上下文特征来生成更好的分割结果。双自注意模块包括从所述自注意机制点注意模块和组注意模块发源提取各种形状和尺度地物的上下文特征。由这些模块中提取的上下文特征表示的数据之间的长程依赖性和对于减少点云对象的规模多样性是有益的。点注意模块通过模拟相邻点的相互依赖性增强选择性的特点。同时，该集团关注模块用来强调点的相互依存的群体。我们评估我们的方法基础上，ISPRS 3D语义标注竞赛数据集，并找到我们的模型85.2％，优于基准与90.7％的整体精度。电力线和汽车的改进是7.5％和13％。通过进行消融的比较，我们发现点注意模块是模型比组注意模块的整体提高更有效，和双自注意模块的结合有平均7％的改善在课前类的准确性。此外，采用双自注意模块消耗了类似的培训时间为一个没有模型收敛的注意力模块。实验结果表明在为DAPnet激光雷达点云的分段的有效性和效率。源代码可在此HTTPS URL。

66. Dynamic Feature Integration for Simultaneous Detection of Salient Object, Edge and Skeleton [PDF] 返回目录
Jiang-Jiang Liu, Qibin Hou, Ming-Ming Cheng
Abstract: In this paper, we solve three low-level pixel-wise vision problems, including salient object segmentation, edge detection, and skeleton extraction, within a unified framework. We first show some similarities shared by these tasks and then demonstrate how they can be leveraged for developing a unified framework that can be trained end-to-end. In particular, we introduce a selective integration module that allows each task to dynamically choose features at different levels from the shared backbone based on its own characteristics. Furthermore, we design a task-adaptive attention module, aiming at intelligently allocating information for different tasks according to the image content priors. To evaluate the performance of our proposed network on these tasks, we conduct exhaustive experiments on multiple representative datasets. We will show that though these tasks are naturally quite different, our network can work well on all of them and even perform better than current single-purpose state-of-the-art methods. In addition, we also conduct adequate ablation analyses that provide a full understanding of the design principles of the proposed framework. To facilitate future research, source code will be released.
摘要：在本文中，我们解决了三个低级逐像素视力问题，其中包括显着对象分割，边缘检测，和骨架提取，一个统一的框架内。首先证明这些任务共享一些相似之处，然后证明他们可以如何利用为开发可以训练结束到终端的统一框架。特别是，我们引进的选择性集成模块，允许每个任务动态地选择不同级别的基于自身特点共享的骨干功能。此外，我们还设计了一个任务自适应注意模块，瞄准智能分配根据图像内容先验不同任务的信息。为了评估我们提出的这些任务网络的性能，我们进行多个代表数据集详尽的实验。我们将证明，虽然这些任务自然是完全不同的，我们的网络可以很好地对所有这些工作，甚至比目前的国家的最先进的单一用途的方法有更好的表现。此外，我们也进行充分消融的分析提供的建议框架的设计原则有充分的了解。为了方便今后的研究，源代码将被释放。

67. BiFNet: Bidirectional Fusion Network for Road Segmentation [PDF] 返回目录
Haoran Li, Yaran Chen, Qichao Zhang, Dongbin Zhao
Abstract: Multi-sensor fusion-based road segmentation plays an important role in the intelligent driving system since it provides a drivable area. The existing mainstream fusion method is mainly to feature fusion in the image space domain which causes the perspective compression of the road and damages the performance of the distant road. Considering the bird's eye views(BEV) of the LiDAR remains the space structure in horizontal plane, this paper proposes a bidirectional fusion network(BiFNet) to fuse the image and BEV of the point cloud. The network consists of two modules: 1) Dense space transformation module, which solves the mutual conversion between camera image space and BEV space. 2) Context-based feature fusion module, which fuses the different sensors information based on the scenes from corresponding features.This method has achieved competitive results on KITTI dataset.
摘要：多传感器基于融合的道路分割起着智能驾驶系统中的重要作用，因为它提供了一个可行驶区域。现有主流融合方法主要是为了在这导致道路和损害远方道路的性能的角度来看压缩图像空间域特征融合。考虑到激光雷达的鸟瞰视图（BEV）保持在水平面上的空间结构，提出了一种双向的融合网络（BiFNet）融合的点云的图像和BEV。该网络由两个模块组成：1）高密度空间变换模块，它解决了相机图像空间和BEV空间之间的相互转换。 2）基于上下文的特征融合模块，其融合基于所述场景从相应功能。本方法的不同的传感器信息取得上KITTI数据集的竞争结果。

68. On the Synergies between Machine Learning and Stereo: a Survey [PDF] 返回目录
Matteo Poggi, Fabio Tosi, Konstantinos Batsos, Philippos Mordohai, Stefano Mattoccia
Abstract: Stereo matching is one of the longest-standing problems in computer vision with close to 40 years of studies and research. Throughout the years the paradigm has shifted from local, pixel-level decision to various forms of discrete and continuous optimization to data-driven, learning-based methods. Recently, the rise of machine learning and the rapid proliferation of deep learning enhanced stereo matching with new exciting trends and applications unthinkable until a few years ago. Interestingly, the relationship between these two worlds is two-way. While machine, and especially deep, learning advanced the state-of-the-art in stereo matching, stereo matching enabled new ground-breaking methodologies such as self-supervised monocular depth estimation based on deep neural networks. In this paper, we review recent research in the field of learning-based depth estimation from images highlighting the synergies, the successes achieved so far and the open challenges the community is going to face in the immediate future.
摘要：立体匹配是计算机视觉与近40年的学习和研究时间最长的问题之一。多年来范式已经从当地的，像素级别的决策转向以数据为驱动的，基于学习的方法不同形式的离散和连续优化。最近，机器学习的兴起和深入学习增强立体匹配的具有不可想象的，直到前几年新的令人兴奋的发展趋势和应用的快速普及。有趣的是，这两个世界之间的关系是双向的。虽然机器，尤其是深，学习先进国家的最先进的立体匹配，立体匹配启用新的突破性的方法如下：自我监督的基于深层神经网络单眼深度估计。在本文中，我们回顾从突出协同图像基于学习的深度估计的领域最近的研究，成功实现了迄今为止，开放挑战的社会将面对在不久的将来。

69. Learning to Dehaze From Realistic Scene with A Fast Physics Based Dehazing Network [PDF] 返回目录
Ruoteng Li, Xiaoyi Zhang, Shaodi You, Yu Li
Abstract: Dehaze is one of the popular computer vision research topics for long. A realtime method with reliable performance is highly desired for a lot of applications such as autonomous driving. In recent years, while learning based methods require datasets containing pairs of hazy images and clean ground truth references, it is generally impossible to capture this kind of data in real. Many existing researches compromise this difficulty to generate hazy images by rendering the haze from depth on common RGBD datasets using the haze imaging model. However, there is still a gap between the synthetic datasets and real hazy images as large datasets with high quality depth are mostly indoor and depth maps for outdoor are imprecise. In this paper, we complement the exiting datasets with a new, large, and diverse dehazing dataset containing real outdoor scenes from HD 3D videos. We select large number of high quality frames of real outdoor scenes and render haze on them using depth from stereo. Our dataset is more realistic than existing ones and we demonstrate that using this dataset greatly improves the dehazing performance on real scenes. In addition to the dataset, inspired by the physics model, we also propose a light and reliable dehaze network. Our approach outperforms other methods by a large margin and becomes the new state-of-the-art method. Moreover, the light design of the network enables our methods to run at realtime speed that is much faster than other methods.
摘要：Dehaze是流行的计算机视觉研究课题长期之一。具有性能可靠的实时方法是高度期望用于很多应用，例如自动驾驶。近年来，一边学习基础的方法需要对含有朦胧的图像和干净的地面实况引用的数据集，一般是不可能捕捉到这种真实的数据。许多现有的研究妥协这种困难来自于使用的阴霾成像模型共同RGBD数据集深度渲染霾产生朦胧的图像。然而，仍然有合成数据集和实时图像朦胧之间的间隙与高品质深度大数据集大多是室内和深度映射户外是不精确的。在本文中，我们用含高清3D视频真实室外场景一个新的，大的，多样的除雾数据集补充退出数据集。我们选择大量真正的户外场景的高品质的帧，并且使用深度从立体声它们呈现混浊。我们的数据是比现有的更现实和我们证明，使用该数据集大大提高了对真实场景的除雾性能。除了数据集，通过物理模型的启发，我们也提出了一个光线和可靠的dehaze网络。我们的方法优于大幅度的其它方法，成为新的国家的最先进的方法。此外，网络的光设计，使我们的方法以实时的速度比其他方法运行得更快。

70. Accurate Tumor Tissue Region Detection with Accelerated Deep Convolutional Neural Networks [PDF] 返回目录
Gabriel Tjio, Xulei Yang, Jia Mei Hong, Sum Thai Wong, Vanessa Ding, Andre Choo, Yi Su
Abstract: Manual annotation of pathology slides for cancer diagnosis is laborious and repetitive. Therefore, much effort has been devoted to develop computer vision solutions. Our approach, (FLASH), is based on a Deep Convolutional Neural Network (DCNN) architecture. It reduces computational costs and is faster than typical deep learning approaches by two orders of magnitude, making high throughput processing a possibility. In computer vision approaches using deep learning methods, the input image is subdivided into patches which are separately passed through the neural network. Features extracted from these patches are used by the classifier to annotate the corresponding region. Our approach aggregates all the extracted features into a single matrix before passing them to the classifier. Previously, the features are extracted from overlapping patches. Aggregating the features eliminates the need for processing overlapping patches, which reduces the computations required. DCCN and FLASH demonstrate high sensitivity (~ 0.96), good precision (~0.78) and high F1 scores (~0.84). The average time taken to process each sample for FLASH and DCNN is 96.6 seconds and 9489.20 seconds, respectively. Our approach was approximately 100 times faster than the original DCNN approach while simultaneously preserving high accuracy and precision.
摘要：癌症诊断病理切片的手动注释是费力的和重复的。因此，许多努力，一直致力于开发计算机视觉解决方案。我们的方法，（FLASH），是基于深卷积神经网络（DCNN）架构。它降低了计算成本，并快于典型的深学习由两个数量级接近，使得高通量处理的可能性。在计算机视觉办法采用深学习方法，输入图像被细分为通过神经网络分别通过补丁。从这些片提取特征被用于由分类来注释对应的区域。我们的方法汇集了所有提取的特征为一个矩阵将它们传递到分类之前。此前，特征从重叠片提取。聚集的特征消除了对处理重叠的补丁，这减少了所需的计算的需要。 DCCN和FLASH表现出高的灵敏度（〜0.96），良好的精度（〜0.78）和高F1分数（〜0.84）。成处理用于FLASH和DCNN每个样品所需的平均时间是96.6秒和9489.20秒，分别。我们的方法是比原来DCNN方法快100倍，同时保持高准确度和精密度。

71. Color Image Segmentation using Adaptive Particle Swarm Optimization and Fuzzy C-means [PDF] 返回目录
Narayana Reddy A, Ranjita Das
Abstract: Segmentation partitions an image into different regions containing pixels with similar attributes. A standard non-contextual variant of Fuzzy C-means clustering algorithm (FCM), considering its simplicity is generally used in image segmentation. Using FCM has its disadvantages like it is dependent on the initial guess of the number of clusters and highly sensitive to noise. Satisfactory visual segments cannot be obtained using FCM. Particle Swarm Optimization (PSO) belongs to the class of evolutionary algorithms and has good convergence speed and fewer parameters compared to Genetic Algorithms (GAs). An optimized version of PSO can be combined with FCM to act as a proper initializer for the algorithm thereby reducing its sensitivity to initial guess. A hybrid PSO algorithm named Adaptive Particle Swarm Optimization (APSO) which improves in the calculation of various hyper parameters like inertia weight, learning factors over standard PSO, using insights from swarm behaviour, leading to improvement in cluster quality can be used. This paper presents a new image segmentation algorithm called Adaptive Particle Swarm Optimization and Fuzzy C-means Clustering Algorithm (APSOF), which is based on Adaptive Particle Swarm Optimization (APSO) and Fuzzy C-means clustering. Experimental results show that APSOF algorithm has edge over FCM in correctly identifying the optimum cluster centers, there by leading to accurate classification of the image pixels. Hence, APSOF algorithm has superior performance in comparison with classic Particle Swarm Optimization (PSO) and Fuzzy C-means clustering algorithm (FCM) for image segmentation.
摘要：分割分割图像分成含有具有类似属性的像素的不同的区域。模糊C-均值聚类算法（FCM），考虑到它的简单的标准非上下文的变体中的图像分割，一般使用。使用FCM有它的缺点像它是依赖于簇的数目的初始猜测，对噪声非常敏感。满意的视觉段可以不使用FCM来获得。粒子群优化（PSO）属于类的进化算法，并具有良好的收敛速度和比遗传算法（气）更少的参数。 PSO的优化版本可以与FCM组合以充当用于算法，从而减少其对初始猜测灵敏度适当的初始值设定。可以使用命名为自适应粒子群优化（APSO）的混合PSO算法从而提高在各种超参数，如惯性重量，学习通过标准PSO因素，使用从见解群行为，从而提高了集群质量的计算。本文呈现称为自适应粒子群优化和模糊C均值聚类算法（APSOF），它是基于自适应粒子群优化（APSO）和模糊C均值聚类新的图像分割算法。实验结果表明，有通过导致图像像素的精确分类该APSOF算法中正确地识别所述最佳聚类中心具有边缘超过FCM。因此，APSOF算法具有与经典粒子群优化（PSO）和模糊C均值聚类的图像分割算法（FCM）比较优异的性能。

72. Moire Image Restoration using Multi Level Hyper Vision Net [PDF] 返回目录
D.Sabari Nathan, M.Parisa Beham, S. M. Md Mansoor Roomi
Abstract: A moire pattern in the images is resulting from high frequency patterns captured by the image sensor (colour filter array) that appear after demosaicing. These Moire patterns would appear in natural images of scenes with high frequency content. The Moire pattern can also vary intensely due to a minimal change in the camera direction/positioning. Thus the Moire pattern depreciates the quality of photographs. An important issue in demoireing pattern is that the Moireing patterns have dynamic structure with varying colors and forms. These challenges makes the demoireing more difficult than many other image restoration tasks. Inspired by these challenges in demoireing, a multilevel hyper vision net is proposed to remove the Moire pattern to improve the quality of the images. As a key aspect, in this network we involved residual channel attention block that can be used to extract and adaptively fuse hierarchical features from all the layers efficiently. The proposed algorithms has been tested with the NTIRE 2020 challenge dataset and thus achieved 36.85 and 0.98 Peak to Signal Noise Ratio (PSNR) and Structural Similarity (SSIM) Index respectively.
摘要：在图像中的莫尔图案从去马赛克之后出现的由图像传感器（彩色滤光器阵列）捕获高频图案得到的。这些条纹图形会出现在高频内容的场景自然的图像。莫尔图案也可以变化强烈由于摄像机的方向/定位的最小变化。因此，该条纹图形的贬值照片的质量。在demoireing模式的一个重要问题是，Moireing模式具有动态结构具有不同的颜色和形状。这些挑战使得demoireing比许多其他图像复原工作更加困难。在demoireing这些挑战的启发，一个多层次的超视觉网建议去除莫尔条纹，以提高图像的质量。作为一个重要方面，在这个网络中，我们涉及可用于提取并自适应熔丝有效地从所有的层分层特征残留通道关注块。所提出的算法已经过测试与NTIRE 2020挑战数据集，从而分别达到36.85和0.98峰信噪比（PSNR）和结构相似性（SSIM）索引。

73. Effect of Text Color on Word Embeddings [PDF] 返回目录
Masaya Ikoma, Brian Kenji Iwana, Seiichi Uchida
Abstract: In natural scenes and documents, we can find the correlation between a text and its color. For instance, the word, "hot", is often printed in red, while "cold" is often in blue. This correlation can be thought of as a feature that represents the semantic difference between the words. Based on this observation, we propose the idea of using text color for word embeddings. While text-only word embeddings (e.g. word2vec) have been extremely successful, they often represent antonyms as similar since they are often interchangeable in sentences. In this paper, we try two tasks to verify the usefulness of text color in understanding the meanings of words, especially in identifying synonyms and antonyms. First, we quantify the color distribution of words from the book cover images and analyze the correlation between the color and meaning of the word. Second, we try to retrain word embeddings with the color distribution of words as a constraint. By observing the changes in the word embeddings of synonyms and antonyms before and after re-training, we aim to understand the kind of words that have positive or negative effects in their word embeddings when incorporating text color information.
摘要：在自然场景和文件，我们可以找到一个文本和颜色之间的相关性。举例来说，这个词，“热”，经常被印在红色，而“冷”是常为蓝色。这种相关性可以被认为是作为表示的词之间的语义差异的特征。基于这一观察，我们建议使用文本颜色词的嵌入的想法。虽然只有正文字的嵌入（例如word2vec）已经非常成功，他们往往代表反义词相似，因为他们往往是在句子中可以互换。在本文中，我们尝试两个任务来验证文本颜色的用处理解单词的含义，尤其是在确定同义词和反义词。首先，我们从量化的书封面图片文字的颜色分布，分析字的颜色和意义之间的相关性。其次，我们试图重新训练字的嵌入文字的颜色分布的约束。通过之前和之后的再培训观察同义词和反义词的字嵌入物的变化，我们的目标是理解那种包含文本颜色的信息时有其字的嵌入正面或负面的影响的话。

74. Super-Resolution-based Snake Model -- An Unsupervised Method for Large-Scale Building Extraction using Airborne LiDAR Data and Optical Image [PDF] 返回目录
Thanh Huy Nguyen, Sylvie Daniel, Didier Gueriot, Christophe Sintes, Jean-Marc Le Caillec
Abstract: Automatic extraction of buildings in urban and residential scenes has become a subject of growing interest in the domain of photogrammetry and remote sensing, particularly since mid-1990s. Active contour model, colloquially known as snake model, has been studied to extract buildings from aerial and satellite imagery. However, this task is still very challenging due to the complexity of building size, shape, and its surrounding environment. This complexity leads to a major obstacle for carrying out a reliable large-scale building extraction, since the involved prior information and assumptions on building such as shape, size, and color cannot be generalized over large areas. This paper presents an efficient snake model to overcome such challenge, called Super-Resolution-based Snake Model (SRSM). The SRSM operates on high-resolution LiDAR-based elevation images -- called z-images -- generated by a super-resolution process applied to LiDAR data. The involved balloon force model is also improved to shrink or inflate adaptively, instead of inflating the snake continuously. This method is applicable for a large scale such as city scale and even larger, while having a high level of automation and not requiring any prior knowledge nor training data from the urban scenes (hence unsupervised). It achieves high overall accuracy when tested on various datasets. For instance, the proposed SRSM yields an average area-based Quality of 86.57% and object-based Quality of 81.60% on the ISPRS Vaihingen benchmark datasets. Compared to other methods using this benchmark dataset, this level of accuracy is highly desirable even for a supervised method. Similarly desirable outcomes are obtained when carrying out the proposed SRSM on the whole City of Quebec (total area of 656 km2), yielding an area-based Quality of 62.37% and an object-based Quality of 63.21%.
摘要：在城市和住宅建筑的场景自动提取已成为摄影测量与遥感的领域日益关注的话题，特别是自90年代中期。活动轮廓模型，俗称蛇模型，进行了研究，从中提取的空中和卫星图像建筑物。然而，这个任务还是非常具有挑战性的，由于建筑物的大小，形状和其周边环境的复杂性。这种复杂性导致用于进行可靠的大型建筑物提取的一大障碍，因为所涉及的先验信息和假设建设，如形状，大小和颜色不能大面积推广。本文提出了一种高效的蛇模型来克服这些挑战，称为基于超分辨率蛇模型（SRSM）。称为Z-图像 - - 所述SRSM高分辨率基于激光雷达的标高图像操作以通过超分辨率处理而生成施加到LiDAR数据。所涉及的气囊力模型也提高收缩或膨胀适应性，代替连续地膨胀所述蛇。这种方法适用于大规模如城市规模甚至更大，同时具有自动化的高电平，并且不需要任何先验知识，也不从城市场景（因此无监督）的训练数据。当在不同的数据集测试它实现了高准确度的整体。例如，建议SRSM产生的86.57％的平均基面积质量在ISPRS Vaihingen基准数据集的81.60％，基于对象的质量。相比于使用此基准数据集的其他方法，这种程度的精度是非常可取的，即使为受监管方法。当执行对整个城市魁北克提出SRSM（656平方公里的总面积），得到62.37％的基于面积的质量和63.21％的基于对象的质量同样理想的结果被获得。

75. JL-DCF: Joint Learning and Densely-Cooperative Fusion Framework for RGB-D Salient Object Detection [PDF] 返回目录
Keren Fu, Deng-Ping Fan, Ge-Peng Ji, Qijun Zhao
Abstract: This paper proposes a novel joint learning and densely-cooperative fusion (JL-DCF) architecture for RGB-D salient object detection. Existing models usually treat RGB and depth as independent information and design separate networks for feature extraction from each. Such schemes can easily be constrained by a limited amount of training data or over-reliance on an elaborately-designed training process. In contrast, our JL-DCF learns from both RGB and depth inputs through a Siamese network. To this end, we propose two effective components: joint learning (JL), and densely-cooperative fusion (DCF). The JL module provides robust saliency feature learning, while the latter is introduced for complementary feature discovery. Comprehensive experiments on four popular metrics show that the designed framework yields a robust RGB-D saliency detector with good generalization. As a result, JL-DCF significantly advances the top-1 D3Net model by an average of ~1.9% (S-measure) across six challenging datasets, showing that the proposed framework offers a potential solution for real-world applications and could provide more insight into the cross-modality complementarity task. The code will be available at this https URL.
摘要：本文提出了一种用于RGB-d显着对象检测的新型联合学习和密集合作融合（JL-DCF）架构。现有的模型通常把RGB和深度作为独立的信息，并设计独立的网络从每个特征提取。这样的方案可以容易地通过的训练上的精心设计的训练过程数据或过度依赖有限数量的限制。相比之下，我们从两个RGB和深度投入JL-DCF获悉通过网络连体。为此，我们提出了两种有效成分：共同学习（JL），以及密集的合作融合（DCF）。的JL模块提供了强大的显着特征的学习，而后者介绍了一种互补特征的发现。四个流行指标综合实验表明，设计的框架产生具有良好的推广强大的RGB-d的显着性检测。其结果是，JL-DCF显著平均〜1.9％（S-措施）的六大挑战数据集中推进顶1 D3Net模型，显示出建议的框架提供了现实世界的应用可能的解决方案，可以提供更洞察跨模式的互补性任务。该代码将可在此HTTPS URL。

76. Semi-Supervised Semantic Segmentation via Dynamic Self-Training and Class-Balanced Curriculum [PDF] 返回目录
Zhengyang Feng, Qianyu Zhou, Guangliang Cheng, Xin Tan, Jianping Shi, Lizhuang Ma
Abstract: In this work, we propose a novel and concise approach for semi-supervised semantic segmentation. The major challenge of this task lies in how to exploit unlabeled data efficiently and thoroughly. Previous state-of-the-art methods utilize unlabeled data by GAN-based self-training or consistency regularization. However, these methods either suffer from noisy self-supervision and class-imbalance, resulting in a low unlabeled data utilization rate, or do not consider the apparent link between self-training and consistency regularization. Our method, Dynamic Self-Training and Class-Balanced Curriculum (DST-CBC), exploits inter-model disagreement by prediction confidence to construct a dynamic loss robust against pseudo label noise, enabling it to extend pseudo labeling to a class-balanced curriculum learning process. While we further show that our method implicitly includes consistency regularization. Thus, DST-CBC not only exploits unlabeled data efficiently, but also thoroughly utilizes $all$ unlabeled data. Without using adversarial training or any kind of modification to the network architecture, DST-CBC outperforms existing methods on different datasets across all labeled ratios, bringing semi-supervised learning yet another step closer to match the performance of fully-supervised learning for semantic segmentation. Our code and data splits are available at: this https URL .
摘要：在这项工作中，我们提出了半监督语义分割的新颖，简明的方法。这个任务就在于在如何有效地彻底利用未标记数据的主要挑战。之前的状态的最先进的方法利用由基于GaN的自我训练或一致性正则未标记的数据。然而，这些方法无论是从嘈杂的自我监督和类不平衡吃亏，从而导致低的无标签数据的利用率，或不考虑自我培训和一致性正规化之间的明显联系。我们的方法，动态自我培训和Class-均衡的课程（DST-CBC），通过预测置信利用模型间的分歧，构建一个动态损耗反对伪标签强大的噪音，使其能够伪标签扩展到一类，均衡的课程学习处理。虽然我们进一步表明，我们的方法隐含了一致性正规化。因此，DST-CBC不仅有效地利用了未标记的数据，而且还彻底利用$ $的所有标签数据。如果不使用对抗性训练或任何一种修改的网络体系结构中，存在于在所有标记的比率不同的数据集，使半监督学习又一一步方法DST-CBC性能优于以匹配完全监督学习的语义分割性能。我们的代码和数据分片，请访问：此HTTPS URL。

77. ImagePairs: Realistic Super Resolution Dataset via Beam Splitter Camera Rig [PDF] 返回目录
Hamid Reza Vaezi Joze, Ilya Zharkov, Karlton Powell, Carl Ringler, Luming Liang, Andy Roulston, Moshe Lutz, Vivek Pradeep
Abstract: Super Resolution is the problem of recovering a high-resolution image from a single or multiple low-resolution images of the same scene. It is an ill-posed problem since high frequency visual details of the scene are completely lost in low-resolution images. To overcome this, many machine learning approaches have been proposed aiming at training a model to recover the lost details in the new scenes. Such approaches include the recent successful effort in utilizing deep learning techniques to solve super resolution problem. As proven, data itself plays a significant role in the machine learning process especially deep learning approaches which are data hungry. Therefore, to solve the problem, the process of gathering data and its formation could be equally as vital as the machine learning technique used. Herein, we are proposing a new data acquisition technique for gathering real image data set which could be used as an input for super resolution, noise cancellation and quality enhancement techniques. We use a beam-splitter to capture the same scene by a low resolution camera and a high resolution camera. Since we also release the raw images, this large-scale dataset could be used for other tasks such as ISP generation. Unlike current small-scale dataset used for these tasks, our proposed dataset includes 11,421 pairs of low-resolution high-resolution images of diverse scenes. To our knowledge this is the most complete dataset for super resolution, ISP and image quality enhancement. The benchmarking result shows how the new dataset can be successfully used to significantly improve the quality of real-world image super resolution.
摘要：超解像是从同一场景的单个或多个较低分辨率的图像恢复高分辨率图像的问题。这是一个病态的问题，因为现场的高频视觉细节在低分辨率图像完全丧失。为了克服这个问题，很多机器学习方法被提出，旨在训练模式，恢复新场景细节丢失。这些方法包括利用深度学习技术来解决超分辨率的问题，最近成功的努力。作为证明，数据本身起着机器学习过程中特别是深层学习它饿了数据的方法一个显著的作用。因此，为了解决该问题，收集数据的过程，并为所使用的机器学习技术在其形成可以作为重要的是同样。在此，我们建议收集这可能被用作超分辨率，噪声消除和质量增强技术输入真实图像数据集的新的数据采集技术。我们用一个分束器由一个低分辨率摄像头和高分辨率相机捕捉同一场景。因为我们还释放了原始图像，这种大规模的数据集可用于其他任务，如ISP的产生。不同于用于这些任务目前小规模的数据集，我们提出的数据集包括11421双不同场景的低分辨率的高分辨率图像。据我们所知，这是超分辨率，ISP和提高图像质量的最完整的数据集。该基准测试结果显示了如何将新的数据集可以被成功地用于显著提高的现实世界图像超分辨率质量。

78. Finding Berries: Segmentation and Counting of Cranberries using Point Supervision and Shape Priors [PDF] 返回目录
Peri Akiva, Kristin Dana, Peter Oudemans, Michael Mars
Abstract: Precision agriculture has become a key factor for increasing crop yields by providing essential information to decision makers. In this work, we present a deep learning method for simultaneous segmentation and counting of cranberries to aid in yield estimation and sun exposure predictions. Notably, supervision is done using low cost center point annotations. The approach, named Triple-S Network, incorporates a three-part loss with shape priors to promote better fitting to objects of known shape typical in agricultural scenes. Our results improve overall segmentation performance by more than 6.74% and counting results by 22.91% when compared to state-of-the-art. To train and evaluate the network, we have collected the CRanberry Aerial Imagery Dataset (CRAID), the largest dataset of aerial drone imagery from cranberry fields. This dataset will be made publicly available.
摘要：精准农业已经成为了通过向决策者提供必要的信息来提高作物产量的关键因素。在这项工作中，我们提出了同时分割和小红莓在估产和烈日曝晒的预测，以帮助计数了深刻的学习方法。值得注意的是，监督使用低成本的中心点注解来实现。这种方法，命名为三重-S网络，整合与形状先验的三部分损失，以促进更好地拟合已知形状农业场面典型的对象。相较于国家的最先进的，当我们的研究结果提高由22.91％超过6.74％和计数结果整体分割性能。为了培养和评价网络，我们已经收集了蔓越莓航拍图像数据集（CRAID），从蔓越莓田空中无人机图像的最大的数据集。该数据集将被公之于众。

79. BReG-NeXt: Facial Affect Computing Using Adaptive Residual Networks With Bounded Gradient [PDF] 返回目录
Behzad Hasani, Pooran Singh Negi, Mohammad H. Mahoor
Abstract: This paper introduces BReG-NeXt, a residual-based network architecture using a function wtih bounded derivative instead of a simple shortcut path (a.k.a. identity mapping) in the residual units for automatic recognition of facial expressions based on the categorical and dimensional models of affect. Compared to ResNet, our proposed adaptive complex mapping results in a shallower network with less numbers of training parameters and floating point operations per second (FLOPs). Adding trainable parameters to the bypass function further improves fitting and training the network and hence recognizing subtle facial expressions such as contempt with a higher accuracy. We conducted comprehensive experiments on the categorical and dimensional models of affect on the challenging in-the-wild databases of AffectNet, FER2013, and Affect-in-Wild. Our experimental results show that our adaptive complex mapping approach outperforms the original ResNet consisting of a simple identity mapping as well as other state-of-the-art methods for Facial Expression Recognition (FER). Various metrics are reported in both affect models to provide a comprehensive evaluation of our method. In the categorical model, BReG-NeXt-50 with only 3.1M training parameters and 15 MFLOPs, achieves 68.50% and 71.53% accuracy on AffectNet and FER2013 databases, respectively. In the dimensional model, BReG-NeXt achieves 0.2577 and 0.2882 RMSE value on AffectNet and Affect-in-Wild databases, respectively.
摘要：基于所述绝对的和的三维模型自动识别面部表情的剩余单元引入BREG-接着，使用wtih界衍生物而不是简单的捷径用通路的功能的基于剩余的网络架构（又名身份映射）影响。相比于RESNET，我们提出的自适应复杂的映射结果在较浅的网络训练参数和浮动每秒（FLOPS）的浮点运算数更少。添加可训练参数旁路功能进一步提高装配和训练网络，并因此识别细微的面部表情，例如具有更高的精度蔑视。我们的分类和三维模型进行了全面的实验对FER2013的AffectNet，在挑战中最百搭的数据库影响，影响功能于野。我们的实验结果表明，我们的自适应复数映射方法优于原始RESNET由一个简单的身份映射以及其它国家的最先进的方法表情识别（FER）。各种指标都报道了这两个影响模型来提供我们的方法进行综合评价。在该分类模型，BREG-NEXT-50与仅3.1M训练参数和15个MFLOPS，分别达到上AffectNet和FER2013数据库准确性68.50％和71.53％。在三维模型，BREG-NEXT上AffectNet达到0.2577和0.2882 RMSE值，影响功能于野生数据库分别。

80. Adversarial Attack on Deep Learning-Based Splice Localization [PDF] 返回目录
Andras Rozsa, Zheng Zhong, Terrance E. Boult
Abstract: Regarding image forensics, researchers have proposed various approaches to detect and/or localize manipulations, such as splices. Recent best performing image-forensics algorithms greatly benefit from the application of deep learning, but such tools can be vulnerable to adversarial attacks. Due to the fact that most of the proposed adversarial example generation techniques can be used only on end-to-end classifiers, the adversarial robustness of image-forensics methods that utilize deep learning only for feature extraction has not been studied yet. Using a novel algorithm capable of directly adjusting the underlying representations of patches we demonstrate on three non end-to-end deep learning-based splice localization tools that hiding manipulations of images is feasible via adversarial attacks. While the tested image-forensics methods, EXIF-SC, SpliceRadar, and Noiseprint, rely on feature extractors that were trained on different surrogate tasks, we find that the formed adversarial perturbations can be transferable among them regarding the deterioration of their localization performance.
摘要：关于图像取证，研究人员已经提出了各种方法来检测和/或本地化操作，如接头。近期表现最佳的图像取证算法从深度学习的应用大大受益，但这样的工具可以很容易受到攻击的对抗性。由于大部分的建议对抗性例如生成技术，可以在终端到终端的分类只用过事实上，图像取证方法，利用深只进行特征提取学习对抗稳健性尚未研究。使用能够直接调整我们展示了在三个非终端到终端的深学习型接头的本地化工具，隐藏图像的操作是通过对抗攻击可行补丁的基本表征了一种新的算法。虽然测试图像取证方法，EXIF-SC，SpliceRadar和Noiseprint，依靠接受了关于不同代理的任务是特征提取，我们发现形成对抗性的扰动可能是其中关于转让其定位性能恶化。

81. Organ at Risk Segmentation for Head and Neck Cancer using Stratified Learning and Neural Architecture Search [PDF] 返回目录
Dazhou Guo, Dakai Jin, Zhuotun Zhu, Tsung-Ying Ho, Adam P. Harrison, Chun-Hung Chao, Jing Xiao, Alan Yuille, Chien-Yu Lin, Le Lu
Abstract: OAR segmentation is a critical step in radiotherapy of head and neck (H&N) cancer, where inconsistencies across radiation oncologists and prohibitive labor costs motivate automated approaches. However, leading methods using standard fully convolutional network workflows that are challenged when the number of OARs becomes large, e.g. > 40. For such scenarios, insights can be gained from the stratification approaches seen in manual clinical OAR delineation. This is the goal of our work, where we introduce stratified organ at risk segmentation (SOARS), an approach that stratifies OARs into anchor, mid-level, and small & hard (S&H) categories. SOARS stratifies across two dimensions. The first dimension is that distinct processing pipelines are used for each OAR category. In particular, inspired by clinical practices, anchor OARs are used to guide the mid-level and S&H categories. The second dimension is that distinct network architectures are used to manage the significant contrast, size, and anatomy variations between different OARs. We use differentiable neural architecture search (NAS), allowing the network to choose among 2D, 3D or Pseudo-3D convolutions. Extensive 4-fold cross-validation on 142 H&N cancer patients with 42 manually labeled OARs, the most comprehensive OAR dataset to date, demonstrates that both pipeline- and NAS-stratification significantly improves quantitative performance over the state-of-the-art (from 69.52% to 73.68% in absolute Dice scores). Thus, SOARS provides a powerful and principled means to manage the highly complex segmentation space of OARs.
摘要：OAR分割是头颈部（H＆N）癌症，其中跨越放射肿瘤学家和高昂的劳动力成本不一致激励自动化方法中的放射治疗中的关键步骤。然而，当桨上的数量变为使用标准充分卷积网络工作流被挑战导致大的方法，例如> 40。这样的情况下，见解可以从分层办法在手动临床OAR描绘看出获得。这是我们的工作，我们在风险分割（腾飞）引入分层器官的目标，这种方式进行分层危及器官为锚，中级，小型和硬（S＆H）两类。在两个维度腾飞进行分层。第一尺寸是不同的处理流水线被用于每个OAR类别。特别地，通过临床实践的启发，锚桨用于指导中级和S＆H类别。第二尺寸是不同的网络架构用于管理显著对比度，尺寸和解剖学不同危及器官之间的变化。我们使用微神经结构搜索（NAS），使网络中的2D，3D和伪3D卷积选择。上142名H＆N癌症患者42个手动标记的桨，最全面的OAR数据集迄今为止，广泛4倍交叉验证表明，既pipeline-和NAS-分层显著改进了国家的最先进的（从定量性能69.52％，以绝对骰子分数73.68％）。因此，大增提供管理桨上的高度复杂的分割空间的有力和原则的装置。

82. Multi-Modal Face Anti-Spoofing Based on Central Difference Networks [PDF] 返回目录
Zitong Yu, Yunxiao Qin, Xiaobai Li, Zezheng Wang, Chenxu Zhao, Zhen Lei, Guoying Zhao
Abstract: Face anti-spoofing (FAS) plays a vital role in securing face recognition systems from presentation attacks. Existing multi-modal FAS methods rely on stacked vanilla convolutions, which is weak in describing detailed intrinsic information from modalities and easily being ineffective when the domain shifts (e.g., cross attack and cross ethnicity). In this paper, we extend the central difference convolutional networks (CDCN) \cite{yu2020searching} to a multi-modal version, intending to capture intrinsic spoofing patterns among three modalities (RGB, depth and infrared). Meanwhile, we also give an elaborate study about single-modal based CDCN. Our approach won the first place in "Track Multi-Modal" as well as the second place in "Track Single-Modal (RGB)" of ChaLearn Face Anti-spoofing Attack Detection Challenge@CVPR2020 \cite{liu2020cross}. Our final submission obtains 1.02$\pm$0.59\% and 4.84$\pm$1.79\% ACER in "Track Multi-Modal" and "Track Single-Modal (RGB)", respectively. The codes are available at{this https URL}.
摘要：面对反欺骗（FAS）发挥了从演示的攻击保护面部识别系统至关重要的作用。现有的多模态FAS方法依赖于堆叠香草卷积，这是在从模态描述详细的固有信息，并很容易地被无效弱时域移位（例如，交叉攻击和跨种族）。在本文中，我们扩展中央差卷积网络（CDCN）\ {引用} yu2020searching到多模式版本，打算以捕获3种模式（RGB，深度和红外线）之间固有欺骗模式。同时，我们还提供有关单模式基于CDCN精心研究。我们的做法赢得了“追踪多模态”之首，以及在面对反欺骗攻击检测挑战@ CVPR2020 ChaLearn的“轨道单模态（RGB）” \ {引用} liu2020cross第二位。我们最后提交获得1.02 $ \ $下午0.59 \％和4.84 $ \ $下午1.79 \％的“追踪多模式”和“轨道单模态（RGB）”分别ACER。该代码可在此{HTTPS URL}。

83. Knowledge-Based Visual Question Answering in Videos [PDF] 返回目录
Noa Garcia, Mayu Otani, Chenhui Chu, Yuta Nakashima
Abstract: We propose a novel video understanding task by fusing knowledge-based and video question answering. First, we introduce KnowIT VQA, a video dataset with 24,282 human-generated question-answer pairs about a popular sitcom. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the series to be answered. Second, we propose a video understanding model by combining the visual and textual video content with specific knowledge about the show. Our main findings are: (i) the incorporation of knowledge produces outstanding improvements for VQA in video, and (ii) the performance on KnowIT VQA still lags well behind human accuracy, indicating its usefulness for studying current video modelling limitations.
摘要：本文提出通过融合知识为基础的视频答疑一种新型的视频了解任务。首先，我们介绍KnowIT VQA，视频数据集提供有关流行的情景喜剧24282人类产生的问题 - 回答对。该数据集联合视觉，文本和时间相干性与知识为基础的问题，这些问题需要回答需要从一系列的观察获得的经验一起推理。其次，我们建议结合视觉和文本的视频内容有关的节目的具体知识视频的理解模型。我们的主要结论是：（i）知识的结合产生的视频VQA优秀的改进，和（ii）在KnowIT VQA的性能仍然落后以及人类准确性背后，表明其对研究当前视频建模的限制效用。

84. Computer Vision for COVID-19 Control: A Survey [PDF] 返回目录
Anwaar Ulhaq, Asim Khan, Douglas Gomes, Manoranjan Pau
Abstract: Global spread of COVID-19 pandemic has triggered an urgent need to contribute in the fight against an immense threat to human population. Computer Vision, as a sub field of Artificial Intelligence has enjoyed a recent success in solving various complex problems in health care and has a potential to contribute in the fight of controlling COVID-19. In response to this call, computer vision researchers are putting their knowledge base on trial to devise effective ways to counter COVID-19 challenge and serve global community. New contributions are being shared with every passing day. It motivated us to review the recent work, collect information about available research resources and an indication of future research directions. We want to make it available to computer vision research community to save their precious time. This survey paper is intended to provide a preliminary review of available literature about computer vision fight against COVID-19 pandemic.
摘要：COVID-19流行病的全球性传播引发的迫切需要对人类人口的巨大威胁的斗争作出贡献。计算机视觉，作为人工智能的一个子领域一直享有在解决卫生保健的各种复杂问题，最近的成功，有一个潜在的控制COVID-19的斗争作出贡献。为响应这一号召，计算机视觉研究人员正在对审判他们的知识基础，制定有效的方法来反COVID-19挑战和服务的全球社区。新的贡献正在一天天共享。它促使我们检讨最近的工作，收集有关可用的研究资源和未来的研究方向的指示。我们希望把它提供给计算机视觉研究团体，以节省宝贵的时间。本调查报告旨在提供有关对COVID-19大流行计算机视觉的战斗现有的文献进行了初步审查。

85. Quantization Guided JPEG Artifact Correction [PDF] 返回目录
Max Ehrlich, Ser-Nam Lim, Larry Davis, Abhinav Shrivastava
Abstract: The JPEG image compression algorithm is the most popular method of image compression because of its ability for large compression ratios. However, to achieve such high compression, information is lost. For aggressive quantization settings, this leads to a noticeable reduction in image quality. Artifact correction has been studied in the context of deep neural networks for some time, but the current state-of-the-art methods require a different model to be trained for each quality setting, greatly limiting their practical application. We solve this problem by creating a novel architecture which is parameterized by the JPEG files quantization matrix. This allows our single model to achieve state-of-the-art performance over models trained for specific quality settings.
摘要：JPEG图像压缩算法，是因为它的大压缩比的能力图像压缩的最常用的方法。然而，要实现如此高的压缩，信息丢失。对于激进的量化设置，这导致图像质量的显着减少。假象校正进行了研究深层神经网络的一段时间范围内，但目前国家的最先进的方法需要不同的模型进行培训每个质量设置，大大限制了其实际应用。我们解决这个问题，通过创建由JPEG参数化的新颖的架构文件的量化矩阵。这使得我们的单一模式，实现国家的最先进的性能在训练特定质量的设置模式。

86. Non-Blocking Simultaneous Multithreading: Embracing the Resiliency of Deep Neural Networks [PDF] 返回目录
Gil Shomron, Uri Weiser
Abstract: Deep neural networks (DNNs) are known for their inability to utilize underlying hardware resources due to hardware susceptibility to sparse activations and weights. Even in finer granularities, many of the non-zero values hold a portion of zero-valued bits that may cause inefficiencies when executed on hardware. Inspired by conventional CPU simultaneous multithreading (SMT) that increases computer resource utilization by sharing them across several threads, we propose non-blocking SMT (NB-SMT) designated for DNN accelerators. Like conventional SMT, NB-SMT shares hardware resources among several execution flows. Yet, unlike SMT, NB-SMT is non-blocking, as it handles structural hazards by exploiting the algorithmic resiliency of DNNs. Instead of opportunistically dispatching instructions while they wait in a reservation station for available hardware, NB-SMT temporarily reduces the computation precision to accommodate all threads at once, enabling a non-blocking operation. We demonstrate NB-SMT applicability using SySMT, an NB-SMT-enabled output-stationary systolic array (OS-SA). Compared with a conventional OS-SA, a 2-threaded SySMT consumes 1.4x the area and delivers 2x speedup with 33% energy savings and less than 1% accuracy degradation of state-of-the-art CNNs with ImageNet. A 4-threaded SySMT consumes 2.5x the area and delivers, for example, 3.4x speedup and 39% energy savings with 1% accuracy degradation of 40%-pruned ResNet-18.
摘要：深层神经网络（DNNs）被称为他们不能利用底层硬件资源由于硬件易感性稀疏激活和权重。甚至在更细的粒度，许多非零值的保持零值比特时在硬件上执行，可能会导致低效率的一部分。通过在多个线程共享他们通过传统的CPU同步多线程（SMT）增加计算机资源利用率的启发，我们提出了非阻塞指定DNN的加速器SMT（NB-SMT）。像传统的SMT，一些执行中NB-SMT股份硬件资源流动。然而，与SMT，NB-SMT是无阻塞的，因为它通过利用DNNs的算法弹性处理结构性风险。代替的机会性地调度指令，而它们在保留站的可用硬件等待，NB-SMT暂时减少了计算精度，以适应在一次所有线程，使得非阻塞操作。我们使用SySMT，启用NB-SMT-输出平稳脉动阵列（OS-SA）表明NB-SMT适用性。与传统的OS-SA，2-螺纹SySMT消耗1.4倍的面积相比，并提供2倍的加速能量节约33％和小于1％的状态的最先进的细胞神经网络与ImageNet的精度的劣化。 A 4-螺纹SySMT消耗2.5倍的面积，并提供，例如，3.4倍加速比和39％的节能40％-pruned RESNET-18 1％的精度劣化。

87. GraN: An Efficient Gradient-Norm Based Detector for Adversarial and Misclassified Examples [PDF] 返回目录
Julia Lust, Alexandru Paul Condurache
Abstract: Deep neural networks (DNNs) are vulnerable to adversarial examples and other data perturbations. Especially in safety critical applications of DNNs, it is therefore crucial to detect misclassified samples. The current state-of-the-art detection methods require either significantly more runtime or more parameters than the original network itself. This paper therefore proposes GraN, a time and parameter-efficient method that is easily adaptable to any DNN. GraN is based on the layer-wise norm of the DNN's gradient regarding the loss of the current input-output combination, which can be computed via backpropagation. GraN achieves state-of-the-art performance on numerous problem set-ups.
摘要：深神经网络（DNNs）易受对抗性实例和其它数据扰动。特别是在DNNs的安全关键技术应用，因此，至关重要的检测错误分类样本。当前状态的最先进的检测方法或者需要显著更多的运行时或比原始网络本身多个参数。因此提出GRAN，时间和参数有效的方法，很容易适应任何DNN。 GRAN是基于DNN的梯度的关于当前输入 - 输出组合，其可经由反向传播来计算损失的逐层常态。格兰实现众多问题调校国家的最先进的性能。

88. Invariant Integration in Deep Convolutional Feature Space [PDF] 返回目录
Matthias Rath, Alexandru Paul Condurache
Abstract: In this contribution, we show how to incorporate prior knowledge to a deep neural network architecture in a principled manner. We enforce feature space invariances using a novel layer based on invariant integration. This allows us to construct a complete feature space invariant to finite transformation groups. We apply our proposed layer to explicitly insert invariance properties for vision-related classification tasks, demonstrate our approach for the case of rotation invariance and report state-of-the-art performance on the Rotated-MNIST dataset. Our method is especially beneficial when training with limited data.
摘要：在这方面的贡献，我们将展示如何先验知识有原则的方式纳入到深层神经网络结构。我们强制使用基于不变整合一个新的层功能空间不变性。这使我们能够构建一个完整的功能空间不变的有限变换群。我们应用我们提出的层明确地插入的视觉相关的分类任务不变性，证明我们对旋转不变性和国家的最先进的报告性能上旋转的 - MNIST数据集的情况下的做法。用有限的数据训练的时候我们的方法是特别有益。

89. Spatial Action Maps for Mobile Manipulation [PDF] 返回目录
Jimmy Wu, Xingyuan Sun, Andy Zeng, Shuran Song, Johnny Lee, Szymon Rusinkiewicz, Thomas Funkhouser
Abstract: This paper proposes a new action representation for learning to perform complex mobile manipulation tasks. In a typical deep Q-learning setup, a convolutional neural network (ConvNet) is trained to map from an image representing the current state (e.g., a birds-eye view of a SLAM reconstruction of the scene) to predicted Q-values for a small set of steering command actions (step forward, turn right, turn left, etc.). Instead, we propose an action representation in the same domain as the state: "spatial action maps." In our proposal, the set of possible actions is represented by pixels of an image, where each pixel represents a trajectory to the corresponding scene location along a shortest path through obstacles of the partially reconstructed scene. A significant advantage of this approach is that the spatial position of each state-action value prediction represents a local milestone (local end-point) for the agent's policy, which may be easily recognizable in local visual patterns of the state image. A second advantage is that atomic actions can perform long-range plans (follow the shortest path to a point on the other side of the scene), and thus it is simpler to learn complex behaviors with a deep Q-network. A third advantage is that we can use a fully convolutional network (FCN) with skip connections to learn the mapping from state images to pixel-aligned action images efficiently. During experiments with a robot that learns to push objects to a goal location, we find that policies learned with this proposed action representation achieve significantly better performance than traditional alternatives.
摘要：本文提出了一种学习来执行复杂的移动操作任务的新行为表示。在一个典型的深Q学习设置，卷积神经网络（ConvNet）被训练以从表示当前状态的图像映射（例如，场景的SLAM重建的鸟瞰图），以用于预测的Q值小的一组转向指令操作（步骤向前，右转，左转等）。相反，我们提出了一个行为表示在同一个域中的状态：“空间行动地图。”在我们的建议，所述一组可能的动作是由一个图像，其中每个像素表示通过部分重建场景的障碍物的轨迹沿着最短路径对应的场景位置的像素表示。这种方法的一个显著优点是每个状态 - 动作预测值的空间位置表示用于该代理的策略的本地里程碑（本地端点），这可能是在状态图像的局部视觉图案容易辨认。第二个优点是原子操作可以执行长期计划（遵循现场的另一侧点的最短路径），因此它是容易学习了深刻的Q-网络复杂的行为。第三个优点是，我们可以使用与跳过的连接完全卷积网络（FCN），以有效地学习从状态图像到像素对齐动作的图像的映射。在与一个机器人学会物体推到目标位置的实验中，我们发现这个建议的行为表示了解到，政策实现比传统的替代品显著更好的性能。

90. Dark, Beyond Deep: A Paradigm Shift to Cognitive AI with Humanlike Common Sense [PDF] 返回目录
Yixin Zhu, Tao Gao, Lifeng Fan, Siyuan Huang, Mark Edmonds, Hangxin Liu, Feng Gao, Chi Zhang, Siyuan Qi, Ying Nian Wu, Joshua B. Tenenbaum, Song-Chun Zhu
Abstract: Recent progress in deep learning is essentially based on a "big data for small tasks" paradigm, under which massive amounts of data are used to train a classifier for a single narrow task. In this paper, we call for a shift that flips this paradigm upside down. Specifically, we propose a "small data for big tasks" paradigm, wherein a single artificial intelligence (AI) system is challenged to develop "common sense", enabling it to solve a wide range of tasks with little training data. We illustrate the potential power of this new paradigm by reviewing models of common sense that synthesize recent breakthroughs in both machine and human vision. We identify functionality, physics, intent, causality, and utility (FPICU) as the five core domains of cognitive AI with humanlike common sense. When taken as a unified concept, FPICU is concerned with the questions of "why" and "how", beyond the dominant "what" and "where" framework for understanding vision. They are invisible in terms of pixels but nevertheless drive the creation, maintenance, and development of visual scenes. We therefore coin them the "dark matter" of vision. Just as our universe cannot be understood by merely studying observable matter, we argue that vision cannot be understood without studying FPICU. We demonstrate the power of this perspective to develop cognitive AI systems with humanlike common sense by showing how to observe and apply FPICU with little training data to solve a wide range of challenging tasks, including tool use, planning, utility inference, and social learning. In summary, we argue that the next generation of AI must embrace "dark" humanlike common sense for solving novel tasks.
摘要：在深度学习新进展基本上是基于一个范式“小任务大数据”，其下大量的数据被用来训练分类为一个单一的狭隘的任务。在本文中，我们呼吁该翻转这种范式倒挂的转变。具体来说，我们提出了一个其中单个人工智能（AI）系统面临的挑战是制定“常识”范式“为大任务，小数据”，使其能够解决广泛很少的训练数据的任务。我们通过审查的常识模型合成在纵向和人类视觉近期突破说明了这一新模式的潜在力量。我们确定的功能，物理，意图，因果关系，以及实用程序（FPICU）与像人一样的常识认知AI的五个核心领域。当作为一个统一的概念，FPICU涉及的问题，“为什么”和“怎么做”，超越占主导地位的“是什么”和“在哪里”的框架理解的眼光。他们是在像素方面无形但仍然推动创建，维护和视觉场景的发展。因此，我们硬币他们视野的“暗物质”。正如我们的宇宙不能仅仅通过研究观察到的事情可以理解，我们认为，视力不能没有学习FPICU理解。我们证明这种观点的力量展示了如何观察和用很少的培训数据应用FPICU解决各种各样挑战性的任务，其中包括工具的使用，规划，公用事业推理，学习和社会学习的发展与类似人类的常识认知AI系统。综上所述，我们认为下一代的AI必须包括解决新任务“黑暗”像人一样的常识。

91. X-Ray: Mechanical Search for an Occluded Object by Minimizing Support of Learned Occupancy Distributions [PDF] 返回目录
Michael Danielczuk, Anelia Angelova, Vincent Vanhoucke, Ken Goldberg
Abstract: For applications in e-commerce, warehouses, healthcare, and home service, robots are often required to search through heaps of objects to grasp a specific target object. For mechanical search, we introduce X-Ray, an algorithm based on learned occupancy distributions. We train a neural network using a synthetic dataset of RGBD heap images labeled for a set of standard bounding box targets with varying aspect ratios. X-Ray minimizes support of the learned distribution as part of a mechanical search policy in both simulated and real environments. We benchmark these policies against two baseline policies on 1,000 heaps of 15 objects in simulation where the target object is partially or fully occluded. Results suggest that X-Ray is significantly more efficient, as it succeeds in extracting the target object 82% of the time, 15% more often than the best-performing baseline. Experiments on an ABB YuMi robot with 20 heaps of 25 household objects suggest that the learned policy transfers easily to a physical system, where it outperforms baseline policies by 15% in success rate with 17% fewer actions. Datasets, videos, and experiments are available at this http URL .
摘要：对于电子商务，仓储，医疗保健和家庭服务应用程序，机器人往往需要通过对象的堆搜索掌握特定的目标对象。对于机械的搜索，我们引进的X射线，根据了解到的占用分布的算法。我们培养使用标记为具有变化的纵横比的一组标准的边界框的目标RGBD堆图像的合成数据集的神经网络。 X射线最小化支持了解到分布在这两个模拟和真实环境中的机械的搜索策略的一部分。我们对基准15个的对象1000堆2倍基线的政策，这些政策在模拟的目标对象是部分或完全闭塞。结果表明，X射线是显著更有效的，因为它在往往比表现最佳的基线提取目标对象的时间82％，15％成功。上的ABB机器人由美与25个家用物品的20堆实验表明，学习策略转移容易物理系统，它由15％的用较少的17％，动作成功率优于基准策略。数据集，视频和实验都可以在这个HTTP URL。

92. Spectral GUI for Automated Tissue and Lesion Segmentation of T1 Weighted Breast MR Images [PDF] 返回目录
Prajval Koul
Abstract: We present Spectral GUI, a multiplatform breast MR image analysis tool designed to facilitate the segmentation of fibro glandular tissues and lesions in T1 weighted breast MR images via a graphical user interface (GUI). Spectral GUIR uses spectrum loft method [1] for breast MR image segmentation. Not only is it interactive, but robust and expeditious at the same time. Being devoid of any machine learning algorithm, it shows exceptionally high execution speed with minimal overheads. The accuracy of the results has been simultaneously measured using performance metrics and expert entailment. The validity and applicability of the tool are discussed in the paper along with a crisp contrast with traditional machine learning principles, establishing the unequivocal foundation of it as a competent tool in the field of image analysis.
摘要：本光谱GUI，一个多乳房MR图像分析工具，设计经由图形用户界面（GUI），以促进在T1加权乳房MR图像FIBRO腺组织和病变的分割。光谱吉尔使用频谱阁楼方法[1]一种用于乳房MR图像分割。它不仅是互动的，但稳健和迅速的在同一时间。未经任何机器学习算法，这说明以最小的开销非常高的执行速度。结果的精确度使用性能度量和专家蕴涵被同时测量。该工具的有效性和适用性在纸与传统的机器学习原理清脆的对比度以及讨论，因为在图像分析领域的主管工具建立它的明确的基础。

93. Autonomous task planning and situation awareness in robotic surgery [PDF] 返回目录
Michele Ginesi, Daniele Meli, Andrea Roberti, Nicola Sansonetto, Paolo Fiorini
Abstract: The use of robots in minimally invasive surgery has improved the quality of standard surgical procedures. So far, only the automation of simple surgical actions has been investigated by researchers, while the execution of structured tasks requiring reasoning on the environment and the choice among multiple actions is still managed by human surgeons. In this paper, we propose a framework to implement surgical task automation. The framework consists of a task-level reasoning module based on answer set programming, a low-level motion planning module based on dynamic movement primitives, and a situation awareness module. The logic-based reasoning module generates explainable plans and is able to recover from failure conditions, which are identified and explained by the situation awareness module interfacing to a human supervisor, for enhanced safety. Dynamic Movement Primitives allow to replicate the dexterity of surgeons and to adapt to obstacles and changes in the environment. The framework is validated on different versions of the standard surgical training peg-and-ring task.
摘要：在微创手术机器人的使用提高了标准的外科手术质量。到目前为止，只有简单的手术操作的自动化已经被研究人员调查，同时要求对环境的推理和多个动作中选择结构化任务的执行仍是人类的外科医生管理。在本文中，我们提出了一个框架来实施手术任务自动化。该框架包括基于回答集编程，基于动态运动基元低级别的运动规划模块，以及态势感知模块任务级推理模块。基于逻辑的推理模块生成可解释的计划，并能够从故障情况，其识别和状况的认识模块接口连接到人的主管，用于增强的安全性说明恢复。动态移动原语允许复制外科医生的灵巧性和适应障碍和环境的变化。该框架是验证在不同版本的标准外科培训PEG和环任务。

94. UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation [PDF] 返回目录
Huimin Huang, Lanfen Lin, Ruofeng Tong, Hongjie Hu, Qiaowei Zhang, Yutaro Iwamoto, Xianhua Han, Yen-Wei Chen, Jian Wu
Abstract: Recently, a growing interest has been seen in deep learning-based semantic segmentation. UNet, which is one of deep learning networks with an encoder-decoder architecture, is widely used in medical image segmentation. Combining multi-scale features is one of important factors for accurate segmentation. UNet++ was developed as a modified Unet by designing an architecture with nested and dense skip connections. However, it does not explore sufficient information from full scales and there is still a large room for improvement. In this paper, we propose a novel UNet 3+, which takes advantage of full-scale skip connections and deep supervisions. The full-scale skip connections incorporate low-level details with high-level semantics from feature maps in different scales; while the deep supervision learns hierarchical representations from the full-scale aggregated feature maps. The proposed method is especially benefiting for organs that appear at varying scales. In addition to accuracy improvements, the proposed UNet 3+ can reduce the network parameters to improve the computation efficiency. We further propose a hybrid loss function and devise a classification-guided module to enhance the organ boundary and reduce the over-segmentation in a non-organ image, yielding more accurate segmentation results. The effectiveness of the proposed method is demonstrated on two datasets. The code is available at: this http URL
摘要：近日，一个越来越大的兴趣已经出现在深基于学习的语义分割。 UNET，其是与编码器 - 解码器架构深度学习网络中的一个，被广泛应用于医学图像分割。结合多尺度特征是用于精确分割的重要因素之一。 UNET ++通过设计具有嵌套和密集跳过连接的体系结构开发的修饰UNET。但是，它不会从满量程探索充足的信息和仍然有较大的提升空间。在本文中，我们提出了一个新颖的UNET 3+，这需要全面跳过连接和深监督的优势。满量程跳跃连接整合与功能的地图，不同尺度的高层次语义的低层细节;而深监督学习来自全规模聚集特征映射分层表示。该方法特别有利于对出现在不同规模的器官。除了提高准确度，建议UNET 3+可以减少网络参数来提高计算效率。我们进一步提出了一个混合动力损失函数，并制定分类指导的模块，以提高器官边界，减少过度分割在非器官图像，从而得到更准确的分割结果。所提出的方法的有效性证明在两个数据集。该代码，请访问：此http网址

95. A fast semi-automatic method for classification and counting the number and types of blood cells in an image [PDF] 返回目录
Hamed Sadeghi, Shahram Shirani, David W. Capson
Abstract: A novel and fast semi-automatic method for segmentation, locating and counting blood cells in an image is proposed. In this method, thresholding is used to separate the nucleus from the other parts. We also use Hough transform for circles to locate the center of white cells. Locating and counting of red cells is performed using template matching. We make use of finding local maxima, labeling and mean value computation in order to shrink the areas obtained after applying Hough transform or template matching, to a single pixel as representative of location of each region. The proposed method is very fast and computes the number and location of white cells accurately. It is also capable of locating and counting the red cells with a small error.
摘要：一种新颖的和快速的用于分割半自动方法，定位及数量的血细胞图像中提出。在此方法中，阈值被用于将核与其它部分分离。我们还使用霍夫变换圈定位白细胞的中心。定位和使用模板匹配来进行红细胞的计数。我们利用以收缩霍夫应用为代表的各区域的位置的变换或模板匹配，单个像素后获得的区域发现局部最大值，标签和平均值计算。所提出的方法是非常快速和准确地计算白细胞的数量和位置。它也能够定位和一个小错误计数红细胞。

96. Automatic Grading of Knee Osteoarthritis on the Kellgren-Lawrence Scale from Radiographs Using Convolutional Neural Networks [PDF] 返回目录
Sudeep Kondal, Viraj Kulkarni, Ashrika Gaikwad, Amit Kharat, Aniruddha Pant
Abstract: The severity of knee osteoarthritis is graded using the 5-point Kellgren-Lawrence (KL) scale where healthy knees are assigned grade 0, and the subsequent grades 1-4 represent increasing severity of the affliction. Although several methods have been proposed in recent years to develop models that can automatically predict the KL grade from a given radiograph, most models have been developed and evaluated on datasets not sourced from India. These models fail to perform well on the radiographs of Indian patients. In this paper, we propose a novel method using convolutional neural networks to automatically grade knee radiographs on the KL scale. Our method works in two connected stages: in the first stage, an object detection model segments individual knees from the rest of the image; in the second stage, a regression model automatically grades each knee separately on the KL scale. We train our model using the publicly available Osteoarthritis Initiative (OAI) dataset and demonstrate that fine-tuning the model before evaluating it on a dataset from a private hospital significantly improves the mean absolute error from 1.09 (95% CI: 1.03-1.15) to 0.28 (95% CI: 0.25-0.32). Additionally, we compare classification and regression models built for the same task and demonstrate that regression outperforms classification.
摘要：膝骨关节炎的严重程度，使用健康其中膝盖被分配等级0 5点Kellgren-劳伦斯（KL）尺度分级，和随后的等级1-4表示增加痛苦的严重程度。虽然一些方法在近年来提出开发模式，可以自动预测从一个给定的X光片的KL级，大部分车型已经开发并没有从印度采购的数据集进行评估。这些模型不能对印度患者的X线片表现良好。在本文中，我们提出了一种新方法，使用卷积神经网络对KL规模档次自动膝盖X光片。我们的方法的工作方式在两个连接的阶段：在第一阶段中，物体检测模型段从所述图像的其余部分单独膝盖;在第二阶段，一个回归每个膝盖分别自动进行建模的KL尺度等级。我们使用公开可用的骨关节炎倡议（OAI）数据集训练我们的模型，并从一家私人医院评估其对数据集之前表明微调的模型显著提高了1.09的平均绝对误差（95％CI：1.03-1.15）至0.28（95％CI：0.25-0.32）。此外，我们比较了相同的任务，建立分类和回归模型表明，回归性能优于分类。

97. FedNAS: Federated Deep Learning via Neural Architecture Search [PDF] 返回目录
Chaoyang He, Murali Annavaram, Salman Avestimehr
Abstract: Federated Learning (FL) has been proved to be an effective learning framework when data cannot be centralized due to privacy, communication costs, and regulatory restrictions. When training deep learning models under an FL setting, people employ the predefined model architecture discovered in the centralized environment. However, this predefined architecture may not be the optimal choice because it may not fit data with non-identical and independent distribution (non-IID). Thus, we advocate automating federated learning (AutoFL) to improve model accuracy and reduce the manual design effort. We specifically study AutoFL via Neural Architecture Search (NAS), which can automate the design process. We propose a Federated NAS (FedNAS) algorithm to help scattered workers collaboratively searching for a better architecture with higher accuracy. We also build a system based on FedNAS. Our experiments on non-IID dataset show that the architecture searched by FedNAS can outperform the manually predefined architecture.
摘要：联合学习（FL）已被证明是当数据不能被集中由于隐私，通信费用，以及监管限制有效的学习框架。当训练下FL设定深度学习模型，人们采用的集中式环境中发现的预定义的模型架构。然而，该预定义的体系结构可能不是最佳的选择，因为它可能不具有不相同且独立的分布（非IID）拟合数据。因此，我们提倡自动化联合学习（AutoFL），以提高模型的精确性和减少手工设计工作。我们特别通过神经结构搜索（NAS），它可以自动设计过程研究AutoFL。我们提出了一个联合NAS（FedNAS）算法来帮助分散工人以更高的精度协同寻找一个更好的架构。我们还建立基于FedNAS系统。我们对非IID数据集上的实验，该架构搜查FedNAS可以超越手动预定义的架构。

98. Fitting the Search Space of Weight-sharing NAS with Graph Convolutional Networks [PDF] 返回目录
Xin Chen, Lingxi Xie, Jun Wu, Longhui Wei, Yuhui Xu, Qi Tian
Abstract: Neural architecture search has attracted wide attentions in both academia and industry. To accelerate it, researchers proposed weight-sharing methods which first train a super-network to reuse computation among different operators, from which exponentially many sub-networks can be sampled and efficiently evaluated. These methods enjoy great advantages in terms of computational costs, but the sampled sub-networks are not guaranteed to be estimated precisely unless an individual training process is taken. This paper owes such inaccuracy to the inevitable mismatch between assembled network layers, so that there is a random error term added to each estimation. We alleviate this issue by training a graph convolutional network to fit the performance of sampled sub-networks so that the impact of random errors becomes minimal. With this strategy, we achieve a higher rank correlation coefficient in the selected set of candidates, which consequently leads to better performance of the final architecture. In addition, our approach also enjoys the flexibility of being used under different hardware constraints, since the graph convolutional network has provided an efficient lookup table of the performance of architectures in the entire search space.
摘要：神经结构搜索已经引起了学术界和工业界的广泛关注。为了加速它，研究人员提出重共享方法，其中第一训练不同的运营商之间的超级网络重用计算，从中成倍多个子网络可以取样，并有效地进行评估。这些方法在计算成本方面享有很大的优势，但除非个别训练过程中所采取的抽样子网络不能保证精确估计。本文欠这种不准确到组装网络层之间的失配不可避免，使得存在添加到每个估计的随机误差项。我们通过训练图卷积网络，使随机误差的影响变得最小，以适应采样子网络的性能缓解这个问题。有了这个战略，我们实现了在选定的一组候选人，因此这导致较高的秩相关系数的最终架构的性能更好。此外，我们的方法也享有下不同硬件约束所使用的灵活性，因为该图的卷积网络已经在整个搜索空间提供架构的性能的有效查找表。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-04-21

目录

摘要