摘要

1. Spatiotemporal Relationship Reasoning for Pedestrian Intent Prediction [PDF] 返回目录
Bingbin Liu, Ehsan Adeli, Zhangjie Cao, Kuan-Hui Lee, Abhijeet Shenoi, Adrien Gaidon, Juan Carlos Niebles
Abstract: Reasoning over visual data is a desirable capability for robotics and vision-based applications. Such reasoning enables forecasting of the next events or actions in videos. In recent years, various models have been developed based on convolution operations for prediction or forecasting, but they lack the ability to reason over spatiotemporal data and infer the relationships of different objects in the scene. In this paper, we present a framework based on graph convolution to uncover the spatiotemporal relationships in the scene for reasoning about pedestrian intent. A scene graph is built on top of segmented object instances within and across video frames. Pedestrian intent, defined as the future action of crossing or not-crossing the street, is a very crucial piece of information for autonomous vehicles to navigate safely and more smoothly. We approach the problem of intent prediction from two different perspectives and anticipate the intention-to-cross within both pedestrian-centric and location-centric scenarios. In addition, we introduce a new dataset designed specifically for autonomous-driving scenarios in areas with dense pedestrian populations: the Stanford-TRI Intent Prediction (STIP) dataset. Our experiments on STIP and another benchmark dataset show that our graph modeling framework is able to predict the intention-to-cross of the pedestrians with an accuracy of 79.10% on STIP and 79.28% on \rev{Joint Attention for Autonomous Driving (JAAD) dataset up to one second earlier than when the actual crossing happens. These results outperform the baseline and previous work. Please refer to this http URL for the dataset and code.
摘要：在推理可视化数据是机器人和基于视觉的应用的期望能力。这样的推理能够在视频的未来事件或行为的预测。近年来，各种型号都基于对预测或预报卷积运算得到了发展，但他们缺乏对时空数据，并推断场景中的不同对象之间的关系的能力的原因。在本文中，我们提出了基于图形卷积揭露在现场为约行人意图推理的时空关系的框架。的场景图是建立在内部和跨视频帧分割的对象实例的顶部。行人意图，定义为交叉或不渡街道的未来行动，是信息的自动驾驶汽车能够安全，更顺利浏览一个非常关键的部分。我们已接近于从两个不同的角度意向预测的问题，并预测两行人为中心和以位置为中心的场景内的意向交叉。此外，我们引入了一个新的数据集与行人密集人口的地区自主驾驶的情况专门设计：斯坦福-TRI意向预测（STIP）数据集。我们对科技革新和另一基准数据集显示，我们的图形建模框架能够预测的\转{共同注意的自动驾驶意向交与对科技革新79.10％和79.28％的精度行人的实验（JAAD）数据集最大一秒钟时比实际发生交叉更早。这些结果跑赢基准和以前的工作。请参阅本HTTP URL的数据集和代码。

2. Strategy to Increase the Safety of a DNN-based Perception for HAD Systems [PDF] 返回目录
Timo Sämann, Peter Schlicht, Fabian Hüger
Abstract: Safety is one of the most important development goals for highly automated driving (HAD) systems. This applies in particular to the perception function driven by Deep Neural Networks (DNNs). For these, large parts of the traditional safety processes and requirements are not fully applicable or sufficient. The aim of this paper is to present a framework for the description and mitigation of DNN insufficiencies and the derivation of relevant safety mechanisms to increase the safety of DNNs. To assess the effectiveness of these safety mechanisms, we present a categorization scheme for evaluation metrics.
摘要：安全是高度自动驾驶（HAD）系统中最重要的发展目标之一。这尤其适用于通过深层神经网络（DNNs）驱动的感知功能。对于这些，传统的安全流程和要求的大部分地区都没有完全适用或足够了。本文的目的是提出一个框架DNN不足的描述和缓解和相关的安全机制的推导来增加DNNs的安全性。为了评估这些安全机制的有效性，我们提出了评价指标，一个分类方案。

3. Deep Learning-Based Feature Extraction in Iris Recognition: Use Existing Models, Fine-tune or Train From Scratch? [PDF] 返回目录
Aidan Boyd, Adam Czajka, Kevin Bowyer
Abstract: Modern deep learning techniques can be employed to generate effective feature extractors for the task of iris recognition. The question arises: should we train such structures from scratch on a relatively large iris image dataset, or it is better to fine-tune the existing models to adapt them to a new domain? In this work we explore five different sets of weights for the popular ResNet-50 architecture to find out whether iris-specific feature extractors perform better than models trained for non-iris tasks. Features are extracted from each convolutional layer and the classification accuracy achieved by a Support Vector Machine is measured on a dataset that is disjoint from the samples used in training of the ResNet-50 model. We show that the optimal training strategy is to fine-tune an off-the-shelf set of weights to the iris recognition domain. This approach results in greater accuracy than both off-the-shelf weights and a model trained from scratch. The winning, fine-tuned approach also shows an increase in performance when compared to previous work, in which only off-the-shelf (not fine-tuned) models were used in iris feature extraction. We make the best-performing ResNet-50 model, fine-tuned with more than 360,000 iris images, publicly available along with this paper.
摘要：现代深学习技术可以用来生成虹膜识别的任务有效特征提取。问题在于：我们要培养从头这种结构上相对较大的虹膜图像数据集，或者最好是微调现有的模式，使其适应新的领域？在这项工作中，我们探讨五种不同的权重集流行的RESNET-50架构，找出具体的虹膜特征提取是否比训练有素的非光圈任务机型更好的表现。特征从各个卷积层萃取并通过支持向量机实现的分类精度上的数据集是从在RESNET-50模型的训练中使用的样本不相交的测量。我们表明，最佳的训练策略是微调关闭的，现成的设置权重来虹膜识别领域的。这种做法的结果比关闭的，现成的权重和从头开始训练的模型都更高的精度。相比以前的工作，其中只有现成的现货（未微调）模型中使用的虹膜特征提取时获胜，微调的做法也显示性能的提高。我们做效果最好的RESNET-50模型，微调有超过36万点的虹膜图像，与本文一起公开。

4. Automatic Shortcut Removal for Self-Supervised Representation Learning [PDF] 返回目录
Matthias Minderer, Olivier Bachem, Neil Houlsby, Michael Tschannen
Abstract: In self-supervised visual representation learning, a feature extractor is trained on a "pretext task" for which labels can be generated cheaply. A central challenge in this approach is that the feature extractor quickly learns to exploit low-level visual features such as color aberrations or watermarks and then fails to learn useful semantic representations. Much work has gone into identifying such "shortcut" features and hand-designing schemes to reduce their effect. Here, we propose a general framework for removing shortcut features automatically. Our key assumption is that those features which are the first to be exploited for solving the pretext task may also be the most vulnerable to an adversary trained to make the task harder. We show that this assumption holds across common pretext tasks and datasets by training a "lens" network to make small image changes that maximally reduce performance in the pretext task. Representations learned with the modified images outperform those learned without in all tested cases. Additionally, the modifications made by the lens reveal how the choice of pretext task and dataset affects the features learned by self-supervision.
摘要：在自我监督的可视化表示学习，特征提取器对其中标签可以廉价地产生“借口任务”训练。这种方法的一个主要挑战是，特征提取很快学会了利用低层次的视觉特征，如色差或水印，然后将无法学到有用的语义表示。很多工作已经进入识别这样的“快捷方式”功能和手工设计方案，降低了它们的影响。在这里，我们提出了删除快捷方式的总体框架自动功能。我们的主要假设是，那些第一功能，以解决为借口，任务也可能是最脆弱的训练费力气的对手加以利用。我们表明，这种假设，通过训练“镜头”的网络，使小图像的变化是最大程度减少借口任务绩效跨越常见借口任务和数据集成立。与修正后的图像了解到交涉优于那些没有在所有测试的情况下的经验教训。此外，通过镜头所做的修改显示借口任务和数据集的选择如何影响自我监督学习的特点。

5. Object 6D Pose Estimation with Non-local Attention [PDF] 返回目录
Jianhan Mei, Henghui Ding, Xudong Jiang
Abstract: In this paper, we address the challenging task of estimating 6D object pose from a single RGB image. Motivated by the deep learning based object detection methods, we propose a concise and efficient network that integrate 6D object pose parameter estimation into the object detection framework. Furthermore, for more robust estimation to occlusion, a non-local self-attention module is introduced. The experimental results show that the proposed method reaches the state-of-the-art performance on the YCB-video and the Linemod datasets.
摘要：在本文中，我们来从一个RGB图像估计6D对象姿势的具有挑战性的任务。由深基于学习对象的检测方法的启发，我们提出了一个简洁高效的网络6D对象姿态参数估计融入对象检测框架。此外，对于更稳健估计遮挡，非本地自注意模块介绍。实验结果表明，所提出的方法达到的YCB视频和Linemod数据集的状态的最先进的性能。

6. Roto-Translation Equivariant Convolutional Networks: Application to Histopathology Image Analysis [PDF] 返回目录
Maxime W. Lafarge, Erik J. Bekkers, Josien P.W. Pluim, Remco Duits, Mitko Veta
Abstract: Rotation-invariance is a desired property of machine-learning models for medical image analysis and in particular for computational pathology applications. We propose a framework to encode the geometric structure of the special Euclidean motion group SE(2) in convolutional networks to yield translation and rotation equivariance via the introduction of SE(2)-group convolution layers. This structure enables models to learn feature representations with a discretized orientation dimension that guarantees that their outputs are invariant under a discrete set of rotations. Conventional approaches for rotation invariance rely mostly on data augmentation, but this does not guarantee the robustness of the output when the input is rotated. At that, trained conventional CNNs may require test-time rotation augmentation to reach their full capability. This study is focused on histopathology image analysis applications for which it is desirable that the arbitrary global orientation information of the imaged tissues is not captured by the machine learning models. The proposed framework is evaluated on three different histopathology image analysis tasks (mitosis detection, nuclei segmentation and tumor classification). We present a comparative analysis for each problem and show that consistent increase of performances can be achieved when using the proposed framework.
摘要：旋转不变性是机器学习模型，医学图像分析为计算病理学应用所需的性能，尤其是。我们提出了一个框架通过引入SE（2） - 基团卷积层的编码中卷积网络产量平移和旋转同变性特殊欧几里德运动组SE（2）的几何结构。这种结构使模型的学习功能交涉，离散化取向层面，保证它们的输出正在一组离散旋转不变。传统的方法对旋转不变性主要依靠数据增强，但是当输入旋转，这并不保证输出的稳定性。当时，受过训练的传统细胞神经网络可能需要测试时间旋转增强，以充分发挥其能力。这项研究主要集中于它希望的是，成像组织的全球任意方位信息不是由机器学习模型拍摄的组织病理学图像分析应用。所提出的框架在三个不同的组织病理学图像分析任务（有丝分裂检测，细胞核分割和肿瘤分类）来评价。我们提出了比较分析每个问题，并显示使用拟议的框架时，就可以实现的性能是一致的增加。

7. A survey on Semi-, Self- and Unsupervised Techniques in Image Classification [PDF] 返回目录
Lars Schmarje, Monty Santarossa, Simon-Martin Schröder, Reinhard Koch
Abstract: While deep learning strategies achieve outstanding results in computer vision tasks, one issue remains. The current strategies rely heavily on a huge amount of labeled data. In many real-world problems it is not feasible to create such an amount of labeled training data. Therefore, researchers try to incorporate unlabeled data into the training process to reach equal results with fewer labels. Due to a lot of concurrent research, it is difficult to keep track of recent developments. In this survey we provide an overview of often used techniques and methods in image classification with fewer labels. We compare 21 methods. In our analysis we identify three major trends. 1. State-of-the-art methods are scaleable to real world applications based on their accuracy. 2. The degree of supervision which is needed to achieve comparable results to the usage of all labels is decreasing. 3. All methods share common techniques while only few methods combine these techniques to achieve better performance. Based on all of these three trends we discover future research opportunities.
摘要：虽然深学习策略，实现在计算机视觉任务，但仍有一个问题优异成绩。目前的策略在很大程度上依赖于一个巨大的标记数据量。在许多现实世界的问题，这是不可行的创建标记的训练数据的量。因此，研究人员尝试将未标记的数据纳入训练过程以更少的标签，以达到相同的效果。由于大量的并发研究的，它是很难保持的最新发展轨道。在本次调查中，我们提供了经常使用的技术和较少的标签图像分类方法的概述。我们比较21点的方法。在我们的分析，我们确定三大发展趋势。 1，国家的最先进的方法是可扩展的，基于其准确性现实世界的应用。 2.这是需要达到比较的结果，所有标签的使用监督的程度正在下降。 3.所有的方法都有着共同的技术，而只有少数方法结合这些技术，以实现更好的性能。基于所有这些三个趋势，我们发现未来的研究机会。

8. Photorealistic Lip Sync with Adversarial Temporal Convolutional Networks [PDF] 返回目录
Ruobing Zheng, Zhou Zhu, Bo Song, Changjiang Ji
Abstract: Lip sync has emerged as a promising technique to generate mouth movements on a talking head. However, synthesizing a clear, accurate and human-like performance is still challenging. In this paper, we present a novel lip-sync solution for producing a high-quality and photorealistic talking head from speech. We focus on capturing the specific lip movement and talking style of the target person. We model the seq-to-seq mapping from audio signals to mouth features by two adversarial temporal convolutional networks. Experiments show our model outperforms traditional RNN-based baselines in both accuracy and speed. We also propose an image-to-image translation-based approach for generating high-resolution photoreal face appearance from synthetic facial maps. This fully-trainable framework not only avoids the cumbersome steps like candidate-frame selection in graphics-based rendering methods but also solves some existing issues in recent neural network-based solutions. Our work will benefit related applications such as conversational agent, virtual anchor, tele-presence and gaming.
摘要：唇型同步已成为一个有前途的技术来生成上说话的头部嘴巴的动作。然而，合成清晰，准确和人类一样的性能仍然是具有挑战性的。在本文中，我们提出了用于从语音的高品质和照片拟真头部特写的新型唇同步溶液。我们专注于捕捉特定的嘴唇运动和说话的目标人物的风格。我们由两个对立的时间卷积网络从音频信号序列到序列映射模型的嘴部特征。实验表明，在精度和速度我们的模型优于传统的基于RNN-基线。我们还提出了从生成合成面部贴图高分辨率的真实感面部外观图像 - 图像基于平移的方法。这完全可训练框架不仅避免了像在基于图形的绘制方法候选帧选择的繁琐步骤，也解决了最近基于神经网络的解决方案存在的一些问题。我们的工作将有利于相关的应用，如会话代理，虚拟锚，临场感和游戏。

9. Bi-directional Dermoscopic Feature Learning and Multi-scale Consistent Decision Fusion for Skin Lesion Segmentation [PDF] 返回目录
Xiaohong Wang, Xudong Jiang, Henghui Ding, Jun Liu
Abstract: Accurate segmentation of skin lesion from dermoscopic images is a crucial part of computer-aided diagnosis of melanoma. It is challenging due to the fact that dermoscopic images from different patients have non-negligible lesion variation, which causes difficulties in anatomical structure learning and consistent skin lesion delineation. In this paper, we propose a novel bi-directional dermoscopic feature learning (biDFL) framework to model the complex correlation between skin lesions and their informative context. By controlling feature information passing through two complementary directions, a substantially rich and discriminative feature representation is achieved. Specifically, we place biDFL module on the top of a CNN network to enhance high-level parsing performance. Furthermore, we propose a multi-scale consistent decision fusion (mCDF) that is capable of selectively focusing on the informative decisions generated from multiple classification layers. By analysis of the consistency of the decision at each position, mCDF automatically adjusts the reliability of decisions and thus allows a more insightful skin lesion delineation. The comprehensive experimental results show the effectiveness of the proposed method on skin lesion segmentation, achieving state-of-the-art performance consistently on two publicly available dermoscopic image databases.
摘要：从皮肤镜图像皮肤损伤的精确分割是黑素瘤的计算机辅助诊断的重要部分。它是具有挑战性由于来自不同患者的皮肤镜图像有不可忽略的病变变化，这导致在解剖结构的学习和一致的皮肤损伤划分的困难。在本文中，我们提出了一个新颖的双向皮肤镜功能学习（biDFL）框架，皮肤损伤和他们的信息语境之间的复杂关联性模型。通过控制通过两个互补的方向特征的信息，基本上丰富和判别特征表示得以实现。具体而言，我们将biDFL模块上的CNN网络，以加强高层解析性能的顶部。此外，我们提出了一个多尺度一致决策融合（MCDF），其能够选择性地集中在从多个分类层中产生的信息决定。通过在每个位置的决定的一致性的分析，MCDF自动调整的决定的可靠性，并因此允许更见地皮肤损伤的划分。综合实验结果表明，对皮肤损伤分割算法的有效性，在两个公开可用的皮肤镜图像数据库持续取得国家的最先进的性能。

10. Neural Network Compression Framework for fast model inference [PDF] 返回目录
Alexander Kozlov, Ivan Lazarevich, Vasily Shamporov, Nikolay Lyalyushkin, Yury Gorbachev
Abstract: In this work we present a new framework for neural networks compression with fine-tuning, which we called Neural Network Compression Framework (NNCF). It leverages recent advances of various network compression methods and implements some of them, such as sparsity, quantization, and binarization. These methods allow getting more hardware-friendly models which can be efficiently run on general-purpose hardware computation units (CPU, GPU) or special Deep Learning accelerators. We show that the developed methods can be successfully applied to a wide range of models to accelerate the inference time while keeping the original accuracy. The framework can be used within the training samples, which are supplied with it, or as a standalone package that can be seamlessly integrated into the existing training code with minimal adaptations. Currently, a PyTorch \cite{PyTorch} version of NNCF is available as a part of OpenVINO Training Extensions at this https URL
摘要：在这项工作中，我们提出了神经网络与压缩微调，我们称之为神经网络压缩框架（NNCF）的新框架。它利用最近的各种网络的压缩方法和器具其中的一些，例如稀疏性，量化和二进制化的进步。这些方法允许获得可在通用的硬件运算单元（CPU，GPU）或特殊的深度学习加速器有效地运行更多硬件友好的模式。我们表明，该方法可以成功地应用于多种型号，加快推理时间，同时保持原有的精度。该框架可以在训练样本，这是与它提供，或者作为独立的软件包，可以无缝地集成到具有最小适应现有的训练码内使用。目前，PyTorch \ {引用} PyTorch NNCF的版本可以在此HTTPS URL OpenVINO培训扩展的一部分

11. Stroke Constrained Attention Network for Online Handwritten Mathematical Expression Recognition [PDF] 返回目录
Jiaming Wang, Jun Du, Jianshu Zhang
Abstract: In this paper, we propose a novel stroke constrained attention network (SCAN) which treats stroke as the basic unit for encoder-decoder based online handwritten mathematical expression recognition (HMER). Unlike previous methods which use trace points or image pixels as basic units, SCAN makes full use of stroke-level information for better alignment and representation. The proposed SCAN can be adopted in both single-modal (online or offline) and multi-modal HMER. For single-modal HMER, SCAN first employs a CNN-GRU encoder to extract point-level features from input traces in online mode and employs a CNN encoder to extract pixel-level features from input images in offline mode, then use stroke constrained information to convert them into online and offline stroke-level features. Using stroke-level features can explicitly group points or pixels belonging to the same stroke, therefore reduces the difficulty of symbol segmentation and recognition via the decoder with attention mechanism. For multi-modal HMER, other than fusing multi-modal information in decoder, SCAN can also fuse multi-modal information in encoder by utilizing the stroke based alignments between online and offline modalities. The encoder fusion is a better way for combining multi-modal information as it implements the information interaction one step before the decoder fusion so that the advantages of multiple modalities can be exploited earlier and more adequately when training the encoder-decoder model. Evaluated on a benchmark published by CROHME competition, the proposed SCAN achieves the state-of-the-art performance.
摘要：在本文中，我们提出一种治疗中风为基本单位为编码器 - 解码器基于联机手写的数学式识别（HMER）一种新颖的行程约束注意网络（SCAN）。不同于以往的方法，它使用的跟踪点或图像的像素为基本单位，SCAN实现更好的定位和代表性，充分利用行程级信息。所提出的SCAN可以在单峰（在线或离线）和多模态HMER通过。对于单峰HMER，SCAN第一采用CNN-GRU编码器提取的点电平从在在线模式下输入迹线的特性和采用CNN编码器提取的像素级从在离线模式下输入图像中的特征，然后使用行程限制信息它们转换成在线和离线中风级功能。使用行程级功能可以明确组点或属于同一行程的像素，因此减少了经由与注意机制的解码器符号分割和识别的难度。对于多模态HMER，比熔合在解码器的多模态信息之外，还SCAN保险丝可以通过利用在线和离线模式之间的基于笔划的比对在编码器的多模态的信息。该编码器的融合是因为它实现了解码器融合前的信息交互一个步骤，使得多个模态的优点，可更早和更充分地训练编码器 - 解码器模型时利用相结合的多模态信息的更好的方法。评估由CROHME竞争公布的基准，建议SCAN实现国家的最先进的性能。

12. Focus on Semantic Consistency for Cross-domain Crowd Understanding [PDF] 返回目录
Tao Han, Junyu Gao, Yuan Yuan, Qi Wang
Abstract: For pixel-level crowd understanding, it is time-consuming and laborious in data collection and annotation. Some domain adaptation algorithms try to liberate it by training models with synthetic data, and the results in some recent works have proved the feasibility. However, we found that a mass of estimation errors in the background areas impede the performance of the existing methods. In this paper, we propose a domain adaptation method to eliminate it. According to the semantic consistency, a similar distribution in deep layer's features of the synthetic and real-world crowd area, we first introduce a semantic extractor to effectively distinguish crowd and background in high-level semantic information. Besides, to further enhance the adapted model, we adopt adversarial learning to align features in the semantic space. Experiments on three representative real datasets show that the proposed domain adaptation scheme achieves the state-of-the-art for cross-domain counting problems.
摘要：对于像素级人群的理解，这是耗时的数据收集和注释中费力。有些域自适应算法尝试通过与合成训练数据模型中解脱出来，并在最近的一些作品中的结果证明是可行的。然而，我们发现，在背景区域估计误差的质量阻碍了现有方法的性能。在本文中，我们提出了一个领域适应性的方法来消除它。根据语义一致性，在合成和真实世界的人群区域的深层的功能相似的分布，我们首先介绍一个语义提取到有效区分人群和背景中的高层语义信息。此外，为进一步提高适应模型，我们采取敌对学习对齐功能，在语义空间。对三种具有代表性的真实数据集的实验表明，该领域适应性方案实现了国家的最先进的跨域计数问题。

13. KaoKore: A Pre-modern Japanese Art Facial Expression Dataset [PDF] 返回目录
Yingtao Tian, Chikahiko Suzuki, Tarin Clanuwat, Mikel Bober-Irizar, Alex Lamb, Asanobu Kitamoto
Abstract: From classifying handwritten digits to generating strings of text, the datasets which have received long-time focus from the machine learning community vary greatly in their subject matter. This has motivated a renewed interest in building datasets which are socially and culturally relevant, so that algorithmic research may have a more direct and immediate impact on society. One such area is in history and the humanities, where better and relevant machine learning models can accelerate research across various fields. To this end, newly released benchmarks and models have been proposed for transcribing historical Japanese cursive writing, yet for the field as a whole using machine learning for historical Japanese artworks still remains largely uncharted. To bridge this gap, in this work we propose a new dataset KaoKore which consists of faces extracted from pre-modern Japanese artwork. We demonstrate its value as both a dataset for image classification as well as a creative and artistic dataset, which we explore using generative models. Dataset available at this https URL
摘要：从手写数字进行分类，以生成文本字符串，均获得长期焦点从机器学习领域的数据集在题材方面差异很大。这促使在建立数据集，其在社会和文化相关的新的兴趣，使算法的研究可能对社会产生更直接，更直接的影响。就是这样一个地区的历史和人文，更好的地方及相关机器学习模型可以加速在各个领域的研究。为此，新公布的基准和模型已被提出用于转录历史日本草书书写，但对于该领域的整体使用机器学习历史的日本艺术品仍然在很大程度上仍然未知。为了弥补这一差距，在这项工作中，我们提出了一个新的数据集KaoKore其中包括由前现代的日本艺术品提取的面孔。我们展示了其既是对图像分类数据集以及一个创造性和艺术性的数据集，这是我们探索利用生成模型值。数据集可以在这个HTTPS URL

14. Captioning Images Taken by People Who Are Blind [PDF] 返回目录
Danna Gurari, Yinan Zhao, Meng Zhang, Nilavra Bhattacharya
Abstract: While an important problem in the vision community is to design algorithms that can automatically caption images, few publicly-available datasets for algorithm development directly address the interests of real users. Observing that people who are blind have relied on (human-based) image captioning services to learn about images they take for nearly a decade, we introduce the first image captioning dataset to represent this real use case. This new dataset, which we call VizWiz-Captions, consists of over 39,000 images originating from people who are blind that are each paired with five captions. We analyze this dataset to (1) characterize the typical captions, (2) characterize the diversity of content found in the images, and (3) compare its content to that found in eight popular vision datasets. We also analyze modern image captioning algorithms to identify what makes this new dataset challenging for the vision community. We publicly-share the dataset with captioning challenge instructions at this https URL
摘要：虽然在视觉领域的一个重要问题是设计算法，可以自动字幕图像，算法开发一些公开可用的数据集，直接解决实际用户的利益。观察，人们谁是盲目的对（以人为本）图像字幕服务，以了解他们采取了近十年的图像依赖，我们引进了第一图像数据集字幕表示此实际使用情况。这个新的数据集，我们称之为VizWiz字幕功能，包括从人谁是盲目的，它们各自成对五个字幕39000图像发起的。我们分析这些数据集（1）表征典型的字幕，（2）描述的内容的多样性在图像中发现，和（3）比较其内容中所发现的八大热门视觉数据集。我们还分析了现代的形象字幕算法，以确定是什么使这个新的数据集为愿景社会挑战。与此HTTPS URL字幕挑战说明我们的公开，共享数据集

15. Learning Object Scale With Click Supervision for Object Detection [PDF] 返回目录
Liao Zhang, Yan Yan, Lin Cheng, Hanzi Wang
Abstract: Weakly-supervised object detection has recently attracted increasing attention since it only requires image-levelannotations. However, the performance obtained by existingmethods is still far from being satisfactory compared with fully-supervised object detection methods. To achieve a good trade-off between annotation cost and object detection performance,we propose a simple yet effective method which incorporatesCNN visualization with click supervision to generate the pseudoground-truths (i.e., bounding boxes). These pseudo ground-truthscan be used to train a fully-supervised detector. To estimatethe object scale, we firstly adopt a proposal selection algorithmto preserve high-quality proposals, and then generate ClassActivation Maps (CAMs) for these preserved proposals by theproposed CNN visualization algorithm called Spatial AttentionCAM. Finally, we fuse these CAMs together to generate pseudoground-truths and train a fully-supervised object detector withthese ground-truths. Experimental results on the PASCAL VOC2007 and VOC 2012 datasets show that the proposed methodcan obtain much higher accuracy for estimating the object scale,compared with the state-of-the-art image-level based methodsand the center-click based method
摘要：弱监督对象检测最近吸引了越来越多的关注，因为它仅需要图像levelannotations。然而，通过existingmethods所获得的性能是从具有完全监督对象检测方法相比，不能令人满意仍远。为了达到良好的权衡注释成本和物体检测性能之间，我们提出了一个简单而有效的方法，它incorporatesCNN可视化与点击监管产生pseudoground-真理（即，边界框）。这些伪接地truthscan用来训练完全监督检测器。为了estimatethe对象的规模，我们首先采取建议选择algorithmto保持高质量的提案，然后生成ClassActivation地图（凸轮）通过所谓的空间AttentionCAM theproposed CNN可视化算法，这些保留的建议。最后，我们这些融合在一起的CAM生成pseudoground，真理和培养标价的地面真理完全监督对象检测器。在PASCAL VOC2007和VOC 2012数据集的实验结果表明，所提出methodcan获得高得多的精确度来估计对象的规模，与基于所述状态的最先进的图像级methodsand的中心基于点击的方法相比

16. Fast and Regularized Reconstruction of Building Façades from Street-View Images using Binary Integer Programming [PDF] 返回目录
Han Hu, Libin Wang, Yulin Ding, Qing Zhu
Abstract: Regularized arrangement of primitives on building façades to aligned locations and consistent sizes is important towards structured reconstruction of urban environment. Mixed integer linear programing was used to solve the problem, however, it is extreamly time consuming even for state-of-the-art commercial solvers. Aiming to alleviate this issue, we cast the problem into binary integer programming, which omits the requirements for real value parameters and is more efficient to be solved . Firstly, the bounding boxes of the primitives are detected using the YOLOv3 architecture in real-time. Secondly, the coordinates of the upper left corners and the sizes of the bounding boxes are automatically clustered in a binary integer programming optimization, which jointly considers the geometric fitness, regularity and additional constraints; this step does not require \emph{a priori} knowledge, such as the number of clusters or pre-defined grammars. Finally, the regularized bounding boxes can be directly used to guide the façade reconstruction in an interactive envinronment. Experimental evaluations have revealed that the accuracies for the extraction of primitives are above 0.85, which is sufficient for the following 3D reconstruction. The proposed approach only takes about $ 10\% $ to $ 20\% $ of the runtime than previous approach and reduces the diversity of the bounding boxes to about $20\%$ to $50\%$
摘要：建筑立面，以排列位置和大小一致元的正则安排是对城市环境的结构重建重要。使用线性规划混合整数来解决这个问题，但是，它成绩不理想时间即使对于国家的最先进的商业求解器耗时。旨在缓解这一问题，我们投的问题转化为二进制整数规划，其中省略了真正的价值参数要求，并更有效地得到解决。首先，使用YOLOv3架构中的实时检测到的图元的边界框。其次，左上角的坐标和所述边界框的尺寸在一个二进制整数规划优化，这共同考虑几何健身，规律性和附加约束自动群集;此步骤不需要\ EMPH {先验}知识，如簇或预先定义的语法的数量。最后，将正则边界框可直接用于指导外墙重建在交互式envinronment。实验评估揭示，对于基元的提取的精度是在0.85以上，这足以用于以下3D重建。所提出的方法只需要大约$ 10 \％$至$ 20 \％的运行时间比以前的方法$，降低了边框的多样性，约$ 20 \％$至$ 50 \％$

17. Do We Really Need to Access the Source Data? Source Hypothesis Transfer for Unsupervised Domain Adaptation [PDF] 返回目录
Jian Liang, Dapeng Hu, Jiashi Feng
Abstract: Unsupervised domain adaptation (UDA) aims to leverage the knowledge learned from a labeled source dataset to solve similar tasks in a new unlabeled domain. Prior UDA methods typically require to access the source data when learning to adapt the model, making them risky and inefficient for decentralized private data. In this work we tackle a novel setting where only a trained source model is available and investigate how we can effectively utilize such a model without source data to solve UDA problems. To this end, we propose a simple yet generic representation learning framework, named \emph{Source HypOthesis Transfer} (SHOT). Specifically, SHOT freezes the classifier module (hypothesis) of the source model and learns the target-specific feature extraction module by exploiting both information maximization and self-supervised pseudo-labeling to implicitly align representations from the target domains to the source hypothesis. In this way, the learned target model can directly predict the labels of target data. We further investigate several techniques to refine the network architecture to parameterize the source model for better transfer performance. To verify its versatility, we evaluate SHOT in a variety of adaptation cases including closed-set, partial-set, and open-set domain adaptation. Experiments indicate that SHOT yields state-of-the-art results among multiple domain adaptation benchmarks.
摘要：无监督领域适应性（UDA）的目标是利用从标记源数据集学会了解决类似的任务在一个新的未标记领域的知识。此前UDA方法通常需要访问学习，以适应模型时的源数据，并使这些风险和低效分散的私人数据。在这项工作中，我们解决一个新的环境，让只受过训练的源模型可用，并探讨如何才能有效利用这样的模型，而源数据解决UDA问题。为此，我们提出了一个简单而通用表示学习框架，名为\ {EMPH源假设传输}（SHOT）。具体而言，SHOT冻结源模型的分类器模块（假说），通过双方的信息最大化和自我监督伪标记利用从目标域到源假设隐含地对准表示获知目标特异性特征提取模块。通过这种方式，学习目标模型可以直接预测目标数据的标签。我们进一步研究了几种技术来改进网络结构参数更好的传输性能源模型。为了验证它的多功能性，我们在各种改编例，其中闭集，偏集，和开集领域适应性评估SHOT。实验表明，SHOT产生多个域自适应基准中的国家的最先进的结果。

18. Expressing Objects just like Words: Recurrent Visual Embedding for Image-Text Matching [PDF] 返回目录
Tianlang Chen, Jiebo Luo
Abstract: Existing image-text matching approaches typically infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image. However, they ignore the connections between the objects that are semantically related. These objects may collectively determine whether the image corresponds to a text or not. To address this problem, we propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN). In particular, given an input image-text pair, our model reorders the image objects based on the positions of their most related words in the text. In the same way as extracting the hidden features from word embeddings, the model leverages RNN to extract high-level object features from the reordered object inputs. We validate that the high-level object features contain useful joint information of semantically related objects, which benefit the retrieval task. To compute the image-text similarity, we incorporate a Multi-attention Cross Matching Model into DP-RNN. It aggregates the affinity between objects and words with cross-modality guided attention and self-attention. Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset. Extensive experiments demonstrate the effectiveness of our model.
摘要：现有图像文本匹配通过捕获和聚集所述文本和图像的每个独立的对象之间的亲和度通常接近推断的图像文本对的相似性。然而，他们忽略那些语义相关的对象之间的连接。这些对象可以共同确定图像是否对应于文本或没有。为了解决这个问题，我们提出了一个双路径回归神经网络（DP-RNN），其采用递归神经网络（RNN）处理图像和句子对称。特别是，通过在输入图像文本对，我们的模型重新排序基于对自己最相关的词在文中的位置的图像对象。以相同的方式从字的嵌入提取隐藏的特征中，该模型利用RNN到提取高级对象从重新排序的对象的输入功能。我们验证了高层次的对象特征包含语义相关的对象，这有利于检索任务的有用的联合信息。为了计算图像文本相似性，我们结合了多注意交叉配型号为DP-RNN。它聚合的对象和文字与跨模态引导注意和自关注之间的亲和力。我们的模型实现了对Flickr30K数据集和有竞争力的表现上MS-COCO数据集的国家的最先进的性能。大量的实验证明我们的模型的有效性。

19. Modelling response to trypophobia trigger using intermediate layers of ImageNet networks [PDF] 返回目录
Piotr Woźnicki, Michał Kuźba, Piotr Migdał
Abstract: In this paper, we approach the problem of detecting trypophobia triggers using Convolutional neural networks. We show that standard architectures such as VGG or ResNet are capable of recognizing trypophobia patterns. We also conduct experiments to analyze the nature of this phenomenon. To do that, we dissect the network decreasing the number of its layers and parameters. We prove, that even significantly reduced networks have accuracy above 91\% and focus their attention on the trypophobia patterns as presented on the visual explanations.
摘要：在本文中，我们接近的使用卷积神经网络检测密集恐惧症的触发器的问题。我们发现，标准体系结构，如VGG或RESNET能够识别密集恐惧症的图案。我们还进行实验来分析这一现象的本质。要做到这一点，我们剖析了网络降低其层和参数的数目。我们证明，即使显著降低网络拥有上述91 \％的准确度和集中自己的注意力集中在密集恐惧症的图案视觉上的解释为呈现。

20. Revisiting Training Strategies and Generalization Performance in Deep Metric Learning [PDF] 返回目录
Karsten Roth, Timo Milbich, Samarth Sinha, Prateek Gupta, Bjoern Ommer, Joseph Paul Cohen
Abstract: Deep Metric Learning (DML) is arguably one of the most influential lines of research for learning visual similarities with many proposed approaches every year. Although the field benefits from the rapid progress, the divergence in training protocols, architectures, and parameter choices make an unbiased comparison difficult. To provide a consistent reference point, we revisit the most widely used DML objective functions and conduct a study of the crucial parameter choices as well as the commonly neglected mini-batch sampling process. Based on our analysis, we uncover a correlation between the embedding space compression and the generalization performance of DML models. Exploiting these insights, we propose a simple, yet effective, training regularization to reliably boost the performance of ranking-based DML models on various standard benchmark datasets.
摘要：深度量学习（DML）可以说是研究最有影响力的线学习视觉相似之处很多提出一个每年接近。虽然从快速进步领域的好处，在培训协议，架构和参数选择的分歧做出公正的比较困难。为了提供一个一致的参考点，我们重温最广泛使用的DML目标函数和进行关键参数选择的研究，以及常用忽视小批量采样过程。根据我们的分析，我们发现嵌入空间压缩和DML模型的泛化性能之间的相关性。利用这些见解，我们提出了一个简单而有效的训练正规化可靠地提高排名为基础的DML模型对各种标准的基准数据集的性能。

21. SD-GAN: Structural and Denoising GAN reveals facial parts under occlusion [PDF] 返回目录
Samik Banerjee, Sukhendu Das
Abstract: Certain facial parts are salient (unique) in appearance, which substantially contribute to the holistic recognition of a subject. Occlusion of these salient parts deteriorates the performance of face recognition algorithms. In this paper, we propose a generative model to reconstruct the missing parts of the face which are under occlusion. The proposed generative model (SD-GAN) reconstructs a face preserving the illumination variation and identity of the face. A novel adversarial training algorithm has been designed for a bimodal mutually exclusive Generative Adversarial Network (GAN) model, for faster convergence. A novel adversarial "structural" loss function is also proposed, comprising of two components: a holistic and a local loss, characterized by SSIM and patch-wise MSE. Ablation studies on real and synthetically occluded face datasets reveal that our proposed technique outperforms the competing methods by a considerable margin, even for boosting the performance of Face Recognition.
摘要：某些面部部分在外观上显着的（唯一的），其基本上与整体识别的对象的贡献。这些突出的部分阻塞恶化的人脸识别算法的性能。在本文中，我们提出了一个生成模型重建它们遮挡情况下，面对残缺的部分。所提出的生成模型（SD-GAN）重构面保留面的照明变化和身份。一种新的对抗训练算法已被设计为双峰互斥剖成对抗性网络（GAN）的模型，更快的收敛。一种新的对抗式“结构的”损失函数还提出，其包括两个部分：一个整体和局部损失，其特征在于，SSIM和patch-明智MSE。真实和综合遮挡脸部数据集消融的研究表明，我们所提出的技术通过一个相当幅度优于竞争的方法，即使是提高人脸识别的性能。

22. Cooperative LIDAR Object Detection via Feature Sharing in Deep Networks [PDF] 返回目录
Ehsan Emad Marvasti, Arash Raftari, Amir Emad Marvasti, Yaser P. Fallah, Rui Guo, HongSheng Lu
Abstract: The recent advancements in communication and computational systems has led to significant improvement of situational awareness in connected and autonomous vehicles. Computationally efficient neural networks and high speed wireless vehicular networks have been some of the main contributors to this improvement. However, scalability and reliability issues caused by inherent limitations of sensory and communication systems are still challenging problems. In this paper, we aim to mitigate the effects of these limitations by introducing the concept of feature sharing for cooperative object detection (FS-COD). In our proposed approach, a better understanding of the environment is achieved by sharing partially processed data between cooperative vehicles while maintaining a balance between computation and communication load. This approach is different from current methods of map sharing, or sharing of raw data which are not scalable. The performance of the proposed approach is verified through experiments on Volony dataset. It is shown that the proposed approach has significant performance superiority over the conventional single-vehicle object detection approaches.
摘要：在通信和计算系统中的最新进展已经导致在连接和自动驾驶车辆的态势感知显著的改善。计算效率的神经网络和高速无线车载网络中已经有了一些主要贡献者的这一改进。然而，因感觉和通信系统的固有局限性可扩展性和可靠性等问题仍然具有挑战性的问题。在本文中，我们的目标是通过引入的特征共享的概念用于协作的对象检测（FS-COD），以减轻这些限制的影响。在我们提出的方法，更好地理解环境是由协作车辆之间共享部分处理的数据，同时保持计算和通信负载之间的平衡来实现的。这种方法是从地图电流共享的方法，或原始数据的共享其是不可扩展的不同。该方法的性能是通过对Volony数据集实验验证。结果表明，所提出的方法具有优于常规单车辆的物体检测方法显著性能的优越性。

23. Residual-Sparse Fuzzy $C$-Means Clustering Incorporating Morphological Reconstruction and Wavelet frames [PDF] 返回目录
Cong Wang, Witold Pedrycz, ZhiWu Li, MengChu Zhou, Jun Zhao
Abstract: Instead of directly utilizing an observed image including some outliers, noise or intensity inhomogeneity, the use of its ideal value (e.g. noise-free image) has a favorable impact on clustering. Hence, the accurate estimation of the residual (e.g. unknown noise) between the observed image and its ideal value is an important task. To do so, we propose an $\ell_0$ regularization-based Fuzzy $C$-Means (FCM) algorithm incorporating a morphological reconstruction operation and a tight wavelet frame transform. To achieve a sound trade-off between detail preservation and noise suppression, morphological reconstruction is used to filter an observed image. By combining the observed and filtered images, a weighted sum image is generated. Since a tight wavelet frame system has sparse representations of an image, it is employed to decompose the weighted sum image, thus forming its corresponding feature set. Taking it as data for clustering, we present an improved FCM algorithm by imposing an $\ell_0$ regularization term on the residual between the feature set and its ideal value, which implies that the favorable estimation of the residual is obtained and the ideal value participates in clustering. Spatial information is also introduced into clustering since it is naturally encountered in image segmentation. Furthermore, it makes the estimation of the residual more reliable. To further enhance the segmentation effects of the improved FCM algorithm, we also employ the morphological reconstruction to smoothen the labels generated by clustering. Finally, based on the prototypes and smoothed labels, the segmented image is reconstructed by using a tight wavelet frame reconstruction operation. Experimental results reported for synthetic, medical, and color images show that the proposed algorithm is effective and efficient, and outperforms other algorithms.
摘要：而不是直接利用的观察图像包括一些离群值，噪声或强度不均匀性的，利用它的理想值（例如无噪声图像）的对聚类产生有利的影响。因此，残留的（例如未知噪声）的观察图像和它的理想值之间的准确的估计是一项重要任务。要做到这一点，我们提出了一个基于正则$ \ $ ell_0模糊$ C $ -Means（FCM）结合形态学重建操作算法和从紧的小波变换框架。为了实现细节保留和噪声抑制之间的声音权衡，形态重构用于过滤的观察图像。通过组合所观察到的和滤波后的图像，将生成的加权和图像。由于小波紧框架系统具有图像的稀疏表示，它被用来分解加权和图像，从而形成其相应的功能集。以它作为聚类的数据，我们通过对特征组和它的理想值，这意味着得到的残余的有利估计和理想值参与之间的残余强加一个$ \ ell_0 $正则化项呈现的改进的FCM算法在集群。空间信息也被引入到集群，因为它在图像分割自然遇到。此外，它使剩余的估计更加可靠。为了进一步增强改进的FCM算法的分割的影响，我们还采用了形态重构以平滑由聚类生成的标签。最后，基于该原型和平滑标签，所述分割的图像，通过使用小波紧帧重构操作重建。实验结果报道为合成的，医疗，和彩色图像显示，该算法是有效和高效率，并且优于其它算法。

24. Table-Top Scene Analysis Using Knowledge-Supervised MCMC [PDF] 返回目录
Ziyuan Liu, Dong Chen, Kai M. Wurm, Georg von Wichert
Abstract: In this paper, we propose a probabilistic method to generate abstract scene graphs for table-top scenes from 6D object pose estimates. We explicitly make use of task-specfic context knowledge by encoding this knowledge as descriptive rules in Markov logic networks. Our approach to generate scene graphs is probabilistic: Uncertainty in the object poses is addressed by a probabilistic sensor model that is embedded in a data driven MCMC process. We apply Markov logic inference to reason about hidden objects and to detect false estimates of object poses. The effectiveness of our approach is demonstrated and evaluated in real world experiments.
摘要：在本文中，我们提出了一个概率的方法来产生从6D对象姿势估计桌面场景抽象的场景图。通过编码这方面的知识，如马氏逻辑网络中描述的规则，我们明确地利用任务specfic背景知识。我们以生成场景图的方法是概率性：不确定性中的对象的姿势是由嵌入在数据驱动MCMC处理的概率测量模型解决。我们应用马氏逻辑推理推理隐藏的对象，并检测对象姿势的错误估计。我们的方法的有效性证明，并在现实世界中的实验评估。

25. A Generalizable Knowledge Framework for Semantic Indoor Mapping Based on Markov Logic Networks and Data Driven MCMC [PDF] 返回目录
Ziyuan Liu, Georg von Wichert
Abstract: In this paper, we propose a generalizable knowledge framework for data abstraction, i.e. finding compact abstract model for input data using predefined abstract terms. Based on these abstract terms, intelligent autonomous systems, such as a robot, should be able to make inference according to specific knowledge base, so that they can better handle the complexity and uncertainty of the real world. We propose to realize this framework by combining Markov logic networks (MLNs) and data driven MCMC sampling, because the former are a powerful tool for modelling uncertain knowledge and the latter provides an efficient way to draw samples from unknown complex distributions. Furthermore, we show in detail how to adapt this framework to a certain task, in particular, semantic robot mapping. Based on MLNs, we formulate task-specific context knowledge as descriptive soft rules. Experiments on real world data and simulated data confirm the usefulness of our framework.
摘要：在本文中，我们提出了一个概括性的知识框架，数据抽象，即发现使用预定义抽象的术语输入数据紧凑抽象模型。基于这些抽象的术语，智能自治系统，如机器人，应该能够根据具体的知识基础，使推理，使他们能更好地处理现实世界的复杂性和不确定性。我们建议结合马尔科夫逻辑网络（MLNS）和驱动MCMC采样数据，实现这个框架，因为前者是不确定的知识建模的有力工具，而后者提供了一种有效的方式来吸取未知复杂分布样本。此外，我们还详细显示了如何将这个框架适应特定的任务，特别是语义机器人映射。基于MLNS，我们制定任务的特定背景知识，描述性的软规则。对现实世界的数据的实验和模拟数据证实了我们的框架的有效性。

26. JRMOT: A Real-Time 3D Multi-Object Tracker and a New Large-Scale Dataset [PDF] 返回目录
Abhijeet Shenoi, Mihir Patel, JunYoung Gwak, Patrick Goebel, Amir Sadeghian, Hamid Rezatofighi, Roberto Martin-Martin, Silvio Savarese
Abstract: An autonomous navigating agent needs to perceive and track the motion of objects and other agents in its surroundings to achieve robust and safe motion planning and execution. While autonomous navigation requires a multi-object tracking (MOT) system to provide 3D information, most research has been done in 2D MOT from RGB videos. In this work we present JRMOT, a novel 3D MOT system that integrates information from 2D RGB images and 3D point clouds into a real-time performing framework. Our system leverages advancements in neural-network based re-identification as well as 2D and 3D detection and descriptors. We incorporate this into a joint probabilistic data-association framework within a multi-modal recursive Kalman architecture to achieve online, real-time 3D MOT. As part of our work, we release the JRDB dataset, a novel large scale 2D+3D dataset and benchmark annotated with over 2 million boxes and 3500 time consistent 2D+3D trajectories across 54 indoor and outdoor scenes. The dataset contains over 60 minutes of data including 360 degree cylindrical RGB video and 3D pointclouds. The presented 3D MOT system demonstrates state-of-the-art performance against competing methods on the popular 2D tracking KITTI benchmark and serves as a competitive 3D tracking baseline for our dataset and benchmark.
摘要：自主导航代理需要感知和跟踪对象和其他代理的运动在它的周围，实现稳健和安全的运动规划和执行。虽然自主导航需要一个多目标追踪（MOT）系统，提供3D信息，大多数的研究已经从RGB视频2D MOT完成。在这项工作中，我们目前JRMOT，一种新颖的3D MOT系统，从2D RGB图像和三维点云信息集成到一个实时的执行框架。我们的系统利用神经网络基于重新鉴定的进展，以及2D和3D检测和描述符。我们的多模态递归卡尔曼架构中整合到这个联合概率数据关联框架来实现在线，实时3D MOT。作为我们工作的一部分，我们发布JRDB数据集，一个新的大型2D + 3D数据集和基准超过200万箱和整个54个室内和室外场景3500个时间一致的2D + 3D轨迹进行注释。该数据集包含在60分钟内的数据，包括360度圆柱RGB视频和3D点云。所提出的3D MOT系统证明了对抗上流行的二维跟踪KITTI基准竞争法国家的最先进的性能，并作为我们的数据集和基准具有竞争力的3D跟踪基准。

27. MonoLayout: Amodal scene layout from a single image [PDF] 返回目录
Kaustubh Mani, Swapnil Daga, Shubhika Garg, N. Sai Shankar, Krishna Murthy Jatavallabhula, K. Madhava Krishna
Abstract: In this paper, we address the novel, highly challenging problem of estimating the layout of a complex urban driving scenario. Given a single color image captured from a driving platform, we aim to predict the bird's-eye view layout of the road and other traffic participants. The estimated layout should reason beyond what is visible in the image, and compensate for the loss of 3D information due to projection. We dub this problem amodal scene layout estimation, which involves "hallucinating" scene layout for even parts of the world that are occluded in the image. To this end, we present MonoLayout, a deep neural network for real-time amodal scene layout estimation from a single image. We represent scene layout as a multi-channel semantic occupancy grid, and leverage adversarial feature learning to hallucinate plausible completions for occluded image parts. Due to the lack of fair baseline methods, we extend several state-of-the-art approaches for road-layout estimation and vehicle occupancy estimation in bird's-eye view to the amodal setup for rigorous evaluation. By leveraging temporal sensor fusion to generate training labels, we significantly outperform current art over a number of datasets. On the KITTI and Argoverse datasets, we outperform all baselines by a significant margin. We also make all our annotations, and code publicly available. A video abstract of this paper is available this https URL .
摘要：在本文中，我们解决估计复杂的城市驾驶情形的布局新颖，极具挑战性的问题。鉴于从驾驶平台捕获的单色图像时，我们的目标是预测道路和其他交通参与者的鸟瞰视图布局。估计布局应理性超出了图像中可见，并赔偿3D信息，由于投影损失。我们配音这个问题amodal场景布置估计，其中涉及对于在图像中闭塞的世界，甚至部分“幻觉”场景布置。为此，我们提出MonoLayout，从单一的图像实时amodal场景布置估计的深层神经网络。我们代表现场布置的多通道语义网格占用，并充分利用对抗性特征的学习产生幻觉的遮挡图像部分合理的完成。由于缺乏公平基线方法，我们延长几个国家的最先进的在鸟瞰到amodal设置了严格的评估道路布局估计和车辆占用估计方法。通过利用时间传感器融合以生成训练标签，我们显著优于在数数据集的现有技术。在KITTI和Argoverse数据集，我们通过一个显著保证金胜过所有的基线。我们也使我们的所有注释和代码公开。本文的视频摘要可用此HTTPS URL。

28. SynFi: Automatic Synthetic Fingerprint Generation [PDF] 返回目录
M. Sadegh Riazi, Seyed M. Chavoshian, Farinaz Koushanfar
Abstract: Authentication and identification methods based on human fingerprints are ubiquitous in several systems ranging from government organizations to consumer products. The performance and reliability of such systems directly rely on the volume of data on which they have been verified. Unfortunately, a large volume of fingerprint databases is not publicly available due to many privacy and security concerns. In this paper, we introduce a new approach to automatically generate high-fidelity synthetic fingerprints at scale. Our approach relies on (i) Generative Adversarial Networks to estimate the probability distribution of human fingerprints and (ii) Super-Resolution methods to synthesize fine-grained textures. We rigorously test our system and show that our methodology is the first to generate fingerprints that are computationally indistinguishable from real ones, a task that prior art could not accomplish.
摘要：基于人的指纹身份验证和识别方法在几个系统，从政府机构到消费者的产品无处不在。这种系统的性能和可靠性直接依赖于它们已被验证的数据量上。不幸的是，大容量指纹数据库是不公开的，由于很多隐私和安全问题。在本文中，我们介绍一种新的方法来大规模自动生成高保真合成指纹。我们的方法依赖于（I）剖成对抗性网络对人类的指纹及（ii）超分辨率方法的概率分布估计合成细粒度纹理。我们严格的测试我们的系统，并表明我们的方法是首先生成从以假乱真，任务是现有技术无法实现的计算区分指纹。

29. A Bayes-Optimal View on Adversarial Examples [PDF] 返回目录
Eitan Richardson, Yair Weiss
Abstract: The ability to fool modern CNN classifiers with tiny perturbations of the input has lead to the development of a large number of candidate defenses and often conflicting explanations. In this paper, we argue for examining adversarial examples from the perspective of Bayes-Optimal classification. We construct realistic image datasets for which the Bayes-Optimal classifier can be efficiently computed and derive analytic conditions on the distributions so that the optimal classifier is either robust or vulnerable. By training different classifiers on these datasets (for which the "gold standard" optimal classifiers are known), we can disentangle the possible sources of vulnerability and avoid the accuracy-robustness tradeoff that may occur in commonly used datasets. Our results show that even when the optimal classifier is robust, standard CNN training consistently learns a vulnerable classifier. At the same time, for exactly the same training data, RBF SVMs consistently learn a robust classifier. The same trend is observed in experiments with real images.
摘要：愚弄现代CNN分类与输入的微小扰动导致了大量候选防御，并经常互相冲突的解释发展的能力。在本文中，我们认为从贝叶斯最优分类的角度审查对抗的例子。我们构建的量，贝叶斯分类器优化可以有效地计算并导出上分布的分析条件，使得最优分类器或者是健壮或脆弱的逼真的图像数据集。通过对这些数据集（对于其中的“金标准”最佳分类器是公知的）训练不同的分类器，我们可以解开漏洞的可能来源，并避免可能出现在通常使用的数据集，该精度鲁棒性的权衡。我们的研究结果显示，即使在最佳的分类器是强大的，标准的CNN一贯的培训学习脆弱的分类。与此同时，对于完全相同的训练数据，RBF支持向量机一贯学习强大的分类。同样的趋势在真实图像实验中观察到。

30. Deep Learning Estimation of Multi-Tissue Constrained Spherical Deconvolution with Limited Single Shell DW-MRI [PDF] 返回目录
Vishwesh Nath, Sudhir K. Pathak, Kurt G. Schilling, Walt Schneider, Bennett A. Landman
Abstract: Diffusion-weighted magnetic resonance imaging (DW-MRI) is the only non-invasive approach for estimation of intra-voxel tissue microarchitecture and reconstruction of in vivo neural pathways for the human brain. With improvement in accelerated MRI acquisition technologies, DW-MRI protocols that make use of multiple levels of diffusion sensitization have gained popularity. A well-known advanced method for reconstruction of white matter microstructure that uses multi-shell data is multi-tissue constrained spherical deconvolution (MT-CSD). MT-CSD substantially improves the resolution of intra-voxel structure over the traditional single shell version, constrained spherical deconvolution (CSD). Herein, we explore the possibility of using deep learning on single shell data (using the b=1000 s/mm2 from the Human Connectome Project (HCP)) to estimate the information content captured by 8th order MT-CSD using the full three shell data (b=1000, 2000, and 3000 s/mm2 from HCP). Briefly, we examine two network architectures: 1.) Sequential network of fully connected dense layers with a residual block in the middle (ResDNN), 2.) Patch based convolutional neural network with a residual block (ResCNN). For both networks an additional output block for estimation of voxel fraction was used with a modified loss function. Each approach was compared against the baseline of using MT-CSD on all data on 15 subjects from the HCP divided into 5 training, 2 validation, and 8 testing subjects with a total of 6.7 million voxels. The fiber orientation distribution function (fODF) can be recovered with high correlation (0.77 vs 0.74 and 0.65) as compared to the ground truth of MT-CST, which was derived from the multi-shell DW-MRI acquisitions. Source code and models have been made publicly available.
摘要：弥散加权磁共振成像（DW-MRI）是用于帧内体素的组织微架构的估计和对人类大脑在体内的神经通路的重建唯一的非侵入性的方法。随着加速MRI采集技术的改进，DW-MRI方案，使利用扩散敏感的多层次的获得了普及。对于白质组织重建，使用多壳数据是多组织的公知方法先进约束球形去卷积（MT-CSD）。 MT-CSD显着改善内部体素的结构相对于传统的单壳版本分辨率，约束球形去卷积（CSD）。在此，我们探讨使用单壳数据深度学习（使用从人类连接组项目（HCP）的B = 1000秒/平方毫米）来估计使用足三个壳数据由8阶MT-CSD捕获的信息内容的可能性（b = 1000，2000，和3000秒/从HCP平方毫米）。简要地说，我们检查两个网络架构：1）与在中间（ResDNN），2。）修补基于卷积神经与残余块（ResCNN）网络中的残余块完全连接致密层的顺序网络。对于这两个网络用改进的损失函数用于体素部分的估计的额外的输出块。每一种方法是使用对关于从HCP分成5训练，2验证和8名测试受试者具有总共670万点的体素15名受试者的所有数据MT-CSD的基线比较。纤维取向分布函数（fODF）可以以高的相关性（0.77 VS 0.74和0.65）相比，MT-CST，将其从多壳DW-MRI采集衍生的基础事实作为回收。源代码和模型已经被公布于众。

31. Pruning untrained neural networks: Principles and Analysis [PDF] 返回目录
Soufiane Hayou, Jean-Francois Ton, Arnaud Doucet, Yee Whye Teh
Abstract: Overparameterized neural networks display state-of-the art performance. However, there is a growing need for smaller, energy-efficient, neural networks to be able to use machine learning applications on devices with limited computational resources. A popular approach consists of using pruning techniques. While these techniques have traditionally focused on pruning pre-trained neural networks (e.g. LeCun et al. (1990) and Hassabi et al. (1993)), recent work by Lee et al. (2018) showed promising results where pruning is performed at initialization. However, such procedures remain unsatisfactory as the resulting pruned networks can be difficult to train and, for instance, these procedures do not prevent one layer being fully pruned. In this paper we provide a comprehensive theoretical analysis of pruning at initialization and training sparse architectures. This analysis allows us to propose novel principled approaches which we validate experimentally on a variety of network architectures. We particularly show that we can prune up to 99.9% of the weights while keeping the model trainable.
摘要：Overparameterized神经网络显示国家的本领域的性能。然而，越来越需要小型，节能，神经网络能够与计算资源有限的设备使用机器学习应用。一种流行的方法包括使用修剪技术。虽然这些技术历来集中在修剪预训练神经网络（例如LeCun等人（1990）和Hassabi等人（1993）），最近由Lee等人的工作。（2018）显示，有前途的地方修剪在初始化执行结果。然而，这样的程序仍然不能令人满意的所得到的修剪网络可以是很难培养，并且例如，这些程序并不防止层与层之间充分修剪。在本文中，我们提供的初始化，修剪和培育稀疏架构进行了全面的理论分析。这种分析使我们能够提出我们通过实验验证多种网络架构的新原则的方法。我们特别表明，我们可以修剪到权重的99.9％，同时保持模型训练的。

32. Disentangled Speech Embeddings using Cross-modal Self-supervision [PDF] 返回目录
Arsha Nagrani, Joon Son Chung, Samuel Albanie, Andrew Zisserman
Abstract: The objective of this paper is to learn representations of speaker identity without access to manually annotated data. To do so, we develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video. The key idea behind our approach is to tease apart---without annotation---the representations of linguistic content and speaker identity. We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors, offering the potential for greater generalisation to novel combinations of content and identity and ultimately producing speaker identity representations that are more robust. We train our method on a large-scale audio-visual dataset of talking heads `in the wild', and demonstrate its efficacy by evaluating the learned speaker representations for standard speaker recognition performance.
摘要：本文的目的是学习的发言者身份的表示，而没有进入手动注释的数据。要做到这一点，我们开发了一个自我监督学习的目标是利用面孔和音频视频之间的天然跨模式同步。我们的做法背后的关键思想是，以梳理出---没有注释---的语言内容和演讲人的身份表示。我们构建了一个两流体系结构，其中：（1）股低级特征常见的两种表示;和（2）提供了一种自然的机制明确地解开这些因素，提供了潜在的更大的推广到的内容和身份的新组合，并最终产生扬声器身份表示是更稳健。我们培养的野生”名嘴'的大型视听数据集我们的方法，并通过评估标准说话人识别性能了解到扬声器陈述证明其疗效。

33. Bimodal Distribution Removal and Genetic Algorithm in Neural Network for Breast Cancer Diagnosis [PDF] 返回目录
Ke Quan
Abstract: Diagnosis of breast cancer has been well studied in the past. Multiple linear programming models have been devised to approximate the relationship between cell features and tumour malignancy. However, these models are less capable in handling non-linear correlations. Neural networks instead are powerful in processing complex non-linear correlations. It is thus certainly beneficial to approach this cancer diagnosis problem with a model based on neural network. Particularly, introducing bias to neural network training process is deemed as an important means to increase training efficiency. Out of a number of popular proposed methods for introducing artificial bias, Bimodal Distribution Removal (BDR) presents ideal efficiency improvement results and fair simplicity in implementation. However, this paper examines the effectiveness of BDR against the target cancer diagnosis classification problem and shows that BDR process in fact negatively impacts classification performance. In addition, this paper also explores genetic algorithm as an efficient tool for feature selection and produced significantly better results comparing to baseline model that without any feature selection in place
摘要：乳腺癌的诊断已经得到很好的研究中来。多个线性规划模型已经被设计来近似细胞特征和肿瘤恶性之间的关系。然而，这些模型是在处理非线性关系能力较差。神经网络代替是在处理复杂的非线性相关性的强大。因此肯定有益基于神经网络模型来处理这个癌症诊断问题。特别地，引入偏压神经网络的训练过程被视为提高训练效率的重要手段。出了一些在执行引入人工偏置，双峰分布去除（BDR）呈现理想的效率提高的结果，公平简单通俗提出的方法。然而，本文考察BDR的针对目标癌症诊断分类问题，并说明BDR过程其实负面影响分类性能的有效性。此外，本文还探讨了遗传算法作为特征选择的有效工具，并产生显著更好的效果比较基准模型，没有采取任何适当的特征选择

34. An empirical study of Conv-TasNet [PDF] 返回目录
Berkan Kadioglu, Michael Horgan, Xiaoyu Liu, Jordi Pons, Dan Darcy, Vivek Kumar
Abstract: Conv-TasNet is a recently proposed waveform-based deep neural network that achieves state-of-the-art performance in speech source separation. Its architecture consists of a learnable encoder/decoder and a separator that operates on top of this learned space. Various improvements have been proposed to Conv-TasNet. However, they mostly focus on the separator, leaving its encoder/decoder as a (shallow) linear operator. In this paper, we conduct an empirical study of Conv-TasNet and propose an enhancement to the encoder/decoder that is based on a (deep) non-linear variant of it. In addition, we experiment with the larger and more diverse LibriTTS dataset and investigate the generalization capabilities of the studied models when trained on a much larger dataset. We propose cross-dataset evaluation that includes assessing separations from the WSJ0-2mix, LibriTTS and VCTK databases. Our results show that enhancements to the encoder/decoder can improve average SI-SNR performance by more than 1 dB. Furthermore, we offer insights into the generalization capabilities of Conv-TasNet and the potential value of improvements to the encoder/decoder.
摘要：转化率，TasNet是实现语音源分离状态的最先进的性能最近提出的基于波形的深层神经网络。它的架构是由一个可以学习的编码器/解码器，并且在这里学到的空间的顶部运行的分隔符。各种改进已经提出了转化率，TasNet。然而，它们主要集中在分离器，留下其编码器/解码器作为（浅）线性算子。在本文中，我们进行转化率-TasNet的实证研究提出的增强编码器/解码器是基于它的一个（深）的非线性变体。此外，我们尝试用更大和更多样化LibriTTS在更大的数据集训练的时候数据集和调查研究模型的推广能力。我们提出的跨数据集的评估，其中包括从WSJ0-2mix，LibriTTS和VCTK数据库评估分离。我们的研究结果表明，增强了编码器/解码器可以提高超过1dB平均SI-SNR性能。此外，我们还提供见解转化率，TasNet和改进的编码器/解码器的潜在价值的概括能力。

35. Unsupervised Multi-Class Domain Adaptation: Theory, Algorithms, and Practice [PDF] 返回目录
Yabin Zhang, Bin Deng, Hui Tang, Lei Zhang, Kui Jia
Abstract: In this paper, we study the formalism of unsupervised multi-class domain adaptation (multi-class UDA), which underlies some recent algorithms whose learning objectives are only motivated empirically. A Multi-Class Scoring Disagreement (MCSD) divergence is presented by aggregating the absolute margin violations in multi-class classification; the proposed MCSD is able to fully characterize the relations between any pair of multi-class scoring hypotheses. By using MCSD as a measure of domain distance, we develop a new domain adaptation bound for multi-class UDA as well as its data-dependent, probably approximately correct bound, which naturally suggest adversarial learning objectives to align conditional feature distributions across the source and target domains. Consequently, an algorithmic framework of Multi-class Domain-adversarial learning Networks (McDalNets) is developed, whose different instantiations via surrogate learning objectives either coincide with or resemble a few of recently popular methods, thus (partially) underscoring their practical effectiveness. Based on our same theory of multi-class UDA, we also introduce a new algorithm of Domain-Symmetric Networks (SymmNets), which is featured by a novel adversarial strategy of domain confusion and discrimination. SymmNets afford simple extensions that work equally well under the problem settings of either closed set, partial, or open set UDA. We conduct careful empirical studies to compare different algorithms of McDalNets and our newly introduced SymmNets. Experiments verify our theoretical analysis and show the efficacy of our proposed SymmNets. We make our implementation codes publicly available.
摘要：在本文中，我们研究了无监督多类领域适应性（多级UDA），这underlies最近的一些算法，其学习目标仅是出于经验的形式主义。多类别得分不一致（MCSD）发散通过聚集绝对余量违反多类分类呈现;所提出的MCSD是能够充分体现任何一组多级得分假设之间的关系。通过MCSD作为域距离的测量，我们开发必将为多类UDA以及它的数据依赖，可能近似正确绑定，在源这自然意味着敌对的学习目标对齐有条件的功能分布和一个新的领域适应性目标域。因此，多级域对抗性学习网络（McDalNets）的算法框架得以确立，通过替代学习目标或者重合，其不同实例有或类似几个最近流行的方法，从而（部分地）强调它们的实际效果。基于我们的多级UDA的同样的理论，我们也介绍域的对称网络（SymmNets），这是由域的混乱和歧视的一种新的对抗策略精选的新算法。 SymmNets提供简单的扩展，它的工作同样在任一闭集，部分或开集UDA的问题设置。我们进行认真的实证研究比较McDalNets和我们新推出的SymmNets不同的算法。实验结果验证了我们的理论分析和展示我们提出SymmNets的功效。我们使我们的实现代码公开。

36. Unsupervised Domain Adaptation via Discriminative Manifold Embedding and Alignment [PDF] 返回目录
You-Wei Luo, Chuan-Xian Ren, Pengfei Ge, Ke-Kun Huang, Yu-Feng Yu
Abstract: Unsupervised domain adaptation is effective in leveraging the rich information from the source domain to the unsupervised target domain. Though deep learning and adversarial strategy make an important breakthrough in the adaptability of features, there are two issues to be further explored. First, the hard-assigned pseudo labels on the target domain are risky to the intrinsic data structure. Second, the batch-wise training manner in deep learning limits the description of the global structure. In this paper, a Riemannian manifold learning framework is proposed to achieve transferability and discriminability consistently. As to the first problem, this method establishes a probabilistic discriminant criterion on the target domain via soft labels. Further, this criterion is extended to a global approximation scheme for the second issue; such approximation is also memory-saving. The manifold metric alignment is exploited to be compatible with the embedding space. A theoretical error bound is derived to facilitate the alignment. Extensive experiments have been conducted to investigate the proposal and results of the comparison study manifest the superiority of consistent manifold learning framework.
摘要：无监督领域适应性是有效地利用从源域到无监督目标域的丰富信息。虽然深学习和对抗性的策略使功能适应性的一个重要突破，有两个问题有待进一步探讨。首先，在目标域硬分配的假标签是有风险的内在数据结构。其次，在深度学习间歇式培训的方式限制了全球结构的描述。在本文中，黎曼流形学习框架，提出了实现可转让性和识别性一致。作为第一个问题，这种方法建立经由软标签目标域中的概率的判别标准。此外，该标准被推广到了第二个问题，一个全球性的近似方案;这种近似也是存储器节约。歧管度量对准被利用以与嵌入空间兼容。结合的理论误差导出以便于对准。大量的实验已经进行了调查对比研究清单一致流形学习框架的优越性的建议和结果。

37. A Novel Framework for Selection of GANs for an Application [PDF] 返回目录
Tanya Motwani, Manojkumar Parmar
Abstract: Generative Adversarial Network (GAN) is a current focal point of research. The body of knowledge is fragmented, leading to a trial-error method while selecting an appropriate GAN for a given scenario. We provide a comprehensive summary of the evolution of GANs starting from its inception addressing issues like mode collapse, vanishing gradient, unstable training and non-convergence. We also provide a comparison of various GANs from the application point of view, its behaviour and implementation details. We propose a novel framework to identify candidate GANs for a specific use case based on architecture, loss, regularization and divergence. We also discuss application of the framework using an example, and we demonstrate a significant reduction in search space. This efficient way to determine potential GANs lowers unit economics of AI development for organizations.
摘要：创成对抗性网络（GAN）是研究当前的焦点。知识的身体支离破碎，导致试错法，而选择合适的GAN对于给定的情景。我们提供甘斯的进化全面总结了自成立以来的寻址模式一样崩溃的问题出发，消失梯度，不稳定的培训和非收敛。我们还提供各种甘斯从来看，它的行为和实施细则的应用点的比较。我们提出了一个新的框架，以确定基于架构的损失，正规化和发散具体的使用情况下，候选人甘斯。我们还讨论了使用一个例子框架的应用，我们证明搜索空间显著减少。这种高效的方式来确定潜在甘斯降低AI开发单位为经济组织。

38. Boosting Adversarial Training with Hypersphere Embedding [PDF] 返回目录
Tianyu Pang, Xiao Yang, Yinpeng Dong, Kun Xu, Hang Su, Jun Zhu
Abstract: Adversarial training (AT) is one of the most effective defenses to improve the adversarial robustness of deep learning models. In order to promote the reliability of the adversarially trained models, we propose to boost AT via incorporating hypersphere embedding (HE), which can regularize the adversarial features onto compact hypersphere manifolds. We formally demonstrate that AT and HE are well coupled, which tunes up the learning dynamics of AT from several aspects. We comprehensively validate the effectiveness and universality of HE by embedding it into the popular AT frameworks including PGD-AT, ALP, and TRADES, as well as the FreeAT and FastAT strategies. In experiments, we evaluate our methods on the CIFAR-10 and ImageNet datasets, and verify that integrating HE can consistently enhance the performance of the models trained by each AT framework with little extra computation.
摘要：对抗性训练（AT）是最有效的防御系统，以提高深学习模式的对抗稳健性的一个。为了促进adversarially训练的模型的可靠性，提出了通过引入超球面包埋（HE），其可正规化对抗特征到紧凑超球面的歧管，以提高AT。我们正式表明，AT和HE很好耦合，这从几个方面曲调了AT的学习动力。我们全面通过嵌入流行的AT框架，包括PGD-AT，ALP，和行业，以及在FreeAT和FastAT策略验证HE的有效性和普遍性。在实验中，我们评估的CIFAR-10和ImageNet数据集的方法，并验证集成HE可以持续提升与一些额外的计算每个AT框架训练模型的性能。

39. Cross-stained Segmentation from Renal Biopsy Images Using Multi-level Adversarial Learning [PDF] 返回目录
Ke Mei, Chuang Zhu, Lei Jiang, Jun Liu, Yuanyuan Qiao
Abstract: Segmentation from renal pathological images is a key step in automatic analyzing the renal histological characteristics. However, the performance of models varies significantly in different types of stained datasets due to the appearance variations. In this paper, we design a robust and flexible model for cross-stained segmentation. It is a novel multi-level deep adversarial network architecture that consists of three sub-networks: (i) a segmentation network; (ii) a pair of multi-level mirrored discriminators for guiding the segmentation network to extract domain-invariant features; (iii) a shape discriminator that is utilized to further identify the output of the segmentation network and the ground truth. Experimental results on glomeruli segmentation from renal biopsy images indicate that our network is able to improve segmentation performance on target type of stained images and use unlabeled data to achieve similar accuracy to labeled data. In addition, this method can be easily applied to other tasks.
摘要：分割从肾脏病理图像是在自动分析肾的组织学特征的关键步骤。然而，模型的性能显著不同类型染色的数据集，由于外观的变化而变化。在本文中，我们设计了交叉染色分割一个强大的和灵活的模式。它是一种新颖的多级深对抗网络架构，可包括三个子网络：（ⅰ）一分割网络; （ⅱ）一对多级的镜像用于引导分割网络来提取域不变特征鉴别器; （ⅲ）被利用来进一步识别分割网络的输出和所述地面真值的形状鉴别器。从肾活检图像分割肾小球实验结果表明，我们的网络能够提高染色图像的目标类型划分的性能和使用无标签数据来实现类似的准确性标签的数据。此外，该方法可以很容易地应用到其他的任务。

40. Deep Fusion of Local and Non-Local Features for Precision Landslide Recognition [PDF] 返回目录
Qing Zhu, Lin Chen, Han Hu, Binzhi Xu, Yeting Zhang, Haifeng Li
Abstract: Precision mapping of landslide inventory is crucial for hazard mitigation. Most landslides generally co-exist with other confusing geological features, and the presence of such areas can only be inferred unambiguously at a large scale. In addition, local information is also important for the preservation of object boundaries. Aiming to solve this problem, this paper proposes an effective approach to fuse both local and non-local features to surmount the contextual problem. Built upon the U-Net architecture that is widely adopted in the remote sensing community, we utilize two additional modules. The first one uses dilated convolution and the corresponding atrous spatial pyramid pooling, which enlarged the receptive field without sacrificing spatial resolution or increasing memory usage. The second uses a scale attention mechanism to guide the up-sampling of features from the coarse level by a learned weight map. In implementation, the computational overhead against the original U-Net was only a few convolutional layers. Experimental evaluations revealed that the proposed method outperformed state-of-the-art general-purpose semantic segmentation approaches. Furthermore, ablation studies have shown that the two models afforded extensive enhancements in landslide-recognition performance.
摘要：滑坡库存的精确映射是减轻灾害的关键。最通常滑坡并存与其它混淆地质特征，并且这些区域的存在只能在大规模明确地推断。此外，本地信息也是对象边界的保护非常重要。旨在解决这一问题，本文提出了融合本地和非本地特色超越情境问题的有效途径。建立在是在遥感界广泛采用U型网络架构，我们利用两个附加模块。第一个用途扩张卷积和相应atrous空间金字塔池，其扩大了感受域，而不会牺牲的空间分辨率或增加内存使用情况。第二个使用的刻度注意机制通过学习权重映射来指导的特征上采样从所述粗级。在实施中，对原来的掌中计算开销只有几卷积层。实验评估揭示了所提出的方法优于状态的最先进的通用语义分割方法。此外，消融的研究表明，这两个模型在滑坡识别性能得到了广泛的改进。

41. AdvMS: A Multi-source Multi-cost Defense Against Adversarial Attacks [PDF] 返回目录
Xiao Wang, Siyue Wang, Pin-Yu Chen, Xue Lin, Peter Chin
Abstract: Designing effective defense against adversarial attacks is a crucial topic as deep neural networks have been proliferated rapidly in many security-critical domains such as malware detection and self-driving cars. Conventional defense methods, although shown to be promising, are largely limited by their single-source single-cost nature: The robustness promotion tends to plateau when the defenses are made increasingly stronger while the cost tends to amplify. In this paper, we study principles of designing multi-source and multi-cost schemes where defense performance is boosted from multiple defending components. Based on this motivation, we propose a multi-source and multi-cost defense scheme, Adversarially Trained Model Switching (AdvMS), that inherits advantages from two leading schemes: adversarial training and random model switching. We show that the multi-source nature of AdvMS mitigates the performance plateauing issue and the multi-cost nature enables improving robustness at a flexible and adjustable combination of costs over different factors which can better suit specific restrictions and needs in practice.
摘要：针对设计对抗攻击的有效防御是深层神经网络已在许多安全关键领域迅速激增，如恶意软件检测和自动驾驶汽车的一个关键话题。传统的防御方法，虽然显示是有前途，在很大程度上是由他们的单源单成本性质的限制：稳健性促销趋于平稳时，防御是由日益增强，而成本趋于放大。在本文中，我们研究设计的多源，多成本方案，其中防御性能是由多个组件卫冕提振原则。在此基础上的动机，我们提出了一种多源，多成本的防御方案，Adversarially受训模型交换（AdvMS），由两个主要方案继承优点：对抗训练和随机模型切换。我们发现，AdvMS减轻了性能停滞不前的问题，多成本自然的多源特性使得以超过其在实践中更好地满足特定的限制和需求的不同因素成本灵活可调组合改进健壮性。

42. Fine tuning U-Net for ultrasound image segmentation: which layers? [PDF] 返回目录
Mina Amiri, Rupert Brooks, Hassan Rivaz
Abstract: Fine-tuning a network which has been trained on a large dataset is an alternative to full training in order to overcome the problem of scarce and expensive data in medical applications. While the shallow layers of the network are usually kept unchanged, deeper layers are modified according to the new dataset. This approach may not work for ultrasound images due to their drastically different appearance. In this study, we investigated the effect of fine-tuning different layers of a U-Net which was trained on segmentation of natural images in breast ultrasound image segmentation. Tuning the contracting part and fixing the expanding part resulted in substantially better results compared to fixing the contracting part and tuning the expanding part. Furthermore, we showed that starting to fine-tune the U-Net from the shallow layers and gradually including more layers will lead to a better performance compared to fine-tuning the network from the deep layers moving back to shallow layers. We did not observe the same results on segmentation of X-ray images, which have different salient features compared to ultrasound, it may therefore be more appropriate to fine-tune the shallow layers rather than deep layers. Shallow layers learn lower level features (including speckle pattern, and probably the noise and artifact properties) which are critical in automatic segmentation in this modality.
摘要：微调已经培训了大量的数据集是完整的训练替代，以克服在医疗应用稀缺和昂贵的数据问题的网络。虽然网络的浅层通常保持不变，更深的层根据新的数据集进行修改。这种方法可能不适合超声图像，由于其显着不同的外观正常工作。在这项研究中，我们调查的U网这是在乳腺超声图像分割自然图像的分割训练有素的微调不同层次的效果。调整收缩部和固定的膨胀部导致基本上更好的结果相比，固定收缩部和调整的膨胀部。此外，我们发现，从浅层开始微调掌中逐步包括更多的层会导致相较于微调，从深层搬回浅层网络具有更好的性能。我们没有观察X射线图像的分割同样的结果，不同的显着特征相比，超声具有，因此可能更适合于微调浅层而不是深层。浅层学习较低级特征（包括散斑图案，并且可能是噪声和假象性能），这是在该模态自动分割关键的。

43. Interactive Natural Language-based Person Search [PDF] 返回目录
Vikram Shree, Wei-Lun Chao, Mark Campbell
Abstract: In this work, we consider the problem of searching people in an unconstrained environment, with natural language descriptions. Specifically, we study how to systematically design an algorithm to effectively acquire descriptions from humans. An algorithm is proposed by adapting models, used for visual and language understanding, to search a person of interest (POI) in a principled way, achieving promising results without the need to re-design another complicated model. We then investigate an iterative question-answering (QA) strategy that enable robots to request additional information about the POI's appearance from the user. To this end, we introduce a greedy algorithm to rank questions in terms of their significance, and equip the algorithm with the capability to dynamically adjust the length of human-robot interaction according to model's uncertainty. Our approach is validated not only on benchmark datasets but on a mobile robot, moving in a dynamic and crowded environment.
摘要：在这项工作中，我们考虑搜索不受约束的环境中的人，用自然语言描述的问题。具体来说，我们研究如何系统地设计一个算法，从人类有效地获取描述。一种算法是通过调整模型提出的，用于视觉和语言的理解，有原则的方式对人的搜索兴趣点（POI），取得可喜的成果，而不需要重新设计的另一个复杂的模型。然后，我们调查的迭代答疑（QA）的战略，使机器人能够请求关于该用户的POI的外观的其他信息。为此，我们在他们的意义方面引入贪心算法来排名的问题，并装备了算法的能力，动态调整，根据模型的不确定性人类与机器人互动的长度。我们的方法是有效的，不仅在标准数据集，但在移动机器人上，在一个充满活力和拥挤的环境中移动。

44. T-Net: A Template-Supervised Network for Task-specific Feature Extraction in Biomedical Image Analysis [PDF] 返回目录
Weinan Song, Yuan Liang, Kun Wang, Lei He
Abstract: Existing deep learning methods depend on an encoder-decoder structure to learn feature representation from the segmentation annotation in biomedical image analysis. However, the effectiveness of feature extraction under this structure decreases due to the indirect optimization process, limited training data size, and simplex supervision method. In this paper, we propose a template-supervised network T-Net for task-specific feature extraction. Specifically, we first obtain templates from pixel-level annotations by down-sampling binary masks of recognition targets according to specific tasks. Then, we directly train the encoding network under the supervision of the derived task-specific templates. Finally, we combine the resulting encoding network with a posterior network for the specific task, e.g. an up-sampling network for segmentation or a region proposal network for detection. Extensive experiments on three public datasets (BraTS-17, MoNuSeg and IDRiD) show that T-Net achieves competitive results to the state-of-the-art methods and superior performance to an encoder-decoder based network. To the best of our knowledge, this is the first in-depth study to improve feature extraction by directly supervise the encoding network and by applying task-specific supervision in biomedical image analysis.
摘要：现有深学习方法依赖于编码器 - 译码器结构从在生物医学图像分析中的分割注释学特征表示。然而，在此结构之下特征提取的有效性降低，由于间接优化过程中，有限的训练数据的大小，和单纯的监督方法。在本文中，我们提出了一个模板监督网络T-网任务特定的特征提取。具体而言，我们首先根据具体任务分解采样的识别目标二进掩模获得像素级的注解模板。然后，我们直接培养的衍生任务特定模板的监督下进行编码网络。最后，我们将得到的编码网络与后网络的特定任务，例如结合上采样网络分段或区域提案网络进行检测。在三个数据集的公共（臭小子-17，MoNuSeg和IDRiD）表明，T-网实现有竞争力的结果，以国家的最先进的方法和优越的性能给编码器 - 解码器基于网络广泛的实验。据我们所知，这是第一次深入研究由直接管理的编码网络，并通过在生物医学图像分析应用特定任务的监督，以提高特征提取。

45. Algorithm-hardware Co-design for Deformable Convolution [PDF] 返回目录
Qijing Huang, Dequan Wang, Yizhao Gao, Yaohui Cai, Zhen Dong, Bichen Wu, Kurt Keutzer, John Wawrzynek
Abstract: FPGAs provide a flexible and efficient platform to accelerate rapidly-changing algorithms for computer vision. The majority of existing work focuses on accelerating image classification, while other fundamental vision problems, including object detection and instance segmentation, have not been adequately addressed. Compared with image classification, detection problems are more sensitive to the spatial variance of objects, and therefore, require specialized convolutions to aggregate spatial information. To address this, recent work proposes dynamic deformable convolution to augment regular convolutions. Regular convolutions process a fixed grid of pixels across all the spatial locations in an image, while dynamic deformable convolutions may access arbitrary pixels in the image and the access pattern is input-dependent and varies per spatial location. These properties lead to inefficient memory accesses of inputs with existing hardware. In this work, we first investigate the overhead of the deformable convolution on embedded FPGA SoCs, and then show the accuracy-latency tradeoffs for a set of algorithm modifications including full versus depthwise, fixed-shape, and limited-range. These modifications benefit the energy efficiency for embedded devices in general as they reduce the compute complexity. We then build an efficient object detection network with modified deformable convolutions and quantize the network using state-of-the-art quantization methods. We implement a unified hardware engine on FPGA to support all the operations in the network. Preliminary experiments show that little accuracy is compromised and speedup can be achieved with our co-design optimization for the deformable convolution.
摘要：FPGA提供了灵活，高效的平台，以加速计算机视觉快速变化的算法。大多数现有工作的重点放在加快图像分类，而其他基本视力问题，包括目标检测和实例分割，没有得到充分解决。与图像分类相比，检测问题是到对象的空间方差更敏感，因此，需要专门的卷积到聚集空间信息。为了解决这个问题，最近的工作提出了动态变形卷积，以增加普通卷积。常规的卷积处理整个图像中的所有空间位置的像素的定格，而动态变形卷积可以访问任意的像素的图像中和所述接入模式是输入依赖性与每个空间位置而变化。这些性质导致的与现有的硬件输入低效的存储器存取。在这项工作中，我们首先调查对嵌入式FPGA的SoC可变形卷积的开销，然后显示该精度延时权衡用于一组算法的修改包括完全与深度方向，固定形状，和有限范围的。这些修改有利于在一般的嵌入式设备的能源效率，因为他们减少了计算的复杂性。然后，我们构建具有修饰的变形卷积的有效对象检测网络和使用状态的最先进的量化方法量化该网络。我们实施基于FPGA的统一硬件引擎，支持网络中的所有操作。初步实验表明，小准确性受到影响，加速可以用我们的可变形卷积协同设计优化来实现。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-02-21

目录

摘要