摘要

1. Atlas: End-to-End 3D Scene Reconstruction from Posed Images [PDF] 返回目录
Zak Murez, Tarrence van As, James Bartolozzi, Ayan Sinha, Vijay Badrinarayanan, Andrew Rabinovich
Abstract: We present an end-to-end 3D reconstruction method for a scene by directly regressing a truncated signed distance function (TSDF) from a set of posed RGB images. Traditional approaches to 3D reconstruction rely on an intermediate representation of depth maps prior to estimating a full 3D model of a scene. We hypothesize that a direct regression to 3D is more effective. A 2D CNN extracts features from each image independently which are then back-projected and accumulated into a voxel volume using the camera intrinsics and extrinsics. After accumulation, a 3D CNN refines the accumulated features and predicts the TSDF values. Additionally, semantic segmentation of the 3D model is obtained without significant computation. This approach is evaluated on the Scannet dataset where we significantly outperform state-of-the-art baselines (deep multiview stereo followed by traditional TSDF fusion) both quantitatively and qualitatively. We compare our 3D semantic segmentation to prior methods that use a depth sensor since no previous work attempts the problem with only RGB input.
摘要：通过直接回归从一组所构成的RGB图像的截断符号距离函数（TSDF）呈现为一个场景的端至端的3D重建方法。传统方法3D重建依靠中间表示之前估计场景的完整的三维模型的深度映射。我们推测，直接回归3D效果更明显。二维CNN提取从每个图像独立地特征然后被背投影和累积到使用相机内在和外部参数的体素的体积。积累之后，3D CNN提炼积累的特点和预测TSDF值。另外，没有显著计算得到的3D模型的语义分割。这种方法是在数据集Scannet评价我们显著优于大状态的最先进的基线（深多视点立体随后传统TSDF融合）在数量和质量。我们为3D语义分割比较使用深度传感器，因为没有以前的工作尝试的问题，只有RGB输入以前的方法。

2. Learning Dynamic Routing for Semantic Segmentation [PDF] 返回目录
Yanwei Li, Lin Song, Yukang Chen, Zeming Li, Xiangyu Zhang, Xingang Wang, Jian Sun
Abstract: Recently, numerous handcrafted and searched networks have been applied for semantic segmentation. However, previous works intend to handle inputs with various scales in pre-defined static architectures, such as FCN, U-Net, and DeepLab series. This paper studies a conceptually new method to alleviate the scale variance in semantic representation, named dynamic routing. The proposed framework generates data-dependent routes, adapting to the scale distribution of each image. To this end, a differentiable gating function, called soft conditional gate, is proposed to select scale transform paths on the fly. In addition, the computational cost can be further reduced in an end-to-end manner by giving budget constraints to the gating function. We further relax the network level routing space to support multi-path propagations and skip-connections in each forward, bringing substantial network capacity. To demonstrate the superiority of the dynamic property, we compare with several static architectures, which can be modeled as special cases in the routing space. Extensive experiments are conducted on Cityscapes and PASCAL VOC 2012 to illustrate the effectiveness of the dynamic framework. Code is available at this https URL.
摘要：近日，众多的手工制作和搜索网络已经申请了语义分割。然而，以前的作品打算处理与预先定义的静态结构不同尺度，如FCN，U-Net和DeepLab系列投入。本文研究以缓解语义表达规模方差概念的新方法，称为动态路由。所提出的框架生成相关的数据的路由，适应每个图像的尺度分布。为此，可微分选通函数，称为软条件栅极，建议选择尺度变换的飞行路径。此外，计算成本可以进一步在端至端的方式通过给预算限制到选通功能降低。我们进一步放宽网络级路由空间来支持多路径的传播和跳连接在每个向前，带来了实实在在的网络容量。为了展示动态特性的优势，我们比较有几个静态的体系结构，它可以模拟成在排布空间的特殊情况。广泛实验在都市风景和PASCAL VOC 2012进行说明的动态框架的有效性。代码可在此HTTPS URL。

3. Inherent Adversarial Robustness of Deep Spiking Neural Networks: Effects of Discrete Input Encoding and Non-Linear Activations [PDF] 返回目录
Saima Sharmin, Nitin Rathi, Priyadarshini Panda, Kaushik Roy
Abstract: In the recent quest for trustworthy neural networks, we present Spiking Neural Network (SNN) as a potential candidate for inherent robustness against adversarial attacks. In this work, we demonstrate that accuracy degradation is less severe in SNNs than in their non-spiking counterparts for CIFAR10 and CIFAR100 datasets on deep VGG architectures. We attribute this robustness to two fundamental characteristics of SNNs and analyze their effects. First, we exhibit that input discretization introduced by the Poisson encoder improves adversarial robustness with reduced number of timesteps. Second, we quantify the amount of adversarial accuracy with increased leak rate in Leaky-Integrate-Fire (LIF) neurons. Our results suggest that SNNs trained with LIF neurons and smaller number of timesteps are more robust than the ones with IF (Integrate-Fire) neurons and larger number of timesteps. We overcome the bottleneck of creating gradient-based adversarial inputs in temporal domain by proposing a technique for crafting attacks from SNN.
摘要：在最近的追求值得信赖的神经网络，我们提出尖峰神经网络（SNN）作为针对敌对攻击的固有稳健性的潜在候选。在这项工作中，我们证明了准确度退化是SNNS较轻比他们的上深VGG架构CIFAR10和CIFAR100数据集非扣球同行。我们认为这鲁棒性SNNS的两个基本特征，并分析其影响。首先，我们通过呈现泊松编码器引入的输入的离散化改善了与减少的时间步长的数目对抗性鲁棒性。其次，我们量化与泄漏积分火（LIF）神经元增加泄漏率对抗性量精度。我们的研究结果表明，与LIF神经元和时间步数少训练有素SNNS比与IF（集成-火）的神经元和更大的时间步数的那些更稳健。我们克服提出了如下技术：从各具特色的攻击SNN在创建时域基于梯度的对抗输入的瓶颈。

4. Weakly Supervised 3D Human Pose and Shape Reconstruction with Normalizing Flows [PDF] 返回目录
Andrei Zanfir, Eduard Gabriel Bazavan, Hongyi Xu, Bill Freeman, Rahul Sukthankar, Cristian Sminchisescu
Abstract: Monocular 3D human pose and shape estimation is challenging due to the many degrees of freedom of the human body and thedifficulty to acquire training data for large-scale supervised learning in complex visual scenes. In this paper we present practical semi-supervised and self-supervised models that support training and good generalization in real-world images and video. Our formulation is based on kinematic latent normalizing flow representations and dynamics, as well as differentiable, semantic body part alignment loss functions that support self-supervised learning. In extensive experiments using 3D motion capture datasets like CMU, Human3.6M, 3DPW, or AMASS, as well as image repositories like COCO, we show that the proposed methods outperform the state of the art, supporting the practical construction of an accurate family of models based on large-scale training with diverse and incompletely labeled image and video data.
摘要：单眼三维人体姿势和体形估计，由于许多度人体和thedifficulty为大规模采集训练数据的自由的挑战在复杂的视觉场景监督学习。在本文中，我们目前的实际半监督和自我监督的模式，在真实世界的图像和视频支持的培训和良好的泛化。我们的配方是基于运动学潜在流动正常化交涉和动态，以及微，语义身体部位对准损失函数，支持自我监督学习。在使用三维运动捕捉数据集等CMU，Human3.6M，3DPW，或AMASS，以及图像存储库等COCO，我们表明，所提出的方法优于现有技术的状态，支持的准确家族的实际构造了广泛的实验基于大规模培训多样化和不完全标记的图像和视频数据模型。

5. Neural Contours: Learning to Draw Lines from 3D Shapes [PDF] 返回目录
Difan Liu, Mohamed Nabail, Aaron Hertzmann, Evangelos Kalogerakis
Abstract: This paper introduces a method for learning to generate line drawings from 3D models. Our architecture incorporates a differentiable module operating on geometric features of the 3D model, and an image-based module operating on view-based shape representations. At test time, geometric and view-based reasoning are combined with the help of a neural module to create a line drawing. The model is trained on a large number of crowdsourced comparisons of line drawings. Experiments demonstrate that our method achieves significant improvements in line drawing over the state-of-the-art when evaluated on standard benchmarks, resulting in drawings that are comparable to those produced by experienced human artists.
摘要：本文介绍用于学习，以生成从3D模型线条图的方法。我们的体系结构结合有可微模块上的3D模型的几何特征操作，并且在基于视图的形状表示基于图像的模块的操作。在测试时间，几何和基于视图的推理相结合，与神经模块的帮助下，以创建一个线图。该模型是在大量线附图的众包比较的训练。实验结果表明，我们的方法实现了在管线显著改进上标准基准进行评价时，导致比得上那些由有经验的人的艺术家产生附图绘制在国家的最先进的。

6. Adversarial Attacks on Monocular Depth Estimation [PDF] 返回目录
Ziqi Zhang, Xinge Zhu, Yingwei Li, Xiangqun Chen, Yao Guo
Abstract: Recent advances of deep learning have brought exceptional performance on many computer vision tasks such as semantic segmentation and depth estimation. However, the vulnerability of deep neural networks towards adversarial examples have caused grave concerns for real-world deployment. In this paper, we present to the best of our knowledge the first systematic study of adversarial attacks on monocular depth estimation, an important task of 3D scene understanding in scenarios such as autonomous driving and robot navigation. In order to understand the impact of adversarial attacks on depth estimation, we first define a taxonomy of different attack scenarios for depth estimation, including non-targeted attacks, targeted attacks and universal attacks. We then adapt several state-of-the-art attack methods for classification on the field of depth estimation. Besides, multi-task attacks are introduced to further improve the attack performance for universal attacks. Experimental results show that it is possible to generate significant errors on depth estimation. In particular, we demonstrate that our methods can conduct targeted attacks on given objects (such as a car), resulting in depth estimation 3-4x away from the ground truth (e.g., from 20m to 80m).
摘要：深度学习的进展带来了许多计算机视觉任务，出色的性能，如语义分割和深度估计。然而，深层神经网络的走向对抗的例子漏洞造成了实际部署的严重关切。在本文中，我们呈现给我们所知的单眼深度估计，3D场景理解场景的一项重要任务敌对攻击的第一个系统的研究，如自动驾驶和机器人导航。为了了解关于深度估计对抗攻击的影响，我们首先定义的不同攻击的情况下进行深度估计，包括非针对性的攻击，有针对性的攻击和普遍的攻击分类。然后，我们适应深度估计的领域分类的几个国家的最先进的攻击方法。此外，多任务攻击推出进一步完善通用攻击的攻击性能。实验结果表明，有可能产生在深度估计显著误差。特别是，我们证明了我们的方法可以进行对给定的对象（如汽车）有针对性的攻击，导致深度估计3-4倍距离地面实况（例如，从20米80米）。

7. Robust Medical Instrument Segmentation Challenge 2019 [PDF] 返回目录
Tobias Ross, Annika Reinke, Peter M. Full, Martin Wagner, Hannes Kenngott, Martin Apitz, Hellena Hempe, Diana Mindroc Filimon, Patrick Scholz, Thuy Nuong Tran, Pierangela Bruno, Pablo Arbeláez, Gui-Bin Bian, Sebastian Bodenstedt, Jon Lindström Bolmgren, Laura Bravo-Sánchez, Hua-Bin Chen, Cristina González, Dong Guo, Pål Halvorsen, Pheng-Ann Heng, Enes Hosgor, Zeng-Guang Hou, Fabian Isensee, Debesh Jha, Tingting Jiang, Yueming Jin, Kadir Kirtac, Sabrina Kletz, Stefan Leger, Zhixuan Li, Klaus H. Maier-Hein, Zhen-Liang Ni, Michael A. Riegler, Klaus Schoeffmann, Ruohua Shi, Stefanie Speidel, Michael Stenzel, Isabell Twick, Gutai Wang, Jiacheng Wang, Liansheng Wang, Lu Wang, Yujie Zhang, Yan-Jie Zhou, Lei Zhu, Manuel Wiesenfarth, Annette Kopp-Schneider, Beat P. Müller-Stich
Abstract: Intraoperative tracking of laparoscopic instruments is often a prerequisite for computer and robotic-assisted interventions. While numerous methods for detecting, segmenting and tracking of medical instruments based on endoscopic video images have been proposed in the literature, key limitations remain to be addressed: Firstly, robustness, that is, the reliable performance of state-of-the-art methods when run on challenging images (e.g. in the presence of blood, smoke or motion artifacts). Secondly, generalization; algorithms trained for a specific intervention in a specific hospital should generalize to other interventions or institutions. In an effort to promote solutions for these limitations, we organized the Robust Medical Instrument Segmentation (ROBUST-MIS) challenge as an international benchmarking competition with a specific focus on the robustness and generalization capabilities of algorithms. For the first time in the field of endoscopic image processing, our challenge included a task on binary segmentation and also addressed multi-instance detection and segmentation. The challenge was based on a surgical data set comprising 10,040 annotated images acquired from a total of 30 surgical procedures from three different types of surgery. The validation of the competing methods for the three tasks (binary segmentation, multi-instance detection and multi-instance segmentation) was performed in three different stages with an increasing domain gap between the training and the test data. The results confirm the initial hypothesis, namely that algorithm performance degrades with an increasing domain gap. While the average detection and segmentation quality of the best-performing algorithms is high, future research should concentrate on detection and segmentation of small, crossing, moving and transparent instrument(s) (parts).
摘要：腹腔镜器械的追踪术往往是计算机和机器人辅助干预的先决条件。同时，用于检测，分割和基于内窥镜的视频图像的医疗器械的跟踪许多方法已在文献中被提出，关键限制仍然有待解决：首先，鲁棒性，即，状态的最先进的方法的性能可靠当在挑战图像上运行（在血液存在例如吸烟或运动伪影）。其次，推广;在一个特定的医院训练了具体干预算法应推广到其他干预措施或机构。在努力促进这些限制的解决方案，我们组织了强大的医疗器械细分（鲁棒-MIS）的挑战，特别专注于算法的鲁棒性和泛化能力的国际基准竞争。为了在内窥镜图像处理领域的第一次，我们面临的挑战包括二元分割的任务，还涉及多实例检测与分割。面临的挑战是基于外科手术数据集包括从由三种不同类型的手术共30的外科手术获得的10040个注释的图像。在三个不同的阶段与所述训练和测试数据之间的增加的域间隔，进行了三个任务竞争法（二值分割，多实例的检测和多实例分割）的验证。结果证实最初的假设，即随着越来越域差距算法性能下降。而最好的执行算法的平均检测与分割质量是高的，未来的研究应集中于检测和小的，交叉分割，移动和透明仪器（或多个）（份）。

8. Accurate Optimization of Weighted Nuclear Norm for Non-Rigid Structure from Motion [PDF] 返回目录
José Pedro Iglesias, Carl Olsson, Marcus Valtonen Örnhag
Abstract: Fitting a matrix of a given rank to data in a least squares sense can be done very effectively using 2nd order methods such as Levenberg-Marquardt by explicitly optimizing over a bilinear parameterization of the matrix. In contrast, when applying more general singular value penalties, such as weighted nuclear norm priors, direct optimization over the elements of the matrix is typically used. Due to non-differentiability of the resulting objective function, first order sub-gradient or splitting methods are predominantly used. While these offer rapid iterations it is well known that they become inefficent near the minimum due to zig-zagging and in practice one is therefore often forced to settle for an approximate solution. In this paper we show that more accurate results can in many cases be achieved with 2nd order methods. Our main result shows how to construct bilinear formulations, for a general class of regularizers including weighted nuclear norm penalties, that are provably equivalent to the original problems. With these formulations the regularizing function becomes twice differentiable and 2nd order methods can be applied. We show experimentally, on a number of structure from motion problems, that our approach outperforms state-of-the-art methods.
摘要：装修给定等级的在最小二乘意义上的数据矩阵可以非常有效地使用二阶方法，如文伯格 - 马夸特通过明确优化了矩阵的双线性参数来完成。与此相反，采用更一般的奇异值的惩罚，如加权核规范先验时，通常使用在矩阵的元素直接优化。由于所得的目标函数的非可微分，主要使用第一阶子梯度或分裂的方法。虽然这些提供快速迭代这是众所周知的，它们成为最小，由于之字形，并在实践一个近inefficent因此常常被迫接受了一个近似解。在本文中，我们表明，更准确的结果可以在许多情况下，二阶方法来实现。我们的主要结果显示如何构建双线性制剂，一般类regularizers包括加权核规范的惩罚，是可证明相当于原来的问题。与这些制剂中的正则化函数变为可应用于两次微分和第二顺序的方法。我们的实验显示，对一些从运动问题的结构，我们的方法优于国家的最先进的方法。

9. Cross-domain Object Detection through Coarse-to-Fine Feature Adaptation [PDF] 返回目录
Yangtao Zheng, Di Huang, Songtao Liu, Yunhong Wang
Abstract: Recent years have witnessed great progress in deep learning based object detection. However, due to the domain shift problem, applying off-the-shelf detectors to an unseen domain leads to significant performance drop. To address such an issue, this paper proposes a novel coarse-to-fine feature adaptation approach to cross-domain object detection. At the coarse-grained stage, different from the rough image-level or instance-level feature alignment used in the literature, foreground regions are extracted by adopting the attention mechanism, and aligned according to their marginal distributions via multi-layer adversarial learning in the common feature space. At the fine-grained stage, we conduct conditional distribution alignment of foregrounds by minimizing the distance of global prototypes with the same category but from different domains. Thanks to this coarse-to-fine feature adaptation, domain knowledge in foreground regions can be effectively transferred. Extensive experiments are carried out in various cross-domain detection scenarios. The results are state-of-the-art, which demonstrate the broad applicability and effectiveness of the proposed approach.
摘要：近年来，两国在深学习基于对象检测了长足的进步。然而，由于该域移位问题，施加关闭的，现成的检测器，以看不见的域导致显著性能下降。为了解决这样的问题，提出了一种新颖的粗到细的特征适配的方法来跨域物体检测。在粗粒度的阶段，从在文献中所使用的粗略图像级或实例级特征的取向不同，前景区域由采用注意机制根据经由多层对抗性学习它们的边缘分布中提取，并对准共同特征空间。在细粒度阶段，我们通过最小化与同一类别的全球原型的距离，但是从不同的域进行前景的条件分布排列。由于这种粗到细的功能适应，在前景区域领域知识可以有效地转移。广泛实验在各种横域检测场景进行。结果是国家的最先进的，这表明了该方法的广泛的适用性和有效性。

10. Sample-Specific Output Constraints for Neural Networks [PDF] 返回目录
Mathis Brosowsky, Olaf Dünkel, Daniel Slieter, Marius Zöllner
Abstract: Neural networks reach state-of-the-art performance in a variety of learning tasks. However, a lack of understanding the decision making process yields to an appearance as black box. We address this and propose ConstraintNet, a neural network with the capability to constrain the output space in each forward pass via an additional input. The prediction of ConstraintNet is proven within the specified domain. This enables ConstraintNet to exclude unintended or even hazardous outputs explicitly whereas the final prediction is still learned from data. We focus on constraints in form of convex polytopes and show the generalization to further classes of constraints. ConstraintNet can be constructed easily by modifying existing neural network architectures. We highlight that ConstraintNet is end-to-end trainable with no overhead in the forward and backward pass. For illustration purposes, we model ConstraintNet by modifying a CNN and construct constraints for facial landmark prediction tasks. Furthermore, we demonstrate the application to a follow object controller for vehicles as a safety-critical application. We submitted an approach and system for the generation of safety-critical outputs of an entity based on ConstraintNet at the German Patent and Trademark Office with the official registration mark DE10 2019 119 739.
摘要：神经网络中的各种学习任务达到国家的最先进的性能。然而，缺乏了解决策过程收益率的外观为黑盒子的决定。我们解决这个问题，并提出ConstraintNet，通过附加的输入来约束每个直传输出空间能力的神经网络。 ConstraintNet的预测是指定域内证实。这使ConstraintNet排除意外，甚至危险的产出明确，而最终的预测仍然从数据获悉。我们专注于约束凸多面体的形式，并显示泛化进一步类别限制。 ConstraintNet可以很容易地通过修改现有的神经网络结构来构造。我们强调，ConstraintNet是终端到终端的可训练在向前和向后传球的开销。为了说明的目的，我们通过修改CNN和构建体约束面部界标的预测任务建模ConstraintNet。此外，我们展示了应用程序的后续对象控制器，汽车作为安全关键应用。我们提出的方法和系统的基础上ConstraintNet实体的安全关键产出的在德国专利商标局有正式注册标记DE10 2019 119 739。

11. Multi-Person Pose Estimation with Enhanced Feature Aggregation and Selection [PDF] 返回目录
Xixia Xu, Qi Zou, Xue Lin
Abstract: We propose a novel Enhanced Feature Aggregation and Selection network (EFASNet) for multi-person 2D human pose estimation. Due to enhanced feature representation, our method can well handle crowded, cluttered and occluded scenes. More specifically, a Feature Aggregation and Selection Module (FASM), which constructs hierarchical multi-scale feature aggregation and makes the aggregated features discriminative, is proposed to get more accurate fine-grained representation, leading to more precise joint locations. Then, we perform a simple Feature Fusion (FF) strategy which effectively fuses high-resolution spatial features and low-resolution semantic features to obtain more reliable context information for well-estimated joints. Finally, we build a Dense Upsampling Convolution (DUC) module to generate more precise prediction, which can recover missing joint details that are usually unavailable in common upsampling process. As a result, the predicted keypoint heatmaps are more accurate. Comprehensive experiments demonstrate that the proposed approach outperforms the state-of-the-art methods and achieves the superior performance over three benchmark datasets: the recent big dataset CrowdPose, the COCO keypoint detection dataset and the MPII Human Pose dataset. Our code will be released upon acceptance.
摘要：我们提出了多方人士的2D人体姿势估计的新增强功能聚合和选择网络（EFASNet）。由于增强型特征表示，我们的方法可以处理好拥挤，混乱和堵塞的场面。更具体地，特征聚合和选择模块（FASM），它构造分层多尺度特征聚集和使聚集特征判别，提出了以获得更准确的细粒度表示，导致更精确的关节位置。然后，我们进行简单的特征融合的（FF）的策略，其有效融合高分辨率空间特征和低分辨率语义特征，以获得良好的估计关节更可靠的上下文信息。最后，我们建立了一个密集上采样卷积（DUC）模块，以产生更精确的预测，这可以恢复丢失的那些通常在共同的采样处理不可用关节细节。其结果，预测的关键点热图更准确。综合实验表明，该方法比国家的最先进的方法，并实现了三个标准数据集的卓越性能：近期大数据集CrowdPose时，COCO关键点的检测数据集和MPII人体姿势的数据集。我们的代码将在接受被释放。

12. Spatial Pyramid Based Graph Reasoning for Semantic Segmentation [PDF] 返回目录
Xia Li, Yibo Yang, Qijie Zhao, Tiancheng Shen, Zhouchen Lin, Hong Liu
Abstract: The convolution operation suffers from a limited receptive filed, while global modeling is fundamental to dense prediction tasks, such as semantic segmentation. In this paper, we apply graph convolution into the semantic segmentation task and propose an improved Laplacian. The graph reasoning is directly performed in the original feature space organized as a spatial pyramid. Different from existing methods, our Laplacian is data-dependent and we introduce an attention diagonal matrix to learn a better distance metric. It gets rid of projecting and re-projecting processes, which makes our proposed method a light-weight module that can be easily plugged into current computer vision architectures. More importantly, performing graph reasoning directly in the feature space retains spatial relationships and makes spatial pyramid possible to explore multiple long-range contextual patterns from different scales. Experiments on Cityscapes, COCO Stuff, PASCAL Context and PASCAL VOC demonstrate the effectiveness of our proposed methods on semantic segmentation. We achieve comparable performance with advantages in computational and memory overhead.
摘要：从限制接受卷积运算患有申请，而全球的造型是密集的预测任务，如语义分割的基础。在本文中，我们运用图形卷积成语义分割任务，并提出了改进的拉普拉斯。该图推理在组织为空间金字塔原始特征空间直接执行。从现有的方法不同的是，我们的拉普拉斯是数据依赖，我们引入一个注重对角矩阵学习更好的距离度量。它摆脱了投影和重新投影过程，这使得我们提出的方法重量轻的模块，可以很容易地插入到当前计算机视觉架构。更重要的是，直接在特征空间进行图形推理保留空间关系和空间使得金字塔能够探索多种长范围从不同尺度的上下文模式。上风情实验，COCO的东西，PASCAL语境和PASCAL VOC证明我们提出了语义分割方法的有效性。我们实现在计算和存储开销的优势相当的性能。

13. Learning Better Lossless Compression Using Lossy Compression [PDF] 返回目录
Fabian Mentzer, Luc Van Gool, Michael Tschannen
Abstract: We leverage the powerful lossy image compression algorithm BPG to build a lossless image compression system. Specifically, the original image is first decomposed into the lossy reconstruction obtained after compressing it with BPG and the corresponding residual. We then model the distribution of the residual with a convolutional neural network-based probabilistic model that is conditioned on the BPG reconstruction, and combine it with entropy coding to losslessly encode the residual. Finally, the image is stored using the concatenation of the bitstreams produced by BPG and the learned residual coder. The resulting compression system achieves state-of-the-art performance in learned lossless full-resolution image compression, outperforming previous learned approaches as well as PNG, WebP, and JPEG2000.
摘要：我们利用强大的有损图像压缩算法BPG打造无损图像压缩系统。具体地，原始图像首先被分解成与BPG压缩它和相应的残留后获得的有损重建。然后，我们建模残留的与被在BPG重建空调卷积基于神经网络的概率模型的分布，并与熵编码无损编码的残余结合起来。最后，图像使用通过BPG和所学习的残留编码器产生的比特流的级联存储。得到的压缩系统实现了无损了解到全分辨率图像的压缩状态的最先进的性能，超越以前学到的方法，以及PNG，WebP的，和JPEG2000。

14. Deep Soft Procrustes for Markerless Volumetric Sensor Alignment [PDF] 返回目录
Vladimiros Sterzentsenko, Alexandros Doumanoglou, Spyridon Thermos, Nikolaos Zioulis, Dimitrios Zarpalas, Petros Daras
Abstract: With the advent of consumer grade depth sensors, low-cost volumetric capture systems are easier to deploy. Their wider adoption though depends on their usability and by extension on the practicality of spatially aligning multiple sensors. Most existing alignment approaches employ visual patterns, e.g. checkerboards, or markers and require high user involvement and technical knowledge. More user-friendly and easier-to-use approaches rely on markerless methods that exploit geometric patterns of a physical structure. However, current SoA approaches are bounded by restrictions in the placement and the number of sensors. In this work, we improve markerless data-driven correspondence estimation to achieve more robust and flexible multi-sensor spatial alignment. In particular, we incorporate geometric constraints in an end-to-end manner into a typical segmentation based model and bridge the intermediate dense classification task with the targeted pose estimation one. This is accomplished by a soft, differentiable procrustes analysis that regularizes the segmentation and achieves higher extrinsic calibration performance in expanded sensor placement configurations, while being unrestricted by the number of sensors of the volumetric capture system. Our model is experimentally shown to achieve similar results with marker-based methods and outperform the markerless ones, while also being robust to the pose variations of the calibration structure. Code and pretrained models are available at this https URL.
摘要：随着消费级深度传感器的问世，低成本的体积捕获系统更易于部署。他们更广泛地采用，虽然取决于其可用性并扩展对空间对准多个传感器的实用性。大多数现有的对准方法采用视觉图案，例如棋盘，或标记和要求较高的用户参与和技术知识。更人性化和更容易使用的方法依赖于利用物理结构的几何图案无标记的方法。然而，当前的SOA方法是通过在放置和传感器的数量限制为界。在这项工作中，我们提高无标记数据驱动对应估计，以实现更强大和灵活的多传感器空间对准。特别是，我们纳入的端至端的方式几何约束到典型分割基于模型和桥接与目标姿态估计一个中间密分类任务。这是通过一个软的，可微普鲁克分析该规则化的分割，并实现在膨胀传感器放置配置更高外部校准性能来实现，而由体积捕获系统的传感器的数量被无限制。我们的模型实验表明，实现与基于标记的方法相似的结果和优于无标记的，同时还健壮到校准结构的姿态的变化。代码和预训练模式可在此HTTPS URL。

15. Balanced Alignment for Face Recognition: A Joint Learning Approach [PDF] 返回目录
Huawei Wei, Peng Lu, Yichen Wei
Abstract: Face alignment is crucial for face recognition and has been widely adopted. However, current practice is too simple and under-explored. There lacks an understanding of how important face alignment is and how it should be performed, for recognition. This work studies these problems and makes two contributions. First, it provides an in-depth and quantitative study of how alignment strength affects recognition accuracy. Our results show that excessive alignment is harmful and an optimal balanced point of alignment is in need. To strike the balance, our second contribution is a novel joint learning approach where alignment learning is controllable with respect to its strength and driven by recognition. Our proposed method is validated by comprehensive experiments on several benchmarks, especially the challenging ones with large pose.
摘要：人脸定位是人脸识别的关键，并已被广泛采用。然而，目前的做法过于简单，充分开发。有没有面对对齐是多么重要的理解，应该如何进行，识别。这项工作研究这些问题，并提出两个贡献。首先，它提供了一个深入和对齐实力如何影响识别的准确定量研究。我们的研究结果表明：过度对齐是有害的，对准的最佳平衡点是需要。为了取得平衡，我们的第二个贡献是一种新的联合学习方法，即对齐学习是可控相对于它的强度和识别驱动。我们提出的方法是通过对几个基准综合性实验，尤其是那些具有挑战性的大姿势验证。

16. SOLOv2: Dynamic, Faster and Stronger [PDF] 返回目录
Xinlong Wang, Rufeng Zhang, Tao Kong, Lei Li, Chunhua Shen
Abstract: In this work, we aim at building a simple, direct, and fast instance segmentation framework with strong performance. We follow the principle of the SOLO method of Wang et al. "SOLO: segmenting objects by locations". Importantly, we take one step further by dynamically learning the mask head of the object segmenter such that the mask head is conditioned on the location. Specifically, the mask branch is decoupled into a mask kernel branch and mask feature branch, which are responsible for learning the convolution kernel and the convolved features respectively. Moreover, we propose Matrix NMS (non maximum suppression) to significantly reduce the inference time overhead due to NMS of masks. Our Matrix NMS performs NMS with parallel matrix operations in one shot, and yields better results. We demonstrate a simple direct instance segmentation system, outperforming a few state-of-the-art methods in both speed and accuracy. A light-weight version of SOLOv2 executes at 31.3 FPS and yields 37.1% AP. Moreover, our state-of-the-art results in object detection (from our mask byproduct) and panoptic segmentation show the potential to serve as a new strong baseline for many instance-level recognition tasks besides instance segmentation. Code is available at: this https URL
摘要：在这项工作中，我们的目标是建立一个简单的，直接的，并与强劲的性能快速实例分割的框架。我们按照王等人的SOLO方法的原理。 “SOLO：通过位置分割的对象”。重要的是，我们通过动态学习对象的掩模头部分段，使得掩模头部上的位置调节的进一步采取的一个步骤。具体地，掩模分支解耦成掩模内核分支和掩模特征的分支，其分别负责学习卷积核和卷积功能。此外，我们提出了矩阵NMS（非最大抑制）的推理时间开销显著减少由于口罩的NMS。我们的矩阵NMS执行NMS与一次性并行矩阵运算，并产生更好的效果。我们展示了一个简单直接的实例分割系统，超越在速度和准确性的几个国家的最先进的方法。的轻重量版本SOLOv2的执行在31.3 FPS和产率37.1％AP。此外，我们国家的最先进成果物体检测（从我们的面具副产物），全景分割显示，以作为除了例如分割许多实例级别的识别任务新的强基线的潜力。代码，请访问：此HTTPS URL

17. GeoGraph: Learning graph-based multi-view object detection with geometric cues end-to-end [PDF] 返回目录
Ahmed Samy Nassar, Sébastien Lefèvre, Jan D. Wegner
Abstract: In this paper we propose an end-to-end learnable approach that detects static urban objects from multiple views, re-identifies instances, and finally assigns a geographic position per object. Our method relies on a Graph Neural Network (GNN) to, detect all objects and output their geographic positions given images and approximate camera poses as input. Our GNN simultaneously models relative pose and image evidence, and is further able to deal with an arbitrary number of input views. Our method is robust to occlusion, with similar appearance of neighboring objects, and severe changes in viewpoints by jointly reasoning about visual image appearance and relative pose. Experimental evaluation on two challenging, large-scale datasets and comparison with state-of-the-art methods show significant and systematic improvements both in accuracy and efficiency, with 2-6% gain in detection and re-ID average precision as well as 8x reduction of training time.
摘要：在本文中，我们提出，其检测从多个视图，重新识别实例的静态城市对象，最后分配每个对象的地理位置的端至端可学习的方法。我们的方法依赖一个图形神经网络（GNN）为开时，检测所有物体和输出它们的地理位置给定的图像和近似相机姿势作为输入。我们GNN同时车型的相对姿态和形象的证据，并且还能够处理的输入视图的任意数量。我们的方法是稳健的闭塞，与周边物体的类似的外观，并严重改变观点的共同推理视觉形象的外观和相对姿态。在两个挑战，大规模数据集实验评估和与国家的最先进的方法相比同时显示在精度和效率在检测和重新ID平均精度显著和系统的改进，用2-6％的增益以及8倍减少培训时间。

18. EPSNet: Efficient Panoptic Segmentation Network with Cross-layer Attention Fusion [PDF] 返回目录
Chia-Yuan Chang, Shuo-En Chang, Pei-Yung Hsiao, Li-Chen Fu
Abstract: Panoptic segmentation is a scene parsing task which unifies semantic segmentation and instance segmentation into one single task. However, the current state-of-the-art studies did not take too much concern on inference time. In this work, we propose an Efficient Panoptic Segmentation Network (EPSNet) to tackle the panoptic segmentation tasks with fast inference speed. Basically, EPSNet generates masks based on simple linear combination of prototype masks and mask coefficients. The light-weight network branches for instance segmentation and semantic segmentation only need to predict mask coefficients and produce masks with the shared prototypes predicted by prototype network branch. Furthermore, to enhance the quality of shared prototypes, we adopt a module called "cross-layer attention fusion module", which aggregates the multi-scale features with attention mechanism helping them capture the long-range dependencies between each other. To validate the proposed work, we have conducted various experiments on the challenging COCO panoptic dataset, which achieve highly promising performance with significantly faster inference speed (53ms on GPU).
摘要：全景分割是一个场景解析任务，统一语义分割和实例分割成一个单一的任务。但是，目前国家的最先进的研究没有采取推理时间太多关注。在这项工作中，我们提出了一个高效的全景分割网络（EPSNet）应对快速推理速度全景分割任务。基本上，EPSNet基于原型掩模和掩模系数的简单的线性组合掩模。例如分割和语义分割轻质网络分支只需要预测掩码系数和产生掩模与由原型网络分支预测共享原型。此外，为提高共享原型的质量，我们采用一个所谓的“跨层注意融合模块”模块，其聚集多尺度特征与注意机制帮助他们捕获彼此之间的长程依赖关系。为了验证所提出的工作中，我们进行了具有挑战性的COCO全景数据集各种实验，其实现大有希望有显著更快的速度推断（GPU上53ms）的性能。

19. Depth Edge Guided CNNs for Sparse Depth Upsampling [PDF] 返回目录
Yi Guo, Ji Liu
Abstract: Guided sparse depth upsampling aims to upsample an irregularly sampled sparse depth map when an aligned high-resolution color image is given as guidance. Many neural networks have been designed for this task. However, they often ignore the structural difference between the depth and the color image, resulting in obvious artifacts such as texture copy and depth blur at the upsampling depth. Inspired by the normalized convolution operation, we propose a guided convolutional layer to recover dense depth from sparse and irregular depth image with an depth edge image as guidance. Our novel guided network can prevent the depth value from crossing the depth edge to facilitate upsampling. We further design a convolution network based on proposed convolutional layer to combine the advantages of different algorithms and achieve better performance. We conduct comprehensive experiments to verify our method on real-world indoor and synthetic outdoor datasets. Our method produces strong results. It outperforms state-of-the-art methods on the Virtual KITTI dataset and the Middlebury dataset. It also presents strong generalization capability under different 3D point densities, various lighting and weather conditions.
摘要：当对准高分辨率彩色图像给出指导的指导下稀疏深度上采样目标上采样不规则采样稀疏深度图。许多神经网络已经设计了这个任务。然而，它们经常忽略深度和彩色图像之间的结构差异，从而导致伪像明显例如纹理复制和深度模糊在上采样深度。通过归一化的卷积运算的启发，我们提出了一个引导卷积层以恢复来自稀疏和不规则深度图像稠密深度与深度边缘图像作为引导。我们的新型的导引网络可以防止从穿越深度边缘，以方便上采样的深度值。我们进一步设计了一个基于卷积建议层上的卷积网络相结合的不同算法的优势，实现更好的性能。我们进行全面的实验来验证现实世界室内和室外综合数据集，我们的方法。我们的方法产生强烈的效果。它优于在虚拟KITTI数据集和明德数据集的国家的最先进的方法。它也呈现出不同的三维点的密度，不同的光照和天气条件下强大的推广能力。

20. Multi-Plateau Ensemble for Endoscopic Artefact Segmentation and Detection [PDF] 返回目录
Suyog Jadhav, Udbhav Bamba, Arnav Chavan, Rishabh Tiwari, Aryan Raj
Abstract: Endoscopic artefact detection challenge consists of 1) Artefact detection, 2) Semantic segmentation, and 3) Out-of-sample generalisation. For Semantic segmentation task, we propose a multi-plateau ensemble of FPN (Feature Pyramid Network) with EfficientNet as feature extractor/encoder. For Object detection task, we used a three model ensemble of RetinaNet with Resnet50 Backbone and FasterRCNN (FPN + DC5) with Resnext101 Backbone}. A PyTorch implementation to our approach to the problem is available at this https URL.
摘要：内窥镜伪影检测挑战在于1）人工制品的检测，2）语义分割，以及3）外的样品概括的。对于语义分割的任务，我们提出FPN（功能金字塔网络）与EfficientNet多平台集成为特征提取/编码器。物体检测任务，我们使用RetinaNet的三种模式集合与Resnet50骨干和FasterRCNN（FPN + DC5）与Resnext101骨干}。一个PyTorch实现我们解决这一问题可在此HTTPS URL。

21. Efficient Crowd Counting via Structured Knowledge Transfer [PDF] 返回目录
Lingbo Liu, Jiaqi Chen, Hefeng Wu, Tianshui Chen, Guanbin Li, Liang Lin
Abstract: Crowd counting is an application-oriented task and its inference efficiency is crucial for real-world applications. However, most previous works relied on heavy backbone networks and required prohibitive runtimes, which would seriously restrict their deployment scopes and cause poor scalability. To liberate these crowd counting models, we propose a novel Structured Knowledge Transfer (SKT) framework integrating two complementary transfer modules, which can generate a lightweight but still highly effective student network by fully exploiting the structured knowledge of a well-trained teacher network. Specifically, an Intra-Layer Pattern Transfer sequentially distills the knowledge embedded in single-layer features of the teacher network to guide feature learning of the student network. Simultaneously, an Inter-Layer Relation Transfer densely distills the cross-layer correlation knowledge of the teacher to regularize the student's feature evolution. In this way, our student network can learn compact and knowledgeable features, yielding high efficiency and competitive performance. Extensive evaluations on three benchmarks well demonstrate the knowledge transfer effectiveness of our SKT for extensive crowd counting models. In particular, only having one-sixteenth of the parameters and computation cost of original models, our distilled VGG-based models obtain at least 6.5$\times$ speed-up on an Nvidia 1080 GPU and even achieve state-of-the-art performance.
摘要：人群计数是面向应用程序的任务及其推理效率是现实世界的应用至关重要。然而，大多数以前的作品依靠重骨干网需要高昂的运行时间，这将严重限制了它们的部署范围，并导致可扩展性差。要解放这些人群计数模型，我们提出了一个新颖的结构化知识转移（SKT）框架集成了两个互补传输模块，它可以产生通过充分利用训练有素的教师网络的结构化知识轻巧，但仍然非常有效的学生网络。具体而言，层内图案转移顺序蒸馏老师网络的嵌入在单层的知识特性来指导学生网络的特征的学习。同时，层间关系转移密集提炼老师的跨层相关的知识来规范学生的功能演变。通过这种方式，我们的学生网络可以学习紧凑和有知识的特点，产生高效率和竞争力的性能。在三个基准广泛的评估以及演示了广泛的人群计数模型我们SKT的知识转移影响。特别是，只有其原车型的参数和计算成本的十六分之一，我们的蒸馏基于VGG的模型获得Nvidia的1080 GPU至少6.5 $ \次$提速，甚至实现国家的最先进的性能。

22. Illumination-based Transformations Improve Skin Lesion Segmentation in Dermoscopic Images [PDF] 返回目录
Kumar Abhishek, Ghassan Hamarneh, Mark S. Drew
Abstract: The semantic segmentation of skin lesions is an important and common initial task in the computer aided diagnosis of dermoscopic images. Although deep learning-based approaches have considerably improved the segmentation accuracy, there is still room for improvement by addressing the major challenges, such as variations in lesion shape, size, color and varying levels of contrast. In this work, we propose the first deep semantic segmentation framework for dermoscopic images which incorporates, along with the original RGB images, information extracted using the physics of skin illumination and imaging. In particular, we incorporate information from specific color bands, illumination invariant grayscale images, and shading-attenuated images. We evaluate our method on three datasets: the ISBI ISIC 2017 Skin Lesion Segmentation Challenge dataset, the DermoFit Image Library, and the PH2 dataset and observe improvements of 12.02%, 4.30%, and 8.86% respectively in the mean Jaccard index over a baseline model trained only with RGB images.
摘要：皮损的语义分割是皮肤镜图像的计算机辅助诊断的重要和常见的首要任务。虽然深基于学习的方法已经大大改善了分割精度，还有通过解决重大挑战，如病灶的形状，大小，颜色的变化和对比不同程度的提升空间。在这项工作中，我们提出了皮肤镜图像的第一深语义分割框架，其结合，与原始的RGB图像一起，使用皮肤照明和成像的物理信息中提取。特别是，我们包括来自特定颜色波段，照明不变的灰度图像和阴影减毒图像的信息。我们评估三个数据集中我们的方法：在ISBI ISIC 2017年皮损分割挑战数据集时，DermoFit图片库，以及PH2数据集，并在基准模型中的平均杰卡德指数分别观察到的12.02％，4.30％和8.86％的改进只有RGB图像训练。

23. ASLFeat: Learning Local Features of Accurate Shape and Localization [PDF] 返回目录
Zixin Luo, Lei Zhou, Xuyang Bai, Hongkai Chen, Jiahui Zhang, Yao Yao, Shiwei Li, Tian Fang, Long Quan
Abstract: This work focuses on mitigating two limitations in the joint learning of local feature detectors and descriptors. First, the ability to estimate the local shape (scale, orientation, etc.) of feature points is often neglected during dense feature extraction, while the shape-awareness is crucial to acquire stronger geometric invariance. Second, the localization accuracy of detected keypoints is not sufficient to reliably recover camera geometry, which has become the bottleneck in tasks such as 3D reconstruction. In this paper, we present ASLFeat, with three light-weight yet effective modifications to mitigate above issues. First, we resort to deformable convolutional networks to densely estimate and apply local transformation. Second, we take advantage of the inherent feature hierarchy to restore spatial resolution and low-level details for accurate keypoint localization. Finally, we use a peakiness measurement to relate feature responses and derive more indicative detection scores. The effect of each modification is thoroughly studied, and the evaluation is extensively conducted across a variety of practical scenarios. State-of-the-art results are reported that demonstrate the superiority of our methods.
摘要：该作品侧重于局部特征检测器和描述符的联合学习减轻两方面的局限性。首先，为了估计局部形状的特征点的能力（比例，方向等）密集特征提取过程中经常被忽视，而形状意识是获得更强的几何不变性的关键。第二，检测关键点的定位精度不足以可靠地恢复相机几何形状，这已成为任务的瓶颈如3D重建。在本文中，我们本ASLFeat，具有三个轻量而有效的修改以减轻上述问题。首先，我们求助于变形卷积网络密集的估算和应用局部改造。其次，我们利用固有特征层次还原准确定位关键点的空间分辨率和低层次的细节。最后，我们使用peakiness测量有关的功能反应和获得更多指示检测分数。各变形例的效果被深入研究，并进行评价跨各种实际情况下广泛地进行。国家的最先进的结果报告，展示了我们方法的优越性。

24. Linguistically Driven Graph Capsule Network for Visual Question Reasoning [PDF] 返回目录
Qingxing Cao, Xiaodan Liang, Keze Wang, Liang Lin
Abstract: Recently, studies of visual question answering have explored various architectures of end-to-end networks and achieved promising results on both natural and synthetic datasets, which require explicitly compositional reasoning. However, it has been argued that these black-box approaches lack interpretability of results, and thus cannot perform well on generalization tasks due to overfitting the dataset bias. In this work, we aim to combine the benefits of both sides and overcome their limitations to achieve an end-to-end interpretable structural reasoning for general images without the requirement of layout annotations. Inspired by the property of a capsule network that can carve a tree structure inside a regular convolutional neural network (CNN), we propose a hierarchical compositional reasoning model called the "Linguistically driven Graph Capsule Network", where the compositional process is guided by the linguistic parse tree. Specifically, we bind each capsule in the lowest layer to bridge the linguistic embedding of a single word in the original question with visual evidence and then route them to the same capsule if they are siblings in the parse tree. This compositional process is achieved by performing inference on a linguistically driven conditional random field (CRF) and is performed across multiple graph capsule layers, which results in a compositional reasoning process inside a CNN. Experiments on the CLEVR dataset, CLEVR compositional generation test, and FigureQA dataset demonstrate the effectiveness and composition generalization ability of our end-to-end model.
摘要：近日，视觉答疑的研究探索终端到终端的网络的各种结构和天然和合成的数据集，这需要明确组合推理取得了可喜的成果。然而，有人认为，这些黑盒方案缺乏结果的解释性，因而不能泛化的任务，由于过度拟合数据集偏置表现良好。在这项工作中，我们的目标是结合双方的优势，克服其局限性，实现一般图像的端至端可解释的结构推理，而不布局标注的要求。通过胶囊网络也能开正规卷积神经网络（CNN）内的树结构的性质的启发，我们提出叫“语言学驱动图形胶囊网”，在作曲过程是通过语言引导的分层组合推理模型解析树。具体而言，我们每个胶囊绑定在最底层弥补单个词的视觉证据，原题语言嵌入，然后将它们路由到同一个胶囊，如果他们在解析树的兄弟姐妹。该构图过程是通过在语言驱动条件随机场（CRF）执行推断和实现跨多个图形胶囊层，这导致了CNN内部的组成推理处理。在CLEVR数据集，CLEVR成分生成测试，并FigureQA实验数据集展示我们的终端到终端模型的有效性和组成泛化能力。

25. Fast(er) Reconstruction of Shredded Text Documents via Self-Supervised Deep Asymmetric Metric Learning [PDF] 返回目录
Thiago M. Paixao, Rodrigo F. Berriel, Maria C. S. Boeres, Alessando L. Koerich, Claudine Badue, Alberto F. De Souza, Thiago Oliveira-Santos
Abstract: The reconstruction of shredded documents consists in arranging the pieces of paper (shreds) in order to reassemble the original aspect of such documents. This task is particularly relevant for supporting forensic investigation as documents may contain criminal evidence. As an alternative to the laborious and time-consuming manual process, several researchers have been investigating ways to perform automatic digital reconstruction. A central problem in automatic reconstruction of shredded documents is the pairwise compatibility evaluation of the shreds, notably for binary text documents. In this context, deep learning has enabled great progress for accurate reconstructions in the domain of mechanically-shredded documents. A sensitive issue, however, is that current deep model solutions require an inference whenever a pair of shreds has to be evaluated. This work proposes a scalable deep learning approach for measuring pairwise compatibility in which the number of inferences scales linearly (rather than quadratically) with the number of shreds. Instead of predicting compatibility directly, deep models are leveraged to asymmetrically project the raw shred content onto a common metric space in which distance is proportional to the compatibility. Experimental results show that our method has accuracy comparable to the state-of-the-art with a speed-up of about 22 times for a test instance with 505 shreds (20 mixed shredded-pages from different documents).
摘要：切碎的文件的重建包括在为了重新组装这些文件的原始方面排列的纸（碎片）的碎片。这个任务是支持法医调查，文档可能包含犯罪证据尤为重要。作为替代的费力和耗时的手动过程，若干研究人员一直在研究如何执行自动数字重建。在粉碎文件的自动重建面临的主要问题是碎片的成对的兼容性评估，特别是对二进制文本文档。在此背景下，深度学习，使在机械粉碎文件的域准确重建了长足的进步。一个敏感的问题，但问题在于，目前的深层模型解决方案需要每当一对碎片的都应该被评估的推断。这项工作提出了测量成对的兼容性，其中推断的数量与碎片的数量线性扩展（而不是二次）一个可扩展的深度学习的方法。而不是直接预测的兼容性，深模型利用非对称投影原料切丝内容到其中的距离是成正比的兼容性一个共同的度量空间。实验结果表明，我们的方法具有可比的精度的状态的最先进的具有加速的约22倍与505个碎片（从不同的文档20的混合碎-页）的测试实例。

26. Architectural Resilience to Foreground-and-Background Adversarial Noise [PDF] 返回目录
Carl Cheng, Evan Hu
Abstract: Adversarial attacks in the form of imperceptible perturbations of normal images have been extensively studied, and for every new defense methodology created, multiple adversarial attacks are found to counteract it. In particular, a popular style of attack, exemplified in recent years by DeepFool and Carlini-Wagner, relies solely on white-box scenarios in which full access to the predictive model and its weights are required. In this work, we instead propose distinct model-agnostic benchmark perturbations of images in order to investigate the resilience and robustness of different network architectures. Results empirically determine that increasing depth within most types of Convolutional Neural Networks typically improves model resilience towards general attacks, with improvement steadily decreasing as the model becomes deeper. Additionally, we find that a notable difference in adversarial robustness exists between residual architectures with skip connections and non-residual architectures of similar complexity. Our findings provide direction for future understanding of residual connections and depth on network robustness.
摘要：在正常图像的潜移默化的扰动形式对抗性攻击已被广泛研究，并为每一个新的防御方法创建多个敌对攻击被发现抵消它。特别是，攻击的流行风格，近年DeepFool和卡烈尼 - 瓦格纳为例，仅仅依赖于它们都需要预测模型及其权重全面进入白盒场景。在这项工作中，我们提出，而不是图像的不同模型无关基准扰动以研究不同网络架构的弹性和健壮性。结果经验确定，增加大多数类型的卷积神经网络内的深度通常对提高普通攻击模式的弹性，改善与稳定下降为模型变得更深。此外，我们发现，在对抗稳健性有明显的差异与跳跃连接和类似的复杂无残留架构残留架构之间存在着。我们的研究结果对网络的健壮性残余连接和深入的了解未来提供指导。

27. Additive Angular Margin for Few Shot Learning to Classify Clinical Endoscopy Images [PDF] 返回目录
Sharib Ali, Binod Bhattaria, Tae-Kyun Kim, Jens Rittscher
Abstract: Endoscopy is a widely used imaging modality to diagnose and treat diseases in hollow organs as for example the gastrointestinal tract, the kidney and the liver. However, due to varied modalities and use of different imaging protocols at various clinical centers impose significant challenges when generalising deep learning models. Moreover, the assembly of large datasets from different clinical centers can introduce a huge label bias that renders any learnt model unusable. Also, when using new modality or presence of images with rare patterns, a bulk amount of similar image data and their corresponding labels are required for training these models. In this work, we propose to use a few-shot learning approach that requires less training data and can be used to predict label classes of test samples from an unseen dataset. We propose a novel additive angular margin metric in the framework of prototypical network in few-shot learning setting. We compare our approach to the several established methods on a large cohort of multi-center, multi-organ, and multi-modal endoscopy data. The proposed algorithm outperforms existing state-of-the-art methods.
摘要：内镜是一种广泛使用的成像方式诊断和中空器官，例如胃肠道，肾脏和肝脏治疗疾病。然而，由于变化的方式和在不同的临床中心使用的不同成像协议要概括深度学习模型时征收显著的挑战。此外，从不同的临床中心的大型数据集的组件可以引入呈现任何学习的模型不能用一个巨大的标签偏差。此外，使用新的模态或稀有图案的图像的存在时，需要用于培养这些模型相似的图像数据和其相应的标签的堆积量。在这项工作中，我们建议使用需要较少的训练数据，可以用来从一个看不见的预测数据集中测试样品标注类几拍的学习方法。我们建议在典型网络中为数不多的射门学习设置框架的新型添加剂角裕度。我们将我们对大样本多中心，多器官，多模态内窥镜数据的若干个成熟的方法途径。所提出的算法优于现有状态的最先进的方法。

28. Dynamic ReLU [PDF] 返回目录
Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen, Lu Yuan, Zicheng Liu
Abstract: Rectified linear units (ReLU) are commonly used in deep neural networks. So far ReLU and its generalizations (either non-parametric or parametric) are static, performing identically for all input samples. In this paper, we propose Dynamic ReLU (DY-ReLU), a dynamic rectifier whose parameters are input-dependent as a hyper function over all input elements. The key insight is that DY-ReLU encodes the global context into the hyper function and adapts the piecewise linear activation function accordingly. Compared to its static counterpart, DY-ReLU has negligible extra computational cost, but significantly more representation capability, especially for light-weight neural networks. By simply using DY-ReLU for MobileNetV2, the top-1 accuracy on ImageNet classification is boosted from 72.0% to 76.2% with only 5% additional FLOPs.
摘要：整流线性单位（RELU）在深层神经网络通常使用。到目前为止RELU及其推广（或者非参数或参数）是静态的，对所有输入样本执行相同。在本文中，我们提出了动态RELU（DY-RELU），动态整流器，它的参数是输入依赖性如在所有输入元件的超功能。关键见解是，DY-RELU编码全局上下文到超功能，并相应地适应分段线性激活函数。相比其静态副本，DY-RELU具有可忽略不计的额外计算成本，但显著更多的表现能力，特别是对轻量神经网络。通过简单地使用DY-RELU为MobileNetV2，上ImageNet分类顶部-1精度从72.0％提高到76.2％，只有5％的额外FLOPS。

29. High Performance Sequence-to-Sequence Model for Streaming Speech Recognition [PDF] 返回目录
Thai-Son Nguyen, Ngoc-Quan Pham, Sebastian Stueker, Alex Waibel
Abstract: Recently sequence-to-sequence models have started to achieve state-of-the art performance on standard speech recognition tasks when processing audio data in batch mode, i.e., the complete audio data is available when starting processing. However, when it comes to perform run-on recognition on an input stream of audio data while producing recognition results in real-time and with a low word-based latency, these models face several challenges. For many techniques, the whole audio sequence to be decoded needs to be available at the start of the processing, e.g., for the attention mechanism or for the bidirectional LSTM (BLSTM). In this paper we propose several techniques to mitigate these problems. We introduce an additional loss function controlling the uncertainty of the attention mechanism, a modified beam search identifying partial, stable hypotheses, ways of working with BLSTM in the encoder, and the use of chunked BLSTM. Our experiments show that with the right combination of these techniques it is possible to perform run-on speech recognition with a low word-based latency without sacrificing performance in terms of word error rate.
摘要：最近序列对序列模式已经开始在批处理模式，即处理音频数据时达到国家规定的最标准的语音识别任务文艺演出，开始处理时，完整的音频数据是可用的。然而，当涉及到进行识别的音频数据的输入流上运行，同时产生实时和低基于词的延迟识别结果，这些模型也面临着一些挑战。对于许多技术，整个音频序列要被解码需要是可在处理的开始，例如，将用于关注机构或双向LSTM（BLSTM）。在本文中，我们提出了多种技术来减轻这些问题。我们推出了一项额外的损失函数控制的注意机制，修正光束搜索识别部分，稳定的假设，在编码器BLSTM工作方式的不确定性，并采用分块BLSTM的。我们的实验表明，这些技术的完美结合，可以用低基于词的延迟语音识别上运行，而无需在字差错率牺牲性能表现。

30. Self-Supervised 2D Image to 3D Shape Translation with Disentangled Representations [PDF] 返回目录
Berk Kaya, Radu Timofte
Abstract: We present a framework to translate between 2D image views and 3D object shapes. Recent progress in deep learning enabled us to learn structure-aware representations from a scene. However, the existing literature assumes that pairs of images and 3D shapes are available for training in full supervision. In this paper, we propose SIST, a Self-supervised Image to Shape Translation framework that fulfills three tasks: (i) reconstructing the 3D shape from a single image; (ii) learning disentangled representations for shape, appearance and viewpoint; and (iii) generating a realistic RGB image from these independent factors. In contrast to the existing approaches, our method does not require image-shape pairs for training. Instead, it uses unpaired image and shape datasets from the same object class and jointly trains image generator and shape reconstruction networks. Our translation method achieves promising results, comparable in quantitative and qualitative terms to the state-of-the-art achieved by fully-supervised methods.
摘要：我们提出一个框架，以2D图像视图和3D物体的形状之间的转换。在深度学习的最新进展使我们能够从一个场景学习结构感知表示。然而，现有的文献假设对图像和三维形状可用于在全程监督训练。在本文中，我们提出SIST，一个自监督图像到形状翻译框架，满足三项任务：（ⅰ）重建从单个图像的3D形状; （ⅱ）学习解开对形状，外观和视点表示;和（iii）产生由这些独立因素逼真的RGB图像。相较于现有的方法，我们的方法不需要训练图像形状对。相反，它使用从同一个对象类不成对图像和形状数据集和共同训练图像生成和形状重建网络。我们的翻译方法实现了可喜的成果，在数量和质量方面的国家的最先进的全监督的方法来实现媲美。

31. A Better Variant of Self-Critical Sequence Training [PDF] 返回目录
Ruotian Luo
Abstract: In this work, we present a simple yet better variant of Self-Critical Sequence Training. We make a simple change in the choice of baseline function in REINFORCE algorithm. The new baseline can bring better performance with no extra cost, compared to the greedy decoding baseline.
摘要：在这项工作中，我们提出了一个简单的自我批判序列训练的又更好的方法。我们做基线功能的加固算法选择一个简单的变化。新的基准可以带来更好的性能，无需额外费用，相比于贪婪解码基线。

32. The Instantaneous Accuracy: a Novel Metric for the Problem of Online Human Behaviour Recognition in Untrimmed Videos [PDF] 返回目录
Marcos Baptista Rios, Roberto J. Lopez-Sastre, Fabian Caba Heilbron, Jan van Gemert
Abstract: The problem of Online Human Behaviour Recognition in untrimmed videos, aka Online Action Detection (OAD), needs to be revisited. Unlike traditional offline action detection approaches, where the evaluation metrics are clear and well established, in the OAD setting we find few works and no consensus on the evaluation protocols to be used. In this paper we introduce a novel online metric, the Instantaneous Accuracy ($IA$), that exhibits an \emph{online} nature, solving most of the limitations of the previous (offline) metrics. We conduct a thorough experimental evaluation on TVSeries dataset, comparing the performance of various baseline methods to the state of the art. Our results confirm the problems of previous evaluation protocols, and suggest that an IA-based protocol is more adequate to the online scenario for human behaviour understanding. Code of the metric available \href{this https URL}{here} (this https URL)
摘要：在线人类行为识别技术在修剪视频的问题，又名在线动作检测（OAD），重新审查的需求。不同于传统的线下动作检测方法，其中，评价指标是明确的，完善的，在OAD设置，我们发现一些作品和对所使用的评估协议没有达成共识。在本文中，我们介绍了一种新的在线度量，则瞬时精度（$ IA $），展现出\ {EMPH在线}性质，解决了大多数以前的（离线）指标的局限性。我们对TVSeries数据集进行一次彻底的实验评估，比较各种方法基线的技术状态的表现。我们的研究结果证实了先前的评估协议的问题，并建议基于IA-协议更足以为人类行为的理解网上的场景。该指标可用\ HREF代码{该HTTPS URL} {}这里（该HTTPS URL）

33. Curved Buildings Reconstruction from Airborne LiDAR Data by Matching and Deforming Geometric Primitives [PDF] 返回目录
Jingwei Song, Shaobo Xia, Jun Wang, Dong Chen
Abstract: Airborne LiDAR (Light Detection and Ranging) data is widely applied in building reconstruction, with studies reporting success in typical buildings. However, the reconstruction of curved buildings remains an open research problem. To this end, we propose a new framework for curved building reconstruction via assembling and deforming geometric primitives. The input LiDAR point cloud are first converted into contours where individual buildings are identified. After recognizing geometric units (primitives) from building contours, we get initial models by matching basic geometric primitives to these primitives. To polish assembly models, we employ a warping field for model refinements. Specifically, an embedded deformation (ED) graph is constructed via downsampling the initial model. Then, the point-to-model displacements are minimized by adjusting node parameters in the ED graph based on our objective function. The presented framework is validated on several highly curved buildings collected by various LiDAR in different cities. The experimental results, as well as accuracy comparison, demonstrate the advantage and effectiveness of our method. {The new insight attributes to an efficient reconstruction manner.} Moreover, we prove that the primitive-based framework significantly reduces the data storage to 10-20 percent of classical mesh models.
摘要：机载激光雷达（光探测和测距）数据被广泛应用于房屋重建，与研究报告中典型的建筑成就。然而，弯曲建筑物的重建仍然是一个有待研究的问题。为此，我们通过组装和变形的几何图元提出一种用于弯曲建筑重建的新框架。输入激光雷达点云首先被转换成其中个别建筑物被识别轮廓。从建筑轮廓识别几何单位（元）之后，我们通过一致的基础几何图元到这些原语获得初始模型。打磨装配模型，我们采用了模型的改进翘曲场。具体地，嵌入的变形（ED）曲线图是通过下采样初始模型构成。然后，点到模型的位移由在基于我们的目标函数的ED图形调整节点参数最小化。所提出的框架进行了验证通过各种激光雷达在不同的城市收集了好几高度弯曲的建筑。实验结果，以及准确的比较，展示优势，我们的方法的有效性。 {新洞察属性的有效重建方式。}此外，我们证明了基于原语的框架显著降低了数据存储到古典网格模型的10％-20％。

34. Ensembles of Deep Neural Networks for Action Recognition in Still Images [PDF] 返回目录
Sina Mohammadi, Sina Ghofrani Majelan, Shahriar B. Shokouhi
Abstract: Despite the fact that notable improvements have been made recently in the field of feature extraction and classification, human action recognition is still challenging, especially in images, in which, unlike videos, there is no motion. Thus, the methods proposed for recognizing human actions in videos cannot be applied to still images. A big challenge in action recognition in still images is the lack of large enough datasets, which is problematic for training deep Convolutional Neural Networks (CNNs) due to the overfitting issue. In this paper, by taking advantage of pre-trained CNNs, we employ the transfer learning technique to tackle the lack of massive labeled action recognition datasets. Furthermore, since the last layer of the CNN has class-specific information, we apply an attention mechanism on the output feature maps of the CNN to extract more discriminative and powerful features for classification of human actions. Moreover, we use eight different pre-trained CNNs in our framework and investigate their performance on Stanford 40 dataset. Finally, we propose using the Ensemble Learning technique to enhance the overall accuracy of action classification by combining the predictions of multiple models. The best setting of our method is able to achieve 93.17$\%$ accuracy on the Stanford 40 dataset.
摘要：尽管显着的改善已在特征提取和分类领域最近提出，人类动作识别仍然是具有挑战性的，尤其是在图像，其中，不同的视频，没有运动。因此，建议在视频识别人的动作的方法不能适用于静止图像。在静止影像动作识别的一大挑战是缺乏足够大的数据集，这是由于过度拟合问题深训练卷积神经网络（细胞神经网络）的问题。在本文中，通过利用预先训练的细胞神经网络的，我们采用转移学习技术，以解决缺乏大规模的标记动作识别的数据集。此外，由于CNN的最后一层具有类特定的信息，我们应用CNN对输出功能的注意机制映射到提取更有辨别力和强大的功能了对人类行为的分类。此外，我们用我们的框架中八个不同的预先训练细胞神经网络，并研究它们对斯坦福大学的40集性能。最后，我们建议使用集成学习技术通过多种模型的预测相结合，以提高行动分类的整体精度。我们的方法的最佳设置能够实现在斯坦福40集93.17 $ \％$准确性。

35. Low Latency ASR for Simultaneous Speech Translation [PDF] 返回目录
Thai Son Nguyen, Jan Niehues, Eunah Cho, Thanh-Le Ha, Kevin Kilgour, Markus Muller, Matthias Sperber, Sebastian Stueker, Alex Waibel
Abstract: User studies have shown that reducing the latency of our simultaneous lecture translation system should be the most important goal. We therefore have worked on several techniques for reducing the latency for both components, the automatic speech recognition and the speech translation module. Since the commonly used commitment latency is not appropriate in our case of continuous stream decoding, we focused on word latency. We used it to analyze the performance of our current system and to identify opportunities for improvements. In order to minimize the latency we combined run-on decoding with a technique for identifying stable partial hypotheses when stream decoding and a protocol for dynamic output update that allows to revise the most recent parts of the transcription. This combination reduces the latency at word level, where the words are final and will never be updated again in the future, from 18.1s to 1.1s without sacrificing performance in terms of word error rate.
摘要：用户研究显示，减少我们的讲座同步翻译系统的延迟应该是最重要的目标。因此，我们有几种技术工作减少这两个组件的延迟，自动语音识别和语音翻译模块。由于常用的承诺等待时间不符合我们的连续流解码的情况下，适当的，我们专注于字延迟。我们用它来分析我们当前系统的性能和识别改进机会。为了尽量减少我们结合用于识别稳定子假设的技术进行解码上运行的等待时间时流解码并输出动态更新的协议，其允许修改转录的最近部分。这种组合降低了字级的延迟，其中词是终局的，绝不会在将来再次更新，从18.1s，以1.1s无字差错率牺牲性能。

36. COVID-Net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases from Chest Radiography Images [PDF] 返回目录
Linda Wang, Alexander Wong
Abstract: The COVID-19 pandemic continues to have a devastating effect on the health and well-being of the global population. A critical step in the fight against COVID-19 is effective screening of infected patients, with one of the key screening approaches being radiological imaging using chest radiography. Motivated by this, a number of artificial intelligence (AI) systems based on deep learning have been proposed and results have been shown to be quite promising in terms of accuracy in detecting patients infected with COVID-19 using chest radiography images. However, to the best of the authors' knowledge, these developed AI systems have been closed source and unavailable to the research community for deeper understanding and extension, and unavailable for public access and use. Therefore, in this study we introduce COVID-Net, a deep convolutional neural network design tailored for the detection of COVID-19 cases from chest radiography images that is open source and available to the general public. We also describe the chest radiography dataset leveraged to train COVID-Net, which we will refer to as COVIDx and is comprised of 5941 posteroanterior chest radiography images across 2839 patient cases from two open access data repositories. Furthermore, we investigate how COVID-Net makes predictions using an explainability method in an attempt to gain deeper insights into critical factors associated with COVID cases, which can aid clinicians in improved screening. By no means a production-ready solution, the hope is that the open access COVID-Net, along with the description on constructing the open source COVIDx dataset, will be leveraged and build upon by both researchers and citizen data scientists alike to accelerate the development of highly accurate yet practical deep learning solutions for detecting COVID-19 cases and accelerate treatment of those who need it the most.
摘要：COVID-19大流行继续对健康和福祉的全球人口的破坏性影响。在打击COVID-19作斗争的一个关键步骤是感染患者的有效的筛选，关键筛选的一个方法使用胸片是放射成像。这个启发，一些基于深度学习的人工智能（AI）系统已被提出和结果已显示出在精度方面是相当有前途的检测感染使用COVID-19胸片影像的病人。然而，最好的作者的知识，这些开发的AI系统已经封闭源代码和不可用的研究社区更深层次的理解和延伸，以及向公众开放，并无法使用。因此，在本研究中，我们介绍COVID型网，从胸部X光图像检测COVID-19案件是开源和提供给广大市民量身打造了深刻的卷积神经网络的设计。我们还描述了利用训练COVID型网，我们将称之为COVIDx，并从两个开放访问数据仓库由横跨2839个病例5941个后前位胸片图像的胸片数据集。此外，我们研究COVID-Net的是如何让在试图更深入地了解与COVID的情况下，可以在改进的筛分帮助临床医生相关的关键因素，使用explainability方法预测。决不是一个生产就绪的解决方案，希望是开放获取COVID-Net的，与构建开源COVIDx数据集的描述一起，将经双方研究人员和公民的数据科学家都加快发展利用和建设用于检测COVID 19例，加速处理那些谁最需要它的高度准确而实用的深学习解决方案。

37. Progressive Domain-Independent Feature Decomposition Network for Zero-Shot Sketch-Based Image Retrieval [PDF] 返回目录
Xinxun Xu, Cheng Deng, Muli Yang, Hao Wang
Abstract: Zero-shot sketch-based image retrieval (ZS-SBIR) is a specific cross-modal retrieval task for searching natural images given free-hand sketches under the zero-shot scenario. Most existing methods solve this problem by simultaneously projecting visual features and semantic supervision into a low-dimensional common space for efficient retrieval. However, such low-dimensional projection destroys the completeness of semantic knowledge in original semantic space, so that it is unable to transfer useful knowledge well when learning semantic from different modalities. Moreover, the domain information and semantic information are entangled in visual features, which is not conducive for cross-modal matching since it will hinder the reduction of domain gap between sketch and image. In this paper, we propose a Progressive Domain-independent Feature Decomposition (PDFD) network for ZS-SBIR. Specifically, with the supervision of original semantic knowledge, PDFD decomposes visual features into domain features and semantic ones, and then the semantic features are projected into common space as retrieval features for ZS-SBIR. The progressive projection strategy maintains strong semantic supervision. Besides, to guarantee the retrieval features to capture clean and complete semantic information, the cross-reconstruction loss is introduced to encourage that any combinations of retrieval features and domain features can reconstruct the visual features. Extensive experiments demonstrate the superiority of our PDFD over state-of-the-art competitors.
摘要：零射门的基于草图的图像检索（ZS-SBIR）是用于搜索零射门的情况下给出手绘草图自然图像的具体的跨模态获取任务。大多数现有的方法通过同时突出的视觉特征和语义监督成用于高效检索低维共同空间解决这个问题。然而，这样的低维投影破坏了原有的语义空间语义知识的完整性，使之无法学习来自不同形态的语义时传递有用的知识很好。而且，域信息和语义信息被纠缠在视觉特征，这是不利于用于跨模式匹配，因为它会阻碍的草图和图像之间域间隙的减小。在本文中，我们提出了ZS-SBIR一个渐进领域无关的功能分解（PDFD）网络。具体而言，原语义知识的监督，PDFD分解视觉特征为域特征和语义的，然后语义特征被投影到作为ZS-SBIR检索功能的公共空间。逐行投影策略保持着强大的语义监督。此外，为了保证检索功能，可拍摄干净彻底语义信息，交叉重建损失引入鼓励的检索功能和领域特征的任意组合可以重建视觉特征。大量的实验证明我们的PDFD超过国家的最先进的竞争对手的优势。

38. Large-Scale Screening of COVID-19 from Community Acquired Pneumonia using Infection Size-Aware Classification [PDF] 返回目录
Feng Shi, Liming Xia, Fei Shan, Dijia Wu, Ying Wei, Huan Yuan, Huiting Jiang, Yaozong Gao, He Sui, Dinggang Shen
Abstract: The worldwide spread of coronavirus disease (COVID-19) has become a threatening risk for global public health. It is of great importance to rapidly and accurately screen patients with COVID-19 from community acquired pneumonia (CAP). In this study, a total of 1658 patients with COVID-19 and 1027 patients of CAP underwent thin-section CT. All images were preprocessed to obtain the segmentations of both infections and lung fields, which were used to extract location-specific features. An infection Size Aware Random Forest method (iSARF) was proposed, in which subjects were automated categorized into groups with different ranges of infected lesion sizes, followed by random forests in each group for classification. Experimental results show that the proposed method yielded sensitivity of 0.907, specificity of 0.833, and accuracy of 0.879 under five-fold cross-validation. Large performance margins against comparison methods were achieved especially for the cases with infection size in the medium range, from 0.01% to 10%. The further inclusion of Radiomics features show slightly improvement. It is anticipated that our proposed framework could assist clinical decision making.
摘要：冠状病毒病的蔓延全球（COVID-19）已成为全球公共卫生危及风险。这是从社区获得性肺炎（CAP）重视快速，准确的屏幕患者COVID-19。在这项研究中，总共有1658名CAP患者的COVID-19和1027例患者进行薄层CT。所有图像进行预处理，以获得这两种感染和肺野，将其用于提取特定位置的功能的分割。感染尺寸意识到随机森林法（iSARF）提出，其中受试者自动分类成与受感染的病变大小的不同范围的基团，随后每组用于分类在随机森林。实验结果表明，所提出的方法，得到灵敏度0.907，0.833特异性和0.879下的准确性个5倍交叉验证。针对比较方法大性能裕度是专为用于与培养基中感染范围大小的情况下实现的，从0.01％至10％。的Radiomics功能进一步纳入略显改善。可以预料，我们提出的框架可以协助临床决策。

39. Visual Question Answering for Cultural Heritage [PDF] 返回目录
Pietro Bongini, Federico Becattini, Andrew D. Bagdanov, Alberto Del Bimbo
Abstract: Technology and the fruition of cultural heritage are becoming increasingly more entwined, especially with the advent of smart audio guides, virtual and augmented reality, and interactive installations. Machine learning and computer vision are important components of this ongoing integration, enabling new interaction modalities between user and museum. Nonetheless, the most frequent way of interacting with paintings and statues still remains taking pictures. Yet images alone can only convey the aesthetics of the artwork, lacking is information which is often required to fully understand and appreciate it. Usually this additional knowledge comes both from the artwork itself (and therefore the image depicting it) and from an external source of knowledge, such as an information sheet. While the former can be inferred by computer vision algorithms, the latter needs more structured data to pair visual content with relevant information. Regardless of its source, this information still must be be effectively transmitted to the user. A popular emerging trend in computer vision is Visual Question Answering (VQA), in which users can interact with a neural network by posing questions in natural language and receiving answers about the visual content. We believe that this will be the evolution of smart audio guides for museum visits and simple image browsing on personal smartphones. This will turn the classic audio guide into a smart personal instructor with which the visitor can interact by asking for explanations focused on specific interests. The advantages are twofold: on the one hand the cognitive burden of the visitor will decrease, limiting the flow of information to what the user actually wants to hear; and on the other hand it proposes the most natural way of interacting with a guide, favoring engagement.
摘要：技术和文化遗产的成果正变得越来越交织在一起，尤其是在智能语音导览，虚拟和增强现实和交互设备的出现。机器学习和计算机视觉此正在进行整合的重要组成部分，使用户和博物馆之间的新的交互方式。尽管如此，与绘画和雕像交互最频繁的方式仍然是拍照。然而，单独的图像只能传达出作品的美感，缺乏的是充分了解和欣赏它这往往是所需的信息。通常这个附加知识从艺术品本身自带两者（并且因此图像描绘它），并从知识，外部源诸如信息片。虽然前者可通过计算机视觉算法来推断，后者需要更结构化的数据进行配对可视内容相关的信息。无论其来源，这些信息仍然必须有效地传递给用户。在计算机视觉一种流行的新兴的趋势是视觉答疑（VQA），用户可以在里面冒充自然语言问题，并接收关于视觉内容的答案与神经网络交互。我们相信，这将是参观博物馆和个人的智能手机简单的图像浏览功能的智能语音导游的演变。这会变成经典的音频引导到智能个人教练与访问者可以通过询问的解释集中在特定的利益互动。优点有两方面：一方面是游客的认知负担将减少，限制信息，以用户实际希望听到的流量;而在另一方面，它提出了指导互动，有利于参与的最自然的方式。

40. Universal Differentiable Renderer for Implicit Neural Representations [PDF] 返回目录
Lior Yariv, Matan Atzmon, Yaron Lipman
Abstract: The goal of this work is to learn implicit 3D shape representation with 2D supervision (i.e., a collection of images). To that end we introduce the Universal Differentiable Renderer (UDR) a neural network architecture that can provably approximate reflected light from an implicit neural representation of a 3D surface, under a wide set of reflectance properties and lighting conditions. Experimenting with the task of multiview 3D reconstruction, we find our model to improve upon the baselines in the accuracy of the reconstructed 3D geometry and rendering from unseen viewing directions.
摘要：这项工作的目的是学习与2D监督（即图像的集合）隐含的3D形状表示。为此，我们引入通用可微渲染器（UDR），其可以可证明近似反射从3D表面的隐含神经表示光的神经网络结构，在宽组反射性能和照明条件的。用多视点3D重建任务试验，我们发现我们的模型中时的准确性基线提高重建的三维几何和看不见的观察方向呈现。

41. Review of data analysis in vision inspection of power lines with an in-depth discussion of deep learning technology [PDF] 返回目录
Xinyu Liu, Xiren Miao, Hao Jiang, Jing Chen
Abstract: The widespread popularity of unmanned aerial vehicles enables an immense amount of power lines inspection data to be collected. How to employ massive inspection data especially the visible images to maintain the reliability, safety, and sustainability of power transmission is a pressing issue. To date, substantial works have been conducted on the analysis of power lines inspection data. With the aim of providing a comprehensive overview for researchers who are interested in developing a deep-learning-based analysis system for power lines inspection data, this paper conducts a thorough review of the current literature and identifies the challenges for future research. Following the typical procedure of inspection data analysis, we categorize current works in this area into component detection and fault diagnosis. For each aspect, the techniques and methodologies adopted in the literature are summarized. Some valuable information is also included such as data description and method performance. Further, an in-depth discussion of existing deep-learning-related analysis methods in power lines inspection is proposed. Finally, we conclude the paper with several research trends for the future of this area, such as data quality problems, small object detection, embedded application, and evaluation baseline.
摘要：无人机的广泛普及使得收集电源线检测数据的巨大数额。如何使用大量的检测数据尤其是可见光图像上从而保持可靠性，安全性和动力传输的可持续性是一个迫切的问题。到目前为止，大量的工程已在电力线检测数据的分析进行。随着提供谁有兴趣开发用于电源线的检查数据基于深学习分析系统，研究人员全面了解的目的，本文进行目前的文献，并确定了今后的研究面临的挑战进行了全面审查。以下检查数据分析的典型过程中我们将在这方面现在的作品到成分检测和故障诊断。对于每个方面，在文献中采用的技术和方法进行了总结。一些有价值的信息也被包括诸如数据描述和方法的性能。此外，现有的电力线检查深学习相关的分析方法的深入讨论，提出了。最后，我们的结论纸与几个研究趋势这一领域的未来，如数据质量问题，小物件检测，嵌入式应用和评价基准。

42. Modal Regression based Structured Low-rank Matrix Recovery for Multi-view Learning [PDF] 返回目录
Jiamiao Xu, Fangzhao Wang, Qinmu Peng, Xinge You, Shuo Wang, Xiao-Yuan Jing, C. L. Philip Chen
Abstract: Low-rank Multi-view Subspace Learning (LMvSL) has shown great potential in cross-view classification in recent years. Despite their empirical success, existing LMvSL based methods are incapable of well handling view discrepancy and discriminancy simultaneously, which thus leads to the performance degradation when there is a large discrepancy among multi-view data. To circumvent this drawback, motivated by the block-diagonal representation learning, we propose Structured Low-rank Matrix Recovery (SLMR), a unique method of effectively removing view discrepancy and improving discriminancy through the recovery of structured low-rank matrix. Furthermore, recent low-rank modeling provides a satisfactory solution to address data contaminated by predefined assumptions of noise distribution, such as Gaussian or Laplacian distribution. However, these models are not practical since complicated noise in practice may violate those assumptions and the distribution is generally unknown in advance. To alleviate such limitation, modal regression is elegantly incorporated into the framework of SLMR (term it MR-SLMR). Different from previous LMvSL based methods, our MR-SLMR can handle any zero-mode noise variable that contains a wide range of noise, such as Gaussian noise, random noise and outliers. The alternating direction method of multipliers (ADMM) framework and half-quadratic theory are used to efficiently optimize MR-SLMR. Experimental results on four public databases demonstrate the superiority of MR-SLMR and its robustness to complicated noise.
摘要：低等级的多视角子空间学习（LMvSL）在最近几年中所示的交叉视角分类的巨大潜力。尽管他们的经验的成功，现有的基于LMvSL方法不能很好的处理视图差异和discriminancy同时，这从而导致性能下降时，有多视角数据中大的出入。为了避免这个缺点，通过块对角表示学习动机，我们建议结构化低秩矩阵恢复（SLMR），有效地除去视图差异和通过构造低秩矩阵的恢复改善discriminancy的独特方法。此外，最近的低秩建模提供了一种令人满意的解决方案，以通过噪声分布的预定义的假设，如高斯或拉普拉斯分布污染地址数据。然而，这些模型是不实际的，因为在实践中复杂的噪声可能会违反这些假设和分布一般是未知的提前。为了减轻这样的限制，模态回归典雅并入SLMR的框架（术语它MR-SLMR）。从前面的LMvSL为基础的方法不同，我们的MR-SLMR可以处理包含宽范围噪声，例如高斯噪声，随机噪声和异常值的任何零模式噪声变量。的乘法器（ADMM）框架和半二次理论的交替方向法被用来有效地优化MR-SLMR。四个公共数据库的实验结果表明，MR-SLMR的优越性，它的坚固性，以复杂的噪音。

43. Exploring Bottom-up and Top-down Cues with Attentive Learning for Webly Supervised Object Detection [PDF] 返回目录
Zhonghua Wu, Qingyi Tao, Guosheng Lin, Jianfei Cai
Abstract: Fully supervised object detection has achieved great success in recent years. However, abundant bounding boxes annotations are needed for training a detector for novel classes. To reduce the human labeling effort, we propose a novel webly supervised object detection (WebSOD) method for novel classes which only requires the web images without further annotations. Our proposed method combines bottom-up and top-down cues for novel class detection. Within our approach, we introduce a bottom-up mechanism based on the well-trained fully supervised object detector (i.e. Faster RCNN) as an object region estimator for web images by recognizing the common objectiveness shared by base and novel classes. With the estimated regions on the web images, we then utilize the top-down attention cues as the guidance for region classification. Furthermore, we propose a residual feature refinement (RFR) block to tackle the domain mismatch between web domain and the target domain. We demonstrate our proposed method on PASCAL VOC dataset with three different novel/base splits. Without any target-domain novel-class images and annotations, our proposed webly supervised object detection model is able to achieve promising performance for novel classes. Moreover, we also conduct transfer learning experiments on large scale ILSVRC 2013 detection dataset and achieve state-of-the-art performance.
摘要：全监督对象检测近几年取得了巨大的成功。然而，需要训练检测器，用于新种类的丰富的边框注释。为了减少人工标记的努力，提出了新颖的类的新的webly监督对象检测（WebSOD）方法，该方法仅需要不经进一步注释网页图像。我们提出的方法结合了自下而上和自上而下的线索新型类检测。在我们的方法中，我们介绍，作为通过识别由基和新颖的类共享的公共客观性的对象区域估计器，用于Web图像基于训练有素完全监控对象检测器上（即，更快的RCNN）自底向上的机制。随着对Web图像估计的地区，我们则利用自上而下的注意线索作为区域分类指导。此外，我们提出了一个残余特征细化（RFR）模块，以解决网络域名和目标域之间的域不匹配。我们证明了我们提出了三种不同的新/基分裂PASCAL VOC数据集的方法。如果没有任何目标域小说类的图像和注释，我们提出的webly监督对象检测模型能够实现新型类有前途的性能。此外，我们还对大规模ILSVRC 2013检测数据集进行迁移学习实验，实现国家的最先进的性能。

44. Mission-Aware Spatio-Temporal Deep Learning Model for UAS Instantaneous Density Prediction [PDF] 返回目录
Ziyi Zhao, Zhao Jin, Wentian Bai, Wentan Bai, Carlos Caicedo, M. Cenk Gursoy, Qinru Qiu
Abstract: The number of daily sUAS operations in uncontrolled low altitude airspace is expected to reach into the millions in a few years. Therefore, UAS density prediction has become an emerging and challenging problem. In this paper, a deep learning-based UAS instantaneous density prediction model is presented. The model takes two types of data as input: 1) the historical density generated from the historical data, and 2) the future sUAS mission information. The architecture of our model contains four components: Historical Density Formulation module, UAS Mission Translation module, Mission Feature Extraction module, and Density Map Projection module. The training and testing data are generated by a python based simulator which is inspired by the multi-agent air traffic resource usage simulator (MATRUS) framework. The quality of prediction is measured by the correlation score and the Area Under the Receiver Operating Characteristics (AUROC) between the predicted value and simulated value. The experimental results demonstrate outstanding performance of the deep learning-based UAS density predictor. Compared to the baseline models, for simplified traffic scenario where no-fly zones and safe distance among sUASs are not considered, our model improves the prediction accuracy by more than 15.2% and its correlation score reaches 0.947. In a more realistic scenario, where the no-fly zone avoidance and the safe distance among sUASs are maintained using A* routing algorithm, our model can still achieve 0.823 correlation score. Meanwhile, the AUROC can reach 0.951 for the hot spot prediction.
摘要：在不受控制的低空空域日常SUAS操作的数量预计在几年内达到走入千家万户。因此，UAS密度的预测已经成为一个新兴的和具有挑战性的问题。在本文中，深学习型UAS瞬时浓度预测模型。该模型采用两种类型的数据作为输入：1）从历史数据生成历史密度，以及2）在未来SUAS任务信息。我们的模型的体系结构包括四个组成部分：历史密度配制模块，UAS任务转化模块，使命特征提取模块，和密度地图投影模块。在训练和测试数据由基于蟒模拟器其由多代理的空中交通资源使用模拟器（MATRUS）框架生成的启发。预测的质量是由预测值和模拟值之间的相关性得分下面积接受者操作特性（AUROC）测量。实验结果表明，深基础的学习-UAS密度预测的出色表现。相较于基准模型，简化业务场景sUASs中禁飞区和安全距离不认为，我们的模型超过15.2％，提高了预测精度及其相关分数达到0.947。在更现实的情况，其中禁飞区避免和sUASs之间的安全距离是使用A *路由算法维持，我们的模型仍然可以达到0.823相关性得分。同时，AUROC可以为热点预测达到0.951。

45. HDF: Hybrid Deep Features for Scene Image Representation [PDF] 返回目录
Chiranjibi Sitaula, Yong Xiang, Anish Basnet, Sunil Aryal, Xuequan Lu
Abstract: Nowadays it is prevalent to take features extracted from pre-trained deep learning models as image representations which have achieved promising classification performance. Existing methods usually consider either object-based features or scene-based features only. However, both types of features are important for complex images like scene images, as they can complement each other. In this paper, we propose a novel type of features - hybrid deep features, for scene images. Specifically, we exploit both object-based and scene-based features at two levels: part image level (i.e., parts of an image) and whole image level (i.e., a whole image), which produces a total number of four types of deep features. Regarding the part image level, we also propose two new slicing techniques to extract part based features. Finally, we aggregate these four types of deep features via the concatenation operator. We demonstrate the effectiveness of our hybrid deep features on three commonly used scene datasets (MIT-67, Scene-15, and Event-8), in terms of the scene image classification task. Extensive comparisons show that our introduced features can produce state-of-the-art classification accuracies which are more consistent and stable than the results of existing features across all datasets.
摘要：如今，它是普遍采取的预先训练的深度学习模型提取作为已经实现承诺的分类性能图像表示的特征。现有的方法通常只考虑任何基于对象的特征或基于场景的功能。然而，这两种类型的特点是对于像场景图像复杂的图像很重要，因为他们可以互相补充。在本文中，我们提出了一种新类型的特征 - 混合型深的特点，对场景图像。具体来说，我们利用在两个级别都基于对象的和场景为基础的特征：部分图像电平（即，图像的部分）和整个图像电平（即，整个图像），产生四种类型的深的总数特征。关于部分图像的水平，我们也提出了两种新的切片技术来提取基于零件特征。最后，我们通过连接符聚合这四种类型的深层特征。我们证明我们的混合深功能效果上常用的三种场景的数据集（MIT-67，场景-15，和事件-8），在场景图像分类任务的条件。广泛的比较表明，我们引入的功能可以产生国家的最先进的分类准确度这比在所有数据集现有功能的结果更加一致和稳定。

46. Lifespan Age Transformation Synthesis [PDF] 返回目录
Roy Or-El, Soumyadip Sengupta, Ohad Fried, Eli Shechtman, Ira Kemelmacher-Shlizerman
Abstract: We address the problem of single photo age progression and regression-the prediction of how a person might look in the future, or how they looked in the past. Most existing aging methods are limited to changing the texture, overlooking transformations in head shape that occur during the human aging and growth process. This limits the applicability of previous methods to aging of adults to slightly older adults, and application of those methods to photos of children does not produce quality results. We propose a novel multi-domain image-to-image generative adversarial network architecture, whose learned latent space models a continuous bi-directional aging process. The network is trained on the FFHQ dataset, which we labeled for ages, gender, and semantic segmentation. Fixed age classes are used as anchors to approximate continuous age transformation. Our framework can predict a full head portrait for ages 0-70 from a single photo, modifying both texture and shape of the head. We demonstrate results on a wide variety of photos and datasets, and show significant improvement over the state of the art.
摘要：针对单张照片年龄进展和一个人会是什么样的未来，以及它们如何看了过去回归预测的问题。大多数现有的老化方法不限于改变质地，俯瞰头形状时的人的衰老和生长过程中发生的转换。这限制了以前的方法是否适用于成年人的老龄化略有老年人，以及这些方法的应用，以孩子的照片不会产生高质量的结果。我们提出了一个新颖的多域图像 - 图像生成对抗性的网络架构，其了解到潜在空间模型的连续双向老化过程。该网络在FFHQ数据集，我们标示的年龄，性别和语义分割训练。固定年龄阶层作为锚近似连续年龄转型。我们的架构可以从一个单一的照片预测一个完整的头像为0-70岁，修改的头两个纹理和形状。我们证明在各种各样的照片和数据集的结果，并显示在现有技术的改进显著。

47. Monocular Depth Prediction Through Continuous 3D Loss [PDF] 返回目录
Minghan Zhu, Maani Ghaffari, Yuanxin Zhong, Pingping Lu, Zhong Cao, Ryan M. Eustice, Huei Peng
Abstract: This paper reports a new continuous 3D loss function for learning depth from monocular images. The dense depth prediction from a monocular image is supervised using sparse LIDAR points, exploiting available data from camera-LIDAR sensor suites during training. Currently, accurate and affordable range sensor is not available. Stereo cameras and LIDARs measure depth either inaccurately or sparsely/costly. In contrast to the current point-to-point loss evaluation approach, the proposed 3D loss treats point clouds as continuous objects; and therefore, it overcomes the lack of dense ground truth depth due to the sparsity of LIDAR measurements. Experimental evaluations show that the proposed method achieves accurate depth measurement with consistent 3D geometric structures through a monocular camera.
摘要：本文报道了用于从单目图像深度学习一个新的连续3D损失函数。从单眼图像的稠密深度预测是使用稀疏LIDAR点监管，训练期间利用从摄像机LIDAR传感器套件可用数据。目前，准确和负担得起的范围传感器不可用。立体相机和激光雷达测量深度要么不准确或疏/昂贵。与此相反的当前点至点损失评估方法，所提出的3D损失对待点云作为连续对象;因此，它克服了缺乏密集的地面实况深度由于激光雷达测量结果的稀疏性。实验评估表明，所提出的方法通过单眼照相机实现准确的深度测量具有一致的三维几何结构。

48. Learning 3D Part Assembly from a Single Image [PDF] 返回目录
Yichen Li, Kaichun Mo, Lin Shao, Minhyuk Sung, Leonidas Guibas
Abstract: Autonomous assembly is a crucial capability for robots in many applications. For this task, several problems such as obstacle avoidance, motion planning, and actuator control have been extensively studied in robotics. However, when it comes to task specification, the space of possibilities remains underexplored. Towards this end, we introduce a novel problem, single-image-guided 3D part assembly, along with a learningbased solution. We study this problem in the setting of furniture assembly from a given complete set of parts and a single image depicting the entire assembled object. Multiple challenges exist in this setting, including handling ambiguity among parts (e.g., slats in a chair back and leg stretchers) and 3D pose prediction for parts and part subassemblies, whether visible or occluded. We address these issues by proposing a two-module pipeline that leverages strong 2D-3D correspondences and assembly-oriented graph message-passing to infer part relationships. In experiments with a PartNet-based synthetic benchmark, we demonstrate the effectiveness of our framework as compared with three baseline approaches.
摘要：自主组件用于许多应用机器人的关键能力。对于这个任务，有几个问题，如避障，运动规划和执行器控制已被广泛应用在机器人的研究。然而，当涉及到的任务规范，可能性的空间仍然勘探不足。为此目的，我们组件引入新的问题，单图像引导3D部件，具有learningbased溶液一起。我们从给定的一套完整的零件和一个单一的形象描述了整个组装的组装对象研究的家具设置了这个问题。多重挑战在此设置存在，包括处理部分之间歧义（例如，在椅子背部和腿部担架板条）和三维姿态预测部件和子组件的一部分，无论是可见光或遮挡。我们通过提出一个双模块流水线，它利用强大的2D-3D对应和装配导向图的消息传递来推断部分关系解决这些问题。在具有PartNet基合成基准测试实验中，我们证明了我们框架的有效性相比有三个基线的方法。

49. Video-based Person Re-Identification using Gated Convolutional Recurrent Neural Networks [PDF] 返回目录
Yang Feng, Yu Wang, Jiebo Luo
Abstract: Deep neural networks have been successfully applied to solving the video-based person re-identification problem with impressive results reported. The existing networks for person re-id are designed to extract discriminative features that preserve the identity information. Usually, whole video frames are fed into the neural networks and all the regions in a frame are equally treated. This may be a suboptimal choice because many regions, e.g., background regions in the video, are not related to the person. Furthermore, the person of interest may be occluded by another person or something else. These unrelated regions may hinder person re-identification. In this paper, we introduce a novel gating mechanism to deep neural networks. Our gating mechanism will learn which regions are helpful for person re-identification and let these regions pass the gate. The unrelated background regions or occluding regions are filtered out by the gate. In each frame, the color channels and optical flow channels provide quite different information. To better leverage such information, we generate one gate using the color channels and another gate using the optical flow channels. These two gates are combined to provide a more reliable gate with a novel fusion method. Experimental results on two major datasets demonstrate the performance improvements due to the proposed gating mechanism.
摘要：深层神经网络已被成功应用于解决与令人印象深刻的结果基于视频的人重新鉴定问题的报道。对于人重新编号现有网络的设计能够判别特征，同时保留身份信息。通常，整个视频帧被馈送到神经网络和帧中的所有的区域被平等对待。这可能是次优的选择，因为很多地区，例如，在视频背景的区域，不相关的人。此外，感兴趣的人可以由其他人或其他什么东西来遮挡。这些不相干区域可能会阻碍人重新鉴定。在本文中，我们介绍了一种新的控制机制，以深层神经网络。我们的门控机制，将了解哪些地区是人重新鉴定乐于助人，让这些地区通过大门。无关的背景区域或封闭区域由栅极过滤掉。在每个帧中，颜色通道和光学流动通道提供完全不同的信息。为了更好地利用这些信息，我们使用颜色信道和使用该光学流动通道另一栅极产生一个栅极。这两个栅极被组合，以提供更可靠的栅极与一个新的融合方法。两大数据集实验结果表明，性能的提升，由于提出的门控机制。

50. Cross-modal Deep Face Normals with Deactivable Skip Connections [PDF] 返回目录
Victoria Fernandez Abrevaya, Adnane Boukhayma, Philip H. S. Torr, Edmond Boyer
Abstract: We present an approach for estimating surface normals from in-the-wild color images of faces. While data-driven strategies have been proposed for single face images, limited available ground truth data makes this problem difficult. To alleviate this issue, we propose a method that can leverage all available image and normal data, whether paired or not, thanks to a novel cross-modal learning architecture. In particular, we enable additional training with single modality data, either color or normal, by using two encoder-decoder networks with a shared latent space. The proposed architecture also enables face details to be transferred between the image and normal domains, given paired data, through skip connections between the image encoder and normal decoder. Core to our approach is a novel module that we call deactivable skip connections, which allows integrating both the auto-encoded and image-to-normal branches within the same architecture that can be trained end-to-end. This allows learning of a rich latent space that can accurately capture the normal information. We compare against state-of-the-art methods and show that our approach can achieve significant improvements, both quantitative and qualitative, with natural face images.
摘要：我们提出用于从在最野生色面的图像估计曲面法线的方法。虽然数据驱动的战略已经提出了单人脸图像，有限的可用地面真实数据，使得这个问题很难。为了缓解这一问题，我们提出可以利用所有可用的形象和正常的数据，无论是配对与否，这要归功于一种新型的跨模态学习结构的方法。特别是，我们能够与单一模态数据，彩色或正常的，额外的培训通过使用两个编码解码器的网络与共享潜在空间。所提出的架构还使面部细节的图像和正常域之间进行传输，给定的配对数据，通过图像编码器和解码器正常之间跳跃的连接。核心我们的做法是一种新的模块，我们称之为deactivable跳过连接，从而使得可训练的端至端相同的架构中集成两个自动编码和图像到正常的分支。这使得能够准确地捕捉正常的信息丰富的潜在空间学习。我们比较反对国家的最先进的方法，并表明我们的方法可以达到显著的改善，定量和定性的，自然的人脸图像。

51. Multi-Task Learning Enhanced Single Image De-Raining [PDF] 返回目录
YuLong Fan, Rong Chen, Bo Li
Abstract: Rain removal in images is an important task in computer vision filed and attracting attentions of more and more people. In this paper, we address a non-trivial issue of removing visual effect of rain streak from a single image. Differing from existing work, our method combines various semantic constraint task in a proposed multi-task regression model for rain removal. These tasks reinforce the model's capabilities from the content, edge-aware, and local texture similarity respectively. To further improve the performance of multi-task learning, we also present two simple but powerful dynamic weighting algorithms. The proposed multi-task enhanced network (MENET) is a powerful convolutional neural network based on U-Net for rain removal research, with a specific focus on utilize multiple tasks constraints and exploit the synergy among them to facilitate the model's rain removal capacity. It is noteworthy that the adaptive weighting scheme has further resulted in improved network capability. We conduct several experiments on synthetic and real rain images, and achieve superior rain removal performance over several selected state-of-the-art (SOTA) approaches. The overall effect of our method is impressive, even in the decomposition of heavy rain and rain streak accumulation.The source code and some results can be found at:this https URL.
摘要：雨去除图像是计算机视觉和归档吸引了越来越多的人关注的一项重要任务。在本文中，我们来从一个图像去除雨水条纹的视觉效果的不平凡的问题。从现有的工作不同的是，我们的方法结合拟议中的多任务回归模型为雨去除各种语义约束的任务。这些任务加强分别从内容，边缘感知和局部纹理相似模型的能力。为了进一步提高多任务学习，我们也存在两个简单但功能强大的动态加权算法的性能。所提出的多任务增强型网络（MENET）是一种基于掌中下雨去除研究了强大的卷积神经网络，特别专注于利用多任务的约束，并利用它们之间的协同效应，有利于模型的雨去除能力。值得注意的是，所述自适应加权方案，进一步导致改善的网络能力。我们对合成和真实图像雨进行多次实验，并在几个选定的国家的最先进的实现更佳的雨水去除性能（SOTA）方法。我们的方法的整体效果令人印象深刻，即使在大雨和小雨连胜积累等源代码的分解和一些成果，可以发现：该HTTPS URL。

52. Geometrically Mappable Image Features [PDF] 返回目录
Janine Thoma, Danda Pani Paudel, Ajad Chhatkuli, Luc Van Gool
Abstract: Vision-based localization of an agent in a map is an important problem in robotics and computer vision. In that context, localization by learning matchable image features is gaining popularity due to recent advances in machine learning. Features that uniquely describe the visual contents of images have a wide range of applications, including image retrieval and understanding. In this work, we propose a method that learns image features targeted for image-retrieval-based localization. Retrieval-based localization has several benefits, such as easy maintenance and quick computation. However, the state-of-the-art features only provide visual similarity scores which do not explicitly reveal the geometric distance between query and retrieved images. Knowing this distance is highly desirable for accurate localization, especially when the reference images are sparsely distributed in the scene. Therefore, we propose a novel loss function for learning image features which are both visually representative and geometrically relatable. This is achieved by guiding the learning process such that the feature and geometric distances between images are directly proportional. In our experiments we show that our features not only offer significantly better localization accuracy, but also allow to estimate the trajectory of a query sequence in absence of the reference images.
摘要：在地图代理的基于视觉定位是机器人和计算机视觉的一个重要问题。在这方面，本地化通过学习可匹配的图像特征的逐渐流行，由于在机器学习的最新进展。唯一地描述图像的视觉内容的特征有着广泛的应用，包括图像检索和理解。在这项工作中，我们提出了一种方法，可以学习图像的功能为，基于图像的检索定位。基于内容的检索，定位有几个好处，如易于维护和快速计算。然而，国家的最先进的功能只提供视觉相似性得分，其没有明确地揭示了查询和检索的图像之间的几何距离。知道这个距离是准确定位非常可取的，特别是当参考图像稀疏地分布在场景中。因此，我们提出了一个新颖的损失函数的学习图像特征这两者都是视觉上的代表和几何听上去很像。这是通过引导学习过程，使得图像之间的特征和几何距离成正比实现。在我们的实验表明，我们的功能，不仅提供了更好的显著的定位精度，而且还允许估计没有参考图像的查询序列的轨迹。

53. BiCANet: Bi-directional Contextual Aggregating Network for Image Semantic Segmentation [PDF] 返回目录
Quan Zhou, Dechun Cong, Bin Kang, Xiaofu Wu, Baoyu Zheng, Huimin Lu, Longin Jan Latecki
Abstract: Exploring contextual information in convolution neural networks (CNNs) has gained substantial attention in recent years for semantic segmentation. This paper introduces a Bi-directional Contextual Aggregating Network, called BiCANet, for semantic segmentation. Unlike previous approaches that encode context in feature space, BiCANet aggregates contextual cues from a categorical perspective, which is mainly consist of three parts: contextual condensed projection block (CCPB), bi-directional context interaction block (BCIB), and muti-scale contextual fusion block (MCFB). More specifically, CCPB learns a category-based mapping through a split-transform-merge architecture, which condenses contextual cues with different receptive fields from intermediate layer. BCIB, on the other hand, employs dense skipped-connections to enhance the class-level context exchanging. Finally, MCFB integrates multi-scale contextual cues by investigating short- and long-ranged spatial dependencies. To evaluate BiCANet, we have conducted extensive experiments on three semantic segmentation datasets: PASCAL VOC 2012, Cityscapes, and ADE20K. The experimental results demonstrate that BiCANet outperforms recent state-of-the-art networks without any postprocess techniques. Particularly, BiCANet achieves the mIoU score of 86.7%, 82.4% and 38.66% on PASCAL VOC 2012, Cityscapes and ADE20K testset, respectively.
摘要：卷积神经网络（细胞神经网络）探讨上下文信息已经获得了大量的关注在近几年的语义分割。本文介绍了一种双向语境集结网名为BiCANet，对于语义分割。不同于以前的方法是，在特征空间中的编码上下文，BiCANet从绝对角度来看，这主要是由三个部分组成聚集上下文线索：上下文冷凝投影块（CCPB），双向上下文交互块（BCIB），和多尺度上下文融合块（MCFB）。更具体地，CCPB获知通过分裂变换合并架构，冷凝来自中间层不同的感受域上下文线索基于类别的映射。 BCIB，在另一方面，采用致密跳过连接以提高类级上下文交换。最后，MCFB通过调查短期和长程空间的依赖集成多尺度上下文线索。 PASCAL VOC 2012，风情，并ADE20K：评价BiCANet，我们已经在三个语义分割数据集进行了广泛的实验。实验结果表明，BiCANet性能优于最近状态的最先进的网络而没有任何后处理技术。特别地，BiCANet分别达到上PASCAL VOC 2012，风情和ADE20K测试集的得分米欧86.7％，82.4％和38.66％。

54. A level set representation method for N-dimensional convex shape and applications [PDF] 返回目录
Lingfeng li, Shousheng Luo, Xue-Cheng Tai, Jiang Yang
Abstract: In this work, we present a new efficient method for convex shape representation, which is regardless of the dimension of the concerned objects, using level-set approaches. Convexity prior is very useful for object completion in computer vision. It is a very challenging task to design an efficient method for high dimensional convex objects representation. In this paper, we prove that the convexity of the considered object is equivalent to the convexity of the associated signed distance function. Then, the second order condition of convex functions is used to characterize the shape convexity equivalently. We apply this new method to two applications: object segmentation with convexity prior and convex hull problem (especially with outliers). For both applications, the involved problems can be written as a general optimization problem with three constraints. Efficient algorithm based on alternating direction method of multipliers is presented for the optimization problem. Numerical experiments are conducted to verify the effectiveness and efficiency of the proposed representation method and algorithm.
摘要：在这项工作中，我们提出了一个凸起的形状表示，其是无论关注的对象的尺寸的，采用水平集接近新有效的方法。凸之前是一种在计算机视觉对象完成是非常有用的。这是一个非常具有挑战性的任务以设计为高维凸对象表示的有效方法。在本文中，我们证明了考虑对象的凸度相当于相关的符号距离函数的凸性。然后，凸函数的二阶条件用于等价地表征形状凸度。与凸事先和凸包问题（尤其是与异常值）对象分割：我们这种新方法应用到两个应用程序。对于这两种应用，所涉及的问题都可以写成一般的优化问题三个约束。基于乘法器的交替方向法有效的算法提出了一种用于优化问题。数值实验，以验证所提出的表示方法和算法的有效性和效率。

55. Cooling-Shrinking Attack: Blinding the Tracker with Imperceptible Noises [PDF] 返回目录
Bin Yan, Dong Wang, Huchuan Lu, Xiaoyun Yang
Abstract: Adversarial attack of CNN aims at deceiving models to misbehave by adding imperceptible perturbations to images. This feature facilitates to understand neural networks deeply and to improve the robustness of deep learning models. Although several works have focused on attacking image classifiers and object detectors, an effective and efficient method for attacking single object trackers of any target in a model-free way remains lacking. In this paper, a cooling-shrinking attack method is proposed to deceive state-of-the-art SiameseRPN-based trackers. An effective and efficient perturbation generator is trained with a carefully designed adversarial loss, which can simultaneously cool hot regions where the target exists on the heatmaps and force the predicted bounding box to shrink, making the tracked target invisible to trackers. Numerous experiments on OTB100, VOT2018, and LaSOT datasets show that our method can effectively fool the state-of-the-art SiameseRPN++ tracker by adding small perturbations to the template or the search regions. Besides, our method has good transferability and is able to deceive other top-performance trackers such as DaSiamRPN, DaSiamRPN-UpdateNet, and DiMP. The source codes are available at this https URL.
摘要：在加入不易察觉的扰动图像欺骗车型胡作非为的CNN目标对抗性攻击。这一特征有利于深入了解神经网络，并提高深度学习模型的鲁棒性。虽然几部作品都集中在攻击图像分类和对象检测器，攻击任何目标的单个对象跟踪器在无模型的方式仍然缺乏一种有效的方法。在本文中，冷却收缩的攻击方法，提出了国家的最先进的基于朦SiameseRPN跟踪器。有效和高效的扰动发电机被训练以精心设计的对抗性损失，可以同时冷却，其中目标上的热图中存在热区，并迫使所述预测边界框缩小，使得被跟踪的目标不可见的跟踪器。在OTB100，VOT2018和LaSOT数据集无数次的实验表明，该方法可以有效地通过增加小扰动模板或搜索区域傻瓜国家的最先进的SiameseRPN ++跟踪。此外，我们的方法具有良好的可转移性和能够欺骗其他顶级性能跟踪，如DaSiamRPN，DaSiamRPN-UpdateNet和DIMP。源代码可在此HTTPS URL。

56. Single-shot autofocusing of microscopy images using deep learning [PDF] 返回目录
Yilin Luo, Luzhe Huang, Yair Rivenson, Aydogan Ozcan
Abstract: We demonstrate a deep learning-based offline autofocusing method, termed Deep-R, that is trained to rapidly and blindly autofocus a single-shot microscopy image of a specimen that is acquired at an arbitrary out-of-focus plane. We illustrate the efficacy of Deep-R using various tissue sections that were imaged using fluorescence and brightfield microscopy modalities and demonstrate snapshot autofocusing under different scenarios, such as a uniform axial defocus as well as a sample tilt within the field-of-view. Our results reveal that Deep-R is significantly faster when compared with standard online algorithmic autofocusing methods. This deep learning-based blind autofocusing framework opens up new opportunities for rapid microscopic imaging of large sample areas, also reducing the photon dose on the sample.
摘要：我们表现出深基础的学习，离线自动对焦方法，被称为深-R，被训练快速和盲目的自动对焦是在任意外的焦平面获取的样本的单次显微图像。我们说明了诸如均匀的轴向的散焦以及该领域的视图内的样品倾斜使用用荧光和明视场显微术模态成像和在不同情况下表现出快照自动聚焦各种组织切片，深-R的效力。我们的研究结果表明，深-R与标准的在线算法自动对焦方法相比显著快。这种深层基础的学习盲自动对焦框架，开辟了大样本领域快速显微成像新的机会，也减少了样品上的光子剂量。

57. Topological Sweep for Multi-Target Detection of Geostationary Space Objects [PDF] 返回目录
Daqi Liu, Bo Chen, Tat-Jun Chin, Mark Rutten
Abstract: Conducting surveillance of the Earth's orbit is a key task towards achieving space situational awareness (SSA). Our work focuses on the optical detection of man-made objects (e.g., satellites, space debris) in Geostationary orbit (GEO), which is home to major space assets such as telecommunications and navigational satellites. GEO object detection is challenging due to the distance of the targets, which appear as small dim points among a clutter of bright stars. In this paper, we propose a novel multi-target detection technique based on topological sweep, to find GEO objects from a short sequence of optical images. Our topological sweep technique exploits the geometric duality that underpins the approximately linear trajectory of target objects across the input sequence, to extract the targets from significant clutter and noise. Unlike standard multi-target methods, our algorithm deterministically solves a combinatorial problem to ensure high-recall rates without requiring accurate initializations. The usage of geometric duality also yields an algorithm that is computationally efficient and suitable for online processing.
摘要：开展地球轨道的监控是实现太空态势感知（SSA）的一项关键任务。我们的工作重点是对地静止轨道（GEO），这是家里主要的太空资产，如通信和导航卫星人造物体（例如，卫星，空间碎片）的光学检测。 GEO物体检测具有挑战性，由于目标，这似乎群星璀璨的杂波中弱小点的距离。在本文中，我们提出了一种新颖的多目标检测技术基于拓扑扫描，以找到GEO从光学图像的短序列的对象。我们的拓扑扫描技术利用几何二元性即支撑着整个输入序列目标对象的近似线性的运动轨迹，以从显著杂波和噪声提取目标。与标准的多目标的方法，我们的算法确定性解决一个组合问题，以确保高召回率，而不需要精确的初始化。几何二元性的使用也产生了一种算法，在计算上是有效的，适合于在线处理。

58. Who2com: Collaborative Perception via Learnable Handshake Communication [PDF] 返回目录
Yen-Cheng Liu, Junjiao Tian, Chih-Yao Ma, Nathan Glaser, Chia-Wen Kuo, Zsolt Kira
Abstract: In this paper, we propose the problem of collaborative perception, where robots can combine their local observations with those of neighboring agents in a learnable way to improve accuracy on a perception task. Unlike existing work in robotics and multi-agent reinforcement learning, we formulate the problem as one where learned information must be shared across a set of agents in a bandwidth-sensitive manner to optimize for scene understanding tasks such as semantic segmentation. Inspired by networking communication protocols, we propose a multi-stage handshake communication mechanism where the neural network can learn to compress relevant information needed for each stage. Specifically, a target agent with degraded sensor data sends a compressed request, the other agents respond with matching scores, and the target agent determines who to connect with (i.e., receive information from). We additionally develop the AirSim-CP dataset and metrics based on the AirSim simulator where a group of aerial robots perceive diverse landscapes, such as roads, grasslands, buildings, etc. We show that for the semantic segmentation task, our handshake communication method significantly improves accuracy by approximately 20% over decentralized baselines, and is comparable to centralized ones using a quarter of the bandwidth.
摘要：在本文中，我们提出了协同感知，机器人已经可以改善上一觉任务的精度可以学习的方式与他们相邻的代理结合当地观察的问题。不同于机器人现有的工作和多智能体强化学习，我们制定的问题，因为一个地方了解到的信息必须在一组的代理带宽敏感的方式来共享，以优化场景理解任务，如语义分割。通过网络通信协议的启发，我们提出了一个多级交换通信机制，其中神经网络可以学习压缩需要各环节的相关信息。具体而言，与退化的传感器数据的目标代理发送一个压缩请求，与匹配得分的其它试剂反应，并且所述目标代理确定谁与（即，接收信息从）连接。我们还开发AirSim-CP数据集和指标基础上，AirSim模拟器，一组空中机器人的感知不同的自然景观，如道路，草地，建筑物等。我们表明，对于语义分割的任务，我们的握手通信方法显著改善约20％的准确度分散的基线，是与使用带宽的四分之一集中的。

59. Monocular Real-time Hand Shape and Motion Capture using Multi-modal Data [PDF] 返回目录
Yuxiao Zhou, Marc Habermann, Weipeng Xu, Ikhsanul Habibie, Christian Theobalt, Feng Xu
Abstract: We present a novel method for monocular hand shape and pose estimation at unprecedented runtime performance of 100fps and at state-of-the-art accuracy. This is enabled by a new learning based architecture designed such that it can make use of all the sources of available hand training data: image data with either 2D or 3D annotations, as well as stand-alone 3D animations without corresponding image data. It features a 3D hand joint detection module and an inverse kinematics module which regresses not only 3D joint positions but also maps them to joint rotations in a single feed-forward pass. This output makes the method more directly usable for applications in computer vision and graphics compared to only regressing 3D joint positions. We demonstrate that our architectural design leads to a significant quantitative and qualitative improvement over the state of the art on several challenging benchmarks. Our model is publicly available for future research.
摘要：在100FPS的前所未有的运行时性能和在国家的最先进的精度存在用于单眼手形和姿势估计的新方法。与2D或3D注解图像数据，以及独立的3D动画没有相应的图像数据：这是一个新的学习基础的架构设计，使得它可以利用所有可用的手的训练数据来源的启用。它具有一个三维手关节检测模块和逆运动学模块，其退化不仅3D关节位置，而且将它们映射到关节旋转在一个单一的馈通。相比，只有倒退3D关节位置，该输出使得该方法更直接地可用于在计算机视觉和应用程序的图形。我们证明了我们的建筑设计导致对现有技术的状态显著定量和定性的改善上一些具有挑战性的基准。我们的模式是公开的对未来的研究。

60. Appearance Fusion of Multiple Cues for Video Co-localization [PDF] 返回目录
Koteswar Rao Jerripothula
Abstract: This work addresses a problem named video co-localization that aims at localizing the objects in videos jointly. Although there are numerous cues available for this purpose, for example, saliency, motion, and joint, their robust fusion can be quite challenging at times due to their spatial inconsistencies. To overcome this, in this paper, we propose a novel appearance fusion method where we fuse appearance models derived from these cues rather than spatially fusing their maps. In this method, we evaluate the cues in terms of their reliability and consensus to guide the appearance fusion process. We also develop a novel joint cue relying on topological hierarchy. We utilize the final fusion results to produce a few candidate bounding boxes and for subsequent optimal selection among them while considering the spatiotemporal constraints. The proposed method achieves promising results on the YouTube Objects dataset.
摘要：这项工作解决了一个名为问题的视频共定位，其目的是在视频共同本地化的对象。虽然有可用于这一目的许多线索，例如，显着性，运动和关节，其强大的融合有时可以是相当具有挑战性，因为它们的空间不一致。为了克服这个问题，在本文中，我们提出了我们融合来自这些线索来源的，而不是空间上融合自己的地图外观车型新颖的外观融合方法。在这种方法中，我们评估的线索在其可靠性和共识方面指导的外观融合过程。我们还开发了一种新的联合线索依赖于拓扑层次。我们利用最后的融合结果产生一些候选人包围盒和在他们随后的最佳选择，同时考虑了时空的限制。该方法实现了可喜的成果在YouTube对象数据集。

61. A MEMS-based Foveating LIDAR to enable Real-time Adaptive Depth Sensing [PDF] 返回目录
Francesco Pittaluga, Zaid Tasneem, Justin Folden, Brevin Tilmon, Ayan Chakrabarti, Sanjeev J. Koppal
Abstract: Most active depth sensors sample their visual field using a fixed pattern, decided by accuracy, speed and cost trade-offs, rather than scene content. However, a number of recent works have demonstrated that adapting measurement patterns to scene content can offer significantly better trade-offs. We propose a hardware LIDAR design that allows flexible real-time measurements according to dynamically specified measurement patterns. Our flexible depth sensor design consists of a controllable scanning LIDAR that can foveate, or increase resolution in regions of interest, and that can fully leverage the power of adaptive depth sensing. We describe our optical setup and calibration, which enables fast sparse depth measurements using a scanning MEMS (micro-electro mechanical) mirror. We validate the efficacy of our prototype LIDAR design by testing on over 75 static and dynamic scenes spanning a range of environments. We also show CNN-based depth-map completion of sparse measurements obtained by our sensor. Our experiments show that our sensor can realize adaptive depth sensing systems.
摘要：活跃深度传感器使用一个固定的模式，通过精度，速度和成本的权衡，而不是场景内容决定品尝他们的视野。然而，最近的一些作品已经证明了适应测量模式场景内容可以提供更好的显著权衡。我们提出了一个硬件LIDAR设计，允许根据动态指定的测量模式灵活的实时测量。我们的灵活深度传感器设计由一个可控的扫描LIDAR可以foveate，或者在感兴趣的区域提高分辨率，并且能够充分利用自适应深度感测的功率的。我们描述我们的光学设置和校准，这使得能够使用扫描MEMS（微机电）镜快速稀疏深度测量。我们通过测试超过75的静态和动态场景横跨各种环境的验证我们的原型设计激光雷达的功效。我们还表明我们的传感器获得的稀疏测量基于CNN-深度图完成。我们的实验表明，我们的传感器可实现自适应深度传感系统。

62. Fast Symmetric Diffeomorphic Image Registration with Convolutional Neural Networks [PDF] 返回目录
Tony C.W. Mok, Albert C.S. Chung
Abstract: Diffeomorphic deformable image registration is crucial in many medical image studies, as it offers unique, special properties including topology preservation and invertibility of the transformation. Recent deep learning-based deformable image registration methods achieve fast image registration by leveraging a convolutional neural network (CNN) to learn the spatial transformation from the synthetic ground truth or the similarity metric. However, these approaches often ignore the topology preservation of the transformation and the smoothness of the transformation which is enforced by a global smoothing energy function alone. Moreover, deep learning-based approaches often estimate the displacement field directly, which cannot guarantee the existence of the inverse transformation. In this paper, we present a novel, efficient unsupervised symmetric image registration method which maximizes the similarity between images within the space of diffeomorphic maps and estimates both forward and inverse transformations simultaneously. We evaluate our method on 3D image registration with a large scale brain image dataset. Our method achieves state-of-the-art registration accuracy and running time while maintaining desirable diffeomorphic properties.
摘要：微分同胚的变形图像配准是许多医学图像的研究是至关重要的，因为它提供了独特的，特殊的性质，包括拓扑保护和改造的可逆性。最近深基于学习的变形图像配准方法通过利用卷积神经网络（CNN），了解由合成基础事实或所述相似性量度的空间变换实现快速图像配准。然而，这些方法往往忽略了转型的拓扑保持和由一个全球性的平滑能量函数单独执行转换的平滑度。此外，深学习型往往接近直接估计位移场，不能保证逆变换的存在。在本文中，我们提出了一个新颖的，高效率的无监督对称图像配准方法，该方法最大化微分同胚映射的空间内的图像之间的相似性，并且同时估计前向和逆变换。我们评估我们对3D图像配准大规模的脑图像数据集的方法。我们的方法实现状态的最先进的配准精度和运行时间，同时保持理想的微分同胚性质。

63. A Robotic 3D Perception System for Operating Room Environment Awareness [PDF] 返回目录
Zhaoshuo Li, Amirreza Shaban, Jean-Gabriel Simard, Dinesh Rabindran, Simon DiMaio, Omid Mohareri
Abstract: Purpose: We describe a 3D multi-view perception system for the da Vinci surgical system to enable Operating room (OR) scene understanding and context awareness. Methods: Our proposed system is comprised of four Time-of-Flight (ToF) cameras rigidly attached to strategic locations on the daVinci Xi patient side cart (PSC). The cameras are registered to the robot's kinematic chain by performing a one-time calibration routine and therefore, information from all cameras can be fused and represented in one common coordinate frame. Based on this architecture, a multi-view 3D scene semantic segmentation algorithm is created to enable recognition of common and salient objects/equipment and surgical activities in a da Vinci OR. Our proposed 3D semantic segmentation method has been trained and validated on a novel densely annotated dataset that has been captured from clinical scenarios. Results: The results show that our proposed architecture has acceptable registration error ($3.3\%\pm1.4\%$ of object-camera distance) and can robustly improve scene segmentation performance (mean Intersection Over Union - mIOU) for less frequently appearing classes ($\ge 0.013$) compared to a single-view method. Conclusion: We present the first dynamic multi-view perception system with a novel segmentation architecture, which can be used as a building block technology for applications such as surgical workflow analysis, automation of surgical sub-tasks and advanced guidance systems.
摘要：目的：我们描述了达芬奇手术系统，使手术室（OR）现场了解和情境意识3D多视角感知系统。方法：我们所提出的系统由四个时间 - 飞行时间（TOF）相机刚性连接到有关达芬奇曦患者侧车（PSC）的战略位置。相机是通过执行一次校准例程注册到机器人的运动链，因此，来自所有摄像机信息可被稠合，并且在一个共同的坐标表示帧。基于此结构，在创建多视点3D场景语义分割算法，以使识别常见和显着对象/设备和手术活动在达芬奇OR。我们提出的3D语义分割方法已经被训练和验证，在已经从临床场景拍摄的新型密集注释数据集。结果：结果表明，该架构具有可接受登记错误（$ 3.3 \％\ pm1.4 \对象的相机距离％$），并能够稳健地改善场景分割性能（平均路口过联盟 - 米欧）较少频繁出现类（$ \ GE 0.013 $）相比，单视图方法。结论：我们具有新颖分割结构，该结构可被用作构建块技术的应用，如外科手术工作流程分析，手术子任务和先进引导系统自动化呈现第一动态多视图感知系统。

64. Do Public Datasets Assure Unbiased Comparisons for Registration Evaluation? [PDF] 返回目录
Jie Luo, Guangshen Ma, Sarah Frisken, Parikshit Juvekar, Nazim Haouchine, Zhe Xu, Yiming Xiao, Alexandra Golby, Patrick Codd, Masashi Sugiyama, William Wells III
Abstract: With the increasing availability of new image registration approaches, an unbiased evaluation is becoming more needed so that clinicians can choose the most suitable approaches for their applications. Current evaluations typically use landmarks in manually annotated datasets. As a result, the quality of annotations is crucial for unbiased comparisons. Even though most data providers claim to have quality control over their datasets, an objective third-party screening can be reassuring for intended users. In this study, we use the variogram to screen the manually annotated landmarks in two datasets used to benchmark registration in image-guided neurosurgeries. The variogram provides an intuitive 2D representation of the spatial characteristics of annotated landmarks. Using variograms, we identified potentially problematic cases and had them examined by experienced radiologists. We found that (1) a small number of annotations may have fiducial localization errors; (2) the landmark distribution for some cases is not ideal to offer fair comparisons. If unresolved, both findings could incur bias in registration evaluation.
摘要：随着新的图像配准的提高可用性的临近，一个公正的评价越来越需要使临床医生可以选择适合自己应用的最合适的方法。目前的评估通常使用的手动注释的数据集标志性建筑。其结果是，注释的质量是比较公正的关键。尽管大多数数据提供商声称拥有对自己数据集的质量控制，客观的第三方筛选可以安心为目标用户。在这项研究中，我们使用了变差函数来筛选手动注释的地标在图像引导neurosurgeries用于基准注册两个数据集。所述变差函数提供的注释的界标的空间特性的直观2D表示。使用方差图，我们发现可能存在问题的情况下，让他们检查了高年资医师。我们发现，（1）注释少数可有基准定位误差; （2）某些情况下具有里程碑意义的分布是不理想提供公平的比较。如果没有解决，无论结果可能招致登记评价偏差。

65. ROAM: Random Layer Mixup for Semi-Supervised Learning in Medical Imaging [PDF] 返回目录
Tariq Bdair, Nassir Navab, Shadi Albarqouni
Abstract: Medical image segmentation is one of the major challenges addressed by machine learning methods. Yet, deep learning methods profoundly depend on a huge amount of annotated data which is time-consuming and costly. Though semi-supervised learning methods approach this problem by leveraging an abundant amount of unlabeled data along with a small amount of labeled data in the training process. Recently, MixUp regularizer [32] has been successfully introduced to semi-supervised learning methods showing superior performance [3]. MixUp augments the model with new data points through linear interpolation of the data at the input space. In this paper, we argue that this option is limited, instead, we propose ROAM, a random layer mixup, which encourages the network to be less confident for interpolated data points at randomly selected space. Hence, avoids over-fitting and enhances the generalization ability. We validate our method on publicly available datasets on whole-brain image segmentation (MALC) achieving state-of-the-art results in fully supervised (89.8%) and semi-supervised (87.2%) settings with relative improvement up to 2.75% and 16.73%, respectively.
摘要：医学图像分割是主要的挑战之一，谈到了机器学习方法。然而，深学习方法深深地依赖于一个巨大的注释数据量是费时又费钱。虽然半监督学习方法用在训练过程中标注的数据量小以及利用未标记数据的丰富量解决这个问题。最近，的mixup正则[32]已被成功地引入到半监督学习方法显示出优异的性能。[3]。的mixup增强通过在输入空间中的数据的线性内插用新数据点的模型。在本文中，我们认为，这个选项是有限的，相反，我们提出ROAM，随机层的mixup，这鼓励网络是在随机选择的空间内插数据点缺乏自信。因此，避免了过拟合和提高泛化能力。我们确认对全脑图像分割可公开获得的数据集（MALC）实现国家的最先进成果的充分监督（89.8％）和半监督（87.2％），设置有相对改善了我们的方法，以2.75％和16.73％之间。

66. Coronavirus (COVID-19) Classification using CT Images by Machine Learning Methods [PDF] 返回目录
Mucahid Barstugan, Umut Ozkaya, Saban Ozturk
Abstract: This study presents early phase detection of Coronavirus (COVID-19), which is named by World Health Organization (WHO), by machine learning methods. The detection process was implemented on abdominal Computed Tomography (CT) images. The expert radiologists detected from CT images that COVID-19 shows different behaviours from other viral pneumonia. Therefore, the clinical experts specify that COVİD-19 virus needs to be diagnosed in early phase. For detection of the COVID-19, four different datasets were formed by taking patches sized as 16x16, 32x32, 48x48, 64x64 from 150 CT images. The feature extraction process was applied to patches to increase the classification performance. Grey Level Co-occurrence Matrix (GLCM), Local Directional Pattern (LDP), Grey Level Run Length Matrix (GLRLM), Grey-Level Size Zone Matrix (GLSZM), and Discrete Wavelet Transform (DWT) algorithms were used as feature extraction methods. Support Vector Machines (SVM) classified the extracted features. 2-fold, 5-fold and 10-fold cross-validations were implemented during the classification process. Sensitivity, specificity, accuracy, precision, and F-score metrics were used to evaluate the classification performance. The best classification accuracy was obtained as 99.68% with 10-fold cross-validation and GLSZM feature extraction method.
摘要：本研究提出早期检测冠状病毒（COVID-19），这是由世界卫生组织命名（WHO），通过机器学习方法。在检测过程中上腹部计算机断层扫描（CT）图像来实现。从CT图像即COVID-19示出不同的行为与其他病毒性肺炎检测到的放射学专家。因此，临床专家指定COVİD-19病毒需要被诊断在早期阶段。用于检测的COVID-19，四个不同的数据集通过取贴片形成尺寸为16x16，32×32，48×48×64 150个CT图像。特征提取过程被应用于补丁，以提高分类性能。灰度共生矩阵（GLCM），局部定向模式（LDP），灰度行程长度矩阵（GLRLM），灰度级尺寸区矩阵（GLSZM）和离散小波变换（DWT）算法被用作特征提取方法。支持向量机（SVM）分类所提取的特征。 2倍，5倍和10倍交叉验证期间在分类过程被实施。敏感性，特异性，准确度，精密度和F-评分指标来评价分类性能。得到最好的分类精度为99.68％，用10倍交叉验证和GLSZM特征提取方法。

67. Deep Unfolding Network for Image Super-Resolution [PDF] 返回目录
Kai Zhang, Luc Van Gool, Radu Timofte
Abstract: Learning-based single image super-resolution (SISR) methods are continuously showing superior effectiveness and efficiency over traditional model-based methods, largely due to the end-to-end training. However, different from model-based methods that can handle the SISR problem with different scale factors, blur kernels and noise levels under a unified MAP (maximum a posteriori) framework, learning-based methods generally lack such flexibility. To address this issue, this paper proposes an end-to-end trainable unfolding network which leverages both learning-based methods and model-based methods. Specifically, by unfolding the MAP inference via a half-quadratic splitting algorithm, a fixed number of iterations consisting of alternately solving a data subproblem and a prior subproblem can be obtained. The two subproblems then can be solved with neural modules, resulting in an end-to-end trainable, iterative network. As a result, the proposed network inherits the flexibility of model-based methods to super-resolve blurry, noisy images for different scale factors via a single model, while maintaining the advantages of learning-based methods. Extensive experiments demonstrate the superiority of the proposed deep unfolding network in terms of flexibility, effectiveness and also generalizability.
摘要：基于学习的单图像超分辨率（SISR）方法被连续地显示出优异的有效性和效率比传统的基于模型的方法，主要是由于端至端的训练。然而，从能够统一MAP（最大后验）框架下处理与不同的比例系数，模糊内核和噪声水平的问题SISR基于模型的方法不同，基于学习的方法通常缺乏这种灵活性。为了解决这个问题，本文提出了一种终端到终端的可训练展开网络，同时利用基于学习的方法和基于模型的方法。具体而言，通过展开经由半二次分裂算法的MAP推论，可以得到由交替解决一个数据子问题和在先子问题的迭代的固定数目。两个子问题然后可以与神经模块来解决，从而导致端部至端可训练，迭代网络。其结果是，所提出的网络继承了基于模型的方法，以超解析模糊的，通过一个单一的模型图像噪点为不同的比例系数的灵活性，同时保持学习为基础的方法的优点。大量的实验验证了深展开网络的灵活性，有效性和普遍性也方面的优越性。

68. Learning a Probabilistic Strategy for Computational Imaging Sensor Selection [PDF] 返回目录
He Sun, Adrian V. Dalca, Katherine L. Bouman
Abstract: Optimized sensing is important for computational imaging in low-resource environments, when images must be recovered from severely limited measurements. In this paper, we propose a physics-constrained, fully differentiable, autoencoder that learns a probabilistic sensor-sampling strategy for optimized sensor design. The proposed method learns a system's preferred sampling distribution that characterizes the correlations between different sensor selections as a binary, fully-connected Ising model. The learned probabilistic model is achieved by using a Gibbs sampling inspired network architecture, and is trained end-to-end with a reconstruction network for efficient co-design. The proposed framework is applicable to sensor selection problems in a variety of computational imaging applications. In this paper, we demonstrate the approach in the context of a very-long-baseline-interferometry (VLBI) array design task, where sensor correlations and atmospheric noise present unique challenges. We demonstrate results broadly consistent with expectation, and draw attention to particular structures preferred in the telescope array geometry that can be leveraged to plan future observations and design array expansions.
摘要：优化感测是用于低资源的环境中计算的成像中，当图像必须从测量结果严重地限制被回收重要。在本文中，我们提出了一个物理学的约束，完全可微，自动编码器是学习优化传感器设计一个概率性传感器采样策略。所提出的方法学系统的表征不同的传感器的选择之间的相关性为二进制，全连接伊辛模型优选采样分布。所学习的概率模型，通过使用Gibbs抽样启发网络架构实现，并且被训练端至端与重建网络进行高效的协同设计。所提出的框架是适用于多种计算成像应用传感器选择的问题。在本文中，我们证明在一个极长基线干涉测量（VLBI）阵列设计任务的上下文，该方法在那里传感器相关性和大气噪声本独特的挑战。我们证明结果大致与预期一致，并提请注意，可以被利用来计划未来的观测和设计阵列扩展望远镜阵列几何首选特定结构。

69. Ambiguity in Sequential Data: Predicting Uncertain Futures with Recurrent Models [PDF] 返回目录
Alessandro Berlati, Oliver Scheel, Luigi Di Stefano, Federico Tombari
Abstract: Ambiguity is inherently present in many machine learning tasks, but especially for sequential models seldom accounted for, as most only output a single prediction. In this work we propose an extension of the Multiple Hypothesis Prediction (MHP) model to handle ambiguous predictions with sequential data, which is of special importance, as often multiple futures are equally likely. Our approach can be applied to the most common recurrent architectures and can be used with any loss function. Additionally, we introduce a novel metric for ambiguous problems, which is better suited to account for uncertainties and coincides with our intuitive understanding of correctness in the presence of multiple labels. We test our method on several experiments and across diverse tasks dealing with time series data, such as trajectory forecasting and maneuver prediction, achieving promising results.
摘要：模糊性是固有存在于许多机器学习任务，但尤其是顺序模型很少考虑在内，因为大多数只能输出一个预测。在这项工作中，我们提出了多种假设预测（MHP）模型的扩展来处理与连续的数据，这是特别重要的，因为常为多发期货等可能模棱两可的预测。我们的方法可以适用于最常见的复发性的架构，可与任何损失函数中使用。此外，我们引入了新的度量模糊的问题，这更适合说明不确定性，并恰逢在多个标签的存在我们的正确性直观的了解。我们在多次实验和整个处理时间序列数据，如轨迹预测和机动预测不同的任务测试我们的方法，取得了可喜的成果。

70. A Developmental Neuro-Robotics Approach for Boosting the Recognition of Handwritten Digits [PDF] 返回目录
Alessandro Di Nuovo
Abstract: Developmental psychology and neuroimaging research identified a close link between numbers and fingers, which can boost the initial number knowledge in children. Recent evidence shows that a simulation of the children's embodied strategies can improve the machine intelligence too. This article explores the application of embodied strategies to convolutional neural network models in the context of developmental neuro-robotics, where the training information is likely to be gradually acquired while operating rather than being abundant and fully available as the classical machine learning scenarios. The experimental analyses show that the proprioceptive information from the robot fingers can improve network accuracy in the recognition of handwritten Arabic digits when training examples and epochs are few. This result is comparable to brain imaging and longitudinal studies with young children. In conclusion, these findings also support the relevance of the embodiment in the case of artificial agents' training and show a possible way for the humanization of the learning process, where the robotic body can express the internal processes of artificial intelligence making it more understandable for humans.
摘要：发展心理学和识别号码和手指之间的密切联系，它可以促进儿童的初始数量知识神经影像学研究。最近的证据显示，孩子们的具体化策略的模拟可以提高机器智能了。本文探讨了体现战略的发育神经机器人的情况下，这里的训练信息是同时操作，而不是丰富，完全可以作为经典机器学习的场景可能会逐渐掌握了应用卷积神经网络模型。实验分析表明，从机器人手指本体感受信息可以提高网络的精度在识别手写阿拉伯数字当训练样例和历元也很少。这一结果相当于脑成像和有小孩的纵向研究。总之，这些研究结果也支持实施的相关性在人工坐席培训的情况，并显示出学习的过程，在机器人的身体可以表达人工智能的内部流程的人性化的可能方式使得它更容易理解的人类。

71. Attention U-Net Based Adversarial Architectures for Chest X-ray Lung Segmentation [PDF] 返回目录
Gusztáv Gaál, Balázs Maga, András Lukács
Abstract: Chest X-ray is the most common test among medical imaging modalities. It is applied for detection and differentiation of, among others, lung cancer, tuberculosis, and pneumonia, the last with importance due to the COVID-19 disease. Integrating computer-aided detection methods into the radiologist diagnostic pipeline, greatly reduces the doctors' workload, increasing reliability and quantitative analysis. Here we present a novel deep learning approach for lung segmentation, a basic, but arduous task in the diagnostic pipeline. Our method uses state-of-the-art fully convolutional neural networks in conjunction with an adversarial critic model. It generalized well to CXR images of unseen datasets with different patient profiles, achieving a final DSC of 97.5% on the JSRT dataset.
摘要：胸部X-射线是医学成像模态中最常见的测试。它适用于检测和他人之间的分化，肺癌，肺结核，肺炎，最后与重要性，由于COVID-19的疾病。整合的计算机辅助检测方法到放射科医师诊断管道，大大降低了医生的工作量，提高了可靠性和定量分析。在这里，我们提出了肺的分割，在诊断管道基础，但任务艰巨一种新的深度学习方法。我们的方法使用状态的最先进的充分卷积神经网络结合对抗性评论家模型。它概括很好地与不同的患者档案看不见的数据集的CXR图像，在JSRT数据集实现的97.5％的最终DSC。

72. RobustGCNs: Robust Norm Graph Convolutional Networks in the Presence of Node Missing Data and Large Noises [PDF] 返回目录
Bo Jiang, Ziyan Zhang
Abstract: Graph Convolutional Networks (GCNs) have been widely studied for attribute graph data representation and learning. In many applications, graph node attribute/feature may contain various kinds of noises, such as gross corruption, outliers and missing values. Existing graph convolutions (GCs) generally focus on feature propagation on structured graph which i) fail to address the graph data with missing values and ii) often perform susceptibility to the large feature errors/noises and outliers. To address this issue, in this paper, we propose to incorporate robust norm feature learning mechanism into graph convolution and present Robust Graph Convolutions (RGCs) for graph data in the presence of feature noises and missing values. Our RGCs is proposed based on the interpretation of GCs from a propagation function aspect of 'data reconstruction on graph'. Based on it, we then derive our RGCs by exploiting robust norm based propagation functions into GCs. Finally, we incorporate the derived RGCs into an end-to-end network architecture and propose a kind of RobustGCNs for graph data learning. Experimental results on several noisy datasets demonstrate the effectiveness and robustness of the proposed RobustGCNs.
摘要：图形卷积网络（GCNs）已被广泛研究的属性图形数据表示和学习。在许多应用中，图形节点属性/特征可以含有各种噪声，如毛腐败，异常值和缺失值。现有的图卷积（GCS）一般集中于特征传播结构化图形，其i）失败，缺失值，以解决所述图形数据，以及ii）经常以大特征错误/噪声和异常值进行易感性。为了解决这个问题，在本文中，我们建议结合强大的功能，规范学习机制成的特征噪声和缺失值的存在曲线数据图卷积和现在的强大图形卷积（RGC的）。我们的视网膜神经节细胞是基于GC的从“数据重建上图形”的传播函数方面的解释提出。在此基础上，我们再通过开发可靠的，基于标准的传播功能于地方选区获得我们的视网膜神经节细胞。最后，我们结合了衍生视网膜神经节细胞到终端到终端的网络架构，并提出了一种用于RobustGCNs图形数据的学习。在几个嘈杂的数据集实验结果表明，所提出的RobustGCNs的有效性和鲁棒性。

73. Bridge the Domain Gap Between Ultra-wide-field and Traditional Fundus Images via Adversarial Domain Adaptation [PDF] 返回目录
Lie Ju, Xin Wang, Quan Zhou, Hu Zhu, Mehrtash Harandi, Paul Bonnington, Tom Drummond, Zongyuan Ge
Abstract: For decades, advances in retinal imaging technology have enabled effective diagnosis and management of retinal disease using fundus cameras. Recently, ultra-wide-field (UWF) fundus imaging by Optos camera is gradually put into use because of its broader insights on fundus for some lesions that are not typically seen in traditional fundus images. Research on traditional fundus images is an active topic but studies on UWF fundus images are few. One of the most important reasons is that UWF fundus images are hard to obtain. In this paper, for the first time, we explore domain adaptation from the traditional fundus to UWF fundus images. We propose a flexible framework to bridge the domain gap between two domains and co-train a UWF fundus diagnosis model by pseudo-labelling and adversarial learning. We design a regularisation technique to regulate the domain adaptation. Also, we apply MixUp to overcome the over-fitting issue from incorrect generated pseudo-labels. Our experimental results on either single or both domains demonstrate that the proposed method can well adapt and transfer the knowledge from traditional fundus images to UWF fundus images and improve the performance of retinal disease recognition.
摘要：几十年来，在视网膜成像技术的进步使利用眼底照相机视网膜疾病的有效的诊断和管理。近日，由光耦相机超广角视野（UWF）眼底成像陆续投入使用，因为其更广泛的见解对眼底的部分病灶通常不可见在传统的眼底图像。研究传统的眼底图像是一个积极的话题，但研究UWF眼底图像也很少。其中最重要的原因是，UWF眼底图像是很难获得。在本文中，对于第一次，我们从传统的眼底UWF眼底图像探索领域适应性。我们提出了一个灵活的框架，弥合伪标签和对抗性学习两个领域和合作火车UWF眼底诊断模型之间的域差距。我们设计了一个正则化技术规范领域适应性。此外，我们应用的mixup从不正确产生的伪标签克服了过度拟合的问题。我们对单个或两个领域的实验结果表明，该方法能很好的适应，并从传统的眼底图像传输知识UWF眼底图像，提高视网膜疾病识别性能。

74. Understanding the robustness of deep neural network classifiers for breast cancer screening [PDF] 返回目录
Witold Oleszkiewicz, Taro Makino, Stanisław Jastrzębski, Tomasz Trzciński, Linda Moy, Kyunghyun Cho, Laura Heacock, Krzysztof J. Geras
Abstract: Deep neural networks (DNNs) show promise in breast cancer screening, but their robustness to input perturbations must be better understood before they can be clinically implemented. There exists extensive literature on this subject in the context of natural images that can potentially be built upon. However, it cannot be assumed that conclusions about robustness will transfer from natural images to mammogram images, due to significant differences between the two image modalities. In order to determine whether conclusions will transfer, we measure the sensitivity of a radiologist-level screening mammogram image classifier to four commonly studied input perturbations that natural image classifiers are sensitive to. We find that mammogram image classifiers are also sensitive to these perturbations, which suggests that we can build on the existing literature. We also perform a detailed analysis on the effects of low-pass filtering, and find that it degrades the visibility of clinically meaningful features called microcalcifications. Since low-pass filtering removes semantically meaningful information that is predictive of breast cancer, we argue that it is undesirable for mammogram image classifiers to be invariant to it. This is in contrast to natural images, where we do not want DNNs to be sensitive to low-pass filtering due to its tendency to remove information that is human-incomprehensible.
摘要：深层神经网络（DNNs）节目承诺在乳腺癌筛查，但其稳健性输入扰动，必须更好地理解它们可以在临床上实施之前。存在关于这个问题的，可以在可能建立自然图像的情况下大量文献。但是，它不能被认为对稳健性的结论将自然的图像传输到乳房X光图像，由于两个图像模态之间显著的差异。为了确定结论是否会转移，我们测量放射科医师级筛选乳房X线照片图像分类器的四个通常研究输入扰动的灵敏度自然图像分类器是敏感的。我们发现，乳房X光图像分类也对这些扰动，这表明我们可以建立在现有的文献敏感。我们还进行低通滤波的效果进行详细的分析，并发现它会降低的称为微小钙化临床意义的特征的可见性。由于低通滤波将删除是预测乳腺癌的语义上有意义的信息，我们认为，这是不希望的乳房X线照片图像分类器是不变到它。这是相对于自然图像，在这里我们不想DNNs是敏感的低通滤波，由于其倾向于删除这是人类无法理解的信息。

75. One-Shot Informed Robotic Visual Search in the Wild [PDF] 返回目录
Karim Koreitem, Florian Shkurti, Travis Manderson, Wei-Di Chang, Juan Camilo Gamboa Higuera, Gregory Dudek
Abstract: We consider the task of underwater robot navigation for the purpose of collecting scientifically-relevant video data for environmental monitoring. The majority of field robots that currently perform monitoring tasks in unstructured natural environments navigate via path-tracking a pre-specified sequence of waypoints. Although this navigation method is often necessary, it is limiting because the robot does not have a model of what the scientist deems to be relevant visual observations. Thus, the robot can neither visually search for particular types of objects, nor focus its attention on parts of the scene that might be more relevant than the pre-specified waypoints and viewpoints. In this paper we propose a method that enables informed visual navigation via a learned visual similarity operator that guides the robot's visual search towards parts of the scene that look like an exemplar image, which is given by the user as a high-level specification for data collection. We propose and evaluate a weakly-supervised video representation learning method that outperforms ImageNet embeddings for similarity tasks in the underwater domain. We also demonstrate the deployment of this similarity operator during informed visual navigation in collaborative environmental monitoring scenarios, in large-scale field trials, where the robot and a human scientist jointly search for relevant visual content.
摘要：我们认为水下机器人导航的任务，收集科学相关的视频数据进行环境监测的目的。大多数领域的机器人，目前在非结构化的自然环境执行监视任务的通过路径跟踪航点的预先指定的顺序导航。虽然这种导航方法往往是必要的，这是因为限制机器人不具备什么样的科学家认为是相关的视觉观察的模型。因此，机器人既不能直观地搜索特定类型的对象，也没有把注意力集中在现场的部件可能会比预先指定的航点和观点更相关。在本文中，我们提出了能够通过一个有学问的视觉相似性算获悉视觉导航的方法，引导机器人对现场的部分视觉搜索，看起来像一个典范的形象，这是由用户给出了数据的高级别规格采集。我们建议和评估弱监督视频表示学习方法性能优于ImageNet的嵌入在水下域相似的任务。我们还演示了这种相似操作的协作环境监测的情况通报视觉导航，在大规模现场试验，其中机器人和人类共同的科学家搜索相关视频内容中的部署。

76. HierTrain: Fast Hierarchical Edge AI Learning with Hybrid Parallelism in Mobile-Edge-Cloud Computing [PDF] 返回目录
Deyin Liu, Xu Chen, Zhi Zhou, Qing Ling
Abstract: Nowadays, deep neural networks (DNNs) are the core enablers for many emerging edge AI applications. Conventional approaches to training DNNs are generally implemented at central servers or cloud centers for centralized learning, which is typically time-consuming and resource-demanding due to the transmission of a large amount of data samples from the device to the remote cloud. To overcome these disadvantages, we consider accelerating the learning process of DNNs on the Mobile-Edge-Cloud Computing (MECC) paradigm. In this paper, we propose HierTrain, a hierarchical edge AI learning framework, which efficiently deploys the DNN training task over the hierarchical MECC architecture. We develop a novel \textit{hybrid parallelism} method, which is the key to HierTrain, to adaptively assign the DNN model layers and the data samples across the three levels of edge device, edge server and cloud center. We then formulate the problem of scheduling the DNN training tasks at both layer-granularity and sample-granularity. Solving this optimization problem enables us to achieve the minimum training time. We further implement a hardware prototype consisting of an edge device, an edge server and a cloud server, and conduct extensive experiments on it. Experimental results demonstrate that HierTrain can achieve up to 6.9x speedup compared to the cloud-based hierarchical training approach.
摘要：如今，深层神经网络（DNNs）是核心促成了许多新兴的边缘AI应用。常规方法培养DNNs在中央服务器或云中心集中学习，这是通常耗时且通常实现需要资源的由于从所述装置的大量的数据样本的传输到远程云。为了克服这些缺点，我们考虑加速有关的移动边云计算DNNs的学习过程（MECC）模式。在本文中，我们提出HierTrain，分层边缘AI学习框架，有效地部署在分层MECC架构DNN的训练任务。我们开发了一种新\ textit {混合并行}方法，该方法的关键是HierTrain，跨越三个层次边缘装置，边缘服务器和云中心的自适应分配DNN模型层和数据样本。然后，我们制定在两个层粒度和样品的粒度调度DNN训练任务的问题。解决这个优化问题，使我们能够实现最小的训练时间。我们进一步实现硬件原型由边缘设备，边缘服务器和云服务器，并在其上进行广泛的试验。实验结果表明，HierTrain可以实现高达6.9倍的加速相比，基于云的分级培训的办法。

77. TanhExp: A Smooth Activation Function with High Convergence Speed for Lightweight Neural Networks [PDF] 返回目录
Xinyu Liu, Xiaoguang Di
Abstract: Lightweight or mobile neural networks used for real-time computer vision tasks contain fewer parameters than normal networks, which lead to a constrained performance. In this work, we proposed a novel activation function named Tanh Exponential Activation Function (TanhExp) which can improve the performance for these networks on image classification task significantly. The definition of TanhExp is f(x) = xtanh(e^x). We demonstrate the simplicity, efficiency, and robustness of TanhExp on various datasets and network models and TanhExp outperforms its counterparts in both convergence speed and accuracy. Its behaviour also remains stable even with noise added and dataset altered. We show that without increasing the size of the network, the capacity of lightweight neural networks can be enhanced by TanhExp with only a few training epochs and no extra parameters added.
摘要：用于实时计算机视觉任务轻或移动神经网络含有比正常的网络，这导致约束性能参数较少。在这项工作中，我们提出了一个名为双曲正切指数激活功能（TanhExp）一种新的激活功能，可以提高图像分类任务这些网络显著的性能。 TanhExp的定义F（X）= xtanh（E ^ x）的。我们展示各种数据集和网络模型的简单性，效率和TanhExp的鲁棒性和TanhExp优于其同行中都收敛速度和精度。它的行为也保持稳定甚至增加的噪声和数据集改变。我们发现，在不增加网络的大小，轻巧的神经网络的容量，可以通过TanhExp只有少数培训时代和不添加额外的参数提高。

78. AQPDCITY Dataset: Picture-Based PM2.5 Monitoring in the Urban Area of Big Cities [PDF] 返回目录
Yonghui Zhang, Ke Gu
Abstract: Since Particulate Matters (PMs) are closely related to people's living and health, it has become one of the most important indicator of air quality monitoring around the world. But the existing sensor-based methods for PM monitoring have remarkable disadvantages, such as low-density monitoring stations and high-requirement monitoring conditions. It is highly desired to devise a method that can obtain the PM concentration at any location for the following air quality control in time. The prior works indicate that the PM concentration can be monitored by using ubiquitous photos. To further investigate such issue, we gathered 1,500 photos in big cities to establish a new AQPDCITY dataset. Experiments conducted to check nine state-of-the-art methods on this dataset show that the performance of those above methods perform poorly in the AQPDCITY dataset.
摘要：由于颗粒物（PMS）密切相关，人们的生活和健康，它已成为监测世界各地的空气质量最重要的指标之一。但是，对于PM监视现有的基于传感器的方法具有显着的缺点，如低密度监测站和高要求的监视条件。非常希望设计出可以在用于在时间以下空气质量控制的任何位置获得的PM浓度的方法。现有的作品表明，PM浓度可以通过无处不在的照片进行监控。为了进一步调查这些问题，我们收集了1500张照片在大城市建立一个新的数据集AQPDCITY。实验进行的，以检查在该数据集表明的上述那些方法的性能在AQPDCITY数据集中表现不佳9状态的最先进的方法。

79. On Information Plane Analyses of Neural Network Classifiers -- A Review [PDF] 返回目录
Bernhard C. Geiger
Abstract: We review the current literature concerned with information plane analyses of neural network classifiers. While the underlying information bottleneck theory and the claim that information-theoretic compression is causally linked to generalization are plausible, empirical evidence was found to be both supporting and conflicting. We review this evidence together with a detailed analysis how the respective information quantities were estimated. Our analysis suggests that compression visualized in information planes is not information-theoretic, but is rather compatible with geometric compression of the activations.
摘要：我们回顾关心神经网络分类的信息面分析，目前的文献。而底层的信息瓶颈理论和声称信息理论压缩因果联系概括是合理的，经验证据被发现是两个支撑和冲突。我们有一个详细的分析各自的信息量是如何估算一起回顾这方面的证据。我们的分析表明，压缩信息面可视化不是信息理论，但与激活的几何压缩，而兼容。

80. Non-Adversarial Video Synthesis with Learned Priors [PDF] 返回目录
Abhishek Aich, Akash Gupta, Rameswar Panda, Rakib Hyder, M. Salman Asif, Amit K. Roy-Chowdhury
Abstract: Most of the existing works in video synthesis focus on generating videos using adversarial learning. Despite their success, these methods often require input reference frame or fail to generate diverse videos from the given data distribution, with little to no uniformity in the quality of videos that can be generated. Different from these methods, we focus on the problem of generating videos from latent noise vectors, without any reference input frames. To this end, we develop a novel approach that jointly optimizes the input latent space, the weights of a recurrent neural network and a generator through non-adversarial learning. Optimizing for the input latent space along with the network weights allows us to generate videos in a controlled environment, i.e., we can faithfully generate all videos the model has seen during the learning process as well as new unseen videos. Extensive experiments on three challenging and diverse datasets well demonstrate that our approach generates superior quality videos compared to the existing state-of-the-art methods.
摘要：大多数视频合成注重利用对立的学习产生视频的现有作品。尽管他们成功，这些方法通常需要输入参考帧或不能产生从给定的数据分配不同的视频，而很少在能够生成视频的质量没有均匀性。从这些方法不同，我们专注于产生从潜噪声向量的视频的问题，没有任何参考输入帧。为此，我们开发了联合优化输入潜在空间，经常性的神经网络的权重，并通过非对抗性的学习产生一种新的方法。随着网络的权重沿着优化输入潜在空间使我们能够生成视频在受控环境中，即我们能够忠实地产生在学习过程以及新看不见视频的模式已经看到的所有视频。三个广泛的实验具有挑战性的和多样化的数据集以及证明我们的方法产生比现有国家的最先进的方法，质量上乘的影片。

81. Adversarial Continual Learning [PDF] 返回目录
Sayna Ebrahimi, Franziska Meier, Roberto Calandra, Trevor Darrell, Marcus Rohrbach
Abstract: Continual learning aims to learn new tasks without forgetting previously learned ones. We hypothesize that representations learned to solve each task in a sequence have a shared structure while containing some task-specific properties. We show that shared features are significantly less prone to forgetting and propose a novel hybrid continual learning framework that learns a disjoint representation for task-invariant and task-specific features required to solve a sequence of tasks. Our model combines architecture growth to prevent forgetting of task-specific skills and an experience replay approach to preserve shared skills. We demonstrate our hybrid approach is effective in avoiding forgetting and show it is superior to both architecture-based and memory-based approaches on class incrementally learning of a single dataset as well as a sequence of multiple datasets in image classification. Our code is available at \url{this https URL}.
摘要：持续的学习目标，学习新的任务没有忘记以前学过的。我们假设，表示学会解决序列中的每个任务都有，而包含一些特定任务的属性共享的结构。我们证明了共享功能显著不易遗忘，提出了一种新的混合持续学习的框架，功能所需的任务不变，并针对特定任务的不相交的代表性学会解决任务的序列。我们的模型融合了建筑生长，防止特定任务的能力和经验回放的方法来保护共享技能遗忘。我们证明我们的混合方法是有效地避免遗忘，并显示它优于阶级逐步学习的单一数据集以及多个数据集的图像分类序列都基于内存的架构为基础和方法。我们的代码是可以在\ {URL这HTTPS URL}。

82. Adversarial Robustness on In- and Out-Distribution Improves Explainability [PDF] 返回目录
Maximilian Augustin, Alexander Meinke, Matthias Hein
Abstract: Neural networks have led to major improvements in image classification but suffer from being non-robust to adversarial changes, unreliable uncertainty estimates on out-distribution samples and their inscrutable black-box decisions. In this work we propose RATIO, a training procedure for Robustness via Adversarial Training on In- and Out-distribution, which leads to robust models with reliable and robust confidence estimates on the out-distribution. RATIO has similar generative properties to adversarial training so that visual counterfactuals produce class specific features. While adversarial training comes at the price of lower clean accuracy, RATIO achieves state-of-the-art $l_2$-adversarial robustness on CIFAR10 and maintains better clean accuracy.
摘要：神经网络已经导致了图像分类的重大改进，但被非稳健对抗性的变化，上了分配样本和其高深莫测的暗箱决策不可靠的不确定性估算受到影响。在这项工作中，我们提出比率，通对抗性训练对入点和出分布，从而导致强大的机型上进行分布可靠和稳定的信心估计稳健性训练过程。 RATIO也有类似的生成属性对抗性训练，这样的视觉反事实产生类的特定功能。虽然对抗性训练是以低精度干净的价格，率达到国家的最先进的$ $ L_2鲁棒性-adversarial上CIFAR10并保持较好的清洁准确性。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-03-24

目录

摘要