摘要

1. Optical Non-Line-of-Sight Physics-based 3D Human Pose Estimation [PDF] 返回目录
Mariko Isogawa, Ye Yuan, Matthew O'Toole, Kris Kitani
Abstract: We describe a method for 3D human pose estimation from transient images (i.e., a 3D spatio-temporal histogram of photons) acquired by an optical non-line-of-sight (NLOS) imaging system. Our method can perceive 3D human pose by `looking around corners' through the use of light indirectly reflected by the environment. We bring together a diverse set of technologies from NLOS imaging, human pose estimation and deep reinforcement learning to construct an end-to-end data processing pipeline that converts a raw stream of photon measurements into a full 3D human pose sequence estimate. Our contributions are the design of data representation process which includes (1) a learnable inverse point spread function (PSF) to convert raw transient images into a deep feature vector; (2) a neural humanoid control policy conditioned on the transient image feature and learned from interactions with a physics simulator; and (3) a data synthesis and augmentation strategy based on depth data that can be transferred to a real-world NLOS imaging system. Our preliminary experiments suggest that our method is able to generalize to real-world NLOS measurement to estimate physically-valid 3D human poses.
摘要：我们描述用于从瞬时图像通过光学非线的视距（NLOS）成像系统采集的三维人体姿势估计（即，三维光子的时空直方图）的方法。我们的方法可以通过`环顾四周边角的通过使用光线通过环境间接反映感觉到3D人体姿势。我们汇集了多样化的自NLOS成像，人体姿势估计和深刻的强化学习技术，建立了一个终端对终端的数据处理管道，其将原始流光子测量到一个完整的三维人体姿势序列的估计值。我们的贡献是数据表示方法，该方法包括（1）一个可学习逆点扩散函数（PSF）将原始瞬时图像转换成一个深特征矢量的设计;对瞬态图像特征（2）神经人形控制策略空调，并从与物理模拟器相互作用教训;和（3）的基础上，可以被转移到一个真实世界NLOS成像系统深度数据的数据合成和扩增策略。我们的初步实验表明，我们的方法可以推广到真实世界的NLOS测量估计物理上有效的三维人体姿势。

2. Probabilistic Pixel-Adaptive Refinement Networks [PDF] 返回目录
Anne S. Wannenwetsch, Stefan Roth
Abstract: Encoder-decoder networks have found widespread use in various dense prediction tasks. However, the strong reduction of spatial resolution in the encoder leads to a loss of location information as well as boundary artifacts. To address this, image-adaptive post-processing methods have shown beneficial by leveraging the high-resolution input image(s) as guidance data. We extend such approaches by considering an important orthogonal source of information: the network's confidence in its own predictions. We introduce probabilistic pixel-adaptive convolutions (PPACs), which not only depend on image guidance data for filtering, but also respect the reliability of per-pixel predictions. As such, PPACs allow for image-adaptive smoothing and simultaneously propagating pixels of high confidence into less reliable regions, while respecting object boundaries. We demonstrate their utility in refinement networks for optical flow and semantic segmentation, where PPACs lead to a clear reduction in boundary artifacts. Moreover, our proposed refinement step is able to substantially improve the accuracy on various widely used benchmarks.
摘要：编码器，解码器网络已经发现，在各种密集的预测任务广泛使用。然而，强还原在编码器引出的位置信息的损失空间分辨率以及边界伪影的。为了解决这个问题，图像自适应后处理方法已经显示出通过利用高分辨率的输入图像（多个）引导数据是有益的。我们延长通过考虑一个重要的信息来源正交这种方法：在其自己的预测网络的信心。我们引入概率像素自适应的卷积（PPACs），这不仅依赖于图像引导数据，用于滤波，但也尊重每像素预测的可靠性。这样，PPACs允许用于图像自适应平滑，同时传播的高置信度的像素到较不可靠的区域中，同时尊重对象边界。我们证明在光流和语义分割，其中PPACs导致的边界效应明显降低细化网络的效用。此外，我们提出的细化步骤能够显着改善各种广泛的应用基准测试的准确性。

3. TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting [PDF] 返回目录
Zhuoqian Yang, Wentao Zhu, Wayne Wu, Chen Qian, Qiang Zhou, Bolei Zhou, Chen Change Loy
Abstract: We present a lightweight video motion retargeting approach TransMoMo that is capable of transferring motion of a person in a source video realistically to another video of a target person. Without using any paired data for supervision, the proposed method can be trained in an unsupervised manner by exploiting invariance properties of three orthogonal factors of variation including motion, structure, and view-angle. Specifically, with loss functions carefully derived based on invariance, we train an auto-encoder to disentangle the latent representations of such factors given the source and target video clips. This allows us to selectively transfer motion extracted from the source video seamlessly to the target video in spite of structural and view-angle disparities between the source and the target. The relaxed assumption of paired data allows our method to be trained on a vast amount of videos needless of manual annotation of source-target pairing, leading to improved robustness against large structural variations and extreme motion in videos. We demonstrate the effectiveness of our method over the state-of-the-art methods. Code, model and data are publicly available on our project page (this https URL).
摘要：本文提出了一种轻量级的视频动态重定位方法TransMoMo是能够在源视频传输现实的人的运动目标的人的另一个视频。如果不使用进行监督的任何配对数据，所提出的方法可以以无监督的方式通过利用的包括运动，结构，和视角变化的三个正交因子不变性性质训练。具体而言，损失函数基于不变性仔细推导，我们训练自动编码器来解开给出的源和目标的视频剪辑等因素的潜伏表示。这使我们能够从无缝源视频到目标视频，尽管在源和目标之间的结构和视角差异的提取选择性传递运动。成对数据的松弛假设允许上的不必要的源 - 目标配对的手动注释的视频广阔量被训练我们的方法，导致对大的结构变化和在视频极端运动改进的鲁棒性。我们证明了我们在国家的最先进的方法，该方法的有效性。代码，模型和数据是我们的项目页面上公开提供（该HTTPS URL）。

4. UniformAugment: A Search-free Probabilistic Data Augmentation Approach [PDF] 返回目录
Tom Ching LingChen, Ava Khonsari, Amirreza Lashkari, Mina Rafi Nazari, Jaspreet Singh Sambee, Mario A. Nascimento
Abstract: Augmenting training datasets has been shown to improve the learning effectiveness for several computer vision tasks. A good augmentation produces an augmented dataset that adds variability while retaining the statistical properties of the original dataset. Some techniques, such as AutoAugment and Fast AutoAugment, have introduced a search phase to find a set of suitable augmentation policies for a given model and dataset. This comes at the cost of great computational overhead, adding up to several thousand GPU hours. More recently RandAugment was proposed to substantially speedup the search phase by approximating the search space by a couple of hyperparameters, but still incurring non-negligible cost for tuning those. In this paper we show that, under the assumption that the augmentation space is approximately distribution invariant, a uniform sampling over the continuous space of augmentation transformations is sufficient to train highly effective models. Based on that result we propose UniformAugment, an automated data augmentation approach that completely avoids a search phase. In addition to discussing the theoretical underpinning supporting our approach, we also use the standard datasets, as well as established models for image classification, to show that UniformAugment's effectiveness is comparable to the aforementioned methods, while still being highly efficient by virtue of not requiring any search.
摘要：增广训练数据已经显示出改善的几个计算机视觉任务的学习成效。一个好的隆胸产生增强数据集，增加了可变性，同时保留原始数据集的统计特性。一些技术，如AutoAugment和快速AutoAugment，相继推出了搜索阶段找到一组给定的模型和数据集适当的增强策略。这是以很大的计算开销的费用，加起来几千GPU小时。最近RandAugment有人提议由一对夫妇超参数的近似搜索空间，但仍招致不可忽略的成本调整的加速比大幅搜索阶段。在本文中，我们表明，在假设增强空间大约分布不变，增强了变革的连续空间均匀采样足以训练非常有效的模式下。根据该结果，我们提出UniformAugment，自动数据增强的办法，完全避免搜索阶段。除了讨论的理论基础支持我们的方法，我们还可以使用标准数据集，以及用于图像分类建立的模型，向人们展示UniformAugment的效果足以媲美上述方法，同时仍然凭借不需要任何的高效搜索。

5. How Useful is Self-Supervised Pretraining for Visual Tasks? [PDF] 返回目录
Alejandro Newell, Jia Deng
Abstract: Recent advances have spurred incredible progress in self-supervised pretraining for vision. We investigate what factors may play a role in the utility of these pretraining methods for practitioners. To do this, we evaluate various self-supervised algorithms across a comprehensive array of synthetic datasets and downstream tasks. We prepare a suite of synthetic data that enables an endless supply of annotated images as well as full control over dataset difficulty. Our experiments offer insights into how the utility of self-supervision changes as the number of available labels grows as well as how the utility changes as a function of the downstream task and the properties of the training data. We also find that linear evaluation does not correlate with finetuning performance. Code and data is available at \href{this https URL}{this http URL}.
摘要：最近的进步已经刺激了对视力自我监督的训练前令人难以置信的进步。我们调查什么因素可能对从业者这些训练前方法的效用发挥作用。要做到这一点，我们评估不同的合成数据集和下游任务了一系列全面的各种自我监督的算法。我们准备了一套综合的数据，使注释的图像源源不绝以及在数据集中的困难完全控制。我们的实验提供了洞察自我监督变化可用的标签数效用如何生长以及效用如何改变为下游任务的功能和训练数据的属性。我们还发现，线性评价不与微调的表现相关。代码和数据可在\ {HREF这HTTPS URL} {这个HTTP URL}。

6. Du$^2$Net: Learning Depth Estimation from Dual-Cameras and Dual-Pixels [PDF] 返回目录
Yinda Zhang, Neal Wadhwa, Sergio Orts-Escolano, Christian Häne, Sean Fanello, Rahul Garg
Abstract: Computational stereo has reached a high level of accuracy, but degrades in the presence of occlusions, repeated textures, and correspondence errors along edges. We present a novel approach based on neural networks for depth estimation that combines stereo from dual cameras with stereo from a dual-pixel sensor, which is increasingly common on consumer cameras. Our network uses a novel architecture to fuse these two sources of information and can overcome the above-mentioned limitations of pure binocular stereo matching. Our method provides a dense depth map with sharp edges, which is crucial for computational photography applications like synthetic shallow-depth-of-field or 3D Photos. Additionally, we avoid the inherent ambiguity due to the aperture problem in stereo cameras by designing the stereo baseline to be orthogonal to the dual-pixel baseline. We present experiments and comparisons with state-of-the-art approaches to show that our method offers a substantial improvement over previous works.
摘要：计算立体声已经达到精度高的水平，但在闭塞的存在退化，重复的纹理，以及对应的错误沿边缘。我们提出基于用于深度估计的神经网络的新方法，从双摄像头与来自双像素传感器，其是在消费级相机越来越普遍立体声联合立体声。我们的网络使用新颖架构融合这两个信息源，并且可以克服纯双目立体匹配的上述限制。我们的方法提供了具有尖锐边缘的致密的深度图，其是用于计算摄影应用像或三维照片合成的字段浅深度的关键。此外，我们避免固有的歧义由于通过设计立体声基线是垂直于双像素基线在立体相机的孔径问题。我们目前的实验，并与国家的最先进的方法比较表明，我们的方法提供了超过以前的作品的显着改善。

7. Take the Scenic Route: Improving Generalization in Vision-and-Language Navigation [PDF] 返回目录
Felix Yu, Zhiwei Deng, Karthik Narasimhan, Olga Russakovsky
Abstract: In the Vision-and-Language Navigation (VLN) task, an agent with egocentric vision navigates to a destination given natural language instructions. The act of manually annotating these instructions is timely and expensive, such that many existing approaches automatically generate additional samples to improve agent performance. However, these approaches still have difficulty generalizing their performance to new environments. In this work, we investigate the popular Room-to-Room (R2R) VLN benchmark and discover that what is important is not only the amount of data you synthesize, but also how you do it. We find that shortest path sampling, which is used by both the R2R benchmark and existing augmentation methods, encode biases in the action space of the agent which we dub as action priors. We then show that these action priors offer one explanation toward the poor generalization of existing works. To mitigate such priors, we propose a path sampling method based on random walks to augment the data. By training with this augmentation strategy, our agent is able to generalize better to unknown environments compared to the baseline, significantly improving model performance in the process.
摘要：在视觉和语言导航（VLN）任务，以自我为中心的视觉导航到给定的自然语言指令目的地的代理。的手动注释这些指令的动作是及时的和昂贵的，例如，许多现有的方法自动产生另外的样品，以改善剂的性能。然而，这些方法仍难以概括其性能推向新的环境。在这项工作中，我们研究了流行的房间到房间（R2R）VLN基准，并发现什么是重要的，不仅是你综合的数据量，而且你如何做到这一点。我们发现，最短路径取样，用于由R2R基准和现有的增强方法均在主体的行动空间编码的偏见，我们的配音先验行动。然后，我们表明，这些行动提供先验向现有作品的穷人推广一种解释。为了减轻这种先验的，我们提出了基于随机游走，以增加数据的路径抽样方法。通过这种增强策略的训练，我们的代理能够更好地推广到未知环境比较基准，显著提高过程模型性能。

8. SCT: Set Constrained Temporal Transformer for Set Supervised Action Segmentation [PDF] 返回目录
Mohsen Fayyaz, Juergen Gall
Abstract: Temporal action segmentation is a topic of increasing interest, however, annotating each frame in a video is cumbersome and costly. Weakly supervised approaches therefore aim at learning temporal action segmentation from videos that are only weakly labeled. In this work, we assume that for each training video only the list of actions is given that occur in the video, but not when, how often, and in which order they occur. In order to address this task, we propose an approach that can be trained end-to-end on such data. The approach divides the video into smaller temporal regions and predicts for each region the action label and its length. In addition, the network estimates the action labels for each frame. By measuring how consistent the frame-wise predictions are with respect to the temporal regions and the annotated action labels, the network learns to divide a video into class-consistent regions. We evaluate our approach on three datasets where the approach achieves state-of-the-art results.
摘要：临时行动分割是越来越大的兴趣，但是，注释在视频每一帧的话题是笨重和昂贵。弱监督方法，因此着眼于从只弱标记视频学习时间的行动分割。在这项工作中，我们假设每个训练视频仅给定的操作列表中出现的视频，而不是何时，如何频繁，以及以何种顺序它们出现。为了解决这一任务，我们提出可以对这些数据进行训练的端至端的方法。该方法将视频成更小的时间区域和预测每个区域的动作标签和它的长度。此外，网络估计每个帧的动作标签。通过测量逐帧预测的一致程度是相对于颞区和注释的动作标签，网络学会把一个视频分成类一致性的区域。我们评估三个数据集的方法，把方法实现国家的最先进的成果。

9. DPGN: Distribution Propagation Graph Network for Few-shot Learning [PDF] 返回目录
Ling Yang, Liangliang Li, Zilun Zhang, Zhou, Erjin Zhou, Yu Liu
Abstract: We extend this idea further to explicitly model the distribution-level relation of one example to all other examples in a 1-vs-N manner. We propose a novel approach named distribution propagation graph network (DPGN) for few-shot learning. It conveys both the distribution-level relations and instance-level relations in each few-shot learning task. To combine the distribution-level relations and instance-level relations for all examples, we construct a dual complete graph network which consists of a point graph and a distribution graph with each node standing for an example. Equipped with dual graph architecture, DPGN propagates label information from labeled examples to unlabeled examples within several update generations. In extensive experiments on few-shot learning benchmarks, DPGN outperforms state-of-the-art results by a large margin in 5% $\sim$ 12% under supervised settings and 7% $\sim$ 13% under semi-supervised settings.
摘要：我们进一步扩展这个构思一个示例的分布级关系明确建模以在1-VS-N方式的所有其它实施例。我们提出了一些次学习命名分布图传播网（DPGN）的新方法。它在每个几炮学习任务传达分布层次关系和实例级关系两者。要结合所有实例的分布层次关系和实例级别的关系，我们构建它由一个点图，并与每个节点的地位的一个例子的分布图的双重完全图的网络。配备有对偶图的建筑，DPGN从几个更新世代内标记的实施例以非标记的例子传播标签信息。在上几拍学习基准广泛的实验，DPGN性能优于国家的先进成果大幅度的监督之下设置5％$ \卡$ 12％和7％$ \卡$ 13％下半监督设置。

10. Straight to the Point: Fast-forwarding Videos via Reinforcement Learning Using Textual Data [PDF] 返回目录
Washington Ramos, Michel Silva, Edson Araujo, Leandro Soriano Marcolino, Erickson Nascimento
Abstract: The rapid increase in the amount of published visual data and the limited time of users bring the demand for processing untrimmed videos to produce shorter versions that convey the same information. Despite the remarkable progress that has been made by summarization methods, most of them can only select a few frames or skims, which creates visual gaps and breaks the video context. In this paper, we present a novel methodology based on a reinforcement learning formulation to accelerate instructional videos. Our approach can adaptively select frames that are not relevant to convey the information without creating gaps in the final video. Our agent is textually and visually oriented to select which frames to remove to shrink the input video. Additionally, we propose a novel network, called Visually-guided Document Attention Network (VDAN), able to generate a highly discriminative embedding space to represent both textual and visual data. Our experiments show that our method achieves the best performance in terms of F1 Score and coverage at the video segment level.
摘要：在公布的视频的数据量和用户的有限的时间内快速增加带来处理修剪视频制作较短版本的传达同样信息的需求。尽管已经通过总结方法，取得了显着的进步，他们大多只能选择几帧或撇去，创造视觉差距和休息视频情境。在本文中，我们提出了一种基于增强学习的制定，加快教学视频一种新颖的方法。我们的方法可以自适应地选择不相关传达信息，而在最后的影像制作空白帧。我们的试剂以文本和面向视觉的选择，以除去以缩小输入视频哪些帧。此外，我们提出了一种新的网络，称为视觉引导的文献注意网络（VDAN），能够生成高辨别嵌入空间来表示文字和视觉数据。我们的实验表明，该方法实现在F1得分和覆盖在视频片段水平方面的最佳性能。

11. Real-Time Semantic Segmentation via Auto Depth, Downsampling Joint Decision and Feature Aggregation [PDF] 返回目录
Peng Sun, Jiaxiang Wu, Songyuan Li, Peiwen Lin, Junzhou Huang, Xi Li
Abstract: To satisfy the stringent requirements on computational resources in the field of real-time semantic segmentation, most approaches focus on the hand-crafted design of light-weight segmentation networks. Recently, Neural Architecture Search (NAS) has been used to search for the optimal building blocks of networks automatically, but the network depth, downsampling strategy, and feature aggregation way are still set in advance by trial and error. In this paper, we propose a joint search framework, called AutoRTNet, to automate the design of these strategies. Specifically, we propose hyper-cells to jointly decide the network depth and downsampling strategy, and an aggregation cell to achieve automatic multi-scale feature aggregation. Experimental results show that AutoRTNet achieves 73.9% mIoU on the Cityscapes test set and 110.0 FPS on an NVIDIA TitanXP GPU card with 768x1536 input images.
摘要：为满足实时语义分割领域对计算资源的严格要求，最趋近于轻质分割网络的手工制作的设计重点。最近，神经结构搜索（NAS）已经被用来寻找最佳建筑自动网络的块，但网络深度，采样战略和要素聚集的方式通过反复试验预先仍设置。在本文中，我们提出了联合搜索的框架，叫AutoRTNet，自动化这些策略的设计。具体来说，我们建议超细胞共同决定了网络深度和采样策略，以及聚合小区实现自动多尺度特征聚集。实验结果表明，AutoRTNet实现对都市风景测试组73.9％和米欧110.0 FPS的NVIDIA TitanXP GPU卡768x1536输入图像上。

12. DISIR: Deep Image Segmentation with Interactive Refinement [PDF] 返回目录
Gaston Lenczner, Bertrand Le Saux, Nicola Luminari, Adrien Chan Hon Tong, Guy Le Besnerais
Abstract: This paper presents an interactive approach for multi-class segmentation of aerial images. Precisely, it is based on a deep neural network which exploits both RGB images and annotations. Starting from an initial output based on the image only, our network then interactively refines this segmentation map using a concatenation of the image and user annotations. Importantly, user annotations modify the inputs of the network - not its weights - enabling a fast and smooth process. Through experiments on two public aerial datasets, we show that user annotations are extremely rewarding: each click corrects roughly 5000 pixels. We analyze the impact of different aspects of our framework such as the representation of the annotations, the volume of training data or the network architecture. Code is available at this https URL.
摘要：本文提出了航拍图像的多级分段互动的方式。确切地说，它是基于它利用两个RGB图像和注释的深层神经网络。从基于所述图像的初始输出开始只，我们的网络交互然后细化使用图像和用户注释的级联此分割图。重要的是，用户的注释修改网络的输入 - 而不是它的权重 - 使得能够快速和平滑过程。通过对两个公共航空数据集实验，我们发现用户的注释非常有益的：每次点击校正大约5000个像素。我们分析了我们的架构不同方面的影响，如注释的表现，训练数据量或网络架构。代码可在此HTTPS URL。

13. Attention-based Assisted Excitation for Salient Object Segmentation [PDF] 返回目录
Saeed Masoudnia, Melika Kheirieh, Abdol-Hossein Vahabie, Babak Nadjar-Araabi
Abstract: Visual attention brings significant progress for Convolution Neural Networks (CNNs) in various applications. In this paper, object-based attention in human visual cortex inspires us to introduce a mechanism for modification of activations in feature maps of CNNs. In this mechanism, the activations of object locations are excited in feature maps. This mechanism is specifically inspired by gain additive attention modulation in object-based attention in brain. It facilitates figure-ground segregation in the visual cortex. Similar to brain, we use the idea to address two challenges in salient object segmentation: object interior parts and concise boundaries. We implemented it based on U-net model using different architectures in the encoder parts, including AlexNet, VGG, and ResNet. The proposed method was examined on three benchmark datasets: HKU-IS, MSRB, and PASCAL-S. Experimental results showed that the inspired idea could significantly improve the results in terms of mean absolute error and F-measure. The results also showed that our proposed method better captured not only the boundary but also the object interior. Thus, it can tackle the mentioned challenges.
摘要：视觉注意力带来了各种应用的卷积神经网络（细胞神经网络）显著的进展。在本文中，在人类视觉皮层的基于对象的注意启发我们引入一种机制，用于细胞神经网络的特征映射激活的修改。在这种机制下，物体位置的激活的特征图兴奋。该机制特别地由增益添加剂关注调制在脑基于对象的关注的启发。它有利于在视觉皮层图地面偏析。大脑类似，我们使用的想法，地址，突出对象分割两个方面的挑战：对象内饰件，简洁的边界。我们在编码器部分采用不同的架构，包括AlexNet，VGG和RESNET实现它的基础上的U网模型。检查对三个标准数据集所提出的方法：HKU-IS，MSRB，和Pascal-S。实验结果表明，灵感想法可能显著提高平均绝对误差和F-措施方面的成果。研究结果还表明，我们提出的方法更好地捕捉不仅边界，而且对象内部。因此，它可以解决提及的挑战。

14. GAST-Net: Graph Attention Spatio-temporal Convolutional Networks for 3D Human Pose Estimation in Video [PDF] 返回目录
Junfa Liu, Yisheng Guang, Juan Rojas
Abstract: 3D pose estimation in video can benefit greatly from both temporal and spatial information. Occlusions and depth ambiguities remain outstanding problems. In this work, we study how to learn the kinematic constraints of the human skeleton by modeling additional spatial information through attention and interleaving it in a synergistic way with temporal models. We contribute a graph attention spatio-temporal convolutional network (GAST-Net) that makes full use of spatio-temporal information and mitigates the problems of occlusion and depth ambiguities. We also contribute attention mechanisms that learn inter-joint relations that are easily visualizable. GAST-Net comprises of interleaved temporal convolutional and graph attention blocks. We use dilated temporal convolution networks (TCNs) to model long-term patterns. More critically, graph attention blocks encode local and global representations through novel convolutional kernels that express human skeletal symmetrical structure and adaptively extract global semantics over time. GAST-Net outperforms SOTA by approximately 10\% for mean per-joint position error for ground-truth labels on Human3.6M and achieves competitive results on HumanEva-I.
摘要：视频3D姿态估计可以从时间和空间信息大大受益。闭塞和深度含糊悬而未决的问题。在这项工作中，我们研究如何通过注意建模额外的空间信息，并将其与时间模型以协同的方式交错学习人体骨骼的运动约束。我们贡献，可充分利用的时空信息和减轻闭塞和深度含糊不清的问题，图注意力时空卷积网络（GAST-网）。我们也有助于注意力的机制，学习，他们很容易可视化的关节间的关系。 GAST-NET由交错的时空卷积和图形关注块。我们采用扩张时间卷积网络（的TCN）长期进行模拟。更关键的是，图中的关注块编码通过表达人的骨骼结构对称和自适应随着时间的推移提取全局语义小说卷积核局部和全局表示。 GAST-Net的性能优于SOTA约10 \％，为上Human3.6M地面实况标签平均每个关节位置误差，实现对HumanEva-I竞争的结果。

15. Recognizing Characters in Art History Using Deep Learning [PDF] 返回目录
Prathmesh Madhu, Ronak Kosti, Lara Mührenberg, Peter Bell, Andreas Maier, Vincent Christlein
Abstract: In the field of Art History, images of artworks and their contexts are core to understanding the underlying semantic information. However, the highly complex and sophisticated representation of these artworks makes it difficult, even for the experts, to analyze the scene. From the computer vision perspective, the task of analyzing such artworks can be divided into sub-problems by taking a bottom-up approach. In this paper, we focus on the problem of recognizing the characters in Art History. From the iconography of $Annunciation of the Lord$ (Figure 1), we consider the representation of the main protagonists, $Mary$ and $Gabriel$, across different artworks and styles. We investigate and present the findings of training a character classifier on features extracted from their face images. The limitations of this method, and the inherent ambiguity in the representation of $Gabriel$, motivated us to consider their bodies (a bigger context) to analyze in order to recognize the characters. Convolutional Neural Networks (CNN) trained on the bodies of $Mary$ and $Gabriel$ are able to learn person related features and ultimately improve the performance of character recognition. We introduce a new technique that generates more data with similar styles, effectively creating data in the similar domain. We present experiments and analysis on three different models and show that the model trained on domain related data gives the best performance for recognizing character. Additionally, we analyze the localized image regions for the network predictions. Code is open-sourced and available at \href{this https URL}{this https URL} and the link to the published peer-reviewed article is \href{this https URL}{this dl acm library link}.
摘要：在艺术史领域，艺术品和它们的上下文的图像核心，以了解潜在的语义信息。然而，这些作品的高度复杂和成熟的表现让人难以，甚至对于专家，分析了现场。从计算机视觉的角度，分析这些作品的任务可以通过采取自下而上的方法可分为子问题。在本文中，我们侧重于认识到艺术史人物的问题。从主$（图1）$报喜的意象，我们考虑的主角，$玛丽$ $和$加布里埃尔，在不同的作品和风格的代表性。我们调查和提出的从他们的脸部图像提取的特征训练字符分类结果。这种方法，并在$加布里埃尔$的表示固有的不确定性的限制，促使我们要考虑自己的身体（一个更大的范围内），以识别字符来分析。培训了$玛丽$ $和$加布里埃尔的尸体卷积神经网络（CNN）能够学习的人相关的功能和最终提高字符识别的性能。我们引入生成具有类似样式的更多数据，从而有效地创建在类似域数据的新技术。我们目前的实验和训练的域相关的数据模型提供了识别字符的最佳性能在三个不同的模型和展示分析。此外，我们分析了网络预测的局部图像区域。代码是开源的，可以在\ {HREF这HTTPS URL} {这个HTTPS URL}和链接到已发布的同行评审文章是\ {HREF这HTTPS URL} {这个DL ACM库链接}。

16. Pix2Shape -- Towards Unsupervised Learning of 3D Scenes from Images using a View-based Representation [PDF] 返回目录
Sai Rajeswar, Fahim Mannan, Florian Golemo, Jérôme Parent-Lévesque, David Vazquez, Derek Nowrouzezahrai, Aaron Courville
Abstract: We infer and generate three-dimensional (3D) scene information from a single input image and without supervision. This problem is under-explored, with most prior work relying on supervision from, e.g., 3D ground-truth, multiple images of a scene, image silhouettes or key-points. We propose Pix2Shape, an approach to solve this problem with four components: (i) an encoder that infers the latent 3D representation from an image, (ii) a decoder that generates an explicit 2.5D surfel-based reconstruction of a scene from the latent code (iii) a differentiable renderer that synthesizes a 2D image from the surfel representation, and (iv) a critic network trained to discriminate between images generated by the decoder-renderer and those from a training distribution. Pix2Shape can generate complex 3D scenes that scale with the view-dependent on-screen resolution, unlike representations that capture world-space resolution, i.e., voxels or meshes. We show that Pix2Shape learns a consistent scene representation in its encoded latent space and that the decoder can then be applied to this latent representation in order to synthesize the scene from a novel viewpoint. We evaluate Pix2Shape with experiments on the ShapeNet dataset as well as on a novel benchmark we developed, called 3D-IQTT, to evaluate models based on their ability to enable 3d spatial reasoning. Qualitative and quantitative evaluation demonstrate Pix2Shape's ability to solve scene reconstruction, generation, and understanding tasks.
摘要：我们推断，并从单一的输入图像和没有监督的生成三维（3D）场景信息。这个问题是充分开发的，与大多数以前的工作依托监管从，例如，3D地面实况，一个场景的多个图像，图像轮廓或关键点。我们建议Pix2Shape，这种方法具有四个组件来解决这个问题：（ⅰ）一个编码器，推断从图像潜3D表示，（ii）该从潜生成的场景的显式2.5D基于面元重建的解码器代码（ⅲ）一个可微分的渲染器从所述面元表示合成2D图像，以及（iv）一个评论家网络训练以判别由所述解码器渲染器和那些从训练分布生成的图像之间。 Pix2Shape可以生成复杂的3D场面规模与视点相关的屏幕分辨率，不像表示，捕捉世界空间分辨率，即体素或网格。我们发现，Pix2Shape学会在其潜伏的空间一致的情景再现和解码器可以那么为了从一个新的角度来看合成的场景被应用到这个潜在的代表性。我们与上ShapeNet数据集，以及对我们开发了一种新的基准实验评估Pix2Shape，所谓的3D-IQTT，根据他们的启用3D空间推理能力评估模型。定性和定量评价表明Pix2Shape的解决现场重建，生成和理解任务的能力。

17. Look-into-Object: Self-supervised Structure Modeling for Object Recognition [PDF] 返回目录
Mohan Zhou, Yalong Bai, Wei Zhang, Tiejun Zhao, Tao Mei
Abstract: Most object recognition approaches predominantly focus on learning discriminative visual patterns while overlooking the holistic object structure. Though important, structure modeling usually requires significant manual annotations and therefore is labor-intensive. In this paper, we propose to "look into object" (explicitly yet intrinsically model the object structure) through incorporating self-supervisions into the traditional framework. We show the recognition backbone can be substantially enhanced for more robust representation learning, without any cost of extra annotation and inference speed. Specifically, we first propose an object-extent learning module for localizing the object according to the visual patterns shared among the instances in the same category. We then design a spatial context learning module for modeling the internal structures of the object, through predicting the relative positions within the extent. These two modules can be easily plugged into any backbone networks during training and detached at inference time. Extensive experiments show that our look-into-object approach (LIO) achieves large performance gain on a number of benchmarks, including generic object recognition (ImageNet) and fine-grained object recognition tasks (CUB, Cars, Aircraft). We also show that this learning paradigm is highly generalizable to other tasks such as object detection and segmentation (MS COCO). Project page: this https URL.
摘要：大多数对象的识别方法主要是集中学习辨别视觉模式，而忽略了整体的对象结构。虽然重要，但结构建模通常需要手动显著注释，因此是劳动密集型的。在本文中，我们提出了“调查对象”（尚未明确对象结构本质模型），通过引入自我监督进入传统框架。我们展示了识别骨干，可以显着提高了更强大的代表性学习，而无需额外的注释和推理速度的任何费用。具体地讲，我们首先提出的物体程度的学习模块，其用于根据在同一类别的实例之间共享的视觉图案定位的对象。然后我们设计用于建模对象的内部结构的空间上下文学习模块，通过预测所述范围内的相对位置。这两个模块可以在训练中很容易地插入到任何骨干网和推理时间分离。大量的实验表明，我们查找到对象的方法（LIO）实现了多项基准测试的大的性能提升，包括一般物体识别（ImageNet）和细颗粒物体识别任务（CUB，汽车，飞机）。我们还表明，这种学习模式是非常推广到其他任务，如对象检测和分割（MS COCO）。项目页面：这个HTTPS URL。

18. Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition [PDF] 返回目录
Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang, Wanli Ouyang
Abstract: Spatial-temporal graphs have been widely used by skeleton-based action recognition algorithms to model human action dynamics. To capture robust movement patterns from these graphs, long-range and multi-scale context aggregation and spatial-temporal dependency modeling are critical aspects of a powerful feature extractor. However, existing methods have limitations in achieving (1) unbiased long-range joint relationship modeling under multi-scale operators and (2) unobstructed cross-spacetime information flow for capturing complex spatial-temporal dependencies. In this work, we present (1) a simple method to disentangle multi-scale graph convolutions and (2) a unified spatial-temporal graph convolutional operator named G3D. The proposed multi-scale aggregation scheme disentangles the importance of nodes in different neighborhoods for effective long-range modeling. The proposed G3D module leverages dense cross-spacetime edges as skip connections for direct information propagation across the spatial-temporal graph. By coupling these proposals, we develop a powerful feature extractor named MS-G3D based on which our model outperforms previous state-of-the-art methods on three large-scale datasets: NTU RGB+D 60, NTU RGB+D 120, and Kinetics Skeleton 400.
摘要：时空图已被广泛应用于基于骨架动作识别算法对人类行为动力学模型。为了抓住这些图中，远距离和多尺度范围内聚集和时空依赖建模强大的运动模式是一个强大的特征提取的关键方面。但是，现有的方法都有局限性在实现（1）无偏远距离下多尺度运营商联合关系建模和（2）无阻碍交叉时空信息用于捕获复杂的空间 - 时间的依赖关系流动。在这项工作中，我们本（1）的简单方法解开多尺度图形卷积和（2）一个统一的空间 - 时间曲线图的卷积命名G3D运算符。所提出的多尺度聚集方案理顺了那些纷繁的有效远程造型不同的街区节点的重要性。所提出的G3D模块杠杆密横时空边缘作为用于跨空间 - 时间曲线图的直接信息传播跳过连接。通过这些提案耦合，我们开发了基于功能强大的特征提取命名为MS-G3D上，我们的模型优于国家的最先进的前面三个大型数据集的方法：NTU RGB + d 60，NTU RGB + d 120，和动力学骨架400。

19. Real-Time Camera Pose Estimation for Sports Fields [PDF] 返回目录
Leonardo Citraro, Pablo Márquez-Neila, Stefano Savarè, Vivek Jayaram, Charles Dubout, Félix Renaut, Andrés Hasfura, Horesh Ben Shitrit, Pascal Fua
Abstract: Given an image sequence featuring a portion of a sports field filmed by a moving and uncalibrated camera, such as the one of the smartphones, our goal is to compute automatically in real time the focal length and extrinsic camera parameters for each image in the sequence without using a priori knowledges of the position and orientation of the camera. To this end, we propose a novel framework that combines accurate localization and robust identification of specific keypoints in the image by using a fully convolutional deep architecture. Our algorithm exploits both the field lines and the players' image locations, assuming their ground plane positions to be given, to achieve accuracy and robustness that is beyond the current state of the art. We will demonstrate its effectiveness on challenging soccer, basketball, and volleyball benchmark datasets.
摘要：鉴于设有由移动和未校准照相机拍摄了体育场的一部分的图像序列，如智能电话中的一个，我们的目标是自动地实时计算在每个图像的焦距和外部照相机参数序列，而无需使用照相机的位置和取向的先验知识。为此，我们提出了一个新的框架，结合准确的定位和使用完全卷积深的特定结构的关键点的鲁棒辨识的形象。我们的算法利用两个场线和玩家的图像位置，假设给予他们的地平面位置，以达到精确度和耐用性已经超出了艺术的当前状态。我们将展示其具有挑战性的足球，篮球和排球基准数据集的有效性。

20. Learning Cross-domain Semantic-Visual Relation for Transductive Zero-Shot Learning [PDF] 返回目录
Jianyang Zhang, Fengmao Lv, Guowu Yang, Lei Feng, Yufeng Yu, Lixin Duan
Abstract: Zero-Shot Learning (ZSL) aims to learn recognition models for recognizing new classes without labeled data. In this work, we propose a novel approach dubbed Transferrable Semantic-Visual Relation (TSVR) to facilitate the cross-category transfer in transductive ZSL. Our approach draws on an intriguing insight connecting two challenging problems, i.e. domain adaptation and zero-shot learning. Domain adaptation aims to transfer knowledge across two different domains (i.e., source domain and target domain) that share the identical task/label space. For ZSL, the source and target domains have different tasks/label spaces. Hence, ZSL is usually considered as a more difficult transfer setting compared with domain adaptation. Although the existing ZSL approaches use semantic attributes of categories to bridge the source and target domains, their performances are far from satisfactory due to the large domain gap between different categories. In contrast, our method directly transforms ZSL into a domain adaptation task through redrawing ZSL as predicting the similarity/dissimilarity labels for the pairs of semantic attributes and visual features. For this redrawn domain adaptation problem, we propose to use a domain-specific batch normalization component to reduce the domain discrepancy of semantic-visual pairs. Experimental results over diverse ZSL benchmarks clearly demonstrate the superiority of our method.
摘要：零射门学习（ZSL）旨在学习识别模型识别新的类没有标记的数据。在这项工作中，我们提出了被称为可移植的语义关系可视化（TSVR）一种新的方法，以促进转导ZSL跨类别转移。我们的做法借鉴了一个有趣的见解连接两个具有挑战性的问题，即领域适应性和零射门学习。域适应旨在传递知识跨两个不同的域（即，源域和目标域），该共享相同的任务/标签空间。对于ZSL，源和目标的领域有不同的任务/标签空间。因此，ZSL通常被认为是与结构域相比适应更困难的传送设定。虽然现有的ZSL方法使用类别的语义属性弥合源和目标域，它们的性能是远不能令人满意，由于不同的类别之间的大的域的间隙。相比之下，我们的方法直接通过重绘ZSL作为预测的相似性/不相似性标签的对语义属性和视觉特征的变换ZSL到域适应任务。对于这个重绘领域适应性问题，我们建议使用一个特定领域的批标准化组件来减少语义视觉对的域差异。在不同的ZSL基准实验结果清楚地表明了该方法的优越性。

21. X-Linear Attention Networks for Image Captioning [PDF] 返回目录
Yingwei Pan, Ting Yao, Yehao Li, Tao Mei
Abstract: Recent progress on fine-grained visual recognition and visual question answering has featured Bilinear Pooling, which effectively models the 2$^{nd}$ order interactions across multi-modal inputs. Nevertheless, there has not been evidence in support of building such interactions concurrently with attention mechanism for image captioning. In this paper, we introduce a unified attention block -- X-Linear attention block, that fully employs bilinear pooling to selectively capitalize on visual information or perform multi-modal reasoning. Technically, X-Linear attention block simultaneously exploits both the spatial and channel-wise bilinear attention distributions to capture the 2$^{nd}$ order interactions between the input single-modal or multi-modal features. Higher and even infinity order feature interactions are readily modeled through stacking multiple X-Linear attention blocks and equipping the block with Exponential Linear Unit (ELU) in a parameter-free fashion, respectively. Furthermore, we present X-Linear Attention Networks (dubbed as X-LAN) that novelly integrates X-Linear attention block(s) into image encoder and sentence decoder of image captioning model to leverage higher order intra- and inter-modal interactions. The experiments on COCO benchmark demonstrate that our X-LAN obtains to-date the best published CIDEr performance of 132.0% on COCO Karpathy test split. When further endowing Transformer with X-Linear attention blocks, CIDEr is boosted up to 132.8%. Source code is available at \url{this https URL}.
摘要：细粒度视觉识别和视觉答疑最新进展特色双线性池，有效模型2 $ ^ {}第二跨多模式输入$秩序的相互作用。然而，还没有支持建设的同时，注重机制图像字幕这种相互作用的证据。在本文中，我们引入一个统一的关注块 - X-线性关注块，即完全采用双线性池来视觉信息选择性地利用或执行多模态的推理。从技术上讲，X线性关注块同时利用的空间和通道明智双线性注意分布既以捕获2 $ ^ {ND} $顺序输入单峰或多峰的特征之间的交互。更高，甚至无限顺序特征交互通过堆叠多个X线性关注块和一个无参数的方式，分别装备有指数线性单元（ELU）的块容易地建模。此外，我们现在的X线性注意网络（被称为为X-LAN），该新颖地集成X线性关注块（或多个）转换为图像编码器和字幕模式以利用较高阶帧内和帧间模式交互作用图像的句解码器。在COCO基准实验表明，我们的X-LAN获得最新的132.0％的COCO Karpathy测试分公布的最佳性能苹果酒。当进一步与X线性关注块赋予变压器，苹果酒升压到132.8％。源代码可以在\ {URL这HTTPS URL}。

22. Long Short-Term Relation Networks for Video Action Detection [PDF] 返回目录
Dong Li, Ting Yao, Zhaofan Qiu, Houqiang Li, Tao Mei
Abstract: It has been well recognized that modeling human-object or object-object relations would be helpful for detection task. Nevertheless, the problem is not trivial especially when exploring the interactions between human actor, object and scene (collectively as human-context) to boost video action detectors. The difficulty originates from the aspect that reliable relations in a video should depend on not only short-term human-context relation in the present clip but also the temporal dynamics distilled over a long-range span of the video. This motivates us to capture both short-term and long-term relations in a video. In this paper, we present a new Long Short-Term Relation Networks, dubbed as LSTR, that novelly aggregates and propagates relation to augment features for video action detection. Technically, Region Proposal Networks (RPN) is remoulded to first produce 3D bounding boxes, i.e., tubelets, in each video clip. LSTR then models short-term human-context interactions within each clip through spatio-temporal attention mechanism and reasons long-term temporal dynamics across video clips via Graph Convolutional Networks (GCN) in a cascaded manner. Extensive experiments are conducted on four benchmark datasets, and superior results are reported when comparing to state-of-the-art methods.
摘要：已经充分认识到，模拟人类对象或对象客体关系将是检测任务很有帮助。然而，探索人类演员，物体和场景（统称为人类上下文），以提升视频动作检测器之间的相互作用，特别是当问题不是小事。从方面的困难源自在视频可靠的关系应该依赖于本夹不只是短期的人力上下文关系，而且蒸出视频的远程跨度的时空动态。这促使我们捕捉到的视频短期和长期的关系。在本文中，我们提出了一个新的长短期关系网络，被戏称为LSTR，即新奇聚集和传播的关系，以增加对视频动作检测功能。从技术上讲，大区提案网络（RPN）的重塑至第一生产三维边界框，即，小管，在每个视频剪辑。 LSTR那么每个剪辑内通过时空注意机制模型的短期人力方面的相互作用，并以级联方式通过图形卷积网络（GCN）的原因整个视频剪辑长期的时空动态。广泛实验在四个基准数据集进行的，比较先进的最先进的方法时优异的结果报告。

23. Inverting Gradients -- How easy is it to break privacy in federated learning? [PDF] 返回目录
Jonas Geiping, Hartmut Bauermeister, Hannah Dröge, Michael Moeller
Abstract: The idea of federated learning is to collaboratively train a neural network on a server. Each user receives the current weights of the network and in turns sends parameter updates (gradients) based on local data. This protocol has been designed not only to train neural networks data-efficiently, but also to provide privacy benefits for users, as their input data remains on device and only parameter gradients are shared. In this paper we show that sharing parameter gradients is by no means secure: By exploiting a cosine similarity loss along with optimization methods from adversarial attacks, we are able to faithfully reconstruct images at high resolution from the knowledge of their parameter gradients, and demonstrate that such a break of privacy is possible even for trained deep networks. Moreover, we analyze the effects of architecture as well as parameters on the difficulty of reconstructing the input image, prove that any input to a fully connected layer can be reconstructed analytically independent of the remaining architecture, and show numerically that even averaging gradients over several iterations or several images does not protect the user's privacy in federated learning applications in computer vision.
摘要：联合学习的想法是协同训练神经网络的服务器上。每个用户接收所述网络的当前重量和轮流发送基于本地数据参数更新（梯度）。此协议已被设计成不仅数据有效地训练神经网络，还为用户提供隐私的好处，作为它们的输入数据仍然在设备和唯一的参数梯度共享。在本文中，我们显示，共享参数梯度绝非安全：通过利用与对抗攻击的优化方法一起余弦相似度损失，我们能够忠实地从它们的参数梯度的知识重构高分辨率图像，并证明隐私这样的突破是训练的深网络可能甚至。此外，我们分析架构上重建所述输入图像的难度的影响以及参数，证明任何输入到完全连接的层可以被重建分析独立于其余的体系结构，并显示数值，即使平均梯度在几个迭代或多个图像，无法保护用户的隐私，在计算机视觉联盟学习应用。

24. 3D Sketch-aware Semantic Scene Completion via Semi-supervised Structure Prior [PDF] 返回目录
Xiaokang Chen, Kwan-Yee Lin, Chen Qian, Gang Zeng, Hongsheng Li
Abstract: The goal of the Semantic Scene Completion (SSC) task is to simultaneously predict a completed 3D voxel representation of volumetric occupancy and semantic labels of objects in the scene from a single-view observation. Since the computational cost generally increases explosively along with the growth of voxel resolution, most current state-of-the-arts have to tailor their framework into a low-resolution representation with the sacrifice of detail prediction. Thus, voxel resolution becomes one of the crucial difficulties that lead to the performance bottleneck. In this paper, we propose to devise a new geometry-based strategy to embed depth information with low-resolution voxel representation, which could still be able to encode sufficient geometric information, e.g., room layout, object's sizes and shapes, to infer the invisible areas of the scene with well structure-preserving details. To this end, we first propose a novel 3D sketch-aware feature embedding to explicitly encode geometric information effectively and efficiently. With the 3D sketch in hand, we further devise a simple yet effective semantic scene completion framework that incorporates a light-weight 3D Sketch Hallucination module to guide the inference of occupancy and the semantic labels via a semi-supervised structure prior learning strategy. We demonstrate that our proposed geometric embedding works better than the depth feature learning from habitual SSC frameworks. Our final model surpasses state-of-the-arts consistently on three public benchmarks, which only requires 3D volumes of 60 x 36 x 60 resolution for both input and output. The code and the supplementary material will be available at this https URL.
摘要：语义场景完成（SSC）任务的目标是同时从一个单一的视图观察预测体积占用和场景中的对象的语义标签的完成的3D体素表示。由于计算成本一般爆炸随着体素分辨率的增长而增加，目前大多数国家的最有艺术调整自己的框架与细节预测的牺牲低分辨率表示。因此，体素的分辨率成为导致性能瓶颈的关键难点之一。在本文中，我们建议制定一个新的基于几何的战略要嵌入的深度信息与低分辨率体素表示，这可能仍然是能够编码足够的几何信息，例如，房间布局，对象的大小和形状，推断无形具有良好结构保留细节的场景区域。为此，我们首先提出了一种新的3D草图感知功能，有效地嵌入明确编码几何信息。在手3D草图，我们进一步设计一种简单而有效的语义场景完成框架并入有重量轻的3D草图幻觉模块经由半监督结构之前学习策略来指导占用和语义标签的推断。我们证明，我们提出的几何嵌入作品比深度的功能从习惯性的SSC框架学习更好。我们的最终模型超越一贯三个公共基准，只需要3D体积60×36×60分辨率的用于输入和输出状态的最美术馆。该代码和补充材料将可在此HTTPS URL。

25. Prediction Confidence from Neighbors [PDF] 返回目录
Mark Philip Philipsen, Thomas Baltzer Moeslund
Abstract: The inability of Machine Learning (ML) models to successfully extrapolate correct predictions from out-of-distribution (OoD) samples is a major hindrance to the application of ML in critical applications. Until the generalization ability of ML methods is improved it is necessary to keep humans in the loop. The need for human supervision can only be reduced if it is possible to determining a level of confidence in predictions, which can be used to either ask for human assistance or to abstain from making predictions. We show that feature space distance is a meaningful measure that can provide confidence in predictions. The distance between unseen samples and nearby training samples proves to be correlated to the prediction error of unseen samples. Depending on the acceptable degree of error, predictions can either be trusted or rejected based on the distance to training samples. %Additionally, a novelty threshold can be used to decide whether a sample is worth adding to the training set. This enables earlier and safer deployment of models in critical applications and is vital for deploying models under ever-changing conditions.
摘要：机器学习（ML）模型无法从外的分布（OOD）样品成功外推正确的预测是一个主要障碍ML的关键应用程序的应用。直到ML方法的推广能力得到提高，有必要保持人类的循环。如果能够确定的信心预测的水平，可用于任何要求的人提供援助或作出预测弃权只能沦为无需人的监督。我们发现，特征空间距离是有意义的措施，可以提供预测的信心。看不见样品和附近的训练样本之间的距离被证明是相关的看不见的样本的预测误差。根据错误的可接受程度，预测可以被信任或基于对训练样本的距离拒绝。％此外，新颖性阈值可以用来决定样品是否是值得加入训练集。这使得在关键应用中较早的机型，更安全的部署，并且是不断变化的条件下部署模型至关重要。

26. Distance in Latent Space as Novelty Measure [PDF] 返回目录
Mark Philip Philipsen, Thomas Baltzer Moeslund
Abstract: Deep Learning performs well when training data densely covers the experience space. For complex problems this makes data collection prohibitively expensive. We propose to intelligently select samples when constructing data sets in order to best utilize the available labeling budget. The selection methodology is based on the presumption that two dissimilar samples are worth more than two similar samples in a data set. Similarity is measured based on the Euclidean distance between samples in the latent space produced by a DNN. By using a self-supervised method to construct the latent space, it is ensured that the space fits the data well and that any upfront labeling effort can be avoided. The result is more efficient, diverse, and balanced data set, which produce equal or superior results with fewer labeled examples.
摘要：深学习执行以及训练时的数据密集覆盖的体验空间。对于复杂的问题，这使得数据收集非常昂贵。我们建议为了构建数据集时，最好地利用可用的标签预算来智能地选择样本。选择方法是基于的假设是两个不同的样品价值超过数据集中的两个类似的样本。相似性是基于由DNN产生的潜在空间样本之间的欧氏距离来测量。通过使用自监督方法来构造潜在空间，它确保了空间拟合数据以及和可避免任何前期标记努力。其结果是更有效的，多样化的和平衡的数据集，从而产生具有较少标记的例子相等或更好的结果。

27. SPARE3D: A Dataset for SPAtial REasoning on Three-View Line Drawings [PDF] 返回目录
Wenyu Han, Siyuan Xiang, Chenhui Liu, Ruoyu Wang, Chen Feng
Abstract: Spatial reasoning is an important component of human intelligence. We can imagine the shapes of 3D objects and reason about their spatial relations by merely looking at their three-view line drawings in 2D, with different levels of competence. Can deep networks be trained to perform spatial reasoning tasks? How can we measure their "spatial intelligence"? To answer these questions, we present the SPARE3D dataset. Based on cognitive science and psychometrics, SPARE3D contains three types of 2D-3D reasoning tasks on view consistency, camera pose, and shape generation, with increasing difficulty. We then design a method to automatically generate a large number of challenging questions with ground truth answers for each task. They are used to provide supervision for training our baseline models using state-of-the-art architectures like ResNet. Our experiments show that although convolutional networks have achieved superhuman performance in many visual learning tasks, their spatial reasoning performance in SPARE3D is almost equal to random guesses. We hope SPARE3D can stimulate new problem formulations and network designs for spatial reasoning to empower intelligent robots to operate effectively in the 3D world via 2D sensors. The dataset and code are available at this https URL.
摘要：空间推理是人类智力的一个重要组成部分。我们可以通过只看着自己的二维三视图线图，用不同层次的能力想象3D对象和原因，对他们的空间关系的形状。能深入网络进行培训，进行空间推理任务？我们如何衡量自己的“空间智能”？为了回答这些问题，我们提出了SPARE3D数据集。基于认知科学和心理测试，SPARE3D包含三种类型的视图上的一致性，摄像头的姿势，和形状生成2D-3D推理任务，有越来越多的困难。然后，我们设计了一个方法来自动生成大量的挑战，与地面实情回答每个任务的问题的。它们被用来利用国家的最先进的架构类似RESNET训练我们的基准模型提供监督。我们的实验表明，虽然卷积网络已经取得了许多视觉学习任务超人的表现，他们SPARE3D空间推理性能几乎等于随机猜测。我们希望SPARE3D能刺激新的问题的配方和网络设计的空间推理授权的智能机器人通过传感器2D到3D世界的有效运作。该数据集和代码可在此HTTPS URL。

28. PolarNet: An Improved Grid Representation for Online LiDAR Point Clouds Semantic Segmentation [PDF] 返回目录
Yang Zhang, Zixiang Zhou, Philip David, Xiangyu Yue, Zerong Xi, Hassan Foroosh
Abstract: The requirement of fine-grained perception by autonomous driving systems has resulted in recently increased research in the online semantic segmentation of single-scan LiDAR. Emerging datasets and technological advancements have enabled researchers to benchmark this problem and improve the applicable semantic segmentation algorithms. Still, online semantic segmentation of LiDAR scans in autonomous driving applications remains challenging due to three reasons: (1) the need for near-real-time latency with limited hardware, (2) points are distributed unevenly across space, and (3) an increasing number of more fine-grained semantic classes. The combination of the aforementioned challenges motivates us to propose a new LiDAR-specific, KNN-free segmentation algorithm - PolarNet. Instead of using common spherical or bird's-eye-view projection, our polar bird's-eye-view representation balances the points per grid and thus indirectly redistributes the network's attention over the long-tailed points distribution over the radial axis in polar coordination. We find that our encoding scheme greatly increases the mIoU in three drastically different real urban LiDAR single-scan segmentation datasets while retaining ultra low latency and near real-time throughput.
摘要：通过自主驾驶系统细粒度感知的需求导致了单扫描激光雷达的在线语义分割多起来研究。新兴的数据集和技术进步已经使研究人员能够标杆这个问题，并提高应用语义分割算法。尽管如此，激光雷达的在线语义分割中自主驾驶应用程序扫描仍充满挑战，由于三个方面的原因：（1）与有限的硬件近乎实时延迟的需求，（2）点分布不均跨越空间，和（3）增加更多的细粒度语义类的数量。前述挑战的结合促使我们提出一个新的激光雷达专用，KNN自由分割算法 - PolarNet。除了使用常见的球形或鸟瞰视图投影，我们的极地鸟瞰视图表示平衡每格点，从而间接地重新分配网络的注意力在长尾点分布在极协调径向轴。我们发现，我们的编码方案，大大提高了米欧在三个截然不同的真正的城市的LiDAR单扫描分割数据集，同时保持超低延迟和近乎实时的吞吐量。

29. BANet: Bidirectional Aggregation Network with Occlusion Handling for Panoptic Segmentation [PDF] 返回目录
Yifeng Chen, Guangchen Lin, Songyuan Li, Bourahla Omar, Yiming Wu, Fangfang Wang, Junyi Feng, Mingliang Xu, Xi Li
Abstract: Panoptic segmentation aims to perform instance segmentation for foreground instances and semantic segmentation for background stuff simultaneously. The typical top-down pipeline concentrates on two key issues: 1) how to effectively model the intrinsic interaction between semantic segmentation and instance segmentation, and 2) how to properly handle occlusion for panoptic segmentation. Intuitively, the complementarity between semantic segmentation and instance segmentation can be leveraged to improve the performance. Besides, we notice that using detection/mask scores is insufficient for resolving the occlusion problem. Motivated by these observations, we propose a novel deep panoptic segmentation scheme based on a bidirectional learning pipeline. Moreover, we introduce a plug-and-play occlusion handling algorithm to deal with the occlusion between different object instances. The experimental results on COCO panoptic benchmark validate the effectiveness of our proposed method. Codes will be released soon at this https URL.
摘要：全景细分目标执行实例分割为前景实例并同时背景的东西语义分割。在两个关键问题上典型的自上而下的管道浓缩：1）如何有效地模拟语义分割和实例分割，和2之间的内在相互作用）如何正确处理为全景分割闭塞。直观地说，语义分割和实例分割之间的互补性可以被利用来提高性能。此外，我们注意到，使用检测/遮罩得分不足以解决遮挡问题。通过这些意见的启发，提出了一种基于双向学习管道新颖的深全景分割方案。此外，我们引入一个插件和播放遮挡处理算法来处理，不同对象实例之间的闭塞。在COCO全景基准验证实验结果我们提出的方法的有效性。代码将很快在此HTTPS URL被释放。

30. Distilled Semantics for Comprehensive Scene Understanding from Videos [PDF] 返回目录
Fabio Tosi, Filippo Aleotti, Pierluigi Zama Ramirez, Matteo Poggi, Samuele Salti, Luigi Di Stefano, Stefano Mattoccia
Abstract: Whole understanding of the surroundings is paramount to autonomous systems. Recent works have shown that deep neural networks can learn geometry (depth) and motion (optical flow) from a monocular video without any explicit supervision from ground truth annotations, particularly hard to source for these two tasks. In this paper, we take an additional step toward holistic scene understanding with monocular cameras by learning depth and motion alongside with semantics, with supervision for the latter provided by a pre-trained network distilling proxy ground truth images. We address the three tasks jointly by a) a novel training protocol based on knowledge distillation and self-supervision and b) a compact network architecture which enables efficient scene understanding on both power hungry GPUs and low-power embedded platforms. We thoroughly assess the performance of our framework and show that it yields state-of-the-art results for monocular depth estimation, optical flow and motion segmentation.
摘要：周围环境的整体认识是极为重要的自治系统。最近的工作表明，深层神经网络可以学习从单筒视频几何形状（深度）和运动（光流），而从地面真相批注任何明确的监督，特别是很难源这两个任务。在本文中，我们采取通过学习深度和运动一起与语义，以监督由预训练的网络蒸馏代理地面实况图像提供的后者向单眼相机整体场景理解一个额外的步骤。我们通过联合A）基于知识蒸馏和自我监督和b）紧凑型网络架构使两个耗电GPU和低功耗嵌入式平台的高效现场了解一种新的训练方案解决三项任务。我们全面评估我们的架构的性能，并表明它得到国家的先进成果为单眼深度估计，光流和运动分割。

31. Learning Human-Object Interaction Detection using Interaction Points [PDF] 返回目录
Tiancai Wang, Tong Yang, Martin Danelljan, Fahad Shahbaz Khan, Xiangyu Zhang, Jian Sun
Abstract: Understanding interactions between humans and objects is one of the fundamental problems in visual classification and an essential step towards detailed scene understanding. Human-object interaction (HOI) detection strives to localize both the human and an object as well as the identification of complex interactions between them. Most existing HOI detection approaches are instance-centric where interactions between all possible human-object pairs are predicted based on appearance features and coarse spatial information. We argue that appearance features alone are insufficient to capture complex human-object interactions. In this paper, we therefore propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs. Our network predicts interaction points, which directly localize and classify the inter-action. Paired with the densely predicted interaction vectors, the interactions are associated with human and object detections to obtain final predictions. To the best of our knowledge, we are the first to propose an approach where HOI detection is posed as a keypoint detection and grouping problem. Experiments are performed on two popular benchmarks: V-COCO and HICO-DET. Our approach sets a new state-of-the-art on both datasets. Code is available at this https URL.
摘要：人类与物体之间的相互作用的理解是在可视化分类和对现场详细了解的一个关键步骤的基本问题之一。人类对象的交互（HOI）检测努力定位人类和对象以及它们之间的复杂的相互作用的识别两者。大多数现有的检测HOI方法是实例为中心，其中所有可能的人对象对之间的相互作用是基于形态特征与粗的空间信息进行预测。我们认为，单纯的外观特征不足以捕捉复杂的人机交互的对象。在本文中，我们因此提出了一种新直接检测人类对象对之间的相互作用完全卷积的方法。我们的网络预测交互点，直接定位和分类的相互作用。与密集的预测相互作用矢量配对时，相互作用与人类和对象相关联的检测，以获得最终的预测。据我们所知，我们是第一个提出在那里HOI检测冒充为关键点检测和分组问题的方法。 V-COCO和HICO-DET：实验是在两个流行的基准进行。我们的方法设置一个新的两个数据集的国家的最先进的。代码可在此HTTPS URL。

32. SK-Net: Deep Learning on Point Cloud via End-to-end Discovery of Spatial Keypoints [PDF] 返回目录
Weikun Wu, Yan Zhang, David Wang, Yunqi Lei
Abstract: Since the PointNet was proposed, deep learning on point cloud has been the concentration of intense 3D research. However, existing point-based methods usually are not adequate to extract the local features and the spatial pattern of a point cloud for further shape understanding. This paper presents an end-to-end framework, SK-Net, to jointly optimize the inference of spatial keypoint with the learning of feature representation of a point cloud for a specific point cloud task. One key process of SK-Net is the generation of spatial keypoints (Skeypoints). It is jointly conducted by two proposed regulating losses and a task objective function without knowledge of Skeypoint location annotations and proposals. Specifically, our Skeypoints are not sensitive to the location consistency but are acutely aware of shape. Another key process of SK-Net is the extraction of the local structure of Skeypoints (detail feature) and the local spatial pattern of normalized Skeypoints (pattern feature). This process generates a comprehensive representation, pattern-detail (PD) feature, which comprises the local detail information of a point cloud and reveals its spatial pattern through the part district reconstruction on normalized Skeypoints. Consequently, our network is prompted to effectively understand the correlation between different regions of a point cloud and integrate contextual information of the point cloud. In point cloud tasks, such as classification and segmentation, our proposed method performs better than or comparable with the state-of-the-art approaches. We also present an ablation study to demonstrate the advantages of SK-Net.
摘要：由于PointNet提出，对点云深度学习一直是激烈的3D研究的浓度。然而，现有的基于点的方法通常不足以提取局部特征和点云用于进一步理解形状的空间模式。本文提出一种终端到终端的框架，SK-网，共同优化空间关键点的推论与点云中特定点云任务的特征表示的学习。 SK-网的一个重要过程是空间的关键点（Skeypoints）的生成。这是联合两个建议调节损失和不Skeypoint位置注释和建议的知识，任务目标函数进行。具体来说，我们Skeypoints并不位置一致性敏感，但非常清楚的形状。 SK-网的另一个关键过程是Skeypoints（详细功能）和归一化的Skeypoints局部空间图案（图案特征）的局部结构的提取。此过程产生的综合表示，图案细节（PD）的特征，其包括点云的局部细节信息，并通过对正火Skeypoints部分区重建揭示了其空间图案。因此，我们的网络将被提示有效地了解点云的不同区域之间的相关性和整合点云的上下文信息。在点云的任务，如分类和分割，我们所提出的方法进行优于或可比与国家的最先进的方法。我们还提出了一种消融研究证明SK-网的优势。

33. FaceScape: a Large-scale High Quality 3D Face Dataset and Detailed Riggable 3D Face Prediction [PDF] 返回目录
Haotian Yang, Hao Zhu, Yanru Wang, Mingkai Huang, Qiu Shen, Ruigang Yang, Xun Cao
Abstract: In this paper, we present a large-scale detailed 3D face dataset, FaceScape, and propose a novel algorithm that is able to predict elaborate riggable 3D face models from a single image input. FaceScape dataset provides 18,760 textured 3D faces, captured from 938 subjects and each with 20 specific expressions. The 3D models contain the pore-level facial geometry that is also processed to be topologically uniformed. These fine 3D facial models can be represented as a 3D morphable model for rough shapes and displacement maps for detailed geometry. Taking advantage of the large-scale and high-accuracy dataset, a novel algorithm is further proposed to learn the expression-specific dynamic details using a deep neural network. The learned relationship serves as the foundation of our 3D face prediction system from a single image input. Different than the previous methods, our predicted 3D models are riggable with highly detailed geometry under different expressions. The unprecedented dataset and code will be released to public for research purpose.
摘要：在本文中，我们提出了一个大型详细的三维人脸数据集，FaceScape，并提出了一种新的算法，能够从单一图像输入预测精心riggable 3D人脸模型。 FaceScape数据集提供18760个纹理的3D面部，从938名受试者和每个具有20个具体表现捕获。 3D模型包含孔隙水平面部几何形状也被处理被拓扑均匀。这些精美的3D面部模型可以表示为粗糙的形状和几何细节置换贴图三维形变模型。以大规模和高准确度的数据集的优点，一种新颖的算法还提出学习使用深神经网络的特定表达动态信息。博学的关系作为我们的三维人脸的预测系统从单一图像输入的基础。不同于以前的方法，我们预测的3D模型riggable用在不同的表情非常详细的几何形状。前所未有的数据集和代码将被释放到公众研究目的。

34. Fashion Meets Computer Vision: A Survey [PDF] 返回目录
Wen-Huang Cheng, Sijie Song, Chieh-Yun Chen, Shintami Chusnul Hidayati, Jiaying Liu
Abstract: Fashion is the way we present ourselves to the world and has become one of the world's largest industries. Fashion, mainly conveyed by vision, has thus attracted much attention from computer vision researchers in recent years. Given the rapid development, this paper provides a comprehensive survey of more than 200 major fashion-related works covering four main aspects for enabling intelligent fashion: (1) Fashion detection includes landmark detection, fashion parsing, and item retrieval, (2) Fashion analysis contains attribute recognition, style learning, and popularity prediction, (3) Fashion synthesis involves style transfer, pose transformation, and physical simulation, and (4) Fashion recommendation comprises fashion compatibility, outfit matching, and hairstyle suggestion. For each task, the benchmark datasets and the evaluation protocols are summarized. Furthermore, we highlight promising directions for future research.
摘要：时尚是我们提出我们自己世界的方式，并已成为世界上最大的产业之一。时尚，主要是由视觉传达，也因此吸引了来自计算机视觉研究人员的关注，近年来。鉴于快速发展，本文提供了超过200个大型时尚相关的内容涵盖四个主要方面为实现智能时尚的综合调查结果显示：（1）时尚检测包括标志检测，时尚解析和项目检索，（2）时尚解析包含属性识别，风格学习，和普及预测，（3）时尚合成包括式转印，姿势变换，和物理模拟，和（4）时尚推荐包括时尚的相容性，装匹配和发型建议。对于每一个任务，标准数据集和评价方案进行了总结。此外，我们强调有前途的未来的研究方向。

35. DeepLPF: Deep Local Parametric Filters for Image Enhancement [PDF] 返回目录
Sean Moran, Pierre Marza, Steven McDonagh, Sarah Parisot, Gregory Slabaugh
Abstract: Digital artists often improve the aesthetic quality of digital photographs through manual retouching. Beyond global adjustments, professional image editing programs provide local adjustment tools operating on specific parts of an image. Options include parametric (graduated, radial filters) and unconstrained brush tools. These highly expressive tools enable a diverse set of local image enhancements. However, their use can be time consuming, and requires artistic capability. State-of-the-art automated image enhancement approaches typically focus on learning pixel-level or global enhancements. The former can be noisy and lack interpretability, while the latter can fail to capture fine-grained adjustments. In this paper, we introduce a novel approach to automatically enhance images using learned spatially local filters of three different types (Elliptical Filter, Graduated Filter, Polynomial Filter). We introduce a deep neural network, dubbed Deep Local Parametric Filters (DeepLPF), which regresses the parameters of these spatially localized filters that are then automatically applied to enhance the image. DeepLPF provides a natural form of model regularization and enables interpretable, intuitive adjustments that lead to visually pleasing results. We report on multiple benchmarks and show that DeepLPF produces state-of-the-art performance on two variants of the MIT-Adobe-5K dataset, often using a fraction of the parameters required for competing methods.
摘要：数字艺术家往往提高的数码照片通过人工修饰的审美素质。除了全球性的调整，专业的图像编辑程序提供的图像的特定部分操作局部调整工具。选项包括参数（毕业后，径向过滤器）和不受约束的笔刷工具。这些极具表现力的工具使一组不同的局部图像增强。然而，它们的使用可能会非常耗时，而且需要艺术的能力。国家的最先进的自动图像增强方法通常集中学习像素级或全球性的增强。前者可能是嘈杂，缺乏可解释性，而后者可能无法捕捉细粒度的调整。在本文中，我们介绍了一种新的方法，以提高自动使用三种不同的类型（椭圆滤波器，渐变滤镜，多项式滤波器）的空间上了解到本地筛选的图像。我们引进一个深层神经网络，被称为深本地参数滤波器（DeepLPF），这是倒退，然后自动应用到提升形象，这些空间局部滤波器的参数。 DeepLPF提供模型正规化的自然形态，并使得导致视觉愉悦的结果可解释的，直观的调整。我们在多个基准测试报告并显示，DeepLPF产生对MIT-的Adobe-5K数据集的两种变型国家的最先进的性能，经常使用竞争方法所需的参数的一小部分。

36. Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text [PDF] 返回目录
Difei Gao, Ke Li, Ruiping Wang, Shiguang Shan, Xilin Chen
Abstract: Answering questions that require reading texts in an image is challenging for current models. One key difficulty of this task is that rare, polysemous, and ambiguous words frequently appear in images, e.g., names of places, products, and sports teams. To overcome this difficulty, only resorting to pre-trained word embedding models is far from enough. A desired model should utilize the rich information in multiple modalities of the image to help understand the meaning of scene texts, e.g., the prominent text on a bottle is most likely to be the brand. Following this idea, we propose a novel VQA approach, Multi-Modal Graph Neural Network (MM-GNN). It first represents an image as a graph consisting of three sub-graphs, depicting visual, semantic, and numeric modalities respectively. Then, we introduce three aggregators which guide the message passing from one graph to another to utilize the contexts in various modalities, so as to refine the features of nodes. The updated nodes have better features for the downstream question answering module. Experimental evaluations show that our MM-GNN represents the scene texts better and obviously facilitates the performances on two VQA tasks that require reading scene texts.
摘要：要求读取文本的图像在回答问题是具有挑战性的电流模式。这个任务的一个重要困难是罕见的，多义和歧义词频繁出现在图像，例如，场所，产品和运动队的名字。为了克服这个困难，只能诉诸预先训练字嵌入模式是远远不够的。所期望的模型应利用图像的多个模式，以帮助丰富的信息了解现场文本，例如的意思，上一瓶突出的文字是最有可能成为这个品牌。依照该思路，我们提出了一种新的方法VQA，多模态图神经网络（MM-GNN）。它首先表示图像作为由三个子图的曲线图，分别示出视觉，语义和数字方式。然后，我们引入3个聚合器，其引导从一个图形传递到另一个以利用各种形式的上下文中的消息，以便细化节点的功能。更新后的节点具有下游答疑模块更好的功能。试验评估表明，我们的MM-GNN代表现场文本更好，显然有利于在需要阅读文字现场VQA 2任务的性能。

37. Neural Networks Are More Productive Teachers Than Human Raters: Active Mixup for Data-Efficient Knowledge Distillation from a Blackbox Model [PDF] 返回目录
Dongdong Wang, Yandong Li, Liqiang Wang, Boqing Gong
Abstract: We study how to train a student deep neural network for visual recognition by distilling knowledge from a blackbox teacher model in a data-efficient manner. Progress on this problem can significantly reduce the dependence on large-scale datasets for learning high-performing visual recognition models. There are two major challenges. One is that the number of queries into the teacher model should be minimized to save computational and/or financial costs. The other is that the number of images used for the knowledge distillation should be small; otherwise, it violates our expectation of reducing the dependence on large-scale datasets. To tackle these challenges, we propose an approach that blends mixup and active learning. The former effectively augments the few unlabeled images by a big pool of synthetic images sampled from the convex hull of the original images, and the latter actively chooses from the pool hard examples for the student neural network and query their labels from the teacher model. We validate our approach with extensive experiments.
摘要：我们研究如何通过一个数据有效的方式从一个黑老师模型蒸馏知识训练从视觉上识别学生深层神经网络。在这个问题上的进展可以显著降低学习的高性能视觉识别模型在大型数据集的依赖。有两个主要的挑战。一个是查询到教师型号数量应尽量减少，以节省计算和/或财务成本。另一种是用于知识蒸馏应该是小的图像的数量;否则，它违反了我们减少对大型数据集的依赖的期望。为了应对这些挑战，我们提出了一个方法，融合的mixup和主动学习。前者有效地通过从原始图像的凸包采样合成图像的大水池增强了少数未标记的图像，而后者则积极从教师模型学生神经网络游泳池硬实例和查询它们的标签选择。我们确认我们有广泛的实验方法。

38. FGN: Fully Guided Network for Few-Shot Instance Segmentation [PDF] 返回目录
Zhibo Fan, Jin-Gang Yu, Zhihao Liang, Jiarong Ou, Changxin Gao, Gui-Song Xia, Yuanqing Li
Abstract: Few-shot instance segmentation (FSIS) conjoins the few-shot learning paradigm with general instance segmentation, which provides a possible way of tackling instance segmentation in the lack of abundant labeled data for training. This paper presents a Fully Guided Network (FGN) for few-shot instance segmentation. FGN perceives FSIS as a guided model where a so-called support set is encoded and utilized to guide the predictions of a base instance segmentation network (i.e., Mask R-CNN), critical to which is the guidance mechanism. In this view, FGN introduces different guidance mechanisms into the various key components in Mask R-CNN, including Attention-Guided RPN, Relation-Guided Detector, and Attention-Guided FCN, in order to make full use of the guidance effect from the support set and adapt better to the inter-class generalization. Experiments on public datasets demonstrate that our proposed FGN can outperform the state-of-the-art methods.
摘要：很少拍实例分割（FSIS）星相与一般的情况下分割，它提供了在缺乏训练丰富的标签数据的解决实例分割的可能方式的几个拍的学习范例。本文提出了几拍例如分割一个全导向网络（FGN）。 FGN感知FSIS为其中一个所谓的支撑集被编码并用于引导一个基本实例分割网络的预测的引导模型（即，掩模R-CNN），关键的是所述引导机构。这种观点认为，FGN介绍不同的制导机制成面膜R-CNN的各种关键部件，包括注意力引导RPN，关系制导探测器和注意力引导FCN，为了充分发挥指导作用，从支撑集并更好地适应，级间推广。公共数据集的实验表明，我们提出的FGN可以超越国家的最先进的方法。

39. Self-supervised Monocular Trained Depth Estimation using Self-attention and Discrete Disparity Volume [PDF] 返回目录
Adrian Johnston, Gustavo Carneiro
Abstract: Monocular depth estimation has become one of the most studied applications in computer vision, where the most accurate approaches are based on fully supervised learning models. However, the acquisition of accurate and large ground truth data sets to model these fully supervised methods is a major challenge for the further development of the area. Self-supervised methods trained with monocular videos constitute one the most promising approaches to mitigate the challenge mentioned above due to the wide-spread availability of training data. Consequently, they have been intensively studied, where the main ideas explored consist of different types of model architectures, loss functions, and occlusion masks to address non-rigid motion. In this paper, we propose two new ideas to improve self-supervised monocular trained depth estimation: 1) self-attention, and 2) discrete disparity prediction. Compared with the usual localised convolution operation, self-attention can explore a more general contextual information that allows the inference of similar disparity values at non-contiguous regions of the image. Discrete disparity prediction has been shown by fully supervised methods to provide a more robust and sharper depth estimation than the more common continuous disparity prediction, besides enabling the estimation of depth uncertainty. We show that the extension of the state-of-the-art self-supervised monocular trained depth estimator Monodepth2 with these two ideas allows us to design a model that produces the best results in the field in KITTI 2015 and Make3D, closing the gap with respect self-supervised stereo training and fully supervised approaches.
摘要：单眼深度估计已经成为计算机视觉，其中最准确的方法是基于充分监督学习模型研究最多的应用之一。然而，准确和大型地面真实数据集收购这些充分监督方法模型是该地区的进一步发展的一大挑战。单眼视频培训的自我监督的方法构成一个最有前途的方法，以减轻上述挑战，因为训练数据的广泛分布可用性。因此，它们已经被广泛研究，其中的主要思想探索由不同类型的模型的体系结构，功能丧失和遮挡掩模的对地址非刚性运动。在本文中，我们提出了两种新的理念，以提高自我监督单眼训练的深度估计：1）自我的关注，和2）分离的视差预测。与通常的局部卷积操作相比，自注意可以探索更一般的上下文信息，其允许类似的差异值的推断在图像的非连续区域。离散的视差预测已被完全监督的方法示出，以提供更坚固的和更清晰的深度估计不是更常见的连续视差预测，除了使深度的不确定性的估计。我们表明，国家的最先进的自我监督单眼训练的深度估计Monodepth2的延伸与这两种想法使我们能够设计出产生在外地KITTI 2015年Make3D最佳结果的模型，缩小与差距尊重自我监督的立体培训和充分监督的方法。

40. Segmenting Transparent Objects in the Wild [PDF] 返回目录
Enze Xie, Wenjia Wang, Wenhai Wang, Mingyu Ding, Chunhua Shen, Ping Luo
Abstract: Transparent objects such as windows and bottles made by glass widely exist in the real world. Segmenting transparent objects is challenging because these objects have diverse appearance inherited from the image background, making them had similar appearance with their surroundings. Besides the technical difficulty of this task, only a few previous datasets were specially designed and collected to explore this task and most of the existing datasets have major drawbacks. They either possess limited sample size such as merely a thousand of images without manual annotations, or they generate all images by using computer graphics method (i.e. not real image). To address this important problem, this work proposes a large-scale dataset for transparent object segmentation, named Trans10K, consisting of 10,428 images of real scenarios with carefully manual annotations, which are 10 times larger than the existing datasets. The transparent objects in Trans10K are extremely challenging due to high diversity in scale, viewpoint and occlusion as shown in Fig. 1. To evaluate the effectiveness of Trans10K, we propose a novel boundary-aware segmentation method, termed TransLab, which exploits boundary as the clue to improve segmentation of transparent objects. Extensive experiments and ablation studies demonstrate the effectiveness of Trans10K and validate the practicality of learning object boundary in TransLab. For example, TransLab significantly outperforms 20 recent object segmentation methods based on deep learning, showing that this task is largely unsolved. We believe that both Trans10K and TransLab have important contributions to both the academia and industry, facilitating future researches and applications.
摘要：透明的物体，如窗户和玻璃制成的瓶广泛存在于现实世界。分割透明物体是具有挑战性的，因为这些对象有来自图像背景继承多样的外观，使他们有相似的外观与周围的环境。除了这个任务的技术难度，只有几个以前的数据集进行了专门设计，并收集，探索这项工作，大多数现有的数据集有重大缺陷。它们或者具有有限的样本大小，如仅仅是一个千图像无需人工注释，或者它们通过使用计算机图形的方法（即，不实像）产生的所有图像。为了解决这个重大问题，这项工作提出了透明物体分割大规模的数据集，名为Trans10K，由实场景10428个图像的精心手工注解，这是比现有的数据集大10倍。 1.如图中Trans10K透明对象极具挑战由于在规模，视点和闭塞高多样性。为了评价Trans10K的有效性，我们提出了一个新颖的边界感知分割方法，称为TransLab，其利用边界作为线索，以改善透明对象的分割。大量的实验和消融研究表明Trans10K的有效性，并验证在TransLab学习对象边界的实用性。例如，TransLab显著优于基于深度学习20种近期目标分割方法，显示出这个任务在很大程度上解决。我们相信，Trans10K和TransLab都有的学术界和工业界都作出了重要贡献，促进未来的研究和应用。

41. A Simple Class Decision Balancing for Incremental Learning [PDF] 返回目录
Hongjoon Ahn, Taesup Moon
Abstract: Class incremental learning (CIL) problem, in which a learning agent continuously learns new classes from incrementally arriving training data batches, has gained much attention recently in AI and computer vision community due to both fundamental and practical perspectives of the problem. For mitigating the main difficulty of deep neural network(DNN)-based CIL, the catastrophic forgetting, recent work showed that a simple fine-tuning (FT) based schemes can outperform the earlier attempts of using knowledge distillation, particularly when a small-sized exemplar-memory for storing samples from the previously learned classes is allowed. The core limitation of the vanilla FT, however, is the severe classification score bias between the new and previously learned classes, and several state-of-the-art methods proposed to rectify the bias via additional post-processing of the scores. In this paper, we propose two simple modifications for the vanilla FT, separated softmax (SS) layer and ratio-preserving (RP) mini-batches for SGD updates. Our scheme, dubbed as SS-IL, is shown to give much more balanced class decisions, have much less biased scores, and outperform strong state-of-the-art baselines on several large-scale benchmark datasets, without any sophisticated post-processing of the scores. We also give several novel analyses our and baseline methods, confirming the effectiveness of our approach in CIL.
摘要：类增量学习（CIL）的问题，其中一个学习剂不断地从逐渐到达训练数据批学习新课程，已AI最近获得了广泛关注和计算机视觉社区由于问题的基础和实用的观点。为了缓解深层神经网络的主要困难（DNN）为主CIL，灾难性遗忘，最近的工作表明，一个简单的微调（FT）的方案可以超越运用知识蒸馏的早期尝试，尤其是在小尺寸用于从先前学习的类存储样本示范性存储器是允许的。香草FT的核心限制，然而，这是新的，以前学过的课程之间的严重分类分值偏差，并提出纠正通过分数的附加后处理的偏置几个国家的最先进的方法。在本文中，我们提出一种用于香草FT两个简单的修改，分离SOFTMAX（SS）层和比保留（RP）的迷你批次SGD更新。我们的计划，被戏称为SS-IL，显示给更多的平衡类的决策，都更偏向得分，并超越国家的最先进的强在几个大型标准数据集的基线，没有任何复杂的后期处理的分数。我们也给一些新的分析我们和基线的方法，证实了我们在CIL方法的有效性。

42. Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [PDF] 返回目录
Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, Juan Carlos Niebles
Abstract: Video captioning is a challenging task that requires a deep understanding of visual scenes. State-of-the-art methods generate captions using either scene-level or object-level information but without explicitly modeling object interactions. Thus, they often fail to make visually grounded predictions, and are sensitive to spurious correlations. In this paper, we propose a novel spatio-temporal graph model for video captioning that exploits object interactions in space and time. Our model builds interpretable links and is able to provide explicit visual grounding. To avoid unstable performance caused by the variable number of objects, we further propose an object-aware knowledge distillation mechanism, in which local object information is used to regularize global scene features. We demonstrate the efficacy of our approach through extensive experiments on two benchmarks, showing our approach yields competitive performance with interpretable predictions.
摘要：视频字幕是一项具有挑战性的任务，需要视觉场景的深刻理解。国家的最先进的方法使用产生任何场景级别或对象级别的信息，但没有明确对象的交互建模字幕。因此，他们往往不能使视觉接地的预测，以及对伪相关敏感。在本文中，我们提出，它利用在空间和时间对象交互视频字幕一种新颖的时空图模型。我们的模型建立可解释的链接，并能提供清楚的视觉接地。为了避免因对象的数目可变性能不稳定，我们进一步提出了一个对象知晓知识蒸馏机制，其中本地对象信息被用来以规范全球舞台的功能。我们证明两个基准我们的方法，通过大量的实验疗效，显示出我们的方法与产量预测解释竞争力的性能。

43. EvolveGraph: Heterogeneous Multi-Agent Multi-Modal Trajectory Prediction with Evolving Interaction Graphs [PDF] 返回目录
Jiachen Li, Fan Yang, Masayoshi Tomizuka, Chiho Choi
Abstract: Multi-agent interacting systems are prevalent in the world, from pure physical systems to complicated social dynamic systems. In many applications, effective understanding of the environment and accurate trajectory prediction of interactive agents play a significant role in downstream tasks, such as decision and planning. In this paper, we propose a generic trajectory forecasting framework (named EvolveGraph) with explicit interaction modeling via a latent interaction graph among multiple heterogeneous, interactive agents. Considering the uncertainty and the possibility of different future behaviors, the model is designed to provide multi-modal prediction hypotheses. Since the interactions may be time-varying even with abrupt changes, and different modalities may have different interactions, we address the necessity and effectiveness of adaptively evolving the interaction graph and provide an effective solution. We also introduce a double-stage training pipeline which not only improves training efficiency and accelerates convergence, but also enhances model performance in terms of prediction error. The proposed framework is evaluated on multiple public benchmark datasets in various areas for trajectory prediction, where the agents cover on-road vehicles, pedestrians, cyclists and sports players. The experimental results illustrate that our approach achieves state-of-the-art performance in terms of prediction accuracy.
摘要：多代理交互系统是盛行于世界各地，从纯物理系统到复杂的社会动态系统。在许多应用中，环境和互动代理的精确轨迹预测的有效理解下游的任务，如决策和规划发挥显著作用。在本文中，我们通过在多个异构的，互动代理商潜在的关系图提出了一个通用的轨迹预测框架（名为EvolveGraph）具有明确的交互模型。考虑到不确定性和不同的未来行为的可能性，该模型旨在提供多模态的预测假设。由于相互作用可以是随时间变化的具有突然的变化均匀，不同模态可以具有不同的相互作用，我们处理的必要性和自适应地演进的相互作用图形有效性和提供了有效的解决方案。我们还引进了双级培训流水线不仅提高了训练效率和加速收敛，同时也增强了预测误差方面的模型的性能。拟议的框架在各个领域轨迹预测，其中药剂在道路上的车辆，行人，自行车和运动好手覆盖多个公共基准数据集进行评估。实验结果表明，我们的方法实现了预测精度方面国家的最先进的性能。

44. Y-net: Multi-scale feature aggregation network with wavelet structure similarity loss function for single image dehazing [PDF] 返回目录
Hao-Hsiang Yang, Chao-Han Huck Yang, Yi-Chang James Tsai
Abstract: Single image dehazing is the ill-posed two-dimensional signal reconstruction problem. Recently, deep convolutional neural networks (CNN) have been successfully used in many computer vision problems. In this paper, we propose a Y-net that is named for its structure. This network reconstructs clear images by aggregating multi-scale features maps. Additionally, we propose a Wavelet Structure SIMilarity (W-SSIM) loss function in the training step. In the proposed loss function, discrete wavelet transforms are applied repeatedly to divide the image into differently sized patches with different frequencies and scales. The proposed loss function is the accumulation of SSIM loss of various patches with respective ratios. Extensive experimental results demonstrate that the proposed Y-net with the W-SSIM loss function restores high-quality clear images and outperforms state-of-the-art algorithms. Code and models are available at this https URL.
摘要：单图像除雾是不适定的二维信号重构问题。近日，深卷积神经网络（CNN）已经成功地在许多计算机视觉问题中。在本文中，我们提出了一个Y型网，被命名为它的结构。该网络通过聚合多尺度重构清晰的图像功能的地图。此外，我们提出了小波结构相似性（W-SSIM）损失函数在训练步骤。在所提出的损失函数，离散小波变换被重复地施加到图像划分成具有不同频率和尺度不同尺寸的补丁。所提出的损失函数是与各自的比例各种补丁的SSIM损失的积累。广泛的实验结果表明，所提出的Y型网与国家的最先进的W-SSIM损失函数恢复高质量清晰的图像，优于算法。代码和模型可在此HTTPS URL。

45. Proxy Anchor Loss for Deep Metric Learning [PDF] 返回目录
Sungyeon Kim, Dongwon Kim, Minsu Cho, Suha Kwak
Abstract: Existing metric learning losses can be categorized into two classes: pair-based and proxy-based losses. The former class can leverage fine-grained semantic relations between data points, but slows convergence in general due to its high training complexity. In contrast, the latter class enables fast and reliable convergence, but cannot consider the rich data-to-data relations. This paper presents a new proxy-based loss that takes advantages of both pair- and proxy-based methods and overcomes their limitations. Thanks to the use of proxies, our loss boosts the speed of convergence and is robust against noisy labels and outliers. At the same time, it allows embedding vectors of data to interact with each other in its gradients to exploit data-to-data relations. Our method is evaluated on four public benchmarks, where a standard network trained with our loss achieves state-of-the-art performance and most quickly converges.
摘要：现有的度量学习的损失可以分为两类：基于代理的对基础和损失。数据点之间的前级可以利用细粒度语义关系，只是速度放缓收敛一般是由于其高培训的复杂性。相比之下，后一类实现了快速和可靠的收敛，但可以不考虑丰富的数据到数据关系。本文提出了一种新的基于代理的损失，同时利用成对和基于代理的方法的优点，克服其局限性。由于使用代理服务器，我们的损失提升收敛速度，反对嘈杂的标签和异常强劲。与此同时，它允许在其梯度中嵌入数据的向量间的相互影响以利用数据对数据的关系。我们的方法是在四个公共基准，在与我们的损失训练的标准网络实现了国家的最先进的性能和最快速收敛评估。

46. Attention-based Multi-modal Fusion Network for Semantic Scene Completion [PDF] 返回目录
Siqi Li, Changqing Zou, Yipeng Li, Xibin Zhao, Yue Gao1
Abstract: This paper presents an end-to-end 3D convolutional network named attention-based multi-modal fusion network (AMFNet) for the semantic scene completion (SSC) task of inferring the occupancy and semantic labels of a volumetric 3D scene from single-view RGB-D images. Compared with previous methods which use only the semantic features extracted from RGB-D images, the proposed AMFNet learns to perform effective 3D scene completion and semantic segmentation simultaneously via leveraging the experience of inferring 2D semantic segmentation from RGB-D images as well as the reliable depth cues in spatial dimension. It is achieved by employing a multi-modal fusion architecture boosted from 2D semantic segmentation and a 3D semantic completion network empowered by residual attention blocks. We validate our method on both the synthetic SUNCG-RGBD dataset and the real NYUv2 dataset and the results show that our method respectively achieves the gains of 2.5% and 2.6% on the synthetic SUNCG-RGBD dataset and the real NYUv2 dataset against the state-of-the-art method.
摘要：推断占用和从单体积3D场景的语义标签本文提出一种终端到终端的3D卷积网络命名为关注基于多模态融合语义现场完成网络（AMFNet）（SSC）任务视图RGB-d的图像。与仅使用从RGB-d的图像中提取的语义特征以前的方法相比，所提出的AMFNet学会通过利用推断2D语义分割的从RGB-d的图像的经验以及可靠的同时执行有效的3D场景完成和语义分割深度线索在空间维度。它是通过采用一种多模态融合架构从2D语义分割而由残余关注块授权一个三维语义完成网络升压实现。我们证实对合成SUNCG-RGBD数据集和真实数据集NYUv2我们两个方法，结果表明，该方法分别达到2.5％和2.6％，对合成SUNCG-RGBD数据集，并针对国有真正NYUv2数据集的收益的最先进的方法。

47. Learning Oracle Attention for High-fidelity Face Completion [PDF] 返回目录
Tong Zhou, Changxing Ding, Shaowen Lin, Xinchao Wang, Dacheng Tao
Abstract: High-fidelity face completion is a challenging task due to the rich and subtle facial textures involved. What makes it more complicated is the correlations between different facial components, for example, the symmetry in texture and structure between both eyes. While recent works adopted the attention mechanism to learn the contextual relations among elements of the face, they have largely overlooked the disastrous impacts of inaccurate attention scores; in addition, they fail to pay sufficient attention to key facial components, the completion results of which largely determine the authenticity of a face image. Accordingly, in this paper, we design a comprehensive framework for face completion based on the U-Net structure. Specifically, we propose a dual spatial attention module to efficiently learn the correlations between facial textures at multiple scales; moreover, we provide an oracle supervision signal to the attention module to ensure that the obtained attention scores are reasonable. Furthermore, we take the location of the facial components as prior knowledge and impose a multi-discriminator on these regions, with which the fidelity of facial components is significantly promoted. Extensive experiments on two high-resolution face datasets including CelebA-HQ and Flickr-Faces-HQ demonstrate that the proposed approach outperforms state-of-the-art methods by large margins.
摘要：高保真面部完成是一个具有挑战性的任务，由于所涉及的丰富和微妙的面部纹理。是什么使得它更复杂的是不同的面部组件之间的相关性，例如，在双眼之间的纹理和结构的对称性。虽然近期的作品所采用的注意机制学会面对要素之间的关系背景，他们基本上忽略不准确的关注分数的灾难性影响;此外，他们还未能引起足够重视，关键面部组件，它的完成结果在很大程度上决定了一个人脸图像的真实性。因此，在本文中，我们设计了基于U型网状结构面完成一个全面的框架。具体来说，我们提出了一个双重空间注意模块，有效地学习多尺度面部纹理之间的相关性;此外，我们提供了一个oracle监控信号时要注意模块，以确保获得关注分数是合理的。此外，我们采取的面部组件先验知识的位置，并在这些地区，与面部组件的保真度显著促进处以多鉴别。在两个高分辨率面数据集，包括CelebA-HQ和Flickr-面-HQ大量实验表明，该方法比国家的最先进的方法，通过大量的利润。

48. Edge Guided GANs with Semantic Preserving for Semantic Image Synthesis [PDF] 返回目录
Hao Tang, Xiaojuan Qi, Dan Xu, Philip H. S. Torr, Nicu Sebe
Abstract: We propose a novel Edge guided Generative Adversarial Network (EdgeGAN) for photo-realistic image synthesis from semantic layouts. Although considerable improvement has been achieved, the quality of synthesized images is far from satisfactory due to two largely unresolved challenges. First, the semantic labels do not provide detailed structural information, making it difficult to synthesize local details and structures. Second, the widely adopted CNN operations such as convolution, down-sampling and normalization usually cause spatial resolution loss and thus are unable to fully preserve the original semantic information, leading to semantically inconsistent results (e.g., missing small objects). To tackle the first challenge, we propose to use the edge as an intermediate representation which is further adopted to guide image generation via a proposed attention guided edge transfer module. Edge information is produced by a convolutional generator and introduces detailed structure information. Further, to preserve the semantic information, we design an effective module to selectively highlight class-dependent feature maps according to the original semantic layout. Extensive experiments on two challenging datasets show that the proposed EdgeGAN can generate significantly better results than state-of-the-art methods. The source code and trained models are available at this https URL.
摘要：我们建议从语义布局照片般逼真的图像合成引导剖成对抗性网络（EdgeGAN）一种新颖的边缘。虽然相当大的改进已经实现，合成图像的质量差强人意，由于两个悬而未决的挑战。首先，语义标签不提供详细的结构信息，因此很难合成局部细节和结构。其次，广泛采用CNN操作，诸如卷积，下采样和归一化通常导致空间分辨率的损失，因此无法充分保留原始语义信息，导致不一致的语义结果（例如，丢失的小对象）。为了解决第一个挑战，我们建议使用边缘作为被进一步采用经由建议关注引导边缘传输模块来引导图像生成的中间表示。边缘信息由卷积发生器和介绍的详细结构的信息产生的。此外，为了保留所述语义信息，我们设计了根据原始语义布局映射的有效模块，以选择性地突出类别依赖性特征。两个挑战数据集大量实验表明，该EdgeGAN可以产生比国家的最先进的方法显著更好的结果。源代码和训练的模型可在此HTTPS URL。

49. TITAN: Future Forecast using Action Priors [PDF] 返回目录
Srikanth Malla, Behzad Dariush, Chiho Choi
Abstract: We consider the problem of predicting the future trajectory of scene agents from egocentric views obtained from a moving platform. This problem is important in a variety of domains, particularly for autonomous systems making reactive or strategic decisions in navigation. In an attempt to address this problem, we introduce TITAN (Trajectory Inference using Targeted Action priors Network), a new model that incorporates prior positions, actions, and context to forecast future trajectory of agents and future ego-motion. In the absence of an appropriate dataset for this task, we created the TITAN dataset that consists of 700 labeled video-clips (with odometry) captured from a moving vehicle on highly interactive urban traffic scenes in Tokyo. Our dataset includes 50 labels including vehicle states and actions, pedestrian age groups, and targeted pedestrian action attributes that are organized hierarchically corresponding to atomic, simple/complex-contextual, transportive, and communicative actions. To evaluate our model, we conducted extensive experiments on the TITAN dataset, revealing significant performance improvement against baselines and state-of-the-art algorithms. We also report promising results from our Agent Importance Mechanism (AIM), a module which provides insight into assessment of perceived risk by calculating the relative influence of each agent on the future ego-trajectory. The dataset is available at this https URL
摘要：我们认为预测从移动平台获得的以自我为中心的观点场景代理商的未来轨迹的问题。这个问题在多个领域的重要的，尤其是在自治系统中的导航使得反应或战略决策。在试图解决这个问题，我们（使用有针对性的行动先验网络轨迹推断），即以代理商和未来的自我运动展望未来轨迹结合之前的位置，动作和上下文的新模式引入TITAN。在没有承担这一任务的适当的数据集，我们创建了TITAN数据集包括从东京高度互动的城市交通场景移动的车辆拍摄的700标记的视频剪辑（与里程计）。我们的数据包括50个标签，包含车辆状态和动作，行人的年龄组，并有针对性的行人动作属性，它们分层组织对应的原子，简单/复杂的上下文，transportive和交往行为。为了评估我们的模型，我们对TITAN数据集进行了广泛的实验，揭示了对基线和国家的最先进的算法显著的性能提升。我们还报告从我们代理的重要性机制（AIM），它通过计算对未来自我的轨迹每个代理的相对影响力提供了洞察感知风险评估模块可喜的成果。该数据集可在此HTTPS URL

50. MUXConv: Information Multiplexing in Convolutional Neural Networks [PDF] 返回目录
Zhichao Lu, Kalyanmoy Deb, Vishnu Naresh Boddeti
Abstract: Convolutional neural networks have witnessed remarkable improvements in computational efficiency in recent years. A key driving force has been the idea of trading-off model expressivity and efficiency through a combination of $1\times 1$ and depth-wise separable convolutions in lieu of a standard convolutional layer. The price of the efficiency, however, is the sub-optimal flow of information across space and channels in the network. To overcome this limitation, we present MUXConv, a layer that is designed to increase the flow of information by progressively multiplexing channel and spatial information in the network, while mitigating computational complexity. Furthermore, to demonstrate the effectiveness of MUXConv, we integrate it within an efficient multi-objective evolutionary algorithm to search for the optimal model hyper-parameters while simultaneously optimizing accuracy, compactness, and computational efficiency. On ImageNet, the resulting models, dubbed MUXNets, match the performance (75.3% top-1 accuracy) and multiply-add operations (218M) of MobileNetV3 while being 1.6$\times$ more compact, and outperform other mobile models in all the three criteria. MUXNet also performs well under transfer learning and when adapted to object detection. On the ChestX-Ray 14 benchmark, its accuracy is comparable to the state-of-the-art while being $3.3\times$ more compact and $14\times$ more efficient. Similarly, detection on PASCAL VOC 2007 is 1.2% more accurate, 28% faster and 6% more compact compared to MobileNetV2. Code is available from this https URL
摘要：卷积神经网络在最近几年见证了计算效率显着改善。 \时期的一个关键驱动力一直是交易截止模型表现力和效率的理念贯穿于$ 1的组合1 $和深度向可分离卷积代替标准卷积层。效率的价格，但是，是在空间和通道网络中的信息的子最佳流动。为了克服这种限制，我们提出MUXConv，其被设计为增加的信息通过逐渐复用网络中的信道和空间信息的流动的层，同时减轻计算复杂度。此外，以证明MUXConv的有效性，我们整合成为一个高效的多目标进化算法中寻找最优模型超参数同时优化精度，紧凑性和计算效率。在ImageNet，得到的模型，被称为MUXNets，而被1.6 $ \倍$更紧凑的匹配性能（75.3％最高1精度）和乘加运算MobileNetV3的（218M），并超越所有其他三个手机型号标准。 MUXNet也是公下转移学习执行，并且当适合于物体检测。在ChestX射线14的基准，而被$ 3.3 \倍$更加紧凑和$ 14 \倍$更有效的其精度是可比的状态的最先进的。同样地，检测在PASCAL VOC 2007是1.2％更准确的，28％更快，相比MobileNetV2 6％更紧凑。代码可以从这个HTTPS URL

51. RetinaTrack: Online Single Stage Joint Detection and Tracking [PDF] 返回目录
Zhichao Lu, Vivek Rathod, Ronny Votel, Jonathan Huang
Abstract: Traditionally multi-object tracking and object detection are performed using separate systems with most prior works focusing exclusively on one of these aspects over the other. Tracking systems clearly benefit from having access to accurate detections, however and there is ample evidence in literature that detectors can benefit from tracking which, for example, can help to smooth predictions over time. In this paper we focus on the tracking-by-detection paradigm for autonomous driving where both tasks are mission critical. We propose a conceptually simple and efficient joint model of detection and tracking, called RetinaTrack, which modifies the popular single stage RetinaNet approach such that it is amenable to instance-level embedding training. We show, via evaluations on the Waymo Open Dataset, that we outperform a recent state of the art tracking algorithm while requiring significantly less computation. We believe that our simple yet effective approach can serve as a strong baseline for future work in this area.
摘要：传统上使用独立的系统与大多数之前的作品完全集中在这些方面比其他一个进行多目标跟踪和目标检测。跟踪系统明显不同于具有然而获得准确检测，受益，并有在文献中充分的证据表明检测器可从跟踪，其中，例如，可有助于消除随着时间预测中获益。在本文中，我们重点跟踪的检测范式自主驾驶其中两个任务是关键任务。我们提出检测和跟踪的概念简单而有效的联合模式，称RetinaTrack，它改变了流行的单级RetinaNet方法，使得它适合于实例级嵌入训练。我们显示，通过对Waymo打开的数据集的评估，我们跑赢大盘近期的技术跟踪算法的状态，同时需要显著更少的计算。我们相信，我们的简单而有效的方法可以作为这一领域今后的工作中一个强大的基础。

52. 3D-MPA: Multi Proposal Aggregation for 3D Semantic Instance Segmentation [PDF] 返回目录
Francis Engelmann, Martin Bokeloh, Alireza Fathi, Bastian Leibe, Matthias Nießner
Abstract: We present 3D-MPA, a method for instance segmentation on 3D point clouds. Given an input point cloud, we propose an object-centric approach where each point votes for its object center. We sample object proposals from the predicted object centers. Then, we learn proposal features from grouped point features that voted for the same object center. A graph convolutional network introduces inter-proposal relations, providing higher-level feature learning in addition to the lower-level point features. Each proposal comprises a semantic label, a set of associated points over which we define a foreground-background mask, an objectness score and aggregation features. Previous works usually perform non-maximum-suppression (NMS) over proposals to obtain the final object detections or semantic instances. However, NMS can discard potentially correct predictions. Instead, our approach keeps all proposals and groups them together based on the learned aggregation features. We show that grouping proposals improves over NMS and outperforms previous state-of-the-art methods on the tasks of 3D object detection and semantic instance segmentation on the ScanNetV2 benchmark and the S3DIS dataset.
摘要：我们提出了3D-MPA，例如分割三维点云的方法。给定一个输入点云，我们提出了一个对象为中心的方法，每个点票的目的中心。我们从预测被摄体中心采样对象的建议。然后，我们借鉴投票支持同一个对象中心分组点要素的建议功能。曲线图卷积网络推出跨提议关系，提供更高水平的功能，除了在较低级别的点要素的学习。每个提议包括一语义标签，一组在其上定义一个前景背景掩模，一个对象性得分和聚合特征相关联的点。先前的工作通常执行非最大抑制（NMS）在建议以获得最终的对象检测或语义的实例。然而，NMS可以抛弃潜在正确的预测。相反，我们的方法可以使一起根据学习聚合功能的所有提案，并将它们分组。我们发现，分组建议改进了NMS和优于国家的最先进的前面上的立体物检测和语义的情况下分割的ScanNetV2基准和S3DIS数据集的任务的方法。

53. Semi-supervised Learning for Few-shot Image-to-Image Translation [PDF] 返回目录
Yaxing Wang, Salman Khan, Abel Gonzalez-Garcia, Joost van de Weijer, Fahad Shahbaz Khan
Abstract: In the last few years, unpaired image-to-image translation has witnessed remarkable progress. Although the latest methods are able to generate realistic images, they crucially rely on a large number of labeled images. Recently, some methods have tackled the challenging setting of few-shot image-to-image translation, reducing the labeled data requirements for the target domain during inference. In this work, we go one step further and reduce the amount of required labeled data also from the source domain during training. To do so, we propose applying semi-supervised learning via a noise-tolerant pseudo-labeling procedure. We also apply a cycle consistency constraint to further exploit the information from unlabeled images, either from the same dataset or external. Additionally, we propose several structural modifications to facilitate the image translation task under these circumstances. Our semi-supervised method for few-shot image translation, called \emph{SEMIT}, achieves excellent results on four different datasets using as little as 10\% of the source labels, and matches the performance of the main fully-supervised competitor using only 20\% labeled data. Our code and models are made public at: \url{this https URL}.
摘要：在过去的几年中，未配对的图像 - 图像平移见证了显着进展。虽然最新的方法能够产生逼真的图像，他们关键的是依靠大量的标记图像。最近，一些方法已经解决了几个镜头图像到图像的平移的挑战设置，减少推理过程中目标域标记的数据要求。在这项工作中，我们走了一步，也从源头上减少域训练过程中需要标注的数据量。要做到这一点，我们建议通过噪声容限的伪标记程序将半监督学习。我们还应用周期一致性约束进一步利用无标签的图像信息，无论是从同一数据集或外部。此外，我们提出了一些结构上的修改，以便在这些情况下的图像翻译任务。我们的几个镜头图像平移半监督的方法，叫做\ EMPH {半音}，实现了四个不同的数据集优异的成绩使用尽可能少的源标签的10 \％，使用该主充分监督竞争对手的性能相匹配只有20 \％标记的数据。 \ {URL这HTTPS URL}：我们的代码和模型的公开。

54. Can Deep Learning Recognize Subtle Human Activities? [PDF] 返回目录
Vincent Jacquot, Zhuofan Ying, Gabriel Kreiman
Abstract: Deep Learning has driven recent and exciting progress in computer vision, instilling the belief that these algorithms could solve any visual task. Yet, datasets commonly used to train and test computer vision algorithms have pervasive confounding factors. Such biases make it difficult to truly estimate the performance of those algorithms and how well computer vision models can extrapolate outside the distribution in which they were trained. In this work, we propose a new action classification challenge that is performed well by humans, but poorly by state-of-the-art Deep Learning models. As a proof-of-principle, we consider three exemplary tasks: drinking, reading, and sitting. The best accuracies reached using state-of-the-art computer vision models were 61.7%, 62.8%, and 76.8%, respectively, while human participants scored above 90% accuracy on the three tasks. We propose a rigorous method to reduce confounds when creating datasets, and when comparing human versus computer vision performance. Source code and datasets are publicly available.
摘要：深学习推动了近期和令人兴奋的计算机视觉进步，灌输的信念，这些算法可以解决任何视觉任务。然而，数据集通常用于训练和测试的计算机视觉算法具有普遍的干扰因素。这样的偏见，难以真正估计这些算法的性能以及效果如何计算机视觉模型可以在其受训的分布外推断。在这项工作中，我们提出由人，但很难用国家的最先进的深度学习模型表现不错了新的行为分类的挑战。作为验证的原则，我们考虑三个示范性的任务：饮酒，读书，和坐着。最好的精度使用最先进的国家的计算机视觉模型分别为61.7％，62.8％和76.8％，达到了，而人类参与者的三个任务90％以上的准确度得分。我们提出了一个严格的方法创建数据集时，并与计算机视觉性能比较人类时减少混淆。源代码和数据集是公开的。

55. AvatarMe: Realistically Renderable 3D Facial Reconstruction "in-the-wild" [PDF] 返回目录
Alexandros Lattas, Stylianos Moschoglou, Baris Gecer, Stylianos Ploumpis, Vasileios Triantafyllou, Abhijeet Ghosh, Stefanos Zafeiriou
Abstract: Over the last years, with the advent of Generative Adversarial Networks (GANs), many face analysis tasks have accomplished astounding performance, with applications including, but not limited to, face generation and 3D face reconstruction from a single "in-the-wild" image. Nevertheless, to the best of our knowledge, there is no method which can produce high-resolution photorealistic 3D faces from "in-the-wild" images and this can be attributed to the: (a) scarcity of available data for training, and (b) lack of robust methodologies that can successfully be applied on very high-resolution data. In this paper, we introduce AvatarMe, the first method that is able to reconstruct photorealistic 3D faces from a single "in-the-wild" image with an increasing level of detail. To achieve this, we capture a large dataset of facial shape and reflectance and build on a state-of-the-art 3D texture and shape reconstruction method and successively refine its results, while generating the per-pixel diffuse and specular components that are required for realistic rendering. As we demonstrate in a series of qualitative and quantitative experiments, AvatarMe outperforms the existing arts by a significant margin and reconstructs authentic, 4K by 6K-resolution 3D faces from a single low-resolution image that, for the first time, bridges the uncanny valley.
摘要：在过去的几年，随着剖成对抗性网络（甘斯）的出现，许多面分析任务已经完成了惊人的性能，应用，包括但不限于面部生成和三维人脸重建从单一的“在-the-野”的形象。然而，据我们所知，没有任何可以由“最野”图像产生高分辨率的写实3D面孔，这可以归因于该方法：培训现有数据（一）稀缺性，和（b）中缺乏的是可以成功地在非常高的分辨率的数据施加强大的方法的。在本文中，我们介绍AvatarMe，即能重建照片写实3D与细节的递增电平从一个单一的“在最野”图像面对第一方法。为了实现这一点，我们捕获上的状态的最先进的三维纹理和形状重建方法的大数据集的面部形状和反射率和构建的，并连续地改进其结果，同时产生了所需要的每个像素的漫反射和镜面反射分量为逼真呈现。正如我们在一系列的定性和定量实验证明，AvatarMe优于由显著保证金现有的艺术和重建真实的，是4K，6K分辨率从单一的低分辨率图像三维面，对于第一次，桥梁恐怖谷。

56. ActGAN: Flexible and Efficient One-shot Face Reenactment [PDF] 返回目录
Ivan Kosarevych, Marian Petruk, Markian Kostiv, Orest Kupyn, Mykola Maksymenko, Volodymyr Budzan
Abstract: This paper introduces ActGAN - a novel end-to-end generative adversarial network (GAN) for one-shot face reenactment. Given two images, the goal is to transfer the facial expression of the source actor onto a target person in a photo-realistic fashion. While existing methods require target identity to be predefined, we address this problem by introducing a "many-to-many" approach, which allows arbitrary persons both for source and target without additional retraining. To this end, we employ the Feature Pyramid Network (FPN) as a core generator building block - the first application of FPN in face reenactment, producing finer results. We also introduce a solution to preserve a person's identity between synthesized and target person by adopting the state-of-the-art approach in deep face recognition domain. The architecture readily supports reenactment in different scenarios: "many-to-many", "one-to-one", "one-to-another" in terms of expression accuracy, identity preservation, and overall image quality. We demonstrate that ActGAN achieves competitive performance against recent works concerning visual quality.
摘要： - 用于一次性面重演一种新颖的端至端生成对抗网络（GAN）介绍ActGAN。鉴于两个图像，目标是到源演员的面部表情的照片般逼真的方式转移到目标人物。尽管现有方法需要目标身份被预定义的，我们要解决通过引入“多对一对多”的方式，这使得双方的源和目标，无需额外再培训任意者这个问题。为此，我们采用了特征金字塔网（FPN）为核心的发电机积木 - FPN的脸上重演第一应用，产生更精细的效果。我们还引入了一个解决方案通过采用深人脸识别领域的国家的最先进的方法来保持一个人的综合和目标之间的人的身份。该体系结构支持容易在不同的场景重演：“多对多”，“一到一”，“一到另一个”在表达精度，身份保存，和总体图像质量方面。我们证明ActGAN实现对有关视觉质量近期的作品有竞争力的表现。

57. Label-Efficient Learning on Point Clouds using Approximate Convex Decompositions [PDF] 返回目录
Matheus Gadelha, Aruni RoyChowdhury, Gopal Sharma, Evangelos Kalogerakis, Liangliang Cao, Erik Learned-Miller, Rui Wang, Subhransu Maji
Abstract: The problems of shape classification and part segmentation from 3D point clouds have garnered increasing attention in the last few years. But both of these problems suffer from relatively small training sets, creating the need for statistically efficient methods to learn 3D shape representations. In this work, we investigate the use of Approximate Convex Decompositions (ACD) as a self-supervisory signal for label-efficient learning of point cloud representations. Decomposing a 3D shape into simpler constituent parts or primitives is a fundamental problem in geometrical shape processing. There has been extensive work on such decompositions, where the criterion for simplicity of a constituent shape is often defined in terms of convexity for solid primitives. In this paper, we show that using the results of ACD to approximate a ground truth segmentation provides excellent self-supervision for learning 3D point cloud representations that are highly effective on downstream tasks. We report improvements over the state-of-theart in unsupervised representation learning on the ModelNet40 shape classification dataset and significant gains in few-shot part segmentation on the ShapeNetPart dataset. Code available at this https URL
摘要：形状分类和部分分割从三维点云的问题已经赢得了在过去的几年里越来越多的关注。但是，这两个问题从比较小的训练集苦，创造需要统计学有效的方法来学习3D形状表示。在这项工作中，我们调查使用近似凸分解（ACD）为点云表示的标签，高效的学习自我监控信号。分解三维形状成简单的构成部分或基元是在几何形状处理的一个基本问题。一直存在对这样的分解，其中用于构成外形的简化的标准往往是在凸面用于固体基元来定义大量的工作。在本文中，我们表明，采用ACD的结果以接近地面实况分段学习三维点云表示是上下游的任务非常有效的提供了出色的自我监督。我们报告的ModelNet40形状分类数据集中在国家的theart改进监督的代表学习和显著的收益在上ShapeNetPart数据集几拍的部分分割。代码可以在这个HTTPS URL

58. Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation [PDF] 返回目录
Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, Richard Bowden
Abstract: Prior work on Sign Language Translation has shown that having a mid-level sign gloss representation (effectively recognizing the individual signs) improves the translation performance drastically. In fact, the current state-of-the-art in translation requires gloss level tokenization in order to work. We introduce a novel transformer based architecture that jointly learns Continuous Sign Language Recognition and Translation while being trainable in an end-to-end manner. This is achieved by using a Connectionist Temporal Classification (CTC) loss to bind the recognition and translation problems into a single unified architecture. This joint approach does not require any ground-truth timing information, simultaneously solving two co-dependant sequence-to-sequence learning problems and leads to significant performance gains. We evaluate the recognition and translation performances of our approaches on the challenging RWTH-PHOENIX-Weather-2014T (PHOENIX14T) dataset. We report state-of-the-art sign language recognition and translation results achieved by our Sign Language Transformers. Our translation networks outperform both sign video to spoken language and gloss to spoken language translation models, in some cases more than doubling the performance (9.58 vs. 21.80 BLEU-4 Score). We also share new baseline translation results using transformer networks for several other text-to-text sign language translation tasks.
摘要：在手语翻译以前的工作表明，具有中等水平的标志光泽表示（有效识别个人标志）大大提高了翻译性能。事实上，当前国家的最先进的翻译要求，以工作的光泽度符号化。我们引入新的基于变压器的架构，共同学习连续手语识别和翻译，同时在终端到终端的方式接受训练的。这是通过使用一个联结颞分类（CTC）损失以结合识别和翻译的问题到一个统一架构实现。这种联合方式不需要任何地面实况时序信息，同时解决两个共同的序列依赖性对序列学习的问题，并导致显著的性能提升。我们评估我们对挑战RWTH-PHOENIX-天气-2014T（PHOENIX14T）数据集的方法的识别和翻译的表演。我们提出我们的手语变压器实现国家的最先进的手语识别和翻译结果。我们的翻译网络超越这两个标牌视频口语和光泽口语翻译模式，在某些情况下，多表现为（9.58与21.80 BLEU-4分数）翻一番。我们还使用共享网络变压器其他几个文本到文本的手语翻译工作的新基准翻译结果。

59. Co-occurrence of deep convolutional features for image search [PDF] 返回目录
J.I.Forcen, Miguel Pagola, Edurne Barrenechea, Humberto Bustince
Abstract: Image search can be tackled using deep features from pre-trained Convolutional Neural Networks (CNN). The feature map from the last convolutional layer of a CNN encodes descriptive information from which a discriminative global descriptor can be obtained. We propose a new representation of co-occurrences from deep convolutional features to extract additional relevant information from this last convolutional layer. Combining this co-occurrence map with the feature map, we achieve an improved image representation. We present two different methods to get the co-occurrence representation, the first one based on direct aggregation of activations, and the second one, based on a trainable co-occurrence representation. The image descriptors derived from our methodology improve the performance in very well-known image retrieval datasets as we prove in the experiments.
摘要：图片搜索可以利用预先训练卷积神经网络（CNN）深的特点加以解决。从CNN的最后一个卷积层的特征地图编码从能够获得具有区分全局描述符的描述性信息。我们提出共同出现深卷积特征的新的表现来提取这最后一层卷积更多的有关资料。结合此共生地图与特征图，我们实现改进的图像表示。我们提出了两种不同的方法来获得共生表示，基于激活的直接聚集的第一个和第二个的基础上，可训练共生表示。从我们的方法得到的图像描述符提高非常著名的图像检索数据集的性能，我们在实验中证明。

60. When to Use Convolutional Neural Networks for Inverse Problems [PDF] 返回目录
Nathaniel Chodosh, Simon Lucey
Abstract: Reconstruction tasks in computer vision aim fundamentally to recover an undetermined signal from a set of noisy measurements. Examples include super-resolution, image denoising, and non-rigid structure from motion, all of which have seen recent advancements through deep learning. However, earlier work made extensive use of sparse signal reconstruction frameworks (e.g convolutional sparse coding). While this work was ultimately surpassed by deep learning, it rested on a much more developed theoretical framework. Recent work by Papyan et. al provides a bridge between the two approaches by showing how a convolutional neural network (CNN) can be viewed as an approximate solution to a convolutional sparse coding (CSC) problem. In this work we argue that for some types of inverse problems the CNN approximation breaks down leading to poor performance. We argue that for these types of problems the CSC approach should be used instead and validate this argument with empirical evidence. Specifically we identify JPEG artifact reduction and non-rigid trajectory reconstruction as challenging inverse problems for CNNs and demonstrate state of the art performance on them using a CSC method. Furthermore, we offer some practical improvements to this model and its application, and also show how insights from the CSC model can be used to make CNNs effective in tasks where their naive application fails.
摘要：在计算机视觉重建任务的目标是从根本上从一组噪声测量的恢复不确定的信号。例子包括超分辨率，图像降噪和非刚性结构的运动，所有这些都通过深入学习看到最新进展。然而，早期的工作中广泛使用稀疏信号重建框架（如卷积稀疏编码）。虽然这项工作是最终由深学习超越，它放在一个更发达的理论框架。通过Papyan等最近的工作。人通过说明如何卷积神经网络（CNN）可以作为一个近似解的卷积稀疏编码（CSC）的问题被看作提供这两种方法之间的桥梁。在这项工作中，我们认为，对于某些类型的反问题的CNN近似发生故障，导致业绩不佳。我们认为，对这类问题的方法CSC应该使用和验证这种说法与经验证据。具体来说，我们鉴定JPEG伪影减少和非刚性轨迹重建作为逆对细胞神经网络的问题和挑战展示使用CSC方法对他们的先进的性能。此外，我们提供这个模型及其应用一些实用的改进，另外还展示了如何从CSC模型的见解可以用来做细胞神经网络的有效在他们天真的应用程序失败的任务。

61. Domain Balancing: Face Recognition on Long-Tailed Domains [PDF] 返回目录
Dong Cao, Xiangyu Zhu, Xingyu Huang, Jianzhu Guo, Zhen Lei
Abstract: Long-tailed problem has been an important topic in face recognition task. However, existing methods only concentrate on the long-tailed distribution of classes. Differently, we devote to the long-tailed domain distribution problem, which refers to the fact that a small number of domains frequently appear while other domains far less existing. The key challenge of the problem is that domain labels are too complicated (related to race, age, pose, illumination, etc.) and inaccessible in real applications. In this paper, we propose a novel Domain Balancing (DB) mechanism to handle this problem. Specifically, we first propose a Domain Frequency Indicator (DFI) to judge whether a sample is from head domains or tail domains. Secondly, we formulate a light-weighted Residual Balancing Mapping (RBM) block to balance the domain distribution by adjusting the network according to DFI. Finally, we propose a Domain Balancing Margin (DBM) in the loss function to further optimize the feature space of the tail domains to improve generalization. Extensive analysis and experiments on several face recognition benchmarks demonstrate that the proposed method effectively enhances the generalization capacities and achieves superior performance.
摘要：长尾问题已经在人脸识别任务的一个重要课题。然而，现有的方法只专注于类的长尾分布。不同的是，我们致力于长尾域分配问题，它是指一个事实，即域的少数频繁出现，而其他领域远远低于现有的。问题的关键挑战是，域标签太复杂（涉及到种族，年龄，姿态，光照等），并在实际应用中无法访问。在本文中，我们提出了一个新的域均衡（DB）机制来处理这个问题。具体地讲，我们首先提出了一种域频率指示符（DFI）来判断样品是否是从头部结构域或尾域。其次，我们配制轻量化残留均衡映射（RBM）块通过根据DFI调整网络以平衡域分布。最后，我们提出了一个域均衡保证金（DBM）在损失函数以进一步优化尾结构域的特征空间来改善泛化。广泛的分析和若干面部识别基准测试实验表明，该方法有效地提高了泛化能力，并实现了卓越的性能。

62. Measuring Generalisation to Unseen Viewpoints, Articulations, Shapes and Objects for 3D Hand Pose Estimation under Hand-Object Interaction [PDF] 返回目录
Anil Armagan, Guillermo Garcia-Hernando, Seungryul Baek, Shreyas Hampali, Mahdi Rad, Zhaohui Zhang, Shipeng Xie, MingXiu Chen, Boshen Zhang, Fu Xiong, Yang Xiao, Zhiguo Cao, Junsong Yuan, Pengfei Ren, Weiting Huang, Haifeng Sun, Marek Hrúz, Jakub Kanis, Zdeněk Krňoul, Qingfu Wan, Shile Li, Linlin Yang, Dongheui Lee, Angela Yao, Weiguo Zhou, Sijia Mei, Yunhui Liu, Adrian Spurr, Umar Iqbal, Pavlo Molchanov, Philippe Weinzaepfel, Romain Brégier, Gregory Rogez, Vincent Lepetit, Tae-Kyun Kim
Abstract: In this work, we study how well different type of approaches generalise in the task of 3D hand pose estimation under hand-object interaction and single hand scenarios. We show that the accuracy of state-of-the-art methods can drop, and that they fail mostly on poses absent from the training set. Unfortunately, since the space of hand poses is highly dimensional, it is inherently not feasible to cover the whole space densely, despite recent efforts in collecting large-scale training datasets. This sampling problem is even more severe when hands are interacting with objects and/or inputs are RGB rather than depth images, as RGB images also vary with lighting conditions and colors. To address these issues, we designed a public challenge to evaluate the abilities of current 3D hand pose estimators~(HPEs) to interpolate and extrapolate the poses of a training set. More exactly, our challenge is designed (a) to evaluate the influence of both depth and color modalities on 3D hand pose estimation, under the presence or absence of objects; (b) to assess the generalisation abilities \wrt~four main axes: shapes, articulations, viewpoints, and objects; (c) to explore the use of a synthetic hand model to fill the gaps of current datasets. Through the challenge, the overall accuracy has dramatically improved over the baseline, especially on extrapolation tasks, from 27mm to 13mm mean joint error. Our analyses highlight the impacts of: Data pre-processing, ensemble approaches, the use of MANO model, and different HPE methods/backbones.
摘要：在这项工作中，我们研究型的方法以及如何在不同的3D手姿势估计的手工对象交互和单手的情况下的任务概括。我们表明，国家的最先进方法的精度会下降，而且他们大多是失败的姿势训练集缺席。不幸的是，由于手部姿势的空间是非常立体，它本质上是不可行的密集，覆盖整个空间尽管最近收集的大型数据集的培训力度。这个采样问题就更严重时手与对象和/或输入交互的RGB而不是深度图像，RGB图像也与照明条件和颜色而不同。为了解决这些问题，我们设计了一个公开挑战，以评估目前的3D手姿势估计的能力〜（HPES）进行插值和外推的训练集的姿势。更确切地说，我们的挑战是设计（一），以评估对3D的手的形状估计深度和颜色模式的影响下，存在或不存在的对象; （b）中评估的泛化能力\ WRT〜四个主要轴：形状，关节，观点和对象; （c）中，探讨了使用合成的手模型，以填补当前数据集的间隙。通过挑战，整体精度都得到大幅度的提高基线，特别是外推的任务，从27毫米〜13mm的平均关节误差。我们的分析突出的影响：数据预处理，合奏的方法，使用MANO模型，以及不同的HPE方法/骨架。

63. Understanding the impact of mistakes on background regions in crowd counting [PDF] 返回目录
Davide Modolo, Bing Shuai, Rahul Rama Varior, Joseph Tighe
Abstract: Every crowd counting researcher has likely observed their model output wrong positive predictions on image regions not containing any person. But how often do these mistakes happen? Are our models negatively affected by this? In this paper we analyze this problem in depth. In order to understand its magnitude, we present an extensive analysis on five of the most important crowd counting datasets. We present this analysis in two parts. First, we quantify the number of mistakes made by popular crowd counting approaches. Our results show that (i) mistakes on background are substantial and they are responsible for 18-49% of the total error, (ii) models do not generalize well to different kinds of backgrounds and perform poorly on completely background images, and (iii) models make many more mistakes than those captured by the standard Mean Absolute Error (MAE) metric, as counting on background compensates considerably for misses on foreground. And second, we quantify the performance change gained by helping the model better deal with this problem. We enrich a typical crowd counting network with a segmentation branch trained to suppress background predictions. This simple addition (i) reduces background error by 10-83%, (ii) reduces foreground error by up to 26% and (iii) improves overall crowd counting performance up to 20%. When compared against the literature, this simple technique achieves very competitive results on all datasets, on par with the state-of-the-art, showing the importance of tackling the background problem.
摘要：每个人群计数研究员可能观察到不包含任何人的图像区域的模型输出错误的阳性预测。不过多久这些错误发生？是我们的模型负面受此影响？在本文中，我们深入分析了这个问题。为了了解它的大小，我们的最重要的人群统计数据集五大提出了广泛的分析。我们提出两个部分这一分析。首先，我们通过量化人群流行计数方法犯了错误的数量。我们的研究结果表明：（一）关于背景的错误是巨大的，他们负责总误差的18-49％，（二）款不能推广以及不同种类的背景和表现不佳完全背景图像，以及（iii ）模型做出许多错误比那些标准的平均绝对误差（MAE）指标捕获，作为背景补偿计数明显对前景不及预期。第二，我们量化帮助模型更好地处理这个问题而获得的性能变化。我们丰富的典型人群计数网络培训，以抑制背景预测的分割分支。由10-83％这个简单的加法（ⅰ）降低了背景错误，（ii）通过最多26％减少前景误差和（iii）提高了整体人群计数性能最高达20％。当针对文献相比，这个简单的技术实现了对所有数据集非常有竞争力的结果，在与状态的最先进的标准杆，示出了解决这一问题的背景的重要性。

64. Combining detection and tracking for human pose estimation in videos [PDF] 返回目录
Manchen Wang, Joseph Tighe, Davide Modolo
Abstract: We propose a novel top-down approach that tackles the problem of multi-person human pose estimation and tracking in videos. In contrast to existing top-down approaches, our method is not limited by the performance of its person detector and can predict the poses of person instances not localized. It achieves this capability by propagating known person locations forward and backward in time and searching for poses in those regions. Our approach consists of three components: (i) a Clip Tracking Network that performs body joint detection and tracking simultaneously on small video clips; (ii) a Video Tracking Pipeline that merges the fixed-length tracklets produced by the Clip Tracking Network to arbitrary length tracks; and (iii) a Spatial-Temporal Merging procedure that refines the joint locations based on spatial and temporal smoothing terms. Thanks to the precision of our Clip Tracking Network and our merging procedure, our approach produces very accurate joint predictions and can fix common mistakes on hard scenarios like heavily entangled people. Our approach achieves state-of-the-art results on both joint detection and tracking, on both the PoseTrack 2017 and 2018 datasets, and against all top-down and bottom-down approaches.
摘要：本文提出了一种自上而下的方法是铲球多人人体姿势估计的问题，并在视频跟踪。相较于现有自上而下的方法，我们的方法不受其个人检测器的性能限制，可以预测人的情况下的姿势没有本地化。它实现了由正向和反向传播的人知道地点在时间和搜索这些区域的姿势这种能力。我们的方法由三个部分组成：（ⅰ）一个剪辑跟踪网络执行体联合检测和小的视频剪辑同时跟踪; （ⅱ）一个视频跟踪流水线该合并由剪辑跟踪网络在任意长度的轨道所产生的固定长度tracklets;和（iii）一个时空合并过程，提炼基于空间和时间平滑方面关节位置。感谢我们的剪辑跟踪网络和我们的合并程序的精度，我们的方法产生非常精确的联合预测，并可以固定在硬场景，如严重纠缠的人常犯的错误。我们的方法实现了国家的先进成果双方联合检测和跟踪，对PoseTrack 2017年和2018年这两个数据集，以及对所有自上而下和自下而上的方法向下。

65. Certifiable Relative Pose Estimation [PDF] 返回目录
Mercedes Garcia-Salguero, Jesus Briales, Javier Gonzalez-Jimenez
Abstract: In this paper we present the first fast optimality certifier for the non-minimal version of the Relative Pose problem for calibrated cameras from epipolar constraints. The proposed certifier is based on Lagrangian duality and relies on a novel closed-form expression for dual points. We also leverage an efficient solver that performs local optimization on the manifold of the original problem's non-convex domain. The optimality of the solution is then checked via our novel fast certifier. The extensive conducted experiments demonstrate that, despite its simplicity, this certifiable solver performs excellently on synthetic data, repeatedly attaining the (certified \textit{a posteriori}) optimal solution and shows a satisfactory performance on real data.
摘要：在本文中，我们提出了非最低版本相对位姿问题，从极约束校准相机先快速最优验证。所提出的验证是基于拉格朗日对偶问题，并且依赖于用于双点一种新颖的闭合形式的表达。我们还充分利用高效的求解器，在原有问题的非凸域的歧管进行局部优化。该解决方案的最优然后通过我们的新的快速验证检查。广泛进行的实验表明，尽管它的简单性，可认证的这个执行解算器上良好地合成数据，反复实现该（认证\ textit {后验}）最优解，并显示在真实数据的令人满意的性能。

66. COVID-ResNet: A Deep Learning Framework for Screening of COVID19 from Radiographs [PDF] 返回目录
Muhammad Farooq, Abdul Hafeez
Abstract: In the last few months, the novel COVID19 pandemic has spread all over the world. Due to its easy transmission, developing techniques to accurately and easily identify the presence of COVID19 and distinguish it from other forms of flu and pneumonia is crucial. Recent research has shown that the chest Xrays of patients suffering from COVID19 depicts certain abnormalities in the radiography. However, those approaches are closed source and not made available to the research community for re-producibility and gaining deeper insight. The goal of this work is to build open source and open access datasets and present an accurate Convolutional Neural Network framework for differentiating COVID19 cases from other pneumonia cases. Our work utilizes state of the art training techniques including progressive resizing, cyclical learning rate finding and discriminative learning rates to training fast and accurate residual neural networks. Using these techniques, we showed the state of the art results on the open-access COVID-19 dataset. This work presents a 3-step technique to fine-tune a pre-trained ResNet-50 architecture to improve model performance and reduce training time. We call it COVIDResNet. This is achieved through progressively re-sizing of input images to 128x128x3, 224x224x3, and 229x229x3 pixels and fine-tuning the network at each stage. This approach along with the automatic learning rate selection enabled us to achieve the state of the art accuracy of 96.23% (on all the classes) on the COVIDx dataset with only 41 epochs. This work presented a computationally efficient and highly accurate model for multi-class classification of three different infection types from along with Normal individuals. This model can help in the early screening of COVID19 cases and help reduce the burden on healthcare systems.
摘要：在过去的几个月里，小说COVID19疫情已遍布世界各地。由于其易传播，开发技术，准确，轻松地识别COVID19的存在和其他形式的流感和肺炎的区别是至关重要的。最近的研究表明，从COVID19患者胸部X射线描绘的放射摄影某些异常。然而，这些方法都封闭源代码，而不是提供给研究界重新可生产性和获得更深入的了解。这项工作的目的是建立开放源码和开放存取数据集并出示准确的卷积神经网络架构从其他肺炎病例区分COVID19案件。我们的工作使用的先进的培训技术，包括渐进式调整大小，周期性学习率发现和判别学习率状态，训练快速而准确的残余神经网络。使用这些技术，我们发现在开放获取COVID-19数据集的艺术效果的状态。这项工作提出了一个3级技术来微调预训练RESNET-50架构，以提高模型的性能，并减少培训时间。我们把它叫做COVIDResNet。这是通过逐步实现重新上浆输入图像，以128x128x3，224x224x3和229x229x3像素和微调网络在每个阶段。与自动学习速率选择沿着这种方法使我们能够只用41时代的COVIDx数据集实现（在所有类）的96.23％的精度艺术的状态。这项工作提出了三种不同类型的感染多级分类的计算效率和高度精确的模型从普通个人一起。这种模式可以在COVID19病例早期筛查帮助和帮助减少对医疗保健系统的负担。

67. Automated Methods for Detection and Classification Pneumonia based on X-Ray Images Using Deep Learning [PDF] 返回目录
Khalid El Asnaoui, Youness Chawki, Ali Idri
Abstract: Recently, researchers, specialists, and companies around the world are rolling out deep learning and image processing-based systems that can fastly process hundreds of X-Ray and computed tomography (CT) images to accelerate the diagnosis of pneumonia such as SARS, COVID-19, and aid in its containment. Medical images analysis is one of the most promising research areas, it provides facilities for diagnosis and making decisions of a number of diseases such as MERS, COVID-19. In this paper, we present a comparison of recent Deep Convolutional Neural Network (DCNN) architectures for automatic binary classification of pneumonia images based fined tuned versions of (VGG16, VGG19, DenseNet201, Inception_ResNet_V2, Inception_V3, Resnet50, MobileNet_V2 and Xception). The proposed work has been tested using chest X-Ray & CT dataset which contains 5856 images (4273 pneumonia and 1583 normal). As result we can conclude that fine-tuned version of Resnet50, MobileNet_V2 and Inception_Resnet_V2 show highly satisfactory performance with rate of increase in training and validation accuracy (more than 96% of accuracy). Unlike CNN, Xception, VGG16, VGG19, Inception_V3 and DenseNet201 display low performance (more than 84% accuracy).
摘要：最近，研究人员，专家和世界各地的公司正在推出深度学习和基于图像处理系统，可以快速度处理数百个X射线和计算机断层扫描（CT）图像加速肺炎的诊断，如SARS， COVID-19，并在其遏制援助。医学图像分析是最有前途的研究领域之一，它提供了用于诊断和进行大量的疾病，如MERS，COVID-19的决定。在本文中，我们提出了近期深卷积神经网络（DCNN）架构基于罚款调谐（VGG16，VGG19，DenseNet201，Inception_ResNet_V2，Inception_V3，Resnet50，MobileNet_V2和Xception）的版本肺炎图像的自动二元分类的比较。所提出的工作已经使用胸部X射线CT＆数据集包含5856倍的图像（4273肺炎和1583正常）进行测试。作为结果，我们可以得出结论Resnet50，MobileNet_V2和Inception_Resnet_V2的是微调的版本显示，在训练和验证准确度（准确度超过96％），增长速度非常令人满意的表现。不像CNN，Xception，VGG16，VGG19，Inception_V3和DenseNet201显示低性能（超过84％的准确度）。

68. Learning from Small Data Through Sampling an Implicit Conditional Generative Latent Optimization Model [PDF] 返回目录
Idan Azuri, Daphna Weinshall
Abstract: We revisit the long-standing problem of \emph{learning from small sample}. In recent years major efforts have been invested into the generation of new samples from a small set of training data points. Some use classical transformations, others synthesize new examples. Our approach belongs to the second one. We propose a new model based on conditional Generative Latent Optimization (cGLO). Our model learns to synthesize completely new samples for every class just by interpolating between samples in the latent space. The proposed method samples the learned latent space using spherical interpolations (\emph{slerp}) and generates a new sample using the trained generator. Our empirical results show that the new sampled set is diverse enough, leading to improvement in image classification in comparison to the state of the art, when trained on small samples of CIFAR-100 and CUB-200.
摘要：我们重温\ EMPH的老大难问题{来自小样本学习}。近年来重大努力已经投入到新样本的产生从一小部分的训练数据点。有的用经典的转换，其他合成新的实例。我们的做法属于第二个。我们提出了一种基于条件剖成潜在优化（cGLO）的新模式。我们的模型学习刚刚于潜在空间样本之间插值以综合每个类全新的样本。所提出的方法的样品使用球形内插（\ {EMPH}球面线性插值）了解到潜在空间和使用经训练的发生器产生一个新的采样。我们的经验结果表明，新的采样集是多样的足够，从而导致在图像分类的改进相比于现有技术的状态下，当上CIFAR-100和CUB-200的小样本训练。

69. Radiologist-level stroke classification on non-contrast CT scans with Deep U-Net [PDF] 返回目录
Manvel Avetisian, Vladimir Kokh, Alex Tuzhilin, Dmitry Umerenkov
Abstract: Segmentation of ischemic stroke and intracranial hemorrhage on computed tomography is essential for investigation and treatment of stroke. In this paper, we modified the U-Net CNN architecture for the stroke identification problem using non-contrast CT. We applied the proposed DL model to historical patient data and also conducted clinical experiments involving ten experienced radiologists. Our model achieved strong results on historical data, and significantly outperformed seven radiologist out of ten, while being on par with the remaining three.
摘要：缺血性中风和计算机断层扫描颅内出血的分割是调查和治疗中风的关键。在本文中，我们修改了U形网CNN架构使用非造影CT中风识别问题。我们应用所提出的DL模型的历史病人数据和还进行了涉及10个高年资医师的临床实验。我们的模型实现了对历史数据的强劲业绩，而显著优于7放射出来十，而在标准杆是与其余三个。

70. Pathological Retinal Region Segmentation From OCT Images Using Geometric Relation Based Augmentation [PDF] 返回目录
Dwarikanath Mahapatra, Behzad Bozorgtabar, Jean-Philippe Thiran, Ling Shao
Abstract: Medical image segmentation is an important task for computer aided diagnosis. Pixelwise manual annotations of large datasets require high expertise and is time consuming. Conventional data augmentations have limited benefit by not fully representing the underlying distribution of the training set, thus affecting model robustness when tested on images captured from different sources. Prior work leverages synthetic images for data augmentation ignoring the interleaved geometric relationship between different anatomical labels. We propose improvements over previous GAN-based medical image synthesis methods by jointly encoding the intrinsic relationship of geometry and shape. Latent space variable sampling results in diverse generated images from a base image and improves robustness. Given those augmented images generated by our method, we train the segmentation network to enhance the segmentation performance of retinal optical coherence tomography (OCT) images. The proposed method outperforms state-of-the-art segmentation methods on the public RETOUCH dataset having images captured from different acquisition procedures. Ablation studies and visual analysis also demonstrate benefits of integrating geometry and diversity.
摘要：医学图像分割是计算机辅助诊断的一项重要任务。大型数据集的基于像素的手动标注要求高的专业知识和耗时。传统的数据增扩已经通过不完全表示所述训练集的潜在分布有限的益处，当在来自不同来源的捕获的图像进行测试从而影响模型的鲁棒性。先前的工作利用了合成的图像数据的增强忽略不同解剖标记物之间的交错几何关系。我们通过联合编码的几何形状和形状的内在关系，提出了比以前的基于GaN的医用图像合成方法的改进。在不同的生成的图像潜在空间可变的采样结果从基础图像和提高了鲁棒性。鉴于由我们的方法所产生的那些增强的图像，我们培养了分割网络，以增强视网膜光学相干断层扫描（OCT）图像的分割性能。上具有从不同采集程序所捕获的图像的公共数据集润饰该方法优于状态的最先进的分割方法。消融研究和可视化分析也证明了整合的几何形状和多样性的好处。

71. MTL-NAS: Task-Agnostic Neural Architecture Search towards General-Purpose Multi-Task Learning [PDF] 返回目录
Yuan Gao, Haoping Bai, Zequn Jie, Jiayi Ma, Kui Jia, Wei Liu
Abstract: We propose to incorporate neural architecture search (NAS) into general-purpose multi-task learning (GP-MTL). Existing NAS methods typically define different search spaces according to different tasks. In order to adapt to different task combinations (i.e., task sets), we disentangle the GP-MTL networks into single-task backbones (optionally encode the task priors), and a hierarchical and layerwise features sharing/fusing scheme across them. This enables us to design a novel and general task-agnostic search space, which inserts cross-task edges (i.e., feature fusion connections) into fixed single-task network backbones. Moreover, we also propose a novel single-shot gradient-based search algorithm that closes the performance gap between the searched architectures and the final evaluation architecture. This is realized with a minimum entropy regularization on the architecture weights during the search phase, which makes the architecture weights converge to near-discrete values and therefore achieves a single model. As a result, our searched model can be directly used for evaluation without (re-)training from scratch. We perform extensive experiments using different single-task backbones on various task sets, demonstrating the promising performance obtained by exploiting the hierarchical and layerwise features, as well as the desirable generalizability to different i) task sets and ii) single-task backbones. The code of our paper is available at this https URL.
摘要：我们建议结合神经结构搜索（NAS）到通用的多任务学习（GP-MTL）。现有的NAS方法通常根据不同的任务定义不同的搜索空间。为了适应不同的任务的组合（即，任务集合），我们解开GP-MTL网络分为单任务骨架（任选地编码任务先验），和分层和逐层提供共享/跨过它们熔合方案。这使我们能够设计一种新的和一般任务无关的搜索空间，其中插入横任务边缘（即，特征融合的连接）为固定单任务网络骨架。此外，我们也建议关闭之间的搜索架构和最终评价架构的性能差距新颖的单次基于梯度的搜索算法。这实现了在搜索阶段期间对架构的权重的最小熵正则化，这使得结构的权重收敛到近的离散值，并且因此实现了单一的模式。其结果是，我们的搜索模型可以直接用于评估不用从头（重新）训练。我们使用在各种任务集不同的单任务骨架，这表明通过利用分层和逐层特征，以及所希望的推广到不同ⅰ）任务集，以及ii）单任务骨架获得的有前途的性能进行了广泛的实验。我们的论文的代码可在此HTTPS URL。

72. Supervised Raw Video Denoising with a Benchmark Dataset on Dynamic Scenes [PDF] 返回目录
Huanjing Yue, Cong Cao, Lei Liao, Ronghe Chu, Jingyu Yang
Abstract: In recent years, the supervised learning strategy for real noisy image denoising has been emerging and has achieved promising results. In contrast, realistic noise removal for raw noisy videos is rarely studied due to the lack of noisy-clean pairs for dynamic scenes. Clean video frames for dynamic scenes cannot be captured with a long-exposure shutter or averaging multi-shots as was done for static images. In this paper, we solve this problem by creating motions for controllable objects, such as toys, and capturing each static moment for multiple times to generate clean video frames. In this way, we construct a dataset with 55 groups of noisy-clean videos with ISO values ranging from 1600 to 25600. To our knowledge, this is the first dynamic video dataset with noisy-clean pairs. Correspondingly, we propose a raw video denoising network (RViDeNet) by exploring the temporal, spatial, and channel correlations of video frames. Since the raw video has Bayer patterns, we pack it into four sub-sequences, i.e RGBG sequences, which are denoised by the proposed RViDeNet separately and finally fused into a clean video. In addition, our network not only outputs a raw denoising result, but also the sRGB result by going through an image signal processing (ISP) module, which enables users to generate the sRGB result with their favourite ISPs. Experimental results demonstrate that our method outperforms state-of-the-art video and raw image denoising algorithms on both indoor and outdoor videos.
摘要：近年来，真正的噪声图像去噪监督学习策略已经出现，并取得了可喜的成果。相比之下，对于原嘈杂的视频逼真噪声去除研究很少，由于缺乏对动态场景嘈杂的清洁对。对于动态场景干净的视频帧不能用长时间曝光的快门或者是为静态图像进行平均多镜头捕捉。在本文中，我们解决了通过建立可控的物体的运动，如玩具，并捕获多次每个静力矩产生干净的视频帧这个问题。通过这种方式，我们构建与55团与ISO值范围从1600到25600。据我们所知，嘈杂的，干净的视频数据集，这是第一个动态视频数据集嘈杂的清洁对。相应地，我们提出通过探索时间，空间，和信道的视频帧的相关性原始视频降噪网络（RViDeNet）。由于原始视频具有拜耳模式，我们包成四个子序列，即RGBG序列，这是由所提出的RViDeNet分别去噪，最后融合成一个干净的视频。此外，我们的网络不仅输出原始去噪结果，还包括通过图像信号处理（ISP）模块，该模块使得用户能够生成与他们喜爱的ISP与sRGB结果会与sRGB的结果。实验结果表明，我们的方法优于国家的最先进的视频和原始图像在室内和室外视频降噪算法。

73. A Thorough Comparison Study on Adversarial Attacks and Defenses for Common Thorax Disease Classification in Chest X-rays [PDF] 返回目录
Chendi Rao, Jiezhang Cao, Runhao Zeng, Qi Chen, Huazhu Fu, Yanwu Xu, Mingkui Tan
Abstract: Recently, deep neural networks (DNNs) have made great progress on automated diagnosis with chest X-rays images. However, DNNs are vulnerable to adversarial examples, which may cause misdiagnoses to patients when applying the DNN based methods in disease detection. Recently, there is few comprehensive studies exploring the influence of attack and defense methods on disease detection, especially for the multi-label classification problem. In this paper, we aim to review various adversarial attack and defense methods on chest X-rays. First, the motivations and the mathematical representations of attack and defense methods are introduced in details. Second, we evaluate the influence of several state-of-the-art attack and defense methods for common thorax disease classification in chest X-rays. We found that the attack and defense methods have poor performance with excessive iterations and large perturbations. To address this, we propose a new defense method that is robust to different degrees of perturbations. This study could provide new insights into methodological development for the community.
摘要：近日，深层神经网络（DNNs）已经对自动诊断胸部X射线图像很大的进步。然而，DNNs易受对抗性实例中，施加在疾病检测基于DNN方法时这可能导致误诊给患者。最近，有几个是全面研究探索对疾病检测攻击和防御方法的影响，特别是对于多标签分类问题。在本文中，我们的目的是审查胸部X射线的各种敌对攻击和防御方法。首先，动机以及攻击和防御方法的数学表述详细的介绍。其次，我们评估的国家的最先进的几种攻击和防御方法在胸部X射线胸部常见的疾病分类的影响。我们发现，攻击和防御方法有过多的迭代和大扰动性能差。为了解决这个问题，我们提出了一种新的防御方法就是稳健不同程度的扰动。这项研究可能提供新的见解，为社会发展方法。

74. Regularizing Class-wise Predictions via Self-knowledge Distillation [PDF] 返回目录
Sukmin Yun, Jongjin Park, Kimin Lee, Jinwoo Shin
Abstract: Deep neural networks with millions of parameters may suffer from poor generalization due to overfitting. To mitigate the issue, we propose a new regularization method that penalizes the predictive distribution between similar samples. In particular, we distill the predictive distribution between different samples of the same label during training. This results in regularizing the dark knowledge (i.e., the knowledge on wrong predictions) of a single network (i.e., a self-knowledge distillation) by forcing it to produce more meaningful and consistent predictions in a class-wise manner. Consequently, it mitigates overconfident predictions and reduces intra-class variations. Our experimental results on various image classification tasks demonstrate that the simple yet powerful method can significantly improve not only the generalization ability but also the calibration performance of modern convolutional neural networks.
摘要：数以百万计的参数深层神经网络可能遭受泛化差，由于过度拟合。为了缓解这一问题，我们提出了类似的处罚样本之间的预测分布一个新的正则化方法。特别是，我们在训练中提炼出同一标签的不同样本之间的预测分布。这导致正规化黑暗知识（即，在错误的预测知识），迫使它产生的一类式的方式更有意义，一致预测的单一网络（即自知之明蒸馏）的。因此，减轻了自信预测，并减少类内变化。我们对各种图像分类任务的实验结果表明，简单而强大的方法可以显著改善不仅泛化能力也是现代卷积神经网络的校准性能。

75. Recurrent Neural Networks with Longitudinal Pooling and Consistency Regularization [PDF] 返回目录
Jiahong Ouyang, Qingyu Zhao, Edith V Sullivan, Adolf Pfefferbaum, Susan F. Tapert, Ehsan Adeli, Kilian M Pohl
Abstract: Most neurological diseases are characterized by gradual deterioration of brain structure and function. To identify the impact of such diseases, studies have been acquiring large longitudinal MRI datasets and applied deep-learning to predict diagnosis label(s). These learning models apply Convolutional Neural Networks (CNN) to extract informative features from each time point of the longitudinal MRI and Recurrent Neural Networks (RNN) to classify each time point based on those features. However, they neglect the progressive nature of the disease, which may result in clinically implausible predictions across visits. In this paper, we propose a framework that injects the extracted features from CNNs at each time point to the RNN cells considering the dependencies across different time points in the longitudinal data. On the feature level, we propose a novel longitudinal pooling layer to couple features of a visit with those of proceeding ones. On the prediction level, we add a consistency regularization to the classification objective in line with the nature of the disease progression across visits. We evaluate the proposed method on the longitudinal structural MRIs from three neuroimaging datasets: Alzheimer's Disease Neuroimaging Initiative (ADNI, N=404), a dataset composed of 274 healthy controls and 329 patients with Alcohol Use Disorder (AUD), and 255 youths from the National Consortium on Alcohol and NeuroDevelopment in Adolescence (NCANDA). All three experiments show that our method is superior to the widely used methods. The code is available at this https URL.
摘要：大多数的神经疾病是由大脑结构和功能的逐渐恶化为特征。为了鉴定这些疾病的影响，研究已经取得大的纵向MRI数据集和应用深学习预测诊断标签（一个或多个）。这些学习模型应用卷积神经网络（CNN），从纵向MRI和递归神经网络（RNN）的每个时间点提取信息量大的特点，以基于这些特征的任意时间点进行分类。然而，他们忽略了疾病，这可能导致在整个来访临床不合理的预测的渐进性。在本文中，我们提议喷射从细胞神经网络所提取的特征在每个时间点到RNN细胞考虑在纵向数据跨不同的时间点的依赖关系的框架。在功能层面上，我们提出这些诉讼者的访问的一种新型纵向池层耦合功能。在预测方面，我们本着一致性正规化添加到分类目标与整个访问的疾病进展的性质。我们评估对来自三个神经影像数据集的纵向结构的MRI所提出的方法：阿尔茨海默氏病神经成像倡议（ADNI，N = 404），一个274个健康对照和329例酒精使用障碍（AUD），并从255名青少年组成数据集国家联盟对酒精和神经发育青春期（NCANDA）。所有这三个实验表明，我们的方法是优于广泛使用的方法。该代码可在此HTTPS URL。

76. Ranger: Boosting Error Resilience of Deep Neural Networks through Range Restriction [PDF] 返回目录
Zitao Chen, Guanpeng Li, Karthik Pattabiraman
Abstract: With the emerging adoption of deep neural networks (DNNs) in the HPC domain, the reliability of DNNs is also growing in importance. As prior studies demonstrate the vulnerability of DNNs to hardware transient faults (i.e., soft errors), there is a compelling need for an efficient technique to protect DNNs from soft errors. While the inherent resilience of DNNs can tolerate some transient faults (which would not affect the system's output), prior work has found there are critical faults that cause safety violations (e.g., misclassification). In this work, we exploit the inherent resilience of DNNs to protect the DNNs from critical faults. In particular, we propose Ranger, an automated technique to selectively restrict the ranges of values in particular DNN layers, which can dampen the large deviations typically caused by critical faults to smaller ones. Such reduced deviations can usually be tolerated by the inherent resilience of DNNs. Ranger can be integrated into existing DNNs without retraining, and with minimal effort. Our evaluation on 8 DNNs (including two used in self-driving car applications) demonstrates that Ranger can achieve significant resilience boosting without degrading the accuracy of the model, and incurring negligible overheads.
摘要：在HPC领域深层神经网络（DNNs）新兴采纳，DNNs的可靠性的重要性也越来越大。作为现有的研究表明DNNs硬件瞬时故障（即，软错误）的脆弱性，有一种有效的技术，以从软错误保护DNNs迫切需要。虽然DNNs的固有弹性可以容忍一些瞬时故障（这不会影响系统的输出），前期工作已发现有严重故障是因违反安全（例如，误判）。在这项工作中，我们利用DNNs的内在弹性，保护DNNs从严重故障。特别是，我们提出游侠，一个自动化的技术来选择性地限制在特定DNN层值，它可减弱通常通过致命故障到较小的引起大的偏差的范围。这种减小的偏差通常可以通过DNNs的固有弹性可以容忍的。游侠可以集成到现有的DNNs没有再培训，并以最小的努力。我们对8个DNNs（包括自动驾驶汽车应用中二）评估表明，游侠可以达到显著的弹性增强，而不会降低该模型的准确性，并招致开销可以忽略不计。

77. Lesion Conditional Image Generation for Improved Segmentation of Intracranial Hemorrhage from CT Images [PDF] 返回目录
Manohar Karki, Junghwan Cho
Abstract: Data augmentation can effectively resolve a scarcity of images when training machine-learning algorithms. It can make them more robust to unseen images. We present a lesion conditional Generative Adversarial Network LcGAN to generate synthetic Computed Tomography (CT) images for data augmentation. A lesion conditional image (segmented mask) is an input to both the generator and the discriminator of the LcGAN during training. The trained model generates contextual CT images based on input masks. We quantify the quality of the images by using a fully convolutional network (FCN) score and blurriness. We also train another classification network to select better synthetic images. These synthetic CT images are then augmented to our hemorrhagic lesion segmentation network. By applying this augmentation method on 2.5%, 10% and 25% of original data, segmentation improved by 12.8%, 6% and 1.6% respectively.
摘要：训练机器学习算法时的数据增强可以有效解决图像的稀缺性。它可以使他们更稳健看不见图像。我们提出了一个病灶条件剖成对抗性网络LcGAN以产生用于数据扩张合成计算机断层扫描（CT）图像。甲病变条件图像（分段掩模）是输入到所述发电机和所述LcGAN的训练期间鉴别两者。经训练的模型基于输入掩码上下文CT图像。我们通过使用完全卷积网络（FCN）分数和模糊量化的图像的质量。我们还培养另一个分类的网络选择更好的合成图像。然后这些合成的CT图像增强我们的出血性病变划分网络。通过在2.5％，10％和原始数据的25％的应用该增强方法，分割分别提高了12.8％，6％和1.6％。

78. Dataless Model Selection with the Deep Frame Potential [PDF] 返回目录
Calvin Murdock, Simon Lucey
Abstract: Choosing a deep neural network architecture is a fundamental problem in applications that require balancing performance and parameter efficiency. Standard approaches rely on ad-hoc engineering or computationally expensive validation on a specific dataset. We instead attempt to quantify networks by their intrinsic capacity for unique and robust representations, enabling efficient architecture comparisons without requiring any data. Building upon theoretical connections between deep learning and sparse approximation, we propose the deep frame potential: a measure of coherence that is approximately related to representation stability but has minimizers that depend only on network structure. This provides a framework for jointly quantifying the contributions of architectural hyper-parameters such as depth, width, and skip connections. We validate its use as a criterion for model selection and demonstrate correlation with generalization error on a variety of common residual and densely connected network architectures.
摘要：选择一个深层神经网络结构是在那些需要平衡性能和参数效率的应用程序的一个基本问题。标准方法依赖于一个特定的数据集的ad-hoc工程或计算昂贵的验证。相反，我们通过自己的独特和强大的内在交涉能力试图量化网络，从而实现高效的架构比较，而不需要任何数据。在深度学习和稀疏逼近之间的理论联系的基础上，我们提出了深刻的框架潜力：连贯性的措施，大致与代表性的稳定性，但有只在网络结构取决于极小。这提供了一个框架，共同量化的建筑超参数如深度，宽度的贡献，并跳过连接。我们验证其作为用于模型选择的标准使用和显示对多种常见残差和密集连接的网络体系结构的使用推广误差的相关性。

79. COVID-CT-Dataset: A CT Scan Dataset about COVID-19 [PDF] 返回目录
Jinyu Zhao, Yichen Zhang, Xuehai He, Pengtao Xie
Abstract: CT scans are promising in providing accurate, fast, and cheap screening and testing of COVID-19. In this paper, we build a publicly available COVID-CT dataset, containing 275 CT scans that are positive for COVID-19, to foster the research and development of deep learning methods which predict whether a person is affected with COVID-19 by analyzing his/her CTs. We train a deep convolutional neural network on this dataset and achieve an F1 of 0.85 which is a promising performance but yet to be further improved. The data and code are available at this https URL
摘要：CT扫描是有希望提供准确，快速和廉价的筛选和COVID-19的测试。在本文中，我们建立了一个公开可用的COVID-CT数据集，包含275次CT扫描是阳性COVID-19，以促进深度学习方法能预测一个人是否是通过分析与COVID-19影响的研究和开发自己的/她的CT。我们对这个数据集训练了深刻的卷积神经网络，实现0.85的F1这是一个有前途的性能，但还有待进一步提高。数据和代码都可以在此HTTPS URL

80. Classification of COVID-19 in chest X-ray images using DeTraC deep convolutional neural network [PDF] 返回目录
Asmaa Abbas, Mohammed M. Abdelsamea, Mohamed Medhat Gaber
Abstract: Chest X-ray is the first imaging technique that plays an important role in the diagnosis of COVID-19 disease. Due to the high availability of large-scale annotated image datasets, great success has been achieved using convolutional neural networks (CNNs) for image recognition and classification. However, due to the limited availability of annotated medical images, the classification of medical images remains the biggest challenge in medical diagnosis. Thanks to transfer learning, an effective mechanism that can provide a promising solution by transferring knowledge from generic object recognition tasks to domain-specific tasks. In this paper, we validate and adopt our previously developed CNN, called Decompose, Transfer, and Compose (DeTraC), for the classification of COVID-19 chest X-ray images. DeTraC can deal with any irregularities in the image dataset by investigating its class boundaries using a class decomposition mechanism. The experimental results showed the capability of DeTraC in the detection of COVID-19 cases from a comprehensive image dataset collected from several hospitals around the world. High accuracy of 95.12% (with a sensitivity of 97.91%, a specificity of 91.87%, and a precision of 93.36%) was achieved by DeTraC in the detection of COVID-19 X-ray images from normal, and severe acute respiratory syndrome cases.
摘要：胸部X-射线是第一成像技术，起着COVID-19疾病的诊断有重要作用。由于大规模标注的图像数据集的高可用性，取得巨大成功已经使用图像识别和分类卷积神经网络（细胞神经网络）来实现的。然而，由于注释医学图像的有限，医学图像的分类仍然是医学诊断的最大挑战。由于迁移学习的有效机制，可以提供由一般物体识别任务传授知识，以域特定任务的可行的解决方案。在本文中，我们验证并接受我们以前开发的CNN，所谓分解，转移，并撰写（DeTraC），用于COVID，19胸部X射线图像的分类。 DeTraC可以通过使用类别分解机制，研究其阶级界限处理图像数据组中的任何违规行为。实验结果显示，来自世界各地的多家医院收集了全面的图像数据集检测COVID-19案件DeTraC的能力。的95.12％的高准确度（具有97.91％的灵敏度，的91.87％的特异性，以及93.36％的精度）通过DeTraC在从正常检测COVID-19的X射线图像的取得，和严重急性呼吸综合征例。

81. Non-dimensional Star-Identification [PDF] 返回目录
Carl Leake, David Arnas, Daniele Mortari
Abstract: This study introduces a new "Non-Dimensional" star identification algorithm to reliably identify the stars observed by a wide field-of-view star tracker when the focal length and optical axis offset values are known with poor accuracy. This algorithm is particularly suited to complement nominal lost-in-space algorithms when they fail the star identification due to focal length and/or optical axis offset deviations from their nominal operational ranges. These deviations may be caused, for example, by launch vibrations or thermal variations in orbit. The algorithm performance is compared in terms of accuracy, speed, and robustness to the Pyramid algorithm. These comparisons highlight the clear advantages that a combined approach of these methodologies provides.
摘要：本研究引入了新的“无量纲”星识别算法，以可靠地识别通过宽视场的视星象跟踪仪观察时，焦距和光轴偏移值与精度差已知的星星。这种算法特别适合他们失败时星识别补充标称丢失-在空间算法由于焦距和/或从它们的标称操作范围光轴偏移的偏差。这些偏差可能引起的，例如，通过发射振动或在轨道上的热的变化。该算法的性能在金字塔算法的精确度，速度和稳健性方面进行了比较。这些比较突出的是，这些方法的组合方法提供了明显的优势。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-04-01

目录

摘要