摘要

1. Measures to Evaluate Generative Adversarial Networks Based on Direct Analysis of Generated Images [PDF] 返回目录
Shuyue Guan, Murray H. Loew
Abstract: The Generative Adversarial Network (GAN) is a state-of-the-art technique in the field of deep learning. A number of recent papers address the theory and applications of GANs in various fields of image processing. Fewer studies, however, have directly evaluated GAN outputs. Those that have been conducted focused on using classification performance (e.g., Inception Score) and statistical metrics (e.g., Fréchet Inception Distance). , Here, we consider a fundamental way to evaluate GANs by directly analyzing the images they generate, instead of using them as inputs to other classifiers. We characterize the performance of a GAN as an image generator according to three aspects: 1) Creativity: non-duplication of the real images. 2) Inheritance: generated images should have the same style, which retains key features of the real images. 3) Diversity: generated images are different from each other. A GAN should not generate a few different images repeatedly. Based on the three aspects of ideal GANs, we have designed two measures: Creativity-Inheritance-Diversity (CID) index and Likeness Score (LS) to evaluate GAN performance, and have applied them to evaluate three typical GANs. We compared our proposed measures with three commonly used GAN evaluation methods: Inception Score (IS), Fréchet Inception Distance (FID) and 1-Nearest Neighbor classifier (1NNC). In addition, we discuss how these evaluations could help us deepen our understanding of GANs and improve their performance.
摘要：剖成对抗性网络（GAN）是国家的最先进的技术中深学习的领域。最近的一些论文解决的理论和图像处理的各个领域甘斯的应用。研究较少，但是，已经直接评估甘输出。那些已被进行了集中在使用分类的性能（例如，启得分）和统计指标（例如，Fréchet可启距离）。，在这里，我们考虑通过直接分析其产生的图像，而不是利用他们作为输入到其他分类评价甘斯的根本途径。 1）创意：实像的非重复我们按照三个方面表征GAN中表现为图像发生器。 2）继承：生成的图像应具有相同的风格，保留了真实图像的主要特征。 3）分集：生成的图像彼此不同。阿甘不应该重复产生一些不同的图像。基于理想甘斯的三个方面，我们设计了两项措施：创造性继承，多样性（CID）指数和相像度得分（LS），以评估GAN性能，并已申请他们评估三个典型甘斯。我们比较了我们提出的措施有三个常用GAN评价方法：盗梦空间得分（IS），Fréchet可盗梦空间距离（FID）和1近邻分类（1NNC）。此外，我们将讨论如何这些评价可以帮助我们加深对甘斯的认识，提高其性能。

2. Visual Camera Re-Localization from RGB and RGB-D Images Using DSAC [PDF] 返回目录
Eric Brachmann, Carsten Rother
Abstract: We describe a learning-based system that estimates the camera position and orientation from a single input image relative to a known environment. The system is flexible w.r.t. the amount of information available at test and at training time, catering to different applications. Input images can be RGB-D or RGB, and a 3D model of the environment can be utilized for training but is not necessary. In the minimal case, our system requires only RGB images and ground truth poses at training time, and it requires only a single RGB image at test time. The framework consists of a deep neural network and fully differentiable pose optimization. The neural network predicts so called scene coordinates, i.e. dense correspondences between the input image and 3D scene space of the environment. The pose optimization implements robust fitting of pose parameters using differentiable RANSAC (DSAC) to facilitate end-to-end training. The system, an extension of DSAC++ and referred to as DSAC*, achieves state-of-the-art accuracy an various public datasets for RGB-based re-localization, and competitive accuracy for RGB-D based re-localization.
摘要：我们描述了一个基于学习的系统，估计从单一相对于已知环境中的输入图像的摄像机的位置和方向。该系统是灵活的w.r.t.信息可在测试和培训时间，迎合不同的应用程序的数量。输入图像可以是RGB-d或RGB，并且可以被用于训练的环境的3D模型，但不是必需的。在最小的情况下，我们的系统只需要RGB图像和地面实况姿势在训练时间，并在测试时只需要一个RGB图像。该框架包含了深刻的神经网络，完全区分的姿态优化。该神经网络预测所谓的场景坐标，输入图像和环境的3D场景空间之间即致密的对应关系。使用微RANSAC（DSAC）姿态参数的姿势优化工具强大的配件，方便终端到终端的培训。该系统中，DSAC ++的扩展并且被称为DSAC *，达到状态的最先进的精度的各种公共数据集为基于RGB的重新定位，和竞争精度RGB-d基于重新定位。

3. Semantically-Guided Representation Learning for Self-Supervised Monocular Depth [PDF] 返回目录
Vitor Guizilini, Rui Hou, Jie Li, Rares Ambrus, Adrien Gaidon
Abstract: Self-supervised learning is showing great promise for monocular depth estimation, using geometry as the only source of supervision. Depth networks are indeed capable of learning representations that relate visual appearance to 3D properties by implicitly leveraging category-level patterns. In this work we investigate how to leverage more directly this semantic structure to guide geometric representation learning, while remaining in the self-supervised regime. Instead of using semantic labels and proxy losses in a multi-task approach, we propose a new architecture leveraging fixed pretrained semantic segmentation networks to guide self-supervised representation learning via pixel-adaptive convolutions. Furthermore, we propose a two-stage training process to overcome a common semantic bias on dynamic objects via resampling. Our method improves upon the state of the art for self-supervised monocular depth prediction over all pixels, fine-grained details, and per semantic categories.
摘要：自监督学习正显示出单眼深度估计巨大潜力，运用几何学作为监管的唯一来源。深度网络确实能够学习了涉及通过隐含利用类别层次的模式视觉外观的3D性能表示。在这项工作中，我们研究如何更直接地利用这个语义结构来引导几何表示学习，自我监督的政权剩下的一段时间。相反，使用语义标签，并在多任务方式代理的损失，我们提出了一个新的架构，利用固定预训练的语义分割网络，通过像素自适应的卷积引导自我监督表示学习。此外，我们提出了两个阶段的训练过程通过重新采样来克服动态对象的公共语义偏差。我们的方法改进了现有技术的用于自监督单眼深度预测在所有的像素，细粒度细节，状态和每个语义类别。

4. 2D Convolutional Neural Networks for 3D Digital Breast Tomosynthesis Classification [PDF] 返回目录
Yu Zhang, Xiaoqin Wang, Hunter Blanton, Gongbo Liang, Xin Xing, Nathan Jacobs
Abstract: Automated methods for breast cancer detection have focused on 2D mammography and have largely ignored 3D digital breast tomosynthesis (DBT), which is frequently used in clinical practice. The two key challenges in developing automated methods for DBT classification are handling the variable number of slices and retaining slice-to-slice changes. We propose a novel deep 2D convolutional neural network (CNN) architecture for DBT classification that simultaneously overcomes both challenges. Our approach operates on the full volume, regardless of the number of slices, and allows the use of pre-trained 2D CNNs for feature extraction, which is important given the limited amount of annotated training data. In an extensive evaluation on a real-world clinical dataset, our approach achieves 0.854 auROC, which is 28.80% higher than approaches based on 3D CNNs. We also find that these improvements are stable across a range of model configurations.
摘要：用于检测乳腺癌的自动化方法集中于2D乳房X射线摄影和在很大程度上忽略了3D数字乳房断层合成（DBT），这是经常在临床实践中使用。在制定DBT分类自动化方法的两个主要挑战是处理可变数目片和保持片对片变化。我们提出了一个新颖的深二维卷积神经网络（CNN）架构DBT分类，能同时克服了两个挑战。我们的方法全卷上运行，无论片的数量，并允许进行特征提取，给出注释的训练数据的量有限，这是非常重要的使用预训练的二维细胞神经网络的。在对现实世界的临床数据集进行广泛的评估，我们的方法达到0.854 AUROC，这比基于三维细胞神经网络方法提高28.80％。我们还发现，这些改进是通过一系列模型配置的稳定。

5. Blurry Video Frame Interpolation [PDF] 返回目录
Wang Shen, Wenbo Bao, Guangtao Zhai, Li Chen, Xiongkuo Min, Zhiyong Gao
Abstract: Existing works reduce motion blur and up-convert frame rate through two separate ways, including frame deblurring and frame interpolation. However, few studies have approached the joint video enhancement problem, namely synthesizing high-frame-rate clear results from low-frame-rate blurry inputs. In this paper, we propose a blurry video frame interpolation method to reduce motion blur and up-convert frame rate simultaneously. Specifically, we develop a pyramid module to cyclically synthesize clear intermediate frames. The pyramid module features adjustable spatial receptive field and temporal scope, thus contributing to controllable computational complexity and restoration ability. Besides, we propose an inter-pyramid recurrent module to connect sequential models to exploit the temporal relationship. The pyramid module integrates a recurrent module, thus can iteratively synthesize temporally smooth results without significantly increasing the model size. Extensive experimental results demonstrate that our method performs favorably against state-of-the-art methods.
摘要：现有作品减少运动模糊和上变频帧率通过两个独立的方法，包括帧去模糊和帧插值。然而，很少有研究走近联合视频增强的问题，即从合成低帧率模糊输入的高帧率明确的结果。在本文中，我们提出了一个模糊的视频帧插值方法同时减少运动模糊和上变频帧率。具体而言，我们开发了一个金字塔模块周期性合成清晰的中间帧。金字塔模块设有可调的空间感受域和时间范围，从而促进可控的计算复杂度和恢复的能力。此外，我们提出了一个金字塔间反复模块连接顺序机型利用的时间关系。金字塔模块集成了一个经常性的模块，从而可以反复合成时间平滑的结果，而显著增加模型的大小。广泛的实验结果表明，我们的方法对国家的最先进的方法有利地进行。

6. The Mertens Unrolled Network (MU-Net): A High Dynamic Range Fusion Neural Network for Through the Windshield Driver Recognition [PDF] 返回目录
Max Ruby, David S. Bolme, Joel Brogan, David Cornett III, Baldemar Delgado, Gavin Jager, Christi Johnson, Jose Martinez-Mendoza, Hector Santos-Villalobos, Nisha Srinivas
Abstract: Face recognition of vehicle occupants through windshields in unconstrained environments poses a number of unique challenges ranging from glare, poor illumination, driver pose and motion blur. In this paper, we further develop the hardware and software components of a custom vehicle imaging system to better overcome these challenges. After the build out of a physical prototype system that performs High Dynamic Range (HDR) imaging, we collect a small dataset of through-windshield image captures of known drivers. We then re-formulate the classical Mertens-Kautz-Van Reeth HDR fusion algorithm as a pre-initialized neural network, which we name the Mertens Unrolled Network (MU-Net), for the purpose of fine-tuning the HDR output of through-windshield images. Reconstructed faces from this novel HDR method are then evaluated and compared against other traditional and experimental HDR methods in a pre-trained state-of-the-art (SOTA) facial recognition pipeline, verifying the efficacy of our approach.
摘要：脸部识别车辆乘员通过挡风玻璃在不受约束的环境中，会产生若干从眩光，照明较差，驾驶员的姿势和运动模糊独特的挑战。在本文中，我们进一步开发自定义的车载成像系统，以更好地应对这些挑战的硬件和软件组件。构建出一个物理样机系统进行高动态范围（HDR）成像，我们收集已知驱动程序通过挡风玻璃图像捕获的小数据集之后。然后我们重新制订了经典梅尔滕斯-Kautz酒店面包车Reeth HDR融合算法作为预初始化神经网络，这是我们命名梅尔滕斯展开网络（MU-NET）进行微调通的HDR输出的目的挡风玻璃上的图像。然后从该新颖方法HDR重构面进行评估，并针对在一个预训练的状态的最先进的（SOTA）面部识别管道等传统和实验方法HDR，验证我们的方法的效力进行比较。

7. ZoomCount: A Zooming Mechanism for Crowd Counting in Static Images [PDF] 返回目录
Usman Sajid, Hasan Sajid, Hongcheng Wang, Guanghui Wang
Abstract: This paper proposes a novel approach for crowd counting in low to high density scenarios in static images. Current approaches cannot handle huge crowd diversity well and thus perform poorly in extreme cases, where the crowd density in different regions of an image is either too low or too high, leading to crowd underestimation or overestimation. The proposed solution is based on the observation that detecting and handling such extreme cases in a specialized way leads to better crowd estimation. Additionally, existing methods find it hard to differentiate between the actual crowd and the cluttered background regions, resulting in further count overestimation. To address these issues, we propose a simple yet effective modular approach, where an input image is first subdivided into fixed-size patches and then fed to a four-way classification module labeling each image patch as low, medium, high-dense or no-crowd. This module also provides a count for each label, which is then analyzed via a specifically devised novel decision module to decide whether the image belongs to any of the two extreme cases (very low or very high density) or a normal case. Images, specified as high- or low-density extreme or a normal case, pass through dedicated zooming or normal patch-making blocks respectively before routing to the regressor in the form of fixed-size patches for crowd estimate. Extensive experimental evaluations demonstrate that the proposed approach outperforms the state-of-the-art methods on four benchmarks under most of the evaluation criteria.
摘要：本文提出了一种用于人群中计数低密度到高密度的场景中的静态图像的新方法。目前的方法无法处理庞大的人群的多样性以及从而在极端情况下，如果在图像的不同区域人群密度太低或太高表现不佳，导致人群低估或高估。提出的解决方案是基于检测和专业的方式带来更好的人群估计处理这样极端的情况下观察。此外，现有的方法很难实际的人群和杂乱背景的区域区分，从而进一步计数高估。为了解决这些问题，我们提出了一个简单而有效的模块化的方法，其中，将输入图像首先被细分成固定大小的补丁，然后提供给四通分类模块标记每个图像补丁为低，中，高致密的或无-人群。该模块还提供了用于每个标签，然后将其通过一个专门设计新颖决定模块分析以决定图像是否属于任何的两种极端情况下（非常低或非常高的密度）或正常情况下的计数。图像，指定为高或低的密度极端或正常情况下，通过专用的变焦或正常膜片使得块分别在固定大小的补丁的用于人群估计的形式路由到回归之前。大量的实验评估表明，该方法比四个基准国家的最先进的方法，在大多数的评价标准。

8. Learning Representations by Predicting Bags of Visual Words [PDF] 返回目录
Spyros Gidaris, Andrei Bursuc, Nikos Komodakis, Patrick Pérez, Matthieu Cord
Abstract: Self-supervised representation learning targets to learn convnet-based image representations from unlabeled data. Inspired by the success of NLP methods in this area, in this work we propose a self-supervised approach based on spatially dense image descriptions that encode discrete visual concepts, here called visual words. To build such discrete representations, we quantize the feature maps of a first pre-trained self-supervised convnet, over a k-means based vocabulary. Then, as a self-supervised task, we train another convnet to predict the histogram of visual words of an image (i.e., its Bag-of-Words representation) given as input a perturbed version of that image. The proposed task forces the convnet to learn perturbation-invariant and context-aware image features, useful for downstream image understanding tasks. We extensively evaluate our method and demonstrate very strong empirical results, e.g., our pre-trained self-supervised representations transfer better on detection task and similarly on classification over classes "unseen" during pre-training, when compared to the supervised case. This also shows that the process of image discretization into visual words can provide the basis for very powerful self-supervised approaches in the image domain, thus allowing further connections to be made to related methods from the NLP domain that have been extremely successful so far.
摘要：自监督表示学习目标，以无标签的数据基于学习convnet形象表示。通过在这一领域NLP方法的成功的启发，在这项工作中，我们提出了一种基于空间密集图像描述的是编码独立视觉概念，这里称为视觉词的自我监督的做法。要建立这样的离散表示，我们量化第一预先训练自我监督convnet的特征图，基于词汇K均值。然后，作为自监督任务，我们培养另一个convnet来预测的图像的视觉词的直方图（即，其袋的字表示）作为输入提供一个扰动该图像的版本。所提出的工作队的convnet学习扰动不变和上下文感知的图像特征，为下游的图像理解任务非常有用。我们广泛地评估我们的方法，并表现出非常强的实证结果，例如，我们的预先训练自我监督的申述检测任务的分类过班“看不见”的过程中预先训练相比，监督的情况下更好地转移，同样，。这也说明，像离散化的过程变成视觉的话可以在图像领域非常强大的自我监督的方法提供依据，从而使进一步连接到从NLP领域相关的方法已经非常成功，到目前为止进行。

9. Meta-Transfer Learning for Zero-Shot Super-Resolution [PDF] 返回目录
Jae Woong Soh, Sunwoo Cho, Nam Ik Cho
Abstract: Convolutional neural networks (CNNs) have shown dramatic improvements in single image super-resolution (SISR) by using large-scale external samples. Despite their remarkable performance based on the external dataset, they cannot exploit internal information within a specific image. Another problem is that they are applicable only to the specific condition of data that they are supervised. For instance, the low-resolution (LR) image should be a "bicubic" downsampled noise-free image from a high-resolution (HR) one. To address both issues, zero-shot super-resolution (ZSSR) has been proposed for flexible internal learning. However, they require thousands of gradient updates, i.e., long inference time. In this paper, we present Meta-Transfer Learning for Zero-Shot Super-Resolution (MZSR), which leverages ZSSR. Precisely, it is based on finding a generic initial parameter that is suitable for internal learning. Thus, we can exploit both external and internal information, where one single gradient update can yield quite considerable results. (See Figure 1). With our method, the network can quickly adapt to a given image condition. In this respect, our method can be applied to a large spectrum of image conditions within a fast adaptation process.
摘要：卷积神经网络（细胞神经网络）已经通过使用大规模的外部的样品示于单个图像超分辨率（SISR）的显着改善。尽管基于外部数据集的表现可圈可点，他们不能利用特定图像中的内部信息。另一个问题是，他们只对他们进行监督数据的具体情况都适用。例如，低分辨率（LR）图像应该是一个“双三次”从一个高分辨率（HR）一个下采样无噪声的图像。为了解决这两个问题，零次超级分辨率（ZSSR）已经提出了灵活的内部学习。然而，他们需要上千梯度更新，即长推理时间的。在本文中，我们目前元转让学习的零射门超分辨率（MZSR），它利用ZSSR。确切地说，它是基于寻找适合于内部学习一个通用的初始参数。因此，我们可以利用外部和内部信息，其中一个单一的梯度更新可以产生相当可观的效果。（参见图1）。随着我们的方法中，网络能够快速适应给定的图像条件。在这方面，我们的方法可以快速的适应过程中被施加到较大的频谱的图像的条件。

10. Total3DUnderstanding: Joint Layout, Object Pose and Mesh Reconstruction for Indoor Scenes from a Single Image [PDF] 返回目录
Yinyu Nie, Xiaoguang Han, Shihui Guo, Yujian Zheng, Jian Chang, Jian Jun Zhang
Abstract: Semantic reconstruction of indoor scenes refers to both scene understanding and object reconstruction. Existing works either address one part of this problem or focus on independent objects. In this paper, we bridge the gap between understanding and reconstruction, and propose an end-to-end solution to jointly reconstruct room layout, object bounding boxes and meshes from a single image. Instead of separately resolving scene understanding and object reconstruction, our method builds upon a holistic scene context and proposes a coarse-to-fine hierarchy with three components: 1. room layout with camera pose; 2. 3D object bounding boxes; 3. object meshes. We argue that understanding the context of each component can assist the task of parsing the others, which enables joint understanding and reconstruction. The experiments on the SUN RGB-D and Pix3D datasets demonstrate that our method consistently outperforms existing methods in indoor layout estimation, 3D object detection and mesh reconstruction.
摘要：室内场景的语义重构指的是现场的理解和对象的重建。现有的作品无论是地址独立的对象这个问题，或者聚焦的一部分。在本文中，我们弥合理解和重建之间的间隙，并从单个图像提出的端至端的解决方案，共同重构房间布局，对象边界框和网格。而不是单独分辨场景理解和对象重建，我们的方法是建立在一个整体的场景上下文，并提出了由粗到细的层次结构具有三个组件：1.房间布局与照相机的姿态; 2. 3D对象的边界框; 3.对象网格。我们认为，了解每个组件的情况下可以帮助解析他人，使共同谅解和重建任务。在SUN RGB-d和Pix3D数据集上的实验结果表明，我们的方法始终优于在室内布局估计，立体物检测的现有方法和网状重建。

11. Visual Commonsense R-CNN [PDF] 返回目录
Tan Wang, Jianqiang Huang, Hanwang Zhang, Qianru Sun
Abstract: We present a novel unsupervised feature representation learning method, Visual Commonsense Region-based Convolutional Neural Network (VC R-CNN), to serve as an improved visual region encoder for high-level tasks such as captioning and VQA. Given a set of detected object regions in an image (e.g., using Faster R-CNN), like any other unsupervised feature learning methods (e.g., word2vec), the proxy training objective of VC R-CNN is to predict the contextual objects of a region. However, they are fundamentally different: the prediction of VC R-CNN is by using causal intervention: P(Y|do(X)), while others are by using the conventional likelihood: P(Y|X). This is also the core reason why VC R-CNN can learn "sense-making" knowledge like chair can be sat --- while not just "common" co-occurrences such as chair is likely to exist if table is observed. We extensively apply VC R-CNN features in prevailing models of three popular tasks: Image Captioning, VQA, and VCR, and observe consistent performance boosts across all the methods and tasks, achieving many new state-of-the-arts. Code and feature are available at this https URL.
摘要：提出一种新的无监督特征表示学习方法，基于区域的视觉常识卷积神经网络（VC R-CNN），以作为高层次的任务，例如字幕和VQA的改进的视觉区域编码器。给定一组检测到的对象区域的图像中（例如，使用快速R-CNN），就像任何其它无监督特征点学习方法（例如，word2vec），VC R-CNN的代理训练目标是预测的语境对象区域。然而，它们是根本不同的：VC R-CNN的预测是通过使用因果干预：P（Y |做（X）），而其它的是通过使用常规的似然度：P（Y | X）。这也是最核心的原因，为什么VC R-CNN可以学习“意义建构”像椅子学问可SAT ---而不仅仅是“普通”共同出现，如椅子是有可能，如果观察表存在。我们广泛采用VC R-CNN在现行的三种流行的任务不同型号的功能：图片字幕，VQA和VCR，并观察一致的性能提升在所有方法和任务，实现了许多新的国家的最艺术。代码和功能可在此HTTPS URL。

12. University-1652: A Multi-view Multi-source Benchmark for Drone-based Geo-localization [PDF] 返回目录
Zhedong Zheng, Yunchao Wei, Yi Yang
Abstract: We consider the problem of cross-view geo-localization. The primary challenge of this task is to learn the robust feature against large viewpoint changes. Existing benchmarks can help, but are limited in the number of viewpoints. Image pairs, containing two viewpoints, e.g., satellite and ground, are usually provided, which may compromise the feature learning. Besides phone cameras and satellites, in this paper, we argue that drones could serve as the third platform to deal with the geo-localization problem. In contrast to the traditional ground-view images, drone-view images meet fewer obstacles, e.g., trees, and could provide a comprehensive view when flying around the target place. To verify the effectiveness of the drone platform, we introduce a new multi-view multi-source benchmark for drone-based geo-localization, named University-1652. University-1652 contains data from three platforms, i.e., synthetic drones, satellites and ground cameras of 1,652 university buildings around the world. To our knowledge, University-1652 is the first drone-based geo-localization dataset and enables two new tasks, i.e., drone-view target localization and drone navigation. As the name implies, drone-view target localization intends to predict the location of the target place via drone-view images. On the other hand, given a satellite-view query image, drone navigation is to drive the drone to the area of interest in the query. We use this dataset to analyze a variety of off-the-shelf CNN features and propose a strong CNN baseline on this challenging dataset. The experiments show that University-1652 helps the model to learn the viewpoint-invariant features and also has good generalization ability in the real-world scenario.
摘要：我们认为交叉视角的地理定位的问题。这项任务的主要挑战是学习对大视角变化的强大的功能。现有的基准测试可以帮助，但在视点的数量是有限的。图像对，含有两个观点出发，例如，卫星和地面，通常设置，这可能损害功能学习。除了手机相机和卫星，在本文中，我们认为，无人驾驶飞机可以作为第三平台应对地理定位问题。相较于传统的地面观看的图像，无人机视图像满足更少的障碍，例如，树木和到处乱飞的目标位置时，可以提供一个全面的看法。为了验证无人机平台的有效性，我们引入了基于无人机的地理定位新的多视角多源基准，命名为大学-1652。大学-1652包含三个平台，即合成的无人驾驶飞机，卫星和1,652所大学在世界各地的建筑接地相机数据。据我们所知，大学-1652的第一个基于无人机的地理定位数据集并启用两个新的任务，即无人机视目标定位和无人机导航。顾名思义，无人机视目标定位打算通过无人机视图像预测目标地点的位置。在另一方面，在给定的卫星视图查询图像，无人驾驶飞机导航是驱动无人驾驶飞机的查询中的感兴趣的区域。我们用这个数据集来分析各种现成的，货架CNN特点，并提出在这一具有挑战性的数据集强大的CNN基线。实验结果表明，大学-1652帮助模型学习的观点不变特征，也有在真实的场景中良好的泛化能力。

13. Evolving Losses for Unsupervised Video Representation Learning [PDF] 返回目录
AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo
Abstract: We present a new method to learn video representations from large-scale unlabeled video data. Ideally, this representation will be generic and transferable, directly usable for new tasks such as action recognition and zero or few-shot learning. We formulate unsupervised representation learning as a multi-modal, multi-task learning problem, where the representations are shared across different modalities via distillation. Further, we introduce the concept of loss function evolution by using an evolutionary search algorithm to automatically find optimal combination of loss functions capturing many (self-supervised) tasks and modalities. Thirdly, we propose an unsupervised representation evaluation metric using distribution matching to a large unlabeled dataset as a prior constraint, based on Zipf's law. This unsupervised constraint, which is not guided by any labeling, produces similar results to weakly-supervised, task-specific ones. The proposed unsupervised representation learning results in a single RGB network and outperforms previous methods. Notably, it is also more effective than several label-based methods (e.g., ImageNet), with the exception of large, fully labeled video datasets.
摘要：我们提出要学会从大规模未标记的视频数据的视频表示的新方法。理想情况下，表示将是通用的，可转让的，可直接使用新的任务，例如动作识别和零个或几个次学习。我们制定监督的代表学习的多模式，多任务学习问题，其中表示是跨不同模态通过蒸馏共享。此外，我们通过使用进化搜索算法自动查找捕获许多（自我监督）的任务和方式的损失功能的最佳组合，引入损失函数演变的概念。第三，我们提出了一种无监督的表现评价指标使用分布匹配的大型数据集未标记作为先验约束的基础上，齐普夫定律。这种无监督约束，这是没有任何标签的指导下，产生类似的结果，以弱监督，任务具体的。所提出的无监督表示学习在一个单一的RGB网络结果，远远超过前的方法。值得注意的是，它也比几个基于标签的方法（例如，ImageNet）更有效的，具有大，中，充分标记的视频数据集的异常。

14. Domain Decluttering: Simplifying Images to Mitigate Synthetic-Real Domain Shift and Improve Depth Estimation [PDF] 返回目录
Yunhan Zhao, Shu Kong, Daeyun Shin, Charless Fowlkes
Abstract: Leveraging synthetically rendered data offers great potential to improve monocular depth estimation, but closing the synthetic-real domain gap is a non-trivial and important task. While much recent work has focused on unsupervised domain adaptation, we consider a more realistic scenario where a large amount of synthetic training data is supplemented by a small set of real images with ground-truth. In this setting we find that existing domain translation approaches are difficult to train and offer little advantage over simple baselines that use a mix of real and synthetic data. A key failure mode is that real-world images contain novel objects and clutter not present in synthetic training. This high-level domain shift isn't handled by existing image translation models. Based on these observations, we develop an attentional module that learns to identify and remove (hard) out-of-domain regions in real images in order to improve depth prediction for a model trained primarily on synthetic data. We carry out extensive experiments to validate our attend-remove-complete approach (ARC) and find that it significantly outperforms state-of-the-art domain adaptation methods for depth prediction. Visualizing the removed regions provides interpretable insights into the synthetic-real domain gap.
摘要：利用综合呈现的数据提供了巨大的潜力，提高单眼深度估计，但收盘合成实域差距是一个不平凡而重要的任务。虽然最近的许多工作集中在无监督的领域适应性，我们认为当大量合成训练数据是由一小部分与地面实况影像进行真实补充了更真实的场景。在这一背景下，我们发现，现有的域名翻译方法是很难培养，并在使用真实和合成数据的混合简单基线提供很少的优势。一个关键的故障模式是，现实世界图像包含新物体和杂波在合成训练不存在。这种高层次的域名转移不被现有的图像翻译模型处理。基于这些观察，我们开发一个所注意的模块，学习识别和删除（硬），以提高深度预测训练主要在合成数据的模型在实际图像外的结构域区域。我们开展了广泛的实验来验证我们的出席 - 删除 - 完整的方法（ARC），并发现它显著优于国家的最先进的领域适应性方法的深度预测。可视化去除区域提供可解释的分析上市公司合成真实域间隙。

15. Deep Slow Motion Video Reconstruction with Hybrid Imaging System [PDF] 返回目录
Avinash Paliwal, Nima Khademi Kalantari
Abstract: Slow motion videos are becoming increasingly popular, but capturing high-resolution videos at extremely high frame rates requires professional high-speed cameras. To mitigate this problem, current techniques increase the frame rate of standard videos through frame interpolation by assuming linear motion between the existing frames. While this assumption holds true for simple cases with small motion, in challenging cases the motion is usually complex and this assumption is no longer valid. Therefore, they typically produce results with unnatural motion in these challenging cases. In this paper, we address this problem using two video streams as the input; an auxiliary video with high frame rate and low spatial resolution, providing temporal information, in addition to the standard main video with low frame rate and high spatial resolution. We propose a two-stage deep learning system consisting of alignment and appearance estimation that reconstructs high resolution slow motion video from the hybrid video input. For alignment, we propose to use a set of pre-trained and trainable convolutional neural networks (CNNs) to compute the flows between the missing frame and the two existing frames of the main video by utilizing the content of the auxiliary video frames. We then warp the existing frames using the flows to produce a set of aligned frames. For appearance estimation, we propose to combine the aligned and auxiliary frames using a context and occlusion aware CNN. We train our model on a set of synthetically generated hybrid videos and show high-quality results on a wide range of test scenes. We further demonstrate the practicality of our approach by showing the performance of our system on two real dual camera setups with small baseline.
摘要：慢动作视频正变得越来越流行，但在非常高帧率捕捉高清晰度视频需要专业的高速摄像机。为了缓解这个问题，目前的技术通过假设现有帧之间直线运动增加的标准视频帧速率通过帧插值。虽然这个假设成立的情况下，简单的小运动真正的，在具有挑战性的情况下，运动通常是复杂的，这种假设不再有效。因此，它们通常产生具有在这些挑战性的情况下不自然运动的结果。在本文中，我们要解决使用两个视频流作为输入这个问题;辅助视频具有高帧速率和低的空间分辨率，提供时间信息，除了具有低帧速率和高空间分辨率的标准主视频。我们建议由对齐和外观估计从混合视频输入重建高分辨率的慢动作视频的两级深学习系统。用于对准，我们建议使用一组预训练和可训练卷积神经网络（细胞神经网络）的计算丢失的帧，并且通过利用所述辅助视频帧的内容的主视频的两个现有帧之间的流动。然后，我们使用流以产生一组对齐的帧的翘曲现有帧。对于外观的估计，我们建议使用上下文的排列和辅助框架结合起来，闭塞知道CNN。我们培养的一套合成产生的混合视频我们的模型，并显示了广泛的测试场景的高品质的结果。我们通过显示对两个真正的双摄像头设置有小基线我们的系统的性能进一步证明了该方法的实用性。

16. The Data Representativeness Criterion: Predicting the Performance of Supervised Classification Based on Data Set Similarity [PDF] 返回目录
Evelien Schat, Rens van de Schoot, Wouter M. Kouw, Duco Veen, Adriënne M. Mendrik
Abstract: In a broad range of fields it may be desirable to reuse a supervised classification algorithm and apply it to a new data set. However, generalization of such an algorithm and thus achieving a similar classification performance is only possible when the training data used to build the algorithm is similar to new unseen data one wishes to apply it to. It is often unknown in advance how an algorithm will perform on new unseen data, being a crucial reason for not deploying an algorithm at all. Therefore, tools are needed to measure the similarity of data sets. In this paper, we propose the Data Representativeness Criterion (DRC) to determine how representative a training data set is of a new unseen data set. We present a proof of principle, to see whether the DRC can quantify the similarity of data sets and whether the DRC relates to the performance of a supervised classification algorithm. We compared a number of magnetic resonance imaging (MRI) data sets, ranging from subtle to severe difference is acquisition parameters. Results indicate that, based on the similarity of data sets, the DRC is able to give an indication as to when the performance of a supervised classifier decreases. The strictness of the DRC can be set by the user, depending on what one considers to be an acceptable underperformance.
摘要：在广泛领域的可能希望重新使用受监督的分类算法，并将其应用到一个新的数据集。然而，这样的算法，并由此实现了类似的分类性能的概括是唯一可能当用于构建算法训练数据类似，新的看不见的数据的一个愿望将其应用到。它往往是未知的提前的算法将如何在新的看不见的数据执行，是在所有未部署的算法的重要原因。因此，需要工具来衡量数据集的相似性。在本文中，我们提出了数据典型性标准（DRC）来确定训练数据集是一个新的看不见的数据集的代表性如何。我们提出一个原则的证明，看是否DRC可以量化的数据集，以及是否在刚果（金）涉及监督分类算法的性能相似。我们比较了一些磁共振成像（MRI）的数据集，从细微的严重差是采集参数。结果表明，基于对数据集的相似性，则DRC是能够给出的指示时有监督的分类器的性能下降到。所述DRC的严格性可以由用户，这取决于一个认为是一个可接受的表现不佳来设置。

17. Action Quality Assessment using Siamese Network-Based Deep Metric Learning [PDF] 返回目录
Hiteshi Jain, Gaurav Harit, Avinash Sharma
Abstract: Automated vision-based score estimation models can be used as an alternate opinion to avoid judgment bias. In the past works the score estimation models were learned by regressing the video representations to the ground truth score provided by the judges. However such regression-based solutions lack interpretability in terms of giving reasons for the awarded score. One solution to make the scores more explicable is to compare the given action video with a reference video. This would capture the temporal variations w.r.t. the reference video and map those variations to the final score. In this work, we propose a new action scoring system as a two-phase system: (1) A Deep Metric Learning Module that learns similarity between any two action videos based on their ground truth scores given by the judges; (2) A Score Estimation Module that uses the first module to find the resemblance of a video to a reference video in order to give the assessment score. The proposed scoring model has been tested for Olympics Diving and Gymnastic vaults and the model outperforms the existing state-of-the-art scoring models.
摘要：基于视觉的自动计分估计模型可以用作备用的意见，以避免判断的偏见。在过去的作品中得分推测模型，通过回归的视频表示由法官提供的地面实况比分教训。然而这种基于回归的解决方案给予了奖励分数的原因方面缺乏可解释性。一种解决方案，使分数更可解释为给定的动作视频与参考视频进行比较。这将捕捉时间变化w.r.t.参考视频和映射这些变化的最终得分。在这项工作中，我们提出了一个新的动作评分系统为两相体系：（1）深度量学习模块，基于由法官给予他们的地面实况比分任意两个动作视频之间的相似度获悉; （2）分数估计模块使用第一模块来找到一个视频参考视频的相似，为了给评估得分。建议的计分模型已经过测试，奥运跳水和体操拱顶和模型优于现有的国家的最先进的评分模型。

18. Reducing Geographic Performance Differential for Face Recognition [PDF] 返回目录
Martins Bruveris, Jochem Gietema, Pouria Mortazavian, Mohan Mahadevan
Abstract: As face recognition algorithms become more accurate and get deployed more widely, it becomes increasingly important to ensure that the algorithms work equally well for everyone. We study the geographic performance differentials-differences in false acceptance and false rejection rates across different countries-when comparing selfies against photos from ID documents. We show how to mitigate geographic performance differentials using sampling strategies despite large imbalances in the dataset. Using vanilla domain adaptation strategies to fine-tune a face recognition CNN on domain-specific doc-selfie data improves the performance of the model on such data, but, in the presence of imbalanced training data, also significantly increases the demographic bias. We then show how to mitigate this effect by employing sampling strategies to balance the training procedure.
摘要：随着人脸识别算法变得更准确，得到更广泛的应用，它变得越来越重要，以确保该算法同样适合每一个人。比较自拍免受身份证件照片时的国家，我们研究的地理表现在错误接受和错误拒绝率在不同的差异，差异。我们将展示如何使用抽样策略，尽管在数据集中存在很大的不平衡，以减轻地理性能差异。使用香草域适应战略微调人脸识别CNN对特定领域的DOC-自拍的数据提高了对这些数据的模型的性能，但是，在不平衡的训练数据的情况下，也显著增加的人口偏差。然后，我们将展示如何通过采用抽样方法，以平衡训练过程，以减轻这种影响。

19. XSepConv: Extremely Separated Convolution [PDF] 返回目录
Jiarong Chen, Zongqing Lu, Jing-Hao Xue, Qingmin Liao
Abstract: Depthwise convolution has gradually become an indispensable operation for modern efficient neural networks and larger kernel sizes ($\ge5$) have been applied to it recently. In this paper, we propose a novel extremely separated convolutional block (XSepConv), which fuses spatially separable convolutions into depthwise convolution to further reduce both the computational cost and parameter size of large kernels. Furthermore, an extra $2\times2$ depthwise convolution coupled with improved symmetric padding strategy is employed to compensate for the side effect brought by spatially separable convolutions. XSepConv is designed to be an efficient alternative to vanilla depthwise convolution with large kernel sizes. To verify this, we use XSepConv for the state-of-the-art architecture MobileNetV3-Small and carry out extensive experiments on four highly competitive benchmark datasets (CIFAR-10, CIFAR-100, SVHN and Tiny-ImageNet) to demonstrate that XSepConv can indeed strike a better trade-off between accuracy and efficiency.
摘要：在深度上卷积逐渐成为近来已应用到它的现代高效的神经网络和更大尺寸的内核（$ \ $ GE5）不可缺少的操作。在本文中，我们提出了一种新颖的分离极其卷积块（XSepConv），其融合空间可分离卷积成深度方向卷积，以进一步减少两者的计算成本和大内核的参数大小。此外，再加上改进的对称填充策略一个额外的$ 2 \ times2 $深度方向卷积被用来补偿由空间可分离卷积所带来的副作用。 XSepConv被设计成一个有效的替代香草深度方向卷积大内核尺寸。为了验证这一点，我们使用XSepConv的国家的最先进的架构MobileNetV3 - 小型和开展4个高度竞争的基准数据集大量的实验（CIFAR-10，CIFAR-100，SVHN和微小-ImageNet）证明XSepConv确实可以取得一个更好的权衡精度和效率之间。

20. Attention-guided Chained Context Aggregation for Semantic Segmentation [PDF] 返回目录
Quan Tang, Fagui Liu, Jun Jiang, Yu Zhang
Abstract: Recent breakthroughs in semantic segmentation methods based on Fully Convolutional Networks (FCNs) have aroused great research interest. One of the critical issues is how to aggregate multi-scale contextual information effectively to obtain reliable results. To address this problem, we propose a novel paradigm called the Chained Context Aggregation Module (CAM). CAM gains features of various spatial scales through chain-connected ladder-style information flows. The features are then guided by Flow Guidance Connections to interact and fuse in a two-stage process, which we refer to as pre-fusion and re-fusion. We further adopt attention models in CAM to productively recombine and select those fused features to refine performance. Based on these developments, we construct the Chained Context Aggregation Network (CANet), which employs a two-step decoder to recover precise spatial details of prediction maps. We conduct extensive experiments on three challenging datasets, including Pascal VOC 2012, CamVid and SUN-RGBD. Results evidence that our CANet achieves state-of-the-art performance. Codes will be available on the publication of this paper.
摘要：在语义分割方法的最新突破，基于全卷积网络（FCNs）已经引起了极大的研究兴趣。其中一个关键问题是如何有效聚合多尺度的上下文信息，以获得可靠的结果。为了解决这个问题，我们提出了一个叫做链式语境聚集模块（CAM）的新范式。通过链连接的阶梯式的信息的各种空间尺度的CAM增益特性流动。该功能，然后通过流诱导连接进行互动和保险丝两阶段的过程，我们称之为预融合和重新融合的指导。我们进一步采取关注车型在CAM富有成效复合，选择那些融合功能来优化性能。基于这些发展，我们构建了已链接的上下文聚合网络（卡内安），其采用两个步骤的解码器，以恢复预测图的精确的空间细节。我们在三个有挑战性的数据集，包括帕斯卡尔VOC 2012，CamVid和SUN-RGBD进行了广泛的实验。结果证明，我们的卡内实现国家的最先进的性能。代码将在今年的论文发表上。

21. Multiple Discrimination and Pairwise CNN for View-based 3D Object Retrieval [PDF] 返回目录
Z. Gao, K.X Xue, S.H Wan
Abstract: With the rapid development and wide application of computer, camera device, network and hardware technology, 3D object (or model) retrieval has attracted widespread attention and it has become a hot research topic in the computer vision domain. Deep learning features already available in 3D object retrieval have been proven to be better than the retrieval performance of hand-crafted features. However, most existing networks do not take into account the impact of multi-view image selection on network training, and the use of contrastive loss alone only forcing the same-class samples to be as close as possible. In this work, a novel solution named Multi-view Discrimination and Pairwise CNN (MDPCNN) for 3D object retrieval is proposed to tackle these issues. It can simultaneously input of multiple batches and multiple views by adding the Slice layer and the Concat layer. Furthermore, a highly discriminative network is obtained by training samples that are not easy to be classified by clustering. Lastly, we deploy the contrastive-center loss and contrastive loss as the optimization objective that has better intra-class compactness and inter-class separability. Large-scale experiments show that the proposed MDPCNN can achieve a significant performance over the state-of-the-art algorithms in 3D object retrieval.
摘要：随着国民经济的快速发展和广泛应用计算机，照相机设备，网络和硬件技术，3D对象（或模型）检索已引起广泛关注，并已成为计算机视觉领域的一个研究热点。检索已被证明优于手工制作的功能检索性能深度学习功能在3D对象已经可用。然而，大多数现有网络没有考虑到多视角图像选择对网络训练的影响，并运用对比损失的单独只迫使同类别样本是尽可能接近。在这项工作中，命名为多视图歧视和成对CNN（MDPCNN）用于3D对象检索一个新的解决方案，提出了解决这些问题。它可以多批次，通过将片层和所述的毗连层的多个视图的同时输入。此外，高辨别网络通过训练未易于被聚类而分类的样品获得。最后，我们部署对比中心的损失和对比损失有更好的类内致密性和类间可分性的优化目标。大规模实验表明，该MDPCNN可以实现对国家的最先进的算法，在3D对象检索显著的性能。

22. Unbiased Scene Graph Generation from Biased Training [PDF] 返回目录
Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, Hanwang Zhang
Abstract: Today's scene graph generation (SGG) task is still far from practical, mainly due to the severe training bias, e.g., collapsing diverse human walk on/ sit on/lay on beach into human on beach. Given such SGG, the down-stream tasks such as VQA can hardly infer better scene structures than merely a bag of objects. However, debiasing in SGG is not trivial because traditional debiasing methods cannot distinguish between the good and bad bias, e.g., good context prior (e.g., person read book rather than eat) and bad long-tailed bias (e.g., behind/in front of collapsed to near). In this paper, we present a novel SGG framework based on causal inference but not the conventional likelihood. We first build a causal graph for SGG, and perform traditional biased training with the graph. Then, we propose to draw the counterfactual causality from the trained graph to infer the effect from the bad bias, which should be removed. In particular, we use Total Direct Effect as the proposed final predicate score for unbiased SGG. Note that our framework is agnostic to any SGG model and thus can be widely applied in the community who seeks unbiased predictions. By using the proposed Scene Graph Diagnosis toolkit on the SGG benchmark Visual Genome and several prevailing models, we observed significant improvements over the previous state-of-the-art methods.
摘要：今天的场景图代（SGG）的任务是从实际还有差距，主要是由于严格的训练偏差，例如，上/坐/在沙滩上躺进人上海滩倒塌的多样化人类的步行路程。鉴于这种SGG中，下游的任务，如VQA几乎不能推断不仅仅是对象的袋子更好的现场搭建。然而，在SGG消除直流偏压是不平凡的，因为传统的去除偏差的方法不能在好的和坏的偏见，如区分，良好的语境优先（例如，人读的书，而不是吃）和坏长尾偏置（例如，暂列前/晕倒附近）。在本文中，我们提出基于因果推断的新颖SGG框架而不是常规可能性。我们首先建立SGG因果关系图，并执行与图形传统偏见的训练。然后，我们建议从训练的图形绘制反因果关系来推断从坏的偏见，它应该被删除的效果。特别是，我们用总的直接影响作为拟议最后断言分数公正SGG。请注意，我们的框架是不可知的任何SGG模型，因此可以在谁寻求公正的预测社会得到广泛应用。通过使用对SGG基准视觉基因组和几个流行的模型所提出的场景图诊断工具，我们观察到以前的状态的最先进的方法显著的改善。

23. Features for Ground Texture Based Localization -- A Survey [PDF] 返回目录
Jan Fabian Schmid, Stephan F. Simon, Rudolf Mester
Abstract: Ground texture based vehicle localization using feature-based methods is a promising approach to achieve infrastructure-free high-accuracy localization. In this paper, we provide the first extensive evaluation of available feature extraction methods for this task, using separately taken image pairs as well as synthetic transformations. We identify AKAZE, SURF and CenSurE as best performing keypoint detectors, and find pairings of CenSurE with the ORB, BRIEF and LATCH feature descriptors to achieve greatest success rates for incremental localization, while SIFT stands out when considering severe synthetic transformations as they might occur during absolute localization.
摘要：采用基于特征的方法基于车辆定位地面纹理是一种很有前途的方法来实现无基础设施的高精度定位。在本文中，我们提供的可用的特征提取方法在第一广泛的评估执行此任务，使用分别取出的图像对以及合成转化。我们确定AKAZE，SURF和谴责为表现最好的关键点检测器，并找到谴责的配对与ORB，BRIEF和LATCH特征描述符，以实现增量本土化最大的成功率，而SIFT脱颖而出考虑到严峻的合成转换时的过程中可能发生，他们绝对定位。

24. Weakly supervised discriminative feature learning with state information for person identification [PDF] 返回目录
Hong-Xing Yu, Wei-Shi Zheng
Abstract: Unsupervised learning of identity-discriminative visual feature is appealing in real-world tasks where manual labelling is costly. However, the images of an identity can be visually discrepant when images are taken under different states, e.g. different camera views and poses. This visual discrepancy leads to great difficulty in unsupervised discriminative learning. Fortunately, in real-world tasks we could often know the states without human annotation, e.g. we can easily have the camera view labels in person re-identification and facial pose labels in face recognition. In this work we propose utilizing the state information as weak supervision to address the visual discrepancy caused by different states. We formulate a simple pseudo label model and utilize the state information in an attempt to refine the assigned pseudo labels by the weakly supervised decision boundary rectification and weakly supervised feature drift regularization. We evaluate our model on unsupervised person re-identification and pose-invariant face recognition. Despite the simplicity of our method, it could outperform the state-of-the-art results on Duke-reID, MultiPIE and CFP datasets with a standard ResNet-50 backbone. We also find our model could perform comparably with the standard supervised fine-tuning results on the three datasets. Code is available at this https URL
摘要：身份辨别视觉特征的无监督学习在现实世界的任务是有吸引力，其中手工贴标是昂贵的。然而，身份的图像可以是在视觉上有差异的图像时不同状态下采取，例如不同的摄像机视图和姿势。这种视觉差异导致了很大的难度在无人监督的判别学习。幸运的是，在现实世界的任务，我们往往可以知道状态，而无需人的注释，例如我们可以很容易地在人重新识别和人脸识别的面部姿态标签摄影机视图标签。在这项工作中我们提出利用国家信息监管不力，以解决因不同状态下的视觉差异。我们制定一个简单的伪标签模型，并利用企图由弱监督的决策边界整顿和弱监督功能漂移正规化细化分配伪标签的状态信息。我们评估我们的无监督人再次鉴定和姿态不变的人脸识别模型。尽管我们的方法的简单性，它可以超越在杜克 - 里德，MultiPIE和CFP数据集的国家的最先进的结果与标准RESNET-50骨干。我们还发现我们的模型可以用在三个数据集标准监督微调效果相当执行。代码可在此HTTPS URL

25. Auto-Encoding Twin-Bottleneck Hashing [PDF] 返回目录
Yuming Shen, Jie Qin, Jiaxin Chen, Mengyang Yu, Li Liu, Fan Zhu, Fumin Shen, Ling Shao
Abstract: Conventional unsupervised hashing methods usually take advantage of similarity graphs, which are either pre-computed in the high-dimensional space or obtained from random anchor points. On the one hand, existing methods uncouple the procedures of hash function learning and graph construction. On the other hand, graphs empirically built upon original data could introduce biased prior knowledge of data relevance, leading to sub-optimal retrieval performance. In this paper, we tackle the above problems by proposing an efficient and adaptive code-driven graph, which is updated by decoding in the context of an auto-encoder. Specifically, we introduce into our framework twin bottlenecks (i.e., latent variables) that exchange crucial information collaboratively. One bottleneck (i.e., binary codes) conveys the high-level intrinsic data structure captured by the code-driven graph to the other (i.e., continuous variables for low-level detail information), which in turn propagates the updated network feedback for the encoder to learn more discriminative binary codes. The auto-encoding learning objective literally rewards the code-driven graph to learn an optimal encoder. Moreover, the proposed model can be simply optimized by gradient descent without violating the binary constraints. Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
摘要：常规的无监督的散列方法通常利用相似的曲线图，其或者预先计算在高维空间中或从随机定位点获得的。在一方面，不耦合哈希函数的学习和施工图的过程现有的方法。在另一方面，图表经验在原始数据可能引入数据相关的偏先验知识构建，导致次优的检索性能。在本文中，我们处理通过提出一种有效的和适应性代码驱动图形，这是由在自动编码器的上下文中进行解码更新了上述问题。具体来说，我们引入了该协作交流的关键信息框架双瓶颈（即潜在变量）。一个瓶颈（即，二进制码）传送由代码驱动图形捕获到另一个高层次特性的数据结构（即，低级别的详细信息的信息连续变量），这反过来又传播更新后的网络反馈的编码器学习更有辨别力的二进制代码。自动编码学习目标字面上奖励代码驱动图形学会最佳的编码器。此外，所提出的模型可以简单地通过梯度下降而不违反约束的二进制优化。在基准测试数据集的实验清楚地表明我们对国家的最先进的哈希方法框架的优越性。

26. Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural Network for Human Trajectory Prediction [PDF] 返回目录
Abduallah Mohamed, Kun Qian, Mohamed Elhoseiny, Christian Claudel
Abstract: Better machine understanding of pedestrian behaviors enables faster progress in modeling interactions between agents such as autonomous vehicles and humans. Pedestrian trajectories are not only influenced by the pedestrian itself but also by interaction with surrounding objects. Previous methods modeled these interactions by using a variety of aggregation methods that integrate different learned pedestrians states. We propose the Social Spatio-Temporal Graph Convolutional Neural Network (Social-STGCNN), which substitutes the need of aggregation methods by modeling the interactions as a graph. Our results show an improvement over the state of art by 20% on the Final Displacement Error (FDE) and an improvement on the Average Displacement Error (ADE) with 8.5 times less parameters and up to 48 times faster inference speed than previously reported methods. In addition, our model is data efficient, and exceeds previous state of the art on the ADE metric with only 20% of the training data. We propose a kernel function to embed the social interactions between pedestrians within the adjacency matrix. Through qualitative analysis, we show that our model inherited social behaviors that can be expected between pedestrians trajectories.
摘要：行人行为的更好的机器理解实现了更快的建模代理商之间的相互作用，如自动驾驶汽车和人类进步。行人轨迹不仅行人本身，而且还通过与周围物体相互作用的影响。先前的方法通过使用各种集成了不同的教训行人状态汇总的方法模拟这些相互作用。我们建议社会时空图卷积神经网络（社会STGCNN），其替代通过模拟互动为图形的需要汇总的方法。我们的结果表明由20％的最终位移误差（FDE）和在平均位移误差（ADE）与8.5倍以下参数的改进超过现有技术状态的改进和更快的推理速度比以前报道的方法高达48倍。此外，我们的模型数据高效，超过了艺术上的ADE度量以前的状态，只有20％的训练数据。我们提出了一个内核函数嵌入邻接矩阵内的行人之间的社会互动。通过定性分析，我们表明，我们的模型继承，可以行人轨线之间可以预期的社会行为。

27. Set-Constrained Viterbi for Set-Supervised Action Segmentation [PDF] 返回目录
Jun Li, Sinisa Todorovic
Abstract: This paper is about weakly supervised action segmentation, where ground truth specifies only a set of actions present in a training video. This problem is more challenging than the standard weakly supervised setting where the temporal ordering of actions is provided. Prior work typically uses a classifier that independently labels video frames for generating the pseudo ground truth, and multiple instance learning for training the classifier. We extend this framework by specifying an HMM, which accounts for co-occurrences of action classes and their temporal lengths, and by explicitly training the HMM on a Viterbi-based loss. Our first contribution is the formulation of a new set-constrained Viterbi algorithm (SCV). Given a video, the SCV generates the MAP action segmentation that satisfies the ground truth. This prediction is used as a framewise pseudo ground truth in our HMM training. Our second contribution is a new regularization of learning by a n-pair loss that regularizes the feature affinity of training videos sharing the same action classes. Evaluation on action segmentation and alignment on the Breakfast, MPII Cooking2, Hollywood Extended datasets demonstrates our significant performance improvement for the two tasks over prior work.
摘要：本文是关于弱监督作用分割，其中地面实况只指定一组动作呈现培训视频。这个问题比标准弱更具挑战性的监督在提供行动的时间顺序设置。以前的工作通常使用分类器独立标签的视频帧生成模拟接地真相，并多示例学习训练的分类。我们通过指定HMM，占操作类的共同出现和他们的时间长度扩展这一框架，并通过显式训练HMM在基于维特比损失。我们的第一个贡献是一套新的约束Viterbi算法（SCV）的制定。鉴于视频时，SCV产生映射动作分割一个满足地面实况。这种预测被用作我们的HMM训练逐帧模拟接地真理。我们的第二个贡献是通过规则化的培训视频共享相同的动作类的功能亲和力的正失去对学习的新正规化。行动细分和定位的早餐评价，MPII Cooking2，好莱坞扩展数据集显示了我们对于这两个任务对现有的工作显著的性能提升。

28. RNNPool: Efficient Non-linear Pooling for RAM Constrained Inference [PDF] 返回目录
Oindrila Saha, Aditya Kusupati, Harsha Vardhan Simhadri, Manik Varma, Prateek Jain
Abstract: Pooling operators are key components in most Convolutional Neural Networks (CNNs) as they serve to downsample images, aggregate feature information, and increase receptive field. However, standard pooling operators reduce the feature size gradually to avoid significant loss in information via gross aggregation. Consequently, CNN architectures tend to be deep, computationally expensive and challenging to deploy on RAM constrained devices. We introduce RNNPool, a novel pooling operator based on Recurrent Neural Networks (RNNs), that efficiently aggregate features over large patches of an image and rapidly downsamples its size. Our empirical evaluation indicates that an RNNPool layer(s) can effectively replace multiple blocks in a variety of architectures such as MobileNets (Sandler et al., 2018), DenseNet (Huang et al., 2017) and can be used for several vision tasks like image classification and face detection. That is, RNNPool can significantly decrease computational complexity and peak RAM usage for inference, while retaining comparable accuracy. Further, we use RNNPool to construct a novel real-time face detection method that achieves state-of-the-art MAP within computational budget afforded by a tiny Cortex M4 microcontroller with ~256 KB RAM.
摘要：池运营商是最卷积神经网络（细胞神经网络），因为它们有助于下采样图像，骨料特征信息，并增加感受野的关键部件。然而，标准池运营商降低特征尺寸逐渐以避免通过总聚集信息显著损失。因此，CNN架构往往是深，计算昂贵的和具有挑战性的RAM上受限设备部署。我们引进RNNPool，基于回归神经网络（RNNs）一种新型的池操作，即在图像的大片有效地聚合功能和快速降频的大小。我们的经验评估表明一个RNNPool层（一个或多个）能够有效地替换在各种架构，诸如MobileNets的多个块（Sandler等人，2018），DenseNet（Huang等人，2017），并且可以用于多种视觉任务像图像分类和面部检测。也就是说，RNNPool可以显著减少对推理的计算复杂度和峰RAM的使用，同时保持相当的准确度。此外，我们使用RNNPool来构建实现通过一个微小的Cortex M4微控制器〜256 KB RAM，得到计算的预算之内的状态的最先进的MAP的新颖实时面部检测方法。

29. Unshuffling Data for Improved Generalization [PDF] 返回目录
Damien Teney, Ehsan Abbasnejad, Anton van den Hengel
Abstract: The inability to generalize to test data outside the distribution of a training set is at the core of practical limits of machine learning. We show that mixing and shuffling training examples when training deep neural networks is not an optimal practice. On the opposite, partitioning the training data into non-i.i.d. subsets can guide the model to use reliable statistical patterns, while ignoring spurious correlations in the training data. We demonstrate multiple use cases where these subsets are built using unsupervised clustering, prior knowledge, or other meta-data available in existing datasets. The approach is supported by recent results on a causal view of generalization, it is simple to apply, and demonstrably improves generalization. Applied to the task of visual question answering, we obtain state-of-the-art performance on VQA-CP. We also show improvement over data augmentation using equivalent questions on GQA. Finally, we show a small improvement when training a model simultaneously on VQA v2 and Visual Genome, treating them as two distinct environments rather than one aggregated training set.
摘要：不能推广到测试数据外的训练集的分布是在机器学习的实际限制的核心。我们表明，混合和洗牌训练样本训练深层神经网络时，是不是最佳做法。在对面，分割训练数据到非i.i.d。子集可以引导使用可靠的统计模式的模型，而忽略了培训数据虚假相关。我们证明，其中这些子集使用无监督聚类，先验知识，或其他元数据在现有的数据集提供内置多个用例。这种方法是通过对泛化的因果鉴于最近结果的支持，这是简单的应用，且令人信服提高泛化。适用于视觉答疑的任务，我们获得VQA-CP国家的最先进的性能。我们还使用上相当于GQA问题表明改善了数据增强。最后，我们展示了一个小的改进上VQA v2和视觉基因组同时训练模型时，将它们视为两种截然不同的环境，而不是一个聚集的训练集。

30. Hierarchical Memory Decoding for Video Captioning [PDF] 返回目录
Aming Wu, Yahong Han
Abstract: Recent advances of video captioning often employ a recurrent neural network (RNN) as the decoder. However, RNN is prone to diluting long-term information. Recent works have demonstrated memory network (MemNet) has the advantage of storing long-term information. However, as the decoder, it has not been well exploited for video captioning. The reason partially comes from the difficulty of sequence decoding with MemNet. Instead of the common practice, i.e., sequence decoding with RNN, in this paper, we devise a novel memory decoder for video captioning. Concretely, after obtaining representation of each frame through a pre-trained network, we first fuse the visual and lexical information. Then, at each time step, we construct a multi-layer MemNet-based decoder, i.e., in each layer, we employ a memory set to store previous information and an attention mechanism to select the information related to the current input. Thus, this decoder avoids the dilution of long-term information. And the multi-layer architecture is helpful for capturing dependencies between frames and word sequences. Experimental results show that even without the encoding network, our decoder still could obtain competitive performance and outperform the performance of RNN decoder. Furthermore, compared with one-layer RNN decoder, our decoder has fewer parameters.
摘要：字幕经常视频的最新进展采用的解码器回归神经网络（RNN）。然而，RNN是容易稀释长期信息。最近的作品已经证明了记忆网络（MemNet）具有保持长期信息的优势。然而，作为解码器，它没有得到很好的利用了视频字幕。其原因部分源于序列解码的与MemNet难度。代替通常的做法，即，序列与RNN解码的，在本文中，我们设计一种用于视频字幕的新型存储器解码器。具体而言，通过预先训练的网络获得每个帧的表示之后，我们首先熔化视觉和词汇信息。然后，在每个时间步骤中，我们构建的多层MemNet基于解码器，即，在每一层中，我们采用的存储器组，以存储以前的信息和注意机制来选择与当前输入的信息。因此，该解码器可避免长期信息稀释。和多层体系结构是用于捕获帧和字序列之间的相关性有帮助。实验结果表明，即使没有编码的网络，我们的解码器仍能获得有竞争力的性能和超越RNN解码器的性能。此外，具有一层RNN解码器相比，我们的解码器具有较少的参数。

31. Defense-PointNet: Protecting PointNet Against Adversarial Attacks [PDF] 返回目录
Yu Zhang, Gongbo Liang, Tawfiq Salem, Nathan Jacobs
Abstract: Despite remarkable performance across a broad range of tasks, neural networks have been shown to be vulnerable to adversarial attacks. Many works focus on adversarial attacks and defenses on 2D images, but few focus on 3D point clouds. In this paper, our goal is to enhance the adversarial robustness of PointNet, which is one of the most widely used models for 3D point clouds. We apply the fast gradient sign attack method (FGSM) on 3D point clouds and find that FGSM can be used to generate not only adversarial images but also adversarial point clouds. To minimize the vulnerability of PointNet to adversarial attacks, we propose Defense-PointNet. We compare our model with two baseline approaches and show that Defense-PointNet significantly improves the robustness of the network against adversarial samples.
摘要：尽管在广泛的任务的表现可圈可点，神经网络已经被证明是脆弱的敌对攻击。许多作品专注于三维点云敌对攻击和2D图像的防御，但很少关注的焦点。在本文中，我们的目标是提升PointNet，这对3D点云中最广泛使用的模型之一的对抗鲁棒性。我们采用三维点云的快速倾斜的符号的攻击方法（FGSM），并发现FGSM可以用来生成不仅对抗性图像，而且对抗性的点云。为了尽量减少PointNet以对抗攻击的漏洞，我们建议防御PointNet。我们比较了两种基准模型方法和结果表明，防御PointNet显著改善了对于对抗样本网络的健壮性。

32. GATCluster: Self-Supervised Gaussian-Attention Network for Image Clustering [PDF] 返回目录
Chuang Niu, Jun Zhang, Ge Wang, Jimin Liang
Abstract: Deep clustering has achieved state-of-the-art results via joint representation learning and clustering, but still has an inferior performance for the real scene images, e.g., those in ImageNet. With such images, deep clustering methods face several challenges, including extracting discriminative features, avoiding trivial solutions, capturing semantic information, and performing on large-size image datasets. To address these problems, here we propose a self-supervised attention network for image clustering (AttentionCluster). Rather than extracting intermediate features first and then performing the traditional clustering algorithm, AttentionCluster directly outputs semantic cluster labels that are more discriminative than intermediate features and does not need further post-processing. To train the AttentionCluster in a completely unsupervised manner, we design four learning tasks with the constraints of transformation invariance, separability maximization, entropy analysis, and attention mapping. Specifically, the transformation invariance and separability maximization tasks learn the relationships between sample pairs. The entropy analysis task aims to avoid trivial solutions. To capture the object-oriented semantics, we design a self-supervised attention mechanism that includes a parameterized attention module and a soft-attention loss. All the guiding signals for clustering are self-generated during the training process. Moreover, we develop a two-step learning algorithm that is training-friendly and memory-efficient for processing large-size images. Extensive experiments demonstrate the superiority of our proposed method in comparison with the state-of-the-art image clustering benchmarks.
摘要：深集群已实现通过联合代表的学习和聚类国家的先进成果，但仍然有真实场景图像，例如，那些ImageNet的性能较差。有了这样的图像，深聚类方法面临着一些挑战，包括提取判别特征，避免琐碎的解决方案，捕捉语义信息，并进行大尺寸图像数据集。为了解决这些问题，我们在这里提出了一个自我监督的关注网络图像集群（AttentionCluster）。而不是第一提取中间特征，然后进行传统的聚类算法，AttentionCluster直接输出语义聚类的标签，比中间特征更有辨别力，并且不需要进一步的后处理。要培养在完全无人监管的方式AttentionCluster，我们大概设计四个学习任务与改造不变性，可分性最大化，熵分析，并注意映射的约束。具体而言，转型不变性和可分性最大化任务学习样本对之间的关系。熵分析任务目标，以避免琐碎的解决方案。为了捕捉面向对象的语义，我们设计了包括参数维护模块和软注意力丧失自我监督的注意机制。所有聚类的引导信号自身生成的训练过程。此外，我们开发了一个两步学习算法训练友好和内存高效处理大尺寸图像。大量的实验证明与国家的最先进的图像聚类基准比较我们提出的方法的优越性。

33. Towards Universal Representation Learning for Deep Face Recognition [PDF] 返回目录
Yichun Shi, Xiang Yu, Kihyuk Sohn, Manmohan Chandraker, Anil K. Jain
Abstract: Recognizing wild faces is extremely hard as they appear with all kinds of variations. Traditional methods either train with specifically annotated variation data from target domains, or by introducing unlabeled target variation data to adapt from the training data. Instead, we propose a universal representation learning framework that can deal with larger variation unseen in the given training data without leveraging target domain knowledge. We firstly synthesize training data alongside some semantically meaningful variations, such as low resolution, occlusion and head pose. However, directly feeding the augmented data for training will not converge well as the newly introduced samples are mostly hard examples. We propose to split the feature embedding into multiple sub-embeddings, and associate different confidence values for each sub-embedding to smooth the training procedure. The sub-embeddings are further decorrelated by regularizing variation classification loss and variation adversarial loss on different partitions of them. Experiments show that our method achieves top performance on general face recognition datasets such as LFW and MegaFace, while significantly better on extreme benchmarks such as TinyFace and IJB-S.
摘要：因为他们出现了各种变化的认识野生面是极其困难的。传统的方法或者与从目标域中特别注释的变化数据，或者通过引入未标记的目标变化数据火车从训练数据适应。相反，我们提出了一个普遍的代表性学习的框架，可以处理不利用目标领域知识较大的变化在给定的训练数据看不见。我们首先合成训练数据一起一些语义上有意义的变化，如低分辨率，遮挡和头部姿势。然而，直接喂养培训增强数据不收敛以及新近推出的样品大多是硬的例子。我们建议拆分功能嵌入到多个子的嵌入，以及不同的置信度值对每个子嵌入到平滑的训练过程相关联。子的嵌入是通过对它们的不同分区正规化变化分类损失和变化对抗性损失进一步去相关。实验表明，我们的方法实现对一般的面部识别的数据集，如LFW和菲斯顶级的性能，而在极端的基准，如TinyFace和IJB-S显著更好。

34. Joint Unsupervised Learning of Optical Flow and Egomotion with Bi-Level Optimization [PDF] 返回目录
Shihao Jiang, Dylan Campbell, Miaomiao Liu, Stephen Gould, Richard Hartley
Abstract: We address the problem of joint optical flow and camera motion estimation in rigid scenes by incorporating geometric constraints into an unsupervised deep learning framework. Unlike existing approaches which rely on brightness constancy and local smoothness for optical flow estimation, we exploit the global relationship between optical flow and camera motion using epipolar geometry. In particular, we formulate the prediction of optical flow and camera motion as a bi-level optimization problem, consisting of an upper-level problem to estimate the flow that conforms to the predicted camera motion, and a lower-level problem to estimate the camera motion given the predicted optical flow. We use implicit differentiation to enable back-propagation through the lower-level geometric optimization layer independent of its implementation, allowing end-to-end training of the network. With globally-enforced geometric constraints, we are able to improve the quality of the estimated optical flow in challenging scenarios and obtain better camera motion estimates compared to other unsupervised learning methods.
摘要：通过将几何约束成一种无监督的深度学习框架，解决刚性场面联合光流和摄像机运动估计的问题。不同于依靠恒定的亮度和光流估计本地平滑现有的方法，我们利用光流和使用极几何摄像机运动之间的全球合作关系。特别是，我们制定光流和摄像机运动为双电平优化问题的预测，由上级问题来估计符合预测的摄像机运动，和一个较低级别的问题来估计照相机的流动运动给定预测的光流。我们使用隐分化通过较低级别的几何优化层独立于其实现的，以使反向传播，从而允许端至端训练网络。在全球强制几何约束，我们能够提高在挑战场景估计光流的质量和获得更好的相机运动估计相比于其它无监督学习方法。

35. Learning to Shade Hand-drawn Sketches [PDF] 返回目录
Qingyuan Zheng, Zhuoru Li, Adam Bargteil
Abstract: We present a fully automatic method to generate detailed and accurate artistic shadows from pairs of line drawing sketches and lighting directions. We also contribute a new dataset of one thousand examples of pairs of line drawings and shadows that are tagged with lighting directions. Remarkably, the generated shadows quickly communicate the underlying 3D structure of the sketched scene. Consequently, the shadows generated by our approach can be used directly or as an excellent starting point for artists. We demonstrate that the deep learning network we propose takes a hand-drawn sketch, builds a 3D model in latent space, and renders the resulting shadows. The generated shadows respect the hand-drawn lines and underlying 3D space and contain sophisticated and accurate details, such as self-shadowing effects. Moreover, the generated shadows contain artistic effects, such as rim lighting or halos appearing from back lighting, that would be achievable with traditional 3D rendering methods.
摘要：我们提出一个全自动的方法来产生由双线条画草图和照明方向详细和精确的艺术阴影。我们也贡献的线条图和阴影对标记有灯光的方向一千例新的数据集。值得注意的是，所产生的阴影迅速传达绘制场景的底层3D结构。因此，由我们的方法所产生的阴影，可直接或作为一个很好的起点为艺术家使用。我们证明了深度学习网络，我们建议采用手绘草图，建立在潜在空间中的3D模型，并呈现所产生的阴影。所产生的阴影尊重手绘线和底层3D空间和包含精密准确的细节，诸如自阴影效应。此外，所产生的阴影包含艺术效果，如照明轮缘或从背后照明出现晕圈，这将是与传统的3D渲染方法实现的。

36. On Leveraging Pretrained GANs for Limited-Data Generation [PDF] 返回目录
Miaoyun Zhao, Yulai Cong, Lawrence Carin
Abstract: Recent work has shown GANs can generate highly realistic images that are indistinguishable by human. Of particular interest here is the empirical observation that most generated images are not contained in training datasets, indicating potential generalization with GANs. That generalizability makes it appealing to exploit GANs to help applications with limited available data, e.g., augment training data to alleviate overfitting. To better facilitate training a GAN on limited data, we propose to leverage already-available GAN models pretrained on large-scale datasets (like ImageNet) to introduce additional common knowledge (which may not exist within the limited data) following the transfer learning idea. Specifically, exampled by natural image generation tasks, we reveal the fact that low-level filters (those close to observations) of both the generator and discriminator of pretrained GANs can be transferred to help the target limited-data generation. For better adaption of the transferred filters to the target domain, we introduce a new technique named adaptive filter modulation (AdaFM), which provides boosted performance over baseline methods. Unifying the transferred filters and the introduced techniques, we present our method and conduct extensive experiments to demonstrate its training efficiency and better performance on limited-data generation.
摘要：最近的工作表明甘斯可以产生是由人类无法区分高度逼真的图像。特别令人感兴趣的是这里大多数生成的图像中不包含的训练数据，表明与甘斯潜在泛化的经验观察。这普遍性使它吸引与利用有限的可用数据，例如，扩充训练数据，以减轻过度拟合甘斯来帮助应用程序。为了更好地促进训练数据有限的GaN，我们建议利用预先训练的大型数据集（如ImageNet）引进后的转移学习理念额外的常识（这可能不是有限的数据中存在）已获得GAN模式。具体而言，通过自然的图像生成任务所示例中，我们揭示了一个事实，即在发电机和预训练甘斯的鉴别器两者的低级别过滤器（那些接近的观察结果）可以被转移到帮助目标限于数据生成。所转移的过滤器到目标域的更好的适应，我们引入了一个名为自适应滤波器调制（AdaFM）的新技术，它提供升压超过基线方法的性能。统一转移过滤器和引进技术，我们提出我们的方法，并进行了广泛的实验来证明其培训效率和有限的数据产生更好的性能。

37. Rethinking the Hyperparameters for Fine-tuning [PDF] 返回目录
Hao Li, Pratik Chaudhari, Hao Yang, Michael Lam, Avinash Ravichandran, Rahul Bhotika, Stefano Soatto
Abstract: Fine-tuning from pre-trained ImageNet models has become the de-facto standard for various computer vision tasks. Current practices for fine-tuning typically involve selecting an ad-hoc choice of hyperparameters and keeping them fixed to values normally used for training from scratch. This paper re-examines several common practices of setting hyperparameters for fine-tuning. Our findings are based on extensive empirical evaluation for fine-tuning on various transfer learning benchmarks. (1) While prior works have thoroughly investigated learning rate and batch size, momentum for fine-tuning is a relatively unexplored parameter. We find that the value of momentum also affects fine-tuning performance and connect it with previous theoretical findings. (2) Optimal hyperparameters for fine-tuning, in particular, the effective learning rate, are not only dataset dependent but also sensitive to the similarity between the source domain and target domain. This is in contrast to hyperparameters for training from scratch. (3) Reference-based regularization that keeps models close to the initial model does not necessarily apply for "dissimilar" datasets. Our findings challenge common practices of fine-tuning and encourages deep learning practitioners to rethink the hyperparameters for fine-tuning.
摘要：从预先训练ImageNet模型微调已成为各种计算机视觉任务的事实上的标准。微调目前的做法通常涉及选择超参数的一个特设的选择，保证它们的固定通常用于从头训练值。本文重新审视微调设定超参数的几种常见的做法。我们的研究结果是基于对各种转移学习基准微调的广泛经验评估。（1）虽然现有作品彻底调查学习率和批量大小，动量进行微调是一个相对未知参数。我们发现，动量的值也会影响微调性能，并将其与以前的理论成果进行连接。（2）最适超参数进行微调，特别是有效的学习率，不仅是数据集依赖，但到源域和目标域之间的相似性也很敏感。这是相对于超参数从头开始训练。（3），保持车型接近初始模型基于参考的正则不一定适用于“不同”的数据集。我们的研究结果挑战微调的通行做法，并鼓励深度学习从业者重新考虑进行微调的超参数。

38. Hallucinative Topological Memory for Zero-Shot Visual Planning [PDF] 返回目录
Kara Liu, Thanard Kurutach, Christine Tung, Pieter Abbeel, Aviv Tamar
Abstract: In visual planning (VP), an agent learns to plan goal-directed behavior from observations of a dynamical system obtained offline, e.g., images obtained from self-supervised robot interaction. Most previous works on VP approached the problem by planning in a learned latent space, resulting in low-quality visual plans, and difficult training algorithms. Here, instead, we propose a simple VP method that plans directly in image space and displays competitive performance. We build on the semi-parametric topological memory (SPTM) method: image samples are treated as nodes in a graph, the graph connectivity is learned from image sequence data, and planning can be performed using conventional graph search methods. We propose two modifications on SPTM. First, we train an energy-based graph connectivity function using contrastive predictive coding that admits stable training. Second, to allow zero-shot planning in new domains, we learn a conditional VAE model that generates images given a context of the domain, and use these hallucinated samples for building the connectivity graph and planning. We show that this simple approach significantly outperform the state-of-the-art VP methods, in terms of both plan interpretability and success rate when using the plan to guide a trajectory-following controller. Interestingly, our method can pick up non-trivial visual properties of objects, such as their geometry, and account for it in the plans.
摘要：视觉规划（VP），从一个动态系统的观测的药剂学会计划目标导向的行为离线获得，例如，从自监督机器人相互作用而获得的图像。在VP大多数以前的作品被规划在了解到潜在空间接近了问题，导致低品质的视觉计划，且难以训练算法。在这里，相反，我们提出了一个简单的方法副总裁直接在图像空间和显示竞争力的性能计划。我们建立在半参数拓扑存储器（SPTM）方法：图像样本被视为在图中的节点，该图的连通是由图像序列数据学习，并且可以使用常规的图搜索方法来进行规划。我们建议对SPTM两处修改。首先，我们训练用对比预测编码即承认稳定的训练基于能量的图形连接功能。其次，允许在新领域的零射门的规划中，我们了解到，生成的图像给出的域的范围内，并使用这些幻觉样本构建连通图和规划条件VAE模型。我们表明，这种简单的方法利用该计划来指导的轨迹，跟踪控制时显著优于国家的最先进的VP方法，在这两个计划解释性和成功率方面。有趣的是，我们的方法可以拿起对象的不平凡的视觉特性，如它们的几何形状，并在计划占了它。

39. Coronary Wall Segmentation in CCTA Scans via a Hybrid Net with Contours Regularization [PDF] 返回目录
Kaikai Huang, Antonio Tejero-de-Pablos, Hiroaki Yamane, Yusuke Kurose, Junichi Iho, Youji Tokunaga, Makoto Horie, Keisuke Nishizawa, Yusaku Hayashi, Yasushi Koyama, Tatsuya Harada
Abstract: Providing closed and well-connected boundaries of coronary artery is essential to assist cardiologists in the diagnosis of coronary artery disease (CAD). Recently, several deep learning-based methods have been proposed for boundary detection and segmentation in a medical image. However, when applied to coronary wall detection, they tend to produce disconnected and inaccurate boundaries. In this paper, we propose a novel boundary detection method for coronary arteries that focuses on the continuity and connectivity of the boundaries. In order to model the spatial continuity of consecutive images, our hybrid architecture takes a volume (i.e., a segment of the coronary artery) as input and detects the boundary of the target slice (i.e., the central slice of the segment). Then, to ensure closed boundaries, we propose a contour-constrained weighted Hausdorff distance loss. We evaluate our method on a dataset of 34 patients of coronary CT angiography scans with curved planar reconstruction (CCTA-CPR) of the arteries (i.e., cross-sections). Experiment results show that our method can produce smooth closed boundaries outperforming the state-of-the-art accuracy.
摘要：提供冠状动脉闭合和连接良好的边界是必不可少的，以协助心脏病专家在冠状动脉疾病（CAD）的诊断。最近，一些深基于学习的方法已经提出了一种在医学图像边界检测和分割。然而，当施加到冠状壁检测时，往往会产生断开和不准确的边界。在本文中，我们提出了冠状动脉侧重于边界的连续性和连接一个新的边界检测方法。为了连续图像的空间上的连续性进行建模，我们的混合体系结构需要的体积（即，冠状动脉的段）作为输入，并检测所述目标切片的边界（即，段的中心切片）。然后，以确保封闭的界限，我们提出了一个轮廓约束的加权Hausdorff距离的损失。我们评估对34例动脉（即，横截面）的曲面重建（CCTA-CPR）冠状动脉CT血管造影扫描的数据集我们的方法。实验结果表明，我们的方法可以产生平滑的闭合边界超越国家的最先进的精度。

40. Opportunities of a Machine Learning-based Decision Support System for Stroke Rehabilitation Assessment [PDF] 返回目录
Min Hun Lee, Daniel P. Siewiorek, Asim Smailagic, Alexandre Bernardino, Sergi Bermúdez i Badia
Abstract: Rehabilitation assessment is critical to determine an adequate intervention for a patient. However, the current practices of assessment mainly rely on therapist's experience, and assessment is infrequently executed due to the limited availability of a therapist. In this paper, we identified the needs of therapists to assess patient's functional abilities (e.g. alternative perspective on assessment with quantitative information on patient's exercise motions). As a result, we developed an intelligent decision support system that can identify salient features of assessment using reinforcement learning to assess the quality of motion and summarize patient specific analysis. We evaluated this system with seven therapists using the dataset from 15 patient performing three exercises. The evaluation demonstrates that our system is preferred over a traditional system without analysis while presenting more useful information and significantly increasing the agreement over therapists' evaluation from 0.6600 to 0.7108 F1-scores ($p <0.05$). we discuss the importance of presenting contextually relevant and salient information adaptation to develop a human machine collaborative decision making system. < font>
摘要：康复评估是至关重要的，以确定病人足够的干预。然而，评估目前的做法主要依靠治疗师的经验，并评估经常因治疗的有限执行。在本文中，我们确定治疗师的需求来评估患者的功能能力（与定量评估信息例如替代的角度对患者的练习动作）。因此，我们开发了一个智能决策支持系统，可以使用强化学习，以评估运动的质量和总结病人具体分析鉴定评估的显着特征。我们评估该系统使用的数据集7名理疗师从15患者实施三个练习。评估表明，我们的系统优于传统的系统不加分析，同时提出更多有用的信息和显著增加了治疗师的评估协议，从0.6600到0.7108 F1-得分（$ P <0.05 $）。我们讨论呈现上下文相关的和显着的信息和适应开发的人机协同决策系统的重要性。< font>

41. Optimization of Graph Total Variation via Active-Set-based Combinatorial Reconditioning [PDF] 返回目录
Zhenzhang Ye, Thomas Möllenhoff, Tao Wu, Daniel Cremers
Abstract: Structured convex optimization on weighted graphs finds numerous applications in machine learning and computer vision. In this work, we propose a novel adaptive preconditioning strategy for proximal algorithms on this problem class. Our preconditioner is driven by a sharp analysis of the local linear convergence rate depending on the "active set" at the current iterate. We show that nested-forest decomposition of the inactive edges yields a guaranteed local linear convergence rate. Further, we propose a practical greedy heuristic which realizes such nested decompositions and show in several numerical experiments that our reconditioning strategy, when applied to proximal gradient or primal-dual hybrid gradient algorithm, achieves competitive performances. Our results suggest that local convergence analysis can serve as a guideline for selecting variable metrics in proximal algorithms.
摘要：加权图形结构凸优化发现在机器学习和计算机视觉众多应用。在这项工作中，我们提出了这个问题类近算法一种自适应预处理策略。我们的预处理器通过根据所述“活动集”，在当前迭代局部线性收敛速度的一个尖锐的分析来驱动。我们展示了非活动边缘的该嵌套分解森林得到保证局部线性收敛速度。此外，我们提出了一个实用的启发式贪婪实现这样的嵌套分解，并显示在几个数值实验，我们的修复策略，当应用于近端梯度或原 - 对偶混合梯度算法，实现了有竞争力的表演。我们的研究结果表明，当地的收敛性分析可以作为近端算法选择变量指标的指导方针。

42. Multi-source Domain Adaptation in the Deep Learning Era: A Systematic Survey [PDF] 返回目录
Sicheng Zhao, Bo Li, Colorado Reed, Pengfei Xu, Kurt Keutzer
Abstract: In many practical applications, it is often difficult and expensive to obtain enough large-scale labeled data to train deep neural networks to their full capability. Therefore, transferring the learned knowledge from a separate, labeled source domain to an unlabeled or sparsely labeled target domain becomes an appealing alternative. However, direct transfer often results in significant performance decay due to domain shift. Domain adaptation (DA) addresses this problem by minimizing the impact of domain shift between the source and target domains. Multi-source domain adaptation (MDA) is a powerful extension in which the labeled data may be collected from multiple sources with different distributions. Due to the success of DA methods and the prevalence of multi-source data, MDA has attracted increasing attention in both academia and industry. In this survey, we define various MDA strategies and summarize available datasets for evaluation. We also compare modern MDA methods in the deep learning era, including latent space transformation and intermediate domain generation. Finally, we discuss future research directions for MDA.
摘要：在许多实际应用中，它往往是困难和昂贵的获得大规模足够的标记数据，以深层神经网络训练其全部的能力。因此，传送从一个单独的，标记的源域所学的知识到未标记的或标记的稀少目标域成为一个吸引人的替代方案。然而，直接转移往往导致显著的性能衰减因域转移。域的适应（DA）通过最小化源和目标域之间的域转移的影响解决了这个问题。多源域的适应（MDA）是一个强大的扩展，其中所述标记的数据可以从具有不同分布的多种来源收集。由于DA方法的成功和多源数据的流行，MDA已经引起了学术界和工业界越来越多的关注。在本次调查中，我们定义各种MDA战略和总结评估提供数据集。我们也比较深学习时代现代MDA方法，包括潜在空间转化和中间域的产生。最后，我们讨论了MDA未来的研究方向。

43. Infinitely Wide Graph Convolutional Networks: Semi-supervised Learning via Gaussian Processes [PDF] 返回目录
Jilin Hu, Jianbing Shen, Bin Yang, Ling Shao
Abstract: Graph convolutional neural networks~(GCNs) have recently demonstrated promising results on graph-based semi-supervised classification, but little work has been done to explore their theoretical properties. Recently, several deep neural networks, e.g., fully connected and convolutional neural networks, with infinite hidden units have been proved to be equivalent to Gaussian processes~(GPs). To exploit both the powerful representational capacity of GCNs and the great expressive power of GPs, we investigate similar properties of infinitely wide GCNs. More specifically, we propose a GP regression model via GCNs~(GPGC) for graph-based semi-supervised learning. In the process, we formulate the kernel matrix computation of GPGC in an iterative analytical form. Finally, we derive a conditional distribution for the labels of unobserved nodes based on the graph structure, labels for the observed nodes, and the feature matrix of all the nodes. We conduct extensive experiments to evaluate the semi-supervised classification performance of GPGC and demonstrate that it outperforms other state-of-the-art methods by a clear margin on all the datasets while being efficient.
摘要：图形卷积神经网络〜（GCNs）最近表明看好基于图的半监督分类结果，但很少工作已经完成，探讨其理论特性。最近，几个深神经网络，例如，全连接，并卷积神经网络，具有无限的隐藏单元已经被证明是相当于高斯过程〜（GPS）。为了利用GCNs的两个强大的代表能力和GP的巨大表现力，我们研究了无限宽GCNs的相似的特性。更具体地讲，我们提出了基于图的半监督学习通过GCNs〜（广东电网）一个GP回归模型。在这个过程中，我们制定的迭代分析形式GPGC的核矩阵计算。最后，我们推导出基于所述图结构未观察到的节点的标签条件分布，标签所观察到的结点，并且所有节点的特征矩阵。我们进行了广泛的实验，以评估GPGC的半监督分类性能，并证明它通过一个明确的保证金上的所有数据集优于其他国家的最先进的方法，同时有效。

44. A Comprehensive Approach to Unsupervised Embedding Learning based on AND Algorithm [PDF] 返回目录
Sungwon Han, Yizhan Xu, Sungwon Park, Meeyoung Cha, Cheng-Te Li
Abstract: Unsupervised embedding learning aims to extract good representation from data without the need for any manual labels, which has been a critical challenge in many supervised learning tasks. This paper proposes a new unsupervised embedding approach, called Super-AND, which extends the current state-of-the-art model. Super-AND has its unique set of losses that can gather similar samples nearby within a low-density space while keeping invariant features intact against data augmentation. Super-AND outperforms all existing approaches and achieves an accuracy of 89.2% on the image classification task for CIFAR-10. We discuss the practical implications of this method in assisting semi-supervised tasks.
摘要：无监督嵌入学习旨在从数据中提取良好的代表性，而无需任何手动标签，已在许多监督学习任务的关键挑战。本文提出了一种新的无监督的嵌入方法，称为超AND，它扩展了当前国家的最先进的模型。超有其独特的可附近的低密度空间内收集类似的样品，同时保持不变的特点对数据增强完好的损失。超级，优于所有现有的方法，并实现了89.2％的图像分类任务CIFAR-10的精确度。我们在帮助半监督任务中讨论该方法的实际影响。

45. Multi-Cycle-Consistent Adversarial Networks for CT Image Denoising [PDF] 返回目录
Jinglan Liu, Yukun Ding, Jinjun Xiong, Qianjun Jia, Meiping Huang, Jian Zhuang, Bike Xie, Chun-Chen Liu, Yiyu Shi
Abstract: CT image denoising can be treated as an image-to-image translation task where the goal is to learn the transform between a source domain $X$ (noisy images) and a target domain $Y$ (clean images). Recently, cycle-consistent adversarial denoising network (CCADN) has achieved state-of-the-art results by enforcing cycle-consistent loss without the need of paired training data. Our detailed analysis of CCADN raises a number of interesting questions. For example, if the noise is large leading to significant difference between domain $X$ and domain $Y$, can we bridge $X$ and $Y$ with an intermediate domain $Z$ such that both the denoising process between $X$ and $Z$ and that between $Z$ and $Y$ are easier to learn? As such intermediate domains lead to multiple cycles, how do we best enforce cycle-consistency? Driven by these questions, we propose a multi-cycle-consistent adversarial network (MCCAN) that builds intermediate domains and enforces both local and global cycle-consistency. The global cycle-consistency couples all generators together to model the whole denoising process, while the local cycle-consistency imposes effective supervision on the process between adjacent domains. Experiments show that both local and global cycle-consistency are important for the success of MCCAN, which outperforms the state-of-the-art.
摘要：CT图像去噪可以被视为一个图像到图像的翻译任务，其目的是学习的源域$ X $（噪声图像）和目标域$ Y $（清洁图像）之间的转换。近日，周期一致对抗降噪网络（CCADN）已通过强制循环一致的损失，而无需配对训练数据的取得国家的先进成果。我们的详细CCADN的分析，提出了一些有趣的问题。例如，如果噪音过大导致域$ X $和域名$ Y $之间显著的差异，我们可以弥合$ X $和$ Y $与中间域$ Z $，使得$ X $之间的两个降噪处理和$ Z $和$之间Z $ $和Y $更容易学吗？因此中间领域导致多个周期，我们如何最好的执行周期的一致性？通过这些问题的推动下，我们建议建立中间域和强制执行局部和全局周期一致性的多周期一致对抗网络（MCCAN）。全球周期的一致性夫妇所有发电机一起建模的整个过程去噪，而当地周期的一致性强加给相邻域之间的过程有效监督。实验表明，无论是局部和全局周期的一致性是MCCAN的成功，这优于国家的最先进的重要。

46. Two-stage breast mass detection and segmentation system towards automated high-resolution full mammogram analysis [PDF] 返回目录
Yutong Yan, Pierre-Henri Conze, Gwenolé Quellec, Mathieu Lamard, Béatrice Cochener, Gouenou Coatrieux
Abstract: Mammography is the primary imaging modality used for early detection and diagnosis of breast cancer. Mammography analysis mainly refers to the extraction of regions of interest around tumors, followed by a segmentation step, which is essential to further classification of benign or malignant tumors. Breast masses are the most important findings among breast abnormalities. However, manual delineation of masses from native mammogram is a time consuming and error-prone task. An integrated computer-aided diagnosis system to assist radiologists in automatically detecting and segmenting breast masses is therefore in urgent need. We propose a fully-automated approach that guides accurate mass segmentation from full mammograms at high resolution through a detection stage. First, mass detection is performed by an efficient deep learning approach, You-Only-Look-Once, extended by integrating multi-scale predictions to improve automatic candidate selection. Second, a convolutional encoder-decoder network using nested and dense skip connections is employed to fine-delineate candidate masses. Unlike most previous studies based on segmentation from regions, our framework handles mass segmentation from native full mammograms without user intervention. Trained on INbreast and DDSM-CBIS public datasets, the pipeline achieves an overall average Dice of 80.44% on high-resolution INbreast test images, outperforming state-of-the-art methods. Our system shows promising accuracy as an automatic full-image mass segmentation system. The comprehensive evaluation provided for both detection and segmentation stages reveals strong robustness to the diversity of size, shape and appearance of breast masses, towards better computer-aided diagnosis.
摘要：乳腺摄影是用于早期检测和诊断乳腺癌的主成像模态。乳房X线照相分析主要是指肿瘤周围感兴趣区域的提取，随后进行分割步骤，这是良性或恶性肿瘤的进一步分类是必不可少的。乳腺肿块是乳腺异常中最重要的发现。然而，从天然乳房X光检查群众的手动圈定费时又容易出错的时间。因此，一个集成的计算机辅助诊断系统，以协助放射科医师在自动检测和分割乳腺肿块是迫切需要。我们提出了一个完全自动化的方法，从全乳房X线照片以高分辨率通过检测阶段引导精确质量的分割。首先，质量检测是通过一个有效的深层学习方法进行，你-ONLY-查询一次，通过集成多尺度预测，以提高自动选扩展。其次，使用嵌套和密集跳过连接的卷积编码器 - 解码器网络被用于精细描绘候选块。与基于分割从地区大多数以前的研究，我们从无需用户干预本地全乳房X线照片框架处理质量的分割。上训练INbreast和DDSM-CBIS公共数据集，所述管道实现了80.44％的总平均骰子高分辨率INbreast测试图像，表现优于国家的最先进的方法。我们的系统显示有前途的精度自动全图像质量分割系统。提供两种检测与分割阶段的综合评价表明的强稳健性的大小，形状和乳腺肿块的外观的多样性，争取更好的计算机辅助诊断。

47. Understanding and Enhancing Mixed Sample Data Augmentation [PDF] 返回目录
Ethan Harris, Antonia Marcu, Matthew Painter, Mahesan Niranjan, Adam Prügel-Bennett, Jonathon Hare
Abstract: Mixed Sample Data Augmentation (MSDA) has received increasing attention in recent years, with many successful variants such as MixUp and CutMix. Following insight on the efficacy of CutMix in particular, we propose FMix, an MSDA that uses binary masks obtained by applying a threshold to low frequency images sampled from Fourier space. FMix improves performance over MixUp and CutMix for a number of state-of-the-art models across a range of data sets and problem settings. We go on to analyse MixUp, CutMix, and FMix from an information theoretic perspective, characterising learned models in terms of how they progressively compress the input with depth. Ultimately, our analyses allow us to decouple two complementary properties of augmentations, and present a unified framework for reasoning about MSDA. Code for all experiments is available at this https URL.
摘要：混合样品数据增强（MSDA）已获得越来越多的关注，近年来，有许多成功的变体，如查询股价和CutMix。继CutMix特别的功效见解，我们提出FMix，使用通过将阈值从傅立叶空间取样低频图像而获得的二进制掩模的MSDA。 FMix改善了查询股价和CutMix性能在一系列的数据集和问题设置了一些国家的最先进的车型。我们继续来分析查询股价，CutMix和FMix从信息理论的角度看，在他们如何逐步与深度压缩输入方面表征了解到的机型。最后，我们的分析让我们去耦增强系统的两个互补性，并提出一个统一的框架推理MSDA。代号为所有实验可在此HTTPS URL。

48. Transductive Few-shot Learning with Meta-Learned Confidence [PDF] 返回目录
Seong Min Kye, Hae Beom Lee, Hoirin Kim, Sung Ju Hwang
Abstract: We propose a novel transductive inference framework for metric-based meta-learning models, which updates the prototype of each class with the confidence-weighted average of all the support and query samples. However, a caveat here is that the model confidence may be unreliable, which could lead to incorrect prediction in the transductive setting. To tackle this issue, we further propose to meta-learn to assign correct confidence scores to unlabeled queries. Specifically, we meta-learn the parameters of the distance-metric, such that the model can improve its transductive inference performance on unseen tasks with the generated confidence scores. We also consider various types of uncertainties to further enhance the reliability of the meta-learned confidence. We combine our transductive meta-learning scheme, Meta-Confidence Transduction (MCT) with a novel dense classifier, Dense Feature Matching Network (DFMN), which performs both instance-level and feature-level classification without global average pooling and validate it on four benchmark datasets. Our model achieves state-of-the-art results on all datasets, outperforming existing state-of-the-art models by 11.11% and 7.68% on miniImageNet and tieredImageNet dataset respectively. Further qualitative analysis confirms that this impressive performance gain is indeed due to its ability to assign high confidence to instances with the correct labels.
摘要：本文提出了基于度量元模型的学习一种新的转导推理的框架，这将更新与信心，加权平均各界的支持和查询样本的每个类的原型。但是，这里需要注意的是，该模型的信心可能是不可靠的，这可能导致不正确的预测在直推设置。为了解决这个问题，我们更建议把元学习正确的置信度分数分配给未标记的查询。具体来说，我们META-学习距离度量的参数，该模型可以提高与所产生的置信分数看不见的任务及其转导推理性能。我们还考虑不同类型的不确定性进一步增强的元了解到信心的可靠性。我们有一个新的密集分类，密集特征匹配网络（DFMN），它没有全球平均池同时执行实例级和功能级分类结合我们的直推式元学习方案，荟萃信心转导（MCT），并验证其在四个基准数据集。我们的模型实现了对所有数据集的国家的最先进的成果，分别由11.11％和miniImageNet和tieredImageNet数据集的7.68％，表现优于现有的国家的最先进的车型。进一步的定性分析证实，这令人印象深刻的性能提升的确是由于其高可信度分配到实例与正确的标签的能力。

49. Face Verification Using 60~GHz 802.11 waveforms [PDF] 返回目录
Eran Hof, Amichai Sanderovich, Evyatar Hemo
Abstract: Verification of an identity based on the human face radar signature in mmwave is studied. The chipset for 802.11ad/y networking that is cable of operating in a radar mode is used. A dataset with faces of 200 different persons was collected for the testing. Our preliminary study shows promising results for the application of autoencoder for the setup at hand.
摘要：基于毫米波的中的人脸雷达信号的身份的验证研究。用于802.11ad的/ Y联网的芯片组，处于雷达模式中操作的电缆。与200对不同的人的面部的数据集收集时的测试。我们的初步研究显示有前途的自动编码器的应用手头的安装结果。

50. Supervised Dimensionality Reduction and Visualization using Centroid-encoder [PDF] 返回目录
Tomojit Ghosh, Michael Kirby
Abstract: Visualizing high-dimensional data is an essential task in Data Science and Machine Learning. The Centroid-Encoder (CE) method is similar to the autoencoder but incorporates label information to keep objects of a class close together in the reduced visualization space. CE exploits nonlinearity and labels to encode high variance in low dimensions while capturing the global structure of the data. We present a detailed analysis of the method using a wide variety of data sets and compare it with other supervised dimension reduction techniques, including NCA, nonlinear NCA, t-distributed NCA, t-distributed MCML, supervised UMAP, supervised PCA, Colored Maximum Variance Unfolding, supervised Isomap, Parametric Embedding, supervised Neighbor Retrieval Visualizer, and Multiple Relational Embedding. We empirically show that centroid-encoder outperforms most of these techniques. We also show that when the data variance is spread across multiple modalities, centroid-encoder extracts a significant amount of information from the data in low dimensional space. This key feature establishes its value to use it as a tool for data visualization.
摘要：可视化高维数据是数据科学和机器学习的一项重要任务。质心编码器（CE）方法类似于自动编码器，但采用了标签信息的可视化降低的空间，以保持一类的对象靠近在一起。 CE利用非线性和标签编码在低维高方差同时捕捉的数据的全局结构。我们使用多种数据集的呈现的方法的详细的分析，并将其与其他监督尺寸减小技术，包括NCA，非线性NCA，叔分布式NCA，叔分布式MCML比较，监督UMAP，监督PCA，彩色最大方差展开，监督Isomap的CWME，参数嵌入，监督的邻居检索可视化，以及多个关系嵌入。我们经验表明，心编码器性能优于这些技术大部分。我们还表明，当数据变化是跨越多个方式传播，质心编码器在低维空间中的数据中提取信息的显著量。这主要特点确立其价值，用它作为数据可视化的工具。

51. Segmentation-based Method combined with Dynamic Programming for Brain Midline Delineation [PDF] 返回目录
Shen Wang, Kongming Liang, Chengwei Pan, Chuyang Ye, Xiuli Li, Feng Liu, Yizhou Yu, Yizhou Wang
Abstract: The midline related pathological image features are crucial for evaluating the severity of brain compression caused by stroke or traumatic brain injury (TBI). The automated midline delineation not only improves the assessment and clinical decision making for patients with stroke symptoms or head trauma but also reduces the time of diagnosis. Nevertheless, most of the previous methods model the midline by localizing the anatomical points, which are hard to detect or even missing in severe cases. In this paper, we formulate the brain midline delineation as a segmentation task and propose a three-stage framework. The proposed framework firstly aligns an input CT image into the standard space. Then, the aligned image is processed by a midline detection network (MD-Net) integrated with the CoordConv Layer and Cascade AtrousCconv Module to obtain the probability map. Finally, we formulate the optimal midline selection as a pathfinding problem to solve the problem of the discontinuity of midline delineation. Experimental results show that our proposed framework can achieve superior performance on one in-house dataset and one public dataset.
摘要：中线相关的病理图像特征是评价由中风或创伤性脑损伤（TBI）脑受压的严重程度是至关重要的。自动化的中线划分，不仅提高了患者的中风症状或头部外伤的评估和临床决策，但也减少了诊断时间。然而，大多数以前的方法通过定位解剖点，这是很难发现，甚至在严重的情况下失踪中线建模。在本文中，我们制定了大脑中线划分为分段任务，并提出三阶段框架。所提出的框架首先对准输入CT图像为标准的空间。然后，将对准的图像通过与CoordConv层和级联AtrousCconv模块，以获得所述概率图一体的中线检测网络（MD-净）处理。最后，我们制定最佳的中线选择的路径发现问题，解决中线划分的不连续的问题。实验结果表明，该框架可以在一个内部数据集和一个公共数据集实现卓越的性能。

52. Max-Affine Spline Insights into Deep Generative Networks [PDF] 返回目录
Randall Balestriero, Sebastien Paris, Richard Baraniuk
Abstract: We connect a large class of Generative Deep Networks (GDNs) with spline operators in order to derive their properties, limitations, and new opportunities. By characterizing the latent space partition, dimension and angularity of the generated manifold, we relate the manifold dimension and approximation error to the sample size. The manifold-per-region affine subspace defines a local coordinate basis; we provide necessary and sufficient conditions relating those basis vectors with disentanglement. We also derive the output probability density mapped onto the generated manifold in terms of the latent space density, which enables the computation of key statistics such as its Shannon entropy. This finding also enables the computation of the GDN likelihood, which provides a new mechanism for model comparison as well as providing a quality measure for (generated) samples under the learned distribution. We demonstrate how low entropy and/or multimodal distributions are not naturally modeled by DGNs and are a cause of training instabilities.
摘要：我们与运营商花一大类剖成深层网络（GDNS）的连接，以获得它们的性质，限制和新的机遇。通过表征所生成的歧管的潜在空间分区，尺寸和角度，我们涉及歧管尺寸和近似误差的样本大小。该歧管每个区域的仿射子空间定义了一个局部坐标基础;我们提供关于与解开那些基本向量的充分必要条件。我们还导出映射到潜在空间密度，这使得关键统计数据的计算方面所产生的歧管中的输出概率密度，例如其香农熵。这一发现也使GDN可能性，这提供了模型比较，以及提供用于学习的分布下（生成）样品的质量测量的新机制的计算。我们演示了如何低熵和/或峰分布不被DGNs自然为蓝本，并训练不稳定性的原因。

53. A Proto-Object Based Dynamic Visual Saliency Model with an FPGA Implementation [PDF] 返回目录
Jamal Lottier Molin, Chetan Singh Thakur, Ralph Etienne-Cummings, Ernst Niebur
Abstract: The ability to attend to salient regions of a visual scene is an innate and necessary preprocessing step for both biological and engineered systems performing high-level visual tasks (e.g. object detection, tracking, and classification). Computational efficiency, in regard to processing bandwidth and speed, is improved by only devoting computational resources to salient regions of the visual stimuli. In this paper, we first present a biologically-plausible, bottom-up, dynamic visual saliency model based on the notion of proto-objects. This is achieved by incorporating the temporal characteristics of the visual stimulus into the model, similarly to the manner in which early stages of the human visual system extracts temporal information. This model outperforms state-of-the-art dynamic visual saliency models in predicting human eye fixations on a commonly-used video dataset with associated eye tracking data. Secondly, for this model to have practical applications, it must be capable of performing its computations in real-time under lowpower, small-size, and lightweight constraints. To address this, we introduce a Field-Programmable Gate Array implementation of the model on an Opal Kelly 7350 Kintex-7 board. This novel hardware implementation allows for processing of up to 23.35 frames per second running on a 100 MHz clock -- better than 26x speedup from the software implementation.
摘要：能力参加到视觉场景的显着区域是先天和用于执行高层次视觉任务生物和工程系统必要的预处理步骤（例如物体检测，跟踪和分类）。计算效率，关于处理带宽和速度，只由投入计算资源来视觉刺激的显着区域的改善。在本文中，我们首先提出一个生物似是而非的，自下而上的，动态的视觉显着性基于原对象的概念模型。这是通过将视觉刺激的时间特征到模型，其中，所述人类视觉系统的早期阶段中提取的时间信息的方式同样地实现。此模型优于状态的最先进的动态视觉显着性模型中的具有相关联的眼睛跟踪数据通常使用的视频数据集预测人眼的注视。其次，这个模型有实际应用中，必须能够在低功耗，小尺寸，轻便约束实时进行的计算。为了解决这个问题，我们引入了模型的一个上蛋白石凯利7350的Kintex-7板现场可编程门阵列实现。这种新颖的硬件实现允许最多的处理每秒运行23.35帧上的100MHz的时钟 - 从软件实现优于26倍的加速。

54. Gradient Boosted Flows [PDF] 返回目录
Robert Giaquinto, Arindam Banerjee
Abstract: Normalizing flows (NF) are a powerful framework for approximating posteriors. By mapping a simple base density through invertible transformations, flows provide an exact method of density evaluation and sampling. The trend in normalizing flow literature has been to devise deeper, more complex transformations to achieve greater flexibility. We propose an alternative: Gradient Boosted Flows (GBF) model a variational posterior by successively adding new NF components by gradient boosting so that each new NF component is fit to the residuals of the previously trained components. The GBF formulation results in a variational posterior that is a mixture model, whose flexibility increases as more components are added. Moreover, GBFs offer a wider, not deeper, approach that can be incorporated to improve the results of many existing NFs. We demonstrate the effectiveness of this technique for density estimation and, by coupling GBF with a variational autoencoder, generative modeling of images.
摘要：正火流（NF）是近似的后验一个强大的框架。通过可逆的转换映射的简单基密度，流动提供密度评估和取样的精确方法。在标准化流程文学的发展趋势是设计出更深刻，更复杂的转换，以实现更大的灵活性。我们提出了一个替代方案：通过梯度提升，使每个新NF组件装配到先前训练部件的残余先后加入新的NF成分梯度提振流（GBF）建模变后路。所述GBF制剂导致变后这是一个混合模型，其灵活性随着更多的组分加入。此外，GBFS提供更广泛，更深入的不是，可引入改善许多现有NFS的结果的方法。我们证明这种技术密度估计的有效性，并通过与变自动编码，图像的生成模型耦合GBF。

55. BBAND Index: A No-Reference Banding Artifact Predictor [PDF] 返回目录
Zhengzhong Tu, Jessie Lin, Yilin Wang, Balu Adsumilli, Alan C. Bovik
Abstract: Banding artifact, or false contouring, is a common video compression impairment that tends to appear on large flat regions in encoded videos. These staircase-shaped color bands can be very noticeable in high-definition videos. Here we study this artifact, and propose a new distortion-specific no-reference video quality model for predicting banding artifacts, called the Blind BANding Detector (BBAND index). BBAND is inspired by human visual models. The proposed detector can generate a pixel-wise banding visibility map and output a banding severity score at both the frame and video levels. Experimental results show that our proposed method outperforms state-of-the-art banding detection algorithms and delivers better consistency with subjective evaluations.
摘要：带状伪像，或伪轮廓，是一种常见的视频压缩功能障碍倾向于出现在编码的视频大平坦区域。这些楼梯形色带可在高清视频非常显着。我们在这里学习这件神器，并提出了新的扭曲专用无参考视频质量预测模型的带状伪影，称为盲绑扎检测器（BBAND指数）。 BBAND是由人类视觉模型的启发。该检测器可以产生一个逐像素条带可见度的地图和输出条带严重程度评分在框架和视频电平二者。实验结果表明，国家的最先进的我们的方法优于条纹检测算法，并提供与主观评价较好的一致性。

56. Kernel Bi-Linear Modeling for Reconstructing Data on Manifolds: The Dynamic-MRI Case [PDF] 返回目录
Gaurav N.Shetty, Konstantinos Slavakis, Ukash Nakarmi, Gesualdo Scutari, Leslie Ying
Abstract: This paper establishes a kernel-based framework for reconstructing data on manifolds, tailored to fit the dynamic-(d)MRI-data recovery problem. The proposed methodology exploits simple tangent-space geometries of manifolds in reproducing kernel Hilbert spaces and follows classical kernel-approximation arguments to form the data-recovery task as a bi-linear inverse problem. Departing from mainstream approaches, the proposed methodology uses no training data, employs no graph Laplacian matrix to penalize the optimization task, uses no costly (kernel) pre-imaging step to map feature points back to the input space, and utilizes complex-valued kernel functions to account for k-space data. The framework is validated on synthetically generated dMRI data, where comparisons against state-of-the-art schemes highlight the rich potential of the proposed approach in data-recovery problems.
摘要：本文建立了重建流形上的，量身定制，以适应dynamic-（d）数据的基于内核的架构MRI数据恢复问题。所提出的方法利用歧管的简单切空间的几何形状，再生核Hilbert空间，并遵循古典内核逼近参数形成数据恢复任务，双线性反问题。从主流的情况下接近，所提出的方法不使用训练数据，采用无图形拉普拉斯矩阵惩罚优化任务，不使用昂贵的（内核）预成像步骤来映射特征点反馈到输入端的空间，并利用复值的内核功能以考虑k空间数据。该框架被验证的合成产生DMRI数据，其中对国家的最先进的方案比较突出的数据恢复问题所提出的方法的丰富潜力。

57. Comparison of Multi-Class and Binary Classification Machine Learning Models in Identifying Strong Gravitational Lenses [PDF] 返回目录
Hossen Teimoorinia, Robert D. Toyonaga, Sebastien Fabbro, Connor Bottrell
Abstract: Typically, binary classification lens-finding schemes are used to discriminate between lens candidates and non-lenses. However, these models often suffer from substantial false-positive classifications. Such false positives frequently occur due to images containing objects such as crowded sources, galaxies with arms, and also images with a central source and smaller surrounding sources. Therefore, a model might confuse the stated circumstances with an Einstein ring. It has been proposed that by allowing such commonly misclassified image types to constitute their own classes, machine learning models will more easily be able to learn the difference between images that contain real lenses, and images that contain lens imposters. Using Hubble Space Telescope (HST) images, in the F814W filter, we compare the usage of binary and multi-class classification models applied to the lens finding task. From our findings, we conclude there is not a significant benefit to using the multi-class model over a binary model. We will also present the results of a simple lens search using a multi-class machine learning model, and potential new lens candidates.
摘要：通常情况下，二元分类透镜调查方案被用于透镜候选人和非透镜之间判别。然而，这些模型经常遭受大量的假阳性分类。这种误报频繁发生由于诸如拥挤源，与武器的星系，并且还图像与中央源和更小的周边来源包含对象的图像。因此，一个模型可能与爱因斯坦环混淆规定的情况。有人提出通过允许这样的常见错误分类的图像类型，构成自己的类，机器学习模型将更加容易就能学会包含真正的镜头图像，以及包含镜头骗子图像之间的差异。利用哈勃太空望远镜（HST）的图像，在F814W过滤器，我们比较二进制和多级分类模型应用到镜头发现任务使用。从我们的调查结果，我们得出结论：没有使用多级车型在二元模型显著的好处。我们也将使用多级的机器学习模型，和潜在的新镜头的候选人提出一个简单的镜头搜索的结果。

58. Analysis of diversity-accuracy tradeoff in image captioning [PDF] 返回目录
Ruotian Luo, Gregory Shakhnarovich
Abstract: We investigate the effect of different model architectures, training objectives, hyperparameter settings and decoding procedures on the diversity of automatically generated image captions. Our results show that 1) simple decoding by naive sampling, coupled with low temperature is a competitive and fast method to produce diverse and accurate caption sets; 2) training with CIDEr-based reward using Reinforcement learning harms the diversity properties of the resulting generator, which cannot be mitigated by manipulating decoding parameters. In addition, we propose a new metric AllSPICE for evaluating both accuracy and diversity of a set of captions by a single value.
摘要：我们调查不同的模型结构，培养目标，超参数设置和解码程序自动生成图片说明多样性的影响。我们的研究结果表明：1）由幼稚采样，加上低温简单的解码是产生不同的和准确的标题集的有竞争力的和快速的方法; 2）训练用强化学习危害所得的发生器，它不能由操纵解码参数来缓解的多样性基于属性苹果酒奖励。此外，我们提出了一个新的度量五香粉由单一值判断一组字幕的准确度和多样性。

59. Improving Robustness of Deep-Learning-Based Image Reconstruction [PDF] 返回目录
Ankit Raj, Yoram Bresler, Bo Li
Abstract: Deep-learning-based methods for different applications have been shown vulnerable to adversarial examples. These examples make deployment of such models in safety-critical tasks questionable. Use of deep neural networks as inverse problem solvers has generated much excitement for medical imaging including CT and MRI, but recently a similar vulnerability has also been demonstrated for these tasks. We show that for such inverse problem solvers, one should analyze and study the effect of adversaries in the measurement-space, instead of the signal-space as in previous work. In this paper, we propose to modify the training strategy of end-to-end deep-learning-based inverse problem solvers to improve robustness. We introduce an auxiliary network to generate adversarial examples, which is used in a min-max formulation to build robust image reconstruction networks. Theoretically, we show for a linear reconstruction scheme the min-max formulation results in a singular-value(s) filter regularized solution, which suppresses the effect of adversarial examples occurring because of ill-conditioning in the measurement matrix. We find that a linear network using the proposed min-max learning scheme indeed converges to the same solution. In addition, for non-linear Compressed Sensing (CS) reconstruction using deep networks, we show significant improvement in robustness using the proposed approach over other methods. We complement the theory by experiments for CS on two different datasets and evaluate the effect of increasing perturbations on trained networks. We find the behavior for ill-conditioned and well-conditioned measurement matrices to be qualitatively different.
摘要：针对不同的应用为基础的深学习方法已被证明容易受到敌对的例子。这些例子都在安全关键任务，这种模式的部署怀疑。深层神经网络，逆解决问题的使用已经产生的医疗成像，包括CT和MRI太多的激动，但最近一个类似的漏洞也已经证实了这些任务。我们表明，这种逆问题的解决者，应该分析和研究在测量空间敌人的效果，而不是信号空间在以前的工作。在本文中，我们提出了修改终端到终端的深学习基于逆问题解决者的培训策略，以提高耐用性。我们引入一个辅助网络，以产生对抗的例子，这是在一个最小 - 最大制剂用于构建鲁棒的图像重建网络。从理论上讲，我们显示了一个线性重建方案的最小 - 最大制剂导致奇异值（一个或多个）过滤器正则化方案，其抑制由于在测量矩阵病态的发生对抗例子的效果。我们发现，使用线性网络所提出的最大最小学习方案确实收敛于相同的解决方案。另外，使用深层网络的非线性压缩感知（CS）重建，我们显示了使用其他方法相比该方法的稳健性显著的改善。我们通过实验对CS在两个不同的数据集相配套的理论和评估培训的网络增加扰动的影响。我们找到病态和状态良好的观测阵的行为是质的不同。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-02-28

目录

摘要