摘要

1. Scan2Plan: Efficient Floorplan Generation from 3D Scans of Indoor Scenes [PDF] 返回目录
Ameya Phalak, Vijay Badrinarayanan, Andrew Rabinovich
Abstract: We introduce Scan2Plan, a novel approach for accurate estimation of a floorplan from a 3D scan of the structural elements of indoor environments. The proposed method incorporates a two-stage approach where the initial stage clusters an unordered point cloud representation of the scene into room instances and wall instances using a deep neural network based voting approach. The subsequent stage estimates a closed perimeter, parameterized by a simple polygon, for each individual room by finding the shortest path along the predicted room and wall keypoints. The final floorplan is simply an assembly of all such room perimeters in the global co-ordinate system. The Scan2Plan pipeline produces accurate floorplans for complex layouts, is highly parallelizable and extremely efficient compared to existing methods. The voting module is trained only on synthetic data and evaluated on publicly available Structured3D and BKE datasets to demonstrate excellent qualitative and quantitative results outperforming state-of-the-art techniques.
摘要：我们介绍Scan2Plan，从室内环境的构成要素的3D扫描一个平面布置图的准确估计的新方法。所提出的方法采用了两阶段方法，其中初始阶段簇使用深基于神经网络的方法投票场景的无序点云表示成房间实例和壁实例。后级估计闭合的周界，通过一个简单的多边形参数，为每个单独的室通过找出沿预计室和壁关键点的最短路径。最终的平面布置很简单，在全球坐标系所有这些房周边的组件。所述Scan2Plan管道产生准确楼层平面图为复杂的布局，是高度并行，并与现有的方法效率极高。表决模块仅在合成的数据来训练和上公开提供Structured3D和BKE数据集评估，以显示出优异的定性和定量结果优于状态的最先进的技术。

2. Deep Adaptive Semantic Logic (DASL): Compiling Declarative Knowledge into Deep Neural Networks [PDF] 返回目录
Karan Sikka, Andrew Silberfarb, John Byrnes, Indranil Sur, Ed Chow, Ajay Divakaran, Richard Rohwer
Abstract: We introduce Deep Adaptive Semantic Logic (DASL), a novel framework for automating the generation of deep neural networks that incorporates user-provided formal knowledge to improve learning from data. We provide formal semantics that demonstrate that our knowledge representation captures all of first order logic and that finite sampling from infinite domains converges to correct truth values. DASL's representation improves on prior neural-symbolic work by avoiding vanishing gradients, allowing deeper logical structure, and enabling richer interactions between the knowledge and learning components. We illustrate DASL through a toy problem in which we add structure to an image classification problem and demonstrate that knowledge of that structure reduces data requirements by a factor of $1000$. We then evaluate DASL on a visual relationship detection task and demonstrate that the addition of commonsense knowledge improves performance by $10.7\%$ in a data scarce setting.
摘要：介绍深自适应语义逻辑（DASL），自动化深层神经网络产生一个新的框架，结合用户提供的正规知识，以提高从数据中学习。我们提供的证明形式语义学，我们的知识表示捕获所有的一阶逻辑和无限域收敛到正确的真值是有限的采样。 DASL的代表性，避免梯度消失，从而更深层次的逻辑结构，并且使知识之间的互动更加丰富和学习组件先前神经象征性的工作得到改善。我们通过说明我们在其中添加结构图像分类问题，证明了结构的知识通过$ $ 1000的因素减少了数据要求的玩具问题DASL。然后，我们在视觉关系检测任务评估DASL和证明添加的常识性知识，通过在数据稀少的设置$ 10.7 \％$提高性能。

3. Complexity of Shapes Embedded in ${\mathbb Z^n}$ with a Bias Towards Squares [PDF] 返回目录
M. Ferhat Arslan, Sibel Tari
Abstract: Shape complexity is a hard-to-quantify quality, mainly due to its relative nature. Biased by Euclidean thinking, circles are commonly considered as the simplest. However, their constructions as digital images are only approximations to the ideal form. Consequently, complexity orders computed in reference to circle are unstable. Unlike circles which lose their circleness in digital images, squares retain their qualities. Hence, we consider squares (hypercubes in $\mathbb Z^n$) to be the simplest shapes relative to which complexity orders are constructed. Using the connection between $L^\infty$ norm and squares we effectively encode squareness-adapted simplification through which we obtain multi-scale complexity measure, where scale determines the level of interest to the boundary. The emergent scale above which the effect of a boundary feature (appendage) disappears is related to the ratio of the contacting width of the appendage to that of the main body. We discuss what zero complexity implies in terms of information repetition and constructibility and what kind of shapes in addition to squares have zero complexity.
摘要：形状复杂性是一个难以量化的质量，主要是由于其相对性。欧几里德思想偏颇，圈通常被认为是最简单的。然而，他们的数字图像结构仅仅是近似的理想形式。因此，在参考圆计算的复杂性订单不稳定。它失去了circleness数字图像与圆，正方形保留其品质。因此，我们认为广场（在$ \ mathbbž超立方体^ N $）是相对于这些构造复杂的订单最简单的形状。使用$ L ^ \ infty $规范和正方形之间的连接我们有效地编码矩形-适于简化通过它我们获得多尺度复杂性度量，其中，刻度确定感兴趣到边界的水平。出射刻度高于该边界特征（附属物）的效果消失是关系到附属物，所述主体的所述接触宽度的比例。我们讨论什么零复杂信息的重复和可构造什么样的，除了正方形形状的具有零复杂性暗示。

4. Learning Shape Representations for Clothing Variations in Person Re-Identification [PDF] 返回目录
Yu-Jhe Li, Zhengyi Luo, Xinshuo Weng, Kris M. Kitani
Abstract: Person re-identification (re-ID) aims to recognize instances of the same person contained in multiple images taken across different cameras. Existing methods for re-ID tend to rely heavily on the assumption that both query and gallery images of the same person have the same clothing. Unfortunately, this assumption may not hold for datasets captured over long periods of time (e.g., weeks, months or years). To tackle the re-ID problem in the context of clothing changes, we propose a novel representation learning model which is able to generate a body shape feature representation without being affected by clothing color or patterns. We call our model the Color Agnostic Shape Extraction Network (CASE-Net). CASE-Net learns a representation of identity that depends only on body shape via adversarial learning and feature disentanglement. Due to the lack of large-scale re-ID datasets which contain clothing changes for the same person, we propose two synthetic datasets for evaluation. We create a rendered dataset SMPL-reID with different clothes patterns and a synthesized dataset Div-Market with different clothing color to simulate two types of clothing changes. The quantitative and qualitative results across 5 datasets (SMPL-reID, Div-Market, two benchmark re-ID datasets, a cross-modality re-ID dataset) confirm the robustness and superiority of our approach against several state-of-the-art approaches
摘要：人重新鉴定（重新编号）旨在识别包含在跨越不同相机拍摄的多个图像同一人的情况下。重新编号现有方法往往在很大程度上依赖于假设，即同一个人的查询和画廊图像具有相同的服装。不幸的是，这种假设可能不成立的捕获在一段时间（例如，周，月或年）长期数据集。为了解决在服装变化的情况下重新编号的问题，我们提出了一个新的表示学习模式，它能够产生一个体形特征表示，而不会受到衣服的颜色或图案。我们把我们的模型的颜色不可知论形状提取网络（CASE-网）。 CASE-网获悉身份的表示是通过对抗性的学习和功能的解开只在身体形状依赖。由于缺乏这包含同一人物服装的变化，大型重ID的数据集，我们提出了评价两个合成数据集。我们与不同花纹的衣服创建呈现数据集SMPL-Reid和合成数据集DIV-市场具有不同的服装色彩，模拟两种类型的服装变化。在5个数据集上的定量和定性结果（SMPL-REID，DIV-市场，二标杆REID数据集，交叉形态里德数据集）确认我们的方法的稳健性和优越性对几个国家的最先进的接近

5. G-LBM:Generative Low-dimensional Background Model Estimation from Video Sequences [PDF] 返回目录
Behnaz Rezaei, Amirreza Farnoosh, Sarah Ostadabbas
Abstract: In this paper, we propose a computationally tractable and theoretically supported non-linear low-dimensional generative model to represent real-world data in the presence of noise and sparse outliers. The non-linear low-dimensional manifold discovery of data is done through describing a joint distribution over observations, and their low-dimensional representations (i.e. manifold coordinates). Our model, called generative low-dimensional background model (G-LBM) admits variational operations on the distribution of the manifold coordinates and simultaneously generates a low-rank structure of the latent manifold given the data. Therefore, our probabilistic model contains the intuition of the non-probabilistic low-dimensional manifold learning. G-LBM selects the intrinsic dimensionality of the underling manifold of the observations, and its probabilistic nature models the noise in the observation data. G-LBM has direct application in the background scenes model estimation from video sequences and we have evaluated its performance on SBMnet-2016 and BMC2012 datasets, where it achieved a performance higher or comparable to other state-of-the-art methods while being agnostic to the background scenes in videos. Besides, in challenges such as camera jitter and background motion, G-LBM is able to robustly estimate the background by effectively modeling the uncertainties in video observations in these scenarios.
摘要：在本文中，我们提出了一个易于计算和理论支持非线性低维生成模型在噪声和稀疏的异常值的存在代表了真实世界的数据。数据的非线性低维流形发现是通过描述在观测一联合分布完成的，和它们的低维表示（即歧管坐标）。我们的模型，称为生成低维背景模型（G-LBM）承认在歧管坐标的分布变操作和同时产生中给出的数据的潜歧管的低秩结构。因此，我们的概率模型包含了非概率低维流形学习的直觉。 G-LBM选择的观测的下属歧管的固有维数，其概率性质型号在观测数据中的噪声。 G-LBM具有直接应用在从视频序列的背景场景模型估计，我们已经评价了SBMnet-2016和BMC2012数据集，在那里它获得了更高的性能或相当于其他国家的最先进的方法，同时不可知其性能在视频的背景场景。此外，在诸如照相机抖动和背景运动的挑战，G-LBM能够通过有效地建模在这些方案中的视频观测的不确定性，以鲁棒估计的背景。

6. RSVQA: Visual Question Answering for Remote Sensing Data [PDF] 返回目录
Sylvain Lobry, Diego Marcos, Jesse Murray, Devis Tuia
Abstract: This paper introduces the task of visual question answering for remote sensing data (RSVQA). Remote sensing images contain a wealth of information which can be useful for a wide range of tasks including land cover classification, object counting or detection. However, most of the available methodologies are task-specific, thus inhibiting generic and easy access to the information contained in remote sensing data. As a consequence, accurate remote sensing product generation still requires expert knowledge. With RSVQA, we propose a system to extract information from remote sensing data that is accessible to every user: we use questions formulated in natural language and use them to interact with the images. With the system, images can be queried to obtain high level information specific to the image content or relational dependencies between objects visible in the images. Using an automatic method introduced in this article, we built two datasets (using low and high resolution data) of image/question/answer triplets. The information required to build the questions and answers is queried from OpenStreetMap (OSM). The datasets can be used to train (when using supervised methods) and evaluate models to solve the RSVQA task. We report the results obtained by applying a model based on Convolutional Neural Networks (CNNs) for the visual part and on a Recurrent Neural Network (RNN) for the natural language part to this task. The model is trained on the two datasets, yielding promising results in both cases.
摘要：介绍视觉问题回答的遥感数据（RSVQA）的任务。遥感图像包含大量的信息，其可以是用于多种任务，包括土地覆盖分类，对象计数或检测是有用的。然而，大多数可用的方法是针对特定任务的，从而抑制载遥感数据的信息，通用和容易获得。因此，准确的遥感一代产品仍然需要专业知识。随着RSVQA，我们提出了一个系统从遥感数据提取信息，那就是每一个用户可访问：我们使用的自然语言问题制定并使用它们来与图像交互。与系统，图像可以被查询以获得高水平的信息特定于图像中的可见对象之间的图像内容或关系的依赖关系。使用本文中介绍的自动方法，我们建立了两个数据集（使用低分辨率和高分辨率数据），图像/问题/答案三胞胎。建问题和答案所需的信息是从OpenStreetMap的（OSM）查询。该数据集可以用来训练（当使用监督方法）和评估模型，以解决RSVQA任务。我们报告通过应用基于卷积神经网络（细胞神经网络）的可视部分和递归神经网络（RNN）的自然语言部分，此任务的模型得到的结果。该模型被训练在两个数据集，从而产生在这两种情况下有希望的结果。

7. Resolution Adaptive Networks for Efficient Inference [PDF] 返回目录
Le Yang, Yizeng Han, Xi Chen, Shiji Song, Jifeng Dai, Gao Huang
Abstract: Recently, adaptive inference is gaining increasing attention due to its high computational efficiency. Different from existing works, which mainly exploit architecture redundancy for adaptive network design, in this paper, we focus on spatial redundancy of input samples, and propose a novel Resolution Adaptive Network (RANet). Our motivation is that low-resolution representations can be sufficient for classifying "easy" samples containing canonical objects, while high-resolution features are curial for recognizing some "hard" ones. In RANet, input images are first routed to a lightweight sub-network that efficiently extracts coarse feature maps, and samples with high confident predictions will exit early from the sub-network. The high-resolution paths are only activated for those "hard" samples whose previous predictions are unreliable. By adaptively processing the features in varying resolutions, the proposed RANet can significantly improve its computational efficiency. Experiments on three classification benchmark tasks (CIFAR-10, CIFAR-100 and ImageNet) demonstrate the effectiveness of the proposed model in both anytime prediction setting and budgeted batch classification setting.
摘要：近日，自适应推断正在获得越来越多的关注，由于其高计算效率。从现有的作品，这主要是利用自适应网络设计架构冗余，本文不同，我们专注于输入样本的空间冗余，并提出了一种新分辨率自适应网络（RANET）。我们的动机是低分辨率表示可以足以包含分类标准的对象“易”的样品，而高分辨率的特点是元老院识别一些“硬”的。在RANET，输入图像被首先路由到一个轻量级子网络能够有效地提取粗略特征的地图以及包含高自信的预测样本将来自子网络提前退出。高分辨率路径只有激活其先前的预测是不可靠的“硬”的样本。通过自适应处理在不同的分辨率的特征，建议RANET可以显著提高计算效率。三个分类基准任务（CIFAR-10，CIFAR-100和ImageNet）实验证明在任何时候都预测设定了模型的有效性和预算批次分类设置。

8. Domain Adaptive Ensemble Learning [PDF] 返回目录
Kaiyang Zhou, Yongxin Yang, Yu Qiao, Tao Xiang
Abstract: The problem of generalizing deep neural networks from multiple source domains to a target one is studied under two settings: When unlabeled target data is available, it is a multi-source unsupervised domain adaptation (UDA) problem, otherwise a domain generalization (DG) problem. We propose a unified framework termed domain adaptive ensemble learning (DAEL) to address both problems. A DAEL model is composed of a CNN feature extractor shared across domains and multiple classifier heads each trained to specialize in a particular source domain. Each such classifier is an expert to its own domain and a non-expert to others. DAEL aims to learn these experts collaboratively so that when forming an ensemble, they can leverage complementary information from each other to be more effective for an unseen target domain. To this end, each source domain is used in turn as a pseudo-target-domain with its own expert providing supervision signal to the ensemble of non-experts learned from the other sources. For unlabeled target data under the UDA setting where real expert does not exist, DAEL uses pseudo-label to supervise the ensemble learning. Extensive experiments on three multi-source UDA datasets and two DG datasets show that DAEL improves the state-of-the-art on both problems, often by significant margins. The code is released at \url{this https URL}.
摘要：概括来自多个源域深神经网络到目标一项所述的问题是下两个设置研究：当未标记的目标数据是可用的，它是一个多源无监督域适配（UDA）的问题，否则一个结构域的概括（DG ）的问题。我们提出了一个统一的框架，称为域自适应集成学习（DAEL）来解决这两个问题。甲DAEL模型由跨越域和多个分类器的每个头训练专门在特定的源域共享一个CNN特征提取的。每一个这样的分类是专家对自己的域和非专家给他人。 DAEL旨在了解这些专家协作，从而形成一个整体的时候，他们可以从彼此利用补充信息是看不见的目标域更有效。为此，每个源域又与自己的专家提供监控信号从其他渠道了解到的非专家的合奏伪目标域使用。对于那些真正的专家是不存在的UDA设置下未标记的目标数据，DAEL使用伪标签来监督集成学习。在三个多源数据集UDA和两个DG数据集大量实验表明，DAEL提高了这两个问题的国家的最先进的，通常由显著的利润。该代码在\ {URL这HTTPS URL}释放。

9. clDice -- a Topology-Preserving Loss Function for Tubular Structure Segmentation [PDF] 返回目录
Suprosanna Shit, Johannes C. Paetzold, Anjany Sekuboyina, Andrey Zhylka, Ivan Ezhov, Alexander Unger, Josien P. W. Pluim, Giles Tetteh, Bjoern H. Menze
Abstract: Accurate segmentation of tubular, network-like structures, such as vessels, neurons, or roads, is relevant to many fields of research. For such structures, the topology is their most important characteristic, e.g. preserving connectedness: in case of vascular networks, missing a connected vessel entirely alters the blood-flow dynamics. We introduce a novel similarity measure termed clDice, which is calculated on the intersection of the segmentation masks and their (morphological) skeletons. Crucially, we theoretically prove that clDice guarantees topological correctness for binary 2D and 3D segmentation. Extending this, we propose a computationally efficient, differentiable soft-clDice as a loss function for training arbitrary neural segmentation networks. We benchmark the soft-clDice loss for segmentation on four public datasets (2D and 3D). Training on soft-clDice leads to segmentation with more accurate connectivity information, higher graph similarity, and better volumetric scores.
摘要：管状的精确分割，网络状结构，如血管，神经元，或道路，是有关研究的许多领域。对于这样的结构，拓扑结构是其最重要的特性，例如保持连通性：在血管网络的情况下，完全丢失连接的容器改变血流动力学。我们介绍一种新颖的相似性度量称为clDice，这是对分割掩码和它们的（形态学）骨架的交点来计算。最重要的是，我们从理论上证明了clDice保证拓扑正确性二进制2D和3D分割。扩展这一点，我们提出了一个计算效率高，微软clDice作为训练任意神经网络的分割损失函数。我们标杆分割软clDice损失四个公共数据集（2D和3D）。更准确的连接信息，更高的图形相似性，以及更好的体积分数对软clDice引线培训分割。

10. Context-Transformer: Tackling Object Confusion for Few-Shot Detection [PDF] 返回目录
Ze Yang, Yali Wang, Xianyu Chen, Jianzhuang Liu, Yu Qiao
Abstract: Few-shot object detection is a challenging but realistic scenario, where only a few annotated training images are available for training detectors. A popular approach to handle this problem is transfer learning, i.e., fine-tuning a detector pretrained on a source-domain benchmark. However, such transferred detector often fails to recognize new objects in the target domain, due to low data diversity of training samples. To tackle this problem, we propose a novel Context-Transformer within a concise deep transfer framework. Specifically, Context-Transformer can effectively leverage source-domain object knowledge as guidance, and automatically exploit contexts from only a few training images in the target domain. Subsequently, it can adaptively integrate these relational clues to enhance the discriminative power of detector, in order to reduce object confusion in few-shot scenarios. Moreover, Context-Transformer is flexibly embedded in the popular SSD-style detectors, which makes it a plug-and-play module for end-to-end few-shot learning. Finally, we evaluate Context-Transformer on the challenging settings of few-shot detection and incremental few-shot detection. The experimental results show that, our framework outperforms the recent state-of-the-art approaches.
摘要：很少拍物体检测是一个具有挑战性的，但现实的情况，其中只有少数注释的训练图像可用于训练探测器。一种流行的方法来处理这个问题是迁移学习，即，微调预训练上的源域基准的检测器。然而，这样的转移检测器经常会失败，目标域来识别新的对象，由于训练样本的低数据多样性。为了解决这个问题，我们提出了一个简洁的深转让框架内的新语境变压器。具体来说，上下文变压器可以有效利用源域对象的知识为指导，并自动利用从目标域中只有几个训练图像背景。随后，它可以自适应整合这些关系的线索，提高探测器的辨别力，以减少在几拍场景对象混淆。此外，上下文变压器灵活地嵌入到流行的SSD式探测器，这使得它的端至端几次学习插件和播放模块。最后，我们在为数不多的镜头检测和增量少一次检测的挑战设置评估上下文变压器。实验结果表明，我们的框架优于近期国家的最先进的方法。

11. MVLoc: Multimodal Variational Geometry-Aware Learning for Visual Localization [PDF] 返回目录
Rui Zhou, Changhao Chen, Bing Wang, Andrew Markham, Niki Trigoni
Abstract: Recent learning-based research has achieved impressive results in the field of single-shot camera relocalization. However, how best to fuse multiple modalities, for example, image and depth, and how to deal with degraded or missing input are less well studied. In particular, we note that previous approaches towards deep fusion do not perform significantly better than models employing a single modality. We conjecture that this is because of the naive approaches to feature space fusion through summation or concatenation which do not take into account the different strengths of each modality, specifically appearance for images and structure for depth. To address this, we propose an end-to-end framework to fuse different sensor inputs through a variational Product-of-Experts (PoE) joint encoder followed by attention-based fusion. Unlike prior work which draws a single sample from the joint encoder, we show how accuracy can be increased through importance sampling and reparameterization of the latent space. Our model is extensively evaluated on RGB-D datasets, outperforming existing baselines by a large margin.
摘要：近期以学习为主的研究已经在单发相机重新定位领域取得了骄人的成绩。然而，如何更好地融合多种方式，例如，图像和深度，以及如何应对降级或丢失的输入较少很好的研究。特别是，我们注意到，对深融合以前的方法并不比采用一个单一的模式模型显著更好地履行。我们推测，这是因为通过求和或串联的幼稚做法特征空间融合不考虑到每个模式的不同优势，特别是外观的图像和结构的深度。为了解决这个问题，我们通过一个变产品 - 的 - 专家（PoE）的联合编码其次注意基于融合提出了一个终端到终端的框架熔断器不同的传感器输入。不同于从接头编码器绘制单个样品以前的工作中，我们展示了如何准确通过重要性采样和潜在空间的新参数来增加。我们的模型是在RGB-d的数据集广泛评估，大幅度超越现有基准。

12. Towards Ground Truth Evaluation of Visual Explanations [PDF] 返回目录
Ahmed Osman, Leila Arras, Wojciech Samek
Abstract: Several methods have been proposed to explain the decisions of neural networks in the visual domain via saliency heatmaps (aka relevances/feature importance scores). Thus far, these methods were mainly validated on real-world images, using either pixel perturbation experiments or bounding box localization accuracies. In the present work, we propose instead to evaluate explanations in a restricted and controlled setup using a synthetic dataset of rendered 3D shapes. To this end, we generate a CLEVR-alike visual question answering benchmark with around 40,000 questions, where the ground truth pixel coordinates of relevant objects are known, which allows us to validate explanations in a fair and transparent way. We further introduce two straightforward metrics to evaluate explanations in this setup, and compare their outcomes to standard pixel perturbation using a Relation Network model and three decomposition-based explanation methods: Gradient x Input, Integrated Gradients and Layer-wise Relevance Propagation. Among the tested methods, Layer-wise Relevance Propagation was shown to perform best, followed by Integrated Gradients. More generally, we expect the release of our dataset and code to support the development and comparison of methods on a well-defined common ground.
摘要：有几种方法已经被提出来解释通过显着性热图视域神经网络的决定（又名关联度/功能重要性分值）。迄今为止，这些方法主要是验证真实世界的图像，无论是使用像素扰动实验或边框的定位精度。在目前的工作中，我们提出，而不是评价使用了渲染的3D形状的合成数据集的限制和控制设置的说明。为此，我们会生成大约40,000问题，在相关对象的地面实况像素坐标已知的CLEVR酷似视觉答疑基准，这使我们能够在一个公平，透明的方式验证解释。我们进一步介绍两种简单的指标来评估在此设置的说明，并比较其使用的关系网络模型和三个基于分解的解释方法标准像素摄结果：梯度X输入，集成的梯度和逐层关联传播。在这些测试方法，逐层关联繁殖显示出效果最好，其次是综合梯度。更一般地，我们希望我们的数据和代码的版本，以支持一个明确的共同点的方法的发展和比较。

13. Neural Pose Transfer by Spatially Adaptive Instance Normalization [PDF] 返回目录
Jiashun Wang, Chao Wen, Yanwei Fu, Haitao Lin, Tianyun Zou, Xiangyang Xue, Yinda Zhang
Abstract: Pose transfer has been studied for decades, in which the pose of a source mesh is applied to a target mesh. Particularly in this paper, we are interested in transferring the pose of source human mesh to deform the target human mesh, while the source and target meshes may have different identity information. Traditional studies assume that the paired source and target meshes are existed with the point-wise correspondences of user annotated landmarks/mesh points, which requires heavy labelling efforts. On the other hand, the generalization ability of deep models is limited, when the source and target meshes have different identities. To break this limitation, we proposes the first neural pose transfer model that solves the pose transfer via the latest technique for image style transfer, leveraging the newly proposed component -- spatially adaptive instance normalization. Our model does not require any correspondences between the source and target meshes. Extensive experiments show that the proposed model can effectively transfer deformation from source to target meshes, and has good generalization ability to deal with unseen identities or poses of meshes. Code is available at this https URL .
摘要：姿态传送已经研究了几十年，其中源网格的姿态被施加到一个目标网格。特别是在本文中，我们感兴趣的是传输源的人的姿态网格变形目标人网，而源和目标网格可以有不同的身份信息。传统的研究假定配对的源和目标网格都存在与用户注释标志的逐点对应/网格点，这需要大量标签的努力。在另一方面，深模型的推广能力是有限的，当源和目标网格有不同的身份。为了打破这一限制，我们提出了第一个神经姿态传输模型，解决了通过对影像风格传递最新的技术，利用新提出的组件的姿势转移 - 空间自适应实例正常化。我们的模型不需要源和目标网格之间的对应关系。大量的实验表明，该模型能够有效地从源头转移变形目标网格，并具有处理看不见的身份或网格的姿态良好的泛化能力。代码可在此HTTPS URL。

14. A Rotation-Invariant Framework for Deep Point Cloud Analysis [PDF] 返回目录
Xianzhi Li, Ruihui Li, Guangyong Chen, Chi-Wing Fu, Daniel Cohen-Or, Pheng-Ann Heng
Abstract: Recently, many deep neural networks were designed to process 3D point clouds, but a common drawback is that rotation invariance is not ensured, leading to poor generalization to arbitrary orientations. In this paper, we introduce a new low-level purely rotation-invariant representation to replace common 3D Cartesian coordinates as the network inputs. Also, we present a network architecture to embed these representations into features, encoding local relations between points and their neighbors, and the global shape structure. To alleviate inevitable global information loss caused by the rotation-invariant representations, we further introduce a region relation convolution to encode local and non-local information. We evaluate our method on multiple point cloud analysis tasks, including shape classification, part segmentation, and shape retrieval. Experimental results show that our method achieves consistent, and also the best performance, on inputs at arbitrary orientations, compared with the state-of-the-arts.
摘要：近日，许多深层神经网络被设计用于处理3D点云，但一个共同的缺点是，旋转不变性得不到保证，导致泛化差任意方向。在本文中，我们引入了一个新的低级别的纯旋转不变的表示来代替普通的三维直角坐标作为网络输入。此外，我们提出了一个网络结构中嵌入这些表示为要素，编码点和他们的邻居，和全球外形结构与地方关系。为了减轻因旋转不变表示不可避免的全球信息的损失，我们还引入了区域关系卷积编码本地和非本地信息。我们评估我们在多个点云分析任务，包括形状分类，部分分割和形状检索方法。实验结果表明，该方法实现一致的，也是最好的表现，在上任意方向的投入，与国家的最艺术的比较。

15. Joint COCO and Mapillary Workshop at ICCV 2019 Keypoint Detection Challenge Track Technical Report: Distribution-Aware Coordinate Representation for Human Pose Estimation [PDF] 返回目录
Hanbin Dai, Liangbo Zhou, Feng Zhang, Zhengyu Zhang, Hong Hu, Xiatian Zhu, Mao Ye
Abstract: In this paper, we focus on the coordinate representation in human pose estimation. While being the standard choice, heatmap based representation has not been systematically investigated. We found that the process of coordinate decoding (i.e. transforming the predicted heatmaps to the coordinates) is surprisingly significant for human pose estimation performance, which nevertheless was not recognised before. In light of the discovered importance, we further probe the design limitations of the standard coordinate decoding method and propose a principled distribution-aware decoding method. Meanwhile, we improve the standard coordinate encoding process (i.e. transforming ground-truth coordinates to heatmaps) by generating accurate heatmap distributions for unbiased model training. Taking them together, we formulate a novel Distribution-Aware coordinate Representation for Keypoint (DARK) method. Serving as a model-agnostic plug-in, DARK significantly improves the performance of a variety of state-of-the-art human pose estimation models. Extensive experiments show that DARK yields the best results on COCO keypoint detection challenge, validating the usefulness and effectiveness of our novel coordinate representation idea. The project page containing more details is at this https URL
摘要：在本文中，我们着眼于人体姿势估计的坐标表示。虽然是标准的选择，基于热图表示尚未得到系统的研究。我们发现，协调的解码（即转化预测热图的坐标）的过程是人体姿势估计性能，其仍然没有被确认之前出奇显著。在已发现的重要性，我们进一步探讨标准的设计局限协调解码方法，并提出一个原则性分布感知解码方法。同时，我们提高通过生成用于无偏模型训练准确热图的分布的标准坐标编码处理（即变换地面实况坐标热图）。考虑他们一起，我们制定关键点（DARK）方法的新颖分布感知坐标表示。作为模型无关的插件，DARK显著改善各种国家的最先进的人体姿势估计模型的性能。大量的实验表明，DARK产生的COCO关键点检测的挑战最好的结果，验证了实用性和新颖我们效力坐标表示的想法。含有更多的细节该项目页面在此HTTPS URL

16. FragNet: Writer Identification using Deep Fragment Networks [PDF] 返回目录
Sheng He, Lambert Schomaker
Abstract: Writer identification based on a small amount of text is a challenging problem. In this paper, we propose a new benchmark study for writer identification based on word or text block images which approximately contain one word. In order to extract powerful features on these word images, a deep neural network, named FragNet, is proposed. The FragNet has two pathways: feature pyramid which is used to extract feature maps and fragment pathway which is trained to predict the writer identity based on fragments extracted from the input image and the feature maps on the feature pyramid. We conduct experiments on four benchmark datasets, which show that our proposed method can generate efficient and robust deep representations for writer identification based on both word and page images.
摘要：基于文本的少量笔迹鉴别是一个具有挑战性的问题。在本文中，我们提出了基于其大约含有一个词或词的文本块图像进行笔迹鉴别的新标杆研究。为了提取这些词的图像强大的功能，一个深层神经网络，命名FragNet，建议。所述FragNet有两个途径：其用于被训练以预测基于从输入图像中提取的片段作家身份和特征上的特征金字塔映射提取特征的地图和片段途径特征的金字塔。我们四个标准数据集，这表明，该方法可以基于这两个词和页面图像进行笔迹鉴定的高效和稳健的深表示进行实验。

17. Refinements in Motion and Appearance for Online Multi-Object Tracking [PDF] 返回目录
Piao Huang, Shoudong Han, Jun Zhao, Donghaisheng Liu, Hongwei Wang, En Yu, Alex ChiChung Kot
Abstract: Modern multi-object tracking (MOT) system usually involves separated modules, such as motion model for location and appearance model for data association. However, the compatible problems within both motion and appearance models are always ignored. In this paper, a general architecture named as MIF is presented by seamlessly blending the Motion integration, three-dimensional(3D) Integral image and adaptive appearance feature Fusion. Since the uncertain pedestrian and camera motions are usually handled separately, the integrated motion model is designed using our defined intension of camera motion. Specifically, a 3D integral image based spatial blocking method is presented to efficiently cut useless connections between trajectories and candidates with spatial constraints. Then the appearance model and visibility prediction are jointly built. Considering scale, pose and visibility, the appearance features are adaptively fused to overcome the feature misalignment problem. Our MIF based tracker (MIFT) achieves the state-of-the-art accuracy with 60.1 MOTA on both MOT16&17 challenges.
摘要：现代多目标跟踪（MOT）系统通常涉及分离的模块，例如用于数据关联的位置和外观模型的运动模型。然而，运动和外观模型中的兼容问题总是被忽略。在本文中，通过无缝融合运动整合，三维（3D）积分图像和自适应外观特征融合呈现命名为MIF的通用架构。由于不确定行人和相机运动通常是分开处理，集成运动模型是使用我们的定义照相机运动的强度而设计的。具体而言，3D积分图像基于空间阻塞方法被呈现给轨迹和候选人空间约束之间有效地削减无用的连接。然后外观模型和知名度预测共同建造。考虑到规模，姿势和知名度，外观特征自适应融合，克服了功能错位问题。基于我们的MIF跟踪器（MIFT）实现国家的最先进的精度60.1 MOTA两个MOT16及17周的挑战。

18. GraphTCN: Spatio-Temporal Interaction Modeling for Human Trajectory Prediction [PDF] 返回目录
Chengxin Wang, Shaofeng Cai, Gary Tan
Abstract: Trajectory prediction is a fundamental and challenging task to forecast the future path of the agents in autonomous applications with multi-agent interaction, where the agents need to predict the future movements of their neighbors to avoid collisions. To respond timely and precisely to the environment, high efficiency and accuracy are required in the prediction. Conventional approaches, e.g., LSTM-based models, take considerable computation costs in the prediction, especially for the long sequence prediction. To support a more efficient and accurate trajectory prediction, we instead propose a novel CNN-based spatial-temporal graph framework GraphTCN, which captures the spatial and temporal interactions in an input-aware manner. The spatial interaction between agents at each time step is captured with an edge graph attention network (EGAT), and the temporal interaction across time step is modeled with a modified gated convolutional network (CNN). In contrast to conventional models, both the spatial and temporal modeling in GraphTCN are computed within each local time window. Therefore, GraphTCN can be executed in parallel for much higher efficiency, and meanwhile with accuracy comparable to best-performing approaches. Experimental results confirm that GraphTCN achieves noticeably better performance in terms of both efficiency and accuracy compared with state-of-the-art methods on various trajectory prediction benchmark datasets.
摘要：轨迹预测是一个基本的和具有挑战性的任务，以预测与多主体互动自主应用，其中药剂需要预测其邻国的未来走势，以避免碰撞的代理商未来的道路。要及时，准确地对环境做出反应，高效率和精度都需要在预测。传统的方法，例如，基于LSTM的模型，需要大量的计算成本的预测，尤其是对长序列预测。为了支持更有效的和精确的轨迹预测，我们代替提出了一种新颖的基于CNN-时空图形框架GraphTCN，其捕获在输入感知方式的空间和时间的相互作用。在每个时间步剂之间的空间相互作用与边缘图形注意网络（EGAT）捕获，并且跨越时间步骤的时间相互作用建模与经修饰的门控卷积网络（CNN）。与常规的模型中，无论是在GraphTCN的空间和时间模型的每一个本地的时间范围内计算。因此，GraphTCN可以并行于高得多的效率执行，同时准确媲美效果最佳的方法。实验结果证实了GraphTCN实现在这两个效率和准确性上的各种轨迹预测基准数据集的状态的最先进的方法相比而言显着更好的性能。

19. Discriminative Feature and Dictionary Learning with Part-aware Model for Vehicle Re-identification [PDF] 返回目录
Huibing Wang, Jinjia Peng, Guangqi Jiang, Fengqiang Xu, Xianping Fu
Abstract: With the development of smart cities, urban surveillance video analysis will play a further significant role in intelligent transportation systems. Identifying the same target vehicle in large datasets from non-overlapping cameras should be highlighted, which has grown into a hot topic in promoting intelligent transportation systems. However, vehicle re-identification (re-ID) technology is a challenging task since vehicles of the same design or manufacturer show similar appearance. To fill these gaps, we tackle this challenge by proposing Triplet Center Loss based Part-aware Model (TCPM) that leverages the discriminative features in part details of vehicles to refine the accuracy of vehicle re-identification. TCPM base on part discovery is that partitions the vehicle from horizontal and vertical directions to strengthen the details of the vehicle and reinforce the internal consistency of the parts. In addition, to eliminate intra-class differences in local regions of the vehicle, we propose external memory modules to emphasize the consistency of each part to learn the discriminating features, which forms a global dictionary over all categories in dataset. In TCPM, triplet-center loss is introduced to ensure each part of vehicle features extracted has intra-class consistency and inter-class separability. Experimental results show that our proposed TCPM has an enormous preference over the existing state-of-the-art methods on benchmark datasets VehicleID and VeRi-776.
摘要：随着智能城市的发展，城市监控视频分析将在智能交通系统进一步显著的作用。从非重叠摄像头在大的数据集标识相同目标车辆应当强调，这已经发展成为热门话题在促进智能交通系统。然而，车辆重新鉴定（重新-ID）技术是一项具有挑战性的任务，因为相同的设计或制造商的车辆显示相似的外观。要弥补这些差距，我们通过提出三重中心丧失是基于部分感知模型（TCPM）应对这种挑战，充分利用车辆的部分细节的判别特征细化的车辆重新鉴定的准确性。上部分发现TCPM碱是分区从水平方向和垂直方向上的车辆，以加强所述车辆的细节和加强部件的内部一致性。此外，为了消除车辆的局部区域的类内的差异，我们建议外部存储器模块，强调各部分的一致性学习辨别的特点，形成了在数据集中的所有类别的全局字典。在TCPM，三重峰中心损失被引入，以确保特征提取的车辆的各部分具有类内一致性和类间可分离性。实验结果表明，该TCPM拥有对基准数据集VehicleID和VERI-776的存在状态的最先进的方法，一个巨大的偏好。

20. Stochastic Frequency Masking to Improve Super-Resolution and Denoising Networks [PDF] 返回目录
Majed El Helou, Ruofan Zhou, Sabine Süsstrunk
Abstract: Super-resolution and denoising are ill-posed yet fundamental image restoration tasks. In blind settings, the degradation kernel or the noise level are unknown. This makes restoration even more challenging, notably for learning-based methods, as they tend to overfit to the degradation seen during training. We present an analysis, in the frequency domain, of degradation-kernel overfitting in super-resolution and introduce a conditional learning perspective that extends to both super-resolution and denoising. Building on our formulation, we propose a stochastic frequency masking of images used in training to regularize the networks and address the overfitting problem. Our technique improves state-of-the-art methods on blind super-resolution with different synthetic kernels, real super-resolution, blind Gaussian denoising, and real-image denoising.
摘要：超分辨率和降噪的病态尚未根本图像恢复的任务。在盲目的设置，降解内核或噪音水平是未知的。这使得恢复更具挑战性，特别是对于基于学习的方法，因为他们往往过度拟合训练中看到的降解。我们提出了一个分析，在频域，降解内核的超分辨率过度拟合，并引入延伸至超高分辨率和降噪两个条件学习的观点。我们的配方的基础上，我们提出了在训练中使用规则化的网络和解决问题的过度拟合图像的随机频率掩蔽。我们的技术提高了盲超分辨率不同的合成内核，真正的超高分辨率，盲目高斯消噪和实时图像去噪国家的最先进的方法。

21. Minimal Solvers for Indoor UAV Positioning [PDF] 返回目录
Marcus Valtonen Örnhag, Patrik Persson, Mårten Wadenbäck, Kalle Åström, Anders Heyden
Abstract: In this paper we consider a collection of relative pose problems which arise naturally in applications for visual indoor UAV navigation. We focus on cases where additional information from an onboard IMU is available and thus provides a partial extrinsic calibration through the gravitational vector. The solvers are designed for a partially calibrated camera, for a variety of realistic indoor scenarios, which makes it possible to navigate using images of the ground floor. Current state-of-the-art solvers use more general assumptions, such as using arbitrary planar structures; however, these solvers do not yield adequate reconstructions for real scenes, nor do they perform fast enough to be incorporated in real-time systems. We show that the proposed solvers enjoy better numerical stability, are faster, and require fewer point correspondences, compared to state-of-the-art solvers. These properties are vital components for robust navigation in real-time systems, and we demonstrate on both synthetic and real data that our method outperforms other methods, and yields superior motion estimation.
摘要：在本文中，我们考虑这在视觉室内无人机导航应用中自然产生的相对位姿问题的集合。我们专注于情况下从车载IMU附加信息是可用的，并因此提供通过重力矢量的局部外部校准。解算器被设计用于局部校准的相机，用于各种现实室内场景中，这使得能够使用所述底层的图像导航。当前状态的最先进的求解器使用更一般的假设，例如使用任意的平面结构;然而，这些求解不产生足够重建的真实场景，也没有进行足够快的速度在实时系统被纳入。我们表明，提出的解决者享受到更好的数值稳定性，速度更快，并且需要更少的对应点，相较于国家的最先进的解算器。这些性质对在实时系统鲁棒导航重要组成部分，我们证明在两个合成的和真实数据，我们的方法优于其它方法，和产量优异的运动估计。

22. Synthesizing human-like sketches from natural images using a conditional convolutional decoder [PDF] 返回目录
Moritz Kampelmühler, Axel Pinz
Abstract: Humans are able to precisely communicate diverse concepts by employing sketches, a highly reduced and abstract shape based representation of visual content. We propose, for the first time, a fully convolutional end-to-end architecture that is able to synthesize human-like sketches of objects in natural images with potentially cluttered background. To enable an architecture to learn this highly abstract mapping, we employ the following key components: (1) a fully convolutional encoder-decoder structure, (2) a perceptual similarity loss function operating in an abstract feature space and (3) conditioning of the decoder on the label of the object that shall be sketched. Given the combination of these architectural concepts, we can train our structure in an end-to-end supervised fashion on a collection of sketch-image pairs. The generated sketches of our architecture can be classified with 85.6% Top-5 accuracy and we verify their visual quality via a user study. We find that deep features as a perceptual similarity metric enable image translation with large domain gaps and our findings further show that convolutional neural networks trained on image classification tasks implicitly learn to encode shape information. Code is available under this https URL
摘要：人类能够通过采用草图精确地进行通信多样的概念，可视内容高度还原和抽象的基于形状的表示。我们建议，对于第一次，一个完全卷积终端到终端的架构，能够合成类似人类的自然图像中物体的草图具有潜在的复杂背景。要启用的架构学习这种高度抽象映射，我们采用以下主要组件：的（1）的全卷积编码器 - 解码器的结构，（2）在一个抽象的特征空间和感知相似损失函数操作（3）调节解码器应绘制对象的标签上。鉴于这些架构概念相结合，我们可以训练我们的结构，最终到终端的监督时装草图图像对的集合。我们的架构的产生草图可以用85.6％五大精度进行分类，我们通过用户调查，核实他们的视觉质量。我们发现，深的特点作为感知相似度量使图像的平移与大型域差距，我们的研究结果进一步表明，经过训练的图像分类任务卷积神经网络隐学习编码形状信息。代码可以在此HTTPS URL

23. PS-RCNN: Detecting Secondary Human Instances in a Crowd via Primary Object Suppression [PDF] 返回目录
Zheng Ge, Zequn Jie, Xin Huang, Rong Xu, Osamu Yoshie
Abstract: Detecting human bodies in highly crowded scenes is a challenging problem. Two main reasons result in such a problem: 1). weak visual cues of heavily occluded instances can hardly provide sufficient information for accurate detection; 2). heavily occluded instances are easier to be suppressed by Non-Maximum-Suppression (NMS). To address these two issues, we introduce a variant of two-stage detectors called PS-RCNN. PS-RCNN first detects slightly/none occluded objects by an R-CNN module (referred as P-RCNN), and then suppress the detected instances by human-shaped masks so that the features of heavily occluded instances can stand out. After that, PS-RCNN utilizes another R-CNN module specialized in heavily occluded human detection (referred as S-RCNN) to detect the rest missed objects by P-RCNN. Final results are the ensemble of the outputs from these two R-CNNs. Moreover, we introduce a High Resolution RoI Align (HRRA) module to retain as much of fine-grained features of visible parts of the heavily occluded humans as possible. Our PS-RCNN significantly improves recall and AP by 4.49% and 2.92% respectively on CrowdHuman, compared to the baseline. Similar improvements on Widerperson are also achieved by the PS-RCNN.
摘要：在高度拥挤的场景检测人体是一个具有挑战性的问题。主要有两个原因导致这样的问题：1）。重堵塞实例的视觉线索弱几乎不能提供准确的检测的足够信息; 2）。严重闭塞实例更容易由非最大抑制（NMS）来抑制。为了解决这两个问题，我们引入叫做PS-RCNN两级探测器的变体。 PS-RCNN首先检测略微/无由RCNN模块（称为P-RCNN）遮挡对象，然后抑制由人形掩模所检测的实例，以便重闭塞实例的特征可以脱颖而出。此后，PS-RCNN利用专门重另一个RCNN模块闭塞人检测（称为S-RCNN）来检测由P-RCNN其余错过的对象。最后的结果是来自这两个R-细胞神经网络的输出的合奏。此外，我们引入了一个高分辨率的投资回报对齐（HRRA）模块，以保留尽可能多的严重堵塞人类尽可能的可见部分的细粒度功能。我们的PS-RCNN显著4.49％和2.92％，分别提高了CrowdHuman召回和AP，相比于基准。在Widerperson类似的改进也被PS-RCNN实现。

24. LT-Net: Label Transfer by Learning Reversible Voxel-wise Correspondence for One-shot Medical Image Segmentation [PDF] 返回目录
Shuxin Wang, Shilei Cao, Dong Wei, Renzhen Wang, Kai Ma. Liansheng Wang, Deyu Meng, Yefeng Zheng
Abstract: We introduce a one-shot segmentation method to alleviate the burden of manual annotation for medical images. The main idea is to treat one-shot segmentation as a classical atlas-based segmentation problem, where voxel-wise correspondence from the atlas to the unlabelled data is learned. Subsequently, segmentation label of the atlas can be transferred to the unlabelled data with the learned correspondence. However, since ground truth correspondence between images is usually unavailable, the learning system must be well-supervised to avoid mode collapse and convergence failure. To overcome this difficulty, we resort to the forward-backward consistency, which is widely used in correspondence problems, and additionally learn the backward correspondences from the warped atlases back to the original atlas. This cycle-correspondence learning design enables a variety of extra, cycle-consistency-based supervision signals to make the training process stable, while also boost the performance. We demonstrate the superiority of our method over both deep learning-based one-shot segmentation methods and a classical multi-atlas segmentation method via thorough experiments.
摘要：介绍一杆分割方法来减轻人工标注的负担医学图像。其主要思想是把一次性分割为地图集经典的分割问题，其中从地图集到未标记的数据体素明智的对应关系的经验教训。随后，将图册的分割标签可以被转移到未标记的数据与所学习的对应关系。然而，由于图像之间的地面实况对应通常是不可用的，学习系统必须得到很好的监管，以避免崩溃模式和收敛失败。为了克服这个困难，我们采取前后一致性，这被广泛应用于通信问题，另外汲取扭曲地图集落后对应回原来的地图集。这个周期对应的学习设计可以实现各种额外的，基于周期的一致性监管信号，使培训过程中的稳定，同时也提升了性能。我们证明我们的方法在两个深学习型一次性分割方法的优势，并通过深入的实验，一个经典的多图册分割方法。

25. Adapting Object Detectors with Conditional Domain Normalization [PDF] 返回目录
Peng Su, Kun Wang, Xingyu Zeng, Shixiang Tang, Dapeng Chen, Di Qiu, Xiaogang Wang
Abstract: Real-world object detectors are often challenged by the domain gaps between different datasets. In this work, we present the Conditional Domain Normalization (CDN) to bridge the domain gap. CDN is designed to encode different domain inputs into a shared latent space, where the features from different domains carry the same domain attribute. To achieve this, we first disentangle the domain-specific attribute out of the semantic features from one domain via a domain embedding module, which learns a domain-vector to characterize the corresponding domain attribute information. Then this domain-vector is used to encode the features from another domain through a conditional normalization, resulting in different domains' features carrying the same domain attribute. We incorporate CDN into various convolution stages of an object detector to adaptively address the domain shifts of different level's representation. In contrast to existing adaptation works that conduct domain confusion learning on semantic features to remove domain-specific factors, CDN aligns different domain distributions by modulating the semantic features of one domain conditioned on the learned domain-vector of another domain. Extensive experiments show that CDN outperforms existing methods remarkably on both real-to-real and synthetic-to-real adaptation benchmarks, including 2D image detection and 3D point cloud detection.
摘要：现实世界中的物体探测器通常由不同的数据集之间的域差距的挑战。在这项工作中，我们提出了有条件的归一化（CDN），以弥合差距域。 CDN被设计成不同的域的输入编码成共享潜在空间，其中来自不同的域的特征携带相同的域属性。为了实现这一点，我们首先从一个域通过域嵌入模块，其学习域的矢量表征对应的域的属性信息解开域特定属性出的语义特征。然后这个域的矢量被用于通过一个条件正常化以编码来自其他域的特征，从而导致携带相同域的属性不同的域的特点。我们结合CDN到对象检测器的各个卷积阶段以自适应地址不同级别的表示的域的转变。在对比现有适应工作于语义特征，该行为混乱域学习通过调节一个结构域的条件对另一个域的域学习矢量的语义特征，以除去域特异性因素，CDN对准不同域分布。大量的实验表明，CDN显着优于现有的方法均实数到真实和合成到实适应基准，包括二维图像检测和三维点云的检测。

26. Self-Supervised Discovering of Causal Features: Towards Interpretable Reinforcement Learning [PDF] 返回目录
Wenjie Shi, Zhuoyuan Wang, Shiji Song, Gao Huang
Abstract: Deep reinforcement learning (RL) has recently led to many breakthroughs on a range of complex control tasks. However, the agent's decision-making process is generally not transparent. The lack of interpretability hinders the applicability of RL in safety-critical scenarios. While several methods have attempted to interpret vision-based RL, most come without detailed explanation for the agent's behaviour. In this paper, we propose a self-supervised interpretable framework, which can discover causal features to enable easy interpretation of RL agents even for non-experts. Specifically, a self-supervised interpretable network (SSINet) is employed to produce fine-grained attention masks for highlighting task-relevant information, which constitutes most evidence for the agent's decisions. We verify and evaluate our method on several Atari 2600 games as well as Duckietown, which is a challenging self-driving car simulator environment. The results show that our method renders causal explanations and empirical evidences about how the agent makes decisions and why the agent performs well or badly, especially when transferred to novel scenes. Overall, our method provides valuable insight into the internal decision-making process of vision-based RL. In addition, our method does not use any external labelled data, and thus demonstrates the possibility to learn high-quality mask through a self-supervised manner, which may shed light on new paradigms for label-free vision learning such as self-supervised segmentation and detection.
摘要：深强化学习（RL）最近导致了一系列复杂的控制任务很多突破。但是，代理的决策过程通常是不透明的。缺乏解释性的阻碍RL在安全关键场景的适用性。虽然多种方法试图解释基于视觉的RL，最不来为代理行为的详细说明。在本文中，我们提出了一个自我监督的解释框架，可以发现因果功能，使RL剂即使对于非专家的解释容易。具体来说，自监督可解释网络（SSINet）的制造采用细粒度注意口罩突出任务相关的信息，这构成了代理的决定大部分的证据。我们验证和评估我们的方法在几个雅达利2600游戏以及Duckietown，这是一个具有挑战性的自动驾驶汽车模拟器环境。结果表明，我们的方法呈现因果解释和有关经验证据代理如何做决定，为什么代理执行好或不好，特别是当转移到新的场景。总的来说，我们的方法提供了宝贵的见解基于视觉-RL的内部决策过程。此外，我们的方法不使用任何外部标记数据，从而证明了通过自我监督的方式来学习优质面膜的可能性，这可能揭示了无标记视觉新模式的学习，如自我监督分割光和检测。

27. On Translation Invariance in CNNs: Convolutional Layers can Exploit Absolute Spatial Location [PDF] 返回目录
Osman Semih Kayhan, Jan C. van Gemert
Abstract: In this paper we challenge the common assumption that convolutional layers in modern CNNs are translation invariant. We show that CNNs can and will exploit the absolute spatial location by learning filters that respond exclusively to particular absolute locations by exploiting image boundary effects. Because modern CNNs filters have a huge receptive field, these boundary effects operate even far from the image boundary, allowing the network to exploit absolute spatial location all over the image. We give a simple solution to remove spatial location encoding which improves translation invariance and thus gives a stronger visual inductive bias which particularly benefits small data sets. We broadly demonstrate these benefits on several architectures and various applications such as image classification, patch matching, and two video classification datasets.
摘要：在本文中，我们挑战共同的假设是，在现代细胞神经网络卷积层平移不变。我们表明，细胞神经网络能够并且将会通过学习过滤器利用绝对空间位置是专门为通过利用图像边界效应尤其绝对位置的响应。由于现代细胞神经网络过滤器有一个巨大的感受野，这些边界效应甚至远离图像边界工作，使网络遍布图像利用绝对的空间位置。我们给出一个简单的解决方案，以除去空间位置编码这改善平移不变，从而给出了一个更强的视觉感应偏压这特别有利于小的数据集。我们广泛地展示在几个架构和各种应用，如图像分类，块匹配，并且两个视频分类数据集这些优势。

28. Weakly-Supervised Multi-Level Attentional Reconstruction Network for Grounding Textual Queries in Videos [PDF] 返回目录
Yijun Song, Jingwen Wang, Lin Ma, Zhou Yu, Jun Yu
Abstract: The task of temporally grounding textual queries in videos is to localize one video segment that semantically corresponds to the given query. Most of the existing approaches rely on segment-sentence pairs (temporal annotations) for training, which are usually unavailable in real-world scenarios. In this work we present an effective weakly-supervised model, named as Multi-Level Attentional Reconstruction Network (MARN), which only relies on video-sentence pairs during the training stage. The proposed method leverages the idea of attentional reconstruction and directly scores the candidate segments with the learnt proposal-level attentions. Moreover, another branch learning clip-level attention is exploited to refine the proposals at both the training and testing stage. We develop a novel proposal sampling mechanism to leverage intra-proposal information for learning better proposal representation and adopt 2D convolution to exploit inter-proposal clues for learning reliable attention map. Experiments on Charades-STA and ActivityNet-Captions datasets demonstrate the superiority of our MARN over the existing weakly-supervised methods.
摘要：在视频时间接地文本查询的任务是本地化是语义对应于给定查询一个视频片段。大多数现有的方法依赖于段句对（时间注释）进行训练，这通常是在真实世界的场景不可用。在这项工作中，我们提出了一种有效的弱监督模型，命名为多级的注意重建网络（自然资源部），其中仅在培训阶段，依靠视频句对。该方法利用了注意力重建和直接得分与教训投标级的关注候选分段的想法。此外，另外一个分支学习剪辑的高度重视被利用完善的训练和测试阶段，这两个提案。我们开发了一种新的提案采样机制，充分利用内部信息，建议学习更好的建议表示，采用二维卷积利用，建议间线索学习可靠注意图。在哑谜-STA和ActivityNet，标题数据集实验证明我们的自然资源部的在现有的弱监督方法的优越性。

29. Gated Texture CNN for Efficient and Configurable Image Denoising [PDF] 返回目录
Kaito Imai, Takamichi Miyata
Abstract: Convolutional neural network (CNN)-based image denoising methods typically estimate the noise component contained in a noisy input image and restore a clean image by subtracting the estimated noise from the input. However, previous denoising methods tend to remove high-frequency information (e.g., textures) from the input. It caused by intermediate feature maps of CNN contains texture information. A straightforward approach to this problem is stacking numerous layers, which leads to a high computational cost. To achieve high performance and computational efficiency, we propose a gated texture CNN (GTCNN), which is designed to carefully exclude the texture information from each intermediate feature map of the CNN by incorporating gating mechanisms. Our GTCNN achieves state-of-the-art performance with 4.8 times fewer parameters than previous state-of-the-art methods. Furthermore, the GTCNN allows us to interactively control the texture strength in the output image without any additional modules, training, or computational costs.
摘要：卷积神经网络（CNN）为基础的图像去噪的方法通常估计噪声分量包含在噪声的输入图像和由减去估计从输入噪声恢复清洁的图像。然而，以往的去噪方法倾向于去除从输入高频信息（例如，纹理）。这引起中间特征图的CNN包含纹理信息。一个直接的办法处理这一问题是堆叠数层，从而导致计算成本高。为了实现高的性能和计算效率，我们提出了一种门控纹理CNN（GTCNN），其被设计为通过将选通机制小心排除来自CNN的每个中间特征映射的纹理的信息。我们GTCNN实现国家的最先进的性能比以前的国家的最先进的方法参数的4.8倍少。此外，GTCNN允许我们交互地控制输出图像中的纹理强度没有任何额外的模块，训练，或计算成本。

30. Extended Feature Pyramid Network for Small Object Detection [PDF] 返回目录
Chunfang Deng, Mengmeng Wang, Liang Liu, Yong Liu
Abstract: Small object detection remains an unsolved challenge because it is hard to extract information of small objects with only a few pixels. While scale-level corresponding detection in feature pyramid network alleviates this problem, we find feature coupling of various scales still impairs the performance of small objects. In this paper, we propose extended feature pyramid network (EFPN) with an extra high-resolution pyramid level specialized for small object detection. Specifically, we design a novel module, named feature texture transfer (FTT), which is used to super-resolve features and extract credible regional details simultaneously. Moreover, we design a foreground-background-balanced loss function to alleviate area imbalance of foreground and background. In our experiments, the proposed EFPN is efficient on both computation and memory, and yields state-of-the-art results on small traffic-sign dataset Tsinghua-Tencent 100K and small category of general object detection dataset MS COCO.
摘要：小物件检测，还是个未解的挑战，因为这是很难，只有几个像素提取小物体的信息。虽然规模级别的相应功能金字塔网络的缓解检测这个问题，我们发现有不同规模的连接仍然会损害小的物体的性能。在本文中，我们提出了以专门用于小物件检测额外的高分辨率金字塔等级扩展功能金字塔网络（EFPN）。具体来说，我们设计了一个新的模块，命名为特征的纹理转移（FTT），其同时用于超解析的功能和提取可靠的区域的信息。此外，我们设计了一个前景，背景，平衡损失函数，以缓解前景和背景的区域不平衡。在我们的实验中，所提出的是EFPN在两个计算和存储效率，和产量上小的结果状态的最先进的交通标志数据集清华腾讯100K和一般物体检测数据集MS COCO的小类别。

31. Closed-loop Matters: Dual Regression Networks for Single Image Super-Resolution [PDF] 返回目录
Yong Guo, Jian Chen, Jingdong Wang, Qi Chen, Jiezhang Cao, Zeshuai Deng, Yanwu Xu, Mingkui Tan
Abstract: Deep neural networks have exhibited promising performance in image super-resolution (SR) by learning a nonlinear mapping function from low-resolution (LR) images to high-resolution (HR) images. However, there are two underlying limitations to existing SR methods. First, learning the mapping function from LR to HR images is typically an ill-posed problem, because there exist infinite HR images that can be downsampled to the same LR image. As a result, the space of the possible functions can be extremely large, which makes it hard to find a good solution. Second, the paired LR-HR data may be unavailable in real-world applications and the underlying degradation method is often unknown. For such a more general case, existing SR models often incur the adaptation problem and yield poor performance. To address the above issues, we propose a dual regression scheme by introducing an additional constraint on LR data to reduce the space of the possible functions. Specifically, besides the mapping from LR to HR images, we learn an additional dual regression mapping estimates the down-sampling kernel and reconstruct LR images, which forms a closed-loop to provide additional supervision. More critically, since the dual regression process does not depend on HR images, we can directly learn from LR images. In this sense, we can easily adapt SR models to real-world data, e.g., raw video frames from YouTube. Extensive experiments with paired training data and unpaired real-world data demonstrate our superiority over existing methods.
摘要：深层神经网络已经通过学习从低分辨率（LR）图像以高分辨率（HR）图像的非线性映射功能表现在图像超分辨率（SR）承诺的表现。不过，也有现有的SR方法的两个潜在的局限性。首先，学习从LR映射函数到HR图像典型地是一个不适定问题，因为存在可被下采样到同一LR图像无限HR图像。其结果是，可能的功能空间是非常大的，这使得它很难找到一个很好的解决方案。其次，配对LR-HR数据可能是真实世界应用程序不可用和底层降解法往往是未知的。对于这样一个更一般的情况下，现有SR车型往往招致适应问题和产量表现不佳。为了解决上述问题，我们提出通过引入对LR数据的附加约束，以减少可能的功能空间的双重回归方案。具体而言，除了映射从LR到HR图像，我们学习额外的双重回归映射估计降采样内核和重构LR图像，从而形成一个闭环，以提供额外的监管。更关键的是，由于双回归过程不依赖于HR图像，我们可以直接从LR图像学习。在这个意义上，我们可以很容易地SR模式适应现实世界的数据，例如，从YouTube原始视频帧。与配对训练数据和不成对现实世界的数据大量的实验证明了我们在现有方法的优越性。

32. AVR: Attention based Salient Visual Relationship Detection [PDF] 返回目录
Jianming Lv, Qinzhe Xiao, Jiajie Zhong
Abstract: Visual relationship detection aims to locate objects in images and recognize the relationships between objects. Traditional methods treat all observed relationships in an image equally, which causes a relatively poor performance in the detection tasks on complex images with abundant visual objects and various relationships. To address this problem, we propose an attention based model, namely AVR, to achieve salient visual relationships based on both local and global context of the relationships. Specifically, AVR recognizes relationships and measures the attention on the relationships in the local context of an input image by fusing the visual features, semantic and spatial information of the relationships. AVR then applies the attention to assign important relationships with larger salient weights for effective information filtering. Furthermore, AVR is integrated with the priori knowledge in the global context of image datasets to improve the precision of relationship prediction, where the context is modeled as a heterogeneous graph to measure the priori probability of relationships based on the random walk algorithm. Comprehensive experiments are conducted to demonstrate the effectiveness of AVR in several real-world image datasets, and the results show that AVR outperforms state-of-the-art visual relationship detection methods significantly by up to $87.5\%$ in terms of recall.
摘要：视觉关系的检测目标定位在图像中的对象和认识对象之间的关系。传统方法治疗的图像同样，这会导致与丰富的视觉对象和各种关系的复杂图像的检测任务的性能相对较差在所有观察到的关系。为了解决这个问题，我们提出了一个基于注意力模型，即AVR，实现基于关系的局部和全局上下文的特点视觉关系。具体来说，AVR通过融合视觉特征，关系的语义和空间信息识别关系和措施的关注输入图像的本地环境的关系。 AVR然后应用关注具有较大的权重显着有效的信息过滤分配的重要关系。另外，AVR与在图像数据组的全局上下文先验知识综合提高关系预测，其中上下文建模为异质图表来测量基于随机游走算法关系的先验概率的精度。综合实验以证明在几个真实世界图像数据集AVR的有效性，结果表明，国家的最先进的AVR性能优于视觉关系的检测方法显著高达$ 87.5 \％$的召回条款。

33. Any-Shot Object Detection [PDF] 返回目录
Shafin Rahman, Salman Khan, Nick Barnes, Fahad Shahbaz Khan
Abstract: Previous work on novel object detection considers zero or few-shot settings where none or few examples of each category are available for training. In real world scenarios, it is less practical to expect that 'all' the novel classes are either unseen or {have} few-examples. Here, we propose a more realistic setting termed 'Any-shot detection', where totally unseen and few-shot categories can simultaneously co-occur during inference. Any-shot detection offers unique challenges compared to conventional novel object detection such as, a high imbalance between unseen, few-shot and seen object classes, susceptibility to forget base-training while learning novel classes and distinguishing novel classes from the background. To address these challenges, we propose a unified any-shot detection model, that can concurrently learn to detect both zero-shot and few-shot object classes. Our core idea is to use class semantics as prototypes for object detection, a formulation that naturally minimizes knowledge forgetting and mitigates the class-imbalance in the label space. Besides, we propose a rebalanced loss function that emphasizes difficult few-shot cases but avoids overfitting on the novel classes to allow detection of totally unseen classes. Without bells and whistles, our framework can also be used solely for Zero-shot detection and Few-shot detection tasks. We report extensive experiments on Pascal VOC and MS-COCO datasets where our approach is shown to provide significant improvements.
摘要：新物体检测以往的工作考虑零或少拍摄设置，其中没有或每个类别的几个例子都可以进行训练。在现实世界中的场景，这是不太现实的期望，“所有”的小说类或者是看不见或{}有为数不多的例子。在这里，我们提出了一个更现实的设置被称为“任何一次检测”，在完全看不见的，很少拍类别可同时推断期间共现。任何一次检测报价，同时学习新的课程，并从背景中区分小说类相比传统的新物体检测，例如，看不见的，为数不多的射门，看到对象类，易患忘记基础训练之间的高不平衡独特的挑战。为了应对这些挑战，我们提出了一个统一的任何一次检测模型，可以同时学习同时检测零射门，很少拍对象类。我们的核心思想是用类语义原型对象检测，配方自然减少知识遗忘和减轻了类不平衡的标签空间。此外，我们建议，强调困难的少数次的情况下，但避免过度拟合的小说类，允许完全看不见类的检测重新平衡损失函数。没有花俏，我们的框架也可以只用于零镜头检测和为数不多的镜头检测任务使用。我们在其中显示我们的方法提供显著改善帕斯卡VOC和MS-COCO集报告广泛的实验。

34. ReLaText: Exploiting Visual Relationships for Arbitrary-Shaped Scene Text Detection with Graph Convolutional Networks [PDF] 返回目录
Chixiang Ma, Lei Sun, Zhuoyao Zhong, Qiang Huo
Abstract: We introduce a new arbitrary-shaped text detection approach named ReLaText by formulating text detection as a visual relationship detection problem. To demonstrate the effectiveness of this new formulation, we start from using a "link" relationship to address the challenging text-line grouping problem firstly. The key idea is to decompose text detection into two subproblems, namely detection of text primitives and prediction of link relationships between nearby text primitive pairs. Specifically, an anchor-free region proposal network based text detector is first used to detect text primitives of different scales from different feature maps of a feature pyramid network, from which a text primitive graph is constructed by linking each pair of nearby text primitives detected from a same feature map with an edge. Then, a Graph Convolutional Network (GCN) based link relationship prediction module is used to prune wrongly-linked edges in the text primitive graph to generate a number of disjoint subgraphs, each representing a detected text instance. As GCN can effectively leverage context information to improve link prediction accuracy, our GCN based text-line grouping approach can achieve better text detection accuracy than previous text-line grouping methods, especially when dealing with text instances with large inter-character or very small inter-line spacings. Consequently, the proposed ReLaText achieves state-of-the-art performance on five public text detection benchmarks, namely RCTW-17, MSRA-TD500, Total-Text, CTW1500 and DAST1500.
摘要：介绍通过制定文本检测的视觉关系检测问题命名ReLaText新的任意形状的文字检测方法。为了证明这种新配方的有效性，我们使用“链接”关系，解决具有挑战性的文本行首先分组的问题开始。其核心思想是分解文本检测分为两个子问题，即检测文本原语和附近原始文本对之间链接关系的预测。具体而言，无锚区域提案网络基于文本检测器首先用于检测从特征金字塔网络的不同的特征映射不同尺度，从该文本原始图形是由每对相邻的文本原语的连接从检测到的构造的文本元相同的特征映射具有边缘。然后，图形卷积网络（GDN）基于链接关系预测模块用于修剪在文本原始图形错误地连接的边缘，以产生多个不相交的子图的，每个代表一个检测到的文本实例。由于GCN可以有效利用上下文信息，以提高链路的预测精度，我们的GCN基于文本行分组方法可以实现比以前的文本行更好的文本检测准确性分组方法，尤其是文字实例打交道时，与大跨字符或非常小的间直插式的间距。因此，建议ReLaText达到五个公开文本检测基准，即RCTW-17，MSRA-TD500，总文本，CTW1500和DAST1500国家的最先进的性能。

35. Multi-Drone based Single Object Tracking with Agent Sharing Network [PDF] 返回目录
Pengfei Zhu, Jiayu Zheng, Dawei Du, Longyin Wen, Yiming Sun, Qinghua Hu
Abstract: Drone equipped with cameras can dynamically track the target in the air from a broader view compared with static cameras or moving sensors over the ground. However, it is still challenging to accurately track the target using a single drone due to several factors such as appearance variations and severe occlusions. In this paper, we collect a new Multi-Drone single Object Tracking (MDOT) dataset that consists of 92 groups of video clips with 113,918 high resolution frames taken by two drones and 63 groups of video clips with 145,875 high resolution frames taken by three drones. Besides, two evaluation metrics are specially designed for multi-drone single object tracking, i.e. automatic fusion score (AFS) and ideal fusion score (IFS). Moreover, an agent sharing network (ASNet) is proposed by self-supervised template sharing and view-aware fusion of the target from multiple drones, which can improve the tracking accuracy significantly compared with single drone tracking. Extensive experiments on MDOT show that our ASNet significantly outperforms recent state-of-the-art trackers.
摘要：配备有摄像机雄蜂可以与静态摄像机或在地面上移动传感器相比从更广的视图动态地跟踪目标在空气中。然而，它仍然是具有挑战性的精确跟踪使用单个雄蜂目标由于几个因素，如外观的变化和严重闭塞。在本文中，我们收集了新的多无人机单目标跟踪（MDOT）数据集由92组视频剪辑与三个无人驾驶飞机拍摄由两个无人驾驶飞机采取113918个高分辨率帧和63组视频剪辑与145875个高分辨率帧。此外，两个评价指标是专为多雄蜂单个对象跟踪，即自动融合得分（AFS）和理想的融合得分（IFS）。此外，代理共享网络（ASNet）由自监督模板共享和视图感知来自多个无人驾驶飞机的目标，这可以提高跟踪精度与单个雄蜂跟踪显著相比的融合方案。在MDOT大量的实验表明，我们的ASNet显著优于国家的最先进的最新跟踪。

36. House-GAN: Relational Generative Adversarial Networks for Graph-constrained House Layout Generation [PDF] 返回目录
Nelson Nauata, Kai-Hung Chang, Chin-Yi Cheng, Greg Mori, Yasutaka Furukawa
Abstract: This paper proposes a novel graph-constrained generative adversarial network, whose generator and discriminator are built upon relational architecture. The main idea is to encode the constraint into the graph structure of its relational networks. We have demonstrated the proposed architecture for a new house layout generation problem, whose task is to take an architectural constraint as a graph (i.e., the number and types of rooms with their spatial adjacency) and produce a set of axis-aligned bounding boxes of rooms. We measure the quality of generated house layouts with the three metrics: the realism, the diversity, and the compatibility with the input graph constraint. Our qualitative and quantitative evaluations over 117,000 real floorplan images demonstrate that the proposed approach outperforms existing methods and baselines. We will publicly share all our code and data.
摘要：本文提出了一种新的图形约束生成对抗性的网络，其发电机和鉴别在关系架构构建的。其主要思想是将约束编码成它的关系网络的图形结构。我们已经表明了该架构的新房子的布局生成问题，其任务是采取结构性限制为图形（即，数量和类型的客房，其空间相邻的），并产生一系列的轴对齐包围盒客房。我们测量产生的内部布局与三个指标的质量：现实主义，多样性，并与输入图形约束的兼容性。我们在现实117000个布局规划图像定性和定量的评估表明，该方法比现有的方法和基线。我们将公开分享我们所有的代码和数据。

37. A CNN-Based Blind Denoising Method for Endoscopic Images [PDF] 返回目录
Shaofeng Zou, Mingzhu Long, Xuyang Wang, Xiang Xie, Guolin Li, Zhihua Wang
Abstract: The quality of images captured by wireless capsule endoscopy (WCE) is key for doctors to diagnose diseases of gastrointestinal (GI) tract. However, there exist many low-quality endoscopic images due to the limited illumination and complex environment in GI tract. After an enhancement process, the severe noise become an unacceptable problem. The noise varies with different cameras, GI tract environments and image enhancement. And the noise model is hard to be obtained. This paper proposes a convolutional blind denoising network for endoscopic images. We apply Deep Image Prior (DIP) method to reconstruct a clean image iteratively using a noisy image without a specific noise model and ground truth. Then we design a blind image quality assessment network based on MobileNet to estimate the quality of the reconstructed images. The estimated quality is used to stop the iterative operation in DIP method. The number of iterations is reduced about 36% by using transfer learning in our DIP process. Experimental results on endoscopic images and real-world noisy images demonstrate the superiority of our proposed method over the state-of-the-art methods in terms of visual quality and quantitative metrics.
摘要：通过无线胶囊内窥镜检查（WCE）捕获的图像的质量是键医生诊断胃肠（GI）道的疾病。然而，由于在胃肠道中有限照明和复杂的环境中存在许多低质量的内窥镜图像。增强处理后，严重的噪声成为不可接受的问题。噪声有不同的相机，胃肠道环境和图像增强而改变。和噪声模型是很难获得。本文提出了一种卷积盲降噪网络内窥镜图像。我们应用深层意象之前（DIP）方法来重建一个干净的形象反复使用嘈杂的图像没有一个具体的噪声模型和地面实况。然后，我们设计了一个基于MobileNet盲图像质量评价网络估计重建图像的质量。估计质量是用来阻止在DIP方法迭代操作。迭代次数由我们的DIP过程中使用迁移学习降低了约36％。在内窥镜图像和真实世界的噪声图像实验结果表明，在视觉质量和数量指标方面我们提出的在国家的最先进的方法，方法的优越性。

38. TACO: Trash Annotations in Context for Litter Detection [PDF] 返回目录
Pedro F Proença, Pedro Simões
Abstract: TACO is an open image dataset for litter detection and segmentation, which is growing through crowdsourcing. Firstly, this paper describes this dataset and the tools developed to support it. Secondly, we report instance segmentation performance using Mask R-CNN on the current version of TACO. Despite its small size (1500 images and 4784 annotations), our results are promising on this challenging problem. However, to achieve satisfactory trash detection in the wild for deployment, TACO still needs much more manual annotations. These can be contributed using: this http URL
摘要：TACO为垫料检测与分割，这是通过众包生长打开的图像数据集。本文首先介绍了该数据集和工具开发，以支持它。其次，我们对TACO的当前版本使用面膜R-CNN报告实例分割性能。尽管它的体积小（1500个图像和4784个批注），我们的结果是有希望在这个具有挑战性的问题。然而，为了实现在野外部署满意的垃圾检测，TACO仍需要更多的手动注释。这些可以使用贡献：此http网址

39. Vec2Face: Unveil Human Faces from their Blackbox Features in Face Recognition [PDF] 返回目录
Chi Nhan Duong, Thanh-Dat Truong, Kha Gia Quach, Hung Bui, Kaushik Roy, Khoa Luu
Abstract: Unveiling face images of a subject given his/her high-level representations extracted from a blackbox Face Recognition engine is extremely challenging. It is because the limitations of accessible information from that engine including its structure and uninterpretable extracted features. This paper presents a novel generative structure with Bijective Metric Learning, namely Bijective Generative Adversarial Networks in a Distillation framework (DiBiGAN), for synthesizing faces of an identity given that person's features. In order to effectively address this problem, this work firstly introduces a bijective metric so that the distance measurement and metric learning process can be directly adopted in image domain for an image reconstruction task. Secondly, a distillation process is introduced to maximize the information exploited from the blackbox face recognition engine. Then a Feature-Conditional Generator Structure with Exponential Weighting Strategy is presented for a more robust generator that can synthesize realistic faces with ID preservation. Results on several benchmarking datasets including CelebA, LFW, AgeDB, CFP-FP against matching engines have demonstrated the effectiveness of DiBiGAN on both image realism and ID preservation properties.
摘要：主题揭幕脸部图像给出从一个黑面部识别引擎是极具挑战性的提取他/她高层表示。正是由于获取的信息是发动机的限制，包括它的结构和不可解释的提取特征。本文提出了一种新颖的生成结构的双射度量学习，即双射剖成对抗性网络在蒸馏框架（DiBiGAN），用于合成考虑到人的特征身份的面孔。为了有效地解决这个问题，这个工作首先介绍双射度量，使得距离测量和度量学习过程可以在图像域中被直接采用图像重建任务。其次，蒸馏过程引入最大化从黑盒面部识别引擎利用的信息。然后用指数加权策略一个特征的条件发生器结构提出了一种用于更鲁棒的发生器，可以合成现实面与ID保存。在几个基准数据集包括CelebA，LFW，AgeDB，CFP-FP针对匹配引擎结果证明DiBiGAN的两个图像的真实感和ID保存性能的效果。

40. Frustratingly Simple Few-Shot Object Detection [PDF] 返回目录
Xin Wang, Thomas E. Huang, Trevor Darrell, Joseph E. Gonzalez, Fisher Yu
Abstract: Detecting rare objects from a few examples is an emerging problem. Prior works show meta-learning is a promising approach. But, fine-tuning techniques have drawn scant attention. We find that fine-tuning only the last layer of existing detectors on rare classes is crucial to the few-shot object detection task. Such a simple approach outperforms the meta-learning methods by roughly 2~20 points on current benchmarks and sometimes even doubles the accuracy of the prior methods. However, the high variance in the few samples often leads to the unreliability of existing benchmarks. We revise the evaluation protocols by sampling multiple groups of training examples to obtain stable comparisons and build new benchmarks based on three datasets: PASCAL VOC, COCO and LVIS. Again, our fine-tuning approach establishes a new state of the art on the revised benchmarks. The code as well as the pretrained models are available at this https URL.
摘要：从几个例子检测罕见的对象是一个新问题。此前作品中表现出的元学习是一种很有前途的方法。但是，微调技术已经引起很少注意。我们发现只有微调上罕见的类中现有的探测器的最后一层是为数不多的拍物体检测任务的关键。这样一个简单的方法，通过约2〜20个点上优于当前的基准的元学习方法，有时甚至加倍的现有方法的准确性。然而，少数样品中的高差异往往导致现有基准的不可靠性。我们通过抽样的训练例子多组以获得稳定的比较，并建立基于三个数据集的新标准修订的评估协议：PASCAL VOC，COCO和LVIS。同样，我们的微调方法建立了艺术上的修订基准的新状态。代码以及预训练的模型可在此HTTPS URL。

41. Camera Trace Erasing [PDF] 返回目录
Chang Chen, Zhiwei Xiong, Xiaoming Liu, Feng Wu
Abstract: Camera trace is a unique noise produced in digital imaging process. Most existing forensic methods analyze camera trace to identify image origins. In this paper, we address a new low-level vision problem, camera trace erasing, to reveal the weakness of trace-based forensic methods. A comprehensive investigation on existing anti-forensic methods reveals that it is non-trivial to effectively erase camera trace while avoiding the destruction of content signal. To reconcile these two demands, we propose Siamese Trace Erasing (SiamTE), in which a novel hybrid loss is designed on the basis of Siamese architecture for network training. Specifically, we propose embedded similarity, truncated fidelity, and cross identity to form the hybrid loss. Compared with existing anti-forensic methods, SiamTE has a clear advantage for camera trace erasing, which is demonstrated in three representative tasks. Code and dataset are available at this https URL.
摘要：相机跟踪是在数字成像过程中产生的独特的噪音。大多数现有的取证方法分析摄像机跟踪识别图像的起源。在本文中，我们讨论了新的低级别的视觉问题，摄像头跟踪擦除，揭示基于跟踪取证方法的弱点。现行的反取证方法的全面调查显示，这是不平凡的，有效地消除相机跟踪，同时避免内容信号的破坏。调和这两种需求，我们提出连体跟踪擦除（SiamTE），其中一个新的混合损失被设计用于网络训练连体结构的基础上。具体地讲，我们建议的嵌入式相似，截短的保真度，和交叉同一性形成混合损失。与现有的反取证方法相比，SiamTE有一丝相机擦除，这表现在三个有代表性的任务具有明显的优势。代码和数据集可在此HTTPS URL。

42. Scene Completenesss-Aware Lidar Depth Completion for Driving Scenario [PDF] 返回目录
Cho-Ying Wu, Ulrich Neumann
Abstract: In this paper we propose Scene Completeness-Aware Depth Completion (SADC) to complete raw lidar scans into dense depth maps with fine whole scene structures. Recent sparse depth completion for lidar only focuses on the lower scenes and produce irregular estimations on the upper because existing datasets such as KITTI do not provide groundtruth for upper areas. These areas are considered less important because they are usually sky or trees and of less scene understanding interest. However, we argue that in several driving scenarios such as large trucks or cars with loads, objects could extend to upper parts of scenes, and thus depth maps with structured upper scene estimation are important for RGBD algorithms. SADC leverages stereo cameras, which have better scene completeness, and lidars, which are more precise, to perform sparse depth completion. To our knowledge, we are the first to focus on scene completeness of sparse depth completion. We validate our SADC on both depth estimate precision and scene-completeness on KITTI. Moreover, SADC only adds small extra computational cost upon base methods of stereo matching and lidar completion in terms of runtime and model size.
摘要：本文提出了场景完整性感知深度竣工（南共体），以完整的原始LiDAR扫描成致密的深度与细整个场景的结构映射。激光雷达最近稀疏深度完成只集中在较低的场景和产生于上不规则的估计，因为现有的数据集，如KITTI没有为上部区域提供真实状况。因为他们通常是天空或树木，减少现场了解关心的这些地区被认为不那么重要了。然而，我们认为，在一些驾驶情况下，如大型货车或汽车行驶的负载，对象可以扩展到场景的上部，并与结构上的场景估计这样的深度地图对于RGBD算法很重要。南部非洲发展共同体杠杆立体相机，其具有更好的现场完整性，激光雷达，这是更精准，执行稀疏深度完成。据我们所知，我们是第一个专注于稀疏深度完成的场景完整性。我们确认我们的两个深度估计的精度和场景完整性上KITTI南部非洲发展共同体。此外，南部非洲发展共同体只会增加在运行时和模型的大小方面在立体匹配和激光雷达完成的基础方法小的额外计算成本。

43. Hyperspectral-Multispectral Image Fusion with Weighted LASSO [PDF] 返回目录
Nguyen Tran, Rupali Mankar, David Mayerich, Zhu Han
Abstract: Spectral imaging enables spatially-resolved identification of materials in remote sensing, biomedicine, and astronomy. However, acquisition times require balancing spectral and spatial resolution with signal-to-noise. Hyperspectral imaging provides superior material specificity, while multispectral images are faster to collect at greater fidelity. We propose an approach for fusing hyperspectral and multispectral images to provide high-quality hyperspectral output. The proposed optimization leverages the least absolute shrinkage and selection operator (LASSO) to perform variable selection and regularization. Computational time is reduced by applying the alternating direction method of multipliers (ADMM), as well as initializing the fusion image by estimating it using maximum a posteriori (MAP) based on Hardie's method. We demonstrate that the proposed sparse fusion and reconstruction provides quantitatively superior results when compared to existing methods on publicly available images. Finally, we show how the proposed method can be practically applied in biomedical infrared spectroscopic microscopy.
摘要：光谱成像使得能够在材料遥感，生物医药，和天文学空间分辨的识别。然而，采集时间需要与平衡信号对噪声谱和空间分辨率。高光谱成像提供了优越的物质的特异性，同时多光谱图像是更快地收集在更高的保真度。我们提出了高光谱融合和多光谱影像，以提供高品质的高光谱输出的方法。所提出的优化利用了至少绝对收缩率和选择算子（LASSO）执行变量选择和正规化。计算时间是通过将乘法器的交替方向法（ADMM），以及通过基于哈迪的方法，使用最大后验（MAP）估计它初始化融合图像缩小。我们证明相比于可公开获得的图像现有的方法时，所提出的稀疏融合和重建提供定量更好的结果。最后，我们展示如何，该方法能在生物医学红外分光显微镜进行实际应用。

44. Self-Constructing Graph Convolutional Networks for Semantic Labeling [PDF] 返回目录
Qinghui Liu, Michael Kampffmeyer, Robert Jenssen, Arnt-Børre Salberg
Abstract: Graph Neural Networks (GNNs) have received increasing attention in many fields. However, due to the lack of prior graphs, their use for semantic labeling has been limited. Here, we propose a novel architecture called the Self-Constructing Graph (SCG), which makes use of learnable latent variables to generate embeddings and to self-construct the underlying graphs directly from the input features without relying on manually built prior knowledge graphs. SCG can automatically obtain optimized non-local context graphs from complex-shaped objects in aerial imagery. We optimize SCG via an adaptive diagonal enhancement method and a variational lower bound that consists of a customized graph reconstruction term and a Kullback-Leibler divergence regularization term. We demonstrate the effectiveness and flexibility of the proposed SCG on the publicly available ISPRS Vaihingen dataset and our model SCG-Net achieves competitive results in terms of F1-score with much fewer parameters and at a lower computational cost compared to related pure-CNN based work. Our code will be made public soon.
摘要：图形神经网络（GNNS）日益受到重视在许多领域。然而，由于缺乏事先的图表，他们对语义标注的使用受到了限制。在这里，我们提出了一种新的体系结构称为自构建格拉夫（SCG），这使得使用可学习潜在变量的生成的嵌入，并直接自构造底层图表从而不依赖于手动建立的先验知识图的输入特征。 SCG可以自动获得从在航空影像复杂形状的对象优化的非本地上下文的曲线图。我们通过自适应对角线增强方法优化SCG及其变下界，它由一个定制图形重建项和相对熵正则化项的。我们证明在可公开获得ISPRS Vaihingen数据集的有效性和拟议SCG的灵活性，我们的模型SCG-网与少得多的参数，并以较低的计算成本相比，相关的纯CNN基础的工作实现了在F1-分数方面的竞争结果。我们的代码将被公之于众很快。

45. Evaluation of Rounding Functions in Nearest-Neighbor Interpolation [PDF] 返回目录
Olivier Rukundo
Abstract: This paper evaluates three rounding functions for nearest neighbor (NN) image interpolation. Evaluated rounding functions are selected among the five rounding rules defined by the IEEE 754-2008 standard. Both full- and non-reference image quality assessment (IQA) metrics evaluate interpolation image quality objectively to extract the number of achieved occurrences over targeted occurrences. Targeted occurrence indicates the optimally achievable number that is directly proportional to the number of sample images, IQA metrics, and scaling ratios. Inferential statistical analysis concept is applied to deduce from a small number of images and draw a conclusion of the behavior of each rounding function on a bigger number of images. Considering the number of images bigger than five, inferential analysis demonstrated that, at 95% of confidence level, the ceil function could also achieve 83.75% of targeted occurrences with 8 to 11% margin of error while the floor and round functions could only achieve 22.5% and 32.5% of targeted occurrences, respectively, with the same margin of error.
摘要：本文评估三个舍入函数最近邻（NN）图像内插。评价舍入功能由IEEE 754-2008标准所定义的五个四舍五入规则中选择。既全职和非参考图像质量评价（IQA）度量评估插值图像质量客观提取实现发生超过目标出现的次数。靶向发生表示最佳地实现的数量是成正比的样本图像，IQA度量和缩放比的数目。推论统计分析原理应用于从少量的图像来推断，借鉴图像的更大数量的每一个舍入函数的行为的结论。考虑到超过五个大的图像数量，推理分析表明，在置信度为95％，该小区功能还可以实现针对出现的83.75％，与错误的8〜11％的保证金，而地板和全面的功能，只能实现22.5 ％和有针对性的事件，分别与错误的相同幅度的32.5％。

46. Night-time Semantic Segmentation with a Large Real Dataset [PDF] 返回目录
Xin Tan, Yiheng Zhang, Ying Cao, Lizhuang Ma, Rynson W.H. Lau
Abstract: Although huge progress has been made on semantic segmentation in recent years, most existing works assume that the input images are captured in day-time with good lighting conditions. In this work, we aim to address the semantic segmentation problem of night-time scenes, which has two main challenges: 1) labeled night-time data are scarce, and 2) over- and under-exposures may co-occur in the input night-time images and are not explicitly modeled in existing semantic segmentation pipelines. To tackle the scarcity of night-time data, we collect a novel labeled dataset (named NightCity) of 4,297 real night-time images with ground truth pixel-level semantic annotations. To our knowledge, NightCity is the largest dataset for night-time semantic segmentation. In addition, we also propose an exposure-aware framework to address the night-time segmentation problem through augmenting the segmentation process with explicitly learned exposure features. Extensive experiments show that training on NightCity can significantly improve the performance of night-time semantic segmentation and that our exposure-aware model outperforms the state-of-the-art segmentation methods, yielding top performances on our benchmark dataset.
摘要：尽管巨大的进步，近年来已经取得的语义分割，现有的大多数作品假定输入图像中日间有良好的照明条件下拍摄的。在这项工作中，我们的目的在于解决夜间场景的语义分割的问题，这有两个主要的挑战：1）标记的夜间数据稀少，和2）过采样和欠曝光，可能共同出现在输入夜间图像，并没有明确在现有的语义分割管道建模。为了解决夜间数据的稀缺性，我们收集的4,297地面实况像素级语义标注真实的夜间图像的新标记数据集（名为NightCity）。据我们所知，NightCity是夜间语义分割上最大的数据集。此外，我们还提出了一种曝光感知框架，通过增强与明确地了解到曝光功能分割过程，以解决夜间分割问题。大量的实验显示在NightCity使训练能显著提高夜间语义分割的性能和我们接触感知模型优于国家的最先进的分割方法，产生对我们的基准数据集顶级表演。

47. Guidance and Evaluation: Semantic-Aware Image Inpainting for Mixed Scenes [PDF] 返回目录
Liang Liao, Jing Xiao, Zheng Wang, Chia-wen Lin, Shin'ichi Satoh
Abstract: Completing a corrupted image with correct structures and reasonable textures for a mixed scene remains an elusive challenge. Since the missing hole in a mixed scene of a corrupted image often contains various semantic information, conventional two-stage approaches utilizing structural information often lead to the problem of unreliable structural prediction and ambiguous image texture generation. In this paper, we propose a Semantic Guidance and Evaluation Network (SGE-Net) to iteratively update the structural priors and the inpainted image in an interplay framework of semantics extraction and image inpainting. It utilizes semantic segmentation map as guidance in each scale of inpainting, under which location-dependent inferences are re-evaluated, and, accordingly, poorly-inferred regions are refined in subsequent scales. Extensive experiments on real-world images of mixed scenes demonstrated the superiority of our proposed method over state-of-the-art approaches, in terms of clear boundaries and photo-realistic textures.
摘要：完成与正确的结构和合理的纹理损坏的图像混合的场景仍然是一个难以挑战。由于损坏的图像的混合场景缺少的孔经常包含各种语义信息，常规的两阶段方法利用经常导致不可靠的结构预测和模糊图像纹理生成的问题的结构信息。在本文中，我们提出了一个语义指导和评价网络（SGE-网）来迭代更新结构先验和修补的图像语义提取和图像修复的相互作用框架。它利用语义分割地图作为补绘的各标尺，其下依赖于位置的推论重新评估，并且相应地，难推断区域在随后的精秤指导。对混合场景的真实世界的图像大量的实验证明我们提出的方法过的国家的最先进的方法，在明确的边界和照片般逼真的纹理方面的优势。

48. Multistage Curvilinear Coordinate Transform Based Document Image Dewarping using a Novel Quality Estimator [PDF] 返回目录
Tanmoy Dasgupta, Nibaran Das, Mita Nasipuri
Abstract: The present work demonstrates a fast and improved technique for dewarping nonlinearly warped document images. The images are first dewarped at the page-level by estimating optimum inverse projections using curvilinear homography. The quality of the process is then estimated by evaluating a set of metrics related to the characteristics of the text lines and rectilinear objects for measuring parallelism, orthogonality, etc. These are designed specifically to estimate the quality of the dewarping process without the need of any ground truth. If the quality is estimated to be unsatisfactory, the page-level dewarping process is repeated with finer approximations. This is followed by a line-level dewarping process that makes granular corrections to the warps in individual text-lines. The methodology has been tested on the CBDAR 2007 / IUPR 2011 document image dewarping dataset and is seen to yield the best OCR accuracy in the shortest amount of time, till date. The usefulness of the methodology has also been evaluated on the DocUNet 2018 dataset with some minor tweaks, and is seen to produce comparable results.
摘要：目前的工作演示了一个快速和改进技术去扭曲非线性扭曲文档图像。图像首先通过使用曲线单应推定最佳逆突起在页面级dewarped。该过程的质量，然后通过评估一组相关的文本行和直线对象的特征，用于测量平行度，正交等，这些是专门设计来估计去扭曲过程的质量，而无需任何度量的估计地面实况。如果质量估计是不能令人满意的，页面级去扭曲处理重复进行更精细的近似值。这之后是一个线路电平去扭曲处理，使粒状更正个别文本行的经纱。该方法已经过测试，在CBDAR 2007 / IUPR 2011文档图像去扭曲数据集，被看作是产生最佳的OCR准确性最短的时间量，直到日期。该方法的有效性也被评估在DocUNet 2018数据集一些小的调整，并且被认为是产生类似的结果。

49. Deep Affinity Net: Instance Segmentation via Affinity [PDF] 返回目录
Xingqian Xu, Mang Tik Chiu, Thomas S. Huang, Honghui Shi
Abstract: Most of the modern instance segmentation approaches fall into two categories: region-based approaches in which object bounding boxes are detected first and later used in cropping and segmenting instances; and keypoint-based approaches in which individual instances are represented by a set of keypoints followed by a dense pixel clustering around those keypoints. Despite the maturity of these two paradigms, we would like to report an alternative affinity-based paradigm where instances are segmented based on densely predicted affinities and graph partitioning algorithms. Such affinity-based approaches indicate that high-level graph features other than regions or keypoints can be directly applied in the instance segmentation task. In this work, we propose Deep Affinity Net, an effective affinity-based approach accompanied with a new graph partitioning algorithm Cascade-GAEC. Without bells and whistles, our end-to-end model results in 32.4% AP on Cityscapes val and 27.5% AP on test. It achieves the best single-shot result as well as the fastest running time among all affinity-based models. It also outperforms the region-based method Mask R-CNN.
摘要：现代实例分割的大多数方法分为两类：在其中检测到对象的包围盒的第一和后来在裁剪和分割实例使用基于区域的方法;且其中各个实例通过一组关键点为代表的关键点的办法，随后围绕这些关键点的致密像素群集。尽管这两种模式的成熟，我们想报告，其中实例是基于密集预测亲和力和图形分割算法分割的替代基于亲和力的范例。这种基于亲和力的方法表明比区或关键点可以在实例分割任务被直接应用于其他的是高级图形特征。在这项工作中，我们提出了深厚的情缘网，有效的基于亲和力的方式伴随着一个新的图形分割算法级联GAEC。没有花俏，在风情VAL和27.5％AP 32.4％AP上测试我们的终端到终端的模型结果。它实现了最佳的单次结果以及所有基于亲和力的车型中最快的运行时间。它也优于基于区域的方法面膜R-CNN。

50. SF-Net: Single-Frame Supervision for Temporal Action Localization [PDF] 返回目录
Fan Ma, Linchao Zhu, Yi Yang, Shengxin Zha, Gourab Kundu, Matt Feiszli, Zheng Shou
Abstract: In this paper, we study an intermediate form of supervision, i.e., single-frame supervision, for temporal action localization (TAL). To obtain the single-frame supervision, the annotators are asked to identify only a single frame within the temporal window of an action. This can significantly reduce the labor cost of obtaining full supervision which requires annotating the action boundary. Compared to the weak supervision that only annotates the video-level label, the single-frame supervision introduces extra temporal action signals while maintaining low annotation overhead. To make full use of such single-frame supervision, we propose a unified system called SF-Net. First, we propose to predict an actionness score for each video frame. Along with a typical category score, the actionness score can provide comprehensive information about the occurrence of a potential action and aid the temporal boundary refinement during inference. Second, we mine pseudo action and background frames based on the single-frame annotations. We identify pseudo action frames by adaptively expanding each annotated single frame to its nearby, contextual frames and we mine pseudo background frames from all the unannotated frames across multiple videos. Together with the ground-truth labeled frames, these pseudo-labeled frames are further used for training the classifier.
摘要：在本文中，我们研究监督的中间形式，即，单帧监督，用于时间行动定位（TAL）。为了获得单帧监管，注释被要求行动的时间窗口内识别只有一个框架。这可以减少显著获得全程监管要求标注的行动边界的劳动力成本。相比于监管不力，只有标注的视频级标签，单帧监督引入了额外的时间采取行动的信号，同时保持低注释开销。要充分利用这样的单帧监管，我们提出了一个所谓的SF-Net的统一系统。首先，我们建议以预测actionness得分为每个视频帧。随着一个典型的类别分值，将比分actionness能提供有关潜在作用的发生全面的信息和推理过程中帮助临时边界细化。其次，我们矿伪基于单帧注释动作和背景帧。我们每个注释单帧自适应扩展到其附近，情境帧识别伪行动框架和我们矿伪来自全国各地的多部影片的所有未注释框背景框。加上地面实况标记的帧，这些伪标记的帧被进一步用于训练分类。

51. 3D-CariGAN: An End-to-End Solution to 3D Caricature Generation from Face Photos [PDF] 返回目录
Zipeng Ye, Ran Yi, Minjing Yu, Juyong Zhang, Yu-Kun Lai, Yong-jin Liu
Abstract: Caricature is a kind of artistic style of human faces that attracts considerable research in computer vision. So far all existing 3D caricature generation methods require some information related to caricature as input, e.g., a caricature sketch or 2D caricature. However, this kind of input is difficult to provide by non-professional users. In this paper, we propose an end-to-end deep neural network model to generate high-quality 3D caricature with a simple face photo as input. The most challenging issue in our system is that the source domain of face photos (characterized by 2D normal faces) is significantly different from the target domain of 3D caricatures (characterized by 3D exaggerated face shapes and texture). To address this challenge, we (1) build a large dataset of 6,100 3D caricature meshes and use it to establish a PCA model in the 3D caricature shape space and (2) detect landmarks in the input face photo and use them to set up correspondence between 2D caricature and 3D caricature shape. Our system can automatically generate high-quality 3D caricatures. In many situations, users want to control the output by a simple and intuitive way, so we further introduce a simple-to-use interactive control with three horizontal and one vertical lines. Experiments and user studies show that our system is easy to use and can generate high-quality 3D caricatures.
摘要：漫画是一种人脸的艺术风格，吸引了计算机视觉大量的研究。到目前为止，所有现有的3D漫画生成方法需要与漫画作为输入，例如，一个漫画草图或2D漫画的一些信息。然而，这种投入是难以用非专业用户提供。在本文中，我们提出了一个终端到终端深层神经网络模型来生成高品质的3D漫画用简单的面部照片作为输入。在我们的系统中最具挑战性的问题是，面部的照片的源域（特征在于2D正常面）是从3D漫画的目标域显著不同（特征在于三维夸张脸的形状和纹理）。为了应对这一挑战，我们（1）建立一个大型数据集6100个3D漫画网格，并用它在3D漫画造型空间，建立PCA模型和（2）检测输入人脸照片的地标，并用它们来建立对应关系2D漫画和3D漫画形状之间。我们的系统可以自动生成高品质的3D漫画。在许多情况下，用户想要控制通过简单直观的方式输出，所以我们进一步与三个横一纵行介绍一个简单易用的交互式控制。实验和用户的研究表明，我们的系统是易于使用，并且能够产生高品质的3D漫画。

52. Energy-based Periodicity Mining with Deep Features for Action Repetition Counting in Unconstrained Videos [PDF] 返回目录
Jianqin Yin, Yanchun Wu, Huaping Liu, Yonghao Dang, Zhiyi Liu, Jun Liu
Abstract: Action repetition counting is to estimate the occurrence times of the repetitive motion in one action, which is a relatively new, important but challenging measurement problem. To solve this problem, we propose a new method superior to the traditional ways in two aspects, without preprocessing and applicable for arbitrary periodicity actions. Without preprocessing, the proposed model makes our method convenient for real applications; processing the arbitrary periodicity action makes our model more suitable for the actual circumstance. In terms of methodology, firstly, we analyze the movement patterns of the repetitive actions based on the spatial and temporal features of actions extracted by deep ConvNets; Secondly, the Principal Component Analysis algorithm is used to generate the intuitive periodic information from the chaotic high-dimensional deep features; Thirdly, the periodicity is mined based on the high-energy rule using Fourier transform; Finally, the inverse Fourier transform with a multi-stage threshold filter is proposed to improve the quality of the mined periodicity, and peak detection is introduced to finish the repetition counting. Our work features two-fold: 1) An important insight that deep features extracted for action recognition can well model the self-similarity periodicity of the repetitive action is presented. 2) A high-energy based periodicity mining rule using deep features is presented, which can process arbitrary actions without preprocessing. Experimental results show that our method achieves comparable results on the public datasets YT Segments and QUVA.
摘要：动作重复计数是估计重复运动的发生时间在一个动作中，这是一个相对较新的，重要的，但具有挑战性的测量问题。为了解决这个问题，我们提出了一种新方法优于传统方法在两个方面，没有预处理和适用于任意的周期性动作。如果没有预处理，该模型使我们的方法，便于实际应用;处理任意周期性作用，使我们的模型更适合实际情况。在方法论上，首先，我们分析基础上，通过深ConvNets提取动作的时空特征的重复动作的运动模式;其次，主成分分析算法来生成从混乱的高维深的特点直观的定期信息;第三，周期性是基于使用傅立叶变换高能规则开采;最后，傅立叶逆变换用的多级阈值过滤器，提出了以提高开采周期性的质量，并且被引入峰值检测来完成的重复计数。我们的工作有两个方面：1）一个重要的见解，提取动作识别可以很好模拟重复动作的自相似周期性深的特点提出。 2）使用深特征的高能量基于周期性挖掘规则被呈现，其可以处理任意的动作而不进行预处理。实验结果表明，我们的方法实现对公共数据集YT段和QUVA比较的结果。

53. FGSD: A Dataset for Fine-Grained Ship Detection in High Resolution Satellite Images [PDF] 返回目录
Kaiyan Chen, Ming Wu, Jiaming Liu, Chuang Zhang
Abstract: Ship detection using high-resolution remote sensing images is an important task, which contribute to sea surface regulation. The complex background and special visual angle make ship detection relies in high quality datasets to a certain extent. However, there is few works on giving both precise classification and accurate location of ships in existing ship detection datasets. To further promote the research of ship detection, we introduced a new fine-grained ship detection datasets, which is named as FGSD. The dataset collects high-resolution remote sensing images that containing ship samples from multiple large ports around the world. Ship samples were fine categorized and annotated with both horizontal and rotating bounding boxes. To further detailed the information of the dataset, we put forward a new representation method of ships' orientation. For future research, the dock as a new class was annotated in the dataset. Besides, rich information of images were provided in FGSD, including the source port, resolution and corresponding GoogleEarth' s resolution level of each image. As far as we know, FGSD is the most comprehensive ship detection dataset currently and it'll be available soon. Some baselines for FGSD are also provided in this paper.
摘要：利用高分辨率遥感影像船舶检测是一项重要的任务，这有助于海面监管。复杂背景和特殊的视觉角度化妆船检测在高品质的数据集依赖在一定程度上。然而，在设定的两个精确的分类和船只在现有船舶检测数据集的准确位置作品很少。为了进一步推动船舶检测的研究中，我们引入了一个新的细粒度船检测的数据集，它被命名为FGSD。数据集收集高分辨率遥感图像，包含来自世界各地的多个大港口的船的样品。船舶样品进行精细分类，并用水平和旋转边界框注释。为了进一步详细数据集的信息，我们提出了船舶的方向新表示方法。对于未来的研究，码头为一类新的数据集进行注解。此外，图像的丰富信息FGSD被提供，包括源端口，分辨率和对应的各图像的谷歌地球的分辨率级别。据我们所知，FGSD是目前最全面的出货检测数据集，它会很快面世。在本文中还提供了一些FGSD基线。

54. VCNet: A Robust Approach to Blind Image Inpainting [PDF] 返回目录
Yi Wang, Ying-Cong Chen, Xin Tao, Jiaya Jia
Abstract: Blind inpainting is a task to automatically complete visual contents without specifying masks for missing areas in an image. Previous works assume missing region patterns are known, limiting its application scope. In this paper, we relax the assumption by defining a new blind inpainting setting, making training a blind inpainting neural system robust against various unknown missing region patterns. Specifically, we propose a two-stage visual consistency network (VCN), meant to estimate where to fill (via masks) and generate what to fill. In this procedure, the unavoidable potential mask prediction errors lead to severe artifacts in the subsequent repairing. To address it, our VCN predicts semantically inconsistent regions first, making mask prediction more tractable. Then it repairs these estimated missing regions using a new spatial normalization, enabling VCN to be robust to the mask prediction errors. In this way, semantically convincing and visually compelling content is thus generated. Extensive experiments are conducted, showing our method is effective and robust in blind image inpainting. And our VCN allows for a wide spectrum of applications.
摘要：盲修补是自动完成的视觉内容的任务，而无需指定口罩图像中失踪的区域。以前的作品缺少承担区域模式是已知的，限制了其应用范围。在本文中，我们定义了一个新的盲目修补的设置，使培训的盲图像修复神经系统对各种未知丢失区域模式稳健放松的假设。具体来说，我们提出了两个阶段的视觉一致性网络（VCN），这意味着在哪里估算填充（通过屏蔽）和产生填写什么。在此过程中，不可避免的潜在面具预测误差导致在随后的修复严重的文物。解决这一问题，我们VCN预测语义不一致地区第一，使面膜的预测更容易处理。然后它修复使用新的空间标准化这些估计的缺失区域，使VCN是稳健的掩模预测误差。以这种方式，语义令人信服从而在视觉上产生引人注目的内容。广泛的实验导通，示出我们的方法是在盲图像修复有效和有力。而我们的VCN允许广泛的应用范围。

55. OS2D: One-Stage One-Shot Object Detection by Matching Anchor Features [PDF] 返回目录
Anton Osokin, Denis Sumin, Vasily Lomakin
Abstract: In this paper, we consider the task of one-shot object detection, which consists in detecting objects defined by a single demonstration. Differently from the standard object detection, the classes of objects used for training and testing do not overlap. We build the one-stage system that performs localization and recognition jointly. We use dense correlation matching of learned local features to find correspondences, a feed-forward geometric transformation model to align features and bilinear resampling of the correlation tensor to compute the detection score of the aligned features. All the components are differentiable, which allows end-to-end training. Experimental evaluation on several challenging domains (retail products, 3D objects, buildings and logos) shows that our method can detect unseen classes (e.g., toothpaste when trained on groceries) and outperforms several baselines by a significant margin. Our code is available online: this https URL .
摘要：在本文中，我们考虑一次性物体检测，其中包括检测由单一的示范定义对象的任务。不同于标准的物体检测中，用于训练和测试对象的类不重叠。我们建立一个平台系统执行的定位与识别联合。我们使用的了解到地方特色密集相关匹配找到对应，前馈几何变换模型对齐功能和双线性相关的张重采样计算的对齐功能的探测得分。所有的组件都是可微，这使得终端到终端的培训。由显著缘上一些具有挑战性的领域（零售产品，3D对象，建筑物和标志）表明，我们的方法可以检测看不见类（例如，在杂货训练的时候牙膏），优于几个基线实验评价。我们的代码可在网上：此HTTPS URL。

56. StarNet: towards weakly supervised few-shot detection and explainable few-shot classification [PDF] 返回目录
Leonid Karlinsky, Joseph Shtok, Amit Alfassy, Moshe Lichtenstein, Sivan Harary, Eli Schwartz, Sivan Doveh, Prasanna Sattigeri, Rogerio Feris, Alexander Bronstein, Raja Giryes
Abstract: In this paper, we propose a new few-shot learning method called StarNet, which is an end-to-end trainable non-parametric star-model few-shot classifier. While being meta-trained using only image-level class labels, StarNet learns not only to predict the class labels for each query image of a few-shot task, but also to localize (via a heatmap) what it believes to be the key image regions supporting its prediction, thus effectively detecting the instances of the novel categories. The localization is enabled by the StarNet's ability to find large, arbitrarily shaped, semantically matching regions between all pairs of support and query images of a few-shot task. We evaluate StarNet on multiple few-shot classification benchmarks attaining significant state-of-the-art improvement on the CUB and ImageNetLOC-FS, and smaller improvements on other benchmarks. At the same time, in many cases, StarNet provides plausible explanations for its class label predictions, by highlighting the correctly paired novel category instances on the query and on its best matching support (for the predicted class). In addition, we test the proposed approach on the previously unexplored and challenging task of Weakly Supervised Few-Shot Object Detection (WS-FSOD), obtaining significant improvements over the baselines.
摘要：在本文中，我们提出了所谓的STARNET新几拍的学习方法，这是一个终端到终端的可训练的非参数星形模型几拍分类。虽然是元培训只用影像级级标记，STARNET不仅学习来预测几拍任务的每个查询图像类的标签，而且还本地化（通过热图）它认为是关键图像区域支持其预测，从而有效地检测的新颖类别的实例。本地化是由星网的找到大，任意形状，支持所有对和一些次任务的查询图像之间的语义匹配区域的能力启用。我们评估星网多一些，镜头分类基准，实现国家的最先进的显著的幼崽和ImageNetLOC-FS，以及其他较小的基准改进提高。与此同时，在许多情况下，星网提供了同级车标签预测似是而非的解释，通过突出对查询的正确配对小说类的实例和其最佳匹配支持（预测类）。此外，我们测试对弱的以前未曾探索的和艰巨的任务所提出的方法监督为数不多的射击目标检测（WS-FSOD），获得超过基线显著的改善。

57. Performance Evaluation of Advanced Deep Learning Architectures for Offline Handwritten Character Recognition [PDF] 返回目录
Moazam Soomro, Muhammad Ali Farooq, Rana Hammad Raza
Abstract: This paper presents a hand-written character recognition comparison and performance evaluation for robust and precise classification of different hand-written characters. The system utilizes advanced multilayer deep neural network by collecting features from raw pixel values. The hidden layers stack deep hierarchies of non-linear features since learning complex features from conventional neural networks is very challenging. Two state of the art deep learning architectures were used which includes Caffe AlexNet and GoogleNet models in NVIDIA DIGITS.The frameworks were trained and tested on two different datasets for incorporating diversity and complexity. One of them is the publicly available dataset i.e. Chars74K comprising of 7705 characters and has upper and lowercase English alphabets, along with numerical digits. While the other dataset created locally consists of 4320 characters. The local dataset consists of 62 classes and was created by 40 subjects. It also consists upper and lowercase English alphabets, along with numerical digits. The overall dataset is divided in the ratio of 80% for training and 20% for testing phase. The time required for training phase is approximately 90 minutes. For validation part, the results obtained were compared with the groundtruth. The accuracy level achieved with AlexNet was 77.77% and 88.89% with Google Net. The higher accuracy level of GoogleNet is due to its unique combination of inception modules, each including pooling, convolutions at various scales and concatenation procedures.
摘要：本文提出了不同的手写字符强大和精确分类的手写字符识别的比较和绩效评估。该系统通过从原始像素值收集功能利用先进的多层深神经网络。隐藏层堆的非线性特征深层次结构由于从常规神经网络学习复杂的功能是非常具有挑战性的。艺术深度学习架构的两种状态中使用，其中包括来自Caffe AlexNet和GoogleNet在NVIDIA DIGITS.The框架模型进行了培训，并在两个不同的数据集进行合并的多样性和复杂性测试。其中之一是可公开获得的数据集即Chars74K包括7705个字符，并且具有大写和小写英文字母，数字与数位一起。虽然本地创建的其他数据集包括4320个字符。本地数据集包括62类，并通过40名受试者产生。它还包括大写和小写英文字母，数字与数位一起。总体数据集的80％用于训练和测试阶段20％的比例分割。训练阶段所需的时间大约是90分钟。用于验证的一部分，得到的结果与真实状况进行比较。与AlexNet达到的精度水平为77.77％和与谷歌净利润88.89％。 GoogleNet的更高的精确度水平是由于在不同尺度和级联过程其以来模块，每个包括池，卷积独特组合。

58. Learning Enriched Features for Real Image Restoration and Enhancement [PDF] 返回目录
Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, Ling Shao
Abstract: With the goal of recovering high-quality image content from its degraded version, image restoration enjoys numerous applications, such as in surveillance, computational photography, medical imaging, and remote sensing. Recently, convolutional neural networks (CNNs) have achieved dramatic improvements over conventional approaches for image restoration task. Existing CNN-based methods typically operate either on full-resolution or on progressively low-resolution representations. In the former case, spatially precise but contextually less robust results are achieved, while in the latter case, semantically reliable but spatially less accurate outputs are generated. In this paper, we present a novel architecture with the collective goals of maintaining spatially-precise high-resolution representations through the entire network, and receiving strong contextual information from the low-resolution representations. The core of our approach is a multi-scale residual block containing several key elements: (a) parallel multi-resolution convolution streams for extracting multi-scale features, (b) information exchange across the multi-resolution streams, (c) spatial and channel attention mechanisms for capturing contextual information, and (d) attention based multi-scale feature aggregation. In the nutshell, our approach learns an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details. Extensive experiments on five real image benchmark datasets demonstrate that our method, named as MIRNet, achieves state-of-the-art results for a variety of image processing tasks, including image denoising, super-resolution and image enhancement.
摘要：从它的降级版本回收高品质的图像内容的目标，图像复原享有多种应用，如监视，计算摄影，医疗成像，和遥感。近日，卷积神经网络（细胞神经网络）已经在图像恢复任务的传统方法取得了显着的改进。现有的基于CNN方法通常在全分辨率或逐步低分辨率表示无论工作。在前者的情况下，在空间上精确，但上下文不太可靠的结果实现的，而在后一种情况下，产生语义可靠的，但在空间较不准确的输出。在本文中，我们提出了一个新颖的结构与在整个网络保持空间上精确的高分辨率表示，并且接收来自所述低分辨率表示强的上下文信息的集体目标。我们的方法的核心是包含几个关键元件的多尺度残余块：（1）平行的多分辨率用于提取多尺度特征卷积流，（b）中跨越多分辨率流的信息交换，（c）中的空间和通道注意机制用于捕获的上下文信息，以及（d）基于关注多尺度特征聚集。在简单地说，我们的方法学习的更加丰富的功能，结合了多尺度的上下文信息，同时保持高分辨率空间细节。五个实像标准数据集大量的实验证明我们的方法，命名为MIRNet，实现了国家的先进成果为各种图像处理任务，包括图像降噪，超分辨率和图像增强。

59. GMM-UNIT: Unsupervised Multi-Domain and Multi-Modal Image-to-Image Translation via Attribute Gaussian Mixture Modeling [PDF] 返回目录
Yahui Liu, Marco De Nadai, Jian Yao, Nicu Sebe, Bruno Lepri, Xavier Alameda-Pineda
Abstract: Unsupervised image-to-image translation (UNIT) aims at learning a mapping between several visual domains by using unpaired training images. Recent studies have shown remarkable success for multiple domains but they suffer from two main limitations: they are either built from several two-domain mappings that are required to be learned independently, or they generate low-diversity results, a problem known as model collapse. To overcome these limitations, we propose a method named GMM-UNIT, which is based on a content-attribute disentangled representation where the attribute space is fitted with a GMM. Each GMM component represents a domain, and this simple assumption has two prominent advantages. First, it can be easily extended to most multi-domain and multi-modal image-to-image translation tasks. Second, the continuous domain encoding allows for interpolation between domains and for extrapolation to unseen domains and translations. Additionally, we show how GMM-UNIT can be constrained down to different methods in the literature, meaning that GMM-UNIT is a unifying framework for unsupervised image-to-image translation.
摘要：无监督图像 - 图像平移（单位）旨在通过使用不成对的训练图像学习了多种视觉域之间的映射。最近的研究表明，多个域的显着的成就，但他们有两个主要限制的影响：他们要么从所需要的独立学到了一些两域映射而建，或它们产生低多样性的结果，称为模型崩溃的问题。为了克服这些限制，我们提出了一个名为GMM-UNIT，它是基于内容属性在属性空间配有GMM解开表示方法。每个GMM组件表示域名，而这个简单的假设有两个突出的优点。首先，它可以很容易地扩展到大多数多领域和多模态图像到图像的翻译任务。其次，连续域编码允许域间和用于外推到看不见域和翻译插值。此外，我们显示GMM-UNIT如何被限制到文献不同的方法，也就是说，GMM-UNIT是无监督图像 - 图像平移一个统一的框架。

60. Self-trained Deep Ordinal Regression for End-to-End Video Anomaly Detection [PDF] 返回目录
Guansong Pang, Cheng Yan, Chunhua Shen, Anton van den Hengel, Xiao Bai
Abstract: Video anomaly detection is of critical practical importance to a variety of real applications because it allows human attention to be focused on events that are likely to be of interest, in spite of an otherwise overwhelming volume of video. We show that applying self-trained deep ordinal regression to video anomaly detection overcomes two key limitations of existing methods, namely, 1) being highly dependent on manually labeled normal training data; and 2) sub-optimal feature learning. By formulating a surrogate two-class ordinal regression task we devise an end-to-end trainable video anomaly detection approach that enables joint representation learning and anomaly scoring without manually labeled normal/abnormal data. Experiments on eight real-world video scenes show that our proposed method outperforms state-of-the-art methods that require no labeled training data by a substantial margin, and enables easy and accurate localization of the identified anomalies. Furthermore, we demonstrate that our method offers effective human-in-the-loop anomaly detection which can be critical in applications where anomalies are rare and the false-negative cost is high.
摘要：视频异常检测是对各种实际应用的关键的现实意义，因为它允许把重点放在那些可能会感兴趣，尽管视频的那些大量体积的事件为人瞩目。我们表明，施加自受训深有序回归到视频异常检测克服了现有方法的两个关键的局限性，即，1）是高度依赖于手动标记的正常的训练数据; 2）次优功能的学习。通过配制替代二类别有序回归任务我们设计的端至端可训练视频异常检测方法，使关节表示学习和无需手动标记的正常/异常数据异常评分。八个真实世界的视频场景实验结果表明，我们所提出的方法优于国家的最先进的方法，即不需要通过了足够的余量标记的训练数据，并能识别出异常的容易和准确定位。此外，我们证明我们的方法提供了有效的人力，在半实物异常检测可以是在异常情况是罕见的，假阴性成本高应用的关键。

61. DeepEMD: Few-Shot Image Classification with Differentiable Earth Mover's Distance and Structured Classifiers [PDF] 返回目录
Chi Zhang, Yujun Cai, Guosheng Lin, Chunhua Shen
Abstract: In this paper, we address the few-shot classification task from a new perspective of optimal matching between image regions. We adopt the Earth Mover's Distance (EMD) as a metric to compute a structural distance between dense image representations to determine image relevance. The EMD generates the optimal matching flows between structural elements that have the minimum matching cost, which is used to represent the image distance for classification. To generate the important weights of elements in the EMD formulation, we design a cross-reference mechanism, which can effectively minimize the impact caused by the cluttered background and large intra-class appearance variations. To handle k-shot classification, we propose to learn a structured fully connected layer that can directly classify dense image representations with the EMD. Based on the implicit function theorem, the EMD can be inserted as a layer into the network for end-to-end training. We conduct comprehensive experiments to validate our algorithm and we set new state-of-the-art performance on four popular few-shot classification benchmarks, namely miniImageNet, tieredImageNet, Fewshot-CIFAR100 (FC100) and Caltech-UCSD Birds-200-2011 (CUB).
摘要：在本文中，我们从图像区域之间的最佳匹配的一个新的角度来处理这一为数不多的镜头分类任务。我们采用堆土机距离（EMD）作为指标来计算密集的图像表示之间的结构性距离来确定图像的相关性。该EMD产生具有最小匹配成本，这是用于表示用于分类图像距离的结构元件之间的最佳匹配中流动。为了产生在EMD制剂元件的重要的权重，我们设计了一个交叉引用机构，可有效地最小化所引起的杂乱背景和大类内外观变化的影响。处理K-镜头分类，我们建议学会与EMD可以直接划入密集的图像表示结构化的完全连接层。基于隐函数定理，EMD可以插入作为层到网络端至端的训练。我们进行了全面的实验来验证我们的算法，我们对四大流行几拍分类基准，即miniImageNet，tieredImageNet，Fewshot-CIFAR100（FC100）和加州理工学院，加州大学圣地亚哥分校的小鸟-200-2011设置国家的最先进的新的性能（幼兽）。

62. Siamese Box Adaptive Network for Visual Tracking [PDF] 返回目录
Zedu Chen, Bineng Zhong, Guorong Li, Shengping Zhang, Rongrong Ji
Abstract: Most of the existing trackers usually rely on either a multi-scale searching scheme or pre-defined anchor boxes to accurately estimate the scale and aspect ratio of a target. Unfortunately, they typically call for tedious and heuristic configurations. To address this issue, we propose a simple yet effective visual tracking framework (named Siamese Box Adaptive Network, SiamBAN) by exploiting the expressive power of the fully convolutional network (FCN). SiamBAN views the visual tracking problem as a parallel classification and regression problem, and thus directly classifies objects and regresses their bounding boxes in a unified FCN. The no-prior box design avoids hyper-parameters associated with the candidate boxes, making SiamBAN more flexible and general. Extensive experiments on visual tracking benchmarks including VOT2018, VOT2019, OTB100, NFS, UAV123, and LaSOT demonstrate that SiamBAN achieves state-of-the-art performance and runs at 40 FPS, confirming its effectiveness and efficiency. The code will be available at this https URL.
摘要：大多数现有的跟踪器通常依赖于任一个多尺度搜索方案或预先定义的锚箱准确地估计目标的规模和纵横比。不幸的是，他们通常要求冗长的和启发式的配置。为了解决这个问题，我们提出通过利用完全卷积网络（FCN）的表现力一个简单而有效的视觉跟踪框架（命名为连体箱自适应网络，SiamBAN）。 SiamBAN观看视觉跟踪问题作为平行分类和回归问题，从而直接进行分类的对象和它们的退化边界框在一个统一的FCN。无前框设计，避免了与候补框相关的超参数，使SiamBAN更加灵活和通用。上视觉跟踪基准包括VOT2018，VOT2019，OTB100，NFS，UAV123和LaSOT广泛的实验表明，SiamBAN实现状态的最先进的性能和运行在40 FPS，确认其有效性和效率。该代码将可在此HTTPS URL。

63. Channel Pruning Guided by Classification Loss and Feature Importance [PDF] 返回目录
Jinyang Guo, Wanli Ouyang, Dong Xu
Abstract: In this work, we propose a new layer-by-layer channel pruning method called Channel Pruning guided by classification Loss and feature Importance (CPLI). In contrast to the existing layer-by-layer channel pruning approaches that only consider how to reconstruct the features from the next layer, our approach additionally take the classification loss into account in the channel pruning process. We also observe that some reconstructed features will be removed at the next pruning stage. So it is unnecessary to reconstruct these features. To this end, we propose a new strategy to suppress the influence of unimportant features (i.e., the features will be removed at the next pruning stage). Our comprehensive experiments on three benchmark datasets, i.e., CIFAR-10, ImageNet, and UCF-101, demonstrate the effectiveness of our CPLI method.
摘要：在这项工作中，我们提出了所谓的通道修剪分级损失和功能重要性（CPLI）引导一个新的层与层之间通道修剪方法。相较于现有的层与层之间通道修剪方法只考虑如何从下一层重建的功能，我们的做法还采取分类损失考虑在通道修剪过程。我们也观察到，一些重建的功能将在下一阶段的修剪除去。所以，没有必要重建这些功能。为此，我们提出了一个新的战略，以抑制不重要的功能（即功能将在下一阶段的修剪去除）的影响。我们对三个标准数据集，即CIFAR-10，ImageNet和UCF-101综合实验，证明我们CPLI方法的有效性。

64. MotionNet: Joint Perception and Motion Prediction for Autonomous Driving Based on Bird's Eye View Maps [PDF] 返回目录
Pengxiang Wu, Siheng Chen, Dimitris Metaxas
Abstract: The ability to reliably perceive the environmental states, particularly the existence of objects and their motion behavior, is crucial for autonomous driving. In this work, we propose an efficient deep model, called MotionNet, to jointly perform perception and motion prediction from 3D point clouds. MotionNet takes a sequence of LiDAR sweeps as input and outputs a bird's eye view (BEV) map, which encodes the object category and motion information in each grid cell. The backbone of MotionNet is a novel spatio-temporal pyramid network, which extracts deep spatial and temporal features in a hierarchical fashion. To enforce the smoothness of predictions over both space and time, the training of MotionNet is further regularized with novel spatial and temporal consistency losses. Extensive experiments show that the proposed method overall outperforms the state-of-the-arts, including the latest scene-flow- and 3D-object-detection-based methods. This indicates the potential value of the proposed method serving as a backup to the bounding-box-based system, and providing complementary information to the motion planner in autonomous driving. Code is available at this https URL.
摘要：可靠地感知环境状况，对象的特别的存在及其运动行为的能力，是自主驾驶的关键。在这项工作中，我们提出了一个高效的深层模型，称为MotionNet，从三维点云联合进行感知和运动预测。 MotionNet采用激光雷达扫描的序列作为输入，并输出的鸟瞰图（BEV）地图，其编码在每个网格单元的对象类别和运动信息。 MotionNet的主干是一个新颖的时空金字塔网络，其提取以分层方式深的空间和时间的特征。为了执行在两个空间和时间预测的平滑性，MotionNet的训练进一步与新颖的空间和时间一致性的损失正规化。大量的实验表明，该方法全面优于国家的的艺术，包括基于三维物体检测场景流通式和最新的方法。这表明所提出的方法用作备份到基于包围盒系统，并提供补充信息在自动驾驶运动计划的潜在价值。代码可在此HTTPS URL。

65. Learning 2D-3D Correspondences To Solve The Blind Perspective-n-Point Problem [PDF] 返回目录
Liu Liu, Dylan Campbell, Hongdong Li, Dingfu Zhou, Xibin Song, Ruigang Yang
Abstract: Conventional absolute camera pose via a Perspective-n-Point (PnP) solver often assumes that the correspondences between 2D image pixels and 3D points are given. When the correspondences between 2D and 3D points are not known a priori, the task becomes the much more challenging blind PnP problem. This paper proposes a deep CNN model which simultaneously solves for both the 6-DoF absolute camera pose and 2D--3D correspondences. Our model comprises three neural modules connected in sequence. First, a two-stream PointNet-inspired network is applied directly to both the 2D image keypoints and the 3D scene points in order to extract discriminative point-wise features harnessing both local and contextual information. Second, a global feature matching module is employed to estimate a matchability matrix among all 2D--3D pairs. Third, the obtained matchability matrix is fed into a classification module to disambiguate inlier matches. The entire network is trained end-to-end, followed by a robust model fitting (P3P-RANSAC) at test time only to recover the 6-DoF camera pose. Extensive tests on both real and simulated data have shown that our method substantially outperforms existing approaches, and is capable of processing thousands of points a second with the state-of-the-art accuracy.
摘要：经由视角-N点（PNP）的常规绝对照相机姿态解算器常常假定2D图像像素和3D点之间的对应关系中给出。当2D和3D点之间的对应关系是事先不知道，这个任务变得更有挑战性盲目即插即用问题。本文提出了一种深深的CNN模型，为6自由度绝对相机姿势和2D两种同时解决 - 3D对应。我们的模型包括依次连接的3个神经模块。首先，两流PointNet启发网络，以便提取判别逐点特征利用本地和上下文信息直接应用到2D图像关键点和3D场景点两者。第二，一个全局特征匹配模块被用来估计所有2D之间的匹配性矩阵 - 3D对。第三，所获得的匹配性矩阵被馈送到分类模块来消除歧义内点匹配。整个网络进行训练，端至端，接着是稳健的模型拟合（P3P-RANSAC）在测试时间只收回6自由度相机姿势。在两个真实和模拟数据的广泛的测试表明，我们的方法，其基本上优于现有的方法，而且是能够处理数千个点与国家的最先进的精度的第二的。

66. A Novel Learnable Gradient Descent Type Algorithm for Non-convex Non-smooth Inverse Problems [PDF] 返回目录
Qingchao Zhang, Xiaojing Ye, Hongcheng Liu, Yunmei Chen
Abstract: Optimization algorithms for solving nonconvex inverse problem have attracted significant interests recently. However, existing methods require the nonconvex regularization to be smooth or simple to ensure convergence. In this paper, we propose a novel gradient descent type algorithm, by leveraging the idea of residual learning and Nesterov's smoothing technique, to solve inverse problems consisting of general nonconvex and nonsmooth regularization with provable convergence. Moreover, we develop a neural network architecture intimating this algorithm to learn the nonlinear sparsity transformation adaptively from training data, which also inherits the convergence to accommodate the general nonconvex structure of this learned transformation. Numerical results demonstrate that the proposed network outperforms the state-of-the-art methods on a variety of different image reconstruction problems in terms of efficiency and accuracy.
摘要：求解非逆问题的最优化算法近来引起显著的利益。但是，现有的方法需要非凸正则是光滑或简单的，以确保收敛。在本文中，我们提出了一个新颖的梯度下降型算法，通过利用剩余的学习和涅斯捷罗夫的平滑技术的想法，解决逆由一般非凸和可证明收敛非光滑正规化的问题。此外，我们开发了一个神经网络结构，得悉该算法从训练数据，这也继承了适应这种转变了解到的一般非凸结构趋同自适应学习非线性稀疏转型。仿真结果表明，所提出的网络优于在效率和准确度方面上的各种不同的图像重构问题的状态的最先进的方法。

67. Beyond without Forgetting: Multi-Task Learning for Classification with Disjoint Datasets [PDF] 返回目录
Yan Hong, Li Niu, Jianfu Zhang, Liqing Zhang
Abstract: Multi-task Learning (MTL) for classification with disjoint datasets aims to explore MTL when one task only has one labeled dataset. In existing methods, for each task, the unlabeled datasets are not fully exploited to facilitate this task. Inspired by semi-supervised learning, we use unlabeled datasets with pseudo labels to facilitate each task. However, there are two major issues: 1) the pseudo labels are very noisy; 2) the unlabeled datasets and the labeled dataset for each task has considerable data distribution mismatch. To address these issues, we propose our MTL with Selective Augmentation (MTL-SA) method to select the training samples in unlabeled datasets with confident pseudo labels and close data distribution to the labeled dataset. Then, we use the selected training samples to add information and use the remaining training samples to preserve information. Extensive experiments on face-centric and human-centric applications demonstrate the effectiveness of our MTL-SA method.
摘要：多任务学习（MTL）与不相交的数据集的目标分类探索MTL当一个任务只有一个标记数据集。在现有的方法中，对于每一个任务，未标记的数据集没有被完全开发，以促进这项工作。通过半监督学习的启发，我们使用无标签数据集与伪标签，以方便各个任务。但是，有两个主要问题：1）假标签是非常嘈杂; 2）未标记数据集和每个任务的标记的数据集具有相当大的数据分配不匹配。为了解决这些问题，我们提出我们的选择性增强MTL（MTL-SA）方法与自信伪标签和关闭数据分发到标记集未标记的数据集来选择训练样本。然后，我们用选定的训练样本中添加信息，并利用剩余的训练样本保存的信息。面部为中心，以人为中心的应用广泛的实验，证明了我们MTL-SA方法的有效性。

68. Vision-Dialog Navigation by Exploring Cross-modal Memory [PDF] 返回目录
Yi Zhu, Fengda Zhu, Zhaohuan Zhan, Bingqian Lin, Jianbin Jiao, Xiaojun Chang, Xiaodan Liang
Abstract: Vision-dialog navigation posed as a new holy-grail task in vision-language disciplinary targets at learning an agent endowed with the capability of constant conversation for help with natural language and navigating according to human responses. Besides the common challenges faced in visual language navigation, vision-dialog navigation also requires to handle well with the language intentions of a series of questions about the temporal context from dialogue history and co-reasoning both dialogs and visual scenes. In this paper, we propose the Cross-modal Memory Network (CMN) for remembering and understanding the rich information relevant to historical navigation actions. Our CMN consists of two memory modules, the language memory module (L-mem) and the visual memory module (V-mem). Specifically, L-mem learns latent relationships between the current language interaction and a dialog history by employing a multi-head attention mechanism. V-mem learns to associate the current visual views and the cross-modal memory about the previous navigation actions. The cross-modal memory is generated via a vision-to-language attention and a language-to-vision attention. Benefiting from the collaborative learning of the L-mem and the V-mem, our CMN is able to explore the memory about the decision making of historical navigation actions which is for the current step. Experiments on the CVDN dataset show that our CMN outperforms the previous state-of-the-art model by a significant margin on both seen and unseen environments.
摘要：在学习赋予了与自然语言帮助不断对话的能力的代理，并根据人类的反应导航带来的视觉语言学科目标的新圣 - 圣杯任务视觉对话框导航。除了面临的视觉语言导航的共同挑战，视觉对话导航也需要有一系列关于从对话历史和共同的推理两个对话框和视觉场景的时间情境问题的语言用心处理好。在本文中，我们提出的跨模态记忆网络（CMN）的记忆和理解相关的历史导航动作的丰富信息。我们CMN包括两个存储器模块，语言存储器模块（L-MEM）和视觉记忆模块（V-MEM）。具体而言，L-MEM学习当前的语言互动，通过采用多头注意机制的对话历史之间的潜在关系。 V-MEM学会关于一个导航行动目前的视觉观点和跨模态记忆相关联。跨模态记忆是通过视觉对语言的重视和语言对视觉注意产生。从L-MEM和V-MEM的协作学习中受益，我们CMN能够探讨有关制作这是当前步骤历史导航的行动决定的内存。在CVDN数据集上，我们CMN优于以前的状态的最先进的模型通过对看见也看不见的环境中显著的margin实验。

69. A model of figure ground organization incorporating local and global cues [PDF] 返回目录
Sudarshan Ramenahalli
Abstract: Figure Ground Organization (FGO) -- inferring spatial depth ordering of objects in a visual scene -- involves determining which side of an occlusion boundary is figure (closer to the observer) and which is ground (further away from the observer). A combination of global cues, like convexity, and local cues, like T-junctions are involved in this process. We present a biologically motivated, feed forward computational model of FGO incorporating convexity, surroundedness, parallelism as global cues and Spectral Anisotropy (SA), T-junctions as local cues. While SA is computed in a biologically plausible manner, the inclusion of T-Junctions is biologically motivated. The model consists of three independent feature channels, Color, Intensity and Orientation, but SA and T-Junctions are introduced only in the Orientation channel as these properties are specific to that feature of objects. We study the effect of adding each local cue independently and both of them simultaneously to the model with no local cues. We evaluate model performance based on figure-ground classification accuracy (FGCA) at every border location using the BSDS 300 figure-ground dataset. Each local cue, when added alone, gives statistically significant improvement in the FGCA of the model suggesting its usefulness as an independent FGO cue. The model with both local cues achieves higher FGCA than the models with individual cues, indicating SA and T-Junctions are not mutually contradictory. Compared to the model with no local cues, the feed-forward model with both local cues achieves $\geq 8.78$% improvement in terms of FGCA.
摘要：图地面组织（FGO） - 推断以视觉场景中的对象的空间深度排序 - 涉及确定哪些闭塞边界的侧是图（更靠近观察者），并且是接地（进一步远离观察者）。全球线索，如凸，和当地的线索，如T形接头的组合都参与了这一过程。我们提出一个生物动机，前馈计算FGO的结合凸，surroundedness，并行全球线索和光谱各向异性（SA），T型接合当地的线索模型。而SA是在生物合理方式计算，T形接头的夹杂物是由生物激励。该模型包括三个独立的特征信道，颜色，强度和方向，但SA和T形接头仅引入在定位通道，因为这些属性是特定于对象的该特征。我们研究独立，二者同时他们将每个地方的提示，该模型没有本地线索的作用。我们评估基于数字地面分类精度（FGCA）在使用BSDS 300图地面数据集中的每个边界位置模型的性能。每个局部线索，单独添加时，给出了该模型表明其作为一个独立FGO提示有用的FGCA统计显著的改善。与本地线索模型实现了更高FGCA较个别线索模型，表明SA和T-路口没有相互矛盾的。相对于没有本地线索的模式，与当地的线索前馈模型实现了FGCA方面$ \ GEQ 8.78 $％的改善。

70. NoiseRank: Unsupervised Label Noise Reduction with Dependence Models [PDF] 返回目录
Karishma Sharma, Pinar Donmez, Enming Luo, Yan Liu, I. Zeki Yalniz
Abstract: Label noise is increasingly prevalent in datasets acquired from noisy channels. Existing approaches that detect and remove label noise generally rely on some form of supervision, which is not scalable and error-prone. In this paper, we propose NoiseRank, for unsupervised label noise reduction using Markov Random Fields (MRF). We construct a dependence model to estimate the posterior probability of an instance being incorrectly labeled given the dataset, and rank instances based on their estimated probabilities. Our method 1) Does not require supervision from ground-truth labels, or priors on label or noise distribution. 2) It is interpretable by design, enabling transparency in label noise removal. 3) It is agnostic to classifier architecture/optimization framework and content modality. These advantages enable wide applicability in real noise settings, unlike prior works constrained by one or more conditions. NoiseRank improves state-of-the-art classification on Food101-N (~20% noise), and is effective on high noise Clothing-1M (~40% noise).
摘要：标签噪声是在从噪声信道获取的数据集越来越普遍。即现有的检测方法和删除标签噪声一般依赖于某种形式的监管，这是不可扩展的，而且容易出错的。在本文中，我们提出NoiseRank，利用马尔可夫随机场（MRF）无监督的标签降噪。我们构建的依赖模型来估计一个实例的后验概率被错误地标记给定数据集，并根据他们的估计概率排名情况。我们的方法1）不需要从地面实况标签，或先验标签或噪声分布的监督。 2）它是由设计可解释的，使在标签去除噪声的透明度。 3）它是不可知的分级架构/优化框架和内容模式。这些优点使在真实噪声设置适用性广，与通过一个或多个条件的限制之前的作品。 NoiseRank改进了Food101-N（〜20％噪声）状态的最先进的分类，并且是高噪声服装-1M（〜40％噪声）有效。

71. Class Conditional Alignment for Partial Domain Adaptation [PDF] 返回目录
Mohsen Kheirandishfard, Fariba Zohrizadeh, Farhad Kamangar
Abstract: Adversarial adaptation models have demonstrated significant progress towards transferring knowledge from a labeled source dataset to an unlabeled target dataset. Partial domain adaptation (PDA) investigates the scenarios in which the source domain is large and diverse, and the target label space is a subset of the source label space. The main purpose of PDA is to identify the shared classes between the domains and promote learning transferable knowledge from these classes. In this paper, we propose a multi-class adversarial architecture for PDA. The proposed approach jointly aligns the marginal and class-conditional distributions in the shared label space by minimaxing a novel multi-class adversarial loss function. Furthermore, we incorporate effective regularization terms to encourage selecting the most relevant subset of source domain classes. In the absence of target labels, the proposed approach is able to effectively learn domain-invariant feature representations, which in turn can enhance the classification performance in the target domain. Comprehensive experiments on three benchmark datasets Office-31, Office-Home, and Caltech-Office corroborate the effectiveness of the proposed approach in addressing different partial transfer learning tasks.
摘要：对抗性适应机型已经证明了对来自标记的源数据集知识转移到未标记的目标数据集显著的进展。局部域适配（PDA）调查，其中源域大且多样的情况下，与目标标签空间是源标签空间的子集。 PDA的主要目的是识别域之间的共享类，促进从这些类学习的知识转移。在本文中，我们提出了PDA多类对抗性结构。所提出的方法通过minimaxing一种新颖的多类对抗性损失函数共同对齐在共享标签空间的边际和类条件分布。此外，我们整合有效正规化方面，鼓励选择源域类的最相关的子集。在没有目标的标签，所提出的方法能够有效地学习域不变特征表示，这反过来又可以提高在目标域分类性能。对三个标准数据集办公-31，办公室，家庭，和加州理工学院的办公综合实验证实在解决不同的部分转移学习任务所提出的方法的有效性。

72. Emotions Don't Lie: A Deepfake Detection Method using Audio-Visual Affective Cues [PDF] 返回目录
Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, Dinesh Manocha
Abstract: We present a learning-based multimodal method for detecting real and deepfake videos. To maximize information for learning, we extract and analyze the similarity between the two audio and visual modalities from within the same video. Additionally, we extract and compare affective cues corresponding to emotion from the two modalities within a video to infer whether the input video is "real" or "fake". We propose a deep learning network, inspired by the Siamese network architecture and the triplet loss. To validate our model, we report the AUC metric on two large-scale, audio-visual deepfake detection datasets, DeepFake-TIMIT Dataset and DFDC. We compare our approach with several SOTA deepfake detection methods and report per-video AUC of 84.4% on the DFDC and 96.6% on the DF-TIMIT datasets, respectively.
摘要：本文提出了一种基于学习的方法多为真正的检测和deepfake视频。要学习最大化的信息，我们提取并从同一视频中分析了两个音频和视觉方式之间的相似性。此外，我们提取和比较从视频中的两个模式对应的情感来推断输入视频是否为“真”或“假”的情感线索。我们提出了一个深刻的学习网络，由连体网络架构和三重损失的启发。为了验证我们的模型，我们报告了两次大规模，视听deepfake检测数据集，DeepFake-TIMIT数据集和DFDC的AUC度量。我们分别比我们的方法有几种SOTA deepfake检测方法和报告，每个视频AUC对DFDC 84.4％和96.6％的DF-TIMIT数据集。

73. Identifying Individual Dogs in Social Media Images [PDF] 返回目录
Djordje Batic, Dubravko Culibrk
Abstract: We present the results of an initial study focused on developing a visual AI solution able to recognize individual dogs in unconstrained (wild) images occurring on social media. The work described here is part of joint project done with Pet2Net, a social network focused on pets and their owners. In order to detect and recognize individual dogs we combine transfer learning and object detection approaches on Inception v3 and SSD Inception v2 architectures respectively and evaluate the proposed pipeline using a new data set containing real data that the users uploaded to Pet2Net platform. We show that it can achieve 94.59% accuracy in identifying individual dogs. Our approach has been designed with simplicity in mind and the goal of easy deployment on all the images uploaded to Pet2Net platform. A purely visual approach to identifying dogs in images, will enhance Pet2Net features aimed at finding lost dogs, as well as form the basis of future work focused on identifying social relationships between dogs, which cannot be inferred from other data collected by the platform.
摘要：我们提出了一个初步的研究结果专注于开发的可视化解决方案的AI能够识别在社交媒体上出现的不受约束的（野生）图像个别犬。这里所描述的工作是做Pet2Net联合项目的一部分，社交网络专注于宠物和它们的主人。为了检测和识别单个的狗，我们结合转移学习以及物体检测的盗梦空间v3和SSD盗梦空间V2架构分别接近和评估使用包含用户上传到Pet2Net平台实时数据的新的数据集所提出的管道。我们表明，它可以识别单个的狗达到94.59％的准确率。我们的方法已经设计中考虑了简单和易于部署的对上传到Pet2Net平台的所有图像的目标。纯粹的视觉方法在图像识别狗，将加强Pet2Net功能旨在寻找丢失的狗，以及形成今后工作的基础集中于确定的狗，不能从平台收集的其他数据推断之间的社会关系。

74. TAFSSL: Task-Adaptive Feature Sub-Space Learning for few-shot classification [PDF] 返回目录
Moshe Lichtenstein, Prasanna Sattigeri, Rogerio Feris, Raja Giryes, Leonid Karlinsky
Abstract: The field of Few-Shot Learning (FSL), or learning from very few (typically $1$ or $5$) examples per novel class (unseen during training), has received a lot of attention and significant performance advances in the recent literature. While number of techniques have been proposed for FSL, several factors have emerged as most important for FSL performance, awarding SOTA even to the simplest of techniques. These are: the backbone architecture (bigger is better), type of pre-training on the base classes (meta-training vs regular multi-class, currently regular wins), quantity and diversity of the base classes set (the more the merrier, resulting in richer and better adaptive features), and the use of self-supervised tasks during pre-training (serving as a proxy for increasing the diversity of the base set). In this paper we propose yet another simple technique that is important for the few shot learning performance - a search for a compact feature sub-space that is discriminative for a given few-shot test task. We show that the Task-Adaptive Feature Sub-Space Learning (TAFSSL) can significantly boost the performance in FSL scenarios when some additional unlabeled data accompanies the novel few-shot task, be it either the set of unlabeled queries (transductive FSL) or some additional set of unlabeled data samples (semi-supervised FSL). Specifically, we show that on the challenging miniImageNet and tieredImageNet benchmarks, TAFSSL can improve the current state-of-the-art in both transductive and semi-supervised FSL settings by more than $5\%$, while increasing the benefit of using unlabeled data in FSL to above $10\%$ performance gain.
摘要：为数不多的射击学习（FSL）的从很少的（通常为1 $ $或$ 5 $），每类新型的例子（训练中看不见），学习领域，或者已经获得了很多的关注和显著性能提升在最近的文献。虽然技术数已经提出了FSL，几个因素已经逐渐成为FSL性能最重要的，授予SOTA连最简单的方法。它们是：骨干架构（越大越好），类型的基类（元培训VS规则的多级，目前正规胜），数量和多样性设置的基类前培训（人越多越好，导致更丰富，更好的适应性功能），并在训练前使用的自我监督任务（作为代理增加一组基本的多样性）。在本文中，我们提出了另一种简单的技术，是为数不多的射门学习成效的重要 - 一个紧凑的特征子空间是歧视性对于给定的几个次测试任务搜索。我们表明，任务型自适应特征子空间学习（TAFSSL）可显著提振FSL场景时的表现一些额外的标签数据伴随着新的几炮的任务，无论是任何一组未标记的查询（转导FSL）或一些额外的一组未标记的数据样本（半监督FSL）的。具体来说，我们表明，在具有挑战性的miniImageNet和tieredImageNet基准，TAFSSL可以改善当前国家的最先进的超过5 $ \％$两推式和半监督FSL设置，同时增加使用无标签数据的好处在FSL以上$ 10 \％$性能增益。

75. Structured Domain Adaptation for Unsupervised Person Re-identification [PDF] 返回目录
Yixiao Ge, Feng Zhu, Rui Zhao, Hongsheng Li
Abstract: Unsupervised domain adaptation (UDA) aims at adapting the model trained on a labeled source-domain dataset to another target-domain dataset without any annotation. The task of UDA for the open-set person re-identification (re-ID) is even more challenging as the identities (classes) have no overlap between the two domains. Existing UDA methods for person re-ID have the following limitations. 1) Pseudo-label-based methods achieve state-of-the-art performances but ignore the complex relations between two domains' images, along with the valuable source-domain annotations. 2) Domain translation-based methods cannot achieve competitive performances as the domain translation is not properly regularized to generate informative enough training samples that well maintain inter-sample relations. To tackle the above challenges, we propose an end-to-end structured domain adaptation framework that consists of a novel structured domain-translation network and two domain-specific person image encoders. The structured domain-translation network can effectively transform the source-domain images into the target domain while well preserving the original intra- and inter-identity relations. The target-domain encoder could then be trained using both source-to-target translated images with valuable ground-truth labels and target-domain images with pseudo labels. Importantly, the domain-translation network and target-domain encoder are jointly optimized, improving each other towards the overall objective, i.e. to achieve optimal re-ID performances on the target domain. Our proposed framework outperforms state-of-the-art methods on multiple UDA tasks of person re-ID.
摘要：无监督领域适应性（UDA）旨在适应于标记源域数据集到另一个目标域数据集训练的，没有任何注释的模式。 UDA为开放式集合人重新鉴定（重新-ID）的任务更加具有挑战性的身份（类）有两个域之间没有重叠。现有UDA方法的人重新编号具有以下限制。 1）基于伪标签方法实现状态的最艺术表演但忽略两个域图像之间的复杂关系，与有价值源域的注释一起。 2）为基础的翻译域方法无法实现竞争力绩效作为域名翻译没有正确正规化产生足够翔实的训练样本那么好保持相互样品关系。为应对上述挑战，我们提出了一个终端到终端的结构域的适应框架，包括一个新的结构域的翻译网络和两个特定领域的人员图像编码器。该结构域的翻译网络可以有效地变换源域图像到目标域，同时也保留原有的内部和相互之间的身份关系。目标域编码器可以接着使用了宝贵的地面实况标签和目标域图像与伪标签都源到目标的转换图像训练。重要的是，域的翻译网络和目标域编码器的联合优化，提高彼此朝向的总体目标，即实现对目标域最优重新ID的演出。我们提出的框架性能优于国家的最先进的方法对人重新编号的多个UDA任务。

76. Fast Depth Estimation for View Synthesis [PDF] 返回目录
Nantheera Anantrasirichai, Majid Geravand, David Braendler, David R. Bull
Abstract: Disparity/depth estimation from sequences of stereo images is an important element in 3D vision. Owing to occlusions, imperfect settings and homogeneous luminance, accurate estimate of depth remains a challenging problem. Targetting view synthesis, we propose a novel learning-based framework making use of dilated convolution, densely connected convolutional modules, compact decoder and skip connections. The network is shallow but dense, so it is fast and accurate. Two additional contributions - a non-linear adjustment of the depth resolution and the introduction of a projection loss, lead to reduction of estimation error by up to 20% and 25% respectively. The results show that our network outperforms state-of-the-art methods with an average improvement in accuracy of depth estimation and view synthesis by approximately 45% and 34% respectively. Where our method generates comparable quality of estimated depth, it performs 10 times faster than those methods.
摘要：从立体图像的序列视差/深度估计是在三维视觉的一个重要因素。由于闭塞，不完美的设置和均匀的亮度，深度的准确估计仍然是一个具有挑战性的问题。靶向视图合成，我们提出了一种新颖的基于学习的框架利用扩张型卷积，密连接卷积模块，紧凑译码器和跳过连接。该网络是浅且致密的，所以它是快速，准确。两个额外的捐款 - 深度分辨率的非线性调整以及引入投影损失，导致减少估计误差的最多分别为20％和25％。结果分别显示，我们的网络性能优于国家的最先进的方法，在深度估计和视图合成的准确性的平均改善约45％和34％。当我们的方法生成估计深度的质量不相上下，它的性能比这些方法快10倍。

77. Large-Scale Optimal Transport via Adversarial Training with Cycle-Consistency [PDF] 返回目录
Guansong Lu, Zhiming Zhou, Jian Shen, Cheng Chen, Weinan Zhang, Yong Yu
Abstract: Recent advances in large-scale optimal transport have greatly extended its application scenarios in machine learning. However, existing methods either not explicitly learn the transport map or do not support general cost function. In this paper, we propose an end-to-end approach for large-scale optimal transport, which directly solves the transport map and is compatible with general cost function. It models the transport map via stochastic neural networks and enforces the constraint on the marginal distributions via adversarial training. The proposed framework can be further extended towards learning Monge map or optimal bijection via adopting cycle-consistency constraint(s). We verify the effectiveness of the proposed method and demonstrate its superior performance against existing methods with large-scale real-world applications, including domain adaptation, image-to-image translation, and color transfer.
摘要：大型最佳的交通的最新进展在机器学习，极大地扩展了其应用场景。然而，现有的方法要么不明确学习交通地图或不支持一般的成本函数。在本文中，我们提出了大规模优化运输，从而直接解决了交通地图，并与一般的成本函数相兼容的终端到终端的方法。它通过模型随机神经网络的交通地图，并强制对通过对抗训练边缘分布的约束。所提出的架构可以朝向经由采用周期一致性约束（一个或多个）的学习型Monge地图或最佳的双射进一步延长。我们验证了该方法的有效性，并证明其对具有大规模实际应用中存在的方法，包括域适应，图像 - 图像转换和色彩传递卓越的性能。

78. Non-Local Part-Aware Point Cloud Denoising [PDF] 返回目录
Chao Huang, Ruihui Li, Xianzhi Li, Chi-Wing Fu
Abstract: This paper presents a novel non-local part-aware deep neural network to denoise point clouds by exploring the inherent non-local self-similarity in 3D objects and scenes. Different from existing works that explore small local patches, we design the non-local learning unit (NLU) customized with a graph attention module to adaptively capture non-local semantically-related features over the entire point cloud. To enhance the denoising performance, we cascade a series of NLUs to progressively distill the noise features from the noisy inputs. Further, besides the conventional surface reconstruction loss, we formulate a semantic part loss to regularize the predictions towards the relevant parts and enable denoising in a part-aware manner. Lastly, we performed extensive experiments to evaluate our method, both quantitatively and qualitatively, and demonstrate its superiority over the state-of-the-arts on both synthetic and real-scanned noisy inputs.
摘要：本文提出了一种新的非本地部分感知深层神经网络去噪点云通过探索固有的非本地自相似的3D对象和场景。从现有的那个小探索的局部小片作品不同的是，我们设计了图形注意模块自适应地捕捉非本地语义相关的功能在整个点云定制非本地学习单元（NLU）。为了提高去噪性能，我们级联了一系列NLUs的逐步提炼出噪声嘈杂的输入功能。此外，除了传统的表面重建的损失，我们制定一个语义部分损失以规范对相关部分的预测，并在部分感知的方式启用降噪。最后，我们进行了大量的实验来评价我们的方法，定量和定性，并展示其在对合成和实时扫描嘈杂的输入状态的最艺的优势。

79. Monocular Depth Estimation Based On Deep Learning: An Overview [PDF] 返回目录
Chaoqiang Zhao, Qiyu Sun, Chongzhen Zhang, Yang Tang, Feng Qian
Abstract: Depth information is important for autonomous systems to perceive environments and estimate their own state. Traditional depth estimation methods, like structure from motion and stereo vision matching, are built on feature correspondences of multiple viewpoints. Meanwhile, the predicted depth maps are sparse. Inferring depth information from a single image (monocular depth estimation) is an ill-posed problem. With the rapid development of deep neural networks, monocular depth estimation based on deep learning has been widely studied recently and achieved promising performance in accuracy. Meanwhile, dense depth maps are estimated from single images by deep neural networks in an end-to-end manner. In order to improve the accuracy of depth estimation, different kinds of network frameworks, loss functions and training strategies are proposed subsequently. Therefore, we survey the current monocular depth estimation methods based on deep learning in this review. Initially, we conclude several widely used datasets and evaluation indicators in deep learning-based depth estimation. Furthermore, we review some representative existing methods according to different training manners: supervised, unsupervised and semi-supervised. Finally, we discuss the challenges and provide some ideas for future researches in monocular depth estimation.
摘要：深度信息的自治系统来感知环境的重要，并估算自己的状态。传统的深度估计方法，像运动和立体视觉匹配结构，是建立在多个视点的特征对应。同时，预测深度图是稀疏。推断从单个图像（单眼深度估计）的深度信息是一个病态问题。随着深层神经网络的飞速发展，基于深度学习单眼深度估计已被广泛研究最近取得了有希望的精度性能。同时，密深度图从在端至端的方式单个图像由深神经网络估算。为了提高深度估计的准确度，不同类型的网络架构，损失函数和培训战略的随后提出。因此，我们调查基于在本次审查深度学习当前单眼深度估计方法。最初，我们得出结论：在深学习型深度估计一些广泛使用的数据集和评价指标。此外，我们根据不同的培训方式回顾一些具有代表性的现有方法：监督，无监督和半监督。最后，我们讨论的挑战和单眼深度估计今后的研究提供一些思路。

80. Medical Image Enhancement Using Histogram Processing and Feature Extraction for Cancer Classification [PDF] 返回目录
Sakshi Patel, Bharath K P, Rajesh Kumar Muthu
Abstract: MRI (Magnetic Resonance Imaging) is a technique used to analyze and diagnose the problem defined by images like cancer or tumor in a brain. Physicians require good contrast images for better treatment purpose as it contains maximum information of the disease. MRI images are low contrast images which make diagnoses difficult; hence better localization of image pixels is required. Histogram Equalization techniques help to enhance the image so that it gives an improved visual quality and a well defined problem. The contrast and brightness is enhanced in such a way that it does not lose its original information and the brightness is preserved. We compare the different equalization techniques in this paper; the techniques are critically studied and elaborated. They are also tabulated to compare various parameters present in the image. In addition we have also segmented and extracted the tumor part out of the brain using K-means algorithm. For classification and feature extraction the method used is Support Vector Machine (SVM). The main goal of this research work is to help the medical field with a light of image processing.
摘要：MRI（磁共振成像）是一种用于分析和诊断由像在脑癌症或肿瘤的图像中定义的问题的技术。医生需要更好的治疗目的，良好的对比度的图像，因为它包含了疾病的最大信息。 MRI图像是低对比度的图像，这使诊断困难;图像像素因此更好的定位是必需的。直方图均衡化技术有助于提升形象，使其给人的改进的视觉质量和明确的问题。对比度和亮度以这样一种方式，它并没有失去其原有的信息得到增强，亮度被保留。我们比较本文的不同均衡技术;该技术是至关重要的研究和阐述。他们还列出来比较图像中存在的各种参数。另外我们还分段和提取的肿瘤部分出使用K-means算法的大脑。用于分类和特征提取所使用的方法是支持向量机（SVM）。这项研究工作的主要目标是帮助医疗领域与图像处理的光。

81. Learn to Augment: Joint Data Augmentation and Network Optimization for Text Recognition [PDF] 返回目录
Canjie Luo, Yuanzhi Zhu, Lianwen Jin, Yongpan Wang
Abstract: Handwritten text and scene text suffer from various shapes and distorted patterns. Thus training a robust recognition model requires a large amount of data to cover diversity as much as possible. In contrast to data collection and annotation, data augmentation is a low cost way. In this paper, we propose a new method for text image augmentation. Different from traditional augmentation methods such as rotation, scaling and perspective transformation, our proposed augmentation method is designed to learn proper and efficient data augmentation which is more effective and specific for training a robust recognizer. By using a set of custom fiducial points, the proposed augmentation method is flexible and controllable. Furthermore, we bridge the gap between the isolated processes of data augmentation and network optimization by joint learning. An agent network learns from the output of the recognition network and controls the fiducial points to generate more proper training samples for the recognition network. Extensive experiments on various benchmarks, including regular scene text, irregular scene text and handwritten text, show that the proposed augmentation and the joint learning methods significantly boost the performance of the recognition networks. A general toolkit for geometric augmentation is available.
摘要：手写文本和场景文本从各种形状和变形模式受到影响。因此，培养一个强大的识别模型需要大量的数据来覆盖多样性尽可能地。与此相反，以数据收集和注解，数据增强是一种低成本的方式。在本文中，我们提出了文本图像增强的新方法。从传统的隆胸方法，如旋转，缩放和透视变换不同的是，我们所提出的增强方法的目的是学习正确和有效的数据增强这是更有效和具体的训练强大的识别。通过使用一组自定义基准点，所提出的增强方法是灵活的和可控制的。此外，我们通过桥接共同学习数据扩张和网络优化的分离过程之间的间隙。从识别网络的输出的代理网络获知并控制基准点来生成用于识别网络更适当的训练样本。各种基准测试，包括定期的现场文字，不规则的场景文字和手写文字广泛的实验，证明所提出的增加和联合学习方法显著提高识别网络的性能。对于几何增强的通用工具包是可用的。

82. Collaborative Motion Prediction via Neural Motion Message Passing [PDF] 返回目录
Yue Hu, Siheng Chen, Ya Zhang, Xiao Gu
Abstract: Motion prediction is essential and challenging for autonomous vehicles and social robots. One challenge of motion prediction is to model the interaction among traffic actors, which could cooperate with each other to avoid collisions or form groups. To address this challenge, we propose neural motion message passing (NMMP) to explicitly model the interaction and learn representations for directed interactions between actors. Based on the proposed NMMP, we design the motion prediction systems for two settings: the pedestrian setting and the joint pedestrian and vehicle setting. Both systems share a common pattern: we use an individual branch to model the behavior of a single actor and an interactive branch to model the interaction between actors, while with different wrappers to handle the varied input formats and characteristics. The experimental results show that both systems outperform the previous state-of-the-art methods on several existing benchmarks. Besides, we provide interpretability for interaction learning.
摘要：运动预测是必不可少的，对于自主车与社交机器人挑战。运动预测中的一个挑战是交通参与者之间的相互作用，这可以相互配合，以避免冲突或形式组建模。为了应对这一挑战，我们提出了神经运动消息传递（NMMP）明确建模的互动和学习的表示，演员之间导向的交互。基于提出NMMP，我们设计的运动预测系统两个设置为行人设置和联合行人和车辆设置。两个系统都有一个共同的模式：我们用一个单独的分支来模拟一个演员和一个交互式分支的行为者之间的互动模式，而不同的包装来处理不同的输入格式和特性。实验结果表明，无论是系统性能超过现有的几种基准以前的国家的最先进的方法。此外，我们提供的互动学习解释性。

83. Image-to-image Neural Network for Addition and Subtraction of a Pair of Not Very Large Numbers [PDF] 返回目录
Vladimir Ivashkin
Abstract: Looking back at the history of calculators, one can see that they become less functional and more computationally expensive over time. A modern calculator runs on a personal computer and is drawn at 60 fps only to help us click a few digits with a mouse pointer. A search engine is often used as a calculator, which means that nowadays we need the Internet just to add two numbers. In this paper, we propose to go further and train a convolutional neural network that takes an image of a simple mathematical expression and generates an image of an answer. This neural calculator works only with pairs of double-digit numbers and supports only addition and subtraction. Also, sometimes it makes mistakes. We promise that the proposed calculator is a small step for man, but one giant leap for mankind.
摘要：纵观计算器的历史过去，人们可以看到，他们变得不那么功能也更昂贵的计算随着时间的推移。在个人电脑上和现代计算器运行以60fps绘制只是为了帮助我们点击几个数字，在鼠标指针。搜索引擎常常被用作一个计算器，这意味着现在我们需要的只是互联网向两个数字相加。在本文中，我们提出了进一步去培养，需要一个简单的数学表达式的图像，并生成一个答案的图像的卷积神经网络。这种神经计算器仅适用于对两位数的数字，并且仅支持加法和减法。另外，有时会犯错误。我们承诺，所提出的计算器是人类迈出的一小步，却是人类的一大步。

84. From W-Net to CDGAN: Bi-temporal Change Detection via Deep Learning Techniques [PDF] 返回目录
Bin Hou, Qingjie Liu, Heng Wang, Yunhong Wang
Abstract: Traditional change detection methods usually follow the image differencing, change feature extraction and classification framework, and their performance is limited by such simple image domain differencing and also the hand-crafted features. Recently, the success of deep convolutional neural networks (CNNs) has widely spread across the whole field of computer vision for their powerful representation abilities. In this paper, we therefore address the remote sensing image change detection problem with deep learning techniques. We firstly propose an end-to-end dual-branch architecture, termed as the W-Net, with each branch taking as input one of the two bi-temporal images as in the traditional change detection models. In this way, CNN features with more powerful representative abilities can be obtained to boost the final detection performance. Also, W-Net performs differencing in the feature domain rather than in the traditional image domain, which greatly alleviates loss of useful information for determining the changes. Furthermore, by reformulating change detection as an image translation problem, we apply the recently popular Generative Adversarial Network (GAN) in which our W-Net serves as the Generator, leading to a new GAN architecture for change detection which we call CDGAN. To train our networks and also facilitate future research, we construct a large scale dataset by collecting images from Google Earth and provide carefully manually annotated ground truths. Experiments show that our proposed methods can provide fine-grained change detection results superior to the existing state-of-the-art baselines.
摘要：传统变化检测方法通常遵循图像差分，改变特征提取和分类的框架，并且它们的性能是通过这样的简单的图像域差分，也是手工制作的特征的限制。近日，深卷积神经网络（细胞神经网络）的成功跨越计算机视觉的整个领域广泛传播他们强大的能力表示。在本文中，我们因此解决与深学习技术的遥感影像变化检测问题。我们首先提出了一个终端到终端的双分支结构，称为了W-网，每个分支取为两个双颞图像输入一个在传统的变化检测模型。这样一来，CNN拥有了更强大的代表性能力可以得到提高最终的检测性能。另外，W-Net的执行在特征域差分，而不是在传统的图像域，这极大地用于确定所述变化的有用信息的缓解损失。此外，通过重新制定变化检测为图像的翻译问题，我们采用时下流行的剖成对抗性网（GAN），使我们的W-网作为发电机，从而导致新的GAN架构变化检测，我们称之为CDGAN。要培养我们的网络，也有利于未来的研究中，我们通过收集来自谷歌地球的图像构建大规模数据集，并提供精心手工标注的基础事实。实验结果表明，我们所提出的方法能够提供优于现有的国家的最先进的基线细粒度变化检测结果。

85. Counterfactual Samples Synthesizing for Robust Visual Question Answering [PDF] 返回目录
Long Chen, Xin Yan, Jun Xiao, Hanwang Zhang, Shiliang Pu, Yueting Zhuang
Abstract: Despite Visual Question Answering (VQA) has realized impressive progress over the last few years, today's VQA models tend to capture superficial linguistic correlations in the train set and fail to generalize to the test set with different QA distributions. To reduce the language biases, several recent works introduce an auxiliary question-only model to regularize the training of targeted VQA model, and achieve dominating performance on VQA-CP. However, since the complexity of design, current methods are unable to equip the ensemble-based models with two indispensable characteristics of an ideal VQA model: 1) visual-explainable: the model should rely on the right visual regions when making decisions. 2) question-sensitive: the model should be sensitive to the linguistic variations in question. To this end, we propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme. The CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions, and assigning different ground-truth answers. After training with the complementary samples (ie, the original and generated samples), the VQA models are forced to focus on all critical objects and words, which significantly improves both visual-explainable and question-sensitive abilities. In return, the performance of these models is further boosted. Extensive ablations have shown the effectiveness of CSS. Particularly, by building on top of the model LMH, we achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.
摘要：尽管视觉答疑（VQA）实现了在过去的几年中令人印象深刻的进步，今天的VQA模型倾向于捕捉肤浅的语言相关的动车组，并不能推广到试验组具有不同QA分布。为了减少语言的偏见，最近几部作品引入辅助唯一的问题模型来规范针对性VQA模型的训练，并实现对VQA-CP控制的表现。然而，由于设计的复杂性，目前的方法是无法与理想的VQA模型的两个不可缺少的特性来装备基于集合的模型：1）视觉解释的：在做决策时的模型应该靠右边视觉区域。 2）问题敏感：所述模式应该是有问题的语言的变化敏感。为此，我们提出一个模型无关的反事实的样品合成（CSS）的培训方案。 CSS的通过屏蔽在图片或文字的关键对象的问题，并分配不同的地面实况答案产生大量的反训练样本。训练与互补的样品（即，原始和产生的样本）之后，将模型VQA被迫集中在所有关键的对象和文字，这显著改善了视觉可解释和问题敏感的能力。作为回报，这些机型的性能进一步增强。大量消融显示CSS的有效性。特别是，通过建立模型台基之上，我们实现了58.95％的VQA-CP V2破纪录的表现，与6.5％的升幅。

86. Efficient Backbone Search for Scene Text Recognition [PDF] 返回目录
Hui Zhang, Quanming Yao, Mingkun Yang, Yongchao Xu, Xiang Bai
Abstract: Scene text recognition (STR) is very challenging due to the diversity of text instances and the complexity of scenes. The community has paid increasing attention to boost the performance by improving the pre-processing image module, like rectification and deblurring, or the sequence translator. However, another critical module, i.e., the feature sequence extractor, has not been extensively explored. In this work, inspired by the success of neural architecture search (NAS), which can identify better architectures than human-designed ones, we propose automated STR (AutoSTR) to search data-dependent backbones to boost text recognition performance. First, we design a domain-specific search space for STR, which contains both choices on operations and constraints on the downsampling path. Then, we propose a two-step search algorithm, which decouples operations and downsampling path, for an efficient search in the given space. Experiments demonstrate that, by searching data-dependent backbones, AutoSTR can outperform the state-of-the-art approaches on standard benchmarks with much fewer FLOPS and model parameters.
摘要：场景文本识别（STR）是非常具有挑战性，因为文本实例的多样性和场景的复杂性。社区日益重视通过完善的售前处理图像模块，如整顿和去模糊，或序列翻译，以提高性能。然而，另一个关键的模块，即，特征序列提取，一直没有广泛探索。在这项工作中，通过神经结构搜索（NAS）成功的鼓舞，它可以识别比人类设计的要好架构，我们建议自动化STR（AutoSTR）来搜索数据依赖骨干，以提升文本识别性能。首先，我们设计了STR域特定的搜索空间，其中包含对下采样路径上的操作和约束两种选择。然后，我们提出了一个两步搜索算法，该算法解耦操作和下采样路径，在给定的空间的有效搜索。实验表明，通过搜索相关的数据的主链，AutoSTR可以超越国家的最先进的上标准基准接近用少得多的FLOPS和模型参数。

87. Dynamic Divide-and-Conquer Adversarial Training for Robust Semantic Segmentation [PDF] 返回目录
Xiaogang Xu, Hengshuang Zhao, Jiaya Jia
Abstract: Adversarial training is promising for improving robustness of deep neural networks towards adversarial perturbation, especially on classification tasks. Effect of this type of training on semantic segmentation, contrarily, just commences. We make the initial attempt to explore the defense strategy on semantic segmentation by formulating a general adversarial training procedure that can perform decently on both adversarial and clean samples. We propose a dynamic divide-and-conquer adversarial training (DDC-AT) strategy to enhance the defense effect, by setting additional branches in the target model during training, and dealing with pixels with diverse properties towards adversarial perturbation. Our dynamical division mechanism divides pixels into multiple branches automatically, achieved by unsupervised learning. Note all these additional branches can be abandoned during inference and thus leave no extra parameter and computation cost. Extensive experiments with various segmentation models are conducted on PASCAL VOC 2012 and Cityscapes datasets, in which DDC-AT yields satisfying performance under both white- and black-box attack.
摘要：对抗性训练有望实现提高对敌对扰动深层神经网络的鲁棒性，特别是在分类任务。对语义分割，反之，就正式动工这种类型的培训。我们做初步尝试通过制定可以同时在对抗性和干净体面的样本进行一般对抗性训练过程，探索在语义分割的防御策略。我们提出了一个动态的分而治之的对抗训练（DDC-AT）的策略，以提高防御效果，通过培训过程中目标模型设定增设分支机构，以及处理与朝对抗扰动不同性质的像素。我们的动力分配机构划分像素分为多个分支自动，通过监督学习来实现的。注意：所有这些额外的分支可以推断期间被放弃，因此没有留下任何额外的参数和计算成本。各种型号的分割广泛的实验是在PASCAL VOC 2012年和风情的数据集，其中DDC-AT产量都白领和黑箱攻击下满足性能进行。

88. OccuSeg: Occupancy-aware 3D Instance Segmentation [PDF] 返回目录
Lei Han, Tian Zheng, Lan Xu, Lu Fang
Abstract: 3D instance segmentation, with a variety of applications in robotics and augmented reality, is in large demands these days. Unlike 2D images that are projective observations of the environment, 3D models provide metric reconstruction of the scenes without occlusion or scale ambiguity. In this paper, we define "3D occupancy size", as the number of voxels occupied by each instance. It owns advantages of robustness in prediction, on which basis, OccuSeg, an occupancy-aware 3D instance segmentation scheme is proposed. Our multi-task learning produces both occupancy signal and embedding representations, where the training of spatial and feature embeddings varies with their difference in scale-aware. Our clustering scheme benefits from the reliable comparison between the predicted occupancy size and the clustered occupancy size, which encourages hard samples being correctly clustered and avoids over segmentation. The proposed approach achieves state-of-the-art performance on 3 real-world datasets, i.e. ScanNetV2, S3DIS and SceneNN, while maintaining high efficiency.
摘要：3D实例分割，与各种机器人和增强现实应用，在大量需求，这些天。不像是环境的投影观察的2D图像，3D模型提供场景的度量重建而不闭塞或规模歧义。在本文中，我们定义“3D占用大小”，作为体素的数量占据每个实例。它拥有在此基础上，OccuSeg，提出了一种占用感知3D实例分割方案在预测鲁棒性的优点。我们的多任务学习既产生占用信号和嵌入表示，在空间和功能的嵌入的培训与他们在规模感知的不同而变化。我们从预测占用大小和群集占用大小之间的可靠的比较，鼓励被正确集群硬样品，避免过分割聚类方案的好处。所提出的方法实现了对3真实世界的数据集，即ScanNetV2，S3DIS和SceneNN国家的最先进的性能，同时保持高效率。

89. An End-to-End Geometric Deficiency Elimination Algorithm for 3D Meshes [PDF] 返回目录
Bingtao Ma, Hongsen Liu, Liangliang Nan, Yang Cong
Abstract: The 3D mesh is an important representation of geometric data. In the generation of mesh data, geometric deficiencies (e.g., duplicate elements, degenerate faces, isolated vertices, self-intersection, and inner faces) are unavoidable and may violate the topology structure of an object. In this paper, we propose an effective and efficient geometric deficiency elimination algorithm for 3D meshes. Specifically, duplicate elements can be eliminated by assessing the occurrence times of vertices or faces, degenerate faces can be removed according to the outer product of two edges; since isolated vertices do not appear in any face vertices, they can be deleted directly; self-intersecting faces are detected using an AABB tree and remeshed afterward; by simulating whether multiple random rays that shoot from a face can reach infinity, we can judge whether the surface is an inner face, then decide to delete it or not. Experiments on ModelNet40 dataset illustrate that our method can eliminate the deficiencies of the 3D mesh thoroughly.
摘要：3D网格是几何数据的重要代表。在网格数据的生成，几何缺陷（例如，重复的元素，简并的面孔，分离顶点，自相交，和内表面）是不可避免的，并可能违反的物体的拓扑结构。在本文中，我们提出了三维网格有效和高效的几何缺陷消除算法。具体而言，重复的元素可以通过评估顶点或面的发生时间被消除，退化面可以根据两个边缘的外产物中除去;因为孤立顶点不以任何脸上的顶点出现，他们可以直接删除;自相交面使用AABB树被检测和然后重新划分;通过模拟是否多个随机的光线来自一拍面可以达到无限大，我们可以判断表面是内表面，然后决定删除与否。在ModelNet40实验数据集表明我们的方法可以消除3D的缺陷网彻底。

90. Learning Reinforced Agents with Counterfactual Simulation for Medical Automatic Diagnosis [PDF] 返回目录
Junfan Lin, Ziliang Chen, Xiaodan Liang, Keze Wang, Liang Lin
Abstract: Medical automatic diagnosis (MAD) aims to learn an agent that mimics the behavior of a human doctor, i.e. inquiring symptoms and informing diseases. Due to medical ethics concerns, it is impractical to directly apply reinforcement learning techniques to solving MAD, e.g., training a reinforced agent with the human patient. Developing a patient simulator by using the collected patient-doctor dialogue records has been proposed as a promising approach to MAD. However, most of these existing works overlook the causal relationship between patient symptoms and disease diagnoses. For example, these simulators simply generate the ``not-sure'' response to the inquiry (i.e., symptom) that was not observed in one dialogue record. As a result, the MAD agent is usually trained without exploiting the counterfactual reasoning beyond the factual observations. To address this problem, this paper presents a propensity-based patient simulator (PBPS), which is capable of facilitating the training of MAD agents by generating informative counterfactual answers along with the disease diagnosis. Specifically, our PBPS estimates the propensity score of each record with the patient-doctor dialogue reasoning, and can thus generate the counterfactual answers by searching across records. That is, the unrecorded symptom for one patient can be found in the records of other patients according to the propensity score matching. A progressive assurance agent (P2A) can be thus trained with PBPS, which includes two separate yet cooperative branches accounting for the execution of symptom-inquiry and disease-diagnosis actions, respectively. The disease-diagnosis predicts the confidence of disease and drives the symptom-inquiry in terms of enhancing the confidence, and the two branches are jointly optimized with benefiting from each other.
摘要：医疗自动诊断（MAD）旨在学习代理模仿人类医生的行为，即询问症状和通知疾病。由于医学伦理的关注，这是不切实际的直接应用强化学习技术来解决MAD，例如，培养了加强剂与人类患者。开发利用采集医患对话记录模拟患者已被提议作为一个有前途的方法来MAD。然而，大多数现有的这些作品的忽视病人的症状和疾病诊断之间的因果关系。例如，这些模拟器简单地生成对没有被在一个对话记录所观察到的询问（即，症状）的``未确认“”响应。其结果是，在MAD剂通常是受过训练的，而不利用反事实推理超出了事实观察。为了解决这个问题，本文提出了一种基于倾向患者模拟器（PBPS），其能够促进MAD剂的由与疾病的诊断沿着生成翔实的反答案培训。具体而言，我们估计PBPS与医患对话推理每条记录的倾向得分，从而可以通过在记录搜索产生反答案。也就是说，对于一个病人未记录的症状，可以在其他病人的记录，根据倾向评分匹配找到。渐进保证剂（P2A）可以由此与训练PBPS，它包括两个分开的但合作分支占症状调查和疾病诊断动作的执行，分别。本病诊断预测疾病的信心，并推动在增强信心方面的症状，询问，和两个分支会同相互受益优化。

91. Instant recovery of shape from spectrum via latent space connections [PDF] 返回目录
Riccardo Marin, Arianna Rampini, Umberto Castellani, Emanuele Rodolà, Maks Ovsjanikov, Simone Melzi
Abstract: We introduce the first learning-based method for recovering shapes from Laplacian spectra. Given an auto-encoder, our model takes the form of a cycle-consistent module to map latent vectors to sequences of eigenvalues. This module provides an efficient and effective linkage between spectrum and geometry of a given shape. Our data-driven approach replaces the need for ad-hoc regularizers required by prior methods, while providing more accurate results at a fraction of the computational cost. Our learning model applies without modifications across different dimensions (2D and 3D shapes alike), representations (meshes, contours and point clouds), as well as across different shape classes, and admits arbitrary resolution of the input spectrum without affecting complexity. The increased flexibility allows us to provide a proxy to differentiable eigendecomposition and to address notoriously difficult tasks in 3D vision and geometry processing within a unified framework, including shape generation from spectrum, mesh super-resolution, shape exploration, style transfer, spectrum estimation from point clouds, segmentation transfer and point-to-point matching.
摘要：本文从拉普拉斯谱恢复形状推出首款基于学习的方法。由于自动编码器来，我们的模型需要一个周期一致的模块的形式潜在向量映射到特征值的序列。此模块提供一个给定形状的光谱和几何形状之间的高效和有效的连接。我们的数据驱动的方法取代了之前由方法所需的ad-hoc regularizers的需要，而在计算成本的一小部分，提供更精确的结果。我们的学习模型的适用不跨越不同的尺寸（2D和3D形状相似），表示（网格，轮廓和点云），以及在不同的形状类的修改，并承认的输入频谱的任意的分辨率，而不影响的复杂性。增加的灵活性，使我们能够提供一个代理到微特征分解，并以一个统一的框架内，在3D视觉和几何处理地址非常困难的任务，包括形状生成的频谱，网超分辨率，形状探索，风格转移，谱估计从点云，分割转移和点至点匹配。

92. Symmetry Detection of Occluded Point Cloud Using Deep Learning [PDF] 返回目录
Zhelun Wu, Hongyan Jiang, Siyun He
Abstract: Symmetry detection has been a classical problem in computer graphics, many of which using traditional geometric methods. In recent years, however, we have witnessed the arising deep learning changed the landscape of computer graphics. In this paper, we aim to solve the symmetry detection of the occluded point cloud in a deep-learning fashion. To the best of our knowledge, we are the first to utilize deep learning to tackle such a problem. In such a deep learning framework, double supervisions: points on the symmetry plane and normal vectors are employed to help us pinpoint the symmetry plane. We conducted experiments on the YCB- video dataset and demonstrate the efficacy of our method.
摘要：对称检测一直是计算机图形学，其中许多使用传统的几何方法的一个经典问题。近年来，然而，我们见证了深而产生改变学习计算机图形学的景观。在本文中，我们的目标是解决闭塞的点云的对称性检测在深学习方式。据我们所知，我们是第一个利用深层的学习来解决这样的问题。在这么深的学习框架，双重监督：点上的对称平面和法向量被用来帮助我们精确定位对称平面。我们在YCB-视频数据集进行了实验，并证实了该方法的有效性。

93. Explainable Deep Classification Models for Domain Generalization [PDF] 返回目录
Andrea Zunino, Sarah Adel Bargal, Riccardo Volpi, Mehrnoosh Sameki, Jianming Zhang, Stan Sclaroff, Vittorio Murino, Kate Saenko
Abstract: Conventionally, AI models are thought to trade off explainability for lower accuracy. We develop a training strategy that not only leads to a more explainable AI system for object classification, but as a consequence, suffers no perceptible accuracy degradation. Explanations are defined as regions of visual evidence upon which a deep classification network makes a decision. This is represented in the form of a saliency map conveying how much each pixel contributed to the network's decision. Our training strategy enforces a periodic saliency-based feedback to encourage the model to focus on the image regions that directly correspond to the ground-truth object. We quantify explainability using an automated metric, and using human judgement. We propose explainability as a means for bridging the visual-semantic gap between different domains where model explanations are used as a means of disentagling domain specific information from otherwise relevant features. We demonstrate that this leads to improved generalization to new domains without hindering performance on the original domain.
摘要：传统上，AI模式被认为是权衡explainability较低的精度。我们制定培训策略，不仅导致了更可解释的AI系统，对象分类，但作为一个后果，遭受没有可察觉的精度降低。说明被定义为视觉证据，在其上深深的分类网络作出决定的区域。这是一个显着图输送的形式表示的每个像素多少促成了网络的决定。我们的培训策略强制执行的周期基于显着性反馈，以鼓励该模型把重点放在直接对应于地面实况对象的图像区域。我们使用自动化的度量，以及使用人的判断量化explainability。我们提出explainability作为用于桥接，其中模型的解释被用作从否则相关特征disentagling域特定信息的装置不同的域之间的视觉语义间隙的装置。我们表明，这导致改善推广到新的领域，而不会对原始域影响性能。

94. Recurrent convolutional neural networks for mandible segmentation from computed tomography [PDF] 返回目录
Bingjiang Qiu, Jiapan Guo, Joep Kraeima, Haye H. Glas, Ronald J. H. Borra, Max J. H. Witjes, Peter M. A. van Ooijen
Abstract: Recently, accurate mandible segmentation in CT scans based on deep learning methods has attracted much attention. However, there still exist two major challenges, namely, metal artifacts among mandibles and large variations in shape or size among individuals. To address these two challenges, we propose a recurrent segmentation convolutional neural network (RSegCNN) that embeds segmentation convolutional neural network (SegCNN) into the recurrent neural network (RNN) for robust and accurate segmentation of the mandible. Such a design of the system takes into account the similarity and continuity of the mandible shapes captured in adjacent image slices in CT scans. The RSegCNN infers the mandible information based on the recurrent structure with the embedded encoder-decoder segmentation (SegCNN) components. The recurrent structure guides the system to exploit relevant and important information from adjacent slices, while the SegCNN component focuses on the mandible shapes from a single CT slice. We conducted extensive experiments to evaluate the proposed RSegCNN on two head and neck CT datasets. The experimental results show that the RSegCNN is significantly better than the state-of-the-art models for accurate mandible segmentation.
摘要：近日，在CT精确下颌分割扫瞄基于深刻的学习方法备受关注。但是，仍存在两大挑战，即在个体之间的形状或大小的下颚骨和大的变化之间的金属文物。为了应对这两项挑战，我们提出了一个经常性的分割卷积神经网络（RSegCNN）嵌入分割卷积神经网络（SegCNN）转化为下颌骨的稳健和准确的分割的递归神经网络（RNN）。该系统的这种设计考虑到了在CT扫描相邻图像切片中捕获的颌骨的形状的相似性和连续性。所述RSegCNN推断基于与所述嵌入编码器 - 解码器分割（SegCNN）组件的经常性结构的下颌骨的信息。复发结构引导所述系统利用从相邻切片相关的和重要的信息，而SegCNN组分集中于从单个CT切片的下颌骨的形状。我们进行了广泛的实验，以评估在两个头部和颈部CT数据集所提出的RSegCNN。实验结果表明，该RSegCNN是显著优于国家的最先进的模型进行准确的下颌骨的分割。

95. Self-supervised Single-view 3D Reconstruction via Semantic Consistency [PDF] 返回目录
Xueting Li, Sifei Liu, Kihwan Kim, Shalini De Mello, Varun Jampani, Ming-Hsuan Yang, Jan Kautz
Abstract: We learn a self-supervised, single-view 3D reconstruction model that predicts the 3D mesh shape, texture and camera pose of a target object with a collection of 2D images and silhouettes. The proposed method does not necessitate 3D supervision, manually annotated keypoints, multi-view images of an object or a prior 3D template. The key insight of our work is that objects can be represented as a collection of deformable parts, and each part is semantically coherent across different instances of the same category (e.g., wings on birds and wheels on cars). Therefore, by leveraging self-supervisedly learned part segmentation of a large collection of category-specific images, we can effectively enforce semantic consistency between the reconstructed meshes and the original images. This significantly reduces ambiguities during joint prediction of shape and camera pose of an object, along with texture. To the best of our knowledge, we are the first to try and solve the single-view reconstruction problem without a category-specific template mesh or semantic keypoints. Thus our model can easily generalize to various object categories without such labels, e.g., horses, penguins, etc. Through a variety of experiments on several categories of deformable and rigid objects, we demonstrate that our unsupervised method performs comparably if not better than existing category-specific reconstruction methods learned with supervision.
摘要：我们学会自我监督，单一视图的三维重建，预测三维网格的目标对象的形状，质地和摄像头的姿势与2D图像和剪影的集合模型。所提出的方法不必要3D监督，手动注释的关键点，多视图一个对象或一个现有的3D模板的图像。我们工作的主要观点是，对象可以被表示为可变形部件的集合，每个部分是同一类别的不同实例语义连贯的（例如，鸟类和汽车车轮翼）。因此，通过利用一个大集合类特定图像的自我supervisedly了解到部分分割，我们可以有效地执行重建网格和原始图像之间的语义一致性。形状和物体的照相机姿态的关节预测期间这显著减少歧义，纹理沿。据我们所知，我们是第一个尝试和解决单一视图重建问题，而无需一类特定的模板网或语义关键点。因此，我们的模型可以很容易地推广到各种类别的对象没有这样的标签，例如马，企鹅等，通过多种几类可变形和刚性物体的实验中，我们证明了我们的无监督方法进行同等如果不是现有的类别更好特异性重建方法与监管的经验教训。

96. Inducing Optimal Attribute Representations for Conditional GANs [PDF] 返回目录
Binod Bhattarai, Tae-Kyun Kim
Abstract: Conditional GANs are widely used in translating an image from one category to another. Meaningful conditions to GANs provide greater flexibility and control over the nature of the target domain synthetic data. Existing conditional GANs commonly encode target domain label information as hard-coded categorical vectors in the form of 0s and 1s. The major drawbacks of such representations are inability to encode the high-order semantic information of target categories and their relative dependencies. We propose a novel end-to-end learning framework with Graph Convolutional Networks to learn the attribute representations to condition on the generator. The GAN losses, i.e. the discriminator and attribute classification losses, are fed back to the Graph resulting in the synthetic images that are more natural and clearer in attributes. Moreover, prior-arts are given priorities to condition on the generator side, not on the discriminator side of GANs. We apply the conditions to the discriminator side as well via multi-task learning. We enhanced the four state-of-the art cGANs architectures: Stargan, Stargan-JNT, AttGAN and STGAN. Our extensive qualitative and quantitative evaluations on challenging face attributes manipulation data set, CelebA, LFWA, and RaFD, show that the cGANs enhanced by our methods outperform by a large margin, compared to their counter-parts and other conditioning methods, in terms of both target attributes recognition rates and quality measures such as PSNR and SSIM.
摘要：条件甘斯被广泛用于图像转换从一类到另一个。甘斯有意义的条件提供对目标域的合成数据的性质更大的灵活性和控制。现有的条件甘斯通常编码目标结构域的标签信息作为0和1的形式的硬编码分类矢量。这样的陈述的主要缺点是不能编码目标类别和它们的相对依赖的高阶语义信息。我们建议用图表卷积网络一种新型的终端到终端的学习框架学习发电机上的属性交涉条件。在GAN损失，即，鉴别器和属性分类的损失，被反馈到图形导致那些更自然和在属性更清晰的合成图像。此外，现有领域中是给定的在发电机侧到条件优先级，而不是在甘斯的鉴别器侧。我们应用的条件鉴别的一面，以及通过多任务学习。我们增强了国家的最先进的四CGANS架构：Stargan，Stargan-JNT，AttGAN和STGAN。我们广泛的定性和具有挑战性的脸上定量评估属性操纵数据集，CelebA，LFWA和RaFD，表明我们的方法提高了CGANS大幅度跑赢大盘相比，他们反零件等调理方法，在这两个方面目标属性识别率和质量的措施，如PSNR和SSIM。

97. GeoDA: a geometric framework for black-box adversarial attacks [PDF] 返回目录
Ali Rahmati, Seyed-Mohsen Moosavi-Dezfooli, Pascal Frossard, Huaiyu Dai
Abstract: Adversarial examples are known as carefully perturbed images fooling image classifiers. We propose a geometric framework to generate adversarial examples in one of the most challenging black-box settings where the adversary can only generate a small number of queries, each of them returning the top-$1$ label of the classifier. Our framework is based on the observation that the decision boundary of deep networks usually has a small mean curvature in the vicinity of data samples. We propose an effective iterative algorithm to generate query-efficient black-box perturbations with small $\ell_p$ norms for $p \ge 1$, which is confirmed via experimental evaluations on state-of-the-art natural image classifiers. Moreover, for $p=2$, we theoretically show that our algorithm actually converges to the minimal $\ell_2$-perturbation when the curvature of the decision boundary is bounded. We also obtain the optimal distribution of the queries over the iterations of the algorithm. Finally, experimental results confirm that our principled black-box attack algorithm performs better than state-of-the-art algorithms as it generates smaller perturbations with a reduced number of queries.
摘要：对抗性的例子被称为仔细扰动图像愚弄图像分类。我们提出了一个几何框架产生的最具挑战性的黑盒设置在对手只能产生少量的查询的一个例子对抗性的，他们每个人返回分类的顶$ 1 $标签。我们的框架是基于这样的观察深网络的决策边界通常在数据样本附近的小平均曲率。我们提出一个有效的迭代算法生成的查询效率暗箱扰动小$ \ $ ell_p为规范$ P \ GE $ 1，它是通过在国家的最先进的自然图像分类试验评价证实。此外，$ P = 2 $，我们从理论上表明，当决策边界的曲率是有界的，我们的算法确实收敛到最低$ \ $ ell_2 -perturbation。我们还获得了查询了该算法的迭代优化配置。最后，实验结果证实了我们的原则黑箱攻击算法执行比国家的最先进的算法更好，因为它与查询的数量减少产生更小的扰动。

98. The GraphNet Zoo: A Plug-and-Play Framework for Deep Semi-Supervised Classification [PDF] 返回目录
Marianne de Vriendt, Philip Sellars, Angelica I Aviles-Rivero
Abstract: We consider the problem of classifying a medical image dataset when we have a limited amounts of labels. This is very common yet challenging setting as labelled data is expensive, time consuming to collect and may require expert knowledge. The current classification go-to of deep supervised learning is unable to cope with such a problem setup. However, using semi-supervised learning, one can produce accurate classifications using a significantly reduced amount of labelled data. Therefore, semi-supervised learning is perfectly suited for medical image classification. However, there has almost been no uptake of semi-supervised methods in the medical domain. In this work, we propose a plug-and-play framework for deep semi-supervised classification focusing on graph based approaches, which up to our knowledge it is the first time that an approach with minimal labels has been shown to such an unprecedented scale. We introduce the concept of hybrid models by defining a classifier as a combination between a model-based functional and a deep net. We demonstrate, through extensive numerical comparisons, that our approach readily compete with fully-supervised state-of-the-art techniques for the applications of Malaria Cells, Mammograms and Chest X-ray classification whilst using far fewer labels.
摘要：我们认为医疗图像数据集进行分类的问题，当我们有一个限量的标签。这是很常见的但是作为标记的数据是昂贵的具有挑战性的环境，耗时收集和可能需要的专业知识。当前分类的走向深监督学习的是无法应付这样的问题设置。然而，使用半监督学习，可以产生使用标记数据的一个显著减少量精确分类。因此，半监督学习非常适合于医用图像的分类。然而，几乎一直没有在医疗领域半监督方法摄取。在这项工作中，我们提出了深半监督分类专注于基于图形的方法，这是第一次用最少的标签的方法已被证明是这样一个规模空前其中最多我们的知识插件和播放框架。我们通过定义一个分类为基于模型的功能和深网之间的组合推出混合动力车型的概念。我们证明，通过广泛的数字比较，我们的方法很容易与国家的最先进的全监督技术疟疾细胞，乳房X光检查和胸部X线分类的同时使用少得多的标签应用竞争。

99. Mutual Information Maximization for Effective Lip Reading [PDF] 返回目录
Xing Zhao, Shuang Yang, Shiguang Shan, Xilin Chen
Abstract: Lip reading has received an increasing research interest in recent years due to the rapid development of deep learning and its widespread potential applications. One key point to obtain good performance for the lip reading task depends heavily on how effective the representation can be to capture the lip movement information and meanwhile to resist the noises resulted from the change of pose, lighting conditions, speaker's appearance and so on. Towards this target, we propose to introduce the mutual information constraints on both the local feature's level and the global sequence's level to enhance the relations of the features with the speech content. On the one hand, we constraint the features generated at each time step to enable them carry a strong relation with the speech content by imposing the local mutual information maximization constraint (LMIM), leading to improvements over the model's ability to discover fine-grained lip movements and the fine-grained differences among words with similar pronunciation, such as ``spend'' and ``spending''. On the other hand, we introduce the mutual information maximization constraint on the global sequence's level (GMIM), to make the model be able to pay more attention to discriminate key frames related with the speech content, and less to various noises appeared in the speaking process. By combining these two advantages together, the proposed method is expected to be both discriminative and robust for effective lip reading. To verify this method, we evaluate on two large-scale benchmark. We perform a detailed analysis and comparison on several aspects, including the comparison of the LMIM and GMIM with the baseline, the visualization of the learned representation and so on. The results not only prove the effectiveness of the proposed method but also report new state-of-the-art performance on both the two benchmarks.
摘要：读唇已收到近几年越来越多的研究兴趣，因为深度学习和广泛的潜在应用的快速发展。一个关键点，以获得唇读任务出色的表现很大程度上取决于表示可以如何有效捕捉唇部运动信息，同时抵御噪声是由于姿势变化，光照条件，音箱的外观等。为了实现这一目标，我们建议引入的局部特征的级别和全局序列的水平，提升与语音内容的特征关系为双方进行信息约束。在一方面，我们约束在每个时间步骤生成的功能，让他们随身携带的讲话内容的强大的关系通过实施局部互信息最大化的约束（LMIM），在模型的发现细粒度唇能力导致的改进运动和类似发音的词语之间的细粒度的差异，如``支出“”和``消费“”。在另一方面，我们引入对全球序列的水平（GMIM），相互信息最大化的约束，使模型能够更加重视与语音内容相关的判别关键帧，并减少各种噪声出现在演讲处理。由这两个优点组合在一起，所提出的方法有望成为既辨别和健壮有效唇读。为了验证这种方法，我们评估两个大型标杆。我们在几个方面，包括LMIM和GMIM的基线相比，学习表现等的可视化进行了详细的分析和比较。结果不仅证明了该方法的有效性，但报告也同时在两个标杆新的国家的最先进的性能。

100. Learning Unbiased Representations via Mutual Information Backpropagation [PDF] 返回目录
Ruggero Ragonesi, Riccardo Volpi, Jacopo Cavazza, Vittorio Murino
Abstract: We are interested in learning data-driven representations that can generalize well, even when trained on inherently biased data. In particular, we face the case where some attributes (bias) of the data, if learned by the model, can severely compromise its generalization properties. We tackle this problem through the lens of information theory, leveraging recent findings for a differentiable estimation of mutual information. We propose a novel end-to-end optimization strategy, which simultaneously estimates and minimizes the mutual information between the learned representation and the data attributes. When applied on standard benchmarks, our model shows comparable or superior classification performance with respect to state-of-the-art approaches. Moreover, our method is general enough to be applicable to the problem of ``algorithmic fairness'', with competitive results.
摘要：我们有兴趣学习数据驱动的表示，可以概括很好，在本质上偏向数据训练时也是如此。特别是，我们面对的其中数据的一些属性（偏差），如果由模型了解到，会严重损害其泛化性能的情况。我们通过信息理论的镜头解决这个问题，利用最近发现的相互信息的微估计。我们提出了一个新颖的终端到终端的优化策略，它同时估计和学习表现和数据属性之间的相互信息最小化。当在标准基准施加，相对于我们的模型示出了相当于或优于分类性能状态的最先进的方法。此外，我们的方法是通用的，足以适用于``算法公平'的问题，具有竞争力的结果。

101. Active Depth Estimation: Stability Analysis and its Applications [PDF] 返回目录
Romulo T. Rodrigues, Pedro Miraldo, Dimos V. Dimarogonas, A. Pedro Aguiar
Abstract: Recovering the 3D structure of the surrounding environment is an essential task in any vision-controlled Structure-from-Motion (SfM) scheme. This paper focuses on the theoretical properties of the SfM, known as the incremental active depth estimation. The term incremental stands for estimating the 3D structure of the scene over a chronological sequence of image frames. Active means that the camera actuation is such that it improves estimation performance. Starting from a known depth estimation filter, this paper presents the stability analysis of the filter in terms of the control inputs of the camera. By analyzing the convergence of the estimator using the Lyapunov theory, we relax the constraints on the projection of the 3D point in the image plane when compared to previous results. Nonetheless, our method is capable of dealing with the cameras' limited field-of-view constraints. The main results are validated through experiments with simulated data.
摘要：恢复周围环境的三维结构在任何视觉控制的结构，由运动（SFM）方案的一项重要任务。本文重点研究的SFM，被称为增量活性深度估计的理论性能。术语增量表示在图像帧的时间序列来估计场景的3D结构。主动意味着相机动作是这样的，它提高了估计性能。从已知的深度估计滤波器开始，本文提出了过滤器的中的摄像机的控制输入术语的稳定性分析。通过分析利用Lyapunov理论估计的收敛，我们比以前的结果时，放宽对在图像平面的三维点的投影的约束。然而，我们的方法能够处理的摄像机的有限字段的视图约束。主要结果是通过与模拟数据实验验证。

102. Anomalous Instance Detection in Deep Learning: A Survey [PDF] 返回目录
Saikiran Bulusu, Bhavya Kailkhura, Bo Li, Pramod K. Varshney, Dawn Song
Abstract: Deep Learning (DL) is vulnerable to out-of-distribution and adversarial examples resulting in incorrect outputs. To make DL more robust, several posthoc anomaly detection techniques to detect (and discard) these anomalous samples have been proposed in the recent past. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection for DL based applications. We provide a taxonomy for existing techniques based on their underlying assumptions and adopted approaches. We discuss various techniques in each of the categories and provide the relative strengths and weaknesses of the approaches. Our goal in this survey is to provide an easier yet better understanding of the techniques belonging to different categories in which research has been done on this topic. Finally, we highlight the unsolved research challenges while applying anomaly detection techniques in DL systems and present some high-impact future research directions.
摘要：深度学习（DL）是容易受到外的分布，并导致不正确的输出对抗性例子。为了使DL更加强劲，一些事后异常检测技术来检测（和丢弃）这些异常样品已经在最近被提出。本次调查试图提供研究的结构和全面的概述的异常检测基于DL应用。我们提供了基于其基本假设和采取的方式现有技术的分类法。我们讨论了在每个类别的各种技术，并提供方法的相对优势和弱点。我们在本次调查的目标是提供一种更简单但更深入的了解属于不同的类别中，已经对这个话题做了技术。最后，我们强调解决的研究课题，同时将在DL系统异常检测技术，并提出了一些高影响未来的研究方向。

103. Toward Adversarial Robustness via Semi-supervised Robust Training [PDF] 返回目录
Yiming Li, Baoyuan Wu, Yan Feng, Yanbo Fan, Yong Jiang, Zhifeng Li, Shutao Xia
Abstract: Adversarial examples have been shown to be the severe threat to deep neural networks (DNNs). One of the most effective adversarial defense methods is adversarial training (AT) through minimizing the adversarial risk $R_{adv}$, which encourages both the benign example $x$ and its adversarially perturbed neighborhoods within the $\ell_{p}$-ball to be predicted as the ground-truth label. In this work, we propose a novel defense method, the robust training (RT), by jointly minimizing two separated risks ($R_{stand}$ and $R_{rob}$), which is with respect to the benign example and its neighborhoods respectively. The motivation is to explicitly and jointly enhance the accuracy and the adversarial robustness. We prove that $R_{adv}$ is upper-bounded by $R_{stand} + R_{rob}$, which implies that RT has similar effect as AT. Intuitively, minimizing the standard risk enforces the benign example to be correctly predicted, and the robust risk minimization encourages the predictions of the neighbor examples to be consistent with the prediction of the benign example. Besides, since $R_{rob}$ is independent of the ground-truth label, RT is naturally extended to the semi-supervised mode ($i.e.$, SRT), to further enhance the adversarial robustness. Moreover, we extend the $\ell_{p}$-bounded neighborhood to a general case, which covers different types of perturbations, such as the pixel-wise ($i.e.$, $x + \delta$) or the spatial perturbation ($i.e.$, $ AX + b$). Extensive experiments on benchmark datasets not only verify the superiority of the proposed SRT method to state-of-the-art methods for defensing pixel-wise or spatial perturbations separately, but also demonstrate its robustness to both perturbations simultaneously. The code for reproducing main results is available at \url{this https URL}.
摘要：对抗性的例子已经证明是深层神经网络（DNNs）的严重威胁。其中最有效的对抗防御方法之一是通过最大限度地减少对抗的风险对抗训练（AT）$ R_ {进阶} $，鼓励双方良性的例子$ X $及其adversarially扰动的$ \ ell_ {P} $内街区 - 球被预测为地面实况标签。在这项工作中，我们提出了一种新的防御方法，稳健的训练（RT），通过共同减少两分地分居的风险（$ R_ {立场} $和$ R_ {抢} $），这是相对于良性例子，其分别街区。的动机是明确和共同提高的准确性和对抗的鲁棒性。我们证明了$ {R_进阶} $是由$ {R_立场} + {R_抢} $，这意味着RT具有与AT类似的效果上界。直观地看，最小化风险的标准强制执行的良性例如被正确预测，以及稳健的风险最小化鼓励的邻居例子的预测是与良好的示例的预测是一致的。此外，由于$ R_ {抢劫} $独立于地面实况标签，RT自然扩展到半监督模式（即$ $，SRT），以进一步增强对抗鲁棒性。此外，我们扩展了$ \ ell_ {P} $ - 邻域内的一般情况下，其覆盖不同类型的扰动，例如在逐像素（$即，$ X + \增量$）或空间扰动（ $即$，$ AX + b $）。上基准数据集广泛的实验不仅验证所提出的方法SRT向状态的最先进的方法的优越性，用于分别defensing逐像素或空间扰动，也同时显示其鲁棒性两者扰动。再现主要结果的代码可以在\ {URL这HTTPS URL}。

104. OmniTact: A Multi-Directional High Resolution Touch Sensor [PDF] 返回目录
Akhil Padmanabha, Frederik Ebert, Stephen Tian, Roberto Calandra, Chelsea Finn, Sergey Levine
Abstract: Incorporating touch as a sensing modality for robots can enable finer and more robust manipulation skills. Existing tactile sensors are either flat, have small sensitive fields or only provide low-resolution signals. In this paper, we introduce OmniTact, a multi-directional high-resolution tactile sensor. OmniTact is designed to be used as a fingertip for robotic manipulation with robotic hands, and uses multiple micro-cameras to detect multi-directional deformations of a gel-based skin. This provides a rich signal from which a variety of different contact state variables can be inferred using modern image processing and computer vision methods. We evaluate the capabilities of OmniTact on a challenging robotic control task that requires inserting an electrical connector into an outlet, as well as a state estimation problem that is representative of those typically encountered in dexterous robotic manipulation, where the goal is to infer the angle of contact of a curved finger pressing against an object. Both tasks are performed using only touch sensing and deep convolutional neural networks to process images from the sensor's cameras. We compare with a state-of-the-art tactile sensor that is only sensitive on one side, as well as a state-of-the-art multi-directional tactile sensor, and find that OmniTact's combination of high-resolution and multi-directional sensing is crucial for reliably inserting the electrical connector and allows for higher accuracy in the state estimation task. Videos and supplementary material can be found at this https URL
摘要：结合触控式感应方式的机器人可以实现更精细和更强大的处理能力。现有触觉传感器要么平坦的，具有小的敏感字段或仅提供低分辨率信号。在本文中，我们介绍OmniTact，多方位高分辨率触觉传感器。 OmniTact被设计成作为与机器人手机器人操作的指尖，并使用多个微型照相机来检测基于凝胶的皮肤的多方向变形。这提供了从多种不同的接触状态变量可使用现代图像处理和计算机视觉方法来推断丰富信号。我们评估的需要将电连接器插入插座一个具有挑战性的机器人控制任务，以及状态估计问题是代表那些典型的灵巧机器人操作，其目的是推断的角度遇到的OmniTact的能力弯曲的手指按压的对象的接触。这两个任务只使用触摸感应和深卷积神经网络来处理图片来自传感器的摄像机进行。我们比较了与国家的最先进的触觉传感器是只在一侧的敏感，以及一个国家的最先进的多方位的触觉传感器，并发现高分辨率和OmniTact的组合多方向性感测是用于可靠地插入电连接器的关键，允许在状态估计任务更高的精度。影片和补充材料可在此HTTPS URL中找到

105. Output Diversified Initialization for Adversarial Attacks [PDF] 返回目录
Yusuke Tashiro, Yang Song, Stefano Ermon
Abstract: Adversarial examples are often constructed by iteratively refining a randomly perturbed input. To improve diversity and thus also the success rates of attacks, we propose Output Diversified Initialization (ODI), a novel random initialization strategy that can be combined with most existing white-box adversarial attacks. Instead of using uniform perturbations in the input space, we seek diversity in the output logits space of the target model. Empirically, we demonstrate that existing $\ell_\infty$ and $\ell_2$ adversarial attacks with ODI become much more efficient on several datasets including MNIST, CIFAR-10 and ImageNet, reducing the accuracy of recently proposed defense models by 1--17\%. Moreover, PGD attack with ODI outperforms current state-of-the-art attacks against robust models, while also being roughly 50 times faster on CIFAR-10. The code is available on this https URL.
摘要：对抗性实例经常通过迭代地细化随机扰动输入构成。为了提高多样性，因而也攻击的成功率，我们建议输出多元化初始化（ODI），一个新的随机初始化的策略，可与大多数现有的白盒敌对攻击组合。相反，在输入空间中使用统一的扰动，我们寻求多样化的目标模型的输出logits空间。根据经验，我们证明了现有的$ \ ell_ \ infty $和$ \ $ ell_2对抗与ODI的攻击变得更加几个数据集，包括MNIST，CIFAR-10和ImageNet更高效，降低了最近提出的防御模型的精度由1--17 \％。此外，PGD攻击与ODI优于国家的最先进的电流对稳健模式的攻击，同时还快约50倍CIFAR-10。该代码可以在此HTTPS URL。

106. Intra Order-preserving Functions for Calibration of Multi-Class Neural Networks [PDF] 返回目录
Amir Rahimi, Amirreza Shaban, Ching-An Cheng, Byron Boots, Richard Hartley
Abstract: Predicting calibrated confidence scores for multi-class deep networks is important for avoiding rare but costly mistakes. A common approach is to learn a post-hoc calibration function that transforms the output of the original network into calibrated confidence scores while maintaining the network's accuracy. However, previous post-hoc calibration techniques work only with simple calibration functions, potentially lacking sufficient representation to calibrate the complex function landscape of deep networks. In this work, we aim to learn general post-hoc calibration functions that can preserve the top-k predictions of any deep network. We call this family of functions intra order-preserving functions. We propose a new neural network architecture that represents a class of intra order-preserving functions by combining common neural network components. Additionally, we introduce order-invariant and diagonal sub-families, which can act as regularization for better generalization when the training data size is small. We show the effectiveness of the proposed method across a wide range of datasets and classifiers. Our method outperforms state-of-the-art post-hoc calibration methods, namely temperature scaling and Dirichlet calibration, in multiple settings.
摘要：预测校准的置信度为多级深网络是避免罕见的，但代价高昂的错误很重要。一种常见的方法是学习的事后校准功能，同时维持将原始网络的输出为经过校准的置信度网络的准确度。然而，以往的事后校准技术，只能用简单的校准职能的工作，可能缺乏足够的代表性校准深层网络的复杂的功能景观。在这项工作中，我们的目的是学习，可以保存任何深网的前k预测的一般事后校准功能。我们把这个函数族内保序功能。我们建议，通过结合常见的神经网络组件代表一个类的内部顺序保留功能的新的神经网络结构。此外，我们引入顺序无关和对角线子的家庭，当训练数据量小，可以作为正规化起到更好的推广。我们表明了该方法的跨越广泛的数据集和分类的有效性。我们的方法优于国家的最先进的事后校准方法，即温度缩放和狄利克雷校准，在多个设置。

107. Towards Privacy Protection by Generating Adversarial Identity Masks [PDF] 返回目录
Xiao Yang, Yinpeng Dong, Tianyu Pang, Jun Zhu, Hang Su
Abstract: As billions of personal data such as photos are shared through social media and network, the privacy and security of data have drawn an increasing attention. Several attempts have been made to alleviate the leakage of identity information with the aid of image obfuscation techniques. However, most of the present results are either perceptually unsatisfactory or ineffective against real-world recognition systems. In this paper, we argue that an algorithm for privacy protection must block the ability of automatic inference of the identity and at the same time, make the resultant image natural from the users' point of view. To achieve this, we propose a targeted identity-protection iterative method (TIP-IM), which can generate natural face images by adding adversarial identity masks to conceal ones' identity against a recognition system. Extensive experiments on various state-of-the-art face recognition models demonstrate the effectiveness of our proposed method on alleviating the identity leakage of face images, without sacrificing? the visual quality of the protected images.
摘要：数十亿的个人资料，如照片通过社交媒体和网络，隐私和数据的安全性共享已引起越来越多的关注。已经进行了一些尝试，以减轻用的图像模糊技术援助的身份信息的泄漏。然而，目前大多数的结果要么是对现实世界的识别系统感知不满意或无效。在本文中，我们认为，对于隐私保护的算法必须阻止身份的自动推理的能力，并在同一时间，请从用户的角度来看产生的图像自然。为了实现这一目标，我们提出了有针对性的身份保护迭代法（TIP-IM），它可以通过添加敌对身份面具掩盖那些对识别系统的身份产生自然的面部图像。在国家的最先进的各种面部识别模型大量实验证明我们提出了减轻面部图像的身份泄露方法的有效性，在不牺牲？被保护的图像的视觉质量。

108. Experimenting with Convolutional Neural Network Architectures for the automatic characterization of Solitary Pulmonary Nodules' malignancy rating [PDF] 返回目录
Ioannis D. Apostolopoulos
Abstract: Lung Cancer is the most common cause of cancer-related death worldwide. Early and automatic diagnosis of Solitary Pulmonary Nodules (SPN) in Computer Tomography (CT) chest scans can provide early treatment as well as doctor liberation from time-consuming procedures. Deep Learning has been proven as a popular and influential method in many medical imaging diagnosis areas. In this study, we consider the problem of diagnostic classification between benign and malignant lung nodules in CT images derived from a PET/CT scanner. More specifically, we intend to develop experimental Convolutional Neural Network (CNN) architectures and conduct experiments, by tuning their parameters, to investigate their behavior, and to define the optimal setup for the accurate classification. For the experiments, we utilize PET/CT images obtained from the Laboratory of Nuclear Medicine of the University of Patras, and the publically available database called Lung Image Database Consortium Image Collection (LIDC-IDRI). Furthermore, we apply simple data augmentation to generate new instances and to inspect the performance of the developed networks. Classification accuracy of 91% and 93% on the PET/CT dataset and on a selection of nodule images form the LIDC-IDRI dataset, is achieved accordingly. The results demonstrate that CNNs are a trustworth method for nodule classification. Also, the experiment confirms that data augmentation enhances the robustness of the CNNs.
摘要：肺癌是癌症相关死亡全世界的最常见原因。早期和孤立性肺结节（SPN）的计算机断层摄影（CT）的自动诊断胸部扫描可以从耗时的程序提供早期治疗以及医生解放。深度学习已被证明在许多医学影像诊断领域的一种流行和影响力的方法。在这项研究中，我们考虑在从PET / CT扫描仪衍生CT图像良性和恶性肺结节之间诊断分类的问题。更具体地讲，我们打算开发实验卷积神经网络（CNN）的体系结构和进行实验，通过调整其参数，以调查他们的行为，并确定最佳设置的准确分类。在实验中，我们利用从帕特雷大学核医学实验室获得的PET / CT图像，称为肺影像数据库联盟图片集（LIDC-IDRI）的公开可用的数据库。此外，我们将简单的数据扩张产生新的实例，并检查发达网络的性能。的91％，在PET / CT数据集，并在选择结节图像的93％的分类精度形成LIDC-IDRI数据集，相应地实现。结果表明，细胞神经网络是结节分类的trustworth方法。此外，实验证实，数据增强增强细胞神经网络的鲁棒性。

109. A proto-object based audiovisual saliency map [PDF] 返回目录
Sudarshan Ramenahalli
Abstract: Natural environment and our interaction with it is essentially multisensory, where we may deploy visual, tactile and/or auditory senses to perceive, learn and interact with our environment. Our objective in this study is to develop a scene analysis algorithm using multisensory information, specifically vision and audio. We develop a proto-object based audiovisual saliency map (AVSM) for the analysis of dynamic natural scenes. A specialized audiovisual camera with $360 \degree$ Field of View, capable of locating sound direction, is used to collect spatiotemporally aligned audiovisual data. We demonstrate that the performance of proto-object based audiovisual saliency map in detecting and localizing salient objects/events is in agreement with human judgment. In addition, the proto-object based AVSM that we compute as a linear combination of visual and auditory feature conspicuity maps captures a higher number of valid salient events compared to unisensory saliency maps. Such an algorithm can be useful in surveillance, robotic navigation, video compression and related applications.
摘要：自然的环境和我们的互动与它本质上是多感官，在这里我们可以部署视觉，触觉和/或听觉感知，学习，并与我们的环境产生互动。我们在这个研究的目的是开发使用多感官的信息，特别是视觉和音频场景分析算法。我们开发的动态自然场景的分析原基于对象的视听显着图（AVSM）。一个专门的视听相机$ 360 \查看的程度$场，能够定位音源方向，用于收集时空对准视听资料。我们表明，基于原对象的视听显着图中的检测和定位显着对象/事件表现与人的判断一致。此外，基于原对象AVSM，我们计算的视觉和听觉特性，醒目的线性组合映射捕获更高数量的有效突出的事件相比，unisensory显着图。这样的算法可以在监控，机器人导航，视频压缩和相关应用。

110. Interactive Neural Style Transfer with Artists [PDF] 返回目录
Thomas Kerdreux, Louis Thiry, Erwan Kerdreux
Abstract: We present interactive painting processes in which a painter and various neural style transfer algorithms interact on a real canvas. Understanding what these algorithms' outputs achieve is then paramount to describe the creative agency in our interactive experiments. We gather a set of paired painting-pictures images and present a new evaluation methodology based on the predictivity of neural style transfer algorithms. We point some algorithms' instabilities and show that they can be used to enlarge the diversity and pleasing oddity of the images synthesized by the numerous existing neural style transfer algorithms. This diversity of images was perceived as a source of inspiration for human painters, portraying the machine as a computational catalyst.
摘要：我们提出的互动喷漆工艺，其中一个画家和一个真正的画布交互的各种神经风格转移算法。了解一下这些算法的输出实现是最重要的，然后来形容我们的互动实验的创意机构。我们收集了一组配对的油画画面图像，并提出了一种基于神经风格转移算法的预测能力新的评价方法。我们点了一些算法不稳定性，并表明他们可以用来放大由无数现有的神经传递风格合成算法的图像的多样性和赏心悦目的怪胎。图像的这种多样性被认为为灵感人类画家的来源，描绘机器作为计算催化剂。

111. Investigating Generalization in Neural Networks under Optimally Evolved Training Perturbations [PDF] 返回目录
Subhajit Chaudhury, Toshihiko Yamasaki
Abstract: In this paper, we study the generalization properties of neural networks under input perturbations and show that minimal training data corruption by a few pixel modifications can cause drastic overfitting. We propose an evolutionary algorithm to search for optimal pixel perturbations using novel cost function inspired from literature in domain adaptation that explicitly maximizes the generalization gap and domain divergence between clean and corrupted images. Our method outperforms previous pixel-based data distribution shift methods on state-of-the-art Convolutional Neural Networks (CNNs) architectures. Interestingly, we find that the choice of optimization plays an important role in generalization robustness due to the empirical observation that SGD is resilient to such training data corruption unlike adaptive optimization techniques (ADAM). Our source code is available at this https URL.
摘要：在本文中，我们研究了神经网络的泛化性能下输入扰动，并显示由少数像素修改是最小的训练数据损坏可能会导致剧烈的过度拟合。我们提出了一种进化算法来搜索用于使用来自文献的启发在域适应明确最大化干净和损坏的图像之间的泛化间隙和域发散新颖成本函数的最优像素扰动。我们的方法优于上状态的最先进的卷积神经网络（细胞神经网络）架构以前的基于像素的数据分配移方法。有趣的是，我们发现，最优化的选择对泛化稳健性具有重要作用，由于经验观察SGD是有弹性的，以适应不同的优化技术（ADAM）这样的训练数据损坏。我们的源代码可在此HTTPS URL。

112. Rapid Whole Slide Imaging via Learning-based Two-shot Virtual Autofocusing [PDF] 返回目录
Qiang Li, Xianming Liu, Kaige Han, Cheng Guo, Xiangyang Ji, Xiaolin Wu
Abstract: Whole slide imaging (WSI) is an emerging technology for digital pathology. The process of autofocusing is the main influence of the performance of WSI. Traditional autofocusing methods either are time-consuming due to repetitive mechanical motions, or require additional hardware and thus are not compatible to current WSI systems. In this paper, we propose the concept of \textit{virtual autofocusing}, which does not rely on mechanical adjustment to conduct refocusing but instead recovers in-focus images in an offline learning-based manner. With the initial focal position, we only perform two-shot imaging, in contrast traditional methods commonly need to conduct as many as 21 times image shooting in each tile scanning. Considering that the two captured out-of-focus images retain pieces of partial information about the underlying in-focus image, we propose a U-Net-inspired deep neural network based approach for fusing them into a recovered in-focus image. The proposed scheme is fast in tissue slides scanning, enabling a high-throughput generation of digital pathology images. Experimental results demonstrate that our scheme achieves satisfactory refocusing performance.
摘要：整个幻灯片成像（WSI）是用于数字病理学的新兴技术。自动对焦的过程是WSI表现的主要影响。传统的自动聚焦方法或者是耗时由于重复的机械运动，或需要额外的硬件，因此不以当前WSI系统兼容。在本文中，我们提出了\ {textit虚拟自动对焦}，这并不在离线学习为主的方式依靠机械的调整行为的重聚，而是复苏聚焦图像的概念。与最初的焦点位置，我们只进行两杆的成像，通常需要进行多达每瓦扫描21次图像拍摄对比传统方法。考虑到这两个拍摄出焦图像保留的关于底层对焦图像部分信息片段，我们提出了一个U型的Net-激发深层神经网络为基础的方法对它们融合成一个恢复对焦图像。所提出的方案是在快组织载玻片扫描，从而实现高吞吐量代数字病理学图像。实验结果表明，我们的方案实现了令人满意的重新聚焦性能。

113. VarMixup: Exploiting the Latent Space for Robust Training and Inference [PDF] 返回目录
Puneet Mangla, Vedant Singh, Shreyas Jayant Havaldar, Vineeth N Balasubramanian
Abstract: The vulnerability of Deep Neural Networks (DNNs) to adversarial attacks has led to the development of many defense approaches. Among them, Adversarial Training (AT) is a popular and widely used approach for training adversarially robust models. Mixup Training (MT), a recent popular training algorithm, improves the generalization performance of models by introducing globally linear behavior in between training examples. Although still in its early phase, we observe a shift in trend of exploiting Mixup from perspectives of generalisation to that of adversarial robustness. It has been shown that the Mixup trained models improves the robustness of models but only passively. A recent approach, Mixup Inference (MI), proposes an inference principle for Mixup trained models to counter adversarial examples at inference time by mixing the input with other random clean samples. In this work, we propose a new approach - \textit{VarMixup (Variational Mixup)} - to better sample mixup images by using the latent manifold underlying the data. Our experiments on CIFAR-10, CIFAR-100, SVHN and Tiny-Imagenet demonstrate that \textit{VarMixup} beats state-of-the-art AT techniques without training the model adversarially. Additionally, we also conduct ablations that show that models trained on \textit{VarMixup} samples are also robust to various input corruptions/perturbations, have low calibration error and are transferable.
摘要：深层神经网络（DNNs）以对抗攻击的脆弱性导致了许多防守的开发方法。其中，对抗性训练（AT）是培养adversarially强大的机型流行和广泛使用的方法。的mixup培训（MT），最近流行的训练算法，在训练实例之间引入全球线性特性提高了模型的泛化性能。虽然仍处于初期阶段，我们观察到从一般化的角度开发的mixup到对抗鲁棒性的趋势的转变。它已被证明了的mixup训练的模型提高了模型的鲁棒性，但只能被动。最近的一个方法，查询股价推理（MI），提出了一种推论原则的mixup训练的模型由输入与其他随机干净样品混合，以应对在推理时间对抗的例子。在这项工作中，我们提出了一种新的方法 - 通过使用潜歧管中的底层数据，以更好地样品的mixup图片 - \ {textit VarMixup（变的mixup）}。我们对CIFAR-10实验中，CIFAR-100，SVHN和微小-Imagenet证明\ textit {} VarMixup击败的国家的最先进的技术，AT没有adversarially训练模型。此外，我们也进行消融，显示上训练\ textit该模型{} VarMixup样品也是稳健的各种输入损坏/干扰，具有低校准误差和可以转让。

114. Boundary Guidance Hierarchical Network for Real-Time Tongue Segmentation [PDF] 返回目录
Xinyi Zeng, Qian Zhang, Jia Chen, Guixu Zhang, Aimin Zhou, Yiqin Wang
Abstract: Automated tongue image segmentation in tongue images is a challenging task for two reasons: 1) there are many pathological details on the tongue surface, which affect the extraction of the boundary; 2) the shapes of the tongues captured from various persons (with different diseases) are quite different. To deal with the challenge, a novel end-to-end Boundary Guidance Hierarchical Network (BGHNet) with a new hybrid loss is proposed in this paper. In the new approach, firstly Context Feature Encoder Module (CFEM) is built upon the bottomup pathway to confront with the shrinkage of the receptive field. Secondly, a novel hierarchical recurrent feature fusion module (HRFFM) is adopt to progressively and hierarchically refine object maps to recover image details by integrating local context information. Finally, the proposed hybrid loss in a four hierarchy-pixel, patch, map and boundary guides the network to effectively segment the tongue regions and accurate tongue boundaries. BGHNet is applied to a set of tongue images. The experimental results suggest that the proposed approach can achieve the latest tongue segmentation performance. And in the meantime, the lightweight network contains only 15.45M parameters and performs only 11.22GFLOPS.
摘要：自动舌图像舌图像分割为两个原因一项具有挑战性的任务：1）有舌头表面上许多病理细节，这影响的边界的提取; 2）从不同的人拍摄（具有不同的疾病的舌片）的形状有很大的不同。为了应对这一挑战，一个新的终端到终端的边界指导层次网络（BGHNet）用一种新的混合损失在本文提出。在新方法中，首先上下文特征的编码器模块（CFEM）是在所述自下而上途径内置面对与感受域的收缩。其次，一种新颖的分层复发性特征融合模块（HRFFM）是采用逐步和分层细化对象映射通过集成本地上下文信息来恢复图像细节。最后，在四层次像素，贴片，地图所提出的混合损失和边界引导网络有效段舌区域和准确的舌边界。 BGHNet被施加到一组舌图像。实验结果表明，该方法可以达到最新的舌头分割性能。而在此期间，轻量级的网络只包含15.45M参数，而仅进行11.22GFLOPS。

115. Leveraging Vision and Kinematics Data to Improve Realism of Biomechanic Soft-tissue Simulation for Robotic Surgery [PDF] 返回目录
Jie Ying Wu, Peter Kazanzides, Mathias Unberath
Abstract: Purpose Surgical simulations play an increasingly important role in surgeon education and developing algorithms that enable robots to perform surgical subtasks. To model anatomy, Finite Element Method (FEM) simulations have been held as the gold standard for calculating accurate soft-tissue deformation. Unfortunately, their accuracy is highly dependent on the simulation parameters, which can be difficult to obtain. Methods In this work, we investigate how live data acquired during any robotic endoscopic surgical procedure may be used to correct for inaccurate FEM simulation results. Since FEMs are calculated from initial parameters and cannot directly incorporate observations, we propose to add a correction factor that accounts for the discrepancy between simulation and observations. We train a network to predict this correction factor. Results To evaluate our method, we use an open-source da Vinci Surgical System to probe a soft-tissue phantom and replay the interaction in simulation. We train the network to correct for the difference between the predicted mesh position and the measured point cloud. This results in 15-30% improvement in the mean distance, demonstrating the effectiveness of our approach across a large range of simulation parameters. Conclusion We show a first step towards a framework that synergistically combines the benefits of model-based simulation and real-time observations. It corrects discrepancies between simulation and the scene that results from inaccurate modeling parameters. This can provide a more accurate simulation environment for surgeons and better data with which to train algorithms.
摘要：目的手术模拟扮演医生的教育越来越重要的作用，并开发算法，使机器人进行手术子任务。到模型解剖，有限元法（FEM）模拟已举行的黄金标准，用于计算精确的软组织变形。不幸的是，它们的准确性是高度依赖于仿真参数，其可以是很难获得的。方法在这项工作中，我们研究如何活着机器人内窥镜外科手术过程中采集的数据可以被用来纠正不准确的有限元模拟结果。由于前端模块从初始参数计算，不能直接将意见，我们建议添加占模拟和观测之间的差异的修正系数。我们训练的网络来预测这个修正系数。结果来评估我们的方法，我们使用一个开源的达芬奇外科手术系统探测软组织模型和重播模拟的互动。我们在网络训练来校正预测的啮合位置和所述测量点群之间的差异。这导致了15-30％的改善的平均距离，跨越了大范围的模拟参数证明了我们方法的有效性。结论我们展示实现这一协同结合的基于模型的仿真和实时观测的好处框架的第一步。它可以解决模拟和场景不准确的模型参数的结果之间的差异。这可以为外科医生和与训练算法更好的数据提供更准确的仿真环境。

116. A Privacy-Preserving DNN Pruning and Mobile Acceleration Framework [PDF] 返回目录
Zheng Zhan, Yifan Gong, Zhengang Li, Pu Zhao, Xiaolong Ma, Wei Niu, Xiaolin Xu, Bin Ren, Yanzhi Wang, Xue Lin
Abstract: To facilitate the deployment of deep neural networks (DNNs) on resource-constrained computing systems, DNN model compression methods have been proposed. However, previous methods mainly focus on reducing the model size and/or improving hardware performance, without considering the data privacy requirement. This paper proposes a privacy-preserving model compression framework that formulates a privacy-preserving DNN weight pruning problem and develops an ADMM based solution to support different weight pruning schemes. We consider the case that the system designer will perform weight pruning on a pre-trained model provided by the client, whereas the client cannot share her confidential training dataset. To mitigate the non-availability of the training dataset, the system designer distills the knowledge of a pre-trained model into a pruned model using only randomly generated synthetic data. Then the client's effort is simply reduced to performing the retraining process using her confidential training dataset, which is similar as the DNN training process with the help of the mask function from the system designer. Both algorithmic and hardware experiments validate the effectiveness of the proposed framework.
摘要：为了促进深层神经网络在资源受限的计算系统（DNNs）的部署，DNN模型压缩方法被提出。然而，以前的方法主要集中在降低模型的大小和/或提高硬件性能，而无需考虑数据私密性要求。本文提出了一种保护隐私的模型的压缩框架，制定一个隐私保护DNN重量修剪问题和开发一种基于ADMM溶液，以支持不同的权重修剪方案。我们认为，该系统设计人员将在客户端提供的预训练模型进行修剪重量，而客户端不能分享她的秘密训练数据集的情况。为了减轻训练数据集的不可用性，系统设计者馏出预先训练的模型的知识转化为仅使用随机生成的合成数据的修剪模式。然后在客户端的努力只是降低到执行使用她的秘密训练数据集，它是从系统设计的屏蔽功能的帮助下，DNN训练过程类似的再培训的过程。这两种算法和硬件实验验证了该框架的有效性。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-03-17

目录

摘要