摘要

1. DualSDF: Semantic Shape Manipulation using a Two-Level Representation [PDF] 返回目录
Zekun Hao, Hadar Averbuch-Elor, Noah Snavely, Serge Belongie
Abstract: We are seeing a Cambrian explosion of 3D shape representations for use in machine learning. Some representations seek high expressive power in capturing high-resolution detail. Other approaches seek to represent shapes as compositions of simple parts, which are intuitive for people to understand and easy to edit and manipulate. However, it is difficult to achieve both fidelity and interpretability in the same representation. We propose DualSDF, a representation expressing shapes at two levels of granularity, one capturing fine details and the other representing an abstracted proxy shape using simple and semantically consistent shape primitives. To achieve a tight coupling between the two representations, we use a variational objective over a shared latent space. Our two-level model gives rise to a new shape manipulation technique in which a user can interactively manipulate the coarse proxy shape and see the changes instantly mirrored in the high-resolution shape. Moreover, our model actively augments and guides the manipulation towards producing semantically meaningful shapes, making complex manipulations possible with minimal user input.
摘要：我们看到3D形状表示的寒武纪生命大爆发的机器学习使用。某些代表寻求捕捉高分辨率的高细节表现力。其他方法试图形状表示为简单零件的组合物，其直观的为人们所了解，易于编辑和操作。然而，这是很难实现相同的表示既保真度和可解释性。我们建议DualSDF，表示在两个级别的粒度，一个拍摄细节和其他形状使用简单和语义一致的形状图元代表一个抽象的代理形状的表示。为了实现这两种表示之间的紧耦合，我们用一个变目标在共享潜在空间。我们的双级车型产生了一个新的形状操纵技术，其中用户可以交互地操纵粗代理形状，看到在高分辨率形状即刻反映的变化。此外，我们的模型积极加强和引导操作对生产语义上有意义的形状，使复杂的操作可能以最小的用户输入。

2. Rethinking Spatially-Adaptive Normalization [PDF] 返回目录
Zhentao Tan, Dongdong Chen, Qi Chu, Menglei Chai, Jing Liao, Mingming He, Lu Yuan, Nenghai Yu
Abstract: Spatially-adaptive normalization is remarkably successful recently in conditional semantic image synthesis, which modulates the normalized activation with spatially-varying transformations learned from semantic layouts, to preserve the semantic information from being washed away. Despite its impressive performance, a more thorough understanding of the true advantages inside the box is still highly demanded, to help reduce the significant computation and parameter overheads introduced by these new structures. In this paper, from a return-on-investment point of view, we present a deep analysis of the effectiveness of SPADE and observe that its advantages actually come mainly from its semantic-awareness rather than the spatial-adaptiveness. Inspired by this point, we propose class-adaptive normalization (CLADE), a lightweight variant that is not adaptive to spatial positions or layouts. Benefited from this design, CLADE greatly reduces the computation cost while still being able to preserve the semantic information during the generation. Extensive experiments on multiple challenging datasets demonstrate that while the resulting fidelity is on par with SPADE, its overhead is much cheaper than SPADE. Take the generator for ADE20k dataset as an example, the extra parameter and computation cost introduced by CLADE are only 4.57% and 0.07% while that of SPADE are 39.21% and 234.73% respectively.
摘要：空间适应性正常化是显着地在条件语义图像合成，其调制从语义布局学习空间上变化的变换的归一化活化最近成功，从被冲走保存的语义信息。尽管其骄人的业绩，对箱内的真正优势更透彻的了解仍然强烈要求，以帮助减少这些新结构引入了显著计算和参数开销。在本文中，从回报率投资的角度来看，我们目前铁锹的有效性进行了深入分析，并观察它的优势其实主要来自其语义的意识，而不是空间的适应能力。通过这点的启发，我们提出类自适应正常化（进化枝），轻量的变体，是不是适应空间位置或布局。从这种设计中获益，分支大大降低，同时仍然能够产生过程中保存语义信息计算成本。在多个数据集挑战大量的实验证明，虽然产生的保真度看齐铲，其开销比SPADE便宜得多。就拿ADE20k数据集发生器作为一个例子，通过进化枝引入的额外的参数和计算成本仅是4.57％和0.07％，而铲分别39.21％和234.73％。

3. There and Back Again: Revisiting Backpropagation Saliency Methods [PDF] 返回目录
Sylvestre-Alvise Rebuffi, Ruth Fong, Xu Ji, Andrea Vedaldi
Abstract: Saliency methods seek to explain the predictions of a model by producing an importance map across each input sample. A popular class of such methods is based on backpropagating a signal and analyzing the resulting gradient. Despite much research on such methods, relatively little work has been done to clarify the differences between such methods as well as the desiderata of these techniques. Thus, there is a need for rigorously understanding the relationships between different methods as well as their failure modes. In this work, we conduct a thorough analysis of backpropagation-based saliency methods and propose a single framework under which several such methods can be unified. As a result of our study, we make three additional contributions. First, we use our framework to propose NormGrad, a novel saliency method based on the spatial contribution of gradients of convolutional weights. Second, we combine saliency maps at different layers to test the ability of saliency methods to extract complementary information at different network levels (e.g.~trading off spatial resolution and distinctiveness) and we explain why some methods fail at specific layers (e.g., Grad-CAM anywhere besides the last convolutional layer). Third, we introduce a class-sensitivity metric and a meta-learning inspired paradigm applicable to any saliency method for improving sensitivity to the output class being explained.
摘要：显着性的方法寻求通过产生跨越每个输入采样的重要性地图来解释模型的预测。一种流行类这样的方法是基于backpropagating的信号，并分析所得梯度。尽管在这样的方法很多研究，相对较少的工作已经完成，以澄清这样的方法，以及这些技术的迫切需求之间的差异。因此，有必要进行严格的理解不同方法以及它们的故障模式之间的关系。在这项工作中，我们进行了基于反向传播的显着度的方法进行彻底的分析，并提出了一个框架，使几个这样的方法可以统一的。作为我们研究的结果，我们做三个额外的贡献。首先，我们使用我们的框架提出NormGrad，基于卷积权重梯度的空间贡献新的显着性的方法。其次，我们结合显着图在不同层测试的显着性方法，在不同的网络层次来提取补充信息的能力（如〜折衷的空间分辨率和独特性），我们解释为什么一些方法不能在特定层（例如，梯度-CAM除了最后的卷积层不等）。第三，我们引入一类灵敏度度量和适用于用于提高灵敏度的输出类存在任何显着性的方法的元学习启发范例说明。

4. Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments [PDF] 返回目录
Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, Stefan Lee
Abstract: We develop a language-guided navigation task set in a continuous 3D environment where agents must execute low-level actions to follow natural language navigation directions. By being situated in continuous environments, this setting lifts a number of assumptions implicit in prior work that represents environments as a sparse graph of panoramas with edges corresponding to navigability. Specifically, our setting drops the presumptions of known environment topologies, short-range oracle navigation, and perfect agent localization. To contextualize this new task, we develop models that mirror many of the advances made in prior settings as well as single-modality baselines. While some of these techniques transfer, we find significantly lower absolute performance in the continuous setting -- suggesting that performance in prior `navigation-graph' settings may be inflated by the strong implicit assumptions.
摘要：我们开发一个连续的3D环境，让代理商必须执行低一级的行动遵循自然语言导航方向语言制导导航任务集。通过被位于连续的环境中，此设置升降机若干假设在表示环境作为全景的稀疏图与对应于边缘适航之前工作的隐式的。具体来说，我们的设置下降知环境拓扑，短距离甲骨文导航，以及完善的代理本地化的推定。为了情境这项新任务，我们开发了许多反映在之前的设置以及单模态下的基线所取得的进步的模型。而其中的一些技术转移，我们发现在连续设置显著较低的绝对性能 - 这表明在现有`导航图表”设置，性能可能会受强烈的隐含假设被充气。

5. Optical Flow Estimation in the Deep Learning Age [PDF] 返回目录
Junhwa Hur, Stefan Roth
Abstract: Akin to many subareas of computer vision, the recent advances in deep learning have also significantly influenced the literature on optical flow. Previously, the literature had been dominated by classical energy-based models, which formulate optical flow estimation as an energy minimization problem. However, as the practical benefits of Convolutional Neural Networks (CNNs) over conventional methods have become apparent in numerous areas of computer vision and beyond, they have also seen increased adoption in the context of motion estimation to the point where the current state of the art in terms of accuracy is set by CNN approaches. We first review this transition as well as the developments from early work to the current state of CNNs for optical flow estimation. Alongside, we discuss some of their technical details and compare them to recapitulate which technical contribution led to the most significant accuracy improvements. Then we provide an overview of the various optical flow approaches introduced in the deep learning age, including those based on alternative learning paradigms (e.g., unsupervised and semi-supervised methods) as well as the extension to the multi-frame case, which is able to yield further accuracy improvements.
摘要：阿金的计算机视觉的许多分区，近期深度学习的进步也显著影响了光流的文献。此前，该文献已被古典能源为基础的模型，制定光流估计作为能量最小化问题为主。然而，随着卷积神经网络比传统方法的实际利益（细胞神经网络）已经成为计算机视觉及以后的许多领域明显，他们也看到了运动估计的背景下越来越多地采用的地步艺术的当前状态在精度方面是由CNN设置方法。我们首先回顾这一转变，以及从早期工作的发展，以细胞神经网络的当前状态的光流估计。除了，我们讨论了他们的一些技术细节，并把它们比作综述其技术贡献导致最显著精度的改善。然后我们提供的各种光学流的概述办法在深学习年龄引入，包括基于可供选择的学习范例（例如，无监督和半监督方法）以及扩展到多帧的情况下，其能够以得到进一步改善准确性。

6. LaNet: Real-time Lane Identification by Learning Road SurfaceCharacteristics from Accelerometer Data [PDF] 返回目录
Madhumitha Harishankar, Jun Han, Sai Vineeth Kalluru Srinivas, Faisal Alqarni, Shi Su, Shijia Pan, Hae Young Noh, Pei Zhang, Marco Gruteser, Patrick Tague
Abstract: The resolution of GPS measurements, especially in urban areas, is insufficient for identifying a vehicle's lane. In this work, we develop a deep LSTM neural network model LaNet that determines the lane vehicles are on by periodically classifying accelerometer samples collected by vehicles as they drive in real time. Our key finding is that even adjacent patches of road surfaces contain characteristics that are sufficiently unique to differentiate between lanes, i.e., roads inherently exhibit differing bumps, cracks, potholes, and surface unevenness. Cars can capture this road surface information as they drive using inexpensive, easy-to-install accelerometers that increasingly come fitted in cars and can be accessed via the CAN-bus. We collect an aggregate of 60 km driving data and synthesize more based on this that capture factors such as variable driving speed, vehicle suspensions, and accelerometer noise. Our formulated LSTM-based deep learning model, LaNet, learns lane-specific sequences of road surface events (bumps, cracks etc.) and yields 100% lane classification accuracy with 200 meters of driving data, achieving over 90% with just 100 m (correspondingly to roughly one minute of driving). We design the LaNet model to be practical for use in real-time lane classification and show with extensive experiments that LaNet yields high classification accuracy even on smooth roads, on large multi-lane roads, and on drives with frequent lane changes. Since different road surfaces have different inherent characteristics or entropy, we excavate our neural network model and discover a mechanism to easily characterize the achievable classification accuracies in a road over various driving distances by training the model just once. We present LaNet as a low-cost, easily deployable and highly accurate way to achieve fine-grained lane identification.
摘要：GPS测量的分辨率，尤其是在城市地区，是不足以识别车辆的车道。在这项工作中，我们开发了决定车道的车辆都在通过定期通过分类收集车，因为他们在实时驱动的加速计样本深刻LSTM神经网络模型LaNet。我们的主要发现是，即使路面的相邻小片包含有足够的唯一通道之间进行区分的特性，即，道路固有显示不同颠簸，裂纹，坑洼，和表面凹凸。汽车可以，因为他们驾驶使用廉价捕获此路面信息，易于安装越来越多地安装在汽车加速计，可通过CAN总线进行访问。我们收集60公里驾驶数据和合成的聚集更多基于该捕捉因素诸如可变驱动速度，车辆悬浮液，和加速计的噪声。我们制定了基于LSTM深的学习模式，LaNet，学习的路面事件（碰伤，裂纹等），并与产量200米驱动数据的100％车道分类准确度的特定车道序列，实现了超过90％的只有100米（对应于大约一时分驱动的）。我们设计的LaNet模式是实时车道分类使用实用，具有广泛的实验表明，LaNet产生较高的分类准确率甚至在平坦的道路，在大的多车道的道路，并在频繁变更车道驱动器。由于不同的路面上有不同的固有特性或熵，我们挖掘我们的神经网络模型，并发现了一个机制，容易被训练模型只是一次表征在各种行驶距离的道路上实现分类精度。我们提出LaNet作为一种低成本，易于部署和高度精确的方式来实现细粒度车道识别。

7. Computer Vision and Abnormal Patient Gait Assessment a Comparison of Machine Learning Models [PDF] 返回目录
Jasmin Hundall, Benson A. Babu
Abstract: Abnormal gait, its associated falls and complications have high patient morbidity, mortality. Computer vision detects, predicts patient gait abnormalities, assesses fall risk and serves as clinical decision support tool for physicians. This paper performs a systematic review of how computer vision, machine learning models perform an abnormal patient's gait assessment. Computer vision is beneficial in gait analysis, it helps capture the patient posture. Several literature suggests the use of different machine learning algorithms such as SVM, ANN, K-Star, Random Forest, KNN, among others to perform the classification on the features extracted to study patient gait abnormalities.
摘要：步态异常，其相关的跌倒和并发症，具有较高的病人的发病率，死亡率。计算机视觉检测，预测患者的步态异常，评估跌倒风险，并作为医生临床决策支持工具。本文执行的计算机视觉，机器学习模型如何执行异常患者的步态评估的系统评价。计算机视觉是步态分析是有益的，它可以帮助捕捉病人的姿势。一些文献表明，使用不同的机器学习算法，如支持向量机，人工神经网络，K-星，随机森林，KNN，以及其他执行上提取到研究患者步态异常的特征进行分类。

8. DAISI: Database for AI Surgical Instruction [PDF] 返回目录
Edgar Rojas-Muñoz, Kyle Couperus, Juan Wachs
Abstract: Telementoring surgeons as they perform surgery can be essential in the treatment of patients when in situ expertise is not available. Nonetheless, expert mentors are often unavailable to provide trainees with real-time medical guidance. When mentors are unavailable, a fallback autonomous mechanism should provide medical practitioners with the required guidance. However, AI/autonomous mentoring in medicine has been limited by the availability of generalizable prediction models, and surgical procedures datasets to train those models with. This work presents the initial steps towards the development of an intelligent artificial system for autonomous medical mentoring. Specifically, we present the first Database for AI Surgical Instruction (DAISI). DAISI leverages on images and instructions to provide step-by-step demonstrations of how to perform procedures from various medical disciplines. The dataset was acquired from real surgical procedures and data from academic textbooks. We used DAISI to train an encoder-decoder neural network capable of predicting medical instructions given a current view of the surgery. Afterwards, the instructions predicted by the network were evaluated using cumulative BLEU scores and input from expert physicians. According to the BLEU scores, the predicted and ground truth instructions were as high as 67% similar. Additionally, expert physicians subjectively assessed the algorithm using Likert scale, and considered that the predicted descriptions were related to the images. This work provides a baseline for AI algorithms to assist in autonomous medical mentoring.
摘要：Telementoring外科医生为他们进行手术可以在病人的治疗时必不可少原位专业知识不可用。尽管如此，专家导师往往无法为用户提供实时的医学指导学员。当导师不可用，一个备用的自主机制应提供必要的指导医生。然而，在医学上AI /自治区师徒已经受限于一般化预测模型，以及手术过程的数据集的可用性来训练这些型号。这项工作提出了迈向智能人工系统的自主医学指导发展的初始步骤。具体地，我们提出了AI外科指令（戴斯）的第一个数据库。戴斯利用上的图像和指令以提供一步一步的如何执行来自各种医学学科程序演示。该数据集是从学术课本实际手术过程和数据获取。我们使用戴斯训练能够预测给定的外科手术的当前视图的医疗指示的编码器 - 解码器神经网络。此后，由网络所预测的说明，使用累积BLEU分数和输入从专家医师进行评价。按照BLEU分数，预测和地面实况指令分别高达67％的相似性。此外，专家医师主观评价采用李克特量表的算法，并认为预测的说明进行了相关的图像。这项工作提供了AI的基线算法来协助自主医疗指导。

9. High-Dimensional Data Set Simplification by Laplace-Beltrami Operator [PDF] 返回目录
Chenkai Xu, Hongwei Lin
Abstract: With the development of the Internet and other digital technologies, the speed of data generation has become considerably faster than the speed of data processing. Because big data typically contain massive redundant information, it is possible to significantly simplify a big data set while maintaining the key information it contains. In this paper, we develop a big data simplification method based on the eigenvalues and eigenfunctions of the Laplace-Beltrami operator (LBO). Specifically, given a data set that can be considered as an unorganized data point set in high-dimensional space, a discrete LBO defined on the big data set is constructed and its eigenvalues and eigenvectors are calculated. Then, the local extremum and the saddle points of the eigenfunctions are proposed to be the feature points of a data set in high-dimensional space, constituting a simplified data set. Moreover, we develop feature point detection methods for the functions defined on an unorganized data point set in high-dimensional space, and devise metrics for measuring the fidelity of the simplified data set to the original set. Finally, examples and applications are demonstrated to validate the efficiency and effectiveness of the proposed methods, demonstrating that data set simplification is a method for processing a maximum-sized data set using a limited data processing capability.
摘要：随着互联网和其他数字技术的发展，数据的生成速度已变得相当快于数据处理的速度。由于大数据通常包含大量的冗余信息，就可以显著简化大数据集，同时保持它包含的关键信息。在本文中，我们开发了基于拉普拉斯Beltrami算（LBO）的特征值和特征函数一个大数据的简化方法。具体地，给出的是可以被认为是在高维空间中的无组织的数据点集的数据集，对大数据集定义的离散的LBO被构造和它的特征向量进行计算。然后，局部极值和所述本征函数的鞍点被建议为在高维空间中的数据集的特征点，构成了一个简化的数据集。此外，我们开发用于测量简化数据设置为原始集合的保真度用于高维空间上无组织的数据点集定义的函数的特征点的检测方法，并设计指标。最后，实施例和应用被证明来验证所提出的方法的效率和有效性，这表明数据集简化是用于处理利用有限的数据处理能力的最大大小的数据集的方法。

10. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects [PDF] 返回目录
Zewen Li, Wenjie Yang, Shouheng Peng, Fan Liu
Abstract: Convolutional Neural Network (CNN) is one of the most significant networks in the deep learning field. Since CNN made impressive achievements in many areas, including but not limited to computer vision and natural language processing, it attracted much attention both of industry and academia in the past few years. The existing reviews mainly focus on the applications of CNN in different scenarios without considering CNN from a general perspective, and some novel ideas proposed recently are not covered. In this review, we aim to provide novel ideas and prospects in this fast-growing field as much as possible. Besides, not only two-dimensional convolution but also one-dimensional and multi-dimensional ones are involved. First, this review starts with a brief introduction to the history of CNN. Second, we provide an overview of CNN. Third, classic and advanced CNN models are introduced, especially those key points making them reach state-of-the-art results. Fourth, through experimental analysis, we draw some conclusions and provide several rules of thumb for function selection. Fifth, the applications of one-dimensional, two-dimensional, and multi-dimensional convolution are covered. Finally, some open issues and promising directions for CNN are discussed to serve as guidelines for future work.
摘要：卷积神经网络（CNN）在深学习领域最显著网络之一。由于CNN提出在许多领域取得不俗成绩，包括但不限于计算机视觉和自然语言处理，它吸引了产业界和学术界的广泛关注，在过去的几年里。现有的审查主要集中在CNN的在不同的场景应用，而不从一般的角度考虑CNN，以及最近提出了一些新奇的想法是不包括在内。在这次审查中，我们的目标是提供在这个快速发展的领域新奇的想法和前景尽可能。此外，不仅是二维卷积也是一维和多维的人都参与其中。首先，本次审查始于简要介绍了CNN的历史。其次，我们提供CNN的概述。三，经典和先进的CNN模型进行介绍，特别是使他们这些关键点，达到国家的先进成果。四，通过实验分析，我们得出一些结论和功能选择提供了几条经验法则。第五，一维，二维，和多维卷积的应用都包括在内。最后，CNN一些开放性的问题，并承诺方向进行了探讨，以作为今后工作的指导方针。

11. Mapping individual differences in cortical architecture using multi-view representation learning [PDF] 返回目录
Akrem Sellami, François-Xavier Dupé, Bastien Cagna, Hachem Kadri, Stéphane Ayache, Thierry Artières, Sylvain Takerkart
Abstract: In neuroscience, understanding inter-individual differences has recently emerged as a major challenge, for which functional magnetic resonance imaging (fMRI) has proven invaluable. For this, neuroscientists rely on basic methods such as univariate linear correlations between single brain features and a score that quantifies either the severity of a disease or the subject's performance in a cognitive task. However, to this date, task-fMRI and resting-state fMRI have been exploited separately for this question, because of the lack of methods to effectively combine them. In this paper, we introduce a novel machine learning method which allows combining the activation-and connectivity-based information respectively measured through these two fMRI protocols to identify markers of individual differences in the functional organization of the brain. It combines a multi-view deep autoencoder which is designed to fuse the two fMRI modalities into a joint representation space within which a predictive model is trained to guess a scalar score that characterizes the patient. Our experimental results demonstrate the ability of the proposed method to outperform competitive approaches and to produce interpretable and biologically plausible results.
摘要：在神经科学，了解个体间差异近来已成为一个重大的挑战，为此，功能性磁共振成像（fMRI）技术的价值无法估量。为此，神经学家依靠基本方法，如单大脑功能和分数之间的单变量线性相关量化无论是疾病的严重程度或认知任务目标的动作。然而，在此日期，任务功能磁共振成像和静息态功能磁共振成像已经分别对这个问题，利用由于缺乏方法，有效地将它们结合起来。在本文中，我们介绍一种新颖的机器学习方法，其允许组合分别通过这两个功能磁共振成像方案来测量，以确定在脑部的功能组织的个体差异标志物的激活和连接为基础的信息。它结合其目的是融合两个fMRI的形式纳入其内的预测模型进行训练猜测标得分表征患者关节表示空间多视图自动编码深。我们的实验结果证明了该方法的性能由于竞争产品的方法和生产解释和生物合理结果的能力。

12. Deformable 3D Convolution for Video Super-Resolution [PDF] 返回目录
Xinyi Ying, Longguang Wang, Yingqian Wang, Weidong Sheng, Wei An, Yulan Guo
Abstract: The spatio-temporal information among video sequences is significant for video super-resolution (SR). However, the spatio-temporal information cannot be fully used by existing video SR methods since spatial feature extraction and temporal motion compensation are usually performed sequentially. In this paper, we propose a deformable 3D convolution network (D3Dnet) to incorporate spatio-temporal information from both spatial and temporal dimensions for video SR. Specifically, we introduce deformable 3D convolutions (D3D) to integrate 2D spatial deformable convolutions with 3D convolutions (C3D), obtaining both superior spatio-temporal modeling capability and motion-aware modeling flexibility. Extensive experiments have demonstrated the effectiveness of our proposed D3D in exploiting spatio-temporal information. Comparative results show that our network outperforms the state-of-the-art methods. Code is available at: this https URL.
摘要：视频序列中的时空信息是视频超分辨率（SR）显著。然而，时空信息不能完全通过由于空间特征提取和时间运动补偿通常顺序地执行现有的视频SR方法中。在本文中，我们提出了一个变形的3D卷积网络（D3Dnet）从视频SR空间和时间维度结合时空信息。具体而言，我们引入变形的三维卷积（D3D）至2D空间变形卷积用三维卷积（C3D）集成，从而获得均优于时空建模能力和运动感知模型的灵活性。大量的实验已经证明了我们在利用时空信息提出D3D的有效性。比较结果表明，我们的网络优于国家的最先进的方法。代码，请访问：此HTTPS URL。

13. Self-Supervised Scene De-occlusion [PDF] 返回目录
Xiaohang Zhan, Xingang Pan, Bo Dai, Ziwei Liu, Dahua Lin, Chen Change Loy
Abstract: Natural scene understanding is a challenging task, particularly when encountering images of multiple objects that are partially occluded. This obstacle is given rise by varying object ordering and positioning. Existing scene understanding paradigms are able to parse only the visible parts, resulting in incomplete and unstructured scene interpretation. In this paper, we investigate the problem of scene de-occlusion, which aims to recover the underlying occlusion ordering and complete the invisible parts of occluded objects. We make the first attempt to address the problem through a novel and unified framework that recovers hidden scene structures without ordering and amodal annotations as supervisions. This is achieved via Partial Completion Network (PCNet)-mask (M) and -content (C), that learn to recover fractions of object masks and contents, respectively, in a self-supervised manner. Based on PCNet-M and PCNet-C, we devise a novel inference scheme to accomplish scene de-occlusion, via progressive ordering recovery, amodal completion and content completion. Extensive experiments on real-world scenes demonstrate the superior performance of our approach to other alternatives. Remarkably, our approach that is trained in a self-supervised manner achieves comparable results to fully-supervised methods. The proposed scene de-occlusion framework benefits many applications, including high-quality and controllable image manipulation and scene recomposition (see Fig. 1), as well as the conversion of existing modal mask annotations to amodal mask annotations.
摘要：自然景象的理解是一项艰巨的任务，遇到被部分遮挡的多个对象的图像时尤为如此。这个障碍是通过改变对象排序和定位引起。现有场景的理解范式是能够解析唯一可见的部分，导致不完整的和非结构化的现场演绎。在本文中，我们研究了现场去遮挡，其目的是恢复的基本闭塞排序并完成闭塞对象的无形部分的问题。我们做的第一次尝试通过新的，统一的框架，复苏隐藏的画面结构没有订货和amodal注解监督来解决这个问题。这是通过部分完成网络（PCNET）-mask（M）和-content（C），即学习，以分别恢复对象口罩和内容，的分数，以自监督的方式来实现。基于PCNET-M和PCNET-C，我们设计了一种新的推理方式来完成现场去遮挡，通过渐进有序恢复，amodal完成和内容完成。真实世界的场景大量的实验证明了该方法对其他替代品的卓越性能。值得注意的是，我们的方法是在自我监督的方式达到训练的效果相媲美，充分监督的方法。所提出的场景去遮挡框架好处许多应用，包括高品质的和可控的图像处理和场景重构（参见图1），以及现有的模态掩模注释amodal掩模注释的转换。

14. The P-DESTRE: A Fully Annotated Dataset for Pedestrian Detection, Tracking, Re-Identification and Search from Aerial Devices [PDF] 返回目录
S.V. Aruna Kumar, Ehsan Yaghoubi, Abhijit Das, B.S. Harish, Hugo Proença
Abstract: Over the last decades, the world has been witnessing growing threats to the security in urban spaces, which has augmented the relevance given to visual surveillance solutions able to detect, track and identify persons of interest in crowds. In particular, unmanned aerial vehicles (UAVs) are a potential tool for this kind of analysis, as they provide a cheap way for data collection, cover large and difficult-to-reach areas, while reducing human staff demands. In this context, all the available datasets are exclusively suitable for the pedestrian re-identification problem, in which the multi-camera views per ID are taken on a single day, and allows the use of clothing appearance features for identification purposes. Accordingly, the main contributions of this paper are two-fold: 1) we announce the UAV-based P-DESTRE dataset, which is the first of its kind to provide consistent ID annotations across multiple days, making it suitable for the extremely challenging problem of person search, i.e., where no clothing information can be reliably used. Apart this feature, the P-DESTRE annotations enable the research on UAV-based pedestrian detection, tracking, re-identification and soft biometric solutions; and 2) we compare the results attained by state-of-the-art pedestrian detection, tracking, reidentification and search techniques in well-known surveillance datasets, to the effectiveness obtained by the same techniques in the P-DESTRE data. Such comparison enables to identify the most problematic data degradation factors of UAV-based data for each task, and can be used as baselines for subsequent advances in this kind of technology. The dataset and the full details of the empirical evaluation carried out are freely available at this http URL.
摘要：在过去的十年里，世界已经见证到城市空间的安全性，这增加给能够探测，跟踪和识别的人群中有关人员视觉监控解决方案的相关性日益增长的威胁。特别是，无人驾驶飞行器（UAV）是这种分析的一个潜在的工具，因为它们提供的数据收集廉价的方式，覆盖大和难以到达的地区，同时减少人力的工作人员的需求。在这种情况下，所有可用的数据集是专门适用于行人重新鉴定问题，其中每ID的多摄像机视图采取的单日，并允许用于识别目的的使用的服装的外观特征。因此，本文的主要贡献是两个方面：1）我们宣布基于UAV-P-DESTRE数据集，这是首次种跨多个天提供一致的ID的注释，使得它适合于非常具有挑战性的问题人的搜索，即在可以可靠地使用没有衣服的信息。除了此功能，P-DESTRE注释启用基于无人机的行人探测，跟踪，重新鉴定和柔软的生物识别解决方案的研究;和2），我们比较由国家的最先进的行人检测，跟踪，重新识别所获得的结果和搜索技术中公知的监控数据集，通过在P-DESTRE数据相同的技术获得的效果。这样的比较使得能够识别基于UAV数据的最有问题的数据退化因子为每个任务，并且可被用作基线在这种技术的后续进展。该数据集和经验评估开展的全部细节都可以免费从该HTTP URL。

15. SSN: Shape Signature Networks for Multi-class Object Detection from Point Clouds [PDF] 返回目录
Xinge Zhu, Yuexin Ma, Tai Wang, Yan Xu, Jianping Shi, Dahua Lin
Abstract: Multi-class 3D object detection aims to localize and classify objects of multiple categories from point clouds. Due to the nature of point clouds, i.e. unstructured, sparse and noisy, some features benefit-ting multi-class discrimination are underexploited, such as shape information. In this paper, we propose a novel 3D shape signature to explore the shape information from point clouds. By incorporating operations of symmetry, convex hull and chebyshev fitting, the proposed shape sig-nature is not only compact and effective but also robust to the noise, which serves as a soft constraint to improve the feature capability of multi-class discrimination. Based on the proposed shape signature, we develop the shape signature networks (SSN) for 3D object detection, which consist of pyramid feature encoding part, shape-aware grouping heads and explicit shape encoding objective. Experiments show that the proposed method performs remarkably better than existing methods on two large-scale datasets. Furthermore, our shape signature can act as a plug-and-play component and ablation study shows its effectiveness and good scalability
摘要：多类立体物检测目标定位和多个类别从点云的对象进行分类。由于点云，即非结构化的，稀疏和嘈杂的性质，一些功能中获益婷多类歧视低度，如形状信息。在本文中，我们提出了一个新颖的3D形状特征，探索从点云的形状信息。通过将对称，凸包和切比雪夫嵌合的动作，所提出的形状SIG-性质不仅紧凑和有效的，但也健壮的噪声，其用作一个软约束以提高多类识别的特征的能力。基于提出的形状特征，我们开发了立体物检测形状标志网络（SSN），其中包括金字塔特征编码部分，形状感知分组头和显形的编码对象。实验结果表明，所提出的方法进行显着地优于在两个大型数据集的现有方法。此外，我们的形状特征可以作为一个插件和播放组件和消融研究显示其有效性和良好的扩展性

16. Network Adjustment: Channel Search Guided by FLOPs Utilization Ratio [PDF] 返回目录
Zhengsu Chen, Jianwei Niu, Lingxi Xie, Xuefeng Liu, Longhui Wei, Qi Tian
Abstract: Automatic designing computationally efficient neural networks has received much attention in recent years. Existing approaches either utilize network pruning or leverage the network architecture search methods. This paper presents a new framework named network adjustment, which considers network accuracy as a function of FLOPs, so that under each network configuration, one can estimate the FLOPs utilization ratio (FUR) for each layer and use it to determine whether to increase or decrease the number of channels on the layer. Note that FUR, like the gradient of a non-linear function, is accurate only in a small neighborhood of the current network. Hence, we design an iterative mechanism so that the initial network undergoes a number of steps, each of which has a small `adjusting rate' to control the changes to the network. The computational overhead of the entire search process is reasonable, i.e., comparable to that of re-training the final model from scratch. Experiments on standard image classification datasets and a wide range of base networks demonstrate the effectiveness of our approach, which consistently outperforms the pruning counterpart. The code is available at this https URL.
摘要：自动设计计算效率的神经网络受到很多关注在最近几年。现有的方法要么利用网络修剪或利用网络体系结构的搜索方法。本文提出了一种命名的网络调整新的框架，它参考网络精度触发器的函数，使每个网络配置下，可以估算该触发器为每一层利用率（FUR），并用它来确定是否增加或减少在该层上的信道数。注意，FUR，就像一个非线性函数的梯度，仅在当前的网络的小邻域是准确的。因此，我们设计了一个迭代机构，使得初始网络经历了一系列步骤，每一个具有一个小`调整率”来控制改变到网络。整个搜索过程的计算开销是合理的，即相媲美的再培训从头最终模型。在标准图像分类数据集和一个宽范围的基础网络的实验表明我们的方法，其中的性能一直优于修剪对方的有效性。该代码可在此HTTPS URL。

17. Guiding Monocular Depth Estimation Using Depth-Attention Volume [PDF] 返回目录
Lam Huynh, Phong Nguyen-Ha, Jiri Matas, Esa Rahtu, Janne Heikkila
Abstract: Recovering the scene depth from a single image is an ill-posed problem that requires additional priors, often referred to as monocular depth cues, to disambiguate different 3D interpretations. In recent works, those priors have been learned in an end-to-end manner from large datasets by using deep neural networks. In this paper, we propose guiding depth estimation to favor planar structures that are ubiquitous especially in indoor environments. This is achieved by incorporating a non-local coplanarity constraint to the network with a novel attention mechanism called depth-attention volume (DAV). Experiments on two popular indoor datasets, namely NYU-Depth-v2 and ScanNet, show that our method achieves state-of-the-art depth estimation results while using only a fraction of the number of parameters needed by the competing methods.
摘要：恢复从单个图像的场景深度是一个病态的问题，需要附加的先验，通常被称为单眼深度线索，来消除歧义不同的3D解释。在近期的作品，这些先验概率已经从大型数据集的端至端的方式使用深层神经网络学会。在本文中，我们提出指导深度估计有利于形成无处不特别是在室内环境中的平面结构。这是通过将一个非本地的共面约束到所述网络具有新颖注意机制称为深度关注体积（DAV）来实现的。两个流行的室内数据集，即NYU-深度v2和ScanNet实验，证明我们的方法，而只使用由相互竞争的方法所需的参数数目的一小部分实现了国家的最先进的深度估计结果。

18. Towards Detection of Sheep Onboard a UAV [PDF] 返回目录
Farah Sarwar, Anthony Griffin, Saeed Ur Rehman, Timotius Pasang
Abstract: In this work we consider the task of detecting sheep onboard an unmanned aerial vehicle (UAV) flying at an altitude of 80 m. At this height, the sheep are relatively small, only about 15 pixels across. Although deep learning strategies have gained enormous popularity in the last decade and are now extensively used for object detection in many fields, state-of-the-art detectors perform poorly in the case of smaller objects. We develop a novel dataset of UAV imagery of sheep and consider a variety of object detectors to determine which is the most suitable for our task in terms of both accuracy and speed. Our findings indicate that a UNet detector using the weighted Hausdorff distance as a loss function during training is an excellent option for detection of sheep onboard a UAV.
摘要：在这项工作中，我们考虑的检测羊板载无人驾驶飞行器（UAV）在80米高度飞行任务。在这个高度，羊都比较小，只有约15个像素宽。虽然深学习策略在过去十年中获得了巨大的声望，现在广泛用于许多领域进行目标检测，国家的最先进的探测器在更小的物体的情况下表现不佳。我们开发无人机羊意象的新的数据集，并考虑各种物体探测器，以确定哪些是最适合我们的准确度和速度方面的任务。我们的研究结果表明，使用在训练期间的加权的Hausdorff距离作为损失功能的UNET检测器是检测羊的板载UAV一个很好的选择。

19. Efficient Deep Representation Learning by Adaptive Latent Space Sampling [PDF] 返回目录
Yuanhan Mo, Shuo Wang, Chengliang Dai, Rui Zhou, Wenjia Bai, Yike Guo
Abstract: Supervised deep learning requires a large amount of training samples with annotations (e.g. label class for classification task, pixel- or voxel-wised label map for segmentation tasks), which are expensive and time-consuming to obtain. During the training of a deep neural network, the annotated samples are fed into the network in a mini-batch way, where they are often regarded of equal importance. However, some of the samples may become less informative during training, as the magnitude of the gradient start to vanish for these samples. In the meantime, other samples of higher utility or hardness may be more demanded for the training process to proceed and require more exploitation. To address the challenges of expensive annotations and loss of sample informativeness, here we propose a novel training framework which adaptively selects informative samples that are fed to the training process. The adaptive selection or sampling is performed based on a hardness-aware strategy in the latent space constructed by a generative model. To evaluate the proposed training framework, we perform experiments on three different datasets, including MNIST and CIFAR-10 for image classification task and a medical image dataset IVUS for biophysical simulation task. On all three datasets, the proposed framework outperforms a random sampling method, which demonstrates the effectiveness of proposed framework.
摘要：监督深度学习需要与注释的大量训练样本，这是昂贵的和耗时的，以获得（例如，对于分类任务，像素 - 体素或-识破标签的地图为分割任务标签类）。在一个深层神经网络的训练，注释样品在小批量的方式，他们通常被认为具有同等的重要性送入网络。然而，一些样品可能会成为训练中的信息较少，随着坡度的大小开始消失了这些样品。在此期间，较高的效用或硬度的其它样品可以更要求的训练过程进行，并且需要更多的开发。为了解决昂贵的注释和样本信息量的损失所带来的挑战，在这里，我们提出了一种新的培训框架，自适应地选择被送入培训过程信息样本。基于由生成模型构造的潜在空间的硬度感知策略执行自适应选择或采样。为了评估拟议的培训框架，我们执行三个不同的数据集，包括MNIST和图像分类任务CIFAR-10和生物物理模拟任务的医疗图像数据集IVUS实验。在所有这三个数据集，拟议的框架优于随机抽样的方法，这表明提出的框架的有效性。

20. Detection and skeletonization of single neurons and tracer injections using topological methods [PDF] 返回目录
Dingkang Wang, Lucas Magee, Bing-Xing Huo, Samik Banerjee, Xu Li, Jaikishan Jayakumar, Meng Kuan Lin, Keerthi Ram, Suyi Wang, Yusu Wang, Partha P. Mitra
Abstract: Neuroscientific data analysis has traditionally relied on linear algebra and stochastic process theory. However, the tree-like shapes of neurons cannot be described easily as points in a vector space (the subtraction of two neuronal shapes is not a meaningful operation), and methods from computational topology are better suited to their analysis. Here we introduce methods from Discrete Morse (DM) Theory to extract the tree-skeletons of individual neurons from volumetric brain image data, and to summarize collections of neurons labelled by tracer injections. Since individual neurons are topologically trees, it is sensible to summarize the collection of neurons using a consensus tree-shape that provides a richer information summary than the traditional regional 'connectivity matrix' approach. The conceptually elegant DM approach lacks hand-tuned parameters and captures global properties of the data as opposed to previous approaches which are inherently local. For individual skeletonization of sparsely labelled neurons we obtain substantial performance gains over state-of-the-art non-topological methods (over 10% improvements in precision and faster proofreading). The consensus-tree summary of tracer injections incorporates the regional connectivity matrix information, but in addition captures the collective collateral branching patterns of the set of neurons connected to the injection site, and provides a bridge between single-neuron morphology and tracer-injection data.
摘要：神经科学的数据分析，传统上依赖于线性代数和随机过程理论。然而，树状神经元的形状不能被容易地作为在矢量空间（2种神经元形状的减法是没有意义的操作）点所描述的，并从计算的拓扑结构的方法更适合于他们的分析。这里我们主要介绍从离散莫尔斯（DM）理论的方法来提取体积脑图像数据单个神经元的树状骨架，并总结通过注射示踪标记的神经元的集合。由于单个神经元在拓扑上是树，这是明智的使用总结了共识树形状提供了比传统区域“连接矩阵”的方式更丰富的信息摘要神经元的集合。在概念上优雅的DM方法缺乏手动调整参数和捕获，而不是其本身固有的本地以前的方法对数据的全局属性。对于稀疏标记的神经元的单独的骨架化，我们得到在国家的最先进的非拓扑方法显着的性能增益（在精度超过10％的改进和更快的校对）。示踪剂注射的共识树摘要结合区域连接矩阵信息，但除了捕获集体抵押品分支的集合连接到所述注射部位的神经元的图案，并提供单神经元形态学和示踪剂注入数据之间的桥梁。

21. Temporally Coherent Embeddings for Self-Supervised Video Representation Learning [PDF] 返回目录
Joshua Knights, Anthony Vanderkop, Daniel Ward, Olivia Mackenzie-Ross, Peyman Moghadam
Abstract: This paper presents TCE: Temporally Coherent Embeddings for self-supervised video representation learning. The proposed method exploits inherent structure of unlabeled video data to explicitly enforce temporal coherency in the embedding space, rather than indirectly learning it through ranking or predictive pretext tasks. In the same way that high-level visual information in the world changes smoothly, we believe that nearby frames in learned representations should demonstrate similar properties. Using this assumption, we train the TCE model to encode videos such that adjacent frames exist close to each other and videos are separated from one another. Using TCE we learn robust representations from large quantities of unlabeled video data. We evaluate our self-supervised trained TCE model by adding a classification layer and finetuning the learned representation on the downstream task of video action recognition on the UCF101 dataset. We obtain 68.7% accuracy and outperform the state-of-the-art self-supervised methods despite using a significantly smaller dataset for pre-training. Notably, we demonstrate results competitive with more complex 3D-CNN based networks while training with a 2D-CNN network backbone on action recognition tasks.
摘要：本文介绍TCE：暂时相干的曲面嵌入自我监督的视频表示学习。该方法利用未标记的视频数据的内在结构，明确强制执行嵌入空间时间的一致性，而不是间接地通过排名或预测借口任务学习它。以同样的方式，在世界高层次的视觉信息变化平稳，我们认为在了解到表示，附近的框架应该表现出相似的特性。使用这一假设，我们训练TCE模型来编码视频，使得相邻的帧接近存在于彼此和视频被彼此分离。使用TCE我们借鉴了大量未标记的视频数据的强劲表示。我们通过添加分类层和微调视频动作识别对UCF101数据集下游任务学习表现评估我们的自我监督训练的TCE模型。我们获得68.7％的精度和超越，尽管使用显著更小的数据集前培训的国家的最先进的自我监督方法。值得注意的是，我们证明结果与更复杂的3D-CNN基于网络的竞争力，同时训练与动作识别任务的2D-CNN网络骨干。

22. Semantic Segmentation of highly class imbalanced fully labelled 3D volumetric biomedical images and unsupervised Domain Adaptation of the pre-trained Segmentation Network to segment another fully unlabelled Biomedical 3D Image stack [PDF] 返回目录
Shreya Roy, Anirban Chakraborty
Abstract: The goal of our work is to perform pixel label semantic segmentation on 3D biomedical volumetric data. Manual annotation is always difficult for a large bio-medical dataset. So, we consider two cases where one dataset is fully labeled and the other dataset is assumed to be fully unlabelled. We first perform Semantic Segmentation on the fully labeled isotropic biomedical source data (FIBSEM) and try to incorporate the the trained model for segmenting the target unlabelled dataset(SNEMI3D)which shares some similarities with the source dataset in the context of different types of cellular bodies and other cellular components. Although, the cellular components vary in size and shape. So in this paper, we have proposed a novel approach in the context of unsupervised domain adaptation while classifying each pixel of the target volumetric data into cell boundary and cell body. Also, we have proposed a novel approach to giving non-uniform weights to different pixels in the training images while performing the pixel-level semantic segmentation in the presence of the corresponding pixel-wise label map along with the training original images in the source domain. We have used the Entropy Map or a Distance Transform matrix retrieved from the given ground truth label map which has helped to overcome the class imbalance problem in the medical image data where the cell boundaries are extremely thin and hence, extremely prone to be misclassified as non-boundary.
摘要：我们的工作目标是在3D生物医学体积数据进行像素标签语义分割。人工标注总是困难的大生物医学数据集。因此，我们考虑两种情况，其中一个数据集是完全标记，并假设其他数据集被完全未标记。我们首先对全标各向同性的生物医学源数据（FIBSEM）进行语义分割，并尝试以纳入训练的模型用于分割目标未标记的数据集（SNEMI3D），这股一些相似之处与源数据集在不同类型的细胞体的背景下和其它细胞组分。虽然，细胞组分的尺寸和形状而变化。因此，在本文中，我们已经提出了在无监督域适应的上下文中的新方法，而目标容积数据的各像素分类成小区边界和细胞体。此外，我们已经提出了一种新颖的方法来得到非均匀权重以不同的像素中的训练图像，同时在相应的逐像素标签地图的存在下与训练原始图像沿着在源域中执行所述像素级语义分割。我们已经使用了熵地图或距离变换从给定的地面实况标签地图这有助于克服在单元边界是非常薄，因此，极其容易的医用图像数据的类的不平衡问题检索矩阵被误判为不-边界。

23. Eisen: a python package for solid deep learning [PDF] 返回目录
Frank Mancolo
Abstract: Eisen is an open source python package making the implementation of deep learning methods easy. It is specifically tailored to medical image analysis and computer vision tasks, but its flexibility allows extension to any application. Eisen is based on PyTorch and it follows the same architecture of other packages belonging to the PyTorch ecosystem. This simplifies its use and allows it to be compatible with modules provided by other packages. Eisen implements multiple dataset loading methods, I/O for various data formats, data manipulation and transformation, full implementation of training, validation and test loops, implementation of losses and network architectures, automatic export of training artifacts, summaries and logs, visual experiment building, command line interface and more. Furthermore, it is open to user contributions by the community. Documentation, examples and code can be downloaded from this http URL.
摘要：艾森是一个开源的Python包进行深层学习方法的实现变得更加容易。它是专门为医学图像分析和计算机视觉任务，但它的灵活性允许扩展到任何应用程序。艾森是基于PyTorch和它遵循属于PyTorch生态系统其他包相同的架构。这简化了它的使用，并允许它与其他包提供模块兼容。艾森实现了多个数据集加载方法，I / O的各种数据格式，数据处理和改造，全面实施培训，验证和测试循环中，实施的损失和网络架构，培训文物出口自动，摘要和日志，视觉实验楼，命令行界面等等。此外，它是开放的由社区用户贡献。文档，示例和代码可以从这个HTTP URL下载。

24. Reconfigurable Voxels: A New Representation for LiDAR-Based Point Clouds [PDF] 返回目录
Tai Wang, Xinge Zhu, Dahua Lin
Abstract: LiDAR is an important method for autonomous driving systems to sense the environment. The point clouds obtained by LiDAR typically exhibit sparse and irregular distribution, thus posing great challenges to the detection of 3D objects, especially those that are small and distant. To tackle this difficulty, we propose Reconfigurable Voxels, a new approach to constructing representations from 3D point clouds. Specifically, we devise a biased random walk scheme, which adaptively covers each neighborhood with a fixed number of voxels based on the local spatial distribution and produces a representation by integrating the points in the chosen neighbors. We found empirically that this approach effectively improves the stability of voxel features, especially for sparse regions. Experimental results on multiple benchmarks, including nuScenes, Lyft, and KITTI, show that this new representation can remarkably improve the detection performance for small and distant objects, without incurring noticeable overhead costs.
摘要：激光雷达是自主驾驶系统，感知环境的重要手段。由激光雷达得到的点云通常表现出稀疏和不规则分布，从而给该检测三维物体的很大的挑战，特别是那些小而疏远。为了解决这个困难，我们提出可重构体素，一种新的方法来从三维点云构建表示。具体来说，我们设计一种偏置随机游走方案，其中自适应地覆盖具有固定数量的基于局部空间分布的体素的每个邻域，并通过在所选择的邻居整合点产生的表示。我们经验发现，这种方法有效地提高了体素特征量的稳定性，特别是对稀疏区域。上的多个基准，包括nuScenes实验结果，Lyft，和KITTI，表明这种新的表示可以显着提高对于小型和远处的物体检测性能，而不会导致显着的间接费用。

25. A Morphable Face Albedo Model [PDF] 返回目录
William A.P. Smith, Alassane Seck, Hannah Dee, Bernard Tiddeman, Joshua Tenenbaum, Bernhard Egger
Abstract: In this paper, we bring together two divergent strands of research: photometric face capture and statistical 3D face appearance modelling. We propose a novel lightstage capture and processing pipeline for acquiring ear-to-ear, truly intrinsic diffuse and specular albedo maps that fully factor out the effects of illumination, camera and geometry. Using this pipeline, we capture a dataset of 50 scans and combine them with the only existing publicly available albedo dataset (3DRFE) of 23 scans. This allows us to build the first morphable face albedo model. We believe this is the first statistical analysis of the variability of facial specular albedo maps. This model can be used as a plug in replacement for the texture model of the Basel Face Model and we make our new albedo model publicly available. We ensure careful spectral calibration such that our model is built in a linear sRGB space, suitable for inverse rendering of images taken by typical cameras. We demonstrate our model in a state of the art analysis-by-synthesis 3DMM fitting pipeline, are the first to integrate specular map estimation and outperform the Basel Face Model in albedo reconstruction.
摘要：在本文中，我们把两个研究不同股一起：光度脸捕捉和统计三维人脸的外观造型。我们提出了一个新颖的lightstage捕获和处理管道获取耳朵到耳朵，真正内在的漫射和镜面反射率的地图，充分因素进行照明，相机和几何形状的影响。使用这条管道，我们捕获的50次扫描的数据集，并与23个扫描现有的唯一可公开获得的反照率数据集（3DRFE）将它们结合起来。这使我们能够建立第一形变脸部模型的反照率。我们相信这是面部镜面反射率的地图的可变性的第一统计分析。这种模式可以作为替代巴塞尔面模型的纹理模型插头和我们做出我们新的反照率模型公开。我们保证小心光谱校正，使得我们的模型是建立在一个线性sRGB空间，适合于典型的相机拍摄的图像的逆向绘制。我们证明了我们的技术分析，通过合成3DMM装配流水线的状态模型，是第一个集成高光贴图估计和反照率的重建跑赢巴塞尔面模型。

26. Sub-Instruction Aware Vision-and-Language Navigation [PDF] 返回目录
Yicong Hong, Cristian Rodriguez-Opazo, Qi Wu, Stephen Gould
Abstract: Vision-and-language navigation requires an agent to navigate through a real 3D environment following a given natural language instruction. Despite significant advances, few previous works are able to fully utilize the strong correspondence between the visual and textual sequences. Meanwhile, due to the lack of intermediate supervision, the agent's performance at following each part of the instruction remains untrackable during navigation. In this work, we focus on the granularity of the visual and language sequences as well as the trackability of agents through the completion of instruction. We provide agents with fine-grained annotations during training and find that they are able to follow the instruction better and have a higher chance of reaching the target at test time. We enrich the previous dataset with sub-instructions and their corresponding paths. To make use of this data, we propose an effective sub-instruction attention and shifting modules that attend and select a single sub-instruction at each time-step. We implement our sub-instruction modules in four state-of-the-art agents, compare with their baseline model, and show that our proposed method improves the performance of all four agents.
摘要：视觉和语言的导航需要通过真正的3D环境中的给定的自然语言指令之后的代理进行导航。尽管显著的进步，一些以前的作品都能够充分利用视觉和文本序列之间的强对应。同时，由于缺乏中间监督，代理的在后面的指令的各部分性能保持在导航过程中无法追踪。在这项工作中，我们注重视觉和语言序列的粒度，以及代理通过指令完成可跟踪。我们在培训期间提供细粒度的注释代理人，发现他们都能够按照指令更好，有在测试时达到目标的机会较高。我们丰富了以前的数据集的子指令及其相应的路径。为了利用这些数据，我们提出了一个有效的子指令的重视和漂移的参加，并选择在每个时间步单个子指令模块。我们实现我们的子指令模块四个国家的最先进的药物，比较其基准模型，并表明，该方法提高了所有四个代理的性能。

27. COVID-CAPS: A Capsule Network-based Framework for Identification of COVID-19 cases from X-ray Images [PDF] 返回目录
Parnian Afshar, Shahin Heidarian, Farnoosh Naderkhani, Anastasia Oikonomou, Konstantinos N. Plataniotis, Arash Mohammadi
Abstract: Novel Coronavirus disease (COVID-19) has abruptly and undoubtedly changed the world as we know it at the end of the 2nd decade of the 21st century. COVID-19 is extremely contagious and quickly spreading globally making its early diagnosis of paramount importance. Early diagnosis of COVID-19 enables health care professionals and government authorities to break the chain of transition and flatten the epidemic curve. The common type of COVID-19 diagnosis test, however, requires specific equipment and has relatively low sensitivity and high false-negative rate. Computed tomography (CT) scans and X-ray images, on the other hand, reveal specific manifestations associated with this disease. Overlap with other lung infections makes human-centered diagnosis of COVID-19 challenging. Consequently, there has been an urgent surge of interest to develop Deep Neural Network (DNN)-based diagnosis solutions, mainly based on Convolutional Neural Networks (CNNs), to facilitate identification of positive COVID-19 cases. CNNs, however, are prone to lose spatial information between image instances and require large datasets. This paper presents an alternative modeling framework based on Capsule Networks, referred to as the COVID-CAPS, being capable of handling small datasets, which is of significant importance due to sudden and rapid emergence of COVID-19. Our initial results based on a dataset of X-ray images show that COVID-CAPS has advantage over previous CNN-based models. COVID-CAPS achieved an Accuracy of 95.7%, Sensitivity of 90%, Specificity of 95.8%, and Area Under the Curve (AUC) of 0.97, while having far less number of trainable parameters in comparison to its counterparts.
摘要：新型冠状病毒病（COVID-19）已经突然，毫无疑问改变了世界，因为我们在21世纪的第二个十年的末尾知道这一点。 COVID-19是极具传染性并迅速蔓延全球使其早期最重要的诊断。 COVID-19的早期诊断使得医疗保健专业人员和政府机构打破过渡链和扁平化的流行曲线。 COVID-19诊断测试的常见的类型，然而，需要特定的设备和具有相对低的灵敏度和高的假阴性率。计算机断层扫描（CT）扫描和X射线图像，而另一方面，显示与该疾病相关的特定表现形式。重叠与其他肺部感染使得COVID-19具有挑战性的人为中心的诊断。因此，人们一直关注的深发展神经网络（DNN）为基础的诊断解决方案，主要是基于卷积神经网络（细胞神经网络）紧急激增，促进积极COVID-19案件识别。细胞神经网络，然而，很容易失去图像实例之间的空间信息，并需要大量的数据集。基于胶囊网络提出了一种替代的建模框架，简称COVID-CAPS，能够处理小型数据集，这是显著的重要性由于COVID-19的突然和迅速崛起。我们基于X射线图像的数据集的初步结果表明，COVID-CAPS拥有以前基于CNN-车型的优势。 COVID-CAPS达到95.7％的精确度，90％灵敏度，95.8％的特异性，和下面积的0.97的曲线（AUC），同时具有相对于其对应物少得多的可训练的参数的数目。

28. Finding Your (3D) Center: 3D Object Detection Using a Learned Loss [PDF] 返回目录
David Griffiths, Jan Boehm, Tobias Ritschel
Abstract: Massive semantic labeling is readily available for 2D images, but much harder to achieve for 3D scenes. Objects in 3D repositories like ShapeNet are labeled, but regrettably only in isolation, so without context. 3D scenes can be acquired by range scanners on city-level scale, but much fewer with semantic labels. Addressing this disparity, we introduce a new optimization procedure, which allows training for 3D detection with raw 3D scans while using as little as 5% of the object labels and still achieve comparable performance. Our optimization uses two networks. A scene network maps an entire 3D scene to a set of 3D object centers. As we assume the scene not to be labeled by centers, no classic loss, such as chamfer can be used to train it. Instead, we use another network to emulate the loss. This loss network is trained on a small labeled subset and maps a non-centered 3D object in the presence of distractions to its own center. This function is very similar - and hence can be used instead of - the gradient the supervised loss would have. Our evaluation documents competitive fidelity at a much lower level of supervision, respectively higher quality at comparable supervision. Supplementary material can be found at: this https URL.
摘要：大规模的语义标注很容易为2D图像，但更难实现3D场景。在3D库等ShapeNet对象被标记，但令人遗憾的是只在隔离，所以没有上下文。 3D场景可以通过在市级规模距离扫描仪，但语义标签少得多收购。解决这个差距，我们引入了新的优化程序，它允许同时使用尽可能少的对象标签的5％，并且仍然达到相当的性能对于3D检测与加工3D扫描培训。我们的优化使用两个网络。场景网络的整个3D场景映射到一组3D对象中心。正如我们假设的情景不被中心，没有经典的损失，如倒角标记可用来训练它。相反，我们使用另一个网络效仿的损失。这种损失网络进行训练上的小标记的子集，并且一个非中心的3D对象在分心的存在其自身的中心映射。这个功能是非常相似的 - 因此可以用来代替 - 渐变的监督损失会有。我们的评价文件在以相当的监管监督，分别更高质量的水平要低得多竞争力的保真度。此HTTPS URL：补充材料可以在这里找到。

29. Light3DPose: Real-time Multi-Person 3D PoseEstimation from Multiple Views [PDF] 返回目录
Alessio Elmi, Davide Mazzini, Pietro Tortella
Abstract: We present an approach to perform 3D pose estimation of multiple people from a few calibrated camera views. Our architecture, leveraging the recently proposed unprojection layer, aggregates feature-maps from a 2D pose estimator backbone into a comprehensive representation of the 3D scene. Such intermediate representation is then elaborated by a fully-convolutional volumetric network and a decoding stage to extract 3D skeletons with sub-voxel accuracy. Our method achieves state of the art MPJPE on the CMU Panoptic dataset using a few unseen views and obtains competitive results even with a single input view. We also assess the transfer learning capabilities of the model by testing it against the publicly available Shelf dataset obtaining good performance metrics. The proposed method is inherently efficient: as a pure bottom-up approach, it is computationally independent of the number of people in the scene. Furthermore, even though the computational burden of the 2D part scales linearly with the number of input views, the overall architecture is able to exploit a very lightweight 2D backbone which is orders of magnitude faster than the volumetric counterpart, resulting in fast inference time. The system can run at 6 FPS, processing up to 10 camera views on a single 1080Ti GPU.
摘要：我们提出从几校准的相机视图执行多人的3D姿态估计的方法。我们的体系结构，利用最近提出的逆投影层，碎石从2D姿态估计骨干功能映射到3D场景中的一个综合表述。这样的中间表示，然后通过一个完全卷积体积网络和一个解码级，以提取与子体素精度三维骨架阐述。我们的方法实现了使用几个看不见的观点上的CMU全景数据集的技术MPJPE的状态，并且获得有竞争力的结果，即使具有单个输入视图。我们也通过测试其对可公开获得的数据集货架获得良好的性能指标评估模型的迁移学习能力。该方法本质上是有效的：作为一个纯粹的自下而上的方法，它是计算独立的人在现场的数量。此外，即使2D部分呈线性变化关系的输入视图的数量的计算负担，整体架构能够利用一款非常轻便的2D骨干是数量级比容积码更快，导致快速推理时间。该系统可以在6 FPS运行，处理多达在单个1080Ti GPU 10个相机视图。

30. Attribute Mix: Semantic Data Augmentation for Fine Grained Recognition [PDF] 返回目录
Hao Li, Xiaopeng Zhang, Hongkai Xiong, Qi Tian
Abstract: Collecting fine-grained labels usually requires expert-level domain knowledge and is prohibitive to scale up. In this paper, we propose Attribute Mix, a data augmentation strategy at attribute level to expand the fine-grained samples. The principle lies in that attribute features are shared among fine-grained sub-categories, and can be seamlessly transferred among images. Toward this goal, we propose an automatic attribute mining approach to discover attributes that belong to the same super-category, and Attribute Mix is operated by mixing semantically meaningful attribute features from two images. Attribute Mix is a simple but effective data augmentation strategy that can significantly improve the recognition performance without increasing the inference budgets. Furthermore, since attributes can be shared among images from the same super-category, we further enrich the training samples with attribute level labels using images from the generic domain. Experiments on widely used fine-grained benchmarks demonstrate the effectiveness of our proposed method. Specifically, without any bells and whistles, we achieve accuracies of $90.2\%$, $93.1\%$ and $94.9\%$ on CUB-200-2011, FGVC-Aircraft and Standford Cars, respectively.
摘要：收集细粒度的标签通常需要专家级的专业知识和令人望而却步扩大规模。在本文中，我们提出了混合属性，在属性级别数据扩充战略，扩大细粒度的样本。原理在于，属性特征细粒度的子类别之间共享，并且可以将图像中被无缝地转移。为了实现这一目标，我们提出了一个自动属性挖掘方法来发现属于同一个超类的属性和属性组合是由两个图像混合语义上有意义的属性值进行操作。属性混合是可以显著提高识别性能不增加预算推断一个简单而有效的数据扩张战略。此外，由于属性可以从同一个超类的图像之间共享，我们进一步丰富了训练样本使用图片来自通用域属性级别的标签。广泛使用的细粒度的基准实验证明我们提出的方法的有效性。具体而言，没有任何花俏，我们，FGVC飞机和斯坦福汽车，分别实现对CUB-200-2011 $ 90.2 \％$的精度，$ 93.1 \％$ $和94.9 \％$。

31. A Local-to-Global Approach to Multi-modal Movie Scene Segmentation [PDF] 返回目录
Anyi Rao, Linning Xu, Yu Xiong, Guodong Xu, Qingqiu Huang, Bolei Zhou, Dahua Lin
Abstract: Scene, as the crucial unit of storytelling in movies, contains complex activities of actors and their interactions in a physical environment. Identifying the composition of scenes serves as a critical step towards semantic understanding of movies. This is very challenging -- compared to the videos studied in conventional vision problems, e.g. action recognition, as scenes in movies usually contain much richer temporal structures and more complex semantic information. Towards this goal, we scale up the scene segmentation task by building a large-scale video dataset MovieScenes, which contains 21K annotated scene segments from 150 movies. We further propose a local-to-global scene segmentation framework, which integrates multi-modal information across three levels, i.e. clip, segment, and movie. This framework is able to distill complex semantics from hierarchical temporal structures over a long movie, providing top-down guidance for scene segmentation. Our experiments show that the proposed network is able to segment a movie into scenes with high accuracy, consistently outperforming previous methods. We also found that pretraining on our MovieScenes can bring significant improvements to the existing approaches.
摘要：场景，在电影讲故事的关键部件，包括演员和他们在物理环境相互作用的复杂的活动。识别场景的组成作为对电影的语义理解的关键一步。这是非常具有挑战性的 - 相比于传统的视力问题研究的影片，例如动作识别，因为在电影场景中通常含有丰富得多的时间结构和更复杂的语义信息。为了实现这一目标，我们通过建立一个大型视频数据集MovieScenes，其中包含来自150部电影21K注释场面区间向上扩展的场景分割任务。我们进一步提出了一个地方到全球的场景分割的框架，它集成了跨越三个层次的多模态信息，即片段，片段和电影。这个框架能够在较长的电影蒸馏从分级时间结构复杂的语义，提供自上而下的场景分割指导。我们的实验表明，该网络能够段电影与高精度的场面，一贯表现优于以前的方法。我们还发现，训练前我们MovieScenes能够给现有的方法显著的改善。

32. Appearance Shock Grammar for Fast Medial Axis Extraction from Real Images [PDF] 返回目录
Charles-Olivier Dufresne Camaro, Morteza Rezanejad, Stavros Tsogkas, Kaleem Siddiqi, Sven Dickinson
Abstract: We combine ideas from shock graph theory with more recent appearance-based methods for medial axis extraction from complex natural scenes, improving upon the present best unsupervised method, in terms of efficiency and performance. We make the following specific contributions: i) we extend the shock graph representation to the domain of real images, by generalizing the shock type definitions using local, appearance-based criteria; ii) we then use the rules of a Shock Grammar to guide our search for medial points, drastically reducing run time when compared to other methods, which exhaustively consider all points in the input image;iii) we remove the need for typical post-processing steps including thinning, non-maximum suppression, and grouping, by adhering to the Shock Grammar rules while deriving the medial axis solution; iv) finally, we raise some fundamental concerns with the evaluation scheme used in previous work and propose a more appropriate alternative for assessing the performance of medial axis extraction from scenes. Our experiments on the BMAX500 and SK-LARGE datasets demonstrate the effectiveness of our approach. We outperform the present state-of-the-art, excelling particularly in the high-precision regime, while running an order of magnitude faster and requiring no post-processing.
摘要：我们从震惊图论相结合的思想与从复杂的自然场景中轴提取更近的外观基础的方法，在改进了目前最好的无监督方法，在效率和性能方面。我们提出以下具体贡献：1）我们，通过推广使用本地，基于外观的标准冲击类型定义扩展震图表示真实图像的领域; II），我们使用一个冲击语法规则来指导我们的搜索中间点，当与其它方法相比，它详尽地考虑所有点的输入图像中大大减少运行时间; 3）我们删除了典型的后处理的需要步骤，包括变薄，非最大抑制，和分组，通过附着在冲击语法规则而导出所述中轴溶液;四）最后，我们提出，在以前的工作中使用的评估方案的一些基本问题和提出评估中轴提取从场面表现更合适的选择。我们对BMAX500和SK-大型数据集的实验证明了该方法的有效性。我们优于当前状态的最先进的，尤其是在高精制度优良，运行一个数量级更快，不需要后处理一段时间。

33. SHOP-VRB: A Visual Reasoning Benchmark for Object Perception [PDF] 返回目录
Michal Nazarczuk, Krystian Mikolajczyk
Abstract: In this paper we present an approach and a benchmark for visual reasoning in robotics applications, in particular small object grasping and manipulation. The approach and benchmark are focused on inferring object properties from visual and text data. It concerns small household objects with their properties, functionality, natural language descriptions as well as question-answer pairs for visual reasoning queries along with their corresponding scene semantic representations. We also present a method for generating synthetic data which allows to extend the benchmark to other objects or scenes and propose an evaluation protocol that is more challenging than in the existing datasets. We propose a reasoning system based on symbolic program execution. A disentangled representation of the visual and textual inputs is obtained and used to execute symbolic programs that represent a 'reasoning process' of the algorithm. We perform a set of experiments on the proposed benchmark and compare to results for the state of the art methods. These results expose the shortcomings of the existing benchmarks that may lead to misleading conclusions on the actual performance of the visual reasoning systems.
摘要：本文提出了一种方法，并在机器人应用视觉推理的基准，特别是小对象掌握和操作。该方法和指标都集中在从视觉和文本数据推断对象的属性。它涉及与它们的属性，功能，自然语言描述，以及与它们对应的场景语义表示沿视觉推理的查询问题，回答对小家电的对象。我们还提出用于产生合成的数据的方法，其允许扩展基准的其他对象或场景，并提出一个评价协议是超过在现有的数据集具有挑战性。我们提出了一种基于符号的程序执行推理系统。获得的视觉和文本输入A解缠结的表示和用于执行表示该算法的一个“推理过程”象征性的方案。我们进行了一组关于拟议的基准实验，并比较结果的技术方法的状态。这些结果暴露出现有基准，可能会导致视觉上的推理系统的实际性能误导性的结论的缺点。

34. Geometrically Principled Connections in Graph Neural Networks [PDF] 返回目录
Shunwang Gong, Mehdi Bahri, Michael M. Bronstein, Stefanos Zafeiriou
Abstract: Graph convolution operators bring the advantages of deep learning to a variety of graph and mesh processing tasks previously deemed out of reach. With their continued success comes the desire to design more powerful architectures, often by adapting existing deep learning techniques to non-Euclidean data. In this paper, we argue geometry should remain the primary driving force behind innovation in the emerging field of geometric deep learning. We relate graph neural networks to widely successful computer graphics and data approximation models: radial basis functions (RBFs). We conjecture that, like RBFs, graph convolution layers would benefit from the addition of simple functions to the powerful convolution kernels. We introduce affine skip connections, a novel building block formed by combining a fully connected layer with any graph convolution operator. We experimentally demonstrate the effectiveness of our technique and show the improved performance is the consequence of more than the increased number of parameters. Operators equipped with the affine skip connection markedly outperform their base performance on every task we evaluated, i.e., shape reconstruction, dense shape correspondence, and graph classification. We hope our simple and effective approach will serve as a solid baseline and help ease future research in graph neural networks.
摘要：图形卷积运营商带来深度学习的优势，以各种图形和网格的处理任务以前认为不能接触的地方。凭借其持续的成功来自经常调整现有的深度学习技术，以非欧几里德数据设计出更强大的架构，欲望。在本文中，我们认为应该几何留在后面的创新主要驱动力几何深度学习的新兴领域。我们与图形神经网络广泛成功的电脑图形和数据近似模型：径向基函数（的RBFs）。我们猜想，像的RBFs，图表卷积层会从另外的简单功能强大的卷积核受益。我们引入仿射跳过连接，通过完全连接层与任何图形卷积运算符相结合形成的新型构建块。我们通过实验证明了我们技术的有效性，并显示改进的性能是比参数的数量增加更多的结果。配备了仿射跳过连接运营商显着优于对我们评估的每个任务，即，形状重建，密形状的对应关系，和图形分类其基本的性能。我们希望我们的简单而有效的方法将作为图形神经网络的坚实基础，并有助于缓解未来的研究。

35. Fair Latency-Aware Metric for real-time video segmentation networks [PDF] 返回目录
Evann Courdier, Francois Fleuret
Abstract: As supervised semantic segmentation is reaching satisfying results, many recent papers focused on making segmentation network architectures faster, smaller and more efficient. In particular, studies often aim to reach the stage to which they can claim to be "real-time". Achieving this goal is especially relevant in the context of real-time video operations for autonomous vehicles and robots, or medical imaging during surgery. The common metric used for assessing these methods is so far the same as the ones used for image segmentation without time constraint: mean Intersection over Union (mIoU). In this paper, we argue that this metric is not relevant enough for real-time video as it does not take into account the processing time (latency) of the network. We propose a similar but more relevant metric called FLAME for video-segmentation networks, that compares the output segmentation of the network with the ground truth segmentation of the current video frame at the time when the network finishes the processing. We perform experiments to compare a few networks using this metric and propose a simple addition to network training to enhance results according to that metric.
摘要：随着监管语义分割是达到满意的效果，近期多篇论文专注于制造分割网络架构更快，更小，更高效。特别是，研究常常旨在达到他们可以声称自己是“实时”的舞台。实现这一目标是在自主车和机器人，或医疗成像实时视频操作的手术过程中这一点尤其重要。用于评估这些方法的通用指标是迄今与用于图像分割无时间约束的：在联盟平均交集（米欧）。在本文中，我们认为，这个指标的相关性不足的实时视频，因为它没有考虑到网络的处理时间（潜伏期）。我们提出了视频分割网络类似，但更相关的指标叫做火焰，当网络完成处理相比之下，当时的当前视频帧的地面实况分割网络的输出分割。我们做实验用这个指标来比较几个网络，并提出了一个简单的除了网络训练，根据该指标，以增强效果。

36. GANSpace: Discovering Interpretable GAN Controls [PDF] 返回目录
Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, Sylvain Paris
Abstract: This paper describes a simple technique to analyze Generative Adversarial Networks (GANs) and create interpretable controls for image synthesis, such as change of viewpoint, aging, lighting, and time of day. We identify important latent directions based on Principal Components Analysis (PCA) applied in activation space. Then, we show that interpretable edits can be defined based on layer-wise application of these edit directions. Moreover, we show that BigGAN can be controlled with layer-wise inputs in a StyleGAN-like manner. A user may identify a large number of interpretable controls with these mechanisms. We demonstrate results on GANs from various datasets.
摘要：本文描述了一种简单的技术来分析剖成对抗性网络（甘斯），并创建用于图像合成可解释控制，如视点的变化，老化，照明，和一天中的时间。我们确定在激活空间应用了基于主成分分析（PCA）重要的潜在方向。然后，我们表明，可解释的编辑可以根据这些编辑方向逐层应用程序定义。此外，我们表明，BigGAN可以在一个StyleGAN状逐层输入进行控制。用户可识别大量具有这些机制可解释控制。我们证明从不同的数据集甘斯结果。

37. Exploration of Input Patterns for Enhancing the Performance of Liquid State Machines [PDF] 返回目录
Shasha Guo, Lianhua Qu, Lei Wang, Xulong Tang, Shuo Tian, Shiming Li, Weixia Xu
Abstract: Spiking Neural Networks (SNN) have gained increasing attention for its low power consumption. But training SNN is challenging. Liquid State Machine (LSM), as a major type of Reservoir computing, has been widely recognized for its low training cost among SNNs. The exploration of LSM topology for enhancing performance often requires hyper-parameter search, which is both resource-expensive and time-consuming. We explore the influence of input scale reduction on LSM instead. There are two main reasons for studying input reduction of LSM. One is that the input dimension of large images requires efficient processing. Another one is that input exploration is generally more economic than architecture search. To mitigate the difficulty in effectively dealing with huge input spaces of LSM, and to find that whether input reduction can enhance LSM performance, we explore several input patterns, namely fullscale, scanline, chessboard, and patch. Several datasets have been used to evaluate the performance of the proposed input patterns, including two spatio image datasets and one spatio-temporal image database. The experimental results show that the reduced input under chessboard pattern improves the accuracy by up to 5%, and reduces execution time by up to 50% with up to 75\% less input storage than the fullscale input pattern for LSM.
摘要：脉冲神经网络（SNN）已经获得了越来越多的关注以其低功耗。但训练SNN是一个挑战。液体状态机（LSM），作为主要的类型水库的计算，得到了广泛的认可SNNS中低培训成本。 LSM拓扑的增强性能的勘探往往需要超参数搜索，这是资源昂贵且耗时两个。我们探讨LSM输入规模减少的影响，而不是。有用于研究LSM的输入减少主要有两个原因。其中之一是，大的图像的输入维数需要有效的处理。另外一个是输入勘探通常比建筑更多的经济。为了缓解与LSM的巨大空间输入有效处理的难度，并发现输入减少是否可以提高性能LSM，我们探讨几种输入模式，即满量程，扫描，棋盘，和补丁。有几个数据集已被用于评估所提出的输入模式的性能，其中包括两个时空图像数据集和一个时空图像数据库。实验结果表明，在棋盘图案的输入降低提高了高达5％的精确度，并通过最高达50％，最多至75 \％更少的输入存储比LSM的满量程输入图案减少执行时间。

38. Cascaded Deep Video Deblurring Using Temporal Sharpness Prior [PDF] 返回目录
Jinshan Pan, Haoran Bai, Jinhui Tang
Abstract: We present a simple and effective deep convolutional neural network (CNN) model for video deblurring. The proposed algorithm mainly consists of optical flow estimation from intermediate latent frames and latent frame restoration steps. It first develops a deep CNN model to estimate optical flow from intermediate latent frames and then restores the latent frames based on the estimated optical flow. To better explore the temporal information from videos, we develop a temporal sharpness prior to constrain the deep CNN model to help the latent frame restoration. We develop an effective cascaded training approach and jointly train the proposed CNN model in an end-to-end manner. We show that exploring the domain knowledge of video deblurring is able to make the deep CNN model more compact and efficient. Extensive experimental results show that the proposed algorithm performs favorably against state-of-the-art methods on the benchmark datasets as well as real-world videos.
摘要：我们提出了视频去模糊一个简单而有效的深层卷积神经网络（CNN）模型。所提出的算法主要由从中间潜帧和潜帧恢复步骤的光流估计。它首先开发了一个深CNN模型来估计从中间潜帧的光流，然后恢复基于所估计的光流的潜帧。为了更好地探索从视频中的时间信息，我们发展约束深CNN模式，帮助潜在的帧之前恢复时间清晰度。我们制定有效的级联的培训方法，共同培养提出的CNN模型在一个终端到终端的方式。我们发现，探索视频去模糊的领域知识时能够使深CNN模型更加紧凑和高效。大量的实验结果表明，对良好的基准数据集的国家的最先进的方法，以及真实世界的视频，该算法执行。

39. Image-based phenotyping of diverse Rice (Oryza Sativa L.) Genotypes [PDF] 返回目录
Mukesh Kumar Vishal, Dipesh Tamboli, Abhijeet Patil, Rohit Saluja, Biplab Banerjee, Amit Sethi, Dhandapani Raju, Sudhir Kumar, R N Sahoo, Viswanathan Chinnusamy, J Adinarayana
Abstract: Development of either drought-resistant or drought-tolerant varieties in rice (Oryza sativa L.), especially for high yield in the context of climate change, is a crucial task across the world. The need for high yielding rice varieties is a prime concern for developing nations like India, China, and other Asian-African countries where rice is a primary staple food. The present investigation is carried out for discriminating drought tolerant, and susceptible genotypes. A total of 150 genotypes were grown under controlled conditions to evaluate at High Throughput Plant Phenomics facility, Nanaji Deshmukh Plant Phenomics Centre, Indian Council of Agricultural Research-Indian Agricultural Research Institute, New Delhi. A subset of 10 genotypes is taken out of 150 for the current investigation. To discriminate against the genotypes, we considered features such as the number of leaves per plant, the convex hull and convex hull area of a plant-convex hull formed by joining the tips of the leaves, the number of leaves per unit convex hull of a plant, canopy spread - vertical spread, and horizontal spread of a plant. We trained You Only Look Once (YOLO) deep learning algorithm for leaves tips detection and to estimate the number of leaves in a rice plant. With this proposed framework, we screened the genotypes based on selected traits. These genotypes were further grouped among different groupings of drought-tolerant and drought susceptible genotypes using the Ward method of clustering.
摘要：水稻（Oryza sativa L.）无论是抗旱或耐旱品种的发展，特别是在气候变化的背景下产量高，是世界各地的一个重要任务。对于需要高产水稻品种是发展中国家，如印度，中国，稻米是主要的主食其他亚非国家的主要关切。本次调查是用于区分耐旱，和易感基因型进行。共有150个基因型控制的条件下生长在高通量植物型组学工具来评估，南士·德松慕克植物型组学中心，农业研究，印度农业研究所，新德里的印度议会。的10种基因型的子集被取出的150为当前调查。以鉴别针对基因型，我们考虑的功能，如单株叶数，凸包，并通过接合所述叶片的尖端，每单位凸包叶数形成的植物凸包的凸包区域植物，树冠范围 - 垂直传播，以及植物的水平传播。我们的培训仅看一次（永乐），用于叶尖端探测深度学习算法，并估计在水稻叶子的数量。有了这个建议的框架，我们筛选了基于所选性状基因型。这些基因型使用聚类方法沃德耐旱和干旱敏感基因型的不同分组中被进一步分组。

40. A Generalized Multi-Task Learning Approach to Stereo DSM Filtering in Urban Areas [PDF] 返回目录
Lukas Liebel, Kesnia Bittner, Marco Körner
Abstract: City models and height maps of urban areas serve as a valuable data source for numerous applications, such as disaster management or city planning. While this information is not globally available, it can be substituted by digital surface models (DSMs), automatically produced from inexpensive satellite imagery. However, stereo DSMs often suffer from noise and blur. Furthermore, they are heavily distorted by vegetation, which is of lesser relevance for most applications. Such basic models can be filtered by convolutional neural networks (CNNs), trained on labels derived from digital elevation models (DEMs) and 3D city models, in order to obtain a refined DSM. We propose a modular multi-task learning concept that consolidates existing approaches into a generalized framework. Our encoder-decoder models with shared encoders and multiple task-specific decoders leverage roof type classification as a secondary task and multiple objectives including a conditional adversarial term. The contributing single-objective losses are automatically weighted in the final multi-task loss function based on learned uncertainty estimates. We evaluated the performance of specific instances of this family of network architectures. Our method consistently outperforms the state of the art on common data, both quantitatively and qualitatively, and generalizes well to a new dataset of an independent study area.
摘要：城市车型和城市地区的高度贴图作为多种应用，如灾害管理或城市规划的宝贵数据源。虽然这种信息不是全局可用，其可以通过数字表面模型（DSM的），从廉价的卫星图像自动生成被取代。然而，立体的DSM经常遭受噪音和模糊。此外，他们大量植被，这是较小的大多数应用程序相关的失真。这样的基本款可以通过卷积神经网络（细胞神经网络），培养从数字高程模型（DEM）和三维城市模型得出的标签进行过滤，以获得精制DSM。我们提出了一个模块化多任务学习的概念，它整合现有的方法为广义框架。我们的编码器 - 解码器模型具有共享的编码器和多任务特定的解码器利用屋顶类型分类作为次要任务和多个目标包括条件对抗性术语。起作用的单目标损失的基础上学会不确定性估算在最后的多任务损失函数自动加权。我们评估了这个家庭网络架构的特定情况下的性能。我们的方法始终优于本领域共同的数据，定量和定性，和概括的状态以及一个独立的研究区域的新的数据集。

41. On-device Filtering of Social Media Images for Efficient Storage [PDF] 返回目录
Dhruval Jain, DP Mohanty, Sanjeev Roy, Naresh Purre, Sukumar Moharana
Abstract: Artificially crafted images such as memes, seasonal greetings, etc are flooding the social media platforms today. These eventually start occupying a lot of internal memory of smartphones and it gets cumbersome for the user to go through hundreds of images and delete these synthetic images. To address this, we propose a novel method based on Convolutional Neural Networks (CNNs) for the on-device filtering of social media images by classifying these synthetic images and allowing the user to delete them in one go. The custom model uses depthwise separable convolution layers to achieve low inference time on smartphones. We have done an extensive evaluation of our model on various camera image datasets to cover most aspects of images captured by a camera. Various sorts of synthetic social media images have also been tested. The proposed solution achieves an accuracy of 98.25% on the Places-365 dataset and 95.81% on the Synthetic image dataset that we have prepared containing 30K instances.
摘要：人工制作的图像，如模因，季节性的问候，等今天泛滥的社交媒体平台。这些最终开始占据了很多智能手机的内部存储器的连带繁琐的用户要经过数百张照片，并删除这些合成图像。为了解决这个问题，我们提出了一种基于卷积神经网络（细胞神经网络）的新方法，该设备上通过细分这些合成图像，并允许用户删除他们在一气呵成过滤社交媒体的图像。自定义模型使用深度上可分离卷积层实现对智能手机的低推理时间。我们已经做了各种相机的图像数据集，我们的模型进行广泛的评估，以覆盖由照相机拍摄的图像的大多数方面。合成社交媒体图像的各种种类也进行了测试。所提出的方案实现了98.25％的地方-365数据集和95.81％对合成的图像数据集，我们已经制备了包含30K的实例的精确度。

42. Vanishing Point Guided Natural Image Stitching [PDF] 返回目录
Kai Chen, Jian Yao, Jingmin Tu, Yahui Liu, Yinxuan Li, Li Li
Abstract: Recently, works on improving the naturalness of stitching images gain more and more extensive attention. Previous methods suffer the failures of severe projective distortion and unnatural rotation, especially when the number of involved images is large or images cover a very wide field of view. In this paper, we propose a novel natural image stitching method, which takes into account the guidance of vanishing points to tackle the mentioned failures. Inspired by a vital observation that mutually orthogonal vanishing points in Manhattan world can provide really useful orientation clues, we design a scheme to effectively estimate prior of image similarity. Given such estimated prior as global similarity constraints, we feed it into a popular mesh deformation framework to achieve impressive natural stitching performances. Compared with other existing methods, including APAP, SPHP, AANAP, and GSP, our method achieves state-of-the-art performance in both quantitative and qualitative experiments on natural image stitching.
摘要：近日，在提高图像拼接的自然作品获得越来越广泛的关注。以前的方法遭受严重的投影失真和不自然旋转的失败，特别是当涉及的图像的数量是大或图像覆盖非常宽的视场。在本文中，我们提出了一种新的自然图像拼接方法，它考虑到消失点，以解决所提到的故障的指导。一个重要的观察是，在曼哈顿的世界相互正交的消失点可以提供真正有用的线索取向的启发，我们设计了一个方案，图像相似的前向有效地估计。鉴于这种估计之前全球的相似性的限制，我们给它成为一种流行的网格变形框架来实现令人印象深刻的自然拼接表演。与其它的方法，包括APAP，SPHP，AANAP和GSP相比，我们的方法实现了对自然的图像拼接定量和定性两个实验的国家的最先进的性能。

43. Robust 3D Self-portraits in Seconds [PDF] 返回目录
Zhe Li, Tao Yu, Chuanyu Pan, Zerong Zheng, Yebin Liu
Abstract: In this paper, we propose an efficient method for robust 3D self-portraits using a single RGBD camera. Benefiting from the proposed PIFusion and lightweight bundle adjustment algorithm, our method can generate detailed 3D self-portraits in seconds and shows the ability to handle subjects wearing extremely loose clothes. To achieve highly efficient and robust reconstruction, we propose PIFusion, which combines learning-based 3D recovery with volumetric non-rigid fusion to generate accurate sparse partial scans of the subject. Moreover, a non-rigid volumetric deformation method is proposed to continuously refine the learned shape prior. Finally, a lightweight bundle adjustment algorithm is proposed to guarantee that all the partial scans can not only "loop" with each other but also remain consistent with the selected live key observations. The results and experiments show that the proposed method achieves more robust and efficient 3D self-portraits compared with state-of-the-art methods.
摘要：在本文中，我们提出了使用单一RGBD相机强大的3D自画像的有效方法。从建议PIFusion轻便捆绑调整算法中受益，我们的方法可以生成详细的3D自画像在几秒钟内显示穿着非常宽松的衣服的能力来处理对象。为了实现高效和稳健的重建，我们提出PIFusion，它结合了基于学习的-与体积非刚性融合3D恢复以生成所述对象的精确稀疏部分扫描。此外，非刚性体积变形方法，提出了现有连续细化习得的形状。最后，一个轻量级的捆绑调整算法，以保证所有的部分扫描不仅可以“循环”彼此也保持与选择的生主要意见是一致的。结果和实验表明，与国家的最先进的方法相比，所提出的方法实现了更鲁棒和有效的三维的自画像。

44. Class Anchor Clustering: a Distance-based Loss for Training Open Set Classifiers [PDF] 返回目录
Dimity Miller, Niko Sünderhauf, Michael Milford, Feras Dayoub
Abstract: Existing open set classifiers distinguish between known and unknown inputs by measuring distance in a network's logit space, assuming that known inputs cluster closer to the training data than unknown inputs. However, this approach is typically applied post-hoc to networks trained with cross-entropy loss, which neither guarantees nor encourages the hoped-for clustering behaviour. To overcome this limitation, we introduce Class Anchor Clustering (CAC) loss. CAC is an entirely distance-based loss that explicitly encourages training data to form tight clusters around class-dependent anchor points in the logit space. We show that an open set classifier trained with CAC loss outperforms all state-of-the-art techniques on the challenging TinyImageNet dataset, achieving a 2.4% performance increase in AUROC. In addition, our approach outperforms other state-of-the-art distance-based approaches on a number of further relevant datasets. We will make the code for CAC publicly available.
摘要：现有通过在网络的分对数空间测量距离已知和未知的输入端之间开集的分类器区分，假设已知输入群集更接近比未知输入训练数据。但是，这种方法通常适用事后与交叉熵损失训练有素的网络，这不担保，也不鼓励希望的集群行为。为了克服这种局限性，我们引入类锚聚类（CAC）的损失。 CAC是一个完全基于距离的损失明确鼓励的训练数据，以形成围绕在Logit模型空间类相关锚点紧簇。我们表明，CAC损失训练有素的开集分类优于对国家的最先进的技术，所有的挑战TinyImageNet数据集，实现AUROC性能提高了2.4％。此外，我们的方法优于上一批进一步相关数据集其他国家的最先进的基于距离的方法。我们将为CAC代码公开。

45. Deep Space-Time Video Upsampling Networks [PDF] 返回目录
Jaeyeon Kang, Younghyun Jo, Seoung Wug Oh, Peter Vajda, Seon Joo Kim
Abstract: Video super-resolution (VSR) and frame interpolation (FI) are traditional computer vision problems, and the performance have been improving by incorporating deep learning recently. In this paper, we investigate the problem of jointly upsampling videos both in space and time, which is becoming more important with advances in display systems. One solution for this is to run VSR and FI, one by one, independently. This is highly inefficient as heavy deep neural networks (DNN) are involved in each solution. To this end, we propose an end-to-end DNN framework for the space-time video upsampling by efficiently merging VSR and FI into a joint framework. In our framework, a novel weighting scheme is proposed to fuse input frames effectively without explicit motion compensation for efficient processing of videos. The results show better results both quantitatively and qualitatively, while reducing the computation time (x7 faster) and the number of parameters (30%) compared to baselines.
摘要：视频超分辨率（VSR）和帧插补（FI）是传统的计算机视觉问题，而且性能已经被纳入最近深学习提高。在本文中，我们探讨在空间和时间上采样联合视频，这变得越来越重要，在显示系统的发展问题。对此的一个解决方案是通过一个运行VSR和FI，一个独立。作为重深神经网络（DNN）参与每种溶液，这是非常低效的。为此，我们通过有效地合并VSR和FI成的共同框架，提出了时空的视频采样的端至端DNN框架。在我们的框架，一个新颖的加权方案，提出了保险丝输入帧有效而不考虑的视频有效处理明确的运动补偿。结果表明更好的结果定量和定性，同时减少了计算时间（X7更快）和参数相比基线的数目（30％）。

46. Detecting the Saliency of Remote Sensing Images Based on Sparse Representation of Contrast-weighted Atoms [PDF] 返回目录
Zhou Huang, Huai-Xin Chen, Yun-Zhi Yang, Chang-Yin Wang, Bi-Yuan Liu
Abstract: Object detection is an important task in remote sensing (RS) image analysis. To reduce the computational complexity of redundant information and improve the efficiency of image processing, visual saliency models are gradually being applied in this field. In this paper, a novel saliency detection method is proposed by exploring the sparse representation (SR) of, based on learning, contrast-weighted atoms (LCWA). Specifically, this paper uses the proposed LCWA atom learning formula on positive and negative samples to construct a saliency dictionary, and on nonsaliency atoms to construct a discriminant dictionary. An online discriminant dictionary learning algorithm is proposed to solve the atom learning formula. Then, we measure saliency by combining the coefficients of SR and reconstruction errors. Furthermore, under the proposed joint saliency measure, a variety of salient maps are generated by the discriminant dictionary. Finally, a fusion method based on global gradient optimisation is proposed to integrate multiple salient maps. Experimental results show that the proposed method significantly outperforms current state-of-the-art methods under six evaluation measures.
摘要：对象检测是遥感（RS）图像分析的重要任务。为了减少冗余信息的计算复杂性和提高图像处理的效率，视觉显着性模型正在逐步在该领域中应用。在本文中，一种新颖的显着性检测方法是通过探索的稀疏表示（SR）提出，基于学习，对比度加权原子（LCWA）。具体而言，本文采用的阳性和阴性样品所提出LCWA原子学习公式构造一个显着词典;对nonsaliency原子构建判别字典。一个在线判别字典学习算法来解决原子学习公式。然后，我们通过组合SR和重建误差系数衡量的显着性。此外，所提出的联合显着特征度量下，由判别词典生成各种凸地图。最后，基于全局梯度优化的融合方法，提出了集成多种显着的地图。实验结果表明，该方法显著优于按六个评价措施的国家的最先进的通用方法。

47. AutoToon: Automatic Geometric Warping for Face Cartoon Generation [PDF] 返回目录
Julia Gong, Yannick Hold-Geoffroy, Jingwan Lu
Abstract: Caricature, a type of exaggerated artistic portrait, amplifies the distinctive, yet nuanced traits of human faces. This task is typically left to artists, as it has proven difficult to capture subjects' unique characteristics well using automated methods. Recent development of deep end-to-end methods has achieved promising results in capturing style and higher-level exaggerations. However, a key part of caricatures, face warping, has remained challenging for these systems. In this work, we propose AutoToon, the first supervised deep learning method that yields high-quality warps for the warping component of caricatures. Completely disentangled from style, it can be paired with any stylization method to create diverse caricatures. In contrast to prior art, we leverage an SENet and spatial transformer module and train directly on artist warping fields, applying losses both prior to and after warping. As shown by our user studies, we achieve appealing exaggerations that amplify distinguishing features of the face while preserving facial detail.
摘要：漫画，一种夸张的艺术肖像，放大人脸的特色，又细致入微的特质。此任务通常留给艺术家，因为它已经证明，很难使用自动方法以及完成拍摄独有的特色。深终端到终端的方法的最新发展取得看好捕获风格和更高级别的夸张效果。然而，漫画，面部变形的重要组成部分，一直保持这些系统的挑战。在这项工作中，我们提出AutoToon，第一个监控深度学习方法，它产量高品质的经线为漫画的扭曲成分。从风格完全解开，它可以与任何风格化的方法来创建不同的漫画配对。相比于现有技术中，我们利用一个SENET和空间变换器模块和直接在艺术家翘曲字段训练，之前和之后的翘曲施加损失。如图所示由我们的用户研究，我们呼吁实现其放大脸部的显着特征，同时保持面部细节夸张。

48. Steering Self-Supervised Feature Learning Beyond Local Pixel Statistics [PDF] 返回目录
Simon Jenni, Hailin Jin, Paolo Favaro
Abstract: We introduce a novel principle for self-supervised feature learning based on the discrimination of specific transformations of an image. We argue that the generalization capability of learned features depends on what image neighborhood size is sufficient to discriminate different image transformations: The larger the required neighborhood size and the more global the image statistics that the feature can describe. An accurate description of global image statistics allows to better represent the shape and configuration of objects and their context, which ultimately generalizes better to new tasks such as object classification and detection. This suggests a criterion to choose and design image transformations. Based on this criterion, we introduce a novel image transformation that we call limited context inpainting (LCI). This transformation inpaints an image patch conditioned only on a small rectangular pixel boundary (the limited context). Because of the limited boundary information, the inpainter can learn to match local pixel statistics, but is unlikely to match the global statistics of the image. We claim that the same principle can be used to justify the performance of transformations such as image rotations and warping. Indeed, we demonstrate experimentally that learning to discriminate transformations such as LCI, image warping and rotations, yields features with state of the art generalization capabilities on several datasets such as Pascal VOC, STL-10, CelebA, and ImageNet. Remarkably, our trained features achieve a performance on Places on par with features trained through supervised learning with ImageNet labels.
摘要：介绍了基于图像的特定转换的歧视，自我监督的功能，学习新原理。我们认为，所学功能的泛化能力取决于什么样的形象附近大小就足以区分不同的图像变换：越大所需的邻里规模和更多的全球形象统计，该功能可以形容。全局图像统计的精确描述允许更好地代表对象的形状和结构以及它们的上下文，这最终更好推广到新的任务，如对象分类和检测。这表明一个标准来选择和设计的图像变换。基于该标准，我们引入一个新的图像变换，我们称之为有限上下文补绘（LCI）。这种变换inpaints图像块空调仅是小矩形的像素边界（有限的上下文）。因为有限的边界信息时，涂色可以学习到符合当地像素的统计数据，但不太可能匹配图像的全局统计信息。我们要求的是同样的原理可以用来证明转换的性能，如图像旋转和扭曲。事实上，我们证明实验上学习判别变换，如LCI，图像变形和旋转，产率在几个数据集，例如帕斯卡VOC，STL-10，CelebA，和ImageNet与本领域概括功能状态提供。值得注意的是，我们训练有素的功能实现上看齐的地方通过监督学习与ImageNet标签训练有素的功能性能。

49. Hyper-spectral NIR and MIR data and optimal wavebands for detecting of apple trees diseases [PDF] 返回目录
Dmitrii Shadrin, Mariia Pukalchik, Anastasia Uryasheva, Nikita Rodichenko, Dzmitry Tsetserukou
Abstract: Plants diseases can lead to dramatic losses in yield and quality of food, becoming a problem of high priority for farmers. Apple scab, moniliasis, and powdery mildew are the most significant apple trees diseases worldwide and may cause between 50% and 60% in yield losses annually; they are controlled by fungicide use with huge financial and time expenses. This research proposes a modern approach for analysing spectral data in Near-Infrared and Mid-Infrared ranges of the apple trees diseases on different stages. Using the obtained spectra, we found optimal spectra bands for detecting particular disease and discriminating it from other diseases and from healthy trees. The proposed instrument will provide farmers with accurate, real-time information on different stages of apple trees diseases enabling more effective timing and selection of fungicide application, resulting in better control and increasing yield. The obtained dataset as well as scripts in Matlab for processing data and finding optimal spectral bands are available via the link: this https URL
摘要：植物疾病可导致产量和食品质量的显着损失，成为高优先级的问题，为农民。苹果黑星病，念珠菌病，白粉病是世界范围内最为显著苹果树疾病和50％，而在每年的产量损失60％的可能会引起;它们通过使用杀菌剂与巨大的资金和时间成本的控制。这项研究提出了在近红外和不同阶段的苹果树疾病的中红外范围分析光谱数据的现代方法。使用所获得的光谱，我们发现最佳光谱带，用于检测特定的疾病和其他疾病和健康树木鉴别它。所提出的仪器将农民提供对苹果树疾病的不同阶段实现更有效的定时和杀真菌剂应用的选择，从而导致更好的控制和增加产量精确，实时的信息。将所获得的数据集以及在Matlab脚本用于处理数据和查找最佳光谱带经由链路是可用的：此HTTPS URL

50. EfficientPS: Efficient Panoptic Segmentation [PDF] 返回目录
Rohit Mohan, Abhinav Valada
Abstract: Understanding the scene in which an autonomous robot operates is critical for its competent functioning. Such scene comprehension necessitates recognizing instances of traffic participants along with general scene semantics which can be effectively addressed by the panoptic segmentation task. In this paper, we introduce the Efficient Panoptic Segmentation (EfficientPS) architecture that consists of a shared backbone which efficiently encodes and fuses semantically rich multi-scale features. We incorporate a new semantic head that aggregates fine and contextual features coherently and a new variant of Mask R-CNN as the instance head. We also propose a novel panoptic fusion module that congruously integrates the output logits from both the heads of our EfficientPS architecture to yield the final panoptic segmentation output. Additionally, we introduce the KITTI panoptic segmentation dataset that contains panoptic annotations for the popularly challenging KITTI benchmark. Extensive evaluations on Cityscapes, KITTI, Mapillary Vistas and Indian Driving Dataset demonstrate that our proposed architecture consistently sets the new state-of-the-art on all these four benchmarks while being the most efficient and fast panoptic segmentation architecture to date.
摘要：了解其中自主机器人操作现场是其主管运作的关键。这样的场景理解必要认识到交通参与者的实例与可以通过全景分割任务得到有效解决一般场景语义一起。在本文中，我们引入一个由一个共享骨干，能有效地进行编码和保险丝语义丰富的多尺度特征的高效全景分割（EfficientPS）架构。我们引入一个新的语义头细骨料和连贯上下文特征和面膜R-CNN的一个新变种作为实例头。我们还提出一种新的全景融合模块congruously从我们EfficientPS结构从而得到最终的全景分割输出的两个头集成了输出logits。此外，我们引入一个包含了普遍质疑KITTI基准全景注解KITTI全景分割数据集。上风情，KITTI，Mapillary景观和印度的驾驶数据集丰富的评估表明，我们提出的结构一致设置新的国家的最先进的对所有这四个基准，同时是最有效和快速的全景分割架构更新。

51. Feature Super-Resolution Based Facial Expression Recognition for Multi-scale Low-Resolution Faces [PDF] 返回目录
Wei Jing, Feng Tian, Jizhong Zhang, Kuo-Ming Chao, Zhenxin Hong, Xu Liu
Abstract: Facial Expressions Recognition(FER) on low-resolution images is necessary for applications like group expression recognition in crowd scenarios(station, classroom etc.). Classifying a small size facial image into the right expression category is still a challenging task. The main cause of this problem is the loss of discriminative feature due to reduced resolution. Super-resolution method is often used to enhance low-resolution images, but the performance on FER task is limited when on images of very low resolution. In this work, inspired by feature super-resolution methods for object detection, we proposed a novel generative adversary network-based feature level super-resolution method for robust facial expression recognition(FSR-FER). In particular, a pre-trained FER model was employed as feature extractor, and a generator network G and a discriminator network D are trained with features extracted from images of low resolution and original high resolution. Generator network G tries to transform features of low-resolution images to more discriminative ones by making them closer to the ones of corresponding high-resolution images. For better classification performance, we also proposed an effective classification-aware loss re-weighting strategy based on the classification probability calculated by a fixed FER model to make our model focus more on samples that are easily misclassified. Experiment results on Real-World Affective Faces (RAF) Database demonstrate that our method achieves satisfying results on various down-sample factors with a single model and has better performance on low-resolution images compared with methods using image super-resolution and expression recognition separately.
摘要：面部表情识别（FER）上的低分辨率图像是必要的像在人群场景组表达识别（站，教室等）的应用。分类的小尺寸的面部图像到合适的表情类别仍然是一个艰巨的任务。这个问题的主要原因是歧视性功能的损失，由于降低分辨率。超分辨率方法通常用于增强低分辨率图像，但FER任务性能有限，当在非常低的分辨率的图像。在这项工作中，通过对物体检测特征的超分辨率的方法的启发，我们提出了一种鲁棒的面部表情识别（FSR-FER）一种新颖的基于网络的生成敌手特征级超分辨率方法。特别是，预先训练FER模型被用作特征提取器，和一台发电机网络G和鉴别网络d与从低分辨率和原始高分辨率图像中提取的特征的训练。发电机G网试图通过使它们更接近对应的高分辨率图像的个转变为更有辨别那些低分辨率图像的特征。为了获得更好的分类性能，我们还提出了一种基于由一个固定的FER模型计算分类机率的有效分类意识丧失重新加权战略，使我们的模型更注重的是很容易错误分类样本。现实世界中的情感实验结果面（RAF）的数据库表明，我们的方法实现对各种下采样因素满足的结果与一个单一的模式，并用影像超分辨率和表情识别分开方法相比低清晰度的图像更好的性能。

52. Structural-analogy from a Single Image Pair [PDF] 返回目录
Sagie Benaim, Ron Mokady, Amit Bermano, Daniel Cohen-Or, Lior Wolf
Abstract: The task of unsupervised image-to-image translation has seen substantial advancements in recent years through the use of deep neural networks. Typically, the proposed solutions learn the characterizing distribution of two large, unpaired collections of images, and are able to alter the appearance of a given image, while keeping its geometry intact. In this paper, we explore the capabilities of neural networks to understand image structure given only a single pair of images, A and B. We seek to generate images that are structurally aligned: that is, to generate an image that keeps the appearance and style of B, but has a structural arrangement that corresponds to A. The key idea is to map between image patches at different scales. This enables controlling the granularity at which analogies are produced, which determines the conceptual distinction between style and content. In addition to structural alignment, our method can be used to generate high quality imagery in other conditional generation tasks utilizing images A and B only: guided image synthesis, style and texture transfer, text translation as well as video translation. Our code and additional results are available in this https URL.
摘要：无监督图像到影像转换的任务，通过使用深层神经网络的出现在最近几年大幅度的进步。通常情况下，提出的解决方案了解两个大，图像的不成对藏品特征分布，能够改变给定的外观形象，同时保持其形状不变。在本文中，我们探讨了神经网络来了解图像的结构只给一对图像的功能，A和B.我们力求产生在结构上对准的图像：即，生成保持了外观风格的图像B的，但有一个结构布置对应于A的关键思想是在不同尺度的图像贴片之间进行映射。这使得能够控制在哪个类比产生的粒度，这就决定了风格和内容之间的概念上的区别。除了结构对准，我们的方法可以用于生成利用图像A和B仅在其他条件生成任务高质量图像：引导图像合成，风格和质地传递，文本翻译以及视频转换。我们的代码和额外的结果是此HTTPS URL可用。

53. Light Field Spatial Super-resolution via Deep Combinatorial Geometry Embedding and Structural Consistency Regularization [PDF] 返回目录
Jing Jin, Junhui Hou, Jie Chen, Sam Kwong
Abstract: Light field (LF) images acquired by hand-held devices usually suffer from low spatial resolution as the limited sampling resources have to be shared with the angular dimension. LF spatial super-resolution (SR) thus becomes an indispensable part of the LF camera processing pipeline. The high-dimensionality characteristic and complex geometrical structure of LF images make the problem more challenging than traditional single-image SR. The performance of existing methods is still limited as they fail to thoroughly explore the coherence among LF views and are insufficient in accurately preserving the parallax structure of the scene. In this paper, we propose a novel learning-based LF spatial SR framework, in which each view of an LF image is first individually super-resolved by exploring the complementary information among views with combinatorial geometry embedding. For accurate preservation of the parallax structure among the reconstructed views, a regularization network trained over a structure-aware loss function is subsequently appended to enforce correct parallax relationships over the intermediate estimation. Our proposed approach is evaluated over datasets with a large number of testing images including both synthetic and real-world scenes. Experimental results demonstrate the advantage of our approach over state-of-the-art methods, i.e., our method not only improves the average PSNR by more than 1.0 dB but also preserves more accurate parallax details, at a lower computational cost.
摘要：通过手持设备获取的光场（LF）的图像通常从低空间分辨率遭受有限的取样资源具有与角度尺寸共享。 LF空间超分辨率（SR），从而成为LF相机处理管线的一个不可缺少的部分。高维特征和LF图像的复杂的几何结构使问题比传统的单图像SR更具挑战性。因为他们未能彻底探索LF视图之间的一致性，并且在精确地保留场景的视差结构不足的现有方法的性能仍然是有限的。在本文中，我们提出了一种新颖的基于学习的空间LF SR框架，其中LF图像的每一视图是第一通过探索与组合几何嵌入视图之间的互补信息单独超分辨。用于重构的视图之间的视差结构的准确保存，训练过的结构感知损失功能的正则化网随后被追加到执行在中间估计正确的视差关系。我们建议的方法是在有大量测试图像，包括合成和真实世界的场景的数据集进行评估。实验结果表明，我们的过度状态的最先进的方法方法中，的优点，即，我们的方法不仅超过1.0分贝提高了平均PSNR而且蜜饯更精确的视差信息，以较低的计算成本。

54. Deep Multimodal Feature Encoding for Video Ordering [PDF] 返回目录
Vivek Sharma, Makarand Tapaswi, Rainer Stiefelhagen
Abstract: True understanding of videos comes from a joint analysis of all its modalities: the video frames, the audio track, and any accompanying text such as closed captions. We present a way to learn a compact multimodal feature representation that encodes all these modalities. Our model parameters are learned through a proxy task of inferring the temporal ordering of a set of unordered videos in a timeline. To this end, we create a new multimodal dataset for temporal ordering that consists of approximately 30K scenes (2-6 clips per scene) based on the "Large Scale Movie Description Challenge". We analyze and evaluate the individual and joint modalities on three challenging tasks: (i) inferring the temporal ordering of a set of videos; and (ii) action recognition. We demonstrate empirically that multimodal representations are indeed complementary, and can play a key role in improving the performance of many applications.
摘要：影片真正的理解来自于它的所有形式的联合分析：视频帧，音频轨道，以及随之而来的文本，如字幕。我们提出一个方式来学习一个紧凑的多模态特征表示编码所有这些方式。我们的模型参数通过推断时间轴中的一组无序的视频的时间排序的代理任务的经验教训。为此，我们建立了一个由约30K场景（每个场景2-6剪辑）基于“大型电影描述的挑战”时间排序的一个新的多模态数据集。我们分析和评估个别及连带方式三个挑战性的任务：（一）推断的一组视频的时间排序;和（ii）动作识别。我们经验表明，多表示确实是相辅相成的，并能提高许多应用程序的性能起到关键作用。

55. Confident Coreset for Active Learning in Medical Image Analysis [PDF] 返回目录
Seong Tae Kim, Farrukh Mushtaq, Nassir Navab
Abstract: Recent advances in deep learning have resulted in great successes in various applications. Although semi-supervised or unsupervised learning methods have been widely investigated, the performance of deep neural networks highly depends on the annotated data. The problem is that the budget for annotation is usually limited due to the annotation time and expensive annotation cost in medical data. Active learning is one of the solutions to this problem where an active learner is designed to indicate which samples need to be annotated to effectively train a target model. In this paper, we propose a novel active learning method, confident coreset, which considers both uncertainty and distribution for effectively selecting informative samples. By comparative experiments on two medical image analysis tasks, we show that our method outperforms other active learning methods.
摘要：在深度学习的最新进展已经导致各种应用巨大的成功。虽然半监督或监督的学习方法已被广泛研究，深层神经网络的性能高度依赖于注释的数据。问题是，对注释的预算通常有限，因为在医疗数据的标注时间和昂贵的注释成本。主动学习是解决这一问题，在积极的学习者，旨在表明该样品需要标注的有效培养目标模型之一。在本文中，我们提出了一个新颖的主动学习方法，相信coreset，它参考了有效的信息选择两种样品的不确定性和分布。通过在两个医学图像分析任务的对比实验，我们证明了我们的方法优于其他有效的学习方法。

56. Clustering based Contrastive Learning for Improving Face Representations [PDF] 返回目录
Vivek Sharma, Makarand Tapaswi, M. Saquib Sarfraz, Rainer Stiefelhagen
Abstract: A good clustering algorithm can discover natural groupings in data. These groupings, if used wisely, provide a form of weak supervision for learning representations. In this work, we present Clustering-based Contrastive Learning (CCL), a new clustering-based representation learning approach that uses labels obtained from clustering along with video constraints to learn discriminative face features. We demonstrate our method on the challenging task of learning representations for video face clustering. Through several ablation studies, we analyze the impact of creating pair-wise positive and negative labels from different sources. Experiments on three challenging video face clustering datasets: BBT-0101, BF-0502, and ACCIO show that CCL achieves a new state-of-the-art on all datasets.
摘要：一个好的聚类算法可以发现数据中的自然分组。这些分组，如果使用得当，学习交涉提供监管不力的一种形式。在这项工作中，我们目前集群为基础的对比学习（CCL），从视频限制一起聚类学习辨别面部特征获得使用标签新的基于聚类的代表性学习方法。我们证明了我们对学习的视频面部聚类表示的具有挑战性的任务的方法。通过多次消融的研究，我们分析创建从不同的来源成对正面和负面标签的影响。 BBT-0101，BF-0502，和阿乔表明CCL实现了新上的所有数据集的国家的最先进的：在3点有挑战性的视频面部聚类的数据集实验。

57. Iterative Context-Aware Graph Inference for Visual Dialog [PDF] 返回目录
Dan Guo, Hui Wang, Hanwang Zhang, Zheng-Jun Zha, Meng Wang
Abstract: Visual dialog is a challenging task that requires the comprehension of the semantic dependencies among implicit visual and textual contexts. This task can refer to the relation inference in a graphical model with sparse contexts and unknown graph structure (relation descriptor), and how to model the underlying context-aware relation inference is critical. To this end, we propose a novel Context-Aware Graph (CAG) neural network. Each node in the graph corresponds to a joint semantic feature, including both object-based (visual) and history-related (textual) context representations. The graph structure (relations in dialog) is iteratively updated using an adaptive top-$K$ message passing mechanism. Specifically, in every message passing step, each node selects the most $K$ relevant nodes, and only receives messages from them. Then, after the update, we impose graph attention on all the nodes to get the final graph embedding and infer the answer. In CAG, each node has dynamic relations in the graph (different related $K$ neighbor nodes), and only the most relevant nodes are attributive to the context-aware relational graph inference. Experimental results on VisDial v0.9 and v1.0 datasets show that CAG outperforms comparative methods. Visualization results further validate the interpretability of our method.
摘要：可视对话是一项具有挑战性的任务，需要隐视觉和文本上下文之间的语义相关性的理解。此任务可指的关系推断与稀疏上下文和未知图形结构（描述符的关系），和如何建模底层上下文感知关系推断是关键的图形模型。为此，我们提出了一个新的环境感知图（CAG）神经网络。图中的每个节点对应于关节语义特征，包括基于对象的（目测）和历史相关（文本）上下文表示。图结构（在对话关系）使用自适应顶$ $ķ消息传递机制迭代地更新。具体而言，在每一个消息传递步骤中，每个节点选择最$ $ķ相关节点，只有从它们接收消息。然后，在更新之后，我们施加的所有节点上图的关注才能得到最终图嵌入并推断答案。在CAG，每个节点都有图中的动态关系（不同的相关$ķ$邻居节点），只有最相关的节点归属于上下文感知的关系图推断。在VisDial V0.9和V1.0数据集实验结果表明，CAG优于比较的方法。可视化结果进一步验证了该方法的可解释性。

58. Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled Representation [PDF] 返回目录
Edoardo Remelli, Shangchen Han, Sina Honari, Pascal Fua, Robert Wang
Abstract: We present a lightweight solution to recover 3D pose from multi-view images captured with spatially calibrated cameras. Building upon recent advances in interpretable representation learning, we exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points. This allows us to reason effectively about 3D pose across different views without using compute-intensive volumetric grids. Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections, that can be simply lifted to 3D via a differentiable Direct Linear Transform (DLT) layer. In order to do it efficiently, we propose a novel implementation of DLT that is orders of magnitude faster on GPU architectures than standard SVD-based triangulation methods. We evaluate our approach on two large-scale human pose datasets (H36M and Total Capture): our method outperforms or performs comparably to the state-of-the-art volumetric methods, while, unlike them, yielding real-time performance.
摘要：我们提出一个轻便的解决方案，以从具有空间校准的照相机拍摄的多视点图像恢复三维姿态。在后解释的表示学习最新进展的基础上，我们利用三维几何熔断器输入图像转化为姿态的统一潜表示，这是由摄影机视图点解开。这有效地有关跨不同视图的3D姿势可以让我们没有理由使用计算密集型体积网格。我们的架构然后条件对照相机的投影算子所学习的表示，以产生准确的每次观看2D检测，可经由微的直接线性变换（DLT）层被简单地提升到3D。为了有效地做到这一点，我们提出了一种新的实现DLT的是快几个数量级上的GPU架构比标准的基于SVD的三角测量方法。我们评估我们两个大型人体姿势的数据集（H36M和总捕捉）的方法：我们的方法优于或执行同等于国家的最先进的体积的方法，同时，与他们不同，产生的实时性能。

59. Comparative Analysis of Multiple Deep CNN Models for Waste Classification [PDF] 返回目录
Dipesh Gyawali, Alok Regmi, Aatish Shakya, Ashish Gautam, Surendra Shrestha
Abstract: Waste is a wealth in a wrong place. Our research focuses on analyzing possibilities for automatic waste sorting and collecting in such a way that helps it for further recycling process. Various approaches are being practiced managing waste but not efficient and require human intervention. The automatic waste segregation would fit in to fill the gap. The project tested well known Deep Learning Network architectures for waste classification with dataset combined from own endeavors and Trash Net. The convolutional neural network is used for image classification. The hardware built in the form of dustbin is used to segregate those wastes into different compartments. Without the human exercise in segregating those waste products, the study would save the precious time and would introduce the automation in the area of waste management. Municipal solid waste is a huge, renewable source of energy. The situation is win-win for both government, society and industrialists. Because of fine-tuning of the ResNet18 Network, the best validation accuracy was found to be 87.8\%.
摘要：垃圾是一种财富在错误的地方。我们的研究着重分析了自动垃圾分类，并以这样的方式，帮助它进一步回收过程中收集的可能性。各种方法正在实行废物管理工作，但效率不高，需要人工干预。自动垃圾分类将适合填补了国内空白。该项目测试的众所周知的深学习网络架构的垃圾分类与数据集从自身的努力和垃圾桶网相结合。卷积神经网络用于图像分类。建于垃圾箱形式的硬件用于这些废物分隔成不同的隔间。如果没有人在行使这些分离废旧产品，该研究将节省宝贵的时间，并会在废物管理领域引入自动化。都市固体废物是一种能量巨大，可再生能源。这种情况是双赢为政府，社会和实业家。由于ResNet18网络的微调，最好的验证准确度被认为是87.8 \％。

60. DSA: More Efficient Budgeted Pruning via Differentiable Sparsity Allocation [PDF] 返回目录
Xuefei Ning, Tianchen Zhao, Wenshuo Li, Peng Lei, Yu Wang, Huazhong Yang
Abstract: Budgeted pruning is the problem of pruning under resource constraints. In budgeted pruning, how to distribute the resources across layers (i.e., sparsity allocation) is the key problem. Traditional methods solve it by discretely searching for the layer-wise pruning ratios, which lacks efficiency. In this paper, we propose Differentiable Sparsity Allocation (DSA), an efficient end-to-end budgeted pruning flow. Utilizing a novel differentiable pruning process, DSA finds the layer-wise pruning ratios with gradient-based optimization. It allocates sparsity in continuous space, which is more efficient than methods based on discrete evaluation and search. Furthermore, DSA could work in a pruning-from-scratch manner, whereas traditional budgeted pruning methods are applied to pre-trained models. Experimental results on CIFAR-10 and ImageNet show that DSA could achieve superior performance than current iterative budgeted pruning methods, and shorten the time cost of the overall pruning process by at least 1.5x in the meantime.
摘要：编入预算的修剪是在资源约束修剪的问题。在预算修剪，如何跨层资源分配（即，稀疏分配）是关键问题。传统的方法通过离散地搜索用于逐层修剪比率，其缺乏效率解决它。在本文中，我们提出了可微稀疏分配（DSA），一种有效的端至端预算修剪流动。利用一种新颖的可微修剪过程，DSA发现逐层修剪比率与基于梯度的优化。它分配在连续空间稀疏性，这比基于离散评估和搜索方法更有效。此外，DSA会在修剪，从划痕的方式工作，而传统的编入预算的修剪方法应用于预先训练模式。在CIFAR-10和实验结果ImageNet表明，DSA可能比当前迭代的预算编列的修剪方法实现卓越的性能，并缩短整体修剪过程中在此期间的时间成本至少1.5倍。

61. BiSeNet V2: Bilateral Network with Guided Aggregation for Real-time Semantic Segmentation [PDF] 返回目录
Changqian Yu, Changxin Gao, Jingbo Wang, Gang Yu, Chunhua Shen, Nong Sang
Abstract: The low-level details and high-level semantics are both essential to the semantic segmentation task. However, to speed up the model inference, current approaches almost always sacrifice the low-level details, which leads to a considerable accuracy decrease. We propose to treat these spatial details and categorical semantics separately to achieve high accuracy and high efficiency for realtime semantic segmentation. To this end, we propose an efficient and effective architecture with a good trade-off between speed and accuracy, termed Bilateral Segmentation Network (BiSeNet V2). This architecture involves: (i) a Detail Branch, with wide channels and shallow layers to capture low-level details and generate high-resolution feature representation; (ii) a Semantic Branch, with narrow channels and deep layers to obtain high-level semantic context. The Semantic Branch is lightweight due to reducing the channel capacity and a fast-downsampling strategy. Furthermore, we design a Guided Aggregation Layer to enhance mutual connections and fuse both types of feature representation. Besides, a booster training strategy is designed to improve the segmentation performance without any extra inference cost. Extensive quantitative and qualitative evaluations demonstrate that the proposed architecture performs favourably against a few state-of-the-art real-time semantic segmentation approaches. Specifically, for a 2,048x1,024 input, we achieve 72.6% Mean IoU on the Cityscapes test set with a speed of 156 FPS on one NVIDIA GeForce GTX 1080 Ti card, which is significantly faster than existing methods, yet we achieve better segmentation accuracy.
摘要：低层次的细节和高层次语义都是必不可少的语义分割任务。然而，加快模型的推断，目前的做法几乎总是牺牲低层次的细节，这导致了相当大的精度下降。我们建议单独处理这些空间细节和明确的语义，以达到较高的精度和实时语义分割效率高。为此，我们提出了一个很好的权衡速度和精度之间的有效和高效的架构，称为双边分割网络（BiSeNet V2）。该架构包括：（ⅰ）一个详细科，具有广泛的频道和浅层到捕获低级别的细节并产生高分辨率特征表示; （ⅱ）一个语义科，具有窄的通道和深层以获得高水平的语义语境。语义科是减少了信道容量和快下采样策略轻质所致。此外，我们还设计了一个指导汇聚层，以增进相互连接和融合这两种类型的特征表示的。此外，升压培训战略旨在提高而无需任何额外费用推断分割性能。丰富的定量和定性的评估表明，该架构顺利执行对几个国家的最先进的实时语义分割方法。具体而言，对于输入2,048x1,024，我们达到72.6％平均数IOU上的都市风景测试组与156 FPS中的一个的NVIDIA GeForce GTX 1080的Ti卡，这是更快的显著比现有方法上的速度，然而我们达到更好的分割精度。

62. Flow2Stereo: Effective Self-Supervised Learning of Optical Flow and Stereo Matching [PDF] 返回目录
Pengpeng Liu, Irwin King, Michael Lyu, Jia Xu
Abstract: In this paper, we propose a unified method to jointly learn optical flow and stereo matching. Our first intuition is stereo matching can be modeled as a special case of optical flow, and we can leverage 3D geometry behind stereoscopic videos to guide the learning of these two forms of correspondences. We then enroll this knowledge into the state-of-the-art self-supervised learning framework, and train one single network to estimate both flow and stereo. Second, we unveil the bottlenecks in prior self-supervised learning approaches, and propose to create a new set of challenging proxy tasks to boost performance. These two insights yield a single model that achieves the highest accuracy among all existing unsupervised flow and stereo methods on KITTI 2012 and 2015 benchmarks. More remarkably, our self-supervised method even outperforms several state-of-the-art fully supervised methods, including PWC-Net and FlowNet2 on KITTI 2012.
摘要：在本文中，我们提出了一个统一的方法，共同学习光流和立体匹配。我们的第一个直觉是立体匹配可以模拟成光流的特殊情况，我们可以利用立体视频背后的3D几何来指导这两种形式对应的学习。然后，我们招收这方面的知识为国家的最先进的自我监督学习框架，培养一个单一的网络来估算流量和立体声。其次，我们在推出之前的自我监督学习瓶颈的办法，并提出创建一组新挑战的代理任务来提升性能的。这两种见解产生是实现KITTI上2012年和2015年的基准所有现有的无监督流程和立体声方法中精度最高的单一模式。更引人注目的是，我们的自我监督方法，甚至超过一些国家的最先进的全监督的方法，包括PWC-Net和FlowNet2 2012年KITTI。

63. A Discriminator Improves Unconditional Text Generation without Updating the Generato [PDF] 返回目录
Xingyuan Chen, Ping Cai, Peng Jin, Hongjun Wang, Xingyu Dai, Jiajun Chen
Abstract: We propose a novel mechanism to improve a text generator with a discriminator, which is trained to estimate the probability that a sample comes from real or generated data. In contrast to recent discrete language generative adversarial networks (GAN) which update the parameters of the generator directly, our method only retains generated samples which are determined to come from real data with relatively high probability by the discriminator. This not only detects valuable information, but also avoids the mode collapse introduced by GAN.This new mechanism is conceptually simple and experimentally powerful. To the best of our knowledge, this is the first method which improves the neural language models (LM) trained with maximum likelihood estimation (MLE) by using a discriminator. Experimental results show that our mechanism improves both RNN-based and Transformer-based LMs when measuring in sample quality and sample diversity simultaneously at different softmax temperatures (a previously noted deficit of language GANs). Further, by recursively adding more discriminators, more powerful generators are created.
摘要：我们提出了一个新的机制，以改善与鉴别，这是训练来估计，一个样品来自真实的或产生的数据的概率文本生成。与此相反，以最近的离散语言生成对抗网络（GAN），其直接更新发生器的参数，我们的方法只保留那些被确定为来自实际数据与由所述鉴别器相对高的概率生成的样本。这不仅检测有价值的信息，同时也避免了GAN.This新机制推出的模式崩溃的概念很简单和实验强大。据我们所知，这是一种利用鉴别改善了与最大似然估计（MLE）训练的神经语言模型（LM）的第一种方法。实验结果表明，我们的机构在样品质量和在不同温度下添加Softmax同时样品分集（语言甘斯的前面所提到的缺陷）进行测量时改善了基于RNN和基于变压器的LM。此外，通过递归地增加更多的鉴别，创建更强大的发电机。

64. Adversarial-Prediction Guided Multi-task Adaptation for Semantic Segmentation of Electron Microscopy Images [PDF] 返回目录
Jiajin Yi, Zhimin Yuan, Jialin Peng
Abstract: Semantic segmentation is an essential step for electron microscopy (EM) image analysis. Although supervised models have achieved significant progress, the need for labor intensive pixel-wise annotation is a major limitation. To complicate matters further, supervised learning models may not generalize well on a novel dataset due to domain shift. In this study, we introduce an adversarial-prediction guided multi-task network to learn the adaptation of a well-trained model for use on a novel unlabeled target domain. Since no label is available on target domain, we learn an encoding representation not only for the supervised segmentation on source domain but also for unsupervised reconstruction of the target data. To improve the discriminative ability with geometrical cues, we further guide the representation learning by multi-level adversarial learning in semantic prediction space. Comparisons and ablation study on public benchmark demonstrated state-of-the-art performance and effectiveness of our approach.
摘要：语义分割为电子显微镜（EM）的图像分析的重要步骤。虽然监管模式已取得显著的进展，需要劳动密集型逐像素注解是一个主要的限制。为了使问题复杂化。此外，监督学习上一个新的数据集，由于域名转移模型可能不能一概而论好。在这项研究中，我们介绍一种敌对预测导式多任务网络来学习使用一个训练有素的模型适配上一个新的未标记的目标域。由于没有标签可在目标域，我们学习的编码表示不仅对源域的监督分割而且对目标数据的无监督重建。为了改善与几何线索的甄别能力，我们进一步引导代表在语义预测空间的多层次对抗学习学习。比较和消融研究公共基准测试表明国家的最先进的性能和我们的方法的有效性。

65. Neuron Linear Transformation: Modeling the Domain Shift for Crowd Counting [PDF] 返回目录
Qi Wang, Tao Han, Junyu Gao, Yuan Yuan
Abstract: Cross-domain crowd counting (CDCC) is a hot topic due to its importance in public safety. The purpose of CDCC is to reduce the domain shift between the source and target domain. Recently, typical methods attempt to extract domain-invariant features via image translation and adversarial learning. When it comes to specific tasks, we find that the final manifestation of the task gap is in the parameters of the model, and the domain shift can be represented apparently by the differences in model weights. To describe the domain gap directly at the parameter-level, we propose a Neuron Linear Transformation (NLT) method, where NLT is exploited to learn the shift at neuron-level and then transfer the source model to the target model. Specifically, for a specific neuron of a source model, NLT exploits few labeled target data to learn a group of parameters, which updates the target neuron via a linear transformation. Extensive experiments and analysis on six real-world datasets validate that NLT achieves top performance compared with other domain adaptation methods. An ablation study also shows that the NLT is robust and more effective compare with supervised and fine-tune training. Furthermore, we will release the code after the paper is accepted.
摘要：跨域人群计数（CDCC）是一个热门话题，由于其在公共安全的重要性。 CDCC的目的是减小源和目标结构域之间的域转移。最近，典型的方法试图通过提取图像平移和对抗性的学习域不变特征。当涉及到具体的任务，我们发现的任务差距的最终表现形式是在模型的参数，和域移可显然是在权重的差异来表示。直接描述在参数级域间隙，我们提出了一个神经元的线性变换（NLT）方法，其中NLT被利用来学习在神经元级变速，然后将源模型转移到目标模型。具体地，对于一个源模型的神经元特异性，NLT利用几个标记的目标数据来学习的一组参数，其通过线性变换更新目标神经元。大量的实验，并在六个真实世界的数据集分析验证NLT与其他领域适应性方法相比达到最高的性能。消融的研究也表明，NLT是强大和更有效的与监督和微调培训比较。此外，纸被接受后，我们将发布的代码。

66. Deep Homography Estimation for Dynamic Scenes [PDF] 返回目录
Hoang Le, Feng Liu, Shu Zhang, Aseem Agarwala
Abstract: Homography estimation is an important step in many computer vision problems. Recently, deep neural network methods have shown to be favorable for this problem when compared to traditional methods. However, these new methods do not consider dynamic content in input images. They train neural networks with only image pairs that can be perfectly aligned using homographies. This paper investigates and discusses how to design and train a deep neural network that handles dynamic scenes. We first collect a large video dataset with dynamic content. We then develop a multi-scale neural network and show that when properly trained using our new dataset, this neural network can already handle dynamic scenes to some extent. To estimate a homography of a dynamic scene in a more principled way, we need to identify the dynamic content. Since dynamic content detection and homography estimation are two tightly coupled tasks, we follow the multi-task learning principles and augment our multi-scale network such that it jointly estimates the dynamics masks and homographies. Our experiments show that our method can robustly estimate homography for challenging scenarios with dynamic scenes, blur artifacts, or lack of textures.
摘要：单应估计是许多计算机视觉问题的重要一步。近日，深层神经网络方法已经显示出与传统方法相比时，这个问题是有利的。然而，这些新方法不考虑输入图像的动态内容。他们训练神经网络，可以使用单应完全一致只图像对。本文研究并讨论如何设计和训练深层神经网络来处理动态场景。我们首先收集与动态内容的大型视频数据集。然后，我们开发了多尺度的神经网络，并表明在使用我们的新的数据集适当的培训，这个神经网络已经能够处理动态场景在一定程度上。为了估计更强调原则性的方式动态场景的单应，我们需要确定动态内容。由于动态内容检测和单应估计是两个紧密耦合的任务，我们按照多任务学习的原则，并加强我们的多尺度网络，使得其共同估计动态面具和单应性。我们的实验表明，我们的方法可以稳健地估计单应与动态场景，模糊文物具有挑战性的场景，或者缺乏纹理。

67. Anisotropic Convolutional Networks for 3D Semantic Scene Completion [PDF] 返回目录
Jie Li, Kai Han, Peng Wang, Yu Liu, Xia Yuan
Abstract: As a voxel-wise labeling task, semantic scene completion (SSC) tries to simultaneously infer the occupancy and semantic labels for a scene from a single depth and/or RGB image. The key challenge for SSC is how to effectively take advantage of the 3D context to model various objects or stuffs with severe variations in shapes, layouts and visibility. To handle such variations, we propose a novel module called anisotropic convolution, which properties with flexibility and power impossible for the competing methods such as standard 3D convolution and some of its variations. In contrast to the standard 3D convolution that is limited to a fixed 3D receptive field, our module is capable of modeling the dimensional anisotropy voxel-wisely. The basic idea is to enable anisotropic 3D receptive field by decomposing a 3D convolution into three consecutive 1D convolutions, and the kernel size for each such 1D convolution is adaptively determined on the fly. By stacking multiple such anisotropic convolution modules, the voxel-wise modeling capability can be further enhanced while maintaining a controllable amount of model parameters. Extensive experiments on two SSC benchmarks, NYU-Depth-v2 and NYUCAD, show the superior performance of the proposed method. Our code is available at this https URL
摘要：作为体素明智标签制作任务，语义场景完成（SSC）尝试同时推断从单个深度和/或RGB图像的占用和语义标签的场景。南南合作面临的主要挑战是如何有效地把3D方面的优势，以各种物体或东西在形状，布局和可见度严重变化建模。为了处理这样的变化，提出了一种称为各向异性卷积新颖模块，其与灵活性和能力不可能的竞争方法的性能如标准3D卷积和它的一些变型。与此相反的标准三维卷积被限制到一个固定的3D感受域，我们的模块能够体素的明智地建模三维各向异性的。基本思想是通过分解三维卷积成三个连续的一个维卷积，以使各向异性3D感受域，并且对于每个这样的一维卷积核尺寸是动态自适应地确定。通过层叠多个这样的各向异性卷积模块，体素明智建模能力，可以进一步同时保持模型参数的可控量提高。两个SSC基准，NYU-深度v2和NYUCAD广泛的实验，证明了该方法的优越性能。我们的代码可在此HTTPS URL

68. Attentive One-Dimensional Heatmap Regression for Facial Landmark Detection and Tracking [PDF] 返回目录
Shi Yin, Shangfei Wang, Xiaoping Chen, Enhong Chen
Abstract: Although heatmap regression is considered a state-of-the-art method to locate facial landmarks, it suffers from huge spatial complexity and is prone to quantization error. To address this, we propose a novel attentive one-dimensional heatmap regression method for facial landmark localization. First, we predict two groups of 1D heatmaps to represent the marginal distributions of the x and y coordinates. These 1D heatmaps reduce spatial complexity significantly compared to current heatmap regression methods, which use 2D heatmaps to represent the joint distributions of x and y coordinates. With much lower spatial complexity, the proposed method can output high-resolution 1D heatmaps despite limited GPU memory, significantly alleviating the quantization error. Second, a co-attention mechanism is adopted to model the inherent spatial patterns existing in x and y coordinates, and therefore the joint distributions on the x and y axes are also captured. Third, based on the 1D heatmap structures, we propose a facial landmark detector capturing spatial patterns for landmark detection on an image; and a tracker further capturing temporal patterns with a temporal refinement mechanism for landmark tracking. Experimental results on four benchmark databases demonstrate the superiority of our method.
摘要：虽然热图回归被认为是国家的最先进的方法来定位面部界标，它从巨大空间复杂遭受和容易出现量化误差。为了解决这个问题，我们提出了面部标志性定位的新颖周到的一维热图回归法。首先，我们预测1D热图的两组代表x和y坐标的边缘分布。显著相比，目前的热图的回归方法，它使用2D热图来表示x和y坐标的联合分布这些1D热图降低空间复杂度。具有低得多的空间复杂度，所提出的方法能够输出高分辨率1D热图尽管有限GPU存储器，显著减轻量化误差。第二，共注意机制采用固有空间模式存在于x和y坐标，并且因此在x轴和y轴的联合分布也被捕获建模。第三，基于一维热图的结构，提出了一种面部标志检测器捕获的图像上标志检测空间图案;和跟踪器还捕获时间模式与时间细化机制为界标跟踪。四个基准数据库，实验结果证明了该方法的优越性。

69. Learning and Recognizing Archeological Features from LiDAR Data [PDF] 返回目录
Conrad M Albrecht, Chris Fisher, Marcus Freitag, Hendrik F Hamann, Sharathchandra Pankanti, Florencia Pezzutti, Francesca Rossi
Abstract: We present a remote sensing pipeline that processes LiDAR (Light Detection And Ranging) data through machine & deep learning for the application of archeological feature detection on big geo-spatial data platforms such as e.g. IBM PAIRS Geoscope. Today, archeologists get overwhelmed by the task of visually surveying huge amounts of (raw) LiDAR data in order to identify areas of interest for inspection on the ground. We showcase a software system pipeline that results in significant savings in terms of expert productivity while missing only a small fraction of the artifacts. Our work employs artificial neural networks in conjunction with an efficient spatial segmentation procedure based on domain knowledge. Data processing is constraint by a limited amount of training labels and noisy LiDAR signals due to vegetation cover and decay of ancient structures. We aim at identifying geo-spatial areas with archeological artifacts in a supervised fashion allowing the domain expert to flexibly tune parameters based on her needs.
摘要：我们提出一个遥感管道，对于考古特征检测对大地理空间数据的平台，如应用程序处理LIDAR（光探测和测距）数据通过机器及深度学习例如IBM PAIRS Geoscope。如今，考古学家得到由视觉测量（原始）激光雷达的海量数据，以确定感兴趣的领域在地面上检查任务不堪重负。我们展示了一个软件系统管道的结果在显著节省专家生产能力方面，而缺的只是文物的一小部分。我们的工作员工结合人工神经网络与基于领域知识的有效空间分割程序。数据处理是通过训练标签和嘈杂激光雷达信号由于植被覆盖和古结构的衰变有限量的约束。我们的目标是确定在监督的方式，允许领域专家根据她的需求灵活地调整参数与考古文物地理空间区域。

70. Deeply Aligned Adaptation for Cross-domain Object Detection [PDF] 返回目录
Minghao Fu, Zhenshan Xie, Wen Li, Lixin Duan
Abstract: Cross-domain object detection has recently attracted more and more attention for real-world applications, since it helps build robust detectors adapting well to new environments. In this work, we propose an end-to-end solution based on Faster R-CNN, where ground-truth annotations are available for source images (e.g., cartoon) but not for target ones (e.g., watercolor) during training. Motivated by the observation that the transferabilities of different neural network layers differ from each other, we propose to apply a number of domain alignment strategies to different layers of Faster R-CNN, where the alignment strength is gradually reduced from low to higher layers. Moreover, after obtaining region proposals in our network, we develop a foreground-background aware alignment module to further reduce the domain mismatch by separately aligning features of the foreground and background regions from the source and target domains. Extensive experiments on benchmark datasets demonstrate the effectiveness of our proposed approach.
摘要：跨域对象检测最近吸引了越来越多的关注对现实世界的应用，因为它有助于建立强大的探测器以及适应新的环境。在这项工作中，我们提出了一种基于更快的R-CNN，其中地面实况注解可用于源图像（例如，卡通），但不是在训练中的目标方法（例如，水彩）终端到终端的解决方案。由观察，不同的神经网络层的可转移性彼此不同的启发，我们提出了一些域对准策略适用于更快的R-CNN，其中对准强度逐渐从低降低至更高的层的不同层。此外，我们的网络中获得区域的建议之后，我们开发中的前景背景感知对准模块通过分别从源和目标域对准前景和背景区域的特征，以进一步降低该域不匹配。在基准数据集大量的实验证明我们提出的方法的有效性。

71. Any-Shot Sequential Anomaly Detection in Surveillance Videos [PDF] 返回目录
Keval Doshi, Yasin Yilmaz
Abstract: Anomaly detection in surveillance videos has been recently gaining attention. Even though the performance of state-of-the-art methods on publicly available data sets has been competitive, they demand a massive amount of training data. Also, they lack a concrete approach for continuously updating the trained model once new data is available. Furthermore, online decision making is an important but mostly neglected factor in this domain. Motivated by these research gaps, we propose an online anomaly detection method for surveillance videos using transfer learning and any-shot learning, which in turn significantly reduces the training complexity and provides a mechanism that can detect anomalies using only a few labeled nominal examples. Our proposed algorithm leverages the feature extraction power of neural network-based models for transfer learning and the any-shot learning capability of statistical detection methods.
摘要：异常检测监控视频最近已经受到关注。尽管国家的最先进的方法，可公开获得的数据集的表现一直有竞争力，他们需要的训练数据的巨量。此外，他们缺乏一旦有新数据不断更新训练模型的具体方法。此外，在线决策是在这一领域的一个重要但被忽视的主要因素。这些研究的启发差距，提出了利用转移学习以及任何拍的学习，这又显著降低了训练的复杂性，并提供了可以仅使用少量标记标称例子检测异常的机制监控视频网上异常检测方法。我们提出的算法利用基于神经网络的模型的转移学习和统计检测方法的任何射学习能力的特征提取能力。

72. ReADS: A Rectified Attentional Double Supervised Network for Scene Text Recognition [PDF] 返回目录
Qi Song, Qianyi Jiang, Nan Li, Rui Zhang, Xiaolin Wei
Abstract: In recent years, scene text recognition is always regarded as a sequence-to-sequence problem. Connectionist Temporal Classification (CTC) and Attentional sequence recognition (Attn) are two very prevailing approaches to tackle this problem while they may fail in some scenarios respectively. CTC concentrates more on every individual character but is weak in text semantic dependency modeling. Attn based methods have better context semantic modeling ability while tends to overfit on limited training data. In this paper, we elaborately design a Rectified Attentional Double Supervised Network (ReADS) for general scene text recognition. To overcome the weakness of CTC and Attn, both of them are applied in our method but with different modules in two supervised branches which can make a complementary to each other. Moreover, effective spatial and channel attention mechanisms are introduced to eliminate background noise and extract valid foreground information. Finally, a simple rectified network is implemented to rectify irregular text. The ReADS can be trained end-to-end and only word-level annotations are required. Extensive experiments on various benchmarks verify the effectiveness of ReADS which achieves state-of-the-art performance.
摘要：近年来，场景文本识别历来被视为一个序列到序列问题。联结颞分类（CTC）和注意序列识别（经办人）是两个非常流行的方法来解决这个问题，而他们可能在某些情况下分别失败。 CTC更集中在每一个个性，但在文本的语义依赖建模弱。经办人基础的方法有较好的背景下语义建模能力，同时倾向于过度拟合在有限的训练数据。在本文中，我们精心设计了整流的注意双监督网络（读取）一般场景文本识别。为了克服CTC和经办人的弱点，这两者在我们的方法，但在2点被监视的分支不同的模块，它可以使彼此互补的施加。此外，有效的空间和通道注意机制被引入到消除背景噪声，并提取有效前景信息。最后，一个简单的整流网络实现整顿不规则的文本。该可读取端至端只字级别的注解是需要进行培训。上的各种基准的实验结果验证了有效性的读出其实现状态的最先进的性能。

73. gDLS*: Generalized Pose-and-Scale Estimation Given Scale and Gravity Priors [PDF] 返回目录
Victor Fragoso, Joseph DeGol, Gang Hua
Abstract: Many real-world applications in augmented reality (AR), 3D mapping, and robotics require both fast and accurate estimation of camera poses and scales from multiple images captured by multiple cameras or a single moving camera. Achieving high speed and maintaining high accuracy in a pose-and-scale estimator are often conflicting goals. To simultaneously achieve both, we exploit a priori knowledge about the solution space. We present gDLS*, a generalized-camera-model pose-and-scale estimator that utilizes rotation and scale priors. gDLS* allows an application to flexibly weigh the contribution of each prior, which is important since priors often come from noisy sensors. Compared to state-of-the-art generalized-pose-and-scale estimators (e.g., gDLS), our experiments on both synthetic and real data consistently demonstrate that gDLS* accelerates the estimation process and improves scale and pose accuracy.
摘要：在增强现实（AR），3D绘图许多现实世界的应用，以及机器人需要既快速又由多个摄像机或单个移动相机拍摄的多个图像摄影机姿态和尺度的准确估计。实现高速和保持一个姿势和规模估计精度高往往相互矛盾的目标。同时实现，我们利用对解空间的先验知识。我们目前的GDL *，广义相机模型的姿态和规模的估计，它利用旋转和缩放先验。的GDL *允许应用程序灵活地权衡每个现有的贡献，因为先验通常来自嘈杂传感器，其是重要的。状态的最先进的广义-姿态和尺度估计（例如，的GDL），我们对合成的和真实数据的实验一致表明的GDL *加速了估计过程，并提高规模和姿态的精度进行比较。

74. ObjectNet Dataset: Reanalysis and Correction [PDF] 返回目录
Ali Borji
Abstract: Recently, Barbu et al introduced a dataset called ObjectNet which includes objects in daily life situations. They showed a dramatic performance drop of the state of the art object recognition models on this dataset. Due to the importance and implications of their results regarding generalization ability of deep models, we take a second look at their findings. We highlight a major problem with their work which is applying object recognizers to the scenes containing multiple objects rather than isolated objects. The latter results in around 20-30% performance gain using our code. Compared with the results reported in the ObjectNet paper, we observe that around 10-15 % of the performance loss can be recovered, without any test time data augmentation. In accordance with Barbu et al.'s conclusions, however, we also conclude that deep models suffer drastically on this dataset. Thus, we believe that ObjectNet remains a challenging dataset for testing the generalization power of models beyond datasets on which they have been trained.
摘要：近日，巴尔布等人引入了一个名为ObjectNet数据集，其中包括日常生活中的物体。他们表现出对艺术品识别模型的状态，这个数据集显着的性能下降。由于重要性，他们对深模型的泛化能力的结果的影响，我们采取在他们的研究结果看一下。我们强调与正在申请对象识别包含多个对象，而不是孤立的对象幕后工作的一个主要问题。后者导致在使用我们的代码约20-30％的性能增益。比较的结果报道ObjectNet文章中，我们观察到的性能损失，大约10％-15％可以回收，没有任何测试时间的数据增强。按照巴尔布等人的结论，但是，我们也认为深模型对这个数据集显着受到影响。因此，我们认为，ObjectNet仍然是测试模型的泛化功率超出上，他们被训练数据集一个具有挑战性的数据集。

75. FAIRS -- Soft Focus Generator and Attention for Robust Object Segmentation from Extreme Points [PDF] 返回目录
Ahmed H. Shahin, Prateek Munjal, Ling Shao, Shadab Khan
Abstract: Semantic segmentation from user inputs has been actively studied to facilitate interactive segmentation for data annotation and other applications. Recent studies have shown that extreme points can be effectively used to encode user inputs. A heat map generated from the extreme points can be appended to the RGB image and input to the model for training. In this study, we present FAIRS -- a new approach to generate object segmentation from user inputs in the form of extreme points and corrective clicks. We propose a novel approach for effectively encoding the user input from extreme points and corrective clicks, in a novel and scalable manner that allows the network to work with a variable number of clicks, including corrective clicks for output refinement. We also integrate a dual attention module with our approach to increase the efficacy of the model in preferentially attending to the objects. We demonstrate that these additions help achieve significant improvements over state-of-the-art in dense object segmentation from user inputs, on multiple large-scale datasets. Through experiments, we demonstrate our method's ability to generate high-quality training data as well as its scalability in incorporating extreme points, guiding clicks, and corrective clicks in a principled manner.
摘要：从用户输入的语义分割一直积极研究，以方便数据的注释和其他应用交互式分割。最近的研究表明，极端点可以有效地用于编码用户输入。从极值点产生的热图可被附加到RGB图像，并输入到模型进行训练。在这项研究中，我们本FAIRS - 一种新的方法，以产生从在极端点和纠正点击的形式的用户输入对象分割。我们提出了一个新的方法用于有效地编码来自极端点和纠正点击的用户输入，以一种新颖的和可扩展的方式来允许网络工作以可变的点击次数，包括对输出细化纠正点击。我们还集成了一个兼顾模块，我们的方法，以增加优先参加的对象模型的有效性。我们证明，这些添加帮助实现了国家的最先进的从用户输入密对象分割显著的改善，在多个大型数据集。通过实验，我们证明我们的方法对生成高质量的训练数据，以及其在将极值点，指导点击可扩展性的能力，和有原则的方式纠正点击。

76. Optimization of Image Embeddings for Few Shot Learning [PDF] 返回目录
Arvind Srinivasan, Aprameya Bharadwaj, Manasa Sathyan, S Natarajan
Abstract: In this paper we improve the image embeddings generated in the graph neural network solution for few shot learning. We propose alternate architectures for existing networks such as Inception-Net, U-Net, Attention U-Net, and Squeeze-Net to generate embeddings and increase the accuracy of the models. We improve the quality of embeddings created at the cost of the time taken to generate them. The proposed implementations outperform the existing state of the art methods for 1-shot and 5-shot learning on the Omniglot dataset. The experiments involved a testing set and training set which had no common classes between them. The results for 5-way and 10-way/20-way tests have been tabulated.
摘要：在本文中，我们提高在几个镜头的学习曲线神经网络解决方案所产生的图像的嵌入。我们建议对现有的网络，如成立之初，网，U-Net的，注意掌中，并挤净交替结构产生的嵌入和提高模型的准确性。我们改善采取的生成它们的时间成本产生的嵌入的质量。所提出的实施方式优于的1次和5次学习的数据集Omniglot的现有技术方法的现有状态。实验涉及到测试集和训练集这让他们之间没有共同的课程。为5路和10路/ 20路试验的结果已被列出。

77. It Is Not the Journey but the Destination: Endpoint Conditioned Trajectory Prediction [PDF] 返回目录
Karttikeya Mangalam, Harshayu Girase, Shreyas Agarwal, Kuan-Hui Lee, Ehsan Adeli, Jitendra Malik, Adrien Gaidon
Abstract: Human trajectory forecasting with multiple socially interacting agents is of critical importance for autonomous navigation in human environments, e.g., for self-driving cars and social robots. In this work, we present Predicted Endpoint Conditioned Network (PECNet) for flexible human trajectory prediction. PECNet infers distant trajectory endpoints to assist in long-range multi-modal trajectory prediction. A novel non-local social pooling layer enables PECNet to infer diverse yet socially compliant trajectories. Additionally, we present a simple "truncation-trick" for improving few-shot multi-modal trajectory prediction performance. We show that PECNet improves state-of-the-art performance on the Stanford Drone trajectory prediction benchmark by ~19.5% and on the ETH/UCY benchmark by ~40.8%.
摘要：与多个社会交互代理的人轨迹的预测是在人类的环境中，例如自主导航，自驾驶汽车和社交机器人至关重要。在这项工作中，我们提出预测端点空调网（PECNet）用于灵活的人力轨迹预测。 PECNet推断遥远轨迹的端点，以协助在远距离多模态轨迹预测。一种新型的非本地社会统筹层使PECNet推断多样又符合社会的轨迹。此外，我们提出了一个简单的“截断帽子戏法”为提高几拍多模态轨迹预测性能。我们发现，通过PECNet〜19.5％，并通过〜40.8％的ETH / UCY基准提高了对斯坦福无人机轨迹预测基准的国家的最先进的性能。

78. SimAug: Learning Robust Representations from 3D Simulation for Pedestrian Trajectory Prediction in Unseen Cameras [PDF] 返回目录
Junwei Liang, Lu Jiang, Alexander Hauptmann
Abstract: This paper focuses on the problem of predicting future trajectories of people in unseen scenarios and camera views. We propose a method to efficiently utilize multi-view 3D simulation data for training. Our approach finds the hardest camera view to mix up with adversarial data from the original camera view in training, thus enabling the model to learn robust representations that can generalize to unseen camera views. We refer to our method as SimAug. We show that SimAug achieves best results on three out-of-domain real-world benchmarks, as well as getting state-of-the-art in the Stanford Drone and the VIRAT/ActEV dataset with in-domain training data. We will release our models and code.
摘要：本文着重于预测看不见的场景和摄像机视图的人的未来轨迹的问题。我们提出了一个方法来有效地利用多视图三维仿真数据进行训练。我们的方法发现最难的摄影机视图与对抗数据从训练中的原摄像机视图混合起来，从而使模型学习强大的陈述，可以推广到看不见的摄影机视图。我们把我们作为SimAug方法。我们发现，SimAug达到三乱域现实世界的基准最好的结果，以及获得国家的最先进的无人机斯坦福大学和VIRAT / ActEV数据集在域训练数据。我们将发布我们的模型和代码。

79. Fine grained classification for multi-source land cover mapping [PDF] 返回目录
Yawogan Jean Eudes Gbodjo, Dino Ienco, Louise Leroux, Roberto Interdonato, Raffaelle Gaetano
Abstract: Nowadays, there is a general agreement on the need to better characterize agricultural monitoring systems in response to the global changes. Timely and accurate land use/land cover mapping can support this vision by providing useful information at fine scale. Here, a deep learning approach is proposed to deal with multi-source land cover mapping at object level. The approach is based on an extension of Recurrent Neural Network enriched via an attention mechanism dedicated to multi-temporal data context. Moreover, a new hierarchical pretraining strategy designed to exploit specific domain knowledge available under hierarchical relationships within land cover classes is introduced. Experiments carried out on the Reunion island - a french overseas department - demonstrate the significance of the proposal compared to remote sensing standard approaches for land cover mapping.
摘要：目前，有在应对全球气候变化需要更好地表征农业监测系统的总协定。及时准确的土地利用/土地覆盖制图可以通过小比例提供有用的信息支持这一愿景。这里，深学习方法，提出了在对象级别处理多源土地覆盖映射。该方法是基于回归神经网络的扩展经由专用于多态数据上下文中的关注机构富集。此外，采用新的分层训练前的战略旨在利用下土地覆盖类中的层次关系可被引入特定领域知识。实验在旺岛进行 - 法国海外省 - 展示相比，遥感标准提案的意义土地覆盖制图方法。

80. Neural Architecture Search for Lightweight Non-Local Networks [PDF] 返回目录
Yingwei Li, Xiaojie Jin, Jieru Mei, Xiaochen Lian, Linjie Yang, Cihang Xie, Qihang Yu, Yuyin Zhou, Song Bai, Alan Yuille
Abstract: Non-Local (NL) blocks have been widely studied in various vision tasks. However, it has been rarely explored to embed the NL blocks in mobile neural networks, mainly due to the following challenges: 1) NL blocks generally have heavy computation cost which makes it difficult to be applied in applications where computational resources are limited, and 2) it is an open problem to discover an optimal configuration to embed NL blocks into mobile neural networks. We propose AutoNL to overcome the above two obstacles. Firstly, we propose a Lightweight Non-Local (LightNL) block by squeezing the transformation operations and incorporating compact features. With the novel design choices, the proposed LightNL block is 400x computationally cheaper} than its conventional counterpart without sacrificing the performance. Secondly, by relaxing the structure of the LightNL block to be differentiable during training, we propose an efficient neural architecture search algorithm to learn an optimal configuration of LightNL blocks in an end-to-end manner. Notably, using only 32 GPU hours, the searched AutoNL model achieves 77.7% top-1 accuracy on ImageNet under a typical mobile setting (350M FLOPs), significantly outperforming previous mobile models including MobileNetV2 (+5.7%), FBNet (+2.8%) and MnasNet (+2.1%). Code and models are available at this https URL.
摘要：非本地（NL）块已被广泛研究各种视觉任务。然而，已很少探索嵌入在移动神经网络的NL块，主要是由于以下挑战：1）NL块一般具有这使得难以在应用其中的计算资源是有限的要施加大量的计算成本，和2 ）它是发现的最佳配置，以嵌入NL块分成移动神经网络的开放问题。我们建议AutoNL克服上述两个障碍。首先，我们通过挤压变换操作并结合紧凑的特点提出了一个轻量级非本地（LightNL）块。以新颖的设计选择，所提出的LightNL块是400倍比其常规的对应物更便宜的计算}而不牺牲性能。其次，通过放松LightNL块训练期间是可微的结构中，我们提出了一种高效的神经结构搜索算法来学习LightNL块的最优配置在端至端的方式。值得注意的是，仅使用32 GPU小时，所搜索的AutoNL模型达到77.7％顶-1精度上ImageNet下典型的移动设定（350M FLOPS），显著优于先前的移动模型，包括MobileNetV2（+ 5.7％），FBNet（+ 2.8％）和MnasNet（+ 2.1％）。代码和模型可在此HTTPS URL。

81. Cross-domain Face Presentation Attack Detection via Multi-domain Disentangled Representation Learning [PDF] 返回目录
Guoqing Wang, Hu Han, Shiguang Shan, Xilin Chen
Abstract: Face presentation attack detection (PAD) has been an urgent problem to be solved in the face recognition systems. Conventional approaches usually assume the testing and training are within the same domain; as a result, they may not generalize well into unseen scenarios because the representations learned for PAD may overfit to the subjects in the training set. In light of this, we propose an efficient disentangled representation learning for cross-domain face PAD. Our approach consists of disentangled representation learning (DR-Net) and multi-domain learning (MD-Net). DR-Net learns a pair of encoders via generative models that can disentangle PAD informative features from subject discriminative features. The disentangled features from different domains are fed to MD-Net which learns domain-independent features for the final cross-domain face PAD task. Extensive experiments on several public datasets validate the effectiveness of the proposed approach for cross-domain PAD.
摘要：面部呈现攻击检测（PAD）已经在面部识别系统需要解决的迫在眉睫的问题。常规办法通常假定测试和培训是同一个域之内;结果，他们可能不能一概而论以及到看不见的情况下，因为学了PAD的表示可能过度拟合在训练集的主题。鉴于此，我们提出了跨域脸PAD的有效解缠结表示学习。我们的方法是由解缠结表示学习（DR-网）和多领域的学习（MD-网）的。 DR-Net的学习通过生成模式，可以从主题判别特征解开PAD信息量大的特点一对编码器。来自不同域的解缠结的特征被馈送到MD-Net的哪个学习域无关的特性为最终跨域面PAD任务。在几个公开的数据集大量的实验验证了该方法的跨域PAD的有效性。

82. Weakly-Supervised Mesh-Convolutional Hand Reconstruction in the Wild [PDF] 返回目录
Dominik Kulon, Riza Alp Güler, Iasonas Kokkinos, Michael Bronstein, Stefanos Zafeiriou
Abstract: We introduce a simple and effective network architecture for monocular 3D hand pose estimation consisting of an image encoder followed by a mesh convolutional decoder that is trained through a direct 3D hand mesh reconstruction loss. We train our network by gathering a large-scale dataset of hand action in YouTube videos and use it as a source of weak supervision. Our weakly-supervised mesh convolutions-based system largely outperforms state-of-the-art methods, even halving the errors on the in the wild benchmark. The dataset and additional resources are available at this https URL.
摘要：我们介绍一个简单的和有效的网络体系结构对于由图像编码器的单眼的3D手姿势估计之后是网状卷积译码器，其通过直接3D手训练啮合重建损失。我们通过收集在YouTube视频牵手行动的大规模数据集训练我们的网络，并把它作为监管不力的根源。我们的弱监督为基础的卷积网格系统在很大程度上优于国家的最先进的方法，即使减半野生基准的误差。该数据集和其他资源可在此HTTPS URL。

83. Optical Flow in Dense Foggy Scenes using Semi-Supervised Learning [PDF] 返回目录
Wending Yan, Aashish Sharma, Robby T. Tan
Abstract: In dense foggy scenes, existing optical flow methods are erroneous. This is due to the degradation caused by dense fog particles that break the optical flow basic assumptions such as brightness and gradient constancy. To address the problem, we introduce a semi-supervised deep learning technique that employs real fog images without optical flow ground-truths in the training process. Our network integrates the domain transformation and optical flow networks in one framework. Initially, given a pair of synthetic fog images, its corresponding clean images and optical flow ground-truths, in one training batch we train our network in a supervised manner. Subsequently, given a pair of real fog images and a pair of clean images that are not corresponding to each other (unpaired), in the next training batch, we train our network in an unsupervised manner. We then alternate the training of synthetic and real data iteratively. We use real data without ground-truths, since to have ground-truths in such conditions is intractable, and also to avoid the overfitting problem of synthetic data training, where the knowledge learned on synthetic data cannot be generalized to real data testing. Together with the network architecture design, we propose a new training strategy that combines supervised synthetic-data training and unsupervised real-data training. Experimental results show that our method is effective and outperforms the state-of-the-art methods in estimating optical flow in dense foggy scenes.
摘要：在密集雾场景，现有的光流的方法是错误的。这是由于由浓雾颗粒打破光流基本假设如亮度和梯度恒定的降解。为了解决这个问题，我们引入了一个半监督深度学习技术，采用真正的雾图像，而不在训练过程中的光流的地面真理。我们的网络集成在一个框架中的域转换和光流网络。最初，给定的一对合成的图像灰雾，其相应的干净图像和光流地面的真理，在一个训练批次我们培训网络中的监督方式。随后，由于一对真正的雾图像和一对未相互对应的（未成），在接下来的训练一批干净的图像，我们培训网络中的无监督的方式。然后，我们交替合成和真实数据的训练迭代。我们用真实的数据，而地面的真理，因为有地面的真理在这种条件下是棘手的，并且也避免了合成训练数据，其中对合成数据学到的知识不能被推广到真实的数据测试过拟合问题。与网络架构设计在一起，我们提出了一个新的培训战略，结合监督合成数据的培训和监督的实际数据的培训。实验结果表明，我们的方法是有效的，优于在估计在密集雾场景光流状态的最先进的方法。

84. Understanding (Non-)Robust Feature Disentanglement and the Relationship Between Low- and High-Dimensional Adversarial Attacks [PDF] 返回目录
Zuowen Wang, Leo Horne
Abstract: Recent work has put forth the hypothesis that adversarial vulnerabilities in neural networks are due to them overusing "non-robust features" inherent in the training data. We show empirically that for PGD-attacks, there is a training stage where neural networks start heavily relying on non-robust features to boost natural accuracy. We also propose a mechanism reducing vulnerability to PGD-style attacks consisting of mixing in a certain amount of images contain-ing mostly "robust features" into each training batch, and then show that robust accuracy is improved, while natural accuracy is not substantially hurt. We show that training on "robust features" provides boosts in robust accuracy across various architectures and for different attacks. Finally, we demonstrate empirically that these "robust features" do not induce spatial invariance.
摘要：最近的工作提出的假设，即在神经网络对抗漏洞是由于他们过度使用“非强大的功能”固有的训练数据。我们发现经验，对PGD-攻击，存在神经网络开始严重依赖于非强大的功能，以提高自然精度的训练阶段。我们还提出了一种机制减少由一定量的图像混合的PGD式攻击漏洞包含-ING大多是“强大功能”为每个培训一批，然后显示，稳健的精度提高，而天然精度基本上不伤害。我们展示的“强大功能”培训提供了在不同架构的健壮准确性和不同的攻击提升。最后，我们证明经验，这些“强大功能”不诱导空间不变性。

85. A Simple Baseline for Multi-Object Tracking [PDF] 返回目录
Yifu Zhan, Chunyu Wang, Xinggang Wang, Wenjun Zeng, Wenyu Liu
Abstract: There has been remarkable progress on object detection and re-identification in recent years which are the core components for multi-object tracking. However, little attention has been focused on accomplishing the two tasks in a single network to improve the inference speed. The initial attempts along this path ended up with degraded results mainly because the re-identification branch is not appropriately learned. In this work, we study the essential reasons behind the failure, and accordingly present a simple baseline to addresses the problems. It remarkably outperforms the state-of-the-arts on the public datasets at $30$ fps. We hope this baseline could inspire and help evaluate new ideas in this field. The code and the pre-trained models will be released. Code is available at \url{this https URL}.
摘要：目前已在近几年这对多目标跟踪的核心部件上的目标检测和重新鉴定了显着进展。然而，很少有人注意集中在完成这两个任务在一个单一的网络，以提高推理速度。沿着这条道路的最初尝试结束了下降的结果，主要是因为重新鉴定分支没有适当地学习。在这项工作中，我们研究了失败背后的根本原因，并据此提出了一个简单的基线到地址的问题。它显着为$ 30，$ FPS优于对公共数据集的国家的最艺术。我们希望这能基线启发和帮助评估在这一领域的新思路。代码和预先训练的模型将被释放。代码可以在\ {URL这HTTPS URL}。

86. Deblurring by Realistic Blurring [PDF] 返回目录
Kaihao Zhang, Wenhan Luo, Yiran Zhong, Lin Ma, Bjorn Stenger, Wei Liu, Hongdong Li
Abstract: Existing deep learning methods for image deblurring typically train models using pairs of sharp images and their blurred counterparts. However, synthetically blurring images do not necessarily model the genuine blurring process in real-world scenarios with sufficient accuracy. To address this problem, we propose a new method which combines two GAN models, i.e., a learning-to-Blur GAN (BGAN) and learning-to-DeBlur GAN (DBGAN), in order to learn a better model for image deblurring by primarily learning how to blur images. The first model, BGAN, learns how to blur sharp images with unpaired sharp and blurry image sets, and then guides the second model, DBGAN, to learn how to correctly deblur such images. In order to reduce the discrepancy between real blur and synthesized blur, a relativistic blur loss is leveraged. As an additional contribution, this paper also introduces a Real-World Blurred Image (RWBI) dataset including diverse blurry images. Our experiments show that the proposed method achieves consistently superior quantitative performance as well as higher perceptual quality on both the newly proposed dataset and the public GOPRO dataset.
摘要：图像存在深刻的学习方法去模糊通常使用训练对锐利的图像和他们的同行模糊的模型。然而，综合模糊图像不一定模型中具有足够精度的真实世界场景的真正模糊化处理。为了解决这个问题，我们提出一种结合了两种GAN模式，新方法，即，学习到模糊GAN（BGAN）和学习到的去模糊化GAN（DBGAN），以学习为图像通过去模糊更好的模型主要学习如何模糊的图像。第一种模式，BGAN，学习如何模糊具有未配对锐利和模糊图像组清晰的图像，然后引导第二模型，DBGAN，学习如何正确去模糊这样的图像。为了减少真实模糊和合成模糊之间的差异，相对论模糊损失是杠杆。作为一个额外的贡献，本文还介绍了一个真实世界的模糊图像（RWBI）数据集，包括不同的图像模糊。我们的实验表明，该方法实现一贯卓越的定量分析性能以及更高的感知质量对新近提出的数据集和公共GOPRO数据集两者。

87. Pixel Consensus Voting for Panoptic Segmentation [PDF] 返回目录
Haochen Wang, Ruotian Luo, Michael Maire, Greg Shakhnarovich
Abstract: The core of our approach, Pixel Consensus Voting, is a framework for instance segmentation based on the Generalized Hough transform. Pixels cast discretized, probabilistic votes for the likely regions that contain instance centroids. At the detected peaks that emerge in the voting heatmap, backprojection is applied to collect pixels and produce instance masks. Unlike a sliding window detector that densely enumerates object proposals, our method detects instances as a result of the consensus among pixel-wise votes. We implement vote aggregation and backprojection using native operators of a convolutional neural network. The discretization of centroid voting reduces the training of instance segmentation to pixel labeling, analogous and complementary to FCN-style semantic segmentation, leading to an efficient and unified architecture that jointly models things and stuff. We demonstrate the effectiveness of our pipeline on COCO and Cityscapes Panoptic Segmentation and obtain competitive results. Code will be open-sourced.
摘要：我们的方法，像素共识投票，核心是基于通用霍夫例如分割的框架变换。像素投离散，概率票包含例如重心可能的区域。当时在投票热图出现的检出峰的反投影应用于收集像素和生产实例口罩。不像密集枚举对象提案的滑动窗检测器，我们的方法检测实例作为中逐像素票的共识的结果。我们使用卷积神经网络的本地运营商实现投票聚集和反投影。心投票的离散减少实例分割的训练像素标签，类似于和补充FCN式的语义分割，导致高效，统一的架构，共同机型的事情和东西。我们证明了我们对COCO和城市景观全景分割流水线的效率和获得竞争的结果。代码将是开源的。

88. Multi-Variate Temporal GAN for Large Scale Video Generation [PDF] 返回目录
Andres Muñoz, Mohammadreza Zolfaghari, Max Argus, Thomas Brox
Abstract: In this paper, we present a network architecture for video generation that models spatio-temporal consistency without resorting to costly 3D architectures. In particular, we elaborate on the components of noise generation, sequence generation, and frame generation. The architecture facilitates the information exchange between neighboring time points, which improves the temporal consistency of the generated frames both at the structural level and the detailed level. The approach achieves state-of-the-art quantitative performance, as measured by the inception score, on the UCF-101 dataset, which is in line with a qualitative inspection of the generated videos. We also introduce a new quantitative measure that uses downstream tasks for evaluation.
摘要：在本文中，我们提出了一个网络架构的视频一代，而不诉诸昂贵的3D架构模型时空一致性。特别是，我们详细说明噪声的产生，序列生成，和帧生成的分量。该体系结构有利于邻近的时间点之间的信息交换，从而提高在结构水平和详细级别生成的帧两者的时间一致性。该方法实现了国家的最先进的定量性能，如由以来得分测量，在UCF-101数据集，这是符合的产生的视频质的检查。我们还介绍了使用下游任务评估一个新的定量测量。

89. Group Based Deep Shared Feature Learning for Fine-grained Image Classification [PDF] 返回目录
Xuelu Li, Vishal Monga
Abstract: Fine-grained image classification has emerged as a significant challenge because objects in such images have small inter-class visual differences but with large variations in pose, lighting, and viewpoints, etc. Most existing work focuses on highly customized feature extraction via deep network architectures which have been shown to deliver state of the art performance. Given that images from distinct classes in fine-grained classification share significant features of interest, we present a new deep network architecture that explicitly models shared features and removes their effect to achieve enhanced classification results. Our modeling of shared features is based on a new group based learning wherein existing classes are divided into groups and multiple shared feature patterns are discovered (learned). We call this framework Group based deep Shared Feature Learning (GSFL) and the resulting learned network as GSFL-Net. Specifically, the proposed GSFL-Net develops a specially designed autoencoder which is constrained by a newly proposed Feature Expression Loss to decompose a set of features into their constituent shared and discriminative components. During inference, only the discriminative feature component is used to accomplish the classification task. A key benefit of our specialized autoencoder is that it is versatile and can be combined with state-of-the-art fine-grained feature extraction models and trained together with them to improve their performance directly. Experiments on benchmark datasets show that GSFL-Net can enhance classification accuracy over the state of the art with a more interpretable architecture.
摘要：细粒度图像分类已成为一个显著的挑战，因为在这样的图像对象有小，级间的视觉差异，但与姿势，灯光和观点大的变化，等等。大多数现有的工作重点是通过深高度定制化的特征提取这已经显示出的本领域性能递送状态网络架构。鉴于感兴趣细粒分级份额显著的特点，从不同类别的图像，我们提出了一个新的深网络架构，明确模型共享功能，并删除其效果达到增强型分类结果。我们的共同特征建模的基础上学习，其中现有类分为团体和多个共享功能模式是基于发现了一组新的（学习）。我们称之为基于深共享功能学习（GSFL）这个框架小组，并将所得了解到网络GSFL-网。具体地，所提出的GSFL-Net的开发了专门设计的自动编码器由一个新提出的功能表达缺失约束到分解的一组特征成它们的组成和共享判别部件。在推理，只有辨别功能组件来完成分类任务。我们专业的自动编码器的关键好处是，它是多功能的，能够与国家的最先进的细粒度特征提取模型相结合，并与他们一起训练，直接提高其性能。在基准数据集实验表明，GSFL网络可能提高分类准确率优于现有技术的状态更可解释的架构。

90. TimeGate: Conditional Gating of Segments in Long-range Activities [PDF] 返回目录
Noureldien Hussein, Mihir Jain, Babak Ehteshami Bejnordi
Abstract: When recognizing a long-range activity, exploring the entire video is exhaustive and computationally expensive, as it can span up to a few minutes. Thus, it is of great importance to sample only the salient parts of the video. We propose TimeGate, along with a novel conditional gating module, for sampling the most representative segments from the long-range activity. TimeGate has two novelties that address the shortcomings of previous sampling methods, as SCSampler. First, it enables a differentiable sampling of segments. Thus, TimeGate can be fitted with modern CNNs and trained end-to-end as a single and unified model.Second, the sampling is conditioned on both the segments and their context. Consequently, TimeGate is better suited for long-range activities, where the importance of a segment heavily depends on the video context.TimeGate reduces the computation of existing CNNs on three benchmarks for long-range activities: Charades, Breakfast and MultiThumos. In particular, TimeGate reduces the computation of I3D by 50% while maintaining the classification accuracy.
摘要：当识别远距离活动，探索整个视频内容详尽，并计算成本，因为它可以跨越长达几分钟。因此，它是非常重要的样本只有视频的显着部分。我们建议TimeGate，具有新颖的条件选通模块沿，用于从远距离活性采样最有代表性的部分。 TimeGate有解决以前抽样方法的不足，为SCSampler 2个新奇。首先，它使段可微分采样。因此，TimeGate可以配备现代细胞神经网络和训练有素的端至端作为一个单一的，统一model.Second，采样的段和它们的上下文都空调。因此，TimeGate更适合于远距离的活动，其中部分的重要性在很大程度上取决于视频context.TimeGate减少了对三个基准现有的细胞神经网络的远距离活动的计算：哑谜，早餐和MultiThumos。特别地，TimeGate降低50％I3D的计算，同时保持分类精度。

91. Google Landmarks Dataset v2 -- A Large-Scale Benchmark for Instance-Level Recognition and Retrieval [PDF] 返回目录
Tobias Weyand, Andre Araujo, Bingyi Cao, Jack Sim
Abstract: While image retrieval and instance recognition techniques are progressing rapidly, there is a need for challenging datasets to accurately measure their performance -- while posing novel challenges that are relevant for practical applications. We introduce the Google Landmarks Dataset v2 (GLDv2), a new benchmark for large-scale, fine-grained instance recognition and image retrieval in the domain of human-made and natural landmarks. GLDv2 is the largest such dataset to date by a large margin, including over 5M images and 200k distinct instance labels. Its test set consists of 118k images with ground truth annotations for both the retrieval and recognition tasks. The ground truth construction involved over 800 hours of human annotator work. Our new dataset has several challenging properties inspired by real world applications that previous datasets did not consider: An extremely long-tailed class distribution, a large fraction of out-of-domain test photos and large intra-class variability. The dataset is sourced from Wikimedia Commons, the world's largest crowdsourced collection of landmark photos. We provide baseline results for both recognition and retrieval tasks based on state-of-the-art methods as well as competitive results from a public challenge. We further demonstrate the suitability of the dataset for transfer learning by showing that image embeddings trained on it achieve competitive retrieval performance on independent datasets. The dataset images, ground-truth and metric scoring code are available at this https URL.
摘要：虽然图像检索和实例识别技术正在迅速发展，有必要对数据集的挑战精确测量它们的性能 - 而冒充有关的实际应用中的新挑战。我们介绍了谷歌的地标数据集V2（GLDv2），为大规模的新标杆，细粒度的情况下识别和图像检索中人为和自然景观的域。 GLDv2是最大的此类数据集日期以大比分，其中包括超过500万的图像和200K不同的实例标识。其测试组由118K的图像，与地面实情注解检索和识别任务都有。地面实况参与建设超过800个小时的人的注释工作。我们新的数据集有由现实世界的应用的启发，以前的数据集，并没有考虑一些具有挑战性的性质：一个非常长尾类分布，超出域测试的照片和大量的类内变化的很大一部分。该数据集是由维基共享资源，是世界上最大的标志性照片的众包收集来源。我们提供基于状态的最先进的方法，以及从公开挑战竞争的结果既承认和检索任务的基准结果。我们通过展示培训了它的形象的嵌入实现独立的数据集有竞争力的检索性能进一步证明了该数据集的迁移学习的适用性。该数据集图像，地面实况和指标的得分代码可在此HTTPS URL。

92. SqueezeSegV3: Spatially-Adaptive Convolution for Efficient Point-Cloud Segmentation [PDF] 返回目录
Chenfeng Xu, Bichen Wu, Zining Wang, Wei Zhan, Peter Vajda, Kurt Keutzer, Masayoshi Tomizuka
Abstract: LiDAR point-cloud segmentation is an important problem for many applications. For large-scale point cloud segmentation, the \textit{de facto} method is to project a 3D point cloud to get a 2D LiDAR image and use convolutions to process it. Despite the similarity between regular RGB and LiDAR images, we discover that the feature distribution of LiDAR images changes drastically at different image locations. Using standard convolutions to process such LiDAR images is problematic, as convolution filters pick up local features that are only active in specific regions in the image. As a result, the capacity of the network is under-utilized and the segmentation performance decreases. To fix this, we propose Spatially-Adaptive Convolution (SAC) to adopt different filters for different locations according to the input image. SAC can be computed efficiently since it can be implemented as a series of element-wise multiplications, im2col, and standard convolution. It is a general framework such that several previous methods can be seen as special cases of SAC. Using SAC, we build SqueezeSegV3 for LiDAR point-cloud segmentation and outperform all previous published methods by at least 3.7% mIoU on the SemanticKITTI benchmark with comparable inference speed.
摘要：激光雷达点云的分割是许多应用的一个重要问题。对于大规模的点云分割，\ {textit事实上}方法是投影三维点云得到一个二维激光雷达图像和使用卷积处理它。尽管常规RGB和激光图像之间的相似性，我们发现，激光雷达图像的特征分布在不同的图像位置急剧变化。使用标准的卷积处理这种激光雷达图像是有问题的，因为卷积滤波器拾取局部特征在于仅在所述图像中的特定区域的活性。其结果是，该网络的容量利用不足和分割性能降低。为了解决这个问题，我们提出了空间自适应卷积（SAC）根据输入图像采用不同的过滤器不同的位置。 SAC可以有效地计算，因为它可以作为一系列逐元素乘法，im2col，和标准的卷积来实现。这是一个总体框架，使得一些以前的方法可以被看作是国资委的特殊情况。使用SAC，我们建立SqueezeSegV3激光雷达点云分割和超越上具有可比性的推理速度SemanticKITTI基准所有先前公布的方法由至少3.7％米欧。

93. Temporally Distributed Networks for Fast Video Segmentation [PDF] 返回目录
Ping Hu, Fabian Caba Heilbron, Oliver Wang, Zhe Lin, Stan Sclaroff, Federico Perazzi
Abstract: We present TDNet, a temporally distributed network designed for fast and accurate video semantic segmentation. We observe that features extracted from a certain high-level layer of a deep CNN can be approximated by composing features extracted from several shallower sub-networks. Leveraging the inherent temporal continuity in videos, we distribute these sub-networks over sequential frames. Therefore, at each time step, we only need to perform a lightweight computation to extract a sub-features group from a single sub-network. The full features used for segmentation are then recomposed by application of a novel attention propagation module that compensates for geometry deformation between frames. A grouped knowledge distillation loss is also introduced to further improve the representation power at both full and sub-feature levels. Experiments on Cityscapes, CamVid, and NYUD-v2 demonstrate that our method achieves state-of-the-art accuracy with significantly faster speed and lower latency.
摘要：我们目前TDNet，时间上的分布式网络设计的快速，准确的视频语义分割。我们观察到从深CNN的某些高级层中提取出的特征可以通过组合从几个较浅的子网络提取的特征近似。凭借在影片的内在时间上的连续性，我们在连续帧分发这些子网络。因此，在每个时间步骤中，我们只需要执行一个轻量的计算以提取子功能组从一个单一的子网络中。用于分割的全部功能，然后通过一种新颖的关注传播模块的应用重构，对帧之间的几何变形补偿。分组的知识蒸馏损失也介绍了在完全和子功能水平进一步提高表现力。上风情，CamVid和NYUD-V2实验表明，我们的方法实现国家的最先进的精度显著更快的速度和更低的延迟。

94. Self-Supervised Viewpoint Learning From Image Collections [PDF] 返回目录
Siva Karthik Mustikovela, Varun Jampani, Shalini De Mello, Sifei Liu, Umar Iqbal, Carsten Rother, Jan Kautz
Abstract: Training deep neural networks to estimate the viewpoint of objects requires large labeled training datasets. However, manually labeling viewpoints is notoriously hard, error-prone, and time-consuming. On the other hand, it is relatively easy to mine many unlabelled images of an object category from the internet, e.g., of cars or faces. We seek to answer the research question of whether such unlabeled collections of in-the-wild images can be successfully utilized to train viewpoint estimation networks for general object categories purely via self-supervision. Self-supervision here refers to the fact that the only true supervisory signal that the network has is the input image itself. We propose a novel learning framework which incorporates an analysis-by-synthesis paradigm to reconstruct images in a viewpoint aware manner with a generative network, along with symmetry and adversarial constraints to successfully supervise our viewpoint estimation network. We show that our approach performs competitively to fully-supervised approaches for several object categories like human faces, cars, buses, and trains. Our work opens up further research in self-supervised viewpoint learning and serves as a robust baseline for it. We open-source our code at this https URL.
摘要：培训深层神经网络来估计对象的角度来看，需要大量标记的训练数据集。然而，手动标记的观点是出名的坚硬，容易出错，且耗时。在另一方面，它是比较容易从互联网对象类别的矿许多未标记的图像，例如，汽车或面。我们寻求答案是否的，例如未标记的藏品中最百搭的图像可以被成功地用于通过自检纯粹是为了训练一般对象类别的观点估计网络研究的问题。这里自检是指这样的事实，唯一真正的监控信号，网络已经是输入图像本身。我们提出了一种新的学习框架，它具有生成网络集成了一个分析，通过合成模式来重建图像在视点知道地，对称和对抗限制一起成功地监督我们的观点估计网络。我们证明了我们的方法进行竞争，为多个目标类别，如人脸，汽车，公共汽车和火车全监督的方法。我们的工作自我监督的观点学习打开了进一步的研究，并作为其强大的基线。我们的开源代码，我们在此HTTPS URL。

95. Privacy-Preserving Eye Videos using Rubber Sheet Model [PDF] 返回目录
Aayush K Chaudhary, Jeff B. Pelz
Abstract: Video-based eye trackers estimate gaze based on eye images/videos. As security and privacy concerns loom over technological advancements, tackling such challenges is crucial. We present a new approach to handle privacy issues in eye videos by replacing the current identifiable iris texture with a different iris template in the video capture pipeline based on the Rubber Sheet Model. We extend to image blending and median-value representations to demonstrate that videos can be manipulated without significantly degrading segmentation and pupil detection accuracy.
摘要：基于视频的眼动仪估计基于眼图像/视频的目光。由于安全和隐私的担忧笼罩着的技术进步，应对这些挑战是至关重要的。我们通过与基于橡胶图纸模型的视频采集管道不同的虹膜模板替换当前识别虹膜纹理提出了一种新的方法来在眼视频处理隐私问题。我们向图像融合和中位值的表示证明，视频可以不显著降解分割和瞳孔检测精度进行操作。

96. Lossless Image Compression through Super-Resolution [PDF] 返回目录
Sheng Cao, Chao-Yuan Wu, Philipp Krähenbühl
Abstract: We introduce a simple and efficient lossless image compression algorithm. We store a low resolution version of an image as raw pixels, followed by several iterations of lossless super-resolution. For lossless super-resolution, we predict the probability of a high-resolution image, conditioned on the low-resolution input, and use entropy coding to compress this super-resolution operator. Super-Resolution based Compression (SReC) is able to achieve state-of-the-art compression rates with practical runtimes on large datasets. Code is available online at this https URL.
摘要：介绍一个简单而有效的无损图像压缩算法。我们存储图像作为原始像素的低分辨率版本，其次是无损超分辨率的多次反复。对于无损超分辨率，我们预测的高清晰度图像的概率，条件对低分辨率的输入，并使用熵编码来压缩这一超分辨率操作。超分辨率基于压缩（SREC）能够实现国家的最先进的压缩率对大数据集的实际运行时间。代码可在网上该HTTPS URL。

97. Application of Structural Similarity Analysis of Visually Salient Areas and Hierarchical Clustering in the Screening of Similar Wireless Capsule Endoscopic Images [PDF] 返回目录
Rui Nie, Huan Yang, Hejuan Peng, Wenbin Luo, Weiya Fan, Jie Zhang, Jing Liao, Fang Huang, Yufeng Xiao
Abstract: Small intestinal capsule endoscopy is the mainstream method for inspecting small intestinal lesions,but a single small intestinal capsule endoscopy will produce 60,000 - 120,000 images, the majority of which are similar and have no diagnostic value. It takes 2 - 3 hours for doctors to identify lesions from these images. This is time-consuming and increase the probability of misdiagnosis and missed diagnosis since doctors are likely to experience visual fatigue while focusing on a large number of similar images for an extended period of this http URL order to solve these problems, we proposed a similar wireless capsule endoscope (WCE) image screening method based on structural similarity analysis and the hierarchical clustering of visually salient sub-image blocks. The similarity clustering of images was automatically identified by hierarchical clustering based on the hue,saturation,value (HSV) spatial color characteristics of the images,and the keyframe images were extracted based on the structural similarity of the visually salient sub-image blocks, in order to accurately identify and screen out similar small intestinal capsule endoscopic images. Subsequently, the proposed method was applied to the capsule endoscope imaging workstation. After screening out similar images in the complete data gathered by the Type I OMOM Small Intestinal Capsule Endoscope from 52 cases covering 17 common types of small intestinal lesions, we obtained a lesion recall of 100% and an average similar image reduction ratio of 76%. With similar images screened out, the average play time of the OMOM image workstation was 18 minutes, which greatly reduced the time spent by doctors viewing the images.
摘要：小肠胶囊内窥镜是用于检查小肠病变的主流方法，但单一的小肠胶囊内窥镜将产生60000 - 120000图像，其中大部分是类似的并且没有诊断价值。它需要2 - 3小时，进行医生从这些图像识别病变。这是费时，增加误诊的概率和漏诊，因为医生有可能体验到视觉疲劳，同时专注于大量类似的图像此HTTP URL为了解决这些问题的长期，我们提出了一个类似的无线胶囊型内窥镜（WCE）图像基于结构相似性分析和视觉显着的子图像块的分级聚类筛选法。图像的相似性聚类自动通过基于色调，饱和度，值（HSV）的图像的空间色彩特性，并且关键帧图像分级聚类确定是基于所述视觉显着的子图像块的结构相似萃取，在为了精确地识别和筛除类似小肠胶囊内窥镜图像。随后，所提出的方法应用于胶囊型内窥镜成像工作站。在从52案件覆盖17种常见的类型小肠损伤的由I型OMOM小肠胶囊型内窥镜所收集的完整数据筛选出类似图像之后，我们获得了100％的病变召回和的76％的平均相似图像缩小率。与类似的图像中筛选出的OMOM图像工作站的平均播放时间是18分钟，这大大降低了由医生观看图像所花费的时间。

98. Adaptive Partial Scanning Transmission Electron Microscopy with Reinforcement Learning [PDF] 返回目录
Jeffrey M. Ede
Abstract: Compressed sensing is applied to scanning transmission electron microscopy to decrease electron dose and scan time. However, established methods use static sampling strategies that do not adapt to samples. We have extended recurrent deterministic policy gradients to train deep LSTMs and differentiable neural computers to adaptively sample scan path segments. Recurrent agents cooperate with a convolutional generator to complete partial scans. We show that our approach outperforms established algorithms based on spiral scans, and we expect our results to be generalizable to other scan systems. Source code, pretrained models and training data is available at this https URL.
摘要：压缩感测被施加到扫描透射电子显微术以减少电子剂量和扫描时间。然而，建立的方法使用不太适应样品静态取样策略。我们已经扩展复发确定性政策渐变深LSTMs和微神经计算机的培训，以适应样品扫描路径段。经常代理使用卷积发电机来完成部分扫描合作。我们表明，我们的方法比基于螺旋扫描建立算法，我们希望我们的结果推广到其他扫描系统。源代码，预先训练模式和训练数据可在此HTTPS URL。

99. Investigating Image Applications Based on Spatial-Frequency Transform and Deep Learning Techniques [PDF] 返回目录
Qinkai Zheng, Han Qiu, Gerard Memmi, Isabelle Bloch
Abstract: This is the report for the PRIM project in Telecom Paris. This report is about applications based on spatial-frequency transform and deep learning techniques. In this report, there are two main works. The first work is about the enhanced JPEG compression method based on deep learning. we propose a novel method to highly enhance the JPEG compression by transmitting fewer image data at the sender's end. At the receiver's end, we propose a DC recovery algorithm together with the deep residual learning framework to recover images with high quality. The second work is about adversarial examples defenses based on signal processing. We propose the wavelet extension method to extend image data features, which makes it more difficult to generate adversarial examples. We further adopt wavelet denoising to reduce the influence of the adversarial perturbations. With intensive experiments, we demonstrate that both works are effective in their application scenarios.
摘要：这是在巴黎的电信PRIM项目的报告。这份报告是关于变换和基于空间频率的应用深度学习技术。在这份报告中，主要有两级方面的作品。第一个工作是关于基于深度学习增强JPEG压缩方法。我们提出了一种新颖的方法，以通过在发送方的端发送较少的图像数据高度提高JPEG压缩。在接收器端，我们提出了一个DC恢复算法连同深残留学习框架，以高品质的图像恢复。第二工作是关于基于信号处理的对抗性例子防御。我们提出了小波扩展方法扩展的图像数据的特征，这使得它更难以产生对抗的例子。我们进一步采用小波降噪，减少对抗扰动的影响。随着密集的实验中，我们证明了这两部作品都是有效的应用场景。

100. Review of Artificial Intelligence Techniques in Imaging Data Acquisition, Segmentation and Diagnosis for COVID-19 [PDF] 返回目录
Feng Shi, Jun Wang, Jun Shi, Ziyan Wu, Qian Wang, Zhenyu Tang, Kelei He, Yinghuan Shi, Dinggang Shen
Abstract: The pandemic of coronavirus disease 2019 (COVID-19) is spreading all over the world. Medical imaging such as X-ray and computed tomography (CT) plays an essential role in the global fight against COVID-19, whereas the recently emerging artificial intelligence (AI) technologies further strengthen the power of the imaging tools and help medical specialists. We hereby review the rapid responses in the community of medical imaging (empowered by AI) toward COVID-19. For example, AI-empowered image acquisition can significantly help automate the scanning procedure and also reshape the workflow with minimal contact to patients, providing the best protection to the imaging technicians. Also, AI can improve work efficiency by accurate delination of infections in X-ray and CT images, facilitating subsequent quantification. Moreover, the computer-aided platforms help radiologists make clinical decisions, i.e., for disease diagnosis, tracking, and prognosis. In this review paper, we thus cover the entire pipeline of medical imaging and analysis techniques involved with COVID-19, including image acquisition, segmentation, diagnosis, and follow-up. We particularly focus on the integration of AI with X-ray and CT, both of which are widely used in the frontline hospitals, in order to depict the latest progress of medical imaging and radiology fighting against COVID-19.
摘要：流行性疾病冠状2019（COVID-19）遍布世界各地。医学影像，如X射线和计算机断层扫描（CT）在针对COVID-19的全球斗争中的重要作用，而最近出现的人工智能（AI）技术进一步强化的成像工具和帮助医疗专家的力量。在此，我们回顾在医疗成像（由AI授权）向COVID-19的社会的快速响应。例如，AI-授权图像采集可以显著帮助自动扫描程序，也重塑以最小的接触到患者的工作流程，提供对成像技术最好的保护。此外，AI可以提高在X射线和CT图像的感染，促进随后的量化的准确delination工作效率。此外，计算机辅助平台，帮助放射科医生做出临床决策，即对疾病的诊断，跟踪和预后。在这篇综述论文中，我们从而覆盖的涉及COVID-19医学成像和分析技术，包括图像采集，分割，诊断，和后续的整个流水线。我们特别注重整合AI的X射线和CT，两者均被广泛应用于一线医院，为了描绘医学影像和放射战斗对抗COVID-19的最新进展。

101. Coronavirus Detection and Analysis on Chest CT with Deep Learning [PDF] 返回目录
Ophir Gozes, Maayan Frid-Adar, Nimrod Sagie, Huangqi Zhang, Wenbin Ji, Hayit Greenspan
Abstract: The outbreak of the novel coronavirus, officially declared a global pandemic, has a severe impact on our daily lives. As of this writing there are approximately 197,188 confirmed cases of which 80,881 are in "Mainland China" with 7,949 deaths, a mortality rate of 3.4%. In order to support radiologists in this overwhelming challenge, we develop a deep learning based algorithm that can detect, localize and quantify severity of COVID-19 manifestation from chest CT scans. The algorithm is comprised of a pipeline of image processing algorithms which includes lung segmentation, 2D slice classification and fine grain localization. In order to further understand the manifestations of the disease, we perform unsupervised clustering of abnormal slices. We present our results on a dataset comprised of 110 confirmed COVID-19 patients from Zhejiang province, China.
摘要：新型冠状病毒的爆发，正式宣布全球大流行，对我们的日常生活造成了严重影响。在撰写本文时，大约有197188例确诊病例，其中80881是“中国大陆”有7,949人死亡，死亡率为3.4％。为了在这铺天盖地的质疑支持放射科医师，我们开发了一个深刻的学习基础的算法，可以检测，COVID-19表现的局部化和量化的严重程度从胸部CT扫描。该算法是由图像处理算法的流水线，其包括肺分割，2D切片的分类和细晶粒定位。为了进一步了解疾病的表现，我们执行异常片的无监督聚类。我们提出我们对由110的数据集结果证实了来自中国浙江省COVID-19例。

102. Vocoder-Based Speech Synthesis from Silent Videos [PDF] 返回目录
Daniel Michelsanti, Olga Slizovskaia, Gloria Haro, Emilia Gómez, Zheng-Hua Tan, Jesper Jensen
Abstract: Both acoustic and visual information influence human perception of speech. For this reason, the lack of audio in a video sequence determines an extremely low speech intelligibility for untrained lip readers. In this paper, we present a way to synthesise speech from the silent video of a talker using deep learning. The system learns a mapping function from raw video frames to acoustic features and reconstructs the speech with a vocoder synthesis algorithm. To improve speech reconstruction performance, our model is also trained to predict text information in a multi-task learning fashion and it is able to simultaneously reconstruct and recognise speech in real time. The results in terms of estimated speech quality and intelligibility show the effectiveness of our method, which exhibits an improvement over existing video-to-speech approaches.
摘要：语音的两个声音和视觉信息影响人类感知。出于这个原因，在视频序列中的音频缺失确定未受过训练的唇读者极低的语音清晰度。在本文中，我们提出了一个方法来合成语音使用深度学习讲话者的无声视频。系统学习从原始视频帧到声学特征的映射函数，并与语音编码器合成算法来重建语音。为了提高语音重建的性能，我们的模型也被训练来预测文本信息在多任务学习时尚和它能够同时重建并实时识别语音。在估计语音质量和清晰度方面的结果表明我们的方法，表现出比现有的视频语音转换方法的改进的有效性。

103. Automatic Right Ventricle Segmentation using Multi-Label Fusion in Cardiac MRI [PDF] 返回目录
Maria A. Zuluaga, M. Jorge Cardoso, Sébastien Ourselin
Abstract: Accurate segmentation of the right ventricle (RV) is a crucial step in the assessment of the ventricular structure and function. Yet, due to its complex anatomy and motion segmentation of the RV has not been as largely studied as the left ventricle. This paper presents a fully automatic method for the segmentation of the RV in cardiac magnetic resonance images (MRI). The method uses a coarse-to-fine segmentation strategy in combination with a multi-atlas propagation segmentation framework. Based on a cross correlation metric, our method selects the best atlases for propagation allowing the refinement of the segmentation at each iteration of the propagation. The proposed method was evaluated on 32 cardiac MRI datasets provided by the RV Segmentation Challenge in Cardiac MRI.
摘要：右心室（RV）的准确的分割是在心室结构和功能的评估的关键步骤。然而，由于RV其复杂的解剖结构和运动分割尚未在很大程度上是研究作为左心室。本文提出了在心脏磁共振图像（MRI）的RV的分割全自动方法。该方法使用组合与多图谱传播分割框架由粗到细的分割策略。基于度量的互相关，我们的方法选择用于传播最好图谱允许分割的在传播的每次迭代细化。所提出的方法是在由RV分割挑战心脏MRI提供32个心脏MRI数据集进行评估。

104. Dynamic Decision Boundary for One-class Classifiers applied to non-uniformly Sampled Data [PDF] 返回目录
Riccardo La Grassa, Ignazio Gallo, Nicola Landro
Abstract: A typical issue in Pattern Recognition is the non-uniformly sampled data, which modifies the general performance and capability of machine learning algorithms to make accurate predictions. Generally, the data is considered non-uniformly sampled when in a specific area of data space, they are not enough, leading us to misclassification problems. This issue cut down the goal of the one-class classifiers decreasing their performance. In this paper, we propose a one-class classifier based on the minimum spanning tree with a dynamic decision boundary (OCdmst) to make good prediction also in the case we have non-uniformly sampled data. To prove the effectiveness and robustness of our approach we compare with the most recent one-class classifier reaching the state-of-the-art in most of them.
摘要：在模式识别中的典型问题是，非均匀采样数据，这改变了一般的性能和机器学习算法的能力做出准确的预测。一般情况下，数据被视为非均匀时的数据空间的特定区域，他们没有足够的采样，导致我们误判问题。这个问题砍倒一类分类降低其性能的目标。在本文中，我们提出了一种基于同一个动态决策边界（OCdmst）最小生成树一类分类做出很好的预测也是在情况下，我们的非均匀采样数据。为了证明我们与最近的一类分类达到国家的最先进的大多比较的有效性和我们的方法的稳健性。

105. Game of Learning Bloch Equation Simulations for MR Fingerprinting [PDF] 返回目录
Mingrui Yang, Yun Jiang, Dan Ma, Bhairav B. Mehta, Mark A. Griswold
Abstract: Purpose: This work proposes a novel approach to efficiently generate MR fingerprints for MR fingerprinting (MRF) problems based on the unsupervised deep learning model generative adversarial networks (GAN). Methods: The GAN model is adopted and modified for better convergence and performance, resulting in an MRF specific model named GAN-MRF. The GAN-MRF model is trained, validated, and tested using different MRF fingerprints simulated from the Bloch equations with certain MRF sequence. The performance and robustness of the model are further tested by using in vivo data collected on a 3 Tesla scanner from a healthy volunteer together with MRF dictionaries with different sizes. T1, T2 maps are generated and compared quantitatively. Results: The validation and testing curves for the GAN-MRF model show no evidence of high bias or high variance problems. The sample MRF fingerprints generated from the trained GAN-MRF model agree well with the benchmark fingerprints simulated from the Bloch equations. The in vivo T1, T2 maps generated from the GAN-MRF fingerprints are in good agreement with those generated from the Bloch simulated fingerprints, showing good performance and robustness of the proposed GAN-MRF model. Moreover, the MRF dictionary generation time is reduced from hours to sub-second for the testing dictionary. Conclusion: The GAN-MRF model enables a fast and accurate generation of the MRF fingerprints. It significantly reduces the MRF dictionary generation process and opens the door for real-time applications and sequence optimization problems.
摘要：目的：这项工作提出了一种新的方法来有效地产生MR指纹基于无监督的深度学习模型生成对抗网络（GAN）（MRF）问题MR指纹。方法：赣模型采用和修改更好的收敛和性能，产生命名为GAN-MRF的MRF具体型号。的GAN-MRF模型被训练，验证和使用从Bloch方程与某些MRF序列模拟不同MRF指纹测试。该模型的性能和鲁棒性通过使用在从与MRF字典具有不同大小的健康志愿者一起收集在3特斯拉扫描器体内数据进一步测试。 T1，T2的地图生成和定量比较。结果：验证和测试曲线的GAN-MRF模型没有显示出高偏置或高方差问题的证据。从训练的GAN-MRF模型生成的样品MRF指纹与Bloch方程模拟的基准指纹吻合。体内T1，T2映射从GAN-MRF指纹生成的是在与那些从布洛赫模拟指纹生成，示出所提出的GAN-MRF模型的良好的性能和鲁棒性很好的一致性。此外，MRF字典生成时间从几小时到子第二个用于检测字典降低。结论：GAN-MRF模型使MRF指纹的快速和准确的产生。它显著降低了MRF字典生成过程，并打开实时应用和序列优化问题的大门。

106. CondenseUNet: A Memory-Efficient Condensely-Connected Architecture for Bi-ventricular Blood Pool and Myocardium Segmentation [PDF] 返回目录
S. M. Kamrul Hasan, Cristian A. Linte
Abstract: With the advent of Cardiac Cine Magnetic Resonance (CMR) Imaging, there has been a paradigm shift in medical technology, thanks to its capability of imaging different structures within the heart without ionizing radiation. However, it is very challenging to conduct pre-operative planning of minimally invasive cardiac procedures without accurate segmentation and identification of the left ventricle (LV), right ventricle (RV) blood-pool, and LV-myocardium. Manual segmentation of those structures, nevertheless, is time-consuming and often prone to error and biased outcomes. Hence, automatic and computationally efficient segmentation techniques are paramount. In this work, we propose a novel memory-efficient Convolutional Neural Network (CNN) architecture as a modification of both CondenseNet, as well as DenseNet for ventricular blood-pool segmentation by introducing a bottleneck block and an upsampling path. Our experiments show that the proposed architecture runs on the Automated Cardiac Diagnosis Challenge (ACDC) dataset using half (50%) the memory requirement of DenseNet and one-twelfth (~ 8%) of the memory requirements of U-Net, while still maintaining excellent accuracy of cardiac segmentation. We validated the framework on the ACDC dataset featuring one healthy and four pathology groups whose heart images were acquired throughout the cardiac cycle and achieved the mean dice scores of 96.78% (LV blood-pool), 93.46% (RV blood-pool) and 90.1% (LV-Myocardium). These results are promising and promote the proposed methods as a competitive tool for cardiac image segmentation and clinical parameter estimation that has the potential to provide fast and accurate results, as needed for pre-procedural planning and/or pre-operative applications.
摘要：随着心脏电影磁共振（CMR）成像的出现，出现了医疗技术的模式转变，这要归功于它的无电离辐射在心脏内成像不同结构的能力。然而，这是非常具有挑战性的行为微创心脏手术术前计划不准确的分割和左心室（LV），右心室（RV）血池的标识，和LV-心肌。这些结构的手动分割，不过，是耗时的且通常容易出错和偏置的结果。因此，自动和计算上有效的分割技术是最重要的。在这项工作中，我们通过引入瓶颈块和上采样路径提出了一种新颖的存储高效的卷积神经网络（CNN）架构两者CondenseNet的变形例，以及DenseNet心室血池分割。我们的实验表明，对掌中的内存要求自动心脏诊断挑战（ACDC）数据集采用一半（50％）的DenseNet和十二分之一（约8％）的内存需求所提出的架构运行，同时仍保持心脏分割的精度优异。我们验证具有一个正常和四个病理组，其心脏图像整个心动周期被收购，实现了平均骰子得分的96.78％（LV血池），93.46％（RV血池）和90.1的ACDC数据集框架％（LV-心肌）。这些结果是有希望促进所提出的方法为心脏图像分割和临床参数估计的有竞争力的工具，必须提供快速，准确的结果，需要对流程前的规划和/或术前应用的潜力。

107. Long-tail learning with attributes [PDF] 返回目录
Dvir Samuel, Yuval Atzmon, Gal Chechik
Abstract: Learning to classify images with unbalanced class distributions is challenged by two effects: It is hard to learn tail classes that have few samples, and it is hard to adapt a single model to both richly-sampled and poorly-sampled classes. To address few-shot learning of tail classes, it is useful to fuse additional information in the form of semantic attributes and classify based on multi-modal information. Unfortunately, as we show below, unbalanced data leads to a "familiarity bias", where classifiers favor sample-rich classes. This bias and lack of calibrated predictions make it hard to fuse correctly information from multiple modalities like vision and attributes. Here we describe DRAGON, a novel modular architecture for long-tail learning designed to address these biases and fuse multi-modal information in face of unbalanced data. Our architecture is based on three classifiers: a vision expert, a semantic attribute expert that excels on the tail classes, and a debias-and-fuse module to combine their predictions. We present the first benchmark for long-tail learning with attributes and use it to evaluate DRAGON. DRAGON outperforms state-of-the-art long-tail learning models and Generalized Few-Shot-Learning with attributes (GFSL-a) models. DRAGON also obtains SoTA in some existing benchmarks for single-modality GFSL.
摘要：学习分类图片不平衡类分布是由两所方面的影响的挑战：这是很难得知有几样尾类，它是很难适应的单一模式既丰富采样和不良采样类。尾类的地址数拍的学习，它是有用的语义属性的形式融合更多的信息，并基于多模态信息进行分类。不幸的是，正如我们下面表明，数据不平衡导致“熟悉偏差”，其中分类有利于丰富样本的类。这种偏见和缺乏校准的预测，很难把保险丝正确信息，从视觉一样的属性和多模态。在这里，我们描述的龙，一种新型的模块化架构长尾学习设计来解决这些偏见和保险丝不平衡数据的脸多模态信息。我们的架构是基于三个分类：视觉专家，含义属性专家认为在尾班，和消除直流偏压和熔丝模块过人之处他们的预测结合起来。我们提出了使用属性长尾学习的第一标杆，并用它来评估DRAGON。 DRAGON性能优于国家的最先进的长尾巴的学习模式和广义为数不多的射击学习与属性（GFSL-A）机型。 DRAGON还获得SOTA在单模态GFSL一些现有的基准。

108. Approximate Manifold Defense Against Multiple Adversarial Perturbations [PDF] 返回目录
Jay Nandy, Wynne Hsu, Mong Li Lee
Abstract: Existing defenses against adversarial attacks are typically tailored to a specific perturbation type. Using adversarial training to defend against multiple types of perturbation requires expensive adversarial examples from different perturbation types at each training step. In contrast, manifold-based defense incorporates a generative network to project an input sample onto the clean data manifold. This approach eliminates the need to generate expensive adversarial examples while achieving robustness against multiple perturbation types. However, the success of this approach relies on whether the generative network can capture the complete clean data manifold, which remains an open problem for complex input domain. In this work, we devise an approximate manifold defense mechanism, called RBF-CNN, for image classification. Instead of capturing the complete data manifold, we use an RBF layer to learn the density of small image patches. RBF-CNN also utilizes a reconstruction layer that mitigates any minor adversarial perturbations. Further, incorporating our proposed reconstruction process for training improves the adversarial robustness of our RBF-CNN models. Experiment results on MNIST and CIFAR-10 datasets indicate that RBF-CNN offers robustness for multiple perturbations without the need for expensive adversarial training.
摘要：对敌对攻击现有的防御通常针对特定的扰动类型。使用对抗训练抵御多种类型的扰动的需要来自不同类型的扰动，在每个训练步骤昂贵对抗性例子。相反，基于歧管防御采用了生成网络来投影输入样本到所述干净的数据流形。这种方法无需昂贵的产生对抗的例子，同时实现对多种扰动类型的鲁棒性。但是，这种方法的成功依赖于生成网络是否能够捕获完整干净的数据歧管，这仍然是复杂的输入域的开放问题。在这项工作中，我们设计的近似歧管防御机制，称为RBF-CNN，对图像分类。而不是捕获完整的数据歧管，我们使用RBF层学习小图像块的浓度。 RBF-CNN还利用重建层，其减轻任何轻微对抗性扰动。此外，结合我们提出的重建进程的培训提高了我们的RBF-CNN车型的对抗性鲁棒性。在MNIST和CIFAR-10数据集实验结果表明，RBF-CNN提供鲁棒性多扰动，而不需要昂贵的对抗性训练。

109. Emotional Video to Audio Transformation Using Deep Recurrent Neural Networks and a Neuro-Fuzzy System [PDF] 返回目录
Gwenaelle Cunha Sergio, Minho Lee
Abstract: Generating music with emotion similar to that of an input video is a very relevant issue nowadays. Video content creators and automatic movie directors benefit from maintaining their viewers engaged, which can be facilitated by producing novel material eliciting stronger emotions in them. Moreover, there's currently a demand for more empathetic computers to aid humans in applications such as augmenting the perception ability of visually and/or hearing impaired people. Current approaches overlook the video's emotional characteristics in the music generation step, only consider static images instead of videos, are unable to generate novel music, and require a high level of human effort and skills. In this study, we propose a novel hybrid deep neural network that uses an Adaptive Neuro-Fuzzy Inference System to predict a video's emotion from its visual features and a deep Long Short-Term Memory Recurrent Neural Network to generate its corresponding audio signals with similar emotional inkling. The former is able to appropriately model emotions due to its fuzzy properties, and the latter is able to model data with dynamic time properties well due to the availability of the previous hidden state information. The novelty of our proposed method lies in the extraction of visual emotional features in order to transform them into audio signals with corresponding emotional aspects for users. Quantitative experiments show low mean absolute errors of 0.217 and 0.255 in the Lindsey and DEAP datasets respectively, and similar global features in the spectrograms. This indicates that our model is able to appropriately perform domain transformation between visual and audio features. Based on experimental results, our model can effectively generate audio that matches the scene eliciting a similar emotion from the viewer in both datasets, and music generated by our model is also chosen more often.
摘要：生成音乐感慨类似于输入视频的是时下一个非常相关的问题。视频内容创建者和自动电影导演受益于维护他们的观众参与，这可以通过生产新材料在其中引发强烈情绪所推动。此外，公司目前旗下有更多的同情计算机的应用，如增强视觉和/或听觉障碍的人的感知能力的需求援助人类。目前的方法忽略了音乐生成步骤视频的情感特征，只考虑静态图像，而不是视频，是无法产生新的音乐，需要人的努力和技能的高水平。在这项研究中，我们提出了一种新的混合使用一个自适应神经模糊推理系统从它的视觉特征和深龙短时记忆回归神经网络预测一个视频的情感及其相应的音频信号，类似的情绪产生深层神经网络暗示。前者是能够适当型号的情绪由于其模糊性，而后者是能够与动态时间特性以及由于先前隐藏状态信息的可用性模型数据。我们提出的方法在于视觉情感特征的提取，以新颖，以将它们转换成音频信号，相应的情感方面的用户。定量实验表明在林赛和DEAP数据集0.217 0.255和分别在频谱低的平均绝对误差，以及类似的全局特征。这表明，我们的模型能够视觉和音频功能之间进行适当域变换。根据实验结果，我们的模型可以有效地生成场景引发来自两个数据集的观众相似的情感相匹配的音频，并通过我们的模型生成的音乐也选择更加频繁。

110. DeepFLASH: An Efficient Network for Learning-based Medical Image Registration [PDF] 返回目录
Jian Wang, Miaomiao Zhang
Abstract: This paper presents DeepFLASH, a novel network with efficient training and inference for learning-based medical image registration. In contrast to existing approaches that learn spatial transformations from training data in the high dimensional imaging space, we develop a new registration network entirely in a low dimensional bandlimited space. This dramatically reduces the computational cost and memory footprint of an expensive training and inference. To achieve this goal, we first introduce complex-valued operations and representations of neural architectures that provide key components for learning-based registration models. We then construct an explicit loss function of transformation fields fully characterized in a bandlimited space with much fewer parameterizations. Experimental results show that our method is significantly faster than the state-of-the-art deep learning based image registration methods, while producing equally accurate alignment. We demonstrate our algorithm in two different applications of image registration: 2D synthetic data and 3D real brain magnetic resonance (MR) images. Our code is available at this https URL.
摘要：本文介绍DeepFLASH，一种新型的网络提供高效的培训和推理为基础的学习，医学图像配准。与此相反的是现有学习的空间变换从在高维摄像空间的训练数据的方法，我们在低维空间限带完全开发新的注册网络。这大大降低了昂贵的培训和推理的计算成本和内存占用。为了实现这个目标，我们首先介绍复数运算，并提供了基于学习型注册车型关键零部件的神经结构的表示。然后，我们构建转型领域的全面特征带限空间要少得多的参数化的明确损失函数。实验结果表明，我们的方法是显著快于国家的最先进的深基于学习的图像配准方法，同时产生同样准确对准。我们证明了我们在图像配准的两个不同的应用算法：2D合成数据和3D实景脑磁共振（MR）图像。我们的代码可在此HTTPS URL。

111. Feature Quantization Improves GAN Training [PDF] 返回目录
Yang Zhao, Chunyuan Li, Ping Yu, Jianfeng Gao, Changyou Chen
Abstract: The instability in GAN training has been a long-standing problem despite remarkable research efforts. We identify that instability issues stem from difficulties of performing feature matching with mini-batch statistics, due to a fragile balance between the fixed target distribution and the progressively generated distribution. In this work, we propose Feature Quantization (FQ) for the discriminator, to embed both true and fake data samples into a shared discrete space. The quantized values of FQ are constructed as an evolving dictionary, which is consistent with feature statistics of the recent distribution history. Hence, FQ implicitly enables robust feature matching in a compact space. Our method can be easily plugged into existing GAN models, with little computational overhead in training. We apply FQ to 3 representative GAN models on 9 benchmarks: BigGAN for image generation, StyleGAN for face synthesis, and U-GAT-IT for unsupervised image-to-image translation. Extensive experimental results show that the proposed FQ-GAN can improve the FID scores of baseline methods by a large margin on a variety of tasks, achieving new state-of-the-art performance.
摘要：在甘训练不稳定一直是显着的，尽管研究工作长期存在的问题。我们识别出不稳定问题的执行与小批量的统计特征匹配，困难茎由于固定目标分布和渐进生成的分发之间的脆弱平衡。在这项工作中，我们提出了特征量化（FQ），用于鉴别器，既真和假的数据样本嵌入到共享离散空间。 FQ的量化值构造一个不断发展的字典，这与近期发行的历史特点的统计数据是一致的。因此，FQ隐含使得在紧凑的空间强大的功能匹配。我们的方法可以很容易地插入到现有的车型甘，在训练的小计算开销。我们运用FQ 3种代表甘车型9个基准：BigGAN生成图像，StyleGAN面部合成和U-GAT-IT的无监督图像到图像的转换。大量的实验结果表明，所提出的FQ-GaN能够以大比分上的各种任务的改进的基准方法FID成绩，实现了国家的最先进的新性能。

112. Arbitrary Scale Super-Resolution for Brain MRI Images [PDF] 返回目录
Chuan Tan, Jin Zhu, Pietro Lio'
Abstract: Recent attempts at Super-Resolution for medical images used deep learning techniques such as Generative Adversarial Networks (GANs) to achieve perceptually realistic single image Super-Resolution. Yet, they are constrained by their inability to generalise to different scale factors. This involves high storage and energy costs as every integer scale factor involves a separate neural network. A recent paper has proposed a novel meta-learning technique that uses a Weight Prediction Network to enable Super-Resolution on arbitrary scale factors using only a single neural network. In this paper, we propose a new network that combines that technique with SRGAN, a state-of-the-art GAN-based architecture, to achieve arbitrary scale, high fidelity Super-Resolution for medical images. By using this network to perform arbitrary scale magnifications on images from the Multimodal Brain Tumor Segmentation Challenge (BraTS) dataset, we demonstrate that it is able to outperform traditional interpolation methods by up to 20$\%$ on SSIM scores whilst retaining generalisability on brain MRI images. We show that performance across scales is not compromised, and that it is able to achieve competitive results with other state-of-the-art methods such as EDSR whilst being fifty times smaller than them. Combining efficiency, performance, and generalisability, this can hopefully become a new foundation for tackling Super-Resolution on medical images.
摘要：在超分辨率最近试图医学图像使用深度学习技术，如创成对抗性网络（甘斯）实现感知现实的单幅图像的超分辨率。然而，他们是由他们无法限制推广到不同的比例系数。这涉及高存储和能源成本为每一个整数比例因子涉及到一个单独的神经网络。最近的一篇文章提出了一个使用重量预测网络上只使用一个单一的神经网络任意倍率使超分辨率的新颖元学习技术。在本文中，我们提出了一个新的网络，结合与SRGAN，国家的最先进的基于GaN的架构，技术实现任意比例，高保真超解像医学图像。通过使用该网络从多模态脑肿瘤分割挑战（小鬼）数据集的图像进行任意缩放倍率，我们证明了它能够跑赢大盘高达$ 20 \％$上SSIM得分传统的插值方法，同时对大脑保持普适性MRI图像。我们发现跨尺度的性能不会受到影响，而且它能够实现与其他国家的最先进的方法，有竞争力的结果，如EDSR而比他们更小的五十次。结合效率，性能和可推广性，这可能有希望成为医疗影像解决超分辨率的新基础。

113. Finding Covid-19 from Chest X-rays using Deep Learning on a Small Dataset [PDF] 返回目录
Lawrence O. Hall, Rahul Paul, Dmitry B. Goldgof, Gregory M. Goldgof
Abstract: Testing for COVID-19 has been unable to keep up with the demand. Further, the false negative rate is projected to be as high as 30% and test results can take some time to obtain. X-ray machines are widely available and provide images for diagnosis quickly. This paper explores how useful chest X-ray images can be in diagnosing COVID-19 disease. We have obtained 122 chest X-rays of COVID-19 and over 4,000 chest X-rays of viral and bacterial pneumonia. A pretrained deep convolutional neural network has been tuned on 102 COVID-19 cases and 102 other pneumonia cases in a 10-fold cross validation. The results were all 102 COVID-19 cases were correctly classified and there were 8 false positives resulting in an AUC of 0.997. On a test set of 20 unseen COVID-19 cases all were correctly classified and more than 95% of 4171 other pneumonia examples were correctly classified. This study has flaws, most critically a lack of information about where in the disease process the COVID-19 cases were and the small data set size. More COVID-19 case images will enable a better answer to the question of how useful chest X-rays can be for diagnosing COVID-19 (so please send them).
摘要：测试了COVID-19已经无法跟上需求。此外，假阴性率预计为高达30％，测试结果可能需要一些时间来获得。 X射线机是广泛使用，诊断快速提供图像。本文探讨如何有用胸部X射线图像可以是在诊断COVID-19的疾病。我们已获得COVID-19的122胸部X光检查和病毒和细菌性肺炎4000胸部X射线。预训练深卷积神经网络已在10倍交叉验证被调整在102 COVID-19病例和102中的其它肺炎病例。结果均为102 COVID-19例正确地分类并且有导致0.997的AUC 8个误报。在测试组20看不见COVID-19情况下，所有被正确分类和4171个其他肺炎例子的95％以上被正确分类。这项研究有缺陷，最关键的一个什么地方在疾病过程中的COVID-19例，缺乏信息和小数据集的大小。更多COVID-19的情况下的图像将能更好地回答了X射线多么有用的胸部可以用于诊断问题COVID-19（所以请给他们）。

114. Quantum Medical Imaging Algorithms [PDF] 返回目录
Bobak Toussi Kiani, Agnes Villanyi, Seth Lloyd
Abstract: A central task in medical imaging is the reconstruction of an image or function from data collected by medical devices (e.g., CT, MRI, and PET scanners). We provide quantum algorithms for image reconstruction that can offer exponential speedup over classical counterparts when data is fed into the algorithm as a quantum state. Since outputs of our algorithms are stored in quantum states, individual pixels of reconstructed images may not be efficiently accessed classically; instead, we discuss various methods to extract information from quantum outputs using a variety of quantum post-processing algorithms.
摘要：在医学成像中的中心的任务是从由医疗设备（例如，CT，MRI和PET扫描仪）所收集的数据的图像或功能的重建。我们提供用于图像重构的数据时被馈送到算法作为量子态，可以提供指数加速超过经典的对应量子算法。由于我们的算法输出被保存在量子态，重建图像的各个像素可以不被有效地经典访问;相反，我们将讨论使用各种量子后处理算法从量子输出的各种方法来提取信息。

115. Generating Rationales in Visual Question Answering [PDF] 返回目录
Hammad A. Ayyubi, Md. Mehrab Tanjim, Julian J. McAuley, Garrison W. Cottrell
Abstract: Despite recent advances in Visual QuestionAnswering (VQA), it remains a challenge todetermine how much success can be attributedto sound reasoning and comprehension ability.We seek to investigate this question by propos-ing a new task ofrationale generation. Es-sentially, we task a VQA model with generat-ing rationales for the answers it predicts. Weuse data from the Visual Commonsense Rea-soning (VCR) task, as it contains ground-truthrationales along with visual questions and an-swers. We first investigate commonsense un-derstanding in one of the leading VCR mod-els, ViLBERT, by generating rationales frompretrained weights using a state-of-the-art lan-guage model, GPT-2. Next, we seek to jointlytrain ViLBERT with GPT-2 in an end-to-endfashion with the dual task of predicting the an-swer in VQA and generating rationales. Weshow that this kind of training injects com-monsense understanding in the VQA modelthrough quantitative and qualitative evaluationmetrics
摘要：尽管在Visual QuestionAnswering（VQA）的最新进展，它仍然是一个挑战todetermine很大的成功如何能attributedto声音的推理和理解ability.We寻求调查这一问题propos-ING ofrationale生成新的任务。 ES-sentially，我们任务与它预测的答案发电机密封-ING理由一VQA模型。 Weuse从Visual常识雷亚 - soning（VCR）任务数据，因为它包含地面truthrationales有视觉问题和-swers一起。我们首先调查常识未derstanding中的龙头VCR MOD-ELS之一，ViLBERT，通过产生使用状态的最先进的LAN-瓜哥模型，GPT-2理由frompretrained权重。接下来，我们寻求结束对endfashion与GPT-2 jointlytrain ViLBERT与预测VQA的一个-swer和产生理由的双重任务。 Weshow，这种训练的内喷射COM-monsense在VQA modelthrough定量和定性evaluationmetrics理解

116. Segmentation for Classification of Screening Pancreatic Neuroendocrine Tumors [PDF] 返回目录
Zhuotun Zhu, Yongyi Lu, Wei Shen, Elliot K. Fishman, Alan L. Yuille
Abstract: This work presents comprehensive results to detect in the early stage the pancreatic neuroendocrine tumors (PNETs), a group of endocrine tumors arising in the pancreas, which are the second common type of pancreatic cancer, by checking the abdominal CT scans. To the best of our knowledge, this task has not been studied before as a computational task. To provide radiologists with tumor locations, we adopt a segmentation framework to classify CT volumes by checking if at least a sufficient number of voxels is segmented as tumors. To quantitatively analyze our method, we collect and voxelwisely label a new abdominal CT dataset containing $376$ cases with both arterial and venous phases available for each case, in which $228$ cases were diagnosed with PNETs while the remaining $148$ cases are normal, which is currently the largest dataset for PNETs to the best of our knowledge. In order to incorporate rich knowledge of radiologists to our framework, we annotate dilated pancreatic duct as well, which is regarded as the sign of high risk for pancreatic cancer. Quantitatively, our approach outperforms state-of-the-art segmentation networks and achieves a sensitivity of $89.47\%$ at a specificity of $81.08\%$, which indicates a potential direction to achieve a clinical impact related to cancer diagnosis by earlier tumor detection.
摘要：该作品呈现全面的结果到胰腺神经内分泌肿瘤（PNETS），一组在胰腺，这是胰腺癌的第二常见的类型引起的内分泌肿瘤，通过检查腹部CT扫描中的早期检测。据我们所知，这个任务还没有被作为一个计算任务前研究。提供放射学家与肿瘤位置，我们通过如果至少体素的足够数量的被分段为肿瘤检查采取分割框架进行分类CT卷。为了定量地分析我们的方法，我们收集和voxelwisely标注含$ $ 376例动脉和静脉阶段可用于每个的情况，其中，而其余$ $ 148的情况下是正常的$ 228 $病例被诊断为公共非专利电讯服务，新的腹部CT数据集，其是目前公共非专利电讯服务于我们所知的最大的数据集。为了丰富放射科医生的知识纳入到我们的框架中，我们做批注胰管扩张为好，这被视为是对胰腺癌高危人群的标志。定量地，国家的最先进的我们的方法比分割网络和为$ 81.08 \％$的特异性，其指示潜在方向实现由早期肿瘤检测与癌症相关的诊断的临床影响达到$ 89.47 \％$的灵敏度。

117. Attention-Guided Version of 2D UNet for Automatic Brain Tumor Segmentation [PDF] 返回目录
Mehrdad Noori, Ali Bahri, Karim Mohammadi
Abstract: Gliomas are the most common and aggressive among brain tumors, which cause a short life expectancy in their highest grade. Therefore, treatment assessment is a key stage to enhance the quality of the patients' lives. Recently, deep convolutional neural networks (DCNNs) have achieved a remarkable performance in brain tumor segmentation, but this task is still difficult owing to high varying intensity and appearance of gliomas. Most of the existing methods, especially UNet-based networks, integrate low-level and high-level features in a naive way, which may result in confusion for the model. Moreover, most approaches employ 3D architectures to benefit from 3D contextual information of input images. These architectures contain more parameters and computational complexity than 2D architectures. On the other hand, using 2D models causes not to benefit from 3D contextual information of input images. In order to address the mentioned issues, we design a low-parameter network based on 2D UNet in which we employ two techniques. The first technique is an attention mechanism, which is adopted after concatenation of low-level and high-level features. This technique prevents confusion for the model by weighting each of the channels adaptively. The second technique is the Multi-View Fusion. By adopting this technique, we can benefit from 3D contextual information of input images despite using a 2D model. Experimental results demonstrate that our method performs favorably against 2017 and 2018 state-of-the-art methods.
摘要：神经胶质瘤是最常见，脑肿瘤，这导致他们的最高等级的短寿命之间的侵略性。因此，治疗评估，以提高患者的生活质量的关键阶段。近日，深卷积神经网络（DCNNs）都实现了脑肿瘤分割了骄人的业绩，但这个任务仍然很难由于高强度不同和胶质瘤的外观。大多数现有的方法，尤其是UNET为基础的网络，以简单的方式，这可能导致该模型混乱集成了低层次，高层次的特点。此外，大多数方法采用3D架构，受益于输入图像的3D上下文信息。这些结构中含有较多的参数和计算比2D架构的复杂性。在另一方面，使用二维模型会导致不受益于输入图像的3D上下文信息。为了解决上述问题，我们设计了一个基于二维UNET低参数的网络中，我们采用两种技术。第一种技术是注意机制，这是低层次和高层次特性级联后通过。这个技术防止混乱为模型通过加权每个信道的自适应。第二种方法是多视角融合。通过采用这种技术，我们可以尽管使用二维模型从输入图像的3D上下文信息中受益。实验结果表明，我们的方法对2017年和2018年的国家的最先进的方法有利地进行。

118. Volumetric Attention for 3D Medical Image Segmentation and Detection [PDF] 返回目录
Xudong Wang, Shizhong Han, Yunqiang Chen, Dashan Gao, Nuno Vasconcelos
Abstract: A volumetric attention(VA) module for 3D medical image segmentation and detection is proposed. VA attention is inspired by recent advances in video processing, enables 2.5D networks to leverage context information along the z direction, and allows the use of pretrained 2D detection models when training data is limited, as is often the case for medical applications. Its integration in the Mask R-CNN is shown to enable state-of-the-art performance on the Liver Tumor Segmentation (LiTS) Challenge, outperforming the previous challenge winner by 3.9 points and achieving top performance on the LiTS leader board at the time of paper submission. Detection experiments on the DeepLesion dataset also show that the addition of VA to existing object detectors enables a 69.1 sensitivity at 0.5 false positive per image, outperforming the best published results by 6.6 points.
摘要：用于3D医学图像分割和检测的容积注意（VA）模块被提出。 VA注意通过在视频处理的最新进展的启发，能够2.5D网络沿着z方向利用上下文信息，并允许使用时训练数据是有限的预训练的2D检测模型的，这是常有的医疗应用的情况下。它在面膜R-CNN集成显示，以使国家的最先进的肝肿瘤分割（双床）挑战的表现，3.9分超越前面的挑战，冠军，当时实现的双床领先榜顶部性能的论文提交。在DeepLesion数据集的检测实验还表明，添加VA的现有对象的检测器使得能够在每个图像0.5假阳性一个69.1灵敏度，由6.6分表现优于最好的公布的结果。

119. Empirical Evaluation of PRNU Fingerprint Variation for Mismatched Imaging Pipelines [PDF] 返回目录
Sharad Joshi, Pawel Korus, Nitin Khanna, Nasir Memon
Abstract: We assess the variability of PRNU-based camera fingerprints with mismatched imaging pipelines (e.g., different camera ISP or digital darkroom software). We show that camera fingerprints exhibit non-negligible variations in this setup, which may lead to unexpected degradation of detection statistics in real-world use-cases. We tested 13 different pipelines, including standard digital darkroom software and recent neural-networks. We observed that correlation between fingerprints from mismatched pipelines drops on average to 0.38 and the PCE detection statistic drops by over 40%. The degradation in error rates is the strongest for small patches commonly used in photo manipulation detection, and when neural networks are used for photo development. At a fixed 0.5% FPR setting, the TPR drops by 17 ppt (percentage points) for 128 px and 256 px patches.
摘要：评估基于PRNU相机指纹不匹配的图像管线（例如，不同的摄像机的ISP或数字暗房软件）的可变性。我们发现，相机指纹表现出这种设置中，这可能会导致在实际使用情况检测统计的意外下降不可忽视的变化。我们测试了13个不同的管道，包括标准的数字暗房软件和最近的神经网络。我们观察到的指纹之间的相关性从失配的管道滴上平均以0.38和PCE检测统计超过40％下降。在错误率的降低是最强的照片的操作检测常用的小补丁，并在神经网络用于照片的发展。在固定的0.5％FPR设置，TPR下降了17个百分点（百分点）为128像素和256级像素的补丁。

120. Theoretical Insights into the Use of Structural Similarity Index In Generative Models and Inferential Autoencoders [PDF] 返回目录
Benyamin Ghojogh, Fakhri Karray, Mark Crowley
Abstract: Generative models and inferential autoencoders mostly make use of $\ell_2$ norm in their optimization objectives. In order to generate perceptually better images, this short paper theoretically discusses how to use Structural Similarity Index (SSIM) in generative models and inferential autoencoders. We first review SSIM, SSIM distance metrics, and SSIM kernel. We show that the SSIM kernel is a universal kernel and thus can be used in unconditional and conditional generated moment matching networks. Then, we explain how to use SSIM distance in variational and adversarial autoencoders and unconditional and conditional Generative Adversarial Networks (GANs). Finally, we propose to use SSIM distance rather than $\ell_2$ norm in least squares GAN.
摘要：生成模型和推理自动编码主要使他们的优化目标使用$ \ $ ell_2规范。为了产生感知更好的图像，这短短的文章从理论上讨论如何生成模型和推理自动编码使用结构相似度指数（SSIM）。我们第一次审查SSIM，SSIM距离度量和SSIM内核。我们表明，SSIM内核是一个通用的内核，因此可以无条件的和有条件产生的力矩匹配网络中使用。然后，我们将解释如何使用SSIM距离变和对抗性的自动编码和无条件的和有条件的生成性对抗式网络（甘斯）。最后，我们建议使用SSIM的距离，而不是$ \ $ ell_2法规范的最小二乘甘。

121. Weighted Fisher Discriminant Analysis in the Input and Feature Spaces [PDF] 返回目录
Benyamin Ghojogh, Milad Sikaroudi, H.R. Tizhoosh, Fakhri Karray, Mark Crowley
Abstract: Fisher Discriminant Analysis (FDA) is a subspace learning method which minimizes and maximizes the intra- and inter-class scatters of data, respectively. Although, in FDA, all the pairs of classes are treated the same way, some classes are closer than the others. Weighted FDA assigns weights to the pairs of classes to address this shortcoming of FDA. In this paper, we propose a cosine-weighted FDA as well as an automatically weighted FDA in which weights are found automatically. We also propose a weighted FDA in the feature space to establish a weighted kernel FDA for both existing and newly proposed weights. Our experiments on the ORL face recognition dataset show the effectiveness of the proposed weighting schemes.
摘要：Fisher判别分析（FDA）是一个子空间学习方法，其最小化和最大化分别数据的帧内和类间散射。尽管在FDA的类所有对被看作是相同的方式，有些类是比别人更接近。 FDA加权分配权重的双班，以解决FDA的这个缺点。在本文中，我们提出了一个余弦加权FDA以及自动加权FDA在其权重自动找到。我们还提出了一个加权FDA特征空间建立加权核FDA现有和新提出的权重。我们对ORL人脸识别实验数据集上所提出的加权方案的有效性。

122. TensorFI: A Flexible Fault Injection Framework for TensorFlow Applications [PDF] 返回目录
Zitao Chen, Niranjhana Narayanan, Bo Fang, Guanpeng Li, Karthik Pattabiraman, Nathan DeBardeleben
Abstract: As machine learning (ML) has seen increasing adoption in safety-critical domains (e.g., autonomous vehicles), the reliability of ML systems has also grown in importance. While prior studies have proposed techniques to enable efficient error-resilience techniques (e.g., selective instruction duplication), a fundamental requirement for realizing these techniques is a detailed understanding of the application's resilience. In this work, we present TensorFI, a high-level fault injection (FI) framework for TensorFlow-based applications. TensorFI is able to inject both hardware and software faults in general TensorFlow programs. TensorFI is a configurable FI tool that is flexible, easy to use, and portable. It can be integrated into existing TensorFlow programs to assess their resilience for different fault types (e.g., faults in particular operators). We use TensorFI to evaluate the resilience of 12 ML programs, including DNNs used in the autonomous vehicle domain. Our tool is publicly available at this https URL.
摘要：随着机器学习（ML）已经看到了在安全关键领域（例如，自主车）越来越多地采用，ML系统的可靠性也在不断增长的重要性。虽然现有的研究已经提出的技术，以使有效的差错弹性技术（例如，选择性的指令复制），用于实现这些技术的基本要求是在应用程序的恢复力的一个详细的了解。在这项工作中，我们目前TensorFI，基于TensorFlow的应用提供高层次的故障注入（FI）的框架。 TensorFI是能够注入在一般TensorFlow方案包括硬件和软件故障。 TensorFI是一个可配置FI工具，灵活，易于使用，便于携带。它可以集成到现有TensorFlow方案，以评估对不同故障类型它们的弹性（例如，在特定运营商的故障）。我们使用TensorFI评估的12 ML方案，其中包括在自主汽车领域使用DNNs弹性。我们的工具是公开的，在此HTTPS URL。

123. Complex-Valued Convolutional Neural Networks for MRI Reconstruction [PDF] 返回目录
Elizabeth K. Cole, Joseph Y. Cheng, John M. Pauly, Shreyas S. Vasanawala
Abstract: Many real-world signal sources are complex-valued, having real and imaginary components. However, the vast majority of existing deep learning platforms and network architectures do not support the use of complex-valued data. MRI data is inherently complex-valued, so existing approaches discard the richer algebraic structure of the complex data. In this work, we investigate end-to-end complex-valued convolutional neural networks - specifically, for image reconstruction in lieu of two-channel real-valued networks. We apply this to magnetic resonance imaging reconstruction for the purpose of accelerating scan times and determine the performance of various promising complex-valued activation functions. We find that complex-valued CNNs with complex-valued convolutions provide superior reconstructions compared to real-valued convolutions with the same number of trainable parameters, over a variety of network architectures and datasets.
摘要：许多真实世界的信号源是复数，具有实部和虚部。然而，绝大多数现有的深度学习平台和网络架构不支持使用复值数据。 MRI数据是固有的复数值的，所以现有的方法丢弃复杂的数据的更丰富的代数结构。在这项工作中，我们调查结束到终端的复数卷积神经网络 - 具体而言，代替双通道实值网络的图像重建。我们将此磁共振成像重建加速扫描时间的目的，确定各种看好复值激活功能的性能。我们发现，复数细胞神经网络与复数卷积相比，具有相同数量的可训练的参数，在各种网络架构和数据集的实值卷积提供卓越的重建。

124. Unsupervised Domain Adaptation with Progressive Domain Augmentation [PDF] 返回目录
Kevin Hua, Yuhong Guo
Abstract: Domain adaptation aims to exploit a label-rich source domain for learning classifiers in a different label-scarce target domain. It is particularly challenging when there are significant divergences between the two domains. In the paper, we propose a novel unsupervised domain adaptation method based on progressive domain augmentation. The proposed method generates virtual intermediate domains via domain interpolation, progressively augments the source domain and bridges the source-target domain divergence by conducting multiple subspace alignment on the Grassmann manifold. We conduct experiments on multiple domain adaptation tasks and the results shows the proposed method achieves the state-of-the-art performance.
摘要：域名适应旨在利用在不同的标签，稀缺的目标域学习分类丰富的标签源域。当有两个域之间显著分歧据尤其具有挑战性。在论文中，提出了一种基于渐进域增强的新颖的无监督域自适应方法。所提出的方法通过域内插生成虚拟中间域，逐步增强的源域和通过在格拉斯曼流形导电多个子空间对准桥接源 - 目标域发散。我们在多个领域适应性任务进行实验，结果表明，该方法实现了国家的最先进的性能。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-04-07

目录

摘要