摘要

1. Long-term Human Motion Prediction with Scene Context [PDF] 返回目录
Zhe Cao, Hang Gao, Karttikeya Mangalam, Qi-Zhi Cai, Minh Vo, Jitendra Malik
Abstract: Human movement is goal-directed and influenced by the spatial layout of the objects in the scene. To plan future human motion, it is crucial to perceive the environment -- imagine how hard it is to navigate a new room with lights off. Existing works on predicting human motion do not pay attention to the scene context and thus struggle in long-term prediction. In this work, we propose a novel three-stage framework that exploits scene context to tackle this task. Given a single scene image and 2D pose histories, our method first samples multiple human motion goals, then plans 3D human paths towards each goal, and finally predicts 3D human pose sequences following each path. For stable training and rigorous evaluation, we contribute a diverse synthetic dataset with clean annotations. In both synthetic and real datasets, our method shows consistent quantitative and qualitative improvements over existing methods.
摘要：运动人体是目标导向，并通过在场景中的对象的空间布局的影响。规划未来的人体运动，关键是要感知环境 - 想象有多难导航新房间用灯关闭。在预测人体运动现有作品不注重现场情境，从而在长期预测奋斗。在这项工作中，我们提出了一个新的三阶段框架，利用场景环境来解决这个任务。给定一个场景图像和2D姿势的历史，我们的方法首先样本的多个人体运动目标，然后计划三维人体的路径实现每一个目标，最后预测下每个路径的3D人体姿态序列。对于稳定的训练和严格的评估，我们的贡献用干净的注释的多元化综合数据集。在合成和真实数据集，我们的方法显示出一致的定量和超过现有方法的质量改进。

2. Human Trajectory Forecasting in Crowds: A Deep Learning Perspective [PDF] 返回目录
Parth Kothari, Sven Kreiss, Alexandre Alahi
Abstract: Since the past few decades, human trajectory forecasting has been a field of active research owing to its numerous real-world applications: evacuation situation analysis, traffic operations, deployment of social robots in crowded environments, to name a few. In this work, we cast the problem of human trajectory forecasting as learning a representation of human social interactions. Early works handcrafted this representation based on domain knowledge. However, social interactions in crowded environments are not only diverse but often subtle. Recently, deep learning methods have outperformed their handcrafted counterparts, as they learned about human-human interactions in a more generic data-driven fashion. In this work, we present an in-depth analysis of existing deep learning based methods for modelling social interactions. Based on our analysis, we propose a simple yet powerful method for effectively capturing these social interactions. To objectively compare the performance of these interaction-based forecasting models, we develop a large scale interaction-centric benchmark TrajNet++, a significant yet missing component in the field of human trajectory forecasting. We propose novel performance metrics that evaluate the ability of a model to output socially acceptable trajectories. Experiments on TrajNet++ validate the need for our proposed metrics, and our method outperforms competitive baselines on both real-world and synthetic datasets.
摘要：由于在过去的几十年中，人类的轨迹预测一直是活跃的研究，由于其众多的现实世界的应用领域：避难状况分析，交通运营，在拥挤的环境中社交机器人的部署，仅举几例。在这项工作中，我们投人的轨迹预测的问题作为学习人类社会交往的表示。早期作品手工制作的基于领域知识的这种表示。然而，在拥挤的环境中的社会交往不仅是多样的，但往往微妙。近日，深学习方法已经表现优于同行手工制作，因为他们了解了人与人更通用的数据驱动的方式相互作用。在这项工作中，我们目前现有的深度学习建模社会交往为基础的方法的深入分析。根据我们的分析，我们提出了一个简单但功能强大的方法，有效地捕捉到这些社会互动。客观地比较这些基于交互的预测模型的性能，我们开发了一个大型的互动为中心的基准TrajNet ++，在人类轨迹预测领域的显著但缺少的组成部分。我们建议评估模型的输出为社会所接受轨迹能力的新的性能指标。在TrajNet ++验证实验需要我们提出的指标，并在两个现实世界和合成数据集，我们的方法优于竞争力的基线。

3. Scribble-based Domain Adaptation via Co-segmentation [PDF] 返回目录
Reuben Dorent, Samuel Joutard, Jonathan Shapey, Sotirios Bisdas, Neil Kitchen, Robert Bradford, Shakeel Saeed, Marc Modat, Sebastien Ourselin, Tom Vercauteren
Abstract: Although deep convolutional networks have reached state-of-the-art performance in many medical image segmentation tasks, they have typically demonstrated poor generalisation capability. To be able to generalise from one domain (e.g. one imaging modality) to another, domain adaptation has to be performed. While supervised methods may lead to good performance, they require to fully annotate additional data which may not be an option in practice. In contrast, unsupervised methods don't need additional annotations but are usually unstable and hard to train. In this work, we propose a novel weakly-supervised method. Instead of requiring detailed but time-consuming annotations, scribbles on the target domain are used to perform domain adaptation. This paper introduces a new formulation of domain adaptation based on structured learning and co-segmentation. Our method is easy to train, thanks to the introduction of a regularised loss. The framework is validated on Vestibular Schwannoma segmentation (T1 to T2 scans). Our proposed method outperforms unsupervised approaches and achieves comparable performance to a fully-supervised approach.
摘要：虽然深卷积网络达成了许多医学图像分割任务的国家的最先进的性能，他们通常表现出较差的泛化能力。为了能够从一个域（例如，一个成像模态）到另一个一概而论，域适应必须被执行。虽然监督的方法可能会导致良好的性能，他们需要这可能不是在实践中的一个选项完全注释附加数据。相比之下，无监督的方法不需要附加的注释，但通常是不稳定的，难以列车。在这项工作中，我们提出了一种新的弱监督方法。代替需要详细而费时的注释，对目标域涂鸦被用来执行域自适应。本文介绍了基于结构化学习和共同分割域适应的新的制剂。我们的方法是容易训练，由于引入了正则损失。该框架被验证在前庭神经鞘瘤分割（T1到T2扫描）。我们提出的方法优于无监督的办法，达到相当的性能，以充分监督的做法。

4. Can GAN Generated Morphs Threaten Face Recognition Systems Equally as Landmark Based Morphs? -- Vulnerability and Detection [PDF] 返回目录
Sushma Venkatesh, Haoyu Zhang, Raghavendra Ramachandra, Kiran Raja, Naser Damer, Christoph Busch
Abstract: The primary objective of face morphing is to combine face images of different data subjects (e.g. a malicious actor and an accomplice) to generate a face image that can be equally verified for both contributing data subjects. In this paper, we propose a new framework for generating face morphs using a newer Generative Adversarial Network (GAN) - StyleGAN. In contrast to earlier works, we generate realistic morphs of both high-quality and high resolution of 1024$\times$1024 pixels. With the newly created morphing dataset of 2500 morphed face images, we pose a critical question in this work. \textit{(i) Can GAN generated morphs threaten Face Recognition Systems (FRS) equally as Landmark based morphs?} Seeking an answer, we benchmark the vulnerability of a Commercial-Off-The-Shelf FRS (COTS) and a deep learning-based FRS (ArcFace). This work also benchmarks the detection approaches for both GAN generated morphs against the landmark based morphs using established Morphing Attack Detection (MAD) schemes.
摘要：主要目的面变形的是不同的数据对象的面部图像（例如恶意演员和同谋）结合生成的面部图像可以被同等地验证两者提供数据对象。在本文中，我们提出了生成使用较新的创成对抗性网络（GAN）的脸摇身一变一个新的框架 - StyleGAN。与早期的作品中，我们同时生成高品质，高清晰度的逼真变种的1024 $ \ $倍1024像素。随着2500个演变脸图像的新创建的变形数据集，我们提出在这项工作中的一个关键问题。 \ {textit（I）可产生GAN威胁变种人脸识别系统（FRS）同样作为地标基于变种？}寻求答案，我们基准测试中的商业现成的现货供应FRS（COTS）的脆弱性和深learning-基于FRS（ArcFace）。这项工作还基准反对使用建立的变形方法攻击检测（MAD）方案的基于地标的变种都产生GAN的变种检测方法。

5. SaADB: A Self-attention Guided ADB Network for Person Re-identification [PDF] 返回目录
Bo Jiang, Sheng Wang, Xiao Wang, Aihua Zheng
Abstract: Recently, Batch DropBlock network (BDB) has demonstrated its effectiveness on person image representation and re-ID task via feature erasing. However, BDB drops the features \textbf{randomly} which may lead to sub-optimal results. In this paper, we propose a novel Self-attention guided Adaptive DropBlock network (SaADB) for person re-ID which can \textbf{adaptively} erase the most discriminative regions. Specifically, SaADB first obtains a self-attention map by channel-wise pooling and returns a drop mask by thresholding the self-attention map. Then, the input features and self-attention guided drop mask are multiplied to generate the dropped feature maps. Meanwhile, we utilize the spatial and channel attention to learn a better feature map and iteratively train with the feature dropping module for person re-ID. Experiments on several benchmark datasets demonstrate that the proposed SaADB significantly beats the prevalent competitors in person re-ID.
摘要：近日，批量DropBlock网络（BDB）已经证明通过功能擦除其上的人物形象表现，并重新编号任务的有效性。然而，BDB丢弃特征\ textbf {随机}这可能导致次优的结果。在本文中，我们提出了人再ID的新型自导注意自适应DropBlock网络（SaADB），它可以\ textbf {}自适应擦除判别能力最强的地区。具体而言，第一SaADB获得由信道逐池自注意图和通过阈值的自注意图返回的下降掩模。然后，输入功能和自我注意引导降面具相乘来产生下降的特征图。同时，我们利用了空间和渠道关注，了解一个更好的功能，地图和与人重号的功能模块下探反复训练。在几个基准数据集的实验表明，该SaADB显著击败亲自重新编号普遍的竞争对手。

6. HKR For Handwritten Kazakh & Russian Database [PDF] 返回目录
Daniyar Nurseitov, Kairat Bostanbekov, Daniyar Kurmankhojayev, Anel Alimova, Abdelrahman Abdallah
Abstract: In this paper, we present a new Russian and Kazakh database (with about 95% of Russian and 5% of Kazakh words/sentences respectively) for offline handwriting recognition, A few pre-processing and segmentation procedures have been developed together with the database. The database is written in Cyrillic and shares the same 33 characters. Besides these characters, the Kazakh alphabet also contains 9 additional specific characters. This dataset is a collection of forms, The sources of all the forms in the datasets were generated by LATEX which subsequently was filled out by persons with their handwriting. The database consists of more than 1400 filled forms. There are approximately 63000 sentences, more than 650000 symbols produced by approximately 200 different writers. It can serve researchers in the field of handwriting recognition tasks by using deep and machine learning.
摘要：在本文中，我们提出了脱机手写识别一个新的俄罗斯和哈萨克斯坦的数据库（与俄罗斯的约95％和哈萨克单词/句子的5％分别），有几个前处理和分割程序已经与发达国家一起数据库。该数据库是用西里尔和共享相同的33个字符。除了这些字符，哈字母还包含9个附加特定的字符。该数据集的形式集合，被随后通过与他们的手写人员填写了乳胶生成在数据集中的所有形式的来源。该数据库包括超过1400页填写的表格。有大约63000的句子，通过大约200种不同的作家生产超过650000个符号。它可以通过使用深和机器学习服务在手写识别任务领域的研究者。

7. Single Shot Video Object Detector [PDF] 返回目录
Jiajun Deng, Yingwei Pan, Ting Yao, Wengang Zhou, Houqiang Li, Tao Mei
Abstract: Single shot detectors that are potentially faster and simpler than two-stage detectors tend to be more applicable to object detection in videos. Nevertheless, the extension of such object detectors from image to video is not trivial especially when appearance deterioration exists in videos, \emph{e.g.}, motion blur or occlusion. A valid question is how to explore temporal coherence across frames for boosting detection. In this paper, we propose to address the problem by enhancing per-frame features through aggregation of neighboring frames. Specifically, we present Single Shot Video Object Detector (SSVD) -- a new architecture that novelly integrates feature aggregation into a one-stage detector for object detection in videos. Technically, SSVD takes Feature Pyramid Network (FPN) as backbone network to produce multi-scale features. Unlike the existing feature aggregation methods, SSVD, on one hand, estimates the motion and aggregates the nearby features along the motion path, and on the other, hallucinates features by directly sampling features from the adjacent frames in a two-stream structure. Extensive experiments are conducted on ImageNet VID dataset, and competitive results are reported when comparing to state-of-the-art approaches. More remarkably, for $448 \times 448$ input, SSVD achieves 79.2% mAP on ImageNet VID, by processing one frame in 85 ms on an Nvidia Titan X Pascal GPU. The code is available at \url{this https URL}.
摘要：单拍探测器是潜在的更快，更简单比二阶段探测器往往更适用于视频对象检测。然而，特别是当在视频，\ {EMPH e.g。}，运动模糊或存在阻塞外观恶化这样的对象检测器的延伸部从图像到视频是不平凡的。一个有效的问题是如何探索跨帧的时间连贯性的提升检测。在本文中，我们提出通过相邻帧的聚集增强每帧特征来解决这个问题。具体而言，我们目前单次视频对象检测器（SSVD） - 一个新的架构，集成新奇功能聚集到了在视频对象检测一台探测器。从技术上讲，SSVD取特征金字塔网络（FPN）的骨干网络，以产生多尺度特征。不同于现有的特征聚合方法，SSVD，一方面，估计运动并聚集附近的特征沿运动路径，并且在另一，出现幻觉通过在两流结构直接采样从相邻帧的特征的特征。大量的实验是在ImageNet VID数据集进行，比较先进设备，最先进的方法，当竞争的结果报告。更显着地，对于$ 448 \倍448 $输入，SSVD达到上ImageNet VID 79.2％MAP，通过在Nvidia的泰坦X帕斯卡GPU处理在85毫秒一帧。代码可以在\ {URL这HTTPS URL}。

8. Meta Corrupted Pixels Mining for Medical Image Segmentation [PDF] 返回目录
Jixin Wang, Sanping Zhou, Chaowei Fang, Le Wang, Jinjun Wang
Abstract: Deep neural networks have achieved satisfactory performance in piles of medical image analysis tasks. However the training of deep neural network requires a large amount of samples with high-quality annotations. In medical image segmentation, it is very laborious and expensive to acquire precise pixel-level annotations. Aiming at training deep segmentation models on datasets with probably corrupted annotations, we propose a novel Meta Corrupted Pixels Mining (MCPM) method based on a simple meta mask network. Our method is targeted at automatically estimate a weighting map to evaluate the importance of every pixel in the learning of segmentation network. The meta mask network which regards the loss value map of the predicted segmentation results as input, is capable of identifying out corrupted layers and allocating small weights to them. An alternative algorithm is adopted to train the segmentation network and the meta mask network, simultaneously. Extensive experimental results on LIDC-IDRI and LiTS datasets show that our method outperforms state-of-the-art approaches which are devised for coping with corrupted annotations.
摘要：深层神经网络在医学图像分析任务堆取得了令人满意的业绩。然而深层神经网络的训练需要大量高品质的注解样本。在医学图像分割，这是非常费力和昂贵获得精确的像素级注释。针对上可能已损坏的注释训练数据集分割深模型，提出了一种基于一个简单的面膜元网络上的小说元破坏的像素矿业（MCPM）方法。在自动估计的加权地图，以评估分割网络的学习每个像素的重要性我们的方法是有针对性的。所述元掩模网络，关于预测的分割结果的损耗值地图作为输入，能够识别出已损坏的层和小权重来给它们分配的。一个替代算法来训练分割网络和元掩模网络，同时。上LIDC-IDRI和双床数据集广泛的实验结果表明，其被设计用于与损坏的注释应对我们的方法优于状态的最先进的方法。

9. Distance-Geometric Graph Convolutional Network (DG-GCN) [PDF] 返回目录
Daniel T. Chang
Abstract: The distance-geometric graph representation adopts a unified scheme (distance) for representing the geometry of three-dimensional(3D) graphs. It is invariant to rotation and translation of the graph and it reflects pair-wise node interactions and their generally local nature. To facilitate the incorporation of geometry in deep learning on 3D graphs, we propose a message-passing graph convolutional network based on the distance-geometric graph representation: DG-GCN (distance-geometric graph convolution network). It utilizes continuous-filter convolutional layers, with filter-generating networks, that enable learning of filter weights from distances, thereby incorporating the geometry of 3D graphs in graph convolutions. Our results for the ESOL and FreeSolv datasets show major improvement over those of standard graph convolutions. They also show significant improvement over those of geometric graph convolutions employing edge weight / edge distance power laws. Our work demonstrates the utility and value of DG-GCN for end-to-end deep learning on 3D graphs.
摘要：距离几何图表示采用用于表示三维（3D）图形中的几何形状的统一方案（距离）。这是不变的旋转和图形的平移，它反映了成对节点的相互作用及其一般的本地性质。为了便于在3D图形深度学习几何的掺入，我们提出基于距离几何图表示消息传递图卷积网络：DG-GCN（距离几何图形卷积网络）。它利用连续滤波器卷积层，带过滤器产生的网络，能够使从距离学习滤波器权重，由此将三维图形的曲线图卷积的几何形状。我们对ESOL和FreeSolv数据集的结果表明对这些标准的图形化卷积的重大改进。他们还展示了那些采用边缘重量/边缘的距离幂律几何图形卷积显著改善。我们的工作表明DG-GCN为终端到终端上的3D图形深度学习的效用和价值。

10. Hierarchical nucleation in deep neural networks [PDF] 返回目录
Diego Doimo, Aldo Glielmo, Alessandro Laio, Alessio Ansuini
Abstract: Deep convolutional networks (DCNs) learn meaningful representations where data that share the same abstract characteristics are positioned closer and closer. Understanding these representations and how they are generated is of unquestioned practical and theoretical interest. In this work we study the evolution of the probability density of the ImageNet dataset across the hidden layers in some state-of-the-art DCNs. We find that the initial layers generate a unimodal probability density getting rid of any structure irrelevant for classification. In subsequent layers density peaks arise in a hierarchical fashion that mirrors the semantic hierarchy of the concepts. Density peaks corresponding to single categories appear only close to the output and via a very sharp transition which resembles the nucleation process of a heterogeneous liquid. This process leaves a footprint in the probability density of the output layer where the topography of the peaks allows reconstructing the semantic relationships of the categories.
摘要：深卷积网络（的DCN）那里汲取共享相同的抽象特征数据定位密切有意义的表示。了解这些陈述和它们是如何产生的是不容置疑的实践和理论的兴趣。在这项工作中，我们研究整个隐藏层在国家的最先进一些的DCN的ImageNet数据集的概率密度的演变。我们发现，在初始层产生的单峰概率密度摆脱任何结构无关进行分类。在随后的层密度的峰出现在分层方式即反射镜的概念的语义层级。对应于单一类别密度峰出现仅靠近输出，并经由它类似于一个多相液体的成核过程中非常尖锐过渡。这个过程留下在所述峰的形貌允许重构的类别的语义关系的输出层的概率密度的覆盖区。

11. AutoAssign: Differentiable Label Assignment for Dense Object Detection [PDF] 返回目录
Benjin Zhu, Jianfeng Wang, Zhengkai Jiang, Fuhang Zong, Songtao Liu, Zeming Li, Jian Sun
Abstract: In this paper, we propose an anchor-free object detector with a fully differentiable label assignment strategy, named AutoAssign. It automatically determines positive/negative samples by generating positive and negative weight maps to modify each location's prediction dynamically. Specifically, we present a center weighting module to adjust the category-specific prior distributions and a confidence weighting module to adapt the specific assign strategy of each instance. The entire label assignment process is differentiable and requires no additional modification to transfer to different datasets and tasks. Extensive experiments on MS COCO show that our method steadily surpasses other best sampling strategies by $ \sim $ 1\% AP with various backbones. Moreover, our best model achieves 52.1\% AP, outperforming all existing one-stage detectors. Besides, experiments on other datasets, \emph{e.g.}, PASCAL VOC, Objects365, and WiderFace, demonstrate the broad applicability of AutoAssign.
摘要：在本文中，我们提出了一个完全区分的标签分配策略，名为AutoAssign的无锚目标物检测。它通过产生正，负重量自动确定正/负样本映射到动态修改每个位置的预测。具体而言，提出了一种中心加权模块来调整特定类别的先验分布和置信加权模块，以适应每个实例的具体分配策略。整个标签分配过程可微，不需要额外的修改转移到不同的数据集和任务。在MS COCO显示出广泛的实验，该方法通过稳步$ \卡$ 1 \各种骨干％AP优于其他最佳取样策略。此外，我们的最好的模式达到52.1 \％AP，超越所有现有的单级探测器。此外，与其他数据集的实验中，\ {EMPH e.g。}，PASCAL VOC，Objects365，和WiderFace，演示AutoAssign的广泛的适用性。

12. Wasserstein Generative Models for Patch-based Texture Synthesis [PDF] 返回目录
Antoine Houdard, Arthur Leclaire, Nicolas Papadakis, Julien Rabin
Abstract: In this paper, we propose a framework to train a generative model for texture image synthesis from a single example. To do so, we exploit the local representation of images via the space of patches, that is, square sub-images of fixed size (e.g. $4\times 4$). Our main contribution is to consider optimal transport to enforce the multiscale patch distribution of generated images, which leads to two different formulations. First, a pixel-based optimization method is proposed, relying on discrete optimal transport. We show that it is related to a well-known texture optimization framework based on iterated patch nearest-neighbor projections, while avoiding some of its shortcomings. Second, in a semi-discrete setting, we exploit the differential properties of Wasserstein distances to learn a fully convolutional network for texture generation. Once estimated, this network produces realistic and arbitrarily large texture samples in real time. The two formulations result in non-convex concave problems that can be optimized efficiently with convergence properties and improved stability compared to adversarial approaches, without relying on any regularization. By directly dealing with the patch distribution of synthesized images, we also overcome limitations of state-of-the art techniques, such as patch aggregation issues that usually lead to low frequency artifacts (e.g. blurring) in traditional patch-based approaches, or statistical inconsistencies (e.g. color or patterns) in learning approaches.
摘要：在本文中，我们提出了一个框架，从一个单一的例子训练纹理图像合成生成模型。为了这样做，我们通过贴剂的空间，也就是，固定大小的方形子图像（例如$ 4 \倍$ 4）利用图像的本地代表。我们的主要贡献是考虑最佳的交通强制生成的图像的多尺度补丁分发，这导致了两种不同的配方。首先，将基于像素的优化方法提出，依靠离散最佳运输。我们发现，它是基于迭代补丁近邻预测与著名的质地优化框架，同时避免它的一些不足之处。其次，在一个半独立的设置，我们利用Wasserstein的距离的差属性以学习为纹理生成全卷积网络。一旦估计，在这个网络中实时产生现实和任意大的纹理样本。两种制剂产生，可以有效地与收敛性质来优化和改善的稳定性相比对抗方法，不依赖于任何的正则化非凸凹的问题。通过与合成图像的贴片分布直接处理，我们还克服在传统的基于块拼贴的方法中的国家的本领域的技术，例如贴剂的聚集问题的限制，通常会导致低频伪影（例如模糊），或统计的不一致在学习方法（例如颜色或图案）。

13. Are spoofs from latent fingerprints a real threat for the best state-of-art liveness detectors? [PDF] 返回目录
Roberto Casula, Giulia Orrù, Daniele Angioni, Xiaoyi Feng, Gian Luca Marcialis, Fabio Roli
Abstract: We investigated the threat level of realistic attacks using latent fingerprints against sensors equipped with state-of-art liveness detectors and fingerprint verification systems which integrate such liveness algorithms. To the best of our knowledge, only a previous investigation was done with spoofs from latent prints. In this paper, we focus on using snapshot pictures of latent fingerprints. These pictures provide molds, that allows, after some digital processing, to fabricate high-quality spoofs. Taking a snapshot picture is much simpler than developing fingerprints left on a surface by magnetic powders and lifting the trace by a tape. What we are interested here is to evaluate preliminary at which extent attacks of the kind can be considered a real threat for state-of-art fingerprint liveness detectors and verification systems. To this aim, we collected a novel data set of live and spoof images fabricated with snapshot pictures of latent fingerprints. This data set provide a set of attacks at the most favourable conditions. We refer to this method and the related data set as "ScreenSpoof". Then, we tested with it the performances of the best liveness detection algorithms, namely, the three winners of the LivDet competition. Reported results point out that the ScreenSpoof method is a threat of the same level, in terms of detection and verification errors, than that of attacks using spoofs fabricated with the full consensus of the victim. We think that this is a notable result, never reported in previous work.
摘要：我们调查的使用针对配备了国家的最先进的活跃度探测器和整合这些活跃的算法指纹验证系统的传感器潜手印现实攻击的威胁级别。据我们所知，只有以前的调查与潜在指纹的欺骗完成。在本文中，我们专注于利用潜指印快照图片。这些图片提供模具，其允许经过一些数字化处理，以制造高品质的欺骗。拍摄快照画面比发展中国家通过磁粉留在表面上的指纹，并通过磁带抬起跟踪简单得多。我们在这里感兴趣的是评估初步在其中的那种程度的攻击可以被认为是国家的最先进的指纹活跃探测器和验证系统的真正威胁。为了达到这个目的，我们收集了一个新的数据集的实时和恶搞图片与潜指印快照图片制作。这组数据在最有利的条件提供了一组攻击。我们把这种方法和相关数据集为“ScreenSpoof”。然后，我们用它测试的最好的活跃度检测算法，即LivDet比赛的三位获奖者的表演。报告的结果指出，ScreenSpoof方法是相同级别的威胁，检测和验证错误方面，比使用欺骗攻击与受害者的充分共识制造。我们认为，这是一个显着的结果是，在以前的工作从来没有报道。

14. Re-thinking Co-Salient Object Detection [PDF] 返回目录
Fan Deng-Ping, Li Tengpeng, Lin Zheng, Ji Ge-Peng, Zhang Dingwen, Cheng Ming-Ming, Fu Huazhu, Shen Jianbing
Abstract: In this paper, we conduct a comprehensive study on the co-salient object detection (CoSOD) problem for images. CoSOD is an emerging and rapidly growing extension of salient object detection (SOD), which aims to detect the co-occurring salient objects in a group of images. However, existing CoSOD datasets often have a serious data bias, assuming that each group of images contains salient objects of similar visual appearances. This bias can lead to the ideal settings and effectiveness of models trained on existing datasets, being impaired in real-life situations, where similarities are usually semantic or conceptual. To tackle this issue, we first introduce a new benchmark, called CoSOD3k in the wild, which requires a large amount of semantic context, making it more challenging than existing CoSOD datasets. Our CoSOD3k consists of 3,316 high-quality, elaborately selected images divided into 160 groups with hierarchical annotations. The images span a wide range of categories, shapes, object sizes, and backgrounds. Second, we integrate the existing SOD techniques to build a unified, trainable CoSOD framework, which is long overdue in this field. Specifically, we propose a novel CoEG-Net that augments our prior model EGNet with a co-attention projection strategy to enable fast common information learning. CoEG-Net fully leverages previous large-scale SOD datasets and significantly improves the model scalability and stability. Third, we comprehensively summarize 34 cutting-edge algorithms, benchmarking 16 of them over three challenging CoSOD datasets (iCoSeg, CoSal2015, and our CoSOD3k), and reporting more detailed (i.e., group-level) performance analysis. Finally, we discuss the challenges and future works of CoSOD. We hope that our study will give a strong boost to growth in the CoSOD community
摘要：在本文中，我们进行对图像的共同显着对象检测（COSOD）的问题进行全面的研究。 COSOD是显着对象检测（SOD），其目的是检测一组图像中共同存在的显着对象的新出现的和快速增长的扩展。然而，现有的数据集COSOD往往产生严重的数据偏差，假设每个组的图像包含类似的视觉外观的显着对象。这种偏见可能导致理想的设置和培训了现有数据集模型的有效性被削弱在现实生活中，其中相似通常是语义或概念。为了解决这个问题，我们首先介绍了新的基准，在野外叫CoSOD3k，这需要大量的语义语境，使得它比现有的数据集COSOD越是充满挑战。我们CoSOD3k由3316分成160组具有分级的注释高品质，精心选择的图像。图像跨越广泛的类别，形状，对象大小，和背景的。其次，整合现有的SOD技术来构建一个统一的，可训练COSOD框架，这是在这一领域姗姗来迟。具体来说，我们提出了一个新颖的CoEG-网，增强了我们之前的模型EGNet与共同关注投影策略，以实现快速的公共信息学习。 CoEG-网充分利用了以往大型SOD数据集和显著提高了模型的可扩展性和稳定性。第三，我们在全面总结34尖端的算法，在三个挑战COSOD数据集（iCoSeg，CoSal2015，我们CoSOD3k）标杆其中16，和报告更详细的（即，组级）性能分析。最后，我们讨论的挑战和COSOD未来的作品。我们希望我们的研究将给予在社区COSOD有力地推动了增长

15. C2G-Net: Exploiting Morphological Properties for Image Classification [PDF] 返回目录
Laurin Herbsthofer, Barbara Prietl, Martina Tomberger, Thomas Pieber, Pablo López-García
Abstract: In this paper we propose C2G-Net, a pipeline for image classification that exploits the morphological properties of images containing a large number of similar objects like biological cells. C2G-Net consists of two components: (1) Cell2Grid, an image compression algorithm that identifies objects using segmentation and arranges them on a grid, and (2) DeepLNiNo, a CNN architecture with less than 10,000 trainable parameters aimed at facilitating model interpretability. To test the performance of C2G-Net we used multiplex immunohistochemistry images for predicting relapse risk in colon cancer. Compared to conventional CNN architectures trained on raw images, C2G-Net achieved similar prediction accuracy while training time was reduced by 85% and its model was is easier to interpret.
摘要：本文提出了C2G型网，用于图像分类的管道，它利用含有大量像生物细胞相似的物体的图像的形态特征。 C2G-Net的由两个部分组成：（1）Cell2Grid，图像压缩算法，其识别出对象使用在网格分割和排列它们，并且（2）DeepLNiNo，CNN的架构与小于10,000旨在促进模型解释性可训练参数。为了测试C2G-网的性能，我们使用多重免疫组化图像为结肠癌复发预测的风险。比起训练有素的原始图像传统CNN架构，C2G-Net的实现类似的预测精度，而训练时间减少了85％，它的模式是很容易解释的。

16. Location Sensitive Image Retrieval and Tagging [PDF] 返回目录
Raul Gomez, Jaume Gibert, Lluis Gomez, Dimosthenis Karatzas
Abstract: People from different parts of the globe describe objects and concepts in distinct manners. Visual appearance can thus vary across different geographic locations, which makes location a relevant contextual information when analysing visual data. In this work, we address the task of image retrieval related to a given tag conditioned on a certain location on Earth. We present LocSens, a model that learns to rank triplets of images, tags and coordinates by plausibility, and two training strategies to balance the location influence in the final ranking. LocSens learns to fuse textual and location information of multimodal queries to retrieve related images at different levels of location granularity, and successfully utilizes location information to improve image tagging.
摘要：来自世界各地的人描述了不同方式的对象和概念。视觉外观从而可以在不同的地理位置的变化，分析视觉数据时，这使得位置相关的上下文信息。在这项工作中，我们处理与给定的标签上调理地球上的某个位置的图像检索的任务。我们目前LocSens，一个模式，学会用合理性的图像，标签和坐标排名三胞胎，和两个培训策略，以平衡最终排名位置的影响。 LocSens学会保险丝文字和多式联运的查询位置信息按不同级别位置粒度的检索相关图片，并成功地利用位置信息，以提高图像标注。

17. Learning and Reasoning with the Graph Structure Representation in Robotic Surgery [PDF] 返回目录
Mobarakol Islam, Lalithkumar Seenivasan, Lim Chwee Ming, Hongliang Ren
Abstract: Learning to infer graph representations and performing spatial reasoning in a complex surgical environment can play a vital role in surgical scene understanding in robotic surgery. For this purpose, we develop an approach to generate the scene graph and predict surgical interactions between instruments and surgical region of interest (ROI) during robot-assisted surgery. We design an attention link function and integrate with a graph parsing network to recognize the surgical interactions. To embed each node with corresponding neighbouring node features, we further incorporate SageConv into the network. The scene graph generation and active edge classification mostly depend on the embedding or feature extraction of node and edge features from complex image representation. Here, we empirically demonstrate the feature extraction methods by employing label smoothing weighted loss. Smoothing the hard label can avoid the over-confident prediction of the model and enhances the feature representation learned by the penultimate layer. To obtain the graph scene label, we annotate the bounding box and the instrument-ROI interactions on the robotic scene segmentation challenge 2018 dataset with an experienced clinical expert in robotic surgery and employ it to evaluate our propositions.
摘要：学习推断图形表示和在复杂的手术环境进行空间推理可以起到在机器人手术手术现场了解了至关重要的作用。为此，我们开发生成场景图的方法，并预测机器人辅助外科手术中仪器和利息（ROI）手术区域之间的相互作用手术。我们设计的重视链接功能，并用图形解析网络识别外科互动整合。嵌入每个节点与对应相邻节点的功能，我们进一步结合SageConv到网络中。场景图的产生和活性边缘分类主要取决于从复杂图像表示的节点和边缘特征的嵌入或特征提取。在这里，我们经验证明的特征提取方法通过使用标签的平滑加权损失。平滑的硬标签可避免模型的过度自信的预测，并增强了倒数第二层学习的特征表现。为了获得该图场景标签，我们标注的边框，并有经验丰富的临床专家在机器人手术的机器人场景分割挑战2018数据集仪器ROI相互作用和用它来评估我们的命题。

18. SpinalNet: Deep Neural Network with Gradual Input [PDF] 返回目录
H M Dipu Kabir, Moloud Abdar, Seyed Mohammad Jafar Jalali, Abbas Khosravi, Amir F Atiya, Saeid Nahavandi, Dipti Srinivasan
Abstract: Deep neural networks (DNNs) have achieved the state of the art performance in numerous fields. However, DNNs need high computation times, and people always expect better performance with lower computation. Therefore, we study the human somatosensory system and design a neural network (SpinalNet) to achieve higher accuracy with lower computation time. This paper aims to present the SpinalNet. Hidden layers of the proposed SpinalNet consist of three parts: 1) Input row, 2) Intermediate row, and 3) output row. The intermediate row of the SpinalNet usually contains a small number of neurons. Input segmentation enables each hidden layer to receive a part of the input and outputs of the previous layer. Therefore, the number of incoming weights in a hidden layer is significantly lower than traditional DNNs. As the network directly contributes to outputs in each layer, the vanishing gradient problem of DNN does not exist. We integrate the SpinalNet as the fully-connected layer of the convolutional neural network (CNN), residual neural network (ResNet), and Dense Convolutional Network (DenseNet), Visual Geometry Group (VGG) network. We observe a significant error reduction with lower computation in most situations. We have received state-of-the-art performance for the QMNIST, Kuzushiji-MNIST, and EMNIST(digits) datasets. Scripts of the proposed SpinalNet is available at the following link: this https URL
摘要：深层神经网络（DNNs）已在众多领域取得的艺术表演状态。然而，DNNs需要高计算时间，而人们总是期望以较低的计算性能更好。因此，我们研究人类的躯体感觉系统和设计神经网络（SpinalNet）以较低的计算时间，以达到更高的精度。本文旨在呈现SpinalNet。所提出的SpinalNet隐藏层由三个部分组成：1）输入行，2）中间体行，和3）输出行。所述SpinalNet的中间行通常含有少量的神经元。输入分割使得每个隐藏层以接收输入和先前层的输出的一部分。因此，在隐藏层传入权重的数目比传统DNNs显著更低。由于网络直接有助于在每个层中的输出，不存在DNN的消失梯度问题。我们的SpinalNet整合作为卷积神经网络（CNN），残留的神经网络（RESNET），和密集卷积网络（DenseNet），可视几何形状组（VGG）网络的完全连接的层。我们观察到在大多数情况下以较低的计算的显著减少错误。我们已经收到国家的最先进的性能为QMNIST，Kuzushiji-MNIST和EMNIST（位）的数据集。所提出的SpinalNet的脚本可在以下链接：此HTTPS URL

19. Diverse and Styled Image Captioning Using SVD-Based Mixture of Recurrent Experts [PDF] 返回目录
Marzieh Heidari, Mehdi Ghatee, Ahmad Nickabadi, Arash Pourhasan Nezhad
Abstract: With great advances in vision and natural language processing, the generation of image captions becomes a need. In a recent paper, Mathews, Xie and He [1], extended a new model to generate styled captions by separating semantics and style. In continuation of this work, here a new captioning model is developed including an image encoder to extract the features, a mixture of recurrent networks to embed the set of extracted features to a set of words, and a sentence generator that combines the obtained words as a stylized sentence. The resulted system that entitled as Mixture of Recurrent Experts (MoRE), uses a new training algorithm that derives singular value decomposition (SVD) from weighting matrices of Recurrent Neural Networks (RNNs) to increase the diversity of captions. Each decomposition step depends on a distinctive factor based on the number of RNNs in MoRE. Since the used sentence generator gives a stylized language corpus without paired images, our captioning model can do the same. Besides, the styled and diverse captions are extracted without training on a densely labeled or styled dataset. To validate this captioning model, we use Microsoft COCO which is a standard factual image caption dataset. We show that the proposed captioning model can generate a diverse and stylized image captions without the necessity of extra-labeling. The results also show better descriptions in terms of content accuracy.
摘要：随着视觉和自然语言处理的巨大进步，图片说明的产生变得有必要。在最近的一篇文章，马修斯，谢和他[1]，扩展新模式通过分离语义和风格样式生成字幕。在这项工作的延续，这里一新字幕模型开发，包括图像编码器，以提取特征，复发性网络嵌入提取的特征集合的一组单词的混合物，和句子生成器，结合所获得的词作为一个程式化的句子。所得系统题为复发专家混合（更多），采用了全新的训练算法，从回归神经网络（RNNs）的加权矩阵导出奇异值分解（SVD）增加字幕的多样性。各分解步骤依赖于基于在更RNNs的数目的独特的因素。由于所使用的句子发生器提供了一个程式化的语言语料库没有配对的图片，我们的字幕模型可以做同样的。此外，没有在密集的标记或样式的训练数据集提取的风格和多样化的字幕。为了验证这一字幕模型中，我们使用微软COCO这是一个标准的事实图片标题数据集。我们表明，该字幕模型可以生成无需额外标签的必要性多元化和程式化的图片说明。结果还显示内容的准确性方面更好的描述。

20. GOLD-NAS: Gradual, One-Level, Differentiable [PDF] 返回目录
Kaifeng Bi, Lingxi Xie, Xin Chen, Longhui Wei, Qi Tian
Abstract: There has been a large literature of neural architecture search, but most existing work made use of heuristic rules that largely constrained the search flexibility. In this paper, we first relax these manually designed constraints and enlarge the search space to contain more than $10^{160}$ candidates. In the new space, most existing differentiable search methods can fail dramatically. We then propose a novel algorithm named Gradual One-Level Differentiable Neural Architecture Search (GOLD-NAS) which introduces a variable resource constraint to one-level optimization so that the weak operators are gradually pruned out from the super-network. In standard image classification benchmarks, GOLD-NAS can find a series of Pareto-optimal architectures within a single search procedure. Most of the discovered architectures were never studied before, yet they achieve a nice tradeoff between recognition accuracy and model complexity. We believe the new space and search algorithm can advance the search of differentiable NAS.
摘要：已经有神经结构搜索的大量文献，但大多数现有的工作利用了在很大程度上限制了搜索的灵活性启发式规则。在本文中，我们首先放宽这些手工设计的限制和扩大搜索空间，包含超过10 $ ^ {160} $候选人。在新的空间，现有的大多数微搜索方法可以显着失败。然后，我们提出了一个名为渐进的级别可微的神经结构搜索（GOLD-NAS）一种新型算法引入了可变资源约束到一个层面的优化，使弱者运营商正逐渐从超级网络修剪掉。在标准的图像分类基准，GOLD-NAS可以找到一个单一的搜索过程中的一系列帕累托最优架构。大多数发现架构的实现他们的识别精度和模型复杂度之间的权衡好看从未研究过，但。我们相信新的空间和搜索算法可以提前搜索微NAS的。

21. DAM: Deliberation, Abandon and Memory Networks for Generating Detailed and Non-repetitive Responses in Visual Dialogue [PDF] 返回目录
Xiaoze Jiang, Jing Yu, Yajing Sun, Zengchang Qin, Zihao Zhu, Yue Hu, Qi Wu
Abstract: Visual Dialogue task requires an agent to be engaged in a conversation with human about an image. The ability of generating detailed and non-repetitive responses is crucial for the agent to achieve human-like conversation. In this paper, we propose a novel generative decoding architecture to generate high-quality responses, which moves away from decoding the whole encoded semantics towards the design that advocates both transparency and flexibility. In this architecture, word generation is decomposed into a series of attention-based information selection steps, performed by the novel recurrent Deliberation, Abandon and Memory (DAM) module. Each DAM module performs an adaptive combination of the response-level semantics captured from the encoder and the word-level semantics specifically selected for generating each word. Therefore, the responses contain more detailed and non-repetitive descriptions while maintaining the semantic accuracy. Furthermore, DAM is flexible to cooperate with existing visual dialogue encoders and adaptive to the encoder structures by constraining the information selection mode in DAM. We apply DAM to three typical encoders and verify the performance on the VisDial v1.0 dataset. Experimental results show that the proposed models achieve new state-of-the-art performance with high-quality responses. The code is available at this https URL.
摘要：视觉对话任务需要一个代理来与人类从事的对话，谈论的图像。生成详细的和非重复响应的能力是至关重要的代理来实现类似人类的对话。在本文中，我们提出了一种新生成的解码体系结构，以产生高质量的应答，它从朝一个主张透明度和柔韧性设计整个编码语义解码移开。在此架构中，字生成被分解成一系列的基于注意机制的信息选择步骤，由所述新颖反复进行思考，放弃和存储器（DAM）模块。每个DAM模块执行专门用于生成每个字选自编码器和字级语义捕获的响应级语义的自适应组合。因此，响应包含更详细和非重复性的描述，同时保持语义的准确度。此外，DAM是柔性的以通过约束在DAM信息选择模式现有可视对话编码器和自适应于编码器结构配合。我们应用坝三个典型的编码器和核验VisDial V1.0数据集的性能。实验结果表明，该模型实现国家的最先进的新的性能与高品质的响应。该代码可在此HTTPS URL。

22. Learning to Generate Novel Domains for Domain Generalization [PDF] 返回目录
Kaiyang Zhou, Yongxin Yang, Timothy Hospedales, Tao Xiang
Abstract: This paper focuses on domain generalization (DG), the task of learning from multiple source domains a model that generalizes well to unseen domains. A main challenge for DG is that the available source domains often exhibit limited diversity, hampering the model's ability to learn to generalize. We therefore employ a data generator to synthesize data from pseudo-novel domains to augment the source domains. This explicitly increases the diversity of available training domains and leads to a more generalizable model. To train the generator, we model the distribution divergence between source and synthesized pseudo-novel domains using optimal transport, and maximize the divergence. To ensure that semantics are preserved in the synthesized data, we further impose cycle-consistency and classification losses on the generator. Our method, L2A-OT (Learning to Augment by Optimal Transport) outperforms current state-of-the-art DG methods on four benchmark datasets.
摘要：本文着重域泛化（DG），从多个来源的学习任务域一个模型，概括很好地看不见的领域。为DG的一个主要挑战是可用的源域常常表现出多样性有限，妨碍了模型的学习能力概括。因此，我们采用来自伪新颖域的数据生成合成数据，以增加源结构域。这明确地增加了可用的培训领域和潜在客户的多样性，更普及的模型。训练发生器，我们建模源和使用最佳传输合成的伪新颖结构域之间的分布背离，最大化发散。为了确保语义在合成数据保存下来，我们进一步强加发生器循环一致性和分类损失。我们的方法，L2A-OT（用最优化交通运输学习增强）优于四个标准数据集当前国家的最先进的DG方法。

23. Automatic Ischemic Stroke Lesion Segmentation from Computed Tomography Perfusion Images by Image Synthesis and Attention-Based Deep Neural Networks [PDF] 返回目录
Guotai Wang, Tao Song, Qiang Dong, Mei Cui, Ning Huang, Shaoting Zhang
Abstract: Ischemic stroke lesion segmentation from Computed Tomography Perfusion (CTP) images is important for accurate diagnosis of stroke in acute care units. However, it is challenged by low image contrast and resolution of the perfusion parameter maps, in addition to the complex appearance of the lesion. To deal with this problem, we propose a novel framework based on synthesized pseudo Diffusion-Weighted Imaging (DWI) from perfusion parameter maps to obtain better image quality for more accurate segmentation. Our framework consists of three components based on Convolutional Neural Networks (CNNs) and is trained end-to-end. First, a feature extractor is used to obtain both a low-level and high-level compact representation of the raw spatiotemporal Computed Tomography Angiography (CTA) images. Second, a pseudo DWI generator takes as input the concatenation of CTP perfusion parameter maps and our extracted features to obtain the synthesized pseudo DWI. To achieve better synthesis quality, we propose a hybrid loss function that pays more attention to lesion regions and encourages high-level contextual consistency. Finally, we segment the lesion region from the synthesized pseudo DWI, where the segmentation network is based on switchable normalization and channel calibration for better performance. Experimental results showed that our framework achieved the top performance on ISLES 2018 challenge and: 1) our method using synthesized pseudo DWI outperformed methods segmenting the lesion from perfusion parameter maps directly; 2) the feature extractor exploiting additional spatiotemporal CTA images led to better synthesized pseudo DWI quality and higher segmentation accuracy; and 3) the proposed loss functions and network structure improved the pseudo DWI synthesis and lesion segmentation performance.
摘要：从计算机断层摄影灌注（CTP）图片缺血性中风病变划分为急性监护病房中风的准确诊断非常重要。但是，它是由低图像对比度和灌注参数图的分辨率，除了病变的复杂外观的挑战。为了解决这个问题，我们从灌注参数提出了一种基于合成伪弥散加权成像（DWI）一种新型的框架映射，以获得更准确的分割更好的图像质量。我们的框架由基于卷积神经网络（细胞神经网络）三个部分组成，并训练结束到终端。首先，特征提取器用于获得同时具有低级别和原始时空CT血管造影（CTA）图像的高级别紧凑的表示。第二，伪DWI发生器作为输入CTP灌注参数图和我们的提取的特征的级联，以获得合成的伪DWI。为了达到更好的综合素质，我们建议更加注重病变区域，并鼓励高层次情境一致性的混合损失函数。最后，我们段从合成伪DWI，其中所述分割网络是基于切换的归一化和信道校准获得更好的性能的病变区域。实验结果表明，我们的框架实现的最高性能上ISLES 2018挑战和：1）使用合成的伪DWI我们的方法优于从分割灌注参数病变的方法直接映射; 2）特征提取器利用导致更好合成伪DWI质量和更高的分割精度附加时空CTA图像;和3）所提出的损失函数和网络结构改善了伪DWI合成和病灶的分割性能。

24. LabelEnc: A New Intermediate Supervision Method for Object Detection [PDF] 返回目录
Miao Hao, Yitao Liu, Xiangyu Zhang, Jian Sun
Abstract: In this paper we propose a new intermediate supervision method, named LabelEnc, to boost the training of object detection systems. The key idea is to introduce a novel label encoding function, mapping the ground-truth labels into latent embedding, acting as an auxiliary intermediate supervision to the detection backbone during training. Our approach mainly involves a two-step training procedure. First, we optimize the label encoding function via an AutoEncoder defined in the label space, approximating the "desired" intermediate representations for the target object detector. Second, taking advantage of the learned label encoding function, we introduce a new auxiliary loss attached to the detection backbones, thus benefiting the performance of the derived detector. Experiments show our method improves a variety of detection systems by around 2% on COCO dataset, no matter one-stage or two-stage frameworks. Moreover, the auxiliary structures only exist during training, i.e. it is completely cost-free in inference time.
摘要：本文提出了一种新的中间监督方法，命名LabelEnc，以提高目标探测系统的培训。关键思想是引入一种新的标签编码功能，地面实况标签映射到潜嵌入，作为训练期间辅助中间监督到检测骨干。我们的方法主要包括两个步骤的训练过程。首先，我们通过在标签空间中定义的自动编码器优化标签编码函数，近似目标对象检测器中的“期望的”中间表示。第二，考虑学习标签编码函数的优点，我们将介绍安装在检测主链一个新的辅助损失，从而有利于所导出的检测器的性能。实验证明我们的方法提高了2％左右的COCO数据集，无论是一阶段或两阶段框架多种检测系统。此外，辅助结构仅在训练期间存在，即，它是完全成本 - 自由在推理时间。

25. Spectral Graph-based Features for Recognition of Handwritten Characters: A Case Study on Handwritten Devanagari Numerals [PDF] 返回目录
Mohammad Idrees Bhat, B. Sharada
Abstract: Interpretation of different writing styles, unconstrained cursiveness and relationship between different primitive parts is an essential and challenging task for recognition of handwritten characters. As feature representation is inadequate, appropriate interpretation/description of handwritten characters seems to be a challenging task. Although existing research in handwritten characters is extensive, it still remains a challenge to get the effective representation of characters in feature space. In this paper, we make an attempt to circumvent these problems by proposing an approach that exploits the robust graph representation and spectral graph embedding concept to characterise and effectively represent handwritten characters, taking into account writing styles, cursiveness and relationships. For corroboration of the efficacy of the proposed method, extensive experiments were carried out on the standard handwritten numeral Computer Vision Pattern Recognition, Unit of Indian Statistical Institute Kolkata dataset. The experimental results demonstrate promising findings, which can be used in future studies.
摘要：不同的写作风格，不受约束cursiveness和不同的原始部件之间的关系解释是识别手写文字的重要和艰巨的任务。作为特征表示是不充分的，适当的解释/手写字符的描述似乎是一个具有挑战性的任务。虽然手写字符现有的研究是广泛的，它仍然得到特征空间字符的有效代表一个挑战。在本文中，我们做出试图通过提出一种利用了强大的图形表示和光谱图嵌入概念表征和有效地代表手写字符，同时考虑到写作风格，cursiveness和关系的方法来规避这些问题。所提出的方法的疗效佐证，大量实验是在标准的手写数字计算机视觉模式识别，单位印度统计研究所加尔各答数据集进行。实验结果表明，有前途的研究结果，它可以在未来的研究中使用。

26. Single Storage Semi-Global Matching for Real Time Depth Processing [PDF] 返回目录
Prathmesh Sawant, Yashwant Temburu, Mandar Datar, Imran Ahmed, Vinayak Shriniwas, Sachin Patkar
Abstract: Depth-map is the key computation in computer vision and robotics. One of the most popular approach is via computation of disparity-map of images obtained from Stereo Camera. Semi Global Matching (SGM) method is a popular choice for good accuracy with reasonable computation time. To use such compute-intensive algorithms for real-time applications such as for autonomous aerial vehicles, blind Aid, etc. acceleration using GPU, FPGA is necessary. In this paper, we show the design and implementation of a stereo-vision system, which is based on FPGA-implementation of More Global Matching(MGM). MGM is a variant of SGM. We use 4 paths but store a single cumulative cost value for a corresponding pixel. Our stereo-vision prototype uses Zedboard containing an ARM-based Zynq-SoC, ZED-stereo-camera / ELP stereo-camera / Intel RealSense D435i, and VGA for visualization. The power consumption attributed to the custom FPGA-based acceleration of disparity map computation required for depth-map is just 0.72 watt. The update rate of the disparity map is realistic 10.5 fps.
摘要：深度图是计算机视觉和机器人的关键计算。其中最常用的方法是通过从立体相机获得的图像的视差图计算。半全局匹配（SGM）方法是合理的计算时间精度好一个流行的选择。要使用实时应用等计算密集型算法，如自主高空作业车，盲人援助等加速GPU使用，FPGA是必要的。在本文中，我们展示了一个立体视觉系统，该系统是基于FPGA的实现更具全球性的匹配（MGM）的设计和实施。米高梅是SGM的变体。我们使用4路，但存储单个累计成本值对应的像素。我们的立体视觉原型使用包含基于ARM ZYNQ的SoC，ZED-立体照相机/ ELP立体照相机/英特尔RealSense D435i和VGA可视化Zedboard。归因于深度图所需的视差图计算的基于FPGA的自定义加速的功耗只是0.72瓦特。位差地图的更新速度是现实的10.5帧。

27. Decoupled Spatial-Temporal Attention Network for Skeleton-Based Action Recognition [PDF] 返回目录
Lei Shi, Yifan Zhang, Jian Cheng, Hanqing Lu
Abstract: Dynamic skeletal data, represented as the 2D/3D coordinates of human joints, has been widely studied for human action recognition due to its high-level semantic information and environmental robustness. However, previous methods heavily rely on designing hand-crafted traversal rules or graph topologies to draw dependencies between the joints, which are limited in performance and generalizability. In this work, we present a novel decoupled spatial-temporal attention network(DSTA-Net) for skeleton-based action recognition. It involves solely the attention blocks, allowing for modeling spatial-temporal dependencies between joints without the requirement of knowing their positions or mutual connections. Specifically, to meet the specific requirements of the skeletal data, three techniques are proposed for building attention blocks, namely, spatial-temporal attention decoupling, decoupled position encoding and spatial global regularization. Besides, from the data aspect, we introduce a skeletal data decoupling technique to emphasize the specific characteristics of space/time and different motion scales, resulting in a more comprehensive understanding of the human this http URL test the effectiveness of the proposed method, extensive experiments are conducted on four challenging datasets for skeleton-based gesture and action recognition, namely, SHREC, DHG, NTU-60 and NTU-120, where DSTA-Net achieves state-of-the-art performance on all of them.
摘要：动态骨架数据，表示为人体关节的2D / 3D坐标，已被广泛研究用于人类行为识别由于其高级别语义信息和环境稳定性。然而，以往的方法在很大程度上依赖于设计的手工制作的游历规则或图形拓扑绘制关节，它们在性能和普遍性的限制之间的依赖关系。在这项工作中，我们提出了一个新颖的解耦时空关注网络（DSTA-网）基于骨架动作识别。它涉及仅仅关注块，允许模拟关节之间的时空依赖性不知道它们的位置或相互连接的要求。具体来说，以满足骨骼数据的具体要求，三种技术提出了构建关注块，即，时空注意脱钩，脱钩位置编码和空间全局规则。此外，从数据方面，我们引入一个骨架数据去耦技术强调的空间/时间和不同的运动尺度的具体特点，从而在人的更全面的了解此http URL测试所提出的方法的有效性，广泛的实验是在四个挑战数据集用于基于骨架的手势和动作识别进行，即，SHREC，DHG，NTU-60和NTU-120，其中DSTA-Net的实现上所有这些国家的最先进的性能。

28. RGBT Salient Object Detection: A Large-scale Dataset and Benchmark [PDF] 返回目录
Zhengzheng Tu, Yan Ma, Zhun Li, Chenglong Li, Jieming Xu, Yongtao Liu
Abstract: Salient object detection in complex scenes and environments is a challenging research topic. % Most works focus on RGB-based salient object detection, which limits its performance of real-life applications when confronted with adverse conditions such as dark environments and complex backgrounds. % Taking advantage of RGB and thermal infrared images becomes a new research direction for detecting salient object in complex scenes recently, as thermal infrared spectrum imaging provides the complementary information and has been applied to many computer vision tasks. % However, current research for RGBT salient object detection is limited by the lack of a large-scale dataset and comprehensive benchmark. % This work contributes such a RGBT image dataset named VT5000, including 5000 spatially aligned RGBT image pairs with ground truth annotations. % VT5000 has 11 challenges collected in different scenes and environments for exploring the robustness of algorithms. % With this dataset, we propose a powerful baseline approach, which extracts multi-level features within each modality and aggregates these features of all modalities with the attention mechanism, for accurate RGBT salient object detection. % Extensive experiments show that the proposed baseline approach outperforms the state-of-the-art methods on VT5000 dataset and other two public datasets. % In addition, we carry out a comprehensive analysis of different algorithms of RGBT salient object detection on VT5000 dataset, and then make several valuable conclusions and provide some potential research directions for RGBT salient object detection.
摘要：在复杂的场景和环境中显着对象的检测是一项具有挑战性的研究课题。％，最适用于基于RGB的显着目标检测，当与不利条件，如黑暗的环境和复杂的背景，面对这限制了它的实际应用性能的重点。 RGB和热红外图像％服用优点变得最近检测复杂的场景显着对象一个新的研究方向，如热红外光谱成像提供了补充信息，并已应用于许多计算机视觉任务。％然而，对于RGBT突出物体检测电流的研究由于缺乏大规模的数据集和全面的标杆限制。％该工作有助于名为VT5000这样的RGBT图像数据集，其中包括5000空间对准与地面实况注解RGBT图像对。％VT5000在不同场景和环境中收集的探索算法的健壮性11周的挑战。％有了这个数据集，我们提出了一个强大的基础方法，其提取多层次的特点各形态内，并汇集了所有形式的这些特点与注意机制，为准确RGBT显着的物体检测。％大量的实验表明，该基线方法比对VT5000数据集和其他两个公共数据集的国家的最先进的方法。％此外，我们开展对VT5000数据集RGBT显着的物体检测不同的算法进行综合分析，然后做出一些有价值的结论，并提供RGBT显着的物体检测一些潜在的研究方向。

29. Learning Model-Blind Temporal Denoisers without Ground Truths [PDF] 返回目录
Bichuan Guo, Jiangtao Wen, Zhen Xia, Shan Liu, Yuxing Han
Abstract: Denoisers trained with synthetic data often fail to cope with the diversity of unknown noises, giving way to methods that can adapt to existing noise without knowing its ground truth. Previous image-based method leads to noise overfitting if directly applied to video denoisers, and has inadequate temporal information management especially in terms of occlusion and lighting variation, which considerably hinders its denoising performance. In this paper, we propose a general framework for video denoising networks that successfully addresses these challenges. A novel twin sampler assembles training data by decoupling inputs from targets without altering semantics, which not only effectively solves the noise overfitting problem, but also generates better occlusion masks efficiently by checking optical flow consistency. An online denoising scheme and a warping loss regularizer are employed for better temporal alignment. Lighting variation is quantified based on the local similarity of aligned frames. Our method consistently outperforms the prior art by 0.6-3.2dB PSNR on multiple noises, datasets and network architectures. State-of-the-art results on reducing model-blind video noises are achieved. Extensive ablation studies are conducted to demonstrate the significance of each technical components.
摘要：用合成数据训练Denoisers往往不能应付未知噪声的多样性，让位给能够适应现有噪声不知道它的地面实况方法。以前的基于图像的方法导致噪声过度拟合如果直接应用到视频denoisers，并且具有不充分的时间信息管理特别是在闭塞和照明的变化，这大大阻碍了其降噪性能方面。在本文中，我们提出了视频去噪网络，成功地应对这些挑战的总体框架。一种新颖的双取样器组装通过从目标的输入去耦的训练数据，而不改变语义，它不仅有效地解决了噪声过拟合问题，而且还通过检查光流一致性有效地产生更好的遮挡掩蔽。在线降噪方案和翘曲损失正则采用了更好的时间对齐。照明变化是基于对齐的帧的局部相似量化。我们的方法始终优于上的多个的声音，数据集和网络体系结构的现有技术通过0.6-3.2dB PSNR。减少模型的盲视频噪声国家的最先进的成果得以实现。广泛消融研究以证明每个技术组件的重要性。

30. Calibrated BatchNorm: Improving Robustness Against Noisy Weights in Neural Networks [PDF] 返回目录
Li-Huang Tsai, Shih-Chieh Chang, Yu-Ting Chen, Jia-Yu Pan, Wei Wei, Da-Cheng Juan
Abstract: Analog computing hardware has gradually received more attention by the researchers for accelerating the neural network computations in recent years. However, the analog accelerators often suffer from the undesirable intrinsic noise caused by the physical components, making the neural networks challenging to achieve ordinary performance as on the digital ones. We suppose the performance drop of the noisy neural networks is due to the distribution shifts in the network activations. In this paper, we propose to recalculate the statistics of the batch normalization layers to calibrate the biased distributions during the inference phase. Without the need of knowing the attributes of the noise beforehand, our approach is able to align the distributions of the activations under variational noise inherent in the analog environments. In order to validate our assumptions, we conduct quantitative experiments and apply our methods on several computer vision tasks, including classification, object detection, and semantic segmentation. The results demonstrate the effectiveness of achieving noise-agnostic robust networks and progress the developments of the analog computing devices in the field of neural networks.
摘要：模拟计算硬件已逐渐由研究人员在近几年加速神经网络计算获得了更多的关注。然而，模拟加速器常患有造成物理组件的不良固有噪声，使神经网络挑战，以实现普通性能上的数字的。我们假设嘈杂的神经网络的性能下降是由于在网络激活的分布变化。在本文中，我们建议重新计算批标准化层的统计过程中推断阶段校准偏分布。而不需要知道的属性噪音事前的，我们的做法是能够对准激活的分布下变在模拟环境中固有的噪声。为了验证我们的假设，我们进行定量实验和应用上的几个计算机视觉任务，包括分类，目标检测和语义分割我们的方法。结果表明实现噪声无关的健壮网络的有效性和进步的模拟计算在神经网络中的现场设备的发展。

31. Extracting the fundamental diagram from aerial footage [PDF] 返回目录
Rafael Makrigiorgis, Panayiotis Kolios, Stelios Timotheou, Theocharis Theocharides, Christos G. Panayiotou
Abstract: Efficient traffic monitoring is playing a fundamental role in successfully tackling congestion in transportation networks. Congestion is strongly correlated with two measurable characteristics, the demand and the network density that impact the overall system behavior. At large, this system behavior is characterized through the fundamental diagram of a road segment, a region or the this http URL this paper we devise an innovative way to obtain the fundamental diagram through aerial footage obtained from drone platforms. The derived methodology consists of 3 phases: vehicle detection, vehicle tracking and traffic state estimation. We elaborate on the algorithms developed for each of the 3 phases and demonstrate the applicability of the results in a real-world setting.
摘要：高效的交通监控是成功解决交通拥堵的网络发挥了重要作用。拥塞强烈有两个衡量的特点，需求和网络密度影响整个系统的行为相关。在大，这个系统行为的特点是通过一个路段，一个区域或该HTTP URL本文的基本图，我们设计出一种创新的方式，通过从无人机平台获得的空中镜头，以获得基本图。所导出的方法包括3个阶段：车辆检测，车辆跟踪和交通状态估计。我们详细阐述了各3个阶段开发的算法，并证明在真实世界场景的结果的适用性。

32. AnchorFace: An Anchor-based Facial Landmark Detector Across Large Poses [PDF] 返回目录
Zixuan Xu, Banghuai Li, Miao Geng, Ye Yuan, Gang Yu
Abstract: Facial landmark localization aims to detect the predefined points of human faces, and the topic has been rapidly improved with the recent development of neural network based methods. However, it remains a challenging task when dealing with faces in unconstrained scenarios, especially with large pose variations. In this paper, we target the problem of facial landmark localization across large poses and address this task based on a split-and-aggregate strategy. To split the search space, we propose a set of anchor templates as references for regression, which well addresses the large variations of face poses. Based on the prediction of each anchor template, we propose to aggregate the results, which can reduce the landmark uncertainty due to the large poses. Overall, our proposed approach, named AnchorFace, obtains state-of-the-art results with extremely efficient inference speed on four challenging benchmarks, i.e. AFLW, 300W, Menpo, and WFLW dataset. Code will be released for reproduction.
摘要：人脸特征点定位目标来检测人脸的预定点和话题已经与最近的基于神经网络的方法的开发得到了快速提升。然而，在不受约束的情况下面孔打交道时，仍然是一个艰巨的任务，尤其是对于大型的姿势变化。在本文中，我们的目标在大型姿势面部界标的定位问题，解决基于分裂和总战略这一任务。要分割的搜索空间，我们提出了一套锚模板作为回归，这也解决了面部姿态的大的变化的引用。基于每个锚模板的预测，我们建议汇总结果，这可以减少不确定性里程碑由于大姿势。总的来说，我们提出的方法，叫AnchorFace，获得四个具有挑战性的基准，即AFLW，300W，Menpo和WFLW数据集的国家的最先进的结果与极其高效的推理速度。代码将被释放再现。

33. Non-image Data Classification with Convolutional Neural Networks [PDF] 返回目录
Anuraganand Sharma, Dinesh Kumar
Abstract: Convolutional Neural Networks (CNNs) is one of the most popular algorithms for deep learning which is mostly used for image classification, natural language processing, and time series forecasting. Its ability to extract and recognize the fine features has led to the state-of-the-art performance. CNN has been designed to work on a set of 2-D matrices whose elements show some correlation with neighboring elements such as in image data. Conversely, the data examples represented as a set of 1-D vectors -- apart from time series data -- cannot be used with CNN, but with other Artificial Neural Networks (ANNs). We have proposed some novel preprocessing methods of data wrangling that transform a 1-D data vector to a 2-D graphical image with appropriate correlations among the fields to be processed on CNN. To our knowledge this work is novel on non-image to image data transformation for non-time series data. The transformed data processed with CNN using VGGnet-16 shows a competitive result in classification accuracy compared to canonical ANN approach with high potential for further improvements.
摘要：卷积神经网络（细胞神经网络）是深学习最流行的算法，它主要用于图像分类，自然语言处理，以及时间序列预测之一。它的提取和识别的优良特性的能力，导致国家的最先进的性能。 CNN已被设计为一组的2-d的矩阵，其元素显示出与相邻的元件的一些相关的工作，例如在图像数据。相反，数据的例子表示为一组的1-d载体 - 除了时间序列数据 - 不能与CNN使用，但与其他人工神经网络（人工神经网络）。我们已经提出的数据争吵，通过适当的相关性的变换1-d的数据矢量2-d的图形图像被上CNN处理的领域中的一些新颖的预处理方法。据我们所知，这项工作在非图像非时间序列数据的图像数据转换小说。与CNN处理转换后的数据使用VGGnet-16示出了有竞争力的结果分级精度相比于具有进一步改进的高电位规范ANN的方法。

34. Semi-Supervised Crowd Counting via Self-Training on Surrogate Tasks [PDF] 返回目录
Yan Liu, Lingqiao Liu, Peng Wang, Pingping Zhang, Yinjie Lei
Abstract: Most existing crowd counting systems rely on the availability of the object location annotation which can be expensive to obtain. To reduce the annotation cost, one attractive solution is to leverage a large number of unlabeled images to build a crowd counting model in semi-supervised fashion. This paper tackles the semi-supervised crowd counting problem from the perspective of feature learning. Our key idea is to leverage the unlabeled images to train a generic feature extractor rather than the entire network of a crowd counter. The rationale of this design is that learning the feature extractor can be more reliable and robust towards the inevitable noisy supervision generated from the unlabeled data. Also, on top of a good feature extractor, it is possible to build a density map regressor with much fewer density map annotations. Specifically, we proposed a novel semi-supervised crowd counting method which is built upon two innovative components: (1) a set of inter-related binary segmentation tasks are derived from the original density map regression task as the surrogate prediction target; (2) the surrogate target predictors are learned from both labeled and unlabeled data by utilizing a proposed self-training scheme which fully exploits the underlying constraints of these binary segmentation tasks. Through experiments, we show that the proposed method is superior over the existing semisupervised crowd counting method and other representative baselines.
摘要：现有的大多数人群计数系统依靠对象位置标注可能很昂贵，获得的可用性。为了降低成本注解，一个有吸引力的解决方案是利用大量的未标记的图像，打造半监督时尚人群计数模型。本文从铲球学习功能的角度半监督人群计数问题。我们的核心思想是利用未标记的图像训练的一般特征提取，而不是一个人群计数器的整个网络。这样设计的理由是，学习特征提取可以更加可靠和稳定对来自未标记的数据所产生的不可避免的嘈杂监督。此外，在一个良好的特征提取的顶部，有可能建立一个密度图与回归少得多密度地图标注。具体来说，我们提出了一种在两种创新组件构建的新型半监督人群计数方法：（1）一组从原始密度图回归任务作为替代预测目标导出相互关联的二进制分割任务; （2）替代目标预测从标记的和未标记的数据通过利用充分利用的这些二元分割任务底层约束的建议的自我训练方案获知。通过实验，我们表明，该方法是在现有的半监督人群计数法和其他代表基线优越。

35. ReMOTS: Refining Multi-Object Tracking and Segmentation [PDF] 返回目录
Fan Yang, Xin Chang, Chenyu Dang, Ziqiang Zheng, Sakriani Sakti, Satoshi Nakamura, Yang Wu
Abstract: We aim to improve the performance of Multiple Object Tracking and Segmentation (MOTS) by refinement. However, it remains challenging for refining MOTS results, which could be attributed to that appearance features are not adapted to target videos and it is also difficult to find proper thresholds to discriminate them. To tackle this issue, we propose a Refining MOTS (i.e., ReMOTS) framework. ReMOTS mainly takes four steps to refine MOTS results from the data association perspective. (1) Training the appearance encoder using predicted masks. (2) Associating observations across adjacent frames to form short-term tracklets. (3) Training the appearance encoder using short-term tracklets as reliable pseudo labels. (4) Merging short-term tracklets to long-term tracklets utilizing adopted appearance features and thresholds that are automatically obtained from statistical information. Using ReMOTS, we reached the $1^{st}$ place on CVPR 2020 MOTS Challenge 1, with a sMOTSA score of $69.9$.
摘要：我们的目标是提高多目标跟踪和分割（MOTS）通过细化的表现。然而，它仍然是炼油MOTS的结果，这可以归因于外观特征不适合目标视频挑战，它也很难找到合适的阈值来区分它们。为了解决这个问题，我们提出了一个炼油MOTS（即ReMOTS）框架。 ReMOTS主要需要四个步骤，从数据关联透视精炼MOTS结果。（1）培养的外观编码器使用预测掩模。（2）在相邻的帧相关联的观察，以形成短期tracklets。（3）使用短期tracklets作为可靠伪标签培养外观编码器。（4）合并短期tracklets于利用自动从统计信息获得通过外观特征和阈值长期tracklets。使用ReMOTS，我们到达了$ 1 ^ {} ST上CVPR 2020 MOTS挑战$ 1的地方，用一个sMOTSA得分$ 69.9 $。

36. Learning to Count in the Crowd from Limited Labeled Data [PDF] 返回目录
Vishwanath A. Sindagi, Rajeev Yasarla, Deepak Sam Babu, R. Venkatesh Babu, Vishal M. Patel
Abstract: Recent crowd counting approaches have achieved excellent performance. however, they are essentially based on fully supervised paradigm and require large number of annotated samples. Obtaining annotations is an expensive and labour-intensive process. In this work, we focus on reducing the annotation efforts by learning to count in the crowd from limited number of labeled samples while leveraging a large pool of unlabeled data. Specifically, we propose a Gaussian Process-based iterative learning mechanism that involves estimation of pseudo-ground truth for the unlabeled data, which is then used as supervision for training the network. The proposed method is shown to be effective under the reduced data (semi-supervised) settings for several datasets like ShanghaiTech, UCF-QNRF, WorldExpo, UCSD, etc. Furthermore, we demonstrate that the proposed method can be leveraged to enable the network in learning to count from synthetic dataset while being able to generalize better to real-world datasets (synthetic-to-real transfer)
摘要：最近的人群计数方法都取得了优异的性能。然而，它们本质上都是基于充分监督范式和需要大量的注释样本。获得注释是一个昂贵和劳动力密集的过程。在这项工作中，我们注重通过学习从标记的样品数量有限的人群数量，同时利用未标记数据的一个大水池减少注释的努力。具体来说，我们建议，涉及的伪地面实况的估计未标记的数据，然后将其作为监督培训网络为基础的流程高斯迭代学习机制。所提出的方法被示出为减小的数据（半监督），用于若干数据集等ShanghaiTech，UCF-QNRF，WorldExpo，UCSD等设置下有效。此外，我们证明了所提出的方法可以被利用来使所述网络中学习，同时能够更好地推广到真实世界的数据集（合成到真实传输），从合成的数据集数

37. Spatial Semantic Embedding Network: Fast 3D Instance Segmentation with Deep Metric Learning [PDF] 返回目录
Dongsu Zhang, Junha Chun, Sang Kyun Cha, Young Min Kim
Abstract: We propose spatial semantic embedding network (SSEN), a simple, yet efficient algorithm for 3D instance segmentation using deep metric learning. The raw 3D reconstruction of an indoor environment suffers from occlusions, noise, and is produced without any meaningful distinction between individual entities. For high-level intelligent tasks from a large scale scene, 3D instance segmentation recognizes individual instances of objects. We approach the instance segmentation by simply learning the correct embedding space that maps individual instances of objects into distinct clusters that reflect both spatial and semantic information. Unlike previous approaches that require complex pre-processing or post-processing, our implementation is compact and fast with competitive performance, maintaining scalability on large scenes with high resolution voxels. We demonstrate the state-of-the-art performance of our algorithm in the ScanNet 3D instance segmentation benchmark on AP score.
摘要：本文提出的空间语义嵌入网络（SSEN），一个简单而有效的算法用于3D例如分割使用深度量学习。室内环境遭受的原始3D重建从闭塞，噪声，并且没有单独的实体之间的任何有意义的区别产生。对于从大型场景的高级别智能任务，例如3D分割识别对象的单个实例。我们通过简单的学习的对象的单个实例映射到反映空间和语义信息不同的集群正确的嵌入空间接近实例分割。与需要复杂的预处理或后处理以前的方法，我们的实现是有竞争力的性能和小巧快速，对高分辨率体素大场面维护的可扩展性。我们证明了我们在AP分数ScanNet 3D实例分割基准算法的国家的最先进的性能。

38. Self domain adapted network [PDF] 返回目录
Yufan He, Aaron Carass, Lianrui Zuo, Blake E. Dewey, Jerry L. Prince
Abstract: Domain shift is a major problem for deploying deep networks in clinical practice. Network performance drops significantly with (target) images obtained differently than its (source) training data. Due to a lack of target label data, most work has focused on unsupervised domain adaptation (UDA). Current UDA methods need both source and target data to train models which perform image translation (harmonization) or learn domain-invariant features. However, training a model for each target domain is time consuming and computationally expensive, even infeasible when target domain data are scarce or source data are unavailable due to data privacy. In this paper, we propose a novel self domain adapted network (SDA-Net) that can rapidly adapt itself to a single test subject at the testing stage, without using extra data or training a UDA model. The SDA-Net consists of three parts: adaptors, task model, and auto-encoders. The latter two are pre-trained offline on labeled source images. The task model performs tasks like synthesis, segmentation, or classification, which may suffer from the domain shift problem. At the testing stage, the adaptors are trained to transform the input test image and features to reduce the domain shift as measured by the auto-encoders, and thus perform domain adaptation. We validated our method on retinal layer segmentation from different OCT scanners and T1 to T2 synthesis with T1 from different MRI scanners and with different imaging parameters. Results show that our SDA-Net, with a single test subject and a short amount of time for self adaptation at the testing stage, can achieve significant improvements.
摘要：域名转移是临床实践中深深部署网络的一个主要问题。网络性能具有比其（源）的训练数据不同的方式获得（目标）图像显著下降。由于缺乏目标标签的数据，大多数工作都集中在无人监督的领域适应性（UDA）。当前UDA方法需要源和目标数据来训练其执行图像转换（统一）模式或学习领域不变特征。然而，训练模型为每个目标域是耗时和计算上昂贵的，甚至是不可行时目标域的数据很少或源数据到数据隐私不可用所致。在本文中，我们提出了一种新的自我域适应网络（SDA-网），可以迅速适应自己在测试阶段一个测试对象，而无需使用额外的数据或训练UDA模型。该SDA-网由三个部分组成：适配器，任务模式，并自动编码器。后两种脱机上标记的源图像预先训练。任务模型进行类似的合成，分割，或分类，其可以从域移位问题遭受的任务。在测试阶段，适配器进行培训，以将输入的测试图像和功能，以减少域移位由自动编码器测量，并且因而执行域自适应。我们证实了我们对视网膜层分割方法从不同的OCT扫描仪和T1到T2用合成从T1不同MRI扫描仪和不同的成像参数。结果表明，我们的SDA-Net的，与单一的测试对象和时间进行自我调整，在测试阶段很短的量，可以达到显著的改善。

39. Discretization-Aware Architecture Search [PDF] 返回目录
Yunjie Tian, Chang Liu, Lingxi Xie, Jianbin Jiao, Qixiang Ye
Abstract: The search cost of neural architecture search (NAS) has been largely reduced by weight-sharing methods. These methods optimize a super-network with all possible edges and operations, and determine the optimal sub-network by discretization, \textit{i.e.}, pruning off weak candidates. The discretization process, performed on either operations or edges, incurs significant inaccuracy and thus the quality of the final architecture is not guaranteed. This paper presents discretization-aware architecture search (DA\textsuperscript{2}S), with the core idea being adding a loss term to push the super-network towards the configuration of desired topology, so that the accuracy loss brought by discretization is largely alleviated. Experiments on standard image classification benchmarks demonstrate the superiority of our approach, in particular, under imbalanced target network configurations that were not studied before.
摘要：神经结构化搜索的搜索成本（NAS）已经重量分享办法大大减少。这些方法最优化了超级网络与所有可能的边缘和操作，并且通过离散化确定最佳的子网络，\ textit {即}，修剪掉弱候选。的离散化过程中，在任操作或边缘上执行，即被显著不准确，因此最终结构的质量不能保证。本文呈现的离散化感知架构搜索（DA \ textsuperscript {2} S），与所述核心思想是增加一个损耗项推超级网络向期望的拓扑的配置，使得通过离散化所带来的精度损失在很大程度上缓解。在标准的图像分类基准的实验证明了该方法的优越性，特别是在那些没有研究过不平衡目标的网络配置。

40. Learning Embeddings for Image Clustering: An Empirical Study of Triplet Loss Approaches [PDF] 返回目录
Kalun Ho, Janis Keuper, Franz-Josef Pfreundt, Margret Keuper
Abstract: In this work, we evaluate two different image clustering objectives, k-means clustering and correlation clustering, in the context of Triplet Loss induced feature space embeddings. Specifically, we train a convolutional neural network to learn discriminative features by optimizing two popular versions of the Triplet Loss in order to study their clustering properties under the assumption of noisy labels. Additionally, we propose a new, simple Triplet Loss formulation, which shows desirable properties with respect to formal clustering objectives and outperforms the existing methods. We evaluate all three Triplet loss formulations for K-means and correlation clustering on the CIFAR-10 image classification dataset.
摘要：在这项工作中，我们评估了两种不同的图像集群的目标，K-均值聚类和相关集群，在三重损失诱导功能的嵌入空间的情况下。具体来说，我们训练卷积神经网络优化三重损失的两个流行的版本，以研究嘈杂标签的假设下，其性能集群学习判别特征。此外，我们提出了一个新的，简单的三重损失配方，这表明期望的性能相对于正规的集群目标和优于现有的方法。我们评估K-手段和相关的聚类在CIFAR-10图像分类数据集的所有三个三胞胎损失配方。

41. Text Recognition -- Real World Data and Where to Find Them [PDF] 返回目录
Klára Janoušková, Jiri Matas, Lluis Gomez, Dimosthenis Karatzas
Abstract: We present a method for exploiting weakly annotated images to improve text extraction pipelines. The approach exploits an arbitrary existing end-to-end text recognition system to obtain text region proposals and their, possibly erroneous, transcriptions. A process that includes imprecise transcription to annotation matching and edit distance guided neighbourhood search produces nearly error-free, localised instances of scene text, which we treat as pseudo ground truth used for training. We apply the method to two weakly-annotated datasets and show that the process consistently improves the accuracy of a state of the art recognition model across different benchmark datasets (image domains) as well as providing a significant performance boost on the same dataset.
摘要：本文提出了一种利用弱注释的图像，以提高文本提取管线。该方法利用的任意的现有端至端文本识别系统，以获得文本区建议及其，可能是错误的，转录。包括不精确的转录注释匹配和编辑距离引导附近搜索的工艺生产近无差错的，本地化的现场文字的情况下，我们当作用于训练模拟接地真理。我们应用的方法两个弱注释的数据集和显示过程中始终改进了不同标准数据集（图像域）在现有技术识别模型的状态的准确性，以及提供对同一数据集一个显著的性能提升。

42. Wasserstein Distances for Stereo Disparity Estimation [PDF] 返回目录
Divyansh Garg, Yan Wang, Bharath Hariharan, Mark Campbell, Kilian Q. Weinberger, Wei-Lun Chao
Abstract: Existing approaches to depth or disparity estimation output a distribution over a set of pre-defined discrete values. This leads to inaccurate results when the true depth or disparity does not match any of these values. The fact that this distribution is usually learned indirectly through a regression loss causes further problems in ambiguous regions around object boundaries. We address these issues using a new neural network architecture that is capable of outputting arbitrary depth values, and a new loss function that is derived from the Wasserstein distance between the true and the predicted distributions. We validate our approach on a variety of tasks, including stereo disparity and depth estimation, and the downstream 3D object detection. Our approach drastically reduces the error in ambiguous regions, especially around object boundaries that greatly affect the localization of objects in 3D, achieving the state-of-the-art in 3D object detection for autonomous driving.
摘要：现有在一组预先定义的离散值的接近深度或视差估算输出的分布。这将导致不准确的结果，当真正的深度或视差不匹配的任何值。这个分布通常是通过回归损失间接了解到的事实引起周围物体的边界模糊的地区进一步的问题。我们处理使用新的神经网络结构，其能够输出任意深度的值，和从所述真实和预测分布之间的距离瓦瑟斯坦衍生的新的损失函数这些问题。我们验证上的各种任务，包括立体视差和深度估计，下游立体物检测我们的做法。我们的方法大大降低了错误的暧昧地区，特别是在对象边界，极大地影响3D对象的定位，实现立体物检测的先进设备，最先进的自动驾驶。

43. Generative Model-Based Loss to the Rescue: A Method to Overcome Annotation Errors for Depth-Based Hand Pose Estimation [PDF] 返回目录
Jiayi Wang, Franziska Mueller, Florian Bernard, Christian Theobalt
Abstract: We propose to use a model-based generative loss for training hand pose estimators on depth images based on a volumetric hand model. This additional loss allows training of a hand pose estimator that accurately infers the entire set of 21 hand keypoints while only using supervision for 6 easy-to-annotate keypoints (fingertips and wrist). We show that our partially-supervised method achieves results that are comparable to those of fully-supervised methods which enforce articulation consistency. Moreover, for the first time we demonstrate that such an approach can be used to train on datasets that have erroneous annotations, i.e. "ground truth" with notable measurement errors, while obtaining predictions that explain the depth images better than the given "ground truth".
摘要：我们建议使用训练手姿势估计深度图像基于体积手板模型基于模型的生成损失。这种额外的损耗使手姿势估计，虽然只使用监督6易注释关键点（指尖及手腕）准确地推断出整个一套21个关键点的训练。我们证明了我们的部分监督的方法达到的结果是媲美的强制执行这些关节的一致性充分监督的方法。此外，我们第一次证明了这种方法可以用来对有错误的注释的数据集，即“地面实况”具有显着的测量误差训练，而获得这比给定的“地面实况”更好地解释深度图像预测。

44. VPN: Learning Video-Pose Embedding for Activities of Daily Living [PDF] 返回目录
Srijan Das, Saurav Sharma, Rui Dai, Francois Bremond, Monique Thonnat
Abstract: In this paper, we focus on the spatio-temporal aspect of recognizing Activities of Daily Living (ADL). ADL have two specific properties (i) subtle spatio-temporal patterns and (ii) similar visual patterns varying with time. Therefore, ADL may look very similar and often necessitate to look at their fine-grained details to distinguish them. Because the recent spatio-temporal 3D ConvNets are too rigid to capture the subtle visual patterns across an action, we propose a novel Video-Pose Network: VPN. The 2 key components of this VPN are a spatial embedding and an attention network. The spatial embedding projects the 3D poses and RGB cues in a common semantic space. This enables the action recognition framework to learn better spatio-temporal features exploiting both modalities. In order to discriminate similar actions, the attention network provides two functionalities - (i) an end-to-end learnable pose backbone exploiting the topology of human body, and (ii) a coupler to provide joint spatio-temporal attention weights across a video. Experiments show that VPN outperforms the state-of-the-art results for action classification on a large scale human activity dataset: NTU-RGB+D 120, its subset NTU-RGB+D 60, a real-world challenging human activity dataset: Toyota Smarthome and a small scale human-object interaction dataset Northwestern UCLA.
摘要：在本文中，我们侧重于认识到日常生活能力（ADL）的活动的时空方面。 ADL有两个特定的性质（ⅰ）微妙的时空图和（ii）类似的视觉图案随时间变化。因此，ADL可能看起来非常相似，往往需要看他们的细粒度细节来区分它们。由于近期时空3D ConvNets过于僵化，以便在整个动作捕捉到细微的视觉模式，我们提出了一个新颖的视频姿势网络：VPN。该VPN的2个关键部件是一个空间的嵌入和关注网络。空间嵌入项目的3D姿势和RGB线索在一个共同的语义空间。这使得动作识别框架，以更好地学习时空的特点开发两种模式。为了区分类似的操作时，注意网络提供两个功能 - （ⅰ）的端部到端可学习姿势骨干跨越视频利用人体的拓扑结构，以及（ii）一个耦合器，以提供联合时空注意权重。实验表明，VPN优于大规模的人类活动数据集的国家的最先进成果的行为分类：NTU-RGB + d 120，其子NTU-RGB + d 60，一个真实世界的挑战人类活动数据集：丰田智能家居和小规模的人类对象交互数据集西北加州大学洛杉矶分校。

45. Learning to Segment Anatomical Structures Accurately from One Exemplar [PDF] 返回目录
Yuhang Lu, Kang Zheng, Weijian Li, Yirui Wang, Adam P. Harrison, Chihung Lin, Song Wang, Jing Xiao, Le Lu, Chang-Fu Kuo, Shun Miao
Abstract: Accurate segmentation of critical anatomical structures is at the core of medical image analysis. The main bottleneck lies in gathering the requisite expert-labeled image annotations in a scalable manner. Methods that permit to produce accurate anatomical structure segmentation without using a large amount of fully annotated training images are highly desirable. In this work, we propose a novel contribution of Contour Transformer Network (CTN), a one-shot anatomy segmentor including a naturally built-in human-in-the-loop mechanism. Segmentation is formulated by learning a contour evolution behavior process based on graph convolutional networks (GCN). Training of our CTN model requires only one labeled image exemplar and leverages additional unlabeled data through newly introduced loss functions that measure the global shape and appearance consistency of contours. We demonstrate that our one-shot learning method significantly outperforms non-learning-based methods and performs competitively to the state-of-the-art fully supervised deep learning approaches. With minimal human-in-the-loop editing feedback, the segmentation performance can be further improved and tailored towards the observer desired outcomes. This can facilitate the clinician designed imaging-based biomarker assessments (to support personalized quantitative clinical diagnosis) and outperforms fully supervised baselines.
摘要：临界解剖结构的精确分割是在医学图像分析的核心。其主要瓶颈在于以可扩展的方式收集必要的专家标记的图像注释。方法，其允许以产生精确的解剖结构分割，而无需使用大量的完全注释的训练图像是高度期望的。在这项工作中，我们提出了轮廓变压器网络（CTN），一杆解剖分段装置包括一个自然内置人于-的循环机制的一种新的贡献。分割是通过学习基于图卷积网络（GCN）的轮廓演化行为过程中制定。我们CTN模型的训练只需要一个标签图像范例，通过测量整体形状和轮廓的外观一致性新推出的损失函数利用附加标签数据。我们证明了我们的一次性学习方法显著优于非基于学习的方法和进行竞争的国家的最先进的完全监督深刻的学习方法。以最小的人力在半实物编辑反馈，分割性能可以进一步提高，并朝向观察者所期望的结果调整。这可有利于临床医生设计基于成像的生物标记物评估（以支持个性化定量临床诊断），优于完全监督基线。

46. Labeling of Multilingual Breast MRI Reports [PDF] 返回目录
Chen-Han Tsai, Nahum Kiryati, Eli Konen, Miri Sklair-Levy, Arnaldo Mayer
Abstract: Medical reports are an essential medium in recording a patient's condition throughout a clinical trial. They contain valuable information that can be extracted to generate a large labeled dataset needed for the development of clinical tools. However, the majority of medical reports are stored in an unregularized format, and a trained human annotator (typically a doctor) must manually assess and label each case, resulting in an expensive and time consuming procedure. In this work, we present a framework for developing a multilingual breast MRI report classifier using a custom-built language representation called LAMBR. Our proposed method overcomes practical challenges faced in clinical settings, and we demonstrate improved performance in extracting labels from medical reports when compared with conventional approaches.
摘要：医学报告是在整个临床试验记录病人的病情的重要媒介。它们含有可以提取生成所需的临床工具的开发大数据集标记有价值的信息。然而，大多数的医疗报告被存储在非正规格式，一个训练有素的人注释器（通常为医生）必须手动评估和标签每种情况下，导致昂贵且耗时的过程。在这项工作中，我们提出了开发利用所谓LAMBR定制的语言表示一个多语种的乳腺MRI报告分类的框架。我们提出的方法克服面临在临床实践的挑战，我们证明与传统方法相比，提取医疗报告标签更高的性能。

47. Determination of the most representative descriptor among a set of feature vectors for the same object [PDF] 返回目录
Dmitry Pozdnyakov
Abstract: On an example of solution of the face recognition problem the approach for estimation of the most representative descriptor among a set of feature vectors for the same face is considered in present study. The estimation is based on robust calculation of the mode-median mixture vector for the set as the descriptor by means of Welsch/Leclerc loss function application in case of very sparse filling of the feature space with feature vectors
摘要：在的脸部识别问题对于相同的面的一组特征向量中的最有代表性的描述符的估计的方法在本研究中被认为是溶液的例子。该估计是基于由韦尔施/勒克莱尔损失函数应用的装置，用于设定作为描述符中的模式的中值矢量混合物的健壮计算在特征空间的非常稀疏的填充与特征矢量的情况下

48. Deep Learning for Apple Diseases: Classification and Identification [PDF] 返回目录
Asif Iqbal Khan, SMK Quadri, Saba Banday
Abstract: Diseases and pests cause huge economic loss to the apple industry every year. The identification of various apple diseases is challenging for the farmers as the symptoms produced by different diseases may be very similar, and may be present simultaneously. This paper is an attempt to provide the timely and accurate detection and identification of apple diseases. In this study, we propose a deep learning based approach for identification and classification of apple diseases. The first part of the study is dataset creation which includes data collection and data labelling. Next, we train a Convolutional Neural Network (CNN) model on the prepared dataset for automatic classification of apple diseases. CNNs are end-to-end learning algorithms which perform automatic feature extraction and learn complex features directly from raw images, making them suitable for wide variety of tasks like image classification, object detection, segmentation etc. We applied transfer learning to initialize the parameters of the proposed deep model. Data augmentation techniques like rotation, translation, reflection and scaling were also applied to prevent overfitting. The proposed CNN model obtained encouraging results, reaching around 97.18% of accuracy on our prepared dataset. The results validate that the proposed method is effective in classifying various types of apple diseases and can be used as a practical tool by farmers.
摘要：病虫害每年造成巨大的经济损失，以苹果产业。各种苹果疾病的识别是具有挑战性的农民如通过不同的疾病所产生的症状可能是非常相似的，并且可以同时存在。本文是试图提供苹果疾病的及时和准确的检测和识别。在这项研究中，我们提出了苹果病虫害的识别和分类的深度学习为基础的方法。该研究的第一部分是数据集创建，其包括数据收集和数据标签。接下来，我们对苹果疾病的自动分类准备好的数据集训练卷积神经网络（CNN）模型。细胞神经网络是端至端学习算法，其执行自动特征提取，并直接从原始图像学习复杂的特征，使得它们适用于各种像图像分类，对象检测，分割等任务，我们施加的转印学习初始化的参数所提出的深层模型。数据增强技术，如旋转，平移，反射和比例也适用于防止过度拟合。所提出的CNN模型获得了令人鼓舞的结果，在我们准备的数据集达到约97.18％的准确性。结果证实，所提出的方法可有效地进行分类的各种类型苹果疾病的，并且可以用作由农民的实用工具。

49. See, Hear, Explore: Curiosity via Audio-Visual Association [PDF] 返回目录
Victoria Dean, Shubham Tulsiani, Abhinav Gupta
Abstract: Exploration is one of the core challenges in reinforcement learning. A common formulation of curiosity-driven exploration uses the difference between the real future and the future predicted by a learned model. However, predicting the future is an inherently difficult task which can be ill-posed in the face of stochasticity. In this paper, we introduce an alternative form of curiosity that rewards novel associations between different senses. Our approach exploits multiple modalities to provide a stronger signal for more efficient exploration. Our method is inspired by the fact that, for humans, both sight and sound play a critical role in exploration. We present results on several Atari environments and Habitat (a photorealistic navigation simulator), showing the benefits of using an audio-visual association model for intrinsically guiding learning agents in the absence of external rewards. For videos and code, see this https URL.
摘要：探索是在强化学习的核心挑战之一。好奇心驱动的摸索共同制定使用真正的前途和未来一个学习的模型预测之间的差异。然而，预测未来是随机性的脸上可以病态的固有困难的任务。在本文中，我们介绍的好奇心，奖励不同的含义之间的关联的新的替代形式。我们的方法利用多种方式，以提供更有效的探索更强的信号。我们的方法是通过一个事实，即，对于人类来说，视觉和声音的探索发挥关键作用的启发。我们在几个孝环境和生境（一个照片写实导航模拟器），示出了使用一个视听关联模型用于本质引导学习剂在不存在外部奖励的益处本发明的结果。对于视频和代码，看到这个HTTPS URL。

50. Segmentation of Pulmonary Opacification in Chest CT Scans of COVID-19 Patients [PDF] 返回目录
Keegan Lensink, Issam Laradji, Marco Law, Paolo Emilio Barbano, Savvas Nicolaou. William Parker, Eldad Haber
Abstract: The Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has rapidly spread into a global pandemic. A form of pneumonia, presenting as opacities with in a patient's lungs, is the most common presentation associated with this virus, and great attention has gone into how these changes relate to patient morbidity and mortality. In this work we provide open source models for the segmentation of patterns of pulmonary opacification on chest Computed Tomography (CT) scans which have been correlated with various stages and severities of infection. We have collected 663 chest CT scans of COVID-19 patients from healthcare centers around the world, and created pixel wise segmentation labels for nearly 25,000 slices that segment 6 different patterns of pulmonary opacification. We provide open source implementations and pre-trained weights for multiple segmentation models trained on our dataset. Our best model achieves an opacity Intersection-Over-Union score of 0.76 on our test set, demonstrates successful domain adaptation, and predicts the volume of opacification within 1.7\% of expert radiologists. Additionally, we present an analysis of the inter-observer variability inherent to this task, and propose methods for appropriate probabilistic approaches.
摘要：严重急性呼吸系统综合症冠状病毒2（SARS-COV-2）已迅速蔓延到全球大流行。肺炎的一种形式，在患者肺部表现为混浊带，与此病毒相关的最常见的表现，并高度重视已进入这些变化如何与病人的发病率和死亡率。在这项工作中，我们提供的已用不同阶段和感染的严重程度相关的胸部计算机断层扫描（CT）扫描肺混浊的图案分割的开源模式。我们收集COVID-19的患者663个胸部CT扫描来自世界各地的医疗中心，并创造像素级细分标签近2.5万片该段6肺混浊的不同模式。我们经过培训我们的数据集多重分割模型提供开放源代码实现和预训练的权重。我们最好的模型实现不透明交叉点过联盟在我们的测试组得分为0.76，表明成功的领域适应性，并预测混浊的范围内放射学专家的1.7 \％的体积。此外，我们目前固有的这一任务的国际观察员的变异进行分析，并提出适当的概率方法的方法。

51. What Gives the Answer Away? Question Answering Bias Analysis on Video QA Datasets [PDF] 返回目录
Jianing Yang, Yuying Zhu, Yongxin Wang, Ruitao Yi, Amir Zadeh, Louis-Philippe Morency
Abstract: Question answering biases in video QA datasets can mislead multimodal model to overfit to QA artifacts and jeopardize the model's ability to generalize. Understanding how strong these QA biases are and where they come from helps the community measure progress more accurately and provide researchers insights to debug their models. In this paper, we analyze QA biases in popular video question answering datasets and discover pretrained language models can answer 37-48% questions correctly without using any multimodal context information, far exceeding the 20% random guess baseline for 5-choose-1 multiple-choice questions. Our ablation study shows biases can come from annotators and type of questions. Specifically, annotators that have been seen during training are better predicted by the model and reasoning, abstract questions incur more biases than factual, direct questions. We also show empirically that using annotator-non-overlapping train-test splits can reduce QA biases for video QA datasets.
摘要：问答视频QA数据集的偏见会误导多模式模型过度拟合到QA文物和破坏模型的归纳能力。如何理解这些强QA偏见是和他们来自哪里更准确地帮助社区衡量进展情况，并为研究人员提供见解，调试他们的模型。在本文中，我们分析了流行的视频问题QA偏见回答的数据集，发现预训练的语言模型可以正确回答37-48％的问题，而无需使用任何多的上下文信息，远远超过了20％随机猜测基线5选1多重选择题。我们的消融研究表明偏见可以来自注释和类型的问题。具体而言，已经在训练中被视为注释通过模型和推理更好地预测，抽象的问题，招致不是实际的，直接的问题更多的偏见。我们还表明经验，使用注释，不重叠的列车测试拆分可以降低对视频QA数据集QA偏见。

52. Monitoring Browsing Behavior of Customers in Retail Stores via RFID Imaging [PDF] 返回目录
Kamran Ali, Alex X. Liu, Eugene Chai, Karthik Sundaresan
Abstract: In this paper, we propose to use commercial off-the-shelf (COTS) monostatic RFID devices (i.e. which use a single antenna at a time for both transmitting and receiving RFID signals to and from the tags) to monitor browsing activity of customers in front of display items in places such as retail stores. To this end, we propose TagSee, a multi-person imaging system based on monostatic RFID imaging. TagSee is based on the insight that when customers are browsing the items on a shelf, they stand between the tags deployed along the boundaries of the shelf and the reader, which changes the multi-paths that the RFID signals travel along, and both the RSS and phase values of the RFID signals that the reader receives change. Based on these variations observed by the reader, TagSee constructs a coarse grained image of the customers. Afterwards, TagSee identifies the items that are being browsed by the customers by analyzing the constructed images. The key novelty of this paper is on achieving browsing behavior monitoring of multiple customers in front of display items by constructing coarse grained images via robust, analytical model-driven deep learning based, RFID imaging. To achieve this, we first mathematically formulate the problem of imaging humans using monostatic RFID devices and derive an approximate analytical imaging model that correlates the variations caused by human obstructions in the RFID signals. Based on this model, we then develop a deep learning framework to robustly image customers with high accuracy. We implement TagSee scheme using a Impinj Speedway R420 reader and SMARTRAC DogBone RFID tags. TagSee can achieve a TPR of more than ~90% and a FPR of less than ~10% in multi-person scenarios using training data from just 3-4 users.
摘要：在本文中，我们建议使用商用现货（COTS）单基地RFID设备（即，使用单个天线在一个时间进行发射和接收的RFID信号，以和从标签）来监视的浏览活动客户在的地方，如零售商店的显示项前面。为此，我们提出TagSee，基于单站RFID成像的多人成像系统。 TagSee是基于这样的认识，即当客户浏览货架上的商品，他们的立场沿大陆架和读者，这改变了多路径的RFID信号一起旅行的边界部署的标记之间，两者的RSS并且所述读取器接收变化的RFID信号的相位值。基于这些变化由读取器观察到，TagSee构建客户的粗粒度图像。此后，TagSee识别正在浏览由客户通过分析所构造的图像中的项目。本文的关键的新颖性是在实现浏览行为通过经由健壮，分析构建粗粒图像的显示项目前方监控多个客户的深模型驱动基于学习的，RFID成像。为了实现这一点，我们首先在数学上制定使用单站RFID设备成像的人的问题，并推导出相关所引起的RFID信号人类障碍物变化的近似解析成像模型。在此基础上，我们再制定一个深度学习框架，以强劲的图像客户提供高精确度。我们实施使用Impinj的Speedway的R420阅读器和SMARTRAC的DogBone RFID标签TagSee方案。 TagSee可以实现大于〜90％TPR和小于〜10％使用从仅仅3-4个用户的训练数据多的人情景中FPR。

53. Instance Segmentation for Whole Slide Imaging: End-to-End or Detect-Then-Segment [PDF] 返回目录
Aadarsh Jha, Haichun Yang, Ruining Deng, Meghan E. Kapp, Agnes B. Fogo, Yuankai Huo
Abstract: Automatic instance segmentation of glomeruli within kidney Whole Slide Imaging (WSI) is essential for clinical research in renal pathology. In computer vision, the end-to-end instance segmentation methods (e.g., Mask-RCNN) have shown their advantages relative to detect-then-segment approaches by performing complementary detection and segmentation tasks simultaneously. As a result, the end-to-end Mask-RCNN approach has been the de facto standard method in recent glomerular segmentation studies, where downsampling and patch-based techniques are used to properly evaluate the high resolution images from WSI (e.g., >10,000x10,000 pixels on 40x). However, in high resolution WSI, a single glomerulus itself can be more than 1,000x1,000 pixels in original resolution which yields significant information loss when the corresponding features maps are downsampled via the Mask-RCNN pipeline. In this paper, we assess if the end-to-end instance segmentation framework is optimal for high-resolution WSI objects by comparing Mask-RCNN with our proposed detect-then-segment framework. Beyond such a comparison, we also comprehensively evaluate the performance of our detect-then-segment pipeline through: 1) two of the most prevalent segmentation backbones (U-Net and DeepLab_v3); 2) six different image resolutions (from 512x512 to 28x28); and 3) two different color spaces (RGB and LAB). Our detect-then-segment pipeline, with the DeepLab_v3 segmentation framework operating on previously detected glomeruli of 512x512 resolution, achieved a 0.953 dice similarity coefficient (DSC), compared with a 0.902 DSC from the end-to-end Mask-RCNN pipeline. Further, we found that neither RGB nor LAB color spaces yield better performance when compared against each other in the context of a detect-then-segment framework. Detect-then-segment pipeline achieved better segmentation performance compared with End-to-end method.
摘要：肾整个幻灯片成像（WSI）肾小球内的自动实例分割在临床研究中的肾脏病理至关重要。在计算机视觉中，端至端实例的分割方法（例如，掩模-RCNN）已经显示它们的优点相对于检测 - 则 - 分段通过同时进行互补检测和分割任务接近。其结果是，最终到终端的面膜RCNN方法是事实上的标准方法，在最近的肾小球分割的研究，在采样和基于补丁技术被用于正确评价从WSI高分辨率图像（例如，> 10,000 x10,000像素上40倍）。但是，在高分辨率WSI，单个肾小球本身可以是在其产生显著信息丢失时，对应的特征映射经由面膜RCNN管道下采样原始分辨率超过1,000x1,000像素。在本文中，我们如果最终到终端的情况下分割的框架是最佳的高分辨率WSI对象通过面膜RCNN与我们提出的检测，然后段框架比较评估。除了这样的比较，我们还全面评估我们通过检测，然后段管线的性能：1）两种最流行的分割骨干（U-Net和DeepLab_v3）的; 2）六个不同的图像分辨率（512×512从至28x28）;和3）两种不同的颜色空间（RGB和LAB）。我们的检测-然后段管道，具有512×512上的分辨率先前检测肾小球DeepLab_v3分割框架操作，实现了0.953骰子相似系数（DSC），以从端部到端面膜RCNN管线中的0.902 DSC比较。此外，我们发现，当对在检测 - 则 - 分段框架的上下文中相互比较既不RGB也不LAB色彩空间产生更好的性能。检测-然后段与端至端的方法相比管道实现更好的分割性能。

54. A Vision-based Social Distance and Critical Density Detection System for COVID-19 [PDF] 返回目录
Dongfang Yang, Ekim Yurtsever, Vishnu Renganathan, Keith A. Redmill, Ümit Özgüner
Abstract: Social distancing has been proven as an effective measure against the spread of the infectious COronaVIrus Disease 2019 (COVID-19). However, individuals are not used to tracking the required 6-feet (2-meters) distance between themselves and their surroundings. An active surveillance system capable of detecting distances between individuals and warning them can slow down the spread of the deadly disease. Furthermore, measuring social density in a region of interest (ROI) and modulating inflow can decrease social distance violation occurrence chance. On the other hand, recording data and labeling individuals who do not follow the measures will breach individuals' rights in free-societies. Here we propose an Artificial Intelligence (AI) based real-time social distance detection and warning system considering four important ethical factors: (1) the system should never record/cache data, (2) the warnings should not target the individuals, (3) no human supervisor should be in the detection/warning loop, and (4) the code should be open-source and accessible to the public. Against this backdrop, we propose using a monocular camera and deep learning-based real-time object detectors to measure social distancing. If a violation is detected, a non-intrusive audio-visual warning signal is emitted without targeting the individual who breached the social distance measure. Also, if the social density is over a critical value, the system sends a control signal to modulate inflow into the ROI. We tested the proposed method across real-world datasets to measure its generality and performance. The proposed method is ready for deployment, and our code is open-sourced.
摘要：社交距离已被证明是对感染性疾病冠状2019（COVID-19）传播的有效措施。然而，个体不用于跟踪所要求的6英尺（2米）的距离自己和他们的周围环境之间。能够检测个体之间的距离，并警告他们可以减缓这种致命疾病传播的主动监测系统。此外，测量关注区域（ROI）的区域中的社会和密度调节流入可降低社会距离违反发生机会。在另一方面，记录数据和标签不遵循这些措施将违反在自由社会中个人的权利谁的人。在这里，我们提出了基于实时社交距离检测和预警系统中的人工智能（AI）考虑四个重要的道德因素：（1）系统不应该记录/缓存数据，（2）警告不要针对个人，（3 ）没有人的主管应该在检测/报警回路，和（4）的代码应该是开源和向公众开放。在此背景下，我们建议使用单眼相机和深基础的学习实时对象检测器来测量社会距离。如果检测到违规，非侵入式的视听警告信号发出不针对谁违反了社会距离测量的个人。另外，如果社会密度超过临界值时，该系统将控制信号发送到调制流入ROI。我们测试了整个现实世界的数据集所提出的方法来衡量其通用性和性能。该方法可用于部署，我们的代码是开源的。

55. Light Field Image Super-Resolution Using Deformable Convolution [PDF] 返回目录
Yingqian Wang, Jungang Yang, Longguang Wang, Xinyi Ying, Tianhao Wu, Wei An, Yulan Guo
Abstract: Light field (LF) cameras can record scenes from multiple perspectives, and thus introduce beneficial angular information for image super-resolution (SR). However, it is challenging to incorporate angular information due to disparities among LF images. In this paper, we propose a deformable convolution network (i.e., LF-DFnet) to handle the disparity problem for LF image SR. Specifically, we design an angular deformable alignment module (ADAM) for feature-level alignment. Based on ADAM, we further propose a collect-and-distribute approach to perform bidirectional alignment between the center-view feature and each side-view feature. Using our approach, angular information can be well incorporated and encoded into features of each view, which benefits the SR reconstruction of all LF images. Moreover, we develop a baseline-adjustable LF dataset to evaluate SR performance under different disparities. Experiments on both public and our self-developed datasets have demonstrated the superiority of our method. Our LF-DFnet can generate high-resolution images with more faithful details and achieve state-of-the-art reconstruction accuracy. Besides, our LF-DFnet is more robust to disparity variations, which has not been well addressed in literature.
摘要：光场（LF）相机可从多个角度拍摄的场景，并由此推出了图像超分辨率（SR）有益的角度信息。然而，它是具有挑战性的结合角度信息由于LF图像之间的差异。在本文中，我们提出了一个变形的卷积网络（即，LF-DFnet）来处理图像LF SR的贫富悬殊问题。具体而言，我们设计为功能级对准的角变形的对准模块（ADAM）。基于ADAM，我们进一步提出了一个收集 - 和 - 分配的方法来中心视图功能和各侧视图功能之间进行双向的对准。使用我们的方法，角度信息可以很好地结合并编码到每个视图，有利于所有LF图像的SR重建的特征。此外，我们开发了一个基线调节LF数据集，以评估在不同的差距SR性能。在公共和我国自主研发的数据集的实验已经证明了该方法的优越性。我们的LF-DFnet可以产生更多的忠实细节的高分辨率图像，并实现国家的最先进的重建精度。此外，我们的LF-DFnet更稳健悬殊的变化，还没有在文学得到很好的解决。

56. 3D Topology Transformation with Generative Adversarial Networks [PDF] 返回目录
Luca Stornaiuolo, Nima Dehmamy, Albert-László Barabási, Mauro Martino
Abstract: Generation and transformation of images and videos using artificial intelligence have flourished over the past few years. Yet, there are only a few works aiming to produce creative 3D shapes, such as sculptures. Here we show a novel 3D-to-3D topology transformation method using Generative Adversarial Networks (GAN). We use a modified pix2pix GAN, which we call Vox2Vox, to transform the volumetric style of a 3D object while retaining the original object shape. In particular, we show how to transform 3D models into two new volumetric topologies - the 3D Network and the Ghirigoro. We describe how to use our approach to construct customized 3D representations. We believe that the generated 3D shapes are novel and inspirational. Finally, we compare the results between our approach and a baseline algorithm that directly convert the 3D shapes, without using our GAN.
摘要：发电，并使用人工智能图像和视频的改造已蓬勃发展，在过去的几年里。然而，也有只有少数作品，旨在产生创新的3D形状，如雕塑。在这里，我们示出了使用剖成对抗性网络（GAN）一种新颖的3D到3D拓扑变换方法。我们使用修改pix2pix GAN，我们称之为Vox2Vox，变换一个3D物体的体积风格，同时保留了原有物体的形状。特别是，我们将展示如何三维模型转换成两个新的体积拓扑 - 三维网络和Ghirigoro。我们描述了如何使用我们的方法来构建定制的3D表示。我们相信，所生成的3D形状新颖和鼓舞人心的。最后，我们比较了我们的方法和基线算法，直接转换成三维形状，而不使用我们的GAN之间的结果。

57. Imitation Learning Approach for AI Driving Olympics Trained on Real-world and Simulation Data Simultaneously [PDF] 返回目录
Mikita Sazanovich, Konstantin Chaika, Kirill Krinkin, Aleksei Shpilman
Abstract: In this paper, we describe our winning approach to solving the Lane Following Challenge at the AI Driving Olympics Competition through imitation learning on a mixed set of simulation and real-world data. AI Driving Olympics is a two-stage competition: at stage one, algorithms compete in a simulated environment with the best ones advancing to a real-world final. One of the main problems that participants encounter during the competition is that algorithms trained for the best performance in simulated environments do not hold up in a real-world environment and vice versa. Classic control algorithms also do not translate well between tasks since most of them have to be tuned to specific driving conditions such as lighting, road type, camera position, etc. To overcome this problem, we employed the imitation learning algorithm and trained it on a dataset collected from sources both from simulation and real-world, forcing our model to perform equally well in all environments.
摘要：在本文中，我们描述了我们获胜的方式，通过模仿学习上一套混合模拟和真实数据的解决在AI驾驶奥运会比赛的车道追踪挑战。 AI驾驶奥运会是一个两阶段的竞争：在第一阶段，竞争的算法在模拟环境中具有最好的推进到一个真实世界的决赛。其中一个参与者在比赛中遇到的主要问题是算法训练在模拟环境下的最佳性能在现实环境中，反之亦然撑不起来。经典控制算法也不任务之间很好的翻译，因为他们大多都被调整到特定驾驶条件，如照明，道路类型，摄像头位置等。为了克服这个问题，我们采用了模仿学习算法和训练有素的它在从无论是从模拟和真实世界的来源收集数据集，迫使我们的模型在所有的环境中同样表现出色。

58. Unsupervised CT Metal Artifact Learning using Attention-guided beta-CycleGAN [PDF] 返回目录
Junghyun Lee, Jawook Gu, Jong Chul Ye
Abstract: Metal artifact reduction (MAR) is one of the most important research topics in computed tomography (CT). With the advance of deep learning technology for image reconstruction,various deep learning methods have been also suggested for metal artifact removal, among which supervised learning methods are most popular. However, matched non-metal and metal image pairs are difficult to obtain in real CT acquisition. Recently, a promising unsupervised learning for MAR was proposed using feature disentanglement, but the resulting network architecture is complication and difficult to handle large size clinical images. To address this, here we propose a much simpler and much effective unsupervised MAR method for CT. The proposed method is based on a novel beta-cycleGAN architecture derived from the optimal transport theory for appropriate feature space disentanglement. Another important contribution is to show that attention mechanism is the key element to effectively remove the metal artifacts. Specifically, by adding the convolutional block attention module (CBAM) layers with a proper disentanglement parameter, experimental results confirm that we can get more improved MAR that preserves the detailed texture of the original image.
摘要：金属伪影减少（MAR）是在计算机断层扫描（CT）是最重要的研究课题之一。随着深学习技术的进步进行图像重建，各种深学习方法也已经提出了金属伪影消除，其中监督学习方法是最流行的。然而，匹配的非金属和金属图像对是难以得到真实的CT采集。最近，MAR一个有前途的无监督学习使用功能的解开提出，但由此产生的网络结构是复杂和难以处理大尺寸的临床图像。为了解决这个问题，在这里我们提出了一个CT更简单和有效得多MAR无人监督的方法。所提出的方法是基于从最佳输运理论为适当的特征空间解缠结衍生的新的β-cycleGAN架构。另一个重要的贡献是表明注意机制是有效地去除金属伪影的关键因素。具体地，通过用适当的解缠结参数添加卷积块注意模块（CBAM）层，实验结果证实，可以得到更多的改善MAR可以保留原始图像的细节纹理。

59. Optical Navigation in Unstructured Dynamic Railroad Environments [PDF] 返回目录
Darius Burschka, Christian Robl, Sebastian Ohrendorf-Weiss
Abstract: We present an approach for optical navigation in unstructured, dynamic railroad environments. We propose a way how to cope with the estimation of the train motion from sole observations of the planar track bed. The occasional significant occlusions during the operation of the train limit the available observation to this difficult to track, repetitive area. This approach is a step towards replacement of the expensive train management infrastructure with local intelligence on the train for SmartRail 4.0. We derive our approach for robust estimation of translation and rotation in this difficult environments and provide experimental validation of the approach on real rail scenarios.
摘要：我们提出了非结构化，动态轨道环境光学导航的方法。我们提出了一个方法如何应对从平面道床的唯一意见列车运动的估计。列车运行过程中偶尔显著闭塞限制可用的观察这个很难跟踪，重复的区域。这种做法是对替换在火车上SmartRail 4.0本地智能昂贵的列车管理基础设施的步骤。我们得出我们在这种困难的环境中平移和旋转的稳健估计方法，并提供真实场景轨的方法实验验证。

60. Hierarchical and Unsupervised Graph Representation Learning with Loukas's Coarsening [PDF] 返回目录
Louis Béthune, Yacouba Kaloga, Pierre Borgnat, Aurélien Garivier, Amaury Habrard
Abstract: We propose a novel algorithm for unsupervised graph representation learning with attributed graphs. It combines three advantages addressing some current limitations of the literature: i) The model is inductive: it can embed new graphs without re-training in the presence of new data; ii) The method takes into account both micro-structures and macro-structures by looking at the attributed graphs at different scales; iii) The model is end-to-end differentiable: it is a building block that can be plugged into deep learning pipelines and allows for back-propagation. We show that combining a coarsening method having strong theoretical guarantees with mutual information maximization suffices to produce high quality embeddings. We evaluate them on classification tasks with common benchmarks of the literature. We show that our algorithm is competitive with state of the art among unsupervised graph representation learning methods.
摘要：我们提出了监督的图表示学习，与归属图表一种新的算法。它结合了三个优点寻址文献的一些当前限制：i）所述模型是电感性的：它可以嵌入新的图形而不在新数据的存在重新训练; ⅱ）的方法，既考虑微结构和宏观结构通过观察在不同尺度的归属的曲线图。 ⅲ）该模型的端至端微分：它是一个构建块，可以插入到深学习管道和允许反向传播。我们发现，将具有与最大互信息就足够了强大的理论保证变粗的方法来生产高品质的嵌入。我们评估他们与文学的共同基准分类任务。我们证明了我们的算法是无监督与图形表示的学习方法中最先进的技术竞争力。

61. RIFLE: Backpropagation in Depth for Deep Transfer Learning through Re-Initializing the Fully-connected LayEr [PDF] 返回目录
Xingjian Li, Haoyi Xiong, Haozhe An, Chengzhong Xu, Dejing Dou
Abstract: Fine-tuning the deep convolution neural network(CNN) using a pre-trained model helps transfer knowledge learned from larger datasets to the target task. While the accuracy could be largely improved even when the training dataset is small, the transfer learning outcome is usually constrained by the pre-trained model with close CNN weights (Liu et al., 2019), as the backpropagation here brings smaller updates to deeper CNN layers. In this work, we propose RIFLE - a simple yet effective strategy that deepens backpropagation in transfer learning settings, through periodically Re-Initializing the Fully-connected LayEr with random scratch during the fine-tuning procedure. RIFLE brings meaningful updates to the weights of deep CNN layers and improves low-level feature learning, while the effects of randomization can be easily converged throughout the overall learning procedure. The experiments show that the use of RIFLE significantly improves deep transfer learning accuracy on a wide range of datasets, out-performing known tricks for the similar purpose, such as Dropout, DropConnect, StochasticDepth, Disturb Label and Cyclic Learning Rate, under the same settings with 0.5% -2% higher testing accuracy. Empirical cases and ablation studies further indicate RIFLE brings meaningful updates to deep CNN layers with accuracy improved.
摘要：微调使用预训练模型的深卷积神经网络（CNN）有助于从更大的数据集的目标任务学到知识转让。虽然精度可以很大程度的提高，即使训练数据集是小，传递学习成果通常是通过密切CNN权预先训练模型约束（刘等人。，2019），作为反传这里带来了更小的更新，更深CNN层。在这项工作中，我们提出了RIFLE - 一个简单而有效的战略，深化反传中转移学习环境，通过定期重新初始化微调过程中随机从无到有的完全连接层。 RIFLE带来有意义的更新，以深CNN层的权重，提高了低层次的学习功能，而随机的效果可以在整个整个学习过程很容易收敛。实验表明，使用步枪显著提高了广泛的数据集的深陷转会学习精度，外进行了类似的目的从已知的技巧，如差，DropConnect，StochasticDepth，打扰标签和循环学习率，在相同的设置用0.5％-2％更高测试精度。实证案例和消融的研究进一步表明RIFLE带来有意义的更新，以深CNN层精度提高。

62. Structured (De)composable Representations Trained with Neural Networks [PDF] 返回目录
Graham Spinks, Marie-Francine Moens
Abstract: The paper proposes a novel technique for representing templates and instances of concept classes. A template representation refers to the generic representation that captures the characteristics of an entire class. The proposed technique uses end-to-end deep learning to learn structured and composable representations from input images and discrete labels. The obtained representations are based on distance estimates between the distributions given by the class label and those given by contextual information, which are modeled as environments. We prove that the representations have a clear structure allowing to decompose the representation into factors that represent classes and environments. We evaluate our novel technique on classification and retrieval tasks involving different modalities (visual and language data).
摘要：提出一种表示模板和概念类的实例的新颖技术。模板表示指的是捕捉一整类的特性的通用表示。所提出的技术使用终端到端到端深学会学习，从输入图像和离散标签的结构和组合的表示。将所得到的表示被根据由类别标签和那些由上下文信息，这被建模为给定的环境中给出的分布之间的距离的估计。我们证明，表示有一个清晰的结构允许表示分解为代表的类和环境因素。我们评估我们的分类和涉及不同的模式（视觉和语言数据）检索任务的新技术。

63. Divide-and-Rule: Self-Supervised Learning for Survival Analysis in Colorectal Cancer [PDF] 返回目录
Christian Abbet, Inti Zlobec, Behzad Bozorgtabar, Jean-Philippe Thiran
Abstract: With the long-term rapid increase in incidences of colorectal cancer (CRC), there is an urgent clinical need to improve risk stratification. The conventional pathology report is usually limited to only a few histopathological features. However, most of the tumor microenvironments used to describe patterns of aggressive tumor behavior are ignored. In this work, we aim to learn histopathological patterns within cancerous tissue regions that can be used to improve prognostic stratification for colorectal cancer. To do so, we propose a self-supervised learning method that jointly learns a representation of tissue regions as well as a metric of the clustering to obtain their underlying patterns. These histopathological patterns are then used to represent the interaction between complex tissues and predict clinical outcomes directly. We furthermore show that the proposed approach can benefit from linear predictors to avoid overfitting in patient outcomes predictions. To this end, we introduce a new well-characterized clinicopathological dataset, including a retrospective collective of 374 patients, with their survival time and treatment information. Histomorphological clusters obtained by our method are evaluated by training survival models. The experimental results demonstrate statistically significant patient stratification, and our approach outperformed the state-of-the-art deep clustering methods.
摘要：随着结直肠癌（CRC）的发生率的长期快速增长，迫切的临床需要，以提高风险分层。传统的病理报告通常只限于少数组织病理学特征。然而，大多数用来形容侵略性肿瘤行为模式的肿瘤微被忽略。在这项工作中，我们的目标是可以用来改善结直肠癌预后分层癌组织区域内的组织病理学学习模式。要做到这一点，我们提出了一个自我监督学习方法共同学习以及度量聚类的获得他们的基本模式组织区域的表示。然后将这些组织病理学图案用于表示复杂的组织之间的相互作用和直接预测临床结果。我们还表明，该方法可以从线性预测中获益，以避免患者过度拟合成果预测。为此，我们引入了一个新的良好的特点临床数据集，其中包括374例患者的回顾性的集体，他们的生存时间和治疗信息。用我们的方法获得的组织形态学集群是通过训练生存模型评估。实验结果表明，统计学显著患者分层，和我们的方法优于国家的最先进的深聚类方法。

64. Lossless CNN Channel Pruning via Gradient Resetting and Convolutional Re-parameterization [PDF] 返回目录
Xiaohan Ding, Tianxiang Hao, Ji Liu, Jungong Han, Yuchen Guo, Guiguang Ding
Abstract: Channel pruning (a.k.a. filter pruning) aims to slim down a convolutional neural network (CNN) by reducing the width (i.e., numbers of output channels) of convolutional layers. However, as CNN's representational capacity depends on the width, doing so tends to degrade the performance. A traditional learning-based channel pruning paradigm applies a penalty on parameters to improve the robustness to pruning, but such a penalty may degrade the performance even before pruning. Inspired by the neurobiology research about the independence of remembering and forgetting, we propose to re-parameterize a CNN into the remembering parts and forgetting parts, where the former learn to maintain the performance and the latter learn for efficiency. By training the re-parameterized model using regular SGD on the former but a novel update rule with penalty gradients on the latter, we achieve structured sparsity, enabling us to equivalently convert the re-parameterized model into the original architecture with narrower layers. With our method, we can slim down a standard ResNet-50 with 76.15\% top-1 accuracy on ImageNet to a narrower one with only 43.9\% FLOPs and no accuracy drop. Code and models are released at this https URL.
摘要：通道修剪（也称为过滤器的修剪）通过减小宽度卷积层（输出信道的，即，数字）的目的是瘦身的卷积神经网络（CNN）。然而，由于CNN的代表能力取决于宽度，这样做往往会降低性能。传统的基于学习的渠道修剪模式适用点球上的参数，以提高稳健性修剪，但这样的判罚甚至可以修剪之前降低性能。大约记忆和遗忘的独立的神经生物学研究的启发，我们建议重新参数CNN的入记忆的部分，忘记的部分，其中前者学会保持性能，而后者学习效率。通过训练使用的常规前SGD但处罚梯度对后者的一种新的更新规则的重新参数化模型，我们实现了结构化的稀疏性，使我们能够重新参数化模型等效转换成原来的架构窄层。随着我们的方法，我们就可以瘦下来标准RESNET-50与76.15 \％，最高1精度上ImageNet一个较窄，只有43.9 \％拖鞋和不准确度下降。代码和模型，在此HTTPS URL释放。

65. Automatic lesion detection, segmentation and characterization via 3D multiscale morphological sifting in breast MRI [PDF] 返回目录
Hang Min, Darryl McClymont, Shekhar S. Chandra, Stuart Crozier, Andrew P. Bradley
Abstract: Previous studies on computer aided detection/diagnosis (CAD) in 4D breast magnetic resonance imaging (MRI) regard lesion detection, segmentation and characterization as separate tasks, and typically require users to manually select 2D MRI slices or regions of interest as the input. In this work, we present a breast MRI CAD system that can handle 4D multimodal breast MRI data, and integrate lesion detection, segmentation and characterization with no user intervention. The proposed CAD system consists of three major stages: region candidate generation, feature extraction and region candidate classification. Breast lesions are firstly extracted as region candidates using the novel 3D multiscale morphological sifting (MMS). The 3D MMS, which uses linear structuring elements to extract lesion-like patterns, can segment lesions from breast images accurately and efficiently. Analytical features are then extracted from all available 4D multimodal breast MRI sequences, including T1-, T2-weighted and DCE sequences, to represent the signal intensity, texture, morphological and enhancement kinetic characteristics of the region candidates. The region candidates are lastly classified as lesion or normal tissue by the random under-sampling boost (RUSboost), and as malignant or benign lesion by the random forest. Evaluated on a breast MRI dataset which contains a total of 117 cases with 95 malignant and 46 benign lesions, the proposed system achieves a true positive rate (TPR) of 0.90 at 3.19 false positives per patient (FPP) for lesion detection and a TPR of 0.91 at a FPP of 2.95 for identifying malignant lesions without any user intervention. The average dice similarity index (DSI) is 0.72 for lesion segmentation. Compared with previously proposed systems evaluated on the same breast MRI dataset, the proposed CAD system achieves a favourable performance in breast lesion detection and characterization.
摘要：先前关于计算机辅助检测/诊断（CAD）中4D乳房磁共振成像（MRI）有关的病变检测，分割和表征为单独的任务，并且通常需要用户手动选择2D MRI切片或感兴趣的区域作为输入研究。在这项工作中，我们提出了一个乳腺MRI CAD系统，能够处理4D多式联运乳腺MRI数据，并整合病变检测，分割和特征无需用户干预。所提出的CAD系统包括三个主要阶段：区域候选生成，特征提取和区域候选分类。乳腺病变首先提取作为使用该新型三维多尺度形态学筛选（MMS）区域候选。所述3D MMS，其采用线性结构元素提取病变状图案，从乳房图像准确且有效地罐段病变。分析功能，然后从所有可用的多峰4D乳腺MRI序列，包括T1-，提取T2加权和DCE序列，来表示区域候补的信号强度，纹理，形态和增强的动力学特性。区域候补最后列为由随机欠采样升压（RUSboost）病变或正常组织，并且为恶性或良性病变由随机森林。评估在其上包含总共117例95个恶性和46良性病变乳房MRI数据集，所提出的系统实现了0.90在每个患者（FPP）3.19假阳性病变检测真阳性率（TPR）和一个TPR在0.91的2.95用于识别恶性病变没有任何用户干预的FPP。平均骰子相似性指数（DSI）是0.72病变分割。与同乳腺MRI的数据集进行评估之前提出的系统相比，所提出的CAD系统实现了乳腺病变检测和表征了良好的性能。

66. Regional Image Perturbation Reduces $L_p$ Norms of Adversarial Examples While Maintaining Model-to-model Transferability [PDF] 返回目录
Utku Ozbulak, Jonathan Peck, Wesley De Neve, Bart Goossens, Yvan Saeys, Arnout Van Messem
Abstract: Regional adversarial attacks often rely on complicated methods for generating adversarial perturbations, making it hard to compare their efficacy against well-known attacks. In this study, we show that effective regional perturbations can be generated without resorting to complex methods. We develop a very simple regional adversarial perturbation attack method using cross-entropy sign, one of the most commonly used losses in adversarial machine learning. Our experiments on ImageNet with multiple models reveal that, on average, $76\%$ of the generated adversarial examples maintain model-to-model transferability when the perturbation is applied to local image regions. Depending on the selected region, these localized adversarial examples require significantly less $L_p$ norm distortion (for $p \in \{0, 2, \infty\}$) compared to their non-local counterparts. These localized attacks therefore have the potential to undermine defenses that claim robustness under the aforementioned norms.
摘要：区域敌对攻击通常依赖于生成对抗扰动复杂的方法，因此很难对众所周知的攻击其疗效比较。在这项研究中，我们表明，可以在不使用复杂的方法来产生有效的区域扰动。我们开发利用交叉熵的迹象，在对抗机器学习中最常用的损失一个非常简单的区域对抗扰动的攻击方法。我们对ImageNet实验与多个模型显示，平均而言，当扰动应用于本地图像区域$ 76 \％的产生对抗的例子$保持模型到模型的可转移性。根据所选择的区域中，这些局部对抗性例子显著较少需要$ $ L_P规范失真（为$ p中\ \ {0，2，\ infty \} $）相比，他们的非本地对应。因此，这些局部的攻击有损害抗辩，根据上述规范要求的鲁棒性的潜力。

67. Dual Mixup Regularized Learning for Adversarial Domain Adaptation [PDF] 返回目录
Yuan Wu, Diana Inkpen, Ahmed El-Roby
Abstract: Recent advances on unsupervised domain adaptation (UDA) rely on adversarial learning to disentangle the explanatory and transferable features for domain adaptation. However, there are two issues with the existing methods. First, the discriminability of the latent space cannot be fully guaranteed without considering the class-aware information in the target domain. Second, samples from the source and target domains alone are not sufficient for domain-invariant feature extracting in the latent space. In order to alleviate the above issues, we propose a dual mixup regularized learning (DMRL) method for UDA, which not only guides the classifier in enhancing consistent predictions in-between samples, but also enriches the intrinsic structures of the latent space. The DMRL jointly conducts category and domain mixup regularizations on pixel level to improve the effectiveness of models. A series of empirical studies on four domain adaptation benchmarks demonstrate that our approach can achieve the state-of-the-art.
摘要：在无人监管的领域适应性（UDA）的最新进展依赖于对抗性学习理清域适应解释和转移的功能。不过，也有与现有方法的两个问题。首先，潜在空间的可辨性不能完全保证在不考虑目标域中的类感知信息。其次，从单独的源和目标域样本不足以域不变特征在潜在空间中提取。为了缓解上述问题，我们提出了一个双重的mixup正规化的学习（DMRL）方法UDA，它不仅引导加强在两者之间保持一致的预测样本的分类，同时也丰富了潜在空间的内在结构。该DMRL共同举办的像素级类别和领域的mixup正则化，以提高模型的有效性。在四个领域适应性基准一系列实证研究表明，我们的方法可以实现国家的最先进的。

68. Multi-image Super Resolution of Remotely Sensed Images using Residual Feature Attention Deep Neural Networks [PDF] 返回目录
Francesco Salvetti, Vittorio Mazzia, Aleem Khaliq, Marcello Chiaberge
Abstract: Convolutional Neural Networks (CNNs) have been consistently proved state-of-the-art results in image Super-Resolution (SR), representing an exceptional opportunity for the remote sensing field to extract further information and knowledge from captured data. However, most of the works published in the literature have been focusing on the Single-Image Super-Resolution problem so far. At present, satellite based remote sensing platforms offer huge data availability with high temporal resolution and low spatial this http URL this context, the presented research proposes a novel residual attention model (RAMS) that efficiently tackles the multi-image super-resolution task, simultaneously exploiting spatial and temporal correlations to combine multiple images. We introduce the mechanism of visual feature attention with 3D convolutions in order to obtain an aware data fusion and information extraction of the multiple low-resolution images, transcending limitations of the local region of convolutional operations. Moreover, having multiple inputs with the same scene, our representation learning network makes extensive use of nestled residual connections to let flow redundant low-frequency signals and focus the computation on more important high-frequency components. Extensive experimentation and evaluations against other available solutions, either for single or multi-image super-resolution, have demonstrated that the proposed deep learning-based solution can be considered state-of-the-art for Multi-Image Super-Resolution for remote sensing applications.
摘要：卷积神经网络（细胞神经网络）已经被证明是一致的国家的最先进的结果在图像超分辨率（SR），表示用于遥感领域一个特殊的机会，以提取从捕获的数据的详细信息和知识。然而，大多数出版的文学作品一直专注于单幅影像超分辨率问题为止。目前，卫星遥感基于传感平台提供具有高时间分辨率和低空间巨大的数据可用的HTTP URL此背景下，提出的研究提出了一种新的残余关注模型（RAMS）能够有效地铲球的多图像超分辨率任务，同时利用空间和时间相关，以多个图像结合起来。我们推出的具有3D回旋视觉特征关注的机制，以获得多个低分辨率图像的感知数据融合和信息提取，超越卷积运算的局部区域的限制。此外，具有与相同场景的多个输入，我们表示学习网络广泛使用坐落残余连接让流冗余低频信号和聚焦的计算上更重要的高频分量。大量的实验和对其他可用的解决方案的评价，无论是单个或多个图像超分辨率，已经证明，所提出的深基础的学习解决方案可以被认为是国家的最先进的多影像超分辨率遥感应用。

69. Scalable, Proposal-free Instance Segmentation Network for 3D Pixel Clustering and Particle Trajectory Reconstruction in Liquid Argon Time Projection Chambers [PDF] 返回目录
Dae Heun Koh, Pierre Côte de Soux, Laura Dominé, François Drielsma, Ran Itay, Qing Lin, Kazuhiro Terao, Ka Vang Tsang, Tracy Usher
Abstract: Liquid Argon Time Projection Chambers (LArTPCs) are high resolution particle imaging detectors, employed by accelerator-based neutrino oscillation experiments for high precision physics measurements. While images of particle trajectories are intuitive to analyze for physicists, the development of a high quality, automated data reconstruction chain remains challenging. One of the most critical reconstruction steps is particle clustering: the task of grouping 3D image pixels into different particle instances that share the same particle type. In this paper, we propose the first scalable deep learning algorithm for particle clustering in LArTPC data using sparse convolutional neural networks (SCNN). Building on previous works on SCNNs and proposal free instance segmentation, we build an end-to-end trainable instance segmentation network that learns an embedding of the image pixels to perform point cloud clustering in a transformed space. We benchmark the performance of our algorithm on PILArNet, a public 3D particle imaging dataset, with respect to common clustering evaluation metrics. 3D pixels were successfully clustered into individual particle trajectories with 90% of them having an adjusted Rand index score greater than 92% with a mean pixel clustering efficiency and purity above 96%. This work contributes to the development of an end-to-end optimizable full data reconstruction chain for LArTPCs, in particular pixel-based 3D imaging detectors including the near detector of the Deep Underground Neutrino Experiment. Our algorithm is made available in the open access repository, and we share our Singularity software container, which can be used to reproduce our work on the dataset.
摘要：液氩时间投影庭（LArTPCs）都是高分辨率粒子成像检测器，通过进行高精度的测量物理基于加速器的中微子振荡实验使用。虽然粒子轨迹的图像是直观的分析物理学家，高品质的发展，自动数据重建链仍然具有挑战性。一个最关键的步骤重建是颗粒簇：3D图像的像素分组到共享同一粒子类型不同的颗粒实例的任务。在本文中，我们使用稀疏卷积神经网络（SCNN）提出了颗粒簇第一可扩展深度学习算法LArTPC数据。上SCNNs和建议免费实例分割以前的作品的基础上，我们建立一个终端到终端的可训练的情况下分割网络获悉的嵌入图像的像素将转换的空间进行点云集群。我们的基准我们对PILArNet，公共3D粒子成像数据集，算法的性能相对于普通聚类评价指标。三维像素被成功地聚集成单个粒子轨迹用90％的人具有与平均像素聚类效率和纯度96％以上的调整兰德指数得分大于92％。这个工作有助于端至端优化的完整数据重建链LArTPCs的发展，特别是基于像素的三维成像探测器，包括深井中微子实验的近探测器。我们的算法是在开放信息库提供，我们分享我们的奇异软件容器，它可以用来再现数据集上我们的工作。

70. Kernel Stein Generative Modeling [PDF] 返回目录
Wei-Cheng Chang, Chun-Liang Li, Youssef Mroueh, Yiming Yang
Abstract: We are interested in gradient-based Explicit Generative Modeling where samples can be derived from iterative gradient updates based on an estimate of the score function of the data distribution. Recent advances in Stochastic Gradient Langevin Dynamics (SGLD) demonstrates impressive results with energy-based models on high-dimensional and complex data distributions. Stein Variational Gradient Descent (SVGD) is a deterministic sampling algorithm that iteratively transports a set of particles to approximate a given distribution, based on functional gradient descent that decreases the KL divergence. SVGD has promising results on several Bayesian inference applications. However, applying SVGD on high dimensional problems is still under-explored. The goal of this work is to study high dimensional inference with SVGD. We first identify key challenges in practical kernel SVGD inference in high-dimension. We propose noise conditional kernel SVGD (NCK-SVGD), that works in tandem with the recently introduced Noise Conditional Score Network estimator. NCK is crucial for successful inference with SVGD in high dimension, as it adapts the kernel to the noise level of the score estimate. As we anneal the noise, NCK-SVGD targets the real data distribution. We then extend the annealed SVGD with an entropic regularization. We show that this offers a flexible control between sample quality and diversity, and verify it empirically by precision and recall evaluations. The NCK-SVGD produces samples comparable to GANs and annealed SGLD on computer vision benchmarks, including MNIST and CIFAR-10.
摘要：我们感兴趣的是基于梯度的明确剖成建模，其中样品可以从基于数据分布的得分函数的估计迭代梯度更新导出。在随机梯度朗之万动力学（SGLD）的最新进展表明与高维和复杂的数据分布能源为基础的模型令人印象深刻的结果。斯坦因变梯度下降（SVGD）是确定性的采样算法迭代地输送一组粒子以近似给定分布的基础上，降低所述KL散功能梯度下降。 SVGD具有广阔的几个贝叶斯推理的应用效果。然而，在高维问题将SVGD仍在-探索。这项工作的目的是研究高维推论与SVGD。我们首先确定高维在实际内核SVGD推论的主要挑战。我们建议有条件的噪音内核SVGD（NCK-SVGD），一前一后与最近引入的噪声条件得分网络估计的作品。 NCK是在高维与SVGD成功的关键推论，因为它适应内核将比分估计的噪声水平。正如我们退火噪声，NCK-SVGD目标的真实数据分布。然后，我们用熵正则延长退火SVGD。我们表明，这种提供样品质量和多样性之间的灵活控制，并通过精确度和召回的评估经验验证。在NCK-SVGD产生计算机视觉基准，包括MNIST和CIFAR-10样品相媲美甘斯并退火SGLD。

71. Guided Fine-Tuning for Large-Scale Material Transfer [PDF] 返回目录
Valentin Deschaintre, George Drettakis, Adrien Bousseau
Abstract: We present a method to transfer the appearance of one or a few exemplar SVBRDFs to a target image representing similar materials. Our solution is extremely simple: we fine-tune a deep appearance-capture network on the provided exemplars, such that it learns to extract similar SVBRDF values from the target image. We introduce two novel material capture and design workflows that demonstrate the strength of this simple approach. Our first workflow allows to produce plausible SVBRDFs of large-scale objects from only a few pictures. Specifically, users only need take a single picture of a large surface and a few close-up flash pictures of some of its details. We use existing methods to extract SVBRDF parameters from the close-ups, and our method to transfer these parameters to the entire surface, enabling the lightweight capture of surfaces several meters wide such as murals, floors and furniture. In our second workflow, we provide a powerful way for users to create large SVBRDFs from internet pictures by transferring the appearance of existing, pre-designed SVBRDFs. By selecting different exemplars, users can control the materials assigned to the target image, greatly enhancing the creative possibilities offered by deep appearance capture.
摘要：呈现给一个或几个范例SVBRDFs的外观转移到表示类似材料的目标图像的方法。我们的解决方案是非常简单的：我们微调所提供的典范了深刻的外观捕捉网，使得它学会了从目标图像中提取相似SVBRDF值。我们介绍演示这个简单的方法的强度两种新材料的捕获和设计工作流程。我们的第一个工作流允许从只有几张图片产生大型对象的合理SVBRDFs。具体来说，用户只需要采取一个大的表面的单个图片和它的一些细节的几个特写闪光照片。我们用现有的方法从特写镜头中提取SVBRDF参数，我们的方法将这些参数传递到整个表面，使表面几米宽，如壁画，地板和家具的轻量级捕获。在我们的第二个工作流程，我们为用户提供了通过将现有的，预先设计SVBRDFs的外观，以创建互联网图片大SVBRDFs的有力途径。通过选择不同的范例中，用户可以控制分配给目标图像的材料，极大地提高了由深外观捕获所提供的创新的可能性。

72. Benefitting from Bicubically Down-Sampled Images for Learning Real-World Image Super-Resolution [PDF] 返回目录
Mohammad Saeed Rad, Thomas Yu, Claudiu Musat, Hazim Kemal Ekenel, Behzad Bozorgtabar, Jean-Philippe Thiran
Abstract: Super-resolution (SR) has traditionally been based on pairs of high-resolution images (HR) and their low-resolution (LR) counterparts obtained artificially with bicubic downsampling. However, in real-world SR, there is a large variety of realistic image degradations and analytically modeling these realistic degradations can prove quite difficult. In this work, we propose to handle real-world SR by splitting this ill-posed problem into two comparatively more well-posed steps. First, we train a network to transform real LR images to the space of bicubically downsampled images in a supervised manner, by using both real LR/HR pairs and synthetic pairs. Second, we take a generic SR network trained on bicubically downsampled images to super-resolve the transformed LR image. The first step of the pipeline addresses the problem by registering the large variety of degraded images to a common, well understood space of images. The second step then leverages the already impressive performance of SR on bicubically downsampled images, sidestepping the issues of end-to-end training on datasets with many different image degradations. We demonstrate the effectiveness of our proposed method by comparing it to recent methods in real-world SR and show that our proposed approach outperforms the state-of-the-art works in terms of both qualitative and quantitative results, as well as results of an extensive user study conducted on several real image datasets.
摘要：超分辨率（SR）传统上是基于对高分辨率的图像（HR）和他们的低分辨率与双三次采样人为获得（LR）同行。然而，在现实世界的SR，有一个大的各种逼真的图像劣化，并分析这些造型逼真的降级会十分困难。在这项工作中，我们提出通过拆分来处理现实世界的SR这种病态问题分成两个相对更适定的步骤。首先，我们培养了网络到真实LR图像变换到bicubically下采样图像的空间在监督方式，通过使用两个真实LR / HR对和合成对。其次，我们需要训练有素的bicubically下采样图像超决心转化LR图像的通用SR网络。流水线地址的第一步骤由大量的各种退化图像注册到一个共同的问题，公知的图像的空间。第二步则利用SR对bicubically采样图像已经令人印象深刻的表现，回避与许多不同的图像劣化数据集的终端到终端的培训问题。我们通过它，我们提出的方法在定性和定量的结果方面优于国家的最先进的作品比较最近在现实世界的SR和表演方法，以及的结果证明了我们提出的方法的有效性在几个真实图像数据集进行了广泛的用户研究。

73. Metric-Guided Prototype Learning [PDF] 返回目录
Vivien Sainte Fare Garnot, Loic Landrieu
Abstract: Not all errors are created equal. This is especially true for many key machine learning applications. In the case of classification tasks, the hierarchy of errors can be summarized under the form of a cost matrix, which assesses the gravity of confusing each pair of classes. When certain conditions are met, this matrix defines a metric, which we use in a new and versatile classification layer to model the disparity of errors. Our method relies on conjointly learning a feature-extracting network and a set of class representations, or prototypes, which incorporate the error metric into their relative arrangement. Our approach allows for consistent improvement of the network's prediction with regard to the cost matrix. Furthermore, when the induced metric contains insight on the data structure, our approach improves the overall precision. Experiments on three different tasks and public datasets -- from agricultural time series classification to depth image semantic segmentation -- validate our approach.
摘要：并不是所有的错误相同。这是许多关键机器学习应用尤其如此。在分类任务的情况下，误差的层次可以成本矩阵，其评估混淆各对类的重力的形式下进行总结。当某些条件得到满足，这个矩阵定义了一个指标，这是我们在新的和通用的分类层使用错误的差距模型。我们的方法依赖于共同地学习特征撷取网络和一组类的表示，或原型，其中包括误差度量到他们的相对布置。我们的方法允许相对于成本矩阵网络的预测是一致的改善。此外，当感应度量包含在数据结构的洞察力，我们的方法提高了整体的精度。从农业的时间序列分类，深度图像语义分割 - - 在三个不同的任务和公共数据集的实验证明我们的方法。

74. Robust Technique for Representative Volume Element Identification in Noisy Microtomography Images of Porous Materials Based on Pores Morphology and Their Spatial Distribution [PDF] 返回目录
Maxim Grigoriev, Anvar Khafizov, Vladislav Kokhan, Viktor Asadchikov
Abstract: Microtomography is a powerful method of materials investigation. It enables to obtain physical properties of porous media non-destructively that is useful in studies. One of the application ways is a calculation of porosity, pore sizes, surface area, and other parameters of metal-ceramic (cermet) membranes which are widely spread in the filtration industry. The microtomography approach is efficient because all of those parameters are calculated simultaneously in contrast to the conventional techniques. Nevertheless, the calculations on Micro-CT reconstructed images appear to be time-consuming, consequently representative volume element should be chosen to speed them up. This research sheds light on representative elementary volume identification without consideration of any physical parameters such as porosity, etc. Thus, the volume element could be found even in noised and grayscale images. The proposed method is flexible and does not overestimate the volume size in the case of anisotropic samples. The obtained volume element could be used for computations of the domain's physical characteristics if the image is filtered and binarized, or for selections of optimal filtering parameters for denoising procedure.
摘要：微断层是材料研究的一个有效方法。它使得能够获得多孔介质非破坏性即在研究中是有用的物理性质。一项所述的应用方式是孔隙率，孔尺寸，表面积，以及它们广泛地分布于过滤行业金属陶瓷（金属陶瓷）膜的其他参数的计算。因为所有这些参数都在对比的是传统的技术同时计算出的微断层的方法是有效的。尽管如此，在显微CT的计算重建图像似乎是耗时的，因此代表体积元件应被选择以加速它们。本研究揭示表征单元体积识别光，而不考虑任何物理参数如孔隙度，等。因此，体积元件可以甚至在去噪图像和灰度图像中找到的。所提出的方法是灵活的，在各向异性样品的情况下不高估的卷大小。将所得到的体积元件，可用于域的物理特性计算如果过滤并二值化的图像，或者用于最优滤波参数去噪程序选择。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-07-08

目录

摘要