摘要

1. One Shot 3D Photography [PDF] 返回目录
Johannes Kopf, Kevin Matzen, Suhib Alsisan, Ocean Quigley, Francis Ge, Yangming Chong, Josh Patterson, Jan-Michael Frahm, Shu Wu, Matthew Yu, Peizhao Zhang, Peter Vajda, Ayush Saraf, Michael Cohen
Abstract: 3D photography is a new medium that allows viewers to more fully experience a captured moment. In this work, we refer to a 3D photo as one that displays parallax induced by moving the viewpoint (as opposed to a stereo pair with a fixed viewpoint). 3D photos are static in time, like traditional photos, but are displayed with interactive parallax on mobile or desktop screens, as well as on Virtual Reality devices, where viewing it also includes stereo. We present an end-to-end system for creating and viewing 3D photos, and the algorithmic and design choices therein. Our 3D photos are captured in a single shot and processed directly on a mobile device. The method starts by estimating depth from the 2D input image using a new monocular depth estimation network that is optimized for mobile devices. It performs competitively to the state-of-the-art, but has lower latency and peak memory consumption and uses an order of magnitude fewer parameters. The resulting depth is lifted to a layered depth image, and new geometry is synthesized in parallax regions. We synthesize color texture and structures in the parallax regions as well, using an inpainting network, also optimized for mobile devices, on the LDI directly. Finally, we convert the result into a mesh-based representation that can be efficiently transmitted and rendered even on low-end devices and over poor network connections. Altogether, the processing takes just a few seconds on a mobile device, and the result can be instantly viewed and shared. We perform extensive quantitative evaluation to validate our system and compare its new components against the current state-of-the-art.
摘要：3D摄影是让观众更充分地体验到拍摄的时刻一种新的媒介。在这项工作中，我们指的是3D照片作为一个显示视差通过移动视点（相对于立体声对具有固定视点）诱导。 3D照片是静态的时候，像传统的照片，但显示与移动或桌面屏幕互动视差，以及虚拟现实设备，在观看的同时也包括立体声。我们提出了一个终端对终端系统，用于创建和查看3D照片，并在其中的算法和设计选择。我们的3D照片是在单个镜头捕获的，并直接在移动设备上进行处理。该方法通过使用移动设备的优化的新单眼深度估计网络从2D输入图像来估计深度开始。它竞争性地执行的状态的最先进的，但具有较低的等待时间和峰值存储器消耗和使用的大小更少的参数的顺序。将所得的深度被提升到一个分层深度图像，并且新的几何结构在视差区域合成。我们合成颜色纹理和结构在视差区域，以及，直接使用补绘网络，还为移动设备优化，在LDI。最后，我们转换结果为可以有效地传递和呈现即使在低端设备和过差的网络连接基于网格的表示。总之，处理花费在移动设备上只需几秒钟，结果可以立即观看和共享。我们进行广泛的定量评价，以验证我们的系统，并针对当前国家的最先进的比较其新的组件。

2. Reducing Drift in Structure from Motion using Extended Features [PDF] 返回目录
Aleksander Holynski, David Geraghty, Jan-Michael Frahm, Chris Sweeney, Richard Szeliski
Abstract: Low-frequency long-range errors (drift) are an endemic problem in 3D structure from motion, and can often hamper reasonable reconstructions of the scene. In this paper, we present a method to dramatically reduce scale and positional drift by using extended structural features such as planes and vanishing points. Unlike traditional feature matches, our extended features are able to span non-overlapping input images, and hence provide long-range constraints on the scale and shape of the reconstruction. We add these features as additional constraints to a state-of-the-art global structure from motion algorithm and demonstrate that the added constraints enable the reconstruction of particularly drift-prone sequences such as long, low field-of-view videos without inertial measurements. Additionally, we provide an analysis of the drift-reducing capabilities of these constraints by evaluating on a synthetic dataset. Our structural features are able to significantly reduce drift for scenes that contain long-spanning man-made structures, such as aligned rows of windows or planar building facades.
摘要：低频长距离误差（漂移）是从运动的3D结构的特有的问题，并能经常妨碍场景的合理重建。在本文中，我们提出通过使用扩展的结构特征，例如飞机和消失点显着降低比例和位置漂移的方法。不同于传统的特征匹配，我们的扩展的功能都能够跨越非重叠的输入图像，并由此提供关于重建的尺度和形状的长范围的限制。我们添加这些功能作为从运动算法的附加约束的状态的最先进的全局结构，并证明添加的约束使特别漂移倾向序列如长，场的视图低视频的重建无惯性测量。此外，我们通过在合成的数据集评估提供的这些约束的漂移减少能力的分析。我们的结构特点是能够显著减少漂移对于包含长跨越人造结构，如Windows或平面建筑外立面对准行的场面。

3. CenterHMR: a Bottom-up Single-shot Method for Multi-person 3D Mesh Recovery from a Single Image [PDF] 返回目录
Yu Sun, Qian Bao, Wu Liu, Yili Fu, Tao Mei
Abstract: In this paper, we propose a method to recover multi-person 3D mesh from a single image. Existing methods follow a multi-stage detection-based pipeline, where the 3D mesh of each person is regressed from the cropped image patch. They have to suffer from the high complexity of the multi-stage process and the ambiguity of the image-level features. For example, it is hard for them to estimate multi-person 3D mesh from the inseparable crowded cases. Instead, in this paper, we present a novel bottom-up single-shot method, Center-based Human Mesh Recovery network (CenterHMR). The model is trained to simultaneously predict two maps, which represent the location of each human body center and the corresponding parameter vector of 3D human mesh at each center. This explicit center-based representation guarantees the pixel-level feature encoding. Besides, the 3D mesh result of each person is estimated from the features centered at the visible body parts, which improves the robustness under occlusion. CenterHMR surpasses previous methods on multi-person in-the-wild benchmark 3DPW and occlusion dataset 3DOH50K. Besides, CenterHMR has achieved a 2-nd place on ECCV 2020 3DPW Challenge. The code is released on this https URL.
摘要：在本文中，我们提出收回多的人从一个单一的图像三维网格的方法。现有方法遵循基于检测多级流水线，其中该3D网格的每个人的从裁剪后的图像补丁倒退。他们从多阶段过程的高度复杂性和形象级特性的不确定性的困扰。例如，这是他们很难估算分不开拥挤的情况下，多方人士的3D网格。相反，在本文中，我们提出了一个新的自下而上的单次法，基于中心，人力Mesh恢复网络（CenterHMR）。该模型被训练以预测同时两个地图，其表示每个人体中心的啮合在每个中心的位置和三维人体的对应参数向量。这明确的基于中心表示保证像素级的特征编码。此外，3D网格的每个人的结果从在可见身体部位为中心的特征，其改善下遮挡的鲁棒性来估计。 CenterHMR超过上多的人以前的方法中最百搭基准3DPW和闭塞数据集3DOH50K。此外，CenterHMR取得的ECCV 2020 3DPW挑战2第二名。该代码被发布了有关该HTTPS URL。

4. DeepFake Detection Based on the Discrepancy Between the Face and its Context [PDF] 返回目录
Yuval Nirkin, Lior Wolf, Yosi Keller, Tal Hassner
Abstract: We propose a method for detecting face swapping and other identity manipulations in single images. Face swapping methods, such as DeepFake, manipulate the face region, aiming to adjust the face to the appearance of its context, while leaving the context unchanged. We show that this modus operandi produces discrepancies between the two regions. These discrepancies offer exploitable telltale signs of manipulation. Our approach involves two networks: (i) a face identification network that considers the face region bounded by a tight semantic segmentation, and (ii) a context recognition network that considers the face context (e.g., hair, ears, neck). We describe a method which uses the recognition signals from our two networks to detect such discrepancies, providing a complementary detection signal that improves conventional real vs. fake classifiers commonly used for detecting fake images. Our method achieves state of the art results on the FaceForensics++, Celeb-DF-v2, and DFDC benchmarks for face manipulation detection, and even generalizes to detect fakes produced by unseen methods.
摘要：本文提出一种用于检测单个图像的脸交换和其他身份操作的方法。面的交换方法，如DeepFake，操纵面部区域，旨在面部调整到它的上下文的外观，同时留下上下文不变。我们表明，这种手法产生两个地区之间的差异。这些差异提供了操纵利用的蛛丝马迹。我们的方法包括两个网络：（ⅰ）一个面部识别网络认为由紧语义分割所限定的面部区域，和（ii）考虑了脸上下文（例如，头发，耳朵，颈部）的上下文识别网络。我们描述了使用识别信号从我们的两个网络来检测这样的差异，从而提供改善通常用于检测假常规图像与真实假分类的互补检测信号的方法。我们的方法实现了对FaceForensics本领域的结果状态++，名人-DF-v2和DFDC基准脸操纵检测，甚至推广到检测由看不见方法产生的假货。

5. Random Style Transfer based Domain Generalization Networks Integrating Shape and Spatial Information [PDF] 返回目录
Lei Li, Veronika A. Zimmer, Wangbin Ding, Fuping Wu, Liqin Huang, Julia A. Schnabel, Xiahai Zhuang
Abstract: Deep learning (DL)-based models have demonstrated good performance in medical image segmentation. However, the models trained on a known dataset often fail when performed on an unseen dataset collected from different centers, vendors and disease populations. In this work, we present a random style transfer network to tackle the domain generalization problem for multi-vendor and center cardiac image segmentation. Style transfer is used to generate training data with a wider distribution/ heterogeneity, namely domain augmentation. As the target domain could be unknown, we randomly generate a modality vector for the target modality in the style transfer stage, to simulate the domain shift for unknown domains. The model can be trained in a semi-supervised manner by simultaneously optimizing a supervised segmentation and an unsupervised style translation objective. Besides, the framework incorporates the spatial information and shape prior of the target by introducing two regularization terms. We evaluated the proposed framework on 40 subjects from the M\&Ms challenge2020, and obtained promising performance in the segmentation for data from unknown vendors and centers.
摘要：深学习（DL）为基础的模型已经证明，在医学图像分割不错的表现。然而，训练有素的已知数据集模型从不同的中心，供应商和疾病的人群收集一个看不见的数据集执行时往往会失败。在这项工作中，我们提出了一个随机的风格传输网络，以解决多厂商和中心心脏图像分割域泛化的问题。式传输用于具有较宽分布/非均质性，即增强域以生成训练数据。作为目标域可能是未知的，我们随机生成用于在样式转移阶段目标模态模态向量，来模拟域移位未知域。该模型可以在一个半监督的方式通过同时优化监督分割和无监督式的翻译目标进行训练。此外，该框架结合了空间信息，并通过引入两个正则项之前的目标的形状。我们评估了从M \＆女士challenge2020 40名受试者提出的框架，并获得在分段从未知的供应商和数据中心的性能承诺。

6. Instance Adaptive Self-Training for Unsupervised Domain Adaptation [PDF] 返回目录
Ke Mei, Chuang Zhu, Jiaqi Zou, Shanghang Zhang
Abstract: The divergence between labeled training data and unlabeled testing data is a significant challenge for recent deep learning models. Unsupervised domain adaptation (UDA) attempts to solve such a problem. Recent works show that self-training is a powerful approach to UDA. However, existing methods have difficulty in balancing scalability and performance. In this paper, we propose an instance adaptive self-training framework for UDA on the task of semantic segmentation. To effectively improve the quality of pseudo-labels, we develop a novel pseudo-label generation strategy with an instance adaptive selector. Besides, we propose the region-guided regularization to smooth the pseudo-label region and sharpen the non-pseudo-label region. Our method is so concise and efficient that it is easy to be generalized to other unsupervised domain adaptation methods. Experiments on 'GTA5 to Cityscapes' and 'SYNTHIA to Cityscapes' demonstrate the superior performance of our approach compared with the state-of-the-art methods.
摘要：标记的训练数据和未标记的测试数据之间的分歧是近期深学习模型一个显著的挑战。无监督域适配（UDA）试图解决这样的问题。最近的作品表明，自我培训是一个功能强大的方法来UDA。但是，现有的方法在平衡可扩展性和性能的难度。在本文中，我们提出了UDA语义分割的任务实例适应性自我训练框架。为切实提高伪标签的质量，我们开发出一个实例自适应选择一个新的伪标签生成策略。此外，我们提出了区域引导正规化平滑伪标签区域和锐化非伪标签区域。我们的方法是如此的简洁和高效，它很容易被推广到其他监督的领域适应性方法。关于“GTA5到风情”和“SYNTHIA到城市景观”实验表明与国家的最先进的方法相比，我们的方法的性能优越。

7. Learning Condition Invariant Features for Retrieval-Based Localization from 1M Images [PDF] 返回目录
Janine Thoma, Danda Pani Paudel, Ajad Chhatkuli, Luc Van Gool
Abstract: Image features for retrieval-based localization must be invariant to dynamic objects (e.g. cars) as well as seasonal and daytime changes. Such invariances are, up to some extent, learnable with existing methods using triplet-like losses, given a large number of diverse training images. However, due to the high algorithmic training complexity, there exists insufficient comparison between different loss functions on large datasets. In this paper, we train and evaluate several localization methods on three different benchmark datasets, including Oxford RobotCar with over one million images. This large scale evaluation yields valuable insights into the generalizability and performance of retrieval-based localization. Based on our findings, we develop a novel method for learning more accurate and better generalizing localization features. It consists of two main contributions: (i) a feature volume-based loss function, and (ii) hard positive and pairwise negative mining. On the challenging Oxford RobotCar night condition, our method outperforms the well-known triplet loss by 24.4% in localization accuracy within 5m.
摘要：用于基于检索的定位图像的特性必须是不变的动态对象（例如汽车）以及季节性和日间变化。这样不变性是，高达一定程度上，可以学习与现有的方法，使用三线态样的损失，给出了大量的不同的训练图像。然而，由于高算法复杂性训练，存在着对大数据集的不同损失函数之间的比较不足。在本文中，我们训练和评估三个不同的基准数据集，包括牛津RobotCar几种定位方法有超过一万张图片。这种大规模的评估产生有价值的见解普遍性和基于内容的检索，定位性能。根据我们的调查结果，我们制定学习更准确，更好的推广本地化功能的新方法。它由两个主要的贡献：（一）基于卷的功能丧失功能，以及（ii）硬盘正面和成对负挖掘。在挑战牛津RobotCar夜间条件下，我们的方法优于24.4％的定位精度5m以内知名的三重损失。

8. Properties Of Winning Tickets On Skin Lesion Classification [PDF] 返回目录
Sherin Muckatira
Abstract: Skin cancer affects a large population every year -- automated skin cancer detection algorithms can thus greatly help clinicians. Prior efforts involving deep learning models have high detection accuracy. However, most of the models have a large number of parameters, with some works even using an ensemble of models to achieve good accuracy. In this paper, we investigate a recently proposed pruning technique called Lottery Ticket Hypothesis. We find that iterative pruning of the network resulted in improved accuracy, compared to that of the unpruned network, implying that -- the lottery ticket hypothesis can be applied to the problem of skin cancer detection and this hypothesis can result in a smaller network for inference. We also examine the accuracy across sub-groups -- created by gender and age -- and it was found that some sub-groups show a larger increase in accuracy than others.
摘要：皮肤癌影响人口众多，每年 - 自动化的皮肤癌检测算法可因此大大有助于临床医生。涉及深学习模型现有成果具有较高的检测精度。然而，大部分车型有大量的参数，有些作品甚至使用的模型的合奏达到良好的精度。在本文中，我们研究了所谓的彩票假设最近提出的剪枝技术。我们发现网络是迭代修剪导致了改进的精度，相比于未修剪的网络，这意味着 - 彩票假设可应用于皮肤癌检测的问题，并可能导致对推理的小型网络这一假说。我们还检查跨子组的准确性 - 按性别和年龄创造的 - 并发现了一些子组显示精度比其他人更大的增长。

9. Siamese Network for RGB-D Salient Object Detection and Beyond [PDF] 返回目录
Keren Fu, Deng-Ping Fan, Ge-Peng Ji, Qijun Zhao, Jianbing Shen, Ce Zhu
Abstract: Existing RGB-D salient object detection (SOD) models usually treat RGB and depth as independent information and design separate networks for feature extraction from each. Such schemes can easily be constrained by a limited amount of training data or over-reliance on an elaborately designed training process. Inspired by the observation that RGB and depth modalities actually present certain commonality in distinguishing salient objects, a novel joint learning and densely cooperative fusion (JL-DCF) architecture is designed to learn from both RGB and depth inputs through a shared network backbone, known as the Siamese architecture. In this paper, we propose two effective components: joint learning (JL), and densely cooperative fusion (DCF). The JL module provides robust saliency feature learning by exploiting cross-modal commonality via a Siamese network, while the DCF module is introduced for complementary feature discovery. Comprehensive experiments using five popular metrics show that the designed framework yields a robust RGB-D saliency detector with good generalization. As a result, JL-DCF significantly advances the state-of-the-art models by an average of ~2.0% (F-measure) across seven challenging datasets. In addition, we show that JL-DCF is readily applicable to other related multi-modal detection tasks, including RGB-T (thermal infrared) SOD and video SOD (VSOD), achieving comparable or even better performance against state-of-the-art methods. This further confirms that the proposed framework could offer a potential solution for various applications and provide more insight into the cross-modal complementarity task. The code will be available at this https URL
摘要：现有RGB-d显着对象检测（SOD）模型通常治疗RGB和深度作为独立的信息和设计单独的网络用于从每个特征提取。这样的方案可以容易地通过的训练上的精心设计的训练过程数据或过度依赖有限数量的限制。由观察，在区分显着对象，一个新的共同学习和密集合作融合RGB和深度模式实际存在一定的共性（JL-DCF）架构被设计为通过一个共享的网络骨干从RGB和深度的输入学习，被称为激发连体建筑。在本文中，我们提出了两种有效成分：共同学习（JL），以及密集的合作融合（DCF）。的JL模块通过经由网络连体利用跨通道共同性提供了强大的显着特征的学习，而DCF模块被引入互补特征的发现。使用五个流行指标综合实验表明，设计的框架产生具有良好的推广强大的RGB-d的显着性检测。其结果是，JL-DCF显著在七个挑战数据集进入国家的最先进的模型平均〜2.0％（F值）的。此外，我们表明，JL-DCF容易应用到其他相关的多模态探测任务，包括RGB-T（热红外）和超氧化物歧化酶SOD视频（VSOD），实现媲美，甚至更好的性能，对国家the-的，技术方法。这进一步证实，所提出的框架可以提供各种应用的潜在的解决方案，并提供更多的洞察跨模式的互补性任务。该代码将可在此HTTPS URL

10. MetaDistiller: Network Self-Boosting via Meta-Learned Top-Down Distillation [PDF] 返回目录
Benlin Liu, Yongming Rao, Jiwen Lu, Jie Zhou, Cho-jui Hsieh
Abstract: Knowledge Distillation (KD) has been one of the most popu-lar methods to learn a compact model. However, it still suffers from highdemand in time and computational resources caused by sequential train-ing pipeline. Furthermore, the soft targets from deeper models do notoften serve as good cues for the shallower models due to the gap of com-patibility. In this work, we consider these two problems at the same time.Specifically, we propose that better soft targets with higher compatibil-ity can be generated by using a label generator to fuse the feature mapsfrom deeper stages in a top-down manner, and we can employ the meta-learning technique to optimize this label generator. Utilizing the softtargets learned from the intermediate feature maps of the model, we canachieve better self-boosting of the network in comparison with the state-of-the-art. The experiments are conducted on two standard classificationbenchmarks, namely CIFAR-100 and ILSVRC2012. We test various net-work architectures to show the generalizability of our MetaDistiller. Theexperiments results on two datasets strongly demonstrate the effective-ness of our method.
摘要：知识蒸馏（KD）一直是最珀普-LAR的方法来学习集约模型之一。然而，它仍然从时间highdemand引起的顺序火车-ING管道的计算资源受到影响。此外，从更深层次的模型软目标，他们notoften作为由于COM-兼容性的差距好线索为浅车型。在这项工作中，我们考虑在同一time.Specifically这两个问题，我们提出了更高的更好的软目标compatibil，两者均可以通过使用标签生成器融合在一个自上而下的方式的特征mapsfrom更深的阶段产生的，并我们可以使用元学习技术来优化这个标签生成。利用从模型的中间特征图学到的softtargets，我们canachieve网络更好的自我提升与国家的最先进的比较。实验是在两个标准classificationbenchmarks，即CIFAR-100和ILSVRC2012进行。我们测试各种网工作架构，以展示我们MetaDistiller的普遍性。对两个数据集Theexperiments结果有力地证明了我们方法的有效的烦躁。

11. DMD: A Large-Scale Multi-Modal Driver Monitoring Dataset for Attention and Alertness Analysis [PDF] 返回目录
Juan Diego Ortega, Neslihan Kose, Paola Cañas, Min-An Chao, Alexander Unnervik, Marcos Nieto, Oihana Otaegui, Luis Salgado
Abstract: Vision is the richest and most cost-effective technology for Driver Monitoring Systems (DMS), especially after the recent success of Deep Learning (DL) methods. The lack of sufficiently large and comprehensive datasets is currently a bottleneck for the progress of DMS development, crucial for the transition of automated driving from SAE Level-2 to SAE Level-3. In this paper, we introduce the Driver Monitoring Dataset (DMD), an extensive dataset which includes real and simulated driving scenarios: distraction, gaze allocation, drowsiness, hands-wheel interaction and context data, in 41 hours of RGB, depth and IR videos from 3 cameras capturing face, body and hands of 37 drivers. A comparison with existing similar datasets is included, which shows the DMD is more extensive, diverse, and multi-purpose. The usage of the DMD is illustrated by extracting a subset of it, the dBehaviourMD dataset, containing 13 distraction activities, prepared to be used in DL training processes. Furthermore, we propose a robust and real-time driver behaviour recognition system targeting a real-world application that can run on cost-efficient CPU-only platforms, based on the dBehaviourMD. Its performance is evaluated with different types of fusion strategies, which all reach enhanced accuracy still providing real-time response.
摘要：视觉是最富有和最具成本效益的技术，驾驶监测系统（DMS），特别是近期深学习（DL）方法成功之后。由于缺乏足够大的和全面的数据集是目前DMS开发的进度，对自动驾驶的从SAE 2级至SAE 3级过渡的关键瓶颈。在本文中，我们介绍了驾驶员监视数据集（DMD），广泛的数据集，其包括真实和模拟驾驶情形：分心，凝视分配，嗜睡，手轮相互作用和上下文数据，在41小时RGB，深度和IR视频的从3个摄像头捕捉脸部，身体和37个驱动程序的手中。与现有的类似数据集的比较被包括，其示出了DMD是更广泛的，多样化的和多用途的。 DMD的使用量通过提取的它的一个子集，该数据集dBehaviourMD，含有13个分心活动所示，制备DL训练过程中使用。此外，我们提出了一个强大的和实时的驾驶行为识别系统针对现实世界的应用程序，它可以在成本效益的纯CPU平台上运行的基础上，dBehaviourMD。其性能与不同类型的融合策略，都已达到提高精度仍然提供实时响应评估。

12. Minimal Adversarial Examples for Deep Learning on 3D Point Clouds [PDF] 返回目录
Jaeyeon Kim, Binh-Son Hua, Duc Thanh Nguyen, Sai-Kit Yeung
Abstract: With the recent developments of convolutional neural networks, deep learning for 3D point clouds has shown significant progress in various 3D scene understanding tasks including 3D object recognition. In a safety-critical environment, it is however not well understood how such neural networks are vulnerable to adversarial examples. In this work, we explore adversarial attacks for point cloud-based neural networks with a focus on real-world data. We propose a general formulation for adversarial point cloud generation via $\ell_0$-norm optimisation. Our method generates adversarial examples by attacking the classification ability of the point cloud-based networks while considering the perceptibility of the examples and ensuring the minimum level of point manipulations. The proposed method is general and can be realised in different attack strategies. Experimental results show that our method achieves the state-of-the-art performance with 80\% of attack success rate while manipulating only about 4\% of the points. We also found that compared with synthetic data, real-world point cloud classification is more vulnerable to attacks.
摘要：随着卷积神经网络的最新发展，对三维点云深度学习已经显示出不同的3D场景中的理解任务，包括3D物体识别显著的进展。在安全性至关重要的环境中，但不能很好地理解这样的神经网络如何很容易受到对抗性的例子。在这项工作中，我们探讨与侧重于真实世界的数据基于云点神经网络对抗攻击。我们建议通过$ \ $ ell_0优化范数为对抗性的点云产生笼统的提法。我们的方法通过攻击点基于云的网络的分类能力，同时考虑的实施例中的可感知性，并确保操作点的最低级别产生对抗的例子。该方法是通用的，可以在不同的攻击策略来实现。实验结果表明，该方法实现国家的最先进的性能，进攻成功率的80 \％，而操纵点只有约4 \％。我们还发现，与合成的数据进行比较，真实世界的点云的分类更容易受到攻击。

13. Compensation Tracker: Data Association Method for Lost Object [PDF] 返回目录
Zhibo Zou, Junjie Huang, Ping Luo
Abstract: At present, the main research direction of multi-object tracking framework is detection-based tracking method. Although the detection-based tracking model can achieve good results, it is very dependent on the performance of the detector. The tracking results will be affected to a certain extent when the detector has the behaviors of omission and error detection. Therefore, in order to solve the problem of missing detection, this paper designs a compensation tracker based on Kalman filter and forecast correction. Experiments show that after using the compensation tracker designed in this paper, evaluation indicators have improved in varying degrees on MOT Challenge data sets. In particular, the multi-object tracking accuracy reached 66% in the 2020 datasets of dense scenarios. This shows that the proposed method can effectively improve the tracking performance of the model.
摘要：目前，多目标跟踪框架的主要研究方向是基于检测的跟踪方法。虽然检测为基础的跟踪模型能取得好成绩，这是非常依赖于检测器的性能。跟踪结果将受到影响时，检测器具有遗漏和错误检测的行为在一定程度上。因此，为了解决漏检的问题，本文设计了基于卡尔曼滤波器和预测校正的补偿跟踪器。实验表明，使用补偿跟踪本文设计后，评价指标在MOT挑战数据集不同程度的提高。特别地，多目标跟踪精度的致密情景2020点的数据集达到66％。这表明，该方法能有效地提高模型的跟踪性能。

14. Inner Eye Canthus Localization for Human Body Temperature Screening [PDF] 返回目录
Claudio Ferrari, Lorenzo Berlincioni, Marco Bertini, Alberto Del Bimbo
Abstract: In this paper, we propose an automatic approach for localizing the inner eye canthus in thermal face images. We first coarsely detect 5 facial keypoints corresponding to the center of the eyes, the nosetip and the ears. Then we compute a sparse 2D-3D points correspondence using a 3D Morphable Face Model (3DMM). This correspondence is used to project the entire 3D face onto the image, and subsequently locate the inner eye canthus. Detecting this location allows to obtain the most precise body temperature measurement for a person using a thermal camera. We evaluated the approach on a thermal face dataset provided with manually annotated landmarks. However, such manual annotations are normally conceived to identify facial parts such as eyes, nose and mouth, and are not specifically tailored for localizing the eye canthus region. As additional contribution, we enrich the original dataset by using the annotated landmarks to deform and project the 3DMM onto the images. Then, by manually selecting a small region corresponding to the eye canthus, we enrich the dataset with additional annotations. By using the manual landmarks, we ensure the correctness of the 3DMM projection, which can be used as ground-truth for future evaluations. Moreover, we supply the dataset with the 3D head poses and per-point visibility masks for detecting self-occlusions. The data will be publicly released.
摘要：在本文中，我们提出了在热脸图像定位的内眼内眦的自动方法。我们首先粗略检测与眼睛，nosetip和耳朵的中心5个面部关键点。然后我们计算一个使用3D形变脸部模型（3DMM）稀疏2D-3D点的对应关系。这种对应关系被用于投影整个3D面部到图像，并且随后定位内眼眦。检测该位置允许获得用于使用热照相机的人的最精确的体温测定。我们评估提供了手动注释标志性建筑的热脸数据集的方法。然而，这样的手动注释通常设想以识别脸部器官诸如眼睛，鼻子和嘴，和不是专门用于定位眼内眦区域定制。随着更多的贡献，我们通过使用注释标志变形，投影3DMM到图像丰富的原始数据集。然后，通过手动选择对应于所述眼内眦的小区域，我们丰富通过附加的注释的数据集。通过使用人工地标，我们保证3DMM投影，它可以作为地面实况今后评估的正确性。此外，我们与三维头部姿势和每点的知名度口罩供应数据集用于检测自我闭塞。这些数据将被公开发布。

15. How semantic and geometric information mutually reinforce each other in ToF object localization [PDF] 返回目录
Antoine Vanderschueren, Victor Joos, Christophe De Vleeschouwer
Abstract: We propose a novel approach to localize a 3D object from the intensity and depth information images provided by a Time-of-Flight (ToF) sensor. Our method uses two CNNs. The first one uses raw depth and intensity images as input, to segment the floor pixels, from which the extrinsic parameters of the camera are estimated. The second CNN is in charge of segmenting the object-of-interest. As a main innovation, it exploits the calibration estimated from the prediction of the first CNN to represent the geometric depth information in a coordinate system that is attached to the ground, and is thus independent of the camera elevation. In practice, both the height of pixels with respect to the ground, and the orientation of normals to the point cloud are provided as input to the second CNN. Given the segmentation predicted by the second CNN, the object is localized based on point cloud alignment with a reference model. Our experiments demonstrate that our proposed two-step approach improves segmentation and localization accuracy by a significant margin compared to a conventional CNN architecture, ignoring calibration and height maps, but also compared to PointNet++.
摘要：提出一种新的方法来定位从由时间 - 飞行时间（TOF）传感器所提供的强度和深度信息的图像的3D对象。我们的方法使用两个细胞神经网络。第一个使用原始深度和强度图像作为输入，对段中的地板像素，从该摄像机的外部参数进行估计。第二CNN负责分割所述感兴趣对象的。作为主要的创新，它利用来自第一CNN的预测估计来表示，其安装于地面的坐标系统中的几何深度信息的校准，并且因此独立于照相机海拔。在实践中，像素的两个高度相对于地面，法线点云的取向被作为输入提供给第二CNN。给出由第二CNN预测分割，该目的是基于与参考模型点云对准本地化。我们的实验证明，我们提出的两步法由显著余量相比传统CNN架构提高细分和定位的精度，忽略校准和高度的地图，而且相比PointNet ++。

16. A Flexible Selection Scheme for Minimum-Effort Transfer Learning [PDF] 返回目录
Amelie Royer, Christoph H. Lampert
Abstract: Fine-tuning is a popular way of exploiting knowledge contained in a pre-trained convolutional network for a new visual recognition task. However, the orthogonal setting of transferring knowledge from a pretrained network to a visually different yet semantically close source is rarely considered: This commonly happens with real-life data, which is not necessarily as clean as the training source (noise, geometric transformations, different modalities, etc.). To tackle such scenarios, we introduce a new, generalized form of fine-tuning, called flex-tuning, in which any individual unit (e.g. layer) of a network can be tuned, and the most promising one is chosen automatically. In order to make the method appealing for practical use, we propose two lightweight and faster selection procedures that prove to be good approximations in practice. We study these selection criteria empirically across a variety of domain shifts and data scarcity scenarios, and show that fine-tuning individual units, despite its simplicity, yields very good results as an adaptation technique. As it turns out, in contrast to common practice, rather than the last fully-connected unit it is best to tune an intermediate or early one in many domain-shift scenarios, which is accurately detected by flex-tuning.
摘要：微调是利用包含在预先训练卷积网络新的视觉识别任务的知识的一种流行方式。然而，从预训练的网络传输知识，以视觉不同但语义接近源的正交设置很少考虑：这通常与现实生活中的数据，这是没有必要那样干净的训练源（噪声，几何变换，不同的情况方式等）。为了解决这样的情况下，我们引入微调的一个新的，广义的形式，称为柔性调谐，其中任何单独的单元中的网络（例如层）可以被调整，和最有前途的之一被自动选择。为了使该方法呼吁实际使用中，我们提出了证明是在实践中很好的近似两座的轻质和快速选择程序。我们研究这些选择标准凭经验跨越各种领域的变化和数据匮乏的情况，并表明微调个别单位，尽管它的简单，产量非常好的结果作为自适应技术。事实证明，在对比常见的做法，而不是最后全连接部最好是调中间或早期一个在许多领域移情况下，这是通过精确地柔性调谐检测。

17. Cloze Test Helps: Effective Video Anomaly Detection via Learning to Complete Video Events [PDF] 返回目录
Guang Yu, Siqi Wang, Zhiping Cai, En Zhu, Chuanfu Xu, Jianping Yin, Marius Kloft
Abstract: As a vital topic in media content interpretation, video anomaly detection (VAD) has made fruitful progress via deep neural network (DNN). However, existing methods usually follow a reconstruction or frame prediction routine. They suffer from two gaps: (1) They cannot localize video activities in a both precise and comprehensive manner. (2) They lack sufficient abilities to utilize high-level semantics and temporal context information. Inspired by frequently-used cloze test in language study, we propose a brand-new VAD solution named Video Event Completion (VEC) to bridge gaps above: First, we propose a novel pipeline to achieve both precise and comprehensive enclosure of video activities. Appearance and motion are exploited as mutually complimentary cues to localize regions of interest (RoIs). A normalized spatio-temporal cube (STC) is built from each RoI as a video event, which lays the foundation of VEC and serves as a basic processing unit. Second, we encourage DNN to capture high-level semantics by solving a visual cloze test. To build such a visual cloze test, a certain patch of STC is erased to yield an incomplete event (IE). The DNN learns to restore the original video event from the IE by inferring the missing patch. Third, to incorporate richer motion dynamics, another DNN is trained to infer erased patches' optical flow. Finally, two ensemble strategies using different types of IE and modalities are proposed to boost VAD performance, so as to fully exploit the temporal context and modality information for VAD. VEC can consistently outperform state-of-the-art methods by a notable margin (typically 1.5%-5% AUROC) on commonly-used VAD benchmarks. Our codes and results can be verified at this http URL.
摘要：作为媒体的重要话题内容的解释，视频异常检测（VAD）已提出通过深层神经网络（DNN）卓有成效的进展。但是，现有的方法通常遵循重建或帧预测例程。（1）他们不能定位在一个既精确又全面的视频活动：他们从两个缺口困扰。（2）他们缺乏足够的能力来利用高层次语义和时间背景信息。在语言学习常用的完形填空的启发，我们提出了一个名为视频事件完成（VEC）以上填补空白了一个全新的解决方案VAD：首先，我们提出了一个新的管道，实现视频的活动既精确又全面的机壳。外观和运动被开发为相互互补线索来定位感兴趣的（投资回报）的区域。归一化的时空立方体（STC）从每个ROI作为视频事件，该事件规定VEC的基础，并作为一个基本处理单元建造。其次，我们鼓励DNN通过求解视觉完形填空捕捉到高层次的语义。建立这样一个视觉完形测试，STC的某一补丁被擦除以产生一个不完整的事件（IE）。 DNN的学习通过推断缺少的补丁，以恢复从IE原来的视频事件。第三，掺入更丰富的运动动力学，另一DNN被训练来推断擦除补丁光流。最后，使用不同类型的IE和形式两个乐团策略建议，以提高VAD性能，从而充分利用了VAD的时间背景和程式信息。 VEC能状态的最先进的始终优于上常用VAD基准的方法有显着下降（通常为1.5％-5％AUROC）。我们的代码和结果可以在这个HTTP URL进行验证。

18. Edge and Identity Preserving Network for Face Super-Resolution [PDF] 返回目录
Jonghyun Kim, Gen Li, Inyong Yun, Cheolkon Jung, Joongkyu Kim
Abstract: Face super-resolution has become an indispensable part in security problems such as video surveillance and identification system, but the distortion in facial components is a main obstacle to overcoming the problems. To alleviate it, most stateof-the-arts have utilized facial priors by using deep networks. These methods require extra labels, longer training time, and larger computation memory. Thus, we propose a novel Edge and Identity Preserving Network for Face Super-Resolution Network, named as EIPNet, which minimizes the distortion by utilizing a lightweight edge block and identity information. Specifically, the edge block extracts perceptual edge information and concatenates it to original feature maps in multiple scales. This structure progressively provides edge information in reconstruction procedure to aggregate local and global structural information. Moreover, we define an identity loss function to preserve identification of super-resolved images. The identity loss function compares feature distributions between super-resolved images and target images to solve unlabeled classification problem. In addition, we propose a Luminance-Chrominance Error (LCE) to expand usage of image representation domain. The LCE method not only reduces the dependency of color information by dividing brightness and color components but also facilitates our network to reflect differences between Super-Resolution (SR) and High- Resolution (HR) images in multiple domains (RGB and YUV). The proposed methods facilitate our super-resolution network to elaborately restore facial components and generate enhanced 8x scaled super-resolution images with a lightweight network structure.
摘要：人脸超分辨率已经成为安全问题，如视频监控和识别系统中不可缺少的一部分，但在面部组件的失真是一个主要障碍克服的问题。为了减轻它，最stateof最有艺术利用深度的网络使用面部前科。这些方法需要一些额外的标签，再培训的时间和更大的内存计算。因此，我们提出了一种新颖的边缘和身份保留网络的人脸超分辨率网络，命名为EIPNet，其通过利用轻量级边缘块和身份信息的失真最小化。具体而言，边缘块提取感知边缘信息，并将它加到在多个尺度原始特征地图。这种结构在重建过程逐步提供边缘信息集合体局部和全局的结构信息。此外，我们定义一个身份损失函数，以便保持超分辨图像的识别。身份丧失功能的超分辨图像和目标图像之间的功能分布比较来解决未标记的分类问题。此外，我们提出的亮度，色度误差（LCE）扩大图像表示域的使用。该方法LCE不仅通过将亮度和彩色分量减少的颜色信息的依赖关系，但也有利于我们的网络，以反映超分辨率（SR）和高分辨率（HR）在多个域中的图像（RGB和YUV）之间的差异。所提出的方法有助于我们的超分辨率网精心恢复面部组件和生成增强8X比例超分辨率图像与一个轻量级的网络结构。

19. Visual Question Answering on Image Sets [PDF] 返回目录
Ankan Bansal, Yuting Zhang, Rama Chellappa
Abstract: We introduce the task of Image-Set Visual Question Answering (ISVQA), which generalizes the commonly studied single-image VQA problem to multi-image settings. Taking a natural language question and a set of images as input, it aims to answer the question based on the content of the images. The questions can be about objects and relationships in one or more images or about the entire scene depicted by the image set. To enable research in this new topic, we introduce two ISVQA datasets - indoor and outdoor scenes. They simulate the real-world scenarios of indoor image collections and multiple car-mounted cameras, respectively. The indoor-scene dataset contains 91,479 human annotated questions for 48,138 image sets, and the outdoor-scene dataset has 49,617 questions for 12,746 image sets. We analyze the properties of the two datasets, including question-and-answer distributions, types of questions, biases in dataset, and question-image dependencies. We also build new baseline models to investigate new research challenges in ISVQA.
摘要：介绍图像群视觉答疑（ISVQA），推广了通常研究的单幅影像VQA问题多图像设置的任务。以自然语言问题和一组图像作为输入的，它旨在回答基于图像的内容的问题。这些问题可以是关于一个或多个图像或有关的图像集所描绘的整个场景的对象和关系。为了使这个新课题的研究中，我们引入两个ISVQA数据集 - 室内和室外场景。他们分别模拟室内的图片集和多个车载摄像头的真实场景。室内场景的数据集包含48138个图像集91479点人注释的问题，和室外场景的数据集有12746个图像集49617点的问题。我们分析了两个数据集，包括提问和回答的分布，类型的问题，在数据集中的偏见和问题像依赖的属性。我们还建立新的基准模型，研究在ISVQA新研究挑战。

20. Multi-task deep CNN model for no-reference image quality assessment on smartphone camera photos [PDF] 返回目录
Chen-Hsiu Huang, Ja-Ling Wu
Abstract: Smartphone is the most successful consumer electronic product in today's mobile social network era. The smartphone camera quality and its image post-processing capability is the dominant factor that impacts consumer's buying decision. However, the quality evaluation of photos taken from smartphones remains a labor-intensive work and relies on professional photographers and experts. As an extension of the prior CNN-based NR-IQA approach, we propose a multi-task deep CNN model with scene type detection as an auxiliary task. With the shared model parameters in the convolution layer, the learned feature maps could become more scene-relevant and enhance the performance. The evaluation result shows improved SROCC performance compared to traditional NR-IQA methods and single task CNN-based models.
摘要：智能手机在当今的移动社交网络时代最成功的消费类电子产品。智能手机摄像头质量和图像后期处理能力的主导因素是影响消费者的购买决策。然而，从智能手机拍摄的照片质量的评价仍然是一个劳动密集型的工作，并依赖于专业摄影师和专家。作为现有基于CNN-NR-IQA方法的延伸，我们提出用场景类型检测多任务深CNN模型作为辅助任务。随着卷积层共享模型参数，学习地物地图可能变得更为场景相关的和增强的性能。评价结果显示出改进的性能SROCC相比传统的NR-IQA方法和基于CNN-单任务模式。

21. Surgical Skill Assessment on In-Vivo Clinical Data via the Clearness of Operating Field [PDF] 返回目录
Daochang Liu, Tingting Jiang, Yizhou Wang, Rulin Miao, Fei Shan, Ziyu Li
Abstract: Surgical skill assessment is important for surgery training and quality control. Prior works on this task largely focus on basic surgical tasks such as suturing and knot tying performed in simulation settings. In contrast, surgical skill assessment is studied in this paper on a real clinical dataset, which consists of fifty-seven in-vivo laparoscopic surgeries and corresponding skill scores annotated by six surgeons. From analyses on this dataset, the clearness of operating field (COF) is identified as a good proxy for overall surgical skills, given its strong correlation with overall skills and high inter-annotator consistency. Then an objective and automated framework based on neural network is proposed to predict surgical skills through the proxy of COF. The neural network is jointly trained with a supervised regression loss and an unsupervised rank loss. In experiments, the proposed method achieves 0.55 Spearman's correlation with the ground truth of overall technical skill, which is even comparable with the human performance of junior surgeons.
摘要：手术技能评估是手术培训和质量控制的重要。这个任务之前的作品主要侧重于基本的手术任务，比如缝合和模拟设置进行绑结。相比之下，外科手术技能评估是本文研究了一个真正的临床数据集，其中包括57体内腹腔镜手术和相应的由六个外科医生注释技能分数。从这个数据集分析，经营领域的清晰度（COF）被确定为总体的手术技巧了良好的代理，因为它与总体的技能和高标注间的一致性很强的相关性。然后，基于神经网络的目标和自动化的框架，提出了通过COF的代理来预测手术技巧。神经网络共同使用监督回归损耗和无监督的排名损失训练。在实验中，所提出的方法达到0.55斯皮尔曼与整体技术技能的基础事实，这甚至与初中外科医生的人类性能堪比相关。

22. Unsupervised Surgical Instrument Segmentation via Anchor Generation and Semantic Diffusion [PDF] 返回目录
Daochang Liu, Yuhui Wei, Tingting Jiang, Yizhou Wang, Rulin Miao, Fei Shan, Ziyu Li
Abstract: Surgical instrument segmentation is a key component in developing context-aware operating rooms. Existing works on this task heavily rely on the supervision of a large amount of labeled data, which involve laborious and expensive human efforts. In contrast, a more affordable unsupervised approach is developed in this paper. To train our model, we first generate anchors as pseudo labels for instruments and background tissues respectively by fusing coarse handcrafted cues. Then a semantic diffusion loss is proposed to resolve the ambiguity in the generated anchors via the feature correlation between adjacent video frames. In the experiments on the binary instrument segmentation task of the 2017 MICCAI EndoVis Robotic Instrument Segmentation Challenge dataset, the proposed method achieves 0.71 IoU and 0.81 Dice score without using a single manual annotation, which is promising to show the potential of unsupervised learning for surgical tool segmentation.
摘要：手术器械分割是开发上下文感知手术室的重要组成部分。这项任务的现有作品在很大程度上依赖于大量的标签数据，其中涉及费力又昂贵的人类努力的监督。相比之下，更实惠的无监督的方法是本文开发。为了训练我们的模型，我们首先通过融合粗糙的手工线索生成锚伪标签分别仪器和背景的组织。然后语义扩散损失提出经由相邻视频帧之间的相关性特征，以解决所产生的锚定的模糊性。在对2017年MICCAI ENDOVIS全自动仪器分割挑战数据集的二进制仪器分割任务的实验中，所提出的方法达到0.71 IOU和0.81骰子得分，而无需使用一个单一的人工注释，这是一个大有希望显示无监督学习的手术工具的潜力分割。

23. Moderately supervised learning: definition and framework [PDF] 返回目录
Yongquan Yang, Zhongxi Zheng
Abstract: Supervised learning (SL) has achieved remarkable success in numerous artificial intelligence applications. In the current literature, by referring to the properties of the ground-truth labels prepared for a training data set, SL is roughly categorized as fully supervised learning (FSL) and weakly supervised learning (WSL). However, solutions for various FSL tasks have shown that the given ground-truth labels are not always learnable, and the target transformation from the given ground-truth labels to learnable targets can significantly affect the performance of the final FSL solutions. Without considering the properties of the target transformation from the given ground-truth labels to learnable targets, the roughness of the FSL category conceals some details that can be critical to building the optimal solutions for some specific FSL tasks. Thus, it is desirable to reveal these details. This article attempts to achieve this goal by expanding the categorization of FSL and investigating the subtype that plays the central role in FSL. Taking into consideration the properties of the target transformation from the given ground-truth labels to learnable targets, we first categorize FSL into three narrower subtypes. Then, we focus on the subtype moderately supervised learning (MSL). MSL concerns the situation where the given ground-truth labels are ideal, but due to the simplicity in annotation of the given ground-truth labels, careful designs are required to transform the given ground-truth labels into learnable targets. From the perspectives of definition and framework, we comprehensively illustrate MSL to reveal what details are concealed by the roughness of the FSL category. Finally, discussions on the revealed details suggest that MSL should be given more attention.
摘要：监督学习（SL）已经取得了许多人工智能应用了显着成效。在目前的文献中，通过参考用于一训练数据集制备的地面实况标签的属性，SL大致分类为完全监督学习（FSL）和弱监督学习（WSL）。然而，由于各种FSL任务的解决方案已经表明，在给定地面实况标签不总是可以学习，并从给定的地面实况标签可以学习目标，目标转换可以显著影响最终FSL解决方案的性能。不考虑从给定的地面实况标签可以学习目标的目标转换的特性，在FSL类的粗糙掩盖了一些细节，可以建立最优的解决方案，为一些特定的FSL任务至关重要。因此，希望揭示这些细节。本文试图通过扩大FSL的分类和调查这一起着FSL的中心作用亚型来实现这一目标。考虑到从给定的地面实况的目标转换的属性标签可以学习的目标，我们先群归类FSL分为三个窄亚型。然后，我们专注于亚型适度监督学习（MSL）。 MSL涉及在给定的地面实况标签是理想的情况，但由于在给定的地面实况标签标注的简洁，在需要仔细设计，以给定的地面实况标签转换成可以学习的目标。从定义和框架的角度来看，我们全面地说明MSL透露什么细节是由FSL类的粗糙掩盖。最后，在透露细节的讨论表明，MSL应给予更多的关注。

24. Attribute-guided image generation from layout [PDF] 返回目录
Ke Ma, Bo Zhao, Leonid Sigal
Abstract: Recent approaches have achieved great success in image generation from structured inputs, e.g., semantic segmentation, scene graph or layout. Although these methods allow specification of objects and their locations at image-level, they lack the fidelity and semantic control to specify visual appearance of these objects at an instance-level. To address this limitation, we propose a new image generation method that enables instance-level attribute control. Specifically, the input to our attribute-guided generative model is a tuple that contains: (1) object bounding boxes, (2) object categories and (3) an (optional) set of attributes for each object. The output is a generated image where the requested objects are in the desired locations and have prescribed attributes. Several losses work collaboratively to encourage accurate, consistent and diverse image generation. Experiments on Visual Genome dataset demonstrate our model's capacity to control object-level attributes in generated images, and validate plausibility of disentangled object-attribute representation in the image generation from layout task. Also, the generated images from our model have higher resolution, object classification accuracy and consistency, as compared to the previous state-of-the-art.
摘要：最近的方法已经从结构化的输入，例如，语义分割，场景图形或布局来实现在图像生成了巨大的成功。虽然这些方法允许在图像级对象及其位置的规范，他们缺乏保真度和语义控制在实例级别指定这些对象的视觉外观。为了解决这个限制，我们提出了一种新的图像生成方法，使实例级属性控制。具体地，输入到我们的属性引导生成模型是包含一个元组：（1）对象的边界框，（2）对象类别和（3）（可选的）为每个对象设置的属性。输出是所生成的图像，其中所请求的对象在所需位置，并具有规定的属性。一些损失协同工作，以鼓励准确，一致的和多样化的图像生成。视觉基因组实验数据集展示我们的模型的能力，控制对象级别的属性在生成的图像，并从布局任务图像生成解缠结的对象属性表示的验证真实性。另外，从我们的模型所产生的图像具有较高的分辨率，对象分类的准确性和一致性，相比于先前的状态的最先进的。

25. Fingerprint Feature Extraction by Combining Texture, Minutiae, and Frequency Spectrum Using Multi-Task CNN [PDF] 返回目录
Ai Takahashi, Yoshinori Koda, Koichi Ito, Takafumi Aoki
Abstract: Although most fingerprint matching methods utilize minutia points and/or texture of fingerprint images as fingerprint features, the frequency spectrum is also a useful feature since a fingerprint is composed of ridge patterns with its inherent frequency band. We propose a novel CNN-based method for extracting fingerprint features from texture, minutiae, and frequency spectrum. In order to extract effective texture features from local regions around the minutiae, the minutia attention module is introduced to the proposed method. We also propose new data augmentation methods, which takes into account the characteristics of fingerprint images to increase the number of images during training since we use only a public dataset in training, which includes a few fingerprint classes. Through a set of experiments using FVC2004 DB1 and DB2, we demonstrated that the proposed method exhibits the efficient performance on fingerprint verification compared with a commercial fingerprint matching software and the conventional method.
摘要：虽然大多数指纹匹配方法利用细节点和/或指纹图像作为指纹特征的纹理，频谱也是一个有用的特征，因为一个指纹与其固有频带由脊图案。我们提出了从质地，细节，和频谱提取指纹特征的新颖的基于CNN-方法。为了提取来自各地的细枝末节局部地区有效的纹理特征，细节点注意模块被引入到所提出的方法。我们还提出了新的数据增强方法，它考虑到指纹图像的特点，以增加训练中的图像的数量，因为我们在训练，其中包括一些指纹类只使用一个公共数据集。通过一组使用FVC2004 DB1和DB2的实验中，我们证明了该方法表现出与商业指纹匹配软件和传统的方法相比，在指纹验证的高效的性能。

26. Domain Adaptation Through Task Distillation [PDF] 返回目录
Brady Zhou, Nimit Kalra, Philipp Krähenbühl
Abstract: Deep networks devour millions of precisely annotated images to build their complex and powerful representations. Unfortunately, tasks like autonomous driving have virtually no real-world training data. Repeatedly crashing a car into a tree is simply too expensive. The commonly prescribed solution is simple: learn a representation in simulation and transfer it to the real world. However, this transfer is challenging since simulated and real-world visual experiences vary dramatically. Our core observation is that for certain tasks, such as image recognition, datasets are plentiful. They exist in any interesting domain, simulated or real, and are easy to label and extend. We use these recognition datasets to link up a source and target domain to transfer models between them in a task distillation framework. Our method can successfully transfer navigation policies between drastically different simulators: ViZDoom, SuperTuxKart, and CARLA. Furthermore, it shows promising results on standard domain adaptation benchmarks.
摘要：深网络吞食百万精确注释的图像，以建立自己的复杂和强大的交涉。不幸的是，像自动驾驶任务实际上没有真实世界的训练数据。多次崩溃车成一棵树简直是太昂贵了。常用处方的解决方案很简单：学习表示在模拟，并将其转移到现实世界。然而，这种转移是具有挑战性，因为模拟的和真实世界的视觉体验显着变化。我们的核心发现是，对于某些任务，如图像识别，数据集是丰富的。它们存在于任何感兴趣的领域，模拟或真实的，是很容易的标签和扩展。我们使用这些数据集识别连接起来，源和目标域在任务蒸馏框架之间传输模型。 ViZDoom，SuperTuxKart，和Carla：我们的方法可以大大不同模拟器之间成功传输的导航策略。此外，它显示了在标准领域适应性基准可喜的成果。

27. Multi-View Fusion of Sensor Data for Improved Perception and Prediction in Autonomous Driving [PDF] 返回目录
Sudeep Fadadu, Shreyash Pandey, Darshan Hegde, Yi Shi, Fang-Chieh Chou, Nemanja Djuric, Carlos Vallespi-Gonzalez
Abstract: We present an end-to-end method for object detection and trajectory prediction utilizing multi-view representations of LiDAR returns. Our method builds on a state-of-the-art Bird's-Eye View (BEV) network that fuses voxelized features from a sequence of historical LiDAR data as well as rasterized high-definition map to perform detection and prediction tasks. We extend the BEV network with additional LiDAR Range-View (RV) features that use the raw LiDAR information in its native, non-quantized representation. The RV feature map is projected into BEV and fused with the BEV features computed from LiDAR and high-definition map. The fused features are then further processed to output the final detections and trajectories, within a single end-to-end trainable network. In addition, using this framework the RV fusion of LiDAR and camera is performed in a straightforward and computational efficient manner. The proposed approach improves the state-of-the-art on proprietary large-scale real-world data collected by a fleet of self-driving vehicles, as well as on the public nuScenes data set.
摘要：我们提出了物体检测和利用多视图激光雷达返回的表示轨迹预测的端至端的方法。我们的方法生成的状态下的最先进的鸟瞰图（BEV）网络上的保险丝体素化从历史激光雷达数据的序列，以及光栅化高清晰度地图特征进行检测和预测的任务。我们的BEV网络与使用在其本土，非量化表示的原始LiDAR信息的附加激光雷达范围 - 视图（RV）功能扩展。的RV特征图被投射到BEV并与BEV稠从激光雷达和高清晰度的图计算功能。将融合的特征然后被进一步处理，以输出最终检测和轨迹，单端至端可训练网络内。另外，使用这个框架激光雷达的RV融合和相机以直接的和计算有效的方式被执行。所提出的方法提高了对自驾车车队收集，以及对公众nuScenes数据集专有的大型真实世界的数据的国家的最先进的。

28. Pose-Guided High-Resolution Appearance Transfer via Progressive Training [PDF] 返回目录
Ji Liu, Heshan Liu, Mang-Tik Chiu, Yu-Wing Tai, Chi-Keung Tang
Abstract: We propose a novel pose-guided appearance transfer network for transferring a given reference appearance to a target pose in unprecedented image resolution (1024 * 1024), given respectively an image of the reference and target person. No 3D model is used. Instead, our network utilizes dense local descriptors including local perceptual loss and local discriminators to refine details, which is trained progressively in a coarse-to-fine manner to produce the high-resolution output to faithfully preserve complex appearance of garment textures and geometry, while hallucinating seamlessly the transferred appearances including those with dis-occlusion. Our progressive encoder-decoder architecture can learn the reference appearance inherent in the input image at multiple scales. Extensive experimental results on the Human3.6M dataset, the DeepFashion dataset, and our dataset collected from YouTube show that our model produces high-quality images, which can be further utilized in useful applications such as garment transfer between people and pose-guided human video generation.
摘要：本文提出了一种新的姿态引导外观传输网络对给定参考外观转移到了前所未有的图像分辨率（1024 * 1024）的目标姿态，分别给出了参考和目标人物的图像。没有3D模型被使用。相反，我们的网络采用密集的局部描述符包括本地视觉损失和当地鉴别细化的细节，这是在一个由粗到细的方式逐步进行培训，制作高分辨率输出忠实地保留衣服的纹理和几何形状复杂的外观，而幻觉无缝转移的外观包括与DIS-闭塞。我们的渐进编码器 - 解码器架构可以了解在多尺度输入图像中固有的参考外观。在Human3.6M数据集时，DeepFashion数据集，以及我们的数据来自YouTube的显示，我们的模型产生高质量的图像，它可以在有用的应用程序，如人与服装之间的转让可以进一步利用收集广泛的实验结果的姿态引导人类视频代。

29. Webly Supervised Image Classification with Self-Contained Confidence [PDF] 返回目录
Jingkang Yang, Litong Feng, Weirong Chen, Xiaopeng Yan, Huabin Zheng, Ping Luo, Wayne Zhang
Abstract: This paper focuses on webly supervised learning (WSL), where datasets are built by crawling samples from the Internet and directly using search queries as web labels. Although WSL benefits from fast and low-cost data collection, noises in web labels hinder better performance of the image classification model. To alleviate this problem, in recent works, self-label supervised loss $\mathcal{L}_s$ is utilized together with webly supervised loss $\mathcal{L}_w$. $\mathcal{L}_s$ relies on pseudo labels predicted by the model itself. Since the correctness of the web label or pseudo label is usually on a case-by-case basis for each web sample, it is desirable to adjust the balance between $\mathcal{L}_s$ and $\mathcal{L}_w$ on sample level. Inspired by the ability of Deep Neural Networks (DNNs) in confidence prediction, we introduce Self-Contained Confidence (SCC) by adapting model uncertainty for WSL setting, and use it to sample-wisely balance $\mathcal{L}_s$ and $\mathcal{L}_w$. Therefore, a simple yet effective WSL framework is proposed. A series of SCC-friendly regularization approaches are investigated, among which the proposed graph-enhanced mixup is the most effective method to provide high-quality confidence to enhance our framework. The proposed WSL framework has achieved the state-of-the-art results on two large-scale WSL datasets, WebVision-1000 and Food101-N. Code is available at this https URL.
摘要：本文着重webly监督学习（WSL），其中数据集被从互联网上抓取样品，并直接使用搜索查询的网页标签建成。虽然从快速和低成本的数据采集WSL的好处，在网页标签噪音妨碍图像分类模型的更好的性能。为了缓解这一问题，在近期的作品，自标签监督亏损$ \ mathcal {L} _s $与webly监督亏损$ \ mathcal {L} $ _W一起使用。 $ \ mathcal {L} _s $依赖于模型本身预测伪标签。由于卷筒纸标签或伪标签的正确性通常是在一个逐案基础对每个幅材样品，理想的是调整$ \ mathcal {L}之间的平衡_S $和$ \ mathcal {L} _w $对样品的水平。通过深层神经网络（DNNs）的信心预测能力的启发，我们将介绍通过调整模型不确定性的WSL设置自包含的信心（SCC），并用它来采样明智的平衡$ \ mathcal {L} _s $和$ \ mathcal {L} $ _W。因此，一个简单而有效的WSL框架建议。一系列的SCC友好正规化方法进行了研究，其中提出的图形增强的mixup是为客户提供高品质的信心，以提高我们的框架是最有效的方法。所提出的WSL框架已经实现在两个大型数据集WSL的状态的最先进的结果，WebVision-1000和Food101-N。代码可在此HTTPS URL。

30. A Self-Reasoning Framework for Anomaly Detection Using Video-Level Labels [PDF] 返回目录
Muhammad Zaigham Zaheer, Arif Mahmood, Hochul Shin, Seung-Ik Lee
Abstract: Anomalous event detection in surveillance videos is a challenging and practical research problem among image and video processing community. Compared to the frame-level annotations of anomalous events, obtaining video-level annotations is quite fast and cheap though such high-level labels may contain significant noise. More specifically, an anomalous labeled video may actually contain anomaly only in a short duration while the rest of the video frames may be normal. In the current work, we propose a weakly supervised anomaly detection framework based on deep neural networks which is trained in a self-reasoning fashion using only video-level labels. To carry out the self-reasoning based training, we generate pseudo labels by using binary clustering of spatio-temporal video features which helps in mitigating the noise present in the labels of anomalous videos. Our proposed formulation encourages both the main network and the clustering to complement each other in achieving the goal of more accurate anomaly detection. The proposed framework has been evaluated on publicly available real-world anomaly detection datasets including UCF-crime, ShanghaiTech and UCSD Ped2. The experiments demonstrate superiority of our proposed framework over the current state-of-the-art methods.
摘要：视频监控异常事件检测是图像和视频处理领域中的挑战和实践研究的问题。相比于异常事件的帧级别的注解获得视频级别的注解是相当快的和便宜的，虽然这种高层次的标签可能包含显著噪音。更具体地，异常标记的视频可能实际上仅异常在一个短的持续时间包含在视频帧的其余部分可以是正常的。在目前的工作中，我们提出了一种弱监督基于其在自推理方式只采用视频级标签训练的深层神经网络异常检测框架。为了进行自我推理为基础的培训，我们使用的时空视频功能，这有助于减轻存在于异常视频标签噪声二值化聚类产生伪标签。我们提出的制定鼓励主网络和集群实现更精确的异常检测的目标相得益彰两者。拟议的框架上公开提供真实世界的异常检测的数据集，包括UCF-犯罪，ShanghaiTech和UCSD Ped2进行了评估。实验证明我们提出的在当前国家的最先进的方法框架的优越性。

31. Crossing-Domain Generative Adversarial Networks for Unsupervised Multi-Domain Image-to-Image Translation [PDF] 返回目录
Xuewen Yang, Dongliang Xie, Xin Wang
Abstract: State-of-the-art techniques in Generative Adversarial Networks (GANs) have shown remarkable success in image-to-image translation from peer domain X to domain Y using paired image data. However, obtaining abundant paired data is a non-trivial and expensive process in the majority of applications. When there is a need to translate images across n domains, if the training is performed between every two domains, the complexity of the training will increase quadratically. Moreover, training with data from two domains only at a time cannot benefit from data of other domains, which prevents the extraction of more useful features and hinders the progress of this research area. In this work, we propose a general framework for unsupervised image-to-image translation across multiple domains, which can translate images from domain X to any a domain without requiring direct training between the two domains involved in image translation. A byproduct of the framework is the reduction of computing time and computing resources, since it needs less time than training the domains in pairs as is done in state-of-the-art works. Our proposed framework consists of a pair of encoders along with a pair of GANs which learns high-level features across different domains to generate diverse and realistic samples from. Our framework shows competing results on many image-to-image tasks compared with state-of-the-art techniques.
摘要：在创成对抗性网络（甘斯）国家的最先进的技术包括使用对图像数据显示从对等域X图像 - 图像平移了显着成效，以Y域。然而，获得丰富的配对数据是在大多数应用一个非平凡和昂贵的过程。当有需要翻译跨晶域的图像，如果每两个域之间进行的训练，在训练的复杂性会增加二次。此外，从只在两个时间域数据训练不能从其他领域，以防止更多的实用功能，阻碍了这一研究领域的进展提取的数据中受益。在这项工作中，我们提出了无监督的图像 - 图像平移跨越多个域，它可以无需涉及图像的平移两个域之间的直接培训到任何一个域转换从域X图像的总体框架。该框架的一个副产品是计算时间和计算资源的减少，由于需要较少的时间比如在国家的最先进的作品进行训练成对的域。我们提出的框架由一对编码器的具有一对甘斯的哪个学习跨不同的域的高级别功能，以产生来自不同的和现实的样品一起。我们的框架表明竞争的许多图像到影像任务的结果与国家的最先进的技术相比。

32. Adversarial Dual Distinct Classifiers for Unsupervised Domain Adaptation [PDF] 返回目录
Taotao Jing, Zhengming Ding
Abstract: Unsupervised Domain adaptation (UDA) attempts to recognize the unlabeled target samples by building a learning model from a differently-distributed labeled source domain. Conventional UDA concentrates on extracting domain-invariant features through deep adversarial networks. However, most of them seek to match the different domain feature distributions, without considering the task-specific decision boundaries across various classes. In this paper, we propose a novel Adversarial Dual Distinct Classifiers Network (AD$^2$CN) to align the source and target domain data distribution simultaneously with matching task-specific category boundaries. To be specific, a domain-invariant feature generator is exploited to embed the source and target data into a latent common space with the guidance of discriminative cross-domain alignment. Moreover, we naturally design two different structure classifiers to identify the unlabeled target samples over the supervision of the labeled source domain data. Such dual distinct classifiers with various architectures can capture diverse knowledge of the target data structure from different perspectives. Extensive experimental results on several cross-domain visual benchmarks prove the model's effectiveness by comparing it with other state-of-the-art UDA.
摘要：无监督域适配（UDA）尝试通过从不同的分布式标记的源域构建学习模型来识别未标记的目标样本。传统的UDA专注于通过深度对抗网络提取域不变特征。然而，大多数人寻求匹配不同领域的功能分布，而没有考虑在不同类别的特定任务的决策边界。在本文中，我们提出了一种新颖的对抗性双鲜明量词网络（AD $ ^ 2 $ CN）与匹配的特定任务的类别边界对准同时在源和目标域的数据分布。具体而言，一个域的不变特征发生器被利用嵌入源和目标数据与判别跨域对准的引导潜共同空间。此外，我们自然设计两个不同结构的分类器在标记的源域数据的监督，以确定未标记的目标样本。具有各种架构，例如双不同的分类器可以从不同的角度捕获目标数据结构的多样的知识。在几个跨域视觉基准广泛的实验结果通过将其与其它国家的最先进的UDA相比较验证了模型的有效性。

33. Adaptively-Accumulated Knowledge Transfer for Partial Domain Adaptation [PDF] 返回目录
Taotao Jing, Haifeng Xia, Zhengming Ding
Abstract: Partial domain adaptation (PDA) attracts appealing attention as it deals with a realistic and challenging problem when the source domain label space substitutes the target domain. Most conventional domain adaptation (DA) efforts concentrate on learning domain-invariant features to mitigate the distribution disparity across domains. However, it is crucial to alleviate the negative influence caused by the irrelevant source domain categories explicitly for PDA. In this work, we propose an Adaptively-Accumulated Knowledge Transfer framework (A$^2$KT) to align the relevant categories across two domains for effective domain adaptation. Specifically, an adaptively-accumulated mechanism is explored to gradually filter out the most confident target samples and their corresponding source categories, promoting positive transfer with more knowledge across two domains. Moreover, a dual distinct classifier architecture consisting of a prototype classifier and a multilayer perceptron classifier is built to capture intrinsic data distribution knowledge across domains from various perspectives. By maximizing the inter-class center-wise discrepancy and minimizing the intra-class sample-wise compactness, the proposed model is able to obtain more domain-invariant and task-specific discriminative representations of the shared categories data. Comprehensive experiments on several partial domain adaptation benchmarks demonstrate the effectiveness of our proposed model, compared with the state-of-the-art PDA methods.
摘要：部分领域适应性（PDA）吸引吸引人的注意，因为它与在源域标签空间替代目标域中的现实和具有挑战性的问题的交易。大多数传统的领域适应性（DA）的努力专心学习域不变特征，以减轻跨域分配差距。然而，关键的是要减轻因不相关的源域类别明确了PDA的负面影响。在这项工作中，我们提出了一种自适应积累的知识转移框架（A $ ^ 2 $ KT）对准跨越两个领域的相关类别进行有效的领域适应性。具体而言，自适应积累的机制进行了探索，逐步筛选出最有信心目标样本和它们相应的源类别，横跨两个结构域促进与更多的知识正向传送。此外，由原型分类器和多层感知分类器的双重不同分类器体系结构是建立捕获跨域固有数据分布知识从不同的角度。通过最大化，级间中心明智的差异和最小化类内样本式紧凑，该模型能够获得共享类数据的多个域不变和任务，具体判别表示。在几个局部领域适应性的基准综合实验证明我们提出的模型的有效性，与国家的最先进的PDA方法相比。

34. Deep Learning for 2D grapevine bud detection [PDF] 返回目录
Wenceslao Villegas Marset, Diego Sebastián Pérez, Carlos Ariel Díaz, Facundo Bromberg
Abstract: In Viticulture, visual inspection of the plant is a necessary task for measuring relevant variables. In many cases, these visual inspections are susceptible to automation through computer vision methods. Bud detection is one such visual task, central for the measurement of important variables such as: measurement of bud sunlight exposure, autonomous pruning, bud counting, type-of-bud classification, bud geometric characterization, internode length, bud area, and bud development stage, among others. This paper presents a computer method for grapevine bud detection based on a Fully Convolutional Networks MobileNet architecture (FCN-MN). To validate its performance, this architecture was compared in the detection task with a strong method for bud detection, the scanning windows with patch classifier method, showing improvements over three aspects of detection: segmentation, correspondence identification and localization. In its best version of configuration parameters, the present approach showed a detection precision of $95.6\%$, a detection recall of $93.6\%$, a mean Dice measure of $89.1\%$ for correct detection (i.e., detections whose mask overlaps the true bud), with small and nearby false alarms (i.e., detections not overlapping the true bud) as shown by a mean pixel area of only $8\%$ the area of a true bud, and a distance (between mass centers) of $1.1$ true bud diameters. We conclude by discussing how these results for FCN-MN would produce sufficiently accurate measurements of variables bud number, bud area, and internode length, suggesting a good performance in a practical setup.
摘要：在葡萄栽培，植物的目视检查是用于测量相关变量的必要任务。在许多情况下，这些目视检查容易通过计算机视觉的方法自动化。芽检测是一种这样的视觉任务，中部为重要的变量，如测量：芽的阳光照射的测定中，自主修剪，芽计数，类型的芽分类，芽几何特征，节间长，芽面积，和芽发育阶段，等等。此提出了一种基于一个完全卷积网络MobileNet架构（FCN-MN）为葡萄芽检测的计算机方法。分割，对应识别和定位：验证其性能，该架构与用于芽检测，扫描窗与补丁分类法中，显示在检测的三个方面改善强烈方法检测任务进行比较。在配置参数其最佳形式中，本方法表明$ 95.6 \％，$ 93.6 \％$的检测召回，$ 89.1 \％$为正确的检测（即，检测其掩模重叠的平均骰子度量的检测精度的真芽），具有小和附近的误报警如图只有$ 8 \％的平均像素区域（即，检测不重叠的真实芽）$ A真芽的$ 1.1面积和距离（质量中心之间） $ true时芽直径。最后，我们讨论如何这些结果FCN-MN将产生的变量数芽，芽面积，节间长足够精确的测量，这在实际建立了良好的业绩。

35. Tabular Structure Detection from Document Images for Resource Constrained Devices Using A Row Based Similarity Measure [PDF] 返回目录
Soumyadeep Dey, Jayanta Mukhopadhyay, Shamik Sural
Abstract: Tabular structures are used to present crucial information in a structured and crisp manner. Detection of such regions is of great importance for proper understanding of a document. Tabular structures can be of various layouts and types. Therefore, detection of these regions is a hard problem. Most of the existing techniques detect tables from a document image by using prior knowledge of the structures of the tables. However, these methods are not applicable for generalized tabular structures. In this work, we propose a similarity measure to find similarities between pairs of rows in a tabular structure. This similarity measure is utilized to identify a tabular region. Since the tabular regions are detected exploiting the similarities among all rows, the method is inherently independent of layouts of the tabular regions present in the training data. Moreover, the proposed similarity measure can be used to identify tabular regions without using large sets of parameters associated with recent deep learning based methods. Thus, the proposed method can easily be used with resource constrained devices such as mobile devices without much of an overhead.
摘要：表格式结构用于在结构化和脆方式本关键信息。这些地区的检测是为一个文件，正确的认识非常重要。表格结构可以是各种布局和类型。因此，这些区域的检测是一个难题。大部分现有技术通过使用表的结构的先验知识检测来自原稿的图像的表。然而，这些方法并不适用于广义的表格结构。在这项工作中，我们提出了一个相似的措施，以找到对的行之间的相似性的表结构。这个相似性度量被用来识别表格式区域。由于被检测的板状区域利用所有的行之间的相似性，该方法是固有地独立存在于所述训练数据中的板状区域的布局。此外，所提出的相似性度量可以用于识别表格的区域，而无需使用大型成套近期深度学习为基础的方法相关的参数。因此，所提出的方法可以容易地与资源受限设备，诸如没有太多的开销的移动设备上使用。

36. Deep learning-based computer vision to recognize and classify suturing gestures in robot-assisted surgery [PDF] 返回目录
Francisco Luongo, Ryan Hakim, Jessica H. Nguyen, Animashree Anandkumar, Andrew J Hung
Abstract: Our previous work classified a taxonomy of suturing gestures during a vesicourethral anastomosis of robotic radical prostatectomy in association with tissue tears and patient outcomes. Herein, we train deep-learning based computer vision (CV) to automate the identification and classification of suturing gestures for needle driving attempts. Using two independent raters, we manually annotated live suturing video clips to label timepoints and gestures. Identification (2395 videos) and classification (511 videos) datasets were compiled to train CV models to produce two- and five-class label predictions, respectively. Networks were trained on inputs of raw RGB pixels as well as optical flow for each frame. Each model was trained on 80/20 train/test splits. In this study, all models were able to reliably predict either the presence of a gesture (identification, AUC: 0.88) as well as the type of gesture (classification, AUC: 0.87) at significantly above chance levels. For both gesture identification and classification datasets, we observed no effect of recurrent classification model choice (LSTM vs. convLSTM) on performance. Our results demonstrate CV's ability to recognize features that not only can identify the action of suturing but also distinguish between different classifications of suturing gestures. This demonstrates the potential to utilize deep learning CV towards future automation of surgical skill assessment.
摘要：我们以前的工作机器人前列腺癌根治术的相关联的膀胱尿道吻合组织撕裂和患者的治疗效果时划分缝合手势的分类。在此，我们训练的深学习基于计算机视觉（CV）自动缝合针为驾驶尝试手势的识别和分类。使用两个独立的评级机构，我们手动注释缝合直播视频剪辑标签的时间点和手势。识别（2395个视频）和分类（511个视频）数据集被编译到训练CV模型分别产生2年和5级标签的预测。网络被训练在原始RGB像素以及对于每个帧的光流的输入。每个模型进行训练的80/20火车/测试分裂。在这项研究中，所有模型都能够可靠地预测任一手势的存在（识别，AUC：0.88）以及手势的类型（分类，AUC：0.87）在上面显著机会水平。对于这两种手势识别和分类数据集，我们发现复发分类模型的选择对性能没有影响（LSTM与convLSTM）。我们的研究结果表明CV的认识，不仅可以识别缝合的行动，但缝合手势的不同分类之间也区分特征的能力。这表明利用对手术技能评估的未来的自动化深度学习CV的潜力。

37. Expressive Telepresence via Modular Codec Avatars [PDF] 返回目录
Hang Chu, Shugao Ma, Fernando De la Torre, Sanja Fidler, Yaser Sheikh
Abstract: VR telepresence consists of interacting with another human in a virtual space represented by an avatar. Today most avatars are cartoon-like, but soon the technology will allow video-realistic ones. This paper aims in this direction and presents Modular Codec Avatars (MCA), a method to generate hyper-realistic faces driven by the cameras in the VR headset. MCA extends traditional Codec Avatars (CA) by replacing the holistic models with a learned modular representation. It is important to note that traditional person-specific CAs are learned from few training samples, and typically lack robustness as well as limited expressiveness when transferring facial expressions. MCAs solve these issues by learning a modulated adaptive blending of different facial components as well as an exemplar-based latent alignment. We demonstrate that MCA achieves improved expressiveness and robustness w.r.t to CA in a variety of real-world datasets and practical scenarios. Finally, we showcase new applications in VR telepresence enabled by the proposed model.
摘要：VR远程呈现由具有在由化身所表示的虚拟空间的另一人的相互作用。今天，大多数化身为卡通般的，但很快这项技术使视频逼真的。在这个方向本文目的和礼物模块化编解码器化身（MCA），一个方法来产生由摄像机在VR耳机驱动超逼真面。 MCA通过用学模表示更换整体模式扩展了传统的编解码器化身（CA）。这是需要注意的重要传统的人，特定的CA从几个训练样本了解到，通常传输的表情时，缺乏稳健性，以及有限的表现力。的MCA通过学习不同的面部成分的调制混合自适应以及基于示范性潜对准解决这些问题。我们证明马华实现了改进的表现和稳健性w.r.t以CA在各种现实世界的数据集和实际场景。最后，我们展示在VR远程呈现新的应用程序能够通过该模型。

38. Visual Concept Reasoning Networks [PDF] 返回目录
Taesup Kim, Sungwoong Kim, Yoshua Bengio
Abstract: A split-transform-merge strategy has been broadly used as an architectural constraint in convolutional neural networks for visual recognition tasks. It approximates sparsely connected networks by explicitly defining multiple branches to simultaneously learn representations with different visual concepts or properties. Dependencies or interactions between these representations are typically defined by dense and local operations, however, without any adaptiveness or high-level reasoning. In this work, we propose to exploit this strategy and combine it with our Visual Concept Reasoning Networks (VCRNet) to enable reasoning between high-level visual concepts. We associate each branch with a visual concept and derive a compact concept state by selecting a few local descriptors through an attention module. These concept states are then updated by graph-based interaction and used to adaptively modulate the local descriptors. We describe our proposed model by split-transform-attend-interact-modulate-merge stages, which are implemented by opting for a highly modularized architecture. Extensive experiments on visual recognition tasks such as image classification, semantic segmentation, object detection, scene recognition, and action recognition show that our proposed model, VCRNet, consistently improves the performance by increasing the number of parameters by less than 1%.
摘要：分裂变换合并策略已被广泛用作视觉识别任务卷积神经网络架构的约束。它近似于通过显式定义多个分支同时学会与不同视觉概念或属性表示稀疏连接的网络。这些表示之间的依赖关系或相互作用通常由密集和地方操作定义，但是，没有任何适应性和高层次的推理。在这项工作中，我们提出利用这一战略，并与我们的视觉概念推理网（VCRNet）结合起来，使高层次的视觉概念之间的推理。我们有一个视觉概念每个分支相关联，并通过一个注意模块选择一些局部描述符派生的紧凑概念状态。这些概念状态，然后通过基于图形交互更新和用于自适应调节局部描述符。我们通过分割变换出席 - 互动 - 调制合并阶段，这是通过选择一个高度模块化的架构实现的描述我们提出的模型。视觉识别任务，如图像分类，语义分割，目标检测，场景识别和行为识别表明，我们提出的模型，VCRNet，持续改善由不足1％增加到参数的数量表现广泛的实验。

39. Measurement-driven Security Analysis of Imperceptible Impersonation Attacks [PDF] 返回目录
Shasha Li, Karim Khalil, Rameswar Panda, Chengyu Song, Srikanth V. Krishnamurthy, Amit K. Roy-Chowdhury, Ananthram Swami
Abstract: The emergence of Internet of Things (IoT) brings about new security challenges at the intersection of cyber and physical spaces. One prime example is the vulnerability of Face Recognition (FR) based access control in IoT systems. While previous research has shown that Deep Neural Network(DNN)-based FR systems (FRS) are potentially susceptible to imperceptible impersonation attacks, the potency of such attacks in a wide set of scenarios has not been thoroughly investigated. In this paper, we present the first systematic, wide-ranging measurement study of the exploitability of DNN-based FR systems using a large scale dataset. We find that arbitrary impersonation attacks, wherein an arbitrary attacker impersonates an arbitrary target, are hard if imperceptibility is an auxiliary goal. Specifically, we show that factors such as skin color, gender, and age, impact the ability to carry out an attack on a specific target victim, to different extents. We also study the feasibility of constructing universal attacks that are robust to different poses or views of the attacker's face. Our results show that finding a universal perturbation is a much harder problem from the attacker's perspective. Finally, we find that the perturbed images do not generalize well across different DNN models. This suggests security countermeasures that can dramatically reduce the exploitability of DNN-based FR systems.
摘要：物联网（IOT）的出现带来了在网络和物理空间的交集新的安全挑战。一个典型的例子是个人识别（FR）中的IoT系统的访问控制的脆弱性。虽然以前的研究已经表明，深层神经网络（DNN） - 基于FR系统（FRS）是可能容易受到潜移默化的假冒攻击，在广泛组场景这类攻击的效力尚未彻底调查。在本文中，我们提出在第一系统的，范围广泛的测量使用大规模数据集基于DNN-FR系统的可利用性的研究。我们发现，任意假冒攻击，其中任意攻击者冒充任意目标，很难如果隐蔽性是一个辅助的目标。具体来说，我们表明因素如肤色，性别和年龄，影响开展对特定目标受害者的攻击能力，在不同程度上。我们还研究构建是稳健不同的姿势或意见攻击者的面部的普遍攻击的可行性。我们的研究结果显示，发现一个普遍的扰动是从攻击者的角度来看，一个更难的问题。最后，我们发现干扰图像不跨越不同的DNN模型概括很好。这表明安全对策，可以显着减少基于DNN-FR系统的可利用性。

40. Learning Global Structure Consistency for Robust Object Tracking [PDF] 返回目录
Bi Li, Chengquan Zhang, Zhibin Hong, Xu Tang, Jingtuo Liu, Junyu Han, Errui Ding, Wenyu Liu
Abstract: Fast appearance variations and the distractions of similar objects are two of the most challenging problems in visual object tracking. Unlike many existing trackers that focus on modeling only the target, in this work, we consider the \emph{transient variations of the whole scene}. The key insight is that the object correspondence and spatial layout of the whole scene are consistent (i.e., global structure consistency) in consecutive frames which helps to disambiguate the target from distractors. Moreover, modeling transient variations enables to localize the target under fast variations. Specifically, we propose an effective and efficient short-term model that learns to exploit the global structure consistency in a short time and thus can handle fast variations and distractors. Since short-term modeling falls short of handling occlusion and out of the views, we adopt the long-short term paradigm and use a long-term model that corrects the short-term model when it drifts away from the target or the target is not present. These two components are carefully combined to achieve the balance of stability and plasticity during tracking. We empirically verify that the proposed tracker can tackle the two challenging scenarios and validate it on large scale benchmarks. Remarkably, our tracker improves state-of-the-art-performance on VOT2018 from 0.440 to 0.460, GOT-10k from 0.611 to 0.640, and NFS from 0.619 to 0.629.
摘要：快速变化的外观和类似物体的干扰是两个视觉目标跟踪中最具挑战性的问题。与许多现有的跟踪器，专注于造型只有目标，在这项工作中，我们考虑\ {EMPH整个场景的瞬时变化}。关键见解是，该物体对应和整个场景的空间布局是一致的（即，全局结构一致性）在连续的帧，这有助于从干扰项消除歧义目标。此外，造型瞬态变化使得定位目标下的快速变化。具体而言，提出了一种有效且高效的短期模型，学会利用在短时间内全局结构的一致性，因此可以处理的快速变化和干扰项。由于短期模型达不到处理闭塞和出的意见，我们采取长短期范式和使用长期模型校正，当它漂移偏离目标或目标远离短期模型不是当下。这两种组分仔细组合以实现稳定性和可塑性的跟踪期间的平衡。我们经验验证，所提出的跟踪器可以解决这两个挑战场景并验证其在大规模基准。值得注意的是，我们的跟踪器提高了国家的最先进的性能上VOT2018从0.440到0.460，GOT-10K从0.611到0.640，而NFS从0.619到0.629。

41. Large Scale Photometric Bundle Adjustment [PDF] 返回目录
Oliver J. Woodford, Edward Rosten
Abstract: Direct methods have shown promise on visual odometry and SLAM, leading to greater accuracy and robustness over feature-based methods. However, offline 3-d reconstruction from internet images has not yet benefited from a joint, photometric optimization over dense geometry and camera parameters. Issues such as the lack of brightness constancy, and the sheer volume of data, make this a more challenging task. This work presents a framework for jointly optimizing millions of scene points and hundreds of camera poses and intrinsics, using a photometric cost that is invariant to local lighting changes. The improvement in metric reconstruction accuracy that it confers over feature-based bundle adjustment is demonstrated on the large-scale Tanks & Temples benchmark. We further demonstrate qualitative reconstruction improvements on an internet photo collection, with challenging diversity in lighting and camera intrinsics.
摘要：直接法已经表现出对视觉里程和SLAM的承诺，从而导致更高的精确度和耐用性在基于特征的方法。然而，从离线因特网图像3 d重建尚未从合资，光度优化过密的几何形状和相机参数受益。如缺乏亮度恒定的，并且数据量巨大，问题使之成为一个更具挑战性的任务。这项工作提出了使用的是不变的局部照明变化的光度成本联合优化百万的现场点和数百个摄像头的姿势和内在的，一个框架。在度量重建精度，它赋予在基于特征的束调整的改善证明在大规模坦克和寺庙基准。我们进一步证明在互联网照片集定性重建的改进，在灯光和相机内在挑战的多样性。

42. Self-Supervised Human Activity Recognition by Augmenting Generative Adversarial Networks [PDF] 返回目录
Mohammad Zaki Zadeh, Ashwin Ramesh Babu, Ashish Jaiswal, Fillia Makedon
Abstract: This article proposes a novel approach for augmenting generative adversarial network (GAN) with a self-supervised task in order to improve its ability for encoding video representations that are useful in downstream tasks such as human activity recognition. In the proposed method, input video frames are randomly transformed by different spatial transformations, such as rotation, translation and shearing or temporal transformations such as shuffling temporal order of frames. Then discriminator is encouraged to predict the applied transformation by introducing an auxiliary loss. Subsequently, results prove superiority of the proposed method over baseline methods for providing a useful representation of videos used in human activity recognition performed on datasets such as KTH, UCF101 and Ball-Drop. Ball-Drop dataset is a specifically designed dataset for measuring executive functions in children through physically and cognitively demanding tasks. Using features from proposed method instead of baseline methods caused the top-1 classification accuracy to increase by more then 4%. Moreover, ablation study was performed to investigate the contribution of different transformations on downstream task.
摘要：本文提出了增强与自我监督的任务生成对抗网络（GAN），以提高其对编码视频表示了在下游的任务非常有用，例如人类活动的识别能力的新方法。在所提出的方法中，输入视频帧是随机由不同的空间变换，如旋转，平移和剪切或时域变换诸如洗牌帧的时间次序改变。然后鉴别器鼓励引入辅助损失来预测所施加的变换。随后，结果证明在基线方法所提出的方法的优越性，用于提供在数据集上诸如KTH，UCF101和落球进行的人类活动识别中使用的视频有用的表示。落球数据集是专门设计的数据集，通过物理和认知艰巨的任务测量儿童的行政职能。使用特征从提出的方法的代替方法基线引起顶端-1分类精度由多于4％增加。此外，进行消融研究，调查不同的变换对下游任务的贡献。

43. learn2learn: A Library for Meta-Learning Research [PDF] 返回目录
Sébastien M. R. Arnold, Praateek Mahajan, Debajyoti Datta, Ian Bunner, Konstantinos Saitas Zarkias
Abstract: Meta-learning researchers face two fundamental issues in their empirical work: prototyping and reproducibility. Researchers are prone to make mistakes when prototyping new algorithms and tasks because modern meta-learning methods rely on unconventional functionalities of machine learning frameworks. In turn, reproducing existing results becomes a tedious endeavour -- a situation exacerbated by the lack of standardized implementations and benchmarks. As a result, researchers spend inordinate amounts of time on implementing software rather than understanding and developing new ideas. This manuscript introduces learn2learn, a library for meta-learning research focused on solving those prototyping and reproducibility issues. learn2learn provides low-level routines common across a wide-range of meta-learning techniques (e.g. meta-descent, meta-reinforcement learning, few-shot learning), and builds standardized interfaces to algorithms and benchmarks on top of them. In releasing learn2learn under a free and open source license, we hope to foster a community around standardized software for meta-learning research.
摘要：元学习研究人员面临的两个根本问题，他们的实证研究：原型和可重复性。原型新的算法和任务时，因为现代的元学习方法依赖于机器学习框架的非常规功能的研究人员很容易犯错误。反过来，复制现有的成果变成一个乏味的努力 - 这种情况加剧，缺乏标准化的实施和基准。结果，研究人员花费在实施软件，而不是理解和发展的新思路过多的时间。这份手稿介绍learn2learn，荟萃学习研究图书馆重点解决那些原型和可重复性的问题。 learn2learn提供跨的元学习技术（例如间下降，间强化学习，少数次学习）一个宽范围普通低级别的例程，并生成标准化的在它们的顶部到算法和基准接口。在以免费和开源许可释放learn2learn，我们希望促进各地标准化软件社区的元学习研究。

44. New Normal: Cooperative Paradigm for Covid-19 Timely Detection and Containment using Internet of Things and Deep Learning [PDF] 返回目录
Farooque Hassan Kumbhar, Syed Ali Hassan, Soo Young Shin
Abstract: The spread of the novel coronavirus (COVID-19) has caused trillions of dollars in damages to the governments and health authorities by affecting the global economies. The purpose of this study is to introduce a connected smart paradigm that not only detects the possible spread of viruses but also helps to restart businesses/economies, and resume social life. We are proposing a connected Internet of Things ( IoT) based paradigm that makes use of object detection based on convolution neural networks (CNN), smart wearable and connected e-health to avoid current and future outbreaks. First, connected surveillance cameras feed continuous video stream to the server where we detect the inter-object distance to identify any social distancing violations. A violation activates area-based monitoring of active smartphone users and their current state of illness. In case a confirmed patient or a person with high symptoms is present, the system tracks exposed and infected people and appropriate measures are put into actions. We evaluated the proposed scheme for social distancing violation detection using YOLO (you only look once) v2 and v3, and for infection spread tracing using Python simulation.
摘要：新型冠状病毒的传播（COVID-19）已经影响全球经济造成损害政府和卫生主管部门的数万亿美元。这项研究的目的是介绍一个连接的智能模式，它不仅能检测的病毒可能蔓延，而且有助于重新启动企业/经济和恢复社会生活。我们提出物联网的连接联网（IoT）的范例，利用物体检测的基于卷积神经网络（CNN），智能耐磨，连接电子医疗，以避免当前和未来的爆发。首先，连接监控摄像机饲料连续视频流的地方，我们检测对象间的距离，以确定任何社会疏远侵犯服务器。违反激活活跃智能手机用户的基于区域的监控和他们目前的疾病状态。如果确诊病人或人高的症状出现时，系统跟踪暴露和感染者和适当的措施付诸行动。我们评估使用YOLO社会隔离违规检测所提出的方案（仅看一次）v2和v3，以及使用Python模拟感染蔓延跟踪。

45. Propensity-to-Pay: Machine Learning for Estimating Prediction Uncertainty [PDF] 返回目录
Md Abul Bashar, Astin-Walmsley Kieren, Heath Kerina, Richi Nayak
Abstract: Predicting a customer's propensity-to-pay at an early point in the revenue cycle can provide organisations many opportunities to improve the customer experience, reduce hardship and reduce the risk of impaired cash flow and occurrence of bad debt. With the advancements in data science; machine learning techniques can be used to build models to accurately predict a customer's propensity-to-pay. Creating effective machine learning models without access to large and detailed datasets presents some significant challenges. This paper presents a case-study, conducted on a dataset from an energy organisation, to explore the uncertainty around the creation of machine learning models that are able to predict residential customers entering financial hardship which then reduces their ability to pay energy bills. Incorrect predictions can result in inefficient resource allocation and vulnerable customers not being proactively identified. This study investigates machine learning models' ability to consider different contexts and estimate the uncertainty in the prediction. Seven models from four families of machine learning algorithms are investigated for their novel utilisation. A novel concept of utilising a Baysian Neural Network to the binary classification problem of propensity-to-pay energy bills is proposed and explored for deployment.
摘要：在收入周期可以为企业提供了很多机会，以改善客户体验，减少困难，减少了现金流，坏账的发生减值的风险的早期预测点客户的倾向到付款。随着数据科学的进步;机器学习技术可用于构建模型来准确地预测客户的倾向到付款。建立有效的机器学习模型无法获得大数据集的详细提出了一些显著的挑战。本文提出了一个案例研究，从能源组织对数据集进行，探索周围的创作机器学习模型能够预测住宅客户进入财政困难，然后降低他们的支付能源账单的能力的不确定性。不正确的预测可能会导致资源配置效率低下和弱势客户不主动识别。本研究探讨机器学习模型考虑不同的上下文，并估计在预测的不确定性的能力。从机器学习算法四大家族七个模型追究其新颖的利用率。利用贝叶斯神经网络的倾向到支付能源账单的二元分类问题的一个新的概念，提出并探讨了部署。

46. Meta-Learning with Shared Amortized Variational Inference [PDF] 返回目录
Ekaterina Iakovleva, Jakob Verbeek, Karteek Alahari
Abstract: We propose a novel amortized variational inference scheme for an empirical Bayes meta-learning model, where model parameters are treated as latent variables. We learn the prior distribution over model parameters conditioned on limited training data using a variational autoencoder approach. Our framework proposes sharing the same amortized inference network between the conditional prior and variational posterior distributions over the model parameters. While the posterior leverages both the labeled support and query data, the conditional prior is based only on the labeled support data. We show that in earlier work, relying on Monte-Carlo approximation, the conditional prior collapses to a Dirac delta function. In contrast, our variational approach prevents this collapse and preserves uncertainty over the model parameters. We evaluate our approach on the miniImageNet, CIFAR-FS and FC100 datasets, and present results demonstrating its advantages over previous work.
摘要：我们提出了一个新颖的摊销变推理方案的经验贝叶斯元学习模型，其中模型参数都被视为潜在变量。我们学会了在空调使用变的自动编码方式有限的训练数据模型参数的先验分布。我们的框架提出共享条件之前和变后验分布之间的相同摊销推理网络在模型参数。而后同时利用了标记的支撑和查询数据，该条件之前仅基于标记的支持数据。我们表明，在早期的工作中，依托蒙特卡洛近似，条件之前崩溃到狄拉克函数。相比之下，我们的变分法防止了模型参数就此瓦解和蜜饯的不确定性。我们评估我们对miniImageNet，CIFAR-FS和FC100的数据集，目前的研究结果证实了先前的工作其优势的做法。

47. A survey on applications of augmented, mixed andvirtual reality for nature and environment [PDF] 返回目录
Jason Rambach, Gergana Lilligreen, Alexander Schäfer, Ramya Bankanal, Alexander Wiebel, Didier Stricker
Abstract: Augmented reality (AR), virtual reality (VR) and mixed reality (MR) are technologies of great potential due to the engaging and enriching experiences they are capable of providing. Their use is rapidly increasing in diverse fields such as medicine, manufacturing or entertainment. However, the possibilities that AR, VR and MR offer in the area of environmental applications are not yet widely explored. In this paper we present the outcome of a survey meant to discover and classify existing AR/VR/MR applications that can benefit the environment or increase awareness on environmental issues. We performed an exhaustive search over several online publication access platforms and past proceedings of major conferences in the fields of AR/VR/MR. Identified relevant papers were filtered based on novelty, technical soundness, impact and topic relevance, and classified into different categories. Referring to the selected papers, we discuss how the applications of each category are contributing to environmental protection, preservation and sensitization purposes. We further analyse these approaches as well as possible future directions in the scope of existing and upcoming AR/VR/MR enabling technologies.
摘要：增强现实（AR），虚拟现实（VR）和混合现实（MR）是由于配合和丰富的经验，他们能够提供很大的潜力。它们的使用在不同的行业，如医药，制造或娱乐迅速增加。然而，可能性，AR，VR和MR报价在环境应用领域还没有广泛地探讨。在本文中，我们提出旨在探索的一项调查的结果，并能受益于环境问题的环境或增加现有意识AR / VR / MR应用分类。我们进行了一些在线出版物接入平台和过去的重大会议在AR / VR / MR领域的诉讼穷举搜索。基于新颖，技术合理性，影响力和话题相关过滤识别的相关论文，并分为不同的类别。参考入选论文，我们将讨论如何每个类别的应用都有助于保护环境，保护和宣传的目的。我们进一步分析这些方法，以及在现有的和即将推出的AR / VR / MR实现技术的范围未来可能的方向。

48. Mixed Noise Removal with Pareto Prior [PDF] 返回目录
Zhou Liu, Lei Yu, Gui-Song Xia, Hong Sun
Abstract: Denoising images contaminated by the mixture of additive white Gaussian noise (AWGN) and impulse noise (IN) is an essential but challenging problem. The presence of impulsive disturbances inevitably affects the distribution of noises and thus largely degrades the performance of traditional AWGN denoisers. Existing methods target to compensate the effects of IN by introducing a weighting matrix, which, however, is lack of proper priori and thus hard to be accurately estimated. To address this problem, we exploit the Pareto distribution as the priori of the weighting matrix, based on which an accurate and robust weight estimator is proposed for mixed noise removal. Particularly, a relatively small portion of pixels are assumed to be contaminated with IN, which should have weights with small values and then be penalized out. This phenomenon can be properly described by the Pareto distribution of type 1. Therefore, armed with the Pareto distribution, we formulate the problem of mixed noise removal in the Bayesian framework, where nonlocal self-similarity priori is further exploited by adopting nonlocal low rank approximation. Compared to existing methods, the proposed method can estimate the weighting matrix adaptively, accurately, and robust for different level of noises, thus can boost the denoising performance. Experimental results on widely used image datasets demonstrate the superiority of our proposed method to the state-of-the-arts.
摘要：去噪通过加性白高斯噪声（AWGN）和脉冲噪声（IN）的混合物污染的图像是必不可少的，但具有挑战性的问题。脉冲干扰的情况下不可避免地影响噪声的分布，从而在很大程度上降低了传统AWGN denoisers的性能。现有方法的目标通过引入加权矩阵，其中，但是，是缺乏适当的先验的，因此难以精确地估计以补偿的影响。为了解决这个问题，我们利用了Pareto分布作为在其提出了一种用于混合噪声去除的精确和鲁棒权重估计的加权矩阵的先验，为主。具体地讲，假设像素的相对小的部分与IN，其应该具有小的值的权重，然后被处罚出被污染。这种现象可以通过类型1。因此的Pareto分布，与Pareto分布武装被正确描述，我们配制混合噪声去除的问题在贝叶斯框架，其中非局部自相似先验进一步通过采用非局部低秩近似利用。与现有方法相比，所提出的方法可以估算加权矩阵自适应地，准确地和健壮为不同电平的噪声，从而可以提高降噪性能。广泛使用的图像数据集实验结果表明，我们提出的方法到国家的最艺术的优越性。

49. Unsupervised MRI Super-Resolution using Deep External Learning and Guided Residual Dense Network with Multimodal Image Priors [PDF] 返回目录
Yutaro Iwamoto, Kyohei Takeda, Yinhao Li, Akihiko Shiino, Yen-Wei Chen
Abstract: Deep learning techniques have led to state-of-the-art single image super-resolution (SISR) with natural images. Pairs of high-resolution (HR) and low-resolution (LR) images are used to train the deep learning model (mapping function). These techniques have also been applied to medical image super-resolution (SR). Compared with natural images, medical images have several unique characteristics. First, there are no HR images for training in real clinical applications because of the limitations of imaging systems and clinical requirements. Second, other modal HR images are available (e.g., HR T1-weighted images are available for enhancing LR T2-weighted images). In this paper, we propose an unsupervised SISR technique based on simple prior knowledge of the human anatomy; this technique does not require HR images for training. Furthermore, we present a guided residual dense network, which incorporates a residual dense network with a guided deep convolutional neural network for enhancing the resolution of LR images by referring to different HR images of the same subject. Experiments on a publicly available brain MRI database showed that our proposed method achieves better performance than the state-of-the-art methods.
摘要：深学习技术导致了国家的最先进的单图像超分辨率（SISR）与自然的图像。高分辨率（HR）和低分辨率的对（LR）图像被用来训练深学习模型（映射功能）。这些技术也被应用于医学影像超分辨率（SR）。与自然图像相比，医学图像有几个独特的特点。首先，对于因成像系统和临床要求的限制真正的临床应用培训没有HR图像。其次，其他模态HR图像是可用的（例如，HR T1加权图像可用于增强LR T2加权的图像）。在本文中，我们提出了基于人体解剖学的简单的先验知识无监督SISR技术;这种技术不需要HR图像进行培训。此外，我们提出了一个引导残余密集网络，其结合了剩余密集网络与引导深卷积神经网络，用于通过参考同一对象的不同HR图像增强LR图像的分辨率。一个公开的脑部MRI数据库上的实验表明，我们提出的方法实现了比国家的最先进的方法更好的性能。

50. Lymph Node Gross Tumor Volume Detection and Segmentation via Distance-based Gating using 3D CT/PET Imaging in Radiotherapy [PDF] 返回目录
Zhuotun Zhu, Dakai Jin, Ke Yan, Tsung-Ying Ho, Xianghua Ye, Dazhou Guo, Chun-Hung Chao, Jing Xiao, Alan Yuille, Le Lu
Abstract: Finding, identifying and segmenting suspicious cancer metastasized lymph nodes from 3D multi-modality imaging is a clinical task of paramount importance. In radiotherapy, they are referred to as Lymph Node Gross Tumor Volume (GTVLN). Determining and delineating the spread of GTVLN is essential in defining the corresponding resection and irradiating regions for the downstream workflows of surgical resection and radiotherapy of various cancers. In this work, we propose an effective distance-based gating approach to simulate and simplify the high-level reasoning protocols conducted by radiation oncologists, in a divide-and-conquer manner. GTVLN is divided into two subgroups of tumor-proximal and tumor-distal, respectively, by means of binary or soft distance gating. This is motivated by the observation that each category can have distinct though overlapping distributions of appearance, size and other LN characteristics. A novel multi-branch detection-by-segmentation network is trained with each branch specializing on learning one GTVLN category features, and outputs from multi-branch are fused in inference. The proposed method is evaluated on an in-house dataset of $141$ esophageal cancer patients with both PET and CT imaging modalities. Our results validate significant improvements on the mean recall from $72.5\%$ to $78.2\%$, as compared to previous state-of-the-art work. The highest achieved GTVLN recall of $82.5\%$ at $20\%$ precision is clinically relevant and valuable since human observers tend to have low sensitivity (around $80\%$ for the most experienced radiation oncologists, as reported by literature).
摘要：发现，鉴定和从三维多模态成像分割可疑癌症转移的淋巴结是一个极为重要的临床任务。在放射治疗中，它们被称为淋巴结肿瘤体积（GTVLN）。确定和划定GTVLN的传播是在确定对应的切除和照射区域手术切除和各种癌症放疗的下游的工作流是至关重要的。在这项工作中，我们提出了一个基于距离选通的有效的方法来模拟并简化由放射肿瘤学家进行的高级别推理协议，在分而治之方式。 GTVLN由二元或软距离选通手段分为肿瘤近端和肿瘤 - 远端，分别两个亚组。这是由每个类别可以具有不同的重叠虽然外观，大小和其他特性LN的分布的观察激励。一种新颖的多分支检测通过分割网络进行训练，与每个分支专业学习一个GTVLN类别的特征，并从多分支输出在推理融合。所提出的方法对$ $ 141食管癌患者一个内部数据集PET和CT的成像模态进行评价。相比以前的状态的最先进的工作我们的研究结果验证的平均召回78.2 $ \％$显著的改善，从72.5 $ \％$。 82.5 $ \％$在$ 20 \％$精度最高达到GTVLN召回是临床相关的和有价值的，因为人类观察者往往具有低敏感度（对于最有经验的放射肿瘤学家$ 80 \％左右$，所报告的文献）。

51. Traces of Class/Cross-Class Structure Pervade Deep Learning Spectra [PDF] 返回目录
Vardan Papyan
Abstract: Numerous researchers recently applied empirical spectral analysis to the study of modern deep learning classifiers. We identify and discuss an important formal class/cross-class structure and show how it lies at the origin of the many visually striking features observed in deepnet spectra, some of which were reported in recent articles, others are unveiled here for the first time. These include spectral outliers, "spikes", and small but distinct continuous distributions, "bumps", often seen beyond the edge of a "main bulk". The significance of the cross-class structure is illustrated in three ways: (i) we prove the ratio of outliers to bulk in the spectrum of the Fisher information matrix is predictive of misclassification, in the context of multinomial logistic regression; (ii) we demonstrate how, gradually with depth, a network is able to separate class-distinctive information from class variability, all while orthogonalizing the class-distinctive information; and (iii) we propose a correction to KFAC, a well-known second-order optimization algorithm for training deepnets.
摘要：许多研究人员最近采用实证频谱分析现代深度学习分类的研究。我们确定并讨论重要的正式上课/跨阶级结构，展示如何谎言在在DEEPNET光谱中观察到的许多视觉上显着的特点，其中一些报道在最近的文章来源，其他人都在这里亮相首次。这些包括光谱离群值，“尖峰”，和小但不同的连续分布，“隆起”，经常看到超越“主本体”的边缘。横类结构的重要性以三种方式中示出：（ⅰ）证明的异常值的比率在散装Fisher信息矩阵的光谱预测错误分类的，在多项Logistic回归的上下文中（ⅱ），我们证明如何，逐渐随深度，网络能够从类变异分开类独特的信息，所有而正交化类独特信息; （三）我们提出了一个修正KFAC，培训deepnets一个众所周知的二阶优化算法。

52. DeepPrognosis: Preoperative Prediction of Pancreatic Cancer Survival and Surgical Margin via Contrast-Enhanced CT Imaging [PDF] 返回目录
Jiawen Yao, Yu Shi, Le Lu, Jing Xiao, Ling Zhang
Abstract: Pancreatic ductal adenocarcinoma (PDAC) is one of the most lethal cancers and carries a dismal prognosis. Surgery remains the best chance of a potential cure for patients who are eligible for initial resection of PDAC. However, outcomes vary significantly even among the resected patients of the same stage and received similar treatments. Accurate preoperative prognosis of resectable PDACs for personalized treatment is thus highly desired. Nevertheless, there are no automated methods yet to fully exploit the contrast-enhanced computed tomography (CE-CT) imaging for PDAC. Tumor attenuation changes across different CT phases can reflect the tumor internal stromal fractions and vascularization of individual tumors that may impact the clinical outcomes. In this work, we propose a novel deep neural network for the survival prediction of resectable PDAC patients, named as 3D Contrast-Enhanced Convolutional Long Short-Term Memory network(CE-ConvLSTM), which can derive the tumor attenuation signatures or patterns from CE-CT imaging studies. We present a multi-task CNN to accomplish both tasks of outcome and margin prediction where the network benefits from learning the tumor resection margin related features to improve survival prediction. The proposed framework can improve the prediction performances compared with existing state-of-the-art survival analysis approaches. The tumor signature built from our model has evidently added values to be combined with the existing clinical staging system.
摘要：胰腺导管腺癌（PDAC）是最致命的癌症之一，承载预后不佳。手术仍然是一个潜在的治疗的病人谁都有资格PDAC的初始切除的最佳机会。然而，结果甚至在同一阶段的切除患者收到了类似的治疗显著变化。为个性化治疗可切除的PDAC准确术前预后因此非常需要。然而，也有没有自动化方法尚未充分利用用于PDAC对比增强计算机断层扫描（CE-CT）成像。在不同的CT相肿瘤衰减的变化可以反映肿瘤内部基质的级分和个别肿瘤可能影响临床结果的血管形成。在这项工作中，我们提出了可切除PDAC患者的生存期预测，命名为3D对比增强卷积龙短时记忆网络（CE-ConvLSTM），其起源可追溯到CE肿瘤衰减签名或图案新颖的深层神经网络-CT成像研究。我们提出了一个多任务CNN完成结果和利润率预测从哪里学的肿瘤切缘相关的特性，网络的好处，以提高生存率预测的两个任务。与现有的国家的最先进的生存分析方法相比，该架构可以提高预测性能。从我们的模型建立的肿瘤标志已经明显增加与现有临床分期系统相结合的值。

53. Domain-Adversarial Learning for Multi-Centre, Multi-Vendor, and Multi-Disease Cardiac MR Image Segmentation [PDF] 返回目录
Cian M. Scannell, Amedeo Chiribiri, Mitko Veta
Abstract: Cine cardiac magnetic resonance (CMR) has become the gold standard for the non-invasive evaluation of cardiac function. In particular, it allows the accurate quantification of functional parameters including the chamber volumes and ejection fraction. Deep learning has shown the potential to automate the requisite cardiac structure segmentation. However, the lack of robustness of deep learning models has hindered their widespread clinical adoption. Due to differences in the data characteristics, neural networks trained on data from a specific scanner are not guaranteed to generalise well to data acquired at a different centre or with a different scanner. In this work, we propose a principled solution to the problem of this domain shift. Domain-adversarial learning is used to train a domain-invariant 2D U-Net using labelled and unlabelled data. This approach is evaluated on both seen and unseen domains from the M\&Ms challenge dataset and the domain-adversarial approach shows improved performance as compared to standard training. Additionally, we show that the domain information cannot be recovered from the learned features.
摘要：电影心脏磁共振（CMR）已经成为心脏功能的非侵入性评价的金标准。特别地，它允许的功能参数，包括腔室容积和射血分数准确定量。深度学习表现出自动化所需的心脏结构分割的潜力。然而，由于缺乏深度学习模型的鲁棒性阻碍了其广泛的临床采纳。由于在数据特性的差异，从一个特定的扫描器上训练数据神经网络不能保证很好推广到在一个不同的中心或用不同的扫描仪获取的数据。在这项工作中，我们提出了一个原则性的解决了这个域名转移的问题。域对抗性学习用于训练使用标记和未标记的数据域不变2D U形网。这种方法是对从M \＆女士挑战数据集都可见和不可见的结构域和域对抗性的方法示出了相比于标准训练改进的性能评价。此外，我们表明，域名信息不能从学习功能恢复。

54. Appropriateness of Performance Indices for Imbalanced Data Classification: An Analysis [PDF] 返回目录
Sankha Subhra Mullick, Shounak Datta, Sourish Gunesh Dhekane, Swagatam Das
Abstract: Indices quantifying the performance of classifiers under class-imbalance, often suffer from distortions depending on the constitution of the test set or the class-specific classification accuracy, creating difficulties in assessing the merit of the classifier. We identify two fundamental conditions that a performance index must satisfy to be respectively resilient to altering number of testing instances from each class and the number of classes in the test set. In light of these conditions, under the effect of class imbalance, we theoretically analyze four indices commonly used for evaluating binary classifiers and five popular indices for multi-class classifiers. For indices violating any of the conditions, we also suggest remedial modification and normalization. We further investigate the capability of the indices to retain information about the classification performance over all the classes, even when the classifier exhibits extreme performance on some classes. Simulation studies are performed on high dimensional deep representations of subset of the ImageNet dataset using four state-of-the-art classifiers tailored for handling class imbalance. Finally, based on our theoretical findings and empirical evidence, we recommend the appropriate indices that should be used to evaluate the performance of classifiers in presence of class-imbalance.
摘要：量化指标在类不平衡分类器的性能，往往取决于测试集或类特定的分类准确度的宪法遭受扭曲，在评估分类的优点制造麻烦。我们确定两个基本条件，一个性能指标必须满足分别为弹性的，以改变从每个类和类在测试集合中的数测试实例的数量。在这些条件的光，类不平衡的作用下，从理论上分析通常用来评估二元分类和多级分类五个款热门指数四项指标。对于指数违反任何条件，我们还建议采取补救修改和规范化。我们进一步调查指标的能力，以保留有关在所有类的分类性能信息，即使在分类上表现出一些类的极端表现。模拟研究使用，用于处理类不平衡量身打造四大国有的最先进的分类上ImageNet数据集的子集的高维深表示执行。最后，根据我们的理论成果和经验证明，我们建议应该用来评估类不平衡的存在分类器的性能适当指标。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-08-28

目录

摘要