摘要

1. Understanding Human Hands in Contact at Internet Scale [PDF] 返回目录
Dandan Shan, Jiaqi Geng, Michelle Shu, David F. Fouhey
Abstract: Hands are the central means by which humans manipulate their world and being able to reliably extract hand state information from Internet videos of humans engaged in their hands has the potential to pave the way to systems that can learn from petabytes of video data. This paper proposes steps towards this by inferring a rich representation of hands engaged in interaction method that includes: hand location, side, contact state, and a box around the object in contact. To support this effort, we gather a large-scale dataset of hands in contact with objects consisting of 131 days of footage as well as a 100K annotated hand-contact video frame dataset. The learned model on this dataset can serve as a foundation for hand-contact understanding in videos. We quantitatively evaluate it both on its own and in service of predicting and learning from 3D meshes of human hands.
摘要：手是中央方法，使人类操纵他们的世界，能够可靠地从网络视频中提取手的状态信息的人从事他们手中有铺平道路，可以从视频数据的PB级的学习系统的潜力。手的位置，侧，接触状态，并围绕在接触物体的盒子：本文通过推断手接合在包括交互方法的丰富表示提出实现这一步骤。为了支持这项工作，我们聚集在一起，手里的大规模数据集与由镜头131天为以及100K的注释手接触视频帧的数据集对象接触。在这个数据集的学习模型可以作为手接触在视频理解奠定了基础。我们定量评价它既自身和预测，并从人的手的3D网格的学习服务。

2. Disentangled Non-Local Neural Networks [PDF] 返回目录
Minghao Yin, Zhuliang Yao, Yue Cao, Xiu Li, Zheng Zhang, Stephen Lin, Han Hu
Abstract: The non-local block is a popular module for strengthening the context modeling ability of a regular convolutional neural network. This paper first studies the non-local block in depth, where we find that its attention computation can be split into two terms, a whitened pairwise term accounting for the relationship between two pixels and a unary term representing the saliency of every pixel. We also observe that the two terms trained alone tend to model different visual clues, e.g. the whitened pairwise term learns within-region relationships while the unary term learns salient boundaries. However, the two terms are tightly coupled in the non-local block, which hinders the learning of each. Based on these findings, we present the disentangled non-local block, where the two terms are decoupled to facilitate learning for both terms. We demonstrate the effectiveness of the decoupled design on various tasks, such as semantic segmentation on Cityscapes, ADE20K and PASCAL Context, object detection on COCO, and action recognition on Kinetics.
摘要：非本地块是加强常规卷积神经网络的上下文建模能力的流行的模块。本文首先研究在深度非局部块，我们发现它的注意力计算可以分成两个方面，一个白化成对项占两个像素，并表示每个像素的显着性一元项之间的关系。我们还注意到，仅仅训练了两个词往往不同的视觉线索，例如模拟白化成对项获悉内区域的关系，而一元长期学习突出的边界。然而，这两个术语紧密耦合在非局部块，这妨碍各学习。基于这些发现，我们现在的解开的非本地块，其中两个术语分离，以便学习两个词。我们展示的各种任务，如城市景观，ADE20K和PASCAL上下文语义分割，目标检测的COCO，以及动作识别的动力学解耦设计的有效性。

3. VirTex: Learning Visual Representations from Textual Annotations [PDF] 返回目录
Karan Desai, Justin Johnson
Abstract: The de-facto approach to many vision tasks is to start from pretrained visual representations, typically learned via supervised training on ImageNet. Recent methods have explored unsupervised pretraining to scale to vast quantities of unlabeled images. In contrast, we aim to learn high-quality visual representations from fewer images. To this end, we revisit supervised pretraining, and seek data-efficient alternatives to classification-based pretraining. We propose VirTex -- a pretraining approach using semantically dense captions to learn visual representations. We train convolutional networks from scratch on COCO Captions, and transfer them to downstream recognition tasks including image classification, object detection, and instance segmentation. On all tasks, VirTex yields features that match or exceed those learned on ImageNet -- supervised or unsupervised -- despite using up to ten times fewer images.
摘要：事实上的方法很多视觉任务是从预训练的视觉表现，通常是通过对ImageNet监督培训学到启动。最近的方法已经探索监督的训练前向规模浩大的数量未标记的图像。相比之下，我们的目标是从较少的图像学习高品质的视觉表现。为此，我们重新审视监督训练前，并寻求数据有效替代基于分类的训练前。我们提出的Virtex - 使用语义密集字幕学习视觉表现一个训练前的办法。我们培养卷积网络从COCO标题划伤，并将其传送到下游的识别任务，包括图像分类，目标检测和实例分割。在所有任务，VIRTEX收益率的特点是匹配或超过那些ImageNet了解到 - 监督或无人监督 - 尽管使用高达十倍较少的图像。

4. Quasi-Dense Instance Similarity Learning [PDF] 返回目录
Jiangmiao Pang, Linlu Qiu, Haofeng Chen, Qi Li, Trevor Darrell, Fisher Yu
Abstract: Similarity metrics for instances have drawn much attention, due to their importance for computer vision problems such as object tracking. However, existing methods regard object similarity learning as a post-hoc stage after object detection and only use sparse ground truth matching as the training objective. This process ignores the majority of the regions on the images. In this paper, we present a simple yet effective quasi-dense matching method to learn instance similarity from hundreds of region proposals in a pair of images. In the resulting feature space, a simple nearest neighbor search can distinguish different instances without bells and whistles. When applied to joint object detection and tracking, our method can outperform existing methods without using location or motion heuristics, yielding almost 10 points higher MOTA on BDD100K and Waymo tracking datasets. Our method is also competitive on one-shot object detection, which further shows the effectiveness of quasi-dense matching for category-level metric learning. The code will be available at this https URL.
摘要：相似度指标实例都引起了很大关注，因为他们对计算机视觉问题，如目标跟踪的重要性。然而，现有的方法把目标相似学习作为对象检测后的事后阶段，仅使用稀疏地真相匹配作为训练目标。这个过程忽略了大部分的图像区域。在本文中，我们提出了一个简单而有效的准稠密匹配方法来学习从数百个区域的建议实例相似的一对图像。在所得到的特征空间，最近邻搜索的简单可以区分不同的实例，而不花俏。当施加到关节物体检测和跟踪，我们的方法可以超越现有方法不使用位置或运动启发式，得到上BDD100K和Waymo跟踪数据集几乎高出10分MOTA。我们的方法还对单触发对象检测，竞争其进一步示出了用于类级度量学习准稠密匹配的有效性。该代码将可在此HTTPS URL。

5. Robust Multi-object Matching via Iterative Reweighting of the Graph Connection Laplacian [PDF] 返回目录
Yunpeng Shi, Shaohan Li, Gilad Lerman
Abstract: We propose an efficient and robust iterative solution to the multi-object matching problem. We first clarify serious limitations of current methods as well as the inappropriateness of the standard iteratively reweighted least squares procedure. In view of these limitations, we propose a novel and more reliable iterative reweighting strategy that incorporates information from higher-order neighborhoods by exploiting the graph connection Laplacian. We demonstrate the superior performance of our procedure over state-of-the-art methods using both synthetic and real datasets.
摘要：本文提出了一种高效和稳健的迭代解决多目标匹配的问题。我们首先澄清当前方法严重的局限性以及标准的迭代复加权最小二乘法的不当。鉴于这些限制，我们建议，通过利用图形连接拉普拉斯合并从高阶邻域信息的新颖的和更可靠的迭代重新加权策略。我们证明我们的程序过使用合成和真实数据的国家的最先进的方法，性能优越。

6. Exploring Weaknesses of VQA Models through Attribution Driven Insights [PDF] 返回目录
Shaunak Halbe
Abstract: Deep Neural Networks have been successfully used for the task of Visual Question Answering for the past few years owing to the availability of relevant large scale datasets. However these datasets are created in artificial settings and rarely reflect the real world scenario. Recent research effectively applies these VQA models for answering visual questions for the blind. Despite achieving high accuracy these models appear to be susceptible to variation in input questions.We analyze popular VQA models through the lens of attribution (input's influence on predictions) to gain valuable insights. Further, We use these insights to craft adversarial attacks which inflict significant damage to these systems with negligible change in meaning of the input questions. We believe this will enhance development of systems more robust to the possible variations in inputs when deployed to assist the visually impaired.
摘要：深层神经网络已经被成功地用于可视化问题回答为由于相关大型数据集的可用性，在过去几年的任务。然而，这些数据集在人工设置创建，很少反映真实世界的场景。最近的研究有效地应用这些模型VQA回答视觉问题，为盲人。尽管实现高精度这些模型似乎容易受到输入变化questions.We通过归属（上预测输入的影响力），以获得宝贵的见解的镜头分析流行VQA车型。此外，我们利用这些资料来手艺这造成这些系统在输入问题，这意味着可以忽略不计的变化显著损害对抗性攻击。我们相信，出动协助视障人士的时候，这将提升系统更稳健的输入可能的变化发展。

7. Privacy-Preserving Visual Feature Descriptors through Adversarial Affine Subspace Embedding [PDF] 返回目录
Mihai Dusmanu, Johannes L. Schönberger, Sudipta N. Sinha, Marc Pollefeys
Abstract: Many computer vision systems require users to upload image features to the cloud for processing and storage. Such features can be exploited to recover sensitive information about the scene or subjects, e.g., by reconstructing the appearance of the original image. To address this privacy concern, we propose a new privacy-preserving feature representation. The core idea of our work is to drop constraints from each feature descriptor by embedding it within an affine subspace containing the original feature as well as one or more adversarial feature samples. Feature matching on the privacy-preserving representation is enabled based on the notion of subspace-to-subspace distance. We experimentally demonstrate the effectiveness of our method and its high practical relevance for applications such as crowd-sourced 3D scene reconstruction and face authentication. Compared to the original features, our approach has only marginal impact on performance but makes it significantly more difficult for an adversary to recover private information.
摘要：许多计算机视觉系统要求用户上传图像功能，以云计算为处理和存储。这样的特征可以被利用来恢复关于场景或对象，例如敏感信息，通过重构原始图像的外观。为了解决这个隐私问题，我们提出了一个新的隐私保护功能表示。我们工作的核心思想是通过包含原始功能，以及一个或多个对抗性特征样本的仿射子空间内嵌入它放弃从每个特征描述符约束。在隐私保护表示特征匹配基于子空间到子空间距离的概念启用。我们通过实验证明我们的方法和应用，如人群来源的3D场景重建和脸部认证很高的实用意义的有效性。相比原有的特色，我们的做法对性能只有边际影响，但使得显著更困难的对手，以恢复私人信息。

8. SLIC-UAV: A Method for monitoring recovery in tropical restoration projects through identification of signature species using UAVs [PDF] 返回目录
Jonathan Williams, Carola-Bibiane Schönlieb, Tom Swinfield, Bambang Irawan, Eva Achmad, Muhammad Zudhi, Habibi, Elva Gemita, David A. Coomes
Abstract: Logged forests cover four million square kilometres of the tropics and restoring these forests is essential if we are to avoid the worst impacts of climate change, yet monitoring recovery is challenging. Tracking the abundance of visually identifiable, early-successional species enables successional status and thereby restoration progress to be evaluated. Here we present a new pipeline, SLIC-UAV, for processing Unmanned Aerial Vehicle (UAV) imagery to map early-successional species in tropical forests. The pipeline is novel because it comprises: (a) a time-efficient approach for labelling crowns from UAV imagery; (b) machine learning of species based on spectral and textural features within individual tree crowns, and (c) automatic segmentation of orthomosaiced UAV imagery into 'superpixels', using Simple Linear Iterative Clustering (SLIC). Creating superpixels reduces the dataset's dimensionality and focuses prediction onto clusters of pixels, greatly improving accuracy. To demonstrate SLIC-UAV, support vector machines and random forests were used to predict the species of hand-labelled crowns in a restoration concession in Indonesia. Random forests were most accurate at discriminating species for whole crowns, with accuracy ranging from 79.3% when mapping five common species, to 90.5% when mapping the three most visually-distinctive species. In contrast, support vector machines proved better for labelling automatically segmented superpixels, with accuracy ranging from 74.3% to 91.7% for the same species. Models were extended to map species across 100 hectares of forest. The study demonstrates the power of SLIC-UAV for mapping characteristic early-successional tree species as an indicator of successional stage within tropical forest restoration areas. Continued effort is needed to develop easy-to-implement and low-cost technology to improve the affordability of project management.
摘要：砍伐的森林覆盖4000000平方公里热带的，如果我们要避免气候变化的最坏影响，尚未恢复监测是具有挑战性的恢复这些森林是必不可少的。跟踪视觉识别的，早期演替物种的丰度使得演替状态，从而恢复进展进行评估。在这里，我们提出一个新的管道，SLIC无人机，用于加工无人机（UAV）图像来绘制在热带森林早期演替的物种。该管道是新颖的，因为它包括：（a）用于从UAV图像标记冠时间有效的方法; （b）中的机器基于个人树冠内的频谱和纹理特征，和（c）orthomosaiced UAV图像的自动分割成“超像素”，使用简单线性迭代聚类（SLIC）物种的学习。创建超像素降低数据集的维数和聚焦预测到像素簇，大大提高了精度。为了证明SLIC无人机，支持向量机和随机森林被用来预测手标记冠的品种在印尼恢复让步。随机森林是最准确的区分，在整个为冠种，测绘五种常见的种类时，90.5％映射三个最直观，独特物种时，从79.3％的准确率。与此相反，支持向量机被证明用于标记自动分割的超像素，具有精度范围从74.3％到对于相同物种91.7％更好。模型扩展到整个映射百公顷森林的物种。这项研究表明SLIC无人机的动力映射特性早期演替树种为热带森林恢复区域内的演替阶段的指标。还需要不断努力，以易于实施和低成本的技术，以提高项目管理水平的承受能力制定。

9. MatchGAN: A Self-Supervised Semi-Supervised Conditional Generative Adversarial Network [PDF] 返回目录
Jiaze Sun, Binod Bhattarai, Tae-Kyun Kim
Abstract: We propose a novel self-supervised semi-supervised learning approach for conditional Generative Adversarial Networks (GANs). Unlike previous self-supervised learning approaches which define pretext tasks by performing augmentations on the image space such as applying geometric transformations or predicting relationships between image patches, our approach leverages the label space. We train our network to learn the distribution of the source domain using the few labelled examples available by uniformly sampling source labels and assigning them as target labels for unlabelled examples from the same distribution. The translated images on the side of the generator are then grouped into positive and negative pairs by comparing their corresponding target labels, which are then used to optimise an auxiliary triplet objective on the discriminator's side. We tested our method on two challenging benchmarks, CelebA and RaFD, and evaluated the results using standard metrics including Frechet Inception Distance, Inception Score, and Attribute Classification Rate. Extensive empirical evaluation demonstrates the effectiveness of our proposed method over competitive baselines and existing arts. In particular, our method is able to surpass the baseline with only 20% of the labelled examples used to train the baseline.
摘要：本文提出了一种新的自我监督的半监督学习的条件剖成对抗性网络（甘斯）的方法。不同于通过对图像进行空间扩充如应用几何变换或预测图像补丁关系定义借口任务以前自我监督的学习方法，我们的方法利用了标签空间。我们培训网络通过均匀采样源标签和从相同的分布未标记的例子目标的标签分配他们学习使用现有的几个标记示例的源域的分布。在发电机侧的翻译图像然后通过比较其对应的目标标记，其随后被用于对鉴别器的侧优化辅助三重目标分为阳性和阴性对。我们测试了两个具有挑战性的基准，CelebA和RaFD我们的方法，并评估使用标准的指标，包括弗雷谢盗梦空间距离，启分数和属性分类率的结果。大量实证评价表明了我们提出的方法在有竞争力的基线和现有的艺术功效。特别地，我们的方法是能够与只有20％的用于训练基线标记例子超过基线。

10. Improving Deep Metric Learning with Virtual Classes and Examples Mining [PDF] 返回目录
Pierre Jacob, David Picard, Aymeric Histace, Edouard Klein
Abstract: In deep metric learning, the training procedure relies on sampling informative tuples. However, as the training procedure progresses, it becomes nearly impossible to sample relevant hard negative examples without proper mining strategies or generation-based methods. Recent work on hard negative generation have shown great promises to solve the mining problem. However, this generation process is difficult to tune and often leads to incorrectly labelled examples. To tackle this issue, we introduce MIRAGE, a generation-based method that relies on virtual classes entirely composed of generated examples that act as buffer areas between the training classes. We empirically show that virtual classes significantly improve the results on popular datasets (Cub-200-2011, Cars-196 and Stanford Online Products) compared to other generation methods.
摘要：深度量学习，训练过程依赖于采样信息的元组。但是，随着训练过程的进行，它变得几乎不可能品尝相关硬反面的例子没有适当的采矿战略或基于代的方法。硬负一代最近的工作表现出极大的承诺，解决采矿问题。然而，该生成过程是困难的调整和常常导致不正确地标记的例子。为了解决这个问题，我们引入MIRAGE，依靠完全由产生的例子是充当培训课程之间的缓冲区域虚拟类基于代的方法。我们经验表明，虚拟课堂显著提高相对于其他发电方式流行的数据集（幼童-200-2011，汽车-196和斯坦福在线产品）的结果。

11. What makes instance discrimination good for transfer learning? [PDF] 返回目录
Nanxuan Zhao, Zhirong Wu, Rynson W.H. Lau, Stephen Lin
Abstract: Unsupervised visual pretraining based on the instance discrimination pretext task has shown significant progress. Notably, in the recent work of MoCo, unsupervised pretraining has shown to surpass the supervised counterpart for finetuning downstream applications such as object detection on PASCAL VOC. It comes as a surprise that image annotations would be better left unused for transfer learning. In this work, we investigate the following problems: What makes instance discrimination pretraining good for transfer learning? What knowledge is actually learned and transferred from unsupervised pretraining? From this understanding of unsupervised pretraining, can we make supervised pretraining great again? Our findings are threefold. First, what truly matters for this detection transfer is low-level and mid-level representations, not high-level representations. Second, the intra-category invariance enforced by the traditional supervised model weakens transferability by increasing task misalignment. Finally, supervised pretraining can be strengthened by following an exemplar-based approach without explicit constraints among the instances within the same category.
摘要：基于实例的歧视借口任务无监督的视觉训练前已经显示出显著的进步。值得注意的是，在最近的莫科工作，无监督预训练已经显示出超过监督对方为微调下游应用，如PASCAL VOC物体检测。它作为一个惊喜，图像注释会更好闲置转让的学习。在这项工作中，我们探讨以下问题：是什么让实例歧视训练前好迁移学习？什么样的知识实际上是教训和无监督的训练前转移？从无人监督的训练前的这种认识，才能使监督训练前再次大？我们的研究结果是一举三得。首先，这种检测传递什么真正重要的是低级别的中级表示，并非高层表示。其次，传统的监管模式实施的内部类不变性通过增加任务错位削弱转让。最后，监督训练前可以按照不相同类别中的实例之间明确的约束条件的基于标本的方法得到加强。

12. Spectral Image Segmentation with Global Appearance Modeling [PDF] 返回目录
Jeova F. S. Rocha Neto, Pedro F. Felzenszwalb
Abstract: We introduce a new spectral method for image segmentation that incorporates long range relationships for global appearance modeling. The approach combines two different graphs, one is a sparse graph that captures spatial relationships between nearby pixels and another is a dense graph that captures pairwise similarity between all pairs of pixels. We extend the spectral method for Normalized Cuts to this setting by combining the transition matrices of Markov chains associated with each graph. We also derive an efficient method that uses importance sampling for sparsifying the dense graph of appearance relationships. This leads to a practical algorithm for segmenting high-resolution images. The resulting method can segment challenging images without any filtering or pre-processing.
摘要：介绍了图像分割并入全球的外观造型长距离关系的新谱方法。该方法结合了两种不同的图形，一个是邻近像素和另一个是致密图形之间捕获的空间关系捕捉成对地所有像素对之间的相似性稀疏图。我们通过组合与每一图相关联的马尔可夫链的转移矩阵为标准化切割光谱方法扩展到这个设置。我们还得到一个有效的方法，它使用重要性抽样外观稀疏基底关系的稠密图。这导致了实际的算法分割的高分辨率图像。将得到的方法可段没有任何滤波或预处理挑战图像。

13. Transferring and Regularizing Prediction for Semantic Segmentation [PDF] 返回目录
Yiheng Zhang, Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Dong Liu, Tao Mei
Abstract: Semantic segmentation often requires a large set of images with pixel-level annotations. In the view of extremely expensive expert labeling, recent research has shown that the models trained on photo-realistic synthetic data (e.g., computer games) with computer-generated annotations can be adapted to real images. Despite this progress, without constraining the prediction on real images, the models will easily overfit on synthetic data due to severe domain mismatch. In this paper, we novelly exploit the intrinsic properties of semantic segmentation to alleviate such problem for model transfer. Specifically, we present a Regularizer of Prediction Transfer (RPT) that imposes the intrinsic properties as constraints to regularize model transfer in an unsupervised fashion. These constraints include patch-level, cluster-level and context-level semantic prediction consistencies at different levels of image formation. As the transfer is label-free and data-driven, the robustness of prediction is addressed by selectively involving a subset of image regions for model regularization. Extensive experiments are conducted to verify the proposal of RPT on the transfer of models trained on GTA5 and SYNTHIA (synthetic data) to Cityscapes dataset (urban street scenes). RPT shows consistent improvements when injecting the constraints on several neural networks for semantic segmentation. More remarkably, when integrating RPT into the adversarial-based segmentation framework, we report to-date the best results: mIoU of 53.2%/51.7% when transferring from GTA5/SYNTHIA to Cityscapes, respectively.
摘要：语义分割通常需要大量具有像素级别的注解图像。在极其昂贵专家标记的视图，最近的研究已经表明，上训练照片般逼真的合成数据（例如，计算机游戏）与计算机生成的注释的模型可以适合于真实图像。尽管取得这一进展，没有约束的真实图像的预测，模型将轻松合成数据过度拟合由于严重域不匹配。在本文中，我们利用新奇语义分割的内在特性来缓解这样的问题的模型传递。具体地，我们提出预测转移（RPT），该强加的固有性质为约束正规化以无监督的方式模型转印的正则。这些约束包括在不同层次的图像形成的补丁级别，集群级和上下文级语义预测稠度。作为传输是无标记和数据驱动，预测的鲁棒性通过选择性地涉及图像区域进行模型正规化的子集解决。大量的实验以验证对培训了GTA5和SYNTHIA（合成数据）风情的数据集（城市街道场景）模式转移RPT的建议。对语义分割几个神经网络注入的约束时，RPT显示持续改善。从GTA5 / SYNTHIA转移到城市景观时分别为53.2％米欧/ 51.7％，：更引人注目的是，RPT集成到基于对抗分割框架时，我们最新报告的最好成绩。

14. Learning a Unified Sample Weighting Network for Object Detection [PDF] 返回目录
Qi Cai, Yingwei Pan, Yu Wang, Jingen Liu, Ting Yao, Tao Mei
Abstract: Region sampling or weighting is significantly important to the success of modern region-based object detectors. Unlike some previous works, which only focus on "hard" samples when optimizing the objective function, we argue that sample weighting should be data-dependent and task-dependent. The importance of a sample for the objective function optimization is determined by its uncertainties to both object classification and bounding box regression tasks. To this end, we devise a general loss function to cover most region-based object detectors with various sampling strategies, and then based on it we propose a unified sample weighting network to predict a sample's task weights. Our framework is simple yet effective. It leverages the samples' uncertainty distributions on classification loss, regression loss, IoU, and probability score, to predict sample weights. Our approach has several advantages: (i). It jointly learns sample weights for both classification and regression tasks, which differentiates it from most previous work. (ii). It is a data-driven process, so it avoids some manual parameter tuning. (iii). It can be effortlessly plugged into most object detectors and achieves noticeable performance improvements without affecting their inference time. Our approach has been thoroughly evaluated with recent object detection frameworks and it can consistently boost the detection accuracy. Code has been made available at \url{this https URL}.
摘要：地区采样或加权是现代基于区域的对象检测器的成功显著重要。不像以前的一些作品，其中只有优化目标函数时，注重“硬”的样品，我们认为，样本权重应该是数据依赖和任务依赖性。样品的目标函数优化的重要性是由它的不确定性，这两个对象分类和边框回归任务决定的。为此，我们提出一个一般性的损失函数来覆盖不同的抽样策略大多数基于区域的对象检测器，然后在此基础上我们提出了一个统一的样本加权网络来预测样本的任务权重。我们的框架是简单而有效的。它利用了样本的分类上的损失，损失的回归，IOU，和概率得分不确定性分布，来预测样本权重。我们的方法有以下几个优点：（I）。它共同学习的分类和回归任务，它区别于大多数以前的工作样本权重。（ⅱ）。它是一种数据驱动的过程，因此它避免了一些手动参数调谐。（三）。它可以毫不费力地插到最物体探测器并实现显着的性能提升，而不会影响他们的推理时间。我们的办法是与最近的目标检测框架被彻底评估，也可以持续提高检测精度。代码已可在\ {URL这HTTPS URL}。

15. Exploring Category-Agnostic Clusters for Open-Set Domain Adaptation [PDF] 返回目录
Yingwei Pan, Ting Yao, Yehao Li, Chong-Wah Ngo, Tao Mei
Abstract: Unsupervised domain adaptation has received significant attention in recent years. Most of existing works tackle the closed-set scenario, assuming that the source and target domains share the exactly same categories. In practice, nevertheless, a target domain often contains samples of classes unseen in source domain (i.e., unknown class). The extension of domain adaptation from closed-set to such open-set situation is not trivial since the target samples in unknown class are not expected to align with the source. In this paper, we address this problem by augmenting the state-of-the-art domain adaptation technique, Self-Ensembling, with category-agnostic clusters in target domain. Specifically, we present Self-Ensembling with Category-agnostic Clusters (SE-CC) -- a novel architecture that steers domain adaptation with the additional guidance of category-agnostic clusters that are specific to target domain. These clustering information provides domain-specific visual cues, facilitating the generalization of Self-Ensembling for both closed-set and open-set scenarios. Technically, clustering is firstly performed over all the unlabeled target samples to obtain the category-agnostic clusters, which reveal the underlying data space structure peculiar to target domain. A clustering branch is capitalized on to ensure that the learnt representation preserves such underlying structure by matching the estimated assignment distribution over clusters to the inherent cluster distribution for each target sample. Furthermore, SE-CC enhances the learnt representation with mutual information maximization. Extensive experiments are conducted on Office and VisDA datasets for both open-set and closed-set domain adaptation, and superior results are reported when comparing to the state-of-the-art approaches.
摘要：无监督领域适应性已收到显著重视，近年来。大多数现有的工程解决关闭设定的情况下，假设源和目标域共享完全相同的类别。在实践中，然而，一个目标域通常包含在源域（即，未知类别）看不见类样品。因为在未知类的目标样本预计不会与源对准域适应的从闭集于这种开集的情况的延伸是不平凡的。在本文中，我们通过增加所述状态的最先进的域自适应技术，自Ensembling，与目标域类别无关的簇解决这个问题。具体来说，我们本自Ensembling与类别无关的集群（SE-CC） - 一种新的体系结构，与类别无关的簇是特定于目标域的额外指导阉牛域自适应。这些聚类信息提供特定于域的视觉线索，促进自我Ensembling的泛化两个闭集和开集方案。从技术上讲，聚类首先对所有未标记的目标样本进行，以获得类别无关的集群，这表现出潜在的数据空间结构特有的目标域。聚类分支大写上，以确保所学习的表示通过在簇所估计的分配分布匹配来为每个目标样品的固有簇分配保留这样底层结构。此外，SE-CC增强了互信息最大化学习表现。大量的实验是比较先进的，最先进的方法，当两个开集和闭集域适应于办公室和VisDA数据集进行，优异的业绩报告。

16. Attentive WaveBlock: Complementarity-enhanced Mutual Networks for Unsupervised Domain Adaptation in Person Re-identification [PDF] 返回目录
Wenhao Wang, Fang Zhao, Shengcai Liao, Ling Shao
Abstract: Unsupervised domain adaptation (UDA) for person re-identification is challenging because of the huge gap between the source and target domain. A typical self-training method is to use pseudo-labels generated by clustering algorithms to iteratively optimize the model on the target domain. However, a drawback to this is that noisy pseudo-labels generally cause troubles in learning. To address this problem, a mutual learning method by dual networks has been developed to produce reliable soft labels. However, as the two neural networks gradually converge, their complementarity is weakened and they likely become biased towards the same kind of noise. In this paper, we propose a novel light-weight module, the Attentive WaveBlock (AWB), which can be integrated into the dual networks of mutual learning to enhance the complementarity and further depress noise in the pseudo-labels. Specifically, we first introduce a parameter-free module, the WaveBlock, which creates a difference between two networks by waving blocks of feature maps differently. Then, an attention mechanism is leveraged to enlarge the difference created and discover more complementary features. Furthermore, two kinds of combination strategies, i.e. pre-attention and post-attention, are explored. Experiments demonstrate that the proposed method achieves state-of-the-art performance with significant improvements of 9.4%, 5.9%, 7.4%, and 7.7% in mAP on Duke-to-Market, Market-to-Duke, Duke-to-MSMT, and Market-to-MSMT UDA tasks, respectively.
摘要：人重新鉴定无监督域适配（UDA）被的，因为源和目标域之间的巨大差距挑战。一种典型的自训练方法是使用通过聚类算法来迭代地优化在目标域的模型产生的伪标签。然而，这样的缺点是噪声的伪标签通常导致学习困难。为了解决这个问题，通过双网相互学习方法已研制生产出可靠的软标签。然而，由于两个神经网络逐渐收敛，其互补性减弱，他们有可能成为对同一种噪声偏置。在本文中，我们提出了一种新颖轻质模块中，细心的波块（AWB），其可以被集成到互相学习的双网，以提高在伪标签的互补性和进一步抑制噪声。具体地讲，我们首先介绍一个无参数模块，所述波块，其产生通过挥动特征的块两个网络之间的差映射不同。然后，注意机制是杠杆放大产生的差异，发现更多互补的特征。此外，两个种组合策略，即预关注和后的关注，进行了探索。实验结果表明，所提出的方法实现了国家的最先进的性能与杜克到市场，市场对杜克，杜克-TO-的9.4％，5.9％，7.4％，和7.7％在地图显著改进MSMT，以及市场对MSMT UDA任务，分别。

17. Rethinking the Truly Unsupervised Image-to-Image Translation [PDF] 返回目录
Kyungjune Baek, Yunjey Choi, Youngjung Uh, Jaejun Yoo, Hyunjung Shim
Abstract: Every recent image-to-image translation model uses either image-level (i.e. input-output pairs) or set-level (i.e. domain labels) supervision at minimum. However, even the set-level supervision can be a serious bottleneck for data collection in practice. In this paper, we tackle image-to-image translation in a fully unsupervised setting, i.e., neither paired images nor domain labels. To this end, we propose the truly unsupervised image-to-image translation method (TUNIT) that simultaneously learns to separate image domains via an information-theoretic approach and generate corresponding images using the estimated domain labels. Experimental results on various datasets show that the proposed method successfully separates domains and translates images across those domains. In addition, our model outperforms existing set-level supervised methods under a semi-supervised setting, where a subset of domain labels is provided. The source code is available at this https URL
摘要：最近每一图像到影像翻译模型使用任何图像电平（即输入 - 输出对）或设置级（即域名标签）监督在最低限度。然而，即使在设定的水平监督可在实践中进行数据采集一个严重的瓶颈。在本文中，我们将处理图像 - 图像平移在完全无人监管的设置，即，既不是成对的图像也不域标签。为此，我们提出了通过信息理论方法真正无监督图像到图像的转换方法（TUNIT）中同时学会单独的图像域和使用所估计的域标签生成对应的图像。在各种数据集实验结果表明，所提出的方法成功地分离域和跨越这些域转换的图像。此外，我们现有的一组级模型优于监督下半监督的设置，在提供域名标签的子集的方法。源代码可在此HTTPS URL

18. Protecting Against Image Translation Deepfakes by Leaking Universal Perturbations from Black-Box Neural Networks [PDF] 返回目录
Nataniel Ruiz, Sarah Adel Bargal, Stan Sclaroff
Abstract: In this work, we develop efficient disruptions of black-box image translation deepfake generation systems. We are the first to demonstrate black-box deepfake generation disruption by presenting image translation formulations of attacks initially proposed for classification models. Nevertheless, a naive adaptation of classification black-box attacks results in a prohibitive number of queries for image translation systems in the real-world. We present a frustratingly simple yet highly effective algorithm Leaking Universal Perturbations (LUP), that significantly reduces the number of queries needed to attack an image. LUP consists of two phases: (1) a short leaking phase where we attack the network using traditional black-box attacks and gather information on successful attacks on a small dataset and (2) and an exploitation phase where we leverage said information to subsequently attack the network with improved efficiency. Our attack reduces the total number of queries necessary to attack GANimation and StarGAN by 30%.
摘要：在这项工作中，我们开发暗箱图像翻译deepfake发电系统的高效中断。我们是第一个被提出的最初提议在分类模型攻击图像平移配方证明暗箱deepfake产生中断。尽管如此，分类黑箱攻击导致了在现实世界图像翻译系统查询的数量望而却步一个天真的适应。我们提出了一个令人沮丧的简单而高效的算法泄漏通用扰动（LUP），即显著减少攻击图像所需的查询数量。土地利用规划分为两个阶段：（1）短期泄漏阶段，我们用传统的黑盒子攻击攻击网络，并收集在一个小数据集，（2）和开发阶段，我们利用上述信息随后进攻成功攻击的信息网络具有提高的效率。我们的攻击减少了必要的30％攻击GANimation和StarGAN查询的总数。

19. Minimum Potential Energy of Point Cloud for Robust Global Registration [PDF] 返回目录
Zijie Wu, Yaonan Wang, Qing Zhu, Jianxu Mao, Haotian Wu, Mingtao Feng, Ajmal mian
Abstract: In this paper, we propose a novel minimum gravitational potential energy (MPE)-based algorithm for global point set registration. The feature descriptors extraction algorithms have emerged as the standard approach to align point sets in the past few decades. However, the alignment can be challenging to take effect when the point set suffers from raw point data problems such as noises (Gaussian and Uniformly). Different from the most existing point set registration methods which usually extract the descriptors to find correspondences between point sets, our proposed MPE alignment method is able to handle large scale raw data offset without depending on traditional descriptors extraction, whether for the local or global registration methods. We decompose the solution into a global optimal convex approximation and the fast descent process to a local minimum. For the approximation step, the proposed minimum potential energy (MPE) approach consists of two main steps. Firstly, according to the construction of the force traction operator, we could simply compute the position of the potential energy minimum; Secondly, with respect to the finding of the MPE point, we propose a new theory that employs the two flags to observe the status of the registration procedure. The method of fast descent process to the minimum that we employed is the iterative closest point algorithm; it can achieve the global minimum. We demonstrate the performance of the proposed algorithm on synthetic data as well as on real data. The proposed method outperforms the other global methods in terms of both efficiency, accuracy and noise resistance.
摘要：在本文中，我们提出了一种新的最低重力势能（MPE）为基础的算法全球点集注册。特征描述信息提取算法已成为在过去几十年的标准方法来对齐点集。然而，对准可以是具有挑战性的生效时从原始点数据的问题点集患有诸如噪声（高斯和均匀地）。从大多数现有的点集配准方法通常提取的描述找到点集之间的对应关系不同的是，我们提出的MPE对准方法能够处理大规模的原始数据偏移量，而不依赖于传统的描述符的提取，无论是本地或全局注册方法。我们分解溶液进入全球最佳凸逼近和快速下降过程局部最小值。对于近似步骤，所提出的最小势能（MPE）方法包括两个主要步骤。首先，根据力牵引运营商的建设，我们可以简单地计算势能最小的位置;其次，对于MPE点的发现，我们提出了一个新的理论，即采用两个标志，观察登记过程的状态。快速下降过程的到，我们所采用的最小的方法是迭代最近点算法;它可以实现全球最低。我们展示的综合数据以及对真实数据的算法的性能。该方法优于在这两个效率，精度和噪声性方面的其它全局方法。

20. Morphing Attack Detection -- Database, Evaluation Platform and Benchmarking [PDF] 返回目录
Kiran Raja, Matteo Ferrara, Annalisa Franco, Luuk Spreeuwers, Illias Batskos, Florens de Wit Marta Gomez-Barrero, Ulrich Scherhag, Daniel Fischer, Sushma Venkatesh, Jag Mohan Singh, Guoqiang Li, Loïc Bergeron, Sergey Isadskiy, Raghavendra Ramachandra, Christian Rathgeb, Dinusha Frings, Uwe Seidel, Fons Knopjes, Raymond Veldhuis, Davide Maltoni, Christoph Busch
Abstract: Morphing attacks have posed a severe threat to Face Recognition System (FRS). Despite the number of advancements reported in recent works, we note serious open issues that are not addressed. Morphing Attack Detection (MAD) algorithms often are prone to generalization challenges as they are database dependent. The existing databases, mostly of semi-public nature, lack in diversity in terms of ethnicity, various morphing process and post-processing pipelines. Further, they do not reflect a realistic operational scenario for Automated Border Control (ABC) and do not provide a basis to test MAD on unseen data, in order to benchmark the robustness of algorithms. In this work, we present a new sequestered dataset for facilitating the advancements of MAD where the algorithms can be tested on unseen data in an effort to better generalize. The newly constructed dataset consists of facial images from 150 subjects from various ethnicities, age-groups and both genders. In order to challenge the existing MAD algorithms, the morphed images are with careful subject pre-selection created from the subjects, and further post-processed to remove the morphing artifacts. The images are also printed and scanned to remove all digital cues and to simulate a realistic challenge for MAD algorithms. Further, we present a new online evaluation platform to test algorithms on sequestered data. With the platform we can benchmark the morph detection performance and study the generalization ability. This work also presents a detailed analysis on various subsets of sequestered data and outlines open challenges for future directions in MAD research.
摘要：变形算法攻击所带来人脸识别系统（FRS）构成了严重威胁。尽管在最近的作品汇报进展的数量，我们注意到，没有解决的严重问题开放。变形攻击检测（MAD）的算法，因为它们依赖于数据库往往容易泛化的挑战。现有的数据库，大多是半公益性质的，缺乏多样性的种族，不同的变形过程和后处理管道的条款。此外，它们不反映用于自动边界控制（ABC）现实操作情景和上看不见数据到基准不提供依据测试MAD，为了算法的鲁棒性。在这项工作中，我们提出了一个新的封存数据集用于促进其中的算法可以在看不见的数据，是为了更好地广义含测试MAD的进步。新构建的数据集由来自各个种族，年龄组和两种性别150名受试者面部图像。为了挑战现存MAD算法，所述演变图像是用来自受试者创建小心受试者预选，和后处理进一步除去变形的伪像。的图像也被打印并扫描以除去所有数字线索和以模拟MAD算法一个现实的挑战。此外，我们提出了一个新的在线评估平台上与世隔绝的数据测试算法。有了这个平台，我们可以基准变身检测性能，研究泛化能力。这项工作也提出了对封存的各种数据子集的详细分析，并概述了在MAD研究的未来发展方向开放的挑战。

21. Convolutional neural networks compression with low rank and sparse tensor decompositions [PDF] 返回目录
Pavel Kaloshin
Abstract: Convolutional neural networks show outstanding results in a variety of computer vision tasks. However, a neural network architecture design usually faces a trade-off between model performance and computational/memory complexity. For some real-world applications, it is crucial to develop models, which can be fast and light enough to run on edge systems and mobile devices. However, many modern architectures that demonstrate good performance don't satisfy inference time and storage limitation requirements. Thus, arises a problem of neural network compression to obtain a smaller and faster model, which is on par with the initial one. In this work, we consider a neural network compression method based on tensor decompositions. Namely, we propose to approximate the convolutional layer weight with a tensor, which can be represented as a sum of low-rank and sparse components. The motivation for such approximation is based on the assumption that low-rank and sparse terms allow eliminating two different types of redundancy and thus yield a better compression rate. An efficient CPU implementation for the proposed method has been developed. Our algorithm has demonstrated up to 3.5x CPU layer speedup and 11x layer size reduction when compressing Resnet50 architecture for the image classification task.
摘要：卷积神经网络显示各种计算机视觉任务的优异成绩。然而，神经网络结构设计通常面临模型性能和计算/存储复杂性之间的权衡。对于一些真实世界的应用，关键是要开发模型，它可以是快，重量轻，足以边缘系统和移动设备上运行。然而，表现出良好的性能的许多现代建筑不符合推理时间和存储的限制要求。因此，产生神经网络的压缩的问题，以获得更小和更快模式，这是在同水准与初始之一。在这项工作中，我们认为基于张量分解神经网络压缩方法。即，我们提出用张量，其可以被表示为低秩和稀疏分量的和来近似卷积层重量。这种近似的动机是基于这样的假设低秩和稀疏条款允许消除两种不同类型的冗余的，从而得到更好的压缩率。所提出的方法一个高效CPU实现已经研制成功。压缩Resnet50架构用于图像分类任务时，我们的算法已经证明至多3.5倍CPU层加速和11X层的尺寸减小。

22. CoMIR: Contrastive Multimodal Image Representation for Registration [PDF] 返回目录
Nicolas Pielawski, Elisabeth Wetzer, Johan Öfverstedt, Jiahao Lu, Carolina Wählby, Joakim Lindblad, Nataša Sladoje
Abstract: We propose contrastive coding to learn shared, dense image representations, referred to as CoMIRs (Contrastive Multimodal Image Representations). CoMIRs enable the registration of multimodal images where existing registration methods often fail due to a lack of sufficiently similar image structures. CoMIRs reduce the multimodal registration problem to a monomodal one in which general intensity-based, as well as feature-based, registration algorithms can be applied. The method involves training one neural network per modality on aligned images, using a contrastive loss based on noise-contrastive estimation (InfoNCE). Unlike other contrastive coding methods, used for e.g. classification, our approach generates image-like representations that contain the information shared between modalities. We introduce a novel, hyperparameter-free modification to InfoNCE, to enforce rotational equivariance of the learnt representations, a property essential to the registration task. We assess the extent of achieved rotational equivariance and the stability of the representations with respect to weight initialization, training set, and hyperparameter settings, on a remote sensing dataset of RGB and near-infrared images. We evaluate the learnt representations through registration of a biomedical dataset of bright-field and second-harmonic generation microscopy images; two modalities with very little apparent correlation. The proposed approach based on CoMIRs significantly outperforms registration of representations created by GAN-based image-to-image translation, as well as a state-of-the-art, application-specific method which takes additional knowledge about the data into account. Code is available at: this https URL.
摘要：本文提出对比编码到学会共享，密集的图像表示，被称为CoMIRs（对比多式联运图像表示）。 CoMIRs使多峰图像，其中现有的配准方法常常失败的登记由于缺乏足够相似的图像结构。 CoMIRs减少多模式登记问题单峰在其中一般强度为基础的，以及基于特征，配准算法可以应用。该方法涉及培养上对准的图像每一个模式的神经网络，使用基于噪声对比估计（InfoNCE）进行对比损失。不像其他的对比编码方法，用于例如分类，我们的方法生成图像类包含模式之间共享的信息的表示。我们引入新的，免费的超参数，修改InfoNCE，强制执行了解到表示的旋转同变性，属性必须登记任务。我们评估取得旋转同变性的程度和表示相对于重量的初始化，训练集，超参数设置的稳定性，对RGB和近红外图像的遥感数据集。我们评估通过明场和二次谐波产生显微图像的生物医学数据集的注册了解到申述;两种模式很少明显的相关性。基于CoMIRs显著性能优于由基于GaN的图像到图像的平移，以及大约需要的数据附加知识考虑在内的状态下的最先进的，应用程序特定的方法创建的表示中的登记所提出的方法。代码，请访问：此HTTPS URL。

23. A Deep Learning Framework for Recognizing both Static and Dynamic Gestures [PDF] 返回目录
Osama Mazhar, Sofiane Ramdani, Andrea Cherubini
Abstract: Intuitive user interfaces are indispensable to interact with human centric smart environments. In this paper, we propose a unified framework that recognizes both static and dynamic gestures, using simple RGB vision (without depth sensing). This feature makes it suitable for inexpensive human-machine interaction (HMI). We rely on a spatial attention-based strategy, which employs SaDNet, our proposed Static and Dynamic gestures Network. From the image of the human upper body, we estimate his/her depth, along with the region-of-interest around his/her hands. The Convolutional Neural Networks in SaDNet are fine-tuned on a background-substituted hand gestures dataset. They are utilized to detect 10 static gestures for each hand and to obtain hand image-embeddings from the last Fully Connected layer, which are subsequently fused with the augmented pose vector and then passed to stacked Long Short-Term Memory blocks. Thus, human-centered frame-wise information from the augmented pose vector and left/right hands image-embeddings are aggregated in time to predict the dynamic gestures of the performing person. In a number of experiments we show that the proposed approach surpasses the state-of-the-art results on large-scale Chalearn 2016 dataset. Moreover, we also transfer the knowledge learned through the proposed methodology to the Praxis gestures dataset, and the obtained results also outscore the state-of-the-art on this dataset.
摘要：直观的用户界面是必不可少的人类为中心的智能环境交互。在本文中，我们提出了一个统一的框架，识别静态和动态手势，使用简单的RGB视觉（没有深度感）。这一特性使得它适合于廉价的人机交互（HMI）。我们依靠空间注意为主战略，它采用SaDNet，我们提出的静态和动态手势网络。从人的上身的形象，我们估计他/她的深度，大约有他/她的手区域的利益一起。的卷积神经网络在SaDNet是微调上的背景取代的手势集。它们被用于检测每个手10个静态手势和从最后的完全连接层，其随后与增强姿势向量稠合，然后传递到堆叠长短期存储器块获得的手图像的嵌入。因此，从增强姿势向量以人为中心的逐帧信息和左/右手图像的嵌入被聚集在时间预测进行人的动态手势。在一些实验中，我们表明，该方法优于在大规模Chalearn 2016数据集的国家的最先进的成果。此外，我们还通过转移所提出的方法在实践手势集学过的知识，得到的结果也得分超过这个数据集的国家的最先进的。

24. Hypernetwork-Based Augmentation [PDF] 返回目录
Chih-Yang Chen, Che-Han Chang, Edward Y. Chang
Abstract: Data augmentation is an effective technique to improve the generalization of deep neural networks. Recently, AutoAugment proposed a well-designed search space and a search algorithm that automatically finds augmentation policies in a data-driven manner. However, AutoAugment is computationally intensive. In this paper, we propose an efficient gradient-based search algorithm, called Hypernetwork-Based Augmentation (HBA), which simultaneously learns model parameters and augmentation hyperparameters in a single training. Our HBA uses a hypernetwork to approximate a population-based training algorithm, which enables us to tune augmentation hyperparameters by gradient descent. Besides, we introduce a weight sharing strategy that simplifies our hypernetwork architecture and speeds up our search algorithm. We conduct experiments on CIFAR-10, CIFAR-100, SVHN, and ImageNet. Our results demonstrate that HBA is significantly faster than state-of-the-art methods while achieving competitive accuracy.
摘要：数据扩充是为了提高深层神经网络的泛化的有效技术。近日，AutoAugment提出了一个精心设计的搜索空间，在数据驱动的方式自动查找增强政策的搜索算法。然而，AutoAugment是计算密集型的。在本文中，我们提出了一种高效的基于梯度的搜索算法，称为超网络基增强（HBA），其同时得知在一个单一的训练模型参数和增强的超参数。我们HBA采用了超网络近似人群为基础的训练算法，它通过梯度下降使我们能够调整增强的超参数。此外，我们引入简化了我们的超网络体系结构和加快我们的搜索算法的权重共享战略。我们对CIFAR-10，CIFAR-100，SVHN和ImageNet进行实验。我们的研究结果表明，HBA是显著快于国家的最先进的方法，同时实现竞争力的准确性。

25. RTEX: A novel methodology for Ranking, Tagging, and Explanatory diagnostic captioning of radiography exams [PDF] 返回目录
Vasiliki Kougia, John Pavlopoulos, Panagiotis Papapetrou, Max Gordon
Abstract: This paper introduces RTEx, a novel methodology for a) ranking radiography exams based on their probability to contain an abnormality, b) generating abnormality tags for abnormal exams, and c) providing a diagnostic explanation in natural language for each abnormal exam. The task of ranking radiography exams is an important first step for practitioners who want to identify and prioritize those radiography exams that are more likely to contain abnormalities, for example, to avoid mistakes due to tiredness or to manage heavy workload (e.g., during a pandemic). We used two publicly available datasets to assess our methodology and demonstrate that for the task of ranking it outperforms its competitors in terms of NDCG@k. For each abnormal radiography exam RTEx generates a set of abnormality tags alongside an explanatory diagnostic text to explain the tags and guide the medical expert. Our tagging component outperforms two strong competitor methods in terms of F1. Moreover, the diagnostic captioning component of RTEx, which exploits the already extracted tags to constrain the captioning process, outperforms all competitors with respect to clinical precision and recall.
摘要：介绍RTEX，一种新颖的方法用于一个）排名基于它们的概率含有的异常时，b）产生异常检查异常代码，以及c）对于每个异常考试提供自然语言的诊断的说明射线照相检查。排名摄片检查的任务是谁想要确定并优先考虑那些摄片检查是更可能含有异常，例如从业者的重要的第一步，以避免错误是由于疲劳或者在大流行管理工作量很大（例如，）。我们使用了两种可公开获得的数据集来评估我们的方法，并证明了排名的任务，它优于其竞争对手在NDCG @ k的条款。对于每个异常射线照相检查RTEX生成并排的说明诊断文本解释标签和引导上述医疗专家的一组异常代码。我们标注部件性能优于两种强的竞争者方法F1的条款。此外，RTEX，它利用了已提取的标签来约束字幕处理的诊断字幕分量，优于所有的竞争者相对于临床精确度和召回。

26. Fast Coherent Point Drift [PDF] 返回目录
Xiang-Wei Feng, Da-Zheng Feng, Yun Zhu
Abstract: Nonrigid point set registration is widely applied in the tasks of computer vision and pattern recognition. Coherent point drift (CPD) is a classical method for nonrigid point set registration. However, to solve spatial transformation functions, CPD has to compute inversion of a M*M matrix per iteration with time complexity O(M3). By introducing a simple corresponding constraint, we develop a fast implementation of CPD. The most advantage of our method is to avoid matrix-inverse operation. Before the iteration begins, our method requires to take eigenvalue decomposition of a M*M matrix once. After iteration begins, our method only needs to update a diagonal matrix with linear computational complexity, and perform matrix multiplication operation with time complexity approximately O(M2) in each iteration. Besides, our method can be further accelerated by the low-rank matrix approximation. Experimental results in 3D point cloud data show that our method can significantly reduce computation burden of the registration process, and keep comparable performance with CPD on accuracy.
摘要：非刚性点集注册被广泛应用于计算机视觉和模式识别的任务。相干点漂移（CPD）为非刚性点集配准的经典方法。然而，为了解决空间变换的功能，CPD必须每次迭代一个M * M矩阵随时间复杂度为O（M3）的计算反转。通过引入相应的约束简单，我们开发了一个快速实现CPD的。我们的方法的最大优点是避免矩阵求逆运算。迭代开始前，我们的方法需要采取M * M矩阵的特征值分解一次。迭代开始之后，我们的方法只需要更新的对角矩阵具有线性计算复杂性，并且与时间复杂度大约为O（M2）在每次迭代中执行矩阵乘法运算。此外，我们的方法可以进一步通过低秩矩阵逼近加速。在三维点云数据的实验结果表明，我们的方法可以显著降低注册过程的计算负担，并保持相当的性能与精度CPD。

27. Privacy-Aware Activity Classification from First Person Office Videos [PDF] 返回目录
Partho Ghosh, Md. Abrar Istiak, Nayeeb Rashid, Ahsan Habib Akash, Ridwan Abrar, Ankan Ghosh Dastider, Asif Shahriyar Sushmit, Taufiq Hasan
Abstract: In the advent of wearable body-cameras, human activity classification from First-Person Videos (FPV) has become a topic of increasing importance for various applications, including in life-logging, law-enforcement, sports, workplace, and healthcare. One of the challenging aspects of FPV is its exposure to potentially sensitive objects within the user's field of view. In this work, we developed a privacy-aware activity classification system focusing on office videos. We utilized a Mask-RCNN with an Inception-ResNet hybrid as a feature extractor for detecting, and then blurring out sensitive objects (e.g., digital screens, human face, paper) from the videos. For activity classification, we incorporate an ensemble of Recurrent Neural Networks (RNNs) with ResNet, ResNext, and DenseNet based feature extractors. The proposed system was trained and evaluated on the FPV office video dataset that includes 18-classes made available through the IEEE Video and Image Processing (VIP) Cup 2019 competition. On the original unprotected FPVs, the proposed activity classifier ensemble reached an accuracy of 85.078% with precision, recall, and F1 scores of 0.88, 0.85 & 0.86, respectively. On privacy protected videos, the performances were slightly degraded, with accuracy, precision, recall, and F1 scores at 73.68%, 0.79, 0.75, and 0.74, respectively. The presented system won the 3rd prize in the IEEE VIP Cup 2019 competition.
摘要：在穿戴体上摄像头的出现，第一人称影片人类活动的分类（FPV）已成为越来越重要的各种应用，包括生命的记录，执法，运动，工作场所和医疗保健的话题。之一的FPV的挑战性的方面是其暴露于所述视用户的视野内的潜在敏感的对象。在这项工作中，我们开发了一个秘密感知活动分类系统侧重于办公视频。我们利用的掩码-RCNN与启-RESNET混合作为用于检测，然后模糊出从视频敏感对象（例如，数字屏幕，人脸，纸）特征提取。对于活动分类，我们与RESNET，ResNext和DenseNet基于特征提取纳入回归神经网络（RNNs）的集合。所提出的系统进行训练，并且在FPV办公室视频数据集，其中包括18类通过IEEE视频和图像处理（VIP）2019世界杯比赛提供评估。在原始的未受保护FPVs，所提出的活动分类器集成达到85.078％的精度，召回，和0.88，分别0.85＆0.86，F1分数的精度。在隐私保护的视频，性能均小幅下降，准确，准确率，召回和F1分数分别为73.68％，0.79，0.75，和0.74。所提出的系统赢得了IEEE VIP杯2019比赛三等奖。

28. CLEval: Character-Level Evaluation for Text Detection and Recognition Tasks [PDF] 返回目录
Youngmin Baek, Daehyun Nam, Sungrae Park, Junyeop Lee, Seung Shin, Jeonghun Baek, Chae Young Lee, Hwalsuk Lee
Abstract: Despite the recent success of text detection and recognition methods, existing evaluation metrics fail to provide a fair and reliable comparison among those methods. In addition, there exists no end-to-end evaluation metric that takes characteristics of OCR tasks into account. Previous end-to-end metric contains cascaded errors from the binary scoring process applied in both detection and recognition tasks. Ignoring partially correct results raises a gap between quantitative and qualitative analysis, and prevents fine-grained assessment. Based on the fact that character is a key element of text, we hereby propose a Character-Level Evaluation metric (CLEval). In CLEval, the \textit{instance matching} process handles split and merge detection cases, and the \textit{scoring process} conducts character-level evaluation. By aggregating character-level scores, the CLEval metric provides a fine-grained evaluation of end-to-end results composed of the detection and recognition as well as individual evaluations for each module from the end-performance perspective. We believe that our metrics can play a key role in developing and analyzing state-of-the-art text detection and recognition methods. The evaluation code is publicly available at this https URL.
摘要：尽管最近的文本检测与识别方法的成功，现有评价指标未能提供这些方法之间的公平和可靠的比较。此外，不存在终端到终端的评价指标，它利用OCR任务的特点考虑在内。上一页终端到终端的指标包含来自探测和识别任务应用二进制评分过程级联错误。忽略部分正确的结果提出了定量和定性分析之间并防止细粒评估的间隙，。基于这样的事实，性质是文本的一个关键要素，我们在此提出一个字符级评价指标（CLEval）。在CLEval，该\ textit {实例匹配}过程手柄分割和合并检测的情况下，和\ textit {评分过程}进行字符级评价。通过聚集字符级分数，所述CLEval度量提供了检测和识别，以及用于从最终性能透视每个模块的各个评价组成的端至端结果的细粒度评估。我们相信，我们的指标可以在制定和分析国家的最先进的文本检测与识别方法的关键作用。评估代码是公开的，在此HTTPS URL。

29. Fall Detector Adapted to Nursing Home Needs through an Optical-Flow based CNN [PDF] 返回目录
Alexy Carlier, Paul Peyramaure, Ketty Favre, Muriel Pressigout
Abstract: Fall detection in specialized homes for the elderly is challenging. Vision-based fall detection solutions have a significant advantage over sensor-based ones as they do not instrument the resident who can suffer from mental diseases. This work is part of a project intended to deploy fall detection solutions in nursing homes. The proposed solution, based on Deep Learning, is built on a Convolutional Neural Network (CNN) trained to maximize a sensitivity-based metric. This work presents the requirements from the medical side and how it impacts the tuning of a CNN. Results highlight the importance of the temporal aspect of a fall. Therefore, a custom metric adapted to this use case and an implementation of a decision-making process are proposed in order to best meet the medical teams requirements. Clinical relevance This work presents a fall detection solution enabled to detect 86.2% of falls while producing only 11.6% of false alarms in average on the considered databases.
摘要：为老人专门舍跌倒检测是具有挑战性的。基于视觉的跌倒检测解决方案具有比基于传感器的那些一个显著优势，因为他们没有仪器谁可以从精神疾病患的居民。这项工作是旨在部署跌倒检测解决方案在养老院项目的一部分。所提出的解决方案的基础上，深入学习，建立训练最大化的敏感性为基础的指标卷积神经网络（CNN）上。这项工作提出了从医疗方面的要求以及它如何影响一个CNN的调整。结果突出显示下跌的时间方面的重要性。因此，自定义指标适应了这个用例和决策过程的执行情况，以便提出最好地满足医疗队伍的要求。临床意义这项工作提出启用跌倒检测解决方案来检测跌倒的86.2％，而仅生产11.6％的平均误报警所考虑的数据库。

30. An Edge Information and Mask Shrinking Based Image Inpainting Approach [PDF] 返回目录
Huali Xu, Xiangdong Su, Meng Wang, Xiang Hao, Guanglai Gao
Abstract: In the image inpainting task, the ability to repair both high-frequency and low-frequency information in the missing regions has a substantial influence on the quality of the restored image. However, existing inpainting methods usually fail to consider both high-frequency and low-frequency information simultaneously. To solve this problem, this paper proposes edge information and mask shrinking based image inpainting approach, which consists of two models. The first model is an edge generation model used to generate complete edge information from the damaged image, and the second model is an image completion model used to fix the missing regions with the generated edge information and the valid contents of the damaged image. The mask shrinking strategy is employed in the image completion model to track the areas to be repaired. The proposed approach is evaluated qualitatively and quantitatively on the dataset Places2. The result shows our approach outperforms state-of-the-art methods.
摘要：在图像修复任务，修复双方的高频和低频信息缺失区域的能力，对恢复图像质量的重大影响。但是，现有的图像修复方法通常不能同时使用高频和低频信息同时考虑。为了解决这个问题，本文提出了边缘信息和掩码萎缩基于图像修复方法，它包括两种型号。第一个模型是用于生成从损坏的图像完整边缘信息的边缘生成模型，第二模型是用于与所生成的边缘信息和损坏的图像的有效内容修复缺失的区域的图像完成模型。掩模收缩策略采用在图像完成模型来跟踪待修复的区域。所提出的方法是定性和定量评价的数据集Places2。结果表明我们的方法比国家的最先进的方法。

31. Large-Scale Adversarial Training for Vision-and-Language Representation Learning [PDF] 返回目录
Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, Jingjing Liu
Abstract: We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning. VILLA consists of two training stages: (i) task-agnostic adversarial pre-training; followed by (ii) task-specific adversarial finetuning. Instead of adding adversarial perturbations on image pixels and textual tokens, we propose to perform adversarial training in the embedding space of each modality. To enable large-scale training, we adopt the "free" adversarial training strategy, and combine it with KL-divergence-based regularization to promote higher invariance in the embedding space. We apply VILLA to current best-performing V+L models, and achieve new state of the art on a wide range of tasks, including Visual Question Answering, Visual Commonsense Reasoning, Image-Text Retrieval, Referring Expression Comprehension, Visual Entailment, and NLVR2.
摘要：我们目前VILLA，对视力和语言（V + L）表示学习的大规模对抗训练的第一个已知的努力。别墅由两个训练阶段：（一）任务无关的对抗前的培训;接着（ⅱ）任务特异性对抗细化和微调。相反，在图像像素和文本标记加入敌对的扰动，我们建议在每个模式的嵌入空间进行对抗性训练。为了使大规模培训，我们采用了“免费”的对抗性训练策略，并与基于KL散度正规化结合起来，以促进嵌入空间较高的不变性。我们应用VILLA目前表现最好的V + L型，并在广泛的任务实现新的艺术状态，包括Visual答疑，视觉常识推理，图片，文本检索，参考表述的理解，视觉蕴涵和NLVR2 。

32. JIT-Masker: Efficient Online Distillation for Background Matting [PDF] 返回目录
Jo Chuang, Qian Dong
Abstract: We design a real-time portrait matting pipeline for everyday use, particularly for "virtual backgrounds" in video conferences. Existing segmentation and matting methods prioritize accuracy and quality over throughput and efficiency, and our pipeline enables trading off a controllable amount of accuracy for better throughput by leveraging online distillation on the input video stream. We construct our own dataset of simulated video calls in various scenarios, and show that our approach delivers a 5x speedup over a saliency detection based pipeline in a non-GPU accelerated setting while delivering higher quality results. We demonstrate that an online distillation approach can feasibly work as part of a general, consumer level product as a "virtual background" tool. Our public implementation is at this https URL.
摘要：我们设计了一个实时的画像消光管道日常使用，尤其是对于视频会议“虚拟背景”。现有的分割和抠图方法排定优先级的精度和质量的吞吐量和效率，我们的管道能够通过利用输入视频流在线蒸馏权衡精度可控的量更好的吞吐量。我们建立我们自己在各种情况下的模拟视频通话的数据集，并证明我们的方法提供了一个5倍加速了基于显着性检测管道在非GPU，同时提供更高质量的结果加速设置。我们证明了一个在线的蒸馏方法可以可行性的工作，作为一个普通的消费级产品的一部分作为“虚拟背景”工具。我们的公共执行是在这个HTTPS URL。

33. Telling Left from Right: Learning Spatial Correspondence between Sight and Sound [PDF] 返回目录
Karren Yang, Bryan Russell, Justin Salamon
Abstract: Self-supervised audio-visual learning aims to capture useful representations of video by leveraging correspondences between visual and audio inputs. Existing approaches have focused primarily on matching semantic information between the sensory streams. We propose a novel self-supervised task to leverage an orthogonal principle: matching spatial information in the audio stream to the positions of sound sources in the visual stream. Our approach is simple yet effective. We train a model to determine whether the left and right audio channels have been flipped, forcing it to reason about spatial localization across the visual and audio streams. To train and evaluate our method, we introduce a large-scale video dataset, YouTube-ASMR-300K, with spatial audio comprising over 900 hours of footage. We demonstrate that understanding spatial correspondence enables models to perform better on three audio-visual tasks, achieving quantitative gains over supervised and self-supervised baselines that do not leverage spatial audio cues. We also show how to extend our self-supervised approach to 360 degree videos with ambisonic audio.
摘要：自监督视听学习的目的是通过利用视觉和音频输入之间的对应关系，以捕捉视频的有用表示。现有的方法主要集中在相匹配的感觉流之间的语义信息。我们提出了一个新的自我监督的任务，利用正交原理：匹配音频流的声源的视觉流中的位置的空间信息。我们的方法是简单而有效的。我们培养一个模型，以确定是否在左，右声道已经翻转，迫使它的原因有关跨视觉和音频流的空间定位。要培养和评价我们的方法，我们引进了大型数据集的视频，YouTube的-ASMR-300K，与空间音频，包括超过900小时的素材。我们证明了解空间对应使车型在三个视听任务有更好的表现，实现了不利用空间音频线索监督和自我监督的基线量化的收益。我们还展示了如何我们的自我监督的方式延伸到与环绕声音频360个视频。

34. MOMS with Events: Multi-Object Motion Segmentation With Monocular Event Cameras [PDF] 返回目录
Chethan M. Parameshwara, Nitin J. Sanket, Arjun Gupta, Cornelia Fermuller, Yiannis Aloimonos
Abstract: Segmentation of moving objects in dynamic scenes is a key process in scene understanding for both navigation and video recognition tasks. Without prior knowledge of the object structure and motion, the problem is very challenging due to the plethora of motion parameters to be estimated while being agnostic to motion blur and occlusions. Event sensors, because of their high temporal resolution, and lack of motion blur, seem well suited for addressing this problem. We propose a solution to multi-object motion segmentation using a combination of classical optimization methods along with deep learning and does not require prior knowledge of the 3D motion and the number and structure of objects. Using the events within a time-interval, the method estimates and compensates for the global rigid motion. Then it segments the scene into multiple motions by iteratively fitting and merging models using input tracked feature regions via alignment based on temporal gradients and contrast measures. The approach was successfully evaluated on both challenging real-world and synthetic scenarios from the EV-IMO, EED, and MOD datasets, and outperforms the state-of-the-art detection rate by as much as 12% achieving a new state-of-the-art average detection rate of 77.06%, 94.2% and 82.35% on the aforementioned datasets.
摘要：在动态场景移动物体的分割是用于导航和视频识别任务场景理解的关键过程。没有物体的结构和运动的先验知识，问题是非常具有挑战性的，由于运动参数过多，同时不可知的运动模糊和遮挡进行估计。事件传感器，因为他们的高时间分辨率，以及缺乏运动模糊的，似乎非常适合解决这个问题。我们建议采用经典的优化相结合的方法多目标运动分割的解决方案与深度学习和沿不需要3D运动和对象的数量和结构的先验知识。使用时间间隔内发生的事件，所述方法估计和补偿了全局刚性运动。然后，它的段现场分成多个运动通过迭代拟合，并使用输入合并模型基于时间梯度和对比度测量经由对准跟踪特征区域。该方法是在两个从挑战现实世界的和合成的情况下成功评价EV-IMO，EED，和MOD数据集，并优于国家的最先进的多达12％的检出率达到一个新的国家的-the-ART的77.06％，94.2％和82.35％对上述数据集的平均检测率。

35. Image Deconvolution via Noise-Tolerant Self-Supervised Inversion [PDF] 返回目录
Hirofumi Kobayashi, Ahmet Can Solak, Joshua Batson, Loic A. Royer
Abstract: We propose a general framework for solving inverse problems in the presence of noise that requires no signal prior, no noise estimate, and no clean training data. We only require that the forward model be available and that the noise be statistically independent across measurement dimensions. We build upon the theory of $\mathcal{J}$-invariant functions (Batson & Royer 2019, arXiv:1901.11365) and show how self-supervised denoising \emph{à la} Noise2Self is a special case of learning a noise-tolerant pseudo-inverse of the identity. We demonstrate our approach by showing how a convolutional neural network can be taught in a self-supervised manner to deconvolve images and surpass in image quality classical inversion schemes such as Lucy-Richardson deconvolution.
摘要：我们提出了一个总体框架为解决噪声的情况下在此之前需要无信号，无噪声估计反问题，而没有干净的训练数据。我们只需要正向模型是可用，并且噪音是跨测量维度统计独立。我们建立在的$理论\ mathcal {Ĵ} $ - 不变函数（巴特森＆罗耶2019年的arXiv：1901.11365），并展示如何自我监督去噪\ EMPH {单} Noise2Self是学习的噪声容限的特例身份的伪逆。我们展示了如何卷积神经网络可以在自我监督的方式传授给卷积图像和图像品质经典反转方案，如露西理查森去卷积超越证明了该方法。

36. Kalman Filter Based Multiple Person Head Tracking [PDF] 返回目录
Mohib Ullah, Maqsood Mahmud, Habib Ullah, Kashif Ahmad, Ali Shariq Imran, Faouzi Alaya Cheikh
Abstract: For multi-target tracking, target representation plays a crucial rule in performance. State-of-the-art approaches rely on the deep learning-based visual representation that gives an optimal performance at the cost of high computational complexity. In this paper, we come up with a simple yet effective target representation for human tracking. Our inspiration comes from the fact that the human body goes through severe deformation and inter/intra occlusion over the passage of time. So, instead of tracking the whole body part, a relative rigid organ tracking is selected for tracking the human over an extended period of time. Hence, we followed the tracking-by-detection paradigm and generated the target hypothesis of only the spatial locations of heads in every frame. After the localization of head location, a Kalman filter with a constant velocity motion model is instantiated for each target that follows the temporal evolution of the targets in the scene. For associating the targets in the consecutive frames, combinatorial optimization is used that associates the corresponding targets in a greedy fashion. Qualitative results are evaluated on four challenging video surveillance dataset and promising results has been achieved.
摘要：针对多目标跟踪，目标表示是性能的重要规则。国家的最先进的方法依赖于深基于学习的视觉表现，让在高计算复杂性为代价的最佳性能。在本文中，我们想出了一个简单但对人类有效的跟踪目标表示。我们的灵感来自人体经过剧烈变形和外部/内部闭塞随着时间的推移这一事实。所以，代替跟踪整个主体部分，相对刚性的器官跟踪被选择用于在延长的时间周期跟踪人类。因此，我们随后跟踪逐检测范式和产生在每帧的头部仅空间位置的目标假说。头位置的定位后，以恒定的速度运动模型卡尔曼滤波器被实例化的每个跟随场景中的目标的时间演变目标。为目标在连续帧相关联，组合优化用于该联营相应目标的贪婪方式。定性结果在四个具有挑战性的视频监控数据集和可喜的成果已经实现了评估。

37. Dance Revolution: Long Sequence Dance Generation with Music via Curriculum Learning [PDF] 返回目录
Ruozi Huang, Huang Hu, Wei Wu, Kei Sawada, Mi Zhang
Abstract: Dancing to music is one of human's innate abilities since ancient times. In artificial intelligence research, however, synthesizing dance movements (complex human motion) from music is a challenging problem, which suffers from the high spatial-temporal complexity in human motion dynamics modeling. Besides, the consistency of dance and music in terms of style, rhythm and beat also needs to be taken into account. Existing works focus on the short-term dance generation with music, e.g. less than 30 seconds. In this paper, we propose a novel seq2seq architecture for long sequence dance generation with music, which consists of a transformer based music encoder and a recurrent structure based dance decoder. By restricting the receptive field of self-attention, our encoder can efficiently process long musical sequences by reducing its quadratic memory requirements to the linear in the sequence length. To further alleviate the error accumulation in human motion synthesis, we introduce a dynamic auto-condition training strategy as a new curriculum learning method to facilitate the long-term dance generation. Extensive experiments demonstrate that our proposed approach significantly outperforms existing methods on both automatic metrics and human evaluation. Additionally, we also make a demo video to exhibit that our approach can generate minute-length dance sequences that are smooth, natural-looking, diverse, style-consistent and beat-matching with the music. The demo video is now available at this https URL.
摘要：舞蹈音乐是自古以来人类与生俱来的能力之一。在人工智能的研究，然而，从音乐合成舞蹈动作（复杂的人体运动）是一个具有挑战性的问题，从人体运动动力学建模高时空复杂度受到影响。此外，舞蹈和音乐风格，节奏和节拍方面的一致性，也需要考虑在内。现有的研究主要集中于短期的舞蹈代音乐，例如小于30秒。在本文中，我们提出了长序列舞蹈一代的音乐，其中包括基于变压器的音乐编码器和一个经常性的结构舞蹈基础解码器的新颖seq2seq架构。通过限制自关注的感受域，我们的编码器可以通过降低其二次存储器要求在序列长度的线性有效地处理长音乐序列。为了进一步减轻人体运动合成的误差积累，我们引入了动态自动条件的培训战略作为新课程的学习方法，以促进长期的舞蹈一代。大量的实验证明，我们提出的方法显著优于两个自动度量和人工评估现有的方法。此外，我们还做一个演示视频，以展示我们的方法可以生成分钟长度的舞蹈序列被流畅，自然逼真，多样，风格一致，节拍匹配的音乐。演示视频在此HTTPS URL现已推出。

38. Continual Learning for Affective Computing [PDF] 返回目录
Nikhil Churamani
Abstract: Real-world application require affect perception models to be sensitive to individual differences in expression. As each user is different and expresses differently, these models need to personalise towards each individual to adequately capture their expressions and thus model their affective state. Despite high performance on benchmarks, current approaches fall short in such adaptation. In this dissertation, we propose the use of continual learning for affective computing as a paradigm for developing personalised affect perception.
摘要：现实世界的应用程序需要影响的感知模型是在表达个体差异敏感。由于每个用户是不同的，并表示不同，这些模型需要对每个人个性化，以充分反映他们的表情，因此模型中的情感状态。尽管在基准性能高，目前的方法在这种适应功亏一篑。在本文中，我们提出了情感计算使用不断学习作为制定个性化的范式影响观感。

39. Autonomous Driving with Deep Learning: A Survey of State-of-Art Technologies [PDF] 返回目录
Yu Huang, Yue Chen
Abstract: Since DARPA Grand Challenges (rural) in 2004/05 and Urban Challenges in 2007, autonomous driving has been the most active field of AI applications. Almost at the same time, deep learning has made breakthrough by several pioneers, three of them (also called fathers of deep learning), Hinton, Bengio and LeCun, won ACM Turin Award in 2019. This is a survey of autonomous driving technologies with deep learning methods. We investigate the major fields of self-driving systems, such as perception, mapping and localization, prediction, planning and control, simulation, V2X and safety etc. Due to the limited space, we focus the analysis on several key areas, i.e. 2D and 3D object detection in perception, depth estimation from cameras, multiple sensor fusion on the data, feature and task level respectively, behavior modelling and prediction of vehicle driving and pedestrian trajectories.
摘要：自从2004/05 DARPA大挑战（农村）和城市挑战在2007年，自动驾驶一直是人工智能应用中最活跃的领域。几乎在同一时间，深度学习，已有若干开拓者取得了突破性进展，他们三个人（也称为深度学习的父亲）的，韩丁，Bengio和LeCun，荣获ACM都灵奖2019年这与深自主驾驶技术的调查学习方法。我们调查自驾车系统，如知觉，测绘和定位，预测，规划和控制，仿真，V2X和安全性等。由于空间有限的主要领域，我们的分析集中于以下几个方面，即二维和立体物检测在感知，从相机深度估计，分别对数据，特征和任务级多传感器融合，行为建模和车辆的驾驶和行人轨迹的预测。

40. DivNoising: Diversity Denoising with Fully Convolutional Variational Autoencoders [PDF] 返回目录
Mangal Prakash, Alexander Krull, Florian Jug
Abstract: Deep Learning based methods have emerged as the indisputable leaders for virtually all image restoration tasks. Especially in the domain of microscopy images, various content-aware image restoration (CARE) approaches are now used to improve the interpretability of acquired data. But there are limitations to what can be restored in corrupted images, and any given method needs to make a sensible compromise between many possible clean signals when predicting a restored image. Here, we propose DivNoising -- a denoising approach based on fully-convolutional variational autoencoders, overcoming this problem by predicting a whole distribution of denoised images. Our method is unsupervised, requiring only noisy images and a description of the imaging noise, which can be measured or bootstrapped from noisy data. If desired, consensus predictions can be inferred from a set of DivNoising predictions, leading to competitive results with other unsupervised methods and, on occasion, even with the supervised state-of-the-art. DivNoising samples from the posterior enable a plethora of useful applications. We are (i) discussing how optical character recognition (OCR) applications could benefit from diverse predictions on ambiguous data, and (ii) show in detail how instance cell segmentation gains performance when using diverse DivNoising predictions.
摘要：基于深度学习方法已经成为不争的领导人几乎所有的图像恢复任务。尤其是在显微图像的域，不同的内容感知图像恢复（CARE）方法现在用于提高所获得的数据的解释性。但也有什么可以在损坏的图像恢复的限制，任何给定的方法需要预测还原图像时做出许多可能的干净信号之间的合理平衡。在这里，我们建议DivNoising - 基于去噪方法完全卷积变自动编码，通过预测去噪图像的整体分布克服了这个问题。我们的方法是无监督，只需要噪声图像和成像噪声的描述，其可以被测量或从噪声数据自举。如果需要的话，共识预测可以从一组DivNoising预测推断，导致与其他监督的方法，偶尔有竞争力的结果，即使有监督的国家的最先进的。从后DivNoising样品使有用的应用过多。我们是（i）讨论如何光学字符识别（OCR）应用程序可以从模棱两可的数据不同的预测中获益，以及（ii）详细说明了如何使用不同的DivNoising预测时，例如细胞分割收益的表现。

41. Joint Training of Variational Auto-Encoder and Latent Energy-Based Model [PDF] 返回目录
Tian Han, Erik Nijkamp, Linqi Zhou, Bo Pang, Song-Chun Zhu, Ying Nian Wu
Abstract: This paper proposes a joint training method to learn both the variational auto-encoder (VAE) and the latent energy-based model (EBM). The joint training of VAE and latent EBM are based on an objective function that consists of three Kullback-Leibler divergences between three joint distributions on the latent vector and the image, and the objective function is of an elegant symmetric and anti-symmetric form of divergence triangle that seamlessly integrates variational and adversarial learning. In this joint training scheme, the latent EBM serves as a critic of the generator model, while the generator model and the inference model in VAE serve as the approximate synthesis sampler and inference sampler of the latent EBM. Our experiments show that the joint training greatly improves the synthesis quality of the VAE. It also enables learning of an energy function that is capable of detecting out of sample examples for anomaly detection.
摘要：本文提出了联合训练的方法来学习这两个变自动编码器（VAE）和潜在的能源为主的模式（EBM）。 VAE和潜EBM的联合培养是基于由上的潜矢量和图像，并且目标函数三个联合分布之间的三个的Kullback-Leibler距离分歧的目标函数是一个优雅的对称和发散的反对称形式的三角无缝集成和变对抗性学习。在这种联合训练方案，潜EBM作为发电机模型的批评，而发电机模型和VAE推理模型作为潜在EBM的近似综合采样器和推理采样。我们的实验表明，联合训练大大提高了VAE的综合素质。这也使能量函数，其能够用于异常检测的样品的例子检测出的学习。

42. Map3D: Registration Based Multi-Object Tracking on 3D Serial Whole Slide Images [PDF] 返回目录
Ruining Deng, Haichun Yang, Aadarsh Jha, Yuzhe Lu, Peng Chu, Agnes Fogo, Yuankai Huo
Abstract: There has been a long pursuit for precise and reproducible glomerular quantification on renal pathology to leverage both research and practice. When digitizing the biopsy tissue samples using whole slide imaging (WSI), a set of serial sections from the same tissue can be acquired as a stack of images, similar to frames in a video. In radiology, the stack of images (e.g., computed tomography) is naturally used to provide 3D context for organs, tissues, and tumors. In pathology, it is appealing to do a similar 3D assessment for glomeruli using a stack of serial WSI sections. However, the 3D identification and association of large-scale glomeruli on renal pathology is challenging due to large tissue deformation, missing tissues, and artifacts from WSI. Therefore, existing 3D quantitative assessments of glomeruli are still largely operated by manual or semi-automated methods, leading to labor costs, low-throughput processing, and inter-observer variability. In this paper, we propose a novel Multi-Object Association for Pathology in 3D (Map3D) method for automatically identifying and associating large-scale cross-sections of 3D objects from routine serial sectioning and WSI. The innovations of the Map3D method are three-fold: (1) the large-scale glomerular association is principled from a new multi-object tracking (MOT) perspective; (2) the quality-aware whole series registration is proposed to not only provide affinity estimation but also offer automatic kidney-wise quality assurance (QA) for registration; (3) a dual-path association method is proposed to tackle the large deformation, missing tissues, and artifacts during tracking. To the best of our knowledge, the Map3D method is the first approach that enables automatic and large-scale glomerular association across 3D serial sectioning using WSI.
摘要：已经有很长的追求对肾脏病理精确和可重复的肾小球量化同时利用研究和实践。当使用数字化整个幻灯片成像（WSI）从相同的组织的一组连续切片的活检组织样品，能够获得作为堆叠的图像，类似于在一个视频帧。在放射学，图像堆栈中（例如，计算机断层扫描）自然用于提供器官，组织，和肿瘤的3D上下文。在病理学上，它是有吸引力的做用串口WSI切片的堆栈肾小球类似的三维评价。然而，对肾脏病理大型肾小球3D识别和关联是具有挑战性由于大的组织变形，缺少组织，以及从WSI工件。因此，肾小球现有3D定量评估仍主要通过手动或半自动的方法操作，导致劳动成本，低通量处理，和观察者间的变异性。在本文中，我们提出了一种新颖的多目标为病理3D（map3d的）方法协会用于自动识别和3D从常规连续切片和WSI对象相关联的大型横截面。该方法map3d的的创新是三倍：（1）大型肾小球关联从一个新的多目标跟踪（MOT）透视原则性; （2）质量感知全系列注册建议不仅提供亲和力的估计，但也提供了自动注册肾脏明智的质量保证（QA）; （3）双路径关联方法提出以解决大变形，缺少组织和跟踪期间的伪像。据我们所知，map3d的方法是第一种方法，使整个3D连续使用WSI切片自动和大规模肾小球关联。

43. Towards Robust Fine-grained Recognition by Maximal Separation of Discriminative Features [PDF] 返回目录
Krishna Kanth Nakka, Mathieu Salzmann
Abstract: Adversarial attacks have been widely studied for general classification tasks, but remain unexplored in the context of fine-grained recognition, where the inter-class similarities facilitate the attacker's task. In this paper, we identify the proximity of the latent representations of different classes in fine-grained recognition networks as a key factor to the success of adversarial attacks. We therefore introduce an attention-based regularization mechanism that maximally separates the discriminative latent features of different classes while minimizing the contribution of the non-discriminative regions to the final class prediction. As evidenced by our experiments, this allows us to significantly improve robustness to adversarial attacks, to the point of matching or even surpassing that of adversarial training, but without requiring access to adversarial samples.
摘要：对抗性攻击已被广泛研究的一般分类的任务，但在细粒度的认可，其中，级间的相似性有利于攻击者的任务的情况下仍然未知。在本文中，我们确定不同类别的细粒度识别网络中的潜在交涉，以对抗的攻击取得成功的一个关键因素的接近。因此，我们引入，同时尽量减少非鉴别的区域，以最后的类别预测的贡献最大分开不同类别的辨别潜在功能的关注，基于正则化机制。就证明我们的实验中，这让我们显著提高稳健性对抗攻击，以匹配或者甚至超过该对抗性训练，点，但无需访问对抗性的样本。

44. Revisiting visual-inertial structure from motion for odometry and SLAM initialization [PDF] 返回目录
Georgios Evangelidis, Branislav Micusik
Abstract: In this paper, an efficient closed-form solution for the state initialization in visual-inertial odometry (VIO) and simultaneous localization and mapping (SLAM) is presented. Unlike the state-of-the-art, we do not derive linear equations from triangulating pairs of point observations. Instead, we build on a direct triangulation of the unknown $3D$ point paired with each of its observations. We show and validate the high impact of such a simple difference. The resulting linear system has a simpler structure and the solution through analytic elimination only requires solving a $6\times 6$ linear system (or $9 \times 9$ when accelerometer bias is included). In addition, all the observations of every scene point are jointly related, thereby leading to a less biased and more robust solution. The proposed formulation attains up to $50$ percent decreased velocity and point reconstruction error compared to the standard closed-form solver. Apart from the inherent efficiency, fewer iterations are needed by any further non-linear refinement thanks to better parameter initialization. In this context, we provide the analytic Jacobians for a non-linear optimizer that optionally refines the initial parameters. The superior performance of the proposed solver is established by quantitative comparisons with the state-of-the-art solver.
摘要：在本文中，用于在视觉惯性里程计（VIO）和同时定位和地图创建（SLAM）的状态初始化一个有效的封闭形式解被呈现。不同于状态的最先进的，我们不推导三角测量从点对观测的线性方程组。相反，我们建立与它的每一个意见的配对未知$ 3D $点的正三角。我们展示和验证这样一个简单的区别的高冲击。将得到的线性系统具有较简单的结构，并通过解析消除溶液仅需要解决一个$ 6 \次6 $线性系统（或$ 9 \倍当包含加速度计偏差9 $）。此外，每一个场景点的所有观测联合相关，从而导致偏差较小，更可靠的解决方案。相对于标准的闭合形式解算器所提出的制剂无所获高达$ 50 $％下降速度和点重构误差。除了固有的效率，可通过任何进一步非线性细化由于更好的参数初始化需要较少的迭代。在这种情况下，我们为非线性优化任选细化初始参数提供的解析雅可比矩阵。所提出的解算器的优异性能是通过与国家的最先进的解算器的定量比较成立。

45. Stochastic Segmentation Networks: Modelling Spatially Correlated Aleatoric Uncertainty [PDF] 返回目录
Miguel Monteiro, Loïc Le Folgoc, Daniel Coelho de Castro, Nick Pawlowski, Bernardo Marques, Konstantinos Kamnitsas, Mark van der Wilk, Ben Glocker
Abstract: In image segmentation, there is often more than one plausible solution for a given input. In medical imaging, for example, experts will often disagree about the exact location of object boundaries. Estimating this inherent uncertainty and predicting multiple plausible hypotheses is of great interest in many applications, yet this ability is lacking in most current deep learning methods. In this paper, we introduce stochastic segmentation networks (SSNs), an efficient probabilistic method for modelling aleatoric uncertainty with any image segmentation network architecture. In contrast to approaches that produce pixel-wise estimates, SSNs model joint distributions over entire label maps and thus can generate multiple spatially coherent hypotheses for a single image. By using a low-rank multivariate normal distribution over the logit space to model the probability of the label map given the image, we obtain a spatially consistent probability distribution that can be efficiently computed by a neural network without any changes to the underlying architecture. We tested our method on the segmentation of real-world medical data, including lung nodules in 2D CT and brain tumours in 3D multimodal MRI scans. SSNs outperform state-of-the-art for modelling correlated uncertainty in ambiguous images while being much simpler, more flexible, and more efficient.
摘要：图像分割，对于给定的输入往往多于一个可能的解决办法。在医学成像中，例如，专家将经常不同意关于对象边界的精确位置。估计这个固有的不确定性和预测多似是而非的假设是在许多应用中极大的兴趣，但缺乏在最新的深度学习方法，这种能力。在本文中，我们引入随机分割网络（的SSN），用于模拟与任何图像分割网络架构肆意不确定性的有效概率方法。与此相反，以接近产生逐个像素估算，模型的SSN联合分布在整个标记图，因此可以产生用于单个图像的多个空间相干的假设。通过使用低秩多元正态分布在分对数空间，以给定的图像的标签映射的概率模型，我们获得可以由神经网络而没有任何改变底层架构有效地计算在空间上一致的概率分布。我们测试了现实世界的医疗数据的分割我们的方法，包括在2D CT肺结节和脑肿瘤在3D多式联运MRI扫描。的SSN优于状态的最先进的用于而被更简单，更灵活，更有效地在模糊的图像建模相关的不确定性。

46. Closed Loop Neural-Symbolic Learning via Integrating Neural Perception, Grammar Parsing, and Symbolic Reasoning [PDF] 返回目录
Qing Li, Siyuan Huang, Yining Hong, Yixin Chen, Ying Nian Wu, Song-Chun Zhu
Abstract: The goal of neural-symbolic computation is to integrate the connectionist and symbolist paradigms. Prior methods learn the neural-symbolic models using reinforcement learning (RL) approaches, which ignore the error propagation in the symbolic reasoning module and thus converge slowly with sparse rewards. In this paper, we address these issues and close the loop of neural-symbolic learning by (1) introducing the \textbf{grammar} model as a \textit{symbolic prior} to bridge neural perception and symbolic reasoning, and (2) proposing a novel \textbf{back-search} algorithm which mimics the top-down human-like learning procedure to propagate the error through the symbolic reasoning module efficiently. We further interpret the proposed learning framework as maximum likelihood estimation using Markov chain Monte Carlo sampling and the back-search algorithm as a Metropolis-Hastings sampler. The experiments are conducted on two weakly-supervised neural-symbolic tasks: (1) handwritten formula recognition on the newly introduced HWF dataset; (2) visual question answering on the CLEVR dataset. The results show that our approach significantly outperforms the RL methods in terms of performance, converging speed, and data efficiency. Our code and data are released at \url{this https URL}.
摘要：神经符号计算的目标是整合联结和象征主义范式。现有方法学习使用强化学习（RL）接近神经象征性车型，它忽略了符号推理模块中的错误传播，从而有稀疏的奖励慢慢收敛。在本文中，我们解决这些问题，靠近神经符号学习的由（1）的环引入\ textbf {语法}模型作为\ textit {符号之前}桥接神经感知和符号推理，和（2）提出了一种新颖的\ textbf {向后搜索}算法模拟自上而下人样的学习过程通过符号推理模块有效地传播错误的。我们进一步解释所提出的学习框架使用马尔可夫链蒙特卡罗采样并作为大都市，黑斯廷斯样的向后搜索算法，最大似然估计。所述实验在两个弱监督神经符号任务进行：（1）上的新引入HWF数据集的手写式识别; （2）可视问答上CLEVR数据集。结果表明，我们的方法显著优于在性能方面，收敛速度和数据的效率RL方法。我们的代码和数据在\ {URL这HTTPS URL}被释放。

47. Interpretable Visualizations with Differentiating Embedding Networks [PDF] 返回目录
Isaac Robinson
Abstract: We present a visualization algorithm based on a novel unsupervised Siamese neural network training regime and loss function, called Differentiating Embedding Networks (DEN). The Siamese neural network finds differentiating or similar features between specific pairs of samples in a dataset, and uses these features to embed the dataset in a lower dimensional space where it can be visualized. Unlike existing visualization algorithms such as UMAP or $t$-SNE, DEN is parametric, meaning it can be interpreted by techniques such as SHAP. To interpret DEN, we create an end-to-end parametric clustering algorithm on top of the visualization, and then leverage SHAP scores to determine which features in the sample space are important for understanding the structures shown in the visualization based on the clusters found. We compare DEN visualizations with existing techniques on a variety of datasets, including image and scRNA-seq data. We then show that our clustering algorithm performs similarly to the state of the art despite not having prior knowledge of the number of clusters, and sets a new state of the art on FashionMNIST. Finally, we demonstrate finding differentiating features of a dataset. Code available at this https URL
摘要：提出了一种基于一种新的无监督连体神经网络训练体制和损失功能的可视化算法，叫做辨嵌入网络（DEN）。连体神经网络的发现分化或数据集中的特定样本对之间类似的特征，并且在其中它可以被可视化的较低维空间中使用这些特征来嵌入数据集。不像现有可视化算法如UMAP或$ T $ -SNE，DEN是参数化的，这意味着它可以通过诸如SHAP来解释。为了解释DEN，我们在图表之上创建的端至端参数聚类算法，然后杠杆SHAP分数，以确定在所述样品空间设有对理解在基于簇的可视化显示中发现的结构非常重要。我们比较DEN可视化与各种数据集，包括图像和scRNA-seq的数据现有技术。然后，我们表明，我们的聚类算法执行尽管没有簇的数目的先验知识，并设置艺术上FashionMNIST的新状态的最先进的技术类似。最后，我们证明找到一个数据集的差异化特性。代码可以在这个HTTPS URL

48. Real-Time Video Inference on Edge Devices via Adaptive Model Streaming [PDF] 返回目录
Mehrdad Khani, Pouya Hamadanian, Arash Nasr-Esfahany, Mohammad Alizadeh
Abstract: Real-time video inference on compute-limited edge devices like mobile phones and drones is challenging due to the high computation cost of Deep Neural Network models. In this paper we propose Adaptive Model Streaming (AMS), a cloud-assisted approach to real-time video inference on edge devices. The key idea in AMS is to use online learning to continually adapt a lightweight model running on an edge device to boost its performance on the video scenes in real-time. The model is trained in a cloud server and is periodically sent to the edge device. We discuss the challenges of online learning for video and present a practical design that takes into account the edge device, cloud server, and network bandwidth resource limitations. On the task of video semantic segmentation, our experimental results show 5.1--17.0 percent mean Intersection-over-Union improvement compared to a pre-trained model on several real-world videos. Our prototype can perform video segmentation at 30 frames-per-second with 40 milliseconds camera-to-label latency on a Samsung Galaxy S10+ mobile phone, using less than 400Kbps uplink and downlink bandwidth on the device.
摘要：像手机和无人驾驶飞机计算有限边缘设备的实时视频推断由于深层神经网络模型的高计算成本挑战。在本文中，我们提出了自适应模型流（AMS），云辅助方法来实时视频推理的边缘设备。在AMS的核心思想是使用网上学习不断地适应边缘设备上运行，以提高在实时的视频场景其性能的轻量级模型。该模型是在云服务器训练有素并且被周期性地发送到边缘设备。我们讨论了在线学习视频的挑战，并提出了一种实用的设计，考虑到边缘设备，云服务器和网络带宽资源的限制。在视频语义分割的任务，我们的实验结果表明，相比于在几个真实世界的视频预训练模型5.1--17.0％的平均环比联盟交叉口的改善。我们的原型可以以30帧每秒有40毫秒照相机到标签等待时间的三星Galaxy S10 +移动电话上执行视频分割，使用小于400kbps的上行链路和下行链路带宽的设备上。

49. Diagnosis and Analysis of Celiac Disease and Environmental Enteropathy on Biopsy Images using Deep Learning Approaches [PDF] 返回目录
Kamran Kowsari
Abstract: Celiac Disease (CD) and Environmental Enteropathy (EE) are common causes of malnutrition and adversely impact normal childhood development. Both conditions require a tissue biopsy for diagnosis and a major challenge of interpreting clinical biopsy images to differentiate between these gastrointestinal diseases is striking histopathologic overlap between them. In the current study, we propose four diagnosis techniques for these diseases and address their limitations and advantages. First, the diagnosis between CD, EE, and Normal biopsies is considered, but the main challenge with this diagnosis technique is the staining problem. The dataset used in this research is collected from different centers with different staining standards. To solve this problem, we use color balancing in order to train our model with a varying range of colors. Random Multimodel Deep Learning (RMDL) architecture has been used as another approach to mitigate the effects of the staining problem. RMDL combines different architectures and structures of deep learning and the final output of the model is based on the majority vote. CD is a chronic autoimmune disease that affects the small intestine genetically predisposed children and adults. Typically, CD rapidly progress from Marsh I to IIIa. Marsh III is sub-divided into IIIa (partial villus atrophy), Marsh IIIb (subtotal villous atrophy), and Marsh IIIc (total villus atrophy) to explain the spectrum of villus atrophy along with crypt hypertrophy and increased intraepithelial lymphocytes. In the second part of this study, we proposed two ways for diagnosing different stages of CD. Finally, in the third part of this study, these two steps are combined as Hierarchical Medical Image Classification (HMIC) to have a model to diagnose the disease data hierarchically.
摘要：腹腔疾病（CD）和环境肠病（EE）是营养不良的常见原因和正常儿童发展产生不利影响。这两个条件需要用于诊断的组织活检和解释临床活检的图像这些胃肠道疾病之间进行区分的一个主要挑战是在它们之间的组织病理学引人注目重叠。在目前的研究中，我们提出了这些疾病的四个诊技术和解决它们的局限性和优势。首先，CD，EE，和正常活检的诊断被认为是，但这种诊断技术的主要挑战是染色问题。在这项研究中所使用的数据集，从不同的染色标准不同的中心收集。为了解决这个问题，我们用色彩平衡，以训练我们的模型有一个变化范围内的颜色。随机多模式深度学习（RMDL）架构已作为另一种方法来减轻污染问题的影响。 RMDL结合了不同的架构和深度学习和模型的最终输出的结构是基于多数表决。 CD是一种慢性自身免疫性疾病，影响小肠遗传倾向的儿童和成人。通常情况下，CD迅速发展马什我到IIIa受体。沼泽III是细分成IIIa的（局部的绒毛萎缩），沼泽IIIb族（小计绒毛萎缩）和沼泽IIIc中（总绒毛萎缩）来解释绒毛萎缩的频谱与隐窝肥大沿和上皮内淋巴细胞增多。在这项研究的第二部分，我们提出了两种方法用于诊断CD的不同阶段。最后，在这项研究中的第三部分，这两个步骤合并为分层医学图像分类（HMIC）有一个模型分层诊断疾病的数据。

50. A Primer on Large Intelligent Surface (LIS) for Wireless Sensing in an Industrial Setting [PDF] 返回目录
Cristian J. Vaca-Rubio, Pablo Ramirez-Espinosa, Robin Jess Williams, Kimmo Kansanen, Zheng-Hua Tan, Elisabeth de Carvalho, Petar Popovski
Abstract: One of the beyond-5G developments that is often highlighted is the integration of wireless communication and radio sensing. This paper addresses the potential of communication-sensing integration of Large Intelligent Surfaces (LIS) in an exemplary Industry 4.0 scenario. Besides the potential for high throughput and efficient multiplexing of wireless links, a LIS can offer a high-resolution rendering of the propagation environment. This is because, in an indoor setting, it can be placed in proximity to the sensed phenomena, while the high resolution is offered by densely spaced tiny antennas deployed over a large area. By treating a LIS as a radio image of the environment, we develop sensing techniques that leverage the tools of image processing and computer vision combined with machine learning. We test these methods for a scenario where we need to detect whether an industrial robot deviates from a predefined route. The results show that the LIS-based sensing offers high precision and has a high application potential in indoor industrial environments.
摘要：其一，往往是强调了超越-5G的发展是无线通信和无线传感的整合。本文地址在示范性产业4.0情景超大智能表面（LIS）的通信，传感集成的潜力。除了为高吞吐量和无线链路的有效复用的潜力，一个LIS可以提供传播环境的高分辨率渲染。这是因为，在室内设置，它可以被放置在接近所感测到的现象，而高分辨率由部署在大面积上密集间隔的微小天线提供。通过处理LIS随着环境的无线电图像，我们开发了传感技术，充分利用图像处理和计算机视觉与机器学习相结合的工具。我们测试这些方法对于这样一个场景，我们需要检测是否从预定路线的工业机器人偏离。结果表明，基于LIS传感提供高精确度和在室内的工业环境中具有较高的应用潜力。

51. MMA Regularization: Decorrelating Weights of Neural Networks by Maximizing the Minimal Angles [PDF] 返回目录
Zhennan Wang, Canqun Xiang, Wenbin Zou, Chen Xu
Abstract: The strong correlation between neurons or filters can significantly weaken the generalization ability of neural networks. Inspired by the well-known Tammes problem, we propose a novel diversity regularization method to address this issue, which makes the normalized weight vectors of neurons or filters distributed on a hypersphere as uniformly as possible, through maximizing the minimal pairwise angles (MMA). This method can easily exert its effect by plugging the MMA regularization term into the loss function with negligible computational overhead. The MMA regularization is simple, efficient, and effective. Therefore, it can be used as a basic regularization method in neural network training. Extensive experiments demonstrate that MMA regularization is able to enhance the generalization ability of various modern models and achieves considerable performance improvements on CIFAR100 and TinyImageNet datasets. In addition, experiments on face verification show that MMA regularization is also effective for feature learning.
摘要：神经元或过滤器之间的强相关性可以显著削弱神经网络的泛化能力。通过公知的Tammes问题的启发，我们提出了一个新颖的多样性正则化方法来解决这个问题，这使得分布在超球面尽可能均匀神经元或滤波器的归一化的权重向量成为可能，通过最大化最小成对角（MMA）。这种方法可以很容易地堵塞MMA则项与微不足道的计算开销丧失功能发挥其作用。该MMA正规化是简单，高效和有效的。因此，它可以被用作神经网络训练的碱性正则化方法。大量的实验证明，MMA正规化能够提高各种现代车型的推广能力，并实现了对CIFAR100和TinyImageNet数据集相当大的性能提升。此外，人脸验证的实验表明MMA正规化也是有效的地物学习。

52. Fully-automated deep learning slice-based muscle estimation from CT images for sarcopenia assessment [PDF] 返回目录
Fahdi Kanavati, Shah Islam, Zohaib Arain, Eric O. Aboagye, Andrea Rockall
Abstract: Objective: To demonstrate the effectiveness of using a deep learning-based approach for a fully automated slice-based measurement of muscle mass for assessing sarcopenia on CT scans of the abdomen without any case exclusion criteria. Materials and Methods: This retrospective study was conducted using a collection of public and privately available CT images (n = 1070). The method consisted of two stages: slice detection from a CT volume and single-slice CT segmentation. Both stages used Fully Convolutional Neural Networks (FCNN) and were based on a UNet-like architecture. Input data consisted of CT volumes with a variety of fields of view. The output consisted of a segmented muscle mass on a CT slice at the level of L3 vertebra. The muscle mass is segmented into erector spinae, psoas, and rectus abdominus muscle groups. The output was tested against manual ground-truth segmentation by an expert annotator. Results: 3-fold cross validation was used to evaluate the proposed method. The slice detection cross validation error was 1.41+-5.02 (in slices). The segmentation cross validation Dice overlaps were 0.97+-0.02, 0.95+-0.04, 0.94+-0.04 for erector spinae, psoas, and rectus abdominus, respectively, and 0.96+-0.02 for the combined muscle mass. Conclusion: A deep learning approach to detect CT slices and segment muscle mass to perform slice-based analysis of sarcopenia is an effective and promising approach. The use of FCNN to accurately and efficiently detect a slice in CT volumes with a variety of fields of view, occlusions, and slice thicknesses was demonstrated.
摘要：为了证明使用深基于学习的方法，用于在CT评估少肌症的肌肉质量的一个完全自动化的基于切片的测量的有效性扫描腹部没有任何情况下排除标准。材料和方法：这项回顾性研究使用公共和私用CT图像（N = 1070）的集合进行。该方法由两个阶段组成：切片检测从CT体积和单切片CT分割。两个阶段中使用全卷积神经网络（FCNN）和基于一个UNET状结构。输入数据包括CT体积与各种视场。输出包括在CT切片分段的肌肉质量的L3椎骨的水平。肌肉质量被分段成竖脊肌，腰肌，和腹直肌基团。输出是由一个专家注释针对手动地面实况分割测试。结果：3倍交叉验证来评估所提出的方法。切片检测交叉验证误差为1.41 + -5.02（在切片）。分割交叉验证骰子重叠分别为0.97±0.02，0.95±0.04，0.94±0.04对竖脊肌，腰肌，和腹直肌，和0.96±0.02的组合肌肉质量。结论：阿深学习方法来检测CT切片和段的肌肉质量进行少肌症基于切片的分析是一种有效的和有前途的方法。准确和有效地检测在CT体积的切片具有多种视图，闭塞，和切片厚度的字段的使用FCNN的证明。

53. Interpreting CNN for Low Complexity Learned Sub-pixel Motion Compensation in Video Coding [PDF] 返回目录
Luka Murn, Saverio Blasi, Alan F. Smeaton, Noel E. O'Connor, Marta Mrak
Abstract: Deep learning has shown great potential in image and video compression tasks. However, it brings bit savings at the cost of significant increases in coding complexity, which limits its potential for implementation within practical applications. In this paper, a novel neural network-based tool is presented which improves the interpolation of reference samples needed for fractional precision motion compensation. Contrary to previous efforts, the proposed approach focuses on complexity reduction achieved by interpreting the interpolation filters learned by the networks. When the approach is implemented in the Versatile Video Coding (VVC) test model, up to 4.5% BD-rate saving for individual sequences is achieved compared with the baseline VVC, while the complexity of learned interpolation is significantly reduced compared to the application of full neural network.
摘要：深学习已显示在图像和视频压缩任务的巨大潜力。然而，在显著增加编码复杂度的成本，这就限制了它潜在的实际应用中实现带来了位节省。在本文中，提出了一种新颖的基于神经网络的工具，它提高了所需的小数精度的运动补偿的参考样本的内插。相反，以前的努力，所提出的方法侧重于通过解释通过网络学到的内插滤波器实现的复杂度降低。当接近的多功能视频实现编码（VVC）测试模型，高达4.5％的BD速率节省单独序列，实现与基线VVC相比，而据悉插值的复杂性相比，全应用程序显著减少神经网络。

54. TensorFlow with user friendly Graphical Framework for object detection API [PDF] 返回目录
Heemoon Yoon, Sang-Hee Lee, Mira Park
Abstract: TensorFlow is an open-source framework for deep learning dataflow and contains application programming interfaces (APIs) of voice analysis, natural language process, and computer vision. Especially, TensorFlow object detection API in computer vision field has been widely applied to technologies of agriculture, engineering, and medicine but barriers to entry of the framework usage is still high through command-line interface (CLI) and code for amateurs and beginners of information technology (IT) field. Therefore, this is aim to develop an user friendly Graphical Framework for object detection API on TensorFlow which is called TensorFlow Graphical Framework (TF-GraF). The TF-GraF provides independent virtual environments according to user accounts in server-side, additionally, execution of data preprocessing, training, and evaluation without CLI in client-side. Furthermore, hyperparameter setting, real-time observation of training process, object visualization of test images, and metrics evaluations of test data can also be operated via TF-GraF. Especially, TF-GraF supports flexible model selection of SSD, Faster-RCNN, RFCN, and Mask-RCNN including convolutional neural networks (inceptions and ResNets) through GUI environment. Consequently, TF-GraF allows anyone, even without any previous knowledge of deep learning frameworks, to design, train and deploy machine intelligence models without coding. Since TF-GraF takes care of setting and configuration, it allows anyone to use deep learning technology for their project without spending time to install complex software and environment.
摘要：TensorFlow是深度学习的数据流的开源框架，包含语音分析，自然语言处理和计算机视觉应用程序编程接口（API）。特别是，在计算机视觉领域TensorFlow对象检测API已广泛通过命令行接口（CLI）和代码为业余爱好者和信息初学者高应用于农业，工程，医学的技术，但障碍框架使用的条目仍然技术（IT）领域。因此，这是旨在开发用于在TensorFlow物体检测API被称为TensorFlow图形框架（TF-GRAF）的用户友好的图形框架。在TF-GRAF根据在服务器端的用户帐户提供独立的虚拟环境，另外，数据预处理，训练，和评价的执行而不在客户端CLI。此外，测试数据的超参数设置，训练过程的实时观察，测试图像的对象可视化，和指标的评价还可以通过TF-GRAF操作。特别是，TF-GRAF支持SSD的灵活模型选择，更快-RCNN，RFCN，和掩模RCNN包括卷积神经网络通过GUI环境（inceptions和ResNets）。因此，TF-格拉夫允许任何人，即使没有深度学习框架以往任何知识，设计，培训和部署机器智能机型无需编码。由于TF-格拉夫负责设置和配置的，它允许任何人使用深度学习技术为他们的项目，无需花费时间来安装复杂的软件环境。

55. Adversarial Attack Vulnerability of Medical Image Analysis Systems: Unexplored Factors [PDF] 返回目录
Suzanne C. Wetstein, Cristina González-Gonzalo, Gerda Bortsova, Bart Liefers, Florian Dubost, Ioannis Katramados, Laurens Hogeweg, Bram van Ginneken, Josien P.W. Pluim, Marleen de Bruijne, Clara I. Sánchez, Mitko Veta
Abstract: Adversarial attacks are considered a potentially serious security threat for machine learning systems. Medical image analysis (MedIA) systems have recently been argued to be particularly vulnerable to adversarial attacks due to strong financial incentives. In this paper, we study several previously unexplored factors affecting adversarial attack vulnerability of deep learning MedIA systems in three medical domains: ophthalmology, radiology and pathology. Firstly, we study the effect of varying the degree of adversarial perturbation on the attack performance and its visual perceptibility. Secondly, we study how pre-training on a public dataset (ImageNet) affects the models' vulnerability to attacks. Thirdly, we study the influence of data and model architecture disparity between target and attacker models. Our experiments show that the degree of perturbation significantly affects both performance and human perceptibility of attacks. Pre-training may dramatically increase the transfer of adversarial examples; the larger the performance gain achieved by pre-training, the larger the transfer. Finally, disparity in data and/or model architecture between target and attacker models substantially decreases the success of attacks. We believe that these factors should be considered when designing cybersecurity-critical MedIA systems, as well as kept in mind when evaluating their vulnerability to adversarial attacks. * indicates equal contribution
摘要：对抗性攻击被认为是机器学习系统具有潜在严重的安全威胁。医学图像分析（媒体）系统最近被认为是特别容易受到攻击的对抗性由于雄厚的资金奖励。在本文中，我们研究了影响在三个医学领域深度学习媒体系统的对抗攻击的漏洞几个以前未开发的因素：眼科，放射科和病理。首先，我们研究不同的攻击性能和视觉感知对抗扰动程度的影响。其次，我们研究了一个公共数据集（ImageNet）前的训练是如何影响到攻击模式的脆弱性。第三，我们的研究数据和模型结构差距的目标和攻击者模型之间的影响。我们的实验表明，扰动程度显著影响性能和攻击人类的感知。预训练可以显着提高的对抗式的例子的转移;通过预先训练获得的性能增益越大，传输较大。最后，在视差目标和攻击者模型之间的数据和/或模型体系结构大大降低攻击的成功。我们认为，这些因素应在设计网络安全关键型媒体系统时，评估他们的弱点对抗攻击的时候，以及牢记考虑。 *表示相等的贡献

56. DSU-net: Dense SegU-net for automatic head-and-neck tumor segmentation in MR images [PDF] 返回目录
Pin Tang, Chen Zu, Mei Hong, Rui Yan, Xingchen Peng, Jianghong Xiao, Xi Wu, Jiliu Zhou, Luping Zhou, Yan Wang
Abstract: Precise and accurate segmentation of the most common head-and-neck tumor, nasopharyngeal carcinoma (NPC), in MRI sheds light on treatment and regulatory decisions making. However, the large variations in the lesion size and shape of NPC, boundary ambiguity, as well as the limited available annotated samples conspire NPC segmentation in MRI towards a challenging task. In this paper, we propose a Dense SegU-net (DSU-net) framework for automatic NPC segmentation in MRI. Our contribution is threefold. First, different from the traditional decoder in U-net using upconvolution for upsamling, we argue that the restoration from low resolution features to high resolution output should be capable of preserving information significant for precise boundary localization. Hence, we use unpooling to unsample and propose SegU-net. Second, to combat the potential vanishing-gradient problem, we introduce dense blocks which can facilitate feature propagation and reuse. Third, using only cross entropy (CE) as loss function may bring about troubles such as miss-prediction, therefore we propose to use a loss function comprised of both CE loss and Dice loss to train the network. Quantitative and qualitative comparisons are carried out extensively on in-house datasets, the experimental results show that our proposed architecture outperforms the existing state-of-the-art segmentation networks.
摘要：最常见的头颈部肿瘤的精准细分，鼻咽癌（NPC），在MRI揭示了治疗和监管决定使光。然而，在损伤尺寸和NPC，边界不明确的形状的大的变化，以及有限的可用的注释的样品凑到NPC分割在MRI朝向具有挑战性的任务。在本文中，我们提出了在MRI自动分割NPC密集SEGU网（DSU网）框架。我们的贡献是一举三得。首先，从在U形网使用upconvolution为upsamling传统解码器不同，我们认为，从低分辨率恢复功能，以高分辨率输出应该是能够保存用于精确边界定位显著信息的。因此，我们使用unpooling到unsample并提出SEGU网。其次，打击潜在消失梯度问题，我们引入密集的街区，可以方便的功能扩展和重用。三，只使用交叉熵（CE）的损失函数会带来麻烦，如错过预测，因此我们建议使用既包含CE损失和骰子损失训练网络的损失函数。定量和定性的比较是在内部数据集广泛开展，实验结果表明，该体系结构优于现有的国家的最先进的分割网络。

57. W-net: Simultaneous segmentation of multi-anatomical retinal structures using a multi-task deep neural network [PDF] 返回目录
Hongwei Zhao, Chengtao Peng, Lei Liu, Bin Li
Abstract: Segmentation of multiple anatomical structures is of great importance in medical image analysis. In this study, we proposed a $\mathcal{W}$-net to simultaneously segment both the optic disc (OD) and the exudates in retinal images based on the multi-task learning (MTL) scheme. We introduced a class-balanced loss and a multi-task weighted loss to alleviate the imbalanced problem and to improve the robustness and generalization property of the $\mathcal{W}$-net. We demonstrated the effectiveness of our approach by applying five-fold cross-validation experiments on two public datasets e\_ophtha\_EX and DiaRetDb1. We achieved F1-score of 94.76\% and 95.73\% for OD segmentation, and 92.80\% and 94.14\% for exudates segmentation. To further prove the generalization property of the proposed method, we applied the trained model on the DRIONS-DB dataset for OD segmentation and on the MESSIDOR dataset for exudate segmentation. Our results demonstrated that by choosing the optimal weights of each task, the MTL based $\mathcal{W}$-net outperformed separate models trained individually on each task. Code and pre-trained models will be available at: \url{this https URL}.
摘要：多个解剖结构的分割是医学图像分析具有重要意义。在这项研究中，我们提出了一个$ \ mathcal {白} $ - 网同时段均视盘（OD）和基于多任务学习（MTL）方案中的视网膜图像的渗出物。我们推出的一类，损耗均衡和多任务加权损失来减轻不平衡的问题，以提高$ \ mathcal {白} $的鲁棒性和泛化 - 网。我们证明我们的方法的有效性通过应用在两个公共数据集Ë\ _ophtha \ _EX和DiaRetDb1五倍交叉验证实验。我们实现了94.76 \％F1-得分和OD分割95.73 \％，92.80和\％和渗出液分段94.14 \％。为了进一步证明了该方法的泛化，我们应用在DRIONS-DB数据集OD分割和对获月数据集渗出分割训练的模型。我们的研究结果表明，通过选择每个任务的最优权重，在MTL基于$ \ mathcal {白} $ - 在每个任务单独训练净跑赢不同的模型。代码和预先训练机型将可在：\ {URL这HTTPS URL}。

58. Unsupervised Learning of 3D Point Set Registration [PDF] 返回目录
Lingjing Wang, Xiang Li, Yi Fang
Abstract: Point cloud registration is the process of aligning a pair of point sets via searching for a geometric transformation. Recent works leverage the power of deep learning for registering a pair of point sets. However, unfortunately, deep learning models often require a large number of ground truth labels for training. Moreover, for a pair of source and target point sets, existing deep learning mechanisms require explicitly designed encoders to extract both deep spatial features from unstructured point clouds and their spatial correlation representation, which is further fed to a decoder to regress the desired geometric transformation for point set alignment. To further enhance deep learning models for point set registration, this paper proposes Deep-3DAligner, a novel unsupervised registration framework based on a newly introduced deep Spatial Correlation Representation (SCR) feature. The SCR feature describes the geometric essence of the spatial correlation between source and target point sets in an encoding-free manner. More specifically, our method starts with optimizing a randomly initialized latent SCR feature, which is then decoded to a geometric transformation (i.e., rotation and translation) to align source and target point sets. Our Deep-3DAligner jointly updates the SCR feature and weights of the transformation decoder towards the minimization of an unsupervised alignment loss. We conducted experiments on the ModelNet40 datasets to validate the performance of our unsupervised Deep-3DAligner for point set registration. The results demonstrated that, even without ground truth and any assumption of a direct correspondence between source and target point sets for training, our proposed approach achieved comparative performance compared to most recent supervised state-of-the-art approaches.
摘要：点云登记是通过搜索一个几何变换对准的一对点集的过程。最近的作品利用深度学习的动力登记一对点集。然而，不幸的是，深学习模型往往需要大量的地面实况标签进行培训。此外，对于对源和目标点集，现有深学习机制需要明确设计的编码器，以提取从非结构化点云和它们的空间相关性表示，其被进一步馈送到解码器以倒退所需几何变换为两个深空间特征点集对齐。为了进一步加强对点集注册深度学习模型，本文提出了深3DAligner的基础上，新推出的深层空间相关性表示（SCR）功能，一种新型的无监督注册框架。该SCR特征描述在无编码方式的源和目标点集之间的空间相关性的几何本质。更具体地，我们的方法开始于优化随机初始化的潜SCR特征，然后将其解码为几何变换（即，旋转和平移），以对准源和目标点集。我们深3DAligner共同更新SCR功能和对无监督对准损失的最小化改造解码器的权重。我们进行的ModelNet40数据集实验来验证我们的无监督深3DAligner为点集配准的性能。结果表明，即使没有地面实况和培训工作的源和目标点集之间的直接对应的任何假设，相对于最近的监测状态的最先进的方法我们提出的方法实现的性能对比。

59. COVID-19-CT-CXR: a freely accessible and weakly labeled chest X-ray and CT image collection on COVID-19 from biomedical literature [PDF] 返回目录
Yifan Peng, Yu-Xing Tang, Sungwon Lee, Yingying Zhu, Ronald M. Summers, Zhiyong Lu
Abstract: The latest threat to global health is the COVID-19 outbreak. Although there exist large datasets of chest X-rays (CXR) and computed tomography (CT) scans, few COVID-19 image collections are currently available due to patient privacy. At the same time, there is a rapid growth of COVID-19-relevant articles in the biomedical literature. Here, we present COVID-19-CT-CXR, a public database of COVID-19 CXR and CT images, which are automatically extracted from COVID-19-relevant articles from the PubMed Central Open Access (PMC-OA) Subset. We extracted figures, associated captions, and relevant figure descriptions in the article and separated compound figures into subfigures. We also designed a deep-learning model to distinguish them from other figure types and to classify them accordingly. The final database includes 1,327 CT and 263 CXR images (as of May 9, 2020) with their relevant text. To demonstrate the utility of COVID-19-CT-CXR, we conducted four case studies. (1) We show that COVID-19-CT-CXR, when used as additional training data, is able to contribute to improved DL performance for the classification of COVID-19 and non-COVID-19 CT. (2) We collected CT images of influenza and trained a DL baseline to distinguish a diagnosis of COVID-19, influenza, or normal or other types of diseases on CT. (3) We trained an unsupervised one-class classifier from non-COVID-19 CXR and performed anomaly detection to detect COVID-19 CXR. (4) From text-mined captions and figure descriptions, we compared clinical symptoms and clinical findings of COVID-19 vs. those of influenza to demonstrate the disease differences in the scientific publications. We believe that our work is complementary to existing resources and hope that it will contribute to medical image analysis of the COVID-19 pandemic. The dataset, code, and DL models are publicly available at this https URL.
摘要：对全球卫生的最新威胁是COVID-19的爆发。尽管存在胸部X射线的大数据集（CXR）和计算机断层扫描（CT）扫描，很少COVID-19图像集合目前可由于患者隐私。与此同时，有COVID-19相关文章在生物医学文献快速增长。在这里，我们目前COVID-19-CT-CXR，COVID-19 CXR和CT图像，它会自动从PubMed中心开放存取（PMC-OA）子集COVID-19相关的文章中提取的公共数据库。我们提取人物，相关联的标题，并在文章中有关数字说明和分离化合物附图成子图。我们还设计了一个深刻的学习模式从其他人物类型区分他们，并给他们相应的分类。最终的数据库包括1327 CT和263个CXR图像（在2020年5月9日）与他们相关的文字。为了证明COVID-19-CT-CXR的效用，我们进行了四个案例研究。（1）我们表明，COVID-19-CT-CXR，作为附加的训练数据中使用时，能够有助于提高DL性能COVID-19和非COVID-19 CT的分类。（2）我们收集了流感的CT图像和训练有素的DL基线区分COVID-19，流感，或者正常或其它类型的CT上的疾病的诊断。（3）从非COVID-19 CXR训练无监督一个级分类器和执行异常检测，以检测COVID-19 CXR。（4）从文本开采标题和附图说明中，我们比较的临床症状和COVID-19对那些流感的展示在科学出版物的疾病不同的临床表现。我们相信，我们的工作是对现有的资源和希望这将有助于COVID-19大流行的医学图像分析的补充。该数据集，代码和DL模式是公开的，在此HTTPS URL。

60. Enabling Nonlinear Manifold Projection Reduced-Order Models by Extending Convolutional Neural Networks to Unstructured Data [PDF] 返回目录
John Tencer, Kevin Potter
Abstract: We propose a nonlinear manifold learning technique based on deep autoencoders that is appropriate for model order reduction of physical systems in complex geometries. Convolutional neural networks have proven to be highly advantageous for systems demonstrating a slow-decaying Kolmogorov n-width. However, these networks are restricted to data on structured meshes. Unstructured meshes are often required for performing analyses of real systems with complex geometry. Our custom graph convolution operators based on the available differential operators for a given spatial discretization effectively extend the application space of these deep autoencoders to systems with arbitrarily complex geometry that can only be efficiently discretized using unstructured meshes. We propose sets of convolution operators based on the spatial derivative operators for the underlying spatial discretization, making the method particularly well suited to data arising from the solution of partial differential equations. We demonstrate the method using examples from heat transfer and fluid mechanics and show better than an order of magnitude improvement in accuracy over linear subspace methods.
摘要：提出了一种基于深自动编码非线性流形学习技术，适合在复杂的几何模型降阶的物理系统。卷积神经网络已被证明是用于证明缓慢衰减的Kolmogorov正宽度系统是非常有利的。然而，这些网络被限制到结构化网格数据。非结构化网格通常需要具有复杂几何形状进行实际系统的分析。基于可用的微分算子对于给定的空间离散我们的自定义曲线的卷积运算符有效这些深自动编码的应用程序空间具有任意复杂的几何形状，可以仅使用非结构化网格被有效地离散延伸到系统。我们提出集基于对底层空间离散空间导数操作符卷积算，使得特别适合于从偏微分方程的解而产生数据的方法。我们证明了使用来自实施例的热传递和流体力学的方法和显示比的幅度提高准确度的线性子空间方法的顺序更好。

61. Deterministic Gaussian Averaged Neural Networks [PDF] 返回目录
Ryan Campbell, Chris Finlay, Adam M Oberman
Abstract: We present a deterministic method to compute the Gaussian average of neural networks used in regression and classification. Our method is based on an equivalence between training with a particular regularized loss, and the expected values of Gaussian averages. We use this equivalence to certify models which perform well on clean data but are not robust to adversarial perturbations. In terms of certified accuracy and adversarial robustness, our method is comparable to known stochastic methods such as randomized smoothing, but requires only a single model evaluation during inference.
摘要：本文提出了一种确定性的方法来计算高斯平均水平回归和分类使用的神经网络。我们的方法是基于与特定的正则损失，高斯平均的预期值之间的训练等价。我们利用这个等价证明其对干净的数据表现良好，但不稳健对抗扰动模型。在认证的准确度和对抗性的稳健性方面，我们的方法是相当知名的随机方法，如随机平滑，但推理过程中只需要一个模型评估。

62. Disease Detection from Lung X-ray Images based on Hybrid Deep Learning [PDF] 返回目录
Subrato Bharati, Prajoy Podder, M. Rubaiyat Hossain Mondal
Abstract: Lung Disease can be considered as the second most common type of disease for men and women. Many people die of lung disease such as lung cancer, Asthma, CPD (Chronic pulmonary disease) etc. in every year. Early detection of lung cancer can lessen the probability of deaths. In this paper, a chest X ray image dataset has been used in order to diagnosis properly and analysis the lung disease. For binary classification, some important is selected. The criteria include precision, recall, F beta score and accuracy. The fusion of AI and cancer diagnosis are acquiring huge interest as a cancer diagnostic tool. In recent days, deep learning based AI for example Convolutional neural network (CNN) can be successfully applied for disease classification and prediction. This paper mainly focuses the performance of Vanilla neural network, CNN, fusion of CNN and Visual Geometry group based neural network (VGG), fusion of CNN, VGG, STN and finally Capsule network. Normally basic CNN has poor performance for rotated, tilted or other abnormal image orientation. As a result, hybrid systems have been exhibited in order to enhance the accuracy with the maintenance of less training time. All models have been implemented in two groups of data sets: full dataset and sample dataset. Therefore, a comparative analysis has been developed in this paper. Some visualization of the attributes of the dataset has also been showed in this paper
摘要：肺部疾病可以被视为对男性和女性第二常见的疾病类型。很多人死于肺部疾病，如肺癌，哮喘，CPD（慢性肺疾病）等，每年。肺癌的早期检测可减少死亡的概率。在本文中，胸部X射线图像数据组已被使用以便正确地诊断和分析的肺部疾病。对于二元分类，一些重要的选择。该标准包括精度，召回，F测试得分和准确性。 AI和癌症诊断的融合正在获得巨大的利益作为癌症诊断工具。连日来，深度学习基于AI例如卷积神经网络（CNN）可以成功地应用于疾病分类和预测。本文主要侧重香草神经网络的性能，CNN，CNN的融合和几何直观组基于神经网络（VGG），CNN，VGG，STN的融合，最终胶囊网络。通常情况下基本CNN有旋转，倾斜或其他异常图像定位表现不佳。其结果是，混合动力系统已经表现出在为了提高与维护较少训练时间的准确度。全部数据集和样本数据集：所有型号都在两组数据集已经实现。因此，对比分析已经制定了本文。该数据集的一些属性的可视化也已经显示出在本文中

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-06-12

目录

摘要