摘要

1. Unsupervised Transfer Learning with Self-Supervised Remedy [PDF] 返回目录
Jiabo Huang, Shaogang Gong
Abstract: Generalising deep networks to novel domains without manual labels is challenging to deep learning. This problem is intrinsically difficult due to unpredictable changing nature of imagery data distributions in novel domains. Pre-learned knowledge does not transfer well without making strong assumptions about the learned and the novel domains. Different methods have been studied to address the underlying problem based on different assumptions, e.g. from domain adaptation to zero-shot and few-shot learning. In this work, we address this problem by transfer clustering that aims to learn a discriminative latent space of the unlabelled target data in a novel domain by knowledge transfer from labelled related domains. Specifically, we want to leverage relative (pairwise) imagery information, which is freely available and intrinsic to a target domain, to model the target domain image distribution characteristics as well as the prior-knowledge learned from related labelled domains to enable more discriminative clustering of unlabelled target data. Our method mitigates nontransferrable prior-knowledge by self-supervision, benefiting from both transfer and self-supervised learning. Extensive experiments on four datasets for image clustering tasks reveal the superiority of our model over the state-of-the-art transfer clustering techniques. We further demonstrate its competitive transferability on four zero-shot learning benchmarks.
摘要：要概括深网络无需人工标签新颖结构域是具有挑战性的深度学习。此问题是由于在新的结构域的图像数据的分布的不可预测的变化性质固有地困难。前期所学的知识不未做有关的教训和新域强的假设转移阱。不同的方法已经被研究了基于不同的假设，例如，以解决根本问题从域适应零出手，很少拍学习。在这项工作中，我们要解决通过转移群集这个问题，其目的是通过知识转移从标相关领域学习未标记的目标数据的辨别潜在空间中一种新的领域。具体来说，我们希望利用相对的（成对）图像信息，这是免费提供和固有至目标域，到目标域图像分布特性以及模型作为先验知识从相关标记的结构域了解到以便能够更有辨别聚类未标记的目标数据。我们的方法减轻由自检nontransferrable现有的知识，包括转移和自我监督学习中受益。在四个数据集的图像聚类任务大量实验表明我们的模型在国家的最先进的转移集群技术的优越性。我们进一步展示其在四个零射门的学习竞争力的基准转让。

2. ResKD: Residual-Guided Knowledge Distillation [PDF] 返回目录
Xuewei Li, Songyuan Li, Bourahla Omar, Xi Li
Abstract: Knowledge distillation has emerge as a promising technique for compressing neural networks. Due to the capacity gap between a heavy teacher and a lightweight student, there exists a significant performance gap between them. In this paper, we see knowledge distillation in a fresh light, using the knowledge gap between a teacher and a student as guidance to train a lighter-weight student called res-student. The combination of a normal student and a res-student becomes a new student. Such a residual-guided process can be repeated. Experimental results show that we achieve competitive results on the CIFAR10/10, Tiny-ImageNet, and ImageNet datasets.
摘要：知识蒸馏已出现作为压缩神经网络有前途的技术。由于沉重的老师和一个轻量级的学生之间的能力差距，它们之间存在一个显著的性能差距。在本文中，我们看到了一个新的光知识蒸馏，用老师和学生为指导之间的知识鸿沟，培养一个更轻的学生称为RES-学生。一个正常的学生和RES-学生的组合成为一个新的学生。这样的残余引导过程可以重复。实验结果表明，我们在CIFAR10 / 10取得竞争的结果，微型-ImageNet和ImageNet数据集。

3. Multimodal Future Localization and Emergence Prediction for Objects in Egocentric View with a Reachability Prior [PDF] 返回目录
Osama Makansi, Özgün Cicek, Kevin Buchicchio, Thomas Brox
Abstract: In this paper, we investigate the problem of anticipating future dynamics, particularly the future location of other vehicles and pedestrians, in the view of a moving vehicle. We approach two fundamental challenges: (1) the partial visibility due to the egocentric view with a single RGB camera and considerable field-of-view change due to the egomotion of the vehicle; (2) the multimodality of the distribution of future states. In contrast to many previous works, we do not assume structural knowledge from maps. We rather estimate a reachability prior for certain classes of objects from the semantic map of the present image and propagate it into the future using the planned egomotion. Experiments show that the reachability prior combined with multi-hypotheses learning improves multimodal prediction of the future location of tracked objects and, for the first time, the emergence of new objects. We also demonstrate promising zero-shot transfer to unseen datasets. Source code is available at $\href{this https URL}{\text{this https URL.}}$
摘要：在本文中，我们研究预测未来的动态，特别是其他车辆和行人的未来位置，在移动车辆的观点的问题。我们接近两个基本挑战：（1）部分的可视性由于用单个RGB相机和由于车辆的自身运动相当大的场的视图改变自我中心视图; （2）未来状态的分布的多峰性。与许多以前的作品中，我们不承担从地图结构的知识。而我们对某些类别的对象估计可达性之前，从本次图像的语义地图和使用计划的自身运动它传播到未来。实验结果表明，与之前多假设学习相结合的可达性提高了跟踪对象的未来位置的多模式预测，首次，新对象的出现。我们也表现出有前途的零次转移到看不见的数据集。源代码可以在$ \ {HREF这HTTPS URL} {\ {文字此HTTPS URL。}} $

4. Unstructured Road Vanishing Point Detection Using the Convolutional Neural Network and Heatmap Regression [PDF] 返回目录
Yin-Bo Liu, Ming Zeng, Qing-Hao Meng
Abstract: Unstructured road vanishing point (VP) detection is a challenging problem, especially in the field of autonomous driving. In this paper, we proposed a novel solution combining the convolutional neural network (CNN) and heatmap regression to detect unstructured road VP. The proposed algorithm firstly adopts a lightweight backbone, i.e., depthwise convolution modified HRNet, to extract hierarchical features of the unstructured road image. Then, three advanced strategies, i.e., multi-scale supervised learning, heatmap super-resolution, and coordinate regression techniques are utilized to achieve fast and high-precision unstructured road VP detection. The empirical results on Kong's dataset show that our proposed approach enjoys the highest detection accuracy compared with state-of-the-art methods under various conditions in real-time, achieving the highest speed of 33 fps.
摘要：非结构化道路消失点（VP）检测是一个具有挑战性的问题，尤其是在自动驾驶领域。在本文中，我们提出了一种新的解决方案结合了卷积神经网络（CNN）和热图回归检测非结构化道路VP。该算法首先采用轻量级的骨干，即，在深度上卷积修改HRNet，提取非结构化道路图像的层次特征。然后，三条先进的战略，即多尺度监督学习，热图超高分辨率，并协调回归技术被利用来实现快速和高精度的非结构化道路VP检测。在香港的数据集表明，与实时各种条件，实现33 fps的最高速度下，国家的最先进的方法相比，我们提出的方法中享有最高的检测精度的实证结果。

5. Semantic Graph-enhanced Visual Network for Zero-shot Learning [PDF] 返回目录
Yang Hu, Guihua Wen, Adriane Chapman, Pei Yang, Mingnan Luo, Yingxue Xu, Dan Dai, Wendy Hall
Abstract: Zero-shot learning uses semantic attributes to connect the search space of unseen objects. In recent years, although the deep convolutional network brings powerful visual modeling capabilities to the ZSL task, its visual features have severe pattern inertia and lack of representation of semantic relationships, which leads to severe bias and ambiguity. In response to this, we propose the Graph-based Visual-Semantic Entanglement Network to conduct graph modeling of visual features, which is mapped to semantic attributes by using a knowledge graph, it contains several novel designs: 1. it establishes a multi-path entangled network with the convolutional neural network (CNN) and the graph convolutional network (GCN), which input the visual features from CNN to GCN to model the implicit semantic relations, then GCN feedback the graph modeled information to CNN features; 2. it uses attribute word vectors as the target for the graph semantic modeling of GCN, which forms a self-consistent regression for graph modeling and supervise GCN to learn more personalized attribute relations; 3. it fuses and supplements the hierarchical visual-semantic features refined by graph modeling into visual embedding. By promoting the semantic linkage modeling of visual features, our method outperforms state-of-the-art approaches on multiple representative ZSL datasets: AwA2, CUB, and SUN.
摘要：零射门学习使用语义属性连接看不见物体的搜索空间。近年来，虽然深卷积网络带来了强大的可视化建模能力，以ZSL任务，其视觉特征有严重的模式惯性，缺乏语义关系，从而导致严重的偏见和不确定性的表示。为了应对这个问题，我们提出了基于图形的可视化，语义网络纠缠到的视觉特征进行图建模，这是通过使用知识图映射到语义属性，它包含了一些新的设计：1，它建立了一个多路径缠结网络与卷积神经网络（CNN）和图形卷积网络（GDN），其输入从CNN的视觉特征以GCN隐式语义关系，然后GCN反馈图形建模信息到CNN功能模型; 2.它使用属性词矢量作为形成自洽回归的图形建模和监督GCN了解更多个性化的属性关系目标GCN的图形语义建模; 3.它融合和补充，通过图形建模提炼成视觉嵌入分层视觉语义特征。 AwA2，CUB和SUN：通过促进视觉特征语义联动建模，我们的方法优于国家的最先进的多代表ZSL数据集的方法。

6. SoftFlow: Probabilistic Framework for Normalizing Flow on Manifolds [PDF] 返回目录
Hyeongju Kim, Hyeonseung Lee, Woo Hyun Kang, Joun Yeop Lee, Nam Soo Kim
Abstract: Flow-based generative models are composed of invertible transformations between two random variables of the same dimension. Therefore, flow-based models cannot be adequately trained if the dimension of the data distribution does not match that of the underlying target distribution. In this paper, we propose SoftFlow, a probabilistic framework for training normalizing flows on manifolds. To sidestep the dimension mismatch problem, SoftFlow estimates a conditional distribution of the perturbed input data instead of learning the data distribution directly. We experimentally show that SoftFlow can capture the innate structure of the manifold data and generate high-quality samples unlike the conventional flow-based models. Furthermore, we apply the proposed framework to 3D point clouds to alleviate the difficulty of forming thin structures for flow-based models. The proposed model for 3D point clouds, namely SoftPointFlow, can estimate the distribution of various shapes more accurately and achieves state-of-the-art performance in point cloud generation.
摘要：基于流的生成模型由相同尺寸的两个随机变量之间的可逆转换。因此，基于流的模型不能充分如果数据分布的尺寸不匹配的底层目标分布的训练。在本文中，我们提出SoftFlow，训练正火概率框架上歧管流动。为了回避这个尺寸不匹配的问题，SoftFlow估计扰动输入数据的条件分布，而不是直接的学习数据分布。我们通过实验证明SoftFlow可以捕获歧管数据的内在结构和产生不同于传统的基于流的模型高质量的样本。此外，我们采用建议的框架，以三维点云，以减轻形成基于流的模型薄结构的难度。该模型的三维点云，即SoftPointFlow，可以更准确地估计各种形状的分布和实现了点云生成状态的最先进的性能。

7. Incorporating Image Gradients as Secondary Input Associated with Input Image to Improve the Performance of the CNN Model [PDF] 返回目录
Vijay Pandey, Shashi Bhushan Jha
Abstract: CNN is very popular neural network architecture in modern days. It is primarily most used tool for vision related task to extract the important features from the given image. Moreover, CNN works as a filter to extract the important features using convolutional operation in distinct layers. In existing CNN architectures, to train the network on given input, only single form of given input is fed to the network. In this paper, new architecture has been proposed where given input is passed in more than one form to the network simultaneously by sharing the layers with both forms of input. We incorporate image gradient as second form of the input associated with the original input image and allowing both inputs to flow in the network using same number of parameters to improve the performance of the model for better generalization. The results of the proposed CNN architecture, applying on diverse set of datasets such as MNIST, CIFAR10 and CIFAR100 show superior result compared to the benchmark CNN architecture considering inputs in single form.
摘要：CNN是在现代社会里非常流行的神经网络结构。它主要是最常用的工具，视觉相关的任务从给定的图像中提取的重要特征。另外，CNN可以作为一个过滤器，以提取在不同的层使用卷积运算的重要特征。在现有CNN架构中，培养在给定的输入的网络中，只有给定输入的单一形式被馈送到网络。在本文中，新的结构已提出了其中给定的输入以多于一种形式同时通过与两种形式的输入的共享层传递到网络。我们合并图像梯度作为与原始输入图像相关联的输入的第二形式，并允许两个输入到网络中的流使用相同数目的参数，以改善模型的更好的泛化的性能。相比于基准CNN架构考虑单一的形式输入所提出的体系结构CNN，施加上不同组的数据集如MNIST，CIFAR10和CIFAR100的结果显示出优异的结果。

8. Person Re-identification in the 3D Space [PDF] 返回目录
Zhedong Zheng, Yi Yang
Abstract: People live in a 3D world. However, existing works on person re-identification (re-id) mostly consider the representation learning in a 2D space, intrinsically limiting the understanding of people. In this work, we address this limitation by exploring the prior knowledge of the 3D body structure. Specifically, we project 2D images to a 3D space and introduce a novel Omni-scale Graph Network (OG-Net) to learn the representation from sparse 3D points. With the help of 3D geometry information, we can learn a new type of deep re-id feature free from noisy variants, such as scale and viewpoint. To our knowledge, we are among the first attempts to conduct person re-identification in the 3D space. Extensive experiments show that the proposed method achieves competitive results on three popular large-scale person re-id datasets, and has good scalability to unseen datasets.
摘要：人们生活在一个3D世界。然而，在人重新鉴定（重新编号）现有的作品大多是考虑在二维空间中表示学习，本质上限制了人们的认识。在这项工作中，我们要解决通过探索三维人体结构的先验知识这一限制。具体而言，我们预计2D图像到3D空间和引入新的全规模格拉夫网络（OG-净）来学习从稀疏3D点的表示。随着三维几何信息的帮助下，我们可以从嘈杂的变种，如规模和观点自由学习新的型深再ID功能。据我们所知，我们是第一个尝试的人进行重新鉴定在三维空间中。大量的实验表明，该方法实现了对三种流行的大型人重新编号的数据集有竞争力的结果，并具有良好的可扩展性，看不见的数据集。

9. FibeR-CNN: Expanding Mask R-CNN to Improve Image-Based Fiber Analysis [PDF] 返回目录
Max Frei, Frank Einar Kruis
Abstract: Fiber-shaped materials (e.g. carbon nano tubes) are of great relevance, due to their unique properties but also the health risk they can impose. Unfortunately, image-based analysis of fibers still involves manual annotation, which is a time-consuming and costly process. We therefore propose the use of region-based convolutional neural networks (R-CNNs) to automate this task. Mask R-CNN, the most widely used R-CNN for semantic segmentation tasks, is prone to errors when it comes to the analysis of fiber-shaped objects. Therefore, a new architecture - FibeR-CNN - is introduced and validated. FibeR-CNN combines two established R-CNN architectures (Mask and Keypoint R-CNN) and adds additional network heads for the prediction of fiber widths and lengths. As a result, FiberR-CNN is able to surpass the mean average precision of Mask R-CNN by 33 % (11 percentage points) on a novel test data set of fiber images. Source code available online.
摘要：丝状材料（如碳纳米管）的重大意义，由于其独特的性能，而且对健康的危害，他们可以并处。不幸的是，纤维的基于图像的分析仍然涉及人工注释，这是一个耗时且昂贵的过程。因此，我们建议采用基于区域的卷积神经网络（R-细胞神经网络）来自动执行此任务。掩模R-CNN，最广泛使用的R-CNN语义分割任务，是容易出错，当涉及到纤维状物体的分析。因此，一种新的架构 - 光纤CNN - 介绍和验证。纤维CNN结合了两种建立R-CNN架构（掩模和关键点R-CNN），并增加了额外的网络磁头纤维宽度和长度的预测。其结果是，FiberR-CNN能够通过33％（11个百分点）对纤维图像的新颖的测试数据集，以超越面膜R-CNN的中值平均精度。源代码可在网上。

10. Learning 3D-3D Correspondences for One-shot Partial-to-partial Registration [PDF] 返回目录
Zheng Dang, Fei Wang, Mathieu Salzmann
Abstract: While 3D-3D registration is traditionally tacked by optimization-based methods, recent work has shown that learning-based techniques could achieve faster and more robust results. In this context, however, only PRNet can handle the partial-to-partial registration scenario. Unfortunately, this is achieved at the cost of relying on an iterative procedure, with a complex network architecture. Here, we show that learning-based partial-to-partial registration can be achieved in a one-shot manner, jointly reducing network complexity and increasing registration accuracy. To this end, we propose an Optimal Transport layer able to account for occluded points thanks to the use of outlier bins. The resulting OPRNet framework outperforms the state of the art on standard benchmarks, demonstrating better robustness and generalization ability than existing techniques.
摘要：尽管3D-3D注册传统上是基于优化的方法上涨，最近的研究表明，基于学习的技术可以实现更快和更稳定的结果。在这方面，然而，只有PRNET能处理的部分到部分注册的情况。不幸的是，这是在依靠一个迭代过程，具有复杂的网络体系结构的成本来实现。在这里，我们表明，学习基础部分对部分注册可在一杆的方式来实现，共同降低网络的复杂性，提高配准精度。为此，我们提出了一个最佳的传输层能占到闭塞点由于使用离群箱。将所得OPRNet框架优于本领域标准基准的状态，显示出比现有技术更好的鲁棒性和泛化能力。

11. More Information Supervised Probabilistic Deep Face Embedding Learning [PDF] 返回目录
Ying Huang, Shangfeng Qiu, Wenwei Zhang, Xianghui Luo, Jinzhuo Wang
Abstract: Researches using margin based comparison loss demonstrate the effectiveness of penalizing the distance between face feature and their corresponding class centers. Despite their popularity and excellent performance, they do not explicitly encourage the generic embedding learning for an open set recognition problem. In this paper, we analyse margin based softmax loss in probability view. With this perspective, we propose two general principles: 1) monotonic decreasing and 2) margin probability penalty, for designing new margin loss functions. Unlike methods optimized with single comparison metric, we provide a new perspective to treat open set face recognition as a problem of information transmission. And the generalization capability for face embedding is gained with more clean information. An auto-encoder architecture called Linear-Auto-TS-Encoder(LATSE) is proposed to corroborate this finding. Extensive experiments on several benchmarks demonstrate that LATSE help face embedding to gain more generalization capability and it boosted the single model performance with open training dataset to more than $99\%$ on MegaFace test.
摘要：使用基于保证金损失比较研究表明惩罚的面部特征和它们对应的类中心之间的距离的效果。尽管他们的知名度和出色的表现，他们没有明确鼓励的开集识别问题的一般嵌入学习。在本文中，我们分析了概率观点基于保证金损失添加Softmax。有了这个角度来看，我们提出了两个基本原则：1）单调递减; 2）利润率概率点球，为设计新差额损失的功能。不像比较单一指标的优化方法，我们提供了一个新的视角看待开集面部识别作为信息传输的问题。而对于脸嵌入泛化能力得到了一个更清洁的信息。自动编码器体系结构称为线性自动TS-编码器（喇孜）提出了证实这一发现。在几个基准大量的实验证明，拉孜帮面嵌入获得更多的泛化能力，并推动开放训练数据集的单一模型的性能将超过$ 99 \％的菲斯测试$。

12. Action Recognition with Deep Multiple Aggregation Networks [PDF] 返回目录
Ahmed Mazari, Hichem Sahbi
Abstract: Most of the current action recognition algorithms are based on deep networks which stack multiple convolutional, pooling and fully connected layers. While convolutional and fully connected operations have been widely studied in the literature, the design of pooling operations that handle action recognition, with different sources of temporal granularity in action categories, has comparatively received less attention, and existing solutions rely mainly on max or averaging operations. The latter are clearly powerless to fully exhibit the actual temporal granularity of action categories and thereby constitute a bottleneck in classification performances. In this paper, we introduce a novel hierarchical pooling design that captures different levels of temporal granularity in action recognition. Our design principle is coarse-to-fine and achieved using a tree-structured network; as we traverse this network top-down, pooling operations are getting less invariant but timely more resolute and well localized. Learning the combination of operations in this network -- which best fits a given ground-truth -- is obtained by solving a constrained minimization problem whose solution corresponds to the distribution of weights that capture the contribution of each level (and thereby temporal granularity) in the global hierarchical pooling process. Besides being principled and well grounded, the proposed hierarchical pooling is also video-length and resolution agnostic. Extensive experiments conducted on the challenging UCF-101, HMDB-51 and JHMDB-21 databases corroborate all these statements.
摘要：大多数的当前动作识别算法是基于其中堆叠多个卷积深网络，汇集和完全连接层。虽然卷积和完全连接的业务已被广泛研究的文献，汇集操作该手柄动作识别，并在活动分类时间粒度的不同来源，已经相对很少受到关注，而现有解决方案的设计主要依靠最大或平均操作。后者显然无力充分发挥的作用的类别的实际时间粒度，从而构成分类性能的瓶颈。在本文中，我们介绍了一种新的分层设计汇集在动作识别捕捉不同级别的时间粒度的。我们的设计原则是由粗到细，并用树形结构网络实现;当我们遍历这个网络自上而下，集中操作越来越少，但不变的及时更坚决和良好的本地化。学习操作的组合在这个网络中 - 其中最适合给定的地面实况 - 通过求解约束最小化问题，其解决方案对应的权重分配获得的捕捉每一级的贡献（从而时间粒度）在全球分层池的过程。除了被原则性与良好接地，建议分层池也是视频长度和分辨率无关。在具有挑战性的UCF-101，HMDB-51和JHMDB-21的数据库进行了广泛的实验，证实了所有这些语句。

13. Deep hierarchical pooling design for cross-granularity action recognition [PDF] 返回目录
Ahmed Mazari, Hichem Sahbi
Abstract: In this paper, we introduce a novel hierarchical aggregation design that captures different levels of temporal granularity in action recognition. Our design principle is coarse-to-fine and achieved using a tree-structured network; as we traverse this network top-down, pooling operations are getting less invariant but timely more resolute and well localized. Learning the combination of operations in this network -- which best fits a given ground-truth -- is obtained by solving a constrained minimization problem whose solution corresponds to the distribution of weights that capture the contribution of each level (and thereby temporal granularity) in the global hierarchical pooling process. Besides being principled and well grounded, the proposed hierarchical pooling is also video-length agnostic and resilient to misalignments in actions. Extensive experiments conducted on the challenging UCF-101 database corroborate these statements.
摘要：在本文中，我们介绍了一种新的分层聚合设计，在动作识别捕捉不同级别的时间粒度的。我们的设计原则是由粗到细，并用树形结构网络实现;当我们遍历这个网络自上而下，集中操作越来越少，但不变的及时更坚决和良好的本地化。学习操作的组合在这个网络中 - 其中最适合给定的地面实况 - 通过求解约束最小化问题，其解决方案对应的权重分配获得的捕捉每一级的贡献（从而时间粒度）在全球分层池的过程。除了被原则性与良好接地，建议分层池也是视频长度无关，弹性在动作失调。具有挑战性的UCF-101数据库上进行了大量的实验证实了这些说法。

14. Continual Representation Learning for Biometric Identification [PDF] 返回目录
Bo Zhao, Shixiang Tang, Dapeng Chen, Hakan Bilen, Rui Zhao
Abstract: With the explosion of digital data in recent years, continuously learning new tasks from a stream of data without forgetting previously acquired knowledge has become increasingly important. In this paper, we propose a new continual learning (CL) setting, namely ``continual representation learning'', which focuses on learning better representation in a continuous way. We also provide two large-scale multi-step benchmarks for biometric identification, where the visual appearance of different classes are highly relevant. In contrast to requiring the model to recognize more learned classes, we aim to learn feature representation that can be better generalized to not only previously unseen images but also unseen classes/identities. For the new setting, we propose a novel approach that performs the knowledge distillation over a large number of identities by applying the neighbourhood selection and consistency relaxation strategies to improve scalability and flexibility of the continual learning model. We demonstrate that existing CL methods can improve the representation in the new setting, and our method achieves better results than the competitors.
摘要：随着近年来数字数据，不断地学习，从数据流中新的任务没有忘记先前获得的知识爆炸已变得越来越重要。在本文中，我们提出了一个新的持续学习（CL）设置，即``不断表示学习'，其重点是在一个连续的方式学习较好的代表性。我们还提供了生物特征识别，在不同类别的视觉外观是高度相关的两次较大规模的多步骤的基准。相较于需要认识更多的了解到班级模型，我们的目标是学习，可以更好地推广到不仅前所未见的图像，而且还看不见类/身份特征表示。对于新的环境，我们提出了一种新的方法执行知识蒸馏在大量的身份通过应用邻域选取和一致性放松的策略，在不断学习模式的可扩展性和灵活性。我们表明，现有的CL方法可以提高在新设置的表示，我们的方法实现比竞争对手更好的结果。

15. Novel Adaptive Binary Search Strategy-First Hybrid Pyramid- and Clustering-Based CNN Filter Pruning Method without Parameters Setting [PDF] 返回目录
Kuo-Liang Chung, Yu-Lun Chang, Bo-Wei Tsai
Abstract: Pruning redundant filters in CNN models has received growing attention. In this paper, we propose an adaptive binary search-first hybrid pyramid- and clustering-based (ABSHPC-based) method for pruning filters automatically. In our method, for each convolutional layer, initially a hybrid pyramid data structure is constructed to store the hierarchical information of each filter. Given a tolerant accuracy loss, without parameters setting, we begin from the last convolutional layer to the first layer; for each considered layer with less or equal pruning rate relative to its previous layer, our ABSHPC-based process is applied to optimally partition all filters to clusters, where each cluster is thus represented by the filter with the median root mean of the hybrid pyramid, leading to maximal removal of redundant filters. Based on the practical dataset and the CNN models, with higher accuracy, the thorough experimental results demonstrated the significant parameters and floating-point operations reduction merits of the proposed filter pruning method relative to the state-of-the-art methods.
摘要：在CNN模型修剪多余的过滤器已经获得越来越多的关注。在本文中，我们提出了一种用于自动修剪滤波器的自适应二进制搜索第一混合棱锥和基于聚类的（基于ABSHPC-）方法。在我们的方法中，对于每个卷积层，最初构造的混合金字塔数据结构来存储每个滤波器的分层信息。给定一个宽容的精度损失，不带参数的设置，我们从过去的卷积层与第一层开始;用于与相对于其先前层小于或等于修剪速率每个所考虑的层，我们的基于ABSHPC处理被施加到所有的过滤器最佳地划分到集群，其中每个集群由该过滤器与混合金字塔的中值根平均因而表示导致最大去除多余的过滤器。基于所述实际数据集和CNN模型，以更高的精度，深入实验结果表明所提出的滤波器相对法向修剪状态的最先进的方法的显著参数和浮点运算减少的优点。

16. Passive Batch Injection Training Technique: Boosting Network Performance by Injecting Mini-Batches from a different Data Distribution [PDF] 返回目录
Pravendra Singh, Pratik Mazumder, Vinay P. Namboodiri
Abstract: This work presents a novel training technique for deep neural networks that makes use of additional data from a distribution that is different from that of the original input data. This technique aims to reduce overfitting and improve the generalization performance of the network. Our proposed technique, namely Passive Batch Injection Training Technique (PBITT), even reduces the level of overfitting in networks that already use the standard techniques for reducing overfitting such as $L_2$ regularization and batch normalization, resulting in significant accuracy improvements. Passive Batch Injection Training Technique (PBITT) introduces a few passive mini-batches into the training process that contain data from a distribution that is different from the input data distribution. This technique does not increase the number of parameters in the final model and also does not increase the inference (test) time but still improves the performance of deep CNNs. To the best of our knowledge, this is the first work that makes use of different data distribution to aid the training of convolutional neural networks (CNNs). We thoroughly evaluate the proposed approach on standard architectures: VGG, ResNet, and WideResNet, and on several popular datasets: CIFAR-10, CIFAR-100, SVHN, and ImageNet. We observe consistent accuracy improvement by using the proposed technique. We also show experimentally that the model trained by our technique generalizes well to other tasks such as object detection on the MS-COCO dataset using Faster R-CNN. We present extensive ablations to validate the proposed approach. Our approach improves the accuracy of VGG-16 by a significant margin of 2.1% over the CIFAR-100 dataset.
摘要：这项工作提出了深层神经网络的一种新的训练技术，利用额外的数据从分布是从原始输入数据的不同。该技术旨在减少过度拟合，提高了网络的泛化性能。我们提出的技术，即被动批次注射训练技术（PBITT），甚至降低了已经使用该标准的技术来减少过度拟合如$ L_2 $正规化和批标准化，导致显著精度提高网络过度拟合的程度。被动批次注射训练技术（PBITT）引入了一些被动的迷你分批进入，从分布是从输入数据分布的不同包含数据的训练过程。这种技术不会增加参数的个数在最终模型，也不会增加推论（测试）的时间，但仍提高了深细胞神经网络的性能。据我们所知，这是第一个工作，使得使用不同的数据分发，以帮助卷积神经网络（细胞神经网络）的培训。我们彻底评估标准体系所提出的方法：VGG，RESNET和WideResNet，并在几个流行的数据集：CIFAR-10，CIFAR-100，SVHN和ImageNet。我们通过使用所提出的技术观察一致精度的提高。我们还通过实验表明，该模型通过我们的技术推广训练有素的其他任务，如使用更快的R-CNN的MS-COCO DataSet对象检测。我们目前广泛消融来验证所提出的方法。我们的方法提高VGG-16的2.1％，比CIFAR-100数据集显著保证金的准确性。

17. Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection [PDF] 返回目录
Xiang Li, Wenhai Wang, Lijun Wu, Shuo Chen, Xiaolin Hu, Jun Li, Jinhui Tang, Jian Yang
Abstract: One-stage detector basically formulates object detection as dense classification and localization. The classification is usually optimized by Focal Loss and the box location is commonly learned under Dirac delta distribution. A recent trend for one-stage detectors is to introduce an individual prediction branch to estimate the quality of localization, where the predicted quality facilitates the classification to improve detection performance. This paper delves into the representations of the above three fundamental elements: quality estimation, classification and localization. Two problems are discovered in existing practices, including (1) the inconsistent usage of the quality estimation and classification between training and inference and (2) the inflexible Dirac delta distribution for localization when there is ambiguity and uncertainty in complex scenes. To address the problems, we design new representations for these elements. Specifically, we merge the quality estimation into the class prediction vector to form a joint representation of localization quality and classification, and use a vector to represent arbitrary distribution of box locations. The improved representations eliminate the inconsistency risk and accurately depict the flexible distribution in real data, but contain continuous labels, which is beyond the scope of Focal Loss. We then propose Generalized Focal Loss (GFL) that generalizes Focal Loss from its discrete form to the continuous version for successful optimization. On COCO test-dev, GFL achieves 45.0\% AP using ResNet-101 backbone, surpassing state-of-the-art SAPD (43.5\%) and ATSS (43.6\%) with higher or comparable inference speed, under the same backbone and training settings. Notably, our best model can achieve a single-model single-scale AP of 48.2\%, at 10 FPS on a single 2080Ti GPU. Code and models are available at this https URL.
摘要：一阶段检测器基本上制定物体检测致密分类和定位。分类通常由震源损耗进行了优化和箱位置下狄拉克δ分布普遍的教训。对于单级检测器A最近的趋势是引入一个单独的分支预测来估算定位，其中所述预测的质量有利于提高检测性能分类的质量。本文深入研究了上述三个基本要素的表示：质量评估，分类和定位。这两个问题是在现行的做法，包括：（1）培训和推理;（2）不灵活的狄拉克δ分布本地化之间的质量评估和分类不一致的用法时，有在复杂场景的模糊性和不确定性发现。为了解决这些问题，我们设计了这些元素的新表述。具体来说，我们合并质量估计进级预测矢量，以形成定位和质量分类的关节表示，并使用一个矢量来表示框位置中的任意分布。改进后的交涉消除不一致风险，准确地描绘了真实数据的灵活分配，但包含连续的标签，这超出焦损失的范围。那么我们建议广义一般化从离散形式到连续版本的成功优化联络联络损失（GFL）。上COCO测试-dev的，GFL使用RESNET-101骨干，超越国家的最先进的SAPD（43.5 \％）和ATSS（43.6 \％）具有更高的或相当的推理速度达到45.0 \％AP，相同的骨干下和培训设置。值得注意的是，我们的最好的模式可以达到48.2 \％的单GPU 2080Ti单型号单级AP，在10 FPS。代码和模型可在此HTTPS URL。

18. Semantics-Driven Unsupervised Learning for Monocular Depth and Ego-Motion Estimation [PDF] 返回目录
Xiaobin Wei, Jianjiang Feng, Jie Zhou
Abstract: We propose a semantics-driven unsupervised learning approach for monocular depth and ego-motion estimation from videos in this paper. Recent unsupervised learning methods employ photometric errors between synthetic view and actual image as a supervision signal for training. In our method, we exploit semantic segmentation information to mitigate the effects of dynamic objects and occlusions in the scene, and to improve depth prediction performance by considering the correlation between depth and semantics. To avoid costly labeling process, we use noisy semantic segmentation results obtained by a pre-trained semantic segmentation network. In addition, we minimize the position error between the corresponding points of adjacent frames to utilize 3D spatial information. Experimental results on the KITTI dataset show that our method achieves good performance in both depth and ego-motion estimation tasks.
摘要：本文提出了一种语义驱动的无监督学习的单眼深度和自我的运动估计从视频中，本文的方法。近期无监督的学习方法采用合成图和实际图像作为训练监控信号之间的测光误差。在我们的方法，我们利用语义分割信息，以减轻现场动态对象和遮挡的影响，并考虑深度和语义之间的相关性，以提高深度预测性能。避免昂贵的标记过程中，我们使用由预训练的语义分割网络获得嘈杂语义分割结果。此外，我们最小化相邻帧的相应点之间的位置误差，利用3D空间信息。在KITTI数据集上，我们的方法实现了深度和自我的运动估计任务性能良好的实验结果。

19. Neural Sparse Representation for Image Restoration [PDF] 返回目录
Yuchen Fan, Jiahui Yu, Yiqun Mei, Yulun Zhang, Yun Fu, Ding Liu, Thomas S. Huang
Abstract: Inspired by the robustness and efficiency of sparse representation in sparse coding based image restoration models, we investigate the sparsity of neurons in deep networks. Our method structurally enforces sparsity constraints upon hidden neurons. The sparsity constraints are favorable for gradient-based learning algorithms and attachable to convolution layers in various networks. Sparsity in neurons enables computation saving by only operating on non-zero components without hurting accuracy. Meanwhile, our method can magnify representation dimensionality and model capacity with negligible additional computation cost. Experiments show that sparse representation is crucial in deep neural networks for multiple image restoration tasks, including image super-resolution, image denoising, and image compression artifacts removal. Code is available at this https URL
摘要：鲁棒性和稀疏编码基于图像复原模型稀疏表示的效率的启发，我们探讨深神经元网络的稀疏性。我们的方法结构强制发现了隐藏的神经元稀疏性约束。稀疏性约束是基于梯度的学习算法和有利地连接到在各种网络卷积层。在稀疏性神经元只对非零分量操作，不伤精度能够节省计算。同时，我们的方法可以放大表示维度，可以忽略不计的额外计算成本模型的能力。实验结果表明，稀疏表示在多个图像恢复的任务，包括图像超分辨率，图像去噪和图像压缩失真去除深层神经网络是至关重要的。代码可在此HTTPS URL

20. Associate-3Ddet: Perceptual-to-Conceptual Association for 3D Point Cloud Object Detection [PDF] 返回目录
Liang Du, Xiaoqing Ye, Xiao Tan, Jianfeng Feng, Zhenbo Xu, Errui Ding, Shilei Wen
Abstract: Object detection from 3D point clouds remains a challenging task, though recent studies pushed the envelope with the deep learning techniques. Owing to the severe spatial occlusion and inherent variance of point density with the distance to sensors, appearance of a same object varies a lot in point cloud data. Designing robust feature representation against such appearance changes is hence the key issue in a 3D object detection method. In this paper, we innovatively propose a domain adaptation like approach to enhance the robustness of the feature representation. More specifically, we bridge the gap between the perceptual domain where the feature comes from a real scene and the conceptual domain where the feature is extracted from an augmented scene consisting of non-occlusion point cloud rich of detailed information. This domain adaptation approach mimics the functionality of the human brain when proceeding object perception. Extensive experiments demonstrate that our simple yet effective approach fundamentally boosts the performance of 3D point cloud object detection and achieves the state-of-the-art results.
摘要：从三维点云物体检测仍然是一个艰巨的任务，但最近的研究推信封与深学习技术。由于严重的空间闭塞和与到传感器的距离的点密度的固有方差，同一对象的外观变化的很多点云数据。设计鲁棒特征表示对这种外观变化因此是在3D物体检测方法的关键问题。在本文中，我们创新性地提出了一个领域适应性喜欢的方法来提升功能表现的稳健性。更具体地说，我们弥合感知域之间的间隙，其中所述特征是来自真实场景，并且其中所述特征选自由非闭塞点云丰富的详细信息的增强场景中提取概念域。您正在访问的适应方法模拟人脑的功能出发感知对象的时候。大量的实验证明，我们的简单而有效的方法从根本上提升三维点云物体检测的性能，并实现国家的最先进的成果。

21. Fast Synthetic LiDAR Rendering via Spherical UV Unwrapping of Equirectangular Z-Buffer Images [PDF] 返回目录
Mohammed Hossny, Khaled Saleh, Mohammed Attia, Ahmed Abobakr, Julie Iskander
Abstract: LiDAR data is becoming increasingly essential with the rise of autonomous vehicles. Its ability to provide 360deg horizontal field of view of point cloud, equips self-driving vehicles with enhanced situational awareness capabilities. While synthetic LiDAR data generation pipelines provide a good solution to advance the machine learning research on LiDAR, they do suffer from a major shortcoming, which is rendering time. Physically accurate LiDAR simulators (e.g. Blensor) are computationally expensive with an average rendering time of 14-60 seconds per frame for urban scenes. This is often compensated for via using 3D models with simplified polygon topology (low poly assets) as is the case of CARLA (Dosovitskiy et al., 2017). However, this comes at the price of having coarse grained unrealistic LiDAR point clouds. In this paper, we present a novel method to simulate LiDAR point cloud with faster rendering time of 1 sec per frame. The proposed method relies on spherical UV unwrapping of Equirectangular Z-Buffer images. We chose Blensor (Gschwandtner et al., 2011) as the baseline method to compare the point clouds generated using the proposed method. The reported error for complex urban landscapes is 4.28cm for a scanning range between 2-120 meters with Velodyne HDL64-E2 parameters. The proposed method reported a total time per frame to 3.2 +/- 0.31 seconds per frame. In contrast, the BlenSor baseline method reported 16.2 +/ 1.82 seconds.
摘要：LiDAR数据正在成为自主车的崛起越来越重要。它提供的视点云的360deg水平场能力，装备有增强的态势感知能力，自驾车车辆。虽然合成的LiDAR数据生成管道提供了一个很好的解决方案，以推动在机器上的LiDAR学习研究，他们从一大缺点，就是渲染时间受到影响。物理上精确的激光雷达模拟器（例如Blensor）是具有的每帧14-60秒城市场景的平均绘制时间计算上昂贵的。这通常是经由使用3D模型与简化面拓扑（低聚资产），其是CARLA的情况下补偿（Dosovitskiy等人，2017）。然而，这是以有粗粒不现实的LiDAR点云的价格。在本文中，我们提出了一种新颖的方法来模拟激光雷达点云以更快的渲染每帧1秒的时间。所提出的方法依赖于球形UV解缠等距离长方圆柱Z缓冲区图像。我们选择Blensor（Gschwandtner等，2011）为基准的方法来比较使用该方法生成点云。对于复杂的城市风景报告的错误是2-120米Velodyne HDL64-E2参数之间的扫描范围4.28厘米。所提出的方法报每帧的总时间，以每帧3.2 +/-0.31秒。与此相反，BlenSor基线方法报道16.2 + /1.82秒。

22. Deep Neural Network Based Real-time Kiwi Fruit Flower Detection in an Orchard Environment [PDF] 返回目录
JongYoon Lim, Ho Seok Ahn, Mahla Nejati, Jamie Bell, Henry Williams, Bruce A. MacDonald
Abstract: In this paper, we present a novel approach to kiwi fruit flower detection using Deep Neural Networks (DNNs) to build an accurate, fast, and robust autonomous pollination robot system. Recent work in deep neural networks has shown outstanding performance on object detection tasks in many areas. Inspired this, we aim for exploiting DNNs for kiwi fruit flower detection and present intensive experiments and their analysis on two state-of-the-art object detectors; Faster R-CNN and Single Shot Detector (SSD) Net, and feature extractors; Inception Net V2 and NAS Net with real-world orchard datasets. We also compare those approaches to find an optimal model which is suitable for a real-time agricultural pollination robot system in terms of accuracy and processing speed. We perform experiments with dataset collected from different seasons and locations (spatio-temporal consistency) in order to demonstrate the performance of the generalized model. The proposed system demonstrates promising results of 0.919, 0.874, and 0.889 for precision, recall, and F1-score respectively on our real-world dataset, and the performance satisfies the requirement for deploying the system onto an autonomous pollination robotics system.
摘要：在本文中，我们提出了用深层神经网络（DNNs）建立一个准确，快速和强大的自主授粉机器人系统的新方法，以猕猴桃花检测。在深层神经网络最近的工作表明在许多领域上的物体检测任务的出色表现。启发此，我们的目标是在两个状态的最先进的对象检测器利用用于猕猴桃花检测和本密集实验及其分析DNNs;更快的R-CNN和单镜头检测器（SSD）网，和特征提取器;盗梦空间净V2和NAS网络与真实世界的果园数据集。我们还比较这些方法找到一个最佳模型，适用于加工精度和速度方面实时农业授粉机器人系统。我们以证明广义模型的性能执行与数据集从不同的季节和地点（时空一致性）收集实验。所提出的系统演示的0.919，0.874，0.889和精密，召回有希望的结果，和我们的现实世界的数据集分别为F1-得分，而且性能满足了系统部署到一个独立的授粉机器人系统的要求。

23. Fully Convolutional Mesh Autoencoder using Efficient Spatially Varying Kernels [PDF] 返回目录
Yi Zhou, Chenglei Wu, Zimo Li, Chen Cao, Yuting Ye, Jason Saragih, Hao Li, Yaser Sheikh
Abstract: Learning latent representations of registered meshes is useful for many 3D tasks. Techniques have recently shifted to neural mesh autoencoders. Although they demonstrate higher precision than traditional methods, they remain unable to capture fine-grained deformations. Furthermore, these methods can only be applied to a template-specific surface mesh, and is not applicable to more general meshes, like tetrahedrons and non-manifold meshes. While more general graph convolution methods can be employed, they lack performance in reconstruction precision and require higher memory usage. In this paper, we propose a non-template-specific fully convolutional mesh autoencoder for arbitrary registered mesh data. It is enabled by our novel convolution and (un)pooling operators learned with globally shared weights and locally varying coefficients which can efficiently capture the spatially varying contents presented by irregular mesh connections. Our model outperforms state-of-the-art methods on reconstruction accuracy. In addition, the latent codes of our network are fully localized thanks to the fully convolutional structure, and thus have much higher interpolation capability than many traditional 3D mesh generation models.
摘要：学习注册网格潜表示对于很多3D任务是非常有用的。技术最近已转移到神经网自动编码。虽然他们表现出比传统方式更高的精度，但仍无法捕捉细粒度变形。此外，这些方法只能应用于模板特异性表面啮合，而并不适用于更一般的网格，如四面体和非流形网格。虽然可使用更一般的图形方法卷积，它们缺乏在重建精度性能和需要更高的内存使用情况。在本文中，我们提出了一个非模板特异完全卷积网的自动编码任意注册网格数据。它是由我们的新型的卷积和（UN）汇集与全局共享权重了解到运营商和局部变化的系数，可以有效地捕捉由不规则网孔的连接提出的空间上变化的内容使能。我们的模型优于重建精度国家的最先进的方法。此外，我们的网络的潜代码完全本地化感谢全卷积结构，从而具有更高的插值能力比许多传统的3D网格生成模型。

24. Ensemble Model with Batch Spectral Regularization and Data Blending for Cross-Domain Few-Shot Learning with Unlabeled Data [PDF] 返回目录
Zhen Zhao, Bingyu Liu, Yuhong Guo, Jieping Ye
Abstract: Deep learning models are difficult to obtain good performance when data is scarce and there are domain gaps. Cross-domain few-shot learning (CD-FSL) is designed to improve this problem. We propose an Ensemble Model with Batch Spectral Regularization and Data Blending for the Track 2 of the CD-FSL challenge. We use different feature mapping matrices to obtain an Ensemble framework. In each branch, Batch Spectral Regularization is used to suppress the singular values of the feature matrix to improve the model's transferability. In the fine-tuning stage a data blending method is used to fuse the information of unlabeled data with the support set. The prediction result is refined with label propagation. We conduct experiments on the CD-FSL benchmark tasks and the results demonstrate the effectiveness of the proposed method.
摘要：深学习模式很难获得良好的性能在数据稀缺，有域的差距。跨域几个次学习（CD-FSL）旨在改善这个问题。我们建议使用批处理光谱正规化和数据混合的集成模型的CD-FSL挑战的轨道2。我们使用不同的功能映射矩阵，以获得集合框架。在每个分支，批量频谱正则用来抑制特征矩阵的奇异值来提高模型的可转移性。在微调级数据混合方法被用于融合的未标记数据的与支撑集的信息。预测结果被细化与标签传播。我们的CD-FSL基准任务进行实验，结果证明了该方法的有效性。

25. Counterfactual VQA: A Cause-Effect Look at Language Bias [PDF] 返回目录
Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, Ji-Rong Wen
Abstract: Visual Question Answering (VQA) models tend to rely on the language bias and thus fail to learn the reasoning from visual knowledge, which is however the original intention of VQA. In this paper, we propose a novel cause-effect look at the language bias, where the bias is formulated as the direct effect of question on answer from the view of causal inference. The effect can be captured by counterfactual VQA, where the image had not existed in an imagined scenario. Our proposed cause-effect look 1) is general to any baseline VQA architecture, 2) achieves significant improvement on the language-bias sensitive VQA-CP dataset, and 3) fills the theoretical gap in recent language prior based works.
摘要：视觉答疑（VQA）模型往往依靠语言偏见，因此不能从学习知识的可视化，但它是VQA的初衷的理由。在本文中，我们提出了一个新颖的因果关系看语言偏见，在偏置被配制成在回答因果推理的角度问题的直接影响。该效果可以通过反VQA中，当图像没有在一个想象的场景中存在被捕获。我们提出的因果看1）是由一般到任何基线VQA架构，2）实现上的语言偏敏感的VQA-CP数据集显著的改善，以及3）填补了近期的语言之前基于作品的理论空白。

26. Are We Hungry for 3D LiDAR Data for Semantic Segmentation? [PDF] 返回目录
Biao Gao, Yancheng Pan, Chengkun Li, Sibo Geng, Huijing Zhao
Abstract: 3D LiDAR semantic segmentation is a pivotal task that is widely involved in many applications, such as autonomous driving and robotics. Studies of 3D LiDAR semantic segmentation have recently achieved considerable development, especially in terms of deep learning strategies. However, these studies usually rely heavily on considerable fine annotated data, while point-wise 3D LiDAR datasets are extremely insufficient and expensive to label. The performance limitation caused by the lack of training data is called the data hungry effect. This survey aims to explore whether and how we are hungry for 3D LiDAR data for semantic segmentation. Thus, we first provide an organized review of existing 3D datasets and 3D semantic segmentation methods. Then, we provide an in-depth analysis of three representative datasets and several experiments to evaluate the data hungry effects in different aspects. Efforts to solve data hungry problems are summarized for both 3D LiDAR-focused methods and general-purpose methods. Finally, insightful topics are discussed for future research on data hungry problems and open questions.
摘要：3D激光雷达语义分割是一个关键的任务，广泛参与了许多应用，如自动驾驶和机器人。 3D激光雷达语义分割的研究最近取得了长足的发展，尤其是在深学习策略方面。然而，这些研究通常在很大程度上依赖于相当大的精细标注的数据，而逐点的3D激光雷达数据集是极为不足和昂贵的标签。缺乏所引起的训练数据的性能限制被称为数据饿效果。本次调查旨在探讨我们是否以及如何都渴望的语义分割三维LiDAR数据。因此，我们首先提供现有的3D数据集和3D语义分割方法的有组织审查。然后，我们提供了三个有代表性的数据集和多次实验，以评估数据在不同的方面饥饿的影响进行了深入分析。努力解决数据问题饿归纳为两种3D激光雷达为重点的方法和通用的方法。最后，有见地的主题为今后的数据饥饿问题和开放性问题的研究讨论。

27. Text Detection and Recognition in the Wild: A Review [PDF] 返回目录
Zobeir Raisi, Mohamed A. Naiel, Paul Fieguth, Steven Wardell, John Zelek
Abstract: Detection and recognition of text in natural images are two main problems in the field of computer vision that have a wide variety of applications in analysis of sports videos, autonomous driving, industrial automation, to name a few. They face common challenging problems that are factors in how text is represented and affected by several environmental conditions. The current state-of-the-art scene text detection and/or recognition methods have exploited the witnessed advancement in deep learning architectures and reported a superior accuracy on benchmark datasets when tackling multi-resolution and multi-oriented text. However, there are still several remaining challenges affecting text in the wild images that cause existing methods to underperform due to there models are not able to generalize to unseen data and the insufficient labeled data. Thus, unlike previous surveys in this field, the objectives of this survey are as follows: first, offering the reader not only a review on the recent advancement in scene text detection and recognition, but also presenting the results of conducting extensive experiments using a unified evaluation framework that assesses pre-trained models of the selected methods on challenging cases, and applies the same evaluation criteria on these techniques. Second, identifying several existing challenges for detecting or recognizing text in the wild images, namely, in-plane-rotation, multi-oriented and multi-resolution text, perspective distortion, illumination reflection, partial occlusion, complex fonts, and special characters. Finally, the paper also presents insight into the potential research directions in this field to address some of the mentioned challenges that are still encountering scene text detection and recognition techniques.
摘要：检测和识别自然图像中的文字是有各种各样的体育视频，自动驾驶，工业自动化分析应用计算机视觉领域的两个主要问题，仅举几例。他们面对的是文本的表示方式并受多种环境条件因素的共同挑战性的问题。当前国家的最先进的现场文字检测和/或识别方法已经利用在深学习架构的见证进步，解决多分辨率和面向多文本时，报告基准数据集卓越的精度。然而，仍有影响野生图片中的文字几个剩余的挑战导致现有方法表现不佳，由于有模型不能推广到看不见的数据和充分的标签数据。因此，在这一领域以往的调查不同的是，在本次调查的目标如下：第一，提供读者不仅在现场文字检测与识别的最近进展进行审查，而且还呈现使用统一进行了广泛的实验结果评估框架，评估预先训练的挑战的情况下所选择的方法的模型，并应用相同的评价标准对这些技术。第二，确定用于检测或在野外图像来识别文本，即几个现有的挑战，在面内旋转，面向多和多分辨率文本，透视变形，照明反射，部分闭塞，复杂的字体和特殊字符。最后，本文还介绍了洞察到这一领域的潜力的研究方向，以解决一些仍在现场遇到文本检测和识别技术所提到的挑战。

28. How useful is Active Learning for Image-based Plant Phenotyping? [PDF] 返回目录
Koushik Nagasubramanian, Talukder Z. Jubery, Fateme Fotouhi Ardakani, Seyed Vahid Mirnezami, Asheesh K. Singh, Arti Singh, Soumik Sarkar, Baskar Ganapathysubramanian
Abstract: Deep learning models have been successfully deployed for a diverse array of image-based plant phenotyping applications including disease detection and classification. However, successful deployment of supervised deep learning models requires large amount of labeled data, which is a significant challenge in plant science (and most biological) domains due to the inherent complexity. Specifically, data annotation is costly, laborious, time consuming and needs domain expertise for phenotyping tasks, especially for diseases. To overcome this challenge, active learning algorithms have been proposed that reduce the amount of labeling needed by deep learning models to achieve good predictive performance. Active learning methods adaptively select samples to annotate using an acquisition function to achieve maximum (classification) performance under a fixed labeling budget. We reports the performance of four different active learning methods, (1) Deep Bayesian Active Learning (DBAL), (2) Entropy, (3) Least Confidence, and (4) Coreset, with conventional random sampling-based annotation for two different image-based classification datasets. The first image dataset consists of soybean [Glycine max L. (Merr.)] leaves belonging to eight different soybean stresses and a healthy class, and the second consists of nine different weed species from the field. For a fixed labeling budget, we observed that the classification performance of deep learning models with active learning-based acquisition strategies is better than random sampling-based acquisition for both datasets. The integration of active learning strategies for data annotation can help mitigate labelling challenges in the plant sciences applications particularly where deep domain knowledge is required.
摘要：深学习模型已经成功部署了基于图像的植物表型应用，包括疾病检测和分类的多样化。然而，监督深度学习模式的成功部署需要大量的标记数据，这是植物科学（和大多数生物）域的显著的挑战，由于固有的复杂性。具体而言，数据注释是昂贵，费力，耗时的，并且需要专门知识域为表型分型的任务，特别是对于疾病。为了克服这一挑战，主动学习算法已经被提出，减少标签需要通过深度学习模式，以达到良好的预测性能的量。主动学习方法自适应地选择样品使用获取功能实现固定标记预算下最大（分类）的性能来注释。我们报告的四种不同的主动学习方法的性能（1），深贝叶斯主动学习（DBAL），（2）熵，（3）最有信心，和（4）Coreset，与常规随机基于采样的注释对于两个不同的图像基于分类的数据集。所述第一图像数据集由大豆[大豆L.（MERR。）]叶属于八个不同的大豆应力和健康的类，并且所述第二组成从场九个不同杂草物种。对于一个固定的标签预算，我们观察到深的学习模式与主动学习型收购战略的分类性能比对两个数据集随机基于采样的收购更好。对数据标注主动学习策略的整合可以帮助特别是需要深厚的知识植物科学应用减少标签的挑战。

29. AdaLAM: Revisiting Handcrafted Outlier Detection [PDF] 返回目录
Luca Cavalli, Viktor Larsson, Martin Ralf Oswald, Torsten Sattler, Marc Pollefeys
Abstract: Local feature matching is a critical component of many computer vision pipelines, including among others Structure-from-Motion, SLAM, and Visual Localization. However, due to limitations in the descriptors, raw matches are often contaminated by a majority of outliers. As a result, outlier detection is a fundamental problem in computer vision, and a wide range of approaches have been proposed over the last decades. In this paper we revisit handcrafted approaches to outlier filtering. Based on best practices, we propose a hierarchical pipeline for effective outlier detection as well as integrate novel ideas which in sum lead to AdaLAM, an efficient and competitive approach to outlier rejection. AdaLAM is designed to effectively exploit modern parallel hardware, resulting in a very fast, yet very accurate, outlier filter. We validate AdaLAM on multiple large and diverse datasets, and we submit to the Image Matching Challenge (CVPR2020), obtaining competitive results with simple baseline descriptors. We show that AdaLAM is more than competitive to current state of the art, both in terms of efficiency and effectiveness.
摘要：局部特征进行匹配是许多计算机视觉管道的重要组成部分，包括在其他结构，由运动，SLAM和可视本地化。然而，由于描述的限制，原料比赛往往被大多数离群的污染。其结果是，异常检测是计算机视觉中的基本问题，以及各种各样的方法被提出，在过去几十年。在本文中，我们重温手工方法来离群值滤波。基于最佳实践，我们提出了有效的异常检测分级流水线以及集成新奇的想法，这和导致AdaLAM，高效和有竞争力的方法异常值拒绝。 AdaLAM旨在有效地利用现代并行硬件，从而导致非常快，但非常准确，异常过滤器。我们确认在多个大型和多样化的数据集AdaLAM，我们提交给图像匹配挑战（CVPR2020），获得简单的基线描述的竞争结果。我们表明，AdaLAM大于竞争的现有技术的当前状态，无论是在效率和有效性方面。

30. Efficient Poverty Mapping using Deep Reinforcement Learning [PDF] 返回目录
Kumar Ayush, Burak Uzkent, Marshall Burke, David Lobell, Stefano Ermon
Abstract: The combination of high-resolution satellite imagery and machine learning have proven useful in many sustainability-related tasks, including poverty prediction, infrastructure measurement, and forest monitoring. However, the accuracy afforded by high-resolution imagery comes at a cost, as such imagery is extremely expensive to purchase at scale. This creates a substantial hurdle to the efficient scaling and widespread adoption of high-resolution-based approaches. To reduce acquisition costs while maintaining accuracy, we propose a reinforcement learning approach in which free low-resolution imagery is used to dynamically identify where to acquire costly high-resolution images, prior to performing a deep learning task on the high-resolution images. We apply this approach to the task of poverty prediction in Uganda, building on an earlier approach that used object detection to count objects and use these counts to predict poverty. Our approach exceeds previous performance benchmarks on this task while using 80% fewer high-resolution images. Our approach could have application in many sustainability domains that require high-resolution imagery.
摘要：高分辨率卫星影像和机器学习的组合已经被证明在许多可持续发展相关的任务，包括贫困预测，基础设施测量和森林监测有用。然而，通过高分辨率图像提供的准确度也是有代价的，因为这样的图像是在购买规模极其昂贵。这将创建一个实质性障碍的有效扩展和广泛采用的基于高分辨率的方法。为了降低采购成本，同时保持精度，我们建议在其中自由低分辨率图像将被动态地确定在哪里购买昂贵的高分辨率图像，对高分辨率的图像进行了深刻的学习任务前的强化学习方法。我们将这种方法用于乌干达的贫困预测的任务，建立在先前的做法，使用的对象检测来计算对象，并使用这些计数来预测贫困。同时使用减少了80％的高清晰度图像我们的做法超出这个任务之前的性能基准。我们的做法可能在需要高清晰度图像许多可持续发展领域的应用。

31. Thoracic Disease Identification and Localization using Distance Learning and Region Verification [PDF] 返回目录
Cheng Zhang, Francine Chen, Yan-Ying Chen
Abstract: The identification and localization of diseases in medical images using deep learning models have recently attracted significant interest. Existing methods only consider training the networks with each image independently and most leverage an activation map for disease localization. In this paper, we propose an alternative approach that learns discriminative features among triplets of images and cyclically trains on region features to verify whether attentive regions contain information indicative of a disease. Concretely, we adapt a distance learning framework for multi-label disease classification to differentiate subtle disease features. Additionally, we feed back the features of the predicted class-specific regions to a separate classifier during training to better verify the localized diseases. Our model can achieve state-of-the-art classification performance on the challenging ChestX-ray14 dataset, and our ablation studies indicate that both distance learning and region verification contribute to overall classification performance. Moreover, the distance learning and region verification modules can capture essential information for better localization than baseline models without these modules.
摘要：识别和使用深度学习模型在医学图像疾病本地化最近已经吸引了显著的兴趣。现有的方法只考虑每个图像训练网络，独立和最大的杠杆作用对疾病的本地化激活图。在本文中，我们提出了学习图像的三胞胎和循环列车之间的判别特征的区域特征来验证周到的地区是否包含指示疾病信息的另一种方法。具体地说，我们适应了多标签疾病分类的距离学习框架来区分细微疾病的功能。此外，我们的培训，以更好地验证本地化疾病中反馈的预测类特定区域的功能到一个单独的分类。我们的模型可以实现对挑战ChestX-ray14数据集的国家的最先进的分类性能，以及我们的消融研究表明，远程学习和区域验证有助于整体分类性能。此外，远程教育和区域验证模块可以捕捉重要信息比基准模型没有这些模块更好地本地化。

32. Finger Texture Biometric Characteristic: a Survey [PDF] 返回目录
Raid R. O. Al-Nima, Tingting Han, Taolue Chen, Satnam Dlay, Jonathon Chambers
Abstract: \begin{abstract} In recent years, the Finger Texture (FT) has attracted considerable attention as a biometric characteristic. It can provide efficient human recognition performance, because it has different human-specific features of apparent lines, wrinkles and ridges distributed along the inner surface of all fingers. Also, such pattern structures are reliable, unique and remain stable throughout a human's life. Efficient biometric systems can be established based only on FTs. In this paper, a comprehensive survey of the relevant FT studies is presented. We also summarise the main drawbacks and obstacles of employing the FT as a biometric characteristic, and provide useful suggestions to further improve the work on FT. \end{abstract}
摘要：\ {开始}抽象近年来，手指纹理（FT）已经吸引了相当多的关注作为生物特征。它可以提供有效的人类识别的性能，因为它具有沿所有手指的内表面分布明显细纹，皱纹和脊的不同人体特异功能。此外，这样的模式结构是可靠的，独特的和整个人类的生活保持稳定。高效的生物识别系统可以建立仅基于罚篮。在本文中，相关FT研究的全面调查提出。我们还总结了主要的缺点，采用FT作为生物特征的障碍，并提供有用的建议，以进一步提高FT的工作。 \ {端抽象}

33. Learning pose variations within shape population by constrained mixtures of factor analyzers [PDF] 返回目录
Xilu Wang
Abstract: Mining and learning the shape variability of underlying population has benefited the applications including parametric shape modeling, 3D animation, and image segmentation. The current statistical shape modeling method works well on learning unstructured shape variations without obvious pose changes (relative rotations of the body parts). Studying the pose variations within a shape population involves segmenting the shapes into different articulated parts and learning the transformations of the segmented parts. This paper formulates the pose learning problem as mixtures of factor analyzers. The segmentation is obtained by components posterior probabilities and the rotations in pose variations are learned by the factor loading matrices. To guarantee that the factor loading matrices are composed by rotation matrices, constraints are imposed and the corresponding closed form optimal solution is derived. Based on the proposed method, the pose variations are automatically learned from the given shape populations. The method is applied in motion animation where new poses are generated by interpolating the existing poses in the training set. The obtained results are smooth and realistic.
摘要：采矿和学习基础总体的形状变化受益的应用，包括参数形状建模，三维动画和图像分割。目前的统计形状建模方法适用于学习非结构化形状的变化无明显改变姿势（身体各部位的相对旋转）。形状群体内学习姿势的变化涉及到分割的形状为不同的铰接部分和学习分段部分的变换。本文制定的姿态学习问题，因素分析的混合物。分割是通过组分后验概率获得，并且在姿势变化的旋转通过因子载荷矩阵获知。为了保证因子载荷矩阵用旋转矩阵组成，约束强加和相应的封闭形式最优解的。根据该方法，姿势变化自动从给定的形状人群教训。该方法是在新的姿势是通过在训练集合内插现有的姿势产生的运动的动画应用。所得到的结果是光滑的和现实的。

34. Realistic text replacement with non-uniform style conditioning [PDF] 返回目录
Arseny Nerinovsky, Igor Buzhinsky, Andey Filchencov
Abstract: In this work, we study the possibility of realistic text replacement, the goal of which is to replace text present in the image with user-supplied text. The replacement should be performed in a way that will not allow distinguishing the resulting image from the original one. We achieve this goal by developing a novel non-uniform style conditioning layer and apply it to an encoder-decoder ResNet based architecture. The resulting model is a single-stage model, with no post-processing. The proposed model achieves realistic text replacement and outperforms existing approaches on ICDAR MLT.
摘要：在这项工作中，我们研究现实的文本替换，其目标是取代目前的文本与用户提供的文本图像中的可能性。更换应的方式，将不允许从原来的一个区别产生的图像进行。我们通过开发一种新的非统一的风格调节层实现这一目标，并将其应用到编码解码器基于RESNET架构。将得到的模型是一个单级模型，没有后处理。该模型实现了逼真的文本替换和现有的ICDAR MLT方法性能优于。

35. Decentralised Learning from Independent Multi-Domain Labels for Person Re-Identification [PDF] 返回目录
Guile Wu, Shaogang Gong
Abstract: Deep learning has been successful for many computer vision tasks due to the availability of shared and centralised large sized training data. However, increasing awareness of privacy concerns poses new challenges to deep learning, especially for human subject related recognition such as person re-identification (Re-ID). In this work, we solve the Re-ID problem by decentralised model learning from non-shared private training data distributed at multiple user cites of independent multi-domain labels. We propose a novel paradigm called Federated Person Re-Identification (FedReID) to construct a generalisable Re-ID model (a central server) by simultaneously learning collaboratively from multiple privacy-preserved local models (local clients). Each local client learns domain-specific local knowledge from its own set of labels independent from all the other clients (each client has its own non-shared independent labels), while the central server selects and aggregates transferrable local updates to accumulate domain-generic knowledge (a general feature embedding model) without sharing local data therefore inherently protecting privacy. Extensive experiments on 11 Re-ID benchmarks demonstrate the superiority of FedReID against the state-of-the-art Re-ID methods.
摘要：深学习已经成功为许多计算机视觉任务，由于共享和集中的大型训练数据的可用性。但是，增加的隐私问题意识提出来深学习新的挑战，特别是对人的主体相关识别诸如人重新鉴定（重新编号）。在这项工作中，我们解决了分散的模式学习再ID问题从多个用户分配非共享的私人训练数据引用了独立的多域标签。我们提出了一个名为联合人重新鉴定（FedReID）新模式的同时从多个隐私保存完好的局部模型（本地客户）协作学习，构建一个普遍意义再ID模型（中央服务器）。每个本地客户端学到特定领域的从自己的一组从所有其他客户端的独立标签（每个客户都有自己的非共享的独立唱片公司）的当地知识，而中央服务器选择和聚合转让本地更新积累域的通用知识（一般特征嵌入模型），因此不共享本地数据固有的保护隐私。 11重新ID基准广泛实验证明FedReID针对国家的最先进的重新ID方法的优越性。

36. Peer Collaborative Learning for Online Knowledge Distillation [PDF] 返回目录
Guile Wu, Shaogang Gong
Abstract: Traditional knowledge distillation uses a two-stage training strategy to transfer knowledge from a high-capacity teacher model to a smaller student model, which relies heavily on the pre-trained teacher. Recent online knowledge distillation alleviates this limitation by collaborative learning, mutual learning and online ensembling, following a one-stage end-to-end training strategy. However, collaborative learning and mutual learning fail to construct an online high-capacity teacher, whilst online ensembling ignores the collaboration among branches and its logit summation impedes the further optimisation of the ensemble teacher. In this work, we propose a novel Peer Collaborative Learning method for online knowledge distillation. Specifically, we employ a multi-branch network (each branch is a peer) and assemble the features from peers with an additional classifier as the peer ensemble teacher to transfer knowledge from the high-capacity teacher to peers and to further optimise the ensemble teacher. Meanwhile, we employ the temporal mean model of each peer as the peer mean teacher to collaboratively transfer knowledge among peers, which facilitates to optimise a more stable model and alleviate the accumulation of training error among peers. Integrating them into a unified framework takes full advantage of online ensembling and network collaboration for improving the quality of online distillation. Extensive experiments on CIFAR-10, CIFAR-100 and ImageNet show that the proposed method not only significantly improves the generalisation capability of various backbone networks, but also outperforms the state-of-the-art alternative methods.
摘要：传统知识蒸馏采用了两阶段的培训战略转移的知识从一个高容量的老师模型到一个较小的学生模型，这在很大程度上依赖于预先训练的老师。最近网上知识蒸馏缓解这一限制，通过合作学习，相互学习和在线ensembling，以下一阶段结束到终端的培训战略。然而，协作学习和相互学习不能构建一个网上的大容量的老师，同时在线ensembling忽略分行及其分对数总和阻碍之间的协作乐团老师的进一步优化。在这项工作中，我们提出了在线知识蒸馏一种新型的对等协作学习方法。具体而言，我们采用了多分支网络（每个分支是一个对等体），并用另外的分类为对端合奏老师转印知识从高容量老师同行和进一步优化合奏老师组装从对等体的功能。同时，我们采用各同行的同行平均老师协同传授知识同业中，这有利于优化更稳定的模型，并减轻训练误差的同业中积累的时间均值模型。它们整合成一个统一的框架，充分利用网上ensembling和网络的合作以提高网上蒸馏的质量。上CIFAR-10，CIFAR-100和ImageNet广泛的实验表明，所提出的方法不仅提高了显著各种骨干网络的泛化能力，但也优于状态的最先进的替代方法。

37. Learning Texture Transformer Network for Image Super-Resolution [PDF] 返回目录
Fuzhi Yang, Huan Yang, Jianlong Fu, Hongtao Lu, Baining Guo
Abstract: We study on image super-resolution (SR), which aims to recover realistic textures from a low-resolution (LR) image. Recent progress has been made by taking high-resolution images as references (Ref), so that relevant textures can be transferred to LR images. However, existing SR approaches neglect to use attention mechanisms to transfer high-resolution (HR) textures from Ref images, which limits these approaches in challenging cases. In this paper, we propose a novel Texture Transformer Network for Image Super-Resolution (TTSR), in which the LR and Ref images are formulated as queries and keys in a transformer, respectively. TTSR consists of four closely-related modules optimized for image generation tasks, including a learnable texture extractor by DNN, a relevance embedding module, a hard-attention module for texture transfer, and a soft-attention module for texture synthesis. Such a design encourages joint feature learning across LR and Ref images, in which deep feature correspondences can be discovered by attention, and thus accurate texture features can be transferred. The proposed texture transformer can be further stacked in a cross-scale way, which enables texture recovery from different levels (e.g., from 1x to 4x magnification). Extensive experiments show that TTSR achieves significant improvements over state-of-the-art approaches on both quantitative and qualitative evaluations.
摘要：我们对图像超分辨率（SR），其目的是从低分辨率（LR）图像恢复逼真的纹理研究。最近的进展已经通过将高分辨率图像作为参考（REF）制成，从而使相关的纹理可以转移到LR图像。但是，现有的SR方法忽视使用注意机制，从参考图像，这限制了这些方法在具有挑战性的情况下传输高分辨率（HR）的纹理。在本文中，我们提出了一种新的纹理变压器网络进行图像超分辨率（TTSR），其中LR和参考图像分别配制成的查询和键，一个变压器。 TTSR由用于图像生成任务优化的4密切相关的模块，包括由一个DNN可学习纹理提取器，相关性嵌入模块，用于纹理转印硬注意模块，以及用于纹理合成软注意模块。这样的设计鼓励跨LR和参考图像共同特征的学习，其中深特征对应可以通过注意被发现，从而准确的纹理特征可以转移。所提出的纹理变压器可以进一步堆叠在跨尺度方式，这使得能够从不同级别的纹理恢复（例如，从1倍到4倍放大率）。大量的实验表明，TTSR达到了国家的最先进的定量和定性的评价方法显著的改善。

38. ADMP: An Adversarial Double Masks Based Pruning Framework For Unsupervised Cross-Domain Compression [PDF] 返回目录
Xiaoyu Feng, Zhuqing Yuan, Guijin Wang, Yongpan Liu
Abstract: Despite the recent progress of network pruning, directly applying it to the Internet of Things (IoT) applications still faces two challenges, i.e. the distribution divergence between end and cloud data and the missing of data label on end devices. One straightforward solution is to combine the unsupervised domain adaptation (UDA) technique and pruning. For example, the model is first pruned on the cloud and then transferred from cloud to end by UDA. However, such a naive combination faces high performance degradation. Hence this work proposes an Adversarial Double Masks based Pruning (ADMP) for such cross-domain compression. In ADMP, we construct a Knowledge Distillation framework not only to produce pseudo labels but also to provide a measurement of domain divergence as the output difference between the full-size teacher and the pruned student. Unlike existing mask-based pruning works, two adversarial masks, i.e. soft and hard masks, are adopted in ADMP. So ADMP can prune the model effectively while still allowing the model to extract strong domain-invariant features and robust classification boundaries. During training, the Alternating Direction Multiplier Method is used to overcome the binary constraint of {0,1}-masks. On Office31 and ImageCLEF-DA datasets, the proposed ADMP can prune 60% channels with only 0.2% and 0.3% average accuracy loss respectively. Compared with the state of art, we can achieve about 1.63x parameters reduction and 4.1% and 5.1% accuracy improvement.
摘要：尽管最近网络修剪的进步，直接将其应用到物联网（IOT）应用的互联网仍然面临两次方面的挑战，即端与云数据和数据标签上的终端设备丢失之间的分布差异。一个简单的解决方案是将无监督域适配（UDA）技术和修剪结合。例如，该模型首先被修剪在云，然后通过从UDA云转移到结束。然而，这样的组合幼稚面高的性能下降。因此，这种工作提出了一种用于这样的交叉域压缩基于修剪（ADMP）的对抗性双掩模。在ADMP，我们构建了一个知识框架蒸馏不仅产生伪标签，而且还提供域名发散作为全尺寸的老师和学生修剪的输出差的测量。与现有的基于掩模的修剪工作，二对抗性口罩，即软硬口罩，在ADMP采用。所以，ADMP可以有效地修剪模式，同时仍允许模型提取强域不变特征和强大的分类界限。在训练期间，交替方向乘子法来克服{0,1} -masks的二元约束。上Office31和ImageCLEF-DA数据集，建议ADMP可以分别修剪60％的通道，只有0.2％和0.3％的平均精度损失。艺术的状态相比，我们可以实现约1.63x参数减少4.1％和5.1％精度的提高。

39. DiffGCN: Graph Convolutional Networks via Differential Operators and Algebraic Multigrid Pooling [PDF] 返回目录
Moshe Eliasof, Eran Treister
Abstract: Graph Convolutional Networks (GCNs) have shown to be effective in handling unordered data like point cloud and meshes. In this work we propose novel approaches for graph convolution, pooling and unpooling, taking inspiration from finite-elements and algebraic multigrid frameworks. We form a parameterized convolution kernel based on discretized differential operators, leveraging the graph mass, gradient and Laplacian. This way, the parameterization does not depend on the graph structure, only on the meaning of the network convolutions as differential operators. To allow hierarchical representations of the input, we propose pooling and unpooling operations that are based on algebraic multigrid methods. To motivate and explain our method, we compare it to standard Convolutional Neural Networks, and show their similarities and relations in the case of a regular grid. Our proposed method is demonstrated in various experiments like classification and segmentation, achieving on par or better than state of the art results. We also analyze the computational cost of our method compared to other GCNs.
摘要：图形卷积网络（GCNs）已证明是有效的在处理像点云和网格无序数据。在这项工作中，我们提出了新的方法用于图形卷积，汇集和unpooling，其灵感来自有限元素和代数多重网格框架。我们形成了参数化基于离散微分算子，利用图形质量，梯度和拉普拉斯卷积核。以此方式，参数不依赖于图形结构，仅在网络卷积作为差分操作符的含义。为了使输入的分层表示，我们建议集中和unpooling是基于代数多重网格方法操作。激励并解释我们的方法，我们把它比作标准卷积神经网络，并显示他们的相似性和关系规则栅格的情况。我们提出的方法是体现在各种实验，如分类和分割，实现持平或者比现有技术成果的状态更好。我们还分析了该方法的计算成本相对于其他GCNs。

40. Robust Learning Through Cross-Task Consistency [PDF] 返回目录
Amir Zamir, Alexander Sax, Teresa Yeo, Oğuzhan Kar, Nikhil Cheerla, Rohan Suri, Zhangjie Cao, Jitendra Malik, Leonidas Guibas
Abstract: Visual perception entails solving a wide set of tasks, e.g., object detection, depth estimation, etc. The predictions made for multiple tasks from the same image are not independent, and therefore, are expected to be consistent. We propose a broadly applicable and fully computational method for augmenting learning with Cross-Task Consistency. The proposed formulation is based on inference-path invariance over a graph of arbitrary tasks. We observe that learning with cross-task consistency leads to more accurate predictions and better generalization to out-of-distribution inputs. This framework also leads to an informative unsupervised quantity, called Consistency Energy, based on measuring the intrinsic consistency of the system. Consistency Energy correlates well with the supervised error (r=0.67), thus it can be employed as an unsupervised confidence metric as well as for detection of out-of-distribution inputs (ROC-AUC=0.95). The evaluations are performed on multiple datasets, including Taskonomy, Replica, CocoDoom, and ApolloScape, and they benchmark cross-task consistency versus various baselines including conventional multi-task learning, cycle consistency, and analytical consistency.
摘要：视知觉解决嗣继承宽组任务，例如，物体检测，深度估计等从相同的图像的多个任务进行的预测的不是独立的，并且因此，预计将是一致的。我们提出了增强与跨任务一致性学习广泛适用的和完全计算方法。所提出的配方在任意任务的基于图形的推论路径不变。我们与跨任务的一致性线索，学习观察到更准确的预测和更好的推广到外的分配投入。该框架也导致了监督的信息数量，称为一致能量，基于测量系统的内在一致性。一致性能源与监督误差（R = 0.67）相关性很好，因此它可被用作一种无监督的置信度量度，以及用于检测外的分配输入（ROC-AUC = 0.95）。评价是在多个数据集，包括Taskonomy，副本，CocoDoom，和ApolloScape执行，并且与各种基准包括常规多任务学习，周期的一致性，和分析的一致性他们基准横任务的一致性。

41. Multi-view Contrastive Learning for Online Knowledge Distillation [PDF] 返回目录
Chuanguang Yang, Zhulin An, Xiaolong Hu, Hui Zhu, Kaiqiang Xu, Yongjun Xu
Abstract: Existing Online Knowledge Distillation (OKD) aims to perform collaborative and mutual learning among multiple peer networks in terms of probabilistic outputs, but ignores the representational knowledge. We thus introduce Multi-view Contrastive Learning (MCL) for OKD to implicitly capture correlations of representations encoded by multiple peer networks, which provide various views for understanding the input data samples. Contrastive loss is applied for maximizing the consensus of positive data pairs, while pushing negative data pairs apart in embedding space among various views. Benefit from MCL, we can learn a more discriminative representation for classification than previous OKD methods. Experimental results on image classification and few-shot learning demonstrate that our MCL-OKD outperforms other state-of-the-art methods of both OKD and KD by large margins without sacrificing additional inference cost.
摘要：现有的在线知识蒸馏（OKD）的目标是在概率产出方面进行多等网络之间的协作和相互学习，而忽视了表象知识。因此，我们介绍多视点对比学习（MCL），用于OKD由多个等网络编码表示的隐式相关性捕捉，这对于理解输入数据采样提供各种视图。对比损耗施加用于最大化正数据对的共有，而在各种视图之间嵌入分隔开推负数据对。受益于MCL，我们可以学习的分类比以前OKD方法更有辨别力表示。在图像分类和几个次学习实验结果表明，我们的MCL-OKD优于大利润都OKD和KD的其他国家的最先进的方法，在不牺牲额外的推断成本。

42. End-to-end Learning for Inter-Vehicle Distance and Relative Velocity Estimation in ADAS with a Monocular Camera [PDF] 返回目录
Zhenbo Song, Jianfeng Lu, Tong Zhang, Hongdong Li
Abstract: Inter-vehicle distance and relative velocity estimations are two basic functions for any ADAS (Advanced driver-assistance systems). In this paper, we propose a monocular camera based inter-vehicle distance and relative velocity estimation method based on end-to-end training of a deep neural network. The key novelty of our method is the integration of multiple visual clues provided by any two time-consecutive monocular frames, which include deep feature clue, scene geometry clue, as well as temporal optical flow clue. We also propose a vehicle-centric sampling mechanism to alleviate the effect of perspective distortion in the motion field (\ie optical flow). We implement the method by a light-weight deep neural network. Extensive experiments are conducted which confirm the superior performance of our method over other state-of-the-art methods,in terms of estimation accuracy, computational speed, and memory footprint.
摘要：车辆间距离和相对速度的估计是任何ADAS（高级驾驶辅助系统）两种基本功能。在本文中，我们提出了一种基于端至端训练深神经网络的单眼照相机基于车辆间距离和相对速度估计方法。我们的方法的关键是新颖性的任何两个时间连续的单眼帧，其包括深特征线索，场景几何线索，以及时间的光流线索提供多个视觉线索的集成。我们还提出了一种车辆为中心的采样机制来减轻透视畸变的在运动领域的效果（\即光流）。我们实现由轻质深层神经网络的方法。广泛实验进行确认，其在我们的其他国家的最先进的方法的方法的性能优越，在估计精度，计算速度和存储器占用方面。

43. CubifAE-3D: Monocular Camera Space Cubification on Autonomous Vehicles for Auto-Encoder based 3D Object Detection [PDF] 返回目录
Shubham Shrivastava, Punarjay Chakravarty
Abstract: We introduce a method for 3D object detection using a single monocular image. Depth data is used to pre-train an RGB-to-Depth Auto-Encoder (AE). The embedding learnt from this AE is then used to train a 3D Object Detector (3DOD) CNN which is used to regress the parameters of 3D object poses after the encoder from the AE generates a latent embedding from the RGB image. We show that we can pre-train the AE using paired RGB and depth images from simulation data once and subsequently only train the 3DOD network using real data, comprising of RGB images and 3D object pose labels (without the requirement of dense depth). Our 3DOD network utilizes a particular $cubification$ of 3D space around the camera, where each cuboid is tasked with predicting N object poses, along with their class and confidence values. The AE pre-training and this method of dividing the 3D space around the camera into cuboids give our method its name - CubifAE-3D. We demonstrate results for monocular 3D object detection on the Virtual KITTI 2, KITTI, and nuScenes datasets for Autnomous Vehicle (AV) perception.
摘要：我们介绍利用单个单眼图像为立体物检测的方法。深度数据用于预培养的RGB到深度自动编码器（AE）。然后从该AE了解到嵌入用于训练三维对象检测器（3DOD），其被使用的编码器后倒退3D对象姿态的参数从AE产生从RGB图像的潜嵌入CNN。我们表明，采用成对的RGB和深度图像从模拟数据，我们可以预先训练AE一次，随后仅使用真实数据训练3DOD网络，包括RGB图像和三维物体姿态标签（不稠密深度的要求）。我们3DOD网络使用的摄像头周围的三维空间，每个长方体的任务是预测n对象的姿态，与他们的阶级和置信度值沿特定$ cubification $。该AE前培训和划分摄像机周围的三维空间为长方体这种方法给我们的方法它的名字 - CubifAE-3D。我们证明了对虚拟KITTI 2，KITTI和nuScenes数据集的Autnomous车辆（AV）感知单眼3D对象检测结果。

44. Siamese Keypoint Prediction Network for Visual Object Tracking [PDF] 返回目录
Qiang Li, Zekui Qin, Wenbo Zhang, Wen Zheng
Abstract: Visual object tracking aims to estimate the location of an arbitrary target in a video sequence given its initial bounding box. By utilizing offline feature learning, the siamese paradigm has recently been the leading framework for high performance tracking. However, current existing siamese trackers either heavily rely on complicated anchor-based detection networks or lack the ability to resist to distractors. In this paper, we propose the Siamese keypoint prediction network (SiamKPN) to address these challenges. Upon a Siamese backbone for feature embedding, SiamKPN benefits from a cascade heatmap strategy for coarse-to-fine prediction modeling. In particular, the strategy is implemented by sequentially shrinking the coverage of the label heatmap along the cascade to apply loose-to-strict intermediate supervisions. During inference, we find the predicted heatmaps of successive stages to be gradually concentrated to the target and reduced to the distractors. SiamKPN performs well against state-of-the-art trackers for visual object tracking on four benchmark datasets including OTB-100, VOT2018, LaSOT and GOT-10k, while running at real-time speed.
摘要：视觉对象跟踪的目标估计在给定的初始边框视频序列中的任意目标的位置。通过利用离线功能的学习，暹罗典范最近一直在高性能跟踪的主要框架。然而，目前现有的连体跟踪要么在很大程度上依赖于复杂的基于锚的检测网络或缺乏抵抗干扰项的能力。在本文中，我们提出了连体关键点预测网络（SiamKPN），以应对这些挑战。在对功能的嵌入，从粗到细的预测建模级联热图战略SiamKPN利益连体骨干。特别地，策略通过依次缩小沿着级联标签热图的覆盖范围，以松散到严格应用中间监督实现。在推论，我们发现连续阶段的预测热图将逐步集中到目标，并减少到干扰项。 SiamKPN执行以及对国家的最先进的跟踪器用于在四个基准数据集包括OTB-100，VOT2018，LaSOT和GOT-10K视觉对象跟踪，在实时的速度运行时。

45. NITS-VC System for VATEX Video Captioning Challenge 2020 [PDF] 返回目录
Alok Singh, Thoudam Doren Singh, Sivaji Bandyopadhyay
Abstract: Video captioning is process of summarising the content, event and action of the video into a short textual form which can be helpful in many research areas such as video guided machine translation, video sentiment analysis and providing aid to needy individual. In this paper, a system description of the framework used for VATEX-2020 video captioning challenge is presented. We employ an encoder-decoder based approach in which the visual features of the video are encoded using 3D convolutional neural network (C3D) and in the decoding phase two Long Short Term Memory (LSTM) recurrent networks are used in which visual features and input captions are fused separately and final output is generated by performing element-wise product between the output of both LSTMs. Our model is able to achieve BLEU scores of 0.20 and 0.22 on public and private test data sets respectively.
摘要：视频字幕是总结视频的内容，事件和行动统一到一个简短的文字形式，可以在许多研究领域有用，例如视频引导机器翻译，视频情绪分析和提供援助有需要的人的过程。在本文中，用于VATEX-2020视频字幕的挑战，提出了框架的系统描述。我们采用的编码器 - 解码器为基础的方法，其中所述视频的视觉特征使用三维卷积神经网络（C3D）编码，并且在解码阶段两个长短期记忆（LSTM）用于复发性网络，其中视觉特征和输入的字幕分别熔融，并通过两个LSTMs的输出之间进行逐元素乘积产生最终输出。我们的模式是能够分别公共和私人测试数据集达到0.20和0.22 BLEU分数。

46. Facial Expression Recognition using Deep Learning [PDF] 返回目录
Raghu Vamshi.N, Bharathi Raja S
Abstract: Throughout the various ages, facial expressions have become one of the universal ways of non-verbal communication. The ability to recognize facial expressions would pave the path for many novel applications. Despite the success of traditional approaches in a controlled environment, these approaches fail on challenging datasets consisting of partial faces. In this paper, I take one such dataset FER-2013 and will implement deep learning models that are able to achieve significant improvement over the previously used traditional approaches and even some of the deep learning models.
摘要：纵观不同年龄，面部表情已经成为非言语交流的普遍方式之一。识别面部表情的能力将铺平了许多新的应用的路径。尽管在受控环境中传统的方法获得成功，这些方法都没有上挑战由子面的数据集。在本文中，我需要一个这样的数据集FER-2013将实现深度学习模式，能够在以前使用的传统方法，甚至深学习模式的一些实现显著的改善。

47. DeepRelativeFusion: Dense Monocular SLAM using Single-Image Relative Depth Prediction [PDF] 返回目录
Shing Yan Loo, Syamsiah Mashohor, Sai Hong Tang, Hong Zhang
Abstract: Traditional monocular visual simultaneous localization and mapping (SLAM) algorithms have been extensively studied and proven to reliably recover a sparse structure and camera motion. Nevertheless, the sparse structure is still insufficient for scene interaction, e.g., visual navigation and augmented reality applications. To densify the scene reconstruction, the use of single-image absolute depth prediction from convolutional neural networks (CNNs) for filling in the missing structure has been proposed. However, the prediction accuracy tends to not generalize well on scenes that are different from the training datasets. In this paper, we propose a dense monocular SLAM system, named DeepRelativeFusion, that is capable to recover a globally consistent 3D structure. To this end, we use a visual SLAM algorithm to reliably recover the camera poses and semi-dense depth maps of the keyframes, and then combine the keyframe pose-graph with the densified keyframe depth maps to reconstruct the scene. To perform the densification, we introduce two incremental improvements upon the energy minimization framework proposed by DeepFusion: (1) an additional image gradient term in the cost function, and (2) the use of single-image relative depth prediction. Despite the absence of absolute scale and depth range, the relative depth maps can be corrected using their respective semi-dense depth maps from the SLAM algorithm. We show that the corrected relative depth maps are sufficiently accurate to be used as priors for the densification. To demonstrate the generalizability of relative depth prediction, we illustrate qualitatively the dense reconstruction on two outdoor sequences. Our system also outperforms the state-of-the-art dense SLAM systems quantitatively in dense reconstruction accuracy by a large margin.
摘要：传统单目视觉同时定位和地图创建（SLAM）算法已被广泛研究，证明可靠地恢复稀疏结构和照相机运动。然而，稀疏结构仍然不足场景的互动，例如，视觉导航和增强现实应用。致密化的场景重构，使用从在缺少结构填充卷积神经网络（细胞神经网络）的单图像绝对深度预测的已提议。然而，预测精度趋向于对从训练数据不同的场景不能一概而论好。在本文中，我们提出了一个密集的单目SLAM系统，命名为DeepRelativeFusion，它能够恢复全球一致的三维结构。为此，我们使用一个视觉SLAM算法可靠恢复相机姿势和半密集深度映射的关键帧，然后再结合关键帧姿态-图形与所述致密化的关键帧的深度图来重建场景。为了进行致密化，我们引入经DeepFusion提出的能量最小化框架两个增量式的改进：（1）在成本函数中的附加图像梯度项，和（2）使用单图像相对深度预测的。尽管不存在绝对标度和深度范围的，相对深度图可以使用其各自的半密集的深度图从SLAM算法来校正。我们发现，修正后的相对深度图精确到足以被用作先验的致密化。为了证明相对深度预测的普遍性，我们定性说明在两个室外序列致密重建。我们的系统还大幅度优于国家的最先进的致密SLAM系统定量在密集的重建精度。

48. SVGA-Net: Sparse Voxel-Graph Attention Network for 3D Object Detection from Point Clouds [PDF] 返回目录
Qingdong He, Zhengning Wang, Hao Zeng, Yi Zeng, Shuaicheng Liu, Bing Zeng
Abstract: Accurate 3D object detection from point clouds has become a crucial component in autonomous driving. However, the volumetric representations and the projection methods in previous works fail to establish the relationships between the local point sets. In this paper, we propose Sparse Voxel-Graph Attention Network (SVGA-Net), a novel end-to-end trainable network which mainly contains voxel-graph module and sparse-to-dense regression module to achieve comparable 3D detection tasks from raw LIDAR data. Specifically, SVGA-Net constructs the local complete graph within each divided 3D spherical voxel and global KNN graph through all voxels. The local and global graphs serve as the attention mechanism to enhance the extracted features. In addition, the novel sparse-to-dense regression module enhances the 3D box estimation accuracy through feature maps aggregation at different levels. Experiments on KITTI detection benchmark demonstrate the efficiency of extending the graph representation to 3D object detection and the proposed SVGA-Net can achieve decent detection accuracy.
摘要：从点云精确的立体物检测已成为自主驾驶的重要组成部分。然而，体积表示在以往的作品投影方法不能建立本地点集之间的关系。在本文中，我们提出了稀疏体素的格拉夫注意网络（SVGA-Net的），其主要包含体素的图形模块和疏到密回归模块来实现从原始可比3D检测任务的新颖的端至端可训练网络LIDAR数据。具体来说，SVGA-Net的构造，每个分割3D球状体素和全球KNN图通过所有体素内的本地完整的图形。局部和全局的图表作为关注机制，以提高提取的特征。此外，新颖的稀疏到致密回归模块增强通过特征的三维框估计精度在不同层次映射聚集。上KITTI检测基准实验证明延伸的图形表示，以3D对象的检测和所提出的SVGA-Net的可以实现体面检测精度的效率。

49. SharinGAN: Combining Synthetic and Real Data for Unsupervised Geometry Estimation [PDF] 返回目录
Koutilya PNVR, Hao Zhou, David Jacobs
Abstract: We propose a novel method for combining synthetic and real images when training networks to determine geometric information from a single image. We suggest a method for mapping both image types into a single, shared domain. This is connected to a primary network for end-to-end training. Ideally, this results in images from two domains that present shared information to the primary network. Our experiments demonstrate significant improvements over the state-of-the-art in two important domains, surface normal estimation of human faces and monocular depth estimation for outdoor scenes, both in an unsupervised setting.
摘要：我们提出用于训练网络时，确定从单个图像几何信息合成的合成图像和真实图像的新方法。我们建议两种图像类型映射到一个单一的，共享的域的方法。这被连接到主网络端至端的训练。理想情况下，这导致了图像从两个结构域本共享信息到主网络。我们的实验表明在国家的最先进的在两个重要结构域显著改进，表面人脸和室外场景单眼深度估计，无论在无监督设置正常估计。

50. MeshSDF: Differentiable Iso-Surface Extraction [PDF] 返回目录
Edoardo Remelli, Artem Lukoianov, Stephan R. Richter, Benoît Guillard, Timur Bagautdinov, Pierre Baque, Pascal Fua
Abstract: Geometric Deep Learning has recently made striking progress with the advent of continuous Deep Implicit Fields. They allow for detailed modeling of watertight surfaces of arbitrary topology while not relying on a 3D Euclidean grid, resulting in a learnable parameterization that is not limited in resolution. Unfortunately, these methods are often not suitable for applications that require an explicit mesh-based surface representation because converting an implicit field to such a representation relies on the Marching Cubes algorithm, which cannot be differentiated with respect to the underlying implicit field. In this work, we remove this limitation and introduce a differentiable way to produce explicit surface mesh representations from Deep Signed Distance Functions. Our key insight is that by reasoning on how implicit field perturbations impact local surface geometry, one can ultimately differentiate the 3D location of surface samples with respect to the underlying deep implicit field. We exploit this to define MeshSDF, an end-to-end differentiable mesh representation which can vary its topology. We use two different applications to validate our theoretical insight: Single-View Reconstruction via Differentiable Rendering and Physically-Driven Shape Optimization. In both cases our differentiable parameterization gives us an edge over state-of-the-art algorithms.
摘要：几何深度学习最近取得连续深隐场的出现显着的进步。它们允许的，而不是依靠3D欧几里德电网，导致可以学习的参数未在分辨率的限制水密任意拓扑结构的表面详细的建模。不幸的是，这些方法通常不适合于需要显式基于网格的表面表示，因为一个隐式字段转换为这样的表示依赖于移动立方体算法，它不能相对于底层的隐式字段来区分应用程序。在这项工作中，我们会删除此限制，并引入了微方式生产从深层符号距离函数明确的表面网格表示。我们的主要观点是，通过推理如何隐场干扰影响局部表面几何形状，可以区分最终样品表面的三维位置，相对于潜在的深隐场。我们利用这种方法来定义MeshSDF，端至端微分网格表示其可以改变其拓扑。我们使用两个不同的应用来验证我们的理论见解：单一视图通过微的渲染和物理驱动形状优化改造。在这两种情况下，我们的微参数让我们在国家的最先进的算法的优势。

51. Enhancing Facial Data Diversity with Style-based Face Aging [PDF] 返回目录
Markos Georgopoulos, James Oldfield, Mihalis A. Nicolaou, Yannis Panagakis, Maja Pantic
Abstract: A significant limiting factor in training fair classifiers relates to the presence of dataset bias. In particular, face datasets are typically biased in terms of attributes such as gender, age, and race. If not mitigated, bias leads to algorithms that exhibit unfair behaviour towards such groups. In this work, we address the problem of increasing the diversity of face datasets with respect to age. Concretely, we propose a novel, generative style-based architecture for data augmentation that captures fine-grained aging patterns by conditioning on multi-resolution age-discriminative representations. By evaluating on several age-annotated datasets in both single- and cross-database experiments, we show that the proposed method outperforms state-of-the-art algorithms for age transfer, especially in the case of age groups that lie in the tails of the label distribution. We further show significantly increased diversity in the augmented datasets, outperforming all compared methods according to established metrics.
摘要：在训练公平分类器显著限制因素涉及到数据集偏见的存在。特别是，面数据集通常偏置属性，诸如性别，年龄和种族的条款。如果不缓解，偏见导致了对这种群体表现出的不公平行为的算法。在这项工作中，我们要解决关于年龄的增加脸部数据集的多样性的问题。具体而言，我们提出了一个新的，基于生成式的架构数据增强捕获通过调节多分辨率年龄歧视表示细粒度老化模式。通过对中的单人和跨数据库实验几个年龄标注的数据集评估，我们表明，该方法优于国家的最先进的算法年龄转移，特别是在年龄组的情况下在谎言的尾巴标签分发。进一步的研究表明在增强数据集，根据建立的指标显著增加多样性，超越所有的比较的方法。

52. Self-Supervised Dynamic Networks for Covariate Shift Robustness [PDF] 返回目录
Tomer Cohen, Noy Shulman, Hai Morgenstern, Roey Mechrez, Erez Farhan
Abstract: As supervised learning still dominates most AI applications, test-time performance is often unexpected. Specifically, a shift of the input covariates, caused by typical nuisances like background-noise, illumination variations or transcription errors, can lead to a significant decrease in prediction accuracy. Recently, it was shown that incorporating self-supervision can significantly improve covariate shift robustness. In this work, we propose Self-Supervised Dynamic Networks (SSDN): an input-dependent mechanism, inspired by dynamic networks, that allows a self-supervised network to predict the weights of the main network, and thus directly handle covariate shifts at test-time. We present the conceptual and empirical advantages of the proposed method on the problem of image classification under different covariate shifts, and show that it significantly outperforms comparable methods.
摘要：作为监督学习仍然占主导地位最AI应用，测试时间的表现往往出人意料。具体而言，输入协变量的偏移，引起像背景噪声，照明的变化或抄写错误，典型滋扰可导致预测精度的显著降低。最近，它表明纳入自检可以显著提高协转变的鲁棒性。在这项工作中，我们提出了自受监督的动态网络（SSDN）：输入依赖性机制，通过动态网络的启发，其允许自监督网络来预测主网络的权重，并且因此直接处理在测试协变量的变化-时间。我们目前的图像分类的下不同协变量的变化的问题所提出的方法的概念和经验的优点，并表明它显著优于可比的方法。

53. Self-supervising Fine-grained Region Similarities for Large-scale Image Localization [PDF] 返回目录
Yixiao Ge, Haibo Wang, Feng Zhu, Rui Zhao, Hongsheng Li
Abstract: The task of large-scale retrieval-based image localization is to estimate the geographical location of a query image by recognizing its nearest reference images from a city-scale dataset. However, the general public benchmarks only provide noisy GPS labels associated with the training images, which act as weak supervisions for learning image-to-image similarities. Such label noise prevents deep neural networks from learning discriminative features for accurate localization. To tackle this challenge, we propose to self-supervise image-to-region similarities in order to fully explore the potential of difficult positive images alongside their sub-regions. The estimated image-to-region similarities can serve as extra training supervision for improving the network in generations, which could in turn gradually refine the fine-grained similarities to achieve optimal performance. Our proposed self-enhanced image-to-region similarity labels effectively deal with the training bottleneck in the state-of-the-art pipelines without any additional parameters or manual annotations in both training and inference. Our method outperforms state-of-the-arts on the standard localization benchmarks by noticeable margins and shows excellent generalization capability on multiple image retrieval datasets. Code of this work is available at this https URL.
摘要：大型基于内容的检索，图像定位的任务是从一个城市规模的数据集承认其最接近的参考图像来估计查询图像的地理位置。然而，广大市民基准仅提供训练图像，它作为学习图像到影像相似弱监督相关嘈杂的GPS标签。这种标签噪音防止深层神经网络从学习判别特征进行精确定位。为了应对这一挑战，我们提出自我监督图像对区域的相似性，以充分发掘他们一起分区域难以正面形象的潜力。估计的图像对区域的相似性可以作为改善世代网络，这反过来又会逐渐细化细粒度的相似之处，以实现最佳性能的额外培训监督。我们提出的自我增强的图像对区域的相似标签有效应对国家的最先进的管道训练瓶颈没有在训练和推理的任何额外的参数或手动注释。我们的方法优于状态的最艺上的标准定位基准由显着的边缘，并显示在多个图像检索数据集优异的泛化能力。这项工作的代码可以在此HTTPS URL。

54. A Robust Attentional Framework for License Plate Recognition in the Wild [PDF] 返回目录
Linjiang Zhang, Peng Wang, Hui Li, Zhen Li, Chunhua Shen, Yanning Zhang
Abstract: Recognizing car license plates in natural scene images is an important yet still challenging task in realistic applications. Many existing approaches perform well for license plates collected under constrained conditions, eg, shooting in frontal and horizontal view-angles and under good lighting conditions. However, their performance drops significantly in an unconstrained environment that features rotation, distortion, occlusion, blurring, shading or extreme dark or bright conditions. In this work, we propose a robust framework for license plate recognition in the wild. It is composed of a tailored CycleGAN model for license plate image generation and an elaborate designed image-to-sequence network for plate recognition. On one hand, the CycleGAN based plate generation engine alleviates the exhausting human annotation work. Massive amount of training data can be obtained with a more balanced character distribution and various shooting conditions, which helps to boost the recognition accuracy to a large extent. On the other hand, the 2D attentional based license plate recognizer with an Xception-based CNN encoder is capable of recognizing license plates with different patterns under various scenarios accurately and robustly. Without using any heuristics rule or post-processing, our method achieves the state-of-the-art performance on four public datasets, which demonstrates the generality and robustness of our framework. Moreover, we released a new license plate dataset, named "CLPD", with 1200 images from all 31 provinces in mainland China. The dataset can be available from: this https URL.
摘要：在自然场景图像来识别车牌是在现实应用中的一个重要但仍然具有挑战性的任务。许多现有的方法表现良好为约束条件，例如下收集车牌，在正面和水平视图-角和良好的照明条件下进行拍摄。然而，他们的表现中，具有旋转，扭曲，阻塞，模糊，阴影或极端暗或明亮的条件下无约束的环境显著下降。在这项工作中，我们提出了在野外车牌识别一个强有力的框架。它是由一个定制CycleGAN模型牌照图像生成和精心设计的图象到序列网络板的认可。一方面，基于CycleGAN平板生成引擎减轻了人类排放工作的注释。可以用一个更加平衡的分布和多种拍摄条件下，这有助于提高识别精度在很大程度上获得训练数据的巨量。在另一方面中，2D注意力基于车牌识别与基于Xception-CNN编码器能够识别与在各种情况下准确且鲁棒的不同图案的车牌。不使用任何启发式规则或后处理，我们的方法实现四个公共数据集，这表明我们的框架的通用性和稳健性的国家的最先进的性能。此外，我们发布了一个新的车牌数据集，名为“CLPD”，1200张图片来自中国大陆的31个省。该数据集可提供：该HTTPS URL。

55. Ensemble Network for Ranking Images Based on Visual Appeal [PDF] 返回目录
Sachin Singh, Victor Sanchez, Tanaya Guha
Abstract: We propose a computational framework for ranking images (group photos in particular) taken at the same event within a short time span. The ranking is expected to correspond with human perception of overall appeal of the images. We hypothesize and provide evidence through subjective analysis that the factors that appeal to humans are its emotional content, aesthetics and image quality. We propose a network which is an ensemble of three information channels, each predicting a score corresponding to one of the three visual appeal factors. For group emotion estimation, we propose a convolutional neural network (CNN) based architecture for predicting group emotion from images. This new architecture enforces the network to put emphasis on the important regions in the images, and achieves comparable results to the state-of-the-art. Next, we develop a network for the image ranking task that combines group emotion, aesthetics and image quality scores. Owing to the unavailability of suitable databases, we created a new database of manually annotated group photos taken during various social events. We present experimental results on this database and other benchmark databases whenever available. Overall, our experiments show that the proposed framework can reliably predict the overall appeal of images with results closely corresponding to human ranking.
摘要：我们提出了排名图像在同一事件很短的时间跨度内采取了计算框架（组照片特别）。排名有望与图像的整体吸引力的人类感知相对应。我们假设，并通过主观分析认为，呼吁人类的因素是它的情感内涵，审美和图像质量提供依据。我们提出了一种网络，该网络是三个信息信道，每个预测对应于三个视觉吸引力因素中的一个的分数的合奏。对于组情感估计，我们提出了一个卷积神经网络（CNN）的基础架构从图像预测组情感。这种新的架构强制网络把重点放在重要的区域中的图像，并实现比较的结果，在国家的最先进的。接下来，我们开发的图像排名任务的网络，结合组情感，美学和图像质量分数。由于合适的数据库的可用性，我们创造了在各种社会活动采取手动注释组照片的新数据库。我们提出这个数据库和其他基准数据库上的实验结果时可用。总体而言，我们的实验表明，该框架能够可靠地预测图像的整体吸引力与结果密切对应于人居。

56. The Criminality From Face Illusion [PDF] 返回目录
Kevin W. Bowyer, Michael King, Walter Scheirer
Abstract: The automatic analysis of face images can generate predictions about a person's gender, age, race, facial expression, body mass index, and various other indices and conditions. A few recent publications have claimed success in analyzing an image of a person's face in order to predict the person's status as Criminal / Non-Criminal. Predicting criminality from face may initially seem similar to other facial analytics, but we argue that attempts to create a criminality-from-face algorithm are necessarily doomed to fail, that apparently promising experimental results in recent publications are an illusion resulting from inadequate experimental design, and that there is potentially a large social cost to belief in the criminality from face illusion.
摘要：人脸图像的自动分析可以生成一个人的性别，年龄，种族，面部表情，身体质量指数，以及其他各种指标和状况的预测。最近的一些出版物权利，以预测人的刑事/非刑事状况分析一个人的脸的图像成功。从脸上预测犯罪起初看起来可能类似于其他面部分析，但我们认为，试图建立一个犯罪，从面算法必然是注定要失败的，在最近的出版物显然是有前途的实验结果，从实验设计不当造成一种错觉，而且，有可能是一个大的社会成本，相信在面对从幻想犯罪行为。

57. ARID: A New Dataset for Recognizing Action in the Dark [PDF] 返回目录
Yuecong Xu, Jianfei Yang, Haozhi Cao, Kezhi Mao, Jianxiong Yin, Simon See
Abstract: The task of action recognition in dark videos is useful in various scenarios, e.g., night surveillance and self-driving at night. Though progress has been made in action recognition task for videos in normal illumination, few have studied action recognition in the dark, partly due to the lack of sufficient datasets for such a task. In this paper, we explored the task of action recognition in dark videos. We bridge the gap of the lack of data by collecting a new dataset: the Action Recognition in the Dark (ARID) dataset. It consists of 3,784 video clips with 11 action categories. To the best of our knowledge, it is the first dataset focused on human actions in dark videos. To gain further understanding of our ARID dataset, we analyze our dataset in detail and showed its necessity over synthetic dark videos. Additionally, we benchmark the performance of current action recognition models on our dataset and explored potential methods for increasing their performances. We show that current action recognition models and frame enhancement methods may not be effective solutions for the task of action recognition in dark videos.
摘要：行为识别的暗视频的任务是在各种情况下非常有用，例如，夜间监视和自驾车在夜间。尽管进展在动作识别工作已经取得了在正常照明视频，很少有研究动作识别在黑暗中，部分是由于缺乏足够的数据集，这样的任务。在本文中，我们探讨了动作识别的暗视频的任务。动作识别在黑暗（干旱）数据集：我们通过收集新的数据集大桥缺乏数据的差距。它由11个活动分类3784个视频剪辑。据我们所知，这是第一个数据集集中在黑暗的影片人的行动。为了获得我们的数据集干旱进一步了解，我们分析的详细信息数据，并显示出其对合成暗视频必要性。此外，在我们的数据电流动作识别模式，提高他们的表演潜力研究方法，我们的基准性能。我们发现，目前的动作识别模型和框架增强方法可能不适合在黑暗的视频的动作识别的任务有效的解决方案。

58. Towards large-scale, automated, accurate detection of CCTV camera objects using computer vision. Applications and implications for privacy, safety, and cybersecurity. (Preprint) [PDF] 返回目录
Hannu Turtiainen, Andrei Costin, Timo Hamalainen, Tuomo Lahtinen
Abstract: While the first CCTV camera was developed almost a century ago back in 1927, currently it is assumed as granted there are about 770 millions CCTV cameras around the globe, and their number is casually predicted to surpass 1 billion in 2021. At the same time the increasing, widespread, unwarranted, and unaccountable use of CCTV cameras globally raises privacy risks and concerns for the last several decades. Recent technological advances implemented in CCTV cameras such as AI-based facial recognition and IoT connectivity only fuel further these concerns raised by privacy-minded persons. However, many of the debates, reports, and policies are based on assumptions and numbers that are neither necessarily factually accurate nor are based on sound methodologies. For example, at present there is no accurate and global understanding of how many CCTV cameras are deployed and in use, where are those cameras located, who owns or operates those cameras, etc. In addition, there are no proper (i.e., sound, accurate, advanced) tools that can help us achieve such counting, localization, and other information gathering. Therefore, new methods and tools must be developed in order to detect, count, and localize the CCTV cameras. To close this gap, with this paper we introduce the first and only computer vision MS COCO-compatible models that are able to accurately detect CCTV and video surveillance cameras in images and video frames. To this end, our state-of-the-art detector was built using 3401 images that were manually reviewed and annotated, and achieves an accuracy between 91,1% - 95,6%. Moreover, we build and evaluate multiple models, present a comprehensive comparison of their performance, and outline core challenges associated with such research. We also present possible privacy-, safety-, and security-related practical applications of our core work.
摘要：虽然第一CCTV摄像机在1927年几乎一个世纪前开发的背，目前它被假定为授予大约有770百万CCTV全球各地的摄像头，其数量是随随便便预测2021年至突破10十亿在同一时间的增加，普遍的，无正当理由的和不负责任的使用闭路电视摄像机在全球引发了过去几十年的隐私风险和担忧。在CCTV摄像机，如基于AI-面部识别和物联网的连接实现的最近的技术进步只会助长进一步的这些问题提出通过隐私志同道合者。然而，许多辩论，报告和策略都是基于多种假设，既不一定与事实相符，也不是基于合理的方法数。例如，目前有许多闭路电视摄影机如何部署和使用中没有准确和全面的认识，在这里是位于那些相机，谁拥有或经营的摄像头，等等。另外，有没有正确的（即声音，精准，先进的）工具，可以帮助我们实现这样的计数，定位和其他信息收集。因此，新的方法和工具必须以检测，计数和本地化的闭路电视摄像机进行开发。为了弥补这一差距，与本文介绍的第一个也是唯一的计算机视觉MS COCO兼容模式，能够准确地检测图像和视频帧CCTV和视频监控摄像头。为此，我们国家的最先进的检测器，使用通过手动审查和注释3401倍的图像建，并达到91,1％之间的精度 - 95.6％。此外，我们建立和评估多个模型，展示他们的表现进行综合比较，以及与这些研究相关的轮廓核心挑战。我们也我们的核心工作，目前可能与隐私，安全 - 和安全相关的实际应用。

59. Instance segmentation of buildings using keypoints [PDF] 返回目录
Qingyu Li, Lichao Mou, Yuansheng Hua, Yao Sun, Pu Jin, Yilei Shi, Xiao Xiang Zhu
Abstract: Building segmentation is of great importance in the task of remote sensing imagery interpretation. However, the existing semantic segmentation and instance segmentation methods often lead to segmentation masks with blurred boundaries. In this paper, we propose a novel instance segmentation network for building segmentation in high-resolution remote sensing images. More specifically, we consider segmenting an individual building as detecting several keypoints. The detected keypoints are subsequently reformulated as a closed polygon, which is the semantic boundary of the building. By doing so, the sharp boundary of the building could be preserved. Experiments are conducted on selected Aerial Imagery for Roof Segmentation (AIRS) dataset, and our method achieves better performance in both quantitative and qualitative results with comparison to the state-of-the-art methods. Our network is a bottom-up instance segmentation method that could well preserve geometric details.
摘要：建筑分割是遥感影像口译任务具有重要意义。然而，现有的语义分割和实例分割方法往往会导致分割口罩边界模糊。在本文中，我们提出了一个新的实例分割网络，以高分辨率遥感图像分割建设。更具体地讲，我们认为细分个别建筑作为检测几个关键点。所检测的关键点随后被重新表述为一个封闭的多边形，这是建筑物的语义边界。通过这样做，建筑物的锐界可以保留。实验在选择航空图像屋顶分割进行（AIRS）的数据集，我们的方法实现了在定量和定性结果与比较国家的最先进的方法更好的性能。我们的网络是自下而上的实例分割方法，可以较好地保持几何细节。

60. Building a Heterogeneous, Large Scale Morphable Face Model [PDF] 返回目录
Claudio Ferrari, Stefano Berretti, Pietro Pala, Alberto Del Bimbo
Abstract: 3D Morphable Models (3DMMs) are powerful statistical tools for representing and modeling 3D faces. To build a 3DMM, a training set of fully registered face scans is required, and its modeling capabilities directly depend on the variability contained in the training data. Thus, accurately establishing a dense point-to-point correspondence across heterogeneous scans with sufficient diversity in terms of identities, ethnicities, or expressions becomes essential. In this manuscript, we present an approach that leverages a 3DMM to transfer its dense semantic annotation across a large set of heterogeneous 3D faces, establishing a dense correspondence between them. To this aim, we propose a novel formulation to learn a set of sparse deformation components with local support on the face that, together with an original non-rigid deformation algorithm, allow precisely fitting the 3DMM to arbitrary faces and transfer its semantic annotation. We experimented our approach on three large and diverse datasets, showing it can effectively generalize to very different samples and accurately establish a dense correspondence even in presence of complex facial expressions or unseen deformations. As main outcome of this work, we build a heterogeneous, large-scale 3DMM from more than 9,000 fully registered scans obtained joining the three datasets together.
摘要：3D形变模型（3DMMs）是代表和3D建模面孔强大的统计工具。要构建3DMM，完全注册脸部扫描的训练集是必需的，它的建模能力直接取决于包含在训练数据的可变性。因此，在身份，种族或表达式等方面准确地建立跨异构扫描具有足够多样性的致密的点至点对应变得至关重要。在该手稿中，我们提出一个利用3DMM跨越一大组异质3D面的转移其稠密语义标注，在它们之间建立的致密对应的方法。为了达到这个目的，我们提出一个新制剂学上，与原来的非刚性变形算法一起，允许精确地拟合3DMM任意面上的面的本地支持的一组稀疏的变形部件，并且传送其语义标注。我们试验了三个大的和多样化的数据集我们的做法，显示出它能有效地推广到非常不同的样本，准确地建立了密集的对应关系，即使在复杂的面部表情或看不见的变形的存在。作为这项工作的主要成果，我们从获得加入三个数据集中在一起超过9000个完全注册扫描建立一个异类，大规模3DMM。

61. 3D Self-Supervised Methods for Medical Imaging [PDF] 返回目录
Aiham Taleb, Winfried Loetzsch, Noel Danz, Julius Severin, Thomas Gaertner, Benjamin Bergner, Christoph Lippert
Abstract: Self-supervised learning methods have witnessed a recent surge of interest after proving successful in multiple application fields. In this work, we leverage these techniques, and we propose 3D versions for five different self-supervised methods, in the form of proxy tasks. Our methods facilitate neural network feature learning from unlabeled 3D images, aiming to reduce the required cost for expert annotation. The developed algorithms are 3D Contrastive Predictive Coding, 3D Rotation prediction, 3D Jigsaw puzzles, Relative 3D patch location, and 3D Exemplar networks. Our experiments show that pretraining models with our 3D tasks yields more powerful semantic representations, and enables solving downstream tasks more accurately and efficiently, compared to training the models from scratch and to pretraining them on 2D slices. We demonstrate the effectiveness of our methods on three downstream tasks from the medical imaging domain: i) Brain Tumor Segmentation from 3D MRI, ii) Pancreas Tumor Segmentation from 3D CT, and iii) Diabetic Retinopathy Detection from 2D Fundus images. In each task, we assess the gains in data-efficiency, performance, and speed of convergence. We achieve results competitive to state-of-the-art solutions at a fraction of the computational expense. We also publish the implementations for the 3D and 2D versions of our algorithms as an open-source library, in an effort to allow other researchers to apply and extend our methods on their datasets.
摘要：自监督学习方法目睹的兴趣证明是成功的在多个应用领域后，近期激增。在这项工作中，我们充分利用这些技术，并提出了3D版本的五种不同的自我监督方法，在代理任务的形式。我们的方法有利于神经网络的功能，无标签的3D影像学，旨在减少专家注释所需的成本。所开发的算法是3D对比预测编码，3D旋转预测，3D拼图，相对3D贴片位置，而3D的Exemplar网络。我们的实验表明，我们的3D任务训练前产车型更强大的语义表示，并能够更准确，有效地解决下游任务，相比于训练从头模型和训练前他们的二维切片。我们证明我们的方法从医学成像领域三个下游任务的有效性：1）脑肿瘤分割从3D MRI，ii）由三维CT胰腺肿瘤分割，以及iii）糖尿病视网膜病变检测从2D眼底图像。在每个任务中，我们评估数据的效率，性能和收敛的速度上涨。我们实现在计算费用的一小部分结果，以国家的最先进的解决方案的竞争力。我们还出版了我们的算法的3D和2D版本的实现作为一个开放源码库，以努力让其他研究人员应用和扩展他们的数据集，我们的方法。

62. UCLID-Net: Single View Reconstruction in Object Space [PDF] 返回目录
Benoit Guillard, Edoardo Remelli, Pascal Fua
Abstract: Most state-of-the-art deep geometric learning single-view reconstruction approaches rely on encoder-decoder architectures that output either shape parametrizations or implicit representations. However, these representations rarely preserve the Euclidean structure of the 3D space objects exist in. In this paper, we show that building a geometry preserving 3-dimensional latent space helps the network concurrently learn global shape regularities and local reasoning in the object coordinate space and, as a result, boosts performance. We demonstrate both on ShapeNet synthetic images, which are often used for benchmarking purposes, and on real-world images that our approach outperforms state-of-the-art ones. Furthermore, the single-view pipeline naturally extends to multi-view reconstruction, which we also show.
摘要：大多数国家的最先进的深几何学习单视图重建方法依赖于编码器 - 解码器架构，输出任一形状参数化或隐式表示。然而，这些表示很少保留三维空间物体存在的欧几里德结构。在本文中，我们表明，建立一个几何保留3维的潜在空间有助于网络同时学习全球造型规律和当地推理的对象坐标空间作为结果，提高性能。我们证明无论在ShapeNet合成图像，这是经常用于进行基准测试，以及对现实世界的图像，我们的方法比国家的最先进的。此外，单视图管道自然延伸到多视图重建，我们还显示。

63. An Empirical Analysis of the Impact of Data Augmentation on Knowledge Distillation [PDF] 返回目录
Deepan Das, Haley Massa, Abhimanyu Kulkarni, Theodoros Rekatsinas
Abstract: Generalization Performance of Deep Learning models trained using the Empirical Risk Minimization can be improved significantly by using Data Augmentation strategies such as simple transformations, or using Mixed Samples. In this work, we attempt to empirically analyse the impact of such augmentation strategies on the transfer of generalization between teacher and student models in a distillation setup. We observe that if a teacher is trained using any of the mixed sample augmentation strategies, the student model distilled from it is impaired in its generalization capabilities. We hypothesize that such strategies limit a model's capability to learn example-specific features, leading to a loss in quality of the supervision signal during distillation, without impacting it's standalone prediction performance. We present a novel KL-Divergence based metric to quantitatively measure the generalization capacity of the different networks.
摘要：使用经验风险最小化训练的深度学习模型的推广能力可以显著利用数据扩张策略，如简单的转换，或使用混合样品得到改善。在这项工作中，我们试图实证分析等增强战略对泛化的教师和学生之间的车型在蒸馏装置传输的影响。我们观察到，如果教师使用任何混合样品增强策略的训练，学生模型蒸馏从它在它的推广能力受损。我们推测，这种策略限制模型的学习例如特定的功能，从而导致在蒸馏过程中的监督信号的质量损失的能力，而不会影响它的独立预测性能。我们提出基于度量来定量测定不同网络的泛化能力的新的KL散度。

64. Extracting Cellular Location of Human Proteins Using Deep Learning [PDF] 返回目录
Hanke Chen
Abstract: Understanding and extracting the patterns of microscopy images has been a major challenge in the biomedical field. Although trained scientists can locate the proteins of interest within a human cell, this procedure is not efficient and accurate enough to process a large amount of data and it often leads to bias. To resolve this problem, we attempted to create an automatic image classifier using Machine Learning to locate human proteins with higher speed and accuracy than human beings. We implemented a Convolution Neural Network with Residue and Squeeze-Excitation layers classifier to locate given proteins of any type in a subcellular structure. After training the model using a series of techniques, it can locate thousands of proteins in 27 different human cell types into 28 subcellular locations, way significant than historical approaches. The model can classify 4,500 images per minute with an accuracy of 63.07%, surpassing human performance in accuracy (by 35%) and speed. Because our system can be implemented on different cell types, it opens a new vision of understanding in the biomedical field. From the locational information of the human proteins, doctors can easily detect cell's abnormal behaviors including viral infection, pathogen invasion, and malignant tumor development. Given the amount of data generalized by experiments are greater than that human can analyze, the model cut down the human resources and time needed to analyze data. Moreover, this locational information can be used in different scenarios like subcellular engineering, medical care, and etiology inspection.
摘要：理解和提取显微镜图像的模式已经在生物医学领域的一个重大挑战。虽然训练有素的科学家可以在人体细胞内定位感兴趣的蛋白质，这个过程是效率不高，并不足以处理大量数据的准确，往往会导致偏差。要解决这个问题，我们试图创建一个使用机器学习来定位更高的速度和精度比人类的人类蛋白质的自动图像分类。我们实施了卷积神经网络的残渣和挤压激励层分类器定位在亚细胞结构中的任何类型的给定的蛋白质。培训通过一系列的技术模型后，就可以找到数千种蛋白质的27种不同的人体细胞类型分为28个亚细胞定位，不是历史的方法方式显著。该模型可以每分钟4500个图像分类为63.07％的准确度，超过在精度人类的性能（35％）和速度。因为我们的系统可以在不同类型的细胞来实现，它开启了在生物医学领域理解一个新的视野。从人类蛋白质的位置信息，医生可以很容易地检测细胞的异常行为，包括病毒感染，病原体侵入，以及恶性肿瘤的发展。鉴于实验广义的数据量都比较大的人可以分析，模型砍下来分析数据所需的人力资源和时间。此外，该位置信息可以在不同的场景，如亚细胞工程，医疗和病原学检查中使用。

65. Deep Mining External Imperfect Data for Chest X-ray Disease Screening [PDF] 返回目录
Luyang Luo, Lequan Yu, Hao Chen, Quande Liu, Xi Wang, Jiaqi Xu, Pheng-Ann Heng
Abstract: Deep learning approaches have demonstrated remarkable progress in automatic Chest X-ray analysis. The data-driven feature of deep models requires training data to cover a large distribution. Therefore, it is substantial to integrate knowledge from multiple datasets, especially for medical images. However, learning a disease classification model with extra Chest X-ray (CXR) data is yet challenging. Recent researches have demonstrated that performance bottleneck exists in joint training on different CXR datasets, and few made efforts to address the obstacle. In this paper, we argue that incorporating an external CXR dataset leads to imperfect training data, which raises the challenges. Specifically, the imperfect data is in two folds: domain discrepancy, as the image appearances vary across datasets; and label discrepancy, as different datasets are partially labeled. To this end, we formulate the multi-label thoracic disease classification problem as weighted independent binary tasks according to the categories. For common categories shared across domains, we adopt task-specific adversarial training to alleviate the feature differences. For categories existing in a single dataset, we present uncertainty-aware temporal ensembling of model predictions to mine the information from the missing labels further. In this way, our framework simultaneously models and tackles the domain and label discrepancies, enabling superior knowledge mining ability. We conduct extensive experiments on three datasets with more than 360,000 Chest X-ray images. Our method outperforms other competing models and sets state-of-the-art performance on the official NIH test set with 0.8349 AUC, demonstrating its effectiveness of utilizing the external dataset to improve the internal classification.
摘要：深学习方法已经证明，在自动胸部X射线分析了显着进展。深模型的数据驱动的功能需要训练数据来覆盖一大片分布。因此，大量的知识从多个数据集中整合，尤其是对医学图像。然而，学习额外胸片疾病分类模型（CXR）数据而具有挑战性的。最近的研究表明，性能瓶颈在不同的数据集CXR联合训练，很少有努力解决存在障碍物。在本文中，我们认为，结合外部CXR的数据可能导致不完全的训练数据，这引起了挑战。具体而言，不完整的数据是在两个折痕：域差异，作为图像出现跨越数据集而变化;和标签差异，因为不同的数据集被部分地标记。为此，我们制定了多标签胸部疾病分类的问题，因为根据类别加权无关的二进制任务。对于跨域共同的类别，我们采用特定任务的对抗性训练，以缓解功能差异。对于现有的单一数据集的类别，我们模型预测的不确定性存在感知时间ensembling矿从残缺的标签进一步的信息。这样一来，我们的框架同时模型和铲球域和标签不符，使卓越的知识挖掘能力。我们对三个数据集进行了广泛的实验，用超过36万胸部X射线图像。我们的方法优于其他竞争车型和套先进设备，最先进的官方NIH试验组与0.8349 AUC性能，展示了其利用外部数据集，以提高内部分类的有效性。

66. No-Reference Image Quality Assessment via Feature Fusion and Multi-Task Learning [PDF] 返回目录
S. Alireza Golestaneh, Kris Kitani
Abstract: Blind or no-reference image quality assessment (NR-IQA) is a fundamental, unsolved, and yet challenging problem due to the unavailability of a reference image. It is vital to the streaming and social media industries that impact billions of viewers daily. Although previous NR-IQA methods leveraged different feature extraction approaches, the performance bottleneck still exists. In this paper, we propose a simple and yet effective general-purpose no-reference (NR) image quality assessment (IQA) framework based on multi-task learning. Our model employs distortion types as well as subjective human scores to predict image quality. We propose a feature fusion method to utilize distortion information to improve the quality score estimation task. In our experiments, we demonstrate that by utilizing multi-task learning and our proposed feature fusion method, our model yields better performance for the NR-IQA task. To demonstrate the effectiveness of our approach, we test our approach on seven standard datasets and show that we achieve state-of-the-art results on various datasets.
摘要：盲或无参考图像质量评价（NR-IQA）是根本，没有解决，仍然具有挑战性的问题，由于参考图像的可用性。这是在流媒体和社交媒体行业的这种影响数十亿观众的日常至关重要。尽管以前的NR-IQA方法利用不同特征提取方法，性能瓶颈仍然存在。在本文中，我们提出了基于多任务学习一个简单而有效的通用无参考（NR）图像质量评价（IQA）的框架。我们的模型采用了失真类型以及人的主观分数来预测的图像质量。我们提出了一种特征融合的方法来利用失真信息，以提高质量得分估计任务。在我们的实验中，我们证明了利用多任务学习和我们提出的特征融合方法，我们的模型产量为NR-IQA任务更好的性能。为了证明我们的方法的有效性，我们测试的七个标准数据集的做法，并表明我们实现了在各种数据集的国家的最先进的成果。

67. MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language Queries at Phrase Level [PDF] 返回目录
Amar Shrestha, Krittaphat Pugdeethosapol, Haowen Fang, Qinru Qiu
Abstract: Grounding free-form textual queries necessitates an understanding of these textual phrases and its relation to the visual cues to reliably reason about the described locations. Spatial attention networks are known to learn this relationship and focus its gaze on salient objects in the image. Thus, we propose to utilize spatial attention networks for image-level visual-textual fusion preserving local (word) and global (phrase) information to refine region proposals with an in-network Region Proposal Network (RPN) and detect single or multiple regions for a phrase query. We focus only on the phrase query - ground truth pair (referring expression) for a model independent of the constraints of the datasets i.e. additional attributes, context etc. For such referring expression dataset ReferIt game, our Multi-region Attention-assisted Grounding network (MAGNet) achieves over 12\% improvement over the state-of-the-art. Without the context from image captions and attribute information in Flickr30k Entities, we still achieve competitive results compared to the state-of-the-art.
摘要：接地自由形式的文本查询这些必要的文本短语的理解和它关系到视觉线索约所描述的地点可靠的理由。空间注意网络是众所周知的了解这种关系，并集中视线上的图像中的显着对象。因此，我们建议利用空间注意网络的影像级视觉文本融合保留本地（字）和全球（短语）的信息来改进区域的建议与网内地区提案网络（RPN）和检测单个或多个区域的短语查询。我们专注于短语查询 - 地面实况对（指表达）为模型的独立数据集的限制的，即附加的属性，背景等。对于这样的参考表达数据集ReferIt比赛，我们的多区域注意辅助接地网（磁铁）实现在所述状态的最先进的超过12 \％的改进。如果没有图片说明和Flickr30k实体属性信息的情况下，我们仍然实现相比，国家的最先进的具有竞争力的结果。

68. Deep Octree-based CNNs with Output-Guided Skip Connections for 3D Shape and Scene Completion [PDF] 返回目录
Peng-Shuai Wang, Yang Liu, Xin Tong
Abstract: Acquiring complete and clean 3D shape and scene data is challenging due to geometric occlusion and insufficient views during 3D capturing. We present a simple yet effective deep learning approach for completing the input noisy and incomplete shapes or scenes. Our network is built upon the octree-based CNNs (O-CNN) with U-Net like structures, which enjoys high computational and memory efficiency and supports to construct a very deep network structure for 3D CNNs. A novel output-guided skip-connection is introduced to the network structure for better preserving the input geometry and learning geometry prior from data effectively. We show that with these simple adaptions -- output-guided skip-connection and deeper O-CNN (up to 70 layers), our network achieves state-of-the-art results in 3D shape completion and semantic scene computation.
摘要：获取完整和清洁3D形状和场景数据是具有挑战性，因为几何闭塞和3D捕捉期间视图不足。我们提出了一个简单而完成输入的嘈杂和不完整的形状或场景的有效深度学习的方法。我们的网络是在所述基于八叉树细胞神经网络（O-CNN）与U形网状结构，其享有较高的计算和存储效率和支持来构建非常深的网络结构为三维细胞神经网络构建的。一种新颖的输出导向跳过连接被引入到网络结构更好的保持输入的几何形状和从数据有效之前学习的几何形状。我们表明，这些简单的adaptions - 输出引导跳过连接和更深的O型CNN（高达70层），我们的网络实现了国家的最先进成果的3D形状完成和语义现场计算。

69. GRNet: Gridding Residual Network for Dense Point Cloud Completion [PDF] 返回目录
Haozhe Xie, Hongxun Yao, Shangchen Zhou, Jiageng Mao, Shengping Zhang, Wenxiu Sun
Abstract: Estimating the complete 3D point cloud from an incomplete one is a key problem in many vision and robotics applications. Mainstream methods (e.g., PCN and TopNet) use Multi-layer Perceptrons (MLPs) to directly process point clouds, which may cause the loss of details because the structural and context of point clouds are not fully considered. To solve this problem, we introduce 3D grids as intermediate representations to regularize unordered point clouds. We therefore propose a novel Gridding Residual Network (GRNet) for point cloud completion. In particular, we devise two novel differentiable layers, named Gridding and Gridding Reverse, to convert between point clouds and 3D grids without losing structural information. We also present the differentiable Cubic Feature Sampling layer to extract features of neighboring points, which preserves context information. In addition, we design a new loss function, namely Gridding Loss, to calculate the L1 distance between the 3D grids of the predicted and ground truth point clouds, which is helpful to recover details. Experimental results indicate that the proposed GRNet performs favorably against state-of-the-art methods on the ShapeNet, Completion3D, and KITTI benchmarks.
摘要：从一个不完整的一个估算完整的三维点云是许多视觉和机器人应用的关键问题。主流方法（例如，PCN和TopNet上）使用多层感知器（的MLP）来直接处理点云，这可能引起细节的损失，因为点云的结构和上下文中没有充分考虑。为了解决这个问题，我们引入3D网格作为中间表示来规范无序的点云。因此，我们建议对点云完成一种新型的网格化剩余网络（GRNet）。特别是，我们制定两个新的微层，命名为网格化和网格化反向，以点云和3D网格之间转换，而不会丢失结构信息。我们还提出可分化立方特征采样层到相邻点的提取物的特点，它保留的上下文信息。此外，我们还设计了一种新的损失函数，即网格化损失，计算预测与地面实况点云，这有利于恢复细节的3D网格之间的L1距离。实验结果表明，对有利的ShapeNet，Completion3D和KITTI基准状态的最技术的方法所提出的GRNet执行。

70. Auxiliary Signal-Guided Knowledge Encoder-Decoder for Medical Report Generation [PDF] 返回目录
Mingjie Li, Fuyu Wang, Xiaojun Chang, Xiaodan Liang
Abstract: Beyond the common difficulties faced in the natural image captioning, medical report generation specifically requires the model to describe a medical image with a fine-grained and semantic-coherence paragraph that should satisfy both medical commonsense and logic. Previous works generally extract the global image features and attempt to generate a paragraph that is similar to referenced reports; however, this approach has two limitations. Firstly, the regions of primary interest to radiologists are usually located in a small area of the global image, meaning that the remainder parts of the image could be considered as irrelevant noise in the training procedure. Secondly, there are many similar sentences used in each medical report to describe the normal regions of the image, which causes serious data bias. This deviation is likely to teach models to generate these inessential sentences on a regular basis. To address these problems, we propose an Auxiliary Signal-Guided Knowledge Encoder-Decoder (ASGK) to mimic radiologists' working patterns. In more detail, ASGK integrates internal visual feature fusion and external medical linguistic information to guide medical knowledge transfer and learning. The core structure of ASGK consists of a medical graph encoder and a natural language decoder, inspired by advanced Generative Pre-Training (GPT). Experiments on the CX-CHR dataset and our COVID-19 CT Report dataset demonstrate that our proposed ASGK is able to generate a robust and accurate report, and moreover outperforms state-of-the-art methods on both medical terminology classification and paragraph generation metrics.
摘要：超越面临的自然图像的字幕共同困难，医疗报告生成明确要求模型来描述与该应同时满足医疗常识和逻辑细粒度和语义相干段的医用图像。以前的作品一般提取全局图像特征，并尝试生成一个段落，类似于引用报告;然而，这种方法有两个限制。首先，主要感兴趣的放射科医师的区域通常位于全球形象的一个小区域，这意味着图像的剩余部分可以被视为训练过程中无关紧要的噪声。其次，在每个医疗报告用来描述图像，这会导致严重的数据偏差的正常区域许多类似的句子。这种偏差可能是教模型来生成定期这些无关紧要的句子。为了解决这些问题，我们提出了一个辅助信号，引导式知识编码器，解码器（ASGK）以模仿放射科医师的工作模式。更详细地，ASGK内部集成的视觉特征融合和外部医疗语言信息来指导医疗知识转移和学习。 ASGK的核心结构由医疗图表编码器和自然语言解码器，通过先进剖成预训练（GPT）的启发的。在实验的CX-CHR数据集和我们COVID-19 CT报表数据集表明，我们提出的ASGK能够产生一个强大的和准确的报告，而且性能优于两个医疗术语分类的国家的最先进的方法和段落发电指标。

71. Simple Primary Colour Editing for Consumer Product Images [PDF] 返回目录
Han Gong, Luwen Yu, Stephen Westland
Abstract: We present a simple primary colour editing method for consumer product images. We show that by using colour correction and colour blending, we can automate the pain-staking colour editing task and save time for consumer colour preference researchers. To improve the colour harmony between the primary colour and its complementary colours, our algorithm also tunes the other colours in the image. Preliminary experiment has shown some promising results compared with a state-of-the-art method and human editing.
摘要：我们提出一个简单的原色编辑为消费产品图像的方法。我们表明，采用色彩校正和色彩混合，就可以自动的痛苦，跑马圈地色彩编辑任务，为了节省时间，消费者色彩偏好的研究人员。为了提高原色和其互补色之间的色彩和谐，我们的算法也调谐图像中的其他颜色。初步实验已经表明与国家的最先进的方法和人编辑相比一些令人鼓舞的结果。

72. Unsupervised Abnormality Detection Using Heterogeneous Autonomous Systems [PDF] 返回目录
Sayeed Shafayet Chowdhury, Kazi Mejbaul Islam, Rouhan Noor
Abstract: Anomaly detection in a surveillance scenario is an emerging and challenging field of research. For autonomous vehicles like drones or cars, it is immensely important to distinguish between normal and abnormal states in real-time to avoid/detect potential threats. But the nature and degree of abnormality may vary depending upon the actual environment and adversary. As a result, it is impractical to model all cases a priori and use supervised methods to classify. Also, an autonomous vehicle provides various data types like images and other analog or digital sensor data. In this paper, a heterogeneous system is proposed which estimates the degree of abnormality of an environment using drone-feed, analyzing real-time image and IMU sensor data in an unsupervised manner. Here, we have demonstrated AngleNet (a novel CNN architecture) to estimate the angle between a normal image and another image under consideration, which provides us with a measure of anomaly. Moreover, the IMU data are used in clustering models to predict abnormality. Finally, the results from these two algorithms are ensembled to estimate the final abnormality. The proposed method performs satisfactorily on the IEEE SP Cup-2020 dataset with an accuracy of 99.92%. Additionally, we have also tested this approach on an in-house dataset to validate its robustness.
摘要：异常检测在监控场景是一个新兴的研究和具有挑战性的领域。对于自主车像无人驾驶飞机或汽车，它在实时正常和异常状态之间进行区分，以避免/检测潜在的威胁是非常重要的。但异常的性质和程度可能会根据实际环境和对手有所不同。其结果是，这是不切实际的所有情况下的先验和使用的监督方法进行分类建模。此外，自主车辆提供的各种数据类型，如图像和其他模拟或数字传感器数据。在本文中，非均相体系，提出了使用估计雄蜂饲料，以无监督方式分析实时图像和IMU传感器数据的环境中的异常的程度。在这里，我们已经证明AngleNet（一种新颖的CNN架构）来估计的正常图像和所考虑的另一图像，这为我们提供了异常的量度之间的角度。此外，IMU数据在聚类模型来预测异常使用。最后，这两种算法结果合奏估计最终的异常。满意地在IEEE SP杯-2020数据集的99.92％的准确度所提出的方法进行。此外，我们还测试了一个内部的数据集这种方法来验证其稳健性。

73. WOAD: Weakly Supervised Online Action Detection in Untrimmed Videos [PDF] 返回目录
Mingfei Gao, Yingbo Zhou, Ran Xu, Richard Socher, Caiming Xiong
Abstract: Online action detection in untrimmed videos aims to identify an action as it happens, which makes it very important for real-time applications. Previous methods rely on tedious annotations of temporal action boundaries for model training, which hinders the scalability of online action detection systems. We propose WOAD, a weakly supervised framework that can be trained using only video-class labels. WOAD contains two jointly-trained modules, i.e., temporal proposal generator (TPG) and online action recognizer (OAR). Supervised by video-class labels, TPG works offline and targets on accurately mining pseudo frame-level labels for OAR. With the supervisory signals from TPG, OAR learns to conduct action detection in an online fashion. Experimental results on THUMOS'14 and ActivityNet1.2 show that our weakly-supervised method achieves competitive performance compared to previous strongly-supervised methods. Beyond that, our method is flexible to leverage strong supervision when it is available. When strongly supervised, our method sets new state-of-the-art results in the online action detection tasks including online per-frame action recognition and online detection of action start.
摘要：修剪视频目标在线动作检测到，因为它发生，这使得实时应用非常重要的识别作用。先前的方法依赖于模型训练，这阻碍了在线动作检测系统的可扩展性颞作用边界的繁琐注释。我们建议菘，弱监管的框架，可以只使用视频类的标签进行培训。菘蓝包含两个联合训练模块，即颞建议发生器（TPG）和在线动作识别器（OAR）。通过视频类标签监督，TPG脱机工作和目标上正确地挖掘伪边框级标签OAR。与TPG的监控信号，OAR学会以在线的方式进行动作检测。在THUMOS'14和ActivityNet1.2实验结果表明，相对于以前强监督方法我们弱监督方法实现竞争力的性能。除此之外，我们的方法是灵活利用强有力的监督可用时。当强烈的监督，我们的方法将国家的最先进的新成果在网上的动作检测任务，包括网上的每帧动作识别和行动开始在线检测。

74. Dilated Convolutions with Lateral Inhibitions for Semantic Image Segmentation [PDF] 返回目录
Yujiang Wang, Mingzhi Dong, Jie Shen, Yiming Lin, Maja Pantic
Abstract: Dilated convolutions are widely used in deep semantic segmentation models as they can enlarge the filters' receptive field without adding additional weights nor sacrificing spatial resolution. However, as dilated convolutional filters do not possess positional knowledge about the pixels on semantically meaningful contours, they could lead to ambiguous predictions on object boundaries. In addition, although dilating the filter can expand its receptive field, the total number of sampled pixels remains unchanged, which usually comprises a small fraction of the receptive field's total area. Inspired by the Lateral Inhibition (LI) mechanisms in human visual systems, we propose the dilated convolution with lateral inhibitions (LI-Convs) to overcome these limitations. Introducing LI mechanisms improves the convolutional filter's sensitivity to semantic object boundaries. Moreover, since LI-Convs also implicitly take the pixels from the laterally inhibited zones into consideration, they can also extract features at a denser scale. By integrating LI-Convs into the Deeplabv3+ architecture, we propose the Lateral Inhibited Atrous Spatial Pyramid Pooling (LI-ASPP) and the Lateral Inhibited MobileNet-V2 (LI-MNV2). Experimental results on three benchmark datasets (PASCAL VOC 2012, CelebAMask-HQ and ADE20K) show that our LI-based segmentation models outperform the baseline on all of them, thus verify the effectiveness and generality of the proposed LI-Convs.
摘要：扩张型卷积被广泛应用于深层语义分割模型，因为它们可以放大过滤器感受野无需增加额外的重量，也不牺牲空间分辨率。然而，随着扩张的卷积过滤器不具备关于语义上有意义的轮廓的像素位置的知识，他们可能会导致对物体边界模糊的预测。另外，虽然扩张所述过滤器可以扩大其感受域，采样的像素的总数保持不变，这通常包括感受域的总面积的一小部分。在人类视觉系统的侧抑制（LI）机制的启发，我们提出用侧面的抑制（LI-Convs）扩张的卷积来克服这些限制。引入LI机制提高了卷积滤波器的语义对象边界的灵敏度。此外，由于LI-Convs也隐含地采取从横抑制区中的像素考虑，它们也可以在更致密的规模提取特征。通过LI-Convs融入Deeplabv3 +架构中，我们提出了侧向抑制的Atrous空间金字塔池（LI-ASPP）和侧向抑制的MobileNet-V2（LI-MNV2）。对三个标准数据集实验结果（PASCAL VOC 2012，CelebAMask-HQ和ADE20K）表明我们的基于LI-分割模型跑赢上所有的人的基线，从而验证了LI-Convs的有效性和普遍性。

75. Visual Transformers: Token-based Image Representation and Processing for Computer Vision [PDF] 返回目录
Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Masayoshi Tomizuka, Kurt Keutzer, Peter Vajda
Abstract: Computer vision has achieved great success using standardized image representations -- pixel arrays, and the corresponding deep learning operators -- convolutions. In this work, we challenge this paradigm: we instead (a) represent images as a set of visual tokens and (b) apply visual transformers to find relationships between visual semantic concepts. Given an input image, we dynamically extract a set of visual tokens from the image to obtain a compact representation for high-level semantics. We then use visual transformers to operate over the visual tokens to densely model relationships between them. We find that this paradigm of token-based image representation and processing drastically outperforms its convolutional counterparts on image classification and semantic segmentation. To demonstrate the power of this approach on ImageNet classification, we use ResNet as a convenient baseline and use visual transformers to replace the last stage of convolutions. This reduces the stage's MACs by up to 6.9x, while attaining up to 4.53 points higher top-1 accuracy. For semantic segmentation, we use a visual-transformer-based FPN (VT-FPN) module to replace a convolution-based FPN, saving 6.5x fewer MACs while achieving up to 0.35 points higher mIoU on LIP and COCO-stuff.
摘要： - 像素阵列，以及对应的深度学习运营商 - 回旋计算机视觉已经使用标准化的图像表示取得了巨大成功。在这项工作中，我们挑战这个范式：我们不是（一）代表图像作为一组视觉标记的和（b）适用的视觉变压器找到视觉语义概念之间的关系。给定的输入图像中，我们动态地提取一组视觉令牌从图像以获得高水平的语义的紧凑表示。然后，我们使用视觉变压器视觉标记工作交给他们之间的密集模型关系。我们发现，这种模式基于令牌的图像表示和处理大大优于对图像分类和语义分割的卷积同行。为了证明对ImageNet分类这种方法的力量，我们用RESNET作为一种方便的基准，并使用视觉变压器更换回旋的最后阶段。这通过最多6.9倍减少该阶段的MAC的，同时获得到更高的4.53分顶1的精度。对于语义分割，我们使用一个基于视觉变压器FPN（VT-FPN）模块来代替基于卷积的FPN，节省了6.5倍更少的互助，同时实现了0.35百分点米欧唇和COCO-东西。

76. AutoHAS: Differentiable Hyper-parameter and Architecture Search [PDF] 返回目录
Xuanyi Dong, Mingxing Tan, Adams Wei Yu, Daiyi Peng, Bogdan Gabrys, Quoc V. Le
Abstract: Neural Architecture Search (NAS) has achieved significant progress in pushing state-of-the-art performance. While previous NAS methods search for different network architectures with the same hyper-parameters, we argue that such search would lead to sub-optimal results. We empirically observe that different architectures tend to favor their own hyper-parameters. In this work, we extend NAS to a broader and more practical space by combining hyper-parameter and architecture search. As architecture choices are often categorical whereas hyper-parameter choices are often continuous, a critical challenge here is how to handle these two types of values in a joint search space. To tackle this challenge, we propose AutoHAS, a differentiable hyper-parameter and architecture search approach, with the idea of discretizing the continuous space into a linear combination of multiple categorical basis. A key element of AutoHAS is the use of weight sharing across all architectures and hyper-parameters which enables efficient search over the large joint search space. Experimental results on MobileNet/ResNet/EfficientNet/BERT show that AutoHAS significantly improves accuracy up to 2% on ImageNet and F1 score up to 0.4 on SQuAD 1.1, with search cost comparable to training a single model. Compared to other AutoML methods, such as random search or Bayesian methods, AutoHAS can achieve better accuracy with 10x less compute cost.
摘要：神经结构搜索（NAS）已经实现了在推动国家的最先进的性能显著的进展。虽然以前的NAS方法搜索具有相同的超参数不同的网络架构，我们认为，这样的搜索会导致次优结果。我们经验观察到不同的体系结构倾向于自己的超参数。在这项工作中，我们通过结合超参数和架构扩展搜索到NAS更广泛和更实用的空间。由于架构的选择往往是绝对的，而超参数的选择往往是连续的，在这里一个关键的挑战是如何在联合搜索空间处理这两种类型的值。为了应对这一挑战，我们提出AutoHAS，一个微超参数和结构搜索方法，与离散的连续空间分割成多个类别基础的线性组合的想法。 AutoHAS的一个关键要素是在所有体系结构和超参数这使得在大的联合搜索空间有效的搜索中使用的权重的共享。在MobileNet / RESNET / EfficientNet实验结果/ BERT表明AutoHAS显著提高了精度高达2％ImageNet和F1得分高达0.4的阵容1.1，与搜索成本媲美训练单一的模式。相比其他AutoML方法，如随机搜索或贝叶斯方法，AutoHAS可以用较少的10倍计算的成本实现更高的精度。

77. RIT-Eyes: Rendering of near-eye images for eye-tracking applications [PDF] 返回目录
Nitinraj Nair, Rakshit Kothari, Aayush K. Chaudhary, Zhizhuo Yang, Gabriel J. Diaz, Jeff B. Pelz, Reynold J. Bailey
Abstract: Deep neural networks for video-based eye tracking have demonstrated resilience to noisy environments, stray reflections, and low resolution. However, to train these networks, a large number of manually annotated images are required. To alleviate the cumbersome process of manual labeling, computer graphics rendering is employed to automatically generate a large corpus of annotated eye images under various conditions. In this work, we introduce a synthetic eye image generation platform that improves upon previous work by adding features such as an active deformable iris, an aspherical cornea, retinal retro-reflection, gaze-coordinated eye-lid deformations, and blinks. To demonstrate the utility of our platform, we render images reflecting the represented gaze distributions inherent in two publicly available datasets, NVGaze and OpenEDS. We also report on the performance of two semantic segmentation architectures (SegNet and RITnet) trained on rendered images and tested on the original datasets.
摘要：基于视频的眼动追踪深层神经网络已经表现出弹性来嘈杂的环境中，杂散反射，和低分辨率。然而，培养这些网络中，需要大量的手动注释的图像。为了减轻手动贴标签的繁琐的过程，计算机图形渲染被用来自动生成在各种条件下的大型注释的语料库眼图像。在这项工作中，我们介绍的合成眼图像生成平台，通过添加功能时以前的工作改进了诸如有源变形虹膜，非球面角膜，视网膜回射，凝视协调眼睑变形，并闪烁。为了证明我们的平台的效用，我们渲染反映了两个公开可用的数据集，NVGaze和OpenEDS固有的代表凝视分布图像。我们还培训了渲染的图像和原始数据集测试了两种语义分割的架构（SegNet和RITnet）的业绩报告。

78. Knowledge transfer between bridges for drive-by monitoring using adversarial and multi-task learning [PDF] 返回目录
Jingxiao Liu, Mario Bergés, Jacobo Bielak, Hae Young Noh
Abstract: Monitoring bridge health using the vibrations of drive-by vehicles has various benefits, such as low cost and no need for direct installation or on-site maintenance of equipment on the bridge. However, many such approaches require labeled data from every bridge, which is expensive and time-consuming, if not impossible, to obtain. This is further exacerbated by having multiple diagnostic tasks, such as damage quantification and localization. One way to address this issue is to directly apply the supervised model trained for one bridge to other bridges, although this may significantly reduce the accuracy because of distribution mismatch between different bridges'data. To alleviate these problems, we introduce a transfer learning framework using domain-adversarial training and multi-task learning to detect, localize and quantify damage. Specifically, we train a deep network in an adversarial way to learn features that are 1) sensitive to damage and 2) invariant to different bridges. In addition, to improve the error propagation from one task to the next, our framework learns shared features for all the tasks using multi-task learning. We evaluate our framework using lab-scale experiments with two different bridges. On average, our framework achieves 94%, 97% and 84% accuracy for damage detection, localization and quantification, respectively. within one damage severity level.
摘要：按驱动车辆监控使用的振动桥梁健康有多种益处，比如成本低，无需直接安装或现场对桥梁维修设备。然而，许多这样的方法需要从每一个桥梁，这是昂贵和费时的，如果不是不可能的，获得标签的数据。这是通过具有多个诊断任务，如损伤定量和定位进一步加剧。解决这个问题的一种方法是直接申请训练了一个桥梁，桥梁等的监督模式，虽然这可能会显著减少，因为不同的bridges'data之间分布不匹配的精度。为了解决这些问题，我们将介绍使用域对抗性训练和多任务学习探测，定位和量化损伤转移学习框架。具体而言，我们在敌对方式训练深网络得知是1至损坏和2）不变的不同网桥敏感特性）。此外，为了提高从一个任务到下一个错误传播，我们的框架获悉共享所有使用多任务学习任务功能。我们使用的实验室规模的实验用两种不同的桥梁评估我们的框架。平均来说，我们的框架达到94％，97％和84％的准确度的损坏检测，定位和定量，分别。内的一个损坏严重级别。

79. Robust Face Verification via Disentangled Representations [PDF] 返回目录
Marius Arvinte, Ahmed H. Tewfik, Sriram Vishwanath
Abstract: We introduce a robust algorithm for face verification, i.e., deciding whether twoimages are of the same person or not. Our approach is a novel take on the idea ofusing deep generative networks for adversarial robustness. We use the generativemodel during training as an online augmentation method instead of a test-timepurifier that removes adversarial noise. Our architecture uses a contrastive loss termand a disentangled generative model to sample negative pairs. Instead of randomlypairing two real images, we pair an image with its class-modified counterpart whilekeeping its content (pose, head tilt, hair, etc.) intact. This enables us to efficientlysample hard negative pairs for the contrastive loss. We experimentally show that, when coupled with adversarial training, the proposed scheme converges with aweak inner solver and has a higher clean and robust accuracy than state-of-the-art-methods when evaluated against white-box physical attacks.
摘要：介绍一个强大的算法，人脸验证，即决定是否twoimages是同一人或没有的。我们的做法是一种新型采取的想法ofusing深生成网络的鲁棒性对抗。我们的培训作为在线隆胸方法，而不是测试timepurifier，消除对抗噪音过程中使用generativemodel。我们的架构采用对比损失termand一个解开生成模型，以样品负对。相反randomlypairing两个真实的图像，我们与它的类修饰的对应whilekeeping其内容（姿势，头部倾斜，毛发等）完整配对的图像。这使我们能够efficientlysample难负对用于对比的损失。我们实验表明，当加上对抗性训练，该方案具有收敛内aweak解算器和具有较高的清洁和坚固的精度比国家的最先进的方法当针对评估白盒物理攻击。

80. SparseFusion: Dynamic Human Avatar Modeling from Sparse RGBD Images [PDF] 返回目录
Xinxin Zuo, Sen Wang, Jiangbin Zheng, Weiwei Yu, Minglun Gong, Ruigang Yang, Li Cheng
Abstract: In this paper, we propose a novel approach to reconstruct 3D human body shapes based on a sparse set of RGBD frames using a single RGBD camera. We specifically focus on the realistic settings where human subjects move freely during the capture. The main challenge is how to robustly fuse these sparse frames into a canonical 3D model, under pose changes and surface occlusions. This is addressed by our new framework consisting of the following steps. First, based on a generative human template, for every two frames having sufficient overlap, an initial pairwise alignment is performed; It is followed by a global non-rigid registration procedure, in which partial results from RGBD frames are collected into a unified 3D shape, under the guidance of correspondences from the pairwise alignment; Finally, the texture map of the reconstructed human model is optimized to deliver a clear and spatially consistent texture. Empirical evaluations on synthetic and real datasets demonstrate both quantitatively and qualitatively the superior performance of our framework in reconstructing complete 3D human models with high fidelity. It is worth noting that our framework is flexible, with potential applications going beyond shape reconstruction. As an example, we showcase its use in reshaping and reposing to a new avatar.
摘要：在本文中，我们提出了一种新颖的方法基于使用单个相机RGBD稀疏集合RGBD帧的重建三维人体形状。我们特别注重现实的设置里的人类受试者的拍摄过程中自由移动。主要的挑战是如何保持强劲的保险丝这些稀疏帧为规范的3D模型，在姿态的变化和表面遮挡。这是由我们的由以下各步骤的新框架解决。首先，根据一生成人模板，对于具有足够的重叠每两帧，初始成对对准进行;其次是一个全球性的非刚性配准过程，其中来自帧RGBD部分结果被收集到一个统一的3D字形，从两两比对的对应的指导下;最后，重建人体模型的纹理贴图优化，可以提供清晰，空间一致的质感。对合成和真实数据集实证评价显示在数量上和质量上我们与高保真重建完整的三维人体模型架构的卓越性能。值得注意的是，我们的架构是灵活的，具有潜在的应用超越形状重建。作为一个例子，我们展示重塑和寄托到一个新的头像其使用。

81. Inception Augmentation Generative Adversarial Network [PDF] 返回目录
Saman Motamed, Farzad Khalvati
Abstract: Successful training of convolutional neural networks (CNNs) requires a substantial amount of data. With small datasets, networks generalize poorly. Data Augmentation techniques improve the generalizability of neural networks by using existing training data more effectively. Standard data augmentation methods, however, produce limited plausible alternative data. Generative Adversarial Networks (GANs) have been utilized to generate new data and improve CNN performance. Nevertheless, generative models have not been used for augmenting data to improve the training of another generative model. In this work, we propose a new GAN architecture for semi-supervised augmentation of chest X-rays for the detection of pneumonia. We show that the proposed GAN can augment data for a specific class of images (pneumonia) using images from both classes (pneumonia and normal) in an image domain (chest X-rays). We demonstrate that using our proposed GAN-based data augmentation method significantly improves the performance of the state-of-the-art anomaly detection architecture, AnoGAN, in detecting pneumonia in chest X-rays, increasing AUC from 0.83 to 0.88.
摘要：卷积神经网络（细胞神经网络）的成功训练需要数据的大量。对于小数据集，网络推广不佳。数据增强技术，通过更有效地使用现有的训练数据改善神经网络的普遍性。标准数据增强方法，但是，产生限制似乎可能的替代数据。生成对抗性网络（甘斯）已被用来产生新的数据和改善CNN性能。然而，生成模型还没有被用于增强数据改善另一生成模型的培训。在这项工作中，我们提出了胸部X射线的半监督增强了检测肺炎的新GAN架构。我们表明，所提出的GAN可以用于使用从图像在图像域中两个类（肺炎和正常）（胸部X射线）的特定类的图像（肺炎）增强数据。我们证明了使用我们的提议的基于GaN的数据扩张方法显著提高国家的最先进的异常检测架构的性能，AnoGAN，在检测胸部X光肺炎，从0.83增加AUC至0.88。

82. Motion Prediction using Trajectory Sets and Self-Driving Domain Knowledge [PDF] 返回目录
Freddy A. Boulton, Elena Corina Grigore, Eric M. Wolff
Abstract: Predicting the future motion of vehicles has been studied using various techniques, including stochastic policies, generative models, and regression. Recent work has shown that classification over a trajectory set, which approximates possible motions, achieves state-of-the-art performance and avoids issues like mode collapse. However, map information and the physical relationships between nearby trajectories is not fully exploited in this formulation. We build on classification-based approaches to motion prediction by adding an auxiliary loss that penalizes off-road predictions. This auxiliary loss can easily be \emph{pretrained} using only map information (e.g., off-road area), which significantly improves performance on small datasets. We also investigate weighted cross-entropy losses to capture spatial-temporal relationships among trajectories. Our final contribution is a detailed comparison of classification and ordinal regression on two public self-driving datasets.
摘要：预测车辆的未来运动一直使用各种技术，包括随机政策，生成模型和回归分析。最近的研究表明，在分类的轨迹集，接近可能的运动，达到类似模式崩溃的国家的最先进的性能和避免的问题。然而，地图信息和附近的轨迹之间的物理关系并未完全在该制剂开发。我们通过加入助剂损失惩罚越野预测建立在基于分类的方法运动预测。该辅助损失可以很容易地\ EMPH仅使用映射信息（例如，越野区域），其显著提高了小的数据集性能{预训练}。我们还调查加权互熵损失捕获轨迹中的时空关系。我们的最后的贡献是分类和有序回归的两个公众自驾车的数据集的详细比较。

83. Biomechanics-informed Neural Networks for Myocardial Motion Tracking in MRI [PDF] 返回目录
Chen Qin, Shuo Wang, Chen Chen, Huaqi Qiu, Wenjia Bai, Daniel Rueckert
Abstract: Image registration is an ill-posed inverse problem which often requires regularisation on the solution space. In contrast to most of the current approaches which impose explicit regularisation terms such as smoothness, in this paper we propose a novel method that can implicitly learn biomechanics-informed regularisation. Such an approach can incorporate application-specific prior knowledge into deep learning based registration. Particularly, the proposed biomechanics-informed regularisation leverages a variational autoencoder (VAE) to learn a manifold for biomechanically plausible deformations and to implicitly capture their underlying properties via reconstructing biomechanical simulations. The learnt VAE regulariser then can be coupled with any deep learning based registration network to regularise the solution space to be biomechanically plausible. The proposed method is validated in the context of myocardial motion tracking in cardiac MRI data. The results show that it can achieve better performance against other competing methods in terms of motion tracking accuracy and has the ability to learn biomechanical properties such as incompressibility and strains. The method has also been shown to have better generalisability to unseen domains compared with commonly used L2 regularisation schemes.
摘要：图像配准是一个病态逆问题，往往需要在解空间正则化。相比于其中大部分实行明确的正规化诸如平滑度，本文中的现有方法的提出，可以学习隐含生物力学知情正规化的新方法。这种方法可以将应用程序特定的先验知识进深学习基于登记。特别地，所提出的生物力学知情正规化杠杆一个变自动编码器（VAE）学习歧管，用于生物力学合理变形和隐式地通过重建的生物力学模拟捕获它们的基本性质。所学习的VAE正则化然后可以与任何深基于学习注册网络正规化解空间是生物力学合理的。所提出的方法在心肌运动跟踪的在心脏MRI数据的上下文中进行验证。结果表明，它可以在运动跟踪精度方面达到与其他竞争方法更好的性能，并具有学习生物力学性能，如不可压缩性和应变能力。该方法也已经显示出具有更好的可推广到与常用L2正规化方案相比看不见域。

84. Neural Architecture Search without Training [PDF] 返回目录
Joseph Mellor, Jack Turner, Amos Storkey, Elliot J. Crowley
Abstract: The time and effort involved in hand-designing deep neural networks is immense. This has prompted the development of Neural Architecture Search (NAS) techniques to automate this design. However, NAS algorithms tend to be extremely slow and expensive; they need to train vast numbers of candidate networks to inform the search process. This could be remedied if we could infer a network's trained accuracy from its initial state. In this work, we examine how the linear maps induced by data points correlate for untrained network architectures in the NAS-Bench-201 search space, and motivate how this can be used to give a measure of modelling flexibility which is highly indicative of a network's trained performance. We incorporate this measure into a simple algorithm that allows us to search for powerful networks without any training in a matter of seconds on a single GPU. Code to reproduce our experiments is available at this https URL.
摘要：参与手工设计深层神经网络的时间和精力是巨大的。这促使神经结构搜索（NAS）技术的发展，自动执行此设计。然而，NAS算法往往是非常缓慢和昂贵;他们需要培训候选网络的广大通知搜索过程。如果我们能够从它的初始状态推断网络训练有素的准确性这可以补救。在这项工作中，我们研究如何线性图的数据点诱导的NAS-台-201的搜索空间未受过训练的网络架构相关，并鼓励这可怎么用来给建模灵活性的措施，其高度指示网络的的训练有素的性能。我们将这一措施成为一种简单的算法，使我们能够寻找强大的网络，而不在单个GPU在几秒钟之内的任何训练。代码重现我们的实验可在此HTTPS URL。

85. End-to-end learning for semiquantitative rating of COVID-19 severity on Chest X-rays [PDF] 返回目录
Alberto Signoroni, Mattia Savardi, Sergio Benini, Nicola Adami, Riccardo Leonardi, Paolo Gibellini, Filippo Vaccher, Marco Ravanelli, Andrea Borghesi, Roberto Maroldi, Davide Farina
Abstract: In this work we designed an end-to-end deep learning architecture for predicting, on Chest X-rays images (CRX), a multi-regional score conveying the degree of lung compromise in COVID-19 patients. Such semiquantitative scoring system, namely Brixia-score, was applied in serial monitoring of such patients, showing significant prognostic value, in one of the hospitals that experienced one of the highest pandemic peaks in Italy. To solve such a challenging visual task, we adopt a weakly supervised learning strategy structured to handle different tasks (segmentation, spatial alignment, and score estimation) trained with a "from part to whole" procedure involving different datasets. In particular, we exploited a clinical dataset of almost 5,000 CXR annotated images collected in the same hospital. Our BS-Net demonstrated self-attentive behavior and a high degree of accuracy in all processing stages. Through inter-rater agreement tests and a gold standard comparison, we were able to show that our solution outperforms single human annotators in rating accuracy and consistency, thus supporting the possibility of using this tool in contexts of computer-assisted monitoring. Highly resolved (super-pixel level) explainability maps were also generated, with an original technique, to visually help the understanding of the network activity on the lung areas. We eventually tested the performance robustness of our model on a variegated public COVID-19 dataset, for which we also provide Brixia-score annotations, observing good direct generalization and fine-tuning capabilities that favorably highlight the portability of BS-Net in other clinical settings.
摘要：在这项工作中，我们设计了一个终端到终端的深度学习结构预测，胸部X射线图像（CRX），多得分区域输送肺妥协COVID-19在患者的程度。这样的半定量计分系统，即Brixia得分，此类患者的连续监测应用，表现出显著预测值，在经历了意大利最流行的山峰之一的医院之一。为了解决这样的挑战视觉任务，我们采用弱监督学习策略结构来处理与涉及不同的数据集是“从部分到整体”程序的培训不同的任务（分割，空间定位，并分估计）。特别是，我们利用了同一家医院收集了近5000 CXR注释的图像的临床数据集。我们BS-Net的展示自我周到的行为，并在所有加工阶段高精确度。通过评估人之间一致试验和金标准的比较，我们能够证明我们的解决方案优于排名的准确性和一致性，单人工注释，从而支持使用的计算机辅助监视上下文这个工具的可能性。高分辨率（超象素级）explainability地图还产生了，与原来的技术，直观地帮助对肺区域网络活动的理解。最后我们测试了我们的模型的性能鲁棒性上的杂色公共COVID-19数据集，为此，我们还提供Brixia得分注解，观察良好的直接泛化和微调功能毫不逊色突出BS-网的其他临床设置便携。

86. On Universalized Adversarial and Invariant Perturbations [PDF] 返回目录
Sandesh Kamath, Amit Deshpande, K V Subrahmanyam
Abstract: Convolutional neural networks or standard CNNs (StdCNNs) are translation-equivariant models that achieve translation invariance when trained on data augmented with sufficient translations. Recent work on equivariant models for a given group of transformations (e.g., rotations) has lead to group-equivariant convolutional neural networks (GCNNs). GCNNs trained on data augmented with sufficient rotations achieve rotation invariance. Recent work by authors arXiv:2002.11318 studies a trade-off between invariance and robustness to adversarial attacks. In another related work arXiv:2005.08632, given any model and any input-dependent attack that satisfies a certain spectral property, the authors propose a universalization technique called SVD-Universal to produce a universal adversarial perturbation by looking at very few test examples. In this paper, we study the effectiveness of SVD-Universal on GCNNs as they gain rotation invariance through higher degree of training augmentation. We empirically observe that as GCNNs gain rotation invariance through training augmented with larger rotations, the fooling rate of SVD-Universal gets better. To understand this phenomenon, we introduce universal invariant directions and study their relation to the universal adversarial direction produced by SVD-Universal.
摘要：卷积神经网络或标准细胞神经网络（StdCNNs）是翻译等变模型上有足够的翻译增强数据训练时实现平移不变性。在等变模型一组给定的转换（例如，旋转）的最近的工作导致组等变卷积神经网络（GCNNs）。培训了足够的旋转增强数据GCNNs实现旋转不变性。由作者最近的工作的arXiv：2002.11318研究权衡不变性和鲁棒性之间的对抗攻击。在另一相关工作的arXiv：2005.08632，给出任何型号和任何输入相关的攻击，满足一定的光谱特性，作者提出了一个名为SVD-通用普遍化技术通过看很少有测试的例子，产生一个普遍的对抗性扰动。在本文中，我们研究SVD-通用的对，因为他们通过更高程度的培训隆胸获得旋转不变性GCNNs有效性。我们经验观察到通过培训GCNNs增益旋转不变性与较大的旋转增加，SVD-通用的嘴硬率变得更好。要理解这一现象，我们引入通用不变的方向，并研究它们之间的关系由SVD-通用生产的普遍对抗的方向。

87. Training Deep Spiking Neural Networks [PDF] 返回目录
Eimantas Ledinauskas, Julius Ruseckas, Alfonsas Juršėnas, Giedrius Buračas
Abstract: Computation using brain-inspired spiking neural networks (SNNs) with neuromorphic hardware may offer orders of magnitude higher energy efficiency compared to the current analog neural networks (ANNs). Unfortunately, training SNNs with the same number of layers as state of the art ANNs remains a challenge. To our knowledge the only method which is successful in this regard is supervised training of ANN and then converting it to SNN. In this work we directly train deep SNNs using backpropagation with surrogate gradient and find that due to implicitly recurrent nature of feed forward SNN's the exploding or vanishing gradient problem severely hinders their training. We show that this problem can be solved by tuning the surrogate gradient function. We also propose using batch normalization from ANN literature on input currents of SNN neurons. Using these improvements we show that is is possible to train SNN with ResNet50 architecture on CIFAR100 and Imagenette object recognition datasets. The trained SNN falls behind in accuracy compared to analogous ANN but requires several orders of magnitude less inference time steps (as low as 10) to reach good accuracy compared to SNNs obtained by conversion from ANN which require on the order of 1000 time steps.
摘要：计算使用脑启发扣球神经网络（SNNS）与神经形态硬件可提供的数量级能效更高的订单相比，目前的模拟神经网络（人工神经网络）。不幸的是，训练SNNS具有相同层数为艺术人工神经网络的状态仍然是一个挑战。据我们所知，这是成功的在这方面的唯一方法是神经网络的监督下的训练，然后将其转换为SNN。在这项工作中，我们直接使用培训与反向传播梯度替代深SNNS，发现饲料，由于隐含经常性向前SNN的爆炸或消失梯度问题严重阻碍他们的训练。我们发现，这个问题可以通过调整替代梯度功能来解决。我们还建议对SNN神经元的输入电流使用批标准化从ANN文献。通过这些改进，我们要显示它是可以训练SNN与CIFAR100和Imagenette物体识别数据集ResNet50架构。受过训练的SNN落在后面的精度相比类似的ANN，但需要的大小更小推理时间步骤几个数量级（低至10）相比，从ANN的转换而获得需要的1000个时间步骤的顺序上SNNS达到良好的精度。

88. Cross-Domain Segmentation with Adversarial Loss and Covariate Shift for Biomedical Imaging [PDF] 返回目录
Bora Baydar, Savas Ozkan, A. Emre Kavur, N. Sinem Gezer, M. Alper Selver, Gozde Bozdagi Akar
Abstract: Despite the widespread use of deep learning methods for semantic segmentation of images that are acquired from a single source, clinicians often use multi-domain data for a detailed analysis. For instance, CT and MRI have advantages over each other in terms of imaging quality, artifacts, and output characteristics that lead to differential diagnosis. The capacity of current segmentation techniques is only allow to work for an individual domain due to their differences. However, the models that are capable of working on all modalities are essentially needed for a complete solution. Furthermore, robustness is drastically affected by the number of samples in the training step, especially for deep learning models. Hence, there is a necessity that all available data regardless of data domain should be used for reliable methods. For this purpose, this manuscript aims to implement a novel model that can learn robust representations from cross-domain data by encapsulating distinct and shared patterns from different modalities. Precisely, covariate shift property is retained with structural modification and adversarial loss where sparse and rich representations are obtained. Hence, a single parameter set is used to perform cross-domain segmentation task. The superiority of the proposed method is that no information related to modalities are provided in either training or inference phase. The tests on CT and MRI liver data acquired in routine clinical workflows show that the proposed model outperforms all other baseline with a large margin. Experiments are also conducted on Covid-19 dataset that it consists of CT data where significant intra-class visual differences are observed. Similarly, the proposed method achieves the best performance.
摘要：尽管广泛使用的深学习方法为从单一来源获取的图像语义分割，临床医生经常使用的多域数据的详细分析。例如，CT和MRI具有在彼此的优点在于导致鉴别诊断成像质量，工件和输出特性方面。目前分割技术的能力只允许用于个人领域的工作，由于他们之间的分歧。然而，基本上需要一个完整的解决方案，能够在所有工作模式的模型。此外，稳健性急剧通过在训练步骤样本数量的影响，特别是对深学习模式。因此，有必要将数据域的不顾一切可用的数据应该用于可靠的方法。为了这个目的，本手稿旨在实现能够通过来自不同模态封装不同的和共享模式学习从跨域数据健壮表示的新颖的模型。精确地，协变量移位属性被保持与结构改性和其中获得稀疏和丰富的表示对抗性损失。因此，一个单一的参数集被用于执行跨域分割任务。该方法的优势是，在任何培训或推断阶段没有提供相关的方式的信息。在日常临床工作流程获得的CT和MRI肝数据测试表明，该模型优于所有其他基线有大幅度提高。实验在Covid-19数据集也进行，它包括其中显著帧内类的视觉差异观察到的CT数据。同样，所提出的方法达到最佳性能。

89. Learning the Compositional Visual Coherence for Complementary Recommendations [PDF] 返回目录
Zhi Li, Bo Wu, Qi Liu, Likang Wu, Hongke Zhao, Tao Mei
Abstract: Complementary recommendations, which aim at providing users product suggestions that are supplementary and compatible with their obtained items, have become a hot topic in both academia and industry in recent years. %However, it is challenging due to its complexity and subjectivity. Existing work mainly focused on modeling the co-purchased relations between two items, but the compositional associations of item collections are largely unexplored. Actually, when a user chooses the complementary items for the purchased products, it is intuitive that she will consider the visual semantic coherence (such as color collocations, texture compatibilities) in addition to global impressions. Towards this end, in this paper, we propose a novel Content Attentive Neural Network (CANN) to model the comprehensive compositional coherence on both global contents and semantic contents. Specifically, we first propose a \textit{Global Coherence Learning} (GCL) module based on multi-heads attention to model the global compositional coherence. Then, we generate the semantic-focal representations from different semantic regions and design a \textit{Focal Coherence Learning} (FCL) module to learn the focal compositional coherence from different semantic-focal representations. Finally, we optimize the CANN in a novel compositional optimization strategy. Extensive experiments on the large-scale real-world data clearly demonstrate the effectiveness of CANN compared with several state-of-the-art methods.
摘要：补充建议，旨在为用户提供产品建议是补充并与他们获得的项兼容，已成为近年来学术界和工业界的一个热门话题。％然而，它具有挑战性，因为它的复杂性和主观性。现有的工作主要集中在造型两个项目之间的合作关系，购买的，但项目的集合组成协会在很大程度上是未知。实际上，当用户选择的补充物品的购买的产品，直觉告诉我们，她会考虑视觉语义连贯（如颜色搭配，材质兼容性），除了全球性的印象。为此，在本文中，我们提出了一个新颖的内容细心的神经网络（CANN）的综合组成连贯了在全球内容和语义内容进行建模。具体而言，我们首先提出了\ {textit整体连贯学习}（GCL）的基础上多注意头模块全球成分的一致性模型。然后，我们生成不同的语义区域的语义焦表示和设计\ textit {联络连贯学习}（FCL）模块，以了解不同的语义焦点交涉焦点组成的连贯性。最后，我们优化CANN在新组成的优化策略。在大规模真实世界的数据大量实验清楚地证明与国家的最先进的几种方法相比CANN的有效性。

90. Photoacoustic Microscopy with Sparse Data Enabled by Convolutional Neural Networks for Fast Imaging [PDF] 返回目录
Jiasheng Zhou, Da He, Xiaoyu Shang, Zhendong Guo, Sung-liang Chen, Jiajia Luo
Abstract: Photoacoustic microscopy (PAM) has been a promising biomedical imaging technology in recent years. However, the point-by-point scanning mechanism results in low-speed imaging, which limits the application of PAM. Reducing sampling density can naturally shorten image acquisition time, which is at the cost of image quality. In this work, we propose a method using convolutional neural networks (CNNs) to improve the quality of sparse PAM images, thereby speeding up image acquisition while keeping good image quality. The CNN model utilizes both squeeze-and-excitation blocks and residual blocks to achieve the enhancement, which is a mapping from a 1/4 or 1/16 low-sampling sparse PAM image to a latent fully-sampled image. The perceptual loss function is applied to keep the fidelity of images. The model is mainly trained and validated on PAM images of leaf veins. The experiments show the effectiveness of our proposed method, which significantly outperforms existing methods quantitatively and qualitatively. Our model is also tested using in vivo PAM images of blood vessels of mouse ears and eyes. The results show that the model can enhance the image quality of the sparse PAM image of blood vessels from several aspects, which may help fast PAM and facilitate its clinical applications.
摘要：光声显微镜（PAM）一直是一个有前途的生物医学成像技术在近几年。然而，逐点扫描机制导致低速成像，这限制了PAM的应用。减少采样密度可以自然地缩短图像获取的时间，这是在图像质量为代价。在这项工作中，我们提出了一种方法，使用卷积神经网络（细胞神经网络），以改善稀疏PAM图像的质量，从而加快了图像采集，同时保持良好的图像质量。 CNN的模型利用两个挤压和激励块和残余块，以实现增强，这是从一个1/4或1/16低采样稀疏PAM图像映射到一个潜完全采样的图像。感知损失函数应用于保持图像的保真度。该模型主要的培训和叶脉的PAM图像验证。实验表明，我们提出的方法，其中显著优于现有方法定量和定性的有效性。我们的模型是使用在小鼠耳朵和眼睛的血管的体内图像PAM还测试。结果表明，该模型可以从几个方面，这可能有助于快速PAM和促进其临床应用增强血管稀疏PAM图像的图像质量。

91. Multi-step Estimation for Gradient-based Meta-learning [PDF] 返回目录
Jin-Hwa Kim, Junyoung Park, Yongseok Choi
Abstract: Gradient-based meta-learning approaches have been successful in few-shot learning, transfer learning, and a wide range of other domains. Despite its efficacy and simplicity, the burden of calculating the Hessian matrix with large memory footprints is the critical challenge in large-scale applications. To tackle this issue, we propose a simple yet straightforward method to reduce the cost by reusing the same gradient in a window of inner steps. We describe the dynamics of the multi-step estimation in the Lagrangian formalism and discuss how to reduce evaluating second-order derivatives estimating the dynamics. To validate our method, we experiment on meta-transfer learning and few-shot learning tasks for multiple settings. The experiment on meta-transfer emphasizes the applicability of training meta-networks, where other approximations are limited. For few-shot learning, we evaluate time and memory complexities compared with popular baselines. We show that our method significantly reduces training time and memory usage, maintaining competitive accuracies, or even outperforming in some cases.
摘要：基于梯度的元学习方法已在一些次学习，学习转移和其他广泛领域的成功。尽管其疗效和简单性，计算与大的存储器脚印的Hessian矩阵的负担是在大规模应用关键挑战。为了解决这个问题，我们提出了一个简单而直接的方法，通过在内部阶梯窗口重复使用相同的梯度，以降低成本。我们描述了多步骤估计的动力学在拉格朗日形式主义，讨论如何减少评估二阶导数估计所述动力学。为了验证我们的方法，我们在实验多设置元转让学习和几个次学习任务。的元传递实验强调训练元网络，其中其它的近似有限的适用性。对于几拍的学习，我们与流行基线相比，评估的时间和存储的复杂性。我们证明了我们的方法显著减少培训时间和内存利用率，保持竞争力的准确度，甚至在某些情况下超越。

92. EDropout: Energy-Based Dropout and Pruning of Deep Neural Networks [PDF] 返回目录
Hojjat Salehinejad, Shahrokh Valaee
Abstract: Dropout is well-known as an effective regularization method by sampling a sub-network from a larger deep neural network and training different sub-networks on different subsets of the data. Inspired by the concept of dropout, we stochastically select, train, and evolve a population of sub-networks, where each sub-network is represented by a state vector and a scalar energy. The proposed energy-based dropout (EDropout) method provides a unified framework that can be applied on any arbitrary neural network without the need for proper normalization. The concept of energy in EDropout has the capability of handling diverse number of constraints without any limit on the size or length of the state vectors. The selected set of sub-networks converges during the training to a sub-network that minimizes the energy of the candidate state vectors. The rest of training time is then allocated to fine-tuning the selected sub-network. This process will be equivalent to pruning. We evaluate the proposed method on different flavours of ResNets, AlexNet, and SqueezeNet on the Kuzushiji, Fashion, CIFAR-10, CIFAR-100, and Flowers datasets, and compare with the state-of-the-art pruning and compression methods. We show that on average the networks trained with EDropout achieve a pruning rate of more than 50% of the trainable parameters with approximately <5% and <1% drop of top-1 top-5 classification accuracy, respectively. < font>
摘要：差是众所周知的，通过从更大的深神经网络采样的子网络和所述数据的不同子集训练不同的子网络中的有效正则化方法。由信息丢失的概念的启发，我们随机选择，火车，和进化子网络，其中每个子网络由状态矢量和标量能量表示的群体。所提出的基于能量的差（EDropout）方法提供了可任意神经网络上，而不需要进行适当的归一化被应用的统一框架。能源在EDropout概念具有处理约束的不同数目而对状态向量的大小或长度的任何限制的能力。训练到子网络候选状态矢量的能量最小化过程中所选择的组的子网络的收敛。随后的训练休息的时候被分配到微调选定的子网络。这个过程将是等同于修剪。我们评估了该方法的ResNets，AlexNet的不同口味，并SqueezeNet在Kuzushiji，时尚，CIFAR-10，CIFAR-100和鲜花的数据集，并与国家的最先进的修剪和压缩方法进行了比较。我们表明，平均有EDropout训练网络实现的可训练参数的50％以上与修剪率约<5％和<1％分别下降顶-1和top-5分类精度。< font>

93. Self-Representation Based Unsupervised Exemplar Selection in a Union of Subspaces [PDF] 返回目录
Chong You, Chi Li, Daniel P. Robinson, Rene Vidal
Abstract: Finding a small set of representatives from an unlabeled dataset is a core problem in a broad range of applications such as dataset summarization and information extraction. Classical exemplar selection methods such as $k$-medoids work under the assumption that the data points are close to a few cluster centroids, and cannot handle the case where data lie close to a union of subspaces. This paper proposes a new exemplar selection model that searches for a subset that best reconstructs all data points as measured by the $\ell_1$ norm of the representation coefficients. Geometrically, this subset best covers all the data points as measured by the Minkowski functional of the subset. To solve our model efficiently, we introduce a farthest first search algorithm that iteratively selects the worst represented point as an exemplar. When the dataset is drawn from a union of independent subspaces, our method is able to select sufficiently many representatives from each subspace. We further develop an exemplar based subspace clustering method that is robust to imbalanced data and efficient for large scale data. Moreover, we show that a classifier trained on the selected exemplars (when they are labeled) can correctly classify the rest of the data points.
摘要：从一个未标记的数据集寻找一个小组的代表是在宽范围的应用的核心问题，如数据集的总结和信息提取。经典范例的选择方法，如假设的数据点都接近一些聚类中心，而不能处理数据位于接近子空间的联合情况下$ $ķ工作-medoids。本文提出了一种新的典范选择模型，对于一个子集，最好的重建由代表系数$ \ $ ell_1测量规范所有的数据点搜索。几何上，该子集最好覆盖了所有的数据点由闵可夫斯基功能子集的测量。为了有效地解决我们的模型中，我们引入了最远的第一搜索算法，反复选择最差的代表点作为一个典范。当这些数据是由独立的子空间的联合绘制的，我们的方法是能够从每个子空间选择足够多的代表。我们进一步开发了基于标本子空间聚类方法，它具有较强的抗不平衡数据和高效的大规模的数据。此外，我们表明，经过训练所选典范（当它们被标记）的分类可以正确分类数据点的其余部分。

94. Unsupervised Learning for Subterranean Junction Recognition Based on 2D Point Cloud [PDF] 返回目录
Sina Sharif Mansouri, Farhad Pourkamali-Anaraki, Miguel Castano Arranz, Ali-akbar Agha-mohammadi, Joel Burdick, George Nikolakopoulos
Abstract: This article proposes a novel unsupervised learning framework for detecting the number of tunnel junctions in subterranean environments based on acquired 2D point clouds. The implementation of the framework provides valuable information for high level mission planners to navigate an aerial platform in unknown areas or robot homing missions. The framework utilizes spectral clustering, which is capable of uncovering hidden structures from connected data points lying on non-linear manifolds. The spectral clustering algorithm computes a spectral embedding of the original 2D point cloud by utilizing the eigen decomposition of a matrix that is derived from the pairwise similarities of these points. We validate the developed framework using multiple data-sets, collected from multiple realistic simulations, as well as from real flights in underground environments, demonstrating the performance and merits of the proposed methodology.
摘要：本文提出了一种用于检测基于采集的2D点云在地下环境中的隧道结的数目的新颖的无监督学习框架。该框架的实现提供了高层次的任务规划者有价值的信息导航在未知的领域或机器人归巢任务的空中平台。该框架利用谱聚类，其能够从连接数据点躺在非线性歧管揭示隐藏结构。光谱聚类算法计算的光谱原始2D点云的通过利用从这些点的成对的相似性获得的矩阵的特征分解嵌入。我们验证使用多个数据集从地下环境中真实的飞行，充分展示了性能和拟议的方法的优点发达的框架，从多个逼真的模拟采集，以及。

95. A multi-channel framework for joint reconstruction of multi-contrast parallel MRI [PDF] 返回目录
Erfan Ebrahim Esfahani
Abstract: Compressed sensing, multi-contrast and parallel imaging are all techniques that exploit certain types of redundancies to speed up image acquisition process and image quality in MRI. Although each individual category has been well developed, the combination of the three has not received significant attention, much less the potential benefit of isotropy within such a setting. In this paper, a novel isotropic multi-channel image regularizer is introduced, and its full potential is unleashed by integrating it into compressed multi-contrast multi-coil MRI.
摘要：压缩感测，多对比度和并行成像是利用某些类型的冗余，以加快在MRI图像获取处理和图像质量的所有技术。虽然各个类别已经非常发达，三者的组合已经没有这样的环境内收到显著的关注，更各向同性的潜在益处。在本文中，一种新颖的各向同性多通道图像正则引入，并充分发挥其潜力是通过将其集成到压缩多对比度多线圈MRI释放。

96. A Generic First-Order Algorithmic Framework for Bi-Level Programming Beyond Lower-Level Singleton [PDF] 返回目录
Risheng Liu, Pan Mu, Xiaoming Yuan, Shangzhi Zeng, Jin Zhang
Abstract: In recent years, a variety of gradient-based first-order methods have been developed to solve bi-level optimization problems for learning applications. However, theoretical guarantees of these existing approaches heavily rely on the simplification that for each fixed upper-level variable, the lower-level solution must be a singleton (a.k.a., Lower-Level Singleton, LLS). In this work, we first design a counter-example to illustrate the invalidation of such LLS condition. Then by formulating BLPs from the view point of optimistic bi-level and aggregating hierarchical objective information, we establish Bi-level Descent Aggregation (BDA), a flexible and modularized algorithmic framework for generic bi-level optimization. Theoretically, we derive a new methodology to prove the convergence of BDA without the LLS condition. Our investigations also demonstrate that BDA is indeed compatible to a verify of particular first-order computation modules. Additionally, as an interesting byproduct, we also improve these conventional first-order bi-level schemes (under the LLS simplification). Particularly, we establish their convergences with weaker assumptions. Extensive experiments justify our theoretical results and demonstrate the superiority of the proposed BDA for different tasks, including hyper-parameter optimization and meta learning.
摘要：近年来，各种基于梯度的一阶方法已发展到解决双层优化问题的学习应用。然而，这些现有的理论保证接近严重依赖于简化对于每个固定的上级变量，较低级别的溶液必须是单（又名，较低级别的Singleton，LLS）。在这项工作中，我们首先设计了一个反例来说明这种LLS条件无效。然后由乐观双电平的观点制定BLPs和汇总层次客观的信息，我们建立了双水平下降聚合（BDA），通用双层优化灵活的，模块化的算法框架。从理论上讲，我们得出一个新的方法来证明BDA的收敛而不LLS条件。我们的调查还表明，BDA确实是兼容的验证特定第一阶的计算模块。此外，作为一个有趣的副产品，我们还提高（下LLS简化）这些传统的一阶双级方案。特别是，我们建立了自己的收敛与弱假设。大量的实验证明我们的理论成果，证明了该BDA的优势为不同的任务，其中包括超参数优化和元学习。

97. A Comparative Analysis of E-Scooter and E-Bike Usage Patterns: Findings from the City of Austin, TX [PDF] 返回目录
Mohammed Hamad Almannaa, Huthaifa I. Ashqar, Mohammed Elhenawy, Mahmoud Masoud, Andry Rakotonirainy, Hesham Rakha
Abstract: E-scooter-sharing and e-bike-sharing systems are accommodating and easing the increased traffic in dense cities and are expanding considerably. However, these new micro-mobility transportation modes raise numerous operational and safety concerns. This study analyzes e-scooter and dockless e-bike sharing system user behavior. We investigate how average trip speed change depending on the day of the week and the time of the day. We used a dataset from the city of Austin, TX from December 2018 to May 2019. Our results generally show that the trip average speed for e-bikes ranges between 3.01 and 3.44 m/s, which is higher than that for e-scooters (2.19 to 2.78 m/s). Results also show a similar usage pattern for the average speed of e-bikes and e-scooters throughout the days of the week and a different usage pattern for the average speed of e-bikes and e-scooters over the hours of the day. We found that users tend to ride e-bikes and e-scooters with a slower average speed for recreational purposes compared to when they are ridden for commuting purposes. This study is a building block in this field, which serves as a first of its kind, and sheds the light of significant new understanding of this emerging class of shared-road users.
摘要：E-摩托车共享和电动自行车共享系统适应和缓解了密集的城市增加的交通流量，并显着扩大。然而，这些新的微移动运输方式提高许多操作和安全问题。这项研究分析了电动摩托车和dockless电动自行车共享系统用户的行为。我们根据一周的日子，这一天的时间研究如何平均出行速度的变化。我们使用从城市奥斯汀，德克萨斯州的一个数据集从2018 12至5月2019年我们的研究结果普遍显示，3.01 3.44米之间电动自行车的范围之旅平均速度/ s，这比对电子滑板车高（ 2.19 2.78 M / S）。结果还显示了电动自行车和电子踏板车的平均速度整个一周的日子和电动自行车和电子踏板车在一天的小时平均车速不同的使用模式相似的使用模式。我们发现，用户往往骑电动自行车和电子踏板车较慢的平均速度为娱乐目的相比，当他们乘坐通勤目的。这项研究是在这一领域，作为一个先河的积木，鸡舍和这个新兴的类共享道路使用者的显著新认识的光。

98. Neural Networks Out-of-Distribution Detection: Hyperparameter-Free Isotropic Maximization Loss, The Principle of Maximum Entropy, Cold Training, and Branched Inferences [PDF] 返回目录
David Macêdo, Teresa Ludermir
Abstract: Current out-of-distribution detection (ODD) approaches present severe drawbacks that make impracticable their large scale adoption in real-world applications. In this paper, we propose a novel loss called Hyperparameter-Free IsoMax that overcomes these limitations. We modified the original IsoMax loss to improve ODD performance while maintaining benefits such as high classification accuracy, fast and energy-efficient inference, and scalability. The global hyperparameter is replaced by learnable parameters to increase performance. Additionally, a theoretical motivation to explain the high ODD performance of the proposed loss is presented. Finally, to keep high classification performance, slightly different inference mathematical expressions for classification and ODD are developed. No access to out-of-distribution samples is required, as there is no hyperparameter to tune. Our solution works as a straightforward SoftMax loss drop-in replacement that can be incorporated without relying on adversarial training or validation, model structure chances, ensembles methods, or generative approaches. The experiments showed that our approach is competitive against state-of-the-art solutions while avoiding their additional requirements and undesired side effects.
摘要：当前外的分布检测器（ODD）接近目前严重的缺点，使不切实际在实际应用中的大规模普及。在本文中，我们提出了一个所谓的超参数，免费ISOMAX新颖损失克服了这些限制。我们修改了原来ISOMAX损失，提高光驱性能，同时保持优势，如高分类精度，快速和高效节能的推理和可扩展性。全球超参数是可以学习的参数替代，以提高性能。此外，一个理论的动机来解释所提出的损失的高奇性能呈现。最后，要保持较高的分类性能，分类和ODD略有不同的推理的数学表达式开发。到外的分发样本的任何访问是必需的，因为没有超参数调整。我们的解决方案可以作为一个简单的使用SoftMax损失的直接替代，可以不依赖于对抗训练或验证，模型结构的机会，乐团的方法，或生成方法结合。实验表明，我们的做法是对国家的最先进的解决方案，有竞争力的同时避免它们额外的要求和期望的副作用。

99. An Efficient $k$-modes Algorithm for Clustering Categorical Datasets [PDF] 返回目录
Karin S. Dorman, Ranjan Maitra
Abstract: Mining clusters from datasets is an important endeavor in many applications. The $k$-means algorithm is a popular and efficient distribution-free approach for clustering numerical-valued data but can not be applied to categorical-valued observations. The $k$-modes algorithm addresses this lacuna by taking the $k$-means objective function, replacing the dissimilarity measure and using modes instead of means in the modified objective function. Unlike many other clustering algorithms, both $k$-modes and $k$-means are scalable, because they do not require calculation of all pairwise dissimilarities. We provide a fast and computationally efficient implementation of $k$-modes, OTQT, and prove that it can find superior clusterings to existing algorithms. We also examine five initialization methods and three types of $K$-selection methods, many of them novel, and all appropriate for $k$-modes. By examining the performance on real and simulated datasets, we show that simple random initialization is the best intializer, a novel $K$-selection method is more accurate than two methods adapted from $k$-means, and that the new OTQT algorithm is more accurate and almost always faster than existing algorithms.
摘要：从数据集采矿集群是在许多应用中的一个重要的努力。的$ $ķ算法-means为聚类数值值数据的常见和有效的免费分发的方法，但不能被应用到分类值观测。的$ $ķ算法-modes解决这一缺陷通过采取$ $ķ-means目标函数，取代了相异性度量和使用模式，而不是在修改目标函数的装置。不像许多其他的聚类算法，既$ķ$ -modes和$ķ$ -means是可伸缩的，因为他们不要求所有成对差异进行计算。我们提供了一个快速，高效计算执行$ķ$ -modes，OTQT，并证明它可以找到优越的聚类到现有的算法。我们还检查5种初始化方法和三种$的K $ -selection方法，其中不乏新颖，所有适用于$ķ$ -modes。通过考察真实和模拟数据集的性能，我们表明，简单随机初始化是最好的初始化程序，一个新的$ķ$ -selection方法比两种方法从$ķ$ -means适应更准确，新OTQT算法更准确，而且几乎总是比现有算法快。

100. Robust watermarking with double detector-discriminator approach [PDF] 返回目录
Marcin Plata, Piotr Syga
Abstract: In this paper we present a novel deep framework for a watermarking - a technique of embedding a transparent message into an image in a way that allows retrieving the message from a (perturbed) copy, so that copyright infringement can be tracked. For this technique, it is essential to extract the information from the image even after imposing some digital processing operations on it. Our framework outperforms recent methods in the context of robustness against not only spectrum of attacks (e.g. rotation, resizing, Gaussian smoothing) but also against compression, especially JPEG. The bit accuracy of our method is at least 0.86 for all types of distortions. We also achieved 0.90 bit accuracy for JPEG while recent methods provided at most 0.83. Our method retains high transparency and capacity as well. Moreover, we present our double detector-discriminator approach - a scheme to detect and discriminate if the image contains the embedded message or not, which is crucial for real-life watermarking systems and up to now was not investigated using neural networks. With this, we design a testing formula to validate our extended approach and compared it with a common procedure. We also present an alternative method of balancing between image quality and robustness on attacks which is easily applicable to the framework.
摘要：在本文中，我们提出一个新的深框架水印 - 在允许检索来自所述消息的方式嵌入透明的消息的技术成图像（扰动）进行复制，以便侵犯版权可以被跟踪。对于这种技术，它是必不可少的，即使它强加一些数字处理操作后，从图像中提取信息。我们的框架优于近期的鲁棒性，不仅对攻击频谱范围内的方法（例如旋转，缩放，高斯平滑），而且还针对压缩，尤其是JPEG。我们的方法的位精度至少是0.86对所有类型的失真。我们也取得了0.90位精度为JPEG，而最近的方法至多0.83提供。我们的方法保持高透明度和能力，以及。此外，我们提出我们的双重检测，鉴别方法 - 一个方案来检测和判别，如果图像包含嵌入的信息或没有，这是现实生活中的水印系统的关键，到现在为止并没有利用神经网络调查。有了这个，我们设计了一个测试公式来验证我们的扩展方法，并用常用的方法进行了比较。我们还提出的图像质量和鲁棒性的攻击是很容易适用于框架之间平衡的另一种方法。

101. Multi-Task Temporal Shift Attention Networks for On-Device Contactless Vitals Measurement [PDF] 返回目录
Xin Liu, Josh Fromm, Shwetak Patel, Daniel McDuff
Abstract: Telehealth and remote health monitoring have become increasingly important during the SARS-CoV-2 pandemic and it is widely expected that this will have a lasting impact on healthcare practices. These tools can help reduce the risk of exposing patients and medical staff to infection, make healthcare services more accessible, and allow providers to see more patients. However, objective measurement of vital signs is challenging without direct contact with a patient. We present a video-based and on-device optical cardiopulmonary vital sign measurement approach. It leverages a novel multi-task temporal shift convolutional attention network (MTTS-CAN) and enables real-time cardiovascular and respiratory measurements on mobile platforms. We evaluate our system on an ARM CPU and achieve state-of-the-art accuracy while running at over 150 frames per second which enables real-time applications. Systematic experimentation on large benchmark datasets reveals that our approach leads to substantial (20%-50%) reductions in error and generalizes well across datasets.
摘要：远程医疗和远程健康监测在SARS-COV-2大流行已经变得越来越重要，人们普遍预计，这将对医疗实践产生持续的影响。这些工具可以帮助减少暴露病人和医务人员感染的风险，使医疗服务更加方便，并让供应商看到更多的患者。然而，生命体征客观测量是具有挑战性不与病人直接接触。我们提出了一个基于视频和设备上的光学心肺生物信号测量的方法。它利用一种新型的多任务时移卷积关注网络（MTTS-CAN），并能够在移动平台上的实时心血管和呼吸测量。我们评估的ARM CPU上我们的体系，实现国家的最先进的精度每秒使实时应用超过150帧运行时。对大型数据集的基准系统实验表明，我们的方法会导致错误大幅度（20％-50％）的减少，概括整个数据集良好。

102. Applied Awareness: Test-Driven GUI Development using Computer Vision and Cryptography [PDF] 返回目录
Donald Beaver
Abstract: Graphical user interface testing is significantly challenging, and automating it even more so. Test-driven development is impractical: it generally requires an initial implementation of the GUI to generate golden images or to construct interactive test scenarios, and subsequent maintenance is costly. While computer vision has been applied to several aspects of GUI testing, we demonstrate a novel and immediately applicable approach of interpreting GUI presentation in terms of backend communications, modeling "awareness" in the fashion employed by cryptographic proofs of security. This focus on backend communication circumvents deficiencies in typical testing methodologies that rely on platform-dependent UI affordances or accessibility features. Our interdisciplinary work is ready for off-the-shelf practice: we report self-contained, practical implementation with both online and offline validation, using simple designer specifications at the outset and specifically avoiding any requirements for a bootstrap implementation or golden images. In addition to practical implementation, ties to formal verification methods in cryptography are explored and explained, providing fertile perspectives on assurance in UI and interpretability in AI.
摘要：图形用户界面测试显著挑战，以及自动化它更应如此。测试驱动开发是不切实际的：它通常需要初始执行GUI生成黄金映像或构建交互式测试场景，以及后续的维护是昂贵的。虽然计算机视觉已经被应用到GUI测试的几个方面，我们展示了一种新的解释，并在后台通信方面GUI演示，建模通过安全的加密证明采用的时尚“意识”的立即适用的方法。这种专注于依赖于平台相关的UI可供性或辅助功能的典型测试方法后端通信规避不足。我们的跨学科的工作是准备好现成的现成做法：我们报告自包含的，实际执行以在线和离线验证，使用简单的设计者规范从一开始就特别避免了自举的实施或黄金映像的任何要求。除了实际的实施，关系到加密形式验证方法进行了探索和解释，提供在UI保证和解释性的AI肥沃的观点。

103. Texture Interpolation for Probing Visual Perception [PDF] 返回目录
Jonathan Vacher, Aida Davila, Adam Kohn, Ruben Coen-Cagli
Abstract: Texture synthesis models are important to understand visual processing. In particular, statistical approaches based on neurally relevant features have been instrumental to understanding aspects of visual perception and of neural coding. New deep learning-based approaches further improve the quality of synthetic textures. Yet, it is still unclear why deep texture synthesis performs so well, and applications of this new framework to probe visual perception are scarce. Here, we show that distributions of deep convolutional neural network (CNN) activations of a texture are well described by elliptical distributions and therefore, following optimal transport theory, constraining their mean and covariance is sufficient to generate new texture samples. Then, we propose the natural geodesics (i.e. the shortest path between two points) arising with the optimal transport metric to interpolate between arbitrary textures. The comparison to alternative interpolation methods suggests that ours matches more closely the geometry of texture perception, and is better suited to study its statistical nature. We demonstrate our method by measuring the perceptual scale associated to the interpolation parameter in human observers, and the neural sensitivity of different areas of visual cortex in macaque monkeys.
摘要：纹理合成模型是重要的是了解视觉处理。尤其是，基于neurally相关特性的统计方法都有助于和神经编码视觉感知的理解方面。新深基于学习的方法进一步提高合成纹理的质量。然而，目前还不清楚为什么深纹理合成表现这么好，而这一新框架的应用，探讨视觉感知是稀缺的。在这里，我们表明，纹理深卷积神经网络（CNN）的激活的分布以及由椭圆形分布，因此，下面的最佳输运理论所描述的，限制了它们的均值和协方差是足以产生新的纹理样本。然后，我们提出以最佳的运输指标任意纹理之间进行插值所产生的自然测地线（即两点之间的最短路径）。替代插值方法的比较表明，我们的质地感觉更加紧密的几何形状相匹配，并且更适合于研究其统计特性。我们证明通过测量人类观察者相关插值参数感性的规模，并在猕猴视觉皮层的不同区域的神经敏感性我们的方法。

104. Evaluating the Disentanglement of Deep Generative Models through Manifold Topology [PDF] 返回目录
Sharon Zhou, Eric Zelikman, Fred Lu, Andrew Y. Ng, Stefano Ermon
Abstract: Learning disentangled representations is regarded as a fundamental task for improving the generalization, robustness, and interpretability of generative models. However, measuring disentanglement has been challenging and inconsistent, often dependent on an ad-hoc external model or specific to a certain dataset. To address this, we present a method for quantifying disentanglement that only uses the generative model, by measuring the topological similarity of conditional submanifolds in the learned representation. This method showcases both unsupervised and supervised variants. To illustrate the effectiveness and applicability of our method, we empirically evaluate several state-of-the-art models across multiple datasets. We find that our method ranks models similarly to existing methods.
摘要：学习解开表示被认为是改善泛化，稳健性和生成模型的可解释性的根本任务。然而，测量解缠结具有挑战性和不一致，往往依赖的ad-hoc外部模型或特定于某个特定数据集。为了解决这个问题，我们提出了量化的解开，只有使用生成模型，通过测量学表示有条件子流形的拓扑相似的方法。这种方法既陈列柜无监督和监督的变种。为了说明的有效性和我们的方法的适用性，我们凭经验评估跨越多个数据集的国家的最先进的几种模式。我们发现，我们的方法同样居模式与现有方法。

105. Hierarchical Class-Based Curriculum Loss [PDF] 返回目录
Palash Goyal, Shalini Ghosh
Abstract: Classification algorithms in machine learning often assume a flat label space. However, most real world data have dependencies between the labels, which can often be captured by using a hierarchy. Utilizing this relation can help develop a model capable of satisfying the dependencies and improving model accuracy and interpretability. Further, as different levels in the hierarchy correspond to different granularities, penalizing each label equally can be detrimental to model learning. In this paper, we propose a loss function, hierarchical curriculum loss, with two properties: (i) satisfy hierarchical constraints present in the label space, and (ii) provide non-uniform weights to labels based on their levels in the hierarchy, learned implicitly by the training paradigm. We theoretically show that the proposed loss function is a tighter bound of 0-1 loss compared to any other loss satisfying the hierarchical constraints. We test our loss function on real world image data sets, and show that it significantly substantially outperforms multiple baselines.
摘要：机分类算法学习往往呈现为平坦的标签空间。然而，大多数现实世界的数据有标签，而这往往通过使用分层捕获之间的依赖关系。利用这种关系可以帮助开发能够满足的依赖并提高模型的准确性和可解释性的典范。此外，如不同水平的层次结构中的对应于不同的粒度，同样惩罚每一个标签可损害模型学习。在本文中，我们提出了一个损失函数，层次课程的损失，有两个特性：（一）满足目前在标签空间等级限制，以及（ii）根据其在层级，学会标签提供非均匀权重隐式培训模式。我们从理论上表明，该损失函数是比较满足等级限制任何其他损失更紧密结合的0-1损失。我们测试我们对真实世界的图像数据集损失函数，并表明它显著大幅优于多个基准。

106. Equivariant Maps for Hierarchical Structures [PDF] 返回目录
Renhao Wang, Marjan Albooyeh, Siamak Ravanbakhsh
Abstract: In many real-world settings, we are interested in learning invariant and equivariant functions over nested or multiresolution structures, such as a set of sequences, a graph of graphs, or a multiresolution image. While equivariant linear maps and by extension multilayer perceptrons (MLPs) for many of the individual basic structures are known, a formalism for dealing with a hierarchy of symmetry transformations is lacking. Observing that the transformation group for a nested structure corresponds to the ``wreath product'' of the symmetry groups of the building blocks, we show how to obtain the equivariant map for hierarchical data-structures using an intuitive combination of the equivariant maps for the individual blocks. To demonstrate the effectiveness of this type of model, we use a hierarchy of translation and permutation symmetries for learning on point cloud data, and report state-of-the-art on \kw{semantic3d} and \kw{s3dis}, two of the largest real-world benchmarks for 3D semantic segmentation.
摘要：在许多真实世界的设置，我们有兴趣不变的，等变函数嵌套上多分辨率或结构，比如一组序列，图形的图形，或者多分辨率图象。虽然等变线性地图并通过扩展多层感知器（的MLP），用于许多个人基本结构是已知的，用于处理对称变换的层次结构的形式主义是缺乏。观察到对于嵌套结构对应于积木的对称群的``圈积“”的变换群，我们显示如何使用用于等变映射的一个直观的组合，以获得分层数据结构的变映射各个块。为了证明这种类型的模型的有效性，我们采用翻译和排列对称的层次学习上的点云数据，并报告国家的最先进的\千瓦{semantic3d}和\千瓦{s3dis}，二最大的现实世界的基准3D语义分割。

107. 3D Augmented Reality-Assisted CT-Guided Interventions: System Design and Preclinical Trial on an Abdominal Phantom using HoloLens 2 [PDF] 返回目录
Brian J. Park, Stephen J. Hunt, Gregory J. Nadolski, Terence P. Gade
Abstract: Background: Out-of-plane lesions pose challenges for CT-guided interventions. Augmented reality (AR) headset devices have evolved and are readily capable to provide virtual 3D guidance to improve CT-guided targeting. Purpose: To describe the design of a three-dimensional (3D) AR-assisted navigation system using HoloLens 2 and evaluate its performance through CT-guided simulations. Materials and Methods: A prospective trial was performed assessing CT-guided needle targeting on an abdominal phantom with and without AR guidance. A total of 8 operators with varying clinical experience were enrolled and performed a total of 86 needle passes. Procedure efficiency, radiation dose, and complication rates were compared with and without AR guidance. Vector analysis of the first needle pass was also performed. Results: Average total number of needle passes to reach the target reduced from 7.4 passes without AR to 3.4 passes with AR (54.2% decrease, p=0.011). Average dose-length product (DLP) decreased from 538 mGy-cm without AR to 318 mGy-cm with AR (41.0% decrease, p=0.009). Complication rate of hitting a non-targeted lesion decreased from 11.9% without AR (7/59 needle passes) to 0% with AR (0/27 needle passes). First needle passes were more nearly aligned with the ideal target trajectory with AR versus without AR (4.6° vs 8.0° offset, respectively, p=0.018). Medical students, residents, and attendings all performed at the same level with AR guidance. Conclusions: 3D AR guidance can provide significant improvements in procedural efficiency and radiation dose savings for targeting challenging, out-of-plane lesions. AR guidance elevated the performance of all operators to the same level irrespective of prior clinical experience.
摘要：背景：外的平面构成病变的CT引导下介入的挑战。增强现实（AR）耳机设备的发展，很容易能够提供虚拟三维引导到提高CT引导下定位。目的：描述了使用HoloLens 2的三维（3D）AR-辅助导航系统的设计和评估通过CT引导模拟其性能。材料与方法：进行了前瞻性研究评估CT引导下穿刺在有和没有现实指导腹部幻象目标。共有8个运营商具有不同的临床经验被纳入和总共86个通行证执行。流程效率，辐射剂量率和并发症有和没有现实指导进行了比较。还进行了第一针通的矢量分析。结果：针的平均总数传递到到达目标从7.4遍，而不AR减少到3.4通行证与AR（54.2％的减少，P = 0.011）。平均剂量 - 长度乘积（DLP）从538毫戈瑞厘米降低而不AR至318毫戈瑞厘米与AR（41.0％的减少，P = 0.009）。击中非靶向病变的并发症发生率从11.9％下降而不AR（五十九分之七针传递）至0％与AR（0/27针通过）。第一针道次更近与AR的理想目标轨道对准相对于无AR（4.6°和8.0°的偏移，分别P = 0.018）。医学生，住院医生，主治医生和所有在用AR指导同级别执行。结论：3D AR指导可用于靶向挑战，外的平面病变提供在程序效率和辐射剂量的节省显著改进。 AR指导提高所有运营商的表现，以相同的电平，而之前的临床经验。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-06-09

目录

摘要