摘要

1. Object Detection and Tracking Algorithms for Vehicle Counting: A Comparative Analysis [PDF] 返回目录
Vishal Mandal, Yaw Adu-Gyamfi
Abstract: The rapid advancement in the field of deep learning and high performance computing has highly augmented the scope of video based vehicle counting system. In this paper, the authors deploy several state of the art object detection and tracking algorithms to detect and track different classes of vehicles in their regions of interest (ROI). The goal of correctly detecting and tracking vehicles' in their ROI is to obtain an accurate vehicle count. Multiple combinations of object detection models coupled with different tracking systems are applied to access the best vehicle counting framework. The models' addresses challenges associated to different weather conditions, occlusion and low-light settings and efficiently extracts vehicle information and trajectories through its computationally rich training and feedback cycles. The automatic vehicle counts resulting from all the model combinations are validated and compared against the manually counted ground truths of over 9 hours' traffic video data obtained from the Louisiana Department of Transportation and Development. Experimental results demonstrate that the combination of CenterNet and Deep SORT, Detectron2 and Deep SORT, and YOLOv4 and Deep SORT produced the best overall counting percentage for all vehicles.
摘要：在深度学习和高性能计算领域的快速发展已经高度增强的基于视频的车辆计数系统的范围。在该论文中，作者部署本领域物体检测和跟踪算法来检测和跟踪他们感兴趣的区域（ROI）不同类的车辆中的几个状态。正确地检测，并在其ROI跟踪车辆的目标是获得准确的车辆数。加上不同的跟踪系统对象检测模型的多个组合被施加来访问的最佳车辆计数框架。该机型的地址挑战关联到不同的气候条件，闭塞和低光设置，并通过其计算丰富的培训和反馈循环有效地提取车辆信息和轨迹。从所有的模型组合而产生的自动车辆计数验证和对来自交通和发展的路易斯安那部获得在9小时的流量的视频数据的手动计数基础事实相比较。实验结果表明，CenterNet和深排序的组合，Detectron2和深度排序，YOLOv4和深SORT生产的所有车辆最佳的整体计数百分比。

2. Palm Vein Identification based on hybrid features selection model [PDF] 返回目录
Mohammed Hamzah Abed, Ali H. Alsaeedi, Ali D. Alfoudi, Abayomi M. Otebolaku, Yasmeen Sajid Razooqi
Abstract: Palm vein identification (PVI) is a modern biometric security technique used for increasing security and authentication systems. The key characteristics of palm vein patterns include, its uniqueness to each individual, unforgettable, non-intrusive and cannot be taken by an unauthorized person. However, the extracted features from the palm vein pattern are huge with high redundancy. In this paper, we propose a combine model of two-Dimensional Discrete Wavelet Transform, Principal Component Analysis (PCA), and Particle Swarm Optimization (PSO) (2D-DWTPP) to enhance prediction of vein palm patterns. The 2D-DWT Extracts features from palm vein images, PCA reduces the redundancy in palm vein features. The system has been trained in selecting high reverent features based on the wrapper model. The PSO feeds wrapper model by an optimal subset of features. The proposed system uses four classifiers as an objective function to determine VPI which include Support Vector Machine (SVM), K Nearest Neighbor (KNN), Decision Tree (DT) and Naïve Bayes (NB). The empirical result shows the proposed system Iit satisfied best results with SVM. The proposed 2D-DWTPP model has been evaluated and the results shown remarkable efficiency in comparison with Alexnet and classifier without feature selection. Experimentally, our model has better accuracy reflected by (98.65) while Alexnet has (63.5) and applied classifier without feature selection has (78.79).
摘要：手掌静脉识别（PVI）是用于提高安全性和认证系统的现代化生物安全技术。的手掌静脉图案的关键特性包括，它的独特性到每一个人，难忘的，非侵入性的，不能由未经授权的人服用。然而，从手掌静脉图案提取的特征是巨大的具有高冗余。在本文中，我们提出了一个结合了二维离散小波变换，主成分分析（PCA）和粒子群优化（PSO）（2D-DWTPP）增强静脉掌纹预测模型。在2D-DWT提取从手掌静脉图像特征，PCA减少了手掌静脉功能冗余。该系统已在选择基于包装模型高虔诚的特征进行训练。 PSO的饲料包装模型通过特征的最优子集。所提出的系统使用四个分类器作为目标函数，以确定VPI包括支持向量机（SVM），K最近邻（KNN），决策树（DT）和朴素贝叶斯（NB）。实证结果表明了系统的本品系最满意的结果与SVM。所提出的2D-DWTPP模型已经被评估，并示出了与Alexnet和没有特征选择比较分类效率惊人的结果。实验上，我们的模型具有较好的准确度（98.65）反映，而Alexnet有（63.5）和无特征选择应用的分类有（78.79）。

3. Self-supervised learning through the eyes of a child [PDF] 返回目录
A. Emin Orhan, Vaibhav V. Gupta, Brenden M. Lake
Abstract: Within months of birth, children have meaningful expectations about the world around them. How much of this early knowledge can be explained through generic learning mechanisms applied to sensory data, and how much of it requires more substantive innate inductive biases? Addressing this fundamental question in its full generality is currently infeasible, but we can hope to make real progress in more narrowly defined domains, such as the development of high-level visual categories, thanks to improvements in data collecting technology and recent progress in deep learning. In this paper, our goal is to achieve such progress by utilizing modern self-supervised deep learning methods and a recent longitudinal, egocentric video dataset recorded from the perspective of several young children (Sullivan et al., 2020). Our results demonstrate the emergence of powerful, high-level visual representations from developmentally realistic natural videos using generic self-supervised learning objectives.
摘要：在出生后几个月，孩子对他们周围世界有意义的预期。多少的早期知识可以通过通用的学习机制来解释适用于感官数据，以及如何它更需要更多的实质性天生感性的偏见？在其全部的通用解决这一根本问题是目前不可行，但我们希望能在更狭义的领域，如高层次的视觉类别的发展真正的进步，这要归功于收集技术和最新进展的深度学习数据改善。在本文中，我们的目标是利用现代自我监督的深度学习的方法和最近的纵向，从几个年幼的孩子的视角记录自我中心的视频数据集来实现这样的进步（Sullivan等，2020）。我们的研究结果表明强大的，高层次的可视化表示使用通用的自我监督学习目标发育逼真自然的视频出现。

4. Exploring Image Enhancement for Salient Object Detection in Low Light Images [PDF] 返回目录
Xin Xu, Shiqin Wang, Zheng Wang, Xiaolong Zhang, Ruimin Hu
Abstract: Low light images captured in a non-uniform illumination environment usually are degraded with the scene depth and the corresponding environment lights. This degradation results in severe object information loss in the degraded image modality, which makes the salient object detection more challenging due to low contrast property and artificial light influence. However, existing salient object detection models are developed based on the assumption that the images are captured under a sufficient brightness environment, which is impractical in real-world scenarios. In this work, we propose an image enhancement approach to facilitate the salient object detection in low light images. The proposed model directly embeds the physical lighting model into the deep neural network to describe the degradation of low light images, in which the environment light is treated as a point-wise variate and changes with local content. Moreover, a Non-Local-Block Layer is utilized to capture the difference of local content of an object against its local neighborhood favoring regions. To quantitative evaluation, we construct a low light Images dataset with pixel-level human-labeled ground-truth annotations and report promising results on four public datasets and our benchmark dataset.
摘要：在非均匀的照明环境中捕获低光图像通常被降解与场景的深度和相应的环境灯。这种降解导致严重对象信息损失在恶化的图像模态，这使得显着对象检测更具挑战性，由于低对比度性质和人造光的影响。但是，现有的显着目标检测模型是基于这些图像是一个足够亮度的环境，这是现实世界的情景不切实际下拍摄的假设开发的。在这项工作中，我们提出了一种图像增强方法，以方便在光线不足的图像显着对象检测。该模型直接嵌入物理光照模型进深神经网络来描述低光图像，其中，所述环境光被视为一个逐点变量，并与本地内容改变的降解。此外，非本地-阻挡层是用于捕捉的针对其局部邻利于区域的对象的本地内容的差异。为了定量评价，我们构造与像素级的人标记地面真相批注和四个公共数据集的报告令人鼓舞的结果，我们的基准数据集低光图像数据集。

5. Physical Adversarial Attack on Vehicle Detector in the Carla Simulator [PDF] 返回目录
Tong Wu, Xuefei Ning, Wenshuo Li, Ranran Huang, Huazhong Yang, Yu Wang
Abstract: In this paper, we tackle the issue of physical adversarial examples for object detectors in the wild. Specifically, we proposed to generate adversarial patterns to be applied on vehicle surface so that it's not recognizable by detectors in the photo-realistic Carla simulator. Our approach contains two main techniques, an \textit{Enlarge-and-Repeat} process and a \textit{Discrete Searching} method, to craft mosaic-like adversarial vehicle textures without access to neither the model weight of the detector nor a differential rendering procedure. The experimental results demonstrate the effectiveness of our approach in the simulator.
摘要：在本文中，我们将处理在野外物体探测器身体对抗的例子问题。具体来说，我们提出，以产生对抗图案上车辆表面施加，使得它不是可识别通过在照片般逼真的卡拉模拟器探测器。我们的方法包含两个主要的技术中，\ textit {放大 - 和 - 重复}过程和\ textit {离散搜索}方法，手艺马赛克状对抗车辆纹理得不到既不检测器的模型重量或者差速渲染程序。实验结果证明了我们在模拟器方法的有效性。

6. Neural Architecture Search as Sparse Supernet [PDF] 返回目录
Yan Wu, Aoming Liu, Zhiwu Huang, Siwei Zhang, Luc Van Gool
Abstract: This paper aims at enlarging the problem of Neural Architecture Search from Single-Path and Multi-Path Search to automated Mixed-Path Search. In particular, we model the new problem as a sparse supernet with a new continuous architecture representation using a mixture of sparsity constraints, i.e., Sparse Group Lasso. The sparse supernet is expected to automatically achieve sparsely-mixed paths upon a compact set of nodes. To optimize the proposed sparse supernet, we exploit a hierarchical accelerated proximal gradient algorithm within a bi-level optimization framework. Extensive experiments on CIFAR-10, CIFAR-100, Tiny ImageNet and ImageNet demonstrate that the proposed methodology is capable of searching for compact, general and powerful neural architectures.
摘要：本文旨在从单一路径和多路径搜索，自动混合路径搜索扩大神经结构搜索的问题。特别是，我们作为稀疏超网与一个新的连续结构表示使用的稀疏性约束的混合物，即，稀疏组套索建模新的问题。稀疏超网，预计在一个紧凑的节点集合来自动实现稀疏混合路径。为了优化建议稀疏超网，我们利用分层双级优化框架内加速近端梯度法。在CIFAR-10，CIFAR-100，微型ImageNet和ImageNet大量实验证明，该方法是能够搜索紧凑，通用和强大的神经架构。

7. Curriculum learning for annotation-efficient medical image analysis: scheduling data with prior knowledge and uncertainty [PDF] 返回目录
Amelia Jiménez-Sánchez, Diana Mateus, Sonja Kirchhoff, Chlodwig Kirchhoff, Peter Biberthaler, Nassir Navab, Miguel A. González Ballester, Gemma Piella
Abstract: Convolutional neural networks (CNNs) for multi-class classification require training on large, representative, and high quality annotated datasets. However, in the field of medical imaging, data and annotations are both difficult and expensive to acquire. Moreover, they frequently suffer from highly imbalanced distributions, and potentially noisy labels due to intra- or inter-expert disagreement. To deal with such challenges, we propose a unified curriculum learning framework to schedule the order and pace of the training samples presented to the optimizer. Our novel framework reunites three strategies consisting of individually weighting training samples, reordering the training set, or sampling subsets of data. The core of these strategies is a scoring function ranking the training samples according to either difficulty or uncertainty. We define the scoring function from domain-specific prior knowledge or by directly measuring the uncertainty in the predictions. We perform a variety of experiments with a clinical dataset for the multi-class classification of proximal femur fractures and the publicly available MNIST dataset. Our results show that the sequence and weight of the training samples play an important role in the optimization process of CNNs. Proximal femur fracture classification is improved up to the performance of experienced trauma surgeons. We further demonstrate the benefits of our unified curriculum learning method for three controlled and challenging digit recognition scenarios: with limited amounts of data, under class-imbalance, and in the presence of label noise.
摘要：多类分类卷积神经网络（细胞神经网络）要求大，代表性和高品质的注解数据集的培训。然而，在医学成像领域中，数据和注释都是困难和昂贵的获取。此外，他们经常遭受高度不平衡分布，潜在的噪声标签，由于区域内或跨专家意见分歧。为了应对这些挑战，我们提出了一个统一的课程学习框架安排呈现给优化训练样本的顺序和节奏。我们的新颖框架团聚三种策略包括单独加权训练样本，重新排序的训练集，或采样数据的子集。这些策略的核心是一个计分函数，根据任一困难或不确定性排名的训练样本。我们定义的特定领域的先验知识或通过直接测量在预测中不确定性的计分函数。我们进行各种实验与临床数据集股骨近端骨折的多类分类，并公开发布的数据集MNIST。我们的研究结果表明，训练样本的顺序和重量起到细胞神经网络的优化过程中的重要作用。近端股骨骨折分类是提高到的有经验的创伤外科医生的性能。我们进一步证明我们统一的课程学习方法的三个控制和具有挑战性的数字识别场景的好处：用有限的数据量，课下不平衡，并在标签噪音的存在。

8. Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution [PDF] 返回目录
Haotian Tang, Zhijian Liu, Shengyu Zhao, Yujun Lin, Ji Lin, Hanrui Wang, Song Han
Abstract: Self-driving cars need to understand 3D scenes efficiently and accurately in order to drive safely. Given the limited hardware resources, existing 3D perception models are not able to recognize small instances (e.g., pedestrians, cyclists) very well due to the low-resolution voxelization and aggressive downsampling. To this end, we propose Sparse Point-Voxel Convolution (SPVConv), a lightweight 3D module that equips the vanilla Sparse Convolution with the high-resolution point-based branch. With negligible overhead, this point-based branch is able to preserve the fine details even from large outdoor scenes. To explore the spectrum of efficient 3D models, we first define a flexible architecture design space based on SPVConv, and we then present 3D Neural Architecture Search (3D-NAS) to search the optimal network architecture over this diverse design space efficiently and effectively. Experimental results validate that the resulting SPVNAS model is fast and accurate: it outperforms the state-of-the-art MinkowskiNet by 3.3%, ranking 1st on the competitive SemanticKITTI leaderboard. It also achieves 8x computation reduction and 3x measured speedup over MinkowskiNet with higher accuracy. Finally, we transfer our method to 3D object detection, and it achieves consistent improvements over the one-stage detection baseline on KITTI.
摘要：自驾驶汽车需要高效，准确地了解3D场景，以安全驾驶。由于硬件资源有限，现有的3D感知模型无法识别的小实例（例如，行人，骑自行车的人），因为很好地低分辨率体素化和积极的采样。为此，我们提出了稀疏点体素卷积（SPVConv），与高分辨率基于点的分支装备香草稀疏卷积一个轻量级的3D模块。可以忽略不计的开销，这种基于点的分支能够甚至从大型户外场景保留细节。为了探索高效的3D模型的频谱，我们首先定义基于SPVConv灵活的架构设计空间，然后我们目前3D神经结构搜索（3D-NAS）超过这个多样化的设计空间有效和高效地搜索最优网络架构。实验结果验证了所得SPVNAS模型是快速，准确的：它由3.3％优于国家的最先进的MinkowskiNet，上竞争SemanticKITTI排名排行榜第一。它也实现了8倍计算减少和3X测量加速了MinkowskiNet以更高的精度。最后，我们我们的方法转移到立体物检测，并取得了KITTI上一阶段的检测基线持续改善。

9. Traffic Control Gesture Recognition for Autonomous Vehicles [PDF] 返回目录
Julian Wiederer, Arij Bouazizi, Ulrich Kressel, Vasileios Belagiannis
Abstract: A car driver knows how to react on the gestures of the traffic officers. Clearly, this is not the case for the autonomous vehicle, unless it has road traffic control gesture recognition functionalities. In this work, we address the limitation of the existing autonomous driving datasets to provide learning data for traffic control gesture recognition. We introduce a dataset that is based on 3D body skeleton input to perform traffic control gesture classification on every time step. Our dataset consists of 250 sequences from several actors, ranging from 16 to 90 seconds per sequence. To evaluate our dataset, we propose eight sequential processing models based on deep neural networks such as recurrent networks, attention mechanism, temporal convolutional networks and graph convolutional networks. We present an extensive evaluation and analysis of all approaches for our dataset, as well as real-world quantitative evaluation. The code and dataset is publicly available.
摘要：汽车司机知道如何在行车人员的手势作出反应。显然，这是不是对自主汽车的情况下，除非有道路交通控制手势识别功能。在这项工作中，我们要解决现有的自动驾驶数据集的限制，以提供交通控制手势识别学习数据。我们推出了基于三维人体骨架输入到每个时间步长进行交通管制姿势分类的数据集。我们的数据集由来自几个参与者250组的序列，从16〜90秒的每个序列的范围内。为了评估我们的数据，我们提出了基于深层神经网络，如经常性的网络，注意机制，时间卷积网络和图形卷积网络的8个连续的处理模型。我们提出了一个广泛的评估和分析的方法都为我们的数据集，以及真实世界的定量评价。代码和数据集是公开的。

10. Pixel-wise Crowd Understanding via Synthetic Data [PDF] 返回目录
Wang Qi, Gao Junyu, Lin Wei, Yuan Yuan
Abstract: Crowd analysis via computer vision techniques is an important topic in the field of video surveillance, which has wide-spread applications including crowd monitoring, public safety, space design and so on. Pixel-wise crowd understanding is the most fundamental task in crowd analysis because of its finer results for video sequences or still images than other analysis tasks. Unfortunately, pixel-level understanding needs a large amount of labeled training data. Annotating them is an expensive work, which causes that current crowd datasets are small. As a result, most algorithms suffer from over-fitting to varying degrees. In this paper, take crowd counting and segmentation as examples from the pixel-wise crowd understanding, we attempt to remedy these problems from two aspects, namely data and methodology. Firstly, we develop a free data collector and labeler to generate synthetic and labeled crowd scenes in a computer game, Grand Theft Auto V. Then we use it to construct a large-scale, diverse synthetic crowd dataset, which is named as "GCC Dataset". Secondly, we propose two simple methods to improve the performance of crowd understanding via exploiting the synthetic data. To be specific, 1) supervised crowd understanding: pre-train a crowd analysis model on the synthetic data, then fine-tune it using the real data and labels, which makes the model perform better on the real world; 2) crowd understanding via domain adaptation: translate the synthetic data to photo-realistic images, then train the model on translated data and labels. As a result, the trained model works well in real crowd scenes.
摘要：通过计算机视觉技术人群分析是视频监控，其中有广泛分布的应用，包括人群监测，公共安全，空间设计等领域的一个重要课题。逐个像素的人群的理解是在人群分析中最根本的任务，因为其更细的结果比其它分析任务的视频序列或静止图像。不幸的是，像素级的理解需要大量标记的训练数据。注解它们是昂贵的工作，这会导致电流人群数据集是小。其结果是，大多数算法过度拟合不同程度受到影响。在本文中，采取的人群计数和细分从逐像素的人群理解的例子，我们试图从两个方面，即数据和方法解决这些问题。首先，我们在开发计算机游戏免费的数据采集和贴标机生成的合成和标记人群的场面，侠盗五，然后，我们用它来构建一个大规模，多样化的综合人群数据集，它被命名为“GCC数据集”。其次，我们提出了两种简单的方法，通过利用合成数据，以提高人群理解的性能。具体而言，1）监督人群理解：预培养了人群分析模型上的合成数据，使用真实的数据和标签它再微调，这使得模型对现实世界表现得更好; 2）通过人群域适配理解：合成数据转换为照片般逼真的图像，然后训练上转换的数据和标签模型。其结果是，训练的模型在现实人群的场景效果很好。

11. DynaMiTe: A Dynamic Local Motion Model with Temporal Constraints for Robust Real-Time Feature Matching [PDF] 返回目录
Patrick Ruhkamp, Ruiqi Gong, Nassir Navab, Benjamin Busam
Abstract: Feature based visual odometry and SLAM methods require accurate and fast correspondence matching between consecutive image frames for precise camera pose estimation in real-time. Current feature matching pipelines either rely solely on the descriptive capabilities of the feature extractor or need computationally complex optimization schemes. We present the lightweight pipeline DynaMiTe, which is agnostic to the descriptor input and leverages spatial-temporal cues with efficient statistical measures. The theoretical backbone of the method lies within a probabilistic formulation of feature matching and the respective study of physically motivated constraints. A dynamically adaptable local motion model encapsulates groups of features in an efficient data structure. Temporal constraints transfer information of the local motion model across time, thus additionally reducing the search space complexity for matching. DynaMiTe achieves superior results both in terms of matching accuracy and camera pose estimation with high frame rates, outperforming state-of-the-art matching methods while being computationally more efficient.
摘要：基于特征的视觉里程计和SLAM方法需要用于在实时精确相机姿势估计连续图像帧之间的精确和快速的对应匹配。当前特征匹配的管道要么完全依赖特征提取或需要复杂的计算优化方案的描述能力。我们目前的轻质管道炸药，这是不可知的描述符输入，并利用高效统计测量时空线索。该方法谎言特征匹配的概率配方和物理约束动机的各研究中的理论骨干。甲动态自适应局部运动模型封装在有效的数据结构的特征的基团。时间约束传送跨越时间的局部运动模型的信息，从而额外地减少搜索空间复杂度匹配。炸药兼顾在匹配精度和照相机姿态估计具有高帧速率方面是优异的结果，表现优于国家的最先进的匹配方法，同时计算上更高效。

12. Disentangling Human Error from the Ground Truth in Segmentation of Medical Images [PDF] 返回目录
Le Zhang, Ryutaro Tanno, Mou-Cheng Xu, Chen Jin, Joseph Jacob, Olga Ciccarelli, Frederik Barkhof, Daniel C. Alexander
Abstract: Recent years have seen increasing use of supervised learning methods for segmentation tasks. However, the predictive performance of these algorithms depends on the quality of labels. This problem is particularly pertinent in the medical image domain, where both the annotation cost and inter-observer variability are high. In a typical label acquisition process, different human experts provide their estimates of the 'true' segmentation labels under the influence of their own biases and competence levels. Treating these noisy labels blindly as the ground truth limits the performance that automatic segmentation algorithms can achieve. In this work, we present a method for jointly learning, from purely noisy observations alone, the reliability of individual annotators and the true segmentation label distributions, using two coupled CNNs. The separation of the two is achieved by encouraging the estimated annotators to be maximally unreliable while achieving high fidelity with the noisy training data. We first define a toy segmentation dataset based on MNIST and study the properties of the proposed algorithm. We then demonstrate the utility of the method on three public medical imaging segmentation datasets with simulated (when necessary) and real diverse annotations: 1) MSLSC (multiple-sclerosis lesions); 2) BraTS (brain tumours); 3) LIDC-IDRI (lung abnormalities). In all cases, our method outperforms competing methods and relevant baselines particularly in cases where the number of annotations is small and the amount of disagreement is large. The experiments also show strong ability to capture the complex spatial characteristics of annotators' mistakes.
摘要：近年来，人们越来越多地使用监督学习方法分割任务。然而，这些算法的预测性能取决于标签的质量。此问题是在医用图像域，其中，所述注释成本和观察者间的可变性均为高电平特别相关。在一个典型的标签获取过程中，不同的人类专家提供他们自己的偏见和能力水平的影响下，“真正的”分割标签的估计。盲目地治疗这些嘈杂标签作为基础事实限制了性能即自动分割算法可以实现。在这项工作中，我们提出了共同学习，从单独纯粹嘈杂的观察，从个人注释的可靠性和真分割标签分布，使用两个耦合的细胞神经网络的方法。两者的分离是通过鼓励估计注释者是最大不可靠的，而与嘈杂的训练数据实现高保真实现。我们首先定义基于MNIST玩具分割数据集，并研究了算法的性能。然后，我们证明了该方法在三个公共医疗成像数据集分割与模拟（必要时）和真实的多样注释该实用程序：1）MSLSC（多硬化病变）; 2）臭小子（脑肿瘤）; 3）LIDC-IDRI（肺异常）。在任何情况下，我们的方法优于竞争方法和相关基准特别是在情况下，注释的数量少，分歧的量大。实验还显示出强劲的捕捉到的注释失误复杂的空间特性的能力。

13. Feature Learning for Accelerometer based Gait Recognition [PDF] 返回目录
Szilárd Nemes, Margit Antal
Abstract: Recent advances in pattern matching, such as speech or object recognition support the viability of feature learning with deep learning solutions for gait recognition. Past papers have evaluated deep neural networks trained in a supervised manner for this task. In this work, we investigated both supervised and unsupervised approaches. Feature extractors using similar architectures incorporated into end-to-end models and autoencoders were compared based on their ability of learning good representations for a gait verification system. Both feature extractors were trained on the IDNet dataset then used for feature extraction on the ZJU-GaitAccel dataset. Results show that autoencoders are very close to discriminative end-to-end models with regards to their feature learning ability and that fully convolutional models are able to learn good feature representations, regardless of the training strategy.
摘要：模式匹配的最新进展，如语音或物体识别支持功能与步态识别深度学习的学习解决方案的可行性。过去的论文已在评估监督的方式训练了这个任务深层神经网络。在这项工作中，我们研究了这两种监督和无监督的方法。使用纳入端至中高端车型和自动编码类似的结构特征提取进行了比较根据自己学习的好表示的步态验证系统的能力。这两个特征提取进行了培训上接IDNet数据集，然后用于在ZJU-GaitAccel数据集特征提取。结果表明，自动编码非常接近，区别终端到高端机型与问候他们的特点学习能力和充分的卷积模型能够学习好的特征表示，无论是培训战略。

14. Neural Style Transfer for Remote Sensing [PDF] 返回目录
Maria Karatzoglidi, Georgios Felekis, Eleni Charou
Abstract: The well-known technique outlined in the paper of Leon A. Gatys et al., A Neural Algorithm of Artistic Style, has become a trending topic both in academic literature and industrial applications. Neural Style Transfer (NST) constitutes an essential tool for a wide range of applications, such as artistic stylization of 2D images, user-assisted creation tools and production tools for entertainment applications. The purpose of this study is to present a method for creating artistic maps from satellite images, based on the NST algorithm. This method includes three basic steps (i) application of semantic image segmentation on the original satellite image, dividing its content into classes (i.e. land, water), (ii) application of neural style transfer for each class and (iii) creation of a collage, i.e. an artistic image consisting of a combination of the two stylized image generated on the previous step.
摘要：在莱昂A. Gatys等人的文件中提出的公知技术，神经算法艺术风格的，既具有在学术文献和工业应用成为热门话题。神经式转换（NST）构成用于广泛的应用范围，例如用于娱乐应用2D图像时，用户协助创建工具和生产工具的艺术风格化的重要工具。本研究的目的是提出一种方法，用于创建从卫星图像，基于所述NST算法艺术地图。该方法包括三个基本步骤（i）所述原始卫星图像语义图像分割的应用程序，将其内容转换成类（即土地，水），（II）的神经样式转移为每个类和（iii）创建的应用拼贴，即一个艺术图像包括前一步骤中产生的两个图像程式化的组合。

15. Saliency-driven Class Impressions for Feature Visualization of Deep Neural Networks [PDF] 返回目录
Sravanti Addepalli, Dipesh Tamboli, R. Venkatesh Babu, Biplab Banerjee
Abstract: In this paper, we propose a data-free method of extracting Impressions of each class from the classifier's memory. The Deep Learning regime empowers classifiers to extract distinct patterns (or features) of a given class from training data, which is the basis on which they generalize to unseen data. Before deploying these models on critical applications, it is advantageous to visualize the features considered to be essential for classification. Existing visualization methods develop high confidence images consisting of both background and foreground features. This makes it hard to judge what the crucial features of a given class are. In this work, we propose a saliency-driven approach to visualize discriminative features that are considered most important for a given task. Another drawback of existing methods is that confidence of the generated visualizations is increased by creating multiple instances of the given class. We restrict the algorithm to develop a single object per image, which helps further in extracting features of high confidence and also results in better visualizations. We further demonstrate the generation of negative images as naturally fused images of two or more classes.
摘要：在本文中，我们提出从分类的记忆提取每一类的印象的无数据的方法。的深度学习制度如虎添翼分类器来从训练数据，这是在其上它们推广到看不见数据的基础提取一个给定类的不同模式（或特征）。对关键应用部署这些模型之前，有利的是，可视化被认为是至关重要的分类功能。现有的可视化方法开发包括背景和前景特征高置信度的图像。这使得它很难判断给定类的关键功能是什么。在这项工作中，我们提出了一个显着驱动的方法来可视化被认为是最重要的给定任务的判别特征。的现有方法的另一个缺点是该生成的可视化的信心增加通过创建给定类的多个实例。我们限制算法开发每个图像的单个对象，这进一步有助于提取的高可信度的特点，也带来更好的视觉效果。我们进一步证明负面形象的产生两个或多个类的自然融合图像。

16. Learning the Distribution: A Unified Distillation Paradigm for Fast Uncertainty Estimation in Computer Vision [PDF] 返回目录
Yichen Shen, Zhilu Zhang, Mert R. Sabuncu, Lin Sun
Abstract: Calibrated estimates of uncertainty are critical for many real-world computer vision applications of deep learning. While there are several widely-used uncertainty estimation methods, dropout inference stands out for its simplicity and efficacy. This technique, however, requires multiple forward passes through the network during inference and therefore can be too resource-intensive to be deployed in real-time applications. To tackle this issue, we propose a unified distillation paradigm for learning the conditional predictive distribution of a pre-trained dropout model for fast uncertainty estimation of both aleatoric and epistemic uncertainty at the same time. We empirically test the effectiveness of the proposed method on both semantic segmentation and depth estimation tasks, and observe that the student model can well approximate the probability distribution generated by the teacher model, i.e the pre-trained dropout model. In addition to a significant boost in speed, we demonstrate the quality of uncertainty estimates and the overall predictive performance can also be improved with the proposed method.
摘要：不确定性的校准估计是深度学习的许多现实世界的计算机视觉应用的关键。虽然有多种广泛使用的不确定性估计方法，辍学推断其突出的简单性和有效性。但是这种技术，通过推断在网络需要多个向前传球，因此可太资源密集型的实时应用进行部署。为了解决这个问题，我们提出了学习都肆意和主观因素的不确定性快速估计的预先训练辍学模型的条件预测分布在同一时间统一蒸馏范例。我们实证检验两个语义分割和深度估计的任务提出方法的有效性，并观察学生模型能很好地逼近由教师模型生成的概率分布，即预先训练辍学模型。除了速度的提升显著，我们表现出的不确定性估计和预测整体性能质量也可与所提出的方法改善。

17. Rethinking PointNet Embedding for Faster and Compact Model [PDF] 返回目录
Teppei Suzuki, Keisuke Ozawa, Yusuke Sekikawa
Abstract: PointNet, which is the widely used point-wise embedding method and known as a universal approximator for continuous set functions, can process one million points per second. Nevertheless, real-time inference for the recent development of high-performing sensors is still challenging with existing neural network-based methods, including PointNet. In ordinary cases, the embedding function of PointNet behaves like a soft-indicator function that is activated when the input points exist in a certain local region of the input space. Leveraging this property, we reduce the computational costs of point-wise embedding by replacing the embedding function of PointNet with the soft-indicator function by Gaussian kernels. Moreover, we show that the Gaussian kernels also satisfy the universal approximation theorem that PointNet satisfies. In experiments, we verify that our model using the Gaussian kernels achieves comparable results to baseline methods, but with much fewer floating-point operations per sample up to 92\% reduction from PointNet.
摘要：PointNet，这是广泛使用的逐点嵌入方法和已知的用于连续的一组功能的通用逼近，可以处理每秒一百万个点。然而，对于近期高性能传感器的开发实时推理仍与现有的基于神经网络的方法，包括PointNet挑战。在通常情况下，PointNet的行为像当在输入空间的某局部区域存在的输入点时被激活的软指示器功能的嵌入功能。利用这个特性，我们通过与软指标函数替换PointNet的嵌入功能由高斯核减少逐点嵌入的计算成本。此外，我们表明，高斯核也满足了通用逼近定理是PointNet满足。在实验中，我们验证使用高斯核我们的模型实现了类似的结果基线方法，但每个采样值要少得多浮点运算从PointNet 92 \％的减少。

18. Resist : Reconstruction of irises from templates [PDF] 返回目录
Sohaib Ahmad, Benjamin Fuller
Abstract: Iris recognition systems transform an iris image into a feature vector. The seminal pipeline segments an image into iris and non-iris pixels, normalizes this region into a fixed-dimension rectangle, and extracts features which are stored and called a template (Daugman, 2009). This template is stored on a system. A future reading of an iris can be transformed and compared against template vectors to determine or verify the identity of an individual. As templates are often stored together, they are a valuable target to an attacker. We show how to invert templates across a variety of iris recognition systems. Our inversion is based on a convolutional neural network architecture we call RESIST (REconStructing IriSes from Templates). We apply RESIST to a traditional Gabor filter pipeline, to a DenseNet (Huang et al., CVPR 2017) feature extractor, and to a DenseNet architecture that works without normalization. Both DenseNet feature extractors are based on the recent ThirdEye recognition system (Ahmad and Fuller, BTAS 2019). When training and testing using the ND-0405 dataset, reconstructed images demonstrate a rank-1 accuracy of 100%, 76%, and 96% respectively for the three pipelines. The core of our approach is similar to an autoencoder. To obtain high accuracy this core is integrated into an adversarial network (Goodfellow et al., NeurIPS, 2014)
摘要：虹膜识别系统变换虹膜图像成特征矢量。精液管线段图像到虹膜和非虹膜像素，这归区域划分为固定尺寸的矩形，并且其存储并调用模板（Daugman，2009）中提取特征。该模板存储在系统上。将来的读取虹膜的可以转化和对模板矢量比较，以确定或验证个人的身份。作为模板通常存储在一起，他们是一个有价值的目标给攻击者。我们将展示如何在各种虹膜识别系统的反转模板。我们的反转是基于卷积神经网络结构我们称之为抗蚀剂（从模板重建虹膜）。我们运用抵制传统的Gabor过滤器管线，到DenseNet（Huang等人，2017年CVPR）特征提取，并以一个DenseNet架构，没有标准化的作品。无论DenseNet特征提取是基于最近ThirdEye识别系统（艾哈迈德和Fuller，BTAS 2019）上。当使用ND-0405数据集训练和测试，重建的图像表明分别用于三个管道的100％，76％，和96％秩1的精度。我们的方法的核心是类似于自动编码。为了获得高精确度这一核心被集成到对抗网络（Goodfellow等人，NeurIPS，2014）

19. ETH-XGaze: A Large Scale Dataset for Gaze Estimation under Extreme Head Pose and Gaze Variation [PDF] 返回目录
Xucong Zhang, Seonwook Park, Thabo Beeler, Derek Bradley, Siyu Tang, Otmar Hilliges
Abstract: Gaze estimation is a fundamental task in many applications of computer vision, human computer interaction and robotics. Many state-of-the-art methods are trained and tested on custom datasets, making comparison across methods challenging. Furthermore, existing gaze estimation datasets have limited head pose and gaze variations, and the evaluations are conducted using different protocols and metrics. In this paper, we propose a new gaze estimation dataset called ETH-XGaze, consisting of over one million high-resolution images of varying gaze under extreme head poses. We collect this dataset from 110 participants with a custom hardware setup including 18 digital SLR cameras and adjustable illumination conditions, and a calibrated system to record ground truth gaze targets. We show that our dataset can significantly improve the robustness of gaze estimation methods across different head poses and gaze angles. Additionally, we define a standardized experimental protocol and evaluation metric on ETH-XGaze, to better unify gaze estimation research going forward. The dataset and benchmark website are available at this https URL
摘要：凝视估计是计算机视觉，人机交互和机器人技术的许多应用程序的基本任务。许多国家的最先进的方法进行培训和自定义的数据集进行测试，进行跨挑战的方法比较。此外，现有的注视估计数据集具有有限头部姿势和凝视变化，并且评价使用不同的协议和指标进行。在本文中，我们提出了所谓的ETH-XGaze新视线估计的数据集，包括在极端的头部姿势变化的注视超过一百万的高分辨率图像。我们收集了110名参与者与定制的硬件设置此数据集包括18个数码单反相机和可调节的光照条件，和校准系统，以创纪录的地面实况凝视目标。我们表明，我们的数据可以显著改善在不同头的姿势注视估计方法的稳健性和凝视角度。此外，我们定义了一个标准化的实验方案和评估ETH-XGaze指标，以更好地统一注视估算研究向前发展。该数据集和基准网站可在此HTTPS URL

20. Adversarial Bipartite Graph Learning for Video Domain Adaptation [PDF] 返回目录
Yadan Luo, Zi Huang, Zijian Wang, Zheng Zhang, Mahsa Baktashmotlagh
Abstract: Domain adaptation techniques, which focus on adapting models between distributionally different domains, are rarely explored in the video recognition area due to the significant spatial and temporal shifts across the source (i.e. training) and target (i.e. test) domains. As such, recent works on visual domain adaptation which leverage adversarial learning to unify the source and target video representations and strengthen the feature transferability are not highly effective on the videos. To overcome this limitation, in this paper, we learn a domain-agnostic video classifier instead of learning domain-invariant representations, and propose an Adversarial Bipartite Graph (ABG) learning framework which directly models the source-target interactions with a network topology of the bipartite graph. Specifically, the source and target frames are sampled as heterogeneous vertexes while the edges connecting two types of nodes measure the affinity among them. Through message-passing, each vertex aggregates the features from its heterogeneous neighbors, forcing the features coming from the same class to be mixed evenly. Explicitly exposing the video classifier to such cross-domain representations at the training and test stages makes our model less biased to the labeled source data, which in-turn results in achieving a better generalization on the target domain. To further enhance the model capacity and testify the robustness of the proposed architecture on difficult transfer tasks, we extend our model to work in a semi-supervised setting using an additional video-level bipartite graph. Extensive experiments conducted on four benchmarks evidence the effectiveness of the proposed approach over the SOTA methods on the task of video recognition.
摘要：域自适应技术，其中重点分布式地不同的域之间适配模型，很少在视频识别区域由于跨源（即培训）和靶（即测试）域显著空间和时间位移探索。作为视觉领域适应性等，近期的作品其杠杆对抗性学习统一的源和目标化视频表示，加强功能转让不在影片非常有效。为了克服这个限制，在本文中，我们学到了域无关的视频分类，而不是学习领域不变的声明，并提出对抗性的二分图（ABG）的学习框架，直接机型的网络拓扑结构中源靶相互作用二分图。具体而言，在边缘连接两种类型的节点测量它们之间的亲合性的源和目标帧被采样为异质顶点。通过消息传递，每个顶点聚集来自其异构邻居特征，迫使从相同类来的功能，以均匀地混合。显式公开视频分类在训练和测试阶段，这种交叉域表现，使我们的模型不太偏向标记源数据，后者又导致在实现该目标域上更好的推广。为了进一步提高模型的能力和作证难以传递任务所提出的架构的稳健性，我们在半监督设置使用附加的视频级二分图扩展我们的模型来工作。大量的实验四个基准测试证明了该方法在视频识别任务的SOTA方法的有效性进行。

21. Blending Generative Adversarial Image Synthesis with Rendering for Computer Graphics [PDF] 返回目录
Ekim Yurtsever, Dongfang Yang, Ibrahim Mert Koc, Keith A. Redmill
Abstract: Conventional computer graphics pipelines require detailed 3D models, meshes, textures, and rendering engines to generate 2D images from 3D scenes. These processes are labor-intensive. We introduce Hybrid Neural Computer Graphics (HNCG) as an alternative. The contribution is a novel image formation strategy to reduce the 3D model and texture complexity of computer graphics pipelines. Our main idea is straightforward: Given a 3D scene, render only important objects of interest and use generative adversarial processes for synthesizing the rest of the image. To this end, we propose a novel image formation strategy to form 2D semantic images from 3D scenery consisting of simple object models without textures. These semantic images are then converted into photo-realistic RGB images with a state-of-the-art conditional Generative Adversarial Network (cGAN) based image synthesizer trained on real-world data. Meanwhile, objects of interest are rendered using a physics-based graphics engine. This is necessary as we want to have full control over the appearance of objects of interest. Finally, the partially-rendered and cGAN synthesized images are blended with a blending GAN. We show that the proposed framework outperforms conventional rendering with ablation and comparison studies. Semantic retention and Fréchet Inception Distance (FID) measurements were used as the main performance metrics.
摘要：传统的计算机图形管线需要详细的3D模型，网格，纹理和渲染引擎生成的3D场景2D图像。这些过程是劳动密集型的。我们引入混合神经计算机图形（HNCG）作为替代。的贡献是一种新颖的图像形成策略，以降低所述3D模型和纹理计算机图形管线的复杂性。我们的主要想法很简单：给定一个3D场景，渲染感兴趣唯一重要的对象，并使用生成对抗性的流程，合成图像的其余部分。为此，我们提出了一种新颖的图像形成策略，以形成从3D风景由简单对象模型的不纹理2D语义图像。然后，这些语义图像被转换成具有一个国家的最先进的条件剖成对抗性网络（cGAN）基于图像合成器上训练的真实世界数据照片般逼真的RGB图像。与此同时，感兴趣的对象是使用基于物理的图形引擎渲染。因为我们希望有在关注对象的外观完全控制这是必要的。最后，部分地呈现和cGAN合成图像被混合在混合GAN。我们表明，该框架优于传统的渲染与消融和比较研究。语义保留和Fréchet可启距离（FID）测定作为主要性能指标。

22. Neural Compression and Filtering for Edge-assisted Real-time Object Detection in Challenged Networks [PDF] 返回目录
Yoshitomo Matsubara, Marco Levorato
Abstract: The edge computing paradigm places compute-capable devices - edge servers at the network edge to assist mobile devices in executing data analysis tasks. Intuitively, offloading compute-intense tasks to edge servers can reduce their execution time. However, poor conditions of the wireless channel connecting the mobile devices to the edge servers may degrade the overall capture-to-output delay achieved by edge offloading. Herein, we focus on edge computing supporting remote object detection by means of Deep Neural Networks (DNNs), and develop a framework to reduce the amount of data transmitted over the wireless link. The core idea we propose builds on recent approaches splitting DNNs into sections - namely head and tail models - executed by the mobile device and edge server, respectively. The wireless link, then, is used to transport the output of the last layer of the head model to the edge server, instead of the DNN input. Most prior work focuses on classification tasks and leaves the DNN structure unaltered. Herein, our focus is on DNNs for three different object detection tasks, which present a much more convoluted structure, and modify the architecture of the network to: (i) achieve in-network compression by introducing a bottleneck layer in the early layers on the head model, and (ii) prefilter pictures that do not contain objects of interest using a convolutional neural network. Results show that the proposed technique represents an effective intermediate option between local and edge computing in a parameter region where these extreme point solutions fail to provide satisfactory performance. We release the code and trained models at this https URL .
摘要：边缘计算模式地计算能力的设备 - 在网络边缘的边缘服务器，以帮助移动设备在执行数据分析任务。直观地说，卸载计算密集型任务，边缘服务器可以减少其执行时间。然而，连接移动设备到边缘服务器无线信道条件差可能会降低由边缘卸载实现的总体捕获至输出的延迟。在此，我们关注在边缘计算通过深神经网络（DNNs）来支持远程物体检测，并开发了一个框架，以减少在无线链路上传输的数据的量。我们提出的核心思想建立在近来的方案拆分DNNs段 - 即头尾模式 - 通过移动设备和边缘服务器，分别执行。的无线链路，那么，用于传输，而不是输入DNN头部模型到边缘服务器的最后一层的输出。大多数现有的工作集中在分类任务和叶DNN的结构不变。这里，我们的重点是DNNs针对三个不同对象检测任务，其中呈现出更加错综复杂的结构，并修改网络的体系结构：（ⅰ）通过在所述初层引入瓶颈层实现在网络压缩头部模型，及（ii）不包含使用卷积神经网络感兴趣的对象前置过滤器的图片。结果表明，所提出的技术表示在这些极端点解决方案未能提供令人满意的性能的参数区域的本地和边缘计算之间的有效中间选项。我们在此HTTPS URL发布的代码和训练的模型。

23. Robust Template Matching via Hierarchical Convolutional Features from a Shape Biased CNN [PDF] 返回目录
Bo Gao, Michael Spratling
Abstract: Finding a template in a search image is an important task underlying many computer vision applications. Recent approaches perform template matching in a feature-space, such as that produced by a convolutional neural network (CNN), that provides more tolerance to changes in appearance. In this article we investigate combining features from different layers of a CNN in order to obtain a feature-space that allows both precise and tolerant template matching. Furthermore we investigate if enhancing the encoding of shape information by the CNN can improve the performance of template matching. These investigations result in a new template matching method that produces state-of-the-art results on a standard benchmark. To confirm these results we also create a new benchmark and show that the proposed method also outperforms existing techniques on this new dataset. We further applied the proposed method to tracking and achieved more robust results.
摘要：在搜索图像找到一个模板是根本许多计算机视觉应用的一项重要任务。近来的方案在一个特征空间中执行模板匹配，如由卷积神经网络（CNN），其提供更耐受外观变化产生的。在本文中，我们调查，以获得一个功能空间，允许既精确又宽容的模板匹配组合从CNN的不同层的特点。此外，我们调查如果由CNN增强的形状信息的编码可以提高模板匹配的性能。这些研究导致产生在一个标准的基准状态的最先进的结果的新的模板匹配方法。为了证实这些结果，我们也创造了新的基准，并表明，该方法也优于这一新的数据集现有技术。我们进一步应用所提出的方法来跟踪并取得较稳定的结果。

24. Looking At The Body: Automatic Analysis of Body Gestures and Self-Adaptors in Psychological Distress [PDF] 返回目录
Weizhe Lin, Indigo Orton, Qingbiao Li, Gabriela Pavarini, Marwa Mahmoud
Abstract: Psychological distress is a significant and growing issue in society. Automatic detection, assessment, and analysis of such distress is an active area of research. Compared to modalities such as face, head, and vocal, research investigating the use of the body modality for these tasks is relatively sparse. This is, in part, due to the limited available datasets and difficulty in automatically extracting useful body features. Recent advances in pose estimation and deep learning have enabled new approaches to this modality and domain. To enable this research, we have collected and analyzed a new dataset containing full body videos for short interviews and self-reported distress labels. We propose a novel method to automatically detect self-adaptors and fidgeting, a subset of self-adaptors that has been shown to be correlated with psychological distress. We perform analysis on statistical body gestures and fidgeting features to explore how distress levels affect participants' behaviors. We then propose a multi-modal approach that combines different feature representations using Multi-modal Deep Denoising Auto-Encoders and Improved Fisher Vector Encoding. We demonstrate that our proposed model, combining audio-visual features with automatically detected fidgeting behavioral cues, can successfully predict distress levels in a dataset labeled with self-reported anxiety and depression levels.
摘要：心理压力是在社会上显著和不断增长的问题。自动检测，评估，而这种痛苦的分析是一个活跃的研究领域。相比于形式，如面部，头部，和声乐，研究调查这些任务的用身体形态的比较稀疏。这是部分地由于有限的可用数据集和在自动提取有用的身体特征的困难。在姿势估计和深度学习最新进展已启用的新方法来这种方式和领域。为了使这项研究中，我们收集和分析包含短访谈和自我报告遇险标签全身影片的一种新的数据集。我们提出了一个新的方法来自动检测自适配器和坐立不安，已被证明与心理痛苦相关联的自适配器的子集。我们进行统计的身体姿势和烦躁分析功能来探索窘迫水平如何影响参与者的行为。然后，我们提出了一个多模式的方法是使用多模态深降噪自动编码器和改进Fisher向量编码结合了不同的特征表示。我们证明了我们提出的模式，结合影音功能与自动检测坐立不安行为线索，可以成功标有自我报告的焦虑和抑郁水平的数据集预测窘迫水平。

25. AR-Net: Adaptive Frame Resolution for Efficient Action Recognition [PDF] 返回目录
Yue Meng, Chung-Ching Lin, Rameswar Panda, Prasanna Sattigeri, Leonid Karlinsky, Aude Oliva, Kate Saenko, Rogerio Feris
Abstract: Action recognition is an open and challenging problem in computer vision. While current state-of-the-art models offer excellent recognition results, their computational expense limits their impact for many real-world applications. In this paper, we propose a novel approach, called AR-Net (Adaptive Resolution Network), that selects on-the-fly the optimal resolution for each frame conditioned on the input for efficient action recognition in long untrimmed videos. Specifically, given a video frame, a policy network is used to decide what input resolution should be used for processing by the action recognition model, with the goal of improving both accuracy and efficiency. We efficiently train the policy network jointly with the recognition model using standard back-propagation. Extensive experiments on several challenging action recognition benchmark datasets well demonstrate the efficacy of our proposed approach over state-of-the-art methods. The project page can be found at this https URL
摘要：动作识别是计算机视觉的开放和具有挑战性的问题。虽然国家的最先进的现有机型提供良好的识别结果，其计算费用限制了其对于许多现实世界的应用程序的影响。在本文中，我们提出了一种新的方法，叫做AR-NET（分辨率自适应网络），即选择在即时为每一帧上调理长修剪视频输入高效动作识别的最佳分辨率。具体地，给出视频帧，政策网络被用来决定应该使用什么输入分辨率由行为识别模型处理，以提高精度和效率的目标。我们有效的培训政策网络共同使用标准的反向传播的识别模型。在几个有挑战性的动作识别标准数据集大量的实验以及展示我们提出的方法在国家的最先进方法的效率。该项目页面可以在此HTTPS URL中找到

26. LEMMA: A Multi-view Dataset for Learning Multi-agent Multi-task Activities [PDF] 返回目录
Baoxiong Jia, Yixin Chen, Siyuan Huang, Yixin Zhu, Song-chun Zhu
Abstract: Understanding and interpreting human actions is a long-standing challenge and a critical indicator of perception in artificial intelligence. However, a few imperative components of daily human activities are largely missed in prior literature, including the goal-directed actions, concurrent multi-tasks, and collaborations among multi-agents. We introduce the LEMMA dataset to provide a single home to address these missing dimensions with meticulously designed settings, wherein the number of tasks and agents varies to highlight different learning objectives. We densely annotate the atomic-actions with human-object interactions to provide ground-truths of the compositionality, scheduling, and assignment of daily activities. We further devise challenging compositional action recognition and action/task anticipation benchmarks with baseline models to measure the capability of compositional action understanding and temporal reasoning. We hope this effort would drive the machine vision community to examine goal-directed human activities and further study the task scheduling and assignment in the real world.
摘要：理解和解释人类行为是一个长期的挑战和知觉的人工智能的一个关键指标。然而，人类日常活动的一些必要的组件在很大程度上错过了之前的文献，包括多Agent之间的目标导向行为，并发多任务和协作。我们引入LEMMA数据集提供了单个家庭解决与精心设计的设置，其中的任务和代理商的数量变化来突出不同的学习目标，这些失踪的尺寸。我们密集的注释原子行动与人类对象交互，提供组合性，调度地面的真理，以及日常活动任务。我们进一步色器件组成的挑战行为识别和动作/任务预期基准与基线模型计量的成分作用的理解和时间推理的能力。我们希望这一努力将推动机器视觉界研究的目标导向的人类活动和进一步研究在现实世界中的任务调度和分配。

27. Weakly supervised one-stage vision and language disease detection using large scale pneumonia and pneumothorax studies [PDF] 返回目录
Leo K. Tam, Xiaosong Wang, Evrim Turkbey, Kevin Lu, Yuhong Wen, Daguang Xu
Abstract: Detecting clinically relevant objects in medical images is a challenge despite large datasets due to the lack of detailed labels. To address the label issue, we utilize the scene-level labels with a detection architecture that incorporates natural language information. We present a challenging new set of radiologist paired bounding box and natural language annotations on the publicly available MIMIC-CXR dataset especially focussed on pneumonia and pneumothorax. Along with the dataset, we present a joint vision language weakly supervised transformer layer-selected one-stage dual head detection architecture (LITERATI) alongside strong baseline comparisons with class activation mapping (CAM), gradient CAM, and relevant implementations on the NIH ChestXray-14 and MIMIC-CXR dataset. Borrowing from advances in vision language architectures, the LITERATI method demonstrates joint image and referring expression (objects localized in the image using natural language) input for detection that scales in a purely weakly supervised fashion. The architectural modifications address three obstacles -- implementing a supervised vision and language detection method in a weakly supervised fashion, incorporating clinical referring expression natural language information, and generating high fidelity detections with map probabilities. Nevertheless, the challenging clinical nature of the radiologist annotations including subtle references, multi-instance specifications, and relatively verbose underlying medical reports, ensures the vision language detection task at scale remains stimulating for future investigation.
摘要：医学影像检测临床相关的对象是，尽管大型数据集，由于缺乏详细的标签的挑战。为了解决这个问题的标签，我们利用与整合自然语言信息的检测体系现场级的标签。我们提出了一个具有挑战性的新集放射科医生的配对公开提供的MIMIC-CXR数据集上肺炎和气胸尤其是集中边框和自然语言注释。随着数据集，我们提出了一个联合视觉语言弱监督变压器层选择的一阶段的双头检测架构（文人）沿着与类激活映射在NIH ChestXray-强基线比较（CAM），梯度CAM和相关的实施方式中14和MIMIC-CXR数据集。从视觉语言架构的进步借用，文人方法证明联合图像和参照表达式（对象使用自然语言的图像中定位的）输入检测，在一个纯粹弱监督的方式秤。建筑修改解决三个障碍 - 实施有监督的视觉和语言检测方法在弱监督的方式，结合临床参考表达自然语言信息，并用地图概率产生高保真度检测。然而，放射科医生的注解，包括含蓄参照，多实例规格和相对冗长的基本医疗报告的临床挑战自然，确保大规模刺激遗体为今后调查的视觉语言检测任务。

28. Unidentified Floating Object detection in maritime environment using dictionary learning [PDF] 返回目录
Darshan Venkatrayappa, Agnès Desolneux, Jean-Michel Hubert, Josselin Manceau
Abstract: Maritime domain is one of the most challenging scenarios for object detection due to the complexity of the observed scene. In this article, we present a new approach to detect unidentified floating objects in the maritime environment. The proposed approach is capable of detecting floating objects without any prior knowledge of their visual appearance, shape or location. The input image from the video stream is denoised using a visual dictionary learned from a K-SVD algorithm. The denoised image is made of self-similar content. Later, we extract the residual image, which is the difference between the original image and the denoised (self-similar) image. Thus, the residual image contains noise and salient structures (objects). These salient structures can be extracted using an a contrario model. We demonstrate the capabilities of our algorithm by testing it on videos exhibiting varying maritime scenarios.
摘要：海事域是用于物体检测的最有挑战性方案之一，由于所观察的场景的复杂性。在这篇文章中，我们提出了一种新的方法来检测海洋环境不明漂浮物。所提出的方法能够检测浮动对象没有它们的视觉外观，形状或位置中的任何先验知识。来自视频流的输入图像是使用来自K-SVD算法学到了视觉词典去噪。去噪图像是由自相似的内容。后来，我们提取残差图像，这是原始图像和去噪（自相似）图像之间的差异。因此，残余图像包含噪声和显着结构（对象）。可使用一种反证模型提取这些凸结构。我们通过测试它的参展不同的海上场景视频演示我们的算法的能力。

29. Deep learning for lithological classification of carbonate rock micro-CT images [PDF] 返回目录
Carlos E. M. dos Anjos, Manuel R. V. Avila, Adna G. P. Vasconcelos, Aurea M.P. Neta, Lizianne C. Medeiros, Alexandre G. Evsukoff, Rodrigo Surmas
Abstract: In addition to the ongoing development, pre-salt carbonate reservoir characterization remains a challenge, primarily due to inherent geological particularities. These challenges stimulate the use of well-established technologies, such as artificial intelligence algorithms, for image classification tasks. Therefore, this work intends to present an application of deep learning techniques to identify patterns in Brazilian pre-salt carbonate rock microtomographic images, thus making possible lithological classification. Four convolutional neural network models were proposed. The first model includes three convolutional layers followed by fully connected layers and is used as a base model for the following proposals. In the next two models, we replace the max pooling layer with a spatial pyramid pooling and a global average pooling layer. The last model uses a combination of spatial pyramid pooling followed by global average pooling in place of the last pooling layer. All models are compared using original images, when possible, as well as resized images. The dataset consists of 6,000 images from three different classes. The model performances were evaluated by each image individually, as well as by the most frequently predicted class for each sample. According to accuracy, Model 2 trained on resized images achieved the best results, reaching an average of 75.54% for the first evaluation approach and an average of 81.33% for the second. We developed a workflow to automate and accelerate the lithology classification of Brazilian pre-salt carbonate samples by categorizing microtomographic images using deep learning algorithms in a non-destructive way.
摘要：除了正在进行开发，预盐碳酸盐岩储层特征仍然是一个挑战，主要是由于固有的地质特殊性。这些挑战激发自身成熟的技术，如人工智能算法，对图像分类任务。因此，该工作打算提出的深学习技术的应用，以确定在巴西预盐碳酸盐岩microtomographic图像图案，从而使可能的岩性分类。提出了四个卷积神经网络模型。第一模型包括三个卷积层，接着完全连接层和被用作用于以下建议基本模型。在接下来的两个模型，我们更换一个空间金字塔池和全球平均汇聚层的最大汇聚层。最后一个模型使用空间金字塔池的组合之后全球平均池到位，最后汇集层。所有型号都采用原始图像，如果可能的话，以及缩放后的图像进行比较。该数据集由来自三个不同的类别6000个的图像。该模型性能进行了由每个单独的图像，以及通过对每个样品的最频繁的预测类评估。据的精度，上训练调整大小的图像模型2取得了最好的结果，达到了平均为第一次评估方法75.54％和平均81.33％，为第二。我们开发了一个工作流程自动化，并通过以非破坏性的方式使用深层学习算法分类microtomographic图像加速巴西前盐碳酸盐样品的岩性分类。

30. From A Glance to "Gotcha": Interactive Facial Image Retrieval with Progressive Relevance Feedback [PDF] 返回目录
Xinru Yang, Haozhi Qi, Mingyang Li, Alexander Hauptmann
Abstract: Facial image retrieval plays a significant role in forensic investigations where an untrained witness tries to identify a suspect from a massive pool of images. However, due to the difficulties in describing human facial appearances verbally and directly, people naturally tend to depict by referring to well-known existing images and comparing specific areas of faces with them and it is also challenging to provide complete comparison at each time. Therefore, we propose an end-to-end framework to retrieve facial images with relevance feedback progressively provided by the witness, enabling an exploitation of history information during multiple rounds and an interactive and iterative approach to retrieving the mental image. With no need of any extra annotations, our model can be applied at the cost of a little response effort. We experiment on \texttt{CelebA} and evaluate the performance by ranking percentile and achieve 99\% under the best setting. Since this topic remains little explored to the best of our knowledge, we hope our work can serve as a stepping stone for further research.
摘要：面部图像检索中扮演着法庭调查一个显著的作用，其中一个未经训练的证人试图从图像的一个巨大的游泳池识别犯罪嫌疑人。然而，由于在口头描述，直接人的面部出现的困难，人们自然参照知名现有图像，并比较他们的面孔的特定区域往往描绘，它也极具挑战性，以提供每次完整的比较。因此，我们提出一个终端到终端的框架来与证人提供逐步的相关反馈，在多轮使历史信息的开发和交互和迭代的方法来获取心理意象面部图像。而无需任何额外的注释，我们的模型可以在一点点救灾工作的成本被应用。我们尝试对\ texttt {} CelebA和排名百分位评估性能下的最佳设置达到99 \％。因为这个话题遗体探索小到我们所知的，我们希望我们的工作可以作为踏脚石进一步的研究。

31. Mix Dimension in Poincaré Geometry for 3D Skeleton-based Action Recognition [PDF] 返回目录
Wei Peng, Jingang Shi, Zhaoqiang Xia, Guoying Zhao
Abstract: Graph Convolutional Networks (GCNs) have already demonstrated their powerful ability to model the irregular data, e.g., skeletal data in human action recognition, providing an exciting new way to fuse rich structural information for nodes residing in different parts of a graph. In human action recognition, current works introduce a dynamic graph generation mechanism to better capture the underlying semantic skeleton connections and thus improves the performance. In this paper, we provide an orthogonal way to explore the underlying connections. Instead of introducing an expensive dynamic graph generation paradigm, we build a more efficient GCN on a Riemann manifold, which we think is a more suitable space to model the graph data, to make the extracted representations fit the embedding matrix. Specifically, we present a novel spatial-temporal GCN (ST-GCN) architecture which is defined via the Poincaré geometry such that it is able to better model the latent anatomy of the structure data. To further explore the optimal projection dimension in the Riemann space, we mix different dimensions on the manifold and provide an efficient way to explore the dimension for each ST-GCN layer. With the final resulted architecture, we evaluate our method on two current largest scale 3D datasets, i.e., NTU RGB+D and NTU RGB+D 120. The comparison results show that the model could achieve a superior performance under any given evaluation metrics with only 40\% model size when compared with the previous best GCN method, which proves the effectiveness of our model.
摘要：图形卷积网络（GCNs）已经证明了自己的强大，在人类动作识别不规则的数据，例如，骨骼数据进行建模的能力，提供融合了居住在一个图形的不同部分节点的丰富的结构信息激动人心的新方式。在人类动作识别，当前工程引入一个动态图形生成机构，以更好地捕获底层语义骨架连接，从而提高了性能。在本文中，我们提供了一个垂直的方式探索的基本连接。代替引入昂贵动态图表生成范例，我们建立在黎曼流形，我们认为这是更合适的空间到图形数据进行建模，以使所提取的表示适合嵌入基质中的更高效的GCN。具体地，我们提出其经由庞加莱几何形状，使得其能够更好的模型结构的数据的潜解剖学所定义的新颖的空间 - 时间GCN（ST-GCN）架构。为了进一步探索在黎曼空间最优投影尺寸，我们混合歧管上的不同的尺寸，并且提供一种有效的方式，以探索对每个ST-GCN层的尺寸。随着最终导致架构，我们评估我们两个目前国内最大规模的3D数据集的方法，即，NTU RGB + d和NTU RGB + d 120的比较结果表明，该模型可以在任何给定的评价指标，实现了卓越的性能，只有与以前最好的GCN方法，这证明我们的模型的有效性比40 \％的模型尺寸时。

32. HMCNAS: Neural Architecture Search using Hidden Markov Chains and Bayesian Optimization [PDF] 返回目录
Vasco Lopes, Luís A. Alexandre
Abstract: Neural Architecture Search has achieved state-of-the-art performance in a variety of tasks, out-performing human-designed networks. However, many assumptions, that require human definition, related with the problems being solved or the models generated are still needed: final model architectures, number of layers to be sampled, forced operations, small search spaces, which ultimately contributes to having models with higher performances at the cost of inducing bias into the system. In this paper, we propose HMCNAS, which is composed of two novel components: i) a method that leverages information about human-designed models to autonomously generate a complex search space, and ii) an Evolutionary Algorithm with Bayesian Optimization that is capable of generating competitive CNNs from scratch, without relying on human-defined parameters or small search spaces. The experimental results show that the proposed approach results in competitive architectures obtained in a very short time. HMCNAS provides a step towards generalizing NAS, by providing a way to create competitive models, without requiring any human knowledge about the specific task.
摘要：神经结构搜索已经实现了国家的最先进的性能在各种不同的任务，在业绩方面的人设计的网络。然而，许多假设，需要人类的定义，与正在解决的问题或生成的模型相关仍然需要：最终的模型架构，被采样层，强行操作，小的搜索空间的数量，这最终有利于具有更高型号演出在诱导偏置到系统的成本。在本文中，我们提出HMCNAS，它是由两个新的组件组成：i）它利用关于人设计的模型的信息，以自主地产生一个复杂的搜索空间的方法，和ii）演化算法与优化贝叶斯能够产生的从头竞争性细胞神经网络，而无需依赖于人定义的参数或小的搜索空间。实验结果表明，该方法导致竞争体系结构在很短的时间内获得的。 HMCNAS提供对推广NAS，通过提供一种方式来创造竞争力的车型，而无需对特定任务的任何人类知识的一个步骤。

33. Computer-aided Tumor Diagnosis in Automated Breast Ultrasound using 3D Detection Network [PDF] 返回目录
Junxiong Yu, Chaoyu Chen, Xin Yang, Yi Wang, Dan Yan, Jianxing Zhang, Dong Ni
Abstract: Automated breast ultrasound (ABUS) is a new and promising imaging modality for breast cancer detection and diagnosis, which could provide intuitive 3D information and coronal plane information with great diagnostic value. However, manually screening and diagnosing tumors from ABUS images is very time-consuming and overlooks of abnormalities may happen. In this study, we propose a novel two-stage 3D detection network for locating suspected lesion areas and further classifying lesions as benign or malignant tumors. Specifically, we propose a 3D detection network rather than frequently-used segmentation network to locate lesions in ABUS images, thus our network can make full use of the spatial context information in ABUS images. A novel similarity loss is designed to effectively distinguish lesions from background. Then a classification network is employed to identify the located lesions as benign or malignant. An IoU-balanced classification loss is adopted to improve the correlation between classification and localization task. The efficacy of our network is verified from a collected dataset of 418 patients with 145 benign tumors and 273 malignant tumors. Experiments show our network attains a sensitivity of 97.66% with 1.23 false positives (FPs), and has an area under the curve(AUC) value of 0.8720.
摘要：自动乳房超声（ABUS）是乳腺癌检测和诊断新有为成像模态，这可以提供直观的3D信息，并且以极大的诊断价值冠状面信息。然而，人工筛选和安博影像诊断肿瘤是非常耗时和异常的眺望可能发生。在这项研究中，我们提出了一个新颖的2级3D检测网络定位病变嫌疑的区域，并进一步分级病变为良性或恶性肿瘤。具体来说，我们提出了一个三维探测网，而不是常用的分割网络来定位安博图像的病变，因此，我们的网络能够充分利用在安博图像的空间上下文信息。一种新的相似性损失被设计从背景到有效区分病变。然后分类网络被用来识别位于病变为良性或恶性的。一张欠条平衡损失分类采用提高分类和定位任务之间的相关性。我们的网络的功效是从418例145个良性肿瘤和273种恶性肿瘤收集的数据集进行验证。实验表明我们的网络达到的97.66％与1.23误报（FPS）灵敏度，并且具有下的0.8720的曲线（AUC）值的区域。

34. Learning-based Computer-aided Prescription Model for Parkinson's Disease: A Data-driven Perspective [PDF] 返回目录
Yinghuan Shi, Wanqi Yang, Kim-Han Thung, Hao Wang, Yang Gao, Yang Pan, Li Zhang, Dinggang Shen
Abstract: In this paper, we study a novel problem: "automatic prescription recommendation for PD patients." To realize this goal, we first build a dataset by collecting 1) symptoms of PD patients, and 2) their prescription drug provided by neurologists. Then, we build a novel computer-aided prescription model by learning the relation between observed symptoms and prescription drug. Finally, for the new coming patients, we could recommend (predict) suitable prescription drug on their observed symptoms by our prescription model. From the methodology part, our proposed model, namely Prescription viA Learning lAtent Symptoms (PALAS), could recommend prescription using the multi-modality representation of the data. In PALAS, a latent symptom space is learned to better model the relationship between symptoms and prescription drug, as there is a large semantic gap between them. Moreover, we present an efficient alternating optimization method for PALAS. We evaluated our method using the data collected from 136 PD patients at Nanjing Brain Hospital, which can be regarded as a large dataset in PD research community. The experimental results demonstrate the effectiveness and clinical potential of our method in this recommendation task, if compared with other competing methods.
摘要：在本文中，我们研究了一个新的问题：“自动处方推荐的PD患者。”为了实现这个目标，我们首先建立收集1）PD患者的症状，和2）他们的处方药由神经科医生提供的数据集。然后，我们通过学习观察症状和处方药之间的关系，建立一个新的计算机辅助处方模型。最后，对于新来的病人，我们可以推荐（预测）适合处方药在他们通过我们的处方模型中观察到的症状。从方法论的一部分，我们提出的模型，即通过处方学习潜伏症状（PALAS），使用数据的多模态表示可以推荐处方。在PALAS，潜症状空间学会了更好的模型症状和处方药之间的关系，它们之间有一个大的语义鸿沟。此外，我们提出了PALAS的有效交变的优化方法。我们使用来自136名PD患者在南京脑科医院，这可以看作是在PD研究界的大型数据集收集的数据进行评估我们的方法。实验结果表明，在这种推荐任务了该方法的有效性和临床应用前景，如果与其他竞争方法相比。

35. L$^2$C -- Learning to Learn to Compress [PDF] 返回目录
Nannan Zou, Honglei Zhang, Francesco Cricri, Hamed R. Tavakoli, Jani Lainema, Miska Hannuksela, Emre Aksu, Esa Rahtu
Abstract: In this paper we present an end-to-end meta-learned system for image compression. Traditional machine learning based approaches to image compression train one or more neural network for generalization performance. However, at inference time, the encoder or the latent tensor output by the encoder can be optimized for each test image. This optimization can be regarded as a form of adaptation or benevolent overfitting to the input content. In order to reduce the gap between training and inference conditions, we propose a new training paradigm for learned image compression, which is based on meta-learning. In a first phase, the neural networks are trained normally. In a second phase, the Model-Agnostic Meta-learning approach is adapted to the specific case of image compression, where the inner-loop performs latent tensor overfitting, and the outer loop updates both encoder and decoder neural networks based on the overfitting performance. Furthermore, after meta-learning, we propose to overfit and cluster the bias terms of the decoder on training image patches, so that at inference time the optimal content-specific bias terms can be selected at encoder-side. Finally, we propose a new probability model for lossless compression, which combines concepts from both multi-scale and super-resolution probability model approaches. We show the benefits of all our proposed ideas via carefully designed experiments.
摘要：在本文中，我们提出了图像压缩的端至端元学系统。传统的基于机器学习的方法图像压缩火车一个或泛化性能更加的神经网络。然而，在推理时间，编码器或编码器的输出潜张量可以为每一个测试图像优化。这种优化可以看作是适应或仁慈过拟合给输入内容的形式。为了减少训练和推理的条件之间的差距，我们提出了解到图像压缩，这是基于元学习新的训练模式。在第一阶段，神经网络正常训练。在第二阶段中，所述模型无关元学习的方法是适合于图像压缩，其中，所述内环进行潜张量过拟合，和外环更新编码器和译码器基于所述过拟合性能的神经网络的特定情况。此外，元学习后，我们提出过拟合和集群上的训练图像块解码器的偏差项，这样在推理时间最优的具体内容，偏差项可以在编码器端的选择。最后，我们提出了无损压缩，它结合了概念来自多尺度和超分辨率概率模型接近一个新的概率模型。我们展示了一个通过精心设计的实验，我们所有的想法提出了好处。

36. Evaluating Automatically Generated Phoneme Captions for Images [PDF] 返回目录
Justin van der Hout, Zoltán D'Haese, Mark Hasegawa-Johnson, Odette Scharenborg
Abstract: Image2Speech is the relatively new task of generating a spoken description of an image. This paper presents an investigation into the evaluation of this task. For this, first an Image2Speech system was implemented which generates image captions consisting of phoneme sequences. This system outperformed the original Image2Speech system on the Flickr8k corpus. Subsequently, these phoneme captions were converted into sentences of words. The captions were rated by human evaluators for their goodness of describing the image. Finally, several objective metric scores of the results were correlated with these human ratings. Although BLEU4 does not perfectly correlate with human ratings, it obtained the highest correlation among the investigated metrics, and is the best currently existing metric for the Image2Speech task. Current metrics are limited by the fact that they assume their input to be words. A more appropriate metric for the Image2Speech task should assume its input to be parts of words, i.e. phonemes, instead.
摘要：Image2Speech是生成图像的语音描述的相对较新的任务。本文介绍了调查这项工作的评价。为此，首先一个Image2Speech系统的实施，其生成由音素序列的图像标题。该系统跑赢上Flickr8k语料库原Image2Speech系统。随后，这些音素字幕被转换成单词的句子。该字幕被评为由人工评估其描述图像的善良。最后，结果几个客观度量得分与这些人相关的收视率。虽然BLEU4不与人评级完全相关，它获得的调查指标中最高的相关性，是最好的现有度量的Image2Speech任务。目前的指标是由他们承担起自己的输入是单词的事实的限制。为Image2Speech任务的更合适的度量应该承担其输入为词语，即音素的部分，来代替。

37. A Novel Global Spatial Attention Mechanism in Convolutional Neural Network for Medical Image Classification [PDF] 返回目录
Linchuan Xu, Jun Huang, Atsushi Nitanda, Ryo Asaoka, Kenji Yamanishi
Abstract: Spatial attention has been introduced to convolutional neural networks (CNNs) for improving both their performance and interpretability in visual tasks including image classification. The essence of the spatial attention is to learn a weight map which represents the relative importance of activations within the same layer or channel. All existing attention mechanisms are local attentions in the sense that weight maps are image-specific. However, in the medical field, there are cases that all the images should share the same weight map because the set of images record the same kind of symptom related to the same object and thereby share the same structural content. In this paper, we thus propose a novel global spatial attention mechanism in CNNs mainly for medical image classification. The global weight map is instantiated by a decision boundary between important pixels and unimportant pixels. And we propose to realize the decision boundary by a binary classifier in which the intensities of all images at a pixel are the features of the pixel. The binary classification is integrated into an image classification CNN and is to be optimized together with the CNN. Experiments on two medical image datasets and one facial expression dataset showed that with the proposed attention, not only the performance of four powerful CNNs which are GoogleNet, VGG, ResNet, and DenseNet can be improved, but also meaningful attended regions can be obtained, which is beneficial for understanding the content of images of a domain.
摘要：空间的关注已被引入到了既提高他们的视觉任务，包括图像分类性能和可解释性卷积神经网络（细胞神经网络）。的空间注意的本质是一个学习的权重图表示激活的同一层或信道内的相对重要性。所有现有的注意机制在这个意义上当地关注的是权重贴图是图像特定的。然而，在医疗领域，有这样的情况，所有的图像应该共享相同的权重贴图，因为图像集的记录同种与同一对象的症状，从而共享相同的结构内容。在本文中，我们提出了这样的细胞神经网络的一个新的全球空间注意机制主要用于医学图像分类。全球重映射是由重要的像素和不重要的像素之间的决策边界实例化。我们提出，实现决策的二元分类中的所有图像中的像素的强度是像素的功能边界。二进制分类被集成到图像分类CNN和是与CNN进行优化在一起。两个医学图像数据和面部表情数据集实验表明，所提出的关注，不仅四个强大的细胞神经网络的性能，其是GoogleNet，VGG，RESNET和DenseNet可以得到改善，但可以得到同样有意义出席地区，其中对于理解域的图像的内容是有益的。

38. Robust Retinal Vessel Segmentation from a Data Augmentation Perspective [PDF] 返回目录
Xu Sun, Xingxing Cao, Yehui Yang, Lei Wang, Yanwu Xu
Abstract: Retinal vessel segmentation is a fundamental step in screening, diagnosis, and treatment of various cardiovascular and ophthalmic diseases. Robustness is one of the most critical requirements for practical utilization, since the test images may be captured using different fundus cameras, or be affected by various pathological changes. We investigate this problem from a data augmentation perspective, with the merits of no additional training data or inference time. In this paper, we propose two new data augmentation modules, namely, channel-wise random Gamma correction and channel-wise random vessel augmentation. Given a training color fundus image, the former applies random gamma correction on each color channel of the entire image, while the latter intentionally enhances or decreases only the fine-grained blood vessel regions using morphological transformations. With the additional training samples generated by applying these two modules sequentially, a model could learn more invariant and discriminating features against both global and local disturbances. Experimental results on both real-world and synthetic datasets demonstrate that our method can improve the performance and robustness of a classic convolutional neural network architecture. Source codes are available this https URL
摘要：视网膜血管分割是在筛查，诊断，和治疗各种心血管和眼科疾病的一个基本步骤。稳健性是对实际利用的最关键的要求之一，由于测试图像可以使用不同的眼底照相机来捕获，或通过各种病理变化的影响。我们从调查数据增强视角这个问题，有没有额外的训练数据或推理时间的优点。在本文中，我们提出了两种新的数据扩充模块，即，信道逐随机伽玛校正和信道随机明智容器增强。给定一个训练彩色眼底图像，前者适用于整个图像的各颜色通道的随机伽马校正，而后者有意增强或使用形态学变换减小仅细粒度血管区域。随着通过依次应用这两个模块所产生的额外训练样本，模型可以学习对全局和局部扰动多个不变，并识别功能。两个现实世界和合成数据集的实验结果表明，我们的方法可以提高一个经典的卷积神经网络架构的性能和稳定性。源代码可用此HTTPS URL

39. Residual-CycleGAN based Camera Adaptation for Robust Diabetic Retinopathy Screening [PDF] 返回目录
Dalu Yang, Yehui Yang, Tiantian Huang, Binghong Wu, Lei Wang, Yanwu Xu
Abstract: There are extensive researches focusing on automated diabetic reti-nopathy (DR) detection from fundus images. However, the accuracy drop is ob-served when applying these models in real-world DR screening, where the fun-dus camera brands are different from the ones used to capture the training im-ages. How can we train a classification model on labeled fundus images ac-quired from only one camera brand, yet still achieves good performance on im-ages taken by other brands of cameras? In this paper, we quantitatively verify the impact of fundus camera brands related domain shift on the performance of DR classification models, from an experimental perspective. Further, we pro-pose camera-oriented residual-CycleGAN to mitigate the camera brand differ-ence by domain adaptation and achieve increased classification performance on target camera images. Extensive ablation experiments on both the EyePACS da-taset and a private dataset show that the camera brand difference can signifi-cantly impact the classification performance and prove that our proposed meth-od can effectively improve the model performance on the target domain. We have inferred and labeled the camera brand for each image in the EyePACS da-taset and will publicize the camera brand labels for further research on domain adaptation.
摘要：有广泛的研究集中于自动糖尿病RETI-nopathy（DR）检测从眼底的图像。然而，精度下降OB停在现实世界DR筛选，其中的乐趣，DUS相机品牌是从用于捕获培训IM-年龄的那些不同的应用这些模型时。我们怎样才能培养上标有眼底图像分类模型交流获得性只能从一个相机品牌，但仍然实现了由其他品牌的相机拍摄的图像性能好？在本文中，我们定量验证的相关DR的分类模型的性能域转变，从实验的角度眼底照相机品牌的冲击。此外，我们亲姿态相机面向残余CycleGAN通过域适应减轻相机品牌不同-ENCE和实现目标的摄像机图像提高分类性能。在EyePACS数据集和私人数据集上的相机品牌差异既可以大量消融实验signifi-着地影响分类性能，证明了我们提出的甲基-OD可以有效地提高在目标域模型的性能。我们推断，并标有相机品牌在EyePACS DA-taset每个图像，将宣传相机品牌标签域适应进一步的研究。

40. Estimating Motion Codes from Demonstration Videos [PDF] 返回目录
Maxat Alibayev, David Paulius, Yu Sun
Abstract: A motion taxonomy can encode manipulations as a binary-encoded representation, which we refer to as motion codes. These motion codes innately represent a manipulation action in an embedded space that describes the motion's mechanical features, including contact and trajectory type. The key advantage of using motion codes for embedding is that motions can be more appropriately defined with robotic-relevant features, and their distances can be more reasonably measured using these motion features. In this paper, we develop a deep learning pipeline to extract motion codes from demonstration videos in an unsupervised manner so that knowledge from these videos can be properly represented and used for robots. Our evaluations show that motion codes can be extracted from demonstrations of action in the EPIC-KITCHENS dataset.
摘要：一种运动分类法可以编码操作为二进制编码的表示，其我们称之为运动代码。这些运动代码天生表示在描述运动的机械特性，包括接触和轨迹类型的嵌入空间操纵动作。使用运动代码嵌入的主要优点是，运动可以与机器人相关的特征来更适当地定义，并且它们的距离可以使用这些运动特征进行更合理测量。在本文中，我们开发了一个深刻的学习管道提取演示视频，移动代码在无人监督的方式，以便从这些视频，知识可以适当的代表，并用于机器人。我们的评估显示，运动代码可以从EPIC的厨房数据集动作的演示中提取。

41. A Survey on Concept Factorization: From Shallow to Deep Representation Learning [PDF] 返回目录
Zhao Zhang, Yan Zhang, Li Zhang, Shuicheng Yan
Abstract: The quality of learned features by representation learning determines the performance of learning algorithms and the related application tasks (such as high-dimensional data clustering). As a relatively new paradigm for representation learning, Concept Factorization (CF) has attracted a great deal of interests in the areas of machine learning and data mining for over a decade. Lots of effective CF based methods have been proposed based on different perspectives and properties, but note that it still remains not easy to grasp the essential connections and figure out the underlying explanatory factors from exiting studies. In this paper, we therefore survey the recent advances on CF methodologies and the potential benchmarks by categorizing and summarizing the current methods. Specifically, we first re-view the root CF method, and then explore the advancement of CF-based representation learning ranging from shallow to deep/multilayer cases. We also introduce the potential application areas of CF-based methods. Finally, we point out some future directions for studying the CF-based representation learning. Overall, this survey provides an insightful overview of both theoretical basis and current developments in the field of CF, which can also help the interested researchers to understand the current trends of CF and find the most appropriate CF techniques to deal with particular applications.
摘要：学习功能因表示学习质量决定的学习算法和相关的应用任务（如高维数据聚类）的性能。至于代表学习一个相对较新的范式，概念分解（CF）已经吸引了机器学习和数据挖掘等领域的利益很大了十多年。基于不同的视角和性能，但要注意的有效的，基于CF很多方法已经提出，它仍然不容易把握从退出研究的本质的联系，并找出潜在的解释因素。在本文中，我们因此，通过分类和总结目前的方法调查CF方法和潜在的基准测试的最新进展。具体地讲，我们首先重新查看根CF方法，然后基于探索CF-表示学习范围从浅到深/多层箱子的进步。我们还介绍了基于CF-方法的潜在应用领域。最后，我们指出了未来的一些方向的研究基于CF-表示学习。总体而言，本次调查提供了理论基础和当前发展的CF领域，这也可以帮助感兴趣的研究人员了解CF的当前趋势和找到最合适的CF技术来处理特定应用的精辟概述。

42. OREBA: A Dataset for Objectively Recognizing Eating Behaviour and Associated Intake [PDF] 返回目录
Philipp V. Rouast, Hamid Heydarian, Marc T. P. Adam, Megan E. Rollo
Abstract: Automatic detection of intake gestures is a key element of automatic dietary monitoring. Several types of sensors, including inertial measurement units (IMU) and video cameras, have been used for this purpose. The common machine learning approaches make use of the labelled sensor data to automatically learn how to make detections. One characteristic, especially for deep learning models, is the need for large datasets. To meet this need, we collected the Objectively Recognizing Eating Behavior and Associated Intake (OREBA) dataset. The OREBA dataset aims to provide a comprehensive multi-sensor recording of communal intake occasions for researchers interested in automatic detection of intake gestures. Two scenarios are included, with 100 participants for a discrete dish and 102 participants for a shared dish, totalling 9069 intake gestures. Available sensor data consists of synchronized frontal video and IMU with accelerometer and gyroscope for both hands. We report the details of data collection and annotation, as well as technical details of sensor processing. The results of studies on IMU and video data involving deep learning models are reported to provide a baseline for future research.
摘要：摄入手势的自动检测是自动监测饮食的关键要素。几种类型的传感器，包括惯性测量单元（IMU）和视频摄像机，已被用于这一目的。常见的机器学习方法利用标记传感器数据的自动学习如何进行检测。特征之一，尤其是对于深的学习模式，是大型数据集的需要。为了满足这种需求，我们收集到的客观认识饮食行为和相关的摄入量（OREBA）数据集。该OREBA数据集旨在为有志于摄入的手势自动检测的研究人员提供的公共场合摄入全面的多传感器记录。两种方案都包括在内，以100名参与者离散菜和102名与会者共享盘，共计9069点摄取的手势。可用的传感器数据由同步前的视频和IMU与加速度计和陀螺仪用于双手。我们报告的数据收集和标注的详细信息，以及传感器处理的技术细节。对IMU以及涉及深学习模型研究视频数据的结果报告为未来研究的基线。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-08-03

目录

摘要