摘要

1. AssembleNet++: Assembling Modality Representations via Attention Connections [PDF] 返回目录
Michael S. Ryoo, AJ Piergiovanni, Juhana Kangaspunta, Anelia Angelova
Abstract: We create a family of powerful video models which are able to: (i) learn interactions between semantic object information and raw appearance and motion features, and (ii) deploy attention in order to better learn the importance of features at each convolutional block of the network. A new network component named peer-attention is introduced, which dynamically learns the attention weights using another block or input modality. Even without pre-training, our models outperform the previous work on standard public activity recognition datasets with continuous videos, establishing new state-of-the-art. We also confirm that our findings of having neural connections from the object modality and the use of peer-attention is generally applicable for different existing architectures, improving their performances. We name our model explicitly as AssembleNet++. The code will be available at: this https URL
摘要：我们创建了一个家庭的功能强大的视频模式，其能够：（一）学习，以更好的语义对象的信息和原始的外观和运动特征，及（ii）部署的关注之间的互动学习，在每个卷积块的功能重要性网络的。一个新的网络部件命名的对关注被引入，其动态地学习使用另一个块或输入模态的注意权重。即使没有前培训，我们的模型跑赢上连续视频标准的公共行为识别的数据集，建立新的国家的最先进的前期工作。本公司谨确认，我们从物态和使用同行关注其神经连接的结果是普遍适用于不同现存的体系结构，提高了他们的表演。我们明确地命名我们的模型AssembleNet ++。该代码将可在：此HTTPS URL

2. AB3DMOT: A Baseline for 3D Multi-Object Tracking and New Evaluation Metrics [PDF] 返回目录
Xinshuo Weng, Jianren Wang, David Held, Kris Kitani
Abstract: 3D multi-object tracking (MOT) is essential to applications such as autonomous driving. Recent work focuses on developing accurate systems giving less attention to computational cost and system complexity. In contrast, this work proposes a simple real-time 3D MOT system with strong performance. Our system first obtains 3D detections from a LiDAR point cloud. Then, a straightforward combination of a 3D Kalman filter and the Hungarian algorithm is used for state estimation and data association. Additionally, 3D MOT datasets such as KITTI evaluate MOT methods in 2D space and standardized 3D MOT evaluation tools are missing for a fair comparison of 3D MOT methods. We propose a new 3D MOT evaluation tool along with three new metrics to comprehensively evaluate 3D MOT methods. We show that, our proposed method achieves strong 3D MOT performance on KITTI and runs at a rate of $207.4$ FPS on the KITTI dataset, achieving the fastest speed among modern 3D MOT systems. Our code is publicly available at this http URL.
摘要：3D多目标跟踪（MOT）是应用，例如自动驾驶必不可少的。最近的工作重点是制定准确的系统给予较少关注计算成本和系统复杂性。相比之下，这项工作提出了具有很强的功能简单的实时3D MOT系统。我们的系统首先从激光雷达点云3D检测。然后，3D卡尔曼滤波器和匈牙利算法的简单组合用于状态估计和数据关联。此外，3D数据集MOT如KITTI评估二维空间MOT方法和标准化的3D MOT评估工具缺失的3D MOT方法公平的比较。我们提出了三个新的指标以及全新的3D MOT评估工具进行综合评价3D MOT方法。我们表明，我们所提出的方法实现对KITTI和运行3D强劲表现MOT在$ 207.4 $ FPS的KITTI数据集的速度，实现现代化的3D MOT系统中最快的速度。我们的代码是公开的，在此http网址。

3. Communicative Reinforcement Learning Agents for Landmark Detection in Brain Images [PDF] 返回目录
Guy Leroy, Daniel Rueckert, Amir Alansary
Abstract: Accurate detection of anatomical landmarks is an essential step in several medical imaging tasks. We propose a novel communicative multi-agent reinforcement learning (C-MARL) system to automatically detect landmarks in 3D brain images. C-MARL enables the agents to learn explicit communication channels, as well as implicit communication signals by sharing certain weights of the architecture among all the agents. The proposed approach is evaluated on two brain imaging datasets from adult magnetic resonance imaging (MRI) and fetal ultrasound scans. Our experiments show that involving multiple cooperating agents by learning their communication with each other outperforms previous approaches using single agents.
摘要：解剖标志的精确检测是在几个医疗成像任务的重要步骤。我们提出了一种新的交际多Agent强化学习（C-MARL）系统自动检测3D大脑图像标志。 C-MARL使药剂通过共享架构的某些权重的所有代理之间，了解显式的通信信道，以及隐式通信信号。所提出的方法，对从成人磁共振成像（MRI）和胎儿超声扫描2个脑成像数据集进行评估。我们的实验表明，通过学习彼此的性能优于使用通信单药以前的方法涉及多个合作机构。

4. Gradients as a Measure of Uncertainty in Neural Networks [PDF] 返回目录
Jinsol Lee, Ghassan AlRegib
Abstract: Despite tremendous success of modern neural networks, they are known to be overconfident even when the model encounters inputs with unfamiliar conditions. Detecting such inputs is vital to preventing models from making naive predictions that may jeopardize real-world applications of neural networks. In this paper, we address the challenging problem of devising a simple yet effective measure of uncertainty in deep neural networks. Specifically, we propose to utilize backpropagated gradients to quantify the uncertainty of trained models. Gradients depict the required amount of change for a model to properly represent given inputs, thus providing a valuable insight into how familiar and certain the model is regarding the inputs. We demonstrate the effectiveness of gradients as a measure of model uncertainty in applications of detecting unfamiliar inputs, including out-of-distribution and corrupted samples. We show that our gradient-based method outperforms state-of-the-art methods by up to 4.8% of AUROC score in out-of-distribution detection and 35.7% in corrupted input detection.
摘要：尽管现代神经网络的巨大成功，他们被称为是过于自信，即使模型遇到不熟悉的条件输入。检测这种投入是防止模型作出一些可能危及神经网络的现实世界的应用天真的预言是至关重要的。在本文中，我们讨论制定在深层神经网络的不确定性的一个简单而有效的措施的具有挑战性的问题。具体来说，我们建议利用backpropagated梯度量化训练的模型的不确定性。渐变描绘所需的变化量的模型正确地表示给定的输入，从而提供了宝贵的见解如何熟悉和某些模型是关于输入。我们证明在检测陌生输入，包括外的分布和损坏的样品应用梯度为模型不确定性的度量的有效性。我们表明，我们的基于梯度的方法优于状态的最先进的方法，通过高达AUROC 4.8％得分外的分布检测和35.7％在损坏的输入检测。

5. Multilanguage Number Plate Detection using Convolutional Neural Networks [PDF] 返回目录
Jatin Gupta, Vandana Saini, Kamaldeep Garg
Abstract: Object Detection is a popular field of research for recent technologies. In recent years, profound learning performance attracts the researchers to use it in many applications. Number plate (NP) detection and classification is analyzed over decades however, it needs approaches which are more precise and state, language and design independent since cars are now moving from state to another easily. In this paperwe suggest a new strategy to detect NP and comprehend the nation, language and layout of NPs. YOLOv2 sensor with ResNet attribute extractor heart is proposed for NP detection and a brand new convolutional neural network architecture is suggested to classify NPs. The detector achieves average precision of 99.57% and country, language and layout classification precision of 99.33%. The results outperforms the majority of the previous works and can move the area forward toward international NP detection and recognition.
摘要：目标检测是研究最新技术的热门领域。近年来，渊博的学识表现吸引了研究人员在许多应用中使用它。号牌（NP）的检测和分类数十年分析然而，它需要这是更精确和状态，语言和设计独立自车现在从状态移动到另一个容易的方法。在这种paperwe提出一个新的战略来检测NP和理解NP的民族，语言和布局。 YOLOv2传感器与RESNET属性提取心脏，提出了NP检测和一个全新的卷积神经网络架构建议进行分类的NP。该探测器实现了99.57％和全国的99.33％，语言和布局分类精度平均精度。结果优于大多数以前的作品，并可以向前走向国际NP检测与识别移动区域。

6. MaskedFace-Net -- A Dataset of Correctly/Incorrectly Masked Face Images in the Context of COVID-19 [PDF] 返回目录
Adnane Cabani, Karim Hammoudi, Halim Benhabiles, Mahmoud Melkemi
Abstract: The wearing of the face masks appears as a solution for limiting the spread of COVID-19. In this context, efficient recognition systems are expected for checking that people faces are masked in regulated areas. To perform this task, a large dataset of masked faces is necessary for training deep learning models towards detecting people wearing masks and those not wearing masks. Some large datasets of masked faces are available in the literature. However, at the moment, there are no available large dataset of masked face images that permits to check if detected masked faces are correctly worn or not. Indeed, many people are not correctly wearing their masks due to bad practices, bad behaviors or vulnerability of individuals (e.g., children, old people). For these reasons, several mask wearing campaigns intend to sensitize people about this problem and good practices. In this sense, this work proposes three types of masked face detection dataset; namely, the Correctly Masked Face Dataset (CMFD), the Incorrectly Masked Face Dataset (IMFD) and their combination for the global masked face detection (MaskedFace-Net). Realistic masked face datasets are proposed with a twofold objective: i) to detect people having their faces masked or not masked, ii) to detect faces having their masks correctly worn or incorrectly worn (e.g.; at airport portals or in crowds). To the best of our knowledge, no large dataset of masked faces provides such a granularity of classification towards permitting mask wearing analysis. Moreover, this work globally presents the applied mask-to-face deformable model for permitting the generation of other masked face images, notably with specific masks. Our datasets of masked face images (137,016 images) are available at this https URL.
摘要：口罩的穿着表现为限制COVID-19的传播解决方案。在这种情况下，有效的识别系统有望用于检查人脸上都在监管方面所掩盖。要执行此任务，掩盖脸上的大型数据集是必要的检测对戴口罩的人培养深厚的学习模式和不戴口罩。掩盖脸部的一些大型数据集在文献中可用。然而，此刻，有蒙面人脸图像没有可用的大数据集，允许检查，如果检测到屏蔽面正确佩戴与否。事实上，很多人都没有正确戴口罩由于恶劣的行为，不良行为或个人（例如，儿童，老人）的脆弱性。由于这些原因，几个戴着口罩运动打算敏感这个问题和好的做法的人。在这个意义上说，这项工作提出了三种类型的蒙面人脸检测数据集;即正确蒙脸部数据集（CMFD）时，错误地蒙面人脸数据集（IMFD）及其全球蒙面人脸检测组合（MaskedFace-网）。现实掩盖脸部数据集有双重目标提出：1）检测有他们的脸掩盖或没有被屏蔽的人，II），以检测有他们的口罩佩戴正确或不正确地佩戴（例如;在机场门户或人群）的面孔。据我们所知，掩盖脸部没有大的数据集提供分类的这样的粒度对允许戴口罩分析。此外，这项工作在全球提出了允许其他蒙面人脸图像的生成，尤其是与特定的口罩应用掩模到脸变形模型。我们的蒙面人脸图像（137016个图像）的数据集可在此HTTPS URL。

7. Linguistically-aware Attention for Reducing the Semantic-Gap in Vision-Language Tasks [PDF] 返回目录
Gouthaman KV, Athira Nambiar, Kancheti Sai Srinivas, Anurag Mittal
Abstract: Attention models are widely used in Vision-language (V-L) tasks to perform the visual-textual correlation. Humans perform such a correlation with a strong linguistic understanding of the visual world. However, even the best performing attention model in V-L tasks lacks such a high-level linguistic understanding, thus creating a semantic gap between the modalities. In this paper, we propose an attention mechanism - Linguistically-aware Attention (LAT) - that leverages object attributes obtained from generic object detectors along with pre-trained language models to reduce this semantic gap. LAT represents visual and textual modalities in a common linguistically-rich space, thus providing linguistic awareness to the attention process. We apply and demonstrate the effectiveness of LAT in three V-L tasks: Counting-VQA, VQA, and Image captioning. In Counting-VQA, we propose a novel counting-specific VQA model to predict an intuitive count and achieve state-of-the-art results on five datasets. In VQA and Captioning, we show the generic nature and effectiveness of LAT by adapting it into various baselines and consistently improving their performance.
摘要：注意模型被广泛应用于视觉语言（V-L）任务执行视觉文本的相关性。人类执行与视觉世界的一个强大的语言理解这种相关性。然而，在V-L任务甚至表现最好的注意模型缺乏这样一个高层次的语言理解，因而产生的方式之间的语义鸿沟。在本文中，我们提出了一个注意机制 - 在语言感知注意（LAT） - 一个利用对象属性从通用对象检测器获得以及岗前培训的语言模型来减少这种语义鸿沟。 LAT代表共富语言空间视觉和文本模式，从而关注过程中提供的语言意识。我们应用和展示三个V-L任务LAT的有效性：计数-VQA，VQA和图像字幕。在计数-VQA，我们提出了一个新的计数特定VQA模型来预测一个直观的数量和五个数据集实现国家的最先进的成果。在VQA和字幕中，我们通过调整成不同的基线和持续改进其性能表现LAT的泛用性和有效性。

8. Reinforcement Learning for Improving Object Detection [PDF] 返回目录
Siddharth Nayak, Balaraman Ravindran
Abstract: The performance of a trained object detection neural network depends a lot on the image quality. Generally, images are pre-processed before feeding them into the neural network and domain knowledge about the image dataset is used to choose the pre-processing techniques. In this paper, we introduce an algorithm called ObjectRL to choose the amount of a particular pre-processing to be applied to improve the object detection performances of pre-trained networks. The main motivation for ObjectRL is that an image which looks good to a human eye may not necessarily be the optimal one for a pre-trained object detector to detect objects.
摘要：经过培训的对象检测的神经网络的性能取决于图像质量了很多。通常，图像在它们送入有关的图像数据集的神经网络和领域知识用于选择前处理技术之前预处理。在本文中，我们引入称为ObjectRL选择一个特定的预处理量将被施加到改善预先训练网络的物体检测性能的算法。对于ObjectRL的主要动机是，它看起来好于人眼的图像可以不必是一个预训练的对象检测器，以检测对象的最优的一个。

9. Anomaly Detection with Convolutional Autoencoders for Fingerprint Presentation Attack Detection [PDF] 返回目录
Jascha Kolberg, Marcel Grimmer, Marta Gomez-Barrero, Christoph Busch
Abstract: In recent years, the popularity of fingerprint-based biometric authentication systems has significantly increased. However, together with many advantages, biometric systems are still vulnerable to presentation attacks (PAs). In particular, this applies for unsupervised applications, where new attacks unknown to the system operator may occur. Therefore, presentation attack detection (PAD) methods are used to determine whether samples stem from a live subject (bona fide) or from a presentation attack instrument (PAI). In this context, most works are dedicated to solve PAD as a two-class classification problem, which includes training a model on both bona fide and PA samples. In spite of the good detection rates reported, these methods still face difficulties detecting PAIs from unknown materials. To address this issue, we propose a new PAD technique based on autoencoders (AEs) trained only on bona fide samples (i.e. one-class). On the experimental evaluation over a database of 19,711 bona fide and 4,339 PA images, including 45 different PAI species, a detection equal error rate (D-EER) of 2.00% was achieved. Additionally, our best performing AE model is compared to further one-class classifiers (support vector machine, Gaussian mixture model). The results show the effectiveness of the AE model as it significantly outperforms the previously proposed methods.
摘要：近年来，基于指纹的生物识别认证系统的普及显著上升。然而，有许多优点一起，生物识别系统仍然容易受到攻击的演示文稿（PAS）。特别是，这适用于无人监管的应用中，可能会出现新的攻击未知的系统操作员。因此，呈现攻击检测（PAD）的方法被用于确定样品是否从活受试者（善意），或从一个呈现攻击仪器（PAI）干。在这种情况下，大多数的作品都致力于解决PAD为两类分类问题，其中包括双方善意和PA样本训练的模型。尽管良好的检出率的报道，这些方法仍然面临困难来自未知材料检测差价调整数指数。为了解决这个问题，我们提出了一种基于自动编码（AES），只在真正的样本训练（即一类）的新技术，PAD。在实验评价过的19711幅善意和4339 PA图像，包括2.00％的45个不同物种PAI，检测等错误率（d-EER）的数据库中实现。此外，我们的最佳表现模型AE相比进一步一个级分类器（支持向量机，高斯混合模型）。结果显示自动曝光模式的有效性，因为它显著优于先前提出的方法。

10. Hierarchical HMM for Eye Movement Classification [PDF] 返回目录
Ye Zhu, Yan Yan, Oleg Komogortsev
Abstract: In this work, we tackle the problem of ternary eye movement classification, which aims to separate fixations, saccades and smooth pursuits from the raw eye positional data. The efficient classification of these different types of eye movements helps to better analyze and utilize the eye tracking data. Different from the existing methods that detect eye movement by several pre-defined threshold values, we propose a hierarchical Hidden Markov Model (HMM) statistical algorithm for detecting fixations, saccades and smooth pursuits. The proposed algorithm leverages different features from the recorded raw eye tracking data with a hierarchical classification strategy, separating one type of eye movement each time. Experimental results demonstrate the effectiveness and robustness of the proposed method by achieving competitive or better performance compared to the state-of-the-art methods.
摘要：在这项工作中，我们将处理三元眼球运动分类的问题，其目的是单独的注视，扫视，顺利从原始的眼睛位置数据的追求。这些不同类型的眼球运动的有效分类有助于更好地分析和利用眼动追踪数据。从几个预先定义的阈值检测的眼球运动的现有方法不同，我们提出了一种分层隐马尔可夫模型（HMM）的统计算法，用于检测注视，跳视和平滑跟踪。该算法利用从已记录的原始眼跟踪数据的不同特征具有分级分类策略，分隔每个时间一种类型的眼球运动。实验结果通过比较状态的最先进的方法实现有竞争力的或更好的性能证明了该方法的有效性和鲁棒性。

11. Dataset Bias in Few-shot Image Recognition [PDF] 返回目录
Shuqiang Jiang, Yaohui Zhu, Chenlong Liu, Xinhang Song, Xiangyang Li, Weiqing Min
Abstract: The goal of few-shot image recognition (FSIR) is to identify novel categories with a small number of annotated samples by exploiting transferable knowledge from training data (base categories). Most current studies assume that the transferable knowledge can be well used to identify novel categories. However, such transferable capability may be impacted by the dataset bias, and this problem has rarely been investigated before. Besides, most of few-shot learning methods are biased to different datasets, which is also an important issue that needs to be investigated in depth. In this paper, we first investigate the impact of transferable capabilities learned from base categories. Specifically, we introduce relevance to describe relations of base and novel categories, along with instance diversity and category diversity to depict distributions of base categories. The FSIR model learns better transferable knowledge from relative training data. In the relative data, diverse instances or categories can further enrich the learned knowledge. We conduct experiments on different sub-datasets of ImagNet, and experimental results demonstrate category relevance and category/instance diversity can depict transferable bias from distributions of base categories. Second, we investigate performance differences on different datasets from dataset structures and few-shot learning methods. Specifically, we introduce image complexity, inner-concept visual consistency, and inter-concept visual similarity to quantify characteristics of dataset structures. We use these quantitative characteristics and four few-shot learning methods to analyze performance differences on five different datasets. Based on experimental analysis, some insightful observations are obtained from the perspective of both dataset structures and few-shot learning methods. These observations are useful to guide future FSIR research.
摘要：几拍图像识别（FSIR）的目标是通过从训练数据（基础类）利用转移的知识来确定新的类别与少量的注释的样品。目前大多数研究认为该转让的知识可以很好地用来确定新的类别。然而，这样的转移能力，可以通过数据集偏见受到影响，而这个问题却很少被前调查。此外，大多数的几拍的学习方法都偏向于不同的数据集，这也是一个重要的问题，需要深入调查。在本文中，我们首先探讨从基础类了解到转移能力的影响。具体而言，我们引入相关性来描述基础和新颖类别的关系，与实例多样性和类别分集基地类别描绘分布沿。从相对训练数据的FSIR模型获悉更好的转移知识。在相关数据，多样化实例或类别可以进一步丰富所学的知识。我们对ImagNet的不同子集进行实验，而实验结果表明类别相关和类别/实例多样性可以描绘出从基地类别的分布转让偏见。其次，我们从调查数据集的结构和一些次学习方法上的不同的数据集的性能差异。具体而言，我们引入图像的复杂性，内概念视觉一致性，以及概念间的视觉相似性的数据集的结构进行量化特性。我们用这些量化的特点和四个几拍的学习方法来分析五个不同的数据集的性能差异。根据实验分析，一些有见地的意见是从两个数据集的结构和一些次学习方法的角度获得。这些意见，以指导今后的研究FSIR有用。

12. Describing Unseen Videos via Multi-ModalCooperative Dialog Agents [PDF] 返回目录
Ye Zhu, Yu Wu, Yi Yang, Yan Yan
Abstract: With the arising concerns for the AI systems provided with direct access to abundant sensitive information, researchers seek to develop more reliable AI with implicit information sources. To this end, in this paper, we introduce a new task called video description via two multi-modal cooperative dialog agents, whose ultimate goal is for one conversational agent to describe an unseen video based on the dialog and two static frames. Specifically, one of the intelligent agents - Q-BOT - is given two static frames from the beginning and the end of the video, as well as a finite number of opportunities to ask relevant natural language questions before describing the unseen video. A-BOT, the other agent who has already seen the entire video, assists Q-BOT to accomplish the goal by providing answers to those questions. We propose a QA-Cooperative Network with a dynamic dialog history update learning mechanism to transfer knowledge from A-BOT to Q-BOT, thus helping Q-BOT to better describe the video. Extensive experiments demonstrate that Q-BOT can effectively learn to describe an unseen video by the proposed model and the cooperative learning method, achieving the promising performance where Q-BOT is given the full ground truth history dialog.
摘要：随着对提供直接访问大量敏感信息的AI系统所产生的担忧，研究人员寻求与隐含的信息来源来开发更可靠的AI。为此，在本文中，我们介绍一种通过两个多模态对话框合作代理，其最终目标是一个会话代理来形容基于对话框和两个静态画面上看不见的视频称为视频说明新任务。具体而言，智能代理商之一 - Q-BOT - 给出从一开始就和视频的结尾两个静态帧，以及机会有限数量的描述看不见的视频前询问有关自然语言问题。 A-BOT，谁已经看到整个视频的其他代理，协助Q-BOT通过提供这些问题的答案来完成目标。我们提出了一个QA-协作网络与动态对话历史更新学习机制，从A-BOT传输知识，Q-BOT，从而帮助Q-BOT，以更好地描述视频。大量的实验证明，Q-BOT能有效学会提出的模型和合作学习的方法来描述一个看不见的视频，实现业绩看好其中的Q-BOT是给予了充分的地面真实的历史对话。

13. Motion Capture from Internet Videos [PDF] 返回目录
Junting Dong, Qing Shuai, Yuanqing Zhang, Xian Liu, Xiaowei Zhou, Hujun Bao
Abstract: Recent advances in image-based human pose estimation make it possible to capture 3D human motion from a single RGB video. However, the inherent depth ambiguity and self-occlusion in a single view prohibit the recovery of as high-quality motion as multi-view reconstruction. While multi-view videos are not common, the videos of a celebrity performing a specific action are usually abundant on the Internet. Even if these videos were recorded at different time instances, they would encode the same motion characteristics of the person. Therefore, we propose to capture human motion by jointly analyzing these Internet videos instead of using single videos separately. However, this new task poses many new challenges that cannot be addressed by existing methods, as the videos are unsynchronized, the camera viewpoints are unknown, the background scenes are different, and the human motions are not exactly the same among videos. To address these challenges, we propose a novel optimization-based framework and experimentally demonstrate its ability to recover much more precise and detailed motion from multiple videos, compared against monocular motion capture methods.
摘要：基于图像的人体姿势估计的最新进展使人们有可能从单一的RGB视频捕捉三维人体运动。然而，固有深度模糊和自遮挡在单个视图中禁止的高品质运动作为多视图重建的恢复。虽然多视点视频是不常见的，名人的执行特定动作的影片通常都是在互联网上丰富。即使这些视频是在不同的时刻记录下来，他们将编码人相同的运动特性。因此，我们通过共同分析这些网络视频，而不是单独使用的单一影片提议捕捉人体动作。但是，不能用现有的方法来解决，因为视频不同步，相机的观点是未知的这项新任务提出了许多新的挑战，背景的场景不同，人的运动是不完全的视频中一样。为了应对这些挑战，我们提出了一种基于优化的框架，并通过实验证明其恢复从多个视频更加精确和详细的运动能力，对单眼动作捕捉方法相比。

14. Feature Products Yield Efficient Networks [PDF] 返回目录
Philipp Grüning, Thomas Martinetz, Erhardt Barth
Abstract: We introduce Feature-Product networks (FP-nets) as a novel deep-network architecture based on a new building block inspired by principles of biological vision. For each input feature map, a so-called FP-block learns two different filters, the outputs of which are then multiplied. Such FP-blocks are inspired by models of end-stopped neurons, which are common in cortical areas V1 and especially in V2. Convolutional neural networks can be transformed into parameter-efficient FP-nets by substituting conventional blocks of regular convolutions with FP-blocks. In this way, we create several novel FP-nets based on state-of-the-art networks and evaluate them on the Cifar-10 and ImageNet challenges. We show that the use of FP-blocks reduces the number of parameters significantly without decreasing generalization capability. Since so far heuristics and search algorithms have been used to find more efficient networks, it seems remarkable that we can obtain even more efficient networks based on a novel bio-inspired design principle.
摘要：介绍特征的产品网络（FP-网），以及作为一个新的建筑块通过生物视觉的原则的启发一种新型深的网络架构。对于每个输入特征地图，一个所谓的FP-块学会两种不同的过滤器，其输出然后被相乘。这样的FP-块由最终停止神经元的车型，这是常见的皮层区域V1，尤其是在V2的启发。卷积神经网络可以通过用与FP-块定期卷积的常规块被转换成参数高效FP-网。通过这种方式，我们创建一个基于国家的最先进的网络，几个新的FP-网和评估他们在CIFAR-10和ImageNet挑战。我们表明，使用FP-块减少参数的数量显著不降低泛化能力。由于到目前为止，启发式和搜索算法已经被用来找到更有效的网络，似乎显着，我们可以基于新型仿生设计原理获得更有效的网络。

15. Visibility-aware Multi-view Stereo Network [PDF] 返回目录
Jingyang Zhang, Yao Yao, Shiwei Li, Zixin Luo, Tian Fang
Abstract: Learning-based multi-view stereo (MVS) methods have demonstrated promising results. However, very few existing networks explicitly take the pixel-wise visibility into consideration, resulting in erroneous cost aggregation from occluded pixels. In this paper, we explicitly infer and integrate the pixel-wise occlusion information in the MVS network via the matching uncertainty estimation. The pair-wise uncertainty map is jointly inferred with the pair-wise depth map, which is further used as weighting guidance during the multi-view cost volume fusion. As such, the adverse influence of occluded pixels is suppressed in the cost fusion. The proposed framework Vis-MVSNet significantly improves depth accuracies in the scenes with severe occlusion. Extensive experiments are performed on DTU, BlendedMVS, and Tanks and Temples datasets to justify the effectiveness of the proposed framework.
摘要：基于学习的多视点立体（MVS）方法已经证明了可喜的成果。然而，很少现有网络明确采取逐个像素的可见性考虑，导致从遮挡的像素错误的成本聚集。在本文中，我们明确地推断出，并通过匹配不确定性估计MVS网络中的逐像素的封闭信息集成。成对的不确定性映射会同成对的深度图，其中，多视图成本体积融合期间进一步用作加权指导推断。这样，遮挡的像素的不利影响在成本融合被抑制。拟议的框架可见 - MVSNet显著改善重度阻塞场景深度精度。大量的实验是在DTU，BlendedMVS，以及坦克和庙宇的数据集进行辩解所提出的框架的有效性。

16. Person image generation with semantic attention network for person re-identification [PDF] 返回目录
Meichen Liu, Kejun Wang, Juihang Ji, Shuzhi Sam Ge
Abstract: Pose variation is one of the key factors which prevents the network from learning a robust person re-identification (Re-ID) model. To address this issue, we propose a novel person pose-guided image generation method, which is called the semantic attention network. The network consists of several semantic attention blocks, where each block attends to preserve and update the pose code and the clothing textures. The introduction of the binary segmentation mask and the semantic parsing is important for seamlessly stitching foreground and background in the pose-guided image generation. Compared with other methods, our network can characterize better body shape and keep clothing attributes, simultaneously. Our synthesized image can obtain better appearance and shape consistency related to the original image. Experimental results show that our approach is competitive with respect to both quantitative and qualitative results on Market-1501 and DeepFashion. Furthermore, we conduct extensive evaluations by using person re-identification (Re-ID) systems trained with the pose-transferred person based augmented data. The experiment shows that our approach can significantly enhance the person Re-ID accuracy.
摘要：姿势变化的，其防止所述网络从学习健壮人重新鉴定（再ID）模型的关键因素之一。为了解决这个问题，我们提出了一个新颖的人的姿态引导图像生成方法，这就是所谓的语义网络的关注。该网络由几个语义关注块，每个块出席，维护和更新的姿态码和服装的纹理。引入二值分割掩模和语义分析的是用于无缝地拼接的前景和背景中的姿态引导图像生成重要。与其他方法相比，我们的网络能够更好地表征的体形，并保持服装的属性，同时。我们的合成图像可以获得与原始图像更好的外观和形状的一致性。实验结果表明，我们的做法是相对于市场上-1501和DeepFashion定量和定性的结果具有竞争力。此外，我们通过使用人重新鉴定（重新-ID）与基于姿势转移人增强数据训练系统进行广泛的评估。实验表明，我们的方法可以显著提升人重新编号的准确性。

17. Self-supervised Sparse to Dense Motion Segmentation [PDF] 返回目录
Amirhossein Kardoost, Kalun Ho, Peter Ochs, Margret Keuper
Abstract: Observable motion in videos can give rise to the definition of objects moving with respect to the scene. The task of segmenting such moving objects is referred to as motion segmentation and is usually tackled either by aggregating motion information in long, sparse point trajectories, or by directly producing per frame dense segmentations relying on large amounts of training data. In this paper, we propose a self supervised method to learn the densification of sparse motion segmentations from single video frames. While previous approaches towards motion segmentation build upon pre-training on large surrogate datasets and use dense motion information as an essential cue for the pixelwise segmentation, our model does not require pre-training and operates at test time on single frames. It can be trained in a sequence specific way to produce high quality dense segmentations from sparse and noisy input. We evaluate our method on the well-known motion segmentation datasets FBMS59 and DAVIS16.
摘要：在视频中可观察到的运动可以产生相对于场景中移动对象的定义。分割这样移动物体的任务被称为运动分割，通常是由，稀疏点轨迹，或由每帧的密分割依靠大量的训练数据的直接生产中长聚集的运动信息解决任一。在本文中，我们建议借鉴单个视频帧稀疏运动分割的致密化自我监督的方法。虽然对运动分割建立在对大数据集代理和使用密集的运动信息前培训，作为基于像素分割的重要线索以前的方法，我们的模型并不需要预先的训练和测试时间在单帧进行操作。它可以在一个序列特异性的方式从稀疏和嘈杂的投入生产出高质量的密集分割进行培训。我们评估上著名的运动分割数据集FBMS59和DAVIS16我们的方法。

18. Depth Completion with RGB Prior [PDF] 返回目录
Yuri Feldman, Yoel Shapiro, Dotan Di Castro
Abstract: Depth cameras are a prominent perception system for robotics, especially when operating in natural unstructured environments. Industrial applications, however, typically involve reflective objects under harsh lighting conditions, a challenging scenario for depth cameras, as it induces numerous reflections and deflections, leading to loss of robustness and deteriorated accuracy. Here, we developed a deep model to correct the depth channel in RGBD images, aiming to restore the depth information to the required accuracy. To train the model, we created a novel industrial dataset that we now present to the public. The data was collected with low-end depth cameras and the ground truth depth was generated by multi-view fusion.
摘要：深度相机是机器人技术的一个突出感知系统，在自然的环境中非结构化尤其是运行时。工业应用中，然而，通常涉及苛刻的光照条件下的反射物体，深度相机一个具有挑战性的情况下，因为它引起众多的反射及偏转，从而导致鲁棒性的损失和劣化的精度。在这里，我们建立了深厚的模型纠正RGBD图像的深度通道，旨在恢复深度信息所要求的精度。训练模型，我们创建了一个新的工业数据集，我们现在向公众开放。数据收集低端深度相机，并通过多视图融合产生的地面实况深度。

19. Image Pre-processing on NumtaDB for Bengali Handwritten Digit Recognition [PDF] 返回目录
Ovi Paul
Abstract: NumtaDB is by far the largest data-set collection for handwritten digits in Bengali. This is a diverse dataset containing more than 85000 images. But this diversity also makes this dataset very difficult to work with. The goal of this paper is to find the benchmark for pre-processed images which gives good accuracy on any machine learning models. The reason being, there are no available pre-processed data for Bengali digit recognition to work with like the English digits for MNIST.
摘要：NumtaDB是迄今为止最大的数据集收集在孟加拉手写数字。这是一种含有多于85000倍的图像的不同数据集。但是这种多样性也使得与工作这个数据集非常困难。本文的目标是找到这给任何机器学习模型精度好于前处理图像的基准。是的原因，也有孟加拉数字识别工作与喜欢英语位数MNIST没有可用的预先处理过的数据。

20. EASTER: Efficient and Scalable Text Recognizer [PDF] 返回目录
Kartik Chaudhary, Raghav Bali
Abstract: Recent progress in deep learning has led to the development of Optical Character Recognition (OCR) systems which perform remarkably well. Most research has been around recurrent networks as well as complex gated layers which make the overall solution complex and difficult to scale. In this paper, we present an Efficient And Scalable TExt Recognizer (EASTER) to perform optical character recognition on both machine printed and handwritten text. Our model utilises 1-D convolutional layers without any recurrence which enables parallel training with considerably less volume of data. We experimented with multiple variations of our architecture and one of the smallest variant (depth and number of parameter wise) performs comparably to RNN based complex choices. Our 20-layered deepest variant outperforms RNN architectures with a good margin on benchmarking datasets like IIIT-5k and SVT. We also showcase improvements over the current best results on offline handwritten text recognition task. We also present data generation pipelines with augmentation setup to generate synthetic datasets for both handwritten and machine printed text.
摘要：在深度学习的最新进展已导致它们执行非常出色的光学字符识别（OCR）系统的开发。大多数研究一直围绕复发网络以及复杂的门控层，这使得整体解决方案复杂，不易结垢。在本文中，我们提出了一种高效和可扩展的文本识别器（复活节）同时印刷机器及手写文本上执行光学字符识别。我们的模型利用1-d卷积层无任何复发使平行训练数据的相当少的量。我们尝试了（和深度明智参数的数目）基于RNN复杂的选择我们的架构的多种变化，最小变异体的一种执行同等。我们的20-层次最深的变种性能优于RNN架构与像IIIT-5K和SVT标杆数据集佳缘。我们还展示了脱机手写文字识别任务目前最好的结果改进。我们还与增强设置本数据生成管道以产生用于两个手写和机印文本的合成数据集。

21. Tackling the Unannotated: Scene Graph Generation with Bias-Reduced Models [PDF] 返回目录
Tzu-Jui Julius Wang, Selen Pehlivan, Jorma Laaksonen
Abstract: Predicting a scene graph that captures visual entities and their interactions in an image has been considered a crucial step towards full scene comprehension. Recent scene graph generation (SGG) models have shown their capability of capturing the most frequent relations among visual entities. However, the state-of-the-art results are still far from satisfactory, e.g. models can obtain 31% in overall recall R@100, whereas the likewise important mean class-wise recall mR@100 is only around 8% on Visual Genome (VG). The discrepancy between R and mR results urges to shift the focus from pursuing a high R to a high mR with a still competitive R. We suspect that the observed discrepancy stems from both the annotation bias and sparse annotations in VG, in which many visual entity pairs are either not annotated at all or only with a single relation when multiple ones could be valid. To address this particular issue, we propose a novel SGG training scheme that capitalizes on self-learned knowledge. It involves two relation classifiers, one offering a less biased setting for the other to base on. The proposed scheme can be applied to most of the existing SGG models and is straightforward to implement. We observe significant relative improvements in mR (between +6.6% and +20.4%) and competitive or better R (between -2.4% and 0.3%) across all standard SGG tasks.
摘要：预测一个场景图，捕捉视觉实体和它们在图像中的互动一直被认为是对整个场景理解的关键一步。最近的场景图代（SGG）模型显示它们的视觉捕捉实体之间最常见的关系的能力。然而，国家的最先进的结果仍远远不能令人满意，例如模型可以在整体再现率R获得31％@ 100，而同样是重要的平均类明智召回MR @ 100只有大约8％的视觉基因组（VG）。 R和MR结果冲动之间的差异，以将焦点从追求高R到高MR与仍然具有竞争力R.移位我们怀疑所观察到的差异从VG注释偏压和稀疏两个注释茎，在其中有许多视觉实体对可能是没有或仅具有单一的关系，当多个节点可能是有效的注释。为了解决这个问题上，我们提出了一个新颖的SGG培训计划，关于自学习的知识资本化。它涉及到两个关系分类器，一个提供为对方上基部的较少偏置设置。该方案可以适用于大部分现有SGG模型，并直接实现。我们在观察先生相对显著改善，在所有标准SGG任务（-2.4之间％和0.3％），竞争性或更好的R（+ 6.6％和+ 20.4％之间）。

22. Mastering Large Scale Multi-label Image Recognition with high efficiency overCamera trap images [PDF] 返回目录
Miroslav Valan, Lukáš Picek
Abstract: Camera traps are crucial in biodiversity motivated studies, however dealing with large number of images while annotating these data sets is a tedious and time consuming task. To speed up this process, Machine Learning approaches are a reasonable asset. In this article we are proposing an easy, accessible, light-weight, fast and efficient approach based on our winning submission to the "Hakuna Ma-data - Serengeti Wildlife Identification challenge". Our system achieved an Accuracy of 97% and outperformed the human level performance. We show that, given relatively large data sets, it is effective to look at each image only once with little or no augmentation. By utilizing such a simple, yet effective baseline we were able to avoid over-fitting without extensive regularization techniques and to train a top scoring system on a very limited hardware featuring single GPU (1080Ti) despite the large training set (6.7M images and 6TB).
摘要：相机陷阱是生物多样性研究的动机是至关重要的，但处理大量的图像，同时标注这些数据集是一项繁琐而耗时的任务。为了加快这一进程，机器学习方法是一个合理的资产。在本文中，我们提出了一个简单，方便，重量轻，根据我们的获奖作品以快速和有效的方法“HAKUNA马数据 - 塞伦盖蒂野生动物识别的挑战”。我们的系统实现了97％的准确度，并跑赢了人类级性能。我们表明，由于比较大的数据集，可以有效地在每个图像很少或根本没有增强看起来只有一次。通过利用这样一个简单而有效的基线，我们能够避免过度拟合没有广泛的正规化技巧，并培养出得分最高的系统上一个非常有限的硬件，尽管大的训练集特色的单GPU（1080Ti）（670个图像和6TB ）。

23. ConvGRU in Fine-grained Pitching Action Recognition for Action Outcome Prediction [PDF] 返回目录
Tianqi Ma, Lin Zhang, Xiumin Diao, Ou Ma
Abstract: Prediction of the action outcome is a new challenge for a robot collaboratively working with humans. With the impressive progress in video action recognition in recent years, fine-grained action recognition from video data turns into a new concern. Fine-grained action recognition detects subtle differences of actions in more specific granularity and is significant in many fields such as human-robot interaction, intelligent traffic management, sports training, health caring. Considering that the different outcomes are closely connected to the subtle differences in actions, fine-grained action recognition is a practical method for action outcome prediction. In this paper, we explore the performance of convolutional gate recurrent unit (ConvGRU) method on a fine-grained action recognition tasks: predicting outcomes of ball-pitching. Based on sequences of RGB images of human actions, the proposed approach achieved the performance of 79.17% accuracy, which exceeds the current state-of-the-art result. We also compared different network implementations and showed the influence of different image sampling methods, different fusion methods and pre-training, etc. Finally, we discussed the advantages and limitations of ConvGRU in such action outcome prediction and fine-grained action recognition tasks.
摘要：行动结果的预测是一个机器人协作与人类工作的一个新的挑战。随着视频行为识别令人印象深刻的进步，近年来，细粒度的动作识别与视频数据变成一个新的关注点。细粒度动作识别检测的更具体的粒度行动细微的差别，在很多领域，如人机交互，智能交通管理，运动训练学，保健显著。考虑到不同的结果有紧密的联系，以在行动细微的差别，细粒度的动作识别是行动结果预测的实用方法。在本文中，我们探索卷积栅极重复单元（ConvGRU）方法对细粒动作识别任务的性能：预测球的投球的结果。基于人的行动的RGB图像的序列，所提出的方法实现的79.17％的准确度，这超过了当前的状态的最先进的结果的性能。我们还比较了不同的网络实现，并表现出不同的图像采集方法，不同的融合方法和岗前培训，等等。最后，我们在这样的行动结果的预测和细粒度动作识别任务讨论ConvGRU的优点和局限性的影响。

24. Retargetable AR: Context-aware Augmented Reality in Indoor Scenes based on 3D Scene Graph [PDF] 返回目录
Tomu Tahara, Takashi Seno, Gaku Narita, Tomoya Ishikawa
Abstract: In this paper, we present Retargetable AR, a novel AR framework that yields an AR experience that is aware of scene contexts set in various real environments, achieving natural interaction between the virtual and real worlds. To this end, we characterize scene contexts with relationships among objects in 3D space, not with coordinates transformations. A context assumed by an AR content and a context formed by a real environment where users experience AR are represented as abstract graph representations, i.e. scene graphs. From RGB-D streams, our framework generates a volumetric map in which geometric and semantic information of a scene are integrated. Moreover, using the semantic map, we abstract scene objects as oriented bounding boxes and estimate their orientations. With such a scene representation, our framework constructs, in an online fashion, a 3D scene graph characterizing the context of a real environment for AR. The correspondence between the constructed graph and an AR scene graph denoting the context of AR content provides a semantically registered content arrangement, which facilitates natural interaction between the virtual and real worlds. We performed extensive evaluations on our prototype system through quantitative evaluation of the performance of the oriented bounding box estimation, subjective evaluation of the AR content arrangement based on constructed 3D scene graphs, and an online AR demonstration. The results of these evaluations showed the effectiveness of our framework, demonstrating that it can provide a context-aware AR experience in a variety of real scenes.
摘要：在本文中，我们提出重定目标的AR，一个新的AR框架，产量的AR体验，了解各种真实环境中设置的场景环境，实现了虚拟与现实世界之间的自然交互的。为此，我们刻画的场景环境具有3D空间中对象之间的关系，而不是与坐标变换。上下文假定由AR内容，并通过其中用户遇到AR真实环境形成的上下文中表示为抽象图形表示，即场景图。从RGB-d流，我们的框架生成一个容积图，其中场景的几何和语义信息被一体化。此外，使用语义图，我们抽象场景中的对象为导向的边界框，并估计它们的方向。有了这样的场景表现，我们的框架结构，以在线的方式，在3D场景图表征为AR真实环境的背景。所构造的图表和AR场景图表示AR内容的上下文之间的对应关系提供了语义上登记的内容排列，这有利于虚拟和现实世界之间自然交互。我们通过定向包围盒估计的性能进行定量评价进行了广泛的评估我们的原型系统，基于构造的3D场景图的AR内容安排，以及在线AR示范的主观评价。这些评估的结果表明，我们的框架的有效性，证明它可以提供各种真实场景的情景感知AR体验。

25. Knowledge Transfer via Dense Cross-Layer Mutual-Distillation [PDF] 返回目录
Anbang Yao, Dawei Sun
Abstract: Knowledge Distillation (KD) based methods adopt the one-way Knowledge Transfer (KT) scheme in which training a lower-capacity student network is guided by a pre-trained high-capacity teacher network. Recently, Deep Mutual Learning (DML) presented a two-way KT strategy, showing that the student network can be also helpful to improve the teacher network. In this paper, we propose Dense Cross-layer Mutual-distillation (DCM), an improved two-way KT method in which the teacher and student networks are trained collaboratively from scratch. To augment knowledge representation learning, well-designed auxiliary classifiers are added to certain hidden layers of both teacher and student networks. To boost KT performance, we introduce dense bidirectional KD operations between the layers appended with classifiers. After training, all auxiliary classifiers are discarded, and thus there are no extra parameters introduced to final models. We test our method on a variety of KT tasks, showing its superiorities over related methods. Code is available at this https URL
摘要：知识蒸馏（KD）的方法采用在训练较低容量的学生网络被预先训练的高容量网络教师引导单向知识转移（KT）方案。近日，深互学（DML）提出了一种双向的KT战略，显示出学生网络可也有利于提高教师的网络。在本文中，我们提出了密集的跨层相互蒸馏（DCM），其中教师和学生网络从无到有协同训练的一种改进的双向KT方法。为了增加知识表示学习，精心设计的辅助分类被添加到教师和学生网络的某些隐藏层。为了提高性能KT，我们将介绍与分类附加层之间的密双向KD操作。培训结束后，所有的辅助分类丢弃，因而有介绍到最终的模型中没有多余的参数。我们测试的各种KT任务我们的方法，显示随相关方法的优势。代码可在此HTTPS URL

26. Mesh Guided One-shot Face Reenactment using Graph Convolutional Networks [PDF] 返回目录
Guangming Yao, Yi Yuan, Tianjia Shao, Kun Zhou
Abstract: Face reenactment aims to animate a source face image to a different pose and expression provided by a driving image. Existing approaches are either designed for a specific identity, or suffer from the identity preservation problem in the one-shot or few-shot scenarios. In this paper, we introduce a method for one-shot face reenactment, which uses the reconstructed 3D meshes (i.e., the source mesh and driving mesh) as guidance to learn the optical flow needed for the reenacted face synthesis. Technically, we explicitly exclude the driving face's identity information in the reconstructed driving mesh. In this way, our network can focus on the motion estimation for the source face without the interference of driving face shape. We propose a motion net to learn the face motion, which is an asymmetric autoencoder. The encoder is a graph convolutional network (GCN) that learns a latent motion vector from the meshes, and the decoder serves to produce an optical flow image from the latent vector with CNNs. Compared to previous methods using sparse keypoints to guide the optical flow learning, our motion net learns the optical flow directly from 3D dense meshes, which provide the detailed shape and pose information for the optical flow, so it can achieve more accurate expression and pose on the reenacted face. Extensive experiments show that our method can generate high-quality results and outperforms state-of-the-art methods in both qualitative and quantitative comparisons.
摘要：面部重演目标到源的面部图像动画的不同姿态和表达通过驱动图像提供。现有的方法要么设计用于特定身份，或者从一次性或几拍场景的身份保护问题的困扰。在本文中，我们介绍单触发面重演，它使用重建的3D网格的方法（即，在源目和驱动网格）作为指导学习所需重演面合成光流。从技术上讲，我们明确排除在重建的驾驶掘进工作面的身份信息网。这样一来，我们的网络能够专注于为源脸部运动估计没有驾驶脸形的干扰。我们提出了一个运动net学习脸部运动，这是一个不对称的自动编码器。编码器是一个曲线图卷积网络（GCN），该学习从网眼的潜运动矢量，和解码器用于从与细胞神经网络潜矢量产生的光流图像。相比于使用稀疏关键点引导光流学习以前的方法，我们的运动净三维密集网格，其提供的详细形状和姿势的光流信息直接获知光流，所以它可以实现更精确的表达和形成对在重演脸。大量的实验表明，该方法能产生定性和定量比较高品质的效果，优于国家的最先进的方法。

27. Pix2Surf: Learning Parametric 3D Surface Models of Objects from Images [PDF] 返回目录
Jiahui Lei, Srinath Sridhar, Paul Guerrero, Minhyuk Sung, Niloy Mitra, Leonidas J. Guibas
Abstract: We investigate the problem of learning to generate 3D parametric surface representations for novel object instances, as seen from one or more views. Previous work on learning shape reconstruction from multiple views uses discrete representations such as point clouds or voxels, while continuous surface generation approaches lack multi-view consistency. We address these issues by designing neural networks capable of generating high-quality parametric 3D surfaces which are also consistent between views. Furthermore, the generated 3D surfaces preserve accurate image pixel to 3D surface point correspondences, allowing us to lift texture information to reconstruct shapes with rich geometry and appearance. Our method is supervised and trained on a public dataset of shapes from common object categories. Quantitative results indicate that our method significantly outperforms previous work, while qualitative results demonstrate the high quality of our reconstructions.
摘要：我们研究学习产生新的对象实例三维参数曲面表示，从一个或多个视图看到的问题。从多个视图学习形状重建先前的工作使用离散表示，例如点云或体素，而连续的表面生成方法缺乏多视图一致性。我们通过设计能够产生高品质的神经网络解决这些问题的参数化三维曲面这也是观点保持一致。此外，所产生的3D表面保持准确的图像像素到3D表面点对应，使我们能够升降的纹理信息来重构与丰富的几何形状和外观形状。我们的方法是监督和训练有素从公共对象类别形状的公共数据集。定量结果表明，我们的方法显著优于以前的工作，而定性结果证明了我们重建的高品质。

28. Equivalent Classification Mapping for Weakly Supervised Temporal Action Localization [PDF] 返回目录
Le Yang, Dingwen Zhang, Tao Zhao, Junwei Han
Abstract: Weakly supervised temporal action localization is a newly emerging yet widely studied topic in recent years. The existing methods can be categorized into two localization-by-classification pipelines, i.e., the pre-classification pipeline and the post-classification pipeline. The pre-classification pipeline first performs classification on each video snippet and then aggregate the snippet-level classification scores to obtain the video-level classification score, while the post-classification pipeline aggregates the snippet-level features first and then predicts the video-level classification score based on the aggregated feature. Although the classifiers in these two pipelines are used in different ways, the role they play is exactly the same---to classify the given features to identify the corresponding action categories. To this end, an ideal classifier can make both pipelines work. This inspires us to simultaneously learn these two pipelines in a unified framework to obtain an effective classifier. Specifically, in the proposed learning framework, we implement two parallel network streams to model the two localization-by-classification pipelines simultaneously and make the two network streams share the same classifier, thus achieving the novel Equivalent Classification Mapping (ECM) mechanism. Considering that an ideal classifier would make the classification results of the two network streams be identical and make the frame-level classification scores obtained from the pre-classification pipeline and the feature aggregation weights in the post-classification pipeline be consistent, we further introduce an equivalent classification loss and an equivalent weight transition module to endow the proposed learning framework with such properties. Comprehensive experiments are carried on three benchmarks and the proposed ECM achieves superior performance over other state-of-the-art methods.
摘要：弱监督时间行动本地化是近年来一个新兴尚未被广泛研究的课题。现有的方法可以被分类成两个定位逐分类管道，即，预先分类管道和后分类管道。在每个视频片断预分类流水线第一执行分类，然后聚集片段级分类评分，以获得视频级分类评分，而在后分类管道聚集第一代码段级特征，然后预测视频电平基于聚集的特征分类得分。虽然在这两个管道的分类以不同的方式使用，它们发挥的作用是完全一样的---给定的功能进行分类，以确定相应的活动分类。为此，理想的分类可以使两条流水线的工作。这激励我们在一个统一的框架，同时学习这两个管道，以获得有效的分类。具体而言，在所提出的学习框架，我们实现两个平行的网络流的两个定位逐分类管道同时进行建模和使两个网络流共享相同的分类器，从而实现了新颖的等效分类映射（ECM）的机制。考虑到理想的分类将使两个网络数据流的分类结果是相同的，并从预分类的管道，并在分类后管道的功能聚集权获得的帧级分类的分数是一致的，我们进一步引入等效分类损失和当量重量转变模块赋予与这样的性质所提出的学习框架。综合实验进行了三个基准和建议的ECM实现了超过其他国家的最先进的方法，性能优越。

29. SoDA: Multi-Object Tracking with Soft Data Association [PDF] 返回目录
Wei-Chih Hung, Henrik Kretzschmar, Tsung-Yi Lin, Yuning Chai, Ruichi Yu, Ming-Hsuan Yang, Drago Anguelov
Abstract: Robust multi-object tracking (MOT) is a prerequisite fora safe deployment of self-driving cars. Tracking objects, however, remains a highly challenging problem, especially in cluttered autonomous driving scenes in which objects tend to interact with each other in complex ways and frequently get occluded. We propose a novel approach to MOT that uses attention to compute track embeddings that encode the spatiotemporal dependencies between observed objects. This attention measurement encoding allows our model to relax hard data associations, which may lead to unrecoverable errors. Instead, our model aggregates information from all object detections via soft data associations. The resulting latent space representation allows our model to learn to reason about occlusions in a holistic data-driven way and maintain track estimates for objects even when they are occluded. Our experimental results on the Waymo OpenDataset suggest that our approach leverages modern large-scale datasets and performs favorably compared to the state of the art in visual multi-object tracking.
摘要：强大的多目标跟踪（MOT）是自动驾驶汽车的先决条件论坛安全部署。跟踪的对象，但是，仍然是一个极具挑战性的问题，尤其是在混乱自主驾驶场景中的对象往往以复杂的方式与对方进行互动，并经常得到闭塞。我们提出了一个新的方法来MOT使用注意编码观察对象之间的时空依赖计算轨道的嵌入。该注意测量编码允许我们的模型放宽硬数据的关联，这可能会导致不可恢复的错误。取而代之的是，我们的模型从通过软数据协会所有对象检测汇总信息。产生的潜在空间表现让我们的模型学会推理以整体数据驱动的方式闭塞和维护，即使他们被遮挡对象的轨迹估计。我们对Waymo OpenDataset实验结果表明，相比毫不逊色的艺术视觉的多目标跟踪的状态，我们的方法充分利用了现代化的大型数据集和执行。

30. Domain Generalizer: A Few-shot Meta Learning Framework for Domain Generalization in Medical Imaging [PDF] 返回目录
Pulkit Khandelwal, Paul Yushkevich
Abstract: Deep learning models perform best when tested on target (test) data domains whose distribution is similar to the set of source (train) domains. However, model generalization can be hindered when there is significant difference in the underlying statistics between the target and source domains. In this work, we adapt a domain generalization method based on a model-agnostic meta-learning framework to biomedical imaging. The method learns a domain-agnostic feature representation to improve generalization of models to the unseen test distribution. The method can be used for any imaging task, as it does not depend on the underlying model architecture. We validate the approach through a computed tomography (CT) vertebrae segmentation task across healthy and pathological cases on three datasets. Next, we employ few-shot learning, i.e. training the generalized model using very few examples from the unseen domain, to quickly adapt the model to new unseen data distribution. Our results suggest that the method could help generalize models across different medical centers, image acquisition protocols, anatomies, different regions in a given scan, healthy and diseased populations across varied imaging modalities.
摘要：当在目标测试深学习模型执行最好的（测试）的数据域，其分布是类似的组源（火车）域。然而，当在目标和源结构域之间的下层的统计差异显著模型综合可阻碍。在这项工作中，我们采用基于模型无关元学习框架，生物医学成像域推广方法。所述方法学的域无关的特征表示，以便提高的模型推广到看不见的检查分布。该方法可用于任何成像任务，因为它不依赖于底层的模型结构。我们通过计算机断层扫描（CT）椎骨分割任务横跨健康和病理情况下对三个数据集验证方法。接下来，我们采用一些拍的学习，即用从看不见域的例子非常少，能够快速适应的模式，新的看不见的数据分发训练广义模型。我们的研究结果表明，该方法可以帮助在不同的医疗中心，图像采集协议，解剖，不同地区广义含车型在跨越不同的成像方式给定的扫描，健康和患病人群。

31. Multiple View Generation and Classification of Mid-wave Infrared Images using Deep Learning [PDF] 返回目录
Maliha Arif, Abhijit Mahalanobis
Abstract: We propose a novel study of generating unseen arbitrary viewpoints for infrared imagery in the non-linear feature subspace . Current methods use synthetic images and often result in blurry and distorted outputs. Our approach on the contrary understands the semantic information in natural images and encapsulates it such that our predicted unseen views possess good 3D representations. We further explore the non-linear feature subspace and conclude that our network does not operate in the Euclidean subspace but rather in the Riemannian subspace. It does not learn the geometric transformation for predicting the position of the pixel in the new image but rather learns the manifold. To this end, we use t-SNE visualisations to conduct a detailed analysis of our network and perform classification of generated images as a low-shot learning task.
摘要：本文提出产生看不见的任意观点在非线性特征空间红外成像的新型研究。目前的方法使用合成的图像和常常导致模糊和扭曲的输出。我们相反的方式了解自然图像语义信息，并对其进行封装，这样，我们的预测看不见的观点具有良好的3D图形。我们进一步探讨非线性特征空间，并得出结论，我们的网络没有在欧氏子空间，而是在黎曼子空间进行操作。它不学几何变换预测在新的图像的像素的位置，而是得知歧管。为此，我们使用T-SNE可视化来进行我们的网络进行了详细分析，并执行生成的图像分类为低次学习任务。

32. Contact Area Detector using Cross View Projection Consistency for COVID-19 Projects [PDF] 返回目录
Pan Zhang, Wilfredo Torres Calderon, Bokyung Lee, Alex Tessier, Jacky Bibliowicz, Liviu Calin, Michael Lee
Abstract: The ability to determine what parts of objects and surfaces people touch as they go about their daily lives would be useful in understanding how the COVID-19 virus spreads. To determine whether a person has touched an object or surface using visual data, images, or videos, is a hard problem. Computer vision 3D reconstruction approaches project objects and the human body from the 2D image domain to 3D and perform 3D space intersection directly. However, this solution would not meet the accuracy requirement in applications due to projection error. Another standard approach is to train a neural network to infer touch actions from the collected visual data. This strategy would require significant amounts of training data to generalize over scale and viewpoint variations. A different approach to this problem is to identify whether a person has touched a defined object. In this work, we show that the solution to this problem can be straightforward. Specifically, we show that the contact between an object and a static surface can be identified by projecting the object onto the static surface through two different viewpoints and analyzing their 2D intersection. The object contacts the surface when the projected points are close to each other; we call this cross view projection consistency. Instead of doing 3D scene reconstruction or transfer learning from deep networks, a mapping from the surface in the two camera views to the surface space is the only requirement. For planar space, this mapping is the Homography transformation. This simple method can be easily adapted to real-life applications. In this paper, we apply our method to do office occupancy detection for studying the COVID-19 transmission pattern from an office desk in a meeting room using the contact information.
摘要：以确定哪些部分物体表面和人们接触，因为他们去他们的日常生活将了解如何COVID-19病毒传播是有用的能力。为了确定使用可视数据，图像或视频的人是否已经触摸的物体或表面，是一个困难的问题。计算机视觉三维重建从2D图像域到3D接近项目对象和人体和直接执行3D空间相交。然而，这种解决方案将无法满足因投影误差应用的精度要求。另一种标准方法是一个神经网络从所收集的可视数据训练来推断触摸动作。这一战略将需要显著的训练数据来概括在规模和观点的变化。一种不同的方法解决这个问题是要确定一个人是否已经触及了定义的对象。在这项工作中，我们表明，解决这个问题可以简单。具体而言，我们表明，一个对象和一个静态表面之间的接触可以通过两个不同视点投影对象到静态表面和分析它们的2D相交来识别。所述物体接触时所投影的点是彼此接近的表面上;我们称这种横视图投影的一致性。代替从深网络做三维场景重建或转移的学习，从在这两个摄像机视图的表面空间中的表面的映射是唯一的要求。对于平面空间，这种映射是该投射变换。这个简单的方法可以很容易地适应现实生活中的应用。在本文中，我们应用我们的方法做办公室占用检测使用的联系人信息研究在一间会议室，从办公桌的COVID-19传输模式。

33. One-pixel Signature: Characterizing CNN Models for Backdoor Detection [PDF] 返回目录
Shanjiaoyang Huang, Weiqi Peng, Zhiwei Jia, Zhuowen Tu
Abstract: We tackle the convolution neural networks (CNNs) backdoor detection problem by proposing a new representation called one-pixel signature. Our task is to detect/classify if a CNN model has been maliciously inserted with an unknown Trojan trigger or not. Here, each CNN model is associated with a signature that is created by generating, pixel-by-pixel, an adversarial value that is the result of the largest change to the class prediction. The one-pixel signature is agnostic to the design choice of CNN architectures, and how they were trained. It can be computed efficiently for a black-box CNN model without accessing the network parameters. Our proposed one-pixel signature demonstrates a substantial improvement (by around 30% in the absolute detection accuracy) over the existing competing methods for backdoored CNN detection/classification. One-pixel signature is a general representation that can be used to characterize CNN models beyond backdoor detection.
摘要：我们通过提出新的解决表示卷积神经网络（细胞神经网络）后门检测问题叫一个像素的签名。我们的任务是检测/分类，如果CNN模型与未知木马的触发或不被恶意插入。这里，每个CNN模型与由生成创建的签名，逐像素，敌对值，是对类预测最大变化的结果相关联。一像素的签名是不可知的设计选择CNN架构，以及他们是如何训练的。它可以有效地被计算为一个黑箱模型CNN无需访问网络参数。我们提出的一个像素的签字证明（在绝对检测精度约30％）的后门CNN检测/分类现有的竞争方式在相当大的改进。一个像素的签名是可用于表征CNN模型超出后门检测的一般表示。

34. A Deep Dive into Adversarial Robustness in Zero-Shot Learning [PDF] 返回目录
Mehmet Kerim Yucel, Ramazan Gokberk Cinbis, Pinar Duygulu
Abstract: Machine learning (ML) systems have introduced significant advances in various fields, due to the introduction of highly complex models. Despite their success, it has been shown multiple times that machine learning models are prone to imperceptible perturbations that can severely degrade their accuracy. So far, existing studies have primarily focused on models where supervision across all classes were available. In constrast, Zero-shot Learning (ZSL) and Generalized Zero-shot Learning (GZSL) tasks inherently lack supervision across all classes. In this paper, we present a study aimed on evaluating the adversarial robustness of ZSL and GZSL models. We leverage the well-established label embedding model and subject it to a set of established adversarial attacks and defenses across multiple datasets. In addition to creating possibly the first benchmark on adversarial robustness of ZSL models, we also present analyses on important points that require attention for better interpretation of ZSL robustness results. We hope these points, along with the benchmark, will help researchers establish a better understanding what challenges lie ahead and help guide their work.
摘要：机器学习（ML）系统引入了在各个领域显著的进步，由于引入的高度复杂的模型。尽管他们的成功，它已被证明多次的是机器学习模型是容易察觉不到的扰动，可以严重降低其准确性。到目前为止，现有的研究主要集中于模式，使在所有课程中的监督可用。在constrast，零射门学习（ZSL）和广义零次学习（GZSL）任务本身缺乏监管在所有课程中。在本文中，我们提出了旨在对评估ZSL和GZSL模型的鲁棒性对抗性研究。我们行之有效的标签嵌入模型并受它利用一组跨多个数据集建立敌对攻击和防御。除了创建可能对ZSL模型的鲁棒性对抗的第一标杆，我们对需要的稳健性ZSL结果更好的诠释注意要点也存在分析。我们希望这些点，与基准一起，将有助于研究人员建立一个更好地了解什么样的挑战面前，并帮助指导他们的工作。

35. Lazy caterer jigsaw puzzles: Models, properties, and a mechanical system-based solver [PDF] 返回目录
Peleg Harel, Ohad Ben-Shahar
Abstract: Jigsaw puzzle solving, the problem of constructing a coherent whole from a set of non-overlapping unordered fragments, is fundamental to numerous applications, and yet most of the literature has focused thus far on less realistic puzzles whose pieces are identical squares. Here we formalize a new type of jigsaw puzzle where the pieces are general convex polygons generated by cutting through a global polygonal shape with an arbitrary number of straight cuts, a generation model inspired by the celebrated Lazy caterer's sequence. We analyze the theoretical properties of such puzzles, including the inherent challenges in solving them once pieces are contaminated with geometrical noise. To cope with such difficulties and obtain tractable solutions, we abstract the problem as a multi-body spring-mass dynamical system endowed with hierarchical loop constraints and a layered reconstruction process. We define evaluation metrics and present experimental results to indicate that such puzzles are solvable completely automatically.
摘要：拼图解谜，从一组非重叠的无序片段构建一个连贯的整体的问题，最根本的是大量的应用，然而文学的最迄今集中在不太现实的难题，其作品是相同的正方形。在这里，我们正式确定了新型拼图的其中作品是通过一个全球性的多边形切割与直切口的任意号码，生成模型由灵感产生的庆祝懒餐饮服务的顺序一般凸多边形。我们分析这样的困惑，包括在解决这些问题一旦件污染几何噪音的内在挑战的理论性能。为了应对这样的困难，并获得易处理的解决方案，我们抽象的问题，因为赋予了层级循环约束和层状重建过程的多体弹簧 - 质量动力系统。我们定义的评价指标和目前的实验结果表明，这种难题是可以解决的完全自动。

36. Learning Graph Edit Distance by Graph Neural Networks [PDF] 返回目录
Pau Riba, Andreas Fischer, Josep Lladós, Alicia Fornés
Abstract: The emergence of geometric deep learning as a novel framework to deal with graph-based representations has faded away traditional approaches in favor of completely new methodologies. In this paper, we propose a new framework able to combine the advances on deep metric learning with traditional approximations of the graph edit distance. Hence, we propose an efficient graph distance based on the novel field of geometric deep learning. Our method employs a message passing neural network to capture the graph structure, and thus, leveraging this information for its use on a distance computation. The performance of the proposed graph distance is validated on two different scenarios. On the one hand, in a graph retrieval of handwritten words~\ie~keyword spotting, showing its superior performance when compared with (approximate) graph edit distance benchmarks. On the other hand, demonstrating competitive results for graph similarity learning when compared with the current state-of-the-art on a recent benchmark dataset.
摘要：几何深度学习作为一种新型的框架来处理与基于图的表示的出现已经消失赞成完全新的方法的传统方法。在本文中，我们提出能进步深度量学习与图形编辑距离的传统近似结合的新框架。因此，我们提出了一种基于几何深度学习的新颖字段的有效图表距离。我们的方法使用通过神经网络来捕获图结构，并且因此，利用这一信息用于其上的距离计算中使用的消息。所提出的图形距离的性能验证在两个不同的方案。在一方面，在手写字的图表检索〜\即〜关键词定位，显示出其优越的性能与时（大约）图形编辑距离的基准进行比较。在另一方面，当与当前在最近的一个基准数据集的国家的最先进的展示相比，对于图形的相似性学习的竞争结果。

37. A Smartphone-based System for Real-time Early Childhood Caries Diagnosis [PDF] 返回目录
Yipeng Zhang, Haofu Liao, Jin Xiao, Nisreen Al Jallad, Oriana Ly-Mapes, Jiebo Luo
Abstract: Early childhood caries (ECC) is the most common, yet preventable chronic disease in children under the age of 6. Treatments on severe ECC are extremely expensive and unaffordable for socioeconomically disadvantaged families. The identification of ECC in an early stage usually requires expertise in the field, and hence is often ignored by parents. Therefore, early prevention strategies and easy-to-adopt diagnosis techniques are desired. In this study, we propose a multistage deep learning-based system for cavity detection. We create a dataset containing RGB oral images labeled manually by dental practitioners. We then investigate the effectiveness of different deep learning models on the dataset. Furthermore, we integrate the deep learning system into an easy-to-use mobile application that can diagnose ECC from an early stage and provide real-time results to untrained users.
摘要：婴幼儿龋（ECC）是最常见的，但可以预防慢性疾病的6重度ECC治疗岁以下的儿童是社会经济地位弱势群体家庭非常昂贵，负担不起。 ECC的早期识别，通常需要在领域的专业知识，因此常常被家长忽略。因此，早期预防战略和易于采用诊断技术所需。在这项研究中，我们提出了腔体检测多级深基础的学习系统。我们创建包含由牙科医生手工标注RGB图像口服的数据集。然后，我们研究了不同深度学习模型的数据集上的有效性。此外，我们的深度学习系统整合到一个易于使用的移动应用程序，可以从早期诊断ECC和未经培训的用户提供实时结果。

38. Polyth-Net: Classification of Polythene Bags for Garbage Segregation Using Deep Learning [PDF] 返回目录
Divyansh Singh
Abstract: Polythene has always been a threat to the environment since its invention. It is non-biodegradable and very difficult to recycle. Even after many awareness campaigns and practices, Separation of polythene bags from waste has been a challenge for human civilization. The primary method of segregation deployed is manual handpicking, which causes a dangerous health hazards to the workers and is also highly inefficient due to human errors. In this paper I have designed and researched on image-based classification of polythene bags using a deep-learning model and its efficiency. This paper focuses on the architecture and statistical analysis of its performance on the data set as well as problems experienced in the classification. It also suggests a modified loss function to specifically detect polythene irrespective of its individual features. It aims to help the current environment protection endeavours and save countless lives lost to the hazards caused by current methods.
摘要：聚乙烯一直以来的发明对环境的威胁。它是不可生物降解和回收非常困难。即使经过多次宣传活动和实践，从垃圾的塑料袋分离一直是人类文明的一大挑战。偏析部署的主要方法是手动的手工采摘，这会导致一种危险的健康危害工人和也是非常低效由于人为错误。在本文中，我已经设计和研发上的使用深学习模式，其效率塑料袋基于图像的分类。本文重点介绍的架构和其上的数据集性能，以及在分类遇到的问题进行统计分析。这也表明修饰的损失函数特异性检测聚乙烯而不管其单个特征。它的目的是帮助目前的环保努力和保存失去了因电流方法危害无数人的生命。

39. TactileSGNet: A Spiking Graph Neural Network for Event-based Tactile Object Recognition [PDF] 返回目录
Fuqiang Gu, Weicong Sng, Tasbolat Taunyazov, Harold Soh
Abstract: Tactile perception is crucial for a variety of robot tasks including grasping and in-hand manipulation. New advances in flexible, event-driven, electronic skins may soon endow robots with touch perception capabilities similar to humans. These electronic skins respond asynchronously to changes (e.g., in pressure, temperature), and can be laid out irregularly on the robot's body or end-effector. However, these unique features may render current deep learning approaches such as convolutional feature extractors unsuitable for tactile learning. In this paper, we propose a novel spiking graph neural network for event-based tactile object recognition. To make use of local connectivity of taxels, we present several methods for organizing the tactile data in a graph structure. Based on the constructed graphs, we develop a spiking graph convolutional network. The event-driven nature of spiking neural network makes it arguably more suitable for processing the event-based data. Experimental results on two tactile datasets show that the proposed method outperforms other state-of-the-art spiking methods, achieving high accuracies of approximately 90\% when classifying a variety of different household objects.
摘要：触觉感受是各种各样的机器人任务，包括掌握和在手操作的关键。在灵活的，事件驱动，电子皮肤的新进展可能很快赋予与人类相似的触摸感知能力的机器人。这些电子皮肤异步响应的变化（例如，压力，温度），并且可以在机器人的主体或端部执行不规则布置。然而，这些独特的功能可以使当前的深度学习卷积特征提取不适合触觉学习方法等。在本文中，我们提出了基于事件的触觉对象识别一种新颖的扣球图的神经网络。要使用taxels的本地连接的，我们提出了在图形结构组织的触觉数据的方法。基于构建的图，我们开发了一个扣球图卷积网络。尖峰神经网络的事件驱动的性质使其可以说更适合于处理所述基于事件的数据。在两个数据集的触觉实验结果表明，所提出的方法优于状态的最先进的其他尖峰方法中，当进行分类的各种不同的家用物品的实现大约90 \％的高准确度。

40. Self-supervised Denoising via Diffeomorphic Template Estimation: Application to Optical Coherence Tomography [PDF] 返回目录
Guillaume Gisbert, Neel Dey, Hiroshi Ishikawa, Joel Schuman, James Fishbaugh, Guido Gerig
Abstract: Optical Coherence Tomography (OCT) is pervasive in both the research and clinical practice of Ophthalmology. However, OCT images are strongly corrupted by noise, limiting their interpretation. Current OCT denoisers leverage assumptions on noise distributions or generate targets for training deep supervised denoisers via averaging of repeat acquisitions. However, recent self-supervised advances allow the training of deep denoising networks using only repeat acquisitions without clean targets as ground truth, reducing the burden of supervised learning. Despite the clear advantages of self-supervised methods, their use is precluded as OCT shows strong structural deformations even between sequential scans of the same subject due to involuntary eye motion. Further, direct nonlinear alignment of repeats induces correlation of the noise between images. In this paper, we propose a joint diffeomorphic template estimation and denoising framework which enables the use of self-supervised denoising for motion deformed repeat acquisitions, without empirically registering their noise realizations. Strong qualitative and quantitative improvements are achieved in denoising OCT images, with generic utility in any imaging modality amenable to multiple exposures.
摘要：光学相干断层扫描（OCT）正处于研究和眼科临床实践都普遍存在。然而，OCT图像强烈地受到噪声污染，限制了他们的解释。目前华侨城denoisers噪声分布杠杆假设或产生通过重复收购的平均训练深监督denoisers目标。然而，最近的自我监督的进步允许使用深降噪网络的训练只重复收购不干净的目标，因为地面实况，降低监督的学习负担。尽管自监督方法明显的优点，它们的使用被排除作为由于无意识眼运动甚至同一对象的顺序扫描之间OCT显示强的结构变形。此外，直接重复诱导图像之间的噪声的相关性的非线性对准。在本文中，我们提出了一个联合微分同胚模板估计和降噪框架，允许使用的运动变形重复收购的自我监督去噪，没有经验登记其噪音的实现。强的定性和定量的改进在任何成像模态经得起多次曝光去噪的OCT图像，与通用的效用来实现。

41. Offloading Optimization in Edge Computing for Deep Learning Enabled Target Tracking by Internet-of-UAVs [PDF] 返回目录
Bo Yang, Xuelin Cao, Chau Yuen, Lijun Qian
Abstract: The empowering unmanned aerial vehicles (UAVs) have been extensively used in providing intelligence such as target tracking. In our field experiments, a pre-trained convolutional neural network (CNN) is deployed at the UAV to identify a target (a vehicle) from the captured video frames and enable the UAV to keep tracking. However, this kind of visual target tracking demands a lot of computational resources due to the desired high inference accuracy and stringent delay requirement. This motivates us to consider offloading this type of deep learning (DL) tasks to a mobile edge computing (MEC) server due to limited computational resource and energy budget of the UAV, and further improve the inference accuracy. Specifically, we propose a novel hierarchical DL tasks distribution framework, where the UAV is embedded with lower layers of the pre-trained CNN model, while the MEC server with rich computing resources will handle the higher layers of the CNN model. An optimization problem is formulated to minimize the weighted-sum cost including the tracking delay and energy consumption introduced by communication and computing of the UAVs, while taking into account the quality of data (e.g., video frames) input to the DL model and the inference errors. Analytical results are obtained and insights are provided to understand the tradeoff between the weighted-sum cost and inference error rate in the proposed framework. Numerical results demonstrate the effectiveness of the proposed offloading framework.
摘要：授权无人驾驶飞行器（UAV）在提供情报，如目标跟踪被广泛使用。在我们的现场实验，预训练的卷积神经网络（CNN）部署在无人机从捕获的视频帧识别目标（车辆），并启用无人机保持跟踪。然而，这种视觉跟踪目标需要大量的计算资源，由于所需的高精确度的推理和严格的延迟要求。这促使我们考虑卸载这种类型的深度学习（DL）任务移动边缘计算（MEC）服务器由于有限的计算资源和无人机的能源预算，进一步提高推理精度。具体来说，我们提出了一个新的层次DL任务分配框架，其中UAV被嵌入预先训练的CNN模型的较低层，同时具有丰富的计算资源MEC服务器将处理CNN模型的高层。优化问题被公式化以最小化包括通过通信引入和无人机计算跟踪延迟和能耗的加权和的成本，同时，考虑到数据的质量（例如，视频帧）输入到DL模型和推理错误。分析结果得到和提供的见解，了解所提出的框架的权重和成本，推理误差率之间的权衡。仿真结果表明，所提出的卸载框架的有效性。

42. Comparison of Convolutional neural network training parameters for detecting Alzheimers disease and effect on visualization [PDF] 返回目录
Arjun Haridas Pallath, Martin Dyrba
Abstract: Convolutional neural networks (CNN) have become a powerful tool for detecting patterns in image data. Recent papers report promising results in the domain of disease detection using brain MRI data. Despite the high accuracy obtained from CNN models for MRI data so far, almost no papers provided information on the features or image regions driving this accuracy as adequate methods were missing or challenging to apply. Recently, the toolbox iNNvestigate has become available, implementing various state of the art methods for deep learning visualizations. Currently, there is a great demand for a comparison of visualization algorithms to provide an overview of the practical usefulness and capability of these algorithms. Therefore, this thesis has two goals: 1. To systematically evaluate the influence of CNN hyper-parameters on model accuracy. 2. To compare various visualization methods with respect to the quality (i.e. randomness/focus, soundness).
摘要：卷积神经网络（CNN）已经成为在图像数据中检测模式的强大工具。最近的论文报告用脑MRI数据在疾病检测领域可喜的成果。尽管CNN模型MRI数据获得的高精度，到目前为止，几乎没有报纸提供了推动这一精度足够的方法失踪或具有挑战性的应用功能或图像区域的信息。近日，工具箱iNNvestigate已经变得可用，实施深学习可视化的技术方法的各种状态。目前，对于可视化算法进行比较，以提供这些算法的实用性和功能的概述的需求量很大。因此，本文有两个目标：1.系统评价的CNN超参数型号精度的影响。 2.为了比较相对于所述质量各种可视化方法（即随机性/焦点，稳健性）。

43. Grading Loss: A Fracture Grade-based Metric Loss for Vertebral Fracture Detection [PDF] 返回目录
Malek Husseini, Anjany Sekuboyina, Maximilian Loeffler, Fernando Navarro, Bjoern H. Menze, Jan S. Kirschke
Abstract: Osteoporotic vertebral fractures have a severe impact on patients' overall well-being but are severely under-diagnosed. These fractures present themselves at various levels of severity measured using the Genant's grading scale. Insufficient annotated datasets, severe data-imbalance, and minor difference in appearances between fractured and healthy vertebrae make naive classification approaches result in poor discriminatory performance. Addressing this, we propose a representation learning-inspired approach for automated vertebral fracture detection, aimed at learning latent representations efficient for fracture detection. Building on state-of-art metric losses, we present a novel Grading Loss for learning representations that respect Genant's fracture grading scheme. On a publicly available spine dataset, the proposed loss function achieves a fracture detection F1 score of 81.5%, a 10% increase over a naive classification baseline.
摘要：骨质疏松性椎体骨折对患者的总体福祉产生了严重影响，但诊断严重不足。这些裂缝在使用Genant的分级量表测量严重性的各级展示自己。注释不足的数据集，严重的数据不平衡，并在破碎和健康的椎骨之间出现微小的差别让天真的分类方法导致歧视性表现不佳。针对这个，我们提出了自动椎体骨折检测，旨在学习潜伏表示高效的裂缝检测的表示学习风格的做法。在国家的最先进的指标亏损的基础上，我们提出了一个新的分级损耗学习尊重Genant骨折分级方案表示。在可公开获得的数据集脊柱，所提出的损失函数实现了裂缝检测F1得分的81.5％，增加了一个幼稚分类基线10％。

44. ReLMoGen: Leveraging Motion Generation in Reinforcement Learning for Mobile Manipulation [PDF] 返回目录
Fei Xia, Chengshu Li, Roberto Martín-Martín, Or Litany, Alexander Toshev, Silvio Savarese
Abstract: Many Reinforcement Learning (RL) approaches use joint control signals (positions, velocities, torques) as action space for continuous control tasks. We propose to lift the action space to a higher level in the form of subgoals for a motion generator (a combination of motion planner and trajectory executor). We argue that, by lifting the action space and by leveraging sampling-based motion planners, we can efficiently use RL to solve complex, long-horizon tasks that could not be solved with existing RL methods in the original action space. We propose ReLMoGen -- a framework that combines a learned policy to predict subgoals and a motion generator to plan and execute the motion needed to reach these subgoals. To validate our method, we apply ReLMoGen to two types of tasks: 1) Interactive Navigation tasks, navigation problems where interactions with the environment are required to reach the destination, and 2) Mobile Manipulation tasks, manipulation tasks that require moving the robot base. These problems are challenging because they are usually long-horizon, hard to explore during training, and comprise alternating phases of navigation and interaction. Our method is benchmarked on a diverse set of seven robotics tasks in photo-realistic simulation environments. In all settings, ReLMoGen outperforms state-of-the-art Reinforcement Learning and Hierarchical Reinforcement Learning baselines. ReLMoGen also shows outstanding transferability between different motion generators at test time, indicating a great potential to transfer to real robots.
摘要：许多强化学习（RL）的方法是使用共同控制信号（位置，速度，转矩）作为连续控制任务的行动空间。我们建议的操作空间提升到子目标的运动发生器（运动规划和轨迹执行的组合）的形式更高的水平。我们认为，通过提升动作空间，以及利用基于采样的运动规划者，我们可以有效地使用RL来解决无法在原来的动作空间的现有RL方法来解决复杂的，长期的地平线的任务。我们建议ReLMoGen - 一个框架，它结合了教训政策预测子目标和运动发生器来规划和执行，以达到这些子目标所需要的运动。为了验证我们的方法，我们应用ReLMoGen两种类型的任务：1）互动导航任务，导航问题在需要与环境的相互作用到达目的地，和2）移动操作任务，操作任务是需要移动机器人基地。这些问题都具有挑战性，因为他们通常是长期的视野，艰苦的训练中去探索，并包括导航和互动的交替阶段。我们的方法是基准在照片般逼真的模拟环境一组不同的七个机器人任务。在所有的设置，ReLMoGen优于国家的最先进的强化学习和分层强化学习基线。 ReLMoGen还显示，在测试时间不同运动发生器之间优秀转让，表明很大的潜力，转移到真正的机器人。

45. Fully automated deep learning based segmentation of normal, infarcted and edema regions from multiple cardiac MRI sequences [PDF] 返回目录
Xiaoran Zhang, Michelle Noga, Kumaradevan Punithakumar
Abstract: Myocardial characterization is essential for patients with myocardial infarction and other myocardial diseases, and the assessment is often performed using cardiac magnetic resonance (CMR) sequences. In this study, we propose a fully automated approach using deep convolutional neural networks (CNN) for cardiac pathology segmentation, including left ventricular (LV) blood pool, right ventricular blood pool, LV normal myocardium, LV myocardial edema (ME) and LV myocardial scars (MS). The input to the network consists of three CMR sequences, namely, late gadolinium enhancement (LGE), T2 and balanced steady state free precession (bSSFP). The proposed approach utilized the data provided by the MyoPS challenge hosted by MICCAI 2020 in conjunction with STACOM. The training set for the CNN model consists of images acquired from 25 cases, and the gold standard labels are provided by trained raters and validated by radiologists. The proposed approach introduces a data augmentation module, linear encoder and decoder module and a network module to increase the number of training samples and improve the prediction accuracy for LV ME and MS. The proposed approach is evaluated by the challenge organizers with a test set including 20 cases and achieves a mean dice score of $46.8\%$ for LV MS and $55.7\%$ for LV ME+MS
摘要：心肌表征是治疗心肌梗死等疾病心肌必不可少的，并且经常使用心脏磁共振（CMR）序列进行评估。在这项研究中，我们使用心脏病理分割深卷积神经网络（CNN），包括左心室（LV）血池，右心室血池，LV正常心肌，LV心肌水肿提出了一个完全自动化的方法（ME）和LV心肌疤痕（MS）。输入到网络由三个CMR序列，即，延迟钆增强（LGE），T2和平衡稳态自由进动（bSSFP）的。利用由MyoPS提供的数据所提出的方法的挑战通过MICCAI 2020与STACOM结合托管。对于CNN模型的训练集包括从25箱子获取的图像，并且由受过培训的评价者提供和放射科医师确认的黄金标准标签。所提出的方法引入了一个数据增强模块，线性编码器和解码器模块和网络模块以增加训练样本的数目，并改善LV ME和MS的预测精度。所提出的方法是通过用测试组包括20例挑战组织者评价和实现平均得分骰子$ 46.8 \％$为LV MS和$ 55.7 \％$为LV ME + MS的

46. UDC 2020 Challenge on Image Restoration of Under-Display Camera: Methods and Results [PDF] 返回目录
Yuqian Zhou, Michael Kwan, Kyle Tolentino, Neil Emerton, Sehoon Lim, Tim Large, Lijiang Fu, Zhihong Pan, Baopu Li, Qirui Yang, Yihao Liu, Jigang Tang, Tao Ku, Shibin Ma, Bingnan Hu, Jiarong Wang, Densen Puthussery, Hrishikesh P S, Melvin Kuriakose, Jiji C V, Varun Sundar, Sumanth Hegde, Divya Kothandaraman, Kaushik Mitra, Akashdeep Jassal, Nisarg A. Shah, Sabari Nathan, Nagat Abdalla Esiad Rahel, Dafan Chen, Shichao Nie, Shuting Yin, Chengconghui Ma, Haoran Wang, Tongtong Zhao, Shanshan Zhao, Joshua Rego, Huaijin Chen, Shuai Li, Zhenhua Hu, Kin Wai Lau, Lai-Man Po, Dahai Yu, Yasar Abbas Ur Rehman, Yiqun Li, Lianping Xing
Abstract: This paper is the report of the first Under-Display Camera (UDC) image restoration challenge in conjunction with the RLQ workshop at ECCV 2020. The challenge is based on a newly-collected database of Under-Display Camera. The challenge tracks correspond to two types of display: a 4k Transparent OLED (T-OLED) and a phone Pentile OLED (P-OLED). Along with about 150 teams registered the challenge, eight and nine teams submitted the results during the testing phase for each track. The results in the paper are state-of-the-art restoration performance of Under-Display Camera Restoration. Datasets and paper are available at this https URL.
摘要：本文是在与ECCV 2020 RLQ车间结合第一副显示屏摄像头（UDC）图像恢复的挑战所面临的挑战是基于副显示屏摄像头的新收集到数据库上的报告。挑战磁道对应于两个类型的显示器：4k的透明OLED（T-OLED）和电话了Pentile OLED（P-OLED）。随着约150队报名的挑战，八个九队在测试阶段每个轨道提交的结果。在本文的结果是副显示屏摄像头恢复的状态的最先进的恢复性能。数据集和纸张可在此HTTPS URL。

47. REFORM: Recognizing F-formations for Social Robots [PDF] 返回目录
Hooman Hedayati, Annika Muehlbradt, Daniel J. Szafir, Sean Andrist
Abstract: Recognizing and understanding conversational groups, or F-formations, is a critical task for situated agents designed to interact with humans. F-formations contain complex structures and dynamics, yet are used intuitively by people in everyday face-to-face conversations. Prior research exploring ways of identifying F-formations has largely relied on heuristic algorithms that may not capture the rich dynamic behaviors employed by humans. We introduce REFORM (REcognize F-FORmations with Machine learning), a data-driven approach for detecting F-formations given human and agent positions and orientations. REFORM decomposes the scene into all possible pairs and then reconstructs F-formations with a voting-based scheme. We evaluated our approach across three datasets: the SALSA dataset, a newly collected human-only dataset, and a new set of acted human-robot scenarios, and found that REFORM yielded improved accuracy over a state-of-the-art F-formation detection algorithm. We also introduce symmetry and tightness as quantitative measures to characterize F-formations. Supplementary video: this https URL , Dataset available at: this http URL
摘要：认识和理解会话组或F-编队，是设计用来与人类互动位于代理一项重要任务。 F-岩层含有复杂的结构和动力学，但被人在日常的脸对脸的对话直观地使用。确定F-地层此前的研究探索如何在很大程度上依赖于可能不捕捉被人类所使用的丰富的动态行为启发式算法。我们引进改革（承认F-编队与机器学习），用于检测赐予人类和代理的位置和方向F-构造一个数据驱动的方法。改革分解成现场所有可能的对，然后重构F-编队与基于投票的方案。我们评估了我们的方法在三个数据集：莎莎数据集，新采集的人，只有数据集和一套新的行动人类与机器人的情况，结果发现，改革取得了提高准确度超过一个国家的最先进的F-形成检测算法。我们还介绍了对称性和紧密性的定量措施来表征F-编队。补充视频：此HTTPS URL，数据集可在：这个HTTP URL

48. Inverse Distance Aggregation for Federated Learning with Non-IID Data [PDF] 返回目录
Yousef Yeganeh, Azade Farshad, Nassir Navab, Shadi Albarqouni
Abstract: Federated learning (FL) has been a promising approach in the field of medical imaging in recent years. A critical problem in FL, specifically in medical scenarios is to have a more accurate shared model which is robust to noisy and out-of distribution clients. In this work, we tackle the problem of statistical heterogeneity in data for FL which is highly plausible in medical data where for example the data comes from different sites with different scanner settings. We propose IDA (Inverse Distance Aggregation), a novel adaptive weighting approach for clients based on meta-information which handles unbalanced and non-iid data. We extensively analyze and evaluate our method against the well-known FL approach, Federated Averaging as a baseline.
摘要：联合学习（佛罗里达州）一直是医学成像领域有前途的方法在最近几年。在FL的一个关键问题，特别是在医疗方案是有一个更精确的模型共享其是鲁棒的嘈杂和出的分发客户机。在这项工作中，我们解决在佛罗里达州的数据统计异质性是在例如数据来自不同的网站使用不同的扫描仪设置的医疗数据非常合理的问题。我们建议IDA（反距离聚合），基于其处理不平衡和非独立同分布数据的元信息的客户一种新的自适应加权方法。我们广泛的分析和对著名的佛罗里达州的做法，联合取平均作为基线评估我们的方法。

49. A Deep Network for Joint Registration and Reconstruction of Images with Pathologies [PDF] 返回目录
Xu Han, Zhengyang Shen, Zhenlin Xu, Spyridon Bakas, Hamed Akbari, Michel Bilello, Christos Davatzikos, Marc Niethammer
Abstract: Registration of images with pathologies is challenging due to tissue appearance changes and missing correspondences caused by the pathologies. Moreover, mass effects as observed for brain tumors may displace tissue, creating larger deformations over time than what is observed in a healthy brain. Deep learning models have successfully been applied to image registration to offer dramatic speed up and to use surrogate information (e.g., segmentations) during training. However, existing approaches focus on learning registration models using images from healthy patients. They are therefore not designed for the registration of images with strong pathologies for example in the context of brain tumors, and traumatic brain injuries. In this work, we explore a deep learning approach to register images with brain tumors to an atlas. Our model learns an appearance mapping from images with tumors to the atlas, while simultaneously predicting the transformation to atlas space. Using separate decoders, the network disentangles the tumor mass effect from the reconstruction of quasi-normal images. Results on both synthetic and real brain tumor scans show that our approach outperforms cost function masking for registration to the atlas and that reconstructed quasi-normal images can be used for better longitudinal registrations.
摘要：病理图像的配准是具有挑战性由于组织外观的变化和缺失引起的病状的对应关系。此外，所观察到的脑肿瘤，可置换组织，产生随时间的比是在一个健康的脑电波中所见的变形较大的质量的效果。深度学习模型已经成功地应用于图像配准提供显着的加快和培训过程中使用的替代信息（例如，分割）。然而，现有的方法集中学习使用图片来自健康的病人登记模式。因此，它们不是设计用于图像与脑肿瘤的背景下强劲病症例如，与创伤性脑损伤的注册。在这项工作中，我们探索了深刻的学习方法与脑肿瘤图像登录到一本地图册。我们的模型从学习与肿瘤寰图像的外观映射，同时预计改造图谱空间。使用单独的解码器，所述网络从理顺了那些纷繁准普通图像的重建肿瘤块效应。在合成的和真实的脑肿瘤的扫描结果显示，我们的方法比成本函数掩蔽登记到图谱和重构的准普通图像可被用于更好的纵向注册。

50. Uncertainty Quantification using Variational Inference for Biomedical Image Segmentation [PDF] 返回目录
Abhinav Sagar
Abstract: Deep learning motivated by convolutional neural networks has been highly successful in a range of medical imaging problems like image classification, image segmentation, image synthesis etc. However for validation and interpretability, not only do we need the predictions made by the model but also how confident it is while making those predictions. This is important in safety critical applications for the people to accept it. In this work, we used an encoder decoder architecture based on variational inference techniques for segmenting brain tumour images. We compare different backbones architectures like U-Net, V-Net and FCN as sampling data from the conditional distribution for the encoder. We evaluate our work on the publicly available BRATS dataset using Dice Similarity Coefficient (DSC) and Intersection Over Union (IOU) as the evaluation metrics. Our model outperforms previous state of the art results while making use of uncertainty quantification in a principled bayesian manner.
摘要：深学习的卷积神经网络的动机一直是非常成功的在一系列的类似图像分类，图像分割，图像合成等，但对于确认和解释性医疗成像问题，我们不仅需要通过模型也做了预测如何确信它是同时使这些预测。为人民接受它这是在安全关键应用很重要。在这项工作中，我们采用了变推理技术分割脑肿瘤图像的编码解码器架构。我们比较不同的骨干网架构类似U型网络，虚拟网络和FCN从编码器的条件分布进行数据采样。我们评估的可公开获得的数据集臭小子使用骰子相似系数（DSC）和交集过联盟（IOU）作为评价指标，我们的工作。我们的模型优于同时利用量化的不确定性在一个原则性贝地管理艺术成果的以前的状态。

51. Anatomy-Aware Cardiac Motion Estimation [PDF] 返回目录
Pingjun Chen, Xiao Chen, Eric Z. Chen, Hanchao Yu, Terrence Chen, Shanhui Sun
Abstract: Cardiac motion estimation is critical to the assessment of cardiac function. Myocardium feature tracking (FT) can directly estimate cardiac motion from cine MRI, which requires no special scanning procedure. However, current deep learning-based FT methods may result in unrealistic myocardium shapes since the learning is solely guided by image intensities without considering anatomy. On the other hand, motion estimation through learning is challenging because ground-truth motion fields are almost impossible to obtain. In this study, we propose a novel Anatomy-Aware Tracker (AATracker) for cardiac motion estimation that preserves anatomy by weak supervision. A convolutional variational autoencoder (VAE) is trained to encapsulate realistic myocardium shapes. A baseline dense motion tracker is trained to approximate the motion fields and then refined to estimate anatomy-aware motion fields under the weak supervision from the VAE. We evaluate the proposed method on long-axis cardiac cine MRI, which has more complex myocardium appearances and motions than short-axis. Compared with other methods, AATracker significantly improves the tracking performance and provides visually more realistic tracking results, demonstrating the effectiveness of the proposed weakly-supervision scheme in cardiac motion estimation.
摘要：心脏运动估计是心脏功能的评估是至关重要的。心肌特征跟踪（FT）可直接估计来自电影MRI，它不需要特殊的扫描过程心脏运动。然而，目前的深学习型FT方法可能会导致不切实际的心肌形状，因为学习是通过图像强度不考虑解剖完全引导。在另一方面，通过学习运动估计是困难的，因为地面实况运动领域几乎不可能获得。在这项研究中，我们提出了一个新颖的解剖感知跟踪（AATracker）心脏运动估计，通过监管不力蜜饯解剖。卷积变的自动编码（VAE）被训练来封装现实心肌形状。基线密集的运动跟踪器被训练来近似运动场，然后细化到估计从VAE弱监督下解剖感知运动领域。我们评估了该方法的长轴心脏电影MRI，里面有更复杂的心肌出现和运动比短轴。与其他方法相比，AATracker显著提高了跟踪性能，并提供视觉上更逼真的跟踪结果，表明心脏运动估计所提出的弱监管方案的有效性。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-08-19

目录

摘要