0%

【arxiv论文】 Computer Vision and Pattern Recognition 2020-05-04

目录

1. An Efficient Integration of Disentangled Attended Expression and Identity FeaturesFor Facial Expression Transfer andSynthesis [PDF] 摘要
2. Computing the Testing Error without a Testing Set [PDF] 摘要
3. Investigating Class-level Difficulty Factors in Multi-label Classification Problems [PDF] 摘要
4. Aggregation and Finetuning for Clothes Landmark Detection [PDF] 摘要
5. MOPS-Net: A Matrix Optimization-driven Network forTask-Oriented 3D Point Cloud Downsampling [PDF] 摘要
6. A Comprehensive Study on Visual Explanations for Spatio-temporal Networks [PDF] 摘要
7. Generative Adversarial Data Programming [PDF] 摘要
8. M^3VSNet: Unsupervised Multi-metric Multi-view Stereo Network [PDF] 摘要
9. Survey on Reliable Deep Learning-Based Person Re-Identification Models: Are We There Yet? [PDF] 摘要
10. The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines [PDF] 摘要
11. Adversarial Synthesis of Human Pose from Text [PDF] 摘要
12. Diverse Visuo-Lingustic Question Answering (DVLQA) Challenge [PDF] 摘要
13. ACCL: Adversarial constrained-CNN loss for weakly supervised medical image segmentation [PDF] 摘要
14. PCA-SRGAN: Incremental Orthogonal Projection Discrimination for Face Super-resolution [PDF] 摘要
15. Multi-Camera Trajectory Forecasting: Pedestrian Trajectory Prediction in a Network of Cameras [PDF] 摘要
16. Recognizing American Sign Language Nonmanual Signal Grammar Errors in Continuous Videos [PDF] 摘要
17. Deepfake Forensics Using Recurrent Neural Networks [PDF] 摘要
18. Deeply Cascaded U-Net for Multi-Task Image Processing [PDF] 摘要
19. The AVA-Kinetics Localized Human Actions Video Dataset [PDF] 摘要
20. HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training [PDF] 摘要
21. Sequence Information Channel Concatenation for Improving Camera Trap Image Burst Classification [PDF] 摘要
22. Domain Siamese CNNs for Sparse Multispectral Disparity Estimation [PDF] 摘要
23. Importance Driven Continual Learning for Segmentation Across Domains [PDF] 摘要
24. Occlusion resistant learning of intuitive physics from videos [PDF] 摘要
25. CP-NAS: Child-Parent Neural Architecture Search for 1-bit CNNs [PDF] 摘要
26. HLVU : A New Challenge to Test Deep Understanding of Movies the Way Humans do [PDF] 摘要
27. A Naturalness Evaluation Database for Video Prediction Models [PDF] 摘要
28. On-board Deep-learning-based Unmanned Aerial Vehicle Fault Cause Detection and Identification [PDF] 摘要
29. Defocus Deblurring Using Dual-Pixel Data [PDF] 摘要
30. Distilling Spikes: Knowledge Distillation in Spiking Neural Networks [PDF] 摘要
31. Cross-modal Language Generation using Pivot Stabilization for Web-scale Language Coverage [PDF] 摘要
32. Conceptual Design of Human-Drone Communication in Collaborative Environments [PDF] 摘要
33. Bridging Mode Connectivity in Loss Landscapes and Adversarial Robustness [PDF] 摘要
34. Unsupervised Lesion Detection via Image Restoration with a Normative Prior [PDF] 摘要

摘要

1. An Efficient Integration of Disentangled Attended Expression and Identity FeaturesFor Facial Expression Transfer andSynthesis [PDF] 返回目录
  Kamran Ali, Charles E. Hughes
Abstract: In this paper, we present an Attention-based Identity Preserving Generative Adversarial Network (AIP-GAN) to overcome the identity leakage problem from a source image to a generated face image, an issue that is encountered in a cross-subject facial expression transfer and synthesis process. Our key insight is that the identity preserving network should be able to disentangle and compose shape, appearance, and expression information for efficient facial expression transfer and synthesis. Specifically, the expression encoder of our AIP-GAN disentangles the expression information from the input source image by predicting its facial landmarks using our supervised spatial and channel-wise attention module. Similarly, the disentangled expression-agnostic identity features are extracted from the input target image by inferring its combined intrinsic-shape and appearance image employing our self-supervised spatial and channel-wise attention mod-ule. To leverage the expression and identity information encoded by the intermediate layers of both of our encoders, we combine these features with the features learned by the intermediate layers of our decoder using a cross-encoder bilinear pooling operation. Experimental results show the promising performance of our AIP-GAN based technique.
摘要:在本文中,我们提出了一种基于注意机制的身份保剖成对抗性网络(AIP-GAN)来克服从源图像的身份泄漏问题于所生成的脸部图像中,在横受试者面部表情遇到的一个问题传输和合成处理。我们的主要观点是,身份保存网络应当能够解开和撰写形状,外观,和用于有效的面部表情转移和表达合成信息。具体地,我们的AIP-GAN的表达编码器通过使用我们的监督空间和通道明智注意模块预测其面部界标理顺了那些纷繁从输入源图像的表达的信息。类似地,解缠结表达无关的身份特征由推断其结合本征的形状和外观的图像采用我们的从输入目标图像中提取自监督的空间和通道明智注意MOD-ULE。为了利用由两个编码器提供的所述中间层中编码的表达和身份信息,我们结合这些特征通过使用交叉编码器双线性池操作我们的解码器的中间层学的特征。实验结果表明,我们的基于AIP-GaN技术的前途表现。

2. Computing the Testing Error without a Testing Set [PDF] 返回目录
  Ciprian Corneanu, Meysam Madadi, Sergio Escalera, Aleix Martinez
Abstract: Deep Neural Networks (DNNs) have revolutionized computer vision. We now have DNNs that achieve top (performance) results in many problems, including object recognition, facial expression analysis, and semantic segmentation, to name but a few. The design of the DNNs that achieve top results is, however, non-trivial and mostly done by trail-and-error. That is, typically, researchers will derive many DNN architectures (i.e., topologies) and then test them on multiple datasets. However, there are no guarantees that the selected DNN will perform well in the real world. One can use a testing set to estimate the performance gap between the training and testing sets, but avoiding overfitting-to-the-testing-data is almost impossible. Using a sequestered testing dataset may address this problem, but this requires a constant update of the dataset, a very expensive venture. Here, we derive an algorithm to estimate the performance gap between training and testing that does not require any testing dataset. Specifically, we derive a number of persistent topology measures that identify when a DNN is learning to generalize to unseen samples. This allows us to compute the DNN's testing error on unseen samples, even when we do not have access to them. We provide extensive experimental validation on multiple networks and datasets to demonstrate the feasibility of the proposed approach.
摘要:深层神经网络(DNNs)已经彻底改变了计算机视觉。现在,我们有了实现顶部(性能)会导致许多问题,包括物体识别,面部表情分析和语义分割DNNs,仅举几例。即争创结果DNNs的设计,然而,不平凡的,大多通过跟踪和错误来完成。也就是说,通常情况下,研究人员将获得许多DNN架构(即拓扑结构),然后测试它们对多个数据集。然而,没有保证所选DNN将在现实世界中表现良好。可以使用一个测试集来估计训练集和测试集,但避免过拟合到的测试数据几乎是不可能的之间的性能差距。使用隔离的测试数据集可以解决这个问题,但这需要的数据集,一个非常昂贵的企业的不断更新。在这里,我们得出一个算法来估计训练和测试,不需要任何测试数据集之间的性能差距。具体而言,我们得到了一些时,DNN是学习推广到看不见的样品标识持久拓扑措施。这使我们能够计算在看不见的样品DNN的测试误差,即使我们没有与他们接触。我们提供多个网络和广泛的数据集实验验证证明了该方法的可行性。

3. Investigating Class-level Difficulty Factors in Multi-label Classification Problems [PDF] 返回目录
  Mark Marsden, Kevin McGuinness, Joseph Antony, Haolin Wei, Milan Redzic, Jian Tang, Zhilan Hu, Alan Smeaton, Noel E O'Connor
Abstract: This work investigates the use of class-level difficulty factors in multi-label classification problems for the first time. Four class-level difficulty factors are proposed: frequency, visual variation, semantic abstraction, and class co-occurrence. Once computed for a given multi-label classification dataset, these difficulty factors are shown to have several potential applications including the prediction of class-level performance across datasets and the improvement of predictive performance through difficulty weighted optimisation. Significant improvements to mAP and AUC performance are observed for two challenging multi-label datasets (WWW Crowd and Visual Genome) with the inclusion of difficulty weighted optimisation. The proposed technique does not require any additional computational complexity during training or inference and can be extended over time with inclusion of other class-level difficulty factors.
摘要:这项工作调查首次采用多标签分类问题的类级难度的因素。提出了四个类级困难因素:频率,视觉变化,语义抽象,和类共现。一旦计算了一个给定的多标签分类数据集,示出了这些困难的因素有几个潜在的应用,包括的类级性能跨越数据集的预测和预测性能通过难度加权优化的改善。映射和AUC性能显著的改善观察到了两个与包容的难度加权优化具有挑战性的多标签数据集(WWW人群和Visual基因组)。所提出的技术不需要训练或推理过程中的任何附加的计算复杂度,并且可以与列入其他类级别的难度系数延长一段时间。

4. Aggregation and Finetuning for Clothes Landmark Detection [PDF] 返回目录
  Tzu-Heng Lin
Abstract: Landmark detection for clothes is a fundamental problem for many applications. In this paper, a new training scheme for clothes landmark detection: $\textit{Aggregation and Finetuning}$, is proposed. We investigate the homogeneity among landmarks of different categories of clothes, and utilize it to design the procedure of training. Extensive experiments show that our method outperforms current state-of-the-art methods by a large margin. Our method also won the 1st place in the DeepFashion2 Challenge 2020 - Clothes Landmark Estimation Track with an AP of 0.590 on the test set, and 0.615 on the validation set. Code will be publicly available at this https URL .
摘要:标志检测衣服是许多应用的一个根本问题。在本文中,衣服标志检测新的训练计划:$ \ {textit聚集和细化和微调} $,建议。我们研究了不同种类的衣服的地标之间的同质性,并利用它来设计的训练过程。大量的实验表明,我们的方法优于国家的最先进的电流的方法以大比分。我们的方法也获得了第一名的DeepFashion2挑战2020 - 衣服里程碑估计轨迹与0.590的测试集,和0.615的验证组的AP。代码将公开此HTTPS URL。

5. MOPS-Net: A Matrix Optimization-driven Network forTask-Oriented 3D Point Cloud Downsampling [PDF] 返回目录
  Yue Qian, Junhui Hou, Yiming Zeng, Qijian Zhang, Sam Kwong, Ying He
Abstract: Downsampling is a commonly-used technique in 3D point cloudprocessing for saving storage space, transmission bandwidth andcomputational complexity. The classic downsampling methods,such as farthest point sampling and Poisson disk sampling, thoughwidely used for decades, are independent of the subsequent applications, since the downsampled point cloud may compromisetheir performance severely. This paper explores the problem oftask-oriented downsampling, which aims to downsample a pointcloud and maintain the performance of subsequent applications asmuch as possible. We propose MOPS-Net, a novel end-to-end deepneural network which is designed from the perspective of matrixoptimization, making it fundamentally different from the existing deep learning-based methods. Due to its discrete and combinatorial nature, it is difficult to solve the downsampling problem directly. To tackle this challenge, we relax the binary constraint of each variableand lift 3D points to a higher dimensional feature space, in which a constrained and differentiable matrix optimization problem withan implicit objective function is formulated. We then propose a deep neural network architecture to mimic the matrix optimization problem by exploring both the local and the global structures of the input data. With a task network, MOPS-Net can be end-to-endtrained. Moreover, it is permutation-invariant, making it robust to input data. We also extend MOPS-Net such that a single network after one-time training is capable of handling arbitrary downsam-pling ratios. Extensive experimental results show that MOPS-Net can achieve favorable performance against state-of-the-art deeplearning-based method over various tasks, including point cloud classification, retrieval and reconstruction.
摘要:下采样是在3D点cloudprocessing为节省存储空间,传输带宽andcomputational复杂通常使用的技术。经典的下采样方法,诸如最远点采样和泊松盘取样,thoughwidely使用了几十年,独立于后续应用的,因为向下抽样的点云可能会严重compromisetheir性能。本文探讨的问题oftask导向下采样,其目的是降低采样点云一和维护的后续应用程序的性能asmuch越好。我们建议MOPS-Net的,这是从matrixoptimization的角度设计,使得它从现有的基于深学习的方法根本不同的一个新的终端到终端的deepneural网络。由于其离散和组合性质,很难直接解决问题的采样。为了应对这一挑战,我们放松每个variableand电梯的二元约束3D点到高维空间,在这种约束和微矩阵优化问题withan隐含的目标函数制定。然后,我们通过探索本地和输入数据的全球结构提出了一个深层神经网络结构来模拟矩阵的优化问题。随着任务的网络,MOPS-网可以端到端的endtrained。此外,它是置换不变,使其坚固以输入数据。此外,我们还要MOPS-网这样的,经过一次培训一个单一的网络能够处理任意downsam-pling比率。广泛的实验结果表明,MOPS-Net的可以实现对各种任务,包括点云分类,检索和重建针对国家的最先进的深度学习基于-方法有利性能。

6. A Comprehensive Study on Visual Explanations for Spatio-temporal Networks [PDF] 返回目录
  Zhenqiang Li, Weimin Wang, Zuoyue Li, Yifei Huang, Yoichi Sato
Abstract: Identifying and visualizing regions that are significant for a given deep neural network model, i.e., attribution methods, is still a vital but challenging task, especially for spatio-temporal networks that process videos as input. Albeit some methods that have been proposed for video attribution, it is yet to be studied what types of network structures each video attribution method is suitable for. In this paper, we provide a comprehensive study of the existing video attribution methods of two categories, gradient-based and perturbation-based, for visual explanation of neural networks that take videos as the input (spatio-temporal networks). To perform this study, we extended a perturbation-based attribution method from 2D (images) to 3D (videos) and validated its effectiveness by mathematical analysis and experiments. For a more comprehensive analysis of existing video attribution methods, we introduce objective metrics that are complementary to existing subjective ones. Our experimental results indicate that attribution methods tend to show opposite performances on objective and subjective metrics.
摘要:识别和可视化是对于给定的深层神经网络模型显著的区域,即归属方法,仍然是至关重要的,但具有挑战性的任务,特别是对时空网络,过程的视频输入。尽管已经提出了视频归属的一些方法,这是有待研究的是什么类型的网络结构中的每个视频归属方法适合。在本文中,我们提供了两个类别的现有视频归属方法的全面研究,基于梯度和扰动为主,对于拍摄视频作为输入(时空网络)神经网络的可视化解释。要执行此研究中,我们从2D(图像)延伸的基于扰动归因方法3D(视频),并通过数学分析和实验验证其有效性。对于现有的视频归属方法更全面的分析,我们引入客观指标,对现有的主观那些互补。我们的实验结果表明,归因方法往往会显示在客观和主观指标相反表演。

7. Generative Adversarial Data Programming [PDF] 返回目录
  Arghya Pal, Vineeth N Balasubramanian
Abstract: The paucity of large curated hand-labeled training data forms a major bottleneck in the deployment of machine learning models in computer vision and other fields. Recent work (Data Programming) has shown how distant supervision signals in the form of labeling functions can be used to obtain labels for given data in near-constant time. In this work, we present Adversarial Data Programming (ADP), which presents an adversarial methodology to generate data as well as a curated aggregated label, given a set of weak labeling functions. More interestingly, such labeling functions are often easily generalizable, thus allowing our framework to be extended to different setups, including self-supervised labeled image generation, zero-shot text to labeled image generation, transfer learning, and multi-task learning.
摘要:大型策划手工标记的训练数据形式的匮乏在机器学习模型在计算机视觉等领域部署的一大瓶颈。最近的工作(数据编程)已经显示出如何在标签函数的形式遥远监督信号可被用于获得用于给定数据的标签在接近恒定的时间。在这项工作中,我们提出了对抗性数据编程(ADP),对抗性的方法来生成数据以及策展聚集标签呈现,给定一组弱标注功能。更有趣的是,这样的标注功能是往往容易普及,从而使我们的框架扩展到不同的设置,包括自我监督的标记图像生成,零射门文本标记图像生成,迁移学习,和多任务学习。

8. M^3VSNet: Unsupervised Multi-metric Multi-view Stereo Network [PDF] 返回目录
  Baichuan Huang, Can Huang, Yijia He, Jingbin Liu, Xiao Liu
Abstract: The present MVS methods with deep learning have an impressive performance than traditional MVS methods. However, the learning-based networks need lots of ground-truth 3D training data, which is not always easy to be available. To relieve the expensive costs, we propose an unsupervised normal-aided multi-metric network, named M^3VSNet, for multi-view stereo reconstruction without ground-truth 3D training data. Our network puts forward: (a) Pyramid feature aggregation to extract more contextual information; (b) Normal-depth consistency to make estimated depth maps more reasonable and precise in the real 3D world; (c) The multi-metric combination of pixel-wise and feature-wise loss function to learn the inherent constraint from the perspective of perception beyond the pixel value. The abundant experiments prove our M^3VSNet state of the arts in the DTU dataset with effective improvement. Without any finetuning, M^3VSNet ranks 1st among all unsupervised MVS network on the leaderboard of Tanks & Temples datasets until April 17, 2020. Our codebase is available at this https URL.
摘要:深学习本MVS方法比传统方法MVS一个令人印象深刻的表现。然而,学习型网络需要大量的地面实况3D训练数据,这并不总是容易可用的。为了减轻昂贵的费用,我们提出了一种无监督的正常辅助的多指标网络,命名为M 1 3VSNet,多视点立体重建,而不地面实况3D训练数据。我们的网络提出:(一)金字塔特征聚集提取更多的上下文信息; (B)正常深度的一致性,使估计的深度映射更加合理,在真实世界3D精确; (c)中逐像素和特征明智损失函数的多度量组合以了解从感知的以外的像素值的观点来看的固有限制。丰富的实验证明我们的M 1 3VSNet在DTU数据集有效改善艺术的状态。没有任何细化和微调,M 1 3VSNet跻身于坦克及庙宇的数据集的排行榜都无监督MVS网络,直到4月17日1日,2020年我们的代码库可在此HTTPS URL。

9. Survey on Reliable Deep Learning-Based Person Re-Identification Models: Are We There Yet? [PDF] 返回目录
  Bahram Lavi, Ihsan Ullah, Mehdi Fatan, Anderson Rocha
Abstract: Intelligent video-surveillance (IVS) is currently an active research field in computer vision and machine learning and provides useful tools for surveillance operators and forensic video investigators. Person re-identification (PReID) is one of the most critical problems in IVS, and it consists of recognizing whether or not an individual has already been observed over a camera in a network. Solutions to PReID have myriad applications including retrieval of video-sequences showing an individual of interest or even pedestrian tracking over multiple camera views. Different techniques have been proposed to increase the performance of PReID in the literature, and more recently researchers utilized deep neural networks (DNNs) given their compelling performance on similar vision problems and fast execution at test time. Given the importance and wide range of applications of re-identification solutions, our objective herein is to discuss the work carried out in the area and come up with a survey of state-of-the-art DNN models being used for this task. We present descriptions of each model along with their evaluation on a set of benchmark datasets. Finally, we show a detailed comparison among these models, which are followed by some discussions on their limitations that can work as guidelines for future research.
摘要:智能视频监控(IVS)是目前计算机视觉和机器学习的活跃的研究领域,并提供了有用的工具,监控操作员和法医调查员的视频。人重新鉴定(PReID)是在IVS中最关键的问题之一,它包括识别是否个体已经被观察到在照相机的网络中的。解决方案PReID有无数的应用,包括展示过多个摄像机视图的兴趣,甚至行人跟踪单个视频序列的检索。不同的技术已被提出,以增加在文学PReID的性能,以及最近的研究人员利用深层神经网络(DNNs)给出类似的视力问题,并在测试时快速执行其卓越的性能。鉴于重要性和广泛的重新鉴定解决方案的应用,我们的目标是在此讨论工作的区域进行,并拿出被用于这个任务的国家的最先进的DNN模型进行了调查。我们在一组标准数据集的呈现每个模型的描述与他们的评价一起。最后,我们显示了这些模型,之后是对他们的限制,一些讨论,可以作为今后研究指导原则的工作中进行详细比较。

10. The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines [PDF] 返回目录
  Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, Michael Wray
Abstract: Since its introduction in 2018, EPIC-KITCHENS has attracted attention as the largest egocentric video benchmark, offering a unique viewpoint on people's interaction with objects, their attention, and even intention. In this paper, we detail how this large-scale dataset was captured by 32 participants in their native kitchen environments, and densely annotated with actions and object interactions. Our videos depict nonscripted daily activities, as recording is started every time a participant entered their kitchen. Recording took place in 4 countries by participants belonging to 10 different nationalities, resulting in highly diverse kitchen habits and cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.2K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos after recording, thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and. anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens. We introduce new baselines that highlight the multimodal nature of the dataset and the importance of explicit temporal modelling to discriminate fine-grained actions e.g. 'closing a tap' from 'opening' it up.
摘要:自2018年引进,EPIC的厨房已引起注意,因为最大的以自我为中心的视频基准,提供人们与物体互动,他们的注意力,甚至打算独特的视角。在本文中,我们详细介绍如何大规模数据集由32名学员在他们的本土厨房环境的拍摄,并用密集的动作和对象交互注解。我们的影片描绘nonscripted日常活动,开始记录每一个参与者进入他们的厨房时间。记录发生在4个国家的,分属10个不同国籍的参与者,导致高度多样化的厨房习惯和烹饪风格的。我们的数据提供55小时视频组成11.5M帧,这是我们密集地标记的共39.6K行为段和454.2K对象边界框的。我们的注释是,我们有与会者讲述自己的视频录制后,反映真实意图独一无二的,我们人群来源基于这些地面真理。我们描述我们的目标,行动和。期待挑战,并评估了两个测试分裂几个基线,可见和不可见的厨房。我们介绍,突出的数据集的多式联运性质和明确的时间建模的鉴别细粒度的操作例如重要性新基准从“打开”它“关闭抽头”。

11. Adversarial Synthesis of Human Pose from Text [PDF] 返回目录
  Yifei Zhang, Rania Briq, Julian Tanke, Juergen Gall
Abstract: This work introduces the novel task of human pose synthesis from text. In order to solve this task, we propose a model that is based on a conditional generative adversarial network. It is designed to generate 2D human poses conditioned on human-written text descriptions. The model is trained and evaluated using the COCO dataset, which consists of images capturing complex everyday scenes. We show through qualitative and quantitative results that the model is capable of synthesizing plausible poses matching the given text, indicating it is possible to generate poses that are consistent with the given semantic features, especially for actions with distinctive poses. We also show that the model outperforms a vanilla GAN.
摘要:本文介绍的工作从文本中人体姿势合成的新任务。为了解决这一任务,我们提出了基于条件生成对抗性网络上的一个典范。它的目的是生成2D人类姿势条件的人写的文字说明。该模型训练和使用该数据集COCO,它由捕获复杂的日常的场景的图像的评估。我们通过定性和定量的结果表明,该模型能够合成似是而非的姿态给定的文本匹配的,表明它是能够生成与给定的语义特征相一致,特别是对于具有鲜明的姿势动作姿势。我们还表明,该模型优于香草GAN。

12. Diverse Visuo-Lingustic Question Answering (DVLQA) Challenge [PDF] 返回目录
  Shailaja Sampat, Yezhou Yang, Chitta Baral
Abstract: Existing question answering datasets mostly contain homogeneous contexts, based on either textual or visual information alone. On the other hand, digitalization has evolved the nature of reading which often includes integrating information across multiple heterogeneous sources. To bridge the gap between two, we compile a Diverse Visuo-Lingustic Question Answering (DVLQA) challenge corpus, where the task is to derive joint inference about the given image-text modality in a question answering setting. Each dataset item consists of an image and a reading passage, where questions are designed to combine both visual and textual information, i.e. ignoring either of them would make the question unanswerable. We first explore the combination of best existing deep learning architectures for visual question answering and machine comprehension to solve DVLQA subsets and show that they are unable to reason well on the joint task. We then develop a modular method which demonstrates slightly better baseline performance and offers more transparency for interpretation of intermediate outputs. However, this is still far behind the human performance, therefore we believe DVLQA will be a challenging benchmark for question answering involving reasoning over visuo-linguistic context. The dataset, code and public leaderboard will be made available at this https URL.
摘要:现有的问答集大多含有均匀背景的基础上,无论是文字或单独的视觉信息。在另一方面,数字化已经发展的读数通常包含多个异构源整合信息的性质。为了缩小两者之间的差距,我们编译一个多元化的视觉一语言学问题解答(DVLQA)挑战语料库,其中任务是有关在问答设置给定的图像,文本模式派生联合推断。每个数据集项由图像和阅读文章,其中问题的目的是视觉和文本信息,即忽略任何一方都会使问题无法回答的结合。我们首先探讨现有最佳深度学习架构视觉答疑和机器理解来解决DVLQA子集,并表明他们无法很好的理由对联合任务的组合。然后,我们开发这表明稍好基准性能和中间产出的解释提供了更多的透明度模块化方法。然而,这还远远落后于人的表现,因此,我们认为DVLQA将是问答涉及推理在视觉一语境一个富有挑战性的基准。该数据集,代码和公共的排行榜将在此HTTPS URL提供。

13. ACCL: Adversarial constrained-CNN loss for weakly supervised medical image segmentation [PDF] 返回目录
  Pengyi Zhang, Yunxin Zhong, Xiaoqiong Li
Abstract: We propose adversarial constrained-CNN loss, a new paradigm of constrained-CNN loss methods, for weakly supervised medical image segmentation. In the new paradigm, prior knowledge is encoded and depicted by reference masks, and is further employed to impose constraints on segmentation outputs through adversarial learning with reference masks. Unlike pseudo label methods for weakly supervised segmentation, such reference masks are used to train a discriminator rather than a segmentation network, and thus are not required to be paired with specific images. Our new paradigm not only greatly facilitates imposing prior knowledge on network's outputs, but also provides stronger and higher-order constraints, i.e., distribution approximation, through adversarial learning. Extensive experiments involving different medical modalities, different anatomical structures, different topologies of the object of interest, different levels of prior knowledge and weakly supervised annotations with different annotation ratios is conducted to evaluate our ACCL method. Consistently superior segmentation results over the size constrained-CNN loss method have been achieved, some of which are close to the results of full supervision, thus fully verifying the effectiveness and generalization of our method. Specifically, we report an average Dice score of 75.4% with an average annotation ratio of 0.65%, surpassing the prior art, i.e., the size constrained-CNN loss method, by a large margin of 11.4%. Our codes are made publicly available at this https URL.
摘要:本文提出抗辩的限制-CNN损失,限制-CNN减肥方法一个新的范例,对于弱监督的医学图像分割。在新的范例,先验知识进行编码,并通过引用的掩模所描绘的,并且进一步采用通过学习对抗性参照掩模强加于分割输出的约束。不同于伪标签方法弱监督分割,这样的参考掩模用于训练识别器,而不是一个分割网络,并且因此不要求与特定图像进行配对。我们的新模式,不仅极大地方便了网络的输出施加先验知识,而且还提供了更强大和更高阶的限制,即分布近似,通过对抗性学习。涉及到不同的医疗方式,不同的解剖结构,感兴趣的对象的不同拓扑结构,不同层次的先验知识,并与不同的注释比弱监督的注释大量的实验是进行评估我们ACCL方法。一贯卓越的分割结果在尺寸约束的CNN减肥方法已经实现,其中一些接近满监督的结果,从而充分验证了该方法的有效性和推广。具体而言,我们报告的平均得分骰子的75.4%与0.65%的平均比注解,超过了现有技术,即,尺寸约束的CNN损失的方法,由11.4%大幅度。我们的代码在此HTTPS URL公之于众。

14. PCA-SRGAN: Incremental Orthogonal Projection Discrimination for Face Super-resolution [PDF] 返回目录
  Hao Dou, Chen Chen, Xiyuan Hu, Zhisen Hu, Silong Peng
Abstract: Generative Adversarial Networks (GAN) have been employed for face super resolution but they bring distorted facial details easily and still have weakness on recovering realistic texture. To further improve the performance of GAN based models on super-resolving face images, we propose PCA-SRGAN which pays attention to the cumulative discrimination in the orthogonal projection space spanned by PCA projection matrix of face data. By feeding the principal component projections ranging from structure to details into the discriminator, the discrimination difficulty will be greatly alleviated and the generator can be enhanced to reconstruct clearer contour and finer texture, helpful to achieve the high perception and low distortion eventually. This incremental orthogonal projection discrimination has ensured a precise optimization procedure from coarse to fine and avoids the dependence on the perceptual regularization. We conduct experiments on CelebA and FFHQ face datasets. The qualitative visual effect and quantitative evaluation have demonstrated the overwhelming performance of our model over related works.
摘要:创成对抗性网络(GAN)已被用于脸超分辨率,但他们带来容易失真的面部细节,仍然有恢复的现实质感的弱点。为了进一步提高GaN基模型对超分辨人脸图像的性能,我们提出了PCA-SRGAN它注重在正交投影空间的累积歧视跨区脸数据的PCA投影矩阵。通过将主成分突起从结构到细节到鉴别器,鉴别难度将大大缓解和发电机能够提高重建更清晰的轮廓和纹理越细,有利于实现高感知和低失真最终。这种增量正交投影判别已确保从粗精确优化过程以微细且避免了感知正规化的依赖性。我们对CelebA和FFHQ面数据集进行实验。定性的视觉效果和定量评价表明我们的模型对相关作品铺天盖地的性能。

15. Multi-Camera Trajectory Forecasting: Pedestrian Trajectory Prediction in a Network of Cameras [PDF] 返回目录
  Olly Styles, Tanaya Guha, Victor Sanchez, Alex Kot
Abstract: We introduce the task of multi-camera trajectory forecasting (MCTF), where the future trajectory of an object is predicted in a network of cameras. Prior works consider forecasting trajectories in a single camera view. Our work is the first to consider the challenging scenario of forecasting across multiple non-overlapping camera views. This has wide applicability in tasks such as re-identification and multi-target multi-camera tracking. To facilitate research in this new area, we release the Warwick-NTU Multi-camera Forecasting Database (WNMF), a unique dataset of multi-camera pedestrian trajectories from a network of 15 synchronized cameras. To accurately label this large dataset (600 hours of video footage), we also develop a semi-automated annotation method. An effective MCTF model should proactively anticipate where and when a person will re-appear in the camera network. In this paper, we consider the task of predicting the next camera a pedestrian will re-appear after leaving the view of another camera, and present several baseline approaches for this. The labeled database is available online: this https URL.
摘要:介绍多相机轨迹预测(MCTF),其中一个对象的未来轨迹的摄像头在网络中预测的任务。在此之前的作品考虑单个摄像机视图的预测轨迹。我们的工作是首先要考虑在多个非重叠摄像机视图预测的挑战场景。这具有在任务,如重新鉴定和多目标多摄像机跟踪广泛的适用性。为了便于在这一新的领域的研究,我们发布华威NTU多相机预测数据库(WNMF),多摄像机行人的轨迹从15台同步摄像机网络中的唯一数据集。为了准确地标记此大的数据集(600个小时的视频素材的),我们还开发了半自动注释方法。一个有效的MCTF模型应该主动预测何时何人将重新出现在相机网络。在本文中,我们考虑预测下一个摄像机留下另一个摄像机的观察后,一行人会重新出现的任务,并为此目前几个基线的方法。标记的数据库可在网上:此HTTPS URL。

16. Recognizing American Sign Language Nonmanual Signal Grammar Errors in Continuous Videos [PDF] 返回目录
  Elahe Vahdani, Longlong Jing, Yingli Tian, Matt Huenerfauth
Abstract: As part of the development of an educational tool that can help students achieve fluency in American Sign Language (ASL) through independent and interactive practice with immediate feedback, this paper introduces a near real-time system to recognize grammatical errors in continuous signing videos without necessarily identifying the entire sequence of signs. Our system automatically recognizes if performance of ASL sentences contains grammatical errors made by ASL students. We first recognize the ASL grammatical elements including both manual gestures and nonmanual signals independently from multiple modalities (i.e. hand gestures, facial expressions, and head movements) by 3D-ResNet networks. Then the temporal boundaries of grammatical elements from different modalities are examined to detect ASL grammatical mistakes by using a sliding window-based approach. We have collected a dataset of continuous sign language, ASL-HW-RGBD, covering different aspects of ASL grammars for training and testing. Our system is able to recognize grammatical elements on ASL-HW-RGBD from manual gestures, facial expressions, and head movements and successfully detect 8 ASL grammatical mistakes.
摘要:作为一种教育工具,通过即时反馈独立和互动实践的发展,可以帮助学生实现美国手语(ASL)流利的一部分,本文介绍了一种近乎实时的系统能够识别连续签约视频语法错误而不必识别标志的整个序列。如果ASL句子的性能包含由ASL学生做语法错误我们的系统会自动识别。我们首先识别由3D-RESNET网络ASL语法元素,包括手动的手势和独立地选自多个模态非体力劳动信号(即手势,面部表情和头部运动)。然后来自不同模态的语法元素的时间边界被检查通过使用基于滑动窗口的的方法来检测ASL语法错误。我们收集了连续的手语,ASL-HW-RGBD,涵盖用于训练和测试翔升语法的不同方面的数据集。我们的系统能够识别从手动手势,面部表情和头部动作对ASL-HW-RGBD语法元素,成功地检测8 ASL语法错误。

17. Deepfake Forensics Using Recurrent Neural Networks [PDF] 返回目录
  Rahul U, Ragul M, Raja Vignesh K, Tejeswinee K
Abstract: As of late an AI based free programming device has made it simple to make authentic face swaps in recordings that leaves barely any hints of control, in what are known as "deepfake" recordings. Situations where these genuine istic counterfeit recordings are utilized to make political pain, extort somebody or phony fear based oppression occasions are effectively imagined. This paper proposes a transient mindful pipeline to automat-ically recognize deepfake recordings. Our framework utilizes a convolutional neural system (CNN) to remove outline level highlights. These highlights are then used to prepare a repetitive neural net-work (RNN) that figures out how to characterize if a video has been sub-ject to control or not. We assess our technique against a huge arrangement of deepfake recordings gathered from different video sites. We show how our framework can accomplish aggressive outcomes in this assignment while utilizing a basic design.
摘要:已故的基于人工智能自由编程设备作为让它变得更简单,使正宗的脸掉期录制叶色几乎没有控制的任何暗示,在所谓的“deepfake”录音。当这些真正的信息研究所假冒录音被用来进行政治疼痛,勒索某人或虚假的情况下,基于恐惧压迫场合都有效可想而知。本文提出了一种瞬态铭记管道自动售货机,ically认识deepfake录音。我们的框架采用了卷积神经系统(CNN)删除大纲级别亮点。这些亮点,然后用来制备重复神经网络工作(RNN)是数字如何刻画如果视频已分JECT来控制与否。我们评估我们对deepfake记录从不同的视频网站聚集了巨大的布置技术。我们展示了如何我们的分析框架可以在这个作业,同时利用基本设计完成进取的结果。

18. Deeply Cascaded U-Net for Multi-Task Image Processing [PDF] 返回目录
  Ilja Gubins, Remco C. Veltkamp
Abstract: In current practice, many image processing tasks are done sequentially (e.g. denoising, dehazing, followed by semantic segmentation). In this paper, we propose a novel multi-task neural network architecture designed for combining sequential image processing tasks. We extend U-Net by additional decoding pathways for each individual task, and explore deep cascading of outputs and connectivity from one pathway to another. We demonstrate effectiveness of the proposed approach on denoising and semantic segmentation, as well as on progressive coarse-to-fine semantic segmentation, and achieve better performance than multiple individual or jointly-trained networks, with lower number of trainable parameters.
摘要:在当前的实践中,许多图像处理任务被顺序地进行(例如降噪,去混浊,接着语义分割)。在本文中,我们提出了一种新的多任务的神经网络结构设计的组合顺序的图像处理任务。我们通过附加解码路径为每个单独的任务延伸的U Net和探索的输出和连接从一个通路到另一个深级联。我们证明在去噪和语义分割所提出的方法有效性,以及对渐进式粗到细的语义分割,达到和超过多个单独或联合训练的网络更好的性能,较低的数量可训练参数。

19. The AVA-Kinetics Localized Human Actions Video Dataset [PDF] 返回目录
  Ang Li, Meghana Thotakuri, David A. Ross, João Carreira, Alexander Vostrikov, Andrew Zisserman
Abstract: This paper describes the AVA-Kinetics localized human actions video dataset. The dataset is collected by annotating videos from the Kinetics-700 dataset using the AVA annotation protocol, and extending the original AVA dataset with these new AVA annotated Kinetics clips. The dataset contains over 230k clips annotated with the 80 AVA action classes for each of the humans in key-frames. We describe the annotation process and provide statistics about the new dataset. We also include a baseline evaluation using the Video Action Transformer Network on the AVA-Kinetics dataset, demonstrating improved performance for action classification on the AVA test set. The dataset can be downloaded from this https URL
摘要:本文介绍了AVA-动力学本地化的人类行为的视频数据集。数据集是通过使用AVA注释协议从动力学-700数据集注释的视频,并且延伸至与这些新的AVA带注释剪辑动力学的原始数据集AVA收集。该数据集包含了超过230K的剪辑与80 AVA动作类中的每个关键帧人类的注解。我们描述了注释过程,并提供有关新的数据集统计数据。我们还包括使用视频动作变压器网络上的AVA-动力学数据集的基线评估,证明了在AVA测试集动作分类提高了性能。该数据集可以从这个HTTPS URL下载

20. HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training [PDF] 返回目录
  Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, Jingjing Liu
Abstract: We present HERO, a Hierarchical EncodeR for Omni-representation learning, for large-scale video+language pre-training. HERO encodes multimodal inputs in a hierarchical fashion, where local textual context of a video frame is captured by a Cross-modal Transformer via multimodal fusion, and global video context is captured by a Temporal Transformer. Besides standard Masked Language Modeling (MLM) and Masked Frame Modeling (MFM) objectives, we design two new pre-training tasks: (i) Video-Subtitle Matching (VSM), where the model predicts both global and local temporal alignment; and (ii) Frame Order Modeling (FOM), where the model predicts the right order of shuffled video frames. Different from previous work that mostly focused on cooking or narrated instructional videos, HERO is jointly trained on HowTo100M and large-scale TV show datasets to learn complex social scenes, dynamics backdrop transitions and multi-character interactions. Extensive experiments demonstrate that HERO achieves new state of the art on both text-based video moment retrieval and video question answering tasks across different domains.
摘要:目前英雄,分层编码器进行全方位表示学习,对于大型视频+语言前培训。 HERO以分层方式,其中,一个视频帧的本地文本上下文由跨模态变压器经由多模态融合捕获编码多模式输入,以及全球视频上下文由时间转换捕获。除了标准的蒙面语言模型(MLM)和罩住的框架模型(MFM)的目标,我们设计了两个新的预培训任务:(一)视频字幕匹配(VSM),其中该模型预测全球和本地时间比对;和(ii)的帧顺序的建模(FOM),其中该模型预测混洗的视频帧的正确顺序。从以前的工作,主要集中在烹饪或叙述的教学视频不同的是,主人公是共同培训了HowTo100M和大型电视节目的数据集学习复杂的社会场景,动态背景的转变,多角色互动。大量的实验证明,HERO达到了艺术上的基于文本的视频检索时刻和视频的问题在不同的领域回答这两个任务的新状态。

21. Sequence Information Channel Concatenation for Improving Camera Trap Image Burst Classification [PDF] 返回目录
  Bhuvan Malladihalli Shashidhara, Darshan Mehta, Yash Kale, Dan Morris, Megan Hazen
Abstract: Camera Traps are extensively used to observe wildlife in their natural habitat without disturbing the ecosystem. This could help in the early detection of natural or human threats to animals, and help towards ecological conservation. Currently, a massive number of such camera traps have been deployed at various ecological conservation areas around the world, collecting data for decades, thereby requiring automation to detect images containing animals. Existing systems perform classification to detect if images contain animals by considering a single image. However, due to challenging scenes with animals camouflaged in their natural habitat, it sometimes becomes difficult to identify the presence of animals from merely a single image. We hypothesize that a short burst of images instead of a single image, assuming that the animal moves, makes it much easier for a human as well as a machine to detect the presence of animals. In this work, we explore a variety of approaches, and measure the impact of using short image sequences (burst of 3 images) on improving the camera trap image classification. We show that concatenating masks containing sequence information and the images from the 3-image-burst across channels, improves the ROC AUC by 20% on a test-set from unseen camera-sites, as compared to an equivalent model that learns from a single image.
摘要:相机陷阱被广泛使用在其自然栖息地观察野生动物,而不会干扰生态系统。这将有助于在早期发现动物自然或人为的威胁,以及对生态保育帮助。目前,这样的相机陷阱数量庞大的已部署在世界各地的各种生态保护区,收集数据了几十年,因此需要自动化检测含有动物的图像。现有的系统进行分类检测,如果图像包含动物通过考虑一个单一的形象。然而,由于在其自然栖息地的动物伪装挑战的场景,有时难以从仅仅是一个单一的图像识别动物的存在。我们假设,图像,而不是单个图像的短脉冲,假设动物的动作,使得它更容易为人类和机器来检测动物的存在。在这项工作中,我们探索了多种方法,并测量使用于改进相机陷阱图像分类短图像序列(突发3个的图像群)的影响。我们显示从含有序列信息级联掩模和图像3的图像突发跨渠道,由20%提高了ROC AUC从看不见摄像机位点的测试组,相比于等效模型来自单个获悉图片。

22. Domain Siamese CNNs for Sparse Multispectral Disparity Estimation [PDF] 返回目录
  David-Alexandre Beaupre, Guillaume-Alexandre Bilodeau
Abstract: Multispectral disparity estimation is a difficult task for many reasons: it has all the same challenges as traditional visible-visible disparity estimation (occlusions, repetitive patterns, textureless surfaces), in addition of having very few common visual information between images (e.g. color information vs. thermal information). In this paper, we propose a new CNN architecture able to do disparity estimation between images from different spectrum, namely thermal and visible in our case. Our proposed model takes two patches as input and proceeds to do domain feature extraction for each of them. Features from both domains are then merged with two fusion operations, namely correlation and concatenation. These merged vectors are then forwarded to their respective classification heads, which are responsible for classifying the inputs as being same or not. Using two merging operations gives more robustness to our feature extraction process, which leads to more precise disparity estimation. Our method was tested using the publicly available LITIV 2014 and LITIV 2018 datasets, and showed best results when compared to other state of the art methods.
摘要:多光谱差异估计是有很多原因一个艰巨的任务:它拥有所有传统的有形可见的视差估计(闭塞,重复模式,无纹理的表面),除了具有图像之间很少有共同的视觉信息(例如颜色相同的挑战信息与热信息)。在本文中,我们提出了一个新的CNN架构能够做到从不同的频谱,即热和可见在我们的例子图像之间的视差估计。我们提出的模型有两个补丁作为输入,并继续为他们每个人做域特征提取。从这两个域的功能,然后用两个化融合操作,即相关性和级联合并。然后,将这些合并后的向量被转发到它们各自的分类头部,其负责的输入为相同或不分级。使用两个合并操作提供了更多的鲁棒性我们的特征提取过程,从而导致更精确的视差估计。我们的方法是使用可公开获得的LITIV 2014年和2018 LITIV测试数据集,并与现有技术方法等状态时显示出最好的结果。

23. Importance Driven Continual Learning for Segmentation Across Domains [PDF] 返回目录
  Sinan Özgür Özgün, Anne-Marie Rickmann, Abhijit Guha Roy, Christian Wachinger
Abstract: The ability of neural networks to continuously learn and adapt to new tasks while retaining prior knowledge is crucial for many applications. However, current neural networks tend to forget previously learned tasks when trained on new ones, i.e., they suffer from Catastrophic Forgetting (CF). The objective of Continual Learning (CL) is to alleviate this problem, which is particularly relevant for medical applications, where it may not be feasible to store and access previously used sensitive patient data. In this work, we propose a Continual Learning approach for brain segmentation, where a single network is consecutively trained on samples from different domains. We build upon an importance driven approach and adapt it for medical image segmentation. Particularly, we introduce learning rate regularization to prevent the loss of the network's knowledge. Our results demonstrate that directly restricting the adaptation of important network parameters clearly reduces Catastrophic Forgetting for segmentation across domains.
摘要:神经网络的不断学习和适应新的任务,同时保留现有的知识是许多应用至关重要的能力。然而,目前的神经网络往往对新的,即经过培训时往往忘记以前学过的任务,他们从灾难性遗忘(CF)受到影响。持续学习(CL)的目标是缓解这个问题,这对于医疗应用,在那里它可能不是存储和访问先前使用的敏感患者数据可行特别相关。在这项工作中,我们提出了大脑的分割,在一个单一的网络上连续从不同的域样本训练一个不断学习的方法。我们建立在一个重要的驱动方式和适应它的医学图像分割。特别是,我们引入学习率正以防止网络知识的损失。我们的研究结果表明,直接制约着重要的网络参数的调整显然降低了灾难性遗忘用于域分割。

24. Occlusion resistant learning of intuitive physics from videos [PDF] 返回目录
  Ronan Riochet, Josef Sivic, Ivan Laptev, Emmanuel Dupoux
Abstract: To reach human performance on complex tasks, a key ability for artificial systems is to understand physical interactions between objects, and predict future outcomes of a situation. This ability, often referred to as intuitive physics, has recently received attention and several methods were proposed to learn these physical rules from video sequences. Yet, most of these methods are restricted to the case where no, or only limited, occlusions occur. In this work we propose a probabilistic formulation of learning intuitive physics in 3D scenes with significant inter-object occlusions. In our formulation, object positions are modeled as latent variables enabling the reconstruction of the scene. We then propose a series of approximations that make this problem tractable. Object proposals are linked across frames using a combination of a recurrent interaction network, modeling the physics in object space, and a compositional renderer, modeling the way in which objects project onto pixel space. We demonstrate significant improvements over state-of-the-art in the intuitive physics benchmark of IntPhys. We apply our method to a second dataset with increasing levels of occlusions, showing it realistically predicts segmentation masks up to 30 frames in the future. Finally, we also show results on predicting motion of objects in real videos.
摘要:为了达到对复杂的任务人的表现,人工系统的关键能力是理解对象之间的物理相互作用,和预测的情况未来的结果。这种能力,通常被称为直观物理学,最近受到关注,并提出了几种方法来学习从视频序列这些物理规则。然而,这些方法大多仅限于在没有或只有有限的情况下,发生阻塞。在这项工作中,我们提出学习物理直观的3D场景与显著对象间遮挡的概率公式。在我们的配方,对象位置建模为潜在变数使现场的重建。然后,我们提出了一系列近似,使这个问题变得易于处理的。对象的建议跨使用经常交互网络的组合的帧链接,造型在对象空间中的物理和组成渲染器,造型中哪些对象投影到像素空间的方式。我们证明了国家的最先进的IntPhys的直观物理学基准显著的改善。我们应用我们的方法将第二个数据集随闭塞的水平,显示出它真实地预测分割掩码了在未来30帧。最后,我们还显示在预测实际视频对象的运动效果。

25. CP-NAS: Child-Parent Neural Architecture Search for 1-bit CNNs [PDF] 返回目录
  Li'an Zhuo, Baochang Zhang, Hanlin Chen, Linlin Yang, Chen Chen, Yanjun Zhu, David Doermann
Abstract: Neural architecture search (NAS) proves to be among the best approaches for many tasks by generating an application-adaptive neural architecture, which is still challenged by high computational cost and memory consumption. At the same time, 1-bit convolutional neural networks (CNNs) with binarized weights and activations show their potential for resource-limited embedded devices. One natural approach is to use 1-bit CNNs to reduce the computation and memory cost of NAS by taking advantage of the strengths of each in a unified framework. To this end, a Child-Parent (CP) model is introduced to a differentiable NAS to search the binarized architecture (Child) under the supervision of a full-precision model (Parent). In the search stage, the Child-Parent model uses an indicator generated by the child and parent model accuracy to evaluate the performance and abandon operations with less potential. In the training stage, a kernel-level CP loss is introduced to optimize the binarized network. Extensive experiments demonstrate that the proposed CP-NAS achieves a comparable accuracy with traditional NAS on both the CIFAR and ImageNet databases. It achieves the accuracy of $95.27\%$ on CIFAR-10, $64.3\%$ on ImageNet with binarized weights and activations, and a $30\%$ faster search than prior arts.
摘要:神经结构搜索(NAS)证明通过生成一个应用自适应神经结构,其仍然受到高计算成本和内存消耗的挑战是许多任务的最佳方法之一。与此同时,1位卷积神经网络(细胞神经网络)与二元化的重量和激活表达他们对资源有限的嵌入式设备的潜力。一个自然的方法是使用1位细胞神经网络在统一的框架服用,每次的优点,从而降低NAS的计算及存储成本。为此,一个子父(CP)的模式引入到微NAS全精度模型(家长)的监督下进行搜索的二值化架构(儿童)。在搜索阶段,亲子模型使用由孩子和家长的模型精度评估性能和更低的潜在放弃操作所产生的指标。在培训阶段,内核级CP损失引入优化二值化网络。大量的实验表明,该CP-NAS实现了与双方的CIFAR和ImageNet数据库传统的NAS可比的精度。它实现了$ 95.27 \%$的准确性上CIFAR-10,$ 64.3 \上ImageNet%$与二值化权重和激活,和$ 30 \%$搜索速度比现有技术。

26. HLVU : A New Challenge to Test Deep Understanding of Movies the Way Humans do [PDF] 返回目录
  Keith Curtis, George Awad, Shahzad Rajput, Ian Soboroff
Abstract: In this paper we propose a new evaluation challenge and direction in the area of High-level Video Understanding. The challenge we are proposing is designed to test automatic video analysis and understanding, and how accurately systems can comprehend a movie in terms of actors, entities, events and their relationship to each other. A pilot High-Level Video Understanding (HLVU) dataset of open source movies were collected for human assessors to build a knowledge graph representing each of them. A set of queries will be derived from the knowledge graph to test systems on retrieving relationships among actors, as well as reasoning and retrieving non-visual concepts. The objective is to benchmark if a computer system can "understand" non-explicit but obvious relationships the same way humans do when they watch the same movies. This is long-standing problem that is being addressed in the text domain and this project moves similar research to the video domain. Work of this nature is foundational to future video analytics and video understanding technologies. This work can be of interest to streaming services and broadcasters hoping to provide more intuitive ways for their customers to interact with and consume video content.
摘要:本文提出了在高级视频谅解区域的新评价的挑战和方向。我们建议的挑战是设计用来测试自动视频分析和理解,以及如何准确地系统可以在演员,实体,事件以及它们相互之间的关系方面理解电影。开源电影飞行员高级视频了解(HLVU)数据集共收集人的评估建立代表他们每个人的知识图谱。一组查询将从知识图导出到测试系统上检索行为者之间的关系,以及推理和检索非视觉概念。目的是基准,如果计算机系统可以“理解”不明确的,但明显的关系以同样的方式人类一样,当他们观看同样的电影。这是长期的,其在文本域中解决问题。这个项目的移动类似的研究到视频领域。这种性质的工作基础是对未来视频分析和视频理解技术。这项工作可能会感兴趣的流媒体服务和广播,希望为他们的客户提供更直观的互动方式与和消费视频内容。

27. A Naturalness Evaluation Database for Video Prediction Models [PDF] 返回目录
  Nagabhushan Somraj, Manoj Surya Kashi, S. P. Arun, Rajiv Soundararajan
Abstract: The study of video prediction models is believed to be a fundamental approach to representation learning for videos. While a plethora of generative models for predicting the future frame pixel values given the past few frames exist, the quantitative evaluation of the predicted frames has been found to be extremely challenging. In this context, we introduce the problem of naturalness evaluation, which refers to how natural or realistic a predicted video looks. We create the Indian Institute of Science Video Naturalness Evaluation (IISc VINE) Database consisting of 300 videos, obtained by applying different prediction models on different datasets, and accompanying human opinion scores. 50 human subjects participated in our study yielding around 6000 human ratings of naturalness. Our subjective study reveals that human observers show a highly consistent judgement of naturalness. We benchmark several popularly used measures for evaluating video prediction and show that they do not adequately correlate with the subjective scores. We introduce two new features to help effectively capture naturalness. In particular, we show that motion compensated cosine similarities of deep features of predicted frames with past frames and deep features extracted from rescaled frame differences lead to state of the art naturalness prediction in accordance with human judgements. The database and code will be made publicly available at our project website: this https URL.
摘要:视频预测模型的研究被认为是表示的基本方法的学习视频。虽然生成模型的预测考虑到过去几帧的将来帧的像素值过多存在,预测帧的量化评估已被发现极具挑战性。在这方面,我们引入自然的评价,这是指天然的或现实的一个预测是如何看起来视频的问题。我们创造科学视频天然度(IISC藤)数据库的印度理工学院组成的300部影片,通过对不同的数据集应用不同的预测模型,并伴随人的意见分数获得。 50名人的受试者参加我们的研究产生自然的约6000人的评级。我们主观的研究表明,人类观察者显示自然的高度一致的判断。我们评估视频的预测,并表明他们没有充分与主观分数归属关系的基准数普遍使用的措施。我们引入了两个新的功能,可帮助有效地捕捉自然。特别是,我们显示与过去的帧和从重新缩放的帧差异提取深特征预测帧的深特征运动补偿余弦相似性导致本领域自然预测的状态按照人体的判断。数据库和代码将在我们的项目网站公之于众:此HTTPS URL。

28. On-board Deep-learning-based Unmanned Aerial Vehicle Fault Cause Detection and Identification [PDF] 返回目录
  Vidyasagar Sadhu, Saman Zonouz, Dario Pompili
Abstract: With the increase in use of Unmanned Aerial Vehicles (UAVs)/drones, it is important to detect and identify causes of failure in real time for proper recovery from a potential crash-like scenario or post incident forensics analysis. The cause of crash could be either a fault in the sensor/actuator system, a physical damage/attack, or a cyber attack on the drone's software. In this paper, we propose novel architectures based on deep Convolutional and Long Short-Term Memory Neural Networks (CNNs and LSTMs) to detect (via Autoencoder) and classify drone mis-operations based on sensor data. The proposed architectures are able to learn high-level features automatically from the raw sensor data and learn the spatial and temporal dynamics in the sensor data. We validate the proposed deep-learning architectures via simulations and experiments on a real drone. Empirical results show that our solution is able to detect with over 90% accuracy and classify various types of drone mis-operations (with about 99% accuracy (simulation data) and upto 88% accuracy (experimental data)).
摘要:随着使用无人飞行器(UAV)/无人驾驶飞机的增加,探测和识别故障原因实时从一个潜在的崩溃般的场景或事件发生后取证分析适当的恢复是非常重要的。飞机坠毁的原因可能是无论是在传感器/执行器系统故障,物理损坏/攻​​击,或对无人机的软件网络攻击。在本文中,我们提出了基于深卷积和长短期记忆神经网络(细胞神经网络和LSTMs)来检测(通过自动编码器)的新架构和基于传感器的数据进行分类无人机误操作。建议的结构能够学习高级别从原始传感器数据自动提供和学习的空间和时间动力学在传感器数据。我们验证一个真正的无人驾驶飞机通过仿真和实验所提出的深学习架构。实证结果表明,我们的解决方案能够与超过90%的准确率检测和分类不同类型的无人机操作错误的(约99%的准确度(模拟数据)和高达88%的准确度(实验数据))。

29. Defocus Deblurring Using Dual-Pixel Data [PDF] 返回目录
  Abdullah Abuolaim, Michael S. Brown
Abstract: Defocus blur arises in images that are captured with a shallow depth of field due to the use of a wide aperture. Correcting defocus blur is challenging because the blur is spatially varying and difficult to estimate. We propose an effective defocus deblurring method that exploits data available on dual-pixel (DP) sensors found on most modern cameras. DP sensors are used to assist a camera's auto-focus by capturing two sub-aperture views of the scene in a single image shot. The two sub-aperture images are used to calculate the appropriate lens position to focus on a particular scene region and are discarded afterwards. We introduce a deep neural network (DNN) architecture that uses these discarded sub-aperture images to reduce defocus deblur. A key contribution of our effort is a carefully captured dataset of 500 scenes (2,000 images) where each scene has: (i) an image with defocus blur captured at a large aperture; (ii) the two associated DP sub-aperture views; and (iii) the corresponding all-in-focus image captured with a small aperture. Our proposed DNN produces results that are significantly better than conventional single image methods in terms of both quantitative and perceptual metrics -- all from data that is already available on the camera but ignored. The dataset, code, and the trained models will be available at this https URL.
摘要:散焦模糊出现在与场的深度浅,由于使用宽孔的捕获的图像。纠正散焦模糊是具有挑战性的,因为模糊空间变化和难以估计。我们建议,利用在双像素提供数据的有效散焦去模糊方法在大多数现代相机中(DP)传感器。 DP传感器用于在单个图像拍摄拍摄场景的两个子孔径的意见,以协助相机的自动对焦。两个子孔径图像被用来计算适当的透镜位置集中在一个特定的场景区域并随后丢弃。我们介绍了使用这些丢弃子孔径成像,减少散焦去模糊深神经网络(DNN)架构。我们的努力的一个重要贡献是500个场景(2000个图像),其中每个场景都有一个精心拍摄的数据集:(i)与在大光圈拍摄离焦模糊的图像; (ii)所述两个相关联的DP子孔径图;和(iii)对应的全聚焦图像具有小孔径捕获。我们提出的DNN产生的结果是显著优于传统的单图像方法的定量和感性的指标方面 - 所有这一切已经可以在相机上却忽略了数据。该数据集,代码,以及训练有素的车型将可在此HTTPS URL。

30. Distilling Spikes: Knowledge Distillation in Spiking Neural Networks [PDF] 返回目录
  Ravi Kumar Kushawaha, Saurabh Kumar, Biplab Banerjee, Rajbabu Velmurugan
Abstract: Spiking Neural Networks (SNN) are energy-efficient computing architectures that exchange spikes for processing information, unlike classical Artificial Neural Networks (ANN). Due to this, SNNs are better suited for real-life deployments. However, similar to ANNs, SNNs also benefit from deeper architectures to obtain improved performance. Furthermore, like the deep ANNs, the memory, compute and power requirements of SNNs also increase with model size, and model compression becomes a necessity. Knowledge distillation is a model compression technique that enables transferring the learning of a large machine learning model to a smaller model with minimal loss in performance. In this paper, we propose techniques for knowledge distillation in spiking neural networks for the task of image classification. We present ways to distill spikes from a larger SNN, also called the teacher network, to a smaller one, also called the student network, while minimally impacting the classification accuracy. We demonstrate the effectiveness of the proposed method with detailed experiments on three standard datasets while proposing novel distillation methodologies and loss functions. We also present a multi-stage knowledge distillation technique for SNNs using an intermediate network to obtain higher performance from the student network. Our approach is expected to open up new avenues for deploying high performing large SNN models on resource-constrained hardware platforms.
摘要:脉冲神经网络(SNN)的高能效计算架构,交换尖峰处理信息,不像传统的人工神经网络(ANN)。由于这个原因,SNNS更适合于现实生活中的部署。然而,类似于人工神经网络,SNNS还从更深层次的架构中获益,以获得更好的性能。此外,深似人工神经网络,SNNS的内存,计算和电源的要求也与模型的尺寸增加,模型压缩成为一种必然。知识蒸馏是能够在一个大的机器学习模型的学习在性能损失最小转移至一个较小的模型的模型的压缩技术。在本文中,我们建议在尖峰神经网络的图像分类的任务对知识的蒸馏技术。我们目前从更大的SNN,提制尖峰方式也被称为老师的网络,以一个较小的一个,也叫学生网络,同时最低限度地影响了分类精度。我们证明了该方法的详细的实验的有效性上三个标准数据集而提出新颖蒸馏方法和损失的功能。我们还使用中间网络以获得来自学生网络更高的性能呈现为SNNS多级知识蒸馏技术。我们的做法有望开辟新的途径部署高在资源有限的硬件平台上进行大SNN机型。

31. Cross-modal Language Generation using Pivot Stabilization for Web-scale Language Coverage [PDF] 返回目录
  Ashish V. Thapliyal, Radu Soricut
Abstract: Cross-modal language generation tasks such as image captioning are directly hurt in their ability to support non-English languages by the trend of data-hungry models combined with the lack of non-English annotations. We investigate potential solutions for combining existing language-generation annotations in English with translation capabilities in order to create solutions at web-scale in both domain and language coverage. We describe an approach called Pivot-Language Generation Stabilization (PLuGS), which leverages directly at training time both existing English annotations (gold data) as well as their machine-translated versions (silver data); at run-time, it generates first an English caption and then a corresponding target-language caption. We show that PLuGS models outperform other candidate solutions in evaluations performed over 5 different target languages, under a large-domain testset using images from the Open Images dataset. Furthermore, we find an interesting effect where the English captions generated by the PLuGS models are better than the captions generated by the original, monolingual English model.
摘要:跨模态语言生成的任务,如图像字幕直接伤害他们通过大量数据的模型与缺乏非英语注释的结合趋势,支持非英语语言的能力。我们调查了现有的语言生成的注释的英文翻译与能力,以便创造的领域和语言覆盖在网络级解决方案结合潜在的解决方案。我们描述了所谓的透视,语言生成稳定器(插头)的做法,直接在训练时间同时利用现有的英文注释(金数据)以及他们的机器翻译版本(银数据);在运行时,它首先产生一个英语字幕,然后对应的目标语言的字幕。我们表明,插头型号优于在评估其他候选方案进行了5个不同的目标语言,使用开放式的图像数据集的图像大域测试集下。此外,我们发现一个有趣的效果,其中由插头模型生成的英文字幕是比原来的,单语英语模型生成的字幕更好。

32. Conceptual Design of Human-Drone Communication in Collaborative Environments [PDF] 返回目录
  Hans Dermot Doran, Monika Reif, Marco Oehler, Curdin Stoehr, Pierluigi Capone
Abstract: Autonomous robots and drones will work collaboratively and cooperatively in tomorrow's industry and agriculture. Before this becomes a reality, some form of standardised communication between man and machine must be established that specifically facilitates communication between autonomous machines and both trained and untrained human actors in the working environment. We present preliminary results on a human-drone and a drone-human language situated in the agricultural industry where interactions with trained and untrained workers and visitors can be expected. We present basic visual indicators enhanced with flight patterns for drone-human interaction and human signaling based on aircraft marshaling for humane-drone interaction. We discuss preliminary results on image recognition and future work.
摘要:自主机器人和无人驾驶飞机将协同合作开展工作,在明天的工业和农业。在此之前成为现实,某种形式的人与机器之间的通信标准化的必须建立专门促进自主机和两个训练和未受过训练的人的演员在工作环境之间的通信。我们提出一个人懒惰,坐落在农业行业中具有训练和未经培训的工人和游客的互动可以期待无人机人类语言的初步结果。我们目前有飞行模式的无人机人机交互和基于飞机编组的人性化,互动无人机人信号增强基本的视觉指标。我们讨论的图像识别和今后工作的初步结果。

33. Bridging Mode Connectivity in Loss Landscapes and Adversarial Robustness [PDF] 返回目录
  Pu Zhao, Pin-Yu Chen, Payel Das, Karthikeyan Natesan Ramamurthy, Xue Lin
Abstract: Mode connectivity provides novel geometric insights on analyzing loss landscapes and enables building high-accuracy pathways between well-trained neural networks. In this work, we propose to employ mode connectivity in loss landscapes to study the adversarial robustness of deep neural networks, and provide novel methods for improving this robustness. Our experiments cover various types of adversarial attacks applied to different network architectures and datasets. When network models are tampered with backdoor or error-injection attacks, our results demonstrate that the path connection learned using limited amount of bonafide data can effectively mitigate adversarial effects while maintaining the original accuracy on clean data. Therefore, mode connectivity provides users with the power to repair backdoored or error-injected models. We also use mode connectivity to investigate the loss landscapes of regular and robust models against evasion attacks. Experiments show that there exists a barrier in adversarial robustness loss on the path connecting regular and adversarially-trained models. A high correlation is observed between the adversarial robustness loss and the largest eigenvalue of the input Hessian matrix, for which theoretical justifications are provided. Our results suggest that mode connectivity offers a holistic tool and practical means for evaluating and improving adversarial robustness.
摘要:模式连接提供在分析损失的风景新颖几何见解,使建设训练有素神经网络之间高精度通路。在这项工作中,我们建议采用模式的连接损耗景观研究深层神经网络的对抗稳健性,并为改善这种稳健性的新方法。我们的实验涵盖各类适用于不同的网络架构和数据集对抗性攻击。当网络模型与后门或错误注入攻击篡改,我们的研究结果表明,路径连接使用bonafide数据量有限,可以有效地缓解对抗效果,同时保持干净数据的原始精度教训。因此,模式的连接为用户提供了电源,修理后门或错误注入模型。我们还使用模式连接调查定期和稳健模式的损失景观反对逃避攻击。实验表明,存在连接定期和adversarially训练模式的道路上的对抗稳健性损失的屏障。高相关性的对抗鲁棒性损失和输入Hessian矩阵,提供了用于其中的理论理的最大特征值之间观察到。我们的研究结果表明,模式的连接提供了一个全面的工具,评估和提高对抗性稳健实用的方法。

34. Unsupervised Lesion Detection via Image Restoration with a Normative Prior [PDF] 返回目录
  Xiaoran Chen, Suhang You, Kerem Can Tezcan, Ender Konukoglu
Abstract: Unsupervised lesion detection is a challenging problem that requires accurately estimating normative distributions of healthy anatomy and detecting lesions as outliers without training examples. Recently, this problem has received increased attention from the research community following the advances in unsupervised learning with deep learning. Such advances allow the estimation of high-dimensional distributions, such as normative distributions, with higher accuracy than previous methods.The main approach of the recently proposed methods is to learn a latent-variable model parameterized with networks to approximate the normative distribution using example images showing healthy anatomy, perform prior-projection, i.e. reconstruct the image with lesions using the latent-variable model, and determine lesions based on the differences between the reconstructed and original images. While being promising, the prior-projection step often leads to a large number of false positives. In this work, we approach unsupervised lesion detection as an image restoration problem and propose a probabilistic model that uses a network-based prior as the normative distribution and detect lesions pixel-wise using MAP estimation. The probabilistic model punishes large deviations between restored and original images, reducing false positives in pixel-wise detections. Experiments with gliomas and stroke lesions in brain MRI using publicly available datasets show that the proposed approach outperforms the state-of-the-art unsupervised methods by a substantial margin, +0.13 (AUC), for both glioma and stroke detection. Extensive model analysis confirms the effectiveness of MAP-based image restoration.
摘要:无监督病变检测是一个具有挑战性的问题,需要精确地估计健康解剖学的标准化分布并检测病变为离群值没有训练示例。最近,这个问题已经受到越来越多的关注下在监督学习与深度学习进步的研究团体。这些进步允许高维分布,如规范的分布,具有比最近提出的方法的先前methods.The主要途径更高精度的估计是学习与网络参数化的潜变量模型使用的示例图像来近似规范分配示出健康解剖,执行现有的投影,即,重构使用所述潜变量模型的病变的图像,并确定基于重构和原始图像之间的差别的病变。虽然是有希望的,在现有投影步骤常常导致大量的假阳性。在这项工作中,我们接近无监督病变检测作为图像恢复问题,并提出了一个概率模型,使用事先为规范分配和逐个像素的使用MAP估计检测病变基于网络的。概率模型惩罚恢复和原始图像之间的大的偏差,从而减少在逐像素检测的假阳性。使用可公开获得的数据集脑胶质瘤MRI和中风病变的实验表明,该方法比用了足够的余量,0.13(AUC)的国家的最先进的无人监督的方法,对于神经胶质瘤和中风的检测。广泛模型分析证实基于MAP的图像恢复的有效性。

注:中文为机器翻译结果!