摘要

1. VGGSound: A Large-scale Audio-Visual Dataset [PDF] 返回目录
Honglie Chen, Weidi Xie, Andrea Vedaldi, Andrew Zisserman
Abstract: Our goal is to collect a large-scale audio-visual dataset with low label noise from videos in the wild using computer vision techniques. The resulting dataset can be used for training and evaluating audio recognition models. We make three contributions. First, we propose a scalable pipeline based on computer vision techniques to create an audio dataset from open-source media. Our pipeline involves obtaining videos from YouTube; using image classification algorithms to localize audio-visual correspondence; and filtering out ambient noise using audio verification. Second, we use this pipeline to curate the VGGSound dataset consisting of more than 210k videos for 310 audio classes. Third, we investigate various Convolutional Neural Network~(CNN) architectures and aggregation approaches to establish audio recognition baselines for our new dataset. Compared to existing audio datasets, VGGSound ensures audio-visual correspondence and is collected under unconstrained conditions. Code and the dataset are available at this http URL
摘要：我们的目标是利用计算机视觉技术在野外视频采集与低噪音的标签一家大型影音数据集。所得到的数据集可以用于训练和评估音频识别模型。我们提出三点贡献。首先，我们提出了一种基于计算机视觉技术可扩展的管道，以创建开放源代码媒体的音频数据集。我们的管线包括从YouTube获得的视频;使用图像分类算法来本地化视听对应关系;和使用音频验证滤除环境噪声。其次，我们利用这个管道策划的VGGSound数据集包括了310种音频类超过210K的视频。第三，我们研究各种卷积神经网络〜（CNN）的架构和聚合的方法来创建音频识别基准为我们的新的数据集。相比现有的音频数据集，VGGSound确保视听通信和无约束的条件下被收集。代码和数据集都可以在这个HTTP URL

2. Editing in Style: Uncovering the Local Semantics of GANs [PDF] 返回目录
Edo Collins, Raja Bala, Bob Price, Sabine Süsstrunk
Abstract: While the quality of GAN image synthesis has improved tremendously in recent years, our ability to control and condition the output is still limited. Focusing on StyleGAN, we introduce a simple and effective method for making local, semantically-aware edits to a target output image. This is accomplished by borrowing elements from a source image, also a GAN output, via a novel manipulation of style vectors. Our method requires neither supervision from an external model, nor involves complex spatial morphing operations. Instead, it relies on the emergent disentanglement of semantic objects that is learned by StyleGAN during its training. Semantic editing is demonstrated on GANs producing human faces, indoor scenes, cats, and cars. We measure the locality and photorealism of the edits produced by our method, and find that it accomplishes both.
摘要：虽然甘图像合成的质量在近几年大幅提升，我们对控制和状态输出能力仍然有限。着眼于StyleGAN，我们介绍制作本地，语义感知编辑目标输出图像的简单有效的方法。这是通过从一个源图像，也是一个GAN输出借用元件，经由风格向量的新颖操作来完成的。我们的方法需要从外部模型既没有监督，也没有涉及到复杂的空间变形操作。相反，它依赖于由StyleGAN其训练期间学习的语义对象的紧急解开。编辑语义上演示制作人脸，室内场景，猫和汽车甘斯。我们衡量我们的方法所产生的编辑的局部性和写实，发现它实现两者。

3. Smart Attendance System Usign CNN [PDF] 返回目录
Shailesh Arya, Hrithik Mesariya, Vishal Parekh
Abstract: The research on the attendance system has been going for a very long time, numerous arrangements have been proposed in the last decade to make this system efficient and less time consuming, but all those systems have several flaws. In this paper, we are introducing a smart and efficient system for attendance using face detection and face recognition. This system can be used to take attendance in colleges or offices using real-time face recognition with the help of the Convolution Neural Network(CNN). The conventional methods like Eigenfaces and Fisher faces are sensitive to lighting, noise, posture, obstruction, illumination etc. Hence, we have used CNN to recognize the face and overcome such difficulties. The attendance records will be updated automatically and stored in an excel sheet as well as in a database. We have used MongoDB as a backend database for attendance records.
摘要：考勤系统的研究已经持续了很长一段时间，很多安排已在过去十年中提出的措施使此系统有效和更费时，但所有这些系统具有一些缺陷。在本文中，我们使用人脸检测和人脸识别引入智能，高效的系统进行考勤。该系统可以用来好好听讲在使用实时人脸识别与卷积神经网络（CNN）的帮助下高校或办事处。像特征脸和Fisher面的传统方法来照明，噪音，姿势，梗阻，照明等敏感。因此，我们已经使用CNN来识别脸部和克服这些困难。的出勤记录将被自动更新，并存储在Excel工作表，以及在数据库中。我们使用的MongoDB作为考勤记录后端数据库。

4. Tensor train rank minimization with nonlocal self-similarity for tensor completion [PDF] 返回目录
Meng Ding, Ting-Zhu Huang, Xi-Le Zhao, Michael K. Ng, Tian-Hui Ma
Abstract: The tensor train (TT) rank has received increasing attention in tensor completion due to its ability to capture the global correlation of high-order tensors ($\textrm{order} >3$). For third order visual data, direct TT rank minimization has not exploited the potential of TT rank for high-order tensors. The TT rank minimization accompany with \emph{ket augmentation}, which transforms a lower-order tensor (e.g., visual data) into a higher-order tensor, suffers from serious block-artifacts. To tackle this issue, we suggest the TT rank minimization with nonlocal self-similarity for tensor completion by simultaneously exploring the spatial, temporal/spectral, and nonlocal redundancy in visual data. More precisely, the TT rank minimization is performed on a formed higher-order tensor called group by stacking similar cubes, which naturally and fully takes advantage of the ability of TT rank for high-order tensors. Moreover, the perturbation analysis for the TT low-rankness of each group is established. We develop the alternating direction method of multipliers tailored for the specific structure to solve the proposed model. Extensive experiments demonstrate that the proposed method is superior to several existing state-of-the-art methods in terms of both qualitative and quantitative measures.
摘要：张量列车（TT）排名已受到越来越多的关注，张量完成，因为它捕捉到高阶张量的全球对比（$ \ {TEXTRM为了}> 3 $）的能力。对于第三级视觉数据，直接TT排名最小化并没有利用TT排名的潜力高阶张量。的TT秩最小化与\ {EMPH KET增强}，其将低阶张量（例如，视觉数据）转换成更高阶的张量伴随，从严重的块伪像受到影响。为了解决这个问题，我们通过同时探索空间，时间/频谱和外地冗余视觉数据表明，TT排名最小化与外地的自相似性的张量完成。更精确地，TT秩最小化在形成较高阶张量称为组通过层叠类似立方体，这自然和充分利用了TT秩的用于高阶张量的能力的优点进行。此外，对于每个组的TT低rankness扰动分析被建立。我们制定的具体结构来解决所提出的模型定制乘数的交替方向法。大量的实验表明，该方法优于现有的几个国家的最先进的方法在定性和定量的措施方面。

5. Action Sequence Predictions of Vehicles in Urban Environments using Map and Social Context [PDF] 返回目录
Jan-Nico Zaech, Dengxin Dai, Alexander Liniger, Luc Van Gool
Abstract: This work studies the problem of predicting the sequence of future actions for surround vehicles in real-world driving scenarios. To this aim, we make three main contributions. The first contribution is an automatic method to convert the trajectories recorded in real-world driving scenarios to action sequences with the help of HD maps. The method enables automatic dataset creation for this task from large-scale driving data. Our second contribution lies in applying the method to the well-known traffic agent tracking and prediction dataset Argoverse, resulting in 228,000 action sequences. Additionally, 2,245 action sequences were manually annotated for testing. The third contribution is to propose a novel action sequence prediction method by integrating past positions and velocities of the traffic agents, map information and social context into a single end-to-end trainable neural network. Our experiments prove the merit of the data creation method and the value of the created dataset - prediction performance improves consistently with the size of the dataset and shows that our action prediction method outperforms comparing models.
摘要：这项工作的研究预测在实际驾驶情况下车辆环绕未来行动的顺序问题。为了达到这个目的，我们做三个主要贡献。第一个贡献是记录在实际驾驶场景的动作序列与高清地图的帮助下，轨迹转换的自动方法。该方法能够自动创建数据集从大规模的行驶数据这一任务。我们的第二个贡献在于将该方法应用于公知的交通剂跟踪和预测数据集Argoverse，导致228000个动作序列。此外，2245个动作序列进行手动注释进行测试。第三贡献是由过去的交通剂，地图信息和社交上下文的位置和速度集成到单个端至端可训练神经网络提出一种新颖的动作顺序的预测方法。我们的实验证明了数据制作方法的优点和创建的数据集的价值 - 预测性能与数据集，并说明我们的动作预测方法优于比较模型的大小一致的改善。

6. Image Captioning through Image Transformer [PDF] 返回目录
Sen He, Wentong Liao, Hamed R. Tavakoli, Michael Yang, Bodo Rosenhahn, Nicolas Pugeault
Abstract: Automatic captioning of images is a task that combines the challenges of image analysis and text generation. One important aspect in captioning is the notion of attention: How to decide what to describe and in which order. Inspired by the successes in text analysis and translation, previous work have proposed the \textit{transformer} architecture for image captioning. However, the structure between the \textit{semantic units} in images (usually the detected regions from object detection model) and sentences (each single word) is different. Limited work has been done to adapt the transformer's internal architecture to images. In this work, we introduce the \textbf{\textit{image transformer}}, which consists of a modified encoding transformer and an implicit decoding transformer, motivated by the relative spatial relationship between image regions. Our design widen the original transformer layer's inner architecture to adapt to the structure of images. With only regions feature as inputs, our model achieves new state-of-the-art performance on both MSCOCO offline and online testing benchmarks.
摘要：自动图像的字幕是结合图像分析和文本生成挑战的任务。在字幕的一个重要方面是人们关注的概念：如何决定用什么来形容和顺序。通过文本分析和翻译的成功的启发，以前的工作提出了图像字幕的\ {textit变压器}架构。然而，该\ textit {语义单元之间}在图像（通常从对象检测模型中的检测到的区域）和句子结构（每个单字）是不同的。有限的工作已经做了变压器的内部架构适应图像。在这项工作中，我们引入\ textbf {\ textit {图像变压器}}，其由修饰的编码变压器和一个隐含的解码变压器，通过图像区域之间的相对空间关系而激发的。我们的设计拓宽了原变压器层的内部架构，以适应图像的结构。由于只有区域功能为输入，我们的模型实现了两个MSCOCO离线和在线测试基准，新的国家的最先进的性能。

7. Deepfake Video Forensics based on Transfer Learning [PDF] 返回目录
Rahul U, Ragul M, Raja Vignesh K, Tejeswinee K
Abstract: Deeplearning has been used to solve complex problems in various domains. As it advances, it also creates applications which become a major threat to our privacy, security and even to our Democracy. Such an application which is being developed recently is the "Deepfake". Deepfake models can create fake images and videos that humans cannot differentiate them from the genuine ones. Therefore, the counter application to automatically detect and analyze the digital visual media is necessary in today world. This paper details retraining the image classification models to apprehend the features from each deepfake video frames. After feeding different sets of deepfake clips of video fringes through a pretrained layer of bottleneck in the neural network is made for every video frame, already stated layer contains condense data for all images and exposes artificial manipulations in Deepfake videos. When checking Deepfake videos, this technique received more than 87 per cent accuracy. This technique has been tested on the Face Forensics dataset and obtained good accuracy in detection.
摘要：深度学习已用于解决在各个领域的复杂问题。在其前进，它还会创建成为我们的隐私，安全，甚至对我们的民主的主要威胁应用程序。这是最近开发这样的应用是“Deepfake”。 Deepfake模型可以制造假的图片和视频，人类不能从真正的人区分开来。因此，反应用程序自动检测和分析数字视频媒体是当今世界的必要。本文的细节再培训的图像分类模型，逮捕从每个deepfake视频帧的功能。通过神经网络瓶颈的预训练层为每个视频帧由喂养组不同的视频条纹deepfake剪辑后，已经说明层包含了所有图像冷凝数据和Deepfake视频暴露人工操作。当检查Deepfake视频，该技术获得了超过87％的精度。这项技术已被测试的人脸数据集取证和在检测得到良好的精度。

8. Assessing Car Damage using Mask R-CNN [PDF] 返回目录
Sarath P, Soorya M, Shaik Abdul Rahman A, S Suresh Kumar, K Devaki
Abstract: Picture based vehicle protection handling is a significant region with enormous degree for mechanization. In this paper we consider the issue of vehicle harm characterization, where a portion of the classifications can be fine-granular. We investigate profound learning based procedures for this reason. At first, we attempt legitimately preparing a CNN. In any case, because of little arrangement of marked information, it doesn't function admirably. At that point, we investigate the impact of space explicit pre-preparing followed by tweaking. At last, we explore different avenues regarding move learning and outfit learning. Trial results show that move learning works superior to space explicit tweaking. We accomplish precision of 89.5% with blend of move and gathering learning.
摘要：基于图片车辆保护处理是极大程度的机械化显著区域。在本文中，我们考虑车辆的危害特性，这里的分类的部分可以是细粒度的问题。我们探讨这个原因渊博的学识基础的程序。起初，我们试图合法准备CNN。在任何情况下，由于标记信息的小安排，它不令人钦佩的工作。在这一点上，我们研究的空间，前期的准备之后的调整明确的影响。最后，我们探讨关于移动学习和装备的学习不同的途径。试验结果表明，移动学习工作优于空间明确的调整。我们完成的89.5％精度的移动和收集学习融为一体。

9. Zero-Shot Learning and its Applications from Autonomous Vehicles to COVID-19 Diagnosis: A Review [PDF] 返回目录
Mahdi Rezaei, Mahsa Shahidi
Abstract: The challenge of learning a new concept without receiving any examples beforehand is called zero-shot learning (ZSL). One of the major issues in deep learning based methodologies is the requirement of feeding a vast amount of annotated and labelled images by a human to train the network model. ZSL is known for having minimal human intervention by relying only on previously known concepts and auxiliary information. It is an ever-growing research area since it has human-like characteristics in learning new concepts, which makes it applicable in many real-world scenarios, from autonomous vehicles to surveillance systems to medical imaging and COVID-19 CT scan-based diagnosis. In this paper, we present the definition of the problem, we review over fundamentals, and the challenging steps of Zero-shot learning, including recent categories of solutions, motivations behind each approach, and their advantages over other categories. Inspired from different settings and extensions, we have a broaden solution called one/few-shot learning. We then review thorough datasets, the variety of splits, and the evaluation protocols proposed so far. Finally, we discuss the recent applications and possible future directions of ZSL. We aim to convey a useful intuition through this paper towards the goal of handling computer vision learning tasks more similar to the way humans learn.
摘要：学习一个新的概念没有收到任何例子事先的挑战被称为零次学习（ZSL）。一个在深学习基础的方法的主要问题是由人类喂养注释和标记图像的大量训练网络模型的要求。 ZSL已知用于通过仅依赖于先前已知的概念和辅助信息具有最小的人为干预。这是一个不断增长的研究领域，因为它具有类似人类在学习新理念，这使得它适用于许多现实世界的场景，从自主车辆监控系统，到医疗成像和COVID-19基于扫描的CT诊断特性。在本文中，我们提出了问题的定义，我们回顾了基本面，和零次学习的挑战性措施，包括最近的类解决方案，每种方法背后的动机，以及他们对其他类别的优势。从不同的设置和扩展的启发，我们有一个开阔的解决方案称为一/数次学习。然后，我们审查彻底的数据集，各种分裂的，到目前为止所提出的评估协议。最后，我们讨论了最近的应用程序和ZSL的未来可能的方向。我们的目标是通过本文对处理计算机视觉的学习任务更类似于人类的学习方式的目标传达有用的直觉。

10. Minimal Rolling Shutter Absolute Pose with Unknown Focal Length and Radial Distortion [PDF] 返回目录
Zuzana Kukelova, Cenek Albl, Akihiro Sugimoto, Konrad Schindler, Tomas Pajdla
Abstract: The internal geometry of most modern consumer cameras is not adequately described by the perspective projection. Almost all cameras exhibit some radial lens distortion and are equipped with an electronic rolling shutter that induces distortions when the camera moves during the image capture. When focal length has not been calibrated offline, the parameters that describe the radial and rolling shutter distortions are usually unknown. While for global shutter cameras, minimal solvers for the absolute camera pose and unknown focal length and radial distortion are available, solvers for the rolling shutter were missing. We present the first minimal solutions for the absolute pose of a rolling shutter camera with unknown rolling shutter parameters, focal length, and radial distortion. Our new minimal solvers combine iterative schemes designed for calibrated rolling shutter cameras with fast generalized eigenvalue and Groebner basis solvers. In a series of experiments, with both synthetic and real data, we show that our new solvers provide accurate estimates of the camera pose, rolling shutter parameters, focal length, and radial distortion parameters.
摘要：最现代的消费类相机内部形状没有充分透视投影描述。几乎所有的相机表现出一定径向透镜畸变，并配有一个电子滚动快门诱导失真当图像拍摄期间摄像机移动。当焦距尚未校准离线，描述在径向和滚动快门的失真参数通常是未知的。尽管全局快门的相机，对于绝对相机姿势和未知的焦距和径向变形最小的求解器是可用的，用于滚动快门求解器不翼而飞。我们目前的第一最小解用于滚动快门照相机具有未知滚动快门参数，焦距，和径向失真的绝对姿态。我们新的求解最小结合专为校准滚动快门相机快速广义特征值和Groebner基求解器迭代计划。在一系列的实验中，合成和真实的数据，我们证明了我们新的求解提供了相机姿势的准确估计，滚动快门参数，焦距，径向畸变参数。

11. Single-Side Domain Generalization for Face Anti-Spoofing [PDF] 返回目录
Yunpei Jia, Jie Zhang, Shiguang Shan, Xilin Chen
Abstract: Existing domain generalization methods for face anti-spoofing endeavor to extract common differentiation features to improve the generalization. However, due to large distribution discrepancies among fake faces of different domains, it is difficult to seek a compact and generalized feature space for the fake faces. In this work, we propose an end-to-end single-side domain generalization framework (SSDG) to improve the generalization ability of face anti-spoofing. The main idea is to learn a generalized feature space, where the feature distribution of the real faces is compact while that of the fake ones is dispersed among domains but compact within each domain. Specifically, a feature generator is trained to make only the real faces from different domains undistinguishable, but not for the fake ones, thus forming a single-side adversarial learning. Moreover, an asymmetric triplet loss is designed to constrain the fake faces of different domains separated while the real ones aggregated. The above two points are integrated into a unified framework in an end-to-end training manner, resulting in a more generalized class boundary, especially good for samples from novel domains. Feature and weight normalization is incorporated to further improve the generalization ability. Extensive experiments show that our proposed approach is effective and outperforms the state-of-the-art methods on four public databases.
摘要：人脸防欺骗现有域泛化方法尽力提取共同分化的特性来提高泛化。然而，由于不同域的假面孔在大型分布的差异，难以寻求的假面孔紧凑和广义特征空间。在这项工作中，我们提出了一个终端到终端单面域泛化框架（SSDG），以改善面部反欺骗的泛化能力。其主要思想是学习的广义特征空间，在那里，而在假的是域之间的分散，但每个域内紧凑真实面孔的特征分布紧凑。具体而言，特征生成器被训练以使仅来自不同域的实际面无法区分，而不是为假的，从而形成一个单一的侧对抗性学习。此外，非对称的三线态损耗被设计成限制分离而与实际的聚合不同结构域的假脸。上述两点被集成到在端至端训练方式的统一框架，从而产生更广义级边界，用于从新颖域样本特别好。特征和重量归一化被结合以进一步改善泛化能力。大量的实验证明，该方法是有效的，优于四个公共数据库的国家的最先进的方法。

12. Counting of Grapevine Berries in Images via Semantic Segmentation using Convolutional Neural Networks [PDF] 返回目录
Laura Zabawa, Anna Kicherer, Lasse Klingbeil, Reinhard Töpfer, Heiner Kuhlmann, Ribana Roscher
Abstract: The extraction of phenotypic traits is often very time and labour intensive. Especially the investigation in viticulture is restricted to an on-site analysis due to the perennial nature of grapevine. Traditionally skilled experts examine small samples and extrapolate the results to a whole plot. Thereby different grapevine varieties and training systems, e.g. vertical shoot positioning (VSP) and semi minimal pruned hedges (SMPH) pose different challenges. In this paper we present an objective framework based on automatic image analysis which works on two different training systems. The images are collected semi automatic by a camera system which is installed in a modified grape harvester. The system produces overlapping images from the sides of the plants. Our framework uses a convolutional neural network to detect single berries in images by performing a semantic segmentation. Each berry is then counted with a connected component algorithm. We compare our results with the Mask-RCNN, a state-of-the-art network for instance segmentation and with a regression approach for counting. The experiments presented in this paper show that we are able to detect green berries in images despite of different training systems. We achieve an accuracy for the berry detection of 94.0% in the VSP and 85.6% in the SMPH.
摘要：表型性状的提取往往是非常费时费力。特别是在葡萄栽培的调查仅限于现场分析，由于葡萄的常年性。传统技术专家审查小样本，并将结果外推到整个阴谋。从而不同葡萄品种和训练系统，例如竖拍定位（VSP）和半最小修剪树篱（SMPH）构成了不同的挑战。在本文中，我们提出了基于自动图像分析这两个不同的训练系统工作的目标框架。图像被收集的半自动通过其安装在改进的葡萄收割机的照相机系统。该系统产生从植物的两侧重叠图像。我们的框架使用卷积神经网络通过执行语义分割，以检测图像中的单个浆果。每个浆果然后用连接部件算法计数。我们比较我们的结果与面膜RCNN，国家的最先进的网络，例如分割与计数回归方法。在本文介绍的实验表明，我们能够检测图像中的绿色浆果尽管不同的培训系统。我们实现了在VSP浆果检测的94.0％和85.6％，在SMPH的精度。

13. Retinal vessel segmentation by probing adaptive to lighting variations [PDF] 返回目录
Guillaume Noyel, Christine Vartin, Peter Boyle, Laurent Kodjikian
Abstract: We introduce a novel method to extract the vessels in eye fun-dus images which is adaptive to lighting variations. In the Logarithmic Image Processing framework, a 3-segment probe detects the vessels by probing the topographic surface of an image from below. A map of contrasts between the probe and the image allows to detect the vessels by a threshold. In a lowly contrasted image, results show that our method better extract the vessels than another state-of the-art method. In a highly contrasted image database (DRIVE) with a reference , ours has an accuracy of 0.9454 which is similar or better than three state-of-the-art methods and below three others. The three best methods have a higher accuracy than a manual segmentation by another expert. Importantly, our method automatically adapts to the lighting conditions of the image acquisition.
摘要：我们介绍一种新颖的方法，以提取血管眼乐趣-DUS图像是自适应于照明变化。在对数图像处理框架，3-段探针由从下面探测的图像的地形表面检测血管。所述探针和所述图像之间的对比度的映射允许由一个阈值来检测血管。在一个卑微的形象对比，结果表明，该方法提取更好的血管比先进国家的另一种方法。在一个高对比度图像数据库（驱动器）与一个参考，我们具有0.9454其类似于或多于三个国家的最先进的方法和下面三个其他更好的精确度。三种佳方法具有比另一位专家手动分割精度更高。重要的是，我们的方法自动适应图像采集的照明条件。

14. Motion Guided 3D Pose Estimation from Videos [PDF] 返回目录
Jingbo Wang, Sijie Yan, Yuanjun Xiong, Dahua Lin
Abstract: We propose a new loss function, called motion loss, for the problem of monocular 3D Human pose estimation from 2D pose. In computing motion loss, a simple yet effective representation for keypoint motion, called pairwise motion encoding, is introduced. We design a new graph convolutional network architecture, U-shaped GCN (UGCN). It captures both short-term and long-term motion information to fully leverage the additional supervision from the motion loss. We experiment training UGCN with the motion loss on two large scale benchmarks: Human3.6M and MPI-INF-3DHP. Our model surpasses other state-of-the-art models by a large margin. It also demonstrates strong capacity in producing smooth 3D sequences and recovering keypoint motion.
摘要：本文提出了一种新的损失函数，称为运动损失，从2D姿态单眼3D人体姿态估计的问题。在计算运动损失，对关键点运动的简单而有效的表示，称之为成对运动编码，被引入。我们设计一个新的图卷积网络结构，U形GCN（UGCN）。它抓住了短期和长期的运动信息，以充分利用从运动损失额外的监管。我们实验的两个大规模的基准运动训练损失UGCN：Human3.6M和MPI-INF-3DHP。我们的模型大幅度超越了国家的最先进的其他车型。它还表明在产生平滑3D序列，并回收关键点运动能力强。

15. Skeleton Focused Human Activity Recognition in RGB Video [PDF] 返回目录
Bruce X. B. Yu, Yan Liu, Keith C. C. Chan
Abstract: The data-driven approach that learns an optimal representation of vision features like skeleton frames or RGB videos is currently a dominant paradigm for activity recognition. While great improvements have been achieved from existing single modal approaches with increasingly larger datasets, the fusion of various data modalities at the feature level has seldom been attempted. In this paper, we propose a multimodal feature fusion model that utilizes both skeleton and RGB modalities to infer human activity. The objective is to improve the activity recognition accuracy by effectively utilizing the mutual complemental information among different data modalities. For the skeleton modality, we propose to use a graph convolutional subnetwork to learn the skeleton representation. Whereas for the RGB modality, we will use the spatial-temporal region of interest from RGB videos and take the attention features from the skeleton modality to guide the learning process. The model could be either individually or uniformly trained by the back-propagation algorithm in an end-to-end manner. The experimental results for the NTU-RGB+D and Northwestern-UCLA Multiview datasets achieved state-of-the-art performance, which indicates that the proposed skeleton-driven attention mechanism for the RGB modality increases the mutual communication between different data modalities and brings more discriminative features for inferring human activities.
摘要：该学习愿景的最佳表征的数据驱动的方法特性，如骨骼框架或RGB视频是当前活动识别主导模式。虽然很大的改进已经从现有的单模式的方法与越来越大的数据集来实现，各种数据模式的特征级融合却鲜有尝试。在本文中，我们提出了利用两个骨架和RGB模式来推断人类活动多特征融合的典范。我们的目标是通过有效利用不同数据形式之间的相互互补的信息，以提高活动的识别精度。为骨架形态，我们建议使用一个图形卷积子网学习骨架表示。而对于RGB模式，我们将使用从RGB视频感兴趣的时空区域，并采取重视功能从骨骼形态来指导学习过程。该模型可以是单独地或均匀地在端至端的方式反向传播算法进行训练。用于实现状态的最先进的性能，这表明用于RGB方式所提出的骨架驱动注意机制增加不同的数据形式之间的相互通信及带来的NTU-RGB + d和西北-UCLA多视图数据集的实验结果更有辨别力的特征推断人类活动。

16. Effective Human Activity Recognition Based on Small Datasets [PDF] 返回目录
Bruce X. B. Yu, Yan Liu, Keith C. C. Chan
Abstract: Most recent work on vision-based human activity recognition (HAR) focuses on designing complex deep learning models for the task. In so doing, there is a requirement for large datasets to be collected. As acquiring and processing large training datasets are usually very expensive, the problem of how dataset size can be reduced without affecting recognition accuracy has to be tackled. To do so, we propose a HAR method that consists of three steps: (i) data transformation involving the generation of new features based on transforming of raw data, (ii) feature extraction involving the learning of a classifier based on the AdaBoost algorithm and the use of training data consisting of the transformed features, and (iii) parameter determination and pattern recognition involving the determination of parameters based on the features generated in (ii) and the use of the parameters as training data for deep learning algorithms to be used to recognize human activities. Compared to existing approaches, this proposed approach has the advantageous characteristics that it is simple and robust. The proposed approach has been tested with a number of experiments performed on a relatively small real dataset. The experimental results indicate that using the proposed method, human activities can be more accurately recognized even with smaller training data size.
摘要：基于视觉的人类活动识别（HAR）最近的工作重点是设计复杂的深度学习模型的任务。这样做，对于要收集的大型数据集的要求。由于获取和处理大数据集的培训通常是非常昂贵的，如何数据集大小可以在不影响识别的准确率可以降低问题已得到解决。要做到这一点，我们提出了一个HAR方法包括三个步骤：（i）涉及的基于原始数据转换的新功能产生的数据转换，（ii）基于AdaBoost算法涉及到一个分类的学习特征提取和使用训练数据由所述转化的功能，和（iii）中使用的参数确定，涉及基于在产生的特征参数的确定（ii）和使用该参数作为用于深学习算法训练数据的模式识别的认识到人类活动。相比于现有的方法，这种建议的方法有有利的特点，这是简单而强大的。所提出的方法已经过测试，与一些在相对较小的实际数据集进行了实验。实验结果表明，使用所提出的方法中，人类活动可以更准确地识别即使是较小的训练数据的大小。

17. Deep Transfer Learning For Plant Center Localization [PDF] 返回目录
Enyu Cai, Sriram Baireddy, Changye Yang, Melba Crawford, Edward J. Delp
Abstract: Plant phenotyping focuses on the measurement of plant characteristics throughout the growing season, typically with the goal of evaluating genotypes for plant breeding. Estimating plant location is important for identifying genotypes which have low emergence, which is also related to the environment and management practices such as fertilizer applications. The goal of this paper is to investigate methods that estimate plant locations for a field-based crop using RGB aerial images captured using Unmanned Aerial Vehicles (UAVs). Deep learning approaches provide promising capability for locating plants observed in RGB images, but they require large quantities of labeled data (ground truth) for training. Using a deep learning architecture fine-tuned on a single field or a single type of crop on fields in other geographic areas or with other crops may not have good results. The problem of generating ground truth for each new field is labor-intensive and tedious. In this paper, we propose a method for estimating plant centers by transferring an existing model to a new scenario using limited ground truth data. We describe the use of transfer learning using a model fine-tuned for a single field or a single type of plant on a varied set of similar crops and fields. We show that transfer learning provides promising results for detecting plant locations.
摘要：植物表型侧重于植物特性的测量整个生长季节，通常与植物育种评估基因型的目标。估计厂址是用于识别具有低出现，这也关系到环境和管理实践，如施肥基因型重要。本文的目的是使用基于字段的产量预估工厂位置RGB使用无人飞行器（UAV）拍摄的航拍图像进行调查的方法。深度学习方法提供承诺的能力定位在RGB图像观察到的植物，但它们需要大量的标签数据（地面实况）进行训练。使用深层学习结构微调单个字段或在其他地理区域或与其他作物领域单一类型的作物可能不会有好结果。生成地面实况为每一个新领域的问题是劳动密集和乏味。在本文中，我们提出了由现有的模式转移到利用有限的地面实况数据的新方案估计工厂中心的方法。我们描述了使用模型微调为单个场或上的变化的一套类似作物和字段的单一类型的植物的用途转移学习。我们表明，迁移学习提供了可喜的成果用于检测设备的位置。

18. Video Contents Understanding using Deep Neural Networks [PDF] 返回目录
Mohammadhossein Toutiaee, Abbas Keshavarzi, Abolfazl Farahani, John A. Miller
Abstract: We propose a novel application of Transfer Learning to classify video-frame sequences over multiple classes. This is a pre-weighted model that does not require to train a fresh CNN. This representation is achieved with the advent of "deep neural network" (DNN), which is being studied these days by many researchers. We utilize the classical approaches for video classification task using object detection techniques for comparison, such as "Google Video Intelligence API" and this study will run experiments as to how those architectures would perform in foggy or rainy weather conditions. Experimental evaluation on video collections shows that the new proposed classifier achieves superior performance over existing solutions.
摘要：本文提出转让学习，以在多个类别分类的视频帧序列的新的应用。这是一个预加权模型不需要训练一个新的CNN。这表示与正在由许多研究者研究了这些天“的深层神经网络”（DNN）的出现，实现了。我们利用使用对象检测技术进行比较的视频分类任务的经典方法，如“谷歌视频智能API”，这一研究将进行实验，以怎样的架构将在多雾或阴雨天气条件下进行。视频集合显示实验评价，认为新提出的分类实现比现有解决方案的卓越性能。

19. Deflating Dataset Bias Using Synthetic Data Augmentation [PDF] 返回目录
Nikita Jaipuria, Xianling Zhang, Rohan Bhasin, Mayar Arafa, Punarjay Chakravarty, Shubham Shrivastava, Sagar Manglani, Vidya N. Murali
Abstract: Deep Learning has seen an unprecedented increase in vision applications since the publication of large-scale object recognition datasets and introduction of scalable compute hardware. State-of-the-art methods for most vision tasks for Autonomous Vehicles (AVs) rely on supervised learning and often fail to generalize to domain shifts and/or outliers. Dataset diversity is thus key to successful real-world deployment. No matter how big the size of the dataset, capturing long tails of the distribution pertaining to task-specific environmental factors is impractical. The goal of this paper is to investigate the use of targeted synthetic data augmentation - combining the benefits of gaming engine simulations and sim2real style transfer techniques - for filling gaps in real datasets for vision tasks. Empirical studies on three different computer vision tasks of practical use to AVs - parking slot detection, lane detection and monocular depth estimation - consistently show that having synthetic data in the training mix provides a significant boost in cross-dataset generalization performance as compared to training on real data only, for the same size of the training set.
摘要：深学习已经看到，因为大型物体识别数据集，并引入可扩展的计算硬件的出版物中视觉应用了前所未有的提高。国家的最先进的方法对大多数的视觉任务自动驾驶汽车（AVS）依靠监督学习，往往不能推广到域名的变化和/或离群值。数据集的多样性是这样成功的关键实际部署。无论多么大的数据集的大小，捕捉有关特定任务环境因素分布的长长的尾巴是不切实际的。本文的目的是调查使用有针对性的综合数据隆胸 - 结合游戏引擎模拟和sim2real风格转移技术的好处 - 为在视觉任务真实数据填补空白。停车位检测，车道检测和单眼深度估计 - - 实际使用三种不同的计算机视觉任务的AV实证研究一致表明，在训练混有合成数据提供跨数据集的泛化性能有显著提升相比，在训练只有真实的数据，对于训练集的大小相同。

20. Less is More: Sample Selection and Label Conditioning Improve Skin Lesion Segmentation [PDF] 返回目录
Vinicius Ribeiro, Sandra Avila, Eduardo Valle
Abstract: Segmenting skin lesions images is relevant both for itself and for assisting in lesion classification, but suffers from the challenge in obtaining annotated data. In this work, we show that segmentation may improve with less data, by selecting the training samples with best inter-annotator agreement, and conditioning the ground-truth masks to remove excessive detail. We perform an exhaustive experimental design considering several sources of variation, including three different test sets, two different deep-learning architectures, and several replications, for a total of 540 experimental runs. We found that sample selection and detail removal may have impacts corresponding, respectively, to 12% and 16% of the one obtained by picking a better deep-learning model.
摘要：分段皮损图像是既为自己，并在病变分类协助相关，但在获得注释数据的挑战受到影响。在这项工作中，我们表明，分割可以用更少的数据改善，用最好的注释间协议，并调理选择训练样本的地面实况口罩，以去除过多的细节。我们进行详尽的实验设计考虑变化的几个来源，包括三种不同的测试组，两个不同的深学习架构，和几个重复，共计540次实验。我们发现，样本选择和细节上去除可具有影响分别对应，至12％并通过拾取更好深学习模型获得的一个的16％。

21. Boosting Deep Open World Recognition by Clustering [PDF] 返回目录
Dario Fontanel, Fabio Cermelli, Massimiliano Mancini, Samuel Rota Bulò, Elisa Ricci, Barbara Caputo
Abstract: While convolutional neural networks have brought significant advances in robot vision, their ability is often limited to closed world scenarios, where the number of semantic concepts to be recognized is determined by the available training set. Since it is practically impossible to capture all possible semantic concepts present in the real world in a single training set, we need to break the closed world assumption, equipping our robot with the capability to act in an open world. To provide such ability, a robot vision system should be able to (i) identify whether an instance does not belong to the set of known categories (i.e. open set recognition), and (ii) extend its knowledge to learn new classes over time (i.e. incremental learning). In this work, we show how we can boost the performance of deep open world recognition algorithms by means of a new loss formulation enforcing a global to local clustering of class-specific features. In particular, a first loss term, i.e. global clustering, forces the network to map samples closer to the class centroid they belong to while the second one, local clustering, shapes the representation space in such a way that samples of the same class get closer in the representation space while pushing away neighbours belonging to other classes. Moreover, we propose a strategy to learn class-specific rejection thresholds, instead of heuristically estimating a single global threshold, as in previous works. Experiments on RGB-D Object and Core50 datasets show the effectiveness of our approach.
摘要：虽然卷积神经网络在机器人视觉带来显著的进步，他们的能力往往局限于封闭的世界场景，其中，通过现有的训练集确定被认可的语义概念的数量。因为它几乎是不可能捕捉到所有可能存在于真实世界的语义概念在一个单一的训练集，我们需要打破封闭世界假设，在一个开放的世界采取行动的能力，装备我们的机器人。为了提供这种能力，机器人视觉系统应该能够（i）确认实例是否不属于该组已知类别（即开集识别）的，以及（ii）扩展其知识随着时间学习新的类（即增量学习）。在这项工作中，我们将展示我们如何能够通过强制全局的类特定功能的本地集群新的损失配方的手段促进深层开放的世界识别算法的性能。特别是，第一个损失项，即全局集群，力网络更接近类图来样订做质心，他们属于，而第二个，本地集群，塑造以这样的方式，同一类的样本更接近的表示空间在表现空间，而推开属于其他类的邻居。此外，我们建议学习类特定的拒绝阈值的策略，而不是试探性地估计一个单一的全球阈值，如在以前的作品。在RGB-d对象和Core50数据集实验证明我们的方法的有效性。

22. Pyramid Attention Networks for Image Restoration [PDF] 返回目录
Yiqun Mei, Yuchen Fan, Yulun Zhang, Jiahui Yu, Yuqian Zhou, Ding Liu, Yun Fu, Thomas S. Huang, Honghui Shi
Abstract: Self-similarity refers to the image prior widely used in image restoration algorithms that small but similar patterns tend to occur at different locations and scales. However, recent advanced deep convolutional neural network based methods for image restoration do not take full advantage of self-similarities by relying on self-attention neural modules that only process information at the same scale. To solve this problem, we present a novel Pyramid Attention module for image restoration, which captures long-range feature correspondences from a multi-scale feature pyramid. Inspired by the fact that corruptions, such as noise or compression artifacts, drop drastically at coarser image scales, our attention module is designed to be able to borrow clean signals from their "clean" correspondences at the coarser levels. The proposed pyramid attention module is a generic building block that can be flexibly integrated into various neural architectures. Its effectiveness is validated through extensive experiments on multiple image restoration tasks: image denoising, demosaicing, compression artifact reduction, and super resolution. Without any bells and whistles, our PANet (pyramid attention module with simple network backbones) can produce state-of-the-art results with superior accuracy and visual quality.
摘要：自相似性是指现有广泛用于图像恢复算法，虽小，但相似的模式趋向于在不同的位置和尺度上出现图像。然而，最近的先进深卷积基于神经网络的图像复原方法没有充分利用自相似性在同样规模的依靠自我的关注神经模块，只有过程的信息。为了解决这个问题，我们提出了图像恢复的新颖金字塔注意模块，其从多尺度特征金字塔捕获远距离特征对应。的事实，损坏，如噪声或压缩失真，在粗糙的图像规模大幅下降的启发，我们的注意力模块设计为能够在粗糙的水平借用他们的“干净”的对应纯净的信号。所提出的金字塔注意模块是一个通用的构建块，能够灵活地集成到各种神经结构。图像降噪，去马赛克，压缩神器减少，以及超分辨率：其功效是通过对多个图像恢复任务大量的实验验证。没有任何花俏，我们PANet（简单网络骨干金字塔注意模块）可以产生国家的先进成果与卓越的精确度和视觉质量。

23. CMRNet++: Map and Camera Agnostic Monocular Visual Localization in LiDAR Maps [PDF] 返回目录
Daniele Cattaneo, Domenico Giorgio Sorrenti, Abhinav Valada
Abstract: Localization is a critically essential and crucial enabler of autonomous robots. While deep learning has made significant strides in many computer vision tasks, it is still yet to make a sizeable impact on improving capabilities of metric visual localization. One of the major hindrances has been the inability of existing Convolutional Neural Network (CNN)-based pose regression methods to generalize to previously unseen places. Our recently introduced CMRNet effectively addresses this limitation by enabling map independent monocular localization in LiDAR-maps. In this paper, we now take it a step further by introducing CMRNet++ which is a significantly more robust model that not only generalizes to new places effectively, but is also independent of the camera parameters. We enable this capability by moving any metric reasoning outside of the learning process. Extensive evaluations of our proposed CMRNet++ on three challenging autonomous driving datasets namely, KITTI, Argoverse, and Lyft Level5, demonstrate that our network substantially outperforms CMRNet as well as other baselines by a large margin. More importantly, for the first-time, we demonstrate the ability of a deep learning approach to accurately localize without any retraining or fine-tuning in a completely new environment and independent of the camera parameters.
摘要：本地化是自主机器人的一个关键性必要和重要推动者。虽然深学习已经在许多计算机视觉任务取得显著的进步，但仍然还没有做出改善指标视觉定位的能力相当大的影响。其中一个主要障碍一直存在的卷积神经网络（CNN）为基础的姿态回归方法推广到以前看不到的地方无力。我们最近推出的CMRNet有效地解决了通过启用激光雷达的地图地图独立单眼本地化此限制。在本文中，我们现在把它更进一步通过引入CMRNet ++这是一个显著更稳健的模型，不仅有效地推广到新的地方，但也是独立的相机参数。我们能够通过移动学习过程的任何指标推理超出这个能力。三我们提出的CMRNet ++的广泛评估挑战自主驾驶的数据集，即KITTI，Argoverse和Lyft LEVEL5，证明了我们的网络基本性能优于CMRNet以及大幅度其他基线。更重要的是，第一次，我们证明了深刻的学习方法的能力，以精确定位而不会在一个全新的环境，任何再培训或微调和独立的相机参数。

24. Cross-modal Speaker Verification and Recognition: A Multilingual Perspective [PDF] 返回目录
Muhammad Saad Saeed, Shah Nawaz, Pietro Morerio, Arif Mahmood, Ignazio Gallo, Muhammad Haroon Yousaf, Alessio Del Bue
Abstract: Recent years have seen a surge in finding association between faces and voices within a cross-modal biometric application along with speaker recognition. Inspired from this, we introduce a challenging task in establishing association between faces and voices across multiple languages spoken by the same set of persons. The aim of this paper is to answer two closely related questions: \textit{"Is face-voice association language independent?"} and \textit{"Can a speaker be recognised irrespective of the spoken language?"}. These two questions are very important to understand effectiveness and to boost development of multilingual biometric systems. To answer them, we collected a Multilingual Audio-Visual dataset, containing human speech clips of $154$ identities with $3$ language annotations extracted from various videos uploaded online. Extensive experiments on the three splits of the proposed dataset have been performed to investigate and answer these novel research questions that clearly point out the relevance of the multilingual problem.
摘要：近年来，人们在说话人识别沿着跨模态生物识别应用程序中发现面孔和声音之间的关联激增。从这个启发，我们在建立跨由同一组的人讲多种语言的面孔和声音之间的关联介绍一个具有挑战性的任务。本文的目的是要回答两个密切相关的问题：\ {textit“为面的声音关联语言无关？”}和\ {textit“可以的扬声器不论口语的认可？”}。这两个问题是理解有效性和提高多种语言的生物识别系统的发展是非常重要的。要回答这些问题，我们收集了多语种视听数据集，包含$ $ 154名身份人的讲话片段从网上上传的各种视频提取$ 3个$语言注释。在提出数据集的三个分裂大量的实验已经进行调查和回答这些新的研究问题，明确指出了多语种问题的相关性。

25. Beyond Instructional Videos: Probing for More Diverse Visual-Textual Grounding on YouTube [PDF] 返回目录
Jack Hessel, Zhenhai Zhu, Bo Pang, Radu Soricut
Abstract: Pretraining from unlabelled web videos has quickly become the de-facto means of achieving high performance on many video understanding tasks. Features are learned via prediction of grounded relationships between visual content and automatic speech recognition (ASR) tokens. However, prior pretraining work has been limited to only instructional videos, a domain that, a priori, we expect to be relatively "easy:" speakers in instructional videos will often reference the literal objects/actions being depicted. Because instructional videos make up only a fraction of the web's diverse video content, we ask: can similar models be trained on broader corpora? And, if so, what types of videos are "grounded" and what types are not? We examine the diverse YouTube8M corpus, first verifying that it contains many non-instructional videos via crowd labeling. We pretrain a representative model on YouTube8M and study its success and failure cases. We find that visual-textual grounding is indeed possible across previously unexplored video categories, and that pretraining on a more diverse set still results in representations that generalize to both non-instructional and instructional domains.
摘要：从训练前未标记的网络视频已迅速成为许多视频了解任务的实现高性能的事实上的手段。的功能可以通过可视内容和自动语音识别（ASR）的令牌之间接地的关系预测得知。然而，事先训练前的工作已经仅限于教学视频，即，先验的，我们预计是相对域“易”扬声器教学视频经常会引用被描绘的文字对象/动作。由于教学视频只占网络的不同视频内容的一小部分，我们要问：可以同类机型进行更广泛的语料训练的？而且，如果是这样，什么类型的影片是“接地”和类型是不是？我们研究了不同YouTube8M语料库，首先验证它包含通过人群贴标许多非教学视频。我们pretrain上YouTube8M有代表性的模型，并研究其成功和失败的案例。我们发现，视觉文本接地确实有可能在之前的未开发的视频类，而训练前上更加多样化仍然导致表示，用于推广到两个非教学和教学领域。

26. Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision [PDF] 返回目录
Soo-Whan Chung, Hong Goo Kang, Joon Son Chung
Abstract: The goal of this work is to train discriminative cross-modal embeddings without access to manually annotated data. Recent advances in self-supervised learning have shown that effective representations can be learnt from natural cross-modal synchrony. We build on earlier work to train embeddings that are more discriminative for uni-modal downstream tasks. To this end, we propose a novel training strategy that not only optimises metrics across modalities, but also enforces intra-class feature separation within each of the modalities. The effectiveness of the method is demonstrated on two downstream tasks: lip reading using the features trained on audio-visual synchronisation, and speaker recognition using the features trained for cross-modal biometric matching. The proposed method outperforms state-of-the-art self-supervised baselines by a signficant margin.
摘要：这项工作的目标是培养辨别跨模式的嵌入得不到手动注释的数据。自我监督学习的最新进展表明，有效的陈述可从天然跨模式同步学习。我们建立在早期的工作训练是为单模式下游任务更有辨别力的嵌入。为此，我们提出了一种新的培训策略，不仅跨越模式优化指标，但也执行在每个模式的类内的功能分离。该方法的有效性证明在两个下游任务：使用唇读训练上视听同步特征，并且使用扬声器识别训练跨模式生物匹配的特征。国家的最先进的方法优于自监督基线由signficant余量。

27. A Wearable Social Interaction Aid for Children with Autism [PDF] 返回目录
Nick Haber, Catalin Voss, Jena Daniels, Peter Washington, Azar Fazel, Aaron Kline, Titas De, Terry Winograd, Carl Feinstein, Dennis P. Wall
Abstract: With most recent estimates giving an incidence rate of 1 in 68 children in the United States, the autism spectrum disorder (ASD) is a growing public health crisis. Many of these children struggle to make eye contact, recognize facial expressions, and engage in social interactions. Today the standard for treatment of the core autism-related deficits focuses on a form of behavior training known as Applied Behavioral Analysis. To address perceived deficits in expression recognition, ABA approaches routinely involve the use of prompts such as flash cards for repetitive emotion recognition training via memorization. These techniques must be administered by trained practitioners and often at clinical centers that are far outnumbered by and out of reach from the many children and families in need of attention. Waitlists for access are up to 18 months long, and this wait may lead to children regressing down a path of isolation that worsens their long-term prognosis. There is an urgent need to innovate new methods of care delivery that can appropriately empower caregivers of children at risk or with a diagnosis of autism, and that capitalize on mobile tools and wearable devices for use outside of clinical settings.
摘要：随着最近的估计给予1 68儿童在美国的发病率，自闭症谱系障碍（ASD）是一个日益严重的公共卫生危机。这些儿童的斗争，使眼神接触，认识面部表情，并从事社会交往。如今治疗的核心与自闭症有关赤字的标准侧重称为应用行为分析行为训练的形式。为了解决表情识别感知障碍，ABA常规办法包括使用提示，如闪存卡通过记忆重复情绪识别培训。这些技术必须由经过培训的从业人员，往往在临床中心被远远多于通过进出许多儿童和家庭需要注意的范围施用。访问是候补名单长达18个月之久，而这种等待可能导致儿童回归下降是其恶化的长期预后孤立的道路。目前迫切需要创新，可以适当地赋权的风险或患有自闭症的诊断，以及对儿童的照顾者利用移动工具和临床环境中使用外可穿戴设备保健服务的新方法。

28. A Fast 3D CNN for Hyperspectral Image Classification [PDF] 返回目录
Muhammad Ahmad
Abstract: Hyperspectral imaging (HSI) has been extensively utilized for a number of real-world applications. HSI classification (HSIC) is a challenging task due to high inter-class similarity, high intra-class variability, overlapping, and nested regions. A 2D Convolutional Neural Network (CNN) is a viable approach whereby HSIC highly depends on both Spectral-Spatial information, therefore, 3D CNN can be an alternative but highly computational complex due to the volume and spectral dimensions. Furthermore, these models do not extract quality feature maps and may underperform over the regions having similar textures. Therefore, this work proposed a 3D CNN model that utilizes both spatial-spectral feature maps to attain good performance. In order to achieve the said performance, the HSI cube is first divided into small overlapping 3D patches. Later these patches are processed to generate 3D feature maps using a 3D kernel function over multiple contiguous bands that persevere the spectral information as well. Benchmark HSI datasets (Pavia University, Salinas and Indian Pines) are considered to validate the performance of our proposed method. The results are further compared with several state-of-the-art methods.
摘要：高光谱成像（HSI）已被广泛用于许多现实世界的应用。 HSI分类（HSIC）是一项具有挑战性的任务，由于高类间的相似性，高类内变化，重叠的，和嵌套区域。二维卷积神经网络（CNN）是一种可行的方法，由此HSIC高度依赖于两个谱空间信息，因此，3D CNN可替代但高度计算复杂由于体积和光谱尺寸。此外，这些机型不提取质量特征的地图和可能跑输了具有相似的纹理区域。因此，这项工作提出了既利用空间光谱特征图，以达到良好性能的3D模型CNN。为了实现上述性能，恒指立方体首先被分成小的重叠3D补丁。后来这些补丁被处理以生成3D特征使用3D核函数在该坚持不懈的频谱信息，以及多个连续的频带图。恒指数据集（帕维亚大学，萨利纳斯和印度松树）被认为是验证我们提出的方法的性能。将其结果与状态的最先进的几种方法进一步进行比较。

29. Inf-Net: Automatic COVID-19 Lung Infection Segmentation from CT Scans [PDF] 返回目录
Deng-Ping Fan, Tao Zhou, Ge-Peng Ji, Yi Zhou, Geng Chen, Huazhu Fu, Jianbing Shen, Ling Shao
Abstract: Coronavirus Disease 2019 (COVID-19) spread globally in early 2020, causing the world to face an existential health crisis. Automated detection of lung infections from computed tomography (CT) images offers a great potential to augment the traditional healthcare strategy for tackling COVID-19. However, segmenting infected regions from CT scans faces several challenges, including high variation in infection characteristics, and low intensity contrast between infections and normal tissues. Further, collecting a large amount of data is impractical within a short time period, inhibiting the training of a deep model. To address these challenges, a novel COVID-19 Lung Infection Segmentation Deep Network (Inf-Net) is proposed to automatically identify infected regions from chest CT scans. In our Inf-Net, a parallel partial decoder is used to aggregate the high-level features and generate a global map. Then, the implicit reverse attention and explicit edge-attention are utilized to model the boundaries and enhance the representations. Moreover, to alleviate the shortage of labeled data, we present a semi-supervised segmentation framework based on a randomly selected propagation strategy, which only requires a few labeled images and leverages primarily unlabeled data. Our semi-supervised framework can improve the learning ability and achieve a higher performance. Extensive experiments on a COVID-19 infection dataset demonstrate that the proposed COVID-SemiSeg outperforms most cutting-edge segmentation models and advances the state-of-the-art performance.
摘要：冠状病毒病2019（COVID-19）在全球范围蔓延，2020年年初，引起了世界面临的生存健康危机。自动检测从计算机断层扫描（CT）图像提供了一个很好的潜力，以增加传统保健战略为应对COVID-19肺部感染。然而，从CT扫描分割感染区域面临一些挑战，包括在感染特性的高变异，和感染组织和正常组织之间的低强度对比。此外，收集大量数据是一个短的时间周期内不实际的，抑制了深刻的模型的培训。为了应对这些挑战，一个新的COVID-19肺部感染分割深层网络（INF-网）提出了从胸部CT扫描自动识别受感染的地区。在我们的INF-网，平行部分解码器用于聚合的高级别功能和生成全局地图。然后，隐含的反向重视和明确的边缘注意被用于建模的界限，增强表示。此外，为了减轻标记数据的不足，提出了一种基于随机选择的传播策略，这仅需要几个标记的图像，并利用未标记的主要数据的半监督分割框架。我们的半监督框架，可以提高学习能力，实现更高的性能。上的COVID-19感染广泛的实验数据集表明，该COVID-SemiSeg性能优于最前沿的分割模式和前进状态的最先进的性能。

30. Informative Scene Decomposition for Crowd Analysis, Comparison and Simulation Guidance [PDF] 返回目录
Feixiang He, Yuanhang Xiang, Xi Zhao, He Wang
Abstract: Crowd simulation is a central topic in several fields including graphics. To achieve high-fidelity simulations, data has been increasingly relied upon for analysis and simulation guidance. However, the information in real-world data is often noisy, mixed and unstructured, making it difficult for effective analysis, therefore has not been fully utilized. With the fast-growing volume of crowd data, such a bottleneck needs to be addressed. In this paper, we propose a new framework which comprehensively tackles this problem. It centers at an unsupervised method for analysis. The method takes as input raw and noisy data with highly mixed multi-dimensional (space, time and dynamics) information, and automatically structure it by learning the correlations among these dimensions. The dimensions together with their correlations fully describe the scene semantics which consists of recurring activity patterns in a scene, manifested as space flows with temporal and dynamics profiles. The effectiveness and robustness of the analysis have been tested on datasets with great variations in volume, duration, environment and crowd dynamics. Based on the analysis, new methods for data visualization, simulation evaluation and simulation guidance are also proposed. Together, our framework establishes a highly automated pipeline from raw data to crowd analysis, comparison and simulation guidance. Extensive experiments and evaluations have been conducted to show the flexibility, versatility and intuitiveness of our framework.
摘要：人群仿真是在几个领域，包括图形的中心话题。为了实现高保真模拟，数据已被越来越多地用于分析和仿真的指导依据。然而，在现实世界数据的信息往往是喧闹的，混合的和非结构化的，因此很难进行有效的分析，因此还没有被充分利用。随着人群数据的快速增长量，这样的瓶颈需要加以解决。在本文中，我们提出了一个新的框架，全面铲球这个问题。它在中心分析无监督方法。该方法以作为输入原料，并用高度混合多维（空间，时间和动力学）信息噪声数据，并通过学习这些尺寸之间的相关性自动构造它。与他们的相关性一起的尺寸充分地描述场景语义其包括在一个场景中，表现为空间重复的活动模式的具有时间和动力学轮廓流动。分析的有效性和鲁棒性已经在数据集测试与量，持续时间，环境和人群动态变化大。根据分析，数据可视化，仿真分析和仿真指导新的方法也提出了。总之，我们的框架建立了一个高度自动化的流水线，从原始数据到人群分析，比较和仿真指导。大量的实验和评估已经进行，以显示我们的框架的灵活性，通用性和直观性。

31. DR-SPAAM: A Spatial-Attention and Auto-regressive Model for Person Detection in 2D Range Data [PDF] 返回目录
Dan Jia, Alexander Hermans, Bastian Leibe
Abstract: Detecting persons using a 2D LiDAR is a challenging task due to the low information content of 2D range data. To alleviate the problem caused by the sparsity of the LiDAR points, current state-of-the-art methods fuse multiple previous scans and perform detection using the combined scans. The downside of such a backward looking fusion is that all the scans need to be aligned explicitly, and the necessary alignment operation makes the whole pipeline more expensive -- often too expensive for real-world applications. In this paper, we propose a person detection network which uses an alternative strategy to combine scans obtained at different times. Our method, Distance Robust SPatial Attention and Auto-regressive Model (DR-SPAAM), follows a forward looking paradigm. It keeps the intermediate features from the backbone network as a template and recurrently updates the template when a new scan becomes available. The updated feature template is in turn used for detecting persons currently in the scene. On the DROW dataset, our method outperforms the existing state-of-the-art, while being approximately four times faster, running at 87.2 FPS on a laptop with a dedicated GPU and at 22.6 FPS on an NVIDIA Jetson AGX embedded GPU. We release our code in PyTorch and a ROS node including pre-trained models.
摘要：检测利用二维激光雷达是一项具有挑战性的任务，由于2D范围数据的低信息内容的人。为了减轻因该激光雷达点的稀疏性的问题，当前状态的最先进的方法融合多个先前扫描和使用组合的扫描进行检测。这种向后看融合的缺点是，所有的扫描需要显式排列，以及必要的校准操作使整个管道更昂贵的 - 往往是现实世界的应用过于昂贵。在本文中，我们提出一种使用一种替代策略，在不同时间获得的扫描相结合的人检测网络。我们的方法，稳健的距离空间注意和自回归模型（DR-SPAAM），遵循前瞻性范例。它使从骨干网为模板，中间的功能，当新的扫描可用反复更新模板。更新后的特征模板反过来用于现场检测当前的人。上卓尔数据集，我们的方法优于现有的国家的最先进的，而被快大约四倍，在87.2 FPS在笔记本电脑上用的专用GPU和在22.6 FPS上的NVIDIA杰特森AGX嵌入GPU上运行。我们发布我们的代码在PyTorch和ROS节点包括预先训练模式。

32. Image Morphing with Perceptual Constraints and STN Alignment [PDF] 返回目录
Noa Fish, Richard Zhang, Lilach Perry, Daniel Cohen-Or, Eli Shechtman, Connelly Barnes
Abstract: In image morphing, a sequence of plausible frames are synthesized and composited together to form a smooth transformation between given instances. Intermediates must remain faithful to the input, stand on their own as members of the set, and maintain a well-paced visual transition from one to the next. In this paper, we propose a conditional GAN morphing framework operating on a pair of input images. The network is trained to synthesize frames corresponding to temporal samples along the transformation, and learns a proper shape prior that enhances the plausibility of intermediate frames. While individual frame plausibility is boosted by the adversarial setup, a special training protocol producing sequences of frames, combined with a perceptual similarity loss, promote smooth transformation over time. Explicit stating of correspondences is replaced with a grid-based freeform deformation spatial transformer that predicts the geometric warp between the inputs, instituting the smooth geometric effect by bringing the shapes into an initial alignment. We provide comparisons to classic as well as latent space morphing techniques, and demonstrate that, given a set of images for self-supervision, our network learns to generate visually pleasing morphing effects featuring believable in-betweens, with robustness to changes in shape and texture, requiring no correspondence annotation.
摘要：图像变形，似是而非的帧的序列被合成并合成在一起，以形成给定的实例之间的平滑转换。中间体必须忠实于输入，站在自己作为集的成员，并保持一到下一个良好的节奏的视觉过渡。在本文中，我们提出了一个条件GAN上对输入图像变形的框架工作。该网络之前训练来合成帧中的对应于沿所述变换时间样本，并获知适当的形状，增强中间帧的似然性。虽然个别帧真实性由对抗设置升压，一个特殊的训练协议产生的帧的序列，具有感知相似损耗相结合，促进随时间平滑转变。明确的说明对应的被替换为基于网格的自由形式的变形空间变换器，其预测所述输入之间的几何扭曲，通过使形状成初始对准实行平滑几何效果。我们以经典的提供比较以及潜在空间变形技术，并表明，给定一组自我监督的图像，我们的网络获知，产生视觉上赏心悦目的渐变效果特点是带有中间人可信的，具有鲁棒性的形状和纹理的变化，不需要对应的注释。

33. The International Workshop on Osteoarthritis Imaging Knee MRI Segmentation Challenge: A Multi-Institute Evaluation and Analysis Framework on a Standardized Dataset [PDF] 返回目录
Arjun D. Desai, Francesco Caliva, Claudia Iriondo, Naji Khosravan, Aliasghar Mortazi, Sachin Jambawalikar, Drew Torigian, Jutta Ellerman, Mehmet Akcakaya, Ulas Bagci, Radhika Tibrewala, Io Flament, Matthew O`Brien, Sharmila Majumdar, Mathias Perslev, Akshay Pai, Christian Igel, Erik B. Dam, Sibaji Gaj, Mingrui Yang, Kunio Nakamura, Xiaojuan Li, Cem M. Deniz, Vladimir Juras, Ravinder Regatte, Garry E. Gold, Brian A. Hargreaves, Valentina Pedoia, Akshay S. Chaudhari
Abstract: Purpose: To organize a knee MRI segmentation challenge for characterizing the semantic and clinical efficacy of automatic segmentation methods relevant for monitoring osteoarthritis progression. Methods: A dataset partition consisting of 3D knee MRI from 88 subjects at two timepoints with ground-truth articular (femoral, tibial, patellar) cartilage and meniscus segmentations was standardized. Challenge submissions and a majority-vote ensemble were evaluated using Dice score, average symmetric surface distance, volumetric overlap error, and coefficient of variation on a hold-out test set. Similarities in network segmentations were evaluated using pairwise Dice correlations. Articular cartilage thickness was computed per-scan and longitudinally. Correlation between thickness error and segmentation metrics was measured using Pearson's coefficient. Two empirical upper bounds for ensemble performance were computed using combinations of model outputs that consolidated true positives and true negatives. Results: Six teams (T1-T6) submitted entries for the challenge. No significant differences were observed across all segmentation metrics for all tissues (p=1.0) among the four top-performing networks (T2, T3, T4, T6). Dice correlations between network pairs were high (>0.85). Per-scan thickness errors were negligible among T1-T4 (p=0.99) and longitudinal changes showed minimal bias (<0.03mm). low correlations (<0.41) were observed between segmentation metrics and thickness error. the majority-vote ensemble was comparable to top performing networks (p="1.0)." empirical upper bound performances similar for both combinations conclusion: diverse learned segment knee similarly where high accuracy did not correlate cartilage accuracy. voting ensembles outperform individual but may help regularize models. < font>
摘要：目的：组织膝关节MRI分割挑战表征的相关监测骨关节炎进展的自动分割方法的语义和临床疗效。方法：从88名受试者选自3D膝盖MRI的在两个时间点与地面实况关节（股动脉，胫骨，髌骨）软骨和半月板的分割数据集的分区被标准化。使用骰子得分，平均对称面的距离，体积重叠误差和变异系数上的保持输出测试组挑战提交和多数表决合奏进行评价。在网络分段的相似性，使用成对骰子的相关性进行评价。关节软骨厚度每扫描和纵向计算。使用皮尔森系数，测定厚度误差和分割度量之间的相关性。用于合奏表演两个经验上界被使用合并真阳性和真阴性模型输出的组合来计算。结果：提交的参赛作品为挑战六队（T1-T6）。在所有的分割指标，观察的四个顶执行网络（T2，T3，T4，T6）之间的所有组织（P = 1.0）无显著差异。网络对之间的相关性骰子很高（> 0.85）。每扫描厚度的误差是可以忽略的T1-T4之间（P = 0.99）和纵向变化显示最小偏置（<0.03毫米）。分割度量和厚度误差之间观察到的相关性低（<0.41）。大多数票合奏比得上顶部执行网络（p = 1.0）。经验上界性能对于两种组合（p="1.0）类似。结论：不同的网络学会段膝盖同样需要高分割精度没有关联到软骨厚度精度。投票合奏没有跑赢个体网络，但可能有助于正规化个别机型。

34. Span-based Localizing Network for Natural Language Video Localization [PDF] 返回目录
Hao Zhang, Aixin Sun, Wei Jing, Joey Tianyi Zhou
Abstract: Given an untrimmed video and a text query, natural language video localization (NLVL) is to locate a matching span from the video that semantically corresponds to the query. Existing solutions formulate NLVL either as a ranking task and apply multimodal matching architecture, or as a regression task to directly regress the target video span. In this work, we address NLVL task with a span-based QA approach by treating the input video as text passage. We propose a video span localizing network (VSLNet), on top of the standard span-based QA framework, to address NLVL. The proposed VSLNet tackles the differences between NLVL and span-based QA through a simple and yet effective query-guided highlighting (QGH) strategy. The QGH guides VSLNet to search for matching video span within a highlighted region. Through extensive experiments on three benchmark datasets, we show that the proposed VSLNet outperforms the state-of-the-art methods; and adopting span-based QA framework is a promising direction to solve NLVL.
摘要：给定一个修剪视频和文本查询，自然语言本地化视频（NLVL）是指从语义查询对应的视频定位匹配的跨度。现有的解决方案制定NLVL无论是作为一个排名的任务和应用多式联运匹配的架构，或者作为回归任务，直接回归目标视频跨度。在这项工作中，我们通过处理输入视频文本通道解决与基于跨度-QA的方法NLVL任务。我们提出了一个视频跨度本地化网络（VSLNet），在标准的基于跨度-QA框架之上，为解决NLVL。所提出的VSLNet铲球NLVL和基于跨QA之间通过一个简单而有效的查询导向突出（QGH）策略的差异。该QGH导游VSLNet到高亮区域中搜索匹配的视频范围。通过对三个标准数据集大量的实验，我们表明，该VSLNet优于国家的最先进的方法。和采用基于跨度-QA框架是一种有前途的方向来解决NLVL。

35. An Auto-Encoder Strategy for Adaptive Image Segmentation [PDF] 返回目录
Evan M. Yu, Juan Eugenio Iglesias, Adrian V. Dalca, Mert R. Sabuncu
Abstract: Deep neural networks are powerful tools for biomedical image segmentation. These models are often trained with heavy supervision, relying on pairs of images and corresponding voxel-level labels. However, obtaining segmentations of anatomical regions on a large number of cases can be prohibitively expensive. Thus there is a strong need for deep learning-based segmentation tools that do not require heavy supervision and can continuously adapt. In this paper, we propose a novel perspective of segmentation as a discrete representation learning problem, and present a variational autoencoder segmentation strategy that is flexible and adaptive. Our method, called Segmentation Auto-Encoder (SAE), leverages all available unlabeled scans and merely requires a segmentation prior, which can be a single unpaired segmentation image. In experiments, we apply SAE to brain MRI scans. Our results show that SAE can produce good quality segmentations, particularly when the prior is good. We demonstrate that a Markov Random Field prior can yield significantly better results than a spatially independent prior. Our code is freely available at this https URL.
摘要：深层神经网络是生物医学图像分割的强大工具。这些模型往往是受过训练的重监督，依靠对图像以及对应体素水平的标签。然而，在大量案例的解剖获取区域的分割可以是非常昂贵的。因此，有深基础的学习分割工具，不需要大量的监督能够不断适应的强烈需求。在本文中，我们提出了分割的新颖的角度作为离散表示学习的问题，并且存在的变自动编码分割策略，其是柔性和适应性。我们的方法中，称为分段自动编码器（SAE），利用所有可用的未标记的扫描和仅仅之前需要一个分段，其可以是一个单一的未配对的分割图像。在实验中，我们采用SAE大脑核磁共振成像扫描。我们的研究结果表明，SAE能生产出品质优良的分割，特别是在之前还是不错的。我们证明了一个MRF场之前可以产生显著更好的效果比空间独立之前。我们的代码是免费提供在此HTTPS URL。

36. Unmanned Aerial Systems for Wildland and Forest Fires: Sensing, Perception, Cooperation and Assistance [PDF] 返回目录
Moulay A. Akhloufi, Nicolas A. Castro, Andy Couturier
Abstract: Wildfires represent an important natural risk causing economic losses, human death and important environmental damage. In recent years, we witness an increase in fire intensity and frequency. Research has been conducted towards the development of dedicated solutions for wildland and forest fire assistance and fighting. Systems were proposed for the remote detection and tracking of fires. These systems have shown improvements in the area of efficient data collection and fire characterization within small scale environments. However, wildfires cover large areas making some of the proposed ground-based systems unsuitable for optimal coverage. To tackle this limitation, Unmanned Aerial Systems (UAS) were proposed. UAS have proven to be useful due to their maneuverability, allowing for the implementation of remote sensing, allocation strategies and task planning. They can provide a low-cost alternative for the prevention, detection and real-time support of firefighting. In this paper we review previous work related to the use of UAS in wildfires. Onboard sensor instruments, fire perception algorithms and coordination strategies are considered. In addition, we present some of the recent frameworks proposing the use of both aerial vehicles and Unmanned Ground Vehicles (UV) for a more efficient wildland firefighting strategy at a larger scale.
摘要：野火代表造成的经济损失，人类死亡和重大环境破坏的一个重要的自然风险。近年来，我们目睹火灾的强度和频率增加。研究已经走向了荒地和森林火灾的援助和战斗专用解决方案的开发进行。系统提出了远程检测和火灾的跟踪。这些系统已经显示出高效的数据采集和火灾特性的小规模环境内的区域的改善。然而，山火大面积覆盖使得一些不适合最佳的覆盖范围所提出的陆基系统。为了解决这个限制，无人机系统（UAS）提出。 UAS已被证明是有用的，因为它们的可操作性，允许遥感，分配策略和任务规划的实施。他们可以提供预防，检测和实时支持消防的低成本替代方案。在本文中，我们回顾以往的有关野火使用UAS的工作。板载传感器仪表，消防感知算法和协调战略的考虑。此外，我们还提出了一些近期的框架建议同时使用飞行器和无人地面车辆（UV）为规模更大更有效的荒地消防策略。

37. Histogram-based Auto Segmentation: A Novel Approach to Segmenting Integrated Circuit Structures from SEM Images [PDF] 返回目录
Ronald Wilson, Navid Asadizanjani, Domenic Forte, Damon L. Woodard
Abstract: In the Reverse Engineering and Hardware Assurance domain, a majority of the data acquisition is done through electron microscopy techniques such as Scanning Electron Microscopy (SEM). However, unlike its counterparts in optical imaging, only a limited number of techniques are available to enhance and extract information from the raw SEM images. In this paper, we introduce an algorithm to segment out Integrated Circuit (IC) structures from the SEM image. Unlike existing algorithms discussed in this paper, this algorithm is unsupervised, parameter-free and does not require prior information on the noise model or features in the target image making it effective in low quality image acquisition scenarios as well. Furthermore, the results from the application of the algorithm on various structures and layers in the IC are reported and discussed.
摘要：在逆向工程和硬件保证域，大部分的数据采集是通过电子显微镜技术如扫描电子显微镜（SEM）进行。然而，不同于在其光学成像同行，仅技术有限数量的可用以增强并提取从原始SEM图像的信息。在本文中，我们引入一种算法，以从SEM图像分割出的集成电路（IC）结构。不像现有的算法在本文中讨论的，该算法是无监督，无参数，并且不需要在目标图像中的噪声模型或特征的先验信息使其成为有效低质量图像获取场景为好。此外，从在各种结构和在IC层中的算法的应用的结果报告和讨论。

38. Character-level Japanese Text Generation with Attention Mechanism for Chest Radiography Diagnosis [PDF] 返回目录
Kenya Sakka, Kotaro Nakayama, Nisei Kimura, Taiki Inoue, Yusuke Iwasawa, Ryohei Yamaguchi, Yosimasa Kawazoe, Kazuhiko Ohe, Yutaka Matsuo
Abstract: Chest radiography is a general method for diagnosing a patient's condition and identifying important information; therefore, radiography is used extensively in routine medical practice in various situations, such as emergency medical care and medical checkup. However, a high level of expertise is required to interpret chest radiographs. Thus, medical specialists spend considerable time in diagnosing such huge numbers of radiographs. In order to solve these problems, methods for generating findings have been proposed. However, the study of generating chest radiograph findings has primarily focused on the English language, and to the best of our knowledge, no studies have studied Japanese data on this subject. There are two challenges involved in generating findings in the Japanese language. The first challenge is that word splitting is difficult because the boundaries of Japanese word are not clear. The second challenge is that there are numerous orthographic variants. For deal with these two challenges, we proposed an end-to-end model that generates Japanese findings at the character-level from chest radiographs. In addition, we introduced the attention mechanism to improve not only the accuracy, but also the interpretation ability of the results. We evaluated the proposed method using a public dataset with Japanese findings. The effectiveness of the proposed method was confirmed using the Bilingual Evaluation Understudy score. And, we were confirmed from the generated findings that the proposed method was able to consider the orthographic variants. Furthermore, we confirmed via visual inspection that the attention mechanism captures the features and positional information of radiographs.
摘要：胸片是用于诊断患者的状况和识别的重要信息的一般方法;因此，射线照相术是在各种情况下，诸如紧急医疗护理和医疗检查中广泛使用的常规医疗实践。然而，需要较高的专业水平来解释胸片。因此，医学专家耗费在这样的诊断X线庞大数量相当多的时间。为了解决这些问题，生成的发现方法已经被提出。然而，产生胸片发现的研究主要集中在英语，并以我们所知，还没有研究研究了关于这个问题的日本数据。有参与日语产生结果两个方面的挑战。第一个挑战是，分词是困难的，因为日本字的边界不清晰。第二个挑战是，有无数的异体字。为了应对这两个挑战，我们提出了一个端到端的高端型号，在从胸片的字符级产生日本的调查结果。此外，我们还推出了注意机制不仅提高准确性，还对结果的解释能力。我们评估使用日本发现一个公共数据集所提出的方法。该方法的有效性，使用双语评估替补得分证实。而且，我们从产生结果，该方法能够考虑异体字证实。此外，我们通过目视检查证实，注意机制捕获功能和X光片的位置信息。

39. Minority Reports Defense: Defending Against Adversarial Patches [PDF] 返回目录
Michael McCoyd, Won Park, Steven Chen, Neil Shah, Ryan Roggenkemper, Minjune Hwang, Jason Xinyu Liu, David Wagner
Abstract: Deep learning image classification is vulnerable to adversarial attack, even if the attacker changes just a small patch of the image. We propose a defense against patch attacks based on partially occluding the image around each candidate patch location, so that a few occlusions each completely hide the patch. We demonstrate on CIFAR-10, Fashion MNIST, and MNIST that our defense provides certified security against patch attacks of a certain size.
摘要：深学习图像分类很容易受到攻击的对抗性，即使攻击者改变图像的只是一小片。我们建议对基于围绕每个候选位置补丁部分堵塞图像块中的攻击防御，使每几个闭塞完全隐藏补丁。我们证明在CIFAR-10，时尚MNIST和MNIST，我们的防守提供了对具有一定规模的补丁攻击认证的安全性。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-04-30

目录

摘要