摘要

1. High-Capacity Expert Binary Networks [PDF] 返回目录
Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos
Abstract: Network binarization is a promising hardware-aware direction for creating efficient deep models. Despite its memory and computational advantages, reducing the accuracy gap between such models and their real-valued counterparts remains an unsolved challenging research problem. To this end, we make the following 3 contributions: (a) To increase model capacity, we propose Expert Binary Convolution, which, for the first time, tailors conditional computing to binary networks by learning to select one data-specific expert binary filter at a time conditioned on input features. (b) To increase representation capacity, we propose to address the inherent information bottleneck in binary networks by introducing an efficient width expansion mechanism which keeps the binary operations within the same budget. (c) To improve network design, we propose a principled binary network growth mechanism that unveils a set of network topologies of favorable properties. Overall, our method improves upon prior work, with no increase in computational cost by ~6%, reaching a groundbreaking ~71% on ImageNet classification.
摘要：网络二值化是建立高效深车型有前途的硬件识别方向。尽管它的内存和计算的优势，降低了这些模型和它们的实值同行的准确性差距还是个未解之具有挑战性的研究课题。为此，我们提出以下3所发布内容：（一）提高模型的能力，我们提出了专家二进制卷积，其中，第一次，裁缝条件计算为二进制网络通过学习来选择一个特定的数据专家二进制过滤器在一个时间上调节输入特性。（b）提高表示容量，我们提出通过引入保持相同的预算范围内二进制运算的有效宽度膨胀机构，以解决在二进制网络中的固有信息的瓶颈。（三）为提高网络设计，我们建议推出一套良好性能的网络拓扑结构的原则性二进制网络增长机制。总的来说，我们的方法改善了以前的工作，由〜6％计算成本并没有增加，达到上ImageNet分类的突破性〜71％。

2. On the Evaluation of Generative Adversarial Networks By Discriminative Models [PDF] 返回目录
Amirsina Torfi, Mohammadreza Beyki, Edward A. Fox
Abstract: Generative Adversarial Networks (GANs) can accurately model complex multi-dimensional data and generate realistic samples. However, due to their implicit estimation of data distributions, their evaluation is a challenging task. The majority of research efforts associated with tackling this issue were validated by qualitative visual evaluation. Such approaches do not generalize well beyond the image domain. Since many of those evaluation metrics are proposed and bound to the vision domain, they are difficult to apply to other domains. Quantitative measures are necessary to better guide the training and comparison of different GANs models. In this work, we leverage Siamese neural networks to propose a domain-agnostic evaluation metric: (1) with a qualitative evaluation that is consistent with human evaluation, (2) that is robust relative to common GAN issues such as mode dropping and invention, and (3) does not require any pretrained classifier. The empirical results in this paper demonstrate the superiority of this method compared to the popular Inception Score and are competitive with the FID score.
摘要：剖成对抗性网络（甘斯）可以准确地建模复杂多维数据并生成逼真的样品。然而，由于其数据分布的隐式估计，他们的评价是一项艰巨的任务。广大处理这个问题相关的研究工作都是由定性视觉评估验证。这种方法不能推广远远超出了图像域。由于许多这些评价指标，提出并绑定到视觉领域，他们是很难适用于其他领域。定量措施是必要的，以便更好地指导不同型号甘斯的训练和比较。在这项工作中，我们利用连体神经网络提出域无关的评价指标：（1）与定性评价，认为是与人类的评估结果一致，（2），是相对于普通GAN问题，如模式丢弃和发明稳健，（3）不需要任何预先训练的分类。本文的实证结果表明，该方法比较流行盗梦空间得分的优势，并与FID分数的竞争力。

3. BoMuDA: Boundless Multi-Source Domain Adaptive Segmentation in Unconstrained Environments [PDF] 返回目录
Divya Kothandaraman, Rohan Chandra, Dinesh Manocha
Abstract: We present an unsupervised multi-source domain adaptive semantic segmentation approach in unstructured and unconstrained traffic environments. We propose a novel training strategy that alternates between single-source domain adaptation (DA) and multi-source distillation, and also between setting up an improvised cost function and optimizing it. In each iteration, the single-source DA first learns a neural network on a selected source, which is followed by a multi-source fine-tuning step using the remaining sources. We call this training routine the Alternating-Incremental ("Alt-Inc") algorithm. Furthermore, our approach is also boundless i.e. it can explicitly classify categories that do not belong to the training dataset (as opposed to labeling such objects as "unknown"). We have conducted extensive experiments and ablation studies using the Indian Driving Dataset, CityScapes, Berkeley DeepDrive, GTA V, and the Synscapes datasets, and we show that our unsupervised approach outperforms other unsupervised and semi-supervised SOTA benchmarks by 5.17% - 42.9% with a reduced model size by up to 5.2x.
摘要：提出了一种无监督的多源域自适应语义分割在非结构化和不受约束的交通环境的方法。我们提出了一个新颖的训练策略，单源域的适应（DA）和多源蒸馏之间交替，并且还之间建立一个简易的成本函数和优化它。在每次迭代中，单源DA第一学习选择的信号源，随后是使用剩余源的多源微调步骤在神经网络。我们称这种例行训练的交流增量（“ALT-公司”）算法。此外，我们的做法也是无限的，即它可以明确分类类别不属于训练数据集（而不是标记这些对象为“未知”）。我们已经进行了广泛的实验和使用印度行车数据集，都市伯克利DeepDrive，GTA V，与Synscapes数据集消融研究，我们证明了我们的无监督的方法由5.17％优于其他无监督和半监督SOTA基准 - 42.9％与高达5.2倍减少的模型大小。

4. Learning Monocular 3D Vehicle Detection without 3D Bounding Box Labels [PDF] 返回目录
L. Koestler, N. Yang, R. Wang, D. Cremers
Abstract: The training of deep-learning-based 3D object detectors requires large datasets with 3D bounding box labels for supervision that have to be generated by hand-labeling. We propose a network architecture and training procedure for learning monocular 3D object detection without 3D bounding box labels. By representing the objects as triangular meshes and employing differentiable shape rendering, we define loss functions based on depth maps, segmentation masks, and ego- and object-motion, which are generated by pre-trained, off-the-shelf networks. We evaluate the proposed algorithm on the real-world KITTI dataset and achieve promising performance in comparison to state-of-the-art methods requiring 3D bounding box labels for training and superior performance to conventional baseline methods.
摘要：基于深学习3D对象检测器的训练需要用三维边界进行监督外箱标签必须由手工贴标产生大的数据集。我们提出了学习单眼立体物检测没有3D边框标签的网络架构和训练过程。通过表示对象作为三角形网格和采用微分形状渲染，我们定义基于深度图，分割掩码，和ego-和对象的运动，这是由预先训练，关的，现成的网络生成的损失函数。我们评估对现实世界的数据集KITTI的算法，实现相较于3D需要的培训和卓越的性能，以传统的基准方法边框标签国家的最先进的方法有前途的性能。

5. Reconfigurable Cyber-Physical System for Lifestyle Video-Monitoring via Deep Learning [PDF] 返回目录
Daniel Deniz, Francisco Barranco, Juan Isern, Eduardo Ros
Abstract: Indoor monitoring of people at their homes has become a popular application in Smart Health. With the advances in Machine Learning and hardware for embedded devices, new distributed approaches for Cyber-Physical Systems (CPSs) are enabled. Also, changing environments and need for cost reduction motivate novel reconfigurable CPS architectures. In this work, we propose an indoor monitoring reconfigurable CPS that uses embedded local nodes (Nvidia Jetson TX2). We embed Deep Learning architectures to address Human Action Recognition. Local processing at these nodes let us tackle some common issues: reduction of data bandwidth usage and preservation of privacy (no raw images are transmitted). Also real-time processing is facilitated since optimized nodes compute only its local video feed. Regarding the reconfiguration, a remote platform monitors CPS qualities and a Quality and Resource Management (QRM) tool sends commands to the CPS core to trigger its reconfiguration. Our proposal is an energy-aware system that triggers reconfiguration based on energy consumption for battery-powered nodes. Reconfiguration reduces up to 22% the local nodes energy consumption extending the device operating time, preserving similar accuracy with respect to the alternative with no reconfiguration.
摘要：在家园的人们室内监测已成为智能健康流行的应用程序。随着机器学习的进步和硬件的嵌入式设备，网络，物理系统（CPS的）新的分布式方法被启用。此外，不断变化的环境和需要降低成本的激励新颖的可重新配置的CPS架构。在这项工作中，我们提出了一个室内监控的可重构CPS使用嵌入式本地节点（英伟达杰特森TX2）。我们嵌入深度学习架构来处理人权行为识别。在这些节点的本地处理，让我们解决一些常见问题：减少数据带宽使用和隐私保护的（没有RAW图像传输）。此外，由于优化的节点仅计算它的本地视频馈送实时处理变得容易。关于该重新配置，远程平台监视器CPS的素质，质量和资源管理（QRM）工具将命令发送到CPS芯以触发它的重新配置。我们的建议是一个能量感知系统触发重构基于能耗用于电池供电的节点。重配置可降低高达22％延伸的设备工作时间的本地节点能量消耗，保持同样的准确度相对于与无重新配置的替代方案。

6. Super-Human Performance in Online Low-latency Recognition of Conversational Speech [PDF] 返回目录
Thai-Son Nguyen, Sebastian Stueker, Alex Waibel
Abstract: Achieving super-human performance in recognizing human speech has been a goal for several decades, as researchers have worked on increasingly challenging tasks. In the 1990's it was discovered, that conversational speech between two humans turns out to be considerably more difficult than read speech as hesitations, disfluencies, false starts and sloppy articulation complicate acoustic processing and require robust handling of acoustic, lexical and language context, jointly. Early attempts even with statistical models could only reach error rates in excess of 50% and far from human performance (WER of around 5.5%). Neural hybrid models and recent attention based encoder-decoder models have considerably improved performance as such context can now be learned in an integral fashion. However processing such contexts requires presentation of an entire utterance and thus introduces unwanted delays before a recognition result can be output. In this paper, we address performance as well as latency. We present results for a system that is able to achieve super-human performance (at a WER or 5.0%, over the Switchboard conversational benchmark) at a latency of only 1 second behind a speaker's speech. The system uses attention based encoder-decoder networks, but can also be configured to use ensembles with Transformer based models at low latency.
摘要：实现识别人的语音超人类的表现已经几十年目标，研究人员已经在越来越具有挑战性的任务工作。在20世纪90年代就被发现了，两个人之间的对话语音原来是大大高于读取语音的犹豫，不流利，错误的开始和马虎衔接复杂的声学处理更加困难，需要强大的处理声音，词汇和语境的，联合。早期的尝试，即使统计模型只能达到误差率超过50％，并且远离人类的性能（WER％左右的5.5）。神经混合动力车型和近期关注基于编码器的解码器模块具有显着提高性能，这样的背景下，现在可以在一个整体的方式来学习。然而处理这样的上下文中需要整个发声的呈现，从而引入了一个识别结果之前不需要的延迟可以被输出。在本文中，我们针对性能和延迟。我们的系统，该系统能够实现超人类的性能（在WER或5.0％，比总机会话基准）仅为1扬声器的讲话背后秒的等待时间目前的结果。该系统采用基于注意编码解码器的网络，但也可以配置为使用歌舞团与低延迟基于变压器模型。

7. Universal Weighting Metric Learning for Cross-Modal Matching [PDF] 返回目录
Jiwei Wei, Xing Xu, Yang Yang, Yanli Ji, Zheng Wang, Heng Tao Shen
Abstract: Cross-modal matching has been a highlighted research topic in both vision and language areas. Learning appropriate mining strategy to sample and weight informative pairs is crucial for the cross-modal matching performance. However, most existing metric learning methods are developed for unimodal matching, which is unsuitable for cross-modal matching on multimodal data with heterogeneous features. To address this problem, we propose a simple and interpretable universal weighting framework for cross-modal matching, which provides a tool to analyze the interpretability of various loss functions. Furthermore, we introduce a new polynomial loss under the universal weighting framework, which defines a weight function for the positive and negative informative pairs respectively. Experimental results on two image-text matching benchmarks and two video-text matching benchmarks validate the efficacy of the proposed method.
摘要：跨模态匹配已经在远见和语言领域突出的研究课题。学习适当的挖掘策略，样品和重量信息对是跨模态匹配性能至关重要。然而，大多数现有的度量学习方法为单峰匹配，这不适合于异构特征多模态数据的跨通道匹配显影。为了解决这个问题，我们提出了跨模态匹配，它提供了一个工具来分析各种损失函数的解释性简单，可解释的普遍加权框架。此外，我们引入通用加权框架，它们分别限定了用于正和负信息对权重函数下一个新的多项式损失。在两个图像文本匹配的基准和两个视频文本匹配基准实验结果验证了该方法的功效。

8. A study on using image based machine learning methods to develop the surrogate models of stamp forming simulations [PDF] 返回目录
Haosu Zhou, Qingfeng Xu, Nan Li
Abstract: In the design optimization of metal forming, it is increasingly significant to use surrogate models to analyse the finite element analysis (FEA) simulations. However, traditional surrogate models using scalar based machine learning methods (SBMLMs) fall in short of accuracy and generalizability. This is because SBMLMs fail to harness the location information of the simulations. To overcome these shortcomings, image based machine learning methods (IBMLMs) are leveraged in this paper. The underlying theory of location information, which supports the advantages of IBMLM, is qualitatively interpreted. Based on this theory, a Res-SE-U-Net IBMLM surrogate model is developed and compared with a multi-layer perceptron (MLP) as a referencing SBMLM surrogate model. It is demonstrated that the IBMLM model is advantageous over the MLP SBMLM model in accuracy, generalizability, robustness, and informativeness. This paper presents a promising methodology of leveraging IBMLMs in surrogate models to make maximum use of info from FEA results. Future prospective studies that inspired by this paper are also discussed.
摘要：金属成形的设计优化，这是日益显著使用替代模型来分析有限元分析（FEA）模拟。然而，使用基于标机器学习方法（SBMLMs）下降短期准确性和普遍性的传统替代模型。这是因为SBMLMs未能利用模拟的位置信息。为了克服这些缺点，基于图像的机器学习方法（IBMLMs）是利用本文。的位置信息，支持IBMLM的优点基本理论，定性地解释。基于这个理论，一个RES-SE-U-Net的IBMLM替代模型的开发和使用的多层感知器（MLP），为引用SBMLM替代模型进行比较。据表明，IBMLM模型是在以上的精度，普遍性，稳健性和信息量的MLP SBMLM模型是有利的。本文利用呈现在替代模型IBMLMs以最大限度地利用从FEA结果信息的一个有前途的方法。未来的前瞻性研究，启发本文进行了讨论。

9. Deep Learning in Diabetic Foot Ulcers Detection: A Comprehensive Evaluation [PDF] 返回目录
Moi Hoon Yap, Ryo Hachiuma, Azadeh Alavi, Raphael Brungel, Manu Goyal, Hongtao Zhu, Bill Cassidy, Johannes Ruckert, Moshe Olshansky, Xiao Huang, Hideo Saito, Saeed Hassanpour, Christoph M. Friedrich, David Ascher, Anping Song, Hiroki Kajita, David Gillespie, Neil D. Reeves, Joseph Pappachan, Claire O'Shea, Eibe Frank
Abstract: There has been a substantial amount of research on computer methods and technology for the detection and recognition of diabetic foot ulcers (DFUs), but there is a lack of systematic comparisons of state-of-the-art deep learning object detection frameworks applied to this problem. With recent development and data sharing performed as part of the DFU Challenge (DFUC2020) such a comparison becomes possible: DFUC2020 provided participants with a comprehensive dataset consisting of 2,000 images for training each method and 2,000 images for testing them. The following deep learning-based algorithms are compared in this paper: Faster R-CNN; three variants of Faster R-CNN; an ensemble consisting of four models obtained using Faster R-CNN approaches; YOLOv3; YOLOv5; EfficientDet; and a new Cascade Attention Network. These deep learning methods achieved the best top 5 results in DFUC2020. For each deep learning method, this paper provides a detailed description of obtaining the model, including pre-processing, data augmentation, architecture, training and post-processing, and parameter settings. We provide a comprehensive evaluation for each method. All the methods required a data augmentation stage to increase the number of images available for training and a post-processing stage to remove false positives. The best performance is obtained by Deformable Convolution, a variant of Faster R-CNN, with a mAP of 0.6940 and an F1-Score of 0.7434. Our results show that state-of-the-art deep learning methods can detect DFU with some accuracy, but there are many challenges ahead before they can be implemented in real world settings.
摘要：一直对计算机的方法和技术，糖尿病足溃疡的检测和识别（DFUS）研究了大量的，但应用缺乏深刻的学习对象检测框架的国家的最先进的系统比较到了这个问题。随着近年来的发展和数据共享作为DFU挑战（DFUC2020）的一部分进行这样的比较成为可能：DFUC2020与由2000张图像的训练每种方法和2000倍的图像用于测试他们的综合数据集提供了机会。下面深基于学习的算法在本文中进行了比较：更快R-CNN;更快的R-CNN的三种变体;合奏由四个模型中获得更快的使用R-CNN方法; YOLOv3; YOLOv5; EfficientDet;和一个新的级联关注网络。这些深层次的学习方法在DFUC2020取得的最好排名前5的结果。对于每一个深度学习方法，本文提供了一种获取模型，包括预处理，数据扩张，建筑，训练和后处理，和参数设置的详细说明。我们提供每个方法的综合评价。所有方法都需要数据扩张阶段，以增加可用于培训和后处理阶段，以除去假阳性图像的数量。最佳的性能是由可变形的卷积，更快R-CNN的变体获得的，具有一个0.6940 MAP和的0.7434的F1-得分。我们的研究结果表明，国家的最先进的深学习方法可以检测到DFU具有一定精确度，但也有他们可以在现实世界中的设置来实现未来之前许多挑战。

10. Multi-label classification of promotions in digital leaflets using textual and visual information [PDF] 返回目录
Roberto Arroyo, David Jiménez-Cabello, Javier Martínez-Cebrián
Abstract: Product descriptions in e-commerce platforms contain detailed and valuable information about retailers assortment. In particular, coding promotions within digital leaflets are of great interest in e-commerce as they capture the attention of consumers by showing regular promotions for different products. However, this information is embedded into images, making it difficult to extract and process for downstream tasks. In this paper, we present an end-to-end approach that classifies promotions within digital leaflets into their corresponding product categories using both visual and textual information. Our approach can be divided into three key components: 1) region detection, 2) text recognition and 3) text classification. In many cases, a single promotion refers to multiple product categories, so we introduce a multi-label objective in the classification head. We demonstrate the effectiveness of our approach for two separated tasks: 1) image-based detection of the descriptions for each individual promotion and 2) multi-label classification of the product categories using the text from the product descriptions. We train and evaluate our models using a private dataset composed of images from digital leaflets obtained by Nielsen. Results show that we consistently outperform the proposed baseline by a large margin in all the experiments.
摘要：在电子商务平台产品说明包含详细的和约零售商品种的有价值的信息。特别是，数字小叶内编码促销是电子商务的极大兴趣，因为他们通过展示不同产品的常规促销吸引消费者的注意。然而，这种信息被嵌入图像，使得难以提取和下游任务处理。在本文中，我们提出数字小叶内进行分类优惠为它们相应的产品类别同时使用视觉和文本信息的结束到终端的方法。我们的方法可以分为三个主要部分：1）区域检测，2）文本识别和3）文本分类。在许多情况下，单一的促销是指多个产品类别，因此我们在分类头引入多标签的目标。我们证明我们的方法有两个分开的任务的有效性：1）描述为每个单独的推广和2）使用从产品的描述文字的产品类别多标签分类的基于图像的检测。我们培养和使用由尼尔森获得的数字图像的小叶组成的私人数据集评估我们的模型。结果表明，我们始终优于大幅度建议基线在所有实验。

11. Contour Primitive of Interest Extraction Network Based on One-shot Learning for Object-Agnostic Vision Measurement [PDF] 返回目录
Fangbo Qin, Jie Qin, Siyu Huang, De Xu
Abstract: Image contour based vision measurement is widely applied in robot manipulation and industrial automation. It is appealing to realize object-agnostic vision system, which can be conveniently reused for various types of objects. We propose the contour primitive of interest extraction network (CPieNet) based on the one-shot learning framework. First, CPieNet is featured by that its contour primitive of interest (CPI) output, a designated regular contour part lying on a specified object, provides the essential geometric information for vision measurement. Second, CPieNet has the one-shot learning ability, utilizing a support sample to assist the perception of the novel object. To realize lower-cost training, we generate support-query sample pairs from unpaired online public images, which cover a wide range of object categories. To obtain single-pixel wide contour for precise measurement, the Gabor-filters based non-maximum suppression is designed to thin the raw contour. For the novel CPI extraction task, we built the Object Contour Primitives dataset using online public images, and the Robotic Object Contour Measurement dataset using a camera mounted on a robot. The effectiveness of the proposed methods is validated by a series of experiments.
摘要：图像轮廓的基于视觉的测量被广泛应用于机器人操纵和工业自动化。它是有吸引力的实现对象无关的视觉系统，可用于各种类型的对象可以方便地重复使用。我们提出了基于一次性学习框架的轮廓基元利息提取网络（CPieNet）。首先，CPieNet是由它的轮廓的原始感兴趣（CPI）输出，指定的规则轮廓部分位于一个指定的对象上，提供了一种用于视觉测量的基本几何信息功能。其次，CPieNet有一次性的学习能力，利用载体样品，以协助新物体的感知。为了实现较低成本的培训，我们生成不成对网上公开的图像，覆盖范围广的对象类别的支持查询样本对。为了获得精确的测量单像素宽的轮廓，伽柏滤波器基于非最大抑制被设计为薄的原始轮廓。对于新的CPI提取的任务，我们建立了对象轮廓基元数据集采用网上公开的图像，并安装在机器人使用照相机的机器人对象轮廓测量数据集。所提出的方法的有效性是通过一系列的实验验证。

12. YOdar: Uncertainty-based Sensor Fusion for Vehicle Detection with Camera and Radar Sensors [PDF] 返回目录
Kamil Kowol, Matthias Rottmann, Stefan Bracke, Hanno Gottschalk
Abstract: In this work, we present an uncertainty-based method for sensor fusion with camera and radar data. The outputs of two neural networks, one processing camera and the other one radar data, are combined in an uncertainty aware manner. To this end, we gather the outputs and corresponding meta information for both networks. For each predicted object, the gathered information is post-processed by a gradient boosting method to produce a joint prediction of both networks. In our experiments we combine the YOLOv3 object detection network with a customized $1D$ radar segmentation network and evaluate our method on the nuScenes dataset. In particular we focus on night scenes, where the capability of object detection networks based on camera data is potentially handicapped. Our experiments show, that this approach of uncertainty aware fusion, which is also of very modular nature, significantly gains performance compared to single sensor baselines and is in range of specifically tailored deep learning based fusion approaches.
摘要：在这项工作中，我们提出了传感器融合与照相机和雷达数据的不确定性的基于-方法。两个神经网络，一个处理照相机和另一个雷达数据的输出，被组合在一个不确定感知方式。为此，我们收集的输出和相应的两个网络的元信息。对于每一个预测对象，所收集的信息是由一个梯度升压方法进行后处理，以产生两个网络的一个关节的预测。在我们的实验中，我们结合了定制$ 1D $雷达分割网络YOLOv3对象检测网络，并评估我们在nuScenes数据集的方法。我们特别注重夜间场景，其中基于摄像头的数据对象检测网络的能力是潜在的阻碍。我们的实验显示，这种方法不确定性意识到融合，这是非常模块化的特性还的，显著获得性能相比单个传感器的基线和在范围内的特别定制的深度学习基于融合的方法。

13. Rotation-Invariant Local-to-Global Representation Learning for 3D Point Cloud [PDF] 返回目录
Seohyun Kim, Jaeyoo Park, Bohyung Han
Abstract: We propose a local-to-global representation learning algorithm for 3D point cloud data, which is appropriate to handle various geometric transformations, especially rotation, without explicit data augmentation with respect to the transformations. Our model takes advantage of multi-level abstraction based on graph convolutional neural networks, which constructs a descriptor hierarchy to encode rotation-invariant shape information of an input object in a bottom-up manner. The descriptors in each level are obtained from neural networks based on graphs via stochastic sampling of 3D points, which is effective to make the learned representations robust to the variations of input data. The proposed algorithm presents the state-of-the-art performance on the rotation-augmented 3D object recognition benchmarks and we further analyze its characteristics through comprehensive ablative experiments.
摘要：我们提出了一个地方到全球的三维点云数据，这是适当的处理各种几何变换，尤其是旋转，没有明确的数据增强关于转变表示学习算法。我们的模型基于图卷积神经网络，它构造的描述符的层次结构，以输入对象的编码旋转不变的形状信息中自底向上的方式采用多级抽象优势。在每个级别上的描述符被从基于经由的3D点随机采样，这是有效地使学习表示健壮到输入数据的变化的曲线神经网络获得。该算法礼物国家的最先进的性能上旋转，增强3D物体识别基准，我们通过全面的烧蚀实验进一步分析其特点。

14. CD-UAP: Class Discriminative Universal Adversarial Perturbation [PDF] 返回目录
Chaoning Zhang, Philipp Benz, Tooba Imtiaz, In So Kweon
Abstract: A single universal adversarial perturbation (UAP) can be added to all natural images to change most of their predicted class labels. It is of high practical relevance for an attacker to have flexible control over the targeted classes to be attacked, however, the existing UAP method attacks samples from all classes. In this work, we propose a new universal attack method to generate a single perturbation that fools a target network to misclassify only a chosen group of classes, while having limited influence on the remaining classes. Since the proposed attack generates a universal adversarial perturbation that is discriminative to targeted and non-targeted classes, we term it class discriminative universal adversarial perturbation (CD-UAP). We propose one simple yet effective algorithm framework, under which we design and compare various loss function configurations tailored for the class discriminative universal attack. The proposed approach has been evaluated with extensive experiments on various benchmark datasets. Additionally, our proposed approach achieves state-of-the-art performance for the original task of UAP attacking all classes, which demonstrates the effectiveness of our approach.
摘要：一个通用的对抗扰动（UAP）可以添加到所有自然的图像改变他们的大部分预测类的标签。这是很高的实用意义的攻击者拥有灵活的控制权被攻击目标的类，不过，从所有类的现有UAP方法攻击样本。在这项工作中，我们提出了一个新的通用的攻击方法来生成一个单一的扰动是傻瓜的目标网络仅错误分类的类别选定的组，而对其余类别影响有限。由于所提出的攻击产生一个普遍的对抗性扰动是歧视性的针对性和非针对性的课程，我们来看，它的类歧视性普遍对抗扰动（CD-UAP）。我们提出了一个简单而有效的算法框架下，我们设计并比较类歧视普遍的攻击量身定制的各种损失功能配置。所提出的方法进行了评估与各种标准数据集大量的实验。此外，我们提出的方法实现了国家的最先进的性能UAP的原始任务攻击的所有类，这证明了我们方法的有效性。

15. Attention Model Enhanced Network for Classification of Breast Cancer Image [PDF] 返回目录
Xiao Kang, Xingbo Liu, Xiushan Nie, Xiaoming Xi, Yilong Yin
Abstract: Breast cancer classification remains a challenging task due to inter-class ambiguity and intra-class variability. Existing deep learning-based methods try to confront this challenge by utilizing complex nonlinear projections. However, these methods typically extract global features from entire images, neglecting the fact that the subtle detail information can be crucial in extracting discriminative features. In this study, we propose a novel method named Attention Model Enhanced Network (AMEN), which is formulated in a multi-branch fashion with pixel-wised attention model and classification submodular. Specifically, the feature learning part in AMEN can generate pixel-wised attention map, while the classification submodular are utilized to classify the samples. To focus more on subtle detail information, the sample image is enhanced by the pixel-wised attention map generated from former branch. Furthermore, boosting strategy are adopted to fuse classification results from different branches for better performance. Experiments conducted on three benchmark datasets demonstrate the superiority of the proposed method under various scenarios.
摘要：乳腺癌分类仍然是一个具有挑战性的任务，由于类间的模糊和类内变化。现有的基于深学习方法试图通过利用复杂的非线性预测，以应对这一挑战。然而，这些方法通常从整个图像中提取全局特征，而忽略了事实的微妙细节信息可以在提取判别特征是至关重要的。在这项研究中，我们提出了一个名为这是在一个多分支的方式与像素识破注意模型和分级子模块制定注意力模型增强网络（AMEN）的新方法。具体而言，在AMEN的特征点学习部分可以生成像素识破注意图，而分类子模被用来将样品分类。更注重于细微细节信息，样本图像是通过从以前的支路中产生的像素识破注意图增强。此外，助推战略采用从获得更好的性能不同的分支保险丝的分类结果。对三个标准数据集进行的实验证明了该方法的各种情况下的优越性。

16. Learning Binary Semantic Embedding for Histology Image Classification and Retrieval [PDF] 返回目录
Xiao Kang, Xingbo Liu, Xiushan Nie, Yilong Yin
Abstract: With the development of medical imaging technology and machine learning, computer-assisted diagnosis which can provide impressive reference to pathologists, attracts extensive research interests. The exponential growth of medical images and uninterpretability of traditional classification models have hindered the applications of computer-assisted diagnosis. To address these issues, we propose a novel method for Learning Binary Semantic Embedding (LBSE). Based on the efficient and effective embedding, classification and retrieval are performed to provide interpretable computer-assisted diagnosis for histology images. Furthermore, double supervision, bit uncorrelation and balance constraint, asymmetric strategy and discrete optimization are seamlessly integrated in the proposed method for learning binary embedding. Experiments conducted on three benchmark datasets validate the superiority of LBSE under various scenarios.
摘要：随着医学成像技术和机器学习，电脑辅助诊断，可以给病理学家提供令人印象深刻的基准，吸引了广泛的研究兴趣的发展。医疗影像和传统的分类模型的uninterpretability的指数增长已经阻碍了电脑辅助诊断的应用。为了解决这些问题，我们提出了学习二元语义嵌入（LBSE）的新方法。基于高效和有效的嵌入，分类和检索被执行以用于组织学图像提供可解释的计算机辅助诊断。此外，双重监督，位不相关和平衡约束，不对称战略和离散优化无缝地集成在学习二进制嵌入所提出的方法。对三个标准数据集进行的实验验证LBSE的在各种情况下的优越性。

17. Variational Transfer Learning for Fine-grained Few-shot Visual Recognition [PDF] 返回目录
Jingyi Xu, Mingzhen Huang, ShahRukh Athar, Dimitris Samaras
Abstract: Fine-grained few-shot recognition often suffers from the problem of training data scarcity for novel categories.The network tends to overfit and does not generalize well to unseen classes due to insufficient training data. Many methods have been proposed to synthesize additional data to support the training. In this paper, we focus one enlarging the intra-class variance of the unseen class to improve few-shot classification performance. We assume that the distribution of intra-class variance generalizes across the base class and the novel class. Thus, the intra-class variance of the base set can be transferred to the novel set for feature augmentation. Specifically, we first model the distribution of intra-class variance on the base set via variational inference. Then the learned distribution is transferred to the novel set to generate additional features, which are used together with the original ones to train a classifier. Experimental results show a significant boost over the state-of-the-art methods on the challenging fine-grained few-shot image classification benchmarks.
摘要：细粒度几拍的认可往往是从训练数据匮乏的新型categories.The网络问题的困扰倾向于过度拟合并且不推广很好地看不见的班，由于训练数据不足。许多方法已经被提出来合成更多的数据来支持训练。在本文中，我们集中在一个扩大看不见类，以提高几拍分类性能的类内变化。我们假设类内方差概括的横跨基类中的分布和类新颖的。因此，基集的类内方差可以被转移到用于特征增强的新颖的组。具体地讲，我们首先类内方差的分布进行建模经由变推理的基本集合。然后，将了解到分布被转移到新的集合，以生成额外的特征，其与原有的一起使用来训练分类器。实验结果表明，在对具有挑战性的细粒度几个镜头图像分类基准的国家的最先进的方法显著提升。

18. Learning Clusterable Visual Features for Zero-Shot Recognition [PDF] 返回目录
Jingyi Xu, Zhixin Shu, Dimitris Samaras
Abstract: In zero-shot learning (ZSL), conditional generators have been widely used to generate additional training features. These features can then be used to train the classifiers for testing data. However, some testing data are considered "hard" as they lie close to the decision boundaries and are prone to misclassification, leading to performance degradation for ZSL. In this paper, we propose to learn clusterable features for ZSL problems. Using a Conditional Variational Autoencoder (CVAE) as the feature generator, we project the original features to a new feature space supervised by an auxiliary classification loss. To further increase clusterability, we fine-tune the features using Gaussian similarity loss. The clusterable visual features are not only more suitable for CVAE reconstruction but are also more separable which improves classification accuracy. Moreover, we introduce Gaussian noise to enlarge the intra-class variance of the generated features, which helps to improve the classifier's robustness. Our experiments on SUN,CUB, and AWA2 datasets show consistent improvement over previous state-of-the-art ZSL results by a large margin. In addition to its effectiveness on zero-shot classification, experiments show that our method to increase feature clusterability benefits few-shot learning algorithms as well.
摘要：零次学习（ZSL），有条件的发电机组已广泛用于生成额外的培训功能。然后，这些特征可以被用于训练分类器，用于测试的数据。然而，一些测试数据被认为是“硬”的，因为他们躺在接近决策边界，很容易发生误判，从而导致性能下降ZSL。在本文中，我们建议学习集群化的特点为ZSL问题。使用条件变自动编码器（CVAE）为特征生成，我们项目的初始功能，一个新的功能空间监督通过辅助分类损失。为了进一步提高clusterability，我们微调采用高斯损失相似的特征。所述可群集视觉特征不仅更适合CVAE重建，而且也更可分离这提高分类精度。此外，我们引进高斯噪声放大的生成功能类内变化，这有助于提高分类的鲁棒性。我们对SUN，CUB和AWA2数据集实验表明，大幅度超过国家的最先进的前面ZSL结果是一致的改善。除了其在零镜头分类效果，实验表明，我们的方法来增加功能clusterability好处几拍学习算法为好。

19. RealSmileNet: A Deep End-To-End Network for Spontaneous and Posed Smile Recognition [PDF] 返回目录
Yan Yang, Md Zakir Hossain, Tom Gedeon, Shafin Rahman
Abstract: Smiles play a vital role in the understanding of social interactions within different communities, and reveal the physical state of mind of people in both real and deceptive ways. Several methods have been proposed to recognize spontaneous and posed smiles. All follow a feature-engineering based pipeline requiring costly pre-processing steps such as manual annotation of face landmarks, tracking, segmentation of smile phases, and hand-crafted features. The resulting computation is expensive, and strongly dependent on pre-processing steps. We investigate an end-to-end deep learning model to address these problems, the first end-to-end model for spontaneous and posed smile recognition. Our fully automated model is fast and learns the feature extraction processes by training a series of convolution and ConvLSTM layer from scratch. Our experiments on four datasets demonstrate the robustness and generalization of the proposed model by achieving state-of-the-art performances.
摘要：微笑发挥不同社区内的社会相互作用的理解至关重要的作用，并揭示真实和欺骗性方式的人心中的物理状态。几种方法已经被提出来识别自发并合影笑容。都遵循一个基于特征的工程管道需要昂贵的预处理步骤，如面部地标人工注释，追踪，微笑阶段的分割，以及手工制作的特点。将所得的计算是昂贵的，并且强烈依赖于前处理步骤。我们调查的端至端深度学习模型来解决这些问题，第一端至端模型自发并合影笑脸识别。我们完全自动化的模式是快速，从无到有培养了一批卷积和ConvLSTM层的学习特征提取过程。我们对四个数据集实验表明，该模型通过实现国家的的艺术表演的鲁棒性和泛化。

20. Vision-Based Object Recognition in Indoor Environments Using Topologically Persistent Features [PDF] 返回目录
Ekta U. Samani, Xingjian Yang, Ashis G. Banerjee
Abstract: Object recognition in unseen indoor environments has been challenging for most state-of-the-art object detectors. To address this challenge, we propose the use of topologically persistent features for object recognition. We extract two kinds of persistent features from binary segmentation maps, namely sparse PI features and amplitude features, by applying persistent homology to filtrations of the cubical complexes of segmentation maps generated using height functions in multiple directions. The features are used for training a fully connected network for recognition. For performance evaluation, in addition to a widely used shape dataset, we collect new datasets comprising scene images from two different environments, i.e., a living room and a warehouse. Scene images in both environments consist of up to five different objects with varying poses, chosen from a set of fourteen objects, taken in varying illumination conditions from different distances and camera viewing angles. The overall performance of our methods, trained using living room images, remains relatively unaltered on the unseen warehouse images, without retraining. In contrast, a state-of-the-art object detector's accuracy drops considerably. Our methods also achieve higher overall recall and accuracy than the state-of-the-art in the unseen warehouse environment. We also implement the proposed framework on a real-world robot to demonstrate its usefulness.
摘要：在看不见的室内环境中物体识别已具有挑战性的状态的最先进的最物体检测器。为了应对这一挑战，我们提出了目标识别使用的拓扑持久性特征。我们提取2种的二值分割的地图，即稀疏PI特征和振幅特征，持久性特征通过应用持久同源性细分的立方体的复合物的过滤使用映射函数的高度在多个方向上生成的。该功能用于训练一个完全连接的网络进行识别。性能评价中，除了一种广泛使用的形状的数据集，我们收集包括从两个不同的环境中，即，一个客厅和一个仓库的场景图像的新的数据集。在两种环境中的场景图像由最多具有不同姿势，从一组14个的对象选择的五个不同的对象，采取改变从不同的距离和照相机的视角的照明条件。我们的方法的整体性能，使用客厅图片的训练，留在看不见的仓库图像相对不变，没有再培训。与此相反，一个国家的最先进的对象检测器的精度显着下降。我们的方法也能达到更高的整体召回和精度比国家的最先进的看不见的仓库环境。我们还实行对现实世界的机器人所提议的框架来证明其有效性。

21. VICTR: Visual Information Captured Text Representation for Text-to-Image Multimodal Tasks [PDF] 返回目录
Caren Han, Siqu Long, Siwen Luo, Kunze Wang, Josiah Poon
Abstract: Text-to-image multimodal tasks, generating/retrieving an image from a given text description, are extremely challenging tasks since raw text descriptions cover quite limited information in order to fully describe visually realistic images. We propose a new visual contextual text representation for text-to-image multimodal tasks, VICTR, which captures rich visual semantic information of objects from the text input. First, we use the text description as initial input and conduct dependency parsing to extract the syntactic structure and analyse the semantic aspect, including object quantities, to extract the scene graph. Then, we train the extracted objects, attributes, and relations in the scene graph and the corresponding geometric relation information using Graph Convolutional Networks, and it generates text representation which integrates textual and visual semantic information. The text representation is aggregated with word-level and sentence-level embedding to generate both visual contextual word and sentence representation. For the evaluation, we attached VICTR to the state-of-the-art models in text-to-image generation.VICTR is easily added to existing models and improves across both quantitative and qualitative aspects.
摘要：文本到图像多峰的任务，生成/从给定的文本描述检索图像，是极具挑战性的任务，因为原始文本描述盖以便相当有限信息，以充分地描述在视觉上逼真的图像。我们提出了文本到影像多任务，VICTR，捕捉从文本输入对象的丰富的视觉语义信息的新的视觉语境的文字表示。首先，我们使用文本描述作为初始输入和行为依赖解析以提取语法结构和分析语义方面，包括对象的数量，以提取场景图。然后，我们培养所提取的对象，属性，以及在场景图中的关系，使用Graph卷积网络对应的几何关系的信息，并生成文本表示它集成了文本和视觉的语义信息。该文本表示汇总与字级和句子级嵌入同时生成视觉语境单词和句子的表示。对于评估，我们连接到VICTR国家的最先进的机型文本到影像generation.VICTR容易加到现有的模型和跨定量和定性两个方面得到改善。

22. A Study on Trees's Knots Prediction from their Bark Outer-Shape [PDF] 返回目录
Mejri Mohamed, Antoine Richard, Cedric Pradalier
Abstract: In the industry, the value of wood-logs strongly depends on their internal structure and more specifically on the knots' distribution inside the trees. As of today, CT-scanners are the prevalent tool to acquire accurate images of the trees internal structure. However, CT-scanners are expensive, and slow, making their use impractical for most industrial applications. Knowing where the knots are within a tree could improve the efficiency of the overall tree industry by reducing waste and improving the quality of wood-logs by-products. In this paper we evaluate different deep-learning based architectures to predict the internal knots distribution of a tree from its outer-shape, something that has never been done before. Three types of techniques based on Convolutional Neural Networks (CNN) will be studied. The architectures are tested on both real and synthetic CT-scanned trees. With these experiments, we demonstrate that CNNs can be used to predict internal knots distribution based on the external surface of the trees. The goal being to show that these inexpensive and fast methods could be used to replace the CT-scanners. Additionally, we look into the performance of several off-the-shelf object-detectors to detect knots inside CT-scanned images. This method is used to autonomously label part of our real CT-scanned trees alleviating the need to manually segment the whole of the images.
摘要：在工业，木材，原木的价值很大程度上取决于其内部结构，更具体的结树内分布。截至今天，CT，扫描仪获取树木内部结构的精确图像的普遍工具。然而，CT的扫描仪价格昂贵，速度慢，使得它们的使用不切实际的大多数工业应用。知道哪里是结是一个树内可以提高整体的树行业通过减少浪费和副产品提高木原木的质量效率。在本文中，我们评估不同的深学习基础架构来预测其外部形状的树，一些从未做过的内部结分布。三种类型的基于卷积神经网络（CNN）技术将进行研究。该体系结构在两个真实的和合成的CT扫描树测试。通过这些实验，我们证明了细胞神经网络，可以使用基于树的外表面上预测其内部结分布。其目标是要表明，这些廉价和快速的方法可以用来代替CT扫描器。此外，我们考虑的几个关的，现成的对象检测器的性能，以检测内部CT扫描图像结。此方法用于我们真正的CT扫描树减轻需要整个图像的手动段的自主标签的一部分。

23. DML-GANR: Deep Metric Learning With Generative Adversarial Network Regularization for High Spatial Resolution Remote Sensing Image Retrieval [PDF] 返回目录
Yun Cao, Yuebin Wang, Junhuan Peng, Liqiang Zhang, Linlin Xu, Kai Yan, Lihua Li
Abstract: With a small number of labeled samples for training, it can save considerable manpower and material resources, especially when the amount of high spatial resolution remote sensing images (HSR-RSIs) increases considerably. However, many deep models face the problem of overfitting when using a small number of labeled samples. This might degrade HSRRSI retrieval accuracy. Aiming at obtaining more accurate HSR-RSI retrieval performance with small training samples, we develop a deep metric learning approach with generative adversarial network regularization (DML-GANR) for HSR-RSI retrieval. The DML-GANR starts from a high-level feature extraction (HFE) to extract high-level features, which includes convolutional layers and fully connected (FC) layers. Each of the FC layers is constructed by deep metric learning (DML) to maximize the interclass variations and minimize the intraclass variations. The generative adversarial network (GAN) is adopted to mitigate the overfitting problem and validate the qualities of extracted high-level features. DML-GANR is optimized through a customized approach, and the optimal parameters are obtained. The experimental results on the three data sets demonstrate the superior performance of DML-GANR over state-of-the-art techniques in HSR-RSI retrieval.
摘要：利用少量训练标记的样品，它可以节省大量的人力和物力，特别是当高空间分辨率的量遥感图像（HSR-RSIS）显着地增加。然而，许多深模特脸上使用少量的标签样本时过度拟合的问题。这可能会降低HSRRSI检索精度。针对获得与小训练样本更准确HSR-RSI检索性能，我们开发出生成对抗网络正规化（DML-GANR）为HSR-RSI检索了深刻的度量学习方法。从高级特征提取（HFE）的DML-GANR开始以提取高级特征，其中包括卷积层和完全连接（FC）的层。所述FC层的每一个由深度量学习（DML）以最大化类间的变化最小化和组内的变化构成的。的生成对抗网络（GAN）被采用以减轻所述过拟合问题，并验证所提取的高级特征的品质。 DML-GANR通过定制的方法进行了优化，并获得了最佳参数。对三个数据集的实验结果表明，DML-GANR超过在HSR-RSI检索状态的最先进的技术的优良的性能。

24. SLCRF: Subspace Learning with Conditional Random Field for Hyperspectral Image Classification [PDF] 返回目录
Yun Cao, Jie Mei, Yuebin Wang, Liqiang Zhang, Junhuan Peng, Bing Zhang, Lihua Li, Yibo Zheng
Abstract: Subspace learning (SL) plays an important role in hyperspectral image (HSI) classification, since it can provide an effective solution to reduce the redundant information in the image pixels of HSIs. Previous works about SL aim to improve the accuracy of HSI recognition. Using a large number of labeled samples, related methods can train the parameters of the proposed solutions to obtain better representations of HSI pixels. However, the data instances may not be sufficient enough to learn a precise model for HSI classification in real applications. Moreover, it is well-known that it takes much time, labor and human expertise to label HSI images. To avoid the aforementioned problems, a novel SL method that includes the probability assumption called subspace learning with conditional random field (SLCRF) is developed. In SLCRF, first, the 3D convolutional autoencoder (3DCAE) is introduced to remove the redundant information in HSI pixels. In addition, the relationships are also constructed using the spectral-spatial information among the adjacent pixels. Then, the conditional random field (CRF) framework can be constructed and further embedded into the HSI SL procedure with the semi-supervised approach. Through the linearized alternating direction method termed LADMAP, the objective function of SLCRF is optimized using a defined iterative algorithm. The proposed method is comprehensively evaluated using the challenging public HSI datasets. We can achieve stateof-the-art performance using these HSI sets.
摘要：子空间学习（SL）起着光谱图像（HSI）分类的重要作用，因为它可以提供减少HSIS的图像像素的冗余信息的有效解决方案。关于SL以前的作品旨在提高HSI识别的准确性。使用大量标记的样品，相关方法可以训练提出的解决方案的参数，以获得HSI像素的更好表示。然而，该数据实例可能不足以足够的学习在实际应用恒指分类的精确模型。此外，这是众所周知的，它需要大量的时间，劳动和人力专长标记HSI图像。为了避免上述问题，其包括称为子空间中的概率假设与条件随机场学习一种新的方法，SL（SLCRF）显影。在SLCRF，首先，三维卷积自动编码器（3DCAE）被引入以去除在HSI像素中的冗余信息。此外，该关系是利用相邻像素之间的谱空间信息也构成。然后，条件随机场（CRF）的框架可以被构造并进一步嵌入到与半监督方法的HSI SL过程。通过线性化交替方向方法称为LADMAP，SLCRF的目标函数使用定义的迭代算法优化。该方法是使用公共挑战恒指数据集进行综合评价。我们可以利用这些HSI套实现stateof最先进的性能。

25. Channel Recurrent Attention Networks for Video Pedestrian Retrieval [PDF] 返回目录
Pengfei Fang, Pan Ji, Jieming Zhou, Lars Petersson, Mehrtash Harandi
Abstract: Full attention, which generates an attention value per element of the input feature maps, has been successfully demonstrated to be beneficial in visual tasks. In this work, we propose a fully attentional network, termed {\it channel recurrent attention network}, for the task of video pedestrian retrieval. The main attention unit, \textit{channel recurrent attention}, identifies attention maps at the frame level by jointly leveraging spatial and channel patterns via a recurrent neural network. This channel recurrent attention is designed to build a global receptive field by recurrently receiving and learning the spatial vectors. Then, a \textit{set aggregation} cell is employed to generate a compact video representation. Empirical experimental results demonstrate the superior performance of the proposed deep network, outperforming current state-of-the-art results across standard video person retrieval benchmarks, and a thorough ablation study shows the effectiveness of the proposed units.
摘要：充分的重视，产生每个输入要素的地图元素的关注价值，已经成功地展示了在视觉任务是有益的。在这项工作中，我们提出了一个完全的注意网络，被称为{\它通道经常关注网络}，视频检索行人的任务。主要注意单元，\ textit {信道经常关注}，识别关注通过经由回归神经网络共同利用的空间和通道图案在帧级映射。这个通道经常关注旨在通过反复接受和学习空间向量建立一个全球性感受野。然后，\ textit {集聚集}小区被利用来产生紧凑的视频表示。实证实验结果表明，所提出的深网络的性能优越，通过标准的视频检索的人跑赢基准的国家的最先进的电流结果，并进行彻底消融研究显示了该单元的有效性。

26. Adversarial Patch Attacks on Monocular Depth Estimation Networks [PDF] 返回目录
Koichiro Yamanaka, Ryutaroh Matsumoto, Keita Takahashi, Toshiaki Fujii
Abstract: Thanks to the excellent learning capability of deep convolutional neural networks (CNN), monocular depth estimation using CNNs has achieved great success in recent years. However, depth estimation from a monocular image alone is essentially an ill-posed problem, and thus, it seems that this approach would have inherent vulnerabilities. To reveal this limitation, we propose a method of adversarial patch attack on monocular depth estimation. More specifically, we generate artificial patterns (adversarial patches) that can fool the target methods into estimating an incorrect depth for the regions where the patterns are placed. Our method can be implemented in the real world by physically placing the printed patterns in real scenes. We also analyze the behavior of monocular depth estimation under attacks by visualizing the activation levels of the intermediate layers and the regions potentially affected by the adversarial attack.
摘要：由于深卷积神经网络（CNN），利用细胞神经网络的单眼深度估计的优秀学习能力在近几年取得了巨大的成功。然而，单从单目图像深度估计本质上是一个病态问题，因此，似乎这种做法将有固有的弱点。为了揭示这一限制，我们提出了单眼深度估计对抗性补丁攻击的方法。更具体地说，我们生成人工模式（对抗性补丁）可骗过目标方法分成估计不正确的深度，其中图案被放置的区域。我们的方法可以在现实世界中通过亲身的印刷图案在真实场景中实现。我们还通过可视化的中间层的激活水平分析单眼深度估计下攻击行为和地区可能会受到敌对攻击。

27. Domain Adaptive Transfer Learning on Visual Attention Aware Data Augmentation for Fine-grained Visual Categorization [PDF] 返回目录
Ashiq Imran, Vassilis Athitsos
Abstract: Fine-Grained Visual Categorization (FGVC) is a challenging topic in computer vision. It is a problem characterized by large intra-class differences and subtle inter-class differences. In this paper, we tackle this problem in a weakly supervised manner, where neural network models are getting fed with additional data using a data augmentation technique through a visual attention mechanism. We perform domain adaptive knowledge transfer via fine-tuning on our base network model. We perform our experiment on six challenging and commonly used FGVC datasets, and we show competitive improvement on accuracies by using attention-aware data augmentation techniques with features derived from deep learning model InceptionV3, pre-trained on large scale datasets. Our method outperforms competitor methods on multiple FGVC datasets and showed competitive results on other datasets. Experimental studies show that transfer learning from large scale datasets can be utilized effectively with visual attention based data augmentation, which can obtain state-of-the-art results on several FGVC datasets. We present a comprehensive analysis of our experiments. Our method achieves state-of-the-art results in multiple fine-grained classification datasets including challenging CUB200-2011 bird, Flowers-102, and FGVC-Aircrafts datasets.
摘要：细粒度的视觉分类（FGVC）是计算机视觉中一个富有挑战性的课题。它的特点是大的类内的差异和微妙的类间的差异的问题。在本文中，我们解决在弱监督的方式，其中的神经网络模型得到与使用数据增强技术，通过视觉注意机制额外的数据送入这个问题。我们对我们的基础网络模型执行通过微调域自适应知识转移。我们在六大挑战和常用FGVC数据集执行我们的实验中，我们展示使用注意感知的数据增强技术与深度学习模型InceptionV3导出功能，在大规模数据集预先训练精度上有竞争力的提高。在多个FGVC数据集和其他数据集显示出有竞争力的结果，我们的方法优于竞争对手的方法。试验研究表明，从大规模的数据集即转印学习可以有效地与视觉注意基于数据扩张，这可以得到在几个数据集FGVC国家的最先进的结果被利用。我们提出我们的实验进行综合分析。我们的方法实现状态的最先进的结果在多个细粒度分类数据集包括具有挑战性CUB200-2011鸟，花-102，和FGVC-飞机数据集。

28. Weakly-Supervised Feature Learning via Text and Image Matching [PDF] 返回目录
Gongbo Liang, Connor Greenwell, Yu Zhang, Xiaoqin Wang, Ramakanth Kavuluru, Nathan Jacobs
Abstract: When training deep neural networks for medical image classification, obtaining a sufficient number of manually annotated images is often a significant challenge. We propose to use textual findings, which are routinely written by clinicians during manual image analysis, to help overcome this problem. The key idea is to use a contrastive loss to train image and text feature extractors to recognize if a given image-finding pair is a true match. The learned image feature extractor is then fine-tuned, in a transfer learning setting, for a supervised classification task. This approach makes it possible to train using large datasets because pairs of images and textual findings are widely available in medical records. We evaluate our method on three datasets and find consistent performance improvements. The biggest gains are realized when fewer manually labeled examples are available. In some cases, our method achieves the same performance as the baseline even when using 70\%--98\% fewer labeled examples.
摘要：当训练深层神经网络的医学图像分类，获得足够数量的手动注释的图像往往是一个显著的挑战。我们建议使用文本发现，这是手动图像分析时经常写的医生，以帮助解决这个问题。其核心思想是使用对比损失火车图片和文字特征提取认识到，如果给定的图像，发现对是一个真正的比赛。博学的图像特征提取，然后微调，在转移的学习环境，对于监督分类任务。这种方法使得可以使用大型数据集训练，因为对图像和文字的发现是在病历广泛使用。我们评估三个数据集的方法和发现一致的性能改进。当较少的手动标记的例子是目前最大的收效。 98点\％较少的标记的例子 - 在某些情况下，使用70 \％，即使我们的方法达到相同的性能基准。

29. Rotate to Attend: Convolutional Triplet Attention Module [PDF] 返回目录
Diganta Misra, Trikay Nalamada, Ajay Uppili Arasanipalai, Qibin Hou
Abstract: Benefiting from the capability of building inter-dependencies among channels or spatial locations, attention mechanisms have been extensively studied and broadly used in a variety of computer vision tasks recently. In this paper, we investigate light-weight but effective attention mechanisms and present triplet attention, a novel method for computing attention weights by capturing cross-dimension interaction using a three-branch structure. For an input tensor, triplet attention builds inter-dimensional dependencies by the rotation operation followed by residual transformations and encodes inter-channel and spatial information with negligible computational overhead. Our method is simple as well as efficient and can be easily plugged into classic backbone networks as an add-on module. We demonstrate the effectiveness of our method on various challenging tasks including image classification on ImageNet-1k and object detection on MSCOCO and PASCAL VOC datasets. Furthermore, we provide extensive in-sight into the performance of triplet attention by visually inspecting the GradCAM and GradCAM++ results. The empirical evaluation of our method supports our intuition on the importance of capturing dependencies across dimensions when computing attention weights. Code for this paper can be publicly accessed at this https URL
摘要：从渠道建设或空间位置之间的相互依存关系的能力中受益，关注机制已被广泛研究，并在各种计算机视觉任务广泛使用最近。在本文中，我们研究重量轻但有效的注意机制和本三重注意，用于通过捕获使用三分支结构的横尺寸相互作用计算关注的权重的新方法。用于将输入张量，三重峰注意构建通过旋转操作，随后残余变换和编码的信道间和具有可忽略的计算开销空间信息维间的依赖关系。我们的方法很简单，以及高效，可以很容易地插入到经典的骨干网作为一个附加模块。我们证明了我们在各种具有挑战性的任务，包括图像分类方法对ImageNet-1K和目标检测上MSCOCO和PASCAL VOC数据集的有效性。此外，我们提供视线广泛进入三联注意通过视觉检查GradCAM和GradCAM ++结果的性能。我们的方法的实证分析支持了我们在计算权重的关注跨时捕捉尺寸依赖的重要性的直觉。代码本文可以在此HTTPS URL公开访问

30. A deep learning pipeline for identification of motor units in musculoskeletal ultrasound [PDF] 返回目录
Hazrat Ali, Johannes Umander, Robin Rohlén, Christer Grönlund
Abstract: Ultrasound imaging provides information from a large part of the muscle. It has recently been shown that ultrafast ultrasound imaging can be used to record and analyze the mechanical response of individual MUs using blind source separation. In this work, we present an alternative method - a deep learning pipeline - to identify active MUs in ultrasound image sequences, including segmentation of their territories and signal estimation of their mechanical responses (twitch train). We train and evaluate the model using simulated data mimicking the complex activation pattern of tens of activated MUs with overlapping territories and partially synchronized activation patterns. Using a slow fusion approach (based on 3D CNNs), we transform the spatiotemporal image sequence data to 2D representations and apply a deep neural network architecture for segmentation. Next, we employ a second deep neural network architecture for signal estimation. The results show that the proposed pipeline can effectively identify individual MUs, estimate their territories, and estimate their twitch train signal at low contraction forces. The framework can retain spatio-temporal consistencies and information of the mechanical response of MU activity even when the ultrasound image sequences are transformed into a 2D representation for compatibility with more traditional computer vision and image processing techniques. The proposed pipeline is potentially useful to identify simultaneously active MUs in whole muscles in ultrasound image sequences of voluntary skeletal muscle contractions at low force levels.
摘要：超声成像提供从肌肉的大部分信息。最近已经表明，超快超声成像可以被用于记录和分析使用盲源分离个体的MU的机械响应。在这项工作中，我们提出了一种替代的方法 - 一个深学习管道 - 识别超声图像序列活性的MU，包括其领土的分段和它们的机械响应（抽搐列车）的信号估计。我们培养和评价使用模拟数据模拟数十激活亩的复杂激活模式具有重叠领土和部分同步激活模式模型。使用缓慢的融合方法（基于三维细胞神经网络），我们变换了时空图像序列数据，以2D表示并应用深层神经网络结构进行分割。接下来，我们使用的信号估计第二个深层神经网络结构。结果表明，该管道可以有效地识别单个亩，估算其领土，并在低收缩力估计其抽搐串信号。该框架可以保留的空间 - 时间的一致性和MU活动的机械响应的信息，即使在超声图像的序列被变换成用于与更传统的计算机视觉和图像处理技术的兼容性的2D表示。建议的管道是潜在有用的，以确定整个肌肉在低力量水平自愿骨骼肌收缩的超声图像序列，同时积极亩。

31. Predicting Hourly Demand in Station-free Bike-sharing Systems with Video-level Data [PDF] 返回目录
Xiao Yan, Gang Kou, Feng Xiao, Dapeng Zhang, Xianghua Gan
Abstract: Temporal and spatial features are both important for predicting the demands in the bike-sharing systems. Many relevant experiments in the literature support this. Meanwhile, it is observed that the data structure of spatial features with vector form is weaker in space than the videos, which have natural spatial structure. Therefore, to obtain more spatial features, this study introduces city map to generate GPS demand videos while employing a novel algorithm : eidetic 3D convolutional long short-term memory network named E3D-LSTM to process the video-level data in bike-sharing system. The spatio-temporal correlations and feature importance are experimented and visualized to validate the significance of spatial and temporal features. Despite the deep learning model is powerful in non-linear fitting ability, statistic model has better interpretation. This study adopts ensemble learning, which is a popular policy, to improve the performance and decrease variance. In this paper, we propose a novel model stacked by deep learning and statistical models, named the fusion multi-channel eidetic 3D convolutional long short-term memory network(FM-E3DCL-Net), to better process temporal and spatial features on the dataset about 100,000 transactions within one month in Shanghai of Mobike company. Furthermore, other factors like weather, holiday and time intervals are proved useful in addition to historical demand, since they decrease the root mean squared error (RMSE) by 29.4%. On this basis, the ensemble learning further decreases RMSE by 6.6%.
摘要：时间和空间特征是预测自行车共享系统的需求都很重要。在文献中许多相关的实验支持这一点。同时，可以观察到的空间特征与矢量形式的数据结构是在比所述视频，其具有天然空间结构空间弱。因此，为了获得更多的空间特征，本研究介绍城市地图，生成GPS点播视频，同时采用新算法：极为逼真的3D卷积长短期记忆网名为E3D-LSTM来处理自行车共享系统的视频级数据。时空相关性和功能重要性的实验和可视化验证的空间和时间特征的意义。尽管深度学习模型是非线性拟合能力强大，统计模型具有较好的解释。本研究采用集成学习，这是一个受欢迎的政策，以提高性能和降低方差。在本文中，我们提出通过深度学习和统计模型堆叠一个新的模型，命名为融合多渠道极为逼真的3D卷积长短期记忆网络（FM-E3DCL-网），以更好地处理时间和空间特征的数据集关于MOBIKE公司上海一个月内10万周的交易。此外，其它因素，如气候，度假和时间间隔被证实除了历史需求是有用的，因为它们由29.4％降低均方根误差（RMSE）。在此基础上，集成学习进一步的6.6％下降RMSE。

32. Place Recognition in Forests with Urquhart Tessellations [PDF] 返回目录
Guilherme V. Nardari, Avraham Cohen, Steven W. Chen, Xu Liu, Vaibhav Arcot, Roseli A. F. Romero, Vijay Kumar
Abstract: In this letter we present a novel descriptor based on polygons derived from Urquhart tessellations on the position of trees in a forest detected from lidar scans. We present a framework that leverages these polygons to generate a signature that is used detect previously seen observations even with partial overlap and different levels of noise while also inferring landmark correspondences to compute an affine transformation between observations. We run loop-closure experiments in simulation and real-world data map-merging from different flights of an Unmanned Aerial Vehicle (UAV) in a pine tree forest and show that our method outperforms state-of-the-art approaches in accuracy and robustness.
摘要：在这封信中，我们提出从激光雷达扫描检测的基础上，从厄克特镶嵌上的森林树木的位置得出的多边形一个新的描述符。我们提出了一个框架，利用这些多边形以产生用于检测先前看到的，即使部分重叠和不同水平的噪声的观测，同时还推断的对应界标来计算观测值之间的仿射变换的签名。我们运行在模拟和真实世界的数据来自于松树林的无人机（UAV）的不同航班地图合并循环封闭实验表明，该方法优于国家的最先进的精确度和耐用性接近。

33. Real-Time Resource Allocation for Tracking Systems [PDF] 返回目录
Yash Satsangi, Shimon Whiteson, Frans A. Oliehoek, Henri Bouma
Abstract: Automated tracking is key to many computer vision applications. However, many tracking systems struggle to perform in real-time due to the high computational cost of detecting people, especially in ultra high resolution images. We propose a new algorithm called \emph{PartiMax} that greatly reduces this cost by applying the person detector only to the relevant parts of the image. PartiMax exploits information in the particle filter to select $k$ of the $n$ candidate \emph{pixel boxes} in the image. We prove that PartiMax is guaranteed to make a near-optimal selection with error bounds that are independent of the problem size. Furthermore, empirical results on a real-life dataset show that our system runs in real-time by processing only 10\% of the pixel boxes in the image while still retaining 80\% of the original tracking performance achieved when processing all pixel boxes.
摘要：自动跟踪的关键是许多计算机视觉应用。然而，许多跟踪系统的斗争中实时进行，由于检测人计算成本高，尤其是在超高分辨率的图像。我们提出了一个所谓的新算法\ {EMPH} PartiMax通过申请的人只检测到图像的相关部分大大降低这一成本。 PartiMax利用在颗粒过滤器信息的图像中选择的$ n $的候选\ EMPH {像素盒} $ $ķ。我们证明了PartiMax保证让一个接近最优的选择，错误的界限，使独立于问题的大小。此外，在现实生活中的数据集上的实证结果是处理所有像素盒当我们的系统实时运行由图像中仅处理10 \％的像素箱的同时仍保留原来的跟踪性能的80 \％实现。

34. IS-CAM: Integrated Score-CAM for axiomatic-based explanations [PDF] 返回目录
Rakshit Naidu, Ankita Ghosh, Yash Maurya, Shamanth R Nayak K, Soumya Snigdha Kundu
Abstract: Convolutional Neural Networks have been known as black-box models as humans cannot interpret their inner functionalities. With an attempt to make CNNs more interpretable and trustworthy, we propose IS-CAM (Integrated Score-CAM), where we introduce the integration operation within the Score-CAM pipeline to achieve visually sharper attribution maps quantitatively. Our method is evaluated on 2000 randomly selected images from the ILSVRC 2012 Validation dataset, which proves the versatility of IS-CAM to account for different models and methods.
摘要：卷积神经网络已经被称为黑盒模型作为人类无法解释其内在的功能。与试图使细胞神经网络的更多解释和值得信赖的，我们的建议是-CAM（综合评分-CAM），在这里我们介绍一下分数-CAM管道内的一体化运作，以实现更清晰的视觉地图的归属定量。我们的方法是从ILSVRC 2012验证数据集，这证明IS-CAM的通用性，以考虑不同的模型和方法在2000年随机选择的图像进行评价。

35. Global Self-Attention Networks for Image Recognition [PDF] 返回目录
Zhuoran Shen, Irwan Bello, Raviteja Vemulapalli, Xuhui Jia, Ching-Hui Chen
Abstract: Recently, a series of works in computer vision have shown promising results on various image and video understanding tasks using self-attention. However, due to the quadratic computational and memory complexities of self-attention, these works either apply attention only to low-resolution feature maps in later stages of a deep network or restrict the receptive field of attention in each layer to a small local region. To overcome these limitations, this work introduces a new global self-attention module, referred to as the GSA module, which is efficient enough to serve as the backbone component of a deep network. This module consists of two parallel layers: a content attention layer that attends to pixels based only on their content and a positional attention layer that attends to pixels based on their spatial locations. The output of this module is the sum of the outputs of the two layers. Based on the proposed GSA module, we introduce new standalone global attention-based deep networks that use GSA modules instead of convolutions to model pixel interactions. Due to the global extent of the proposed GSA module, a GSA network has the ability to model long-range pixel interactions throughout the network. Our experimental results show that GSA networks outperform the corresponding convolution-based networks significantly on the CIFAR-100 and ImageNet datasets while using less parameters and computations. The proposed GSA networks also outperform various existing attention-based networks on the ImageNet dataset.
摘要：近日，一系列的计算机视觉作品都表现出有前途的各种图像，并利用自身的关注视频了解任务的结果。然而，由于自身关注的二次计算和存储的复杂性，这些作品无论是适用只顾低分辨率的功能映射在深网络的后期阶段或限制在每一层的一个小局部区域的关注感受野。为了克服这些限制，这项工作引入了一个新的全球性的自我关注模块，简称GSA模块，它是有效的，足以作为深网的骨干组成。该模块由两个平行的层组成：一个内容关注层照顾到像素仅基于他们的内容和照顾到像素基于其空间位置的位置的关注层。该模块的输出是这两个层的输出的总和。根据所提出的GSA模块上，我们介绍使用GSA模块，而不是对卷积模型像素交互新的独立的全球关注，基于深网络。由于所提出的GSA模块的全球范围内，一个GSA网络具有到整个网络远距离像素交互进行建模的能力。我们的实验结果表明，GSA网络显著优于对CIFAR-100和ImageNet数据集对应的基于卷积的网络，同时使用更少的参数和计算。所提出的GSA网络也跑赢上ImageNet集各种现有的注意基于网络。

36. Online Action Detection in Streaming Videos with Time Buffers [PDF] 返回目录
Bowen Zhang, Hao Chen, Meng Wang, Yuanjun Xiong
Abstract: We formulate the problem of online temporal action detection in live streaming videos, acknowledging one important property of live streaming videos that there is normally a broadcast delay between the latest captured frame and the actual frame viewed by the audience. The standard setting of the online action detection task requires immediate prediction after a new frame is captured. We illustrate that its lack of consideration of the delay is imposing unnecessary constraints on the models and thus not suitable for this problem. We propose to adopt the problem setting that allows models to make use of the small `buffer time' incurred by the delay in live streaming videos. We design an action start and end detection framework for this online with buffer setting with two major components: flattened I3D and window-based suppression. Experiments on three standard temporal action detection benchmarks under the proposed setting demonstrate the effectiveness of the proposed framework. We show that by having a suitable problem setting for this problem with wide-applications, we can achieve much better detection accuracy than off-the-shelf online action detection models.
摘要：我们制定在线时间动作检测的问题，在直播的流媒体视频，承认通常是有最新捕获的帧，观众观看到的实际帧之间的延时播出的直播流媒体视频的一个重要特性。在线动作检测任务的标准制定需要一个新的帧拍摄后立即预测。我们举例说明其缺乏考虑的延迟是在模型强加不必要的限制，因此不适合这个问题。我们建议采用有问题的设置，可以让模型使用的小`缓冲区以现场直播的视频招致的延迟时间”。我们与两个主要组成部分缓冲区设置在线设计动作的开始和结束检测框架是：扁平I3D和基于窗口的抑制。在根据拟议设置三个标准时间动作检测基准实验验证了该框架的有效性。我们表明，具有合适的问题设置针对此问题具有广泛的应用，我们可以实现比关闭的，现成的在线动作检测模型更好的检测精度。

37. Motion Prediction Using Temporal Inception Module [PDF] 返回目录
Tim Lebailly, Sena Kiciroglu, Mathieu Salzmann, Pascal Fua, Wei Wang
Abstract: Human motion prediction is a necessary component for many applications in robotics and autonomous driving. Recent methods propose using sequence-to-sequence deep learning models to tackle this problem. However, they do not focus on exploiting different temporal scales for different length inputs. We argue that the diverse temporal scales are important as they allow us to look at the past frames with different receptive fields, which can lead to better predictions. In this paper, we propose a Temporal Inception Module (TIM) to encode human motion. Making use of TIM, our framework produces input embeddings using convolutional layers, by using different kernel sizes for different input lengths. The experimental results on standard motion prediction benchmark datasets Human3.6M and CMU motion capture dataset show that our approach consistently outperforms the state of the art methods.
摘要：人的运动预测是用于机器人和自动驾驶许多应用的必要组成部分。最近的方法提出了利用序列对序列深度学习模型来解决这个问题。然而，他们不注重对不同长度的输入，利用不同时间尺度。我们认为，不同的时间尺度很重要，因为它们让我们来看看不同的感受野，这可能会导致更好的预测过去帧。在本文中，我们提出了一个颞启模块（TIM）来编码人体运动。利用TIM的，我们的框架中使用卷积层，通过使用不同的内核大小对于不同的输入长度产生输入的嵌入。在标准的运动预测基准数据集Human3.6M和CMU动作捕捉数据集上的实验结果表明，我们的方法始终优于现有技术方法的状态。

38. Using Sentences as Semantic Representations in Large Scale Zero-Shot Learning [PDF] 返回目录
Yannick Le Cacheux, Hervé Le Borgne, Michel Crucianu
Abstract: Zero-shot learning aims to recognize instances of unseen classes, for which no visual instance is available during training, by learning multimodal relations between samples from seen classes and corresponding class semantic representations. These class representations usually consist of either attributes, which do not scale well to large datasets, or word embeddings, which lead to poorer performance. A good trade-off could be to employ short sentences in natural language as class descriptions. We explore different solutions to use such short descriptions in a ZSL setting and show that while simple methods cannot achieve very good results with sentences alone, a combination of usual word embeddings and sentences can significantly outperform current state-of-the-art.
摘要：零次学习目标识别看不见类，对于没有视觉实例训练期间可用，通过学习从看到类型的样品之间的关系多式联运和相应的类语义表示的实例。这些类表示通常由两种属性，它不能很好地扩展到大型数据集，或字的嵌入，这导致较差的性能。一个很好的权衡可能是采用自然语言类描述短句。我们在探索不同的解决方案在ZSL设置使用这种简短描述和显示，虽然简单的方法无法实现单独的句子很好的效果，常用字的嵌入，并能句子的组合显著优于当前国家的最先进的。

39. Learning to Represent Image and Text with Denotation Graph [PDF] 返回目录
Bowen Zhang, Hexiang Hu, Vihan Jain, Eugene Ie, Fei Sha
Abstract: Learning to fuse vision and language information and representing them is an important research problem with many applications. Recent progresses have leveraged the ideas of pre-training (from language modeling) and attention layers in Transformers to learn representation from datasets containing images aligned with linguistic expressions that describe the images. In this paper, we propose learning representations from a set of implied, visually grounded expressions between image and text, automatically mined from those datasets. In particular, we use denotation graphs to represent how specific concepts (such as sentences describing images) can be linked to abstract and generic concepts (such as short phrases) that are also visually grounded. This type of generic-to-specific relations can be discovered using linguistic analysis tools. We propose methods to incorporate such relations into learning representation. We show that state-of-the-art multimodal learning models can be further improved by leveraging automatically harvested structural relations. The representations lead to stronger empirical results on downstream tasks of cross-modal image retrieval, referring expression, and compositional attribute-object recognition. Both our codes and the extracted denotation graphs on the Flickr30K and the COCO datasets are publically available on this https URL.
摘要：学习保险丝视觉和语言的信息，并表示他们是一个重要的研究课题有许多应用。最近的进展已经利用预先训练（从语言建模）和注意力层的想法变形金刚从包含与描述图像的语言表达对准的图像数据集的学习表现。在本文中，我们建议从一组图片和文字之间隐含的，视觉上表达接地学习表示，自动从这些数据集开采。特别是，我们用外延图表来表示具体怎么概念（如描述图片的句子），可以链接到也直观地接地抽象和通用的概念（如短语）。这种类型一般到特殊的关系，可以用语言分析工具发现。我们建议的方法将这类关系到学习的表示。我们发现，国家的最先进的多模态学习模型可以通过利用自动收获结构关系得到进一步改善。所述表示导致跨模态图像检索的下游任务更强经验结果，参照表达，和组成属性对象识别。无论我们的代码，并在Flickr30K提取的外延图形和COCO数据集在此HTTPS URL公开可用。

40. Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win [PDF] 返回目录
Utku Evci, Yani A. Ioannou, Cem Keskin, Yann Dauphin
Abstract: Sparse Neural Networks (NNs) can match the generalization of dense NNs using a fraction of the compute/storage for inference, and also have the potential to enable efficient training. However, naively training unstructured sparse NNs from random initialization results in significantly worse generalization, with the notable exception of Lottery Tickets (LTs) and Dynamic Sparse Training (DST). In this work, we attempt to answer: (1) why training unstructured sparse networks from random initialization performs poorly and; (2) what makes LTs and DST the exceptions? We show that sparse NNs have poor gradient flow at initialization and propose a modified initialization for unstructured connectivity. Furthermore, we find that DST methods significantly improve gradient flow during training over traditional sparse training methods. Finally, we show that LTs do not improve gradient flow, rather their success lies in re-learning the pruning solution they are derived from - however, this comes at the cost of learning novel solutions.
摘要：稀疏神经网络（神经网络），可以搭配使用推理计算/存储的分数密集神经网络的推广，同时也有以实现高效培训的潜力。然而，天真的训练非结构化稀疏神经网络的随机初始化的结果显著恶化泛化，与彩票（LTS）和动态稀疏培训（DST）的显着的例外。在这项工作中，我们试图回答：（1）为什么从随机初始化进行学习非结构化稀疏网络不良和; （2）是什么让LTS和DST例外？我们表明，稀疏神经网络具有在初始化梯度差的流动，并提出对非结构化连接修改初始化。此外，我们发现与传统的稀疏的训练方法训练期间DST方法显著提高梯度流动。最后，我们表明，三烯不重新学习它们都源于修剪解决方案提高梯度流动，而他们的成功之道在于 - 但是，这是以学习新的解决方案的成本。

41. Discriminative Cross-Modal Data Augmentation for Medical Imaging Applications [PDF] 返回目录
Yue Yang, Pengtao Xie
Abstract: While deep learning methods have shown great success in medical image analysis, they require a number of medical images to train. Due to data privacy concerns and unavailability of medical annotators, it is oftentimes very difficult to obtain a lot of labeled medical images for model training. In this paper, we study cross-modality data augmentation to mitigate the data deficiency issue in the medical imaging domain. We propose a discriminative unpaired image-to-image translation model which translates images in source modality into images in target modality where the translation task is conducted jointly with the downstream prediction task and the translation is guided by the prediction. Experiments on two applications demonstrate the effectiveness of our method.
摘要：尽管深学习方法表明，在医学图像分析了巨大的成功，他们需要大量的医学影像列车。由于数据隐私担忧和医疗注释不可用，这是常常很难获得大量的标记医学图像的模型训练。在本文中，我们研究了跨模态数据增强，以减轻在医疗成像领域的数据不足的问题。我们建议其在转换模式源图像到哪里翻译任务与下游的预测任务共同进行和翻译被预测在引导模式的目标图像的判别不成对图像到图像的翻译模型。在两个应用程序的实验结果证明了该方法的有效性。

42. Deep Neural Network: An Efficient and Optimized Machine Learning Paradigm for Reducing Genome Sequencing Error [PDF] 返回目录
Ferdinand Kartriku, Dr. Robert Sowah, Charles Saah
Abstract: Genomic data I used in many fields but, it has become known that most of the platforms used in the sequencing process produce significant errors. This means that the analysis and inferences generated from these data may have some errors that need to be corrected. On the two main types of genome errors substitution and indels - our work is focused on correcting indels. A deep learning approach was used to correct the errors in sequencing the chosen dataset
摘要：基因组数据我在许多领域使用，但它已成为众所周知，大多数在测序过程中产生显著的错误使用的平台。这意味着，从这些数据生成的分析和推断可以具有需要被校正的一些错误。在两种主要类型的基因组的错误替换和插入缺失的 - 我们的工作重点是纠正插入缺失。深学习方法被用来纠正错误的测序数据集选择

43. Memory-efficient GAN-based Domain Translation of High Resolution 3D Medical Images [PDF] 返回目录
Hristina Uzunova, Jan Ehrhardt, Heinz Handels
Abstract: Generative adversarial networks (GANs) are currently rarely applied on 3D medical images of large size, due to their immense computational demand. The present work proposes a multi-scale patch-based GAN approach for establishing unpaired domain translation by generating 3D medical image volumes of high resolution in a memory-efficient way. The key idea to enable memory-efficient image generation is to first generate a low-resolution version of the image followed by the generation of patches of constant sizes but successively growing resolutions. To avoid patch artifacts and incorporate global information, the patch generation is conditioned on patches from previous resolution scales. Those multi-scale GANs are trained to generate realistically looking images from image sketches in order to perform an unpaired domain translation. This allows to preserve the topology of the test data and generate the appearance of the training domain data. The evaluation of the domain translation scenarios is performed on brain MRIs of size 155x240x240 and thorax CTs of size up to 512x512x512. Compared to common patch-based approaches, the multi-resolution scheme enables better image quality and prevents patch artifacts. Also, it ensures constant GPU memory demand independent from the image size, allowing for the generation of arbitrarily large images.
摘要：创成对抗网络（甘斯）目前很少应用在大尺寸的3D医学图像，由于其巨大的计算需求。目前的工作提出了通过在存储高效的方式产生高分辨率的三维医学图像体积建立未成域转换的多尺度基础补丁-GaN的方法。关键的思想，使内存高效的图像生成是首先产生的图像，然后是恒定大小的补丁，但次生长决议产生的低分辨率版本。为了避免补丁的文物，并纳入全球信息，色块生成的条件是从先前的分辨率尺度补丁。这些多尺度甘斯被训练，以执行未配对域转换生成的图像从现实素描精美的影像。这允许保留测试数据的拓扑结构和生成训练域数据的外观。是在尺寸155x240x240的脑部核磁共振和大小的胸部的CT可达512x512x512进行的域名翻译情景的评价。相比于普通的基于补丁的办法，多分辨率方案可以实现更好的图像质量，并防止补丁文物。另外，它确保从图像尺寸恒定GPU存储器需求独立，从而允许任意大的图像的生成。

44. Descriptive analysis of computational methods for automating mammograms with practical applications [PDF] 返回目录
Aparna Bhale, Manish Joshi
Abstract: Mammography is a vital screening technique for early revealing and identification of breast cancer in order to assist to decrease mortality rate. Practical applications of mammograms are not limited to breast cancer revealing, identification ,but include task based lens design, image compression, image classification, content based image retrieval and a host of others. Mammography computational analysis methods are a useful tool for specialists to reveal hidden features and extract significant information in mammograms. Digital mammograms are mammography images available along with the conventional screen-film mammography to make automation of mammograms easier. In this paper, we descriptively discuss computational advancement in digital mammograms to serve as a compass for research and practice in the domain of computational mammography and related fields. The discussion focuses on research aiming at a variety of applications and automations of mammograms. It covers different perspectives on image pre-processing, feature extraction, application of mammograms, screen-film mammogram, digital mammogram and development of benchmark corpora for experimenting with digital mammograms.
摘要：乳腺X线摄影是为了协助降低死亡率早期暴露和乳腺癌的识别至关重要的筛选技术。乳房X线照片的实际应用并不局限于乳腺癌揭示，识别，而且还包括基于任务的镜头设计，图像压缩，图像分类，基于内容的图像检索和其他的主机。乳腺钼靶计算分析方法是专家揭示隐藏的功能和提取乳房X线照片显著信息的有用工具。数字乳房X线照片是与传统的屏片乳房X光检查，以使乳房X线照片的自动化以及更容易获得乳腺放射影像。在本文中，我们描述性讨论数字乳房X线照片计算的进步，以作为在计算乳房X光检查及相关领域的研究领域和实践指南针。讨论的重点是研究，主要针对各种应用和乳房X线照片的自动化。它覆盖在图像预处理，特征提取，乳房X线照片的应用，屏幕膜乳房X线照片，乳腺X线照片的数字和基准语料库的发展不同的角度与数字乳房X线照片的实验。

45. Secure 3D medical Imaging [PDF] 返回目录
Shadi Al-Zu'bi
Abstract: Image segmentation has proved its importance and plays an important role in various domains such as health systems and satellite-oriented military applications. In this context, accuracy, image quality, and execution time deem to be the major issues to always consider. Although many techniques have been applied, and their experimental results have shown appealing achievements for 2D images in real-time environments, however, there is a lack of works about 3D image segmentation despite its importance in improving segmentation accuracy. Specifically, HMM was used in this domain. However, it suffers from the time complexity, which was updated using different accelerators. As it is important to have efficient 3D image segmentation, we propose in this paper a novel system for partitioning the 3D segmentation process across several distributed machines. The concepts behind distributed multi-media network segmentation were employed to accelerate the segmentation computational time of training Hidden Markov Model (HMMs). Furthermore, a secure transmission has been considered in this distributed environment and various bidirectional multimedia security algorithms have been applied. The contribution of this work lies in providing an efficient and secure algorithm for 3D image segmentation. Through a number of extensive experiments, it was proved that our proposed system is of comparable efficiency to the state of art methods in terms of segmentation accuracy, security and execution time.
摘要：图像分割已经证明了它的重要性和发挥各领域如卫生系统和面向卫星军事应用中起重要作用。在此背景下，精度，图像质量和执行时间1062是主要的问题，始终考虑。虽然已经应用了许多技术，他们的实验结果表明，呼吁在实时环境2D图像的成就，但是，有一个缺乏对3D图像分割的作品，尽管它在提高分割精度的重要性。具体而言，在HMM此域被使用。但是，从时间复杂度，这是使用不同的加速器更新受到影响。因为它是有效率的3D图像分割重要的是，我们提出本文的新颖系统跨多个分布式机器分割3D分割过程。后面分布式多媒体网络分段的概念进行了用于加速的训练隐马尔科夫模型（HMM模型）的分割计算时间。此外，安全传输已经在这种分布式环境和各种双向多媒体的安全算法被认为已经得到了应用。这项工作的谎言在提供3D图像分割的高效和安全算法的贡献。通过一系列广泛的实验，它证明了我们所提出的系统是相当的效率的技术方法的状态的分割精度，安全性和执行时间。

46. Batch Normalization Increases Adversarial Vulnerability: Disentangling Usefulness and Robustness of Model Features [PDF] 返回目录
Philipp Benz, Chaoning Zhang, In So Kweon
Abstract: Batch normalization (BN) has been widely used in modern deep neural networks (DNNs) due to fast convergence. BN is observed to increase the model accuracy while at the cost of adversarial robustness. We conjecture that the increased adversarial vulnerability is caused by BN shifting the model to rely more on non-robust features (NRFs). Our exploration finds that other normalization techniques also increase adversarial vulnerability and our conjecture is also supported by analyzing the model corruption robustness and feature transferability. With a classifier DNN defined as a feature set $F$ we propose a framework for disentangling $F$ robust usefulness into $F$ usefulness and $F$ robustness. We adopt a local linearity based metric, termed LIGS, to define and quantify $F$ robustness. Measuring the $F$ robustness with the LIGS provides direct insight on the feature robustness shift independent of usefulness. Moreover, the LIGS trend during the whole training stage sheds light on the order of learned features, i.e. from RFs (robust features) to NRFs, or vice versa. Our work analyzes how BN and other factors influence the DNN from the feature perspective. Prior works mainly adopt accuracy to evaluate their influence regarding $F$ usefulness, while we believe evaluating $F$ robustness is equally important, for which our work fills the gap.
摘要：批标准化（BN）已被广泛应用于由于快速收敛现代深层神经网络（DNNs）。 BN观察到提高模型的精度，而在对抗稳健性的成本。我们猜想，增加对抗性漏洞是由BN转移模型更多地依靠非强大的功能（NRFs）引起的。我们探索的发现，其他标准化技术也增加了对抗脆弱性和我们的猜想也通过分析模型腐败稳健性和功能转移性支持。随着分类DNN定义为功能集$ F $我们提出解开$ F $稳健有用到$ F $实用性和$ F $鲁棒性的框架。我们采用基于度量的局部线性度，称为LIGS，定义和量化$ F $鲁棒性。测量$ F $稳健性与LIGS提供功能稳健性转移独立有用的直接了解。此外，在整个训练阶段LIGS趋势揭示的学习功能，即，从RFS（鲁棒特征），以NRFs，或者反之亦然的量级上的光。我们的工作分析BN和其他因素如何从功能角度看影响DNN。在此之前的作品主要采用精度来评价他们对$ F $有效性的影响，虽然我们认为评估$ F $稳健性同样重要，为此，我们的工作填补了这个空白。

47. Double Targeted Universal Adversarial Perturbations [PDF] 返回目录
Philipp Benz, Chaoning Zhang, Tooba Imtiaz, In So Kweon
Abstract: Despite their impressive performance, deep neural networks (DNNs) are widely known to be vulnerable to adversarial attacks, which makes it challenging for them to be deployed in security-sensitive applications, such as autonomous driving. Image-dependent perturbations can fool a network for one specific image, while universal adversarial perturbations are capable of fooling a network for samples from all classes without selection. We introduce a double targeted universal adversarial perturbations (DT-UAPs) to bridge the gap between the instance-discriminative image-dependent perturbations and the generic universal perturbations. This universal perturbation attacks one targeted source class to sink class, while having a limited adversarial effect on other non-targeted source classes, for avoiding raising suspicions. Targeting the source and sink class simultaneously, we term it double targeted attack (DTA). This provides an attacker with the freedom to perform precise attacks on a DNN model while raising little suspicion. We show the effectiveness of the proposed DTA algorithm on a wide range of datasets and also demonstrate its potential as a physical attack.
摘要：尽管他们的出色表现，深层神经网络（DNNs）是众所周知的是脆弱的敌对攻击，这使得它挑战了他们在安全敏感的应用，如自动驾驶进行部署。依赖于图像的扰动可以欺骗为一个特定图像的网络中，而万向对抗性扰动能够从所有类愚弄样品网络上，而不选择。我们引入了双针对性的普遍对抗的扰动（DT-UAPs）弥合实例辨别图像相关的扰动和普通的通用扰动之间的差距。这种普遍的扰动攻击一个目标源类到接收器类，而对其他非靶向源类有限的对抗性作用，为了避免提高的怀疑。同时针对源和接收器类，我们来看，这双有针对性的攻击（DTA）。这提供了对一个DNN模型进行精确的攻击，同时提高一点怀疑的自由攻击。我们表明，该DTA算法对大范围数据集的有效性，也表明了其作为物理攻击潜力。

48. Low-Rank Robust Online Distance/Similarity Learning based on the Rescaled Hinge Loss [PDF] 返回目录
Davood Zabihzadeh, Ali Karami-Mollaee
Abstract: An important challenge in metric learning is scalability to both size and dimension of input data. Online metric learning algorithms are proposed to address this challenge. Existing methods are commonly based on (Passive Aggressive) PA approach. Hence, they can rapidly process large volumes of data with an adaptive learning rate. However, these algorithms are based on the Hinge loss and so are not robust against outliers and label noise. Also, existing online methods usually assume training triplets or pairwise constraints are exist in advance. However, many datasets in real-world applications are in the form of input data and their associated labels. We address these challenges by formulating the online Distance-Similarity learning problem with the robust Rescaled hinge loss function. The proposed model is rather general and can be applied to any PA-based online Distance-Similarity algorithm. Also, we develop an efficient robust one-pass triplet construction algorithm. Finally, to provide scalability in high dimensional DML environments, the low-rank version of the proposed methods is presented that not only reduces the computational cost significantly but also keeps the predictive performance of the learned metrics. Also, it provides a straightforward extension of our methods for deep Distance-Similarity learning. We conduct several experiments on datasets from various applications. The results confirm that the proposed methods significantly outperform state-of-the-art online DML methods in the presence of label noise and outliers by a large margin.
摘要：在度量学习的一个重要挑战就是可扩展性，规模和输入数据的维度。在线度量学习算法提出了应对这一挑战。现有的方法通常是基于（被动攻击）PA方法。因此，他们可以快速处理大量数据具有自适应学习率。然而，这些算法都是基于铰链脱落等不反对异常值和标签的噪声稳健。此外，现有的在线方法通常假定训练三胞胎或成对约束都存在提前。然而，在实际应用中的许多数据集在输入数据及其相关标签的形式。我们应对与强大的重标铰链损失函数制定网上距离相似的学习问题这些挑战。该模型是相当普遍的，可以适用于任何基于PA-在线远程相似算法。此外，我们还开发了高效稳健的一个通三重构造算法。最后，提供高维DML环境的可扩展性，提出建议的方法的低阶版本，不仅显著降低了计算成本，而且还使学指标的预测性能。此外，它提供了我们对深距离相似度学习方法的直接扩展。我们从各种应用程序的数据集进行多次试验。结果证实，所提出的方法显著强于大盘国家的最先进的在线DML方法标签噪音和异常值的存在大幅度。

49. A Novel Face-tracking Mouth Controller and its Application to Interacting with Bioacoustic Models [PDF] 返回目录
Gamhewage C. de Silva, Tamara Smyth, Michael J. Lyons
Abstract: We describe a simple, computationally light, real-time system for tracking the lower face and extracting information about the shape of the open mouth from a video sequence. The system allows unencumbered control of audio synthesis modules by the action of the mouth. We report work in progress to use the mouth controller to interact with a physical model of sound production by the avian syrinx.
摘要：我们描述了一个简单的，跟踪的下表面和提取关于张口从视频序列的形状信息计算光，实时系统。该系统允许通过嘴的动作的音频合成模块未支配控制。我们报告由禽鸣管，以用嘴巴控制器互动有声制作的物理模型的进展工作。

50. Deep Learning-Based Grading of Ductal Carcinoma In Situ in Breast Histopathology Images [PDF] 返回目录
Suzanne C. Wetstein, Nikolas Stathonikos, Josien P.W. Pluim, Yujing J. Heng, Natalie D. ter Hoeve, Celien P.H. Vreuls, Paul J. van Diest, Mitko Veta
Abstract: Ductal carcinoma in situ (DCIS) is a non-invasive breast cancer that can progress into invasive ductal carcinoma (IDC). Studies suggest DCIS is often overtreated since a considerable part of DCIS lesions may never progress into IDC. Lower grade lesions have a lower progression speed and risk, possibly allowing treatment de-escalation. However, studies show significant inter-observer variation in DCIS grading. Automated image analysis may provide an objective solution to address high subjectivity of DCIS grading by pathologists. In this study, we developed a deep learning-based DCIS grading system. It was developed using the consensus DCIS grade of three expert observers on a dataset of 1186 DCIS lesions from 59 patients. The inter-observer agreement, measured by quadratic weighted Cohen's kappa, was used to evaluate the system and compare its performance to that of expert observers. We present an analysis of the lesion-level and patient-level inter-observer agreement on an independent test set of 1001 lesions from 50 patients. The deep learning system (dl) achieved on average slightly higher inter-observer agreement to the observers (o1, o2 and o3) ($\kappa_{o1,dl}=0.81, \kappa_{o2,dl}=0.53, \kappa_{o3,dl}=0.40$) than the observers amongst each other ($\kappa_{o1,o2}=0.58, \kappa_{o1,o3}=0.50, \kappa_{o2,o3}=0.42$) at the lesion-level. At the patient-level, the deep learning system achieved similar agreement to the observers ($\kappa_{o1,dl}=0.77, \kappa_{o2,dl}=0.75, \kappa_{o3,dl}=0.70$) as the observers amongst each other ($\kappa_{o1,o2}=0.77, \kappa_{o1,o3}=0.75, \kappa_{o2,o3}=0.72$). In conclusion, we developed a deep learning-based DCIS grading system that achieved a performance similar to expert observers. We believe this is the first automated system that could assist pathologists by providing robust and reproducible second opinions on DCIS grade.
摘要：导管原位癌（DCIS）是一种非侵入性乳腺癌，可发展成浸润性导管癌（IDC）。研究表明DCIS经常过度治疗，因为DCIS病灶有相当一部分可能永远不会发展成IDC。低度病变具有较低的进行速度和风险，可能允许治疗降级。然而，研究表明在DCIS分级显著观察者间的变化。自动图像分析可以提供客观的溶液至DCIS的地址高主观性由病理学家分级。在这项研究中，我们建立了深厚的学习型DCIS分级系统。它是用三个专家观察家的共识DCIS上档次1186个DCIS病灶的59例患者的数据集开发。在国际观察员的协议，通过二次加权科恩kappa测量，被用来评估系统，并比较其性能是专家观察员。我们提出病变级别和患者级国际观察员的协议，从50名患者的独立测试集1001个病变的分析。平均略高国际观察员的协议，观察员（O1，O2和O3）（$ \ kappa_ {O1，DL} = 0.81 \ kappa_ {O2，DL} = 0.53，\ kappa_达到的深度学习系统（DL） {O3，DL} = 0.40 $）比在彼此之间（$ \ kappa_ {O1，O2} = 0.58，\ kappa_ {O1，O3} = 0.50，\ kappa_ {O2，O3} = 0.42 $）在观察者病变水平。在患者水平，深学习系统实现了类似协议的观察员（$ \ kappa_ {O1，DL} = 0.77，\ kappa_ {O2，DL} = 0.75，\ kappa_ {O3，DL} = $ 0.70）作为观察者在彼此之间（$ \ kappa_ {O1，O2} = 0.77，\ kappa_ {O1，O3} = 0.75，\ kappa_ {O2，O3} = 0.72 $）。总之，我们开发了实现类似的专家观察员表现了深刻的学习型DCIS分级系统。我们相信这是第一个自动化系统，可以通过提供DCIS分级稳健和可重复的第二意见协助病理学家。

51. Sonification of Facial Actions for Musical Expression [PDF] 返回目录
Mathias Funk, Kazuhiro Kuwabara, Michael J. Lyons
Abstract: The central role of the face in social interaction and non-verbal communication suggests we explore facial action as a means of musical expression. This paper presents the design, implementation, and preliminary studies of a novel system utilizing face detection and optic flow algorithms to associate facial movements with sound synthesis in a topographically specific fashion. We report on our experience with various gesture-to-sound mappings and applications, and describe our preliminary experiments at musical performance using the system.
摘要：面对在社会交往和非言语交流的核心作用表明，我们探讨的面部动作的音乐表现力的一种手段。本文介绍了设计，实施，和利用人脸检测和光流算法，以与声音合成准面部运动在一个地形特定方式的新颖系统的初步研究。我们对我们的各种动作到声音的映射和应用体验报告，并利用该系统在音乐表现说明我们的初步实验。

52. Designing, Playing, and Performing with a Vision-based Mouth Interface [PDF] 返回目录
Michael J. Lyons, Michael Haehnel, Nobuji Tetsutani
Abstract: The role of the face and mouth in speech production as well asnon-verbal communication suggests the use of facial action tocontrol musical sound. Here we document work on theMouthesizer, a system which uses a headworn miniaturecamera and computer vision algorithm to extract shapeparameters from the mouth opening and output these as MIDIcontrol changes. We report our experience with variousgesture-to-sound mappings and musical applications, anddescribe a live performance which used the Mouthesizerinterface.
摘要：脸和嘴在讲话生产以及asnon语言交流中的作用提出tocontrol乐音使用面部动作。在这里，我们theMouthesizer，它采用了头戴式miniaturecamera和计算机视觉算法从张口这些作为MIDIcontrol改变输出到提取shapeparameters系统记录的工作。我们提出我们的variousgesture到声音的映射和音乐应用的经验，anddescribe其使用的Mouthesizerinterface现场表演。

53. M3Lung-Sys: A Deep Learning System for Multi-Class Lung Pneumonia Screening from CT Imaging [PDF] 返回目录
Xuelin Qian, Huazhu Fu, Weiya Shi, Tao Chen, Yanwei Fu, Fei Shan, Xiangyang Xue
Abstract: To counter the outbreak of COVID-19, the accurate diagnosis of suspected cases plays a crucial role in timely quarantine, medical treatment, and preventing the spread of the pandemic. Considering the limited training cases and resources (e.g, time and budget), we propose a Multi-task Multi-slice Deep Learning System (M3Lung-Sys) for multi-class lung pneumonia screening from CT imaging, which only consists of two 2D CNN networks, i.e., slice- and patient-level classification networks. The former aims to seek the feature representations from abundant CT slices instead of limited CT volumes, and for the overall pneumonia screening, the latter one could recover the temporal information by feature refinement and aggregation between different slices. In addition to distinguish COVID-19 from Healthy, H1N1, and CAP cases, our M 3 Lung-Sys also be able to locate the areas of relevant lesions, without any pixel-level annotation. To further demonstrate the effectiveness of our model, we conduct extensive experiments on a chest CT imaging dataset with a total of 734 patients (251 healthy people, 245 COVID-19 patients, 105 H1N1 patients, and 133 CAP patients). The quantitative results with plenty of metrics indicate the superiority of our proposed model on both slice- and patient-level classification tasks. More importantly, the generated lesion location maps make our system interpretable and more valuable to clinicians.
摘要：为了应对COVID-19的爆发，疑似病例的准确诊断起着及时检疫，医疗至关重要的作用，并防止流行病的蔓延。考虑到有限的培训情况和资源（例如，时间和预算），我们提出了多类肺肺炎多任务多层面深入学习系统（M3Lung-SYS）从CT成像，只由两个2D CNN的筛选网络，即slice-和病人级分类的网络。前者的目的是寻求丰富的CT片而不是有限的CT卷的特征表示，对于整体肺炎筛选，后者可以恢复由不同的切片之间的功能细化和聚集的时间信息。除了区分COVID-19从健康，甲型H1N1流感和CAP的情况下，我们的M 3龙-Sys系统还能够找到相关病变的区域，没有任何像素级的注解。为了进一步证明我们的模型的有效性，我们在一个胸部CT成像数据集进行了广泛的实验，共有734名患者（251健康人，245 COVID-19例，105例甲型H1N1流感，以及133名CAP患者）。用大量指标的定量结果表明，我们提出的模型上都slice-和病人级分类任务的优越性。更重要的是，所产生的病变位置的地图使我们的系统解释和临床医生更有价值。

54. WDN: A Wide and Deep Network to Divide-and-Conquer Image Super-resolution [PDF] 返回目录
Vikram Singh, Anurag Mittal
Abstract: Divide and conquer is an established algorithm design paradigm that has proven itself to solve a variety of problems efficiently. However, it is yet to be fully explored in solving problems with a neural network, particularly the problem of image super-resolution. In this work, we propose an approach to divide the problem of image super-resolution into multiple sub-problems and then solve/conquer them with the help of a neural network. Unlike a typical deep neural network, we design an alternate network architecture that is much wider (along with being deeper) than existing networks and is specially designed to implement the divide-and-conquer design paradigm with a neural network. Additionally, a technique to calibrate the intensities of feature map pixels is being introduced. Extensive experimentation on five datasets reveals that our approach towards the problem and the proposed architecture generate better and sharper results than current state-of-the-art methods.
摘要：分而治之是一个既定的算法设计模式，它已经证明了自己能够有效地解决各种问题。但是，还没有解决问题，神经网络，图像超分辨率的特别问题进行充分探讨。在这项工作中，我们提出了一个方法来划分图像超分辨率的问题分解成多个子问题，然后解决/征服他们用神经网络的帮助。不同于典型的深神经网络，我们设计了一个替代的网络体系结构比现有的网络宽得多（与正被更深一起）和专门设计用于执行分而治之设计范例与一个神经网络。此外，在被引入到校准特征映射像素的强度的技术。在五个数据集大量的实验表明，我们对这个问题，所提出的架构方法产生比国家的最先进的电流的方法更好，更清晰的效果。

55. Conditional Generative Modeling via Learning the Latent Space [PDF] 返回目录
Sameera Ramasinghe, Kanchana Ranasinghe, Salman Khan, Nick Barnes, Stephen Gould
Abstract: Although deep learning has achieved appealing results on several machine learning tasks, most of the models are deterministic at inference, limiting their application to single-modal settings. We propose a novel general-purpose framework for conditional generation in multimodal spaces, that uses latent variables to model generalizable learning patterns while minimizing a family of regression cost functions. At inference, the latent variables are optimized to find optimal solutions corresponding to multiple output modes. Compared to existing generative solutions, in multimodal spaces, our approach demonstrates faster and stable convergence, and can learn better representations for downstream tasks. Importantly, it provides a simple generic model that can beat highly engineered pipelines tailored using domain expertise on a variety of tasks, while generating diverse outputs. Our codes will be released.
摘要：尽管深度学习取得呼吁在几个机器学习任务的结果，大部分车型都在推断确定性，限制了其应用到单模式的设置。我们建议有条件的一代多的空间，使用潜变量来概括的学习情况进行模拟，同时尽量减少家庭的回归成本函数一种新的通用框架。在推论，潜变量进行优化，以找到对应于多种输出模式的最佳解决方案。相比于现有的解决方案生成，在多空间，我们的方法演示了更快收敛稳定，并能更好地学习表示下游任务。重要的是，它提供了可以击败使用在各种不同的任务领域的专业知识量身定制的高度工程化的管道，而产生不同的输出，一个简单的通用模型。我们的代码将被释放。

56. A Fast and Effective Method of Macula Automatic Detection for Retina Images [PDF] 返回目录
Yukang Jiang, Jianying Pan, Yanhe Shen, Jin Zhu, Jiamin Huang, Huirui Xie, Xueqin Wang, Yan Luo
Abstract: Retina image processing is one of the crucial and popular topics of medical image processing. The macula fovea is responsible for sharp central vision, which is necessary for human behaviors where visual detail is of primary importance, such as reading, writing, driving, etc. This paper proposes a novel method to locate the macula through a series of morphological processing. On the premise of maintaining high accuracy, our approach is simpler and faster than others. Furthermore, for the hospital's real images, our method is also able to detect the macula robustly.
摘要：视网膜图像处理是医学图像处理的重要和热门的话题之一。黄斑中心凹负责锐利中心视力，这是必要的人的行为，其中视觉细节是最重要的，如读，写，驾驶等提出了一种新颖的方法通过一系列的形态学处理来定位黄斑。在保持高准确度的前提下，我们的方法更简单，比别人快。此外，医院的真实图像，我们的方法也能稳健地检测黄斑。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-10-08

目录

摘要