摘要

1. MakeItTalk: Speaker-Aware Talking Head Animation [PDF] 返回目录
Yang Zhou, DIngzeyu Li, Xintong Han, Evangelos Kalogerakis, Eli Shechtman, Jose Echevarria
Abstract: We present a method that generates expressive talking heads from a single facial image with audio as the only input. In contrast to previous approaches that attempt to learn direct mappings from audio to raw pixels or points for creating talking faces, our method first disentangles the content and speaker information in the input audio signal. The audio content robustly controls the motion of lips and nearby facial regions, while the speaker information determines the specifics of facial expressions and the rest of the talking head dynamics. Another key component of our method is the prediction of facial landmarks reflecting speaker-aware dynamics. Based on this intermediate representation, our method is able to synthesize photorealistic videos of entire talking heads with full range of motion and also animate artistic paintings, sketches, 2D cartoon characters, Japanese mangas, stylized caricatures in a single unified framework. We present extensive quantitative and qualitative evaluation of our method, in addition to user studies, demonstrating generated talking heads of significantly higher quality compared to prior state-of-the-art.
摘要：我们提出了从单个面部图像与音频作为唯一的输入生成表现头部特写的方法。与此相反，以试图借鉴音频原始像素或点的直接映射创建说话的面孔以前的方法，我们的方法首先理顺了那些纷繁输入音频信号的内容和演讲人的信息。音频内容控制有力的嘴唇和面部附近区域的运动，而讲话者信息判断面部表情的细节和说话的头部动力学的其余部分。我们的方法的另一个关键成分是面部界标反射扬声器感知动态的预测。在此基础上中间表示，我们的方法是能够合成的整个名嘴逼真的视频与全套动作，也有生艺术绘画，素描，2D卡通人物，日本画的漫画，漫画程式化在一个统一的框架。我们提出了大量的定量和我们的方法的定性评价，除了用户的研究，相比于现有的国家的最先进的展示显著更高质量的产生名嘴。

2. CoReNet: Coherent 3D scene reconstruction from a single RGB image [PDF] 返回目录
Stefan Popov, Pablo Bauszat, Vittorio Ferrari
Abstract: Advances in deep learning techniques have allowed recent work to reconstruct the shape of a single object given only one RBG image as input. Building on common encoder-decoder architectures for this task, we propose three extensions: (1) ray-traced skip connections that propagate local 2D information to the output 3D volume in a physically correct manner; (2) a hybrid 3D volume representation that enables building translation equivariant models, while at the same time encoding fine object details without an excessive memory footprint; (3) a reconstruction loss tailored to capture overall object geometry. Furthermore, we adapt our model to address the harder task of reconstructing multiple objects from a single image. We reconstruct all objects jointly in one pass, producing a coherent reconstruction, where all objects live in a single consistent 3D coordinate frame relative to the camera and they do not intersect in 3D space. We also handle occlusions and resolve them by hallucinating the missing object parts in the 3D volume. We validate the impact of our contributions experimentally both on synthetic data from ShapeNet as well as real images from Pix3D. Our method outperforms the state-of-the-art single-object methods on both datasets. Finally, we evaluate performance quantitatively on multiple object reconstruction with synthetic scenes assembled from ShapeNet objects.
摘要：进展深度学习技术允许最近的工作来重建只给出一个RBG图像作为输入的单个物体的形状。在共同编码器 - 解码器架构用于此任务的基础上，我们提出了三个延伸：该传播本地2D信息在一个物理上正确的方式输出3D体积（1）光线跟踪跳跃连接; （2）混合3D体积表示，它使建立翻译等变模式，而在同一时间编码微小物体的细节而不过度存储器足迹; （3）重建损失定制捕捉整体对象的几何形状。此外，我们调整我们的模型，以解决单个图像重建多个对象的艰巨的任务。我们在一个通联合重建的所有对象，产生一个连贯的重建，所有的对象生活在单一，一致的三维坐标系相对于相机和他们不相交的三维空间。我们还办理闭塞，并通过在三维体积幻觉缺少的对象部分解决这些问题。我们验证了我们的捐款合成数据从ShapeNet从Pix3D冲击实验都和真实的图像。我们的方法优于在两个数据集的状态的最先进单对象的方法。最后，我们对多个对象重建从ShapeNet对象组装合成的场景定量评价性能。

3. Audio-Visual Instance Discrimination with Cross-Modal Agreement [PDF] 返回目录
Pedro Morgado, Nuno Vasconcelos, Ishan Misra
Abstract: We present a self-supervised learning approach to learn audio-visual representations from video and audio. Our method uses contrastive learning for cross-modal discrimination of video from audio and vice versa. We show that optimizing for cross-modal discrimination, rather than within-modal discrimination, is important to learn good representations from video and audio. With this simple but powerful insight, our method achieves state-of-the-art results when finetuned on action recognition tasks. While recent work in contrastive learning defines positive and negative samples as individual instances, we generalize this definition by exploring cross-modal agreement. We group together multiple instances as positives by measuring their similarity in both the video and the audio feature spaces. Cross-modal agreement creates better positive and negative sets, and allows us to calibrate visual similarities by seeking within-modal discrimination of positive instances.
摘要：我们提出了一个自我监督的学习方法，学习从视频和音频视听表示。我们的方法是使用对比学习从音频，反之亦然视频的跨模态的歧视。我们表明，优化跨模态的歧视，而不是在模态识别，重要的是要学会从视频和音频的良好表示。有了这个简单但功能强大的洞察力，行动识别任务微调，当我们的方法实现国家的最先进的成果。虽然最近在对比学习工作定义阳性和阴性样品作为单独的情况下，我们推广了探索跨通道协议这个定义。我们组的多个实例通过测量在视频和音频功能空间既它们的相似性阳性在一起。跨模态协议创造了更好的正极和负极集，并允许我们寻求积极的实例中，模态识别校准视觉相似之处。

4. Improvement in Land Cover and Crop Classification based on Temporal Features Learning from Sentinel-2 Data Using Recurrent-Convolutional Neural Network (R-CNN) [PDF] 返回目录
Vittorio Mazzia, Aleem Khaliq, Marcello Chiaberge
Abstract: The increasing spatial and temporal resolution of globally available satellite images, such as provided by Sentinel-2, creates new possibilities for researchers to use freely available multi-spectral optical images, with decametric spatial resolution and more frequent revisits for remote sensing applications such as land cover and crop classification (LC&CC), agricultural monitoring and management, environment monitoring. Existing solutions dedicated to cropland mapping can be categorized based on per-pixel based and object-based. However, it is still challenging when more classes of agricultural crops are considered at a massive scale. In this paper, a novel and optimal deep learning model for pixel-based LC&CC is developed and implemented based on Recurrent Neural Networks (RNN) in combination with Convolutional Neural Networks (CNN) using multi-temporal sentinel-2 imagery of central north part of Italy, which has diverse agricultural system dominated by economic crop types. The proposed methodology is capable of automated feature extraction by learning time correlation of multiple images, which reduces manual feature engineering and modeling crop phenological stages. Fifteen classes, including major agricultural crops, were considered in this study. We also tested other widely used traditional machine learning algorithms for comparison such as support vector machine SVM, random forest (RF), Kernal SVM, and gradient boosting machine, also called XGBoost. The overall accuracy achieved by our proposed Pixel R-CNN was 96.5%, which showed considerable improvements in comparison with existing mainstream methods. This study showed that Pixel R-CNN based model offers a highly accurate way to assess and employ time-series data for multi-temporal classification tasks.
摘要：全球可用的卫星图像，诸如通过哨兵-2提供的增加的空间和时间分辨率，创建新的可能性，研究人员使用免费提供的多光谱的光学图像，用十米波空间分辨率和更频繁的回访遥感应用，例如作为土地覆盖和作物分类（LC＆CC），农业监测和管理，环境监测。专用于农田映射现有的解决方案可以基于基于对象的每个像素的基础和进行分类。然而，当更多的类农作物的大规模被认为它仍然是具有挑战性的。在本文中，用于基于像素的LC＆CC一种新颖的和最佳深度学习模型的开发和基于回归神经网络（RNN）结合卷积神经网络（CNN）实现的使用多颞定点-2的图像中，北部部分的意大利，其中有经济作物种类为主的多样化的农业系统。所建议的方法是通过学习的多个图像的时间相关性，这降低了手动特征工程和建模作物候阶段能够自动化特征提取的。十五类，包括主要农作物，在这项研究进行了审议。我们也测试用于比较的其它广泛使用的传统的机器学习算法，诸如支持向量机，随机森林（RF），籽粒SVM，和梯度升压机，也称为XGBoost。我们提出的像素R-CNN获得的准确率为96.5％，这表明在现有的主流方法相比相当大的改善。这项研究表明，像素R-CNN基于模型提供了高度准确的方法来评估和雇用的时间序列数据进行多时间分类任务。

5. AI-Driven CT-based quantification, staging and short-term outcome prediction of COVID-19 pneumonia [PDF] 返回目录
Guillaume Chassagnon, Maria Vakalopoulou, Enzo Battistella, Stergios Christodoulidis, Trieu-Nghi Hoang-Thi, Severine Dangeard, Eric Deutsch, Fabrice Andre, Enora Guillo, Nara Halm, Stefany El Hajj, Florian Bompard, Sophie Neveu, Chahinez Hani, Ines Saab, Alienor Campredon, Hasmik Koulakian, Souhail Bennani, Gael Freche, Aurelien Lombard, Laure Fournier, Hippolyte Monnier, Teodor Grand, Jules Gregory, Antoine Khalil, Elyas Mahdjoub, Pierre-Yves Brillet, Stephane Tran Ba, Valerie Bousson, Marie-Pierre Revel, Nikos Paragios
Abstract: Chest computed tomography (CT) is widely used for the management of Coronavirus disease 2019 (COVID-19) pneumonia because of its availability and rapidity. The standard of reference for confirming COVID-19 relies on microbiological tests but these tests might not be available in an emergency setting and their results are not immediately available, contrary to CT. In addition to its role for early diagnosis, CT has a prognostic role by allowing visually evaluating the extent of COVID-19 lung abnormalities. The objective of this study is to address prediction of short-term outcomes, especially need for mechanical ventilation. In this multi-centric study, we propose an end-to-end artificial intelligence solution for automatic quantification and prognosis assessment by combining automatic CT delineation of lung disease meeting performance of experts and data-driven identification of biomarkers for its prognosis. AI-driven combination of variables with CT-based biomarkers offers perspectives for optimal patient management given the shortage of intensive care beds and ventilators.
摘要：胸部计算机断层扫描（CT）广泛使用，因为其可用性和速度为冠状病2019（COVID-19）肺炎的管理。参考确认COVID-19的标准依赖于微生物测试，但这些测试可能无法在紧急情况下设置可用，其结果不能立即使用，违背了CT。除了其作用的早期诊断，CT具有通过允许目视评价COVID-19肺异常的程度的预后作用。这项研究的目的是短期的成果地址的预测，尤其是需要机械通气。在这个多中心的研究中，我们结合专家和肺部疾病会表现自动CT划定提出了自动定量及预后评估的一个终端到终端的人工智能解决方案的数据驱动的生物标志物的鉴定其预后。 AI驱动变量的组合基于CT的生物标志物提供观点给定的重症监护病床，通风不足最佳患者管理。

6. A Deep Attentive Convolutional Neural Network for Automatic Cortical Plate Segmentation in Fetal MRI [PDF] 返回目录
Haoran Dou, Davood Karimi, Caitlin K. Rollins, Cynthia M. Ortinau, Lana Vasung, Clemente Velasco-Annis, Abdelhakim Ouaalam, Xin Yang, Dong Ni, Ali Gholipour
Abstract: Fetal cortical plate segmentation is essential in quantitative analysis of fetal brain maturation and cortical folding. Manual segmentation of the cortical plate, or manual refinement of automatic segmentations is tedious and time consuming, and automatic segmentation of the cortical plate is challenged by the relatively low resolution of the reconstructed fetal brain MRI scans compared to the thin structure of the cortical plate, partial voluming, and the wide range of variations in the morphology of the cortical plate as the brain matures during gestation. To reduce the burden of manual refinement of segmentations, we have developed a new and powerful deep learning segmentation method that exploits new deep attentive modules with mixed kernel convolutions within a fully convolutional neural network architecture that utilizes deep supervision and residual connections. Quantitative evaluation based on several performance measures and expert evaluations show that our method outperformed several state-of-the-art deep models for segmentation, as well as a state-of-the-art multi-atlas segmentation technique. In particular, we achieved average Dice similarity coefficient of 0.87, average Hausdroff distance of 0.96mm, and average symmetric surface difference of 0.28mm in cortical plate segmentation on reconstructed fetal brain MRI scans of fetuses scanned in the gestational age range of 16 to 39 weeks. By generating accurate cortical plate segmentations in less than 2 minutes, our method can facilitate and accelerate large-scale studies on normal and altered fetal brain cortical maturation and folding.
摘要：胎儿皮质板分割是胎脑成熟和皮质折叠的定量分析是必不可少的。皮质板的手工分割，或自动分割的手动细化是乏味和耗时的，并且皮层板的自动分段由重建胎儿大脑的相对低分辨率挑战MRI扫描相比皮质板的薄结构，部分容积，并且广泛在皮质板作为脑的形态学变化的妊娠期间的成熟。为了减少分割的人工细化的负担，我们已经开发出利用利用深监管和残留的连接完全卷积神经网络架构内的混合内核卷积新的深周到模块新的强大的深度学习分割方法。基于几个性能指标和专家评估定量评价表明，我们的方法优于市场细分几个国家的最先进的深模型，以及一个国家的最先进的多图册分割技术。特别是，我们实现了平均0.87骰子相似系数，0.96毫米的平均Hausdroff距离，0.28毫米的皮质板分割上重建胎儿脑MRI扫描中的16至39周孕龄范围扫描胎儿的平均对称面差。通过在小于2分钟产生准确皮质板的分割，我们的方法可以方便和对正常和改变的胎儿大脑皮质的成熟和折叠加速大规模研究。

7. Detecting and Tracking Communal Bird Roosts in Weather Radar Data [PDF] 返回目录
Zezhou Cheng, Saadia Gabriel, Pankaj Bhambhani, Daniel Sheldon, Subhransu Maji, Andrew Laughlin, David Winkler
Abstract: The US weather radar archive holds detailed information about biological phenomena in the atmosphere over the last 20 years. Communally roosting birds congregate in large numbers at nighttime roosting locations, and their morning exodus from the roost is often visible as a distinctive pattern in radar images. This paper describes a machine learning system to detect and track roost signatures in weather radar data. A significant challenge is that labels were collected opportunistically from previous research studies and there are systematic differences in labeling style. We contribute a latent variable model and EM algorithm to learn a detection model together with models of labeling styles for individual annotators. By properly accounting for these variations we learn a significantly more accurate detector. The resulting system detects previously unknown roosting locations and provides comprehensive spatio-temporal data about roosts across the US. This data will provide biologists important information about the poorly understood phenomena of broad-scale habitat use and movements of communally roosting birds during the non-breeding season.
摘要：美国天气雷达归档保存有关过去20年的大气生物学现象的详细信息。由社区栖息的鸟类大量聚集在夜间栖息场所，并从他们的栖息早上出走往往是在雷达图像的特征性图案可见。本文介绍一种机器学习系统，以探测和跟踪栖息签名天气雷达数据。一个显著的挑战是，标签被从先前的研究报告伺机收集，并有在贴标签式的系统差异。我们贡献潜变量模型和EM算法，共同学习检测模型标注为个人注释风格的车型。通过适当考虑这些变化，我们学到了显著更准确的检测。最终的系统检测到以前未知的栖息场所，并提供有关美国各地的栖息综合时空数据。这些数据将提供有关生物学家的大规模栖息地的利用和社区共同栖息的鸟类的活动在非繁殖季节期间知之甚少现象的重要信息。

8. Unsupervised Real Image Super-Resolution via Generative Variational AutoEncoder [PDF] 返回目录
Zhi-Song Liu, Wan-Chi Siu, Li-Wen Wang, Chu-Tak Li, Marie-Paule Cani, Yui-Lam Chan
Abstract: Benefited from the deep learning, image Super-Resolution has been one of the most developing research fields in computer vision. Depending upon whether using a discriminator or not, a deep convolutional neural network can provide an image with high fidelity or better perceptual quality. Due to the lack of ground truth images in real life, people prefer a photo-realistic image with low fidelity to a blurry image with high fidelity. In this paper, we revisit the classic example based image super-resolution approaches and come up with a novel generative model for perceptual image super-resolution. Given that real images contain various noise and artifacts, we propose a joint image denoising and super-resolution model via Variational AutoEncoder. We come up with a conditional variational autoencoder to encode the reference for dense feature vector which can then be transferred to the decoder for target image denoising. With the aid of the discriminator, an additional overhead of super-resolution subnetwork is attached to super-resolve the denoised image with photo-realistic visual quality. We participated the NTIRE2020 Real Image Super-Resolution Challenge. Experimental results show that by using the proposed approach, we can obtain enlarged images with clean and pleasant features compared to other supervised methods. We also compared our approach with state-of-the-art methods on various datasets to demonstrate the efficiency of our proposed unsupervised super-resolution model.
摘要：从深度学习受益，图像超分辨率一直是计算机视觉领域发展最快的研究领域之一。取决于是否使用鉴别与否，深卷积神经网络可以为用户提供高保真度或更好的感知质量的图像。由于缺乏在现实生活中的地面实况图像，人们更喜欢低保真照片般逼真的图像与高保真图像模糊。在本文中，我们重温经典例如基于图像超分辨率方法，并提出了用于感知图像超分辨率的新颖生成模型。考虑到实际图像包含各种噪声和文物，我们提出通过变自动编码的联合图像去噪和超分辨率模型。我们提出一个条件变自动编码器来编码用于密集特征矢量然后可将其传送到解码器，用于目标图像去噪参考。随着鉴别器的帮助下，超分辨率子网的额外开销连接到超级解决与照片般逼真的视觉效果去噪图像。我们参加了NTIRE2020真实影像超分辨率的挑战。实验结果表明，采用该方法，我们可以得到放大的图像与清新宜人的特点相比其他监督方式。我们还比较我们与各种数据集的国家的最先进的方法方法来证明我们提出的无监督的超分辨率模型的有效性。

9. Per-pixel Classification Rebar Exposures in Bridge Eye-inspection [PDF] 返回目录
Yasuno Takato, Nakajima Michihiro, Noda Kazuhiro
Abstract: Efficient inspection and accurate diagnosis are required for civil infrastructures with 50 years since completion. Especially in municipalities, the shortage of technical staff and budget constraints on repair expenses have become a critical problem. If we can detect damaged photos automatically per-pixels from the record of the inspection record in addition to the 5-step judgment and countermeasure classification of eye-inspection vision, then it is possible that countermeasure information can be provided more flexibly, whether we need to repair and how large the expose of damage interest. A piece of damage photo is often sparse as long as it is not zoomed around damage, exactly the range where the detection target is photographed, is at most only 1%. Generally speaking, rebar exposure is frequently occurred, and there are many opportunities to judge repair measure. In this paper, we propose three damage detection methods of transfer learning which enables semantic segmentation in an image with low pixels using damaged photos of human eye-inspection. Also, we tried to create a deep convolutional network from scratch with the preprocessing that random crops with rotations are generated. In fact, we show the results applied this method using the 208 rebar exposed images on the 106 real-world bridges. Finally, future tasks of damage detection modeling are mentioned.
摘要：高效的检查和准确诊断所需以来完成50年民用基础设施。特别是在城市，对维修费用的技术人员和预算约束短缺已经成为一个严重的问题。如果我们能够从除眼睛检查视力的5步的判断和对策分类检验记录的记录每像素自动检测损坏的照片，那么很可能可以更灵活地提供对策的信息，我们是否需要修理和多大的伤害暴露兴趣。一块的损伤照片稀疏，只要其不放大围绕损伤，究竟在何处检测对象被拍摄的范围常常是，至多仅1％。一般来说，经常发生钢筋暴露，并有判断修复措施的机会。在本文中，我们提出转让学习使语义分割用肉眼检查的损坏照片低像素的图像在三个损伤的检测方法。此外，我们试图从头开始创建了深刻的卷积网络与生成与旋转随机作物预处理。事实上，我们显示结果在106真实世界的桥梁使用208钢筋暴露的图片应用这个方法。最后，损伤检测建模的未来任务提到。

10. Adversarial Fooling Beyond "Flipping the Label" [PDF] 返回目录
Konda Reddy Mopuri, Vaisakh Shaj, R. Venkatesh Babu
Abstract: Recent advancements in CNNs have shown remarkable achievements in various CV/AI applications. Though CNNs show near human or better than human performance in many critical tasks, they are quite vulnerable to adversarial attacks. These attacks are potentially dangerous in real-life deployments. Though there have been many adversarial attacks proposed in recent years, there is no proper way of quantifying the effectiveness of these attacks. As of today, mere fooling rate is used for measuring the susceptibility of the models, or the effectiveness of adversarial attacks. Fooling rate just considers label flipping and does not consider the cost of such flipping, for instance, in some deployments, flipping between two species of dogs may not be as severe as confusing a dog category with that of a vehicle. Therefore, the metric to quantify the vulnerability of the models should capture the severity of the flipping as well. In this work we first bring out the drawbacks of the existing evaluation and propose novel metrics to capture various aspects of the fooling. Further, for the first time, we present a comprehensive analysis of several important adversarial attacks over a set of distinct CNN architectures. We believe that the presented analysis brings valuable insights about the current adversarial attacks and the CNN models.
摘要：在细胞神经网络的最新进展表明，在不同的CV / AI应用令人瞩目的成就。虽然细胞神经网络显示附近的人或者在许多关键任务优于人类的表现，他们很容易受到攻击的对抗性。这些攻击是在现实生活中部署的潜在危险。尽管已经出现了近几年提出了许多敌对攻击时，有量化的这些攻击的有效性没有正确的方法。截至今天，仅仅是愚弄率用于衡量模型的敏感性，或敌对攻击的有效性。愚弄率只考虑标签翻转和不考虑这样的翻转的成本，例如，在一些部署，两个品种的狗之间翻转可能没有那么严重，与该车辆的混淆狗类。因此，指标量化模型的漏洞应该捕获翻转的严重程度为好。在这项工作中，我们首先带出了现有评价的弊端，并提出新的指标来捕捉嘴硬的各个方面。此外，在第一时间，我们提出了几个重要的敌对攻击的全面分析了一组不同的CNN架构。我们认为，所提出的分析带来了当前敌对攻击和CNN模型有价值的见解。

11. A Skip-connected Multi-column Network for Isolated Handwritten Bangla Character and Digit recognition [PDF] 返回目录
Animesh Singh, Ritesh Sarkhel, Nibaran Das, Mahantapas Kundu, Mita Nasipuri
Abstract: Finding local invariant patterns in handwrit-ten characters and/or digits for optical character recognition is a difficult task. Variations in writing styles from one person to another make this task challenging. We have proposed a non-explicit feature extraction method using a multi-scale multi-column skip convolutional neural network in this work. Local and global features extracted from different layers of the proposed architecture are combined to derive the final feature descriptor encoding a character or digit image. Our method is evaluated on four publicly available datasets of isolated handwritten Bangla characters and digits. Exhaustive comparative analysis against contemporary methods establishes the efficacy of our proposed approach.
摘要：寻找在handwrit十字符和/或数字局部不变的模式进行光学字符识别是一项艰巨的任务。在写作风格，从一个人到另一个人的变化使得这一任务挑战。我们使用多尺度多列跳过卷积神经网络在这项工作中提出了一种非显性特征提取方法。从所提出的架构的不同层中提取局部和全局特征被组合以导出编码字符或数字图像的最后一个特征描述符。我们的方法是对的隔离孟加拉手写字符和数字4点公立可用的数据集进行评估。对当代的方法详尽的比较分析，建立了我们提出的方法的有效性。

12. Semantic Neighborhood-Aware Deep Facial Expression Recognition [PDF] 返回目录
Yongjian Fu, Xintian Wu, Xi Li, Zhijie Pan, Daxin Luo
Abstract: Different from many other attributes, facial expression can change in a continuous way, and therefore, a slight semantic change of input should also lead to the output fluctuation limited in a small scale. This consistency is important. However, current Facial Expression Recognition (FER) datasets may have the extreme imbalance problem, as well as the lack of data and the excessive amounts of noise, hindering this consistency and leading to a performance decreasing when testing. In this paper, we not only consider the prediction accuracy on sample points, but also take the neighborhood smoothness of them into consideration, focusing on the stability of the output with respect to slight semantic perturbations of the input. A novel method is proposed to formulate semantic perturbation and select unreliable samples during training, reducing the bad effect of them. Experiments show the effectiveness of the proposed method and state-of-the-art results are reported, getting closer to an upper limit than the state-of-the-art methods by a factor of 30\% in AffectNet, the largest in-the-wild FER database by now.
摘要：从许多其他属性不同的是，面部表情可以以连续的方式变化，因此，输入的微小变化的语义也应导致在小规模的限制的输出变动。这种一致性是非常重要的。然而，当前的人脸表情识别（FER）数据集可具有极不平衡的问题，以及由于缺乏数据和过度的噪声量，阻碍这种一致性和导致性能测试时减小。在本文中，我们不仅考虑样本点的预测精度，还需要他们的邻居平滑考虑，着眼于输出的稳定性，相对于输入的语义轻微扰动。提出了一种新的方法来制定语义扰动和培训过程中选择不可靠的样本，减少他们的不良影响。实验结果表明所提出的方法和状态的最先进的结果被报告，得到由30 \％的系数更接近上限比国家的最先进的方法AffectNet的有效性，最大IN-现在的野生FER数据库。

13. Unsupervised Domain Adaptation with Multiple Domain Discriminators and Adaptive Self-Training [PDF] 返回目录
Teo Spadotto, Marco Toldo, Umberto Michieli, Pietro Zanuttigh
Abstract: Unsupervised Domain Adaptation (UDA) aims at improving the generalization capability of a model trained on a source domain to perform well on a target domain for which no labeled data is available. In this paper, we consider the semantic segmentation of urban scenes and we propose an approach to adapt a deep neural network trained on synthetic data to real scenes addressing the domain shift between the two different data distributions. We introduce a novel UDA framework where a standard supervised loss on labeled synthetic data is supported by an adversarial module and a self-training strategy aiming at aligning the two domain distributions. The adversarial module is driven by a couple of fully convolutional discriminators dealing with different domains: the first discriminates between ground truth and generated maps, while the second between segmentation maps coming from synthetic or real world data. The self-training module exploits the confidence estimated by the discriminators on unlabeled data to select the regions used to reinforce the learning process. Furthermore, the confidence is thresholded with an adaptive mechanism based on the per-class overall confidence. Experimental results prove the effectiveness of the proposed strategy in adapting a segmentation network trained on synthetic datasets like GTA5 and SYNTHIA, to real world datasets like Cityscapes and Mapillary.
摘要：无监督域适配（UDA）的目的是改善训练上的源结构域的模型的泛化能力上是可用的没有标记的数据目标域表现良好。在本文中，我们考虑城市场景的语义分割，我们提出了一个方法来适应训练的合成上的数据真实场景应对两个不同的数据分布之间的域移的深层神经网络。我们介绍一种新颖的UDA框架上，其中标记的合成数据的标准监督损耗由对抗式模块和自训练策略旨在将两个结构域的分布支撑。对抗模块由一对全卷积鉴别处理不同域的从动：地面实况和产生的地图之间的第一判别，而分割之间的第二映射从合成的或现实世界的数据到来。自培训模块利用通过对标签数据的鉴别估计选择用于加强学习过程中地区的信心。此外，置信度被阈值与基于所述每个类总体置信的自适应机制。实验结果证明在适应训练的上合成数据集像GTA5和SYNTHIA，现实世界的数据集，像风情和Mapillary分割网络所提出的策略的有效性。

14. GraftNet: An Engineering Implementation of CNN for Fine-grained Multi-label Task [PDF] 返回目录
Chunhua Jia, Lei Zhang, Hui Huang, Weiwei Cai, Hao Hu, Rohan Adivarekar
Abstract: Multi-label networks with branches are proved to perform well in both accuracy and speed, but lacks flexibility in providing dynamic extension onto new labels due to the low efficiency of re-work on annotating and training. For multi-label classification task, to cover new labels we need to annotate not only newly collected images, but also the previous whole dataset to check presence of these new labels. Also training on whole re-annotated dataset costs much time. In order to recognize new labels more effectively and accurately, we propose GraftNet, which is a customizable tree-like network with its trunk pretrained with a dynamic graph for generic feature extraction, and branches separately trained on sub-datasets with single label to improve accuracy. GraftNet could reduce cost, increase flexibility, and incrementally handle new labels. Experimental results show that it has good performance on our human attributes recognition task, which is fine-grained multi-label classification.
摘要：用树枝多标签的网络被证明在准确度和速度表现良好，但缺乏在提供动态扩展到新的标签灵活性由于对注解和培训重新工作效率低。对于多标签分类任务，涵盖我们需要注释不仅新收集的图像，而且以前的整个数据集，以检查这些新标签的存在的新标签。整体上重新标注的数据集还培训费用多少时间。为了更有效地，准确地识别新的标签，我们建议GraftNet，这是一个可定制的树形网络，其主干与动态图的通用特征提取预先训练，和分支机构分别与单个标签，以提高精度的子集训练的。 GraftNet可以降低成本，提高灵活性，并逐步处理新的标签。实验结果表明，它具有良好的性能对我们人类的属性识别任务，这是细粒度多标签分类。

15. On indirect assessment of heart rate in video [PDF] 返回目录
Mikhail Kopeliovich, Konstantin Kalinin, Yuriy Mironenko, Mikhail Petrushan
Abstract: Problem of indirect assessment of heart rate in video is addressed. Several methods of indirect evaluations (adaptive baselines) were examined on Remote Physiological Signal Sensing challenge. Particularly, regression models of dependency of heart rate on estimated age and motion intensity were obtained on challenge's train set. Accounting both motion and age in regression model led to top-quarter position in the leaderboard. Practical value of such adaptive baseline approaches is discussed. Although such approaches are considered as non-applicable in medicine, they are valuable as baseline for the photoplethysmography problem.
摘要：在视频心脏率的间接评估的问题解决。检查的远程生理信号感测的挑战间接评价（自适应基准）的几种方法。特别是，在挑战的火车集得到的估计年龄和运动强度心脏率的相关性回归模型。核算导致了排行榜顶部季度位置回归模型运动和年龄。这种自适应基线方法中的实际值进行了讨论。虽然这种方法被认为是不适用的药，他们是作为基准的容积描记的问题是有价值的。

16. In-Vehicle Object Detection in the Wild for Driverless Vehicles [PDF] 返回目录
Ranjith Dinakaran, Li Zhang, Richard Jiang
Abstract: In-vehicle human object identification plays an important role in vision-based automated vehicle driving systems while objects such as pedestrians and vehicles on roads or streets are the primary targets to protect from driverless vehicles. A challenge is the difficulty to detect objects in moving under the wild conditions, while illumination and image quality could drastically vary. In this work, to address this challenge, we exploit Deep Convolutional Generative Adversarial Networks (DCGANs) with Single Shot Detector (SSD) to handle with the wild conditions. In our work, a GAN was trained with low-quality images to handle with the challenges arising from the wild conditions in smart cities, while a cascaded SSD is employed as the object detector to perform with the GAN. We used tested our approach under wild conditions using taxi driver videos on London street in both daylight and night times, and the tests from in-vehicle videos demonstrate that this strategy can drastically achieve a better detection rate under the wild conditions.
摘要：在车内的人物体识别起着基于视觉的车辆自动驾驶系统，同时对象，如在道路或街道行人和车辆从无人驾驶车辆保护的首要目标具有重要作用。一个挑战是检测在野外条件下的移动物体，而照明和图像质量可能急剧变化的难度。在这项工作中，应对这一挑战，我们利用深卷积剖成对抗性网络（DCGANs）与单次检测器（SSD）与野生条件来处理。在我们的工作中，甘与低质量的图像受过处理与智能城市的野外条件所带来的挑战，而级联SSD被用作对象检测器与GaN来执行。我们使用了两个白天和夜晚的时间利用在伦敦街头的出租车司机视频野生条件下测试我们的方法，并从车载视频的测试表明，这种策略可以大大实现野外条件下更好的检出率。

17. Remote Photoplethysmography: Rarely Considered Factors [PDF] 返回目录
Yuriy Mironenko, Konstantin Kalinin, Mikhail Kopeliovich, Mikhail Petrushan
Abstract: Remote Photoplethysmography (rPPG) is a fast-growing technique of vital sign estimation by analyzing video of a person. Several major phenomena affecting rPPG signals have been studied (e.g. video compression, distance from person to camera, skin tone, head motions). However, to develop a highly accurate rPPG method, new, minor, factors should be investigated. First considered factor is irregular frame rate of video recordings. Despite of PPG signal transformation by frame rate irregularity, no significant distortion of PPG signal spectra was found in the experiments. Second factor is rolling shutter effect which generates tiny phase shift of the same PPG signal in different parts of the frame caused by progressive scanning. In particular conditions effect of this artifact could be of the same order of magnitude as physiologically caused phase shifts. Third factor is a size of temporal windows, which could significantly influence the estimated error of vital sign evaluation. It follows that one should account difference in size of processing windows when comparing rPPG methods. Short series of experiments were conducted to estimate importance of these phenomena and to determine necessity of their further comprehensive study.
摘要：远程光电容积（rPPG）是通过分析一个人的视频生命体征估计的快速增长的技术。影响rPPG信号几家大的现象进行了研究（例如视频压缩，距离人的摄像头，肤色，头部运动）。然而，开发出高度精确的rPPG方法，新的，小的，应该的因素进行调查。首先考虑的因素是视频记录的不规则帧速率。尽管帧速率不规则PPG信号的变换，PPG信号频谱的无显著失真在实验中被发现。第二个因素是滚动，其产生引起逐行扫描帧的不同部分相同的PPG信号的微小的相移快门的效果。特别条件下，这种伪像的影响可能是相同的数量级顺序生理学引起的相移的。第三个因素是大小时间窗口，这可能会影响显著生命体征评价的估计误差的。由此可见，一个人应该在比较rPPG方法时处理窗口的大小占差别。短一系列实验来估算这些现象的重要性，并确定它们的进一步综合研究的必要性。

18. Distance Guided Channel Weighting for Semantic Segmentation [PDF] 返回目录
Lanyun Zhu, Shiping Zhu, Xuanyi Liu, Li Luo
Abstract: Recent works have achieved great success in improving the performance of multiple computer vision tasks by capturing features with a high channel number utilizing deep neural networks. However, many channels of extracted features are not discriminative and contain a lot of redundant information. In this paper, we address above issue by introducing the Distance Guided Channel Weighting (DGCW) Module. The DGCW module is constructed in a pixel-wise context extraction manner, which enhances the discriminativeness of features by weighting different channels of each pixel's feature vector when modeling its relationship with other pixels. It can make full use of the high-discriminative information while ignore the low-discriminative information containing in feature maps, as well as capture the long-range dependencies. Furthermore, by incorporating the DGCW module with a baseline segmentation network, we propose the Distance Guided Channel Weighting Network (DGCWNet). We conduct extensive experiments to demonstrate the effectiveness of DGCWNet. In particular, it achieves 81.16% mIoU on Cityscapes with only fine annotated data for training, and also gains satisfactory performance on another two semantic segmentation datasets, i.e. Pascal Context and ADE20K. Code will be available soon at this https URL.
摘要：最近的作品都在提高的多个计算机视觉任务通过利用深层神经网络的高通道数捕捉功能的性能取得了巨大成功。但是，提取的特征很多渠道都没有区别，并且包含大量的冗余信息。在本文中，我们通过引入引导通道加权（DGCW）模块的远程解决上述问题。所述DGCW模块在逐像素上下文提取方式，其通过模拟其与其它像素的关系，当加权的每个像素的特征向量的不同的信道增强的特征的discriminativeness构成。它可以充分利用的高判别信息，而忽略包含在特征映射低判别信息，以及拍摄远距离的依赖。此外，通过将与基线分割网络DGCW模块，我们提出了指导的距离加权通道网络（DGCWNet）。我们进行了大量的实验证明DGCWNet的有效性。特别是，它实现了81.16％米欧上都市风景与训练仅细注释的数据，并且还获得对另外两个语义分割数据集令人满意的性能，即帕斯卡语境与ADE20K。代码将在这个HTTPS URL很快面市。

19. Self-supervised Keypoint Correspondences for Multi-Person Pose Estimation and Tracking in Videos [PDF] 返回目录
Rafi Umer, Andreas Doering, Bastian Leibe, Juergen Gall
Abstract: Video annotation is expensive and time consuming. Consequently, datasets for multi-person pose estimation and tracking are less diverse and have more sparse annotations compared to large scale image datasets for human pose estimation. This makes it challenging to learn deep learning based models for associating keypoints across frames that are robust to nuisance factors such as motion blur and occlusions for the task of multi-person pose tracking. To address this issue, we propose an approach that relies on keypoint correspondences for associating persons in videos. Instead of training the network for estimating keypoint correspondences on video data, it is trained on a large scale image datasets for human pose estimation using self-supervision. Combined with a top-down framework for human pose estimation, we use keypoints correspondences to (i) recover missed pose detections (ii) associate pose detections across video frames. Our approach achieves state-of-the-art results for multi-frame pose estimation and multi-person pose tracking on the PosTrack $2017$ and PoseTrack $2018$ data sets.
摘要：视频注释是昂贵和费时。因此，对于多人姿态估计与跟踪数据集多元化程度较低，并且与人类姿态估计大规模图像数据集具有较为稀疏的注释。这使得它具有挑战性的学习基础深度学习模型跨越处于稳健的滋扰因素，如运动模糊和闭塞多的人姿势跟踪的任务帧关联的关键点。为了解决这个问题，我们提出了依靠关键点对应于视频关联的人的方法。取而代之的训练网络上的视频数据估计关键点的对应，这是对人类姿态估计使用自检大规模图像中的数据集训练。对人类姿态估计自上而下的架构相结合，我们使用关键点对应于（i）恢复错过姿态检测（II）中的视频帧关联的姿势检测。我们的方法实现了国家的最先进的结果多帧的姿态估计和多人姿势跟踪的PosTrack $ $ 2017年和PoseTrack $ $ 2018年的数据集。

20. 3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-View Spatial Feature Fusion for 3D Object Detection [PDF] 返回目录
Jin Hyeok Yoo, Yeocheol Kim, Ji Song Kim, Jun Won Choi
Abstract: In this paper, we propose a new deep architecture for fusing camera and LiDAR sensors for 3D object detection. Because the camera and LiDAR sensor signals have different characteristics and distributions, fusing these two modalities is expected to improve both the accuracy and robustness of 3D object detection. One of the challenges presented by the fusion of cameras and LiDAR is that the spatial feature maps obtained from each modality are represented by significantly different views in the camera and world coordinates; hence, it is not an easy task to combine two heterogeneous feature maps without loss of information. To address this problem, we propose a method called 3D-CVF that combines the camera and LiDAR features using the cross-view spatial feature fusion strategy. First, the method employs {\it auto-calibrated projection}, to transform the 2D camera features to a smooth spatial feature map with the highest correspondence to the LiDAR features in the bird's eye view (BEV) domain. Then, a {\it gated feature fusion network} is applied to use the spatial attention maps to mix the camera and LiDAR features appropriately according to the region. Next, camera-LiDAR feature fusion is also achieved in the subsequent proposal refinement stage. The camera feature is used from the 2D camera-view domain via {\it 3D RoI grid pooling} and fused with the BEV feature for proposal refinement. Our evaluations, conducted on the KITTI and nuScenes 3D object detection datasets demonstrate that the camera-LiDAR fusion offers significant performance gain over single modality and that the proposed 3D-CVF achieves state-of-the-art performance in the KITTI benchmark.
摘要：在本文中，我们提出了一个新的深架构融合相机和LiDAR传感器的立体物检测。由于摄像机和激光雷达传感器信号具有不同的特性和分布，熔融这两个方式有望改善两者的准确性和立体物检测的鲁棒性。之一的由摄像机和激光雷达融合带来的挑战是，映射从每个模态获得的空间特征是由在摄像机和世界坐标显著不同视图表示;因此，它不是一项容易的任务两种异质特征图而不会丢失信息结合起来。为了解决这个问题，我们提出了一个所谓的3D-CVF方法，结合相机和激光雷达功能使用交叉视角空间特征融合策略。首先，该方法采用{\它自动校准投影}，以变换2D照相机功能来以最高对应于所述激光雷达在鸟瞰（BEV）域设有一个平滑的空间特征地图。然后，{\它门控特征融合网络}被施加到使用空间注意映射到混合摄像机和激光雷达根据区域适当地功能。接下来，相机激光雷达特征融合在随后的提案细化阶段也实现了。相机特征从2D照相机视图域通过使用{\它3D ROI网格池}和与BEV特征提案细化熔融。我们的评估，对KITTI和nuScenes立体物检测数据集进行显示，在单一的模式，以及所提出的3D-CVF相机，激光雷达融合提供显著的性能增益达到在基准KITTI国家的最先进的性能。

21. CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents [PDF] 返回目录
Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, Kavita Sultanpure
Abstract: An automatic table recognition method for interpretation of tabular data in document images majorly involves solving two problems of table detection and table structure recognition. The prior work involved solving both problems independently using two separate approaches. More recent works signify the use of deep learning-based solutions while also attempting to design an end to end solution. In this paper, we present an improved deep learning-based end to end approach for solving both problems of table detection and structure recognition using a single Convolution Neural Network (CNN) model. We propose CascadeTabNet: a Cascade mask Region-based CNN High-Resolution Network (Cascade mask R-CNN HRNet) based model that detects the regions of tables and recognizes the structural body cells from the detected tables at the same time. We evaluate our results on ICDAR 2013, ICDAR 2019 and TableBank public datasets. We achieved 3rd rank in ICDAR 2019 post-competition results for table detection while attaining the best accuracy results for the ICDAR 2013 and TableBank dataset. We also attain the highest accuracy results on the ICDAR 2019 table structure recognition dataset. Additionally, we demonstrate effective transfer learning and image augmentation techniques that enable CNNs to achieve very accurate table detection results. Code and dataset has been made available at: this https URL
摘要：在文档图像表格数据的解释的自动表识别方法majorly涉及求解表检测和表结构识别的两个问题。现有工作涉及解决独立使用两种不同的方法这两个问题。最近的作品表明使用基于深学习的解决方案的同时，还试图设计的端到端解决方案。在本文中，我们提出了一种改进的深基于学习的结束到用于使用单个卷积神经网络（CNN）模型求解表检测和结构识别的这两个问题末端方法。我们建议CascadeTabNet：级联屏蔽检测表的地区和识别结构体细胞从检测表，同时基于区域的CNN高分辨率网络（级联面膜R-CNN HRNet）基于模型。我们评估我们的ICDAR 2013年，ICDAR 2019和TableBank公共数据集的结果。我们ICDAR取得排名第三的表格检测2019后的比赛结果，同时获得用于ICDAR 2013 TableBank数据集的最精确的结果。我们还获得在ICDAR 2019表格结构识别数据集精度最高的结果。此外，我们证明有效的转移学习和图像增强技术，使细胞神经网络的实现非常准确的表格检测结果。代码和数据集已可在：此HTTPS URL

22. Preliminary Forensics Analysis of DeepFake Images [PDF] 返回目录
Luca Guarnera, Oliver Giudice, Cristina Nastasi, Sebastiano Battiato
Abstract: One of the most terrifying phenomenon nowadays is the DeepFake: the possibility to automatically replace a person's face in images and videos by exploiting algorithms based on deep learning. This paper will present a brief overview of technologies able to produce DeepFake images of faces. A forensics analysis of those images with standard methods will be presented: not surprisingly state of the art techniques are not completely able to detect the fakeness. To solve this, a preliminary idea on how to fight DeepFake images of faces will be presented by analysing anomalies in the frequency domain.
摘要：其中一个最可怕的现象如今是DeepFake：可能性自动通过利用基于深学习算法代替一个人的脸的图像和视频。本文将介绍的能够生产面临的DeepFake图像技术的简要概述。用标准方法将这些图像的取证分析将呈现：本领域的技术并不奇怪状态不完全能检测到骗人的。为了解决这个问题，就怎么打脸DeepFake图像的初步设想将在频域分析异常呈现。

23. Localizing Grouped Instances for Efficient Detection in Low-Resource Scenarios [PDF] 返回目录
Amelie Royer, Christoph H. Lampert
Abstract: State-of-the-art detection systems are generally evaluated on their ability to exhaustively retrieve objects densely distributed in the image, across a wide variety of appearances and semantic categories. Orthogonal to this, many real-life object detection applications, for example in remote sensing, instead require dealing with large images that contain only a few small objects of a single class, scattered heterogeneously across the space. In addition, they are often subject to strict computational constraints, such as limited battery capacity and computing power. To tackle these more practical scenarios, we propose a novel flexible detection scheme that efficiently adapts to variable object sizes and densities: We rely on a sequence of detection stages, each of which has the ability to predict groups of objects as well as individuals. Similar to a detection cascade, this multi-stage architecture spares computational effort by discarding large irrelevant regions of the image early during the detection process. The ability to group objects provides further computational and memory savings, as it allows working with lower image resolutions in early stages, where groups are more easily detected than individuals, as they are more salient. We report experimental results on two aerial image datasets, and show that the proposed method is as accurate yet computationally more efficient than standard single-shot detectors, consistently across three different backbone architectures.
摘要：国家的最先进的检测系统通常评估他们详尽地获取密集分布图像中的对象，在各种不同的外观和语义范畴的能力。正交到这一点，许多现实生活中的物体检测的应用，例如在遥感，而需要处理包含单个类的只有少数几个小物件，跨空间不均匀散大图像。此外，它们往往受到严格的计算约束，例如有限的电池容量和计算能力。为了解决这些更实际的情况下，我们提出了一个新的灵活的检测方案能够有效地适应变量对象的大小和密度：我们依靠检测阶段，各自有预测对象以及个人群体的能力的序列。类似的检测级联，这种多级架构备件通过在检测过程中尽早丢弃的形象大不相关的区域的计算工作。对对象进行分组的能力提供了进一步的计算和存储器的节省，因为它允许在早期阶段，其中基团更容易不是个人检测到较低的图像分辨率工作，因为它们是更突出的。我们报告的两个航空影像数据集的实验结果，并表明，该方法是精确计算还比标准的单次探测器更有效，一致地三种不同的骨干架构。

24. Maximum Density Divergence for Domain Adaptation [PDF] 返回目录
Li Jingjing, Chen Erpeng, Ding Zhengming, Zhu Lei, Lu Ke, Shen Heng Tao
Abstract: Unsupervised domain adaptation addresses the problem of transferring knowledge from a well-labeled source domain to an unlabeled target domain where the two domains have distinctive data distributions. Thus, the essence of domain adaptation is to mitigate the distribution divergence between the two domains. The state-of-the-art methods practice this very idea by either conducting adversarial training or minimizing a metric which defines the distribution gaps. In this paper, we propose a new domain adaptation method named Adversarial Tight Match (ATM) which enjoys the benefits of both adversarial training and metric learning. Specifically, at first, we propose a novel distance loss, named Maximum Density Divergence (MDD), to quantify the distribution divergence. MDD minimizes the inter-domain divergence ("match" in ATM) and maximizes the intra-class density ("tight" in ATM). Then, to address the equilibrium challenge issue in adversarial domain adaptation, we consider leveraging the proposed MDD into adversarial domain adaptation framework. At last, we tailor the proposed MDD as a practical learning loss and report our ATM. Both empirical evaluation and theoretical analysis are reported to verify the effectiveness of the proposed method. The experimental results on four benchmarks, both classical and large-scale, show that our method is able to achieve new state-of-the-art performance on most evaluations. Codes and datasets used in this paper are available at {\it this http URL}.
摘要：无监督域适应地址从井中标记的源域知识转移到其中两个结构域具有独特的数据分布的未标记的目标域的问题。因此，域的适应的本质是减轻的两个结构域之间的分布发散。国家的最先进的方法的实践中通过任一导电对抗性训练或最小化的度量限定分配差距这个非常想法。在本文中，我们提出了一个名为对抗性严格匹配（ATM）的新领域适应性方法，都享有对抗性训练和度量学习的好处。具体而言，首先，我们提出了一种新颖的距离损失，命名为最大密度发散（MDD），以定量分布发散。 MDD最小化域间发散（在ATM“匹配”）并且最大化类内密度（在ATM“紧”）。然后，以解决对抗性领域适应性平衡的挑战问题，我们考虑利用所提出的MDD为对立的领域适应性框架。最后，我们的裁缝提出MDD作为一个实用的学习损失和报告我们的ATM。既经验评估和理论分析报告给验证了该方法的有效性。四个基准，既古典又大规模的实验结果，证明我们的方法是能够实现对大多数评估新的国家的最先进的性能。代码和本文中所使用的数据集可在{\它这个HTTP URL}。

25. Deploying Image Deblurring across Mobile Devices: A Perspective of Quality and Latency [PDF] 返回目录
Cheng-Ming Chiang, Yu Tseng, Yu-Syuan Xu, Hsien-Kai Kuo, Yi-Min Tsai, Guan-Yu Chen, Koan-Sin Tan, Wei-Ting Wang, Yu-Chieh Lin, Shou-Yao Roy Tseng, Wei-Shiang Lin, Chia-Lin Yu, BY Shen, Kloze Kao, Chia-Ming Cheng, Hung-Jen Chen
Abstract: Recently, image enhancement and restoration have become important applications on mobile devices, such as super-resolution and image deblurring. However, most state-of-the-art networks present extremely high computational complexity. This makes them difficult to be deployed on mobile devices with acceptable latency. Moreover, when deploying to different mobile devices, there is a large latency variation due to the difference and limitation of deep learning accelerators on mobile devices. In this paper, we conduct a search of portable network architectures for better quality-latency trade-off across mobile devices. We further present the effectiveness of widely used network optimizations for image deblurring task. This paper provides comprehensive experiments and comparisons to uncover the in-depth analysis for both latency and image quality. Through all the above works, we demonstrate the successful deployment of image deblurring application on mobile devices with the acceleration of deep learning accelerators. To the best of our knowledge, this is the first paper that addresses all the deployment issues of image deblurring task across mobile devices. This paper provides practical deployment-guidelines, and is adopted by the championship-winning team in NTIRE 2020 Image Deblurring Challenge on Smartphone Track.
摘要：近日，图像增强和恢复已成为移动设备上的重要应用，如超分辨率和图像去模糊。然而，大多数国家的最先进的网络呈现极高的计算复杂度。这使得它们难以被部署在可接受的延迟的移动设备。此外，部署到不同的移动设备的情况下，有大的延时变化由于深学习促进剂的在移动设备上的差异和限制。在本文中，我们进行搜索便携式网络架构获得更高质量的延迟权衡不同移动设备。我们进一步提出广泛使用的网络优化的图像去模糊任务的有效性。本文提供了全面的实验和对比揭示了深入分析了延迟和图像质量。通过以上所有的作品中，我们展示图像去模糊的应用与深度学习加速器的加速移动设备的成功部署。据我们所知，这是第一个纸张地址的所有图像的部署问题在移动设备去模糊任务。本文提供的实际部署的指导方针，并通过夺取总冠军的球队NTIRE 2020的图像去模糊挑战智能手机上采用的轨道。

26. VTGNet: A Vision-based Trajectory Generation Network for Autonomous Vehicles in Urban Environments [PDF] 返回目录
Peide Cai, Yuxiang Sun, Hengli Wang, Ming Liu
Abstract: Reliable navigation like expert human drivers in urban environments is a critical capability for autonomous vehicles. Traditional methods for autonomous driving are implemented with many building blocks from perception, planning and control, making them difficult to generalize to varied scenarios due to complex assumptions and interdependencies. In this paper, we develop an end-to-end trajectory generation method based on imitation learning. It can extract spatiotemporal features from the front-view camera images for scene understanding, then generate collision-free trajectories several seconds into the future. The proposed network consists of three sub-networks, which are selectively activated for three common driving tasks: keep straight, turn left and turn right. The experimental results suggest that under various weather and lighting conditions, our network can reliably generate trajectories in different urban environments, such as turning at intersections and slowing down for collision avoidance. Furthermore, by integrating the proposed network into a navigation system, good generalization performance is presented in an unseen simulated world for autonomous driving on different types of vehicles, such as cars and trucks.
摘要：可靠的导航像在城市环境专家人力驱动程序是自动驾驶汽车的关键能力。对于自主驾驶的传统方法是从知觉，计划和控制许多构造模块，使它们推广到变化的情况，由于复杂的假设和相互依赖性困难。在本文中，我们开发了基于模仿学习结束到终端的轨迹生成方法。它可以通过前视摄像头的图像场景的理解提取时空特征，然后生成无碰撞的轨迹几秒钟到未来。所提出的网络由三个子网络，有选择地对三种常见的驾驶任务激活：保持挺直，左转和右转。实验结果表明，各种天气和光照条件下，我们的网络可以可靠地产生在不同的城市环境，比如在十字路口转弯和减速避免碰撞的轨迹。此外，通过所提出的网络进入导航系统整合，良好的泛化性能是在看不见的虚拟世界呈现在不同类型的车辆，如汽车和卡车自主驾驶。

27. Difficulty Translation in Histopathology Images [PDF] 返回目录
Jerry Wei, Arief Suriawinata, Xiaoying Liu, Bing Ren, Mustafa Nasir-Moin, Naofumi Tomita, Jason Wei, Saeed Hassanpour
Abstract: The unique nature of histopathology images opens the door to domain-specific formulations of image translation models. We propose a difficulty translation model that modifies colorectal histopathology images to become more challenging to classify. Our model comprises a scorer, which provides an output confidence to measure the difficulty of images, and an image translator, which learns to translate images from easy-to-classify to hard-to-classify using a training set defined by the scorer. We present three findings. First, generated images were indeed harder to classify for both human pathologists and machine learning classifiers than their corresponding source images. Second, image classifiers trained with generated images as augmented data performed better on both easy and hard images from an independent test set. Finally, human annotator agreement and our model's measure of difficulty correlated strongly, implying that for future work requiring human annotator agreement, the confidence score of a machine learning classifier could be used instead as a proxy.
摘要：组织病理学图像的独特性打开大门，像翻译模型的特定领域的配方。我们提出了一个难度翻译模型，修改结病理图像变得更加具有挑战性的分类。我们的模型包括射手，它提供一个输出信心以测量图像的难度，以及图像转换，这学会从翻译图像易于分类到难以分类使用由射手限定的训练集。我们提出三项发现。首先，生成的图像确实很难为人类病理学家和机器学习分类比其相应的源图像进行分类。第二，与所生成的图像作为增强数据来训练分类器的图像从一个独立的测试集执行上既容易又硬图像更好。最后，人的注释协议和困难的我们的模型的措施密切相关，这意味着对于需要人的注释协议今后的工作中，机器学习分类的信心分数可以用来代替作为代理。

28. Fully Embedding Fast Convolutional Networks on Pixel Processor Arrays [PDF] 返回目录
Laurie Bose, Jianing Chen, Stephen J. Carey, Piotr Dudek, Walterio Mayol-Cuevas
Abstract: We present a novel method of CNN inference for pixel processor array (PPA) vision sensors, designed to take advantage of their massive parallelism and analog compute capabilities. PPA sensors consist of an array of processing elements (PEs), with each PE capable of light capture, data storage and computation, allowing various computer vision processing to be executed directly upon the sensor device. The key idea behind our approach is storing network weights "in-pixel" within the PEs of the PPA sensor itself to allow various computations, such as multiple different image convolutions, to be carried out in parallel. Our approach can perform convolutional layers, max pooling, ReLu, and a final fully connected layer entirely upon the PPA sensor, while leaving no untapped computational resources. This is in contrast to previous works that only use a sensor-level processing to sequentially compute image convolutions, and must transfer data to an external digital processor to complete the computation. We demonstrate our approach on the SCAMP-5 vision system, performing inference of a MNIST digit classification network at over 3000 frames per second and over 93% classification accuracy. This is the first work demonstrating CNN inference conducted entirely upon the processor array of a PPA vision sensor device, requiring no external processing.
摘要：我们提出CNN推理用于像素处理器阵列（PPA）的视觉传感器，设计为利用它们的大规模并行和模拟计算能力的优点的新方法。 PPA传感器包括处理元件（PE）的阵列，具有能够光捕获，数据存储和计算的各PE，从而允许直接在所述传感器设备中执行的各种计算机视觉处理。我们的方法背后的关键思想是存储网络的权重“在像素”的PPA传感器本身的PE设备内，以允许各种运算，例如多个不同的图像的卷积，以并行地进行。我们的方法可以执行卷积层，最大池，RELU，并且在PPA传感器的最终完全连接完全层，同时留下无尚未开发的计算资源。这是相对于先前的工作，仅使用一个传感器级处理顺序地计算图像卷积，并且必须将数据传送到一个外部数字处理器来完成计算。我们证明了我们的SCAMP-5的视觉系统的方法，以超过每秒3000帧，并且超过93％的分类精度进行MNIST数字分级网络的推断。这是第一次证明工作时完全一PPA视觉传感器装置的处理器阵列进行CNN推理，不需要外部的处理。

29. Weakly Supervised Semantic Segmentation in 3D Graph-Structured Point Clouds of Wild Scenes [PDF] 返回目录
Haiyan Wang, Xuejian Rong, Liang Yang, Jinglun Feng, Jizhong Xiao, Yingli Tian
Abstract: The deficiency of 3D segmentation labels is one of the main obstacles to effective point cloud segmentation, especially for scenes in the wild with varieties of different objects. To alleviate this issue, we propose a novel deep graph convolutional network-based framework for large-scale semantic scene segmentation in point clouds with sole 2D supervision. Different with numerous preceding multi-view supervised approaches focusing on single object point clouds, we argue that 2D supervision is capable of providing sufficient guidance information for training 3D semantic segmentation models of natural scene point clouds while not explicitly capturing their inherent structures, even with only single view per training sample. Specifically, a Graph-based Pyramid Feature Network (GPFN) is designed to implicitly infer both global and local features of point sets and an Observability Network (OBSNet) is introduced to further solve object occlusion problem caused by complicated spatial relations of objects in 3D scenes. During the projection process, perspective rendering and semantic fusion modules are proposed to provide refined 2D supervision signals for training along with a 2D-3D joint optimization strategy. Extensive experimental results demonstrate the effectiveness of our 2D supervised framework, which achieves comparable results with the state-of-the-art approaches trained with full 3D labels, for semantic point cloud segmentation on the popular SUNCG synthetic dataset and S3DIS real-world dataset.
摘要：3D分割标签的缺乏是有效的点云分割的主要障碍之一，尤其是对与品种不同对象的野外场景。为了缓解这一问题，我们提出了与唯一的二维监管点云的大型语义场景分割新颖的深图卷积基于网络的框架。与众多的多视图前面不同的监管方法侧重于单个物体点云，我们认为2D监督能够为训练自然场景的点云3D语义分割模型，而没有明确捕捉其固有的结构提供足够的指导信息，即使只有每训练样本单一视图。具体而言，基于图的金字塔特征网（GPFN）被设计成隐式地推断点集的全局和局部特征和一个观测网络（OBSNet）被引入以进一步解决由在3D场景的对象的复杂的空间关系对象遮挡问题。在投影过程中，透视渲染和语义融合模块建议提供精致的2D监测信号与一个2D-3D的联合优化策略一起训练。大量的实验结果证明了我们的2D的有效性监管框架，实现了与比较的结果，在国家的最先进的办法以全3D的标签，语义点云分割上流行的SUNCG合成数据集和S3DIS真实世界的数据集训练。

30. On the Limits to Multi-Modal Popularity Prediction on Instagram -- A New Robust, Efficient and Explainable Baseline [PDF] 返回目录
Christoffer Riis, Damian Konrad Kowalczyk, Lars Kai Hansen
Abstract: The predictability of social media popularity is a topic of much scientific interest and significant practical importance. We present a new strong baseline for popularity prediction on Instagram, which is both robust and efficient to compute. The approach expands previous work by a comprehensive ablation study of the predictive power of multiple representations of the visual modality and by detailed use of explainability tools. We use transfer learning to extract visual semantics as concepts, scenes, and objects, which allows us to interpret and explain the trained model and predictions. The study is based in one million posts extracted from Instagram. We approach the problem of popularity prediction as a ranking problem, where we predict the log-normalised number of likes. Through our ablation study design, we can suggest models that outperform a previous state-of-the-art black-box method for multi-modal popularity prediction on Instagram.
摘要：社会化媒体普及的预测是非常科学的兴趣和显著的现实意义的话题。我们提出了一个新的强大的基线上的Instagram流行的预测，这既是强大和高效的计算。该方法通过视觉方式的多种表示的预测能力的全面消融的研究和详细使用的explainability工具扩展了以前的工作。我们使用迁移学习中提取视觉语义概念，场景和对象，这使我们能够解释和说明训练模型和预测。这项研究是基于从Instagram的提取的一个百万的发帖。我们接近普及预测的问题，作为一个排名的问题，在这里我们喜欢预测的对数标准化的数字。通过我们的消融研究设计，我们可以建议的回报超越Instagram上的多模态流行预测的先前状态的最先进的黑箱方法模型。

31. Designing a physically-feasible colour filter to make a camera more colorimetric [PDF] 返回目录
Yuteng Zhu
Abstract: Previously, a method has been developed to find the best colour filter for a given camera which results in the new effective camera sensitivities that best meet the Luther condition. That is, the new sensitivities are approximately linearly related to the XYZ colour matching functions. However, with no constraint, the filter derived from this Luther-condition based optimisation can be rather non-smooth and transmit very little light which are impractical for fabrication. In this paper, we extend the Luther-condition filter optimisation method to allow us to incorporate both the smoothness and transmittance bounds of the recovered filter which are key practical concerns. Experiments demonstrate that we can find physically realisable filters which are smooth and reasonably transmissive with which the effective "camera+filter" becomes significantly more colorimetric.
摘要：以前，一个方法已经开发出来，找到一个给定的相机，这导致新的有效相机感光度最符合条件路德的最佳彩色滤光片。也就是说，新的灵敏度近似线性关系到XYZ色彩匹配功能。然而，没有限制，过滤器从这个路德条件基于优化衍生可以是相当不平滑和发送非常少的光，其是不实际的用于制造。在本文中，我们扩展了路德条件过滤器优化方法，使我们能够纳入回收过滤器的平滑度和透光率范围这两个是关键的实际问题。实验表明，我们能够发现物理可实现滤波器，平滑且合理透过与有效的“摄像头+过滤”变得更加显著比色。

32. One-Shot Identity-Preserving Portrait Reenactment [PDF] 返回目录
Sitao Xiang, Yuming Gu, Pengda Xiang, Mingming He, Koki Nagano, Haiwei Chen, Hao Li
Abstract: We present a deep learning-based framework for portrait reenactment from a single picture of a target (one-shot) and a video of a driving subject. Existing facial reenactment methods suffer from identity mismatch and produce inconsistent identities when a target and a driving subject are different (cross-subject), especially in one-shot settings. In this work, we aim to address identity preservation in cross-subject portrait reenactment from a single picture. We introduce a novel technique that can disentangle identity from expressions and poses, allowing identity preserving portrait reenactment even when the driver's identity is very different from that of the target. This is achieved by a novel landmark disentanglement network (LD-Net), which predicts personalized facial landmarks that combine the identity of the target with expressions and poses from a different subject. To handle portrait reenactment from unseen subjects, we also introduce a feature dictionary-based generative adversarial network (FD-GAN), which locally translates 2D landmarks into a personalized portrait, enabling one-shot portrait reenactment under large pose and expression variations. We validate the effectiveness of our identity disentangling capabilities via an extensive ablation study, and our method produces consistent identities for cross-subject portrait reenactment. Our comprehensive experiments show that our method significantly outperforms the state-of-the-art single-image facial reenactment methods. We will release our code and models for academic use.
摘要：从目标（一次性）和驱动对象的视频的单一画面呈现人像重演了深刻的学习型框架。现有的面部重演方法从身份不匹配遭受和产生不一致的身份当目标和驱动主体是不同的（跨学科），尤其是一次性设置。在这项工作中，我们的目标是在跨学科肖像重演地址身份保护从单一的图片。我们引入了一种新技术，可以解开从表情和姿势的身份，允许保留身份倾情重演，即使驾驶员的身份是从目标有很大不同。这是通过一种新的界标解缠结网络（LD-净），其预测的个性化，其从不同的主题结合使用表达式和姿势目标的身份面部界标来实现。为了保证从看不见的科目肖像重演，我们还引入了基于字典的功能生成对抗网络（FD-GAN），其中本地转化2D地标成个性化肖像，下大姿态和表情的变化使一次性肖像重演。我们通过广泛的研究消融验证了我们的身份解开能力的有效性，以及我们的方法，跨学科的画像重演生成一致的认同。我们全面的实验表明，我们的方法显著优于国家的最先进的单图像面部重演方法。我们会发布我们的代码和模型学术用途。

33. All you need is a second look: Towards Tighter Arbitrary shape text detection [PDF] 返回目录
Meng Cao, Yuexian Zou
Abstract: Deep learning-based scene text detection methods have progressed substantially over the past years. However, there remain several problems to be solved. Generally, long curve text instances tend to be fragmented because of the limited receptive field size of CNN. Besides, simple representations using rectangle or quadrangle bounding boxes fall short when dealing with more challenging arbitrary-shaped texts. In addition, the scale of text instances varies greatly which leads to the difficulty of accurate prediction through a single segmentation network. To address these problems, we innovatively propose a two-stage segmentation based arbitrary text detector named \textit{NASK} (\textbf{N}eed \textbf{A} \textbf{S}econd loo\textbf{K}). Specifically, \textit{NASK} consists of a Text Instance Segmentation network namely \textit{TIS} ($1^{st}$ stage), a Text RoI Pooling module and a Fiducial pOint eXpression module termed as \textit{FOX} ($2^{nd}$ stage). Firstly, \textit{TIS} conducts instance segmentation to obtain rectangle text proposals with a proposed Group Spatial and Channel Attention module (\textit{GSCA}) to augment the feature expression. Then, Text RoI Pooling transforms these rectangles to the fixed size. Finally, \textit{FOX} is introduced to reconstruct text instances with a more tighter representation using the predicted geometrical attributes including text center line, text line orientation, character scale and character orientation. Experimental results on two public benchmarks including \textit{Total-Text} and \textit{SCUT-CTW1500} have demonstrated that the proposed \textit{NASK} achieves state-of-the-art results.
摘要：深基础的学习场景文字检测方法已经进展基本在过去几年。但是，仍有待解决的几个问题。一般来说，长曲线文本实例往往因为CNN的限制感受野的大小进行分片。此外，使用矩形或四方形边框简单的交涉，更具挑战性的任意形文字打交道时功亏一篑。此外，文本的实例的规模变化很大，这导致精确的预测的通过单个分割网络的难度。为了解决这些问题，我们提出了创新性的两阶段分割基于任意的文本命名textit {NASK}探测器\（textbf \ {N} EED \ textbf {A} \ textbf {S}的Econd厕所\ textbf {K}）。具体而言，\ textit {NASK}由一个文本实例分割网络即\ textit {TIS}（\（1 ^ {ST} \）级）的，文本的投资回报池模块和一个基准点表达模块称为\ textit {FOX }（\（2 ^ {ND} \）阶段）。首先，\ textit {TIS}进行实例分割，以获得与提议的组空间和通道注意模块矩形文本建议（\ textit {GSCA}）以增强特征的表达。然后，文字的投资回报池改造这些矩形的固定大小。最后，\ textit {FOX}引入重建文本实例与使用预测的几何属性，包括文本中心线，文本行取向，字符规模和字符方向更紧密的表示。在两个公共的基准包括\ textit {总文本}和\ {textit华南理工大学，CTW1500}实验结果表明，所提出的\ {textit} NASK实现国家的最先进的成果。

34. Stitcher: Feedback-driven Data Provider for Object Detection [PDF] 返回目录
Yukang Chen, Peizhen Zhang, Zeming Li, Yanwei Li, Xiangyu Zhang, Gaofeng Meng, Shiming Xiang, Jian Sun, Jiaya Jia
Abstract: Object detectors commonly vary quality according to scales, where the performance on small objects is the least satisfying. In this paper, we investigate this phenomenon and discover that: in the majority of training iterations, small objects contribute barely to the total loss, causing poor performance with imbalanced optimization. Inspired by this finding, we present Stitcher, a feedback-driven data provider, which aims to train object detectors in a balanced way. In Stitcher, images are resized into smaller components and then stitched into the same size to regular images. Stitched images contain inevitable smaller objects, which would be beneficial with our core idea, to exploit the loss statistics as feedback to guide next-iteration update. Experiments have been conducted on various detectors, backbones, training periods, datasets, and even on instance segmentation. Stitcher steadily improves performance by a large margin in all settings, especially for small objects, with nearly no additional computation in both training and testing stages.
摘要：对象检测器通常根据秤，其中上小物件的性能是至少满足质量变化。在本文中，我们调查这一现象，并发现：在大多数训练迭代，小物件几乎没有影响总的损失，导致不均衡的优化性能较差。通过这一发现的启发，我们提出缝合器，反馈驱动的数据提供商，它旨在培养对象探测器以平衡的方式。在缝合中，图像被调整大小为更小的分量，然后缝合成相同大小常规图像。拼接图像包含必然更小的物体，这将是我们的核心理念是有益的，利用损失的统计数据作为反馈，以指导下一迭代更新。实验已经在各种探测器，骨干，培训周期，数据集进行，甚至在实例分割。缝合器逐渐在所有的设置也大幅度提高性能，尤其是小物件，用在训练几乎没有额外的计算和测试阶段。

35. Disentangled Image Generation Through Structured Noise Injection [PDF] 返回目录
Yazeed Alharbi, Peter Wonka
Abstract: We explore different design choices for injecting noise into generative adversarial networks (GANs) with the goal of disentangling the latent space. Instead of traditional approaches, we propose feeding multiple noise codes through separate fully-connected layers respectively. The aim is restricting the influence of each noise code to specific parts of the generated image. We show that disentanglement in the first layer of the generator network leads to disentanglement in the generated image. Through a grid-based structure, we achieve several aspects of disentanglement without complicating the network architecture and without requiring labels. We achieve spatial disentanglement, scale-space disentanglement, and disentanglement of the foreground object from the background style allowing fine-grained control over the generated images. Examples include changing facial expressions in face images, changing beak length in bird images, and changing car dimensions in car images. This empirically leads to better disentanglement scores than state-of-the-art methods on the FFHQ dataset.
摘要：噪声注入生成对抗网络（甘斯）与解开潜在空间的目标，探索不同的设计选择。代替的传统的方法，我们提出通过单独的全连接层分别供给多个噪声码。其目的是限制的每个噪声代码所生成的图像的特定部分的影响。我们发现在发电机网络引线的第一层的解开生成的图像中解开。通过一个基于网格的结构，我们实现解开几个方面没有网络结构复杂化，而不需要标签。我们实现空间的解开，尺度空间的解开，和背景样式允许对生成的图像进行精细控制的前景对象的解开。实例包括改变的面部图像的面部表情，在鸟的图像改变喙的长度，并在汽车的图像改变汽车的尺寸。这种经验带来更好的解开成绩比上FFHQ数据集的国家的最先进的方法。

36. Hyperspectral image classification based on multi-scale residual network with attention mechanism [PDF] 返回目录
Xiangdong Zhang, Tengjun Wang, Yun Yang
Abstract: Compared with traditional machine learning methods, deep learning methods such as convolutional neural networks (CNNs) have achieved great success in the hyperspectral image (HSI) classification task. HSI contains abundant spatial and spectral information, but they also contain a lot of invalid information, which may introduce noises and weaken the performance of CNNs. In order to make full use of the useful information in HSI, we propose a multi-scale residual network integrated with the attention mechanism (MSRN-A) for HSI classification in this letter. In our method, we built two different multi-scale feature extraction blocks to extract the joint spatial-spectral features and the advanced spatial features, respectively. Moreover, a spatial-spectral attention module and a spatial attention module were set up to focus on the salient spatial parts and valid spectral information. Experimental results demonstrate that our method achieves high accuracy on the Indian Pines, Pavia University, and Salinas datasets. The source code can be found at this https URL.
摘要：与传统的机器学习方法相比，深度学习的方法，如卷积神经网络（细胞神经网络）已经取得的光谱图像（HSI）分类任务取得圆满成功。恒指含有丰富的空间和光谱信息，而且还包含了大量的无效信息，这可能引入噪声，削弱细胞神经网络的性能。为了充分利用在HSI的有用信息，我们建议，在这封信中注意机制（MSRN-A）为恒指分类集成多尺度剩余网络。在我们的方法，我们建立了两个不同的多尺度特征提取块分别提取关节空间光谱特征和先进的空间特征。此外，空间谱注意模块和空间注意模块被设置成集中于凸空间部分和有效的光谱信息。实验结果表明，我们的方法实现对印度松树，帕维亚大学和萨利纳斯数据集精度高。源代码可以在此HTTPS URL中找到。

37. Evaluation Metrics for Conditional Image Generation [PDF] 返回目录
Yaniv Benny, Tomer Galanti, Sagie Benaim, Lior Wolf
Abstract: We present two new metrics for evaluating generative models in the class-conditional image generation setting. These metrics are obtained by generalizing the two most popular unconditional metrics: the Inception Score (IS) and the Fréchet Inception Distance (FID). A theoretical analysis shows the motivation behind each proposed metric and links the novel metrics to their unconditional counterparts. The link takes the form of a product in the case of IS or an upper bound in the FID case. We provide an extensive empirical evaluation, comparing the metrics to their unconditional variants and to other metrics, and utilize them to analyze existing generative models, thus providing additional insights about their performance, from unlearned classes to mode collapse.
摘要：我们提出了两个新指标的分类条件图像生成设置评估生成模型。这些指标通过推广两种最流行的无条件的度量得到：先启分数（IS）和所述Fréchet可启距离（FID）。理论分析显示每个背后的动机提出度量和新颖度量链接到他们的无条件的对应物。该链接将一个产品的形式，在IS的情况下，或在FID情况下的上限。我们提供了一个广泛的经验评估，指标比较无条件变种和其他指标，并利用它们来分析现有的生成模型，从而提供对他们的表现额外的见解，从没有学问类模式的崩溃。

38. When CNNs Meet Random RNNs: Towards Multi-Level Analysis for RGB-D Object and Scene Recognition [PDF] 返回目录
Ali Caglayan, Nevrez Imamoglu, Ahmet Burak Can, Ryosuke Nakamura
Abstract: Recognizing objects and scenes are two challenging but essential tasks in image understanding. In particular, the use of RGB-D sensors in handling these tasks has emerged as an important area of focus for better visual understanding. Meanwhile, deep neural networks, specifically convolutional neural networks (CNNs), have become widespread and have been applied to many visual tasks by replacing hand-crafted features with effective deep features. However, it is an open problem how to exploit deep features from a multi-layer CNN model effectively. In this paper, we propose a novel two-stage framework that extracts discriminative feature representations from multi-modal RGB-D images for object and scene recognition tasks. In the first stage, a pretrained CNN model has been employed as a backbone to extract visual features at multiple levels. The second stage maps these features into high level representations with a fully randomized structure of recursive neural networks (RNNs) efficiently. In order to cope with the high dimensionality of CNN activations, a random weighted pooling scheme has been proposed by extending the idea of randomness in RNNs. Multi-modal fusion has been performed through a soft voting approach by computing weights based on individual recognition confidences (i.e. SVM scores) of RGB and depth streams separately. This produces consistent class label estimation in final RGB-D classification performance. Extensive experiments verify that fully randomized structure in RNN stage encodes CNN activations to discriminative solid features successfully. Comparative experimental results on the popular Washington RGB-D Object and SUN RGB-D Scene datasets show that the proposed approach significantly outperforms state-of-the-art methods both in object and scene recognition tasks.
摘要：认识物体和场景是两个挑战，但在图像理解的基本任务。特别是，利用在处理这些任务的RGB-d传感器已成为焦点更好的视觉认识的一个重要领域。同时，深层神经网络，特别是卷积神经网络（细胞神经网络），已成为普遍和有效的深功能替代手工制作的功能已经被应用到许多视觉任务。但是，它是一个开放的问题，如何有效地利用从多层CNN模型深的特点。在本文中，我们提出了一个新颖的两阶段框架，从物体和场景识别任务多模式RGB-d的图像中提取判别特征表示。在第一阶段中，预训练CNN模型已经被用作骨干在多个级别中提取视觉特征。第二阶段这些功能变为高电平表示映射递归神经网络（RNNs）有效的完全随机化的结构。为了应对CNN激活的高维，随机加权池方案已经提出了在RNNs延长随机性的想法。多模态融合已经通过计算基于RGB的个体识别的置信度（即SVM分数）和深度的权重通过软表决方法进行单独流。这产生最终的RGB-d分类性能一致类标签估计。大量的实验验证完全随机结构RNN阶段编码CNN激活到判别实体特征成功。广受欢迎的华盛顿RGB-d对象和SUN RGB-d场景数据集的比较实验结果表明，该方法显著性能优于在物体和场景识别任务的国家的最先进的方法。

39. Climate Adaptation: Reliably Predicting from Imbalanced Satellite Data [PDF] 返回目录
Ruchit Rawal, Prabhu Pradhan
Abstract: The utility of aerial imagery (Satellite, Drones) has become an invaluable information source for cross-disciplinary applications, especially for crisis management. Most of the mapping and tracking efforts are manual which is resource-intensive and often lead to delivery delays. Deep Learning methods have boosted the capacity of relief efforts via recognition, detection, and are now being used for non-trivial applications. However the data commonly available is highly imbalanced (similar to other real-life applications) which severely hampers the neural network's capabilities, this reduces robustness and trust. We give an overview on different kinds of techniques being used for handling such extreme settings and present solutions aimed at maximizing performance on minority classes using a diverse set of methods (ranging from architectural tuning to augmentation) which as a combination generalizes for all minority classes. We hope to amplify cross-disciplinary efforts by enhancing model reliability.
摘要：航拍图像（卫星，无人驾驶飞机）的效用已经成为跨学科应用的宝贵信息来源，尤其是危机管理。大多数的映射和跟踪力度手册，该手册是资源密集型的，往往会导致交货延迟。深度学习方法已通过提高识别，检测救灾工作的能力，现在正在使用的重要应用程序。但是常用的数据是高度不平衡（类似于其他实际生活中的应用），这严重阻碍了神经网络的功能，这降低了稳健性和信任。我们给被用于处理这种极端的设置和旨在最大限度地使用一组不同的方法少数类性能本发明的溶液在不同种技术的概述（从建筑调谐到增强），其作为组合概括为所有少数类。我们希望通过放大增强模型的可靠性跨学科的努力。

40. Transfer learning for leveraging computer vision in infrastructure maintenance [PDF] 返回目录
Mateusz Żarski, Bartosz Wójcik, Jarosław Adam Miszczak
Abstract: Monitoring the technical condition of infrastructure is a crucial element to its maintenance. Currently, the applied methods are outdated, labour intensive and highly inaccurate. At the same time, the latest methods using Artificial Intelligence techniques, despite achieving satisfactory results in the detection of infrastructure damage, are severely limited in their application due to two main factors - labour-intensive gathering of new datasets and high demand for computing power. In the presented work, we propose to utilize Transfer Learning techniques and computer vision to overcome these limiting factor and fully harness the advantages of Artificial Intelligence methods. We describe a framework which enables hassle-free development of unique infrastructure defects detectors on digital images, achieving the accuracy of above 90%. The framework supports semi-automatic creation of new datasets and has modest computing power requirements. It is implemented in the form of a ready-to-use software package distributed under an open software licence and available for the public. Thus, it can be used to immediately implement the methods proposed in this paper in the process of infrastructure management by government units, regardless of their financial capabilities. With the help of introduced framework it is possible to improve the efficiency of infrastructure management and the quality of its life cycle documentation globally, leading to a more accurate mapping of the processes taking place in the infrastructure's life cycle for better infrastructure planning in the future.
摘要：监测的基础设施的技术条件是其维护的关键因素。目前，采用的方法是过时的，劳动密集和高度不准确的。同时，利用人工智能技术，尽管实现了检测基础设施损坏的满意效果的最新方法，都严重地在其应用有限的，由于两个主要因素 - 新的数据集和计算能力高要求的劳动密集型聚会。在所提出的工作中，我们提出了利用迁移学习技术和计算机视觉，以克服这些限制因素和人工智能的方法全面利用的优势。我们描述了一个框架，支持对数字图像独特的基础设施缺陷探测器无忧的发展，实现了90％以上的准确度。该框架支持半自动创建新的数据集，并具有适度的计算能力要求。它是在一个随时可以使用的软件包下一个开放的软件许可，并为公众提供分布式的形式来实现。因此，它可以用来立即实施基础设施管理的由政府单位的过程中，本文提出的方法，不论其经济能力。随着引入框架的帮助下，可以提高基础设施管理的效率和其全球范围的生命周期文档的质量，导致发生在基础设施的生命周期为将来更好的基础设施的规划过程的更准确的映射。

41. A Global Benchmark of Algorithms for Segmenting Late Gadolinium-Enhanced Cardiac Magnetic Resonance Imaging [PDF] 返回目录
Zhaohan Xiong, Qing Xia, Zhiqiang Hu, Ning Huang, Sulaiman Vesal, Nishant Ravikumar, Andreas Maier, Caizi Li, Qianqian Tong, Weixin Si, Elodie Puybareau, Younes Khoudli, Thierry Geraud, Chen Chen, Wenjia Bai, Daniel Rueckert, Lingchao Xu, Xiahai Zhuang, Xinzhe Luo, Shuman Jia, Maxime Sermesant, Davide Borra, Alessandro Masci, Cristiana Corsi, Rashed Karim, Coen de Vente, Mitko Veta, Chandrakanth Jayachandran Preetha, Sandy Engelhardt, Menyun Qiao, Yuanyuan Wang, Qian Tao, Marta Nunez-Garcia, Oscar Camara, Yashu Liu, Kuanquan Wang, Nicolo Savioli, Pablo Lamata, Jichao Zhao
Abstract: Segmentation of cardiac images, particularly late gadolinium-enhanced magnetic resonance imaging (LGE-MRI) widely used for visualizing diseased cardiac structures, is a crucial first step for clinical diagnosis and treatment. However, direct segmentation of LGE-MRIs is challenging due to its attenuated contrast. Since most clinical studies have relied on manual and labor-intensive approaches, automatic methods are of high interest, particularly optimized machine learning approaches. To address this, we organized the "2018 Left Atrium Segmentation Challenge" using 154 3D LGE-MRIs, currently the world's largest cardiac LGE-MRI dataset, and associated labels of the left atrium segmented by three medical experts, ultimately attracting the participation of 27 international teams. In this paper, extensive analysis of the submitted algorithms using technical and biological metrics was performed by undergoing subgroup analysis and conducting hyper-parameter analysis, offering an overall picture of the major design choices of convolutional neural networks (CNNs) and practical considerations for achieving state-of-the-art left atrium segmentation. Results show the top method achieved a dice score of 93.2% and a mean surface to a surface distance of 0.7 mm, significantly outperforming prior state-of-the-art. Particularly, our analysis demonstrated that double, sequentially used CNNs, in which a first CNN is used for automatic region-of-interest localization and a subsequent CNN is used for refined regional segmentation, achieved far superior results than traditional methods and pipelines containing single CNNs. This large-scale benchmarking study makes a significant step towards much-improved segmentation methods for cardiac LGE-MRIs, and will serve as an important benchmark for evaluating and comparing the future works in the field.
摘要：心脏图像，特别是晚期钆增强的磁共振成像（LGE-MRI）广泛用于可视化患病心脏结构的分割，是用于临床诊断和治疗的关键的第一步。然而，LGE，核磁共振的直接分割具有挑战性，因为它的衰减对比。由于大多数的临床研究都依赖于手动和劳动密集型方式，自动的方法是高的兴趣，特别是优化机器学习方法。为了解决这个问题，我们举办了“2018左心房分割挑战”使用154 3D LGE，核磁共振，目前世界上最大的心脏LGE-MRI数据集，和左心房分割由三个医学专家的相关标签，最终吸引了27参与国际队。在本文中，被接受亚组分析和进行超参数分析，提供卷积神经网络（细胞神经网络）和实际考虑的主要设计选择的整体画面实现状态下进行的使用技术和生物指标提交的算法，广泛的分析-of最先进的左心房分割。结果表明，该顶部方法来实现一个骰子得分的93.2％，平均表面到0.7mm的表面距离，先前状态的最先进的显著优于。特别地，我们的分析表明，双，顺序地使用细胞神经网络，其中，第一CNN，用于区域的感兴趣的自动定位和随后的CNN用于精炼区域分割，实现远远高于传统的方法和含有单细胞神经网络管道优异的结果。这种大规模的标杆研究，使对大大改善分割方法心脏LGE-核磁共振一个显著一步，将作为评价和比较在该领域今后工作的重要标尺。

42. AutoHR: A Strong End-to-end Baseline for Remote Heart Rate Measurement with Neural Searching [PDF] 返回目录
Zitong Yu, Xiaobai Li, Xuesong Niu, Jingang Shi, Guoying Zhao
Abstract: Remote photoplethysmography (rPPG), which aims at measuring heart activities without any contact, has great potential in many applications (e.g., remote healthcare). Existing end-to-end rPPG and heart rate (HR) measurement methods from facial videos are vulnerable to the less-constrained scenarios (e.g., with head movement and bad illumination). In this letter, we explore the reason why existing end-to-end networks perform poorly in challenging conditions and establish a strong end-to-end baseline (AutoHR) for remote HR measurement with neural architecture search (NAS). The proposed method includes three parts: 1) a powerful searched backbone with novel Temporal Difference Convolution (TDC), intending to capture intrinsic rPPG-aware clues between frames; 2) a hybrid loss function considering constraints from both time and frequency domains; and 3) spatio-temporal data augmentation strategies for better representation learning. Comprehensive experiments are performed on three benchmark datasets to show our superior performance on both intra- and cross-dataset testing.
摘要：远程光电容积描记（rPPG），在测量心脏活动没有任何联系，其目的，在许多应用中（例如，远程医疗）的巨大潜力。现有端至端rPPG和心脏速率从面部视频（HR）测量方法容易受到约束较少场景（例如，与头部运动和坏的照明）。在这封信中，我们探究为什么现有的终端到终端的网络，在艰难条件下表现不佳的原因，并建立一个强大的终端到终端的基线（AutoHR）与神经结构搜索（NAS）远程HR测量。所提出的方法包括三个部分：1）一个功能强大的搜索具有新瞬时差分卷积（TDC骨架），打算以捕获帧之间的本征rPPG感知线索; 2）混合损失函数考虑从时间和频率两者域限制; 3）时空为更好的代表性学习数据增强策略。综合实验对三个标准数据集进行展示我们在区域内和跨数据集的测试性能优越。

43. Stomach 3D Reconstruction Based on Virtual Chromoendoscopic Image Generation [PDF] 返回目录
Aji Resindra Widya, Yusuke Monno, Masatoshi Okutomi, Sho Suzuki, Takuji Gotoda, Kenji Miki
Abstract: Gastric endoscopy is a standard clinical process that enables medical practitioners to diagnose various lesions inside a patient's stomach. If any lesion is found, it is very important to perceive the location of the lesion relative to the global view of the stomach. Our previous research showed that this could be addressed by reconstructing the whole stomach shape from chromoendoscopic images using a structure-from-motion (SfM) pipeline, in which indigo carmine (IC) blue dye sprayed images were used to increase feature matches for SfM by enhancing stomach surface's textures. However, spraying the IC dye to the whole stomach requires additional time, labor, and cost, which is not desirable for patients and practitioners. In this paper, we propose an alternative way to achieve whole stomach 3D reconstruction without the need of the IC dye by generating virtual IC-sprayed (VIC) images based on image-to-image style translation trained on unpaired real no-IC and IC-sprayed images. We have specifically investigated the effect of input and output color channel selection for generating the VIC images and found that translating no-IC green-channel images to IC-sprayed red-channel images gives the best SfM reconstruction result.
摘要：胃镜检查是一个标准的临床过程，使医生诊断患者的胃内的各种病变。如果发现任何病变，其感知相对于病变的位置胃的全球视野是非常重要的。我们以前的研究表明，这可以通过使用结构从运动（SFM）管道，其中靛蓝胭脂红（IC）蓝色染料喷图像被用来通过增加的特征匹配于SFM重建从chromoendoscopic图像整个胃的形状来解决增强胃表面的纹理。然而，IC染料喷射到整个胃需要额外的时间，劳动和成本，这是不希望的患者和从业者。在本文中，我们提出了一种替代的方式，通过基于训练上不成对真实没有-IC和IC图像到图像的风格翻译生成虚拟IC喷涂（VIC）的图像，以实现全胃3D重建，而无需在IC染料的-sprayed图像。我们具体地研究输入和输出颜色信道选择的效果用于产生VIC图像，发现翻译没有-IC绿色通道图像到IC喷涂红色通道图像给出最佳SFM重构结果。

44. Fashionpedia: Ontology, Segmentation, and an Attribute Localization Dataset [PDF] 返回目录
Menglin Jia, Mengyun Shi, Mikhail Sirotenko, Yin Cui, Claire Cardie, Bharath Hariharan, Hartwig Adam, Serge Belongie
Abstract: In this work we explore the task of instance segmentation with attribute localization, which unifies instance segmentation (detect and segment each object instance) and fine-grained visual attribute categorization (recognize one or multiple attributes). The proposed task requires both localizing an object and describing its properties. To illustrate the various aspects of this task, we focus on the domain of fashion and introduce Fashionpedia as a step toward mapping out the visual aspects of the fashion world. Fashionpedia consists of two parts: (1) an ontology built by fashion experts containing 27 main apparel categories, 19 apparel parts, 294 fine-grained attributes and their relationships; (2) a dataset with everyday and celebrity event fashion images annotated with segmentation masks and their associated per-mask fine-grained attributes, built upon the Fashionpedia ontology. In order to solve this challenging task, we propose a novel Attribute-Mask RCNN model to jointly perform instance segmentation and localized attribute recognition, and provide a novel evaluation metric for the task. We also demonstrate instance segmentation models pre-trained on Fashionpedia achieve better transfer learning performance on other fashion datasets than ImageNet pre-training. Fashionpedia is available at: this https URL.
摘要：在这项工作中，我们探索实例分割的任务与属性的定位，其结合实例分割（检测和分段的每个对象实例）和细粒度视觉属性分类（识别一个或多个属性）。所提出的任务，既需要定位物体，并描述其特性。为了说明这个任务的各个方面，我们专注于时尚领域，并介绍Fashionpedia作为向映射出时尚界的视觉方面的一个步骤。 Fashionpedia由两个部分组成：（1）由含有27个主要服装类服装19份，294细粒度的属性和它们的关系时尚专家建立了一个本体; （2）与分割掩码和它们相关联的每个掩模细粒度的属性，在所述本体Fashionpedia内置注释每天和名人事件方式的图像数据集。为了解决这个艰巨的任务，我们提出了一个新的属性，面膜RCNN模式，共同执行实例分割和本地化属性识别，并提供一种新的评价指标来完成任务。我们还演示了在Fashionpedia实例细分车型预先训练实现对其他数据集时尚比ImageNet前培训更好的转移学习表演。 Fashionpedia，请访问：此HTTPS URL。

45. Learning to Autofocus [PDF] 返回目录
Charles Herrmann, Richard Strong Bowen, Neal Wadwha, Rahul Garg, Qirui He, Jonathan T. Barron, Ramin Zabih
Abstract: Autofocus is an important task for digital cameras, yet current approaches often exhibit poor performance. We propose a learning-based approach to this problem, and provide a realistic dataset of sufficient size for effective learning. Our dataset is labeled with per-pixel depths obtained from multi-view stereo, following "Learning single camera depth estimation using dual-pixels". Using this dataset, we apply modern deep classification models and an ordinal regression loss to obtain an efficient learning-based autofocus technique. We demonstrate that our approach provides a significant improvement compared with previous learned and non-learned methods: our model reduces the mean absolute error by a factor of 3.6 over the best comparable baseline algorithm. Our dataset and code are publicly available.
摘要：自动对焦是数码相机的一项重要任务，但目前的做法往往表现出业绩不佳。我们提出了一个基于学习的方法解决这个问题，并提供足够的尺寸，有效学习现实的数据集。我们的数据都标有从多视点立体获得的每像素的深度，下面的“使用双像素学习单个相机深度估计”。使用这个数据集，我们运用现代深层分类模型和有序回归损失获得有效的学习型自动对焦技术。我们证明我们的方法与以前的教训和非学方法相比提供了显著的改善：我们的模型按3.6的系数比最好的可比基准算法降低了平均绝对误差。我们的数据和代码是公开的。

46. TPNet: Trajectory Proposal Network for Motion Prediction [PDF] 返回目录
Liangji Fang, Qinhong Jiang, Jianping Shi, Bolei Zhou
Abstract: Making accurate motion prediction of the surrounding traffic agents such as pedestrians, vehicles, and cyclists is crucial for autonomous driving. Recent data-driven motion prediction methods have attempted to learn to directly regress the exact future position or its distribution from massive amount of trajectory data. However, it remains difficult for these methods to provide multimodal predictions as well as integrate physical constraints such as traffic rules and movable areas. In this work we propose a novel two-stage motion prediction framework, Trajectory Proposal Network (TPNet). TPNet first generates a candidate set of future trajectories as hypothesis proposals, then makes the final predictions by classifying and refining the proposals which meets the physical constraints. By steering the proposal generation process, safe and multimodal predictions are realized. Thus this framework effectively mitigates the complexity of motion prediction problem while ensuring the multimodal output. Experiments on four large-scale trajectory prediction datasets, i.e. the ETH, UCY, Apollo and Argoverse datasets, show that TPNet achieves the state-of-the-art results both quantitatively and qualitatively.
摘要：使周围的交通剂的精确运动预测，如行人，车辆，和骑自行车是自主驾驶的关键。最近的数据驱动的运动预测方法试图学习退步直接未来的确切位置或轨迹数据的海量及其分布。但是，它仍然难以这些方法提供多式联运的预测以及集成的物理限制，例如交通规则和移动领域。在这项工作中，我们提出了一种两阶段的运动预测框架，弹道提案网络（TPNet）。 TPNet首先生成一组候选的未来轨迹的假设提案，然后进行分类，通过和完善符合实际限制的建议最终预测。通过操纵提议生成过程，安全和多式联运的预测得以实现。因此，该框架有效地减轻了运动预测问题的复杂性，同时确保多模式输出。在四个大型轨迹预测的数据集的实验，即，ETH，UCY，阿波罗和Argoverse数据集，表明TPNet定量和定性实现状态的最先进的结果。

47. Joint Inference of States, Robot Knowledge, and Human (False-)Beliefs [PDF] 返回目录
Tao Yuan, Hangxin Liu, Lifeng Fan, Zilong Zheng, Tao Gao, Yixin Zhu, Song-Chun Zhu
Abstract: Aiming to understand how human (false-)belief--a core socio-cognitive ability--would affect human interactions with robots, this paper proposes to adopt a graphical model to unify the representation of object states, robot knowledge, and human (false-)beliefs. Specifically, a parse graph (pg) is learned from a single-view spatiotemporal parsing by aggregating various object states along the time; such a learned representation is accumulated as the robot's knowledge. An inference algorithm is derived to fuse individual pg from all robots across multi-views into a joint pg, which affords more effective reasoning and inference capability to overcome the errors originated from a single view. In the experiments, through the joint inference over pg-s, the system correctly recognizes human (false-)belief in various settings and achieves better cross-view accuracy on a challenging small object tracking dataset.
摘要：针对以了解人类（假）的信念 - 一个核心的社会认知能力 - 会影响与机器人的人工交互，提出采用一种图形模型，以统一的对象状态，机器人知识，与人类的代表（假）的信念。具体而言，解析图（PG）是由通过聚合沿时的各种对象状态单视图时空解析得知;这样的教训表示累积为机器人的知识。推理算法推导融合从跨越多视图所有漫游到关节PG，得到更有效的推理和推理能力来克服该误差源于单一视图个体PG。在实验中，通过在PG-S接头推理，系统正确地识别在不同的设置并实现更好的跨视图精度人（假）信念上的挑战小物体跟踪数据集。

48. Congestion-aware Evacuation Routing using Augmented Reality Devices [PDF] 返回目录
Zeyu Zhang, Hangxin Liu, Ziyuan Jiao, Yixin Zhu, Song-Chun Zhu
Abstract: We present a congestion-aware routing solution for indoor evacuation, which produces real-time individual-customized evacuation routes among multiple destinations while keeping tracks of all evacuees' locations. A population density map, obtained on-the-fly by aggregating locations of evacuees from user-end Augmented Reality (AR) devices, is used to model the congestion distribution inside a building. To efficiently search the evacuation route among all destinations, a variant of A* algorithm is devised to obtain the optimal solution in a single pass. In a series of simulated studies, we show that the proposed algorithm is more computationally optimized compared to classic path planning algorithms; it generates a more time-efficient evacuation route for each individual that minimizes the overall congestion. A complete system using AR devices is implemented for a pilot study in real-world environments, demonstrating the efficacy of the proposed approach.
摘要：我们提出了室内疏散，产生多个目的地之间的实时单独定制的疏散通道，同时保持所有撤离人员位置的轨道拥塞感知路由解决方案。人口密度图，通过聚集来自用户端的增强现实（AR）设备，撤离的位置上的即时获得的用于建筑物内的拥塞分布进行建模。为了有效地搜索所有目的地之间的疏散路线，*算法被设计来获得在单次通过的最优解A的变体。在一系列的模拟研究，我们表明，该算法在计算上更比经典的路径规划算法优化;它生成用于使整体拥塞最小化每个单独的更时间有效的疏散路线。采用AR设备提供完整的系统在实际环境中的试点研究实施，证明了该方法的有效性。

49. Detective: An Attentive Recurrent Model for Sparse Object Detection [PDF] 返回目录
Amine Kechaou, Manuel Martinez, Monica Haurilet, Rainer Stiefelhagen
Abstract: In this work, we present Detective - an attentive object detector that identifies objects in images in a sequential manner. Our network is based on an encoder-decoder architecture, where the encoder is a convolutional neural network, and the decoder is a convolutional recurrent neural network coupled with an attention mechanism. At each iteration, our decoder focuses on the relevant parts of the image using an attention mechanism, and then estimates the object's class and the bounding box coordinates. Current object detection models generate dense predictions and rely on post-processing to remove duplicate predictions. Detective is a sparse object detector that generates a single bounding box per object instance. However, training a sparse object detector is challenging, as it requires the model to reason at the instance level and not just at the class and spatial levels. We propose a training mechanism based on the Hungarian algorithm and a loss that balances the localization and classification tasks. This allows Detective to achieve promising results on the PASCAL VOC object detection dataset. Our experiments demonstrate that sparse object detection is possible and has a great potential for future developments in applications where the order of the objects to be predicted is of interest.
摘要：在这项工作中，我们本侦探 - 周到的物体检测装置，在图像识别的对象以顺序的方式。我们的网络是基于编码器 - 解码器体系结构，其中所述编码器是卷积神经网络，和解码器是加上一个注意机构卷积回归神经网络。在每次迭代中，我们的解码器侧重于使用的注意机制的图像的相关部分，然后估计对象的类和边界框坐标。当前对象的检测模型生成致密的预测，并依靠后处理来移除重复的预测。侦探是产生每个对象实例的单个边界框的稀疏对象检测器。然而，培养了稀疏对象探测器是一个挑战，因为它需要的模型原因在实例级别而不是仅仅在类和空间层次。我们提出了一种基于匈牙利算法的培训机制和平衡的定位和分类任务的损失。这让侦探实现对PASCAL VOC物体检测数据集可喜的成果。我们的实验表明，稀疏对象检测是可能的，并在应用中要预测的对象的顺序是感兴趣未来发展的巨大潜力。

50. EfficientPose: Scalable single-person pose estimation [PDF] 返回目录
Daniel Groos, Heri Ramampiaro, Espen Ihlen
Abstract: Human pose estimation facilitates markerless movement analysis in sports, as well as in clinical applications. Still, state-of-the-art models for human pose estimation generally do not meet the requirements for real-life deployment. The main reason for this is that the more the field progresses, the more expensive the approaches become, with high computational demands. To cope with the challenges caused by this trend, we propose a convolutional neural network architecture that benefits from the recently proposed EfficientNets to deliver scalable single-person pose estimation. To this end, we introduce EfficientPose, which is a family of models harnessing an effective multi-scale feature extractor, computation efficient detection blocks utilizing mobile inverted bottleneck convolutions, and upscaling improving precision of pose configurations. EfficientPose enables real-world deployment on edge devices through 500K parameter model consuming less than one GFLOP. The results from our experiments, using the challenging MPII single-person benchmark, show that the proposed EfficientPose models substantially outperform the widely-used OpenPose model in terms of accuracy, while being at the same time up to 15 times smaller and 20 times more computationally efficient than its counterpart.
摘要：人体姿势估计便于在运动无标记的运动分析，以及在临床应用。不过，国家的最先进的机型为人类姿态估计一般不符合现实生活中部署的要求。造成这种情况的主要原因是，更多领域的进展，越贵的办法而成，具有高计算需求。为了应对造成这种趋势的挑战，我们提出了一个卷积神经网络架构，从最近提出EfficientNets优势，提供可扩展的单人位姿估计。为此，我们引入EfficientPose，这是模型家族利用的有效的多尺度特征提取器，利用移动倒瓶颈卷积和升频改善姿势配置的精度计算高效的检测块。 EfficientPose能够通过500K参数模型耗时不到一GFLOP的边缘设备，真实世界的部署。从我们的实验结果，用挑战MPII单人基准，表明所提出EfficientPose车型大幅跑赢大市在精度方面广泛使用OpenPose模式，在同一时间，同时达到较小的15倍和20倍计算比其对应高效。

51. Cheaper Pre-training Lunch: An Efficient Paradigm for Object Detection [PDF] 返回目录
Dongzhan Zhou, Xinchi Zhou, Hongwen Zhang, Shuai Yi, Wanli Ouyang
Abstract: In this paper, we propose a general and efficient pre-training paradigm, Jigsaw pre-training, for object detection. Jigsaw pre-training needs only the target detection dataset while taking only 1/4 computational resources compared to the widely adopted ImageNet pre-training. To build such an efficient paradigm, we reduce the potential redundancy by carefully extracting useful samples from the original images, assembling samples in a Jigsaw manner as input, and using an ERF-adaptive dense classification strategy for model pre-training. These designs include not only a new input pattern to improve the spatial utilization but also a novel learning objective to expand the effective receptive field of the pre-trained model. The efficiency and superiority of Jigsaw pre-training are validated by extensive experiments on the MS-COCO dataset, where the results indicate that the models using Jigsaw pre-training are able to achieve on-par or even better detection performances compared with the ImageNet pre-trained counterparts.
摘要：在本文中，我们提出了一个普遍和有效的岗前培训模式，拼图前的训练，为目标检测。拼图前的训练只需要在目标探测数据集，而相比于广泛采用ImageNet前培训，只取1/4计算资源。建立这样一个有效的范例，我们通过仔细从原始图像中提取有用的样本，作为输入在拼图的方式组装的样品，并使用模型预训练的ERF-自适应致密分类策略减少潜在冗余。这些设计不仅包括新的输入模式，以提高空间利用率，但也有新的学习目标展开预先训练模式的有效感受野。拼图的有效性和优越性前培训是通过在MS-COCO数据集，广泛的实验，其中结果表明，采用拼图前培训的模型能够实现对标准杆或更好的检测性能与ImageNet前相比验证-trained同行。

52. Revisiting Sequence-to-Sequence Video Object Segmentation with Multi-Task Loss and Skip-Memory [PDF] 返回目录
Fatemeh Azimi, Benjamin Bischke, Sebastian Palacio, Federico Raue, Joern Hees, Andreas Dengel
Abstract: Video Object Segmentation (VOS) is an active research area of the visual domain. One of its fundamental sub-tasks is semi-supervised / one-shot learning: given only the segmentation mask for the first frame, the task is to provide pixel-accurate masks for the object over the rest of the sequence. Despite much progress in the last years, we noticed that many of the existing approaches lose objects in longer sequences, especially when the object is small or briefly occluded. In this work, we build upon a sequence-to-sequence approach that employs an encoder-decoder architecture together with a memory module for exploiting the sequential data. We further improve this approach by proposing a model that manipulates multi-scale spatio-temporal information using memory-equipped skip connections. Furthermore, we incorporate an auxiliary task based on distance classification which greatly enhances the quality of edges in segmentation masks. We compare our approach to the state of the art and show considerable improvement in the contour accuracy metric and the overall segmentation accuracy.
摘要：视频对象分割（VOS）是视觉领域的一个研究热点。它的一个基本的子任务是半监督/一次性学习：只给出分割掩码第一帧，任务是为对象在序列的其余部分提供像素精确的面具。尽管在过去几年里很大的进展，我们注意到，许多现有的较长序列接近失去对象，尤其当对象是小的或短暂封闭。在这项工作中，我们建立在一个序列到序列的方法与存储器模块一起使用的编码器 - 解码器架构用于利用所述时序数据。通过提出使用配备内存跳过连接操纵多尺度时空信息的模型，我们进一步提高了这种方法。此外，我们将基于对距离分类的辅助任务，极大地提高了边缘的分割掩码质量。我们我们的方法比较的技术状态，并显示在轮廓精度指标和整体的分割精度相当大的改进。

53. CNN based Road User Detection using the 3D Radar Cube [PDF] 返回目录
Andras Palffy, Jiaao Dong, Julian F. P. Kooij, Dariu M. Gavrila
Abstract: This letter presents a novel radar based, single-frame, multi-class detection method for moving road users (pedestrian, cyclist, car), which utilizes low-level radar cube data. The method provides class information both on the radar target- and object-level. Radar targets are classified individually after extending the target features with a cropped block of the 3D radar cube around their positions, thereby capturing the motion of moving parts in the local velocity distribution. A Convolutional Neural Network (CNN) is proposed for this classification step. Afterwards, object proposals are generated with a clustering step, which not only considers the radar targets' positions and velocities, but their calculated class scores as well. In experiments on a real-life dataset we demonstrate that our method outperforms the state-of-the-art methods both target- and object-wise by reaching an average of 0.70 (baseline: 0.68) target-wise and 0.56 (baseline: 0.48) object-wise F1 score. Furthermore, we examine the importance of the used features in an ablation study.
摘要：信一种基于一种新颖的雷达，用于移动道路使用者（行人，骑自行车的人，汽车），其利用低级别的雷达立方体数据单帧，多类检测方法。该方法提供既对雷达靶和对象级类的信息。雷达目标与围绕它们的位置在3D雷达立方体的裁剪块延伸的目标特征，从而捕获在本地速度分布运动部件的运动之后单独分类。卷积神经网络（CNN），提出了该分级工序。然后，用一个聚类步骤，它不仅考虑了雷达目标的位置和速度，但它们的计算出的级得分以及生成的对象的建议。在现实生活中的数据集实验中，我们证明，我们的方法优于国家的最先进的方法，通过在到达平均0.70（基线：0.68）这两个靶和对象明智靶向和0.56（基线：0.48 ）对象的明智F1得分。此外，我们研究的常用的功能在消融研究的重要性。

54. Offline Signature Verification on Real-World Documents [PDF] 返回目录
Deniz Engin, Alperen Kantarcı, Seçil Arslan, Hazım Kemal Ekenel
Abstract: Research on offline signature verification has explored a large variety of methods on multiple signature datasets, which are collected under controlled conditions. However, these datasets may not fully reflect the characteristics of the signatures in some practical use cases. Real-world signatures extracted from the formal documents may contain different types of occlusions, for example, stamps, company seals, ruling lines, and signature boxes. Moreover, they may have very high intra-class variations, where even genuine signatures resemble forgeries. In this paper, we address a real-world writer independent offline signature verification problem, in which, a bank's customers' transaction request documents that contain their occluded signatures are compared with their clean reference signatures. Our proposed method consists of two main components, a stamp cleaning method based on CycleGAN and signature representation based on CNNs. We extensively evaluate different verification setups, fine-tuning strategies, and signature representation approaches to have a thorough analysis of the problem. Moreover, we conduct a human evaluation to show the challenging nature of the problem. We run experiments both on our custom dataset, as well as on the publicly available Tobacco-800 dataset. The experimental results validate the difficulty of offline signature verification on real-world documents. However, by employing the stamp cleaning process, we improve the signature verification performance significantly.
摘要：研究离线签名验证摸索出种类繁多的多个签名的数据集，其中在受控条件下收集方法。然而，这些数据集可能不能完全反映签名在一些实际使用情况的特点。从正式文件中提取现实世界的签名可能包含不同类型的闭塞，例如，印章，公章，格线，和签名框。此外，他们可能有很高的类内变化，其中甚至真正的签名像伪造的。在本文中，我们解决现实世界的作家独立的离线签名验证的问题，其中，一家银行的客户包含其遮挡签名交易请求的文档与他们干净的参考签名进行比较。我们提出的方法包括两个主要部分，基于细胞神经网络的基础上CycleGAN邮票清洗方法和签名表示。我们广泛地评估不同的验证设置，微调的策略，并签名表示方法有问题的深入分析。此外，我们还进行了人工评估，借以说明问题的挑战性。我们对我们的自定义数据集，以及对可公开获得的烟草-800数据集进行实验两者。实验结果验证了离线签名验证的现实世界的文件的难度。然而，通过采用邮票清洗过程中，我们显著提高了签名验证的性能。

55. How to read faces without looking at them [PDF] 返回目录
Suyash Shandilya, Waris Quamer
Abstract: Face reading is the most intuitive aspect of emotion recognition. Unfortunately, digital analysis of facial expression requires digitally recording personal faces. As emotional analysis is particularly required in a more poised scenario, capturing faces becomes a gross violation of privacy. In this paper, we use the concept of compressive analysis to conceptualise a system which compressively acquires faces in order to ascertain unusable reconstruction, while allowing for acceptable (and adjustable) accuracy in inference.
摘要：面向读取是情绪识别的最直观的方面。不幸的是，面部表情的数字分析要求的一种数字记录个人的面孔。作为情感分析更泰然自若的情况特别要求时，捕获的面孔变得严重侵犯了个人隐私。在本文中，我们使用的压缩分析的概念来概念化其压缩获取以确定不可用重建面的系统中，同时允许在推理上可接受的（和可调节的）的精度。

56. Clustering by Constructing Hyper-Planes [PDF] 返回目录
Luhong Diao, Jinying Gao1, Manman Deng
Abstract: As a kind of basic machine learning method, clustering algorithms group data points into different categories based on their similarity or distribution. We present a clustering algorithm by finding hyper-planes to distinguish the data points. It relies on the marginal space between the points. Then we combine these hyper-planes to determine centers and numbers of clusters. Because the algorithm is based on linear structures, it can approximate the distribution of datasets accurately and flexibly. To evaluate its performance, we compared it with some famous clustering algorithms by carrying experiments on different kinds of benchmark datasets. It outperforms other methods clearly.
摘要：作为一种基本的机器学习方法，聚类算法组数据点基于其相似性或分配不同的种类。我们通过寻找超平面来区分数据点呈现聚类算法。它依赖于点之间的边际空间。然后，我们结合这些个超平面来确定群的中心和数字。由于该算法是基于线性的结构，它可以准确和灵活的数据集近似的分布。为了评估其性能，我们通过对不同种类的基准数据集的携带实验的一些著名聚类算法进行了比较。它明显优于其他方法。

57. Deep Multimodal Neural Architecture Search [PDF] 返回目录
Zhou Yu, Yuhao Cui, Jun Yu, Meng Wang, Dacheng Tao, Qi Tian
Abstract: Designing effective neural networks is fundamentally important in deep multimodal learning. Most existing works focus on a single task and design neural architectures manually, which are highly task-specific and hard to generalize to different tasks. In this paper, we devise a generalized deep multimodal neural architecture search (MMnas) framework for various multimodal learning tasks. Given multimodal input, we first define a set of primitive operations, and then construct a deep encoder-decoder based unified backbone, where each encoder or decoder block corresponds to an operation searched from a predefined operation pool. On top of the unified backbone, we attach task-specific heads to tackle different multimodal learning tasks. By using a gradient-based NAS algorithm, the optimal architectures for different tasks are learned efficiently. Extensive ablation studies, comprehensive analysis, and superior experimental results show that MMnasNet significantly outperforms existing state-of-the-art approaches across three multimodal learning tasks (over five datasets), including visual question answering, image-text matching, and visual grounding. Code will be made available.
摘要：设计有效的神经网络在深多学习至关重要的。大多数现有的作品集中在一个单一的任务和设计手动神经结构，这是高度针对特定任务的，难以推广到不同的任务。在本文中，我们设计了各种多模态学习任务广义深多式联运的神经结构搜索（MMnas）框架。鉴于多模式输入，我们首先定义一组基本的操作，然后构造一个深编码器 - 解码器基于统一骨架，其中每个编码器或解码块对应的动作从预定义的操作池搜索。在统一的骨干网的顶部，我们重视任务的具体负责人，以解决不同的多模态学习任务。通过使用基于梯度的NAS算法，针对不同任务的最佳架构有效地学习。广泛切除研究，全面分析和卓越的实验结果表明，MMnasNet显著优于现有的国家的最先进的跨三个多学习任务接近（超五类数据集），包括视觉问答，图片，文本匹配，和视觉接地。代码将提供。

58. CS-AF: A Cost-sensitive Multi-classifier Active Fusion Framework for Skin Lesion Classification [PDF] 返回目录
Di Zhuang, Keyu Chen, J. Morris Chang
Abstract: Convolutional neural networks (CNNs) have achieved the state-of-the-art performance in skin lesion analysis. Compared with single CNN classifier, combining the results of multiple classifiers via fusion approaches shows to be more effective and robust. Since the skin lesion datasets are usually limited and statistically biased, while designing an effective fusion approach, it is important to consider not only the performance of each classifier on the training/validation dataset, but also the relative discriminative power (e.g., confidence) of each classifier regarding an individual sample in the testing phase, which calls for an active fusion approach. Furthermore, in skin lesion analysis, the data of certain classes is usually abundant making them an over-represented majority (e.g., benign lesions), while the data of some other classes is deficient, making them an underrepresented minority (e.g., cancerous lesions). It is more crucial to precisely identify the samples from an underrepresented (i.e., in terms of the amount of data) but more important (e.g., the cancerous lesions) minority class. In other words, misclassifying a more severe lesion to a benign or less severe lesion should have relative more cost (e.g., money, time and even lives). To address such challenges, we present CS-AF, a cost-sensitive multi-classifier active fusion framework for skin lesion classification. In the experimental evaluation, we prepared 60 base classifiers (of 10 CNN architectures) on the ISIC research datasets. Our experimental results show that our framework consistently outperforms the static fusion competitors.
摘要：卷积神经网络（细胞神经网络）都取得皮肤损伤分析的国家的最先进的性能。与单CNN分类器相比，通过融合多个分类器的结果结合接近显示是更有效的和鲁棒性。由于皮损数据集通常是有限的和统计偏差，而设计一个有效的融合方法，它不仅要考虑的在训练/验证数据集中的每个分类的性能，而且相对辨别力（例如，置信度）是重要的每个分类器关于在测试阶段的个体的样品，其中要求的活性融合方法。此外，在皮肤损伤分析中，某些类别的数据通常是丰富使它们的过表达大多数（例如，良性病变），而其他一些类的数据是有缺陷的，这使得它们的代表性不足少数（例如，癌性病灶）。这是更重要的精确从代表性不足识别样本（即，在数据的量而言），但是更重要的（例如，癌性病灶）少数类。换句话说，误分为更严重的病变是良性或不太严重的损伤应该有相对更多的成本（例如，金钱，时间甚至生命）。为了应对这些挑战，我们目前CS-AF，皮肤病灶分类的成本敏感的多分类活性融合框架。在实验评价中，我们在研究ISIC数据集制备60个基分类（10个CNN架构）。我们的实验结果表明，该框架的性能一直优于静态融合的竞争对手。

59. Deep convolutional neural networks for face and iris presentation attack detection: Survey and case study [PDF] 返回目录
Yomna Safaa El-Din, Mohamed N. Moustafa, Hani Mahdi
Abstract: Biometric presentation attack detection is gaining increasing attention. Users of mobile devices find it more convenient to unlock their smart applications with finger, face or iris recognition instead of passwords. In this paper, we survey the approaches presented in the recent literature to detect face and iris presentation attacks. Specifically, we investigate the effectiveness of fine tuning very deep convolutional neural networks to the task of face and iris antispoofing. We compare two different fine tuning approaches on six publicly available benchmark datasets. Results show the effectiveness of these deep models in learning discriminative features that can tell apart real from fake biometric images with very low error rate. Cross-dataset evaluation on face PAD showed better generalization than state of the art. We also performed cross-dataset testing on iris PAD datasets in terms of equal error rate which was not reported in literature before. Additionally, we propose the use of a single deep network trained to detect both face and iris attacks. We have not noticed accuracy degradation compared to networks trained for only one biometric separately. Finally, we analyzed the learned features by the network, in correlation with the image frequency components, to justify its prediction decision.
摘要：生物识别演示攻击检测正在获得越来越多的关注。移动设备的用户，这是更方便的解锁他们的手指，面部或虹膜识别代替密码的智能应用。在本文中，我们调查的途径在最近的文献中提出来检测人脸与虹膜演示的攻击。具体而言，我们调查微调非常深刻的卷积神经网络来的脸和虹膜反欺骗任务的有效性。我们比较了六个公开可用的基准数据集两种不同的微调的做法。结果显示这些深层次的模型在学习判别功能，可以从假冒的生物特征图像分辨真实与非常低的错误率的效果。跨数据集在脸上PAD评估显示比现有技术状态较好的泛化。我们在文学这之前没有报告等错误率方面表现虹膜PAD数据集跨数据集的测试。此外，我们建议使用经过培训的同时检测脸和虹膜攻击一个深刻的网络。我们比较了单独训练的只有一个生物特征的网络有没有注意到精度降低。最后，我们通过网络分析功能了解到，与图像的频率分量的相关性，以证明其预测决定。

60. StRDAN: Synthetic-to-Real Domain Adaptation Network for Vehicle Re-Identification [PDF] 返回目录
Sangrok Lee, Eunsoo Park, Hongsuk Yi, Sang Hun Lee
Abstract: Vehicle re-identification aims to obtain the same vehicles from vehicle images. It is challenging but essential for analyzing and predicting traffic flow in the city. Although deep learning methods have achieved enormous progress in this task, requiring a large amount of data is a critical shortcoming. To tackle this problem, we propose a novel framework called Synthetic-to-Real Domain Adaptation Network (StRDAN), which is trained with inexpensive large-scale synthetic data as well as real data to improve performance. The training method for StRDAN is combined with domain adaptation and semi-supervised learning methods and their associated losses. StRDAN shows a significant improvement over the baseline model, which is trained using only real data, in two main datasets: VeRi and CityFlow-ReID. Evaluating with the mean average precision (mAP) metric, our model outperforms the reference model by 12.87% in CityFlow-ReID and 3.1% in VeRi.
摘要：车辆重新鉴定目的是从车辆的图像获得相同的车辆。这是一个挑战，但对于分析和预测在城市交通流是至关重要的。虽然深学习方法在这个任务中都取得了巨大的进步，需要大量的数据是一个关键的缺点。为了解决这个问题，我们提出了一个小说叫框架合成到房地产领域适应性网络（StRDAN），它与廉价的大规模综合数据以及实际数据训练来提高性能。对于StRDAN训练方法与域的适应和半监督学习方法及其相关的损耗结合。 StRDAN显示了在基准模型，它仅使用真实数据来训练，在两个主要数据集一个显著改善：VERI和CityFlow - 里德。与平均平均精度（MAP）度量评价，我们的模型优于由CityFlow-REID 12.87％和VERI 3.1％的参考模型。

61. Deepfakes Detection with Automatic Face Weighting [PDF] 返回目录
Daniel Mas Montserrat, Hanxiang Hao, S. K. Yarlagadda, Sriram Baireddy, Ruiting Shao, Janos Horv ath, Emily Bartusiak, Justin Yang, David Guera, Fengqing Zhu, Edward J. Delp
Abstract: Altered and manipulated multimedia is increasingly present and widely distributed via social media platforms. Advanced video manipulation tools enable the generation of highly realistic-looking altered multimedia. While many methods have been presented to detect manipulations, most of them fail when evaluated with data outside of the datasets used in research environments. In order to address this problem, the Deepfake Detection Challenge (DFDC) provides a large dataset of videos containing realistic manipulations and an evaluation system that ensures that methods work quickly and accurately, even when faced with challenging data. In this paper, we introduce a method based on convolutional neural networks (CNNs) and recurrent neural networks (RNNs) that extracts visual and temporal features from faces present in videos to accurately detect manipulations. The method is evaluated with the DFDC dataset, providing competitive results compared to other techniques.
摘要：改变并操纵多媒体是越来越呈现，并通过社交媒体平台广泛分布。先进的视频处理工具实现高度逼真的改变多媒体的产生。虽然已经提出了许多方法来检测操作，当与研究环境中使用的数据集的数据外评估大多失败。为了解决这个问题，Deepfake检测挑战（DFDC）提供了含有逼真的操纵视频大型数据集和评价系统，其确保的方法快速，准确地工作，即使面临挑战数据。在本文中，我们介绍了基于卷积神经网络（细胞神经网络）和复发性神经网络（RNNs）的方法，其从面提取物的视觉和时间特征呈现在视频高精度地检测操作。该方法是与数据集DFDC评价，与其它技术相比提供具有竞争力的结果。

62. Neural Head Reenactment with Latent Pose Descriptors [PDF] 返回目录
Egor Burkov, Igor Pasechnik, Artur Grigorev, Victor Lempitsky
Abstract: We propose a neural head reenactment system, which is driven by a latent pose representation and is capable of predicting the foreground segmentation alongside the RGB image. The latent pose representation is learned as a part of the entire reenactment system, and the learning process is based solely on image reconstruction losses. We show that despite its simplicity, with a large and diverse enough training dataset, such learning successfully decomposes pose from identity. The resulting system can then reproduce mimics of the driving person and, furthermore, can perform cross-person reenactment. Additionally, we show that the learned descriptors are useful for other pose-related tasks, such as keypoint prediction and pose-based retrieval.
摘要：本文提出了一种神经头重演系统，该系统由一个潜伏的姿态表示驱动而能够预测前景分割旁边的RGB图像。潜姿态表示被学习为整个重演系统的一部分，而学习的过程完全基于图像重建的损失。我们发现，尽管它的简单，用大和多样化的足够的训练数据集，这样的学习成功分解姿势从身份。然后将所得的系统可以再现驱动人的模拟物，并且此外，可以执行跨人重演。此外，我们表明，学习描述符是其他姿势相关的任务，如关键点的预测和基于姿势检索有用。

63. Extending and Analyzing Self-Supervised Learning Across Domains [PDF] 返回目录
Bram Wallace, Bharath Hariharan
Abstract: Self-supervised representation learning has achieved impressive results in recent years, with experiments primarily coming on ImageNet or other similarly large internet imagery datasets. There has been little to no work with these methods on other smaller domains, such as satellite, textural, or biological imagery. We experiment with several popular methods on an unprecedented variety of domains. We discover, among other findings, that Rotation is by far the most semantically meaningful task, with much of the performance of Jigsaw and Instance Discrimination being attributable to the nature of their induced distribution rather than semantic understanding. Additionally, there are several areas, such as fine-grain classification, where all tasks underperform. We quantitatively and qualitatively diagnose the reasons for these failures and successes via novel experiments studying pretext generalization, random labelings, and implicit dimensionality. Code and models are available at this https URL.
摘要：自监督表示学习已经取得了不俗的成绩，近年来，随着实验的ImageNet或其他类似的大型互联网图像数据集的主要到来。很少有与其他较小的领域，如卫星，纹理或生物图像这些方法没有工作。我们与上前所未有的各种领域的几种流行的方法进行实验。我们发现，其他调查结果中，该旋转是迄今为止最有意义的语义任务，其中大部分拼图和实例歧视的表现是由于其引起的分配，而不是语义理解的性质。此外，有几个方面，比如细粒度的分类，所有任务弱于大盘。我们定性和定量诊断为通过新的实验，这些成功和失败学习的借口泛化，随机标号L，和隐性维度的原因。代码和模型可在此HTTPS URL。

64. DALES: A Large-scale Aerial LiDAR Data Set for Semantic Segmentation [PDF] 返回目录
Nina Varney, Vijayan K. Asari, Quinn Graehling
Abstract: We present the Dayton Annotated LiDAR Earth Scan (DALES) data set, a new large-scale aerial LiDAR data set with over a half-billion hand-labeled points spanning 10 square kilometers of area and eight object categories. Large annotated point cloud data sets have become the standard for evaluating deep learning methods. However, most of the existing data sets focus on data collected from a mobile or terrestrial scanner with few focusing on aerial data. Point cloud data collected from an Aerial Laser Scanner (ALS) presents a new set of challenges and applications in areas such as 3D urban modeling and large-scale surveillance. DALES is the most extensive publicly available ALS data set with over 400 times the number of points and six times the resolution of other currently available annotated aerial point cloud data sets. This data set gives a critical number of expert verified hand-labeled points for the evaluation of new 3D deep learning algorithms, helping to expand the focus of current algorithms to aerial data. We describe the nature of our data, annotation workflow, and provide a benchmark of current state-of-the-art algorithm performance on the DALES data set.
摘要：我们提出的注释顿激光雷达扫描地球（山谷）的数据集，一个新的大型航空LiDAR数据集与半超过十亿手标记点跨越面积10平方公里，八类对象。大型注释的点云数据集已经成为评估深学习方法的标准。然而，大多数现有的数据集的集中于从移动或地面扫描器收集很少注重天线数据。从空中激光扫描仪（ALS）的礼物收集了一组新的领域的挑战和应用的点云数据，例如3D建模城市和大型监控。山谷是最广泛的可公开获得的数据ALS与点的超过400倍的数量和其他现有的注释航拍的点云数据集的分辨率六次设置。这组数据给出了专家的临界值验证手标记点的新3D的评价深度学习算法，有助于扩大目前的算法航空数据的焦点。我们描述我们的数据，注释工作流程的性质，并提供对山谷的数据集当前国家的最先进的算法性能的基准。

65. DriftNet: Aggressive Driving Behavior Classification using 3D EfficientNet Architecture [PDF] 返回目录
Alam Noor, Bilel Benjdira, Adel Ammar, Anis Koubaa
Abstract: Aggressive driving (i.e., car drifting) is a dangerous behavior that puts human safety and life into a significant risk. This behavior is considered as an anomaly concerning the regular traffic in public transportation roads. Recent techniques in deep learning proposed new approaches for anomaly detection in different contexts such as pedestrian monitoring, street fighting, and threat detection. In this paper, we propose a new anomaly detection framework applied to the detection of aggressive driving behavior. Our contribution consists in the development of a 3D neural network architecture, based on the state-of-the-art EfficientNet 2D image classifier, for the aggressive driving detection in videos. We propose an EfficientNet3D CNN feature extractor for video analysis, and we compare it with existing feature extractors. We also created a dataset of car drifting in Saudi Arabian context this https URL . To the best of our knowledge, this is the first work that addresses the problem of aggressive driving behavior using deep learning.
摘要：野蛮驾驶（即，汽车漂移）是一种危险的行为，这使人类的安全和生活进入一个显著风险。这种行为被认为是关于在公共交通道路正常的交通异常。在深度学习最新技术，提出了在不同的环境下，比如行人监测，巷战，并威胁检测异常检测的新方法。在本文中，我们提出了应用于检测的攻击性驾驶行为的新异常检测框架。我们的贡献在于在一个3D的神经网络架构的发展，基于状态的最先进的EfficientNet 2D图像分类，对视频的攻击性驾驶检测。我们提出了视频分析的EfficientNet3D CNN特征提取，以及我们与现有的特征提取进行比较。我们还创建了汽车漂移的数据集中在沙特阿拉伯背景下，这个HTTPS URL。据我们所知，这是使用深度学习的激进的驾驶行为的地址问题的第一项工作。

66. Leveraging Planar Regularities for Point Line Visual-Inertial Odometry [PDF] 返回目录
Xin Li, Yijia He, Jinlong Lin, Xiao Liu
Abstract: With monocular Visual-Inertial Odometry (VIO) system, 3D point cloud and camera motion can be estimated simultaneously. Because pure sparse 3D points provide a structureless representation of the environment, generating 3D mesh from sparse points can further model the environment topology and produce dense mapping. To improve the accuracy of 3D mesh generation and localization, we propose a tightly-coupled monocular VIO system, PLP-VIO, which exploits point features and line features as well as plane regularities. The co-planarity constraints are used to leverage additional structure information for the more accurate estimation of 3D points and spatial lines in state estimator. To detect plane and 3D mesh robustly, we combine both the line features with point features in the detection method. The effectiveness of the proposed method is verified on both synthetic data and public datasets and is compared with other state-of-the-art algorithms.
摘要：随着单眼视觉惯性测程（VIO）系统，3D点云和摄像机运动可以同时估计。因为纯稀疏3D点提供所述环境的无结构的表示中，生成三维网格由疏点可以进一步建模环境的拓扑结构和产生致密映射。为了提高3D的精度网格生成和定位，我们提出了一个紧密耦合的单眼VIO系统，PLP-VIO，其利用点特征和线特征以及平面规律性。共面约束用于利用附加结构信息的3D点并在状态估计空间线条更精确的估计。为了检测平面和三维网状强劲，我们结合的检测方法都与点线保护功能。所提出的方法的有效性被验证在两个合成的数据和公共数据集和与国家的最先进的其他算法相比。

67. Eigenfeatures: Discrimination of X-ray images of epoxy resins using singular value decomposition of deep learning features [PDF] 返回目录
Edgar Avalos, Kazuto Akagi, Yasumasa Nishiura
Abstract: Although the process variables of epoxy resins alter their mechanical properties, the visual identification of the characteristic features of X-ray images of samples of these materials is challenging. To facilitate the identification, we approximate the magnitude of the gradient of the intensity field of the X-ray images of different kinds of epoxy resins and then we use deep learning to discover the most representative features of the transformed images. In this solution of the inverse problem to finding characteristic features to discriminate samples of heterogeneous materials, we use the eigenvectors obtained from the singular value decomposition of all the channels of the feature maps of the early layers in a convolutional neural network. While the strongest activated channel gives a visual representation of the characteristic features, often these are not robust enough in some practical settings. On the other hand, the left singular vectors of the matrix decomposition of the feature maps, barely change when variables such as the capacity of the network or network architecture change. High classification accuracy and robustness of characteristic features are presented in this work.
摘要：虽然环氧树脂的过程变量改变它们的机械性能，这些材料的样品的X射线图像的的特征的视觉识别是具有挑战性的。为了便于识别，我们近似不同种环氧树脂的X射线图像的强度场的梯度的大小，然后我们使用深层学习来发现变换图像的最有代表性的特征。在逆问题寻找特征到异质材料的判别样品的该溶液中，我们使用在卷积神经网络从所有初层的特征图的信道的奇异值分解而得到的特征向量。而最强的活化信道给出的特征的可视化表示，通常这些都没有在一些实际设置足够稳健。在另一方面，所述特征映射的矩阵分解的左奇异向量，勉强改变时的变量，如网络或网络体系结构的变化的能力。较高的分类准确性和特征稳健性在这项工作中提出。

68. Defining Benchmarks for Continual Few-Shot Learning [PDF] 返回目录
Antreas Antoniou, Massimiliano Patacchiola, Mateusz Ochal, Amos Storkey
Abstract: Both few-shot and continual learning have seen substantial progress in the last years due to the introduction of proper benchmarks. That being said, the field has still to frame a suite of benchmarks for the highly desirable setting of continual few-shot learning, where the learner is presented a number of few-shot tasks, one after the other, and then asked to perform well on a validation set stemming from all previously seen tasks. Continual few-shot learning has a small computational footprint and is thus an excellent setting for efficient investigation and experimentation. In this paper we first define a theoretical framework for continual few-shot learning, taking into account recent literature, then we propose a range of flexible benchmarks that unify the evaluation criteria and allows exploring the problem from multiple perspectives. As part of the benchmark, we introduce a compact variant of ImageNet, called SlimageNet64, which retains all original 1000 classes but only contains 200 instances of each one (a total of 200K data-points) downscaled to 64 x 64 pixels. We provide baselines for the proposed benchmarks using a number of popular few-shot learning algorithms, as a result, exposing previously unknown strengths and weaknesses of those algorithms in continual and data-limited settings.
摘要：无论几炮和不断学习已经看到在过去几年中取得实质性进展，由于引进合适的基准。话虽这么说，本场仍具有帧一套房的基准为连续几拍的学习，在学习中提出了一些为数不多的射门任务，一前一后，再在非常理想的设定要求有良好表现在验证组从所有先前看到的任务而产生。持续几拍的学习有一个小的计算足迹，因此高效的调查和实验的优良环境。在本文中，我们首先确定一个持续几拍学习的理论框架，同时考虑到最近的文献，我们提出了一系列的统一的评价标准，并允许探索，从多个角度问题灵活的基准。为基准的一部分，我们引入ImageNet的紧凑变体，称为SlimageNet64，它保留所有原始1000类，但只包含每一个（总共200K数据点）的200个实例降尺度到64×64个像素。我们提供了使用一些流行的几拍学习算法所提议的基准的基准，因此，露出先前未知的优势和这些算法的弱点，连续和数据有限的设置。

69. Extreme Consistency: Overcoming Annotation Scarcity and Domain Shifts [PDF] 返回目录
Gaurav Fotedar, Nima Tajbakhsh, Shilpa Ananth, Xiaowei Ding
Abstract: Supervised learning has proved effective for medical image analysis. However, it can utilize only the small labeled portion of data; it fails to leverage the large amounts of unlabeled data that is often available in medical image datasets. Supervised models are further handicapped by domain shifts, when the labeled dataset, despite being large enough, fails to cover different protocols or ethnicities. In this paper, we introduce \emph{extreme consistency}, which overcomes the above limitations, by maximally leveraging unlabeled data from the same or a different domain in a teacher-student semi-supervised paradigm. Extreme consistency is the process of sending an extreme transformation of a given image to the student network and then constraining its prediction to be consistent with the teacher network's prediction for the untransformed image. The extreme nature of our consistency loss distinguishes our method from related works that yield suboptimal performance by exercising only mild prediction consistency. Our method is 1) auto-didactic, as it requires no extra expert annotations; 2) versatile, as it handles both domain shift and limited annotation problems; 3) generic, as it is readily applicable to classification, segmentation, and detection tasks; and 4) simple to implement, as it requires no adversarial training. We evaluate our method for the tasks of lesion and retinal vessel segmentation in skin and fundus images. Our experiments demonstrate a significant performance gain over both modern supervised networks and recent semi-supervised models. This performance is attributed to the strong regularization enforced by extreme consistency, which enables the student network to learn how to handle extreme variants of both labeled and unlabeled images. This enhances the network's ability to tackle the inevitable same- and cross-domain data variability during inference.
摘要：监督学习已被证明有效的医学图像分析。但是，它可以仅利用数据的小标记的部分;它未能充分利用大量的，往往是在医学图像数据提供未标记的数据。监督模型由域的变化，当标记的数据集，尽管是足够大的，不能覆盖不同的协议或种族进一步阻碍。在本文中，我们介绍\ EMPH {极端一致性}，其克服了上述限制，通过最大程度地从相同的或在教师与学生半监督模式不同的域利用未标记的数据。至尊一致性是发送一个给定的图像的极端转变到学生网络，然后限制其预测为与教师网络的预测未转化形象一致的过程。我们的一致性损失的极端性关联作品区别我们的方法通过行使只有轻微的预测一致认为收益率最佳性能。我们的方法是1）自动说教，因为它不需要额外的专家注释; 2）通用的，因为它可以同时处理域移位和有限的注解问题; 3）通用的，因为它是容易适用于分类，分割和检测任务; 4）容易实现，因为它不需要任何对抗性训练。我们评估了病变和视网膜血管分割的皮肤和眼底图像任务的方法。我们的实验表明，在既现代监督网络和最近的半监督机型显著的性能增益。这种表现归因于极端的一致性实施的强有力的正规化，从而使学生网络学习如何处理标记的和未标记的图像的极端变种。这增强了网络的推理过程，以解决不可避免same-和跨域数据变化的能力。

70. On the safety of vulnerable road users by cyclist orientation detection using Deep Learning [PDF] 返回目录
Marichelo Garcia-Venegas, Diego A. Mercado-Ravell, Carlos A. Carballo-Monsivais
Abstract: In this work, orientation detection using Deep Learning is acknowledged for a particularly vulnerable class of road users,the cyclists. Knowing the cyclists' orientation is of great relevance since it provides a good notion about their future trajectory, which is crucial to avoid accidents in the context of intelligent transportation systems. Using Transfer Learning with pre-trained models and TensorFlow, we present a performance comparison between the main algorithms reported in the literature for object detection,such as SSD, Faster R-CNN and R-FCN along with MobilenetV2, InceptionV2, ResNet50, ResNet101 feature extractors. Moreover, we propose multi-class detection with eight different classes according to orientations. To do so, we introduce a new dataset called "Detect-Bike", containing 20,229 cyclist instances over 11,103 images, which has been labeled based on cyclist's orientation. Then, the same Deep Learning methods used for detection are trained to determine the target's heading. Our experimental results and vast evaluation showed satisfactory performance of all of the studied methods for the cyclists and their orientation detection, especially using Faster R-CNN with ResNet50 proved to be precise but significantly slower. Meanwhile, SSD using InceptionV2 provided good trade-off between precision and execution time, and is to be preferred for real-time embedded applications.
摘要：在这项工作中，使用Deep学习方向检测被确认为特别脆弱类道路使用者的骑自行车的人。知道了骑自行车的人的方向具有重大意义，因为它提供了有关其未来的轨迹，这是智能交通系统的情况下避免事故的发生至关重要的一个很好的概念。使用迁移学习与预训练的模型和TensorFlow中，我们提出的主要算法之间的性能比较在文献中为对象检测，诸如SSD报道，更快R-CNN和与MobilenetV2，InceptionV2，ResNet50，ResNet101特征沿R-FCN提取。此外，我们建议以根据方位八个不同类别多级检测。要做到这一点，我们引入了一个名为“检测电动自行车”新的数据集，包含超过11103倍的图像，这是基于骑自行车的方向标记20229个骑自行车的人的情况。然后，用于检测同一深度学习的方法进行培训，以确定目标的航向。我们的实验结果和广阔的评价表明所有的用于骑自行车者和他们的姿势检测所研究的方法中的令人满意的性能，尤其是使用更快的R-CNN与ResNet50证明是精确，但显著慢。同时，利用SSD提供InceptionV2很好的权衡精度和执行时间之间，并且是优选的为实时嵌入式应用。

71. Towards causal generative scene models via competition of experts [PDF] 返回目录
Julius von Kügelgen, Ivan Ustyuzhaninov, Peter Gehler, Matthias Bethge, Bernhard Schölkopf
Abstract: Learning how to model complex scenes in a modular way with recombinable components is a pre-requisite for higher-order reasoning and acting in the physical world. However, current generative models lack the ability to capture the inherently compositional and layered nature of visual scenes. While recent work has made progress towards unsupervised learning of object-based scene representations, most models still maintain a global representation space (i.e., objects are not explicitly separated), and cannot generate scenes with novel object arrangement and depth ordering. Here, we present an alternative approach which uses an inductive bias encouraging modularity by training an ensemble of generative models (experts). During training, experts compete for explaining parts of a scene, and thus specialise on different object classes, with objects being identified as parts that re-occur across multiple scenes. Our model allows for controllable sampling of individual objects and recombination of experts in physically plausible ways. In contrast to other methods, depth layering and occlusion are handled correctly, moving this approach closer to a causal generative scene model. Experiments on simple toy data qualitatively demonstrate the conceptual advantages of the proposed approach.
摘要：学习如何将复杂的场景与可重组组件以模块化方式模型是一个先决条件高阶推理和现实世界中的作用。然而，当前的生成模型缺乏捕捉可视场景的固有成分和分层特性的能力。虽然最近的工作取得了对基于对象的场景交涉无监督学习进度，大部分车型仍然保持着全球代表性空间（即对象没有明确分开的），并不能产生具有新的物体布置和深度排序场景。在这里，我们提出一种使用归纳偏置通过训练生成模型（专家）的合奏鼓励模块化的替代方法。在培训过程中，专家们争夺解释场景的部分，从而专注于不同的对象类，使用对象被确定为跨多个场面再次出现的部分。我们的模型允许单个对象和物理上合理的方式专家的重组可控采样。在其他方法相比，深度的层次感和闭塞正确处理，移动这种方法更接近因果生成场景模型。在简单的玩具数据定性实验表明，该方法的优点概念。

72. Control Design of Autonomous Drone Using Deep Learning Based Image Understanding Techniques [PDF] 返回目录
Seid Miad Zandavi, Vera Chung, Ali Anaissi
Abstract: This paper presents a new framework to use images as the inputs for the controller to have autonomous flight, considering the noisy indoor environment and uncertainties. A new Proportional-Integral-Derivative-Accelerated (PIDA) control with a derivative filter is proposed to improves drone/quadcopter flight stability within a noisy environment and enables autonomous flight using object and depth detection techniques. The mathematical model is derived from an accurate model with a high level of fidelity by addressing the problems of non-linearity, uncertainties, and coupling. The proposed PIDA controller is tuned by Stochastic Dual Simplex Algorithm (SDSA) to support autonomous flight. The simulation results show that adapting the deep learning-based image understanding techniques (RetinaNet ant colony detection and PSMNet) to the proposed controller can enable the generation and tracking of the desired point in the presence of environmental disturbances.
摘要：本文提出了一种新的框架，以使用图片作为输入的控制器具有自主飞行，考虑到嘈杂的室内环境和不确定性。提出了一种新比例 - 积分 - 微分 - 加速（PIDA）配有一个微分滤波器控制到提高在嘈杂的环境中雄蜂/四轴飞行稳定性和使得能够使用对象和深度检测技术自主飞行。所述数学模型是从与通过解决非线性，不确定性，以及耦合的问题的高电平保真度的精确模型的。所提出的PIDA控制器由随机对偶单纯形算法（SDSA）调整，以支持自主飞行。仿真结果表明，适配所述深基于学习的图像理解技术（RetinaNet蚁群检测和PSMNet）所提出的控制器可以使产生和环境的干扰的存在所需的点的跟踪。

73. A Light CNN for detecting COVID-19 from CT scans of the chest [PDF] 返回目录
Matteo Polsinelli, Luigi Cinque, Giuseppe Placidi
Abstract: OVID-19 is a world-wide disease that has been declared as a pandemic by the World Health Organization. Computer Tomography (CT) imaging of the chest seems to be a valid diagnosis tool to detect COVID-19 promptly and to control the spread of the disease. Deep Learning has been extensively used in medical imaging and convolutional neural networks (CNNs) have been also used for classification of CT images. We propose a light CNN design based on the model of the SqueezeNet, for the efficient discrimination of COVID-19 CT images with other CT images (community-acquired pneumonia and/or healthy images). On the tested datasets, the proposed modified SqueezeNet CNN achieved 83.00\% of accuracy, 85.00\% of sensitivity, 81.00\% of specificity, 81.73\% of precision and 0.8333 of F1Score in a very efficient way (7.81 seconds medium-end laptot without GPU acceleration). Besides performance, the average classification time is very competitive with respect to more complex CNN designs, thus allowing its usability also on medium power computers. In the next future we aim at improving the performances of the method along two directions: 1) by increasing the training dataset (as soon as other CT images will be available); 2) by introducing an efficient pre-processing strategy.
摘要：OVID-19是已宣布大流行由世界卫生组织在世界范围内的疾病。计算机断层扫描（CT）的胸部成像似乎是一种有效的诊断工具来检测COVID-19及时控制疾病的蔓延。深度学习已经在医学成像和卷积神经网络（细胞神经网络）被广泛地用于已也用于CT图像的分类。我们提出了基于SqueezeNet模型光CNN设计，为COVID-19的CT图像与其他CT图像（社区获得性肺炎和/或健康的图像）的高效歧视。对测试数据集，所提出的修改的SqueezeNet CNN获得的精度的83.00 \％，灵敏度的85.00 \％，特异性的81.00 \％，精度81.73 \％和非常有效的方式F1Score的0.8333（7.81秒中端laptot没有GPU加速）。除了性能，平均分类时间是相对于更复杂的CNN极具竞争力的设计，因此允许其可用性也中等功率的计算机。在下一将来我们的目标是改善了该方法的性能沿两个方向：1）只要其他CT图像将是可利用）增加训练数据集（; 2）通过引入有效的预处理策略。

74. A Critic Evaluation of Methods for COVID-19 Automatic Detection from X-Ray Images [PDF] 返回目录
Gianluca Maguolo, Loris Nanni
Abstract: In this paper, we compare and evaluate different testing protocols used for automatic COVID-19 diagnosis from X-Ray images. We show that similar results can be obtained using X-Ray images that do not contain most of the lungs. We are able to remove them from the images by turning to black the center of the scan. Hence, we deduce that several testing protocols for the recognition are not fair and that the neural networks are learning patterns in the images that are not correlated to the presence of COVID-19. We propose a new testing protocol that consists in using different datasets for training and testing, and we provide a method to measure how fair is a specific testing protocol. We suggest to follow the proposed protocol in the future research, and provide tools to better interpret the results of a classifier.
摘要：在本文中，我们比较和评价用于从X射线图像自动COVID-19不同的诊断测试协议。我们证明了相似的结果可以用不包含最肺部的X射线图像来获得。我们可以通过旋转黑色的扫描中心从图像中删除。因此，我们推断，对于认识的几个测试协议都是不公平的，而且神经网络学习在没有相关COVID-19的存在的图像模式。我们建议，包括使用不同的数据集进行训练和测试新的测试协议，我们提供计量公允如何为特定的测试协议的方法。我们建议遵循未来的研究提出的协议，并提供工具，以更好地诠释分类的结果。

75. Quantifying Graft Detachment after Descemet's Membrane Endothelial Keratoplasty with Deep Convolutional Neural Networks [PDF] 返回目录
Friso G. Heslinga, Mark Alberti, Josien P.W. Pluim, Javier Cabrerizo, Mitko Veta
Abstract: Purpose: We developed a method to automatically locate and quantify graft detachment after Descemet's Membrane Endothelial Keratoplasty (DMEK) in Anterior Segment Optical Coherence Tomography (AS-OCT) scans. Methods: 1280 AS-OCT B-scans were annotated by a DMEK expert. Using the annotations, a deep learning pipeline was developed to localize scleral spur, center the AS-OCT B-scans and segment the detached graft sections. Detachment segmentation model performance was evaluated per B-scan by comparing (1) length of detachment and (2) horizontal projection of the detached sections with the expert annotations. Horizontal projections were used to construct graft detachment maps. All final evaluations were done on a test set that was set apart during training of the models. A second DMEK expert annotated the test set to determine inter-rater performance. Results: Mean scleral spur localization error was 0.155 mm, whereas the inter-rater difference was 0.090 mm. The estimated graft detachment lengths were in 69% of the cases within a 10-pixel (~150{\mu}m) difference from the ground truth (77% for the second DMEK expert). Dice scores for the horizontal projections of all B-scans with detachments were 0.896 and 0.880 for our model and the second DMEK expert respectively. Conclusion: Our deep learning model can be used to automatically and instantly localize graft detachment in AS-OCT B-scans. Horizontal detachment projections can be determined with the same accuracy as a human DMEK expert, allowing for the construction of accurate graft detachment maps. Translational Relevance: Automated localization and quantification of graft detachment can support DMEK research and standardize clinical decision making.
摘要：目的：我们开发了一种方法来自动定位和量化移植物脱落后弹力膜内皮角膜（DMEK）在眼前段光学相干断层扫描（AS-OCT）扫描后。方法：1280 AS-OCT B扫描被由DMEK专家注解。使用注释，深刻学习管道被开发本地化巩膜突，居中AS-OCT B扫描和段分离的移植部分。脱离分割模型的性能是由分离的（1）的长度和（2）的水平投影的分离段的比较与专家注释每B扫描评价。水平投影被用来构建接枝脱离地图。所有的最终评价是对在模型的训练过程中一个与众不同的测试设置完成。第二DMEK专家标注的测试集，以确定评估者间的性能。结果：平均值巩膜突定位误差0.155毫米，而评估者间差异0.090毫米。所估计的接枝脱离长度为在的情况下，69％有10像素内（〜150 {\亩}米）从地面实况（用于第二DMEK专家77％）的差异。骰子得分为所有B扫描与分队的水平投影分别为我们的模型和第二DMEK专家0.896 0.880和。结论：我们的深度学习模型可以用于自动并立即本地化移植支队在AS-OCT B扫描。水平分离突起可以用相同的精度作为人DMEK专家来确定，从而允许准确的接枝分离图谱的构建。平移相关性：自动定位和移植物脱落的量化可以支持DMEK研究和规范临床决策。

76. A Cascaded Learning Strategy for Robust COVID-19 Pneumonia Chest X-Ray Screening [PDF] 返回目录
Chun-Fu Yeh, Hsien-Tzu Cheng, Andy Wei, Keng-Chi Liu, Mong-Chi Ko, Po-Chen Kuo, Ray-Jade Chen, Po-Chang Lee, Jen-Hsiang Chuang, Chi-Mai Chen, Nai-Kuan Chou, Yeun-Chung Chang, Kuan-Hua Chao, Yi-Chin Tu, Tyng-Luh Liu
Abstract: We introduce a comprehensive screening platform for the COVID-19 (a.k.a., SARS-CoV-2) pneumonia. The proposed AI-based system works on chest x-ray (CXR) images to predict whether a patient is infected with the COVID-19 disease. Although the recent international joint effort on making the availability of all sorts of open data, the public collection of CXR images is still relatively small for reliably training a deep neural network (DNN) to carry out COVID-19 prediction. To better address such inefficiency, we design a cascaded learning strategy to improve both the sensitivity and the specificity of the resulting DNN classification model. Our approach leverages a large CXR image dataset of non-COVID-19 pneumonia to generalize the original well-trained classification model via a cascaded learning scheme. The resulting screening system is shown to achieve good classification performance on the expanded dataset, including those newly added COVID-19 CXR images.
摘要：介绍了COVID-19（又名，SARS-COV-2）肺炎的综合筛选平台。胸部X射线所提出的基于AI-系统的工作原理（CXR）图像来预测患者是否感染了COVID-19的疾病。虽然制作各种开放数据的可用性近期国际共同努力，CXR图像的公共回收仍是可靠地训练深层神经网络（DNN）开展COVID-19的预测相对较小。为了更好地解决这种低效率，我们设计了一个级联的学习策略，以提高灵敏度和所产生的DNN分类模型的特异性。我们的方法利用非COVID-19肺炎的大CXR图像数据集来概括通过级联的学习方案，原有的良好训练的分类模型。将得到的筛选系统被示出为对所述扩展数据集实现良好的分类性能，包括那些新添加COVID-19 CXR图像。

77. Boosting Connectivity in Retinal Vessel Segmentation via a Recursive Semantics-Guided Network [PDF] 返回目录
Rui Xu, Tiantian Liu, Xinchen Ye, Yen-Wei Chen
Abstract: Many deep learning based methods have been proposed for retinal vessel segmentation, however few of them focus on the connectivity of segmented vessels, which is quite important for a practical computer-aided diagnosis system on retinal images. In this paper, we propose an efficient network to address this problem. A U-shape network is enhanced by introducing a semantics-guided module, which integrates the enriched semantics information to shallow layers for guiding the network to explore more powerful features. Besides, a recursive refinement iteratively applies the same network over the previous segmentation results for progressively boosting the performance while increasing no extra network parameters. The carefully designed recursive semantics-guided network has been extensively evaluated on several public datasets. Experimental results have shown the efficiency of the proposed method.
摘要：许多深基础的学习方法已经被提出了视网膜血管分割，但是很少有人专注于分割的脉管的连通性，这是对视网膜图像的实际的计算机辅助诊断系统非常重要。在本文中，我们提出了一个高效的网络来解决这个问题。 U形网络通过引入语义引导模块，它集成了富集的语义信息，以浅层用于引导网络探索更强大的功能增强。此外，递归迭代细化比上年分割结果应用相同的网络逐步提升性能的同时提高没有额外的网络参数。精心设计的递归语义引导网络已在几个公共数据集被广泛评价。实验结果表明了该方法的效率。

78. EAO-SLAM: Monocular Semi-Dense Object SLAM Based on Ensemble Data Association [PDF] 返回目录
Yanmin Wu, Yunzhou Zhang, Delong Zhu, Yonghui Feng, Sonya Coleman, Dermot Kerr
Abstract: Object-level data association and pose estimation play a fundamental role in semantic SLAM, which remain unsolved due to the lack of robust and accurate algorithms. In this work, we propose an ensemble data associate strategy to integrate the parametric and nonparametric statistic tests. By exploiting the nature of different statistics, our method can effectively aggregate the information of different measurements, and thus significantly improve the robustness and accuracy of the association process. We then present an accurate object pose estimation framework, in which an outlier-robust centroid and scale estimation algorithm and an object pose initialization algorithm are developed to help improve the optimality of the estimated results. Furthermore, we build a SLAM system that can generate semi-dense or lightweight object-oriented maps with a monocular camera. Extensive experiments are conducted on three publicly available datasets and a real scenario. The results show that our approach significantly outperforms state-of-the-art techniques in accuracy and robustness.
摘要：对象级别的数据关联和姿态估计起到语义SLAM的基础性作用，这是由于缺乏健全和准确的算法仍然没有得到解决。在这项工作中，我们提出了一个整体的数据关联的战略整合参数和非参数统计检验。通过利用不同的统计的性质，我们的方法可以有效地聚合不同的测量的信息，从而提高显著关联过程的鲁棒性和准确性。然后，我们提出了一个准确的目标姿态估计框架，其中离群强大的心和规模估计算法和对象姿态初始化算法的开发，以帮助提高估计结果的最优性。此外，我们建立了一个SLAM系统，可以产生半密集或轻质面向对象的地图用单眼照相机。大量的实验是在三个公开可用的数据集和真实的情景中进行。结果表明，我们的方法显著优于国家的最先进的技术，精确度和耐用性。

79. Reconstructing normal section profiles of 3D revolving structures via pose-unconstrained multi-line structured-light vision [PDF] 返回目录
Junhua Sun, Zhou Zhang, Jie Zhang
Abstract: The wheel of the train is a 3D revolving geometrical structure. Reconstructing the normal section profile is an effective approach to determine the critical geometric parameter and wear of the wheel in the community of railway safety. The existing reconstruction methods typically require a sensor working in a constrained position and pose, suffering poor flexibility and limited viewangle. This paper proposes a pose-unconstrained normal section profile reconstruction framework for 3D revolving structures via multiple 3D general section profiles acquired by a multi-line structured light vision sensor. First, we establish a model to estimate the axis of 3D revolving geometrical structure and the normal section profile using corresponding points. Then, we embed the model into an iterative algorithm to optimize the corresponding points and finally reconstruct the accurate normal section profile. We conducted real experiment on reconstructing the normal section profile of a 3D wheel. The results demonstrate that our algorithm reaches the mean precision of 0.068mm and good repeatability with the STD of 0.007mm. It is also robust to varying pose variations of the sensor. Our proposed framework and models are generalized to any 3D wheeltype revolving components.
摘要：火车的车轮是3D旋转几何结构。重建正常的截面轮廓是确定临界几何参数和在铁路安全的社区的车轮的磨损的有效途径。现有的重建方法通常需要一个传感器在受约束的位置和姿势工作，遭受较差柔韧性和有限viewangle。本文提出了一种通过由多线结构光视觉传感器获取的多个3D一般部轮廓的3D旋转结构的姿势-不受约束正常截面轮廓重建框架。首先，我们建立一个模型来估计三维的旋转几何结构和使用对应点正常截面轮廓的轴线。然后，我们嵌入模型转换的迭代算法来优化相应的点和最后重构精确正常截面轮廓。我们在重建3D滚轮的正常截面轮廓进行真正的实验。结果表明，我们的算法达到0.068毫米和良好的重复性的平均精度0.007毫米的性病。它也是坚固以改变传感器的姿态的变化。我们提出的框架和模型wheeltype旋转组件被推广到任何3D。

80. OR-UNet: an Optimized Robust Residual U-Net for Instrument Segmentation in Endoscopic Images [PDF] 返回目录
Fabian Isensee, Klaus H. Maier-Hein
Abstract: Segmentation of endoscopic images is an essential processing step for computer and robotics-assisted interventions. The Robust-MIS challenge provides the largest dataset of annotated endoscopic images to date, with 5983 manually annotated images. Here we describe OR-UNet, our optimized robust residual 2D U-Net for endoscopic image segmentation. As the name implies, the network makes use of residual connections in the encoder. It is trained with the sum of Dice and cross-entropy loss and deep supervision. During training, extensive data augmentation is used to increase the robustness. In an 8-fold cross-validation on the training images, our model achieved a mean (median) Dice score of 87.41 (94.35). We use the eight models from the cross-validation as an ensemble on the test set.
摘要：内窥镜图像分割为计算机和机器人辅助的干预的必要处理步骤。乐百氏-MIS挑战提供了注解内窥镜图像的最大的数据集到今天为止，有5983个手动注释的图像。这里，我们介绍OR-UNET，我们优化的稳健残留2D UNET内镜图像分割。顾名思义，网络利用编码器中的剩余连接。它训练了与骰子的总和与交叉熵损失和深监督。在培训过程中，大量的数据增强被用来增加健壮性。在8折的训练图像交叉验证，我们的模型实现了平均（中位数）骰子得分87.41（94.35）。我们使用了八款来自交叉验证作为测试集的合奏。

81. Continuous hand-eye calibration using 3D points [PDF] 返回目录
Bjarne Grossmann, Volker Krueger
Abstract: The recent development of calibration algorithms has been driven into two major directions: (1) an increasing accuracy of mathematical approaches and (2) an increasing flexibility in usage by reducing the dependency on calibration objects. These two trends, however, seem to be contradictory since the overall accuracy is directly related to the accuracy of the pose estimation of the calibration object and therefore demanding large objects, while an increased flexibility leads to smaller objects or noisier estimation methods. The method presented in this paper aims to resolves this problem in two steps: First, we derive a simple closed-form solution with a shifted focus towards the equation of translation that only solves for the necessary hand-eye transformation. We show that it is superior in accuracy and robustness compared to traditional approaches. Second, we decrease the dependency on the calibration object to a single 3D-point by using a similar formulation based on the equation of translation which is much less affected by the estimation error of the calibration object's orientation. Moreover, it makes the estimation of the orientation obsolete while taking advantage of the higher accuracy and robustness from the first solution, resulting in a versatile method for continuous hand-eye calibration.
摘要：最近的校准算法的发展已被驱动成两个主要方向：（1）的数学方法增加准确度和（2）通过减少对标定物体的依赖于使用的增加的灵活性。这两种趋势，然而，似乎是矛盾的，因为总的准确性直接关系到校准对象的姿势估计的准确性，因此要求苛刻的大型物体，同时增加灵活性，导致更小的物体或喧闹的估算方法。分两步本文旨在提出来解决此问题的方法：首先，我们得出一个重点转移一个简单的封闭形式的解决方案对翻译的方程式，只有进行必要的手眼改造解决的问题。我们表明，与传统方法相比在精度和鲁棒性优越。其次，我们通过使用基于翻译的方程，它是更受校准对象的方向的估计误差的类似制剂减少校正对象到一个3D点的依赖。此外，它使取向的估计过时的同时利用从第一溶液中的更高的精度和鲁棒性的，从而导致连续的手眼校准的通用方法。

82. Improving Endoscopic Decision Support Systems by Translating Between Imaging Modalities [PDF] 返回目录
Georg Wimmer, Michael Gadermayr, Andreas Vécsei, Andreas Uhl
Abstract: Novel imaging technologies raise many questions concerning the adaptation of computer-aided decision support systems. Classification models either need to be adapted or even newly trained from scratch to exploit the full potential of enhanced techniques. Both options typically require the acquisition of new labeled training data. In this work we investigate the applicability of image-to-image translation to endoscopic images showing different imaging modalities, namely conventional white-light and narrow-band imaging. In a study on computer-aided celiac disease diagnosis, we explore whether image-to-image translation is capable of effectively performing the translation between the domains. We investigate if models can be trained on virtual (or a mixture of virtual and real) samples to improve overall accuracy in a setting with limited labeled training data. Finally, we also ask whether a translation of testing images to another domain is capable of improving accuracy by exploiting the enhanced imaging characteristics.
摘要：新型成像技术的提高涉及计算机辅助决策支持系统的适应许多问题。分类模型要么需要进行调整，甚至重新从头开始培训，以利用的先进技术的全部潜力。这两个选项通常需要购置新的标记的训练数据。在这项工作中，我们调查图像到图像翻译的适用性内窥镜图像显示不同的成像模态，即常规的白光和窄带成像。在计算机辅助腹腔疾病诊断的研究中，我们探讨图像到影像转换是否能够有效地执行域之间的转换。我们调查，如果模型可以在虚拟的培训（或虚拟和真实的混合物）的样品，以改善与限制标记的训练数据的设置整体精度。最后，我们也要求测试图像到另一个域的翻译是否能够通过利用增强的成像特性提高了精度。

83. Robust Screening of COVID-19 from Chest X-ray via Discriminative Cost-Sensitive Learning [PDF] 返回目录
Tianyang Li, Zhongyi Han, Benzheng Wei, Yuanjie Zheng, Yanfei Hong, Jinyu Cong
Abstract: This paper addresses the new problem of automated screening of coronavirus disease 2019 (COVID-19) based on chest X-rays, which is urgently demanded toward fast stopping the pandemic. However, robust and accurate screening of COVID-19 from chest X-rays is still a globally recognized challenge because of two bottlenecks: 1) imaging features of COVID-19 share some similarities with other pneumonia on chest X-rays, and 2) the misdiagnosis rate of COVID-19 is very high, and the misdiagnosis cost is expensive. While a few pioneering works have made much progress, they underestimate both crucial bottlenecks. In this paper, we report our solution, discriminative cost-sensitive learning (DCSL), which should be the choice if the clinical needs the assisted screening of COVID-19 from chest X-rays. DCSL combines both advantages from fine-grained classification and cost-sensitive learning. Firstly, DCSL develops a conditional center loss that learns deep discriminative representation. Secondly, DCSL establishes score-level cost-sensitive learning that can adaptively enlarge the cost of misclassifying COVID-19 examples into other classes. DCSL is so flexible that it can apply in any deep neural network. We collected a large-scale multi-class dataset comprised of 2,239 chest X-ray examples: 239 examples from confirmed COVID-19 cases, 1,000 examples with confirmed bacterial or viral pneumonia cases, and 1,000 examples of healthy people. Extensive experiments on the three-class classification show that our algorithm remarkably outperforms state-of-the-art algorithms. It achieves an accuracy of 97.01%, a precision of 97%, a sensitivity of 97.09%, and an F1-score of 96.98%. These results endow our algorithm as an efficient tool for the fast large-scale screening of COVID-19.
摘要：本文地址的自动筛选基于胸部X光冠状病2019（COVID-19），这是迫切朝向快速停止流行病要求的新的问题。然而，从胸部X光COVID-19的鲁棒和准确的检查仍然是因为两个瓶颈的全球公认的挑战：的COVID-19共享1）成像功能的一些相似之处胸部X射线以外的肺炎，以及2）在COVID-19的误诊率是非常高的，而误诊的成本是昂贵的。虽然有少数开创性工作已经取得了很大的进步，他们低估了这两个关键的瓶颈。在本文中，我们提出我们的解决方案，歧视成本敏感学习（DCSL），这应该是选择，如果临床需要的辅助COVID-19从胸部X光检查。 DCSL从细粒度分类和成本敏感学习结合两者的优点。首先，开发DCSL该学习深判别表示有条件的中心损失。其次，建立DCSL得分级成本敏感的学习能够自适应放大的误分COVID-19的实例为其他类的费用。 DCSL非常灵活，它可以在任何深层神经网络的应用。我们收集了由2239胸片例子大规模多类数据集：从证实COVID-19例239个例，1000例确诊细菌性或病毒性肺炎病例和健康人1000个例。在三级分类显示，我们的算法显着优于国家的最先进的算法，广泛的实验。它实现的97.01％的精确度，97％的精度，97.09％的灵敏度，和96.98％的F1-得分。这些结果赋予我们的算法为COVID-19的快速大规模筛查的有效工具。

84. Towards Efficient COVID-19 CT Annotation: A Benchmark for Lung and Infection Segmentation [PDF] 返回目录
Jun Ma, Yixin Wang, Xingle An, Cheng Ge, Ziqi Yu, Jianan Chen, Qiongjie Zhu, Guoqiang Dong, Jian He, Zhiqiang He, Ziwei Nie, Xiaoping Yang
Abstract: Accurate segmentation of lung and infection in COVID-19 CT scans plays an important role in the quantitative management of patients. Most of the existing studies are based on large and private annotated datasets that are impractical to obtain from a single institution, especially when radiologists are busy fighting the coronavirus disease. Furthermore, it is hard to compare current COVID-19 CT segmentation methods as they are developed on different datasets, trained in different settings, and evaluated with different metrics. In this paper, we created a COVID-19 3D CT dataset with 20 cases that contains 1800+ annotated slices and made it publicly available. To promote the development of annotation-efficient deep learning methods, we built three benchmarks for lung and infection segmentation that contain current main research interests, e.g., few-shot learning, domain generalization, and knowledge transfer. For a fair comparison among different segmentation methods, we also provide unified training, validation and testing dataset splits, and evaluation metrics and corresponding code. In addition, we provided more than 40 pre-trained baseline models for the benchmarks, which not only serve as out-of-the-box segmentation tools but also save computational time for researchers who are interested in COVID-19 lung and infection segmentation. To the best of our knowledge, this work presents the largest public annotated COVID-19 CT volume dataset, the first segmentation benchmark, and the most pre-trained models up to now. We hope these resources (\url{this https URL}) could advance the development of deep learning methods for COVID-19 CT segmentation with limited data.
摘要：COVID-19 CT扫描肺部感染的精确分割起着患者的量化管理具有重要作用。大多数现有的研究都是基于大和私人注释数据集是不切实际的，从一个单一的机构来获得，特别是在放射科医生整天斗来斗冠状病毒病。此外，这是很难，因为它们是在不同的数据集研发，在不同环境下的训练，并与不同的指标进行评价，比较当前COVID-19 CT分割方法。在本文中，我们创建了一个COVID-19三维CT数据集，其中包含1800+注释片20箱子并使其公开。为了促进注解高效深度学习方法的发展，我们建立了肺部感染分割包含目前的主要研究方向，例如，一些次学习，泛化域和知识转移三个基准。对于不同的分割方法之间进行公平的比较，我们还提供统一培训，验证和测试数据集拆分和评价指标和相应的代码。此外，我们的基准，它不仅可以作为外的开箱分割工具，而且节省计算时间，谁感兴趣的COVID-19肺部感染分割的研究人员提供超过40个预训练的基本模式。据我们所知，这项工作提出了最大的公共注释COVID-19 CT容积数据集，第一分割基准，而最前的训练的模型到现在。我们希望这些资源（\ {URL这HTTPS URL}）可以推动深层学习方法COVID-19 CT分割数据有限的开发。

85. Towards Accurate and Robust Domain Adaptation under Noisy Environments [PDF] 返回目录
Zhongyi Han, Xian-Jin Gui, Chaoran Cui, Yilong Yin
Abstract: In non-stationary environments, learning machines usually confront the domain adaptation scenario where the data distribution does change over time. Previous domain adaptation works have achieved great success in theory and practice. However, they always lose robustness in noisy environments where the labels and features of examples from the source domain become corrupted. In this paper, we report our attempt towards achieving accurate noise-robust domain adaptation. We first give a theoretical analysis that reveals how harmful noises influence unsupervised domain adaptation. To eliminate the effect of label noise, we propose an offline curriculum learning for minimizing a newly-defined empirical source risk. To reduce the impact of feature noise, we propose a proxy distribution based margin discrepancy. We seamlessly transform our methods into an adversarial network that performs efficient joint optimization for them, successfully mitigating the negative influence from both data corruption and distribution shift. A series of empirical studies show that our algorithm remarkably outperforms state of the art, over 10% accuracy improvements in some domain adaptation tasks under noisy environments.
摘要：在非固定环境中，学习机通常面对的领域适应性场景的数据分配不随时间变化。上一页领域适应性方面工作取得的理论和实践取得巨大成功。然而，他们在嘈杂的环境中总是输的鲁棒性，其中从源域实例的标签和功能被破坏。在本文中，我们报道了我们在实现精确的噪声稳健的领域适应性的尝试。我们首先给出的是揭示了如何有害的噪声影响监督的领域适应性进行了理论分析。为了消除标签噪声的影响，我们提出了一个离线的课程学习最小化的新定义的经验来源的风险。为了减少噪音功能的影响，我们提出了一个代理分配基于利润率的差异。我们无缝转换我们的方法为对抗网络，对他们进行有效的联合优化，成功地减轻来自数据损坏和配送移负面影响。一系列的实证研究表明，我们的算法显着优于现有技术的状态，比在嘈杂的环境中某些领域适应性任务10％的准确度提高。

86. Cross-Domain Structure Preserving Projection for Heterogeneous Domain Adaptation [PDF] 返回目录
Qian Wang, Toby P. Breckon
Abstract: Heterogeneous Domain Adaptation (HDA) aims to enable effective transfer learning across domains of different modalities (e.g., texts and images) or feature dimensions (e.g., features extracted with different methods). Traditional domain adaptation algorithms assume that the representations of source and target samples reside in the same feature space, hence are likely to fail in solving the heterogeneous domain adaptation problem where the source and target domain data are represented by completely different features. To address this issue, we propose a Cross-Domain Structure Preserving Projection (CDSPP) algorithm, as an extension of the classic LPP, which aims to learn domain-specific projections to map sample features from source and target domains into a common subspace such that the class consistency is preserved and data distributions are sufficiently aligned. CDSPP is naturally suitable for supervised HDA but can be extended for semi-supervised HDA where the unlabeled target domain samples are available. Our approach illustrates superior results when evaluated against both supervised and semi-supervised state-of-the-art approaches on several HDA benchmark datasets.
摘要：异构域适应（HDA）的目的在于使有效转移跨不同模态（例如，文本和图像）或特征尺寸的结构域学习（例如，设有不同的方法中提取）。传统域自适应算法假定源和目标样本的表示驻留在相同的特征空间，因此，有可能解决，其中源和目标域数据是由完全不同的特征表示的异质域的适应问题失败。为了解决这个问题，我们提出了一个跨域结构保持投影（CDSPP）算法，作为扩展的经典LPP的，学习特定领域的预测，其目的是从源和目标域样本特征映射到一个共同的子空间，使得类一致性被保存和数据分布充分对准。 CDSPP是自然适合于监督HDA但是可以扩展到半监督HDA其中未标记的目标域样本是可用的。当针对评估我们的方法示出了优异的结果既监督和半监督的最先进的状态在几个HDA基准数据集的方法。

87. Towards Feature Space Adversarial Attack [PDF] 返回目录
Qiuling Xu, Guanhong Tao, Siyuan Cheng, Lin Tan, Xiangyu Zhang
Abstract: We propose a new type of adversarial attack to Deep Neural Networks (DNNs) for image classification. Different from most existing attacks that directly perturb input pixels. Our attack focuses on perturbing abstract features, more specifically, features that denote styles, including interpretable styles such as vivid colors and sharp outlines, and uninterpretable ones. It induces model misclassfication by injecting style changes insensitive for humans, through an optimization procedure. We show that state-of-the-art adversarial attack detection and defense techniques are ineffective in guarding against feature space attacks.
摘要：本文提出了一种新型的对抗攻击深层神经网络（DNNs）图像分类。从直接扰动输入像素大多数现有的攻击不同。我们的攻击侧重于抽象的扰动特性，更具体地说，特点是分别表示风格，包括解释风格，如鲜艳的色彩和锐利的轮廓，以及无法解释的。它通过注射风格的变化对人类不敏感，通过优化程序诱导模型misclassfication。我们发现，国家的最先进的敌对攻击检测和防御技术是在防范功能空间的攻击无效。

88. Joint Liver Lesion Segmentation and Classification via Transfer Learning [PDF] 返回目录
Michal Heker, Hayit Greenspan
Abstract: Transfer learning and joint learning approaches are extensively used to improve the performance of Convolutional Neural Networks (CNNs). In medical imaging applications in which the target dataset is typically very small, transfer learning improves feature learning while joint learning has shown effectiveness in improving the network's generalization and robustness. In this work, we study the combination of these two approaches for the problem of liver lesion segmentation and classification. For this purpose, 332 abdominal CT slices containing lesion segmentation and classification of three lesion types are evaluated. For feature learning, the dataset of MICCAI 2017 Liver Tumor Segmentation (LiTS) Challenge is used. Joint learning shows improvement in both segmentation and classification results. We show that a simple joint framework outperforms the commonly used multi-task architecture (Y-Net), achieving an improvement of 10% in classification accuracy, compared to a 3% improvement with Y-Net.
摘要：转让学习和共同学习的方法被广泛地用于改善卷积神经网络（细胞神经网络）的性能。在其中目标数据集通常是非常小的医疗成像应用中，转移学习改善特征的学习而共同学习在提高了网络的一般化和鲁棒性示出有效性。在这项工作中，我们研究了这两个的组合为肝脏病变分割和分类的这个问题的方法。为了这个目的，将含有病变分割和三个病变类型的分类332个腹部CT切片进行评估。对于学习功能，MICCAI 2017肝肿瘤分割（双床）的数据集被用于挑战。联合学习有进步在这两个细分和分类结果。我们证明了一个简单的联合框架优于常用的多任务架构（Y-网），实现了10％的分类精度的提高，相较于Y型网3％的改善。

89. DeepSeg: Deep Neural Network Framework for Automatic Brain Tumor Segmentation using Magnetic Resonance FLAIR Images [PDF] 返回目录
Ramy A. Zeineldin, Mohamed E. Karar, Jan Coburger, Christian R. Wirtz, Oliver Burgert
Abstract: Purpose: Gliomas are the most common and aggressive type of brain tumors due to their infiltrative nature and rapid progression. The process of distinguishing tumor boundaries from healthy cells is still a challenging task in the clinical routine. Fluid-Attenuated Inversion Recovery (FLAIR) MRI modality can provide the physician with information about tumor infiltration. Therefore, this paper proposes a new generic deep learning architecture; namely DeepSeg for fully automated detection and segmentation of the brain lesion using FLAIR MRI data. Methods: The developed DeepSeg is a modular decoupling framework. It consists of two connected core parts based on an encoding and decoding relationship. The encoder part is a convolutional neural network (CNN) responsible for spatial information extraction. The resulting semantic map is inserted into the decoder part to get the full resolution probability map. Based on modified U-Net architecture, different CNN models such as Residual Neural Network (ResNet), Dense Convolutional Network (DenseNet), and NASNet have been utilized in this study. Results: The proposed deep learning architectures have been successfully tested and evaluated on-line based on MRI datasets of Brain Tumor Segmentation (BraTS 2019) challenge, including s336 cases as training data and 125 cases for validation data. The dice and Hausdorff distance scores of obtained segmentation results are about 0.81 to 0.84 and 9.8 to 19.7 correspondingly. Conclusion: This study showed successful feasibility and comparative performance of applying different deep learning models in a new DeepSeg framework for automated brain tumor segmentation in FLAIR MR images. The proposed DeepSeg is open-source and freely available at this https URL.
摘要：目的：脑胶质瘤是最常见的和积极型的脑肿瘤由于其浸润性质和病情进展迅速。从健康细胞区分肿瘤边界的过程仍然是在临床常规一项艰巨的任务。液体衰减反转恢复（FLAIR）MRI模态可以提供关于肿瘤浸润信息医师。因此，本文提出了一种新的通用深度学习架构;即DeepSeg用于完全自动化的检测和使用FLAIR MRI数据的脑损伤的分割。方法：开发DeepSeg是一个模块化的去耦框架。它由基于一个编码和解码关系两个连接的核心部分。编码器部分负责空间信息提取的卷积神经网络（CNN）。得到的语义图被插入到解码器部分，以获得全分辨率概率图。基于变形的U-Net的架构，不同CNN模型，如残余神经网络（RESNET），密集卷积网络（DenseNet），和NASNet已用于这项研究。结果：提出深度学习结构已测试成功，并基于脑肿瘤分割（臭小子2019）的挑战，包括S336案件训练数据和125件验证数据的MRI数据集上线评估。得到的分割结果的骰子和Hausdorff距离分数是大约0.81至0.84和9.8至19.7相应。结论：这项研究表明成功的可行性和自动化脑肿瘤分割在FLAIR MR图像的新DeepSeg框架应用不同的深度学习模式的比较优势。所提出的DeepSeg是开源和免费提供在该HTTPS URL。

90. Development of a High Fidelity Simulator for Generalised Photometric Based Space Object Classification using Machine Learning [PDF] 返回目录
James Allworth, Lloyd Windrim, Jeffrey Wardman, Daniel Kucharski, James Bennett, Mitch Bryson
Abstract: This paper presents the initial stages in the development of a deep learning classifier for generalised Resident Space Object (RSO) characterisation that combines high-fidelity simulated light curves with transfer learning to improve the performance of object characterisation models that are trained on real data. The classification and characterisation of RSOs is a significant goal in Space Situational Awareness (SSA) in order to improve the accuracy of orbital predictions. The specific focus of this paper is the development of a high-fidelity simulation environment for generating realistic light curves. The simulator takes in a textured geometric model of an RSO as well as the objects ephemeris and uses Blender to generate photo-realistic images of the RSO that are then processed to extract the light curve. Simulated light curves have been compared with real light curves extracted from telescope imagery to provide validation for the simulation environment. Future work will involve further validation and the use of the simulator to generate a dataset of realistic light curves for the purpose of training neural networks.
摘要：本文介绍在深学习分类的发展，为广义居住空间对象（RSO）特征的初始阶段，结合高保真模拟光变曲线与转移学习提高被真实数据训练的对象特性模型的性能。 RSOs的分类和特征是在太空态势感知（SSA）一显著的目标，以提高轨道预测的准确性。本文的特定焦点是用于产生逼真的光曲线的高保真仿真环境的发展。该模拟器需要在RSO的纹理化几何模型以及对象的星历表，并使用搅拌机，以产生RSO被随后处理，以提取光曲线的照片般逼真的图像。模拟光曲线已经从望远镜的图像中提取用于模拟环境提供验证真实光曲线相比。未来的工作将涉及进一步的验证和使用模拟器来生成逼真的光变曲线的数据集的训练神经网络的目的。

91. Deep DIH : Statistically Inferred Reconstruction of Digital In-Line Holography by Deep Learning [PDF] 返回目录
Huayu Li, Xiwen Chen, Haiyu Wu, Zaoyi Chi, Christopher Mann, Abolfazl Razi
Abstract: Digital in-line holography is commonly used to reconstruct 3D images from 2D holograms for microscopic objects. One of the technical challenges that arise in the signal processing stage is removing the twin image that is caused by the phase-conjugate wavefront from the recorded holograms. Twin image removal is typically formulated as a non-linear inverse problem due to the irreversible scattering process when generating the hologram. Recently, end-to-end deep learning-based methods have been utilized to reconstruct the object wavefront (as a surrogate for the 3D structure of the object) directly from a single-shot in-line digital hologram. However, massive data pairs are required to train deep learning models for acceptable reconstruction precision. In contrast to typical image processing problems, well-curated datasets for in-line digital holography does not exist. Also, the trained model highly influenced by the morphological properties of the object and hence can vary for different applications. Therefore, data collection can be prohibitively cumbersome in practice as a major hindrance to using deep learning for digital holography. In this paper, we proposed a novel implementation of autoencoder-based deep learning architecture for single-shot hologram reconstruction solely based on the current sample without the need for massive datasets to train the model. The simulations results demonstrate the superior performance of the proposed method compared to the state of the art single-shot compressive digital in-line hologram reconstruction method.
摘要：数字在线全息通常用于从二维全息图微观对象重建3D图像。一种在信号处理级中出现的技术挑战之一是去除是通过从记录的全息图的相位共轭波阵面引起的双图像。产生全息图时双床图像去除通常被配制为非线性反问题，由于不可逆散射过程。最近，端至端深基于学习的方法已被用来直接从单次在线数字全息重建对象的波阵面（作为对象的3D结构的替代物）。然而，需要大量的数据对训练深度学习模型可接受重建精度。与典型的图像处理的问题，公策划数据集用于在线数字全息不存在。另外，训练的模型高度的物体的形态性质的影响，因此可以针对不同的应用而变化。因此，数据收集可能极其繁琐的在实践中的主要障碍利用深度学习的数字全息。在本文中，我们提出了完全基于而不需要大规模数据集训练模型目前的样品单次全息重建一种新的实现了基于自动编码，深度学习建筑。该模拟结果表明所提出的方法的优越性能与现有技术相比单次压缩数字在线全息图再现方法的状态。

92. Explainable Deep CNNs for MRI-Based Diagnosis of Alzheimer's Disease [PDF] 返回目录
Eduardo Nigri, Nivio Ziviani, Fabio Cappabianco, Augusto Antunes, Adriano Veloso
Abstract: Deep Convolutional Neural Networks (CNNs) are becoming prominent models for semi-automated diagnosis of Alzheimer's Disease (AD) using brain Magnetic Resonance Imaging (MRI). Although being highly accurate, deep CNN models lack transparency and interpretability, precluding adequate clinical reasoning and not complying with most current regulatory demands. One popular choice for explaining deep image models is occluding regions of the image to isolate their influence on the prediction. However, existing methods for occluding patches of brain scans generate images outside the distribution to which the model was trained for, thus leading to unreliable explanations. In this paper, we propose an alternative explanation method that is specifically designed for the brain scan task. Our method, which we refer to as Swap Test, produces heatmaps that depict the areas of the brain that are most indicative of AD, providing interpretability for the model's decisions in a format understandable to clinicians. Experimental results using an axiomatic evaluation show that the proposed method is more suitable for explaining the diagnosis of AD using MRI while the opposite trend was observed when using a typical occlusion test. Therefore, we believe our method may address the inherent black-box nature of deep neural networks that are capable of diagnosing AD.
摘要：深卷积神经网络（细胞神经网络）使用的大脑磁共振成像（MRI）成为突出的车型为阿尔茨海默病（AD）的半自动化诊断。虽然是高度精确的，深深的CNN模型缺乏透明度和可解释性，排除足够的临床推理，而不是与最新的监管要求相符。解释深图像模型的一个流行的选择是图像的遮挡区域隔离在预测他们的影响力。然而，对于阻断脑部扫描的补丁现有方法产生于该模型训练的分布范围之外的图像，从而导致不可靠的解释。在本文中，我们提出了专门针对脑部扫描任务设计的另一种解释方法。我们的方法，我们称之为互换试验，产生描绘了最指示AD的大脑区域，在理解到临床医生的格式提供模型的决定解释性热图。使用不言自明的评价结果表明，所提出的方法是更适合，而使用典型的阻塞测试时，观察到相反的趋势解释使用MRI AD诊断实验结果。因此，我们认为我们的方法可以解决能够诊断AD的深层神经网络固有的黑盒性质。

93. A Survey on Domain Knowledge Powered Deep Learning for Medical Image Analysis [PDF] 返回目录
Xiaozheng Xie, Jianwei Niu, Xuefeng Liu, Zhengsu Chen, Shaojie Tang
Abstract: Although deep learning models like CNNs have achieved a great success in medical image analysis, small-sized medical datasets remain to be the major bottleneck in this area. To address this problem, researchers start looking for external information beyond the current available medical datasets. Traditional approaches generally leverage the information from natural images. More recent works utilize the domain knowledge from medical doctors, by letting networks either resemble how they are trained, mimic their diagnostic patterns, or focus on the features or areas they particular pay attention to. In this survey, we summarize the current progress on introducing medical domain knowledge in deep learning models for various tasks like disease diagnosis, lesion, organ and abnormality detection, lesion and organ segmentation. For each type of task, we systematically categorize different kinds of medical domain knowledge that have been utilized and the corresponding integrating methods. We end with a summary of challenges, open problems, and directions for future research.
摘要：尽管深度学习模型，如细胞神经网络已实现在医学图像分析了巨大的成功，小规模的医疗数据集仍然是这一领域的主要瓶颈。为了解决这个问题，研究人员开始寻找超越目前现有的医疗数据集的外部信息。传统的方法一般利用从自然的图像信息。最近的作品利用来自医生的领域知识，通过让网络或者像他们是如何训练的，模仿他们的诊断模式，或集中在功能或地区，他们尤其要注意。在本次调查中，我们总结了在深的学习模式引入医疗领域专业知识，像疾病的诊断，病变部位，器官和异常检测，病变部位和器官分割各种任务的当前进度。对于每种类型的任务，我们系统地分类不同种类已被用于医疗领域的知识和相应的集成方法。我们结束与挑战，开放式的问题，并为未来的研究方向的总结。

94. POCOVID-Net: Automatic Detection of COVID-19 From a New Lung Ultrasound Imaging Dataset (POCUS) [PDF] 返回目录
Jannis Born, Gabriel Brändle, Manuel Cossio, Marion Disdier, Julie Goulet, Jérémie Roulin, Nina Wiedemann
Abstract: With the rapid development of COVID-19 into a global pandemic, there is an ever more urgent need for cheap, fast and reliable tools that can assist physicians in diagnosing COVID-19. Medical imaging such as CT can take a key role in complementing conventional diagnostic tools from molecular biology, and, using deep learning techniques, several automatic systems were demonstrated promising performances using CT or X-ray data. Here, we advocate a more prominent role of point-of-care ultrasound imaging to guide COVID-19 detection. Ultrasound is non-invasive and ubiquitous in medical facilities around the globe. Our contribution is threefold. First, we gather a lung ultrasound (POCUS) dataset consisting of (currently) 1103 images (654 COVID-19, 277 bacterial pneumonia and 172 healthy controls), sampled from 64 videos. While this dataset was assembled from various online sources and is by no means exhaustive, it was processed specifically to feed deep learning models and is intended to serve as a starting point for an open-access initiative. Second, we train a deep convolutional neural network (POCOVID-Net) on this 3-class dataset and achieve an accuracy of 89% and, by a majority vote, a video accuracy of 92% . For detecting COVID-19 in particular, the model performs with a sensitivity of 0.96, a specificity of 0.79 and F1-score of 0.92 in a 5-fold cross validation. Third, we provide an open-access web service (POCOVIDScreen) that is available at: this https URL. The website deploys the predictive model, allowing to perform predictions on ultrasound lung images. In addition, it grants medical staff the option to (bulk) upload their own screenings in order to contribute to the growing public database of pathological lung ultrasound images. Dataset and code are available from: this https URL
摘要：随着COVID-19的快速发展成为全球性流行病，有便宜的一个越来越迫切需要，快速，可以帮助医生诊断COVID-19可靠工具。医疗成像如CT可以在从分子生物学常规补充诊断工具关键的作用，并且，使用深学习技术，几个自动系统被证明有希望使用CT或X射线数据的表演。在这里，我们提倡定点护理超声成像的更加突出的作用，引导COVID-19检测。超声波是在全球各地的医疗设施非侵入性和无处不在的。我们的贡献是一举三得。首先，我们收集肺超声（把戏）数据集包括（目前）1103幅的图像（654 COVID-19，277细菌性肺炎和172个健康对照）中，从64级的视频采样。虽然这个数据集从各种在线资源组装，是并不详尽，它是专门加工饲料深度学习模型，目的是作为一个开放接入举措的起点。其次，我们在这3类数据集训练了深刻的卷积神经网络（POCOVID-网），实现89％的准确度，并且通过多数投票，92％的视频精度。特别是对于检测COVID-19，具有0.96的5倍交叉验证灵敏度，0.79的特异性和0.92 F1-得分模型进行。第三，我们提供了一个开放式接入网络服务（POCOVIDScreen）可用在：此HTTPS URL。该网站部署的预测模型，允许进行超声检查肺部图像的预测。此外，它赋予医务人员的选项（散装）上传自己的场次，以促进肺的病理超声图像日益增长的公共数据库。数据集和代码可从：此HTTPS URL

95. SAIA: Split Artificial Intelligence Architecture for Mobile Healthcare System [PDF] 返回目录
Di Zhuang, Nam Nguyen, Keyu Chen, J. Morris Chang
Abstract: As the advancement of deep learning (DL), the Internet of Things and cloud computing techniques for the analysis and diagnosis of biomedical and health care problems in the last decade, mobile healthcare applications have received unprecedented attention. Since DL techniques usually require enormous amount of computation, most of them cannot be directly deployed on the computation-constrained and energy limited mobile and IoT devices. Hence, most of the mobile healthcare applications leverage the cloud computing infrastructure, where the data collected on the mobile/IoT devices would be transmitted to the cloud computing platforms for analysis. However, in contested environments, relying on the cloud server might not be practical at all times; for instance, the satellite communication might be denied or disrupted. In this paper, we propose SAIA, a Split Artificial Intelligence Architecture for mobile healthcare systems. Unlike traditional approach for artificial intelligence (AI) which solely exploits the computational power of the cloud server, SAIA not only relies on the cloud computing infrastructure while the wireless communication is available, but also utilizes the lightweight AI solutions that work locally at the client side (e.g., mobile and IoT devices), hence, it can work even when the communication is impeded. In SAIA, we propose a meta-information based decision unit, that could tune whether a sample captured by the client should be operated by embedded AI or networked AI, under different conditions. In the experimental evaluation, extensive experiments have been conducted on two popular healthcare datasets. Our results show that SAIA consistently outperforms its baselines in terms of both effectiveness and efficiency.
摘要：随着深度学习（DL），物联网和云计算在过去十年中，生物医学和卫生保健问题的分析和诊断技术的进步，移动医疗应用已经得到了前所未有的关注。由于DL技术通常需要计算大量，大多不能直接部署在计算约束和能量限制移动和的IoT设备。因此，大多数的移动保健应用利用云计算基础设施，其中，收集在所述移动/的IoT设备中的数据将被发送至云计算用于分析平台。然而，在有争议的环境中，依靠云服务器上可能不是在任何时候都实用;例如，卫星通信可能被拒绝或中断。在本文中，我们提出了SAIA，拆分人工智能架构的移动医疗系统。与人工智能（AI），它仅利用了云服务器的计算能力的传统做法，SAIA不仅依赖于云计算基础设施，而无线通信是可用的，而且还采用了轻量级的AI解决方案，在客户端本地的工作（例如，移动和的IoT装置），因此，它甚至可以在通信被阻碍工作。在SAIA，我们提出了一个元信息基础的决策单元，能够调整是否由客户端捕获的样本应由嵌入AI或联网AI操作，在不同条件下。在实验的评价，广泛的实验已在两个流行的医疗数据集进行。我们的研究结果表明，SAIA一贯优于其基线的有效性和效率方面。

96. GPO: Global Plane Optimization for Fast and Accurate Monocular SLAM Initialization [PDF] 返回目录
Sicong Du, Hengkai Guo, Yao Chen, Yilun Lin, Xiangbing Meng, Linfu Wen, Fei-Yue Wang
Abstract: Initialization is essential to monocular Simultaneous Localization and Mapping (SLAM) problems. This paper focuses on a novel initialization method for monocular SLAM based on planar features. The algorithm starts by homography estimation in a sliding window. It then proceeds to a global plane optimization (GPO) to obtain camera poses and the plane normal. 3D points can be recovered using planar constraints without triangulation. The proposed method fully exploits the plane information from multiple frames and avoids the ambiguities in homography decomposition. We validate our algorithm on the collected chessboard dataset against baseline implementations and present extensive analysis. Experimental results show that our method outperforms the fine-tuned baselines in both accuracy and real-time.
摘要：初始化是必不可少的单眼同时定位和映射（SLAM）的问题。本文重点研究了基于平面特征单眼SLAM一种新颖的初始化方法。该算法在滑动窗口开始由单应估计。然后，它前进到全局优化平面（GPO）来获得照相机的姿势和平面法线。 3D点可以使用平面约束而不三角测量法来回收。所提出的方法充分利用了从多个帧的平面的信息和避免了在单应性分解的歧义。我们验证了我们关于对基线的实现和现在广泛的分析所收集的数据集棋盘算法。实验结果表明，我们的方法优于在准确性和实时微调基准。

97. Self-supervised Learning of Visual Speech Features with Audiovisual Speech Enhancement [PDF] 返回目录
Zakaria Aldeneh, Anushree Prasanna Kumar, Barry-John Theobald, Erik Marchi, Sachin Kajarekar, Devang Naik, Ahmed Hussen Abdelaziz
Abstract: We present an introspection of an audiovisual speech enhancement model. In particular, we focus on interpreting how a neural audiovisual speech enhancement model uses visual cues to improve the quality of the target speech signal. We show that visual features provide not only high-level information about speech activity, i.e. speech vs. no speech, but also fine-grained visual information about the place of articulation. An interesting byproduct of this finding is that the learned visual embeddings can be used as features for other visual speech applications. We demonstrate the effectiveness of the learned visual representations for classifying visemes (the visual analogy to phonemes). Our results provide insight into important aspects of audiovisual speech enhancement and demonstrate how such models can be used for self-supervision tasks for visual speech applications.
摘要：我们提出的视听语音增强模型的反思。尤其是，我们专注于解释神经视听语音增强模式如何利用视觉线索来提高目标语音信号的质量。我们表明，视觉功能不仅提供了有关语音活动的高级别信息，即语音与非语音，但对关节的地方也细粒度的视觉信息。这一发现的一个有趣的副产品是，了解到视觉的嵌入可作为功能的其他可视语音应用。我们证明学习视觉表现进行分类视素（视觉类比音素）的有效性。我们的研究结果提供洞察视听语音增强的重要方面，并演示模型如何等，可用于视觉语音应用自检任务。

98. Spectral Data Augmentation Techniques to quantify Lung Pathology from CT-images [PDF] 返回目录
Subhradeep Kayal, Florian Dubost, Harm A. W. M. Tiddens, Marleen de Bruijne
Abstract: Data augmentation is of paramount importance in biomedical image processing tasks, characterized by inadequate amounts of labelled data, to best use all of the data that is present. In-use techniques range from intensity transformations and elastic deformations, to linearly combining existing data points to make new ones. In this work, we propose the use of spectral techniques for data augmentation, using the discrete cosine and wavelet transforms. We empirically evaluate our approaches on a CT texture analysis task to detect abnormal lung-tissue in patients with cystic fibrosis. Empirical experiments show that the proposed spectral methods perform favourably as compared to the existing methods. When used in combination with existing methods, our proposed approach can increase the relative minor class segmentation performance by 44.1% over a simple replication baseline.
摘要：数据增强是在生物医学图像处理任务至关重要，其特征在于通过量不足标记的数据，以最好地利用所有存在的数据。使用中的技术的范围从强度转换和弹性变形，以线性组合已有的数据点使新的。在这项工作中，我们提出利用光谱技术对数据增强，利用离散余弦和小波变换。我们根据经验评估我们在CT纹理分析任务的方法来检测囊性纤维化患者的异常肺组织。实证实验表明，所提出的频谱的方法顺利地执行相比于现有的方法。当与现有的方法中，我们提出的方法可以通过简单的复制基准由44.1％提高相对较小的类分割性能。

99. DeepMerge: Classifying High-redshift Merging Galaxies with Deep Neural Networks [PDF] 返回目录
A. Ćiprijanović, G. F. Snyder, B. Nord, J. E. G. Peek
Abstract: We investigate and demonstrate the use of convolutional neural networks (CNNs) for the task of distinguishing between merging and non-merging galaxies in simulated images, and for the first time at high redshifts (i.e. $z=2$). We extract images of merging and non-merging galaxies from the Illustris-1 cosmological simulation and apply observational and experimental noise that mimics that from the Hubble Space Telescope; the data without noise form a "pristine" data set and that with noise form a "noisy" data set. The test set classification accuracy of the CNN is $79\%$ for pristine and $76\%$ for noisy. The CNN outperforms a Random Forest classifier, which was shown to be superior to conventional one- or two-dimensional statistical methods (Concentration, Asymmetry, the Gini, $M_{20}$ statistics etc.), which are commonly used when classifying merging galaxies. We also investigate the selection effects of the classifier with respect to merger state and star formation rate, finding no bias. Finally, we extract Grad-CAMs (Gradient-weighted Class Activation Mapping) from the results to further assess and interrogate the fidelity of the classification model.
摘要：我们调查和演示使用卷积神经网络（细胞神经网络）用于合并和模拟图像非合并的星系区分的任务，并首次在高红移（即$ Z = $ 2）。我们从中提取Illustris-1合并和非合并的星系的宇宙模拟的图像和应用的观测和实验噪音，模仿，从哈勃太空望远镜;无噪声的数据形成一个“原始”数据组和与噪声形成“嘈杂”的数据集。的CNN的测试集分类精度为$ 79 \为原始％$ $和76 \用于嘈杂％$。 CNN新闻优于一个随机森林分类器，其被证明是优于传统的一维或二维的统计方法（浓缩，不对称，基尼，$ {M_ 20} $统计等）中，当进行分类合并，通常使用的星系。我们还调查了分类的选择效应相对于合并状态和恒星形成率，发现没有偏见。最后，我们从结果中提取梯度的凸轮（梯度加权类激活映射），以进一步评估和询问分类模型的保真度。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-04-28

目录

摘要