摘要

1. Learning to Communicate and Correct Pose Errors [PDF] 返回目录
Nicholas Vadivelu, Mengye Ren, James Tu, Jingkang Wang, Raquel Urtasun
Abstract: Learned communication makes multi-agent systems more effective by aggregating distributed information. However, it also exposes individual agents to the threat of erroneous messages they might receive. In this paper, we study the setting proposed in V2VNet, where nearby self-driving vehicles jointly perform object detection and motion forecasting in a cooperative manner. Despite a huge performance boost when the agents solve the task together, the gain is quickly diminished in the presence of pose noise since the communication relies on spatial transformations. Hence, we propose a novel neural reasoning framework that learns to communicate, to estimate potential errors, and finally, to reach a consensus about those errors. Experiments confirm that our proposed framework significantly improves the robustness of multi-agent self-driving perception and motion forecasting systems under realistic and severe localization noise.
摘要：据悉通信使得通过聚合分布式信息多代理系统更有效。然而，它也暴露出个别代理商，他们可能会收到错误消息的威胁。在本文中，我们研究了V2VNet，在附近的自驾车车辆共同合作的方式进行目标检测和运动预测提出了设置。尽管当代理解决任务一起巨大的性能提升，增益的姿势噪声的情况下迅速减少，因为通信依赖于空间的转换。因此，我们提出了一种新的神经推理的框架，学会沟通，估计潜在的错误，最后，达成一个关于这些错误的共识。实验证实，我们提出的框架显著提高多代理自驾车感觉和运动预报系统在真实和严重的本土化噪音的鲁棒性。

2. Perception Improvement for Free: Exploring Imperceptible Black-box Adversarial Attacks on Image Classification [PDF] 返回目录
Yongwei Wang, Mingquan Feng, Rabab Ward, Z. Jane Wang, Lanjun Wang
Abstract: Deep neural networks are vulnerable to adversarial attacks. White-box adversarial attacks can fool neural networks with small adversarial perturbations, especially for large size images. However, keeping successful adversarial perturbations imperceptible is especially challenging for transfer-based black-box adversarial attacks. Often such adversarial examples can be easily spotted due to their unpleasantly poor visual qualities, which compromises the threat of adversarial attacks in practice. In this study, to improve the image quality of black-box adversarial examples perceptually, we propose structure-aware adversarial attacks by generating adversarial images based on psychological perceptual models. Specifically, we allow higher perturbations on perceptually insignificant regions, while assigning lower or no perturbation on visually sensitive regions. In addition to the proposed spatial-constrained adversarial perturbations, we also propose a novel structure-aware frequency adversarial attack method in the discrete cosine transform (DCT) domain. Since the proposed attacks are independent of the gradient estimation, they can be directly incorporated with existing gradient-based attacks. Experimental results show that, with the comparable attack success rate (ASR), the proposed methods can produce adversarial examples with considerably improved visual quality for free. With the comparable perceptual quality, the proposed approaches achieve higher attack success rates: particularly for the frequency structure-aware attacks, the average ASR improves more than 10% over the baseline attacks.
摘要：深层神经网络很容易受到攻击的对抗性。白盒敌对攻击可以欺骗神经网络的小对抗扰动，特别是对于大尺寸的图像。然而，把成功的对抗扰动难以察觉的是专为传输基于黑盒敌对攻击挑战。通常，这种对抗性的例子可由于很容易被察觉到他们的不愉快差的视觉质量，这损害了在实践中对抗攻击的威胁。在这项研究中，以提高感知的暗箱对抗的例子，图像质量，我们提出了基于心理感知模型生成对抗性的图像结构感知的对抗攻击。具体来说，我们允许在感知上不显着的区域更高的扰动，而分配低或视觉敏感区域没有扰动。除了提出的空间受限的对抗性扰动，我们还提出了一种新颖的结构感知频率对抗攻击的离散余弦变换（DCT）域的方法。由于所提出的攻击是独立于梯度估计的，它们可以直接与现有的基于梯度的攻击并入。实验结果表明，与可比的进攻成功率（ASR），所提出的方法可以产生与自由大大改善视觉质量对抗性的例子。与可比的感知质量，所提出的方法获得更高的攻击成功率：特别是对频率结构感知攻击，平均ASR提高超过基线的攻击超过10％。

3. Classification of Polarimetric SAR Images Using Compact Convolutional Neural Networks [PDF] 返回目录
Mete Ahishali, Serkan Kiranyaz, Turker Ince, Moncef Gabbouj
Abstract: Classification of polarimetric synthetic aperture radar (PolSAR) images is an active research area with a major role in environmental applications. The traditional Machine Learning (ML) methods proposed in this domain generally focus on utilizing highly discriminative features to improve the classification performance, but this task is complicated by the well-known "curse of dimensionality" phenomena. Other approaches based on deep Convolutional Neural Networks (CNNs) have certain limitations and drawbacks, such as high computational complexity, an unfeasibly large training set with ground-truth labels, and special hardware requirements. In this work, to address the limitations of traditional ML and deep CNN based methods, a novel and systematic classification framework is proposed for the classification of PolSAR images, based on a compact and adaptive implementation of CNNs using a sliding-window classification approach. The proposed approach has three advantages. First, there is no requirement for an extensive feature extraction process. Second, it is computationally efficient due to utilized compact configurations. In particular, the proposed compact and adaptive CNN model is designed to achieve the maximum classification accuracy with minimum training and computational complexity. This is of considerable importance considering the high costs involved in labelling in PolSAR classification. Finally, the proposed approach can perform classification using smaller window sizes than deep CNNs. Experimental evaluations have been performed over the most commonly-used four benchmark PolSAR images: AIRSAR L-Band and RADARSAT-2 C-Band data of San Francisco Bay and Flevoland areas. Accordingly, the best obtained overall accuracies range between 92.33 - 99.39% for these benchmark study sites.
摘要：极化合成孔径雷达（极化SAR）图像分类是在环境应用中起主要作用的活跃的研究领域。在这一领域提出的传统机器学习（ML）方法一般集中在利用高辨别功能，以提高分类性能，但这个任务是通过现象的知名“的维诅咒”复杂。基于深卷积神经网络（细胞神经网络）的其它方法有一定的局限性和缺点，如高计算复杂性，一个unfeasibly大的训练集与地面实况标签和特殊的硬件要求。在这项工作中，以解决传统的ML和深基础CNN方法的局限性，一种新的，系统的分类框架，提出了极化SAR图像的分类的基础上，采用滑动窗口分类方法紧凑和自适应实现细胞神经网络的。所提出的方法有三个优点。首先，有一个广泛的特征提取过程没有要求。其次，它是计算上高效由于利用紧凑的配置。尤其是，所提出的紧凑和自适应CNN模型的目的是实现以最小的训练和计算复杂度最大的分类精度。这是考虑到参与极化SAR分类标签的高成本相当的重要性。最后，该方法可以使用较小的窗口大小比深层细胞神经网络执行分类。 AIRSAR L波段和旧金山湾和弗莱福兰省地区RADARSAT-2 C波段数据：实验评估已在最常用的四种基准极化SAR图像进行。因此，最好获得整体精度范围92.33之间 - ％用于这些基准研究点99.39。

4. Temporal Stochastic Softmax for 3D CNNs: An Application in Facial Expression Recognition [PDF] 返回目录
Théo Ayral, Marco Pedersoli, Simon Bacon, Eric Granger
Abstract: Training deep learning models for accurate spatiotemporal recognition of facial expressions in videos requires significant computational resources. For practical reasons, 3D Convolutional Neural Networks (3D CNNs) are usually trained with relatively short clips randomly extracted from videos. However, such uniform sampling is generally sub-optimal because equal importance is assigned to each temporal clip. In this paper, we present a strategy for efficient video-based training of 3D CNNs. It relies on softmax temporal pooling and a weighted sampling mechanism to select the most relevant training clips. The proposed softmax strategy provides several advantages: a reduced computational complexity due to efficient clip sampling, and an improved accuracy since temporal weighting focuses on more relevant clips during both training and inference. Experimental results obtained with the proposed method on several facial expression recognition benchmarks show the benefits of focusing on more informative clips in training videos. In particular, our approach improves performance and computational cost by reducing the impact of inaccurate trimming and coarse annotation of videos, and heterogeneous distribution of visual information across time.
摘要：在视频中精确的时空识别面部表情的训练深度学习模型需要显著的计算资源。由于实际原因，3D卷积神经网络（3D细胞神经网络）通常用相对较短的片段从视频中随机抽取的训练。然而，这样的均匀采样通常是次优的，因为同样的重要性被分配给每个时间剪辑。在本文中，我们提出了一个三维细胞神经网络的高效的基于视频的培训策略。它依赖于SOFTMAX时空池和加权采样机制来选择最相关的训练片段。所提出的SOFTMAX战略提供了几个优点：降低计算复杂性，由于高效的剪辑采样，由于时间权重的改进准确性训练和推理过程中的重点更多相关的剪辑。在几个面部表情识别基准测试所提出的方法得到的实验结果表明，专注于提供更多信息剪辑培训视频的好处。特别是，我们的方法提高了通过减少不准确的修整和视频粗糙的注释，以及跨越时间的视觉信息的不均匀分布的影响性能和计算成本。

5. Pixel precise unsupervised detection of viral particle proliferation in cellular imaging data [PDF] 返回目录
Birgitta Dresp-Langley, John M. Wandeto
Abstract: Cellular and molecular imaging techniques and models have been developed to characterize single stages of viral proliferation after focal infection of cells in vitro. The fast and automatic classification of cell imaging data may prove helpful prior to any further comparison of representative experimental data to mathematical models of viral propagation in host cells. Here, we use computer generated images drawn from a reproduction of an imaging model from a previously published study of experimentally obtained cell imaging data representing progressive viral particle proliferation in host cell monolayers. Inspired by experimental time-based imaging data, here in this study viral particle increase in time is simulated by a one-by-one increase, across images, in black or gray single pixels representing dead or partially infected cells, and hypothetical remission by a one-by-one increase in white pixels coding for living cells in the original image model. The image simulations are submitted to unsupervised learning by a Self-Organizing Map (SOM) and the Quantization Error in the SOM output (SOM-QE) is used for automatic classification of the image simulations as a function of the represented extent of viral particle proliferation or cell recovery. Unsupervised classification by SOM-QE of 160 model images, each with more than three million pixels, is shown to provide a statistically reliable, pixel precise, and fast classification model that outperforms human computer-assisted image classification by RGB image mean computation. The automatic classification procedure proposed here provides a powerful approach to understand finely tuned mechanisms in the infection and proliferation of virus in cell lines in vitro or other cells.
摘要：细胞和分子成像技术和模式已经开发体外细胞病灶感染后表征病毒增殖的单一阶段。快速和细胞成像数据的自动分类可以有帮助之前代表性的实验数据的在宿主细胞中的任何进一步的比较病毒传播的数学模型证明。这里，我们使用从成像模型的再现从表示在宿主细胞单层渐进病毒颗粒增殖实验获得的细胞成像数据的先前公布的研究得出的计算机生成的图像。通过实验基于时间的成像数据的启发，这里在该时间研究病毒颗粒增加由一个接一个的增加模拟，横跨图像，黑色或灰色的单个像素表示死的或部分感染的细胞，并通过一个假想的缓解一个接一个增加白色像素编码为活细胞在原始图像模式。图像模拟是由自组织映射（SOM）和在SOM输出（SOM-QE）量化误差提交的无监督学习用于图像模拟的自动分类作为病毒颗粒扩散的表示程度的函数或细胞恢复。以160个模型图像，每个具有多于三个万像素SOM-QE无监督分类，示出了以提供统计学上可靠的，精确的像素，和快速分类模型，通过RGB图像平均计算性能优于人计算机辅助图像分类。这里提出的自动分类过程提供了一个有力的方法来了解在感染和病毒的增殖在体外或其它细胞的细胞系微调机构。

6. A Multi-Plant Disease Diagnosis Method using Convolutional Neural Network [PDF] 返回目录
Muhammad Mohsin Kabir, Abu Quwsar Ohi, M. F. Mridha
Abstract: A disease that limits a plant from its maximal capacity is defined as plant disease. From the perspective of agriculture, diagnosing plant disease is crucial, as diseases often limit plants' production capacity. However, manual approaches to recognize plant diseases are often temporal, challenging, and time-consuming. Therefore, computerized recognition of plant diseases is highly desired in the field of agricultural automation. Due to the recent improvement of computer vision, identifying diseases using leaf images of a particular plant has already been introduced. Nevertheless, the most introduced model can only diagnose diseases of a specific plant. Hence, in this chapter, we investigate an optimal plant disease identification model combining the diagnosis of multiple plants. Despite relying on multi-class classification, the model inherits a multilabel classification method to identify the plant and the type of disease in parallel. For the experiment and evaluation, we collected data from various online sources that included leaf images of six plants, including tomato, potato, rice, corn, grape, and apple. In our investigation, we implement numerous popular convolutional neural network (CNN) architectures. The experimental results validate that the Xception and DenseNet architectures perform better in multi-label plant disease classification tasks. Through architectural investigation, we imply that skip connections, spatial convolutions, and shorter hidden layer connectivity cause better results in plant disease classification.
摘要：限制从它的最大容量的植物被定义为植物疾病的疾病。从农业的角度来看，诊断植物病害是至关重要的，因为疾病往往限制了植物的生产能力。然而，手动接近识别植物疾病往往时间，具有挑战性的，并且费时。因此，植物病害的计算机化识别非常希望在农业自动化领域。由于近期改善计算机视觉，识别使用特定植物的叶片图像疾病已经出台。然而，最引入模型只能诊断特定工厂的疾病。因此，在本章中，我们研究了一个最佳的植物病害识别模型结合多种植物的诊断。尽管依靠多类分类，该模型继承一个多标记分类方法来识别植物和疾病的并行类型。对于实验和评估，我们收集了来自各种在线资源，其中包括6种植物，包括番茄，马铃薯，水稻，玉米，葡萄，苹果叶图像数据。在我们的调查中，我们实施了许多流行的卷积神经网络（CNN）架构。实验结果验证了Xception和DenseNet架构在多标签病害分类任务更好地履行。通过建筑的调查，我们的意思是跳过连接，空间卷积和更短的隐藏层连接的原因在植物疾病分类更好的结果。

7. Multi-modal, multi-task, multi-attention (M3) deep learning detection of reticular pseudodrusen: 1 towards automated and accessible classification of age-related macular degeneration [PDF] 返回目录
Qingyu Chen, Tiarnan D. L. Keenan, Alexis Allot, Yifan Peng, Elvira Agrón, Amitha Domalpally, Caroline C. W. Klaver, Daniel T. Luttikhuizen, Marcus H. Colyer, Catherine A. Cukras, Henry E. Wiley, M. Teresa Magone, Chantal Cousineau-Krieger, Wai T. Wong, Yingying Zhu, Emily Y. Chew, Zhiyong Lu
Abstract: Objective Reticular pseudodrusen (RPD), a key feature of age-related macular degeneration (AMD), are poorly detected by human experts on standard color fundus photography (CFP) and typically require advanced imaging modalities such as fundus autofluorescence (FAF). The objective was to develop and evaluate the performance of a novel 'M3' deep learning framework on RPD detection. Materials and Methods A deep learning framework M3 was developed to detect RPD presence accurately using CFP alone, FAF alone, or both, employing >8000 CFP-FAF image pairs obtained prospectively (Age-Related Eye Disease Study 2). The M3 framework includes multi-modal (detection from single or multiple image modalities), multi-task (training different tasks simultaneously to improve generalizability), and multi-attention (improving ensembled feature representation) operation. Performance on RPD detection was compared with state-of-the-art deep learning models and 13 ophthalmologists; performance on detection of two other AMD features (geographic atrophy and pigmentary abnormalities) was also evaluated. Results For RPD detection, M3 achieved area under receiver operating characteristic (AUROC) 0.832, 0.931, and 0.933 for CFP alone, FAF alone, and both, respectively. M3 performance on CFP was very substantially superior to human retinal specialists (median F1-score 0.644 versus 0.350). External validation (on Rotterdam Study, Netherlands) demonstrated high accuracy on CFP alone (AUROC 0.965). The M3 framework also accurately detected geographic atrophy and pigmentary abnormalities (AUROC 0.909 and 0.912, respectively), demonstrating its generalizability. Conclusion This study demonstrates the successful development, robust evaluation, and external validation of a novel deep learning framework that enables accessible, accurate, and automated AMD diagnosis and prognosis.
摘要：目的网状pseudodrusen（RPD），年龄相关性黄斑变性（AMD）的一个重要特征，是很难用标准的彩色眼底照相（CFP）人类专家检测，通常需要先进的成像方式，如眼底自身荧光（FAF）。其目的是开发和评估一种新的RPD检测“M3”深度学习框架的性能。材料和方法深学习框架M3被开发用于检测RPD存在准确使用CFP单独，FAF单独，或两者，使用>获得前瞻性8000 CFP-FAF图像对（年龄相关性眼病研究2）。该M3框架包括多模态（检测从单个或多个图像模态），多任务（同时训练不同的任务以提高普遍性），和多关注（改善合奏的特征表示）操作。在RPD检测性能与国家的最先进的深学习模式和13名眼科医生进行比较;在检测到的两个其它AMD特征（地图样萎缩和色素异常）的表现也进行了评价。结果对于RPD检测，M3下接受者操作特性（AUROC）0.832，0.931，和0.933分别达到区CFP，单独FAF，和两者。上CFP M3表现得非常显着优于人视网膜专家（中位F1-得分0.644 0.350对比）。外部验证（鹿特丹研究，荷兰）表现出对单独CFP（AUROC 0.965）精度高。 M3的框架也准确地检测地图样萎缩和色素异常（分别AUROC 0.909 0.912和，），证明其普遍性。结论这项研究表明，成功开发，强大的评估，以及一个新的深度学习框架，使方便，准确，自动化的AMD诊断和预后的外部验证。

8. Multi-pooled Inception features for no-reference image quality assessment [PDF] 返回目录
Domonkos Varga
Abstract: Image quality assessment (IQA) is an important element of a broad spectrum of applications ranging from automatic video streaming to display technology. Furthermore, the measurement of image quality requires a balanced investigation of image content and features. Our proposed approach extracts visual features by attaching global average pooling (GAP) layers to multiple Inception modules of on an ImageNet database pretrained convolutional neural network (CNN). In contrast to previous methods, we do not take patches from the input image. Instead, the input image is treated as a whole and is run through a pretrained CNN body to extract resolution-independent, multi-level deep features. As a consequence, our method can be easily generalized to any input image size and pretrained CNNs. Thus, we present a detailed parameter study with respect to the CNN base architectures and the effectiveness of different deep features. We demonstrate that our best proposal - called MultiGAP-NRIQA - is able to provide state-of-the-art results on three benchmark IQA databases. Furthermore, these results were also confirmed in a cross database test using the LIVE In the Wild Image Quality Challenge database.
摘要：图像质量评价（IQA）是应用范围从自动视频流至显示技术广谱的一个重要因素。此外，图像质量的测量需要图像内容和功能的平衡调查。我们提出的方法，通过一个ImageNet数据库预训练的卷积神经网络（CNN）上附着全球平均池（GAP）层的多个启模块提取视觉特征。相较于以前的方法，我们不采取从输入图像补丁。取而代之的是，输入的图像被视为一个整体，并通过一个预训练的CNN体运行，以提取与分辨率无关的，多级深的特点。因此，我们的方法可以很容易地推广到任何输入图像尺寸和预先训练细胞神经网络。因此，我们提出了一个详细的参数研究相对于所述基座CNN体系结构和不同的深特征的有效性。我们证明了我们最好的建议 - 所谓的多间隙-NRIQA - 能够提供对三个标准IQA数据库国家的先进成果。此外，这些结果也证实了在使用生活在野外图像质量挑战数据库的横数据库测试。

9. On-Device Language Identification of Text in Images using Diacritic Characters [PDF] 返回目录
Shubham Vatsal, Nikhil Arora, Gopi Ramena, Sukumar Moharana, Dhruval Jain, Naresh Purre, Rachit S Munjal
Abstract: Diacritic characters can be considered as a unique set of characters providing us with adequate and significant clue in identifying a given language with considerably high accuracy. Diacritics, though associated with phonetics often serve as a distinguishing feature for many languages especially the ones with a Latin script. In this proposed work, we aim to identify language of text in images using the presence of diacritic characters in order to improve Optical Character Recognition (OCR) performance in any given automated environment. We showcase our work across 13 Latin languages encompassing 85 diacritic characters. We use an architecture similar to Squeezedet for object detection of diacritic characters followed by a shallow network to finally identify the language. OCR systems when accompanied with identified language parameter tends to produce better results than sole deployment of OCR systems. The discussed work apart from guaranteeing an improvement in OCR results also takes on-device (mobile phone) constraints into consideration in terms of model size and inference time.
摘要：变音符号的字符可以被看作是一组独特的识别具有相当高的精度给定的语言为我们提供了充足而显著的线索人物。变音符号，但与语音相关的经常作为多国语言特别是用拉丁字母的那些一个显着特点。在此建议的工作，我们的目标是确定使用的音调符号字符存在图像文本的语言，以提高在任何给定的自动化环境光学字符识别（OCR）的性能。我们展示了在13个涵盖85个变音字符拉丁语我们的工作。我们使用类似Squeezedet为读音符号字符，后面是浅网最终确认语言对象检测的架构。当识别的语言参数陪同OCR系统往往会产生比OCR系统的唯一部署更好的结果。从保证在OCR结果的改善开所讨论的工作也需要对设备（移动电话）的约束考虑在模型大小和推理时间方面。

10. MP-ResNet: Multi-path Residual Network for the Semantic segmentation of High-Resolution PolSAR Images [PDF] 返回目录
Lei Ding, Kai Zheng, Dong Lin, Yuxing Chen, Bing Liu, Jiansheng Li, Lorenzo Bruzzone
Abstract: There are limited studies on the semantic segmentation of high-resolution Polarimetric Synthetic Aperture Radar (PolSAR) images due to the scarcity of training data and the inference of speckle noises. The Gaofen contest has provided open access of a high-quality PolSAR semantic segmentation dataset. Taking this chance, we propose a Multi-path ResNet (MP-ResNet) architecture for the semantic segmentation of high-resolution PolSAR images. Compared to conventional U-shape encoder-decoder convolutional neural network (CNN) architectures, the MP-ResNet learns semantic context with its parallel multi-scale branches, which greatly enlarges its valid receptive fields and improves the embedding of local discriminative features. In addition, MP-ResNet adopts a multi-level feature fusion design in its decoder to make the best use of the features learned from its different branches. Ablation studies show that the MPResNet has significant advantages over its baseline method (FCN with ResNet34). It also surpasses several classic state-of-the-art methods in terms of overall accuracy (OA), mean F1 and fwIoU, whereas its computational costs are not much increased. This CNN architecture can be used as a baseline method for future studies on the semantic segmentation of PolSAR images. The code is available at: this https URL.
摘要：有高分辨率极化合成孔径雷达（极化SAR）图像的语义分割，由于训练数据的稀缺性和斑点噪声的推论限制的研究。该Gaofen大赛提供了一个高品质的极化SAR语义分割数据集的开放式访问。以这个机会，我们提出了一个多路径RESNET（MP-RESNET）架构高分辨率极化SAR图像的语义分割。相比于传统的U形编码器 - 解码器卷积神经网络（CNN）体系结构中，MP-RESNET得知与它平行的多尺度分支，这极大地扩大了它的有效感受域，提高了局部判别特征嵌入语义语境。此外，MP-RESNET采用在其解码器，使从它的不同分支学到的功能最好使用多层次的特征融合设计。消融研究表明MPResNet已超过其基线法（FCN与ResNet34）显著的优势。它也超越了整体精度（OA），平均F1和fwIoU方面几款经典的国家的最先进的方法，而其计算成本没有太大增加。此CNN架构可以用作未来上极化SAR图像的语义分割研究的基准方法。该代码，请访问：此HTTPS URL。

11. Decoupled Appearance and Motion Learning for Efficient Anomaly Detection in Surveillance Video [PDF] 返回目录
Bo Li, Sam Leroux, Pieter Simoens
Abstract: Automating the analysis of surveillance video footage is of great interest when urban environments or industrial sites are monitored by a large number of cameras. As anomalies are often context-specific, it is hard to predefine events of interest and collect labelled training data. A purely unsupervised approach for automated anomaly detection is much more suitable. For every camera, a separate algorithm could then be deployed that learns over time a baseline model of appearance and motion related features of the objects within the camera viewport. Anything that deviates from this baseline is flagged as an anomaly for further analysis downstream. We propose a new neural network architecture that learns the normal behavior in a purely unsupervised fashion. In contrast to previous work, we use latent code predictions as our anomaly metric. We show that this outperforms reconstruction-based and frame prediction-based methods on different benchmark datasets both in terms of accuracy and robustness against changing lighting and weather conditions. By decoupling an appearance and a motion model, our model can also process 16 to 45 times more frames per second than related approaches which makes our model suitable for deploying on the camera itself or on other edge devices.
摘要：自动化监控录像的分析是非常感兴趣的，当城市环境或工业用地被大量的摄像机监控。由于异常往往是上下文特定的，这是很难的利益预先定义的事件，并收集标记的训练数据。用于自动化异常检测纯的无监督的方法是更合适的。对于每一个摄像头，一个独立的算法，然后可以部署在获悉随着时间的推移外观和相机视窗内的物体的运动相关的功能的基准模型。任何从该基线偏离被标记为用于下游进一步分析异常。我们建议，在学习一个纯粹的无监督形式的正常行为，新的神经网络结构。相较于以前的工作中，我们使用的潜代码的预测作为我们的异常指标。我们表明，在两个对中变化的照明和天气条件的精度和鲁棒性方面不同基准数据集为基础的重建这一性能优于和基于预测帧的方法。通过解耦的外观和运动模型，我们的模型也可以处理每秒16至45倍更多的帧比，这使得我们的模型适用于照相机本身上或在其它边缘设备部署相关方法。

12. Human-centric Spatio-Temporal Video Grounding With Visual Transformers [PDF] 返回目录
Zongheng Tang, Yue Liao, Si Liu, Guanbin Li, Xiaojie Jin, Hongxu Jiang, Qian Yu, Dong Xu
Abstract: In this work, we introduce a novel task - Humancentric Spatio-Temporal Video Grounding (HC-STVG). Unlike the existing referring expression tasks in images or videos, by focusing on humans, HC-STVG aims to localize a spatiotemporal tube of the target person from an untrimmed video based on a given textural description. This task is useful, especially for healthcare and security-related applications, where the surveillance videos can be extremely long but only a specific person during a specific period of time is concerned. HC-STVG is a video grounding task that requires both spatial (where) and temporal (when) localization. Unfortunately, the existing grounding methods cannot handle this task well. We tackle this task by proposing an effective baseline method named Spatio-Temporal Grounding with Visual Transformers (STGVT), which utilizes Visual Transformers to extract cross-modal representations for video-sentence matching and temporal localization. To facilitate this task, we also contribute an HC-STVG dataset consisting of 5,660 video-sentence pairs on complex multi-person scenes. Specifically, each video lasts for 20 seconds, pairing with a natural query sentence with an average of 17.25 words. Extensive experiments are conducted on this dataset, demonstrating the newly-proposed method outperforms the existing baseline methods.
摘要：在这项工作中，我们介绍了一种新的任务 - Humancentric时空视频接地（HC-STVG）。不像在图像或视频的现有指表达的任务，通过集中对人类，HC-STVG旨在从基于给定的纹理描述未修剪视频本地化目标人物的时空管。此任务非常有用，特别是对医疗保健和安全相关的应用中，监控录像可以非常长，但只在特定的时间段特定的人而言。 HC-STVG是视频接地的任务，既需要空间（在哪里）和时间（时）的定位。不幸的是，现有的接地方法无法处理这个任务很好。我们通过提出一个名为时空磨砺与Visual变压器（STGVT），它利用视觉变形金刚提取视频句子匹配和时间本地化跨模态表示的有效基线法解决这个任务。为了促进这项工作，我们也贡献了HC-STVG数据集，包括在复杂的多的人的场景5,660视频句对。具体地，每个视频持续20秒，平均为17.25单词的自然查询语句配对。大量的实验是在这个数据集进行的，证明了新方法优于现有的基准方法。

13. Point Cloud Registration Based on Consistency Evaluation of Rigid Transformation in Parameter Space [PDF] 返回目录
Masaki Yoshii, Ikuko Shimizu
Abstract: We can use a method called registration to integrate some point clouds that represent the shape of the real world. In this paper, we propose highly accurate and stable registration method. Our method detects keypoints from point clouds and generates triplets using multiple descriptors. Furthermore, our method evaluates the consistency of rigid transformation parameters of each triplet with histograms and obtains the rigid transformation between the point clouds. In the experiment of this paper, our method had minimul errors and no major failures. As a result, we obtained sufficiently accurate and stable registration results compared to the comparative methods.
摘要：我们可以使用的方法称为注册整合一些点云代表现实世界的形状。在本文中，我们提出了高度精确和稳定的注册方法。我们的方法检测从点云关键点和使用多个描述符产生三元组。此外，我们的方法评估与直方图每个三元组的刚性变换参数的一致性，并获得点云之间的刚性变换。在本文的实验中，我们的方法有minimul错误，并没有什么大的故障。其结果是，我们相对于比较方法获得足够精确的和稳定的配准结果。

14. Residual Pose: A Decoupled Approach for Depth-based 3D Human Pose Estimation [PDF] 返回目录
Angel Martínez-González, Michael Villamizar, Olivier Canévet, Jean-Marc Odobez
Abstract: We propose to leverage recent advances in reliable 2D pose estimation with Convolutional Neural Networks (CNN) to estimate the 3D pose of people from depth images in multi-person Human-Robot Interaction (HRI) scenarios. Our method is based on the observation that using the depth information to obtain 3D lifted points from 2D body landmark detections provides a rough estimate of the true 3D human pose, thus requiring only a refinement step. In that line our contributions are threefold. (i) we propose to perform 3D pose estimation from depth images by decoupling 2D pose estimation and 3D pose refinement; (ii) we propose a deep-learning approach that regresses the residual pose between the lifted 3D pose and the true 3D pose; (iii) we show that despite its simplicity, our approach achieves very competitive results both in accuracy and speed on two public datasets and is therefore appealing for multi-person HRI compared to recent state-of-the-art methods.
摘要：本文提出利用与卷积神经网络（CNN）可靠的2D姿态估计的最新进展多人人机交互（HRI）的情况来估算深度图像的人的3D姿态。我们的方法是基于使用所述深度信息来获得3D从2D体界标检测抬起点提供真实三维人体姿势的粗略估计，因此仅需要一个细化步骤的观察。在该行自己的贡献有三。（ⅰ），我们提出通过解耦2D姿态估计和三维姿态细化执行从深度图像的3D姿势估计; （二）我们提出了一个深刻的学习方法是倒退的提升3D姿势和真正的3D姿势之间残留的姿态; （iii）本公司表明，尽管它的简单，我们的方法实现了两个公共数据集无论是在精度和速度非常有竞争力的结果，因此呼吁多人HRI相比，国家的最先进的最近的方法。

15. Deep Multimodal Fusion by Channel Exchanging [PDF] 返回目录
Yikai Wang, Wenbing Huang, Fuchun Sun, Tingyang Xu, Yu Rong, Junzhou Huang
Abstract: Deep multimodal fusion by using multiple sources of data for classification or regression has exhibited a clear advantage over the unimodal counterpart on various applications. Yet, current methods including aggregation-based and alignment-based fusion are still inadequate in balancing the trade-off between inter-modal fusion and intra-modal processing, incurring a bottleneck of performance improvement. To this end, this paper proposes Channel-Exchanging-Network (CEN), a parameter-free multimodal fusion framework that dynamically exchanges channels between sub-networks of different modalities. Specifically, the channel exchanging process is self-guided by individual channel importance that is measured by the magnitude of Batch-Normalization (BN) scaling factor during training. The validity of such exchanging process is also guaranteed by sharing convolutional filters yet keeping separate BN layers across modalities, which, as an add-on benefit, allows our multimodal architecture to be almost as compact as a unimodal network. Extensive experiments on semantic segmentation via RGB-D data and image translation through multi-domain input verify the effectiveness of our CEN compared to current state-of-the-art methods. Detailed ablation studies have also been carried out, which provably affirm the advantage of each component we propose. Our code is available at this https URL.
摘要：通过使用用于分类或回归多个数据源深多模态融合已经表现出了明显的优势在于各种应用的单峰对应物。然而，目前的方法包括聚合基和对准融合仍处于平衡模间融合和帧内模式的处理之间的权衡，招致的性能改善的瓶颈不足。为此，本文提出了通道交换网络（CEN），无参数多模态融合框架，不同模态的子网络之间动态地交换信道。具体而言，信道交换过程是自引导通过由的批次正常化（BN）的缩放因子的量值在训练期间测量的单个信道的重要性。这种交换处理的有效性也通过共享卷积滤波器还跨模态保持单独的BN层，其作为附加的益处，允许我们的多峰结构是一样紧凑的单峰网络几乎保证。经由RGB-d数据和图象翻译通过多域输入语义分割了广泛的实验验证相比状态的最先进的现有方法我们的CEN的有效性。详细消融的研究也已进行，这可证明确认我们建议每个组件的优势。我们的代码可在此HTTPS URL。

16. Joint Super-Resolution and Rectification for Solar Cell Inspection [PDF] 返回目录
Mathis Hoffmann, Thomas Köhler, Bernd Doll, Frank Schebesch, Florian Talkenberg, Ian Marius Peters, Christoph J. Brabec, Andreas Maier, Vincent Christlein
Abstract: Visual inspection of solar modules is an important monitoring facility in photovoltaic power plants. Since a single measurement of fast CMOS sensors is limited in spatial resolution and often not sufficient to reliably detect small defects, we apply multi-frame super-resolution (MFSR) to a sequence of low resolution measurements. In addition, the rectification and removal of lens distortion simplifies subsequent analysis. Therefore, we propose to fuse this pre-processing with standard MFSR algorithms. This is advantageous, because we omit a separate processing step, the motion estimation becomes more stable and the spacing of high-resolution (HR) pixels on the rectified module image becomes uniform w.r.t. the module plane, regardless of perspective distortion. We present a comprehensive user study showing that MFSR is beneficial for defect recognition by human experts and that the proposed method performs better than the state of the art. Furthermore, we apply automated crack segmentation and show that the proposed method performs 3x better than bicubic upsampling and 2x better than the state of the art for automated inspection.
摘要：太阳能电池组件的目视检查是在光伏电站的重要监视设施。由于快速的CMOS传感器的一个单次测量中的空间分辨率是有限的，经常不足以可靠地检测微小缺陷，我们应用多帧的超分辨率（MFSR）低分辨率测量的序列。此外，整流和移除透镜畸变的简化随后的分析。因此，我们建议使用标准MFSR算法融合这个预处理。这是有利的，因为我们省略单独的处理步骤中，运动估计变得整流模块图像上更稳定的和高分辨率（HR）的像素间距变得均匀w.r.t.模块平面，而不管透视畸变的。我们提出表明MFSR是由人类专家所提出的方法比现有技术的状态更好缺陷识别有益全面的用户研究。此外，我们应用自动化裂纹分割和显示，比双三次上采样所提出的方法进行3倍更好和2x比现有技术用于自动检查的状态更好。

17. Removing Brightness Bias in Rectified Gradients [PDF] 返回目录
Lennart Brocki, Neo Christopher Chung
Abstract: Interpretation and improvement of deep neural networks relies on better understanding of their underlying mechanisms. In particular, gradients of classes or concepts with respect to the input features (e.g., pixels in images) are often used as importance scores, which are visualized in saliency maps. Thus, a family of saliency methods provide an intuitive way to identify input features with substantial influences on classifications or latent concepts. Rectified Gradients \cite{Kim2019} is a new method which introduce layer-wise thresholding in order to denoise the saliency maps. While visually coherent in certain cases, we identify a brightness bias in Rectified Gradients. We demonstrate that dark areas of an input image are not highlighted by a saliency map using Rectified Gradients, even if it is relevant for the class or concept. Even in the scaled images, the bias exists around an artificial point in color spectrum. Our simple modification removes this bias and recovers input features that were removed due to their colors. "No Bias Rectified Gradient" is available at \url{this https URL}
摘要：深层神经网络的解释和完善依赖于更好地了解他们的基本机制。特别地，相对于所述输入的特征的类别或概念的梯度（例如，在图像中的像素）通常用作重要性分数，其在显着性映射可视化。因此，显着性的方法家庭提供一种直观的方式来识别与关于分类或潜概念实质影响输入功能。整流梯度\举{Kim2019}是为了去噪显着性映射介绍逐层阈值的新方法。虽然在某些情况下视觉连贯，我们确定在整流梯度的亮度偏差。我们证明了输入图像的暗区不被使用整流梯度，哪怕是相关的类或概念的显着图高亮显示。即使在缩放的图像时，偏置周围存在的颜色光谱的人工点。我们简单的修改，删除了是由于它们的颜色去掉这种偏见，并恢复输入功能。 “没有偏向整流渐变”可在\ {URL这HTTPS URL}

18. AIM 2020 Challenge on Learned Image Signal Processing Pipeline [PDF] 返回目录
Andrey Ignatov, Radu Timofte, Zhilu Zhang, Ming Liu, Haolin Wang, Wangmeng Zuo, Jiawei Zhang, Ruimao Zhang, Zhanglin Peng, Sijie Ren, Linhui Dai, Xiaohong Liu, Chengqi Li, Jun Chen, Yuichi Ito, Bhavya Vasudeva, Puneesh Deora, Umapada Pal, Zhenyu Guo, Yu Zhu, Tian Liang, Chenghua Li, Cong Leng, Zhihong Pan, Baopu Li, Byung-Hoon Kim, Joonyoung Song, Jong Chul Ye, JaeHyun Baek, Magauiya Zhussip, Yeskendir Koishekenov, Hwechul Cho Ye, Xin Liu, Xueying Hu, Jun Jiang, Jinwei Gu, Kai Li, Pengliang Tan, Bingxin Hou
Abstract: This paper reviews the second AIM learned ISP challenge and provides the description of the proposed solutions and results. The participating teams were solving a real-world RAW-to-RGB mapping problem, where to goal was to map the original low-quality RAW images captured by the Huawei P20 device to the same photos obtained with the Canon 5D DSLR camera. The considered task embraced a number of complex computer vision subtasks, such as image demosaicing, denoising, white balancing, color and contrast correction, demoireing, etc. The target metric used in this challenge combined fidelity scores (PSNR and SSIM) with solutions' perceptual results measured in a user study. The proposed solutions significantly improved the baseline results, defining the state-of-the-art for practical image signal processing pipeline modeling.
摘要：本文综述了第二AIM了解到ISP的挑战，并提供了提出的解决方案和结果的描述。各参赛队伍进行了解决现实世界的RAW到RGB映射问题，到哪里的目标是绘制由华为P20设备与佳能5D数码单反相机获得相同的照片拍摄的原始低质量的RAW图像。所考虑的任务接受了一些复杂的计算机视觉子任务，如图像去马赛克，去噪，白平衡，颜色和对比度校正，demoireing等在这一挑战组合保真度分数（PSNR和SSIM）所使用的目标度量的解决方案的感性结果在用户研究中测量。提出的解决方案显著改善基线结果，定义状态的最先进的用于实际图像信号处理流水线建模。

19. SelfDeco: Self-Supervised Monocular Depth Completion in Challenging Indoor Environments [PDF] 返回目录
Jaehoon Choi, Dongki Jung, Yonghan Lee, Deokhwa Kim, Dinesh Manocha, Donghwan Lee
Abstract: We present a novel algorithm for self-supervised monocular depth completion. Our approach is based on training a neural network that requires only sparse depth measurements and corresponding monocular video sequences without dense depth labels. Our self-supervised algorithm is designed for challenging indoor environments with textureless regions, glossy and transparent surface, non-Lambertian surfaces, moving people, longer and diverse depth ranges and scenes captured by complex ego-motions. Our novel architecture leverages both deep stacks of sparse convolution blocks to extract sparse depth features and pixel-adaptive convolutions to fuse image and depth features. We compare with existing approaches in NYUv2, KITTI and NAVERLABS indoor datasets, and observe 5\:-\:34 \% improvements in root-means-square error (RMSE) reduction.
摘要：我们提出了自我监督的单眼深度完成了一种新的算法。我们的方法是基于训练，只需要稀疏深度测量神经网络和相应的单眼视频序列不稠密深度标签。我们的自我监督算法被设计为具有挑战性的室内环境与无纹理的区域，光泽和透明表面，非朗伯表面，移动的人，更长的多样深度范围和场景复杂通过自我运动捕获。我们的新颖架构利用稀疏卷积块都深栈以提取稀疏深度特征和像素自适应的卷积到保险丝的图像和深度特性。在根手段方（RMSE）减少错误34点\％改进： - ：我们比较在NYUv2，KITTI现有的方法和NAVERLABS室内的数据集，并观察5 \ \。

20. Conceptual Compression via Deep Structure and Texture Synthesis [PDF] 返回目录
Jianhui Chang, Zhenghui Zhao, Chuanmin Jia, Shiqi Wang, Lingbo Yang, Jian Zhang, Siwei Ma
Abstract: Existing compression methods typically focus on the removal of signal-level redundancies, while the potential and versatility of decomposing visual data into compact conceptual components still lack further study. To this end, we propose a novel conceptual compression framework that encodes visual data into compact structure and texture representations, then decodes in a deep synthesis fashion, aiming to achieve better visual reconstruction quality, flexible content manipulation, and potential support for various vision tasks. In particular, we propose to compress images by a dual-layered model consisting of two complementary visual features: 1) structure layer represented by structural maps and 2) texture layer characterized by low-dimensional deep representations. At the encoder side, the structural maps and texture representations are individually extracted and compressed, generating the compact, interpretable, inter-operable bitstreams. During the decoding stage, a hierarchical fusion GAN (HF-GAN) is proposed to learn the synthesis paradigm where the textures are rendered into the decoded structural maps, leading to high-quality reconstruction with remarkable visual realism. Extensive experiments on diverse images have demonstrated the superiority of our framework with lower bitrates, higher reconstruction quality, and increased versatility towards visual analysis and content manipulation tasks.
摘要：现有的压缩方法通常集中在除去信号一级冗余，而潜在的和分解的视觉数据转换成紧凑概念组件的通用性仍然缺乏进一步的研究。为此，我们提出了一个新颖的概念压缩框架编码视频数据转换成紧凑的结构和纹理交涉，然后解码在深合成的方式，旨在实现更好的视觉重建质量，灵活的内容操作，以及各种视觉任务的潜在支持。由结构图和2）纹理层，其特征在于低维深表示表示1）结构层：特别地，我们通过由两个互补的视觉特征的双层模型提出来压缩图像。在编码器侧，结构映射和纹理的表示被单独提取并压缩，产生紧凑，可解释，可互操作的比特流。在解码阶段，分层融合GAN（HF-GAN）提出了学习，其中纹理渲染到解码构造图合成范例，导致高品质的重建具有显着的逼真视觉效果。在不同的图像大量的实验已经证明我们的架构的优势在低比特率，较高的重建质量，并实现可视化分析和内容操作任务增加了多功能性。

21. Detecting Human-Object Interaction with Mixed Supervision [PDF] 返回目录
Suresh Kirthi Kumaraswamy, Miaojing Shi, Ewa Kijak
Abstract: Human object interaction (HOI) detection is an important task in image understanding and reasoning. It is in a form of HOI triplet hhuman; verb; objecti, requiring bounding boxes for human and object, and action between them for the task completion. In other words, this task requires strong supervision for training that is however hard to procure. A natural solution to overcome this is to pursue weakly-supervised learning, where we only know the presence of certain HOI triplets in images but their exact location is unknown. Most weakly-supervised learning methods do not make provision for leveraging data with strong supervision, when they are available; and indeed a naive combination of this two paradigms in HOI detection fails to make contributions to each other. In this regard we propose a mixed-supervised HOI detection pipeline: thanks to a specific design of momentum-independent learning that learns seamlessly across these two types of supervision. Moreover, in light of the annotation insufficiency in mixed supervision, we introduce an HOI element swapping technique to synthesize diverse and hard negatives across images and improve the robustness of the model. Our method is evaluated on the challenging HICO-DET dataset. It performs close to or even better than many fully-supervised methods by using a mixed amount of strong and weak annotations; furthermore, it outperforms representative state of the art weaklyand fully-supervised methods under the same supervision.
摘要：人类对象交互（海）检测是图像理解和推理的重要任务。它是在HOI三重hhuman的一种形式;动词; objecti，需要对人类和对象，以及它们之间行动的任务完成包围盒。换句话说，这个任务需要培训强有力的监督是不过硬采购。一个自然的解决方案来克服这个是追求弱监督学习，在这里我们只知道某些HOI三胞胎的图像中的存在，但他们的确切位置是未知的。大多数弱监督学习方法不作出规定，利用与强有力的监督，可用时的数据;而其实在HOI检测这两种模式的组合天真不能使彼此的贡献。在这方面，我们提出了一个混合监督HOI检测管道：感谢势头，自主学习的具体设计，无缝跨越这两种监督处了解到。此外，在混合监督批注不全光，我们引入一个HOI元件交换技术合成跨越图像多样化和硬底片和提高了模型的鲁棒性。我们的方法是在挑战HICO-DET数据集进行评估。它执行接近或超过使用强和弱注解的混合量多充分监督的方法会更好。此外，它优于现有技术的代表状态weaklyand相同监督下全面监督的方法。

22. Unsupervised Contrastive Photo-to-Caricature Translation based on Auto-distortion [PDF] 返回目录
Yuhe Ding, Xin Ma, Mandi Luo, Aihua Zheng, Ran He
Abstract: Photo-to-caricature translation aims to synthesize the caricature as a rendered image exaggerating the features through sketching, pencil strokes, or other artistic drawings. Style rendering and geometry deformation are the most important aspects in photo-to-caricature translation task. To take both into consideration, we propose an unsupervised contrastive photo-to-caricature translation architecture. Considering the intuitive artifacts in the existing methods, we propose a contrastive style loss for style rendering to enforce the similarity between the style of rendered photo and the caricature, and simultaneously enhance its discrepancy to the photos. To obtain an exaggerating deformation in an unpaired/unsupervised fashion, we propose a Distortion Prediction Module (DPM) to predict a set of displacements vectors for each input image while fixing some controlling points, followed by the thin plate spline interpolation for warping. The model is trained on unpaired photo and caricature while can offer bidirectional synthesizing via inputting either a photo or a caricature. Extensive experiments demonstrate that the proposed model is effective to generate hand-drawn like caricatures compared with existing competitors.
摘要：光 - 漫画翻译旨在合成漫画的渲染图像，通过写生，铅笔笔触或其他艺术图纸夸大功能。风格的渲染和几何变形是光 - 漫画翻译任务中最重要的方面。采取这两种考虑，我们提出了一种无监督的对比照片到漫画翻译架构。考虑到在现有的方法直观的文物，我们提出了一个对比风格亏损风格渲染执行渲染的照片和漫画的风格之间的相似性，同时增强其差异的照片。为了获得在非配对/无监督方式的变形夸张，我们提出了一种失真预测模块（DPM）来预测一组位移矢量的每个输入图像而定影一些控制点，随后薄板样条插值的翘曲。该模型是在不成对的照片和漫画的培训，同时可以通过输入任何照片或漫画提供双向合成。大量的实验表明，该模型是有效产生手绘像现有的竞争对手相比漫画。

23. Multi-modal Fusion for Single-Stage Continuous Gesture Recognition [PDF] 返回目录
Harshala Gammulle, Simon Denman, Sridha Sridharan, Clinton Fookes
Abstract: Gesture recognition is a much studied research area which has myriad real-world applications including robotics and human-machine interaction. Current gesture recognition methods have heavily focused on isolated gestures, and existing continuous gesture recognition methods are limited by a two-stage approach where independent models are required for detection and classification, with the performance of the latter being constrained by detection performance. In contrast, we introduce a single-stage continuous gesture recognition model, that can detect and classify multiple gestures in a single video via a single model. This approach learns the natural transitions between gestures and non-gestures without the need for a pre-processing segmentation stage to detect individual gestures. To enable this, we introduce a multi-modal fusion mechanism to support the integration of important information that flows from multi-modal inputs, and is scalable to any number of modes. Additionally, we propose Unimodal Feature Mapping (UFM) and Multi-modal Feature Mapping (MFM) models to map uni-modal features and the fused multi-modal features respectively. To further enhance the performance we propose a mid-point based loss function that encourages smooth alignment between the ground truth and the prediction. We demonstrate the utility of our proposed framework which can handle variable-length input videos, and outperforms the state-of-the-art on two challenging datasets, EgoGesture, and IPN hand. Furthermore, ablative experiments show the importance of different components of the proposed framework.
摘要：手势识别是有无数的真实世界的应用，包括机器人和人机交互的研究很多研究领域。当前手势识别方法着重介绍孤立手势，和现有的连续手势识别方法由其中所需的检测和分类无关模型，其中后者是由检测性能约束的性能的两阶段方法的限制。与此相反，我们通过单个模型引入的单阶段连续姿势识别模式，即可以检测和分类的多个姿势在单个视频。这种方法学的手势和非手势之间的自然过渡，而不需要预先处理分割阶段检测个体手势。为了实现这一点，我们引入一个多模态融合机制以支持来自多模态的输入流动，并且可扩展到任何数量的模式的重要信息的集成。此外，我们建议单峰特征映射（UFM）和多模态特征映射（MFM）模型映射分别单模式的功能和融合的多模态功能。为了进一步加强我们建议鼓励地面实况和预测之间的平滑对齐基于中点损失函数的性能。我们证明我们提出的框架，它能够处理可变长度的输入视频的效用，并优于两个有挑战性的数据集，EgoGesture和IPN手的国家的最先进的。此外，烧蚀实验表明，该框架的不同组件的重要性。

24. Simple means Faster: Real-Time Human Motion Forecasting in Monocular First Person Videos on CPU [PDF] 返回目录
Junaid Ahmed Ansari, Brojeshwar Bhowmick
Abstract: We present a simple, fast, and light-weight RNN based framework for forecasting future locations of humans in first person monocular videos. The primary motivation for this work was to design a network which could accurately predict future trajectories at a very high rate on a CPU. Typical applications of such a system would be a social robot or a visual assistance system for all, as both cannot afford to have high compute power to avoid getting heavier, less power efficient, and costlier. In contrast to many previous methods which rely on multiple type of cues such as camera ego-motion or 2D pose of the human, we show that a carefully designed network model which relies solely on bounding boxes can not only perform better but also predicts trajectories at a very high rate while being quite low in size of approximately 17 MB. Specifically, we demonstrate that having an auto-encoder in the encoding phase of the past information and a regularizing layer in the end boosts the accuracy of predictions with negligible overhead. We experiment with three first person video datasets: CityWalks, FPL and JAAD. Our simple method trained on CityWalks surpasses the prediction accuracy of state-of-the-art method (STED) while being 9.6x faster on a CPU (STED runs on a GPU). We also demonstrate that our model can transfer zero-shot or after just 15% fine-tuning to other similar datasets and perform on par with the state-of-the-art methods on such datasets (FPL and DTP). To the best of our knowledge, we are the first to accurately forecast trajectories at a very high prediction rate of 78 trajectories per second on CPU.
摘要：本文提出了一种简单，快速，重量轻RNN在第一人称单目视频中的人的预测未来的地点为基础的框架。这项工作的主要动机是设计一个网络，它可以准确地在CPU上一个非常高的速度预测未来的轨迹。这种系统的典型应用将是一个社交机器人或所有视觉辅助系统，既可以承担不起高计算能力，以避免越来越重，更省电高效，更昂贵。在与依赖于多种类型的线索，如摄像机自运动或人类的2D姿态的许多以前的方法，我们表明，这完全依赖于边界框不仅能有更好的表现，但也有精心设计的网络模型预测的轨迹在率非常高，同时在大约17 MB大小相当低。具体而言，我们表明，具有在过去的信息的编码相位的自动编码器和规则化层中的端提升具有可忽略的开销的预测的准确性。 CityWalks，FPL和JAAD：我们有三个第一人称视频数据集实验。我们训练有素上CityWalks简单的方法优于状态的最先进的方法（STED）的预测精度，同时9.6x在CPU上更快（STED运行在GPU）。我们还表明，我们的模型可以零次或以后只有15％的微调转移到其他类似的数据集和看齐，与这样的数据集（FPL和DTP）的国家的最先进的方法执行。据我们所知，我们是在每秒78个轨迹上CPU很高的预测率第一，以准确预测轨迹。

25. Stage-wise Channel Pruning for Model Compression [PDF] 返回目录
Mingyang Zhang, Linlin Ou
Abstract: Auto-ML pruning methods aim at searching a pruning strategy automatically to reduce the computational complexity of deep Convolutional Neural Networks(deep CNNs). However, some previous works found that the results of many Auto-ML pruning methods even cannot surpass the results of the uniformly pruning method. In this paper, we first analyze the reason for the ineffectiveness of Auto-ML pruning. Subsequently, a stage-wise pruning(SP) method is proposed to solve the above problem. As with most of the previous Auto-ML pruning methods, SP also trains a super-net that can provide proxy performance for sub-nets and search the best sub-net who has the best proxy performance. Different from previous works, we split a deep CNN into several stages and use a full-net where all layers are not pruned to supervise the training and the searching of sub-nets. Remarkably, the proxy performance of sub-nets trained with SP is closer to the actual performance than most of the previous Auto-ML pruning works. Therefore, SP achieves the state-of-the-art on both CIFAR-10 and ImageNet under the mobile setting.
摘要：自动ML修剪方法针对自动搜索剪枝策略，减少深卷积神经网络（深细胞神经网络）的计算复杂度。然而，一些以前的作品中发现，很多自动ML修剪方法的结果，即使不能超越均匀修剪方法的结果。在本文中，我们首先分析了自动ML修剪无效的原因。随后，分阶段修剪（SP）的方法，提出了解决上述问题。与大多数先前的自动ML修剪方法，SP还培训了超级网能提供子网代理的性能和搜索最佳子网谁拥有最好的性能代理。从以前的作品不同的是，我们分手了深刻的CNN分成几个阶段，使用的是全网所有层不修剪监督培训和子网的搜索。值得注意的是，与SP训练的子网络的代理性能更接近比大多数以前的自动ML修剪工作的实际性能。因此，SP达到在两个CIFAR-10和ImageNet移动设置下状态的最先进的。

26. CoADNet: Collaborative Aggregation-and-Distribution Networks for Co-Salient Object Detection [PDF] 返回目录
Qijian Zhang, Runmin Cong, Junhui Hou, Chongyi Li, Yao Zhao
Abstract: Co-Salient Object Detection (CoSOD) aims at discovering salient objects that repeatedly appear in a given query group containing two or more relevant images. One challenging issue is how to effectively capture co-saliency cues by modeling and exploiting inter-image relationships. In this paper, we present an end-to-end collaborative aggregation-and-distribution network (CoADNet) to capture both salient and repetitive visual patterns from multiple images. First, we integrate saliency priors into the backbone features to suppress the redundant background information through an online intra-saliency guidance structure. After that, we design a two-stage aggregate-and-distribute architecture to explore group-wise semantic interactions and produce the co-saliency features. In the first stage, we propose a group-attentional semantic aggregation module that models inter-image relationships to generate the group-wise semantic representations. In the second stage, we propose a gated group distribution module that adaptively distributes the learned group semantics to different individuals in a dynamic gating mechanism. Finally, we develop a group consistency preserving decoder tailored for the CoSOD task, which maintains group constraints during feature decoding to predict more consistent full-resolution co-saliency maps. The proposed CoADNet is evaluated on four prevailing CoSOD benchmark datasets, which demonstrates the remarkable performance improvement over ten state-of-the-art competitors.
摘要：合作显着对象检测（COSOD）旨在发现，多次出现在包含两个或多个相关图像给定的查询组显着对象。一个具有挑战性的问题是如何通过模拟和利用图像间的关系，有效地捕获二氧化碳的显着度的线索。在本文中，我们提出了一个端至端协作汇聚和分配网络（CoADNet）同时捕获突出，并从多个图像重复的视觉图案。首先，我们整合显着先验到骨干功能，以抑制通过在线内的显着度的指导结构冗余的背景信息。在那之后，我们设计了两个阶段的聚集型和发行架构探索组相关的语义相互作用而产生的共同显着特征。在第一阶段，我们提出了一个组的注意语义聚合模块，模型图像间的关系，产生的组间的语义表示。在第二阶段，我们提出了自适应分配学习组语义不同的个体在动态门控机制门控组分发模块。最后，我们制定了COSOD任务，功能解码预测比较一致的全分辨率共同特征地图过程中保持组约束量身打造的一款组一致性保持解码器。所提出的CoADNet在四种流行COSOD基准数据集，这表明在国家的最先进的10竞争对手的显着的性能改进进行评估。

27. A low latency ASR-free end to end spoken language understanding system [PDF] 返回目录
Mohamed Mhiri, Samuel Myer, Vikrant Singh Tomar
Abstract: In recent years, developing a speech understanding system that classifies a waveform to structured data, such as intents and slots, without first transcribing the speech to text has emerged as an interesting research problem. This work proposes such as system with an additional constraint of designing a system that has a small enough footprint to run on small micro-controllers and embedded systems with minimal latency. Given a streaming input speech signal, the proposed system can process it segment-by-segment without the need to have the entire stream at the moment of processing. The proposed system is evaluated on the publicly available Fluent Speech Commands dataset. Experiments show that the proposed system yields state-of-the-art performance with the advantage of low latency and a much smaller model when compared to other published works on the same task.
摘要：近年来，开发了语音理解系统进行分类的波形结构化数据，如意图和插槽，无需先录制的语音到文本已经成为一个有趣的研究课题。这项工作提出了诸如与设计具有足够的占地面积小到小微控制器和延迟最小的嵌入式系统上运行的系统的附加约束系统。给定一个流输入语音信号，所提出的系统可以处理它的段由段，而不需要必须在处理的那一刻整个流。在公开可用流利的语音命令数据集所提出的系统进行评估。实验表明，该系统产量的国家的最先进的性能，低延迟的优势和更小的机型相比，在相同任务的发表作品的时候。

28. STCNet: Spatio-Temporal Cross Network for Industrial Smoke Detection [PDF] 返回目录
Yichao Cao, Qingfei Tang, Xiaobo Lu, Fan Li, Jinde Cao
Abstract: Industrial smoke emissions present a serious threat to natural ecosystems and human health. Prior works have shown that using computer vision techniques to identify smoke is a low cost and convenient method. However, industrial smoke detection is a challenging task because industrial emission particles are often decay rapidly outside the stacks or facilities and steam is very similar to smoke. To overcome these problems, a novel Spatio-Temporal Cross Network (STCNet) is proposed to recognize industrial smoke emissions. The proposed STCNet involves a spatial pathway to extract texture features and a temporal pathway to capture smoke motion information. We assume that spatial and temporal pathway could guide each other. For example, the spatial path can easily recognize the obvious interference such as trees and buildings, and the temporal path can highlight the obscure traces of smoke movement. If the two pathways could guide each other, it will be helpful for the smoke detection performance. In addition, we design an efficient and concise spatio-temporal dual pyramid architecture to ensure better fusion of multi-scale spatiotemporal information. Finally, extensive experiments on public dataset show that our STCNet achieves clear improvements on the challenging RISE industrial smoke detection dataset against the best competitors by 6.2%. The code will be available at: this https URL.
摘要：工业烟气排放呈现给自然生态系统和人类健康构成严重威胁。在此之前的作品已经表明，利用计算机视觉技术来识别烟雾是一种成本低，方便的方法。然而，工业烟尘检测是一项艰巨的任务，因为工业排放的颗粒往往迅速衰减的堆栈或设施之外，蒸汽是非常相似的烟。为了克服这些问题，一种新颖的时空交叉网（STCNet）提出识别工业烟尘排放。所提出的STCNet涉及空间途径提取纹理特征和时间途径捕获烟雾的运动信息。我们认为空间和时间的途径可以引导对方。例如，空间路径可以容易地识别明显的干扰，如树木和建筑物，以及颞路径可以突出烟气运动的晦涩痕迹。如果这两个途径可以引导对方，这将是烟雾探测性能很有帮助。此外，我们还设计了一个高效而简洁的时空双金字塔架构，以确保更好地融合的多尺度时空信息。最后，在公共数据集上，我们STCNet了6.2％，达到对上最好的竞争者挑战RISE工业烟尘检测数据集明显改善了广泛的实验。该代码将可在：该HTTPS URL。

29. On Efficient and Robust Metrics for RANSAC Hypotheses and 3D Rigid Registration [PDF] 返回目录
Jiaqi Yang, Zhiqiang Huang, Siwen Quan, Qian Zhang, Yanning Zhang, Zhiguo Cao
Abstract: This paper focuses on developing efficient and robust evaluation metrics for RANSAC hypotheses to achieve accurate 3D rigid registration. Estimating six-degree-of-freedom (6-DoF) pose from feature correspondences remains a popular approach to 3D rigid registration, where random sample consensus (RANSAC) is a de-facto choice to this problem. However, existing metrics for RANSAC hypotheses are either time-consuming or sensitive to common nuisances, parameter variations, and different application scenarios, resulting in performance deterioration in overall registration accuracy and speed. We alleviate this problem by first analyzing the contributions of inliers and outliers, and then proposing several efficient and robust metrics with different designing motivations for RANSAC hypotheses. Comparative experiments on four standard datasets with different nuisances and application scenarios verify that the proposed metrics can significantly improve the registration performance and are more robust than several state-of-the-art competitors, making them good gifts to practical applications. This work also draws an interesting conclusion, i.e., not all inliers are equal while all outliers should be equal, which may shed new light on this research problem.
摘要：本文着重于开发高效，稳健的评价指标RANSAC假说来实现精确的3D刚性登记。估计六度的自由度（6-DOF）从特征对应姿势仍然是一个流行的方法3D刚性配准，其中，随机样本一致性（RANSAC）是一种事实上选择了这个问题。然而，对于RANSAC假设现有指标要么费时或共同滋扰，参数变化，和不同的应用场景敏感，从而导致整体登记的精度和速度性能的劣化。我们首先分析正常值和异常值的贡献，并提出了不同的设计动机RANSAC假设几种高效和稳健的指标缓解这个问题。与不同的滋扰和应用场景的四个标准数据集对比实验验证所提出的指标可以显著提高注册性能和比国家的最先进的几家竞争对手更强劲，使他们好东西给实际应用。这项工作也得出一个有趣的结论，即并非所有的内点都是平等的，而所有异常应该是平等的，这可能对本研究问题有了新的认识。

30. Understanding the hand-gestures using Convolutional Neural Networks and Generative Adversial Networks [PDF] 返回目录
Arpita Vats
Abstract: In this paper, it is introduced a hand gesture recognition system to recognize the characters in the real time. The system consists of three modules: real time hand tracking, training gesture and gesture recognition using Convolutional Neural Networks. Camshift algorithm and hand blobs analysis for hand tracking are being used to obtain motion descriptors and hand region. It is fairy robust to background cluster and uses skin color for hand gesture tracking and recognition. Furthermore, the techniques have been proposed to improve the performance of the recognition and the accuracy using the approaches like selection of the training images and the adaptive threshold gesture to remove non-gesture pattern that helps to qualify an input pattern as a gesture. In the experiments, it has been tested to the vocabulary of 36 gestures including the alphabets and digits, and results effectiveness of the approach.
摘要：在本文中，它引入了手势识别系统识别的实时人物。该系统由三个模块组成：实时专人跟踪，培训的手势和姿态识别使用卷积神经网络。用于手部跟踪CAMSHIFT算法和手的斑点分析被用于获得运动描述符和手区域。这是童话稳健的背景集群和用途肤色的手势跟踪和识别。此外，该技术已被提出以提高识别的性能和使用类似于训练图像的选择和自适应阈值手势以除去非手势型式，有助于限定的输入模式为手势的方法的精确度。在实验中，它已被测试，以36个手势包括字母和数字，并且该方法的结果的有效性的词汇。

31. Social-STAGE: Spatio-Temporal Multi-Modal Future Trajectory Forecast [PDF] 返回目录
Srikanth Malla, Chiho Choi, Behzad Dariush
Abstract: This paper considers the problem of multi-modal future trajectory forecast with ranking. Here, multi-modality and ranking refer to the multiple plausible path predictions and the confidence in those predictions, respectively. We propose Social-STAGE, Social interaction-aware Spatio-Temporal multi-Attention Graph convolution network with novel Evaluation for multi-modality. Our main contributions include analysis and formulation of multi-modality with ranking using interaction and multi-attention, and introduction of new metrics to evaluate the diversity and associated confidence of multi-modal predictions. We evaluate our approach on existing public datasets ETH and UCY and show that the proposed algorithm outperforms the state of the arts on these datasets.
摘要：本文认为，多模态的未来轨迹与排名预测的问题。在这里，多模态和排名分别指的是多似是而非的路径预测和信心，这些预测。我们提出的社会阶段，社会互动感知时空多注意图卷积网络小说评价多模态。我们的主要工作包括分析和使用的交互和多的关注，并引入新的指标来评价的多样性和多模式预测的相关置信等级的多模态的配方。我们评估我们对现有的公共数据集ETH和UCY方法，并表明，该算法优于对这些数据集艺术的状态。

32. Ellipse Detection and Localization with Applications to Knots in Sawn Lumber Images [PDF] 返回目录
Shenyi Pan, Shuxian Fan, Samuel W.K. Wong, James V. Zidek, Helge Rhodin
Abstract: While general object detection has seen tremendous progress, localization of elliptical objects has received little attention in the literature. Our motivating application is the detection of knots in sawn timber images, which is an important problem since the number and types of knots are visual characteristics that adversely affect the quality of sawn timber. We demonstrate how models can be tailored to the elliptical shape and thereby improve on general purpose detectors; more generally, elliptical defects are common in industrial production, such as enclosed air bubbles when casting glass or plastic. In this paper, we adapt the Faster R-CNN with its Region Proposal Network (RPN) to model elliptical objects with a Gaussian function, and extend the existing Gaussian Proposal Network (GPN) architecture by adding the region-of-interest pooling and regression branches, as well as using the Wasserstein distance as the loss function to predict the precise locations of elliptical objects. Our proposed method has promising results on the lumber knot dataset: knots are detected with an average intersection over union of 73.05%, compared to 63.63% for general purpose detectors. Specific to the lumber application, we also propose an algorithm to correct any misalignment in the raw timber images during scanning, and contribute the first open-source lumber knot dataset by labeling the elliptical knots in the preprocessed images.
摘要：虽然一般物体检测已经看到了巨大的进步，椭圆形物体的定位很少受到关注在文献中。我们的激励应用是结在锯材图像的检测，这是一个重要的问题，因为数量和类型的结是锯材的质量产生不利影响的视觉特性。我们证明模型如何可以根据椭圆形状，从而在通用检测器提高;更一般的，椭圆形的缺陷是常见的工业生产中，如浇铸玻璃或塑料时封闭气泡。在本文中，我们适应更快的R-CNN其地区提案网络（RPN）到椭圆形的物体与高斯函数模型，并通过添加区域的感兴趣的池和回归扩展现有的高斯提案网络（GPN）架构分支机构，以及利用瓦瑟斯坦距离作为损耗函数来预测椭圆物体的精确位置。我们提出的方法已在木材结数据集有前途的结果：与以上的73.05％联盟的平均交叉点被检测到结，与用于通用探测器63.63％。具体到木材应用，我们还提出了一种算法扫描期间校正在原木图像任何未对准，并通过标记在预处理图像中的椭圆形结有助于第一开源木材结数据集。

33. CenterFusion: Center-based Radar and Camera Fusion for 3D Object Detection [PDF] 返回目录
Ramin Nabati, Hairong Qi
Abstract: The perception system in autonomous vehicles is responsible for detecting and tracking the surrounding objects. This is usually done by taking advantage of several sensing modalities to increase robustness and accuracy, which makes sensor fusion a crucial part of the perception system. In this paper, we focus on the problem of radar and camera sensor fusion and propose a middle-fusion approach to exploit both radar and camera data for 3D object detection. Our approach, called CenterFusion, first uses a center point detection network to detect objects by identifying their center points on the image. It then solves the key data association problem using a novel frustum-based method to associate the radar detections to their corresponding object's center point. The associated radar detections are used to generate radar-based feature maps to complement the image features, and regress to object properties such as depth, rotation and velocity. We evaluate CenterFusion on the challenging nuScenes dataset, where it improves the overall nuScenes Detection Score (NDS) of the state-of-the-art camera-based algorithm by more than 12%. We further show that CenterFusion significantly improves the velocity estimation accuracy without using any additional temporal information. The code is available at this https URL .
摘要：自主车的感知系统负责检测和跟踪周围的物体。这通常是通过取的几个传感方式的优势，增加的鲁棒性和准确性，这使得传感器融合的感知系统的重要组成部分来完成。在本文中，我们侧重于雷达和摄像机传感器融合的问题，并且提出了一种中间融合方法利用三维物体检测雷达和摄像机数据。我们的方法，称为CenterFusion，首先使用中心点检测网络通过识别在图像上它们的中心点，以检测对象。然后它解决了使用新颖的基于截方法的雷达检测关联到它们相应对象的中心点的关键数据关联的问题。相关联的雷达的检测可被用于产生基于雷达的特征映射到补充的图像特征，并回归到对象属性诸如深度，旋转和速度。我们评估的挑战nuScenes数据集，它提高了12％以上的国家的最先进的基于摄像头的算法的整体nuScenes探测得分（NDS）CenterFusion。进一步的研究表明CenterFusion显著提高速度估计精度，无需使用任何额外的时间信息。该代码可在此HTTPS URL。

34. Kinematics-Guided Reinforcement Learning for Object-Aware 3D Ego-Pose Estimation [PDF] 返回目录
Zhengyi Luo, Ryo Hachiuma, Ye Yuan, Shun Iwase, Kris M. Kitani
Abstract: We propose a method for incorporating object interaction and human body dynamics into the task of 3D ego-pose estimation using a head-mounted camera. We use a kinematics model of the human body to represent the entire range of human motion, and a dynamics model of the body to interact with objects inside a physics simulator. By bringing together object modeling, kinematics modeling, and dynamics modeling in a reinforcement learning (RL) framework, we enable object-aware 3D ego-pose estimation. We devise several representational innovations through the design of the state and action space to incorporate 3D scene context and improve pose estimation quality. We also construct a fine-tuning step to correct the drift and refine the estimated human-object interaction. This is the first work to estimate a physically valid 3D full-body interaction sequence with objects (e.g., chairs, boxes, obstacles) from egocentric videos. Experiments with both controlled and in-the-wild settings show that our method can successfully extract an object-conditioned 3D ego-pose sequence that is consistent with the laws of physics.
摘要：本文提出了使用包含对象的交互和人体动力学成3D自我姿态估计的任务的方法头戴式摄像头。我们使用人体的运动学模型来表示人体运动的整个范围内，并且所述主体的动力学模型与物理模拟器内部的对象进行交互。通过汇集对象建模，运动学建模，并动态在强化学习（RL）框架模型，我们能够对象知晓3D自我姿态估计。我们通过制定国家和行动空间设计的几个代表性的创新，纳入3D场景环境和改善姿势估计质量。我们还建立一个微调步纠正漂移和完善估计人为对象交互。这是估计从以自我为中心的视频对象（例如，椅子，箱子，障碍）物理学上有效的3D全身交互序列的第一项工作。有两个实验控制和最狂野的设置表明，我们的方法可以成功地提取条件对象的3D自我姿势序列与物理定律一致。

35. After All, Only The Last Neuron Matters: Comparing Multi-modal Fusion Functions for Scene Graph Generation [PDF] 返回目录
Mohamed Karim Belaid
Abstract: From object segmentation to word vector representations, Scene Graph Generation (SGG) became a complex task built upon numerous research results. In this paper, we focus on the last module of this model: the fusion function. The role of this latter is to combine three hidden states. We perform an ablation test in order to compare different implementations. First, we reproduce the state-of-the-art results using SUM, and GATE functions. Then we expand the original solution by adding more model-agnostic functions: an adapted version of DIST and a mixture between MFB and GATE. On the basis of the state-of-the-art configuration, DIST performed the best Recall @ K, which makes it now part of the state-of-the-art.
摘要：从对象分割字向量表示，场景图代（SGG）成为取决于许多研究成果，建立了一个复杂的任务。在本文中，我们关注这款机型的最后一个模块上：融合功能。后者的作用是三个隐藏状态结合起来。我们以比较不同的实现进行消融测试。首先，我们用重现之和，门功能状态的最先进的成果。 DIST的改编版本和MFB和栅极之间的混合物：然后，我们通过添加更多的模型无关功能扩展原始溶液。接通状态的最先进的结构的基础上，执行DIST最好召回@ K，这使得它现在的状态的最先进的一部分。

36. Predicting Landsat Reflectance with Deep Generative Fusion [PDF] 返回目录
Shahine Bouabid, Maxim Chernetskiy, Maxime Rischard, Jevgenij Gamper
Abstract: Public satellite missions are commonly bound to a trade-off between spatial and temporal resolution as no single sensor provides fine-grained acquisitions with frequent coverage. This hinders their potential to assist vegetation monitoring or humanitarian actions, which require detecting rapid and detailed terrestrial surface changes. In this work, we probe the potential of deep generative models to produce high-resolution optical imagery by fusing products with different spatial and temporal characteristics. We introduce a dataset of co-registered Moderate Resolution Imaging Spectroradiometer (MODIS) and Landsat surface reflectance time series and demonstrate the ability of our generative model to blend coarse daily reflectance information into low-paced finer acquisitions. We benchmark our proposed model against state-of-the-art reflectance fusion algorithms.
摘要：公共卫星任务通常是必然的空间和时间分辨率之间的权衡，因为没有单独的传感器提供细粒度并购频繁报道。这妨碍他们的潜力，以协助植被监测或人道主义动作，这需要快速检测和详述陆地表面的变化。在这项工作中，我们深入探究生成模型的通过融合产品具有不同的空间和时间特性产生高分辨率光学成像的潜力。我们引入共同注册的中分辨率成像光谱仪（MODIS）和陆地卫星表面反射的时间序列的数据集，并证明我们的生成模型的融入日常粗反射信息到低节奏的精细收购的能力。我们的基准我们提出的针对国家的最先进的反射率融合算法模型。

37. MUSE: Illustrating Textual Attributes by Portrait Generation [PDF] 返回目录
Xiaodan Hu, Pengfei Yu, Kevin Knight, Heng Ji, Bo Li, Honghui Shi
Abstract: We propose a novel approach, MUSE, to illustrate textual attributes visually via portrait generation. MUSE takes a set of attributes written in text, in addition to facial features extracted from a photo of the subject as input. We propose 11 attribute types to represent inspirations from a subject's profile, emotion, story, and environment. We propose a novel stacked neural network architecture by extending an image-to-image generative model to accept textual attributes. Experiments show that our approach significantly outperforms several state-of-the-art methods without using textual attributes, with Inception Score score increased by 6% and Fréchet Inception Distance (FID) score decreased by 11%, respectively. We also propose a new attribute reconstruction metric to evaluate whether the generated portraits preserve the subject's attributes. Experiments show that our approach can accurately illustrate 78% textual attributes, which also help MUSE capture the subject in a more creative and expressive way.
摘要：本文提出一种新的方法，MUSE，说明通过肖像代视觉文本的属性。 MUSE采用一组写在文本属性的，除了从被检体作为输入的照片提取面部特征。我们建议11种属性类型来表示从受试者的个人资料，情感，故事，和环境的灵感。我们通过扩展的图像 - 图像生成模型接受文本属性提出堆叠神经网络结构的小说。实验结果表明，我们的方法显著优于国家的最先进的几种方法，而无需使用文本属性，以及启分数得分增加了6％和Fréchet可启距离（FID）得分分别下降了11％。我们还提出了一个新的属性重建指标来评价所产生的肖像是否保留对象的属性。实验结果表明，我们的方法能准确说明78％的文本属性，这也有助于MUSE捕获在一个更富有创造性和表现方式的主题。

38. Learning to Infer Semantic Parameters for 3D Shape Editing [PDF] 返回目录
Fangyin Wei, Elena Sizikova, Avneesh Sud, Szymon Rusinkiewicz, Thomas Funkhouser
Abstract: Many applications in 3D shape design and augmentation require the ability to make specific edits to an object's semantic parameters (e.g., the pose of a person's arm or the length of an airplane's wing) while preserving as much existing details as possible. We propose to learn a deep network that infers the semantic parameters of an input shape and then allows the user to manipulate those parameters. The network is trained jointly on shapes from an auxiliary synthetic template and unlabeled realistic models, ensuring robustness to shape variability while relieving the need to label realistic exemplars. At testing time, edits within the parameter space drive deformations to be applied to the original shape, which provides semantically-meaningful manipulation while preserving the details. This is in contrast to prior methods that either use autoencoders with a limited latent-space dimensionality, failing to preserve arbitrary detail, or drive deformations with purely-geometric controls, such as cages, losing the ability to update local part regions. Experiments with datasets of chairs, airplanes, and human bodies demonstrate that our method produces more natural edits than prior work.
摘要：在三维形状的设计和增强许多应用要求作出具体的编辑对象的语义参数的能力（例如，一个人的手臂的姿势或飞机机翼的长度），同时保留尽可能多的现有的信息成为可能。我们建议学习深刻网络推断输入形状的语义参数，然后允许用户操纵这些参数。该网络进行训练，从辅助合成模板的形状和未标记的现实模型联合，确保稳健性形状的变化，同时减轻需要标注的现实典范。在测试时间，参数空间驱动变形内的修改将被施加到原来的形状，它提供了语义有意义操纵，同时保留的细节。这是相对于现有方法，要么使用自动编码具有有限潜在空间维数，从而不能保持任意的细节，或驱动具有纯粹几何控制，如笼变形，失去更新本地部分区域的能力。带椅子，飞机，和人体的数据集实验表明，我们的方法产生更自然的编辑比以前的工作。

39. Similarity-Based Clustering for Enhancing Image Classification Architectures [PDF] 返回目录
Dishant Parikh, Shambhavi Aggarwal
Abstract: Convolutional networks are at the center of best in class computer vision applications for a wide assortment of undertakings. Since 2014, profound amount of work began to make better convolutional architectures, yielding generous additions in different benchmarks. Albeit expanded model size and computational cost will, in general, mean prompt quality increases for most undertakings but, the architectures now need to have some additional information to increase the performance. We show empirical evidence that with the amalgamation of content-based image similarity and deep learning models, we can provide the flow of information which can be used in making clustered learning possible. We show how parallel training of sub-dataset clusters not only reduces the cost of computation but also increases the benchmark accuracies by 5-11 percent.
摘要：卷积网络是在同类最佳的计算机视觉应用中心，为事业各式各样的。自2014年，工作的深刻量开始做出更好的卷积架构，在不同的基准收益大方增加。尽管扩大了模型的大小和计算成本，在大多数事业一般，平均质量迅速提高，但是，该架构现在需要有一些额外的信息，以提高性能。我们发现经验证据表明，与基于内容的图像相似性和深度学习模型的融合，我们可以提供一种能够使集群学习可以使用的信息流。我们将展示如何子集集群的并行培训不仅减少了计算成本，而且还通过5-11％的增加了基准的精度。

40. Ontology-driven Event Type Classification in Images [PDF] 返回目录
Eric Müller-Budack, Matthias Springstein, Sherzod Hakimov, Kevin Mrutzek, Ralph Ewerth
Abstract: Event classification can add valuable information for semantic search and the increasingly important topic of fact validation in news. So far, only few approaches address image classification for newsworthy event types such as natural disasters, sports events, or elections. Previous work distinguishes only between a limited number of event types and relies on rather small datasets for training. In this paper, we present a novel ontology-driven approach for the classification of event types in images. We leverage a large number of real-world news events to pursue two objectives: First, we create an ontology based on Wikidata comprising the majority of event types. Second, we introduce a novel large-scale dataset that was acquired through Web crawling. Several baselines are proposed including an ontology-driven learning approach that aims to exploit structured information of a knowledge graph to learn relevant event relations using deep neural networks. Experimental results on existing as well as novel benchmark datasets demonstrate the superiority of the proposed ontology-driven approach.
摘要：事件的分类可以添加语义搜索和事实验证的消息日益重要的话题有价值的信息。到目前为止，只有几种方法解决对新闻价值的事件类型，如自然灾害，体育赛事或选举图像分类。只有事件类型的数量有限之间以前的工作区分，并依赖于非常小的数据集进行训练。在本文中，我们提出了事件类型的图像中的分类的新颖本体驱动的方法。我们充分利用了大量真实世界的新闻事件的追求两个目标：首先，我们创建一个基于维基数据包括广大事件类型的本体。其次，我们介绍了通过Web爬行收购了一家新的大型数据集。一些基线提出包括本体驱动的学习方法，旨在利用知识图的结构化信息使用深层神经网络学习相关事件的关系。在现有的以及新的标准数据集实验结果表明，所提出的本体驱动方式的优越性。

41. Explainable COVID-19 Detection Using Chest CT Scans and Deep Learning [PDF] 返回目录
Hammam Alshazly, Christoph Linse, Erhardt Barth, Thomas Martinetz
Abstract: This paper explores how well deep learning models trained on chest CT images can diagnose COVID-19 infected people in a fast and automated process. To this end, we adopt advanced deep network architectures and propose a transfer learning strategy using custom-sized input tailored for each deep architecture to achieve the best performance. We conduct extensive sets of experiments on two CT image datasets, namely the SARS-CoV-2 CT-scan and the COVID19-CT. The obtained results show superior performances for our models compared with previous studies, where our best models achieve average accuracy, precision, sensitivity, specificity and F1 score of 99.4%, 99.6%, 99.8%, 99.6% and 99.4% on the SARS-CoV-2 dataset; and 92.9%, 91.3%, 93.7%, 92.2% and 92.5% on the COVID19-CT dataset, respectively. Furthermore, we apply two visualization techniques to provide visual explanations for the models' predictions. The visualizations show well-separated clusters for CT images of COVID-19 from other lung diseases, and accurate localizations of the COVID-19 associated regions.
摘要：本文探讨训练的胸部CT图像深度学习模型究竟能诊断COVID-19感染者以快速和自动化的过程。为此，我们采用先进的深网络架构，并使用每个深架构量身自定义尺寸的输入，以达到最佳的性能提出了转移的学习策略。我们两个CT图像数据集，即SARS-COV-2 CT扫描和COVID19-CT进行广泛组实验。得到的结果表明，我们的模型与以前的研究，我们的最好的车型实现平均准确度，精度，灵敏度，特异性和F1分数的99.4％，99.6％，99.8％，99.6％和99.4％，对SARS冠状病毒相比，性能优越-2数据集;和92.9％，分别为91.3％，93.7％，92.2％和对COVID19-CT数据集92.5％。此外，我们应用两种可视化技术，以提供模型的预测视觉解释。可视化示出了用于从其他肺部疾病COVID-19的CT图像良好分离的簇，并且COVID-19相关联的区域的精确本地化。

42. An Attack on InstaHide: Is Private Learning Possible with Instance Encoding? [PDF] 返回目录
Nicholas Carlini, Samuel Deng, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, Shuang Song, Abhradeep Thakurta, Florian Tramer
Abstract: A learning algorithm is private if the produced model does not reveal (too much) about its training set. InstaHide [Huang, Song, Li, Arora, ICML'20] is a recent proposal that claims to preserve privacy by an encoding mechanism that modifies the inputs before being processed by the normal learner. We present a reconstruction attack on InstaHide that is able to use the encoded images to recover visually recognizable versions of the original images. Our attack is effective and efficient, and empirically breaks InstaHide on CIFAR-10, CIFAR-100, and the recently released InstaHide Challenge. We further formalize various privacy notions of learning through instance encoding and investigate the possibility of achieving these notions. We prove barriers against achieving (indistinguishability based notions of) privacy through any learning protocol that uses instance encoding.
摘要：学习算法是私有的，如果生产模式不露（太多）关于它的训练集。 InstaHide [黄，宋，李，Arora的，ICML'20]是最近提议，即权利要求中通过编码机制保护隐私的修改由正常学习者处理之前的输入。我们目前的InstaHide重建攻击，能够使用编码图像恢复原始图像的视觉识别的版本。我们的攻击是有效的，高效的，并经验打破InstaHide上CIFAR-10，CIFAR-100，和最近发布InstaHide挑战。我们进一步通过正式例如编码学习的各种隐私观念和探讨实现这些概念的可能性。我们证明通过任何学习协议使用实例编码对实现障碍（基于不可区分的概念）的隐私。

43. EPSR: Edge Profile Super resolution [PDF] 返回目录
Jiun Lee, Joongkyu Kim
Abstract: Recently numerous deep convolutional neural networks(CNNs) have been explored in single image super-resolution(SISR) and they achieved significant performance. However, most deep CNN-based SR mainly focuses on designing wider or deeper architecture and it is hard to find methods that utilize image properties in SISR. In this paper, by developing an edge-profile approach based on end-to-end CNN model to SISR problem, we propose an edge profile super resolution(EPSR). Specifically, we construct a residual edge enhance block(REEB), which consists of residual efficient channel attention block(RECAB), edge profile(EP) module, and context network(CN) module. RE-CAB extracts adaptively rescale channel-wise features by considering interdependencies among channels efficiently.From the features, EP module generates edge-guided features by extracting edge profile itself, and then CN module enhances details by exploiting contextual information of the features. To utilize various information from low to high frequency components, we design a fractal skip connection(FSC) structure. Since self-similarity of the architecture, FSC structure allows our EPSR to bypass abundant information into each REEB block. Experimental results present that our EPSR achieves competitive performance against state-of-the-art methods.
摘要：近日许多深深的卷积神经网络（细胞神经网络）已经探索了单幅图像超分辨率（SISR），他们取得了显著的性能。然而，最深处基于CNN-SR主要侧重于设计更宽或更深结构，它是很难找到利用在SISR图像属性的方法。在本文中，通过开发基于终端到终端的CNN模型SISR问题边缘轮廓的方法，我们提出了一个边缘轮廓超分辨率（EPSR）。具体而言，我们构建了一个残余边缘增强区块（REEB），其由剩余有效的信道的关注块（利甲）的，边缘轮廓（EP）模块，和上下文网络（CN）模块。 RE-CAB通过考虑信道之间的相互依赖性的萃取自适应重新调整信道逐特征efficiently.From特征，EP模块由通过利用所述特征的上下文信息中提取的边缘轮廓本身，然后CN模块增强细节生成边缘导向功能。为了利用从低到高的频率分量的各种信息，设计了分形跳过连接（FSC）结构。由于该体系结构的自相似性，FSC结构允许我们EPSR到旁路丰富的信息到每个REEB块。实验结果提出，我们的EPSR实现对国家的最先进的方法，有竞争力的表现。

44. OpenKinoAI: An Open Source Framework for Intelligent Cinematography and Editing of Live Performances [PDF] 返回目录
Rémi Ronfard, Rémi Colin de Verdière
Abstract: OpenKinoAI is an open source framework for post-production of ultra high definition video which makes it possible to emulate professional multiclip editing techniques for the case of single camera recordings. OpenKinoAI includes tools for uploading raw video footage of live performances on a remote web server, detecting, tracking and recognizing the performers in the original material, reframing the raw video into a large choice of cinematographic rushes, editing the rushes into movies, and annotating rushes and movies for documentation purposes. OpenKinoAI is made available to promote research in multiclip video editing of ultra high definition video, and to allow performing artists and companies to use this research for archiving, documenting and sharing their work online in an innovative fashion.
摘要：OpenKinoAI是后期制作的超高清视频的，这使得它有可能效仿专业多用钳夹编辑技术的单电相机记录的情况下，一个开源框架。 OpenKinoAI包括工具上传的现场表演原始视频素材远程Web服务器上，探测，跟踪原始材料识别表演，重新构造原始视频成一个大的选择电影灯芯草，编辑冲入电影，和注释灯芯草和电影以作记录。 OpenKinoAI是可用的推动超高清视频的多用钳夹视频编辑的研究，并允许表演艺术家和企业使用这项研究用于存档，记录并以创新的方式在网上分享他们的工作。

45. Pristine annotations-based multi-modal trained artificial intelligence solution to triage chest X-ray for COVID-19 [PDF] 返回目录
Tao Tan, Bipul Das, Ravi Soni, Mate Fejes, Sohan Ranjan, Daniel Attila Szabo, Vikram Melapudi, K S Shriram, Utkarsh Agrawal, Laszlo Rusko, Zita Herczeg, Barbara Darazs, Pal Tegzes, Lehel Ferenczi, Rakesh Mullick, Gopal Avinash
Abstract: The COVID-19 pandemic continues to spread and impact the well-being of the global population. The front-line modalities including computed tomography (CT) and X-ray play an important role for triaging COVID patients. Considering the limited access of resources (both hardware and trained personnel) and decontamination considerations, CT may not be ideal for triaging suspected subjects. Artificial intelligence (AI) assisted X-ray based applications for triaging and monitoring require experienced radiologists to identify COVID patients in a timely manner and to further delineate the disease region boundary are seen as a promising solution. Our proposed solution differs from existing solutions by industry and academic communities, and demonstrates a functional AI model to triage by inferencing using a single x-ray image, while the deep-learning model is trained using both X-ray and CT data. We report on how such a multi-modal training improves the solution compared to X-ray only training. The multi-modal solution increases the AUC (area under the receiver operating characteristic curve) from 0.89 to 0.93 and also positively impacts the Dice coefficient (0.59 to 0.62) for localizing the pathology. To the best our knowledge, it is the first X-ray solution by leveraging multi-modal information for the development.
摘要：COVID-19流行病继续蔓延和影响幸福感的全球人口。一线形式，包括计算机断层扫描（CT）和X射线玩了检伤分类COVID患者具有重要作用。考虑到资源（硬件和训练有素的人员）和净化的考虑有限，CT可能不是理想的优先分配怀疑对象。对于检伤分类和监控基于人工智能（AI）辅助X射线应用需要高年资医师，以确定及时，并进一步划定疾病区域边界被看作是一个可行的解决方案COVID患者。我们提出的解决方案，从不同行业和学术界现有的解决方案，并通过使用一个单一的X射线图像推理证明的功能AI模式来分流，而深学习模型，同时使用X射线和CT数据训练。我们就比X射线只练这样的多模培训如何提高解决方案的报告。多模态溶液从0.89（接收器操作特征曲线下面积）增加了AUC至0.93，并且还积极地影响用于定位病理学骰子系数（0.59〜0.62）。据我们所知，这是通过利用为发展多模态信息的第一透视解决方案。

46. Bridging the Performance Gap between FGSM and PGD Adversarial Training [PDF] 返回目录
Tianjin Huang, Vlado Menkovski, Yulong Pei, Mykola Pechenizkiy
Abstract: Deep learning achieves state-of-the-art performance in many tasks but exposes to the underlying vulnerability against adversarial examples. Across existing defense techniques, adversarial training with the projected gradient decent attack (adv.PGD) is considered as one of the most effective ways to achieve moderate adversarial robustness. However, adv.PGD requires too much training time since the projected gradient attack (PGD) takes multiple iterations to generate perturbations. On the other hand, adversarial training with the fast gradient sign method (adv.FGSM) takes much less training time since the fast gradient sign method (FGSM) takes one step to generate perturbations but fails to increase adversarial robustness. In this work, we extend adv.FGSM to make it achieve the adversarial robustness of adv.PGD. We demonstrate that the large curvature along FGSM perturbed direction leads to a large difference in performance of adversarial robustness between adv.FGSM and adv.PGD, and therefore propose combining adv.FGSM with a curvature regularization (adv.FGSMR) in order to bridge the performance gap between adv.FGSM and adv.PGD. The experiments show that adv.FGSMR has higher training efficiency than adv.PGD. In addition, it achieves comparable performance of adversarial robustness on MNIST dataset under white-box attack, and it achieves better performance than adv.PGD under white-box attack and effectively defends the transferable adversarial attack on CIFAR-10 dataset.
摘要：深学习实现了许多任务，但公开反对对抗的例子的潜在漏洞的国家的最先进的性能。跨现有的防御技术，与投影梯度像样的进攻（adv.PGD）对抗性训练被认为是，实现适度对抗稳健性的最有效途径之一。然而，adv.PGD需要太多的训练时间，因为投影梯度攻击（PGD）需要多次迭代产生扰动。在另一方面，对抗性训练用快速梯度符号方法（adv.FGSM）取训练时间要少得多，因为快速梯度符号方法（FGSM）采用一个步骤，以产生扰动，但无法提高对抗鲁棒性。在这项工作中，我们扩展adv.FGSM，使其达到adv.PGD的对抗鲁棒性。我们表明，沿FGSM大曲率的扰动方向，因而在adv.FGSM和adv.PGD之间的对抗性稳健的性能差异较大，因此建议以弥合adv.FGSM用曲率正则（adv.FGSMR）合并adv.FGSM和adv.PGD之间的性能差距。实验结果表明，adv.FGSMR比adv.PGD训练效率更高。此外，实现了对下白盒攻击MNIST数据集对抗性的鲁棒性相当的性能，而且实现了比adv.PGD下白盒攻击更好的性能，并有效维护上CIFAR-10数据集中的转让对抗性攻击。

47. Principles of Stochastic Computing: Fundamental Concepts and Applications [PDF] 返回目录
S. Rahimi Kari
Abstract: The semiconductor and IC industry is facing the issue of high energy consumption. In modern days computers and processing systems are designed based on the Turing machine and Von Neumann's architecture. This architecture mainly focused on designing systems based on deterministic behaviors. To tackle energy consumption and reliability in systems, Stochastic Computing was introduced. In this research, we aim to review and study the principles behind stochastic computing and its implementation techniques. By utilizing stochastic computing, we can achieve higher energy efficiency and smaller area sizes in terms of designing arithmetic units. Also, we aim to popularize the affiliation of Stochastic systems in designing futuristic BLSI and Neuromorphic systems.
摘要：半导体产业和集成电路产业正面临着高能耗的问题。在现代社会里的计算机和处理系统的设计基于图灵机和冯·诺依曼架构上。该架构主要集中在设计基础上确定性行为的系统。为了解决在系统能耗和可靠性，随机指标的计算进行了介绍。在这项研究中，我们的目标是检讨和研究的原则随机计算及其实现技术落后。通过利用随机计算，我们就可以在设计运算单元方面实现更高的能源效率和更小的面积大小。此外，我们的目标是普及随机系统的加入在设计未来BLSI和仿神经系统。

48. Classification of optics-free images with deep neural networks [PDF] 返回目录
Soren Nelson, Rajesh Menon
Abstract: The thinnest possible camera is achieved by removing all optics, leaving only the image sensor. We train deep neural networks to perform multi-class detection and binary classification (with accuracy of 92%) on optics-free images without the need for anthropocentric image reconstructions. Inferencing from optics-free images has the potential for enhanced privacy and power efficiency.
摘要：尽可能薄的照相机是通过去除所有光学实现，只留下图像传感器。我们培养深神经网络来执行对自由光学图像的多级检测和二元分类（为92％的准确度），而不需要人类中心图像重建。免费光学图像推断具有增强的隐私和功率效率的潜力。

49. A Soft Computing Approach for Selecting and Combining Spectral Bands [PDF] 返回目录
Juan F. H. Albarracín, Rafael S. Oliveira, Marina Hirota, Jefersson A. dos Santos, Ricardo da S. Torres
Abstract: We introduce a soft computing approach for automatically selecting and combining indices from remote sensing multispectral images that can be used for classification tasks. The proposed approach is based on a Genetic-Programming (GP) framework, a technique successfully used in a wide variety of optimization problems. Through GP, it is possible to learn indices that maximize the separability of samples from two different classes. Once the indices specialized for all the pairs of classes are obtained, they are used in pixelwise classification tasks. We used the GP-based solution to evaluate complex classification problems, such as those that are related to the discrimination of vegetation types within and between tropical biomes. Using time series defined in terms of the learned spectral indices, we show that the GP framework leads to superior results than other indices that are used to discriminate and classify tropical biomes.
摘要：我们介绍一种软计算方法用于自动选择，并从该可用于分类任务遥感多光谱图像组合索引。所提出的方法是基于遗传编程（GP）的框架，在各种各样的优化问题而无法成功使用的技术。通过GP，可以学习指标，最大限度地提高样本来自两个不同类别的可分性。一旦专门用于所有对类的指数获得，他们在逐像素分类任务中使用。我们使用GP为基础的解决方案，以评估复杂的分类问题，比如那些涉及植被类型中和热带生物群落之间的歧视。用在学习谱指标来定义时间序列，我们证明了GP框架导致比用于判别和分类的热带生物群落其他指标更好的结果。

50. Noise2Stack: Improving Image Restoration by Learning from Volumetric Data [PDF] 返回目录
Mikhail Papkov, Kenny Roberts, Lee Ann Madissoon, Omer Bayraktar, Dmytro Fishman, Kaupo Palo, Leopold Parts
Abstract: Biomedical images are noisy. The imaging equipment itself has physical limitations, and the consequent experimental trade-offs between signal-to-noise ratio, acquisition speed, and imaging depth exacerbate the problem. Denoising is, therefore, an essential part of any image processing pipeline, and convolutional neural networks are currently the method of choice for this task. One popular approach, Noise2Noise, does not require clean ground truth, and instead, uses a second noisy copy as a training target. Self-supervised methods, like Noise2Self and Noise2Void, relax data requirements by learning the signal without an explicit target but are limited by the lack of information in a single image. Here, we introduce Noise2Stack, an extension of the Noise2Noise method to image stacks that takes advantage of a shared signal between spatially neighboring planes. Our experiments on magnetic resonance brain scans and newly acquired multiplane microscopy data show that learning only from image neighbors in a stack is sufficient to outperform Noise2Noise and Noise2Void and close the gap to supervised denoising methods. Our findings point towards low-cost, high-reward improvement in the denoising pipeline of multiplane biomedical images. As a part of this work, we release a microscopy dataset to establish a benchmark for the multiplane image denoising.
摘要：生物医学图像噪音。成像设备本身具有的物理限制，以及信噪比，采集速度和成像深度之间的随后实验权衡使问题恶化。去噪，因此，任何图像处理管线的一个重要部分，和卷积神经网络目前选择的此任务的方法。一种流行的做法，Noise2Noise，不需要清洁地面实况，而是，使用第二嘈杂的副本作为培养目标。自监督方法，如Noise2Self和Noise2Void，通过学习信号没有明确的目标放松数据要求，但通过在一个单一的图像信息的缺乏的限制。这里，我们介绍Noise2Stack，所述Noise2Noise方法的扩展，以图像栈的是需要空间相邻的平面之间的共享信号的优点。我们对核磁共振脑部扫描实验和新获得的多平面镜的数据显示，从图像邻居堆栈仅供学习足以超越Noise2Noise和Noise2Void并把差距缩小到监督去噪方法。我们的研究结果指向的多平面生物医学图像的降噪管道低成本，高回报的改善。作为此项工作的一部分，我们发布一个显微镜数据集，以树立标杆的多平面图像去噪。

51. Deep correction of breathing-related artifacts in MR-thermometry [PDF] 返回目录
Baudouin Denis de Senneville, Pierrick Coupé, Mario Ries, Laurent Facq, Chrit Moonen
Abstract: Real-time MR-imaging has been clinically adapted for monitoring thermal therapies since it can provide on-the-fly temperature maps simultaneously with anatomical information. However, proton resonance frequency based thermometry of moving targets remains challenging since temperature artifacts are induced by the respiratory as well as physiological motion. If left uncorrected, these artifacts lead to severe errors in temperature estimates and impair therapy guidance. In this study, we evaluated deep learning for on-line correction of motion related errors in abdominal MR-thermometry. For this, a convolutional neural network (CNN) was designed to learn the apparent temperature perturbation from images acquired during a preparative learning stage prior to hyperthermia. The input of the designed CNN is the most recent magnitude image and no surrogate of motion is needed. During the subsequent hyperthermia procedure, the recent magnitude image is used as an input for the CNN-model in order to generate an on-line correction for the current temperature map. The method's artifact suppression performance was evaluated on 12 free breathing volunteers and was found robust and artifact-free in all examined cases. Furthermore, thermometric precision and accuracy was assessed for in vivo ablation using high intensity focused ultrasound. All calculations involved at the different stages of the proposed workflow were designed to be compatible with the clinical time constraints of a therapeutic procedure.
摘要：实时MR成像已在临床上适用于监视热疗法，因为它可以提供即时的温度与解剖信息同时映射。然而，移动目标以来遗体温度工件具有挑战性的基于质子共振频率测温由呼吸以及生理运动引起的。如果不纠正，这些文物导致温度估计，损害治疗指导严重错误。在这项研究中，我们评估了深度学习在腹部MR-测温运动有关的错误的上线修正。对于这一点，卷积神经网络（CNN）被设计为从学习过程中热疗之前的制备学习阶段所获取的图像的表观温度的扰动。设计CNN的输入是最近的幅度图像，不需要运动的替代品。在随后的高温过程中，最近幅值图像被用作用于CNN-模型的输入，以产生一上线校正为当前温度图。该方法的伪迹抑制性能上12名自由呼吸志愿者评估，发现在所有被检测的情况下稳健和无伪影。此外，测温精度和准确度评估在体内消融使用高强度聚焦超声。参与在建议工作流程的不同阶段，所有的计算设计，是与治疗过程的临床时间的限制兼容。

52. Tattoo tomography: Freehand 3D photoacoustic image reconstruction with an optical pattern [PDF] 返回目录
Niklas Holzwarth, Melanie Schellenberg, Janek Gröhl, Kris Dreher, Jan-Hinrich, Nölke, Alexander Seitel, Minu D. Tizabi, Beat P. Müller-Stich, Lena Maier-Hein
Abstract: Purpose: Photoacoustic tomography (PAT) is a novel imaging technique that can spatially resolve both morphological and functional tissue properties, such as the vessel topology and tissue oxygenation. While this capacity makes PAT a promising modality for the diagnosis, treatment and follow-up of various diseases, a current drawback is the limited field-of-view (FoV) provided by the conventionally applied 2D probes. Methods: In this paper, we present a novel approach to 3D reconstruction of PAT data (Tattoo tomography) that does not require an external tracking system and can smoothly be integrated into clinical workflows. It is based on an optical pattern placed on the region of interest prior to image acquisition. This pattern is designed in a way that a tomographic image of it enables the recovery of the probe pose relative to the coordinate system of the pattern. This allows the transformation of a sequence of acquired PA images into one common global coordinate system and thus the consistent 3D reconstruction of PAT imaging data. Results: An initial feasibility study conducted with experimental phantom data and in vivo forearm data indicates that the Tattoo approach is well-suited for 3D reconstruction of PAT data with high accuracy and precision. Conclusion: In contrast to previous approaches to 3D ultrasound (US) or PAT reconstruction, the Tattoo approach neither requires complex external hardware nor training data acquired for a specific application. It could thus become a valuable tool for clinical freehand PAT.
摘要：目的：光声层析成像（PAT）是一种新型的成像技术，其可以在空间上同时解决形态和功能的组织特性，如容器的拓扑结构和组织氧合。虽然这种容量使PAT用于诊断，治疗和后续的各种疾病的有希望的模态，电流的缺点是应用2D探针通过常规提供的有限场的视场（FOV）。方法：在本文中，我们提出了一个新的方法来重建三维数据PAT（纹身断层扫描）的不需要的外部跟踪系统，并且可以顺利地被集成到临床工作流程。它是基于之前的图像采集放置在感兴趣区域的光学图案。这个图案被设计的方式，它的一个断层图像使得探针姿态相对于图案的坐标系的恢复。这允许获得的PA图像序列转化为一个共同的全局坐标系，因而一致的三维重建PAT成像数据。结果：与实验幻像数据和体内前臂数据进行初步可行性研究表明，纹身方法非常适合于3D重建PAT数据的具有高准确度和精度。结论：相比于以前的方法到3D超声（US）或PAT重建，纹身方法既不要求为特定应用程序获取的复杂的外部硬件也没有训练数据。因此，它可能成为临床写意PAT的宝贵工具。

53. AIM 2020 Challenge on Rendering Realistic Bokeh [PDF] 返回目录
Andrey Ignatov, Radu Timofte, Ming Qian, Congyu Qiao, Jiamin Lin, Zhenyu Guo, Chenghua Li, Cong Leng, Jian Cheng, Juewen Peng, Xianrui Luo, Ke Xian, Zijin Wu, Zhiguo Cao, Densen Puthussery, Jiji C V, Hrishikesh P S, Melvin Kuriakose, Saikat Dutta, Sourya Dipta Das, Nisarg A. Shah, Kuldeep Purohit, Praveen Kandula, Maitreya Suin, A. N. Rajagopalan, Saagara M B, Minnu A L, Sanjana A R, Praseeda S, Ge Wu, Xueqin Chen, Tengyao Wang, Max Zheng, Hulk Wong, Jay Zou
Abstract: This paper reviews the second AIM realistic bokeh effect rendering challenge and provides the description of the proposed solutions and results. The participating teams were solving a real-world bokeh simulation problem, where the goal was to learn a realistic shallow focus technique using a large-scale EBB! bokeh dataset consisting of 5K shallow / wide depth-of-field image pairs captured using the Canon 7D DSLR camera. The participants had to render bokeh effect based on only one single frame without any additional data from other cameras or sensors. The target metric used in this challenge combined the runtime and the perceptual quality of the solutions measured in the user study. To ensure the efficiency of the submitted models, we measured their runtime on standard desktop CPUs as well as were running the models on smartphone GPUs. The proposed solutions significantly improved the baseline results, defining the state-of-the-art for practical bokeh effect rendering problem.
摘要：本文综述了第二个目标的现实背景虚化效果渲染的挑战，并提供了提出的解决方案和结果的描述。各参赛队伍进行了解决现实世界的背景虚化的模拟问题，其目的是利用大规模EBB学习一个现实的浅源技术！背景虚化数据集包括使用佳能7D数码单反相机拍摄5K浅/深宽的场图像对。与会者呈现基于仅一个单独帧而不从其他相机或传感器的任何附加数据背景虚化效果。在这一挑战合并运行时和在用户研究测量溶液的感知质量所使用的目标度量。为了保证所提交的模型的效率，我们测量了他们在标准的台式机CPU运行时以及被智能手机在GPU上运行的模型。提出的解决方案显著改善基线结果，确定了国家的最先进的实际背景虚化效果的渲染问题。

54. Towards a Better Global Loss Landscape of GANs [PDF] 返回目录
Ruoyu Sun, Tiantian Fang, Alex Schwing
Abstract: Understanding of GAN training is still very limited. One major challenge is its non-convex-non-concave min-max objective, which may lead to sub-optimal local minima. In this work, we perform a global landscape analysis of the empirical loss of GANs. We prove that a class of separable-GAN, including the original JS-GAN, has exponentially many bad basins which are perceived as mode-collapse. We also study the relativistic pairing GAN (RpGAN) loss which couples the generated samples and the true samples. We prove that RpGAN has no bad basins. Experiments on synthetic data show that the predicted bad basin can indeed appear in training. We also perform experiments to support our theory that RpGAN has a better landscape than separable-GAN. For instance, we empirically show that RpGAN performs better than separable-GAN with relatively narrow neural nets. The code is available at this https URL.
摘要：甘训练的了解仍然非常有限。一个主要的挑战是其非凸非凹最小 - 最大的目标，这可能导致次优的局部极小。在这项工作中，我们执行甘斯的经验损失的一个全球性的景观分析。我们证明了一类可分离-GaN，包括原来的JS-GaN，具有被视为模式崩溃成倍许多不好的盆地。我们还研究了相对论配对GAN（RpGAN）损失，夫妇产生的样品和真实样本。我们证明RpGAN无不良盆地。对合成数据实验表明，该预测坏盆地，的确可以出现在训练。我们还进行实验来支持我们的理论RpGAN比可分离-GaN更好的景观。例如，我们经验表明，RpGAN进行比可分离-GaN具有更好的相对狭窄的神经网络。该代码可在此HTTPS URL。

55. The Virtual Goniometer: A new method for measuring angles on 3D models of fragmentary bone and lithics [PDF] 返回目录
Katrina Yezzi-Woodley, Jeff Calder, Peter J. Olver, Annie Melton, Paige Cody, Thomas Huffstutler, Alexander Terwilliger, Martha Tappen, Reed Coil, Gilbert Tostevin
Abstract: The contact goniometer is a commonly used tool in lithic and zooarchaeological analysis, despite suffering from a number of shortcomings due to the physical interaction between the measuring implement, the object being measured, and the individual taking the measurements. However, lacking a simple and efficient alternative, researchers in a variety of fields continue to use the contact goniometer to this day. In this paper, we present a new goniometric method that we call the virtual goniometer, which takes angle measurements virtually on a 3D model of an object. The virtual goniometer allows for rapid data collection, and for the measurement of many angles that cannot be physically accessed by a manual goniometer. We compare the intra-observer variability of the manual and virtual goniometers, and find that the virtual goniometer is far more consistent and reliable. Furthermore, the virtual goniometer allows for precise replication of angle measurements, even among multiple users, which is important for reproducibility of goniometric-based research. The virtual goniometer is available as a plug-in in the open source mesh processing packages Meshlab and Blender, making it easily accessible to researchers exploring the potential for goniometry to improve archaeological methods and address anthropological questions.
摘要：测角仪接触是岩屑和zooarchaeological分析中常用的工具，尽管一些缺点痛苦由于测量实现之间的物理相互作用，被测量的对象，和个体服用测量。然而，缺乏一种简单而有效的替代，研究人员在各个领域的继续使用接触测角仪的这一天。在本文中，我们提出了一种新的测角方法，我们称之为虚拟测角器，它接受角测量几乎上的物体的3D模型。虚拟测角器允许快速的数据收集，以及用于不能由手动测角器进行物理访问多角度的测量。我们比较手动和虚拟测角仪的观察者内的变异，并发现虚拟测角仪是更为一致和可靠。此外，虚拟测角仪允许角度测量的精确复制，即使在多个用户，这是用于基于测角的研究的再现性很重要的。虚拟测角仪可作为插件在开源网格处理包Meshlab和搅拌机，使得它很容易接触到研究人员探索测角，提高考古方法和地址人类学问题的潜力。

56. Learnings from Frontier Development Lab and SpaceML -- AI Accelerators for NASA and ESA [PDF] 返回目录
Siddha Ganju, Anirudh Koul, Alexander Lavin, Josh Veitch-Michaelis, Meher Kasam, James Parr
Abstract: Research with AI and ML technologies lives in a variety of settings with often asynchronous goals and timelines: academic labs and government organizations pursue open-ended research focusing on discoveries with long-term value, while research in industry is driven by commercial pursuits and hence focuses on short-term timelines and return on investment. The journey from research to product is often tacit or ad hoc, resulting in technology transition failures, further exacerbated when research and development is interorganizational and interdisciplinary. Even more, much of the ability to produce results remains locked in the private repositories and know-how of the individual researcher, slowing the impact on future research by others and contributing to the ML community's challenges in reproducibility. With research organizations focused on an exploding array of fields, opportunities for the handover and maturation of interdisciplinary research reduce. With these tensions, we see an emerging need to measure the correctness, impact, and relevance of research during its development to enable better collaboration, improved reproducibility, faster progress, and more trusted outcomes. We perform a case study of the Frontier Development Lab (FDL), an AI accelerator under a public-private partnership from NASA and ESA. FDL research follows principled practices that are grounded in responsible development, conduct, and dissemination of AI research, enabling FDL to churn successful interdisciplinary and interorganizational research projects, measured through NASA's Technology Readiness Levels. We also take a look at the SpaceML Open Source Research Program, which helps accelerate and transition FDL's research to deployable projects with wide spread adoption amongst citizen scientists.
摘要：研究与AI和ML技术生活中的各种与经常异步目标和时间表的设置：学术实验室和政府机构推行开放式的研究着眼于长期价值的发现，而在行业研究是由商业追求驱动因此，着眼于短期的时间表和投资回报率。从研发到产品的旅途往往是默许或特设的，导致技术转型的失败，进一步加剧了当的研究和开发组织间是和跨学科的。更有甚者，很多的能力产生锁定在私人仓库结果遗骸和诀窍个别研究员，减缓由他人未来研究的影响和重复性有助于ML社会面临的挑战。与研究机构的重点领域的爆炸阵列上，跨学科研究的切换和成熟的机会减少。面对这些压力，我们看到了一个新兴的需要来衡量其发展过程中的正确性，影响，以及相关研究，以便更好地协作，提高的再现，更快的进步，更可信的结果。我们执行边疆开发实验室（FDL），在从美国航空航天局和欧洲航天局公共 - 私营伙伴关系的AI加速器的个案研究。 FDL研究遵循了在负责开发，实施和人工智能研究的传播接地原则的做法，使FDL翻腾成功的跨学科和跨组织的研究项目，通过NASA的技术就绪指数测量。我们还需要看看SpaceML开放源代码研究计划，这有助于加速和转型FDL的研究与广泛采用之中公民科学家部署项目。

57. Deep Reinforcement Learning for Navigation in AAA Video Games [PDF] 返回目录
Eloi Alonso, Maxim Peter, David Goumard, Joshua Romoff
Abstract: In video games, non-player characters (NPCs) are used to enhance the players' experience in a variety of ways, e.g., as enemies, allies, or innocent bystanders. A crucial component of NPCs is navigation, which allows them to move from one point to another on the map. The most popular approach for NPC navigation in the video game industry is to use a navigation mesh (NavMesh), which is a graph representation of the map, with nodes and edges indicating traversable areas. Unfortunately, complex navigation abilities that extend the character's capacity for movement, e.g., grappling hooks, jetpacks, teleportation, or double-jumps, increases the complexity of the NavMesh, making it intractable in many practical scenarios. Game designers are thus constrained to only add abilities that can be handled by a NavMesh if they want to have NPC navigation. As an alternative, we propose to use Deep Reinforcement Learning (Deep RL) to learn how to navigate 3D maps using any navigation ability. We test our approach on complex 3D environments in the Unity game engine that are notably an order of magnitude larger than maps typically used in the Deep RL literature. One of these maps is directly modeled after a Ubisoft AAA game. We find that our approach performs surprisingly well, achieving at least $90\%$ success rate on all tested scenarios. A video of our results is available at this https URL.
摘要：视频游戏，非玩家角色（NPC）被用来增强玩家的经验以各种方式，例如，作为敌人，盟友，或无辜的旁观者。的NPC的一个关键组成部分是导航，这允许它们以移动从一个点到另一个点在地图上。在视频游戏产业NPC导航最流行的方法是使用导航网（导航网格），这是地图的图形表示，与节点和边指示穿越的地区。不幸的是，复杂的导航能力，扩展了人物的能力的运动，例如，抓钩，飞行背包，远距传物，或双跳，增加了导航网格的复杂性，使得它在许多实际情况下难以解决。因此，游戏设计师被限制只能添加可通过导航网格，如果他们想拥有NPC导航处理能力。作为替代方案，我们建议使用深层强化学习（深RL），学习如何三维地图使用任何导航功能进行导航。我们对那些特别是数量级比通常在深RL文献中使用的地图大一个数量级在Unity游戏引擎复杂的3D环境中测试我们的做法。其中一个地图的育碧AAA赛后直接建模。我们发现，我们的方法执行得非常好，实现了在所有测试场景至少$ 90 \％$的成功率。我们的研究结果的视频可在此HTTPS URL。

58. Predicting the Future is like Completing a Painting! [PDF] 返回目录
Nadir Maaroufi, Mehdi Najib, Mohamed Bakhouya
Abstract: This article is an introductory work towards a larger research framework relative to Scientific Prediction. It is a mixed between science and philosophy of science, therefore we can talk about Experimental Philosophy of Science. As a first result, we introduce a new forecasting method based on image completion, named Forecasting Method by Image Inpainting (FM2I). In fact, time series forecasting is transformed into fully images- and signal-based processing procedures. After transforming a time series data into its corresponding image, the problem of data forecasting becomes essentially a problem of image inpainting problem, i.e., completing missing data in the image. An extensive experimental evaluation is conducted using a large dataset proposed by the well-known M3-competition. Results show that FM2I represents an efficient and robust tool for time series forecasting. It has achieved prominent results in terms of accuracy and outperforms the best M3 forecasting methods.
摘要：本文是对较大的研究框架相对于科学预测一个介绍性的工作。这是科学和科学哲学之间的混合，所以我们可以谈论科学实验哲学。作为第一个结果，介绍了基于图像完成，命名为图像修复（FM2I）预测方法的一个新的预测方法。事实上，时间序列预测被转换成完全图像 - 并且基于信号的处理过程。转化的时间序列数据成其相应的图像之后，数据预测的问题实质上成为图像修复问题的一个问题，即，在图像中完成丢失的数据。一个广泛的实验评价使用由公知的M3-竞争提出大型数据集进行的。结果表明，FM2I表示时间序列预测高效率和强大的工具。它实现了精度方面突出成果，优于最好的M3预测方法。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-11-11

目录

摘要