摘要

1. Afro-MNIST: Synthetic generation of MNIST-style datasets for low-resource languages [PDF] 返回目录
Daniel J Wu, Andrew C Yang, Vinay U Prabhu
Abstract: We present Afro-MNIST, a set of synthetic MNIST-style datasets for four orthographies used in Afro-Asiatic and Niger-Congo languages: Ge`ez (Ethiopic), Vai, Osmanya, and N'Ko. These datasets serve as "drop-in" replacements for MNIST. We also describe and open-source a method for synthetic MNIST-style dataset generation from single examples of each digit. These datasets can be found at this https URL. We hope that MNIST-style datasets will be developed for other numeral systems, and that these datasets vitalize machine learning education in underrepresented nations in the research community.
摘要：我们提出非裔MNIST，在亚非和尼日尔 - 刚果语言使用了四个正字一套合成MNIST式数据集：Ge`ez（埃塞俄比亚），奥钢联，奥斯曼亚语，以及西非书面语。这些数据集作为“插入式”替代MNIST。我们还描述和开源用于从每个数字的单个实例合成MNIST风格数据集生成的方法。这些数据集可以在此HTTPS URL中找到。我们希望MNIST式的数据集将用于其他数字系统来开发，这些数据集振兴在研究界代表性不足的国家机器学习教育。

2. A complete character recognition and transliteration technique for Devanagari script [PDF] 返回目录
Jasmine Kaur, Vinay Kumar
Abstract: Transliteration involves transformation of one script to another based on phonetic similarities between the characters of two distinctive scripts. In this paper, we present a novel technique for automatic transliteration of Devanagari script using character recognition. One of the first tasks performed to isolate the constituent characters is segmentation. Line segmentation methodology in this manuscript discusses the case of overlapping lines. Character segmentation algorithm is designed to segment conjuncts and separate shadow characters. Presented shadow character segmentation scheme employs connected component method to isolate the character, keeping the constituent characters intact. Statistical features namely different order moments like area, variance, skewness and kurtosis along with structural features of characters are employed in two phase recognition process. After recognition, constituent Devanagari characters are mapped to corresponding roman alphabets in way that resulting roman alphabets have similar pronunciation to source characters.
摘要：音译涉及到一个脚本转换到另一个基于两个不同脚本的字符之间的语音相似。在本文中，我们提出了使用字符识别梵文脚本的自动音译的新颖技术。一的执行以分离的构成字符的第一任务是分割。在这个手稿线路分段方法讨论重叠行的情况。字符分割算法被设计为段合取和分离阴影的字符。呈现阴影字符分割方案采用连接部件的方法的字符隔离，保持构成字符不变。统计特征即不同阶矩状区域，方差，偏度以及与字符的结构特征在两相识别处理中采用峰度沿。确认后，构成梵文字符被映射到对应于方式罗马字母使得得到的罗马字母具有类似的发音源字符。

3. ConvSequential-SLAM: A Sequence-based, Training-less Visual Place Recognition Technique for Changing Environments [PDF] 返回目录
Mihnea-Alexandru Tomită, Mubariz Zaffar, Michael Milford, Klaus McDonald-Maier, Shoaib Ehsan
Abstract: Visual Place Recognition (VPR) is the ability to correctly recall a previously visited place under changing viewpoints and appearances. A large number of handcrafted and deep-learning-based VPR techniques exist, where the former suffer from appearance changes and the latter have significant computational needs. In this paper, we present a new handcrafted VPR technique that achieves state-of-the-art place matching performance under challenging conditions. Our technique combines the best of 2 existing trainingless VPR techniques, SeqSLAM and CoHOG, which are each robust to conditional and viewpoint changes, respectively. This blend, namely ConvSequential-SLAM, utilises sequential information and block-normalisation to handle appearance changes, while using regional-convolutional matching to achieve viewpoint-invariance. We analyse content-overlap in-between query frames to find a minimum sequence length, while also re-using the image entropy information for environment-based sequence length tuning. State-of-the-art performance is reported in contrast to 8 contemporary VPR techniques on 4 public datasets. Qualitative insights and an ablation study on sequence length are also provided.
摘要：视觉识别广场（VPR）是记错改变观点和外观下以前访问过的地方的能力。大量的手工和深学习基础的VPR技术存在，其中前者从外观的变化受到严重损害，后者有显著的计算需求。在本文中，我们提出了一种新的手工VPR技术，该技术实现了具有挑战性的条件下状态的最先进的地方匹配性能。我们的技术结合了最好的2项现有trainingless VPR技术，SeqSLAM和CoHOG，分别其分别健壮，将条件和视点的变化。这种共混物，即ConvSequential的SLAM，利用顺序信息和块正常化手柄外观的变化，同时使用区域性卷积匹配来实现视点不变性。我们分析的内容重叠在中间查询帧找到一个最小序列长度，同时也重新使用环境为基础的序列长度调整图像熵信息。国家的最先进的性能报道相反，在4个公共数据集8种当代VPR技术。还提供了定性的见解和上序列长度的消融研究。

4. Arabic Handwritten Character Recognition based on Convolution Neural Networks and Support Vector Machine [PDF] 返回目录
Mahmoud Shams, Amira. A. Elsonbaty, Wael. Z. ElSawy
Abstract: Recognition of Arabic characters is essential for natural language processing and computer vision fields. The need to recognize and classify the handwritten Arabic letters and characters are essentially required. In this paper, we present an algorithm for recognizing Arabic letters and characters based on using deep convolution neural networks (DCNN) and support vector machine (SVM). This paper addresses the problem of recognizing the Arabic handwritten characters by determining the similarity between the input templates and the pre-stored templates using both fully connected DCNN and dropout SVM. Furthermore, this paper determines the correct classification rate (CRR) depends on the accuracy of the corrected classified templates, of the recognized handwritten Arabic characters. Moreover, we determine the error classification rate (ECR). The experimental results of this work indicate the ability of the proposed algorithm to recognize, identify, and verify the input handwritten Arabic characters. Furthermore, the proposed system determines similar Arabic characters using a clustering algorithm based on the K-means clustering approach to handle the problem of multi-stroke in Arabic characters. The comparative evaluation is stated and the system accuracy reached 95.07% CRR with 4.93% ECR compared with the state of the art.
摘要：阿拉伯字符识别是自然语言处理和计算机视觉领域至关重要。识别和分类手写阿拉伯字母和字符的需求基本上是必需的。在本文中，我们提出了识别基于使用深卷积神经网络（DCNN）和支持向量机（SVM）阿拉伯字母和字符的算法。本文地址通过确定输入模板和使用这两种完全连接DCNN和漏失SVM预存储的模板之间的相似性识别所述阿拉伯手写字符的问题。此外，本文确定正确的分类率（CRR）取决于精度的校正分类的模板，将识别手写阿拉伯字符。此外，我们确定错误分类率（ECR）。这项工作的实验结果表明，该算法的识别，识别和验证输入的手写阿拉伯字符的能力。此外，所提出的系统确定使用基于K-means聚类方法来处理在阿拉伯字符多行程的问题的聚类算法类似阿拉伯字符。的对比评价陈述，并与现有技术中的状态相比，系统的精度达到95.07％CRR与4.93％ECR。

5. Learning to Detect Objects with a 1 Megapixel Event Camera [PDF] 返回目录
Etienne Perot, Pierre de Tournemire, Davide Nitti, Jonathan Masci, Amos Sironi
Abstract: Event cameras encode visual information with high temporal precision, low data-rate, and high-dynamic range. Thanks to these characteristics, event cameras are particularly suited for scenarios with high motion, challenging lighting conditions and requiring low latency. However, due to the novelty of the field, the performance of event-based systems on many vision tasks is still lower compared to conventional frame-based solutions. The main reasons for this performance gap are: the lower spatial resolution of event sensors, compared to frame cameras; the lack of large-scale training datasets; the absence of well established deep learning architectures for event-based processing. In this paper, we address all these problems in the context of an event-based object detection task. First, we publicly release the first high-resolution large-scale dataset for object detection. The dataset contains more than 14 hours recordings of a 1 megapixel event camera, in automotive scenarios, together with 25M bounding boxes of cars, pedestrians, and two-wheelers, labeled at high frequency. Second, we introduce a novel recurrent architecture for event-based detection and a temporal consistency loss for better-behaved training. The ability to compactly represent the sequence of events into the internal memory of the model is essential to achieve high accuracy. Our model outperforms by a large margin feed-forward event-based architectures. Moreover, our method does not require any reconstruction of intensity images from events, showing that training directly from raw events is possible, more efficient, and more accurate than passing through an intermediate intensity image. Experiments on the dataset introduced in this work, for which events and gray level images are available, show performance on par with that of highly tuned and studied frame-based detectors.
摘要：事件摄像机编码具有高的时间精度，低数据速率和高动态范围的视觉信息。由于这些特点，事件相机特别适用于具有高运动场景中，具有挑战性的照明条件和需要低等待时间。然而，由于该领域的新颖性，基于事件的系统的许多视觉任务的性能仍然较低相比传统的基于帧的解决方案。此性能差距主要原因是：事件传感器，相比于帧相机的空间分辨率较低;缺乏大型训练数据;由于没有完善的深度学习架构的基于事件的处理。在本文中，我们解决基于事件的对象检测任务环境中的所有这些问题。首先，我们公开发布物体检测第一高分辨率的大型数据集。数据集包含一个1兆像素照相机事件的14分小时以上的记录，在汽车的场景，连同汽车，行人和两轮车的25M边界框，标记的在高频率。其次，我们引入了基于事件的检测和更好的乖巧训练时间一致性损失新颖的经常性架构。紧凑地表示事件的序列到模型的内部存储器中的能力是必不可少的，以实现高的精度。我们以大比分前馈基于事件的架构模型优于。而且，我们的方法不需要从事件强度图像的任何重建，显示出直接从原始事件培养是可能的，更有效的，并且比通过中间强度图像更准确。在这项工作中，哪些事件和灰度级的图像是可用的，显示性能堪与该高度调整，并研究基于帧的检测器引入的数据集的实验。

6. A Study on Lip Localization Techniques used for Lip reading from a Video [PDF] 返回目录
S.D. Lalitha, K.K. Thyagharajan
Abstract: In this paper some of the different techniques used to localize the lips from the face are discussed and compared along with its processing steps. Lip localization is the basic step needed to read the lips for extracting visual information from the video input. The techniques could be applied on asymmetric lips and also on the mouth with visible teeth, tongue & mouth with moustache. In the process of Lip reading the following steps are generally used. They are, initially locating lips in the first frame of the video input, then tracking the lips in the following frames using the resulting pixel points of initial step and at last converting the tracked lip model to its corresponding matched letter to give the visual information. A new proposal is also initiated from the discussed techniques. The lip reading is useful in Automatic Speech Recognition when the audio is absent or present low with or without noise in the communication systems. Human Computer communication also will require speech recognition.
摘要：在本文中的一些用于定位从面部嘴唇的不同技术进行了讨论，并用其处理步骤一起进行比较。唇本地化是读唇，用于从视频输入提取视觉信息所需的基本步骤。该技术可以在不对称的嘴唇，还口可见牙齿，舌头和嘴巴与胡子应用。在唇缘的过程读出通常用于以下步骤。它们最初定位在视频输入的第一帧的嘴唇，则跟踪在使用初始步骤的所得到的像素点以下的帧的嘴唇和最后将所述跟踪的唇模型到其对应的匹配信，得到的视觉信息。一项新的提案也从讨论的技术开始。当音频是具有或不具有在通信系统中噪声不存在或存在低的唇读是自动语音识别是有用的。人机通信也需要语音识别。

7. CAT STREET: Chronicle Archive of Tokyo Street-fashion [PDF] 返回目录
Satoshi Takahashi, Keiko Yamaguchi, Asuka Watanabe
Abstract: The analysis of daily life fashion trends can help us understand our societies and human cultures profoundly. However, no appropriate database exists that includes images illustrating what people wore in their daily lives over an extended period. In this study, we propose a new fashion image archive, Chronicle Archive of Tokyo Street-fashion (CAT STREET), to shed light on daily life fashion trends. CAT STREET includes images showing what people wore in their daily lives during the period 1970-2017, and these images contain timestamps and street location annotations. This novel database enables us to observe long-term daily life fashion trends using quantitative methods. To evaluate the potential of our database, we corroborated the rules-of-thumb for two fashion trend phenomena, namely how economic conditions affect fashion style share in the long term and how fashion styles emerge in the street and diffuse from street to street. Our findings show that the Conservative style trend, a type of luxury fashion style, is affected by economic conditions. We also introduce four cases of how fashion styles emerge in the street and diffuse from street to street in fashion-conscious streets in Tokyo. Our study demonstrates CAT STREET's potential to promote understanding of societies and human cultures through quantitative analysis of daily life fashion trends.
摘要：日常生活中的时尚趋势的分析可以帮助我们了解我们的社会和人类文化深刻。但是，没有合适的数据库存在，包括说明人们在日常生活中穿在较长时间内图像。在这项研究中，我们提出了一种新的时尚形象档案，东京街头时尚的纪事存档（CAT STREET），摆脱日常生活的时尚潮流光。猫街包括，展示了人们在此期间1970年至2017年在日常生活中穿的图像，这些图像包含时间戳和街道位置标注。这种新颖的数据库，使我们能够观察使用定量方法长期生活时尚潮流。为了评估我们的数据库的潜力，我们证实了规则的拇指两个时尚潮流的现象，经济条件即如何影响长期的时尚风格份额，如何时尚风格出现在街上，从街头弥漫到街头。我们的研究结果表明，保守的风格潮流，一种时尚奢华的风格，受经济条件的影响。我们还引进4箱子时尚风格在街上出现怎样和在东京的时尚意识的街道从街头扩散到街头。我们的研究表明猫街的潜力，促进通过日常生活中的流行趋势进行定量分析理解社会和人类文化的。

8. RS-MetaNet: Deep meta metric learning for few-shot remote sensing scene classification [PDF] 返回目录
Haifeng Li, Zhenqi Cui, Zhiqing Zhu, Li Chen, Jiawei Zhu, Haozhe Huang, Chao Tao
Abstract: Training a modern deep neural network on massive labeled samples is the main paradigm in solving the scene classification problem for remote sensing, but learning from only a few data points remains a challenge. Existing methods for few-shot remote sensing scene classification are performed in a sample-level manner, resulting in easy overfitting of learned features to individual samples and inadequate generalization of learned category segmentation surfaces. To solve this problem, learning should be organized at the task level rather than the sample level. Learning on tasks sampled from a task family can help tune learning algorithms to perform well on new tasks sampled in that family. Therefore, we propose a simple but effective method, called RS-MetaNet, to resolve the issues related to few-shot remote sensing scene classification in the real world. On the one hand, RS-MetaNet raises the level of learning from the sample to the task by organizing training in a meta way, and it learns to learn a metric space that can well classify remote sensing scenes from a series of tasks. We also propose a new loss function, called Balance Loss, which maximizes the generalization ability of the model to new samples by maximizing the distance between different categories, providing the scenes in different categories with better linear segmentation planes while ensuring model fit. The experimental results on three open and challenging remote sensing datasets, UCMerced\_LandUse, NWPU-RESISC45, and Aerial Image Data, demonstrate that our proposed RS-MetaNet method achieves state-of-the-art results in cases where there are only 1-20 labeled samples.
摘要：在大规模的标记样本训练一个现代化的深层神经网络是解决遥感场景分类问题，但只有几个数据点学习的主要模式仍然是一个挑战。对于少数次遥感场景分类现有方法在样品级的方式进行，从而导致的学习功能，以各个样品，得知分割类别表面的泛化不足容易过度拟合。为了解决这个问题，学习应该在任务级别，而不是样品一级举办。学习从任务的家庭采样任务可以帮助调整学习算法对在家庭抽样新任务表现良好。因此，我们提出了一个简单而有效的方法，称为RS-METANET，解决在现实世界与几拍遥感场景分类的问题。在一方面，RS-METANET提出在元的方式组织培训，从样品到任务的学习水平，并学会学习，可以很好分类远离一系列任务的感应场景度量空间。我们还提出了一种新的损失函数，称为平衡损失，其通过最大化不同类别之间的距离，不同类别提供的场景更好的线性分割面，同时确保模型的拟合最大化模式，新样本的推广能力。三个开放和具有挑战性的遥感数据集，UCMerced \ _LandUse，西北工业大学-RESISC45，和空间图像数据的实验结果，证明了我们提出的RS-METANET方法实现状态的最先进的结果在那里只有例1- 20个标记的样品。

9. Learning Category- and Instance-Aware Pixel Embedding for Fast Panoptic Segmentation [PDF] 返回目录
Naiyu Gao, Yanhu Shan, Xin Zhao, Kaiqi Huang
Abstract: Panoptic segmentation (PS) is a complex scene understanding task that requires providing high-quality segmentation for both thing objects and stuff regions. Previous methods handle these two classes with semantic and instance segmentation modules separately, following with heuristic fusion or additional modules to resolve the conflicts between the two outputs. This work simplifies this pipeline of PS by consistently modeling the two classes with a novel PS framework, which extends a detection model with an extra module to predict category- and instance-aware pixel embedding (CIAE). CIAE is a novel pixel-wise embedding feature that encodes both semantic-classification and instance-distinction information. At the inference process, PS results are simply derived by assigning each pixel to a detected instance or a stuff class according to the learned embedding. Our method not only shows fast inference speed but also the first one-stage method to achieve comparable performance to two-stage methods on the challenging COCO benchmark.
摘要：全景分割（PS）是一个复杂的场景理解的任务，需要两个东西的对象和东西地区提供高品质的分割。以前的方法分别处理这两个类具有语义和实例分割模块，与启发式融合或另外的模块以下来解决这两个输出之间的冲突。这项工作始终如一两个类具有新颖PS框架，它延伸的检测模型与一个额外的模块来预测类别 - 和实例感知像素嵌入（院）建模简化了这一管道PS的。院是一种新型的逐像素嵌入功能，同时编码语义分类和实例的区别信息。在推理过程中，PS结果简单地通过每个像素分配给一个检测到的实例或根据学习的嵌入一个东西类派生的。我们的方法不仅具有快速的推理速度也是第一个阶段的方法来实现相当的性能对挑战COCO基准两级方法。

10. Weakly Supervised Deep Functional Map for Shape Matching [PDF] 返回目录
Abhishek Sharma, Maks Ovsjanikov
Abstract: A variety of deep functional maps have been proposed recently, from fully supervised to totally unsupervised, with a range of loss functions as well as different regularization terms. However, it is still not clear what are minimum ingredients of a deep functional map pipeline and whether such ingredients unify or generalize all recent work on deep functional maps. We show empirically minimum components for obtaining state of the art results with different loss functions, supervised as well as unsupervised. Furthermore, we propose a novel framework designed for both full-to-full as well as partial to full shape matching that achieves state of the art results on several benchmark datasets outperforming even the fully supervised methods by a significant margin. Our code is publicly available at this https URL
摘要：各种深功能图的最近提出，从充分监督，完全不受监督，与一系列的损失函数以及不同的正则项。但是，目前还不清楚是什么深深的功能映射管道的最小成分和是否这类成分或统一推广了深厚的功能映射所有最近的工作。我们示出了用于获得在本领域的结果具有不同的损失函数的状态凭经验最少的元件，幼儿以及无监督。此外，我们提出了设计成用于一种新颖的框架都完全到满以及部分完整形状匹配，关于几个标准数据集的技术的结果优于甚至完全监督由显著余量方法的实现状态。我们的代码是公开的，在此HTTPS URL

11. Addressing Class Imbalance in Scene Graph Parsing by Learning to Contrast and Score [PDF] 返回目录
He Huang, Shunta Saito, Yuta Kikuchi, Eiichi Matsumoto, Wei Tang, Philip S. Yu
Abstract: Scene graph parsing aims to detect objects in an image scene and recognize their relations. Recent approaches have achieved high average scores on some popular benchmarks, but fail in detecting rare relations, as the highly long-tailed distribution of data biases the learning towards frequent labels. Motivated by the fact that detecting these rare relations can be critical in real-world applications, this paper introduces a novel integrated framework of classification and ranking to resolve the class imbalance problem in scene graph parsing. Specifically, we design a new Contrasting Cross-Entropy loss, which promotes the detection of rare relations by suppressing incorrect frequent ones. Furthermore, we propose a novel scoring module, termed as Scorer, which learns to rank the relations based on the image features and relation features to improve the recall of predictions. Our framework is simple and effective, and can be incorporated into current scene graph models. Experimental results show that the proposed approach improves the current state-of-the-art methods, with a clear advantage of detecting rare relations.
摘要：场景图解析目标探测物体的图像场景，并承认他们的关系。最近的方法已经在一些流行的基准测试取得了较高的平均成绩，但未能检测罕见的关系，数据的高度长尾分布向偏置频繁标签学习。的事实，检测这些罕见的关系可以在实际应用中的关键的启发，本文提出了分类和排序，以解决在场景图解析类不平衡问题的一种新的集成框架。具体来说，我们设计了一种新的对比交叉熵损失，其通过抑制不正确频繁的那些促进罕见关系的检测。此外，我们提出了一个新的计分模块，称为得分手，这获悉排名基于图像的特征和关联关系的功能，以提高预测的召回。我们的框架是简单而有效的，并且可以纳入当前场景图模型。实验结果表明，所提出的方法提高了当前状态的最先进的方法，具有检测稀有关系明显的优势。

12. EvolGAN: Evolutionary Generative Adversarial Networks [PDF] 返回目录
Baptiste Roziere, Fabien Teytaud, Vlad Hosu, Hanhe Lin, Jeremy Rapin, Mariia Zameshina, Olivier Teytaud
Abstract: We propose to use a quality estimator and evolutionary methods to search the latent space of generative adversarial networks trained on small, difficult datasets, or both. The new method leads to the generation of significantly higher quality images while preserving the original generator's diversity. Human raters preferred an image from the new version with frequency 83.7pc for Cats, 74pc for FashionGen, 70.4pc for Horses, and 69.2pc for Artworks, and minor improvements for the already excellent GANs for faces. This approach applies to any quality scorer and GAN generator.
摘要：我们建议使用质量估计和进化方法来搜索训练有素的小，难的数据集，或两者生成对抗性网络的潜在空间。新的方法导致的显著更高质量的图像生成，同时保留原有的发电机的多样性。人工评级来自新版本的与频率为83.7pc猫，74pc为FashionGen，70.4pc为马，和69.2pc为艺术品，和小的改进，用于面已经优良甘斯优选的图像。这种方式适用于任何质量得分和GaN发电机。

13. Cuid: A new study of perceived image quality and its subjective assessment [PDF] 返回目录
Lucie Lévêque, Ji Yang, Xiaohan Yang, Pengfei Guo, Kenneth Dasalla, Leida Li, Yingying Wu, Hantao Liu
Abstract: Research on image quality assessment (IQA) remains limited mainly due to our incomplete knowledge about human visual perception. Existing IQA algorithms have been designed or trained with insufficient subjective data with a small degree of stimulus variability. This has led to challenges for those algorithms to handle complexity and diversity of real-world digital content. Perceptual evidence from human subjects serves as a grounding for the development of advanced IQA algorithms. It is thus critical to acquire reliable subjective data with controlled perception experiments that faithfully reflect human behavioural responses to distortions in visual signals. In this paper, we present a new study of image quality perception where subjective ratings were collected in a controlled lab environment. We investigate how quality perception is affected by a combination of different categories of images and different types and levels of distortions. The database will be made publicly available to facilitate calibration and validation of IQA algorithms.
摘要：研究图像质量评价（IQA）主要保留有限的，由于我们对人类视觉感知不完整的知识。现有IQA算法已经被设计或训练的主观数据不足的小程度的刺激可变性。这导致了对这些算法处理的复杂性和真实世界的数字内容多样性的挑战。从人类个体感性证据可作为先进的IQA算法的开发接地。因此临界以获取与控制感知实验忠实地反映在视觉信号失真人类行为反应可靠主观数据。在本文中，我们目前在那里的主观评价被收集在一个可控的实验室环境下的图像质量感知的一项新的研究。我们研究如何质量感知是由不同类别的图片和不同类型和扭曲的水平的组合影响。该数据库将公之于众，以促进IQA算法校准和验证。

14. The Elements of End-to-end Deep Face Recognition: A Survey of Recent Advances [PDF] 返回目录
Hang Du, Hailin Shi, Dan Zeng, Tao Mei
Abstract: Face recognition is one of the most fundamental and long-standing topics in computer vision community. With the recent developments of deep convolutional neural networks and large-scale datasets, deep face recognition has made remarkable progress and been widely used in the real-world applications. Given a natural image or video frame as input, an end-to-end deep face recognition system outputs the face feature for recognition. To achieve this, the whole system is generally built with three key elements: face detection, face preprocessing, and face representation. The face detection locates faces in the image or frame. Then, the face preprocessing is proceeded to calibrate the faces to a canonical view and crop them to a normalized pixel size. Finally, in the stage of face representation, the discriminative features are extracted from the preprocessed faces for recognition. All of the three elements are fulfilled by deep convolutional neural networks. In this paper, we present a comprehensive survey about the recent advances of every element of the end-to-end deep face recognition, since the thriving deep learning techniques have greatly improved the capability of them. To start with, we introduce an overview of the end-to-end deep face recognition, which, as mentioned above, includes face detection, face preprocessing, and face representation. Then, we review the deep learning based advances of each element, respectively, covering many aspects such as the up-to-date algorithm designs, evaluation metrics, datasets, performance comparison, existing challenges, and promising directions for future research. We hope this survey could bring helpful thoughts to one for better understanding of the big picture of end-to-end face recognition and deeper exploration in a systematic way.
摘要：人脸识别是计算机视觉领域最根本和长期的话题之一。凭借深厚的卷积神经网络和大型数据集的最新发展，深人脸识别取得了令人瞩目的进展，并已广泛应用于现实世界的应用程序。给定的自然图像或视频帧作为输入，一个端至端深人脸识别系统输出用于识别面部特征。为了实现这一目标，整个系统通常建有三个关键要素：人脸检测，预处理，以及人脸表示。面部检测所处的图像或帧中的面孔。然后，脸部预处理进至面校准到规范视图并将其裁剪到归一化的像素大小。最后，在面部表征的阶段，判别特征从用于识别的预处理面提取。所有的三个要素是由深卷积神经网络实现。在本文中，我们提出对近期结束对终端的深人脸识别的每一个环节都进步了全面的调查，因为蓬勃发展的深度学习技术已大大提高他们的能力。首先，我们引入端至端深面部识别，其中，如上所述，包括：人脸检测，预处理，和面表示的概述。然后，我们将对每个元素的深度学习基础的进步，分别涵盖许多方面，如先进的日期算法设计，评价指标，数据集，性能比较，现有的挑战，有为方向为今后的研究。我们希望这次调查能带来帮助的想法，一个更好地了解终端到终端的人脸识别和更深层次的探索的以系统的方式大局。

15. Multi-scale Receptive Fields Graph Attention Network for Point Cloud Classification [PDF] 返回目录
Xi-An Li, Lei Zhang, Li-Yan Wang, Jian Lu
Abstract: Understanding the implication of point cloud is still challenging to achieve the goal of classification or segmentation due to the irregular and sparse structure of point cloud. As we have known, PointNet architecture as a ground-breaking work for point cloud which can learn efficiently shape features directly on unordered 3D point cloud and have achieved favorable performance. However, this model fail to consider the fine-grained semantic information of local structure for point cloud. Afterwards, many valuable works are proposed to enhance the performance of PointNet by means of semantic features of local patch for point cloud. In this paper, a multi-scale receptive fields graph attention network (named after MRFGAT) for point cloud classification is proposed. By focusing on the local fine features of point cloud and applying multi attention modules based on channel affinity, the learned feature map for our network can well capture the abundant features information of point cloud. The proposed MRFGAT architecture is tested on ModelNet10 and ModelNet40 datasets, and results show it achieves state-of-the-art performance in shape classification tasks.
摘要：了解点云的含义仍是挑战，实现分类或细分的目标，由于点云的不规则和稀疏结构。正如我们已经知道，PointNet架构作为点云突破性的工作，这可以有效地学习形状直接展示了无序的三维点云，并取得了良好的性能。然而，这种模式没有考虑到的点云局部结构的细粒语义信息。此后，许多有价值的作品提出了增强PointNet由局部修补点云的语义特征来表现。在本文中，多尺度感受野图的点云分类注意网络（MRFGAT命名）提出。通过关注点云的地方优良特性和应用基于信道亲和力多注意模块，我们的网络学习地物地图可以很好的捕捉点云的丰富功能的信息。所提出的架构MRFGAT上ModelNet10和ModelNet40数据集进行测试，结果表明它实现了在形状分类任务状态的最先进的性能。

16. Learning to Adapt Multi-View Stereo by Self-Supervision [PDF] 返回目录
Arijit Mallick, Jörg Stückler, Hendrik Lensch
Abstract: 3D scene reconstruction from multiple views is an important classical problem in computer vision. Deep learning based approaches have recently demonstrated impressive reconstruction results. When training such models, self-supervised methods are favourable since they do not rely on ground truth data which would be needed for supervised training and is often difficult to obtain. Moreover, learned multi-view stereo reconstruction is prone to environment changes and should robustly generalise to different domains. We propose an adaptive learning approach for multi-view stereo which trains a deep neural network for improved adaptability to new target domains. We use model-agnostic meta-learning (MAML) to train base parameters which, in turn, are adapted for multi-view stereo on new domains through self-supervised training. Our evaluations demonstrate that the proposed adaptation method is effective in learning self-supervised multi-view stereo reconstruction in new domains.
摘要：从多个视图，3D场景重建是计算机视觉中一个重要的经典问题。基于深学习方法最近证明令人印象深刻的重建结果。当训练这样的模型，自我监督的方法是有利的，因为它们不依赖于这将需要为指导训练，而且往往难以得到地面实况数据。此外，了解到多视点立体重建易于环境的变化，并应牢固地推广到不同的域。我们提出了哪些培训，以提高适应新的目标域的深层神经网络多视点立体的自适应学习方法。我们用模型无关元学习（MAML）训练，这反过来，通过自我监督的培训适应新的域多视点立体的基础参数。我们的评估表明，该适配方法是有效的学习在新领域的自我监督的多视点立体重建。

17. Driver Drowsiness Classification Based on Eye Blink and Head Movement Features Using the k-NN Algorithm [PDF] 返回目录
Mariella Dreissig, Mohamed Hedi Baccour, Tim Schaeck, Enkelejda Kasneci
Abstract: Modern advanced driver-assistance systems analyze the driving performance to gather information about the driver's state. Such systems are able, for example, to detect signs of drowsiness by evaluating the steering or lane keeping behavior and to alert the driver when the drowsiness state reaches a critical level. However, these kinds of systems have no access to direct cues about the driver's state. Hence, the aim of this work is to extend the driver drowsiness detection in vehicles using signals of a driver monitoring camera. For this purpose, 35 features related to the driver's eye blinking behavior and head movements are extracted in driving simulator experiments. Based on that large dataset, we developed and evaluated a feature selection method based on the k-Nearest Neighbor algorithm for the driver's state classification. A concluding analysis of the best performing feature sets yields valuable insights about the influence of drowsiness on the driver's blink behavior and head movements. These findings will help in the future development of robust and reliable driver drowsiness monitoring systems to prevent fatigue-induced accidents.
摘要：现代先进的驾驶员辅助系统分析行驶性能，以收集关于驾驶员的状态信息。这样的系统能够，例如，通过评估转向或车道保持行为检测困倦的迹象，当困倦状态达到临界水平，以提醒驾驶员。然而，这些类型的系统没有关于驾驶员的状态获得直接的线索。因此，这项工作的目的是为了在扩展使用驾驶监视摄像机的信号的车辆的驾驶员困倦检测。为此，35个与驾驶者的眨眼行为和头部运动功能在驾驶模拟器实验提取。基于大型数据集，我们开发和评估基于对驾驶员的状态分类的K近邻算法特征选择方法。表现最好的功能集的总结分析产生约困倦的驾驶员的眨眼行为和头部运动的影响有价值的见解。这些发现将有助于稳定可靠的驾驶员困倦监控系统的未来发展，防止过度疲劳引起的事故。

18. Texture Memory-Augmented Deep Patch-Based Image Inpainting [PDF] 返回目录
Rui Xu, Minghao Guo, Jiaqi Wang, Xiaoxiao Li, Bolei Zhou, Chen Change Loy
Abstract: Patch-based methods and deep networks have been employed to tackle image inpainting problem, with their own strengths and weaknesses. Patch-based methods are capable of restoring a missing region with high-quality texture through searching nearest neighbor patches from the unmasked regions. However, these methods bring problematic contents when recovering large missing regions. Deep networks, on the other hand, show promising results in completing large regions. Nonetheless, the results often lack faithful and sharp details that resemble the surrounding area. By bringing together the best of both paradigms, we propose a new deep inpainting framework where texture generation is guided by a texture memory of patch samples extracted from unmasked regions. The framework has a novel design that allows texture memory retrieval to be trained end-to-end with the deep inpainting network. In addition, we introduce a patch distribution loss to encourage high-quality patch synthesis. The proposed method shows superior performance both qualitatively and quantitatively on three challenging image benchmarks, i.e., Places, CelebA-HQ, and Paris Street-View datasets.
摘要：基于补丁的方法和深厚的网络已被用来处理图像修复问题，用自己的长处和短处。基于补丁的方法能够通过从东窗事发地区搜索近邻补丁修复提供高品质的纹理丢失区域。然而，这些方法恢复丢失的大的区域时，把有问题的内容。深层网络，在另一方面，显示完成大区域可喜的成果。然而，结果往往缺乏类似于周边地区的忠实和锐利的细节。通过汇集最好的两个范例，我们提出了一个新的深修补框架，纹理生成是由非屏蔽区域中提取色块的样本纹理存储器引导。该框架有一个新颖的设计，其允许纹理存储器检索被训练端至端与深修补网络。此外，我们引入了一个补丁分发损失，以鼓励高品质的贴片合成。该方法显示出优异的性能定性和定量的三个挑战图像的基准，即地方，CelebA-HQ，和巴黎街景数据集。

19. Video Face Recognition System: RetinaFace-mnet-faster and Secondary Search [PDF] 返回目录
Qian Li, Nan Guo, Xiaochun Ye, Dongrui Fan, Zhimin Tang
Abstract: Face recognition is widely used in the scene. However, different visual environments require different methods, and face recognition has a difficulty in complex environments. Therefore, this paper mainly experiments complex faces in the video. First, we design an image pre-processing module for fuzzy scene or under-exposed faces to enhance images. Our experimental results demonstrate that effective images pre-processing improves the accuracy of 0.11%, 0.2% and 1.4% on LFW, WIDER FACE and our datasets, respectively. Second, we propose RetinacFace-mnet-faster for detection and a confidence threshold specification for face recognition, reducing the lost rate. Our experimental results show that our RetinaFace-mnet-faster for 640*480 resolution on the Tesla P40 and single-thread improve speed of 16.7% and 70.2%, respectively. Finally, we design secondary search mechanism with HNSW to improve performance. Ours is suitable for large-scale datasets, and experimental results show that our method is 82% faster than the violent retrieval for the single-frame detection.
摘要：人脸识别广泛应用于现场。然而，不同的视觉环境需要不同的方法，以及面部识别技术在复杂环境中的困难。因此，本文主要实验复杂的面孔在视频中。首先，我们设计了一个图像预处理模块，用于模糊场景或欠暴露面，以增强图像。我们的实验结果表明，有效的图像预处理分别提高了0.11％，0.2％和1.4 LFW％的准确度，更宽的FACE和我们的数据集。其次，我们提出RetinacFace-MNET-更快检测和面部识别置信度阈值规范，降低损失率。我们的实验结果表明，我们RetinaFace-MNET-更快640 * 480特斯拉P40分辨率和单丝分别提高16.7％和70.2％的速度。最后，我们设计的二次搜索机制与HNSW以提高性能。我们是适合于大规模数据集，和实验结果表明，我们的方法比单帧检测剧烈检索快82％。

20. Trainable Structure Tensors for Autonomous Baggage Threat Detection Under Extreme Occlusion [PDF] 返回目录
Taimur Hassan, Samet Akcay, Mohammed Bennamoun, Salman Khan, Naoufel Werghi
Abstract: Detecting baggage threats is one of the most difficult tasks, even for expert officers. Many researchers have developed computer-aided screening systems to recognize these threats from the baggage X-ray scans. However, all of these frameworks are limited in recognizing contraband items under extreme occlusion. This paper presents a novel instance detector that utilizes a trainable structure tensor scheme to highlight the contours of the occluded and cluttered contraband items (obtained from multiple predominant orientations) while simultaneously suppressing all the other baggage content within the scan, leading to robust detection. The proposed framework has been rigorously tested on four publicly available X-ray datasets where it outperforms the state-of-the-art frameworks in terms of mean average precision scores. Furthermore, to the best of our knowledge, it is the only framework that has been rigorously tested on combined grayscale and colored scans obtained from four different types of X-ray scanners.
摘要：检测行李的威胁是最困难的任务之一，即使是专家人员。许多研究人员已经开发出计算机辅助筛选系统识别从行李X射线扫描这些威胁。然而，所有这些框架的极端闭塞下识别违禁物品的限制。本文呈现（从多个主要取向获得的），其利用可训练结构张量方案突出闭塞和杂乱违禁品的轮廓的新颖实例检测器，而同时抑制所有内的扫描的其他行李内容，从而导致鲁棒的检测。拟议的框架在四个可公开获得的X射线数据集，它优于国家的最先进的框架中值平均精度分数的形式进行了严格的测试。此外，据我们所知，它是一个已经在合并的灰度经过严格的测试和有色来自四个不同类型的X射线扫描仪的扫描获得的唯一的框架。

21. Segmentation and Analysis of a Sketched Truss Frame Using Morphological Image Processing Techniques [PDF] 返回目录
Mirsalar Kamari, Oguz Gunes
Abstract: Development of computational tools to analyze and assess the building capacities has had a major impact in civil engineering. The interaction with the structural software packages is becoming easier and the modeling tools are becoming smarter by automating the users role during their interaction with the software. One of the difficulties and the most time consuming steps involved in the structural modeling is defining the geometry of the structure to provide the analysis. This paper is dedicated to the development of a methodology to automate analysis of a hand sketched or computer generated truss frame drawn on a piece of paper. First, we focus on the segmentation methodologies for hand sketched truss components using the morphological image processing techniques, and then we provide a real time analysis of the truss. We visualize and augment the results on the input image to facilitate the public understanding of the truss geometry and internal forces. MATLAB is used as the programming language for the image processing purposes, and the truss is analyzed using Sap2000 API to integrate with MATLAB to provide a convenient structural analysis. This paper highlights the potential of the automation of the structural analysis using image processing to quickly assess the efficiency of structural systems. Further development of this framework is likely to revolutionize the way that structures are modeled and analyzed.
摘要：计算工具来分析和评估能力建设发展已在土木工程产生重大影响。与结构软件包的交互变得更加容易和建模工具通过它们与软件的交互过程中自动化用户角色变得更聪明。的困难之一，涉及在结构建模最费时的步骤是定义结构的几何形状，以提供分析。本文专用于方法的发展来自动分析的手勾勒或在一张纸上绘出计算机生成的桁架框架。首先，我们集中于分割方法用于手勾勒使用形态学图像处理技术桁架组件，然后我们提供的桁架的实时分析。我们可视化和增强对输入图像的结果，以方便桁架几何形状和内力的公众理解。 MATLAB被用作编程语言用于图像处理的目的，并使用SAP2000 API与MATLAB整合提供了方便的结构分析进行分析的桁架。本文亮点使用图像处理，从而快速评估结构系统的效率的结构分析的自动化的潜力。这个框架的进一步发展可能会彻底改变该结构建模和分析的方式。

22. Interpretable Detail-Fidelity Attention Network for Single Image Super-Resolution [PDF] 返回目录
Yuanfei Huang, Jie Li, Xinbo Gao, Yanting Hu, Wen Lu
Abstract: Benefiting from the strong capabilities of deep CNNs for feature representation and nonlinear mapping, deep-learning-based methods have achieved excellent performance in single image super-resolution. However, most existing SR methods depend on the high capacity of networks which is initially designed for visual recognition, and rarely consider the initial intention of super-resolution for detail fidelity. Aiming at pursuing this intention, there are two challenging issues to be solved: (1) learning appropriate operators which is adaptive to the diverse characteristics of smoothes and details; (2) improving the ability of model to preserve the low-frequency smoothes and reconstruct the high-frequency details. To solve them, we propose a purposeful and interpretable detail-fidelity attention network to progressively process these smoothes and details in divide-and-conquer manner, which is a novel and specific prospect of image super-resolution for the purpose on improving the detail fidelity, instead of blindly designing or employing the deep CNNs architectures for merely feature representation in local receptive fields. Particularly, we propose a Hessian filtering for interpretable feature representation which is high-profile for detail inference, a dilated encoder-decoder and a distribution alignment cell to improve the inferred Hessian features in morphological manner and statistical manner respectively. Extensive experiments demonstrate that the proposed methods achieve superior performances over the state-of-the-art methods quantitatively and qualitatively. Code is available at this https URL.
摘要：从深层细胞神经网络为特征表示和非线性映射能力强的推动，深学习为基础的方法已经实现了单幅图像超分辨率优异的表现。然而，大多数现有的SR方法取决于其最初的设计从视觉上识别网络的高容量，并且很少考虑超分辨率的详细保真初步意向。针对实现这一意图，有两个具有挑战性的问题需要解决：（1）学习适当运营商，其是自适应的，以平滑和细节的不同特点; （2）改进模型的保存低频平滑和重建的高频细节的能力。为了解决这些问题，我们提出了针对性和解释的细节保真关注网络，逐步处理这些平滑，细节分而治之的方式，这是为目的的小说和图像超分辨率的具体前景改善的细节保真度，而不是盲目地设计或采用深层细胞神经网络架构在当地感受野只是特征表示。特别是，我们提出了可解释特征表示一个黑森滤波是用于详细地推断，扩张的编码器 - 解码器和分别以改善形态和方式统计的方式推断出的Hessian特征的分布对准细胞高调。广泛的实验表明，所提出的方法实现对国家的最先进的方法定量和定性优异的性能。代码可在此HTTPS URL。

23. RRPN++: Guidance Towards More Accurate Scene Text Detection [PDF] 返回目录
Jianqi Ma
Abstract: RRPN is among the outstanding scene text detection approaches, but the manually-designed anchor and coarse proposal refinement make the performance still far from perfection. In this paper, we propose RRPN++ to exploit the potential of RRPN-based model by several improvements. Based on RRPN, we propose the Anchor-free Pyramid Proposal Networks (APPN) to generate first-stage proposals, which adopts the anchor-free design to reduce proposal number and accelerate the inference speed. In our second stage, both the detection branch and the recognition branch are incorporated to perform multi-task learning. In inference stage, the detection branch outputs the proposal refinement and the recognition branch predicts the transcript of the refined text region. Further, the recognition branch also helps rescore the proposals and eliminate the false positive proposals by the jointing filtering strategy. With these enhancements, we boost the detection results by $6\%$ of F-measure in ICDAR2015 compared to RRPN. Experiments conducted on other benchmarks also illustrate the superior performance and efficiency of our model.
摘要：RRPN是其中杰出的场景文本检测方法，但在人工设计的锚和粗建议细化完善，从使性能仍远。在本文中，我们提出RRPN ++通过一些改进利用基于RRPN模型的潜力。基于RRPN，我们提出了锚自由金字塔建议网络（APPN），以产生第一阶段的建议，即采用无锚设计，减少提案数量和加快推理速度。在我们的第二个阶段，无论是检测分支和识别分支被合并执行多任务学习。在推论阶段，检测支路输出的提议细化和识别分支预测精文本区域的成绩单。此外，识别分支也有助于rescore的建议和消除拔节过滤策略的假阳性的建议。通过这些改进，我们通过提高$ 6 \％F值的$的检测结果在ICDAR2015相比RRPN。在其他基准进行的实验也说明我们的模型的卓越性能和效率。

24. Learning to Stop: A Simple yet Effective Approach to Urban Vision-Language Navigation [PDF] 返回目录
Jiannan Xiang, Xin Eric Wang, William Yang Wang
Abstract: Vision-and-Language Navigation (VLN) is a natural language grounding task where an agent learns to follow language instructions and navigate to specified destinations in real-world environments. A key challenge is to recognize and stop at the correct location, especially for complicated outdoor environments. Existing methods treat the STOP action equally as other actions, which results in undesirable behaviors that the agent often fails to stop at the destination even though it might be on the right path. Therefore, we propose Learning to Stop (L2Stop), a simple yet effective policy module that differentiates STOP and other actions. Our approach achieves the new state of the art on a challenging urban VLN dataset Touchdown, outperforming the baseline by 6.89% (absolute improvement) on Success weighted by Edit Distance (SED).
摘要：视觉和语言导航（VLN）是一种自然语言接地任务当代理人学会遵循语言指令，然后导航到在现实世界环境中指定的目标。一个关键的挑战是在正确的位置识别和停止，尤其是对于复杂的户外环境。现有的方法同等对待停止操作其他行动，这将导致不良行为的代理人往往不能停在目的地，即使它可能是在正确的道路上。因此，我们建议学习到停止（L2Stop），一个简单而有效的策略模块区别于停止等动作。我们的方法实现对一个具有挑战性的城市VLN数据集着陆的新的艺术状态，跑赢6.89％（绝对提高）上成功基线通过编辑距离（SED）加权。

25. NITI: Training Integer Neural Networks Using Integer-only Arithmetic [PDF] 返回目录
Maolin Wang, Seyedramin Rasoulinezhad, Philip H.W. Leong, Hayden K.H. So
Abstract: While integer arithmetic has been widely adopted for improved performance in deep quantized neural network inference, training remains a task primarily executed using floating point arithmetic. This is because both high dynamic range and numerical accuracy are central to the success of most modern training algorithms. However, due to its potential for computational, storage and energy advantages in hardware accelerators, neural network training methods that can be implemented with low precision integer-only arithmetic remains an active research challenge. In this paper, we present NITI, an efficient deep neural network training framework that stores all parameters and intermediate values as integers, and computes exclusively with integer arithmetic. A pseudo stochastic rounding scheme that eliminates the need for external random number generation is proposed to facilitate conversion from wider intermediate results to low precision storage. Furthermore, a cross-entropy loss backpropagation scheme computed with integer-only arithmetic is proposed. A proof-of-concept open-source software implementation of NITI that utilizes native 8-bit integer operations in modern GPUs to achieve end-to-end training is presented. When compared with an equivalent training setup implemented with floating point storage and arithmetic, NITI achieves negligible accuracy degradation on the MNIST and CIFAR10 datasets using 8-bit integer storage and computation. On ImageNet, 16-bit integers are needed for weight accumulation with an 8-bit datapath. This achieves training results comparable to all-floating-point implementations.
摘要：虽然整数算法已经在深量化神经网络推理更高的性能被广泛采用，训练仍然使用浮点运算主要执行的任务。这是因为这两个高动态范围和数值精度是中央的最现代化的训练算法成功。然而，由于其在硬件加速器，神经网络训练方法的计算，存储和能源优势，可以用低精度整数运算只仍然是一个活跃的研究挑战中实现的潜力。在本文中，我们目前镍钛，高效的深层神经网络训练框架，存储所有参数以及中间值作为整数，并只与整数运算单位计算。伪随机舍入消除了对外部随机数生成的需要，提出了以有利于从更广泛的中间结果，以低精度存储转换方案。此外，仅具有整数算术计算的交叉熵损失反向传播方案提出。镍钛的证明了概念开源软件实现，利用了现代GPU原生8位整数运算实现终端到终端呈现的训练。当用浮点存储和算术实现的等效训练设置相比，NITI实现使用8位整数存储和计算的MNIST和CIFAR10数据集可忽略的精度的劣化。上ImageNet，需要用于用一个8位数据通路重量累积16位整数。这实现培训效果媲美全浮点实现。

26. PERF-Net: Pose Empowered RGB-Flow Net [PDF] 返回目录
Yinxiao Li, Zhichao Lu, Xuehan Xiong, Jonathan Huang
Abstract: In recent years, many works in the video action recognition literature have shown that two stream models (combining spatial and temporal input streams) are necessary for achieving state of the art performance. In this paper we show the benefits of including yet another stream based on human pose estimated from each frame -- specifically by rendering pose on input RGB frames. At first blush, this additional stream may seem redundant given that human pose is fully determined by RGB pixel values -- however we show (perhaps surprisingly) that this simple and flexible addition can provide complementary gains. Using this insight, we then propose a new model, which we dub PERF-Net (short for Pose Empowered RGB-Flow Net), which combines this new pose stream with the standard RGB and flow based input streams via distillation techniques and show that our model outperforms the state-of-the-art by a large margin in a number of human action recognition datasets while not requiring flow or pose to be explicitly computed at inference time.
摘要：近年来，在视频动作识别文学的许多作品都显示两个流模型（结合空间和时间的输入流）是必要的艺术表演达到的状态。在本文中，我们展示包括基于从每帧估计人体姿势又流的好处 - 特别是通过对输入RGB帧渲染姿势。乍一看，这个额外的流可能因为人类的姿势是由RGB像素值完全决定似乎是多余的 - 但是我们证明（也许令人惊讶的），这种简单灵活除了可以提供互补的收益。使用这种洞察力，我们再提出一个新的模式，这是我们配音PERF-NET（简称姿态被授权的RGB-流量净额），它结合了标准的RGB这个新的姿势流和基于流的输入通过蒸馏技术流，并表明我们的模型优于所述状态的最先进的大幅度在多个同时不需要流或给在推理时间被显式计算人类行为识别数据集。

27. Distribution Matching for Crowd Counting [PDF] 返回目录
Boyu Wang, Huidong Liu, Dimitris Samaras, Minh Hoai
Abstract: In crowd counting, each training image contains multiple people, where each person is annotated by a dot. Existing crowd counting methods need to use a Gaussian to smooth each annotated dot or to estimate the likelihood of every pixel given the annotated point. In this paper, we show that imposing Gaussians to annotations hurts generalization performance. Instead, we propose to use Distribution Matching for crowd COUNTing (DM-Count). In DM-Count, we use Optimal Transport (OT) to measure the similarity between the normalized predicted density map and the normalized ground truth density map. To stabilize OT computation, we include a Total Variation loss in our model. We show that the generalization error bound of DM-Count is tighter than that of the Gaussian smoothed methods. In terms of Mean Absolute Error, DM-Count outperforms the previous state-of-the-art methods by a large margin on two large-scale counting datasets, UCF-QNRF and NWPU, and achieves the state-of-the-art results on the ShanghaiTech and UCF-CC50 datasets. Notably, DM-Count ranked first on the leaderboard for the NWPU benchmark, reducing the error of the state-of-the-art published result by approximately 16%. Code is available at this https URL.
摘要：在人群中计数，每一个训练图像中包含多个人，每个人用点注释。现有人群计数方法需要使用高斯平滑每个带注释点或估计给出的注释点的每一个像素的可能性。在本文中，我们证明了高斯强加给注解伤害推广能力。相反，我们建议使用人群计数（DM-COUNT）分布匹配。在DM-计数，我们使用最佳传输（OT）来测量的归一化预测的密度图和所述归一化的基础事实密度图之间的相似性。为了稳定OT计算，我们包括在我们的模型总变差损失。我们表明，结合DM-Count的推广误差比的高斯平滑方法更紧。在平均绝对误差方面，DM-计数优于上的两个大型计数数据集，UCF-QNRF和西北工业大学之前的状态的最先进的方法，通过一大截，并实现国家的最先进的结果在ShanghaiTech和UCF-CC50的数据集。值得注意的是，DM-计数排名第一的排行榜的西北工业大学基准，由约16％降低的状态的最先进的发布结果的误差。代码可在此HTTPS URL。

28. Semi-Supervised Image Deraining using Gaussian Processes [PDF] 返回目录
Rajeev Yasarla, V.A. Sindagi, V.M. Patel
Abstract: Recent CNN-based methods for image deraining have achieved excellent performance in terms of reconstruction error as well as visual quality. However, these methods are limited in the sense that they can be trained only on fully labeled data. Due to various challenges in obtaining real world fully-labeled image deraining datasets, existing methods are trained only on synthetically generated data and hence, generalize poorly to real-world images. The use of real-world data in training image deraining networks is relatively less explored in the literature. We propose a Gaussian Process-based semi-supervised learning framework which enables the network in learning to derain using synthetic dataset while generalizing better using unlabeled real-world images. More specifically, we model the latent space vectors of unlabeled data using Gaussian Processes, which is then used to compute pseudo-ground-truth for supervising the network on unlabeled data. Through extensive experiments and ablations on several challenging datasets (such as Rain800, Rain200L and DDN-SIRR), we show that the proposed method is able to effectively leverage unlabeled data thereby resulting in significantly better performance as compared to labeled-only training. Additionally, we demonstrate that using unlabeled real-world images in the proposed GP-based framework results
摘要：近期基于CNN-图像deraining方法已经取得了重构误差方面具有极好的性能和视觉质量。然而，这些方法在这个意义上，他们可以训练只能在充分的标签数据的限制。由于获得现实世界完全标记的图像数据集deraining各种挑战，现有的方法只在训练的合成产生的数据，因此，不好推广到真实世界的影像。在训练图像deraining网络使用真实世界的数据相对较少文献探讨。我们提出了一种基于高斯过程半监督学习框架，使网络在学习使用合成数据集，同时更好地推广使用未标记的真实世界的影像德兰。更具体地说，我们使用高斯过程，然后将其用于计算伪地面实况对未标记的数据监视的网络模式未标记数据的潜在空间矢量。通过对一些具有挑战性的数据集大量的实验和消融（如Rain800，Rain200L和DDN-SIRR），我们表明，相比于标记的只有训练所提出的方法能够有效地利用未标记的数据，从而导致显著更好的性能。此外，我们证明了在建议GP为基础的框架使用结果未标记的真实世界的影像

29. Rotated Binary Neural Network [PDF] 返回目录
Mingbao Lin, Rongrong Ji, Zihan Xu, Baochang Zhang, Yan Wang, Yongjian Wu, Feiyue Huang, Chia-Wen Lin
Abstract: Binary Neural Network (BNN) shows its predominance in reducing the complexity of deep neural networks. However, it suffers severe performance degradation. One of the major impediments is the large quantization error between the full-precision weight vector and its binary vector. Previous works focus on compensating for the norm gap while leaving the angular bias hardly touched. In this paper, for the first time, we explore the influence of angular bias on the quantization error and then introduce a Rotated Binary Neural Network (RBNN), which considers the angle alignment between the full-precision weight vector and its binarized version. At the beginning of each training epoch, we propose to rotate the full-precision weight vector to its binary vector to reduce the angular bias. To avoid the high complexity of learning a large rotation matrix, we further introduce a bi-rotation formulation that learns two smaller rotation matrices. In the training stage, we devise an adjustable rotated weight vector for binarization to escape the potential local optimum. Our rotation leads to around 50% weight flips which maximize the information gain. Finally, we propose a training-aware approximation of the sign function for the gradient backward. Experiments on CIFAR-10 and ImageNet demonstrate the superiorities of RBNN over many state-of-the-arts. Our source code, experimental settings, training logs and binary models are available at this https URL.
摘要：二元神经网络（BNN）显示了其在减少深层神经网络的复杂性优势。然而，遭受严重的性能下降。其中一个主要障碍是全精度的权重向量及其二元矢量之间的量化误差大。以前的作品集中在补偿标准的差距，同时使角偏差难以触及。在本文中，对于第一次，我们探索对量化误差角偏差的影响，然后引入一个旋转的二神经网络（RBNN），它考虑到全精度加权矢量及其二值化版本之间的角度对准。在每个训练时代的开始，我们建议全精度的权重向量旋转至二元载体，以减少角偏差。为了避免学习大旋转矩阵的高复杂性，我们还引入了双向旋转制剂学两个较小的旋转矩阵。在训练阶段，我们设计一种用于二值化以躲避潜在局部最优的可调节旋转的权重向量。我们的旋转导致约50％的重量翻转最大限度提高了信息增益。最后，我们提出了符号函数的训练感知近似梯度落后。在CIFAR-10和ImageNet实验证明在许多国家的最艺术RBNN的优势。我们的源代码，实验设置，训练日志和二进制模式可在此HTTPS URL。

30. Event-based Action Recognition Using Timestamp Image Encoding Network [PDF] 返回目录
Chaoxing Huang
Abstract: Event camera is an asynchronous, high frequency vision sensor with low power consumption, which is suitable for human action recognition task. It is vital to encode the spatial-temporal information of event data properly and use standard computer vision tool to learn from the data. In this work, we propose a timestamp image encoding 2D network, which takes the encoded spatial-temporal images of the event data as input and output the action label. Experiment results show that our method can achieve the same level of performance as those RGB-based benchmarks on real world action recognition, and also achieve the SOTA result on gesture recognition.
摘要：事件照相机是一个异步，高频视觉传感器具有低功率消耗，这是适合于人的动作识别任务。至关重要的是要正确编码的事件数据的时空信息，并使用标准的计算机视觉工具从数据中学习。在这项工作中，我们提出了编码的2D网络，它接受的事件数据的经编码的空间 - 时间的图像作为输入，并输出动作标签的时间戳图像。实验结果表明，该方法能实现性能上的真实世界动作识别那些基于RGB的基准相同的水平，同时也实现手势识别的SOTA结果。

31. Kernel Based Progressive Distillation for Adder Neural Networks [PDF] 返回目录
Yixing Xu, Chang Xu, Xinghao Chen, Wei Zhang, Chunjing Xu, Yunhe Wang
Abstract: Adder Neural Networks (ANNs) which only contain additions bring us a new way of developing deep neural networks with low energy consumption. Unfortunately, there is an accuracy drop when replacing all convolution filters by adder filters. The main reason here is the optimization difficulty of ANNs using $\ell_1$-norm, in which the estimation of gradient in back propagation is inaccurate. In this paper, we present a novel method for further improving the performance of ANNs without increasing the trainable parameters via a progressive kernel based knowledge distillation (PKKD) method. A convolutional neural network (CNN) with the same architecture is simultaneously initialized and trained as a teacher network, features and weights of ANN and CNN will be transformed to a new space to eliminate the accuracy drop. The similarity is conducted in a higher-dimensional space to disentangle the difference of their distributions using a kernel based method. Finally, the desired ANN is learned based on the information from both the ground-truth and teacher, progressively. The effectiveness of the proposed method for learning ANN with higher performance is then well-verified on several benchmarks. For instance, the ANN-50 trained using the proposed PKKD method obtains a 76.8\% top-1 accuracy on ImageNet dataset, which is 0.6\% higher than that of the ResNet-50.
摘要：加法器神经网络（人工神经网络）只含有添加剂给我们带来发展具有能耗低深层神经网络的新方法。不幸的是，通过加法器过滤器更换所有的卷积过滤器时的精度下降。这里的主要原因是使用$ \ $ ell_1范数，其中在反向传播梯度的估计是不正确的人工神经网络的优化难度。在本文中，我们提出了进一步改进的人工神经网络的性能，而无需经由渐进基于内核的知识蒸馏（PKKD）方法增加可训练参数的新方法。具有相同的结构中的一个卷积神经网络（CNN）同时初始化和训练有素的作为教师网络，特征和ANN和CNN的权重将被转换到一个新的空间，以消除的精确度下降。的相似度是在高维空间中进行解开使用基于内核的方法其分布的差异。最后，所需的人工神经网络是基于地面实况和教师都，逐步将信息学。对于更高性能的学习神经网络所提出的方法的有效性，然后在几个基准测试以及验证。例如，ANN-50使用所提出的方法PKKD训练获得关于ImageNet数据集76.8 \％顶部-1精度，这是比RESNET-50高0.6 \％。

32. Concentrated Multi-Grained Multi-Attention Network for Video Based Person Re-Identification [PDF] 返回目录
Panwen Hu, Jiazhen Liu, Rui Huang
Abstract: Occlusion is still a severe problem in the video-based Re-IDentification (Re-ID) task, which has a great impact on the success rate. The attention mechanism has been proved to be helpful in solving the occlusion problem by a large number of existing methods. However, their attention mechanisms still lack the capability to extract sufficient discriminative information into the final representations from the videos. The single attention module scheme employed by existing methods cannot exploit multi-scale spatial cues, and the attention of the single module will be dispersed by multiple salient parts of the person. In this paper, we propose a Concentrated Multi-grained Multi-Attention Network (CMMANet) where two multi-attention modules are designed to extract multi-grained information through processing multi-scale intermediate features. Furthermore, multiple attention submodules in each multi-attention module can automatically discover multiple discriminative regions of the video frames. To achieve this goal, we introduce a diversity loss to diversify the submodules in each multi-attention module, and a concentration loss to integrate their attention responses so that each submodule can strongly focus on a specific meaningful part. The experimental results show that the proposed approach outperforms the state-of-the-art methods by large margins on multiple public datasets.
摘要：闭塞仍是基于视频的重新鉴定（重新-ID）的任务，这对成功率有很大的影响的严重问题。注意机制已被证明是由大量的现有方法解决遮挡问题很有帮助。然而，他们的注意力机制仍然缺乏提取足够的判别信息到从视频的最后陈述的能力。通过现有的方法中使用不能利用多尺度空间线索的单个注意模块方案，并且所述单个模块的注意力将被人的多个突出的部分被分散。在本文中，我们提出了一种浓多晶多注意网络（CMMANet）其中两个多注意模块被设计成通过处理多尺度中间特性来提取多晶信息。此外，多个子模块注意每个多注意模块中可以自动发现的视频帧的多个鉴别的区域。为了实现这一目标，我们引入一个多样性丧失多样化每个多注意模块中的子模块，和浓度损失来整合他们的注意力反应，使每个子模块能够强烈地专注于一个特定的有意义的部分。实验结果表明，该方法通过对多个公共数据集大型利润率优于国家的最先进的方法。

33. Long-Tailed Classification by Keeping the Good and Removing the Bad Momentum Causal Effect [PDF] 返回目录
Kaihua Tang, Jianqiang Huang, Hanwang Zhang
Abstract: As the class size grows, maintaining a balanced dataset across many classes is challenging because the data are long-tailed in nature; it is even impossible when the sample-of-interest co-exists with each other in one collectable unit, e.g., multiple visual instances in one image. Therefore, long-tailed classification is the key to deep learning at scale. However, existing methods are mainly based on re-weighting/re-sampling heuristics that lack a fundamental theory. In this paper, we establish a causal inference framework, which not only unravels the whys of previous methods, but also derives a new principled solution. Specifically, our theory shows that the SGD momentum is essentially a confounder in long-tailed classification. On one hand, it has a harmful causal effect that misleads the tail prediction biased towards the head. On the other hand, its induced mediation also benefits the representation learning and head prediction. Our framework elegantly disentangles the paradoxical effects of the momentum, by pursuing the direct causal effect caused by an input sample. In particular, we use causal intervention in training, and counterfactual reasoning in inference, to remove the "bad" while keep the "good". We achieve new state-of-the-arts on three long-tailed visual recognition benchmarks: Long-tailed CIFAR-10/-100, ImageNet-LT for image classification and LVIS for instance segmentation.
摘要：由于班级人数的增长，保持跨越许多类平衡数据集挑战性，因为数据是在自然界长尾;当兴趣采样的-与一个收藏部，例如，多个视觉实例彼此共同存在于一个图像就甚至不可能。因此，长尾分类是大规模的关键是深度学习。然而，现有的方法主要是基于对缺乏基础理论重新加权/重新采样启发。在本文中，我们建立了因果推理的框架，这不仅揭开了以前的方法的个为什么，又派生了新的原则性解决方案。具体来说，我们的理论表明，SGD势头基本上是长尾分类的混杂因素。一方面，它具有一种有害因果效果误导尾部预测朝向头部偏置。在另一方面，其引起的调解也有利于代表学习和头部的预测。我们的框架优雅理顺了那些纷繁的气势似是而非的效果，通过追求所造成的输入样本的直接因果关系。特别是，我们使用训练因果干预和反推理的推理，以去除“坏”，而保持“良好”。我们实现国家的最艺术的三个长尾视觉识别基准测试新：长尾CIFAR-10 / -100，ImageNet-LT的图像分类和LVIS例如分割。

34. AIM 2020 Challenge on Video Temporal Super-Resolution [PDF] 返回目录
Sanghyun Son, Jaerin Lee, Seungjun Nah, Radu Timofte, Kyoung Mu Lee
Abstract: Videos in the real-world contain various dynamics and motions that may look unnaturally discontinuous in time when the recordedframe rate is low. This paper reports the second AIM challenge on Video Temporal Super-Resolution (VTSR), a.k.a. frame interpolation, with a focus on the proposed solutions, results, and analysis. From low-frame-rate (15 fps) videos, the challenge participants are required to submit higher-frame-rate (30 and 60 fps) sequences by estimating temporally intermediate frames. To simulate realistic and challenging dynamics in the real-world, we employ the REDS_VTSR dataset derived from diverse videos captured in a hand-held camera for training and evaluation purposes. There have been 68 registered participants in the competition, and 5 teams (one withdrawn) have competed in the final testing phase. The winning team proposes the enhanced quadratic video interpolation method and achieves state-of-the-art on the VTSR task.
摘要：影片在现实世界中含有可能看起来不自然不连续的时候recordedframe率低各种动态和运动。本文报道的视频的时间超分辨率（VTSR），又称为插帧的第二个目标的挑战，重点是提出的解决方案，结果和分析。从低帧速率（每秒15帧）的视频时，需要挑战参与者通过估计在时间上的中间帧提交更高帧速率（30和60 fps）的序列。为了模拟真实世界的现实和具有挑战性的动态，我们采用在训练和评估目的一个手持相机拍摄不同的视频导出REDS_VTSR数据集。目前已经在竞争68名注册学员，5队（一个撤销）在最后的测试阶段都争先。获胜的球队提出了增强二次视频插值法和实现国家的最先进的VTSR任务。

35. Recognition and Synthesis of Object Transport Motion [PDF] 返回目录
Connor Daly
Abstract: Deep learning typically requires vast numbers of training examples in order to be used successfully. Conversely, motion capture data is often expensive to generate, requiring specialist equipment, along with actors to generate the prescribed motions, meaning that motion capture datasets tend to be relatively small. Motion capture data does however provide a rich source of information that is becoming increasingly useful in a wide variety of applications, from gesture recognition in human-robot interaction, to data driven animation. This project illustrates how deep convolutional networks can be used, alongside specialized data augmentation techniques, on a small motion capture dataset to learn detailed information from sequences of a specific type of motion (object transport). The project shows how these same augmentation techniques can be scaled up for use in the more complex task of motion synthesis. By exploring recent developments in the concept of Generative Adversarial Models (GANs), specifically the Wasserstein GAN, this project outlines a model that is able to successfully generate lifelike object transportation motions, with the generated samples displaying varying styles and transport strategies.
摘要：深学习通常需要的训练样本广大才能被成功地使用。相反，运动捕捉数据往往是昂贵的产生，需要专门的设备，与演员沿着以生成规定的动作，这意味着动作捕捉数据集趋于相对较小。然而运动捕获数据并提供了丰富的信息源，其正在成为在各种各样的应用中越来越有用，从手势识别在人机互动，数据驱动的动画。该项目示出了卷积网络可以有多深使用，沿着专门数据增量技术，在一个小的运动捕捉数据集以学习从运动（对象运输）的特定类型的序列的详细信息。该项目显示了这些相同的增强技术可以如何扩大在运动合成的更复杂的任务使用。通过创成对抗性模型（甘斯），特别是佳华GAN的概念探索的最新进展，该项目概述了一个模型，能够成功地生成逼真的物体运输的运动，与生成的样本显示不同的风格和交通战略。

36. Human-Object Interaction Detection:A Quick Survey and Examination of Methods [PDF] 返回目录
Trevor Bergstrom, Humphrey Shi
Abstract: Human-object interaction detection is a relatively new task in the world of computer vision and visual semantic information extraction. With the goal of machines identifying interactions that humans perform on objects, there are many real-world use cases for the research in this field. To our knowledge, this is the first general survey of the state-of-the-art and milestone works in this field. We provide a basic survey of the developments in the field of human-object interaction detection. Many works in this field use multi-stream convolutional neural network architectures, which combine features from multiple sources in the input image. Most commonly these are the humans and objects in question, as well as the spatial quality of the two. As far as we are aware, there have not been in-depth studies performed that look into the performance of each component individually. In order to provide insight to future researchers, we perform an individualized study that examines the performance of each component of a multi-stream convolutional neural network architecture for human-object interaction detection. Specifically, we examine the HORCNN architecture as it is a foundational work in the field. In addition, we provide an in-depth look at the HICO-DET dataset, a popular benchmark in the field of human-object interaction detection. Code and papers can be found at this https URL.
摘要：以人为对象的交互检测是计算机视觉和视觉语义信息提取的世界一个相对较新的任务。随着机器识别的互动，人类对对象执行的目标，还有很多现实世界中的用例在这一领域的研究。据我们所知，这是在这一领域的国家的最先进和里程碑作品的首次普查。我们提供的在人类对象交互检测领域的发展，一个基本的调查。在这一领域的使用多流卷积神经网络架构，它使来自多个源的输入图像中的特征结合起来的许多作品。最常见的是这些都是人类和对象的问题，以及两者的空间质量。据我们所知，有没有进行过深入研究，执行的外观到每个组件的性能分别。为了提供洞察到未来的研究人员，正进行个性化的研究，检查一个多流卷积神经网络体系结构以人为对象交互检测的每个组件的性能。具体来说，我们考察HORCNN架构，因为它是该领域的基础性工作。此外，我们提供了一个深入了解的HICO-DET数据集，在人类对象交互检测领域的流行风向标。代码和论文可以在此HTTPS URL中找到。

37. A Survey on Deep Learning Methods for Semantic Image Segmentation in Real-Time [PDF] 返回目录
Georgios Takos
Abstract: Semantic image segmentation is one of fastest growing areas in computer vision with a variety of applications. In many areas, such as robotics and autonomous vehicles, semantic image segmentation is crucial, since it provides the necessary context for actions to be taken based on a scene understanding at the pixel level. Moreover, the success of medical diagnosis and treatment relies on the extremely accurate understanding of the data under consideration and semantic image segmentation is one of the important tools in many cases. Recent developments in deep learning have provided a host of tools to tackle this problem efficiently and with increased accuracy. This work provides a comprehensive analysis of state-of-the-art deep learning architectures in image segmentation and, more importantly, an extensive list of techniques to achieve fast inference and computational efficiency. The origins of these techniques as well as their strengths and trade-offs are discussed with an in-depth analysis of their impact in the area. The best-performing architectures are summarized with a list of methods used to achieve these state-of-the-art results.
摘要：语义图像分割是计算机视觉发展最快的领域具有多种应用之一。在许多领域，如机器人和自主车，语义图像分割是至关重要的，因为它提供了必要的背景下，基于在像素水平的景物认识到应采取的行动。此外，医学诊断和治疗的成功依赖于所考虑和语义图像分割的数据非常精确的理解是，在许多情况下，重要的工具之一。在深度学习的最新发展已经提供了主机的工具，以高效，更精确地找准解决这个问题。这项工作提供了先进设备，最先进的深度学习结构的图像分割了全面的分析，更重要的，广泛的技术列表来实现快速的推理和计算效率。这些技术以及各自的优势和权衡的起源与他们在该地区的影响进行了深入分析，探讨。表现最好的架构进行了总结与用于实现国家的最先进的这些结果的方法列表。

38. Two-stream Encoder-Decoder Network for Localizing Image Forgeries [PDF] 返回目录
Aniruddha Mazumdar, Prabin Kumar Bora
Abstract: This paper proposes a novel two-stream encoder-decoder network, which utilizes both the high-level and the low-level image features for precisely localizing forged regions in a manipulated image. This is motivated from the fact that the forgery creation process generally introduces both the high-level artefacts (e.g. unnatural contrast) and the low-level artefacts (e.g. noise inconsistency) to the forged images. In the proposed two-stream network, one stream learns the low-level manipulation-related features in the encoder side by extracting noise residuals through a set of high-pass filters in the first layer of the encoder network. In the second stream, the encoder learns the high-level image manipulation features from the input image RGB values. The coarse feature maps of both the encoders are upsampled by their corresponding decoder network to produce dense feature maps. The dense feature maps of the two streams are concatenated and fed to a final convolutional layer with sigmoidal activation to produce pixel-wise prediction. We have carried out experimental analysis on multiple standard forensics datasets to evaluate the performance of the proposed method. The experimental results show the efficacy of the proposed method with respect to the state-of-the-art.
摘要：本文提出了一种新颖的双流编码器 - 解码器网络，其利用高级别和低级别的图像的特征在于用于在一个被操纵的图像精确定位伪造区域两者。这是从以下事实伪造创建过程通常引入两个高电平伪影（例如，非天然对比度）和低级别的假象（例如，噪声的不一致）到伪造的图像的动机。在所提出的两流网络中，一个流通过在编码器网络的第一层的一组高通滤波器提取噪声残差获悉在编码器侧的低级别的操作相关的功能。在第二个流中，编码器学习从输入图像RGB值的高级别图像处理功能。既可以通过对应的解码器网络取样得到产生致密的特征映射编码器的粗略特征映射。两个流的密集特征的地图被串接并送入与S形激活，以产生逐个像素的预测的最终卷积层。我们已经在多个标准的取证数据集进行实验分析，以评估该方法的性能。实验结果表明相对于该状态的最先进的所提出的方法的功效。

39. Virtual Experience to Real World Application: Sidewalk Obstacle Avoidance Using Reinforcement Learning for Visually Impaired [PDF] 返回目录
Faruk Ahmed, Md Sultan Mahmud, Kazi Ashraf Moinuddin, Mohammed Istiaque Hyder, Mohammed Yeasin
Abstract: Finding a path free from obstacles that poses minimal risk is critical for safe navigation. People who are sighted and people who are visually impaired require navigation safety while walking on a sidewalk. In this research we developed an assistive navigation on a sidewalk by integrating sensory inputs using reinforcement learning. We trained a Sidewalk Obstacle Avoidance Agent (SOAA) through reinforcement learning in a simulated robotic environment. A Sidewalk Obstacle Conversational Agent (SOCA) is built by training a natural language conversation agent with real conversation data. The SOAA along with SOCA was integrated in a prototype device called augmented guide (AG). Empirical analysis showed that this prototype improved the obstacle avoidance experience about 5% from a base case of 81.29%
摘要：从障碍物免费找到一个路径姿势最小的风险是安全航行的关键。谁是谁看见是视障人士和人民需要航行安全，同时在人行道上行走。在这项研究中，我们通过整合利用强化学习感觉输入开发在人行道上的辅助导航。我们通过强化学习的机器人模拟环境训练了边路避障代理（SOAA）。人行道障碍会话代理（SOCA）由训练自然语言对话剂，具有真正的交谈数据构建的。与SOCA沿SOAA被集成在称为增强指南（AG）的原型设备。经验分析表明，该改进的原型从81.29％的基的情况下约为5％障碍物躲避经验

40. Adaptive confidence thresholding for semi-supervised monocular depth estimation [PDF] 返回目录
Hyesong Choi, Hunsang Lee, Sunkyung Kim, Sunok Kim, Seungryong Kim, Dongbo Min
Abstract: Self-supervised monocular depth estimation has become an appealing solution to the lack of ground truth labels, but its reconstruction loss often produces over-smoothed results across object boundaries and is incapable of handling occlusion explicitly. In this paper, we propose a new approach to leverage pseudo ground truth depth maps of stereo images generated from pretrained stereo matching methods. Our method is comprised of three subnetworks; monocular depth network, confidence network, and threshold network. The confidence map of the pseudo ground truth depth map is first estimated to mitigate performance degeneration by inaccurate pseudo depth maps. To cope with the prediction error of the confidence map itself, we also propose to leverage the threshold network that learns the threshold {\tau} in an adaptive manner. The confidence map is thresholded via a differentiable soft-thresholding operator using this truncation boundary {\tau}. The pseudo depth labels filtered out by the thresholded confidence map are finally used to supervise the monocular depth network. To apply the proposed method to various training dataset, we introduce the network-wise training strategy that transfers the knowledge learned from one dataset to another. Experimental results demonstrate superior performance to state-of-the-art monocular depth estimation methods. Lastly, we exhibit that the threshold network can also be used to improve the performance of existing confidence estimation approaches.
摘要：自监督单眼深度估计已经成为一个有吸引力的解决方案缺乏地面实况标签，但其重建的损失往往会产生跨越对象边界过度平滑效果，且不能在明确处理闭塞。在本文中，我们提出了一个新的方法来利用模拟接地真理深度从预训练的立体匹配方法生成的立体图像的映射。我们的方法是由三个子网络;单眼深入的网络，网络的信心和阈值网络。模拟接地真理的景深图的置信度图首先估计不准确的伪深度贴图减轻性能退化。为了应付信心地图本身的预测误差，我们亦建议利用该学会以自适应的方式门槛{\头}门槛网络。置信度图是通过使用这种截断边界{\ tau蛋白}可微软阈值算子阈值化。由阈值的置信图滤掉伪深度标签终于来监督单眼深度网络。要应用该方法的各种训练数据集，我们介绍了网络的明智培训战略转移的知识从一个数据集学会了另一种。实验结果表明优异的性能状态的最先进的单眼深度估计方法。最后，我们表现出的门槛网络还可以用来提高现有置信估计方法的性能。

41. Domain Generalization for Medical Imaging Classification with Linear-Dependency Regularization [PDF] 返回目录
Haoliang Li, YuFei Wang, Renjie Wan, Shiqi Wang, Tie-Qiang Li, Alex C. Kot
Abstract: Recently, we have witnessed great progress in the field of medical imaging classification by adopting deep neural networks. However, the recent advanced models still require accessing sufficiently large and representative datasets for training, which is often unfeasible in clinically realistic environments. When trained on limited datasets, the deep neural network is lack of generalization capability, as the trained deep neural network on data within a certain distribution (e.g. the data captured by a certain device vendor or patient population) may not be able to generalize to the data with another distribution. In this paper, we introduce a simple but effective approach to improve the generalization capability of deep neural networks in the field of medical imaging classification. Motivated by the observation that the domain variability of the medical images is to some extent compact, we propose to learn a representative feature space through variational encoding with a novel linear-dependency regularization term to capture the shareable information among medical data collected from different domains. As a result, the trained neural network is expected to equip with better generalization capability to the "unseen" medical data. Experimental results on two challenging medical imaging classification tasks indicate that our method can achieve better cross-domain generalization capability compared with state-of-the-art baselines.
摘要：最近，我们通过采用深层神经网络见证了医学影像分类领域很大的进步。然而，最近的先进车型仍然需要访问的培训，这是经常在临床上逼真的环境不可行足够大的和有代表性的数据集。当在有限的数据集训练，深神经网络是缺乏泛化能力的，因为一定的分布内上训练数据深层神经网络（例如，由特定设备的供应商或患者群体中捕获的数据）可能不能够推广到与另一种分布数据。在本文中，我们介绍一个简单而有效的方法，以提高深层神经网络在医疗成像领域分类的推广能力。由观察发现，医用图像的域可变性是在一定程度上紧凑的启发，我们提出通过学习变编码具有新颖线性依赖性正则化项的代表特征空间以捕获从不同域收集的医疗数据中的共享信息。其结果是，在训练的神经网络希望能与更好的泛化能力，装备的“看不见”的医疗数据。两个有挑战性的医学影像分类任务的实验结果表明，我们的方法可以与国家的最先进的基线相比，达到更好的跨域泛化能力。

42. AIM 2020: Scene Relighting and Illumination Estimation Challenge [PDF] 返回目录
Majed El Helou, Ruofan Zhou, Sabine Süsstrunk, Radu Timofte, Mahmoud Afifi, Michael S. Brown, Kele Xu, Hengxing Cai, Yuzhong Liu, Li-Wen Wang, Zhi-Song Liu, Chu-Tak Li, Sourya Dipta Das, Nisarg A. Shah, Akashdeep Jassal, Tongtong Zhao, Shanshan Zhao, Sabari Nathan, M. Parisa Beham, R. Suganya, Qing Wang, Zhongyun Hu, Xin Huang, Yaning Li, Maitreya Suin, Kuldeep Purohit, A. N. Rajagopalan, Densen Puthussery, Hrishikesh P S, Melvin Kuriakose, Jiji C V, Yu Zhu, Liping Dong, Zhuolong Jiang, Chenghua Li, Cong Leng, Jian Cheng
Abstract: We review the AIM 2020 challenge on virtual image relighting and illumination estimation. This paper presents the novel VIDIT dataset used in the challenge and the different proposed solutions and final evaluation results over the 3 challenge tracks. The first track considered one-to-one relighting; the objective was to relight an input photo of a scene with a different color temperature and illuminant orientation (i.e., light source position). The goal of the second track was to estimate illumination settings, namely the color temperature and orientation, from a given image. Lastly, the third track dealt with any-to-any relighting, thus a generalization of the first track. The target color temperature and orientation, rather than being pre-determined, are instead given by a guide image. Participants were allowed to make use of their track 1 and 2 solutions for track 3. The tracks had 94, 52, and 56 registered participants, respectively, leading to 20 confirmed submissions in the final competition stage.
摘要：我们回顾虚拟图像二次照明和照明估计AIM 2020挑战。本文呈现的挑战，并在3个挑战曲目的不同提出的解决方案，并最终评价结果使用的新型VIDIT数据集。第一轨道视为一到一个重新点灯;目标是要重新点燃一个场景的具有不同色温和发光体取向（即，光源的位置）的输入照片。第二轨道的目标是估计来自给定图像的照明设置，即色温和取向。最后，第三轨道处理的任何到任何二次照明，因此，第一轨道的概括。目标色温和取向，而不是预先确定的，而不是被通过的引导图像给出。参与者被允许让自己的轨道使用1个2的解决方案，为轨道3.轨道有94，52，56注册参会，分别导致20所确认提交的最后竞争阶段。

43. Semi-Supervised Learning for In-Game Expert-Level Music-to-Dance Translation [PDF] 返回目录
Yinglin Duan, Tianyang Shi, Zhengxia Zou, Jia Qin, Yifei Zhao, Yi Yuan, Jie Hou, Xiang Wen, Changjie Fan
Abstract: Music-to-dance translation is a brand-new and powerful feature in recent role-playing games. Players can now let their characters dance along with specified music clips and even generate fan-made dance videos. Previous works of this topic consider music-to-dance as a supervised motion generation problem based on time-series data. However, these methods suffer from limited training data pairs and the degradation of movements. This paper provides a new perspective for this task where we re-formulate the translation problem as a piece-wise dance phrase retrieval problem based on the choreography theory. With such a design, players are allowed to further edit the dance movements on top of our generation while other regression based methods ignore such user interactivity. Considering that the dance motion capture is an expensive and time-consuming procedure which requires the assistance of professional dancers, we train our method under a semi-supervised learning framework with a large unlabeled dataset (20x than labeled data) collected. A co-ascent mechanism is introduced to improve the robustness of our network. Using this unlabeled dataset, we also introduce self-supervised pre-training so that the translator can understand the melody, rhythm, and other components of music phrases. We show that the pre-training significantly improves the translation accuracy than that of training from scratch. Experimental results suggest that our method not only generalizes well over various styles of music but also succeeds in expert-level choreography for game players.
摘要：音乐对舞蹈的翻译是在最近的角色扮演游戏的一个全新的，强大的功能。现在玩家可以让自己的角色与指定的音乐片段一起跳舞，甚至产生风扇制造的舞蹈视频。这个话题之前的作品考虑音乐到舞蹈基于时间序列数据有监督的运动生成问题。然而，这些方法从有限的训练数据对和动作的质量的损失。本文提供了这个任务，我们重新制定了翻译的问题作为基于编排理论逐段舞蹈短语检索问题的新视角。有了这样的设计，而其他基于回归的方法会忽略这样的用户交互允许玩家进一步的编辑在我们这一代的顶级舞蹈动作。考虑到舞蹈动作捕捉是一个昂贵和耗时，需要的专业舞者的援助过程中，我们培养具有收集了大量未标记数据集（超过20倍的数据标记）一个半监督学习框架下，我们的方法。一个共同的上升机制引入到提高我们网络的健壮性。使用这种未标记的数据集，我们也介绍了自我监督的预培训，使译者可以理解的旋律，节奏和音乐短语等组成。我们表明，前培训显著提高比从头训练翻译的准确性。实验结果表明，该方法不仅在各种风格的音乐的推广目的，又成功地专家级的芭蕾舞蹈艺术为游戏玩家。

44. Handwriting Prediction Considering Inter-Class Bifurcation Structures [PDF] 返回目录
Masaki Yamagata, Hideaki Hayashi, Seiichi Uchida
Abstract: Temporal prediction is a still difficult task due to the chaotic behavior, non-Markovian characteristics, and non-stationary noise of temporal signals. Handwriting prediction is also challenging because of uncertainty arising from inter-class bifurcation structures, in addition to the above problems. For example, the classes '0' and '6' are very similar in terms of their beginning parts; therefore it is nearly impossible to predict their subsequent parts from the beginning part. In other words, '0' and '6' have a bifurcation structure due to ambiguity between classes, and we cannot make a long-term prediction in this context. In this paper, we propose a temporal prediction model that can deal with this bifurcation structure. Specifically, the proposed model learns the bifurcation structure explicitly as a Gaussian mixture model (GMM) for each class as well as the posterior probability of the classes. The final result of prediction is represented as the weighted sum of GMMs using the class probabilities as weights. When multiple classes have large weights, the model can handle a bifurcation and thus avoid an inaccurate prediction. The proposed model is formulated as a neural network including long short-term memories and is thus trained in an end-to-end manner. The proposed model was evaluated on the UNIPEN online handwritten character dataset, and the results show that the model can catch and deal with the bifurcation structures.
摘要：时间预测是静止艰巨的任务，由于混沌行为，非马尔科夫特性，和时间信号的非稳态噪声。手写预测也具有挑战性，因为从类间分岔结构所引起的不确定性的，除了上述问题。例如，类“0”和“6”处于它们的开始部分的方面是非常相似的;因此，它是几乎不可能从一开始的部分预测其后续部分。换句话说，“0”和“6”具有一个分叉结构由于类之间的模糊，我们不能使一个长期预测在此上下文中。在本文中，我们提出了可以处理这种分叉结构的时间预测模型。具体地，所提出的模型明确地获知分叉结构为高斯混合模型（GMM）对于每个类，以及类的后验概率。预测的最终结果被表示为使用类概率作为权的GMM的加权和。当多个类具有大的权重，该模型可以处理分叉，从而避免不准确的预测。该模型被配制成神经网络包括长短期存储器和在端至端的方式由此训练。该模型是在UNIPEN评估联机手写字符集，结果表明该模型可以捕捉和处理分叉结构。

45. MicroAnalyzer: A Python Tool for Automated Bacterial Analysis with Fluorescence Microscopy [PDF] 返回目录
Jonathan Reiner, Guy Azran, Gal Hyams
Abstract: Fluorescence microscopy is a widely used method among cell biologists for studying the localization and co-localization of fluorescent protein. For microbial cell biologists, these studies often include tedious and time-consuming manual segmentation of bacteria and of the fluorescence clusters or working with multiple programs. Here, we present MicroAnalyzer - a tool that automates these tasks by providing an end-to-end platform for microscope image analysis. While such tools do exist, they are costly, black-boxed programs. Microanalyzer offers an open-source alternative to these tools, allowing flexibility and expandability by advanced users. MicroAnalyzer provides accurate cell and fluorescence cluster segmentation based on state-of-the-art deep-learning segmentation models, combined with ad-hoc post-processing and Colicoords - an open-source cell image analysis tool for calculating general cell and fluorescence measurements. Using these methods, it performs better than generic approaches since the dynamic nature of neural networks allows for a quick adaptation to experiment restrictions and assumptions. Other existing tools do not consider experiment assumptions, nor do they provide fluorescence cluster detection without the need for any specialized equipment. The key goal of MicroAnalyzer is to automate the entire process of cell and fluorescence image analysis "from microscope to database", meaning it does not require any further input from the researcher except for the initial deep-learning model training. In this fashion, it allows the researchers to concentrate on the bigger picture instead of granular, eye-straining labor
摘要：荧光显微术是细胞生物学家中用于研究的定位和荧光蛋白的共定位一种广泛使用的方法。微生物细胞生物学家，这些研究通常包括乏味和费时的细菌和荧光簇的手动分割或多个程序的工作。在这里，我们目前微分析 - 通过提供终端到终端平台显微图像分析自动化这些任务的工具。虽然这样的工具确实存在，他们是昂贵的，黑盒装方案。显微提供了一个开源替代这些工具，允许高级用户的灵活性和可扩展性。微分析仪提供了基于状态的最先进的深学习分割模型中，与自组织后处理和Colicoords组合准确细胞和荧光簇分割 - 用于计算一般小区和荧光测量的开放源码细胞图像分析工具。使用这些方法，它的性能比一般的方法更好，因为神经网络的动态特性允许快速适应实验的限制和假设。其他现有的工具不考虑实验假设，也不提供荧光聚类检测，而不需要任何专用设备。微分析的主要目标是“从显微镜到数据库”自动化电池和荧光图像分析的全过程，这意味着它不需要从研究者的任何进一步的输入，除了初始的深度学习模型的训练。通过这种方式，它使研究人员能够专注于大局，而不是颗粒状，眼睛紧张的劳动

46. I Like to Move It: 6D Pose Estimation as an Action Decision Process [PDF] 返回目录
Benjamin Busam, Hyun Jun Jung, Nassir Navab
Abstract: Object pose estimation is an integral part of robot vision and augmented reality. Robust and accurate pose prediction of both object rotation and translation is a crucial element to enable precise and safe human-machine interactions and to allow visualization in mixed reality. Previous 6D pose estimation methods treat the problem either as a regression task or discretize the pose space to classify. We reformulate the problem as an action decision process where an initial pose is updated in incremental discrete steps that sequentially move a virtual 3D rendering towards the correct solution. A neural network estimates likely moves from a single RGB image iteratively and determines so an acceptable final pose. In comparison to previous approaches that learn an object-specific pose embedding, a decision process allows for a lightweight architecture while it naturally generalizes to unseen objects. Moreover, the coherent action for process termination enables dynamic reduction of the computation cost if there are insignificant changes in a video sequence. While other methods only provide a static inference time, we can thereby automatically increase the runtime depending on the object motion. We evaluate robustness and accuracy of our action decision network on video scenes with known and unknown objects and show how this can improve the state-of-the-art on YCB videos significantly.
摘要：物体姿态估计是机器人视觉和增强现实的一个组成部分。这两个物体旋转和平移的鲁棒和精确的姿态预测是一个关键因素，使精确和安全的人机交互，并允许在混合现实中的可视化。上一页6D姿态估计方法无论是对待问题的回归任务或离散的姿态空间进行分类。我们重新制定问题作为行动的决策过程，其中初始姿势是在连续移动虚拟3D渲染朝着正确的解决增量不连续的步骤进行更新。神经网络从迭代单个RGB图像估计可能移动，并且确定这样一个可接受的最终姿势。相较于学习对象特定姿势嵌入以前的方法，决策过程允许轻量级架构，同时它自然可以推广到看不见的对象。此外，进程终止的连贯动作实现动态降低计算成本的，如果有视频序列显着的改变。而其他方法只能提供静态的推理时间，我们由此可以自动增加取决于物体运动的运行时间。我们评估对视频场景我们的行动决定网络已知和未知物体的鲁棒性和准确性，并展示如何这可以显著提高YCB视频的国家的最先进的。

47. Enhancing a Neurocognitive Shared Visuomotor Model for Object Identification, Localization, and Grasping With Learning From Auxiliary Tasks [PDF] 返回目录
Matthias Kerzel, Fares Abawi, Manfred Eppe, Stefan Wermter
Abstract: We present a follow-up study on our unified visuomotor neural model for the robotic tasks of identifying, localizing, and grasping a target object in a scene with multiple objects. Our Retinanet-based model enables end-to-end training of visuomotor abilities in a biologically inspired developmental approach. In our initial implementation, a neural model was able to grasp selected objects from a planar surface. We embodied the model on the NICO humanoid robot. In this follow-up study, we expand the task and the model to reaching for objects in a three-dimensional space with a novel dataset based on augmented reality and a simulation environment. We evaluate the influence of training with auxiliary tasks, i.e., if learning of the primary visuomotor task is supported by learning to classify and locate different objects. We show that the proposed visuomotor model can learn to reach for objects in a three-dimensional space. We analyze the results for biologically-plausible biases based on object locations or properties. We show that the primary visuomotor task can be successfully trained simultaneously with one of the two auxiliary tasks. This is enabled by a complex neurocognitive model with shared and task-specific components, similar to models found in biological systems.
摘要：我们提出我们的统一视觉运动神经元模型的后续研究识别，定位，并与多个对象的场景把握目标对象的机器人任务。基于Retinanet，我们的模式使终端到终端的培训视觉运动能力的仿生发展途径。在我们最初的实现，一个神经网络模型能够从一个平面把握选择的对象。我们体现在NICO人形机器人模型。在这个后续的研究中，我们扩大了任务和模型，以期就与基于增强现实和仿真环境中一个新的数据集三维空间物体。我们评估与辅助任务，即训练，如果主视觉运动任务的学习是通过学习，分类和定位不同的对象所支持的影响。我们表明，该模型视觉运动可以学习达到在三维空间中的物体。我们分析了基于对象的位置或属性生物似是而非的偏见的结果。我们表明，初级视觉运动任务可以成功地同时的两个辅助任务之一培训。这是通过共享和任务的特定组件，类似于生物系统中的模型复杂的神经认知模型启用。

48. Multispectral Fusion for Object Detection with Cyclic Fuse-and-Refine Blocks [PDF] 返回目录
Heng Zhang, Elisa Fromont, Sébastien Lefevre, Bruno Avignon
Abstract: Multispectral images (e.g. visible and infrared) may be particularly useful when detecting objects with the same model in different environments (e.g. day/night outdoor scenes). To effectively use the different spectra, the main technical problem resides in the information fusion process. In this paper, we propose a new halfway feature fusion method for neural networks that leverages the complementary/consistency balance existing in multispectral features by adding to the network architecture, a particular module that cyclically fuses and refines each spectral feature. We evaluate the effectiveness of our fusion method on two challenging multispectral datasets for object detection. Our results show that implementing our Cyclic Fuse-and-Refine module in any network improves the performance on both datasets compared to other state-of-the-art multispectral object detection methods.
摘要：在不同的环境相同的模型（例如，白天/夜晚的室外场景）检测对象时的多光谱图像（例如可见光和红外光）可以是特别有用的。要有效地使用不同的光谱，主要技术问题存在于信息融合的过程。在本文中，我们提出了神经网络新的中途特征融合方法，该方法通过向网络架构中，特定模块，周期性保险丝和提炼每个频谱特征利用现有的多光谱中的特征的互补/一致性平衡。我们评估两个挑战多光谱数据集物体检测我们的融合方法的有效性。我们的研究结果表明，在任何网络中实现我们的循环保险丝和细化模块提高了两个数据集的性能相对于其他国家的最先进的多光谱目标检测方法。

49. Interactive White Balancing for Camera-Rendered Images [PDF] 返回目录
Mahmoud Afifi, Michael S. Brown
Abstract: White balance (WB) is one of the first photo-finishing steps used to render a captured image to its final output. WB is applied to remove the color cast caused by the scene's illumination. Interactive photo-editing software allows users to manually select different regions in a photo as examples of the illumination for WB correction (e.g., clicking on achromatic objects). Such interactive editing is possible only with images saved in a RAW image format. This is because RAW images have no photo-rendering operations applied and photo-editing software is able to apply WB and other photo-finishing procedures to render the final image. Interactively editing WB in camera-rendered images is significantly more challenging. This is because the camera hardware has already applied WB to the image and subsequent nonlinear photo-processing routines. These nonlinear rendering operations make it difficult to change the WB post-capture. The goal of this paper is to allow interactive WB manipulation of camera-rendered images. The proposed method is an extension of our recent work \cite{afifi2019color} that proposed a post-capture method for WB correction based on nonlinear color-mapping functions. Here, we introduce a new framework that links the nonlinear color-mapping functions directly to user-selected colors to enable {\it interactive} WB manipulation. This new framework is also more efficient in terms of memory and run-time (99\% reduction in memory and 3$\times$ speed-up). Lastly, we describe how our framework can leverage a simple illumination estimation method (i.e., gray-world) to perform auto-WB correction that is on a par with the WB correction results in \cite{afifi2019color}. The source code is publicly available at this https URL.
摘要：白平衡（WB）是用来呈现所拍摄的图像到其最终输出第一光整理的步骤之一。 WB适用于去除由场景的照明偏色。交互式照片编辑软件允许用户在一张相片作为照明为WB校正的例子手动选择不同的区域（例如，点击消色差的对象）。这种互动式的编辑可能只图像保存为RAW图像格式。这是因为RAW图像没有照片级渲染操作应用和照片编辑软件能够申请世界银行和其他照片处理程序，以使最终图像。在相机渲染的图像交互编辑WB是显著更具挑战性。这是因为，照相机硬件已经施加WB到图像和随后的非线性光学处理例程。这些非线性的渲染操作使其难以更改白平衡后捕获。本文的目标是让相机渲染图像的互动WB操作。该方法是我们最近的工作\ {引用} afifi2019color所提出基于非线性色彩映射功能WB校正捕获后方法的扩展。在这里，我们介绍一个新的框架，链接非线性颜色映射函数直接到用户选择的颜色，使{\采用交互式} WB操作。这个新的框架也是在内存和运行时间（在内存中99 \％的减少和3 $ \ $次加速）方面会更有效率。最后，我们描述了我们的框架如何利用一个简单的照明估计方法（即灰色世界）执行自动白平衡矫正是与世界银行修正结果的面值\ {引用} afifi2019color。源代码是公开的，在此HTTPS URL。

50. Grasp Proposal Networks: An End-to-End Solution for Visual Learning of Robotic Grasps [PDF] 返回目录
Chaozheng Wu, Jian Chen, Qiaoyu Cao, Jianchi Zhang, Yunxin Tai, Lin Sun, Kui Jia
Abstract: Learning robotic grasps from visual observations is a promising yet challenging task. Recent research shows its great potential by preparing and learning from large-scale synthetic datasets. For the popular, 6 degree-of-freedom (6-DOF) grasp setting of parallel-jaw gripper, most of existing methods take the strategy of heuristically sampling grasp candidates and then evaluating them using learned scoring functions. This strategy is limited in terms of the conflict between sampling efficiency and coverage of optimal grasps. To this end, we propose in this work a novel, end-to-end \emph{Grasp Proposal Network (GPNet)}, to predict a diverse set of 6-DOF grasps for an unseen object observed from a single and unknown camera view. GPNet builds on a key design of grasp proposal module that defines \emph{anchors of grasp centers} at discrete but regular 3D grid corners, which is flexible to support either more precise or more diverse grasp predictions. To test GPNet, we contribute a synthetic dataset of 6-DOF object grasps; evaluation is conducted using rule-based criteria, simulation test, and real test. Comparative results show the advantage of our methods over existing ones. Notably, GPNet gains better simulation results via the specified coverage, which helps achieve a ready translation in real test. We will make our dataset publicly available.
摘要：学习从肉眼观察机器人抓手是一种很有前途而具有挑战性的任务。最近的研究表明通过准备和大型数据集合成其学习潜力巨大。为流行，6度的自由度（6-DOF）平行夹爪抓持的抓设置，大多数的现有方法采取启发式采样把握候选的策略，然后使用学习评分函数评估它们。这一策略被掌握最佳的抽样效率和覆盖之间的矛盾的条款的限制。为此，我们在这项工作中提出一种新颖，端至端\ EMPH {握提案网络（GPNet）}，来预测一组不同的6-DOF的掌握从一个单一的和未知的摄像机视图观察到一个看不见的对象。 GPNet建立在把握提案模块的一个关键的设计，其限定\ EMPH {把握中心锚}在离散但规则的三维网格的角，其是柔性的，以支持或者更精确或更多样化把握预测。为了测试GPNet，我们有助于六自由度的对象掌握的合成数据集;评价采用基于规则的标准，模拟测试和实际检验。比较结果表明，我们的方法比现有的优势。值得注意的是，GPNet获得通过指定的覆盖，这有助于实现真正的考验一个现成的翻译更好的模拟结果。我们将尽我们的数据公之于众。

51. Few-shot Object Detection with Self-adaptive Attention Network for Remote Sensing Images [PDF] 返回目录
Zixuan Xiao, Wei Xue, Ping Zhong
Abstract: In remote sensing field, there are many applications of object detection in recent years, which demands a great number of labeled data. However, we may be faced with some cases where only limited data are available. In this paper, we proposed a few-shot object detector which is designed for detecting novel objects provided with only a few examples. Particularly, in order to fit the object detection settings, our proposed few-shot detector concentrates on the relations that lie in the level of objects instead of the full image with the assistance of Self-Adaptive Attention Network (SAAN). The SAAN can fully leverage the object-level relations through a relation GRU unit and simultaneously attach attention on object features in a self-adaptive way according to the object-level relations to avoid some situations where the additional attention is useless or even detrimental. Eventually, the detection results are produced from the features that are added with attention and thus are able to be detected simply. The experiments demonstrate the effectiveness of the proposed method in few-shot scenes.
摘要：在遥感领域，也有近年来目标检测的很多应用，这就要求标签数据的大量。但是，我们可能会面临某些情况下，只有有限的数据是可用的。在本文中，我们提出了一种被设计用于检测设置有仅几个例子新物体几拍物体检测装置。特别是，为了适应物体检测设置，在关系我们提出的一些次检测浓缩横亘在目标水平，而不是与自适应关注网络（SAAN）的协助下，完整的图像。该SAAN可以充分利用通过关系GRU单元对象级关系，同时根据对象的层次关系，以避免某些情况下的额外注意的是无用甚至有害附上对象关注的特征是在自适应方式。最终，检测结果从被加入的关注和因此能够被简单地检测到的特征产生的。实验证明在几个-SHOT场景所提出的方法的有效性。

52. A Few-shot Learning Approach for Historical Ciphered Manuscript Recognition [PDF] 返回目录
Mohamed Ali Souibgui, Alicia Fornés, Yousri Kessentini, Crina Tudor
Abstract: Encoded (or ciphered) manuscripts are a special type of historical documents that contain encrypted text. The automatic recognition of this kind of documents is challenging because: 1) the cipher alphabet changes from one document to another, 2) there is a lack of annotated corpus for training and 3) touching symbols make the symbol segmentation difficult and complex. To overcome these difficulties, we propose a novel method for handwritten ciphers recognition based on few-shot object detection. Our method first detects all symbols of a given alphabet in a line image, and then a decoding step maps the symbol similarity scores to the final sequence of transcribed symbols. By training on synthetic data, we show that the proposed architecture is able to recognize handwritten ciphers with unseen alphabets. In addition, if few labeled pages with the same alphabet are used for fine tuning, our method surpasses existing unsupervised and supervised HTR methods for ciphers recognition.
摘要：编码（或加密的）手稿是一种特殊类型的包含加密文本的历史文献。这种文件的自动识别是一项挑战，因为：1）从一个文档密码字母切换到另一个，2）有一个缺乏训练标注语料库和3）触摸符号使符号分割困难和复杂。为了克服这些困难，我们提出了基于几拍物体检测手写识别密码的新方法。我们的方法首先检测的线图像的给定字母表的所有符号，然后一个解码步骤映射符号的相似性得分至转录的符号的最终序列。通过对合成数据的训练，我们表明，该架构能够识别手写密码被看不见的字母。另外，如果用相同的字母标记几页用于微调，我们的方法优于现有的密码识别无监督和监督的HTR方法。

53. DT-Net: A novel network based on multi-directional integrated convolution and threshold convolution [PDF] 返回目录
Hongfeng You, Long Yu, Shengwei Tian, Xiang Ma, Yan Xing, Xiaojie Ma
Abstract: Since medical image data sets contain few samples and singular features, lesions are viewed as highly similar to other tissues. The traditional neural network has a limited ability to learn features. Even if a host of feature maps is expanded to obtain more semantic information, the accuracy of segmenting the final medical image is slightly improved, and the features are excessively redundant. To solve the above problems, in this paper, we propose a novel end-to-end semantic segmentation algorithm, DT-Net, and use two new convolution strategies to better achieve end-to-end semantic segmentation of medical images. 1. In the feature mining and feature fusion stage, we construct a multi-directional integrated convolution (MDIC). The core idea is to use the multi-scale convolution to enhance the local multi-directional feature maps to generate enhanced feature maps and to mine the generated features that contain more semantics without increasing the number of feature maps. 2. We also aim to further excavate and retain more meaningful deep features reduce a host of noise features in the training process. Therefore, we propose a convolution thresholding strategy. The central idea is to set a threshold to eliminate a large number of redundant features and reduce computational complexity. Through the two strategies proposed above, the algorithm proposed in this paper produces state-of-the-art results on two public medical image datasets. We prove in detail that our proposed strategy plays an important role in feature mining and eliminating redundant features. Compared with the existing semantic segmentation algorithms, our proposed algorithm has better robustness.
摘要：由于医学图像数据集包含几个样品和奇异特征，病变被视为高度相似的其它组织。传统的神经网络具有学习功能的能力有限。即使特征图的宿主被扩展，以获得更多的语义信息，分割所述最终医用图像的精度稍有提高，并且特征是过度冗余的。为了解决上述问题，在本文中，我们提出了一种新的终端到终端的语义分割算法，DT-Net和使用两个新的卷积策略，以更好地实现医学图像的终端到终端的语义分割。 1.在特征采矿和特征融合阶段，我们构建了一个多向集成卷积（MDIC）。其核心思想是利用多尺度卷积，以提升当地多方位的功能映射到产生更强大的功能地图和矿包含多个语义不增加特征地图的数量所产生的功能。 2.我们的目标是进一步挖掘和保留更多的意义深刻的功能，可降低噪声的主机在训练过程中的特点。因此，我们提出了一个卷积阈值策略。其中心思想是设置阈值，以消除大量的冗余功能和降低计算复杂度。通过上面提出的两种策略，在本文提出的算法产生在两个公共医学图像数据状态的最先进的结果。我们证明的细节，我们提出的策略起着特征挖掘和消除冗余的功能具有重要作用。与现有的语义分割算法相比，我们的算法具有较好的鲁棒性。

54. Affinity Space Adaptation for Semantic Segmentation Across Domains [PDF] 返回目录
Wei Zhou, Yukang Wang, Jiajia Chu, Jiehua Yang, Xiang Bai, Yongchao Xu
Abstract: Semantic segmentation with dense pixel-wise annotation has achieved excellent performance thanks to deep learning. However, the generalization of semantic segmentation in the wild remains challenging. In this paper, we address the problem of unsupervised domain adaptation (UDA) in semantic segmentation. Motivated by the fact that source and target domain have invariant semantic structures, we propose to exploit such invariance across domains by leveraging co-occurring patterns between pairwise pixels in the output of structured semantic segmentation. This is different from most existing approaches that attempt to adapt domains based on individual pixel-wise information in image, feature, or output level. Specifically, we perform domain adaptation on the affinity relationship between adjacent pixels termed affinity space of source and target domain. To this end, we develop two affinity space adaptation strategies: affinity space cleaning and adversarial affinity space alignment. Extensive experiments demonstrate that the proposed method achieves superior performance against some state-of-the-art methods on several challenging benchmarks for semantic segmentation across domains. The code is available at this https URL.
摘要：密集的逐像素的注释语义分割取得了优异的性能得益于深度学习。然而，语义分割的野生遗体泛化挑战。在本文中，我们解决语义分割无监督领域适应性（UDA）的问题。的事实，源和目标域具有不变的语义结构的启发，我们提出了通过利用成对的像素之间共同发生图案构造语义分割的输出利用跨多个域这种不变性。这是从试图基于在图像，特征或输出电平个人逐像素信息来适配域大多数现有的方法不同。具体来说，我们执行上相邻的像素之间的亲和力关系域适配称为源和目标域的亲和力空间。为此，我们开发了两个亲和空间适应战略：亲和空间的清洁和对抗性的亲和力空间对齐。广泛的实验表明，该方法实现了对跨多个域语义分割几个挑战性的基准对国家的最先进的一些方法优越的性能。该代码可在此HTTPS URL。

55. Causal Intervention for Weakly-Supervised Semantic Segmentation [PDF] 返回目录
Dong Zhang, Hanwang Zhang, Jinhui Tang, Xiansheng Hua, Qianru Sun
Abstract: We present a causal inference framework to improve Weakly-Supervised Semantic Segmentation (WSSS). Specifically, we aim to generate better pixel-level pseudo-masks by using only image-level labels -- the most crucial step in WSSS. We attribute the cause of the ambiguous boundaries of pseudo-masks to the confounding context, e.g., the correct image-level classification of "horse" and "person" may be not only due to the recognition of each instance, but also their co-occurrence context, making the model inspection (e.g., CAM) hard to distinguish between the boundaries. Inspired by this, we propose a structural causal model to analyze the causalities among images, contexts, and class labels. Based on it, we develop a new method: Context Adjustment (CONTA), to remove the confounding bias in image-level classification and thus provide better pseudo-masks as ground-truth for the subsequent segmentation model. On PASCAL VOC 2012 and MS-COCO, we show that CONTA boosts various popular WSSS methods to new state-of-the-arts.
摘要：本文提出了一种因果推理的框架，以提高弱监督的语义分割（WSSS）。具体而言，我们的目标是通过只使用映像级标签产生更好的像素级的伪面具 - 在WSSS最关键的一步。我们认为伪面具的暧昧边界的原因混杂情境，例如“马”和“人”的正确的图像级分类可能不仅是由于确认每个实例的，而且他们的合作现上下文，使得模型检查（例如，CAM）硬的边界来区分。受此启发，我们提出了一个结构因果模型来分析图像，上下文和类标签之间的因果关系。在此基础上，我们开发了一个新的方法：上下文调整（CONTA），去除图像级分类的混淆偏差，从而提供更好的伪面具作为地面实况为后续的细分模型。在PASCAL VOC 2012和MS-COCO，我们表明，CONTA提升各种流行WSSS方法，新的国家的最艺术。

56. A light-weight method to foster the (Grad)CAM interpretability and explainability of classification networks [PDF] 返回目录
Alfred Schöttl
Abstract: We consider a light-weight method which allows to improve the explainability of localized classification networks. The method considers (Grad)CAM maps during the training process by modification of the training loss and does not require additional structural elements. It is demonstrated that the (Grad)CAM interpretability, as measured by several indicators, can be improved in this way. Since the method shall be applicable on embedded systems and on standard deeper architectures, it essentially takes advantage of second order derivatives during the training and does not require additional model layers.
摘要：我们认为重量轻方法，可以改善局部的分类网络的explainability。该方法考虑由训练损失的变形例在训练过程期间（梯度）CAM映射，并且不需要附加的结构元件。据表明，（梯度）CAM解释性，通过几个指标测得，可以以这种方式得到改善。由于该方法应适用于嵌入式系统和标准的更深的结构，它基本上需要在培训期间优点二阶导数，并且不需要额外的模型层。

57. Neural Twins Talk [PDF] 返回目录
Zanyar Zohourianshahzadi, Jugal Kumar Kalita
Abstract: Inspired by how the human brain employs more neural pathways when increasing the focus on a subject, we introduce a novel twin cascaded attention model that outperforms a state-of-the-art image captioning model that was originally implemented using one channel of attention for the visual grounding task. Visual grounding ensures the existence of words in the caption sentence that are grounded into a particular region in the input image. After a deep learning model is trained on visual grounding task, the model employs the learned patterns regarding the visual grounding and the order of objects in the caption sentences, when generating captions. We report the results of our experiments in three image captioning tasks on the COCO dataset. The results are reported using standard image captioning metrics to show the improvements achieved by our model over the previous image captioning model. The results gathered from our experiments suggest that employing more parallel attention pathways in a deep neural network leads to higher performance. Our implementation of NTT is publicly available at: this https URL.
摘要：人脑对提高对象的焦点时，如何采用更加神经通路的启发，我们引入了一个新的双级联关注模型性能优于这是使用的注意一个通道最初实现了一个国家的最先进的图像字幕模型为视觉接地任务。视觉接地确保在标题语句的字词被接地到所述输入图像中的特定区域的存在。深刻学习模型视觉接地任务的培训后，该模型采用了有关视觉接地和对象的说明句子的顺序，生成字幕时，学习的模式。我们在对COCO数据集三个图像字幕任务报告我们的实验结果。结果使用标准的图像字幕指标表明我们的模型比上年图像字幕模式所取得的进步的报道。从我们的实验得出的结果表明，采用更多的并行重视途径在深层神经网络带来更高的性能。我们NTT的实施是公开的：该HTTPS URL。

58. Dense-View GEIs Set: View Space Covering for Gait Recognition based on Dense-View GAN [PDF] 返回目录
Rijun Liao, Weizhi An, Shiqi Yu, Zhu Li, Yongzhen Huang
Abstract: Gait recognition has proven to be effective for long-distance human recognition. But view variance of gait features would change human appearance greatly and reduce its performance. Most existing gait datasets usually collect data with a dozen different angles, or even more few. Limited view angles would prevent learning better view invariant feature. It can further improve robustness of gait recognition if we collect data with various angles at 1 degree interval. But it is time consuming and labor consuming to collect this kind of dataset. In this paper, we, therefore, introduce a Dense-View GEIs Set (DV-GEIs) to deal with the challenge of limited view angles. This set can cover the whole view space, view angle from 0 degree to 180 degree with 1 degree interval. In addition, Dense-View GAN (DV-GAN) is proposed to synthesize this dense view set. DV-GAN consists of Generator, Discriminator and Monitor, where Monitor is designed to preserve human identification and view information. The proposed method is evaluated on the CASIA-B and OU-ISIR dataset. The experimental results show that DV-GEIs synthesized by DV-GAN is an effective way to learn better view invariant feature. We believe the idea of dense view generated samples will further improve the development of gait recognition.
摘要：步态识别已被证明是有效的远距离人类识别。但鉴于步态特征的变化将极大地改变人类的外观和降低其性能。通常大多数现有的步态数据集有十几个不同的角度，甚至更多一些收集数据。有限视角会妨碍学习更好的观点不变特征。它可以进一步提高步态识别的鲁棒性，如果我们以1个的间隔收集与各种角度的数据。但它是费时又费力搜集这种数据集。在本文中，我们，因此，引入密集查看盖斯套装（DV-GEIS）来处理有限视角的挑战。这个集合可以覆盖整个视觉空间，视场角从0度到180度，1度的间隔。此外，密集查看甘（DV-GAN）提出了合成这种密集的视图集。 DV-GAN包括发电机，鉴别和监视器，其中监视器被设计为维护人识别和视图信息的。所提出的方法在CASIA-B和OU-ISIR数据集进行评估。实验结果表明，DV-GaN合成DV-盖斯是学习更好地观察不变特征的有效途径。我们认为，鉴于密集的样品产生的想法将进一步提高步态识别的发展。

59. Dictionary Learning with Low-rank Coding Coefficients for Tensor Completion [PDF] 返回目录
Tai-Xiang Jiang, Xi-Le Zhao, Hao Zhang, Michael K. Ng
Abstract: In this paper, we propose a novel tensor learning and coding model for third-order data completion. Our model is to learn a data-adaptive dictionary from the given observations, and determine the coding coefficients of third-order tensor tubes. In the completion process, we minimize the low-rankness of each tensor slice containing the coding coefficients. By comparison with the traditional pre-defined transform basis, the advantages of the proposed model are that (i) the dictionary can be learned based on the given data observations so that the basis can be more adaptively and accurately constructed, and (ii) the low-rankness of the coding coefficients can allow the linear combination of dictionary features more effectively. Also we develop a multi-block proximal alternating minimization algorithm for solving such tensor learning and coding model, and show that the sequence generated by the algorithm can globally converge to a critical point. Extensive experimental results for real data sets such as videos, hyperspectral images, and traffic data are reported to demonstrate these advantages and show the performance of the proposed tensor learning and coding method is significantly better than the other tensor completion methods in terms of several evaluation metrics.
摘要：在本文中，我们提出了三阶数据完成一个新的张量的学习和编码模型。我们的模型是学习从给定的观测数据自适应字典，并确定三阶张量管的编码系数。在完成过程中，我们最小化含有编码系数张量的每个切片的低rankness。通过与传统的预先定义的变换基础的比较，所提出的模型的优点是：（i）所述字典可以基于给定的数据观测使得基础可以更加自适应地和准确地构造可以得知，和（ii）所述该编码系数的低rankness可以更有效地允许的字典特征的线性组合。此外，我们制定了解决这类张量的学习和编码模型多块近交替最小化算法，并且表明，该算法生成的序列可以全局收敛到一个临界点。对于实际数据如视频，高光谱图像，和交通数据广泛的实验结果报告，以证明这些优势，表明了该张量学习的性能和编码方法是显著比其他的张量完成方法更好一些评价指标方面。

60. SIA-GCN: A Spatial Information Aware Graph Neural Network with 2D Convolutions for Hand Pose Estimation [PDF] 返回目录
Deying Kong, Haoyu Ma, Xiaohui Xie
Abstract: Graph Neural Networks (GNNs) generalize neural networks from applications on regular structures to applications on arbitrary graphs, and have shown success in many application domains such as computer vision, social networks and chemistry. In this paper, we extend GNNs along two directions: a) allowing features at each node to be represented by 2D spatial confidence maps instead of 1D vectors; and b) proposing an efficient operation to integrate information from neighboring nodes through 2D convolutions with different learnable kernels at each edge. The proposed SIA-GCN can efficiently extract spatial information from 2D maps at each node and propagate them through graph convolution. By associating each edge with a designated convolution kernel, the SIA-GCN could capture different spatial relationships for different pairs of neighboring nodes. We demonstrate the utility of SIA-GCN on the task of estimating hand keypoints from single-frame images, where the nodes represent the 2D coordinate heatmaps of keypoints and the edges denote the kinetic relationships between keypoints. Experiments on multiple datasets show that SIA-GCN provides a flexible and yet powerful framework to account for structural constraints between keypoints, and can achieve state-of-the-art performance on the task of hand pose estimation.
摘要：图形神经网络（GNNS）推广从规则的结构应用神经网络对任意图形应用程序，并已显示出在许多应用领域，如计算机视觉，社交网络和化学成功。在由2D空间信心来表示每个节点）使特征映射，而不是一个维向量;：在本文中，我们扩展沿两个方向GNNS和b）提出一种有效的操作，以从信息邻近节点至2D卷积与在各边缘处不同可以学习内核集成。所提出的SIA-GCN可以有效地从2D地图在每个节点处提取的空间信息，并通过图形卷积传播它们。通过与指定的卷积核每条边关联，新航，GCN可以捕捉不同，对邻近节点的不同空间关系。我们证明从单帧图像，其中，所述节点表示的二维坐标关键点的热图和边缘表示关键点之间的动力学关系推定手的关键点的任务SIA-GCN的效用。在多个数据集实验表明，SIA-GCN提供了灵活而强大的框架，占关键点之间的结构限制，并可以实现在手姿势估计的任务状态的最先进的性能。

61. Online Learnable Keyframe Extraction in Videos and its Application with Semantic Word Vector in Action Recognition [PDF] 返回目录
G M Mashrur E Elahi, Yee-Hong Yang
Abstract: Video processing has become a popular research direction in computer vision due to its various applications such as video summarization, action recognition, etc. Recently, deep learning-based methods have achieved impressive results in action recognition. However, these methods need to process a full video sequence to recognize the action, even though most of these frames are similar and non-essential to recognizing a particular action. Additionally, these non-essential frames increase the computational cost and can confuse a method in action recognition. Instead, the important frames called keyframes not only are helpful in the recognition of an action but also can reduce the processing time of each video sequence for classification or in other applications, e.g. summarization. As well, current methods in video processing have not yet been demonstrated in an online fashion. Motivated by the above, we propose an online learnable module for keyframe extraction. This module can be used to select key-shots in video and thus can be applied to video summarization. The extracted keyframes can be used as input to any deep learning-based classification model to recognize action. We also propose a plugin module to use the semantic word vector as input along with keyframes and a novel train/test strategy for the classification models. To our best knowledge, this is the first time such an online module and train/test strategy have been proposed. The experimental results on many commonly used datasets in video summarization and in action recognition have shown impressive results using the proposed module.
摘要：视频处理已经成为计算机视觉的热门研究方向，由于其各种应用，例如视频摘要，动作识别，在动作识别等。最近，深基于学习的方法都取得了不俗的业绩。然而，这些方法需要处理一个完整的视频序列识别动作，尽管大多数这些帧是相似的和非必需以识别特定的动作。此外，这些非必要的帧增加了计算成本，并能迷惑在动作识别的方法。相反，重要的帧称为关键帧不仅是在识别的动作的帮助，但也可以减少各视频序列的处理时间进行分类或在其他应用中，例如总结。同样，在视频处理目前的方法还没有被证实以在线的方式。通过上面的启发，我们提出了关键帧提取网上可以学习模块。该模块可用于选择在视频键射击并且因此可以被应用到视频摘要。提取的关键帧可以用作输入到任何基于深学习分类模型来识别动作。我们还提出了一个插件模块使用语义词向量作为输入，使用关键帧和分类模型一个新的火车/测试策略一起。据我们所知，这是第一次这样的在线模块和火车/测试策略已经被提出。在视频摘要和动作识别许多常用的数据集上的实验结果表明，利用所提出的模块骄人的成绩。

62. Towards General Purpose and Geometry Preserving Single-View Depth Estimation [PDF] 返回目录
Mikhail Romanov, Nikolay Patatkin, Anna Vorontsova, Anton Konushin
Abstract: Single-view depth estimation plays a crucial role in scene understanding for AR applications and 3D modelling as it allows to retrieve the geometry of a scene. However, it is only possible if the inverse depth estimates are unbiased, i.e. they are either absolute or Up-to-Scale (UTS). In recent years, great progress has been made in general-purpose single-view depth estimation. Nevertheless, the latest general-purpose models were trained using ranking or on Up-to-Shift-Scale (UTSS) data. As a result, they provide UTSS predictions that cannot be used to reconstruct scene geometry. In this work, we strive to build a general-purpose single-view UTS depth estimation model. Following Ranftl et. al., we train our model on a mixture of datasets and test it on several previously unseen datasets. We show that our method outperforms previous state-of-the-art UTS models. We train several light-weight models following the proposed training scheme and prove that our ideas are applicable for computationally efficient depth estimation.
摘要：单视点深度估计起着现场了解的AR应用和3D建模了至关重要的作用，因为它允许检索场景的几何形状。然而，如果逆深度估计是无偏的，即它们是绝对的或截至到量表（UTS）这是唯一可能的。近年来，巨大的进步已经在通用的单视点深度估计作出。然而，最新的通用模型使用的排名，也截至到移量表（UTSS）数据来训练。其结果是，他们提供的，不能用于重建场景几何UTSS预测。在这项工作中，我们努力建立一个通用的单一视图UTS深度估计模型。继Ranftl等。人，我们训练数据集上的混合物我们的模型和测试它在几个以前看不到的数据集。我们证明了我们的方法优于先前的国家的最先进的UTS模型。我们培养以下建议的培训计划若干轻重量的模型，证明了我们的想法是适用于高效计算深度估计。

63. Graph Adversarial Networks: Protecting Information against Adversarial Attacks [PDF] 返回目录
Peiyuan Liao, Han Zhao, Keyulu Xu, Tommi Jaakkola, Geoffrey Gordon, Stefanie Jegelka, Ruslan Salakhutdinov
Abstract: We explore the problem of protecting information when learning with graph-structured data. While the advent of Graph Neural Networks (GNNs) has greatly improved node and graph representational learning in many applications, the neighborhood aggregation paradigm exposes additional vulnerabilities to attackers seeking to extract node-level information about sensitive attributes. To counter this, we propose a minimax game between the desired GNN encoder and the worst-case attacker. The resulting adversarial training creates a strong defense against inference attacks, while only suffering small loss in task performance. We analyze the effectiveness of our framework against a worst-case adversary, and characterize the trade-off between predictive accuracy and adversarial defense. Experiments across multiple datasets from recommender systems, knowledge graphs and quantum chemistry demonstrate that the proposed approach provides a robust defense across various graph structures and tasks, while producing competitive GNN encoders.
摘要：本文探讨与图形的结构化数据时习保护信息的问题。虽然图神经网络（GNNS）的出现，在许多应用极大地提高了节点和图形代表性学习，附近聚集范式自曝其他漏洞，以寻求有关敏感属性提取节点级信息攻击。为了解决这个问题，我们提出了期望GNN编码器和最坏情况下的攻击者之间的博弈极大极小。产生的对抗性训练创建针对推断攻击强大的国防，而只有痛苦在任务绩效小的损失。我们分析我们对最坏情况的对手框架的有效性和表征预测准确度和对抗性的防守之间的权衡。对面的推荐系统，知识图和量子化学多个数据集的实验表明，该方法提供了在各种图形结构和任务，强大的防御能力，同时产生有竞争力的GNN编码器。

64. EIS -- a family of activation functions combining Exponential, ISRU, and Softplus [PDF] 返回目录
Koushik Biswas, Sandeep Kumar, Shilpak Banerjee, Ashish Kumar Pandey
Abstract: Activation functions play a pivotal role in the function learning using neural networks. The non-linearity in the learned function is achieved by repeated use of the activation function. Over the years, numerous activation functions have been proposed to improve accuracy in several tasks. Basic functions like ReLU, Exponential, Tanh, or Softplus have been favorite among the deep learning community because of their simplicity. In recent years, several novel activation functions arising from these basic functions have been proposed, which have improved accuracy in some challenging datasets with complicated models. We propose a five hyper-parameters family of activation functions, namely EIS, defined as, \[ \frac{x(\ln(1+e^x))^\alpha}{\sqrt{\beta+\gamma x^2}+\delta e^{-\theta x}}. \] We show examples of activation functions from the EIS family which outperform widely used activation functions on some well known datasets and models. For example, $\frac{x\ln(1+e^x)}{x+1.16e^{-x}}$ beats ReLU by 0.89\% in DenseNet-169, 0.24\% in Inception V3 in CIFAR100 dataset while 1.13\% in Inception V3, 0.13\% in DenseNet-169, 0.94\% in SimpleNet model in CIFAR10 dataset. Also, $\frac{x\ln(1+e^x)}{\sqrt{1+x^2}}$ beats ReLU by 1.68\% in DenseNet-169, 0.30\% in Inception V3 in CIFAR100 dataset while 1.0\% in Inception V3, 0.15\% in DenseNet-169, 1.13\% in SimpleNet model in CIFAR10 dataset
摘要：激活功能，在功能使用神经网络的学习起到了举足轻重的作用。非线性在学习功能是通过重复使用激活功能的实现。多年来，无数的激活功能已经提出了提高几个任务的准确性。像RELU，指数，双曲正切，或Softplus基本功能都因其简单而得到了深刻的学习社区中的最爱。近年来，从这些基本的功能产生了一些新功能，激活已经提出，这有在复杂的模型一些具有挑战性的数据集，提高了精度。我们提出了一个五超参数家族的激活功能，即EIS，定义为，\ [\压裂{X（\ LN（1个+ E ^ X））^ \阿尔法} {\ SQRT {\的β+ \伽玛X ^ 2 } + \的ΔE^ { - \ THETA X}}。 \]我们发现从EIS家族跑赢上一些知名的数据集和模型广泛使用的激活功能激活功能的示例。例如，$ \压裂{X \ LN（1个+ E ^ X）} {X + 1.16e ^ { - X}} $在跳动RELU 0.89 \％DenseNet-169，在启V3 0.24 \％在CIFAR100数据集而在启V3 1.13 \％，在DenseNet-169 0.13 \％，在数据集中CIFAR10模型SimpleNet 0.94 \％。此外，$ \压裂{X \ LN（1个+ E ^ X）} {\ SQRT {1 + X ^ 2}} $在跳动RELU 1.68 \％DenseNet-169，在启V3 0.30 \％在CIFAR100数据集而在启V3 1.0 \％，在DenseNet-169 0.15 \％，在数据集中CIFAR10模型SimpleNet 1.13 \％

65. The Smart Parking Management System [PDF] 返回目录
Amira. A. Elsonbaty, Mahmoud Shams
Abstract: With growing, Car parking increases with the number of car users. With the increased use of smartphones and their applications, users prefer mobile phone-based solutions. This paper proposes the Smart Parking Management System (SPMS) that depends on Arduino parts, Android applications, and based on IoT. This gave the client the ability to check available parking spaces and reserve a parking spot. IR sensors are utilized to know if a car park space is allowed. Its area data are transmitted using the WI-FI module to the server and are recovered by the mobile application which offers many options attractively and with no cost to users and lets the user check reservation details. With IoT technology, the smart parking system can be connected wirelessly to easily track available locations.
摘要：随着汽车用户的数量不断增加，停车费的增加。随着越来越多地使用智能手机及其应用，用户更喜欢基于手机的解决方案。本文提出了智能停车场管理系统（SPMS）取决于Arduino的部分，Android应用程序，以及基于物联网。这给了客户端检查可用的停车位和预留停车位的能力。 IR传感器被用来知道，如果一个车位是允许的。它的面积数据使用的是Wi-Fi模块传输到服务器和移动应用程序，它诱人且无成本提供了许多选择给用户，并让用户检查预约细节被回收。随着物联网技术，智能泊车系统可无线连接可以轻松地跟踪可用的位置。

66. Group Whitening: Balancing Learning Efficiency and Representational Capacity [PDF] 返回目录
Lei Huang, Li Liu, Fan Zhu, Ling Shao
Abstract: Batch normalization (BN) is an important technique commonly incorporated into deep learning models to perform standardization within mini-batches. The merits of BN in improving model's learning efficiency can be further amplified by applying whitening, while its drawbacks in estimating population statistics for inference can be avoided through group normalization (GN). This paper proposes group whitening (GW), which elaborately exploits the advantages of the whitening operation and avoids the disadvantages of normalization within mini-batches. Specifically, GW divides the neurons of a sample into groups for standardization, like GN, and then further decorrelates the groups. In addition, we quantitatively analyze the constraint imposed by normalization, and show how the batch size (group number) affects the performance of batch (group) normalized networks, from the perspective of model's representational capacity. This analysis provides theoretical guidance for applying GW in practice. Finally, we apply the proposed GW to ResNet and ResNeXt architectures and conduct experiments on the ImageNet and COCO benchmarks. Results show that GW consistently improves the performance of different architectures, with absolute gains of $1.02\%$ $\sim$ $1.49\%$ in top-1 accuracy on ImageNet and $1.82\%$ $\sim$ $3.21\%$ in bounding box AP on COCO.
摘要：批标准化（BN）通常并入深学习模式，以小型批次内进行标准化的一项重要技术。 BN在改进模型的学习效率的优点可以通过应用白化被进一步放大，而其在估计推断人口统计缺点可以通过组归一化（GN）来避免。本文提出了组美白（GW），其精心利用了白化操作的优点，并且避免了迷你一批内归一化的缺点。具体地，GW划分的样本的神经元成组为标准化，如GN，然后进一步去相关的组。此外，我们定量分析的标准化规定的约束，并显示批量大小（群号）如何影响批次（组）标准化的网络性能，从模型的代表能力的观点。这种分析提供了在实践中应用GW理论指导。最后，我们应用所提出的GW到RESNET和ResNeXt架构和进行实验的ImageNet和COCO基准。结果表明，GW持续改善不同架构的性能，以1.02 $ \％$ $ \卡$ $ 1.49 \％$绝对收益在顶部1精度上ImageNet和$ 1.82 \％$ $ \卡$ $ 3.21 \％的边界$箱AP上COCO。

67. AI Progress in Skin Lesion Analysis [PDF] 返回目录
Philippe M. Burlina, William Paul, Phil A. Mathew, Neil J. Joshi, Alison W. Rebman, John N. Aucott
Abstract: We examine progress in the use of AI for detecting skin lesions, with particular emphasis on the erythema migrans rash of acute Lyme disease, and other lesions, such as those from conditions like herpes zoster (shingles), tinea corporis, erythema multiforme, cellulitis, insect bites, or tick bites. We discuss important challenges for these applications, including the problem of AI bias especially regarding the lack of skin images in dark skinned individuals, being able to accurately detect, delineate, and segment lesions or regions of interest compared to normal skin in images, and the challenge of low shot learning (addressing classification with a paucity of training images). Solving these problems ranges from being highly desirable requirements, e.g. for delineation, which may be useful to disambiguate between similar types of lesions, and perform improved diagnostics, or required, as is the case for AI de-biasing, to allow for the deployment of fair AI techniques in the clinic for skin lesion analysis. For the problem of low shot learning in particular, we report skin analysis algorithms that gracefully degrade and still perform well at low shots, when compared to baseline algorithms: when using a little as 10 training exemplars per class, the baseline DL algorithm performance significantly degrades, with accuracy of 56.41%, close to chance, whereas the best performing low shot algorithm yields an accuracy of 83.33%.
摘要：我们研究了在使用AI的检测皮损，特别强调的游走性红斑进度皮疹急性莱姆病，和其他病变，如来自像带状疱疹条件（带状疱疹），股癣，多形性红斑，蜂窝组织炎，昆虫叮咬，或蜱叮咬。我们讨论，描出和段病变或感兴趣区域相比，在图像的正常皮肤对这些应用的重要挑战，包括AI偏置尤其是对缺乏皮肤图像的深色皮肤的人，能够准确检测的问题，而低射学习的挑战（解决分类与训练图像的缺乏）。从高度期望要求解决这些问题的范围，例如对于圈定，其可以是相似类型的病灶之间的歧义是有用的，并执行改进的诊断，或必需的，因为是对于AI去偏的情况下，以允许公平人工智能技术在临床上用于皮肤损伤分析的部署。特别是对于低射学习的问题，我们报告的皮肤分析算法，优雅降级和仍处于较低的投篮表现良好，相比于基线算法：用少量的每班10个培训典范，基线DL算法的性能显著降级时，与56.41％的准确度，接近的机会，而执行低射算法的最佳收益率83.33％的准确度。

68. High-throughput molecular imaging via deep learning enabled Raman spectroscopy [PDF] 返回目录
Conor C. Horgan, Magnus Jensen, Anika Nagelkerke, Jean-Phillipe St-Pierre, Tom Vercauteren, Molly M. Stevens, Mads S. Bergholt
Abstract: Raman spectroscopy enables non-destructive, label-free imaging with unprecedented molecular contrast but is limited by slow data acquisition, largely preventing high-throughput imaging applications. Here, we present a comprehensive framework for higher-throughput molecular imaging via deep learning enabled Raman spectroscopy, termed DeepeR, trained on a large dataset of hyperspectral Raman images, with over 1.5 million spectra (400 hours of acquisition) in total. We firstly perform denoising and reconstruction of low signal-to-noise ratio Raman molecular signatures via deep learning, with a 9x improvement in mean squared error over state-of-the-art Raman filtering methods. Next, we develop a neural network for robust 2-4x super-resolution of hyperspectral Raman images that preserves molecular cellular information. Combining these approaches, we achieve Raman imaging speed-ups of up to 160x, enabling high resolution, high signal-to-noise ratio cellular imaging in under one minute. Finally, transfer learning is applied to extend DeepeR from cell to tissue-scale imaging. DeepeR provides a foundation that will enable a host of higher-throughput Raman spectroscopy and molecular imaging applications across biomedicine.
摘要：拉曼光谱法使非破坏性的，无标记成像以前所未有的分子对比度，而且通过缓慢的数据采集的限制，主要是防止高通量成像应用。在这里，我们提出了一个全面的框架，为高通量通过深度学习的分子成像启用拉曼光谱，称为更深，训练有素的大型数据集的拉曼光谱图像，在总数超过150万光谱（400小时收购）。我们首先进行去噪，并通过深学习低信噪比的拉曼分子标签的重建，具有过状态的最先进的拉曼滤波方法中的均方误差的9倍的改善。接下来，我们开发了强大的2-4x超分辨率的高光谱图像拉曼的，保留的分子细胞信息的神经网络。组合这些方法，我们实现了的拉曼成像速度起坐160X，可实现高分辨率，高信噪比细胞成像在1分钟内。最后，迁移学习被施加到从细胞更深延伸到组织级成像。更深的提供了基础，使吞吐量较高的拉曼光谱和横跨生物医学分子成像应用程序的主机的。

69. Characterization of Covid-19 Dataset using Complex Networks and Image Processing [PDF] 返回目录
Josimar Chire, Esteban Wilfredo Vilca Zuniga
Abstract: This paper aims to explore the structure of pattern behind covid-19 dataset. The dataset includes medical images with positive and negative cases. A sample of 100 sample is chosen, 50 per each class. An histogram frequency is calculated to get features using statistical measurements, besides a feature extraction using Grey Level Co-Occurrence Matrix (GLCM). Using both features are build Complex Networks respectively to analyze the adjacency matrices and check the presence of patterns. Initial experiments introduces the evidence of hidden patterns in the dataset for each class, which are visible using Complex Networks representation.
摘要：本文旨在探讨背后covid-19数据集中模式的结构。该数据集包括具有正面和负面的情况下，医学图像。 100样本的样本被选择，50％每个类。一个直方图的频率计算得到使用统计测量功能，除了使用灰度共生矩阵（GLCM）特征提取。使用这两种功能都建立复杂网络分别分析邻接矩阵和检查模式的存在。最初的实验中引入了数据集为每个类隐藏的模式，使用复杂网络表示，其是可见的证据。

70. Scalable Transfer Learning with Expert Models [PDF] 返回目录
Joan Puigcerver, Carlos Riquelme, Basil Mustafa, Cedric Renggli, André Susano Pinto, Sylvain Gelly, Daniel Keysers, Neil Houlsby
Abstract: Transfer of pre-trained representations can improve sample efficiency and reduce computational requirements for new tasks. However, representations used for transfer are usually generic, and are not tailored to a particular distribution of downstream tasks. We explore the use of expert representations for transfer with a simple, yet effective, strategy. We train a diverse set of experts by exploiting existing label structures, and use cheap-to-compute performance proxies to select the relevant expert for each target task. This strategy scales the process of transferring to new tasks, since it does not revisit the pre-training data during transfer. Accordingly, it requires little extra compute per target task, and results in a speed-up of 2-3 orders of magnitude compared to competing approaches. Further, we provide an adapter-based architecture able to compress many experts into a single model. We evaluate our approach on two different data sources and demonstrate that it outperforms baselines on over 20 diverse vision tasks in both cases.
摘要：预先训练表示的转移可以提高采样效率，降低了新任务计算需求。然而，用于传输表示通常是通用的，而不是针对下游任务的特定分布。我们探讨转让使用专家表示的一个简单而有效的策略。我们通过利用现有的标签结构，培养多样化的专家，以及使用廉价到计算性能代理来选择每个目标任务的相关专家。这种策略扩展转移到新的任务的过程，因为它不传输过程中重温前的训练数据。因此，它需要每个目标任务，一些额外的计算，并导致加速的2-3个数量级相比，竞争的方法。此外，我们还能够提供许多专家压缩成一个模型中基于适配器的体系结构。我们评估我们在两个不同的数据源的方法，并证明其优于在这两种情况下超过20多样视觉任务的基线。

71. Deep EvoGraphNet Architecture For Time-Dependent Brain Graph Data Synthesis From a Single Timepoint [PDF] 返回目录
Ahmed Nebli, Ugur Ali Kaplan, Islem Rekik
Abstract: Learning how to predict the brain connectome (i.e. graph) development and aging is of paramount importance for charting the future of within-disorder and cross-disorder landscape of brain dysconnectivity evolution. Indeed, predicting the longitudinal (i.e., time-dependent ) brain dysconnectivity as it emerges and evolves over time from a single timepoint can help design personalized treatments for disordered patients in a very early stage. Despite its significance, evolution models of the brain graph are largely overlooked in the literature. Here, we propose EvoGraphNet, the first end-to-end geometric deep learning-powered graph-generative adversarial network (gGAN) for predicting time-dependent brain graph evolution from a single timepoint. Our EvoGraphNet architecture cascades a set of time-dependent gGANs, where each gGAN communicates its predicted brain graphs at a particular timepoint to train the next gGAN in the cascade at follow-up timepoint. Therefore, we obtain each next predicted timepoint by setting the output of each generator as the input of its successor which enables us to predict a given number of timepoints using only one single timepoint in an end- to-end fashion. At each timepoint, to better align the distribution of the predicted brain graphs with that of the ground-truth graphs, we further integrate an auxiliary Kullback-Leibler divergence loss function. To capture time-dependency between two consecutive observations, we impose an l1 loss to minimize the sparse distance between two serialized brain graphs. A series of benchmarks against variants and ablated versions of our EvoGraphNet showed that we can achieve the lowest brain graph evolution prediction error using a single baseline timepoint. Our EvoGraphNet code is available at this http URL.
摘要：学习如何预测脑连接组（即图）发展和老龄化是至关重要的图表的未来内，紊乱和脑dysconnectivity进化的交叉混乱的风景线。事实上，预测纵向（即与时间相关）的大脑dysconnectivity，因为它已经从一个非常早期的单一时间点可以帮助设计个性化的治疗无序患者随时间演变。尽管它的意义，大脑图形的演化模式在文献中大多忽视。在这里，我们提出EvoGraphNet，第一端至端几何深度学习供电图形的生成对抗从单一时间点预测时间相关的脑图演进网络（gGAN）。我们EvoGraphNet架构级联一组时间相关的gGANs，其中每个gGAN在一个特定的时间点，在后续的时间点来培育未来gGAN在级联传达其预测大脑图形。因此，我们通过设置每个发生器的输出获得的每个下一个预测的时间点为，使我们它的后继者的输入，以在端到端的方式只使用一个单一的时间点的时间点预测的给定数目。在每个时间点，更好地协调与地面真图的预测大脑图的分布，我们进一步整合辅助相对熵损失函数。两个连续观测之间的捕获时间的关系，我们强加L1损失减少2幅连载脑图之间的距离稀疏。一系列针对变异和我们EvoGraphNet的烧蚀版本基准的表现，我们可以使用一个单一的基准时间点达到最低的大脑进化图预测误差。我们EvoGraphNet代码可在此HTTP URL。

72. Automated Pancreas Segmentation Using Multi-institutional Collaborative Deep Learning [PDF] 返回目录
Pochuan Wang, Chen Shen, Holger R. Roth, Dong Yang, Daguang Xu, Masahiro Oda, Kazunari Misawa, Po-Ting Chen, Kao-Lang Liu, Wei-Chih Liao, Weichung Wang, Kensaku Mori
Abstract: The performance of deep learning-based methods strongly relies on the number of datasets used for training. Many efforts have been made to increase the data in the medical image analysis field. However, unlike photography images, it is hard to generate centralized databases to collect medical images because of numerous technical, legal, and privacy issues. In this work, we study the use of federated learning between two institutions in a real-world setting to collaboratively train a model without sharing the raw data across national boundaries. We quantitatively compare the segmentation models obtained with federated learning and local training alone. Our experimental results show that federated learning models have higher generalizability than standalone training.
摘要：深基于学习的方法的性能强烈依赖于用于训练数据集的数量。许多已作出努力，以增加在医学图像分析领域的数据。然而，不同于摄影图像，这是很难产生集中的数据库，因为大量的技术，法律和隐私问题收集医学图像。在这项工作中，我们研究了在真实世界场景两个机构之间的联合运用学习到协作训练模型，而不会跨越国界共享原始数据。我们定量比较与联盟学习和独立本地培训获得的分割模式。我们的实验结果表明，联合学习模型比单独训练高普遍性。

73. Amodal 3D Reconstruction for Robotic Manipulation via Stability and Connectivity [PDF] 返回目录
William Agnew, Christopher Xie, Aaron Walsman, Octavian Murad, Caelen Wang, Pedro Domingos, Siddhartha Srinivasa
Abstract: Learning-based 3D object reconstruction enables single- or few-shot estimation of 3D object models. For robotics, this holds the potential to allow model-based methods to rapidly adapt to novel objects and scenes. Existing 3D reconstruction techniques optimize for visual reconstruction fidelity, typically measured by chamfer distance or voxel IOU. We find that when applied to realistic, cluttered robotics environments, these systems produce reconstructions with low physical realism, resulting in poor task performance when used for model-based control. We propose ARM, an amodal 3D reconstruction system that introduces (1) a stability prior over object shapes, (2) a connectivity prior, and (3) a multi-channel input representation that allows for reasoning over relationships between groups of objects. By using these priors over the physical properties of objects, our system improves reconstruction quality not just by standard visual metrics, but also performance of model-based control on a variety of robotics manipulation tasks in challenging, cluttered environments. Code is available at this http URL.
摘要：基于学习的三维物体重建使3D对象模型的单一或少数次估计。对于机器人，这适用于允许基于模型的方法，以迅速适应新的物体和场景的可能性。现有的3D重建技术优化视觉保真度重建，典型地通过倒角距离或体素期票测量。我们发现，当应用到现实，凌乱机器人的环境中，用于基于模型的控制时，这些系统产生的低物理真实感重建，从而在任务表现不佳。我们建议ARM，一个amodal 3D重建系统，其引入了（1）之前在对象的形状稳定性，（2）之前的一连接，和（3）的多通道输入表示允许推理超过对象的组之间的关系。通过使用这些先验在物体的物理性质，我们的系统可以提高重建质量不只是标准视觉度量，而且在具有挑战性的，杂乱的环境中的各种机器人操作任务的基于模型的控制性能。代码可以在这个HTTP URL。

74. Medical Image Segmentation Using Deep Learning: A Survey [PDF] 返回目录
Tao Lei, Risheng Wang, Yong Wan, Xiaogang Du, Hongying Meng, Asoke K. Nandi
Abstract: Deep learning has been widely used for medical image segmentation and a large number of papers has been presented recording the success of deep learning in the field. In this paper, we present a comprehensive thematic survey on medical image segmentation using deep learning techniques. This paper makes two original contributions. Firstly, compared to traditional surveys that directly divide literatures of deep learning on medical image segmentation into many groups and introduce literatures in detail for each group, we classify currently popular literatures according to a multi-level structure from coarse to fine. Secondly, this paper focuses on supervised and weakly supervised learning approaches, without including unsupervised approaches since they have been introduced in many old surveys and they are not popular currently. For supervised learning approaches, we analyze literatures in three aspects: the selection of backbone networks, the design of network blocks, and the improvement of loss functions. For weakly supervised learning approaches, we investigate literature according to data augmentation, transfer learning, and interactive segmentation, separately. Compared to existing surveys, this survey classifies the literatures very differently from before and is more convenient for readers to understand the relevant rationale and will guide them to think of appropriate improvements in medical image segmentation based on deep learning approaches.
摘要：深学习已被广泛用于医疗图像分割，并已提交记录深度学习的领域取得成功的大量论文。在本文中，我们提出对医学图像分割的综合专题调查采用深学习技术。本文通过两个原始的贡献。首先，相比于直接划分上医学图像分割深度学习的文献成许多组并详细每个组文献介绍传统的调查，我们目前根据一个多级结构由粗到细分类流行文献。其次，本文侧重于监督和弱监督学习方法，在不包括无监督的办法，因为他们已经在许多老普查已有，他们不是目前流行。对于监督学习方法，我们分析了三个方面的文献：骨干网的选择，网络模块的设计，并丧失功能的改善。对于弱监督学习方法，我们根据数据扩张，迁移学习，和交互式分割文献调查，分别。相比于现有的调查，本次调查从之前非常不同的分类文献，更方便读者了解相关理由，并会引导他们去思考的基础上深刻的学习方法在医学图像分割相应的改进。

75. Cloud Removal for Remote Sensing Imagery via Spatial Attention Generative Adversarial Network [PDF] 返回目录
Heng Pan
Abstract: Optical remote sensing imagery has been widely used in many fields due to its high resolution and stable geometric properties. However, remote sensing imagery is inevitably affected by climate, especially clouds. Removing the cloud in the high-resolution remote sensing satellite image is an indispensable pre-processing step before analyzing it. For the sake of large-scale training data, neural networks have been successful in many image processing tasks, but the use of neural networks to remove cloud in remote sensing imagery is still relatively small. We adopt generative adversarial network to solve this task and introduce the spatial attention mechanism into the remote sensing imagery cloud removal task, proposes a model named spatial attention generative adversarial network (SpA GAN), which imitates the human visual mechanism, and recognizes and focuses the cloud area with local-to-global spatial attention, thereby enhancing the information recovery of these areas and generating cloudless images with better quality...
摘要：光学遥感影像已经由于其高分辨率和稳定的几何特性被广泛应用于许多领域。然而，遥感图像不可避免地受气候，特别是云。在高分辨率遥感卫星图像卸下云是分析它之前不可缺少的预处理步骤。对于大规模训练数据的缘故，神经网络已经在许多图像处理任务取得了成功，但利用神经网络来去除云遥感影像还是比较小的。我们采用生成对抗性的网络来解决这个任务，并介绍了空间注意机制引入遥感影像去云的任务，提出了一种名为空间注意生成对抗网络（SPA GAN）模型，模仿人眼的视觉机制，识别并对焦的本地到全球的空间注意云区，从而提高这些地区的信息恢复和更好的质量而产生无云图片...

76. Visual Steering for One-Shot Deep Neural Network Synthesis [PDF] 返回目录
Anjul Tyagi, Cong Xie, Klaus Mueller
Abstract: Recent advancements in the area of deep learning have shown the effectiveness of very large neural networks in several applications. However, as these deep neural networks continue to grow in size, it becomes more and more difficult to configure their many parameters to obtain good results. Presently, analysts must experiment with many different configurations and parameter settings, which is labor-intensive and time-consuming. On the other hand, the capacity of fully automated techniques for neural network architecture search is limited without the domain knowledge of human experts. To deal with the problem, we formulate the task of neural network architecture optimization as a graph space exploration, based on the one-shot architecture search technique. In this approach, a super-graph of all candidate architectures is trained in one-shot and the optimal neural network is identified as a sub-graph. In this paper, we present a framework that allows analysts to effectively build the solution sub-graph space and guide the network search by injecting their domain knowledge. Starting with the network architecture space composed of basic neural network components, analysts are empowered to effectively select the most promising components via our one-shot search scheme. Applying this technique in an iterative manner allows analysts to converge to the best performing neural network architecture for a given application. During the exploration, analysts can use their domain knowledge aided by cues provided from a scatterplot visualization of the search space to edit different components and guide the search for faster convergence. We designed our interface in collaboration with several deep learning researchers and its final effectiveness is evaluated with a user study and two case studies.
摘要：在深度学习领域的最新发展，在一些应用中表现的非常大的神经网络的有效性。然而，随着这些深层神经网络继续扩大规模，它变得越来越难以配置他们的许多参数，以获得良好的效果。目前，分析师必须与许多不同的配置和参数设置，这是劳动密集和费时的实验。在另一方面，对神经网络结构的搜索完全自动化的技术能力没有人类专家的领域知识的限制。为了解决这个问题，我们制定的神经网络结构优化的图形太空探索任务的基础上，一次性架构搜索技术。在这种方法中，所有候选架构的超图在一次性的训练和最佳的神经网络识别为一个子图。在本文中，我们提出了一个框架，允许分析师有效地构建解决方案子图空间和指导他们注射领域知识的网络搜索。与神经网络基本元件组成的网络体系结构空间出发，分析师有权有效地利用我们的单次搜索方案选择最有前途的成分。以迭代的方式应用这种技术允许分析收敛到最佳性能的神经网络结构对于给定的应用程序。在勘探，分析人员可以使用辅助他们的领域知识从搜索空间的散点图可视化编辑不同的组件提供的线索，引导搜索快速收敛。我们设计的协作接口与几个深度学习研究人员和其最终效果与用户研究和两个案例研究评估。

77. Interventional Few-Shot Learning [PDF] 返回目录
Zhongqi Yue, Hanwang Zhang, Qianru Sun, Xian-Sheng Hua
Abstract: We uncover an ever-overlooked deficiency in the prevailing Few-Shot Learning (FSL) methods: the pre-trained knowledge is indeed a confounder that limits the performance. This finding is rooted from our causal assumption: a Structural Causal Model (SCM) for the causalities among the pre-trained knowledge, sample features, and labels. Thanks to it, we propose a novel FSL paradigm: Interventional Few-Shot Learning (IFSL). Specifically, we develop three effective IFSL algorithmic implementations based on the backdoor adjustment, which is essentially a causal intervention towards the SCM of many-shot learning: the upper-bound of FSL in a causal view. It is worth noting that the contribution of IFSL is orthogonal to existing fine-tuning and meta-learning based FSL methods, hence IFSL can improve all of them, achieving a new 1-/5-shot state-of-the-art on \textit{mini}ImageNet, \textit{tiered}ImageNet, and cross-domain CUB. Code is released at this https URL.
摘要：在目前为数不多的射击学习（FSL）的方法，我们揭开了曾经被忽视的缺陷：预先训练的知识确实是一个混杂因素限制了性能。一个结构因果模型（SCM）的预先训练的知识之间的因果关系，样品的功能和标签：这一发现与我们的因果假设扎根。多亏了这一点，我们提出了一个新的范式FSL：介入为数不多的射击学习（IFSL）。具体来说，我们开发基于后门调整，基本上是朝向许多次学习的SCM的因果干预3个有效IFSL算法实现：上界中的因果视图FSL的。值得一提的是，IFSL的贡献是正交现有的微调和元学习基础FSL方法，因此IFSL可以提高所有的人，实现1/5次国家的最先进的新\ textit {迷你} ImageNet，\ textit {分层} ImageNet，和跨域CUB。代码在此HTTPS URL释放。

78. Sparse-data based 3D surface reconstruction with vector matching [PDF] 返回目录
Bin Wu, Xue-Cheng Tai, Talal Rahman
Abstract: Three dimensional surface reconstruction based on two dimensional sparse information in the form of only a small number of level lines of the surface with moderately complex structures, containing both structured and unstructured geometries, is considered in this paper. A new model has been proposed which is based on the idea of using normal vector matching combined with a first order and a second order total variation regularizers. A fast algorithm based on the augmented Lagrangian is also proposed. Numerical experiments are provided showing the effectiveness of the model and the algorithm in reconstructing surfaces with detailed features and complex structures for both synthetic and real world digital maps.
摘要：基于在只有少数的中度复杂结构的表面的水平线条，含有结构化和非结构化的几何形状的形式二维稀疏信息的三维表面重建，在本文中被考虑。一种新的模型已经提出了基于使用正常矢量匹配用一阶和二阶全变差regularizers合并的想法。基于增强拉格朗日的快速算法还提出。数值实验提供了显示模型的有效性，并与用于合成的和真实世界的数字地图中详述的特征和复杂结构的表面重构算法。

79. VATLD: A Visual Analytics System to Assess, Understand and Improve Traffic Light Detection [PDF] 返回目录
Liang Gou, Lincan Zou, Nanxiang Li, Michael Hofmann, Arvind Kumar Shekar, Axel Wendt, Liu Ren
Abstract: Traffic light detection is crucial for environment perception and decision-making in autonomous driving. State-of-the-art detectors are built upon deep Convolutional Neural Networks (CNNs) and have exhibited promising performance. However, one looming concern with CNN based detectors is how to thoroughly evaluate the performance of accuracy and robustness before they can be deployed to autonomous vehicles. In this work, we propose a visual analytics system, VATLD, equipped with a disentangled representation learning and semantic adversarial learning, to assess, understand, and improve the accuracy and robustness of traffic light detectors in autonomous driving applications. The disentangled representation learning extracts data semantics to augment human cognition with human-friendly visual summarization, and the semantic adversarial learning efficiently exposes interpretable robustness risks and enables minimal human interaction for actionable insights. We also demonstrate the effectiveness of various performance improvement strategies derived from actionable insights with our visual analytics system, VATLD, and illustrate some practical implications for safety-critical applications in autonomous driving.
摘要：交通灯检测对环境的感知和决策的自主驾驶的关键。国家的最先进的探测器在深卷积神经网络（细胞神经网络）建成并已经表现出有前途的性能。然而，基于CNN探测器一个迫在眉睫的问题是如何彻底评估的准确性和鲁棒性的表现，才可以被部署到自主车。在这项工作中，我们提出了一个可视化分析系统，VATLD，配备了解缠结表示学习和语义对抗性学习，评估，理解和提高自主驾驶应用的精度和红绿灯探测器的鲁棒性。解缠结表示学习中提取数据的语义来增强人类认知与人类友好的可视化总结和语义对抗性学习有效地公开解释鲁棒性风险并启用了可操作的见解最小的人工干预。我们还演示了从与我们的视觉分析系统，VATLD可操作的见解衍生的各种性能改进策略的有效性，并说明了在自主驾驶安全关键型应用的一些实际问题。

80. Classification and understanding of cloud structures via satellite images with EfficientUNet [PDF] 返回目录
Tashin Ahmed, Noor Hossain Nuri Sabab
Abstract: Climate change has been a common interest and the forefront of crucial political discussion and decision-making for many years. Shallow clouds play a significant role in understanding the Earth's climate, but they are challenging to interpret and represent in a climate model. By classifying these cloud structures, there is a better possibility of understanding the physical structures of the clouds, which would improve the climate model generation, resulting in a better prediction of climate change or forecasting weather update. Clouds organise in many forms, which makes it challenging to build traditional rule-based algorithms to separate cloud features. In this paper, classification of cloud organization patterns was performed using a new scaled-up version of Convolutional Neural Network (CNN) named as EfficientNet as the encoder and UNet as decoder where they worked as feature extractor and reconstructor of fine grained feature map and was used as a classifier, which will help experts to understand how clouds will shape the future climate. By using a segmentation model in a classification task, it was shown that with a good encoder alongside UNet, it is possible to obtain good performance from this dataset. Dice coefficient has been used for the final evaluation metric, which gave the score of 66.26% and 66.02% for public and private leaderboard on Kaggle competition respectively.
摘要：气候变化一直是共同利益和重要的政治讨论和决策多年的前列。浅云在理解地球气候的一个显著的作用，但它们是具有挑战性的解释和气候模型来表示。通过这些云结构分类，有理解的云，这将改善气候模型生成，从而导致气候变化或预测天气更新更好的预测的物理结构更好的可能性。云组织形式多种多样，这使得它具有挑战性的建立传统的基于规则的算法来分隔云功能。在本文中，使用卷积神经网络（CNN）评为EfficientNet的编码器和UNET作为解码器，他们担任特征提取和细粒度特征图重构的新的按比例放大的版本进行云的组织模式分类，是作为一个分类，这将有助于专家了解云将如何塑造未来的气候。通过使用一个分类任务细分模型，它表明具有良好的编码器一起UNET，有可能获得这个数据集良好的性能。骰子系数已用于最终的评价指标，分别给了66.26％和66.02％，比分为公共和私人排行榜上Kaggle竞争。

81. Learning to Improve Image Compression without Changing the Standard Decoder [PDF] 返回目录
Yannick Strümpler, Ren Yang, Radu Timofte
Abstract: In recent years we have witnessed an increasing interest in applying Deep Neural Networks (DNNs) to improve the rate-distortion performance in image compression. However, the existing approaches either train a post-processing DNN on the decoder side, or propose learning for image compression in an end-to-end manner. This way, the trained DNNs are required in the decoder, leading to the incompatibility to the standard image decoders (\eg, JPEG) in personal computers and mobiles. Therefore, we propose learning to improve the encoding performance with the standard decoder. In this paper, We work on JPEG as an example. Specifically, a frequency-domain pre-editing method is proposed to optimize the distribution of DCT coefficients, aiming at facilitating the JPEG compression. Moreover, we propose learning the JPEG quantization table jointly with the pre-editing network. Most importantly, we do not modify the JPEG decoder and therefore our approach is applicable when viewing images with the widely used standard JPEG decoder. The experiments validate that our approach successfully improves the rate-distortion performance of JPEG in terms of various quality metrics, such as PSNR, MS-SSIM and LPIPS. Visually, this translates to better overall color retention especially when strong compression is applied.
摘要：近年来，我们看到在应用深层神经网络（DNNs）提高图像压缩率失真性能的越来越大的兴趣。但是，现有的方法或者培养在解码器侧的后处理DNN，或建议在一个端至端的方式学习的图像压缩。这样，训练有素的DNNs需要在解码器，导致不兼容的标准图像解码器（\例如，JPEG）在个人电脑和手机。因此，我们建议学习提高与标准解码器编码的性能。在本文中，我们对JPEG工作作为一个例子。具体而言，频域预编辑方法，提出了最优化的DCT系数的分布，旨在促进JPEG压缩。此外，我们建议共同学习的JPEG量化表与预编辑的网络。最重要的是，我们不修改JPEG解码器，因此与广泛使用的标准JPEG解码器观看图像时，我们的做法是适用的。实验验证了我们的方法成功地提高了在各种质量指标，如PSNR，MS-SSIM和LPIPS方面JPEG的率失真性能。从外观上看，这相当于尤其是当较强的抗压应用更好的整体保色性。

82. Interaction-Based Trajectory Prediction Over a Hybrid Traffic Graph [PDF] 返回目录
Sumit Kumar, Yiming Gu, Jerrick Hoang, Galen Clark Haynes, Micol Marchetti-Bowick
Abstract: Behavior prediction of traffic actors is an essential component of any real-world self-driving system. Actors' long-term behaviors tend to be governed by their interactions with other actors or traffic elements (traffic lights, stop signs) in the scene. To capture this highly complex structure of interactions, we propose to use a hybrid graph whose nodes represent both the traffic actors as well as the static and dynamic traffic elements present in the scene. The different modes of temporal interaction (e.g., stopping and going) among actors and traffic elements are explicitly modeled by graph edges. This explicit reasoning about discrete interaction types not only helps in predicting future motion, but also enhances the interpretability of the model, which is important for safety-critical applications such as autonomous driving. We predict actors' trajectories and interaction types using a graph neural network, which is trained in a semi-supervised manner. We show that our proposed model, TrafficGraphNet, achieves state-of-the-art trajectory prediction accuracy while maintaining a high level of interpretability.
摘要：交通参与者的行为预测任何真实世界的自我驱动系统的重要组成部分。演员的长期行为往往是由他们与场景中的其他演员或业务单元（交通信号灯，停车标志）相互作用的约束。为了捕捉相互作用的这种高度复杂的结构，我们建议使用混合图形，其节点表示交通行为主体以及静态和动态交通元素出现在现场。颞相互作用的演员和交通元件之间的不同的模式（例如，停止和去）被明确地由图中的边建模。这是关于离散交互类型明确的推理不仅有助于预测未来的运动，同时也增强了模型，这是对安全关键应用，如自动驾驶重要的可解释性。我们预测演员的轨迹，并使用图形神经网络，它是在半监督方式训练交互类型。我们证明了我们提出的模型，TrafficGraphNet，实现了国家的最先进的轨迹预测准确度，同时保持解释性的较高水平。

83. ESTAN: Enhanced Small Tumor-Aware Network for Breast Ultrasound Image Segmentation [PDF] 返回目录
Bryar Shareef, Alex Vakanski, Min Xian, Phoebe E. Freer
Abstract: Breast tumor segmentation is a critical task in computer-aided diagnosis (CAD) systems for breast cancer detection because accurate tumor size, shape and location are important for further tumor quantification and classification. However, segmenting small tumors in ultrasound images is challenging, due to the speckle noise, varying tumor shapes and sizes among patients, and the existence of tumor-like image regions. Recently, deep learning-based approaches have achieved great success for biomedical image analysis, but current state-of-the-art approaches achieve poor performance for segmenting small breast tumors. In this paper, we propose a novel deep neural network architecture, namely Enhanced Small Tumor-Aware Network (ESTAN), to accurately and robustly segment breast tumors. ESTAN introduces two encoders to extract and fuse image context information at different scales and utilizes row-column-wise kernels in the encoder to adapt to breast anatomy. We validate the proposed approach and compare it to nine state-of-the-art approaches on three public breast ultrasound datasets using seven quantitative metrics. The results demonstrate that the proposed approach achieves the best overall performance and outperforms all other approaches on small tumor segmentation.
摘要：乳房肿瘤分割是在计算机辅助诊断（CAD）系统，用于检测乳腺癌的关键任务，因为准确的肿瘤的大小，形状和位置是用于进一步肿瘤量化和分类重要。然而，分割小肿瘤在超声图像是具有挑战性的，由于斑点噪声，患者之间变化的肿瘤的形状和尺寸，和瘤状的图像区域的存在。生物医学图像分析最近，深学习型的方法已经取得了巨大成功，但方法实现分割小乳腺肿瘤表现不佳先进国家的电流。在本文中，我们提出了一个新颖的深层神经网络结构，即增强的小型肿瘤感知网络（ESTAN），准确和稳健段乳腺肿瘤。 ESTAN介绍在不同尺度两个编码器，以萃取物和保险丝图像上下文信息，并利用行 - 列明智内核在所述编码器，以适应乳房解剖结构。我们证明了该方法，并把它比作上使用七个定量的度量三公乳腺超声数据集接近先进国家的九个。结果表明，该方法达到最佳的整体性能和优于小肿瘤分割所有其他方法。

84. Learning Self-Expression Metrics for Scalable and Inductive Subspace Clustering [PDF] 返回目录
Julian Busch, Evgeniy Faerman, Matthias Schubert, Thomas Seidl
Abstract: Subspace clustering has established itself as a state-of-the-art approach to clustering high-dimensional data. In particular, methods relying on the self-expressiveness property have recently proved especially successful. However, they suffer from two major shortcomings: First, a quadratic-size coefficient matrix is learned directly, preventing these methods from scaling beyond small datasets. Secondly, the trained models are transductive and thus cannot be used to cluster out-of-sample data unseen during training. Instead of learning self-expression coefficients directly, we propose a novel metric learning approach to learn instead a subspace affinity function using a siamese neural network architecture. Consequently, our model benefits from a constant number of parameters and a constant-size memory footprint, allowing it to scale to considerably larger datasets. In addition, we can formally show that out model is still able to exactly recover subspace clusters given an independence assumption. The siamese architecture in combination with a novel geometric classifier further makes our model inductive, allowing it to cluster out-of-sample data. Additionally, non-linear clusters can be detected by simply adding an auto-encoder module to the architecture. The whole model can then be trained end-to-end in a self-supervised manner. This work in progress reports promising preliminary results on the MNIST dataset. In the spirit of reproducible research, me make all code publicly available. In future work we plan to investigate several extensions of our model and to expand experimental evaluation.
摘要：子空间聚类已成为国家的最先进的方法来聚类高维数据。特别是，依靠自我表现属性方法最近被证明特别成功。首先，二次尺寸系数矩阵是直接教训，防止结垢超越小型数据集这些方法：然而，他们从两个主要的缺点。其次，经训练的模型是转导，因此不能被用于聚类外的样品训练期间看不见的数据。除了直接学习自我表达的系数，我们提出了一种新的度量学习方法来学习，而不是使用一个连体的神经网络结构的子空间亲和力的功能。因此，从恒定数量的参数和一个固定尺寸的内存占用，允许其规模相当大的数据集，我们的模型的好处。此外，我们可以正式表明了模型仍然能够精确地恢复的子空间集群给出的独立性假设。在组合的连体结构具有新颖几何分类进一步使得我们的模型的电感，允许它聚类外的样本数据。此外，非线性集群可以通过简单地增加一个自动编码器模块的结构进行检测。整个模型可以再自我监督的方式进行培训，终端到终端。这项工作的进展报告看好的MNIST数据集的初步结果。在重复性研究的精神，我做的所有代码公开。在今后的工作中，我们计划调查我们的模型的几个扩展和扩大实验评估。

85. Normalization Techniques in Training DNNs: Methodology, Analysis and Application [PDF] 返回目录
Lei Huang, Jie Qin, Yi Zhou, Fan Zhu, Li Liu, Ling Shao
Abstract: Normalization techniques are essential for accelerating the training and improving the generalization of deep neural networks (DNNs), and have successfully been used in various applications. This paper reviews and comments on the past, present and future of normalization methods in the context of DNN training. We provide a unified picture of the main motivation behind different approaches from the perspective of optimization, and present a taxonomy for understanding the similarities and differences between them. Specifically, we decompose the pipeline of the most representative normalizing activation methods into three components: the normalization area partitioning, normalization operation and normalization representation recovery. In doing so, we provide insight for designing new normalization technique. Finally, we discuss the current progress in understanding normalization methods, and provide a comprehensive review of the applications of normalization for particular tasks, in which it can effectively solve the key issues.
摘要：标准化技术是加快培养和提高深层神经网络（DNNs）的推广至关重要，并已成功地在各种应用中使用。在过去，现在和标准化方法未来DNN训练的情况下本文的评论和意见。我们提供从优化的角度不同方法背后的主要动机的统一的画面，并提出一个分类理解的相似性和它们之间的差异。具体而言，我们分解的最有代表性的正火激活方法管道分为三个部分：标准化区域划分，标准化操作和标准化表示恢复。在此过程中，我们提供了设计新的标准化技术洞察力。最后，我们讨论在理解标准化方法目前的进展，并提供正常化特定的任务中，它可以有效地解决关键问题的应用程序进行全面审查。

86. Agile Reactive Navigation for A Non-Holonomic Mobile Robot Using A Pixel Processor Array [PDF] 返回目录
Yanan Liu, Laurie Bose, Colin Greatwood, Jianing Chen, Rui Fan, Thomas Richardson, Stephen J. Carey, Piotr Dudek, Walterio Mayol-Cuevas
Abstract: This paper presents an agile reactive navigation strategy for driving a non-holonomic ground vehicle around a preset course of gates in a cluttered environment using a low-cost processor array sensor. This enables machine vision tasks to be performed directly upon the sensor's image plane, rather than using a separate general-purpose computer. We demonstrate a small ground vehicle running through or avoiding multiple gates at high speed using minimal computational resources. To achieve this, target tracking algorithms are developed for the Pixel Processing Array and captured images are then processed directly on the vision sensor acquiring target information for controlling the ground vehicle. The algorithm can run at up to 2000 fps outdoors and 200fps at indoor illumination levels. Conducting image processing at the sensor level avoids the bottleneck of image transfer encountered in conventional sensors. The real-time performance of on-board image processing and robustness is validated through experiments. Experimental results demonstrate that the algorithm's ability to enable a ground vehicle to navigate at an average speed of 2.20 m/s for passing through multiple gates and 3.88 m/s for a 'slalom' task in an environment featuring significant visual clutter.
摘要：提出一种使用低成本的处理器阵列传感器在杂波环境驱动非完整地面车辆周围栅极的预设路线敏捷反应导航策略。这使得能够直接在所述传感器的图像平面上进行机器视觉任务，而不是使用单独的通用计算机。我们展示了一个小型地面车辆通过运行或使用最少的计算资源，避免在高速多门。为了实现这一点，目标跟踪算法为像素处理阵列开发和捕获的图像然后被直接加工在视觉传感器获取目标信息用于控制所述地面车辆。该算法能以高达2000 fps的户外活动和200fps的室内照明水平运行。在传感器级别进行图像处理避免了在传统的传感器遇到的图像转印的瓶颈。车载图像处理和鲁棒性的实时性能是通过实验验证。实验结果表明，该算法的在2.20米的平均速度，以使地面车辆导航能力/ s的经过多门3.88米/秒为特色显著视觉混乱的环境中的“回转”的任务。

87. Hierarchical Deep Multi-modal Network for Medical Visual Question Answering [PDF] 返回目录
Deepak Gupta, Swati Suman, Asif Ekbal
Abstract: Visual Question Answering in Medical domain (VQA-Med) plays an important role in providing medical assistance to the end-users. These users are expected to raise either a straightforward question with a Yes/No answer or a challenging question that requires a detailed and descriptive answer. The existing techniques in VQA-Med fail to distinguish between the different question types sometimes complicates the simpler problems, or over-simplifies the complicated ones. It is certainly true that for different question types, several distinct systems can lead to confusion and discomfort for the end-users. To address this issue, we propose a hierarchical deep multi-modal network that analyzes and classifies end-user questions/queries and then incorporates a query-specific approach for answer prediction. We refer our proposed approach as Hierarchical Question Segregation based Visual Question Answering, in short HQS-VQA. Our contributions are three-fold, viz. firstly, we propose a question segregation (QS) technique for VQAMed; secondly, we integrate the QS model to the hierarchical deep multi-modal neural network to generate proper answers to the queries related to medical images; and thirdly, we study the impact of QS in Medical-VQA by comparing the performance of the proposed model with QS and a model without QS. We evaluate the performance of our proposed model on two benchmark datasets, viz. RAD and CLEF18. Experimental results show that our proposed HQS-VQA technique outperforms the baseline models with significant margins. We also conduct a detailed quantitative and qualitative analysis of the obtained results and discover potential causes of errors and their solutions.
摘要：在医疗领域（VQA-MED）视觉答疑起到给最终用户提供医疗援助具有重要作用。这些用户有望提升要么是/否回答一个简单的问题或挑战性的问题，需要一个详细的和描述性的答案。在VQA - 地中海现有技术不能在不同的问题类型之间进行区分，有时复杂的简单的问题，或过度简化了复杂的问题。这是千真万确的，对于不同的问题类型，几个不同的系统可能会导致混乱和不适用于最终用户。为了解决这个问题，我们提出了一个深层次多模式网络的分析和分类最终用户的问题/查询，然后合并为答案预测查询具体做法。我们称这些建议的方法为分层离析问题基于视觉问题回答，总之HQS-VQA。我们的贡献有三个方面，即：首先，我们提出了VQAMed问题隔离（QS）技术;其次，我们整合了QS模式向深层次多模式神经网络，产生正确的答案相关的医学图像的查询;第三，我们通过比较该模型与QS的性能和无QS模型研究QS在医疗，VQA的影响。我们评估了模型的两个标准数据集，即性能。 RAD和CLEF18。实验结果表明，我们提出的HQS-VQA技术优于基线模型与显著的利润。我们也进行了所得结果的详细定量和定性分析，发现错误及其解决方案的潜在原因。

88. Iterative Reconstruction for Low-Dose CT using Deep Gradient Priors of Generative Model [PDF] 返回目录
Zhuonan He, Yikun Zhang, Yu Guan, Shanzhou Niu, Yi Zhang, Yang Chen, Qiegen Liu
Abstract: Dose reduction in computed tomography (CT) is essential for decreasing radiation risk in clinical applications. Iterative reconstruction is one of the most promising ways to compensate for the increased noise due to reduction of photon flux. Rather than most existing prior-driven algorithms that benefit from manually designed prior functions or supervised learning schemes, in this work we integrate the data-consistency as a conditional term into the iterative generative model for low-dose CT. At first, a score-based generative network is used for unsupervised distribution learning and the gradient of generative density prior is learned from normal-dose images. Then, the annealing Langevin dynamics is employed to update the trained priors with conditional scheme, i.e., the distance between the reconstructed image and the manifold is minimized along with data fidelity during reconstruction. Experimental comparisons demonstrated the noise reduction and detail preservation abilities of the proposed method.
摘要：在计算机断层摄影（CT）剂量减少为在临床应用中减少辐射风险至关重要。迭代重建是最有前途的方法，以补偿由于减少光子通量的噪声增加一个。而不是大多数现存的现有驱动算法，受益于手动设计现有的功能或监督学习方案，在此工作中，我们集成数据一致性作为条件术语入迭代生成模型对于低剂量CT。首先，将基于分数的生成网络用于无监督学习分配和生成密度的梯度之前从正常剂量的图像获知。然后，进行退火朗之万动力学被用来更新与有条件方案，即训练先验，重构图像和所述歧管之间的距离与数据保真度沿重建期间最小化。实验证明比较所提出的方法的降噪和细节保留能力。

89. COVID-19 Infection Map Generation and Detection from Chest X-Ray Images [PDF] 返回目录
Aysen Degerli, Mete Ahishali, Mehmet Yamac, Serkan Kiranyaz, Muhammad E. H. Chowdhury, Khalid Hameed, Tahir Hamid, Rashid Mazhar, Moncef Gabbouj
Abstract: Computer-aided diagnosis has become a necessity for accurate and immediate coronavirus disease 2019 (COVID-19) detection to aid treatment and prevent the spread of the virus. Compared to other diagnosis methodologies, chest X-ray (CXR) imaging is an advantageous tool since it is fast, low-cost, and easily accessible. Thus, CXR has a great potential not only to help diagnose COVID-19 but also to track the progression of the disease. Numerous studies have proposed to use Deep Learning techniques for COVID-19 diagnosis. However, they have used very limited CXR image repositories for evaluation with a small number, a few hundreds, of COVID-19 samples. Moreover, these methods can neither localize nor grade the severity of COVID-19 infection. For this purpose, recent studies proposed to explore the activation maps of deep networks. However, they remain inaccurate for localizing the actual infestation making them unreliable for clinical use. This study proposes a novel method for the joint localization, severity grading, and detection of COVID-19 from CXR images by generating the so-called infection maps that can accurately localize and grade the severity of COVID-19 infection. To accomplish this, we have compiled the largest COVID-19 dataset up to date with 2951 COVID-19 CXR images, where the annotation of the ground-truth segmentation masks is performed on CXRs by a novel collaborative expert human-machine approach. Furthermore, we publicly release the first CXR dataset with the ground-truth segmentation masks of the COVID-19 infected regions. A detailed set of experiments show that state-of-the-art segmentation networks can learn to localize COVID-19 infection with an F1-score of 85.81%, that is significantly superior to the activation maps created by the previous methods. Finally, the proposed approach achieved a COVID-19 detection performance with 98.37% sensitivity and 99.16% specificity.
摘要：计算机辅助诊断已成为精确和直接冠状病毒病2019（COVID-19）检测到辅助治疗的必要性和防止病毒的传播。相比于其他诊断方法，胸部X射线（CXR）成像是有利的工具，因为它是快速，低成本，和容易获得。因此，CXR不仅帮助诊断COVID-19有很大的潜力，但也跟踪疾病的进展。许多研究已经提出用深度学习技术COVID-19的诊断。然而，他们用于评估非常有限CXR图像储存库数量不多，几百，COVID-19样品。此外，这些方法既不能定位也不级COVID-19感染的严重性。为此，最近的研究提出，探索深网络的激活图。然而，他们仍然不准确的本地化实际侵扰使得它们不可靠的临床使用。本研究中通过产生所谓的感染的地图，可以准确定位和等级提出了一种用于从图像CXR接头定位，严重程度分级，并且COVID-19的检测的新方法COVID-19感染的严重性。要做到这一点，我们已经编制了最大COVID-19数据集最新与2951 COVID-19 CXR图像，其中，通过一种新的协作专家人机方法上进行CXRS地面真分割掩码的注解。此外，我们公开发布与COVID-19感染地区的地面实况分割掩码第一CXR数据集。详细组实验表明，国家的最先进的分割网络可以学习本地化COVID-19感染的85.81％的F1-得分，即显著优于激活映射通过以前的方法创建。最后，所提出的方法实现了COVID-19的检测性能与98.37％的敏感性和99.16％的特异性。

90. Quantitative and Qualitative Evaluation of Explainable Deep Learning Methods for Ophthalmic Diagnosis [PDF] 返回目录
Amitojdeep Singh, J. Jothi Balaji, Varadharajan Jayakumar, Mohammed Abdul Rasheed, Rajiv Raman, Vasudevan Lakshminarayanan
Abstract: Background: The lack of explanations for the decisions made by algorithms such as deep learning has hampered their acceptance by the clinical community despite highly accurate results on multiple problems. Recently, attribution methods have emerged for explaining deep learning models, and they have been tested on medical imaging problems. The performance of attribution methods is compared on standard machine learning datasets and not on medical images. In this study, we perform a comparative analysis to determine the most suitable explainability method for retinal OCT diagnosis. Methods: A commonly used deep learning model known as Inception v3 was trained to diagnose 3 retinal diseases - choroidal neovascularization (CNV), diabetic macular edema (DME), and drusen. The explanations from 13 different attribution methods were rated by a panel of 14 clinicians for clinical significance. Feedback was obtained from the clinicians regarding the current and future scope of such methods. Results: An attribution method based on a Taylor series expansion, called Deep Taylor was rated the highest by clinicians with a median rating of 3.85/5. It was followed by two other attribution methods, Guided backpropagation and SHAP (SHapley Additive exPlanations). Conclusion: Explanations of deep learning models can make them more transparent for clinical diagnosis. This study compared different explanations methods in the context of retinal OCT diagnosis and found that the best performing method may not be the one considered best for other deep learning tasks. Overall, there was a high degree of acceptance from the clinicians surveyed in the study. Keywords: explainable AI, deep learning, machine learning, image processing, Optical coherence tomography, retina, Diabetic macular edema, Choroidal Neovascularization, Drusen
摘要：背景：缺乏通过算法，如深度学习作出的决定解释已经阻碍了医疗界的认可，尽管在多个问题高度精确的结果。近日，归属方法已经出现了深解释学习模型，他们已经在医疗成像问题进行测试。归因方法的性能标准的机器学习数据集，而不是医学图像进行比较。在这项研究中，我们执行的比较分析来确定视网膜OCT诊断的最合适的方法explainability。方法：常用的深度学习称为盗梦空间V3模型被训练来诊断3种视网膜疾病 - 脉络膜新生血管（CNV），糖尿病性黄斑水肿（DME），和玻璃膜疣。从13种不同的归属方法的解释是由14个临床医生的临床显着性的面板定级。从关于这种方法的当前和未来的范围临床医生获得反馈。结果：基于泰勒级数展开归因方法，称为深泰勒被评为最高临床医师中位等级的3.85 / 5。紧随其后的是另外两个属性的方法，引导反向传播至十八（沙普利添加剂的解释）。结论：深学习模型的解释可以让他们为临床诊断更加透明。这项研究比较了视网膜OCT诊断的情况下不同的解释方法，并发现表现最好的方法可能不是一个最好考虑其他深的学习任务。总体而言，高度从研究调查了临床医生的认可。关键词：解释的AI，深度学习，机器学习，图像处理，光学相干断层扫描，视网膜，糖尿病性黄斑水肿，脉络膜新生血管，玻璃膜疣

91. Deep Learning-based Four-region Lung Segmentation in Chest Radiography for COVID-19 Diagnosis [PDF] 返回目录
Young-Gon Kim, Kyungsang Kim, Dufan Wu, Hui Ren, Won Young Tak, Soo Young Park, Yu Rim Lee, Min Kyu Kang, Jung Gil Park, Byung Seok Kim, Woo Jin Chung, Mannudeep K. Kalra, Quanzheng Li
Abstract: Purpose. Imaging plays an important role in assessing severity of COVID 19 pneumonia. However, semantic interpretation of chest radiography (CXR) findings does not include quantitative description of radiographic opacities. Most current AI assisted CXR image analysis framework do not quantify for regional variations of disease. To address these, we proposed a four region lung segmentation method to assist accurate quantification of COVID 19 pneumonia. Methods. A segmentation model to separate left and right lung is firstly applied, and then a carina and left hilum detection network is used, which are the clinical landmarks to separate the upper and lower lungs. To improve the segmentation performance of COVID 19 images, ensemble strategy incorporating five models is exploited. Using each region, we evaluated the clinical relevance of the proposed method with the Radiographic Assessment of the Quality of Lung Edema (RALE). Results. The proposed ensemble strategy showed dice score of 0.900, which is significantly higher than conventional methods (0.854 0.889). Mean intensities of segmented four regions indicate positive correlation to the extent and density scores of pulmonary opacities under the RALE framework. Conclusion. A deep learning based model in CXR can accurately segment and quantify regional distribution of pulmonary opacities in patients with COVID 19 pneumonia.
摘要：目的。成像起着评估COVID 19肺炎严重程度具有重要作用。然而，胸片的语义解释（CXR）的调查结果不包括X线阴影的定量描述。目前大多数AI辅助CXR图像分析框架没有量化疾病的地区差异。针对这些问题，我们提出了四个区肺部分割方法，以协助COVID 19肺炎的准确定量。方法。到单独的左A分割模型和首先施加右肺，然后一个隆突和左门处检测网络被使用，其是临床地标以分离的上部和下部肺部。为了提高COVID 19图像的分割性能，结合五款车型整体战略被利用。利用每个区域，我们评估了该方法与肺水肿质量的影像学评估（RALE）的临床意义。结果。所提出的合奏策略表明骰子得分为0.900，这比传统的方法（0.854 0.889）显著更高。分割的四个区域的平均强度指示RALE框架下肺混浊的程度和密度的分数的正相关。结论。在CXR深学习基于模型能准确细分和量化患者肺实变影的区域分布与COVID 19肺炎。

92. Potential Features of ICU Admission in X-ray Images of COVID-19 Patients [PDF] 返回目录
Douglas P. S. Gomes, Anwaar Ulhaq, Manoranjan Paul, Michael J. Horry, Subrata Chakraborty, Manas Saha, Tanmoy Debnath, D.M. Motiur Rahaman
Abstract: X-ray images may present non-trivial features with predictive information of patients that develop severe symptoms of COVID-19. If true, this hypothesis may have practical value in allocating resources to particular patients while using a relatively inexpensive imaging technique. The difficulty of testing such a hypothesis comes from the need for large sets of labelled data, which not only need to be well-annotated but also should contemplate the post-imaging severity outcome. On this account, this paper presents a methodology for extracting features from a limited data set with outcome label (patient required ICU admission or not) and correlating its significance to an additional, larger data set with hundreds of images. The methodology employs a neural network trained to recognise lung pathologies to extract the semantic features, which are then analysed with a shallow decision tree to limit overfitting while increasing interpretability. This analysis points out that only a few features explain most of the variance between patients that developed severe symptoms. When applied to an unrelated, larger data set with labels extracted from clinical notes, the method classified distinct sets of samples where there was a much higher frequency of labels such as `Consolidation', `Effusion', and `alveolar'. A further brief analysis on the locations of such labels also showed a significant increase in the frequency of words like `bilateral', `middle', and `lower' in patients classified as with higher chances of going severe. The methodology for dealing with the lack of specific ICU label data while attesting correlations with a data set containing text notes is novel; its results suggest that some pathologies should receive higher weights when assessing disease severity.
摘要：X射线图像可能存在不平凡的特点与发展是COVID-19症状严重的患者的预测信息。如果属实，这种假设可能在同时使用相对廉价的成像技术把资源分配给特定患者的实用价值。测试这样一个假设的困难来自于需要大集的标签数据，这不仅需要有良好的注解，也应考虑后成像的严重程度的结果。由于这个原因，本文提出了从结果标签有限的数据集提取特征的方法（患者入住ICU与否）及其意义的额外的，更大的数据有上百张图片的设置相关。该方法采用培训，认识肺疾病提取语义特征，然后将其与浅决策树分析，以限制过度拟合，同时增加解释性神经网络。该分析指出，只有少数的功能说明大多数发达严重症状的患者之间的差异的。当施加到与来自临床笔记中提取的标签无关，较大的数据集，该方法分为不同组样品在那里有标签的高得多的频率，如`固结“`积液”和`肺泡”。这种标签的位置进一步简要分析也显示，像'双边“'中”字的频率显著上升，在列为与去严峻的几率要高患者`降低”。用于处理缺乏特定ICU标签数据的同时用含文本注释的数据集的相关性证明是新颖的方法;其结果表明，评估疾病的严重程度时，一些疾病应获得更高的权重。

93. Physics-Guided Recurrent Graph Networks for Predicting Flow and Temperature in River Networks [PDF] 返回目录
Xiaowei Jia, Jacob Zwart, Jeffery Sadler, Alison Appling, Samantha Oliver, Steven Markstrom, Jared Willard, Shaoming Xu, Michael Steinbach, Jordan Read, Vipin Kumar
Abstract: This paper proposes a physics-guided machine learning approach that combines advanced machine learning models and physics-based models to improve the prediction of water flow and temperature in river networks. We first build a recurrent graph network model to capture the interactions among multiple segments in the river network. Then we present a pre-training technique which transfers knowledge from physics-based models to initialize the machine learning model and learn the physics of streamflow and thermodynamics. We also propose a new loss function that balances the performance over different river segments. We demonstrate the effectiveness of the proposed method in predicting temperature and streamflow in a subset of the Delaware River Basin. In particular, we show that the proposed method brings a 33\%/14\% improvement over the state-of-the-art physics-based model and 24\%/14\% over traditional machine learning models (e.g., Long-Short Term Memory Neural Network) in temperature/streamflow prediction using very sparse (0.1\%) observation data for training. The proposed method has also been shown to produce better performance when generalized to different seasons or river segments with different streamflow ranges.
摘要：本文提出了一种物理引导机器学习方法，结合先进的机器学习模型和物理模型，以改善河网水流量和温度的预测。我们首先建立一个经常性的图形网络模型捕捉河网多个段之间的相互作用。然后，我们提出了一个前的训练技术，从物理模型传输知识初始化机器学习模型和学习径流和热力学的物理过程。我们还建议，平衡在不同河段的性能的新的损失函数。我们证明了在特拉华河流域的一个子集预测温度和水流了该方法的有效性。特别是，我们表明，该方法带来了超过国家的最先进的基于物理学的模型33 \％/ 14 \％提高至24 \％/ 14 \％，比传统的机器学习模型（例如，庆龙短期记忆的神经网络）的温度/使用非常稀疏（0.1 \％）观测数据训练径流预测。所提出的方法也已经显示，当推广到不同的季节或河流段具有不同的水流范围，以产生更好的性能。

94. Deep Selective Combinatorial Embedding and Consistency Regularization for Light Field Super-resolution [PDF] 返回目录
Jing Jin, Junhui Hou, Zhiyu Zhu, Jie Chen, Sam Kwong
Abstract: Light field (LF) images acquired by hand-held devices usually suffer from low spatial resolution as the limited detector resolution has to be shared with the angular dimension. LF spatial super-resolution (SR) thus becomes an indispensable part of the LF camera processing pipeline. The high-dimensionality characteristic and complex geometrical structure of LF images make the problem more challenging than traditional single-image SR. The performance of existing methods is still limited as they fail to thoroughly explore the coherence among LF sub-aperture images (SAIs) and are insufficient in accurately preserving the scene's parallax structure. To tackle this challenge, we propose a novel learning-based LF spatial SR framework. Specifically, each SAI of an LF image is first coarsely and individually super-resolved by exploring the complementary information among SAIs with selective combinatorial geometry embedding. To achieve efficient and effective selection of the complementary information, we propose two novel sub-modules conducted hierarchically: the patch selector provides an option of retrieving similar image patches based on offline disparity estimation to handle large-disparity correlations; and the SAI selector adaptively and flexibly selects the most informative SAIs to improve the embedding efficiency. To preserve the parallax structure among the reconstructed SAIs, we subsequently append a consistency regularization network trained over a structure-aware loss function to refine the parallax relationships over the coarse estimation. In addition, we extend the proposed method to irregular LF data. To the best of our knowledge, this is the first learning-based SR method for irregular LF data. Experimental results over both synthetic and real-world LF datasets demonstrate the significant advantage of our approach over state-of-the-art methods.
摘要：通过手持设备获取的光场（LF）的图像通常从低空间分辨率遭受有限的检测器的分辨率必须与角度尺寸共享。 LF空间超分辨率（SR），从而成为LF相机处理管线的一个不可缺少的部分。高维特征和LF图像的复杂的几何结构使问题比传统的单图像SR更具挑战性。因为他们未能彻底探索LF子孔径图像（的SAI）之间的一致性，并且在保持准确的场景的视差结构不足的现有方法的性能仍然是有限的。为了应对这种挑战，我们提出了一种新的基于学习型LF空间SR框架。具体而言，一个LF图像的每个SAI是首先粗略和通过探索的SAI中具有选择性组合几何嵌入互补信息单独超分辨。为了实现的互补信息高效和有效的选择，我们提出分层进行两种新颖的子模块：该补丁选择器提供基于离线视差估计处理大视差相关性检索类似图像的补丁的选项;和SAI选择自适应地且灵活地选择最有信息的SAI，以提高嵌入效率。为了保持最高审计机关重建中的视差的结构，我们后来追加培训了一个结构感知的损失功能的一致性，正规化的网络来改进在粗略估计的视差关系。此外，我们扩展了该方法，以不规则的LF数据。据我们所知，这是不规则的LF数据的第一个基于学习的SR方法。在合成和真实世界的数据集LF实验结果表明，我们对国家的最先进的方法，方法的显著优势。

95. Unsupervised Model Adaptation for Continual Semantic Segmentation [PDF] 返回目录
Serban Stan, Mohammad Rostami
Abstract: We develop an algorithm for adapting a semantic segmentation model that is trained using a labeled source domain to generalize well in an unlabeled target domain. A similar problem has been studied extensively in the unsupervised domain adaptation (UDA) literature, but existing UDA algorithms require access to both the source domain labeled data and the target domain unlabeled data for training a domain agnostic semantic segmentation model. Relaxing this constraint enables a user to adapt pretrained models to generalize in a target domain, without requiring access to source data. To this end, we learn a prototypical distribution for the source domain in an intermediate embedding space. This distribution encodes the abstract knowledge that is learned from the source domain. We then use this distribution for aligning the target domain distribution with the source domain distribution in the embedding space. We provide theoretical analysis and explain conditions under which our algorithm is effective. Experiments on benchmark adaptation task demonstrate our method achieves competitive performance even compared with joint UDA approaches.
摘要：我们开发的算法为适应正在使用的标记源域在未标记的目标域推广训练有素的语义分割模型。类似的问题已在无监督域适配（UDA）文献中有广泛的研究，但现有UDA算法需要访问源域标记的数据和用于训练域无关的语义分割模型对目标域未标记的数据两者。放宽这一限制使用户能够适应预训练模式在目标域一概而论，无需访问源数据。为此，我们学会在中间嵌入空间的源域的原型分布。这种分布编码从源域学到的抽象的知识。然后，我们使用这个分布对准在嵌入空间的源域分布的目标域分布。我们提供的理论分析和讲解下，我们的算法是有效的条件。在基准适应任务的实验结果证明，即使比较了联合UDA接近我们的方法实现竞争力的性能。

96. Generating Realistic COVID19 X-rays with a Mean Teacher + Transfer Learning GAN [PDF] 返回目录
Sumeet Menon, Joshua Galita, David Chapman, Aryya Gangopadhyay, Jayalakshmi Mangalagiri, Phuong Nguyen, Yaacov Yesha, Yelena Yesha, Babak Saboury, Michael Morris
Abstract: COVID-19 is a novel infectious disease responsible for over 800K deaths worldwide as of August 2020. The need for rapid testing is a high priority and alternative testing strategies including X-ray image classification are a promising area of research. However, at present, public datasets for COVID19 x-ray images have low data volumes, making it challenging to develop accurate image classifiers. Several recent papers have made use of Generative Adversarial Networks (GANs) in order to increase the training data volumes. But realistic synthetic COVID19 X-rays remain challenging to generate. We present a novel Mean Teacher + Transfer GAN (MTT-GAN) that generates COVID19 chest X-ray images of high quality. In order to create a more accurate GAN, we employ transfer learning from the Kaggle Pneumonia X-Ray dataset, a highly relevant data source orders of magnitude larger than public COVID19 datasets. Furthermore, we employ the Mean Teacher algorithm as a constraint to improve stability of training. Our qualitative analysis shows that the MTT-GAN generates X-ray images that are greatly superior to a baseline GAN and visually comparable to real X-rays. Although board-certified radiologists can distinguish MTT-GAN fakes from real COVID19 X-rays. Quantitative analysis shows that MTT-GAN greatly improves the accuracy of both a binary COVID19 classifier as well as a multi-class Pneumonia classifier as compared to a baseline GAN. Our classification accuracy is favourable as compared to recently reported results in the literature for similar binary and multi-class COVID19 screening tasks.
摘要：COVID-19是负责全球超过80万的死亡在8月到2020年需要进行快速检测的一种新的传染病是一个高优先级和替代测试策略，包括X射线图像分类是研究工作的有前途的领域。然而，在目前，COVID19公共数据集的X射线图像具有低数据量，使其具有挑战性的开发精确的图像分类。最近有几篇论文，为了增加训练的数据量利用剖成对抗性网络（甘斯）的。但现实的合成COVID19 X射线仍具挑战生成。我们提出了一个新颖的平均数教师+转移GAN（MTT-GAN），其产生的高品质COVID19胸部X射线图像。为了创造一个更准确的甘，我们采用迁移学习从Kaggle肺炎X射线数据集大小的高度相关的数据源的订单比公共COVID19数据集大。此外，我们聘请教师平均算法，以提高培训稳定性的约束。我们的定性分析显示，MTT-GAN生成是大大优于基线GAN和视觉上比得上实际的X射线透视图像。虽然委员会认证的放射科医师可以区分真正COVID19 X射线MTT-GAN假货。定量分析显示，MTT-GAN大大改善了的二进制COVID19分类器以及多级分类器肺炎相比于基线作为GAN的准确性。相比于文献中最近报道的结果类似二进制和多类COVID19筛选任务我们的分类准确度是有利的。

97. Enhanced 3D Myocardial Strain Estimation from Multi-View 2D CMR Imaging [PDF] 返回目录
Mohamed Abdelkhalek, Heba Aguib, Mohamed Moustafa, Khalil Elkhodary
Abstract: In this paper, we propose an enhanced 3D myocardial strain estimation procedure which combines complementary displacement information from multiple orientations of a single imaging modality (untagged CMR SSFP images). To estimate myocardial strain across the left ventricle, we register the sets of short-axis, four-chamber and twochamber views via a 2D non-rigid registration algorithm implemented in a commercial software (Segment, Medviso). We then create a series of interpolating functions for the three orthogonal directions of motion and use them to deform a tetrahedral mesh representation of a patient-specific left ventricle. Additionally, we correct for overestimation of displacement by introducing a weighting scheme that is based on displacement along the long axis. The procedure was evaluated on the STACOM 2011 dataset containing CMR SSFP images for 16 healthy volunteers. We show increased accuracy in estimating the three strain components (radial, circumferential, longitudinal) compared to reported results in the challenge, for the imaging modality of interest (SSFP). Our peak strain estimates are also significantly closer to reported measurements from studies of a larger cohort in the literature. Our proposed procedure provides a fast way to accurately reconstruct a deforming patient-specific model of the left ventricle using the commonest imaging modality routinely administered in clinical settings, without requiring additional or specialized imaging protocols.
摘要：在本文中，我们提出一种组合来自单个成像模态（无标记CMR SSFP图像）的多个方位互补位移信息的增强的3D心肌应变估计过程。为了估计在整个左心室心肌劳损，我们通过在商业软件（段，Medviso）实现的2D非刚性配准算法寄存器短轴，四腔和twochamber视图的集。然后，我们创建了运动的三个正交方向上的一系列插值功能，并利用它们来变形的四面体网格针对特定患者的左心室表示。此外，我们通过引入基于沿长轴位移加权方案校正位移的高估。的工序，在含有CMR SSFP图像16名健康志愿者的STACOM 2011的数据集进行评估。我们表明，在估计三个应变分量增加的准确度（径向，周向，纵向）相比，在挑战报告的结果，对于所关注的成像模态（SSFP）。我们的峰值应变估计也显著接近从文献较大的队列研究报告的测量。我们提出的方法提供了一种快速的方式来准确地重建使用最常见的成像模态在临床环境中常规施用的左心室的变形患者特异性模型，而不需要额外的或专门的成像协议。

98. Blind Image Super-Resolution with Spatial Context Hallucination [PDF] 返回目录
Dong Huo, Yee-Hong Yang
Abstract: Deep convolution neural networks (CNNs) play a critical role in single image super-resolution (SISR) since the amazing improvement of high performance computing. However, most of the super-resolution (SR) methods only focus on recovering bicubic degradation. Reconstructing high-resolution (HR) images from randomly blurred and noisy low-resolution (LR) images is still a challenging problem. In this paper, we propose a novel Spatial Context Hallucination Network (SCHN) for blind super-resolution without knowing the degradation kernel. We find that when the blur kernel is unknown, separate deblurring and super-resolution could limit the performance because of the accumulation of error. Thus, we integrate denoising, deblurring and super-resolution within one framework to avoid such a problem. We train our model on two high quality datasets, DIV2K and Flickr2K. Our method performs better than state-of-the-art methods when input images are corrupted with random blur and noise.
摘要：深卷积神经网络（细胞神经网络）发挥单幅图像超分辨率（SISR）因为惊人的改善高性能计算的关键作用。然而，大多数的超分辨率（SR）方法只专注于恢复双三次下降。从随机模糊和嘈杂的低分辨率（LR）图像重建高分辨率（HR）图像仍然是一个挑战性的问题。在本文中，我们提出了盲超分辨率的新颖空间环境幻觉网（SCHN）不知道降解内核。我们发现，当模糊内核是未知的，单独的去模糊和超分辨率可能会限制，因为错误的积累的性能。因此，我们整合去噪，一个框架内去模糊和超分辨率，以避免这样的问题。我们培训两个高品质的数据集，DIV2K和Flickr2K模型。我们的方法进行比状态的最先进的方法时输入图像被损坏与随机抖动和噪声。

99. Democratizing Artificial Intelligence in Healthcare: A Study of Model Development Across Two Institutions Incorporating Transfer Learning [PDF] 返回目录
Vikash Gupta1, Holger Roth, Varun Buch3, Marcio A.B.C. Rockenbach, Richard D White, Dong Yang, Olga Laur, Brian Ghoshhajra, Ittai Dayan, Daguang Xu, Mona G. Flores, Barbaros Selnur Erdal
Abstract: The training of deep learning models typically requires extensive data, which are not readily available as large well-curated medical-image datasets for development of artificial intelligence (AI) models applied in Radiology. Recognizing the potential for transfer learning (TL) to allow a fully trained model from one institution to be fine-tuned by another institution using a much small local dataset, this report describes the challenges, methodology, and benefits of TL within the context of developing an AI model for a basic use-case, segmentation of Left Ventricular Myocardium (LVM) on images from 4-dimensional coronary computed tomography angiography. Ultimately, our results from comparisons of LVM segmentation predicted by a model locally trained using random initialization, versus one training-enhanced by TL, showed that a use-case model initiated by TL can be developed with sparse labels with acceptable performance. This process reduces the time required to build a new model in the clinical environment at a different institution.
摘要：深度学习模型的训练通常需要大量的数据，这是不容易获得，作为人工智能（AI）的模型开发应用放射良好的大型策划医疗图像数据集。认识到迁移学习（TL）的潜力，以允许从被通过使用多地方小数据集的其他机构微调一个机构一个全面的培训模式，本报告介绍了发展的背景下TL的挑战，方法和好处一个AI模型的一个基本的用例，从4维冠状动脉CT血管造影左心室心肌（LVM）上的图像的分割。通过TL最终，我们从LVM分割的比较预测的结果通过使用随机初始化，与一个本地培训示范培训增强，表明，TL发起的用例模型可以用稀疏的标签与可接受的性能进行开发。此过程减少了在不同的机构建立临床环境的新模式所需要的时间。

100. Deep Artifact-Free Residual Network for Single Image Super-Resolution [PDF] 返回目录
Hamdollah Nasrollahi, Kamran Farajzadeh, Vahid Hosseini, Esmaeil Zarezadeh, Milad Abdollahzadeh
Abstract: Recently, convolutional neural networks have shown promising performance for single-image super-resolution. In this paper, we propose Deep Artifact-Free Residual (DAFR) network which uses the merits of both residual learning and usage of ground-truth image as target. Our framework uses a deep model to extract the high-frequency information which is necessary for high-quality image reconstruction. We use a skip-connection to feed the low-resolution image to the network before the image reconstruction. In this way, we are able to use the ground-truth images as target and avoid misleading the network due to artifacts in difference image. In order to extract clean high-frequency information, we train the network in two steps. The first step is a traditional residual learning which uses the difference image as target. Then, the trained parameters of this step are transferred to the main training in the second step. Our experimental results show that the proposed method achieves better quantitative and qualitative image quality compared to the existing methods.
摘要：单张影像超分辨率近日，卷积神经网络已经显示出大有希望的性能。在本文中，我们提出一种同时使用剩余的学习和地面实况图像作为目标的使用的优点深神器免费残余（DAFR）网络。我们的框架使用深模型以提取所必需的高品质的图像重建的高频信息。我们使用跳连接的图像重建之前低分辨率图像喂到网络。通过这种方式，我们能够使用地面实况图像为目标，避免误导网络由于差分图像伪影。为了提取干净的高频信息，我们训练分两步网络。第一个步骤是一个传统的残余学习它使用差分图像作为目标。然后，此步骤的经训练的参数传送到在所述第二步骤中的主要培养。我们的实验结果表明，该方法实现比现有方法更好的定量和定性的图像质量。

101. SceneGen: Generative Contextual Scene Augmentation using Scene Graph Priors [PDF] 返回目录
Mohammad Keshavarzi, Aakash Parikh, Xiyu Zhai, Melody Mao, Luisa Caldas, Allen Yang
Abstract: Spatial computing experiences are constrained by the real-world surroundings of the user. In such experiences, augmenting virtual objects to existing scenes require a contextual approach, where geometrical conflicts are avoided, and functional and plausible relationships to other objects are maintained in the target environment. Yet, due to the complexity and diversity of user environments, automatically calculating ideal positions of virtual content that is adaptive to the context of the scene is considered a challenging task. Motivated by this problem, in this paper we introduce SceneGen, a generative contextual augmentation framework that predicts virtual object positions and orientations within existing scenes. SceneGen takes a semantically segmented scene as input, and outputs positional and orientational probability maps for placing virtual content. We formulate a novel spatial Scene Graph representation, which encapsulates explicit topological properties between objects, object groups, and the room. We believe providing explicit and intuitive features plays an important role in informative content creation and user interaction of spatial computing settings, a quality that is not captured in implicit models. We use kernel density estimation (KDE) to build a multivariate conditional knowledge model trained using prior spatial Scene Graphs extracted from real-world 3D scanned data. To further capture orientational properties, we develop a fast pose annotation tool to extend current real-world datasets with orientational labels. Finally, to demonstrate our system in action, we develop an Augmented Reality application, in which objects can be contextually augmented in real-time.
摘要：空间的计算体验由用户的真实世界环境的制约。在这样的经历，增强虚拟对象与现有的场景需要一个情境的方法，其中避免几何冲突，以及其他物体的功能和合理的关系得以维持在目标环境。然而，由于复杂性和用户环境的多样性，自动地计算是适应现场的情况下被认为是一个具有挑战性的任务虚拟内容的理想位置。此问题的启发，在本文中，我们介绍了SceneGen，一个生成语境增强框架，预计现有的场景内的虚拟物体的位置和方向。 SceneGen需要语义场景分割作为输入，并输出位置和方位概率映射用于将虚拟内容。我们配制了一种新的空间场景图的表示，它封装对象，对象组，以及所述室之间的显式拓扑性质。我们相信，提供明确的和直观的功能起着翔实的内容创作和空间计算设置的用户交互，未在隐含模型拍摄的质量具有重要作用。我们使用核密度估计（KDE）建立使用前空间场景图，从真实世界的3D扫描提取数据训练一个多因素条件知识模型。为了进一步捕获取向性能，我们开发了一个快速的姿势注释工具与定向标签扩展当前现实世界的数据集。最后，为了证明我们在行动系统，我们开发的增强现实应用，其中对象可以在实时上下文增强。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-09-29

目录

摘要