摘要

1. Preparing for the Worst: Making Networks Less Brittle with Adversarial Batch Normalization [PDF] 返回目录
Manli Shu, Zuxuan Wu, Micah Goldblum, Tom Goldstein
Abstract: Adversarial training is the industry standard for producing models that are robust to small adversarial perturbations. However, machine learning practitioners need models that are robust to domain shifts that occur naturally, such as changes in the style or illumination of input images. Such changes in input distribution have been effectively modeled as shifts in the mean and variance of deep image features. We adapt adversarial training by adversarially perturbing these feature statistics, rather than image pixels, to produce models that are robust to domain shift. We also visualize images from adversarially crafted distributions. Our method, Adversarial Batch Normalization (AdvBN), significantly improves the performance of ResNet-50 on ImageNet-C (+8.1%), Stylized-ImageNet (+6.7%), and ImageNet-Instagram (+3.9%) over standard training practices. In addition, we demonstrate that AdvBN can also improve generalization on semantic segmentation.
摘要：对抗性训练是生产模式，是强大的小对抗扰动的行业标准。然而，机器学习从业者需要的模型，是稳健的天然存在域变化，如在样式或输入图像的亮度的变化。在输入分配这样的改变已被有效地建模为在深图像特征的均值和方差的变化。我们通过adversarially扰动这些特征的统计数据，而不是图像像素，生产模式，是稳健域转变适应对抗性训练。我们还可以看到来自adversarially制作的分布图像。我们的方法，对抗性批标准化（AdvBN），显著通过标准的训练方法-C ImageNet（+ 8.1％），程式化-ImageNet（+ 6.7％），和ImageNet-Instagram的（+ 3.9％）改善了的RESNET-50的性能。此外，我们证明了AdvBN也可以提高对语义分割概括。

2. Light Direction and Color Estimation from Single Image with Deep Regression [PDF] 返回目录
Hassan A. Sial, Ramon Baldrich, Maria Vanrell, Dimitris Samaras
Abstract: We present a method to estimate the direction and color of the scene light source from a single image. Our method is based on two main ideas: (a) we use a new synthetic dataset with strong shadow effects with similar constraints to the SID dataset; (b) we define a deep architecture trained on the mentioned dataset to estimate the direction and color of the scene light source. Apart from showing good performance on synthetic images, we additionally propose a preliminary procedure to obtain light positions of the Multi-Illumination dataset, and, in this way, we also prove that our trained model achieves good performance when it is applied to real scenes.
摘要：呈现给从单个图像估计场景光源的方向和颜色的方法。我们的方法是基于两个主要观点：（1）我们使用类似的限制的SID数据集强大的阴影效果一种新的合成数据集; （b）中，我们定义一个深架构上训练的上述数据集来估计场景的光源的方向和颜色。除了显示在合成图像良好的性能，我们还提出了一个初步程序，以获取多照明集光位置，并且，通过这种方式，我们也证明了，当它被应用到真实场景我们的训练模型取得了较好的业绩。

3. Multi-Resolution Graph Neural Network for Large-Scale Pointcloud Segmentation [PDF] 返回目录
Liuyue Xie, Tomotake Furuhata, Kenji Shimada
Abstract: In this paper, we propose a multi-resolution deep-learning architecture to semantically segment dense large-scale pointclouds. Dense pointcloud data require a computationally expensive feature encoding process before semantic segmentation. Previous work has used different approaches to drastically downsample from the original pointcloud so common computing hardware can be utilized. While these approaches can relieve the computation burden to some extent, they are still limited in their processing capability for multiple scans. We present MuGNet, a memory-efficient, end-to-end graph neural network framework to perform semantic segmentation on large-scale pointclouds. We reduce the computation demand by utilizing a graph neural network on the preformed pointcloud graphs and retain the precision of the segmentation with a bidirectional network that fuses feature embedding at different resolutions. Our framework has been validated on benchmark datasets including Stanford Large-Scale 3D Indoor Spaces Dataset(S3DIS) and Virtual KITTI Dataset. We demonstrate that our framework can process up to 45 room scans at once on a single 11 GB GPU while still surpassing other graph-based solutions for segmentation on S3DIS with an 88.5\% (+3\%) overall accuracy and 69.8\% (+7.7\%) mIOU accuracy.
摘要：在本文中，我们提出了一个多分辨率深学习架构，语义段密集的大规模点云。密点云数据需要语义分割前的计算上昂贵的特征编码处理。以前的工作已使用不同的方法来急剧下采样从原来的点云，从而共同计算硬件可被利用。虽然这些方法可以减轻计算量在一定程度上，他们仍然在多次扫描他们的处理能力的限制。我们本MuGNet，存储器效率，端至端图表神经网络框架上大规模点云执行语义分割。我们利用在预制点云图形的图形神经网络减少计算需求和与双向网络保险丝功能在不同的分辨率嵌入保留了分割的精度。我们的框架已经验证的基准数据集，包括斯坦福大学的大型3D室内空间数据集（S3DIS）和虚拟KITTI数据集。我们证明我们的架构能够在一个单一的11 GB GPU同时处理多达45间客房，扫描，同时还对S3DIS超越市场细分其他基于图形的解决方案与88.5 \％（+3 \％）整体精度和69.8 \％（ 7.7 \％）米欧精度。

4. Deep Learning for 3D Point Cloud Understanding: A Survey [PDF] 返回目录
Haoming Lu, Humphrey Shi
Abstract: The development of practical applications, such as autonomous driving and robotics, has brought increasing attention to 3D point cloud understanding. While deep learning has achieved remarkable success on image-based tasks, there are many unique challenges faced by deep neural networks in processing massive, unstructured and noisy 3D points. To demonstrate the latest progress of deep learning for 3D point cloud understanding, this paper summarizes recent remarkable research contributions in this area from several different directions (classification, segmentation, detection, tracking, flow estimation, registration, augmentation and completion), together with commonly used datasets, metrics and state-of-the-art performances. More information regarding this survey can be found at: this https URL.
摘要：实际应用，如自动驾驶和机器人技术，发展带来了越来越多的关注，以三维点云的理解。虽然深学习对基于图像的任务，取得了显着成效，也有大量的加工，非结构化和噪声的3D点所面临的深层神经网络的许多独特的挑战。为了证明深学习三维点云的了解的最新进展，总结了在这一领域，从几个不同的方向最近显着的研究成果（分类，分割，检测，跟踪，流量估计，登记，增强和完成）与常用，一起使用的数据集，度量和状态的最艺术表演。此HTTPS URL：关于本次调查的更多信息可以在这里找到。

5. Learning Unseen Emotions from Gestures via Semantically-Conditioned Zero-Shot Perception with Adversarial Autoencoders [PDF] 返回目录
Abhishek Banerjee, Uttaran Bhattacharya, Aniket Bera
Abstract: We present a novel generalized zero-shot algorithm to recognize perceived emotions from gestures. Our task is to map gestures to novel emotion categories not encountered in training. We introduce an adversarial, autoencoder-based representation learning that correlates 3D motion-captured gesture sequence with the vectorized representation of the natural-language perceived emotion terms using word2vec embeddings. The language-semantic embedding provides a representation of the emotion label space, and we leverage this underlying distribution to map the gesture-sequences to the appropriate categorical emotion labels. We train our method using a combination of gestures annotated with known emotion terms and gestures not annotated with any emotions. We evaluate our method on the MPI Emotional Body Expressions Database (EBEDB) and obtain an accuracy of $58.43\%$. This improves the performance of current state-of-the-art algorithms for generalized zero-shot learning by $25$--$27\%$ on the absolute.
摘要：本文提出了一种新的广义零射门算法来识别手势感知情绪。我们的任务是绘制手势训练中没有遇到的新的情感类别。我们推出了一项对抗性，基于自动编码，表示学习与使用word2vec的嵌入自然语言感知情感方面的矢量表示相关因素的3D动作捕捉手势序列。语言的语义嵌入提供了情感标签空间的表示，我们利用这一基本分布于该手势序列到适当的分类情绪标签映射。我们使用已知的情感方面，而不是与任何情感注解注释手势比划的组合训练我们的方法。我们评估我们对MPI情绪体表达数据库（EBEDB）方法，并获得$ 58.43 \％$的精度。这25 $ $提高了国家的最先进的算法，目前广义零射门的学习表现 - $ 27日\上的绝对％$。

6. Image Captioning with Attention for Smart Local Tourism using EfficientNet [PDF] 返回目录
Dhomas Hatta Fudholi, Yurio Windiatmoko, Nurdi Afrianto, Prastyo Eko Susanto, Magfirah Suyuti, Ahmad Fathan Hidayatullah, Ridho Rahmadi
Abstract: Smart systems have been massively developed to help humans in various tasks. Deep Learning technologies push even further in creating accurate assistant systems due to the explosion of data lakes. One of the smart system tasks is to disseminate users needed information. This is crucial in the tourism sector to promote local tourism destinations. In this research, we design a model of local tourism specific image captioning, which later will support the development of AI-powered systems that assist various users. The model is developed using a visual Attention mechanism and uses the state-of-the-art feature extractor architecture EfficientNet. A local tourism dataset is collected and is used in the research, along with two different kinds of captions. Captions that describe the image literally and captions that represent human logical responses when seeing the image. This is done to make the captioning model more humane when implemented in the assistance system. We compared the performance of two different models using EfficientNet architectures (B0 and B4) with other well known VGG16 and InceptionV3. The best BLEU scores we get are 73.39 and 24.51 for the training set and the validation set respectively, using EfficientNetB0. The captioning result using the developed model shows that the model can produce logical caption for local tourism-related images
摘要：智能系统已经大规模开发的各项任务帮助人类。深度学习技术推匀创建精确的辅助系统进一步由于数据湖泊的爆炸。其中一个智能系统的任务是必要的信息为散发的用户。这是旅游部门，以促进当地旅游业的目的地是至关重要的。在这项研究中，我们设计了当地的旅游特定的图像字幕，后来将支持AI供电系统是帮助各种用户的发展模式。该模型是使用视觉注意机制开发和使用状态的最先进的特征提取器架构EfficientNet。当地旅游数据集被收集，并在研究使用，两种不同的字幕一起。从字面上描述图像看到图像时，代表人类逻辑的反应字幕和字幕。这样做是为了使字幕模式更加人性化的救助制度实施时。我们比较了使用EfficientNet架构（B0和B4）与其他知名VGG16和InceptionV3两种不同型号的性能。我们得到最好的BLEU分数是73.39和24.51的训练集和验证集分别使用EfficientNetB0。使用开发的模型表明，该模型可以产生本地旅游相关的影像逻辑字幕的字幕结果

7. Faster Gradient-based NAS Pipeline Combining Broad Scalable Architecture with Confident Learning Rate [PDF] 返回目录
Ding Zixiang, Chen Yaran, Li Nannan, Zhao Dongbin
Abstract: In order to further improve the search efficiency of Neural Architecture Search (NAS), we propose B-DARTS, a novel pipeline combining broad scalable architecture with Confident Learning Rate (CLR). In B-DARTS, Broad Convolutional Neural Network (BCNN) is employed as the scalable architecture for DARTS, a popular differentiable NAS approach. On one hand, BCNN is a broad scalable architecture whose topology achieves two advantages compared with the deep one, mainly including faster single-step training speed and higher memory efficiency (i.e. larger batch size for architecture search), which are all contributed to the search efficiency improvement of NAS. On the other hand, DARTS discovers the optimal architecture by gradient-based optimization algorithm, which benefits from two superiorities of BCNN simultaneously. Similar to vanilla DARTS, B-DARTS also suffers from the performance collapse issue, where those weight-free operations are prone to be selected by the search strategy. Therefore, we propose CLR, that considers the confidence of gradient for architecture weights update increasing with the training time of over-parameterized model, to mitigate the above issue. Experimental results on CIFAR-10 and ImageNet show that 1) B-DARTS delivers state-of-the-art efficiency of 0.09 GPU day using first order approximation on CIFAR-10; 2) the learned architecture by B-DARTS achieves competitive performance using state-of-the-art composite multiply-accumulate operations and parameters on ImageNet; and 3) the proposed CLR is effective for performance collapse issue alleviation of both B-DARTS and DARTS.
摘要：为了进一步提高神经结构搜索（NAS）的搜索效率，我们提出了B-飞镖，一种新型的管道广泛的可扩展架构，自信学习率（CLR）相结合。在B-飞镖，广泛卷积神经网络（BCNN）被用作可扩展架构飞镖，一个流行的可微NAS的方法。一方面，BCNN是一个广泛的可扩展的架构，其拓扑结构实现了两个优点与深的相比，主要包括更快的单步训练速度和更高的存储器效率（即更大的批量大小为架构搜索），它们都促成了搜索NAS的效率提高。在另一方面，飞镖发现通过基于梯度的优化算法，从BCNN两个优势同时其利益的最佳结构。类似香草飞镖，B-飞镖还从性能上崩溃的问题，在那些体重无操作是容易被搜索策略选择受到影响。因此，我们建议CLR，即考虑梯度结构的权重更新过参数化模型的训练时间增加的信心，缓解上述问题。上CIFAR-10和实验结果ImageNet表明，1）B-DARTS使用上CIFAR-10阶近似提供的0.09 GPU日国家的最先进的效率; 2）通过B-DARTS学习架构实现使用状态的最先进的复合乘法累加运算和参数对ImageNet竞争的性能; 3）所提出的CLR是有效的两个B-飞镖和飞镖性能崩溃问题的缓解。

8. PMVOS: Pixel-Level Matching-Based Video Object Segmentation [PDF] 返回目录
Suhwan Cho, Heansung Lee, Sungmin Woo, Sungjun Jang, Sangyoun Lee
Abstract: Semi-supervised video object segmentation (VOS) aims to segment arbitrary target objects in video when the ground truth segmentation mask of the initial frame is provided. Due to this limitation of using prior knowledge about the target object, feature matching, which compares template features representing the target object with input features, is an essential step. Recently, pixel-level matching (PM), which matches every pixel in template features and input features, has been widely used for feature matching because of its high performance. However, despite its effectiveness, the information used to build the template features is limited to the initial and previous frames. We address this issue by proposing a novel method-PM-based video object segmentation (PMVOS)-that constructs strong template features containing the information of all past frames. Furthermore, we apply self-attention to the similarity maps generated from PM to capture global dependencies. On the DAVIS 2016 validation set, we achieve new state-of-the-art performance among real-time methods (> 30 fps), with a J&F score of 85.6%. Performance on the DAVIS 2017 and YouTube-VOS validation sets is also impressive, with J&F scores of 74.0% and 68.2%, respectively.
摘要：提供的初始帧的地面实况分割掩码当半监督视频对象分割（VOS）旨在段任意目标在视频对象。由于这个限制使用关于目标物体，特征匹配，其比较的模板的特征表示与输入的特征的目标对象的先验知识的，是一个重要的步骤。近日，像素级匹配（PM），这在模板的功能和输入功能，每个像素相匹配，已被广泛使用，因为其高性能的特征匹配。然而，尽管其有效性，信息用于构建模板的功能仅限于初始的和以前的帧。我们通过提出一种新的基于法-PM视频对象分割（PMVOS） - 即构建了强大的模板功能包含所有过去的帧的信息解决这一问题。此外，我们采用自注意相似地图中产生的PM捕捉到全球的相关性。在DAVIS 2016验证集，我们实现状态的最先进的新的实时方法（> 30 fps）的，具有J＆F值的85.6％之间的性能。在2017年戴维斯和YouTube，VOS验证组的表现也令人印象深刻，分别为74.0％和68.2％，J＆F的分数。

9. Synthetic Convolutional Features for Improved Semantic Segmentation [PDF] 返回目录
Yang He, Bernt Schiele, Mario Fritz
Abstract: Recently, learning-based image synthesis has enabled to generate high-resolution images, either applying popular adversarial training or a powerful perceptual loss. However, it remains challenging to successfully leverage synthetic data for improving semantic segmentation with additional synthetic images. Therefore, we suggest to generate intermediate convolutional features and propose the first synthesis approach that is catered to such intermediate convolutional features. This allows us to generate new features from label masks and include them successfully into the training procedure in order to improve the performance of semantic segmentation. Experimental results and analysis on two challenging datasets Cityscapes and ADE20K show that our generated feature improves performance on segmentation tasks.
摘要：近年来，基于学习的图像合成，使生成高分辨率的图像，无论是采用流行的对抗训练还是一个强大的视觉损失。但是，它仍然具有挑战性的成功的杠杆合成数据用于改善语义分割用另外的合成图像。因此，我们建议以生成中间卷积特征和建议是照顾到这种中间卷积特征的第一合成方法。这让我们产生从标签口罩的新功能，包括他们顺利进入培训程序，以提高语义分割的性能。实验结果和分析两个数据集的挑战和风情表演ADE20K我们生成功能提高了分割任务中的表现。

10. IDA: Improved Data Augmentation Applied to Salient Object Detection [PDF] 返回目录
Daniel V. Ruiz, Bruno A. Krinski, Eduardo Todt
Abstract: In this paper, we present an Improved Data Augmentation (IDA) technique focused on Salient Object Detection (SOD). Standard data augmentation techniques proposed in the literature, such as image cropping, rotation, flipping, and resizing, only generate variations of the existing examples, providing a limited generalization. Our method combines image inpainting, affine transformations, and the linear combination of different generated background images with salient objects extracted from labeled data. Our proposed technique enables more precise control of the object's position and size while preserving background information. The background choice is based on an inter-image optimization, while object size follows a uniform random distribution within a specified interval, and the object position is intra-image optimal. We show that our method improves the segmentation quality when used for training state-of-the-art neural networks on several famous datasets of the SOD field. Combining our method with others surpasses traditional techniques such as horizontal-flip in 0.52% for F-measure and 1.19% for Precision. We also provide an evaluation in 7 different SOD datasets, with 9 distinct evaluation metrics and an average ranking of the evaluated methods.
摘要：在本文中，我们提出了一个改进的数据扩张（IDA）技术集中在显着对象检测（SOD）。在文献中提出的标准的数据增强技术，如图像裁剪，旋转，翻转，和调整大小，仅生成的现有的例子的变化，提供了一个有限的概括。我们的方法结合图像修复，仿射变换，并且生成不同的背景图像与凸线性组合对象从标记数据中提取。我们提出的技术使物体位置的更精确的控制和尺寸，同时保持背景信息。背景选择是基于一个图像间的优化，而对象尺寸如下指定的时间间隔内的均匀随机分布，并且对象的位置是图像内最佳。我们发现，用于在SOD领域的几个著名的数据集训练状态的最先进的神经网络时，我们的方法提高了分割质量。结合我们与他人胜过传统的技术，如在0.52％的水平翻转为F-测量和精密1.19％的方法。我们还7点不同的SOD的数据集提供的评价，用9个不同的评价指标和平均排名的评价方法。

11. Densely Guided Knowledge Distillation using Multiple Teacher Assistants [PDF] 返回目录
Wonchul Son, Jaemin Na, Wonjun Hwang
Abstract: With the success of deep neural networks, knowledge distillation which guides the learning of a small student network from a large teacher network is being actively studied for model compression and transfer learning. However, few studies have been performed to resolve the poor learning issue of the student network when the student and teacher model sizes significantly differ. In this paper, we propose a densely guided knowledge distillation using multiple teacher assistants that gradually decrease the model size to efficiently bridge the gap between teacher and student networks. To stimulate more efficient learning of the student network, we guide each teacher assistant to every other smaller teacher assistant step by step. Specifically, when teaching a smaller teacher assistant at the next step, the existing larger teacher assistants from the previous step are used as well as the teacher network to increase the learning efficiency. Moreover, we design stochastic teaching where, for each mini-batch during training, a teacher or a teacher assistant is randomly dropped. This acts as a regularizer like dropout to improve the accuracy of the student network. Thus, the student can always learn rich distilled knowledge from multiple sources ranging from the teacher to multiple teacher assistants. We verified the effectiveness of the proposed method for a classification task using Cifar-10, Cifar-100, and Tiny ImageNet. We also achieved significant performance improvements with various backbone architectures such as a simple stacked convolutional neural network, ResNet, and WideResNet.
摘要：随着深层神经网络的成功，蒸馏知识引导从一个大的教师网络小的学生网络学习正在积极研究的模型压缩转发学习。然而，很少有研究已经完成，以解决学生网络学习问题，穷的时候，学生和教师的模型大小显著不同。在本文中，我们提出了一个密集的引导式知识蒸馏使用多个助教逐渐减小模型的大小来有效地缩小教师与学生网络之间的差距。为了激发学生的网络进行更加高效的学习，我们引导每个教师助理每隔较小教师助理一步一步来。具体而言，教学在下一步骤较小教师助理时，自前一步骤的现有较大助教用作以及教师网络以提高学习效率。此外，我们设计的随机教学，其中对于训练中的每个小批量，教师或助教随机掉落。这作为像差一个正则，以提高学生的网络的精度。因此，学生总是可以从多个数据源，包括从老师到多个助教学习丰富的知识精华。我们验证使用CIFAR-10，CIFAR-100和微型ImageNet分类任务所提出的方法的有效性。我们也取得了显著的性能提升与各种骨干网架构，比如一个简单的堆叠卷积神经网络，RESNET和WideResNet。

12. $σ^2$R Loss: a Weighted Loss by Multiplicative Factors using Sigmoidal Functions [PDF] 返回目录
Riccardo La Grassa, Ignazio Gallo, Nicola Landro
Abstract: In neural networks, the loss function represents the core of the learning process that leads the optimizer to an approximation of the optimal convergence error. Convolutional neural networks (CNN) use the loss function as a supervisory signal to train a deep model and contribute significantly to achieving the state of the art in some fields of artificial vision. Cross-entropy and Center loss functions are commonly used to increase the discriminating power of learned functions and increase the generalization performance of the model. Center loss minimizes the class intra-class variance and at the same time penalizes the long distance between the deep features inside each class. However, the total error of the center loss will be heavily influenced by the majority of the instances and can lead to a freezing state in terms of intra-class variance. To address this, we introduce a new loss function called sigma squared reduction loss ($\sigma^2$R loss), which is regulated by a sigmoid function to inflate/deflate the error per instance and then continue to reduce the intra-class variance. Our loss has clear intuition and geometric interpretation, furthermore, we demonstrate by experiments the effectiveness of our proposal on several benchmark datasets showing the intra-class variance reduction and overcoming the results obtained with center loss and soft nearest neighbour functions.
摘要：在神经网络，损失函数表示学习过程的核心是导致优化到最佳会聚误差的近似值。卷积神经网络（CNN）使用的损失函数为一个监控信号来训练了深刻的模型，并在人工视觉的一些领域实现技术状态显著贡献。交叉熵和中心损失函数通常用于增加学到功能区分能力，提高模型的泛化性能。中心损失最小化了类内方差，并在同一时间惩罚每个类内部的深特征之间的长的距离。然而，中心损失的总误差将被很大程度上受到大多数实例的影响，并可能导致冷冻状态在类内方差的条款。为了解决这个问题，我们引入称为西格玛平方减少损失（$ \西格马^ 2 $ R损失）一个新的损失函数，其通过S形函数调节到充气/放气每个实例的错误，然后继续减少帧内类方差。我们的损失具有明显的直觉和几何解释，此外，我们通过实验证明在几个基准数据集显示了类内方差减少和克服与中心损失和软近邻函数得到的结果我们的建议的有效性。

13. Commands 4 Autonomous Vehicles (C4AV) Workshop Summary [PDF] 返回目录
Thierry Deruyttere, Simon Vandenhende, Dusan Grujicic, Yu Liu, Luc Van Gool, Matthew Blaschko, Tinne Tuytelaars, Marie-Francine Moens
Abstract: The task of visual grounding requires locating the most relevant region or object in an image, given a natural language query. So far, progress on this task was mostly measured on curated datasets, which are not always representative of human spoken language. In this work, we deviate from recent, popular task settings and consider the problem under an autonomous vehicle scenario. In particular, we consider a situation where passengers can give free-form natural language commands to a vehicle which can be associated with an object in the street scene. To stimulate research on this topic, we have organized the \emph{Commands for Autonomous Vehicles} (C4AV) challenge based on the recent \emph{Talk2Car} dataset (URL: this https URL). This paper presents the results of the challenge. First, we compare the used benchmark against existing datasets for visual grounding. Second, we identify the aspects that render top-performing models successful, and relate them to existing state-of-the-art models for visual grounding, in addition to detecting potential failure cases by evaluating on carefully selected subsets. Finally, we discuss several possibilities for future work.
摘要：视觉接地的任务需要定位的图像中最相关的区域或对象，给自然语言查询。到目前为止，这项任务进展策划的数据集，这并不总是代表人类语言的大部分是测量。在这项工作中，我们从最近的，流行的任务设置偏离，并考虑自主车辆的情况下的问题。特别是，我们考虑的情况下乘客可到可以在街头场景中的对象关联的车辆给予自由形式的自然语言命令。为了刺激对这个课题的研究，我们已经组织了\ {EMPH命令的自主汽车}基于最近\ {EMPH} Talk2Car数据集（C4AV）挑战（网址：此HTTPS URL）。本文提出挑战的结果。首先，我们比较反对的视觉接地现有数据集所使用的基准。其次，我们确定呈现顶级表现的车型成功的方面，以及它们与现有的国家的最先进的型号为视觉接地，除了通过评估对精心挑选的子集检测潜在的故障情况。最后，我们讨论了今后工作的几种可能性。

14. DeltaGAN: Towards Diverse Few-shot Image Generation with Sample-Specific Delta [PDF] 返回目录
Yan Hong, Li Niu, Jianfu Zhang, Jing Liang, Liqing Zhang
Abstract: Learning to generate new images for a novel category based on only a few images, named as few-shot image generation, has attracted increasing research interest. Several state-of-the-art works have yielded impressive results, but the diversity is still limited. In this work, we propose a novel Delta Generative Adversarial Network (DeltaGAN), which consists of a reconstruction subnetwork and a generation subnetwork. The reconstruction subnetwork captures intra-category transformation, i.e., "delta", between same-category pairs. The generation subnetwork generates sample-specific "delta" for an input image, which is combined with this input image to generate a new image within the same category. Besides, an adversarial delta matching loss is designed to link the above two subnetworks together. Extensive experiments on five few-shot image datasets demonstrate the effectiveness of our proposed method.
摘要：学习生成基于只有几张图片，命名为少拍图像生成一个新的类新形象，吸引了越来越多的研究兴趣。国家的最先进的几部作品都取得了骄人的成绩，但差异仍然有限。在这项工作中，我们提出了一种新的增量创成对抗性网（DeltaGAN），其中包括重建子网和子网产生的。重建子网捕获帧内类转化，即，“增量”，相同类别对之间。产生子网络对输入图像，其与该输入图像结合，以产生相同的类别内的新图像来生成样本特异性“Δ”。此外，对抗式增量匹配损耗被设计成连接在上述两个子网络在一起。五少数镜头图像数据集大量的实验证明我们提出的方法的有效性。

15. Moving object detection for visual odometry in a dynamic environment based on occlusion accumulation [PDF] 返回目录
Haram Kim, Pyojin Kim, H. Jin Kim
Abstract: Detection of moving objects is an essential capability in dealing with dynamic environments. Most moving object detection algorithms have been designed for color images without depth. For robotic navigation where real-time RGB-D data is often readily available, utilization of the depth information would be beneficial for obstacle recognition. Here, we propose a simple moving object detection algorithm that uses RGB-D images. The proposed algorithm does not require estimating a background model. Instead, it uses an occlusion model which enables us to estimate the camera pose on a background confused with moving objects that dominate the scene. The proposed algorithm allows to separate the moving object detection and visual odometry (VO) so that an arbitrary robust VO method can be employed in a dynamic situation with a combination of moving object detection, whereas other VO algorithms for a dynamic environment are inseparable. In this paper, we use dense visual odometry (DVO) as a VO method with a bi-square regression weight. Experimental results show the segmentation accuracy and the performance improvement of DVO in the situations. We validate our algorithm in public datasets and our dataset which also publicly accessible.
摘要：移动物体的检测是在处理动态环境的重要能力。大多数移动体检测算法已被设计用于彩色图像没有深度。对于机器人导航，其中的实时RGB-d的数据往往是现成的，深度信息的利用将是障碍物识别有利。在这里，我们建议使用RGB-d图像的简单移动物体检测算法。该算法不需要估计背景模型。相反，它使用一个闭塞模型，使我们能够估计与移动支配场景中的对象混淆背景相机姿态。所提出的算法允许以分离运动物体检测和视觉里程计（VO），使得任意的鲁棒VO方法可与运动物体检测的组合的动态情况中使用，而对于动态环境中的其他算法VO是分不开的。在本文中，我们使用致密视觉测距法（DVO），为VO方法与双方回归重量。实验结果表明，分割精度和DVO在情况下的性能提升。我们验证了我们在公共数据集算法和我们的数据也可公开访问的。

16. Contextual Semantic Interpretability [PDF] 返回目录
Diego Marcos, Ruth Fong, Sylvain Lobry, Remi Flamary, Nicolas Courty, Devis Tuia
Abstract: Convolutional neural networks (CNN) are known to learn an image representation that captures concepts relevant to the task, but do so in an implicit way that hampers model interpretability. However, one could argue that such a representation is hidden in the neurons and can be made explicit by teaching the model to recognize semantically interpretable attributes that are present in the scene. We call such an intermediate layer a \emph{semantic bottleneck}. Once the attributes are learned, they can be re-combined to reach the final decision and provide both an accurate prediction and an explicit reasoning behind the CNN decision. In this paper, we look into semantic bottlenecks that capture context: we want attributes to be in groups of a few meaningful elements and participate jointly to the final decision. We use a two-layer semantic bottleneck that gathers attributes into interpretable, sparse groups, allowing them contribute differently to the final output depending on the context. We test our contextual semantic interpretable bottleneck (CSIB) on the task of landscape scenicness estimation and train the semantic interpretable bottleneck using an auxiliary database (SUN Attributes). Our model yields in predictions as accurate as a non-interpretable baseline when applied to a real-world test set of Flickr images, all while providing clear and interpretable explanations for each prediction.
摘要：卷积神经网络（CNN）被称为学习的图像表现，它相关的任务，但捕捉概念在篮模型解释性的含蓄的方式这样做。然而，人们可能认为这样的表示是隐藏在神经元，并且可以通过教学模式认识到存在于场景语义解释的属性进行明确。我们称这样的中间层的\ {EMPH语义瓶颈}。一旦属性了解到，他们可以重新组合，以达到最终的决定，并同时提供一个准确的预测和CNN决定背后的推理明确。在本文中，我们考虑语义瓶颈捕获方面：我们希望属性在几个有意义的元素组，共同参与到最终的决定。我们用两层语义瓶颈云集属性分为可解释的，稀疏的群体，使他们不同，这取决于上下文的最终输出做出贡献。我们对景观scenicness估计的任务，考验我们的上下文语义解释的瓶颈（CSIB）和培养使用辅助数据库的语义解释的瓶颈（SUN属性）。我们的预测是准确的不可解释的基线模型产生当应用到真实世界的测试集Flickr的图片，都同时为每个预测明确和解释说明。

17. Progressive Semantic-Aware Style Transformation for Blind Face Restoration [PDF] 返回目录
Chaofeng Chen, Xiaoming Li, Lingbo Yang, Xianhui Lin, Lei Zhang, Kwan-Yee K. Wong
Abstract: Face restoration is important in face image processing, and has been widely studied in recent years. However, previous works often fail to generate plausible high quality (HQ) results for real-world low quality (LQ) face images. In this paper, we propose a new progressive semantic-aware style transformation framework, named PSFR-GAN, for face restoration. Specifically, instead of using an encoder-decoder framework as previous methods, we formulate the restoration of LQ face images as a multi-scale progressive restoration procedure through semantic-aware style transformation. Given a pair of LQ face image and its corresponding parsing map, we first generate a multi-scale pyramid of the inputs, and then progressively modulate different scale features from coarse-to-fine in a semantic-aware style transfer way. Compared with previous networks, the proposed PSFR-GAN makes full use of the semantic (parsing maps) and pixel (LQ images) space information from different scales of input pairs. In addition, we further introduce a semantic aware style loss which calculates the feature style loss for each semantic region individually to improve the details of face textures. Finally, we pretrain a face parsing network which can generate decent parsing maps from real-world LQ face images. Experiment results show that our model trained with synthetic data can not only produce more realistic high-resolution results for synthetic LQ inputs and but also generalize better to natural LQ face images compared with state-of-the-art methods. Codes are available at this https URL.
摘要：面对恢复是在人脸图像处理的重要，并已被广泛研究在最近几年。然而，以往的作品往往无法生成对于真实世界的低质量（LQ）的人脸图像合理的高质量（HQ）的结果。在本文中，我们提出了一个新的进步的语义感知风格转换框架，命名为私营部门筹资-GaN，脸部修复。具体来说，而是采用编码器，解码器框架，以前的方法，我们制定LQ人脸图像的恢复通过语义感知风貌改造多尺度逐步恢复的过程。给定一对LQ面部图像和其相应的解析图，我们首先生成的输入的多尺度金字塔，然后逐步调制不同尺度从粗到细的语义感知式传送方式的特征。与以前的网络相比，所提出的PSFR-GAN充分利用语义（解析图）和像素（LQ图像）从输入对不同尺度空间信息。此外，我们还进一步介绍其计算每个语义区域特征的风格损失分别以改善面部纹理的细节语义感知风格的损失。最后，我们pretrain面解析网络，可以生成从现实世界LQ脸图像解析像样的地图。实验结果表明，我们用合成数据训练模型不仅能生产合成LQ输入更加逼真的高分辨率结果也是广义含更好地与国家的最先进的方法相比自然LQ人脸图像。代码可在此HTTPS URL。

18. Learning Emotional-Blinded Face Representations [PDF] 返回目录
Alejandro Peña, Julian Fierrez, Agata Lapedriza, Aythami Morales
Abstract: We propose two face representations that are blind to facial expressions associated to emotional responses. This work is in part motivated by new international regulations for personal data protection, which enforce data controllers to protect any kind of sensitive information involved in automatic processes. The advances in Affective Computing have contributed to improve human-machine interfaces but, at the same time, the capacity to monitorize emotional responses triggers potential risks for humans, both in terms of fairness and privacy. We propose two different methods to learn these expression-blinded facial features. We show that it is possible to eliminate information related to emotion recognition tasks, while the performance of subject verification, gender recognition, and ethnicity classification are just slightly affected. We also present an application to train fairer classifiers in a case study of attractiveness classification with respect to a protected facial expression attribute. The results demonstrate that it is possible to reduce emotional information in the face representation while retaining competitive performance in other face-based artificial intelligence tasks.
摘要：我们提出了两种脸部表征是盲目地关联到的情绪反应的面部表情。这项工作部分是由新的国际规则对个人数据保护，加强数据控制，以保护任何形式的参与自动流程敏感信息的动机。在情感电脑科技的发展作出了贡献，以改善人机界面，但在同一时间，有能力monitorize情绪反应的触发对人类的潜在风险，无论是在公平和隐私的条款。我们提出了两种不同的方法来学习这些表达盲面部特征。我们表明，它可以消除与情绪识别任务的信息，而受到核查，性别认同和种族分类的性能稍稍影响。我们还提出了一个应用程序相对于一个受保护的面部表情属性训练更公平分类的吸引力分类为例。结果表明，它可以减少情感信息的人脸表示，同时保留在另一面，基于人工智能的任务有竞争力的表现。

19. Searching for Low-Bit Weights in Quantized Neural Networks [PDF] 返回目录
Zhaohui Yang, Yunhe Wang, Kai Han, Chunjing Xu, Chao Xu, Dacheng Tao, Chang Xu
Abstract: Quantized neural networks with low-bit weights and activations are attractive for developing AI accelerators. However, the quantization functions used in most conventional quantization methods are non-differentiable, which increases the optimization difficulty of quantized networks. Compared with full-precision parameters (i.e., 32-bit floating numbers), low-bit values are selected from a much smaller set. For example, there are only 16 possibilities in 4-bit space. Thus, we present to regard the discrete weights in an arbitrary quantized neural network as searchable variables, and utilize a differential method to search them accurately. In particular, each weight is represented as a probability distribution over the discrete value set. The probabilities are optimized during training and the values with the highest probability are selected to establish the desired quantized network. Experimental results on benchmarks demonstrate that the proposed method is able to produce quantized neural networks with higher performance over the state-of-the-art methods on both image classification and super-resolution tasks.
摘要：低比特权重和激活量化神经网络是发展人工智能加速器更加具有吸引力。但是，在大多数传统的量化方法中使用的量化函数是不可微的，这增加了量化网络的优化的困难。与全精度参数（即，32位浮点数）相比，低比特值从一个小得多的集合中选择。例如，有16只在4位空间的可能性。因此，我们提出都把在任意离散权重量化神经网络作为搜索的变量，并利用差分法，以准确地搜索它们。特别地，每个重量被表示为在所述离散值集合的概率分布。的概率被训练期间优化，具有最高概率的值被选择以建立所需的量化网络。在基准测试实验结果表明，该方法能够产生量化神经网络有超过两个图像分类和超分辨率任务的国家的最先进的方法更高的性能。

20. DeepRemaster: Temporal Source-Reference Attention Networks for Comprehensive Video Enhancement [PDF] 返回目录
Satoshi Iizuka, Edgar Simo-Serra
Abstract: The remastering of vintage film comprises of a diversity of sub-tasks including super-resolution, noise removal, and contrast enhancement which aim to restore the deteriorated film medium to its original state. Additionally, due to the technical limitations of the time, most vintage film is either recorded in black and white, or has low quality colors, for which colorization becomes necessary. In this work, we propose a single framework to tackle the entire remastering task semi-interactively. Our work is based on temporal convolutional neural networks with attention mechanisms trained on videos with data-driven deterioration simulation. Our proposed source-reference attention allows the model to handle an arbitrary number of reference color images to colorize long videos without the need for segmentation while maintaining temporal consistency. Quantitative analysis shows that our framework outperforms existing approaches, and that, in contrast to existing approaches, the performance of our framework increases with longer videos and more reference color images.
摘要：子任务的多样性的怀旧电影包括包括超分辨率，噪声消除和对比度增强，其目的是恢复退化电影媒介到其原始状态的重新制定。此外，由于当时的技术限制，最怀旧电影要么记录在黑色和白色，或具有低质量的颜色，其着色成为必要。在这项工作中，我们提出了一个框架，以解决整个重新制定任务半交互。我们的工作是基于训练有素的与数据驱动的恶化模拟视频注意机制时空卷积神经网络。我们提出的源引用的关注使得该模型处理的参考彩色图像的任意数量的上色长视频，而无需分割，同时保持时间一致性。定量分析表明，我们的框架性能优于现有的方法，而且，相对于现有的方法，我们与更长的视频和更多的参考彩色图像框架增加性能。

21. Conditional Image Generation with One-Vs-All Classifier [PDF] 返回目录
Xiangrui Xu, Yaqin Li, Cao Yuan
Abstract: This paper explores conditional image generation with a One-Vs-All classifier based on the Generative Adversarial Networks (GANs). Instead of the real/fake discriminator used in vanilla GANs, we propose to extend the discriminator to a One-Vs-All classifier (GAN-OVA) that can distinguish each input data to its category label. Specifically, we feed certain additional information as conditions to the generator and take the discriminator as a One-Vs-All classifier to identify each conditional category. Our model can be applied to different divergence or distances used to define the objective function, such as Jensen-Shannon divergence and Earth-Mover (or called Wasserstein-1) distance. We evaluate GAN-OVAs on MNIST and CelebA-HQ datasets, and the experimental results show that GAN-OVAs make progress toward stable training over regular conditional GANs. Furthermore, GAN-OVAs effectively accelerate the generation process of different classes and improves generation quality.
摘要：本文探讨有条件图像生成带有一个-VS-所有基于创成对抗性网络（甘斯）分类。相反，在香草甘斯使用的真正/假冒鉴别的，我们建议鉴别扩展到一个-VS-所有分类（GAN-OVA），可每个输入数据区分其类别标签。具体来说，我们喂某些附加信息作为条件，所述发电机和采取鉴别作为一差值Vs-所有分类器来识别每个条件类别。我们的模型可以被应用到不同的发散或用于定义所述目标函数的距离，如詹森 - 香农散度和地球移动器（或称为瓦瑟斯坦-1）的距离。我们评估对MNIST和CelebA-HQ数据集GAN-OVAs都能和实验结果表明，GAN-OVAs都能让朝比常规条件甘斯稳定的训练进度。此外，GAN-OVAs都能有效地加速不同类的生成处理，并提高生成的质量。

22. Face Sketch Synthesis with Style Transfer using Pyramid Column Feature [PDF] 返回目录
Chaofeng Chen, Xiao Tan, Kwan-Yee K. Wong
Abstract: In this paper, we propose a novel framework based on deep neural networks for face sketch synthesis from a photo. Imitating the process of how artists draw sketches, our framework synthesizes face sketches in a cascaded manner. A content image is first generated that outlines the shape of the face and the key facial features. Textures and shadings are then added to enrich the details of the sketch. We utilize a fully convolutional neural network (FCNN) to create the content image, and propose a style transfer approach to introduce textures and shadings based on a newly proposed pyramid column feature. We demonstrate that our style transfer approach based on the pyramid column feature can not only preserve more sketch details than the common style transfer method, but also surpasses traditional patch based methods. Quantitative and qualitative evaluations suggest that our framework outperforms other state-of-the-arts methods, and can also generalize well to different test images. Codes are available at this https URL
摘要：本文提出了一种基于从照片脸部略深合成神经网络的一个新的框架。模仿的艺术家如何绘制草图的过程中，我们的框架面对综合类以级联方式草图。一种内容图像首先生成轮廓的面的形状和键的面部特征。然后纹理和阴影被添加到丰富草图的细节。我们利用完全卷积神经网络（FCNN）来创建内容画面，并建议引入基于新提出的金字塔列特征的纹理和阴影风格转移的方法。我们证明了我们基于金字塔柱子和风格传递方式，不仅可以保持比一般的风格转移方法的更多细节小品，也超越了传统的基于补丁的方法。定量和定性的评估表明，我们的框架优于其他国家的最艺术的方法，也可以概括以及不同的测试图像。代码可在此HTTPS URL

23. TopNet: Topology Preserving Metric Learning for Vessel Tree Reconstruction and Labelling [PDF] 返回目录
Deepak Keshwani, Yoshiro Kitamura, Satoshi Ihara, Satoshi Iizuka, Edgar Simo-Serra
Abstract: Reconstructing Portal Vein and Hepatic Vein trees from contrast enhanced abdominal CT scans is a prerequisite for preoperative liver surgery simulation. Existing deep learning based methods treat vascular tree reconstruction as a semantic segmentation problem. However, vessels such as hepatic and portal vein look very similar locally and need to be traced to their source for robust label assignment. Therefore, semantic segmentation by looking at local 3D patch results in noisy misclassifications. To tackle this, we propose a novel multi-task deep learning architecture for vessel tree reconstruction. The network architecture simultaneously solves the task of detecting voxels on vascular centerlines (i.e. nodes) and estimates connectivity between center-voxels (edges) in the tree structure to be reconstructed. Further, we propose a novel connectivity metric which considers both inter-class distance and intra-class topological distance between center-voxel pairs. Vascular trees are reconstructed starting from the vessel source using the learned connectivity metric using the shortest path tree algorithm. A thorough evaluation on public IRCAD dataset shows that the proposed method considerably outperforms existing semantic segmentation based methods. To the best of our knowledge, this is the first deep learning based approach which learns multi-label tree structure connectivity from images.
摘要：从对比增强腹部CT扫描重建门静脉和肝静脉树的术前肝脏手术仿真的先决条件。现有的基于深度学习方法治疗血管树重建的语义分割问题。然而，容器，如肝门静脉的外观非常相似地方和需要被追踪到其源头的强劲标签分配。因此，通过观察在嘈杂的错误分类的本地3D贴片结果语义分割。为了解决这个问题，我们提出了血管树重建一个新的多任务深度学习建筑。网络体系结构同时解决对血管中心线的体素检测任务（即节点），并且估计在树结构中心的体素（边缘）之间的连接将被重建。此外，我们提出它参考中心体素对之间的两个级间距离和类内拓扑距离的新的连接性度量。血管树重建使用使用最短路径树算法学习连接性度量从容器源头开始。公共IRCAD数据集显示了一个全面的评估，该方法大大优于现有的基于语义分割方法。据我们所知，这是第一个深度学习基础的办法，学习从图像的多标签树结构连接。

24. Performance Monitoring of Object Detection During Deployment [PDF] 返回目录
Quazi Marufur Rahman, Niko Sünderhauf, Feras Dayoub
Abstract: Performance monitoring of object detection is crucial for safety-critical applications such as autonomous vehicles that operate under varying and complex environmental conditions. Currently, object detectors are evaluated using summary metrics based on a single dataset that is assumed to be representative of all future deployment conditions. In practice, this assumption does not hold, and the performance fluctuates as a function of the deployment conditions. To address this issue, we propose an introspection approach to performance monitoring during deployment without the need for ground truth data. We do so by predicting when the per-frame mean average precision drops below a critical threshold using the detector's internal features. We quantitatively evaluate and demonstrate our method's ability to reduce risk by trading off making an incorrect decision by raising the alarm and absenting from detection.
摘要：性能监测对象检测是安全关键应用是至关重要的，诸如在变化和复杂的环境条件下工作的自主车。目前，对象检测器使用基于被假定为代表的所有将来的部署条件单一数据集概要度量评价。在实践中，这种假设不成立，而且性能发生变动的部署条件的函数。为了解决这个问题，我们建议在部署过程中的内省方法的性能监控，而不需要地面实况数据。我们通过在每帧中值平均精度低于使用检测器的内部特征的临界阈值预测这样做。我们定量评价和证明我们的方法的通过交易减少关通过提高警报和检测absenting做出不正确的决策风险的能力。

25. Identification of Abnormal States in Videos of Ants Undergoing Social Phase Change [PDF] 返回目录
Taeyeong Choi, Benjamin Pyenson, Juergen Liebig, Theodore P. Pavlic
Abstract: Biology is both an important application area and a source of motivation for development of advanced machine learning techniques. Although much attention has been paid to large and complex data sets resulting from high-throughput sequencing, advances in high-quality video recording technology have begun to generate similarly rich data sets requiring sophisticated techniques from both computer vision and time-series analysis. Moreover, just as studying gene expression patterns in one organism can reveal general principles that apply to other organisms, the study of complex social interactions in an experimentally tractable model system, such as a laboratory ant colony, can provide general principles about the dynamics many other social groups. Here, we focus on one such example from the study of reproductive regulation in small laboratory colonies of $\sim$50 Harpgenathos ants. These ants can be artificially induced to begin a $\sim$20 day process of hierarchy reformation. Although the conclusion of this process is conspicuous to a human observer, it is still unclear which behaviors during the transients are contributing to the process. To address this issue, we explore the potential application of One-class Classification (OC) to the detection of abnormal states in ant colonies for which behavioral data is only available for the normal societal conditions during training. Specifically, we build upon the Deep Support Vector Data Description (DSVDD) and introduce the Inner-Outlier Generator (IO-GEN) that synthesizes fake "inner outlier" observations during training that are near the center of the DSVDD data description. We show that IO-GEN increases the reliability of the final OC classifier relative to other DSVDD baselines. This method can be used to screen video frames for which additional human observation is needed.
摘要：生物既是一个重要的应用领域和动力为先进的机器学习技术发展的源泉。虽然大部分受到人们的重视，从高通量测序产生的庞大而复杂的数据集，以高品质的视频录制技术的进步已经开始产生，需要从两个计算机视觉和时间序列分析复杂的技术同样丰富的数据集。此外，正如研究在一个生物体的基因表达模式可以揭示适用于其他生物的一般原则，在实验听话的模型系统复杂的社会互动，如实验室蚁群的研究，可以提供许多其他的动态一般原则社会团体。在这里，我们专注于一个这样的例子，从生殖调控研究中的$ \ $ SIM卡50个Harpgenathos蚂蚁小型实验室殖民地。这些蚂蚁可以人工诱导开始层级改革的$ \卡$ 20天的过程。虽然这个过程的结论是明显的对人类观察者，目前还不清楚它的瞬态期间行为推动这一进程。为了解决这个问题，我们探讨一类分类（OC）的潜在应用在蚁群检测异常状态的针对行为数据仅可用于训练过程中的正常社会条件。具体来说，我们建立在深支持向量数据描述（DSVDD）并引入内离群值发生器（IO-GEN），该训练是靠近DSVDD数据描述的中心过程中合成假“内离群”观测。我们表明，IO-GEN增加相对于其他DSVDD基线最终OC分类的可靠性。该方法可用于需要用于其额外的人力观察画面的视频帧。

26. 6-DoF Grasp Planning using Fast 3D Reconstruction and Grasp Quality CNN [PDF] 返回目录
Yahav Avigal, Samuel Paradis, Harry Zhang
Abstract: Recent consumer demand for home robots has accelerated performance of robotic grasping. However, a key component of the perception pipeline, the depth camera, is still expensive and inaccessible to most consumers. In addition, grasp planning has significantly improved recently, by leveraging large datasets and cloud robotics, and by limiting the state and action space to top-down grasps with 4 degrees of freedom (DoF). By leveraging multi-view geometry of the object using inexpensive equipment such as off-the-shelf RGB cameras and state-of-the-art algorithms such as Learn Stereo Machine (LSM\cite{kar2017learning}), the robot is able to generate more robust grasps from different angles with 6-DoF. In this paper, we present a modification of LSM to graspable objects, evaluate the grasps, and develop a 6-DoF grasp planner based on Grasp-Quality CNN (GQ-CNN\cite{mahler2017dex}) that exploits multiple camera views to plan a robust grasp, even in the absence of a possible top-down grasp.
摘要：近期消费者对家用机器人的需求加速了机器人抓持的性能。然而，感知管道的重要组成部分，深度相机，仍然是价格昂贵，无法进入大多数消费者。此外，抓规划已显著最近，通过与4个自由度（DOF）限制状态和行动空间，以自上而下的掌握提高，通过利用大数据集和云机器人。通过使用便宜的设备，例如现成的，货架RGB相机和国家的最先进的算法，如了解立体声机利用多视点的物体的几何形状（LSM \举{kar2017learning}），则机器人能够产生从与六自由度不同的角度更健壮抓手。在本文中，我们提出LSM的变形来可抓握的物体，评估掌握，并开发出利用多个摄像机视图计划一个基于掌握品质CNN（GQ-CNN \举{mahler2017dex}）的6自由度掌握规划人员强大的把握，甚至在没有可能的自上而下的把握。

27. Reinforcement Learning for Weakly Supervised Temporal Grounding of Natural Language in Untrimmed Videos [PDF] 返回目录
Jie Wu, Guanbin Li, Xiaoguang Han, Liang Lin
Abstract: Temporal grounding of natural language in untrimmed videos is a fundamental yet challenging multimedia task facilitating cross-media visual content retrieval. We focus on the weakly supervised setting of this task that merely accesses to coarse video-level language description annotation without temporal boundary, which is more consistent with reality as such weak labels are more readily available in practice. In this paper, we propose a \emph{Boundary Adaptive Refinement} (BAR) framework that resorts to reinforcement learning (RL) to guide the process of progressively refining the temporal boundary. To the best of our knowledge, we offer the first attempt to extend RL to temporal localization task with weak supervision. As it is non-trivial to obtain a straightforward reward function in the absence of pairwise granular boundary-query annotations, a cross-modal alignment evaluator is crafted to measure the alignment degree of segment-query pair to provide tailor-designed rewards. This refinement scheme completely abandons traditional sliding window based solution pattern and contributes to acquiring more efficient, boundary-flexible and content-aware grounding results. Extensive experiments on two public benchmarks Charades-STA and ActivityNet demonstrate that BAR outperforms the state-of-the-art weakly-supervised method and even beats some competitive fully-supervised ones.
摘要：修剪影片自然语言的时空接地是一项基本而具有挑战性的任务多媒体促进跨媒体视觉内容检索。我们专注于这个任务，仅仅访问粗视频层次的语言描述注释不临时边界，因为这种弱标签是在实践中更容易获得这与现实较为一致的弱监督的设置。在本文中，我们提出了一个\ {EMPH边界自适应细化}（BAR）的框架，诉诸强化学习（RL）引导的逐步细化临时边界的过程。据我们所知，我们提供的第一次尝试与RL监管不力延伸到时间的本地化工作。因为它是不平凡的，以获得在不存在成对粒状边界查询注释的简单的回报函数，一个跨通道对准评估器是制作成测量段查询对取向度，以提供量身设计奖励。此细化方案的传统滑动窗口基于溶液图案并有助于完全放弃以获取更有效的，边界灵活和内容感知接地结果。在两个公共基准哑谜-STA和ActivityNet大量的实验结果表明，BAR优于国家的最先进的弱监督方法，甚至胜过一些有竞争力的全面监督的。

28. Consistency Regularization with High-dimensional Non-adversarial Source-guided Perturbation for Unsupervised Domain Adaptation in Segmentation [PDF] 返回目录
Kaihong Wang, Chenhongyi Yang, Margrit Betke
Abstract: Unsupervised domain adaptation for semantic segmentation has been intensively studied due to the low cost of the pixel-level annotation for synthetic data. The most common approaches try to generate images or features mimicking the distribution in the target domain while preserving the semantic contents in the source domain so that a model can be trained with annotations from the latter. However, such methods highly rely on an image translator or feature extractor trained in an elaborated mechanism including adversarial training, which brings in extra complexity and instability in the adaptation process. Furthermore, these methods mainly focus on taking advantage of the labeled source dataset, leaving the unlabeled target dataset not fully utilized. In this paper, we propose a bidirectional style-induced domain adaptation method, called BiSIDA, that employs consistency regularization to efficiently exploit information from the unlabeled target domain dataset, requiring only a simple neural style transfer model. BiSIDA aligns domains by not only transferring source images into the style of target images but also transferring target images into the style of source images to perform high-dimensional perturbation on the unlabeled target images, which is crucial to the success in applying consistency regularization in segmentation tasks. Extensive experiments show that our BiSIDA achieves new state-of-the-art on two commonly-used synthetic-to-real domain adaptation benchmarks: GTA5-to-CityScapes and SYNTHIA-to-CityScapes.
摘要：语义分割无监督域适应已被广泛研究由于对合成数据的象素级注释的成本低。最常用的方法尝试生成的图像或特征模仿目标域中的分布，同时保留在所述源域的语义内容，这样的模型可以与来自后者的注释进行训练。然而，这种方法非常依赖于受过培训的一个详细的机制，包括对抗性训练，这带来了额外的适应过程的复杂性和不稳定性的图像转换或特征提取。此外，这些方法主要集中在服用标记源数据集的优点，留下没有充分利用未标记的目标数据集。在本文中，我们提出了一个双向的风格引起领域适应性方法，称为BiSIDA，它采用一致性正规化有效地利用来自未标记的目标域数据集信息，只需要一个简单的神经风格传递模型。 BiSIDA对齐结构域不仅通过源图像转移到目标图像的风格，但也目标图像转移到源图像的样式，以对未标记的目标图像执行高维扰动，这是成功的关键，在分割施加一致性正规化任务。大量的实验表明，我们的BiSIDA达到新的国家的最先进的两个常用的合成到实域自适应基准：GTA5对城市景观和SYNTHIA到城市景观。

29. Accelerating Search on Binary Codes in Weighted Hamming Space [PDF] 返回目录
Zhenyu Weng, Yuesheng Zhu, Ruixin Liu
Abstract: Compared to Hamming distance, weighted Hamming distance as a similarity measure between binary codes and the binary query point can provide superior accuracy in the search tasks. However, how to efficiently find $K$ binary codes in the dataset that have the smallest weighted Hamming distance with the query is still an open issue. In this paper, a non-exhaustive search framework is proposed to accelerate the search speed and guarantee the search accuracy on the binary codes in weighted Hamming space. By separating the binary codes into multiple disjoint substrings as the bucket indices, the search framework iteratively probes the buckets until the query's nearest neighbors are found. The framework consists of two modules, the search module and the decision module. The search module successively probes the buckets and takes the candidates according to a proper probing sequence generated by the proposed search algorithm. And the decision module decides whether the query's nearest neighbors are found or more buckets should be probed according to a designed decision criterion. The analysis and experiments indicate that the search framework can solve the nearest neighbor search problem in weighted Hamming space and is orders of magnitude faster than the linear scan baseline.
摘要：相对于海明距离，加权汉明距离为二进制码之间和二进制查询点可以在搜索任务提供优异的精确度的相似性度量。然而，如何有效地找到$ķ美元，这与查询最小加权汉明距离仍然是一个悬而未决的问题数据集的二进制代码。在本文中，非穷尽搜索框架，提出了加快搜索速度，并保证在加权海明空间的二进制代码的搜索精度。通过分离的二进制代码为多个不相交的子如桶指标，直到查询的最近邻居发现搜索框架反复探查桶。该框架包括两个模块，搜索模块和决策模块。搜索模块依次探测的水桶和根据由所提出的搜索算法生成一个适当的探测序列取候选。而决策模块决定查询的最近的邻居是否发现以上桶应根据设计的决策准则进行探测。分析和实验表明，搜索框架可以在加权海明空间解决最近邻搜索问题，是数量级比线性扫描基线更快。

30. MUTANT: A Training Paradigm for Out-of-Distribution Generalization in Visual Question Answering [PDF] 返回目录
Tejas Gokhale, Pratyay Banerjee, Chitta Baral, Yezhou Yang
Abstract: While progress has been made on the visual question answering leaderboards, models often utilize spurious correlations and priors in datasets under the i.i.d. setting. As such, evaluation on out-of-distribution (OOD) test samples has emerged as a proxy for generalization. In this paper, we present \textit{MUTANT}, a training paradigm that exposes the model to perceptually similar, yet semantically distinct \textit{mutations} of the input, to improve OOD generalization, such as the VQA-CP challenge. Under this paradigm, models utilize a consistency-constrained training objective to understand the effect of semantic changes in input (question-image pair) on the output (answer). Unlike existing methods on VQA-CP, \textit{MUTANT} does not rely on the knowledge about the nature of train and test answer distributions. \textit{MUTANT} establishes a new state-of-the-art accuracy on VQA-CP with a $10.57\%$ improvement. Our work opens up avenues for the use of semantic input mutations for OOD generalization in question answering.
摘要：虽然已经取得进展的视觉问答排行榜，模型通常利用独立同分布下的数据集伪相关和先验设置。因此，上外的分布（OOD）试验样品的评价已成为用于概括一个代理。在本文中，我们目前\ textit {}突变体，一个训练模式暴露出模型感知类似，但不同的语义\ textit输入{}突变，以提高OOD概括，如VQA-CP的挑战。在这种模式下，模型利用一致性约束培养目标理解在输入对输出（回答）的语义更改（问题图像对）的效果。不同于上VQA-CP，\ textit现有的方法{}突变体不依赖于有关培训和考试答案分布的性质的认识。 \ {textit突变型}建立上VQA-CP一个新的国家的最先进的精度以$ 10.57 \％$的改善。我们的工作开辟了途径，在问答中使用语义输入突变的OOD概括。

31. Objective, Probabilistic, and Generalized Noise Level Dependent Classifications of sets of more or less 2D Periodic Images into Plane Symmetry Groups [PDF] 返回目录
Andrew Dempsey, Peter Moeck
Abstract: Crystallographic symmetry classifications from real-world images with periodicities in two dimensions (2D) are of interest to crystallographers and practitioners of computer vision studies alike. Currently, these classifications are typically made by both communities in a subjective manner that relies on arbitrary thresholds for judgments, and are reported under the pretense of being definitive, which is impossible. Moreover, the computer vision community tends to use direct space methods to make such classifications instead of more powerful and computationally efficient Fourier space methods. This is because the proper functioning of those methods requires more periodic repeats of a unit cell motif than are commonly present in images analyzed by the computer vision community. We demonstrate a novel approach to plane symmetry group classifications that is enabled by Kenichi Kanatani's Geometric Akaike Information Criterion and associated Geometric Akaike weights. Our approach leverages the advantages of working in Fourier space, is well suited for handling the hierarchic nature of crystallographic symmetries, and yields probabilistic results that are generalized noise level dependent. The latter feature means crystallographic symmetry classifications can be updated when less noisy image data and more accurate processing algorithms become available. We demonstrate the ability of our approach to objectively estimate the plane symmetry and pseudosymmetries of sets of synthetic 2D-periodic images with varying amounts of red-green-blue and spread noise. Additionally, we suggest a simple solution to the problem of too few periodic repeats in an input image for practical application of Fourier space methods. In doing so, we effectively solve the decades-old and heretofore intractable problem from computer vision of symmetry detection and classification from images in the presence of noise.
摘要：在二维（2D）周期性真实世界的影像晶体对称性分类感兴趣的晶体学家和一致好评计算机视觉研究的实践者。目前，这些分类是典型地通过在依赖于用于判断任意阈值的主观方式两族制成，并且被确定的虚假的，这是不可能下被报告。此外，计算机视觉领域倾向于使用直接的空间的方法，使这种分类，而不是更强大，更高效的计算傅立叶空间的方法。这是因为这些方法的正常运作需要比是由计算机视觉领域分析的图像通常存在一个单元主题的多个周期性重复。我们展示了一个新的方法来由健一金谷的几何赤池信息量准则启用，相关的几何赤池加权平面对称群分类。我们的方法利用傅立叶空间工作的优势，非常适合用于处理晶体对称性的分级性质，并产生被广义噪音电平相关的概率结果。后一特征是指晶体学对称性分类可以当更少噪声的图像数据，更准确的处理算法变得可用更新。我们证明我们的方法来客观地估计集合成2D周期性图像的平面对称，并用pseudosymmetries数量不等的红，绿，蓝和传播噪声的能力。此外，我们提出一个简单的解决方案太少周期性重复的问题，在输入图像的傅立叶空间方法的实际应用。在此过程中，我们有效地解决来自于噪声的情况下的图像对称性检测和分类的计算机视觉的几十年历史的因素，但迄今为止棘手的问题。

32. Smartphone Camera De-identification while Preserving Biometric Utility [PDF] 返回目录
Sudipta Banerjee, Arun Ross
Abstract: The principle of Photo Response Non Uniformity (PRNU) is often exploited to deduce the identity of the smartphone device whose camera or sensor was used to acquire a certain image. In this work, we design an algorithm that perturbs a face image acquired using a smartphone camera such that (a) sensor-specific details pertaining to the smartphone camera are suppressed (sensor anonymization); (b) the sensor pattern of a different device is incorporated (sensor spoofing); and (c) biometric matching using the perturbed image is not affected (biometric utility). We employ a simple approach utilizing Discrete Cosine Transform to achieve the aforementioned objectives. Experiments conducted on the MICHE-I and OULU-NPU datasets, which contain periocular and facial data acquired using 12 smartphone cameras, demonstrate the efficacy of the proposed de-identification algorithm on three different PRNU-based sensor identification schemes. This work has application in sensor forensics and personal privacy.
摘要：光响应非均匀性（PRNU）的原理是利用常推导出智能电话设备，其照相机或传感器被用于获取某一图像的身份。在这项工作中，我们设计了一种算法，扰乱使用智能手机相机有关智能手机相机这样的（a）传感器具体细节即抑制（传感器匿名）的面部图像获取; （b）在不同的装置的传感器图案被结合（传感器欺骗）;和（c）使用所述扰动的图像生物匹配不受影响（生物计量实用程序）。我们采用利用离散余弦变换来实现上述目标的一个简单的方法。在米凯-I和OULU-NPU数据集进行的实验，其含有使用12个智能手机相机获取眼周和面部数据，证明了该脱识别算法的在三个不同的基于PRNU传感器识别方案的功效。这项工作在传感器取证和个人隐私应用。

33. Predicting molecular phenotypes from histopathology images: a transcriptome-wide expression-morphology analysis in breast cancer [PDF] 返回目录
Yinxi Wang, Kimmo Kartasalo, Masi Valkonen, Christer Larsson, Pekka Ruusuvuori, Johan Hartman, Mattias Rantalainen
Abstract: Molecular phenotyping is central in cancer precision medicine, but remains costly and standard methods only provide a tumour average profile. Microscopic morphological patterns observable in histopathology sections from tumours are determined by the underlying molecular phenotype and associated with clinical factors. The relationship between morphology and molecular phenotype has a potential to be exploited for prediction of the molecular phenotype from the morphology visible in histopathology images. We report the first transcriptome-wide Expression-MOrphology (EMO) analysis in breast cancer, where gene-specific models were optimised and validated for prediction of mRNA expression both as a tumour average and in spatially resolved manner. Individual deep convolutional neural networks (CNNs) were optimised to predict the expression of 17,695 genes from hematoxylin and eosin (HE) stained whole slide images (WSIs). Predictions for 9,334 (52.75%) genes were significantly associated with RNA-sequencing estimates (FDR adjusted p-value < 0.05). 1,011 of the genes were brought forward for validation, with 876 (87%) and 908 (90%) successfully replicated in internal and external test data, respectively. Predicted spatial intra-tumour variabilities in expression were validated in 76 genes, out of which 59 (77.6%) had a significant association (FDR adjusted p-value < 0.05) with spatial transcriptomics estimates. These results suggest that the proposed methodology can be applied to predict both tumour average gene expression and intra-tumour spatial expression directly from morphology, thus providing a scalable approach to characterise intra-tumour heterogeneity.
摘要：分子分型是癌症精密医学中心，但仍是昂贵和标准方法只提供肿瘤平均分布型。在从肿瘤组织病理学切片观察到的微观形态模式由底层分子表型测定，并与临床因素相关联。形态学和分子表型之间的关系具有用于从形态学的病理学图像中可见的分子表型的预测被利用的电位。我们在乳腺癌，其中基因特异性模型优化和验证对mRNA表达的预测既作为肿瘤平均并在空间上分辨的方式报告所述第一全转录物表达形貌（EMO）分析。个别深卷积神经网络（细胞神经网络）进行了优化，预测从苏木精和伊17695个基因的表达（HE）染色的整个幻灯片图像（WSIS）。为9334（52.75％）的基因预测的结果与RNA测序估计（FDR调整的p值<0.05）的显著相关联。的基因的1011中提出了用于验证，以876（87％）和908（90％）成功地复制在内部和外部的测试数据，分别。预测的空间肿瘤内变率的表达在76个基因，出其中59（77.6％）进行了验证有显著关联（fdr调整的p值<0.05）具有空间转录估计。这些结果表明，所提出的方法可以应用到直接从形态学预测两种肿瘤平均基因表达和肿瘤内空间的表达，从而提供了一种可扩展的方法来表征肿瘤内的异质性。< font>

34. AdderSR: Towards Energy Efficient Image Super-Resolution [PDF] 返回目录
Dehua Song, Yunhe Wang, Hanting Chen, Chang Xu, Chunjing Xu, DaCheng Tao
Abstract: This paper studies the single image super-resolution problem using adder neural networks (AdderNet). Compared with convolutional neural networks, AdderNet utilizing additions to calculate the output features thus avoid massive energy consumptions of conventional multiplications. However, it is very hard to directly inherit the existing success of AdderNet on large-scale image classification to the image super-resolution task due to the different calculation paradigm. Specifically, the adder operation cannot easily learn the identity mapping, which is essential for image processing tasks. In addition, the functionality of high-pass filters cannot be ensured by AdderNet. To this end, we thoroughly analyze the relationship between an adder operation and the identity mapping and insert shortcuts to enhance the performance of SR models using adder networks. Then, we develop a learnable power activation for adjusting the feature distribution and refining details. Experiments conducted on several benchmark models and datasets demonstrate that, our image super-resolution models using AdderNet can achieve comparable performance and visual quality to that of their CNN baselines with an about 2$\times$ reduction on the energy consumption.
摘要：本文研究使用加法神经网络的单个图像超分辨率问题（AdderNet）。与卷积神经网络相比，利用AdderNet加法来计算输出特征由此避免常规乘法的大量能量消耗。然而，这是很难直接继承AdderNet对大型图像分类为图像超分辨率任务已有的成功归因于不同的计算模式。具体而言，加法器操作不能很容易地学会该身份映射，这是用于图像处理任务至关重要。此外，高通滤波器的功能不能被AdderNet保证。为此，我们深入剖析加法器操作和身份之间的映射和插入快捷方式的关系，以提高利用加法器网络SR车型的性能。然后，我们开发了一个可以学习的动力激活调节功能分布和改进的细节。在几个基准模型和数据集进行的实验表明，我们使用AdderNet可以达到相当的性能和视觉质量，他们CNN基线与约$ 2 \倍的能源消耗降低$图像超分辨率的机型。

35. Multi-modal Experts Network for Autonomous Driving [PDF] 返回目录
Shihong Fang, Anna Choromanska
Abstract: End-to-end learning from sensory data has shown promising results in autonomous driving. While employing many sensors enhances world perception and should lead to more robust and reliable behavior of autonomous vehicles, it is challenging to train and deploy such network and at least two problems are encountered in the considered setting. The first one is the increase of computational complexity with the number of sensing devices. The other is the phenomena of network overfitting to the simplest and most informative input. We address both challenges with a novel, carefully tailored multi-modal experts network architecture and propose a multi-stage training procedure. The network contains a gating mechanism, which selects the most relevant input at each inference time step using a mixed discrete-continuous policy. We demonstrate the plausibility of the proposed approach on our 1/6 scale truck equipped with three cameras and one LiDAR.
摘要：结束到终端，从感官数据学习已显示出大有希望在自动驾驶的结果。虽然采用多个传感器增强了世界的看法，应该引起自主车的更强大和可靠的行为，它是具有挑战性的训练和部署这样的网络，并在考虑的设定中遇到至少有两个问题。第一个是计算复杂性与感测设备的数量的增加。另一种是网络的过度拟合的最简单和最翔实输入的现象。我们解决这两个挑战，一个新的，精心定制的多模态专家网络架构，并提出多阶段训练过程。该网络包含一个门控机构，其在使用混合离散连续政策每个推理时间步骤选择最相关的输入。我们证明了我们的1/6比例的卡车所提出的方法装有三个摄像头和一个激光雷达的合理性。

36. Classification and Region Analysis of COVID-19 Infection using Lung CT Images and Deep Convolutional Neural Networks [PDF] 返回目录
Saddam Hussain Khan, Anabia Sohail, Asifullah Khan, Yeon Soo Lee
Abstract: COVID-19 is a global health problem. Consequently, early detection and analysis of the infection patterns are crucial for controlling infection spread as well as devising a treatment plan. This work proposes a two-stage deep Convolutional Neural Networks (CNNs) based framework for delineation of COVID-19 infected regions in Lung CT images. In the first stage, initially, COVID-19 specific CT image features are enhanced using a two-level discrete wavelet transformation. These enhanced CT images are then classified using the proposed custom-made deep CoV-CTNet. In the second stage, the CT images classified as infectious images are provided to the segmentation models for the identification and analysis of COVID-19 infectious regions. In this regard, we propose a novel semantic segmentation model CoV-RASeg, which systematically uses average and max pooling operations in the encoder and decoder blocks. This systematic utilization of max and average pooling operations helps the proposed CoV-RASeg in simultaneously learning both the boundaries and region homogeneity. Moreover, the idea of attention is incorporated to deal with mildly infected regions. The proposed two-stage framework is evaluated on a standard Lung CT image dataset, and its performance is compared with the existing deep CNN models. The performance of the proposed CoV-CTNet is evaluated using Mathew Correlation Coefficient (MCC) measure (0.98) and that of proposed CoV-RASeg using Dice Similarity (DS) score (0.95). The promising results on an unseen test set suggest that the proposed framework has the potential to help the radiologists in the identification and analysis of COVID-19 infected regions.
摘要：COVID-19是一个全球性的健康问题。因此，早期发现和感染模式的分析是控制感染的传播以及制定治疗计划是至关重要的。这项工作提出了两个阶段的深卷积神经网络（细胞神经网络）在肺部CT图像COVID-19感染的地区划定为基础的框架。在第一阶段中，首先，COVID-19特定的CT图像特征使用两级离散小波变换增强。然后，这些增强CT图像使用提出定做深COV-持CTNet分类。在第二阶段中，分类为感染性的图像的CT图像被提供到分割模型COVID-19感染区域的识别和分析。在这方面，我们提出了一个新的语义分割模型COV-RASeg，在编码器和解码器模块，其系统采用平均和最大池操作。该系统的最大和平均池操作的利用率有助于提出COV-RASeg在同时学习两个边界和区域同质性。此外，人们关注的想法结合对付轻度感染地区。所提出的两个阶段的框架是在一个标准的肺CT图像数据进行评估，其性能与现有的深CNN机型相比。所提出的COV-持CTNet的性能是使用马修相关系数（MCC）测量（0.98）来评价，并使用骰子相似度（DS）得分（0.95），其提出COV-RASeg的。在一个看不见的测试集有前途的研究结果表明，所提出的框架必须帮助COVID-19感染区域的识别和分析，放射科医生的潜力。

37. Search and Rescue with Airborne Optical Sectioning [PDF] 返回目录
David C. Schedl, Indrajit Kurmi, Oliver Bimber
Abstract: We show that automated person detection under occlusion conditions can be significantly improved by combining multi-perspective images before classification. Here, we employed image integration by Airborne Optical Sectioning (AOS)---a synthetic aperture imaging technique that uses camera drones to capture unstructured thermal light fields---to achieve this with a precision/recall of 96/93%. Finding lost or injured people in dense forests is not generally feasible with thermal recordings, but becomes practical with use of AOS integral images. Our findings lay the foundation for effective future search and rescue technologies that can be applied in combination with autonomous or manned aircraft. They can also be beneficial for other fields that currently suffer from inaccurate classification of partially occluded people, animals, or objects.
摘要：我们发现，闭塞条件下自动检测人可以通过分类之前多视角图像结合来显著改善。在这里，我们通过机载光学切片（AOS）---合成孔径成像技术，用途相机雄蜂捕捉非结构化热光场---与96/93％的精度/召回实现这一就业图像整合。寻找丢失或受伤的人在茂密的森林一般不与热录音可行的，但是变成在使用AOS积分图像的实际。我们的发现奠定了可结合自主或有人驾驶飞机可以应用于未来有效的搜救技术的基础。它们也可以用于目前从部分被遮挡的人，动物或物体的不准确分类遭受其他领域的互利。

38. Fused Deep Convolutional Neural Network for Precision Diagnosis of COVID-19 Using Chest X-Ray Images [PDF] 返回目录
Hussin K. Ragb, Ian T. Dover, Redha Ali
Abstract: With a Coronavirus disease (COVID-19) case count exceeding 10 million worldwide, there is an increased need for a diagnostic capability. The main variables in increasing diagnostic capability are reduced cost, turnaround or diagnosis time, and upfront equipment cost and accessibility. Two candidates for machine learning COVID-19 diagnosis are Computed Tomography (CT) scans and plain chest X-rays. While CT scans score higher in sensitivity, they have a higher cost, maintenance requirement, and turnaround time as compared to plain chest X-rays. The use of portable chest X-radiograph (CXR) is recommended by the American College of Radiology (ACR) since using CT places a massive burden on radiology services. Therefore, X-ray imagery paired with machine learning techniques is proposed a first-line triage tool for COVID-19 diagnostics. In this paper we propose a computer-aided diagnosis (CAD) to accurately classify chest X-ray scans of COVID-19 and normal subjects by fine-tuning several neural networks (ResNet18, ResNet50, DenseNet201) pre-trained on the ImageNet dataset. These neural networks are fused in a parallel architecture and the voting criteria are applied in the final classification decision between the candidate object classes where the output of each neural network is representing a single vote. Several experiments are conducted on the weakly labeled COVID-19-CT-CXR dataset consisting of 263 COVID-19 CXR images extracted from PubMed Central Open Access subsets combined with 25 normal classification CXR images. These experiments show an optimistic result and a capability of the proposed model to outperforming many state-of-the-art algorithms on several measures. Using k-fold cross-validation and a bagging classifier ensemble, we achieve an accuracy of 99.7% and a sensitivity of 100%.
摘要：随着冠状病毒病（COVID-19）的情况下计数超过全世界10万元，还有一个诊断能力的需求增加。在提高诊断能力的主要因素是成本降低，周转或诊断时间，和前期设备成本和易用性。两个候选机器学习COVID-19诊断是计算机断层摄影（CT）扫描和滑动胸部X射线。尽管CT扫描在灵敏度得分较高，它们具有较高的成本，维护要求，并且与普通的胸部X射线作为周转时间。使用便携式胸部X射线照相（CXR）是由美国放射学会（ACR）推荐使用以来CT放在放射服务一个巨大的负担。因此，利用机器学习技术成对X射线图像，提出了COVID-19的诊断一线分流工具。在本文中，我们通过微调几个神经网络（ResNet18，ResNet50，DenseNet201）上ImageNet数据集预先训练提出了一种计算机辅助诊断（CAD）到COVID-19和正常受试者的准确分类胸部X射线扫描。这些神经网络被熔合以并行的体系结构和投票标准在每个神经网络的输出是代表一票候选对象类之间的最终的分类决定施加。几个实验在弱标记COVID-19-CT-CXR数据集包括从PubMed中心开放存取亚群25个正常分类CXR图像合并的萃取263 COVID-19 CXR图像进行。这些实验表明乐观的结果和提出的模型，以若干措施超出许多国家的最先进的算法的能力。使用k折交叉验证和装袋分类器群，我们达到99.7％的精确度和100％的灵敏度。

39. Residual Spatial Attention Network for Retinal Vessel Segmentation [PDF] 返回目录
Changlu Guo, Márton Szemenyei, Yugen Yi, Wei Zhou, Haodong Bian
Abstract: Reliable segmentation of retinal vessels can be employed as a way of monitoring and diagnosing certain diseases, such as diabetes and hypertension, as they affect the retinal vascular structure. In this work, we propose the Residual Spatial Attention Network (RSAN) for retinal vessel segmentation. RSAN employs a modified residual block structure that integrates DropBlock, which can not only be utilized to construct deep networks to extract more complex vascular features, but can also effectively alleviate the overfitting. Moreover, in order to further improve the representation capability of the network, based on this modified residual block, we introduce the spatial attention (SA) and propose the Residual Spatial Attention Block (RSAB) to build RSAN. We adopt the public DRIVE and CHASE DB1 color fundus image datasets to evaluate the proposed RSAN. Experiments show that the modified residual structure and the spatial attention are effective in this work, and our proposed RSAN achieves the state-of-the-art performance.
摘要：视网膜血管的可靠的分割可以用作监测和诊断某些疾病，如糖尿病和高血压，因为它们影响视网膜血管结构的一种方式。在这项工作中，我们提出了视网膜血管分割剩余空间注意网络（RSAN）。 RSAN采用修改的残余块结构，集成DropBlock，它不仅可以用于构建深网络来提取更复杂的血管的特征，而且还可以有效缓解过度拟合。此外，为了进一步提高网络的表示能力，根据本修改的残余块，我们引入空间注意（SA），并提出残余空间注意块（RSAB）来构建RSAN。我们采用公盘和CHASE DB1彩色眼底图像数据集，以评估该RSAN。实验结果表明，经修改的残余结构和空间注意这项工作是有效的，我们提出的RSAN实现国家的最先进的性能。

40. An Analysis by Synthesis Method that Allows Accurate Spatial Modeling of Thickness of Cortical Bone from Clinical QCT [PDF] 返回目录
Stefan Reinhold, Timo Damm, Sebastian Büsse, Stanislav N. Gorb, Claus-C. Glüer, Reinhard Koch
Abstract: Osteoporosis is a skeletal disorder that leads to increased fracture risk due to decreased strength of cortical and trabecular bone. Even with state-of-the-art non-invasive assessment methods there is still a high underdiagnosis rate. Quantitative computed tomography (QCT) permits the selective analysis of cortical bone, however the low spatial resolution of clinical QCT leads to an overestimation of the thickness of cortical bone (this http URL) and bone strength. We propose a novel, model based, fully automatic image analysis method that allows accurate spatial modeling of the thickness distribution of cortical bone from clinical QCT. In an analysis-by-synthesis (AbS) fashion a stochastic scan is synthesized from a probabilistic bone model, the optimal model parameters are estimated using a maximum a-posteriori approach. By exploiting the different characteristics of in-plane and out-of-plane point spread functions of CT scanners the proposed method is able assess the spatial distribution of cortical thickness. The method was evaluated on eleven cadaveric human vertebrae, scanned by clinical QCT and analyzed using standard methods and AbS, both compared to high resolution peripheral QCT (HR-pQCT) as gold standard. While standard QCT based measurements overestimated this http URL. by 560% and did not show significant correlation with the gold standard ($r^2 = 0.20,\, p = 0.169$) the proposed method eliminated the overestimation and showed a significant tight correlation with the gold standard ($r^2 = 0.98,\, p < 0.0001$) a root mean square error below 10%.
摘要：骨质疏松症是一种骨骼疾病，导致骨折风险增加，由于皮质及骨小梁强度下降。即使是与国家的最先进的非侵入性的评估方法仍然有较高的漏诊率。定量计算机断层扫描（QCT）允许皮质骨的选择性分析，但是临床QCT引线的空间分辨率低到骨皮质的厚度的高估（此http URL）和骨强度。我们提出了一个新颖的，基于模型的，完全自动图像分析方法，其允许从临床QCT皮质骨的厚度分布的精确的空间建模。在分析边合成（ABS）的方式随机扫描从概率骨模型合成，最佳模型参数使用最大后验估计的方法。通过利用在面内和面外的点扩散函数的CT扫描仪的不同特性所提出的方法能够评估皮质厚度的空间分布。该方法对11人的尸体椎骨评价，通过临床QCT扫描并使用标准方法和ABS进行分析，二者相比高分辨率周QCT（HR-的pQCT）作为黄金标准。虽然标准QCT基于测量高估这个HTTP URL。由560％并没有表现出与黄金标准（$ R ^ 2 = 0.20，\，p值= 0.169 $）所提出的方法消除了过度估计和显示出与黄金标准显著紧相关（$ R ^ 2 =显著相关0.98，\，p <0.0001 $）根均方误差低于10％。< font>

41. Pruning Neural Networks at Initialization: Why are We Missing the Mark? [PDF] 返回目录
Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, Michael Carbin
Abstract: Recent work has explored the possibility of pruning neural networks at initialization. We assess proposals for doing so: SNIP (Lee et al., 2019), GraSP (Wang et al., 2020), SynFlow (Tanaka et al., 2020), and magnitude pruning. Although these methods surpass the trivial baseline of random pruning, they remain below the accuracy of magnitude pruning after training, and we endeavor to understand why. We show that, unlike pruning after training, accuracy is the same or higher when randomly shuffling which weights these methods prune within each layer or sampling new initial values. As such, the per-weight pruning decisions made by these methods can be replaced by a per-layer choice of the fraction of weights to prune. This property undermines the claimed justifications for these methods and suggests broader challenges with the underlying pruning heuristics, the desire to prune at initialization, or both.
摘要：最近的工作探索在初始化修剪神经网络的可能性。我们评估这样的建议：（Wang等，2020）SNIP（Lee等，2019），抓，SynFlow和幅度修剪（Tanaka等，2020）。虽然这些方法超越随机修剪的琐碎基线，他们仍然幅度修剪训练后的准确度以下，我们努力理解为什么。我们表明，不同于训练后修剪，精度是相同或更高时随机地混洗该权重这些方法每一层内修剪或采样新的初始值。这样，通过这些方法制备的每重量修剪决策可以通过每个层的选择权重剪枝的分数来代替。此属性破坏了这些方法所要求的理由，并提出与底层修剪启发式更广泛的挑战，渴望在初始化，或两者修剪。

42. SCREENet: A Multi-view Deep Convolutional Neural Network for Classification of High-resolution Synthetic Mammographic Screening Scans [PDF] 返回目录
Saeed Seyyedi, Margaret J. Wong, Debra M. Ikeda, Curtis P. Langlotz
Abstract: Purpose: To develop and evaluate the accuracy of a multi-view deep learning approach to the analysis of high-resolution synthetic mammograms from digital breast tomosynthesis screening cases, and to assess the effect on accuracy of image resolution and training set size. Materials and Methods: In a retrospective study, 21,264 screening digital breast tomosynthesis (DBT) exams obtained at our institution were collected along with associated radiology reports. The 2D synthetic mammographic images from these exams, with varying resolutions and data set sizes, were used to train a multi-view deep convolutional neural network (MV-CNN) to classify screening images into BI-RADS classes (BI-RADS 0, 1 and 2) before evaluation on a held-out set of exams. Results: Area under the receiver operating characteristic curve (AUC) for BI-RADS 0 vs non-BI-RADS 0 class was 0.912 for the MV-CNN trained on the full dataset. The model obtained accuracy of 84.8%, recall of 95.9% and precision of 95.0%. This AUC value decreased when the same model was trained with 50% and 25% of images (AUC = 0.877, P=0.010 and 0.834, P=0.009 respectively). Also, the performance dropped when the same model was trained using images that were under-sampled by 1/2 and 1/4 (AUC = 0.870, P=0.011 and 0.813, P=0.009 respectively). Conclusion: This deep learning model classified high-resolution synthetic mammography scans into normal vs needing further workup using tens of thousands of high-resolution images. Smaller training data sets and lower resolution images both caused significant decrease in performance.
摘要：目的：开发和评估多视图深度学习方法高分辨率合成乳房X线照片的分析，从数字乳房断层合成筛选案件的准确度，并评估对图像分辨率和训练集大小的精度的影响。材料和方法：在一项回顾性研究中，21,264筛选数字乳房断层合成（DBT）在我们的机构获得与考试相关的放射学报告一起收集。所述2D合成乳腺摄影图像从这些检查，具有不同的分辨率和数据集的大小，被用于深培养多视图卷积神经网络（MV-CNN）进行分类筛选图像转换成BI-RADS类（BI-RADS 0,1 2）在持有了一套考试的评估之前。结果：下的面积接收器操作用于BI-RADS 0对非BI-RADS 0类的特性曲线（AUC）为0.912的MV-CNN上训练的全部数据集。的84.8％的模型获得的精确度，召回的95.9％和95.0％的精度。当相同的模型，用50％和图像的25％（AUC = 0.877，P = 0.010和0.834，P = 0.009分别）训练这个AUC值降低。另外，当相同的模型，使用该被欠采样1/2和1/4（AUC = 0.870，P = 0.011和0.813，P = 0.009分别）图像训练的性能下降。结论：深度学习模型分为高分辨率合成乳房X线扫描到正常VS需要使用数以万计的高清晰度图像的进一步处理。规模较小的训练数据集和较低分辨率图像性能既造成显著下降。

43. The Next Big Thing(s) in Unsupervised Machine Learning: Five Lessons from Infant Learning [PDF] 返回目录
Lorijn Zaadnoordijk, Tarek R. Besold, Rhodri Cusack
Abstract: After a surge in popularity of supervised Deep Learning, the desire to reduce the dependence on curated, labelled data sets and to leverage the vast quantities of unlabelled data available recently triggered renewed interest in unsupervised learning algorithms. Despite a significantly improved performance due to approaches such as the identification of disentangled latent representations, contrastive learning, and clustering optimisations, the performance of unsupervised machine learning still falls short of its hypothesised potential. Machine learning has previously taken inspiration from neuroscience and cognitive science with great success. However, this has mostly been based on adult learners with access to labels and a vast amount of prior knowledge. In order to push unsupervised machine learning forward, we argue that developmental science of infant cognition might hold the key to unlocking the next generation of unsupervised learning approaches. Conceptually, human infant learning is the closest biological parallel to artificial unsupervised learning, as infants too must learn useful representations from unlabelled data. In contrast to machine learning, these new representations are learned rapidly and from relatively few examples. Moreover, infants learn robust representations that can be used flexibly and efficiently in a number of different tasks and contexts. We identify five crucial factors enabling infants' quality and speed of learning, assess the extent to which these have already been exploited in machine learning, and propose how further adoption of these factors can give rise to previously unseen performance levels in unsupervised learning.
摘要：在监督深度学习的人气飙升后，希望减少对策划，标注的数据集的依赖，并充分利用广大的数量可以在无监督学习算法，最近引发了新的兴趣未标记的数据。尽管显著提高性能，因为方法，如解开潜表示，对比学习，和集群的优化的识别，无监督机器学习的性能仍然达不到其假设的潜力。机器学习先前接受来自神经科学和认知科学的灵感取得了巨大成功。然而，这主要是基于成人学习者能够访问标签和先验知识的大量。为了推动无监督的机器学习着，我们认为，婴幼儿认知发展的科学可能持有的关键在于监督的学习方法解锁下一代。从概念上讲，人类婴儿学习生物平行于人工监督学习最接近的，因为婴儿也必须学会从数据未标记的有用表示。相较于机器学习，这些新的表述迅速和相对较少的例子教训。此外，婴幼儿学习，可以在许多不同的任务和环境灵活有效地使用强大的交涉。我们确定使婴幼儿的素质和学习的速度5的关键因素，评估这些已经在机器学习利用的程度，并提出进一步普及，这些因素如何导致在无监督学习以前看不到的性能水平。

44. Keep off the Grass: Permissible Driving Routes from Radar with Weak Audio Supervision [PDF] 返回目录
David Williams, Daniele De Martini, Matthew Gadd, Letizia Marchegiani, Paul Newman
Abstract: Reliable outdoor deployment of mobile robots requires the robust identification of permissible driving routes in a given environment. The performance of LiDAR and vision-based perception systems deteriorates significantly if certain environmental factors are present e.g. rain, fog, darkness. Perception systems based on FMCW scanning radar maintain full performance regardless of environmental conditions and with a longer range than alternative sensors. Learning to segment a radar scan based on driveability in a fully supervised manner is not feasible as labelling each radar scan on a bin-by-bin basis is both difficult and time-consuming to do by hand. We therefore weakly supervise the training of the radar-based classifier through an audio-based classifier that is able to predict the terrain type underneath the robot. By combining odometry, GPS and the terrain labels from the audio classifier, we are able to construct a terrain labelled trajectory of the robot in the environment which is then used to label the radar scans. Using a curriculum learning procedure, we then train a radar segmentation network to generalise beyond the initial labelling and to detect all permissible driving routes in the environment.
摘要：移动机器人可靠的室外部署需要在特定的环境中稳健识别允许的行驶路线。激光雷达和基于视觉的知觉系统的性能变坏如果显著某些环境因素存在例如雨，雾，黑暗。不管环境条件和具有较长的范围比其他传感器基于FMCW雷达扫描系统感知保持充分的性能。学习段基于驾驶性能在完全监督的方式的雷达扫描是不作为仓逐仓的基础上标记每个雷达扫描可行是既困难又费时的手工完成。因此，我们通过弱是能够预测机器人下方的地形类型基于音频的分类监管的基于雷达的分类培训。通过结合里程计，GPS和音频分类地形标签，我们能够构建机器人的地形标记轨迹，然后用于标记的雷达扫描环境。使用课程的学习过程中，我们再训练的雷达分割网络来概括超出了最初的标记和检测环境中的所有允许的行驶路线。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-09-21

目录

摘要