摘要

1. Optimizing Neural Architecture Search using Limited GPU Time in a Dynamic Search Space: A Gene Expression Programming Approach [PDF] 返回目录
Jeovane Honorio Alves, Lucas Ferrari de Oliveira
Abstract: Efficient identification of people and objects, segmentation of regions of interest and extraction of relevant data in images, texts, audios and videos are evolving considerably in these past years, which deep learning methods, combined with recent improvements in computational resources, contributed greatly for this achievement. Although its outstanding potential, development of efficient architectures and modules requires expert knowledge and amount of resource time available. In this paper, we propose an evolutionary-based neural architecture search approach for efficient discovery of convolutional models in a dynamic search space, within only 24 GPU hours. With its efficient search environment and phenotype representation, Gene Expression Programming is adapted for network's cell generation. Despite having limited GPU resource time and broad search space, our proposal achieved similar state-of-the-art to manually-designed convolutional networks and also NAS-generated ones, even beating similar constrained evolutionary-based NAS works. The best cells in different runs achieved stable results, with a mean error of 2.82% in CIFAR-10 dataset (which the best model achieved an error of 2.67%) and 18.83% for CIFAR-100 (best model with 18.16%). For ImageNet in the mobile setting, our best model achieved top-1 and top-5 errors of 29.51% and 10.37%, respectively. Although evolutionary-based NAS works were reported to require a considerable amount of GPU time for architecture search, our approach obtained promising results in little time, encouraging further experiments in evolutionary-based NAS, for search and network representation improvements.
摘要：人与物体，利益和图像，文本，音频和视频相关数据的提取是不断变化的显着在过去这些年里，哪些深层学习方法，在计算资源的最新改进相结合的区域分割的有效识别，极大地促进对于这个成就。虽然它的潜力的优秀，高效的架构和模块的开发需要的可用资源时的专业知识和金额。在本文中，我们提出了在一个动态的搜索空间卷积模型的有效发现的基于进化神经结构的搜索方法，只有24 GPU小时内。凭借其高效的搜索环境和表型表达，基因表达式编程适合于网络的细胞的产生。尽管有有限的GPU资源的时间和广阔的搜索空间，我们的建议得到同样的国家的最先进的人工设计的卷积网络，也NAS生成的，甚至殴打类似的约束进化，基于NAS的工作。在不同的运行最好细胞达到稳定的结果，用2.82％在CIFAR-10的数据集的平均误差（其最好的模式实现的2.67％的误差）和18.83％对CIFAR-100（具有18.16％的最佳模型）。对于ImageNet在移动环境，我们最好的模型实现分别29.51％和10.37％，最高1和顶部5个错误。虽然基于进化的NAS工作报告要求对建筑搜索了大量的GPU时间，获得我们的方法有希望一点时间的结果，鼓励基于进化的NAS进一步的实验中，搜索和网络表示改进。

2. Guided interactive image segmentation using machine learning and color based data set clustering [PDF] 返回目录
Adrian Friebel, Tim Johann, Dirk Drasdo, Stefan Hoehme
Abstract: We present a novel approach that combines machine learning based interactive image segmentation with a two-stage clustering method for identification of similarly colored images enabling efficient batch image segmentation through guided reuse of interactively trained classifiers. The segmentation task is formulated as a supervised machine learning problem working on supervoxels. These visually homogeneous groups of voxels are characterized using local color, edge and texture features. Classifiers are interactively trained from sparse annotations in a iterative process of annotation refinement. Resulting models can be used for batch processing of previously unseen images. However, due to systemic discrepancies of image colorization classifier reusability is typically limited. By clustering a set of images into subsets of similar colorization, considering characteristic dominant color vectors obtained from the individual images it is possible to identify a minimal set of prototype images eligible for interactive segmentation. We demonstrate that limiting the reuse of pre-trained classifiers to images in the same color-cluster significantly improves the average segmentation performance of batch processing. The described methods are implemented in our free image processing and quantification software TiQuant released alongside this publication.
摘要：我们提出一个新的方法，它结合机器学习基于交互式图像分割为具有同样颜色的图像通过交互地训练的分类器的引导重用实现高效的批次的图像分割的识别两级聚类方法。分割任务配制成监督机器学习问题上supervoxels工作。这些体素的目视均匀组使用局部颜色，边缘和纹理特征的特征。分类器交互从稀疏的注释中注释细化的迭代过程的培训。结果模型可用于前所未见的图像批处理。然而，由于图像彩色化分类器可重用性的全身性差异典型地限制。通过聚类的一组图像的成类似彩色的子集，考虑到从各个图像获得的特征的主色矢量能够识别一组最小的原型图像获交互式分割的。我们表明，限制预先训练分类器以相同的颜色，集群中的图像重用显著提高了批处理的平均分割性能。所描述的方法在本出版物发行旁边我们的免费图像处理和定量软件TiQuant实现。

3. ResMoNet: A Residual Mobile-based Network for Facial Emotion Recognition in Resource-Limited Systems [PDF] 返回目录
Rodolfo Ferro-Pérez, Hugo Mitre-Hernandez
Abstract: The Deep Neural Networks (DNNs) models have contributed a high accuracy for the classification of human emotional states from facial expression recognition data sets, where efficiency is an important factor for resource-limited systems as mobile devices and embedded systems. There are efficient Convolutional Neural Networks (CNN) models as MobileNet, PeleeNet, Extended Deep Neural Network (EDNN) and Inception-Based Deep Neural Network (IDNN) in terms of model architecture results: parameters, Floating-point OPerations (FLOPs) and accuracy. Although these results are satisfactory, it is necessary to evaluate other computational resources related to the trained model such as main memory utilization and response time to complete the emotion recognition. In this paper, we compare our proposed model inspired in depthwise separable convolutions and residual blocks with MobileNet, PeleeNet, EDNN and IDNN. The comparative results of the CNN architectures and the trained models --with Radboud Faces Database (RaFD)-- installed in a resource-limited device are discussed.
摘要：深神经网络（DNNs）模型已促成人类情感状态的从面部表情识别的数据集，其中，效率是作为移动设备和嵌入式系统资源有限的系统的一个重要因素的分类的精度高。有效率的卷积神经网络（CNN）模型作为MobileNet，PeleeNet，扩展深层神经网络（EDNN）和启基于深层神经网络（IDNN）在模型架构成果方面：参数，浮点运算（FLOPS）和精确度。虽然这些结果是令人满意的，有必要评估与训练的模型其他计算资源，如主内存利用率和响应时间来完成情绪识别。在本文中，我们比较我们提出的模型在深度上可分离卷积和残余块的启发与MobileNet，PeleeNet，EDNN和IDNN。 CNN的架构和训练的模型--with的来自Radboud的比较结果面临数据库（RaFD） - 安装在资源有限的设备进行了讨论。

4. HNAS: Hierarchical Neural Architecture Search on Mobile Devices [PDF] 返回目录
Xin Xia, Wenrui Ding
Abstract: Neural Architecture Search (NAS) has attracted growing interest. To reduce the search cost, recent work has explored weight sharing across models and made major progress in One-Shot NAS. However, it has been observed that a model with higher one-shot model accuracy does not necessarily perform better when stand-alone trained. To address this issue, in this paper, we propose a new method, named Hierarchical Neural Architecture Search (HNAS). Unlike previous approaches where the same operation search space is shared by all the layers in the supernet, we formulate a hierarchical search strategy based on operation pruning and build a layer-wise operation search space. In this way, HNAS can automatically select the operations for each layer. During the search, we also take the hardware platform constraints into consideration for efficient neural network model deployment. Extensive experiments on ImageNet show that under mobile latency constraint, our models consistently outperform state-of-the-art models both designed manually and generated automatically by NAS methods.
摘要：神经结构搜索（NAS）已经吸引了越来越大的兴趣。为了降低搜索成本，最近的工作已探索跨车型的重量共享和单次NAS取得重大进展。然而，已经观察到，具有较高的一次性模型精度的模型并不一定有更好的表现，当单机训练。为了解决这个问题，在本文中，我们提出了一种新的方法，名为分层神经结构搜索（HNAS）。不像同样的操作搜索空间由超网所有图层共享以前的方法，我们制定基于操作剪枝的分层搜索策略，构建逐层操作的搜索空间。通过这种方式，HNAS可以自动选择每一层的操作。在搜查过程中，我们也考虑了硬件平台的限制纳入考虑高效的神经网络模型的部署。，根据移动延迟约束上ImageNet显示出广泛的实验，我们的模型一致的国家的最先进的机型跑赢上手动设计和NAS方法自动生成。

5. Temperate Fish Detection and Classification: a Deep Learning based Approach [PDF] 返回目录
Kristian Muri Knausgård, Arne Wiklund, Tonje Knutsen Sørdalen, Kim Halvorsen, Alf Ring Kleiven, Lei Jiao, Morten Goodwin
Abstract: A wide range of applications in marine ecology extensively uses underwater cameras. Still, to efficiently process the vast amount of data generated, we need to develop tools that can automatically detect and recognize species captured on film. Classifying fish species from videos and images in natural environments can be challenging because of noise and variation in illumination and the surrounding habitat. In this paper, we propose a two-step deep learning approach for the detection and classification of temperate fishes without pre-filtering. The first step is to detect each single fish in an image, independent of species and sex. For this purpose, we employ the You Only Look Once (YOLO) object detection technique. In the second step, we adopt a Convolutional Neural Network (CNN) with the Squeeze-and-Excitation (SE) architecture for classifying each fish in the image without pre-filtering. We apply transfer learning to overcome the limited training samples of temperate fishes and to improve the accuracy of the classification. This is done by training the object detection model with ImageNet and the fish classifier via a public dataset (Fish4Knowledge), whereupon both the object detection and classifier are updated with temperate fishes of interest. The weights obtained from pre-training are applied to post-training as a priori. Our solution achieves the state-of-the-art accuracy of 99.27\% on the pre-training. The percentage values for accuracy on the post-training are good; 83.68\% and 87.74\% with and without image augmentation, respectively, indicating that the solution is viable with a more extensive dataset.
摘要：大范围的海洋生态应用中广泛使用水下摄像机。尽管如此，有效处理所产生的数据的数量繁多，我们需要开发能够自动检测和识别在电影拍摄种工具。判断从在自然环境中视频和图像鱼类可以因为噪声和在照明变化和周围的栖息地是具有挑战性。在本文中，我们提出了检测和温带鱼类的分类没有预过滤两步深的学习方式。第一步是检测每个单个鱼的图像中，独立的物种和性别。为此，我们采用了仅看一次（永乐）目标检测技术。在第二步骤中，我们采用一个卷积神经网络（CNN）与所述挤压和 - 激励（SE）架构没有预滤波图像中的每个进行分类鱼。我们运用转移克服学习中的温带鱼类有限的训练样本，并提高分类的准确度。这是通过一个公共数据集（Fish4Knowledge），于是无论是目标检测和分类与目的温带鱼类更新培训与ImageNet对象检测模型和鱼类分类进行。从预训练所获得的权重应用于后训练作为先验。我们的解决方案实现了99.27 \％，在训练前的国家的最先进的精度。有关培训后精度百分比值是好的; 83.68 \％和具有和不具有图象的增强87.74 \％，分别指示该溶液是可行的具有更广泛的数据集。

6. History for Visual Dialog: Do we really need it? [PDF] 返回目录
Shubham Agarwal, Trung Bui, Joon-Young Lee, Ioannis Konstas, Verena Rieser
Abstract: Visual Dialog involves "understanding" the dialog history (what has been discussed previously) and the current question (what is asked), in addition to grounding information in the image, to generate the correct response. In this paper, we show that co-attention models which explicitly encode dialog history outperform models that don't, achieving state-of-the-art performance (72 % NDCG on val set). However, we also expose shortcomings of the crowd-sourcing dataset collection procedure by showing that history is indeed only required for a small amount of the data and that the current evaluation metric encourages generic replies. To that end, we propose a challenging subset (VisDialConv) of the VisDial val set and provide a benchmark of 63% NDCG.
摘要：视觉对话涉及“理解”的对话历史（之前已经讨论过）和当前问题（什么是问），除了在图像中的接地信息，生成正确的响应。在本文中，我们证明了共同关注的模型，其中明确编码对话历史跑赢模型，不这样做，实现国家的最先进的性能（72％NDCG上VAL集）。然而，我们也通过展示历史确实是只需要对数据进行少量暴露人群采购数据集收集程序的缺点和当前的评估指标鼓励通用回复。为此，我们提出VisDial VAL集的一个子集挑战（VisDialConv），并提供63％NDCG的标杆。

7. Convex Shape Prior for Deep Neural Convolution Network based Eye Fundus Images Segmentation [PDF] 返回目录
Jun Liu, Xue-Cheng Tai, Shousheng Luo
Abstract: Convex Shapes (CS) are common priors for optic disc and cup segmentation in eye fundus images. It is important to design proper techniques to represent convex shapes. So far, it is still a problem to guarantee that the output objects from a Deep Neural Convolution Networks (DCNN) are convex shapes. In this work, we propose a technique which can be easily integrated into the commonly used DCNNs for image segmentation and guarantee that outputs are convex shapes. This method is flexible and it can handle multiple objects and allow some of the objects to be convex. Our method is based on the dual representation of the sigmoid activation function in DCNNs. In the dual space, the convex shape prior can be guaranteed by a simple quadratic constraint on a binary representation of the shapes. Moreover, our method can also integrate spatial regularization and some other shape prior using a soft thresholding dynamics (STD) method. The regularization can make the boundary curves of the segmentation objects to be simultaneously smooth and convex. We design a very stable active set projection algorithm to numerically solve our model. This algorithm can form a new plug-and-play DCNN layer called CS-STD whose outputs must be a nearly binary segmentation of convex objects. In the CS-STD block, the convexity information can be propagated to guide the DCNN in both forward and backward propagation during training and prediction process. As an application example, we apply the convexity prior layer to the retinal fundus images segmentation by taking the popular DeepLabV3+ as a backbone network. Experimental results on several public datasets show that our method is efficient and outperforms the classical DCNN segmentation methods.
摘要：凹凸形状（CS）是用于在眼底图像视盘和杯分割共同先验。它以设计恰当的技术来表示凸形状是很重要的。到目前为止，它仍然是保证一个问题，从深层神经网络卷积（DCNN）的输出对象是凸的形状。在这项工作中，我们提出一种可以容易地集成到用于图像分割和保证输出是凸形常用DCNNs的技术。该方法是灵活的，它可以处理多个对象，并且允许一些对象为凸状。我们的方法是基于DCNNs乙状结肠激活功能的双重代表性。在对偶空间，所述凸形状之前可通过一个简单的二次约束上的形状的二进制表示保证。而且，我们的方法还可以集成空间调整和之前使用一些其它的形状的软阈值动力学（STD）方法。正则可以使分割的对象的边界曲线将被同时平滑和凸。我们设计了一个非常稳定的活动集投影算法数值求解我们的模型。该算法可以形成所谓的CS-STD其输出必须是凸对象的近二元分割的一个新插件和播放DCNN层。在CS-STD块，凸信息可以传播训练和预测过程期间引导DCNN在向前和向后传播。作为应用例中，我们通过取流行DeepLabV3 +作为主干网络应用的凸前层与视网膜眼底图像的分割。在几个公共数据集的实验结果表明，我们的方法是有效的，优于经典DCNN分割方法。

8. PrimiTect: Fast Continuous Hough Voting for Primitive Detection [PDF] 返回目录
Christiane Sommer, Yumin Sun, Erik Bylow, Daniel Cremers
Abstract: This paper tackles the problem of data abstraction in the context of 3D point sets. Our method classifies points into different geometric primitives, such as planes and cones, leading to a compact representation of the data. Being based on a semi-global Hough voting scheme, the method does not need initialization and is robust, accurate, and efficient. We use a local, low-dimensional parameterization of primitives to determine type, shape and pose of the object that a point belongs to. This makes our algorithm suitable to run on devices with low computational power, as often required in robotics applications. The evaluation shows that our method outperforms state-of-the-art methods both in terms of accuracy and robustness.
摘要：本文铲球数据的问题中抽象的三维点集环境。我们的方法进行分类分成不同的几何图元，如飞机和视锥细胞，导致数据的紧凑表示。是基于一个半全局霍夫投票方案，该方法不需要初始化以及是否可靠，精确，高效。我们使用的原语的地方，低维参数来确定种类，形状，一个点属于对象的姿态。这使得我们的算法适用于在具有低计算能力的设备上运行，如机器人应用经常需要。评价结果显示，我们的方法优于国家的最先进的方法，无论在准确性和鲁棒性的条款。

9. A Deep Learning-based Radar and Camera Sensor Fusion Architecture for Object Detection [PDF] 返回目录
Felix Nobis, Maximilian Geisslinger, Markus Weber, Johannes Betz, Markus Lienkamp
Abstract: Object detection in camera images, using deep learning has been proven successfully in recent years. Rising detection rates and computationally efficient network structures are pushing this technique towards application in production vehicles. Nevertheless, the sensor quality of the camera is limited in severe weather conditions and through increased sensor noise in sparsely lit areas and at night. Our approach enhances current 2D object detection networks by fusing camera data and projected sparse radar data in the network layers. The proposed CameraRadarFusionNet (CRF-Net) automatically learns at which level the fusion of the sensor data is most beneficial for the detection result. Additionally, we introduce BlackIn, a training strategy inspired by Dropout, which focuses the learning on a specific sensor type. We show that the fusion network is able to outperform a state-of-the-art image-only network for two different datasets. The code for this research will be made available to the public at: this https URL.
摘要：目的检测相机的图像，利用深度学习已经在最近几年成功地证明。瑞星检测率和计算效率的网络结构正朝着在生产汽车的应用推动了这一技术。然而，相机的传感器质量在恶劣的天气条件下，通过增加传感器的噪声被限制在稀疏昏暗的地方和在晚上。我们的方法通过将相机数据增强当前2D物体检测网络和投影在网络层稀疏雷达数据。所提出的CameraRadarFusionNet（CRF-净）自动学习在该水平传感器数据的融合是最有益的检测结果。此外，我们引入BlackIn，培训战略由降的启发，其重点学习上的特定传感器类型。我们表明，融合网络是能够超越国家的最先进的图像专用网络，两个不同的数据集。本研究的代码将提供给在市民：这个HTTPS URL。

10. Persistent Map Saving for Visual Localization for Autonomous Vehicles: An ORB-SLAM Extension [PDF] 返回目录
Felix Nobis, Odysseas Papanikolaou, Johannes Betz, Markus Lienkamp
Abstract: Electric vhicles and autonomous driving dominate current research efforts in the automotive sector. The two topics go hand in hand in terms of enabling safer and more environmentally friendly driving. One fundamental building block of an autonomous vehicle is the ability to build a map of the environment and localize itself on such a map. In this paper, we make use of a stereo camera sensor in order to perceive the environment and create the map. With live Simultaneous Localization and Mapping (SLAM), there is a risk of mislocalization, since no ground truth map is used as a reference and errors accumulate over time. Therefore, we first build up and save a map of visual features of the environment at low driving speeds with our extension to the ORB-SLAM\,2 package. In a second run, we reload the map and then localize on the previously built-up map. Loading and localizing on a previously built map can improve the continuous localization accuracy for autonomous vehicles in comparison to a full SLAM. This map saving feature is missing in the original ORB-SLAM\,2 implementation. We evaluate the localization accuracy for scenes of the KITTI dataset against the built up SLAM map. Furthermore, we test the localization on data recorded with our own small scale electric model car. We show that the relative translation error of the localization stays under 1\% for a vehicle travelling at an average longitudinal speed of 36 m/s in a feature-rich environment. The localization mode contributes to a better localization accuracy and lower computational load compared to a full SLAM. The source code of our contribution to the ORB-SLAM2 will be made public at: this https URL.
摘要：电vhicles和汽车行业的自主行走主导目前的研究工作。这两个主题齐头并进在实现更安全，更环保的驾驶条件。自主车辆的一个重要基石是建立环境地图的定位和自身这样一个地图上的能力。在本文中，我们利用立体相机传感器，以感知环境和创建地图。现场同时定位和映射（SLAM），有错位的风险，因为没有地面实况地图作为参考和误差随时间积累。所以，我们首先建立和保存在低温的行驶速度与我们的扩展到ORB-SLAM \，2包地图环境的视觉特征。在第二轮，我们刷新地图，然后本地化以前建成地图上。加载和本地化以前建成地图上可以提高自主车辆的连续定位精度相比，一个完整的SLAM。这张地图省电功能缺失在原ORB-SLAM \，2执行。我们评估了反对建立了SLAM地图KITTI数据集的场景定位精度。此外，我们测试的记录我们自己的小型电动汽车模型数据的本地化。我们显示了一个功能丰富的环境，在36μm的平均纵向速度下的车辆行驶1 \％本土化停留的相对平移误差/秒。本地化模式有助于更好的定位精度和较低的计算负荷比全SLAM。此HTTPS URL：我们对ORB-SLAM2贡献的源代码将被公开。

11. Exploring the Capabilities and Limits of 3D Monocular Object Detection -- A Study on Simulation and Real World Data [PDF] 返回目录
Felix Nobis, Fabian Brunhuber, Simon Janssen, Johannes Betz, Markus Lienkamp
Abstract: 3D object detection based on monocular camera data is a key enabler for autonomous driving. The task however, is ill-posed due to lack of depth information in 2D images. Recent deep learning methods show promising results to recover depth information from single images by learning priors about the environment. Several competing strategies tackle this problem. In addition to the network design, the major difference of these competing approaches lies in using a supervised or self-supervised optimization loss function, which require different data and ground truth information. In this paper, we evaluate the performance of a 3D object detection pipeline which is parameterizable with different depth estimation configurations. We implement a simple distance calculation approach based on camera intrinsics and 2D bounding box size, a self-supervised, and a supervised learning approach for depth estimation. Ground truth depth information cannot be recorded reliable in real world scenarios. This shifts our training focus to simulation data. In simulation, labeling and ground truth generation can be automatized. We evaluate the detection pipeline on simulator data and a real world sequence from an autonomous vehicle on a race track. The benefit of simulation training to real world application is investigated. Advantages and drawbacks of the different depth estimation strategies are discussed.
摘要：基于单眼相机数据立体物检测是自主驾驶的关键因素。任务然而，是病态，由于缺乏2D图像的深度信息。近期深的学习方法表现出可喜的成果通过学习环境的先验恢复从单个图像的深度信息。几个相互竞争的策略解决这个问题。除了网络设计，这些竞争的主要区别在于接近在使用监督或自我监督的优化损失函数，这需要不同的数据和地面实况信息。在本文中，我们评估3D对象检测流水线是可参数具有不同深度估计配置的性能。我们实现了一个基于相机内部函数和2D边界框大小的简单距离计算方法，自我监督和监督学习的深度估计方法。地面实况深度信息不能被记录在现实世界中的场景可靠。这种转变我们的训练重点模拟数据。在模拟中，标签和地面实况代可以自动化。我们评估对仿真数据检测管道和自动车辆在赛道现实世界序列。模拟训练到现实世界的应用程序的好处进行了研究。优势和不同深度估计策略的缺点进行了讨论。

12. Semi-supervised Medical Image Classification with Relation-driven Self-ensembling Model [PDF] 返回目录
Quande Liu, Lequan Yu, Luyang Luo, Qi Dou, Pheng Ann Heng
Abstract: Training deep neural networks usually requires a large amount of labeled data to obtain good performance. However, in medical image analysis, obtaining high-quality labels for the data is laborious and expensive, as accurately annotating medical images demands expertise knowledge of the clinicians. In this paper, we present a novel relation-driven semi-supervised framework for medical image classification. It is a consistency-based method which exploits the unlabeled data by encouraging the prediction consistency of given input under perturbations, and leverages a self-ensembling model to produce high-quality consistency targets for the unlabeled data. Considering that human diagnosis often refers to previous analogous cases to make reliable decisions, we introduce a novel sample relation consistency (SRC) paradigm to effectively exploit unlabeled data by modeling the relationship information among different samples. Superior to existing consistency-based methods which simply enforce consistency of individual predictions, our framework explicitly enforces the consistency of semantic relation among different samples under perturbations, encouraging the model to explore extra semantic information from unlabeled data. We have conducted extensive experiments to evaluate our method on two public benchmark medical image classification datasets, i.e.,skin lesion diagnosis with ISIC 2018 challenge and thorax disease classification with ChestX-ray14. Our method outperforms many state-of-the-art semi-supervised learning methods on both single-label and multi-label image classification scenarios.
摘要：培训深层神经网络通常需要大量的标签数据，以获得良好的性能。然而，在医学图像分析，获得高品质的标签的数据是费力且昂贵的，因为准确注释医用图像要求临床医师的专门知识的知识。在本文中，我们提出了医用图像的分类的新的关系驱动的半监督框架。它是通过鼓励给定输入的下扰动预测一致性利用了未标记的数据的基于一致性的方法，并利用自ensembling模型来生产高品质的一致性目标未标记的数据。考虑到人的诊断通常是指以前的类似病例做出可靠的决定中，我们介绍一种新颖的样品关系一致性（SRC）的范例来有效地利用通过建模不同样品之间的关系的信息的未标记的数据。优于现有的基于一致性的方法简单地执行个人预测的一致性，我们的框架明确强制扰动下的不同样本之间的语义关系的一致性，鼓励模型，探讨无标签数据的额外语义信息。我们已经进行了广泛的实验，以评估我们在两个公共基准医学图像分类数据集，方法即皮肤损伤诊断与ISIC 2018挑战与ChestX-ray14胸部疾病分类。我们的方法优于两个单标签和多标签图像分类方案在许多国家的最先进的半监督学习方法。

13. Resisting the Distracting-factors in Pedestrian Detection [PDF] 返回目录
Zhe Wang, Jun Wang, Yezhou Yang
Abstract: Pedestrian detection has been heavily studied in the last decade due to its wide applications. Despite incremental progress, several distracting-factors in the aspect of geometry and appearance still remain. In this paper, we first analyze these impeding factors and their effect on the general region-based detection framework. We then present a novel model that is resistant to these factors by incorporating methods that are not solely restricted to pedestrian detection domain. Specifically, to address the geometry distraction, we design a novel coulomb loss as a regulator on bounding box regression, in which proposals are attracted by their target instance and repelled by the adjacent non-target instances. For appearance distraction, we propose an efficient semantic-driven strategy for selecting anchor locations, which can sample informative negative examples at training phase for classification refinement. Our detector can be trained in an end-to-end manner, and achieves consistently high performance on both the Caltech-USA and CityPersons benchmarks. Code will be publicly available upon publication.
摘要：行人检测已被大量的研究在过去十年中由于其广泛的应用。尽管逐步取得进展，一些分散注意力的因素在几何形状和外观方面依然存在。在本文中，我们首先分析这些阻碍因素，以及它们对一般的基于区域的检测框架效应。然后，我们提出一个新模型，该模型是通过结合并不仅仅限于行人检测域方法对这些因素有抗性。具体而言，为了解决几何分心，我们设计了一种新库仑损失作为边界框回归，其中建议通过它们的目标实例吸引并且由相邻的非目标实例排斥的调节器。对于外观分心，我们建议选择锚的地点，可以在分类精化训练相试样翔实的反面例子高效的语义驱动战略。我们的检测器可以在端至端的方式被训练，并且实现在两个加州理工学院的，美国和CityPersons基准一致的高性能。代码将公开出版后。

14. ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language [PDF] 返回目录
Zhe Wang, Zhiyuan Fang, Jun Wang, Yezhou Yang
Abstract: Person search by natural language aims at retrieving a specific person in a large-scale image pool that matches the given textual descriptions. While most of the current methods treat the task as a holistic visual and textual feature matching one, we approach it from an attribute-aligning perspective that allows grounding specific attribute phrases to the corresponding visual regions. We achieve success as well as the performance boosting by a robust feature learning that the referred identity can be accurately bundled by multiple attribute visual cues. To be concrete, our Visual-Textual Attribute Alignment model (dubbed as ViTAA) learns to disentangle the feature space of a person into subspaces corresponding to attributes using a light auxiliary attribute segmentation computing branch. It then aligns these visual features with the textual attributes parsed from the sentences by using a novel contrastive learning loss. Upon that, we validate our ViTAA framework through extensive experiments on tasks of person search by natural language and by attribute-phrase queries, on which our system achieves state-of-the-art performances. Code will be publicly available upon publication.
摘要：人在一个大型的图像池给定的文字描述相匹配检索特定人的自然语言搜索的目标。虽然大多数的现有方法对待任务作为整体的视觉和文本特征匹配之一，我们从允许接地特定属性短语相应的视觉区域的属性对准角度来处理它。我们取得成功，以及性能的强大功能，提高学习的简称身份可以由多个属性的视觉线索来准确地捆绑在一起。是混凝土，我们的视觉-考属性对准模型（称为作为维塔）获悉解开的人的特征空间划分为使用光辅助属性分割计算分支对应于属性的子空间。然后，它对齐通过使用一种新型的对比学习的损失从句子解析文本属性这些视觉特征。在这一点，我们验证了我们通过对自然语言搜索的人的任务，大量的实验和属性，短语查询维塔框架，在我们的系统实现了国家的最先进的性能。代码将公开出版后。

15. Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models [PDF] 返回目录
Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen, Jingjing Liu
Abstract: Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research. Models such as ViLBERT, LXMERT and UNITER have significantly lifted state of the art across a wide range of V+L benchmarks with joint image-text pre-training. However, little is known about the inner mechanisms that destine their impressive success. To reveal the secrets behind the scene of these powerful models, we present VALUE (Vision-And-Language Understanding Evaluation), a set of meticulously designed probing tasks (e.g., Visual Coreference Resolution, Visual Relation Detection, Linguistic Probing Tasks) generalizable to standard pre-trained V+L models, aiming to decipher the inner workings of multimodal pre-training (e.g., the implicit knowledge garnered in individual attention heads, the inherent cross-modal alignment learned through contextualized multimodal embeddings). Through extensive analysis of each archetypal model architecture via these probing tasks, our key observations are: (i) Pre-trained models exhibit a propensity for attending over text rather than images during inference. (ii) There exists a subset of attention heads that are tailored for capturing cross-modal interactions. (iii) Learned attention matrix in pre-trained models demonstrates patterns coherent with the latent alignment between image regions and textual words. (iv) Plotted attention patterns reveal visually-interpretable relations among image regions. (v) Pure linguistic knowledge is also effectively encoded in the attention heads. These are valuable insights serving to guide future work towards designing better model architecture and objectives for multimodal pre-training.
摘要：近期基于变压器的大型预训练模式已经彻底改变了视觉和语言（V + L）的研究。模型如ViLBERT，LXMERT和UNITER已经显著解除了现有技术的状态在很宽的范围第V + L基准与联合图像 - 文本预训练。然而，鲜为人知的是，这注定他们印象深刻的成功的内在机制。为了揭示的秘密，这些强悍的机型的场景背后，我们现值（视觉和语言理解评估），一套精心设计的探测任务（例如，视觉指代消解，视觉关系检测，语言完成探测任务）推广到标准预先训练V + L型，旨在破译多前培训的内部运作（例如，隐性知识在个人的关注头囊括，固有的跨模态定位，通过情境化的嵌入多学习）。通过通过这些探测任务，每个原型模型结构的深入分析，我们的主要意见是：（i）前训练的模型表现出倾向推断期间参加过的文字，而不是图像。（ⅱ）存在的注意头被定制用于捕获跨通道相互作用的子集。（ⅲ）在预训练模式习得关注矩阵演示模式与图像区域和文本字之间的潜对准相干的。（四）绘制的关注模式显示图像区域之间在视觉上可解释的关系。（五）纯语言知识，也有效编码中注意头。这些服务以指导今后的工作，争取多峰前培训设计更好的模型结构和目标有价值的见解。

16. Investigating Bias in Deep Face Analysis: The KANFace Dataset and Empirical Study [PDF] 返回目录
Markos Georgopoulos, Yannis Panagakis, Maja Pantic
Abstract: Deep learning-based methods have pushed the limits of the state-of-the-art in face analysis. However, despite their success, these models have raised concerns regarding their bias towards certain demographics. This bias is inflicted both by limited diversity across demographics in the training set, as well as the design of the algorithms. In this work, we investigate the demographic bias of deep learning models in face recognition, age estimation, gender recognition and kinship verification. To this end, we introduce the most comprehensive, large-scale dataset of facial images and videos to date. It consists of 40K still images and 44K sequences (14.5M video frames in total) captured in unconstrained, real-world conditions from 1,045 subjects. The data are manually annotated in terms of identity, exact age, gender and kinship. The performance of state-of-the-art models is scrutinized and demographic bias is exposed by conducting a series of experiments. Lastly, a method to debias network embeddings is introduced and tested on the proposed benchmarks.
摘要：深基于学习的方法有推国家的最先进的人脸分析的极限。然而，尽管他们的成功，这些模型都提出了关于他们对某些人口偏差的担忧。这种偏差是通过在训练集跨越人口统计有限的多样性，以及该算法的设计造成两者。在这项工作中，我们探讨面部识别，年龄估计，性别认同和亲属验证深学习模型的人口偏差。为此，我们介绍的面部图像和视频迄今为止最全面的，大规模的数据集。它由40K静止图像和44K从序列受试者1045在无约束的，真实世界条件下捕获（总共14.5M视频帧）。该数据在身份，确切的年龄，性别和亲属的条款手动注释。状态的最先进的模型的性能被仔细检查和人口统计偏压通过进行一系列的实验露出。最后，为了消除直流偏压网络的嵌入的方法被引入并且在拟议的基准测试。

17. DeepFaceFlow: In-the-wild Dense 3D Facial Motion Estimation [PDF] 返回目录
Mohammad Rami Koujan, Anastasios Roussos, Stefanos Zafeiriou
Abstract: Dense 3D facial motion capture from only monocular in-the-wild pairs of RGB images is a highly challenging problem with numerous applications, ranging from facial expression recognition to facial reenactment. In this work, we propose DeepFaceFlow, a robust, fast, and highly-accurate framework for the dense estimation of 3D non-rigid facial flow between pairs of monocular images. Our DeepFaceFlow framework was trained and tested on two very large-scale facial video datasets, one of them of our own collection and annotation, with the aid of occlusion-aware and 3D-based loss function. We conduct comprehensive experiments probing different aspects of our approach and demonstrating its improved performance against state-of-the-art flow and 3D reconstruction methods. Furthermore, we incorporate our framework in a full-head state-of-the-art facial video synthesis method and demonstrate the ability of our method in better representing and capturing the facial dynamics, resulting in a highly-realistic facial video synthesis. Given registered pairs of images, our framework generates 3D flow maps at ~60 fps.
摘要：从唯一的单眼密集3D面部动作捕捉内式野生对RGB图像是与许多应用，从面部表情识别面部重演高度挑战性的问题。在这项工作中，我们提出DeepFaceFlow，稳健，快速，高精度的对单眼图像之间的3D非刚性面部流动的密集评估框架。我们DeepFaceFlow框架进行训练，并且在两个很大规模的面部视频数据集，是我们自己的收集和注释的一个测试，以闭塞感知和基于3D的损失函数的帮助。我们进行了全面的实验探测我们的方法的不同方面，表现出其对国家的最先进的流量和3D重建方法改进的性能。此外，我们将我们的框架，在全头的国家的最先进的面部影像的合成方法，并展示更好代表和捕捉面部动态，导致高度逼真的面部视频合成了该方法的能力。由于注册的图像对，我们的框架生成3D在〜60 fps的流图。

18. Taskology: Utilizing Task Relations at Scale [PDF] 返回目录
Yao Lu, Sören Pirk, Jan Dlabal, Anthony Brohan, Ankita Pasad, Zhao Chen, Vincent Casser, Anelia Angelova, Ariel Gordon
Abstract: It has been recognized that the joint training of computer vision tasks with shared network components enables higher performance for each individual task. Training tasks together allows learning the inherent relationships among them; however, this requires large sets of labeled data. Instead, we argue that utilizing the known relationships between tasks explicitly allows improving their performance with less labeled data. To this end, we aim to establish and explore a novel approach for the collective training of computer vision tasks. In particular, we focus on utilizing the inherent relations of tasks by employing consistency constraints derived from physics, geometry, and logic. We show that collections of models can be trained without shared components, interacting only through the consistency constraints as supervision (peer-supervision). The consistency constraints enforce the structural priors between tasks, which enables their mutually consistent training, and -- in turn -- leads to overall higher performance. Treating individual tasks as modules, agnostic to their implementation, reduces the engineering overhead to collectively train many tasks to a minimum. Furthermore, the collective training can be distributed among multiple compute nodes, which further facilitates training at scale. We demonstrate our framework on subsets of the following collection of tasks: depth and normal prediction, semantic segmentation, 3D motion estimation, and object tracking and detection in point clouds.
摘要：已经认识到的计算机视觉任务与共享的网络组件的联合训练，可以为每个单独的任务更高的性能。一起训练任务，可以学习它们之间的内在联系;然而，这需要大型成套标签数据。相反，我们认为，利用任务之间的已知关系明确允许提高它们的性能与更低的标签数据。为此，我们的目标是建立和探索的计算机视觉任务的集体训练的新方法。特别是，我们专注于通过使用从物理，几何形状和逻辑导出一致性约束利用任务的固有的关系。我们表明，模型的集合可以不共享组件进行培训，只有通过一致性约束监督（同行监督）进行交互。该一致性约束执行任务之间的结构先验，这使它们相互一致的培训， - 反过来 - 导致整体更高的性能。对待各个任务模块，不可知的实施，降低了工程成本，以集体训练许多任务到最低限度。此外，集体训练可以在多个计算节点，其进一步有利于大规模培养中分配。我们证明了我们对任务的下列集合的子集架构：深度和正常预测，语义分割，3D运动估计和目标跟踪和检测点云。

19. SUPER: A Novel Lane Detection System [PDF] 返回目录
Pingping Lu, Chen Cui, Shaobing Xu, Huei Peng, Fan Wang
Abstract: AI-based lane detection algorithms were actively studied over the last few years. Many have demonstrated superior performance compared with traditional feature-based methods. The accuracy, however, is still generally in the low 80% or high 90%, or even lower when challenging images are used. In this paper, we propose a real-time lane detection system, called Scene Understanding Physics-Enhanced Real-time (SUPER) algorithm. The proposed method consists of two main modules: 1) a hierarchical semantic segmentation network as the scene feature extractor and 2) a physics enhanced multi-lane parameter optimization module for lane inference. We train the proposed system using heterogeneous data from Cityscapes, Vistas and Apollo, and evaluate the performance on four completely separate datasets (that were never seen before), including Tusimple, Caltech, URBAN KITTI-ROAD, and X-3000. The proposed approach performs the same or better than lane detection models already trained on the same dataset and performs well even on datasets it was never trained on. Real-world vehicle tests were also conducted. Preliminary test results show promising real-time lane-detection performance compared with the Mobileye.
摘要：基于AI-车道检测算法进行了积极的研究，在过去的几年里。与传统的基于特征的方法相比，许多人都表现出卓越的性能。的准确性，但是，仍然是一般在低80％或高90％，或使用具有挑战性的图像时甚至更低。在本文中，我们提出了一个实时的道路检测系统，被称为场景理解物理性增强的实时（SUPER）算法。所提出的方法包括两个主要模块：1）分层语义分割网络作为场景特征提取和2）用于车道推理一个物理增强的多车道参数优化模块。我们培养使用从城市景观，景观和阿波罗异构数据所提出的系统，并评估在四个完全独立的数据集（即从未见过），包括Tusimple，加州理工学院，城市KITTI-ROAD，和X-3000的性能。所提出的方法进行的培训已经对同一数据集，并执行良好，即使在数据集这是从来没有受过训练的相同或更好的车道检测模型。现实世界中的车辆测试也在进行。初步测试结果表明与Mobileye在比较看好的实时车道检测性能。

20. Bi3D: Stereo Depth Estimation via Binary Classifications [PDF] 返回目录
Abhishek Badki, Alejandro Troccoli, Kihwan Kim, Jan Kautz, Pradeep Sen, Orazio Gallo
Abstract: Stereo-based depth estimation is a cornerstone of computer vision, with state-of-the-art methods delivering accurate results in real time. For several applications such as autonomous navigation, however, it may be useful to trade accuracy for lower latency. We present Bi3D, a method that estimates depth via a series of binary classifications. Rather than testing if objects are at a particular depth $D$, as existing stereo methods do, it classifies them as being closer or farther than $D$. This property offers a powerful mechanism to balance accuracy and latency. Given a strict time budget, Bi3D can detect objects closer than a given distance in as little as a few milliseconds, or estimate depth with arbitrarily coarse quantization, with complexity linear with the number of quantization levels. Bi3D can also use the allotted quantization levels to get continuous depth, but in a specific depth range. For standard stereo (i.e., continuous depth on the whole range), our method is close to or on par with state-of-the-art, finely tuned stereo methods.
摘要：立体声基于深度估计是计算机视觉的基石，与国家的最先进的方法，实时提供准确的结果。对于一些应用，如自主导航，但是，它可能交易精度较低的延迟是有用的。我们目前Bi3D，即通过一系列的二元分类的估计深度的方法。而不是，如果测试对象在特定深度$ d $因为现有的立体方法做的，它把它们归为大于$ d $更近或更远。此属性提供了一个强有力的机制来平衡精度和延迟。鉴于严格的时间预算，Bi3D可以比给定的距离在靠近检测对象小至几毫秒，或估计深度与任意粗糙量化，用量化级的数目的复杂性是线性的。 Bi3D也可以使用分配的量化水平，以获得持续的深度，但在特定的深度范围。对于标准立体声（即，在整个范围的连续深度），我们的方法是靠近或在与国家的最先进的，精细调整立体声的方法相提并论。

21. Direction-aware Residual Network for Road Extraction in VHR Remote Sensing Images [PDF] 返回目录
Lei Ding, Lorenzo Bruzzone
Abstract: The binary segmentation of roads in very high resolution (VHR) remote sensing images (RSIs) has always been a challenging task due to factors such as occlusions (caused by shadows, trees, buildings, etc.) and the intra-class variances of road surfaces. The wide use of convolutional neural networks (CNNs) has greatly improved the segmentation accuracy and made the task end-to-end trainable. However, there are still margins to improve in terms of the completeness and connectivity of the results. In this paper, we consider the specific context of road extraction and present a direction-aware residual network that includes three main contributions: 1) ResDec: an asymmetric residual segmentation network with deconvolutional layers and a structural supervision to enhance the learning of road topology; 2) a pixel-level supervision of local directions to enhance the embedding of linear features; 3) Refnet: a refinement network to optimize the segmentation results. An ablation study on the benchmark Massachusetts dataset has confirmed the effectiveness of the presented designs. Comparative experiments with other approaches show that the proposed method has advantages in both overall accuracy and F1-score.
摘要：道路在非常高的分辨率（VHR）遥感图像（RSIS）的二元分割一直是一个具有挑战性的任务，由于一些因素，如闭塞（造成阴影，树木，建筑物等）和类内变化的路面。的广泛使用卷积神经网络的（细胞神经网络），大大提高了分割精度和取得的任务的端至端可训练。然而，仍然有利润的结果的完整性和连通性方面得到改善。在本文中，我们考虑道路提取的特定上下文和呈现的方向感知剩余网络，其包括三个主要贡献：1）ResDec：不对称残余分割网络与解卷积层和结构的监督，以提高道路拓扑的学习; 2）局部方向的像素级的监督，以增强的线性特征的嵌入; 3）Refnet：一种改进的网络，以优化的分割结果。基准的马萨诸塞州数据集消融研究证实呈现设计的有效性。与其他方法对比实验表明，该方法在两个整体精度和F1-比分优势。

22. Evolved Explainable Classifications for Lymph Node Metastases [PDF] 返回目录
Iam Palatnik de Sousa, Marley Maria Bernardes Rebuzzi Vellasco, Eduardo Costa da Silva
Abstract: A novel evolutionary approach for Explainable Artificial Intelligence is presented: the "Evolved Explanations" model (EvEx). This methodology consists in combining Local Interpretable Model Agnostic Explanations (LIME) with Multi-Objective Genetic Algorithms to allow for automated segmentation parameter tuning in image classification tasks. In this case, the dataset studied is Patch-Camelyon, comprised of patches from pathology whole slide images. A publicly available Convolutional Neural Network (CNN) was trained on this dataset to provide a binary classification for presence/absence of lymph node metastatic tissue. In turn, the classifications are explained by means of evolving segmentations, seeking to optimize three evaluation goals simultaneously. The final explanation is computed as the mean of all explanations generated by Pareto front individuals, evolved by the developed genetic algorithm. To enhance reproducibility and traceability of the explanations, each of them was generated from several different seeds, randomly chosen. The observed results show remarkable agreement between different seeds. Despite the stochastic nature of LIME explanations, regions of high explanation weights proved to have good agreement in the heat maps, as computed by pixel-wise relative standard deviations. The found heat maps coincide with expert medical segmentations, which demonstrates that this methodology can find high quality explanations (according to the evaluation metrics), with the novel advantage of automated parameter fine tuning. These results give additional insight into the inner workings of neural network black box decision making for medical data.
摘要：解释的人工智能新的进化方法，提出：在“进化的解释”模式（EVEX）。该方法在于在组合本地可解释模型不可知论说明（石灰）与多目标遗传算法，以允许在图像分类任务自动分割参数调谐。在这种情况下，所研究的数据集是膜片Camelyon，由来自病理整个幻灯片图像补丁。公众可用的卷积神经网络（CNN）在此训练数据集，以提供存在/不存在淋巴结转移组织的二元分类。反过来，分类是由进化分割的方式解释，寻求同时优化三份评估目标。最后一种解释是计算平均帕累托前个人产生的所有解释，由开发遗传算法进化的。为了增强的解释的重现性和可跟踪性，它们中的每从几个不同的种子，随机选择的生成。观察结果表明不同种子之间存在显着的协议。尽管LIME解释的随机性，高说明权重的地区被证明有在热图一致，通过逐像素相对标准偏差为计算。所发现的热图与专家的医疗分割一致，这表明该方法可以找到高质量的解释（根据评价指标），与自动化参数微调的新颖优点。这些结果提供额外洞察神经网络黑箱决策的医疗数据的内部运作。

23. Small-brain neural networks rapidly solve inverse problems with vortex Fourier encoders [PDF] 返回目录
Baurzhan Muminov, Luat T. Vuong
Abstract: We introduce a vortex phase transform with a lenslet-array to accompany shallow, dense, ``small-brain'' neural networks for high-speed and low-light imaging. Our single-shot ptychographic approach exploits the coherent diffraction, compact representation, and edge enhancement of Fourier-tranformed spiral-phase gradients. With vortex spatial encoding, a small brain is trained to deconvolve images at rates 5-20 times faster than those achieved with random encoding schemes, where greater advantages are gained in the presence of noise. Once trained, the small brain reconstructs an object from intensity-only data, solving an inverse mapping without performing iterations on each image and without deep-learning schemes. With this hybrid, optical-digital, vortex Fourier encoded, small-brain scheme, we reconstruct MNIST Fashion objects illuminated with low-light flux (5 nJ/cm$^2$) at a rate of several thousand frames per second on a 15 W central processing unit, two orders of magnitude faster than convolutional neural networks.
摘要：我们介绍利用小透镜阵列的涡流相位变换陪浅，致密，``小脑“”用于高速和低光成像神经网络。我们的单次折叠写入方法利用相干衍射，紧凑表示，和傅立叶tranformed螺旋相位梯度的边缘增强。与涡流空间编码，小脑被训练以在去卷积的图像速率比用随机编码方案，其中，更大的优点在存在噪声的情况获得了实现更快的5-20倍。一旦被训练，小脑重建来自仅强度数据的对象，求解逆映射而无需在每个图像上，没有深学习方案进行迭代。与此混合，光学数字化，涡流傅立叶编码，小脑方案，我们重建MNIST时装对象以每秒几千帧上的15的速率与低光束（5毫微/厘米$ ^ 2 $）照射w ^中央处理单元，两个数量级比卷积神经网络快。

24. Grounding Language in Play [PDF] 返回目录
Corey Lynch, Pierre Sermanet
Abstract: Natural language is perhaps the most versatile and intuitive way for humans to communicate tasks to a robot. Prior work on Learning from Play (LfP) [Lynch et al, 2019] provides a simple approach for learning a wide variety of robotic behaviors from general sensors. However, each task must be specified with a goal image---something that is not practical in open-world environments. In this work we present a simple and scalable way to condition policies on human language instead. We extend LfP by pairing short robot experiences from play with relevant human language after-the-fact. To make this efficient, we introduce multicontext imitation, which allows us to train a single agent to follow image or language goals, then use just language conditioning at test time. This reduces the cost of language pairing to less than 1% of collected robot experience, with the majority of control still learned via self-supervised imitation. At test time, a single agent trained in this manner can perform many different robotic manipulation skills in a row in a 3D environment, directly from images, and specified only with natural language (e.g. "open the drawer...now pick up the block...now press the green button..."). Finally, we introduce a simple technique that transfers knowledge from large unlabeled text corpora to robotic learning. We find that transfer significantly improves downstream robotic manipulation. It also allows our agent to follow thousands of novel instructions at test time in zero shot, in 16 different languages. See videos of our experiments at this http URL
摘要：自然语言也许是人类的任务传达给机器人最通用和直观的方式。从播放（LFP）[Lynch等，2019]上学习先前的工作提供了一种用于学习各种各样从一般传感器的机器人的行为的简单方法。然而，每一个任务必须与目标图像---东西是不实际的开放世界环境中指定。在这项工作中，我们提出了一个简单易扩展的方式对人类语言的条件，而不是政策。我们从玩配对短机器人的经验与相关的人类语言后 - 事实上延长LFP。为了使这个效率，我们引进multicontext模仿，这使得我们可以训练一个代理遵循图像或语言的目标，那么在测试时只使用语言调节。这减少了语言配对成本的收集机器人体验不到1％，通过自我监督的模仿还是学到了广大的控制。在测试时，以这种方式培养了单剂可以在3D环境中连续执行许多不同的机器人操作技能，直接从图像，并且只用自然语言规定（如“打开抽屉......现在拿起块...现在按下绿色按钮...“）。最后，我们介绍一个简单的方法，从大的未标记文本语料库转移知识的机器人学习。我们发现，转移显著改善下游机器人操作。这也使我们的代理遵循成千上万的新指令的测试时间在零出手，在16种不同的语言。见我们的实验视频在这个HTTP URL

25. 3D deformable registration of longitudinal abdominopelvic CT images using unsupervised deep learning [PDF] 返回目录
Maureen van Eijnatten, Leonardo Rundo, K. Joost Batenburg, Felix Lucka, Emma Beddowes, Carlos Caldas, Ferdia A. Gallagher, Evis Sala, Carola-Bibiane Schönlieb, Ramona Woitek
Abstract: This study investigates the use of the unsupervised deep learning framework VoxelMorph for deformable registration of longitudinal abdominopelvic CT images acquired in patients with bone metastases from breast cancer. The CT images were refined prior to registration by automatically removing the CT table and all other extra-corporeal components. To improve the learning capabilities of VoxelMorph when only a limited amount of training data is available, a novel incremental training strategy is proposed based on simulated deformations of consecutive CT images. In a 4-fold cross-validation scheme, the incremental training strategy achieved significantly better registration performance compared to training on a single volume. Although our deformable image registration method did not outperform iterative registration using NiftyReg (considered as a benchmark) in terms of registration quality, the registrations were approximately 300 times faster. This study showed the feasibility of deep learning based deformable registration of longitudinal abdominopelvic CT images via a novel incremental training strategy based on simulated deformations.
摘要：本研究探讨使用无监督学习深框架VoxelMorph用于患者从乳房癌骨转移获取纵向腹盆CT图像的可变形配准的。 CT图像是由自动移除CT表和所有其他额外有形成分登记前精制。为了提高VoxelMorph的学习能力，只有当训练数据有限数量的可用，一个新的增量训练策略是基于连续的CT图像的模拟变形建议。在4倍交叉验证方案，增量训练策略相比单卷上训练实现显著更好注册的性能。虽然我们的变形图像配准方法并没有在注册质量方面使用NiftyReg（考虑为基准）跑赢大市反复注册，登记数量大约快300倍。本研究通过基于模拟变形的新颖增量训练策略表明纵向腹盆CT图像的深度学习基于可变形配准的可行性。

26. Enhancing Perceptual Loss with Adversarial Feature Matching for Super-Resolution [PDF] 返回目录
Akella Ravi Tej, Shirsendu Sukanta Halder, Arunav Pratap Shandeelya, Vinod Pankajakshan
Abstract: Single image super-resolution (SISR) is an ill-posed problem with an indeterminate number of valid solutions. Solving this problem with neural networks would require access to extensive experience, either presented as a large training set over natural images or a condensed representation from another pre-trained network. Perceptual loss functions, which belong to the latter category, have achieved breakthrough success in SISR and several other computer vision tasks. While perceptual loss plays a central role in the generation of photo-realistic images, it also produces undesired pattern artifacts in the super-resolved outputs. In this paper, we show that the root cause of these pattern artifacts can be traced back to a mismatch between the pre-training objective of perceptual loss and the super-resolution objective. To address this issue, we propose to augment the existing perceptual loss formulation with a novel content loss function that uses the latent features of a discriminator network to filter the unwanted artifacts across several levels of adversarial similarity. Further, our modification has a stabilizing effect on non-convex optimization in adversarial training. The proposed approach offers notable gains in perceptual quality based on an extensive human evaluation study and a competent reconstruction fidelity when tested on objective evaluation metrics.
摘要：单张超分辨率（SISR）是一个病态问题的有效解决方案的一个不确定的数字。解决这个问题的神经网络将需要获得丰富的经验，无论是呈现为一个大的训练集在自然图像或从其他预先训练网络压缩表示。知觉丧失功能，这属于后一类，都实现了SISR和突破性的成功等几个计算机视觉任务。而感知损失起着照片般逼真图像的生成中心作用，它也产生在超分辨的输出不期望的图案的伪像。在本文中，我们表明，这些图案文物的根源可以追溯到之间的失配前培训客观感知的损失和超分辨率客观的。为了解决这个问题，我们提出了与使用鉴别网络的潜在功能，跨对抗性相似的几个层次过滤不需要的斑块，新颖的内容损失函数，以增强现有视觉损失配方。此外，我们的修饰对对抗性训练非凸优化具有稳定作用。基于广泛的人类评估研究和主管重建的保真度的感知质量所提出的方法提供了显着的收益时，客观评价指标进行测试。

27. TripletUNet: Multi-Task U-Net with Online Voxel-Wise Learning for Precise CT Prostate Segmentation [PDF] 返回目录
Kelei He, Chunfeng Lian, Ehsan Adeli, Jing Huo, Yinghuan Shi, Yang Gao, Bing Zhang, Junfeng Zhang, Dinggang Shen
Abstract: Fully convolutional networks (FCNs), including U-Net and V-Net, are widely-used network architecture for semantic segmentation in recent studies. However, conventional FCNs are typically trained by the cross-entropy loss or dice loss, in which the relationships among voxels are neglected. This often results in non-smooth neighborhoods in the output segmentation map. This problem becomes more serious in CT prostate segmentation as CT images are usually of low tissue contrast. To address this problem, we propose a two-stage framework. The first stage quickly localizes the prostate region. Then, the second stage precisely segments the prostate by a multi-task FCN-based on the U-Net architecture. We introduce a novel online voxel-triplet learning module through metric learning and voxel feature embeddings in the multi-task network. The proposed network has two branches guided by two tasks: 1) a segmentation sub-network aiming to generate prostate segmentation, and 2) a triplet learning sub-network aiming to improve the quality of the learned feature space supervised by a mixed of triplet and pair-wise loss function. The triplet learning sub-network samples triplets from the inter-mediate heatmap. Unlike conventional deep triplet learning methods that generate triplets before the training phase, our proposed voxel-triplets are sampled in an online manner and operates in an end-to-end fashion via multi-task learning. To evaluate the proposed method, we implement comprehensive experiments on a CT image dataset consisting of 339 patients. The ablation studies show that our method can effectively learn more representative voxel-level features compared with the conventional FCN network. And the comparisons show that the proposed method outperforms the state-of-the-art methods by a large margin.
摘要：全卷积网络（FCNs），包括U形网和虚拟网络，是语义分割广泛使用的网络架构在最近的研究中。然而，传统的FCNs通常由交叉熵损失或损失骰子，其中体素之间的关系被忽略训练。这通常会导致在非平滑的街区中的输出分割图。作为CT图像通常的低组织对比度是，这个问题变得在CT前列腺分割更为严重。为了解决这个问题，我们提出了两个阶段的框架。第一阶段迅速本地化前列腺区域。然后，第二级精确段前列腺通过多任务系FCN在U网络架构。我们通过引入多任务网络中度量学习和素功能的嵌入一个新的在线体素的三重学习模块。 1）细分子网络，旨在产生前列腺分割，和2）三重学习子网旨在提高学习地物空间的质量监督由混合的三重态的和：建议网络有两个分支由两个任务引导成对损失函数。三重学习子网样本间中介热图三胞胎。不同于生成训练阶段之前三胞胎常规深三重的学习方法，我们提出的体素的三元组以在线的方式采样，并通过多任务学习的端至端的方式运作。为了评估所提出的方法，我们落实包括339例患者的CT图像数据集综合性实验。消融研究表明，我们的方法可以有效的学习比较有代表性的体素水平特点与传统的FCN网络相比。和比较表明，该方法优于国家的最先进的方法，通过一个大的裕度。

28. Near-duplicate video detection featuring coupled temporal and perceptual visual structures and logical inference based matching [PDF] 返回目录
B. Tahayna, M. Belkhatir
Abstract: We propose in this paper an architecture for near-duplicate video detection based on: (i) index and query signature based structures integrating temporal and perceptual visual features and (ii) a matching framework computing the logical inference between index and query documents. As far as indexing is concerned, instead of concatenating low-level visual features in high-dimensional spaces which results in curse of dimensionality and redundancy issues, we adopt a perceptual symbolic representation based on color and texture concepts. For matching, we propose to instantiate a retrieval model based on logical inference through the coupling of an N-gram sliding window process and theoretically-sound lattice-based structures. The techniques we cover are robust and insensitive to general video editing and/or degradation, making it ideal for re-broadcasted video search. Experiments are carried out on large quantities of video data collected from the TRECVID 02, 03 and 04 collections and real-world video broadcasts recorded from two German TV stations. An empirical comparison over two state-of-the-art dynamic programming techniques is encouraging and demonstrates the advantage and feasibility of our method.
摘要：本文提出了一种基于近重复的视频检测的体系结构：（I）索引和查询基于签名的结构整合时间和感性的视觉特征以及（ii）匹配的框架计算索引和查询文档之间的逻辑推理。至于索引而言，而不将低级别的视觉特征的高维空间中的维和冗余问题，诅咒其结果，我们采用基于颜色和纹理概念的感性符号表示。为匹配，我们提出了一种基于通过N克滑动窗口处理和理论上声基于晶格结构的耦合逻辑推理来实例化一个检索模型。我们覆盖的技术是稳健和不敏感的普通视频编辑和/或降解，使其成为理想的重新广播的视频搜索。实验是在大量来自两个德国电视台录制的TRECVID 02，03和04的集合和现实世界的视频播放采集到的视频数据进行。在两个国家的最先进的动态编程技术的实证比较令人鼓舞，体现了优势，我们的方法的可行性。

29. Visual Perception Model for Rapid and Adaptive Low-light Image Enhancement [PDF] 返回目录
Xiaoxiao Li, Xiaopeng Guo, Liye Mei, Mingyu Shang, Jie Gao, Maojing Shu, Xiang Wang
Abstract: Low-light image enhancement is a promising solution to tackle the problem of insufficient sensitivity of human vision system (HVS) to perceive information in low light environments. Previous Retinex-based works always accomplish enhancement task by estimating light intensity. Unfortunately, single light intensity modelling is hard to accurately simulate visual perception information, leading to the problems of imbalanced visual photosensitivity and weak adaptivity. To solve these problems, we explore the precise relationship between light source and visual perception and then propose the visual perception (VP) model to acquire a precise mathematical description of visual perception. The core of VP model is to decompose the light source into light intensity and light spatial distribution to describe the perception process of HVS, offering refinement estimation of illumination and reflectance. To reduce complexity of the estimation process, we introduce the rapid and adaptive $\mathbf{\beta}$ and $\mathbf{\gamma}$ functions to build an illumination and reflectance estimation scheme. Finally, we present a optimal determination strategy, consisting of a \emph{cycle operation} and a \emph{comparator}. Specifically, the \emph{comparator} is responsible for determining the optimal enhancement results from multiple enhanced results through implementing the \emph{cycle operation}. By coordinating the proposed VP model, illumination and reflectance estimation scheme, and the optimal determination strategy, we propose a rapid and adaptive framework for low-light image enhancement. Extensive experiment results demenstrate that the proposed method achieves better performance in terms of visual comparison, quantitative assessment, and computational efficiency, compared with the currently state-of-the-arts.
摘要：低光图像增强是一种很有前途的解决方案，以解决人类视觉系统（HVS）的灵敏度不足以在低光照环境感知信息的问题。以前基于视网膜皮层，作品总是通过估计光强度增强完成任务。不幸的是，单光强造型是很难准确地模拟视觉信息，导致不平衡的视觉光敏性和弱适应性的问题。为了解决这些问题，我们探索光源和视觉感知之间的精确关系，然后提出视觉感知（VP）模型来获取视觉感知的精确的数学描述。 VP模型的核心是将光源分解成的光强度和光空间分布来描述HVS的感知过程，提供照明和反射率的细化估计。为了减少估计过程的复杂性，我们引入了快速和自适应$ \ mathbf {\测试} $和$ \ mathbf {\伽马} $函数构建的照明和反射率估计方案。最后，我们提出了一个最佳判断策略，由\ {EMPH循环运转}和\ {EMPH比较}的。具体而言，\ {EMPH比较}是负责通过实施\ {EMPH循环运转}确定从多个增强的结果的最佳增强结果。通过协调所提出的VP模式，照明和反射率估计方案，最佳判断策略，我们提出了低光图像增强的快速和自适应框架。大量的实验结果demenstrate，与当前国家的最技术相比，该方法实现了视觉比较，定量评估和计算效率方面性能更好。

30. SAGE: Sequential Attribute Generator for Analyzing Glioblastomas using Limited Dataset [PDF] 返回目录
Padmaja Jonnalagedda, Brent Weinberg, Jason Allen, Taejin L. Min, Shiv Bhanu, Bir Bhanu
Abstract: While deep learning approaches have shown remarkable performance in many imaging tasks, most of these methods rely on availability of large quantities of data. Medical image data, however, is scarce and fragmented. Generative Adversarial Networks (GANs) have recently been very effective in handling such datasets by generating more data. If the datasets are very small, however, GANs cannot learn the data distribution properly, resulting in less diverse or low-quality results. One such limited dataset is that for the concurrent gain of 19 and 20 chromosomes (19/20 co-gain), a mutation with positive prognostic value in Glioblastomas (GBM). In this paper, we detect imaging biomarkers for the mutation to streamline the extensive and invasive prognosis pipeline. Since this mutation is relatively rare, i.e. small dataset, we propose a novel generative framework - the Sequential Attribute GEnerator (SAGE), that generates detailed tumor imaging features while learning from a limited dataset. Experiments show that not only does SAGE generate high quality tumors when compared to standard Deep Convolutional GAN (DC-GAN) and Wasserstein GAN with Gradient Penalty (WGAN-GP), it also captures the imaging biomarkers accurately.
摘要：虽然深学习方法表明，在许多成像任务骄人的业绩，这些方法大多依赖于大量数据的可用性。医学图像数据，但是，是稀缺的和零散的。生成对抗性网络（甘斯）最近一直在通过产生更多的数据处理此类数据集是非常有效的。如果数据集是非常小的，但是，甘斯无法正常学习数据分布，从而减少不同的或低质量的结果。一种这样的数据集的限制是，对于19个20条染色体（19/20共增益），在胶质母细胞瘤（GBM）阳性预后价值的突变并发增益。在本文中，我们检测成像生物标志物，以简化广泛浸润预后管道的突变。由于该突变是相对罕见的，即小的数据集，我们提出了一种新的生成框架 - 序贯属性发生器（SAGE），其生成详细的肿瘤成像特征，同时从有限的数据集学习。实验结果表明，不仅确实比标准深卷积GAN（DC-GAN）和GAN瓦瑟斯坦具有梯度罚款（WGAN-GP）时SAGE生成高品质的肿瘤，也准确地捕捉成像生物标志物。

31. Cycle-Consistent Adversarial Networks for Realistic Pervasive Change Generation in Remote Sensing Imagery [PDF] 返回目录
Christopher X. Ren, Amanda Ziemann, Alice M.S. Durieux, James Theiler
Abstract: This paper introduces a new method of generating realistic pervasive changes in the context of evaluating the effectiveness of change detection algorithms in controlled settings. The method, a cycle-consistent adversarial network (CycleGAN), requires low quantities of training data to generate realistic changes. Here we show an application of CycleGAN in creating realistic snow-covered scenes of multispectral Sentinel-2 imagery, and demonstrate how these images can be used as a test bed for anomalous change detection algorithms.
摘要：介绍产生在控制的设置的评价变化检测算法的有效性的上下文现实普遍的变化的新方法。该方法中，一个周期一致的对抗网络（CycleGAN），需要低数量的训练数据以生成逼真的变化。在这里，我们显示CycleGAN在建立多光谱哨兵-2影像逼真冰雪覆盖场景的应用程序，并展示这些图像如何被用作试验台的异常变化检测算法。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-05-18

目录

摘要