摘要

1. Probabilistic Future Prediction for Video Scene Understanding [PDF] 返回目录
Anthony Hu, Fergal Cotter, Nikhil Mohan, Corina Gurau, Alex Kendall
Abstract: We present a novel deep learning architecture for probabilistic future prediction from video. We predict the future semantics, geometry and motion of complex real-world urban scenes and use this representation to control an autonomous vehicle. This work is the first to jointly predict ego-motion, static scene, and the motion of dynamic agents in a probabilistic manner, which allows sampling consistent, highly probable futures from a compact latent space. Our model learns a representation from RGB video with a spatio-temporal convolutional module. The learned representation can be explicitly decoded to future semantic segmentation, depth, and optical flow, in addition to being an input to a learnt driving policy. To model the stochasticity of the future, we introduce a conditional variational approach which minimises the divergence between the present distribution (what could happen given what we have seen) and the future distribution (what we observe actually happens). During inference, diverse futures are generated by sampling from the present distribution.
摘要：本文提出了一种新颖的深度学习结构从视频概率未来的预测。我们预测未来的语义，几何形状复杂的真实世界的城市场景的运动，并使用这种表示控制自动车辆。这项工作是在第一共同预测自运动，静态场景，和动态代理的以概率方式运动，这允许从紧凑的潜在空间采样一致，极有可能的未来。我们的模型学习从RGB视频与时空卷积模块的表示。所学习的表示可被明确地解码以未来语义分割，深度和光流，除了作为一个输入到学习到的驾驶策略。为了模拟未来的随机性，我们介绍这和未来的分布（我们实际观察情况），本分配之间的差异（会发生什么给什么，我们所看到的）最小的条件变分法。在推理，通过从本采样分布产生多样期货。

2. Extending Maps with Semantic and Contextual Object Information for Robot Navigation: a Learning-Based Framework using Visual and Depth Cues [PDF] 返回目录
Renato Martins, Dhiego Bersan, Mario F. M. Campos, Erickson R. Nascimento
Abstract: This paper addresses the problem of building augmented metric representations of scenes with semantic information from RGB-D images. We propose a complete framework to create an enhanced map representation of the environment with object-level information to be used in several applications such as human-robot interaction, assistive robotics, visual navigation, or in manipulation tasks. Our formulation leverages a CNN-based object detector (Yolo) with a 3D model-based segmentation technique to perform instance semantic segmentation, and to localize, identify, and track different classes of objects in the scene. The tracking and positioning of semantic classes is done with a dictionary of Kalman filters in order to combine sensor measurements over time and then providing more accurate maps. The formulation is designed to identify and to disregard dynamic objects in order to obtain a medium-term invariant map representation. The proposed method was evaluated with collected and publicly available RGB-D data sequences acquired in different indoor scenes. Experimental results show the potential of the technique to produce augmented semantic maps containing several objects (notably doors). We also provide to the community a dataset composed of annotated object classes (doors, fire extinguishers, benches, water fountains) and their positioning, as well as the source code as ROS packages.
摘要：本文地址建筑场景的增加指标表示与RGB-d图像语义信息的问题。我们提出了一个完整的框架来创建对象级信息的环境的增强型地图表示在若干应用，例如人机交互，辅助机器人，视觉导航，或在操作任务中使用。我们的制剂利用与基于模型的3D分割技术来执行实例语义分割，并以定位，识别和跟踪场景中的不同类的对象的基于CNN-对象检测器（YOLO）。跟踪和语义类别的定位与卡尔曼滤波器的字典，以便传感器测量随着时间的推移，然后提供更准确的地图相结合来完成。所述制剂被设计来识别和无视动态对象以获得一个中期不变地图表示。所提出的方法，用在不同的室内场景获取的收集和可公开获得的RGB-d的数据序列进行评价。实验结果表明，该技术以产生的电势增强含多个对象（尤其是门）的语义图。我们也向社会提供一个注解的对象类（门，灭火器，长椅，喷泉）及其定位的组成数据集，以及源代码ROS包。

3. Harmonizing Transferability and Discriminability for Adapting Object Detectors [PDF] 返回目录
Chaoqi Chen, Zebiao Zheng, Xinghao Ding, Yue Huang, Qi Dou
Abstract: Recent advances in adaptive object detection have achieved compelling results in virtue of adversarial feature adaptation to mitigate the distributional shifts along the detection pipeline. Whilst adversarial adaptation significantly enhances the transferability of feature representations, the feature discriminability of object detectors remains less investigated. Moreover, transferability and discriminability may come at a contradiction in adversarial adaptation given the complex combinations of objects and the differentiated scene layouts between domains. In this paper, we propose a Hierarchical Transferability Calibration Network (HTCN) that hierarchically (local-region/image/instance) calibrates the transferability of feature representations for harmonizing transferability and discriminability. The proposed model consists of three components: (1) Importance Weighted Adversarial Training with input Interpolation (IWAT-I), which strengthens the global discriminability by re-weighting the interpolated image-level features; (2) Context-aware Instance-Level Alignment (CILA) module, which enhances the local discriminability by capturing the underlying complementary effect between the instance-level feature and the global context information for the instance-level feature alignment; (3) local feature masks that calibrate the local transferability to provide semantic guidance for the following discriminative pattern alignment. Experimental results show that HTCN significantly outperforms the state-of-the-art methods on benchmark datasets.
摘要：在自适应物体检测的最新进展已经实现借助对抗性特征适配的引人注目的结果，以减轻沿着检测管道的分配变化。虽然对抗性适应显著增强特征表示的转让，物体探测器遗体的特征辨别较少的影响。此外，转印性和可辨性可能会在给定的对象的复杂组合和结构域之间的分化的场景的布局在对抗性适应一个矛盾。在本文中，我们提出了一个层次转让校准网络（HTCN）分层地（局域/图像/实例）校准功能表示为协调转让和鉴别力的可转让性。该模型由三个部分组成：（1）重要性加权对抗性训练与输入插值（IWAT-I），其通过重新加权内插图像级别功能增强全球辨性; （2）上下文感知实例级对齐（CILA）模块，这增强通过捕获实例级特征和对实例级特征对准全局上下文信息之间的潜在互补效应本地辨性; （3）校准本地转移性局部特征的掩模，以提供以下识别图案对准语义指导。实验结果表明，HTCN显著优于上基准数据集的状态的最先进的方法。

4. Automatic Lesion Detection System (ALDS) for Skin Cancer Classification Using SVM and Neural Classifiers [PDF] 返回目录
Muhammad Ali Farooq, Muhammad Aatif Mobeen Azhar, Rana Hammad Raza
Abstract: Technology aided platforms provide reliable tools in almost every field these days. These tools being supported by computational power are significant for applications that need sensitive and precise data analysis. One such important application in the medical field is Automatic Lesion Detection System (ALDS) for skin cancer classification. Computer aided diagnosis helps physicians and dermatologists to obtain a second opinion for proper analysis and treatment of skin cancer. Precise segmentation of the cancerous mole along with surrounding area is essential for proper analysis and diagnosis. This paper is focused towards the development of improved ALDS framework based on probabilistic approach that initially utilizes active contours and watershed merged mask for segmenting out the mole and later SVM and Neural Classifier are applied for the classification of the segmented mole. After lesion segmentation, the selected features are classified to ascertain that whether the case under consideration is melanoma or non-melanoma. The approach is tested for varying datasets and comparative analysis is performed that reflects the effectiveness of the proposed system.
摘要：技术援助平台在几乎所有领域提供可靠的工具，这些天。通过这些计算能力支撑的工具是需要敏感和精确的数据分析应用显著。在医疗领域这样一个重要的应用是自动肿瘤检测系统（ALDS）皮肤癌分类。计算机辅助诊断帮助医生和皮肤科医生，以获得正确的分析和治疗皮肤癌的第二个观点。与周边地区沿着癌摩尔的精确分割为适当的分析和诊断至关重要。本文的重点是对基于概率的方式改进ALDS框架的发展，最初利用主动轮廓和用于分割出摩尔后来SVM和神经分类器被应用对于分割的摩尔的分类分水合并掩码。病变分割后，将选择的特征分类以确定所考虑的情况下是否是黑素瘤或非黑素瘤。该方法是用于改变数据集进行测试，并进行比较分析，反映所提出的系统的有效性。

5. Belief Propagation Reloaded: Learning BP-Layers for Labeling Problems [PDF] 返回目录
Patrick Knöbelreiter, Christian Sormann, Alexander Shekhovtsov, Friedrich Fraundorfer, Thomas Pock
Abstract: It has been proposed by many researchers that combining deep neural networks with graphical models can create more efficient and better regularized composite models. The main difficulties in implementing this in practice are associated with a discrepancy in suitable learning objectives as well as with the necessity of approximations for the inference. In this work we take one of the simplest inference methods, a truncated max-product Belief Propagation, and add what is necessary to make it a proper component of a deep learning model: We connect it to learning formulations with losses on marginals and compute the backprop operation. This BP-Layer can be used as the final or an intermediate block in convolutional neural networks (CNNs), allowing us to design a hierarchical model composing BP inference and CNNs at different scale levels. The model is applicable to a range of dense prediction problems, is well-trainable and provides parameter-efficient and robust solutions in stereo, optical flow and semantic segmentation.
摘要：提出了许多研究人员认为深层神经网络，图形模型相结合可以创造更多的效率和更好的正规化的复合模式。在实践中落实这方面的主要困难在适合自己的学习目标的差异，以及与近似的推断的必要性有关。在这项工作中，我们采取的最简单的推理方法之一，一截断最大，产品置信传播，并加有什么必要让它深刻学习模式的适当的组件：我们将它连接到边缘人学习配方与损失和计算backprop操作。此BP-层可以被用作最终还是在卷积神经网络（细胞神经网络）的中间块，从而允许我们设计一个分层模型构成BP推理和细胞神经网络在不同的等级水平。该模型适用于一范围的致密预测问题，是公可训练和立体声，光流和语义分割提供参数高效且健壮的解决方案。

6. Semantic Pyramid for Image Generation [PDF] 返回目录
Assaf Shocher, Yossi Gandelsmam, Inbar Mosseri, Michal Yarom, Michal Irani, William T. Freeman, Tali Dekel
Abstract: We present a novel GAN-based model that utilizes the space of deep features learned by a pre-trained classification model. Inspired by classical image pyramid representations, we construct our model as a Semantic Generation Pyramid -- a hierarchical framework which leverages the continuum of semantic information encapsulated in such deep features; this ranges from low level information contained in fine features to high level, semantic information contained in deeper features. More specifically, given a set of features extracted from a reference image, our model generates diverse image samples, each with matching features at each semantic level of the classification model. We demonstrate that our model results in a versatile and flexible framework that can be used in various classic and novel image generation tasks. These include: generating images with a controllable extent of semantic similarity to a reference image, and different manipulation tasks such as semantically-controlled inpainting and compositing; all achieved with the same model, with no further training.
摘要：本文提出了一种新的基于GaN的模型，利用由预先训练的分类模型了解到深功能的空间。通过经典的图像金字塔交涉的启发，我们构建了一个意义生成金字塔模型 - 一个分层架构，充分利用了封装在如此深的特点语义信息的连续性;这个范围从低级别的信息包含在优良特性为高电平，包含在更深的功能语义信息。更具体地，给定一组从参考图像提取的特征，我们的模型生成多样图像样本，每个具有在该分类模型的每个语义级别匹配特征。我们证明了我们在一个通用的和灵活的框架模型结果，可以在各种经典和新颖的图像生成任务中使用。这些包括：与语义相似度的参考图像的可控程度产生图像，以及不同的操作任务，如语义控制的图像修复和合成;都具有相同的模型来实现的，没有进一步的培训。

7. Pyramidal Edge-maps based Guided Thermal Super-resolution [PDF] 返回目录
Honey Gupta, Kaushik Mitra
Abstract: Thermal imaging is a robust sensing technique but its consumer applicability is limited by the high cost of thermal sensors. Nevertheless, low-resolution thermal cameras are relatively affordable and are also usually accompanied by a high-resolution visible-range camera. This visible-range image can be used as a guide to reconstruct a high-resolution thermal image using guided super-resolution(GSR) techniques. However, the difference in wavelength-range of the input images makes this task challenging. Improper processing can introduce artifacts such as blur and ghosting, mainly due to texture and content mismatch. To this end, we propose a novel algorithm for guided super-resolution that explicitly tackles the issue of texture-mismatch caused due to multimodality. We propose a two-stage network that combines information from a low-resolution thermal and a high-resolution visible image with the help of multi-level edge-extraction and integration. The first stage of our network extracts edge-maps from the visual image at different pyramidal levels and the second stage integrates these edge-maps into our proposed super-resolution network at appropriate layers. Extraction and integration of edges belonging to different scales simplifies the task of GSR as it provides texture to object-level information in a progressive manner. Using multi-level edges also allows us to adjust the contribution of the visual image directly at the time of testing and thus provides controllability at test-time. We perform multiple experiments and show that our method performs better than existing state-of-the-art guided super-resolution methods both quantitatively and qualitatively.
摘要：热成像是一个强大的感测技术，但其适用性消费者通过热传感器的高成本的限制。然而，低分辨率热成像摄像机是比较实惠的，而且通常也伴随着高分辨率可见光范围的相机。此可见光范围的图像可以被用作引导件以重构用引导超分辨率（GSR）技术的高分辨率热图像。但是，在输入图像的波长范围中的差异使得此任务的挑战。不当处理可以引入伪像，例如模糊和重影，这主要是由于质地和内容不匹配。为此，我们提出了指导超分辨率的新算法，明确铲球导致由于多模态纹理不匹配的问题。我们提出了两个阶段的网络，从低分辨率的热和多层次的边缘提取和整合的帮助下，高分辨率的可视图像的信息进行整合。从在不同锥体水平的视觉图像，并在适当的层中的第二级集成这些边缘映射到我们提出的超分辨率网络我们的网络提取物的第一阶段边缘映射。提取与属于不同尺度边缘的集成简化GSR的任务，因为它提供的纹理以渐进的方式对象级信息。使用多级的边缘也允许我们直接在测试时间调整可视图像的贡献，从而在测试时间提供控制性。我们进行多次实验，证明比现有的国家的最先进的超分辨率方法定量和定性指导我们的方法执行得更好。

8. Adaptive Graph Convolutional Network with Attention Graph Clustering for Co-saliency Detection [PDF] 返回目录
Kaihua Zhang, Tengpeng Li, Shiwen Shen, Bo Liu, Jin Chen, Qingshan Liu
Abstract: Co-saliency detection aims to discover the common and salient foregrounds from a group of relevant images. For this task, we present a novel adaptive graph convolutional network with attention graph clustering (GCAGC). Three major contributions have been made, and are experimentally shown to have substantial practical merits. First, we propose a graph convolutional network design to extract information cues to characterize the intra- and interimage correspondence. Second, we develop an attention graph clustering algorithm to discriminate the common objects from all the salient foreground objects in an unsupervised fashion. Third, we present a unified framework with encoder-decoder structure to jointly train and optimize the graph convolutional network, attention graph cluster, and co-saliency detection decoder in an end-to-end manner. We evaluate our proposed GCAGC method on three cosaliency detection benchmark datasets (iCoseg, Cosal2015 and COCO-SEG). Our GCAGC method obtains significant improvements over the state-of-the-arts on most of them.
摘要：合作显着性检测的目的从一组相关的图像发现的常见和突出的前景。执行此任务，我们提出了一个新的自适应图表卷积网络与关注图聚类（GCAGC）。三大贡献已经作出，并用实验证明具有相当实用的优点。首先，我们提出了一个图形卷积网络设计中提取信息线索，以表征区域内和图像间的对应关系。其次，我们开发的关注图形聚类算法来区分在无人监督的方式从所有突出前景物体共通对象。第三，我们呈现与编码器 - 解码器结构的统一框架，共同培养和优化图形卷积网络，注意图簇，和共同显着性检测解码器中的端至端的方式。我们评估三个cosaliency检测标准数据集（iCoseg，Cosal2015和COCO-SEG）我们提出的GCAGC方法。我们GCAGC方式来获取其中的大部分在国家的最艺术显著的改善。

9. Gimme Signals: Discriminative signal encoding for multimodal activity recognition [PDF] 返回目录
Raphael Memmesheimer, Nick Theisen, Dietrich Paulus
Abstract: We present a simple, yet effective and flexible method for action recognition supporting multiple sensor modalities. Multivariate signal sequences are encoded in an image and are then classified using a recently proposed EfficientNet CNN architecture. Our focus was to find an approach that generalizes well across different sensor modalities without specific adaptions while still achieving good results. We apply our method to 4 action recognition datasets containing skeleton sequences, inertial and motion capturing measurements as well as \wifi fingerprints that range up to 120 action classes. Our method defines the current best CNN-based approach on the NTU RGB+D 120 dataset, lifts the state of the art on the ARIL Wi-Fi dataset by +6.78%, improves the UTD-MHAD inertial baseline by +14.4%, the UTD-MHAD skeleton baseline by 1.13% and achieves 96.11% on the Simitate motion capturing data (80/20 split). We further demonstrate experiments on both, modality fusion on a signal level and signal reduction to prevent the representation from overloading.
摘要：我们给出一个简单的，为动作识别支持多个传感器方式而有效和灵活的方法。多元信号序列被编码的图像中，并且然后使用最近提出EfficientNet CNN架构分类。我们的重点是找到了一种方法，以及跨不特定adaptions不同的传感器模式推广，同时还取得了良好的效果。我们应用我们的方法来骨架序列，惯性和运动捕捉测量以及\ Wi-Fi指纹，范围高达120操作类4动作识别数据集。我们的方法定义了NTU RGB + d 120的数据集的当前的最佳基于CNN的方法，由+ 6.78％提起技术的假种皮的Wi-Fi数据集的状态，提高了+ 14.4％的UTD-MHAD惯性基线时， UTD-MHAD骨架基线由1.13％并且在Simitate运动捕获数据（80/20分割）达到96.11％。我们进一步证明上的信号电平和信号还原的两个实验中，模态融合，以防止超负荷表示。

10. PointINS: Point-based Instance Segmentation [PDF] 返回目录
Lu Qi, Xiangyu Zhang, Yingcong Chen, Yukang Chen, Jian Sun, Jiaya Jia
Abstract: A single-point feature has shown its effectiveness in object detection. However, for instance segmentation, it does not lead to satisfactory results. The reasons are two folds. Firstly, it has limited representation capacity. Secondly, it could be misaligned with potential instances. To address the above issues, we propose a new point-based framework, namely PointINS, to segment instances from single points. The core module of our framework is instance-aware convolution, including the instance-agnostic feature and instance-aware weights. Instance-agnostic feature for each Point-of-Interest (PoI) serves as a template for potential instance masks. In this way, instance-aware features are computed by convolving this template with instance-aware weights for following mask prediction. Given the independence of instance-aware convolution, PointINS is general and practical as a one-stage detector for anchor-based and anchor-free frameworks. In our extensive experiments, we show the effectiveness of our framework on RetinaNet and FCOS. With ResNet101 backbone, PointINS achieves 38.3 mask mAP on challenging COCO dataset, outperforming its competitors by a large margin. The code will be made publicly available.
摘要：单点特征已经显示出其在物体检测有效性。然而，例如分割，它并没有取得满意的疗效。原因有两倍。首先，它具有有限的容量表示。其次，它可以与潜在的实例错位。为了解决上述问题，我们提出了一种新的基于点的框架，即PointINS，从单点段的情况。我们的框架的核心模块是识别实例卷积，包括实例无关的功能和实例感知权重。对于定点每个感兴趣的点（POI）实例无关的功能，可作为潜在实例口罩的模板。通过这种方式，例如感知功能通过卷积该模板与下面的掩码预测识别实例的权重来计算。鉴于例如感知卷积的独立性，PointINS是一般性的，实用的锚型和无锚的框架的一台探测器。在我们广泛的实验，我们将展示我们对RetinaNet和FCOS框架的有效性。随着ResNet101骨干，PointINS达到具有挑战性的COCO数据集，大幅度超越竞争对手38.3面具地图。该代码将被公之于众。

11. Is There Tradeoff between Spatial and Temporal in Video Super-Resolution? [PDF] 返回目录
Haochen Zhang, Dong Liu, Zhiwei Xiong
Abstract: Recent advances of deep learning lead to great success of image and video super-resolution (SR) methods that are based on convolutional neural networks (CNN). For video SR, advanced algorithms have been proposed to exploit the temporal correlation between low-resolution (LR) video frames, and/or to super-resolve a frame with multiple LR frames. These methods pursue higher quality of super-resolved frames, where the quality is usually measured frame by frame in e.g. PSNR. However, frame-wise quality may not reveal the consistency between frames. If an algorithm is applied to each frame independently (which is the case of most previous methods), the algorithm may cause temporal inconsistency, which can be observed as flickering. It is a natural requirement to improve both frame-wise fidelity and between-frame consistency, which are termed spatial quality and temporal quality, respectively. Then we may ask, is a method optimized for spatial quality also optimized for temporal quality? Can we optimize the two quality metrics jointly?
摘要：深学习导致图像和视频超分辨率（SR）是基于卷积神经网络（CNN）方法的巨大成功进展。对于视频SR，先进的算法已经被提出来利用低分辨率（LR）的视频帧之间的时间相关性，和/或超解决与多个LR帧的帧。这些方法追求超分辨的帧，其中该质量是由帧在例如通常测量到的帧的更高质量的PSNR。然而，逐帧质量可能不能揭示帧之间的一致性。如果算法被独立地施加到每个帧（这是大多数现有方法的情况下），该算法可以导致域不一致性，其可被观察为闪烁。它是同时提高逐帧保真度和帧之间一致性，它们分别被称为空间质量和时间质量，天然要求。然后，我们可能会问，对于空间质量优化的方法也可用于时间质量优化？我们可以联合优化两个质量度量？

12. Partial Weight Adaptation for Robust DNN Inference [PDF] 返回目录
Xiufeng Xie, Kyu-Han Kim
Abstract: Mainstream video analytics uses a pre-trained DNN model with an assumption that inference input and training data follow the same probability distribution. However, this assumption does not always hold in the wild: autonomous vehicles may capture video with varying brightness; unstable wireless bandwidth calls for adaptive bitrate streaming of video; and, inference servers may serve inputs from heterogeneous IoT devices/cameras. In such situations, the level of input distortion changes rapidly, thus reshaping the probability distribution of the input. We present GearNN, an adaptive inference architecture that accommodates heterogeneous DNN inputs. GearNN employs an optimization algorithm to identify a small set of "distortion-sensitive" DNN parameters, given a memory budget. Based on the distortion level of the input, GearNN then adapts only the distortion-sensitive parameters, while reusing the rest of constant parameters across all input qualities. In our evaluation of DNN inference with dynamic input distortions, GearNN improves the accuracy (mIoU) by an average of 18.12% over a DNN trained with the undistorted dataset and 4.84% over stability training from Google, with only 1.8% extra memory overhead.
摘要：主流视频分析采用的是预训练DNN模型与推理输入和训练数据遵循相同的概率分布的假设。然而，这种假设并不总是保持在野外：自动驾驶汽车可以捕获与不同亮度的视频;不稳定的无线带宽呼吁自适应比特率视频流送;并且，推理服务器可以从异构的IoT设备/相机服务的输入。在这种情况下，输入畸变水平急剧变化，从而重塑输入的概率分布。我们提出GearNN，容纳异构DNN输入的自适应推断架构。 GearNN采用最优化算法，以确定一小部分的“失真敏感” DNN参数，给定存储器预算。基于输入的失真水平，然后GearNN适应仅失真敏感参数，而所有输入的质量重用常数参数的其余部分。在我们的DNN推理与动态输入扭曲的评价，GearNN提高在DNN精度（米欧）平均18.12％的培训与无失真的数据集，并在谷歌从稳定性训练4.84％，只有1.8％的额外内存开销。

13. Dual Temporal Memory Network for Efficient Video Object Segmentation [PDF] 返回目录
Kaihua Zhang, Long Wang, Dong Liu, Bo Liu, Qingshan Liu, Zhu Li
Abstract: Video Object Segmentation (VOS) is typically formulated in a semi-supervised setting. Given the ground-truth segmentation mask on the first frame, the task of VOS is to track and segment the single or multiple objects of interests in the rest frames of the video at the pixel level. One of the fundamental challenges in VOS is how to make the most use of the temporal information to boost the performance. We present an end-to-end network which stores short- and long-term video sequence information preceding the current frame as the temporal memories to address the temporal modeling in VOS. Our network consists of two temporal sub-networks including a short-term memory sub-network and a long-term memory sub-network. The short-term memory sub-network models the fine-grained spatial-temporal interactions between local regions across neighboring frames in video via a graph-based learning framework, which can well preserve the visual consistency of local regions over time. The long-term memory sub-network models the long-range evolution of object via a Simplified-Gated Recurrent Unit (S-GRU), making the segmentation be robust against occlusions and drift errors. In our experiments, we show that our proposed method achieves a favorable and competitive performance on three frequently-used VOS datasets, including DAVIS 2016, DAVIS 2017 and Youtube-VOS in terms of both speed and accuracy.
摘要：视频对象分割（VOS）通常被配制成半监督设置。给出的第一个帧上的地面实况分割掩码，VOS的任务是在像素级来跟踪和段的利益在视频的静止参考系的单个或多个对象。一个在VOS的根本性的挑战是，如何最大限度地利用的时间信息，以提高性能。我们提出了一个端至端网络，其存储的短期和长期的视频序列信息在当前帧之前为时间记忆处理时间建模VOS。我们的网络包括两个时间子网络，包括短期记忆子网络和长期记忆的子网络。短期存储器子网络模型跨越经由基于图形的学习的框架，它能够很好地保持随时间的局部区域的视觉一致性相邻视频帧的局部区域之间的细粒度的空间 - 时间的相互作用。长期存储器子网络模型经由简化的门控重复单元（S-GRU）所述对象的远程进化，使得分割对抗闭塞和漂移误差的鲁棒性。在我们的实验中，我们证明了我们提出的方法实现了速度和精度方面，在三个常用数据集VOS，包括DAVIS 2016年，戴维斯2017年YouTube的VOS一个良好的和有竞争力的表现。

14. BIHL:A Fast and High Performance Object Proposals based on Binarized HL Frequency [PDF] 返回目录
Jiang Chao, Liang Huawei, Wang Zhiling
Abstract: In recent years, the use of object proposal as a preprocessing step for target detection to improve computational efficiency has become an effective method. Good object proposal methods should have high object detection recall rate and low computational cost, as well as good location accuracy and repeatability. However, it is difficult for current advanced algorithms to achieve a good balance in the above performance. Therefore, it is especially important to ensure that the recall rate and location quality are not degraded while accelerating object generation.For this problem, we propose a class-independent object proposal algorithm BIHL. It combines the advantages of window scoring and superpixel merging. First, a binarized horizontal high frequency component feature and a linear classifier are used to learn and generate a set of candidate boxs with a objective score. Then, the candidate boxs are merged based on the principle of location and score proximity. Different from superpixel merging algorithm, our method does not use pixel level operation to avoid a lot of computation without losing performance. Experimental results on the VOC2007 dataset and the VOC2007 synthetic interference dataset containing 297,120 test images show that when including difficult-to-identify objects with an IOU threshold of 0.5 and 10000 budget proposals, our method achieves a 99.3% detection recall and an mean average best overlap of 81.1% . The average processing time of our method for all test set images is 0.0015 seconds, which is nearly 3 times faster than the current fastest method. In repeatability testing, our method is the method with the highest average repeatability among the methods that achieve good repeatability to various disturbances, and the average repeatability is 10% higher than RPN. The code will be published in {this https URL}
摘要：近年来，使用对象提案作为目标检测的预处理步骤，以提高计算效率已经成为一种有效的方法。好对象的建议方法应具有较高的物体探测召回率和较低的计算成本，以及良好的定位精度和可重复性。然而，很难当前先进的算法来实现上述性能的良好平衡。因此，这一点尤其重要保证，同时加快对象generation.For这个问题，召回率和位置的质量不恶化，我们提出了一个类无关对象建议算法BIHL。它结合了窗口的得分和超像素合并的优势。首先，将二值化的水平高频分量的特征和线性分类被用于学习和产生具有目标得分的一组候选boxs的。接下来，候选boxs基于位置和得分就近原则合并。从超像素合并算法不同的是，我们的方法不使用像素级别的操作，以避免大量的运算不失性能。在VOC2007数据集和包含297120个测试图像的VOC2007合成干扰数据集实验结果表明，包括难以识别与0.5和10000概算的期票阈目的，我们的方法实现了99.3％的检测召回和值平均最佳时的81.1％重叠。我们所有的测试图像集方法的平均处理时间为0.0015秒，这比目前最快的方法快了近3倍。在重复性测试，我们的方法与实现重复性好各种干扰的方法中的最高平均重复性的方法，平均可重复性比RPN高出10％。该代码将刊登在本{HTTPS URL}

15. A Spatial-Temporal Attentive Network with Spatial Continuity for Trajectory Prediction [PDF] 返回目录
Beihao Xia, Conghao Wang, Qinmu Peng, Xinge You, Dacheng Tao
Abstract: It remains challenging to automatically predict the multi-agent trajectory due to multiple interactions including agent to agent interaction and scene to agent interaction. Although recent methods have achieved promising performance, most of them just consider spatial influence of the interactions and ignore the fact that temporal influence always accompanies spatial influence. Moreover, those methods based on scene information always require extra segmented scene images to generate multiple socially acceptable trajectories. To solve these limitations, we propose a novel model named spatial-temporal attentive network with spatial continuity (STAN-SC). First, spatial-temporal attention mechanism is presented to explore the most useful and important information. Second, we conduct a joint feature sequence based on the sequence and instant state information to make the generative trajectories keep spatial continuity. Experiments are performed on the two widely used ETH-UCY datasets and demonstrate that the proposed model achieves state-of-the-art prediction accuracy and handles more complex scenarios.
摘要：它仍然具有挑战性的自动预测的多代理轨迹由于多种相互作用，包括代理商代理的互动和现场代理交互。虽然最近的方法都取得了有前途的性能，大多只是考虑相互作用的空间影响而忽略时间的影响力总是伴随着空间影响的事实。此外，基于场景信息的方法总是需要额外的分割场景图像生成多个社会可接受的轨迹。为了解决这些限制，我们建议命名为空间连续性（STAN-SC）的时空周到网络一种新型的模式。首先，提出了时空注意机制，探索最有用和重要的信息。其次，我们进行基于序列和即时状态信息的共同特征序列，使生成的轨迹保持空间的连续性。实验在两个广泛使用的ETH-UCY数据集执行，并且表明，该模型实现状态的最先进的预测精度和把手更复杂的场景。

16. BigGAN-based Bayesian reconstruction of natural images from human brain activity [PDF] 返回目录
Kai Qiao, Jian Chen, Linyuan Wang, Chi Zhang, Li Tong, Bin Yan
Abstract: In the visual decoding domain, visually reconstructing presented images given the corresponding human brain activity monitored by functional magnetic resonance imaging (fMRI) is difficult, especially when reconstructing viewed natural images. Visual reconstruction is a conditional image generation on fMRI data and thus generative adversarial network (GAN) for natural image generation is recently introduced for this task. Although GAN-based methods have greatly improved, the fidelity and naturalness of reconstruction are still unsatisfactory due to the small number of fMRI data samples and the instability of GAN training. In this study, we proposed a new GAN-based Bayesian visual reconstruction method (GAN-BVRM) that includes a classifier to decode categories from fMRI data, a pre-trained conditional generator to generate natural images of specified categories, and a set of encoding models and evaluator to evaluate generated images. GAN-BVRM employs the pre-trained generator of the prevailing BigGAN to generate masses of natural images, and selects the images that best matches with the corresponding brain activity through the encoding models as the reconstruction of the image stimuli. In this process, the semantic and detailed contents of reconstruction are controlled by decoded categories and encoding models, respectively. GAN-BVRM used the Bayesian manner to avoid contradiction between naturalness and fidelity from current GAN-based methods and thus can improve the advantages of GAN. Experimental results revealed that GAN-BVRM improves the fidelity and naturalness, that is, the reconstruction is natural and similar to the presented image stimuli.
摘要：在视觉解码域，在视觉上重建所呈现的图像中给出的相应的人脑活动监测由功能性磁共振成像（fMRI）是困难的，特别是当重构观察自然的图像。视觉重建是在fMRI数据和用于自然图像生成从而生成对抗网络（GAN）有条件图像生成最近此任务引入。虽然基于GAN-方法有很大的提高，保真度和重建的自然仍然要小一些fMRI数据样本和GAN训练的不稳定性不令人满意所致。在这项研究中，我们提出了一种新的基于GAN-贝叶斯视觉重建方法（GAN-BVRM），其包括分类器来从fMRI数据，一个预训练的条件产生器，产生指定的类别的自然图像解码类别，和一组编码的模型和评估，以评估生成的图像。 GAN-BVRM采用现行BigGAN的预先训练产生器，产生自然图像的质量，并选择图像与通过编码模式作为图像刺激的重建相应的脑活动最佳匹配。在这个过程中，重建的语义和详细的内容分别由解码类别和编码模式，进行控制。 GAN-BVRM使用贝叶斯方式来从当前基于GAN-方法自然性和保真度之间避免矛盾，因此可以提高GAN的优点。实验结果表明，GAN-BVRM提高保真度和自然，即，重建是自然的，类似于所呈现的图像的刺激。

17. Deep Domain-Adversarial Image Generation for Domain Generalisation [PDF] 返回目录
Kaiyang Zhou, Yongxin Yang, Timothy Hospedales, Tao Xiang
Abstract: Machine learning models typically suffer from the domain shift problem when trained on a source dataset and evaluated on a target dataset of different distribution. To overcome this problem, domain generalisation (DG) methods aim to leverage data from multiple source domains so that a trained model can generalise to unseen domains. In this paper, we propose a novel DG approach based on \emph{Deep Domain-Adversarial Image Generation} (DDAIG). Specifically, DDAIG consists of three components, namely a label classifier, a domain classifier and a domain transformation network (DoTNet). The goal for DoTNet is to map the source training data to unseen domains. This is achieved by having a learning objective formulated to ensure that the generated data can be correctly classified by the label classifier while fooling the domain classifier. By augmenting the source training data with the generated unseen domain data, we can make the label classifier more robust to unknown domain changes. Extensive experiments on four DG datasets demonstrate the effectiveness of our approach.
摘要：在源数据集培训，并对不同分布的目标数据集进行评估时，机器学习模型通常从域名转移问题的困扰。为了克服这个问题，域概括（DG）的方法的目标是从多个源域杠杆数据，使得一个训练的模型可以推广到看不见域。在本文中，我们提出了一种基于\ EMPH {深域对抗性图像生成}（DDAIG）一种新颖的DG的方法。具体而言，DDAIG由三个部分组成，即一个标签分类器，一个域分类和域变换网络（DOTNET）的。对于DOTNET的目标是映射源训练数据看不见的领域。这是通过制定，以确保产生的数据可以通过标签分类进行正确分类，而愚弄域分类目标学习来实现的。通过扩大与生成看不见域数据源的训练数据，我们可以使标签分类更稳健，以未知的领域的变化。四个DG数据集大量的实验证明了该方法的有效性。

18. Interaction Graphs for Object Importance Estimation in On-road Driving Videos [PDF] 返回目录
Zehua Zhang, Ashish Tawari, Sujitha Martin, David Crandall
Abstract: A vehicle driving along the road is surrounded by many objects, but only a small subset of them influence the driver's decisions and actions. Learning to estimate the importance of each object on the driver's real-time decision-making may help better understand human driving behavior and lead to more reliable autonomous driving systems. Solving this problem requires models that understand the interactions between the ego-vehicle and the surrounding objects. However, interactions among other objects in the scene can potentially also be very helpful, e.g., a pedestrian beginning to cross the road between the ego-vehicle and the car in front will make the car in front less important. We propose a novel framework for object importance estimation using an interaction graph, in which the features of each object node are updated by interacting with others through graph convolution. Experiments show that our model outperforms state-of-the-art baselines with much less input and pre-processing.
摘要：车辆沿道路行驶是由许多物体包围，但其中只有一小部分影响驾驶者的决策和行动。学习估计驾驶员的实时决策可能有助于更好地理解人类的驾驶行为，并导致更可靠的自动驾驶系统的每个对象的重要性。解决这个问题需要了解自我车辆和周围物体之间的相互作用的模型。然而，场景中的其他物体间的相互作用有可能也有很大的帮助，例如，开始穿越自车辆与前车之间的道路一行人将在前面的车不那么重要了。我们提出用于使用交互图，其中每个对象节点的特征是由通过图卷积与他人互动更新的对象的重要性估计的新颖框架。实验表明，我们的模型优于国家的最先进的基线用少得多的投入和预处理。

19. Analyzing Visual Representations in Embodied Navigation Tasks [PDF] 返回目录
Erik Wijmans, Julian Straub, Dhruv Batra, Irfan Essa, Judy Hoffman, Ari Morcos
Abstract: Recent advances in deep reinforcement learning require a large amount of training data and generally result in representations that are often over specialized to the target task. In this work, we present a methodology to study the underlying potential causes for this specialization. We use the recently proposed projection weighted Canonical Correlation Analysis (PWCCA) to measure the similarity of visual representations learned in the same environment by performing different tasks. We then leverage our proposed methodology to examine the task dependence of visual representations learned on related but distinct embodied navigation tasks. Surprisingly, we find that slight differences in task have no measurable effect on the visual representation for both SqueezeNet and ResNet architectures. We then empirically demonstrate that visual representations learned on one task can be effectively transferred to a different task.
摘要：深强化学习最新进展需要大量的训练数据，通常导致在经常在专门的目标任务表示。在这项工作中，我们提出了一个方法，研究的根本潜在原因为这种专业化。我们使用最近提出的投影加权典型相关分析（PWCCA）来衡量通过执行不同的任务，在相同的环境学可视化表示的相似性。然后，我们利用我们提出的方法来检查的相关但不同的实施形式的导航任务获悉视觉表现的任务依赖性。出人意料的是，我们发现在任务细微的差别有两个SqueezeNet和RESNET架构的可视化表示没有可测量的效果。然后，我们根据经验表明，在一个任务学到视觉表现可以有效地转移到不同的任务。

20. LaserFlow: Efficient and Probabilistic Object Detection and Motion Forecasting [PDF] 返回目录
Gregory P. Meyer, Jake Charland, Shreyash Pandey, Ankit Laddha, Carlos Vallespi-Gonzalez, Carl K. Wellington
Abstract: In this work, we present LaserFlow, an efficient method for 3D object detection and motion forecasting from LiDAR. Unlike the previous work, our approach utilizes the native range view representation of the LiDAR, which enables our method to operate at the full range of the sensor in real-time without voxelization or compression of the data. We propose a new multi-sweep fusion architecture, which extracts and merges temporal features directly from the range images. Furthermore, we propose a novel technique for learning a probability distribution over future trajectories inspired by curriculum learning. We evaluate LaserFlow on two autonomous driving datasets and demonstrate competitive results when compared to the existing state-of-the-art methods.
摘要：在这项工作中，我们提出LaserFlow，为3D对象检测和运动预测从激光雷达的有效方法。不同于以往的工作，我们的方法利用激光雷达的本地范围视图表示，这使我们的方法在全范围内的实时传感器的不包括所述数据的体素化或压缩操作。我们提出了一个新的多扫融合架构，直接从距离图像提取和合并时间特征。此外，我们提出了一种新的技术，学习过的课程学习启发未来轨迹的概率分布。我们评估LaserFlow两个自主驾驶数据集，相比现有的国家的最先进的方法时表现出有竞争力的结果。

21. Sparse Graphical Memory for Robust Planning [PDF] 返回目录
Michael Laskin, Scott Emmons, Ajay Jain, Thanard Kurutach, Pieter Abbeel, Deepak Pathak
Abstract: To operate effectively in the real world, artificial agents must act from raw sensory input such as images and achieve diverse goals across long time-horizons. On the one hand, recent strides in deep reinforcement and imitation learning have demonstrated impressive ability to learn goal-conditioned policies from high-dimensional image input, though only for short-horizon tasks. On the other hand, classical graphical methods like A* search are able to solve long-horizon tasks, but assume that the graph structure is abstracted away from raw sensory input and can only be constructed with task-specific priors. We wish to combine the strengths of deep learning and classical planning to solve long-horizon tasks from raw sensory input. To this end, we introduce Sparse Graphical Memory (SGM), a new data structure that stores observations and feasible transitions in a sparse memory. SGM can be combined with goal-conditioned RL or imitative agents to solve long-horizon tasks across a diverse set of domains. We show that SGM significantly outperforms current state of the art methods on long-horizon, sparse-reward visual navigation tasks. Project video and code are available at this https URL
摘要：在现实世界中有效运行，人工坐席必须从原材料感觉输入行为，如图像，实现跨越时间长，眼界不同的目标。一方面，近期深强化和模仿学习的进步已经证明，从高维图像输入学习目标空调的政策，虽然只短视距任务的能力令人印象深刻。在另一方面，像一个经典的图形化方法*搜索能够解决长期的地平线的任务，但假设图形结构从原料感觉输入抽象出来，只能用任务的特定先验构造。我们希望深度学习和古典的规划来解决从原材料感觉输入长视距任务的优势结合。为此，我们引入稀疏图形存储器（SGM），一个新的数据结构，其存储观测和可行的过渡在稀疏存储器。 SGM可以与目标空调RL或模仿种试剂组合在一组不同的域来解决长期的地平线的任务。我们发现，SGM显著优于对长期的地平线上，稀疏的回报视觉导航任务的技术方法目前的状态。项目视频和代码可在此HTTPS URL

22. Advanced Deep Learning Methodologies for Skin Cancer Classification in Prodromal Stages [PDF] 返回目录
Muhammad Ali Farooq, Asma Khatoon, Viktor Varkarakis, Peter Corcoran
Abstract: Technology-assisted platforms provide reliable solutions in almost every field these days. One such important application in the medical field is the skin cancer classification in preliminary stages that need sensitive and precise data analysis. For the proposed study the Kaggle skin cancer dataset is utilized. The proposed study consists of two main phases. In the first phase, the images are preprocessed to remove the clutters thus producing a refined version of training images. To achieve that, a sharpening filter is applied followed by a hair removal algorithm. Different image quality measurement metrics including Peak Signal to Noise (PSNR), Mean Square Error (MSE), Maximum Absolute Squared Deviation (MXERR) and Energy Ratio/ Ratio of Squared Norms (L2RAT) are used to compare the overall image quality before and after applying preprocessing operations. The results from the aforementioned image quality metrics prove that image quality is not compromised however it is upgraded by applying the preprocessing operations. The second phase of the proposed research work incorporates deep learning methodologies that play an imperative role in accurate, precise and robust classification of the lesion mole. This has been reflected by using two state of the art deep learning models: Inception-v3 and MobileNet. The experimental results demonstrate notable improvement in train and validation accuracy by using the refined version of images of both the networks, however, the Inception-v3 network was able to achieve better validation accuracy thus it was finally selected to evaluate it on test data. The final test accuracy using state of art Inception-v3 network was 86%.
摘要：技术辅助平台，在几乎所有领域提供可靠的解决方案，这些天。在医疗领域这样一个重要的应用是需要敏感和精确的数据分析，初步阶段的皮肤癌分类。对于所提出的研究Kaggle皮肤癌的数据集被利用。所提出的研究包括两个主要阶段。在第一阶段中，将图像进行预处理，以除去由此产生的训练图像的改良版本的杂波。为了实现这一点，锐化滤波器被应用于随后的毛发移除算法。不同的图像质量度量标准包括峰值信号噪声（PSNR），均方误差（MSE），最大绝对平方偏差（MXERR）和能量比/平方范数的比率（L2RAT）被用于之前和之后的整体图像质量比较采用预处理操作。从上述的图像质量度量的结果证明，图像质量不会受到影响通过应用预处理操作但它是升级。所提出的研究工作的第二阶段整合深学习方法，在病变摩尔的准确，精确和稳健的分类发挥作用势在必行。盗梦空间-V3和MobileNet：这是通过使用本领域深度学习模式两种状态反映出来。实验结果表明，通过同时使用网络的图像的精致版在火车和验证准确度显着改善，但是，盗梦空间-v3网络之所以能够取得这样它终于选择了评价它的测试数据更好的验证准确度。使用本领域启-V3网络的状态的最终测试精度为86％。

23. Human Activity Recognition using Multi-Head CNN followed by LSTM [PDF] 返回目录
Waqar Ahmad, Misbah Kazmi, Hazrat Ali
Abstract: This study presents a novel method to recognize human physical activities using CNN followed by LSTM. Achieving high accuracy by traditional machine learning algorithms, (such as SVM, KNN and random forest method) is a challenging task because the data acquired from the wearable sensors like accelerometer and gyroscope is a time-series data. So, to achieve high accuracy, we propose a multi-head CNN model comprising of three CNNs to extract features for the data acquired from different sensors and all three CNNs are then merged, which are followed by an LSTM layer and a dense layer. The configuration of all three CNNs is kept the same so that the same number of features are obtained for every input to CNN. By using the proposed method, we achieve state-of-the-art accuracy, which is comparable to traditional machine learning algorithms and other deep neural network algorithms.
摘要：本研究提出承认使用CNN人类身体活动的新方法，然后LSTM。实现由传统的机器学习算法，（如SVM，KNN和随机森林法）高精度是一项具有挑战性的任务，因为从像加速度计和陀螺仪的穿戴式传感器获得的数据是时间序列数据。因此，为了达到较高的精度，提出了一种多磁头CNN模型包括三个细胞神经网络来提取物的特征在于用于从不同的传感器获得的数据，然后所有三种细胞神经网络被合并，这是随后是LSTM层和致密层。所有三个细胞神经网络的结构保持相同，从而，对于每一个输入到CNN获得相同数目的特征。通过使用该方法，我们实现了国家的最先进的精度，这是比较传统的机器学习算法和其他深层神经网络算法。

24. A Power-Efficient Binary-Weight Spiking Neural Network Architecture for Real-Time Object Classification [PDF] 返回目录
Pai-Yu Tan, Po-Yao Chuang, Yen-Ting Lin, Cheng-Wen Wu, Juin-Ming Lu
Abstract: Neural network hardware is considered an essential part of future edge devices. In this paper, we propose a binary-weight spiking neural network (BW-SNN) hardware architecture for low-power real-time object classification on edge platforms. This design stores a full neural network on-chip, and hence requires no off-chip bandwidth. The proposed systolic array maximizes data reuse for a typical convolutional layer. A 5-layer convolutional BW-SNN hardware is implemented in 90nm CMOS. Compared with state-of-the-art designs, the area cost and energy per classification are reduced by 7$\times$ and 23$\times$, respectively, while also achieving a higher accuracy on the MNIST benchmark. This is also a pioneering SNN hardware architecture that supports advanced CNN architectures.
摘要：神经网络硬件被认为是未来的边缘设备的重要组成部分。在本文中，我们提出了一个二进制加权尖峰神经网络（BW-SNN）的硬件架构上边缘平台的低功耗实时对象分类。这种设计店全神经网络芯片，因此无需片外带宽。所提出的脉动阵列对于一个典型的卷积层最大化重用数据。有5层卷积BW-SNN硬件在90纳米CMOS实现。与国家的最先进的设计相比，每分类区域成本和能量分别由7 $ \倍$和23 $ \倍$，减少，同时还实现对基准MNIST一个更高的精度。这也是一个创举SNN硬件架构，支持先进的CNN架构。

25. What Information Does a ResNet Compress? [PDF] 返回目录
Luke Nicholas Darlow, Amos Storkey
Abstract: The information bottleneck principle (Shwartz-Ziv & Tishby, 2017) suggests that SGD-based training of deep neural networks results in optimally compressed hidden layers, from an information theoretic perspective. However, this claim was established on toy data. The goal of the work we present here is to test whether the information bottleneck principle is applicable to a realistic setting using a larger and deeper convolutional architecture, a ResNet model. We trained PixelCNN++ models as inverse representation decoders to measure the mutual information between hidden layers of a ResNet and input image data, when trained for (1) classification and (2) autoencoding. We find that two stages of learning happen for both training regimes, and that compression does occur, even for an autoencoder. Sampling images by conditioning on hidden layers' activations offers an intuitive visualisation to understand what a ResNets learns to forget.
摘要：信息瓶颈原理（Shwartz-谢夫＆Tishby，2017）提出的在最优压缩隐藏层深的神经网络的结果，基于SGD训练，从信息理论的观点。然而，这种说法成立于玩具的数据。我们在这里提出的工作目标是测试的信息瓶颈原则是否适用于使用更大和更深的卷积架构中，RESNET模型真实的场景。我们培养PixelCNN ++模型作为逆表示解码器，以测量和RESNET输入图像数据的隐藏层，训练（1）的分类和（2）当autoencoding之间的互信息。我们发现，学习两个阶段发生的针对训练制度，并没有出现压缩，即使是自动编码。隐藏图层激活报价采样由调节图像的直观的可视化来了解什么是ResNets学会忘记。

26. Unsupervised Style and Content Separation by Minimizing Mutual Information for Speech Synthesis [PDF] 返回目录
Ting-Yao Hu, Ashish Shrivastava, Oncel Tuzel, Chandra Dhir
Abstract: We present a method to generate speech from input text and a style vector that is extracted from a reference speech signal in an unsupervised manner, i.e., no style annotation, such as speaker information, is required. Existing unsupervised methods, during training, generate speech by computing style from the corresponding ground truth sample and use a decoder to combine the style vector with the input text. Training the model in such a way leaks content information into the style vector. The decoder can use the leaked content and ignore some of the input text to minimize the reconstruction loss. At inference time, when the reference speech does not match the content input, the output may not contain all of the content of the input text. We refer to this problem as "content leakage", which we address by explicitly estimating and minimizing the mutual information between the style and the content through an adversarial training formulation. We call our method MIST - Mutual Information based Style Content Separation. The main goal of the method is to preserve the input content in the synthesized speech signal, which we measure by the word error rate (WER) and show substantial improvements over state-of-the-art unsupervised speech synthesis methods.
摘要：我们提出以从输入的文本和从基准语音信号以无监督的方式提取的风格的载体，即，没有风格注释，如扬声器的语音信息的方法，是必需的。现有的无监督方法，在训练期间，通过计算从相应的地面实况示例样式生成语音，并使用解码器，对风格向量与输入文本相结合。培训以这样的方式模型泄露内容信息到风格矢量。解码器可以使用泄露的内容，而忽略了一些输入文本的，以尽量减少损失重建。在推理时，当基准台词不匹配的内容输入，输出可能不包含所有输入文本的内容。我们以明确的估计和最小化的风格，并通过对抗训练制定的内容之间的相互信息，请参考这个问题“内容泄露”，这是我们的地址。我们把我们的方法MIST - 基于互信息样式内容分离。该方法的主要目标是保持在合成语音信号输入的内容，这是我们由字差错率（WER）测量并显示在国家的最先进的无监督的语音合成方法实质性的改进。

27. On the effectiveness of convolutional autoencoders on image-based personalized recommender systems [PDF] 返回目录
E. Blanco-Mallo, B. Remeseiro, V. Bolón-Canedo, A. Alonso-Betanzos
Abstract: Recommender systems (RS) are increasingly present in our daily lives, especially since the advent of Big Data, which allows for storing all kinds of information about users' preferences. Personalized RS are successfully applied in platforms such as Netflix, Amazon or YouTube. However, they are missing in gastronomic platforms such as TripAdvisor, where moreover we can find millions of images tagged with users' tastes. This paper explores the potential of using those images as sources of information for modeling users' tastes and proposes an image-based classification system to obtain personalized recommendations, using a convolutional autoencoder as feature extractor. The proposed architecture will be applied to TripAdvisor data, using users' reviews that can be defined as a triad composed by a user, a restaurant, and an image of it taken by the user. Since the dataset is highly unbalanced, the use of data augmentation on the minority class is also considered in the experimentation. Results on data from three cities of different sizes (Santiago de Compostela, Barcelona and New York) demonstrate the effectiveness of using a convolutional autoencoder as feature extractor, instead of the standard deep features computed with convolutional neural networks.
摘要：推荐系统（RS）在我们的日常生活中越来越呈现，尤其是大数据，这使得用于存储各种有关用户的偏好信息的出现。个性化RS在平台如Netflix公司，亚马逊或YouTube成功应用。然而，他们缺少的美食平台，如登录，而且在那里我们可以找到数以百万计的图像标记与用户的口味。本文探讨了使用这些图像作为信息源用于建模用户的口味的潜力，并提出了一种基于图像的分类系统，以获得个性化的推荐，使用卷积的自动编码为特征提取。所提出的架构将使用可以被定义为一个由用户，餐厅，和由用户所采取的它的图像组成的三元组用户的评价可以应用于到到数据。由于数据集是非常不平衡的，对少数类数据的使用增强的在试验也被认为。从大小不同的三个城市（圣地亚哥德孔波斯特拉，巴塞罗那和纽约），数据结果显示，而不是与卷积神经网络计算的标准深功能使用卷积的自动编码为特征提取的有效性。

28. Towards a Framework for Visual Intelligence in Service Robotics: Epistemic Requirements and Gap Analysis [PDF] 返回目录
Agnese Chiatti, Enrico Motta, Enrico Daga
Abstract: A key capability required by service robots operating in real-world, dynamic environments is that of Visual Intelligence, i.e., the ability to use their vision system, reasoning components and background knowledge to make sense of their environment. In this paper, we analyze the epistemic requirements for Visual Intelligence, both in a top-down fashion, using existing frameworks for human-like Visual Intelligence in the literature, and from the bottom up, based on the errors emerging from object recognition trials in a real-world robotic scenario. Finally, we use these requirements to evaluate current knowledge bases for Service Robotics and to identify gaps in the support they provide for Visual Intelligence. These gaps provide the basis of a research agenda for developing more effective knowledge representations for Visual Intelligence.
摘要：在现实世界的运营服务机器人所需的关键能力，动态的环境是Visual情报，即用自己的视觉系统，推理成分和背景知识的能力，使他们的环境意识。在本文中，我们分析了认知要求的视觉智能，无论是在自上而下的方式，利用现有的框架为人类一样的视觉智能文献，并从下往上的基础上，从物体识别试验出现的错误现实世界的机器人方案。最后，我们使用这些要求来评估服务机器人目前的知识基础，并在他们提供的视觉智能支持，以找出差距。这些差距制定更有效的知识表示为视觉智能提供的研究议程的基础。

29. Random smooth gray value transformations for cross modality learning with gray value invariant networks [PDF] 返回目录
Nikolas Lessmann, Bram van Ginneken
Abstract: Random transformations are commonly used for augmentation of the training data with the goal of reducing the uniformity of the training samples. These transformations normally aim at variations that can be expected in images from the same modality. Here, we propose a simple method for transforming the gray values of an image with the goal of reducing cross modality differences. This approach enables segmentation of the lumbar vertebral bodies in CT images using a network trained exclusively with MR images. The source code is made available at this https URL
摘要：随机变换通常用于与减少训练样本的均匀性的目标，训练数据的扩充。这些变换通常旨在能够在从相同的模态图像可预期的变化。在这里，我们提出了以减少交叉形态差异的目标转换图像的灰度值的简单方法。这种方法使腰椎使用专门与MR图像的培训网络中的CT图像体分割。源代码在此HTTPS URL提供

30. LIBRE: The Multiple 3D LiDAR Dataset [PDF] 返回目录
Alexander Carballo, Jacob Lambert, Abraham Monrroy, David Wong, Patiphon Narksri, Yuki Kitsukawa, Eijiro Takeuchi, Shinpei Kato, Kazuya Takeda
Abstract: In this work, we present LIBRE: LiDAR Benchmarking and Reference, a first-of-its-kind dataset featuring 12 different LiDAR sensors, covering a range of manufacturers, models, and laser configurations. Data captured independently from each sensor includes four different environments and configurations: static obstacles placed at known distances and measured from a fixed position within a controlled environment; static obstacles measured from a moving vehicle, captured in a weather chamber where LiDARs were exposed to different conditions (fog, rain, strong light); dynamic objects actively measured from a fixed position by multiple LiDARs mounted side-by-side simultaneously, creating indirect interference conditions; and dynamic traffic objects captured from a vehicle driven on public urban roads multiple times at different times of the day, including data from supporting sensors such as cameras, infrared imaging, and odometry devices. LIBRE will contribute the research community to (1) provide a means for a fair comparison of currently available LiDARs, and (2) facilitate the improvement of existing self-driving vehicles and robotics-related software, in terms of development and tuning of LiDAR-based perception algorithms.
摘要：在这项工作中，我们提出LIBRE：激光雷达标杆和参考了首个，其独一无二的特色集12个不同的LiDAR传感器，涵盖一系列的厂家，型号，以及激光配置。数据独立地捕获来自每个传感器包括四个不同的环境和配置：设置在已知的距离和测量从受控环境内的固定位置静态障碍;从移动的车辆测量的静态障碍，在激光雷达暴露于不同条件下（雾，雨，强的光）气象室捕获的;主动地从由多个激光雷达的固定位置测量的动态对象安装侧由端同时，产生间接干扰状况;和动态交通对象从驱动公共城市道路多次在一天中的不同时间，包括从支撑传感器，例如摄像机，红外成像，和测距设备的数据的车辆捕获。 LIBRE将有助于研究团体（1）现有的激光雷达的一个比较公平的手段，以及（2）有利于现有的自驾车车辆和机器人技术相关软件的改进，LiDAR-的发展和调整的方面基于感知算法。

31. Action for Better Prediction [PDF] 返回目录
Bernadette Bucher, Karl Schmeckpeper, Nikolai Matni, Kostas Daniilidis
Abstract: Good prediction is necessary for autonomous robotics to make informed decisions in dynamic environments. Improvements can be made to the performance of a given data-driven prediction model by using better sampling strategies when collecting training data. Active learning approaches to optimal sampling have been combined with the mathematically general approaches to incentivizing exploration presented in the curiosity literature via model-based formulations of curiosity. We present an adversarial curiosity method which maximizes a score given by a discriminator network. This score gives a measure of prediction certainty enabling our approach to sample sequences of observations and actions which result in outcomes considered the least realistic by the discriminator. We demonstrate the ability of our active sampling method to achieve higher prediction performance and higher sample efficiency in a domain transfer problem for robotic manipulation tasks. We also present a validation dataset of action-conditioned video of robotic manipulation tasks on which we test the prediction performance of our trained models.
摘要：良好的预测是必要的自主机器人，使动态环境中的明智决策。改进可以以一个给定的数据驱动的预测模型的性能通过使用更好的收集训练数据时的采样策略来制备。主动学习方法以最佳采样已与一般的数学方法通过建立激励机制的好奇心基于模型的配方在好奇心文献中提出的探索相结合。我们提出最大化由鉴别器网络给定的一个分数的对抗性好奇心方法。这个分数给出预测的确定性的量度使我们的方法来观察和其导致考虑由鉴别所述至少逼真结果动作样本序列。我们证明了我们主动采样方式，实现在域名转移问题进行机器人操作的任务更高的预测性能和更高的效率样本的能力。我们还提出对我们检验我们训练的模型的预测性能机器人操作任务行动空调，视频的验证数据集。

32. Human Grasp Classification for Reactive Human-to-Robot Handovers [PDF] 返回目录
Wei Yang, Chris Paxton, Maya Cakmak, Dieter Fox
Abstract: Transfer of objects between humans and robots is a critical capability for collaborative robots. Although there has been a recent surge of interest in human-robot handovers, most prior research focus on robot-to-human handovers. Further, work on the equally critical human-to-robot handovers often assumes humans can place the object in the robot's gripper. In this paper, we propose an approach for human-to-robot handovers in which the robot meets the human halfway, by classifying the human's grasp of the object and quickly planning a trajectory accordingly to take the object from the human's hand according to their intent. To do this, we collect a human grasp dataset which covers typical ways of holding objects with various hand shapes and poses, and learn a deep model on this dataset to classify the hand grasps into one of these categories. We present a planning and execution approach that takes the object from the human hand according to the detected grasp and hand position, and replans as necessary when the handover is interrupted. Through a systematic evaluation, we demonstrate that our system results in more fluent handovers versus two baselines. We also present findings from a user study (N = 9) demonstrating the effectiveness and usability of our approach with naive users in different scenarios. More results and videos can be found at this http URL.
摘要：人类与机器人之间的对象转移是一个协作机器人的一个关键能力。虽然最近一直在人类与机器人切换兴趣大增，大多数现有研究主要集中在机器人对人类的切换。此外，在同样重要的人对机器人的切换工作中经常假设人类可以将对象放在机器人的夹持器。在本文中，我们提出了人类对机器人的切换，其中，机器人半路遇到人，由人的对象的把握相应的分类和快速规划的轨迹，以根据自己的意愿，从人类的手取物的方法。要做到这一点，我们收集人类掌握的数据集涵盖的举办各种手的形状和姿态对象的典型方式，并学习在这个数据集深层模型，以手握分为这些类别之一。我们提出了一个计划和执行的方法，根据所检测的把握和手的位置从对人的手拍摄对象，并重新计划时，根据需要切换被中断。通过系统化的评估，我们证明了我们的系统会导致更流畅的切换与两个基线。我们从展示我们在不同的场景幼稚用户方法的有效性和可用性用户研究（N = 9）也本发明的发现。更多结果和视频可以在这个HTTP URL中找到。

33. Autoencoders [PDF] 返回目录
Dor Bank, Noam Koenigstein, Raja Giryes
Abstract: An autoencoder is a specific type of a neural network, which is mainlydesigned to encode the input into a compressed and meaningful representation, andthen decode it back such that the reconstructed input is similar as possible to theoriginal one. This chapter surveys the different types of autoencoders that are mainlyused today. It also describes various applications and use-cases of autoencoders.
摘要：自动编码器是一个神经网络，该网络mainlydesigned来编码输入到压缩和有意义的表示，andthen解码回，使得重建的输入是尽可能地相似theoriginal一个的特定类型。本章调查中不同类型的今天mainlyused自动编码的。它还介绍了各种应用和自动编码的使用情况。

34. LiDAR guided Small obstacle Segmentation [PDF] 返回目录
Aasheesh Singh, Aditya Kamireddypalli, Vineet Gandhi, K Madhava Krishna
Abstract: Detecting small obstacles on the road is critical for autonomous driving. In this paper, we present a method to reliably detect such obstacles through a multi-modal framework of sparse LiDAR(VLP-16) and Monocular vision. LiDAR is employed to provide additional context in the form of confidence maps to monocular segmentation networks. We show significant performance gains when the context is fed as an additional input to monocular semantic segmentation frameworks. We further present a new semantic segmentation dataset to the community, comprising of over 3000 image frames with corresponding LiDAR observations. The images come with pixel-wise annotations of three classes off-road, road, and small obstacle. We stress that precise calibration between LiDAR and camera is crucial for this task and thus propose a novel Hausdorff distance based calibration refinement method over extrinsic parameters. As a first benchmark over this dataset, we report our results with 73% instance detection up to a distance of 50 meters on challenging scenarios. Qualitatively by showcasing accurate segmentation of obstacles less than 15 cms at 50m depth and quantitatively through favourable comparisons vis a vis prior art, we vindicate the method's efficacy. Our project-page and Dataset is hosted at this https URL
摘要：在道路上检测小的障碍是自主驾驶的关键。在本文中，我们提出了一个方法，以通过激光雷达稀疏（VLP-16）和单目视觉的多模态框架可靠地检测这些障碍。采用激光雷达，以提供在信任映射到单眼分割网络形式的附加上下文。我们发现显著的性能提升当上下文被送入作为额外的输入单眼语义分割框架。我们进一步提出一个新的语义分割数据集到社区，包括超过3000的图像帧的相应激光雷达观测。这些图像来与三类越野，公路，和小的障碍的逐个像素的注释。我们强调激光雷达和摄像机之间的精确校准是这个任务关键，因此提出过外部参数的新的Hausdorff距离基于校准细化方法。由于在该数据集的第一标杆，我们报道了我们用73％的情况下检测高达50米的距离上具有挑战性的场景效果。定性小于15厘米165'的深度和定量通过比较有利展示的障碍精确分割相一相现有技术中，我们平反方法的功效。我们的项目页面和数据集在此HTTPS URL托管

35. W2S: A Joint Denoising and Super-Resolution Dataset [PDF] 返回目录
Ruofan Zhou, Majed El Helou, Daniel Sage, Thierry Laroche, Arne Seitz, Sabine Süsstrunk
Abstract: Denoising and super-resolution (SR) are fundamental tasks in imaging. These two restoration tasks are well covered in the literature, however, only separately. Given a noisy low-resolution (LR) input image, it is yet unclear what the best approach would be in order to obtain a noise-free high-resolution (HR) image. In order to study joint denoising and super-resolution (JDSR), a dataset containing pairs of noisy LR images and the corresponding HR images is fundamental. We propose such a novel JDSR dataset, Wieldfield2SIM (W2S), acquired using microscopy equipment and techniques. W2S is comprised of 144,000 real fluorescence microscopy images, used to form a total of 360 sets of images. A set is comprised of noisy LR images with different noise levels, a noise-free LR image, and a corresponding high-quality HR image. W2S allows us to benchmark the combinations of 6 denoising methods and 6 SR methods. We show that state-of-the-art SR networks perform very poorly on noisy inputs, with a loss reaching 14dB relative to noise-free inputs. Our evaluation also shows that applying the best denoiser in terms of reconstruction error followed by the best SR method does not yield the best result. The best denoising PSNR can, for instance, come at the expense of a loss in high frequencies, which is detrimental for SR methods. We lastly demonstrate that a light-weight SR network with a novel texture loss, trained specifically for JDSR, outperforms any combination of state-of-the-art deep denoising and SR networks.
摘要：去噪和超分辨率（SR）是在成像的基本任务。这两个恢复任务是很好地覆盖在文献中，但是，只有分开。鉴于嘈杂的低分辨率（LR）输入图像，它是目前还不清楚什么是最好的办法是，以获得无噪声的高分辨率（HR）图像。为了研究关节去噪和超分辨率（JDSR），含噪声的LR图像的对和相应的HR图像数据集是根本。我们提出这样一个新的数据集JDSR，Wieldfield2SIM（W2S），使用显微镜设备和技术收购。 W2S是由144000幅真实荧光显微图像，用于形成共360组图像。一组是由具有不同的噪声水平，无噪声LR图像，和相应的高品质的HR图像嘈杂LR图像。 W2S使我们能够比较基准6种去噪方法和6种SR方法的组合。我们发现，国家的最先进的SR网络上的嘈杂输入执行得非常糟糕，有损失达到14分贝相对于无噪声输入。我们的评估也表明，采用最佳的降噪重建误差方面其次是最好的方法SR不会产生最好的结果。最好的去噪PSNR可以，例如，都在高频率的损失，这是有害的SR方法的费用。我们最后证明轻量SR网络具有新型质地损失，专门为JDSR嘤优于的状态的最先进的深去噪和SR网络的任何组合。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-03-16

目录

摘要