摘要

1. Unpaired Image-to-Image Translation using Adversarial Consistency Loss [PDF] 返回目录
Yihao Zhao, Ruihai Wu, Hao Dong
Abstract: Unpaired image-to-image translation is a class of vision problems whose goal is to find the mapping between different image domains using unpaired training data. Cycle-consistency loss is a widely used constraint for such problems. However, due to the strict pixel-level constraint, it cannot perform geometric changes, remove large objects, or ignore irrelevant texture. In this paper, we propose a novel adversarial-consistency loss for image-to-image translation. This loss does not require the translated image to be translated back to be a specific source image but can encourage the translated images to retain important features of the source images and overcome the drawbacks of cycle-consistency loss noted above. Our method achieves state-of-the-art results on three challenging tasks: glasses removal, male-to-female translation, and selfie-to-anime translation.
摘要：不成对图像到影像转换为一类的视力问题，其目标是要找到使用不成对的训练数据不同的图像域之间的映射。周期一致性的损失是对于这样的问题广泛使用的约束。但是，由于严格的像素级的约束，它不能执行几何变化，除去大的物体，或忽略不相关的纹理。在本文中，我们提出了图像到影像进行翻译对抗性一致性损失。这种损失并不需要翻译的图像被转换回是一个特定的源图像，但可以鼓励翻译的图像以保留源图像的重要特征，克服周期的一致性损失上面提到的缺点。我们的方法实现了三个挑战性的任务的国家的最先进的成果：眼镜切除，男性对女性的翻译，并自拍照到动画的翻译。

2. Image Restoration for Under-Display Camera [PDF] 返回目录
Yuqian Zhou, David Ren, Neil Emerton, Sehoon Lim, Timothy Large
Abstract: The new trend of full-screen devices encourages us to position a camera behind a screen. Removing the bezel and centralizing the camera under the screen brings larger display-to-body ratio and enhances eye contact in video chat, but also causes image degradation. In this paper, we focus on a newly-defined Under-Display Camera (UDC), as a novel real-world single image restoration problem. First, we take a 4k Transparent OLED (T-OLED) and a phone Pentile OLED (P-OLED) and analyze their optical systems to understand the degradation. Second, we design a novel Monitor-Camera Imaging System (MCIS) for easier real pair data acquisition, and a model-based data synthesizing pipeline to generate UDC data only from display pattern and camera measurements. Finally, we resolve the complicated degradation using learning-based methods. Our model demonstrates a real-time high-quality restoration trained with either real or synthetic data. The presented results and methods provide good practice to apply image restoration to real-world applications.
摘要：全屏幕设备的新趋势鼓励我们的相机定位在屏幕后面。卸下挡板和屏幕下集中化相机带来更大的显示 - 体比率，提高了在视频聊天眼睛接触，但也引起图像质量下降。在本文中，我们侧重于一新定义的副显示摄像机（UDC），作为一种新的现实世界的单个图像恢复问题。首先，我们来4K透明OLED（T-OLED）和电话了Pentile OLED（P-OLED），分析其光学系统，了解降解。其次，我们设计了一个新的监视器，摄像机成像系统（MCIS）更容易对实际数据采集和基于模型的数据合成管道只能从显示模式和摄像头的测量产生UDC数据。最后，我们使用基于学习的方法解决复杂的退化。我们的模型显示了与真实的或合成数据训练实时高质量恢复。所提出的结果和方法提供了很好的做法，图像恢复应用到现实世界的应用。

3. PANDA: A Gigapixel-level Human-centric Video Dataset [PDF] 返回目录
Xueyang Wang, Xiya Zhang, Yinheng Zhu, Yuchen Guo, Xiaoyun Yuan, Liuyu Xiang, Zerun Wang, Guiguang Ding, David J Brady, Qionghai Dai, Lu Fang
Abstract: We present PANDA, the first gigaPixel-level humAN-centric viDeo dAtaset, for large-scale, long-term, and multi-object visual analysis. The videos in PANDA were captured by a gigapixel camera and cover real-world scenes with both wide field-of-view (~1 square kilometer area) and high-resolution details (~gigapixel-level/frame). The scenes may contain 4k head counts with over 100x scale variation. PANDA provides enriched and hierarchical ground-truth annotations, including 15,974.6k bounding boxes, 111.8k fine-grained attribute labels, 12.7k trajectories, 2.2k groups and 2.9k interactions. We benchmark the human detection and tracking tasks. Due to the vast variance of pedestrian pose, scale, occlusion and trajectory, existing approaches are challenged by both accuracy and efficiency. Given the uniqueness of PANDA with both wide FoV and high resolution, a new task of interaction-aware group detection is introduced. We design a 'global-to-local zoom-in' framework, where global trajectories and local interactions are simultaneously encoded, yielding promising results. We believe PANDA will contribute to the community of artificial intelligence and praxeology by understanding human behaviors and interactions in large-scale real-world scenes. PANDA Website: this http URL.
摘要：我们目前PANDA，第一个超高画质像素级别的以人为中心的视频数据集中，为大规模，长期，多对象的可视化分析。在PANDA的视频是由千兆像素相机捕获并覆盖与两个宽视场的视图（〜1平方公里面积）真实世界的场景和高分辨率细节（〜千兆像素级/帧）。这些场景可能含有超过100倍的尺度变化4K头计数。 PANDA提供丰富和层次地面真相批注，包括15,974.6k边框，111.8k细粒度属性标签，12.7k轨迹，2.2K团体和2.9k的相互作用。我们的基准人类探测和跟踪任务。由于行人的姿势，规模，闭塞和轨迹的巨大差异，现有的方法是通过精度和效率的挑战。鉴于PANDA的既宽FOV和高分辨率的独特性，交互感知组检测的新任务介绍。我们设计了“整体 - 局部放大”的框架，在全球轨道和局部作用是同时编码，产生可喜的成果。我们相信，熊猫将在大规模真实世界的场景促进人工智能和人类行为学的社区通过理解人类的行为和交互。 PANDA网址：此http网址。

4. Hierarchical Human Parsing with Typed Part-Relation Reasoning [PDF] 返回目录
Wenguan Wang, Hailong Zhu, Jifeng Dai, Yanwei Pang, Jianbing Shen, Ling Shao
Abstract: Human parsing is for pixel-wise human semantic understanding. As human bodies are underlying hierarchically structured, how to model human structures is the central theme in this task. Focusing on this, we seek to simultaneously exploit the representational capacity of deep graph networks and the hierarchical human structures. In particular, we provide following two contributions. First, three kinds of part relations, i.e., decomposition, composition, and dependency, are, for the first time, completely and precisely described by three distinct relation networks. This is in stark contrast to previous parsers, which only focus on a portion of the relations and adopt a type-agnostic relation modeling strategy. More expressive relation information can be captured by explicitly imposing the parameters in the relation networks to satisfy the specific characteristics of different relations. Second, previous parsers largely ignore the need for an approximation algorithm over the loopy human hierarchy, while we instead address an iterative reasoning process, by assimilating generic message-passing networks with their edge-typed, convolutional counterparts. With these efforts, our parser lays the foundation for more sophisticated and flexible human relation patterns of reasoning. Comprehensive experiments on five datasets demonstrate that our parser sets a new state-of-the-art on each.
摘要：人的解析是逐像素人类语义理解。由于人体的底层分层结构，如何对人体结构模型是这个任务的中心主题。着眼于此，我们寻求利用同时深图的网络和层次结构的人的代表能力。特别是，我们提供以下两个贡献。第一三种部分的关系，即，分解，组合物，和依赖，都为第一次，完全准确地由三个不同的关系网络描述。与此形成鲜明对比以前的解析器，只专注于关系的一部分，并采用一种不可知的关系建模策略。更多的表现相关信息可以通过关系网络实行明确的参数，以满足不同关系的具体特点被捕获。其次，以前的解析器很大程度上忽略在糊涂的人的层次的近似算法的需求，而我们不是解决一个反复的推理过程，通过与他们的边缘类型的，卷积同行同化通用消息传递网络。通过这些努力，我们的分析器奠定了推理的更为复杂和灵活的人际关系模式的基础。在五个数据集综合实验表明，我们的分析器设置在每一个新的国家的最先进的。

5. SAD: Saliency-based Defenses Against Adversarial Examples [PDF] 返回目录
Richard Tran, David Patrick, Michael Geyer, Amanda Fernandez
Abstract: With the rise in popularity of machine and deep learning models, there is an increased focus on their vulnerability to malicious inputs. These adversarial examples drift model predictions away from the original intent of the network and are a growing concern in practical security. In order to combat these attacks, neural networks can leverage traditional image processing approaches or state-of-the-art defensive models to reduce perturbations in the data. Defensive approaches that take a global approach to noise reduction are effective against adversarial attacks, however their lossy approach often distorts important data within the image. In this work, we propose a visual saliency based approach to cleaning data affected by an adversarial attack. Our model leverages the salient regions of an adversarial image in order to provide a targeted countermeasure while comparatively reducing loss within the cleaned images. We measure the accuracy of our model by evaluating the effectiveness of state-of-the-art saliency methods prior to attack, under attack, and after application of cleaning methods. We demonstrate the effectiveness of our proposed approach in comparison with related defenses and against established adversarial attack methods, across two saliency datasets. Our targeted approach shows significant improvements in a range of standard statistical and distance saliency metrics, in comparison with both traditional and state-of-the-art approaches.
摘要：随着机器和深度学习模式的流行程度的上升，有一个更加注重自己的漏洞恶意输入。这些对抗的例子漂离原意网络的模型预测了出去，在实际的安全日益受到关注。为了对抗这些攻击，神经网络可以利用传统的图像处理方法或国家的最先进的防御模型来减少数据扰动。这需要一个全球性的方法来降低噪音防御方法是对敌对攻击有效，但他们的有损方法通常扭曲图像内的重要数据。在这项工作中，我们提出了一种视觉显着性为基础的方法来清洗受敌对攻击数据。我们的模型利用，以便提供有针对性的对策，而相对地减少清洁图像内的损失敌对图像的显着区域。我们通过评估之前攻击的国家的最先进的显着方法的有效性，受到攻击衡量我们的模型的准确性和清洁方法的应用之后。我们证明了我们提出的方法的有效性与相关的防御比较和对已确定的敌对攻击方法，在两个显着的数据集。我们的靶向方法的节目在一系列标准统计和显着性的距离度量的显著改进，在比较传统的和国家的最先进的方法。

6. Dam Burst: A region-merging-based image segmentation method [PDF] 返回目录
Rui Tang, Wenlong Song, Xiaoping Guan, Huibin Ge, Deke Kong
Abstract: Until now, all single level segmentation algorithms except CNN-based ones lead to over segmentation. And CNN-based segmentation algorithms have their own problems. To avoid over segmentation, multiple thresholds of criteria are adopted in region merging process to produce hierarchical segmentation results. However, there still has extreme over segmentation in the low level of the hierarchy, and outstanding tiny objects are merged to their large adjacencies in the high level of the hierarchy. This paper proposes a region-merging-based image segmentation method that we call it Dam Burst. As a single level segmentation algorithm, this method avoids over segmentation and retains details by the same time. It is named because of that it simulates a flooding from underground destroys dams between water-pools. We treat edge detection results as strengthening structure of a dam if it is on the dam. To simulate a flooding from underground, regions are merged by ascending order of the average gra-dient inside the region.
摘要：到现在为止，除了基于CNN-以外的所有单层次分割算法导致过度分割。而基于CNN-分割算法有自己的问题。为了避免过度分割，判断标准的多个阈值被在区域合并过程中采用，以产生分层分割结果。然而，仍有拥有层次结构中的较低水平分割极致，出色的微小物体被合并到它们的大邻接在高级别层次的。本文提出了一种区域合并为基础的图像分割方法，我们称之为水坝决堤。作为一个单一的水平分割算法，这种方法避免了过分割，并保留用相同的时间的细节。这是因为，它模拟了从水池之间的地下破阵大坝洪水命名。我们对待边缘检测结果作为加强大坝的结构，如果是上坝。为了模拟从地下洪水，区域由上升的区域内的平均GRA-dient的顺序合并。

7. Off-Road Drivable Area Extraction Using 3D LiDAR Data [PDF] 返回目录
Biao Gao, Anran Xu, Yancheng Pan, Xijun Zhao, Wen Yao, Huijing Zhao
Abstract: We propose a method for off-road drivable area extraction using 3D LiDAR data with the goal of autonomous driving application. A specific deep learning framework is designed to deal with the ambiguous area, which is one of the main challenges in the off-road environment. To reduce the considerable demand for human-annotated data for network training, we utilize the information from vast quantities of vehicle paths and auto-generated obstacle labels. Using these autogenerated annotations, the proposed network can be trained using weakly supervised or semi-supervised methods, which can achieve better performance with fewer human annotations. The experiments on our dataset illustrate the reasonability of our framework and the validity of our weakly and semi-supervised methods.
摘要：我们建议使用三维激光雷达数据与自动驾驶应用的目标，为越野驱动的区域提取的方法。具体的深度学习框架旨在应对模糊区域，这是在越野环境的主要挑战之一。为了减少网络培训人力标注数据的需求相当大，我们利用从浩大的数量的车辆路径，并自动生成障碍标签的信息。使用这些自动生成的注释，所提出的网络可以使用弱的培训监督或半监督的方法，它可以实现用较少的人力注解更好的性能。我们的数据集实验说明我们的框架的合理性和我们的弱和半监督方法的有效性。

8. Multi-Task Recurrent Neural Network for Surgical Gesture Recognition and Progress Prediction [PDF] 返回目录
Beatrice van Amsterdam, Matthew J. Clarkson, Danail Stoyanov
Abstract: Surgical gesture recognition is important for surgical data science and computer-aided intervention. Even with robotic kinematic information, automatically segmenting surgical steps presents numerous challenges because surgical demonstrations are characterized by high variability in style, duration and order of actions. In order to extract discriminative features from the kinematic signals and boost recognition accuracy, we propose a multi-task recurrent neural network for simultaneous recognition of surgical gestures and estimation of a novel formulation of surgical task progress. To show the effectiveness of the presented approach, we evaluate its application on the JIGSAWS dataset, that is currently the only publicly available dataset for surgical gesture recognition featuring robot kinematic data. We demonstrate that recognition performance improves in multi-task frameworks with progress estimation without any additional manual labelling and training.
摘要：手术手势识别是手术数据科学和计算机辅助干预很重要。即使机器人运动的信息，因为手术示威的风格，持续时间和动作顺序具有高可变性自动分割手术步骤提出了许多挑战。为了从运动信号，并提升识别准确提取判别特征，我们提出了手术手势的同时识别和评估的手术任务进度的新制剂的多任务回归神经网络。为了验证了该方法的有效性，我们评估其对竖锯的数据集，这是目前手术手势识别的唯一可公开获得的数据集特色机器人运动学数据应用。我们证明了识别性能与进度估计多任务框架提高了无需任何额外的手工贴标和培训。

9. AP-MTL: Attention Pruned Multi-task Learning Model for Real-time Instrument Detection and Segmentation in Robot-assisted Surgery [PDF] 返回目录
Mobarakol Islam, Vibashan VS, Hongliang Ren
Abstract: Surgical scene understanding and multi-tasking learning are crucial for image-guided robotic surgery. Training a real-time robotic system for the detection and segmentation of high-resolution images provides a challenging problem with the limited computational resource. The perception drawn can be applied in effective real-time feedback, surgical skill assessment, and human-robot collaborative surgeries to enhance surgical outcomes. For this purpose, we develop a novel end-to-end trainable real-time Multi-Task Learning (MTL) model with weight-shared encoder and task-aware detection and segmentation decoders. Optimization of multiple tasks at the same convergence point is vital and presents a complex problem. Thus, we propose an asynchronous task-aware optimization (ATO) technique to calculate task-oriented gradients and train the decoders independently. Moreover, MTL models are always computationally expensive, which hinder real-time applications. To address this challenge, we introduce a global attention dynamic pruning (GADP) by removing less significant and sparse parameters. We further design a skip squeeze and excitation (SE) module, which suppresses weak features, excites significant features and performs dynamic spatial and channel-wise feature re-calibration. Validating on the robotic instrument segmentation dataset of MICCAI endoscopic vision challenge, our model significantly outperforms state-of-the-art segmentation and detection models, including best-performed models in the challenge.
摘要：手术现场了解和多任务是学习图像引导机器人手术的关键。对于高分辨率图像的检测与分割训练实时机器人系统提供利用有限的计算资源具有挑战性的问题。得出的感知可以有效的实时反馈，外科手术技能评估，以及人类与机器人协作手术适用于提高手术效果。为此，我们开发了一个新的终端到终端的可训练的实时多任务学习（MTL）模型重量共享编码器和任务感知检测与分割的解码器。在相同的汇聚点的多任务优化是至关重要的，并提出了一个复杂的问题。因此，我们提出了一种异步任务感知优化（ATO）技术计算任务导向的梯度和培训独立的解码器。此外，MTL模型总是计算昂贵，阻碍实时应用。为了应对这一挑战，我们通过删除不太显著和稀疏参数引入全球瞩目动态修剪（GADP）。我们进一步设计跳过挤压和激励（SE）模块，这抑制弱的特性，的激励显著特征和执行动态空间和通道明智特征重新校准。验证上MICCAI内镜视觉挑战的全自动仪器分割数据集，我们的模型显著优于国家的最先进的分割和检测模型，包括在挑战最好的执行模型。

10. Rainy screens: Collecting rainy datasets, indoors [PDF] 返回目录
Horia Porav, Valentina-Nicoleta Musat, Tom Bruls, Paul Newman
Abstract: Acquisition of data with adverse conditions in robotics is a cumbersome task due to the difficulty in guaranteeing proper ground truth and synchronising with desired weather conditions. In this paper, we present a simple method recording a high resolution screen - for generating diverse rainy images from existing clear ground-truth images that is domain- and source-agnostic, simple and scales up. This setup allows us to leverage the diversity of existing datasets with auxiliary task ground-truth data, such as semantic segmentation, object positions etc. We generate rainy images with real adherent droplets and rain streaks based on Cityscapes and BDD, and train a de-raining model. We present quantitative results for image reconstruction and semantic segmentation, and qualitative results for an out-of-sample domain, showing that models trained with our data generalize well.
摘要：随着机器人技术的不利条件数据的采集是一项繁重的任务，由于在保证适当的地面实况，并与所需的天气条件下进行同步的难度。在本文中，我们提出了一种简单的方法，记录高分辨率的屏幕 - 用于从现有明确地面实况图像生成多样下雨的图像，是结构域和源无关，简单和比例增加。这种设置使我们能够利用与辅助任务地面实况数据，如语义分割，对象位置等，我们生成与真正的附着液滴和基于风情和BDD雨条纹多雨的图像，并培养了去现有数据集的多样性下雨模型。我们提出了图像重建和语义分割，并定性结果定量结果进行了样本的安域，显示出我们的数据训练的模型概括很好。

11. Restore from Restored: Single Image Denoising with Pseudo Clean Image [PDF] 返回目录
Seunghwan Lee, Donghyeon Cho, Jiwon Kim, Tae Hyun Kim
Abstract: Under certain statistical assumptions of noise (e.g., zero-mean noise), recent self-supervised approaches for denoising have been introduced to learn network parameters without ground-truth clean images, and these methods can restore an image by exploiting information available from the given input (i.e., internal statistics) at test time. However, self-supervised methods are not yet properly combined with conventional supervised denoising methods which train the denoising networks with a large number of external training images. Thus, we propose a new denoising approach that can greatly outperform the state-of-the-art supervised denoising methods by adapting (fine-tuning) their network parameters to the given specific input through self-supervision without changing the fully original network architectures. We demonstrate that the proposed method can be easily employed with state-of-the-art denoising networks without additional parameters, and achieve state-of-the-art performance on numerous denoising benchmark datasets.
摘要：在噪音的某些统计假设（例如，零均值噪声），用于降噪已被引入到学习没有地面实况干净的影像网络参数最近自我监督的方法，这些方法可以通过利用现有的资料恢复的图像在测试时给定的输入（即，内部统计）。然而，自监督方法尚未正确地与常规监督去噪方法，其与大量的外部训练图像的训练降噪网络相结合。因此，我们提出了一个新的降噪方法，可以大大优于国家的最先进的监督通过自检调整（微调）其网络参数设为指定特定的输入而不改变完全原有网络架构去噪方法。我们表明，所提出的方法可与国家的最先进的去噪网络中容易地采用不附加参数，并实现在许多去噪基准数据集的状态的最先进的性能。

12. Dual-attention Guided Dropblock Module for Weakly Supervised Object Localization [PDF] 返回目录
Junhui Yin, Siqing Zhang, Dongliang Chang, Zhanyu Ma, Jun Guo
Abstract: In this paper, we propose a dual-attention guided dropblock module, and target at learning the complementary and discriminative visual patterns for weakly supervised object localization (WSOL). We extend the attention mechanism in the task of WSOL, and carefully design two types of attention modules to capture the informative features for better feature representations. Based on two types of attention mechanism, we propose a channel attention guided dropout (CAGD) and a spatial attention guided dropblock (SAGD). The CAGD ranks channel attention according to a measure of importance and treat the top-k largest magnitude attentions as important ones. The SAGD can not only completely remove the information by erasing the contiguous regions of feature maps rather than individual pixels, but also simply sense the foreground objects and background regions to alleviate the attention misdirection. Extensive experiments demonstrate that the proposed method achieves new state-of-the-art localization accuracy on three challenging datasets.
摘要：在本文中，我们提出了一个双引导注意力模块dropblock，和目标在学习的补充和辨别视觉模式为弱监督对象定位（WSOL）。我们延长WSOL任务的注意机制，精心设计出两种类型的注意力模块捕捉信息量大的特点更好的特征表示。基于两种注意机制，我们提出了一个通道注意引导差（CAGD）和空间注意力引导dropblock（SAGD）。在计算机辅助几何设计按照重要性的衡量队伍频道的关注和治疗前k震级最大的关注是重要的。该SAGD不仅可以完全通过擦除功能的连续区域映射而非单独的像素去除的信息，也只是感觉到前景物体和背景区域，以缓解注意力误导。大量的实验表明，该方法实现了三个有挑战性的数据集的国家的最先进的新的定位精度。

13. Deep Blind Video Super-resolution [PDF] 返回目录
Jinshan Pan, Songsheng Cheng, Jiawei Zhang, Jinhui Tang
Abstract: Existing video super-resolution (SR) algorithms usually assume that the blur kernels in the degradation process are known and do not model the blur kernels in the restoration. However, this assumption does not hold for video SR and usually leads to over-smoothed super-resolved images. In this paper, we propose a deep convolutional neural network (CNN) model to solve video SR by a blur kernel modeling approach. The proposed deep CNN model consists of motion blur estimation, motion estimation, and latent image restoration modules. The motion blur estimation module is used to provide reliable blur kernels. With the estimated blur kernel, we develop an image deconvolution method based on the image formation model of video SR to generate intermediate latent images so that some sharp image contents can be restored well. However, the generated intermediate latent images may contain artifacts. To generate high-quality images, we use the motion estimation module to explore the information from adjacent frames, where the motion estimation can constrain the deep CNN model for better image restoration. We show that the proposed algorithm is able to generate clearer images with finer structural details. Extensive experimental results show that the proposed algorithm performs favorably against state-of-the-art methods.
摘要：现有视频超分辨率（SR）算法通常假设模糊降解过程被称为内核，并没有在恢复模糊核建模。然而，这种假设并不适用于视频SR，通常导致过度平滑的超分辨率图像。在本文中，我们提出了一个深刻的卷积神经网络（CNN）模型模糊核建模方法来解决视频SR。所提出的深CNN模型由运动模糊估计，运动估计，并且潜像恢复模块。运动模糊估计模块，用于提供可靠的模糊内核。与所估计的模糊核，我们开发了基于视频SR的图像形成模型来生成中间潜影，这样一些清晰的图像内容可以恢复良好的图像去卷积方法。然而，所生成的中间的潜像可以包含伪影。为了产生高质量的图像，我们使用运动估计模块，探索从相邻帧，其中所述运动估计可以约束为更好的图像恢复深CNN模型中的信息。我们表明，该算法能够产生更精细的结构细节更清晰的图像。广泛的实验结果表明，针对有利状态的最先进的方法，该算法执行。

14. Deep Hough Transform for Semantic Line Detection [PDF] 返回目录
Qi Han, Kai Zhao, Jun Xu, Mingg-Ming Cheng
Abstract: In this paper, we put forward a simple yet effective method to detect meaningful straight lines, a.k.a. semantic lines, in given scenes. Prior methods take line detection as a special case of object detection, while neglect the inherent characteristics of lines, leading to less efficient and suboptimal results. We propose a one-shot end-to-end framework by incorporating the classical Hough transform into deeply learned representations. By parameterizing lines with slopes and biases, we perform Hough transform to translate deep representations to the parametric space and then directly detect lines in the parametric space. More concretely, we aggregate features along candidate lines on the feature map plane and then assign the aggregated features to corresponding locations in the parametric domain. Consequently, the problem of detecting semantic lines in the spatial domain is transformed to spotting individual points in the parametric domain, making the post-processing steps, \ie non-maximal suppression, more efficient. Furthermore, our method makes it easy to extract contextual line features, that are critical to accurate line detection. Experimental results on a public dataset demonstrate the advantages of our method over state-of-the-arts.
摘要：在本文中，我们提出了一个简单而有效的方法来检测有意义的直线，又名语义线，在给定的场景。现有方法采取线检测作为物体检测的一个特殊情况，而忽视线的固有特性，从而导致低效率的和次优的结果。我们通过将经典Hough变换成深深了解到申述提出一次性的终端到终端的框架。通过参数与斜坡和偏见线，我们执行Hough变换翻译深交涉参数空间，然后直接检测的参数空间线。更具体地讲，我们一起聚集在功能地图平面备选线的功能，然后分配汇总功能，相应的参数域的位置。因此，检测在空间域中语义线问题转化为在参数域斑点的各个点，使得后处理步骤，\即非最大抑制，更高效。此外，我们的方法可以很容易地提取情境线的功能，是准确的在线检测关键。在公共数据集的实验结果表明，在国家的最艺术我们的方法的优点。

15. Realizing Pixel-Level Semantic Learning in Complex Driving Scenes based on Only One Annotated Pixel per Class [PDF] 返回目录
Xi Li, Huimin Ma, Sheng Yi, Yanxian Chen
Abstract: Semantic segmentation tasks based on weakly supervised condition have been put forward to achieve a lightweight labeling process. For simple images that only include a few categories, researches based on image-level annotations have achieved acceptable performance. However, when facing complex scenes, since image contains a large amount of classes, it becomes difficult to learn visual appearance based on image tags. In this case, image-level annotations are not effective in providing information. Therefore, we set up a new task in which only one annotated pixel is provided for each category. Based on the more lightweight and informative condition, a three step process is built for pseudo labels generation, which progressively implement optimal feature representation for each category, image inference and context-location based refinement. In particular, since high-level semantics and low-level imaging feature have different discriminative ability for each class under driving scenes, we divide each category into "object" or "scene" and then provide different operations for the two types separately. Further, an alternate iterative structure is established to gradually improve segmentation performance, which combines CNN-based inter-image common semantic learning and imaging prior based intra-image modification process. Experiments on Cityscapes dataset demonstrate that the proposed method provides a feasible way to solve weakly supervised semantic segmentation task under complex driving scenes.
摘要：基于弱监督的条件语义分割任务已提出实现了轻量化标记过程。对于简单的图像仅包括几类，研究了基于图像级别注释已经实现可接受的性能。然而，面对复杂的场景时，由于图像中包含了大量的类，就很难学会基于图像标签的视觉外观。在这种情况下，图像级别的注解是不能有效地提供信息。因此，我们成立了，其中只有一个注解像素为每个类别中的新任务。基于更轻便和信息状态中，三个步骤的过程，是为伪标签产生，其逐步实现最佳的特征表示为每个类别，图像推断和上下文基于位置的细化。特别是，由于高级别语义和低级别的成像特征具有用于驱动下的场景的每个类不同辨别能力，我们将每个类别成“对象”或“场景”，然后提供用于单独的两种类型的不同操作。此外，替代的迭代结构被建立为逐渐提高分割性能，它结合了基于CNN-图像间共同的语义学习和成像之前基于帧内图像修改处理。对城市景观的实验数据集表明，该方法提供了解决弱下复杂的驾驶场景监督语义分割任务的可行途径。

16. Incremental Few-Shot Object Detection [PDF] 返回目录
Juan-Manuel Perez-Rua, Xiatian Zhu, Timothy Hospedales, Tao Xiang
Abstract: Most existing object detection methods rely on the availability of abundant labelled training samples per class and offline model training in a batch mode. These requirements substantially limit their scalability to open-ended accommodation of novel classes with limited labelled training data. %, both in terms of model accuracy and training efficiency during deployment. We present a study aiming to go beyond these limitations by considering the Incremental Few-Shot Detection (iFSD) problem setting, where new classes must be registered incrementally (without revisiting base classes) and with few examples. To this end we propose OpeN-ended Centre nEt (ONCE), a detector designed for incrementally learning to detect novel class objects with few examples. This is achieved by an elegant adaptation of the CentreNet detector to the few-shot learning scenario, and meta-learning a class-specific code generator model for registering novel classes. ONCE fully respects the incremental learning paradigm, with novel class registration requiring only a single forward pass of few-shot training samples, and no access to base classes -- thus making it suitable for deployment on embedded devices. Extensive experiments conducted on both the standard object detection and fashion landmark detection tasks show the feasibility of iFSD for the first time, opening an interesting and very important line of research.
摘要：大多数现有的目标检测方法依赖于每个班级和离线模式培训丰富标记的训练样本的批处理模式下的可用性。这些要求其可扩展性基本上限制到新的班级，限制标记的训练数据的开放式住宿。％，无论是在部署过程中模型的精度和训练效率方面。我们提出了一个研究目标是在考虑了增量很少一次检测（IFSD）的问题设置，其中新的类必须增量登记（不重新审视基类），并用几个例子来超越这些限制。为此，我们提出了开放式的中心网（ONCE），专为逐步学习检测与几个例子小说类对象的检测器。这是由CentreNet检测到几拍的学习情景优雅适应实现，元学习新的注册类类特定的代码生成模式。一旦完全尊重增量学习范例，与需要较少次的训练样本仅单个直传，并基类无访问类新的注册 - 从而使得它适合于嵌入式设备的部署。两个标准的目标检测与时尚地标检测任务进行了大量的实验表明IFSD的首次可行性，打开一个有趣的研究和非常重要的线。

17. Lung Infection Quantification of COVID-19 in CT Images with Deep Learning [PDF] 返回目录
Fei Shan+, Yaozong Gao+, Jun Wang, Weiya Shi, Nannan Shi, Miaofei Han, Zhong Xue, Dinggang Shen, Yuxin Shi
Abstract: CT imaging is crucial for diagnosis, assessment and staging COVID-19 infection. Follow-up scans every 3-5 days are often recommended for disease progression. It has been reported that bilateral and peripheral ground glass opacification (GGO) with or without consolidation are predominant CT findings in COVID-19 patients. However, due to lack of computerized quantification tools, only qualitative impression and rough description of infected areas are currently used in radiological reports. In this paper, a deep learning (DL)-based segmentation system is developed to automatically quantify infection regions of interest (ROIs) and their volumetric ratios w.r.t. the lung. The performance of the system was evaluated by comparing the automatically segmented infection regions with the manually-delineated ones on 300 chest CT scans of 300 COVID-19 patients. For fast manual delineation of training samples and possible manual intervention of automatic results, a human-in-the-loop (HITL) strategy has been adopted to assist radiologists for infection region segmentation, which dramatically reduced the total segmentation time to 4 minutes after 3 iterations of model updating. The average Dice simiarility coefficient showed 91.6% agreement between automatic and manual infaction segmentations, and the mean estimation error of percentage of infection (POI) was 0.3% for the whole lung. Finally, possible applications, including but not limited to analysis of follow-up CT scans and infection distributions in the lobes and segments correlated with clinical findings, were discussed.
摘要：CT成像是用于诊断，评估和分期COVID-19感染的关键。后续每隔3-5天通常建议对于病情恶化扫描。据报道，两国和周边地面玻璃混浊（GGO）有无合并是COVID，19例患者主要CT表现。然而，由于缺乏电脑定量分析工具，只有定性的印象，疫区的粗略描述，目前在放射报告中使用。在本文中，深刻学习（DL）为基础分割系统的开发感兴趣自动感染进行量化区域（ROI）和它们的体积比w.r.t.肺。该系统的性能是由自动分段感染区域与300 COVID-19的患者的300胸部CT扫描可手动圈定那些进行比较评价。对于训练样本和自动结果可能人工干预的快速手动圈定，一个人在半实物（HITL）的策略已被采用，以协助感染区域分割，其中3后的总分割时间显着地减少到4分钟的放射科医师模型更新迭代。平均骰子simiarility系数表明自动和手动梗死分割之间91.6％的协议，和（POI）感染的百分比的平均估计误差是整个肺0.3％。最后，可能的应用，包括但不限于叶，并用的临床发现相关段的后续CT扫描和感染分布的分析，进行了讨论。

18. HeatNet: Bridging the Day-Night Domain Gap in Semantic Segmentation with Thermal Images [PDF] 返回目录
Johan Vertens, Jannik Zürn, Wolfram Burgard
Abstract: The majority of learning-based semantic segmentation methods are optimized for daytime scenarios and favorable lighting conditions. Real-world driving scenarios, however, entail adverse environmental conditions such as nighttime illumination or glare which remain a challenge for existing approaches. In this work, we propose a multimodal semantic segmentation model that can be applied during daytime and nighttime. To this end, besides RGB images, we leverage thermal images, making our network significantly more robust. We avoid the expensive annotation of nighttime images by leveraging an existing daytime RGB-dataset and propose a teacher-student training approach that transfers the dataset's knowledge to the nighttime domain. We further employ a domain adaptation method to align the learned feature spaces across the domains and propose a novel two-stage training scheme. Furthermore, due to a lack of thermal data for autonomous driving, we present a new dataset comprising over 20,000 time-synchronized and aligned RGB-thermal image pairs. In this context, we also present a novel target-less calibration method that allows for automatic robust extrinsic and intrinsic thermal camera calibration. Among others, we employ our new dataset to show state-of-the-art results for nighttime semantic segmentation.
摘要：大多数的基于学习的语义分割方法为白天的场景和有利的照明条件进行了优化。真实世界的驾驶情形，但是，例如夜间照明或眩光继承权不利的环境条件，其仍然是现有方法的一个挑战。在这项工作中，我们建议可以在白天和夜间要应用的多语义分割模型。为此，除了RGB图像，我们利用热图像，使我们的网络显著更加稳健。我们通过利用现有的白天RGB-数据集避免夜间图像的昂贵的注释，提出了一种师生培训的办法，转移数据集的到夜间领域知识。我们进一步使用领域适应性方法来对齐跨网域学习地物的空间，并提出了一种两阶段的训练计划。此外，由于缺乏用于自主驾驶的热数据的，我们提出了一个新的数据集，其包括超过20,000时间同步和对准RGB-热图像对。在这方面，我们还提出了一种新目标更少的校准方法，其允许自动健壮外在和内在的热照相机校准。其中，我们使用我们的新的数据集，以显示国家的先进成果夜间语义分割。

19. PnP-Net: A hybrid Perspective-n-Point Network [PDF] 返回目录
Roy Sheffer, Ami Wiesel
Abstract: We consider the robust Perspective-n-Point (PnP) problem using a hybrid approach that combines deep learning with model based algorithms. PnP is the problem of estimating the pose of a calibrated camera given a set of 3D points in the world and their corresponding 2D projections in the image. In its more challenging robust version, some of the correspondences may be mismatched and must be efficiently discarded. Classical solutions address PnP via iterative robust non-linear least squares method that exploit the problem's geometry but are either inaccurate or computationally intensive. In contrast, we propose to combine a deep learning initial phase followed by a model-based fine tuning phase. This hybrid approach, denoted by PnP-Net, succeeds in estimating the unknown pose parameters under correspondence errors and noise, with low and fixed computational complexity requirements. We demonstrate its advantages on both synthetic data and real world data.
摘要：我们使用一种混合的方法，结合深，基于模型的算法学习考虑稳健的角度看，N点（PNP）的问题。即插即用是估计给定一组的在世界上的3D点和它们的对应的图像中的2D投影校准照相机的姿势的问题。在其更具挑战可靠的版本，一些对应的可能不匹配，必须有效地丢弃。经典解决方案通过该漏洞问题的几何体，但要么不准确或计算密集的迭代强大的非线性最小二乘法解决即插即用。相比之下，我们提出了深刻的学习初始阶段，随后基于模型的微调相结合。这种混合方法，通过即插即用-Net的表示时，成功地在对应的误差估计未知参数的姿势和噪声，具有低和固定计算复杂度要求。我们证明了在人工数据和真实数据的优势。

20. Hierarchical Neural Architecture Search for Single Image Super-Resolution [PDF] 返回目录
Yong Guo, Yongsheng Luo, Zhenhao He, Jin Huang, Jian Chen
Abstract: Deep neural networks have exhibited promising performance in image super-resolution (SR). Most SR models follow a hierarchical architecture that contains both the cell-level design of computational blocks and the network-level design of the positions of upsampling blocks. However, designing SR models heavily relies on human expertise and is very labor-intensive. More critically, these SR models often contain a huge number of parameters and may not meet the requirements of computation resources in real-world applications. To address the above issues, we propose a Hierarchical Neural Architecture Search (HNAS) method to automatically design promising architectures with different requirements of computation cost. To this end, we design a hierarchical SR search space and propose a hierarchical controller for architecture search. Such a hierarchical controller is able to simultaneously find promising cell-level blocks and network-level positions of upsampling layers. Moreover, to design compact architectures with promising performance, we build a joint reward by considering both the performance and computation cost to guide the search process. Extensive experiments on five benchmark datasets demonstrate the superiority of our method over existing methods.
摘要：深层神经网络已经表现出有前途的图像超分辨率（SR）的性能。大多数SR模式遵循包含两个计算块，在细胞水平的设计和上采样块的位置的网络级设计的分层架构。然而，设计SR模式在很大程度上依赖于人的专业知识，是非常费力。更关键的是，这些车型SR通常包含的参数数量庞大，可能不符合计算资源在实际应用的要求。为了解决上述问题，我们提出了一个分层的神经结构搜索（HNAS）的方法来自动设计有前途的架构与计算成本的不同要求。为此，我们设计了分层SR搜索空间，并提出了结构搜索分层控制器。这种分级控制器能够同时发现有前途的细胞级块和上采样层的网络级的位置。此外，设计紧凑的结构与性能的看好，我们通过建立同时考虑性能和计算成本，以引导搜索过程的联合奖励。五个基准数据集大量的实验证明了我们在现有的方法方法的优越性。

21. Convolutional Occupancy Networks [PDF] 返回目录
Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc Pollefeys, Andreas Geiger
Abstract: Recently, implicit neural representations have gained popularity for learning-based 3D reconstruction. While demonstrating promising results, most implicit approaches are limited to comparably simple geometry of single objects and do not scale to more complicated or large-scale scenes. The key limiting factor of implicit methods is their simple fully-connected network architecture which does not allow for integrating local information in the observations or incorporating inductive biases such as translational equivariance. In this paper, we propose Convolutional Occupancy Networks, a more flexible implicit representation for detailed reconstruction of objects and 3D scenes. By combining convolutional encoders with implicit occupancy decoders, our model incorporates inductive biases and Manhattan-world priors, enabling structured reasoning in 3D space. We investigate the effectiveness of the proposed representation by reconstructing complex geometry from noisy point clouds and low-resolution voxel representations. We empirically find that our method enables fine-grained implicit 3D reconstruction of single objects, scales to large indoor scenes and generalizes well from synthetic to real data.
摘要：基于学习的三维重建近日，隐神经表征已经得到普及。虽然展示了可喜的成果，最含蓄的方法仅限于单个对象的相对简单的几何形状，而且扩展到更复杂或大型场景。关键限制隐式方法的因素是其简单的完全连接的网络架构不允许在观测集成本地信息或并入感应偏压诸如平移同变性。在本文中，我们提出了卷积占用网络，对象和3D场景的详细重建一个更加灵活的隐式表示。通过卷积编码器与隐性占用解码器相结合，我们的模型采用感应偏见和曼哈顿世界的先验，使推理结构在三维空间。我们通过重建从嘈杂的点云和低分辨率体素表示复杂的几何形状调查建议表示的有效性。我们根据经验发现，我们的方法使单个对象，扩展到大型室内场景和推广从好到合成真实数据的细粒度隐含三维重建。

22. Label-Driven Reconstruction for Domain Adaptation in Semantic Segmentation [PDF] 返回目录
Jinyu Yang, Weizhi An, Sheng Wang, Xinliang Zhu, Chaochao Yan, Junzhou Huang
Abstract: Unsupervised domain adaptation enables to alleviate the need for pixel-wise annotation in the semantic segmentation. One of the most common strategies is to translate images from the source domain to the target domain and then align their marginal distributions in the feature space using adversarial learning. However, source-to-target translation enlarges the bias in translated images, owing to the dominant data size of the source domain. Furthermore, consistency of the joint distribution in source and target domains cannot be guaranteed through global feature alignment. Here, we present an innovative framework, designed to mitigate the image translation bias and align cross-domain features with the same category. This is achieved by 1) performing the target-to-source translation and 2) reconstructing both source and target images from their predicted labels. Extensive experiments on adapting from synthetic to real urban scene understanding demonstrate that our framework competes favorably against existing state-of-the-art methods.
摘要：无监督领域适应性能够减轻在语义分割逐像素的注释的需要。其中最常用的策略是从源域图像转换到目标域，然后使用对抗性的学习调整其边际分布特征空间。然而，源到目标平移在放大图像翻译偏差，由于源域的主数据的大小。此外，在源和目标域的联合分布的一致性不能通过全局特征对准保证。在这里，我们提出了一个创新的框架，旨在减轻图像翻译偏差和对齐跨域功能与同一类别。这是通过1实现）执行目标 - 源翻译和2）从重构它们的预测的标签源和目标图像。从合成适应真正的城市场景理解大量的实验证明，我们的框架竞争毫不逊色针对现有的国家的最先进的方法。

23. The Virtual Tailor: Predicting Clothing in 3D as a Function of Human Pose, Shape and Garment Style [PDF] 返回目录
Chaitanya Patel, Zhouyingcheng Liao, Gerard Pons-Moll
Abstract: In this paper, we present TailorNet, a neural model which predicts clothing deformation in 3D as a function of three factors: pose, shape and style (garment geometry), while retaining wrinkle detail. This goes beyond prior models, which are either specific to one style and shape, or generalize to different shapes producing smooth results, despite being style specific. Our hypothesis is that (even non-linear) combinations of examples smooth out high frequency components such as fine-wrinkles, which makes learning the three factors jointly hard. At the heart of our technique is a decomposition of deformation into a high frequency and a low frequency component. While the low-frequency component is predicted from pose, shape and style parameters with an MLP, the high-frequency component is predicted with a mixture of shape-style specific pose models. The weights of the mixture are computed with a narrow bandwidth kernel to guarantee that only predictions with similar high-frequency patterns are combined. The style variation is obtained by computing, in a canonical pose, a subspace of deformation, which satisfies physical constraints such as inter-penetration, and draping on the body. TailorNet delivers 3D garments which retain the wrinkles from the physics based simulations (PBS) it is learned from, while running more than 1000 times faster. In contrast to classical PBS, TailorNet is easy to use and fully differentiable, which is crucial for computer vision algorithms. Several experiments demonstrate TailorNet produces more realistic results than prior work, and even generates temporally coherent deformations on sequences of the AMASS dataset, despite being trained on static poses from a different dataset. To stimulate further research in this direction, we will make a dataset consisting of 55800 frames, as well as our model publicly available at this https URL.
摘要：在本文中，我们提出TailorNet，它预测在3D服装变形为三个因素的功能的神经网络模型：姿势，造型和风格（服装几何），同时保留皱纹细节。这超出了现有模型，或者是特别针对一种样式和形状，或推广到，产生平滑的结果不同的形状，尽管是风格特定的。我们的假设是，实施例（即使非线性）组合平滑高频成分如细皱纹，这使得学习三个因素共同硬。在我们技术的心脏是变形的分解成高频和低频分量。而低频率分量从与MLP的姿势，形状和样式的参数预测中，高频分量被预测具有形状风格特定姿态模型的混合物。该混合物的计算权重具有窄带宽内核保证只有具有相似高频模式的预测进行组合。样式变化是通过计算获得的，在一个典型的姿势，变形的子空间，其满足物理约束如相互渗透，和悬垂在主体上。 TailorNet提供三维服装，其保留从它从学习的基于物理学的仿真（PBS）的皱纹，而运行超过1000倍的速度。相较于传统的PBS，TailorNet易于使用和完全可微，这是计算机视觉算法的关键。几个实验证明TailorNet产生比以前的工作更现实的结果，甚至产生关于AMASS数据集的序列在时间上相干的变形，尽管被训练静态姿势从不同的数据集。为了刺激这个方向进一步的研究，我们将公布于该HTTPS URL的数据集，包括55800帧，以及我们的模型。

24. DymSLAM:4D Dynamic Scene Reconstruction Based on Geometrical Motion Segmentation [PDF] 返回目录
Chenjie Wang, Bin Luo, Yun Zhang, Qing Zhao, Lu Yin, Wei Wang, Xin Su, Yajun Wang, Chengyuan Li
Abstract: Most SLAM algorithms are based on the assumption that the scene is static. However, in practice, most scenes are dynamic which usually contains moving objects, these methods are not suitable. In this paper, we introduce DymSLAM, a dynamic stereo visual SLAM system being capable of reconstructing a 4D (3D + time) dynamic scene with rigid moving objects. The only input of DymSLAM is stereo video, and its output includes a dense map of the static environment, 3D model of the moving objects and the trajectories of the camera and the moving objects. We at first detect and match the interesting points between successive frames by using traditional SLAM methods. Then the interesting points belonging to different motion models (including ego-motion and motion models of rigid moving objects) are segmented by a multi-model fitting approach. Based on the interesting points belonging to the ego-motion, we are able to estimate the trajectory of the camera and reconstruct the static background. The interesting points belonging to the motion models of rigid moving objects are then used to estimate their relative motion models to the camera and reconstruct the 3D models of the objects. We then transform the relative motion to the trajectories of the moving objects in the global reference frame. Finally, we then fuse the 3D models of the moving objects into the 3D map of the environment by considering their motion trajectories to obtain a 4D (3D+time) sequence. DymSLAM obtains information about the dynamic objects instead of ignoring them and is suitable for unknown rigid objects. Hence, the proposed system allows the robot to be employed for high-level tasks, such as obstacle avoidance for dynamic objects. We conducted experiments in a real-world environment where both the camera and the objects were moving in a wide range.
摘要：大多数SLAM算法是基于这样的假设场景是静态的。然而，在实践中，大多数场景是动态的，它通常包含移动的物体，这些方法并不适合。在本文中，我们介绍DymSLAM，动态立体声视觉SLAM系统能够重建4D（3D +时间）动态场景具有刚性移动的物体。 DymSLAM的唯一输入是立体视频，并且其输出包括静态环境的致密地图，移动对象和摄像机的移动轨迹和所述移动物体的3D模型。我们在第一次检测，并通过使用传统的SLAM方法符合连续帧之间的有趣点。然后属于不同的运动模型（包括自运动和刚性运动物体的运动模型）的有趣的点由一个多模型拟合方法进行分割。基于属于自我运动的有趣的点，我们能够估计相机的轨迹和重建的静态背景。然后属于刚性运动物体的运动模型的有趣的点被用于估计它们的相对运动模型到相机和重构的3D模型的对象。然后，我们改造的相对运动的运动物体的轨迹在全局参考帧。最后，我们再保险丝考虑其运动轨迹，以获得4D（3D +时间）序列的三维模型中的运动物体进入环境的3D地图。 DymSLAM获得关于动态对象的信息而不是忽略它们，并适用于未知刚性物体。因此，所提出的系统允许采用高层次的任务，如避障动态对象的机器人。我们在相机和对象大范围正把一个真实世界的环境中进行实验。

25. Channel Pruning via Optimal Thresholding [PDF] 返回目录
Yun Ye, Ganmei You, Jong-Kae Fwu, Xia Zhu, Qing Yang, Yuan Zhu
Abstract: Structured pruning, especially channel pruning is widely used for the reduced computational cost and the compatibility with off-the-shelf hardware devices. Among existing works, weights are typically removed using a predefined global threshold, or a threshold computed from a predefined metric. The predefined global threshold based designs ignore the variation among different layers and weights distribution, therefore, they may often result in sub-optimal performance caused by over-pruning or under-pruning. In this paper, we present a simple yet effective method, termed Optimal Thresholding (OT), to prune channels with layer dependent thresholds that optimally separate important from negligible channels. By using OT, most negligible or unimportant channels are pruned to achieve high sparsity while minimizing performance degradation. Since most important weights are preserved, the pruned model can be further fine-tuned and quickly converge with very few iterations. Our method demonstrates superior performance, especially when compared to the state-of-the-art designs at high levels of sparsity. On CIFAR-100, a pruned and fine-tuned DenseNet-121 by using OT achieves 75.99% accuracy with only 1.46e8 FLOPs and 0.71M parameters. code is available at: this https URL.
摘要：结构化的修剪，特别是信道修剪被广泛用于降低的计算成本，并与现成的，现成的硬件设备的兼容性。其中现有的作品，权重通常被去除使用预定义的全局阈值，或从计算的阈值预先定义度量。预定义的全局阈值基础的设计忽略不同的层和权重分布中的变化，因此，它们可能经常导致由过度修剪或下修剪次优的性能。在本文中，我们提出了一个简单而有效的方法，被称为最优阈值（OT），以剪枝通道与层相关阈值最优地分开从可忽略的重要渠道。通过使用OT，最微不足道的或不重要的通道被修剪以达到高稀疏性，同时最小化性能劣化。因为最重要的权重将被保留，剪掉的模型可以进一步微调并迅速用很少的迭代收敛。特别是当与在高水平稀疏的状态的最先进的设计我们的方法表现出优异的性能，。上CIFAR-100，修剪和微调用OT DenseNet-121达到75.99％的准确度仅1.46e8触发器和0.71M参数。代码，请访问：此HTTPS URL。

26. 3D Quasi-Recurrent Neural Network for Hyperspectral Image Denoising [PDF] 返回目录
Kaixuan Wei, Ying Fu, Hua Huang
Abstract: In this paper, we propose an alternating directional 3D quasi-recurrent neural network for hyperspectral image (HSI) denoising, which can effectively embed the domain knowledge -- structural spatio-spectral correlation and global correlation along spectrum. Specifically, 3D convolution is utilized to extract structural spatio-spectral correlation in an HSI, while a quasi-recurrent pooling function is employed to capture the global correlation along spectrum. Moreover, alternating directional structure is introduced to eliminate the causal dependency with no additional computation cost. The proposed model is capable of modeling spatio-spectral dependency while preserving the flexibility towards HSIs with arbitrary number of bands. Extensive experiments on HSI denoising demonstrate significant improvement over state-of-the-arts under various noise settings, in terms of both restoration accuracy and computation time. Our code is available at this https URL.
摘要：在本文中，我们提出了去噪光谱图像（HSI），可有效地嵌入领域知识的交流方向3D准回归神经网络 - 结构空间 - 谱相关和沿频谱全球对比。具体而言，3D卷积用于提取结构的空间 - 谱相关在HSI，而采用一个准周期性池功能以捕获沿着光谱全局相关性。此外，交替定向结构被引入，以消除不需要额外的计算成本的因果依赖性。该模型是能够模拟空间 - 光谱依赖性，同时保持朝向HSIS的灵活性与频带的任意数量的。恒指降噪大量的实验证明在这两个复原精度和计算时间方面比，国家的最艺术在各种噪音的设置显著改善。我们的代码可在此HTTPS URL。

27. PBRnet: Pyramidal Bounding Box Refinement to Improve Object Localization Accuracy [PDF] 返回目录
Li Xiao, Yufan Luo, Chunlong Luo, Lianhe Zhao, Quanshui Fu, Guoqing Yang, Anpeng Huang, Yi Zhao
Abstract: Many recently developed object detectors focused on coarse-to-fine framework which contains several stages that classify and regress proposals from coarse-grain to fine-grain, and obtains more accurate detection gradually. Multi-resolution models such as Feature Pyramid Network(FPN) integrate information of different levels of resolution and effectively improve the performance. Previous researches also have revealed that localization can be further improved by: 1) using fine-grained information which is more translational variant; 2) refining local areas which is more focused on local boundary information. Based on these principles, we designed a novel boundary refinement architecture to improve localization accuracy by combining coarse-to-fine framework with feature pyramid structure, named as Pyramidal Bounding Box Refinement network(PBRnet), which parameterizes gradually focused boundary areas of objects and leverages lower-level feature maps to extract finer local information when refining the predicted bounding boxes. Extensive experiments are performed on the MS-COCO dataset. The PBRnet brings a significant performance gains by roughly 3 point of $mAP$ when added to FPN or Libra R-CNN. Moreover, by treating Cascade R-CNN as a coarse-to-fine detector and replacing its localization branch by the regressor of PBRnet, it leads an extra performance improvement by 1.5 $mAP$, yielding a total performance boosting by as high as 5 point of $mAP$.
摘要：许多最近开发的对象探测器集中在粗到细的框架，包含了几个分类阶段，从粗粒度到细粒度回归的建议，并且获得更准确的检测逐步显现。多分辨率模式，如功能金字塔网（FPN）集成不同级别的分辨率的信息，并有效地提高了性能。先前的研究也已表明，本地化可进一步通过改进：1使用细粒度信息这是更平移变体）; 2）炼化它更专注于本地边界信息的局部区域。基于这些原则，我们设计了一个新的边界细化架构由粗到细框架与功能金字塔结构，命名为金字塔包围盒细化网络（PBRnet），其参数化相结合，以提高定位精度逐渐焦点对象，并利用边界地区低层次的功能映射细化预测包围盒时，提取更精细的本地信息。广泛实验在MS-COCO数据集进行的。该PBRnet当添加到FPN或天秤座R-CNN了大约3点$ $映射的带来了显著的性能提升。此外，通过处理级联R-CNN作为粗到细检测器和由PBRnet的回归替换其定位分支，它由1.5 $地图$导致额外的性能的提高，产生总共性能由作为高升压为5点的$ $地图。

28. Implementation of Deep Neural Networks to Classify EEG Signals using Gramian Angular Summation Field for Epilepsy Diagnosis [PDF] 返回目录
K. Palani Thanaraj, B. Parvathavarthini, U. John Tanik, V. Rajinikanth, Seifedine Kadry, K. Kamalanand
Abstract: This paper evaluates the approach of imaging timeseries data such as EEG in the diagnosis of epilepsy through Deep Neural Network (DNN). EEG signal is transformed into an RGB image using Gramian Angular Summation Field (GASF). Many such EEG epochs are transformed into GASF images for the normal and focal EEG signals. Then, some of the widely used Deep Neural Networks for image classification problems are used here to detect the focal GASF images. Three pre-trained DNN such as the AlexNet, VGG16, and VGG19 are validated for epilepsy detection based on the transfer learning approach. Furthermore, the textural features are extracted from GASF images, and prominent features are selected for a multilayer Artificial Neural Network (ANN) classifier. Lastly, a Custom Convolutional Neural Network (CNN) with three CNN layers, Batch Normalization, Max-pooling layer, and Dense layers, is proposed for epilepsy diagnosis from GASF images. The results of this paper show that the Custom CNN model was able to discriminate against the focal and normal GASF images with an average peak Precision of 0.885, Recall of 0.92, and F1-score of 0.90. Moreover, the Area Under the Curve (AUC) value of the Receiver Operating Characteristic (ROC) curve is 0.92 for the Custom CNN model. This paper suggests that Deep Learning methods widely used in image classification problems can be an alternative approach for epilepsy detection from EEG signals through GASF images.
摘要：评估成像时间序列数据，例如EEG癫痫通过深神经网络（DNN）诊断的方法。 EEG信号被转换成使用Gramian矩阵角求和字段（GASF）RGB图像。许多这样的EEG历元被变换成图像GASF用于正常和焦EEG信号。于是，一些广泛使用的深层神经网络的图像分类的问题在这里用来检测焦点GASF图像。三预训练DNN如AlexNet，VGG16和VGG19被验证的基础上，转移的学习方法癫痫检测。此外，纹理特征从GASF图像中提取，和突出的特征被选择用于多层的人工神经网络（ANN）的分类器。最后，自定义卷积神经网络（CNN）具有三个CNN层，批标准化，最大 - 汇集层和致密层，提出了一种用于从图像GASF癫痫的诊断。。本文结果表明，自定义CNN模型能够平均峰值的0.885精密的0.92召回和0.90 F1-得分对焦点和正常GASF图像进行区分。此外，下面积接受者操作特征（ROC）曲线的曲线（AUC）值是0.92，自定义模型CNN。本文认为深度学习方法广泛用于图像分类的问题可以是用于从EEG信号通过GASF图像癫痫检测的另一种方法。

29. FOAL: Fast Online Adaptive Learning for Cardiac Motion Estimation [PDF] 返回目录
Hanchao Yu, Shanhui Sun, Haichao Yu, Xiao Chen, Honghui Shi, Thomas Huang, Terrence Chen
Abstract: Motion estimation of cardiac MRI videos is crucial for the evaluation of human heart anatomy and function. Recent researches show promising results with deep learning-based methods. In clinical deployment, however, they suffer dramatic performance drops due to mismatched distributions between training and testing datasets, commonly encountered in the clinical environment. On the other hand, it is arguably impossible to collect all representative datasets and to train a universal tracker before deployment. In this context, we proposed a novel fast online adaptive learning (FOAL) framework: an online gradient descent based optimizer that is optimized by a meta-learner. The meta-learner enables the online optimizer to perform a fast and robust adaptation. We evaluated our method through extensive experiments on two public clinical datasets. The results showed the superior performance of FOAL in accuracy compared to the offline-trained tracking method. On average, the FOAL took only $0.4$ second per video for online optimization.
摘要：心脏MRI视频运动估计是对人的心脏解剖和功能的评价是至关重要的。最近的研究表明深基于学习的方法有希望的结果。在临床上的部署，然而，他们遭受显着的性能下降，由于训练和测试数据集，在临床环境中经常遇到的不匹配之间的分布。在另一方面，它可以说是不可能的收集所有代表性的数据集，并在部署前训练通用跟踪。在此背景下，我们提出了一个新的快速在线自适应学习（驹）框架：由一元学习者优化的在线梯度下降的优化器。荟萃学习者使在线优化进行快速和强大的适应。我们通过大量的实验评价了两个公共数据集临床我们的方法。结果表明，小马驹在精度优异的性能相比，离线训练跟踪方法。平均而言，驹只花了0.4 $ $第二每部影片的在线优化。

30. Compositional Convolutional Neural Networks: A Deep Architecture with Innate Robustness to Partial Occlusion [PDF] 返回目录
Adam Kortylewski, Ju He, Qing Liu, Alan Yuille
Abstract: Recent work has shown that deep convolutional neural networks (DCNNs) do not generalize well under partial occlusion. Inspired by the success of compositional models at classifying partially occluded objects, we propose to integrate compositional models and DCNNs into a unified deep model with innate robustness to partial occlusion. We term this architecture Compositional Convolutional Neural Network. In particular, we propose to replace the fully connected classification head of a DCNN with a differentiable compositional model. The generative nature of the compositional model enables it to localize occluders and subsequently focus on the non-occluded parts of the object. We conduct classification experiments on artificially occluded images as well as real images of partially occluded objects from the MS-COCO dataset. The results show that DCNNs do not classify occluded objects robustly, even when trained with data that is strongly augmented with partial occlusions. Our proposed model outperforms standard DCNNs by a large margin at classifying partially occluded objects, even when it has not been exposed to occluded objects during training. Additional experiments demonstrate that CompositionalNets can also localize the occluders accurately, despite being trained with class labels only.
摘要：最近的研究表明，深卷积神经网络（DCNNs）不部分遮挡下推广好。在部分遮挡的对象分类通过组成模型的成功的启发，我们建议成分模型和DCNNs融入与先天的鲁棒性部分遮挡统一的深层模型。我们称这种架构组成卷积神经网络。特别是，我们提出用微成分模型来代替DCNN的完全连接分类头。组成模型的生成性质使其能够定位封堵器，随后集中于对象的非封闭部分。我们从MS-COCO数据集进行人工上遮挡图像分类实验以及部分遮挡物体的真实图像。结果表明，DCNNs不遮挡物体强劲分类，即使与被强烈部分遮挡增强数据来训练。我们提出的模型中部分被遮挡的物体进行分类，即使它没有被训练过程中暴露于闭塞的对象优于标准DCNNs大幅度。进一步的实验表明，CompositionalNets也能精确定位的封堵器，尽管只有类标签的培训。

31. Deep learning approach for breast cancer diagnosis [PDF] 返回目录
Essam A. Rashed, M. Samir Abou El Seoud
Abstract: Breast cancer is one of the leading fatal disease worldwide with high risk control if early discovered. Conventional method for breast screening is x-ray mammography, which is known to be challenging for early detection of cancer lesions. The dense breast structure produced due to the compression process during imaging lead to difficulties to recognize small size abnormalities. Also, inter- and intra-variations of breast tissues lead to significant difficulties to achieve high diagnosis accuracy using hand-crafted features. Deep learning is an emerging machine learning technology that requires a relatively high computation power. Yet, it proved to be very effective in several difficult tasks that requires decision making at the level of human intelligence. In this paper, we develop a new network architecture inspired by the U-net structure that can be used for effective and early detection of breast cancer. Results indicate a high rate of sensitivity and specificity that indicate potential usefulness of the proposed approach in clinical use.
摘要：乳腺癌是导致致命的疾病与世界各地如能早期发现高风险控制的一个。乳腺癌筛选常规方法是X射线乳房X射线摄影，这是众所周知的是具有挑战性的早期检测癌病灶的。成像导致难以识别小尺寸异常期间，由于压缩过程中产生的致密乳房结构。此外，乳房组织的之间和内部的变化导致显著困难使用手工制作功能，以实现高诊断精度。深度学习是一门新兴的机器学习技术，需要比较高的计算能力。然而，事实证明这是一个需要在人的智力水平决策的几个困难的任务非常有效。在本文中，我们开发了新的网络体系结构由可用于有效和早期检测乳腺癌的U网状结构的启发。结果表明敏感性和特异性的高率表明在临床上使用了该方法的潜在用途。

32. Tracking Road Users using Constraint Programming [PDF] 返回目录
Alexandre Pineault, Guillaume-Alexandre Bilodeau, Gilles Pesant
Abstract: In this paper, we aim at improving the tracking of road users in urban scenes. We present a constraint programming (CP) approach for the data association phase found in the tracking-by-detection paradigm of the multiple object tracking (MOT) problem. Such an approach can solve the data association problem more efficiently than graph-based methods and can handle better the combinatorial explosion occurring when multiple frames are analyzed. Because our focus is on the data association problem, our MOT method only uses simple image features, which are the center position and color of detections for each frame. Constraints are defined on these two features and on the general MOT problem. For example, we enforce color appearance preservation over trajectories and constrain the extent of motion between frames. Filtering layers are used in order to eliminate detection candidates before using CP and to remove dummy trajectories produced by the CP solver. Our proposed method was tested on a motorized vehicles tracking dataset and produces results that outperform the top methods of the UA-DETRAC benchmark.
摘要：在本文中，我们的目标是提高道路使用者的跟踪城市的场景。我们提出了在多目标跟踪（MOT）的问题的跟踪逐检测范例中找到的数据关联阶段约束编程（CP）的方法。这种方法可以更有效地解决数据关联问题比基于图形的方法，并且可以处理更好时，多个帧被分析发生的组合爆炸。因为我们的重点是在数据关联的问题，我们的MOT方法只使用简单的图像特征，它是检测每个帧的中心位置和颜色。约束这两个特点与一般MOT问题定义。例如，我们执行颜色外观保存在轨迹和约束帧之间的运动的程度。过滤层，以便使用CP之前消除的检测候选物，并除去由CP解算器产生虚设轨迹使用。我们提出的方法在机动车跟踪数据集进行测试，并产生跑赢UA-DETRAC基准的顶部方法的结果。

33. Single-view 2D CNNs with Fully Automatic Non-nodule Categorization for False Positive Reduction in Pulmonary Nodule Detection [PDF] 返回目录
Hyunjun Eun, Daeyeong Kim, Chanho Jung, Changick Kim
Abstract: Background and Objective: In pulmonary nodule detection, the first stage, candidate detection, aims to detect suspicious pulmonary nodules. However, detected candidates include many false positives and thus in the following stage, false positive reduction, such false positives are reliably reduced. Note that this task is challenging due to 1) the imbalance between the numbers of nodules and non-nodules and 2) the intra-class diversity of non-nodules. Although techniques using 3D convolutional neural networks (CNNs) have shown promising performance, they suffer from high computational complexity which hinders constructing deep networks. To efficiently address these problems, we propose a novel framework using the ensemble of 2D CNNs using single views, which outperforms existing 3D CNN-based methods. Methods: Our ensemble of 2D CNNs utilizes single-view 2D patches to improve both computational and memory efficiency compared to previous techniques exploiting 3D CNNs. We first categorize non-nodules on the basis of features encoded by an autoencoder. Then, all 2D CNNs are trained by using the same nodule samples, but with different types of non-nodules. By extending the learning capability, this training scheme resolves difficulties of extracting representative features from non-nodules with large appearance variations. Note that, instead of manual categorization requiring the heavy workload of radiologists, we propose to automatically categorize non-nodules based on the autoencoder and k-means clustering.
摘要：背景与目的：肺结节检测，第一阶段，候选的检测，目的是发现可疑肺部结节。然而，检测到的候选者包括许多假阳性，因此在下面的阶段中，假阳性降低，例如假阳性被可靠地降低。请注意，此任务由于1挑战）结节和非结核和2）非结节的类内分集的数量之间的不平衡。虽然使用3D卷积神经网络（细胞神经网络）技术已经显示出大有希望的表现，他们从高计算复杂性遭受阻碍构建深网络。为了有效地解决这些问题，我们提出了使用二维细胞神经网络的合奏使用单一视图的新型框架，该框架优于现有的基于CNN-3D的方法。方法：我们的二维细胞神经网络的整体使用单视图2D补丁相比以前的技术开发3D细胞神经网络的同时提高计算和存储效率。首先，我们的分类由自动编码器编码功能的基础上，非结核。然后，所有二维细胞神经网络是通过使用相同的结节样本训练，但具有不同类型的非结节。通过延长学习能力，这种训练方案做出决议，从非结节大出现的变化中提取代表性特征的困难。需要注意的是，而不是要求放射科医师的工作量很大手动分类，我们建议自动分类基于自动编码和k-means聚类非结核。

34. SDVTracker: Real-Time Multi-Sensor Association and Tracking for Self-Driving Vehicles [PDF] 返回目录
Shivam Gautam, Gregory P. Meyer, Carlos Vallespi-Gonzalez, Brian C. Becker
Abstract: Accurate motion state estimation of Vulnerable Road Users (VRUs), is a critical requirement for autonomous vehicles that navigate in urban environments. Due to their computational efficiency, many traditional autonomy systems perform multi-object tracking using Kalman Filters which frequently rely on hand-engineered association. However, such methods fail to generalize to crowded scenes and multi-sensor modalities, often resulting in poor state estimates which cascade to inaccurate predictions. We present a practical and lightweight tracking system, SDVTracker, that uses a deep learned model for association and state estimation in conjunction with an Interacting Multiple Model (IMM) filter. The proposed tracking method is fast, robust and generalizes across multiple sensor modalities and different VRU classes. In this paper, we detail a model that jointly optimizes both association and state estimation with a novel loss, an algorithm for determining ground-truth supervision, and a training procedure. We show this system significantly outperforms hand-engineered methods on a real-world urban driving dataset while running in less than 2.5 ms on CPU for a scene with 100 actors, making it suitable for self-driving applications where low latency and high accuracy is critical.
摘要：弱势道路使用者的精确运动状态估计（视频接收单元），是自主汽车的一个关键要求，即导航在城市环境中。由于其计算效率，许多传统的自治系统中使用卡尔曼滤波器频繁依靠手工工程协会进行多目标跟踪。然而，这种方法不能推广到拥挤的场景，多传感器的模式，往往导致状态不佳，估计其级联不准确的预测。我们提出了一个实用和轻便的跟踪系统，SDVTracker，使用了深刻的学习模型进行关联和状态估计结合的交互式多模型（IMM）滤波器。所提出的跟踪方法是快速的，在多个传感器的方式和不同VRU类健壮和概括。在本文中，我们的细节，共同优化了关联和状态估计用一种新颖的损失，确定地面实况监督的算法和训练过程的模型。我们表明，该系统显著优于上一个真实世界的城市驾驶数据集手工设计的方法，而在小于2.5毫秒运行在CPU与100个演员的场景，非常适合自驾车应用中，低延迟和高精确度是关键。

35. Multi-Scale Superpatch Matching using Dual Superpixel Descriptors [PDF] 返回目录
Rémi Giraud, Merlin Boyer, Michaël Clément
Abstract: Over-segmentation into superpixels is a very effective dimensionality reduction strategy, enabling fast dense image processing. The main issue of this approach is the inherent irregularity of the image decomposition compared to standard hierarchical multi-resolution schemes, especially when searching for similar neighboring patterns. Several works have attempted to overcome this issue by taking into account the region irregularity into their comparison model. Nevertheless, they remain sub-optimal to provide robust and accurate superpixel neighborhood descriptors, since they only compute features within each region, poorly capturing contour information at superpixel borders. In this work, we address these limitations by introducing the dual superpatch, a novel superpixel neighborhood descriptor. This structure contains features computed in reduced superpixel regions, as well as at the interfaces of multiple superpixels to explicitly capture contour structure information. A fast multi-scale non-local matching framework is also introduced for the search of similar descriptors at different resolution levels in an image dataset. The proposed dual superpatch enables to more accurately capture similar structured patterns at different scales, and we demonstrate the robustness and performance of this new strategy on matching and supervised labeling applications.
摘要：过分割成超像素是一个非常有效的降维策略，实现快速密集的图像处理。这种方法的主要问题是相对于标准的层次多分辨率方案的图像分解固有的不规则性，类似的相邻的图案特别是在搜索时。几部作品都试图通过考虑到区域不规则性引入了比较模型，能够克服这个问题。尽管如此，他们仍然次优，以提供可靠和准确的超级像素邻里的描述，因为他们每个区域内唯一的计算功能，捕捉不佳中超像素在边界轮廓信息。在这项工作中，我们要解决通过引入双superpatch，一种新型的超像素邻居描述了这些限制。此结构包含计算功能降低超像素区域，以及在多个超像素的接口，以明确地捕获轮廓结构的信息。快速多尺度非本地匹配框架还推出了不同分辨率等级的图像数据集搜索类似描述的。该建议的双superpatch能够在不同的尺度更准确地捕捉类似的结构模式，我们表现出对匹配这一新战略的稳健性和性能，并监督标签应用。

36. Texture Superpixel Clustering from Patch-based Nearest Neighbor Matching [PDF] 返回目录
Rémi Giraud, Yannick Berthoumieu
Abstract: Superpixels are widely used in computer vision applications. Nevertheless, decomposition methods may still fail to efficiently cluster image pixels according to their local texture. In this paper, we propose a new Nearest Neighbor-based Superpixel Clustering (NNSC) method to generate texture-aware superpixels in a limited computational time compared to previous approaches. We introduce a new clustering framework using patch-based nearest neighbor matching, while most existing methods are based on a pixel-wise K-means clustering. Therefore, we directly group pixels in the patch space enabling to capture texture information. We demonstrate the efficiency of our method with favorable comparison in terms of segmentation performances on both standard color and texture datasets. We also show the computational efficiency of NNSC compared to recent texture-aware superpixel methods.
摘要：超像素广泛应用于计算机视觉应用。然而，分解方法可以根据它们的局部纹理仍不能有效簇的图像像素。在本文中，我们提出了一种新的基于最近邻 - 超像素聚类（NNSC）方法相比以前的方法在有限的计算时间，以生成纹理感知超级像素。我们介绍使用基于补丁近邻匹配新的集群框架，而大部分现有的方法是基于逐像素K-均值聚类。因此，我们直接在补丁空间从而能够捕获纹理信息组像素。我们证明我们的方法，在对两种标准色和纹理数据集分割性能方面比较有利的效率。我们还表明中立国监察委员会的计算效率相比，最近的质地感知超像素的方法。

37. FusionLane: Multi-Sensor Fusion for Lane Marking Semantic Segmentation Using Deep Neural Networks [PDF] 返回目录
Ruochen Yin, Biao Yu, Huapeng Wu, Yutao Song, Runxin Niu
Abstract: It is a crucial step to achieve effective semantic segmentation of lane marking during the construction of the lane level high-precision map. In recent years, many image semantic segmentation methods have been proposed. These methods mainly focus on the image from camera, due to the limitation of the sensor itself, the accurate three-dimensional spatial position of the lane marking cannot be obtained, so the demand for the lane level high-precision map construction cannot be met. This paper proposes a lane marking semantic segmentation method based on LIDAR and camera fusion deep neural network. Different from other methods, in order to obtain accurate position information of the segmentation results, the semantic segmentation object of this paper is a bird's eye view converted from a LIDAR points cloud instead of an image captured by a camera. This method first uses the deeplabv3+ [\ref{ref:1}] network to segment the image captured by the camera, and the segmentation result is merged with the point clouds collected by the LIDAR as the input of the proposed network. In this neural network, we also add a long short-term memory (LSTM) structure to assist the network for semantic segmentation of lane markings by using the the time series information. The experiments on more than 14,000 image datasets which we have manually labeled and expanded have shown the proposed method has better performance on the semantic segmentation of the points cloud bird's eye view. Therefore, the automation of high-precision map construction can be significantly improved. Our code is available at this https URL.
摘要：这是一个关键的一步，实现车道的有效语义分割车道级高精度图谱的构建过程中标记。近年来，许多图像语义分割方法被提出。这些方法主要集中在图像上从照相机中，由于传感器本身，车道不能获得标记的精确的三维空间位置的限制，所以对于车道级高精度地图构造的需求不能被满足。本文提出了一种车道标线基于激光雷达和相机融合深层神经网络语义分割方法。从其他方法不同的是，为了获得分割结果的准确的位置信息，本文件的语义分割对象是从LIDAR点变换后的鸟瞰图云代替由照相机拍摄的图像的。该方法首先使用deeplabv3 + [\ REF {REF：1}]网络段由照相机拍摄的图像，和分割结果被合并与由LIDAR作为所提出的网络的输入收集到的点云。在这个神经网络中，我们还添加了长短期记忆（LSTM）结构，通过使用时间序列资料，以协助网络车道标记的语义分割。在我们已经手工标注和扩大超过14,000图像数据集上的实验表明，该方法对点的语义分割更好的性能，云的鸟瞰图。因此，高精度地图构造的自动化可以显著改善。我们的代码可在此HTTPS URL。

38. A New Meta-Baseline for Few-Shot Learning [PDF] 返回目录
Yinbo Chen, Xiaolong Wang, Zhuang Liu, Huijuan Xu, Trevor Darrell
Abstract: Meta-learning has become a popular framework for few-shot learning in recent years, with the goal of learning a model from collections of few-shot classification tasks. While more and more novel meta-learning models are being proposed, our research has uncovered simple baselines that have been overlooked. We present a Meta-Baseline method, by pre-training a classifier on all base classes and meta-learning on a nearest-centroid based few-shot classification algorithm, it outperforms recent state-of-the-art methods by a large margin. Why does this simple method work so well? In the meta-learning stage, we observe that a model generalizing better on unseen tasks from base classes can have a decreasing performance on tasks from novel classes, indicating a potential objective discrepancy. We find both pre-training and inheriting a good few-shot classification metric from the pre-trained classifier are important for Meta-Baseline, which potentially helps the model better utilize the pre-trained representations with stronger transferability. Furthermore, we investigate when we need meta-learning in this Meta-Baseline. Our work sets up a new solid benchmark for this field and sheds light on further understanding the phenomenons in the meta-learning framework for few-shot learning.
摘要：元学习已经成为一些次学习一个流行的框架，近年来，随着学习从几拍分类任务收集的模型的目标。同时提出了越来越多的新颖元的学习模型，我们的研究已发现简单的基线已经被忽视了。我们提出一个元基线法，通过训练前对所有的基类和元学习到的最近的质心基于几拍分类算法，它优于近期国家的最先进的方法，分类以大比分。为什么这个简单的方法工作这么好？在元学习阶段，我们观察到一个模型泛化上看不见的任务，更好地从基类可以从小说类任务的性能下降，这表明潜在的目标不一致。我们发现这两个前培训，并继承了好几个，镜头分类从预先训练的分类指标是重要的元基线，这可能有助于模型更好地利用预先训练的交涉，强烈转让。此外，我们调查的时候，我们需要的元学习本元基线。我们的工作设置了一个新的坚实的标杆这一领域，并阐明了进一步的了解，对一些次学习元学习框架的现象。

39. Category-wise Attack: Transferable Adversarial Examples for Anchor Free Object Detection [PDF] 返回目录
Quanyu Liao, Xin Wang, Bin Kong, Siwei Lyu, Youbing Yin, Qi Song, Xi Wu
Abstract: Deep neural networks have been demonstrated to be vulnerable to adversarial attacks: sutle perturbations can completely change the classification results. Their vulnerability has led to a surge of research in this direction. However, most works dedicated to attacking anchor-based object detection models. In this work, we aim to present an effective and efficient algorithm to generate adversarial examples to attack anchor-free object models based on two approaches. First, we conduct category-wise instead of instance-wise attacks on the object detectors. Second, we leverage the high-level semantic information to generate the adversarial examples. Surprisingly, the generated adversarial examples it not only able to effectively attack the targeted anchor-free object detector but also to be transferred to attack other object detectors, even anchor-based detectors such as Faster R-CNN.
摘要：深层神经网络已经被证明是脆弱的对抗性攻击：sutle扰动可以彻底改变的分类结果。他们的脆弱性导致了研究在这个方向的激增。然而，大多数的作品奉献给攻击系结合体检测模型。在这项工作中，我们的目标是提出一个有效和高效的算法产生对抗的例子来攻击基于两种方法无锚对象模型。首先，我们进行的物体探测器实例明智的攻击类别的明智代替。其次，我们充分利用高层语义信息来产生对抗的例子。出人意料的是，将被传送的产生对抗的例子，它不仅能够有效地攻击目标无锚对象检测器，而且攻击其他对象的检测器，即使基于锚的检测器，例如更快的R-CNN。

40. Crossmodal learning for audio-visual speech event localization [PDF] 返回目录
Rahul Sharma, Krishna Somandepalli, Shrikanth Narayanan
Abstract: An objective understanding of media depictions, such as about inclusive portrayals of how much someone is heard and seen on screen in film and television, requires the machines to discern automatically who, when, how and where someone is talking. Media content is rich in multiple modalities such as visuals and audio which can be used to learn speaker activity in videos. In this work, we present visual representations that have implicit information about when someone is talking and where. We propose a crossmodal neural network for audio speech event detection using the visual frames. We use the learned representations for two downstream tasks: i) audio-visual voice activity detection ii) active speaker localization in video frames. We present a state-of-the-art audio-visual voice activity detection system and demonstrate that the learned embeddings can effectively localize to active speakers in the visual frames.
摘要：媒体描绘，如约是多少人听到和看到屏幕上的电影和电视包容性描写一个客观的认识，需要机器来辨别自动谁，何时，如何以及在哪里有人说话。媒体内容丰富的多种方式，如可以用来学习视频扬声器活动的视觉效果和音频。在这项工作中，我们提出了视觉表现具有当有人在谈论并在有关的隐含信息。我们提出了音频语音事件检测使用可视帧crossmodal神经网络。我们用所学表示两个下游任务：1）视听语音活动检测II）视频帧中的有源音箱的定位。我们提出了一个先进的最先进的视听语音活动检测系统，并证明了解到的嵌入可以有效地定位在视觉帧有源扬声器。

41. Video Caption Dataset for Describing Human Actions in Japanese [PDF] 返回目录
Yutaro Shigeto, Yuya Yoshikawa, Jiaqing Lin, Akikazu Takeuchi
Abstract: In recent years, automatic video caption generation has attracted considerable attention. This paper focuses on the generation of Japanese captions for describing human actions. While most currently available video caption datasets have been constructed for English, there is no equivalent Japanese dataset. To address this, we constructed a large-scale Japanese video caption dataset consisting of 79,822 videos and 399,233 captions. Each caption in our dataset describes a video in the form of "who does what and where." To describe human actions, it is important to identify the details of a person, place, and action. Indeed, when we describe human actions, we usually mention the scene, person, and action. In our experiments, we evaluated two caption generation methods to obtain benchmark results. Further, we investigated whether those generation methods could specify "who does what and where."
摘要：近年来，自动字幕生成已经吸引了相当多的关注。本文重点介绍日本字幕的生成用于描述人的行为。虽然大多数现有的视频字幕的数据集已经构建了英语，没有等价日本数据集。为了解决这个问题，我们构建一个大型的日本视频字幕数据集，包括79822个视频和字幕399233。在我们的数据集中的每个标题描述视频的形式“谁做什么，在哪里。”为了描述人的行为，以识别人物，地点和行动的细节是很重要的。事实上，当我们描述人的行为，我们通常提到的场景，人物和动作。在我们的实验中，我们评价了两种字幕生成方法来获取基准测试结果。此外，我们研究了这些发电方式是否可以指定“谁做什么，在哪里。”

42. Weighted Encoding Based Image Interpolation With Nonlocal Linear Regression Model [PDF] 返回目录
Junchao Zhang
Abstract: Image interpolation is a special case of image super-resolution, where the low-resolution image is directly down-sampled from its high-resolution counterpart without blurring and noise. Therefore, assumptions adopted in super-resolution models are not valid for image interpolation. To address this problem, we propose a novel image interpolation model based on sparse representation. Two widely used priors including sparsity and nonlocal self-similarity are used as the regularization terms to enhance the stability of interpolation model. Meanwhile, we incorporate the nonlocal linear regression into this model since nonlocal similar patches could provide a better approximation to a given patch. Moreover, we propose a new approach to learn adaptive sub-dictionary online instead of clustering. For each patch, similar patches are grouped to learn adaptive sub-dictionary, generating a more sparse and accurate representation. Finally, the weighted encoding is introduced to suppress tailing of fitting residuals in data fidelity. Abundant experimental results demonstrate that our proposed method outperforms several state-of-the-art methods in terms of quantitative measures and visual quality.
摘要：图像插值是图像超分辨率，其中，所述低分辨率图像是直接从其高分辨率对应下采样不模糊和噪声的一种特殊情况。因此，假设在超分辨率的机型采用的是无效的图像插值。为了解决这个问题，我们提出了一种基于稀疏表示新的图像插值模型。两种广泛使用的先验包括稀疏和外地的自相似性被用作正则项，以提高插值模型的稳定性。同时，我们结合了外地线性回归到这个模型，因为外地类似的补丁程序可以提供一个更好的近似给定的补丁。此外，我们提出了一种新的方法来学习自适应子词典在线而不是集群。对于每个补丁，补丁类似被分组学习自适应子字典，生成更稀疏和准确表示。最后，加权编码引入抑制数据保真度拟合残差的拖尾。大量的实验结果表明，我们提出的方法优于在量化的指标和视觉质量方面的国家的最先进的几种方法。

43. Reconstruction of 3D flight trajectories from ad-hoc camera networks [PDF] 返回目录
Jingtong Li, Jesse Murray, Dorina Ismaili, Konrad Schindler, Cenek Albl
Abstract: We present a method to reconstruct the 3D trajectory of an airborne robotic system only from videos recorded with cameras that are unsynchronized, may feature rolling shutter distortion, and whose viewpoints are unknown. Our approach enables robust and accurate outside-in tracking of dynamically flying targets, with cheap and easy-to-deploy equipment. We show that, in spite of the weakly constrained setting, recent developments in computer vision make it possible to reconstruct trajectories in 3D from unsynchronized, uncalibrated networks of consumer cameras, and validate the proposed method in a realistic field experiment. We make our code available along with the data, including cm-accurate ground-truth from differential GNSS navigation.
摘要：本文提出了一种方法只从记录有不同步的摄像机视频重建的空中机器人系统的三维轨迹，可以配备卷帘快门失真，且其观点不明。我们的方法能够稳健和准确的从外到内的动态飞行目标跟踪，廉价和易于部署的设备。我们表明，尽管弱约束设置的，最近在计算机视觉的发展使人们有可能从消费相机不同步，未校准网络重建的轨迹在3D，并验证在真实的现场实验所提出的方法。我们做的数据，包括差分GNSS导航厘米精确的地面实况一起提供我们的代码。

44. Computer Aided Diagnosis for Spitzoid lesions classification using Artificial Intelligence techniques [PDF] 返回目录
Abir Belaala, Labib Sadek, Noureddine Zerhouni, Christine Devalland
Abstract: Spitzoid lesions may be largely categorized into Spitz Nevus, Atypical Spitz Tumors, and Spitz Melanomas. Classifying a lesion precisely as Atypical Spitz Tumors or AST is challenging and often requires the integration of clinical, histological, and immunohistochemical features to differentiate AST from regular Spitz nevus and malignant Spitz melanomas. Specifically, this paper aims to test several artificial intelligence techniques so as to build a computer aided diagnosis system. A proposed three-phase approach is being implemented. In Phase I, collected data are preprocessed with an effective Synthetic Minority Oversampling TEchnique or SMOTE-based method being implemented to treat the imbalance data problem. Then, a feature selection mechanism using genetic algorithm (GA) is applied in Phase II. Finally, in Phase III, a ten-fold cross-validation method is used to compare the performance of seven machine-learning algorithms for classification. Results obtained with SMOTE-Multilayer Perceptron with GA-based 14 features show the highest classification accuracy (0.98), a sensitivity of 0.99, and a specificity of 0.98, outperforming other Spitzoid lesions classification algorithms.
摘要：Spitzoid病变可大致分为斯皮茨痣，非典型斯皮茨肿瘤，斯皮茨黑色素瘤。分类的病变正是由于非典型斯皮茨肿瘤或AST是具有挑战性的，往往需要的临床，组织学，和免疫组化特点与常规斯皮茨痣和恶性黑色素瘤施皮茨整合分化AST。具体来说，本文的目的是测试几个人工智能技术，以建立一个计算机辅助诊断系统。拟议分三个阶段正在实施中。在第一阶段，收集的数据进行预处理与有效合成少数类过采样技术或正在执行基于SMOTE-方法来治疗的不平衡数据的问题。然后，使用遗传算法（GA）的特征选择机构在第二阶段被应用。最后，在第三阶段，一个10倍交叉验证方法是用来比较的分类7的机器学习算法的性能。与SMOTE-多层感知器获得与基于GA-14个特征的结果显示最高的分类精度（0.98），0.99灵敏度和0.98的特异性，优于其他Spitzoid病变分类算法。

45. TorchIO: a Python library for efficient loading, preprocessing, augmentation and patch-based sampling of medical images in deep learning [PDF] 返回目录
Fernando Pérez-García, Rachel Sparks, Sebastien Ourselin
Abstract: We present TorchIO, an open-source Python library for efficient loading, preprocessing, augmentation and patch-based sampling of medical images for deep learning. It follows the design of PyTorch and relies on standard medical image processing libraries such as SimpleITK or NiBabel to efficiently process large 3D images during the training of convolutional neural networks. We provide multiple generic as well as magnetic-resonance-imaging-specific operations for preprocessing and augmentation of medical images. TorchIO is an open-source project with code, comprehensive examples and extensive documentation shared at this https URL.
摘要：我们提出Torchio的，一个开源Python库进行有效载荷，预处理，增强和深度学习医学图像的基于块拼贴的采样。它遵循PyTorch的设计和卷积神经网络的训练期间依赖于标准的医疗用图像处理库，如SimpleITK或NiBabel来高效地处理大量的3D图像。我们提供了多种通用的以及为预处理和医学图像增强磁共振成像，具体操作。 Torchio的是代码，全面的例子，在这个HTTPS URL共享大量文件的开源项目。

46. Learning to Respond with Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog [PDF] 返回目录
Shen Gao, Xiuying Chen, Chang Liu, Li Liu, Dongyan Zhao, Rui Yan
Abstract: Stickers with vivid and engaging expressions are becoming increasingly popular in online messaging apps, and some works are dedicated to automatically select sticker response by matching text labels of stickers with previous utterances. However, due to their large quantities, it is impractical to require text labels for the all stickers. Hence, in this paper, we propose to recommend an appropriate sticker to user based on multi-turn dialog context history without any external labels. Two main challenges are confronted in this task. One is to learn semantic meaning of stickers without corresponding text labels. Another challenge is to jointly model the candidate sticker with the multi-turn dialog context. To tackle these challenges, we propose a sticker response selector (SRS) model. Specifically, SRS first employs a convolutional based sticker image encoder and a self-attention based multi-turn dialog encoder to obtain the representation of stickers and utterances. Next, deep interaction network is proposed to conduct deep matching between the sticker with each utterance in the dialog history. SRS then learns the short-term and long-term dependency between all interaction results by a fusion network to output the the final matching score. To evaluate our proposed method, we collect a large-scale real-world dialog dataset with stickers from one of the most popular online chatting platform. Extensive experiments conducted on this dataset show that our model achieves the state-of-the-art performance for all commonly-used metrics. Experiments also verify the effectiveness of each component of SRS. To facilitate further research in sticker selection field, we release this dataset of 340K multi-turn dialog and sticker pairs.
摘要：以生动，引人入胜的表情贴纸正在成为在线消息应用越来越普及，有的作品致力于通过与以前的话语贴纸匹配的文本标签自动选择贴纸响应。然而，由于其数量大，是不切实际的要求对所有的贴纸文本标签。因此，在本文中，我们提出建议适当贴纸基于多轮对话的上下文历史的用户，无需任何外部的标签。两个主要挑战所面临的这一任务。一是学习贴的语义没有相应的文字标签。另一个挑战是共同的模型与多轮对话语境候选贴纸。为了应对这些挑战，我们提出了一个标签响应选择（SRS）模型。具体而言，SRS第一采用卷积基于不干胶贴纸图像编码器和一个自关注基于多匝对话编码器以获得贴和话语的表示。接下来，深交互网络提出了在对话历史每个话语的贴纸之间进行深匹配。 SRS然后通过熔融网络来输出最终的匹配分数获悉所有交互结果之间的短期和长期的依赖性。为了评估我们提出的方法，我们将收集从最流行的在线聊天平台之一贴纸大规模真实世界的对话集。在这个数据集上，我们的模型实现了国家的最先进的性能为所有常用的指标进行了广泛的实验。实验还证实SRS的每个组件的有效性。为了便于在贴纸的选择领域的进一步研究，我们发布的340K多轮对话和贴纸对这个数据集。

47. Enabling Viewpoint Learning through Dynamic Label Generation [PDF] 返回目录
Michael Schelling, Pedro Hermosilla, Pere-Pau Vazquez, Timo Ropinski
Abstract: Optimal viewpoint prediction is an essential task in many computer graphicsapplications. Unfortunately, common viewpoint qualities suffer from majordrawbacks: dependency on clean surface meshes, which are not alwaysavailable, insensitivity to upright orientation, and the lack of closed-formexpressions, which requires a costly sampling process involving rendering.We overcome these limitations through a 3D deep learning approach, whichsolely exploits vertex coordinate information to predict optimal viewpointsunder upright orientation, while reflecting both informational content andhuman preference analysis. To enable this approach we propose a dynamiclabel generation strategy, which resolves inherent label ambiguities dur-ing training. In contrast to previous viewpoint prediction methods, whichevaluate many rendered views, we directly learn on the 3D mesh, and arethus independent from rendering. Furthermore, by exploiting unstructuredlearning, we are independent of mesh discretization. We show how the pro-posed technology enables learned prediction from model to viewpoints fordifferent object categories and viewpoint qualities. Additionally, we showthat prediction times are reduced from several minutes to a fraction of asecond, as compared to viewpoint quality evaluation. We will release thecode and training data, which will to our knowledge be the biggest viewpointquality dataset available.
摘要：最佳视点预测是许多计算机graphicsapplications的一项重要任务。不幸的是，共同视点质量，从majordrawbacks苦：依赖于清洁表面网格，这是不alwaysavailable，不敏感直立的方向，以及缺乏封闭formexpressions的，这需要涉及rendering.We通过3D克服这些限制深昂贵的采样过程学习方式，whichsolely利用顶点坐标信息来预测最佳viewpointsunder直立的方向，而这反映了信息内容andhuman偏好分析。为了使这种方法，我们提出了一个dynamiclabel生成策略，从而解决固有的歧义标签DUR-ING培训。相较于以前的视点预测方法，whichevaluate许多渲染视图，我们直接学习上的三维网格，并从渲染arethus独立。此外，通过利用unstructuredlearning，我们是独立的网格离散的。我们展示的亲提出技术使学会了如何预测从模型到视点fordifferent对象类别和观点特质。此外，我们showthat预测时间从几分钟减少到asecond的一小部分，相比于视点质量的评价。我们将发布导出代码和训练数据，这将给我们的知识是最大的viewpointquality数据集可用。

48. DIBS: Diversity inducing Information Bottleneck in Model Ensembles [PDF] 返回目录
Samarth Sinha, Homanga Bharadhwaj, Anirudh Goyal, Hugo Larochelle, Animesh Garg, Florian Shkurti
Abstract: Although deep learning models have achieved state-of-the-art performance on a number of vision tasks, generalization over high dimensional multi-modal data, and reliable predictive uncertainty estimation are still active areas of research. Bayesian approaches including Bayesian Neural Nets (BNNs) do not scale well to modern computer vision tasks, as they are difficult to train, and have poor generalization under dataset-shift. This motivates the need for effective ensembles which can generalize and give reliable uncertainty estimates. In this paper, we target the problem of generating effective ensembles of neural networks by encouraging diversity in prediction. We explicitly optimize a diversity inducing adversarial loss for learning the stochastic latent variables and thereby obtain diversity in the output predictions necessary for modeling multi-modal data. We evaluate our method on benchmark datasets: MNIST, CIFAR100, TinyImageNet and MIT Places 2, and compared to the most competitive baselines show significant improvements in classification accuracy, under a shift in the data distribution and in out-of-distribution detection. Code will be released in this url this https URL
摘要：尽管深度学习模型已经取得了多项视觉任务，概括了高维的多模态数据，可靠的预测不确定性估计国家的最先进的性能仍然是研究的活跃领域。贝叶斯方法包括贝叶斯神经网络（BNNs）不能很好地扩展到现代计算机视觉任务，因为它们很难培养，并且有数据集移下泛化差。这激发了对于可概括并给予可靠的不确定性估算有效的合奏的需要。在本文中，我们的目标在预测鼓励多样性产生神经网络的有效合奏的问题。我们明确地优化多样性诱导对抗丧失学习的随机潜在变量，从而获得分集所必需的模拟多模态数据输出的预测。我们评估的基准数据集我们的方法：MNIST，CIFAR100，TinyImageNet和麻省理工学院的地方2，和比较，最有竞争力的基线显示，分类精度显著的改善，下的数据分布的偏移和外的分布检测。代码将在这个网址发布此HTTPS URL

49. Spine intervertebral disc labeling using a fully convolutional redundant counting model [PDF] 返回目录
Lucas Rouhier, Francisco Perdigon Romero, Joseph Paul Cohen, Julien Cohen-Adad
Abstract: Labeling intervertebral discs is relevant as it notably enables clinicians to understand the relationship between a patient's symptoms (pain, paralysis) and the exact level of spinal cord injury. However manually labeling those discs is a tedious and user-biased task which would benefit from automated methods. While some automated methods already exist for MRI and CT-scan, they are either not publicly available, or fail to generalize across various imaging contrasts. In this paper we combine a Fully Convolutional Network (FCN) with inception modules to localize and label intervertebral discs. We demonstrate a proof-of-concept application in a publicly-available multi-center and multi-contrast MRI database (n=235 subjects). The code is publicly available at this https URL.
摘要：标签椎间盘是相关的，因为它主要是使临床了解患者的症状（疼痛，麻痹）和脊髓损伤的确切水平之间的关系。然而手动标记这些光盘是这将来自自动方法中受益的乏味和用户偏置任务。虽然已经存在，MRI和CT扫描一些自动化的方法，他们要么不公开，或未能在各个成像对比一概而论。在本文中，我们结合了完全卷积网络（FCN）与成立模块定位和标记椎间盘。我们在公共可用的多中心，多对比MRI数据库（N = 235例）证明验证的概念应用。该代码是公开的，在此HTTPS URL。

50. Automatic segmentation of spinal multiple sclerosis lesions: How to generalize across MRI contrasts? [PDF] 返回目录
Olivier Vincent, Charley Gros, Joseph Paul Cohen, Julien Cohen-Adad
Abstract: Despite recent improvements in medical image segmentation, the ability to generalize across imaging contrasts remains an open issue. To tackle this challenge, we implement Feature-wise Linear Modulation (FiLM) to leverage physics knowledge within the segmentation model and learn the characteristics of each contrast. Interestingly, a well-optimised U-Net reached the same performance as our FiLMed-Unet on a multi-contrast dataset (0.72 of Dice score), which suggests that there is a bottleneck in spinal MS lesion segmentation different from the generalization across varying contrasts. This bottleneck likely stems from inter-rater variability, which is estimated at 0.61 of Dice score in our dataset.
摘要：尽管在医学图像分割最近有所改善，整个成像对比归纳的能力仍然是一个悬而未决的问题。为了应对这种挑战，我们实现功能方面线性调制（薄膜）与细分模型中的杠杆作用物理知识，了解每一个造影剂的特性。有趣的是，优化良好的U形网达到了相同的性能作为我们的拍摄-UNET上的多的对比数据集（骰子得分的0.72），这表明存在跨越不同的对比在脊柱MS病变分割不同的瓶颈从泛化。这个瓶颈可能由评估者间可变性，这估计在骰子的0.61得分在我们的数据茎。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-03-11

目录

摘要