摘要

1. Learn2Perturb: an End-to-end Feature Perturbation Learning to Improve Adversarial Robustness [PDF] 返回目录
Ahmadreza Jeddi, Mohammad Javad Shafiee, Michelle Karg, Christian Scharfenberger, Alexander Wong
Abstract: While deep neural networks have been achieving state-of-the-art performance across a wide variety of applications, their vulnerability to adversarial attacks limits their widespread deployment for safety-critical applications. Alongside other adversarial defense approaches being investigated, there has been a very recent interest in improving adversarial robustness in deep neural networks through the introduction of perturbations during the training process. However, such methods leverage fixed, pre-defined perturbations and require significant hyper-parameter tuning that makes them very difficult to leverage in a general fashion. In this study, we introduce Learn2Perturb, an end-to-end feature perturbation learning approach for improving the adversarial robustness of deep neural networks. More specifically, we introduce novel perturbation-injection modules that are incorporated at each layer to perturb the feature space and increase uncertainty in the network. This feature perturbation is performed at both the training and the inference stages. Furthermore, inspired by the Expectation-Maximization, an alternating back-propagation training algorithm is introduced to train the network and noise parameters consecutively. Experimental results on CIFAR-10 and CIFAR-100 datasets show that the proposed Learn2Perturb method can result in deep neural networks which are $4-7\%$ more robust on $l_{\infty}$ FGSM and PDG adversarial attacks and significantly outperforms the state-of-the-art against $l_2$ $C\&W$ attack and a wide range of well-known black-box attacks.
摘要：尽管深层神经网络已经在各种不同的应用实现国家的最先进的性能，他们的对抗攻击的弱点限制了其广泛应用的安全关键型应用。与其他对抗性的防守办法正在调查中，出现了通过在训练过程中引入扰动的改善深层神经网络对抗的鲁棒性非常最近的兴趣。然而，这种方法利用固定的，预定义的干扰，需要显著超参数调整，使得利用它们很难在一般的时尚。在这项研究中，我们介绍Learn2Perturb，可以改进神经网络的鲁棒性对抗的结束到终端的功能扰动的学习方法。更具体地说，我们引入在每个层中扰动特征空间和增加了网络中的不确定性微扰新颖喷射模块。此功能扰动在训练和推理阶段都进行。此外，激发由期望最大化，交替反向传播训练算法被引入到连续训练网络和噪声参数。上CIFAR-10和CIFAR-100数据集实验结果表明，所提出的Learn2Perturb方法可导致深的神经网络，其是$ 4-7 \％$上$升_ {\ infty} $ FGSM和PDG对抗攻击更加稳健和显著优于国家的最先进的打击$ L_2 $ $ C \＆W $攻击力和广泛的知名黑箱攻击。

2. D3VO: Deep Depth, Deep Pose and Deep Uncertainty for Monocular Visual Odometry [PDF] 返回目录
Nan Yang, Lukas von Stumberg, Rui Wang, Daniel Cremers
Abstract: We propose D3VO as a novel framework for monocular visual odometry that exploits deep networks on three levels -- deep depth, pose and uncertainty estimation. We first propose a novel self-supervised monocular depth estimation network trained on stereo videos without any external supervision. In particular, it aligns the training image pairs into similar lighting condition with predictive brightness transformation parameters. Besides, we model the photometric uncertainties of pixels on the input images, which improves the depth estimation accuracy and provides a learned weighting function for the photometric residuals in direct (feature-less) visual odometry. Evaluation results show that the proposed network outperforms state-of-the-art self-supervised depth estimation networks. D3VO tightly incorporates the predicted depth, pose and uncertainty into a direct visual odometry method to boost both the front-end tracking as well as the back-end non-linear optimization. We evaluate D3VO in terms of monocular visual odometry on both the KITTI odometry benchmark and the EuRoC MAV dataset. The results show that D3VO outperforms state-of-the-art traditional monocular VO methods by a large margin. It also achieves comparable results to state-of-the-art stereo/LiDAR odometry on KITTI and to the state-of-the-art visual-inertial odometry on EuRoC MAV, while using only a single camera.
摘要：本文提出D3VO为单眼视觉测程是在三个层面上利用深网络一种新型的框架 - 深度较深，姿态和不确定性估计。我们首先提出的培训立体声影片新颖的自我监督单眼深度估计网络，无需任何外部监督。特别是，它对准训练图像对与预测亮度变换参数类似的光照条件。此外，我们在输入的图像，从而提高了深度估计精度和直接（特征更少）视觉里程计提供了学习的加权函数用于光度残差像素的光度的不确定性模型。评价结果表明，该网络性能优于国家的最先进的自我监督的深度估计网络。 D3VO紧紧结合预测的深度，姿势和不确定性的直接视觉测距方法，以提高两个前端跟踪以及后端非线性优化。我们在上KITTI测距标杆和EuRoC MAV数据集都单眼视觉测程方面评估D3VO。结果表明，D3VO性能优于国家的最先进的传统单眼VO方法大幅度。它也实现了上KITTI并就EuRoC MAV所述状态的最先进的视觉惯性测程比较的结果，国家的最先进的立体声/激光雷达测距，而只使用一个单一的相机。

3. Plug & Play Convolutional Regression Tracker for Video Object Detection [PDF] 返回目录
Ye Lyu, Michael Ying Yang, George Vosselman, Gui-Song Xia
Abstract: Video object detection targets to simultaneously localize the bounding boxes of the objects and identify their classes in a given video. One challenge for video object detection is to consistently detect all objects across the whole video. As the appearance of objects may deteriorate in some frames, features or detections from the other frames are commonly used to enhance the prediction. In this paper, we propose a Plug & Play scale-adaptive convolutional regression tracker for the video object detection task, which could be easily and compatibly implanted into the current state-of-the-art detection networks. As the tracker reuses the features from the detector, it is a very light-weighted increment to the detection network. The whole network performs at the speed close to a standard object detector. With our new video object detection pipeline design, image object detectors can be easily turned into efficient video object detectors without modifying any parameters. The performance is evaluated on the large-scale ImageNet VID dataset. Our Plug & Play design improves mAP score for the image detector by around 5% with only little speed drop.
摘要：视频对象检测目标同时定位的对象的包围盒，并确定在给定的视频课堂。视频对象检测的一个挑战是始终如一地检测在整个视频中的所有对象。作为对象的外观可在一些帧恶化，特征或从其它帧检测通常用于增强预测。在本文中，我们提出了一个即插即用尺度自适应卷积回归跟踪的视频对象检测任务，这可能是很容易和兼容植入到当前国家的最先进的检测网络。由于跟踪器从检测器的重复使用的特性，这是一个非常光加权增量到检测网络。整个网络进行在速度接近标准的对象检测器。与我们的新的视频对象检测流水线设计，图像对象检测器可以很容易地转变成有效的视频对象的探测器，而无需修改任何参数。性能上大规模ImageNet VID数据集进行评估。我们的即插即用设计提高了地图分数由5％左右，只有很少的速度下降到图像检测器。

4. DriverMHG: A Multi-Modal Dataset for Dynamic Recognition of Driver Micro Hand Gestures and a Real-Time Recognition Framework [PDF] 返回目录
Okan Köpüklü, Thomas Ledwon, Yao Rong, Neslihan Kose, Gerhard Rigoll
Abstract: The use of hand gestures provides a natural alternative to cumbersome interface devices for Human-Computer Interaction (HCI) systems. However, real-time recognition of dynamic micro hand gestures from video streams is challenging for in-vehicle scenarios since (i) the gestures should be performed naturally without distracting the driver, (ii) micro hand gestures occur within very short time intervals at spatially constrained areas, (iii) the performed gesture should be recognized only once, and (iv) the entire architecture should be designed lightweight as it will be deployed to an embedded system. In this work, we propose an HCI system for dynamic recognition of driver micro hand gestures, which can have a crucial impact in automotive sector especially for safety related issues. For this purpose, we initially collected a dataset named Driver Micro Hand Gestures (DriverMHG), which consists of RGB, depth and infrared modalities. The challenges for dynamic recognition of micro hand gestures have been addressed by proposing a lightweight convolutional neural network (CNN) based architecture which operates online efficiently with a sliding window approach. For the CNN model, several 3-dimensional resource efficient networks are applied and their performances are analyzed. Online recognition of gestures has been performed with 3D-MobileNetV2, which provided the best offline accuracy among the applied networks with similar computational complexities. The final architecture is deployed on a driver simulator operating in real-time. We make DriverMHG dataset and our source code publicly available.
摘要：使用手势提供了用于人机交互（HCI）的系统笨重接口设备的天然替代。然而，由于（i）所述的手势应当自然而不分心驾驶员执行从视频流动态微手势的实时识别被用于车载场景挑战，（ii）在空间上很短的时间间隔内发生微手势狭窄的区域，（iii）所述进行手势应当仅一次认识到的，以及（iv）在整个体系结构的设计应轻便，因为它会被部署到嵌入式系统。在这项工作中，我们提出了动态识别驾驶员微手势，它可以在汽车行业尤其是对安全相关问题的关键影响的人机交互系统。为此，我们最初收集到一个名为驱动微手势（DriverMHG）数据集，其中包括RGB，深度和红外模式的。微手势的动态识别的难题已经得到解决，通过提出与滑动窗口方法网上有效地运行一个轻量级的卷积神经网络（CNN）的基础架构。对于CNN模型，几个3维资源高效的网络应用，并对其性能进行了分析。手势识别在线已经与3D-MobileNetV2，它提供了类似的计算复杂的应用网络中最好的离线准确性进行。最终的架构部署在驾驶模拟器的实时操作。我们做DriverMHG数据集和我们的源代码公开。

5. Always Look on the Bright Side of the Field: Merging Pose and Contextual Data to Estimate Orientation of Soccer Players [PDF] 返回目录
Adrià Arbués-Sangüesa, Adrián Martín, Javier Fernández, Carlos Rodríguez, Gloria Haro, Coloma Ballester
Abstract: Although orientation has proven to be a key skill of soccer players in order to succeed in a broad spectrum of plays, body orientation is a yet-little-explored area in sports analytics' research. Despite being an inherently ambiguous concept, player orientation can be defined as the projection (2D) of the normal vector placed in the center of the upper-torso of players (3D). This research presents a novel technique to obtain player orientation from monocular video recordings by mapping pose parts (shoulders and hips) in a 2D field by combining OpenPose with a super-resolution network, and merging the obtained estimation with contextual information (ball position). Results have been validated with players-held EPTS devices, obtaining a median error of 27 degrees/player. Moreover, three novel types of orientation maps are proposed in order to make raw orientation data easy to visualize and understand, thus allowing further analysis at team- or player-level.
摘要：尽管定位已被证明是足球运动员的一个重要技巧，才能在戏剧的广泛成功，身体方向是运动分析研究尚未小面积探索。尽管是一种固有的模糊的概念，玩家取向可以被定义为放置在玩家上部躯干（3D）的中心处的法线矢量的投影（2D）。本研究提出了一种新技术，通过OpenPose与超分辨率网络相结合，和合并与上下文信息（球的位置）得到的估算，以获得从通过在2D场映射姿态份（肩部和臀部）单眼录像播放器取向。结果进行了验证，玩家持有的EPTS设备，获得27度/播放机的平均误差。此外，三种新类型的定向图，提出为了使原始方位数据易于可视化和理解，因此允许在team-或玩家级进一步分析。

6. Deep Meditations: Controlled navigation of latent space [PDF] 返回目录
Memo Akten, Rebecca Fiebrink, Mick Grierson
Abstract: We introduce a method which allows users to creatively explore and navigate the vast latent spaces of deep generative models. Specifically, our method enables users to \textit{discover} and \textit{design} \textit{trajectories} in these high dimensional spaces, to construct stories, and produce time-based media such as videos---\textit{with meaningful control over narrative}. Our goal is to encourage and aid the use of deep generative models as a medium for creative expression and story telling with meaningful human control. Our method is analogous to traditional video production pipelines in that we use a conventional non-linear video editor with proxy clips, and conform with arrays of latent space vectors. Examples can be seen at \url{this http URL}.
摘要：介绍了一种方法，可以让用户创造性地探索和导航深生成模型的巨大潜在空间。具体来说，我们的方法，使用户能够\ textit {}发现和\ {textit设计} \ {textit轨迹}在这些高维空间，构造故事，以及基于时间的媒体，例如视频--- \ textit {有意义在控制叙述}。我们的目标是鼓励和帮助使用深生成模式作为创造性的表达和讲故事有意义的人类控制的媒体。在我们使用具有代理剪辑常规的非线性视频编辑器，并与潜在空间矢量的阵列符合我们的方法类似于传统的视频制作流程。例如可以在\网址可以看到这个{HTTP URL}。

7. Learning Fast and Robust Target Models for Video Object Segmentation [PDF] 返回目录
Andreas Robinson, Felix Järemo Lawin, Martin Danelljan, Fahad Shahbaz Khan, Michael Felsberg
Abstract: Video object segmentation (VOS) is a highly challenging problem since the initial mask, defining the target object, is only given at test-time. The main difficulty is to effectively handle appearance changes and similar background objects, while maintaining accurate segmentation. Most previous approaches fine-tune segmentation networks on the first frame, resulting in impractical frame-rates and risk of overfitting. More recent methods integrate generative target appearance models, but either achieve limited robustness or require large amounts of training data. We propose a novel VOS architecture consisting of two network components. The target appearance model consists of a light-weight module, learned during the inference stage using fast optimization techniques to predict a coarse but robust target segmentation. The segmentation model is exclusively trained offline, designed to process the coarse scores into high quality segmentation masks. Our method is fast, easily trainable and remains is highly effective in cases of limited training data. We perform extensive experiments on the challenging YouTube-VOS and DAVIS datasets. Our network achieves favorable performance, while operating at significantly higher frame-rates compared to state-of-the-art. Code is available at this https URL.
摘要：视频对象分割（VOS）是由于初始掩模的高度挑战性的问题，定义目标对象，仅在测试时间给出。主要的困难是有效地处理外观变化和相似的背景对象，同时保持准确的分割。大多数以前的办法在第一帧微调分割网络，造成不切实际的帧速率和过度拟合的风险。最近的集成方法生成的目标外观模型，但无论是实现有限的稳健性或需要大量的训练数据。我们提出了一个新颖的VOS架构由两个网络组件。目标外观模型由重量轻的模块，期间使用快速优化技术来预测一个粗略但健壮目标分割推论阶段了解到的。细分模型是专门离线训练，旨在处理粗分数成高品质的分割掩码。我们的方法是快速，轻松地训练的和遗体是在有限的训练数据的情况下非常有效。我们进行的挑战YouTube上，VOS和戴维斯的数据集广泛的实验。我们的网络实现了良好的性能，而在显著更高的帧速率运行相比，国家的最先进的。代码可在此HTTPS URL。

8. Learning to See: You Are What You See [PDF] 返回目录
Memo Akten, Rebecca Fiebrink, Mick Grierson
Abstract: The authors present a visual instrument developed as part of the creation of the artwork Learning to See. The artwork explores bias in artificial neural networks and provides mechanisms for the manipulation of specifically trained for real-world representations. The exploration of these representations acts as a metaphor for the process of developing a visual understanding and/or visual vocabulary of the world. These representations can be explored and manipulated in real time, and have been produced in such a way so as to reflect specific creative perspectives that call into question the relationship between how both artificial neural networks and humans may construct meaning.
摘要：作者提出发展为创造艺术品习见的一部分可视仪器。艺术品探讨偏向于人工神经网络，并提供了专门训练的真实世界表示的操作机制。这些表述的探索充当开发一种直观的了解和/或世界的视觉词汇的过程中的隐喻。这些表示可以探索和实时操作，并以这样的方式已经产生，以反映特定创造性的观点是质疑两者间人工神经网络和人类如何可以建构意义的关系。

9. Revisiting Convolutional Neural Networks for Urban Flow Analytics [PDF] 返回目录
Yuxuan Liang, Kun Ouyang, Junbo Zhang, Yu Zheng, David S. Rosenblum
Abstract: Convolutional Neural Networks (CNNs) have been widely adopted in raster-based urban flow analytics by virtue of their capability in capturing nearby spatial context. By revisiting CNN-based methods for different analytics tasks, we expose two common critical drawbacks in the existing uses: 1) inefficiency in learning global context, and 2) overlooking latent region functions. To tackle these challenges, in this paper we present a novel framework entitled DeepLGR that can be easily generalized to address various urban flow analytics problems. This framework consists of three major parts: 1) a local context module to learn local representations of each region; 2) a global context module to extract global contextual priors and upsample them to generate the global features; and 3) a region-specific predictor based on tensor decomposition to provide customized predictions for each region, which is very parameter-efficient compared to previous methods. Extensive experiments on two typical urban analytics tasks demonstrate the effectiveness, stability, and generality of our framework.
摘要：卷积神经网络（细胞神经网络）已经被广泛应用在基于栅格的城市流动分析凭借着自己的能力，在拍摄附近的空间环境中采用。通过重新审视不同的分析任务的基于CNN的方法，我们暴露出现有用途两种常见的关键缺点：俯瞰潜在区域功能1）在学习全球范围内的低效率，以及2）。为了应对这些挑战，在本文中我们提出题为DeepLGR一个新的框架，可以很容易地推广到地址不同城市流动分析问题。这个框架由三个主要部分组成：1）地方背景模块，以了解每个地区的本地表示; 2）一个全局上下文模块以提取上下文全球先验和上采样它们来生成全局特征;和3）的基础上张量分解的特定区域预测器提供定制针对每个区域，相比于以前的方法，这是非常参数效率的预测。两个典型的城市分析任务大量实验证明的有效性，稳定性，和我们的框架的通用性。

10. Gated Fusion Network for Degraded Image Super Resolution [PDF] 返回目录
Xinyi Zhang, Hang Dong, Zhe Hu, Wei-Sheng Lai, Fei Wang, Ming-Hsuan Yang
Abstract: Single image super resolution aims to enhance image quality with respect to spatial content, which is a fundamental task in computer vision. In this work, we address the task of single frame super resolution with the presence of image degradation, e.g., blur, haze, or rain streaks. Due to the limitations of frame capturing and formation processes, image degradation is inevitable, and the artifacts would be exacerbated by super resolution methods. To address this problem, we propose a dual-branch convolutional neural network to extract base features and recovered features separately. The base features contain local and global information of the input image. On the other hand, the recovered features focus on the degraded regions and are used to remove the degradation. Those features are then fused through a recursive gate module to obtain sharp features for super resolution. By decomposing the feature extraction step into two task-independent streams, the dual-branch model can facilitate the training process by avoiding learning the mixed degradation all-in-one and thus enhance the final high-resolution prediction results. We evaluate the proposed method in three degradation scenarios. Experiments on these scenarios demonstrate that the proposed method performs more efficiently and favorably against the state-of-the-art approaches on benchmark datasets.
摘要：单张超分辨率的目标，以提高图像质量相对于空间的内容，这是计算机视觉的基本任务。在这项工作中，我们要解决单帧超解像与图像质量下降，例如，模糊，霾，雨或条纹的存在的任务。由于帧捕捉和形成工艺的局限性，图像劣化是不可避免的，工件将由超分辨率方法而加剧。为了解决这个问题，我们提出了一个双分支卷积神经网络来提取基地的功能和单独回收功能。该基地的功能将包含输入图像的局部和全局的信息。在另一方面，回收的特征着眼于退化区域和用于除去的降解。那些特征，然后通过递归栅极模块融合以获得超分辨率尖锐特征。通过分解特征提取步骤分为两个任务无关的流，双分支模型可以方便通过避免学习的混合降解的所有功能于一身，从而提高最终的高分辨率的预测结果的训练过程。我们评估在3分劣化的情况所提出的方法。在这些情况下的实验表明，所提出的方法进行更有效和更有利地抵靠状态的最先进的上基准数据集的方法。

11. Instance Separation Emerges from Inpainting [PDF] 返回目录
Steffen Wolf, Fred A. Hamprecht, Jan Funke
Abstract: Deep neural networks trained to inpaint partially occluded images show a deep understanding of image composition and have even been shown to remove objects from images convincingly. In this work, we investigate how this implicit knowledge of image composition can be leveraged for fully self-supervised instance separation. We propose a measure for the independence of two image regions given a fully self-supervised inpainting network and separate objects by maximizing this independence. We evaluate our method on two microscopy image datasets and show that it reaches similar segmentation performance to fully supervised methods.
摘要：训练部分补绘深层神经网络遮挡图像显示的图像组成的深刻理解和甚至表现出令人信服地从图像中删除对象。在这项工作中，我们研究如何图像组成的这个隐性知识可以利用的完全自我监督的情况下分离。我们提出给予完全的两个图像区域的独立性衡量自我监督最大化这种独立性去水印网络和独立的对象。我们评估我们在两个显微镜图像数据集的方法和显示它达到类似的分割性能全面监督的方法。

12. 3D Object Detection From LiDAR Data Using Distance Depended Feature Extraction [PDF] 返回目录
Guus Engels, Nerea Aranjuelo, Ignacio Arganda-Carreras, Marcos Nieto, Oihana Otaegui
Abstract: This paper presents a new approach to 3D object detection that leverages the properties of the data obtained by a LiDAR sensor. State-of-the-art detectors use neural network architectures based on assumptions valid for camera images. However, point clouds obtained from LiDAR are fundamentally different. Most detectors use shared filter kernels to extract features which do not take into account the range dependent nature of the point cloud features. To show this, different detectors are trained on two splits of the KITTI dataset: close range (points up to 25 meters from LiDAR) and long-range. Top view images are generated from point clouds as input for the networks. Combined results outperform the baseline network trained on the full dataset with a single backbone. Additional research compares the effect of using different input features when converting the point cloud to image. The results indicate that the network focuses on the shape and structure of the objects, rather than exact values of the input. This work proposes an improvement for 3D object detectors by taking into account the properties of LiDAR point clouds over distance. Results show that training separate networks for close-range and long-range objects boosts performance for all KITTI benchmark difficulties.
摘要：本文呈现给立体物检测的一种新方法，它利用由激光雷达传感器获得的数据的属性。国家的最先进的检测器使用基于假设有效的摄像机图像的神经网络结构。然而，从激光雷达得到的点云是根本不同的。大多数检测器使用共享滤波器内核于不考虑的点群的特征的范围内依赖性质提取特征。为了证明这一点，不同的探测器在KITTI数据集的两个裂口的培训：近距离（点最多从激光雷达25米）和远射。俯视图图像是从点云生成作为用于网络的输入。合并后的结果优于训练有素的完整数据集与单个骨干基线网络。另外的研究比较了点云转换为图像时，使用不同的输入特性的效果。结果表明，该网络的重点是物体的形状和结构，而不是输入的精确值。这项工作提出了通过考虑随着距离的LiDAR点云的属性为3D对象检测器的改进。结果表明，培养独立的网络进行近距离和远距离对象的性能提升为所有KITTI基准困难。

13. Adversarial Perturbations Prevail in the Y-Channel of the YCbCr Color Space [PDF] 返回目录
Camilo Pestana, Naveed Akhtar, Wei Liu, David Glance, Ajmal Mian
Abstract: Deep learning offers state of the art solutions for image recognition. However, deep models are vulnerable to adversarial perturbations in images that are subtle but significantly change the model's prediction. In a white-box attack, these perturbations are generally learned for deep models that operate on RGB images and, hence, the perturbations are equally distributed in the RGB color space. In this paper, we show that the adversarial perturbations prevail in the Y-channel of the YCbCr space. Our finding is motivated from the fact that the human vision and deep models are more responsive to shape and texture rather than color. Based on our finding, we propose a defense against adversarial images. Our defence, coined ResUpNet, removes perturbations only from the Y-channel by exploiting ResNet features in an upsampling framework without the need for a bottleneck. At the final stage, the untouched CbCr-channels are combined with the refined Y-channel to restore the clean image. Note that ResUpNet is model agnostic as it does not modify the DNN structure. ResUpNet is trained end-to-end in Pytorch and the results are compared to existing defence techniques in the input transformation category. Our results show that our approach achieves the best balance between defence against adversarial attacks such as FGSM, PGD and DDN and maintaining the original accuracies of VGG-16, ResNet50 and DenseNet121 on clean images. We perform another experiment to show that learning adversarial perturbations only for the Y-channel results in higher fooling rates for the same perturbation magnitude.
摘要：图像识别的技术解决方案的深度学习报价状态。然而，深模型是在那些微妙的，但显著变化模型的预测图像容易受到敌对扰动。在白盒攻击，这些扰动一般都学了，关于RGB图像，因此，扰动均匀分布在RGB色彩空间中操作深模型。在本文中，我们证明了对抗扰动在YCbCr空间的Y通道为准。我们的发现是从一个事实，即人类视觉和深模型更加适应的形状和质地，而不是颜色的动机。根据我们的发现，我们提出反对对抗图像的防御。我们的防守，创造ResUpNet，只能从Y通道通过利用RESNET消除干扰功能在采样架构，而不需要一个瓶颈。在最后阶段，未触摸CBCR-通道与精制Y通道结合以恢复清洁的图像。需要注意的是ResUpNet是因为它不修改DNN结构模型无关。 ResUpNet训练端至端Pytorch和结果相比，在输入变换类现有的防御技术。我们的研究结果表明，我们的方法实现了对敌对攻击，如FGSM，PGD和DDN防御和维护清洁图像VGG-16，ResNet50和DenseNet121原有精度之间的最佳平衡。我们进行另一项实验中，只显示了更高的愚弄率Y通道结果相同的扰动幅度，学习对抗扰动。

14. The perceptual boost of visual attention is task-dependent in naturalistic settings [PDF] 返回目录
Freddie Bickford Smith, Xiaoliang Luo, Brett D. Roads, Bradley C. Love
Abstract: Attentional modulation of neural representations is known to enhance processing of task-relevant visual information. Is the resulting perceptual boost task-dependent in naturalistic settings? We aim to answer this with a large-scale computational experiment. First we design a series of visual tasks, each consisting of classifying images from a particular task set (group of image categories). The nature of a given task is determined by which categories are included in the task set. Then on each task we compare the accuracy of an attention-augmented neural network to that of an attention-free counterpart. We show that, all else being equal, the performance impact of attention is stronger with increasing task-set difficulty, weaker with increasing task-set size, and weaker with increasing perceptual similarity within a task set.
摘要：神经表征的注意瞬调制已知能够提高的任务相关的视觉信息处理。是导致感性升压任务依赖于自然的设置？我们的目标是与大规模的计算实验来回答这个问题。首先，我们设计了一系列的视觉任务，每个由分类图像从特定的任务组（图像类别的组）。通过该类别包括在设定的任务是确定的给定任务的性质。然后在每个任务中，我们的注意力，增强神经网络的精度比较，一个免费的注意力对应的。我们表明，在其他条件相同，人们关注的性能的影响是随着任务难度集，随着任务集大小弱更强，与任务组内增加感知相似弱。

15. Introducing Fuzzy Layers for Deep Learning [PDF] 返回目录
Stanton R. Price, Steven R. Price, Derek T. Anderson
Abstract: Many state-of-the-art technologies developed in recent years have been influenced by machine learning to some extent. Most popular at the time of this writing are artificial intelligence methodologies that fall under the umbrella of deep learning. Deep learning has been shown across many applications to be extremely powerful and capable of handling problems that possess great complexity and difficulty. In this work, we introduce a new layer to deep learning: the fuzzy layer. Traditionally, the network architecture of neural networks is composed of an input layer, some combination of hidden layers, and an output layer. We propose the introduction of fuzzy layers into the deep learning architecture to exploit the powerful aggregation properties expressed through fuzzy methodologies, such as the Choquet and Sugueno fuzzy integrals. To date, fuzzy approaches taken to deep learning have been through the application of various fusion strategies at the decision level to aggregate outputs from state-of-the-art pre-trained models, e.g., AlexNet, VGG16, GoogLeNet, Inception-v3, ResNet-18, etc. While these strategies have been shown to improve accuracy performance for image classification tasks, none have explored the use of fuzzified intermediate, or hidden, layers. Herein, we present a new deep learning strategy that incorporates fuzzy strategies into the deep learning architecture focused on the application of semantic segmentation using per-pixel classification. Experiments are conducted on a benchmark data set as well as a data set collected via an unmanned aerial system at a U.S. Army test site for the task of automatic road segmentation, and preliminary results are promising.
摘要：近年来开发了许多国家的先进技术已经被机器学习的影响在一定程度上。最流行的在写这篇文章的时候是深度学习的保护伞下下降的人工智能方法。深度学习已经在多种应用中证明是非常强大，能够处理具有很大的复杂性和难度的问题。在这项工作中，我们引入了一个新的层，深度学习：模糊层。传统上，神经网络的网络体系结构由一个输入层，隐藏层的一些组合，以及输出层构成。我们建议引入模糊层的进深学习结构利用通过模糊的方法，如Choquet模糊和Sugueno模糊积分表达了强大的聚集性质。迄今为止，采取深度学习模糊方法已经通过各种融合策略应用在决策层面，从国家的最先进的预先训练模型总产出，例如，AlexNet，VGG16，GoogLeNet，启-V3， RESNET-18，等。虽然这些策略已被证实可以改善用于图像分类任务的精度性能，没有已经探索了使用模糊化的中间体，或隐藏，层。在此，我们提出合并模糊战略进入深学习架构专注于语义分割的使用每像素分类中的应用新的深度学习策略。实验是在基准数据集以及在美国陆军试验场进行自动分割道路的任务，通过无人机系统所收集的数据集进行，初步结果是有希望的。

16. Estimating a Null Model of Scientific Image Reuse to Support Research Integrity Investigations [PDF] 返回目录
Daniel E. Acuna, Ziyue Xiang
Abstract: When there is a suspicious figure reuse case in science, research integrity investigators often find it difficult to rebut authors claiming that "it happened by chance". In other words, when there is a "collision" of image features, it is difficult to justify whether it appears rarely or not. In this article, we provide a method to predict the rarity of an image feature by statistically estimating the chance of it randomly occurring across all scientific imagery. Our method is based on high-dimensional density estimation of ORB features using 7+ million images in the PubMed Open Access Subset dataset. We show that this method can lead to meaningful feedback during research integrity investigations by providing a null hypothesis for scientific image reuse and thus a p-value during deliberations. We apply the model to a sample of increasingly complex imagery and confirm that it produces decreasingly smaller p-values as expected. We discuss applications to research integrity investigations as well as future work.
摘要：当有科学可疑人物重用的情况下，科研诚信调查发现很难反驳的作者声称“它的发生是偶然”。换句话说，当有图像特征的“碰撞”，这是很难证明它是否出现很少或没有。在这篇文章中，我们提供的统计估计在所有科学影像它随机发生的可能性进行预测的图像特征的稀有性的方法。我们的方法是基于ORB的高维密度估计功能使用的考研开放存取子集数据集7+万张图片。我们表明，这种方法可以通过提供科学图像重复使用，因而审议中的p值的零假设在科研诚信调查导致有意义的反馈。我们的模型适用于日益复杂的图像和确认的样品，它预期会产生递减的p值较小。我们讨论的应用研究诚信调查以及今后的工作。

17. A Multi-view Perspective of Self-supervised Learning [PDF] 返回目录
Chuanxing Geng, Zhenghao Tan, Songcan Chen
Abstract: As a newly emerging unsupervised learning paradigm, self-supervised learning (SSL) recently gained widespread attention, which usually introduces a pretext task without manual annotation of data. With its help, SSL effectively learns the feature representation beneficial for downstream tasks. Thus the pretext task plays a key role. However, the study of its design, especially its essence currently is still open. In this paper, we borrow a multi-view perspective to decouple a class of popular pretext tasks into a combination of view data augmentation (VDA) and view label classification (VLC), where we attempt to explore the essence of such pretext task while providing some insights into its design. Specifically, a simple multi-view learning framework is specially designed (SSL-MV), which assists the feature learning of downstream tasks (original view) through the same tasks on the augmented views. SSL-MV focuses on VDA while abandons VLC, empirically uncovering that it is VDA rather than generally considered VLC that dominates the performance of such SSL. Additionally, thanks to replacing VLC with VDA tasks, SSL-MV also enables an integrated inference combining the predictions from the augmented views, further improving the performance. Experiments on several benchmark datasets demonstrate its advantages.
摘要：作为一种新兴的无监督学习的范式，自我监督学习（SSL）最近获得了广泛的关注，这通常不引入数据的人工注释的借口任务。有了它的帮助，SSL有效学习的特征表现为下游的任务是有益的。因此为借口任务起着关键的作用。然而，它的设计的研究，特别是它的本质目前仍处于打开状态。在本文中，我们借用一个多视图的角度来解耦一类流行的借口，任务到哪里，我们试图探讨这样的借口任务的精髓，同时提供视图数据增强（VDA）和查看标签分类（VLC）的组合更深入地了解它的设计。具体地，一个简单的多视图学习框架是专门设计（SSL-MV），这有助于下游任务（原始视图）通过对增强视图中的相同任务的特征的学习。 SSL-MV的重点，而VDA退让VLC，经验揭示，这是VDA，而不是通常认为VLC支配这种SSL的性能。此外，由于与VDA任务更换VLC，SSL-MV也使综合推理结合从增强的观点预测，进一步提高性能。在几个基准数据集实验证明它的优点。

18. Predicting TUG score from gait characteristics based on video analysis and machine learning [PDF] 返回目录
Jian Ma
Abstract: Fall is a leading cause of death which suffers the elderly and society. Timed Up and Go (TUG) test is a common tool for fall risk assessment. In this paper, we propose a method for predicting TUG score from gait characteristics extracted from video based on computer vision and machine learning technologies. First, 3D pose is estimated from video captured with 2D and 3D cameras during human motion and then a group of gait characteristics are computed from 3D pose series. After that, copula entropy is used to select those characteristics which are mostly associated with TUG score. Finally, the selected characteristics are fed into the predictive models to predict TUG score. Experiments on real world data demonstrated the effectiveness of the proposed method. As a byproduct, the associations between TUG score and several gait characteristics are discovered, which laid the scientific foundation of the proposed method and make the predictive models such built interpretable to clinical users.
摘要：秋天是死亡的主要原因，其遭受的老人和社会。计时起立行走（TUG）测试是秋季的风险评估的常用工具。在本文中，我们提出了预测基于计算机视觉和机器学习技术，从视频中提取的步态特征TUG得分的方法。首先，三维姿态从人体运动，然后一组步态特征从三维姿态系列计算期间与2D和3D照相机拍摄的视频进行估计。在这之后，连接函数熵用于选择基本都是用TUG得分相关联的那些特性。最后，选择的特性被送入预测模型来预测TUG得分。对现实世界的数据的实验验证了该方法的有效性。作为一个副产品，TUG得分和几个步态特征之间的关联被发现，从而奠定了该方法的科学依据，使预测模型建立等可解释临床用户。

19. Few-shot Learning with Weakly-supervised Object Localization [PDF] 返回目录
Jinfu Lin, Xiaojian He
Abstract: Few-shot learning (FSL) aims to learn novel visual categories from very few samples, which is a challenging problem in real-world applications. Many data generation methods have improved the performance of FSL models, but require lots of annotated images to train a specialized network (e.g., GAN) dedicated to hallucinate new samples. We argue that localization is a more efficient approach because it provides the most discriminative regions without using extra samples. In this paper, we propose a novel method to address the FSL task by achieving weakly-supervised object localization within performing few-shot classification. To this end, we design (i) a triplet-input module to obtain the initial object seeds and (ii) an Image-To-Class-Distance (ITCD) based localizer to activate the deep descriptors of the key objects, thus obtaining the more discriminative representations used to perform few-shot classification. Extensive experiments show our method outperforms the state-of-the-art methods on benchmark datasets under various settings. Besides, our method achieves superior performance over previous methods when training the model on miniImageNet and evaluating it on the different datasets (e.g., Stanford Dogs), demonstrating its superior generalization capacity. Extra visualization shows the proposed method can localize the key objects accurately.
摘要：很少次学习（FSL）的目的是从很少的样本，这是在现实应用中一个具有挑战性的问题，学习新的视觉类。许多数据生成方法提高了FSL车型的性能，但需要大量注释的图像的培养，致力于幻觉新样本专门的网络（例如，GAN）。我们认为，本土化是一个更有效的方法，因为它提供了最鉴别的区域，而无需使用额外的样品。在本文中，我们提出通过实现内进行几拍分类弱监督的对象定位于解决FSL任务的新方法。为此，我们设计（ⅰ）的三重态 - 输入模块，以获得初始目标种子，和（ii）一个图像 - 类距离（ITCD）基于定位以激活键的对象的深描述符，从而获得更有辨别力表示用于执行为数不多的镜头分类。大量的实验证明我们的方法优于对下各种设置基准数据集的国家的最先进的方法。此外，我们的方法训练的时候就miniImageNet模型以及对不同的数据集（例如，斯坦福大学的狗）评价它，展示了其卓越的泛化能力达到了以前的方法优越的性能。额外的可视化示出所提出的方法可以精确定位的关键对象。

20. AlignSeg: Feature-Aligned Segmentation Networks [PDF] 返回目录
Zilong Huang, Yunchao Wei, Xinggang Wang, Honghui Shi, Wenyu Liu, Thomas S. Huang
Abstract: Aggregating features in terms of different convolutional blocks or contextual embeddings has been proven to be an effective way to strengthen feature representations for semantic segmentation. However, most of the current popular network architectures tend to ignore the misalignment issues during the feature aggregation process caused by 1) step-by-step downsampling operations, and 2) indiscriminate contextual information fusion. In this paper, we explore the principles in addressing such feature misalignment issues and inventively propose Feature-Aligned Segmentation Networks (AlignSeg). AlignSeg consists of two primary modules, i.e., the Aligned Feature Aggregation (AlignFA) module and the Aligned Context Modeling (AlignCM) module. First, AlignFA adopts a simple learnable interpolation strategy to learn transformation offsets of pixels, which can effectively relieve the feature misalignment issue caused by multiresolution feature aggregation. Second, with the contextual embeddings in hand, AlignCM enables each pixel to choose private custom contextual information in an adaptive manner, making the contextual embeddings aligned better to provide appropriate guidance. We validate the effectiveness of our AlignSeg network with extensive experiments on Cityscapes and ADE20K, achieving new state-of-the-art mIoU scores of 82.6% and 45.95%, respectively. Our source code will be made available.
摘要：在不同的卷积块或上下文的嵌入方面汇总的功能已经被证明是加强对语义分割特征表示的有效途径。然而，大多数当前流行的网络体系结构的倾向于忽略期间由1特征聚集过程的错位的问题）一步一步下采样操作，以及2）不加选择的上下文信息融合。在本文中，我们在处理这些功能错位问题探索的原则，创造性地提出了功能对齐分割网络（AlignSeg）。 AlignSeg包括两个主要模块，即，对齐特征聚合（AlignFA）模块和对齐上下文建模（AlignCM）模块。首先，AlignFA采用一个简单的可以学习插值战略学会像素的改造偏移，可有效缓解由多分辨率功能的聚集功能错位问题。其次，在手上下文的嵌入，AlignCM使每个像素可以选择以自适应的方式私人定制上下文信息，从而更好地对齐，以提供适当的指导上下文的嵌入。我们验证与城市景观和ADE20K广泛的实验，实现了分别为82.6％和45.95％，新的国家的最先进的米欧分数我们AlignSeg网络的有效性。我们的源代码将提供。

21. Learning Texture Invariant Representation for Domain Adaptation of Semantic Segmentation [PDF] 返回目录
Myeongjin Kim, Hyeran Byun
Abstract: Since annotating pixel-level labels for semantic segmentation is laborious, leveraging synthetic data is an attractive solution. However, due to the domain gap between synthetic domain and real domain, it is challenging for a model trained with synthetic data to generalize to real data. In this paper, considering the fundamental difference between the two domains as the texture, we propose a method to adapt to the texture of the target domain. First, we diversity the texture of synthetic images using a style transfer algorithm. The various textures of generated images prevent a segmentation model from overfitting to one specific (synthetic) texture. Then, we fine-tune the model with self-training to get direct supervision of the target texture. Our results achieve state-of-the-art performance and we analyze the properties of the model trained on the stylized dataset with extensive experiments.
摘要：由于标注像素级标签语义分割是费力的，利用合成数据是一个有吸引力的解决方案。然而，由于合成结构域和实际结构域之间的结构域间隙，它被用于与合成数据训练的模型的挑战推广到实际数据。在本文中，考虑到两个域的纹理之间的根本区别，我们提出了一个适应目标域的纹理的方法。首先，我们利用多元化风格转换算法合成图像的纹理。生成的图像的各种纹理防止分割模型过度拟合到一个特定的（合成的）纹理。然后，我们微调与自我培养模式，以获得目标质地的直接监督。我们的研究结果达到国家的最先进的性能和我们分析的培训与广泛的实验程式化数据集模型的属性。

22. Exposing Backdoors in Robust Machine Learning Models [PDF] 返回目录
Ezekiel Soremekun, Sakshi Udeshi, Sudipta Chattopadhyay, Andreas Zeller
Abstract: The introduction of robust optimisation has pushed the state-of-the-art in defending against adversarial attacks. However, the behaviour of such optimisation has not been studied in the light of a fundamentally different class of attacks called backdoors. In this paper, we demonstrate that adversarially robust models are susceptible to backdoor attacks. Subsequently, we observe that backdoors are reflected in the feature representation of such models. Then, this is leveraged to detect backdoor-infected models. Specifically, we use feature clustering to effectively detect backdoor-infected robust Deep Neural Networks (DNNs). In our evaluation of major classification tasks, our approach effectively detects robust DNNs infected with backdoors. Our investigation reveals that salient features of adversarially robust DNNs break the stealthy nature of backdoor attacks.
摘要：介绍的稳健优化推动了国家的最先进的防御敌对攻击。然而，这种优化的行为还没有被研究了一个根本不同类的攻击称为后门的光。在本文中，我们证明了adversarially可靠的模型很容易受到后门攻击。随后，我们观察到后门反映在这些模型的特征表示。然后，将其利用，以检测后门感染模型。具体而言，我们使用特征聚类有效地检测后门程序感染的强劲深层神经网络（DNNs）。在我们的主要任务分类评价，我们的方法有效地检测感染了后门强劲DNNs。我们的调查显示adversarially稳健DNNs的那几个突出的特点打破后门攻击的隐蔽性质。

23. Triangle-Net: Towards Robustness in Point Cloud Classification [PDF] 返回目录
Chenxi Xiao, Juan Wachs
Abstract: 3D object recognition is becoming a key desired capability for many computer vision systems such as autonomous vehicles, service robots and surveillance drones to operate more effectively in unstructured environments. These real-time systems require effective classification methods that are robust to sampling resolution, measurement noise, and pose configuration of the objects. Previous research has shown that sparsity, rotation and positional variance of points can lead to a significant drop in the performance of point cloud based classification techniques. In this regard, we propose a novel approach for 3D classification that takes sparse point clouds as input and learns a model that is robust to rotational and positional variance as well as point sparsity. To this end, we introduce new feature descriptors which are fed as an input to our proposed neural network in order to learn a robust latent representation of the 3D object. We show that such latent representations can significantly improve the performance of object classification and retrieval. Further, we show that our approach outperforms PointNet and 3DmFV by 34.4% and 27.4% respectively in classification tasks using sparse point clouds of only 16 points under arbitrary SO(3) rotation.
摘要：3D物体识别逐渐成为许多计算机视觉系统，如自动驾驶汽车，服务机器人和侦察机在非结构化环境中更有效地运作的关键所需的能力。这些实时系统需要有效的分类方法是稳健的采样分辨率，测量噪声，和姿势的对象的配置。以前的研究已经表明，点稀疏，旋转和位置变化可能导致基于点云分类技术性能有显著下降。在这方面，我们提出了3D分类该需要稀疏点云作为输入和学习的模型是鲁棒的旋转和位置变化以及点稀疏性的新方法。为此，我们介绍这是为了学习3D对象的一个强大的潜在表示供给作为输入到我们提出的神经网络新功能描述。我们发现，这些潜在的表示可以显著提高对象分类和检索的性能。此外，我们证明了我们的方法了34.4％和27.4％，分别使用在任意SO（3）旋转的只有16点稀疏的点云优于PointNet和3DmFV在分类任务。

24. Deep Learning on Radar Centric 3D Object Detection [PDF] 返回目录
Seungjun Lee
Abstract: Even though many existing 3D object detection algorithms rely mostly on camera and LiDAR, camera and LiDAR are prone to be affected by harsh weather and lighting conditions. On the other hand, radar is resistant to such conditions. However, research has found only recently to apply deep neural networks on radar data. In this paper, we introduce a deep learning approach to 3D object detection with radar only. To the best of our knowledge, we are the first ones to demonstrate a deep learning-based 3D object detection model with radar only that was trained on the public radar dataset. To overcome the lack of radar labeled data, we propose a novel way of making use of abundant LiDAR data by transforming it into radar-like point cloud data and aggressive radar augmentation techniques.
摘要：尽管许多现有的三维物体检测算法主要依靠相机和激光雷达，摄像机和激光雷达是容易受到恶劣天气和照明条件的影响。在另一方面，雷达是这样的条件有抗性。然而，研究发现，最近才在雷达数据应用深层神经网络。在本文中，我们介绍了与单独雷达深刻的学习方法，以立体物检测。据我们所知，我们是第一批来证明与单独雷达这是对公众的雷达数据集训练了深刻的学习型立体物检测模型。为了克服缺少雷达标记的数据，我们建议将其转化为类似雷达点云数据和积极的雷达增强技术，利用丰富的LiDAR数据的一种新方法。

25. Learning to Deblur and Generate High Frame Rate Video with an Event Camera [PDF] 返回目录
Chen Haoyu, Teng Minggui, Shi Boxin, Wang YIzhou, Huang Tiejun
Abstract: Event cameras are bio-inspired cameras which can measure the change of intensity asynchronously with high temporal resolution. One of the event cameras' advantages is that they do not suffer from motion blur when recording high-speed scenes. In this paper, we formulate the deblurring task on traditional cameras directed by events to be a residual learning one, and we propose corresponding network architectures for effective learning of deblurring and high frame rate video generation tasks. We first train a modified U-Net network to restore a sharp image from a blurry image using corresponding events. Then we train another similar network with different downsampling blocks to generate high frame rate video using the restored sharp image and events. Experiment results show that our method can restore sharper images and videos than state-of-the-art methods.
摘要：事件相机仿生相机能够以高时间分辨率异步测量强度的变化。其中一个的情况下，相机的优点是记录高速场景时，他们不从运动模糊苦。在本文中，我们制定的事件，定向为剩余学习一个传统相机的去模糊任务，并提出了相应的网络架构的去模糊和高帧率视频生成任务的有效的学习。首先，我们培养了变形的U-Net网络来使用相应的事件模糊图像恢复清晰的图像。然后，我们培养具有不同下采样块另一个类似的网络使用恢复的清晰图像和事件来生成高帧率视频。实验结果表明，该方法能比国家的最先进的方法可以还原更清晰的图像和视频。

26. FPGA Implementation of Minimum Mean Brightness Error Bi-Histogram Equalization [PDF] 返回目录
Abhishek Saroha, Avichal Rakesh, Rajiv Kumar Tripathi
Abstract: Histogram Equalization (HE) is a popular method for contrast enhancement. Generally, mean brightness is not conserved in Histogram Equalization. Initially, Bi-Histogram Equalization (BBHE) was proposed to enhance contrast while maintaining a the mean brightness. However, when mean brightness is primary concern, Minimum Mean Brightness Error Bi-Histogram Equalization (MMBEBHE) is the best technique. There are several implementations of Histogram Equalization on FPGA, however to our knowledge MMBEBHE has not been implemented on FPGAs before. Therefore, we present an implementation of MMBEBHE on FPGA.
摘要：直方图均衡（HE）是用于对比度增强的常用方法。一般情况下，平均亮度不直方图均衡保守。最初，双直方图均衡（BBHE）提出了以提高对比度，同时保持平均亮度。然而，当平均亮度是主要关心的，最小均亮度误差双直方图均衡（MMBEBHE）是最好的技术。有直方图均衡的FPGA的几种实现，但据我们所知MMBEBHE尚未在FPGA中前完成。因此，我们提出MMBEBHE对FPGA的实现。

27. A Machine Learning Framework for Data Ingestion in Document Images [PDF] 返回目录
Han Fu, Yunyu Bai, Zhuo Li, Jun Shen, Jianling Sun
Abstract: Paper documents are widely used as an irreplaceable channel of information in many fields, especially in financial industry, fostering a great amount of demand for systems which can convert document images into structured data representations. In this paper, we present a machine learning framework for data ingestion in document images, which processes the images uploaded by users and return fine-grained data in JSON format. Details of model architectures, design strategies, distinctions with existing solutions and lessons learned during development are elaborated. We conduct abundant experiments on both synthetic and real-world data in State Street. The experimental results indicate the effectiveness and efficiency of our methods.
摘要：纸质文件被广泛用作信息在许多领域具有不可替代的渠道，尤其是在金融行业，促进需求的巨大量能的文档图像转换成结构化的数据表示系统。在本文中，我们提出了在文档图像数据摄取，处理由用户和返回细粒度的数据以JSON格式上传影像机器学习框架。模型架构，设计策略，与现有的解决方案和开发过程中的经验教训区别的详细阐述。我们开展对州街合成和真实世界的数据丰富的实验。实验结果表明我们的方法的有效性和效率。

28. On Parameter Tuning in Meta-learning for Computer Vision [PDF] 返回目录
Farid Ghareh Mohammadi, M. Hadi Amini, Hamid R. Arabnia
Abstract: Learning to learn plays a pivotal role in meta-learning (MTL) to obtain an optimal learning model. In this paper, we investigate mage recognition for unseen categories of a given dataset with limited training information. We deploy a zero-shot learning (ZSL) algorithm to achieve this goal. We also explore the effect of parameter tuning on performance of semantic auto-encoder (SAE). We further address the parameter tuning problem for meta-learning, especially focusing on zero-shot learning. By combining different embedded parameters, we improved the accuracy of tuned-SAE. Advantages and disadvantages of parameter tuning and its application in image classification are also explored.
摘要：学会学习戏剧荟萃学习（MTL）了举足轻重的作用，以获得最佳的学习模式。在本文中，我们探讨看不见的类别与有限的培训信息给定数据集的法师认可。我们部署了一个零次学习（ZSL）算法来实现这一目标。我们还探讨了参数调整的语义自动编码器（SAE）的性能的影响。我们进一步地址元学习的参数调整问题，特别是侧重于零射门学习。通过组合不同的嵌入式参数，我们改进调谐-SAE的准确性。优点和参数整定及其在图像分类应用的优缺点进行了探讨。

29. Marine life through You Only Look Once's perspective [PDF] 返回目录
Herman Stavelin, Adil Rasheed, Omer San, Arne Johan Hestnes
Abstract: With the rise of focus on man made changes to our planet and wildlife therein, more and more emphasis is put on sustainable and responsible gathering of resources. In an effort to preserve maritime wildlife the Norwegian government has decided that it is necessary to create an overview over the presence and abundance of various species of wildlife in the Norwegian fjords and oceans. In this paper we apply and analyze an object detection scheme that detects fish in camera images. The data is sampled from a submerged data station at Fulehuk in Norway. We implement You Only Look Once (YOLO) version 3 and create a dataset consisting of 99,961 images with a mAP of $\sim 0.88$. We also investigate intermediate results within YOLO, gaining insight into how it performs object detection.
摘要：重点对人为改变我们的星球和野生动物在其中的兴起，越来越多的重点放在资源的可持续和负责任的聚会。在努力维护海洋野生动物挪威政府已决定，有必要建立在存在和野生动物的不同种类的丰富在挪威峡湾和海洋的概述。在本文中，我们应用和分析检测相机图像鱼对象检测方案。数据是从在挪威在Fulehuk浸没数据站采样。我们实施仅看一次3（永乐）版本，并创建一个由99961个图像与$ \卡$ 0.88地图的数据集。我们也调查中YOLO中间结果，深入了解它是如何进行目标检测。

30. Deep Variational Luenberger-type Observer for Stochastic Video Prediction [PDF] 返回目录
Dong Wang, Feng Zhou, Zheng Yan, Guang Yao, Zongxuan Liu, Wennan Ma, Cewu Lu
Abstract: Considering the inherent stochasticity and uncertainty, predicting future video frames is exceptionally challenging. In this work, we study the problem of video prediction by combining interpretability of stochastic state space models and representation learning of deep neural networks. Our model builds upon an variational encoder which transforms the input video into a latent feature space and a Luenberger-type observer which captures the dynamic evolution of the latent features. This enables the decomposition of videos into static features and dynamics in an unsupervised manner. By deriving the stability theory of the nonlinear Luenberger-type observer, the hidden states in the feature space become insensitive with respect to the initial values, which improves the robustness of the overall model. Furthermore, the variational lower bound on the data log-likelihood can be derived to obtain the tractable posterior prediction distribution based on the variational principle. Finally, the experiments such as the Bouncing Balls dataset and the Pendulum dataset are provided to demonstrate the proposed model outperforms concurrent works.
摘要：考虑到固有的随机性和不确定性，预测未来的视频帧是极具挑战性。在这项工作中，我们结合随机状态空间模型，并表示学习深层神经网络的解释性研究视频预测的问题。我们的模型建立在其中将输入视频转换成潜特征空间和其捕获的潜特征动态演化一个的Luenberger型观察者的变分编码器。这使得视频分为静态特征和动态，无人监督的方式分解。通过导出非线性的Luenberger型观察者的稳定性理论，在特征空间中的隐藏的状态变得不敏感相对于该初始值，这提高了整体模型的稳健性。此外，变低的数据数似然结合的可衍生以获得基于变分原理的易处理后预测分布。最后，提供数据集中表明了该模型的实验，如弹弹球的数据集与钟摆优于并行工程。

31. CALVIS: chest, waist and pelvis circumference from 3D human body meshes as ground truth for deep learning [PDF] 返回目录
Yansel Gonzalez Tejeda, Helmut Mayer
Abstract: In this paper we present CALVIS, a method to calculate $\textbf{C}$hest, w$\textbf{A}$ist and pe$\textbf{LVIS}$ circumference from 3D human body meshes. Our motivation is to use this data as ground truth for training convolutional neural networks (CNN). Previous work had used the large scale CAESAR dataset or determined these anthropometrical measurements $\textit{manually}$ from a person or human 3D body meshes. Unfortunately, acquiring these data is a cost and time consuming endeavor. In contrast, our method can be used on 3D meshes automatically. We synthesize eight human body meshes and apply CALVIS to calculate chest, waist and pelvis circumference. We evaluate the results qualitatively and observe that the measurements can indeed be used to estimate the shape of a person. We then asses the plausibility of our approach by generating ground truth with CALVIS to train a small CNN. After having trained the network with our data, we achieve competitive validation error. Furthermore, we make the implementation of CALVIS publicly available to advance the field.
摘要：在本文中我们本CALVIS，一种方法来计算$ \ textbf {C} $ HEST，瓦特$ \ textbf {A} $ IST和PE $ \ textbf {LVIS}从三维人体网格$圆周。我们的动机是为了用这个数据作为地面实况训练卷积神经网络（CNN）。以前的研究中使用的大型数据集凯撒或确定从一个人或人体三维人体网格，这些人体测量学的测量$ \ {textit手动} $。不幸的是，获得这些数据是一个成本和耗时的努力。相比之下，我们的方法可以在3D网格自动使用。我们合成8个个人体网格，并应用到CALVIS计算胸部，腰部和骨盆周围。我们定性评估结果，并观察测量确实可以用来估算一个人的形状。然后，我们驴我们的方法通过生成地面真相与CALVIS训练小CNN的合理性。在已经训练了与我们的数据网络，我们获得竞争验证错误。此外，我们做CALVIS实行公开可用来推动该领域。

32. CNN Hyperparameter tuning applied to Iris Liveness Detection [PDF] 返回目录
Gabriela Y. Kimura, Diego R. Lucio, Alceu S. Britto Jr., David Menotti
Abstract: The iris pattern has significantly improved the biometric recognition field due to its high level of stability and uniqueness. Such physical feature has played an important role in security and other related areas. However, presentation attacks, also known as spoofing techniques, can be used to bypass the biometric system with artifacts such as printed images, artificial eyes, and textured contact lenses. To improve the security of these systems, many liveness detection methods have been proposed, and the first Internacional Iris Liveness Detection competition was launched in 2013 to evaluate their effectiveness. In this paper, we propose a hyperparameter tuning of the CASIA algorithm, submitted by the Chinese Academy of Sciences to the third competition of Iris Liveness Detection, in 2017. The modifications proposed promoted an overall improvement, with an 8.48% Attack Presentation Classification Error Rate (APCER) and 0.18% Bonafide Presentation Classification Error Rate (BPCER) for the evaluation of the combined datasets. Other threshold values were evaluated in an attempt to reduce the trade-off between the APCER and the BPCER on the evaluated datasets and worked out successfully.
摘要：虹膜图案生物统计识别领域已显著改善由于其高水平的稳定性和唯一的。这种物理特性起到了安全和其他相关领域发挥重要作用。但是，呈现的攻击，也称为欺骗技术，可用于与旁路工件诸如打印的图像，人工眼睛，纹理隐形眼镜的生物测定系统。为了提高这些系统的安全性，许多活跃度检测方法已经被提出，并在第一INTERNACIONAL虹膜活跃度检测竞赛在2013年推出，以评估其有效性。在本文中，我们提出了中科院自动化所算法的超参数调整，提出中国社科院的科学到虹膜活跃度检测的第三次比赛，在2017年的修改提出了促进全面提高，有8.48％的攻击演示分类错误率（APCER）和0.18％Bonafide演示分类错误率（BPCER）用于组合的数据集的评估。其他的阈值，以试图减少对评估数据集的APCER和BPCER之间的权衡进行了评价，并成功地摸索出。

33. An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos [PDF] 返回目录
Sicheng Zhao, Yunsheng Ma, Yang Gu, Jufeng Yang, Tengfei Xing, Pengfei Xu, Runbo Hu, Hua Chai, Kurt Keutzer
Abstract: Emotion recognition in user-generated videos plays an important role in human-centered computing. Existing methods mainly employ traditional two-stage shallow pipeline, i.e. extracting visual and/or audio features and training classifiers. In this paper, we propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs). Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN. Further, we design a special classification loss, i.e. polarity-consistent cross-entropy loss, based on the polarity-emotion hierarchy constraint to guide the attention generation. Extensive experiments conducted on the challenging VideoEmotion-8 and Ekman-6 datasets demonstrate that the proposed VAANet outperforms the state-of-the-art approaches for video emotion recognition. Our source code is released at: this https URL.
摘要：在用户生成的视频情感识别在人类为中心的计算中起重要作用。现有的方法主要采用传统的两段式浅管道，即提取视觉和/或音频功能和训练分类。在本文中，我们提出了一种基于卷积神经网络（细胞神经网络）来识别一个终端到终端的方式视频情绪。具体而言，我们开发了一个深的视觉音频注意网络（VAANet），一种新型结构，集成了空间，信道的角度来看，和时间关注成可视3D CNN和时间关注到音频2D CNN。此外，我们还设计了一个特殊的分类损失，即极性一致的跨熵损失的基础上，极性情感层次的约束，引导关注的产生。在挑战VideoEmotion-8和埃克曼-6的数据集进行了广泛的实验表明，该VAANet优于状态的最先进的方法用于视频情绪识别。此HTTPS URL：我们的源代码是在释放。

34. Character Segmentation in Asian Collector's Seal Imprints: An Attempt to Retrieval Based on Ancient Character Typeface [PDF] 返回目录
Kangying Li, Biligsaikhan Batjargal, Akira Maeda
Abstract: Collector's seals provide important clues about the ownership of a book. They contain much information pertaining to the essential elements of ancient materials and also show the details of possession, its relation to the book, the identity of the collectors and their social status and wealth, amongst others. Asian collectors have typically used artistic ancient characters rather than modern ones to make their seals. In addition to the owner's name, several other words are used to express more profound meanings. A system that automatically recognizes these characters can help enthusiasts and professionals better understand the background information of these seals. However, there is a lack of training data and labelled images, as samples of some seals are scarce and most of them are degraded images. It is necessary to find new ways to make full use of such scarce data. While these data are available online, they do not contain information on the characters'position. The goal of this research is to provide retrieval tools assist in obtaining more information from Asian collector's seals imprints without consuming a lot of computational resources. In this paper, a character segmentation method is proposed to predict the candidate characters'area without any labelled training data that contain character coordinate information. A retrieval-based recognition system that focuses on a single character is also proposed to support seal retrieval and matching. The experimental results demonstrate that the proposed character segmentation method performs well on Asian collector's seals, with 92% of the test data being correctly segmented.
摘要：收藏家的印章提供关于一本书的所有权的重要线索。它们包含与古材料的基本要素多的信息，也显示出藏的细节，它关系到的书，收藏家的身份和他们的社会地位和财富，等等。亚洲收藏家通常用于艺术的古字，而不是现代的，以使他们的密封件。除了主人的名字，其他的几个单词来表达更深刻的含义。自动识别这些字符可以帮助爱好者和专业人士更好地了解这些密封的后台信息系统。然而，缺乏训练数据和标记的图像，如一些密封的样本很少，其中大部分是退化图像。有必要寻找新的方法来充分利用这种稀缺数据。虽然这些数据可在网上，它们不包含在characters'position信息。这项研究的目的是提供检索工具协助获得亚洲收藏家的印章印的详细信息，而无需耗费大量的计算资源。在本文中，一个字符分割方法，提出了预测候选characters'area不包含字符坐标信息的任何标记的训练数据。侧重于单个字符一个基于检索识别系统还提议支持密封检索和匹配。实验结果表明，以及对亚洲集电极的密封件所提议的字符分割方法进行，与所述测试数据的92％被正确分割。

35. GSANet: Semantic Segmentation with Global and Selective Attention [PDF] 返回目录
Qingfeng Liu, Mostafa El-Khamy, Dongwoon Bai, Jungwon Lee
Abstract: This paper proposes a novel deep learning architecture for semantic segmentation. The proposed Global and Selective Attention Network (GSANet) features Atrous Spatial Pyramid Pooling (ASPP) with a novel sparsemax global attention and a novel selective attention that deploys a condensation and diffusion mechanism to aggregate the multi-scale contextual information from the extracted deep features. A selective attention decoder is also proposed to process the GSA-ASPP outputs for optimizing the softmax volume. We are the first to benchmark the performance of semantic segmentation networks with the low-complexity feature extraction network (FXN) MobileNetEdge, that is optimized for low latency on edge devices. We show that GSANet can result in more accurate segmentation with MobileNetEdge, as well as with strong FXNs, such as Xception. GSANet improves the state-of-art semantic segmentation accuracy on both the ADE20k and the Cityscapes datasets.
摘要：本文提出了一种语义分割一个新的深度学习建筑。提议的全球和选择性注意网络（GSANet）功能Atrous空间金字塔池（ASPP）具有新颖sparsemax全球瞩目和部署一个凝聚和扩散机制，聚集来自提取深特征的多尺度的上下文信息的新的选择性注意。甲选择性注意解码器，还提出以处理GSA-ASPP输出用于优化SOFTMAX体积。我们是第一个到基准与低复杂度特征提取网络（FXN）MobileNetEdge，即对于上边缘设备低延迟优化语义分割网络的性能。我们表明，GSANet可能导致更准确的分割与MobileNetEdge，以及具有较强FXNs，如Xception。 GSANet提高两者上ADE20k和都市风景数据集的状态的最先进的语义分割精度。

36. Verifying Deep Learning-based Decisions for Facial Expression Recognition [PDF] 返回目录
Ines Rieger, Rene Kollmann, Bettina Finzel, Dominik Seuss, Ute Schmid
Abstract: Neural networks with high performance can still be biased towards non-relevant features. However, reliability and robustness is especially important for high-risk fields such as clinical pain treatment. We therefore propose a verification pipeline, which consists of three steps. First, we classify facial expressions with a neural network. Next, we apply layer-wise relevance propagation to create pixel-based explanations. Finally, we quantify these visual explanations based on a bounding-box method with respect to facial regions. Although our results show that the neural network achieves state-of-the-art results, the evaluation of the visual explanations reveals that relevant facial regions may not be considered.
摘要：神经网络，高性能仍然可以对非相关特征有偏差。然而，可靠性和耐用性是高风险的领域尤其重要，如临床疼痛治疗。因此，我们提出了一个验证管道，其中包括三个步骤。首先，我们用分类神经网络的面部表情。接下来，我们采用逐层传播的相关性来创建基于像素的解释。最后，我们量化基于一个包围盒方法对于面部区域这些视觉解释。尽管我们的结果表明，该神经网络实现了国家的先进成果，视觉解释的评估表明，有关面部区域可以不考虑。

37. CheXclusion: Fairness gaps in deep chest X-ray classifiers [PDF] 返回目录
Laleh Seyyed-Kalantari, Guanxiong Liu, Matthew McDermott, Marzyeh Ghassemi
Abstract: Machine learning systems have received much attention recently for their ability to achieve expert-level performance on clinical tasks, particularly in medical imaging. Here, we examine the extent to which state-of-the-art deep learning classifiers trained to yield diagnostic labels from X-ray images are biased with respect to protected attributes. We train convolution neural networks to predict 14 diagnostic labels in three prominent public chest X-ray datasets: MIMIC-CXR, Chest-Xray8, and CheXpert. We then evaluate the TPR disparity - the difference in true positive rates (TPR) and - underdiagnosis rate - the false positive rate of a non-diagnosis - among different protected attributes such as patient sex, age, race, and insurance type. We demonstrate that TPR disparities exist in the state-of-the-art classifiers in all datasets, for all clinical tasks, and all subgroups. We find that TPR disparities are most commonly not significantly correlated with a subgroup's proportional disease burden; further, we find that some subgroups and subsection of the population are chronically underdiagnosed. Such performance disparities have real consequences as models move from papers to products, and should be carefully audited prior to deployment.
摘要：机器学习系统备受关注最近他们实现临床任务的专家级的性能，特别是在医疗成像能力。在这里，我们检查状态的最先进的深学习训练从X射线图像产生诊断标签分类器，其被偏置相对于受保护的属性的程度。我们训练卷积神经网络在三个重要公共胸部X射线数据集来预测14个诊断标签：MIMIC-CXR，胸Xray8和CheXpert。然后，我们评估TPR差距 - 在真正的阳性率（TPR）的差异， - 漏诊率 - 非诊断的假阳性率 - 不同的保护属性，如患者性别，年龄，种族，保险类型之一。我们表明，在国家的最先进的分类中的所有数据集存在TPR差距，所有临床任务，所有子组。我们发现，TPR差距是最常见的不与群的比例疾病负担显著相关;此外，我们发现，一些人口群与小节中长期漏诊。这样的性能差距有真正的后果，因为模型从报纸转向产品，并应在部署之前进行仔细审核。

38. Realistic River Image Synthesis using Deep Generative Adversarial Networks [PDF] 返回目录
Akshat Gautam, Muhammed Sit, Ibrahim Demir
Abstract: In this paper, we investigate an application of image generation for river satellite imagery. Specifically, we propose a generative adversarial network (GAN) model capable of generating high-resolution and realistic river images that can be used to support models in surface water estimation, river meandering, wetland loss and other hydrological research studies. First, we summarized an augmented, diverse repository of overhead river images to be used in training. Second, we incorporate the Progressive Growing GAN (PGGAN), a network architecture that iteratively trains smaller-resolution GANs to gradually build up to a very high resolution, to generate 256x256 river satellite imagery. With conventional GAN architectures, difficulties soon arise in terms of exponential increase of training time and vanishing/exploding gradient issues, which the PGGAN implementation seems to significantly reduce. Our preliminary results show great promise in capturing the detail of river flow and green areas present in river satellite images that can be used for supporting hydroinformatics studies.
摘要：在本文中，我们研究了图像生成的河流卫星影像的应用程序。具体来说，我们提出了一个生成对抗网络（GAN）能够产生高清晰度和可用于支持机型地表水估计，河流蜿蜒，湿地减少等水文调查研究现实河流的图像模式。首先，我们总结了在训练中使用的开销河图像的增强，多样化的存储库。其次，我们结合了渐进式生长GaN（PGGAN），网络架构，反复训练较小的分辨率甘斯，逐步建立以非常高的分辨率，256x256的生成河流卫星图像。与传统的GAN架构，困难很快在训练时间和消失/爆炸梯度问题的指数增长，其中PGGAN实施似乎显著减少条款出现。我们的初步结果表明，捕获河流流量和当前在可用于支持水文信息学研究河流的卫星图像的绿色区域的细节巨大潜力。

39. SIP-SegNet: A Deep Convolutional Encoder-Decoder Network for Joint Semantic Segmentation and Extraction of Sclera, Iris and Pupil based on Periocular Region Suppression [PDF] 返回目录
Bilal Hassan, Ramsha Ahmed, Taimur Hassan, Naoufel Werghi
Abstract: The current developments in the field of machine vision have opened new vistas towards deploying multimodal biometric recognition systems in various real-world applications. These systems have the ability to deal with the limitations of unimodal biometric systems which are vulnerable to spoofing, noise, non-universality and intra-class variations. In addition, the ocular traits among various biometric traits are preferably used in these recognition systems. Such systems possess high distinctiveness, permanence, and performance while, technologies based on other biometric traits (fingerprints, voice etc.) can be easily compromised. This work presents a novel deep learning framework called SIP-SegNet, which performs the joint semantic segmentation of ocular traits (sclera, iris and pupil) in unconstrained scenarios with greater accuracy. The acquired images under these scenarios exhibit purkinje reflexes, specular reflections, eye gaze, off-angle shots, low resolution, and various occlusions particularly by eyelids and eyelashes. To address these issues, SIP-SegNet begins with denoising the pristine image using denoising convolutional neural network (DnCNN), followed by reflection removal and image enhancement based on contrast limited adaptive histogram equalization (CLAHE). Our proposed framework then extracts the periocular information using adaptive thresholding and employs the fuzzy filtering technique to suppress this information. Finally, the semantic segmentation of sclera, iris and pupil is achieved using the densely connected fully convolutional encoder-decoder network. We used five CASIA datasets to evaluate the performance of SIP-SegNet based on various evaluation metrics. The simulation results validate the optimal segmentation of the proposed SIP-SegNet, with the mean f1 scores of 93.35, 95.11 and 96.69 for the sclera, iris and pupil classes respectively.
摘要：在机器视觉领域的最新发展已经对各种现实应用中部署多模态生物特征识别系统，开辟了新的前景。这些系统必须处理单峰生物识别系统，其很容易受到欺骗，噪音小，无普遍性和类内变化的局限性的能力。此外，各种生物统计性状的眼性状在这些识别系统被优选使用。这种系统具有较高的显着性，持久性，并且在性能的基础上，其他的生物统计学特征（指纹，语音等）的技术可以很容易地受到损害。这项工作提出一种新颖的深学习框架被叫SIP-SegNet，其执行在更准确不受约束场景眼性状（巩膜，虹膜和瞳孔）的联合语义分割。这些情景下所获得的图像特别是通过眼睑和睫毛呈现浦肯野反射，镜面反射，眼睛注视，偏角拍摄，低分辨率和各种闭塞。为了解决这些问题，SIP-SegNet与去噪使用去噪卷积神经网络（DnCNN）原始图像开始，随后反射去除和基于对比度的图像增强限于自适应直方图均衡（CLAHE）。我们提出的框架然后提取使用自适应阈值化的眼周信息和采用模糊滤波技术抑制此信息。最后，巩膜，虹膜和瞳孔的语义分割使用密集连接完全卷积编码器 - 解码器网络来实现。我们使用的五种自动化所数据集来评估SIP-SegNet的基于各种评价指标的表现。仿真结果验证了该SIP-SegNet的最佳分割，具有93.35，95.11和96.69的平均F1分数分别巩膜，虹膜和瞳孔类。

40. Multi-Scale Representation Learning for Spatial Feature Distributions using Grid Cells [PDF] 返回目录
Gengchen Mai, Krzysztof Janowicz, Bo Yan, Rui Zhu, Ling Cai, Ni Lao
Abstract: Unsupervised text encoding models have recently fueled substantial progress in NLP. The key idea is to use neural networks to convert words in texts to vector space representations based on word positions in a sentence and their contexts, which are suitable for end-to-end training of downstream tasks. We see a strikingly similar situation in spatial analysis, which focuses on incorporating both absolute positions and spatial contexts of geographic objects such as POIs into models. A general-purpose representation model for space is valuable for a multitude of tasks. However, no such general model exists to date beyond simply applying discretization or feed-forward nets to coordinates, and little effort has been put into jointly modeling distributions with vastly different characteristics, which commonly emerges from GIS data. Meanwhile, Nobel Prize-winning Neuroscience research shows that grid cells in mammals provide a multi-scale periodic representation that functions as a metric for location encoding and is critical for recognizing places and for path-integration. Therefore, we propose a representation learning model called Space2Vec to encode the absolute positions and spatial relationships of places. We conduct experiments on two real-world geographic data for two different tasks: 1) predicting types of POIs given their positions and context, 2) image classification leveraging their geo-locations. Results show that because of its multi-scale representations, Space2Vec outperforms well-established ML approaches such as RBF kernels, multi-layer feed-forward nets, and tile embedding approaches for location modeling and image classification tasks. Detailed analysis shows that all baselines can at most well handle distribution at one scale but show poor performances in other scales. In contrast, Space2Vec's multi-scale representation can handle distributions at different scales.
摘要：无监督文本编码模型最近助长了NLP实质性进展。关键思想是利用神经网络的话在文本转换为基于句子中的单词位置和他们的环境，适合于终端到终端的培训下游任务矢量空间表示。我们看到在空间分析，其重点是结合两者的绝对位置和地理对象的空间环境下，比如兴趣点到模型中惊人相似的情况。空间通用表示模型是多项任务有价值。但是，没有这样的一般模型存在至今超越了简单的应用离散或前馈网，坐标，举手之劳已投入联合分布建模与完全不同的特性，这通常由GIS数据出现。同时，获得诺贝尔奖的神经科学的研究表明，在哺乳动物中网格单元提供了一个多尺度的周期性表现，其功能如同一个度量位置编码，并识别名额和路径整合的关键。因此，我们提出了一个名为Space2Vec表示学习模式编码的绝对位置和场所的空间关系。我们进行了两个真实世界的地理数据实验的两个不同的任务：1）预测给出自己的立场和背景种类的POI，2）图像分类利用其地缘位置。结果表明，由于其多尺度交涉，Space2Vec优于成熟的ML方法，如RBF内核，多层前馈网，和位置建模和图像分类任务瓷砖嵌入方法。详细的分析表明，所有基准最多能处理好分布在某一尺度，但显示其他尺度表现不佳。相比之下，Space2Vec的多尺度表示可以处理不同尺度分布。

41. Breast Cancer Histopathology Image Classification and Localization using Multiple Instance Learning [PDF] 返回目录
Abhijeet Patil, Dipesh Tamboli, Swati Meena, Deepak Anand, Amit Sethi
Abstract: Breast cancer has the highest mortality among cancers in women. Computer-aided pathology to analyze microscopic histopathology images for diagnosis with an increasing number of breast cancer patients can bring the cost and delays of diagnosis down. Deep learning in histopathology has attracted attention over the last decade of achieving state-of-the-art performance in classification and localization tasks. The convolutional neural network, a deep learning framework, provides remarkable results in tissue images analysis, but lacks in providing interpretation and reasoning behind the decisions. We aim to provide a better interpretation of classification results by providing localization on microscopic histopathology images. We frame the image classification problem as weakly supervised multiple instance learning problem where an image is collection of patches i.e. instances. Attention-based multiple instance learning (A-MIL) learns attention on the patches from the image to localize the malignant and normal regions in an image and use them to classify the image. We present classification and localization results on two publicly available BreakHIS and BACH dataset. The classification and visualization results are compared with other recent techniques. The proposed method achieves better localization results without compromising classification accuracy.
摘要：乳腺癌在女性癌症中死亡率最高。计算机辅助病理分析诊断显微组织病理学图像与越来越多的乳腺癌患者可以带来成本和诊断下来的延迟。在病理组织学深学已引起重视了实现分类和定位任务的国家的最先进的性能的最后十年。卷积神经网络，深学习框架，提供了组织图像分析了显着成绩，但在提供解释和推理的决定背后的缺乏。我们的目标是通过对微观病理图像提供本地化提供分类结果更好的诠释。作为弱监督多示例学习问题，其中的图像是补丁即实例的集合，我们帧图像分类问题。关注基于多示例学习（A-MIL）学习从图像补丁注意本地化恶性和正常区域的图像中并用它们来对图像进行分类。我们两个公开可用BreakHIS和巴赫的数据集目前的分类和定位结果。分类和可视化结果与最近的其他技术相比。所提出的方法实现更好的定位结果不影响分类的准确性。

42. MADAN: Multi-source Adversarial Domain Aggregation Network for Domain Adaptation [PDF] 返回目录
Sicheng Zhao, Bo Li, Xiangyu Yue, Pengfei Xu, Kurt Keutzer
Abstract: Domain adaptation aims to learn a transferable model to bridge the domain shift between one labeled source domain and another sparsely labeled or unlabeled target domain. Since the labeled data may be collected from multiple sources, multi-source domain adaptation (MDA) has attracted increasing attention. Recent MDA methods do not consider the pixel-level alignment between sources and target or the misalignment across different sources. In this paper, we propose a novel MDA framework to address these challenges. Specifically, we design an end-to-end Multi-source Adversarial Domain Aggregation Network (MADAN). First, an adapted domain is generated for each source with dynamic semantic consistency while aligning towards the target at the pixel-level cycle-consistently. Second, sub-domain aggregation discriminator and cross-domain cycle discriminator are proposed to make different adapted domains more closely aggregated. Finally, feature-level alignment is performed between the aggregated domain and the target domain while training the task network. For the segmentation adaptation, we further enforce category-level alignment and incorporate context-aware generation, which constitutes MADAN+. We conduct extensive MDA experiments on digit recognition, object classification, and simulation-to-real semantic segmentation. The results demonstrate that the proposed MADAN and MANDA+ models outperform state-of-the-art approaches by a large margin.
摘要：域名适应旨在学习转移模型来弥补一个标记源域和另一个稀疏标注或未标注的目标域之间的域转移。由于标记的数据可从多个源收集，多源域的适应（MDA）已经吸引了越来越多的关注。最近MDA方法没有考虑源和目标或在不同源的偏差之间的像素级的定位。在本文中，我们提出了一个新颖的MDA框架来应对这些挑战。具体来说，我们设计了一个端至端的多源对抗性域聚合网络（马丹）。首先，用于与同时在像素级对所述目标对准周期一致地动态语义一致性每个源产生的适应域。其次，子域名聚集鉴别和跨域循环标识符都提出了不同的适应领域更加紧密地聚集做。最后，聚集区和同时培养任务的网络目标域之间进行功能级别的对齐。对于分割适应，我们进一步执行类别级对准并结合上下文感知代，其构成马丹+。我们对数字识别，对象分类，并模拟到真实语义分割进行广泛的MDA的实验。结果表明，所提出的麻石和MANDA +车型超越国家的最先进的大幅度接近。

43. Recognizing Handwritten Mathematical Expressions as LaTex Sequences Using a Multiscale Robust Neural Network [PDF] 返回目录
Hongyu Wang, Guangcun Shan
Abstract: In this paper, a robust multiscale neural network is proposed to recognize handwritten mathematical expressions and output LaTeX sequences, which can effectively and correctly focus on where each step of output should be concerned and has a positive effect on analyzing the two-dimensional structure of handwritten mathematical expressions and identifying different mathematical symbols in a long expression. With the addition of visualization, the model's recognition process is shown in detail. In addition, our model achieved 49.459% and 46.062% ExpRate on the public CROHME 2014 and CROHME 2016 datasets. The present model results suggest that the state-of-the-art model has better robustness, fewer errors, and higher accuracy.
摘要：在本文中，一个强大的多尺度神经网络提出了识别手写的数学表达式和输出乳胶序列，其可有效地和正确地集中于其中输出的每一步骤应该关注，并且具有在分析所述二维结构的积极作用的手写数学表达式，并确定在长表达不同的数学符号。由于增加的可视化，该模型的识别处理中详细示出。此外，我们的模型实现了49.459％，并在公共CROHME 2014年和CROHME 2016数据集46.062％ExpRate。本模型结果表明，国家的最先进的模型具有较好的稳健性，更少的错误，以及更高的精度。

44. Deepfakes for Medical Video De-Identification: Privacy Protection and Diagnostic Information Preservation [PDF] 返回目录
Bingquan Zhu, Hao Fang, Yanan Sui, Luming Li
Abstract: Data sharing for medical research has been difficult as open-sourcing clinical data may violate patient privacy. Traditional methods for face de-identification wipe out facial information entirely, making it impossible to analyze facial behavior. Recent advancements on whole-body keypoints detection also rely on facial input to estimate body keypoints. Both facial and body keypoints are critical in some medical diagnoses, and keypoints invariability after de-identification is of great importance. Here, we propose a solution using deepfake technology, the face swapping technique. While this swapping method has been criticized for invading privacy and portraiture right, it could conversely protect privacy in medical video: patients' faces could be swapped to a proper target face and become unrecognizable. However, it remained an open question that to what extent the swapping de-identification method could affect the automatic detection of body keypoints. In this study, we apply deepfake technology to Parkinson's disease examination videos to de-identify subjects, and quantitatively show that: face-swapping as a de-identification approach is reliable, and it keeps the keypoints almost invariant, significantly better than traditional methods. This study proposes a pipeline for video de-identification and keypoint preservation, clearing up some ethical restrictions for medical data sharing. This work could make open-source high quality medical video datasets more feasible and promote future medical research that benefits our society.
摘要：数据共享为医学研究一直是困难的，因为开放式采购的临床数据可能侵犯病人隐私。面部去标识的传统方法完全祛除面部信息，因此无法分析脸部行为。对全身的关键点的最新进展检测还要靠面部输入估计身体的关键点。无论脸部和身体的关键点是在一些医疗诊断的关键，去标识后的关键点不变是非常重要的。在这里，我们建议使用deepfake技术，脸部交换技术的解决方案。虽然这种互换方法已经被批评为侵犯隐私和肖像权，它可以在医疗视频反过来保护隐私：患者的面部可能被交换到正确目标的脸，变得面目全非。但是，它仍然是一个悬而未决的问题是在何种程度上交换去识别方法可能会影响身体的关键点的自动检测。在这项研究中，我们应用deepfake技术，帕金森氏病检查视频去标识科目，并定量地表明：面对面交换作为去识别方法是可靠的，它不断的关键点几乎不变，显著优于传统方法。这项研究提出了视频去识别和关键点保存管道，清理医疗数据共享一些道德限制。这项工作可以使开源高品质的医疗影像数据集的可行性，并促进未来医学研究有利于我们的社会。

45. Medicine Strip Identification using 2-D Cepstral Feature Extraction and Multiclass Classification Methods [PDF] 返回目录
Anirudh Itagi, Ritam Sil, Saurav Mohapatra, Subham Rout, Bharath K P, Karthik R, Rajesh Kumar Muthu
Abstract: Misclassification of medicine is perilous to the health of a patient, more so if the said patient is visually impaired or simply did not recognize the color, shape or type of medicine strip. This paper proposes a method for identification of medicine strips by 2-D cepstral analysis of their images followed by performing classification that has been done using the K-Nearest Neighbor (KNN), Support Vector Machine (SVM) and Logistic Regression (LR) Classifiers. The 2-D cepstral features extracted are extremely distinct to a medicine strip and consequently make identifying them exceptionally accurate. This paper also proposes the Color Gradient and Pill shape Feature (CGPF) extraction procedure and discusses the Binary Robust Invariant Scalable Keypoints (BRISK) algorithm as well. The mentioned algorithms were implemented and their identification results have been compared.
摘要：医学误判是危险的患者的健康，更是这样，如果说患者有视力障碍或根本不认识颜色，形状或药条的类型。本文提出了一种由它们的图像的2-d倒频谱分析鉴定药带的方法，接着进行已使用k近邻（KNN）进行分类，支持向量机（SVM）和Logistic回归（LR）的分类器。提取出的2-d倒谱特征是极为不同的药条，并因此使鉴定它们极其精确。本文还提出了颜色梯度和丸形状特征（CGPF）提取方法，并讨论了二进制鲁棒不变可伸缩关键点（BRISK）算法为好。所提到的算法得以实施，他们的鉴定结果进行了比较。

46. Vision based body gesture meta features for Affective Computing [PDF] 返回目录
Indigo J. D. Orton
Abstract: Early detection of psychological distress is key to effective treatment. Automatic detection of distress, such as depression, is an active area of research. Current approaches utilise vocal, facial, and bodily modalities. Of these, the bodily modality is the least investigated, partially due to the difficulty in extracting bodily representations from videos, and partially due to the lack of viable datasets. Existing body modality approaches use automatic categorization of expressions to represent body language as a series of specific expressions, much like words within natural language. In this dissertation I present a new type of feature, within the body modality, that represents meta information of gestures, such as speed, and use it to predict a non-clinical depression label. This differs to existing work by representing overall behaviour as a small set of aggregated meta features derived from a person's movement. In my method I extract pose estimation from videos, detect gestures within body parts, extract meta information from individual gestures, and finally aggregate these features to generate a small feature vector for use in prediction tasks. I introduce a new dataset of 65 video recordings of interviews with self-evaluated distress, personality, and demographic labels. This dataset enables the development of features utilising the whole body in distress detection tasks. I evaluate my newly introduced meta-features for predicting depression, anxiety, perceived stress, somatic stress, five standard personality measures, and gender. A linear regression based classifier using these features achieves a 82.70% F1 score for predicting depression within my novel dataset.
摘要：心理困扰的早期发现是关键，有效的治疗方法。自动检测困扰，如抑郁症，是一个活跃的研究领域。目前的方法利用声音，面部，和身体的方式。其中，身体形态是最少的调查，部分原因是由于难以从视频中提取的身体表示，部分由于缺乏可行的数据集。现有的身体形态的方法使用表达式的自动分类来表示的肢体语言为一系列具体表述，很像自然语言中的单词。在本文我提出了一种新类型的特征，身体形态中，表示的手势，如速度的元信息，并用它来预测非临床抑郁症的标签。这不同于由代表整体行为为一小部分从一个人的运动衍生聚集元功能现有的工作。在从视频中我的方法我提取姿态估计，检测手势的身体部位，从单个手势提取元数据信息中，最后汇总这些功能来生成用于预测任务使用小特征向量。我介绍的与自我评价的困扰，个性和人口标签采访的65个视频录制新的数据集。该数据集能够利用的整个身体处于困境的检测任务功能的开发。我评估我的新推出的元功能预测抑郁，焦虑，心理压力，躯体应激，五个标准的个性措施，和性别。使用这些功能的线性回归的分类器实现了82.70％的F1的成绩。我的小说集内预测抑郁症。

47. A Convolutional Baseline for Person Re-Identification Using Vision and Language Descriptions [PDF] 返回目录
Ammarah Farooq, Muhammad Awais, Fei Yan, Josef Kittler, Ali Akbari, Syed Safwan Khalid
Abstract: Classical person re-identification approaches assume that a person of interest has appeared across different cameras and can be queried by one of the existing images. However, in real-world surveillance scenarios, frequently no visual information will be available about the queried person. In such scenarios, a natural language description of the person by a witness will provide the only source of information for retrieval. In this work, person re-identification using both vision and language information is addressed under all possible gallery and query scenarios. A two stream deep convolutional neural network framework supervised by cross entropy loss is presented. The weights connecting the second last layer to the last layer with class probabilities, i.e., logits of softmax layer are shared in both networks. Canonical Correlation Analysis is performed to enhance the correlation between the two modalities in a joint latent embedding space. To investigate the benefits of the proposed approach, a new testing protocol under a multi modal ReID setting is proposed for the test split of the CUHK-PEDES and CUHK-SYSU benchmarks. The experimental results verify the merits of the proposed system. The learnt visual representations are more robust and perform 22\% better during retrieval as compared to a single modality system. The retrieval with a multi modal query greatly enhances the re-identification capability of the system quantitatively as well as qualitatively.
摘要：经典的人重新鉴定方法假设的权益的人已经在不同的摄像机出现，并且可以通过现有的图像之一进行查询。然而，在现实世界中的场景监控，经常没有视觉信息将有关查询人。在这样的情况下，通过证人的人的自然语言描述将提供用于检索信息的唯一来源。在这项工作中，同时使用视觉和语言信息的人重新鉴定是在所有可能的画廊和查询场景讨论。两流深卷积神经网络架构监督交叉熵损失提出。倒数第二层连接到与类概率，即最后的层的权重，SOFTMAX层的logits在两个网络共享。典型相关分析进行，以提高在合资潜在嵌入空间的两种模式之间的相关性。为了研究该方法的优点，在多模态里德设置一个新的测试方案，提出了中大 - 德斯和香港中文大学 - 中山大学基准测试分裂。实验结果验证了该系统的优点。博学的视觉表现更健壮，相比于单模态系统恢复过程中执行22 \％更好。具有多模态查询检索极大地提高了系统的重新识别能力定量以及定性。

48. Firearm Detection and Segmentation Using an Ensemble of Semantic Neural Networks [PDF] 返回目录
Alexander Egiazarov, Vasileios Mavroeidis, Fabio Massimo Zennaro, Kamer Vishi
Abstract: In recent years we have seen an upsurge in terror attacks around the world. Such attacks usually happen in public places with large crowds to cause the most damage possible and get the most attention. Even though surveillance cameras are assumed to be a powerful tool, their effect in preventing crime is far from clear due to either limitation in the ability of humans to vigilantly monitor video surveillance or for the simple reason that they are operating passively. In this paper, we present a weapon detection system based on an ensemble of semantic Convolutional Neural Networks that decomposes the problem of detecting and locating a weapon into a set of smaller problems concerned with the individual component parts of a weapon. This approach has computational and practical advantages: a set of simpler neural networks dedicated to specific tasks requires less computational resources and can be trained in parallel; the overall output of the system given by the aggregation of the outputs of individual networks can be tuned by a user to trade-off false positives and false negatives; finally, according to ensemble theory, the output of the overall system will be robust and reliable even in the presence of weak individual models. We evaluated our system running simulations aimed at assessing the accuracy of individual networks and the whole system. The results on synthetic data and real-world data are promising, and they suggest that our approach may have advantages compared to the monolithic approach based on a single deep convolutional neural network.
摘要：近年来，我们看到世界各地的恐怖袭击的热潮。这种攻击通常在公共场所与大量人群发生造成最大的伤害可能并得到了最多的关注。尽管监控摄像机被认为是一个强大的工具，其在预防犯罪的效果还远未明朗因人的能力，警惕地监视视频监控或原因很简单，他们正在操作被动或者限制。在本文中，我们提出基于语义卷积神经网络的集合会分解的检测和定位的武器成一组涉及一种武器的各个组成部件较小的问题的问题的武器检测系统。这种方法具有计算和实用的优点：一个专用于特定任务的一组简单的神经网络需要较少的计算资源可以并行进行培训;通过各个网络的输出的聚集给出的系统的总输出可以由用户被调谐到权衡假阳性和假阴性;最后，根据合奏理论，整个系统的输出，即使在微弱的个别型号的存在是稳定可靠的。我们评估我们的系统旨在评估单个网络的精度和整个系统的仿真。在模拟数据和真实数据的结果是令人鼓舞的，他们认为我们的做法可能具有的优势与基于单一的深卷积神经网络的整体办法。

49. Task Augmentation by Rotating for Meta-Learning [PDF] 返回目录
Jialin Liu, Fei Chao, Chih-Min Lin
Abstract: Data augmentation is one of the most effective approaches for improving the accuracy of modern machine learning models, and it is also indispensable to train a deep model for meta-learning. In this paper, we introduce a task augmentation method by rotating, which increases the number of classes by rotating the original images 90, 180 and 270 degrees, different from traditional augmentation methods which increase the number of images. With a larger amount of classes, we can sample more diverse task instances during training. Therefore, task augmentation by rotating allows us to train a deep network by meta-learning methods with little over-fitting. Experimental results show that our approach is better than the rotation for increasing the number of images and achieves state-of-the-art performance on miniImageNet, CIFAR-FS, and FC100 few-shot learning benchmarks. The code is available on \url{this http URL}.
摘要：数据隆胸是最有效的办法对于提高现代机器学习模型的准确性之一，它也是必不可少的训练对元学习了深刻的模型。在本文中，我们引入通过旋转任务增强方法，该方法通过旋转原始图像90，180和270度，与传统的增强方法，其增加的图像的数量不同增加了类的数量。随着类用量较大，我们可以在训练中品尝更多样化的任务实例。因此，通过旋转任务增加允许我们训练由元的学习方法了深刻的网络，有点过度拟合。实验结果表明，该方法比旋转更好地为提高图像的数量，并实现对miniImageNet，CIFAR-FS国家的最先进的性能和FC100几拍的学习标杆。该代码可以在\ {URL这个HTTP URL}。

50. Hypernetwork approach to generating point clouds [PDF] 返回目录
Przemysław Spurek, Sebastian Winczowski, Jacek Tabor, Maciej Zamorski, Maciej Zięba, Tomasz Trzciński
Abstract: In this work, we propose a novel method for generating 3D point clouds that leverage properties of hyper networks. Contrary to the existing methods that learn only the representation of a 3D object, our approach simultaneously finds a representation of the object and its 3D surface. The main idea of our HyperCloud method is to build a hyper network that returns weights of a particular neural network (target network) trained to map points from a uniform unit ball distribution into a 3D shape. As a consequence, a particular 3D shape can be generated using point-by-point sampling from the assumed prior distribution and transforming sampled points with the target network. Since the hyper network is based on an auto-encoder architecture trained to reconstruct realistic 3D shapes, the target network weights can be considered a parametrization of the surface of a 3D shape, and not a standard representation of point cloud usually returned by competitive approaches. The proposed architecture allows finding mesh-based representation of 3D objects in a generative manner while providing point clouds en pair in quality with the state-of-the-art methods.
摘要：在这项工作中，我们提出了产生三维点云是超网络的杠杆性质的新方法。与此相反学习3D对象的仅表示现有的方法，我们的方法同时找到对象和它的3D表面的表示。我们的方法HyperCloud的主要思想是建立一个超网络训练以从均匀单元球分布为3D形状的点映射的特定神经网络（目标网络）的回报的权重。因此，可使用逐点从采样假定先验分布和用所述目标网络采样的点来产生特定的3D形状。由于超网络是基于一个自动编码器架构训练来重建逼真的三维形状，所述目标网络的权值可被认为是三维形状的表面的参数化，而不是点云的标准表示通常通过竞争性方法返回。所提出的架构允许找到3D的基于网格的表示在生成对象的方式，而在质量提供点云连接一对与国家的最先进的方法。

51. Real-Time target detection in maritime scenarios based on YOLOv3 model [PDF] 返回目录
Alessandro Betti, Benedetto Michelozzi, Andrea Bracci, Andrea Masini
Abstract: In this work a novel ships dataset is proposed consisting of more than 56k images of marine vessels collected by means of web-scraping and including 12 ship categories. A YOLOv3 single-stage detector based on Keras API is built on top of this dataset. Current results on four categories (cargo ship, naval ship, oil ship and tug ship) show Average Precision up to 96% for Intersection over Union (IoU) of 0.5 and satisfactory detection performances up to IoU of 0.8. A Data Analytics GUI service based on QT framework and Darknet-53 engine is also implemented in order to simplify the deployment process and analyse massive amount of images even for people without Data Science expertise.
摘要：在这项工作中数据集提出由借助于收集海洋船舶的多于56K的图像的新颖的一般Web的刮削和包括12个船类别。基于Keras API甲YOLOv3单级检测器是建立在此数据集的顶部。四类（货船，军舰，油船和船舶拖船）当前结果表明平均精确到0.5和满意的检测性能96％交叉口超过联盟（IOU）高达0.8 IOU。基于Qt框架和暗网-53发动机的数据分析GUI服务，以简化部署流程和分析图像的巨量甚至对人没有数据科学的专业知识也可以实现。

52. Preventing Clean Label Poisoning using Gaussian Mixture Loss [PDF] 返回目录
Muhammad Yaseen, Muneeb Aadil, Maria Sargsyan
Abstract: Since 2014 when Szegedy et al. showed that carefully designed perturbations of the input can lead Deep Neural Networks (DNNs) to wrongly classify its label, there has been an ongoing research to make DNNs more robust to such malicious perturbations. In this work, we consider a poisoning attack called Clean Labeling poisoning attack (CLPA). The goal of CLPA is to inject seemingly benign instances which can drastically change decision boundary of the DNNs due to which subsequent queries at test time can be mis-classified. We argue that a strong defense against CLPA can be embedded into the model during the training by imposing features of the network to follow a Large Margin Gaussian Mixture distribution in the penultimate layer. By having such a prior knowledge, we can systematically evaluate how unusual the example is, given the label it is claiming to be. We demonstrate our builtin defense via experiments on MNIST and CIFAR datasets. We train two models on each dataset: one trained via softmax, another via LGM. We show that using LGM can substantially reduce the effectiveness of CLPA while having no additional overhead of data sanitization. The code to reproduce our results is available online.
摘要：从2014年时Szegedy等。结果表明，输入精心设计的扰动会导致深层神经网络（DNNs）错误地归类它的标签，出现了持续的研究，使DNNs更强大的这种恶意干扰。在这项工作中，我们认为所谓的清洁标签中毒攻击中毒攻击（CLPA）。 CLPA的目标是注入看似良性的情况下，可以彻底改变DNNs由于其在测试时后续查询可以误分类的决策边界。我们认为，对CLPA强大的防御可以通过实施网络的功能，遵循在倒数第二层大幅度高斯混合分布的训练过程中嵌入到模型中。通过具有这样的先验知识，我们可以系统地评价如何不同寻常的例子，给了它声称是标签。我们证明通过对MNIST和CIFAR数据集实验我们内置的防御。我们培养对每个数据集两种模式：一是通过SOFTMAX训练有素，另一个通过LGM。我们表明，使用LGM可以同时具有数据消毒的无额外的开销大大减少CLPA的有效性。重现我们的结果的代码是在网上提供。

53. Identity Recognition in Intelligent Cars with Behavioral Data and LSTM-ResNet Classifier [PDF] 返回目录
Michael Hammann, Maximilian Kraus, Sina Shafaei, Alois Knoll
Abstract: Identity recognition in a car cabin is a critical task nowadays and offers a great field of applications ranging from personalizing intelligent cars to suit drivers physical and behavioral needs to increasing safety and security. However, the performance and applicability of published approaches are still not suitable for use in series cars and need to be improved. In this paper, we investigate Human Identity Recognition in a car cabin with Time Series Classification (TSC) and deep neural networks. We use gas and brake pedal pressure as input to our models. This data is easily collectable during driving in everyday situations. Since our classifiers have very little memory requirements and do not require any input data preproccesing, we were able to train on one Intel i5-3210M processor only. Our classification approach is based on a combination of LSTM and ResNet. The network trained on a subset of NUDrive outperforms the ResNet and LSTM models trained solely by 35.9 % and 53.85 % accuracy respectively. We reach a final accuracy of 79.49 % on a 10-drivers subset of NUDrive and 96.90 % on a 5-drivers subset of UTDrive.
摘要：在汽车驾驶室身份识别是时下一个重要的任务，并提供应用软件，从个性化的智能汽车，以适应驾驶员的身体和行为需要增加安全保障有很大场。但是，性能和公布方式的适用性仍然不适合在系列轿车的使用和需要改进。在本文中，我们在汽车舱时间序列分类（TSC）和深层神经网络调查人身份识别。我们使用天然气和制动踏板的压力输入到我们的模型。这个数据是在日常生活中驾驶过程中容易收藏。由于我们的分类有很少的内存需求，并且不需要任何输入数据preproccesing，我们能够在只有一个英特尔i5-3210M处理器训练。我们的分类方法是根据LSTM和RESNET的组合。培训了NUDrive的一个子集的网络优于RESNET和LSTM模型35.9％和53.85％的准确度分别单独训练。我们到达79.49％的NUDrive的10集司机和96.90％的UTDrive子集5驱动程序最终精度。

54. Unsupervised Learning of Depth, Optical Flow and Pose with Occlusion from 3D Geometry [PDF] 返回目录
Guangming Wang, Chi Zhang, Hesheng Wang, Jingchuan Wang, Yong Wang, Xinlei Wang
Abstract: In autonomous driving, monocular sequences contain lots of information. Monocular depth estimation, camera ego-motion estimation and optical flow estimation in consecutive frames are high-profile concerns recently. By analyzing tasks above, pixels in the first frame are modeled into three parts: the rigid region, the non-rigid region, and the occluded region. In joint unsupervised training of depth and pose, we can segment the occluded region explicitly. The occlusion information is used in unsupervised learning of depth, pose and optical flow, as the image reconstructed by depth, pose and flow will be invalid in occluded regions. A less-than-mean mask is designed to further exclude the mismatched pixels which are interfered with motion or illumination change in the training of depth and pose networks. This method is also used to exclude some trivial mismatched pixels in the training of the flow net. Maximum normalization is proposed for smoothness term of depth-pose networks to restrain degradation in textureless regions. In the occluded region, as depth and camera motion can provide more reliable motion estimation, they can be used to instruct unsupervised learning of flow. Our experiments in KITTI dataset demonstrate that the model based on three regions, full and explicit segmentation of occlusion, rigid region and non-rigid region with corresponding unsupervised losses can improve performance on three tasks significantly.
摘要：在自主驾驶，单眼序列含有大量的信息。单眼深度估计，相机自运动估计和光流估计在连续的帧是高调关注最近。通过分析上述任务，在所述第一帧中的像素被建模为三个部分：刚性区域，非刚性区域和遮挡区域。在深度和姿态的联合监督的培训，我们可以细分闭塞的区域明确。遮蔽信息在深度，姿势和光流的无监督学习使用的，作为图像重建由深度，姿势和流量将是无效的遮挡区域。甲低于平均掩模被设计成进一步排除这些干扰在深度和姿势网络的训练运动或光照变化不匹配的像素。这种方法也可以用来排除在流动净额的训练一些琐碎的不匹配的像素。最大的标准化提出了深度姿势网络的平滑项约束的无纹理区域的退化。在咬合区，作为深度和摄像机运动可以提供更可靠的运动估计，它们可以被用于指示流的无监督学习。我们在KITTI实验数据集表明，基于三个区域，闭塞的充分和明确的细分，刚性区和非刚性区域与相应的监督的损失会显著提高三个任务性能的模型。

55. Learning Depth via Interaction [PDF] 返回目录
Antonio Loquercio, Alexey Dosovitskiy, Davide Scaramuzza
Abstract: Motivated by the astonishing capabilities of natural intelligent agents and inspired by theories from psychology, this paper explores the idea that perception gets coupled to 3D properties of the world via interaction with the environment. Existing works for depth estimation require either massive amounts of annotated training data or some form of hard-coded geometrical constraint. This paper explores a new approach to learning depth perception requiring neither of those. Specifically, we train a specialized global-local network architecture with what would be available to a robot interacting with the environment: from extremely sparse depth measurements down to even a single pixel per image. From a pair of consecutive images, our proposed network outputs a latent representation of the observer's motion between the images and a dense depth map. Experiments on several datasets show that, when ground truth is available even for just one of the image pixels, the proposed network can learn monocular dense depth estimation up to 22.5% more accurately than state-of-the-art approaches. We believe that this work, despite its scientific interest, lays the foundations to learn depth from extremely sparse supervision, which can be valuable to all robotic systems acting under severe bandwidth or sensing constraints.
摘要：由天然智能代理的惊人能力的启发，并从心理学理论的启发，本文探讨这种看法得到通过与环境的交互耦合到世界的3D性能的想法。对于深度估计现有作品需要注释的训练数据或某种形式的硬编码的几何约束的任大量。本文探讨了一种新的方法来学习的深度知觉既不需要这些的。具体来说，我们培养什么将提供给机器人与环境交互的专业全球和当地的网络架构：从极其稀疏深度测量低至每幅图像甚至单个像素。从一对连续图像的，我们所提出的网络输出的图像和稠密深度图之间的观察者的运动的潜表示。几个数据集的实验表明，当基础事实可即使对图像的像素只有一个，所提出的网络可以更准确地了解单眼密集深度估计达22.5％，比国家的最先进的方法。我们认为，这一工作，尽管它的科学兴趣，奠定了基础，从极其稀疏的监督，这可能是下严重的带宽或传感约束作用的所有机器人系统的宝贵学习的深度。

56. Long Short-Term Sample Distillation [PDF] 返回目录
Liang Jiang, Zujie Wen, Zhongping Liang, Yafang Wang, Gerard de Melo, Zhe Li, Liangzhuang Ma, Jiaxing Zhang, Xiaolong Li, Yuan Qi
Abstract: In the past decade, there has been substantial progress at training increasingly deep neural networks. Recent advances within the teacher--student training paradigm have established that information about past training updates show promise as a source of guidance during subsequent training steps. Based on this notion, in this paper, we propose Long Short-Term Sample Distillation, a novel training policy that simultaneously leverages multiple phases of the previous training process to guide the later training updates to a neural network, while efficiently proceeding in just one single generation pass. With Long Short-Term Sample Distillation, the supervision signal for each sample is decomposed into two parts: a long-term signal and a short-term one. The long-term teacher draws on snapshots from several epochs ago in order to provide steadfast guidance and to guarantee teacher--student differences, while the short-term one yields more up-to-date cues with the goal of enabling higher-quality updates. Moreover, the teachers for each sample are unique, such that, overall, the model learns from a very diverse set of teachers. Comprehensive experimental results across a range of vision and NLP tasks demonstrate the effectiveness of this new training method.
摘要：在过去的十年里，在训练越来越深层神经网络已取得实质性进展。教师中的最新进展 - 学生培养模式已经建立了关于过去的培训更新的信息显示承诺为指导的在随后的训练步骤的来源。基于这个概念，在本文中，我们提出了长短期样品蒸馏，一种新颖的培训政策，即同时利用了以前的培训过程中的多个阶段，以指导以后的训练更新神经网络，而只是一个单一有效地进行代传。随着长短期样品蒸馏，每个样品的监管信号被分解为两个部分：一个长期的信号和短期的一个。长期的教师借鉴了几个时代的快照前，以提供坚定的指导和保障教师 - 学生的差异，而短期收益率一个更先进的最新线索有，可实现更高质量的更新的目标。此外，教师对每个样品是唯一的，这样，总体而言，从一个非常多样化的教师模型获悉。在一系列的视觉和NLP任务的综合实验结果表明，这种新的训练方法的有效性。

57. Towards Unconstrained Palmprint Recognition on Consumer Devices: a Literature Review [PDF] 返回目录
Adrian-S. Ungureanu, Saqib Salahuddin, Peter Corcoran
Abstract: As a biometric palmprints have been largely under-utilized, but they offer some advantages over fingerprints and facial biometrics. Recent improvements in imaging capabilities on handheld and wearable consumer devices have re-awakened interest in the use fo palmprints. The aim of this paper is to provide a comprehensive review of state-of-the-art methods for palmprint recognition including Region of Interest extraction methods, feature extraction approaches and matching algorithms along with overview of available palmprint datasets in order to understand the latest trends and research dynamics in the palmprint recognition field.
摘要：生物识别掌纹已基本得到充分利用，但他们提供了指纹和脸部生物特征一定的优势。在手持设备和可穿戴式消费设备的成像能力的最新改进在使用FO掌纹重新唤醒兴趣。本文的目的是为掌纹识别其中的感兴趣区域提取方法，特征提取方法和匹配算法，以便了解最新趋势的现有掌纹数据集的概述一起提供的先进设备，最先进的方法进行全面审查和研究动态的掌纹识别领域。

58. A-TVSNet: Aggregated Two-View Stereo Network for Multi-View Stereo Depth Estimation [PDF] 返回目录
Sizhang Dai, Weibing Huang
Abstract: We propose a learning-based network for depth map estimation from multi-view stereo (MVS) images. Our proposed network consists of three sub-networks: 1) a base network for initial depth map estimation from an unstructured stereo image pair, 2) a novel refinement network that leverages both photometric and geometric information, and 3) an attentional multi-view aggregation framework that enables efficient information exchange and integration among different stereo image pairs. The proposed network, called A-TVSNet, is evaluated on various MVS datasets and shows the ability to produce high quality depth map that outperforms competing approaches. Our code is available at this https URL.
摘要：我们提出了从多视点立体（MVS）图像深度图估计基于学习的网络。我们提出的网络由三个子网络：1）一种基础网络用于初始深度图估计从非结构化立体图像对，2）一种新颖的改进的网络，它利用光度和几何信息，以及3）一个所注意的多视图聚合框架，使不同的立体图像对之间高效的信息交换和集成。所提出的网络，称为A-TVSNet，在各种数据集MVS和节目进行评估，以生产高品质的深度图性能优于竞争的方法的能力。我们的代码可在此HTTPS URL。

59. Learned Enrichment of Top-View Grid Maps Improves Object Detection [PDF] 返回目录
Sascha Wirges, Ye Yang, Sven Richter, Hahohao Hu, Christoph Stiller
Abstract: We propose an object detector for top-view grid maps which is additionally trained to generate an enriched version of its input. Our goal in the joint model is to improve generalization by regularizing towards structural knowledge in form of a map fused from multiple adjacent range sensor measurements. This training data can be generated in an automatic fashion, thus does not require manual annotations. We present an evidential framework to generate training data, investigate different model architectures and show that predicting enriched inputs as an additional task can improve object detection performance.
摘要：我们提出了其附加训练以产生其输入的富集版本顶视图网格地图的对象检测器。我们在关节模型的目标是通过在朝着从多个相邻的范围传感器测量稠合的映射的形式的结构知识正则化，以改善泛化。此训练数据可以以自动的方式来产生，因此不需要手动注释。我们提出的证据框架来生成训练数据，研究不同模型架构和显示，预测丰富的输入作为一个额外的任务可以提高目标探测性能。

60. Unbiased Mean Teacher for Cross Domain Object Detection [PDF] 返回目录
Jinhong Deng, Wen Li, Yuhua Chen, Lixin Duan
Abstract: Cross domain object detection is challenging, because object detection model is often vulnerable to data variance, especially to the considerable domain shift in cross domain scenarios. In this paper, we propose a new approach called Unbiased Mean Teacher (UMT) for cross domain object detection. While the simple mean teacher (MT) model exhibits good robustness to small data variance, it can also become easily biased in cross domain scenarios. We thus improve it with several simple yet highly effective strategies. In particular, we firstly propose a novel cross domain distillation for MT to maximally exploit the expertise of the teacher model. Then, we further alleviate the bias in the student model by augmenting training samples with pixel-level adaptation. The feature level adversarial training is also incorporated to learn domain-invariant representation. Those strategies can be implemented easily into MT and leads to our unbiased MT model. Our model surpasses the existing state-of-the-art models in large margins on benchmark datasets, which demonstrates the effectiveness of our approach.
摘要：跨域对象检测是具有挑战性的，因为对象检测模型往往易受数据方差，特别是在跨域场景中的相当大的域的转变。在本文中，我们提出了跨域对象检测叫做无偏的平均老师（UMT）的新方法。而简单平均老师（MT）模型显示出良好的鲁棒性较小的数据的方差，就可以也成为容易在跨域场景偏置。因此，我们与几个简单但非常有效的策略，提高它。特别是，我们首先提出一种新的跨域蒸馏MT最大限度地利用教师模型的专业知识。然后，我们进一步通过增加训练样本与像素级自适应减轻学生模型的偏差。该功能级别的对抗训练也纳入到学习领域不变的表示。这些策略可以很容易地实现到MT和潜在客户对我们的偏见MT模型。我们的模型优于现有的国家的最先进车型在标准数据集，这证明了我们方法的有效性大的利润。

61. GPU-Accelerated Mobile Multi-view Style Transfer [PDF] 返回目录
Puneet Kohli, Saravana Gunaseelan, Jason Orozco, Yiwen Hua, Edward Li, Nicolas Dahlquist
Abstract: An estimated 60% of smartphones sold in 2018 were equipped with multiple rear cameras, enabling a wide variety of 3D-enabled applications such as 3D Photos. The success of 3D Photo platforms (Facebook 3D Photo, Holopix, etc) depend on a steady influx of user generated content. These platforms must provide simple image manipulation tools to facilitate content creation, akin to traditional photo platforms. Artistic neural style transfer, propelled by recent advancements in GPU technology, is one such tool for enhancing traditional photos. However, naively extrapolating single-view neural style transfer to the multi-view scenario produces visually inconsistent results and is prohibitively slow on mobile devices. We present a GPU-accelerated multi-view style transfer pipeline which enforces style consistency between views with on-demand performance on mobile platforms. Our pipeline is modular and creates high quality depth and parallax effects from a stereoscopic image pair.
摘要：估计在2018年出售的智能手机60％配备有多个后置摄像头，可实现多种多样的3D功能的应用程序，例如3D照片。的3D照片平台（脸谱3D照片，Holopix等）的成功依赖于用户产生的内容的稳定涌入。这些平台必须提供简单的图像处理工具，以促进内容创作，类似于传统照片的平台。艺术风格的神经传递，最近的进步在GPU技术推动，是提高传统的照片一个这样的工具。然而，推断天真单视图神经样式转移到多视点方案中产生视觉不一致的结果和令人望而却步缓慢移动设备上。我们提出了一个GPU加速的多视图风格传递管道，其强制对移动平台的按需性能视图之间风格的一致性。我们的管道是模块化的，创建高品质的深度和立体图像对视差效果。

62. Relational Deep Feature Learning for Heterogeneous Face Recognition [PDF] 返回目录
MyeongAh Cho, Taeoh Kim, Ig-Jae Kim, Sangyoun Lee
Abstract: Heterogeneous Face Recognition (HFR) is a task that matches faces across two different domains such as VIS (visible light), NIR (near-infrared), or the sketch domain. In contrast to face recognition in visual spectrum, because of the domain discrepancy, this task requires to extract domain-invariant feature or common space projection learning. To bridge this domain gap, we propose a graph-structured module that focuses on facial relational information to reduce the fundamental differences in domain characteristics. Since relational information is domain independent, our Relational Graph Module (RGM) performs relation modeling from node vectors that represent facial components such as lips, nose, and chin. Propagation of the generated relational graph then reduces the domain difference by transitioning from spatially correlated CNN (convolutional neural network) features to inter-dependent relational features. In addition, we propose a Node Attention Unit (NAU) that performs node-wise recalibration to focus on the more informative nodes arising from the relation-based propagation. Furthermore, we suggest a novel conditional-margin loss function (C-Softmax) for efficient projection learning on the common latent space of the embedding vector. Our module can be plugged into any pre-trained face recognition network to help overcome the limitations of a small HFR database. The proposed method shows superior performance on three different HFR databases CAISA NIR-VIS 2.0, IIIT-D Sketch, and BUAA-VisNir in various pre-trained networks. Furthermore, we explore our C-Softmax loss boosts HFR performance and also apply our loss to the large-scale visual face database LFW(Labeled Faces in Wild) by learning inter-class margins adaptively.
摘要：异构人脸识别（HFR）是一个任务跨越两个不同的域的匹配面如VIS（可见光），NIR（近红外）或草图域。在可见光谱对比脸部识别，因为域差异，此任务需要提取域不变特征或共同空间投影的学习。为了弥补这个差距域，我们建议侧重于面部关系信息，以减少域特性的根本差异的图形结构模块。由于关系信息是域独立的，我们的关系图模块（RGM）执行从关系表示面部组件如嘴唇，鼻子和下巴节点矢量建模。所产生的关系图的传播然后减少了由从空间相关的CNN（卷积神经网络）过渡域差设有于互相依赖的关系特性。此外，我们提出了一个节点注意单元（NAU），该执行节点明智重新校准聚焦从基于关系的传播所产生的更多的信息节点上。此外，我们建议用于在嵌入矢量的共同潜在空间高效投影学习一种新的有条件利润损失函数（C-使用SoftMax）。我们的模块可以插入任何预先训练脸部识别网络，以帮助克服小HFR数据库的限制。在三个不同的HFR所提出的方法示出了优越的性能数据库CAISA NIR-VIS 2.0，IIIT-d草图，和北航-VisNir各种预先训练网络。此外，我们探索我们的C-使用SoftMax损失HFR提升性能，还通过自适应学习班际利润率运用我们的损失，大型视觉人脸数据库LFW（野生标记面）。

63. Deep Image Spatial Transformation for Person Image Generation [PDF] 返回目录
Yurui Ren, Xiaoming Yu, Junming Chen, Thomas H. Li, Ge Li
Abstract: Pose-guided person image generation is to transform a source person image to a target pose. This task requires spatial manipulations of source data. However, Convolutional Neural Networks are limited by lacking the ability to spatially transform the inputs. In this paper, we propose a differentiable global-flow local-attention framework to reassemble the inputs at the feature level. Specifically, our model first calculates the global correlations between sources and targets to predict flow fields. Then, the flowed local patch pairs are extracted from the feature maps to calculate the local attention coefficients. Finally, we warp the source features using a content-aware sampling method with the obtained local attention coefficients. The results of both subjective and objective experiments demonstrate the superiority of our model. Besides, additional results in video animation and view synthesis show that our model is applicable to other tasks requiring spatial transformation. Our source code is available at this https URL.
摘要：姿态引导人的图像生成是一个源人图像转换到目标姿势。此任务需要的源数据的空间操作。然而，卷积神经网络是由缺乏在空间上变换的输入的能力的限制。在本文中，我们提出了一个微全球流动局部注意力框架在功能层面重新组合的输入。具体来说，我们的模型首先计算源和目标之间的相关性的全球预测流场。然后，流入当地的补丁对提取从特征映射到计算局部注意力系数。最后，我们用经编用所获得的本地注意力系数的内容感知抽样法源功能。主观和客观实验的结果表明我们的模型的优越性。此外，在视频动画和视图合成表明我们的模型是适用于需要空间变换等任务的其他结果。我们的源代码可在此HTTPS URL。

64. SketchGCN: Semantic Sketch Segmentation with Graph Convolutional Networks [PDF] 返回目录
Lumin Yang, Jiajie Zhuang, Hongbo Fu, Kun Zhou, Youyi Zheng
Abstract: We introduce SketchGCN, a graph convolutional neural network for semantic segmentation and labeling of free-hand sketches. We treat an input sketch as a 2D pointset, and encode the stroke structure information into graph node/edge representations. To predict the per-point labels, our SketchGCN uses graph convolution and a global-local branching network architecture to extract both intra-stroke and inter-stroke features. SketchGCN significantly improves the accuracy of the state-of-the-art methods for semantic sketch segmentation (by 11.4% in the pixel-basedmetric and 18.2% in the component-based metric over a large-scale challenging SPG dataset) and has magnitudes fewer parameters than both image-based and sequence-based methods.
摘要：介绍SketchGCN，对于语义分割和手绘草图标注图形卷积神经网络。我们把一个输入草图作为2D点集，并编码笔画结构信息转换成图形节点/边缘表示。为了预测每点的标签，我们SketchGCN使用图形的卷积和全球本地分支网络架构，同时抽取内部行程和笔划间的功能。 SketchGCN显著改善的国家的最先进的方法语义草图分割的精确度（在11.4％的像素basedmetric在基于组件在大规模度量挑战SPG的数据集和18.2％），并且具有幅度更少参数比都基于序列基于图像和方法。

65. Global Context-Aware Progressive Aggregation Network for Salient Object Detection [PDF] 返回目录
Zuyao Chen, Qianqian Xu, Runmin Cong, Qingming Huang
Abstract: Deep convolutional neural networks have achieved competitive performance in salient object detection, in which how to learn effective and comprehensive features plays a critical role. Most of the previous works mainly adopted multiple level feature integration yet ignored the gap between different features. Besides, there also exists a dilution process of high-level features as they passed on the top-down pathway. To remedy these issues, we propose a novel network named GCPANet to effectively integrate low-level appearance features, high-level semantic features, and global context features through some progressive context-aware Feature Interweaved Aggregation (FIA) modules and generate the saliency map in a supervised way. Moreover, a Head Attention (HA) module is used to reduce information redundancy and enhance the top layers features by leveraging the spatial and channel-wise attention, and the Self Refinement (SR) module is utilized to further refine and heighten the input features. Furthermore, we design the Global Context Flow (GCF) module to generate the global context information at different stages, which aims to learn the relationship among different salient regions and alleviate the dilution effect of high-level features. Experimental results on six benchmark datasets demonstrate that the proposed approach outperforms the state-of-the-art methods both quantitatively and qualitatively.
摘要：深卷积神经网络已取得显着的物体探测竞争力的性能，在如何学习有效和全面的功能起着至关重要的作用。大部分以前的作品主要采用多级功能集成还忽略不同功能之间的差距。此外，还存在，因为他们在自上而下的途径通过高层次的稀释处理功能。为了解决这些问题，我们提出了一个名为GCPANet有效整合低级别的外观特征，高层语义特征的新型网络，以及全球范围内通过一些渐进的上下文感知功能交织聚合（FIA）模块的功能，并产生显着图一个监督方式。此外，头注意（HA）模块用于减少信息冗余和提高的顶层通过利用空间和通道明智关注特征和自我细化（SR）模块被用来进一步细化，并且可以提高输入的功能。此外，我们设计的全球背景下的流量（GCF）模块，以产生不同阶段的全球环境信息，其目的是了解不同的显着区域之间的关系，并缓解高层次功能的稀释效应。在六个基准数据集的实验结果表明，所提出的方法在数量和质量优于国家的最先进的方法。

66. VAE/WGAN-Based Image Representation Learning For Pose-Preserving Seamless Identity Replacement In Facial Images [PDF] 返回目录
Hiroki Kawai, Jiawei Chen, Prakash Ishwar, Janusz Konrad
Abstract: We present a novel variational generative adversarial network (VGAN) based on Wasserstein loss to learn a latent representation from a face image that is invariant to identity but preserves head-pose information. This facilitates synthesis of a realistic face image with the same head pose as a given input image, but with a different identity. One application of this network is in privacy-sensitive scenarios; after identity replacement in an image, utility, such as head pose, can still be recovered. Extensive experimental validation on synthetic and real human-face image datasets performed under 3 threat scenarios confirms the ability of the proposed network to preserve head pose of the input image, mask the input identity, and synthesize a good-quality realistic face image of a desired identity. We also show that our network can be used to perform pose-preserving identity morphing and identity-preserving pose morphing. The proposed method improves over a recent state-of-the-art method in terms of quantitative metrics as well as synthesized image quality.
摘要：提出了一种基于Wasserstein的损失要学会从人脸图像是不变的身份，但保持净空，姿态信息的潜表示一种新型变生成对抗网络（VGAN）。这有利于合成具有相同的头部姿势作为给定的输入图像逼真面部图像的，但具有不同的标识。这个网络的一个应用是隐私敏感的情况;标识替换在图像，实用程序后，如头部的姿势，仍可以恢复。 3个威胁情景确认执行对合成和真实的人脸图像数据集大量实验验证所提出的网络来保存输入图像的头部姿势，屏蔽输入身份和合成所需的优质逼真的人脸图像的能力身份。我们还表明，我们的网络可以用于执行姿势保留身份变形和身份保持姿势变形。所提出的方法改进了在定量的度量以及合成图像质量方面最近的状态的最先进的方法。

67. A Novel Recurrent Encoder-Decoder Structure for Large-Scale Multi-view Stereo Reconstruction from An Open Aerial Dataset [PDF] 返回目录
Jin Liu, Shunping Ji
Abstract: A great deal of research has demonstrated recently that multi-view stereo (MVS) matching can be solved with deep learning methods. However, these efforts were focused on close-range objects and only a very few of the deep learning-based methods were specifically designed for large-scale 3D urban reconstruction due to the lack of multi-view aerial image benchmarks. In this paper, we present a synthetic aerial dataset, called the WHU dataset, we created for MVS tasks, which, to our knowledge, is the first large-scale multi-view aerial dataset. It was generated from a highly accurate 3D digital surface model produced from thousands of real aerial images with precise camera parameters. We also introduce in this paper a novel network, called RED-Net, for wide-range depth inference, which we developed from a recurrent encoder-decoder structure to regularize cost maps across depths and a 2D fully convolutional network as framework. RED-Net's low memory requirements and high performance make it suitable for large-scale and highly accurate 3D Earth surface reconstruction. Our experiments confirmed that not only did our method exceed the current state-of-the-art MVS methods by more than 50% mean absolute error (MAE) with less memory and computational cost, but its efficiency as well. It outperformed one of the best commercial software programs based on conventional methods, improving their efficiency 16 times over. Moreover, we proved that our RED-Net model pre-trained on the synthetic WHU dataset can be efficiently transferred to very different multi-view aerial image datasets without any fine-tuning. Dataset are available at this http URL.
摘要：研究了大量已经证明最近，多视点立体（MVS）匹配可以用深的学习方法来解决。然而，这些努力都集中在近距离的物体，只有极少数的，由于缺乏深基于学习的方法是专门为大型3D城市改造设计的多视角航空影像基准。在本文中，我们提出了一个合成的空中数据集，叫做WHU数据集，我们为MVS任务，其中，据我们所知，是第一次大规模的多视图空中数据集创建。它是从从数千具有精确的相机参数真实空间图像的产生高度精确的3D数字表面模型生成的。我们还介绍了在本文提出了一种新颖的网络，所谓的RED-Net的，对于大范围的深度推断，这是我们从一个经常性的编码器，解码器结构，开发成本正规化跨深度和二维卷积完全网络架构映射。 RED-Net的低内存需求和高性能使其适用于大型和高度精确的3D地球表面重建。我们的实验证实，不仅没有我们的方法，超过50％以上，平均绝对误差（MAE），使用较少的内存和计算成本，但它的效率，以及当前国家的最先进的MVS方法。它优于基于传统的方法最好的商业软件之一，提高其效率超过16倍。此外，我们证明了在合成WHU数据集可以被有效地传递到非常不同的多视点空中影像的数据集而没有任何的微调，我们的RED-Net的预训练的模型。数据集可在此http网址。

68. Matching Neuromorphic Events and Color Images via Adversarial Learning [PDF] 返回目录
Fang Xu, Shijie Lin, Wen Yang, Lei Yu, Dengxin Dai, Gui-song Xia
Abstract: The event camera has appealing properties: high dynamic range, low latency, low power consumption and low memory usage, and thus provides complementariness to conventional frame-based cameras. It only captures the dynamics of a scene and is able to capture almost "continuous" motion. However, different from frame-based camera that reflects the whole appearance as scenes are, the event camera casts away the detailed characteristics of objects, such as texture and color. To take advantages of both modalities, the event camera and frame-based camera are combined together for various machine vision tasks. Then the cross-modal matching between neuromorphic events and color images plays a vital and essential role. In this paper, we propose the Event-Based Image Retrieval (EBIR) problem to exploit the cross-modal matching task. Given an event stream depicting a particular object as query, the aim is to retrieve color images containing the same object. This problem is challenging because there exists a large modality gap between neuromorphic events and color images. We address the EBIR problem by proposing neuromorphic Events-Color image Feature Learning (ECFL). Particularly, the adversarial learning is employed to jointly model neuromorphic events and color images into a common embedding space. We also contribute to the community N-UKbench and EC180 dataset to promote the development of EBIR problem. Extensive experiments on our datasets show that the proposed method is superior in learning effective modality-invariant representation to link two different modalities.
摘要：事件相机具有吸引人的性能：高动态范围，低等待时间，低功率消耗和低的内存使用情况，并因此提供互补性常规的基于帧的摄像机。它仅捕获场景的动态，并能捕捉到几乎是“连续”的议案。然而，从反映整个外观场景是基于帧的相机不同，事件相机擅自抛弃对象，例如质地和颜色特征的详细资料。采取两种模态的优点，事件照相机和基于帧的相机被用于各种机器视觉任务结合在一起。然后，神经运动赛事和彩色图像之间的跨模态匹配起着至关重要和必不可少的作用。在本文中，我们提出了基于事件的图像检索（EBIR）问题，利用跨模态匹配任务。给定的事件流描绘的特定对象作为查询，其目的是检索包含相同的对象的彩色图像。这个问题是有挑战性的，因为存在神经运动赛事和彩色图像之间存在较大的差距形态。我们通过提出神经运动活动彩色图像地物学习（ECFL）解决EBIR问题。特别是，对抗性学习采用联合模型神经运动赛事和彩色图像到一个共同的嵌入空间。我们回馈社会的N- UKbench和EC180数据集，以促进EBIR问题的发展。我们的数据集，大量实验表明，该方法是在学习有效方式不变表示连接两个不同的模式优越。

69. Extremely Dense Point Correspondences using a Learned Feature Descriptor [PDF] 返回目录
Xingtong Liu, Yiping Zheng, Benjamin Killeen, Masaru Ishii, Gregory D. Hager, Russell H. Taylor, Mathias Unberath
Abstract: High-quality 3D reconstructions from endoscopy video play an important role in many clinical applications, including surgical navigation where they enable direct video-CT registration. While many methods exist for general multi-view 3D reconstruction, these methods often fail to deliver satisfactory performance on endoscopic video. Part of the reason is that local descriptors that establish pair-wise point correspondences, and thus drive reconstruction, struggle when confronted with the texture-scarce surface of anatomy. Learning-based dense descriptors usually have larger receptive fields enabling the encoding of global information, which can be used to disambiguate matches. In this work, we present an effective self-supervised training scheme and novel loss design for dense descriptor learning. In direct comparison to recent local and dense descriptors on an in-house sinus endoscopy dataset, we demonstrate that our proposed dense descriptor can generalize to unseen patients and scopes, thereby largely improving the performance of Structure from Motion (SfM) in terms of model density and completeness. We also evaluate our method on a public dense optical flow dataset and a small-scale SfM public dataset to further demonstrate the effectiveness and generality of our method. The source code is available at this https URL.
摘要：高品质的视频内窥镜三维重建在许多临床应用，包括外科手术导航，他们能直接视频-CT登记中发挥重要作用。虽然一般的多视图三维重建中存在许多方法，这些方法往往不能兑现内窥镜检查视频表现令人满意。的部分原因是，当与解剖结构的纹理表面稀缺面临的是建立成对点对应，因此局部描述符驱动重建，斗争。学习基于致密描述符通常具有较大的感受野使全球信息，它可以用来消除匹配的编码。在这项工作中，我们提出了密集的描述符学习有效的自我指导训练方案和新颖的设计损失。在直接比较近期本地和密集的描述在一个内部的鼻窦内窥镜数据集，我们证明了我们提出的密集描述符可以推广到看不见的患者和范围，从而大大提高了模型密度方面结构从运动（SFM）的性能和完整性。我们也评估我们在公共密集光流数据集和小规模的SFM公共数据集的方法来进一步证明了该方法的有效性和普遍性。源代码可在此HTTPS URL。

70. 3D Point Cloud Processing and Learning for Autonomous Driving [PDF] 返回目录
Siheng Chen, Baoan Liu, Chen Feng, Carlos Vallespi-Gonzalez, Carl Wellington
Abstract: We present a review of 3D point cloud processing and learning for autonomous driving. As one of the most important sensors in autonomous vehicles, light detection and ranging (LiDAR) sensors collect 3D point clouds that precisely record the external surfaces of objects and scenes. The tools for 3D point cloud processing and learning are critical to the map creation, localization, and perception modules in an autonomous vehicle. While much attention has been paid to data collected from cameras, such as images and videos, an increasing number of researchers have recognized the importance and significance of LiDAR in autonomous driving and have proposed processing and learning algorithms to exploit 3D point clouds. We review the recent progress in this research area and summarize what has been tried and what is needed for practical and safe autonomous vehicles. We also offer perspectives on open issues that are needed to be solved in the future.
摘要：我们提出三维点云处理的回顾和学习自主驾驶。作为自主车，光探测和测距的最重要的传感器之一（LIDAR）传感器采集的三维点云能够精确记录物体和场景的外表面。三维点云处理和学习的工具是地图制作，本地化，并在自动车辆感知模块的关键。而备受人们关注已经支付给从摄像头采集到的数据，如图像和视频，越来越多的研究人员已经认识到激光雷达的重要性和意义的自主驾驶，并提出了处理和学习算法利用三维点云。我们回顾在这一研究领域的最新进展，并总结一下已经尝试过的，什么是需要的实用，安全的自动驾驶汽车。我们还提供所需要在未来要解决的开放性问题的观点。

71. Rethinking Fully Convolutional Networks for the Analysis of Photoluminescence Wafer Images [PDF] 返回目录
Maike Lorena Stern, Hans Lindberg, Klaus Meyer-Wegener
Abstract: The manufacturing of light-emitting diodes is a complex semiconductor-manufacturing process, interspersed with different measurements. Among the employed measurements, photoluminescence imaging has several advantages, namely being a non-destructive, fast and thus cost-effective measurement. On a photoluminescence measurement image of an LED wafer, every pixel corresponds to an LED chip's brightness after photo-excitation, revealing chip performance information. However, generating a chip-fine defect map of the LED wafer, based on photoluminescence images, proves challenging for multiple reasons: on the one hand, the measured brightness values vary from image to image, in addition to local spots of differing brightness. On the other hand, certain defect structures may assume multiple shapes, sizes and brightness gradients, where salient brightness values may correspond to defective LED chips, measurement artefacts or non-defective structures. In this work, we revisit the creation of chip-fine defect maps using fully convolutional networks and show that the problem of segmenting objects at multiple scales can be improved by the incorporation of densely connected convolutional blocks and atrous spatial pyramid pooling modules. We also share implementation details and our experiences with training networks with small datasets of measurement images. The proposed architecture significantly improves the segmentation accuracy of highly variable defect structures over our previous version.
摘要：发光二极管的制造是复杂的半导体制造过程中，随着不同的测量穿插。其中所采用的测量结果，光致发光成像具有几个优点，即是一个非破坏性的，快速并且因此成本有效的测量。在LED晶片的光致发光测定图像，每个像素对应于一个LED芯片的亮度光激发，揭示芯片的性能信息之后。然而，产生该LED晶片的芯片微细缺陷映射表，基于光致发光图像，证明出于多种原因挑战：在一方面，所测量的亮度值从图像变化到图像中，除了不同的亮度的局部斑点。在另一方面，某些缺陷结构可以采取多种形状，尺寸和亮度梯度，其中凸亮度值可以对应于缺陷的LED芯片，测量假象或无缺陷的结构。在这项工作中，我们重新访问的芯片微细缺陷创建使用完全卷积网络和显示，分割在多个尺度的对象的问题可以通过密集地连接卷积块与àtrous空间金字塔池模块的掺入来提高地图。我们还份额实现的细节，我们与测量图像的小型数据集培训网络体验。所提出的架构显著提升高度可变的缺陷结构在原有版本的分割精度。

72. Fast Lidar Clustering by Density and Connectivity [PDF] 返回目录
Frederik Hasecke, Lukas Hahn, Anton Kummert
Abstract: Lidar sensors are widely used in various applications, ranging from scientific fields over industrial use to integration in consumer products. With an ever growing number of different driver assistance systems, they have been introduced to automotive series production in recent years and are considered an important building block for the practical realisation of autonomous driving. However, due to the potentially large amount of Lidar points per scan, tailored algorithms are required to identify objects (e.g. pedestrians or vehicles) with high precision in a very short time. In this work, we propose an algorithmic approach for real-time instance segmentation of Lidar sensor data. We show how our method leverages the properties of the Euclidean distance to retain three-dimensional measurement information, while being narrowed down to a two-dimensional representation for fast computation. We further introduce what we call skip connections, to make our approach robust against over-segmentation and improve assignment in cases of partial occlusion. Through detailed evaluation on public data and comparison with established methods, we show how these aspects enable state-of-the-art performance and runtime on a single CPU core.
摘要：激光雷达传感器被广泛用于各种应用，从对工业用在消费类产品整合科研领域。随着数量不断增长的不同的驾驶辅助系统，它们已经被引入到汽车系列产品的生产，近年来，被认为是为自主驾驶的实际实现的一个重要组成部分。然而，由于潜在的大量的每次扫描激光雷达点，需要定制算法来识别在一个很短的时间高精度对象（例如行人或车辆）。在这项工作中，我们提出了激光雷达传感器数据的实时情况下分割的算法方法。我们证明我们的方法如何利用欧氏距离保持三维测量信息的特性，同时缩小到了快速计算二维表示。我们进一步介绍我们称之为跳过连接，使我们的方法对过度分割健壮，提高部分遮挡的情况下分配。通过对公共数据和既定方法的比较详细的评测中，我们展示这些方面如何能够在一个单一的CPU核状态的最先进的性能和运行。

73. The Sloop System for Individual Animal Identification with Deep Learning [PDF] 返回目录
Kshitij Bakliwal, Sai Ravela
Abstract: The MIT Sloop system indexes and retrieves photographs from databases of non-stationary animal population distributions. To do this, it adaptively represents and matches generic visual feature representations using sparse relevance feedback from experts and crowds. Here, we describe the Sloop system and its application, then compare its approach to a standard deep learning formulation. We then show that priming with amplitude and deformation features requires very shallow networks to produce superior recognition results. Results suggest that relevance feedback, which enables Sloop's high-recall performance may also be essential for deep learning approaches to individual identification to deliver comparable results.
摘要：MIT桅系统的索引和检索照片从非平稳动物种群分布的数据库。要做到这一点，它适应性代表，并使用专家和人群稀疏的相关反馈通用的视觉特征相匹配的表示。在这里，我们描述的单桅帆船系统及其应用，那么它的方法比较标准的深度学习配方。然后，我们显示出与幅度吸和变形功能需要很浅的网络生产出卓越的识别结果。结果表明，相关反馈，使桅高召回的表现也可能对深学习必不可少接近个体识别提供可比较的结果。

74. Soft-Root-Sign Activation Function [PDF] 返回目录
Yuan Zhou, Dandan Li, Shuwei Huo, Sun-Yuan Kung
Abstract: The choice of activation function in deep networks has a significant effect on the training dynamics and task performance. At present, the most effective and widely-used activation function is ReLU. However, because of the non-zero mean, negative missing and unbounded output, ReLU is at a potential disadvantage during optimization. To this end, we introduce a novel activation function to manage to overcome the above three challenges. The proposed nonlinearity, namely "Soft-Root-Sign" (SRS), is smooth, non-monotonic, and bounded. Notably, the bounded property of SRS distinguishes itself from most state-of-the-art activation functions. In contrast to ReLU, SRS can adaptively adjust the output by a pair of independent trainable parameters to capture negative information and provide zero-mean property, which leading not only to better generalization performance, but also to faster learning speed. It also avoids and rectifies the output distribution to be scattered in the non-negative real number space, making it more compatible with batch normalization (BN) and less sensitive to initialization. In experiments, we evaluated SRS on deep networks applied to a variety of tasks, including image classification, machine translation and generative modelling. Our SRS matches or exceeds models with ReLU and other state-of-the-art nonlinearities, showing that the proposed activation function is generalized and can achieve high performance across tasks. Ablation study further verified the compatibility with BN and self-adaptability for different initialization.
摘要：激活功能的深层网络选择对的培训力度和任务性能显著的效果。目前，最有效和广泛使用的活化功能RELU。但是，因为非零均值，负面失踪，无限产出，RELU是在优化过程中一个潜在的缺点。为此，我们引入了一个新的激活功能，以设法克服上述三大挑战。所提出的非线性，即“软根符号”（SRS），光滑，非单调和有界的。值得注意的是，SRS的有界属性从国家的最先进的最激活函数中脱颖而出。与此相反RELU，SRS可以自适应由一对独立的可训练参数，以获取负面信息调整输出，并提供零均值特性，这不仅导致更好的泛化性能，而且能够更快的学习速度。它还避免和整流被散射在非负实数空间中的输出分配，使得它与批标准化（BN），以及初始化不太敏感更相容。在实验中，我们评估了适用于多种任务，包括图像分类，机器翻译和生成模拟的深网络SRS。我们的SRS匹配或超过机型RELU和国家的最先进的非线性等，表明所提出的激活功能是广义的，可以实现跨任务的高性能。消融研究进一步验证与国阵不同的初始化的相容性和自适应性。

75. ZoomNet: Part-Aware Adaptive Zooming Neural Network for 3D Object Detection [PDF] 返回目录
Zhenbo Xu, Wei Zhang, Xiaoqing Ye, Xiao Tan, Wei Yang, Shilei Wen, Errui Ding, Ajin Meng, Liusheng Huang
Abstract: 3D object detection is an essential task in autonomous driving and robotics. Though great progress has been made, challenges remain in estimating 3D pose for distant and occluded objects. In this paper, we present a novel framework named ZoomNet for stereo imagery-based 3D detection. The pipeline of ZoomNet begins with an ordinary 2D object detection model which is used to obtain pairs of left-right bounding boxes. To further exploit the abundant texture cues in RGB images for more accurate disparity estimation, we introduce a conceptually straight-forward module -- adaptive zooming, which simultaneously resizes 2D instance bounding boxes to a unified resolution and adjusts the camera intrinsic parameters accordingly. In this way, we are able to estimate higher-quality disparity maps from the resized box images then construct dense point clouds for both nearby and distant objects. Moreover, we introduce to learn part locations as complementary features to improve the resistance against occlusion and put forward the 3D fitting score to better estimate the 3D detection quality. Extensive experiments on the popular KITTI 3D detection dataset indicate ZoomNet surpasses all previous state-of-the-art methods by large margins (improved by 9.4% on APbv (IoU=0.7) over pseudo-LiDAR). Ablation study also demonstrates that our adaptive zooming strategy brings an improvement of over 10% on AP3d (IoU=0.7). In addition, since the official KITTI benchmark lacks fine-grained annotations like pixel-wise part locations, we also present our KFG dataset by augmenting KITTI with detailed instance-wise annotations including pixel-wise part location, pixel-wise disparity, etc.. Both the KFG dataset and our codes will be publicly available at this https URL.
摘要：3D物体的检测是在自动驾驶和机器人的一项重要任务。虽然巨大的已经取得了进展，挑战依然存在估计3D姿势的遥远而闭塞的对象。在本文中，我们提出了一个名为ZoomNet立体图像为基础的，3D检测一个新的框架。 ZoomNet的流水线开始其用于获得对左右边界框普通2D对象检测模型。为了进一步发挥RGB图像丰富的纹理线索更精确的视差估计，我们引入一个概念直接的模块 - 自适应缩放，它同时重新调整2D情况下边框为统一的分辨率和内在的参数进行相应调节摄像机。通过这种方式，我们可以从调整大小后的图片框估计更高质量的差异图然后构造致密的点云的附近和远处的两个对象。此外，我们介绍学习部分位置的补充功能，以提高对闭塞性，并提出了3D拟合得分，以更好地估计三维检测质量。上流行的3D KITTI检测数据集广泛的实验表明ZoomNet超过先前的所有国家的最先进的方法，通过大边距（9.4％上APbv（IOU = 0.7）以上的伪激光雷达改善）。消融研究还表明，我们的自适应缩放策略带来了超过10％的AP3d（IOU = 0.7）的改善。此外，由于官方KITTI基准缺乏细粒度的注解像逐像素的部分地区，我们也有详细的实例明智的注解，包括逐像素部分的位置，像素方面的差距，增强等KITTI提出我们的KFG数据集..无论是KFG数据集，我们的代码将公开可在此HTTPS URL。

76. Deep Attention Aware Feature Learning for Person Re-Identification [PDF] 返回目录
Yifan Chen, Han Wang, Xiaolu Sun, Bin Fan, Chu Tang
Abstract: Visual attention has proven to be effective in improving the performance of person re-identification. Most existing methods apply visual attention heuristically by learning an additional attention map to re-weight the feature maps for person re-identification. However, this kind of methods inevitably increase the model complexity and inference time. In this paper, we propose to incorporate the attention learning as additional objectives in a person ReID network without changing the original structure, thus maintain the same inference time and model size. Two kinds of attentions have been considered to make the learned feature maps being aware of the person and related body parts respectively. Globally, a holistic attention branch (HAB) makes the feature maps obtained by backbone focus on persons so as to alleviate the influence of background. Locally, a partial attention branch (PAB) makes the extracted features be decoupled into several groups and be separately responsible for different body parts (i.e., keypoints), thus increasing the robustness to pose variation and partial occlusion. These two kinds of attentions are universal and can be incorporated into existing ReID networks. We have tested its performance on two typical networks (TriNet and Bag of Tricks) and observed significant performance improvement on five widely used datasets.
摘要：视觉注意力已被证明是有效地改善人重新鉴定的性能。大多数现有的方法通过学习额外注意图重新重量的功能对于人重新鉴定映射启发式应用视觉注意力。然而，这种方法不可避免地增加了模型的复杂性和推理时间。在本文中，我们建议纳入关注学习作为一个人里德网络的其他目标，在不改变原有结构，从而保持相同的推理时间和模型的大小。两种关注已考虑到使学习地物地图分别为了解人及相关身体部位。在全球范围内，全方位关注分支（HAB）使得特征映射由骨干重点人员所获得的，以减轻背景的影响。局部，局部关注分支（PAB），使所提取的特征被解耦成几组，并对于不同的身体部位（即，关键点）分别负责，因此增加了鲁棒性姿势变化和部分遮挡。这两种关注的是普遍的，可纳入现有的里德网络。我们已经测试了两种典型的网络性能（TriNet和技巧的袋），并在五个广泛使用的数据集观测显著的性能提升。

77. MonoPair: Monocular 3D Object Detection Using Pairwise Spatial Relationships [PDF] 返回目录
Yongjian Chen, Lei Tai, Kai Sun, Mingyang Li
Abstract: Monocular 3D object detection is an essential component in autonomous driving while challenging to solve, especially for those occluded samples which are only partially visible. Most detectors consider each 3D object as an independent training target, inevitably resulting in a lack of useful information for occluded samples. To this end, we propose a novel method to improve the monocular 3D object detection by considering the relationship of paired samples. This allows us to encode spatial constraints for partially-occluded objects from their adjacent neighbors. Specifically, the proposed detector computes uncertainty-aware predictions for object locations and 3D distances for the adjacent object pairs, which are subsequently jointly optimized by nonlinear least squares. Finally, the one-stage uncertainty-aware prediction structure and the post-optimization module are dedicatedly integrated for ensuring the run-time efficiency. Experiments demonstrate that our method yields the best performance on KITTI 3D detection benchmark, by outperforming state-of-the-art competitors by wide margins, especially for the hard samples.
摘要：单眼立体物检测是在自动驾驶的必要成分而具有挑战性的解决，特别是对那些遮挡样品其是仅部分可见。大多数检测器考虑的每个3D对象作为一个独立的训练目标，不可避免地导致缺乏用于遮挡样品的有用信息。为此，我们提出了通过考虑成对样品的关系来改善单眼立体物检测的新方法。这使我们能够编码空间限制从它们相邻的邻居部分遮挡的对象。具体地，所提出的检测器计算用于物体位置和3D距离为相邻对象对，其随后通过共同非线性最小二乘优化不确定感知预测。最后，一个阶段的不确定性感知预测结构和后优化模块的专用地集成，以确保运行时的效率。实验表明，我们的方法产生的KITTI 3D检测基准性能最好，通过广泛的利润率跑赢国家的最先进的竞争对手，尤其是对硬盘的样品。

78. PointASNL: Robust Point Clouds Processing using Nonlocal Neural Networks with Adaptive Sampling [PDF] 返回目录
Xu Yan, Chaoda Zheng, Zhen Li, Sheng Wang, Shuguang Cui
Abstract: Raw point clouds data inevitably contains outliers or noise through acquisition from 3D sensors or reconstruction algorithms. In this paper, we present a novel end-to-end network for robust point clouds processing, named PointASNL, which can deal with point clouds with noise effectively. The key component in our approach is the adaptive sampling (AS) module. It first re-weights the neighbors around the initial sampled points from farthest point sampling (FPS), and then adaptively adjusts the sampled points beyond the entire point cloud. Our AS module can not only benefit the feature learning of point clouds, but also ease the biased effect of outliers. To further capture the neighbor and long-range dependencies of the sampled point, we proposed a local-nonlocal (L-NL) module inspired by the nonlocal operation. Such L-NL module enables the learning process insensitive to noise. Extensive experiments verify the robustness and superiority of our approach in point clouds processing tasks regardless of synthesis data, indoor data, and outdoor data with or without noise. Specifically, PointASNL achieves state-of-the-art robust performance for classification and segmentation tasks on all datasets, and significantly outperforms previous methods on real-world outdoor SemanticKITTI dataset with considerate noise. Our code is released through this https URL.
摘要：原始点云数据不可避免地含有异常值或噪声通过从三维传感器或重建算法获取。在本文中，我们提出了一个新颖的端至端的网络的鲁棒点云处理，命名PointASNL，其可以处理点云与噪声有效。在我们的方法的关键组件是自适应采样（AS）模块。它首先重新权重绕初始采样点邻居从最远点取样（FPS），然后自适应地调整超出整个点云的采样的点。我们作为模块既可以造福点云的特征学习，还能缓解异常值的偏差影响。为了进一步捕获采样点的邻居和远程相关性，我们提出了通过非局部操作的启发局部非局部（L-NL）模块。这种L-NL模块使学习过程对噪声不敏感。大量的实验验证我们的方法的稳健性和优越性在点云有无杂音的处理任务，无论合成数据，室内数据，和室外的数据。具体来说，PointASNL实现对所有数据集的分类和分割任务的国家的最先进的强劲表现，以及显著优于现实世界与体贴噪音以前的方法户外SemanticKITTI数据集。我们的代码是通过这个HTTPS URL释放。

79. State-Aware Tracker for Real-Time Video Object Segmentation [PDF] 返回目录
Xi Chen, Zuoxin Li, Ye Yuan, Gang Yu, Jianxin Shen, Donglian Qi
Abstract: In this work, we address the task of semi-supervised video object segmentation(VOS) and explore how to make efficient use of video property to tackle the challenge of semi-supervision. We propose a novel pipeline called State-Aware Tracker(SAT), which can produce accurate segmentation results with real-time speed. For higher efficiency, SAT takes advantage of the inter-frame consistency and deals with each target object as a tracklet. For more stable and robust performance over video sequences, SAT gets awareness for each state and makes self-adaptation via two feedback loops. One loop assists SAT in generating more stable tracklets. The other loop helps to construct a more robust and holistic target representation. SAT achieves a promising result of 72.3% J&F mean with 39 FPS on DAVIS2017-Val dataset, which shows a decent trade-off between efficiency and accuracy. Code will be released at this http URL.
摘要：在这项工作中，我们要解决半监督视频对象分割（VOS）的任务，并探讨如何有效利用视频性能的解决半监督的挑战。我们提出了一个新颖的管道叫状态感知追踪（SAT），能够生成实时的速度准确的分割结果。对于更高的效率，SAT利用了帧间的一致性以及与每个目标对象作为tracklet交易。对于更稳定，在视频序列强劲的性能，SAT取得意识，为每个状态，并通过两个反馈回路使自适应性。一个回路辅助SAT的产生更稳定tracklets。其他环有助于构建一个更强大和全面的目标表示。 SAT实现了72.3％有希望的结果J＆F平均39 FPS上DAVIS2017-VAL数据集，其示出了一个体面折衷效率和准确性之间。代码将在这个HTTP URL被释放。

80. STC-Flow: Spatio-temporal Context-aware Optical Flow Estimation [PDF] 返回目录
Xiaolin Song, Yuyang Zhao, Jingyu Yang, Cuiling Lan, Wenjun Zeng, Jiahao Li
Abstract: In this paper, we propose a spatio-temporal contextual network, STC-Flow, for optical flow estimation. Unlike previous optical flow estimation approaches with local pyramid feature extraction and multi-level correlation, we propose a contextual relation exploration architecture by capturing rich long-range dependencies in spatial and temporal dimensions. Specifically, STC-Flow contains three key context modules - pyramidal spatial context module, temporal context correlation module and recurrent residual contextual upsampling module, to build the relationship in each stage of feature extraction, correlation, and flow reconstruction, respectively. Experimental results indicate that the proposed scheme achieves the state-of-the-art performance of two-frame based methods on the Sintel dataset and the KITTI 2012/2015 datasets.
摘要：在本文中，我们提出了一个时空情境网络，STC流量，用于光流估计。不同于以往的光流估计与当地的金字塔特征提取和多层次的相关办法，我们提出通过捕捉空间和时间维度丰富的远射依赖上下文关系的探索架构。具体地讲，STC-流包含三个密钥上下文模块 - 锥体空间上下文模块，时间上下文相关模块和复发性残余上下文上采样模块，建立在特征提取，相关的各阶段的关系，并分别流重建。实验结果表明，所提出的方案实现了对辛特尔数据集和KITTI二千○一十五分之二千○十二数据集两帧为基础的方法的状态的最先进的性能。

81. Deep Learning for Content-based Personalized Viewport Prediction of 360-Degree VR Videos [PDF] 返回目录
Xinwei Chen, Ali Taleb Zadeh Kasgari, Walid Saad
Abstract: In this paper, the problem of head movement prediction for virtual reality videos is studied. In the considered model, a deep learning network is introduced to leverage position data as well as video frame content to predict future head movement. For optimizing data input into this neural network, data sample rate, reduced data, and long-period prediction length are also explored for this model. Simulation results show that the proposed approach yields 16.1\% improvement in terms of prediction accuracy compared to a baseline approach that relies only on the position data.
摘要：在本文中，头部运动预测虚拟现实视频的问题进行了研究。在所考虑的模型，深度学习网络引入杠杆位置数据以及视频帧的内容来预测未来的头部运动。为了优化数据输入到该神经网络，数据采样率，减少的数据，和长周期预测长度也探索了这种模式。仿真结果表明，在预测精度方面提出的方法的产率16.1 \％的改进相比于仅依赖于所述位置数据的基线的方法。

82. Learning When and Where to Zoom with Deep Reinforcement Learning [PDF] 返回目录
Burak Uzkent, Stefano Ermon
Abstract: While high resolution images contain semantically more useful information than their lower resolution counterparts, processing them is computationally more expensive, and in some applications, e.g. remote sensing, they can be much more expensive to acquire. For these reasons, it is desirable to develop an automatic method to selectively use high resolution data when necessary while maintaining accuracy and reducing acquisition/run-time cost. In this direction, we propose PatchDrop a reinforcement learning approach to dynamically identify when and where to use/acquire high resolution data conditioned on the paired, cheap, low resolution images. We conduct experiments on CIFAR10, CIFAR100, ImageNet and fMoW datasets where we use significantly less high resolution data while maintaining similar accuracy to models which use full high resolution images.
摘要：虽然高分辨率图像包含比其更低的分辨率对应语义更加有用的信息，处理它们在计算上更昂贵的，并且在一些应用中，例如遥感，它们可以是更昂贵的获取。由于这些原因，理想的是选择性地使用高分辨率数据，同时保持精确度和减少采集/运行时间成本，必要时，开发的自动方法。在这个方向，我们提出PatchDrop增强学习的方法来动态识别何时何地使用空调，在一对，价格便宜，低分辨率图像/获取高分辨率的数据。我们对CIFAR10，CIFAR100，ImageNet和fMoW数据集进行实验，我们用显著较少的高分辨率数据，同时保持同样的准确度，其使用完整的高分辨率图像模式。

83. Towards Automatic Face-to-Face Translation [PDF] 返回目录
Prajwal K R, Rudrabha Mukhopadhyay, Jerin Philip, Abhishek Jha, Vinay Namboodiri, C.V. Jawahar
Abstract: In light of the recent breakthroughs in automatic machine translation systems, we propose a novel approach that we term as "Face-to-Face Translation". As today's digital communication becomes increasingly visual, we argue that there is a need for systems that can automatically translate a video of a person speaking in language A into a target language B with realistic lip synchronization. In this work, we create an automatic pipeline for this problem and demonstrate its impact on multiple real-world applications. First, we build a working speech-to-speech translation system by bringing together multiple existing modules from speech and language. We then move towards "Face-to-Face Translation" by incorporating a novel visual module, LipGAN for generating realistic talking faces from the translated audio. Quantitative evaluation of LipGAN on the standard LRW test set shows that it significantly outperforms existing approaches across all standard metrics. We also subject our Face-to-Face Translation pipeline, to multiple human evaluations and show that it can significantly improve the overall user experience for consuming and interacting with multimodal content across languages. Code, models and demo video are made publicly available. Demo video: this https URL Code and models: this https URL
摘要：在光的自动机器翻译系统近期的突破，我们提出了一种新的方法，我们术语“觌翻译”。由于今天的数字通信变得越来越直观，我们认为有必要对，可以自动翻译语言话语与现实的口形同步目标B语言的人的视频系统。在这项工作中，我们创建了这一问题的自动流水线，并展示其在多个现实世界的应用程序的影响。首先，我们通过把从言语和语言多个现有的模块组合在一起建立一个工作语音到语音翻译系统。然后，我们通过引入一个新的视觉模块，LipGAN从翻译的音频生成拟真的脸朝着“觌翻译”之举。在标准LRW测试仪显示其显著优于现有LipGAN的定量评价在所有标准度量方法。我们也受到我们的脸对脸翻译管道，以多种人类评估和表明，它可以显著提高消费在跨语言多式联运内容交互的整体用户体验。代码，模型和演示视频被公之于众。演示视频：此HTTPS URL代码和模型：该HTTPS URL

84. PF-Net: Point Fractal Network for 3D Point Cloud Completion [PDF] 返回目录
Zitian Huang, Yikuan Yu, Jiawen Xu, Feng Ni, Xinyi Le
Abstract: In this paper, we propose a Point Fractal Network (PF-Net), a novel learning-based approach for precise and high-fidelity point cloud completion. Unlike existing point cloud completion networks, which generate the overall shape of the point cloud from the incomplete point cloud and always change existing points and encounter noise and geometrical loss, PF-Net preserves the spatial arrangements of the incomplete point cloud and can figure out the detailed geometrical structure of the missing region(s) in the prediction. To succeed at this task, PF-Net estimates the missing point cloud hierarchically by utilizing a feature-points-based multi-scale generating network. Further, we add up multi-stage completion loss and adversarial loss to generate more realistic missing region(s). The adversarial loss can better tackle multiple modes in the prediction. Our experiments demonstrate the effectiveness of our method for several challenging point cloud completion tasks.
摘要：在本文中，我们提出了一个分形点网（PF-网），精确和高保真点云完成一种新型的基于学习的方法。不像现有的点云完成网络，其从残缺点云生成点云的整体形状和总是改变现有分和遭遇的噪声和几何损失，PF-Net的蜜饯不完全点云和空间排列可以计算出在预测中详述的丢失区域的几何结构（一个或多个）。在这个任务取得成功，PF-Net的利用基于特征点的多尺度生成网络分层估计丢失的点云。此外，我们加起来多阶段完成损失和对抗性的损失，产生更逼真的失踪区域（一个或多个）。对抗损失可以更好地应对在预测多种模式。我们的实验证明我们的方法的一些具有挑战性的点云完成任务的有效性。

85. FMT:Fusing Multi-task Convolutional Neural Network for Person Search [PDF] 返回目录
Sulan Zhai, Shunqiang Liu, Xiao Wang, Jin Tang
Abstract: Person search is to detect all persons and identify the query persons from detected persons in the image without proposals and bounding boxes, which is different from person re-identification. In this paper, we propose a fusing multi-task convolutional neural network(FMT-CNN) to tackle the correlation and heterogeneity of detection and re-identification with a single convolutional neural network. We focus on how the interplay of person detection and person re-identification affects the overall performance. We employ person labels in region proposal network to produce features for person re-identification and person detection network, which can improve the accuracy of detection and re-identification simultaneously. We also use a multiple loss to train our re-identification network. Experiment results on CUHK-SYSU Person Search dataset show that the performance of our proposed method is superior to state-of-the-art approaches in both mAP and top-1.
摘要：人的搜索是为了检测所有的人，并确定从图像中检测到的人查询者不建议和边框，这是由人重新鉴定不同。在本文中，我们提出了一个融合多任务卷积神经网络（FMT-CNN），以解决检测和重新鉴定的相关性和异质性与单个卷积神经网络。我们专注于人检测和人重新鉴定的相互作用如何影响整体性能。我们目前在区域方案网络人标签，以生产为特征的人重新鉴定和人员检测网络，可以同时提高检测和重新鉴定的准确性。我们还使用多损失训练我们的重新鉴定的网络。在香港中文大学，中山大学人搜索数据集上的实验结果，我们提出的方法的性能优于国家的最先进的两种地图和顶级1接近。

86. Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension [PDF] 返回目录
Zhenfang Chen, Peng Wang, Lin Ma, Kwan-Yee K. Wong, Qi Wu
Abstract: Referring expression comprehension (REF) aims at identifying a particular object in a scene by a natural language expression. It requires joint reasoning over the textual and visual domains to solve the problem. Some popular referring expression datasets, however, fail to provide an ideal test bed for evaluating the reasoning ability of the models, mainly because 1) their expressions typically describe only some simple distinctive properties of the object and 2) their images contain limited distracting information. To bridge the gap, we propose a new dataset for visual reasoning in context of referring expression comprehension with two main features. First, we design a novel expression engine rendering various reasoning logics that can be flexibly combined with rich visual properties to generate expressions with varying compositionality. Second, to better exploit the full reasoning chain embodied in an expression, we propose a new test setting by adding additional distracting images containing objects sharing similar properties with the referent, thus minimising the success rate of reasoning-free cross-domain alignment. We evaluate several state-of-the-art REF models, but find none of them can achieve promising performance. A proposed modular hard mining strategy performs the best but still leaves substantial room for improvement. We hope this new dataset and task can serve as a benchmark for deeper visual reasoning analysis and foster the research on referring expression comprehension.
摘要：在由自然语言表达识别场景中的特定对象参照表达理解（REF）的目标。它需要在文本和可视化领域的联合推理来解决问题。一些流行的参照表达数据集，但是，不能用于评估模型的推理能力提供一个理想的试验台，主要是因为：1）其表达通常描述对象和2的只有一些简单的独特的属性）他们的图像包含有限分心信息。为了缩小差距，我们提出了视觉推理的新数据集在提到表达理解有两个主要特征的情况下。首先，我们设计了一个新的表达引擎渲染可以与富视觉特性来灵活地组合，以产生具有不同组合性表达式各种推理逻辑。第二，更好地利用体现在表达完整的推理链，我们通过将含有对象与指示对象共享类似属性的其他分散注意力的图像，从而最大限度地减少自由推理-跨域对准的成功率提出了一种新的测试设置。我们评估的国家的最先进的几种REF车型，但他们发现没有人能达到承诺的性能。拟议的模块化硬开采战略执行最好的，但仍有改进的余地相当大。我们希望这个新的数据集和任务可以作为更深入的视觉推理分析的基准，并推动指表达理解的研究。

87. Intelligent Home 3D: Automatic 3D-House Design from Linguistic Descriptions Only [PDF] 返回目录
Qi Chen, Qi Wu, Rui Tang, Yuhan Wang, Shuai Wang, Mingkui Tan
Abstract: Home design is a complex task that normally requires architects to finish with their professional skills and tools. It will be fascinating that if one can produce a house plan intuitively without knowing much knowledge about home design and experience of using complex designing tools, for example, via natural language. In this paper, we formulate it as a language conditioned visual content generation problem that is further divided into a floor plan generation and an interior texture (such as floor and wall) synthesis task. The only control signal of the generation process is the linguistic expression given by users that describe the house details. To this end, we propose a House Plan Generative Model (HPGM) that first translates the language input to a structural graph representation and then predicts the layout of rooms with a Graph Conditioned Layout Prediction Network (GC LPN) and generates the interior texture with a Language Conditioned Texture GAN (LCT-GAN). With some post-processing, the final product of this task is a 3D house model. To train and evaluate our model, we build the first Text-to-3D House Model dataset.
摘要：家居设计是一个复杂的任务，通常需要建筑师与自己的专业技能和工具完成。这将是有趣的，如果一个人可以直观地产生一个房子的计划，而无需了解家居设计，并使用复杂的设计工具，例如经验多的知识，通过自然语言。在本文中，我们配制它作为一种语言空调可视内容产生问题被进一步划分成一个平面图生成和内部质地（如地板和墙壁）的合成任务。生成过程的唯一控制信号是由描述该房子细节用户给出的语言表达。为此，我们提出了一个房子的计划生成模型（HPGM），首先转换的输入结构图表示的语言，然后预测客房配有图表空调布局预测网（GC LPN）的布局，并产生内部的质地配合语言空调纹理甘（LCT-GAN）。对于一些后期处理，这个任务的最终产品是一个三维房屋模型。要培养和评价我们的模型，我们所建立的第一个文本到3D房屋模型数据集。

88. Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision [PDF] 返回目录
Denis Gudovskiy, Alec Hodgkinson, Takuya Yamaguchi, Sotaro Tsukizawa
Abstract: Active learning (AL) aims to minimize labeling efforts for data-demanding deep neural networks (DNNs) by selecting the most representative data points for annotation. However, currently used methods are ill-equipped to deal with biased data. The main motivation of this paper is to consider a realistic setting for pool-based semi-supervised AL, where the unlabeled collection of train data is biased. We theoretically derive an optimal acquisition function for AL in this setting. It can be formulated as distribution shift minimization between unlabeled train data and weakly-labeled validation dataset. To implement such acquisition function, we propose a low-complexity method for feature density matching using self-supervised Fisher kernel (FK) as well as several novel pseudo-label estimators. Our FK-based method outperforms state-of-the-art methods on MNIST, SVHN, and ImageNet classification while requiring only 1/10th of processing. The conducted experiments show at least 40% drop in labeling efforts for the biased class-imbalanced data compared to existing methods.
摘要：主动学习（AL）的目标，以尽量减少数据要求苛刻的深层神经网络（DNNs）通过选择最有代表性的数据点标注标签的努力。然而，目前使用的方法都没有能力应付偏置数据。本文的主要动机是考虑基于池的半监督AL，其中列车数据的收集未标记偏向真实的场景。从理论上推导AL最佳的采集功能在此设置。它可以被配制为无标签的训练数据和弱标记的验证数据集之间分配移最小化。为了实现这样的采集功能，我们采用自我监督费舍尔内核（FK）以及几个新的伪标签估计提出特征密度匹配的低复杂度的方法。我们的FK-基于状态的的最先进的方法优于上MNIST，SVHN和ImageNet分类方法同时需要处理的只有1/10。所进行的实验显示在标记为相对于现有方法偏置类不平衡数据的努力至少40％的下降。

89. Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [PDF] 返回目录
Shizhe Chen, Yida Zhao, Qin Jin, Qi Wu
Abstract: Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web. The current dominant approach for this problem is to learn a joint embedding space to measure cross-modal similarities. However, simple joint embeddings are insufficient to represent complicated visual and textual details, such as scenes, objects, actions and their compositions. To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which decomposes video-text matching into global-to-local levels. To be specific, the model disentangles texts into hierarchical semantic graph including three levels of events, actions, entities and relationships across levels. Attention-based graph reasoning is utilized to generate hierarchical textual embeddings, which can guide the learning of diverse and hierarchical video representations. The HGR model aggregates matchings from different video-text levels to capture both global and local details. Experimental results on three video-text datasets demonstrate the advantages of our model. Such hierarchical decomposition also enables better generalization across datasets and improves the ability to distinguish fine-grained semantic differences.
摘要：视频和文本之间的跨模态获取引起了越来越多的关注，由于视频在网络上迅速崛起。对于这个问题目前主要的方式就是学会联合嵌入空间来衡量跨模式的相似性。然而，简单的嵌入关节都不足以表示复杂的视觉和文本的详细信息，如场景，物体，行动和他们的作品。为了提高细粒度视频文本检索，我们提出了一个层次图推理（HGR）模型，分解视频，文本匹配到全球到本地水平。具体而言，该模式理顺了那些纷繁文本翻译成层次语义图包括三个层次的事件，行动，实体并通过水平的关系。基于关注-图形推理被用于产生分级文本的嵌入，从而可以指导多样化和分级视频表示的学习。从不同的视频，文字水平的HGR模型聚集匹配数来捕捉全局和局部细节。三视频，文本数据集实验结果表明我们的模型的优点。这样的分层分解也可以跨数据集更好的泛化和提高分辨细粒度语义差异的能力。

90. Joint Wasserstein Distribution Matching [PDF] 返回目录
JieZhang Cao, Langyuan Mo, Qing Du, Yong Guo, Peilin Zhao, Junzhou Huang, Mingkui Tan
Abstract: Joint distribution matching (JDM) problem, which aims to learn bidirectional mappings to match joint distributions of two domains, occurs in many machine learning and computer vision applications. This problem, however, is very difficult due to two critical challenges: (i) it is often difficult to exploit sufficient information from the joint distribution to conduct the matching; (ii) this problem is hard to formulate and optimize. In this paper, relying on optimal transport theory, we propose to address JDM problem by minimizing the Wasserstein distance of the joint distributions in two domains. However, the resultant optimization problem is still intractable. We then propose an important theorem to reduce the intractable problem into a simple optimization problem, and develop a novel method (called Joint Wasserstein Distribution Matching (JWDM)) to solve it. In the experiments, we apply our method to unsupervised image translation and cross-domain video synthesis. Both qualitative and quantitative comparisons demonstrate the superior performance of our method over several state-of-the-arts.
摘要：共同配送匹配（JDM）的问题，其目的是学习的双向映射，以匹配两个域的联合分布，发生在许多机器学习和计算机视觉应用。这个问题，但是，是由于两个关键的挑战非常困难：（i）其往往难以利用从联合分布足够的信息来进行匹配; （ⅱ）此问题是难以配制和优化。在本文中，依靠优化输运理论，我们在两个域最小化联合分布的Wasserstein的距离提出地址JDM问题。然而，得到的优化问题依然棘手。然后，我们提出了一个重要的定理，以减少棘手的问题变成一个简单的优化问题，并制定新的方法（称为联合华的配送匹配（JWDM））来解决它。在实验中，我们应用我们的方法无监督图像平移和跨域视频合成。定性和定量的比较，证明在几个国家的的艺术了该方法的优越性能。

91. Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs [PDF] 返回目录
Shizhe Chen, Qin Jin, Peng Wang, Qi Wu
Abstract: Humans are able to describe image contents with coarse to fine details as they wish. However, most image captioning models are intention-agnostic which can not generate diverse descriptions according to different user intentions initiatively. In this work, we propose the Abstract Scene Graph (ASG) structure to represent user intention in fine-grained level and control what and how detailed the generated description should be. The ASG is a directed graph consisting of three types of \textbf{abstract nodes} (object, attribute, relationship) grounded in the image without any concrete semantic labels. Thus it is easy to obtain either manually or automatically. From the ASG, we propose a novel ASG2Caption model, which is able to recognise user intentions and semantics in the graph, and therefore generate desired captions according to the graph structure. Our model achieves better controllability conditioning on ASGs than carefully designed baselines on both VisualGenome and MSCOCO datasets. It also significantly improves the caption diversity via automatically sampling diverse ASGs as control signals.
摘要：人类能够按照自己的意愿来形容与粗到细的细节图像内容。然而，大多数图像字幕模型是意图无关根据不同用户的意图主动无法生成多样的描述。在这项工作中，我们提出了抽象场景图（ASG）结构来表示的细粒级和控制生成的描述应该是什么，以及如何详细的用户意图。在ASG是由三种类型的\ textbf图像中接地{抽象节点}（对象，属性，关系），而没有任何具体的语义标签的有向图。因此，它是很容易手动或自动获得任一。从ASG，我们提出了一个新颖的ASG2Caption模型，其能够识别用户的意图和语义图中的，因此根据图结构产生期望的标题。我们的模型上取得助理秘书长比都VisualGenome和MSCOCO数据集精心设计的基准更好的可控调节。它还显著提高通过自动采样助理秘书长多样作为控制信号的标题的多样性。

92. Emotion Recognition System from Speech and Visual Information based on Convolutional Neural Networks [PDF] 返回目录
Nicolae-Catalin Ristea, Liviu Cristian Dutu, Anamaria Radoi
Abstract: Emotion recognition has become an important field of research in the human-computer interactions domain. The latest advancements in the field show that combining visual with audio information lead to better results if compared to the case of using a single source of information separately. From a visual point of view, a human emotion can be recognized by analyzing the facial expression of the person. More precisely, the human emotion can be described through a combination of several Facial Action Units. In this paper, we propose a system that is able to recognize emotions with a high accuracy rate and in real time, based on deep Convolutional Neural Networks. In order to increase the accuracy of the recognition system, we analyze also the speech data and fuse the information coming from both sources, i.e., visual and audio. Experimental results show the effectiveness of the proposed scheme for emotion recognition and the importance of combining visual with audio data.
摘要：情感识别已成为人机交互领域研究的一个重要领域。在该领域的最新进展表明，如果相比于单独使用的单一信息来源的情况下，与音频信息铅结合视觉更好的结果。从一个视点，一个人的情感可以通过分析人的脸部表情识别。更确切地说，人的情感可以通过以下几种面部动作单元的组合来描述。在本文中，我们提出了一个系统，该系统能够识别的情绪具有高准确率和实时的基础上，深卷积神经网络。为了提高识别系统的准确度，我们也分析了语音数据和熔断器从两个来源的信息，即，视觉和听觉。实验结果表明，对于情感识别所提出的方案的有效性，并结合视觉与音频数据的重要性。

93. Learning Cross-domain Generalizable Features by Representation Disentanglement [PDF] 返回目录
Qingjie Meng, Daniel Rueckert, Bernhard Kainz
Abstract: Deep learning models exhibit limited generalizability across different domains. Specifically, transferring knowledge from available entangled domain features(source/target domain) and categorical features to new unseen categorical features in a target domain is an interesting and difficult problem that is rarely discussed in the current literature. This problem is essential for many real-world applications such as improving diagnostic classification or prediction in medical imaging. To address this problem, we propose Mutual-Information-based Disentangled Neural Networks (MIDNet) to extract generalizable features that enable transferring knowledge to unseen categorical features in target domains. The proposed MIDNet is developed as a semi-supervised learning paradigm to alleviate the dependency on labeled data. This is important for practical applications where data annotation requires rare expertise as well as intense time and labor. We demonstrate our method on handwritten digits datasets and a fetal ultrasound dataset for image classification tasks. Experiments show that our method outperforms the state-of-the-art and achieve expected performance with sparsely labeled data.
摘要：深学习模型表现出在不同的领域普遍性有限。具体而言，从可用的缠结域特征（源/目标域）和分类特征在目标域知识转移到新看不见类别特征是在目前的文献中很少讨论了一个有趣的和困难的问题。这个问题对于许多实际应用，例如提高诊断分类或预测在医学成像中是必不可少的。为了解决这个问题，我们提出了一种基于互信息化解开的神经网络（MIDNet）提取，使在目标域转移的知识看不见的类别特征概括的特点。所提出的MIDNet被开发为一个半监督学习模式，以减轻标签数据的依赖。这对于数据注解需要罕见的专业知识，以及强烈的时间和劳力的实际应用很重要。我们证明手写数字数据集和图像分类任务的胎儿超声数据集我们的方法。实验表明，我们的方法优于国家的最先进的，并实现与稀疏标签的数据预期的表现。

94. Grounded and Controllable Image Completion by Incorporating Lexical Semantics [PDF] 返回目录
Shengyu Zhang, Tan Jiang, Qinghao Huang, Ziqi Tan, Zhou Zhao, Siliang Tang, Jin Yu, Hongxia Yang, Yi Yang, Fei Wu
Abstract: In this paper, we present an approach, namely Lexical Semantic Image Completion (LSIC), that may have potential applications in art, design, and heritage conservation, among several others. Existing image completion procedure is highly subjective by considering only visual context, which may trigger unpredictable results which are plausible but not faithful to a grounded knowledge. To permit both grounded and controllable completion process, we advocate generating results faithful to both visual and lexical semantic context, i.e., the description of leaving holes or blank regions in the image (e.g., hole description). One major challenge for LSIC comes from modeling and aligning the structure of visual-semantic context and translating across different modalities. We term this process as structure completion, which is realized by multi-grained reasoning blocks in our model. Another challenge relates to the unimodal biases, which occurs when the model generates plausible results without using the textual description. This can be true since the annotated captions for an image are often semantically equivalent in existing datasets, and thus there is only one paired text for a masked image in training. We devise an unsupervised unpaired-creation learning path besides the over-explored paired-reconstruction path, as well as a multi-stage training strategy to mitigate the insufficiency of labeled data. We conduct extensive quantitative and qualitative experiments as well as ablation studies, which reveal the efficacy of our proposed LSIC.
摘要：在本文中，我们提出了一个方法，即词汇语义图像修复（大规模集成电路），可能有其中几个人在艺术，设计和文物保护，潜在的应用。现有的图像修复方法是只考虑视觉背景下，这可能会引发哪些是合理的，但不是忠实于一个接地的知识不可预知的结果非常主观。为了允许这两个接地的和可控的完成过程，我们主张生成结果忠实于视觉和词汇语义语境，即，留下孔或空白区域的图像中（例如，孔的描述）的描述。对于LSIC一个主要的挑战来自于模拟和调整视觉语义上下文的结构和不同的模式转换。我们称这个过程为结构完成，这是在我们的模型多晶推理模块实现。另一个挑战涉及单峰偏差，当模型生成合理的结果，而无需使用文本描述时发生。因为对于图像的注解字幕经常在现有的数据集语义上等价这可能是真实的，因此只有一个在训练遮盖的图像配对的文本。我们设计除了过探讨配对重建路径无监督未成创造学习之路，以及多级培训战略，以减轻标记数据的不足。我们进行了广泛的定量和定性实验以及消融研究，其中揭示了我们所提出的大规模集成电路的功效。

95. Representations, Metrics and Statistics For Shape Analysis of Elastic Graphs [PDF] 返回目录
Xiaoyang Guo, Anuj Srivastava
Abstract: Past approaches for statistical shape analysis of objects have focused mainly on objects within the same topological classes, e.g., scalar functions, Euclidean curves, or surfaces, etc. For objects that differ in more complex ways, the current literature offers only topological methods. This paper introduces a far-reaching geometric approach for analyzing shapes of graphical objects, such as road networks, blood vessels, brain fiber tracts, etc. It represents such objects, exhibiting differences in both geometries and topologies, as graphs made of curves with arbitrary shapes (edges) and connected at arbitrary junctions (nodes). To perform statistical analyses, one needs mathematical representations, metrics and other geometrical tools, such as geodesics, means, and covariances. This paper utilizes a quotient structure to develop efficient algorithms for computing these quantities, leading to useful statistical tools, including principal component analysis and analytical statistical testing and modeling of graphical shapes. The efficacy of this framework is demonstrated using various simulated as well as the real data from neurons and brain arterial networks.
摘要：对象的统计形状分析过去的方法主要集中于相同的拓扑的类，例如，标量函数，欧几里德曲线，或表面等中的对象对于在更复杂的方式，目前的文献仅提供拓扑方法不同对象。本文介绍了用于分析的图形对象，诸如道路网络，血管，脑纤维束等，这表示这样的对象，表现出两个几何和拓扑结构的差异的形状的深远几何方法，如由具有任意曲线的曲线图形状（边缘）和在任意的结（节点）连接。为了进行统计分析，一个需要数学表达式，度量和其他几何工具，如测地线，手段，和协方差。本文使用的商结构来开发有效的算法，用于计算这些量，导致有用的统计工具，包括主成分分析和分析的统计检验和图形的形状建模。该框架的功效使用从神经元和脑动脉网络的各种模拟以及实际数据证实。

96. Augmenting Visual Place Recognition with Structural Cues [PDF] 返回目录
Amadeus Oertel, Titus Cieslewski, Davide Scaramuzza
Abstract: In this paper, we propose to augment image-based place recognition with structural cues. Specifically, these structural cues are obtained using structure-from-motion, such that no additional sensors are needed for place recognition. This is achieved by augmenting the 2D convolutional neural network (CNN) typically used for image-based place recognition with a 3D CNN that takes as input a voxel grid derived from the structure-from-motion point cloud. We evaluate different methods for fusing the 2D and 3D features and obtain best performance with global average pooling and simple concatenation. The resulting descriptor exhibits superior recognition performance compared to descriptors extracted from only one of the input modalities, including state-of-the-art image-based descriptors. Especially at low descriptor dimensionalities, we outperform state-of-the-art descriptors by up to 90%.
摘要：在本文中，我们提出了以增强基于图像的地方识别与结构线索。具体而言，使用结构从运动，使得需要对地方识别没有额外的传感器获得这些结构的线索。这是通过增强通常用于基于图像的地方识别与3D CNN说作为输入从该结构从 - 运动点云衍生的体素网格的2D卷积神经网络（CNN）来实现的。我们评估上述用于融合2D和3D功能，并获得与全球平均水平池和简单的串联最佳性能的不同方法。相比仅从输入模态的一个提取描述符，包括国家的最先进的基于图像的描述符的描述符得到的具有优异的识别性能。特别是在低维度描述符，我们通过高达90％优于状态的最先进的描述符。

97. Reusing Discriminators for Encoding Towards Unsupervised Image-to-Image Translation [PDF] 返回目录
Runfa Chen, Wenbing Huang, Binghui Huang, Fuchun Sun, Bin Fang
Abstract: Unsupervised image-to-image translation is a central task in computer vision. Current translation frameworks will abandon the discriminator once the training process is completed. This paper contends a novel role of the discriminator by reusing it for encoding the images of the target domain. The proposed architecture, termed as NICE-GAN, exhibits two advantageous patterns over previous approaches: First, it is more compact since no independent encoding component is required; Second, this plug-in encoder is directly trained by the adversary loss, making it more informative and trained more effectively if a multi-scale discriminator is applied. The main issue in NICE-GAN is the coupling of translation with discrimination along the encoder, which could incur training inconsistency when we play the min-max game via GAN. To tackle this issue, we develop a decoupled training strategy by which the encoder is only trained when maximizing the adversary loss while keeping frozen otherwise. Extensive experiments on four popular benchmarks demonstrate the superior performance of NICE-GAN over state-of-the-art methods in terms of FID, KID, and also human preference. Comprehensive ablation studies are also carried out to isolate the validity of each proposed component.
摘要：无监督图像到影像转换为计算机视觉的中心任务。一旦训练过程完成当前的翻译框架将放弃鉴别。本文争辩通过重用它用于编码目标结构域的图像中的鉴别器的新的作用。所提出的架构，称为NICE-GAN，具有两个有利模式在先前的方法：首先，它是更紧凑，因为没有独立编码组件是必需的;其次，此插件编码器直接被对手损失的培训，使之更翔实，更有效，如果应用了多尺度鉴别培训。在NICE-GaN的主要问题是翻译与沿编码器，它可能招致训练不一致时，我们通过发挥GAN最小 - 最大游戏歧视的耦合。为了解决这个问题，我们开发了一个分离的培训策略，通过最大化对手损失时，同时保持冻结，否则编码器只训练。四个流行的基准测试大量的实验证明NICE-GaN超国家的最先进的方法FID，KID方面的卓越性能，同时还人的偏好。综合消融研究也进行了各自提出的组件的有效性隔离。

98. Joint Face Completion and Super-resolution using Multi-scale Feature Relation Learning [PDF] 返回目录
Zhilei Liu, Yunpeng Wu, Le Li, Cuicui Zhang, Baoyuan Wu
Abstract: Previous research on face restoration often focused on repairing a specific type of low-quality facial images such as low-resolution (LR) or occluded facial images. However, in the real world, both the above-mentioned forms of image degradation often coexist. Therefore, it is important to design a model that can repair LR occluded images simultaneously. This paper proposes a multi-scale feature graph generative adversarial network (MFG-GAN) to implement the face restoration of images in which both degradation modes coexist, and also to repair images with a single type of degradation. Based on the GAN, the MFG-GAN integrates the graph convolution and feature pyramid network to restore occluded low-resolution face images to non-occluded high-resolution face images. The MFG-GAN uses a set of customized losses to ensure that high-quality images are generated. In addition, we designed the network in an end-to-end format. Experimental results on the public-domain CelebA and Helen databases show that the proposed approach outperforms state-of-the-art methods in performing face super-resolution (up to 4x or 8x) and face completion simultaneously. Cross-database testing also revealed that the proposed approach has good generalizability.
摘要：面部恢复以前的研究往往侧重于修复低质量的面部图像的具体类型，如低分辨率（LR）或遮挡面部图像。然而，在现实世界中，无论是图像质量下降的上述形式往往并存。因此，设计出可以修复LR同时遮挡图像的模型是非常重要的。本文提出了一种多尺度特征图表生成对抗网络（MFG-GAN）来实现图像，其中两个退化模式共存，并且还修图像的面恢复与单一类型的退化。基于在GaN中，MFG-GaN集成了图形卷积和功能金字塔网络阻塞低分辨率人脸图像恢复到非封闭的高分辨率人脸图像。所述MFG-GAN使用一组定制的损失，以确保高品质的被生成的图像。此外，我们还设计了一个终端到终端的格式的网络。在公共领域的实验结果CelebA和海伦的数据库表明，该方法比国家的最先进的方法进行面部超分辨率（高达4倍或8倍）和脸部结束的同时。跨数据库测试还透露，该方法具有良好的推广性。

99. NAS-Count: Counting-by-Density with Neural Architecture Search [PDF] 返回目录
Yutao Hu, Xiaolong Jiang, Xuhui Liu, Baochang Zhang, Jungong Han, Xianbin Cao, David Doermann
Abstract: Most of the recent advances in crowd counting have evolved from hand-designed density estimation networks, where multi-scale features are leveraged to address scale variation, but at the expense of demanding design efforts. In this work, we automate the design of counting models with Neural Architecture Search (NAS) and introduce an end-to-end searched encoder-decoder architecture, Automatic Multi-Scale Network (AMSNet). The encoder and decoder in AMSNet are composed of different cells discovered from counting-specific search spaces, each dedicated to extracting and aggregating multi-scale features adaptively. To resolve the pixel-level isolation issue in training density estimation models, AMSNet is optimized with a novel Scale Pyramid Pooling Loss (SPPLoss), which exploits a pyramidal architecture to achieve structural supervision at multiple scales. During training time, AMSNet and SPPLoss are searched end-to-end efficiently with differentiable NAS techniques. When testing, AMSNet produces state-of-the-art results that are considerably better than hand-designed models on four challenging datasets, fully demonstrating the efficacy of NAS-Count.
摘要：大多数人群计数的最新进展，已经从手工设计密度估计网络中，多尺度功能利用到地址尺度变化发展而来，但在高标准的设计工作的费用。在这项工作中，我们自动计数与神经结构搜索（NAS）车型的设计和引入终端到终端的搜索编码器，解码器架构，自动多尺度网络（AMSNet）。在AMSNet编码器和解码器是由来自特定计数搜索空间不同发现小区，每个专用于提取和聚合多尺度自适应功能。要解决训练密度估算模型的像素级的隔离问题，AMSNet与新颖的刻度金字塔池损失（SPPLoss），它利用了一个金字塔结构，实现多尺度结构优化的监督。在培训时间，AMSNet和SPPLoss被高效地搜索端至端与NAS区分的技术。测试时，AMSNet产生国家的最先进的效果都大大超过手设计的车型更好的四个挑战数据集，充分展示了NAS-Count的功效。

100. Channel Equilibrium Networks for Learning Deep Representation [PDF] 返回目录
Wenqi Shao, Shitao Tang, Xingang Pan, Ping Tan, Xiaogang Wang, Ping Luo
Abstract: Convolutional Neural Networks (CNNs) are typically constructed by stacking multiple building blocks, each of which contains a normalization layer such as batch normalization (BN) and a rectified linear function such as ReLU. However, this work shows that the combination of normalization and rectified linear function leads to inhibited channels, which have small magnitude and contribute little to the learned feature representation, impeding the generalization ability of CNNs. Unlike prior arts that simply removed the inhibited channels, we propose to "wake them up" during training by designing a novel neural building block, termed Channel Equilibrium (CE) block, which enables channels at the same layer to contribute equally to the learned representation. We show that CE is able to prevent inhibited channels both empirically and theoretically. CE has several appealing benefits. (1) It can be integrated into many advanced CNN architectures such as ResNet and MobileNet, outperforming their original networks. (2) CE has an interesting connection with the Nash Equilibrium, a well-known solution of a non-cooperative game. (3) Extensive experiments show that CE achieves state-of-the-art performance on various challenging benchmarks such as ImageNet and COCO.
摘要：卷积神经网络（细胞神经网络）通常是通过堆叠多个积木，其中每个都包含一个归一化层，例如批标准化（BN）和诸如RELU整流的线性函数构成。然而，这项工作表明，规范化的组合和纠正线性函数导致抑制信道，其中有小幅度和学习地物表示一点贡献，阻碍细胞神经网络的泛化能力。不同于简单地除去抑制通道现有技术中，我们通过设计一种新的神经积木训练期间建议“唤醒他们”，称为信道均衡（CE）块，这使得信道在相同的层，以同样的学习表示有助于。我们证明了CE能够防止抑制两种渠道经验和理论。 CE有几个吸引人的优点。（1）它可以被集成到许多先进CNN架构，诸如RESNET和MobileNet，超越其原始网络。（2）CE具有与Nash平衡，非合作游戏的公知的溶液一个有趣的连接。（3）广泛的实验表明，CE实现各种挑战基准如ImageNet和COCO状态的最先进的性能。

101. Cross-Spectrum Dual-Subspace Pairing for RGB-infrared Cross-Modality Person Re-Identification [PDF] 返回目录
Xing Fan, Hao Luo, Chi Zhang, Wei Jiang
Abstract: Due to its potential wide applications in video surveillance and other computer vision tasks like tracking, person re-identification (ReID) has become popular and been widely investigated. However, conventional person re-identification can only handle RGB color images, which will fail at dark conditions. Thus RGB-infrared ReID (also known as Infrared-Visible ReID or Visible-Thermal ReID) is proposed. Apart from appearance discrepancy in traditional ReID caused by illumination, pose variations and viewpoint changes, modality discrepancy produced by cameras of the different spectrum also exists, which makes RGB-infrared ReID more difficult. To address this problem, we focus on extracting the shared cross-spectrum features of different modalities. In this paper, a novel multi-spectrum image generation method is proposed and the generated samples are utilized to help the network to find discriminative information for re-identifying the same person across modalities. Another challenge of RGB-infrared ReID is that the intra-person (images from the same person) discrepancy is often larger than the inter-person (images from different persons) discrepancy, so a dual-subspace pairing strategy is proposed to alleviate this problem. Combining those two parts together, we also design a one-stream neural network combining the aforementioned methods to extract compact representations of person images, called Cross-spectrum Dual-subspace Pairing (CDP) model. Furthermore, during the training process, we also propose a Dynamic Hard Spectrum Mining method to automatically mine more hard samples from hard spectrum based on the current model state to further boost the performance. Extensive experimental results on two public datasets, SYSU-MM01 with RGB + near-infrared images and RegDB with RGB + far-infrared images, have demonstrated the efficiency and generality of our proposed method.
摘要：由于视频监控等计算机视觉任务，如追踪其潜在的广泛应用，人重新鉴定（里德）已成为流行，并得到了广泛的研究。然而，传统的人重新鉴定只能处理RGB彩色图像，这将在黑暗条件下失败。因此RGB红外里德（也称为红外可见光里德或可见光热里德）提出。除了引起照明传统REID外观差异，姿势变化和视点的变化，由不同的频谱也存在，这使得RGB红外里德更困难的摄像机产生的形态差异。为了解决这个问题，我们专注于提取不同模态的共享交叉谱特征。在本文中，一种新颖的多光谱图像生成方法，提出和所产生的样本被用于帮助网络找到用于重新识别跨模态是同一人的区别信息。 RGB红外里德的另一个挑战是，内部人（来自同一人的图像）的差异往往比人物间（来自不同人的图像）差异较大，所以双子空间配对策略，提出了缓解这一问题。这两部分结合在一起，我们还设计了一个流神经网络相结合上述方法提取的人物图像，称为跨谱双子空间配对（CDP）模式的简洁表示。此外，在训练过程中，我们还提出了一种基于当前模型状态，进一步提高性能，从硬频谱动态频谱硬采矿方法自动矿更硬的样品。在两个公共数据集大量的实验结果，我校-MM01与RGB +近红外图像和RegDB与RGB +远红外图像，证明我们提出的方法的有效性和普遍性。

102. Learning to Compare Relation: Semantic Alignment for Few-Shot Learning [PDF] 返回目录
Congqi Cao, Yanning Zhang
Abstract: Few-shot learning is a fundamental and challenging problem since it requires recognizing novel categories from only a few examples. The objects for recognition have multiple variants and can locate anywhere in images. Directly comparing query images with example images can not handle content misalignment. The representation and metric for comparison are critical but challenging to learn due to the scarcity and wide variation of the samples in few-shot learning. In this paper, we present a novel semantic alignment model to compare relations, which is robust to content misalignment. We propose to add two key ingredients to existing few-shot learning frameworks for better feature and metric learning ability. First, we introduce a semantic alignment loss to align the relation statistics of the features from samples that belong to the same category. And second, local and global mutual information maximization is introduced, allowing for representations that contain locally-consistent and intra-class shared information across structural locations in an image. Thirdly, we introduce a principled approach to weigh multiple loss functions by considering the homoscedastic uncertainty of each stream. We conduct extensive experiments on several few-shot learning datasets. Experimental results show that the proposed method is capable of comparing relations with semantic alignment strategies, and achieves state-of-the-art performance.
摘要：很少拍学习是根本，具有挑战性的问题，因为它需要从几个例子承认新类别。为表彰对象有多种变体，并且可以在任何地方的图像定位。直接比较与示例图像查询图像不能处理的内容错位。代表性和指标进行比较是重要的，但具有挑战性的学习由于几拍的学习样本的稀缺性和很大的差异。在本文中，我们提出了一种新的语义对准模型来比较的关系，这是强大的内容错位。我们建议两个关键成分添加到现有的几拍的学习框架，更好的功能和指标的学习能力。首先，我们引入了语义对准损失使从样本特征的关系的统计属于同一类别。第二，本地和全球互信息最大化的引入，允许包含在图像中跨越构造位置本地一致，类内共享的信息的表示。第三，我们引入原则的做法，考虑每个流的不确定性同方差衡量多种损失函数。我们进行了几个为数不多的射门学习数据集大量的实验。实验结果表明，所提出的方法能够比较语义对准策略的关系，并实现国家的最先进的性能。

103. VideoSSL: Semi-Supervised Learning for Video Classification [PDF] 返回目录
Longlong Jing, Toufiq Parag, Zhe Wu, Yingli Tian, Hongcheng Wang
Abstract: We propose a semi-supervised learning approach for video classification, VideoSSL, using convolutional neural networks (CNN). Like other computer vision tasks, existing supervised video classification methods demand a large amount of labeled data to attain good performance. However, annotation of a large dataset is expensive and time consuming. To minimize the dependence on a large annotated dataset, our proposed semi-supervised method trains from a small number of labeled examples and exploits two regulatory signals from unlabeled data. The first signal is the pseudo-labels of unlabeled examples computed from the confidences of the CNN being trained. The other is the normalized probabilities, as predicted by an image classifier CNN, that captures the information about appearances of the interesting objects in the video. We show that, under the supervision of these guiding signals from unlabeled examples, a video classification CNN can achieve impressive performances utilizing a small fraction of annotated examples on three publicly available datasets: UCF101, HMDB51 and Kinetics.
摘要：我们提出了视频分类，VideoSSL半监督学习方法，使用卷积神经网络（CNN）。像其他的计算机视觉任务，目前受监管的视频分类方法需要大量的标签数据以获得良好的性能。然而，大数据集的注释是昂贵和费时。为了最大限度地减少在一个大型注释的数据集的相关性，从少量的标记的例子我们提出的半监督方法列车和利用来自未标记数据的两个调节信号。该第一信号是被训练的从CNN的置信度计算的未标记的实施例中的伪标签。另一种是归一化的概率，由分类器CNN的图像作为预测，即捕获关于在视频中的有趣的物体的外观的信息。我们表明，与未标记示例中，这些引导信号的监督下，视频分类CNN可以实现利用的注释实例一小部分的三个公开可用的数据集出色的表现：UCF101，HMDB51和动力学。

104. First Order Motion Model for Image Animation [PDF] 返回目录
Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, Nicu Sebe
Abstract: Image animation consists of generating a video sequence so that an object in a source image is animated according to the motion of a driving video. Our framework addresses this problem without using any annotation or prior information about the specific object to animate. Once trained on a set of videos depicting objects of the same category (\eg \ faces, human bodies), our method can be applied to any object of this class. To achieve this, we decouple appearance and motion information using a self-supervised formulation. To support complex motions, we use a representation consisting of a set of learned keypoints along with their local affine transformations. A generator network models occlusions arising during target motions and combines the appearance extracted from the source image and the motion derived from the driving video. Our framework scores best on diverse benchmarks and on a variety of object categories. Our source code is publicly available.
摘要：图片动画包括使源图像中的对象是根据驱动视频的运动动画生成视频序列的。我们的框架，而无需使用任何注释或对特定对象进行动画信息之前解决这个问题。一旦一组描绘同一类别的对象视频的培训（\例如\脸，人体），我们的方法可以适用于该类的任何对象。为了实现这一目标，我们使用分离自监督配方的外观和运动信息。为了支持复杂的运动，我们使用由一组了解到关键点的表示与当地的仿射变换一起。期间目标动作和联合机从源图像和从驱动视频导出的运动中提取的外观引起的发电机网络模型闭塞。我们的框架成绩最好的多样化基准和对各种对象的类别。我们的源代码是公开的。

105. Robust 6D Object Pose Estimation by Learning RGB-D Features [PDF] 返回目录
Meng Tian, Liang Pan, Marcelo H Ang Jr, Gim Hee Lee
Abstract: Accurate 6D object pose estimation is fundamental to robotic manipulation and grasping. Previous methods follow a local optimization approach which minimizes the distance between closest point pairs to handle the rotation ambiguity of symmetric objects. In this work, we propose a novel discrete-continuous formulation for rotation regression to resolve this local-optimum problem. We uniformly sample rotation anchors in SO(3), and predict a constrained deviation from each anchor to the target, as well as uncertainty scores for selecting the best prediction. Additionally, the object location is detected by aggregating point-wise vectors pointing to the 3D center. Experiments on two benchmarks: LINEMOD and YCB-Video, show that the proposed method outperforms state-of-the-art approaches. Our code is available at this https URL.
摘要：准确6D对象姿态估计是机器人操作和把握的基础。先前的方法遵循的最接近点对处理对象对称的旋转不确定性之间的距离最小的局部优化的方法。在这项工作中，我们提出了一个新颖的离散连续配方旋转回归来解决这个局部优化问题。我们在SO（3）均匀样品旋转锚，并预测从各锚固到目标受约束的偏差，以及用于选择所述最佳预测不确定性分数。此外，该对象位置通过聚集指向3D中心逐点矢量检测。两个基准测试实验：LINEMOD和YCB视频，表明该方法优于国家的最先进的方法。我们的代码可在此HTTPS URL。

106. Augmented Cyclic Consistency Regularization for Unpaired Image-to-Image Translation [PDF] 返回目录
Takehiko Ohkawa, Naoto Inoue, Hirokatsu Kataoka, Nakamasa Inoue
Abstract: Unpaired image-to-image (I2I) translation has received considerable attention in pattern recognition and computer vision because of recent advancements in generative adversarial networks (GANs). However, due to the lack of explicit supervision, unpaired I2I models often fail to generate realistic images, especially in challenging datasets with different backgrounds and poses. Hence, stabilization is indispensable for real-world applications and GANs. Herein, we propose Augmented Cyclic Consistency Regularization (ACCR), a novel regularization method for unpaired I2I translation. Our main idea is to enforce consistency regularization originating from semi-supervised learning on the discriminators leveraging real, fake, reconstructed, and augmented samples. We regularize the discriminators to output similar predictions when fed pairs of original and perturbed images. We qualitatively clarify the generation property between unpaired I2I models and standard GANs, and explain why consistency regularization on fake and reconstructed samples works well. Quantitatively, our method outperforms the consistency regularized GAN (CR-GAN) in digit translations and demonstrates efficacy against several data augmentation variants and cycle-consistent constraints.
摘要：不成对图像到图像（121）的翻译已经收到，因为在生成对抗网络（甘斯）最新发展的模式识别和计算机视觉相当的重视。然而，由于缺乏明确的监管，未成I2I模型通常不能产生逼真的图像，尤其是在挑战不同的背景和姿态数据集。因此，稳定是现实世界的应用和甘斯不可或缺的。在此，我们建议增强循环一致性正则（ACCR），未配对I2I进行翻译正则化方法。我们的主要想法是执行从上鉴别利用真实的，假的，重建和增强样品半监督学习一致性正规化始发。喂养对原有的时候和不安图像我们正规化鉴输出类似的预测。我们定性澄清未成I2I模型和标准甘斯之间产生的财产，并解释为什么假冒重构样品的一致性正规化运作良好。定量地说，我们的方法优于一致性正规化GAN（CR-GAN）在数字平移和证明了对抗几个数据扩张变体和周期一致的约束效果。

107. HVNet: Hybrid Voxel Network for LiDAR Based 3D Object Detection [PDF] 返回目录
Maosheng Ye, Shuangjie Xu, Tongyi Cao
Abstract: We present Hybrid Voxel Network (HVNet), a novel one-stage unified network for point cloud based 3D object detection for autonomous driving. Recent studies show that 2D voxelization with per voxel PointNet style feature extractor leads to accurate and efficient detector for large 3D scenes. Since the size of the feature map determines the computation and memory cost, the size of the voxel becomes a parameter that is hard to balance. A smaller voxel size gives a better performance, especially for small objects, but a longer inference time. A larger voxel can cover the same area with a smaller feature map, but fails to capture intricate features and accurate location for smaller objects. We present a Hybrid Voxel network that solves this problem by fusing voxel feature encoder (VFE) of different scales at point-wise level and project into multiple pseudo-image feature maps. We further propose an attentive voxel feature encoding that outperforms plain VFE and a feature fusion pyramid network to aggregate multi-scale information at feature map level. Experiments on the KITTI benchmark show that a single HVNet achieves the best mAP among all existing methods with a real time inference speed of 31Hz.
摘要：本杂交体素网（HVNet），一种新型单级统一网络点云基于立体物检测为自主驾驶。最近的研究表明，2D素化，每个体素PointNet风格特征提取导致准确，高效的检测器为大型3D场景。由于特征地图的大小决定的计算和存储器成本，体素的尺寸变的参数是很难平衡。较小的体素尺寸给出了一个更好的性能，特别是对于小对象，但较长的推理时间。较大的体素可以覆盖具有较小特征地图相同的面积，但不能捕获复杂的特征和用于更小的物体精确位置。我们提出了一个混合体素的网络，解决了这个问题，通过在逐点的水平和项目划分成多个虚拟图像特征图融合不同尺度的体素的功能编码器（VFE）。我们进一步提出了一个贴心的功能体素编码是性能优于普通VFE和特征融合金字塔网络聚合的多尺度信息的特征图的水平。在KITTI基准显示，一个HVNet实现了与31Hz的实时推理速度所有现有的方法中最好的地图实验。

108. Attention-aware fusion RGB-D face recognition [PDF] 返回目录
Hardik Uppal, Alireza Sepas-Moghaddam, Michael Greenspan, Ali Etemad
Abstract: A novel attention aware method is proposed to fuse two image modalities, RGB and depth, for enhanced RGB-D facial recognition. The proposed method uses two attention layers, the first focused on the fused feature maps generated by convolution layers, and the second focused on the spatial features of those maps. The training database is preprocessed and augmented through a set of geometric transformations, and the learning process is further aided using transfer learning from a pure 2D RGB image training process. Comparative evaluations demonstrate that the proposed method outperforms other state-of-the-art approaches, including both traditional and deep neural network-based methods, on the challenging CurtinFaces and IIIT-D RGB-D benchmark databases, achieving classification accuracies over 98:2% and 99:3% respectively.
摘要：一种新的关注感知方法，提出了熔丝的两个图像模态，RGB和深度，以增强RGB-d面部识别。所提出的方法使用了两个关注的层，所述第一聚焦在熔融特征图通过卷积层中产生，并且第二集中于这些地图的空间特征。训练数据库被预处理并通过一组几何变换的增强，和学习过程是使用转印学习从纯2D RGB图像训练过程进一步辅助。比较评价表明，该方法优于其他国家的最先进的方法，包括传统的和深基于神经网络的方法中，对具有挑战性CurtinFaces和IIIT-d RGB-d基准数据库，超过98实现分类精确度：2 ％和99：分别为3％。

109. Towards Using Count-level Weak Supervision for Crowd Counting [PDF] 返回目录
Yinjie Lei, Yan Liu, Pingping Zhang, Lingqiao Liu
Abstract: Most existing crowd counting methods require object location-level annotation, i.e., placing a dot at the center of an object. While being simpler than the bounding-box or pixel-level annotation, obtaining this annotation is still labor-intensive and time-consuming especially for images with highly crowded scenes. On the other hand, weaker annotations that only know the total count of objects can be almost effortless in many practical scenarios. Thus, it is desirable to develop a learning method that can effectively train models from count-level annotations. To this end, this paper studies the problem of weakly-supervised crowd counting which learns a model from only a small amount of location-level annotations (fully-supervised) but a large amount of count-level annotations (weakly-supervised). To perform effective training in this scenario, we observe that the direct solution of regressing the integral of density map to the object count is not sufficient and it is beneficial to introduce stronger regularizations on the predicted density map of weakly-annotated images. We devise a simple-yet-effective training strategy, namely Multiple Auxiliary Tasks Training (MATT), to construct regularizes for restricting the freedom of the generated density maps. Through extensive experiments on existing datasets and a newly proposed dataset, we validate the effectiveness of the proposed weakly-supervised method and demonstrate its superior performance over existing solutions.
摘要：大多数现有的人群计数方法需要对象位置级注释，即，放置在点的对象的中心。虽然比包围盒或像素级别注解简单，获得这个注释仍然是劳动密集和费时尤其是对于高度拥挤的场景画面。在另一方面，弱的注解只知道对象的总数可以毫不费力在许多实际场景。因此，希望开发一种学习的方法，可以有效地从数量级的注解训练模式。为此，本文研究了从只有少量的位置级别的注解（全面监督），但大量的计数级别的注解学习的典范弱监督的人群计数的问题（弱监督）。在这种情况下进行有效的培训，我们观察到，倒退密度图的积分对象计数的直接解决方案是不够的，它是有益的弱注释的图像的预测的密度地图上引进更强的正则化。我们设计一个简单但有效的培训策略，即多个辅助任务训练（MATT），构建规则化限制所产生的密度图的自由。通过对现有数据集大量的实验和新提出的数据集，我们验证了该弱监督方法的有效性，并证明其对现有解决方案的卓越性能。

110. Self-supervised Representation Learning for Ultrasound Video [PDF] 返回目录
Jianbo Jiao, Richard Droste, Lior Drukker, Aris T. Papageorghiou, J. Alison Noble
Abstract: Recent advances in deep learning have achieved promising performance for medical image analysis, while in most cases ground-truth annotations from human experts are necessary to train the deep model. In practice, such annotations are expensive to collect and can be scarce for medical imaging applications. Therefore, there is significant interest in learning representations from unlabelled raw data. In this paper, we propose a self-supervised learning approach to learn meaningful and transferable representations from medical imaging video without any type of human annotation. We assume that in order to learn such a representation, the model should identify anatomical structures from the unlabelled data. Therefore we force the model to address anatomy-aware tasks with free supervision from the data itself. Specifically, the model is designed to correct the order of a reshuffled video clip and at the same time predict the geometric transformation applied to the video clip. Experiments on fetal ultrasound video show that the proposed approach can effectively learn meaningful and strong representations, which transfer well to downstream tasks like standard plane detection and saliency prediction.
摘要：在深度学习的最新进展，实现了医学图像分析看好的表现，而在大多数情况下，人类专家地面真注解是必要的训练深层模型。在实践中，这样的注释是收集昂贵的并且可以是稀少的医疗成像应用。因此，在学习来自未标记的原始数据表示显著的兴趣。在本文中，我们提出了一个自我监督的学习方法，学习从医疗成像视频有意义的，可转让的表示，而没有任何型人的注释。我们假设，为了学习这样的表示，模型应确认未标记的数据解剖结构。因此，我们迫使模型地址解剖感知任务，从数据本身无监督。具体地，该模型被设计成校正改组视频剪辑的顺序，并在同一时间预测施加到视频片段中的几何变换。对胎儿超声视频显示，该方法能有效地学习有意义和强烈的抗议，后者传以及像标准平面检测和显着性预测下游任务实验。

111. Transferring Dense Pose to Proximal Animal Classes [PDF] 返回目录
Artsiom Sanakoyeu, Vasil Khalidov, Maureen S. McCarthy, Andrea Vedaldi, Natalia Neverova
Abstract: Recent contributions have demonstrated that it is possible to recognize the pose of humans densely and accurately given a large dataset of poses annotated in detail. In principle, the same approach could be extended to any animal class, but the effort required for collecting new annotations for each case makes this strategy impractical, despite important applications in natural conservation, science and business. We show that, at least for proximal animal classes such as chimpanzees, it is possible to transfer the knowledge existing in dense pose recognition for humans, as well as in more general object detectors and segmenters, to the problem of dense pose recognition in other classes. We do this by (1) establishing a DensePose model for the new animal which is also geometrically aligned to humans (2) introducing a multi-head R-CNN architecture that facilitates transfer of multiple recognition tasks between classes, (3) finding which combination of known classes can be transferred most effectively to the new animal and (4) using self-calibrated uncertainty heads to generate pseudo-labels graded by quality for training a model for this class. We also introduce two benchmark datasets labelled in the manner of DensePose for the class chimpanzee and use them to evaluate our approach, showing excellent transfer learning performance.
摘要：最近的捐款已经表明，它可以识别人类的姿势密集，准确地给出一个大的数据集详细标注的姿势。原则上，同样的方法可以扩展到任何动物类，但对于收集新的注解，每种情况下所需的努力，使这一战略不切实际，尽管在自然保护，科学和商业的重要应用。我们表明，至少对于近端动物类，如黑猩猩，它可以传送现有的在密集的姿态识别对人类的知识，以及在更广泛的对象检测器和分割器，在其他类密集姿态识别的问题。我们通过（1）建立用于将多头R-CNN架构便于传输的类之间的多重识别任务的新的动物其也几何对齐，以人类（2）DensePose模型，（3）发现该组合要这样做的公知的类可以最有效地利用自校准不确定度头，以产生通过质量用于训练模型此类分级伪标签被转移到新的动物和（4）。我们还介绍了标记在DensePose为类黑猩猩的方式用两个标准数据集，并利用它们来评估我们的做法，表现出优异的转印学习表现。

112. Bio-Inspired Modality Fusion for Active Speaker Detection [PDF] 返回目录
Gustavo Assunção, Nuno Gonçalves, Paulo Menezes
Abstract: Human beings have developed fantastic abilities to integrate information from various sensory sources exploring their inherent complementarity. Perceptual capabilities are therefore heightened enabling, for instance, the well known "cocktail party" and McGurk effects, i.e. speech disambiguation from a panoply of sound signals. This fusion ability is also key in refining the perception of sound source location, as in distinguishing whose voice is being heard in a group conversation. Furthermore, Neuroscience has successfully identified the superior colliculus region in the brain as the one responsible for this modality fusion, with a handful of biological models having been proposed to approach its underlying neurophysiological process. Deriving inspiration from one of these models, this paper presents a methodology for effectively fusing correlated auditory and visual information for active speaker detection. Such an ability can have a wide range of applications, from teleconferencing systems to social robotics. The detection approach initially routes auditory and visual information through two specialized neural network structures. The resulting embeddings are fused via a novel layer based on the superior colliculus, whose topological structure emulates spatial neuron cross-mapping of unimodal perceptual fields. The validation process employed two publicly available datasets, with achieved results confirming and greatly surpassing initial expectations.
摘要：人类已经开发出梦幻般的能力，整合来自各种感官的来源探索其内在的互补性信息。因此感知能力被增强，使得能够，例如，公知的“鸡尾酒会”和McGurk效果，即语音消歧从声音信号的一整套。这种融合的能力也是炼声源位置的感知，为在区分谁的声音被听到群组会话密钥。此外，神经科学，成功确认上丘区在大脑中的一个负责这种方式融合，具有生物模型的极少数已经提出接近其潜在的神经生理过程。从这些模型的一个派生的启发，本文提出了一种有效的融合相关的听觉和视觉信息的有源音箱检测的方法。这种能力可以有广泛的应用，从电话会议系统向社会机器人。最初检测的方针路线听觉和视觉信息通过两个专门的神经网络结构。将所得的嵌入是通过基于上丘，其拓扑结构模拟单峰感知领域的空间神经元跨映射的新的稠合层。验证过程使用两种可公开获得的数据集，用取得的结果确认，并大大超过了最初的期望。

113. Learning Nonparametric Human Mesh Reconstruction from a Single Image without Ground Truth Meshes [PDF] 返回目录
Kevin Lin, Lijuan Wang, Ying Jin, Zicheng Liu, Ming-Ting Sun
Abstract: Nonparametric approaches have shown promising results on reconstructing 3D human mesh from a single monocular image. Unlike previous approaches that use a parametric human model like skinned multi-person linear model (SMPL), and attempt to regress the model parameters, nonparametric approaches relax the heavy reliance on the parametric space. However, existing nonparametric methods require ground truth meshes as their regression target for each vertex, and obtaining ground truth mesh labels is very expensive. In this paper, we propose a novel approach to learn human mesh reconstruction without any ground truth meshes. This is made possible by introducing two new terms into the loss function of a graph convolutional neural network (Graph CNN). The first term is the Laplacian prior that acts as a regularizer on the reconstructed mesh. The second term is the part segmentation loss that forces the projected region of the reconstructed mesh to match the part segmentation. Experimental results on multiple public datasets show that without using 3D ground truth meshes, the proposed approach outperforms the previous state-of-the-art approaches that require ground truth meshes for training.
摘要：非参数方法已显示在重建的三维人体从单个单眼图像啮合有希望的结果。与使用参数化人体模型像皮肤多人线性模型（SMPL）以前的办法，并试图回归模型参数，非参数方法放宽对参数空间的严重依赖。但是，现有的非参数方法需要地面实况网作为自己回归的目标为每个顶点，并获得地面实况网标签是非常昂贵的。在本文中，我们提出了一种新的方法来学习的人网改造没有任何地面实况网格。这是通过引入两个新的术语成图形的卷积神经网络（CNN图形）的损失函数成为可能。第一项是一个用作对重构的啮合的正则拉普拉斯之前。第二项是所述部分分割损失力的投影区域重建啮合以匹配部分分割。在多个公开数据集实验结果表明，在不使用3D地面实况网格，所提出的方法比需要的训练场真理网格以前的状态的最先进的方法。

114. Unlimited Resolution Image Generation with R2D2-GANs [PDF] 返回目录
Marija Jegorova, Antti Ilari Karjalainen, Jose Vazquez, Timothy M. Hospedales
Abstract: In this paper we present a novel simulation technique for generating high quality images of any predefined resolution. This method can be used to synthesize sonar scans of size equivalent to those collected during a full-length mission, with across track resolutions of any chosen magnitude. In essence, our model extends Generative Adversarial Networks (GANs) based architecture into a conditional recursive setting, that facilitates the continuity of the generated images. The data produced is continuous, realistically-looking, and can also be generated at least two times faster than the real speed of acquisition for the sonars with higher resolutions, such as EdgeTech. The seabed topography can be fully controlled by the user. The visual assessment tests demonstrate that humans cannot distinguish the simulated images from real. Moreover, experimental results suggest that in the absence of real data the autonomous recognition systems can benefit greatly from training with the synthetic data, produced by the R2D2-GANs.
摘要：在本文中，我们提出用于产生任何预定义的分辨率的高品质的图像的新的模拟技术。此方法可用于在任何所选大小的轨道分辨率合成等效尺寸的声纳扫描那些全长任务期间收集，用。从本质上说，我们的模型延伸剖成对抗性网络（甘斯）体系结构的成有条件的递归设置，有助于所生成的图像的连续性。所产生的数据是连续的，现实外观的，并且也可以产生至少两倍高于采集用于与更高的分辨率，诸如EdgeTech的声纳的实际速度。海底地形可以由用户被完全控制。视觉评估测试证明，人类无法真正区分模拟图像。此外，实验结果表明，在没有实际数据的自主识别系统可以从训练用合成数据，由R2D2-Gans的产生极大地受益。

115. Constrained Nonnegative Matrix Factorization for Blind Hyperspectral Unmixing incorporating Endmember Independence [PDF] 返回目录
E.M.M.B. Ekanayake, Bhathiya Rathnayake, G.M.R.I. Godaliyadda, H.M.V.R. Herath, M.P.B. Ekanayake
Abstract: Hyperspectral image (HSI) analysis has become a key area in the field of remote sensing as a result of its ability to exploit richer information in the form of multiple spectral bands. The study of hyperspectral unmixing (HU) is important in HSI analysis due to the insufficient spatial resolution of customary imaging spectrometers. The endmembers of an HSI are more likely to be generated by independent sources and be mixed in a macroscopic degree before arriving at the sensor element of the imaging spectrometer as mixed spectra. Over the past few decades, many attempts have focused on imposing auxiliary constraints on the conventional nonnegative matrix factorization (NMF) framework in order to effectively unmix these mixed spectra. Signifying a step forward toward finding an optimum constraint to extract endmembers, this paper presents a novel blind HU algorithm, referred to as Kurtosis-based Smooth Nonnegative Matrix Factorization (KbSNMF) which incorporates a novel constraint-based on the statistical independence of the probability density functions of endmembers. Imposing this constraint on the conventional NMF framework promotes the extraction of independent endmembers while further enhancing the parts-based representation of data. The proposed algorithm manages to outperform several state of the art NMF-based algorithms in terms of extracting endmembers from hyperspectral remote sensing data, hence could uplift the performance of recent deep learning HU methods which utilizes the endmembers as supervisory data for abundance extraction. Keywords: Hyperspectral unmixing (HU), blind source separation, kurtosis, constrained, Gaussianity, endmember independence, nonnegative matrix factorization (NMF).
摘要：高光谱图像（HSI）分析已成为遥感领域作为其利用在多个光谱带的形式更丰富的信息的能力的结果的一个关键区域。高光谱解混（HU）的研究是在HSI分析重要由于常规成像光谱的空间分辨率不足。的HSI的端元更容易由独立的源产生并在成像光谱仪作为混合光谱的传感器元件到达之前的宏观程度混合。在过去的几十年中，许多努力都集中在有效UNMIX这些混合光谱强加给传统的非负矩阵分解（NMF）框架辅助约束顺序。意的一个步骤向前朝着找到最佳约束来提取端元，本文提出了一种新颖的盲目HU算法中，被称为基于峰度光滑非负矩阵分解（KbSNMF），其结合了新颖的约束基于概率密度的统计独立性端元的功能。强加给常规NMF框架此约束促进独立端元的提取的同时进一步提高数据的基于部分的表示。所提出的算法设法优于在从高光谱遥感数据提取端元的术语的基于NMF技术算法的几个状态，因此，可以提升的，它利用了端元作为丰度提取监控数据最近深度学习HU方法的性能。关键词：高光谱解混（HU），盲源分离，峰度，约束，高斯，端元独立，非负矩阵分解（NMF）。

116. Evaluating Temporal Queries Over Video Feeds [PDF] 返回目录
Yueting Chen, Xiaohui Yu, Nick Koudas
Abstract: Recent advances in Computer Vision and Deep Learning made possible the efficient extraction of a schema from frames of streaming video. As such, a stream of objects and their associated classes along with unique object identifiers derived via object tracking can be generated, providing unique objects as they are captured across frames. In this paper we initiate a study of temporal queries involving objects and their co-occurrences in video feeds. For example, queries that identify video segments during which the same two red cars and the same two humans appear jointly for five minutes are of interest to many applications ranging from law enforcement to security and safety. We take the first step and define such queries in a way that they incorporate certain physical aspects of video capture such as object occlusion. We present an architecture consisting of three layers, namely object detection/tracking, intermediate data generation and query evaluation. We propose two techniques,MFS and SSG, to organize all detected objects in the intermediate data generation layer, which effectively, given the queries, minimizes the number of objects and frames that have to be considered during query evaluation. We also introduce an algorithm called State Traversal (ST) that processes incoming frames against the SSG and efficiently prunes objects and frames unrelated to query evaluation, while maintaining all states required for succinct query evaluation. We present the results of a thorough experimental evaluation utilizing both real and synthetic data establishing the trade-offs between MFS and SSG. We stress various parameters of interest in our evaluation and demonstrate that the proposed query evaluation methodology coupled with the proposed algorithms is capable to evaluate temporal queries over video feeds efficiently, achieving orders of magnitude performance benefits.
摘要：在计算机视觉和深度学习最新进展成为可能的模式的高效提取从视频流的帧。这样，可产生对象和它们的相关联的通过目标跟踪产生的独特的对象标识符沿着类的流，提供独特的对象，因为它们是在帧之间捕获。在本文中，我们开始涉及对象和他们共同出现在视频供稿时间查询的研究。例如，识别视频片段查询在此期间，相同的两个红色轿车和相同的两个人共同出现五分钟感兴趣的许多应用范围从执法的安全性和安全性。我们采取的第一个步骤，并定义在某种程度上这样的查询，他们将视频捕获的某些物理方面，如物体遮挡。我们提出由三层组成，分别是物体检测/跟踪，中间数据生成和查询评估的架构。我们提出两种技术，MFS和SSG，组织在中间数据生成层，从而有效地，给定的查询，最小化对象和具有查询求值过程要考虑的帧的数量的所有检测到的对象。我们还介绍了被称为国家遍历（ST）的算法来处理对SSG的数据帧，有效地修剪对象和无关的查询评估框架，同时保持了简洁的查询评估所需的所有状态。我们提出一个彻底的实验评估的利用真实和合成数据建立MFS和SSG之间的权衡结果。我们强调各种感兴趣的参数在我们的评价，并表明加上该算法的提出查询评估方法能够评估时间查询了视频饲料效率，实现数量级的性能好处订单。

117. Multi-View Learning for Vision-and-Language Navigation [PDF] 返回目录
Qiaolin Xia, Xiujun Li, Chunyuan Li, Yonatan Bisk, Zhifang Sui, Yejin Choi, Noah A. Smith
Abstract: Learning to navigate in a visual environment following natural language instructions is a challenging task because natural language instructions are highly variable, ambiguous, and under-specified. In this paper, we present a novel training paradigm, Learn from EveryOne (LEO), which leverages multiple instructions (as different views) for the same trajectory to resolve language ambiguity and improve generalization. By sharing parameters across instructions, our approach learns more effectively from limited training data and generalizes better in unseen environments. On the recent Room-to-Room (R2R) benchmark dataset, LEO achieves 16% improvement (absolute) over a greedy agent as the base agent (25.3% $\rightarrow$ 41.4%) in Success Rate weighted by Path Length (SPL). Further, LEO is complementary to most existing models for vision-and-language navigation, allowing for easy integration with the existing techniques, leading to LEO+, which creates the new state of the art, pushing the R2R benchmark to 62% (9% absolute improvement).
摘要：学习在一个可视化的环境中导航以下的自然语言指令是一项具有挑战性的任务，因为自然语言指令是高度可变的，暧昧的，并在指定的。在本文中，我们提出了一个新颖的培训模式，从每个人（LEO），它采用多种指令（如不同的看法）对同一轨迹决心语言歧义学习和提高泛化。通过从有限的训练数据，概括了在看不见的环境更好更有效地跨越指令共享参数，我们的方法可以学习。在最近的房间到房间（R2R）基准数据集，LEO达到16％的改善（绝对值）在贪婪剂为基剂（25.3％$ \ RIGHTARROW $ 41.4％）成功率的路径长度（SPL）加权。此外，LEO是为视觉和语言导航大多数现有车型的补充，允许与现有技术易于集成，导致LEO +，这创造了新的艺术状态，推R2R基准，以62％（9％绝对改善）。

118. Addressing target shift in zero-shot learning using grouped adversarial learning [PDF] 返回目录
Saneem Ahmed Chemmengath, Samarth Bharadwaj, Soumava Paul, Suranjana Samanta, Karthik Sankaranarayanan
Abstract: In this paper, we present a new paradigm to zero-shot learning (ZSL) that is trained by utilizing additional information (such as attribute-class mapping) for specific set of unseen classes. We conjecture that such additional information about unseen classes is more readily available than unsupervised image sets. Further, on close examination of the underlying attribute predictors of popular ZSL algorithms, we find that they often leverage attribute correlations to make predictions. While attribute correlations that remain intact in the unseen classes (test) benefit the prediction of difficult attributes, change in correlations can have an adverse effect on ZSL performance. For example, detecting an attribute 'brown' may be the same as detecting 'fur' over an animals' image dataset captured in the tropics. However, such a model might fail on unseen images of Arctic animals. To address this effect, termed target-shift in ZSL, we utilize our proposed framework to design grouped adversarial learning. We introduce grouping of attributes to enable the model to continue to benefit from useful correlations, while restricting cross-group correlations that may be harmful for generalization. Our analysis shows that it is possible to not only constrain the model from leveraging unwanted correlations, but also adjust them to specific test setting using only the additional information (the already available attribute-class mapping). We show empirical results for zero-shot predictions on standard benchmark datasets, namely, aPY, AwA2, SUN and CUB datasets. We further introduce to the research community, a new experimental train-test split that maximizes target-shift to further study its effects.
摘要：在本文中，我们提出了一个新的模式，以零射门学习（ZSL），它通过利用特定的一套看不见类的附加信息（例如属性级映射）培训。我们猜想大概看不见类的，这样的附加信息更容易比无监督图像集可用。此外，在流行ZSL算法的基本属性预测的仔细检查，我们发现，他们经常利用属性相关性进行预测。同时，在看不见类（试验）保持完整属性相关受益难以属性的预测，在相关性的变化可能对性能ZSL有不利的影响。例如，当检测的属性“棕色”可以是相同的检测“皮毛”过度在热带捕获的动物的图像数据集。然而，这种模式可能会失败北极动物看不见图像。为了解决这个问题的影响，被称为在ZSL目标的转变，我们利用我们提出的框架设计分组对抗的学习。我们引入分组属性，使模型继续从有用的相关利益，同时限制跨组相关性，可能是有害的概括。我们的分析表明，它是可能的，不仅从限制不必要的借力相关的模式，但也仅使用附加信息（已经提供的属性级映射）调整到特定的测试设置。我们显示标准基准数据集，即，APY，AwA2，SUN和CUB数据集零爆破预测实证结果。我们进一步介绍给研究界，一个新的实验列车试验分裂最大化目标转向进一步研究其影响。

119. Quantized Neural Network Inference with Precision Batching [PDF] 返回目录
Maximilian Lam, Zachary Yedidia, Colby Banbury, Vijay Janapa Reddi
Abstract: We present PrecisionBatching, a quantized inference algorithm for speeding up neural network execution on traditional hardware platforms at low bitwidths without the need for retraining or recalibration. PrecisionBatching decomposes a neural network into individual bitlayers and accumulates them using fast 1-bit operations while maintaining activations in full precision. PrecisionBatching not only facilitates quantized inference at low bitwidths (< 8 bits) without the need for retraining/recalibration, but also 1) enables traditional hardware platforms the ability to realize inference speedups at a finer granularity of quantization (e.g: 1-16 bit execution) and 2) allows accuracy and speedup tradeoffs at runtime by exposing the number of bitlayers to accumulate as a tunable parameter. Across a variety of applications (MNIST, language modeling, natural language inference) and neural network architectures (fully connected, RNN, LSTM), PrecisionBatching yields end-to-end speedups of over 8x on a GPU within a < 1% error margin of the full precision baseline, outperforming traditional 8-bit quantized inference by over 1.5x-2x at the same error tolerance.
摘要：我们提出PrecisionBatching，加快对低bitwidths传统的硬件平台的神经网络执行，而不需要再培训或重新校准量化推理算法。 PrecisionBatching分解的神经网络到单独的bitlayers和使用快速1位操作，同时保持全精度激活积累他们。 PrecisionBatching不仅有利于在低bitwidths（<8个比特）量化推理，而不需要再培训重新校准，而且1）使传统的硬件平台中的量化的更精细粒度（例如能力实现推理的加速比：1-16位执行）和2）允许精度和加速折衷在运行时由bitlayers数量暴露于累加作为一个可调参数。跨各种应用程序（mnist，语言建模，自然语言的推论）和神经网络结构（全连接，rnn，lstm），precisionbatching产量端至端的超过8倍的加速上的<1％的误差范围内的gpu完整的精密基准，超越传统的8位在同一容错量化推理超过1.5倍，2倍。< font>

120. Convolutional Sparse Support Estimator Network (CSEN) From energy efficient support estimation to learning-aided Compressive Sensing [PDF] 返回目录
Mehmet Yamac, Mete Ahishali, Serkan Kiranyaz, Moncef Gabbouj
Abstract: Support estimation (SE) of a sparse signal refers to finding the location indices of the non-zero elements in a sparse representation. Most of the traditional approaches dealing with SE problem are iterative algorithms based on greedy methods or optimization techniques. Indeed, a vast majority of them use sparse signal recovery techniques to obtain support sets instead of directly mapping the non-zero locations from denser measurements (e.g., Compressively Sensed Measurements). This study proposes a novel approach for learning such a mapping from a training set. To accomplish this objective, the Convolutional Support Estimator Networks (CSENs), each with a compact configuration, are designed. The proposed CSEN can be a crucial tool for the following scenarios: (i) Real-time and low-cost support estimation can be applied in any mobile and low-power edge device for anomaly localization, simultaneous face recognition, etc. (ii) CSEN's output can directly be used as "prior information" which improves the performance of sparse signal recovery algorithms. The results over the benchmark datasets show that state-of-the-art performance levels can be achieved by the proposed approach with a significantly reduced computational complexity.
摘要：稀疏信号的支持估计（SE）是指查找非零元素的位置索引中的稀疏表示。大多数处理SE问题的传统方法是基于贪婪方法或优化技术迭代算法。事实上，绝大多数的使用稀疏信号恢复技术来获得，而不是直接从映射更密集的测量（例如，压缩感测的测量值）中的非零位置的支持台。这项研究提出了从训练集学习这种映射的新方法。为了实现这个目标，卷积支持估计网络（CSENS），每个具有一个紧凑的结构，进行了设计。所提出的CSEN可以是用于在以下情况下的一个重要工具：（ⅰ）实时和低成本支持估计可以在任何移动和低功率边缘装置被应用于异常定位，同时人脸识别等。（ⅱ） CSEN的输出可以直接被用作这改善的稀疏信号恢复算法的性能“先验信息”。相对于基准数据集上的结果表明，国家的最先进的性能水平可以通过与显著降低计算复杂性所提出的方法来实现。

121. MVP: Unified Motion and Visual Self-Supervised Learning for Large-Scale Robotic Navigation [PDF] 返回目录
Marvin Chancán, Michael Milford
Abstract: Autonomous navigation emerges from both motion and local visual perception in real-world environments. However, most successful robotic motion estimation methods (e.g. VO, SLAM, SfM) and vision systems (e.g. CNN, visual place recognition-VPR) are often separately used for mapping and localization tasks. Conversely, recent reinforcement learning (RL) based methods for visual navigation rely on the quality of GPS data reception, which may not be reliable when directly using it as ground truth across multiple, month-spaced traversals in large environments. In this paper, we propose a novel motion and visual perception approach, dubbed MVP, that unifies these two sensor modalities for large-scale, target-driven navigation tasks. Our MVP-based method can learn faster, and is more accurate and robust to both extreme environmental changes and poor GPS data than corresponding vision-only navigation methods. MVP temporally incorporates compact image representations, obtained using VPR, with optimized motion estimation data, including but not limited to those from VO or optimized radar odometry (RO), to efficiently learn self-supervised navigation policies via RL. We evaluate our method on two large real-world datasets, Oxford Robotcar and Nordland Railway, over a range of weather (e.g. overcast, night, snow, sun, rain, clouds) and seasonal (e.g. winter, spring, fall, summer) conditions using the new CityLearn framework; an interactive environment for efficiently training navigation agents. Our experimental results, on traversals of the Oxford RobotCar dataset with no GPS data, show that MVP can achieve 53% and 93% navigation success rate using VO and RO, respectively, compared to 7% for a vision-only method. We additionally report a trade-off between the RL success rate and the motion estimation precision.
摘要：自主导航从运动和当地的视觉感知真实世界的环境中出现。然而，最成功的机器人的运动估计方法（例如VO，SLAM，SFM）和视觉系统（例如CNN，视觉识别地点-VPR）通常分别用于标测和定位的任务。相反，对于可视化导航最近的强化学习（RL）的方法依赖于GPS数据接收的质量，在直接使用它作为跨多个地面实况，在大环境月份间距遍历这可能是不可靠的。在本文中，我们提出了一个新颖的运动和视觉感知的方法，被称为MVP，它统一为大规模，目标驱动导航任务这两个传感器的方式。我们基于MVP-方法可以学得更快，更精确和稳健两种极端的环境变化和相应的比仅视觉导航方法GPS数据不佳。 MVP时间上集成紧凑的图像表示，使用VPR获得，具有优化的运动估计的数据，包括但优化雷达测距法（RO）不限于那些从VO或，为了有效地学习经由RL自监督导航政策。我们评估我们的两个大现实世界的数据集，牛津Robotcar和诺德兰铁路，方法在一定范围的天气（如阴天，夜晚，雪，太阳，雨，云）和季节性（如冬，春，秋，夏季）条件使用新的CityLearn框架;为有效地训练导航代理一个互动的环境。我们的实验结果，在牛津RobotCar数据集没有GPS数据的遍历，表明MVP可以达到53％，并使用分别VO和RO，93％导航成功率，相比于仅视觉方法7％。我们还报告RL成功率和运动估计精度之间的权衡。

122. PSF--NET: A Non-parametric Point Spread Function Model for Ground Based Optical Telescopes [PDF] 返回目录
Peng Jia, Xuebo Wu, Yi Huang, Bojun Cai, Dongmei Cai
Abstract: Ground based optical telescopes are seriously affected by atmospheric turbulence induced aberrations. Understanding properties of these aberrations is important both for instruments design and image restoration methods development. Because the point spread function can reflect performance of the whole optic system, it is appropriate to use the point spread function to describe atmospheric turbulence induced aberrations. Assuming point spread functions induced by the atmospheric turbulence with the same profile belong to the same manifold space, we propose a non-parametric point spread function - PSF-NET. The PSF-NET has a cycle convolutional neural network structure and is a statistical representation of the manifold space of PSFs induced by the atmospheric turbulence with the same profile. Testing the PSF-NET with simulated and real observation data, we find that a well trained PSF--NET can restore any short exposure images blurred by atmospheric turbulence with the same profile. Besides, we further use the impulse response of the PSF-NET, which can be viewed as the statistical mean PSF, to analyze interpretation properties of the PSF-NET. We find that variations of statistical mean PSFs are caused by variations of the atmospheric turbulence profile: as the difference of the atmospheric turbulence profile increases, the difference between statistical mean PSFs also increases. The PSF-NET proposed in this paper provides a new way to analyze atmospheric turbulence induced aberrations, which would be benefit to develop new observation methods for ground based optical telescopes.
摘要：基于地面的光学望远镜是严重影响大气湍流引起的畸变。了解这些畸变的性质既是仪器的设计和图像复原方法发展具有重要意义。因为点扩散函数可以反映整个光学系统的性能，这是适合使用的点扩展函数来描述大气湍流诱发像差。假设用属于同一歧管空间相同的外形大气湍流引起的点扩散函数，我们提出了一种非参数的点扩散函数 - PSF-NET。的PSF-NET具有循环卷积神经网络的结构，并通过用相同的配置文件中的大气湍流诱发的PSF的歧管空间的统计表示。测试PSF-NET与模拟和实际观测数据，我们发现，一个训练有素的PSF - NET可以恢复任何的短曝光图像模糊用相同的轮廓大气湍流。此外，我们还使用PSF-NET，这可以被看作是统计平均值PSF的脉冲响应，分析PSF-NET的解释性质。我们发现统计平均的PSF的变化是由大气湍流剖面的变化引起：由于大气湍流廓线的差增大，统计平均的PSF也增加之间的差额。本文提出的PSF-NET提供了一种新的方法来分析大气湍流引起的畸变，这将是受益于开发基于地面的光学望远镜新的观测方法。

123. 3DCFS: Fast and Robust Joint 3D Semantic-Instance Segmentation via Coupled Feature Selection [PDF] 返回目录
Liang Du, Jingang Tan, Xiangyang Xue, Lili Chen, Hongkai Wen, Jianfeng Feng, Jiamao Li, Xiaolin Zhang
Abstract: We propose a novel fast and robust 3D point clouds segmentation framework via coupled feature selection, named 3DCFS, that jointly performs semantic and instance segmentation. Inspired by the human scene perception process, we design a novel coupled feature selection module, named CFSM, that adaptively selects and fuses the reciprocal semantic and instance features from two tasks in a coupled manner. To further boost the performance of the instance segmentation task in our 3DCFS, we investigate a loss function that helps the model learn to balance the magnitudes of the output embedding dimensions during training, which makes calculating the Euclidean distance more reliable and enhances the generalizability of the model. Extensive experiments demonstrate that our 3DCFS outperforms state-of-the-art methods on benchmark datasets in terms of accuracy, speed and computational cost.
摘要：本文提出了一种快速，经由耦合特征选择强大的3D点云分割的框架，名为3DCFS，即共同执行语义和实例分割。由人的场景的感知过程的启发，我们设计了新型的耦合特征选择模块，命名CFSM，自适应地选择和融合的倒数语义和例如从在配对方式两个任务功能。为了进一步提升我们的3DCFS实例分割任务的性能，我们研究了损失函数，帮助模型学会平衡输出培训中嵌入的尺寸，这使得计算的欧氏距离更可靠，更增强了普遍性的大小模型。大量的实验证明在精度，速度和计算成本方面对标准数据集，我们3DCFS性能优于国家的最先进的方法。

124. Dimensionality reduction to maximize prediction generalization capability [PDF] 返回目录
Takuya Isomura, Taro Toyoizumi
Abstract: This work develops an analytically solvable unsupervised learning scheme that extracts the most informative components for predicting future inputs, termed predictive principal component analysis (PredPCA). Our scheme can effectively remove unpredictable observation noise and globally minimize the test prediction error. Mathematical analyses demonstrate that, with sufficiently high-dimensional observations that are generated by a linear or nonlinear system, PredPCA can identify the optimal hidden state representation, true system parameters, and true hidden state dimensionality, with a global convergence guarantee. We demonstrate the performance of PredPCA by using sequential visual inputs comprising hand-digits, rotating 3D objects, and natural scenes. It reliably and accurately estimates distinct hidden states and predicts future outcomes of previously unseen test input data, even in the presence of considerable observation noise. The simple model structure and low computational cost of PredPCA make it highly desirable as a learning scheme for biological neural networks and neuromorphic chips.
摘要：该作品的开发解析解无监督学习方案，其提取用于预测未来的输入的最有信息的部件，被称为预测主成分分析（PredPCA）。我们的方案可以有效地去除不可预知的观测噪声和全球最小化测试预测误差。数学分析表明，与由一个线性或非线性系统中产生足够高的维观察，PredPCA可以识别最佳隐藏状态表示，真正的系统参数，和真正的隐藏状态维数，具有全局收敛保证。我们证明PredPCA的通过使用连续的视觉输入包括手工数字，旋转的3D对象，以及自然场景的性能。它可靠和精确地估计不同隐藏状态和预测以前看不见的测试输入数据的未来的结果，即使在相当大的观测噪声的存在。简单的模型结构和PredPCA的低计算成本使作为生物神经网络和神经形态芯片学习计划是非常可取的。

125. Environment-agnostic Multitask Learning for Natural Language Grounded Navigation [PDF] 返回目录
Xin Wang, Vihan Jain, Eugene Ie, William Yang Wang, Zornitsa Kozareva, Sujith Ravi
Abstract: Recent research efforts enable study for natural language grounded navigation in photo-realistic environments, e.g., following natural language instructions or dialog. However, existing methods tend to overfit training data in seen environments and fail to generalize well in previously unseen environments. In order to close the gap between seen and unseen environments, we aim at learning a generalized navigation model from two novel perspectives: (1) we introduce a multitask navigation model that can be seamlessly trained on both Vision-Language Navigation (VLN) and Navigation from Dialog History (NDH) tasks, which benefits from richer natural language guidance and effectively transfers knowledge across tasks; (2) we propose to learn environment-agnostic representations for the navigation policy that are invariant among the environments seen during training, thus generalizing better on unseen environments. Extensive experiments show that our navigation model trained using environment-agnostic multitask learning significantly reduces the performance gap between seen and unseen environments and outperforms the baselines on unseen environments by 16% (relative measure on success rate) on VLN and 120% (goal progress) on NDH, establishing a new state-of-the-art for the NDH task. The code for training the navigation model using environment-agnostic multitask learning is available at this https URL.
摘要：最近的研究工作能够在照片般逼真的环境中，例如，以下的自然语言指令或对话框的自然语言接地导航研究。但是，现有的方法往往在看到环境过度拟合训练数据，并未能在以前看不见的环境下推广好。为了关闭可见和不可见的环境之间的间隙中，我们的目标是在学习从两个新颖的观点广义导航模型：（1）我们介绍可以同时在视觉语言导航（VLN）和导航无缝训练多任务导航模型从对话历史（NDH）的任务，从更丰富的自然语言指导的利益和整个任务有效地传递知识; （2）我们提出学习的导航策略，是在训练中看到的环境中不变的环境无关的交涉，从而对看不见的环境中更好的推广。大量的实验表明，我们的导航模式使用环境无关的多任务显著学习培训的减少可见和不可见的环境之间的性能差距，并通过16％的VLN优于对看不见的环境基线（成功率相对度量）和120％（目标的进展情况）在NDH，建立一个新的国家的最先进的NDH任务。训练使用环境无关的多任务学习导航模型的代码可在此HTTPS URL。

126. Why is the Mahalanobis Distance Effective for Anomaly Detection? [PDF] 返回目录
Ryo Kamoi, Kei Kobayashi
Abstract: The Mahalanobis distance-based confidence score, a recently proposed anomaly detection method for pre-trained neural classifiers, achieves state-of-the-art performance on both out-of-distribution and adversarial example detection. This work analyzes why this method exhibits such strong performance while imposing an implausible assumption; namely, that class conditional distributions of intermediate features have tied covariance. We reveal that the reason for its effectiveness has been misunderstood. Although this method scores the prediction confidence for the original classification task, our analysis suggests that information critical for classification task does not contribute to state-of-the-art performance on anomaly detection. To support this hypothesis, we demonstrate that a simpler confidence score that does not use class information is as effective as the original method in most cases. Moreover, our experiments show that the confidence scores can exhibit different behavior on other frameworks such as metric learning models, and their detection performance is sensitive to model architecture choice. These findings provide insight into the behavior of neural classifiers when provided with anomalous inputs.
摘要：基于马氏距离的置信度得分，用于预训练神经分类最近提出的异常检测方法，实现了两个外的分布和对抗例如检测状态的最先进的性能。这项工作分析为什么这个方法表现而堂堂一个难以置信的假设，这种强劲的表现;即，中间特征该类条件分布有捆绑协方差。我们揭示了其有效性的原因被误解了。这种方法虽然成绩为原分类任务的预测有信心，我们的分析表明了分类任务至关重要的信息，不利于国家的最先进的性能上异常检测。为了支持这一假说，我们证明了不使用类信息进行简单的置信度得分有效，因为在大多数情况下，原来的方法。此外，我们的实验表明，信心分数可以表现出对其他框架不同的行为，如度量学习模型，他们的检测性能是模型架构的选择敏感。当有异常的输入提供的这些发现提供了洞察神经分类的行为。

127. Unblind Your Apps: Predicting Natural-Language Labels for Mobile GUI Components by Deep Learning [PDF] 返回目录
Jieshan Chen, Chunyang Chen, Zhenchang Xing, Xiwei Xu, Liming Zhu, Guoqiang Li, Jinshui Wang
Abstract: According to the World Health Organization(WHO), it is estimated that approximately 1.3 billion people live with some forms of vision impairment globally, of whom 36 million are blind. Due to their disability, engaging these minority into the society is a challenging problem. The recent rise of smart mobile phones provides a new solution by enabling blind users' convenient access to the information and service for understanding the world. Users with vision impairment can adopt the screen reader embedded in the mobile operating systems to read the content of each screen within the app, and use gestures to interact with the phone. However, the prerequisite of using screen readers is that developers have to add natural-language labels to the image-based components when they are developing the app. Unfortunately, more than 77% apps have issues of missing labels, according to our analysis of 10,408 Android apps. Most of these issues are caused by developers' lack of awareness and knowledge in considering the minority. And even if developers want to add the labels to UI components, they may not come up with concise and clear description as most of them are of no visual issues. To overcome these challenges, we develop a deep-learning based model, called LabelDroid, to automatically predict the labels of image-based buttons by learning from large-scale commercial apps in Google Play. The experimental results show that our model can make accurate predictions and the generated labels are of higher quality than that from real Android developers.
摘要：根据世界卫生组织（WHO），据估计，约130十亿人生活在某种形式的视力障碍的全球范围内，其中36万盲人。由于他们的残疾，从事这些少数族裔融入社会是一个具有挑战性的问题。最近的智能手机的崛起提供了通过使盲人用户对信息和服务的方便访问了了解世界的一个新的解决方案。有视力障碍的用户可以通过嵌入在移动操作系统的屏幕阅读器读取每个屏幕的应用中的内容，并用手势来与手机互动。然而，使用屏幕阅读器的前提是开发商必须自然语言标签添加到基于图像的组件时，他们正在开发的应用程序。不幸的是，超过77级％的应用程序都缺少标签的问题，根据我们的10408个Android应用分析。大多数的这些问题是由开发者缺乏考虑少数人的认识和知识造成的。而且，即使开发商想将标签添加到UI组件，它们可能无法拿出简洁清晰的描述，因为大多数人都没有直观的问题。为了克服这些挑战，我们开发了一个深刻的学习基于模型，称为LabelDroid，通过在谷歌玩大型的商业应用程序学习自动预测基于图像的按钮的标签。实验结果表明，我们的模型可以做出准确的预测和生成的标签比从真正的Android开发人员具有更高的质量。

128. Understanding the Intrinsic Robustness of Image Distributions using Conditional Generative Models [PDF] 返回目录
Xiao Zhang, Jinghui Chen, Quanquan Gu, David Evans
Abstract: Starting with Gilmer et al. (2018), several works have demonstrated the inevitability of adversarial examples based on different assumptions about the underlying input probability space. It remains unclear, however, whether these results apply to natural image distributions. In this work, we assume the underlying data distribution is captured by some conditional generative model, and prove intrinsic robustness bounds for a general class of classifiers, which solves an open problem in Fawzi et al. (2018). Building upon the state-of-the-art conditional generative models, we study the intrinsic robustness of two common image benchmarks under $\ell_2$ perturbations, and show the existence of a large gap between the robustness limits implied by our theory and the adversarial robustness achieved by current state-of-the-art robust models. Code for all our experiments is available at this https URL.
摘要：吉尔默等人开始。（2018），几工作已经证明基于关于底层输入概率空间中的不同的假设对抗性例子的必然性。目前还不清楚，但是，这些结果是否适用于自然的图像分布。在这项工作中，我们假设底层的数据分布是通过一些有条件的生成模型捕获，并证明了通用类的分类，这在法齐等人解决了一个公开问题的内在稳健发展。（2018）。在国家的最先进的条件生成模型的基础上，我们研究了两种常见的图像基准固有的稳健性在$ \ ell_2 $扰动，并显示我们的理论和对抗所隐含的鲁棒性限制之间有很大的差距的存在通过稳健的国家的最先进的电流可靠的模型来实现。代号为我们所有的实验可在此HTTPS URL。

129. An Evaluation of Knowledge Graph Embeddings for Autonomous Driving Data: Experience and Practice [PDF] 返回目录
Ruwan Wickramarachchi, Cory Henson, Amit Sheth
Abstract: The autonomous driving (AD) industry is exploring the use of knowledge graphs (KGs) to manage the vast amount of heterogeneous data generated from vehicular sensors. The various types of equipped sensors include video, LIDAR and RADAR. Scene understanding is an important topic in AD which requires consideration of various aspects of a scene, such as detected objects, events, time and location. Recent work on knowledge graph embeddings (KGEs) - an approach that facilitates neuro-symbolic fusion - has shown to improve the predictive performance of machine learning models. With the expectation that neuro-symbolic fusion through KGEs will improve scene understanding, this research explores the generation and evaluation of KGEs for autonomous driving data. We also present an investigation of the relationship between the level of informational detail in a KG and the quality of its derivative embeddings. By systematically evaluating KGEs along four dimensions -- i.e. quality metrics, KG informational detail, algorithms, and datasets -- we show that (1) higher levels of informational detail in KGs lead to higher quality embeddings, (2) type and relation semantics are better captured by the semantic transitional distance-based TransE algorithm, and (3) some metrics, such as coherence measure, may not be suitable for intrinsically evaluating KGEs in this domain. Additionally, we also present an (early) investigation of the usefulness of KGEs for two use-cases in the AD domain.
摘要：用自主行驶（AD）行业正在探索使用知识图（KGS）的管理与车辆传感器产生的异构数据的大量。各种类型配备传感器包括视频，LIDAR和雷达。场景理解是AD的一个重要课题，需要考虑的一个场景的各个方面，如检测到物体，事件，时间和地点。一种方法有利于神经象征性的融合 - - 已经表明，以改善机器学习模型的预测性能知识图嵌入（KGEs）最近的工作。并期望通过KGEs神经象征性的融合将提高现场了解，该研究探讨KGEs用于自主驾驶数据的生成和评估。我们还提出信息的细节在KG水平及其衍生的嵌入的质量之间的关系进行调查。通过系统地评估沿四个维度KGEs - 即质量度量，KG信息细节，算法，和数据集 - 我们表明，（1）更高的水平在幼稚园信息的细节导致更高质量的嵌入，（2）式和关系的语义是由语义过渡基于距离的TRANSE算法，以及（3）一些度量，诸如相干性测量更好捕获，可能不适合于本领域固有评估KGEs。此外，我们还提出KGEs的有用的（早期）调查了两个使用情况的AD域。

130. An estimation-based method to segment PET images [PDF] 返回目录
Ziping Liu, Richard Laforest, Joyce Mhlanga, Hae Sol Moon, Tyler J. Fraum, Malak Itani, Aaron Mintz, Farrokh Dehdashti, Barry A. Siegel, Abhinav K. Jha
Abstract: Tumor segmentation in oncological PET images is challenging, a major reason being the partial-volume effects that arise from low system resolution and a finite pixel size. The latter results in pixels containing more than one region, also referred to as tissue-fraction effects. Conventional classification-based segmentation approaches are inherently limited in accounting for the tissue-fraction effects. To address this limitation, we pose the segmentation task as an estimation problem. We propose a Bayesian method that estimates the posterior mean of the tumorfraction area within each pixel and uses these estimates to define the segmented tumor boundary. The method was implemented using an autoencoder. Quantitative evaluation of the method was performed using realistic simulation studies conducted in the context of segmenting the primary tumor in PET images of patients with lung cancer. For these studies, a framework was developed to generate clinically realistic simulated PET images. Realism of these images was quantitatively confirmed using a two-alternative-forced-choice study by six trained readers with expertise in reading PET scans. The evaluation studies demonstrated that the proposed segmentation method was accurate, significantly outperformed widely used conventional methods on the tasks of tumor segmentation and estimation of tumor-fraction areas, was relatively insensitive to partial-volume effects, and reliably estimated the ground-truth tumor boundaries. Further, these results were obtained across different clinical-scanner configurations. This proof-of-concept study demonstrates the efficacy of an estimation-based approach to PET segmentation.
摘要：在肿瘤PET图像肿瘤分割是具有挑战性，一个主要的原因是，从较低的系统的分辨率和有限像素尺寸出现的部分卷影响。在包含多于一个的区域中的像素后者的效果，也被称为组织分数的影响。常规的基于分类的分割方法在占组织馏分效果固有地受到限制。为了解决这个限制，我们提出分割任务作为估计的问题。我们建议，估计每个像素内的tumorfraction区域的后均值，并使用这些估计来定义分割肿瘤边界的贝叶斯方法。方法，使用自动编码器来实现。使用分割患者PET图像的原发性肿瘤为肺癌的上下文中进行逼真的模拟研究中执行的方法的定量评价。对于这些研究，一个框架的开发是为了产生临床逼真的模拟PET图像。这些图像的现实使用由具有专业知识的培训6名读者两另类强迫选择学习阅读PET扫描定量证实。证明所提出的分割方法准确的评价研究，显著跑赢广泛用于对肿瘤分割和肿瘤部分区域的估计的任务的常规方法，是相对不敏感的部分卷效果，可靠地估计地面实况肿瘤边界。此外，在不同的临床扫描器配置，得到了这些结果。验证的概念这项研究表明，基于估计的方法来PET细分的功效。

131. Unsupervised Dictionary Learning for Anomaly Detection [PDF] 返回目录
Paul Irofti, Andra Băltoiu
Abstract: We investigate the possibilities of employing dictionary learning to address the requirements of most anomaly detection applications, such as absence of supervision, online formulations, low false positive rates. We present new results of our recent semi-supervised online algorithm, TODDLeR, on a anti-money laundering application. We also introduce a novel unsupervised method of using the performance of the learning algorithm as indication of the nature of the samples.
摘要：我们研究使用的词典学习，解决最异常检测的应用，如缺乏监管，网上的配方，低误报率的要求的可能性。我们提出我们最近的半监督在线算法，蹒跚学步的新成果，在反洗钱中的应用。我们还介绍了使用学习算法作为样品的性质的指示的性能的新颖的无监督方法。

132. Image Hashing by Minimizing Independent Relaxed Wasserstein Distance [PDF] 返回目录
Khoa D. Doan, Amir Kimiyaie, Saurav Manchanda, Chandan K. Reddy
Abstract: Image hashing is a fundamental problem in the computer vision domain with various challenges, primarily, in terms of efficiency and effectiveness. Existing hashing methods lack a principled characterization of the goodness of the hash codes and a principled approach to learn the discrete hash functions that are being optimized in the continuous space. Adversarial autoencoders are shown to be able to implicitly learn a robust hash function that generates hash codes which are balanced and have low-quantization error. However, the existing adversarial autoencoders for hashing are too inefficient to be employed for large-scale image retrieval applications because of the minmax optimization procedure. In this paper, we propose an Independent Relaxed Wasserstein Autoencoder, which presents a novel, efficient hashing method that can implicitly learn the optimal hash function by directly training the adversarial autoencoder without any discriminator/critic. Our method is an order-of-magnitude more efficient and has a much lower sample complexity than the Optimal Transport formulation of the Wasserstein distance. The proposed method outperforms the current state-of-the-art image hashing methods for the retrieval task on several prominent image collections.
摘要：图像哈希是各种挑战，主要是，在效率和有效性方面的计算机视觉领域的一个基本问题。现有散列方法缺乏哈希码的善良的原则性表征和一个原则性方法来了解在该连续空间被优化离散哈希函数。对抗性自动编码的证明是能够隐含了解其产生的平衡，并具有低量化误差的散列码强大的哈希函数。然而，对于散列现有的对抗性自动编码的效率太低被用于因为极小极大优化程序的大型图像检索的应用。在本文中，我们提出了一个独立的宽松瓦瑟斯坦自动编码器，它提出了一个新颖的，可以通过直接训练对抗性自动编码器没有任何鉴别/评论家隐学习的最佳散列函数高效散列方法。我们的方法是命令的数量级更有效的，并且具有比瓦瑟斯坦距离的最佳运输制剂低得多的样品的复杂性。该方法优于用于检索任务上几个著名的图像集合当前国家的最先进的图像哈希方法。

133. Inexpensive surface electromyography sleeve with consistent electrode placement enables dexterous and stable prosthetic control through deep learning [PDF] 返回目录
Jacob A. George, Anna Neibling, Michael D. Paskett, Gregory A. Clark
Abstract: The dexterity of conventional myoelectric prostheses is limited in part by the small datasets used to train the control algorithms. Variations in surface electrode positioning make it difficult to collect consistent data and to estimate motor intent reliably over time. To address these challenges, we developed an inexpensive, easy-to-don sleeve that can record robust and repeatable surface electromyography from 32 embedded monopolar electrodes. Embedded grommets are used to consistently align the sleeve with natural skin markings (e.g., moles, freckles, scars). The sleeve can be manufactured in a few hours for less than $60. Data from seven intact participants show the sleeve provides a signal-to-noise ratio of 14, a don-time under 11 seconds, and sub-centimeter precision for electrode placement. Furthermore, in a case study with one intact participant, we use the sleeve to demonstrate that neural networks can provide simultaneous and proportional control of six degrees of freedom, even 263 days after initial algorithm training. We also highlight that consistent recordings, accumulated over time to establish a large dataset, significantly improve dexterity. These results suggest that deep learning with a 74-layer neural network can substantially improve the dexterity and stability of myoelectric prosthetic control, and that deep-learning techniques can be readily instantiated and further validated through inexpensive sleeves/sockets with consistent recording locations.
摘要：常规肌电假肢的灵巧部分地由用于训练的控制算法的小的数据集的限制。在表面电极定位的变化使得很难收集一致的数据，并估算电动机意图可靠地随着时间的推移。为了应对这些挑战，我们开发了一种廉价的，易于穿上套可从32个嵌入式单极电极记录强大的和可重复的表面肌电。嵌入垫圈用于一致地对准天然皮肤标记（例如，痣，雀斑，疤痕）的套筒。套筒可以在几个小时内被用于制造低于$ 60来自七个完整的参与者的数据显示该套筒提供一个信噪比的14，低于11秒的DON-时间，和亚厘米精度的电极放置。此外，在一个完整的参与者的案例研究，我们使用套筒证明神经网络可以提供六个自由度，在最初的算法训练甚至263天同时和比例控制。我们还强调，一致的录音，日积月累建立了庞大的数据集，显著提高灵活性。这些结果表明用一个74层的神经网络深学习可以显着提高肌电假体控制的灵巧性和稳定性，而且深学习技术可以容易地实例化，并通过与相一致的记录的位置廉价的套筒/套筒进一步验证。

134. SeismiQB -- a novel framework for deep learning with seismic data [PDF] 返回目录
Alexander Koryagin, Roman Khudorozhkov, Sergey Tsimfer, Darima Mylzenova
Abstract: In recent years, Deep Neural Networks were successfully adopted in numerous domains to solve various image-related tasks, ranging from simple classification to fine borders annotation. Naturally, many researches proposed to use it to solve geological problems. Unfortunately, many of the seismic processing tools were developed years before the era of machine learning, including the most popular SEG-Y data format for storing seismic cubes. Its slow loading speed heavily hampers experimentation speed, which is essential for getting acceptable results. Worse yet, there is no widely-used format for storing surfaces inside the volume (for example, seismic horizons). To address these problems, we've developed an open-sourced Python framework with emphasis on working with neural networks, that provides convenient tools for (i) fast loading seismic cubes in multiple data formats and converting between them, (ii) generating crops of desired shape and augmenting them with various transformations, and (iii) pairing cube data with labeled horizons or other types of geobodies.
摘要：近年来，深层神经网络在众多领域进行了成功采用，解决各种图像相关的任务，从简单的分类细边框标注。当然，许多研究建议用它来解决地质问题。不幸的是，很多的地震数据处理工具，机器学习的时代年前被开发，其中包括用于存储地震立方体最流行的SEG-Y数据格式。它的加载速度慢的速度很大程度上阻碍了实验速度，这是获得可接受的结果至关重要。更糟糕的是，不存在用于存储表面的体积内（例如，地震层位）广泛使用的格式。为了解决这些问题，我们开发了与神经网络的工作重点是一个开源的Python框架，它提供便捷的工具，（我）快装于多种数据格式的地震立方体和它们之间的转换，（ii）产生的作物所需的形状，并用各种变换增强它们，以及（iii）用标记的视野或其它类型的地质体的配对立方体数据。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-03-03

目录

摘要