摘要

1. Spatiotemporal-Aware Augmented Reality: Redefining HCI in Image-Guided Therapy [PDF] 返回目录
Javad Fotouhi, Arian Mehrfard, Tianyu Song, Alex Johnson, Greg Osgood, Mathias Unberath, Mehran Armand, Nassir Navab
Abstract: Suboptimal interaction with patient data and challenges in mastering 3D anatomy based on ill-posed 2D interventional images are essential concerns in image-guided therapies. Augmented reality (AR) has been introduced in the operating rooms in the last decade; however, in image-guided interventions, it has often only been considered as a visualization device improving traditional workflows. As a consequence, the technology is gaining minimum maturity that it requires to redefine new procedures, user interfaces, and interactions. The main contribution of this paper is to reveal how exemplary workflows are redefined by taking full advantage of head-mounted displays when entirely co-registered with the imaging system at all times. The proposed AR landscape is enabled by co-localizing the users and the imaging devices via the operating room environment and exploiting all involved frustums to move spatial information between different bodies. The awareness of the system from the geometric and physical characteristics of X-ray imaging allows the redefinition of different human-machine interfaces. We demonstrate that this AR paradigm is generic, and can benefit a wide variety of procedures. Our system achieved an error of $4.76\pm2.91$ mm for placing K-wire in a fracture management procedure, and yielded errors of $1.57\pm1.16^\circ$ and $1.46\pm1.00^\circ$ in the abduction and anteversion angles, respectively, for total hip arthroplasty. We hope that our holistic approach towards improving the interface of surgery not only augments the surgeon's capabilities but also augments the surgical team's experience in carrying out an effective intervention with reduced complications and provide novel approaches of documenting procedures for training purposes.
摘要：基于病态的2D影像介入患者数据和掌握3D解剖挑战次优的交互在图像引导的治疗至关重要的担忧。增强现实（AR）已经在过去十年的手术室被引入;然而，在图像引导的介入，它经常仅被认为是一个可视化装置改进传统工作流程。因此，该技术得到了最低的成熟，它需要重新定义新的程序，用户界面和交互。本文的主要贡献是揭示如何工作流示例性的，通过利用充分利用重新定义头戴式显示器时完全与在所有时间成像系统共同配准。所提出的AR景观是由共定位通过手术室环境中的用户和成像设备和利用所有相关的截锥移动不同机构之间的空间信息启用。从X射线成像的几何和物理特性的系统的认识允许不同的人机界面的重新定义。我们表明，这种AR模式是通用的，可以享受各种各样的程序。我们的系统来实现的$ 4.76 \ pm2.91 $毫米的误差用于放置克氏针在骨折管理程序，并在绑架产生的$ 1.57 \ pm1.16 ^ \ CIRC $ $和1.46 \ pm1.00 ^ \ CIRC错误$和前倾角度，分别用于全髋关节置换。希望我们的努力改善手术不仅界面整体方法增强了外科医生的能力，同时也增强了手术小组在开展以减少并发症的有效干预措施的经验和记录提供用于训练目的的程序的新方法。

2. Robust Perceptual Night Vision in Thermal Colorization [PDF] 返回目录
Feras Almasri, Olivier Debeir
Abstract: Transforming a thermal infrared image into a robust perceptual colour Visible image is an ill-posed problem due to the differences in their spectral domains and in the objects' representations. Objects appear in one spectrum but not necessarily in the other, and the thermal signature of a single object may have different colours in its Visible representation. This makes a direct mapping from thermal to Visible images impossible and necessitates a solution that preserves texture captured in the thermal spectrum while predicting the possible colour for certain objects. In this work, a deep learning method to map the thermal signature from the thermal image's spectrum to a Visible representation in their low-frequency space is proposed. A pan-sharpening method is then used to merge the predicted low-frequency representation with the high-frequency representation extracted from the thermal image. The proposed model generates colour values consistent with the Visible ground truth when the object does not vary much in its appearance and generates averaged grey values in other cases. The proposed method shows robust perceptual night vision images in preserving the object's appearance and image context compared with the existing state-of-the-art.
摘要：转变热红外图像分割成一个强大感知颜色可见图像是病态问题由于在它们的频谱域的差异，并在对象的表示。对象显示在一个频谱，但不一定在其它，和一个单一的对象的热签名可以在其可见表示具有不同的颜色。这使得从热的直接映射到不可能可见图像，并且需要该蜜饯在热谱纹理捕获而预测所述可能的颜色为特定对象的溶液。在这项工作中，深刻的学习方法，从热图像的光谱热签名映射到它们的低频空间可见表示建议。然后，将全色锐化方法用于合并从所述热图像中提取的高频率表示所预测的低频表示。当对象并不在它的外观变化很大，在其他情况下产生的平均灰度值，该模型生成与可见地面实况一致的颜色值。所提出的方法示出了健壮感知夜视图像中与现有状态的最先进的相比保持对象的外观和形象上下文。

3. HintPose [PDF] 返回目录
Sanghoon Hong, Hunchul Park, Jonghyuk Park, Sukhyun Cho, Heewoong Park
Abstract: Most of the top-down pose estimation models assume that there exists only one person in a bounding box. However, the assumption is not always correct. In this technical report, we introduce two ideas, instance cue and recurrent refinement, to an existing pose estimator so that the model is able to handle detection boxes with multiple persons properly. When we evaluated our model on the COCO17 keypoints dataset, it showed non-negligible improvement compared to its baseline model. Our model achieved 76.2 mAP as a single model and 77.3 mAP as an ensemble on the test-dev set without additional training data. After additional post-processing with a separate refinement network, our final predictions achieved 77.8 mAP on the COCO test-dev set.
摘要：大多数自顶向下的姿势估计模型的假设只存在一个在边框中的人。但是，假设并不总是正确的。本技术报告中，我们引入两个概念，例如提示和反复提炼，以现有的姿势估计使模型能够处理检测箱多的人正确。当我们评估的关键点COCO17我们的数据集模型，它比其基线模型显示出不可忽视的改善。我们的模型实现76.2地图作为一个单一的模型和77.3地图作为合奏在测试-dev的组而无需额外的训练数据。一个单独的细化网络后处理的附加之后，我们最终的预测实现的测试COCO-dev的一套77.8地图。

4. VESR-Net: The Winning Solution to Youku Video Enhancement and Super-Resolution Challenge [PDF] 返回目录
Jiale Chen, Xu Tan, Chaowei Shan, Sen Liu, Zhibo Chen
Abstract: This paper introduces VESR-Net, a method for video enhancement and super-resolution (VESR). We design a separate non-local module to explore the relations among video frames and fuse video frames efficiently, and a channel attention residual block to capture the relations among feature maps for video frame reconstruction in VESR-Net. We conduct experiments to analyze the effectiveness of these designs in VESR-Net, which demonstrates the advantages of VESR-Net over previous state-of-the-art VESR methods. It is worth to mention that among more than thousands of participants for Youku video enhancement and super-resolution (Youku-VESR) challenge, our proposed VESR-Net beat other competitive methods and ranked the first place.
摘要：本文介绍了VESR型网，视频增强和超分辨率（VESR）的方法。我们设计一个独立的非本地模块探索有效视频帧和保险丝的视频帧之间的关系，以及其中特征为视频帧重构在VESR-Net的映射的信道注意残余块以捕获的关系。我们进行实验来分析VESR-Net的，它比国家的最先进的前面VESR方法演示VESR-网的优势，这些设计的有效性。值得一提的是在多于参与者优酷视频增强和超分辨率十万（优酷，VESR）的挑战，我们提出的VESR，净击败其他竞争方法和排名第一的位置。

5. Unity Style Transfer for Person Re-Identification [PDF] 返回目录
Chong Liu, Xiaojun Chang, Yi-Dong Shen
Abstract: Style variation has been a major challenge for person re-identification, which aims to match the same pedestrians across different cameras. Existing works attempted to address this problem with camera-invariant descriptor subspace learning. However, there will be more image artifacts when the difference between the images taken by different cameras is larger. To solve this problem, we propose a UnityStyle adaption method, which can smooth the style disparities within the same camera and across different cameras. Specifically, we firstly create UnityGAN to learn the style changes between cameras, producing shape-stable style-unity images for each camera, which is called UnityStyle images. Meanwhile, we use UnityStyle images to eliminate style differences between different images, which makes a better match between query and gallery. Then, we apply the proposed method to Re-ID models, expecting to obtain more style-robust depth features for querying. We conduct extensive experiments on widely used benchmark datasets to evaluate the performance of the proposed framework, the results of which confirm the superiority of the proposed model.
摘要：风格变化一直是人重新鉴定的一个重大挑战，其目的是匹配在不同的相机相同的行人。现有的作品试图解决这个问题，摄像头，不变的描述符子空间学习。然而，将会有更多的图像伪影当由不同的照相机拍摄的图像之间的差异较大。为了解决这个问题，我们提出了一个UnityStyle适应方法，它可以在同一个镜头内和跨不同的相机流畅的风格差异。具体而言，我们首先创建UnityGAN学习摄像机之间的风格变化，对于每一个摄像头，这被称为UnityStyle图像制作形状稳定的风格统一的图像。同时，我们使用UnityStyle图像，以消除不同的图像，这使得查询和画廊之间更好的匹配之间的差异风格。然后，我们应用所提出的方法重新编号机型，希望获得更多的风格，强大的深度功能进行查询。我们进行了广泛使用的基准数据集了广泛的实验，以评估该框架的性能，其结果验证了该模式的优越性。

6. Mixup Regularization for Region Proposal based Object Detectors [PDF] 返回目录
Shahine Bouabid, Vincent Delaitre
Abstract: Mixup - a neural network regularization technique based on linear interpolation of labeled sample pairs - has stood out by its capacity to improve model's robustness and generalizability through a surprisingly simple formalism. However, its extension to the field of object detection remains unclear as the interpolation of bounding boxes cannot be naively defined. In this paper, we propose to leverage the inherent region mapping structure of anchors to introduce a mixup-driven training regularization for region proposal based object detectors. The proposed method is benchmarked on standard datasets with challenging detection settings. Our experiments show an enhanced robustness to image alterations along with an ability to decontextualize detections, resulting in an improved generalization power.
摘要：的mixup - 基于标记的样品对线性内插的神经网络正则化技术 - 已经站在出通过它的能力，通过一个非常简单的形式主义，以改善模型的稳健性和普遍性。然而，由于包围盒的内插不能天真地定义其延伸到对象检测领域仍不清楚。在本文中，我们提出利用锚的固有区域映射结构以引入用于基于区域提案对象检测器一个的mixup驱动训练正规化。该方法是在基准具有挑战性的检测设置标准数据集。我们的实验显示增强的鲁棒性的图像变化的情况与向decontextualize检测，从而得到改善的泛化功率的能力一起。

7. Vehicle-Human Interactive Behaviors in Emergency: Data Extraction from Traffic Accident Videos [PDF] 返回目录
Wansong Liu, Danyang Luo, Changxu Wu, Minghui Zheng
Abstract: Currently, studying the vehicle-human interactive behavior in the emergency needs a large amount of datasets in the actual emergent situations that are almost unavailable. Existing public data sources on autonomous vehicles (AVs) mainly focus either on the normal driving scenarios or on emergency situations without human involvement. To fill this gap and facilitate related research, this paper provides a new yet convenient way to extract the interactive behavior data (i.e., the trajectories of vehicles and humans) from actual accident videos that were captured by both the surveillance cameras and driving recorders. The main challenge for data extraction from real-time accident video lies in the fact that the recording cameras are un-calibrated and the angles of surveillance are unknown. The approach proposed in this paper employs image processing to obtain a new perspective which is different from the original video's perspective. Meanwhile, we manually detect and mark object feature points in each image frame. In order to acquire a gradient of reference ratios, a geometric model is implemented in the analysis of reference pixel value, and the feature points are then scaled to the object trajectory based on the gradient of ratios. The generated trajectories not only restore the object movements completely but also reflect changes in vehicle velocity and rotation based on the feature points distributions.
摘要：目前，在紧急研究汽车人的互动行为需要在几乎不可用实际紧急情况的大量数据集。对自主车（AVS）现有的公共数据源主要要么集中在正常驾驶情况下或者在没有人为干预的紧急情况。为了填补这一空白，并促进相关的研究，本文提出了一种新的还没有方便的方法来提取交互行为数据（即，车辆和人的轨迹）从被监控摄像头和行车纪录器两种捕捉实际发生的事故视频。从实时视频事故在于一个事实，即记录相机未校准和监控的角度是未知的数据提取的主要挑战。在本文所提出的方法采用图像处理，以获得一个新的角度，其是从原始视频的角度不同。同时，我们手动检测和在每个图像帧标记对象的特征点。为了获取的参考比的梯度，几何模型是在参考像素值的分析来实现，然后将特征点被缩放到物体轨迹基于比率的梯度。所生成的轨迹不仅完全恢复的对象的运动，而且反映基于所述特征点分布在车辆速度和旋转的变化。

8. Learning to Transfer Texture from Clothing Images to 3D Humans [PDF] 返回目录
Aymen Mir, Thiemo Alldieck, Gerard Pons-Moll
Abstract: In this paper, we present a simple yet effective method to automatically transfer textures of clothing images (front and back) to 3D garments worn on top SMPL, in real time. We first automatically compute training pairs of images with aligned 3D garments using a custom non-rigid 3D to 2D registration method, which is accurate but slow. Using these pairs, we learn a mapping from pixels to the 3D garment surface. Our idea is to learn dense correspondences from garment image silhouettes to a 2D-UV map of a 3D garment surface using shape information alone, completely ignoring texture, which allows us to generalize to the wide range of web images. Several experiments demonstrate that our model is more accurate than widely used baselines such as thin-plate-spline warping and image-to-image translation networks while being orders of magnitude faster. Our model opens the door for applications such as virtual-try on, and allows generation of 3D humans with varied textures which is necessary for learning.
摘要：在本文中，我们提出了一个简单而服装的图像（正面和背面）的纹理自动传输到佩戴在顶部SMPL 3D服装，实时有效的方法。我们的图像的第一自动计算训练对与使用自定义非刚性3D到2D配准方法，该方法是准确的，但慢对齐3D服装。利用这些对，我们学会从像素到三维服装表面的映射。我们的想法是单独使用的形状信息，完全无视质地，这使我们能够推广到宽范围的Web图像的学习从服装图像轮廓致密对应于3D衣服表面的2D-UV地图。一些实验表明，我们的模型比广泛使用的基线，如薄板样条变形和图像 - 图像平移网络更加准确，同时快几个数量级。我们的模型打开的应用，如虚拟试穿门，并允许代3D人类具有不同纹理所必需的学习。

9. Annotation-free Learning of Deep Representations for Word Spotting using Synthetic Data and Self Labeling [PDF] 返回目录
Fabian Wolf, Gernot A. Fink
Abstract: Word spotting is a popular tool for supporting the first exploration of historic, handwritten document collections. Today, the best performing methods rely on machine learning techniques, which require a high amount of annotated training material. As training data is usually not available in the application scenario, annotation-free methods aim at solving the retrieval task without representative training samples. In this work, we present an annotation-free method that still employs machine learning techniques and therefore outperforms other learning-free approaches. The weakly supervised training scheme relies on a lexicon, that does not need to precisely fit the dataset. In combination with a confidence based selection of pseudo-labeled training samples, we achieve state-of-the-art query-by-example performances. Furthermore, our method allows to perform query-by-string, which is usually not the case for other annotation-free methods.
摘要：单词辨别是支持的历史，手写文档集合的第一口勘探的流行工具。今天表现最好的方法依赖于机器学习技术，这需要注释的训练材料的高量。作为训练数据通常不是在应用方案中可用，无注释的方法旨在解决检索任务没有代表性的训练样本。在这项工作中，我们目前仍然采用机器学习技术的无注释的方法，因此优于其他免费的学习方法。弱指导训练方案依赖于一个词汇，这并不需要精确地适应数据集。在与伪标记的训练样本的信心为基础的选择组合，实现我们国家的最先进的查询通过例如表演。此外，我们的方法可以进行查询通过字符串，它通常是不为别的自由注释的方法的情况。

10. Occlusion Aware Unsupervised Learning of Optical Flow From Video [PDF] 返回目录
Jianfeng Li, Junqiao Zhao, Tiantian Feng, Chen Ye, Lu Xiong
Abstract: In this paper, we proposed an unsupervised learning method for estimating the optical flow between video frames, especially to solve the occlusion problem. Occlusion is caused by the movement of an object or the movement of the camera, defined as when certain pixels are visible in one video frame but not in adjacent frames. Due to the lack of pixel correspondence between frames in the occluded area, incorrect photometric loss calculation can mislead the optical flow training process. In the video sequence, we found that the occlusion in the forward ($t\rightarrow t+1$) and backward ($t\rightarrow t-1$) frame pairs are usually complementary. That is, pixels that are occluded in subsequent frames are often not occluded in the previous frame and vice versa. Therefore, by using this complementarity, a new weighted loss is proposed to solve the occlusion problem. In addition, we calculate gradients in multiple directions to provide richer supervision information. Our method achieves competitive optical flow accuracy compared to the baseline and some supervised methods on KITTI 2012 and 2015 benchmarks. This source code has been released at this https URL.
摘要：在本文中，我们提出了一种无监督学习方法用于估计视频帧之间的光流，尤其是解决了遮挡问题。闭塞由对象或相机的运动的运动造成的，定义为当某些像素处于相邻帧在一个视频帧可见的，但不是。由于缺乏在闭塞区域帧之间的像素对应的，不正确的光度损耗计算可能误导光流训练过程。在视频序列中，我们发现，在正向闭塞（$吨\ RIGHTARROW吨+ $ 1）和向后（$吨\ RIGHTARROW叔$ 1）帧对通常是互补的。即，将在随后的帧中遮挡的像素往往没有在先前帧中，并且反之亦然闭塞。因此，通过使用这种互补性，一种新的加权损失提出了解决遮挡问题。另外，我们计算在多个方向上的梯度，以提供更丰富的监管信息。我们的方法实现了比基线和KITTI 2012和2015点的基准一些监管方法具有竞争力的光流精确度。此源代码已经在这个HTTPS URL被释放。

11. Automatic Signboard Detection from Natural Scene Image in Context of Bangladesh Google Street View [PDF] 返回目录
Md. Sadrul Islam Toaha, Chowdhury Rafeed Rahman, Sakib Bin Asad, Tashin Ahmed, Mahfuz Ara Proma, S.M. Shahriar Haque
Abstract: Automatic signboard region detection is the first step of information extraction about establishments from an image, especially when there is a complex background and multiple signboard regions are present in the image. Automatic signboard detection in Bangladesh is a challenging task because of low quality street view image, presence of overlapping objects and presence of signboard like objects which are not actually signboards. In this research, we provide a novel dataset from the perspective of Bangladesh city streets with an aim of signboard detection, namely Bangladesh Street View Signboard Objects (BSVSO) image dataset. We introduce a novel approach to detect signboard accurately by applying smart image processing techniques and statistically determined hyperparameter based deep learning method, Faster R-CNN. Comparison of different variations of this segmentation based learning method have also been performed in this research.
摘要：自动招牌区域检测为约从图像场所信息提取的第一步骤中，尤其是当有一个复杂的背景和多个招牌区均存在在图像中。自动招牌孟加拉国检测是由于低质量的街道视图图像的一个具有挑战性的任务，重叠的物体和等，其实际上不是招牌对象招牌的存在的存在。在这项研究中，我们提供了从孟加拉国城市街道与告示牌检测，即孟加拉国街景招牌对象（BSVSO）图像数据集的目标角度新颖的数据集。我们介绍一种新颖的方法来通过应用智能图像处理技术和统计确定的超参数基于深度学习方法，更快R-CNN招牌精确地检测。基于学习方法这个细分的不同变化的比较也已经在这项研究进行。

12. Reveal of Domain Effect: How Visual Restoration Contributes to Object Detection in Aquatic Scenes [PDF] 返回目录
Xingyu Chen, Yue Lu, Zhengxing Wu, Junzhi Yu, Li Wen
Abstract: Underwater robotic perception usually requires visual restoration and object detection, both of which have been studied for many years. Meanwhile, data domain has a huge impact on modern data-driven leaning process. However, exactly indicating domain effect, the relation between restoration and detection remains unclear. In this paper, we generally investigate the relation of quality-diverse data domain to detection performance. In the meantime, we unveil how visual restoration contributes to object detection in real-world underwater scenes. According to our analysis, five key discoveries are reported: 1) Domain quality has an ignorable effect on within-domain convolutional representation and detection accuracy; 2) low-quality domain leads to higher generalization ability in cross-domain detection; 3) low-quality domain can hardly be well learned in a domain-mixed learning process; 4) degrading recall efficiency, restoration cannot improve within-domain detection accuracy; 5) visual restoration is beneficial to detection in the wild by reducing the domain shift between training data and real-world scenes. Finally, as an illustrative example, we successfully perform underwater object detection with an aquatic robot.
摘要：水下机器人感知通常需要视觉恢复和检测物体，这两者已经研究了很多年。同时，数据域对现代数据驱动的学习过程产生巨大的影响。然而，恰恰说明域效果，恢复和检测之间的关系尚不清楚。在本文中，我们一般调查的质量，不同的数据域，以检测性能的关系。在此期间，我们推出怎样的视觉恢复有利于检测对象在现实世界的水下场景。根据我们的分析，五个关键的发现报道：1）域质量对内域卷积表示和检测精度可忽略的影响; 2）低质量域，因而在交叉域检测更高泛化能力; 3）低质量域几乎不能在一个域中混合学习过程很好地了解到; 4）降解召回效率，恢复不能改善内域的检测精度; 5）视觉恢复是通过减少训练数据和真实世界的场景之间的域转移到检测在野生有益的。最后，作为一个说明性的例子，我们成功地执行与水生机器人水下物体检测。

13. Double Backpropagation for Training Autoencoders against Adversarial Attack [PDF] 返回目录
Chengjin Sun, Sizhe Chen, Xiaolin Huang
Abstract: Deep learning, as widely known, is vulnerable to adversarial samples. This paper focuses on the adversarial attack on autoencoders. Safety of the autoencoders (AEs) is important because they are widely used as a compression scheme for data storage and transmission, however, the current autoencoders are easily attacked, i.e., one can slightly modify an input but has totally different codes. The vulnerability is rooted the sensitivity of the autoencoders and to enhance the robustness, we propose to adopt double backpropagation (DBP) to secure autoencoder such as VAE and DRAW. We restrict the gradient from the reconstruction image to the original one so that the autoencoder is not sensitive to trivial perturbation produced by the adversarial attack. After smoothing the gradient by DBP, we further smooth the label by Gaussian Mixture Model (GMM), aiming for accurate and robust classification. We demonstrate in MNIST, CelebA, SVHN that our method leads to a robust autoencoder resistant to attack and a robust classifier able for image transition and immune to adversarial attack if combined with GMM.
摘要：深学习，如广为人知的，很容易受到敌对样本。本文重点对自动编码对抗攻击。在自动编码（AES）的安全性是重要的，因为它们被广泛地用作数据存储和传输的压缩方案，但是，目前的自动编码易于受到攻击，即，一个可以稍微修改的输入，但具有完全不同的代码。该漏洞是根的自动编码的灵敏度和增强鲁棒性，我们建议采用双反向传播（DBP），以确保自动编码如VAE和借鉴。我们限制从重建图像与原件的一个梯度，使得自动编码器不是由对抗攻击产生微不足道的扰动敏感。平滑的DBP梯度后，我们进一步平滑由高斯混合模型（GMM）标签，旨在准确和稳健的分类。我们证明MNIST，CelebA，SVHN我们的方法导致一个强大的自动编码抗攻击并能够用于图像转换，不受敌对攻击一个强大的分级如果与GMM相结合。

14. GarmentGAN: Photo-realistic Adversarial Fashion Transfer [PDF] 返回目录
Amir Hossein Raffiee, Michael Sollami
Abstract: The garment transfer problem comprises two tasks: learning to separate a person's body (pose, shape, color) from their clothing (garment type, shape, style) and then generating new images of the wearer dressed in arbitrary garments. We present GarmentGAN, a new algorithm that performs image-based garment transfer through generative adversarial methods. The GarmentGAN framework allows users to virtually try-on items before purchase and generalizes to various apparel types. GarmentGAN requires as input only two images, namely, a picture of the target fashion item and an image containing the customer. The output is a synthetic image wherein the customer is wearing the target apparel. In order to make the generated image look photo-realistic, we employ the use of novel generative adversarial techniques. GarmentGAN improves on existing methods in the realism of generated imagery and solves various problems related to self-occlusions. Our proposed model incorporates additional information during training, utilizing both segmentation maps and body key-point information. We show qualitative and quantitative comparisons to several other networks to demonstrate the effectiveness of this technique.
摘要：服装转移问题包括两个任务：学习一个人的身体（姿势，形状，颜色）从他们的服装（服装种类，形状，样式）分离，然后生成任意服装穿着佩戴者的新形象。我们提出GarmentGAN，一种新的算法，通过生成对抗性的方法为基础的图像，进行服装转移。该GarmentGAN框架允许用户几乎试穿购买之前，项目和推广到各种服装类型。 GarmentGAN需要作为输入仅两个图像，即，目标时尚项目的图片和包含客户图像。输出是其中所述客户佩戴目标服饰的合成图像。为了使生成的图像看起来逼真的，我们采用采用新颖的生成对抗技术。 GarmentGAN提高对生成的图像的真实感现有的方法和解决与自我闭塞的各种问题。我们提出的模型训练过程中包含更多的信息，同时利用分割地图和身体的关键点信息。我们展示的定性和定量比较其他几个网络，以证明这种技术的有效性。

15. MoVi: A Large Multipurpose Motion and Video Dataset [PDF] 返回目录
Saeed Ghorbani, Kimia Mahdaviani, Anne Thaler, Konrad Kording, Douglas James Cook, Gunnar Blohm, Nikolaus F. Troje
Abstract: Human movements are both an area of intense study and the basis of many applications such as character animation. For many applications, it is crucial to identify movements from videos or analyze datasets of movements. Here we introduce a new human Motion and Video dataset MoVi, which we make available publicly. It contains 60 female and 30 male actors performing a collection of 20 predefined everyday actions and sports movements, and one self-chosen movement. In five capture rounds, the same actors and movements were recorded using different hardware systems, including an optical motion capture system, video cameras, and inertial measurement units (IMU). For some of the capture rounds, the actors were recorded when wearing natural clothing, for the other rounds they wore minimal clothing. In total, our dataset contains 9 hours of motion capture data, 17 hours of video data from 4 different points of view (including one hand-held camera), and 6.6 hours of IMU data. In this paper, we describe how the dataset was collected and post-processed; We present state-of-the-art estimates of skeletal motions and full-body shape deformations associated with skeletal motion. We discuss examples for potential studies this dataset could enable.
摘要：人类活动都深入研究的领域和许多应用，如角色动画的基础。对于许多应用，关键的是，从视频识别运动或运动分析的数据集。这里我们介绍一种新的人类运动和视频数据集MOVI，我们对其进行公开提供。它包含了60个雌性和30执行预定义的20个日常活动和体育的运动，和一个自选动作的集合男演员。在5个捕捉回合中，相同的角色和动作使用不同的硬件系统，其中包括一个光学运动捕捉系统，摄像机和惯性测量单元（IMU）的记录。对于一些捕获轮，穿衣服自然当演员被记录下来，因为他们穿的服装最少其他回合。总之，我们的数据包含9个小时的运动捕捉数据的，从4个不同的视点17小时视频数据（包括一个手持相机），以及6.6小时IMU数据。在本文中，我们描述了数据集是如何收集和处理后的;我们目前的状态的最先进的估计与骨骼相关联的运动的运动骨骼和全身体形状变形。我们讨论潜在的研究数据集可以使例子。

16. A Deep Learning Method for Complex Human Activity Recognition Using Virtual Wearable Sensors [PDF] 返回目录
Fanyi Xiao, Ling Pei, Lei Chu, Danping Zou, Wenxian Yu, Yifan Zhu, Tao Li
Abstract: Sensor-based human activity recognition (HAR) is now a research hotspot in multiple application areas. With the rise of smart wearable devices equipped with inertial measurement units (IMUs), researchers begin to utilize IMU data for HAR. By employing machine learning algorithms, early IMU-based research for HAR can achieve accurate classification results on traditional classical HAR datasets, containing only simple and repetitive daily activities. However, these datasets rarely display a rich diversity of information in real-scene. In this paper, we propose a novel method based on deep learning for complex HAR in the real-scene. Specially, in the off-line training stage, the AMASS dataset, containing abundant human poses and virtual IMU data, is innovatively adopted for enhancing the variety and diversity. Moreover, a deep convolutional neural network with an unsupervised penalty is proposed to automatically extract the features of AMASS and improve the robustness. In the on-line testing stage, by leveraging advantages of the transfer learning, we obtain the final result by fine-tuning the partial neural network (optimizing the parameters in the fully-connected layers) using the real IMU data. The experimental results show that the proposed method can surprisingly converge in a few iterations and achieve an accuracy of 91.15% on a real IMU dataset, demonstrating the efficiency and effectiveness of the proposed method.
摘要：基于传感器的人类活动识别（HAR）现在是在多个应用领域的一个研究热点。用装有惯性测量单元（IMU）的智能可穿戴设备的兴起，研究者开始利用用于HAR IMU数据。通过采用机器学习算法，早IMU为基础的研究为HAR可以实现对传统经典HAR数据集准确分类结果，只包含简单的和重复的日常活动。然而，这些数据集很少展示的实景信息丰富的多样性。在本文中，我们提出了基于深度学习在真实场景复杂HAR的新方法。特别地，在离线训练阶段，数据集AMASS，含有丰富的人力姿势和虚拟IMU数据，创新性采用用于增强各种和多样性。此外，与无监督惩罚了深刻的卷积神经网络提出了自动提取AMASS的特点，提高了耐用性。在上线的测试阶段，通过利用转印学习的优势，我们通过微调获得最终结果使用实IMU数据的部分的神经网络（优化在全连接层中的参数）。实验结果表明，所提出的方法可以在几次迭代出奇会聚并达到91.15％的准确度上的实际数据集IMU，证明所提出的方法的效率和效果。

17. Type I Attack for Generative Models [PDF] 返回目录
Chengjin Sun, Sizhe Chen, Jia Cai, Xiaolin Huang
Abstract: Generative models are popular tools with a wide range of applications. Nevertheless, it is as vulnerable to adversarial samples as classifiers. The existing attack methods mainly focus on generating adversarial examples by adding imperceptible perturbations to input, which leads to wrong result. However, we focus on another aspect of attack, i.e., cheating models by significant changes. The former induces Type II error and the latter causes Type I error. In this paper, we propose Type I attack to generative models such as VAE and GAN. One example given in VAE is that we can change an original image significantly to a meaningless one but their reconstruction results are similar. To implement the Type I attack, we destroy the original one by increasing the distance in input space while keeping the output similar because different inputs may correspond to similar features for the property of deep neural network. Experimental results show that our attack method is effective to generate Type I adversarial examples for generative models on large-scale image datasets.
摘要：生成模型是流行的工具具有广泛的应用前景。尽管如此，它是脆弱的对抗性样本作为分类。现有的攻击方法主要集中于通过将察觉不到扰动输入，这导致错误的结果，产生对抗的例子。然而，我们专注于进攻的另一个方面，即通过显著变化作弊模式。前者诱发Ⅱ型错误，而后者导致I型错误。在本文中，我们提出了I型攻击生成模型，如VAE和GaN。在VAE给出的一个例子是，我们可以显著改变原始图像的意义之一，但他们的重建结果是相似的。为了实现I型攻击，我们破坏了通过增加输入空间的距离，同时保持输出相似的，因为不同的输入可以对应于类似的功能深神经网络的属性的原始之一。实验结果表明，我们的攻击方法是有效的能够生成用于大规模数据集图像生成模型I型对抗性的例子。

18. Region adaptive graph fourier transform for 3d point clouds [PDF] 返回目录
Eduardo Pavez, Benjamin Girault, Antonio Ortega, Philip A. Chou
Abstract: We introduce the Region Adaptive Graph Fourier Transform (RA-GFT) for compression of 3D point cloud attributes. We assume the points are organized by a family of nested partitions represented by a tree. The RA-GFT is a multiresolution transform, formed by combining spatially localized block transforms. At each resolution level, attributes are processed in clusters by a set of block transforms. Each block transform produces a single approximation (DC) coefficient, and various detail (AC) coefficients. The DC coefficients are promoted up the tree to the next (lower resolution) level, where the process can be repeated until reaching the root. Since clusters may have a different numbers of points, each block transform must incorporate the relative importance of each coefficient. For this, we introduce the $\mathbf{Q}$-normalized graph Laplacian, and propose using its eigenvectors as the block transform. The RA-GFT outperforms the Region Adaptive Haar Transform (RAHT) by up to 2.5 dB, with a small complexity overhead.
摘要：介绍了自适应区域图傅立叶变换（RA-GFT）的三维点云的属性压缩。我们假设点由一个家庭一棵树代表嵌套分区的组织。所述RA-GFT是多分辨率变换，通过组合空间局部块变换形成。在每个分辨率水平，属性在簇由一组块变换的处理。每个块的变换产生一个单一的近似（DC）系数和各种细节（AC）系数。 DC系数被促进了树到下一个（较低分辨率）级别，其中可以重复该过程，直到到达根。因为簇可具有点的不同数量，每个块变换必须结合各系数的相对重要性。对于这一点，我们引入$ \ mathbf {Q} $ - 归一化图的拉普拉斯，并使用其特征向量作为块变换建议。该RA-GFT优于地区自适应哈尔高达变换（RAHT）至2.5 dB时，用小的复杂性开销。

19. Watch your Up-Convolution: CNN Based Generative Deep Neural Networks are Failing to Reproduce Spectral Distributions [PDF] 返回目录
Ricard Durall, Margret Keuper, Janis Keuper
Abstract: Generative convolutional deep neural networks, e.g. popular GAN architectures, are relying on convolution based up-sampling methods to produce non-scalar outputs like images or video sequences. In this paper, we show that common up-sampling methods, i.e. known as up-convolution or transposed convolution, are causing the inability of such models to reproduce spectral distributions of natural training data correctly. This effect is independent of the underlying architecture and we show that it can be used to easily detect generated data like deepfakes with up to 100% accuracy on public benchmarks. To overcome this drawback of current generative models, we propose to add a novel spectral regularization term to the training optimization objective. We show that this approach not only allows to train spectral consistent GANs that are avoiding high frequency errors. Also, we show that a correct approximation of the frequency spectrum has positive effects on the training stability and output quality of generative networks.
摘要：剖成卷积深层神经网络，例如流行GAN架构中，都是靠卷积基于上采样方法来生产像图像或视频序列的非标量输出。在本文中，我们表明，常见的上采样方法，即，是造成这种模型的无法正确地再现自然的训练数据的光谱分布称为上行卷积或转置卷积。这种效果是独立于基础架构的和我们表明，它可被用于容易地检测像高达100％的准确度上的公共基准deepfakes生成的数据。为了克服当前生成模型的这个缺点，我们提出了一种新的频谱则项添加到训练优化目标。我们表明，这种方法不仅可以培养被避免高频率误差频谱一致甘斯。此外，我们证明了频谱的正确的近似对培训的稳定性和生成网络的输出质量的积极影响。

20. Implicitly Defined Layers in Neural Networks [PDF] 返回目录
Qianggong Zhang, Yanyang Gu, Michalkiewicz Mateusz, Mahsa Baktashmotlagh, Anders Eriksson
Abstract: In conventional formulations of multilayer feedforward neural networks, the individual layers are customarily defined by explicit functions. In this paper we demonstrate that defining individual layers in a neural network \emph{implicitly} provide much richer representations over the standard explicit one, consequently enabling a vastly broader class of end-to-end trainable architectures. We present a general framework of implicitly defined layers, where much of the theoretical analysis of such layers can be addressed through the implicit function theorem. We also show how implicitly defined layers can be seamlessly incorporated into existing machine learning libraries. In particular with respect to current automatic differentiation techniques for use in backpropagation based training. Finally, we demonstrate the versatility and relevance of our proposed approach on a number of diverse example problems with promising results.
摘要：多层前馈神经网络的常规制剂，各个层通过惯用显函数定义的。在本文中，我们表明，限定在神经网络中各个层\ EMPH {隐含}提供更丰富的表示比标准明确的一个，从而使得能够极大地更广泛种类的端至端可训练架构。我们提出隐式定义的层，其中许多这样的层的理论分析可通过隐函数定理来解决的一般框架。我们还展示了如何隐含定义层可以无缝地整合到现有的机器学习库。特别是相对于当前的自动微分技术基于反传训练中使用。最后，我们展示了对一些有希望的结果不同的问题，例如我们所提出的方法的多样性和针对性。

21. RODNet: Object Detection under Severe Conditions Using Vision-Radio Cross-Modal Supervision [PDF] 返回目录
Yizhou Wang, Zhongyu Jiang, Xiangyu Gao, Jenq-Neng Hwang, Guanbin Xing, Hui Liu
Abstract: Radar is usually more robust than the camera in severe autonomous driving scenarios, e.g., weak/strong lighting and bad weather. However, the semantic information from the radio signals is difficult to extract. In this paper, we propose a radio object detection network (RODNet) to detect objects purely from the processed radar data in the format of range-azimuth frequency heatmaps (RAMaps). To train the RODNet, we introduce a cross-modal supervision framework, which utilizes the rich information extracted by a vision-based object 3D localization technique to teach object detection for the radar. In order to train and evaluate our method, we build a new dataset -- CRUW, containing synchronized video sequences and RAMaps in various scenarios. After intensive experiments, our RODNet shows favorable object detection performance without the presence of the camera. To the best of our knowledge, this is the first work that can achieve accurate multi-class object detection purely using radar data as the input.
摘要：雷达通常比相机在恶劣自主驾驶的情况，例如，弱/强照明和恶劣天气更稳健。然而，来自无线电信号的语义信息是困难的提取物。在本文中，我们提出了一种无线电物体检测网络（RODNet），以从在距离 - 方位频率热图（RAMaps）的格式的处理雷达数据纯粹检测对象。训练RODNet，我们引入了一个跨模式的监督框架，它利用一个基于视觉的目标3D定位技术教对象检测雷达中提取的丰富信息。为了培养和评价我们的方法，我们建立了一个新的数据集 - CRUW，含在各种情况下同步的视频序列和RAMaps。经过深入的实验，我们RODNet显示了没有摄像头的存在有利的物体检测性能。据我们所知，这是可以做到精确的多类目标检测纯粹用雷达数据作为输入的第一部作品。

22. TimeConvNets: A Deep Time Windowed Convolution Neural Network Design for Real-time Video Facial Expression Recognition [PDF] 返回目录
James Ren Hou Lee, Alexander Wong
Abstract: A core challenge faced by the majority of individuals with Autism Spectrum Disorder (ASD) is an impaired ability to infer other people's emotions based on their facial expressions. With significant recent advances in machine learning, one potential approach to leveraging technology to assist such individuals to better recognize facial expressions and reduce the risk of possible loneliness and depression due to social isolation is the design of computer vision-driven facial expression recognition systems. Motivated by this social need as well as the low latency requirement of such systems, this study explores a novel deep time windowed convolutional neural network design (TimeConvNets) for the purpose of real-time video facial expression recognition. More specifically, we explore an efficient convolutional deep neural network design for spatiotemporal encoding of time windowed video frame sub-sequences and study the respective balance between speed and accuracy. Furthermore, to evaluate the proposed TimeConvNet design, we introduce a more difficult dataset called BigFaceX, composed of a modified aggregation of the extended Cohn-Kanade (CK+), BAUM-1, and the eNTERFACE public datasets. Different variants of the proposed TimeConvNet design with different backbone network architectures were evaluated using BigFaceX alongside other network designs for capturing spatiotemporal information, and experimental results demonstrate that TimeConvNets can better capture the transient nuances of facial expressions and boost classification accuracy while maintaining a low inference time.
摘要：面对广大自闭症谱系障碍（ASD）的个人的核心挑战是基于他们的面部表情来推断他人的情绪的能力受损。近年来，随着机器学习显著的进步，以利用技术一个可能的方法来帮助这些人更好地识别面部表情和减少由于社会隔离可能孤独和抑郁症的风险是计算机视觉驱动的面部表情识别系统的设计。这个社会需要，以及这种系统的低延迟需求的推动下，这项研究探讨了新的深刻的时间窗卷积神经网络的设计（TimeConvNets）用于实时视频面部表情识别的目的。更具体地说，我们探索一种高效的卷积深神经网络设计的时间的时空编码窗口视频帧的子序列和学习速度和精度之间的相应的平衡。此外，为了评价所提出的设计TimeConvNet，我们引入一个更加困难的数据集称为BigFaceX，延伸的Cohn-奏（CK +），BAUM-1的改性聚合组成，并且eNTERFACE公共数据集。使用BigFaceX与其他网络设计用于捕捉时空信息所提出的TimeConvNet设计不同的骨干网络架构的不同变体进行了评估，结果与实验结果表明，TimeConvNets可以更好地捕捉面部表情和提升分类准确性的瞬态细微差别，同时保持较低的推理时间。

23. A Robust Imbalanced SAR Image Change Detection Approach Based on Deep Difference Image and PCANet [PDF] 返回目录
Xinzheng Zhang, Hang Su, Ce Zhang, Peter M. Atkinson, Xiaoheng Tan, Xiaoping Zeng, Xin Jian
Abstract: In this research, a novel robust change detection approach is presented for imbalanced multi-temporal synthetic aperture radar (SAR) image based on deep learning. Our main contribution is to develop a novel method for generating difference image and a parallel fuzzy c-means (FCM) clustering method. The main steps of our proposed approach are as follows: 1) Inspired by convolution and pooling in deep learning, a deep difference image (DDI) is obtained based on parameterized pooling leading to better speckle suppression and feature enhancement than traditional difference images. 2) Two different parameter Sigmoid nonlinear mapping are applied to the DDI to get two mapped DDIs. Parallel FCM are utilized on these two mapped DDIs to obtain three types of pseudo-label pixels, namely, changed pixels, unchanged pixels, and intermediate pixels. 3) A PCANet with support vector machine (SVM) are trained to classify intermediate pixels to be changed or unchanged. Three imbalanced multi-temporal SAR image sets are used for change detection experiments. The experimental results demonstrate that the proposed approach is effective and robust for imbalanced SAR data, and achieve up to 99.52% change detection accuracy superior to most state-of-the-art methods.
摘要：本研究中，一个新颖的鲁棒变化检测方法是基于深学习提出了不平衡多时合成孔径雷达（SAR）图像。我们的主要贡献是开发用于生成差分图像和平行模糊c均值（FCM）聚类方法的新方法。是我们提出的方法的主要步骤如下：1）卷积的启发和深度学习汇集，基于参数汇集从而获得更好的斑点抑制和增强功能，比传统的差分图像，获得了深刻的差分图像（DDI）。 2）有两种不同参数的Sigmoid非线性映射施加到DDI得到两映射的DDI。平行FCM被利用在这两个映射的DDI，得到三种类型的伪标记象素，即，改变像素，不变的像素，和中间像素。 3）一种PCANet用支持向量机（SVM）被训练来分类像素中间被改变或不变。三个不平衡用于改变检测实验多时相SAR图像集。实验结果表明，所提出的方法是不平衡SAR数据有效和健壮的，并实现了高达99.52％的变化的检测精度优于大多数国家的最先进的方法。

24. Blind Image Restoration without Prior Knowledge [PDF] 返回目录
Noam Elron, Shahar S. Yuval, Dmitry Rudoy, Noam Levy
Abstract: Many image restoration techniques are highly dependent on the degradation used during training, and their performance declines significantly when applied to slightly different input. Blind and universal techniques attempt to mitigate this by producing a trained model that can adapt to varying conditions. However, blind techniques to date require prior knowledge of the degradation process, and assumptions regarding its parameter-space. In this paper we present the Self-Normalization Side-Chain (SCNC), a novel approach to blind universal restoration in which no prior knowledge of the degradation is needed. This module can be added to any existing CNN topology, and is trained along with the rest of the network in an end-to-end manner. The imaging parameters relevant to the task, as well as their dynamics, are deduced from the variety in the training data. We apply our solution to several image restoration tasks, and demonstrate that the SNSC encodes the degradation-parameters, improving restoration performance.
摘要：许多图像复原技术是高度依赖于训练期间使用的退化，并且当应用于略有不同输入他们的表现显著下降。盲人和通用技术试图通过产生训练模型，能够适应变化的条件，以减轻这一点。然而，盲目的技术迄今为止需要关于它的参数空间中的降解过程的先验知识和假设。在本文中，我们目前的自规范化侧链（SCNC），一种新的方法，以盲其中需要事先没有退化的知识普遍恢复。此模块可以被添加到任何现有的CNN的拓扑结构，并且与网络中的端至端的方式，其余训练沿。与任务相关的成像参数，以及它们的动态，从训练数据中推导出品种。我们应用我们的解决方案的几个图像恢复的任务，并证明SNSC编码降解参数，提高恢复性能。

25. Image-based OoD-Detector Principles on Graph-based Input Data in Human Action Recognition [PDF] 返回目录
Jens Bayer, David Münch, Michael Arens
Abstract: Living in a complex world like ours makes it unacceptable that a practical implementation of a machine learning system assumes a closed world. Therefore, it is necessary for such a learning-based system in a real world environment, to be aware of its own capabilities and limits and to be able to distinguish between confident and unconfident results of the inference, especially if the sample cannot be explained by the underlying distribution. This knowledge is particularly essential in safety-critical environments and tasks e.g. self-driving cars or medical applications. Towards this end, we transfer image-based Out-of-Distribution (OoD)-methods to graph-based data and show the applicability in action recognition. The contribution of this work is (i) the examination of the portability of recent image-based OoD-detectors for graph-based input data, (ii) a Metric Learning-based approach to detect OoD-samples, and (iii) the introduction of a novel semi-synthetic action recognition dataset. The evaluation shows that image-based OoD-methods can be applied to graph-based data. Additionally, there is a gap between the performance on intraclass and intradataset results. First methods as the examined baseline or ODIN provide reasonable results. More sophisticated network architectures - in contrast to their image-based application - were surpassed in the intradataset comparison and even lead to less classification accuracy.
摘要：生活在一个复杂的世界像我们这样使得它不能接受的机器学习系统的实际实现假定一个封闭的世界。因此，有必要对这种基于学习系统在实际环境中，要注意自身的能力和限制，并能推断的自信和不自信的结果之间的区别，特别是如果样品不能被解释底层分布。这方面的知识是在例如安全相关的环境和任务尤其重要自动驾驶汽车或医疗应用。为此目的，我们基于转换的图像外的分布（OOD） - 方法到基于图的数据，并显示在动作识别的适用性。这项工作的贡献为（i）最近的基于图像的OOD探测器用于基于图形的输入数据的便携性的检查，（ⅱ）一个学习型度量的方法来检测OOD的样品，和（iii）引入的一种新颖的半合成动作识别数据集。的评价结果显示，基于图像的OOD的方法可以应用到基于图的数据。此外，对组内和intradataset结果之间的性能差距。首先方法作为检测基准或ODIN提供合理的结果。更复杂的网络架构 - 相比于他们的基于图像的应用 - 在intradataset相比还有所超越，甚至会导致更少的分类精度。

26. Voxel Map for Visual SLAM [PDF] 返回目录
Manasi Muglikar, Zichao Zhang, Davide Scaramuzza
Abstract: In modern visual SLAM systems, it is a standard practice to retrieve potential candidate map points from overlapping keyframes for further feature matching or direct tracking. In this work, we argue that keyframes are not the optimal choice for this task, due to several inherent limitations, such as weak geometric reasoning and poor scalability. We propose a voxel-map representation to efficiently retrieve map points for visual SLAM. In particular, we organize the map points in a regular voxel grid. Visible points from a camera pose are queried by sampling the camera frustum in a raycasting manner, which can be done in constant time using an efficient voxel hashing method. Compared with keyframes, the retrieved points using our method are geometrically guaranteed to fall in the camera field-of-view, and occluded points can be identified and removed to a certain extend. This method also naturally scales up to large scenes and complicated multicamera configurations. Experimental results show that our voxel map representation is as efficient as a keyframe map with 5 keyframes and provides significantly higher localization accuracy (average 46% improvement in RMSE) on the EuRoC dataset. The proposed voxel-map representation is a general approach to a fundamental functionality in visual SLAM and widely applicable.
摘要：在现代视觉SLAM系统，它是一个标准的做法，以重叠进一步特征匹配或直接跟踪关键帧检索潜在候选图上的点。在这项工作中，我们认为关键帧不是最佳选择了这个任务，由于一些固有的局限性，如弱几何推理和可扩展性差。我们提出了一个体素图的表示形式来有效地检索地图点视觉SLAM。特别是，我们在常规网格体素的组织图上的点。从相机姿态可见点由在光线投射方式，其可在恒定的时间使用高效的体素的散列方法来进行采样相机平截头体查询。使用关键帧相比，使用我们的方法所获取的点被几何地保证落入相机领域的视图，和闭塞点可以被识别和去除至一定地延伸。这种方法也自然地扩展到大场面和复杂的多机位的配置。实验结果表明，我们的体素映射表示是一样有效关键帧地图5点的关键帧和所述数据集EuRoC（在RMSE平均46％的改善）提供显著更高的定位精度。所提出的体素图的表示是在视觉SLAM和广泛应用的一个基本功能的通用方法。

27. Colored Noise Injection for Training Adversarially Robust Neural Networks [PDF] 返回目录
Evgenii Zheltonozhskii, Chaim Baskin, Yaniv Nemcovsky, Brian Chmiel, Avi Mendelson, Alex M. Bronstein
Abstract: Even though deep learning have shown unmatched performance on various tasks, neural networks has been shown to be vulnerable to small adversarial perturbation of the input which lead to significant performance degradation. In this work we extend the idea of adding independent Gaussian noise to weights and activation during adversarial training (PNI) to injection of colored noise for defense against common white-box and black-box attacks. We show that our approach outperforms PNI and various previous approaches in terms of adversarial accuracy on CIFAR-10 dataset. In addition, we provide an extensive ablation study of the proposed method justifying the chosen configurations.
摘要：尽管深度学习已经表现出对各种任务无与伦比的性能，神经网络已经被证明是脆弱的输入而导致显著的性能下降的小对抗扰动。在这项工作中，我们扩展对抗性训练（PNI）过程中添加独立的高斯噪声权重和激活注入的想法对普通白盒和黑盒攻击防御噪音着色。我们证明了我们的方法优于PNI和各种以前的方法在CIFAR-10数据集对抗性的准确性方面。另外，我们提供所提出的方法证明所选择的配置的一个广泛的研究消融。

28. Deep Joint Transmission-Recognition for Power-Constrained IoT Devices [PDF] 返回目录
Mikolaj Jankowski, Deniz Gunduz, Krystian Mikolajczyk
Abstract: We propose a joint transmission-recognition scheme for efficient inference at the wireless network edge. Our scheme allows for reliable image recognition over wireless channels with significant computational load reduction at the sender side. We incorporate recently proposed deep joint source-channel coding (JSCC) scheme, and combine it with novel filter pruning strategies aimed at reducing the redundant complexity from neural networks. We evaluate our approach on a classification task, and show satisfactory results in both transmission reliability and workload reduction. This is the first work that combines deep JSCC with network pruning and applies it to images classification over wireless network.
摘要：我们在无线网络边缘提出了有效的推理联合传输识别方案。我们的方案允许可靠的图像识别在与在发送机侧显著计算负荷减少无线信道。我们结合最近提出的深联合信源信道编码（JSCC）方案，以及旨在从神经网络降低了冗余复杂新颖的过滤器剪枝策略相结合。我们评估我们在分类任务的方法，并显示在这两个传输的可靠性和工作量减少了满意的效果。这是第一部作品，结合了网络修剪深JSCC，并将其应用于图像分类通过无线网络。

29. Redesigning SLAM for Arbitrary Multi-Camera Systems [PDF] 返回目录
Juichung Kuo, Manasi Muglikar, Zichao Zhang, Davide Scaramuzza
Abstract: Adding more cameras to SLAM systems improves robustness and accuracy but complicates the design of the visual front-end significantly. Thus, most systems in the literature are tailored for specific camera configurations. In this work, we aim at an adaptive SLAM system that works for arbitrary multi-camera setups. To this end, we revisit several common building blocks in visual SLAM. In particular, we propose an adaptive initialization scheme, a sensor-agnostic, information-theoretic keyframe selection algorithm, and a scalable voxel-based map. These techniques make little assumption about the actual camera setups and prefer theoretically grounded methods over heuristics. We adapt a state-of-the-art visual-inertial odometry with these modifications, and experimental results show that the modified pipeline can adapt to a wide range of camera setups (e.g., 2 to 6 cameras in one experiment) without the need of sensor-specific modifications or tuning.
摘要：添加更多的摄像机SLAM系统提高了耐用性和准确性，但视觉前端的设计显著复杂。因此，在文献中大多数系统专门针对某些照相机的配置。在这项工作中，我们瞄准的自适应SLAM系统，任意多台摄像机的设置工作。为此，我们重新审视在视觉SLAM几种常见的构建模块。特别是，我们提出了一种自适应初始化方案，传感器无关，信息理论的关键帧选择算法，和一个可伸缩的基于体素的图。这些技术做出实际的相机设置小的假设，更喜欢启发式理论基础的方法。我们适应的状态的最先进的视觉惯性里程计具有这些修饰的，和实验结果表明，经修饰的管道可以适应宽范围的相机设定的（例如，2至6个相机在一个实验中），而不需要的传感器特定的修改或调整。

30. G-VAE: A Continuously Variable Rate Deep Image Compression Framework [PDF] 返回目录
Ze Cui, Jing Wang, Bo Bai, Tiansheng Guo, Yihui Feng
Abstract: Rate adaption of deep image compression in a single model will become one of the decisive factors competing with the classical image compression codecs. However, until now, there is no perfect solution that neither increases the computation nor affects the compression performance. In this paper, we propose a novel image compression framework G-VAE (Gained Variational Autoencoder), which could achieve continuously variable rate in a single model. Unlike the previous solutions that encode progressively or change the internal unit of the network, G-VAE only adds a pair of gain units at the output of encoder and the input of decoder. It is so concise that G-VAE could be applied to almost all the image compression methods and achieve continuously variable rate with negligible additional parameters and computation. We also propose a new deep image compression framework, which outperforms all the published results on Kodak datasets in PSNR and MS-SSIM metrics. Experimental results show that adding a pair of gain units will not affect the performance of the basic models while endowing them with continuously variable rate.
摘要：在一个模型深的图像压缩比率配将成为经典的图像压缩编码的竞争关键因素之一。然而，直到现在，也没有完美的解决方案，无论是增加了计算，也不影响压缩性能。在本文中，我们提出了一个新的图像压缩框架G-VAE（获变自动编码器），其可以在一个单一的模型实现连续可变的速度。不同于编码逐步或改变网络的内部单元中的以前的解决方案中，G-VAE只增加在编码器的输出在一对增益单元和解码器的输入端。它是如此简洁使得G-VAE可以适用于几乎所有的图像压缩方法，实现连续可变速率具有可忽略的额外的参数和计算。我们还提出了一个新的深图像压缩框架，它优于在PSNR和MS-SSIM指标柯达数据集所有公布的结果。实验结果表明，虽然连续可变速率赋予他们加入了对增益单元不会影响基本模型的性能。

31. A Learning Strategy for Contrast-agnostic MRI Segmentation [PDF] 返回目录
Benjamin Billot, Douglas Greve, Koen Van Leemput, Bruce Fischl, Juan Eugenio Iglesias, Adrian V. Dalca
Abstract: We present a deep learning strategy that enables, for the first time, contrast-agnostic semantic segmentation of completely unpreprocessed brain MRI scans, without requiring additional training or fine-tuning for new modalities. Classical Bayesian methods address this segmentation problem with unsupervised intensity models, but require significant computational resources. In contrast, learning-based methods can be fast at test time, but are sensitive to the data available at training. Our proposed learning method, SynthSeg, leverages a set of training segmentations (no intensity images required) to generate synthetic sample images of widely varying contrasts on the fly during training. These samples are produced using the generative model of the classical Bayesian segmentation framework, with randomly sampled parameters for appearance, deformation, noise, and bias field. Because each mini-batch has a different synthetic contrast, the final network is not biased towards any MRI contrast. We comprehensively evaluate our approach on four datasets comprising over 1,000 subjects and four types of MR contrast. The results show that our approach successfully segments every contrast in the data, performing slightly better than classical Bayesian segmentation, and three orders of magnitude faster. Moreover, even within the same type of MRI contrast, our strategy generalizes significantly better across datasets, compared to training using real images. Finally, we find that synthesizing a broad range of contrasts, even if unrealistic, increases the generalization of the neural network. Our code and model are open source at this https URL.
摘要：我们提出了一个深刻的学习策略，使，是第一次，完全unpreprocessed脑部MRI扫描造影无关语义分割，而不需要新的方式额外的培训或微调。古典贝叶斯方法解决与监督的强度模型这个分割问题，但需要显著的计算资源。相反，基于学习的方法可以在测试时要快，但在训练敏感的数据可用。我们提出的学习方法，SynthSeg，利用一组训练分割的（不需要强度图像）训练期间产生上飞广泛变化的对比合成样品的图像。这些样品采用经典贝叶斯分割框架的生成模型产生，具有外观，变形，噪声和偏置场随机采样参数。因为每个小批量具有不同的合成相反，最终网络未对任何MRI造影偏置。我们全面评估我们的四个数据集的方法，包括超过1,000个主题和四种类型的MR对比。结果表明，我们的方法成功地在细分数据每对比度，比传统的贝叶斯分割稍好表演，和三个数量级的速度更快。此外，即使在同一类型的MRI造影，我们的策略概括显著更好的跨数据集，相比使用真实的图像训练。最后，我们发现，合成一系列的对比，即使是不现实的，提高了神经网络的泛化。我们的代码和型号在此HTTPS URL开源。

32. The iCub multisensor datasets for robot and computer vision applications [PDF] 返回目录
Murat Kirtay, Ugo Albanese, Lorenzo Vannucci, Guido Schillaci, Cecilia Laschi, Egidio Falotico
Abstract: This document presents novel datasets, constructed by employing the iCub robot equipped with an additional depth sensor and color camera. We used the robot to acquire color and depth information for 210 objects in different acquisition scenarios. At this end, the results were large scale datasets for robot and computer vision applications: object representation, object recognition and classification, and action recognition.
摘要：本文档呈现新颖的数据集，通过采用的iCub机器人配备有附加的深度传感器和彩色照相机构成。我们使用机器人来获取颜色和用于在不同的采集场景210级的对象的深度信息。在这一端，结果是机器人和计算机视觉应用的大规模数据集：对象表示，对象识别和分类，以及动作识别。

33. Metrics and methods for robustness evaluation of neural networks with generative models [PDF] 返回目录
Igor Buzhinsky, Arseny Nerinovsky, Stavros Tripakis
Abstract: Recent studies have shown that modern deep neural network classifiers are easy to fool, assuming that an adversary is able to slightly modify their inputs. Many papers have proposed adversarial attacks, defenses and methods to measure robustness to such adversarial perturbations. However, most commonly considered adversarial examples are based on $\ell_p$-bounded perturbations in the input space of the neural network, which are unlikely to arise naturally. Recently, especially in computer vision, researchers discovered "natural" or "semantic" perturbations, such as rotations, changes of brightness, or more high-level changes, but these perturbations have not yet been systematically utilized to measure the performance of classifiers. In this paper, we propose several metrics to measure robustness of classifiers to natural adversarial examples, and methods to evaluate them. These metrics, called latent space performance metrics, are based on the ability of generative models to capture probability distributions, and are defined in their latent spaces. On three image classification case studies, we evaluate the proposed metrics for several classifiers, including ones trained in conventional and robust ways. We find that the latent counterparts of adversarial robustness are associated with the accuracy of the classifier rather than its conventional adversarial robustness, but the latter is still reflected on the properties of found latent perturbations. In addition, our novel method of finding latent adversarial perturbations demonstrates that these perturbations are often perceptually small.
摘要：最近的研究表明，现代深层神经网络分类很容易糊弄，假设对手能够稍微修改他们的投入。许多文献提出对抗攻击，防御和方法来衡量稳健这种对抗性的扰动。然而，最常考虑的对抗性例子在神经网络的输入空间，这是不可能自然产生的基于$ \ $ ell_p扰动-bounded。最近，特别是在计算机视觉，研究人员发现，“天然”或“语义”扰动，如旋转，亮度变化，或更高层的变化，但这些扰动还没有被系统地用来测量分类器的性能。在本文中，我们提出了多种指标来分类自然对抗的例子的鲁棒性措施和方法来评价他们。这些指标，所谓的潜在空间性能指标，是基于生成模型来捕捉概率分布的能力，并在他们的潜在空间的定义。在三个图像分类的案例研究，我们评估了几个分类，包括那些在传统和强大的方式训练提出的指标。我们发现，对抗性稳健潜同行与分类，而不是它的传统对抗性稳健的精度有关，但后者仍体现在发现潜在扰动的特性。此外，我们发现潜在的对抗扰动的新方法表明，这些扰动往往感知小。

34. Learning for Video Compression with Hierarchical Quality and Recurrent Enhancement [PDF] 返回目录
Ren Yang, Fabian Mentzer, Luc Van Gool, Radu Timofte
Abstract: The recent years have witnessed the great potential of deep learning for video compression. In this paper, we propose the Hierarchical Learned Video Compression (HLVC) approach with three hierarchical quality layers and recurrent enhancement. To be specific, the frames in the first layer are compressed by image compression method with the highest quality. Using them as references, we propose the Bi-Directional Deep Compression (BDDC) network to compress the second layer with relatively high quality. Then, the third layer frames are compressed with the lowest quality, by the proposed Single Motion Deep Compression (SMDC) network, which adopts a single motion map to estimate the motions of multiple frames, thus saving the bit-rate for motion information. In our deep decoder, we develop the Weighted Recurrent Quality Enhancement (WRQE) network with the inputs of both compressed frames and bit stream. In the recurrent cell of WRQE, the memory and update signal are weighted by quality features to reasonably leverage multi-frame information for enhancement. In our HLVC approach, the hierarchical quality benefits the coding efficiency, since the high quality information facilitates the compression and enhancement of low quality frames at encoder and decoder sides, respectively. Finally, the experiments validate that our HLVC approach advances the state-of-the-art deep video compression methods, and outperforms x265 low delay P very fast mode in terms of both PSNR and MS-SSIM. The project page is at this https URL.
摘要：近年来，中深学习视频压缩的巨大潜力。在本文中，我们提出了三个层次的质量层和复发性增强分层了解到视频压缩（HLVC）的方法。具体而言，在第一层中的帧由具有最高质量的图像压缩方法压缩。使用它们作为参考，我们提出了双向深度压缩（BDDC）网络压缩与质量比较高的第二层。然后，第三层帧具有最低品质的压缩，由所提出的单一运动深度压缩（SMDC）网络，它采用一个单一的运动图来估计多个帧的运动，从而节省了比特率的运动信息。在我们的深层解码器，我们开发出既压缩帧和比特流的输入的加权复发质量增强（WRQE）网络。在WRQE，通过质量进行加权存储器和更新信号的复发性细胞功能，以用于增强合理杠杆多帧信息。在我们的HLVC方法，分层质量效益的编码效率，因为高质量的信息分别提供低质量的帧的压缩和增强在编码器和解码器两侧。最后，实验验证了我们的HLVC方法推进国家的最先进的深视频压缩方法，性能胜过两个PSNR和MS-SSIM方面X265低延迟P非常快的模式。该项目页面在这个HTTPS URL。

35. ADRN: Attention-based Deep Residual Network for Hyperspectral Image Denoising [PDF] 返回目录
Yongsen Zhao, Deming Zhai, Junjun Jiang, Xianming Liu
Abstract: Hyperspectral image (HSI) denoising is of crucial importance for many subsequent applications, such as HSI classification and interpretation. In this paper, we propose an attention-based deep residual network to directly learn a mapping from noisy HSI to the clean one. To jointly utilize the spatial-spectral information, the current band and its $K$ adjacent bands are simultaneously exploited as the input. Then, we adopt convolution layer with different filter sizes to fuse the multi-scale feature, and use shortcut connection to incorporate the multi-level information for better noise removal. In addition, the channel attention mechanism is employed to make the network concentrate on the most relevant auxiliary information and features that are beneficial to the denoising process best. To ease the training procedure, we reconstruct the output through a residual mode rather than a straightforward prediction. Experimental results demonstrate that our proposed ADRN scheme outperforms the state-of-the-art methods both in quantitative and visual evaluations.
摘要：高光谱图像（HSI）降噪是许多后续的应用，如恒指分类和解释至关重要。在本文中，我们提出了一种基于注意机制深剩余网络直接学习从嘈杂的恒指向一个干净的映射。共同利用的空间谱的信息，当前频带和其$ $ķ相邻频带被同时利用作为输入。然后，我们采用不同的过滤器尺寸卷积层融合的多尺度特征，并使用快捷方式连接到包括更好的噪声去除多层次的信息。另外，采用的通道注意机制，使最相关的辅助信息，并且是于去噪处理最好的有益功能的网络精矿。为了缓解训练过程中，我们通过重构的残留模式，而不是一个简单的预测的输出。实验结果表明，我们提出的方案ADRN都优于定量和视觉评估的国家的最先进的方法。

36. \textit{Semixup}: In- and Out-of-Manifold Regularization for Deep Semi-Supervised Knee Osteoarthritis Severity Grading from Plain Radiographs [PDF] 返回目录
Huy Hoang Nguyen, Simo Saarakkala, Matthew Blaschko, Aleksei Tiulpin
Abstract: Knee osteoarthritis (OA) is one of the highest disability factors in the world in humans. This musculoskeletal disorder is assessed from clinical symptoms, and typically confirmed via radiographic assessment. This visual assessment done by a radiologist requires experience, and suffers from high inter-observer variability. The recent development in the literature has shown that deep learning (DL) methods can reliably perform the OA severity assessment according to the gold standard Kellgren-Lawrence (KL) grading system. However, these methods require large amounts of labeled data, which are costly to obtain. In this study, we propose the \emph{Semixup} algorithm, a semi-supervised learning (SSL) approach to leverage unlabeled data. \emph{Semixup} relies on consistency regularization using in- and out-of-manifold samples, together with interpolated consistency. On an independent test set, our method significantly outperformed other state-of-the-art SSL methods in most cases, and even achieved a comparable performance to a well-tuned fully supervised learning (SL) model that required over 12 times more labeled data.
摘要：膝关节骨性关节炎（OA）是最高的残疾因素在世界上人类的一个。这种肌肉骨骼疾病，从临床症状评估，通常通过影像学评估确认。由放射科医生做这个视觉评估需要丰富的经验，并从高观察员变异受到影响。最近在文献中的发展已经表明深度学习（DL）方法可以根据分级系统的黄金标准Kellgren-劳伦斯（KL）可靠地进行OA严重性评估。然而，这些方法需要大量的标记的数据，这是昂贵的，以获得的。在这项研究中，我们提出了\ {EMPH} Semixup算法，半监督学习（SSL）的方法来利用未标记数据。 \ {EMPH} Semixup使用入点和出的歧管的样品，与经内插的一致性一起依靠一致性正规化。上独立的测试组，我们的方法显著优于在大多数情况下的状态的最先进的其它SSL方法，甚至实现了相当的性能，需要一个调整良好的完全监督学习（SL）模型超过12倍以上的标记的数据。

37. Gaussianization Flows [PDF] 返回目录
Chenlin Meng, Yang Song, Jiaming Song, Stefano Ermon
Abstract: Iterative Gaussianization is a fixed-point iteration procedure that can transform any continuous random vector into a Gaussian one. Based on iterative Gaussianization, we propose a new type of normalizing flow model that enables both efficient computation of likelihoods and efficient inversion for sample generation. We demonstrate that these models, named Gaussianization flows, are universal approximators for continuous probability distributions under some regularity conditions. Because of this guaranteed expressivity, they can capture multimodal target distributions without compromising the efficiency of sample generation. Experimentally, we show that Gaussianization flows achieve better or comparable performance on several tabular datasets compared to other efficiently invertible flow models such as Real NVP, Glow and FFJORD. In particular, Gaussianization flows are easier to initialize, demonstrate better robustness with respect to different transformations of the training data, and generalize better on small training sets.
摘要：迭代Gaussianization是一个固定点迭代过程，可以改变任何连续随机载体导入高斯之一。基于迭代Gaussianization，我们提出了一种新型的标准化流程模型，使似然既高效计算和生成样本有效反转。我们证明了这些模型，命名Gaussianization流，对于一些规律性的条件下连续概率分布万能逼近。由于这种保证的表现力，它们可以捕获多峰分布目标而不损害样品产生的效率。在实验中，我们表明，Gaussianization流动性好或相当的性能达到几个表格数据集相对于其他高效可逆的流模型如Real NVP，发光和FFJORD。特别是，Gaussianization流动更容易初始化，表现出更好的稳健性对于训练数据的不同转换，并推广了更好的小训练集。

38. ETRI-Activity3D: A Large-Scale RGB-D Dataset for Robots to Recognize Daily Activities of the Elderly [PDF] 返回目录
Jinhyeok Jang, Dohyung Kim, Cheonshu Park, Minsu Jang, Jaeyeon Lee, Jaehong Kim
Abstract: Deep learning, based on which many modern algorithms operate, is well known to be data-hungry. In particular, the datasets appropriate for the intended application are difficult to obtain. To cope with this situation, we introduce a new dataset called ETRI-Activity3D, focusing on the daily activities of the elderly in robot-view. The major characteristics of the new dataset are as follows: 1) practical action categories that are selected from the close observation of the daily lives of the elderly; 2) realistic data collection, which reflects the robot's working environment and service situations; and 3) a large-scale dataset that overcomes the limitations of the current 3D activity analysis benchmark datasets. The proposed dataset contains 112,620 samples including RGB videos, depth maps, and skeleton sequences. During the data acquisition, 100 subjects were asked to perform 55 daily activities. Additionally, we propose a novel network called four-stream adaptive CNN (FSA-CNN). The proposed FSA-CNN has three main properties: robustness to spatio-temporal variations, input-adaptive activation function, and extension of the conventional two-stream approach. In the experiment section, we confirmed the superiority of the proposed FSA-CNN using NTU RGB+D and ETRI-Activity3D. Further, the domain difference between both groups of age was verified experimentally. Finally, the extension of FSA-CNN to deal with the multimodal data was investigated.
摘要：深学习的基础上，很多现代的算法操作，是众所周知的是大量数据的。特别地，数据集适合于预期应用是很难获得的。为了应对这种情况，我们引入了一个名为ETRI-Activity3D新的数据集，着眼于老年人的机器人视图中的日常活动。是新的数据集的主要特点如下：即从老年人的日常生活密切观察选择1）用实际行动类别; 2）现实的数据收集，这反映了机器人的工作环境和服务的情况;和3）的大规模数据集，其克服了当前的3D活动分析基准数据集的限制。所提出的数据集包含112620个样品，包括RGB视频，深度图，和骨架序列。在数据采集期间，100名受试者被要求执行55个日常活动。此外，我们提出了一种所谓的四流自适应CNN（FSA-CNN）的新网络。所提出的FSA-CNN有三个主要性能：鲁棒性时空变型中，输入自适应激活功能，和常规的两流方法的扩展。在实验部分，我们证实了提出FSA-CNN的使用NTU RGB + d和ETRI-Activity3D的优越性。此外，年龄这两个群体之间的差异域实验验证。最后，FSA-CNN的扩展来处理多模数据进行了研究。

39. Black-box Smoothing: A Provable Defense for Pretrained Classifiers [PDF] 返回目录
Hadi Salman, Mingjie Sun, Greg Yang, Ashish Kapoor, J. Zico Kolter
Abstract: We present a method for provably defending any pretrained image classifier against $\ell_p$ adversarial attacks. By prepending a custom-trained denoiser to any off-the-shelf image classifier and using randomized smoothing, we effectively create a new classifier that is guaranteed to be $\ell_p$-robust to adversarial examples, without modifying the pretrained classifier. The approach applies both to the case where we have full access to the pretrained classifier as well as the case where we only have query access. We refer to this defense as black-box smoothing, and we demonstrate its effectiveness through extensive experimentation on ImageNet and CIFAR-10. Finally, we use our method to provably defend the Azure, Google, AWS, and ClarifAI image classification APIs. Our code replicating all the experiments in the paper can be found at this https URL .
摘要：本文提出了一种可证明抵御$ \ $ ell_p对抗性攻击任何预训练的图像分类。前面加上一个自定义的训练降噪任何现成的货架图像分类和使用随机平滑，我们有效地创建一个保证是$ \ $ ell_p到-robust对抗的例子，而无需修改预训练的分类器的新分类。该方法既适用于我们已经完全进入预训练的分类，以及在哪里，我们只有查询访问时的情况。我们将此作为防御黑盒子平滑，和我们证明通过关于ImageNet和CIFAR-10大量的试验其有效性。最后，我们用我们的方法可证明保卫天青，谷歌，AWS和ClarifAI图像分类的API。我们的代码复制所有实验中的文件可以在此HTTPS URL中找到。

40. Localising Faster: Efficient and precise lidar-based robot localisation in large-scale environments [PDF] 返回目录
Li Sun, Daniel Adolfsson, Martin Magnusson, Henrik Andreasson, Ingmar Posner, Tom Duckett
Abstract: This paper proposes a novel approach for global localisation of mobile robots in large-scale environments. Our method leverages learning-based localisation and filtering-based localisation, to localise the robot efficiently and precisely through seeding Monte Carlo Localisation (MCL) with a deep-learned distribution. In particular, a fast localisation system rapidly estimates the 6-DOF pose through a deep-probabilistic model (Gaussian Process Regression with a deep kernel), then a precise recursive estimator refines the estimated robot pose according to the geometric alignment. More importantly, the Gaussian method (i.e. deep probabilistic localisation) and non-Gaussian method (i.e. MCL) can be integrated naturally via importance sampling. Consequently, the two systems can be integrated seamlessly and mutually benefit from each other. To verify the proposed framework, we provide a case study in large-scale localisation with a 3D lidar sensor. Our experiments on the Michigan NCLT long-term dataset show that the proposed method is able to localise the robot in 1.94 s on average (median of 0.8 s) with precision 0.75~m in a large-scale environment of approximately 0.5 km2.
摘要：本文提出了一种在大型环境下移动机器人的全球定位的新方法。我们的方法的杠杆学习基于定位和滤波系定位，通过与深了解到分布播种蒙特卡罗定位（MCL）有效且精确地定位机器人。特别地，快速定位系统迅速估计通过深概率模型（高斯过程具有深内核回归）的6-DOF姿态，然后精确递归估计细化根据几何对准估计机器人的姿势。更重要的是，高斯方法（即深概率定位）和非高斯方法（即MCL）可以自然地通过重要性采样一体化。因此，这两个系统可以无缝集成，并且彼此相互受益。为了验证所提出的框架下，我们提供了三维激光雷达传感器在大规模本土化的案例研究。我们对密歇根NCLT长期数据集显示，所提出的方法能够定位所述机器人在1.94 S于平均值（的0.8秒中值）用实验精度0.75〜米约0.5平方公里的大规模环境。

41. Semantic sensor fusion: from camera to sparse lidar information [PDF] 返回目录
Julie Stephany Berrio, Mao Shan, Stewart Worrall, James Ward, Eduardo Nebot
Abstract: To navigate through urban roads, an automated vehicle must be able to perceive and recognize objects in a three-dimensional environment. A high-level contextual understanding of the surroundings is necessary to plan and execute accurate driving maneuvers. This paper presents an approach to fuse different sensory information, Light Detection and Ranging (lidar) scans and camera images. The output of a convolutional neural network (CNN) is used as classifier to obtain the labels of the environment. The transference of semantic information between the labelled image and the lidar point cloud is performed in four steps: initially, we use heuristic methods to associate probabilities to all the semantic classes contained in the labelled images. Then, the lidar points are corrected to compensate for the vehicle's motion given the difference between the timestamps of each lidar scan and camera image. In a third step, we calculate the pixel coordinate for the corresponding camera image. In the last step we perform the transfer of semantic information from the heuristic probability images to the lidar frame, while removing the lidar information that is not visible to the camera. We tested our approach in the Usyd Dataset \cite{usyd_dataset}, obtaining qualitative and quantitative results that demonstrate the validity of our probabilistic sensory fusion approach.
摘要：通过城市道路导航，自动车辆必须能够感知并在三维环境中识别物体。周围的一个高层次的语境理解是必要的规划和执行正确的驾驶操作。本文提出了一种方法来保险丝不同的感觉信息，光探测和测距（LIDAR）扫描和摄像机图像。卷积神经网络（CNN）的输出被用作分类器，以获得环境的标签。的标签图像和激光雷达点云之间的语义信息的传递是在四个步骤进行：首先，我们使用启发式方法来准概率包含在标记图像的所有语义类别。然后，激光雷达点被校正，以补偿给定的每个激光雷达扫描和摄像机图像的时间戳之间的差的车辆的运动。在第三步骤中，我们可以计算出相应的照相机图像的像素坐标。在最后一步中，我们执行的语义信息从启发概率图像转移到激光雷达帧，同时除去激光雷达信息不是到相机可见的。我们测试了我们的Usyd数据集的方法\ {引用} usyd_dataset，获得展现我们的感官概率融合方法的有效性，定性和定量结果。

42. Learning Rope Manipulation Policies Using Dense Object Descriptors Trained on Synthetic Depth Data [PDF] 返回目录
Priya Sundaresan, Jennifer Grannen, Brijen Thananjeyan, Ashwin Balakrishna, Michael Laskey, Kevin Stone, Joseph E. Gonzalez, Ken Goldberg
Abstract: Robotic manipulation of deformable 1D objects such as ropes, cables, and hoses is challenging due to the lack of high-fidelity analytic models and large configuration spaces. Furthermore, learning end-to-end manipulation policies directly from images and physical interaction requires significant time on a robot and can fail to generalize across tasks. We address these challenges using interpretable deep visual representations for rope, extending recent work on dense object descriptors for robot manipulation. This facilitates the design of interpretable and transferable geometric policies built on top of the learned representations, decoupling visual reasoning and control. We present an approach that learns point-pair correspondences between initial and goal rope configurations, which implicitly encodes geometric structure, entirely in simulation from synthetic depth images. We demonstrate that the learned representation -- dense depth object descriptors (DDODs) -- can be used to manipulate a real rope into a variety of different arrangements either by learning from demonstrations or using interpretable geometric policies. In 50 trials of a knot-tying task with the ABB YuMi Robot, the system achieves a 66% knot-tying success rate from previously unseen configurations. See this https URL for supplementary material and videos.
摘要：一维变形的机器人操作对象，如绳索，电缆和胶管是由于缺乏高保真分析模型和大的配置空间的挑战。此外，直接从图像和物理互动中学习终端到终端的操作策略，需要在机器人显著时间，可以不善于任务一概而论。我们使用可解释深视觉表现为绳，延续近期对机器人操作密集的对象描述符的工作应对这些挑战。这有利于建立在了解到表示的顶部解释，可转让的几何政策，去耦视觉推理和控制的设计。我们提出了一种方法，初始和目标绳索配置，这隐含地编码几何结构，完全由合成深度图像模拟之间获悉点配对的对应关系。我们表明，学习表现 - 稠密深度对象描述符（DDODs） - 可以通过从示范学习或使用可解释的几何政策被用来操纵一个真正的绳子成各种不同的安排。与ABB机器人裕美一个打结任务的50次试验中，系统实现了从以前看不见的结构的66％的打结的成功率。看到这个HTTPS URL的辅助材料和视频。

43. RMP-SNNs: Residual Membrane Potential Neuron for Enabling Deeper High-Accuracy and Low-Latency Spiking Neural Networks [PDF] 返回目录
Bing Han, Gopalakrishnan Srinivasan, Kaushik Roy
Abstract: Spiking Neural Networks (SNNs) have recently attracted significant research interest as the third generation of artificial neural networks that can enable low-power event-driven data analytics. The best performing SNNs for image recognition tasks are obtained by converting a trained Analog Neural Network (ANN), consisting of Rectified Linear Units (ReLU), to SNN composed of integrate-and-fire neurons with "proper" firing thresholds. The converted SNNs typically incur loss in accuracy compared to that provided by the original ANN and require sizable number of inference time-steps to achieve the best accuracy. We find that performance degradation in the converted SNN stems from using "hard reset" spiking neuron that is driven to fixed reset potential once its membrane potential exceeds the firing threshold, leading to information loss during SNN inference. We propose ANN-SNN conversion using "soft reset" spiking neuron model, referred to as Residual Membrane Potential (RMP) spiking neuron, which retains the "residual" membrane potential above threshold at the firing instants. We demonstrate near loss-less ANN-SNN conversion using RMP neurons for VGG-16, ResNet-20, and ResNet-34 SNNs on challenging datasets including CIFAR-10 (93.63% top-1), CIFAR-100 (70.928% top-1), and ImageNet (73.26% top-1 accuracy). Our results also show that RMP-SNN achieves comparable accuracy to that provided by the converted SNN with "hard reset" spiking neurons using 2-8 times fewer inference time-steps across datasets.
摘要：扣球神经网络（SNNS）最近引起显著的研究兴趣为第三代人工神经网络，可以使低功率事件驱动的数据分析。图像识别任务的最佳执行SNNS被转换受过训练模拟神经网络（ANN）获得，由整流线性单位（RELU）的，到与SNN“适当”击发阈值整合和火神经元组成。转换后的SNNS通常招致精度损失相比，由原始ANN提供和要求的推断时间步长相当数量，以达到最佳的精度。我们发现在转换SNN性能下降使用“硬重置”尖峰一旦其膜电位，超过放电阈值时，SNN推理过程中导致信息丢失被驱动到固定复位电位神经元造成的。我们使用“软复位”尖峰神经元模型中，被称为剩余膜电位（RMP）尖峰神经元，它保留高于阈值的“残余”膜电位在击发时刻提出ANN-SNN转换。我们证明近使用无损耗ANN-SNN转换RMP神经元VGG-16，RESNET-20，和RESNET-34 SNNS上具有挑战性的数据集包括CIFAR-10（93.63％顶部-1），CIFAR-100（70.928％顶1），和ImageNet（73.26％顶部-1精度）。我们的研究结果还表明，RMP-SNN达到媲美精度提供由使用整个数据集的2-8倍更少的推理时间步“硬重置”尖峰神经元转换SNN。

44. Security of Deep Learning based Lane Keeping System under Physical-World Adversarial Attack [PDF] 返回目录
Takami Sato, Junjie Shen, Ningfei Wang, Yunhan Jack Jia, Xue Lin, Qi Alfred Chen
Abstract: Lane-Keeping Assistance System (LKAS) is convenient and widely available today, but also extremely security and safety critical. In this work, we design and implement the first systematic approach to attack real-world DNN-based LKASes. We identify dirty road patches as a novel and domain-specific threat model for practicality and stealthiness. We formulate the attack as an optimization problem, and address the challenge from the inter-dependencies among attacks on consecutive camera frames. We evaluate our approach on a state-of-the-art LKAS and our preliminary results show that our attack can successfully cause it to drive off lane boundaries within as short as 1.3 seconds.
摘要：车道保持辅助系统（LKAS）是当今便捷和广泛使用，而且非常安全和安全的关键。在这项工作中，我们设计并实现了第一个系统化的方法来攻击真实世界中基础DNN-LKASes。我们确定了脏道路补丁的实用性和隐蔽性的新颖和特定域威胁模型。我们制定攻击为最优化问题，并消除对连续帧的摄像头中的攻击从相互依赖性的挑战。我们评估对我们的做法一个国家的最先进的LKAS和我们的初步结果表明，该攻击能够成功导致其内驱赶车道标线短1.3秒。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-03-05

目录

摘要