摘要

1. A study of Neural networks point source extraction on simulated Fermi/LAT Telescope images [PDF] 返回目录
Mariia Drozdova, Anton Broilovskiy, Andrey Ustyuzhanin, Denys Malyshev
Abstract: Astrophysical images in the GeV band are challenging to analyze due to the strong contribution of the background and foreground astrophysical diffuse emission and relatively broad point spread function of modern space-based instruments. In certain cases, even finding of point sources on the image becomes a non-trivial task. We present a method for point sources extraction using a convolution neural network (CNN) trained on our own artificial data set which imitates images from the Fermi Large Area Telescope. These images are raw count photon maps of 10x10 degrees covering energies from 1 to 10 GeV. We compare different CNN architectures that demonstrate accuracy increase by ~15% and reduces the inference time by at least the factor of 4 accuracy improvement with respect to a similar state of the art models.
摘要：在GeV的带天体物理图像是具有挑战性的分析，由于背景和前景天体物理漫发射和现代的空间仪器的比较宽泛点扩散函数的巨大贡献。在某些情况下，甚至找点源的图像成为不平凡的任务。我们目前使用的培训对我们自己的人工数据集从费米大视场望远镜模仿图像的卷积神经网络（CNN）的点源提取的方法。这些图像是10×10度的覆盖能量从1到10电子伏特的原始计数光子的地图。我们比较演示由约15％的准确率提高，并降低由4精度的提高至少因素推断时间对于艺术模型的一个类似的状态不同的CNN架构。

2. SegFix: Model-Agnostic Boundary Refinement for Segmentation [PDF] 返回目录
Yuhui Yuan, Jingyi Xie, Xilin Chen, Jingdong Wang
Abstract: We present a model-agnostic post-processing scheme to improve the boundary quality for the segmentation result that is generated by any existing segmentation model. Motivated by the empirical observation that the label predictions of interior pixels are more reliable, we propose to replace the originally unreliable predictions of boundary pixels by the predictions of interior pixels. Our approach processes only the input image through two steps: (i) localize the boundary pixels and (ii) identify the corresponding interior pixel for each boundary pixel. We build the correspondence by learning a direction away from the boundary pixel to an interior pixel. Our method requires no prior information of the segmentation models and achieves nearly real-time speed. We empirically verify that our SegFix consistently reduces the boundary errors for segmentation results generated from various state-of-the-art models on Cityscapes, ADE20K and GTA5. Code is available at: this https URL.
摘要：我们提出一个模型无关的后处理方案，以提高该由任何现有分段模型生成的分割结果的边界质量。通过经验观察内部像素的标签预测是比较可靠的启发，我们提出通过内部像素的预测，以代替边界像素的原本不可靠的预测。我们的方法通过两个步骤只处理输入图像：（ⅰ）本地化的边界像素和（ii）识别对应的内部像素对于每个边界像素。我们通过学习方向从边界像素客场内部像素建立对应关系。我们的方法需要细分车型没有事先通知，且实现近实时速度。我们经验验证我们SegFix持续减少从风情，ADE20K和GTA5各个国家的最先进的模型生成的分割结果的边界错误。代码，请访问：此HTTPS URL。

3. A Multi-Level Approach to Waste Object Segmentation [PDF] 返回目录
Tao Wang, Yuanzheng Cai, Lingyu Liang, Dongyi Ye
Abstract: We address the problem of localizing waste objects from a color image and an optional depth image, which is a key perception component for robotic interaction with such objects. Specifically, our method integrates the intensity and depth information at multiple levels of spatial granularity. Firstly, a scene-level deep network produces an initial coarse segmentation, based on which we select a few potential object regions to zoom in and perform fine segmentation. The results of the above steps are further integrated into a densely connected conditional random field that learns to respect the appearance, depth, and spatial affinities with pixel-level accuracy. In addition, we create a new RGBD waste object segmentation dataset, MJU-Waste, that is made public to facilitate future research in this area. The efficacy of our method is validated on both MJU-Waste and the Trash Annotation in Context (TACO) dataset.
摘要：我们从解决彩色图像和一个可选的深度图像，这是与这些对象的机器人互动的一个关键组成部分的感知定位废对象的问题。具体地，我们的方法在集成空间粒度的多个级别的强度和深度信息。首先，场景级深度网络产生了一个初始粗分割，在此基础上，我们选择了几个潜在对象区域进行放大，并执行精细分割。上述步骤的结果被进一步集成到一个密集的连接条件随字段学会尊重外观，深度，并与像素级精确度的空间亲和力。此外，我们创建了一个新的RGBD浪费对象分割数据集，MJU-废物，即公开，以方便今后在这一领域的研究。我们的方法的有效性进行验证两个MJU-废物和上下文（TACO）数据集回收站注解。

4. Superpixel Segmentation using Dynamic and Iterative Spanning Forest [PDF] 返回目录
F.C. Belem, S.J.F. Guimaraes, A.X. Falcao
Abstract: As constituent parts of image objects, superpixels can improve several higher-level operations. However, image segmentation methods might have their accuracy seriously compromised for reduced numbers of superpixels. We have investigated a solution based on the Iterative Spanning Forest (ISF) framework. In this work, we present Dynamic ISF (DISF) -- a method based on the following steps. (a) It starts from an image graph and a seed set with considerably more pixels than the desired number of superpixels. (b) The seeds compete among themselves, and each seed conquers its most closely connected pixels, resulting in an image partition (spanning forest) with connected superpixels. In step (c), DISF assigns relevance values to seeds based on superpixel analysis and removes the most irrelevant ones. Steps (b) and (c) are repeated until the desired number of superpixels is reached. DISF has the chance to reconstruct relevant edges after each iteration, when compared to region merging algorithms. As compared to other seed-based superpixel methods, DISF is more likely to find relevant seeds. It also introduces dynamic arc-weight estimation in the ISF framework for more effective superpixel delineation, and we demonstrate all results on three datasets with distinct object properties.
摘要：作为图像对象的构成部件，超像素可以提高几个更高级别的操作。然而，图像分割方法可能有其准确性大打折扣，可降低超像素的数目。我们已经调查了基于迭代生成森林（ISF）架构的解决方案。在这项工作中，我们提出了动态ISF（DISF） - 基于以下步骤的方法。的（a）它开始从图像图形和种子组具有比超像素的期望数量相当多的像素。（b）在种子竞争本身之间，并且每个种子克服其最为密切相关的像素，从而产生的图像分区（生成森林）具有连接的超像素。在步骤（c），DISF受让人相关性值基于超像素分析的种子并删除最不相关的。步骤（b）和（c）被重复，直到达到超像素的期望数量。 DISF具有每次迭代中，当相比于区域合并算法后重建相关边缘的机会。相比于其他基于种子的超像素方法，DISF更容易找到相关的种子。它还引入了更有效的超像素划定ISF框架动态弧权重估计，我们表现出三个数据集与不同的对象属性的所有结果。

5. Deformable spatial propagation network for depth completion [PDF] 返回目录
Zheyuan Xu, Yingfu Wang, Jian Yao
Abstract: Depth completion has attracted extensive attention recently due to the development of autonomous driving, which aims to recover dense depth map from sparse depth measurements. Convolutional spatial propagation network (CSPN) is one of the state-of-the-art methods in this task, which adopt a linear propagation model to refine coarse depth maps with local context. However, the propagation of each pixel occurs in a fixed receptive field. This may not be the optimal for refinement since different pixel needs different local context. To tackle this issue, in this paper, we propose a deformable spatial propagation network (DSPN) to adaptively generates different receptive field and affinity matrix for each pixel. It allows the network obtain information with much fewer but more relevant pixels for propagation. Experimental results on KITTI depth completion benchmark demonstrate that our proposed method achieves the state-of-the-art performance.
摘要：深度完成已引起广泛关注最近由于自主驾驶的发展，来恢复稀疏深度测量密集的深度图的目标。卷积空间传播网络（CSPN）是国家的最先进的方法，在该任务中，其采用线性传播模型来细化粗深度与本地上下文映射之一。然而，每个像素的传播发生在一个固定的感受域。这可能不是细化最佳的，因为不同的像素需要不同的地方背景。为了解决这个问题，在该论文中，我们提出了一种可变形的空间传播网络（DSPN）以自适应地生成不同的感受域和亲和基质用于每个像素。它允许网络获取与传播要少得多，但更多的相关像素的信息。在KITTI深度完成基准实验结果表明，我们提出的方法实现了国家的最先进的性能。

6. Dynamic Group Convolution for Accelerating Convolutional Neural Networks [PDF] 返回目录
Zhuo Su, Linpu Fang, Wenxiong Kang, Dewen Hu, Matti Pietikäinen, Li Liu
Abstract: Replacing normal convolutions with group convolutions can significantly increase the computational efficiency of modern deep convolution networks, which has been widely adopted in compact network architecture designs. However, existing group convolutions undermine the original network structures by cutting off some connections permanently resulting in significant accuracy degradation. In this paper, we propose dynamic group convolution (DGC) that adaptively selects which part of input channels to be connected within each group for individual samples on the fly. Specifically, we equip each group with a small feature selector to automatically select the most important input channels conditioned on the input images. Multiple groups can adaptively capture abundant and complementary visual/semantic features for each input image. The DGC preserves the original network structure and has similar computational efficiency as the conventional group convolutions simultaneously. Extensive experiments on multiple image classification benchmarks including CIFAR-10, CIFAR-100 and ImageNet demonstrate its superiority over the existing group convolution techniques and dynamic execution methods. The code is available at this https URL.
摘要：组回旋更换正常的回旋可以显著提高现代深卷积网络，这在紧凑型网络架构的设计被广泛采用的计算效率。然而，现有的组卷积通过切断一些连接永久地导致显著精度的劣化破坏原有的网络结构。在本文中，我们提出了动态组卷积（DGC）自适应地选择哪个输入信道的一部分，以各组为在飞行单独样品内进行连接。具体来说，我们装备每个组具有小的特征选择来自动选择最重要的输入通道调节的输入图像。多个组可以自适应捕获每个输入图像丰富和互补的视觉/语义特征。的DGC保留原始网络结构，并具有类似的计算效率为同时常规组卷积。上的多个图像分类基准包括CIFAR-10，CIFAR-100和ImageNet广泛的实验证明其在现有的组卷积技术和动态执行的方法的优越性。该代码可在此HTTPS URL。

7. Transfer Learning or Self-supervised Learning? A Tale of Two Pretraining Paradigms [PDF] 返回目录
Xingyi Yang, Xuehai He, Yuxiao Liang, Yue Yang, Shanghang Zhang, Pengtao Xie
Abstract: Pretraining has become a standard technique in computer vision and natural language processing, which usually helps to improve performance substantially. Previously, the most dominant pretraining method is transfer learning (TL), which uses labeled data to learn a good representation network. Recently, a new pretraining approach -- self-supervised learning (SSL) -- has demonstrated promising results on a wide range of applications. SSL does not require annotated labels. It is purely conducted on input data by solving auxiliary tasks defined on the input data examples. The current reported results show that in certain applications, SSL outperforms TL and the other way around in other applications. There has not been a clear understanding on what properties of data and tasks render one approach outperforms the other. Without an informed guideline, ML researchers have to try both methods to find out which one is better empirically. It is usually time-consuming to do so. In this work, we aim to address this problem. We perform a comprehensive comparative study between SSL and TL regarding which one works better under different properties of data and tasks, including domain difference between source and target tasks, the amount of pretraining data, class imbalance in source data, and usage of target data for additional pretraining, etc. The insights distilled from our comparative studies can help ML researchers decide which method to use based on the properties of their applications.
摘要：训练前已经成为计算机视觉和自然语言处理，这通常有利于大幅提高性能的标准技术。此前，最主要的训练前的方法是转移学习（TL），它使用标记的数据学习的好代表网络。最近，一种新的训练前的办法 - 自我监督学习（SSL） - 已经证明有希望在广泛的应用成果。 SSL不需要注释标签。它纯粹是对输入数据通过解对输入数据实施例中定义辅助任务进行。目前的报道表明，在某些应用中，SSL在其他应用中优于周围TL和其他方式。目前还没有什么数据和任务的性质呈现优于其他一种方法清醒的认识。如果没有一个明智的方针，ML研究人员尝试两种方法来找出哪一个是更好的经验。它通常是耗时这样做。在这项工作中，我们的目标是解决这个问题。我们进行SSL和TL之间关于哪一个作品下数据和任务的不同属性，包括源和目标任务之间训练前的数据量类不平衡源数据域差异，更好的目标数据的使用进行全面的比较研究，并为额外的训练前，等我们的比较研究，蒸馏除去洞察力可以帮助研究人员ML决定使用基于他们的应用程序的性能，其方法。

8. KIT MOMA: A Mobile Machines Dataset [PDF] 返回目录
Yusheng Xiang, Hongzhe Wang, Tianqing Su, Ruoyu Li, Christine Brach, Samuel S. Mao, Marcus Geimer
Abstract: Mobile machines typically working in a closed site, have a high potential to utilize autonomous driving technology. However, vigorously thriving development and innovation are happening mostly in the area of passenger cars. In contrast, although there are also many research pieces about autonomous driving or working in mobile machines, a consensus about the SOTA solution is still not achieved. We believe that the most urgent problem that should be solved is the absence of a public and challenging visual dataset, which makes the results from different researches comparable. To address the problem, we publish the KIT MOMA dataset, including eight classes of commonly used mobile machines, which can be used as a benchmark to evaluate the SOTA algorithms to detect mobile construction machines. The view of the gathered images is outside of the mobile machines since we believe fixed cameras on the ground are more suitable if all the interesting machines are working in a closed site. Most of the images in KIT MOMA are in a real scene, whereas some of the images are from the official website of top construction machine companies. Also, we have evaluated the performance of YOLO v3 on our dataset, indicating that the SOTA computer vision algorithms already show an excellent performance for detecting the mobile machines in a specific working site. Together with the dataset, we also upload the trained weights, which can be directly used by engineers from the construction machine industry. The dataset, trained weights, and updates can be found on our Github. Moreover, the demo can be found on our Youtube.
摘要：移动设备通常在一个封闭的现场工作，具有较高的潜力，利用自动驾驶技术。然而，大力繁荣的发展和创新都发生主要是在乘用车领域。相比之下，虽然也有很多研究作品约自主驾驶或移动机器的工作，一个关于SOTA解决方案达成共识仍然没有实现。我们认为，应该解决的最迫切的问题是缺乏一个公开和具有挑战性的视觉数据集，这使得从不同的研究结果相媲美。为了解决这个问题，我们发布KIT MOMA数据集，包括8类常用的移动设备，它可以作为一个基准来评估的SOTA算法来检测移动工程机械。因为我们认为在地面上的固定摄像机更适合，如果所有有趣的机器都在一个封闭的现场工作所收集的图像的观点是外移动机器。大多数KIT MOMA的图像是在一个真实场景，而一些图像是从顶部建筑机械公司的官方网站。此外，我们已经评估YOLO V3对我们的数据表现，表明SOTA计算机视觉算法已经显示了在特定作业现场检测移动设备的优异性能。与数据集一起，我们还上传训练的权重，可以直接使用工程从工程机械行业。该数据集，训练的权重，并且更新可以在我们的Github上找到。此外，该演示可在我们的Youtube找到。

9. Evaluation for Weakly Supervised Object Localization: Protocol, Metrics, and Datasets [PDF] 返回目录
Junsuk Choe, Seong Joon Oh, Sanghyuk Chun, Zeynep Akata, Hyunjung Shim
Abstract: Weakly-supervised object localization (WSOL) has gained popularity over the last years for its promise to train localization models with only image-level labels. Since the seminal WSOL work of class activation mapping (CAM), the field has focused on how to expand the attention regions to cover objects more broadly and localize them better. However, these strategies rely on full localization supervision for validating hyperparameters and model selection, which is in principle prohibited under the WSOL setup. In this paper, we argue that WSOL task is ill-posed with only image-level labels, and propose a new evaluation protocol where full supervision is limited to only a small held-out set not overlapping with the test set. We observe that, under our protocol, the five most recent WSOL methods have not made a major improvement over the CAM baseline. Moreover, we report that existing WSOL methods have not reached the few-shot learning baseline, where the full-supervision at validation time is used for model training instead. Based on our findings, we discuss some future directions for WSOL.
摘要：弱监督对象定位（WSOL）已经获得了普及，在过去几年的承诺，列车国产化车型只有影像级的标签。由于类激活映射（CAM）的开创性工作WSOL，现场一直专注于如何将关注点区域更广泛地扩大到覆盖对象，并更好地本地化它们。然而，这些策略依赖于完全本地化监督验证超参数和模型的选择，这是在WSOL设置所禁止的原则。在本文中，我们认为，WSOL任务病态，只有影像级标签，并提出了新的评估协议，其中全程监督仅限于一小举行出一套不带载测试重叠。我们观察到，我们的协议下，最近五个WSOL方法都没有取得过CAM基线的重大改进。此外，我们报告说，现有WSOL方法还没有达到几拍学习基线，在该全监督在验证时，用于模型训练来代替。根据我们的调查结果，我们讨论了一些WSOL未来发展方向。

10. Robust Re-Identification by Multiple Views Knowledge Distillation [PDF] 返回目录
Angelo Porrello, Luca Bergamini, Simone Calderara
Abstract: To achieve robustness in Re-Identification, standard methods leverage tracking information in a Video-To-Video fashion. However, these solutions face a large drop in performance for single image queries (e.g., Image-To-Video setting). Recent works address this severe degradation by transferring temporal information from a Video-based network to an Image-based one. In this work, we devise a training strategy that allows the transfer of a superior knowledge, arising from a set of views depicting the target object. Our proposal - Views Knowledge Distillation (VKD) - pins this visual variety as a supervision signal within a teacher-student framework, where the teacher educates a student who observes fewer views. As a result, the student outperforms not only its teacher but also the current state-of-the-art in Image-To-Video by a wide margin (6.3% mAP on MARS, 8.6% on Duke-Video-ReId and 5% on VeRi-776). A thorough analysis on Person, Vehicle and Animal Re-ID - investigates the properties of VKD from a qualitatively and quantitatively perspective. Code is available at this https URL.
摘要：为了实现重新鉴定，在视频到视频的方式的标准方法利用跟踪信息的鲁棒性。然而，这些解决方案面临性能大幅下降为单个图像的查询（例如，图像 - 视频设置）。最近的作品解决由基于视频的网络到基于图像的一个传递时间信息，这严重退化。在这项工作中，我们制定培训战略，允许一个卓越的知识转移，从一组视图描述目标对象出现。我们的建议 - 知识观蒸馏（VKD） - 销这种视觉品种为师生框架内监控信号，其中教师教育学生遵守谁少了意见。其结果是，该学生性能优于不仅其老师也是当前状态的最先进的图像到视频大幅（上MARS 6.3％MAP，上杜克视频-Reid和5％8.6％上VERI-776）。上人，车辆和动物重新ID进行全面分析 - 从定性和定量透视调查VKD的属性。代码可在此HTTPS URL。

11. Combating Domain Shift with Self-Taught Labeling [PDF] 返回目录
Jian Liang, Dapeng Hu, Jiashi Feng
Abstract: We present a novel method to combat domain shift when adapting classification models trained on one domain to other new domains with few or no target labels. In the existing literature, a prevailing solution paradigm is to learn domain-invariant feature representations so that a classifier learned on the source features generalizes well to the target features. However, such a classifier is inevitably biased to the source domain by overlooking the structure of the target data. Instead, we propose Self-Taught Labeling (SeTL), a new regularization approach that finds an auxiliary target-specific classifier for unlabeled data. During adaptation, this classifier is able to teach the target domain itself by providing \emph{unbiased accurate} pseudo labels. In particular, for each target data, we employ the memory bank to store the feature along with its soft label from the domain-shared classifier. Then we develop a non-parametric neighborhood aggregation strategy to generate new pseudo labels as well as confidence weights for unlabeled data. Though simply using the standard classification objective, SeTL significantly outperforms existing domain alignment techniques on a large variety of domain adaptation benchmarks. We expect that SeTL can provide a new perspective of addressing domain shift and inspire future research of domain adaptation and transfer learning.
摘要：适应训练的一个领域与很少或没有目标标签等新领域分类模型时，我们提出了一种新的方法来打击域转变。在现有的文献中，一个普遍的解决方案的模式是学习领域不变的特征表示，以使源学到的分类特征概括很好的目标特征。然而，这样的分类器是不可避免地受到俯瞰目标数据的结构偏压到源域。相反，我们提出了自学标签（赛特），一种新的正的办法，认定为未标记数据的辅助目标特定的分类。在适应，这种分类器能够通过提供\ {EMPH公正准确}伪标签教目标域本身。尤其是，对于各目标数据，我们采用的存储库来存储功能，以从域共享分类的软标签一起。然后，我们开发了一个非参数附近聚集战略，以产生新的伪标签以及为未标记的数据置信度权。虽然简单地使用标准分类目标，SETL显著优于上大量的各种领域适应性基准的现有域对准技术。我们预计，SETL可以提供解决域转移的一个新的视角和启发领域适应性和迁移学习的未来研究。

12. SLAP: Improving Physical Adversarial Examples with Short-Lived Adversarial Perturbations [PDF] 返回目录
Giulio Lovisotto, Henry Turner, Ivo Sluganovic, Martin Strohmeier, Ivan Martinovic
Abstract: Whilst significant research effort into adversarial examples (AE) has emerged in recent years, the main vector to realize these attacks in the real-world currently relies on static adversarial patches, which are limited in their conspicuousness and can not be modified once deployed. In this paper, we propose Short-Lived Adversarial Perturbations (SLAP), a novel technique that allows adversaries to realize robust, dynamic real-world AE from a distance. As we show in this paper, such attacks can be achieved using a light projector to shine a specifically crafted adversarial image in order to perturb real-world objects and transform them into AE. This allows the adversary greater control over the attack compared to adversarial patches: (i) projections can be dynamically turned on and off or modified at will, (ii) projections do not suffer from the locality constraint imposed by patches, making them harder to detect. We study the feasibility of SLAP in the self-driving scenario, targeting both object detector and traffic sign recognition tasks. We demonstrate that the proposed method generates AE that are robust to different environmental conditions for several networks and lighting conditions: we successfully cause misclassifications of state-of-the-art networks such as Yolov3 and Mask-RCNN with up to 98% success rate for a variety of angles and distances. Additionally, we demonstrate that AE generated with SLAP can bypass SentiNet, a recent AE detection method which relies on the fact that adversarial patches generate highly salient and localized areas in the input images.
摘要：虽然显著的研究努力为对立的例子（AE）已经出现在最近几年，以实现在现实世界中，这些攻击的主要载体目前依赖静态对抗性的补丁，这是在他们的显着度的限制，一旦部署不能修改。在本文中，我们提出了短暂的对抗性扰动（SLAP），一种新的技术，可以让对手实现强劲，动感真实世界的AE从远处。正如我们在本文中，这种攻击可以使用光投影机，以扰乱真实世界的物体，并将其转化成AE绽放出特制的对抗性图像来实现。这允许在攻击对手更大的控制相比，对抗性补丁程序：（一）预测可以动态地开启和关闭或随意变更，（ii）伸出不从补丁造成的局部性约束受到影响，使它们很难被探测到。我们研究SLAP的可行性，在以自驾车的情况下，靶向目标物检测和交通标志识别任务。我们表明，该方法生成AE是稳健的几个网络和照明条件不同的环境条件：我们成功的原因国家的最先进的网络，如Yolov3和面具RCNN高达98％的成功率的错误分类各种角度和距离。此外，我们证明，AE用SLAP可以旁路SentiNet，其依赖于一个事实，即对抗补丁生成在输入图像中高度显着和局部区域最近的AE检测方法生成。

13. Delving into the Adversarial Robustness on Face Recognition [PDF] 返回目录
Xiao Yang, Dingcheng Yang, Yinpeng Dong, Wenjian Yu, Hang Su, Jun Zhu
Abstract: Face recognition has recently made substantial progress and achieved high accuracy on standard benchmarks based on the development of deep convolutional neural networks (CNNs). However, the lack of robustness in deep CNNs to adversarial examples has raised security concerns to enormous face recognition applications. To facilitate a better understanding of the adversarial vulnerability of the existing face recognition models, in this paper we perform comprehensive robustness evaluations, which can be applied as reference for evaluating the robustness of subsequent works on face recognition. We investigate 15 popular face recognition models and evaluate their robustness by using various adversarial attacks as an important surrogate. These evaluations are conducted under diverse adversarial settings, including dodging and impersonation attacks, $\ell_2$ and $\ell_\infty$ attacks, white-box and black-box attacks. We further propose a landmark-guided cutout (LGC) attack method to improve the transferability of adversarial examples for black-box attacks, by considering the special characteristics of face recognition. Based on our evaluations, we draw several important findings, which are crucial for understanding the adversarial robustness and providing insights for future research on face recognition. Code is available at \url{this https URL}.
摘要：人脸识别最近取得实质性进展，并实现基于深卷积神经网络（细胞神经网络）的发展标准的基准测试精度高。然而，由于缺乏深细胞神经网络，以对抗例子稳健性已提出的安全关切，以巨大的人脸识别应用。为了便于更好地理解现有的脸部识别模式的对抗脆弱的，在本文中，我们进行全面的稳健性评估，这可以作为参考，以评估对人脸识别后续作品的鲁棒性应用。我们调查15个流行的面部识别模型和评估通过使用各种敌对攻击作为一种重要的替代他们的鲁棒性。这些评估是在各种对抗性的设置，包括闪避和假冒攻击，$ \ $ ell_2和$ \ ell_ \ infty $攻击，白盒和黑盒攻击进行。我们进一步提出了具有里程碑意义的引导缺口（LGC）攻击方法，以提高对黑盒攻击敌对的例子可转移性，考虑面部识别的特点。根据我们的评估，我们得出几个重要的结论，这对于理解对抗的鲁棒性和为未来的人脸识别研究洞察至关重要。代码可以在\ {URL这HTTPS URL}。

14. A Distilled Model for Tracking and Tracker Fusion [PDF] 返回目录
Matteo Dunnhofer, Niki Martinel, Christian Micheloni
Abstract: Visual object tracking was generally tackled by reasoning independently on fast processing algorithms, accurate online adaptation methods, and fusion of trackers. In this paper, we unify such goals by proposing a novel tracking methodology that takes advantage of other visual trackers, offline and online. A compact student model is trained via the marriage of knowledge distillation and reinforcement learning. The first allows to transfer and compress tracking knowledge of other trackers. The second enables the learning of evaluation measures which are then exploited online. After learning, the student can be ultimately used to build (i) a very fast single-shot tracker, (ii) a tracker with a simple and effective online adaptation mechanism, (iii) a tracker that performs fusion of other trackers. Extensive validation shows that the proposed algorithms compete with state-of-the-art trackers while running in real-time.
摘要：视觉对象跟踪普遍通过快速处理算法，准确的在线适应方法，并跟踪融合独立推理解决。在本文中，我们结合这些目标通过提出一种新颖的跟踪方法，它需要其他视觉跟踪器的优势，离线和在线。紧凑的学生模型是通过知识蒸馏和强化学习的婚姻培训。第一个允许转让和其他跟踪器的压缩跟踪知识。第二个允许的评估措施，然后利用网上学习。学习后，学生最终可以用来构建（我）非常快的单次跟踪器，（ii）与一种简单而有效的在线适应机制跟踪器，（三）跟踪器执行其他跟踪器的融合。广泛的验证表明，该算法在实时运行时，与国家的最先进的跟踪器竞争。

15. On Learning Semantic Representations for Million-Scale Free-Hand Sketches [PDF] 返回目录
Peng Xu, Yongye Huang, Tongtong Yuan, Tao Xiang, Timothy M. Hospedales, Yi-Zhe Song, Liang Wang
Abstract: In this paper, we study learning semantic representations for million-scale free-hand sketches. This is highly challenging due to the domain-unique traits of sketches, e.g., diverse, sparse, abstract, noisy. We propose a dual-branch CNNRNN network architecture to represent sketches, which simultaneously encodes both the static and temporal patterns of sketch strokes. Based on this architecture, we further explore learning the sketch-oriented semantic representations in two challenging yet practical settings, i.e., hashing retrieval and zero-shot recognition on million-scale sketches. Specifically, we use our dual-branch architecture as a universal representation framework to design two sketch-specific deep models: (i) We propose a deep hashing model for sketch retrieval, where a novel hashing loss is specifically designed to accommodate both the abstract and messy traits of sketches. (ii) We propose a deep embedding model for sketch zero-shot recognition, via collecting a large-scale edge-map dataset and proposing to extract a set of semantic vectors from edge-maps as the semantic knowledge for sketch zero-shot domain alignment. Both deep models are evaluated by comprehensive experiments on million-scale sketches and outperform the state-of-the-art competitors.
摘要：在本文中，我们研究学习的语义表示以300万规模的手绘草图。这是极具挑战性的，由于草图，例如，多样，稀疏的，抽象的，嘈杂的域特有性状。我们提出了一个双分支CNNRNN网络架构，代表草图，它同时编码静态和素描笔触的时间模式。基于此架构中，我们进一步探索学习面向草图语义表示两个挑战又实用的设置，即上亿规模的草图散列检索和零射门的认可。具体来说，我们用我们的双分支结构的普遍代表性的框架设计草图的特定两道深深的型号有：（一）我们提出了草图检索，在一个新的散列损失是专门设计以适应抽象和深散列模型草图的凌乱的特征。（ⅱ）我们提出了草图零拍识别深嵌入模型，通过收集大规模边缘地图数据集，并提出从提取一组语义矢量的边缘映射作为草图零炮域对准语义知识。这两种型号深通过综合性实验上亿规模的草图评价，并超越国家的最先进的竞争对手。

16. Synthetic-to-Real Domain Adaptation for Lane Detection [PDF] 返回目录
Noa Garnett, Roy Uziel, Netalee Efrat, Dan Levi
Abstract: Accurate lane detection, a crucial enabler for autonomous driving, currently relies on obtaining a large and diverse labeled training dataset. In this work, we explore learning from abundant, randomly generated synthetic data, together with unlabeled or partially labeled target domain data, instead. Randomly generated synthetic data has the advantage of controlled variability in the lane geometry and lighting, but it is limited in terms of photo-realism. This poses the challenge of adapting models learned on the unrealistic synthetic domain to real images. To this end we develop a novel autoencoder-based approach that uses synthetic labels unaligned with particular images for adapting to target domain data. In addition, we explore existing domain adaptation approaches, such as image translation and self-supervision, and adjust them to the lane detection task. We test all approaches in the unsupervised domain adaptation setting in which no target domain labels are available and in the semi-supervised setting in which a small portion of the target images are labeled. In extensive experiments using three different datasets, we demonstrate the possibility to save costly target domain labeling efforts. For example, using our proposed autoencoder approach on the llamas and tuSimple lane datasets, we can almost recover the fully supervised accuracy with only 10% of the labeled data. In addition, our autoencoder approach outperforms all other methods in the semi-supervised domain adaptation scenario.
摘要：准确的车道检测，对自主驾驶的关键推动者，目前依赖于获取大量不同标记的训练数据集。在这项工作中，我们探索从丰富，随机生成的合成数据的学习，与未标记的或部分标记的目标域的数据一起，来代替。随机生成的合成数据具有在车道几何和照明控制变异性的优点，但它是在照片写实的术语的限制。这对调整模型的挑战，对不切实际的合成结构域，以真实图像的教训。为此，我们开发使用合成标签不对齐与适应目标域数据特别是图像的新的基于自动编码器的方法。此外，我们将探讨现有的域适应办法，如图像的平移和自我监督，并调整车道检测任务。我们测试在无监督域适配设置在其中没有目标域的标签是可用的并且在其中目标图像的一小部分被标记的半监督设置的所有方法。在使用三个不同的数据集大量的实验，我们证明，以节省成本目标域标签力度的可能性。例如，使用在骆驼和tuSimple车道数据集我们提出的自动编码方法，我们可以几乎只有10％的标签数据恢复完全监控精度。此外，我们的自动编码方法比在半监督领域适应性方案中的所有其他方法。

17. Adaptive 3D Face Reconstruction from a Single Image [PDF] 返回目录
Kun Li, Jing Yang, Nianhong Jiao, Jinsong Zhang, Yu-Kun Lai
Abstract: 3D face reconstruction from a single image is a challenging problem, especially under partial occlusions and extreme poses. This is because the uncertainty of the estimated 2D landmarks will affect the quality of face reconstruction. In this paper, we propose a novel joint 2D and 3D optimization method to adaptively reconstruct 3D face shapes from a single image, which combines the depths of 3D landmarks to solve the uncertain detections of invisible landmarks. The strategy of our method involves two aspects: a coarse-to-fine pose estimation using both 2D and 3D landmarks, and an adaptive 2D and 3D re-weighting based on the refined pose parameter to recover accurate 3D faces. Experimental results on multiple datasets demonstrate that our method can generate high-quality reconstruction from a single color image and is robust for self-occlusion and large poses.
摘要：从单个图像的三维人脸重建是一个具有挑战性的问题，尤其是在部分遮挡和极端姿势。这是因为所估算的2D地标中的不确定性会影响人脸重建的质量。在本文中，我们提出了从单个图像，它结合了3D地标深处解决无形地标的不确定检测的新型联合2D和3D优化方法自适应地重新构建的三维人脸形状。我们的方法的策略涉及两个方面：利用2D和3D地标粗到细的姿态估计和自适应2D和3D重新加权基于所述精制姿态参数，以恢复准确的3D面。上多个数据集实验结果表明，我们的方法可以产生从单个彩色图像的高品质的重建和为自遮挡和大姿势健壮。

18. Remix: Rebalanced Mixup [PDF] 返回目录
Hsin-Ping Chou, Shih-Chieh Chang, Jia-Yu Pan, Wei Wei, Da-Cheng Juan
Abstract: Deep image classifiers often perform poorly when training data are heavily class-imbalanced. In this work, we propose a new regularization technique, Remix, that relaxes Mixup's formulation and enables the mixing factors of features and labels to be disentangled. Specifically, when mixing two samples, while features are mixed up proportionally in the same fashion as Mixup methods, Remix assigns the label in favor of the minority class by providing a disproportionately higher weight to the minority class. By doing so, the classifier learns to push the decision boundaries towards the majority classes and balances the generalization error between majority and minority classes. We have studied the state of the art regularization techniques such as Mixup, Manifold Mixup and CutMix under class-imbalanced regime, and shown that the proposed Remix significantly outperforms these state-of-the-arts and several re-weighting and re-sampling techniques, on the imbalanced datasets constructed by CIFAR-10, CIFAR-100, and CINIC-10. We have also evaluated Remix on a real-world large-scale imbalanced dataset, iNaturalist 2018. The experimental results confirmed that Remix provides consistent and significant improvements over the previous state-of-the-arts.
摘要：深图像分类往往表现不佳时，训练数据是重类的不平衡。在这项工作中，我们提出了一种新的正则化技术，混音，该放松的mixup的配方，使被解开的功能和标签的混合因素。具体而言，混合两种样品时，而特征在相同的方式的mixup方法按比例混合起来，再混合通过提供一个不成比例的较高的权重，以少数类分配有利于少数类的标签。通过这样做，分类学习，推动决策边界向广大类和平衡多数和少数类之间的泛化误差。我们研究的类不平衡制度下的艺术正则化技术，如查询股价，集成块的mixup和CutMix状态，并表明，该混音显著优于这些国家的最艺术和几重加权和重新采样技术上，经CIFAR-10，CIFAR-100，和CINIC-10构成的不平衡数据集。我们还评估混音在现实世界中的大规模数据集的不平衡，iNaturalist 2018年的实验结果证实，混音提供较先进的最一致的艺术和显著的改善。

19. Winning with Simple Learning Models: Detecting Earthquakes in Groningen, the Netherlands [PDF] 返回目录
Umair bin Waheed, Ahmed Shaheen, Mike Fehler, Ben Fulcher
Abstract: Deep learning is fast emerging as a potential disruptive tool to tackle longstanding research problems across the sciences. Notwithstanding its success across disciplines, the recent trend of the overuse of deep learning is concerning to many machine learning practitioners. Recently, seismologists have also demonstrated the efficacy of deep learning algorithms in detecting low magnitude earthquakes. Here, we revisit the problem of seismic event detection but using a logistic regression model with feature extraction. We select well-discriminating features from a huge database of time-series operations collected from interdisciplinary time-series analysis methods. Using a simple learning model with only five trainable parameters, we detect several low-magnitude induced earthquakes from the Groningen gas field that are not present in the catalog. We note that the added advantage of simpler models is that the selected features add to our understanding of the noise and event classes present in the dataset. Since simpler models are easy to maintain, debug, understand, and train, through this study we underscore that it might be a dangerous pursuit to use deep learning without carefully weighing simpler alternatives.
摘要：深学习正迅速崛起为一个潜在的破坏性工具，以解决跨学科研究的长期问题。尽管它的跨学科的成功，近期深度学习的过度使用的趋势有关的许多机器学习的实践者。近来，地震学家还证明在检测低级地震深度学习算法的效力。在这里，我们重访地震事件检测的问题，但使用逻辑回归模型特征提取。从跨学科的时间序列分析方法收集的时间序列操作的一个庞大的数据库，我们选择良好的识别功能。使用只有五可训练参数的简单的学习模式，我们发现从格罗宁根气场不存在在该目录几个低幅度诱发地震。我们注意到，简单的模型的好处是，所选择的功能添加到我们目前的数据集中的噪声和事件类的理解。由于简单的模式，易于维护，调试，理解和火车，通过这项研究我们强调，这可能是一个危险的追求使用深度学习没有仔细权衡更简单的方法。

20. Single-Frame based Deep View Synchronization for Unsynchronized Multi-Camera Surveillance [PDF] 返回目录
Qi Zhang, Antoni B. Chan
Abstract: Multi-camera surveillance has been an active research topic for understanding and modeling scenes. Compared to a single camera, multi-cameras provide larger field-of-view and more object cues, and the related applications are multi-view counting, multi-view tracking, 3D pose estimation or 3D reconstruction, etc. It is usually assumed that the cameras are all temporally synchronized when designing models for these multi-camera based tasks. However, this assumption is not always valid,especially for multi-camera systems with network transmission delay and low frame-rates due to limited network bandwidth, resulting in desynchronization of the captured frames across cameras. To handle the issue of unsynchronized multi-cameras, in this paper, we propose a synchronization model that works in conjunction with existing DNN-based multi-view models, thus avoiding the redesign of the whole model. Under the low-fps regime, we assume that only a single relevant frame is available from each view, and synchronization is achieved by matching together image contents guided by epipolar geometry. We consider two variants of the model, based on where in the pipeline the synchronization occurs, scene-level synchronization and camera-level synchronization. The view synchronization step and the task-specific view fusion and prediction step are unified in the same framework and trained in an end-to-end fashion. Our view synchronization models are applied to different DNNs-based multi-camera vision tasks under the unsynchronized setting, including multi-view counting and 3D pose estimation, and achieve good performance compared to baselines.
摘要：多摄像头监控一直是理解和模拟场景活跃的研究课题。相比于单个相机，多相机提供更大的字段的视图和多个对象线索，以及相关的应用程序是多视图计数，多视点跟踪，3D姿势估计或3D重建等通常假设相机设计的模型，这些多相机基于任务时都暂时同步。然而，这种假设并不总是有效的，尤其是对于由于有限的网络带宽多摄像机系统与网络传输延迟和低帧速率，从而导致横跨摄像机的捕获的帧的失步。要处理的不同步多摄像机的问题，在本文中，我们提出了一个同步模式，与基于DNN-现有的多视图模型协同工作，从而避免了整个模型的重新设计。下的低fps的制度中，我们假设只有一个相关帧是可从各图，和同步是通过对极几何引导匹配在一起的图像内容来实现的。我们认为，该模型的两个变种，根据其中的管道发生同步，场景级同步和相机水平同步。视图同步步骤和特定任务的视图融合和预测步骤是统一在相同的框架，并在端至端的方式训练。我们的观点同步模型下的非同步设置，其中包括多视图计数和3D姿态估计应用到不同的基于DNNs的多摄像头视觉任务，并取得良好的业绩比较基准。

21. When Perspective Comes for Free: Improving Depth Prediction with Camera Pose Encoding [PDF] 返回目录
Yunhan Zhao, Shu Kong, Charless Fowlkes
Abstract: Monocular depth prediction is a highly underdetermined problem and recent progress has relied on high-capacity CNNs to effectively learn scene statistics that disambiguate estimation. However, we observe that such models are strongly biased by the distribution of camera poses seen during training and fail to generalize to novel viewpoints, even when the scene geometry distribution remains fixed. To address this challenge, we propose a factored approach that estimates pose first, followed by a conditional depth estimation model that takes an encoding of the camera pose prior (CPP) as input. In many applications, a strong test-time pose prior comes for free, e.g., from inertial sensors or static camera deployment. A factored approach also allows for adapting pose prior estimation to new test domains using only pose supervision, without the need for collecting expensive ground-truth depth required for end-to-end training. We evaluate our pose-conditional depth predictor (trained on synthetic indoor scenes) on a real-world test set. Our factored approach, which only requires camera pose supervision for training, outperforms recent state-of-the-art methods trained with full scene depth supervision on 10x more data.
摘要：单眼深度预测是一个高度欠定的问题和最新进展一直依靠高容量的细胞神经网络有效地学习现场统计数据歧义估计。但是，我们观察到，这种模式在很大程度上受到训练期间看到摄影机姿态的分布偏向而无法推广到新的观点，即使在场景中的几何分布保持固定。为了应对这一挑战，我们提出了一个因式分解的办法，估计造成第一，其次是有条件的深度估计模型作为输入相机姿势之前（CPP）的编码。在许多应用中，强烈的测试时间姿势之前来免费，例如，从惯性传感器或静态照相机部署。一个因式分解的方法还允许用于调整姿势之前估计只使用监管带来新的测试域，而无需用于收集终端到终端的培训需要昂贵的地面实况深度。我们评估在真实世界的测试集我们的姿态，有条件的深度预测（培训了合成的室内场景）。我们的因式分解的方法，这不仅需要相机姿态监督训练，优于国家的最先进的近期与10个的数据悉数到场深度监督训练的方法。

22. PathGAN: Local Path Planning with Generative Adversarial Networks [PDF] 返回目录
Dooseop Choi, Seung-jun Han, Kyoungwook Min, Jeongdan Choi
Abstract: Targeting autonomous driving without High-Definition maps, we present a model capable of generating multiple plausible paths from sensory inputs for autonomous vehicles. Our generative model comprises two neural networks, Feature Extraction Network (FEN) and Path Generation Network (PGN). FEN extracts meaningful features from input scene images while PGN generates multiple paths from the features given a driving intention and speed. To make paths generated by PGN both be plausible and match the intention, we introduce a discrimination network and train it with PGN under generative adversarial networks (GANs) framework. Besides, to further increase the accuracy and diversity of the generated paths, we encourage PGN to capture intentions hidden in the positions in the paths and let the discriminator evaluate how realistic the sequential intentions are. Finally, we introduce ETRIDriving, the dataset for autonomous driving where the recorded sensory data is labeled with discrete high-level driving actions, and demonstrate the-state-of-the-art performances of the proposed model on ETRIDriving in terms of the accuracy and diversity.
摘要：瞄准自主驾驶不高清晰度的地图，我们目前能够产生从自动驾驶汽车的感觉输入的多个合理的路径模型。我们的生成模型包括两个神经网络，特征提取网络（FEN）和路径下一代网络（PGN）。 FEN提取从输入场景图像有意义的功能，同时，生成PGN从给定的驱动意图和速度特征的多个路径。为了使由PGN产生既似是而非并匹配意图路径，我们引入一个鉴别网络和有性对抗网络（甘斯）框架下与PGN训练它。此外，为进一步提高所生成路径的准确性和多样性，我们鼓励PGN隐藏在路径中的位置捕获意图，让鉴别评估顺序的意图多么逼真的。最后，我们介绍ETRIDriving，用于自主驾驶的数据集，其中记录的感知数据标有离散的高层次驾驶行为，并展示国家的最先进的准确性方面对ETRIDriving了模型的性能和多样性。

23. Marginal loss and exclusion loss for partially supervised multi-organ segmentation [PDF] 返回目录
Gonglei Shi, Li Xiao, Yang Chen, S. Kevin Zhou
Abstract: Annotating multiple organs in medical images is both costly and time-consuming; therefore, existing multi-organ datasets with labels are often low in sample size and mostly partially labeled, that is, a dataset has a few organs labeled but not all organs. In this paper, we investigate how to learn a single multi-organ segmentation network from a union of such datasets. To this end, we propose two types of novel loss function, particularly designed for this scenario: (i) marginal loss and (ii) exclusion loss. Because the background label for a partially labeled image is, in fact, a `merged' label of all unlabelled organs and `true' background (in the sense of full labels), the probability of this `merged' background label is a marginal probability, summing the relevant probabilities before merging. This marginal probability can be plugged into any existing loss function (such as cross entropy loss, Dice loss, etc.) to form a marginal loss. Leveraging the fact that the organs are non-overlapping, we propose the exclusion loss to gauge the dissimilarity between labeled organs and the estimated segmentation of unlabelled organs. Experiments on a union of five benchmark datasets in multi-organ segmentation of liver, spleen, left and right kidneys, and pancreas demonstrate that using our newly proposed loss functions brings a conspicuous performance improvement for state-of-the-art methods without introducing any extra computation.
摘要：医学影像诠释多个器官既昂贵又费时;因此，与现有的标签多器官数据集的样本量通常较低，大多部分标记，就是一个数据集有标记的几个器官，但不是所有的器官。在本文中，我们研究如何从这些数据集的工会学习一个多器官分割网络。为此，我们提出了两种类型的新颖的损失函数的，尤其是设计用于这种情况下：（ⅰ）边缘损失和（ii）排除损失。因为对于部分标签图像的背景标签，事实上，一个'合并“的所有未标记器官的标签，'真正的”背景（全称标签的意义上），这种概率'合并”的背景标签是一个边缘概率，合并前的相关概率相加。这个边缘概率可以被插入到任何现有的损失函数（如交叉熵损失，骰子损耗等）以形成边缘损失。利用的事实，即器官是不重叠的，我们提出排除损失来衡量标器官和未标记的器官的估计分割之间的不相似性。对肝，脾，左，右肾脏和胰腺多器官分割5个标准数据集的联合实验证明，使用我们新提出的损失的功能带来了国家的最先进的方法，一个显眼的性能提升，而不会引入任何额外的计算。

24. PaMIR: Parametric Model-Conditioned Implicit Representation for Image-based Human Reconstruction [PDF] 返回目录
Zerong Zheng, Tao Yu, Yebin Liu, Qionghai Dai
Abstract: Modeling 3D humans accurately and robustly from a single image is very challenging, and the key for such an ill-posed problem is the 3D representation of the human models. To overcome the limitations of regular 3D representations, we propose Parametric Model-Conditioned Implicit Representation (PaMIR), which combines the parametric body model with the free-form deep implicit function. In our PaMIR-based reconstruction framework, a novel deep neural network is proposed to regularize the free-form deep implicit function using the semantic features of the parametric model, which improves the generalization ability under the scenarios of challenging poses and various clothing topologies. Moreover, a novel depth-ambiguity-aware training loss is further integrated to resolve depth ambiguities and enable successful surface detail reconstruction with imperfect body reference. Finally, we propose a body reference optimization method to improve the parametric model estimation accuracy and to enhance the consistency between the parametric model and the implicit function. With the PaMIR representation, our framework can be easily extended to multi-image input scenarios without the need of multi-camera calibration and pose synchronization. Experimental results demonstrate that our method achieves state-of-the-art performance for image-based 3D human reconstruction in the cases of challenging poses and clothing types.
摘要：3D建模从一个单一的形象准确，鲁棒人类是非常具有挑战性的，而对于这样的病态问题的关键是人体模特的3D表示。为了克服普通3D表示的局限性，我们提出了参数化模型的空调隐式表示（帕米尔），它结合了参数化人体模型与自由形式的深隐函数。在我们的基于帕米尔重建框架，一种新型的深层神经网络提出了使用参数模型，从而提高了具有挑战性的姿势和各种服装拓扑的情景下，泛化能力的语义特征正规化的自由形式的深隐函数。此外，一个新的深度模糊感知训练损耗被进一步集成到决心深度歧义和使表面细节重建成功不完善体参考。最后，我们提出了一个身体参考优化的方法来提高参数模型估计精度，并提高了参数化模型和隐函数之间的一致性。与帕米尔表示，我们的框架可以容易地扩展到多图像输入场景，而不需要多摄像机校准和姿态同步。实验结果表明，我们的方法实现状态的最先进的性能为基于图像的3D重建的人类在挑战姿势和服装类型的情况下。

25. SiENet: Siamese Expansion Network for Image Extrapolation [PDF] 返回目录
Xiaofeng Zhang, Feng Chen, Cailing Wang, Songsong Wu, Ming Tao, Guoping Jiang
Abstract: Different from image inpainting, image outpainting has relative less context in the image center to capture and more content at the image border to predict. Therefore, classical encoder-decoder pipeline of existing methods may not predict the outstretched unknown content perfectly. In this paper, a novel two-stage siamese adversarial model for image extrapolation, named Siamese Expansion Network (SiENet) is proposed. In two stages, a novel border sensitive convolution named adaptive filling convolution is designed for allowing encoder to predict the unknown content, alleviating the burden of decoder. Besides, to introduce prior knowledge to network and reinforce the inferring ability of encoder, siamese adversarial mechanism is designed to enable our network to model the distribution of covered long range feature for that of uncovered image feature. The results on four datasets has demonstrated that our method outperforms existing state-of-the-arts and could produce realistic results.
摘要：从图像修复不同，图像outpainting在图像中心，以捕捉和更多的内容在图像边界相对较少的情况下预测。因此，现有方法经典编码器 - 解码器管线可能无法完美预测伸出的未知内容。在本文中，用于图像外推的新的两阶段连体对抗性模型，名为连体扩展网络（SiENet）提出。在两个阶段中，一种新颖的边界敏感卷积命名自适应填充卷积被设计用于允许编码器以预测未知内容，减轻解码器的负担。此外，为先验知识引入到网络，并加强编码器的推理能力，连体对抗机制的目的是使我们的网络覆盖远程功能的分布模型是裸露图像特征。在四个数据集上的结果表明，我们的方法优于现有的国家的最艺术，并能产生实际的结果。

26. Spatio-Temporal Scene Graphs for Video Dialog [PDF] 返回目录
Shijie Geng, Peng Gao, Chiori Hori, Jonathan Le Roux, Anoop Cherian
Abstract: The Audio-Visual Scene-aware Dialog (AVSD) task requires an agent to indulge in a natural conversation with a human about a given video. Specifically, apart from the video frames, the agent receives the audio, brief captions, and a dialog history, and the task is to produce the correct answer to a question about the video. Due to the diversity in the type of inputs, this task poses a very challenging multimodal reasoning problem. Current approaches to AVSD either use global video-level features or those from a few sampled frames, and thus lack the ability to explicitly capture relevant visual regions or their interactions for answer generation. To this end, we propose a novel spatio-temporal scene graph representation (STSGR) modeling fine-grained information flows within videos. Specifically, on an input video sequence, STSGR (i) creates a two-stream visual and semantic scene graph on every frame, (ii) conducts intra-graph reasoning using node and edge convolutions generating visual memories, and (iii) applies inter-graph aggregation to capture their temporal evolutions. These visual memories are then combined with other modalities and the question embeddings using a novel semantics-controlled multi-head shuffled transformer, which then produces the answer recursively. Our entire pipeline is trained end-to-end. We present experiments on the AVSD dataset and demonstrate state-of-the-art results. A human evaluation on the quality of our generated answers shows 12% relative improvement against prior methods.
摘要：视听场景感知对话框（AVSD）的任务，需要有一个在一个给定的视频的人的代理人沉迷于一个自然对话。具体而言，除了视频帧，代理接收音频，字幕短暂和对话历史，任务就是产生正确回答关于视频的问题。由于投入的类型的多样性，这个任务提出了一个非常具有挑战性的多模态推理问题。到AVSD目前的做法既可以使用全球视频级功能或者那些从几个采样帧，因此缺乏明确捕获有关视觉的区域或他们的答案产生相互作用的能力。为此，我们提出了一种新颖的时空场景图的表示（STSGR）建模细粒度信息的视频内流动。具体地，在输入视频序列，STSGR（ⅰ）创建每个帧上的两流视觉和语义场景图，（ⅱ）进行-图表帧内推理使用节点和边的卷积产生的视觉记忆，和（iii）适用帧间图聚集捕捉他们的时间的演化。这些视觉记忆然后用使用新颖的语义控制多头其他模式和问题的嵌入组合改组变压器，然后产生答案递归。我们的整个管道被训练结束到终端。我们在AVSD数据集目前的实验和展示国家的最先进的成果。对我们产生的回答显示了对现有技术的方法12％的相对改善质量进行人工评估。

27. Making Adversarial Examples More Transferable and Indistinguishable [PDF] 返回目录
Junhua Zou, Zhisong Pan, Junyang Qiu, Yexin Duan, Xin Liu, Yu Pan
Abstract: Many previous methods generate adversarial examples based on the fast gradient sign attack series. However, these methods cannot balance the indistinguishability and transferability due to the limitations of the basic sign structure. To address this problem, we propose an ADAM iterative fast gradient tanh method (AI-FGTM) to generate indistinguishable adversarial examples with high transferability. Extensive experiments on the ImageNet dataset show that our method generates more indistinguishable adversarial examples and achieves higher black-box attack success rates without extra running time and resource. Our best attack, TI-DI-AITM, can fool six black-box defense models with an average success rate of 88.0\%. We expect that our method will serve as a new baseline for generating adversarial examples with more transferability and indistinguishability.
摘要：许多以前的方法生成基于快速倾斜的符号攻击敌对系列例子。然而，这些方法不能不可区分和可转让尾款基本符号结构的限制。为了解决这个问题，我们提出了一种ADAM迭代快速梯度的tanh方法（AI-FGTM）以产生具有高转印性不能区分对抗性例子。在ImageNet数据集上，我们的方法产生更多的区分对抗的例子，并获得更高的黑盒攻击的成功率而无需额外的运行时间和资源广泛的实验。我们最好的攻击，TI-DI-AITM，可以骗过6暗箱防御机型的88.0 \％的平均成功率。我们希望，我们的方法将作为产生更多的可转让性和不可分辨对抗的例子一个新的基准。

28. Real-time Semantic Segmentation with Fast Attention [PDF] 返回目录
Ping Hu, Federico Perazzi, Fabian Caba Heilbron, Oliver Wang, Zhe Lin, Kate Saenko, Stan Sclaroff
Abstract: In deep CNN based models for semantic segmentation, high accuracy relies on rich spatial context (large receptive fields) and fine spatial details (high resolution), both of which incur high computational costs. In this paper, we propose a novel architecture that addresses both challenges and achieves state-of-the-art performance for semantic segmentation of high-resolution images and videos in real-time. The proposed architecture relies on our fast spatial attention, which is a simple yet efficient modification of the popular self-attention mechanism and captures the same rich spatial context at a small fraction of the computational cost, by changing the order of operations. Moreover, to efficiently process high-resolution input, we apply an additional spatial reduction to intermediate feature stages of the network with minimal loss in accuracy thanks to the use of the fast attention module to fuse features. We validate our method with a series of experiments, and show that results on multiple datasets demonstrate superior performance with better accuracy and speed compared to existing approaches for real-time semantic segmentation. On Cityscapes, our network achieves 74.4$\%$ mIoU at 72 FPS and 75.5$\%$ mIoU at 58 FPS on a single Titan X GPU, which is~$\sim$50$\%$ faster than the state-of-the-art while retaining the same accuracy.
摘要：基于深CNN模型语义分割，精度高，依靠丰富的空间背景（大感受野）和精细空间细节（高分辨率），两者都导致较高的计算成本。在本文中，我们提出了一种新的架构，地址，既是挑战，也是实现国家的最先进的性能进行实时高清晰度图像和视频的语义分割。所提出的架构依赖于我们的快速空间的关注，这是流行的自我关注机制和捕捉在计算成本的一小部分相同的丰富空间环境的一个简单而有效的修改，通过改变操作的顺序。此外，为了有效地处理高分辨率的输入，我们使用了一种额外的空间，以减少在感谢精度使用快速注重模块的熔断器功能损失最小网络的中间特征阶段。我们确认我们有一系列的实验方法，并表明对多个数据集的结果相比，现有的方法用于实时语义分割证明具有较好的精度和速度性能优越。在城市景观，我们的网络达到74.4 $ \％$米欧在72 FPS和75.5 $ \％$米欧在58 FPS的一个泰坦X GPU，这是〜$ \卡$ 50 $ \％$超过了最先进快最先进的，同时保持相同的精度。

29. A Free Viewpoint Portrait Generator with Dynamic Styling [PDF] 返回目录
Anpei Chen, Ruiyang Liu, Ling Xie, Jingyi Yu
Abstract: Generating portrait images from a single latent space facing the problem of entangled attributes, making it difficult to explicitly adjust the generation on specific attributes, e.g., contour and viewpoint control or dynamic styling. Therefore, we propose to decompose the generation space into two subspaces: geometric and texture space. We first encode portrait scans with a semantic occupancy field (SOF), which represents semantic-embedded geometry structure and output free-viewpoint semantic segmentation maps. Then we design a semantic instance wised(SIW) StyleGAN to regionally styling the segmentation map. We capture 664 3D portrait scans for our SOF training and use real capture photos(FFHQ and CelebA-HQ) for SIW StyleGAN training. Adequate experiments show that our representations enable appearance consistent shape, pose, regional styles controlling, achieve state-of-the-art results, and generalize well in various application scenarios.
摘要：从面向缠结属性的问题，使其难以明确地调整特定的属性，例如，轮廓和视点控制或动态式样产生一个单一的潜在空间生成的肖像图像。因此，我们建议生成空间分解成两个子空间：几何和纹理空间。我们首先编码画像扫描与语义占用字段（SOF），其表示语义包埋几何结构和输出自由视点语义分割地图。然后，我们设计识破（SIW）StyleGAN到区域性造型分段图语义实例。我们捕捉664个3D人像扫描我们的SOF训练和使用SIW StyleGAN培训实际拍摄的照片（FFHQ和CelebA-HQ）。充足的实验表明，我们的陈述使外观形状一致，构成，地区风格的控制，实现国家的先进成果，并在各种应用场景概括好。

30. 3D Shape Reconstruction from Vision and Touch [PDF] 返回目录
Edward J. Smith, Roberto Calandra, Adriana Romero, Georgia Gkioxari, David Meger, Jitendra Malik, Michal Drozdzal
Abstract: When a toddler is presented a new toy, their instinctual behaviour is to pick it up and inspect it with their hand and eyes in tandem, clearly searching over its surface to properly understand what they are playing with. Here, touch provides high fidelity localized information while vision provides complementary global context. However, in 3D shape reconstruction, the complementary fusion of visual and haptic modalities remains largely unexplored. In this paper, we study this problem and present an effective chart-based approach to fusing vision and touch, which leverages advances in graph convolutional networks. To do so, we introduce a dataset of simulated touch and vision signals from the interaction between a robotic hand and a large array of 3D objects. Our results show that (1) leveraging both vision and touch signals consistently improves single-modality baselines; (2) our approach outperforms alternative modality fusion methods and strongly benefits from the proposed chart-based structure; (3) the reconstruction quality increases with the number of grasps provided; and (4) the touch information not only enhances the reconstruction at the touch site but also extrapolates to its local neighborhood.
摘要：当孩子提出一个新的玩具，他们的本能行为是把它捡起来，并与他们一前一后的手和眼睛检查它，它的表面清晰地寻找到正确理解它们与玩什么。这里，触摸提供高保真的局部信息，而视力提供了互补的全球背景。然而，在三维形状重建，视觉和触觉方式的互补融合在很大程度上仍然未知。在本文中，我们研究这个问题并提出一种有效的基于图表的方法来定影视觉和触觉，它利用在图表卷积网络的进步。要做到这一点，我们引进模拟触摸和视觉信号的数据集从一个机器人的手和一个大阵列3D对象之间的互动。我们的结果表明，（1）利用两个视觉和触觉信号一致地提高单模态基线; （2）我们的方法优于备选方式融合的方法和从所提出的基于图表的结构强烈益; （3）与设置掌握的数量的重构质量增大; （4）触摸信息，不仅增强了在触摸现场重建，而且外推到其本地附近。

31. Placepedia: Comprehensive Place Understanding with Multi-Faceted Annotations [PDF] 返回目录
Huaiyi Huang, Yuqi Zhang, Qingqiu Huang, Zhengkui Guo, Ziwei Liu, Dahua Lin
Abstract: Place is an important element in visual understanding. Given a photo of a building, people can often tell its functionality, e.g. a restaurant or a shop, its cultural style, e.g. Asian or European, as well as its economic type, e.g. industry oriented or tourism oriented. While place recognition has been widely studied in previous work, there remains a long way towards comprehensive place understanding, which is far beyond categorizing a place with an image and requires information of multiple aspects. In this work, we contribute Placepedia, a large-scale place dataset with more than 35M photos from 240K unique places. Besides the photos, each place also comes with massive multi-faceted information, e.g. GDP, population, etc., and labels at multiple levels, including function, city, country, etc.. This dataset, with its large amount of data and rich annotations, allows various studies to be conducted. Particularly, in our studies, we develop 1) PlaceNet, a unified framework for multi-level place recognition, and 2) a method for city embedding, which can produce a vector representation for a city that captures both visual and multi-faceted side information. Such studies not only reveal key challenges in place understanding, but also establish connections between visual observations and underlying socioeconomic/cultural implications.
摘要：Place是直观的理解的重要元素。由于建筑物的照片，人们可以经常告诉它的功能，例如餐馆或商店，它的文化风格，例如亚洲或欧洲，以及其经济类型，例如面向行业或旅游为主。尽管地方肯定已被广泛研究在以前的工作中，仍然存在对综合处了解到，这远远超出了同一个图像分类的地方，需要多个方面的信息很长的路要走。在这项工作中，我们贡献Placepedia，大规模数据集的地方从240K独特的地方超过35M的照片。除了照片，每个地方还带有大量的多方面的信息，例如GDP，人口等，多层次的，包括功能，城市，国家等。该数据集，其大量的数据和丰富的注释标签，允许进行了各种研究。特别是，在我们的研究中，我们开发1）PlaceNet，多层次的地方识别一个统一的框架，和2）城市嵌入的方法，它可以产生一个向量表示，对于一座捕获视觉和多方位的侧信息。这样的研究不仅揭示了地方理解的关键挑战，同时也建立目视观测和潜在的社会经济/文化内涵之间的连接。

32. Resonator networks for factoring distributed representations of data structures [PDF] 返回目录
E. Paxon Frady, Spencer Kent, Bruno A. Olshausen, Friedrich T. Sommer
Abstract: The ability to encode and manipulate data structures with distributed neural representations could qualitatively enhance the capabilities of traditional neural networks by supporting rule-based symbolic reasoning, a central property of cognition. Here we show how this may be accomplished within the framework of Vector Symbolic Architectures (VSA) (Plate, 1991; Gayler, 1998; Kanerva, 1996), whereby data structures are encoded by combining high-dimensional vectors with operations that together form an algebra on the space of distributed representations. In particular, we propose an efficient solution to a hard combinatorial search problem that arises when decoding elements of a VSA data structure: the factorization of products of multiple code vectors. Our proposed algorithm, called a resonator network, is a new type of recurrent neural network that interleaves VSA multiplication operations and pattern completion. We show in two examples -- parsing of a tree-like data structure and parsing of a visual scene -- how the factorization problem arises and how the resonator network can solve it. More broadly, resonator networks open the possibility to apply VSAs to myriad artificial intelligence problems in real-world domains. A companion paper (Kent et al., 2020) presents a rigorous analysis and evaluation of the performance of resonator networks, showing it out-performs alternative approaches.
摘要：分布式神经表征的能力，编码和操纵数据结构可以通过支持基于规则的符号推理，认知的核心属性定性增强传统神经网络的功能。这里，我们显示如何可能向量符号体系结构（VSA）（图版，1991; Gayler，1998;卡内尔瓦，1996）的框架内来完成，由此数据结构由高维向量与它们一起形成一个代数运算组合编码分布式表示的空间。特别是，我们提出了一种有效的解决方案来进行解码的VSA数据结构的元素时产生的硬组合搜索的问题：多个代码向量的产品的分解。我们提出的算法，称作谐振器网络，是一种新型的递归神经网络是交错VSA乘法运算和模式完成。我们显示了两个例子 - 树形数据结构的解析和视觉场景的解析 - 因子分解问题是如何产生的，以及如何谐振网络可以解决这个问题。更广泛地说，谐振器的网络开放的VSA适用于现实世界的无数领域的人工智能问题的可能性。一个同伴纸（Kent等人，2020）呈现谐振器网络的性能的一个严格的分析和评价，示出其超出执行的各种途径。

33. Detection as Regression: Certified Object Detection by Median Smoothing [PDF] 返回目录
Ping-yeh Chiang, Michael J. Curry, Ahmed Abdelkader, Aounon Kumar, John Dickerson, Tom Goldstein
Abstract: Despite the vulnerability of object detectors to adversarial attacks, very few defenses are known to date. While adversarial training can improve the empirical robustness of image classifiers, a direct extension to object detection is very expensive. This work is motivated by recent progress on certified classification by randomized smoothing. We start by presenting a reduction from object detection to a regression problem. Then, to enable certified regression, where standard mean smoothing fails, we propose median smoothing, which is of independent interest. We obtain the first model-agnostic, training-free, and certified defense for object detection against $\ell_2$-bounded attacks.
摘要：尽管对象检测器来对抗攻击的脆弱性，很少防御已知的日期。虽然对抗性训练可以提高图像分类器的经验鲁棒性，直接延伸到物体检测是非常昂贵的。这项工作是由随机平滑的认证分类最新进展的动机。我们通过提供从检测对象减少到回归问题开始。然后，使认证的回归，标准的平均平滑失败，我们建议中位数平滑，这是独立的利益的。我们得到的物体检测针对$ \ $ ell_2攻击-bounded第一种模式无关，免费培训及获得认证等防御。

34. Self-Supervised Policy Adaptation during Deployment [PDF] 返回目录
Nicklas Hansen, Yu Sun, Pieter Abbeel, Alexei A. Efros, Lerrel Pinto, Xiaolong Wang
Abstract: In most real world scenarios, a policy trained by reinforcement learning in one environment needs to be deployed in another, potentially quite different environment. However, generalization across different environments is known to be hard. A natural solution would be to keep training after deployment in the new environment, but this cannot be done if the new environment offers no reward signal. Our work explores the use of self-supervision to allow the policy to continue training after deployment without using any rewards. While previous methods explicitly anticipate changes in the new environment, we assume no prior knowledge of those changes yet still obtain significant improvements. Empirical evaluations are performed on diverse environments from DeepMind Control suite and ViZDoom. Our method improves generalization in 25 out of 30 environments across various tasks, and outperforms domain randomization on a majority of environments.
摘要：在大多数现实世界中的场景，通过强化在一个环境中需要的学习培训的政策，可以用另一种部署，可能相当不同的环境。然而，在不同的环境中推广被称为是困难的。一个自然的解决办法是在新的环境中进行部署后坚持训练，但如果新的环境提供任何奖赏信号这无法做到的。我们的工作探索了采用自主监督，使政策继续不使用任何奖励部署后的培训。虽然以前的方法明确地预见在新的环境的变化，我们不承担任何事先的这些变化知识，但仍然获得显著的改善。实证评价是从DeepMind Control套件和ViZDoom多样的环境中执行。我们的方法提高了在各种任务25出30环境泛化，并在大多数环境下的性能优于域随机。

35. Quantifying and Leveraging Predictive Uncertainty for Medical Image Assessment [PDF] 返回目录
Florin C. Ghesu, Bogdan Georgescu, Awais Mansoor, Youngjin Yoo, Eli Gibson, R.S. Vishwanath, Abishek Balachandran, James M. Balter, Yue Cao, Ramandeep Singh, Subba R. Digumarthy, Mannudeep K. Kalra, Sasa Grbic, Dorin Comaniciu
Abstract: The interpretation of medical images is a challenging task, often complicated by the presence of artifacts, occlusions, limited contrast and more. Most notable is the case of chest radiography, where there is a high inter-rater variability in the detection and classification of abnormalities. This is largely due to inconclusive evidence in the data or subjective definitions of disease appearance. An additional example is the classification of anatomical views based on 2D Ultrasound images. Often, the anatomical context captured in a frame is not sufficient to recognize the underlying anatomy. Current machine learning solutions for these problems are typically limited to providing probabilistic predictions, relying on the capacity of underlying models to adapt to limited information and the high degree of label noise. In practice, however, this leads to overconfident systems with poor generalization on unseen data. To account for this, we propose a system that learns not only the probabilistic estimate for classification, but also an explicit uncertainty measure which captures the confidence of the system in the predicted output. We argue that this approach is essential to account for the inherent ambiguity characteristic of medical images from different radiologic exams including computed radiography, ultrasonography and magnetic resonance imaging. In our experiments we demonstrate that sample rejection based on the predicted uncertainty can significantly improve the ROC-AUC for various tasks, e.g., by 8% to 0.91 with an expected rejection rate of under 25% for the classification of different abnormalities in chest radiographs. In addition, we show that using uncertainty-driven bootstrapping to filter the training data, one can achieve a significant increase in robustness and accuracy.
摘要：医学图像的解释是一项具有挑战性的任务，通常由文物，闭塞，有限的对比度和更存在复杂。最值得注意的是胸片，那里是在检测和异常分类的高评估者间可变性的情况。这主要是由于数据或发生病变的主观的定义尚无定论的证据。附加的例子是基于二维超声图像的解剖视图分类。通常情况下，在一个帧中捕获的解剖上下文不足以识别底层解剖学。对于这些问题，目前的机器学习解决方案通常仅限于提供概率预测，依靠基本模型，以适应有限的信息和高度的标签噪音的能力。然而在实践中，这会导致对看不见的数据泛化差不自量力系统。为了解决这个问题，我们提出了一个系统，该系统不仅学习概率估计进行分类，但也捕捉系统在预测输出的信心明确的不确定性的措施。我们认为，这样的做法是帐户是否有来自不同放射检查医用图像，包括计算机X射线成像，超声和磁共振成像的固有歧义特性是至关重要的。在我们的实验中，我们展示基于预测的不确定性下的25％的预期抑制率不同异常的胸部X光片的分类可以显著提高ROC-AUC完成各种任务，例如，8％，达到0.91该样品拒绝。此外，我们还表明，利用不确定性驱动的引导来过滤训练数据，一个可以实现的鲁棒性和准确性显著上升。

36. A Benchmark of Medical Out of Distribution Detection [PDF] 返回目录
Tianshi Cao, Chinwei Huang, David Yu-Tung Hui, Joseph Paul Cohen
Abstract: There is a rise in the use of deep learning for automated medical diagnosis, most notably in medical imaging. Such an automated system uses a set of images from a patient to diagnose whether they have a disease. However, systems trained for one particular domain of images cannot be expected to perform accurately on images of a different domain. These images should be filtered out by an Out-of-Distribution Detection (OoDD) method prior to diagnosis. This paper benchmarks popular OoDD methods in three domains of medical imaging: chest x-rays, fundus images, and histology slides. Our experiments show that despite methods yielding good results on some types of out-of-distribution samples, they fail to recognize images close to the training distribution.
摘要：在使用深度学习自动化医疗诊断，医疗成像最显着的上升。这样的自动化系统使用的一组图像的从患者诊断它们是否有疾病。然而，训练图像中的一个特定结构域的系统不能预期到在一个不同的域的图像准确地执行。这些图像应由一个外的分布检测（OoDD）之前的诊断方法被过滤掉。本文基准测试在医学成像的三个结构域流行OoDD方法：胸部X光检查，眼底图像，和组织学载玻片。我们的实验显示，尽管方法产生对某些类型外的分布样品的好成绩，他们无法接近训练分布识别图像。

37. Understanding Object Affordances Through Verb Usage Patterns [PDF] 返回目录
Ka Chun Lam, Francisco Pereira, Maryam Vaziri-Pashkam, Kristin Woodard, Emalie McMahon
Abstract: In order to interact with objects in our environment, we rely on an understanding of the actions that can be performed on them, and the extent to which they rely or have an effect on the properties of the object. This knowledge is called the object "affordance". We propose an approach for creating an embedding of objects in an affordance space, in which each dimension corresponds to an aspect of meaning shared by many actions, using text corpora. This embedding makes it possible to predict which verbs will be applicable to a given object, as captured in human judgments of affordance. We show that the dimensions learned are interpretable, and that they correspond to patterns of interaction with objects. Finally, we show that they can be used to predict other dimensions of object representation that have been shown to underpin human judgments of object similarity.
摘要：为了在我们的环境中的对象进行交互，我们依靠的是可以对它们进行到它们所依赖或对对象的属性产生影响的行动，以及在何种程度的理解。这方面的知识被称为对象“启示”。我们提出的方法在一个启示空间创建对象的嵌入，其中每个维度对应的许多行动意思共享，利用语料库的一个方面。该嵌入使得能够预测哪些动词将适用于给定的对象，如在启示的人工判断捕获。我们发现，学到的尺寸是可解释的，并且它们对应于与对象的交互模式。最后，我们表明，它们可以被用于预测已显示对象相似的托换人为判断对象表示的其它尺寸。

38. Predicting the Accuracy of a Few-Shot Classifier [PDF] 返回目录
Myriam Bontonou, Louis Béthune, Vincent Gripon
Abstract: In the context of few-shot learning, one cannot measure the generalization ability of a trained classifier using validation sets, due to the small number of labeled samples. In this paper, we are interested in finding alternatives to answer the question: is my classifier generalizing well to previously unseen data? We first analyze the reasons for the variability of generalization performances. We then investigate the case of using transfer-based solutions, and consider three settings: i) supervised where we only have access to a few labeled samples, ii) semi-supervised where we have access to both a few labeled samples and a set of unlabeled samples and iii) unsupervised where we only have access to unlabeled samples. For each setting, we propose reasonable measures that we empirically demonstrate to be correlated with the generalization ability of considered classifiers. We also show that these simple measures can be used to predict generalization up to a certain confidence. We conduct our experiments on standard few-shot vision datasets.
摘要：在几拍学习的背景下，一个无法衡量使用验证组，训练的分类的推广能力，由于少数标记的样本。在本文中，我们感兴趣的是寻找替代品来回答这个问题：我的分类推广以及以前看不见的数据？我们首先分析的泛化性能可变性的原因。然后，我们调查使用基于转换的解决方案的情况下，并考虑三个设置：1）的监督，我们只能访问一些标记的样品，II）半监督，我们都可以访问两个几个标记的样品和一组未标记的样品和iii）无人监管，我们只能访问未标记样本。对于每种设置，我们提出了合理的措施，我们经验证明与考虑分类的推广能力相关。我们还表明，这些简单的措施可以用于预测概括起来有一定的信心。我们进行我们的标准几拍远景的数据集实验。

39. Labelling imaging datasets on the basis of neuroradiology reports: a validation study [PDF] 返回目录
David A. Wood, Sina Kafiabadi, Aisha Al Busaidi, Emily Guilhem, Jeremy Lynch, Matthew Townend, Antanas Montvila, Juveria Siddiqui, Naveen Gadapa, Matthew Benger, Gareth Barker, Sebastian Ourselin, James H. Cole, Thomas C. Booth
Abstract: Natural language processing (NLP) shows promise as a means to automate the labelling of hospital-scale neuroradiology magnetic resonance imaging (MRI) datasets for computer vision applications. To date, however, there has been no thorough investigation into the validity of this approach, including determining the accuracy of report labels compared to image labels as well as examining the performance of non-specialist labellers. In this work, we draw on the experience of a team of neuroradiologists who labelled over 5000 MRI neuroradiology reports as part of a project to build a dedicated deep learning-based neuroradiology report classifier. We show that, in our experience, assigning binary labels (i.e. normal vs abnormal) to images from reports alone is highly accurate. In contrast to the binary labels, however, the accuracy of more granular labelling is dependent on the category, and we highlight reasons for this discrepancy. We also show that downstream model performance is reduced when labelling of training reports is performed by a non-specialist. To allow other researchers to accelerate their research, we make our refined abnormality definitions and labelling rules available, as well as our easy-to-use radiology report labelling app which helps streamline this process.
摘要：自然语言处理（NLP）节目承诺作为一种手段来自动医院规模神经放射学磁共振成像（MRI）的数据集用于计算机视觉应用的标签。迄今为止，但是，一直没有彻底调查这一方法的有效性，包括确定报告标签的准确性，相比于图像标签和审查非专业的贴标机的性能。在这项工作中，我们得出一个团队的神经放射学家的标谁超过5000 MRI神经放射学报告，作为一个项目，建立一个专门的深厚基础的学习，神经放射学报告分类的部分经验。我们发现，在我们的经验，从报告中的图像赋予二进制标签（即正常VS异常）单独是非常准确的。与此相反的二进制标签，然而，更精细的标签的准确性取决于类别，我们强调这种差异的原因。我们还表明，当由非专业人士进行了的训练报告标签下游模型性能的降低。要允许其他研究者加速他们的研究中，我们使我们的精致异常的定义和标签规则可用，以及我们的易于使用的放射学报告标签的应用程序，这有助于简化这一过程。

40. Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision [PDF] 返回目录
Abhinav Shukla, Stavros Petridis, Maja Pantic
Abstract: The intuitive interaction between the audio and visual modalities is valuable for cross-modal self-supervised learning. This concept has been demonstrated for generic audiovisual tasks like video action recognition and acoustic scene classification. However, self-supervision remains under-explored for audiovisual speech. We propose a method to learn self-supervised speech representations from the raw audio waveform. We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio). The visual pretext task drives the audio representations to capture information related to lip movements. This enriches the audio encoder with visual information and the encoder can be used for evaluation without the visual modality. Our method attains competitive performance with respect to existing self-supervised audio features on established isolated word classification benchmarks, and significantly outperforms other methods at learning from fewer labels. Notably, our method also outperforms fully supervised training, thus providing a strong initialization for speech related tasks. Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
摘要：在音频和视频模式之间的直观的人机交互是有价值的跨模态自我监督学习。这个概念已经被证明对于像视频行为识别和听觉场景分类通用视听任务。然而，自我监督遗体充分开发的视听讲话。我们建议学习从原始音频波形自我监督的讲话表示的方法。我们培养的结合仅音频自检（通过预测信息的音频属性）与视觉自检（由音频生成交谈面）原始音频编码器。视觉借口任务驱动的音频交涉与嘴唇动作捕捉信息。这种丰富视觉信息的音频编码器和编码器可用于评估，而不视觉模态。我们对于现有的既定孤立词的分类基准，自我监督的音频功能，并显著方法有竞争力的无所获表现在从较少的标签学习优于其他方法。值得注意的是，我们的方法也优于完全监督下的训练，从而为言语相关的任务，强大的初始化。我们的研究结果表明多式联运自我监督的视听讲话学习好音频表示的潜力。

41. BS4NN: Binarized Spiking Neural Networks with Temporal Coding and Learning [PDF] 返回目录
Saeed Reza Kheradpisheh, Maryam Mirsadeghi, Timothée Masquelier
Abstract: We recently proposed the S4NN algorithm, essentially an adaptation of backpropagation to multilayer spiking neural networks that use simple non-leaky integrate-and-fire neurons and a form of temporal coding known as time-to-first-spike coding. With this coding scheme, neurons fire at most once per stimulus, but the firing order carries information. Here, we introduce BS4NN, a modification of S4NN in which the synaptic weights are constrained to be binary (+1 or -1), in order to decrease memory and computation footprints. This was done using two sets of weights: firstly, real-valued weights, updated by gradient descent, and used in the backward pass of backpropagation, and secondly, their signs, used in the forward pass. Similar strategies have been used to train (non-spiking) binarized neural networks. The main difference is that BS4NN operates in the time domain: spikes are propagated sequentially, and different neurons may reach their threshold at different times, which increases computational power. We validated BS4NN on two popular benchmarks, MNIST and Fashion MNIST, and obtained state-of-the-art accuracies for this sort of networks (97.0% and 87.3% respectively) with a negligible accuracy drop with respect to real-valued weights (0.4% and 0.7%, respectively). We also demonstrated that BS4NN outperforms a simple BNN with the same architectures on those two datasets (by 0.2% and 0.9% respectively), presumably because it leverages the temporal dimension.
摘要：我们最近提出的S4NN算法，本质上反向传播至多层扣球使用简单的非泄漏积分和火神经元的神经网络的适应和随着时间至第一穗编码时间编码已知的形式。有了这个编码方案，神经元开火最多每一次的刺激，但点火顺序进行的信息。这里，我们介绍BS4NN，S4NN的其中突触权重被约束为二进制（+1或-1）的变形例中，为了减少存储量和计算的脚印。这是使用两组权重进行：首先，实值的权重，通过梯度下降更新，并在反向传播的后向通行使用，其次，他们的招牌，在直传使用。类似的策略已经被用于训练（非尖峰）二值化神经网络。的主要区别在于，BS4NN操作在时域中：尖峰被依次传播，并且不同的神经元可能达到在不同的时间它们的阈值，这增加了计算能力。我们验证两个流行的基准，MNIST和时尚MNIST BS4NN，并获得了国家的最先进的精度为这种网络（97.0％和87.3％）与相对于实值权重可以忽略不计的精度下降（0.4 ％和0.7％）。我们还表明，BS4NN优于简单的BNN与这些两个数据集相同的体系结构（由0.2％和0.9％），这大概是因为它利用了时间维度。

42. MCU-Net: A framework towards uncertainty representations for decision support system patient referrals in healthcare contexts [PDF] 返回目录
Nabeel Seedat
Abstract: Incorporating a human-in-the-loop system when deploying automated decision support is critical in healthcare contexts to create trust, as well as provide reliable performance on a patient-to-patient basis. Deep learning methods while having high performance, do not allow for this patient-centered approach due to the lack of uncertainty representation. Thus, we present a framework of uncertainty representation evaluated for medical image segmentation, using MCU-Net which combines a U-Net with Monte Carlo Dropout, evaluated with four different uncertainty metrics. The framework augments this by adding a human-in-the-loop aspect based on an uncertainty threshold for automated referral of uncertain cases to a medical professional. We demonstrate that MCU-Net combined with epistemic uncertainty and an uncertainty threshold tuned for this application maximizes automated performance on an individual patient level, yet refers truly uncertain cases. This is a step towards uncertainty representations when deploying machine learning based decision support in healthcare settings.
摘要：结合一个人在最回路系统部署时自动决策支持是医疗环境至关重要创造信任，并提供对病人和病人间的基础上可靠的性能。深学习方法，同时具有高性能，不要让这个病人为中心的方法由于缺乏不确定性表示的。因此，我们提出了医学图像分割评价不确定性的表示的框架，用MCU的网相结合的U净蒙特卡罗差，用四种不同的不确定性的度量评估。该框架通过将基于用于的不确定的情况下自动转诊到医疗专业人员的不确定性阈值的人在半实物方面增加了此。我们证明MCU-网与认知的不确定性和调整该应用程序的不确定性门槛相结合最大限度地提高患者个体水平的自动化性能，又指真正不确定的情况下。这是朝着部署在医疗机构中基于机器学习的决策支持，当不确定性表示了一步。

43. Guidestar-free image-guided wavefront-shaping [PDF] 返回目录
Tomer Yeminy, Ori Katz
Abstract: Optical imaging through scattering media is a fundamental challenge in many applications. Recently, substantial breakthroughs such as imaging through biological tissues and looking around corners have been obtained by the use of wavefront-shaping approaches. However, these require an implanted guide-star for determining the wavefront correction, controlled coherent illumination, and most often raster scanning of the shaped focus. Alternative novel computational approaches that exploit speckle correlations, avoid guide-stars and wavefront control but are limited to small two-dimensional objects contained within the memory-effect correlations range. Here, we present a new concept, image-guided wavefront-shaping, allowing non-invasive, guidestar-free, widefield, incoherent imaging through highly scattering layers, without illumination control. Most importantly, the wavefront-correction is found even for objects that are larger than the memory-effect range, by blindly optimizing image-quality metrics. We demonstrate imaging of extended objects through highly-scattering layers and multi-core fibers, paving the way for non-invasive imaging in various applications, from microscopy to endoscopy.
摘要：通过散射介质光学成像是在许多应用中的一个基本挑战。最近，实质性突破如通过生物组织成像和环视拐角已通过使用波前整形而获得接近。然而，这些需要植入的引导星级用于确定波前校正，控制相干照明，和最经常光栅扫描成形焦。即利用斑点的相关性，避免导向分和波前控制，但被限制为小二维对象替代新颖计算方法包含的记忆效应的相关性的范围内。这里，我们提出一个新的概念，图像引导的波阵面整形，从而允许非侵入性，无导星-，宽视场，不连贯通过高散射层成像，没有照明控制。最重要的是，波前校正是通过盲目地优化图像质量度量即使对于比记忆效应范围较大物体找到。我们证明通过高度散射层和多芯光纤成像扩展对象的，在各种应用非侵入性成像铺平了道路，从显微镜内镜。

44. Designing and Training of A Dual CNN for Image Denoising [PDF] 返回目录
Chunwei Tian, Yong Xu, Wangmeng Zuo, Bo Du, Chia-Wen Lin, David Zhang
Abstract: Deep convolutional neural networks (CNNs) for image denoising have recently attracted increasing research interest. However, plain networks cannot recover fine details for a complex task, such as real noisy images. In this paper, we propsoed a Dual denoising Network (DudeNet) to recover a clean image. Specifically, DudeNet consists of four modules: a feature extraction block, an enhancement block, a compression block, and a reconstruction block. The feature extraction block with a sparse machanism extracts global and local features via two sub-networks. The enhancement block gathers and fuses the global and local features to provide complementary information for the latter network. The compression block refines the extracted information and compresses the network. Finally, the reconstruction block is utilized to reconstruct a denoised image. The DudeNet has the following advantages: (1) The dual networks with a parse mechanism can extract complementary features to enhance the generalized ability of denoiser. (2) Fusing global and local features can extract salient features to recover fine details for complex noisy images. (3) A Small-size filter is used to reduce the complexity of denoiser. Extensive experiments demonstrate the superiority of DudeNet over existing current state-of-the-art denoising methods.
摘要：深卷积神经网络（细胞神经网络）对图像进行去噪最近已经吸引了越来越多的研究兴趣。然而，普通的网络不能收回一项复杂的任务细节，如真正的噪声图像。在本文中，我们propsoed一个双重降噪网络（DudeNet）来恢复一个干净的形象。具体而言，DudeNet由四个模块组成：特征提取块，增强块，压缩块和重构块。有稀疏机理特征提取块通过两个子网中提取全局和局部特征。增强块收集和保险丝的全球和本地特点，为后者的网络的补充信息。压缩块提炼所提取的信息，并压缩网络。最后，重建块被利用来重建图像去噪。该DudeNet具有以下优点：（1）所述的双网络与一个解析机制可以提取互补的特征，以提高降噪的广义能力。（2）定影全局和局部特征可以提取显着特征，以恢复复杂的噪声图像细节。（3）小尺寸的过滤器被用来减少降噪的复杂性。大量的实验证明DudeNet超过现有的当前状态的最先进的降噪方法的优越性。

45. Operation-Aware Soft Channel Pruning using Differentiable Masks [PDF] 返回目录
Minsoo Kang, Bohyung Han
Abstract: We propose a simple but effective data-driven channel pruning algorithm, which compresses deep neural networks in a differentiable way by exploiting the characteristics of operations. The proposed approach makes a joint consideration of batch normalization (BN) and rectified linear unit (ReLU) for channel pruning; it estimates how likely the two successive operations deactivate each feature map and prunes the channels with high probabilities. To this end, we learn differentiable masks for individual channels and make soft decisions throughout the optimization procedure, which facilitates to explore larger search space and train more stable networks. The proposed framework enables us to identify compressed models via a joint learning of model parameters and channel pruning without an extra procedure of fine-tuning. We perform extensive experiments and achieve outstanding performance in terms of the accuracy of output networks given the same amount of resources when compared with the state-of-the-art methods.
摘要：本文提出了一种简单而有效的数据驱动通道修剪算法，它通过利用操作的特性压缩深层神经网络在微方式。所提出的方法使一个联合考虑批标准化（BN）和整流线性单元（RELU），用于信道修剪;它估计两个连续操作的可能性有多大禁用各个特征映射和修剪高概率的通道。为此，我们学习了各个通道微口罩，整个优化过程，这有利于开拓更大的搜索空间，培养更稳定的网络，使软判决。拟议的框架，使我们通过模型参数和渠道修剪一个共同学习，以确定压缩模式而无需微调的额外步骤。我们进行了广泛的实验，并在与国家的最先进的方法相比实现给定资源的相同数量的输出网络的精确度方面表现出色。

46. AUSN: Approximately Uniform Quantization by Adaptively Superimposing Non-uniform Distribution for Deep Neural Networks [PDF] 返回目录
Liu Fangxin, Zhao Wenbo, Wang Yanzhi, Dai Changzhi, Jiang Li
Abstract: Quantization is essential to simplify DNN inference in edge applications. Existing uniform and non-uniform quantization methods, however, exhibit an inherent conflict between the representing range and representing resolution, and thereby result in either underutilized bit-width or significant accuracy drop. Moreover, these methods encounter three drawbacks: i) the absence of a quantitative metric for in-depth analysis of the source of the quantization errors; ii) the limited focus on the image classification tasks based on CNNs; iii) the unawareness of the real hardware and energy consumption reduced by lowering the bit-width. In this paper, we first define two quantitative metrics, i.e., the Clipping Error and rounding error, to analyze the quantization error distribution. We observe that the boundary- and rounding- errors vary significantly across layers, models and tasks. Consequently, we propose a novel quantization method to quantize the weight and activation. The key idea is to Approximate the Uniform quantization by Adaptively Superposing multiple Non-uniform quantized values, namely AUSN. AUSN is consist of a decoder-free coding scheme that efficiently exploits the bit-width to its extreme, a superposition quantization algorithm that can adapt the coding scheme to different DNN layers, models and tasks without extra hardware design effort, and a rounding scheme that can eliminate the well-known bit-width overflow and re-quantization issues. Theoretical analysis~(see Appendix A) and accuracy evaluation on various DNN models of different tasks show the effectiveness and generalization of AUSN. The synthesis~(see Appendix B) results on FPGA show $2\times$ reduction of the energy consumption, and $2\times$ to $4\times$ reduction of the hardware resource.
摘要：量化是必不可少的，以简化边缘应用DNN推断。现有均匀和非均匀量化方法，但是，显示出表示范围和表示分辨率之间的固有冲突，从而导致未充分利用的任一比特宽度或显著精度下降。此外，这些方法会遇到三个缺点：1）不存在的定量度量的深入的量化误差的源的数据; ii）所述有限焦点基于细胞神经网络的图像分类任务; iii）所述真实的硬件和能量消耗的减少不了解通过降低比特宽度。在本文中，我们首先定义两个定量的度量，即，裁剪错误和舍入误差，来分析量化误差分布。我们注意到，boundary-和rounding-错误跨层，模型和任务显著变化。因此，我们提出了一种新颖的量化方法来量化的重量和活化。的关键思想是在接近由自适应叠加多个非均匀量化的值，即AUSN均匀量化。 AUSN是由一个免费的解码器，编码方案能够有效地利用了比特宽度极端情况下，叠加量化算法，可以适应编码方案不同的DNN层，模型和任务，而无需额外的硬件设计工作，并舍入方案，它的可以消除众所周知的位宽溢出和重新量化的问题。对不同任务的各种型号DNN理论分析〜（见附录A）与精度分析表明AUSN的有效性和推广。合成〜（见附录B）上FPGA结果表明$ 2 \倍$减少能量消耗，和$ 2 \倍$到$ 4 \倍$减少硬件资源的。

47. NVAE: A Deep Hierarchical Variational Autoencoder [PDF] 返回目录
Arash Vahdat, Jan Kautz
Abstract: Normalizing flows, autoregressive models, variational autoencoders (VAEs), and deep energy-based models are among competing likelihood-based frameworks for deep generative learning. Among them, VAEs have the advantage of fast and tractable sampling and easy-to-access encoding networks. However, they are currently outperformed by other models such as normalizing flows and autoregressive models. While the majority of the research in VAEs is focused on the statistical challenges, we explore the orthogonal direction of carefully designing neural architectures for hierarchical VAEs. We propose Nouveau VAE (NVAE), a deep hierarchical VAE built for image generation using depth-wise separable convolutions and batch normalization. NVAE is equipped with a residual parameterization of Normal distributions and its training is stabilized by spectral regularization. We show that NVAE achieves state-of-the-art results among non-autoregressive likelihood-based models on the MNIST, CIFAR-10, and CelebA HQ datasets and it provides a strong baseline on FFHQ. For example, on CIFAR-10, NVAE pushes the state-of-the-art from 2.98 to 2.91 bits per dimension, and it produces high-quality images on CelebA HQ as shown in Fig. 1. To the best of our knowledge, NVAE is the first successful VAE applied to natural images as large as 256$\times$256 pixels.
摘要：正火流动，自回归模型，变自动编码（VAES）和深基于能量的模型是相互竞争的可能基于的框架深生成学习。其中，VAES具有快速和易于处理采样和易于访问的编码网络的优势。然而，他们目前被其他车型的表现优于如正火流和自回归模型。虽然大多数VAES的研究主要集中在统计的挑战，我们探索的精心设计的神经结构分层VAES正交方向。我们建议风格VAE（NVAE），深层次VAE建成使用纵深可分离卷积和批标准化图像生成。 NVAE配备有正态分布的剩余参数化和其培训由光谱正规化稳定。我们发现，NVAE实现对MNIST非自回归的可能基于的模型中国家的先进成果，CIFAR-10和HQ CelebA数据集，它提供了FFHQ强大的基线。例如，在CIFAR-10，NVAE推动从每个维度2.98 2.91比特的状态的最先进的，并且它在CelebA HQ产生高质量的图像，如图1据我们所知， NVAE是第一个成功的VAE应用于自然图像一样大256 $ \ $倍256个像素。

48. Low-dimensional Manifold Constrained Disentanglement Network for Metal Artifact Reduction [PDF] 返回目录
Chuang Niu, Wenxiang Cong, Fenglei Fan, Hongming Shan, Mengzhou Li, Jimin Liang, Ge Wang
Abstract: Deep neural network based methods have achieved promising results for CT metal artifact reduction (MAR), most of which use many synthesized paired images for training. As synthesized metal artifacts in CT images may not accurately reflect the clinical counterparts, an artifact disentanglement network (ADN) was proposed with unpaired clinical images directly, producing promising results on clinical datasets. However, without sufficient supervision, it is difficult for ADN to recover structural details of artifact-affected CT images based on adversarial losses only. To overcome these problems, here we propose a low-dimensional manifold (LDM) constrained disentanglement network (DN), leveraging the image characteristics that the patch manifold is generally low-dimensional. Specifically, we design an LDM-DN learning algorithm to empower the disentanglement network through optimizing the synergistic network loss functions while constraining the recovered images to be on a low-dimensional patch manifold. Moreover, learning from both paired and unpaired data, an efficient hybrid optimization scheme is proposed to further improve the MAR performance on clinical datasets. Extensive experiments demonstrate that the proposed LDM-DN approach can consistently improve the MAR performance in paired and/or unpaired learning settings, outperforming competing methods on synthesized and clinical datasets.
摘要：深基于神经网络的方法已经实现了CT金属伪影减少（MAR），其中大部分使用许多合成图像配对训练可喜的成果。如在CT图像合成的金属伪影可能不能准确地反映临床同行，伪影解缠结网络（ADN），提出了用直接不成对临床图像，产生临床数据集有前途的结果。然而，如果没有足够的监管，难以对ADN恢复仅基于敌对损失神器影响CT图像的结构细节。为了克服这些问题，我们在这里提出了一个低维流形（LDM）约束的解开网络（DN），利用图像特征补丁歧管一般是低维。具体来说，我们设计了一个LDM-DN学习算法通过同时限制恢复图像上一些低维流形补丁优化的协同网络损耗函数，授权解缠结网络。此外，从两个成对和非成对数据中学习，高效的混合优化方案提出了进一步完善的临床数据集的MAR性能。大量的实验表明，该LDM-DN方法可以持续改善配对和/或不成学习环境的MAR表现，跑赢上合成和临床数据集的竞争方法。

49. Fine-grained Vibration Based Sensing Using a Smartphone [PDF] 返回目录
Kamran Ali, Alex X. Liu
Abstract: Recognizing surfaces based on their vibration signatures is useful as it can enable tagging of different locations without requiring any additional hardware such as Near Field Communication (NFC) tags. However, previous vibration based surface recognition schemes either use custom hardware for creating and sensing vibration, which makes them difficult to adopt, or use inertial (IMU) sensors in commercial off-the-shelf (COTS) smartphones to sense movements produced due to vibrations, which makes them coarse-grained because of the low sampling rates of IMU sensors. The mainstream COTS smartphones based schemes are also susceptible to inherent hardware based irregularities in vibration mechanism of the smartphones. Moreover, the existing schemes that use microphones to sense vibration are prone to short-term and constant background noises (e.g. intermittent talking, exhaust fan, etc.) because microphones not only capture the sounds created by vibration but also other interfering sounds present in the environment. In this paper, we propose VibroTag, a robust and practical vibration based sensing scheme that works with smartphones with different hardware, can extract fine-grained vibration signatures of different surfaces, and is robust to environmental noise and hardware based irregularities. We implemented VibroTag on two different Android phones and evaluated in multiple different environments where we collected data from 4 individuals for 5 to 20 consecutive days. Our results show that VibroTag achieves an average accuracy of 86.55% while recognizing 24 different locations/surfaces, even when some of those surfaces were made of similar material. VibroTag's accuracy is 37% higher than the average accuracy of 49.25% achieved by one of the state-of-the-art IMUs based schemes, which we implemented for comparison with VibroTag.
摘要：认识到基于它们的振动特征的表面是有用，因为它可以使不同位置的标记，而不需要任何额外的硬件，例如近场通信（NFC）的标签。然而，以往的基于振动表面识别方案或者用于创建和感测振动，这使得它们难以采用，或者使用在商用现货供应的惯性（IMU）传感器（COTS）使用定制的硬件智能手机到由于振动产生感运动，这使得它们的粗粒度，因为IMU传感器的低采样速率。主流COTS智能手机基于方案也容易受到在智能电话的振动机构固有的基于硬件的不规则性。此外，现有方案在使用麦克风来感测振动易于短期和恒定的背景噪声（例如间歇聊天，排气风扇等），因为麦克风不仅捕获由振动产生的声音，但也其他干扰的声音呈现在环境。在本文中，我们提出VibroTag，基于强大和实用的振动传感方案，该方案具有不同硬件的智能手机作品，可以提取不同的表面细粒度振动信号，并具有较强的抗环境噪音和基于硬件的违规行为。我们实施VibroTag在两个不同的Android手机，并在我们收集到的数据，从4名个人为5〜20天连续的多个不同的环境进行评估。我们的研究结果表明，VibroTag实现了86.55％的平均精确度，同时识别24个不同的位置/表面，即使当一些那些表面作了类似材料制成。 VibroTag的精度比由国家的最先进的基于IMU的方案，我们用VibroTag比较实施中的一个实现的49.25％的平均精度更高37％。

50. Consistency Regularization with Generative Adversarial Networks for Semi-Supervised Image Classification [PDF] 返回目录
Zexi Chen, Bharathkumar Ramachandra, Ranga Raju Vatsavai
Abstract: Generative Adversarial Networks (GANs) based semi-supervised learning (SSL) approaches are shown to improve classification performance by utilizing a large number of unlabeled samples in conjunction with limited labeled samples. However, their performance still lags behind the state-of-the-art non-GAN based SSL approaches. One main reason we identify is the lack of consistency in class probability predictions on the same image under local perturbations. This problem was addressed in the past in a generic setting using the label consistency regularization, which enforces the class probability predictions for an input image to be unchanged under various semantic-preserving perturbations. In this work, we incorporate the consistency regularization in the vanilla semi-GAN to address this critical limitation. In particular, we present a new composite consistency regularization method which, in spirit, combines two well-known consistency-based techniques -- Mean Teacher and Interpolation Consistency Training. We demonstrate the efficacy of our approach on two SSL image classification benchmark datasets, SVHN and CIFAR-10. Our experiments show that this new composite consistency regularization based semi-GAN significantly improves its performance and achieves new state-of-the-art performance among GAN-based SSL approaches.
摘要：剖成对抗性网络（甘斯）基于半监督学习（SSL）的方法被示出通过利用在具有有限的标记的样品相结合的大量未标记的样品，以提高分类性能。然而，他们的表现仍然落后于国家的最先进的非GAN基于SSL的方法背后。我们确定的一个主要原因是缺乏在局部扰动在同一图像上一流的概率预测的一致性。此问题已在过去在一般环境中使用的标签一致性正则化，其强制用于将输入图像的类概率的预测要承受种种语义保留扰动不变寻址。在这项工作中，我们引入了一致性正规化的香草半甘来解决这一关键限制。特别是，我们提出了一种新的复合一致性正则化方法，其在精神，结合了两种公知的基于一致性的技术 - 平均数教师和插值一致性培养。我们证明了我们两个SSL图像分类基准数据集，SVHN和CIFAR-10方法的有效性。我们的实验表明，这种新型复合材料的一致性正规化基于半甘显著提高其性能，并实现国家的最先进的新的基于GAN-SSL之间的性能接近。

51. Self-supervised Skull Reconstruction in Brain CT Images with Decompressive Craniectomy [PDF] 返回目录
Franco Matzkin, Virginia Newcombe, Susan Stevenson, Aneesh Khetani, Tom Newman, Richard Digby, Andrew Stevens, Ben Glocker, Enzo Ferrante
Abstract: Decompressive craniectomy (DC) is a common surgical procedure consisting of the removal of a portion of the skull that is performed after incidents such as stroke, traumatic brain injury (TBI) or other events that could result in acute subdural hemorrhage and/or increasing intracranial pressure. In these cases, CT scans are obtained to diagnose and assess injuries, or guide a certain therapy and intervention. We propose a deep learning based method to reconstruct the skull defect removed during DC performed after TBI from post-operative CT images. This reconstruction is useful in multiple scenarios, e.g. to support the creation of cranioplasty plates, accurate measurements of bone flap volume and total intracranial volume, important for studies that aim to relate later atrophy to patient outcome. We propose and compare alternative self-supervised methods where an encoder-decoder convolutional neural network (CNN) estimates the missing bone flap on post-operative CTs. The self-supervised learning strategy only requires images with complete skulls and avoids the need for annotated DC images. For evaluation, we employ real and simulated images with DC, comparing the results with other state-of-the-art approaches. The experiments show that the proposed model outperforms current manual methods, enabling reconstruction even in highly challenging cases where big skull defects have been removed during surgery.
摘要：去骨瓣减压（DC）是由去除被事件如中风，创伤性脑损伤（TBI）或其他事件之后执行的颅骨的一部分的，可能导致急性硬脑膜下出血的常见的外科手术和/或增加颅内压。在这些情况下，获得CT扫描来诊断和评估损伤，或引导一定治疗和干预。我们提出了一个深刻的学习为基础的方法来重建去除期间DC从手术后的CT图像TBI后进行颅骨缺损。这种重建是在多个方案中是有用的，例如以支持旨在后来有关萎缩患者的预后颅骨板，骨瓣容量和总颅内容积的准确测量，研究重要的创作。我们提出和比较另类的自我监督的方法，其中编码器，解码器卷积神经网络（CNN）估计在术后的CT失踪骨瓣。自监督学习策略，只需要具有完整的头骨图像，并避免了注释DC图像的需求。对于评估中，我们使用真实和模拟图像与DC，相比较与其他国家的先进方法的成果。实验结果表明，该模型优于目前的人工方法，即使在大颅骨缺损已在手术过程中被删除极具挑战性的情况下，使重建。

52. README: REpresentation learning by fairness-Aware Disentangling MEthod [PDF] 返回目录
Sungho Park, Dohyung Kim, Sunhee Hwang, Hyeran Byun
Abstract: Fair representation learning aims to encode invariant representation with respect to the protected attribute, such as gender or age. In this paper, we design Fairness-aware Disentangling Variational AutoEncoder (FD-VAE) for fair representation learning. This network disentangles latent space into three subspaces with a decorrelation loss that encourages each subspace to contain independent information: 1) target attribute information, 2) protected attribute information, 3) mutual attribute information. After the representation learning, this disentangled representation is leveraged for fairer downstream classification by excluding the subspace with the protected attribute information. We demonstrate the effectiveness of our model through extensive experiments on CelebA and UTK Face datasets. Our method outperforms the previous state-of-the-art method by large margins in terms of equal opportunity and equalized odds.
摘要：公正的代表学习目标，以编码不变表示相对于受保护的属性，如性别或年龄。在本文中，我们对公平的代表学习设计公平感知解开变自动编码器（FD-VAE）。这个网络理顺了那些纷繁潜在空间分成三个子空间与鼓励每个子空间以包含独立的信息的解相关损失：1）目标的属性信息，2）保护的属性信息，3）相互属性信息。表示学习后，该解缠结的表示是通过排除与受保护的属性信息的子空间利用更公平下游分类。我们通过对CelebA和UTK脸部数据集大量的实验证明我们的模型的有效性。我们的方法优于机会平等和均衡的赔率方面以前国家的最先进的方法，通过大量的利润。

53. On the Generalization Effects of Linear Transformations in Data Augmentation [PDF] 返回目录
Sen Wu, Hongyang R. Zhang, Gregory Valiant, Christopher Ré
Abstract: Data augmentation is a powerful technique to improve performance in applications such as image and text classification tasks. Yet, there is little rigorous understanding of why and how various augmentations work. In this work, we consider a family of linear transformations and study their effects on the ridge estimator in an over-parametrized linear regression setting. First, we show that transformations which preserve the labels of the data can improve estimation by enlarging the span of the training data. Second, we show that transformations which mix data can improve estimation by playing a regularization effect. Finally, we validate our theoretical insights on MNIST. Based on the insights, we propose an augmentation scheme that searches over the space of transformations by how uncertain the model is about the transformed data. We validate our proposed scheme on image and text datasets. For example, our method outperforms RandAugment by 1.24% on CIFAR-100 using Wide-ResNet-28-10. Furthermore, we achieve comparable accuracy to the SoTA Adversarial AutoAugment on CIFAR datasets.
摘要：数据隆胸是改善应用，如图像和文本分类任务性能的强大技术。然而，还有就是和扩充如何各项工作的严谨一点理解。在这项工作中，我们考虑了家庭线性变换和研究过参数化的线性回归设定他们对岭估计的影响。首先，我们表明，其保存数据的标签转换可以通过扩大训练数据的跨度提高估计。其次，我们表明，转变其混合数据可以通过播放效果正规化提高估计。最后，我们验证了我们对MNIST理论见解。基于该见解，提出了一种增强方案是在由模型是如何不确定是关于转换数据的转换的空间搜索。我们验证我们提出的图像和文本数据集的方案。例如，我们的方法在CIFAR-100使用宽RESNET-28-10优于RandAugment由1.24％。此外，我们获得相当准确的SOTA对抗性AutoAugment上CIFAR数据集。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-07-09

目录

摘要