摘要

1. Evading Deepfake-Image Detectors with White- and Black-Box Attacks [PDF] 返回目录
Nicholas Carlini, Hany Farid
Abstract: It is now possible to synthesize highly realistic images of people who don't exist. Such content has, for example, been implicated in the creation of fraudulent social-media profiles responsible for dis-information campaigns. Significant efforts are, therefore, being deployed to detect synthetically-generated content. One popular forensic approach trains a neural network to distinguish real from synthetic content. We show that such forensic classifiers are vulnerable to a range of attacks that reduce the classifier to near-0% accuracy. We develop five attack case studies on a state-of-the-art classifier that achieves an area under the ROC curve (AUC) of 0.95 on almost all existing image generators, when only trained on one generator. With full access to the classifier, we can flip the lowest bit of each pixel in an image to reduce the classifier's AUC to 0.0005; perturb 1% of the image area to reduce the classifier's AUC to 0.08; or add a single noise pattern in the synthesizer's latent space to reduce the classifier's AUC to 0.17. We also develop a black-box attack that, with no access to the target classifier, reduces the AUC to 0.22. These attacks reveal significant vulnerabilities of certain image-forensic classifiers.
摘要：现在可以合成的谁是不存在的人民高度逼真的图像。这样的内容，例如，被牵连在创造负责DIS-信息宣传欺诈社交媒体型材。显著的努力，因此，被部署到检测综合生成的内容。一种流行的法医方法训练神经网络由合成的内容真正区分开来。我们发现，这种取证的分类很容易受到一系列的降低分类到接近0％的准确攻击。我们开发的国家的最先进的分类上几乎所有现有的图像发生器0.95 ROC曲线（AUC）下达到的区域，当一个发电机只训练有素分为5个攻击的案例研究。随着全面进入分类，我们可以翻转每个像素的最低位的图像，以减少分类的AUC到0.0005;扰动的图像区域，以减少分级机的AUC至0.08的1％;或在合成器的潜在空间，以减少分级机的AUC 0.17添加单个噪声模式。我们还开发了一个黑箱攻击，无法获得目标分类，减少AUC 0.22。这些攻击显示某些图像取证分类的显著漏洞。

2. Articulation-aware Canonical Surface Mapping [PDF] 返回目录
Nilesh Kulkarni, Abhinav Gupta, David F. Fouhey, Shubham Tulsiani
Abstract: We tackle the tasks of: 1) predicting a Canonical Surface Mapping (CSM) that indicates the mapping from 2D pixels to corresponding points on a canonical template shape, and 2) inferring the articulation and pose of the template corresponding to the input image. While previous approaches rely on keypoint supervision for learning, we present an approach that can learn without such annotations. Our key insight is that these tasks are geometrically related, and we can obtain supervisory signal via enforcing consistency among the predictions. We present results across a diverse set of animal object categories, showing that our method can learn articulation and CSM prediction from image collections using only foreground mask labels for training. We empirically show that allowing articulation helps learn more accurate CSM prediction, and that enforcing the consistency with predicted CSM is similarly critical for learning meaningful articulation.
摘要：解决的任务：1预测，指示从2D像素上的规范模板形状对应点的映射的规范曲面映射（CSM）），和2）推断关节运动和对应于所述输入图像的模板的姿态。虽然以前的方法依赖于学习关键点的监督，我们提出可以学习没有这样的标注的方法。我们的主要观点是，这些任务在几何学上的关系，我们可以通过强制执行的预测之间的一致性得到监控信号。我们在一组不同的动物对象的类别，这表明我们的方法可以仅使用前景蒙标签培训学习衔接和CSM预测从图像集目前的结果。我们经验表明，让关节可以帮助了解更准确的预测CSM，并执行与CSM预测的一致性是学习有意义的衔接同样是至关重要的。

3. EPOS: Estimating 6D Pose of Objects with Symmetries [PDF] 返回目录
Tomas Hodan, Daniel Barath, Jiri Matas
Abstract: We present a new method for estimating the 6D pose of rigid objects with available 3D models from a single RGB input image. The method is applicable to a broad range of objects, including challenging ones with global or partial symmetries. An object is represented by compact surface fragments which allow handling symmetries in a systematic manner. Correspondences between densely sampled pixels and the fragments are predicted using an encoder-decoder network. At each pixel, the network predicts: (i) the probability of each object's presence, (ii) the probability of the fragments given the object's presence, and (iii) the precise 3D location on each fragment. A data-dependent number of corresponding 3D locations is selected per pixel, and poses of possibly multiple object instances are estimated using a robust and efficient variant of the PnP-RANSAC algorithm. In the BOP Challenge 2019, the method outperforms all RGB and most RGB-D and D methods on the T-LESS and LM-O datasets. On the YCB-V dataset, it is superior to all competitors, with a large margin over the second-best RGB method. Source code is at: this http URL.
摘要：本文提出了一种新的方法从单一的RGB输入图像估计与可用的3D模型刚性物体的6D姿态。该方法适用于宽范围的对象，包括具有全局或局部对称性有挑战性的。一个目的是通过紧凑表面片段，其允许以系统的方式处理对称性表示。密集采样像素和片段之间的对应关系使用的编码器 - 解码器网络预测。在每个像素处，网络预测：（i）所述每个对象的存在的概率，（ⅱ）给出的对象的存在的片段的每个片段的概率，和（iii）的精确的三维位置。对应的3D位置的数据依赖性数每像素被选择，以及可能多个对象实例的姿势使用的PnP-RANSAC算法的鲁棒和有效的变体中估计。在BOP挑战2019，该方法优于所有RGB和在T-LESS和LM-O数据集最RGB-d和d的方法。在YCB-V数据集，它优于所有的竞争者，超过第二好的RGB方法的大裕度。源代码是：这个HTTP URL。

4. Robust Image Reconstruction with Misaligned Structural Information [PDF] 返回目录
Leon Bungert, Matthias J. Ehrhardt
Abstract: Multi-modality (or multi-channel) imaging is becoming increasingly important and more widely available, e.g. hyperspectral imaging in remote sensing, spectral CT in material sciences as well as multi-contrast MRI and PET-MR in medicine. Research in the last decades resulted in a plethora of mathematical methods to combine data from several modalities. State-of-the-art methods, often formulated as variational regularization, have shown to significantly improve image reconstruction both quantitatively and qualitatively. Almost all of these models rely on the assumption that the modalities are perfectly registered, which is not the case in most real world applications. We propose a variational framework which jointly performs reconstruction and registration, thereby overcoming this hurdle. Numerical results show the potential of the proposed strategy for various applications for hyperspectral imaging, PET-MR and multi-contrast MRI: typical misalignments between modalities such as rotations, translations, zooms can be effectively corrected during the reconstruction process. Therefore the proposed framework allows the robust exploitation of shared information across multiple modalities under real conditions.
摘要：多模态（或多通道）成像变得越来越重要和更广泛的应用，例如高光谱成像遥感，在材料科学以及在药物多对比MRI和PET-MR光谱CT。研究在过去几十年造成的数学方法过多将数据从几个模式结合起来。状态的最先进的方法，通常配制成变正则化，已显示在数量和质量，以改善显著图像重建。几乎所有这些模型的依赖，该方式是完全注册，这是不是在大多数现实世界中的应用情况的假设。我们提出了一个变框架，共同执行重建和登记，从而克服这一障碍。数值结果表明所提出的策略用于高光谱成像的各种应用的潜力，PET-MR和多对比MRI：模态之间的未对准典型诸如旋转，平移，缩放可以有效地在重建过程校正。因此所提出的框架允许鲁棒开采真实条件下在多个模态共享的信息。

5. Symmetry and Group in Attribute-Object Compositions [PDF] 返回目录
Yong-Lu Li, Yue Xu, Xiaohan Mao, Cewu Lu
Abstract: Attributes and objects can compose diverse compositions. To model the compositional nature of these general concepts, it is a good choice to learn them through transformations, such as coupling and decoupling. However, complex transformations need to satisfy specific principles to guarantee the rationality. In this paper, we first propose a previously ignored principle of attribute-object transformation: Symmetry. For example, coupling peeled-apple with attribute peeled should result in peeled-apple, and decoupling peeled from apple should still output apple. Incorporating the symmetry principle, a transformation framework inspired by group theory is built, i.e. SymNet. SymNet consists of two modules, Coupling Network and Decoupling Network. With the group axioms and symmetry property as objectives, we adopt Deep Neural Networks to implement SymNet and train it in an end-to-end paradigm. Moreover, we propose a Relative Moving Distance (RMD) based recognition method to utilize the attribute change instead of the attribute pattern itself to classify attributes. Our symmetry learning can be utilized for the Compositional Zero-Shot Learning task and outperforms the state-of-the-art on widely-used benchmarks. Code is available at this https URL.
摘要：属性和对象可以撰写不同的组合物。为了模拟这些一般概念的成分性质，这是一个不错的选择，通过转换，如离合学习他们。然而，复杂的转换需要满足特定的原则，以保证合理性。在本文中，我们首先提出属性的对象转变先前忽略的原则：对称性。例如，耦合剥离的苹果与剥离应导致剥离的苹果属性，并从去耦苹果剥离仍应输出苹果。结合对称原理，通过群论的启发转换框架构建，即的SymNet。的SymNet由两个模块组成，耦合网络和去耦网络。同组公理和对称性作为目标，我们采用深层神经网络来实现的SymNet并在终端到终端的模式训练它。此外，我们提出了一个相对移动距离（RMD）利用属性的变化，而不是属性模式本身进行分类属性，基于识别方法。我们的对称性学习可以用于对组成零射门学习任务，优于上广泛使用的基准状态的最先进的。代码可在此HTTPS URL。

6. Boosting Deep Hyperspectral Image Classification with Spectral Unmixing [PDF] 返回目录
Alan J.X. Guo, Fei Zhu
Abstract: Recent advances in neural networks have made great progress in addressing the hyperspectral image (HSI) classification problem. However, the overfitting effect, which is mainly caused by complicated model structure and small training set, remains a major concern when applying neural networks to HSIs analysis. Reducing the complexity of the neural networks could prevent overfitting to some extent, but also decline the networks' ability to extract more abstract features. Enlarging the training set is also difficult. To tackle the overfitting issue, we propose an abundance-based multi-HSI classification method. By applying an autoencoder-based spectral unmixing technique, different HSIs are firstly converted from the spectral domain to the abundance domain. After that, the obtained abundance data from multi-HSI are collected to form an enlarged dataset. Lastly, a simple classifier is trained, which is capable to predict on all the involved datasets. Taking advantage of spectral unmixing, converting the data from the spectral domain to the abundance domain can significantly simplify the classification tasks. This enables the use of a simple network as the classifier, thus alleviating the overfitting effect. Moreover, as much dataset-specific information is eliminated after spectral unmixing, a compatible classifier suitable for different HSIs is trained. In view of this, a several times enlarged training set is constructed by bundling different HSIs' training data. The effectiveness of the proposed method is verified by ablation study and comparative experiments. On four public HSIs, the proposed method provides comparable classification results with two comparing methods, but with a far more simple model.
摘要：在神经网络的最新进展，在解决高光谱图像（HSI）的分类问题上取得了很大的进步。然而，过拟合效果，这主要是由于复杂的模型结构和小的训练集，运用神经网络HSIS分析时，仍然是一个大问题。降低了神经网络的复杂性可以防止过度拟合在一定程度上，也拒绝网络提取更多的抽象特征的能力。扩大训练集也很困难。为了解决过度拟合问题，我们提出了一个基于丰富的多恒指分类方法。通过施加基于自动编码-光谱分离技术中，不同HSIS首先从谱域变换到丰度域转换。在此之后，从多HSI所获得的丰度的数据被收集，以形成扩大的数据集。最后，一个简单的分类器进行训练，这是能够预测在所有涉及的数据集。以光谱分离的优点，从谱域转换的数据的丰度结构域可以显著简化分类任务。这使得能够使用简单的网络作为分类的，从而减轻了过拟合效果。另外，尽可能多的数据集特有的信息被光谱分离后除去，适合不同HSIS兼容的分类器进行训练。鉴于此，一个几次放大训练集由捆扎不同HSIS'训练数据构建。该方法的有效性被消融的研究和对比实验验证。四个公共HSIS，该方法提供了两种比较方法相同的分类结果，但有一个更简单的模型。

7. LGVTON: A Landmark Guided Approach to Virtual Try-On [PDF] 返回目录
Debapriya Roy, Sanchayan Santra, Bhabatosh Chanda
Abstract: We address the problem of image based virtual try-on (VTON), where the goal is to synthesize an image of a person wearing the cloth of a model. An essential requirement for generating a perceptually convincing VTON result is preserving the characteristics of the cloth and the person. Keeping this in mind we propose \textit{LGVTON}, a novel self-supervised landmark guided approach to image based virtual try-on. The incorporation of self-supervision tackles the problem of lack of paired training data in model to person VTON scenario. LGVTON uses two types of landmarks to warp the model cloth according to the shape and pose of the person, one, human landmarks, the locations of anatomical keypoints of human, two, fashion landmarks, the structural keypoints of cloth. We introduce an unique way of using landmarks for warping which is more efficient and effective compared to existing warping based methods in current problem scenario. In addition to that, to make the method robust in cases of noisy landmark estimates that causes inaccurate warping, we propose a mask generator module that attempts to predict the true segmentation mask of the model cloth on the person, which in turn guides our image synthesizer module in tackling warping issues. Experimental results show the effectiveness of our method in comparison to the state-of-the-art VTON methods.
摘要：针对图像的基于虚拟试穿（VTON）的问题，其目的是为了合成一个穿着模型的布人的图像。用于产生感知令人信服VTON结果的一项基本要求是保存布和该人的特性。牢记这一点，我们提出了\ {textit} LGVTON，一个新的自我监督的里程碑引导方法基于图像的虚拟试穿。自检铲球的结合的缺乏模型人VTON情景配对训练数据的问题。 LGVTON使用两种类型的地标，根据该人，一个人的地标，人，两个关键点解剖，时尚的地标，布的结构的关键点的位置的形状和姿势以翘曲模型布。我们介绍使用翘曲相比，目前的问题情景现有扭曲为基础的方法是更有效地标的一个独特的方式。除此之外，为了使在嘈杂的标志性估计的情况下，稳健的方法导致不准确的扭曲，我们提出了一个面具发生器模块，试图预测的人，这反过来又指导我们的图像合成模型布的真实分割掩码模块解决翘曲的问题。实验结果表明，相比于国家的最先进的方法VTON我们的方法的有效性。

8. Feature-Driven Super-Resolution for Object Detection [PDF] 返回目录
Bin Wang, Tao Lu, Yanduo Zhang
Abstract: Although some convolutional neural networks (CNNs) based super-resolution (SR) algorithms yield good visual performances on single images recently. Most of them focus on perfect perceptual quality but ignore specific needs of subsequent detection task. This paper proposes a simple but powerful feature-driven super-resolution (FDSR) to improve the detection performance of low-resolution (LR) images. First, the proposed method uses feature-domain prior which extracts from an existing detector backbone to guide the HR image reconstruction. Then, with the aligned features, FDSR update SR parameters for better detection performance. Comparing with some state-of-the-art SR algorithms with 4$\times$ scale factor, FDSR outperforms the detection performance mAP on MS COCO validation, VOC2007 databases with good generalization to other detection networks.
摘要：虽然有些卷积神经网络（细胞神经网络）的超分辨率（SR）算法对单张图片最近获得良好的视觉表现。他们大多专注于完美的感知质量而忽视后续的检测任务的特殊需求。本文提出了一种简单而强大的功能驱动的超分辨率（FDSR）改善低分辨率（LR）图像的检测性能。首先，所提出的方法使用了特征域之前从现有的检测器骨干提取引导HR图像重建其。然后，对齐功能，更好的探测性能FDSR更新SR参数。与4 $ \ $倍的比例因子一些国家的最先进的SR算法相比，FDSR优于上MS COCO验证，VOC2007数据库具有良好的推广到其他检测网络检测性能地图。

9. Physically Realizable Adversarial Examples for LiDAR Object Detection [PDF] 返回目录
James Tu, Mengye Ren, Siva Manivasagam, Ming Liang, Bin Yang, Richard Du, Frank Cheng, Raquel Urtasun
Abstract: Modern autonomous driving systems rely heavily on deep learning models to process point cloud sensory data; meanwhile, deep models have been shown to be susceptible to adversarial attacks with visually imperceptible perturbations. Despite the fact that this poses a security concern for the self-driving industry, there has been very little exploration in terms of 3D perception, as most adversarial attacks have only been applied to 2D flat images. In this paper, we address this issue and present a method to generate universal 3D adversarial objects to fool LiDAR detectors. In particular, we demonstrate that placing an adversarial object on the rooftop of any target vehicle to hide the vehicle entirely from LiDAR detectors with a success rate of 80%. We report attack results on a suite of detectors using various input representation of point clouds. We also conduct a pilot study on adversarial defense using data augmentation. This is one step closer towards safer self-driving under unseen conditions from limited training data.
摘要：现代自动驾驶系统在很大程度上依赖于深度学习模型来处理点云感知数据;同时，深模型已经被证明是容易与视觉上感觉不到的扰动对抗性攻击。尽管这给自驾车行业的安全问题，出现了3D感知方面非常小的探索，因为大多数敌对攻击仅被应用到2D平面图像。在本文中，我们要解决这个问题，并提出生成通用3D敌对对象愚弄激光雷达探测器的方法。特别是，我们表明，放置一个对抗对象上的任何目标车辆的屋顶从激光雷达检测器完全隐藏在车辆用的80％的成功率。我们报告了一套使用点云的各种输入表示探测器的攻击效果。我们还使用数据增强上对抗的防守进行了初步研究。这是更靠近一步朝着更安全驾驶自从有限的培训数据看不见的条件下进行。

10. Future Video Synthesis with Object Motion Prediction [PDF] 返回目录
Yue Wu, Rongrong Gao, Jaesik Park, Qifeng Chen
Abstract: We present an approach to predict future video frames given a sequence of continuous video frames in the past. Instead of synthesizing images directly, our approach is designed to understand the complex scene dynamics by decoupling the background scene and moving objects. The appearance of the scene components in the future is predicted by non-rigid deformation of the background and affine transformation of moving objects. The anticipated appearances are combined to create a reasonable video in the future. With this procedure, our method exhibits much less tearing or distortion artifact compared to other approaches. Experimental results on the Cityscapes and KITTI datasets show that our model outperforms the state-of-the-art in terms of visual quality and accuracy.
摘要：提出了一种方法来预测给出在过去连续的视频帧序列未来的视频帧。除了直接合成图像，我们的做法是旨在通过解耦背景场景和运动物体了解复杂场景的动态。场景部件的在未来的外观由背景的非刚性变形和移动对象的仿射变换预测。预期外观相结合，建立在未来一个合理的视频。与此过程，我们的方法表现出少得多的撕裂或扭曲伪像相比其他方法。对城市景观和KITTI数据集实验结果表明，我们的模型优于国家的最先进的视觉质量和精度方面。

11. Medical-based Deep Curriculum Learning for Improved Fracture Classification [PDF] 返回目录
Amelia Jiménez-Sánchez, Diana Mateus, Sonja Kirchhoff, Chlodwig Kirchhoff, Peter Biberthaler, Nassir Navab, Miguel A. González Ballester, Gemma Piella
Abstract: Current deep-learning based methods do not easily integrate to clinical protocols, neither take full advantage of medical knowledge. In this work, we propose and compare several strategies relying on curriculum learning, to support the classification of proximal femur fracture from X-ray images, a challenging problem as reflected by existing intra- and inter-expert disagreement. Our strategies are derived from knowledge such as medical decision trees and inconsistencies in the annotations of multiple experts, which allows us to assign a degree of difficulty to each training sample. We demonstrate that if we start learning "easy" examples and move towards "hard", the model can reach a better performance, even with fewer data. The evaluation is performed on the classification of a clinical dataset of about 1000 X-ray images. Our results show that, compared to class-uniform and random strategies, the proposed medical knowledge-based curriculum, performs up to 15% better in terms of accuracy, achieving the performance of experienced trauma surgeons.
摘要：当前深学习基础的方法不容易集成到临床方案，既没有采取医学知识的充分利用。在这项工作中，我们提出并比较了几种策略，依托课程的学习，以支持近端股骨骨折的X射线图像，通过现有的区域内和跨专家分歧反映了一个具有挑战性的问题进行分类。我们的策略是从知识衍生而来，例如医疗决策树和矛盾多专家的注解，这使得我们的难度分配给每个训练样本。我们证明，如果我们开始学习“易”的例子和对“硬”之举，该模型可以达到更好的性能，甚至用少的数据。评价是在约1000的X射线图像的临床数据集的分类进行。我们的研究结果表明，相类均匀和随机策略，所提出的医学知识为基础的课程，进行高达15％更好的准确性方面，取得了经验丰富的创伤外科医生的性能。

12. PIFuHD: Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization [PDF] 返回目录
Shunsuke Saito, Tomas Simon, Jason Saragih, Hanbyul Joo
Abstract: Recent advances in image-based 3D human shape estimation have been driven by the significant improvement in representation power afforded by deep neural networks. Although current approaches have demonstrated the potential in real world settings, they still fail to produce reconstructions with the level of detail often present in the input images. We argue that this limitation stems primarily form two conflicting requirements; accurate predictions require large context, but precise predictions require high resolution. Due to memory limitations in current hardware, previous approaches tend to take low resolution images as input to cover large spatial context, and produce less precise (or low resolution) 3D estimates as a result. We address this limitation by formulating a multi-level architecture that is end-to-end trainable. A coarse level observes the whole image at lower resolution and focuses on holistic reasoning. This provides context to an fine level which estimates highly detailed geometry by observing higher-resolution images. We demonstrate that our approach significantly outperforms existing state-of-the-art techniques on single image human shape reconstruction by fully leveraging 1k-resolution input images.
摘要：基于图像的三维人体形状估计的最新进展已在深神经网络提供表现力的显著改善驱动。尽管目前的做法已经证明，在现实世界中设置的潜力，他们仍然不能产生重建的细节往往存在于输入图像的水平。我们认为，这种限制主要是茎形成了两个相互矛盾的要求;准确的预测需要大背景下，但准确的预测需要高分辨率。由于在当前的硬件存储器的限制，以前的方法倾向于采取低分辨率图像作为输入，以覆盖大的空间范围内，并产生不那么精确的（或低的分辨率）的3D估计作为结果。我们通过制定多层次的架构，终端到终端的可训练解决此限制。粗略水平较低的分辨率观察的整体形象，注重整体的推理。这提供上下文，其估计通过观察更高分辨率的图像高度详细的几何结构的精细程度。我们证明我们的方法，通过充分利用1K分辨率的输入图像显著优于国家的最先进的现有单像人形重建技术。

13. Spatio-temporal Tubelet Feature Aggregation and Object Linking in Videos [PDF] 返回目录
Daniel Cores, Víctor M. Brea, Manuel Mucientes
Abstract: This paper addresses the problem of how to exploit spatio-temporal information available in videos to improve the object detection precision. We propose a two stage object detector called FANet based on short-term spatio-temporal feature aggregation to give a first detection set, and long-term object linking to refine these detections. Firstly, we generate a set of short tubelet proposals containing the object in $N$ consecutive frames. Then, we aggregate RoI pooled deep features through the tubelet using a temporal pooling operator that summarizes the information with a fixed size output independent of the number of input frames. On top of that, we define a double head implementation that we feed with spatio-temporal aggregated information for spatio-temporal object classification, and with spatial information extracted from the current frame for object localization and spatial classification. Furthermore, we also specialize each head branch architecture to better perform in each task taking into account the input data. Finally, a long-term linking method builds long tubes using the previously calculated short tubelets to overcome detection errors. We have evaluated our model in the widely used ImageNet VID dataset achieving a 80.9% mAP, which is the new state-of-the-art result for single models. Also, in the challenging small object detection dataset USC-GRAD-STDdb, our proposal outperforms the single frame baseline by 5.4% mAP.
摘要：本文地址如何利用时空信息提供的视频，以提高物体检测精度的问题。我们建议呼吁FANet短期时空特征聚集基于给第一组检测两个阶段目标物检测，以及长期的对象链接细化这些检测。首先，我们产生一组包含在$ N的对象$连续帧短tubelet提案。然后，我们聚集投资回报通过tubelet汇集深特征使用总结了固定输出独立的输入帧的数量的大小的信息的时间池操作。最重要的是，我们定义一个双头实现，我们饲料与时空聚集的信息用于时空对象分类，并与来自于对象定位和空间分类当前帧提取的空间信息。此外，我们还专门每个盖分支架构中的每个任务回吐更好地履行考虑输入数据。最后，一个长期联方法构建使用先前计算出的短小管，以克服检测误差的长管。我们已经在广泛使用的ImageNet VID数据集实现了80.9％的地图，这是国家的最先进的新的结果为单一模型中评估我们的模型。另外，在具有挑战性小物体检测数据集USC-GRAD-STDdb，我们的建议优于5.4％地图单帧基线。

14. Rethinking Data Augmentation for Image Super-resolution: A Comprehensive Analysis and a New Strategy [PDF] 返回目录
Jaejun Yoo, Namhyuk Ahn, Kyung-Ah Sohn
Abstract: Data augmentation is an effective way to improve the performance of deep networks. Unfortunately, current methods are mostly developed for high-level vision tasks (e.g., classification) and few are studied for low-level vision tasks (e.g., image restoration). In this paper, we provide a comprehensive analysis of the existing augmentation methods applied to the super-resolution task. We find that the methods discarding or manipulating the pixels or features too much hamper the image restoration, where the spatial relationship is very important. Based on our analyses, we propose CutBlur that cuts a low-resolution patch and pastes it to the corresponding high-resolution image region and vice versa. The key intuition of CutBlur is to enable a model to learn not only "how" but also "where" to super-resolve an image. By doing so, the model can understand "how much", instead of blindly learning to apply super-resolution to every given pixel. Our method consistently and significantly improves the performance across various scenarios, especially when the model size is big and the data is collected under real-world environments. We also show that our method improves other low-level vision tasks, such as denoising and compression artifact removal.
摘要：数据增强是提高深网络的性能的有效途径。不幸的是，目前的方法大多是高层次的视觉任务（例如，分类）开发，很少有人研究了低层次的视觉任务（例如，图像恢复）。在本文中，我们提供了适用于超分辨率任务已有增强方法的全面分析。我们发现，这些方法丢弃或操纵像素或功能太多阻碍图像恢复，那里的空间关系是非常重要的。根据我们的分析，我们提出CutBlur，削减低分辨率斑块并将其粘贴到相应的高分辨率图像区域，反之亦然。 CutBlur的关键直觉是使模型的学习不仅是“如何”，而且“在哪里”，以超解析图像。通过这样做，该模型可以理解“多少”，而不是盲目地学习超分辨率适用于每一个给定像素。我们的方法一致，并显著地改善了整个各种情况下的性能，尤其是当模型尺寸大，该数据是根据真实世界的环境中收集的。我们还表明，我们的方法提高了其他低级别的视觉任务，如降噪和压缩伪影消除。

15. Learning to Cluster Faces via Confidence and Connectivity Estimation [PDF] 返回目录
Lei Yang, Dapeng Chen, Xiaohang Zhan, Rui Zhao, Chen Change Loy, Dahua Lin
Abstract: Face clustering is an essential tool for exploiting the unlabeled face data, and has a wide range of applications including face annotation and retrieval. Recent works show that supervised clustering can result in noticeable performance gain. However, they usually involve heuristic steps and require numerous overlapped subgraphs, severely restricting their accuracy and efficiency. In this paper, we propose a fully learnable clustering framework without requiring a large number of overlapped subgraphs. Instead, we transform the clustering problem into two sub-problems. Specifically, two graph convolutional networks, named GCN-V and GCN-E, are designed to estimate the confidence of vertices and the connectivity of edges, respectively. With the vertex confidence and edge connectivity, we can naturally organize more relevant vertices on the affinity graph and group them into clusters. Experiments on two large-scale benchmarks show that our method significantly improves clustering accuracy and thus performance of the recognition models trained on top, yet it is an order of magnitude more efficient than existing supervised methods.
摘要：面部聚类是用于利用未标记的面部数据的必要工具，并且具有广泛的应用范围，包括面标注和检索。最近的作品表明，监督聚类可能会导致明显的性能增益。然而，他们通常涉及启发式的步骤和需要大量的重叠子图，严重限制了它们的精度和效率。在本文中，我们提出，而不需要大量重叠的子图完全可以学习的集群框架。相反，我们把聚类问题分为两个子问题。具体而言，图2个的卷积网络，命名GCN-V和GCN-E，分别设计来估计顶点的信心和边缘的连通。随着顶点的信心和边缘连接，就可以自然地举办亲和度图，并将它们组合成集群更相关的顶点。两个大型的基准实验表明，我们的方法显著提高训练的顶部的识别模型的聚类准确率，因此性能，但它是数量级比现有的监督方法更有效的顺序。

16. Semantic Drift Compensation for Class-Incremental Learning [PDF] 返回目录
Lu Yu, Bartłomiej Twardowski, Xialei Liu, Luis Herranz, Kai Wang, Yongmei Cheng, Shangling Jui, Joost van de Weijer
Abstract: Class-incremental learning of deep networks sequentially increases the number of classes to be classified. During training, the network has only access to data of one task at a time, where each task contains several classes. In this setting, networks suffer from catastrophic forgetting which refers to the drastic drop in performance on previous tasks. The vast majority of methods have studied this scenario for classification networks, where for each new task the classification layer of the network must be augmented with additional weights to make room for the newly added classes. Embedding networks have the advantage that new classes can be naturally included into the network without adding new weights. Therefore, we study incremental learning for embedding networks. In addition, we propose a new method to estimate the drift, called semantic drift, of features and compensate for it without the need of any exemplars. We approximate the drift of previous tasks based on the drift that is experienced by current task data. We perform experiments on fine-grained datasets, CIFAR100 and ImageNet-Subset. We demonstrate that embedding networks suffer significantly less from catastrophic forgetting. We outperform existing methods which do not require exemplars and obtain competitive results compared to methods which store exemplars. Furthermore, we show that our proposed SDC when combined with existing methods to prevent forgetting consistently improves results.
摘要：深网络的类增量学习顺序地增加要被分类的类的数量。在训练过程中，网络具有的时间，每个任务包含几个班，以一个任务的数据的访问。在这种背景下，网络灾难性的遗忘是指在以前的任务性能急剧下降受到影响。绝大多数的方法研究了这个场景进行分类网络中，每个新任务网络的分类层必须用额外的重量来增强以腾出空间给新加入的类。嵌入网络的优势在于新类可以自然地纳入网络，而不增加新的权重。因此，我们研究增量学习嵌入网络。此外，我们提出了一种新的方法来估算漂移，被称为语义漂移的特性和补偿它，而不需要任何典范的。我们近似的基础上由目前的任务数据所经历的漂移以前任务的漂移。我们细粒度数据集，CIFAR100和ImageNet子集进行实验。我们表明，嵌入网络免受灾难性遗忘显著少受苦。我们跑赢现有不需要典范，并获得竞争结果的方法相比，该卖场典范方法。此外，我们还表明，当与现有的方法相结合，以防止持续遗忘我们提出的SDC提高的结果。

17. Long-tail Visual Relationship Recognition with a Visiolinguistic Hubless Loss [PDF] 返回目录
Sherif Abdelkarim, Panos Achlioptas, Jiaji Huang, Boyang Li, Kenneth Church, Mohamed Elhoseiny
Abstract: Scaling up the vocabulary and complexity of current visual understanding systems is necessary in order to bridge the gap between human and machine visual intelligence. However, a crucial impediment to this end lies in the difficulty of generalizing to data distributions that come from real-world scenarios. Typically such distributions follow Zipf's law which states that only a small portion of the collected object classes will have abundant examples (head); while most classes will contain just a few (tail). In this paper, we propose to study a novel task concerning the generalization of visual relationships that are on the distribution's tail, i.e. we investigate how to help AI systems to better recognize rare relationships like , where the subject S, predicate P, and/or the object O come from the tail of the corresponding distributions. To achieve this goal, we first introduce two large-scale visual-relationship detection benchmarks built upon the widely used Visual Genome and GQA datasets. We also propose an intuitive evaluation protocol that gives credit to classifiers who prefer concepts that are semantically close to the ground truth class according to wordNet- or word2vec-induced metrics. Finally, we introduce a visiolinguistic version of a Hubless loss which we show experimentally that it consistently encourages classifiers to be more predictive of the tail classes while still being accurate on head classes. Our code and models are available on this http URL.
摘要：向上扩展当前直观的了解系统的词汇和复杂性是必要的，以弥补人类和机器视觉智能之间的差距。然而，一个关键的障碍，为此在于归纳的难度是来自真实世界的场景数据分布。典型地，这种分布遵循齐普夫定律其中指出只有所收集的对象类的一小部分将具有丰富的实例（头）;而大多数类将仅包含少数（尾）。在本文中，我们提出，研究有关的视觉关系，这是对分布尾部的推广一项新任务，即我们研究如何帮助AI系统，以更好地识别像，其中被检体S，谓词P，和/或所述对象O来自相应分布的尾部。为了实现这个目标，我们首先介绍了广泛使用的可视化基因组与GQA数据集建立了两个大型的视觉关系的检测标准。我们还提出了一种直观的评估协议，功劳归于谁喜欢在语义接近地面实况类根据wordNet-或word2vec诱导度量的概念分类。最后，我们引入了无轮毂的损失visiolinguistic版本，我们通过实验证明它一贯鼓励的分类更预测尾类的同时仍然准确的头班。我们的代码和模型都可以在这个HTTP URL。

18. M2m: Imbalanced Classification via Major-to-minor Translation [PDF] 返回目录
Jaehyung Kim, Jongheon Jeong, Jinwoo Shin
Abstract: In most real-world scenarios, labeled training datasets are highly class-imbalanced, where deep neural networks suffer from generalizing to a balanced testing criterion. In this paper, we explore a novel yet simple way to alleviate this issue by augmenting less-frequent classes via translating samples (e.g., images) from more-frequent classes. This simple approach enables a classifier to learn more generalizable features of minority classes, by transferring and leveraging the diversity of the majority information. Our experimental results on a variety of class-imbalanced datasets show that the proposed method improves the generalization on minority classes significantly compared to other existing re-sampling or re-weighting methods. The performance of our method even surpasses those of previous state-of-the-art methods for the imbalanced classification.
摘要：在大多数现实世界的情景，标记的训练数据集是高度类不平衡，其中深层神经网络从泛化遭受平衡测试标准。在本文中，我们探索一种新的而又简单的方式通过经由从更频繁类平移采样（例如，图像）增大频率较低的类来减轻这个问题。这个简单的方法使分类学民族班的更加普及的功能，通过传输和利用的大部分信息的多样性。我们对各种类不平衡数据集的实验结果表明，所提出的方法改进了少数类泛化显著相对于其他现有重新采样或重新加权的方法。我们的方法的性能甚至超越那些以前的国家的最先进的方法不平衡分类。

19. Image Demoireing with Learnable Bandpass Filters [PDF] 返回目录
Bolun Zheng, Shanxin Yuan, Gregory Slabaugh, Ales Leonardis
Abstract: Image demoireing is a multi-faceted image restoration task involving both texture and color restoration. In this paper, we propose a novel multiscale bandpass convolutional neural network (MBCNN) to address this problem. As an end-to-end solution, MBCNN respectively solves the two sub-problems. For texture restoration, we propose a learnable bandpass filter (LBF) to learn the frequency prior for moire texture removal. For color restoration, we propose a two-step tone mapping strategy, which first applies a global tone mapping to correct for a global color shift, and then performs local fine tuning of the color per pixel. Through an ablation study, we demonstrate the effectiveness of the different components of MBCNN. Experimental results on two public datasets show that our method outperforms state-of-the-art methods by a large margin (more than 2dB in terms of PSNR).
摘要：图片demoireing是同时涉及纹理和色彩还原一个多面的图像恢复任务。在本文中，我们提出了一种新的多尺度带通卷积神经网络（MBCNN）来解决这个问题。作为最终的端到端解决方案，分别MBCNN解决了两个子问题。纹理恢复，我们提出一种可学习带通滤波器（LBF）学习前莫阿纹理去除的频率。对于色彩还原，我们提出了一个两步色调映射策略，首先应用全局色调映射到正确的全球色偏，然后执行每个像素的颜色局部微调。通过消融的研究，我们证明MBCNN的不同组成部分的有效性。在两个公共数据集的实验结果表明，该方法优于国家的最先进的方法，大幅度（大于2dB在PSNR方面）。

20. Two-shot Spatially-varying BRDF and Shape Estimation [PDF] 返回目录
Mark Boss, Varun Jampani, Kihwan Kim, Hendrik P.A. Lensch, Jan Kautz
Abstract: Capturing the shape and spatially-varying appearance (SVBRDF) of an object from images is a challenging task that has applications in both computer vision and graphics. Traditional optimization-based approaches often need a large number of images taken from multiple views in a controlled environment. Newer deep learning-based approaches require only a few input images, but the reconstruction quality is not on par with optimization techniques. We propose a novel deep learning architecture with a stage-wise estimation of shape and SVBRDF. The previous predictions guide each estimation, and a joint refinement network later refines both SVBRDF and shape. We follow a practical mobile image capture setting and use unaligned two-shot flash and no-flash images as input. Both our two-shot image capture and network inference can run on mobile hardware. We also create a large-scale synthetic training dataset with domain-randomized geometry and realistic materials. Extensive experiments on both synthetic and real-world datasets show that our network trained on a synthetic dataset can generalize well to real-world images. Comparisons with recent approaches demonstrate the superior performance of the proposed approach.
摘要：捕获来自图像的对象的形状和空间变化的外观（SVBRDF）是具有在这两个计算机视觉和图形应用的具有挑战性的任务。传统的基于优化的方法往往需要大量从多个视图，在受控环境中拍摄的图像。较新的深基于学习的方法只需要少量的输入图像，但重建质量不看齐优化技术。我们提出了一个新颖的深度学习架构，形状和SVBRDF的分阶段估计。先前的预测引导每个估计，以及一个联合网络细化后提炼既SVBRDF和形状。我们遵循实用的移动图像采集设置，并使用未对齐双镜头闪光和无闪光图像作为输入。我们两个两杆的图像捕捉和网络推理可以在移动硬件上运行。我们还创建域随机几何形状和现实材料的大规模合成训练数据集。在人工和真实世界的数据集，大量实验表明，我们的网络训练了合成数据集可以概括以及真实世界的图像。最近方法比较表明，该方法的优越性能。

21. More Grounded Image Captioning by Distilling Image-Text Matching Model [PDF] 返回目录
Yuanen Zhou, Meng Wang, Daqing Liu, Zhenzhen Hu, Hanwang Zhang
Abstract: Visual attention not only improves the performance of image captioners, but also serves as a visual interpretation to qualitatively measure the caption rationality and model transparency. Specifically, we expect that a captioner can fix its attentive gaze on the correct objects while generating the corresponding words. This ability is also known as grounded image captioning. However, the grounding accuracy of existing captioners is far from satisfactory. To improve the grounding accuracy while retaining the captioning quality, it is expensive to collect the word-region alignment as strong supervision. To this end, we propose a Part-of-Speech (POS) enhanced image-text matching model (SCAN \cite{lee2018stacked}): POS-SCAN, as the effective knowledge distillation for more grounded image captioning. The benefits are two-fold: 1) given a sentence and an image, POS-SCAN can ground the objects more accurately than SCAN; 2) POS-SCAN serves as a word-region alignment regularization for the captioner's visual attention module. By showing benchmark experimental results, we demonstrate that conventional image captioners equipped with POS-SCAN can significantly improve the grounding accuracy without strong supervision. Last but not the least, we explore the indispensable Self-Critical Sequence Training (SCST) \cite{Rennie_2017_CVPR} in the context of grounded image captioning and show that the image-text matching score can serve as a reward for more grounded captioning \footnote{this https URL}.
摘要：视觉注意不仅提高了图像式字幕的性能，而且还作为一个目视解译来定性衡量标题合理性和模型透明。具体而言，我们预期一个字幕员可以固定其视线周到上正确的对象，而产生相应的单词。这种能力也被称为接地图像字幕。然而，现有的字幕人员的接地精度远远不能令人满意。为了提高精度接地，同时保留字幕的质量，它是昂贵的收集字区域对齐作为强有力的监督。为此，我们提出了一个部分的语音（POS），增强的图像，文本匹配模式（SCAN \ {引用} lee2018stacked）：POS-SCAN，为更多的接地图像字幕的有效的知识升华。的好处是双重的：1）给定的一个句子和图像，POS-SCAN能够比SCAN更准确地的对象; 2）POS-SCAN用作字幕员的视觉注意模块一个字区域对齐正则化。通过展示基准实验结果，我们证明了装有POS机-SCAN传统的图像字幕人员可以显著改善没有强有力的监督接地准确性。最后但并非最不重要的，我们探索了不可或缺的自我批判序列训练（SCST）\举{Rennie_2017_CVPR}在接地的图像字幕，并显示图像的文本匹配分数可以作为多点接地字幕\脚注奖励的情况下{此HTTPS URL}。

22. Single Image Optical Flow Estimation with an Event Camera [PDF] 返回目录
Liyuan Pan, Miaomiao Liu, Richard Hartley
Abstract: Event cameras are bio-inspired sensors that asynchronously report intensity changes in microsecond resolution. DAVIS can capture high dynamics of a scene and simultaneously output high temporal resolution events and low frame-rate intensity images. In this paper, we propose a single image (potentially blurred) and events based optical flow estimation approach. First, we demonstrate how events can be used to improve flow estimates. To this end, we encode the relation between flow and events effectively by presenting an event-based photometric consistency formulation. Then, we consider the special case of image blur caused by high dynamics in the visual environments and show that including the blur formation in our model further constrains flow estimation. This is in sharp contrast to existing works that ignore the blurred images while our formulation can naturally handle either blurred or sharp images to achieve accurate flow estimation. Finally, we reduce flow estimation, as well as image deblurring, to an alternative optimization problem of an objective function using the primal-dual algorithm. Experimental results on both synthetic and real data (with blurred and non-blurred images) show the superiority of our model in comparison to state-of-the-art approaches.
摘要：事件相机仿生传感器异步报告在微秒级强度的变化。 DAVIS可以捕捉场景的高动态和同时输出高时间分辨率的事件和低帧速率强度图像。在本文中，我们提出了一个单个图像（潜在模糊）和事件基于光流估计方法。首先，我们证明事件是如何被用来提高流量估计。为此，我们有效地呈现基于事件的光度一致性制剂编码流和事件之间的关系。然后，我们考虑在视觉环境造成的高动态图像模糊的特殊情况，显示包括在我们的模型中形成的模糊约束进一步流量估计。与此形成鲜明对比的是忽略模糊的图像，而我们的配方可自然处理任何模糊或清晰的图像，以实现精确的流量估计现存的作品。最后，我们减少流量估计，以及图像去模糊，使用原始对偶算法的目标函数的替代优化问题。在合成的和真实数据（具有模糊和非模糊图像）实验结果表明，我们的模型的优越性相比于国家的最先进的方法。

23. Digit Recognition Using Convolution Neural Network [PDF] 返回目录
Kajol Gupta
Abstract: In pattern recognition, digit recognition has always been a very challenging task. This paper aims to extracting a correct feature so that it can achieve better accuracy for recognition of digits. The applications of digit recognition such as in password, bank check process, etc. to recognize the valid user identification. Earlier, several researchers have used various different machine learning algorithms in pattern recognition i.e. KNN, SVM, RFC. The main objective of this work is to obtain highest accuracy 99.15% by using convolution neural network (CNN) to recognize the digit without doing too much pre-processing of dataset.
摘要：在模式识别，数字识别一直是一个非常具有挑战性的任务。本文旨在提取正确的功能，因此，它可以实现对识别的数字更准确。数字识别的，如密码，银行支票处理等应用程序识别有效的用户身份。此前，一些研究人员使用模式识别即KNN，SVM，RFC各种不同的机器学习算法。主要目标这项工作是通过使用卷积神经网络（CNN）承认的数字没有做太多的前处理的数据集，以获得最高的准确度99.15％。

24. Compressed Volumetric Heatmaps for Multi-Person 3D Pose Estimation [PDF] 返回目录
Matteo Fabbri, Fabio Lanzi, Simone Calderara, Stefano Alletto, Rita Cucchiara
Abstract: In this paper we present a novel approach for bottom-up multi-person 3D human pose estimation from monocular RGB images. We propose to use high resolution volumetric heatmaps to model joint locations, devising a simple and effective compression method to drastically reduce the size of this representation. At the core of the proposed method lies our Volumetric Heatmap Autoencoder, a fully-convolutional network tasked with the compression of ground-truth heatmaps into a dense intermediate representation. A second model, the Code Predictor, is then trained to predict these codes, which can be decompressed at test time to re-obtain the original representation. Our experimental evaluation shows that our method performs favorably when compared to state of the art on both multi-person and single-person 3D human pose estimation datasets and, thanks to our novel compression strategy, can process full-HD images at the constant runtime of 8 fps regardless of the number of subjects in the scene. Code and models available at this https URL .
摘要：在本文中，我们提出了从单眼RGB图像自下而上的多人3D人体姿态估计的新方法。我们建议使用高分辨率体积热图关节位置的模型，设计一个简单而有效的压缩方法，以大大减少这种表示的大小。在该方法的核心在于我们的容积热图自动编码，与地面实况热图的任务压缩到一个密集的中间表示一个完全卷积网络。然后将第二模型中，预测的代码，被训练来预测这些代码，它可以在测试时被解压缩，以重新得到原始表示。我们的实验评价结果显示，当与现有技术相比两个多人和单人三维人体姿态估计数据集的状态，这要归功于我们的新的压缩策略，可以处理全高清图像在不断的运行我们的方法进行顺利无论在场景科目数每秒约8张。代码可以在这个HTTPS URL模式。

25. SoftSMPL: Data-driven Modeling of Nonlinear Soft-tissue Dynamics for Parametric Humans [PDF] 返回目录
Igor Santesteban, Elena Garces, Miguel A. Otaduy, Dan Casas
Abstract: We present SoftSMPL, a learning-based method to model realistic soft-tissue dynamics as a function of body shape and motion. Datasets to learn such task are scarce and expensive to generate, which makes training models prone to overfitting. At the core of our method there are three key contributions that enable us to model highly realistic dynamics and better generalization capabilities than state-of-the-art methods, while training on the same data. First, a novel motion descriptor that disentangles the standard pose representation by removing subject-specific features; second, a neural-network-based recurrent regressor that generalizes to unseen shapes and motions; and third, a highly efficient nonlinear deformation subspace capable of representing soft-tissue deformations of arbitrary shapes. We demonstrate qualitative and quantitative improvements over existing methods and, additionally, we show the robustness of our method on a variety of motion capture databases.
摘要：我们提出SoftSMPL，基于学习的方法，以现实的软组织动力学建模为主体的形状和运动的函数。数据集来学习这样的任务是稀缺和昂贵生成，这使得培训模式容易出现过度拟合。在我们的方法的核心有三个关键的贡献，使我们能够比国家的最先进的方法，高度逼真的动态和更好的泛化能力模型，而在相同的数据训练。首先，新的运动描述符理顺了那些纷繁通过去除特定主题的标准姿势表示特征;第二，一个神经网络为基础的反复回归元的推广到看不见的形状和运动;第三，高效率的非线性变形子空间能够表示任意形状的软组织变形的。我们证明了现有方法定性和定量的改善，另外，我们将展示我们的方法对各种运动捕获数据库的鲁棒性。

26. Learning to Select Base Classes for Few-shot Classification [PDF] 返回目录
Linjun Zhou, Peng Cui, Xu Jia, Shiqiang Yang, Qi Tian
Abstract: Few-shot learning has attracted intensive research attention in recent years. Many methods have been proposed to generalize a model learned from provided base classes to novel classes, but no previous work studies how to select base classes, or even whether different base classes will result in different generalization performance of the learned model. In this paper, we utilize a simple yet effective measure, the Similarity Ratio, as an indicator for the generalization performance of a few-shot model. We then formulate the base class selection problem as a submodular optimization problem over Similarity Ratio. We further provide theoretical analysis on the optimization lower bound of different optimization methods, which could be used to identify the most appropriate algorithm for different experimental settings. The extensive experiments on ImageNet, Caltech256 and CUB-200-2011 demonstrate that our proposed method is effective in selecting a better base dataset.
摘要：很少拍学习已吸引了深入的研究关注在最近几年。许多方法被提出来概括从提供基础类学会了新种类的一种模式，但没有以前的工作，学习如何选择基类，或不同的基类，甚至是否会导致学习模型的不同推广能力。在本文中，我们利用一个简单而有效的措施，相似比，作为几拍模型的泛化性能的指标。然后，我们制定了基类的选择问题，因为在相似比子模优化问题。我们进一步提供了优化的理论分析下限不同的优化方法，它可用于识别不同的实验设置最合适的算法。在ImageNet，Caltech256和CUB-200-2011的大量实验证明，我们提出的方法是有效的选择一个更好的基础数据集。

27. Towards Achieving Adversarial Robustness by Enforcing Feature Consistency Across Bit Planes [PDF] 返回目录
Sravanti Addepalli, Vivek B.S., Arya Baburaj, Gaurang Sriramanan, R. Venkatesh Babu
Abstract: As humans, we inherently perceive images based on their predominant features, and ignore noise embedded within lower bit planes. On the contrary, Deep Neural Networks are known to confidently misclassify images corrupted with meticulously crafted perturbations that are nearly imperceptible to the human eye. In this work, we attempt to address this problem by training networks to form coarse impressions based on the information in higher bit planes, and use the lower bit planes only to refine their prediction. We demonstrate that, by imposing consistency on the representations learned across differently quantized images, the adversarial robustness of networks improves significantly when compared to a normally trained model. Present state-of-the-art defenses against adversarial attacks require the networks to be explicitly trained using adversarial samples that are computationally expensive to generate. While such methods that use adversarial training continue to achieve the best results, this work paves the way towards achieving robustness without having to explicitly train on adversarial samples. The proposed approach is therefore faster, and also closer to the natural learning process in humans.
摘要：作为人类，我们固有的感知图像基于他们的主要特点，而忽视噪音低的嵌入式位平面内。相反，深层神经网络已知与精心制作的扰动是几乎无法察觉到人眼损坏理直气壮错误分类的图像。在这项工作中，我们试图通过培训网络，形成基于较高位平面的信息粗的印象来解决这个问题，并使用较低位平面只能完善自己的预测。我们证明，通过跨不同的量化影像学的交涉气势连贯，网络对抗稳健性相比正常训练模型时显著提高。国家的最先进的目前针对敌对攻击的防御要求网络使用对抗性的样本在计算上产生昂贵的明确培训。虽然使用对抗性训练这种方法不断取得最好的效果，这项工作铺平了道路走向，而无需对样品对抗训练明确实现稳健的方式。因此更快的建议的方法是，也更接近人类自然的学习过程。

28. High-Performance Long-Term Tracking with Meta-Updater [PDF] 返回目录
Kenan Dai, Yunhua Zhang, Dong Wang, Jianhua Li, Huchuan Lu, Xiaoyun Yang
Abstract: Long-term visual tracking has drawn increasing attention because it is much closer to practical applications than short-term tracking. Most top-ranked long-term trackers adopt the offline-trained Siamese architectures, thus, they cannot benefit from great progress of short-term trackers with online update. However, it is quite risky to straightforwardly introduce online-update-based trackers to solve the long-term problem, due to long-term uncertain and noisy observations. In this work, we propose a novel offline-trained Meta-Updater to address an important but unsolved problem: Is the tracker ready for updating in the current frame? The proposed meta-updater can effectively integrate geometric, discriminative, and appearance cues in a sequential manner, and then mine the sequential information with a designed cascaded LSTM module. Our meta-updater learns a binary output to guide the tracker's update and can be easily embedded into different trackers. This work also introduces a long-term tracking framework consisting of an online local tracker, an online verifier, a SiamRPN-based re-detector, and our meta-updater. Numerous experimental results on the VOT2018LT, VOT2019LT, OxUvALT, TLP, and LaSOT benchmarks show that our tracker performs remarkably better than other competing algorithms. Our project is available on the website: this https URL.
摘要：长期视觉跟踪已经引起越来越多的关注，因为它是更接近比短期跟踪实际应用。大多数排名靠前的长期跟踪器采用离线训练的连体结构，因此，他们无法从网上更新的短期跟踪很大的进步中获益。然而，这是相当危险的直截了当引进在线更新型跟踪器，解决了长期存在的问题，由于长期的不确定性和嘈杂的意见。在这项工作中，我们提出了一种新的离线训练元更新程序，以解决一个重要但尚未解决的问题：是对跟踪器准备在当前帧更新？所提出的元更新可以有效地以连续的方式整合几何，辨别和外观线索，然后矿具有设计的级联LSTM模块顺序信息。我们荟萃更新学习二进制输出，引导跟踪器的更新，并且可以方便地嵌入到不同的跟踪器。这项工作还引入了长期的跟踪框架，包括一个在线本地跟踪的，在线验证，基于SiamRPN，重复检测，而我们的元更新。在VOT2018LT大量的实验结果，VOT2019LT，OxUvALT，TLP和LaSOT基准测试显示，我们的跟踪执行比其他竞争算法明显更好。我们的项目是在网站上提供：此HTTPS URL。

29. Transfer Learning of Photometric Phenotypes in Agriculture Using Metadata [PDF] 返回目录
Dan Halbersberg, Aharon Bar Hillel, Shon Mendelson, Daniel Koster, Lena Karol, Boaz Lerner
Abstract: Estimation of photometric plant phenotypes (e.g., hue, shine, chroma) in field conditions is important for decisions on the expected yield quality, fruit ripeness, and need for further breeding. Estimating these from images is difficult due to large variances in lighting conditions, shadows, and sensor properties. We combine the image and metadata regarding capturing conditions embedded into a network, enabling more accurate estimation and transfer between different conditions. Compared to a state-of-the-art deep CNN and a human expert, metadata embedding improves the estimation of the tomato's hue and chroma.
摘要：在野外条件光度植物表型（例如，色调，光泽，色度）的估计是在预期收益质量，果实成熟，并需要进一步的育种决定很重要。从图像中估计这些是大方差困难的，因为在照明条件下，阴影和传感器性能。我们结合关于捕获嵌入到网络状况，可实现更准确的估计，并在不同的条件之间的传输的图像和元数据。相比于一个国家的最先进的深CNN和人类专家，元数据嵌入提高了番茄的色调和色度的估计。

30. Evaluation of Model Selection for Kernel Fragment Recognition in Corn Silage [PDF] 返回目录
Christoffer Bøgelund Rasmussen, Thomas B. Moeslund
Abstract: Model selection when designing deep learning systems for specific use-cases can be a challenging task as many options exist and it can be difficult to know the trade-off between them. Therefore, we investigate a number of state of the art CNN models for the task of measuring kernel fragmentation in harvested corn silage. The models are evaluated across a number of feature extractors and image sizes in order to determine optimal model design choices based upon the trade-off between model complexity, accuracy and speed. We show that accuracy improvements can be made with more complex meta-architectures and speed can be optimised by decreasing the image size with only slight losses in accuracy. Additionally, we show improvements in Average Precision at an Intersection over Union of 0.5 of up to 20 percentage points while also decreasing inference time in comparison to previously published work. This result for better model selection enables opportunities for creating systems that can aid farmers in improving their silage quality while harvesting.
摘要：选型设计针对特定使用案例深度学习系统可以是一项具有挑战性的任务，因为许多选项存在，它可能很难知道他们之间的权衡。因此，我们研究了一些艺术CNN模型的状态在收割玉米青贮测量核碎片的任务。该模型以确定基于权衡模型的复杂性，准确度和速度之间的最佳模型设计选择跨多个特征提取和图像尺寸的评价。我们表明，提高准确度可以用更复杂的元架构和速度提出可以通过在精度只有轻微的损失减小图像尺寸进行优化。此外，我们表现出平均准确率提高在路口在长达0.5联盟20个百分点，而相较于先前发表的作品也减少推理时间。这一结果为更好的模型选择能够用于创建可在提高，而他们的收获青贮质量帮助农民系统提供了机会。

31. CurricularFace: Adaptive Curriculum Learning Loss for Deep Face Recognition [PDF] 返回目录
Yuge Huang, Yuhan Wang, Ying Tai, Xiaoming Liu, Pengcheng Shen, Shaoxin Li, Jilin Li, Feiyue Huang
Abstract: As an emerging topic in face recognition, designing margin-based loss functions can increase the feature margin between different classes for enhanced discriminability. More recently, the idea of mining-based strategies is adopted to emphasize the misclassified samples, achieving promising results. However, during the entire training process, the prior methods either do not explicitly emphasize the sample based on its importance that renders the hard samples not fully exploited; or explicitly emphasize the effects of semi-hard/hard samples even at the early training stage that may lead to convergence issue. In this work, we propose a novel Adaptive Curriculum Learning loss (CurricularFace) that embeds the idea of curriculum learning into the loss function to achieve a novel training strategy for deep face recognition, which mainly addresses easy samples in the early training stage and hard ones in the later stage. Specifically, our CurricularFace adaptively adjusts the relative importance of easy and hard samples during different training stages. In each stage, different samples are assigned with different importance according to their corresponding difficultness. Extensive experimental results on popular benchmarks demonstrate the superiority of our CurricularFace over the state-of-the-art competitors.
摘要：在面部识别的新课题，基于利润率的设计损失函数可以增加不同的类，增强辨别力的功能余量。最近，矿业为基础的战略思路，采用强调错误分类样本，取得了可喜的成果。然而，在整个训练过程中，现有方法或者不明确强调根据其重要性呈现在硬盘样品未充分利用的样本;或者明确地强调，甚至在早期训练阶段，可能会导致收敛问题，半硬/硬样品的效果。在这项工作中，我们提出了一种新的自适应课程学习的损失（CurricularFace）嵌入课程学习的理念引入损失函数来实现对深面部识别，早期训练阶段和硬的人，主要解决了容易样新颖的培训战略在以后的阶段。具体来说，我们CurricularFace适应在不同的训练阶段，调整容易，难的样本的相对重要性。在每一级中，不同的样品根据它们相应的难易度不同重要性分配。流行广泛的基准测试实验结果表明，我们的CurricularFace在国家的最先进的竞争对手的优势。

32. Creating Something from Nothing: Unsupervised Knowledge Distillation for Cross-Modal Hashing [PDF] 返回目录
Hengtong Hu, Lingxi Xie, Richang Hong, Qi Tian
Abstract: In recent years, cross-modal hashing (CMH) has attracted increasing attentions, mainly because its potential ability of mapping contents from different modalities, especially in vision and language, into the same space, so that it becomes efficient in cross-modal data retrieval. There are two main frameworks for CMH, differing from each other in whether semantic supervision is required. Compared to the unsupervised methods, the supervised methods often enjoy more accurate results, but require much heavier labors in data annotation. In this paper, we propose a novel approach that enables guiding a supervised method using outputs produced by an unsupervised method. Specifically, we make use of teacher-student optimization for propagating knowledge. Experiments are performed on two popular CMH benchmarks, i.e., the MIRFlickr and NUS-WIDE datasets. Our approach outperforms all existing unsupervised methods by a large margin.
摘要：近年来，跨模式哈希（CMH）已经吸引了越来越多的关注，主要是因为它有可能从不同的方式，尤其是在视觉和语言，在同一个空间的映射内容的能力，使其成为跨模态有效数据检索。有用于CMH两个主要框架，在是否需要语义监督彼此不同。相比于无监督方法，监督的方法通常可以享受更精确的结果，但需要数据注解重得多的劳动力。在本文中，我们提出了一种新方法，使引导使用由无监督方法产生的输出受监管方法。具体而言，我们利用师生优化传播知识。实验在两种流行CMH基准，即，MIRFlickr和NUS-WIDE数据集进行的。我们的方法优于大幅度所有现有的无监督的方法。

33. Graph Structured Network for Image-Text Matching [PDF] 返回目录
Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, Yongdong Zhang
Abstract: Image-text matching has received growing interest since it bridges vision and language. The key challenge lies in how to learn correspondence between image and text. Existing works learn coarse correspondence based on object co-occurrence statistics, while failing to learn fine-grained phrase correspondence. In this paper, we present a novel Graph Structured Matching Network (GSMN) to learn fine-grained correspondence. The GSMN explicitly models object, relation and attribute as a structured phrase, which not only allows to learn correspondence of object, relation and attribute separately, but also benefits to learn fine-grained correspondence of structured phrase. This is achieved by node-level matching and structure-level matching. The node-level matching associates each node with its relevant nodes from another modality, where the node can be object, relation or attribute. The associated nodes then jointly infer fine-grained correspondence by fusing neighborhood associations at structure-level matching. Comprehensive experiments show that GSMN outperforms state-of-the-art methods on benchmarks, with relative Recall@1 improvements of nearly 7% and 2% on Flickr30K and MSCOCO, respectively. Code will be released at: this https URL.
摘要：图片，文本匹配已经获得越来越大的兴趣，因为它填补了视觉和语言。关键的挑战在于如何学习图像与文字之间的对应关系。现有工程学会基于对象共同出现统计粗的对应关系，而没有学习细粒度短语的对应关系。在本文中，我们提出了一个新颖的结构化图形匹配网络（GSMN）学习细粒度的对应关系。该GSMN明确模型对象，关系和属性作为结构化的短语，它不仅允许单独学习的对象，关系和属性的对应关系，也有利于学习的结构短语细粒度的对应关系。这通过节点级匹配和结构电平匹配来实现的。节点电平匹配的每个节点与来自另一模态，其中节点可以是对象，关系或属性其相关节点相关联。的相关联的节点然后共同推断通过在结构级匹配熔合街道协会细粒度的对应关系。综合性实验分别那GSMN性能优于国家的最先进的方法，关于基准，相对召回@ 1个的改进近7％和Flickr30K和MSCOCO 2％，显示。此HTTPS URL：代码将被释放。

34. An Efficient Agreement Mechanism in CapsNets By Pairwise Product [PDF] 返回目录
Lei Zhao, Xiaohui Wang, Lei Huang
Abstract: Capsule networks (CapsNets) are capable of modeling visual hierarchical relationships, which is achieved by the "routing-by-agreement" mechanism. This paper proposes a pairwise agreement mechanism to build capsules, inspired by the feature interactions of factorization machines (FMs). The proposed method has a much lower computation complexity. We further proposed a new CapsNet architecture that combines the strengths of residual networks in representing low-level visual features and CapsNets in modeling the relationships of parts to wholes. We conduct comprehensive experiments to compare the routing algorithms, including dynamic routing, EM routing, and our proposed FM agreement, based on both architectures of original CapsNet and our proposed one, and the results show that our method achieves both excellent performance and efficiency under a variety of situations.
摘要：胶囊网络（CapsNets）能够模拟视觉层次关系，这是由“路由按协议”机制来实现的。本文提出了一种成对的协议机制来构建胶囊，通过分解机（FMS）的特征交互启发。该方法具有低得多的计算复杂度。我们进一步提出了一个新CapsNet架构，在代表在模拟部分的整体而至的关系低层视觉特征和CapsNets残留网络的结合强度。我们进行了全面的实验来比较路由算法，包括动态路由，EM路由，以及我们提出的FM协议的基础上，原CapsNet的两种架构和我们提出的一个，结果表明，该方法实现下一个兼具优异的性能和效率各种情况。

35. Progressive Multi-Stage Learning for Discriminative Tracking [PDF] 返回目录
Weichao Li, Xi Li, Omar Elfarouk Bourahla, Fuxian Huang, Fei Wu, Wei Liu, Zhiheng Wang, Hongmin Liu
Abstract: Visual tracking is typically solved as a discriminative learning problem that usually requires high-quality samples for online model adaptation. It is a critical and challenging problem to evaluate the training samples collected from previous predictions and employ sample selection by their quality to train the model. To tackle the above problem, we propose a joint discriminative learning scheme with the progressive multi-stage optimization policy of sample selection for robust visual tracking. The proposed scheme presents a novel time-weighted and detection-guided self-paced learning strategy for easy-to-hard sample selection, which is capable of tolerating relatively large intra-class variations while maintaining inter-class separability. Such a self-paced learning strategy is jointly optimized in conjunction with the discriminative tracking process, resulting in robust tracking results. Experiments on the benchmark datasets demonstrate the effectiveness of the proposed learning framework.
摘要：视觉追踪通常解决的判别学习的问题，通常需要高品质的样品进行在线模型适应。它是由它们的质量评价从先前的预测，并采用样本选择采集的训练样本来训练模型中的关键和具有挑战性的问题。为了解决上述问题，我们提出了与样本选择的鲁棒视觉跟踪的渐进多级优化策略的联合判别学习方案。所提出的方案提出了一种新的时间加权和检测引导自学策略易于硬样本选择，这是能够耐受相对大的类内波动，同时保持类间分离性。这样的自学策略与判别跟踪处理一起联合优化，从而导致鲁棒跟踪结果。基准的数据集实验证明所提出的学习框架的有效性。

36. Pose-guided Visible Part Matching for Occluded Person ReID [PDF] 返回目录
Shang Gao, Jingya Wang, Huchuan Lu, Zimo Liu
Abstract: Occluded person re-identification is a challenging task as the appearance varies substantially with various obstacles, especially in the crowd scenario. To address this issue, we propose a Pose-guided Visible Part Matching (PVPM) method that jointly learns the discriminative features with pose-guided attention and self-mines the part visibility in an end-to-end framework. Specifically, the proposed PVPM includes two key components: 1) pose-guided attention (PGA) method for part feature pooling that exploits more discriminative local features; 2) pose-guided visibility predictor (PVP) that estimates whether a part suffers the occlusion or not. As there are no ground truth training annotations for the occluded part, we turn to utilize the characteristic of part correspondence in positive pairs and self-mining the correspondence scores via graph matching. The generated correspondence scores are then utilized as pseudo-labels for visibility predictor (PVP). Experimental results on three reported occluded benchmarks show that the proposed method achieves competitive performance to state-of-the-art methods. The source codes are available at this https URL
摘要：闭塞的人重新鉴定是一项具有挑战性的任务，因为外观与各种障碍差异很大，特别是在人群中的场景。为了解决这个问题，我们提出了一种姿态制导可见部分匹配（PVPM）方法，共同学习与姿态引导注意力和自我矿山的端至端框架的一部分知名度判别特征。具体地，所提出的PVPM包括两个主要组件：1）的姿势引导注意（PGA），选择那些利用更有辨别力的局部特征部分特征池的方法; 2）姿态引导能见度预测器（PVP），其估计的一部分是否患有闭塞与否。由于有用于闭塞部没有地面实况训练的注解，我们转向使用部位对应的正对和自采通过图匹配的对应分数的特点。然后将所生成的对应分数被用作伪标签的可见性预测器（PVP）。三个报告闭塞基准实验结果表明，该方法实现了有竞争力的性能，以国家的最先进的方法。源代码可在此HTTPS URL

37. Synthesis and Edition of Ultrasound Images via Sketch Guided Progressive Growing GANs [PDF] 返回目录
Jiamin Liang, Xin Yang, Haoming Li, Yi Wang, Manh The Van, Haoran Dou, Chaoyu Chen, Jinghui Fang, Xiaowen Liang, Zixin Mai, Guowen Zhu, Zhiyi Chen, Dong Ni
Abstract: Ultrasound (US) is widely accepted in clinic for anatomical structure inspection. However, lacking in resources to practice US scan, novices often struggle to learn the operation skills. Also, in the deep learning era, automated US image analysis is limited by the lack of annotated samples. Efficiently synthesizing realistic, editable and high resolution US images can solve the problems. The task is challenging and previous methods can only partially complete it. In this paper, we devise a new framework for US image synthesis. Particularly, we firstly adopt a sketch generative adversarial networks (Sgan) to introduce background sketch upon object mask in a conditioned generative adversarial network. With enriched sketch cues, Sgan can generate realistic US images with editable and fine-grained structure details. Although effective, Sgan is hard to generate high resolution US images. To achieve this, we further implant the Sgan into a progressive growing scheme (PGSgan). By smoothly growing both generator and discriminator, PGSgan can gradually synthesize US images from low to high resolution. By synthesizing ovary and follicle US images, our extensive perceptual evaluation, user study and segmentation results prove the promising efficacy and efficiency of the proposed PGSgan.
摘要：超声（US）被广泛接受的诊所解剖结构检查。然而，缺乏资源练超声扫描，新手往往需要努力学习操作技能。此外，在深学习的时代，美国的自动图像分析由于缺乏注解样本的限制。高效合成逼真，编辑和高分辨率的超声图像可以解决的问题。任务是具有挑战性和以前的方法只能部分地完成它。在本文中，我们设计了美国的图像合成的新框架。具体地讲，我们首先采用一个草图生成对抗网络（Sgan）引入在对象掩模背景草图在条件生成对抗性网络。凭借丰富草图线索，Sgan可以产生与美国和编辑细粒结构，细节逼真的图像。虽然有效，Sgan难以产生高分辨率的超声图像。为了实现这一目标，我们进一步植入Sgan成逐行生长方案（PGSgan）。通过发展顺利发电机也可鉴别，PGSgan可以逐渐从低合成超声图像以高分辨率。通过合成卵巢和卵泡超声图像，我们广泛的感知评估，用户研究和市场细分的结果证明了该PGSgan的承诺有效性和效率。

38. Video Anomaly Detection for Smart Surveillance [PDF] 返回目录
Sijie Zhu, Chen Chen, Waqas Sultani
Abstract: In modern intelligent video surveillance systems, automatic anomaly detection through computer vision analytics plays a pivotal role which not only significantly increases monitoring efficiency but also reduces the burden on live monitoring. Anomalies in videos are broadly defined as events or activities that are unusual and signify irregular behavior. The goal of anomaly detection is to temporally or spatially localize the anomaly events in video sequences. Temporal localization (i.e. indicating the start and end frames of the anomaly event in a video) is referred to as frame-level detection. Spatial localization, which is more challenging, means to identify the pixels within each anomaly frame that correspond to the anomaly event. This setting is usually referred to as pixel-level detection. In this paper, we provide a brief overview of the recent research progress on video anomaly detection and highlight a few future research directions.
摘要：在现代的智能视频监控系统，通过计算机视觉分析自动异常检测起着不仅显著提高了监控效率举足轻重的作用，但也降低了实时监控的负担。在视频中异常被广泛定义为事件或不寻常，并表示不规则的行为活动。异常检测的目的是时间或空间上本地化的视频序列的异常事件。时间定位（即，指示在视频异常事件的开始和结束的帧）被称为帧电平检测。空间定位，更有挑战性，来识别在每个像素内的异常帧对应于异常事件。该设置通常被称为像素电平检测。在本文中，我们提供的视频异常检测的最新研究进展的简要概述，并突出了几个未来的研究方向。

39. NBDT: Neural-Backed Decision Trees [PDF] 返回目录
Alvin Wan, Lisa Dunlap, Daniel Ho, Jihan Yin, Scott Lee, Henry Jin, Suzanne Petryk, Sarah Adel Bargal, Joseph E. Gonzalez
Abstract: Deep learning is being adopted in settings where accurate and justifiable predictions are required, ranging from finance to medical imaging. While there has been recent work providing post-hoc explanations for model predictions, there has been relatively little work exploring more directly interpretable models that can match state-of-the-art accuracy. Historically, decision trees have been the gold standard in balancing interpretability and accuracy. However, recent attempts to combine decision trees with deep learning have resulted in models that (1) achieve accuracies far lower than that of modern neural networks (e.g. ResNet) even on small datasets (e.g. MNIST), and (2) require significantly different architectures, forcing practitioners pick between accuracy and interpretability. We forgo this dilemma by creating Neural-Backed Decision Trees (NBDTs) that (1) achieve neural network accuracy and (2) require no architectural changes to a neural network. NBDTs achieve accuracy within 1% of the base neural network on CIFAR10, CIFAR100, TinyImageNet, using recently state-of-the-art WideResNet; and within 2% of EfficientNet on ImageNet. This yields state-of-the-art explainable models on ImageNet, with NBDTs improving the baseline by ~14% to 75.30% top-1 accuracy. Furthermore, we show interpretability of our model's decisions both qualitatively and quantitatively via a semi-automatic process. Code and pretrained NBDTs can be found at this https URL.
摘要：深学习是需要精确和合理预测的设置，从金融到医疗成像被采用。虽然最近有工作模型预测提供事后的解释，出现了相对较少的工作，探索更直接地解释模型，可以配合国家的最先进的精度。从历史上看，决策树一直在平衡解释性和精确性的金标准。然而，最近试图决策树与深度学习相结合导致了模型（1）达到的精度远远高于现代神经网络的低（例如，RESNET）即使在小数据集（如MNIST），和（2）要求显著不同的体系结构，迫使从业者的准确性和可解释性之间挑选。我们放弃这种困境通过创建神经后盾决策树（NBDTs）（1）实现神经网络的精度和（2）需要对神经网络没有结构上的变化。 NBDTs上CIFAR10，CIFAR100，TinyImageNet基神经网络的1％以内的精度实现，使用最近状态的最先进的WideResNet;和上ImageNet EfficientNet的2％以内。这产生对ImageNet状态的最先进的解释的模型中，与NBDTs改进由〜14％基线到75.30％顶1的精度。此外，我们通过半自动过程定性和定量显示了我们的模型的决定可解释性。代码和预训练NBDTs可以在此HTTPS URL中找到。

40. BCNet: Learning Body and Cloth Shape from A Single Image [PDF] 返回目录
Boyi Jiang, Juyong Zhang, Yang Hong, Jinhao Luo, Ligang Liu, Hujun Bao
Abstract: In this paper, we consider the problem to automatically reconstruct both garment and body shapes from a single near front view RGB image. To this end, we propose a layered garment representation on top of SMPL and novelly make the skinning weight of garment to be independent with the body mesh, which significantly improves the expression ability of our garment model. Compared with existing methods, our method can support more garment categories like skirts and recover more accurate garment geometry. To train our model, we construct two large scale datasets with ground truth body and garment geometries as well as paired color images. Compared with single mesh or non-parametric representation, our method can achieve more flexible control with separate meshes, makes applications like re-pose, garment transfer, and garment texture mapping possible.
摘要：在本文中，我们考虑问题时自动从单个近前视图RGB图像重建两种服装和身体的形状。为此，我们提出了基于SMPL的顶部分层服装的代表性和新奇使服装的剥皮重量是独立的，与身体网格，这显著提高我国服装模型的表达能力。与现有的方法相比，我们的方法可以支持更多的服装类别，如裙子，恢复更精确的服装几何。为了训练我们的模型，我们构造与地面实况人体与服装几何两个大型数据集的规模，以及配对的彩色图像。与单个网格或非参数表示相比，我们的方法可以实现与单独的网格更灵活的控制，使得像重新姿势，服装转移，和服装纹理映射可能的应用。

41. Region Proposal Network with Graph Prior and IoU-Balance Loss for Landmark Detection in 3D Ultrasound [PDF] 返回目录
Chaoyu Chen, Xin Yang, Ruobing Huang, Wenlong Shi, Shengfeng Liu, Mingrong Lin, Yuhao Huang, Yong Yang, Yuanji Zhang, Huanjia Luo, Yankai Huang, Yi Xiong, Dong Ni
Abstract: 3D ultrasound (US) can facilitate detailed prenatal examinations for fetal growth monitoring. To analyze a 3D US volume, it is fundamental to identify anatomical landmarks of the evaluated organs accurately. Typical deep learning methods usually regress the coordinates directly or involve heatmap-matching. However, these methods struggle to deal with volumes with large sizes and the highly-varying positions and orientations of fetuses. In this work, we exploit an object detection framework to detect landmarks in 3D fetal facial US volumes. By regressing multiple parameters of the landmark-centered bounding box (B-box) with a strict criteria, the proposed model is able to pinpoint the exact location of the targeted landmarks. Specifically, the model uses a 3D region proposal network (RPN) to generate 3D candidate regions, followed by several 3D classification branches to select the best candidate. It also adopts an IoU-balance loss to improve communications between branches that benefits the learning process. Furthermore, it leverages a distance-based graph prior to regularize the training and helps to reduce false positive predictions. The performance of the proposed framework is evaluated on a 3D US dataset to detect five key fetal facial landmarks. Results showed the proposed method outperforms some of the state-of-the-art methods in efficacy and efficiency.
摘要：三维超声（US），可以方便的进行详尽胎儿生长监测产前检查。为了分析3D美国卷，最根本的是找准评估器官的解剖标志。典型的深度学习方法通常直接回归坐标或涉及热图匹配。然而，这些方法很难处理与大尺寸和高度的不同位置和胎儿的方向卷。在这项工作中，我们利用的目标检测框架来检测3D胎儿颜面部美国卷地标。通过回归地标为中心的边界框（B-盒）具有严格的标准的多个参数，该模型是能够确定目标的地标确切位置。具体地，模型使用3D区域提案网络（RPN）以产生3D候选区域，其次是几个3D分类分支到选择的最佳人选。它还采用了一张欠条，平衡损失，提高有利于学习过程分支之间的通信。此外，之前利用基于距离的图形来规范培训，并有助于减少假阳性的预测。拟议框架的性能在3D数据集美国评估检测五个关键胎儿脸部标志。结果表明该方法优于一些国家的最先进的方法中功效和效率。

42. Shared Cross-Modal Trajectory Prediction for Autonomous Driving [PDF] 返回目录
Chiho Choi
Abstract: We propose a framework for predicting future trajectories of traffic agents in highly interactive environments. On the basis of the fact that autonomous driving vehicles are equipped with various types of sensors (e.g., LiDAR scanner, RGB camera, etc.), our work aims to get benefit from the use of multiple input modalities that are complementary to each other. The proposed approach is composed of two stages. (i) feature encoding where we discover motion behavior of the target agent with respect to other directly and indirectly observable influences. We extract such behaviors from multiple perspectives such as in top-down and frontal view. (ii) cross-modal embedding where we embed a set of learned behavior representations into a single cross-modal latent space. We construct a generative model and formulate the objective functions with an additional regularizer specifically designed for future prediction. An extensive evaluation is conducted to show the efficacy of the proposed framework using two benchmark driving datasets.
摘要：我们提出了在高度互动的环境中预测话务代理的未来轨迹的框架。在的事实，自主驾驶的车辆都配备有各种传感器（例如，激光雷达扫描仪，RGB照相机等），我们的工作目标，以获得从使用多种输入方式是彼此互补的利益的基础上。所提出的方法是由两个阶段组成。（i）如我们发现目标代理的运动行为相对于其它直接和间接地观察到的影响的特征编码。我们提取从多个角度等行为，如自上而下和前视图。（ii）在我们嵌入一个组学习型行为表示成一个单一的跨模态潜在空间的跨模态嵌入。我们构建了一个生成模型，并制定目标函数与专为未来的预测设计了一个额外的正则。一个广泛的评估进行，以显示使用两个基准驾驶数据集所提出的框架的有效性。

43. Semi-Supervised Cervical Dysplasia Classification With Learnable Graph Convolutional Network [PDF] 返回目录
Yanglan Ou, Yuan Xue, Ye Yuan, Tao Xu, Vincent Pisztora, Jia Li, Xiaolei Huang
Abstract: Cervical cancer is the second most prevalent cancer affecting women today. As the early detection of cervical carcinoma relies heavily upon screening and pre-clinical testing, digital cervicography has great potential as a primary or auxiliary screening tool, especially in low-resource regions due to its low cost and easy access. Although an automated cervical dysplasia detection system has been desirable, traditional fully-supervised training of such systems requires large amounts of annotated data which are often labor-intensive to collect. To alleviate the need for much manual annotation, we propose a novel graph convolutional network (GCN) based semi-supervised classification model that can be trained with fewer annotations. In existing GCNs, graphs are constructed with fixed features and can not be updated during the learning process. This limits their ability to exploit new features learned during graph convolution. In this paper, we propose a novel and more flexible GCN model with a feature encoder that adaptively updates the adjacency matrix during learning and demonstrate that this model design leads to improved performance. Our experimental results on a cervical dysplasia classification dataset show that the proposed framework outperforms previous methods under a semi-supervised setting, especially when the labeled samples are scarce.
摘要：宫颈癌是目前影响妇女的第二个最普遍的癌症。作为宫颈癌的早期发现严重依赖筛选和临床前试验，数字子宫颈由于其成本低，易访问具有很大的潜力作为主要或辅助检查手段，尤其是在资源匮乏的地区。虽然自动化子宫颈发育异常检测系统一直期望的，这样的系统的传统的完全监督训练需要大量的这往往是劳动密集型的，以收集注释的数据的。为了减轻大部分人工注释的需要，提出了一种基于可与更少的注释被训练半监督分类模型的新颖图表卷积网络（GDN）。在现有GCNs，图形被构造具有固定功能和在学习过程中不能被更新。这限制了他们利用图形卷积期间学到了新的功能的能力。在本文中，我们提出了一个新的，更灵活的GCN模型功能的编码器，学习期间自适应地更新邻接矩阵，并证明了这种模式的设计会提高业绩。我们在宫颈不典型增生的分类数据集的实验结果表明拟议的框架，远远超过前下半监督设置方法，尤其是当标记的样品是稀缺的。

44. Boundary-Aware Dense Feature Indicator for Single-Stage 3D Object Detection from Point Clouds [PDF] 返回目录
Guodong Xu, Wenxiao Wang, Zili Liu, Liang Xie, Zheng Yang, Haifeng Liu, Deng Cai
Abstract: 3D object detection based on point clouds has become more and more popular. Some methods propose localizing 3D objects directly from raw point clouds to avoid information loss. However, these methods come with complex structures and significant computational overhead, limiting its broader application in real-time scenarios. Some methods choose to transform the point cloud data into compact tensors first and leverage off-the-shelf 2D detectors to propose 3D objects, which is much faster and achieves state-of-the-art results. However, because of the inconsistency between 2D and 3D data, we argue that the performance of compact tensor-based 3D detectors is restricted if we use 2D detectors without corresponding modification. Specifically, the distribution of point clouds is uneven, with most points gather on the boundary of objects, while detectors for 2D data always extract features evenly. Motivated by this observation, we propose DENse Feature Indicator (DENFI), a universal module that helps 3D detectors focus on the densest region of the point clouds in a boundary-aware manner. Moreover, DENFI is lightweight and guarantees real-time speed when applied to 3D object detectors. Experiments on KITTI dataset show that DENFI improves the performance of the baseline single-stage detector remarkably, which achieves new state-of-the-art performance among previous 3D detectors, including both two-stage and multi-sensor fusion methods, in terms of mAP with a 34FPS detection speed.
摘要：基于点云立体物检测已成为越来越受欢迎。有些方法提议三维定位直接从原始点云对象以避免信息丢失。然而，这些方法都具有复杂的结构和显著的计算开销，限制了实时场景的更广泛的应用。一些方法中选择的点群数据中的第一变换成紧凑张量和杠杆关闭的，现成的2D检测器提出的3D对象，这是更快，实现状态的最先进的结果。然而，由于2D和3D数据之间的不一致的，我们认为，如果我们使用2D检测器而没有相应的修改紧凑基于张量-3D检测器的性能受到限制。具体而言，点云的分布是不均匀的，与大多数点收集的对象的边界上，而对于二维数据总是提取探测器设有均匀。通过这种观察的启发，我们提出了密集特征指标（DENFI），通用模块，可以帮助3D探测器集中于边界感知方式的点云的最密集的区域。此外，DENFI是轻质的，并且当应用于3D对象检测器保证实时速度。上KITTI数据集表明DENFI改善基线单级检测器的性能显着地，这实现状态的最先进的新的先前的3D检测器，包括两级和多传感器融合方法中的性能，在以下方面试验用34FPS检测速度地图。

45. Spatio-Temporal Action Detection with Multi-Object Interaction [PDF] 返回目录
Huijuan Xu, Lizhi Yang, Stan Sclaroff, Kate Saenko, Trevor Darrell
Abstract: Spatio-temporal action detection in videos requires localizing the action both spatially and temporally in the form of an "action tube". Nowadays, most spatio-temporal action detection datasets (e.g. UCF101-24, AVA, DALY) are annotated with action tubes that contain a single person performing the action, thus the predominant action detection models simply employ a person detection and tracking pipeline for localization. However, when the action is defined as an interaction between multiple objects, such methods may fail since each bounding box in the action tube contains multiple objects instead of one person. In this paper, we study the spatio-temporal action detection problem with multi-object interaction. We introduce a new dataset that is annotated with action tubes containing multi-object interactions. Moreover, we propose an end-to-end spatio-temporal action detection model that performs both spatial and temporal regression simultaneously. Our spatial regression may enclose multiple objects participating in the action. During test time, we simply connect the regressed bounding boxes within the predicted temporal duration using a simple heuristic. We report the baseline results of our proposed model on this new dataset, and also show competitive results on the standard benchmark UCF101-24 using only RGB input.
摘要：在视频时空动作检测需要在“动作管”的形式在空间和时间定位的动作。如今，大多数时空动作检测数据集（例如UCF101-24，AVA，DALY）被注释与包含执行所述动作单人操作管，因此主要的动作检测模型简单地采用一个人检测和跟踪管道用于定位。然而，当该动作被定义为多个对象之间的交互，这些方法可能会失败，因为在动作管每个边界框包含代替一个人的多个对象。在本文中，我们研究了多对象交互时空动作检测的问题。我们引入被注释与含有多对象交互动作管的新数据集。此外，我们提出了一种端至端时空动作检测模式，执行空间和时间回归同时。我们的空间回归可以包围参与行动的多个对象。在测试的时候，我们只需使用一种简单的规则连接的预测持续时间内回归的边界框。我们报道了我们提出的模型的基准结果在这个新的数据集，也显示出对标准基准UCF101-24竞争的结果只使用RGB输入。

46. Knowledge as Priors: Cross-Modal Knowledge Generalization for Datasets without Superior Knowledge [PDF] 返回目录
Long Zhao, Xi Peng, Yuxiao Chen, Mubbasir Kapadia, Dimitris N. Metaxas
Abstract: Cross-modal knowledge distillation deals with transferring knowledge from a model trained with superior modalities (Teacher) to another model trained with weak modalities (Student). Existing approaches require paired training examples exist in both modalities. However, accessing the data from superior modalities may not always be feasible. For example, in the case of 3D hand pose estimation, depth maps, point clouds, or stereo images usually capture better hand structures than RGB images, but most of them are expensive to be collected. In this paper, we propose a novel scheme to train the Student in a Target dataset where the Teacher is unavailable. Our key idea is to generalize the distilled cross-modal knowledge learned from a Source dataset, which contains paired examples from both modalities, to the Target dataset by modeling knowledge as priors on parameters of the Student. We name our method "Cross-Modal Knowledge Generalization" and demonstrate that our scheme results in competitive performance for 3D hand pose estimation on standard benchmark datasets.
摘要：跨模态知识蒸馏涉及从具有超强的方式（教师）培训了模型与弱模式（学生）训练的另一种模式传递知识。现有方法需要配对在两种模式中存在的训练例子。然而，从优良方式访问该数据可能并不总是可行的。例如，在3D手姿势估计，深度图，点云，或立体图像的情况下，通常捕捉更好的手比结构RGB图像，但其中大部分将被收集昂贵。在本文中，我们提出了一个新方案，以培养学生的目标数据集，其中教师是不可用的。我们的核心思想是通过对学生的建模参数知识作为先验来概括从源数据集，其中包含来自两种模式配对的例子学到的蒸馏水跨模式的知识，到目标数据集。我们命名我们的方法“跨模态的知识泛化”，并表明我们在标准的基准数据集的3D手姿势估计竞争绩效方案的效果。

47. Manifold-Aware CycleGAN for High Resolution Structural-to-DTI Synthesis [PDF] 返回目录
Benoit Anctil-Robitaille, Christian Desrosiers, Herve Lombaert
Abstract: Unpaired image-to-image translation has been applied successfully to natural images but has received very little attention for manifold-valued data such as in diffusion tensor imaging (DTI). The non-Euclidean nature of DTI prevents current generative adversarial networks (GANs) from generating plausible images and has mostly limited their application to diffusion MRI scalar maps, such as fractional anisotropy (FA) or mean diffusivity (MD). Even if these scalar maps are clinically useful, they mostly ignore fiber orientations and have, therefore, limited applications for analyzing brain fibers, for instance, impairing fiber tractography. Here, we propose a manifold-aware CycleGAN that learns the generation of high resolution DTI from unpaired T1w images. We formulate the objective as a Wasserstein distance minimization problem of data distributions on a Riemannian manifold of symmetric positive definite 3x3 matrices SPD(3), using adversarial and cycle-consistency losses. To ensure that the generated diffusion tensors lie on the SPD(3) manifold, we exploit the theoretical properties of the exponential and logarithm maps. We demonstrate that, unlike standard GANs, our method is able to generate realistic high resolution DTI that can be used to compute diffusion-based metrics and run fiber tractography algorithms. To evaluate our model's performance, we compute the cosine similarity between the generated tensors principal orientation and their ground truth orientation and the mean squared error (MSE) of their derived FA values. We demonstrate that our method produces up to 8 times better FA MSE than a standard CycleGAN and 30% better cosine similarity than a manifold-aware Wasserstein GAN while synthesizing sharp high resolution DTI.
摘要：未配对图像到图像平移已成功地应用于自然图像，但已收到很少注意对于歧管值数据，例如在扩散张量成像（DTI）。 DTI的非欧几里德性质防止电流生成对抗网络（甘斯）从产生合理的图像和已大多限制了它们扩散MRI标量的地图，诸如分数各向异性（FA）或均值扩散率（MD）的应用程序。即使这些标量映射是临床上有用的，它们大多忽略纤维取向，并具有，因此，有限的应用，用于分析脑纤维，例如，纤维损害示踪。在这里，我们提出了一个歧管感知CycleGAN从未成T1W影像学高分辨率DTI的产生。我们制定目标的数据分布的上对称正定3×3的黎曼流形一个瓦瑟斯坦距离最小化问题矩阵SPD（3），使用对抗性和周期一致性的损失。为了确保产生的扩散张量躺在SPD（3）歧管，我们利用指数和对数图的理论特性。我们证明，与标准甘斯，我们的方法是能够产生可用于计算基于扩散的指标和运行纤维束成像算法逼真的高分辨率DTI。为了评估我们的模型的性能，我们计算生成的张量主要方向和他们的地面实况取向及其衍生FA值的均方误差（MSE）之间的余弦相似性。我们证明我们的方法产生高达8倍的FA MSE比标准CycleGAN和30％，更好的余弦相似性比歧管感知瓦瑟斯坦甘而合成锐高分辨率DTI。

48. The Edge of Depth: Explicit Constraints between Segmentation and Depth [PDF] 返回目录
Shengjie Zhu, Garrick Brazil, Xiaoming Liu
Abstract: In this work we study the mutual benefits of two common computer vision tasks, self-supervised depth estimation and semantic segmentation from images. For example, to help unsupervised monocular depth estimation, constraints from semantic segmentation has been explored implicitly such as sharing and transforming features. In contrast, we propose to explicitly measure the border consistency between segmentation and depth and minimize it in a greedy manner by iteratively supervising the network towards a locally optimal solution. Partially this is motivated by our observation that semantic segmentation even trained with limited ground truth (200 images of KITTI) can offer more accurate border than that of any (monocular or stereo) image-based depth estimation. Through extensive experiments, our proposed approach advances the state of the art on unsupervised monocular depth estimation in the KITTI.
摘要：在这项工作中，我们研究的两个常见的计算机视觉任务，自我监督的深度估计，并从图像语义分割的互惠互利。例如，为了帮助无监督单眼深度估计，从语义分割的制约已探索隐含如共享和转化功能。相反，我们建议明确衡量分割和深度之间的边界一致性，通过反复向局部最优解监督网络时，它最大限度地减少贪婪的方式。部分地，这是通过我们的观察的动机，即使有限的地面实况训练的语义分割（KITTI的200幅），可以提供更精确的边界比任何（单眼或立体声）的基于图像的深度估计。通过大量的实验，我们提出的方法推进了本领域上的KITTI无人监督的单眼深度估计的状态。

49. Towards Lifelong Self-Supervision For Unpaired Image-to-Image Translation [PDF] 返回目录
Victor Schmidt, Makesh Narsimhan Sreedhar, Mostafa ElAraby, Irina Rish
Abstract: Unpaired Image-to-Image Translation (I2IT) tasks often suffer from lack of data, a problem which self-supervised learning (SSL) has recently been very popular and successful at tackling. Leveraging auxiliary tasks such as rotation prediction or generative colorization, SSL can produce better and more robust representations in a low data regime. Training such tasks along an I2IT task is however computationally intractable as model size and the number of task grow. On the other hand, learning sequentially could incur catastrophic forgetting of previously learned tasks. To alleviate this, we introduce Lifelong Self-Supervision (LiSS) as a way to pre-train an I2IT model (e.g., CycleGAN) on a set of self-supervised auxiliary tasks. By keeping an exponential moving average of past encoders and distilling the accumulated knowledge, we are able to maintain the network's validation performance on a number of tasks without any form of replay, parameter isolation or retraining techniques typically used in continual learning. We show that models trained with LiSS perform better on past tasks, while also being more robust than the CycleGAN baseline to color bias and entity entanglement (when two entities are very close).
摘要：不成对图片到影像转换（I2IT）任务往往缺乏数据受到影响，其中，自行监督学习（SSL）最近一直很受欢迎，成功地解决了问题。利用辅助任务，如旋转预测或生成着色，SSL可以在低数据政权产生更好的和更强大的表示。沿着I2IT作业训练等任务作为模型的大小和任务成长的数量却难以计算。在另一方面，按顺序学习可能招致的以前学过的任务灾难性遗忘。为了缓解这个问题，我们引入终身自检（LISS）作为一种一组预先训练的I2IT模型（例如，CycleGAN）自我监督的辅助任务。通过保持过去编码器的指数移动平均线和蒸馏积累的知识，我们能够保持对多项任务的网络的验证性能没有任何形式的重播，参数隔离或再培训通常通过不断地学习使用技术。我们表明，利斯训练的模型过去的任务更好地发挥，同时还比CycleGAN基线更强大的色彩偏差和实体纠缠（当两个实体非常接近）。

50. Deep Semantic Matching with Foreground Detection and Cycle-Consistency [PDF] 返回目录
Yun-Chun Chen, Po-Hsiang Huang, Li-Yu Yu, Jia-Bin Huang, Ming-Hsuan Yang, Yen-Yu Lin
Abstract: Establishing dense semantic correspondences between object instances remains a challenging problem due to background clutter, significant scale and pose differences, and large intra-class variations. In this paper, we address weakly supervised semantic matching based on a deep network where only image pairs without manual keypoint correspondence annotations are provided. To facilitate network training with this weaker form of supervision, we 1) explicitly estimate the foreground regions to suppress the effect of background clutter and 2) develop cycle-consistent losses to enforce the predicted transformations across multiple images to be geometrically plausible and consistent. We train the proposed model using the PF-PASCAL dataset and evaluate the performance on the PF-PASCAL, PF-WILLOW, and TSS datasets. Extensive experimental results show that the proposed approach performs favorably against the state-of-the-art methods.
摘要：建立对象实例之间的密集语义对应仍然是一个具有挑战性的问题，由于背景杂波，显著规模和姿态的差异，以及大型类内变化。在本文中，我们基于地址在提供无需人工关键点对应注解只有图像对深网络上弱监督语义匹配。为了便于网络训练与监督的这种弱形式，我们1）明确的估计前景区域，以抑制背景杂波和2的效果）发展周期一致损失为强制多个图像的预测转化为几何合理的和一贯的。我们使用PF-PASCAL数据集训练所提出的模型，并评估对PF-PASCAL，PF-柳和TSS数据集的性能。广泛的实验结果表明，所提出的方法进行良好地对国家的最先进的方法。

51. Learning Generative Models of Tissue Organization with Supervised GANs [PDF] 返回目录
Ligong Han, Robert F. Murphy, Deva Ramanan
Abstract: A key step in understanding the spatial organization of cells and tissues is the ability to construct generative models that accurately reflect that organization. In this paper, we focus on building generative models of electron microscope (EM) images in which the positions of cell membranes and mitochondria have been densely annotated, and propose a two-stage procedure that produces realistic images using Generative Adversarial Networks (or GANs) in a supervised way. In the first stage, we synthesize a label "image" given a noise "image" as input, which then provides supervision for EM image synthesis in the second stage. The full model naturally generates label-image pairs. We show that accurate synthetic EM images are produced using assessment via (1) shape features and global statistics, (2) segmentation accuracies, and (3) user studies. We also demonstrate further improvements by enforcing a reconstruction loss on intermediate synthetic labels and thus unifying the two stages into one single end-to-end framework.
摘要：在了解细胞和组织的空间组织的一个关键步骤是构建生成模型能准确反映该组织的能力。在本文中，我们专注于建立电子显微镜的生成模型（EM），其中细胞膜和线粒体的位置已被密集地注释的图像，并提议产生使用生成对抗性网络（或甘斯）逼真的图像的两阶段过程在监督方式。在第一阶段中，我们合成给定的一个噪声“图像”作为输入，然后在第二阶段EM图像合成提供监督标签“图像”。完整的模型自然生成标签图像对。我们表明，准确合成EM图像经由（1）的形状的特征和全局统计，（2）分割精度，以及（3）用户使用的研究评估产生的。我们还证明通过强制对中间合成标签重建损失并且因此两个阶段统一成一个单一的端至端的框架的进一步改进。

52. Revisiting Few-shot Activity Detection with Class Similarity Control [PDF] 返回目录
Huijuan Xu, Ximeng Sun, Eric Tzeng, Abir Das, Kate Saenko, Trevor Darrell
Abstract: Many interesting events in the real world are rare making preannotated machine learning ready videos a rarity in consequence. Thus, temporal activity detection models that are able to learn from a few examples are desirable. In this paper, we present a conceptually simple and general yet novel framework for few-shot temporal activity detection based on proposal regression which detects the start and end time of the activities in untrimmed videos. Our model is end-to-end trainable, takes into account the frame rate differences between few-shot activities and untrimmed test videos, and can benefit from additional few-shot examples. We experiment on three large scale benchmarks for temporal activity detection (ActivityNet1.2, ActivityNet1.3 and THUMOS14 datasets) in a few-shot setting. We also study the effect on performance of different amount of overlap with activities used to pretrain the video classification backbone and propose corrective measures for future works in this domain. Our code will be made available.
摘要：在现实世界中许多有趣的事件是罕见的决策preannotated机器学习准备的视频在后果罕见。因此，时间活动检测模式，能够从几个例子来学习是可取的。在本文中，我们提出了基于提案回归，其检测修剪视频活动的开始和结束的时间很少，出手时间活动性检测一个概念上简单和普通但又是崭新的框架。我们的模式是终端到终端的可训练，考虑到一些次活动，并修剪测试视频的帧速率的差异，可以从另外的几个炮的例子中受益。我们尝试对时间活动性检测三个大的层次基准（ActivityNet1.2，ActivityNet1.3和THUMOS14数据集）在几合一设定。我们还研究了与用于pretrain视频类别骨干，并提出整改措施，在这个领域未来的工作活动不同程度的重叠的性能的影响。我们的代码将提供。

53. EOLO: Embedded Object Segmentation only Look Once [PDF] 返回目录
Longfei Zeng, Mohammed Sabah
Abstract: In this paper, we introduce an anchor-free and single-shot instance segmentation method, which is conceptually simple with 3 independent branches, fully convolutional and can be used by easily embedding it into mobile and embedded devices. Our method, refer as EOLO, reformulates the instance segmentation problem as predicting semantic segmentation and distinguishing overlapping objects problem, through instance center classification and 4D distance regression on each pixel. Moreover, we propose one effective loss function to deal with sampling a high-quality center of gravity examples and optimization for 4D distance regression, which can significantly improve the mAP performance. Without any bells and whistles, EOLO achieves 27.7$\%$ in mask mAP under IoU50 and reaches 30 FPS on 1080Ti GPU, with a single-model and single-scale training/testing on the challenging COCO2017 dataset. For the first time, we show the different comprehension of instance segmentation in recent methods, in terms of both up-bottom, down-up, and direct-predict paradigms. Then we illustrate our model and present related experiments and results. We hope that the proposed EOLO framework can serve as a fundamental baseline for a single-shot instance segmentation task in Real-time Industrial Scenarios.
摘要：在本文中，我们介绍锚自由和单次实例的分割方法，这与3个独立的分支概念上简单的，完全的卷积，并且可以通过很容易嵌入到移动和嵌入式设备中使用。我们的方法，如参考EOLO，重新表述实例分割的问题，因为预测语义分割和区分重叠对象的问题，通过例如中心分类和4D距离回归上的每个像素。此外，我们提出了一个有效的损失函数来处理取样的重力例子四维距离回归高品质中心和优化，它可以显著提高地图的性能。没有任何花俏，EOLO实现了IoU50掩码下图27.7 $ \％$，达到30 FPS的1080Ti GPU，与单模式和单大规模的培训/在挑战COCO2017数据集测试。这是第一次，我们显示实例分割的不同理解在最近方法，在这两个从上到下而言，下了，直接预测范例。然后，我们说明我们的模型和现在相关的实验和结果。我们希望提出EOLO框架可以作为实时工业方案单次实例分割任务的基本底线。

54. StyleRig: Rigging StyleGAN for 3D Control over Portrait Images [PDF] 返回目录
Ayush Tewari, Mohamed Elgharib, Gaurav Bharaj, Florian Bernard, Hans-Peter Seidel, Patrick Pérez, Michael Zollhöfer, Christian Theobalt
Abstract: StyleGAN generates photorealistic portrait images of faces with eyes, teeth, hair and context (neck, shoulders, background), but lacks a rig-like control over semantic face parameters that are interpretable in 3D, such as face pose, expressions, and scene illumination. Three-dimensional morphable face models (3DMMs) on the other hand offer control over the semantic parameters, but lack photorealism when rendered and only model the face interior, not other parts of a portrait image (hair, mouth interior, background). We present the first method to provide a face rig-like control over a pretrained and fixed StyleGAN via a 3DMM. A new rigging network, RigNet is trained between the 3DMM's semantic parameters and StyleGAN's input. The network is trained in a self-supervised manner, without the need for manual annotations. At test time, our method generates portrait images with the photorealism of StyleGAN and provides explicit control over the 3D semantic parameters of the face.
摘要：StyleGAN生成脸的眼睛，牙齿，头发和上下文（颈，肩，背景）真实感肖像图像，但缺少钻机状过语义面参数，在3D是可解释的，如脸的姿势，表情，以及控制场景照明。三维形变脸部模型上在语义参数另一方面报价控制（3DMMs），但渲染和建模只有面对内部，当缺乏写实没有一个人像图像的其他部分（头发，口腔内部，背景）。我们提出的第一种方法是提供一种面钻机状过经由3DMM一个预训练的固定StyleGAN控制。新索具的网络，RigNet是3DMM的语义参数和StyleGAN的输入端之间的培训。该网络以自监督方式训练，而不需要手动注释。在测试时，我们的方法与StyleGAN的写实肖像生成图像，并提供了对脸部的三维语义参数明确的控制。

55. Graph Domain Adaptation for Alignment-Invariant Brain Surface Segmentation [PDF] 返回目录
Karthik Gopinath, Christian Desrosiers, Herve Lombaert
Abstract: The varying cortical geometry of the brain creates numerous challenges for its analysis. Recent developments have enabled learning surface data directly across multiple brain surfaces via graph convolutions on cortical data. However, current graph learning algorithms do fail when brain surface data are misaligned across subjects, thereby affecting their ability to deal with data from multiple domains. Adversarial training is widely used for domain adaptation to improve the segmentation performance across domains. In this paper, adversarial training is exploited to learn surface data across inconsistent graph alignments. This novel approach comprises a segmentator that uses a set of graph convolution layers to enable parcellation directly across brain surfaces in a source domain, and a discriminator that predicts a graph domain from segmentations. More precisely, the proposed adversarial network learns to generalize a parcellation across both, source and target domains. We demonstrate an 8% mean improvement in performance over a non-adversarial training strategy applied on multiple target domains extracted from MindBoggle, the largest publicly available manually-labeled brain surface dataset.
摘要：大脑的不同皮质几何形状在其分析的诸多挑战。最近的事态发展已经使通过对皮质的数据图表回旋直接学习跨多个大脑表面的表面数据。然而，当大脑表面数据跨学科错位，从而影响他们处理来自多个域数据的能力当前图形学习算法做失败。对抗性训练被广泛应用于领域适应性，提高跨域分割性能。在本文中，对抗式训练被利用学习跨不一致图形对准表面数据。这种新颖的方法包括使用一组曲线图卷积层以使直接跨地块划分脑表面在源域segmentator，和其预测从分割的图表结构域的鉴别器。更确切地说，所提出的对抗性网络学习一概而论跨两个，源和目标域的地块划分。我们证明在性能上8％的平均改善在非对抗性的训练策略应用于从MindBoggle，最大可公开获得的手动标记的大脑表面的数据集提取的多个目标域。

56. Conditional Channel Gated Networks for Task-Aware Continual Learning [PDF] 返回目录
Davide Abati, Jakub Tomczak, Tijmen Blankevoort, Simone Calderara, Rita Cucchiara, Babak Ehteshami Bejnordi
Abstract: Convolutional Neural Networks experience catastrophic forgetting when optimized on a sequence of learning problems: as they meet the objective of the current training examples, their performance on previous tasks drops drastically. In this work, we introduce a novel framework to tackle this problem with conditional computation. We equip each convolutional layer with task-specific gating modules, selecting which filters to apply on the given input. This way, we achieve two appealing properties. Firstly, the execution patterns of the gates allow to identify and protect important filters, ensuring no loss in the performance of the model for previously learned tasks. Secondly, by using a sparsity objective, we can promote the selection of a limited set of kernels, allowing to retain sufficient model capacity to digest new tasks.Existing solutions require, at test time, awareness of the task to which each example belongs to. This knowledge, however, may not be available in many practical scenarios. Therefore, we additionally introduce a task classifier that predicts the task label of each example, to deal with settings in which a task oracle is not available. We validate our proposal on four continual learning datasets. Results show that our model consistently outperforms existing methods both in the presence and the absence of a task oracle. Notably, on Split SVHN and Imagenet-50 datasets, our model yields up to 23.98% and 17.42% improvement in accuracy w.r.t. competing methods.
摘要：卷积神经网络经历灾难性遗忘的时候就学习问题的序列优化：为他们实现目标的当前训练的例子，他们在以前的任务性能急剧下降。在这项工作中，我们引入了一个新的架构，以解决这一问题的条件计算。我们装备与任务专用门控模块，每个模块卷积层，选择哪些过滤器适用于给定的输入。这样一来，我们实现两个吸引人的特性。首先，门的执行模式允许识别和保护重要的过滤器，保证了型号为以前学过的任务的性能没有损失。其次，通过使用稀疏目标，我们可以促进一组有限的内核的选择，允许保留消化新tasks.Existing解决方案要求，在测试时间足够能力模型，任务意识，每个例子属于。这方面的知识，但是，可能无法在许多实际情况下可用。因此，我们还引入预测每个示例的任务标签任务分类，处理设置在其中一个任务甲骨文不可用。我们确认我们的四个不断学习数据集的建议。结果表明，我们的模型的性能一直优于无论是在存在和不存在的任务甲骨文的现有方法。值得注意的是，在分割SVHN和Imagenet-50的数据集，我们的模型亩产量达23.98％和17.42％提高精度w.r.t.竞争的方法。

57. HOPE-Net: A Graph-based Model for Hand-Object Pose Estimation [PDF] 返回目录
Bardia Doosti, Shujon Naha, Majid Mirbagheri, David Crandall
Abstract: Hand-object pose estimation (HOPE) aims to jointly detect the poses of both a hand and of a held object. In this paper, we propose a lightweight model called HOPE-Net which jointly estimates hand and object pose in 2D and 3D in real-time. Our network uses a cascade of two adaptive graph convolutional neural networks, one to estimate 2D coordinates of the hand joints and object corners, followed by another to convert 2D coordinates to 3D. Our experiments show that through end-to-end training of the full network, we achieve better accuracy for both the 2D and 3D coordinate estimation problems. The proposed 2D to 3D graph convolution-based model could be applied to other 3D landmark detection problems, where it is possible to first predict the 2D keypoints and then transform them to 3D.
摘要：手工对象姿态估计（HOPE）旨在联合检测这两种手和保留的对象的姿势。在本文中，我们提出了一个名为HOPE-Net的轻量级模型，联合估计手和物体姿态在2D和3D实时。我们的网络使用两个自适应图表卷积神经网络，一个的级联来估计2D坐标手关节和对象拐角，紧接着又到2D坐标转换为3D。我们的实验表明，通过终端到终端的培训完整的网络，我们实现了两个二维和三维坐标估计问题更好的精度。所提出的2D到3D图形基于卷积的模型可以被应用到其他3D地标检测问题，在那里它可以首先预测2D关键点，然后将它们转换为3D。

58. In-Domain GAN Inversion for Real Image Editing [PDF] 返回目录
Jiapeng Zhu, Yujun Shen, Deli Zhao, Bolei Zhou
Abstract: Recent work has shown that a variety of controllable semantics emerges in the latent space of the Generative Adversarial Networks (GANs) when being trained to synthesize images. However, it is difficult to use these learned semantics for real image editing. A common practice of feeding a real image to a trained GAN generator is to invert it back to a latent code. However, we find that existing inversion methods typically focus on reconstructing the target image by pixel values yet fail to land the inverted code in the semantic domain of the original latent space. As a result, the reconstructed image cannot well support semantic editing through varying the latent code. To solve this problem, we propose an in-domain GAN inversion approach, which not only faithfully reconstructs the input image but also ensures the inverted code to be semantically meaningful for editing. We first learn a novel domain-guided encoder to project any given image to the native latent space of GANs. We then propose a domain-regularized optimization by involving the encoder as a regularizer to fine-tune the code produced by the encoder, which better recovers the target image. Extensive experiments suggest that our inversion method achieves satisfying real image reconstruction and more importantly facilitates various image editing tasks, such as image interpolation and semantic manipulation, significantly outperforming start-of-the-arts.
摘要：最近的研究表明，各种可控语义的生成性对抗性网络（甘斯）的潜在空间出现时，被训练成合成图像。然而，这是很难用真正的图像编辑这些教训语义。喂实像一个训练有素的GAN发电机通常的做法是反转回至潜代码。然而，我们发现，现有的反演方法通常侧重于通过重建像素值的目标图像却无法登陆反转后的码在原有潜在空间的语义域。其结果是，所重建的图像不能很好地通过改变所述潜代码支持语义编辑。为了解决这个问题，我们提出了一个域内甘反演方法，它不仅忠实地重建输入图像，而且还确保了反转后的码可用于编辑语义上有意义。我们先学习一个新的领域，引导编码任何给定的图像投射到甘斯的本土潜在空间。然后，我们涉及到编码器作为一个正则微调由编码器，它更好的恢复目标图像生成的代码提出域正则化优化。大量的实验表明，我们的满足真实图像重建，更重要的反演方法实现有利于各种图像编辑任务，如图像插值和语义操作，显著跑赢启动的最艺术。

59. PointAR: Efficient Lighting Estimation for Mobile Augmented Reality [PDF] 返回目录
Yiqin Zhao, Tian Guo
Abstract: We propose an efficient lighting estimation pipeline that is suitable to run on modern mobile devices, with comparable resource complexities to state-of-the-art on-device deep learning models. Our pipeline, referred to as PointAR, takes a single RGB-D image captured from the mobile camera and a 2D location in that image, and estimates a 2nd order spherical harmonics coefficients which can be directly utilized by rendering engines for indoor lighting in the context of augmented reality. Our key insight is to formulate the lighting estimation as a learning problem directly from point clouds, which is in part inspired by the Monte Carlo integration leveraged by real-time spherical harmonics lighting. While existing approaches estimate lighting information with complex deep learning pipelines, our method focuses on reducing the computational complexity. Through both quantitative and qualitative experiments, we demonstrate that PointAR achieves lower lighting estimation errors compared to state-of-the-art methods. Further, our method requires an order of magnitude lower resource, comparable to that of mobile-specific DNNs.
摘要：提出了一种高效照明估计管道，适合于现代移动设备上运行，具有可比性资源复杂性国家的最先进的设备上的深度学习模式。我们的管道中，被称为PointAR，需要从移动照相机和图像中的2D位置捕获的单个RGB-d的图像，并估计一个第二顺序可以通过使室内照明引擎在上下文直接利用球面谐波系数的增强现实。我们的主要观点是直接从点云，这是由实时球谐照明利用了蒙特卡洛积分激励部分制定照明估计作为学习的问题。虽然现有的方法估算照明复杂深刻的学习管道信息，我们的方法集中在降低计算复杂度。通过定量和定性实验，我们证明了PointAR相比，国家的最先进的方法，实现了更低的照明估计误差。此外，我们的方法需要幅度较低的资源，媲美特定移动DNNs的次序。

60. Sign Language Translation with Transformers [PDF] 返回目录
Kayo Yin
Abstract: Sign Language Translation (SLT) first uses a Sign Language Recognition (SLR) system to extract sign language glosses from videos. Then, a translation system generates spoken language translations from the sign language glosses. Though SLT has gathered interest recently, little study has been performed on the translation system. This paper focuses on the translation system and improves performance by utilizing Transformer networks. We report a wide range of experimental results for various Transformer setups and introduce the use of Spatial-Temporal Multi-Cue (STMC) networks in an end-to-end SLT system with Transformer. We perform experiments on RWTH-PHOENIX-Weather 2014T, a challenging SLT benchmark dataset of German sign language, and ASLG-PC12, a dataset involving American Sign Language (ASL) recently used in gloss-to-text translation. Our methodology improves on the current state-of-the-art by over 5 and 7 points respectively in BLEU-4 score on ground truth glosses and by using an STMC network to predict glosses of the RWTH-PHOENIX-Weather 2014T dataset. On the ASLG-PC12 corpus, we report an improvement of over 16 points in BLEU-4. Our findings also demonstrate that end-to-end translation on predicted glosses provides even better performance than translation on ground truth glosses. This shows potential for further improvement in SLT by either jointly training the SLR and translation systems or by revising the gloss annotation system.
摘要：手语翻译（SLT）首先使用手语识别（SLR）系统从视频中提取手语粉饰。随后，翻译系统从手语粉饰口语翻译。虽然SLT已经聚集了兴趣最近，很少研究已翻译系统上进行。本文着重翻译系统上，并利用变压器网络提高性能。我们报告了广泛的各种变压器设置实验结果和介绍使用时空多线索（STMC）网络在与变压器的端至端SLT系统。我们执行的RWTH-PHOENIX-天气2014T，德国手语的挑战SLT基准数据集，并ASLG-PC12，包括美国手语（ASL）的数据集最近在光泽到文本的翻译使用的实验。我们的方法对当前状态的最先进的通过在图5和7分分别在BLEU-4的得分地面实况美化并通过使用网络STMC预测RWTH-PHOENIX-天气2014T数据集的粉饰提高。在ASLG-PC12语料库，我们报道了超过16点的BLEU-4的改进。我们的研究结果还表明，终端到终端的平移预测的粉饰提供了比在地面真相粉饰翻译甚至更好的性能。这说明潜在的或者通过共同训练单反相机和翻译系统或通过修改光泽注释系统在SLT进一步改善。

61. Self-Augmentation: Generalizing Deep Networks to Unseen Classes for Few-Shot Learning [PDF] 返回目录
Jin-Woo Seo, Hong-Gyu Jung, Seong-Whan Lee
Abstract: Few-shot learning aims to classify unseen classes with a few training examples. While recent works have shown that standard mini-batch training with a carefully designed training strategy can improve generalization ability for unseen classes, well-known problems in deep networks such as memorizing training statistics have been less explored for few-shot learning. To tackle this issue, we propose self-augmentation that consolidates regional dropout and self-distillation. Specifically, we exploit a data augmentation technique called regional dropout, in which a patch of an image is substituted into other values. Then, we employ a backbone network that has auxiliary branches with its own classifier to enforce knowledge sharing. Lastly, we present a fine-tuning method to further exploit a few training examples for unseen classes. Experimental results show that the proposed method outperforms the state-of-the-art methods for prevalent few-shot benchmarks and improves the generalization ability.
摘要：很少次的学习目标，以看不见的班级，几个训练实例进行分类。虽然最近的工作表明，标准的小批量培养具有精心设计的培训策略，可以提高看不见类的推广能力，在深网络众所周知的问题，如记忆训练的统计数据一直不太探索几拍的学习。为了解决这个问题，我们提出了自我增强，它整合区域辍学和自我升华。具体来说，我们利用一种叫做区域漏失数据增强技术，其中的图像的补丁被代入的其它值。然后，我们采用的是与自己的分类实施知识共享辅助分支的骨干网络。最后，我们提出了一个微调的方法来进一步利用了看不见类的几个训练实例。实验结果表明，该方法优于国家的最先进的方法流行的几拍基准，提高了泛化能力。

62. MetaPoison: Practical General-purpose Clean-label Data Poisoning [PDF] 返回目录
W. Ronny Huang, Jonas Geiping, Liam Fowl, Gavin Taylor, Tom Goldstein
Abstract: Data poisoning--the process by which an attacker takes control of a model by making imperceptible changes to a subset of the training data--is an emerging threat in the context of neural networks. Existing attacks for data poisoning have relied on hand-crafted heuristics. Instead, we pose crafting poisons more generally as a bi-level optimization problem, where the inner level corresponds to training a network on a poisoned dataset and the outer level corresponds to updating those poisons to achieve a desired behavior on the trained model. We then propose MetaPoison, a first-order method to solve this optimization quickly. MetaPoison is effective: it outperforms previous clean-label poisoning methods by a large margin under the same setting. MetaPoison is robust: its poisons transfer to a variety of victims with unknown hyperparameters and architectures. MetaPoison is also general-purpose, working not only in fine-tuning scenarios, but also for end-to-end training from scratch with remarkable success, e.g. causing a target image to be misclassified 90% of the time via manipulating just 1% of the dataset. Additionally, MetaPoison can achieve arbitrary adversary goals not previously possible--like using poisons of one class to make a target image don the label of another arbitrarily chosen class. Finally, MetaPoison works in the real-world. We demonstrate successful data poisoning of models trained on Google Cloud AutoML Vision. Code and premade poisons are provided at this https URL
摘要：数据中毒 - 由攻击者需要通过使训练数据的子集难以察觉的变化模型的控制过程 - 在神经网络的环境中一个新兴的威胁。数据中毒现有的攻击都依赖于手工制作的启发。相反，我们更姿一般各具特色毒物为双层优化问题，其中内级别对应了中毒数据集和外部级对应更新这些毒物实现对训练模型所需的行为训练网络。然后，我们提出MetaPoison，一阶的方法来快速解决这个优化。 MetaPoison是有效的：它优于相同的设置下以大比分以前干净标签中毒的方法。 MetaPoison是强大的：它的毒药转移到各种未知的超参数和结构受害者。 MetaPoison也是通用的，工作不仅在微调的情况，同时也为终端到终端的从无到有取得显着成绩，例如培训引起的目标图像通过操纵数据集的仅1％的错误分类的90％的时间。此外，MetaPoison，可以实现任意的对手的目标之前不可能的 - 就像使用一个类的毒药使目标图像穿上另一任意选择类的标签。最后，MetaPoison工作在现实世界中。我们证明成功的数据中毒的培训了谷歌云AutoML视觉模型。在此HTTPS URL提供的代码和预制毒药

63. 3D Deep Learning on Medical Images: A Review [PDF] 返回目录
Satya P. Singh, Lipo Wang, Sukrit Gupta, Haveesh Goli, Parasuraman Padmanabhan, Balázs Gulyás
Abstract: The rapid advancements in machine learning, graphics processing technologies and availability of medical imaging data has led to a rapid increase in use of machine learning models in the medical domain. This was exacerbated by the rapid advancements in convolutional neural network (CNN) based architectures, which were adopted by the medical imaging community to assist clinicians in disease diagnosis. Since the grand success of AlexNet in 2012, CNNs have been increasingly used in medical image analysis to improve the efficiency of human clinicians. In recent years, three-dimensional (3D) CNNs have been employed for analysis of medical images. In this paper, we trace the history of how the 3D CNN was developed from its machine learning roots, brief mathematical description of 3D CNN and the preprocessing steps required for medical images before feeding them to 3D CNNs. We review the significant research in the field of 3D medical imaging analysis using 3D CNNs (and its variants) in different medical areas such as classification, segmentation, detection, and localization. We conclude by discussing the challenges associated with the use of 3D CNNs in the medical imaging domain (and the use of deep learning models, in general) and possible future trends in the field.
摘要：在机器学习的迅速发展，图形处理技术和医学影像数据的可用性，导致在医疗领域的快速增长在使用机器学习模型。这是通过卷积神经网络（CNN）的基础架构的迅速发展，这是通过医疗成像社会协助疾病诊断医生加剧。由于AlexNet在2012年盛大的成功，细胞神经网络已越来越多地在医学图像分析来提高人体临床医生的工作效率。近年来，三维（3D）细胞神经网络已被用于医学图像的分析。在本文中，我们跟踪的3D CNN是如何从机器学习的根，3D CNN的简短的数学描述，并将这些资料传送到3D细胞神经网络之前，医学图像所需的预处理步骤发展的历史。我们回顾了在不同的医疗领域，如分类，分割，检测和定位采用3D细胞神经网络（及其变种）三维医学图像分析领域的研究显著。最后，我们将讨论在医疗成像领域运用3D细胞神经网络（以及利用深度学习模式，在一般情况）和未来可能的发展趋势在该领域所面临的挑战。

64. A theory of independent mechanisms for extrapolation in generative models [PDF] 返回目录
Michel Besserve, Rémy Sun, Dominik Janzing, Bernhard Schölkopf
Abstract: Deep generative models reproduce complex empirical data but cannot extrapolate to novel environments. An intuitive idea to promote extrapolation capabilities is to enforce the architecture to have the modular structure of a causal graphical model, where one can intervene on each module independently of the others in the graph. We develop a framework to formalize this intuition, using the principle of Independent Causal Mechanisms, and show how over-parameterization of generative neural networks can hinder extrapolation capabilities. Our experiments on the generation of human faces shows successive layers of a generator architecture implement independent mechanisms to some extent, allowing meaningful extrapolations. Finally, we illustrate that independence of mechanisms may be enforced during training to improve extrapolation.
摘要：深生成模型再现复杂的经验数据，但不能外推到新的环境。一个直观的想法，以促进外推功能是执行该架构具有因果图形模型，其中，一个可各自独立地模块的其他图表中的上介入的模块化结构。我们制定一个框架来规范这种直觉，使用独立因果机制的原则，并显示生成神经网络的过度参数如何阻碍推断能力。我们对人的产生实验面示出了发生器体系结构的连续层实施独立的机制，以在一定程度上，允许有意义的外推。最后，我们说明了机制的独立性可以培训，提高外推过程中执行。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-04-02

目录

摘要