摘要

1. DHP: Differentiable Meta Pruning via HyperNetworks [PDF] 返回目录
Yawei Li, Shuhang Gu, Kai Zhang, Luc Van Gool, Radu Timofte
Abstract: Network pruning has been the driving force for the efficient inference of neural networks and the alleviation of model storage and transmission burden. Traditional network pruning methods focus on the per-filter influence on the network accuracy by analyzing the filter distribution. With the advent of AutoML and neural architecture search (NAS), pruning has become topical with automatic mechanism and searching based architecture optimization. However, current automatic designs rely on either reinforcement learning or evolutionary algorithm, which often do not have a theoretical convergence guarantee or do not converge in a meaningful time limit. In this paper, we propose a differentiable pruning method via hypernetworks for automatic network pruning and layer-wise configuration optimization. A hypernetwork is designed to generate the weights of the backbone network. The input of the hypernetwork, namely, the latent vectors control the output channels of the layers of backbone network. By applying $\ell_1$ sparsity regularization to the latent vectors and utilizing proximal gradient, sparse latent vectors can be obtained with removed zero elements. Thus, the corresponding elements of the hypernetwork outputs can also be removed, achieving the effect of network pruning. The latent vectors of all the layers are pruned together, resulting in an automatic layer configuration. Extensive experiments are conducted on various networks for image classification, single image super-resolution, and denoising. And the experimental results validate the proposed method.
摘要：网络修剪一直是神经网络的高效推理和模型存储和传输的负担减轻的驱动力。传统的网络修剪方法集中在网络上的精度的每个滤波器的影响通过分析滤波器分布。随着AutoML的神经结构搜索（NAS）的出现，并修剪已成为局部有自动机制和搜索基础的架构优化。然而，当前的自动设计都依赖于强化学习或进化算法，这往往没有理论收敛担保或以一种有意义的时限不收敛。在本文中，我们提出通过超级网络自动网络修剪和逐层配置优化可微分修剪方法。一个超网络的目的是生成骨干网的权重。该超网络的输入端，即，潜矢量控制骨干网的层的输出通道。通过施加$ \ ell_1 $稀疏正则化到潜矢量和利用近端梯度，稀疏潜载体可移除零个元素来获得。因此，超网络输出的相应的元件也可以被去除，从而实现网络修剪的效果。所有层的潜矢量被修剪在一起，从而导致在自动层配置。大量的实验对图像分类，单个图像超分辨率和降噪各种网络进行。和实验结果验证了该方法。

2. Designing Network Design Spaces [PDF] 返回目录
Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, Piotr Dollár
Abstract: In this work, we present a new network design paradigm. Our goal is to help advance the understanding of network design and discover design principles that generalize across settings. Instead of focusing on designing individual network instances, we design network design spaces that parametrize populations of networks. The overall process is analogous to classic manual design of networks, but elevated to the design space level. Using our methodology we explore the structure aspect of network design and arrive at a low-dimensional design space consisting of simple, regular networks that we call RegNet. The core insight of the RegNet parametrization is surprisingly simple: widths and depths of good networks can be explained by a quantized linear function. We analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. The RegNet design space provides simple and fast networks that work well across a wide range of flop regimes. Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet models while being up to 5x faster on GPUs.
摘要：在这项工作中，我们提出了一个新的网络设计模式。我们的目标是帮助推动网络设计的理解和发现，在不同的环境概括设计原则。而不是集中于设计单独的网络情况下，我们设计的网络设计空间，网络的参数多态的人群。整个过程是类似网络的经典手工设计，但升高到设计空间的水平。使用我们的方法，我们探索网络的结构设计方面，在低维空间设计，包括简单的到达，规则网络，我们称之为RegNet。所述参数化RegNet的核心观点是令人惊讶的简单：好网络的宽度和深度可以通过量化线性函数进行说明。我们分析了RegNet设计空间，并在有趣的发现不符合网络设计的现行做法到达。该RegNet设计空间提供了简单和快速的网络，整个工作以及大范围翻牌制度。在可比的训练设置和触发器，该RegNet模型跑赢大流行EfficientNet车型，同时最高5倍的GPU上更快。

3. Exploiting Deep Generative Prior for Versatile Image Restoration and Manipulation [PDF] 返回目录
Xingang Pan, Xiaohang Zhan, Bo Dai, Dahua Lin, Chen Change Loy, Ping Luo
Abstract: Learning a good image prior is a long-term goal for image restoration and manipulation. While existing methods like deep image prior (DIP) capture low-level image statistics, there are still gaps toward an image prior that captures rich image semantics including color, spatial coherence, textures, and high-level concepts. This work presents an effective way to exploit the image prior captured by a generative adversarial network (GAN) trained on large-scale natural images. As shown in Fig.1, the deep generative prior (DGP) provides compelling results to restore missing semantics, e.g., color, patch, resolution, of various degraded images. It also enables diverse image manipulation including random jittering, image morphing, and category transfer. Such highly flexible restoration and manipulation are made possible through relaxing the assumption of existing GAN-inversion methods, which tend to fix the generator. Notably, we allow the generator to be fine-tuned on-the-fly in a progressive manner regularized by feature distance obtained by the discriminator in GAN. We show that these easy-to-implement and practical changes help preserve the reconstruction to remain in the manifold of nature image, and thus lead to more precise and faithful reconstruction for real images. Code is available at this https URL.
摘要：以前学习的良好形象是图像复原和操纵一个长期的目标。虽然像深图像之前（DIP）捕获低级别图像统计现有方法，还是有朝向图像的间隙之前，其捕获图像丰富的语义包括颜色，空间相干性，纹理和高层次的概念。这项工作提出利用之前由经过培训的大型自然的图像生成对抗网络（GAN）拍摄的图像的有效途径。如图1所示，深生成之前（DGP）提供了令人信服的结果，以恢复丢失的语义，例如，颜色，贴剂，分辨率，各种退化图像。这也使不同的图像处理，包括随机抖动，图像变形和类别转移。这种高度灵活的恢复和操纵通过放松的现有GAN-反演方法，其倾向于以固定发电机的假设成为可能。值得注意的是，我们允许发电机进行微调上即时在通过在GAN鉴别器获得的特征距离正规化以渐进的方式。我们表明，这些易于实现和实践的变化有助于维护重建留在自然图像的歧管，从而导致对真实图像更精确，忠实的重建。代码可在此HTTPS URL。

4. Plug-and-Play Algorithms for Large-scale Snapshot Compressive Imaging [PDF] 返回目录
Xin Yuan, Yang Liu, Jinli Suo, Qionghai Dai
Abstract: Snapshot compressive imaging (SCI) aims to capture the high-dimensional (usually 3D) images using a 2D sensor (detector) in a single snapshot. Though enjoying the advantages of low-bandwidth, low-power and low-cost, applying SCI to large-scale problems (HD or UHD videos) in our daily life is still challenging. The bottleneck lies in the reconstruction algorithms; they are either too slow (iterative optimization algorithms) or not flexible to the encoding process (deep learning based end-to-end networks). In this paper, we develop fast and flexible algorithms for SCI based on the plug-and-play (PnP) framework. In addition to the widely used PnP-ADMM method, we further propose the PnP-GAP (generalized alternating projection) algorithm with a lower computational workload and prove the {global convergence} of PnP-GAP under the SCI hardware constraints. By employing deep denoising priors, we first time show that PnP can recover a UHD color video ($3840\times 1644\times 48$ with PNSR above 30dB) from a snapshot 2D measurement. Extensive results on both simulation and real datasets verify the superiority of our proposed algorithm. The code is available at this https URL.
摘要：快照压缩成像（SCI）的目标捕捉使用单个快照一2D传感器（检测器）的高维（通常为3D）图像。虽然享受低带宽，低功耗和低成本的优势，在我们的日常生活中应用SCI大规模的问题（HD或UHD视频）仍然是一个挑战。瓶颈在于在重建算法;它们或者太慢（迭代优化算法）或不灵活的编码处理（深基于学习的端至端网络）。在本文中，我们开发了基于插件和播放（PNP）框架SCI快速和灵活的算法。除了广泛使用的PnP-ADMM方法，我们进一步提出一种具有更低的计算工作量即插即用-GAP（广义交替投影）算法而根据SCI硬件约束证明的PnP-GAP的{全局收敛}。通过采用深去噪先验，我们第一次显示，即插即用可以恢复快照2D测量UHD彩色视频（$ 3840个\倍1644 \ 48倍与$ PNSR 30dB以上）。在仿真和真实数据的大量实验验证我们算法的优越性。该代码可在此HTTPS URL。

5. Vox2Vox: 3D-GAN for Brain Tumour Segmentation [PDF] 返回目录
Marco Domenico Cirillo, David Abramian, Anders Eklund
Abstract: We propose a 3D volume-to-volume Generative Adversarial Network (GAN) for segmentation of brain tumours. The proposed model, called Vox2Vox, generates segmentations from multi-channel 3D MR images. The best results are obtained when the generator loss (a 3D U-Net) is weighted 5 times higher compared to the discriminator loss (a 3D GAN). For the BraTS 2018 training set we obtain (after ensembling 5 models) the following dice scores and Hausdorff 95 percentile distances: 90.66%, 82.54%, 78.71%, and 4.04 mm, 6.07 mm, 5.00 mm, for whole tumour, core tumour and enhancing tumour respectively. The proposed model is shown to compare favorably to the winners of the BraTS 2018 challenge, but a direct comparison is not possible.
摘要：我们提出了一个三维体积与体积剖成对抗性网络（GAN）用于脑肿瘤的分割。所提出的模型，称为Vox2Vox，产生从多通道3D MR图像分割。当发电机损失（三维U形净）相比，鉴别器损耗（三维GAN）被加权的5倍，得到了最好的结果。对于臭小子2018训练集，我们得到下面的骰子分数和豪斯多夫95百分位的距离（ensembling 5个模型后）：90.66％，82.54％，78.71％，和4.04毫米，6.07毫米，5.00毫米，整个肿瘤，芯肿瘤和分别增强肿瘤。该模型显示，以媲美的臭小子2018挑战的优胜者，而是直接比较是不可能的。

6. Supervised and Unsupervised Detections for Multiple Object Tracking in Traffic Scenes: A Comparative Study [PDF] 返回目录
Hui-Lee Ooi, Guillaume-Alexandre Bilodeau, Nicolas Saunier
Abstract: In this paper, we propose a multiple object tracker, called MF-Tracker, that integrates multiple classical features (spatial distances and colours) and modern features (detection labels and re-identification features) in its tracking framework. Since our tracker can work with detections coming either from unsupervised and supervised object detectors, we also investigated the impact of supervised and unsupervised detection inputs in our method and for tracking road users in general. We also compared our results with existing methods that were applied on the UA-Detrac and the UrbanTracker datasets. Results show that our proposed method is performing very well in both datasets with different inputs (MOTA ranging from 0:3491 to 0:5805 for unsupervised inputs on the UrbanTracker dataset and an average MOTA of 0:7638 for supervised inputs on the UA Detrac dataset) under different circumstances. A well-trained supervised object detector can give better results in challenging scenarios. However, in simpler scenarios, if good training data is not available, unsupervised method can perform well and can be a good alternative.
摘要：在本文中，我们提出了一个多目标跟踪器，称为MF-跟踪，在其跟踪框架集成了多个古典特色（空间距离和颜色）和现代功能（检测标签和重新鉴定的功能）。由于我们的跟踪器可以检测来自无监督和监督的对象检测器来擦出火花，我们也研究了我们的方法和一般跟踪道路使用者的监督和无监督检测输入的影响。我们还比较我们与施加在UA-Detrac和UrbanTracker数据集，现有方法的结果。结果表明，我们所提出的方法与不同的输入两个数据集（MOTA范围从0表现非常好：3491至0：5805用于在UrbanTracker数据集和0-2的平均MOTA无监督输入：用于在UA Detrac数据集被监视的输入7638 ）根据不同的情况。一个训练有素的监督对象检测器可以在挑战情况下提供更好的结果。然而，一些简单场景，如果好好训练数据不可用，无人监督的方法可以表现良好，可以是一个很好的选择。

7. TResNet: High Performance GPU-Dedicated Architecture [PDF] 返回目录
Tal Ridnik, Hussam Lawen, Asaf Noy, Itamar Friedman
Abstract: Many deep learning models, developed in recent years, reach higher ImageNet accuracy than ResNet50, with fewer or comparable FLOPS count. While FLOPs are often seen as a proxy for network efficiency, when measuring actual GPU training and inference throughput, vanilla ResNet50 is usually significantly faster than its recent competitors, offering better throughput-accuracy trade-off. In this work, we introduce a series of architecture modifications that aim to boost neural networks' accuracy, while retaining their GPU training and inference efficiency. We first demonstrate and discuss the bottlenecks induced by FLOPs-optimizations. We then suggest alternative designs that better utilize GPU structure and assets. Finally, we introduce a new family of GPU-dedicated models, called TResNet, which achieve better accuracy and efficiency than previous ConvNets. Using a TResNet model, with similar GPU throughput to ResNet50, we reach 80.7% top-1 accuracy on ImageNet. Our TResNet models also transfer well and achieve state-of-the-art accuracy on competitive datasets such as Stanford cars (96.0%), CIFAR-10 (99.0%), CIFAR-100 (91.5%) and Oxford-Flowers (99.1%). Implementation is available at: this https URL
摘要：许多深刻的学习模式，近年来发展起来的，比ResNet50达到更高的精度ImageNet，用较少的或类似的FLOPS计数。虽然触发器往往被视为对网络效率的代理，测量实际GPU训练和推理时的吞吐量，香草ResNet50通常是显著比其竞争对手近期的更快，更好地提供吞吐量准确性权衡。在这项工作中，我们引入了一系列架构的修改，旨在促进神经网络的精度，同时保留其GPU训练和推理效率。我们首先证明，并讨论通过触发器的优化导致的瓶颈。然后，我们建议更好地利用GPU结构和资产替代设计。最后，我们引入了GPU的专用车型，称为TResNet，其实现比以前更好ConvNets精度和效率一个新的家庭。使用TResNet模型，与同类GPU吞吐量ResNet50，我们达成ImageNet 80.7％，最高1精度。我们TResNet车型也有竞争力的数据集转移阱，实现国家的最先进的精确性，如斯坦福大学汽车（96.0％），CIFAR-10（99.0％），CIFAR-100（91.5％）和牛津 - 鲜花（99.1％）。实施可在：此HTTPS URL

8. Laplacian Denoising Autoencoder [PDF] 返回目录
Jianbo Jiao, Linchao Bao, Yunchao Wei, Shengfeng He, Honghui Shi, Rynson Lau, Thomas S. Huang
Abstract: While deep neural networks have been shown to perform remarkably well in many machine learning tasks, labeling a large amount of ground truth data for supervised training is usually very costly to scale. Therefore, learning robust representations with unlabeled data is critical in relieving human effort and vital for many downstream tasks. Recent advances in unsupervised and self-supervised learning approaches for visual data have benefited greatly from domain knowledge. Here we are interested in a more generic unsupervised learning framework that can be easily generalized to other domains. In this paper, we propose to learn data representations with a novel type of denoising autoencoder, where the noisy input data is generated by corrupting latent clean data in the gradient domain. This can be naturally generalized to span multiple scales with a Laplacian pyramid representation of the input data. In this way, the agent learns more robust representations that exploit the underlying data structures across multiple scales. Experiments on several visual benchmarks demonstrate that better representations can be learned with the proposed approach, compared to its counterpart with single-scale corruption and other approaches. Furthermore, we also demonstrate that the learned representations perform well when transferring to other downstream vision tasks.
摘要：尽管深层神经网络已经被证明在很多机器学习任务完成得非常好，贴标大量地面实况数据的监督下的训练通常是非常昂贵的规模。因此，学习强大的交涉，未标记的数据对于许多下游任务缓解人的努力至关重要，是至关重要的。视觉数据已经从知识领域大大受益于无监督和自我监督学习研究进展的方法。在这里，我们感兴趣的是一个更通用的无监督学习框架，可以很容易地推广到其他领域。在本文中，我们提出了学习数据表示具有新颖类型去噪自动编码器，其中，通过在梯度域破坏潜干净的数据所产生的噪声输入的数据。这可以自然推广到跨越多个尺度与输入数据的拉普拉斯金字塔表示。通过这种方式，代理获悉更健壮表示，用于开发跨多个尺度的基础数据结构。在几个视觉基准实验证明，更好地表示可以用该方法可以得知，相比它具有单的腐败和其他方法对应。此外，我们还表明，学习交涉转移到其他下游视觉任务时表现良好。

9. Speech2Action: Cross-modal Supervision for Action Recognition [PDF] 返回目录
Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman
Abstract: Is it possible to guess human action from dialogue alone? In this work we investigate the link between spoken words and actions in movies. We note that movie screenplays describe actions, as well as contain the speech of characters and hence can be used to learn this correlation with no additional supervision. We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments. We then apply this model to the speech segments of a large unlabelled movie corpus (188M speech segments from 288K movies). Using the predictions of this model, we obtain weak action labels for over 800K video clips. By training on these video clips, we demonstrate superior action recognition performance on standard action recognition benchmarks, without using a single manually labelled action example.
摘要：是否有可能猜出单独对话人的行动？在这项工作中，我们探讨电影口头语言和行动之间的联系。我们注意到，电影剧本描述的动作，以及包含字符的讲话，因此可以用来学习，没有额外的监督这种相关性。我们上过千的电影剧本培养基于BERT-Speech2Action分类，从转录语音段预测的动作标签。然后，我们把这个模型应用到大未标记的电影文集（288K从电影188M语音段）的声音片段。利用该模型的预测，我们得到弱行动标签过800K的视频剪辑。通过对这些视频短片的培训，我们证明标准动作识别基准优越动作识别性能，而无需使用单独的手工标注动作例子。

10. Squeezed Deep 6DoF Object Detection Using Knowledge Distillation [PDF] 返回目录
Heitor Felix, Walber M. Rodrigues, David Macêdo, Francisco Simões, Adriano L. I. Oliveira, Veronica Teichrieb, Cleber Zanchettin
Abstract: The detection of objects considering a 6DoF pose is common requisite to build virtual and augmented reality applications. It is usually a complex task witch requires real-time processing and high precision results for an adequate user experience. Recently, different deep learning techniques have been proposed to detect objects in 6DoF in RGB images but they rely on high complexity networks, requiring a computational power that prevents them to work on mobile devices. In this paper, we propose an approach to reduce the complexity of 6DoF detection networks while maintaining accuracy. We used Knowledge Distillation to teach portables Convolutional Neural Networks (CNN) to learn from a real-time 6DoF detection CNN. The proposed method allows real-time applications using only RGB images while decreasing the hardware requirements. We used the LINEMOD dataset to evaluate the proposed method and the experimental results show that the proposed method reduces the memory requirement almost 99\% in comparison to the original architecture reducing half the accuracy in one of the metrics. Code is available at this https URL
摘要：考虑到6自由度姿态物体的检测是必需的共同构建虚拟和增强现实应用。它通常是一个复杂的任务，需要巫实时处理和足够的用户体验，高精度的结果。最近，不同深度学习技术已被提出来检测RGB图像6自由度对象，但他们依靠高复杂性的网络，需要的计算能力是他们防止在移动设备上工作。在本文中，我们提出了一个方法来减少6自由度检测网络的复杂性，同时保持准确度。我们用知识蒸馏教便携式电脑卷积神经网络（CNN）从实时6自由度检测CNN学习。该方法可以只使用RGB图像，同时降低对硬件的要求实时应用。我们使用的数据集LINEMOD评估所提出的方法，实验结果表明，该方法减少了内存的需求近99 \％，相较于原来的架构减少了指标的一半准确性。代码可在此HTTPS URL

11. Rethinking Depthwise Separable Convolutions: How Intra-Kernel Correlations Lead to Improved MobileNets [PDF] 返回目录
Daniel Haase, Manuel Amthor
Abstract: We introduce blueprint separable convolutions (BSConv) as highly efficient building blocks for CNNs. They are motivated by quantitative analyses of kernel properties from trained models, which show the dominance of correlations along the depth axis. Based on our findings, we formulate a theoretical foundation from which we derive efficient implementations using only standard layers. Moreover, our approach provides a thorough theoretical derivation, interpretation, and justification for the application of depthwise separable convolutions (DSCs) in general, which have become the basis of many modern network architectures. Ultimately, we reveal that DSC-based architectures such as MobileNets implicitly rely on cross-kernel correlations, while our BSConv formulation is based on intra-kernel correlations and thus allows for a more efficient separation of regular convolutions. Extensive experiments on large-scale and fine-grained classification datasets show that BSConvs clearly and consistently improve MobileNets and other DSC-based architectures without introducing any further complexity. For fine-grained datasets, we achieve an improvement of up to 13.7 percentage points. In addition, if used as drop-in replacement for standard architectures such as ResNets, BSConv variants also outperform their vanilla counterparts by up to 9.5 percentage points on ImageNet.
摘要：介绍蓝图可分离卷积（BSConv）作为细胞神经网络高效积木。它们是由受过训练的模型内核性能的定量分析，显示沿深度轴相关的统治地位的动机。根据我们的调查结果，我们制定从中我们只使用标准的层得到有效的实现奠定了理论基础。此外，我们的方法提供了一个彻底的理论推导，解释，和一般深度方向可分离卷积（DSC）的，这已成为许多现代网络架构的基础上的应用程序的理由。最终，我们表明，基于DSC的体系结构，如MobileNets毫无保留地依赖于跨内核的相关性，而我们的BSConv配方是基于内部内核的相关性，从而使得常规卷积更有效的分离。规模化和细粒度的分类数据集大量实验表明，BSConvs明确和不断改善MobileNets和其他基于DSC架构不会引入任何进一步的复杂性。细粒度的数据集，我们实现了高达的13.7个百分点的改善。此外，如果使用下拉更换为标准的体系结构，如ResNets，BSConv变种也高达9.5个百分点ImageNet超越他们的香草同行。

12. Super Resolution for Root Imaging [PDF] 返回目录
Jose F. Ruiz-Munoz, Alina Zare, Jyothier K. Nimmagadda, Shuang Cui, James E. Baciak
Abstract: High-resolution cameras have become very helpful for plant phenotyping by providing a mechanism for tasks such as target versus background discrimination, and the measurement and analysis of fine-above-ground plant attributes, e.g., the venation network of leaves. However, the acquisition of high-resolution (HR) imagery of roots in situ remains a challenge. We apply super-resolution (SR) convolutional neural networks (CNNs) to boost the resolution capability of a backscatter X-ray system designed to image buried roots. To overcome limited available backscatter X-ray data for training, we compare three alternatives for training: i) non-plant-root images, ii) plant-root images, and iii) pretraining the model with non-plant-root images and fine-tuning with plant-root images and two deep learning approaches i) Fast Super Resolution Convolutional Neural Network and ii) Super Resolution Generative Adversarial Network). We evaluate SR performance using signal to noise ratio (SNR) and intersection over union (IoU) metrics when segmenting the SR images. In our experiments, we observe that the studied SR models improve the quality of the low-resolution images (LR) of plant roots of an unseen dataset in terms of SNR. Likewise, we demonstrate that SR pre-processing boosts the performance of a machine learning system trained to separate plant roots from their background. In addition, we show examples of backscatter X-ray images upscaled by using the SR model. The current technology for non-intrusive root imaging acquires noisy and LR images. In this study, we show that this issue can be tackled by the incorporation of a deep-learning based SR model in the image formation process.
摘要：高分辨率相机已经通过提供任务的机制，如目标与背景的歧视，以及测量和精细地上植物的属性分析，例如，树叶的脉络网络成为植物表型非常有帮助。然而，在原位根的高分辨率（HR）图像的获取仍然是一个挑战。我们采用超分辨率（SR）卷积神经网络（细胞神经网络），以提高设计的图像埋根反向散射X射线系统的分辨能力。为了克服有限的可用反向散射X射线数据进行训练，我们比较了培训三种选择：1）非植物根的图像，II）植物根的图像，以及iii）训练前与非植物根影像和精细的模型与植物根图像-tuning和两个深学习方法I）的快速超分辨率卷积神经网络和ii）超解像剖成对抗性网络）。我们分割SR图像时使用的信号噪声比（SNR）和交叉点以上联合（IOU）度量评估SR性能。在我们的实验中，我们观察到所研究的SR车型提高一个看不见的数据集的植物根系的低分辨率图像（LR）的信噪比方面的质量。同样地，我们证明了SR预处理提升他们的背景训练单独植物根部机器学习系统的性能。另外，我们将展示通过使用SR模型放大的反向散射X射线图像的例子。目前的技术用于非侵入式成像根获取嘈杂和LR图像。在这项研究中，我们表明，这个问题可以通过在图像形成过程中的深学习基于SR模式的结合加以解决。

13. SiTGRU: Single-Tunnelled Gated Recurrent Unit for Abnormality Detection [PDF] 返回目录
Habtamu Fanta, Zhiwen Shao, Lizhuang Ma
Abstract: Abnormality detection is a challenging task due to the dependence on a specific context and the unconstrained variability of practical scenarios. In recent years, it has benefited from the powerful features learnt by deep neural networks, and handcrafted features specialized for abnormality detectors. However, these approaches with large complexity still have limitations in handling long term sequential data (e.g., videos), and their learnt features do not thoroughly capture useful information. Recurrent Neural Networks (RNNs) have been shown to be capable of robustly dealing with temporal data in long term sequences. In this paper, we propose a novel version of Gated Recurrent Unit (GRU), called Single Tunnelled GRU for abnormality detection. Particularly, the Single Tunnelled GRU discards the heavy weighted reset gate from GRU cells that overlooks the importance of past content by only favouring current input to obtain an optimized single gated cell model. Moreover, we substitute the hyperbolic tangent activation in standard GRUs with sigmoid activation, as the former suffers from performance loss in deeper networks. Empirical results show that our proposed optimized GRU model outperforms standard GRU and Long Short Term Memory (LSTM) networks on most metrics for detection and generalization tasks on CUHK Avenue and UCSD datasets. The model is also computationally efficient with reduced training and testing time over standard RNNs.
摘要：异常检测是一项具有挑战性的任务，由于在特定语境的依赖和实际场景的约束变性。近年来，它已经从深层神经网络学习的强大功能中受益，而手工制作专门用于异常检出功能。然而，这些方法具有很大的复杂性仍然在处理长期连续数据（例如，视频）的限制，以及自己所学的功能不完全捕捉有用的信息。回归神经网络（RNNs）已被证明能够在长期的时间序列数据处理鲁棒。在本文中，我们提出了门控重复单元（GRU）的新版本，称为单隧道GRU异常检测。特别地，单隧道GRU丢弃来自GRU细胞重加权复位栅极，通过仅利于电流输入以获得优化的单门控细胞模型俯瞰过去内容的重要性。此外，我们在替代乙状结肠激活标准灰鹤双曲正切激活，从更深的网络性能损失前者受到影响。实证结果表明，我们提出的优化GRU模型优于大多数指标在中大大道和加州大学圣地亚哥分校的数据集检测和泛化任务的标准GRU和长短期记忆（LSTM）网络。该模型也与降低训练和测试的时间比标准RNNs计算效率。

14. Improving out-of-distribution generalization via multi-task self-supervised pretraining [PDF] 返回目录
Isabela Albuquerque, Nikhil Naik, Junnan Li, Nitish Keskar, Richard Socher
Abstract: Self-supervised feature representations have been shown to be useful for supervised classification, few-shot learning, and adversarial robustness. We show that features obtained using self-supervised learning are comparable to, or better than, supervised learning for domain generalization in computer vision. We introduce a new self-supervised pretext task of predicting responses to Gabor filter banks and demonstrate that multi-task learning of compatible pretext tasks improves domain generalization performance as compared to training individual tasks alone. Features learnt through self-supervision obtain better generalization to unseen domains when compared to their supervised counterpart when there is a larger domain shift between training and test distributions and even show better localization ability for objects of interest. Self-supervised feature representations can also be combined with other domain generalization methods to further boost performance.
摘要：自监督特征表示已经被证明是监督分类，几拍的学习和对抗性的鲁棒性是有用的。我们发现，使用功能自我监督学习获得可媲美或优于，监督学习一种在计算机视觉领域的推广。我们引进预测对Gabor滤波器银行响应的新的自我监督的借口任务和展示的兼容借口任务是多任务学习提高域泛化性能相比单独训练各个任务。相比于他们的监督对方的时候，当有训练和测试的分布，甚至表现出感兴趣的对象更好地定位能力之间存在较大的域转移功能，通过自检学会获得更好的推广到看不见的领域。自我监督的特征表示还可以与其他领域的推广方法，以进一步提升性能相结合。

15. LayoutMP3D: Layout Annotation of Matterport3D [PDF] 返回目录
Fu-En Wang, Yu-Hsuan Yeh, Min Sun, Wei-Chen Chiu, Yi-Hsuan Tsai
Abstract: Inferring the information of 3D layout from a single equirectangular panorama is crucial for numerous applications of virtual reality or robotics (e.g., scene understanding and navigation). To achieve this, several datasets are collected for the task of 360 layout estimation. To facilitate the learning algorithms for autonomous systems in indoor scenarios, we consider the Matterport3D dataset with their originally provided depth map ground truths and further release our annotations for layout ground truths from a subset of Matterport3D. As Matterport3D contains accurate depth ground truths from time-of-flight (ToF) sensors, our dataset provides both the layout and depth information, which enables the opportunity to explore the environment by integrating both cues.
摘要：从单一等距离长方圆柱全景推断3D布局的信息是虚拟现实或机器人（例如，现场了解和导航）的多种应用的关键。为了实现这一目标，几个数据集收集了360版图估计的任务。为了便于在室内场景自治系统的学习算法，我们考虑Matterport3D数据集与他们最初提供深度图基础事实，并进一步从Matterport3D的子集，释放我们的布局基础事实注解。作为Matterport3D包含时间飞行（TOF）传感器精确的深度基础事实，我们的数据提供了布局和深度信息两者，这使得双方的线索整合，探索环境的机会。

16. Improved Gradient based Adversarial Attacks for Quantized Networks [PDF] 返回目录
Kartik Gupta, Thalaiyasingam Ajanthan
Abstract: Neural network quantization has become increasingly popular due to efficient memory consumption and faster computation resulting from bitwise operations on the quantized networks. Even though they exhibit excellent generalization capabilities, their robustness properties are not well-understood. In this work, we systematically study the robustness of quantized networks against gradient based adversarial attacks and demonstrate that these quantized models suffer from gradient vanishing issues and show a fake sense of security. By attributing gradient vanishing to poor forward-backward signal propagation in the trained network, we introduce a simple temperature scaling approach to mitigate this issue while preserving the decision boundary. Despite being a simple modification to existing gradient based adversarial attacks, experiments on CIFAR-10/100 datasets with VGG-16 and ResNet-18 networks demonstrate that our temperature scaled attacks obtain near-perfect success rate on quantized networks while outperforming original attacks on adversarially trained models as well as floating-point networks.
摘要：神经网络量化已经成为高效的内存消耗和对量化网络的位操作导致更快的计算日益流行所致。即使他们表现出优异的泛化能力，其稳健性性质没有得到很好理解。在这项工作中，我们系统地研究对基于梯度敌对攻击量化网络的鲁棒性和证明这些量化模型从梯度消失问题的影响，并显示安全的假意识。通过归因梯度消失不良的向前向后的信号传播训练有素的网络中，我们介绍一个简单的温度缩小方法，同时保留决策边界，以缓解此问题。尽管是一个简单的修改，以现有的基于梯度敌对攻击，实验上CIFAR-10/100的数据集与VGG-16和RESNET-18网络证明我们的温度缩放攻击获得量化的网络近乎完美的成功率，同时优于上adversarially原攻击训练的模型以及浮点网络。

17. A Comparison of Data Augmentation Techniques in Training Deep Neural Networks for Satellite Image Classification [PDF] 返回目录
Mohamed Abdelhack
Abstract: Satellite imagery allows a plethora of applications ranging from weather forecasting to land surveying. The rapid development of computer vision systems could open new horizons to the utilization of satellite data due to the abundance of large volumes of data. However, current state-of-the-art computer vision systems mainly cater to applications that mainly involve natural images. While useful, those images exhibit a different distribution from satellite images in addition to having more spectral channels. This allows the use of pretrained deep learning models only in a subset of spectral channels that are equivalent to natural images thus discarding valuable information from other spectral channels. This calls for research effort to optimize deep learning models for satellite imagery to enable the assessment of their utility in the domain of remote sensing. This study focuses on the topic of image augmentation in training of deep neural network classifiers. I tested different techniques for image augmentation to train a standard deep neural network on satellite images from EuroSAT. Results show that while some image augmentation techniques commonly used in natural image training can readily be transferred to satellite images, some others could actually lead to a decrease in performance. Additionally, some novel image augmentation techniques that take into account the nature of satellite images could be useful to incorporate in training.
摘要：卫星图像，让应用软件，从天气预报到土地勘测过多。计算机视觉系统的快速发展可能会因为丰富的大容量数据的打开新的视野，以卫星数据的利用率。然而，国家的最先进的当前计算机视觉系统主要满足该项目主要包括自然图像的应用。虽然有用，但这些图像显示出从除了卫星图像的不同分布到具有多个光谱信道。这使得只有在光谱通道等效于这样丢弃来自其他光谱通道有价值的信息自然图像的子集，使用预训练的深度学习模式。这就要求研究工作，以优化深度学习模型，卫星图像，以使自己的效用在遥感领域的评估。这项研究的重点是图像增强的主题深层神经网络分类器的训练。我测试不同的技术用于图像增强训练来自EuroSAT卫星图像的标准深神经网络。结果表明，虽然在自然图像训练一些常用的图像增强技术可以很容易地转移到卫星图像，另一些人实际上可能导致性能下降。此外，考虑到卫星图像的性质的一些新颖的图像增强技术可以在训练并入有用。

18. Faster than FAST: GPU-Accelerated Frontend for High-Speed VIO [PDF] 返回目录
Balazs Nagy, Philipp Foehn, Davide Scaramuzza
Abstract: The recent introduction of powerful embedded graphics processing units (GPUs) has allowed for unforeseen improvements in real-time computer vision applications. It has enabled algorithms to run onboard, well above the standard video rates, yielding not only higher information processing capability, but also reduced latency. This work focuses on the applicability of efficient low-level, GPU hardware-specific instructions to improve on existing computer vision algorithms in the field of visual-inertial odometry (VIO). While most steps of a VIO pipeline work on visual features, they rely on image data for detection and tracking, of which both steps are well suited for parallelization. Especially non-maxima suppression and the subsequent feature selection are prominent contributors to the overall image processing latency. Our work first revisits the problem of non-maxima suppression for feature detection specifically on GPUs, and proposes a solution that selects local response maxima, imposes spatial feature distribution, and extracts features simultaneously. Our second contribution introduces an enhanced FAST feature detector that applies the aforementioned non-maxima suppression method. Finally, we compare our method to other state-of-the-art CPU and GPU implementations, where we always outperform all of them in feature tracking and detection, resulting in over 1000fps throughput on an embedded Jetson TX2 platform. Additionally, we demonstrate our work integrated in a VIO pipeline achieving a metric state estimation at ~200fps.
摘要：最近推出的功能强大的嵌入式图形处理单元（GPU）已经允许实时计算机视觉应用不可预见的改进。它使算法板载运行，远高于标准的视频速率，得到的不仅是更高的信息处理能力，同时也降低了延迟。此工作的重点是有效的低级别的适用性，GPU硬件专用指令，以改善视觉惯性里程计（VIO）领域现有的计算机视觉算法。虽然视觉功能VIO管道的大部分工作步骤，他们依靠的图像数据进行检测和跟踪，这两个步骤都非常适合并行化。尤其是非极大值抑制和随后的特征选择是突出的贡献者的整体图像处理的延迟。我们的工作第一重访非极大值抑制为特征检测特别在GPU上的问题，并提出了一种解决方案，选择本地响应的最大值，规定空间特征分布，和提取物同时提供。我们的第二贡献介绍了适用上述非极大值抑制方法增强的FAST特征检测器。最后，我们我们的方法，比较其他国家的最先进的CPU和GPU的实现，我们总是胜过所有的人都在功能跟踪检测，导致了1000fps的吞吐量嵌入式杰特森TX2平台上。此外，我们证明了我们的工作集成在VIO管道在〜200fps达到指标状态估计。

19. RPM-Net: Robust Point Matching using Learned Features [PDF] 返回目录
Zi Jian Yew, Gim Hee Lee
Abstract: Iterative Closest Point (ICP) solves the rigid point cloud registration problem iteratively in two steps: (1) make hard assignments of spatially closest point correspondences, and then (2) find the least-squares rigid transformation. The hard assignments of closest point correspondences based on spatial distances are sensitive to the initial rigid transformation and noisy/outlier points, which often cause ICP to converge to wrong local minima. In this paper, we propose the RPM-Net -- a less sensitive to initialization and more robust deep learning-based approach for rigid point cloud registration. To this end, our network uses the differentiable Sinkhorn layer and annealing to get soft assignments of point correspondences from hybrid features learned from both spatial coordinates and local geometry. To further improve registration performance, we introduce a secondary network to predict optimal annealing parameters. Unlike some existing methods, our RPM-Net handles missing correspondences and point clouds with partial visibility. Experimental results show that our RPM-Net achieves state-of-the-art performance compared to existing non-deep learning and recent deep learning methods. Our source code is available at the project website this https URL .
摘要：迭代最近点（ICP）迭代地解决了在两个步骤中的刚性点云配准问题：（1）使在空间上最接近的点对应的硬分配，然后（2）找到最小二乘刚性变换。基于空间距离最近点对应的硬任务是初始刚体变换和嘈杂/离群点，这往往导致ICP收敛到错误的局部极小敏感。在本文中，我们提出了RPM-网 - 刚性点云登记，以初始化不太敏感和更强大的深基于学习的方法。为此，我们的网络使用微Sinkhorn层和退火获得来自两个空间坐标和局部几何了解到混合动力功能点对应的软任务。为了进一步改善性能注册，我们引入一个次级网络来预测最佳退火参数。不像一些现有的方法，我们的RPM-Net的处理缺少对应和点云与局部的知名度。实验结果表明，相对于现有的非深学习和最近的深度学习方法我们RPM-Net的实现国家的最先进的性能。我们的源代码可在项目网站这HTTPS URL。

20. DeFeat-Net: General Monocular Depth via Simultaneous Unsupervised Representation Learning [PDF] 返回目录
Jaime Spencer, Richard Bowden, Simon Hadfield
Abstract: In the current monocular depth research, the dominant approach is to employ unsupervised training on large datasets, driven by warped photometric consistency. Such approaches lack robustness and are unable to generalize to challenging domains such as nighttime scenes or adverse weather conditions where assumptions about photometric consistency break down. We propose DeFeat-Net (Depth & Feature network), an approach to simultaneously learn a cross-domain dense feature representation, alongside a robust depth-estimation framework based on warped feature consistency. The resulting feature representation is learned in an unsupervised manner with no explicit ground-truth correspondences required. We show that within a single domain, our technique is comparable to both the current state of the art in monocular depth estimation and supervised feature representation learning. However, by simultaneously learning features, depth and motion, our technique is able to generalize to challenging domains, allowing DeFeat-Net to outperform the current state-of-the-art with around 10% reduction in all error measures on more challenging sequences such as nighttime driving.
摘要：在当前单眼深入的研究，主要的方法是采用大型数据集无监督的培训，通过扭曲光度一致性驱动。这种方法缺乏稳健性，并不能一概而论挑战领域，如夜间场景或恶劣的天气条件，其中约光度一致性假设不成立。我们建议击败-NET（深度和特点的网络），这种方法同时学习跨域密集的特征表示，沿着基于扭曲的功能一致性强大的深度评估框架。得到的特征表示是在不需要明确的地面实况对应无人监督的方式教训。我们发现，在单个域中，我们的技术相媲美的技术单眼深度估计和监督的特征表示这两种学习的当前状态。然而，通过同时学习特征，深度和运动，我们的技术能够推广到挑战域，允许的失败-Net的超越当前状态的最先进的带减速约10％的所有错误的措施上更具挑战性这样的序列作为夜间驾驶。

21. Computer Aided Detection for Pulmonary Embolism Challenge (CAD-PE) [PDF] 返回目录
Germán González, Daniel Jimenez-Carretero, Sara Rodríguez-López, Carlos Cano-Espinosa, Miguel Cazorla, Tanya Agarwal, Vinit Agarwal, Nima Tajbakhsh, Michael B. Gotway, Jianming Liang, Mojtaba Masoudi, Noushin Eftekhari, Mahdi Saadatmand, Hamid-Reza Pourreza, Patricia Fraga-Rivas, Eduardo Fraile, Frank J. Rybicki, Ara Kassarjian, Raúl San José Estépar, Maria J. Ledesma-Carbayo
Abstract: Rationale: Computer aided detection (CAD) algorithms for Pulmonary Embolism (PE) algorithms have been shown to increase radiologists' sensitivity with a small increase in specificity. However, CAD for PE has not been adopted into clinical practice, likely because of the high number of false positives current CAD software produces. Objective: To generate a database of annotated computed tomography pulmonary angiographies, use it to compare the sensitivity and false positive rate of current algorithms and to develop new methods that improve such metrics. Methods: 91 Computed tomography pulmonary angiography scans were annotated by at least one radiologist by segmenting all pulmonary emboli visible on the study. 20 annotated CTPAs were open to the public in the form of a medical image analysis challenge. 20 more were kept for evaluation purposes. 51 were made available post-challenge. 8 submissions, 6 of them novel, were evaluated on the 20 evaluation CTPAs. Performance was measured as per embolus sensitivity vs. false positives per scan curve. Results: The best algorithms achieved a per-embolus sensitivity of 75% at 2 false positives per scan (fps) or of 70% at 1 fps, outperforming the state of the art. Deep learning approaches outperformed traditional machine learning ones, and their performance improved with the number of training cases. Significance: Through this work and challenge we have improved the state-of-the art of computer aided detection algorithms for pulmonary embolism. An open database and an evaluation benchmark for such algorithms have been generated, easing the development of further improvements. Implications on clinical practice will need further research.
摘要：基本原理：计算机辅助对于肺栓塞（PE）检测（CAD）算法的算法已经被证明能增加与特异性的小的增加放射科医师的灵敏度。然而，对于CAD PE还没有被采用到临床实践中，可能是大量的误报当前CAD软件产生的原因。目的：为了生成的注释的计算机断层摄影血管造影肺一个数据库，用它来比较的敏感性和目前算法的假阳性率和开发改进这样的度量的新方法。方法：91计算机断层扫描肺血管造影扫描，由至少一个放射科医生进行了注释通过分割在研究所有肺栓塞可见。 20个注释CTPAs是开放的医学图像分析挑战的形式，向公众。 20个被保留用于评估目的。 51被提供后的挑战。 8个提交，其中6新颖，分别在20个评价CTPAs评价。性能测量为每次扫描曲线每栓子灵敏度与误报。结果：最佳算法中的1个FPS在每次扫描（fps）的或70％的2个误报达到75％的每栓子灵敏度，表现优于现有技术的状态。深学习方法优于传统学习机的，并与培训病例数它们的性能提升。意义：通过这项工作和挑战，我们已经提高了肺栓塞计算机辅助检测算法的国家的艺术。一个开放的数据库，该算法的评估基准已产生，缓解了进一步的改进发展。对临床实践的意义将需要进一步的研究。

22. Same Features, Different Day: Weakly Supervised Feature Learning for Seasonal Invariance [PDF] 返回目录
Jaime Spencer, Richard Bowden, Simon Hadfield
Abstract: "Like night and day" is a commonly used expression to imply that two things are completely different. Unfortunately, this tends to be the case for current visual feature representations of the same scene across varying seasons or times of day. The aim of this paper is to provide a dense feature representation that can be used to perform localization, sparse matching or image retrieval, regardless of the current seasonal or temporal appearance. Recently, there have been several proposed methodologies for deep learning dense feature representations. These methods make use of ground truth pixel-wise correspondences between pairs of images and focus on the spatial properties of the features. As such, they don't address temporal or seasonal variation. Furthermore, obtaining the required pixel-wise correspondence data to train in cross-seasonal environments is highly complex in most scenarios. We propose Deja-Vu, a weakly supervised approach to learning season invariant features that does not require pixel-wise ground truth data. The proposed system only requires coarse labels indicating if two images correspond to the same location or not. From these labels, the network is trained to produce "similar" dense feature maps for corresponding locations despite environmental changes. Code will be made available at: this https URL
摘要：“就像白天和黑夜”是一种常用的表情暗示着两件事情是完全不同的。不幸的是，这往往是跨不同的季节或一天的时间对同一场景的当前视觉特征表示的情况。本文的目的是提供可用于执行定位，稀疏匹配或图像检索，而不管当前季节或时间外观的致密的特征表示。最近，已经有深学习密集特征表示几个拟议方法。这些方法利用地面实况逐像素对应的图像的对与焦点上的特征的空间特性之间。因此，它们没有解决时间或季节变化。此外，获得所需要的逐像素对应数据在横季节性环境训练是在大多数情况下非常复杂的。我们建议似曾相识，弱监督的方法来学习，不需要逐个像素的地面实况数据赛季不变特征。所提出的系统只需要粗标签指示是否两个图像对应于相同的位置与否。从这些标签，网络培训，制作“类似的”密集特征地图，尽管环境变化相应的位置。此HTTPS URL：代码将被提供

23. Predicting Semantic Map Representations from Images using Pyramid Occupancy Networks [PDF] 返回目录
Thomas Roddick, Roberto Cipolla
Abstract: Autonomous vehicles commonly rely on highly detailed birds-eye-view maps of their environment, which capture both static elements of the scene such as road layout as well as dynamic elements such as other cars and pedestrians. Generating these map representations on the fly is a complex multi-stage process which incorporates many important vision-based elements, including ground plane estimation, road segmentation and 3D object detection. In this work we present a simple, unified approach for estimating maps directly from monocular images using a single end-to-end deep learning architecture. For the maps themselves we adopt a semantic Bayesian occupancy grid framework, allowing us to trivially accumulate information over multiple cameras and timesteps. We demonstrate the effectiveness of our approach by evaluating against several challenging baselines on the NuScenes and Argoverse datasets, and show that we are able to achieve a relative improvement of 9.1% and 22.3% respectively compared to the best-performing existing method.
摘要：自主车常见的依靠非常详细的鸟瞰视图映射他们的环境，它们可以捕捉场景的静态元素，如道路布局以及动态元素，如其他车辆和行人通行。产生上飞这些地图的表示是一个复杂的多阶段过程，其结合了许多重要的基于视觉的元素，包括接地平面估计，道路分段和立体物检测。在这项工作中，我们提出了一个简单，使用单个终端到终端的深度学习结构直接从单眼图像估计地图统一的做法。对于地图本身，我们采用语义贝叶斯占用网格架构，使我们能够在多台摄像机和时间步平凡的积累信息。我们通过对上NuScenes和Argoverse数据集的一些具有挑战性的基线评估证明了该方法的有效性，并表明我们能够实现比表现最好的现有方法的9.1％和22.3％的相对改善。

24. Context Based Emotion Recognition using EMOTIC Dataset [PDF] 返回目录
Ronak Kosti, Jose M. Alvarez, Adria Recasens, Agata Lapedriza
Abstract: In our everyday lives and social interactions we often try to perceive the emotional states of people. There has been a lot of research in providing machines with a similar capacity of recognizing emotions. From a computer vision perspective, most of the previous efforts have been focusing in analyzing the facial expressions and, in some cases, also the body pose. Some of these methods work remarkably well in specific settings. However, their performance is limited in natural, unconstrained environments. Psychological studies show that the scene context, in addition to facial expression and body pose, provides important information to our perception of people's emotions. However, the processing of the context for automatic emotion recognition has not been explored in depth, partly due to the lack of proper data. In this paper we present EMOTIC, a dataset of images of people in a diverse set of natural situations, annotated with their apparent emotion. The EMOTIC dataset combines two different types of emotion representation: (1) a set of 26 discrete categories, and (2) the continuous dimensions Valence, Arousal, and Dominance. We also present a detailed statistical and algorithmic analysis of the dataset along with annotators' agreement analysis. Using the EMOTIC dataset we train different CNN models for emotion recognition, combining the information of the bounding box containing the person with the contextual information extracted from the scene. Our results show how scene context provides important information to automatically recognize emotional states and motivate further research in this direction. Dataset and code is open-sourced and available at: this https URL and link for the peer-reviewed published article: this https URL
摘要：在我们的日常生活和社会交往，我们常常试图感知人的情绪状态。已经有大量的研究提供机器与识别情感的类似能力。从计算机视觉的角度来看，最先前的努力，一直专注于分析面部表情，在某些情况下，也是身体的姿势。有些方法在特定环境显着工作。然而，他们的表现是自然的，不受约束的环境限制。心理学研究表明，现场背景下，除了面部表情和身体的姿态，提供了重要的信息给我们的人的情绪的看法。然而，对于自动情感识别上下文的处理还没有深入探讨，部分原因是由于缺乏正确的数据。在本文中，我们提出EMOTIC，人物图片的多样化的自然的环境中，他们明显的情感注释的数据集。所述数据集EMOTIC结合了两种不同类型的情感表示的：（1）一组26个离散的类，和（2）连续的尺寸价，觉醒，和优势。我们还与注释协议分析沿着目前的数据集的详细统计和数学分析。使用EMOTIC数据集中我们训练不同型号CNN为情感识别，使包含从现场提取的上下文信息的人的边框的信息。我们的研究结果表明如何景物情境提供了重要的信息自动识别情感状态和激励在这个方向进一步的研究。数据集和代码是开源的，请访问：此HTTPS URL和纽带，为同行评审发表的一篇文章：此HTTPS URL

25. Strip Pooling: Rethinking Spatial Pooling for Scene Parsing [PDF] 返回目录
Qibin Hou, Li Zhang, Ming-Ming Cheng, Jiashi Feng
Abstract: Spatial pooling has been proven highly effective in capturing long-range contextual information for pixel-wise prediction tasks, such as scene parsing. In this paper, beyond conventional spatial pooling that usually has a regular shape of NxN, we rethink the formulation of spatial pooling by introducing a new pooling strategy, called strip pooling, which considers a long but narrow kernel, i.e., 1xN or Nx1. Based on strip pooling, we further investigate spatial pooling architecture design by 1) introducing a new strip pooling module that enables backbone networks to efficiently model long-range dependencies, 2) presenting a novel building block with diverse spatial pooling as a core, and 3) systematically comparing the performance of the proposed strip pooling and conventional spatial pooling techniques. Both novel pooling-based designs are lightweight and can serve as an efficient plug-and-play module in existing scene parsing networks. Extensive experiments on popular benchmarks (e.g., ADE20K and Cityscapes) demonstrate that our simple approach establishes new state-of-the-art results. Code is made available at this https URL.
摘要：空间池已被证明在获取高效远距离的逐像素的预测任务，如现场解析上下文信息。在本文中，超出通常具有N×N的一个规则的形状传统的空间池，我们通过引入新的合并策略，称为条池，它参考长而窄的内核，即1xN或Nx1重新考虑空间汇集的制剂。基于条池，我们进一步研究了1空间池结构设计）引入一个新的条带池模块，使骨干网有效地远距离依赖性2型，）呈现新颖拼装块具有不同空间汇集为核心，和3 ）系统比较所提出的带池和常规空间汇集技术的性能。这两个新的基于池的设计是重量轻，可作为现有场景解析网络高效的插件和播放模块。流行的基准（例如，ADE20K和风情）广泛的实验表明，我们简单的办法建立国家的最先进的新成果。代码在此HTTPS URL提供。

26. Active stereo vision three-dimensional reconstruction by RGB dot pattern projection and ray intersection [PDF] 返回目录
Yongcan Shuang, Zhenzhou Wang
Abstract: Active stereo vision is important in reconstructing objects without obvious textures. However, it is still very challenging to extract and match the projected patterns from two camera views automatically and robustly. In this paper, we propose a new pattern extraction method and a new stereo vision matching method based on our novel structured light pattern. Instead of using the widely used 2D disparity to calculate the depths of the objects, we use the ray intersection to compute the 3D shapes directly. Experimental results showed that the proposed approach could reconstruct the 3D shape of the object significantly more robustly than state of the art methods that include the widely used disparity based active stereo vision method, the time of flight method and the structured light method. In addition, experimental results also showed that the proposed approach could reconstruct the 3D motions of the dynamic shapes robustly.
摘要：主动立体视觉是在没有明显的纹理重建的对象很重要。然而，它仍然是非常具有挑战性的提取和自动匹配强劲的两个相机视图投影模式。在本文中，我们提出了一个新的模式提取方法，并根据我们新的结构光模式的新的立体视觉匹配方法。代替使用广泛使用的2D视差来计算对象的深度，我们使用光线相交来计算三维形状直接。实验结果表明，所提出的方法可以显著更鲁棒重建对象的3D形状比的现有技术方法，包括广泛使用的基于视差的活性立体视觉法，飞行方法的时间和结构光方法的状态。此外，实验结果也表明，该方法可以重建动态形状的3D动作有力。

27. Real-time Fruit Recognition and Grasp Estimation for Autonomous Apple harvesting [PDF] 返回目录
Hanwen Kang, Chao Chen
Abstract: In this research, a fully neural network based visual perception framework for autonomous apple harvesting is proposed. The proposed framework includes a multi-function neural network for fruit recognition and a Pointnet grasp estimation to determine the proper grasp pose to guide the robotic execution. Fruit recognition takes raw input of RGB images from the RGB-D camera to perform fruit detection and instance segmentation, and Pointnet grasp estimation take point cloud of each fruit as input and output the prediction of grasp pose for each of fruits. The proposed framework is validated by using RGB-D images collected from laboratory and orchard environments, a robotic grasping test in a controlled environment is also included in the experiments. Experimental shows that the proposed framework can accurately localise and estimate the grasp pose for robotic grasping.
摘要：在本研究中，自主苹果收获一个完全基于神经网络的视觉感知架构提出。所提出的框架包括水果识别和把握Pointnet估计，以确定正确的姿势把握指导机器人执行的多功能神经网络。果识别需要RGB图像的原始输入从RGB-d相机来执行果检测和实例分割，并且每个水果的Pointnet把握估计取点云作为输入和输出把持姿态的预测每个水果。所提出的架构是通过使用从实验室和果园环境中收集的RGB-d的图像验证，在受控环境中的机器人把持试验也被包括在实验中。实验表明，该框架能够准确地定位和估计机器人抓持的把握姿势。

28. Unsupervised Model Personalization while Preserving Privacy and Scalability: An Open Problem [PDF] 返回目录
Matthias De Lange, Xu Jia, Sarah Parisot, Ales Leonardis, Gregory Slabaugh, Tinne Tuytelaars
Abstract: This work investigates the task of unsupervised model personalization, adapted to continually evolving, unlabeled local user images. We consider the practical scenario where a high capacity server interacts with a myriad of resource-limited edge devices, imposing strong requirements on scalability and local data privacy. We aim to address this challenge within the continual learning paradigm and provide a novel Dual User-Adaptation framework (DUA) to explore the problem. This framework flexibly disentangles user-adaptation into model personalization on the server and local data regularization on the user device, with desirable properties regarding scalability and privacy constraints. First, on the server, we introduce incremental learning of task-specific expert models, subsequently aggregated using a concealed unsupervised user prior. Aggregation avoids retraining, whereas the user prior conceals sensitive raw user data, and grants unsupervised adaptation. Second, local user-adaptation incorporates a domain adaptation point of view, adapting regularizing batch normalization parameters to the user data. We explore various empirical user configurations with different priors in categories and a tenfold of transforms for MIT Indoor Scene recognition, and classify numbers in a combined MNIST and SVHN setup. Extensive experiments yield promising results for data-driven local adaptation and elicit user priors for server adaptation to depend on the model rather than user data. Hence, although user-adaptation remains a challenging open problem, the DUA framework formalizes a principled foundation for personalizing both on server and user device, while maintaining privacy and scalability.
摘要：该作品探讨无监督模式的个性化的任务，适应不断发展的，未标记的本地用户的图像。我们认为实际的场景，其中有无数的资源有限的边缘设备的，强加的可扩展性和本地数据隐私强烈要求高容量服务器进行交互。我们的目标是解决不断学习范式这一挑战，并提供一种新的双用户适应框架（DUA）探讨的问题。该框架灵活地理顺了那些纷繁用户适配到服务器和本地数据正规化用户设备上的型号个性化，提供有关可扩展性和私密性约束所需的特性。首先，在服务器上，我们引入之前使用无监督隐蔽用户特定任务的专家模型的增量学习，随后聚集。聚合避免再训练，而用户事先隐匿敏感原始的用户数据，并授予无监督适应。第二，本地用户适配结合的视域适配点，适应正规化批标准化参数的用户数据。我们探讨与在类别不同的先验和变换为MIT室内场景识别十倍，并且在组合的MNIST和SVHN设置分类号码各种经验的用户配置。大量的实验得到的数据驱动的地方适应并征求用户先验服务器适应依赖于模型，而不是用户数据可喜的成果。因此，尽管用户调整仍然是一项具有挑战性的开放问题，DUA框架正式提出了原则性的基础上的服务器和用户设备都个性化，同时保持私密性和可扩展性。

29. PANDA: Prototypical Unsupervised Domain Adaptation [PDF] 返回目录
Dapeng Hu, Jian Liang, Qibin Hou, Hanshu Yan, Yunpeng Chen, Shuicheng Yan, Jiashi Feng
Abstract: Previous adversarial domain alignment methods for unsupervised domain adaptation (UDA) pursue conditional domain alignment via intermediate pseudo labels. However, these pseudo labels are generated by independent instances without considering the global data structure and tend to be noisy, making them unreliable for adversarial domain adaptation. Compared with pseudo labels, prototypes are more reliable to represent the data structure resistant to the domain shift since they are summarized over all the relevant instances. In this work, we attempt to calibrate the noisy pseudo labels with prototypes. Specifically, we first obtain a reliable prototypical representation for each instance by multiplying the soft instance predictions with the global prototypes. Based on the prototypical representation, we propose a novel Prototypical Adversarial Learning (PAL) scheme and exploit it to align both feature representations and intermediate prototypes across domains. Besides, with the intermediate prototypes as a proxy, we further minimize the intra-class variance in the target domain to adaptively improve the pseudo labels. Integrating the three objectives, we develop an unified framework termed PrototypicAl uNsupervised Domain Adaptation (PANDA) for UDA. Experiments show that PANDA achieves state-of-the-art or competitive results on multiple UDA benchmarks including both object recognition and semantic segmentation tasks.
摘要：无监督域适配（UDA）上一页对抗性域比对方法追求经由中间伪标签有条件域对齐。然而，由独立的情况下产生的，不考虑全局数据结构，这些伪标签，往往是嘈杂的，使他们不可靠的对抗性领域适应性。与伪标签相比，原型更可靠来表示数据结构域转变，因为他们总结了所有相关的实例性。在这项工作中，我们试图嘈杂的伪标签与原型进行校准。具体而言，我们首先通过与全球原型软实例预测相乘获得为每个实例可靠的原型表示。基于该原型表示，我们提出了一个新的原型对抗性学习（PAL）方案，并利用它来对准这两个特征表示不同域之间的中间原型。此外，与中间的原型作为代理，我们进一步最小化在目标域中的类内方差自适应地提高伪标签。整合的三个目标，我们制定一个统一的框架，称为为UDA典型的无监督的领域适应性（PANDA）。实验表明，PANDA实现状态的最先进的或在多个UDA基准的竞争的结果，包括两个物体识别和语义分割任务。

30. Multi-Objective Matrix Normalization for Fine-grained Visual Recognition [PDF] 返回目录
Shaobo Min, Hantao Yao, Hongtao Xie, Zheng-Jun Zha, Yongdong Zhang
Abstract: Bilinear pooling achieves great success in fine-grained visual recognition (FGVC). Recent methods have shown that the matrix power normalization can stabilize the second-order information in bilinear features, but some problems, e.g., redundant information and over-fitting, remain to be resolved. In this paper, we propose an efficient Multi-Objective Matrix Normalization (MOMN) method that can simultaneously normalize a bilinear representation in terms of square-root, low-rank, and sparsity. These three regularizers can not only stabilize the second-order information, but also compact the bilinear features and promote model generalization. In MOMN, a core challenge is how to jointly optimize three non-smooth regularizers of different convex properties. To this end, MOMN first formulates them into an augmented Lagrange formula with approximated regularizer constraints. Then, auxiliary variables are introduced to relax different constraints, which allow each regularizer to be solved alternately. Finally, several updating strategies based on gradient descent are designed to obtain consistent convergence and efficient implementation. Consequently, MOMN is implemented with only matrix multiplication, which is well-compatible with GPU acceleration, and the normalized bilinear features are stabilized and discriminative. Experiments on five public benchmarks for FGVC demonstrate that the proposed MOMN is superior to existing normalization-based methods in terms of both accuracy and efficiency. The code is available: this https URL.
摘要：双线性池实现细粒度的视觉识别（FGVC）大获成功。最近的方法已经表明，基质功率归一化可以稳定在双线性特征二阶信息，但是存在一些问题，例如，冗余信息和过拟合，仍有待解决。在本文中，我们提出了一种高效的多目标矩阵正常化（MOMN）方法，其可同时正常化平方根，低等级，和稀疏性方面双线性表示。这三个regularizers不仅可以稳定二阶信息，同时也压缩了双线性功能，促进模型概括。在MOMN，核心挑战是如何共同优化三个非平滑不同凸特性的regularizers。为此，第一MOMN它们公式化成增强拉格朗日式与近似正则约束。然后，辅助变量被引入到放松不同的约束，这允许每个正则交替地解决。最后，基于梯度下降几个更新战略的目的是获得一致的融合和高效的执行。因此，MOMN仅与矩阵乘法，这与GPU加速以及兼容的实现，并且归一化的双线性特征是稳定的和判别。在五个公共基准FGVC实验表明，该MOMN优于现有的基于标准化的方法在精度和效率方面。该代码可：这个HTTPS URL。

31. Architecture Disentanglement for Deep Neural Networks [PDF] 返回目录
Jie Hu, Rongrong Ji, Qixiang Ye, Tong Tong, ShengChuan Zhang, Ke Li, Feiyue Huang, Ling Shao
Abstract: Deep Neural Networks (DNNs) are central to deep learning, and understanding their internal working mechanism is crucial if they are to be used for emerging applications in medical and industrial AI. To this end, the current line of research typically involves linking semantic concepts to a DNN's units or layers. However, this fails to capture the hierarchical inference procedure throughout the network. To address this issue, we introduce the novel concept of Neural Architecture Disentanglement (NAD) in this paper. Specifically, we disentangle a pre-trained network into hierarchical paths corresponding to specific concepts, forming the concept feature paths, i.e., the concept flows from the bottom to top layers of a DNN. Such paths further enable us to quantify the interpretability of DNNs according to the learned diversity of human concepts. We select four types of representative architectures ranging from handcrafted to autoML-based, and conduct extensive experiments on object-based and scene-based datasets. Our NAD sheds important light on the information flow of semantic concepts in DNNs, and provides a fundamental metric that will facilitate the design of interpretable network architectures. Code will be available at: this https URL.
摘要：深层神经网络（DNNs）是中央的深度学习，并了解它们的内部工作机制，如果他们将被用于医疗和工业AI新兴应用是至关重要的。为此，研究当前行通常涉及连接语义概念到DNN的单位或层。然而，这未能捕捉到整个网络的分层推理过程。为了解决这个问题，我们在本文中介绍的神经结构退纠缠（NAD）的新概念。具体来说，我们解开一个预训练的网络成对应于特定的概念，形成概念特征的路径，即，概念从底部到DNN的顶部层流的分层的路径。这样的路径进一步使我们根据人的概念，看到多样性量化DNNs的可解释性。我们选择四种代表性的架构，从手工制作到基于autoML的，并进行上，基于场景的基于对象的数据集和广泛的实验。我们NAD揭示了语义概念DNNs信息流重要的光，并提供了一个基本的指标，将有助于解释网络架构的设计。代码将可在：该HTTPS URL。

32. Towards Palmprint Verification On Smartphones [PDF] 返回目录
Yingyi Zhang, Lin Zhang, Ruixin Zhang, Shaoxin Li, Jilin Li, Feiyue Huang
Abstract: With the rapid development of mobile devices, smartphones have gradually become an indispensable part of people's lives. Meanwhile, biometric authentication has been corroborated to be an effective method for establishing a person's identity with high confidence. Hence, recently, biometric technologies for smartphones have also become increasingly sophisticated and popular. But it is noteworthy that the application potential of palmprints for smartphones is seriously underestimated. Studies in the past two decades have shown that palmprints have outstanding merits in uniqueness and permanence, and have high user acceptance. However, currently, studies specializing in palmprint verification for smartphones are still quite sporadic, especially when compared to face- or fingerprint-oriented ones. In this paper, aiming to fill the aforementioned research gap, we conducted a thorough study of palmprint verification on smartphones and our contributions are twofold. First, to facilitate the study of palmprint verification on smartphones, we established an annotated palmprint dataset named MPD, which was collected by multi-brand smartphones in two separate sessions with various backgrounds and illumination conditions. As the largest dataset in this field, MPD contains 16,000 palm images collected from 200 subjects. Second, we built a DCNN-based palmprint verification system named DeepMPV+ for smartphones. In DeepMPV+, two key steps, ROI extraction and ROI matching, are both formulated as learning problems and then solved naturally by modern DCNN models. The efficiency and efficacy of DeepMPV+ have been corroborated by extensive experiments. To make our results fully reproducible, the labeled dataset and the relevant source codes have been made publicly available at this https URL.
摘要：随着移动设备的快速发展，智能手机已逐渐成为人们生活中不可或缺的一部分。同时，生物特征认证已经被证实是具有高置信度确定个人身份的有效方法。因此，最近，智能手机的生物识别技术也变得越来越复杂和流行。但值得注意的是，掌纹的智能手机的应用潜力严重低估。在过去的二十年研究表明，掌纹具有唯一性和持久性突出优点，并具有较高的用户的认可。然而，目前的研究专注于掌纹验证智能手机还是相当零星，相比尤其是当face-或面向指纹的。本文旨在填补了上述研究的空白，我们在智能手机上进行掌纹验证了深入的研究和我们的贡献是双重的。首先，为了便于掌纹验证的智能手机上的研究中，我们建立了一个名为MPD的注释掌纹数据集，这与不同背景和光照条件两个单独的会话收集多品牌智能手机。作为该领域最大的数据集，MPD包含200名受试者采集16000倍手掌的图像。其次，我们建立了一个名为DeepMPV +的智能手机基于DCNN，掌纹验证系统。在DeepMPV +，两个关键步骤，ROI提取和ROI匹配，都被配制成学习问题，然后通过现代DCNN车型自然迎刃而解。 DeepMPV +的效率和有效性已经通过大量的实验证实。为了使我们的结果完全可重复的，标记的数据集和相关的源代码已被公开发布在这个HTTPS URL。

33. Domain-aware Visual Bias Eliminating for Generalized Zero-Shot Learning [PDF] 返回目录
Shaobo Min, Hantao Yao, Hongtao Xie, Chaoqun Wang, Zheng-Jun Zha, Yongdong Zhang
Abstract: Recent methods focus on learning a unified semantic-aligned visual representation to transfer knowledge between two domains, while ignoring the effect of semantic-free visual representation in alleviating the biased recognition problem. In this paper, we propose a novel Domain-aware Visual Bias Eliminating (DVBE) network that constructs two complementary visual representations, i.e., semantic-free and semantic-aligned, to treat seen and unseen domains separately. Specifically, we explore cross-attentive second-order visual statistics to compact the semantic-free representation, and design an adaptive margin Softmax to maximize inter-class divergences. Thus, the semantic-free representation becomes discriminative enough to not only predict seen class accurately but also filter out unseen images, i.e., domain detection, based on the predicted class entropy. For unseen images, we automatically search an optimal semantic-visual alignment architecture, rather than manual designs, to predict unseen classes. With accurate domain detection, the biased recognition problem towards the seen domain is significantly reduced. Experiments on five benchmarks for classification and segmentation show that DVBE outperforms existing methods by averaged 5.7% improvement.
摘要：最近的方法集中学习两个域之间的统一语义对齐的视觉表现，以传授知识，而忽略了自由语义的视觉表现在减轻偏置识别问题的影响。在本文中，我们提出了一个新颖域感知的视觉偏差消除（DVBE）网络构建两个互补视觉表示，即，语义 - 自由和语义对齐，对待分开看和看不见域。具体来说，我们探索横细心二阶统计视觉压实自由语义表示，并设计自适应余量使用SoftMax最大化类间的分歧。因此，自由语义的表示变得判别足以不仅准确预测看出类也滤除看不见的图像，即，域检测，基于所预测的类熵。对于看不见图象，我们自动搜索最佳的语义视觉对准架构，而非人工设计，以预测看不见类。具有准确域检测，向可见域偏置识别问题显著降低。对5个标准进行分类和分割显示，DVBE优于由平均5.7％的改善现有方法的实验。

34. TapLab: A Fast Framework for Semantic Video Segmentation Tapping into Compressed-Domain Knowledge [PDF] 返回目录
Junyi Feng, Songyuan Li, Xi Li, Fei Wu, Qi Tian, Ming-Hsuan Yang, Haibin Ling
Abstract: Real-time semantic video segmentation is a challenging task due to the strict requirements of inference speed. Recent approaches mainly devote great efforts to reducing the model size for high efficiency. In this paper, we rethink this problem from a different viewpoint: using knowledge contained in compressed videos. We propose a simple and effective framework, dubbed TapLab, to tap into resources from the compressed domain. Specifically, we design a fast feature warping module using motion vectors for acceleration. To reduce the noise introduced by motion vectors, we design a residual-guided correction module and a residual-guided frame selection module using residuals. Compared with the state-of-the-art fast semantic image segmentation models, our proposed TapLab significantly reduces redundant computations, running around 3 times faster with comparable accuracy for 1024x2048 video. The experimental results show that TapLab achieves 70.6% mIoU on the Cityscapes dataset at 99.8 FPS with a single GPU card. A high-speed version even reaches the speed of 160+ FPS.
摘要：实时视频语义分割是一项具有挑战性的任务，由于推断速度的严格要求。近来的方案主要大气力降低模型大小为高效率。在本文中，我们从不同的视角重新思考这个问题：使用包含在压缩视频知识。我们提出了一个简单而有效的框架，被称为TapLab，进军资源从压缩域。具体而言，我们设计使用运动矢量对加速度的快速特征翘曲模块。以减少由运动矢量所引入的噪声，我们设计了一个残余引导校正模块和使用残差残差引导帧选择模块。与国家的最先进的快速语义图像分割机型相比，我们提出的TapLab显著减少冗余计算，运行速度更快相媲美的准确性3倍左右的1024x2048视频。实验结果表明，在TapLab 99.8 FPS达到70.6％米欧的风情数据集单GPU卡。一个高速的版本甚至达到了160 + FPS的速度。

35. Memory Aggregation Networks for Efficient Interactive Video Object Segmentation [PDF] 返回目录
Jiaxu Miao, Yunchao Wei, Yi Yang
Abstract: Interactive video object segmentation (iVOS) aims at efficiently harvesting high-quality segmentation masks of the target object in a video with user interactions. Most previous state-of-the-arts tackle the iVOS with two independent networks for conducting user interaction and temporal propagation, respectively, leading to inefficiencies during the inference stage. In this work, we propose a unified framework, named Memory Aggregation Networks (MA-Net), to address the challenging iVOS in a more efficient way. Our MA-Net integrates the interaction and the propagation operations into a single network, which significantly promotes the efficiency of iVOS in the scheme of multi-round interactions. More importantly, we propose a simple yet effective memory aggregation mechanism to record the informative knowledge from the previous interaction rounds, improving the robustness in discovering challenging objects of interest greatly. We conduct extensive experiments on the validation set of DAVIS Challenge 2018 benchmark. In particular, our MA-Net achieves the J@60 score of 76.1% without any bells and whistles, outperforming the state-of-the-arts with more than 2.7%.
摘要：交互式视频对象分割（IVOS）旨在与用户交互的视频目标对象的有效收获高质量分割掩码。大多数以前的状态的最艺滑车具有两个独立的网络的IVOS用于进行用户交互和时间传播，分别在推论阶段导致低效率。在这项工作中，我们提出了一个统一的框架，名为存储器汇集网络（MA-网），以解决具有挑战性的IVOS以更有效的方式。我们的MA-Net的集成的相互作用和传播操作成一个单一网络，它显著促进IVOS的在多轮相互作用的方案的效率。更重要的是，我们提出了一个简单而有效的存储器汇集机制，从以前的交互几轮记录翔实的知识，提高了耐用性大大发现感兴趣的挑战对象。我们在验证集DAVIS挑战2018基准进行广泛的试验。特别是，我们的MA-Net的实现歼@ 60得分的76.1％，没有任何花俏，有超过2.7％，跑赢国家的最艺术。

36. Physical Model Guided Deep Image Deraining [PDF] 返回目录
Honghe Zhu, Cong Wang, Yajie Zhang, Zhixun Su, Guohui Zhao
Abstract: Single image deraining is an urgent task because the degraded rainy image makes many computer vision systems fail to work, such as video surveillance and autonomous driving. So, deraining becomes important and an effective deraining algorithm is needed. In this paper, we propose a novel network based on physical model guided learning for single image deraining, which consists of three sub-networks: rain streaks network, rain-free network, and guide-learning network. The concatenation of rain streaks and rain-free image that are estimated by rain streaks network, rain-free network, respectively, is input to the guide-learning network to guide further learning and the direct sum of the two estimated images is constrained with the input rainy image based on the physical model of rainy image. Moreover, we further develop the Multi-Scale Residual Block (MSRB) to better utilize multi-scale information and it is proved to boost the deraining performance. Quantitative and qualitative experimental results demonstrate that the proposed method outperforms the state-of-the-art deraining methods. The source code will be available at \url{this https URL}.
摘要：单张图像deraining是一项紧迫的任务，因为退化阴雨图像使得许多计算机视觉系统无法正常工作，如视频监控和自动驾驶。所以，deraining变得重要，需要一个有效的deraining算法。在本文中，我们提出了一种基于物理模型导学单图像deraining，它由三个子网络的网络小说：雨条纹网络，无雨网络，引导学习网络。雨条纹和无雨图像由雨条纹网络，无雨网络估计的级联，分别被输入到引导学习网络，以指导进一步学习和两个估计的图像的直接总和限制与基于雨天图像的物理模型的输入图像下雨。此外，我们进一步开发了多尺度的残余块（MSRB），以更好地利用多尺度信息，并证明提振deraining性能。定量和定性的实验结果表明，所提出的方法优于状态的最先进的deraining方法。源代码将可在\ {URL这HTTPS URL}。

37. MetaFuse: A Pre-trained Fusion Model for Human Pose Estimation [PDF] 返回目录
Rongchang Xie, Chunyu Wang, Yizhou Wang
Abstract: Cross view feature fusion is the key to address the occlusion problem in human pose estimation. The current fusion methods need to train a separate model for every pair of cameras making them difficult to scale. In this work, we introduce MetaFuse, a pre-trained fusion model learned from a large number of cameras in the Panoptic dataset. The model can be efficiently adapted or finetuned for a new pair of cameras using a small number of labeled images. The strong adaptation power of MetaFuse is due in large part to the proposed factorization of the original fusion model into two parts (1) a generic fusion model shared by all cameras, and (2) lightweight camera-dependent transformations. Furthermore, the generic model is learned from many cameras by a meta-learning style algorithm to maximize its adaptation capability to various camera poses. We observe in experiments that MetaFuse finetuned on the public datasets outperforms the state-of-the-arts by a large margin which validates its value in practice.
摘要：跨视图功能融合是解决人类姿态估计遮挡问题的关键。目前融合方法要培养为每对相机一个单独的模型使他们很难形成规模。在这项工作中，我们介绍MetaFuse，预先训练的融合模型从全景数据集了大量的摄像机教训。该模型可有效地调整或微调，对于一个新的对使用少量标记的图像的摄像机。 MetaFuse的适应性强的功率是由于在很大程度上原始融合模型的提出因式分解成两个部分（1）的通用融合模型由所有摄像机共享，和（2）轻量相机有关的变换。此外，通用模型是从许多相机由元的学习型算法学会最大限度地发挥其能力，适应于各种摄影机姿态。我们在实验中观察到，MetaFuse大幅度这验证在实践中它的价值在公共数据集优于国家的最艺术的微调，。

38. Learning Memory-guided Normality for Anomaly Detection [PDF] 返回目录
Hyunjong Park, Jongyoun Noh, Bumsub Ham
Abstract: We address the problem of anomaly detection, that is, detecting anomalous events in a video sequence. Anomaly detection methods based on convolutional neural networks (CNNs) typically leverage proxy tasks, such as reconstructing input video frames, to learn models describing normality without seeing anomalous samples at training time, and quantify the extent of abnormalities using the reconstruction error at test time. The main drawbacks of these approaches are that they do not consider the diversity of normal patterns explicitly, and the powerful representation capacity of CNNs allows to reconstruct abnormal video frames. To address this problem, we present an unsupervised learning approach to anomaly detection that considers the diversity of normal patterns explicitly, while lessening the representation capacity of CNNs. To this end, we propose to use a memory module with a new update scheme where items in the memory record prototypical patterns of normal data. We also present novel feature compactness and separateness losses to train the memory, boosting the discriminative power of both memory items and deeply learned features from normal data. Experimental results on standard benchmarks demonstrate the effectiveness and efficiency of our approach, which outperforms the state of the art.
摘要：针对异常检测，就是检测视频序列异常事件的问题。基于卷积神经网络（细胞神经网络）的异常检测方法通常利用代理的任务，如重构输入视频帧，学习描述常态而不会在训练时间看到异常样品的模型，并量化使用重构误差在测试时的异常的程度。这些方法的主要缺点是，他们不考虑正常模式的多样性明确，和细胞神经网络的强大表示容量允许重建异常的视频帧。为了解决这个问题，我们提出了一种无监督学习方法异常检测的是考虑了明确正常模式的多样性，同时减轻细胞神经网络的能力表示。为此，我们建议使用一个内存模块与新的更新方案，而在正常数据的存储记录原型模式的项目。我们还提出新的功能紧凑性和独立性的损失训练记忆力，提高从正常的数据都存储项目和深刻教训的特点的辨别力。上标准基准实验结果表明我们的方法，其优于现有技术的状态的有效性和效率。

39. Learning to Learn Single Domain Generalization [PDF] 返回目录
Fengchun Qiao, Long Zhao, Xi Peng
Abstract: We are concerned with a worst-case scenario in model generalization, in the sense that a model aims to perform well on many unseen domains while there is only one single domain available for training. We propose a new method named adversarial domain augmentation to solve this Out-of-Distribution (OOD) generalization problem. The key idea is to leverage adversarial training to create "fictitious" yet "challenging" populations, from which a model can learn to generalize with theoretical guarantees. To facilitate fast and desirable domain augmentation, we cast the model training in a meta-learning scheme and use a Wasserstein Auto-Encoder (WAE) to relax the widely used worst-case constraint. Detailed theoretical analysis is provided to testify our formulation, while extensive experiments on multiple benchmark datasets indicate its superior performance in tackling single domain generalization.
摘要：我们关注的是在模型综合最坏的情况是，在一个模式的目的是在许多看不见的领域没有可用于训练只有一个单一的域，而表现良好的感觉。我们提出了一个新的方法命名为对抗域隆胸来解决这个外的分布（OOD）泛化的问题。其核心思想是利用对抗性训练，以创建“虚拟”，但“挑战”的人群，从一个模型可以学习与理论保证一概而论。为了便于快速和理想的隆胸域，我们在元学习方案投了模型训练和使用瓦瑟斯坦自动编码器（WAE）放宽广泛使用的最坏情况下的约束。提供了详细的理论分析，证明我们的配方，而在多个基准数据集大量的实验表明在解决单域推广其卓越的性能。

40. Cross-Domain Document Object Detection: Benchmark Suite and Method [PDF] 返回目录
Kai Li, Curtis Wigington, Chris Tensmeyer, Handong Zhao, Nikolaos Barmpalios, Vlad I. Morariu, Varun Manjunatha, Tong Sun, Yun Fu
Abstract: Decomposing images of document pages into high-level semantic regions (e.g., figures, tables, paragraphs), document object detection (DOD) is fundamental for downstream tasks like intelligent document editing and understanding. DOD remains a challenging problem as document objects vary significantly in layout, size, aspect ratio, texture, etc. An additional challenge arises in practice because large labeled training datasets are only available for domains that differ from the target domain. We investigate cross-domain DOD, where the goal is to learn a detector for the target domain using labeled data from the source domain and only unlabeled data from the target domain. Documents from the two domains may vary significantly in layout, language, and genre. We establish a benchmark suite consisting of different types of PDF document datasets that can be utilized for cross-domain DOD model training and evaluation. For each dataset, we provide the page images, bounding box annotations, PDF files, and the rendering layers extracted from the PDF files. Moreover, we propose a novel cross-domain DOD model which builds upon the standard detection model and addresses domain shifts by incorporating three novel alignment modules: Feature Pyramid Alignment (FPA) module, Region Alignment (RA) module and Rendering Layer alignment (RLA) module. Extensive experiments on the benchmark suite substantiate the efficacy of the three proposed modules and the proposed method significantly outperforms the baseline methods. The project page is at \url{this https URL}.
摘要：分解文档页面的图像转换为高层语义区域（例如，图，表，段），文档对象检测（DOD）是用于像智能文档编辑和理解下游任务的基础。 DOD仍然是一个挑战性的问题，因为文件中的对象的布局，大小，纵横比，纹理显著变化，但在实践中等等的额外挑战出现是因为大标记的训练数据集是仅可用于从目标域不同的域。我们调查跨域DOD，其目的是要学会利用从目标域源域只有未标记的数据标记的数据目标域的检测器。从两个域的文档可以在布局，语言和风格显著变化。我们建立了一个基准测试套件，包括不同类型的PDF文档数据集可用于跨域DOD模型训练和评估。对于每个数据集，我们提供的页面图像，边界框注释，PDF文件，并从PDF文件中提取的渲染层。此外，我们提出一种通过掺入三种新型对准模块建立在标准检测模型和地址域转移的新颖跨域DOD模型：特征金字塔对齐（FPA）模块，区域对齐（RA）模块和渲染层对准（RLA）模块。上基准套件广泛的实验证实拟议的三个模块的功效和所提出的方法显著优于基线方法。该项目页面是在\ {URL这HTTPS URL}。

41. Density-Aware Graph for Deep Semi-Supervised Visual Recognition [PDF] 返回目录
Suichan Li, Bin Liu, Dongdong Chen, Qi Chu, Lu Yuan, Nenghai Yu
Abstract: Semi-supervised learning (SSL) has been extensively studied to improve the generalization ability of deep neural networks for visual recognition. To involve the unlabelled data, most existing SSL methods are based on common density-based cluster assumption: samples lying in the same high-density region are likely to belong to the same class, including the methods performing consistency regularization or generating pseudo-labels for the unlabelled images. Despite their impressive performance, we argue three limitations exist: 1) Though the density information is demonstrated to be an important clue, they all use it in an implicit way and have not exploited it in depth. 2) For feature learning, they often learn the feature embedding based on the single data sample and ignore the neighborhood information. 3) For label-propagation based pseudo-label generation, it is often done offline and difficult to be end-to-end trained with feature learning. Motivated by these limitations, this paper proposes to solve the SSL problem by building a novel density-aware graph, based on which the neighborhood information can be easily leveraged and the feature learning and label propagation can also be trained in an end-to-end way. Specifically, we first propose a new Density-aware Neighborhood Aggregation(DNA) module to learn more discriminative features by incorporating the neighborhood information in a density-aware manner. Then a novel Density-ascending Path based Label Propagation(DPLP) module is proposed to generate the pseudo-labels for unlabeled samples more efficiently according to the feature distribution characterized by density. Finally, the DNA module and DPLP module evolve and improve each other end-to-end.
摘要：半监督学习（SSL）已被广泛研究，以提高深层神经网络，从视觉上识别的泛化能力。涉及未标记的数据，大多数现有的SSL的方法是基于共同的基于密度的群集的假设：位于相同高密度区域的样品很可能属于同一类，包括方法，用于执行一致性正则化或产生伪标签在未标记的图像。尽管他们骄人的业绩，我们认为三所限存在：1）虽然密度信息被证明是一条重要线索，他们以含蓄的方式都使用它，并在深度没有利用它。 2）地物学习，他们经常学习功能包埋基于单个数据样本，并忽略相邻信息。 3）对于基于标签的传播伪标签生成，它经常被离线，不易被完成端至端与地物学习训练。通过这些限制激励，提出了通过建立一个新的密度感知曲线图中，基于在其上相邻信息可以容易地利用与特征点学习和标签传播也可以在端至端训练解决SSL问题办法。具体而言，我们首先提出一个新的密度感知邻居聚合（DNA）模块通过将在密度意识地附近的信息，以了解更多判别特征。然后一个新的密度的上升路径基于标签传播（DPLP）模块，提出了更有效地根据特征分布，特征是密度以产生所述伪标签未标记的样品。最后，DNA模块和DPLP模块发展和改进彼此端 - 端。

42. Adversarial Feature Hallucination Networks for Few-Shot Learning [PDF] 返回目录
Kai Li, Yulun Zhang, Kunpeng Li, Yun Fu
Abstract: The recent flourish of deep learning in various tasks is largely accredited to the rich and accessible labeled data. Nonetheless, massive supervision remains a luxury for many real applications, boosting great interest in label-scarce techniques such as few-shot learning (FSL), which aims to learn concept of new classes with a few labeled samples. A natural approach to FSL is data augmentation and many recent works have proved the feasibility by proposing various data synthesis models. However, these models fail to well secure the discriminability and diversity of the synthesized data and thus often produce undesirable results. In this paper, we propose Adversarial Feature Hallucination Networks (AFHN) which is based on conditional Wasserstein Generative Adversarial networks (cWGAN) and hallucinates diverse and discriminative features conditioned on the few labeled samples. Two novel regularizers, i.e., the classification regularizer and the anti-collapse regularizer, are incorporated into AFHN to encourage discriminability and diversity of the synthesized features, respectively. Ablation study verifies the effectiveness of the proposed cWGAN based feature hallucination framework and the proposed regularizers. Comparative results on three common benchmark datasets substantiate the superiority of AFHN to existing data augmentation based FSL approaches and other state-of-the-art ones.
摘要：深度学习的各种任务的最近的繁荣在很大程度上派驻丰富，访问标签数据。然而，大量的监管仍然是许多实际应用中的奢侈品，拉动了标签稀缺的技术，如一些次学习（FSL），其目的是学习的新概念班有几个标记的样品极大的兴趣。自然的方法来FSL是数据增长以及最近的许多作品都通过提出各种数据合成模型被证明是可行的。然而，这些模型并不能很好地保证合成数据的可辨性和多样性，因此，往往产生不希望的结果。在本文中，我们提出了对抗性特征幻觉网络（AFHN），这是基于条件瓦瑟斯坦剖成对抗性网络（cWGAN），并出现幻觉多样化和判别特征条件的几个标记的样品。两种新型regularizers，即，分类正则和抗崩正则，被并入AFHN鼓励的合成功能可辨性和多样性，分别。消融研究验证了该cWGAN基于特征的幻觉框架的建议regularizers的有效性和。三个共同基准数据集比较的结果证实AFHN到现有的基于数据扩张FSL接近和国家的最先进的其他的优越性。

43. Incremental Learning In Online Scenario [PDF] 返回目录
Jiangpeng He, Runyu Mao, Zeman Shao, Fengqing Zhu
Abstract: Modern deep learning approaches have achieved great success in many vision applications by training a model using all available task-specific data. However, there are two major obstacles making it challenging to implement for real life applications: (1) Learning new classes makes the trained model quickly forget old classes knowledge, which is referred to as catastrophic forgetting. (2) As new observations of old classes come sequentially over time, the distribution may change in unforeseen way, making the performance degrade dramatically on future data, which is referred to as concept drift. Current state-of-the-art incremental learning methods require a long time to train the model whenever new classes are added and none of them takes into consideration the new observations of old classes. In this paper, we propose an incremental learning framework that can work in the challenging online learning scenario and handle both new classes data and new observations of old classes. We address problem (1) in online mode by introducing a modified cross-distillation loss together with a two-step learning technique. Our method outperforms the results obtained from current state-of-the-art offline incremental learning methods on the CIFAR-100 and ImageNet-1000 (ILSVRC 2012) datasets under the same experiment protocol but in online scenario. We also provide a simple yet effective method to mitigate problem (2) by updating exemplar set using the feature of each new observation of old classes and demonstrate a real life application of online food image classification based on our complete framework using the Food-101 dataset.
摘要：现代深度学习方法已经通过训练使用所有可用的任务特定数据的模型，实现了许多视觉应用了巨大的成功。但是，有两个主要障碍使得它具有挑战性实现现实生活中的应用：（1）学习新课程，使培训的模式很快忘记老班知识，这被称为灾难性的遗忘。（2）由于旧的类新的观察顺序来随着时间的推移，该分布可以在不可预见的方式变化，使得性能显着地劣化上未来的数据，这被称为概念漂移。当前国家的最先进的增量学习方法需要很长的时间来训练模型，只要添加新的类，其中没有考虑到老班的新的观测。在本文中，我们提出了一种增量学习框架，可以在充满挑战的在线学习方案工作，并同时处理新类型的数据和老班的新的观察结果。我们通过用两步学习技术一起引入改性横蒸馏损失解决在在线模式下的问题（1）。我们的方法优于（ILSVRC 2012）从当前状态的最先进的离线上CIFAR-100和ImageNet-1000增量学习方法得到的结果相同的实验方案下但在在线情况下的数据集。我们还通过更新使用的老班的每一个新的观测功能范例集提供了一个简单而为了缓解问题（2）有效的方法，并展示基于我们使用的食品101集完整的框架网上食物图像分类的实际生活中的应用。

44. Gradually Vanishing Bridge for Adversarial Domain Adaptation [PDF] 返回目录
Shuhao Cui, Shuhui Wang, Junbao Zhuo, Chi Su, Qingming Huang, Qi Tian
Abstract: In unsupervised domain adaptation, rich domain-specific characteristics bring great challenge to learn domain-invariant representations. However, domain discrepancy is considered to be directly minimized in existing solutions, which is difficult to achieve in practice. Some methods alleviate the difficulty by explicitly modeling domain-invariant and domain-specific parts in the representations, but the adverse influence of the explicit construction lies in the residual domain-specific characteristics in the constructed domain-invariant representations. In this paper, we equip adversarial domain adaptation with Gradually Vanishing Bridge (GVB) mechanism on both generator and discriminator. On the generator, GVB could not only reduce the overall transfer difficulty, but also reduce the influence of the residual domain-specific characteristics in domain-invariant representations. On the discriminator, GVB contributes to enhance the discriminating ability, and balance the adversarial training process. Experiments on three challenging datasets show that our GVB methods outperform strong competitors, and cooperate well with other adversarial methods. The code is available at this https URL.
摘要：在无人监管的领域适应性，丰富的特定领域的特性带来了极大的挑战，了解域的恒定表征。然而，域差异被认为是直接最小化，现有的解决方案，这是在实践中难以实现。一些方法减轻在交涉明确建模领域不变和特定域部分的难度，但明确建设的谎言在构造域的恒定表征的剩余域的具体特点的不利影响。在本文中，我们装备对抗领域适应性与发电机也可鉴别渐渐消逝桥（GVB）机制。在发电机，地球村不仅可以减少整体转让难度，也降低域不变表示剩余特定域特性的影响。在鉴别，地球村有助于提高识别能力，并平衡对抗训练过程。三个有挑战性的数据集实验结果表明，地球村的方法超越强大的竞争对手，并与其他对抗性方法很好地合作。该代码可在此HTTPS URL。

45. Space-Time-Aware Multi-Resolution Video Enhancement [PDF] 返回目录
Muhammad Haris, Greg Shakhnarovich, Norimichi Ukita
Abstract: We consider the problem of space-time super-resolution (ST-SR): increasing spatial resolution of video frames and simultaneously interpolating frames to increase the frame rate. Modern approaches handle these axes one at a time. In contrast, our proposed model called STARnet super-resolves jointly in space and time. This allows us to leverage mutually informative relationships between time and space: higher resolution can provide more detailed information about motion, and higher frame-rate can provide better pixel alignment. The components of our model that generate latent low- and high-resolution representations during ST-SR can be used to finetune a specialized mechanism for just spatial or just temporal super-resolution. Experimental results demonstrate that STARnet improves the performances of space-time, spatial, and temporal video super-resolution by substantial margins on publicly available datasets.
摘要：我们认为时空超分辨率（ST-SR）的问题：增加视频帧的空间分辨率，同时插帧以提高帧速率。现代的方法同时处理这些轴之一。相比之下，我们提出的模型称为星网超级解决了空间和时间联合。这使我们能够充分利用时间和空间之间的相互关系的信息：更高的分辨率可以提供运动更详细的信息，和更高的帧速率可以提供更好的像素对齐。我们的模型产生潜在的低收入和ST-SR在高分辨率表示该组件可以用来微调只是空间或只是时间超分辨率的专业机构。实验结果表明，STARNET提高了时空，空间和时间视频超分辨率的可公开获得的数据集显着利润率的表演。

46. Learning Interactions and Relationships between Movie Characters [PDF] 返回目录
Anna Kukleva, Makarand Tapaswi, Ivan Laptev
Abstract: Interactions between people are often governed by their relationships. On the flip side, social relationships are built upon several interactions. Two strangers are more likely to greet and introduce themselves while becoming friends over time. We are fascinated by this interplay between interactions and relationships, and believe that it is an important aspect of understanding social situations. In this work, we propose neural models to learn and jointly predict interactions, relationships, and the pair of characters that are involved. We note that interactions are informed by a mixture of visual and dialog cues, and present a multimodal architecture to extract meaningful information from them. Localizing the pair of interacting characters in video is a time-consuming process, instead, we train our model to learn from clip-level weak labels. We evaluate our models on the MovieGraphs dataset and show the impact of modalities, use of longer temporal context for predicting relationships, and achieve encouraging performance using weak labels as compared with ground-truth labels. Code is online.
摘要：人们之间的互动往往是由它们之间的关系决定。在另一面，社会关系在几个互动建。两个陌生人更容易问候和自我介绍，而成为朋友随着时间的推移。我们通过互动和关系之间的这种相互着迷，并且认为这是了解社会情况的一个重要方面。在这项工作中，我们提出神经模型来学习，共同预测的相互作用，关系，以及对所涉及的字符。我们注意到，相互作用是通过视觉和对话线索的混合物通知，并呈现多结构来提取这些有意义的信息。本地化对视频交互的字符是一个耗时的过程，相反，我们训练我们的模型从剪辑电平弱标签学习。我们评估我们的模型对MovieGraphs数据集和显示模式的预测关系的影响，使用较长的时间背景，并达到鼓励使用弱的标签性能与地面实况标签进行比较。代码是在网上。

47. Learning a Weakly-Supervised Video Actor-Action Segmentation Model with a Wise Selection [PDF] 返回目录
Jie Chen, Zhiheng Li, Jiebo Luo, Chenliang Xu
Abstract: We address weakly-supervised video actor-action segmentation (VAAS), which extends general video object segmentation (VOS) to additionally consider action labels of the actors. The most successful methods on VOS synthesize a pool of pseudo-annotations (PAs) and then refine them iteratively. However, they face challenges as to how to select from a massive amount of PAs high-quality ones, how to set an appropriate stop condition for weakly-supervised training, and how to initialize PAs pertaining to VAAS. To overcome these challenges, we propose a general Weakly-Supervised framework with a Wise Selection of training samples and model evaluation criterion (WS^2). Instead of blindly trusting quality-inconsistent PAs, WS^2 employs a learning-based selection to select effective PAs and a novel region integrity criterion as a stopping condition for weakly-supervised training. In addition, a 3D-Conv GCAM is devised to adapt to the VAAS task. Extensive experiments show that WS^2 achieves state-of-the-art performance on both weakly-supervised VOS and VAAS tasks and is on par with the best fully-supervised method on VAAS.
摘要：针对弱监督视频男主角动作分段（VAAS），它扩展了通用视频对象分割（VOS）到另外考虑演员的动作标签。 VOS上最成功的方法合成的伪注释（PAS）的池，然后提炼他们反复。然而，他们面临的挑战是如何从功率放大器高品质的，如何设置弱指导训练适当的停止条件的巨量，以及如何初始化功率放大器有关VAAS选择。为了克服这些挑战，我们提出用训练样本和模型评价标准（WS ^ 2）的一个明智的选择一般的弱监督框架。代替的盲目信任质量不一致功率放大器，WS ^ 2采用基于学习的选择，以选择有效功率放大器和一种新颖的区域的完整性判据，作为弱监督训练停止状态。此外，3D-CONV GCAM被设计为适应VAAS任务。大量的实验表明WS ^ 2个实现了国家的最先进的，业绩上都弱监督VOS和VAAS任务，是看齐，与上VAAS最全面监督方法。

48. Detection of 3D Bounding Boxes of Vehicles Using Perspective Transformation for Accurate Speed Measurement [PDF] 返回目录
Viktor Kocur, Milan Ftáčnik
Abstract: Detection and tracking of vehicles captured by traffic surveillance cameras is a key component of intelligent transportation systems. We present an improved version of our algorithm for detection of 3D bounding boxes of vehicles, their tracking and subsequent speed estimation. Our algorithm utilizes the known geometry of vanishing points in the surveilled scene to construct a perspective transformation. The transformation enables an intuitive simplification of the problem of detecting 3D bounding boxes to detection of 2D bounding boxes with one additional parameter using a standard 2D object detector. Main contribution of this paper is an improved construction of the perspective transformation which is more robust and fully automatic and an extended experimental evaluation of speed estimation. We test our algorithm on the speed estimation task of the BrnoCompSpeed dataset. We evaluate our approach with different configurations to gauge the relationship between accuracy and computational costs and benefits of 3D bounding box detection over 2D detection. All of the tested configurations run in real-time and are fully automatic. Compared to other published state-of-the-art fully automatic results our algorithm reduces the mean absolute speed measurement error by 32% (1.10 km/h to 0.75 km/h) and the absolute median error by 40% (0.97 km/h to 0.58 km/h).
摘要：检测和交通监控摄像机拍摄车辆的跟踪是智能交通系统的重要组成部分。我们提出我们的算法的改进版本，用于检测三维边界的车辆，他们的跟踪和后续的速度估计的盒子。我们的算法利用在监控的场景中消失点，构建一个透视变换的已知几何结构。变换使得能够使用标准的2D对象检测器检测到的三维边界框来检测的2D边界框与一个附加参数的问题的一个直观的简化。本文的主要贡献是透视变换，其是更健壮和全自动和速度估算的扩展实验评估的改进的结构。我们测试我们对BrnoCompSpeed数据集的速度估计任务算法。我们评估我们有不同的配置方法来衡量精度和计算成本和边界框检测在2D检测3D的利益之间的关系。所有的实时运行和测试配置是全自动的。相比其他公布的国家的最先进的全自动结果我们的算法由32％减少了平均绝对速度测量误差（1.10公里每小时0.75公里/小时）和40％的绝对平均误差（0.97公里每小时0.58公里/小时）。

49. Defect segmentation: Mapping tunnel lining internal defects with ground penetrating radar data using a convolutional neural network [PDF] 返回目录
Senlin Yang, Zhengfang Wang, Jing Wang, Anthony G. Cohn, Jiaqi Zhang, Peng Jiang, Peng Jiang, Qingmei Sui
Abstract: This research proposes a Ground Penetrating Radar (GPR) data processing method for non-destructive detection of tunnel lining internal defects, called defect segmentation. To perform this critical step of automatic tunnel lining detection, the method uses a CNN called Segnet combined with the Lovász softmax loss function to map the internal defect structure with GPR synthetic data, which improves the accuracy, automation and efficiency of defects detection. The novel method we present overcomes several difficulties of traditional GPR data interpretation as demonstrated by an evaluation on both synthetic and real datas -- to verify the method on real data, a test model containing a known defect was designed and built and GPR data was obtained and analyzed.
摘要：本研究提出了一种用于非破坏性检测隧道衬砌内部缺陷的，所谓缺陷分割一个探地雷达（GPR）的数据处理方法。为了执行自动隧道衬砌检测的这一关键步骤，该方法使用一个称为CNN Segnet与LovászSOFTMAX损失函数映射的内部缺陷结构GPR合成数据，从而提高了缺陷检测的精度，自动化和效率相结合。的新颖方法，我们本克服了如由合成的和真实的DATAS评价表明传统数据GPR解释几个困难 - 验证真实的数据的方法，包含已知缺陷的测试模型的设计和建造，得到GPR数据和分析。

50. High-Order Residual Network for Light Field Super-Resolution [PDF] 返回目录
Nan Meng, Xiaofei Wu, Jianzhuang Liu, Edmund Y. Lam
Abstract: Plenoptic cameras usually sacrifice the spatial resolution of their SAIs to acquire geometry information from different viewpoints. Several methods have been proposed to mitigate such spatio-angular trade-off, but seldom make use of the structural properties of the light field (LF) data efficiently. In this paper, we propose a novel high-order residual network to learn the geometric features hierarchically from the LF for reconstruction. An important component in the proposed network is the high-order residual block (HRB), which learns the local geometric features by considering the information from all input views. After fully obtaining the local features learned from each HRB, our model extracts the representative geometric features for spatio-angular upsampling through the global residual learning. Additionally, a refinement network is followed to further enhance the spatial details by minimizing a perceptual loss. Compared with previous work, our model is tailored to the rich structure inherent in the LF, and therefore can reduce the artifacts near non-Lambertian and occlusion regions. Experimental results show that our approach enables high-quality reconstruction even in challenging regions and outperforms state-of-the-art single image or LF reconstruction methods with both quantitative measurements and visual evaluation.
摘要：全光相机通常牺牲自己的最高审计机关的空间分辨率从不同角度获取几何信息。几种方法已经被提出来减轻这种空间 - 角折衷，但很少利用效率的光场（LF）数据的结构特性的。在本文中，我们提出了一种新的高阶剩余网络从LF重建分层学习几何特征。所提出的网络中的一个重要组成部分是高阶残余块（HRB），其通过考虑从所有输入视图的信息获悉局部几何特征。完全获得每个HRB了解到当地的特色之后，我们的模型通过全球剩余学习提取用于空间 - 角采样的代表性几何特征。另外，一种改进的网络之后通过最小化感知损失，进一步提高空间细节。与以前的工作相比，我们的模型是专为在LF固有的丰富的结构，因此可以减少近非朗伯和闭塞地区的文物。实验结果表明，我们的方法能够实现高品质的重建，即使在具有挑战性的区域，优于与定量测量和视觉评价状态的最先进的单个图像或LF重建方法。

51. Disturbance-immune Weight Sharing for Neural Architecture Search [PDF] 返回目录
Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Yong Guo, Peilin Zhao, Junzhou Huang, Mingkui Tan
Abstract: Neural architecture search (NAS) has gained increasing attention in the community of architecture design. One of the key factors behind the success lies in the training efficiency created by the weight sharing (WS) technique. However, WS-based NAS methods often suffer from a performance disturbance (PD) issue. That is, the training of subsequent architectures inevitably disturbs the performance of previously trained architectures due to the partially shared weights. This leads to inaccurate performance estimation for the previous architectures, which makes it hard to learn a good search strategy. To alleviate the performance disturbance issue, we propose a new disturbance-immune update strategy for model updating. Specifically, to preserve the knowledge learned by previous architectures, we constrain the training of subsequent architectures in an orthogonal space via orthogonal gradient descent. Equipped with this strategy, we propose a novel disturbance-immune training scheme for NAS. We theoretically analyze the effectiveness of our strategy in alleviating the PD risk. Extensive experiments on CIFAR-10 and ImageNet verify the superiority of our method.
摘要：神经结构搜索（NAS）已经获得了越来越多的关注在建筑设计界。其中的成功在于通过共享重量（WS）技术创建的训练效率背后的关键因素。然而，基于WS-NAS方法通常从性能的干扰（PD）的问题受到影响。也就是说，随后的架构的训练不可避免地妨碍先前训练的体系结构的性能，因为所述部分共享权重。这导致了之前的体系结构不准确的性能估计，这使得它很难去学习一个很好的搜索战略。为了缓解性能干扰问题，我们提出了模型更新一个新的干扰免疫更新策略。具体来说，保持先前的体系结构学过的知识，我们限制随后的架构在正交空间通过正交梯度下降的训练。配备这一战略，我们提出了一个NAS一种干扰免疫的培训方案。从理论上分析，以减轻PD风险我们策略的有效性。在CIFAR-10和ImageNet的实验结果验证了该方法的优越性。

52. Generative Partial Multi-View Clustering [PDF] 返回目录
Qianqian Wang, Zhengming Ding, Zhiqiang Tao, Quanxue Gao, Yun Fu
Abstract: Nowadays, with the rapid development of data collection sources and feature extraction methods, multi-view data are getting easy to obtain and have received increasing research attention in recent years, among which, multi-view clustering (MVC) forms a mainstream research direction and is widely used in data analysis. However, existing MVC methods mainly assume that each sample appears in all the views, without considering the incomplete view case due to data corruption, sensor failure, equipment malfunction, etc. In this study, we design and build a generative partial multi-view clustering model, named as GP-MVC, to address the incomplete multi-view problem by explicitly generating the data of missing views. The main idea of GP-MVC lies at two-fold. First, multi-view encoder networks are trained to learn common low-dimensional representations, followed by a clustering layer to capture the consistent cluster structure across multiple views. Second, view-specific generative adversarial networks are developed to generate the missing data of one view conditioning on the shared representation given by other views. These two steps could be promoted mutually, where learning common representations facilitates data imputation and the generated data could further explores the view consistency. Moreover, an weighted adaptive fusion scheme is implemented to exploit the complementary information among different views. Experimental results on four benchmark datasets are provided to show the effectiveness of the proposed GP-MVC over the state-of-the-art methods.
摘要：如今，随着数据收集的来源和特征提取方法，多视角的数据也越来越容易获得，并已收到越来越多的研究近年来受到关注，其中，多视点集群（MVC）的快速发展形成了一个主流研究方向并且被广泛用于数据分析。但是，现有的MVC方法主要假设每个样本出现在所有的视图，在不考虑不完全视图情况下，由于数据损坏，传感器故障，设备故障等。在本研究中，我们设计和建造一个生成部分多视点簇模型，命名为GP-MVC，通过明确地生成缺少视图中的数据，以解决不完整的多视角问题。 GP-MVC谎言的主要思想两倍。首先，多视点编码器网络被训练学习共同的低维表示，随后的聚类层在多个视图捕获一致的群集结构。第二，视图特定生成对抗网络的开发，以生成由其他视图给出的共享表示的一个视图调理的缺失数据。这两个步骤可以相互促进，其中学习共同表示促进了数据插补和所产生的数据能进一步探讨的观点一致。而且，加权自适应融合方案是这样实现利用不同的视图中的补充信息。提供四个标准数据集的实验结果表明，该GP-MVC超过国家的最先进的方法的有效性。

53. Structure-Preserving Super Resolution with Gradient Guidance [PDF] 返回目录
Cheng Ma, Yongming Rao, Yean Cheng, Ce Chen, Jiwen Lu, Jie Zhou
Abstract: Structures matter in single image super resolution (SISR). Recent studies benefiting from generative adversarial network (GAN) have promoted the development of SISR by recovering photo-realistic images. However, there are always undesired structural distortions in the recovered images. In this paper, we propose a structure-preserving super resolution method to alleviate the above issue while maintaining the merits of GAN-based methods to generate perceptual-pleasant details. Specifically, we exploit gradient maps of images to guide the recovery in two aspects. On the one hand, we restore high-resolution gradient maps by a gradient branch to provide additional structure priors for the SR process. On the other hand, we propose a gradient loss which imposes a second-order restriction on the super-resolved images. Along with the previous image-space loss functions, the gradient-space objectives help generative networks concentrate more on geometric structures. Moreover, our method is model-agnostic, which can be potentially used for off-the-shelf SR networks. Experimental results show that we achieve the best PI and LPIPS performance and meanwhile comparable PSNR and SSIM compared with state-of-the-art perceptual-driven SR methods. Visual results demonstrate our superiority in restoring structures while generating natural SR images.
摘要：结构关系在单个图像超分辨率（SISR）。最近的研究从生成对抗网络（GAN）受益已通过恢复照片般逼真的图像促进SISR的发展。然而，也有恢复的照片总是不理想的结构扭曲。在本文中，我们提出了一个结构保留的超分辨率的方法来缓解上述问题，同时保持基于GaN的方法的优点产生知觉愉快的细节。具体而言，我们利用图像的梯度映射到引导在两个方面的恢复。在一方面，我们恢复高分辨率梯度映射通过梯度分支提供了SR过程的其他结构前科。在另一方面，我们提出这就对超分辨率图像的第二次限制的梯度损失。随着先前的图像空间损耗功能，梯度空间的目标帮助生成网络更专注于几何结构。此外，我们的方法是模型无关，其可潜在地用于断开的，现成的SR网络。实验结果表明，我们达到最佳的PI和LPIPS性能，同时可比的PSNR和SSIM与国家的最先进的感性驱动的SR方法相比。视觉结果证明了我们在恢复结构而产生的自然SR图像的优势。

54. Deep Face Super-Resolution with Iterative Collaboration between Attentive Recovery and Landmark Estimation [PDF] 返回目录
Cheng Ma, Zhenyu Jiang, Yongming Rao, Jiwen Lu, Jie Zhou
Abstract: Recent works based on deep learning and facial priors have succeeded in super-resolving severely degraded facial images. However, the prior knowledge is not fully exploited in existing methods, since facial priors such as landmark and component maps are always estimated by low-resolution or coarsely super-resolved images, which may be inaccurate and thus affect the recovery performance. In this paper, we propose a deep face super-resolution (FSR) method with iterative collaboration between two recurrent networks which focus on facial image recovery and landmark estimation respectively. In each recurrent step, the recovery branch utilizes the prior knowledge of landmarks to yield higher-quality images which facilitate more accurate landmark estimation in turn. Therefore, the iterative information interaction between two processes boosts the performance of each other progressively. Moreover, a new attentive fusion module is designed to strengthen the guidance of landmark maps, where facial components are generated individually and aggregated attentively for better restoration. Quantitative and qualitative experimental results show the proposed method significantly outperforms state-of-the-art FSR methods in recovering high-quality face images.
摘要：基于深度学习和面部先验最近的作品已经成功地超分辨率严重退化面部图像。然而，现有的知识没有完全在现有的方法利用，因为面部先验如界标和组分地图总是由低分辨率或粗超分辨图像，这可能是不准确，从而影响恢复性能估计。在本文中，我们提出与分别着眼于面部图像恢复和界标估计二期复发网络之间的协作反复深面超分辨率（FSR）方法。在每个反复步骤中，回收利用分支地标，以产生更高质量的图像，其促进反过来更精确的界标估计的先验知识。因此，两者之间的迭代信息交互逐步提升处理彼此的性能。此外，新的细心融合模块的目的是加强标志性的地图，其中面部组件单独生成和更好的恢复用心聚集的指导。定量和定性实验结果表明显著性能优于回收高品质的面部图像状态的最先进的FSR方法所提出的方法。

55. Attentive CutMix: An Enhanced Data Augmentation Approach for Deep Learning Based Image Classification [PDF] 返回目录
Devesh Walawalkar, Zhiqiang Shen, Zechun Liu, Marios Savvides
Abstract: Convolutional neural networks (CNN) are capable of learning robust representation with different regularization methods and activations as convolutional layers are spatially correlated. Based on this property, a large variety of regional dropout strategies have been proposed, such as Cutout, DropBlock, CutMix, etc. These methods aim to promote the network to generalize better by partially occluding the discriminative parts of objects. However, all of them perform this operation randomly, without capturing the most important region(s) within an object. In this paper, we propose Attentive CutMix, a naturally enhanced augmentation strategy based on CutMix. In each training iteration, we choose the most descriptive regions based on the intermediate attention maps from a feature extractor, which enables searching for the most discriminative parts in an image. Our proposed method is simple yet effective, easy to implement and can boost the baseline significantly. Extensive experiments on CIFAR-10/100, ImageNet datasets with various CNN architectures (in a unified setting) demonstrate the effectiveness of our proposed method, which consistently outperforms the baseline CutMix and other methods by a significant margin.
摘要：卷积神经网络（CNN）能够学习用不同的正则化方法和激活健壮表示卷积层空间相关的。基于此属性，种类繁多的区域性辍学战略已经提出，如切口，DropBlock，CutMix等，这些方法旨在促进网络被部分堵塞对象的辨别部分更好地一概而论。然而，所有这些随机执行此操作，但在对象内捕获的最重要的区域（一个或多个）。在本文中，我们提出了细心CutMix，基于CutMix自然增强增强策略。在每次训练迭代中，我们选择基于中间注意力从特征提取，这使得能够对图像中的区别最大的部分搜索映射最描述的区域。我们提出的方法简单而有效，易于实现，可以显著提高基线。上CIFAR-10/100，ImageNet与各种CNN架构的数据集（在一个统一的设定）了广泛的实验表明我们提出的方法，通过一个显著余量一贯地优于基线CutMix和其他方法的有效性。

56. Learning by Analogy: Reliable Supervision from Transformations for Unsupervised Optical Flow Estimation [PDF] 返回目录
Liang Liu, Jiangning Zhang, Ruifei He, Yong Liu, Yabiao Wang, Ying Tai, Donghao Luo, Chengjie Wang, Jilin Li, Feiyue Huang
Abstract: Unsupervised learning of optical flow, which leverages the supervision from view synthesis, has emerged as a promising alternative to supervised methods. However, the objective of unsupervised learning is likely to be unreliable in challenging scenes. In this work, we present a framework to use more reliable supervision from transformations. It simply twists the general unsupervised learning pipeline by running another forward pass with transformed data from augmentation, along with using transformed predictions of original data as the self-supervision signal. Besides, we further introduce a lightweight network with multiple frames by a highly-shared flow decoder. Our method consistently gets a leap of performance on several benchmarks with the best accuracy among deep unsupervised methods. Also, our method achieves competitive results to recent fully supervised methods while with much fewer parameters.
摘要：光流的无监督学习，它利用从视图合成的监督下，已成为一个有前途的替代方案监督方法。然而，无监督学习的目的很可能是在挑战场景不可靠的。在这项工作中，我们提出了一个框架，用来转换使用更可靠的监督。它只是通过扭曲运行的另一个直传从增强变换的数据，使用原始数据作为自检信号的变换预测沿着一般无监督学习的管道。此外，我们还通过一个高度共享流解码器引进具有多个帧的轻量级网络。我们的方法得到一致的性能飞跃的几个基准深无人监管的方法中最好的精度。此外，我们的方法实现了有竞争力的结果最近全面监督方法，同时用少得多的参数。

57. Noise Modeling, Synthesis and Classification for Generic Object Anti-Spoofing [PDF] 返回目录
Joel Stehouwer, Yaojie Liu, Amin Jourabloo, Xiaoming Liu
Abstract: Using printed photograph and replaying videos of biometric modalities, such as iris, fingerprint and face, are common attacks to fool the recognition systems for granting access as the genuine user. With the growing online person-to-person shopping (e.g., Ebay and Craigslist), such attacks also threaten those services, where the online photo illustration might not be captured from real items but from paper or digital screen. Thus, the study of anti-spoofing should be extended from modality-specific solutions to generic-object-based ones. In this work, we define and tackle the problem of Generic Object Anti-Spoofing (GOAS) for the first time. One significant cue to detect these attacks is the noise patterns introduced by the capture sensors and spoof mediums. Different sensor/medium combinations can result in diverse noise patterns. We propose a GAN-based architecture to synthesize and identify the noise patterns from seen and unseen medium/sensor combinations. We show that the procedure of synthesis and identification are mutually beneficial. We further demonstrate the learned GOAS models can directly contribute to modality-specific anti-spoofing without domain transfer. The code and GOSet dataset are available at this http URL.
摘要：利用打印的照片和重放生物识别方式，如虹膜，指纹和面部的视频，是常见的攻击欺骗识别系统授予访问权限为真正的用户。随着不断增长的在线个人对个人购物（例如，eBay和Craigslist的），这种攻击也威胁着这些服务，其中在线照片插图可能没有真正的项目，但是从纸或数字屏幕捕获。因此，反欺骗的研究应该从形态特定的解决方案，基于通用对象的人进行扩展。在这项工作中，我们定义并在第一时间解决通用对象反欺骗（GOAS）的问题。一个显著线索来检测这些攻击是由捕获传感器和恶搞媒介引入的噪声模式。不同的传感器/介质的组合可导致在不同的噪声模式。我们提出了一个基于GaN的架构来合成，并从可见和不可见介质/传感器的组合识别所述噪声模式。我们发现，合成和识别的过程是互惠互利的。我们进一步证明学习GOAS模型可以直接促进模态特异性抗欺骗无域转移。代码和数据集中GOSet可在此http网址。

58. Omni-sourced Webly-supervised Learning for Video Recognition [PDF] 返回目录
Haodong Duan, Yue Zhao, Yuanjun Xiong, Wentao Liu, Dahua Lin
Abstract: We introduce OmniSource, a novel framework for leveraging web data to train video recognition models. OmniSource overcomes the barriers between data formats, such as images, short videos, and long untrimmed videos for webly-supervised learning. First, data samples with multiple formats, curated by task-specific data collection and automatically filtered by a teacher model, are transformed into a unified form. Then a joint-training strategy is proposed to deal with the domain gaps between multiple data sources and formats in webly-supervised learning. Several good practices, including data balancing, resampling, and cross-dataset mixup are adopted in joint training. Experiments show that by utilizing data from multiple sources and formats, OmniSource is more data-efficient in training. With only 3.5M images and 800K minutes videos crawled from the internet without human labeling (less than 2% of prior works), our models learned with OmniSource improve Top-1 accuracy of 2D- and 3D-ConvNet baseline models by 3.0% and 3.9%, respectively, on the Kinetics-400 benchmark. With OmniSource, we establish new records with different pretraining strategies for video recognition. Our best models achieve 80.4%, 80.5%, and 83.6 Top-1 accuracies on the Kinetics-400 benchmark respectively for training-from-scratch, ImageNet pre-training and IG-65M pre-training.
摘要：介绍OmniSource，用于充分利用网络数据来训练视频识别模型的一个新的框架。 OmniSource克服数据格式之间的障碍，例如图像，视频短片，以及用于webly监督学习长修剪视频。首先，兼容多种格式，通过特定任务的数据收集策划和自动老师模型滤波的数据样本，被改造成一个统一的形式。然后，联合培训战略，提出了应对webly监督学习多个数据源和格式之间的域差距。一些良好做法，包括数据平衡，重采样，和跨数据集的mixup在联合训练采用。实验表明，通过利用从多个源和格式的数据，OmniSource是在训练数据更高效。由于只有3.5M的图像和800K分钟的视频从互联网上抓取无需人工标记（之前作品的不到2％），我们的模型与OmniSource学会了3.0％和3.9提高2D和3D ConvNet基线模型之上-1精度％，分别在动力学-400的基准。随着OmniSource，我们建立了与视频识别不同的训练前策略的新纪录。我们最好的车型分别动力学-400基准达到80.4％，80.5％，83.6顶1精度进行培训，从划痕，ImageNet前培训和IG-65M前培训。

59. Multi-Path Region Mining For Weakly Supervised 3D Semantic Segmentation on Point Clouds [PDF] 返回目录
Jiacheng Wei, Guosheng Lin, Kim-Hui Yap, Tzu-Yi Hung, Lihua Xie
Abstract: Point clouds provide intrinsic geometric information and surface context for scene understanding. Existing methods for point cloud segmentation require a large amount of fully labeled data. Using advanced depth sensors, collection of large scale 3D dataset is no longer a cumbersome process. However, manually producing point-level label on the large scale dataset is time and labor-intensive. In this paper, we propose a weakly supervised approach to predict point-level results using weak labels on 3D point clouds. We introduce our multi-path region mining module to generate pseudo point-level label from a classification network trained with weak labels. It mines the localization cues for each class from various aspects of the network feature using different attention modules. Then, we use the point-level pseudo labels to train a point cloud segmentation network in a fully supervised manner. To the best of our knowledge, this is the first method that uses cloud-level weak labels on raw 3D space to train a point cloud semantic segmentation network. In our setting, the 3D weak labels only indicate the classes that appeared in our input sample. We discuss both scene- and subcloud-level weakly labels on raw 3D point cloud data and perform in-depth experiments on them. On ScanNet dataset, our result trained with subcloud-level labels is compatible with some fully supervised methods.
摘要：点云提供内在的几何信息和场景的理解面背景。点云分割现有的方法要求大量完全标记的数据。采用先进的深度传感器，大型3D数据集的集合不再是一个麻烦的过程。然而，在大规模数据集手工生产点级别的标签是时间和劳动密集型的。在本文中，我们提出了一个弱监督方法来预测使用三维点云标签薄弱点级的效果。我们引进的多路径区域挖掘模块，以从弱标签培养了分类网络伪点级别的标签。它矿山定位线索从使用不同的关注模块网络功能的各个方面，各个班级。然后，我们用点级别的伪标签，在完全监督的方式来训练点云分割网络。据我们所知，这是第一种方法，使用云计算的水平上加工3D空间弱标签训练点云语义分割网络。在我们的设置中，3D弱标签只能说明我们的出现输入样本中的类。我们讨论的原始三维点云数据都scene-和subcloud级弱标签，并对其进行了深入的实验。在ScanNet数据集，我们的结果与subcloud级标签训练的是一些完全监督方法兼容。

60. Fast-MVSNet: Sparse-to-Dense Multi-View Stereo With Learned Propagation and Gauss-Newton Refinement [PDF] 返回目录
Zehao Yu, Shenghua Gao
Abstract: Almost all previous deep learning-based multi-view stereo (MVS) approaches focus on improving reconstruction quality. Besides quality, efficiency is also a desirable feature for MVS in real scenarios. Towards this end, this paper presents a Fast-MVSNet, a novel sparse-to-dense coarse-to-fine framework, for fast and accurate depth estimation in MVS. Specifically, in our Fast-MVSNet, we first construct a sparse cost volume for learning a sparse and high-resolution depth map. Then we leverage a small-scale convolutional neural network to encode the depth dependencies for pixels within a local region to densify the sparse high-resolution depth map. At last, a simple but efficient Gauss-Newton layer is proposed to further optimize the depth map. On one hand, the high-resolution depth map, the data-adaptive propagation method and the Gauss-Newton layer jointly guarantee the effectiveness of our method. On the other hand, all modules in our Fast-MVSNet are lightweight and thus guarantee the efficiency of our approach. Besides, our approach is also memory-friendly because of the sparse depth representation. Extensive experimental results show that our method is 5$\times$ and 14$\times$ faster than Point-MVSNet and R-MVSNet, respectively, while achieving comparable or even better results on the challenging Tanks and Temples dataset as well as the DTU dataset. Code is available at this https URL.
摘要：几乎所有以前的深基础的学习，多视点立体（MVS）对提高重建质量接近焦点。除了质量，效率也是真实情景MVS理想的。为此，本文提出了一种快速MVSNet，一种新颖的稀疏到致密粗到细的框架，在MVS快速和准确的深度估计。具体而言，在我们快速MVSNet，我们首先构建学习稀疏和高分辨率深度图稀疏成本产量。然后，我们利用小规模的卷积神经网络来编码局部区域内像素的深度依赖关系致密稀疏高分辨率深度图。最后，一个简单但有效的高斯 - 牛顿层，提出了进一步优化深度图。一方面，高分辨率的深度图，所述数据的自适应传播方法和高斯 - 牛顿层共同保证我们的方法的有效性。在另一方面，在我们快速MVSNet所有模块重量轻，从而保证我们的方法的效率。此外，我们的做法也是存储器友好，因为稀疏深表示的。大量的实验结果表明，我们的方法是5 $ \次$ 14 $ \倍$比点MVSNet和R-MVSNet，分别，而在具有挑战性的坦克和寺庙的数据集实现媲美，甚至更好的效果还有DTU快数据集。代码可在此HTTPS URL。

61. Data-Driven Neuromorphic DRAM-based CNN and RNN Accelerators [PDF] 返回目录
Tobi Delbruck, Shih-Chii Liu
Abstract: The energy consumed by running large deep neural networks (DNNs) on hardware accelerators is dominated by the need for lots of fast memory to store both states and weights. This large required memory is currently only economically viable through DRAM. Although DRAM is high-throughput and low-cost memory (costing 20X less than SRAM), its long random access latency is bad for the unpredictable access patterns in spiking neural networks (SNNs). In addition, accessing data from DRAM costs orders of magnitude more energy than doing arithmetic with that data. SNNs are energy-efficient if local memory is available and few spikes are generated. This paper reports on our developments over the last 5 years of convolutional and recurrent deep neural network hardware accelerators that exploit either spatial or temporal sparsity similar to SNNs but achieve SOA throughput, power efficiency and latency even with the use of DRAM for the required storage of the weights and states of large DNNs.
摘要：能源消耗通过运行在硬件加速器大型深层神经网络（DNNs）是需要大量快速存储器来存储状态和权重为主的。这需要大内存是目前唯一通过DRAM经济上可行。虽然DRAM是高通量，低成本的存储器（20X成本低于SRAM），其长的随机访问延迟是坏在尖峰神经网络（SNNS）的不可预测的访问模式。此外，从DRAM访问数据的成本数量级的能量不是做算术与数据的订单。如果本地内存可用，并产生一些尖峰SNNS是节能。过去5年中利用类似SNNS的空间或时间稀疏，但实现SOA的吞吐量，电源效率和延迟即使使用DRAM为所需的存储卷积和复发深层神经网络硬件加速器对我们的发展本文报道权重和大DNNs的状态。

62. ClusterVO: Clustering Moving Instances and Estimating Visual Odometry for Self and Surroundings [PDF] 返回目录
Jiahui Huang, Sheng Yang, Tai-Jiang Mu, Shi-Min Hu
Abstract: We present ClusterVO, a stereo Visual Odometry which simultaneously clusters and estimates the motion of both ego and surrounding rigid clusters/objects. Unlike previous solutions relying on batch input or imposing priors on scene structure or dynamic object models, ClusterVO is online, general and thus can be used in various scenarios including indoor scene understanding and autonomous driving. At the core of our system lies a multi-level probabilistic association mechanism and a heterogeneous Conditional Random Field (CRF) clustering approach combining semantic, spatial and motion information to jointly infer cluster segmentations online for every frame. The poses of camera and dynamic objects are instantly solved through a sliding-window optimization. Our system is evaluated on Oxford Multimotion and KITTI dataset both quantitatively and qualitatively, reaching comparable results to state-of-the-art solutions on both odometry and dynamic trajectory recovery.
摘要：本ClusterVO，立体声视觉里程计，其同时簇，并估计这两个自我和周围刚性簇/对象的运动。不同于以往的解决方案依靠批量输入或场景结构或动态对象模型气势前科，ClusterVO处于联机状态，一般，因此可以在各种场景，包括室内场景的理解和自主驾驶使用。在我们的系统谎言一个多层次的概率关联机制和异构条件随机场（CRF）聚类方法语义空间和运动信息网上每帧相结合，共同推断集群分割的核心。摄像机和动态对象的姿势是通过滑动窗口优化立即解决。在牛津多动我们的系统进行评估，KITTI数据集数量和质量，达到了比较的结果，在两个测距和动态轨迹恢复状态的最先进的解决方案。

63. Spatial Attention Pyramid Network for Unsupervised Domain Adaptation [PDF] 返回目录
Congcong Li, Dawei Du, Libo Zhang, Longyin Wen, Tiejian Luo, Yanjun Wu, Pengfei Zhu
Abstract: Unsupervised domain adaptation is critical in various computer vision tasks, such as object detection, instance segmentation, and semantic segmentation, which aims to alleviate performance degradation caused by domain-shift. Most of previous methods rely on a single-mode distribution of source and target domains to align them with adversarial learning, leading to inferior results in various scenarios. To that end, in this paper, we design a new spatial attention pyramid network for unsupervised domain adaptation. Specifically, we first build the spatial pyramid representation to capture context information of objects at different scales. Guided by the task-specific information, we combine the dense global structure representation and local texture patterns at each spatial location effectively using the spatial attention mechanism. In this way, the network is enforced to focus on the discriminative regions with context information for domain adaption. We conduct extensive experiments on various challenging datasets for unsupervised domain adaptation on object detection, instance segmentation, and semantic segmentation, which demonstrates that our method performs favorably against the state-of-the-art methods by a large margin. Our source code is available at code_path.
摘要：无监督域的适应是在各个计算机视觉任务，诸如对象检测，例如分割和语义分割，旨在减轻因域移性能劣化这是至关重要的。大多数以前的方法依赖于源和目标域的单一模式分布与对抗性学习对准他们，导致各种情况较差的结果。为此，在本文中，我们设计了一个新的空间注意的金字塔网络监督的领域适应性。具体地讲，我们首先构建空间金字塔表示，以在不同尺度的对象捕获上下文信息。通过任务特定信息的指引下，我们在每一个空间位置结合密集全局结构的代表性和局部纹理模式有效地利用空间注意机制。通过这种方式，网络实施集中于与域适应上下文信息鉴别的区域。我们对用于在物体检测无监督域的适应，例如分割，和语义分割，这表明我们的方法进行良好地对国家的最先进的方法大幅度各种挑战数据集进行了广泛的实验。我们的源代码可在code_path。

64. Global-Local Bidirectional Reasoning for Unsupervised Representation Learning of 3D Point Clouds [PDF] 返回目录
Yongming Rao, Jiwen Lu, Jie Zhou
Abstract: Local and global patterns of an object are closely related. Although each part of an object is incomplete, the underlying attributes about the object are shared among all parts, which makes reasoning the whole object from a single part possible. We hypothesize that a powerful representation of a 3D object should model the attributes that are shared between parts and the whole object, and distinguishable from other objects. Based on this hypothesis, we propose to learn point cloud representation by bidirectional reasoning between the local structures at different abstraction hierarchies and the global shape without human supervision. Experimental results on various benchmark datasets demonstrate the unsupervisedly learned representation is even better than supervised representation in discriminative power, generalization ability, and robustness. We show that unsupervisedly trained point cloud models can outperform their supervised counterparts on downstream classification tasks. Most notably, by simply increasing the channel width of an SSG PointNet++, our unsupervised model surpasses the state-of-the-art supervised methods on both synthetic and real-world 3D object classification datasets. We expect our observations to offer a new perspective on learning better representation from data structures instead of human annotations for point cloud understanding.
摘要：一个物体的局部和全局模式密切相关。虽然对象的每个部分是不完整的，关于所述对象的潜在属性的所有部件，这使得由单个部件可以推理整个对象之间共享。我们假设一个3D对象的一个强大的代表性应被部分与整个对象之间共享，并从其它物体区分开的属性建模。基于这一假设，我们建议通过地方机构之间在不同的抽象层次和无需人工监督的整体形状双向推理学习点云表示。在各种标准数据集实验结果表明，unsupervisedly了解到表现甚至好于辨别力的监督表示，泛化能力和鲁棒性。我们发现，受过训练的unsupervisedly点云模型可以超越上下游分类任务的监督同行。最值得注意的是，通过简单地增加一个SSG PointNet的沟道宽度++，我们的无监督模型优于状态的最先进的监督上合成的和真实世界3D对象分类数据集的方法。我们希望我们的观察，提供从数据结构，而不是对点云的理解人类的注解学习更好地代表一个新的视角。

65. GPS-Net: Graph Property Sensing Network for Scene Graph Generation [PDF] 返回目录
Xin Lin, Changxing Ding, Jinquan Zeng, Dacheng Tao
Abstract: Scene graph generation (SGG) aims to detect objects in an image along with their pairwise relationships. There are three key properties of scene graph that have been underexplored in recent works: namely, the edge direction information, the difference in priority between nodes, and the long-tailed distribution of relationships. Accordingly, in this paper, we propose a Graph Property Sensing Network (GPS-Net) that fully explores these three properties for SGG. First, we propose a novel message passing module that augments the node feature with node-specific contextual information and encodes the edge direction information via a tri-linear model. Second, we introduce a node priority sensitive loss to reflect the difference in priority between nodes during training. This is achieved by designing a mapping function that adjusts the focusing parameter in the focal loss. Third, since the frequency of relationships is affected by the long-tailed distribution problem, we mitigate this issue by first softening the distribution and then enabling it to be adjusted for each subject-object pair according to their visual appearance. Systematic experiments demonstrate the effectiveness of the proposed techniques. Moreover, GPS-Net achieves state-of-the-art performance on three popular databases: VG, OI, and VRD by significant gains under various settings and metrics. The code and models are available at \url{this https URL}.
摘要：场景图代（SGG）的目的是识别图像中的物体与他们的配对关系一起。有迹象表明，在最近的作品被勘探不足的场景图的三个关键特性：即，边缘方向信息，节点之间的优先级的区别和关系的长尾分布。因此，在本文中，我们提出了一个图形性质感测网络（GPS-网），充分探讨了SGG这三个属性。首先，我们提出了一种新的消息传递模块，其增强件与特定于节点的上下文信息和编码经由三线性模型中的边缘方向信息的节点的功能。其次，我们引入一个节点优先级敏感的损失反映在训练中的节点之间的优先级的差别。这通过设计，其调整在焦点损失的聚焦参数的映射函数实现。第三，由于关系的频率由长尾分配问题的影响，我们通过首先软化的分布，然后，使其能够按每个被检对象对根据它们的视觉外观来调节缓解此问题。系统实验证明所提出的技术的有效性。此外，GPS-网实现三个流行的数据库的国家的最先进的性能：VG，OI，并通过VRD下的各种设置和指标显著的收益。代码和模型可在\ {URL这HTTPS URL}。

66. Realistic Face Reenactment via Self-Supervised Disentangling of Identity and Pose [PDF] 返回目录
Xianfang Zeng, Yusu Pan, Mengmeng Wang, Jiangning Zhang, Yong Liu
Abstract: Recent works have shown how realistic talking face images can be obtained under the supervision of geometry guidance, e.g., facial landmark or boundary. To alleviate the demand for manual annotations, in this paper, we propose a novel self-supervised hybrid model (DAE-GAN) that learns how to reenact face naturally given large amounts of unlabeled videos. Our approach combines two deforming autoencoders with the latest advances in the conditional generation. On the one hand, we adopt the deforming autoencoder to disentangle identity and pose representations. A strong prior in talking face videos is that each frame can be encoded as two parts: one for video-specific identity and the other for various poses. Inspired by that, we utilize a multi-frame deforming autoencoder to learn a pose-invariant embedded face for each video. Meanwhile, a multi-scale deforming autoencoder is proposed to extract pose-related information for each frame. On the other hand, the conditional generator allows for enhancing fine details and overall reality. It leverages the disentangled features to generate photo-realistic and pose-alike face images. We evaluate our model on VoxCeleb1 and RaFD dataset. Experiment results demonstrate the superior quality of reenacted images and the flexibility of transferring facial movements between identities.
摘要：最近的工作已经展示了如何说话逼真的人脸图像可以几何的指导，例如，面部的地标或边界的监督下获得。为了减轻人工标注的需求，在本文中，我们提出了一种新的自我监督的混合模型（DAE-GAN）是学会如何面对重演自然给予大量的未标记的视频。我们的方法结合了在有条件的一代的最新进展两个的变形自动编码。在一方面，我们采用了自动编码器变形解开身份和姿态表示。甲强事先在说话的脸的视频是，每个帧可被编码为两个部分：一个用于视频特定标识，而另一个用于各种姿势。通过启发，我们利用多帧变形的自动编码学习每个视频的姿势不变嵌入式脸。同时，多尺度变形自动编码器提出提取每个帧的姿势相关的信息。在另一方面，条件发生器允许用于增强细节和整体现实。它利用解缠结的功能，生成照片般逼真和姿态酷似人脸图像。我们评估我们的VoxCeleb1和RaFD数据集模型。实验结果表明重演图像的优异的品质和身份之间传送面部运动的灵活性。

67. AutoTrack: Towards High-Performance Visual Tracking for UAV with Automatic Spatio-Temporal Regularization [PDF] 返回目录
Yiming Li, Changhong Fu, Fangqiang Ding, Ziyuan Huang, Geng Lu
Abstract: Most existing trackers based on discriminative correlation filters (DCF) try to introduce predefined regularization term to improve the learning of target objects, e.g., by suppressing background learning or by restricting change rate of correlation filters. However, predefined parameters introduce much effort in tuning them and they still fail to adapt to new situations that the designer did not think of. In this work, a novel approach is proposed to online automatically and adaptively learn spatio-temporal regularization term. Spatially local response map variation is introduced as spatial regularization to make DCF focus on the learning of trust-worthy parts of the object, and global response map variation determines the updating rate of the filter. Extensive experiments on four UAV benchmarks have proven the superiority of our method compared to the state-of-the-art CPU- and GPU-based trackers, with a speed of ~60 frames per second running on a single CPU. Our tracker is additionally proposed to be applied in UAV localization. Considerable tests in the indoor practical scenarios have proven the effectiveness and versatility of our localization method. The code is available at this https URL.
摘要：基于辨别相关滤波器大多数现有的跟踪器（DCF）尝试引入预先定义正则化项，以改善目标对象，例如的学习，通过抑制背景学习或通过限制相关滤波器的变化率。然而，预定义的参数引入大量精力调整他们，他们仍然无法适应新形势的设计者没有想到的。在这项工作中，一种新的方法，提出了在线自动和自适应学习时空正则化项。空间局部响应地图变化被引入作为空间调整到使DCF焦点上的该对象的值得信赖的零件的学习，和全球反应图的变化来确定所述过滤器的更新速率。在四个UAV基准大量的实验已经证明了我们方法的优越性相比国家的最先进的CPU和基于GPU的跟踪器，以每秒运行〜60帧的单个CPU上的速度。我们的跟踪器还提出了在无人机本地化应用。在室内实际情况下相当多的试验已经证明了我们的定位方法的有效性和通用性。该代码可在此HTTPS URL。

68. Adaptive Object Detection with Dual Multi-Label Prediction [PDF] 返回目录
Zhen Zhao, Yuhong Guo, Haifeng Shen, Jieping Ye
Abstract: In this paper, we propose a novel end-to-end unsupervised deep domain adaptation model for adaptive object detection by exploiting multi-label object recognition as a dual auxiliary task. The model exploits multi-label prediction to reveal the object category information in each image and then uses the prediction results to perform conditional adversarial global feature alignment, such that the multi-modal structure of image features can be tackled to bridge the domain divergence at the global feature level while preserving the discriminability of the features. Moreover, we introduce a prediction consistency regularization mechanism to assist object detection, which uses the multi-label prediction results as an auxiliary regularization information to ensure consistent object category discoveries between the object recognition task and the object detection task. Experiments are conducted on a few benchmark datasets and the results show the proposed model outperforms the state-of-the-art comparison methods.
摘要：在本文中，我们通过利用多标记对象识别为双辅助任务提出了自适应物体检测的新颖的端至端的无监督深域适应模型。该模型利用多标记预测以显示每个图像中的对象类别信息，然后使用该预测结果来执行条件对抗性全局特征对准，使得图像特征的多模态结构可以解决以桥接在所述域发散全球功能水平，同时保留的特征量的辨别。此外，我们引入一个预测一致性正则化机制，以协助对象检测，它使用多标签预测结果作为辅助信息正规化，以确保目标识别任务和对象检测任务之间一致的对象类别的发现。实验是在几个基准数据集进行，结果表明该模型优于国家的最先进的比较方法。

69. Co-occurrence Background Model with Superpixels for Robust Background Initialization [PDF] 返回目录
Wenjun Zhou, Yuheng Deng, Bo Peng, Dong Liang, Shun'ichi Kaneko
Abstract: Background initialization is an important step in many high-level applications of video processing,ranging from video surveillance to video inpainting.However,this process is often affected by practical challenges such as illumination changes,background motion,camera jitter and intermittent movement,this http URL this paper,we develop a co-occurrence background model with superpixel segmentation for robust background initialization. We first introduce a novel co-occurrence background modeling method called as Co-occurrence Pixel-Block Pairs(CPB)to generate a reliable initial background model,and the superpixel segmentation is utilized to further acquire the spatial texture Information of foreground and background.Then,the initial background can be determined by combining the foreground extraction results with the superpixel segmentation information.Experimental results obtained from the dataset of the challenging benchmark(SBMnet)validate it's performance under various challenges.
摘要：后台初始化是在视频处理的许多高级应用，从视频监控到视频inpainting.However一个重要的步骤，这个过程往往受实际挑战如光照变化，背景运动，相机抖动和间歇运动，这个HTTP URL本文中，我们开发出的超像素分割为强大的后台初始化共生背景模型。我们首先介绍称为共生像素块对（CPB）的新颖共现背景建模方法来生成一个可靠的初始背景模型，并且超像素分割被利用来进一步获取前景和background.Then的空间结构信息中，初始背景可由前景提取结果与超像素分割information.Experimental结果从所述挑战基准（SBMnet）验证它的下各种挑战性能的数据集获得的组合来确定。

70. Superpixel Segmentation with Fully Convolutional Networks [PDF] 返回目录
Fengting Yang, Qian Sun, Hailin Jin, Zihan Zhou
Abstract: In computer vision, superpixels have been widely used as an effective way to reduce the number of image primitives for subsequent processing. But only a few attempts have been made to incorporate them into deep neural networks. One main reason is that the standard convolution operation is defined on regular grids and becomes inefficient when applied to superpixels. Inspired by an initialization strategy commonly adopted by traditional superpixel algorithms, we present a novel method that employs a simple fully convolutional network to predict superpixels on a regular image grid. Experimental results on benchmark datasets show that our method achieves state-of-the-art superpixel segmentation performance while running at about 50fps. Based on the predicted superpixels, we further develop a downsampling/upsampling scheme for deep networks with the goal of generating high-resolution outputs for dense prediction tasks. Specifically, we modify a popular network architecture for stereo matching to simultaneously predict superpixels and disparities. We show that improved disparity estimation accuracy can be obtained on public datasets.
摘要：在计算机视觉，超像素已被广泛使用，以减少用于随后的处理图像的图元的数目的有效途径。但只有少数已经试图将其纳入深层神经网络。一个主要的原因是标准的卷积运算是在常规网格定义，并且当施加到超像素变得效率低下。通过传统的超像素的算法通常采用一个初始化策略的启发，我们提出完全采用一个简单的卷积网络来预测在常规图像框格的超像素的新方法。在基准数据集的实验结果表明，该方法在大约50FPS运行的同时实现国家的最先进的超像素分割性能。根据预测超级像素，我们进一步发展与密集的预测任务生成高分辨率输出的目标深网下采样/上采样方案。具体来说，我们修改立体匹配一个流行的网络体系结构来同时预测超像素和差距。我们发现可以在公共数据集来获得的改善视差估计的准确性。

71. Refined Plane Segmentation for Cuboid-Shaped Objects by Leveraging Edge Detection [PDF] 返回目录
Alexander Naumann, Laura Dörr, Niels Ole Salscheider, Kai Furmans
Abstract: Recent advances in the area of plane segmentation from single RGB images show strong accuracy improvements and now allow a reliable segmentation of indoor scenes into planes. Nonetheless, fine-grained details of these segmentation masks are still lacking accuracy, thus restricting the usability of such techniques on a larger scale in numerous applications, such as inpainting for Augmented Reality use cases. We propose a post-processing algorithm to align the segmented plane masks with edges detected in the image. This allows us to increase the accuracy of state-of-the-art approaches, while limiting ourselves to cuboid-shaped objects. Our approach is motivated by logistics, where this assumption is valid and refined planes can be used to perform robust object detection without the need for supervised learning. Results for two baselines and our approach are reported on our own dataset, which we made publicly available. The results show a consistent improvement over the state-of-the-art. The influence of the prior segmentation and the edge detection is investigated and finally, areas for future research are proposed.
摘要：从单一的RGB图像平面分割区域的最新进展显示出强劲的准确性改进和现在允许室内场景的可靠分割为多个平面。然而，这些分割掩码的细粒度细节仍然缺乏精度，从而限制这些技术的易用性在更大的规模在许多应用中，如图像修复增强现实使用情况。我们提出了一个后处理算法与图像中检测边缘对齐分段平面口罩。这使我们能够增加国家的最先进方法的准确度，同时限制自己，长方体状的物体。我们的做法是通过物流，在此假设是有效和完善的飞机可以使用，无需监督学习执行稳健的对象检测的动机。结果两个基线和我们的方法都报道了我们自己的数据集，这是我们公开。结果表明，在国家的最先进的持续改善。事先分割和边缘检测的影响进行了研究，最后提出了对未来的研究领域。

72. One-Shot Domain Adaptation For Face Generation [PDF] 返回目录
Chao Yang, Ser-Nam Lim
Abstract: In this paper, we propose a framework capable of generating face images that fall into the same distribution as that of a given one-shot example. We leverage a pre-trained StyleGAN model that already learned the generic face distribution. Given the one-shot target, we develop an iterative optimization scheme that rapidly adapts the weights of the model to shift the output's high-level distribution to the target's. To generate images of the same distribution, we introduce a style-mixing technique that transfers the low-level statistics from the target to faces randomly generated with the model. With that, we are able to generate an unlimited number of faces that inherit from the distribution of both generic human faces and the one-shot example. The newly generated faces can serve as augmented training data for other downstream tasks. Such setting is appealing as it requires labeling very few, or even one example, in the target domain, which is often the case of real-world face manipulations that result from a variety of unknown and unique distributions, each with extremely low prevalence. We show the effectiveness of our one-shot approach for detecting face manipulations and compare it with other few-shot domain adaptation methods qualitatively and quantitatively.
摘要：在本文中，我们提出了能够产生人脸图像落入相同的分布为一个给定的一次性例子的框架。我们充分利用已学到的通用脸上分布的预先训练StyleGAN模型。由于一次性的目标，我们开发了一个迭代优化方案，该方案快速适应模型的权重输出的高层次分布转移到目标的。为了产生相同的分布的图像，我们引入来自目标的低级别的统计转移到面随机地与模型生成的式混合技术。就这样，我们能够产生的面孔无限数量的来自两个通用的人脸的分布和一次性例子继承。新生成的面可以作为其他下游任务，增强训练数据。因为它需要标注很少，甚至一个例子，在目标域，这往往是现实世界的面孔操作的情况下，这样的环境吸引来自各种未知的和独特的分布，每个具有非常低流行的结果。我们证明我们的一次性方法的有效性检测面的操作，并与其他几个次领域适应性方法定性和定量比较。

73. Adversarial Robustness: From Self-Supervised Pre-Training to Fine-Tuning [PDF] 返回目录
Tianlong Chen, Sijia Liu, Shiyu Chang, Yu Cheng, Lisa Amini, Zhangyang Wang
Abstract: Pretrained models from self-supervision are prevalently used in fine-tuning downstream tasks faster or for better accuracy. However, gaining robustness from pretraining is left unexplored. We introduce adversarial training into self-supervision, to provide general-purpose robust pre-trained models for the first time. We find these robust pre-trained models can benefit the subsequent fine-tuning in two ways: i) boosting final model robustness; ii) saving the computation cost, if proceeding towards adversarial fine-tuning. We conduct extensive experiments to demonstrate that the proposed framework achieves large performance margins (eg, 3.83% on robust accuracy and 1.3% on standard accuracy, on the CIFAR-10 dataset), compared with the conventional end-to-end adversarial training baseline. Moreover, we find that different self-supervised pre-trained models have a diverse adversarial vulnerability. It inspires us to ensemble several pretraining tasks, which boosts robustness more. Our ensemble strategy contributes to a further improvement of 3.59% on robust accuracy, while maintaining a slightly higher standard accuracy on CIFAR-10. Our codes are available at this https URL.
摘要：自检预训练模型微调下游任务更快，更好的精度普遍使用。然而，从训练前获得稳健留探索。我们引入对抗性训练成自我监督，首次提供通用的强大的预先训练模式。我们发现这些强大的预先训练的模型可以通过两种方式有利于后续的微调：1）提高最终模型的鲁棒性;二）节约计算成本，如果继续朝着对抗性的微调。我们进行了广泛的实验表明，该框架实现了较大的性能余量（例如，3.83％的稳健的精度和标准精度1.3％，在CIFAR-10数据集），与传统的终端到终端的对抗训练比较基准。此外，我们发现不同的自我监督预训练的模型有一个多样化的对抗漏洞。它激励我们合奏几个训练前任务，提升稳健性更多。我们的合奏策略助长了3.59％的强劲精度的进一步提高，同时保持CIFAR-10稍微高标准精度。我们的代码可在此HTTPS URL。

74. Cross-domain Detection via Graph-induced Prototype Alignment [PDF] 返回目录
Minghao Xu, Hang Wang, Bingbing Ni, Qi Tian, Wenjun Zhang
Abstract: Applying the knowledge of an object detector trained on a specific domain directly onto a new domain is risky, as the gap between two domains can severely degrade model's performance. Furthermore, since different instances commonly embody distinct modal information in object detection scenario, the feature alignment of source and target domain is hard to be realized. To mitigate these problems, we propose a Graph-induced Prototype Alignment (GPA) framework to seek for category-level domain alignment via elaborate prototype representations. In the nutshell, more precise instance-level features are obtained through graph-based information propagation among region proposals, and, on such basis, the prototype representation of each class is derived for category-level domain alignment. In addition, in order to alleviate the negative effect of class-imbalance on domain adaptation, we design a Class-reweighted Contrastive Loss to harmonize the adaptation training process. Combining with Faster R-CNN, the proposed framework conducts feature alignment in a two-stage manner. Comprehensive results on various cross-domain detection tasks demonstrate that our approach outperforms existing methods with a remarkable margin. Our code is available at this https URL.
摘要：应用的培训在特定领域的物体检测的知识能直接到一个新的领域是有风险的，因为两个域会严重降低模型性能之间的差距。此外，由于在物体检测场景不同的实例通常Embody座椅不同模态信息，源和目标域的特征对准是难以实现。为了缓解这些问题，我们提出了一个图形诱导原型对齐（GPA）的框架，寻求通过精心设计的原型表示类别级域对齐。在概括地说，更精确的实例级功能是通过基于图的信息传播区域提案中获得的，并且，这样的基础上，每一类的原型表示被导出类别级域对齐。此外，为了缓解类不平衡的域适应的负面影响，我们设计一个类重加权对比损失协调适应训练过程。用更快的R-CNN组合，所提出的框架导通功能以两阶段方式对准。各种跨域检测任务的综合结果表明，我们的方法比现有的具有显着利润率的方法。我们的代码可在此HTTPS URL。

75. CAKES: Channel-wise Automatic KErnel Shrinking for Efficient 3D Network [PDF] 返回目录
Qihang Yu, Yingwei Li, Jieru Mei, Yuyin Zhou, Alan L. Yuille
Abstract: 3D Convolution Neural Networks (CNNs) have been widely applied to 3D scene understanding, such as video analysis and volumetric image recognition. However, 3D networks can easily lead to over-parameterization which incurs expensive computation cost. In this paper, we propose Channel-wise Automatic KErnel Shrinking (CAKES), to enable efficient 3D learning by shrinking standard 3D convolutions into a set of economic operations (e.g., 1D, 2D convolutions). Unlike previous methods, our proposed CAKES performs channel-wise kernel shrinkage, which enjoys the following benefits: 1) encouraging operations deployed in every layer to be heterogeneous, so that they can extract diverse and complementary information to benefit the learning process; and 2) allowing for an efficient and flexible replacement design, which can be generalized to both spatial-temporal and volumetric data. Together with a neural architecture search framework, by applying CAKES to 3D C2FNAS and ResNet50, we achieve the state-of-the-art performance with much fewer parameters and computational costs on both 3D medical imaging segmentation and video action recognition.
摘要：3D卷积神经网络（细胞神经网络）已经被广泛地应用到3D场景的理解，如视频分析和容积图像识别。然而，3D网络很容易导致过度参数这将产生昂贵的计算成本。在本文中，我们提出了通道明智自动内核收缩（盆花），通过缩小标准3D卷积成一组经济运行（例如，一维，二维卷积），以实现高效的3D学习。不同于以往的方法，我们提出的盆花进行信道明智的内核收缩，它具有如下的优点：1）鼓励部署在每一层的操作是异质性，使他们可以提取多样化和补充信息，以造福于学习过程;以及2）允许高效率和灵活的替换的设计，这可以推广到两个时空和体积数据。用神经结构搜索框架一起，通过应用蛋糕3D C2FNAS和ResNet50，我们用少得多的参数，并在两个3D医疗影像分割和视频动作识别计算成本，实现国家的最先进的性能。

76. Polarized Reflection Removal with Perfect Alignment in the Wild [PDF] 返回目录
Chenyang Lei, Xuhua Huang, Mengdi Zhang, Qiong Yan, Wenxiu Sun, Qifeng Chen
Abstract: We present a novel formulation to removing reflection from polarized images in the wild. We first identify the misalignment issues of existing reflection removal datasets where the collected reflection-free images are not perfectly aligned with input mixed images due to glass refraction. Then we build a new dataset with more than 100 types of glass in which obtained transmission images are perfectly aligned with input mixed images. Second, capitalizing on the special relationship between reflection and polarized light, we propose a polarized reflection removal model with a two-stage architecture. In addition, we design a novel perceptual NCC loss that can improve the performance of reflection removal and general image decomposition tasks. We conduct extensive experiments, and results suggest that our model outperforms state-of-the-art methods on reflection removal.
摘要：我们提出一个新制剂从在野外偏振图像去除反射。我们首先确定在收集无反射的图像不完全与输入混合图像由于玻璃折射对准现有的反射去除数据集的错位问题。然后，我们建立与超过100种类型的玻璃，其中获得的传输图象完美地与输入混合图像对准的新的数据集。其次，善用反射和偏振光之间的特殊关系，我们提出了一个两级架构的偏振反射去除模型。此外，我们还设计了一个新型的感性NCC损失，可以提高反射去除和一般的图像分解任务的性能。我们进行了广泛的实验，结果表明，国家的最先进的我们的模型优于上反射去除方法。

77. CNN-based Density Estimation and Crowd Counting: A Survey [PDF] 返回目录
Guangshuai Gao, Junyu Gao, Qingjie Liu, Qi Wang, Yunhong Wang
Abstract: Accurately estimating the number of objects in a single image is a challenging yet meaningful task and has been applied in many applications such as urban planning and public safety. In the various object counting tasks, crowd counting is particularly prominent due to its specific significance to social security and development. Fortunately, the development of the techniques for crowd counting can be generalized to other related fields such as vehicle counting and environment survey, if without taking their characteristics into account. Therefore, many researchers are devoting to crowd counting, and many excellent works of literature and works have spurted out. In these works, they are must be helpful for the development of crowd counting. However, the question we should consider is why they are effective for this task. Limited by the cost of time and energy, we cannot analyze all the algorithms. In this paper, we have surveyed over 220 works to comprehensively and systematically study the crowd counting models, mainly CNN-based density map estimation methods. Finally, according to the evaluation metrics, we select the top three performers on their crowd counting datasets and analyze their merits and drawbacks. Through our analysis, we expect to make reasonable inference and prediction for the future development of crowd counting, and meanwhile, it can also provide feasible solutions for the problem of object counting in other fields. We provide the density maps and prediction results of some mainstream algorithm in the validation set of NWPU dataset for comparison and testing. Meanwhile, density map generation and evaluation tools are also provided. All the codes and evaluation results are made publicly available at this https URL.
摘要：准确地预计在单个图像中的对象的数量是一个具有挑战性但有意义的工作，并在许多应用中，如城市规划和公众安全得到了应用。在不同的对象计数任务，人流计数尤为突出，由于其特殊意义的社会安全和发展。幸运的是，的技术人群计数的发展可以推广到其他相关领域，如车辆计数和环境调查，如果不考虑他们的特点考虑在内。因此，许多研究人员正致力于人群计数，文学作品和很多优秀作品都喷出的。在这些作品中，他们必须为人群计数的发展很有帮助。但是，我们应该考虑的问题就是为什么他们是有效的完成这个任务。通过的时间和精力成本的限制，我们无法分析所有的算法。在本文中，我们已经调查了220工程，以全面，系统地研究人群计数车型，主要是根据美国有线电视新闻网密度图估计方法。最后，根据评价指标，我们选择他们的人群统计数据集前三名表演者和分析自己的优点和缺点。通过我们的分析，我们预计，以做出合理的推断和预测人群计数的未来发展，同时，它也可以提供在其他领域对象计数的问题的可行解决方案。我们提供的密度图，并在验证集西北工业大学数据集的一些主流算法的预测结果进行比较和测试。同时，还提供了密度图生成和评估工具。所有代码和评估结果在这个HTTPS URL公之于众。

78. Real-MFF Dataset: A Large Realistic Multi-focus Image Dataset with Ground Truth [PDF] 返回目录
Juncheng Zhang, Qingmin Liao, Shaojun Liu, Haoyu Ma, Wenming Yang, Jing-hao Xue
Abstract: Multi-focus image fusion, a technique to generate an all-in-focus image from two or more source images, can benefit many computer vision tasks. However, currently there is no large and realistic dataset to perform convincing evaluation and comparison for exiting multi-focus image fusion. For deep learning methods, it is difficult to train a network without a suitable dataset. In this paper, we introduce a large and realistic multi-focus dataset containing 800 pairs of source images with the corresponding ground truth images. The dataset is generated using a light field camera, consequently, the source images as well as the ground truth images are realistic. Moreover, the dataset contains a variety of scenes, including buildings, plants, humans, shopping malls, squares and so on, to serve as a well-founded benchmark for multi-focus image fusion tasks. For illustration, we evaluate 10 typical multi-focus algorithms on this dataset.
摘要：多聚焦图像融合，一个技术从两个或多个源图像生成的全聚焦图像，可以享受许多计算机视觉任务。然而，目前还没有大的和现实的数据集以执行说服评价和比较用于退出多聚焦图像融合。对于深的学习方法，它是很难培养的网络不具有合适的数据集。在本文中，我们引入含有800对源图像与相应的地面实况图像的大的和现实的多焦点数据集。利用光场照相机中产生的数据集，因此，源图像以及地面实况图像是现实的。此外，该数据集包含各种场景，包括建筑物，植物，人类，商场，广场等，以作为多聚焦图像融合任务有根有据的标杆。为了说明，我们评估这个数据集10典型的多聚焦算法。

79. Learning Invariant Representation for Unsupervised Image Restoration [PDF] 返回目录
Wenchao Du, Hu Chen, Hongyu Yang
Abstract: Recently, cross domain transfer has been applied for unsupervised image restoration tasks. However, directly applying existing frameworks would lead to domain-shift problems in translated images due to lack of effective supervision. Instead, we propose an unsupervised learning method that explicitly learns invariant presentation from noisy data and reconstructs clear observations. To do so, we introduce discrete disentangling representation and adversarial domain adaption into general domain transfer framework, aided by extra self-supervised modules including background and semantic consistency constraints, learning robust representation under dual domain constraints, such as feature and image domains. Experiments on synthetic and real noise removal tasks show the proposed method achieves comparable performance with other state-of-the-art supervised and unsupervised methods, while having faster and stable convergence than other domain adaption methods.
摘要：近日，跨域传输已经应用于无监督的图像恢复任务。然而，直接应用现有的框架将导致翻译图像域转移的问题，由于缺乏有效的监督。相反，我们建议，明确学习从噪声数据不变的介绍和重建清晰观察无监督学习方法。要做到这一点，我们引入离散解开表示和对抗性域适配成一般域转移框架，通过额外自监督模块，包括背景和语义一致性约束，下双域的限制，诸如特征和图像域学习健壮表示辅助。关于合成的和真实的噪声去除任务实验表明，该方法实现了与其它无监督的状态的最先进的监督和方法比较的性能，同时具有更快，比其它域自适应方法稳定收敛。

80. Trajectory Poisson multi-Bernoulli filters [PDF] 返回目录
Ángel F. García-Fernández, Lennart Svensson, Jason L. Williams, Yuxuan Xia, Karl Granström
Abstract: This paper presents two trajectory Poisson multi-Bernoulli (TPMB) filters for multi-target tracking: one to estimate the set of alive trajectories at each time step and another to estimate the set of all trajectories, which includes alive and dead trajectories, at each time step. The filters are based on propagating a Poisson multi-Bernoulli (PMB) density on the corresponding set of trajectories through the filtering recursion. After the update step, the posterior is a PMB mixture (PMBM) so, in order to obtain a PMB density, a Kullback-Leibler divergence minimisation on an augmented space is performed. The developed filters are computationally lighter alternatives to the trajectory PMBM filters, which provide the closed-form recursion for sets of trajectories with Poisson birth model, and are shown to outperform previous multi-target tracking algorithms.
摘要：本文呈现两个轨道泊松多伯努利用于多目标跟踪（TPMB）滤波器：一个来估计所述一组活的轨迹在每个时间步长，另一个估计集合中的所有轨迹的，其包括活的和死的轨迹在每个时间步。该过滤器是基于通过滤波递归传播上轨迹的相应集合的泊松多伯努利（PMB）的密度。更新步骤之后，后验是这样，为了获得一个PMB密度，进行上增强空间相对熵最小化一个PMB混合物（PMBM）。在发达的过滤器的轨迹PMBM过滤器，这对于套与泊松出生模型轨迹提供了封闭形式的递归计算更轻的替代品，并证明优于先前的多目标跟踪算法。

81. Deep Fashion3D: A Dataset and Benchmark for 3D Garment Reconstruction from Single Images [PDF] 返回目录
Heming Zhu, Yu Cao, Hang Jin, Weikai Chen, Dong Du, Zhangye Wang, Shuguang Cui, Xiaoguang Han
Abstract: High-fidelity clothing reconstruction is the key to achieving photorealism in a wide range of applications including human digitization, virtual try-on, etc. Recent advances in learning-based approaches have accomplished unprecedented accuracy in recovering unclothed human shape and pose from single images, thanks to the availability of powerful statistical models, e.g. SMPL, learned from a large number of body scans. In contrast, modeling and recovering clothed human and 3D garments remains notoriously difficult, mostly due to the lack of large-scale clothing models available for the research community. We propose to fill this gap by introducing Deep Fashion3D, the largest collection to date of 3D garment models, with the goal of establishing a novel benchmark and dataset for the evaluation of image-based garment reconstruction systems. Deep Fashion3D contains 2078 models reconstructed from real garments, which covers 10 different categories and 563 garment instances. It provides rich annotations including 3D feature lines, 3D body pose and the corresponded multi-view real images. In addition, each garment is randomly posed to enhance the variety of real clothing deformations. To demonstrate the advantage of Deep Fashion3D, we propose a novel baseline approach for single-view garment reconstruction, which leverages the merits of both mesh and implicit representations. A novel adaptable template is proposed to enable the learning of all types of clothing in a single network. Extensive experiments have been conducted on the proposed dataset to verify its significance and usefulness. We will make Deep Fashion3D publicly available upon publication.
摘要：高保真服装重建的关键是在广泛的应用中实现写实包括人类数字化，虚拟试穿等，基于学习的方法的最新进展，在恢复裸体形状来实现前所未有的精度，并从单一姿势图像，得益于强大的统计模型，例如可用性SMPL，从大量的人体扫描的教训。相比之下，造型和恢复人的衣服和3D服装仍然是非常困难的，主要是由于可供研究界缺少大型服装模特。我们建议通过引入深Fashion3D，最大的采集到三维服装模型的日期，以建立一个新的标杆和数据集基于图像的服装重建系统的评价目标，以填补这一空白。深Fashion3D包含2078款从真正的服装，其中涵盖10个不同的类别和563个服装实例重建。它提供了丰富的注释，包括三维特征线，三维人体姿势和对应多视角真实图像。此外，每件衣服随机带来全面提升各种实穿变形。为了证明深Fashion3D的优势，提出了单视图服装重建，这同时利用网格和隐式表示的优劣一种新的基准方法。一种新型的适应模板建议，使各类服装的学习在一个单一的网络。大量的实验已就建议进行的数据集，以验证它的意义和用处。我们将深Fashion3D时加以公布出版物。

82. A Physics-based Noise Formation Model for Extreme Low-light Raw Denoising [PDF] 返回目录
Kaixuan Wei, Ying Fu, Jiaolong Yang, Hua Huang
Abstract: Lacking rich and realistic data, learned single image denoising algorithms generalize poorly to real raw images that do not resemble the data used for training. Although the problem can be alleviated by the heteroscedastic Gaussian model for noise synthesis, the noise sources caused by digital camera electronics are still largely overlooked, despite their significant effect on raw measurement, especially under extremely low-light condition. To address this issue, we present a highly accurate noise formation model based on the characteristics of CMOS photosensors, thereby enabling us to synthesize realistic samples that better match the physics of image formation process. Given the proposed noise model, we additionally propose a method to calibrate the noise parameters for available modern digital cameras, which is simple and reproducible for any new device. We systematically study the generalizability of a neural network trained with existing schemes, by introducing a new low-light denoising dataset that covers many modern digital cameras from diverse brands. Extensive empirical results collectively show that by utilizing our proposed noise formation model, a network can reach the capability as if it had been trained with rich real data, which demonstrates the effectiveness of our noise formation model.
摘要：由于缺乏丰富的和现实的数据，了解到单个图像去噪算法不好一概而论不相似的用于训练的数据真实原始图像。虽然问题可以通过用于噪声合成的异方差高斯模型得到缓解，由数字照相机的电子噪声源仍然大大地忽略，尽管原始测量其显著效果，尤其是非常低光条件下进行。为了解决这个问题，我们提出了一种基于CMOS光电传感器的特性的高度精确的噪声形成模型，从而使我们能够合成现实样本，更好地匹配成像过程的物理过程。鉴于所提出的噪声模型，我们还提出了一种方法来校准噪声参数可用的现代数码相机，这是简单的和可重复的任何新的设备。我们系统地研究与现有方案的培训，通过引入新的低光降噪的数据集，涵盖许多现代数码相机从不同品牌的神经网络的普遍性。大量的实证结果共同表明，通过利用我们提出的噪音形成模型，网络可以达到的能力，如果它已经训练了与丰富的真实数据，这表明我们的噪音形成模型的有效性。

83. BiLingUNet: Image Segmentation by Modulating Top-Down and Bottom-Up Visual Processing with Referring Expressions [PDF] 返回目录
Ozan Arkan Can, İlker Kesen, Deniz Yuret
Abstract: We present BiLingUNet, a state-of-the-art model for image segmentation using referring expressions. BiLingUNet uses language to customize visual filters and outperforms approaches that concatenate a linguistic representation to the visual input. We find that using language to modulate both bottom-up and top-down visual processing works better than just making the top-down processing language-conditional. We argue that common 1x1 language-conditional filters cannot represent relational concepts and experimentally demonstrate that wider filters work better. Our model achieves state-of-the-art performance on four referring expression datasets.
摘要：我们提出BiLingUNet，国家的最先进的模型用于使用参照表达式图像分割。 BiLingUNet使用的语言来定制视觉过滤器，优于接近于连接一个语言表达的视觉输入。我们发现，使用的语言，以便更好地调节两个自下而上和自上而下的视觉处理工作不仅仅是使自上而下的处理语言的条件。我们认为，共同的1x1语言条件过滤器不能代表关系概念和实验证明，更宽的过滤器更好地工作。我们的模型实现了四个指表达数据集的状态的最先进的性能。

84. Actor-Transformers for Group Activity Recognition [PDF] 返回目录
Kirill Gavrilyuk, Ryan Sanford, Mehrsan Javan, Cees G. M. Snoek
Abstract: This paper strives to recognize individual actions and group activities from videos. While existing solutions for this challenging problem explicitly model spatial and temporal relationships based on location of individual actors, we propose an actor-transformer model able to learn and selectively extract information relevant for group activity recognition. We feed the transformer with rich actor-specific static and dynamic representations expressed by features from a 2D pose network and 3D CNN, respectively. We empirically study different ways to combine these representations and show their complementary benefits. Experiments show what is important to transform and how it should be transformed. What is more, actor-transformers achieve state-of-the-art results on two publicly available benchmarks for group activity recognition, outperforming the previous best published results by a considerable margin.
摘要：本文力图从识别视频个人行动和集体活动。虽然这个具有挑战性的问题的现有解决方案明确模式的基础上个别演员的位置，空间和时间的关系，我们提出了一个演员变压器模型能够学习和相关团体活动的识别选择性地提取信息。我们从进料与分别从2D姿势网络和3D CNN，功能丰富的表示特定的演员静态和动态表示变压器。我们实证研究不同的方式来表示这些组合，并显示它们的互补效益。实验表明什么是改造重要的，它应该如何改变。更重要的是，演员变压器实现对群体行为识别两个公开提供基准国家的先进成果，以相当幅度跑赢以前的最好成绩公布的结果。

85. Exploit Clues from Views: Self-Supervised and Regularized Learning for Multiview Object Recognition [PDF] 返回目录
Chih-Hui Ho, Bo Liu, Tz-Ying Wu, Nuno Vasconcelos
Abstract: Multiview recognition has been well studied in the literature and achieves decent performance in object recognition and retrieval task. However, most previous works rely on supervised learning and some impractical underlying assumptions, such as the availability of all views in training and inference time. In this work, the problem of multiview self-supervised learning (MV-SSL) is investigated, where only image to object association is given. Given this setup, a novel surrogate task for self-supervised learning is proposed by pursuing "object invariant" representation. This is solved by randomly selecting an image feature of an object as object prototype, accompanied with multiview consistency regularization, which results in view invariant stochastic prototype embedding (VISPE). Experiments shows that the recognition and retrieval results using VISPE outperform that of other self-supervised learning methods on seen and unseen data. VISPE can also be applied to semi-supervised scenario and demonstrates robust performance with limited data available. Code is available at this https URL
摘要：多视角认识已得到很好的研究在文献中并实现目标识别和检索任务不俗的表现。然而，大多数以前的作品依靠监督学习和一些不切实际的基本假设，如在训练和推理时间所有视图的可用性。在这项工作中，多视点自我监督学习（MV-SSL）的问题进行了研究，其中只有图像对象关联给出。鉴于此设置，自我监督学习一种新的替代任务是通过推行“目标不变”表示建议。这是通过随机地选择一个对象作为对象原型的图像特征得到解决，伴随着多视点一致性正则化，其结果在图不变随机原型嵌入（VISPE）。实验表明，识别和检索结果使用上看到和看不到的数据等自我监督的学习方法是VISPE跑赢大盘。 VISPE也可以应用于半监督场景，展示了与现有的有限数据强劲的性能。代码可在此HTTPS URL

86. NMS by Representative Region: Towards Crowded Pedestrian Detection by Proposal Pairing [PDF] 返回目录
Xin Huang, Zheng Ge, Zequn Jie, Osamu Yoshie
Abstract: Although significant progress has been made in pedestrian detection recently, pedestrian detection in crowded scenes is still challenging. The heavy occlusion between pedestrians imposes great challenges to the standard Non-Maximum Suppression (NMS). A relative low threshold of intersection over union (IoU) leads to missing highly overlapped pedestrians, while a higher one brings in plenty of false positives. To avoid such a dilemma, this paper proposes a novel Representative Region NMS approach leveraging the less occluded visible parts, effectively removing the redundant boxes without bringing in many false positives. To acquire the visible parts, a novel Paired-Box Model (PBM) is proposed to simultaneously predict the full and visible boxes of a pedestrian. The full and visible boxes constitute a pair serving as the sample unit of the model, thus guaranteeing a strong correspondence between the two boxes throughout the detection pipeline. Moreover, convenient feature integration of the two boxes is allowed for the better performance on both full and visible pedestrian detection tasks. Experiments on the challenging CrowdHuman and CityPersons benchmarks sufficiently validate the effectiveness of the proposed approach on pedestrian detection in the crowded situation.
摘要：虽然显著进展，在行人检测已取得最近，在拥挤的场景行人检测仍然是具有挑战性的。行人之间的严重遮挡规定的标准非最大抑制（NMS）的巨大挑战。交点超过联盟（IOU）导致丢失高度重叠行人，而较高的一个在很多假阳性带来的相对低的阈值。为了避免这样的矛盾，提出了一种新的代表地区NMS方式撬动少遮挡可见的部分，有效去除多余的箱子没有带来太多的误判。为了获取可见的部分，一种新型的配对盒模型（PBM）提出了同时预测的行人的充分和可见框。充分和可见框构成一对用作该模型的样本单元，从而保证在整个检测流水线的两个框之间有很强的对应关系。此外，两个箱的便捷功能集成允许对完全和可见的行人探测任务的更好的性能。在挑战CrowdHuman和CityPersons基准实验验证足以在群雄逐鹿的局面上行人检测了该方法的有效性。

87. Semantically Mutil-modal Image Synthesis [PDF] 返回目录
Zhen Zhu, Zhiliang Xu, Ansheng You, Xiang Bai
Abstract: In this paper, we focus on semantically multi-modal image synthesis (SMIS) task, namely, generating multi-modal images at the semantic level. Previous work seeks to use multiple class-specific generators, constraining its usage in datasets with a small number of classes. We instead propose a novel Group Decreasing Network (GroupDNet) that leverages group convolutions in the generator and progressively decreases the group numbers of the convolutions in the decoder. Consequently, GroupDNet is armed with much more controllability on translating semantic labels to natural images and has plausible high-quality yields for datasets with many classes. Experiments on several challenging datasets demonstrate the superiority of GroupDNet on performing the SMIS task. We also show that GroupDNet is capable of performing a wide range of interesting synthesis applications. Codes and models are available at: this https URL.
摘要：在本文中，我们侧重于语义多模态图像合成（SMIS）任务，即，在语义级别产生多模态的图像。以前的研究旨在利用多类特定的发电机，制约其在数据集中使用了少量的几个类。相反，我们提出了一种新颖的组减少了网络（GroupDNet），它利用在发电机组卷积和逐渐减小在解码器中的卷积的组号。因此，GroupDNet装备有更多的可控性上的语义翻译标签，自然的图像，并具有与许多类数据集合理的高品质的产量。在一些具有挑战性的数据集实验证明上执行任务SMIS的GroupDNet的优越性。我们还表明，GroupDNet能够执行广泛的有趣的合成应用的。代码和模型，请访问：此HTTPS URL。

88. Using the Split Bregman Algorithm to Solve the Self-Repelling Snake Model [PDF] 返回目录
Huizhu Pan, Jintao Song, Wanquan Liu, Ling Li, Guanglu Zhou, Lu Tan, Shichu Chen
Abstract: Preserving the contour topology during image segmentation is useful in manypractical scenarios. By keeping the contours isomorphic, it is possible to pre-vent over-segmentation and under-segmentation, as well as to adhere to giventopologies. The self-repelling snake model (SR) is a variational model thatpreserves contour topology by combining a non-local repulsion term with thegeodesic active contour model (GAC). The SR is traditionally solved using theadditive operator splitting (AOS) scheme. Although this solution is stable, thememory requirement grows quickly as the image size increases. In our paper,we propose an alternative solution to the SR using the Split Bregman method.Our algorithm breaks the problem down into simpler subproblems to use lower-order evolution equations and approximation schemes. The memory usage issignificantly reduced as a result. Experiments show comparable performance to the original algorithm with shorter iteration times.
摘要：图像分割期间保持轮廓拓扑manypractical方案是有用的。通过保持轮廓同构的，可以预泄过分割和欠分割，以及坚持giventopologies。自排斥蛇模型（SR）是一个变模型轮廓thatpreserves拓扑由非本地斥力术语与thegeodesic活动轮廓模型（GAC）合并。该SR使用theadditive算子分裂（AOS）方案解决了传统。虽然这种解决方案是稳定的，thememory需求快速增长随着图像尺寸的增加。在本文中，我们提出了一个替代的解决方案使用分裂布雷格曼method.Our算法断裂问题分解成更简单的子问题使用低阶演化方程近似方案的SR。存储器使用issignificantly减少作为结果。实验显示出相当的性能到原来的算法更短的迭代次数。

89. Inferring Semantic Information with 3D Neural Scene Representations [PDF] 返回目录
Amit Kohli, Vincent Sitzmann, Gordon Wetzstein
Abstract: Biological vision infers multi-modal 3D representations that support reasoning about scene properties such as materials, appearance, affordance, and semantics in 3D. These rich representations enable us humans, for example, to acquire new skills, such as the learning of a new semantic class, with extremely limited supervision. Motivated by this ability of biological vision, we demonstrate that 3D-structure-aware representation learning leads to multi-modal representations that enable 3D semantic segmentation with extremely limited, 2D-only supervision. Building on emerging neural scene representations, which have been developed for modeling the shape and appearance of 3D scenes supervised exclusively by posed 2D images, we are first to demonstrate a representation that jointly encodes shape, appearance, and semantics in a 3D-structure-aware manner. Surprisingly, we find that only a few tens of labeled 2D segmentation masks are required to achieve dense 3D semantic segmentation using a semi-supervised learning strategy. We explore two novel applications for our semantically aware neural scene representation: 3D novel view and semantic label synthesis given only a single input RGB image or 2D label mask, as well as 3D interpolation of appearance and semantics.
摘要：生物视觉推断多模式3D表示，大约场景属性，如材料，外观，启示和语义3D支持推理。这些丰富的陈述使我们人类，例如，掌握新的技能，比如一个新的语义类的学习，用极其有限的监督。通过生物视觉的这种能力的启发，我们证明了三维结构感知表示学习导致多模态表示，使具有极为有限，2D-只有监管3D语义分割。所构成的二维图像建立在新兴的神经现场表示，已为3D场景的形状和外观造型开发监督的专门，我们是第一次证明了联合编码形状，外观和语义的表示三维结构感知方式。出人意料的是，我们发现，只有几十标记二维分割遮光罩，需要使用一个半监督学习策略，以达到密集的3D语义分割。我们探索我们的语义感知神经的场景表达两个新的应用：3D视图新颖并给予只有一个输入的RGB图像或2D标签面具语义标签的合成，以及外观和语义的3D插值。

90. Deep CG2Real: Synthetic-to-Real Translation via Image Disentanglement [PDF] 返回目录
Sai Bi, Kalyan Sunkavalli, Federico Perazzi, Eli Shechtman, Vladimir Kim, Ravi Ramamoorthi
Abstract: We present a method to improve the visual realism of low-quality, synthetic images, e.g. OpenGL renderings. Training an unpaired synthetic-to-real translation network in image space is severely under-constrained and produces visible artifacts. Instead, we propose a semi-supervised approach that operates on the disentangled shading and albedo layers of the image. Our two-stage pipeline first learns to predict accurate shading in a supervised fashion using physically-based renderings as targets, and further increases the realism of the textures and shading with an improved CycleGAN network. Extensive evaluations on the SUNCG indoor scene dataset demonstrate that our approach yields more realistic images compared to other state-of-the-art approaches. Furthermore, networks trained on our generated "real" images predict more accurate depth and normals than domain adaptation approaches, suggesting that improving the visual realism of the images can be more effective than imposing task-specific losses.
摘要：我们提出以提高低质量的，合成的图像，例如的视觉真实性的方法OpenGL的渲染。在图像空间中训练的不成对合成到实翻译网络是严重受到约束，并产生可见的伪像。相反，我们建议在解缠结的阴影和图像的反射率的层运行的半监督方法。我们两级流水线先学习如何预测在监督的方式精确的阴影使用基于物理的渲染为目标，进一步提高纹理的真实感与改进CycleGAN网络底纹。在SUNCG室内场景的数据集丰富的评估表明，我们的方法产生更逼真的图像相比，国家的最先进的其他方法。此外，培训了我们产生“真正的”图片网络预测更准确的深度和法线高于领域适应性接近，这表明提高了图像的逼真视觉效果可能比征收任务的具体损失更有效。

91. Designing Color Filters that Make Cameras MoreColorimetric [PDF] 返回目录
Graham D. Finlayson, Yuteng Zhu
Abstract: When we place a colored filter in front of a camera the effective camera response functions are equal to the given camera spectral sensitivities multiplied by the filter spectral transmittance. In this paper, we solve for the filter which returns the modified sensitivities as close to being a linear transformation from the color matching functions of human visual system as possible. When this linearity condition - sometimes called the Luther condition - is approximately met, the `camera+filter' system can be used for accurate color measurement. Then, we reformulate our filter design optimisation for making the sensor responses as close to the CIEXYZ tristimulus values as possible given the knowledge of real measured surfaces and illuminants spectra data. This data-driven method in turn is extended to incorporate constraints on the filter (smoothness and bounded transmission). Also, because how the optimisation is initialised is shown to impact on the performance of the solved-for filters, a multi-initialisation optimisation is developed. Experiments demonstrate that, by taking pictures through our optimised color filters we can make cameras significantly more colorimetric.
摘要：当我们在照相机的前面放置一个有色的滤色器的有效相机响应功能都等于给定的照相机的光谱灵敏度乘以滤波器的光谱透射率。在本文中，我们解决该返回修改的灵敏度接近为从人的视觉系统尽可能的颜色匹配函数的线性变换的滤波器。当该线性条件 - 有时被称为路德条件 - 大致满足，`相机+滤波器”系统可被用于精确的颜色测量。然后，我们重新制定我们的过滤器优化设计制造的传感器响应尽可能接近给定的实测量表面和照明光谱数据的知识CIEXYZ三色值可能。进而该数据驱动方法扩展到包含在过滤器上（光滑性和有界的传输）的约束。此外，因为如何被初始化的优化显示的解决，对过滤器，多初始化优化开发对性能的影响。实验表明，通过拍照经过我们优化的彩色滤光片，我们可以让相机更显著比色。

92. Deep 3D Capture: Geometry and Reflectance from Sparse Multi-View Images [PDF] 返回目录
Sai Bi, Zexiang Xu, Kalyan Sunkavalli, David Kriegman, Ravi Ramamoorthi
Abstract: We introduce a novel learning-based method to reconstruct the high-quality geometry and complex, spatially-varying BRDF of an arbitrary object from a sparse set of only six images captured by wide-baseline cameras under collocated point lighting. We first estimate per-view depth maps using a deep multi-view stereo network; these depth maps are used to coarsely align the different views. We propose a novel multi-view reflectance estimation network architecture that is trained to pool features from these coarsely aligned images and predict per-view spatially-varying diffuse albedo, surface normals, specular roughness and specular albedo. We do this by jointly optimizing the latent space of our multi-view reflectance network to minimize the photometric error between images rendered with our predictions and the input images. While previous state-of-the-art methods fail on such sparse acquisition setups, we demonstrate, via extensive experiments on synthetic and real data, that our method produces high-quality reconstructions that can be used to render photorealistic images.
摘要：我们介绍一种新颖的基于学习的方法，从稀疏集合只有六个由下并置点照明宽基线照相机拍摄的图像的重构的任意对象的高品质的几何形状和复杂的，在空间上变化的BRDF。我们首先估计每视点深度映射使用深多视点立体网络;这些深度图被用于粗略地对准的不同观点。我们建议，被训练以池从这些粗对准的图像特征和每次观看预测空间变化的漫反射率，表面法线，镜面粗糙度和镜面反射率的新的多视点反射率估计网络架构。我们通过联合优化我们多视角反射网络的潜在空间，以尽量减少与我们的预测和输入图像渲染的图像之间的测光误差做到这一点。虽然以前的状态的最先进的方法，对这种稀疏采集设置失败，我们证明，通过对合成的和真实数据广泛的实验，我们的方法产生可被用于呈现逼真的图像高品质的重建。

93. Self-Supervised Learning for Domain Adaptation on Point-Clouds [PDF] 返回目录
Idan Achituve, Haggai Maron, Gal Chechik
Abstract: Self-supervised learning (SSL) allows to learn useful representations from unlabeled data and has been applied effectively for domain adaptation (DA) on images. It is still unknown if and how it can be leveraged for domain adaptation for 3D perception. Here we describe the first study of SSL for DA on point-clouds. We introduce a new pretext task, Region Reconstruction, motivated by the deformations encountered in sim-to-real transformation. We also demonstrate how it can be combined with a training procedure motivated by the MixUp method. Evaluations on six domain adaptations across synthetic and real furniture data, demonstrate large improvement over previous work.
摘要：自监督学习（SSL）允许学习无标签数据用的表现，并已为域适应（DA）上的图像有效的应用。它仍然是未知是否以及如何可以利用的领域适应性的3D感知。这里，我们介绍SSL对DA对点云的首次研究。我们推出了新的借口任务，重建区，通过在SIM到真正的变革中遇到的变形动机。我们还演示了如何将其与被查询股价的动机方法训练过程相结合。跨合成和真实数据的家具域6和适应评估，证明了以前的工作大的改善。

94. Combining Visible and Infrared Spectrum Imagery using Machine Learning for Small Unmanned Aerial System Detection [PDF] 返回目录
Vinicius G. Goecks, Grayson Woods, Niladri Das, John Valasek
Abstract: Advances in machine learning and deep neural networks for object detection, coupled with lower cost and power requirements of cameras, led to promising vision-based solutions for sUAS detection. However, solely relying on the visible spectrum has previously led to reliability issues in low contrast scenarios such as sUAS flying below the treeline and against bright sources of light. Alternatively, due to the relatively high heat signatures emitted from sUAS during flight, a long-wave infrared (LWIR) sensor is able to produce images that clearly contrast the sUAS from its background. However, compared to widely available visible spectrum sensors, LWIR sensors have lower resolution and may produce more false positives when exposed to birds or other heat sources. This research work proposes combining the advantages of the LWIR and visible spectrum sensors using machine learning for vision-based detection of sUAS. Utilizing the heightened background contrast from the LWIR sensor combined and synchronized with the relatively increased resolution of the visible spectrum sensor, a deep learning model was trained to detect the sUAS through previously difficult environments. More specifically, the approach demonstrated effective detection of multiple sUAS flying above and below the treeline, in the presence of heat sources, and glare from the sun. Our approach achieved a detection rate of 71.2 +- 8.3%, improving by 69% when compared to LWIR and by 30.4% when visible spectrum alone, and achieved false alarm rate of 2.7 +- 2.6%, decreasing by 74.1% and by 47.1% when compared to LWIR and visible spectrum alone, respectively, on average, for single and multiple drone scenarios, controlled for the same confidence metric of the machine learning object detector of at least 50%. Videos of the solution's performance can be seen at this https URL.
摘要：进展机器学习和物体检测深层神经网络，再加上相机更低的成本和功耗的要求，导致了SUAS检测看好基于视觉的解决方案。然而，仅仅依靠可见光谱先前已导致低对比度场景的可靠性问题，如SUAS飞树线下方，对明亮光源。可替换地，由于在飞行期间从SUAS发射的相对高的热信号，一个长波红外（LWIR）传感器能够产生图像，清楚地从它的背景对比SUAS。然而，相对于广泛使用可见光谱传感器，LWIR传感器具有较低的分辨率并且当暴露于鸟类或其它热源可以产生更多的假阳性。这项研究工作提出了使用机器学习SUAS的基于视觉的检测相结合的长波红外和可见光谱传感器的优点。利用来自传感器LWIR合并，并用可见光谱传感器的相对增加的分辨率同步的提高背景的对比度，深学习模型被训练通过先前困难的环境，以检测SUAS。更具体地，该方法表现出有效检测多个SUAS的飞行的上方和下方树线，在热源的存在，以及从太阳眩光。相比于长波红外和30.4％，在可见光谱独自一人时的8.3％，由69％提高，并取得了2.7 +误报率 - - 我们的方法实现的检出率为71.2 + 2.6％，74.1％和47.1％减少当单独，分别平均，单个和多个雄蜂场景中，对于相同的置信度量度的机器学习的至少50％的物体检测装置的控制相比，LWIR和可见光谱。该解决方案的性能的视频可以在此HTTPS URL中可以看出。

95. Detection and Description of Change in Visual Streams [PDF] 返回目录
Davis Gilton, Ruotian Luo, Rebecca Willett, Greg Shakhnarovich
Abstract: This paper presents a framework for the analysis of changes in visual streams: ordered sequences of images, possibly separated by significant time gaps. We propose a new approach to incorporating unlabeled data into training to generate natural language descriptions of change. We also develop a framework for estimating the time of change in visual stream. We use learned representations for change evidence and consistency of perceived change, and combine these in a regularized graph cut based change detector. Experimental evaluation on visual stream datasets, which we release as part of our contribution, shows that representation learning driven by natural language descriptions significantly improves change detection accuracy, compared to methods that do not rely on language.
摘要：本文提出了在影像流的变化的分析框架：有序图像序列，可能由显著时间间隙隔开。我们提出了一种新的方法来整合未标记的数据进入训练产生变化的自然语言描述。我们还开发了用于评估在视觉流变化的时间框架。我们使用变化的证据和感知变化的一致性了解到交涉，并以规则的图形切割基于变化探测器组合这些。视觉流的数据集，这是我们发布作为我们贡献的一部分实验评估显示，表示通过学习自然语言描述驱动显著提高了变化检测精度，相比不依赖于语言的方法。

96. MCFlow: Monte Carlo Flow Models for Data Imputation [PDF] 返回目录
Trevor W. Richardson, Wencheng Wu, Lei Lin, Beilei Xu, Edgar A. Bernal
Abstract: We consider the topic of data imputation, a foundational task in machine learning that addresses issues with missing data. To that end, we propose MCFlow, a deep framework for imputation that leverages normalizing flow generative models and Monte Carlo sampling. We address the causality dilemma that arises when training models with incomplete data by introducing an iterative learning scheme which alternately updates the density estimate and the values of the missing entries in the training data. We provide extensive empirical validation of the effectiveness of the proposed method on standard multivariate and image datasets, and benchmark its performance against state-of-the-art alternatives. We demonstrate that MCFlow is superior to competing methods in terms of the quality of the imputed data, as well as with regards to its ability to preserve the semantic structure of the data.
摘要：我们认为数据的归集，在机基础的学习任务缺少数据的地址问题的话题。为此，我们提出MCFlow，用于归集，充分利用标准化流程生成模型和蒙特卡罗抽样深刻的框架。我们解决困境的因果关系通过引入迭代学习方案，该方案交替更新密度估计，并在训练数据丢失项的值训练不完整的数据模型时产生的。我们提供了有关标准和多元图像数据集所提出的方法的有效性的广泛经验的验证，和基准其对国家的最先进的替代品性能。我们证明MCFlow优于至于其保存数据的语义结构的能力的竞争方法，推算数据的质量，以及。

97. On the Evaluation of Prohibited Item Classification and Detection in Volumetric 3D Computed Tomography Baggage Security Screening Imagery [PDF] 返回目录
Qian Wang, Neelanjan Bhowmik, Toby P. Breckon
Abstract: X-ray Computed Tomography (CT) based 3D imaging is widely used in airports for aviation security screening whilst prior work on prohibited item detection focuses primarily on 2D X-ray imagery. In this paper, we aim to evaluate the possibility of extending the automatic prohibited item detection from 2D X-ray imagery to volumetric 3D CT baggage security screening imagery. To these ends, we take advantage of 3D Convolutional Neural Neworks (CNN) and popular object detection frameworks such as RetinaNet and Faster R-CNN in our work. As the first attempt to use 3D CNN for volumetric 3D CT baggage security screening, we first evaluate different CNN architectures on the classification of isolated prohibited item volumes and compare against traditional methods which use hand-crafted features. Subsequently, we evaluate object detection performance of different architectures on volumetric 3D CT baggage images. The results of our experiments on Bottle and Handgun datasets demonstrate that 3D CNN models can achieve comparable performance (98% true positive rate and 1.5% false positive rate) to traditional methods but require significantly less time for inference (0.014s per volume). Furthermore, the extended 3D object detection models achieve promising performance in detecting prohibited items within volumetric 3D CT baggage imagery with 76% mAP for bottles and 88% mAP for handguns, which shows both the challenge and promise of such threat detection within 3D CT X-ray security imagery.
摘要：X射线计算机断层扫描（CT）的三维成像被广泛应用于机场航空安全检查，同时对违禁物品检测前的工作主要集中在二维X射线图像。在本文中，我们的目标是评估延伸从二维X射线成像自动禁止物品检测到体积3D CT行李安全检查的图像的可能性。为达到这些目的，我们利用三维卷积基于神经网络（CNN）和流行的对象检测框架，如RetinaNet和更快的R-CNN的在我们的工作。由于使用3D CNN为立体3D CT行李安检的第一次尝试，我们首先评估对隔离违禁物品体积的分类不同CNN架构和与之比较的传统方法在使用手工制作的功能。随后，我们评估对立体3D CT图像行李不同体系结构的目标检测性能。我们的瓶子和手枪数据集实验结果表明，3D模型CNN能达到相当的性能（98％真阳性率和1.5％的假阳性率），以传统的方法，但需要显著较少的时间推断（每卷0.014s）。此外，扩展的3D对象检测模型实现在检测与76％的地图为瓶和手枪88％MAP，体积3D CT行李图像内违禁物品有前途的性能，其示出了两个三维CT X-内的挑战并且这样的威胁检测的承诺射线安全图像。

98. SceneCAD: Predicting Object Alignments and Layouts in RGB-D Scans [PDF] 返回目录
Armen Avetisyan, Tatiana Khanova, Christopher Choy, Denver Dash, Angela Dai, Matthias Nießner
Abstract: We present a novel approach to reconstructing lightweight, CAD-based representations of scanned 3D environments from commodity RGB-D sensors. Our key idea is to jointly optimize for both CAD model alignments as well as layout estimations of the scanned scene, explicitly modeling inter-relationships between objects-to-objects and objects-to-layout. Since object arrangement and scene layout are intrinsically coupled, we show that treating the problem jointly significantly helps to produce globally-consistent representations of a scene. Object CAD models are aligned to the scene by establishing dense correspondences between geometry, and we introduce a hierarchical layout prediction approach to estimate layout planes from corners and edges of the this http URL this end, we propose a message-passing graph neural network to model the inter-relationships between objects and layout, guiding generation of a globally object alignment in a scene. By considering the global scene layout, we achieve significantly improved CAD alignments compared to state-of-the-art methods, improving from 41.83% to 58.41% alignment accuracy on SUNCG and from 50.05% to 61.24% on ScanNet, respectively. The resulting CAD-based representations makes our method well-suited for applications in content creation such as augmented- or virtual reality.
摘要：我们提出一个新的方法来重构轻便，基于CAD的商品从RGB-d传感器扫描的3D环境的表示。我们的主要想法是联合优化两个CAD模型比对以及对现场扫描布局估算，明确对象建模到对象和对象到布局之间的相互关系。由于物体布置和场景布置有内在耦合，我们表明，共同显著处理这个问题有助于产生一个场景的全球一致的表示。对象CAD模型通过建立几何形状之间的密集对应对准现场，和我们介绍估算角落和边缘布局平面的分层布局预测方法的这个HTTP URL为此，我们提出了一个消息传递图的神经网络模型对象和布局，在一个场景中引导生成一个全局对象对齐的之间的相互关系。通过考虑全局场景布局，我们分别实现显著改进的CAD比对相比，国家的最先进的方法，从41.83％提高到58.41 SUNCG％的对准精度和从50.05％至61.24％的ScanNet。这些基于CAD的表示，使我们的方法非常适合如augmented-或虚拟现实内容创建应用程序。

99. Acceleration of Convolutional Neural Network Using FFT-Based Split Convolutions [PDF] 返回目录
Kamran Chitsaz, Mohsen Hajabdollahi, Nader Karimi, Shadrokh Samavi, Shahram Shirani
Abstract: Convolutional neural networks (CNNs) have a large number of variables and hence suffer from a complexity problem for their implementation. Different methods and techniques have developed to alleviate the problem of CNN's complexity, such as quantization, pruning, etc. Among the different simplification methods, computation in the Fourier domain is regarded as a new paradigm for the acceleration of CNNs. Recent studies on Fast Fourier Transform (FFT) based CNN aiming at simplifying the computations required for FFT. However, there is a lot of space for working on the reduction of the computational complexity of FFT. In this paper, a new method for CNN processing in the FFT domain is proposed, which is based on input splitting. There are problems in the computation of FFT using small kernels in situations such as CNN. Splitting can be considered as an effective solution for such issues aroused by small kernels. Using splitting redundancy, such as overlap-and-add, is reduced and, efficiency is increased. Hardware implementation of the proposed FFT method, as well as different analyses of the complexity, are performed to demonstrate the proper performance of the proposed method.
摘要：卷积神经网络（细胞神经网络）有大量的变量，因此从它们的实施复杂性问题的困扰。不同的方法和技术已经发展到减轻CNN的复杂性，如量化，修剪等。在不同的简化方法的问题，在傅立叶域计算被视为细胞神经网络的加速度的新典范。变换（FFT）根据CNN旨在简化FFT所需的计算快速傅立叶最近的研究。然而，对FFT的计算复杂度的降低工作有很大的空间。在本文中，在FFT域CNN处理的新方法，提出了一种基于输入分裂。还有用小谷粒，如CNN情况在FFT的计算问题。拆分可以被视为对这类问题的有效解决方案引起了小的内核。使用分割冗余，诸如重叠和添加，减少，并且，效率提高。硬件实现所提出的FFT方法，以及复杂性的不同分析，进行证明了该方法的正常表现。

100. Source Printer Identification from Document Images Acquired using Smartphone [PDF] 返回目录
Sharad Joshi, Suraj Saxena, Nitin Khanna
Abstract: Vast volumes of printed documents continue to be used for various important as well as trivial applications. Such applications often rely on the information provided in the form of printed text documents whose integrity verification poses a challenge due to time constraints and lack of resources. Source printer identification provides essential information about the origin and integrity of a printed document in a fast and cost-effective manner. Even when fraudulent documents are identified, information about their origin can help stop future frauds. If a smartphone camera replaces scanner for the document acquisition process, document forensics would be more economical, user-friendly, and even faster in many applications where remote and distributed analysis is beneficial. Building on existing methods, we propose to learn a single CNN model from the fusion of letter images and their printer-specific noise residuals. In the absence of any publicly available dataset, we created a new dataset consisting of 2250 document images of text documents printed by eighteen printers and acquired by a smartphone camera at five acquisition settings. The proposed method achieves 98.42% document classification accuracy using images of letter 'e' under a 5x2 cross-validation approach. Further, when tested using about half a million letters of all types, it achieves 90.33% and 98.01% letter and document classification accuracies, respectively, thus highlighting the ability to learn a discriminative model without dependence on a single letter type. Also, classification accuracies are encouraging under various acquisition settings, including low illumination and change in angle between the document and camera planes.
摘要：打印文档的广阔卷继续被用于各种重要的，以及琐碎的应用。这些应用程序通常依赖于打印的文本文档，其完整性验证提出了挑战，由于时间限制的形式提供的信息和资源的缺乏。源的打印机识别提供有关打印文档的快速且成本有效的方式来源和完整性的必要信息。即使伪造证件被识别，关于其来源的信息能够帮助未来的停止欺诈行为。如果智能手机摄像头的文档采集过程代替扫描仪，其中远程和分布式分析是有益的文档取证会比较经济，方便用户使用，而且在许多应用中甚至更快。在现有方法的基础上，我们建议从学习字母图像和他们的特定打印机噪声残差的熔融的单CNN模型。在没有任何可公开获得的数据集，我们创建了由人到十八打印机打印出来的并五采集设置由智能手机相机获取的文本文档2250的文档图像的新数据集。所提出的方法实现了使用5×2交叉验证方法下字母“E”的图像98.42％文档分类精度。此外，当使用所有类型的五十万封测试，实现了90.33％和98.01％，信和文档分类准确度，分别从而突出学习判别模型，而不在单个字母类型依赖的能力。此外，分类精确度下的各种采集的设置，包括在文档和摄像机平面之间的角度低照度和变化令人鼓舞的。

101. Weakly-supervised land classification for coastal zone based on deep convolutional neural networks by incorporating dual-polarimetric characteristics into training dataset [PDF] 返回目录
Sheng Sun, Armando Marino, Wenze Shui, Zhongwen Hu
Abstract: In this work we explore the performance of DCNNs on semantic segmentation using spaceborne polarimetric synthetic aperture radar (PolSAR) datasets. The semantic segmentation task using PolSAR data can be categorized as weakly supervised learning when the characteristics of SAR data and data annotating procedures are factored in. Datasets are initially analyzed for selecting feasible pre-training images. Then the differences between spaceborne and airborne datasets are examined in terms of spatial resolution and viewing geometry. In this study we used two dual-polarimetric images acquired by TerraSAR-X DLR. A novel method to produce training dataset with more supervised information is developed. Specifically, a series of typical classified images as well as intensity images serve as training datasets. A field survey is conducted for an area of about 20 square kilometers to obtain a ground truth dataset used for accuracy evaluation. Several transfer learning strategies are made for aforementioned training datasets which will be combined in a practicable order. Three DCNN models, including SegNet, U-Net, and LinkNet, are implemented next.
摘要：在这项工作中，我们探索利用星载极化合成孔径雷达（极化SAR）数据集语义分割DCNNs的性能。使用极化SAR数据的语义分割任务可以被分类为弱监督当SAR数据的特性和数据注释程序被纳入其中的学习。数据集最初分析用于选择可行预训练图像。然后星载和机载的数据集之间的差异在空间分辨率和观察几何方面进行研究。在这项研究中，我们使用的TerraSAR-X DLR获得的两个双偏振图像。的新方法，以产生训练数据集更监督信息被显影。具体而言，一系列典型分类的图像的以及强度图像作为训练数据集。实地调查是为约20平方公里，以获得用于精度评价一个地面实况数据集进行的。几个迁移学习策略是针对上述训练数据，这将在切实可行的顺序组合制成。三DCNN模型，包括SegNet，U-Net和LINKnet设备，接下来实施。

102. How Not to Give a FLOP: Combining Regularization and Pruning for Efficient Inference [PDF] 返回目录
Tai Vu, Emily Wen, Roy Nehoran
Abstract: The challenge of speeding up deep learning models during the deployment phase has been a large, expensive bottleneck in the modern tech industry. In this paper, we examine the use of both regularization and pruning for reduced computational complexity and more efficient inference in Deep Neural Networks (DNNs). In particular, we apply mixup and cutout regularizations and soft filter pruning to the ResNet architecture, focusing on minimizing floating point operations (FLOPs). Furthermore, by using regularization in conjunction with network pruning, we show that such a combination makes a substantial improvement over each of the two techniques individually.
摘要：在部署阶段加快深学习模式面临的挑战是在现代高科技产业的大型，昂贵的瓶颈。在本文中，我们研究利用两个正规化和修剪对深层神经网络（DNNs）降低计算复杂度和更有效的推论。特别是，我们采用的mixup和缺口正则化和软过滤器修剪到RESNET架构，注重最大限度地减少浮点运算（FLOPS）。此外，通过与网络修剪结合使用正则化，我们表明，这种组合使得在每个单独的两种技术的显着改进。

103. BVI-DVC: A Training Database for Deep Video Compression [PDF] 返回目录
Di Ma, Fan Zhang, David R. Bull
Abstract: Deep learning methods are increasingly being applied in the optimisation of video compression algorithms and can achieve significantly enhanced coding gains, compared to conventional approaches. Such approaches often employ Convolutional Neural Networks (CNNs) which are trained on databases with relatively limited content coverage. In this paper, a new extensive and representative video database, BVI-DVC, is presented for training CNN-based coding tools. BVI-DVC contains 800 sequences at various spatial resolutions from 270p to 2160p and has been evaluated on ten existing network architectures for four different coding tools. Experimental results show that the database produces significant improvements in terms of coding gains over three existing (commonly used) image/video training databases, for all tested CNN architectures under the same training and evaluation configurations.
摘要：深学习方法正越来越多地在视频压缩算法优化应用，可以实现显著提高编码增益，与常规方法相比。这种方法通常采用卷积神经网络（细胞神经网络），其在内容与范围相对有限的数据库培训。在本文中，一种新的丰富和有代表性的视频数据库，BVI-DVC，提出了一种基于训练CNN编码工具。 BVI-DVC包含在不同的空间分辨率从270P到2160P 800个序列和十项现有的网络架构已经评估了四种不同的编码工具。实验结果表明，该数据库中在三个编码增益方面产生显著改进现有的（普遍使用）图像/视频训练数据库，在相同的训练和评价配置所有测试CNN架构。

104. OCmst: One-class Novelty Detection using Convolutional Neural Network and Minimum Spanning Trees [PDF] 返回目录
Riccardo La Grassa, Ignazio Gallo, Nicola Landro
Abstract: We present a novel model called One Class Minimum Spanning Tree (OCmst) for novelty detection problem that uses a Convolutional Neural Network (CNN) as deep feature extractor and graph-based model based on Minimum Spanning Tree (MST). In a novelty detection scenario, the training data is no polluted by outliers (abnormal class) and the goal is to recognize if a test instance belongs to the normal class or to the abnormal class. Our approach uses the deep features from CNN to feed a pair of MSTs built starting from each test instance. To cut down the computational time we use a parameter $\gamma$ to specify the size of the MST's starting to the neighbours from the test instance. To prove the effectiveness of the proposed approach we conducted experiments on two publicly available datasets, well-known in literature and we achieved the state-of-the-art results on CIFAR10 dataset.
摘要：我们提出一个被称为一类最小生成树（OCmst）求新检测问题新颖模型使用一个卷积神经网络（CNN）基于最小生成树（MST）深特征提取器和基于图的模型。在一个新颖性检测方案中，训练数据通过无异常值（异常类）的污染，并且目标是识别如果一个测试实例属于正常类或异常类。我们的方法使用来自CNN的深层特征养活一对内置MSTS从每个测试实例开始的。为了减少我们使用了一个参数$ \ $伽玛指定的MST公司开始从测试实例的邻居大小的计算时间。为了证明所提出的方法，我们在两个可公开获得的数据集，在文学知名，我们取得的CIFAR10数据集的国家的最先进的成果进行实验的有效性。

105. Diagnosis of Breast Cancer using Hybrid Transfer Learning [PDF] 返回目录
Subrato Bharati, Prajoy Podder
Abstract: Breast cancer is a common cancer for women. Early detection of breast cancer can considerably increase the survival rate of women. This paper mainly focuses on transfer learning process to detect breast cancer. Modified VGG (MVGG), residual network, mobile network is proposed and implemented in this paper. DDSM dataset is used in this paper. Experimental results show that our proposed hybrid transfers learning model (Fusion of MVGG16 and ImageNet) provides an accuracy of 88.3% where the number of epoch is 15. On the other hand, only modified VGG 16 architecture (MVGG 16) provides an accuracy 80.8% and MobileNet provides an accuracy of 77.2%. So, it is clearly stated that the proposed hybrid pre-trained network outperforms well compared to single architecture. This architecture can be considered as an effective tool for the radiologists in order to reduce the false negative and false positive rate. Therefore, the efficiency of mammography analysis will be improved.
摘要：乳腺癌是女性常见的癌症。乳腺癌的早期检测可以大大提高妇女的存活率。本文主要着眼于转移学习过程来检测乳腺癌。改性VGG（MVGG），残留的网络，移动网络被提出并在本文中实现。 DDSM数据集在本文中使用。实验结果表明，我们的建议的混合传送学习模型（MVGG16的融合和ImageNet）提供了88.3％，其中历元的数目是15。另一方面，只有改性VGG 16架构的精度（MVGG 16）提供了一个准确度80.8％和MobileNet提供77.2％的准确度。因此，它是明确表示，以及相较于单一架构所提出的混合预训练的网络性能优于。这种架构可以被认为是为了减少假阴性和假阳性率的有效工具，放射科医生。因此，乳房X线照相分析的效率将得到改善。

106. Interval Neural Networks as Instability Detectors for Image Reconstructions [PDF] 返回目录
Jan Macdonald, Maximilian März, Luis Oala, Wojciech Samek
Abstract: This work investigates the detection of instabilities that may occur when utilizing deep learning models for image reconstruction tasks. Although neural networks often empirically outperform traditional reconstruction methods, their usage for sensitive medical applications remains controversial. Indeed, in a recent series of works, it has been demonstrated that deep learning approaches are susceptible to various types of instabilities, caused for instance by adversarial noise or out-of-distribution features. It is argued that this phenomenon can be observed regardless of the underlying architecture and that there is no easy remedy. Based on this insight, the present work demonstrates on two use cases how uncertainty quantification methods can be employed as instability detectors. In particular, it is shown that the recently proposed Interval Neural Networks are highly effective in revealing instabilities of reconstructions. Such an ability is crucial to ensure a safe use of deep learning-based methods for medical image reconstruction.
摘要：该作品利用调查深度学习模型进行图像重建任务时可能出现的不稳定性的检测。虽然神经网络往往凭经验优于传统的修复方法，他们对敏感的医疗应用程序的使用仍存在争议。事实上，在最近的系列作品，它已经证明，深学习方法很容易受到各种类型的不稳定性，以对抗噪声或外的分布特点造成的实例。有人认为，这种现象可以不考虑底层架构的遵守，有没有简单的补救措施。基于这种认识，目前的工作证明上两个用例的不确定性量化方法如何被用作不稳定探测器。特别是，它表明最近提出的区间神经网络是在揭示重建的不稳定性非常有效。这种能力是至关重要的，以确保安全使用的医学图像重建深基于学习的方法。

107. PointGMM: a Neural GMM Network for Point Clouds [PDF] 返回目录
Amir Hertz, Rana Hanocka, Raja Giryes, Daniel Cohen-Or
Abstract: Point clouds are a popular representation for 3D shapes. However, they encode a particular sampling without accounting for shape priors or non-local information. We advocate for the use of a hierarchical Gaussian mixture model (hGMM), which is a compact, adaptive and lightweight representation that probabilistically defines the underlying 3D surface. We present PointGMM, a neural network that learns to generate hGMMs which are characteristic of the shape class, and also coincide with the input point cloud. PointGMM is trained over a collection of shapes to learn a class-specific prior. The hierarchical representation has two main advantages: (i) coarse-to-fine learning, which avoids converging to poor local-minima; and (ii) (an unsupervised) consistent partitioning of the input shape. We show that as a generative model, PointGMM learns a meaningful latent space which enables generating consistent interpolations between existing shapes, as well as synthesizing novel shapes. We also present a novel framework for rigid registration using PointGMM, that learns to disentangle orientation from structure of an input shape.
摘要：点云是3D形状流行的表示。然而，它们编码特定采样不考虑形状先验或非本地信息。我们主张使用分层高斯混合模型（hGMM），它是一种紧凑，自适应轻便表示，其定义概率底层的三维曲面。我们提出PointGMM，神经网络学会产生hGMMs其是形状类的特性，并且还与输入点云重合。 PointGMM被培训了形状的集合之前，学习类特定的。分层次表示具有两个主要优点：（ⅰ）粗到细的学习，这避免收敛到差局部最小值;和（ii）（无监督）相一致的输入形状的分割。我们发现，作为生成模型，PointGMM学习有意义的潜在空间使产生现有形状之间是一致的插值，以及合成新型的形状。我们还提出了一种新的框架使用PointGMM刚性配准，即学会从输入形状的结构解开取向。

108. Optimizing Geometry Compression using Quantum Annealing [PDF] 返回目录
Sebastian Feld, Markus Friedrich, Claudia Linnhoff-Popien
Abstract: The compression of geometry data is an important aspect of bandwidth-efficient data transfer for distributed 3d computer vision applications. We propose a quantum-enabled lossy 3d point cloud compression pipeline based on the constructive solid geometry (CSG) model representation. Key parts of the pipeline are mapped to NP-complete problems for which an efficient Ising formulation suitable for the execution on a Quantum Annealer exists. We describe existing Ising formulations for the maximum clique search problem and the smallest exact cover problem, both of which are important building blocks of the proposed compression pipeline. Additionally, we discuss the properties of the overall pipeline regarding result optimality and described Ising formulations.
摘要：几何数据的压缩是用于分布式3D计算机视觉应用带宽有效的数据传输的一个重要方面。我们提出了基于构造实体几何（CSG）模型表示启用量子有损三维点云压缩管道。管道的关键部分映射到为其上的量子退火炉适合于执行一个有效的伊辛制剂存在NP完全问题。我们描述现有伊辛的最大团搜索问题的配方和最小的精确覆盖问题，这两者都是所提出的压缩管道的重要组成部分。此外，我们讨论有关结果最优的整体管道的性质和描述伊辛配方。

109. InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining [PDF] 返回目录
Junyang Lin, An Yang, Yichang Zhang, Jie Liu, Jingren Zhou, Hongxia Yang
Abstract: Multi-modal pretraining for learning high-level multi-modal representation is a further step towards deep learning and artificial intelligence. In this work, we propose a novel model, namely InterBERT (BERT for Interaction), which owns strong capability of modeling interaction between the information flows of different modalities. The single-stream interaction module is capable of effectively processing information of multiple modalilties, and the two-stream module on top preserves the independence of each modality to avoid performance downgrade in single-modal tasks. We pretrain the model with three pretraining tasks, including masked segment modeling (MSM), masked region modeling (MRM) and image-text matching (ITM); and finetune the model on a series of vision-and-language downstream tasks. Experimental results demonstrate that InterBERT outperforms a series of strong baselines, including the most recent multi-modal pretraining methods, and the analysis shows that MSM and MRM are effective for pretraining and our method can achieve performances comparable to BERT in single-modal tasks. Besides, we propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model. We pretrain the Chinese InterBERT on our proposed dataset of 3.1M image-text pairs from the mobile Taobao, the largest Chinese e-commerce platform. We finetune the model for text-based image retrieval, and recently we deployed the model online for topic-based recommendation.
摘要：学习高层次的多模态表示多式联运训练前是朝着深度学习和人工智能的又一步骤。在这项工作中，我们提出了一种新的模式，即InterBERT（BERT互动），拥有之间的不同方式的信息流建模交互能力强。单流交互模块能够有效地处理多个modalilties的信息，和在顶部的两流模块保留每种模态，以避免性能降级在单一模态的任务的独立性。我们pretrain有三个训练前的任务，包括屏蔽段模型（MSM），掩模区的建模（MRM）和影像文本匹配（ITM）的模型;和微调的一系列视觉和语言下游任务的模型。实验结果表明，InterBERT胜过了一系列强有力的基线，包括最近的多模式训练前的方法，并分析表明MSM和MRM是有效的训练前，我们的方法可以达到与BERT演出单模态的任务。此外，我们提出了在中国的多模态训练前大规模数据集，我们开发的中国InterBERT这是中国第一个多模态模型预训练。我们pretrain对我们提出的从移动淘宝，中国最大的电子商务平台3.1M图文对数据集中的中国InterBERT。我们微调了基于文本的图像检索模型，最近我们部署的模型在线基于主题的建议。

110. Can AI help in screening Viral and COVID-19 pneumonia? [PDF] 返回目录
Muhammad E. H. Chowdhury, Tawsifur Rahman, Amith Khandakar, Rashid Mazhar, Muhammad Abdul Kadir, Zaid Bin Mahbub, Khandakar R. Islam, Muhammad Salman Khan, Atif Iqbal, Nasser Al-Emadi, Mamun Bin Ibne Reaz
Abstract: Coronavirus disease (COVID-19) is a pandemic disease, which has already infected more than half a million people and caused fatalities of above 30 thousand. The aim of this paper is to automatically detect COVID-19 pneumonia patients using digital x-ray images while maximizing the accuracy in detection using image pre-processing and deep-learning techniques. A public database was created by the authors using three public databases and also by collecting images from recently published articles. The database contains a mixture of 190 COVID-19, 1345 viral pneumonia, and 1341 normal chest x-ray images. An image augmented training set was created with 2500 images of each category for training and validating four different pre-trained deep Convolutional Neural Networks (CNNs). These networks were tested for the classification of two different schemes (normal and COVID-19 pneumonia; normal, viral and COVID-19 pneumonia). The classification accuracy, sensitivity, specificity and precision for both the schemes were 98.3%, 96.7%, 100%, 100% and 98.3%, 96.7%, 99%, 100%, respectively. The high accuracy of this computer-aided diagnostic tool can significantly improve the speed and accuracy of diagnosing cases with COVID-19. This would be highly useful in this pandemic where disease burden and need for preventive measures are at odds with available resources.
摘要：冠状病毒病（COVID-19）是一种流行病，它已经感染了超过五十万人，造成以上30000人死亡。本文的目的是自动地检测通过数字x射线图像，同时利用图像预处理和深学习技术在检测最大化精度COVID-19肺炎的患者。公共数据库，通过使用三个公共数据库，并通过收集从近期发表的文章图片作者创建的。该数据库包含190 COVID-19，1345病毒性肺炎，和1341正常胸部x射线图像的混合物。图像增强训练组利用训练和验证四种不同的预先训练的深卷积神经网络（细胞神经网络）每个类别的2500个图像创建。这些网络为两种不同的模式（;正常，病毒和COVID-19肺炎正常和COVID-19肺炎）的分类试验。分类精确度，灵敏度，特异性和精确度两者的方案分别为98.3％，96.7％，100％，100％和98.3％，96.7％，99％，100％。这种计算机辅助诊断工具的高精确度可以提高显著的速度和诊断病例精度COVID-19。这将是在这一流行病，其中疾病负担和必要的预防措施是在现有资源的赔率是非常有用的。

111. Unsupervised Deep Learning for MR Angiography with Flexible Temporal Resolution [PDF] 返回目录
Eunju Cha, Hyungjin Chung, Eung Yeop Kim, Jong Chul Ye
Abstract: Time-resolved MR angiography (tMRA) has been widely used for dynamic contrast enhanced MRI (DCE-MRI) due to its highly accelerated acquisition. In tMRA, the periphery of the k-space data are sparsely sampled so that neighbouring frames can be merged to construct one temporal frame. However, this view-sharing scheme fundamentally limits the temporal resolution, and it is not possible to change the view-sharing number to achieve different spatio-temporal resolution trade-off. Although many deep learning approaches have been recently proposed for MR reconstruction from sparse samples, the existing approaches usually require matched fully sampled k-space reference data for supervised training, which is not suitable for tMRA. This is because high spatio-temporal resolution ground-truth images are not available for tMRA. To address this problem, here we propose a novel unsupervised deep learning using optimal transport driven cycle-consistent generative adversarial network (cycleGAN). In contrast to the conventional cycleGAN with two pairs of generator and discriminator, the new architecture requires just a single pair of generator and discriminator, which makes the training much simpler and improves the performance. Reconstruction results using in vivo tMRA data set confirm that the proposed method can immediately generate high quality reconstruction results at various choices of view-sharing numbers, allowing us to exploit better trade-off between spatial and temporal resolution in time-resolved MR angiography.
摘要：时间分辨磁共振血管造影（TMRA）已被广泛用于动态对比增强MRI（DCE-MRI）由于其高加速收购。在TMRA，k空间数据的周围稀疏采样的，使得相邻的帧可以被合并以构成一个时间帧。然而，这种观点分享计划从根本上限制了时间分辨率，这是不可能改变的观点分享数以达到不同的时空分辨率的权衡。虽然许多深学习方法近来已经提出了用于MR重建从稀疏样本，现有的用于监督训练，这是不适合TMRA方法通常需要匹配完全采样k-空间参考数据。这是因为，高时空分辨率的地面实况图片不可用于TMRA。为了解决这个问题，在这里我们建议使用最佳的交通工具驱动周期一致的生成对抗网络（cycleGAN）一种新型的无监督深度学习。与此相反，以两对发电机和鉴别的传统cycleGAN，新的架构只需要一对发电机和鉴别的，这使得培训更加简单，并提高了性能。重建结果使用体内TMRA数据集证实，该方法能立即产生在视图共享数字的各种选择高质量的重建结果，使我们能够利用的时间分辨磁共振血管造影更好的权衡的空间和时间分辨率之间。

112. Mutual Learning Network for Multi-Source Domain Adaptation [PDF] 返回目录
Zhenpeng Li, Zhen Zhao, Yuhong Guo, Haifeng Shen, Jieping Ye
Abstract: Early Unsupervised Domain Adaptation (UDA) methods have mostly assumed the setting of a single source domain, where all the labeled source data come from the same distribution. However, in practice the labeled data can come from multiple source domains with different distributions. In such scenarios, the single source domain adaptation methods can fail due to the existence of domain shifts across different source domains and multi-source domain adaptation methods need to be designed. In this paper, we propose a novel multi-source domain adaptation method, Mutual Learning Network for Multiple Source Domain Adaptation (ML-MSDA). Under the framework of mutual learning, the proposed method pairs the target domain with each single source domain to train a conditional adversarial domain adaptation network as a branch network, while taking the pair of the combined multi-source domain and target domain to train a conditional adversarial adaptive network as the guidance network. The multiple branch networks are aligned with the guidance network to achieve mutual learning by enforcing JS-divergence regularization over their prediction probability distributions on the corresponding target data. We conduct extensive experiments on multiple multi-source domain adaptation benchmark datasets. The results show the proposed ML-MSDA method outperforms the comparison methods and achieves the state-of-the-art performance.
摘要：早期无监督域适配（UDA）的方法，大多假定单个源结构域，其中所有标记的源数据来自相同的分布的设置。然而，在实践中，标记的数据可来自不同分布多个源域。在这样的场景中，单个源域适配方法会由于在不同的来源域和多源域自适应方法域移位的存在需要被设计。在本文中，我们提出了一种新颖的多源域自适应方法，相互学习网络多源域适配（ML-MSDA）。下互相学习的框架下，所提出的方法对与每个单个源域目标域训练条件对抗性域适应网络作为分支网络，同时利用上述一对组合的多源域和目标域的训练条件对抗性自适应网络为导向网络。多重分支网络提供了指导网络对准通过在对相应的目标数据的预测概率分布执行JS发散正规化，实现相互学习。我们进行了多个多源领域适应性基准数据集大量的实验。结果表明所提出的ML-MSDA方法优于比较的方法和实现先进的最先进的性能。

113. NPENAS: Neural Predictor Guided Evolution for Neural Architecture Search [PDF] 返回目录
Chen Wei, Chuang Niu, Yiping Tang, Jimin Liang
Abstract: Neural architecture search (NAS) is a promising method for automatically finding excellent architectures.Commonly used search strategies such as evolutionary algorithm, Bayesian optimization, and Predictor method employs a predictor to rank sampled architectures. In this paper, we propose two predictor based algorithms NPUBO and NPENAS for neural architecture search. Firstly we propose NPUBO which takes a neural predictor with uncertainty estimation as surrogate model for Bayesian optimization. Secondly we propose a simple and effective predictor guided evolution algorithm(NPENAS), which uses neural predictor to guide evolutionary algorithm to perform selection and mutation. Finally we analyse the architecture sampling pipeline and find that mostly used random sampling pipeline tends to generate architectures in a subspace of the real underlying search space. Our proposed methods can find architecture achieves high test accuracy which is comparable with recently proposed methods on NAS-Bench-101 and NAS-Bench-201 dataset using less training and evaluated samples. Code will be publicly available after finish all the experiments.
摘要：神经架构搜索（NAS）为优良architectures.Commonly用于搜索策略，如进化算法，贝叶斯优化自动寻找一个有前途的方法，以及预测方法使用预测器秩采样架构。在本文中，我们提出了两种基于预测算法NPUBO和NPENAS的神经结构的搜索。首先，我们提出NPUBO这需要具有不确定性估计为贝叶斯优化的替代模型神经预测。其次，我们提出了一个简单而有效的预测制导进化算法（NPENAS），它使用神经预测，引导进化算法来进行选择和突变。最后，我们分析了建筑采样管线，发现大多采用随机抽样的管道往往在现实基础搜索空间的子空间生成架构。我们提出的方法可以找到的架构实现了测试精度高，其是使用更少的培训和评估的样本对NAS-台-101和NAS-台-201数据集最近提出的方法相媲美。代码将完成所有的实验后公开。

114. A Benchmark for Point Clouds Registration Algorithms [PDF] 返回目录
Simone Fontana, Daniele Cattaneo, Augusto Luis Ballardini, Matteo Vaghi, Domenico Giorgio Sorrenti
Abstract: Point clouds registration is a fundamental step of many point clouds processing pipelines; however, most algorithms are tested on data collected ad-hoc and not shared with the research community. These data often cover only a very limited set of use cases; therefore, the results cannot be generalised. Public datasets proposed until now, taken individually, cover only a few kinds of environment and mostly a single sensor. For these reasons, we developed a benchmark, for localization and mapping applications, using multiple publicly available datasets. In this way, we have been able to cover many kinds of environments and many kinds of sensor that can produce point clouds. Furthermore, the ground truth has been thoroughly inspected and evaluated to ensure its quality. For some of the datasets, the accuracy of the ground truth system was not reported by the original authors, therefore we estimated it with our own novel method, based on an iterative registration algorithm. Along with the data, we provide a broad set of registration problems, chosen to cover different types of initial misalignment, various degrees of overlap, and different kinds of registration problems. Lastly, we propose a metric to measure the performances of registration algorithms: it combines the commonly used rotation and translation errors together, to allow an objective comparison of the alignments. This work aims at encouraging authors to use a public and shared benchmark, instead than data collected ad-hoc, to ensure objectivity and repeatability, two fundamental characteristics in any scientific field.
摘要：点云登记是许多点云处理管线的一个基本步骤;但是，大多数算法对采集到的数据即席测试，而不是与研究界共享。这些数据通常只覆盖了一组非常有限的使用情况;因此，结果不能一概而论。提出到现在为止公开的数据集，采取单独，只覆盖了几种环境，几乎是单一的传感器。由于这些原因，我们制定了一个基准，用于定位和地图应用程序，使用多个可公开获得的数据集。这样一来，我们已经能够覆盖多种环境和多种传感器，可以产生点云。此外，地面真相已被彻底检查和评价，以保证其质量。对于一些数据集，地面真理系统的精度不是原作者的报道，因此，我们有我们自己的新方法估计它，基于迭代配准算法。随着数据，我们提供一套广泛的注册问题，选择为覆盖不同类型的初始未对准，各种程度的重叠，以及不同种类的注册问题。最后，我们提出了一个指标来衡量的注册算法的性能：它结合了常用的旋转和平移误差在一起，允许路线的客观比较。这项工作旨在鼓励作者使用公共和共享的基准，而不是比数据收集的ad-hoc，以确保客观性和可重复性，在任何科学领域的两个基本特征。

115. Gradient-based Data Augmentation for Semi-Supervised Learning [PDF] 返回目录
Hiroshi Kaizuka
Abstract: In semi-supervised learning (SSL), a technique called consistency regularization (CR) achieves high performance. It has been proved that the diversity of data used in CR is extremely important to obtain a model with high discrimination performance by CR. We propose a new data augmentation (Gradient-based Data Augmentation (GDA)) that is deterministically calculated from the image pixel value gradient of the posterior probability distribution that is the model output. We aim to secure effective data diversity for CR by utilizing three types of GDA. On the other hand, it has been demonstrated that the mixup method for labeled data and unlabeled data is also effective in SSL. We propose an SSL method named MixGDA by combining various mixup methods and GDA. The discrimination performance achieved by MixGDA is evaluated against the 13-layer CNN that is used as standard in SSL research. As a result, for CIFAR-10 (4000 labels), MixGDA achieves the same level of performance as the best performance ever achieved. For SVHN (250 labels, 500 labels and 1000 labels) and CIFAR-100 (10000 labels), MixGDA achieves state-of-the-art performance.
摘要：半监督学习（SSL），一种技术被称为一致性正则（CR）达到高性能。业已证明，在CR中使用的数据的多样性是极为重要的，以获得与由CR高判别性能的模型。我们建议，被确定性地从后验概率分布是模型输出的图像的像素值梯度计算新的数据扩张（基于梯度的数据扩张（GDA））。我们的目标是通过使用三种类型的GDA以确保有效的数据多样性CR。在另一方面，已证实，对于标记的数据和未标记数据的的mixup方法也是有效的SSL。我们建议通过组合各种查询股价的方法和GDA名为MixGDA的SSL方法。通过MixGDA实现的判别性能是针对其被用作在SSL研究标准的13层CNN评价。其结果是，对于CIFAR-10（4000个标签），MixGDA达到相同的性能水平最佳的性能不断实现。对于SVHN（250个标签，标签500和1000名的标签）和CIFAR-100（10000个标签），MixGDA实现状态的最先进的性能。

116. Adversarial Imitation Attack [PDF] 返回目录
Mingyi Zhou, Jing Wu, Yipeng Liu, Shuaicheng Liu, Xiang Zhang, Ce Zhu
Abstract: Deep learning models are known to be vulnerable to adversarial examples. A practical adversarial attack should require as little as possible knowledge of attacked models. Current substitute attacks need pre-trained models to generate adversarial examples and their attack success rates heavily rely on the transferability of adversarial examples. Current score-based and decision-based attacks require lots of queries for the attacked models. In this study, we propose a novel adversarial imitation attack. First, it produces a replica of the attacked model by a two-player game like the generative adversarial networks (GANs). The objective of the generative model is to generate examples that lead the imitation model returning different outputs with the attacked model. The objective of the imitation model is to output the same labels with the attacked model under the same inputs. Then, the adversarial examples generated by the imitation model are utilized to fool the attacked model. Compared with the current substitute attacks, imitation attacks can use less training data to produce a replica of the attacked model and improve the transferability of adversarial examples. Experiments demonstrate that our imitation attack requires less training data than the black-box substitute attacks, but achieves an attack success rate close to the white-box attack on unseen data with no query.
摘要：深学习模型被称为是脆弱的对抗性的例子。一个实用的敌对攻击应要求尽可能少的攻击模式的可能知识。当前取代攻击需要预先训练模型来生成对抗性例子，他们的进攻成功率在很大程度上依赖于对抗例子转移性。当前得分和基于决策的攻击需要大量的攻击模式的查询。在这项研究中，我们提出了一种新的对抗模仿攻击。首先，它由两个玩家的游戏，如生成对抗网络（甘斯）产生攻击模式的翻版。生成模型的目标是产生导致仿制模型返回不同的输出与攻击模型的例子。仿模型的目标是与输出相同的输入下的攻击模式相同的标签。然后，通过模仿模型生成的对抗性的例子被用于欺骗攻击模型。与目前的替代攻击相比，仿制的攻击可以用较少的训练数据产生攻击模式的翻版，提高对抗性的例子可转移性。实验表明，我们的模仿攻击需要比暗箱替代攻击的训练数据较少，但达到的进攻成功率接近上看不见数据的白盒攻击与没有查询。

117. Predicting the Popularity of Micro-videos with Multimodal Variational Encoder-Decoder Framework [PDF] 返回目录
Yaochen Zhu, Jiayi Xie, Zhenzhong Chen
Abstract: As an emerging type of user-generated content, micro-video drastically enriches people's entertainment experiences and social interactions. However, the popularity pattern of an individual micro-video still remains elusive among the researchers. One of the major challenges is that the potential popularity of a micro-video tends to fluctuate under the impact of various external factors, which makes it full of uncertainties. In addition, since micro-videos are mainly uploaded by individuals that lack professional techniques, multiple types of noise could exist that obscure useful information. In this paper, we propose a multimodal variational encoder-decoder (MMVED) framework for micro-video popularity prediction tasks. MMVED learns a stochastic Gaussian embedding of a micro-video that is informative to its popularity level while preserves the inherent uncertainties simultaneously. Moreover, through the optimization of a deep variational information bottleneck lower-bound (IBLBO), the learned hidden representation is shown to be maximally expressive about the popularity target while maximally compressive to the noise in micro-video features. Furthermore, the Bayesian product-of-experts principle is applied to the multimodal encoder, where the decision for information keeping or discarding is made comprehensively with all available modalities. Extensive experiments conducted on a public dataset and a dataset we collect from Xigua demonstrate the effectiveness of the proposed MMVED framework.
摘要：随着用户生成内容的新兴型，微视频大大丰富了人们的娱乐体验和社会交往。然而，一个人的微视频的普及模式仍然是研究人员之间难以捉摸。其中一个主要的挑战是，微视频的潜力普及下的各种外部因素的影响，这使得它充满了不确定性倾向于波动。此外，由于微视频主要是由缺乏专业技术个人上传的，多种类型的噪声可能存在这种不起眼的有用信息。在本文中，我们提出微视频的普及预测任务的多模态变编码器，解码器（MMVED）框架。 MMVED学习微视频的随机高斯嵌入就是翔实它的普及水平，同时保留了固有的不确定性同时进行。此外，通过一个深变信息瓶颈优化下限（IBLBO），所学习的隐藏表示被示出为最大限度表现力大约普及目标，而最大程度地压缩到在微视频特征的噪声。此外，贝叶斯产品的专业人士原理应用到多式编码器，其中对于信息保留或丢弃的决定是综合了所有可用的模式做。在公共数据集，我们从Xigua收集的数据集进行了广泛的实验，证明了该MMVED框架的有效性。

118. DaST: Data-free Substitute Training for Adversarial Attacks [PDF] 返回目录
Mingyi Zhou, Jing Wu, Yipeng Liu, Shuaicheng Liu, Ce Zhu
Abstract: Machine learning models are vulnerable to adversarial examples. For the black-box setting, current substitute attacks need pre-trained models to generate adversarial examples. However, pre-trained models are hard to obtain in real-world tasks. In this paper, we propose a data-free substitute training method (DaST) to obtain substitute models for adversarial black-box attacks without the requirement of any real data. To achieve this, DaST utilizes specially designed generative adversarial networks (GANs) to train the substitute models. In particular, we design a multi-branch architecture and label-control loss for the generative model to deal with the uneven distribution of synthetic samples. The substitute model is then trained by the synthetic samples generated by the generative model, which are labeled by the attacked model subsequently. The experiments demonstrate the substitute models produced by DaST can achieve competitive performance compared with the baseline models which are trained by the same train set with attacked models. Additionally, to evaluate the practicability of the proposed method on the real-world task, we attack an online machine learning model on the Microsoft Azure platform. The remote model misclassifies 98.35% of the adversarial examples crafted by our method. To the best of our knowledge, we are the first to train a substitute model for adversarial attacks without any real data.
摘要：机器学习模型很容易受到对抗性的例子。对于黑盒设置，当前取代攻击需要预先训练模型来生成对抗性的例子。然而，预先训练的模型是很难获得在现实世界的任务。在本文中，我们提出了一个无数据替代训练方法（DAST）来获得替代模型，没有任何实际数据的要求对抗性黑箱攻击。要做到这一点，利用DAST专门设计的生成对抗网络（甘斯）训练替代车型。特别是，我们设计的生成模型来处理合成样品的不均匀分布的多分支结构和标签控制损失。替代模型然后通过生成模型生成的合成的样品，其由攻击模型标记的随后的训练。实验证明通过DAST产生的替代车型与由同一车组与攻击模型训练的基线机型相比可以实现竞争力的性能。此外，以评估对现实世界的任务所提出的方法的实用性，我们的攻击微软Azure平台上的在线机器学习模型。遥控模型misclassifies的对抗例子98.35％，制作用我们的方法。据我们所知，我们是第一个来训练的对抗攻击的替代模型，没有任何真实的数据。

119. Learning to Smooth and Fold Real Fabric Using Dense Object Descriptors Trained on Synthetic Color Images [PDF] 返回目录
Aditya Ganapathi, Priya Sundaresan, Brijen Thananjeyan, Ashwin Balakrishna, Daniel Seita, Jennifer Grannen, Minho Hwang, Ryan Hoque, Joseph E. Gonzalez, Nawid Jamali, Katsu Yamane, Soshi Iba, Ken Goldberg
Abstract: Robotic fabric manipulation is challenging due to the infinite dimensional configuration space and complex dynamics. In this paper, we learn visual representations of deformable fabric by training dense object descriptors that capture correspondences across images of fabric in various configurations. The learned descriptors capture higher level geometric structure, facilitating design of explainable policies. We demonstrate that the learned representation facilitates multistep fabric smoothing and folding tasks on two real physical systems, the da Vinci surgical robot and the ABB YuMi given high level demonstrations from a supervisor. The system achieves a 78.8% average task success rate across six fabric manipulation tasks. See this https URL for supplementary material and videos.
摘要：机器人织物操纵由于无限维配置空间和复杂的动力学挑战。在本文中，我们通过密集培训对象描述符学变形结构的可视化表示跨布在各种配置的图像捕捉对应。博学的描述捕捉更高层次的几何结构，促进解释的政策设计。我们表明，学习表现有利于多级结构平滑，并在两个真实的物理系统，达芬奇外科手术机器人和上司给予了很高的水平示威ABB YUMI折叠任务。该系统实现了全球六大布操作任务的78.8％，平均任务的成功率。看到这个HTTPS URL的辅助材料和视频。

120. Image compression optimized for 3D reconstruction by utilizing deep neural networks [PDF] 返回目录
Alex Golts, Yoav Y. Schechner
Abstract: Computer vision tasks are often expected to be executed on compressed images. Classical image compression standards like JPEG 2000 are widely used. However, they do not account for the specific end-task at hand. Motivated by works on recurrent neural network (RNN)-based image compression and three-dimensional (3D) reconstruction, we propose unified network architectures to solve both tasks jointly. These joint models provide image compression tailored for the specific task of 3D reconstruction. Images compressed by our proposed models, yield 3D reconstruction performance superior as compared to using JPEG 2000 compression. Our models significantly extend the range of compression rates for which 3D reconstruction is possible. We also show that this can be done highly efficiently at almost no additional cost to obtain compression on top of the computation already required for performing the 3D reconstruction task.
摘要：计算机视觉任务通常预计将在压缩的图像执行。经典的图像压缩标准，如JPEG 2000被广泛使用。然而，他们没有考虑现有的具体结束任务。通过递归神经网络（RNN）为基础的图像压缩和三维（3D）重建工程的推动下，我们提出了统一的网络架构，以共同解决这两个任务。这些合资车型提供了三维重建的具体任务量身定制的图像压缩。我们提出的模型压缩图像，产生三维重建性能优越相比，使用JPEG 2000压缩的。我们的模型显著延长其三维重建是可能的压缩率的范围。我们还表明，这可以高效地在几乎不增加成本来完成，以获得已经需要进行三维重建任务计算的顶部压缩。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-03-31

目录

摘要