摘要

1. More Diverse Means Better: Multimodal Deep Learning Meets Remote Sensing Imagery Classification [PDF] 返回目录
Danfeng Hong, Lianru Gao, Naoto Yokoya, Jing Yao, Jocelyn Chanussot, Qian Du, Bing Zhang
Abstract: Classification and identification of the materials lying over or beneath the Earth's surface have long been a fundamental but challenging research topic in geoscience and remote sensing (RS) and have garnered a growing concern owing to the recent advancements of deep learning techniques. Although deep networks have been successfully applied in single-modality-dominated classification tasks, yet their performance inevitably meets the bottleneck in complex scenes that need to be finely classified, due to the limitation of information diversity. In this work, we provide a baseline solution to the aforementioned difficulty by developing a general multimodal deep learning (MDL) framework. In particular, we also investigate a special case of multi-modality learning (MML) -- cross-modality learning (CML) that exists widely in RS image classification applications. By focusing on "what", "where", and "how" to fuse, we show different fusion strategies as well as how to train deep networks and build the network architecture. Specifically, five fusion architectures are introduced and developed, further being unified in our MDL framework. More significantly, our framework is not only limited to pixel-wise classification tasks but also applicable to spatial information modeling with convolutional neural networks (CNNs). To validate the effectiveness and superiority of the MDL framework, extensive experiments related to the settings of MML and CML are conducted on two different multimodal RS datasets. Furthermore, the codes and datasets will be available at this https URL, contributing to the RS community.
摘要：材料的分类和识别躺满或地球的表面下早已在地球科学和遥感（RS），一个基本的，但具有挑战性的研究课题，并已囊括由于近期的深学习技术的进步日益受到关注。虽然深网络在已成功地应用于单形态为主的分类任务，但他们的表现不可避免地满足复杂场景的瓶颈需要进行精细分类，由于信息的多样性的限制。在这项工作中，我们提供了一个基线解决方案通过开发一般多深的学习（MDL）的框架上述困难。特别是，我们还研究多模态学习（MML）的一种特殊情况 - 在遥感图像分类应用广泛存在跨模态学习（CML）。通过专注于“做什么”，“在哪里”，以及“如何”导火索，我们表现出不同的融合策略，以及如何培养深厚的网络和建立网络架构。具体来说，五融合架构引进和开发，进一步在我们的MDL框架内统一。更显著，我们的框架不仅限于逐像素分类任务，但也适用于空间信息建模与卷积神经网络（细胞神经网络）。为了验证MDL框架的有效性和优越性，关系到MML和CML的设置大量的实验是在两个不同的多式联运RS的数据集进行。此外，代码和数据集将可在此HTTPS URL，有助于RS社区。

2. Stable Low-rank Tensor Decomposition for Compression of Convolutional Neural Network [PDF] 返回目录
Anh-Huy Phan, Konstantin Sobolev, Konstantin Sozykin, Dmitry Ermilov, Julia Gusak, Petr Tichavsky, Valeriy Glukhov, Ivan Oseledets, Andrzej Cichocki
Abstract: Most state of the art deep neural networks are overparameterized and exhibit a high computational cost. A straightforward approach to this problem is to replace convolutional kernels with its low-rank tensor approximations, whereas the Canonical Polyadic tensor Decomposition is one of the most suited models. However, fitting the convolutional tensors by numerical optimization algorithms often encounters diverging components, i.e., extremely large rank-one tensors but canceling each other. Such degeneracy often causes the non-interpretable result and numerical instability for the neural network fine-tuning. This paper is the first study on degeneracy in the tensor decomposition of convolutional kernels. We present a novel method, which can stabilize the low-rank approximation of convolutional kernels and ensure efficient compression while preserving the high-quality performance of the neural networks. We evaluate our approach on popular CNN architectures for image classification and show that our method results in much lower accuracy degradation and provides consistent performance.
摘要：现有技术的大多数州的深层神经网络overparameterized并表现出很高的计算成本。一个直接的办法处理这一问题，以取代其低秩张量近似卷积内核，而规范Polyadic张量分解是最适合的车型之一。但是，配合用数值优化算法卷积张量经常遇到发散成分，即，非常大的秩一张量但彼此抵消。这种简并性经常导致对所述神经网络微调不可解释的结果和数值不稳定性。本文是在卷积核的张量分解退化的首次研究。我们提出的新方法，同时保留神经网络的高品质的性能，其可以稳定的卷积核的低秩近似和确保有效的压缩。我们评估的流行CNN架构的方法进行图像分类，并表明我们在低得多的准确性降解法的结果，并提供一致的性能。

3. DXSLAM: A Robust and Efficient Visual SLAM System with Deep Features [PDF] 返回目录
Dongjiang Li, Xuesong Shi, Qiwei Long, Shenghui Liu, Wei Yang, Fangshi Wang, Qi Wei, Fei Qiao
Abstract: A robust and efficient Simultaneous Localization and Mapping (SLAM) system is essential for robot autonomy. For visual SLAM algorithms, though the theoretical framework has been well established for most aspects, feature extraction and association is still empirically designed in most cases, and can be vulnerable in complex environments. This paper shows that feature extraction with deep convolutional neural networks (CNNs) can be seamlessly incorporated into a modern SLAM framework. The proposed SLAM system utilizes a state-of-the-art CNN to detect keypoints in each image frame, and to give not only keypoint descriptors, but also a global descriptor of the whole image. These local and global features are then used by different SLAM modules, resulting in much more robustness against environmental changes and viewpoint changes compared with using hand-crafted features. We also train a visual vocabulary of local features with a Bag of Words (BoW) method. Based on the local features, global features, and the vocabulary, a highly reliable loop closure detection method is built. Experimental results show that all the proposed modules significantly outperforms the baseline, and the full system achieves much lower trajectory errors and much higher correct rates on all evaluated data. Furthermore, by optimizing the CNN with Intel OpenVINO toolkit and utilizing the Fast BoW library, the system benefits greatly from the SIMD (single-instruction-multiple-data) techniques in modern CPUs. The full system can run in real-time without any GPU or other accelerators. The code is public at this https URL.
摘要：强健和高效的同步定位和地图创建（SLAM）系统是用于机器人自治必不可少的。对于视觉SLAM算法，虽然理论框架已经为大多数方面，特征提取和协会已经确立仍凭经验设计在大多数情况下，可在复杂环境脆弱。深卷积神经网络（细胞神经网络）本文表明的特征提取可以无缝地结合到现代SLAM框架。所提出的SLAM系统利用国家的最先进的CNN来检测的关键点中的每个图像帧，并不仅给关键点描述符，而且整个图像的全局描述符。这些局部和全局的功能，然后通过不同的SLAM模块使用，造成对环境变化和观点更稳健性变化用手工制作的功能进行比较。我们还培训当地特征的视觉词汇与单词的包（弓）方法。基于局部特征，全局特征和词汇，高度可靠的环闭合检测方法建立。实验结果表明，所有建议的模块显著优于基准，全系统实现对所有评估的数据要低得多轨迹错误和更高的正确率。此外，通过与英特尔OpenVINO工具优化CNN和利用快速的BoW库，极大地从SIMD（单指令多数据）的系统优势在现代CPU技术。完整的系统可以实时，没有任何GPU或其他加速器运行。该代码是公开的，在此HTTPS URL。

4. Look here! A parametric learning based approach to redirect visual attention [PDF] 返回目录
Youssef Alami Mejjati, Celso F. Gomez, Kwang In Kim, Eli Shechtman, Zoya Bylinskii
Abstract: Across photography, marketing, and website design, being able to direct the viewer's attention is a powerful tool. Motivated by professional workflows, we introduce an automatic method to make an image region more attention-capturing via subtle image edits that maintain realism and fidelity to the original. From an input image and a user-provided mask, our GazeShiftNet model predicts a distinct set of global parametric transformations to be applied to the foreground and background image regions separately. We present the results of quantitative and qualitative experiments that demonstrate improvements over prior state-of-the-art. In contrast to existing attention shifting algorithms, our global parametric approach better preserves image semantics and avoids typical generative artifacts. Our edits enable inference at interactive rates on any image size, and easily generalize to videos. Extensions of our model allow for multi-style edits and the ability to both increase and attenuate attention in an image region. Furthermore, users can customize the edited images by dialing the edits up or down via interpolations in parameter space. This paper presents a practical tool that can simplify future image editing pipelines.
摘要：在整个拍摄，营销，网站设计，能够引导观众的注意力是一个强大的工具。由专业的工作流程的启发，我们引入一个自动的方法通过维持现实主义和忠实于原始图像细微修改，以使图像区域更多地关注捕获。从输入图像和用户提供的掩模，我们GazeShiftNet模型预测一组不同的全球参数化变换，以分别适用于前景和背景的图像区域。我们提出的演示之前，国家的最先进的了改进的定量和定性实验的结果。相较于现有的注意移算法，我们的全球参数方法更好的保持图像的语义，并且避免了典型的生成假象。我们的编辑在任何图像尺寸上互动率使推理，而且容易推广到视频。我们的模型的扩展允许多风格编辑和能力都增加和减轻注意力的图像区域。此外，用户可以通过拨编辑向上或向下通过在参数空间内插定制编辑的图像。本文提出了一种实用的工具，可以简化未来的图像编辑管道。

5. DAWN: Vehicle Detection in Adverse Weather Nature Dataset [PDF] 返回目录
Mourad A. Kenk, Mahmoud Hassaballah
Abstract: Recently, self-driving vehicles have been introduced with several automated features including lane-keep assistance, queuing assistance in traffic-jam, parking assistance and crash avoidance. These self-driving vehicles and intelligent visual traffic surveillance systems mainly depend on cameras and sensors fusion systems. Adverse weather conditions such as heavy fog, rain, snow, and sandstorms are considered dangerous restrictions of the functionality of cameras impacting seriously the performance of adopted computer vision algorithms for scene understanding (i.e., vehicle detection, tracking, and recognition in traffic scenes). For example, reflection coming from rain flow and ice over roads could cause massive detection errors which will affect the performance of intelligent visual traffic systems. Additionally, scene understanding and vehicle detection algorithms are mostly evaluated using datasets contain certain types of synthetic images plus a few real-world images. Thus, it is uncertain how these algorithms would perform on unclear images acquired in the wild and how the progress of these algorithms is standardized in the field. To this end, we present a new dataset (benchmark) consisting of real-world images collected under various adverse weather conditions called DAWN. This dataset emphasizes a diverse traffic environment (urban, highway and freeway) as well as a rich variety of traffic flow. The DAWN dataset comprises a collection of 1000 images from real-traffic environments, which are divided into four sets of weather conditions: fog, snow, rain and sandstorms. The dataset is annotated with object bounding boxes for autonomous driving and video surveillance scenarios. This data helps interpreting effects caused by the adverse weather conditions on the performance of vehicle detection systems.
摘要：近日，自驾车车辆已经推出了几个自动化功能，包括车道保持辅助，排队的交通堵塞，停车辅助和防撞援助。这些自驾车车辆和智能交通视频监控系统主要依赖于摄像头和传感器融合系统。不利的天气条件为重雾，雨，雪，风沙等被认为是相机影响严重的采用计算机视觉算法用于场景的理解（即，车辆检测，跟踪，并在交通场景识别）性能的功能性危险的限制。例如，反射从雨流冰在未来的道路可能会导致大规模的检测误差，这将影响智能视觉交通系统的性能。使用的数据集此外，现场了解和车辆检测算法大多包含评估某些类型的合成图像加上一些真实世界的图像。因此，目前尚不能确定这些算法是如何在野外，以及如何对这些算法的进步领域是标准化获得不清晰的图像进行。为此，我们提出了一个新的数据集（基准），包括所谓的DAWN各种恶劣的天气条件下收集真实世界的图像。此数据集强调多样化的交通环境（城市，公路和高速公路），以及丰富多样的交通流。黎明数据集包括从实时交通环境1000倍的图像，它们被分成四组的天气状况的集合：雾，雪，雨和沙尘暴。该数据集标注了对象边界框的自动驾驶和视频监控方案。这些数据可以帮助解释造成车辆检测系统性能的不利天气条件的影响。

6. Rethinking of the Image Salient Object Detection: Object-level Semantic Saliency Re-ranking First, Pixel-wise Saliency Refinement Latter [PDF] 返回目录
Zhenyu Wu, Shuai Li, Chenglizhao Chen, Aimin Hao, Hong Qin
Abstract: The real human attention is an interactive activity between our visual system and our brain, using both low-level visual stimulus and high-level semantic information. Previous image salient object detection (SOD) works conduct their saliency predictions in a multi-task manner, i.e., performing pixel-wise saliency regression and segmentation-like saliency refinement at the same time, which degenerates their feature backbones in revealing semantic information. However, given an image, we tend to pay more attention to those regions which are semantically salient even in the case that these regions are perceptually not the most salient ones at first glance. In this paper, we divide the SOD problem into two sequential tasks: 1) we propose a lightweight, weakly supervised deep network to coarsely locate those semantically salient regions first; 2) then, as a post-processing procedure, we selectively fuse multiple off-the-shelf deep models on these semantically salient regions as the pixel-wise saliency refinement. In sharp contrast to the state-of-the-art (SOTA) methods that focus on learning pixel-wise saliency in "single image" using perceptual clues mainly, our method has investigated the "object-level semantic ranks between multiple images", of which the methodology is more consistent with the real human attention mechanism. Our method is simple yet effective, which is the first attempt to consider the salient object detection mainly as an object-level semantic re-ranking problem.
摘要：真正的人的关注是我们的视觉系统和大脑之间的互动活动，同时使用低级别的视觉刺激和高层次的语义信息。先前图像显着对象检测（SOD）工作在一个多任务的方式开展显着性的预测，即，在执行逐像素的显着性回归和分割状的同时显着改进方案中，该退化其特征骨架中揭示的语义信息。然而，考虑到图像时，我们往往会更注重其语义即使在这些地区感知不是最突出的人第一眼的情况下凸角的区域。在本文中，我们把SOD问题分为两个连续的任务：1）我们提出了一个轻量级，弱监督网络深先粗略地定位那些语义上显着区域; 2）然后，作为后处理过程中，我们有选择地熔断这些语义显着区域作为像素方式显着细化多现成的，货架深模型。与此形成鲜明对比的是国家的最先进的（SOTA）的方法，重点是学习“单个图像”利用感知线索逐像素的显着性主要是，我们的方法调查了“多个图像之间的对象级语义行列”该其中的方法与真正的人类注意机制更加一致。我们的方法是简单而有效的，这是主要考虑显着对象检测为对象级别的语义重新排序问题的首次尝试。

7. Full Reference Screen Content Image Quality Assessment by Fusing Multi-level Structure Similarity [PDF] 返回目录
Chenglizhao Chen, Hongmeng Zhao, Huan Yang, Chong Peng, Teng Yu
Abstract: The screen content images (SCIs) usually comprise various content types with sharp edges, in which the artifacts or distortions can be well sensed by the vanilla structure similarity measurement in a full reference manner. Nonetheless, almost all of the current SOTA structure similarity metrics are "locally" formulated in a single-level manner, while the true human visual system (HVS) follows the multi-level manner, and such mismatch could eventually prevent these metrics from achieving trustworthy quality assessment. To ameliorate, this paper advocates a novel solution to measure structure similarity "globally" from the perspective of sparse representation. To perform multi-level quality assessment in accordance with the real HVS, the above-mentioned global metric will be integrated with the conventional local ones by resorting to the newly devised selective deep fusion network. To validate its efficacy and effectiveness, we have compared our method with 12 SOTA methods over two widely-used large-scale public SCI datasets, and the quantitative results indicate that our method yields significantly higher consistency with subjective quality score than the currently leading works. Both the source code and data are also publicly available to gain widespread acceptance and facilitate new advancement and its validation.
摘要：屏幕内容的图像（的SCI）通常包括各种内容类型具有尖锐边缘，其中，所述伪影或失真可被很好由香草结构相似性的测量在一个完整的参考的方式感测。尽管如此，几乎所有的电流SOTA结构的相似性度量是“本地”配制在单级方式，而真正的人类视觉系统（HVS）如下多级方式，这样的失配可能最终防止这些度量从实现可信赖质量评估。为了改善，本文主张一种新颖的解决方案，从稀疏表示的角度测量结构相似性“全局”。在按照真实HVS执行多级质量评估，上述全局度量将与常规本地的通过诉诸新设计的选择性深融合网络集成在一起。为了验证其有效性和效率，我们比较我们的方法与12种SOTA方法在两个广泛使用的大型公共SCI数据集，并定量结果表明，我们的方法产生主观质量分数比目前领先的工程显著较高的一致性。无论是源代码和数据也向公众公开获得广泛的接受，并促进新的进步及其验证。

8. Towards Unsupervised Crowd Counting via Regression-Detection Bi-knowledge Transfer [PDF] 返回目录
Yuting Liu, Zheng Wang, Miaojing Shi, Shin'ichi Satoh, Qijun Zhao, Hongyu Yang
Abstract: Unsupervised crowd counting is a challenging yet not largely explored task. In this paper, we explore it in a transfer learning setting where we learn to detect and count persons in an unlabeled target set by transferring bi-knowledge learnt from regression- and detection-based models in a labeled source set. The dual source knowledge of the two models is heterogeneous and complementary as they capture different modalities of the crowd distribution. We formulate the mutual transformations between the outputs of regression- and detection-based models as two scene-agnostic transformers which enable knowledge distillation between the two models. Given the regression- and detection-based models and their mutual transformers learnt in the source, we introduce an iterative self-supervised learning scheme with regression-detection bi-knowledge transfer in the target. Extensive experiments on standard crowd counting benchmarks, ShanghaiTech, UCF\_CC\_50, and UCF\_QNRF demonstrate a substantial improvement of our method over other state-of-the-arts in the transfer learning setting.
摘要：无监督的人群计数是一个具有挑战性的尚未大幅探索任务。在本文中，我们在这里我们学会了检测并通过在标记源集传输双向知识从回归 - 和检测依据的模式了解到计数未标记的目标集人转移学习环境探索它。因为他们捕捉到的人群分布的不同方式的两种型号的双源知识是异质的，相辅相成的。我们制定回归 - 和基于检测模型的输出之间的相互转换为两个场景无关的变压器这使两款车型之间的知识升华。由于回归 - 和检测为基础的模型和它们在源了解到相互变压器，介绍与目标回归检测双向知识转移的迭代自我监督学习方案。在标准的人群计数基准广泛的实验，ShanghaiTech，UCF \ _cc \ _50和UCF \ _QNRF证明我们的方法相对于其他国家的的艺术在转移学习环境的显着改善。

9. Improving the Performance of Fine-Grain Image Classifiers via Generative Data Augmentation [PDF] 返回目录
Shashank Manjunath, Aitzaz Nathaniel, Jeff Druce, Stan German
Abstract: Recent advances in machine learning (ML) and computer vision tools have enabled applications in a wide variety of arenas such as financial analytics, medical diagnostics, and even within the Department of Defense. However, their widespread implementation in real-world use cases poses several challenges: (1) many applications are highly specialized, and hence operate in a \emph{sparse data} domain; (2) ML tools are sensitive to their training sets and typically require cumbersome, labor-intensive data collection and data labelling processes; and (3) ML tools can be extremely "black box," offering users little to no insight into the decision-making process or how new data might affect prediction performance. To address these challenges, we have designed and developed Data Augmentation from Proficient Pre-Training of Robust Generative Adversarial Networks (DAPPER GAN), an ML analytics support tool that automatically generates novel views of training images in order to improve downstream classifier performance. DAPPER GAN leverages high-fidelity embeddings generated by a StyleGAN2 model (trained on the LSUN cars dataset) to create novel imagery for previously unseen classes. We experimentally evaluate this technique on the Stanford Cars dataset, demonstrating improved vehicle make and model classification accuracy and reduced requirements for real data using our GAN based data augmentation framework. The method's validity was supported through an analysis of classifier performance on both augmented and non-augmented datasets, achieving comparable or better accuracy with up to 30\% less real data across visually similar classes. To support this method, we developed a novel augmentation method that can manipulate semantically meaningful dimensions (e.g., orientation) of the target object in the embedding space.
摘要：在机器学习（ML）和计算机视觉工具的最新进展已使在各种领域的应用，如财务分析，医疗诊断，甚至国防部内。然而，它们在现实世界中使用的情况下对广泛实施的几个问题：（1）许多应用程序都非常专业，因此在\ {EMPH稀疏数据}域操作; （2）ML工具对于他们的训练集敏感并且通常需要繁琐的，劳动密集的数据收集和数据标记过程; （3）ML工具可以非常“黑盒子”，小为用户提供没有洞察到决策过程或数据如何利用新的可能影响预测性能。为了应对这些挑战，我们设计和强大的生成性对抗性网络的精通前培训（DAPPER GAN），一个ML分析支持工具，以提高下游分类器性能自动生成训练图像的新颖观点开发的数据增强。 DAPPER GAN利用由StyleGAN2模型（训练在LSUN汽车数据集）产生，以创建用于以前看不见的类新颖图像高保真的嵌入。我们通过实验评估在斯坦福大学汽车数据集的这种技术，证明改进的车辆品牌和型号分类准确度和使用我们的基于GaN的数据增强框架的真实数据减少需求。该方法的有效性是通过分类器性能上都增强和非增强的数据集进行分析的支持，实现了跨越视觉上相似的类高达30 \％，那么真实的数据相当或更好的精度。为了支持这种方法，我们开发了可以操纵目标对象的语义上有意义的尺寸（例如，取向）在嵌入空间的新型增强方法。

10. Attention-based Fully Gated CNN-BGRU for Russian Handwritten Text [PDF] 返回目录
Abdelrahman Abdallah, Mohamed Hamada, Daniyar Nurseitov
Abstract: This research approaches the task of handwritten text with attention encoder-decoder networks that are trained on Kazakh and Russian language. We developed a novel deep neural network model based on Fully Gated CNN, supported by Multiple bidirectional GRU and Attention mechanisms to manipulate sophisticated features that achieve 0.045 Character Error Rate (CER), 0.192 Word Error Rate (WER) and 0.253 Sequence Error Rate (SER) for the first test dataset and 0.064 CER, 0.24 WER and 0.361 SER for the second test dataset. Also, we propose fully gated layers by taking the advantage of multiple the output feature from Tahn and input feature, this proposed work achieves better results and We experimented with our model on the Handwritten Kazakh \& Russian Database (HKR). Our research is the first work on the HKR dataset and demonstrates state-of-the-art results to most of the other existing models.
摘要：研究方法与是在哈萨克和俄语语言训练的注意编码器，解码器网络手写文字的任务。我们开发了基于全封闭式CNN一种新型的深层神经网络模型，通过多次双向GRU和注意机制的支持以操纵达到0.045字符错误率（CER）复杂的功能，0.192词错误率（WER）和0.253顺序错误率（SER ）为所述第一测试数据集和0.064 CER，0.24 WER和0.361 SER用于第二测试数据集。此外，我们建议完全通过采取多种输出功能的优势，从TAHN和输入功能门控层，这个建议的工作取得了较好的结果和我们试验了我们对手写哈萨克\＆俄语数据库（HKR）模型。我们的研究是在HKR数据集的第一个工作和国家的最先进的结果表明大多数现有的其他车型。

11. Anomaly localization by modeling perceptual features [PDF] 返回目录
David Dehaene, Pierre Eline
Abstract: Although unsupervised generative modeling of an image dataset using a Variational AutoEncoder (VAE) has been used to detect anomalous images, or anomalous regions in images, recent works have shown that this method often identifies images or regions that do not concur with human perception, even questioning the usability of generative models for robust anomaly detection. Here, we argue that those issues can emerge from having a simplistic model of the anomaly distribution and we propose a new VAE-based model expressing a more complex anomaly model that is also closer to human perception. This Feature-Augmented VAE is trained by not only reconstructing the input image in pixel space, but also in several different feature spaces, which are computed by a convolutional neural network trained beforehand on a large image dataset. It achieves clear improvement over state-of-the-art methods on the MVTec anomaly detection and localization datasets.
摘要：虽然使用变的自动编码（VAE）的图像数据集的无监督的生成模型已被用于检测异常图像或图像中的异常区域，最近的作品表明，这种方法往往识别图片或地区不与人类的感知同意，甚至质疑生成模型的鲁棒异常检测的可用性。在这里，我们认为，这些问题可以从具有异常分布的一个简单的模型出现，并提出表达更为复杂的异常模式，也更接近于人类感知一个新的基于VAE模型。这个特征的增强VAE通过不仅在重构像素空间的输入图像，也可在几个不同的特征空间，这是由在一个大的图像数据集预先培养了卷积神经网络计算出的训练。它实现了在上MVTec公司异常检测和定位的数据集的状态的最先进的方法明显的改进。

12. LogoDet-3K: A Large-Scale Image Dataset for Logo Detection [PDF] 返回目录
Jing Wang, Weiqing Min, Sujuan Hou, Shengnan Ma, Yuanjie Zheng, Shuqiang Jiang
Abstract: Logo detection has been gaining considerable attention because of its wide range of applications in the multimedia field, such as copyright infringement detection, brand visibility monitoring, and product brand management on social media. In this paper, we introduce LogoDet-3K, the largest logo detection dataset with full annotation, which has 3,000 logo categories, about 200,000 manually annotated logo objects and 158,652 images. LogoDet-3K creates a more challenging benchmark for logo detection, for its higher comprehensive coverage and wider variety in both logo categories and annotated objects compared with existing datasets. We describe the collection and annotation process of our dataset, analyze its scale and diversity in comparison to other datasets for logo detection. We further propose a strong baseline method Logo-Yolo, which incorporates Focal loss and CIoU loss into the state-of-the-art YOLOv3 framework for large-scale logo detection. Logo-Yolo can solve the problems of multi-scale objects, logo sample imbalance and inconsistent bounding-box regression. It obtains about 4% improvement on the average performance compared with YOLOv3, and greater improvements compared with reported several deep detection models on LogoDet-3K. The evaluations on other three existing datasets further verify the effectiveness of our method, and demonstrate better generalization ability of LogoDet-3K on logo detection and retrieval tasks. The LogoDet-3K dataset is used to promote large-scale logo-related research and it can be found at this https URL.
摘要：标志检测已获得，因为其广泛的多媒体领域的应用，如侵犯版权的检测，品牌的知名度监控，并在社会化媒体经营产品品牌的极大关注。在本文中，我们将介绍LogoDet-3K，以饱满的注释，其中有3000点标志的类别，20万个左右手动注释标识对象和158652幅图像的最大标志检测数据集。 LogoDet-3K创建标识检测更具挑战性的标杆，其较高的全面覆盖和更广泛的在这两个类别的标识，并与现有的数据集相比，标注对象。我们描述了收集和我们的数据的注释过程，分析其规模和多样性相对于其他数据集用于标识检测。我们进一步提出了一种强基线方法标识-YOLO，其包含焦损失和CIoU损失到用于大型标志检测所述状态的最先进的YOLOv3框架。徽标YOLO能解决多尺度的物体，标志样品不平衡和不一致的边界框回归问题。它获得关于平均业绩4％的提升与YOLOv3相比，并与LogoDet-3K汇报了深刻的检测模型相比，更大的改善。在其他三个现有数据集的评估进一步验证了该方法的有效性，并展示LogoDet-3K对标志的检测和检索任务较好的泛化能力。该LogoDet-3K的数据集用于大规模推广标识相关的研究，它可以在这个HTTPS URL中找到。

13. Image-based Portrait Engraving [PDF] 返回目录
Paul L. Rosin, Yu-Kun Lai
Abstract: This paper describes a simple image-based method that applies engraving stylisation to portraits using ordered dithering. Face detection is used to estimate a rough proxy geometry of the head consisting of a cylinder, which is used to warp the dither matrix, causing the engraving lines to curve around the face for better stylisation. Finally, an application of the approach to colour engraving is demonstrated.
摘要：本文介绍了适用于雕刻程式化使用有序抖动人像一个简单的基于图像的方法。面部检测被用于估计由一个圆柱体，其用于翘曲的抖动矩阵的头部的一个粗略的代理的几何形状，使所述雕刻线围绕脸部曲线更好程式化。最后，该方法以色雕刻的应用演示。

14. TF-NAS: Rethinking Three Search Freedoms of Latency-Constrained Differentiable Neural Architecture Search [PDF] 返回目录
Yibo Hu, Xiang Wu, Ran He
Abstract: With the flourish of differentiable neural architecture search (NAS), automatically searching latency-constrained architectures gives a new perspective to reduce human labor and expertise. However, the searched architectures are usually suboptimal in accuracy and may have large jitters around the target latency. In this paper, we rethink three freedoms of differentiable NAS, i.e. operation-level, depth-level and width-level, and propose a novel method, named Three-Freedom NAS (TF-NAS), to achieve both good classification accuracy and precise latency constraint. For the operation-level, we present a bi-sampling search algorithm to moderate the operation collapse. For the depth-level, we introduce a sink-connecting search space to ensure the mutual exclusion between skip and other candidate operations, as well as eliminate the architecture redundancy. For the width-level, we propose an elasticity-scaling strategy that achieves precise latency constraint in a progressively fine-grained manner. Experiments on ImageNet demonstrate the effectiveness of TF-NAS. Particularly, our searched TF-NAS-A obtains 76.9% top-1 accuracy, achieving state-of-the-art results with less latency. The total search time is only 1.8 days on 1 Titan RTX GPU. Code is available at this https URL.
摘要：随着微神经结构搜索（NAS）的蓬勃发展，自动搜索延迟受限架构给出了一个新的视角，以减少人力和专业知识。然而，搜索架构通常准确性欠佳，并有可能围绕目标延迟大抖动。在本文中，我们重新考虑微的NAS，即动作量，深度级和宽度级的三个自由度，并提出了一种新方法，命名为三自由NAS（TF-NAS），同时实现良好的分级精度和准确延迟约束。对于操作层面上，我们提出了一种双采样搜索算法协调运作崩溃。对于深层次的，我们引入了一个下沉式连接的搜索空间，以保证跳跃和其他候选操作之间的相互排斥，以及消除冗余的架构。对于宽度级别，我们提出了一种弹性缩放策略，实现了精确的延时约束在逐渐细粒度方式。在ImageNet实验证明TF-NAS的有效性。特别地，我们的搜索TF-NAS-A获得76.9％顶1的精度，实现状态的最先进的结果与更短的延迟。总的搜索时间为1巨人RTX GPU只有1.8天。代码可在此HTTPS URL。

15. Factor Graph based 3D Multi-Object Tracking in Point Clouds [PDF] 返回目录
Johannes Pöschmann, Tim Pfeifer, Peter Protzel
Abstract: Accurate and reliable tracking of multiple moving objects in 3D space is an essential component of urban scene understanding. This is a challenging task because it requires the assignment of detections in the current frame to the predicted objects from the previous one. Existing filter-based approaches tend to struggle if this initial assignment is not correct, which can happen easily. We propose a novel optimization-based approach that does not rely on explicit and fixed assignments. Instead, we represent the result of an off-the-shelf 3D object detector as Gaussian mixture model, which is incorporated in a factor graph framework. This gives us the flexibility to assign all detections to all objects simultaneously. As a result, the assignment problem is solved implicitly and jointly with the 3D spatial multi-object state estimation using non-linear least squares optimization. Despite its simplicity, the proposed algorithm achieves robust and reliable tracking results and can be applied for offline as well as online tracking. We demonstrate its performance on the real world KITTI tracking dataset and achieve better results than many state-of-the-art algorithms. Especially the consistency of the estimated tracks is superior offline as well as online.
摘要：在三维空间中的多个移动物体的准确和可靠跟踪的城市场景理解的重要组成部分。这是一项艰巨的任务，因为它需要检测的当前帧与前一个预测对象的分配。现有的基于过滤方法往往斗争，如果这个初始分配是不正确的，它可以很容易地发生。我们提出了一种基于优化的方法，不依赖于明确和固定的分配。相反，我们表示截止的，现成的3D对象检测器如高斯混合模型，其在因子图框架结合的结果。这给我们分配所有检测到的所有对象同时灵活性。其结果是，分配问题是使用非线性最小二乘优化隐式和共同解决与3D空间的多对象状态估计。尽管它的简单性，该算法实现了稳定可靠的跟踪结果，并且可以适用于离线和在线跟踪。我们展示了其对现实世界KITTI跟踪数据集性能达到和超过许多国家的最先进的算法更好的结果。尤其是估计轨迹的一致性在线优于线下也是如此。

16. Guided Collaborative Training for Pixel-wise Semi-Supervised Learning [PDF] 返回目录
Zhanghan Ke, Di Qiu, Kaican Li, Qiong Yan, Rynson W.H. Lau
Abstract: We investigate the generalization of semi-supervised learning (SSL) to diverse pixel-wise tasks. Although SSL methods have achieved impressive results in image classification, the performances of applying them to pixel-wise tasks are unsatisfactory due to their need for dense outputs. In addition, existing pixel-wise SSL approaches are only suitable for certain tasks as they usually require to use task-specific properties. In this paper, we present a new SSL framework, named Guided Collaborative Training (GCT), for pixel-wise tasks, with two main technical contributions. First, GCT addresses the issues caused by the dense outputs through a novel flaw detector. Second, the modules in GCT learn from unlabeled data collaboratively through two newly proposed constraints that are independent of task-specific properties. As a result, GCT can be applied to a wide range of pixel-wise tasks without structural adaptation. Our extensive experiments on four challenging vision tasks, including semantic segmentation, real image denoising, portrait image matting, and night image enhancement, show that GCT outperforms state-of-the-art SSL methods by a large margin. Our code available at: this https URL.
摘要：我们研究半监督学习（SSL），以不同的逐像素任务的泛化。虽然SSL方法已经实现了图像分类骄人的成绩，把它们应用到逐像素任务的性能是自己的需要进行密集的输出不满意所致。此外，现有的逐像素SSL的方法只适用于特定的任务，因为它们通常需要使用任务特有的属性。在本文中，我们提出了一个新的SSL框架，取名为指导协作培训（GCT），用于逐像素的任务，有两个主要的技术贡献。首先，GCT地址通过一个新颖探伤仪引起密输出的问题。其次，在GCT模块从标签数据协作通过两个新提出的约束是独立的任务特定属性的学习。其结果是，GCT可以应用于范围广泛的逐像素任务，而无需结构适应。我们的四个挑战性的视觉任务，包括语义分割，真正的图像去噪，人像图像抠图，和夜间图像增强大量的实验，表明国家的最先进的GCT性能优于SSL方法以大比分。我们的代码，请访问：此HTTPS URL。

17. Identity-Aware Attribute Recognition via Real-Time Distributed Inference in Mobile Edge Clouds [PDF] 返回目录
Zichuan Xu, Jiangkai Wu, Qiufen Xia, Pan Zhou, Jiankang Ren, Huizhi Liang
Abstract: With the development of deep learning technologies, attribute recognition and person re-identification (re-ID) have attracted extensive attention and achieved continuous improvement via executing computing-intensive deep neural networks in cloud datacenters. However, the datacenter deployment cannot meet the real-time requirement of attribute recognition and person re-ID, due to the prohibitive delay of backhaul networks and large data transmissions from cameras to datacenters. A feasible solution thus is to employ mobile edge clouds (MEC) within the proximity of cameras and enable distributed inference. In this paper, we design novel models for pedestrian attribute recognition with re-ID in an MEC-enabled camera monitoring system. We also investigate the problem of distributed inference in the MEC-enabled camera network. To this end, we first propose a novel inference framework with a set of distributed modules, by jointly considering the attribute recognition and person re-ID. We then devise a learning-based algorithm for the distributions of the modules of the proposed distributed inference framework, considering the dynamic MEC-enabled camera network with uncertainties. We finally evaluate the performance of the proposed algorithm by both simulations with real datasets and system implementation in a real testbed. Evaluation results show that the performance of the proposed algorithm with distributed inference framework is promising, by reaching the accuracies of attribute recognition and person identification up to 92.9% and 96.6% respectively, and significantly reducing the inference delay by at least 40.6% compared with existing methods.
摘要：随着深学习技术，属性识别和人重新鉴定（重新-ID）的发展已经引起了广泛的关注，并通过在云数据中心执行计算密集型深层神经网络的持续改善。然而，数据中心部署不能满足属性识别和人重新编号，对实时性要求，由于回程网络和大型数据传输到数据中心高昂的延迟从相机。因此，一个可行的解决方案是采用的摄像机的接近范围内移动边缘云（MEC）和启用分布式推理。在本文中，我们设计了行人属性识别与启用MEC-摄像监控系统再ID新颖的车型。我们还调查了启用MEC-相机网络分布式推理的问题。为此，我们首先提出了一套分布式模块的小说推理的框架，通过联合考虑属性识别和人重新编号。然后，我们制定了提出的分布式推理框架模块的分布以学习为基础的算法，考虑到不确定性的动态启用MEC-摄像机网络。最后，我们评估了算法的通过在实际测试平台真实数据和系统实现两个模拟性能。评价结果表明，该算法与分布式推理框架的性能分别承诺，通过达到属性识别和个人识别起来的精度，以92.9％和96.6％，而显著降低与现有的相比，至少40.6％的推断延迟方法。

18. PAM:Point-wise Attention Module for 6D Object Pose Estimation [PDF] 返回目录
Myoungha Song, Jeongho Lee, Donghwan Kim
Abstract: 6D pose estimation refers to object recognition and estimation of 3D rotation and 3D translation. The key technology for estimating 6D pose is to estimate pose by extracting enough features to find pose in any environment. Previous methods utilized depth information in the refinement process or were designed as a heterogeneous architecture for each data space to extract feature. However, these methods are limited in that they cannot extract sufficient feature. Therefore, this paper proposes a Point Attention Module that can efficiently extract powerful feature from RGB-D. In our Module, attention map is formed through a Geometric Attention Path(GAP) and Channel Attention Path(CAP). In GAP, it is designed to pay attention to important information in geometric information, and CAP is designed to pay attention to important information in Channel information. We show that the attention module efficiently creates feature representations without significantly increasing computational complexity. Experimental results show that the proposed method outperforms the existing methods in benchmarks, YCB Video and LineMod. In addition, the attention module was applied to the classification task, and it was confirmed that the performance significantly improved compared to the existing model.
摘要：6D姿态估计是指物体识别和3D旋转和3D平移的估计。估计6D姿态的关键技术是通过提取足够的功能来找到在任何环境下姿势估计姿势。以前的方法使用在细化过程深度信息或被设计为每一个数据空间，以提取特征的非均相结构。然而，这些方法中，他们不能提取足够的功能受限。因此，本文提出了一种点注意模块可以有效地提取从RGB-d强大的功能。在我们的模块，注意图通过几何注意路径（GAP）和信道注意路径（CAP）形成。在GAP，它的设计要注意几何信息的重要信息，CAP的设计注重于渠道信息的重要信息。我们表明，关注模块有效地创建功能表示，而没有显著增加计算复杂度。实验结果表明，该方法优于在基准，YCB视频和LineMod现有方法。此外，注意模块应用到分类任务，并确认其性能比现有的模型显著改善。

19. Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders [PDF] 返回目录
Nicola Messina, Giuseppe Amato, Andrea Esuli, Fabrizio Falchi, Claudio Gennaro, Stéphane Marchand-Maillet
Abstract: Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the problem of accurate cross-media retrieval through image-sentence matching based on word-region alignments using supervision only at the global image-sentence level. In particular, we present an approach called Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a fine-grained match between the underlying components of images and sentences, i.e., image regions and words, respectively, in order to preserve the informative richness of both modalities. The proposed approach obtains state-of-the-art results on the image retrieval task on both MS-COCO and Flickr30k. Moreover, on MS-COCO, it defeats current approaches also on the sentence retrieval task. Given our long-term interest in scalable cross-modal information retrieval, TERAN is designed to keep the visual and textual data pipelines well separated. In fact, cross-attention links invalidate any chance to separately extract visual and textual features needed for the online search and the offline indexing steps in large-scale retrieval systems. In this respect, TERAN merges the information from the two domains only during the final alignment phase, immediately before the loss computation. We argue that the fine-grained alignments produced by TERAN pave the way towards the research for effective and efficient methods for large-scale cross-modal information retrieval. We compare the effectiveness of our approach against the best eight methods in this research area. On the MS-COCO 1K test set, we obtain an improvement of 3.5% and 1.2% respectively on the image and the sentence retrieval tasks on the Recall@1 metric. The code used for the experiments is publicly available on GitHub at this https URL.
摘要：尽管基于深学习视觉文字处理系统的演进，精确的多模态匹配仍然是一个具有挑战性的任务。在这项工作中，我们将处理准确跨媒体的问题，通过基于使用的监督只能在全局图像句子层面字区域对齐图像句子匹配检索。特别是，我们提出称为转换器编码器推理和对齐网络（TERAN）的方法。 TERAN强制分别图像和句子，即，图像区域和文字，的底层部件之间的细粒度匹配，为了保持两个模态的信息丰富性。所提出的方法取得的国家的最先进的结果对两个MS-COCO和Flickr30k图像检索任务。此外，在MS-COCO，这违背了目前的方法也对句子检索任务。鉴于我们在可扩展的跨模态信息检索的长远利益，TERAN旨在保持良好分离的视觉和文本数据管道。事实上，交重视链接无效任何机会，以分别提取所需的在线搜索和大型检索系统离线索引步骤的视觉和文本特征。在这方面，TERAN仅在最后的对准阶段合并来自两个域的信息，立即损失计算之前。我们认为，通过TERAN产生的细粒度路线铺平走向研究切实有效的方法，为大规模的跨模态信息检索的方式。我们将我们对在这一研究领域最好的八种方法方法的有效性。在MS-COCO 1K测试组，我们得到分别在图像上的3.5％和1.2％的改进和句子上召回@ 1度量检索任务。用于实验的代码是在GitHub上公开可在此HTTPS URL。

20. Defending Adversarial Examples via DNN Bottleneck Reinforcement [PDF] 返回目录
Wenqing Liu, Miaojing Shi, Teddy Furon, Li Li
Abstract: This paper presents a DNN bottleneck reinforcement scheme to alleviate the vulnerability of Deep Neural Networks (DNN) against adversarial attacks. Typical DNN classifiers encode the input image into a compressed latent representation more suitable for inference. This information bottleneck makes a trade-off between the image-specific structure and class-specific information in an image. By reinforcing the former while maintaining the latter, any redundant information, be it adversarial or not, should be removed from the latent representation. Hence, this paper proposes to jointly train an auto-encoder (AE) sharing the same encoding weights with the visual classifier. In order to reinforce the information bottleneck, we introduce the multi-scale low-pass objective and multi-scale high-frequency communication for better frequency steering in the network. Unlike existing approaches, our scheme is the first reforming defense per se which keeps the classifier structure untouched without appending any pre-processing head and is trained with clean images only. Extensive experiments on MNIST, CIFAR-10 and ImageNet demonstrate the strong defense of our method against various adversarial attacks.
摘要：本文介绍了DNN瓶颈加固方案，以缓解对抗敌对攻击深层神经网络（DNN）的脆弱性。典型DNN分类器来编码输入图像压缩成压缩潜表示更适合推断。该信息瓶颈使一个图像中的图像，特定结构和类特定的信息之间的折衷。通过增强前，同时保持后者，任何多余的信息，无论是对抗性与否，应该从潜表示被移除。因此，本文提出了共同培养的自动编码器（AE）与视觉分类器共享相同的编码权重。为了加强信息的瓶颈，我们引进了多尺度低通在网络中更好的频率转向目标和多尺度高频通信。不同于现有的方法，我们的计划是本身这使分类结构不变而不附加任何预处理头，与只有干净的图像训练的第一重整防线。 MNIST上广泛的实验，CIFAR-10和ImageNet证明我们对各种敌对攻击方法的强大的防御。

21. A Zero-Shot Sketch-based Inter-Modal Object Retrieval Scheme for Remote Sensing Images [PDF] 返回目录
Ushasi Chaudhuri, Biplab Banerjee, Avik Bhattacharya, Mihai Datcu
Abstract: Conventional existing retrieval methods in remote sensing (RS) are often based on a uni-modal data retrieval framework. In this work, we propose a novel inter-modal triplet-based zero-shot retrieval scheme utilizing a sketch-based representation of RS data. The proposed scheme performs efficiently even when the sketch representations are marginally prototypical of the image. We conducted experiments on a new bi-modal image-sketch dataset called Earth on Canvas (EoC) conceived during this study. We perform a thorough bench-marking of this dataset and demonstrate that the proposed network outperforms other state-of-the-art methods for zero-shot sketch-based retrieval framework in remote sensing.
摘要：常规的现有遥感（RS）检索方法通常基于一单峰数据检索框架。在这项工作中，我们提出了利用遥感数据的基于草图的表示一种新型的多式联运三重从零开始次检索方案。该方案执行，即使有效，当草图表示是轻微原型的形象。我们在上粗帆布（EOC）被称为地球的一个新的双峰图像草图数据集进行实验，在这项研究中构思。我们进行全面的基准标记数据集，并表明，该网络优于其他国家的最先进的方法在遥感为基础的草图零射门检索框架。

22. Pixel-level Corrosion Detection on Metal Constructions by Fusion of Deep Learning Semantic and Contour Segmentation [PDF] 返回目录
Iason Katsamenis, Eftychios Protopapadakis, Anastasios Doulamis, Nikolaos Doulamis, Athanasios Voulodimos
Abstract: Corrosion detection on metal constructions is a major challenge in civil engineering for quick, safe and effective inspection. Existing image analysis approaches tend to place bounding boxes around the defected region which is not adequate both for structural analysis and pre-fabrication, an innovative construction concept which reduces maintenance cost, time and improves safety. In this paper, we apply three semantic segmentation-oriented deep learning models (FCN, U-Net and Mask R-CNN) for corrosion detection, which perform better in terms of accuracy and time and require a smaller number of annotated samples compared to other deep models, e.g. CNN. However, the final images derived are still not sufficiently accurate for structural analysis and pre-fabrication. Thus, we adopt a novel data projection scheme that fuses the results of color segmentation, yielding accurate but over-segmented contours of a region, with a processed area of the deep masks, resulting in high-confidence corroded pixels.
摘要：在金属结构的腐蚀检测是土木工程快速，安全，有效的检查是一项重大挑战。现有的图像分析方法往往把边界这是不足够既为结构分析和预加工，创新的建设理念从而降低了维护成本，时间，提高安全性的缺陷区域周围框。在本文中，我们使用3种语义面向分割的深度学习模型（FCN，U-Net和面膜R-CNN）的腐蚀检测，其准确度和时间方面有更好的表现，需要注解的样本数量较少相比其他深模型，例如CNN。然而，得到的最终图像仍然不用于结构分析和预制足够准确。因此，我们采用的是融合颜色分割，结果，得到的区域的准确，但过度分割轮廓，与深掩模的处理区域，导致高可信度的腐蚀像素的新颖的数据投影方案。

23. Representative Graph Neural Network [PDF] 返回目录
Changqian Yu, Yifan Liu, Changxin Gao, Chunhua Shen, Nong Sang
Abstract: Non-local operation is widely explored to model the long-range dependencies. However, the redundant computation in this operation leads to a prohibitive complexity. In this paper, we present a Representative Graph (RepGraph) layer to dynamically sample a few representative features, which dramatically reduces redundancy. Instead of propagating the messages from all positions, our RepGraph layer computes the response of one node merely with a few representative nodes. The locations of representative nodes come from a learned spatial offset matrix. The RepGraph layer is flexible to integrate into many visual architectures and combine with other operations. With the application of semantic segmentation, without any bells and whistles, our RepGraph network can compete or perform favourably against the state-of-the-art methods on three challenging benchmarks: ADE20K, Cityscapes, and PASCAL-Context datasets. In the task of object detection, our RepGraph layer can also improve the performance on the COCO dataset compared to the non-local operation. Code is available at this https URL.
摘要：非本地操作被广泛探讨的远射依赖模型。然而，在这种操作导致令人望而却步的复杂冗余计算。在本文中，我们提出了一个代表性的图（RepGraph）层来动态采样几个有代表性的特征，这显着地降低了冗余度。而不是从所有的位置传播的消息的，我们的RepGraph层计算一个节点的仅仅用几个有代表性的节点的响应。代表节点的位置来自获悉空间偏移矩阵。该RepGraph层是灵活的融入了许多视觉上的架构，并与其他业务结合起来。随着语义分割的应用，没有任何花俏，我们RepGraph网络能够竞争或对抗国家的最先进的方法，在三个有挑战性的基准测试顺利地执行：ADE20K，风情，和PASCAL - 语境数据集。在目标检测的任务，我们的RepGraph层还可以提高相对于非本地操作的COCO数据集的性能。代码可在此HTTPS URL。

24. RAF-AU Database: In-the-Wild Facial Expressions with Subjective Emotion Judgement and Objective AU Annotations [PDF] 返回目录
Wenjing Yan, Shan Li, Chengtao Que, JiQuan Pei, Weihong Deng
Abstract: Much of the work on automatic facial expression recognition relies on databases containing a certain number of emotion classes and their exaggerated facial configurations (generally six prototypical facial expressions), based on Ekman's Basic Emotion Theory. However, recent studies have revealed that facial expressions in our human life can be blended with multiple basic emotions. And the emotion labels for these in-the-wild facial expressions cannot easily be annotated solely on pre-defined AU patterns. How to analyze the action units for such complex expressions is still an open question. To address this issue, we develop a RAF-AU database that employs a sign-based (i.e., AUs) and judgement-based (i.e., perceived emotion) approach to annotating blended facial expressions in the wild. We first reviewed the annotation methods in existing databases and identified crowdsourcing as a promising strategy for labeling in-the-wild facial expressions. Then, RAF-AU was finely annotated by experienced coders, on which we also conducted a preliminary investigation of which key AUs contribute most to a perceived emotion, and the relationship between AUs and facial expressions. Finally, we provided a baseline for AU recognition in RAF-AU using popular features and multi-label learning methods.
摘要：许多关于自动面部表情识别的工作依赖于含有一定数量的情感类及其夸张的面部结构（一般为6个原型的表情），根据艾克曼的基本情绪理论的数据库。然而，最近的研究表明，在我们人类生活的面部表情可以与多个基本情绪混合。而情绪标签，这些在最狂野的面部表情不能轻易仅仅根据注释预先定义AU模式。如何分析动作单元对于这样复杂的表达式仍是一个悬而未决的问题。为了解决这个问题，我们开发了采用RAF-AU数据库的标志为基础（即学分）和判断为基础的（即，感知情感）的方法来标注混合在野外的面部表情。我们首先回顾了注释方法，现有的数据库，并确定众包作为一个有前途的战略中最百搭的面部表情标签。然后，RAF-AU精细地由经验丰富的程序员，在这我们也进行了初步调查，其中关键的AU最有助于感知情感，和AU的面部表情之间的关系注解。最后，我们提供了使用流行的功能和多标记学习方法在RAF-AU AU认可的基准。

25. Balanced Depth Completion between Dense Depth Inference and Sparse Range Measurements via KISS-GP [PDF] 返回目录
Sungho Yoon, Ayoung Kim
Abstract: Estimating a dense and accurate depth map is the key requirement for autonomous driving and robotics. Recent advances in deep learning have allowed depth estimation in full resolution from a single image. Despite this impressive result, many deep-learning-based monocular depth estimation (MDE) algorithms have failed to keep their accuracy yielding a meter-level estimation error. In many robotics applications, accurate but sparse measurements are readily available from Light Detection and Ranging (LiDAR). Although they are highly accurate, the sparsity limits full resolution depth map reconstruction. Targeting the problem of dense and accurate depth map recovery, this paper introduces the fusion of these two modalities as a depth completion (DC) problem by dividing the role of depth inference and depth regression. Utilizing the state-of-the-art MDE and our Gaussian process (GP) based depth-regression method, we propose a general solution that can flexibly work with various MDE modules by enhancing its depth with sparse range measurements. To overcome the major limitation of GP, we adopt Kernel Interpolation for Scalable Structured (KISS)-GP and mitigate the computational complexity from O(N^3) to O(N). Our experiments demonstrate that the accuracy and robustness of our method outperform state-of-the-art unsupervised methods for sparse and biased measurements.
摘要：估计密集和精确的深度图是自动驾驶和机器人技术的关键要求。在深度学习的最新进展已经允许以全分辨率深度估计从单一图像。尽管这令人印象深刻的结果，许多基于深学习单眼深度估计（MDE）算法未能保持其准确性得到一米级别的估计误差。在许多机器人应用，准确，但稀疏测量可容易地从光检测和测距（LIDAR）。虽然他们是非常精确的，稀疏性限制的全分辨率深度图重构。靶向致密且精确的深度图恢复的问题，本文通过将深度推理和深度回归的作用介绍这两种模式为深度完成（DC）问题的融合。利用国家的最先进的MDE和我们的高斯过程（GP）为基础的深度回归方法，我们提出了一个通用的解决方案，可以通过加强与稀疏的范围测量其深度与各种MDE模块灵活的工作。为了克服GP的主要限制，我们采用核插值可伸缩结构（KISS）-GP和减轻O（N ^ 3）O（N）的计算复杂度。我们的实验表明，我们的方法强于大盘的国家的最先进的稀疏和偏测量无监督方法的准确性和鲁棒性。

26. Towards Geometry Guided Neural Relighting with Flash Photography [PDF] 返回目录
Di Qiu, Jin Zeng, Zhanghan Ke, Wenxiu Sun, Chengxi Yang
Abstract: Previous image based relighting methods require capturing multiple images to acquire high frequency lighting effect under different lighting conditions, which needs nontrivial effort and may be unrealistic in certain practical use scenarios. While such approaches rely entirely on cleverly sampling the color images under different lighting conditions, little has been done to utilize geometric information that crucially influences the high-frequency features in the images, such as glossy highlight and cast shadow. We therefore propose a framework for image relighting from a single flash photograph with its corresponding depth map using deep learning. By incorporating the depth map, our approach is able to extrapolate realistic high-frequency effects under novel lighting via geometry guided image decomposition from the flashlight image, and predict the cast shadow map from the shadow-encoding transformed depth map. Moreover, the single-image based setup greatly simplifies the data capture process. We experimentally validate the advantage of our geometry guided approach over state-of-the-art image-based approaches in intrinsic image decomposition and image relighting, and also demonstrate our performance on real mobile phone photo examples.
摘要：上基于图像的重新点灯方法需要捕获多个图像，从而获得不同的照明条件下具有高频率的照明效果，需要非平凡的努力，并且可以在某些实际使用场景是不现实的。虽然这样的方法完全依赖于不同的照明条件下巧妙地采样的彩色图像，几乎没有做利用几何信息决定性影响的高频特征在图像中，如光泽和高亮投射阴影。因此，我们提出了图像从一个单一的闪光摄影，其采用深度学习相应的深度图重新点灯的框架。通过将深度图，我们的方法是能够通过几何引导图像分解来自手电筒图像来推断下新颖照明现实高频效果，并预测从阴影编码变换的深度图的投射阴影图。此外，单图像基于设定极大地简化了数据采集过程。我们通过实验验证了我们的几何导向经过本征图像分解和图像二次照明国家的最先进的基于图像的方式方法的优点，同时也证明了我们真实的手机照片的例子性能。

27. HOSE-Net: Higher Order Structure Embedded Network for Scene Graph Generation [PDF] 返回目录
Meng Wei, Chun Yuan, Xiaoyu Yue, Kuo Zhong
Abstract: Scene graph generation aims to produce structured representations for images, which requires to understand the relations between objects. Due to the continuous nature of deep neural networks, the prediction of scene graphs is divided into object detection and relation classification. However, the independent relation classes cannot separate the visual features well. Although some methods organize the visual features into graph structures and use message passing to learn contextual information, they still suffer from drastic intra-class variations and unbalanced data distributions. One important factor is that they learn an unstructured output space that ignores the inherent structures of scene graphs. Accordingly, in this paper, we propose a Higher Order Structure Embedded Network (HOSE-Net) to mitigate this issue. First, we propose a novel structure-aware embedding-to-classifier(SEC) module to incorporate both local and global structural information of relationships into the output space. Specifically, a set of context embeddings are learned via local graph based message passing and then mapped to a global structure based classification space. Second, since learning too many context-specific classification subspaces can suffer from data sparsity issues, we propose a hierarchical semantic aggregation(HSA) module to reduces the number of subspaces by introducing higher order structural information. HSA is also a fast and flexible tool to automatically search a semantic object hierarchy based on relational knowledge graphs. Extensive experiments show that the proposed HOSE-Net achieves the state-of-the-art performance on two popular benchmarks of Visual Genome and VRD.
摘要：场景图生成目标是生产的图像，这需要了解对象之间的关系结构表示。由于深神经网络的连续性质，场景图的预测被分成物体检测和关系分类。然而，独立的关系类不能分开的视觉功能良好。虽然一些方法组织视觉特征为图结构和通过学习上下文信息使用的信息，他们仍然从剧烈的类内的变型和不平衡数据分布受到影响。一个重要的因素是他们学习，忽略场景图的固有结构非结构化输出空间。因此，在本文中，我们提出了一个高阶结构嵌入式网络（软管网），以缓解此问题。首先，我们提出了一个新颖的结构意识嵌入到分类器（SEC）模块纳入关系的局部和全局结构信息到输出空间。具体而言，一组上下文的嵌入的经由本地基于图形消息传递学到，然后被映射到全局结构基础的分类空间。其次，由于学习太多的具体情况的分类子空间可以从数据稀疏问题的影响，我们提出了一种层次语义聚合（HSA）模块通过引入高阶结构信息，以减少子空间的数量。 HSA也是一个快速，灵活的工具，基于关系知识图自动搜索语义对象层次。大量的实验表明，该软管网达到视觉上的基因组与VRD的两个流行的基准测试的国家的最先进的性能。

28. ASAP-Net: Attention and Structure Aware Point Cloud Sequence Segmentation [PDF] 返回目录
Hanwen Cao, Yongyi Lu, Cewu Lu, Bo Pang, Gongshen Liu, Alan Yuille
Abstract: Recent works of point clouds show that mulit-frame spatio-temporal modeling outperforms single-frame versions by utilizing cross-frame information. In this paper, we further improve spatio-temporal point cloud feature learning with a flexible module called ASAP considering both attention and structure information across frames, which we find as two important factors for successful segmentation in dynamic point clouds. Firstly, our ASAP module contains a novel attentive temporal embedding layer to fuse the relatively informative local features across frames in a recurrent fashion. Secondly, an efficient spatio-temporal correlation method is proposed to exploit more local structure for embedding, meanwhile enforcing temporal consistency and reducing computation complexity. Finally, we show the generalization ability of the proposed ASAP module with different backbone networks for point cloud sequence segmentation. Our ASAP-Net (backbone plus ASAP module) outperforms baselines and previous methods on both Synthia and SemanticKITTI datasets (+3.4 to +15.2 mIoU points with different backbones). Code is availabe at this https URL
摘要：点云的最近的作品表明，多功能使用帧时空建模利用跨框架信息优于单帧版本。在本文中，我们进一步完善与所谓尽快考虑跨框架既重视和结构信息，我们发现作为动态点云成功分割的两个重要因素，一个灵活的模块时空点云功能的学习。首先，我们尽快模块包含一个新的周到时间埋层融合在一个经常性的方式跨越帧相对翔实的地方特色。其次，一个有效的空间 - 时间相关方法，提出了利用多个本地结构用于嵌入，同时强制执行时间一致性，减少计算复杂度。最后，我们表现出不同的骨干网的点云数据序列分割尽快提出模块的推广能力。我们的ASAP-NET（骨干加上尽快模块）优于两个Synthia和SemanticKITTI数据集（+3.4至15.2米欧点与不同的骨架）的基准和以前的方法。代码是速效这HTTPS URL

29. Object Detection for Graphical User Interface: Old Fashioned or Deep Learning or a Combination? [PDF] 返回目录
Jieshan Chen, Mulong Xie, Zhenchang Xing, Chunyang Chen, Xiwei Xu, Liming Zhu
Abstract: Detecting Graphical User Interface (GUI) elements in GUI images is a domain-specific object detection task. It supports many software engineering tasks, such as GUI animation and testing, GUI search and code generation. Existing studies for GUI element detection directly borrow the mature methods from computer vision (CV) domain, including old fashioned ones that rely on traditional image processing features (e.g., canny edge, contours), and deep learning models that learn to detect from large-scale GUI data. Unfortunately, these CV methods are not originally designed with the awareness of the unique characteristics of GUIs and GUI elements and the high localization accuracy of the GUI element detection task. We conduct the first large-scale empirical study of seven representative GUI element detection methods on over 50k GUI images to understand the capabilities, limitations and effective designs of these methods. This study not only sheds the light on the technical challenges to be addressed but also informs the design of new GUI element detection methods. We accordingly design a new GUI-specific old-fashioned method for non-text GUI element detection which adopts a novel top-down coarse-to-fine strategy, and incorporate it with the mature deep learning model for GUI text detection.Our evaluation on 25,000 GUI images shows that our method significantly advances the start-of-the-art performance in GUI element detection.
摘要：在GUI图像中检测图形用户界面（GUI）元素是特定域对象检测任务。它支持许多软件工程任务，如GUI动画和测试，GUI的搜索和代码生成。现有的研究对GUI元素检测直接借用成熟的计算机视觉（CV）域，包括依靠传统的图像处理功能（例如，精明的边缘，轮廓），而深学习模型的老式那些借鉴large-检测方法扩展GUI数据。不幸的是，这些CV方法不是最初设计的GUI和GUI元素和GUI元素检测任务的高定位精度的独特特性的认识。我们进行了超过5万GUI图像的7种代表性GUI元素检测方法的第一次大规模的实证研究，了解能力，局限性以及这些方法的有效设计。这项研究不仅揭示了需要解决的技术难题光也通知的新GUI元素的检测方法的设计。因此，我们设计了非文本GUI元素检测新的特定图形用户界面的老式方法，它采用一种新型的自上而下的由粗到细的策略，并与成熟的深度学习模型GUI文本detection.Our评价其纳入25000所GUI图象显示，我们的方法显著前进在GUI元素检测启动的最先进的性能。

30. Open Set Recognition with Conditional Probabilistic Generative Models [PDF] 返回目录
Xin Sun, Chi Zhang, Guosheng Lin, Keck-Voon Ling
Abstract: Deep neural networks have made breakthroughs in a wide range of visual understanding tasks. A typical challenge that hinders their real-world applications is that unknown samples may be fed into the system during the testing phase, but traditional deep neural networks will wrongly recognize these unknown samples as one of the known classes. Open set recognition (OSR) is a potential solution to overcome this problem, where the open set classifier should have the flexibility to reject unknown samples and meanwhile maintain high classification accuracy in known classes. Probabilistic generative models, such as Variational Autoencoders (VAE) and Adversarial Autoencoders (AAE), are popular methods to detect unknowns, but they cannot provide discriminative representations for known classification. In this paper, we propose a novel framework, called Conditional Probabilistic Generative Models (CPGM), for open set recognition. The core insight of our work is to add discriminative information into the probabilistic generative models, such that the proposed models can not only detect unknown samples but also classify known classes by forcing different latent features to approximate conditional Gaussian distributions. We discuss many model variants and provide comprehensive experiments to study their characteristics. Experiment results on multiple benchmark datasets reveal that the proposed method significantly outperforms the baselines and achieves new state-of-the-art performance.
摘要：深层神经网络已经在广泛的视觉理解任务取得了突破性进展。一个典型的挑战阻碍它们在现实世界的应用是未知样品可以在测试阶段被送入系统，但传统的深层神经网络将错误地识别这些未知样本与已知类别之一。开集识别（OSR）为解决这一问题，在开集分类应具有灵活性，拒绝未知样品，同时保持已知类高分类精度潜在的解决方案。概率生成模型，如变自动编码（VAE）和对抗式自动编码（AAE），是流行的方法来检测未知，但他们无法提供已知的分类判别表示。在本文中，我们提出了一个新的框架，称为条件概率生成模型（CPGM），开放组的认可。我们工作的核心观点是，以判别信息添加到概率生成模型，使得提出的模型不仅可以检测未知样品也迫使不同的潜在功能近似的条件高斯分布已知类别进行分类。我们讨论了许多模型变种，并提供全面的实验来研究它们的特性。在多个基准数据集实验结果表明，该方法显著优于基线和实现国家的最先进的新性能。

31. Facial Expression Retargeting from Human to Avatar Made Easy [PDF] 返回目录
Juyong Zhang, Keyu Chen, Jianmin Zheng
Abstract: Facial expression retargeting from humans to virtual characters is a useful technique in computer graphics and animation. Traditional methods use markers or blendshapes to construct a mapping between the human and avatar faces. However, these approaches require a tedious 3D modeling process, and the performance relies on the modelers' experience. In this paper, we propose a brand-new solution to this cross-domain expression transfer problem via nonlinear expression embedding and expression domain translation. We first build low-dimensional latent spaces for the human and avatar facial expressions with variational autoencoder. Then we construct correspondences between the two latent spaces guided by geometric and perceptual constraints. Specifically, we design geometric correspondences to reflect geometric matching and utilize a triplet data structure to express users' perceptual preference of avatar expressions. A user-friendly method is proposed to automatically generate triplets for a system allowing users to easily and efficiently annotate the correspondences. Using both geometric and perceptual correspondences, we trained a network for expression domain translation from human to avatar. Extensive experimental results and user studies demonstrate that even nonprofessional users can apply our method to generate high-quality facial expression retargeting results with less time and effort.
摘要：面部表情从人重定向到虚拟人物是在计算机图形和动画的有用技术。传统的方法使用标记或blendshapes以构建人类和化身面之间的映射。然而，这些方法需要繁琐的3D建模过程，而且性能依赖于建模的经验。在本文中，我们提出通过非线性表达嵌入和表达域翻译一个全新的解决方案，这跨域表达转移问题。我们首先构建低维的潜在空间为人类和头像与变的自动编码的面部表情。然后，我们通过构造和几何约束感知导向所述两个潜空间之间的对应关系。具体而言，我们设计几何对应于反映几何匹配，并利用三重数据结构来表示用户的化身表达式的感知偏好。提出了一种用户友好的方法，以自动地生成用于允许用户容易且有效地注释的对应一个系统三胞胎。同时使用几何和感知的对应，我们训练的表达域转换网络从人与化身。大量的实验结果和用户的研究表明，即使非专业用户可以应用我们的方法来产生高品质的面部表情与重定向更少的时间和精力的结果。

32. Local Temperature Scaling for Probability Calibration [PDF] 返回目录
Zhipeng Ding, Xu Han, Peirong Liu, Marc Niethammer
Abstract: For semantic segmentation, label probabilities are often uncalibrated as they are typically only the by-product of a segmentation task. Intersection over Union (IoU) and Dice score are often used as criteria for segmentation success, while metrics related to label probabilities are rarely explored. On the other hand, probability calibration approaches have been studied, which aim at matching probability outputs with experimentally observed errors, but they mainly focus on classification tasks, not on semantic segmentation. Thus, we propose a learning-based calibration method that focuses on multi-label semantic segmentation. Specifically, we adopt a tree-like convolution neural network to predict local temperature values for probability calibration. One advantage of our approach is that it does not change prediction accuracy, hence allowing for calibration as a post-processing step. Experiments on the COCO and LPBA40 datasets demonstrate improved calibration performance over different metrics. We also demonstrate the performance of our method for multi-atlas brain segmentation from magnetic resonance images.
摘要：对于语义分割，标签概率往往未校准，因为它们通常只分割任务的副产品。路口过联盟（IOU）和骰子得分经常被用来作为分割成功的标准，而相关的标签概率的指标很少探讨。在另一方面，校准概率的方法进行了研究，其目的是在与实验上观察到的错误匹配概率输出，但它们主要集中在分类任务，而不是在语义分割。因此，我们建议侧重于多品牌语义分割基于学习的校准方法。具体来说，我们采用树形卷积神经网络预测局部温度值的概率标定。我们的方法的一个优点是它不会改变的预测精度，从而允许校准作为后处理步骤。在COCO和LPBA40数据集实验证明了不同的指标改进的校准性能。我们也证明了我们方法的多脑图谱从磁共振图像分割的性能。

33. Inter-Image Communication for Weakly Supervised Localization [PDF] 返回目录
Xiaolin Zhang, Yunchao Wei, Yi Yang
Abstract: Weakly supervised localization aims at finding target object regions using only image-level supervision. However, localization maps extracted from classification networks are often not accurate due to the lack of fine pixel-level supervision. In this paper, we propose to leverage pixel-level similarities across different objects for learning more accurate object locations in a complementary way. Particularly, two kinds of constraints are proposed to prompt the consistency of object features within the same categories. The first constraint is to learn the stochastic feature consistency among discriminative pixels that are randomly sampled from different images within a batch. The discriminative information embedded in one image can be leveraged to benefit its counterpart with inter-image communication. The second constraint is to learn the global consistency of object features throughout the entire dataset. We learn a feature center for each category and realize the global feature consistency by forcing the object features to approach class-specific centers. The global centers are actively updated with the training process. The two constraints can benefit each other to learn consistent pixel-level features within the same categories, and finally improve the quality of localization maps. We conduct extensive experiments on two popular benchmarks, i.e., ILSVRC and CUB-200-2011. Our method achieves the Top-1 localization error rate of 45.17% on the ILSVRC validation set, surpassing the current state-of-the-art method by a large margin. The code is available at this https URL.
摘要：弱监督本地化目的是找出仅使用图像层次监管目标物体的区域。然而，定位从映射网络分类提取往往不缺乏精细像素级监督的准确原因。在本文中，我们建议在不同的对象杠杆像素级的相似之处在互补的方式学习更精确的目标位置。尤其，二种约束提出以提示相同的类别内的对象的特征的一致性。第一个约束是学习是随机从不同的图像采样的批次内辨别像素之间的随机特征的一致性。嵌入在一个图像中的区别信息可以被利用来获益其与图像间通信对方。第二个条件是要学会在整个数据集的物体特征的全球一致性。我们了解每一类的特征中心，通过强制对象功能的方法类特定的中心实现全局特征的一致性。全球中心正积极与训练过程进行更新。这两个约束可以有利于彼此在同一类别中学习一致像素级的功能，并最终提高本地化的地图的质量。我们进行了两个流行的基准，即ILSVRC和CUB-200-2011广泛的实验。我们的方法实现的45.17％对ILSVRC验证集合顶部-1定位误差率，大幅度超过了当前状态的最先进的方法。该代码可在此HTTPS URL。

34. Learning to Caricature via Semantic Shape Transform [PDF] 返回目录
Wenqing Chu, Wei-Chih Hung, Yi-Hsuan Tsai, Yu-Ting Chang, Yijun Li, Deng Cai, Ming-Hsuan Yang
Abstract: Caricature is an artistic drawing created to abstract or exaggerate facial features of a person. Rendering visually pleasing caricatures is a difficult task that requires professional skills, and thus it is of great interest to design a method to automatically generate such drawings. To deal with large shape changes, we propose an algorithm based on a semantic shape transform to produce diverse and plausible shape exaggerations. Specifically, we predict pixel-wise semantic correspondences and perform image warping on the input photo to achieve dense shape transformation. We show that the proposed framework is able to render visually pleasing shape exaggerations while maintaining their facial structures. In addition, our model allows users to manipulate the shape via the semantic map. We demonstrate the effectiveness of our approach on a large photograph-caricature benchmark dataset with comparisons to the state-of-the-art methods.
摘要：漫画是创造了一个人的抽象的或夸张的面部特征的绘画艺术风格。呈现赏心悦目的漫画是一项艰巨的任务，需要专业的技术，因此它是极大的兴趣来设计的方法来自动生成这样的图纸。为了应对大形状的变化，提出了一种基于语义形状的变换算法，产生多样化和合理的形状夸张。具体来说，我们预测逐像素语义对应和对输入的照片进行图像变形，以实现致密的形状转变。我们表明，该框架能够呈现赏心悦目的造型夸张，同时保持他们的面部结构。此外，我们的模型允许用户通过语义地图操作的形状。我们证明我们的方法对大量照片，漫画基准数据集中比较，国家的最先进的方法的有效性。

35. BiHand: Recovering Hand Mesh with Multi-stage Bisected Hourglass Networks [PDF] 返回目录
Lixin Yang, Jiasen Li, Wenqiang Xu, Yiqun Diao, Cewu Lu
Abstract: 3D hand estimation has been a long-standing research topic in computer vision. A recent trend aims not only to estimate the 3D hand joint locations but also to recover the mesh model. However, achieving those goals from a single RGB image remains challenging. In this paper, we introduce an end-to-end learnable model, BiHand, which consists of three cascaded stages, namely 2D seeding stage, 3D lifting stage, and mesh generation stage. At the output of BiHand, the full hand mesh will be recovered using the joint rotations and shape parameters predicted from the network. Inside each stage, BiHand adopts a novel bisecting design which allows the networks to encapsulate two closely related information (e.g. 2D keypoints and silhouette in 2D seeding stage, 3D joints, and depth map in 3D lifting stage, joint rotations and shape parameters in the mesh generation stage) in a single forward pass. As the information represents different geometry or structure details, bisecting the data flow can facilitate optimization and increase robustness. For quantitative evaluation, we conduct experiments on two public benchmarks, namely the Rendered Hand Dataset (RHD) and the Stereo Hand Pose Tracking Benchmark (STB). Extensive experiments show that our model can achieve superior accuracy in comparison with state-of-the-art methods, and can produce appealing 3D hand meshes in several severe conditions.
摘要：3D手估计已经在计算机视觉一个长期的研究课题。最近的趋势的目的不仅是估计3D手关节位置，但也恢复网格模型。然而，从单一的RGB图像实现这些目标仍然具有挑战性。在本文中，我们引入一个端至端可学习模型，BiHand，它由三个级联阶段，即2D苗期，3D升降台，以及网格生成阶段。在BiHand的输出，全手网格将使用来自网络预测关节旋转和形状参数来回收。内的每个阶段，BiHand采用一种新颖的二等分设计，其允许网络封装两个密切相关的信息（例如，2D关键点和轮廓在2D苗期，3D关节和深度图在3D升降台，联合旋转和形状参数在所述网状在一个单一的向前遍生成阶段）。作为信息表示不同的几何形状或结构的详细说明，二等分的数据流可以有助于优化和增加的鲁棒性。对于定量评价，我们进行两个公共基准，即呈现手数据集（RHD）和立体声手姿势跟踪基准（STB）的实验。大量的实验表明，我们的模型可以与国家的最先进的方法相比，实现更高的精度，并能产生几个苛刻的条件吸引人的3D手网格。

36. Select Good Regions for Deblurring based on Convolutional Neural Networks [PDF] 返回目录
Hang Yang, Xiaotian Wu, Xinglong Sun
Abstract: The goal of blind image deblurring is to recover sharp image from one input blurred image with an unknown blur kernel. Most of image deblurring approaches focus on developing image priors, however, there is not enough attention to the influence of image details and structures on the blur kernel estimation. What is the useful image structure and how to choose a good deblurring region? In this work, we propose a deep neural network model method for selecting good regions to estimate blur kernel. First we construct image patches with labels and train a deep neural networks, then the learned model is applied to determine which region of the image is most suitable to deblur. Experimental results illustrate that the proposed approach is effective, and could be able to select good regions for image deblurring.
摘要：盲图像去模糊的目标是从一个输入模糊图像与未知的模糊核恢复清晰的图像。大多数图像去模糊的方法重点发展先验图像，但是，没有足够重视的图像细节和结构上的模糊核估计的影响。什么是有用的图像结构，以及如何选择一个好的去模糊区域？在这项工作中，我们提出了选择好的区域估计模糊核深神经网络模型的方法。首先，我们构建图像块用标签和训练深层神经网络，那么学习的模型应用于确定图像的区域是最适合的去模糊化。实验结果表明，该方法是有效的，并且可能能够选择好的区域进行图像去模糊。

37. Online Graph Completion: Multivariate Signal Recovery in Computer Vision [PDF] 返回目录
Won Hwa Kim, Mona Jalal, Seongjae Hwang, Sterling C. Johnson, Vikas Singh
Abstract: The adoption of "human-in-the-loop" paradigms in computer vision and machine learning is leading to various applications where the actual data acquisition (e.g., human supervision) and the underlying inference algorithms are closely interwined. While classical work in active learning provides effective solutions when the learning module involves classification and regression tasks, many practical issues such as partially observed measurements, financial constraints and even additional distributional or structural aspects of the data typically fall outside the scope of this treatment. For instance, with sequential acquisition of partial measurements of data that manifest as a matrix (or tensor), novel strategies for completion (or collaborative filtering) of the remaining entries have only been studied recently. Motivated by vision problems where we seek to annotate a large dataset of images via a crowdsourced platform or alternatively, complement results from a state-of-the-art object detector using human feedback, we study the "completion" problem defined on graphs, where requests for additional measurements must be made sequentially. We design the optimization model in the Fourier domain of the graph describing how ideas based on adaptive submodularity provide algorithms that work well in practice. On a large set of images collected from Imgur, we see promising results on images that are otherwise difficult to categorize. We also show applications to an experimental design problem in neuroimaging.
摘要：通过“人在半实物”计算机视觉和机器学习范式是导致在实际数据采集（例如，人监督）和底层推理算法是紧密交织在一起的各种应用。而在主动学习古典工作提供了有效的解决方案时，学习模块包括分类和回归的任务，如部分地观察到的测量，财政限制甚至数据的附加分配或结构方面的许多实际问题通常落在外面这种治疗的范围。例如，对于连续采集数据的部分测量的该清单为一个矩阵（或张量），对于剩余的条目完成后（或协同过滤）新策略只被最近研究。通过我们寻求通过众包平台注释大型数据集的图像或可替换地，使用人反馈从一个国家的最先进的目标物检测的补充效果，我们研究了图，其中明确了“完成”的问题视力问题的启发额外的测量要求必须按顺序做。我们设计的描述基于自适应子模想法如何规定，在实际工作中很好的算法图的傅立叶域优化模型。在大集从Imgur采集的图像，我们看到在其他方面难以归类的图像可喜的成果。我们还表明应用神经影像学实验设计的问题。

38. Dynamic Object Removal and Spatio-Temporal RGB-D Inpainting via Geometry-Aware Adversarial Learning [PDF] 返回目录
Borna Bešić, Abhinav Valada
Abstract: Dynamic objects have a significant impact on the robot's perception of the environment which degrades the performance of essential tasks such as localization and mapping. In this work, we address this problem by synthesizing plausible color, texture and geometry in regions occluded by dynamic objects. We propose the novel geometry-aware DynaFill architecture that follows a coarse-to-fine topology and incorporates our gated recurrent feedback mechanism to adaptively fuse information from previous timesteps. We optimize our architecture using adversarial training to synthesize fine realistic textures which enables it to hallucinate color and depth structure in occluded regions online in a spatially and temporally coherent manner, without relying on future frame information. Casting our inpainting problem as an image-to-image translation task, our model also corrects regions correlated with the presence of dynamic objects in the scene, such as shadows or reflections. We introduce a large-scale hyperrealistic dataset with RGB-D images, semantic segmentation labels, camera poses as well as ground truth RGB-D information of occluded regions. Extensive quantitative and qualitative evaluations show that our approach achieves state-of-the-art performance, even in challenging weather conditions. Furthermore, we present results for retrieval-based visual localization with the synthesized images that demonstrate the utility of our approach.
摘要：动态对象有其降低的重要任务，如定位和地图表现环境的机器人的感知显著的影响。在这项工作中，我们要解决在地区综合合理的颜色，纹理和几何这个问题闭塞的动态对象。我们建议遵循由粗到细的拓扑结构，并结合我们的门经常性反馈机制从以前的时间步长自适应熔丝信息的新颖几何感知DynaFill架构。我们采用对抗性训练，合成精细逼真的纹理，它能够产生幻觉的色彩和深度结构遮挡区域网络在空间和时间上一致的方式，而不依赖于未来帧信息优化我们的结构。铸造我们的图像修补问题，因为图像 - 图像翻译任务，我们的模型还纠正与场景中的动态物体，如阴影或反射的存在相关的区域。我们引进与RGB-d的图像，语义分割标签，摄影机姿态大规模超真实数据集以及遮挡区域的地面实况RGB-d的信息。丰富的定量和定性评估表明，我们的方法实现了国家的最先进的性能，即使在富有挑战性的天气条件。此外，我们提出了与证明了该方法的实用的合成图像基于内容的检索，可视定位结果。

39. Audio- and Gaze-driven Facial Animation of Codec Avatars [PDF] 返回目录
Alexander Richard, Colin Lea, Shugao Ma, Juergen Gall, Fernando de la Torre, Yaser Sheikh
Abstract: Codec Avatars are a recent class of learned, photorealistic face models that accurately represent the geometry and texture of a person in 3D (i.e., for virtual reality), and are almost indistinguishable from video. In this paper we describe the first approach to animate these parametric models in real-time which could be deployed on commodity virtual reality hardware using audio and/or eye tracking. Our goal is to display expressive conversations between individuals that exhibit important social signals such as laughter and excitement solely from latent cues in our lossy input signals. To this end we collected over 5 hours of high frame rate 3D face scans across three participants including traditional neutral speech as well as expressive and conversational speech. We investigate a multimodal fusion approach that dynamically identifies which sensor encoding should animate which parts of the face at any time. See the supplemental video which demonstrates our ability to generate full face motion far beyond the typically neutral lip articulations seen in competing work: this https URL
摘要：编解码器头像是最近的类了解到，逼真的脸部模型能够准确反映一个人在三维（即，虚拟现实）的几何形状和纹理，并从视频几乎没有区别。在本文中，我们描述了第一种方法中，可以在商品的虚拟现实硬件使用音频和/或眼睛跟踪部署实时动画这些参数化模型。我们的目标是显示仅表现在我们的有损输入信号的潜在线索重要的社会信号，例如笑声和兴奋个人之间表现的对话。为此我们收集了超过500小时的高帧频的3D面部扫描跨越三个参与者，包括传统的中性讲话以及表现力和对话语音。我们研究的多峰融合方法动态地识别哪个传感器编码应随时动画面部的哪一部分。见补充视频这表明我们有能力产生远远超出竞争的工作看到的通常中性色唇关节全脸的运动：这个HTTPS URL

40. VI-Net: View-Invariant Quality of Human Movement Assessment [PDF] 返回目录
Faegheh Sardari, Adeline Paiement, Sion Hannuna, Majid Mirmehdi
Abstract: We propose a view-invariant method towards the assessment of the quality of human movements which does not rely on skeleton data. Our end-to-end convolutional neural network consists of two stages, where at first a view-invariant trajectory descriptor for each body joint is generated from RGB images, and then the collection of trajectories for all joints are processed by an adapted, pre-trained 2D CNN (e.g. VGG-19 or ResNeXt-50) to learn the relationship amongst the different body parts and deliver a score for the movement quality. We release the only publicly-available, multi-view, non-skeleton, non-mocap, rehabilitation movement dataset (QMAR), and provide results for both cross-subject and cross-view scenarios on this dataset. We show that VI-Net achieves average rank correlation of 0.66 on cross-subject and 0.65 on unseen views when trained on only two views. We also evaluate the proposed method on the single-view rehabilitation dataset KIMORE and obtain 0.66 rank correlation against a baseline of 0.62.
摘要：本文提出对不依赖于骨架数据人类活动的质量评估视图不变的方法。我们的端至端的卷积神经网络包括两个阶段，其中在第一从RGB图像生成每个身体关节的视图不变轨迹描述符，然后轨迹的所有关节的收集处理由适配，预的受过训练的2D CNN（例如VGG-19或ResNeXt-50）来学习不同的身体部位之间的关系，并提供一个分数的运动质量。我们发布的唯一公开可用的，多视角，无骨架，无动作捕捉，康复运动数据集（QMAR），以及有关该数据集的两个交叉学科和交叉视角场景提供结果。我们发现，只有两个训练有素的看法时，VI-网实现了0.66的跨学科和0.65上看不见的观点平均等级相关。我们也评估对单一视图康复数据集KIMORE所提出的方法，并获得对0.62基线0.66等级相关。

41. Retrieval Guided Unsupervised Multi-domain Image-to-Image Translation [PDF] 返回目录
Raul Gomez, Yahui Liu, Marco De Nadai, Dimosthenis Karatzas, Bruno Lepri, Nicu Sebe
Abstract: Image to image translation aims to learn a mapping that transforms an image from one visual domain to another. Recent works assume that images descriptors can be disentangled into a domain-invariant content representation and a domain-specific style representation. Thus, translation models seek to preserve the content of source images while changing the style to a target visual domain. However, synthesizing new images is extremely challenging especially in multi-domain translations, as the network has to compose content and style to generate reliable and diverse images in multiple domains. In this paper we propose the use of an image retrieval system to assist the image-to-image translation task. First, we train an image-to-image translation model to map images to multiple domains. Then, we train an image retrieval model using real and generated images to find images similar to a query one in content but in a different domain. Finally, we exploit the image retrieval system to fine-tune the image-to-image translation model and generate higher quality images. Our experiments show the effectiveness of the proposed solution and highlight the contribution of the retrieval network, which can benefit from additional unlabeled data and help image-to-image translation models in the presence of scarce data.
摘要：图像到图像平移旨在学习，它可将图像从一个视觉域到另一个域的映射。最近的作品假设图像描述符可以解开成域不变的内容表示和特定域的风格表示。因此，翻译模式，同时寻求改变风格，目标视觉域保留源图像内容。然而，合成新的图像极为特别具有挑战性的多域的翻译，因为网络必须撰写的内容和风格，以产生多个域中可靠和多样化的图像。在本文中，我们建议使用的图像检索系统，以协助图像到图像的翻译任务。首先，我们训练的图像 - 图像翻译模型来映射图像，多个域。然后，我们用实际和生成的图像找到类似的查询一个内容，但在不同的域训练图像的图像检索模型。最后，我们利用图像检索系统，以微调图像 - 图像翻译模型，并生成更高质量的图像。我们的实验表明所提出的解决方案的有效性，并突出检索网络的贡献，可以从另外的标签数据和帮助图像 - 图像转换模型数据稀少的情况下受益。

42. Campus3D: A Photogrammetry Point Cloud Benchmark for Hierarchical Understanding of Outdoor Scene [PDF] 返回目录
Xinke Li, Chongshou Li, Zekun Tong, Andrew Lim, Junsong Yuan, Yuwei Wu, Jing Tang, Raymond Huang
Abstract: Learning on 3D scene-based point cloud has received extensive attention as its promising application in many fields, and well-annotated and multisource datasets can catalyze the development of those data-driven approaches. To facilitate the research of this area, we present a richly-annotated 3D point cloud dataset for multiple outdoor scene understanding tasks and also an effective learning framework for its hierarchical segmentation task. The dataset was generated via the photogrammetric processing on unmanned aerial vehicle (UAV) images of the National University of Singapore (NUS) campus, and has been point-wisely annotated with both hierarchical and instance-based labels. Based on it, we formulate a hierarchical learning problem for 3D point cloud segmentation and propose a measurement evaluating consistency across various hierarchies. To solve this problem, a two-stage method including multi-task (MT) learning and hierarchical ensemble (HE) with consistency consideration is proposed. Experimental results demonstrate the superiority of the proposed method and potential advantages of our hierarchical annotations. In addition, we benchmark results of semantic and instance segmentation, which is accessible online at https://3d.dataset.site with the dataset and all source codes.
摘要：学习三维基于场景的点云已受到广泛关注，其应用前景在许多领域，以及良好的注释和多源数据集可以催化这些数据驱动型方法的发展。为了促进这方面的研究，我们提出了多个外景地理解任务丰富标注的三维点云数据集，也为它的层次分割任务的有效的学习框架。通过对新加坡国立大学（NUS）校园的无人飞行器（UAV）图像的摄影测量处理产生的数据集，并得到了点明智既分层和基于实例的标签注释。在此基础上，我们制定三维点云分割层次的学习问题，并提出在不同层次的测量评估的一致性。为了解决这个问题，一个两阶段方法包括多任务（MT）的学习和分层合奏（HE）与一致性考虑，提出了。实验结果表明，所提出的方法和我们的分层标注的潜在优势的优越性。此外，我们的语义和实例分割，这是在线访问在https://3d.dataset.site与数据集和所有源代码的基准测试结果。

43. Image segmentation via Cellular Automata [PDF] 返回目录
Mark Sandler, Andrey Zhmoginov, Liangcheng Luo, Alexander Mordvintsev, Ettore Randazzo, Blaise Agúera y Arcas
Abstract: In this paper, we propose a new approach for building cellular automata to solve real-world segmentation problems. We design and train a cellular automaton that can successfully segment high-resolution images. We consider a colony that densely inhabits the pixel grid, and all cells are governed by a randomized update that uses the current state, the color, and the state of the $3\times 3$ neighborhood. The space of possible rules is defined by a small neural network. The update rule is applied repeatedly in parallel to a large random subset of cells and after convergence is used to produce segmentation masks that are then back-propagated to learn the optimal update rules using standard gradient descent methods. We demonstrate that such models can be learned efficiently with only limited trajectory length and that they show remarkable ability to organize the information to produce a globally consistent segmentation result, using only local information exchange. From a practical perspective, our approach allows us to build very efficient models -- our smallest automata use less than 10,000 paramaters to solve complex segmentation tasks.
摘要：在本文中，我们提出了构建元胞自动机来解决现实世界的问题分割的一种新方法。我们设计并训练一个细胞自动机，可以成功地段的高清晰度图像。我们认为一个群体是居住密集的像素网格，所有细胞被使用的当前状态，颜色，和$ 3 \次3 $附近的状态的随机更新管理。的可能的规则的空间是由一个小神经网络定义。更新规则并行地反复施加到细胞的一个大的随机子集和收敛后用来产生分割掩码被然后向后传播学习使用标准梯度下降方法的最佳的更新规则。我们证明这种模式能够有效地仅限于轨道长得知，他们表现出组织信息产生全球一致的分割结果，只使用本地信息交换非凡的能力。从实用的角度来看，我们的方法使我们能够建立非常有效的模型 - 我们最小的自动机使用低于10000个PARAMATERS来解决复杂的分割任务。

44. Little Motion, Big Results: Using Motion Magnification to Reveal Subtle Tremors in Infants [PDF] 返回目录
Girik Malik, Ish K. Gulati
Abstract: Detecting tremors is challenging for both humans and machines. Infants exposed to opioids during pregnancy often show signs and symptoms of withdrawal after birth, which are easy to miss with the human eye. The constellation of clinical features, termed as Neonatal Abstinence Syndrome (NAS), include tremors, seizures, irritability, etc. The current standard of care uses Finnegan Neonatal Abstinence Syndrome Scoring System (FNASS), based on subjective evaluations. Monitoring with FNASS requires highly skilled nursing staff, making continuous monitoring difficult. In this paper we propose an automated tremor detection system using amplified motion signals. We demonstrate its applicability on bedside video of infant exhibiting signs of NAS. Further, we test different modes of deep convolutional network based motion magnification, and identify that dynamic mode works best in the clinical setting, being invariant to common orientational changes. We propose a strategy for discharge and follow up for NAS patients, using motion magnification to supplement the existing protocols. Overall our study suggests methods for bridging the gap in current practices, training and resource utilization.
摘要：检测震颤是人和机器挑战。孕期暴露于阿片类药物的婴儿常常表现出症状和戒断症状出生后，它很容易与人眼小姐。临床特征，称为新生儿戒断综合征（NAS）的星座，包括震颤，惊厥，烦躁等目前的护理标准使用芬尼根新生儿戒断综合症评分系统（FNASS），基于主观评价。与FNASS监控需要高度熟练的护理人员，使连续监测困难。在本文中，我们提议使用放大的运动信号的自动震颤检测系统。我们展示其对婴幼儿表现出NAS的迹象床边视频应用。此外，我们测试深卷积基于网络运动放大不同的模式，并确定动态模式最适合在临床上，是不变的共同取向的变化。我们提出了放电的策略和跟进NAS的患者，使用运动放大，以补充现有的协议。总的来说，我们的研究表明，弥合在目前的做法，培训和资源利用率的差距的方法。

45. PX-NET: Simple, Efficient Pixel-Wise Training of Photometric Stereo Networks [PDF] 返回目录
Fotios Logothetis, Ignas Budvytis, Roberto Mecca, Roberto Cipolla
Abstract: Retrieving accurate 3D reconstructions of objects from the way they reflect light is a very challenging task in computer vision. Despite more than four decades since the definition of the Photometric Stereo problem, most of the literature has had limited success when global illumination effects such as cast shadows, self-reflections and ambient light come into play, especially for specular surfaces. Recent approaches have leveraged the power of deep learning in conjunction with computer graphics in order to cope with the need of a vast number of training data in order to invert the image irradiance equation and retrieve the geometry of the object. However, rendering global illumination effects is a slow process which can limit the amount of training data that can be generated. In this work we propose a novel pixel-wise training procedure for normal prediction by replacing the training data of globally rendered images with independent per-pixel renderings. We show that robustness to global physical effects can be achieved via data-augmentation which greatly simplifies and speeds up the data creation procedure. Our network, PX-NET, achieves the state-of-the-art performance on synthetic datasets, as well as the Diligent real dataset.
摘要：从检索对象的方式进行精确三维重建它们反射光是计算机视觉中一个非常具有挑战性的任务。尽管由于光度立体问题的定义，超过四个十年，大多数文献已经取得有限的成功时，全局照明效果，如投射阴影，自我反思和环境光线来发挥作用，尤其是对于镜面。最近的做法已经利用计算机图形芯片相结合深度学习的能力，以应付需要的大量训练数据，以反转图像辐照度方程和检索对象的几何形状。然而，渲染全局照明效果是一个缓慢的过程，其可以限制的训练可以产生数据的量。在这项工作中，我们通过用独立的每像素渲染替换全局渲染图像的训练数据提出一种用于正常预测的新的逐像素训练过程。我们表明，稳健性的全球物理效果可以通过数据增强从而大大简化并加快了数据创建过程来实现。我们的网络，PX-NET，实现了对合成数据集的国家的最先进的性能，以及勤劳真实数据集。

46. Renal Cell Carcinoma Detection and Subtyping with Minimal Point-Based Annotation in Whole-Slide Images [PDF] 返回目录
Zeyu Gao, Pargorn Puttapirat, Jiangbo Shi, Chen Li
Abstract: Obtaining a large amount of labeled data in medical imaging is laborious and time-consuming, especially for histopathology. However, it is much easier and cheaper to get unlabeled data from whole-slide images (WSIs). Semi-supervised learning (SSL) is an effective way to utilize unlabeled data and alleviate the need for labeled data. For this reason, we proposed a framework that employs an SSL method to accurately detect cancerous regions with a novel annotation method called Minimal Point-Based annotation, and then utilize the predicted results with an innovative hybrid loss to train a classification model for subtyping. The annotator only needs to mark a few points and label them are cancer or not in each WSI. Experiments on three significant subtypes of renal cell carcinoma (RCC) proved that the performance of the classifier trained with the Min-Point annotated dataset is comparable to a classifier trained with the segmentation annotated dataset for cancer region detection. And the subtyping model outperforms a model trained with only diagnostic labels by 12% in terms of f1-score for testing WSIs.
摘要：获取大量医学成像标签数据是费力和耗时的，特别是用于组织病理学。然而，它是非常容易和便宜得到全幻灯片图像标签数据（WSIS）。半监督学习（SSL）是利用未标记的数据，并减轻用于标记的数据的需要的有效方法。出于这个原因，我们提出了一个框架，它采用了SSL的方法来精确地检测癌性区域与称为最小基于点注释的注释新颖方法，然后利用通过创新的混合损失预测结果来训练分类模型为子类型。注释者只需要标记几个点和标记他们在每个WSI癌症与否。对肾细胞癌的3个显著亚型实验（RCC）证明了与最小值点训练分类器的性能数据集的注释的是与用癌症区域检测分割注释的数据集训练的分类器。和子类型模型优于只诊断标签的12％用于测试WSIS F1-分数方面训练有素的典范。

47. Learning to Learn from Mistakes: Robust Optimization for Adversarial Noise [PDF] 返回目录
Alex Serban, Erik Poll, Joost Visser
Abstract: Sensitivity to adversarial noise hinders deployment of machine learning algorithms in security-critical applications. Although many adversarial defenses have been proposed, robustness to adversarial noise remains an open problem. The most compelling defense, adversarial training, requires a substantial increase in processing time and it has been shown to overfit on the training data. In this paper, we aim to overcome these limitations by training robust models in low data regimes and transfer adversarial knowledge between different models. We train a meta-optimizer which learns to robustly optimize a model using adversarial examples and is able to transfer the knowledge learned to new models, without the need to generate new adversarial examples. Experimental results show the meta-optimizer is consistent across different architectures and data sets, suggesting it is possible to automatically patch adversarial vulnerabilities.
摘要：灵敏度对安全关键型应用程序的机器学习算法对抗噪声妨碍部署。虽然很多对抗性的防御已经提出，稳健性对抗噪音仍然是一个悬而未决的问题。最引人注目的防守，对抗性训练，需要处理时间大幅增加，它已被证明在训练数据过度拟合。在本文中，我们的目标是通过在低数据政权训练可靠的模型来克服这些局限性，不同型号之间传输对抗性的知识。我们培养一个元优化使用对抗性实例学会稳健优化模型，并能够传送学会了新车型的知识，而不需要产生新的对抗例子。实验结果表明，元优化器是在不同的架构和数据集相一致，这表明它可以自动修补漏洞对抗性。

48. Compression of Deep Learning Models for Text: A Survey [PDF] 返回目录
Manish Gupta, Puneet Agrawal
Abstract: In recent years, the fields of natural language processing (NLP) and information retrieval (IR) have made tremendous progress thanks to deep learning models like Recurrent Neural Networks (RNNs), Gated Recurrent Units (GRUs) and Long Short-Term Memory (LSTMs) networks, and Transformer based models like Bidirectional Encoder Representations from Transformers (BERT). But these models are humongous in size. On the other hand, real world applications demand small model size, low response times and low computational power wattage. In this survey, we discuss six different types of methods (Pruning, Quantization, Knowledge Distillation, Parameter Sharing, Tensor Decomposition, and Linear Transformer based methods) for compression of such models to enable their deployment in real industry NLP projects. Given the critical need of building applications with efficient and small models, and the large amount of recently published work in this area, we believe that this survey organizes the plethora of work done by the 'deep learning for NLP' community in the past few years and presents it as a coherent story.
摘要：近年来，自然语言处理（NLP）和信息检索（IR）的领域都取得了巨大的进步归功于深度学习模型，如回归神经网络（RNNs），门控复发单位（丹顶鹤）和长短期记忆从变压器（BERT）（LSTMs）网络，以及基于变压器模型，如双向编码器交涉。但是，这些车型在尺寸堆积如山。在另一方面，现实世界的应用需要小模型尺寸，低响应时间和低计算能力的瓦数。在这次调查中，我们讨论了这些模型的压缩六种不同类型的方法（修剪，量化，知识蒸馏，参数共享，张量分解和线性变压器为基础的方法），以使他们能够真正的行业NLP项目部署。考虑到建设有效率的，小型号应用的迫切需求，并在此领域的大量最近发表的作品，我们相信，本次调查组织工作由“深度学习的NLP”社区在过去几年所做的大量并提出它作为一个连贯的故事。

49. Large-Scale Analysis of Iliopsoas Muscle Volumes in the UK Biobank [PDF] 返回目录
Julie Fitzpatrick, Nicolas Basty, Madeleine Cule, Yi Liu, Jimmy D. Bell, E. LouiseThomas, Brandon Whitcher
Abstract: Psoas muscle measurements are frequently used as markers of sarcopenia and predictors of health. Manually measured cross-sectional areas are most commonly used, but there is a lack of consistency regarding the position of the measurementand manual annotations are not practical for large population studies. We have developed a fully automated method to measure iliopsoas muscle volume (comprised of the psoas and iliacus muscles) using a convolutional neural network. Magnetic resonance images were obtained from the UK Biobank for 5,000 male and female participants, balanced for age, gender and BMI. Ninety manual annotations were available for model training and validation. The model showed excellent performance against out-of-sample data (dice score coefficient of 0.912 +/- 0.018). Iliopsoas muscle volumes were successfully measured in all 5,000 participants. Iliopsoas volume was greater in male compared with female subjects. There was a small but significant asymmetry between left and right iliopsoas muscle volumes. We also found that iliopsoas volume was significantly related to height, BMI and age, and that there was an acceleration in muscle volume decrease in men with age. Our method provides a robust technique for measuring iliopsoas muscle volume that can be applied to large cohorts.
摘要：腰肌测量经常被用作肌肉衰减症的标志物和健康的预测。人工测量的横截面面积是最常用的，但有一个关于measurementand为大群体研究的位置手动注释是不实际的缺乏一致性。我们已经开发了使用卷积神经网络的全自动的方法来测量髂腰肌量（包括腰肌和髂肌）。来自英国生物银行获得了5000名男性和女性参加磁共振图像，年龄，性别和体重平衡。九十手动注释是可用的模型训练和验证。该模型显示针对外的样本数据（骰子得分的0.912 +/- 0.018系数）优异的性能。髂腰肌卷中的所有5000人成功地测量。髂腰肌体积较大的男性与女性受试者进行比较。有左，右髂腰肌之间的体积小，但显著不对称。我们还发现，髂腰肌体积显著有关身高，体重和年龄，并有在男性随着年龄的增长肌肉体积减小的加速度。我们的方法提供了用于测量髂腰肌，可以应用于大群组肌肉体积稳健技术。

50. An Inter- and Intra-Band Loss for Pansharpening Convolutional Neural Networks [PDF] 返回目录
Jiajun Cai, Bo Huang
Abstract: Pansharpening aims to fuse panchromatic and multispectral images from the satellite to generate images with both high spatial and spectral resolution. With the successful applications of deep learning in the computer vision field, a lot of scholars have proposed many convolutional neural networks (CNNs) to solve the pansharpening task. These pansharpening networks focused on various distinctive structures of CNNs, and most of them are trained by L2 loss between fused images and simulated desired multispectral images. However, L2 loss is designed to directly minimize the difference of spectral information of each band, which does not consider the inter-band relations in the training process. In this letter, we propose a novel inter- and intra-band (IIB) loss to overcome the drawback of original L2 loss. Our proposed IIB loss can effectively preserve both inter- and intra-band relations and can be directly applied to different pansharpening CNNs.
摘要：从卫星Pansharpening旨在保险丝全色和多谱图像，以生成同时具有高空间和光谱分辨率的图像。随着计算机视觉领域的深度学习的成功应用，很多学者提出了许多卷积神经网络（细胞神经网络）解决pansharpening任务。这些pansharpening网络侧重于细胞神经网络的各种独特的结构，其中大部分是由融合图像和模拟所需的多光谱图像之间L2损失训练。然而，L2损失的目的是尽量减少直接的每个波段，不考虑在培训过程中的带间关系的光谱信息的差异。在这种信，我们提出了一种新颖的间和波段内（IIB）的损失，以克服原有L2损失的缺点。我们提出的IIB损失可以有效地保护双方之间和内部带关系，并且可以直接应用于不同pansharpening细胞神经网络。

51. A Longitudinal Method for Simultaneous Whole-Brain and Lesion Segmentation in Multiple Sclerosis [PDF] 返回目录
Stefano Cerri, Andrew Hoopes, Douglas N. Greve, Mark Mühlau, Koen Van Leemput
Abstract: In this paper we propose a novel method for the segmentation of longitudinal brain MRI scans of patients suffering from Multiple Sclerosis. The method builds upon an existing cross-sectional method for simultaneous whole-brain and lesion segmentation, introducing subject-specific latent variables to encourage temporal consistency between longitudinal scans. It is very generally applicable, as it does not make any prior assumptions on the scanner, the MRI protocol, or the number and timing of longitudinal follow-up scans. Preliminary experiments on three longitudinal datasets indicate that the proposed method produces more reliable segmentations and detects disease effects better than the cross-sectional method it is based upon.
摘要：在本文中，我们提出的多发性硬化症患者脑部纵向MRI扫描的分割的新方法。该方法建立在用于同时全脑和病变分割，引入特定主题潜在变量鼓励纵向扫描之间的时间一致性的现有剖方法。这是非常普遍适用的，因为它不会使在扫描仪上时，MRI协议，或纵向随访扫描的数量和定时任何先前的假设。上的三个纵向数据集初步实验表明，所提出的方法产生更可靠的分段和检测疾病的效果比它是基于剖面方法更好。

52. FATNN: Fast and Accurate Ternary Neural Networks [PDF] 返回目录
Peng Chen, Bohan Zhuang, Chunhua Shen
Abstract: Ternary Neural Networks (TNNs) have received much attention due to being potentially orders of magnitude faster in inference, as well as more power efficient, than full-precision counterparts. However, 2 bits are required to encode the ternary representation with only 3 quantization levels leveraged. As a result, conventional TNNs have similar memory consumption and speed compared with the standard 2-bit models, but have worse representational capability. Moreover, there is still a significant gap in accuracy between TNNs and full-precision networks, hampering their deployment to real applications. To tackle these two challenges, in this work, we first show that, under some mild constraints, the computational complexity of ternary inner product can be reduced by 2x. Second, to mitigate the performance gap, we elaborately design an implementation-dependent ternary quantization algorithm. The proposed framework is termed Fast and Accurate Ternary Neural Networks (FATNN). Experiments on image classification demonstrate that our FATNN surpasses the state-of-the-arts by a significant margin in accuracy. More importantly, speedup evaluation comparing with various precisions is analyzed on several platforms, which serves as a strong benchmark for further research.
摘要：三元神经网络（TNNs）备受关注，由于是潜在数量级的推断更快，以及更省电，比全精度的同行。但是，2位需要编码具有杠杆仅3量化等级三元表示。其结果是，传统的TNNs有标准的2位机型相比内存消耗和速度，但有更糟糕的代表性能力。此外，还有在TNNs和全精密网络之间精度显著的差距，阻碍了它们的部署到实际应用中。为了应对这两项挑战，在这项工作中，我们首先表明，在一些轻微的限制，可以通过2X可以减少三元内积的计算复杂度。其次，要减轻的性能差距，我们精心设计与实现相关的三元量化算法。拟议的框架被称为快速准确和三元神经网络（FATNN）。在图像分类实验表明，我们的FATNN在精度显著利润率超过了国家的最艺术。更重要的是，加速评估的各种精确度比较分析在多种平台上，作为进一步研究的强有力的标杆。

53. Self-supervised Light Field View Synthesis Using Cycle Consistency [PDF] 返回目录
Yang Chen, Martin Alain, Aljosa Smolic
Abstract: High angular resolution is advantageous for practical applications of light fields. In order to enhance the angular resolution of light fields, view synthesis methods can be utilized to generate dense intermediate views from sparse light field input. Most successful view synthesis methods are learning-based approaches which require a large amount of training data paired with ground truth. However, collecting such large datasets for light fields is challenging compared to natural images or videos. To tackle this problem, we propose a self-supervised light field view synthesis framework with cycle consistency. The proposed method aims to transfer prior knowledge learned from high quality natural video datasets to the light field view synthesis task, which reduces the need for labeled light field data. A cycle consistency constraint is used to build bidirectional mapping enforcing the generated views to be consistent with the input views. Derived from this key concept, two loss functions, cycle loss and reconstruction loss, are used to fine-tune the pre-trained model of a state-of-the-art video interpolation method. The proposed method is evaluated on various datasets to validate its robustness, and results show it not only achieves competitive performance compared to supervised fine-tuning, but also outperforms state-of-the-art light field view synthesis methods, especially when generating multiple intermediate views. Besides, our generic light field view synthesis framework can be adopted to any pre-trained model for advanced video interpolation.
摘要：高角度分辨率为光场的实际应用中是有利的。为了增强光场的角分辨率，视图合成的方法可以用于产生从稀疏光场输入密中间视图。大多数成功的视图合成方法学为基础的，其需要大量的地面实况配对训练数据的方法。然而，对于光场收集这种大型数据集进行比较，自然图像或视频挑战。为了解决这个问题，我们提出用循环一致性自我监督的光场视图合成框架。所提出的方法的目的来传输由高品质的自然视频数据集学会光场视图合成的任务，从而降低了对标记的光场数据的需要的先验知识。甲周期一致性约束是用来建立双向映射执行该生成的视图为与输入视图相一致。从这个关键概念，二损耗的功能，循环损失和重建损失导出，被用于微调的状态的最先进的视频内插方法的预训练的模型。所提出的方法对各种数据集评估以验证其鲁棒性，结果表明，它不仅相比监督微调实现有竞争力的性能，而且还优于国家的最先进的光场视图合成方法，特别是生成多个中间时观点。此外，我们的普通光场视图合成框架可以采用先进的视频插任何预先训练模式。

54. End-to-End Rate-Distortion Optimization for Bi-Directional Learned Video Compression [PDF] 返回目录
M. Akin Yilmaz, A. Murat Tekalp
Abstract: Conventional video compression methods employ a linear transform and block motion model, and the steps of motion estimation, mode and quantization parameter selection, and entropy coding are optimized individually due to combinatorial nature of the end-to-end optimization problem. Learned video compression allows end-to-end rate-distortion optimized training of all nonlinear modules, quantization parameter and entropy model simultaneously. While previous work on learned video compression considered training a sequential video codec based on end-to-end optimization of cost averaged over pairs of successive frames, it is well-known in conventional video compression that hierarchical, bi-directional coding outperforms sequential compression. In this paper, we propose for the first time end-to-end optimization of a hierarchical, bi-directional motion compensated learned codec by accumulating cost function over fixed-size groups of pictures (GOP). Experimental results show that the rate-distortion performance of our proposed learned bi-directional {\it GOP coder} outperforms the state-of-the-art end-to-end optimized learned sequential compression as expected.
摘要：传统的视频压缩方法采用线性变换和块运动模型，和运动估计，模式和量化参数选择的步骤，和熵编码是由于端至端的最优化问题的组合性质单独优化。了解到视频压缩允许所有非线性模块，量化参数和同时熵模型的端至端的速率 - 失真优化的训练。虽然上了解到视频压缩以前的工作视为培训基础上月底至端优化的成本平均超过对连续帧的连续视频编解码器，它是在传统的视频压缩技术，分层，双向编码性能优于顺序压迫知名。在本文中，我们为分层的第一次终端到终端的优化建议，双向运动补偿通过在图片的大小固定的组（GOP）累积成本函数学会编解码器。实验结果表明，该学会双向{\它GOP编码}的率失真性能要优于国家的最先进的终端到终端的优化学习序列压缩预期。

55. Learned Proximal Networks for Quantitative Susceptibility Mapping [PDF] 返回目录
Kuo-Wei Lai, Manisha Aggarwal, Peter van Zijl, Xu Li, Jeremias Sulam
Abstract: Quantitative Susceptibility Mapping (QSM) estimates tissue magnetic susceptibility distributions from Magnetic Resonance (MR) phase measurements by solving an ill-posed dipole inversion problem. Conventional single orientation QSM methods usually employ regularization strategies to stabilize such inversion, but may suffer from streaking artifacts or over-smoothing. Multiple orientation QSM such as calculation of susceptibility through multiple orientation sampling (COSMOS) can give well-conditioned inversion and an artifact free solution but has expensive acquisition costs. On the other hand, Convolutional Neural Networks (CNN) show great potential for medical image reconstruction, albeit often with limited interpretability. Here, we present a Learned Proximal Convolutional Neural Network (LP-CNN) for solving the ill-posed QSM dipole inversion problem in an iterative proximal gradient descent fashion. This approach combines the strengths of data-driven restoration priors and the clear interpretability of iterative solvers that can take into account the physical model of dipole convolution. During training, our LP-CNN learns an implicit regularizer via its proximal, enabling the decoupling between the forward operator and the data-driven parameters in the reconstruction algorithm. More importantly, this framework is believed to be the first deep learning QSM approach that can naturally handle an arbitrary number of phase input measurements without the need for any ad-hoc rotation or re-training. We demonstrate that the LP-CNN provides state-of-the-art reconstruction results compared to both traditional and deep learning methods while allowing for more flexibility in the reconstruction process.
摘要：量化磁敏映射（QSM）通过求解一个病态偶极反演问题估计从磁共振（MR）相位测量组织磁化率分布。传统的单一取向QSM方法通常采用规则化战略，以稳定这种反转，但可以从条纹赝象或过度平滑受到影响。多个取向QSM如通过多个取向采样（COSMOS）易感性的计算可以得到良好的条件反转和伪影免费解决方案，但具有昂贵的购置成本。在另一方面，卷积神经网络（CNN）显示医疗图像重建的巨大潜力，虽然经常与有限的可解释性。在这里，我们为在迭代近梯度下降的方式解决病态QSM偶极子反演问题提出一个有学问的近端卷积神经网络（LP-CNN）。这种方法结合了数据驱动的恢复先验和可以考虑偶极卷积的物理模型迭代求解的明确解释性的优势。在培训过程中，我们的LP-CNN通过其近端学习一个隐含的正则，使前运营商，并在重建算法的数据驱动的参数之间的解耦。更重要的是，该框架被认为是第一个深度学习QSM的方法，可以自然地处理相输入测量任意数量，而不需要任何特设旋转或重新培训。我们证明了LP-CNN相比传统和深厚的学习方法，同时允许在重建过程中提供更多灵活性的国家的最先进的重建结果。

56. Automatic assembly of aero engine low pressure turbine shaft based on 3D vision measurement [PDF] 返回目录
Jiaxiang Wang, Kunyong Chen
Abstract: In order to solve the problem of low automation of Aero-engine Turbine shaft assembly and the difficulty of non-contact high-precision measurement, a structured light binocular measurement technology for key components of aero-engine is proposed in this paper. Combined with three-dimensional point cloud data processing and assembly position matching algorithm, the high-precision measurement of shaft hole assembly posture in the process of turbine shaft docking is realized. Firstly, the screw thread curve on the bolt surface is segmented based on PCA projection and edge point cloud clustering, and Hough transform is used to model fit the three-dimensional thread curve. Then the preprocessed two-dimensional convex hull is constructed to segment the key hole location features, and the mounting surface and hole location obtained by segmentation are fitted based on RANSAC method. Finally, the geometric feature matching is used the evaluation index of turbine shaft assembly is established to optimize the pose. The final measurement accuracy of mounting surface matching is less than 0.05mm, and the measurement accuracy of mounting hole matching based on minimum ance optimization is less than 0.1 degree. The measurement algorithm is implemented on the automatic assembly test-bed of a certain type of aero-engine low-pressure turbine rotor. In the narrow installation space, the assembly process of the turbine shaft assembly, such as the automatic alignment and docking of the shaft hole, the automatic heating and temperature measurement of the installation seam, and the automatic tightening of the two guns, are realized in the narrow installation space Guidance, real-time inspection and assembly result evaluation.
摘要：为了解决航空发动机涡轮轴的低自动化组件的问题和非接触高精度测量，一个结构光双目测量技术用于航空发动机的关键部件的困难在本文提出。与三维点群数据处理和组装位置匹配算法相结合，在涡轮机轴的对接的过程轴孔组件姿势的高精度测量的实现。首先，螺栓表面上的螺纹曲线是基于PCA投影和边缘点云聚类分段，并且Hough变换用于模型拟合三维螺纹曲线。然后将预处理的二维凸包被构造成段的钥匙孔的位置的功能，并通过分割所获得的安装表面和孔的位置是基于RANSAC法拟合。最后，几何特征匹配用于涡轮轴组件被建立以优化姿势的评价指标。安装面匹配的最终的测量精度是小于0.05mm，并且基于最小ANCE优化安装孔匹配的测定精度小于0.1度。测量算法在特定类型的航空发动机的低压涡轮转子的的自动装配试验台实现。在狭窄的安装空间，所述涡轮机的装配过程轴组件，诸如自动对准和轴孔的对接，安装接缝的自动加热和温度测量，并且两个枪的自动拧紧，在被实现狭窄的安装空间指导，实时检测和装配结果评价。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-08-13

目录

摘要