摘要

1. Joint Multi-Dimension Pruning [PDF] 返回目录
Zechun Liu, Xiangyu Zhang, Zhiqiang Shen, Zhe Li, Yichen Wei, Kwang-Ting Cheng, Jian Sun
Abstract: We present joint multi-dimension pruning (named as JointPruning), a new perspective of pruning a network on three crucial aspects: spatial, depth and channel simultaneously. The joint strategy enables to search a better status than previous studies that focused on individual dimension solely, as our method is optimized collaboratively across the three dimensions in a single end-to-end training. Moreover, each dimension that we consider can promote to get better performance through colluding with the other two. Our method is realized by the adapted stochastic gradient estimation. Extensive experiments on large-scale ImageNet dataset across a variety of network architectures MobileNet V1&V2 and ResNet demonstrate the effectiveness of our proposed method. For instance, we achieve significant margins of 2.5% and 2.6% improvement over the state-of-the-art approach on the already compact MobileNet V1&V2 under an extremely large compression ratio.
摘要：本联合多维度修剪（命名为JointPruning），在三个重要方面修剪的网络的一个新的角度：空间，同时深度和通道。联合战略使搜索比以前的研究集中在个体层面完全，因为我们的方法是协作在三个维度上的单个终端到终端的培训优化更好的状态。此外，每个维度，我们认为可以通过促进与其他两个勾结，以获得更好的性能。我们的方法是由适于随机梯度估计来实现。在大型数据集ImageNet广泛的实验在多种网络架构MobileNet V1 V2与和RESNET证明我们提出的方法的有效性。例如，我们实现的2.5％和2.6％的改善在对已经紧凑MobileNet V1＆V2的状态的最先进的方法显著余量非常大的压缩比下。

2. Portrait Shadow Manipulation [PDF] 返回目录
Xuaner Cecilia Zhang, J onathan T. Barron, Yun-Ta Tsai, Rohit Pandey, Xiuming Zhang, Ren Ng, David E. Jacobs
Abstract: Casually-taken portrait photographs often suffer from unflattering lighting and shadowing because of suboptimal conditions in the environment. Aesthetic qualities such as the position and softness of shadows and the lighting ratio between the bright and dark parts of the face are frequently determined by the constraints of the environment rather than by the photographer. Professionals address this issue by adding light shaping tools such as scrims, bounce cards, and flashes. In this paper, we present a computational approach that gives casual photographers some of this control, thereby allowing poorly-lit portraits to be relit post-capture in a realistic and easily-controllable way. Our approach relies on a pair of neural networks---one to remove foreign shadows cast by external objects, and another to soften facial shadows cast by the features of the subject and to add a synthetic fill light to improve the lighting ratio. To train our first network we construct a dataset of real-world portraits wherein synthetic foreign shadows are rendered onto the face, and we show that our network learns to remove those unwanted shadows. To train our second network we use a dataset of Light Stage scans of human subjects to construct input/output pairs of input images harshly lit by a small light source, and variably softened and fill-lit output images of each face. We propose a way to explicitly encode facial symmetry and show that our dataset and training procedure enable the model to generalize to images taken in the wild. Together, these networks enable the realistic and aesthetically pleasing enhancement of shadows and lights in real-world portrait images
摘要：随便，拍摄人像照片常常直言不讳照明受苦，因为在环境条件欠佳的阴影。美学品质，如位置和阴影的柔软度和面部的亮和暗部分之间的照明比经常受到环境的限制，而不是由摄影师确定。专业人员通过增加光线塑造工具，如网布，弹跳卡和闪烁解决这个问题。在本文中，我们提出了一个计算方法，让休闲摄影师一些这方面的控制，从而使光线不足的肖像被重新点燃，捕获后的现实和容易控制的方式。我们的方法依赖于一对神经网络的一个---去除由外部对象投射阴影国外，而另一个软化由受试者的特征投面部阴影和添加的合成补光以提高照明比。要培养我们的第一个网络，我们构建现实世界画像的数据集，其中合成外国阴影渲染到脸上，我们证明了我们的网络学习，以消除那些不必要的阴影。训练我们第二网络我们使用人类受试者，以通过小的光源严厉点亮，并且可变地软化，每个面的填充明亮的输出图像的输入图像的构建体的输入/输出对轻级的数据集进行扫描。我们提出了一个方法来明确编码面部的对称性，并表明我们的数据集和训练过程使模型推广到在野外拍摄的图像。总之，这些网络使阴影的现实和美观的增强和灯在现实世界的肖像图片

3. Generative Tweening: Long-term Inbetweening of 3D Human Motions [PDF] 返回目录
Yi Zhou, Jingwan Lu, Connelly Barnes, Jimei Yang, Sitao Xiang, Hao li
Abstract: The ability to generate complex and realistic human body animations at scale, while following specific artistic constraints, has been a fundamental goal for the game and animation industry for decades. Popular techniques include key-framing, physics-based simulation, and database methods via motion graphs. Recently, motion generators based on deep learning have been introduced. Although these learning models can automatically generate highly intricate stylized motions of arbitrary length, they still lack user control. To this end, we introduce the problem of long-term inbetweening, which involves automatically synthesizing complex motions over a long time interval given very sparse keyframes by users. We identify a number of challenges related to this problem, including maintaining biomechanical and keyframe constraints, preserving natural motions, and designing the entire motion sequence holistically while considering all constraints. We introduce a biomechanically constrained generative adversarial network that performs long-term inbetweening of human motions, conditioned on keyframe constraints. This network uses a novel two-stage approach where it first predicts local motion in the form of joint angles, and then predicts global motion, i.e. the global path that the character follows. Since there are typically a number of possible motions that could satisfy the given user constraints, we also enable our network to generate a variety of outputs with a scheme that we call Motion DNA. This approach allows the user to manipulate and influence the output content by feeding seed motions (DNA) to the network. Trained with 79 classes of captured motion data, our network performs robustly on a variety of highly complex motion styles.
摘要：生成复杂的和现实的人体动画在规模，而下面的具体艺术约束的能力，一直是游戏和动漫产业数十年的根本目标。热门技术包括关键帧，基于物理学的模拟，并通过运动图形数据库的方法。最近，基于深度学习运动发电机相继出台。虽然这些学习模型可以自动生成任意长度的高度复杂的程式化的动作，他们仍然缺乏用户控制。为此，我们引入长期inbetweening，它会自动在涉及用户给出非常稀疏关键帧间隔时间长合成复杂运动的问题。我们确定了一些与此相关的问题，包括维持生物力学和关键帧的限制，保护自然的运动，并从整体设计，同时考虑所有约束整个运动过程的挑战。我们引入生物力学约束生成对抗性的网络进行长期inbetweening人的运动，调节关键帧的约束。此网络使用一种新颖的两阶段方法，其中它首先预测在关节角度的形式局部运动，然后预测全局运动，即该字符遵循全局路径。因为通常有许多可能满足给定用户约束可能的运动，我们也使我们的网络，以产生多种输出，一种方案，我们称之为运动DNA。这种方法允许用户操纵并通过加入种子运动（DNA）到网络影响输出的内容。与79级的类捕获的运动数据来训练，我们的网络上的各种高度复杂的运动样式鲁棒执行。

4. MMFashion: An Open-Source Toolbox for Visual Fashion Analysis [PDF] 返回目录
Xin Liu, Jiancheng Li, Jiaqi Wang, Ziwei Liu
Abstract: We present MMFashion, a comprehensive, flexible and user-friendly open-source visual fashion analysis toolbox based on PyTorch. This toolbox supports a wide spectrum of fashion analysis tasks, including Fashion Attribute Prediction, Fashion Recognition and Retrieval, Fashion Landmark Detection, Fashion Parsing and Segmentation and Fashion Compatibility and Recommendation. It covers almost all the mainstream tasks in fashion analysis community. MMFashion has several appealing properties. Firstly, MMFashion follows the principle of modular design. The framework is decomposed into different components so that it is easily extensible for diverse customized modules. In addition, detailed documentations, demo scripts and off-the-shelf models are available, which ease the burden of layman users to leverage the recent advances in deep learning-based fashion analysis. Our proposed MMFashion is currently the most complete platform for visual fashion analysis in deep learning era, with more functionalities to be added. This toolbox and the benchmark could serve the flourishing research community by providing a flexible toolkit to deploy existing models and develop new ideas and approaches. We welcome all contributions to this still-growing efforts towards open science: this https URL.
摘要：我们提出MMFashion，全面，灵活的和用户友好的开源视觉时尚解析基于PyTorch工具箱。这种工具箱支持的方式分析任务，包括时装属性预测，时装识别与检索，时尚标志检测，时尚解析和分割和时尚兼容性和建议的宽光谱。它涵盖了几乎所有流行的分析社会的主流任务。 MMFashion有几个吸引人的特性。首先，MMFashion遵循模块化设计原则。该框架被分解成不同的组件，使得其很容易扩展为不同的定制模块。此外，详细的文档，演示脚本和关闭的，现成的模式可供选择，这对缓解外行用户的负担，充分利用其在深学习型时尚解析的最新进展。我们提出的MMFashion是目前在深的学习时代的视觉时尚解析最完整的平台，更多的功能被添加。这个工具箱和基准可以通过提供灵活的工具来部署现有的模型，并开发新的思路和方法服务于蓬勃发展的研究团体。我们欢迎所有的贡献，这仍然增长走向开放科学的努力：此HTTPS URL。

5. Visual Memorability for Robotic Interestingness Prediction via Unsupervised Online Learning [PDF] 返回目录
Chen Wang, Wenshan Wang, Yuheng Qiu, Yafei Hu, Sebastian Scherer
Abstract: In this paper, we aim to solve the problem of interesting scene prediction for mobile robots. This area is currently under explored but is crucial for many practical applications such as autonomous exploration and decision making. First, we expect a robot to detect novel and interesting scenes in unknown environments and lose interests over time after repeatedly observing similar objects. Second, we expect the robots to learn from unbalanced data in a short time, as the robots normally only know the uninteresting scenes before they are deployed. Inspired by those industrial demands, we first propose a novel translation-invariant visual memory for recalling and identifying interesting scenes, then design a three-stage architecture of long-term, short-term, and online learning for human-like experience, environmental knowledge, and online adaption, respectively. It is demonstrated that our approach is able to learn online and find interesting scenes for practical exploration tasks. It also achieves a much higher accuracy than the state-of-the-art algorithm on very challenging robotic interestingness prediction datasets.
摘要：在本文中，我们的目标是解决有趣的一幕预测的问题移动机器人。该区域目前正在探讨，但对于许多实际应用，如自主探索和决策的关键。首先，我们期待一个机器人经过再三观察相似的对象随着时间的推移检测未知的环境中，失去兴趣新奇有趣的场面。其次，我们希望机器人从不平衡数据学习在很短的时间，如机器人一般只知道无趣的场面，他们被部署之前。这些工业需求的启发，我们首先提出了一种新的翻译不变的视觉记忆的回忆和识别有趣的场景，然后设计长期的三级架构，短期，和在线学习人类般的体验，环保知识和在线适应，分别。它表明，我们的做法是能够在线学习，并找到实践探索任务有趣的场景。它还实现了高于国家的最先进的算法非常具有挑战性的机器人兴趣的预测数据集更高的精度。

6. Hierarchical and Efficient Learning for Person Re-Identification [PDF] 返回目录
Jiangning Zhang, Liang Liu, Chao Xu, Yong Liu
Abstract: Recent works in the person re-identification task mainly focus on the model accuracy while ignore factors related to the efficiency, e.g. model size and latency, which are critical for practical application. In this paper, we propose a novel Hierarchical and Efficient Network (HENet) that learns hierarchical global, partial, and recovery features ensemble under the supervision of multiple loss combinations. To further improve the robustness against the irregular occlusion, we propose a new dataset augmentation approach, dubbed Random Polygon Erasing (RPE), to random erase irregular area of the input image for imitating the body part missing. We also propose an Efficiency Score (ES) metric to evaluate the model efficiency. Extensive experiments on Market1501, DukeMTMC-ReID, and CUHK03 datasets shows the efficiency and superiority of our approach compared with epoch-making methods.
摘要：在人重新鉴定任务，最近的作品主要集中在模型的准确性，而忽略与效率，例如因素模型的大小和延迟，这对于实际应用的关键。在本文中，我们提出了一个新的层次，高效的网络（HENet）获悉该分级全局，局部和恢复功能的多种组合损失的监督下合奏。为了进一步提高对不规则闭塞的稳健性，我们提出了一个新的数据集隆胸方法，被称为随机多边形擦除（RPE），对输入图像的随机擦除不规则区域模仿身体部分缺失。我们还提出了一种效率得分（ES）指标来评价模型的效率。在Market1501，DukeMTMC-REID广泛的实验，并CUHK03数据集表演与划时代的方法相比，我们的方法的有效性和优越性。

7. Niose-Sampling Cross Entropy Loss: Improving Disparity Regression Via Cost Volume Aware Regularizer [PDF] 返回目录
Yang Chen, Zongqing Lu, Xuechen Zhang, Lei Chen, Qinming Liao
Abstract: Recent end-to-end deep neural networks for disparity regression have achieved the state-of-the-art performance. However, many well-acknowledged specific properties of disparity estimation are omitted in these deep learning algorithms. Especially, matching cost volume, one of the most important procedure, is treated as a normal intermediate feature for the following softargmin regression, lacking explicit constraints compared with those traditional algorithms. In this paper, inspired by previous canonical definition of cost volume, we propose the noise-sampling cross entropy loss function to regularize the cost volume produced by deep neural networks to be unimodal and coherent. Extensive experiments validate that the proposed noise-sampling cross entropy loss can not only help neural networks learn more informative cost volume, but also lead to better stereo matching performance compared with several representative algorithms.
摘要：近期终端到终端的深层神经的差距回归网络已经实现了国家的最先进的性能。然而，视差估计的多所承认的特定属性在这些深层次的学习算法被省略。特别是，匹配成本体积，最重要的过程的一个，被视为以下softargmin回归正常中间特征，与那些传统的算法相比缺乏明确的限制。在本文中，由成本体积的先前规范定义的启发，我们提出了噪声采样交叉熵损失函数正规化通过深神经网络产生的是单峰和相干成本体积。大量的实验验证，所提出的噪声采样交叉熵损失，不仅可以帮助神经网络学习更多的信息成本量，但也带来更好的立体匹配性能与几个有代表性的算法进行了比较。

8. Classification of Spam Emails through Hierarchical Clustering and Supervised Learning [PDF] 返回目录
Francisco Jáñez-Martino, Eduardo Fidalgo, Santiago González-Martínez, Javier Velasco-Mata
Abstract: Spammers take advantage of email popularity to send indiscriminately unsolicited emails. Although researchers and organizations continuously develop anti-spam filters based on binary classification, spammers bypass them through new strategies, like word obfuscation or image-based spam. For the first time in literature, we propose to classify spam email in categories to improve the handle of already detected spam emails, instead of just using a binary model. First, we applied a hierarchical clustering algorithm to create SPEMC-$11$K (SPam EMail Classification), the first multi-class dataset, which contains three types of spam emails: Health and Technology, Personal Scams, and Sexual Content. Then, we used SPEMC-$11$K to evaluate the combination of TF-IDF and BOW encodings with Naïve Bayes, Decision Trees and SVM classifiers. Finally, we recommend for the task of multi-class spam classification the use of (i) TF-IDF combined with SVM for the best micro F1 score performance, $95.39\%$, and (ii) TD-IDF along with NB for the fastest spam classification, analyzing an email in $2.13$ms.
摘要：垃圾邮件发送者采取电子邮件普及的优势，以胡乱发送未经请求的电子邮件。虽然研究人员和组织基于二元分类，垃圾邮件发送者绕过他们通过新的策略，比如文字模糊或者基于图像的垃圾邮件不断开发的反垃圾邮件过滤器。对于文学的第一次，我们建议进行分类垃圾邮件的类别来改善已经检测到的垃圾邮件的处理，而不是仅仅使用二进制模式。首先，我们采用了分层聚类算法创建SPEMC- $ $ 11 K（垃圾邮件分类），第一多类数据集，其中包含三种类型的垃圾邮件：生科技，个人诈骗和色情内容。然后，我们用SPEMC- $ $ 11 K至评估TF-IDF和弓编码与朴素贝叶斯，决策树和SVM分类的组合。最后，我们建议多类垃圾邮件分类的用途：（i）TF-IDF与SVM相结合的最佳微型F1得分性能的任务，$ 95.39 \％$，及（ii）TD-IDF与NB沿承最快的垃圾邮件分类，在$ 2.13 $ MS分析电子邮件。

9. A Statistical Story of Visual Illusions [PDF] 返回目录
Elad Hirsch, Ayellet Tal
Abstract: This paper explores the wholly empirical paradigm of visual illusions, which was introduced two decades ago in Neuro-Science. This data-driven approach attempts to explain visual illusions by the likelihood of patches in real-world images. Neither the data, nor the tools, existed at the time to extensively support this paradigm. In the era of big data and deep learning, at last, it becomes possible. This paper introduces a tool that computes the likelihood of patches, given a large dataset to learn from. Given this tool, we present an approach that manages to support the paradigm and explain visual illusions in a unified manner. Furthermore, we show how to generate (or enhance) visual illusions in natural images, by applying the same principles (and tool) reversely.
摘要：本文探讨了视错觉，这是二十年前引入神经科学的整体经验的范例。该数据驱动的方法试图通过在真实世界的图像补丁的可能性来解释视错觉。无论是数据，也没有工具，在当时存在，广泛支持这种模式。在大数据和深度学习，最后的时代，它成为可能。本文介绍了一个工具，计算补丁的可能性，因为大型数据集的学习。由于这套工具，我们提供管理支持范式，以统一的方式解释视错觉的方法。此外，我们将展示如何产生（或提高）的自然图像的视觉错觉，通过应用同样的原则（和工具）反向。

10. Evaluating Performance of an Adult Pornography Classifier for Child Sexual Abuse Detection [PDF] 返回目录
Mhd Wesam Al-Nabki, Eduardo Fidalgo, Roberto A. Vasco-Carofilis, Francisco Jañez-Martino, Javier Velasco-Mata
Abstract: The information technology revolution has facilitated reaching pornographic material for everyone, including minors who are the most vulnerable in case they were abused. Accuracy and time performance are features desired by forensic tools oriented to child sexual abuse detection, whose main components may rely on image or video classifiers. In this paper, we identify which are the hardware and software requirements that may affect the performance of a forensic tool. We evaluated the adult porn classifier proposed by Yahoo, based on Deep Learning, into two different OS and four Hardware configurations, with two and four different CPU and GPU, respectively. The classification speed on Ubuntu Operating System is $~5$ and $~2$ times faster than on Windows 10, when a CPU and GPU are used, respectively. We demonstrate the superiority of a GPU-based machine rather than a CPU-based one, being $7$ to $8$ times faster. Finally, we prove that the upward and downward interpolation process conducted while resizing the input images do not influence the performance of the selected prediction model.
摘要：信息技术革命促进了深远淫秽物品所有人，包括未成年人谁是在情况下，他们被虐待的最弱势群体。准确性和及时性是通过面向儿童性虐待的检测，其主要成分可以依靠的图像或视频分类取证工具所需的功能。在本文中，我们确定哪些因素可能会影响取证工具的性能的硬件和软件要求。我们评估了雅虎提出的成人色情分类的基础上，深入学习，分为两个不同的操作系统和四个硬件配置，有两个和四个不同的CPU和GPU分别。在Ubuntu操作系统分级速度为$〜$ 5和比在Windows 10，当使用一个CPU和GPU更快$〜2 $倍。我们展示了一个基于GPU的机器的优势，而不是基于CPU的一个，是$ 7 $至$ 8快$倍。最后，我们证明了向上和向下的插值过程，同时调整输入图像不影响所选择的预测模型的性能进行。

11. Towards Better Graph Representation: Two-Branch Collaborative Graph Neural Networks for Multimodal Marketing Intention Detection [PDF] 返回目录
Lu Zhang, Jian Zhang, Zhibin Li, Jingsong Xu
Abstract: Inspired by the fact that spreading and collecting information through the Internet becomes the norm, more and more people choose to post for-profit contents (images and texts) in social networks. Due to the difficulty of network censors, malicious marketing may be capable of harming society. Therefore, it is meaningful to detect marketing intentions online automatically. However, gaps between multimodal data make it difficult to fuse images and texts for content marketing detection. To this end, this paper proposes Two-Branch Collaborative Graph Neural Networks to collaboratively represent multimodal data by Graph Convolution Networks (GCNs) in an end-to-end fashion. We first separately embed groups of images and texts by GCNs layers from two views and further adopt the proposed multimodal fusion strategy to learn the graph representation collaboratively. Experimental results demonstrate that our proposed method achieves superior graph classification performance for marketing intention detection.
摘要：通过传播，并通过互联网收集信息成为常态的事实的启发，越来越多的人选择到后期以营利为目的的社交网络内容（图像和文字）。由于网络审查的难度，恶意营销可能能够危害社会的。因此，它是有意义的在线自动检测市场的意图。然而，多模数据之间的差距难以融合为内容营销探测图像和文字。为此，提出了两分支协作格拉夫神经网络中的端至端的方式协作地表示由图卷积网络（GCNs）多模态数据。我们首先分别由GCNs层从两个视图中插入图片和文字的群体，并进一步通过拟议的多模态融合战略，以协作学习的图形表示。实验结果表明，我们提出的方法实现了营销意图识别卓越的图形分类性能。

12. A Biologically Inspired Feature Enhancement Framework for Zero-Shot Learning [PDF] 返回目录
Zhongwu Xie, Weipeng Cao, Xizhao Wang, Zhong Ming, Jingjing Zhang, Jiyong Zhang
Abstract: Most of the Zero-Shot Learning (ZSL) algorithms currently use pre-trained models as their feature extractors, which are usually trained on the ImageNet data set by using deep neural networks. The richness of the feature information embedded in the pre-trained models can help the ZSL model extract more useful features from its limited training samples. However, sometimes the difference between the training data set of the current ZSL task and the ImageNet data set is too large, which may lead to the use of pre-trained models has no obvious help or even negative impact on the performance of the ZSL model. To solve this problem, this paper proposes a biologically inspired feature enhancement framework for ZSL. Specifically, we design a dual-channel learning framework that uses auxiliary data sets to enhance the feature extractor of the ZSL model and propose a novel method to guide the selection of the auxiliary data sets based on the knowledge of biological taxonomy. Extensive experimental results show that our proposed method can effectively improve the generalization ability of the ZSL model and achieve state-of-the-art results on three benchmark ZSL tasks. We also explained the experimental phenomena through the way of feature visualization.
摘要：零射门学习（ZSL）算法，目前使用预先训练模型作为其特征提取，通常使用深层神经网络训练有素的ImageNet数据集的大部分。嵌入在预先训练模型的特征信息的丰富性可以帮助ZSL模型提取更多有用的功能从有限的训练样本。但是，有时当前ZSL任务的训练数据集和ImageNet数据集之间的差异过大，这可能会导致使用预训练的模型对ZSL模型的性能没有明显的帮助，甚至造成负面影响。为了解决这个问题，本文提出了一个ZSL生物启发特征增强框架。具体来说，我们设计了一个双通道学习框架使用的辅助数据集，以提高ZSL模型的特征提取器，并提出，以指导基于生物的分类学知识的辅助数据集的选择的新颖方法。大量的实验结果表明，该方法能有效地提高ZSL模型的泛化能力，实现对三个标准ZSL任务的国家的最先进的成果。我们还通过功能可视化的方式解释了实验现象。

13. A global method to identify trees inside and outside of forests with medium-resolution satellite imagery [PDF] 返回目录
John Brandt, Fred Stolle
Abstract: Scattered trees outside of dense forests are very important for carbon sequestration, supporting livelihoods, maintaining ecosystem integrity, and climate change adaptation and mitigation. In contrast to trees inside of forests, not much is known about the spatial extent and distribution of scattered trees at a global scale. Due to the very high cost of high-resolution satellite imagery, global monitoring systems rely on medium resolution satellites to monitor land use and land use change. However, detecting and monitoring scattered trees with an open canopy using medium resolution satellites is difficult because individual trees often cover a smaller footprint than the satellites resolution. Here we present a globally consistent method to identify trees inside and outside of forests with medium resolution optical and radar imagery. Biweekly cloud-free, pansharpened 10 meter Sentinel-2 optical imagery and Sentinel-1 radar imagery are used to train a fully convolutional network, consisting of a convolutional gated recurrent unit layer and a feature pyramid attention layer. Tested across more than 215,000 Sentinel-1 and Sentinel-2 pixels distributed from -60 to +60 latitude, the proposed model exceeds 75 percent users and producers accuracy identifying trees in hectares with a low to medium density (less than 40 percent) of canopy cover, and 95 percent user's and producer's accuracy in hectares with dense (greater than 40 percent) canopy cover. When applied across large, heterogeneous landscapes, the results demonstrate potential to map trees in high detail and consistent accuracy over diverse landscapes across the globe. This information is important for understanding current land cover and can be used to detect changes in land cover such as agroforestry, buffer zones around biological hotspots, and expansion or encroachment of forests.
摘要：零星树木茂密的森林外面是碳汇，维持生计，维护生态系统的完整性，以及气候变化的适应和减缓非常重要的。相较于森林里的树木，没有多少人知道，在全球范围内的空间范围和零星树木的分布。由于高分辨率卫星图像的成本非常高，全球监测系统依赖于中等分辨率的卫星监测土地利用和土地利用变化。然而，检测和使用中等分辨率卫星开放篷监测零星树木是困难的，因为个别树木通常覆盖比该卫星分辨率更小的足迹。在这里，我们提出一个全球统一的方法来确定树木内部，并与中等分辨率的光学和雷达成像的森林之外。每两周无云，pansharpened10米哨兵-2光学成像和哨兵-1雷达图像被用来训练完全卷积网络，包括一个卷积门控重复单元层和特征金字塔关注层。在超过215,000的Sentinel-1和Sentinel-2的像素分布在-60至+60纬度测试，所提出的模型超过75个％的使用者和生产者的精度识别公顷树木具有低到天篷的中密度（小于40％）盖，和95％的用户的以及在公顷密集（大于40％），冠盖生产者的精度。当在大型，异构应用的风景，结果证明潜力在世界各地不同的自然景观映射在高细节和一致的准确性树木。此信息是用于了解当前土地覆盖重要的，可以被用于检测在土地覆盖的变化如农林业，缓冲带围绕生物热点和膨胀或森林的侵占。

14. Development of a New Image-to-text Conversion System for Pashto, Farsi and Traditional Chinese [PDF] 返回目录
Marek Rychlik, Dwight Nwaigwe, Yan Han, Dylan Murphy
Abstract: We report upon the results of a research and prototype building project \emph{Worldly~OCR} dedicated to developing new, more accurate image-to-text conversion software for several languages and writing systems. These include the cursive scripts Farsi and Pashto, and Latin cursive scripts. We also describe approaches geared towards Traditional Chinese, which is non-cursive, but features an extremely large character set of 65,000 characters. Our methodology is based on Machine Learning, especially Deep Learning, and Data Science, and is directed towards vast quantities of original documents, exceeding a billion pages. The target audience of this paper is a general audience with interest in Digital Humanities or in retrieval of accurate full-text and metadata from digital images.
摘要：我们在研究和原型建设项目的结果报告\ {EMPH世间〜OCR}致力于开发新的，更精确的图像到文本的转换软件，拥有语言和书写系统。这些措施包括草书波斯语和普什图语和拉丁美洲的草书。我们还描述了对中国传统，这也是非草书齿轮的方法，但是拥有一个非常大的字符集65000个字符。我们的方法是基于机器学习，尤其是深度学习和数据科学，并针对数量庞大的原始文件，超过十亿的网页。本文的目标受众是在数字人文或准确的全文和元数据从数字图像的检索感兴趣的普通观众。

15. A Detailed Look At CNN-based Approaches In Facial Landmark Detection [PDF] 返回目录
Chih-Fan Hsu, Chia-Ching Lin, Ting-Yang Hung, Chin-Laung Lei, Kuan-Ta Chen
Abstract: Facial landmark detection has been studied over decades. Numerous neural network (NN)-based approaches have been proposed for detecting landmarks, especially the convolutional neural network (CNN)-based approaches. In general, CNN-based approaches can be divided into regression and heatmap approaches. However, no research systematically studies the characteristics of different approaches. In this paper, we investigate both CNN-based approaches, generalize their advantages and disadvantages, and introduce a variation of the heatmap approach, a pixel-wise classification (PWC) model. To the best of our knowledge, using the PWC model to detect facial landmarks have not been comprehensively studied. We further design a hybrid loss function and a discrimination network for strengthening the landmarks' interrelationship implied in the PWC model to improve the detection accuracy without modifying the original model architecture. Six common facial landmark datasets, AFW, Helen, LFPW, 300-W, IBUG, and COFW are adopted to train or evaluate our model. A comprehensive evaluation is conducted and the result shows that the proposed model outperforms other models in all tested datasets.
摘要：面部标志检测已经研究了几十年。许多神经网络（NN）基的方法已经被提出了，用于检测地标，特别是卷积神经网络（CNN）系的方法。在一般情况下，基于CNN-方法可分为回归和热图方法。然而，没有研究系统地研究不同方法的特点。在本文中，我们研究了基于这两个CNN的办法，概括他们的优点和缺点，并介绍了热图的方法，逐像素分类（PWC）模型的变化。据我们所知，使用PWC模型来检测脸部标志还没有得到全面的研究。我们进一步设计了混合动力损失函数和判别网络加强了地标的相互关系在PWC模式，提高检测精度，而无需修改原有的模型结构暗示。六种常见的人脸特征点数据集，AFW，海伦，LFPW，300-W，IBUG和COFW是采用火车或评估我们的模型。综合评价进行，结果表明，该模型优于其他模型中的所有测试数据集。

16. Preterm infants' pose estimation with spatio-temporal features [PDF] 返回目录
Sara Moccia, Lucia Migliorelli, Virgilio Carnielli, Emanuele Frontoni
Abstract: Objective: Preterm infants' limb monitoring in neonatal intensive care units (NICUs) is of primary importance for assessing infants' health status and motor/cognitive development. Herein, we propose a new approach to preterm infants' limb pose estimation that features spatio-temporal information to detect and track limb joints from depth videos with high reliability. Methods: Limb-pose estimation is performed using a deep-learning framework consisting of a detection and a regression convolutional neural network (CNN) for rough and precise joint localization, respectively. The CNNs are implemented to encode connectivity in the temporal direction through 3D convolution. Assessment of the proposed framework is performed through a comprehensive study with sixteen depth videos acquired in the actual clinical practice from sixteen preterm infants (the babyPose dataset). Results: When applied to pose estimation, the median root mean squared distance, computed among all limbs, between the estimated and the ground-truth pose was 9.06 pixels, overcoming approaches based on spatial features only (11.27pixels). Conclusion: Results showed that the spatio-temporal features had a significant influence on the pose-estimation performance, especially in challenging cases (e.g., homogeneous image intensity). Significance: This paper significantly enhances the state of art in automatic assessment of preterm infants' health status by introducing the use of spatio-temporal features for limb detection and tracking, and by being the first study to use depth videos acquired in the actual clinical practice for limb-pose estimation. The babyPose dataset has been released as the first annotated dataset for infants' pose estimation.
摘要：目的：早产儿肢体在新生儿重症监护病房（新生儿重症监护室）监测是最重要的评估婴幼儿的健康状况和运动/认知能力的发展。在此，我们提出了一种新的方法来治疗早产儿肢体姿势估计该功能时空信息来检测和高可靠性深度视频轨道四肢关节。方法：使用由分别的检测和用于粗和精确定位关节回归卷积神经网络（CNN），的一个深学习框架进行肢姿势估计。的细胞神经网络被实现为编码在连接至3D卷积时间方向。拟议框架的评估是通过与从16名早产儿（在babyPose数据集），在实际临床实践中取得16个深度视频的综合研究进行。结果：当施加到姿态估计的中位均方根距离，所有四肢中计算，估计和地面实况姿态是9.06像素，之间克服基于空间特征仅（11.27pixels）接近。结论：结果表明，时空特征对姿态估计性能显著影响，特别是在具有挑战性的情况下（例如，均匀的图像强度）。意义：本文显著提高起用的肢体探测和跟踪时空特征在早产儿的健康状况自动评估先进的国家，并且通过研究首次在实际的临床实践中获取的使用深度视频对肢体姿态估计。该babyPose数据集已经发布，作为婴幼儿的姿态估计第一注释数据集。

17. Character Matters: Video Story Understanding with Character-Aware Relations [PDF] 返回目录
Shijie Geng, Ji Zhang, Zuohui Fu, Peng Gao, Hang Zhang, Gerard de Melo
Abstract: Different from short videos and GIFs, video stories contain clear plots and lists of principal characters. Without identifying the connection between appearing people and character names, a model is not able to obtain a genuine understanding of the plots. Video Story Question Answering (VSQA) offers an effective way to benchmark higher-level comprehension abilities of a model. However, current VSQA methods merely extract generic visual features from a scene. With such an approach, they remain prone to learning just superficial correlations. In order to attain a genuine understanding of who did what to whom, we propose a novel model that continuously refines character-aware relations. This model specifically considers the characters in a video story, as well as the relations connecting different characters and objects. Based on these signals, our framework enables weakly-supervised face naming through multi-instance co-occurrence matching and supports high-level reasoning utilizing Transformer structures. We train and test our model on the six diverse TV shows in the TVQA dataset, which is by far the largest and only publicly available dataset for VSQA. We validate our proposed approach over TVQA dataset through extensive ablation study.
摘要：从短片和GIF不同，视频故事包含明确的情节和主要人物名单。如果没有识别出现人与字符名称之间的连接，模型是无法获得该地块的真正理解。视频故事问答（VSQA）提供了有效的途径模型的基准更高级别的理解能力。然而，目前的VSQA方法仅仅提取场景一般的视觉特征。这样的做法，他们仍然容易产生学习只是表面的相关性。为了实现谁做的真正理解什么给谁，我们提出了一种新的模式，不断地改进字符感知关系。这种模式专门考虑了视频故事中的人物，以及连接不同的人物和物体之间的关系。根据这些信号，我们的框架能够通过利用变压器结构多实例共生匹配和支持高级推理弱监督的脸命名。我们训练和测试我们对六个不同的电视节目模式在TVQA数据集，这是迄今为止规模最大，只为VSQA可公开获得的数据集。我们验证了我们提出的通过广泛的消融研究的方法上TVQA数据集。

18. Multi-Task Learning in Histo-pathology for Widely Generalizable Model [PDF] 返回目录
Jevgenij Gamper, Navid Alemi Kooohbanani, Nasir Rajpoot
Abstract: In this work we show preliminary results of deep multi-task learning in the area of computational pathology. We combine 11 tasks ranging from patch-wise oral cancer classification, one of the most prevalent cancers in the developing world, to multi-tissue nuclei instance segmentation and classification.
摘要：在这项工作中，我们表现出计算病变区域深多任务学习的初步结果。我们结合11个任务，从补丁明智口腔癌的分类，在发展中世界的最普遍的癌症之一，多组织细胞核实例分割和分类。

19. Intracranial Hemorrhage Detection Using Neural Network Based Methods With Federated Learning [PDF] 返回目录
Utkarsh Chandra Srivastava, Dhruv Upadhyay, Vinayak Sharma
Abstract: Intracranial hemorrhage, bleeding that occurs inside the cranium, is a serious health problem requiring rapid and often intensive medical treatment. Such a condition is traditionally diagnosed by highly-trained specialists analyzing computed tomography (CT) scan of the patient and identifying the location and type of hemorrhage if one exists. We propose a neural network approach to find and classify the condition based upon the CT scan. The model architecture implements a time distributed convolutional network. We observed accuracy above 92% from such an architecture, provided enough data. We propose further extensions to our approach involving the deployment of federated learning. This would be helpful in pooling learned parameters without violating the inherent privacy of the data involved.
摘要：颅内出血，出血发生的头盖骨内，是一个需要快速，往往在强化药物治疗严重的健康问题。这样的条件是传统上由训练有素的专业人员分析计算机断层摄影（CT）扫描所述患者的并且识别如果存在出血的位置和类型进行诊断。我们提出了一个神经网络的方法来查找和分类基于CT扫描的条件。该模型架构实现了一个实时分布式网络卷积。我们观察到的精度在92％以上，从这样的结构，提供了足够的数据。我们建议进一步扩展我们的办法，让联盟学习的部署。这将是集中学参数不违反有关数据的固有隐私很有帮助。

20. Atom Search Optimization with Simulated Annealing -- a Hybrid Metaheuristic Approach for Feature Selection [PDF] 返回目录
Kushal Kanti Ghosh, Ritam Guha, Soulib Ghosh, Suman Kumar Bera, Ram Sarkar
Abstract: 'Hybrid meta-heuristics' is one of the most interesting recent trends in the field of optimization and feature selection (FS). In this paper, we have proposed a binary variant of Atom Search Optimization (ASO) and its hybrid with Simulated Annealing called ASO-SA techniques for FS. In order to map the real values used by ASO to the binary domain of FS, we have used two different transfer functions: S-shaped and V-shaped. We have hybridized this technique with a local search technique called, SA We have applied the proposed feature selection methods on 25 datasets from 4 different categories: UCI, Handwritten digit recognition, Text, non-text separation, and Facial emotion recognition. We have used 3 different classifiers (K-Nearest Neighbor, Multi-Layer Perceptron and Random Forest) for evaluating the strength of the selected featured by the binary ASO, ASO-SA and compared the results with some recent wrapper-based algorithms. The experimental results confirm the superiority of the proposed method both in terms of classification accuracy and number of selected features.
摘要：“混合元启发式”是在优化和特征选择（FS）的领域中最有趣的最新趋势之一。在本文中，我们提出了原子的搜索优化（ASO）及其与模拟退火混合的二进制变种，叫ASO-SA技术FS。为了映射使用ASO至FS的二进制域中的实值，我们已经使用了两种不同的传递函数：S形和V形的。我们已经与杂交称为本地搜索技术这一技术，SA我们已经申请25个数据集所提出的特征选择方法从4次不同的类别：UCI，手写数字识别，文本，非文本分离，面部情感识别。我们使用3个不同的分类器（K-近邻，多层感知和随机森林），用于评估与最近的一些基于包装的算法选择由二进制ASO，ASO-SA特色和比较结果的强度。实验结果证实了该方法两者的分级精度和选择的特征数量方面的优越性。

21. Deep Learning Based Vehicle Tracking System Using License Plate Detection And Recognition [PDF] 返回目录
Lalit Lakshmanan, Yash Vora, Raj Ghate
Abstract: Vehicle tracking is an integral part of intelligent traffic management systems. Previous implementations of vehicle tracking used Global Positioning System(GPS) based systems that gave location of the vehicle of an individual on their smartphones.The proposed system uses a novel approach to vehicle tracking using Vehicle License plate detection and recognition (VLPR) technique, which can be integrated on a large scale with traffic management systems. Initial methods of implementing VLPR used simple image processing techniques which were quite experimental and heuristic. With the onset of Deep learning and Computer Vision, one can create robust VLPR systems that can produce results close to human efficiency. Previous implementations, based on deep learning, made use of object detection and support vector machines for detection and a heuristic image processing based approach for recognition. The proposed system makes use of scene text detection model architecture for License plate detection and for recognition it uses the Optical character recognition engine (OCR) Tesseract. The proposed system obtained extraordinary results when it was tested on a highway video using NVIDIA Ge-force RTX 2080ti GPU, results were obtained at a speed of 30 frames per second with accuracy close to human.
摘要：车辆跟踪是智能交通管理系统的一个组成部分。车辆跟踪的先前的实现使用的全球定位系统（GPS），该给他们的smartphones.The提出的系统上的个人的所述车辆的位置使用一个新的方法来使用车辆车牌检测和识别（VLPR）技术跟踪的系统，它可以大规模交通管理系统集成。实施VLPR的初始方法中使用的原来相当实验和启发式简单的图像处理技术。凭借深厚的学习和计算机视觉的发作，可以开发出能够产生效果接近人的效率强劲VLPR系统。先前的实现，基于深度学习，检测利用物体检测和支持向量机和用于识别的启发式图像处理为基础的方法。所提出的系统利用现场的文字检测模型架构的车牌检测，并承认它使用光学字符识别引擎（OCR）正方体。所提出的系统所获得非凡的结果，当它被使用NVIDIA戈力RTX 2080ti GPU高速公路视频测试，以每秒30帧的精度接近人体的速度获得的结果。

22. End-to-End Lane Marker Detection via Row-wise Classification [PDF] 返回目录
Seungwoo Yoo, Heeseok Lee, Heesoo Myeong, Sungrack Yun, Hyoungwoo Park, Janghoon Cho, Duck Hoon Kim
Abstract: In autonomous driving, detecting reliable and accurate lane marker positions is a crucial yet challenging task. The conventional approaches for the lane marker detection problem perform a pixel-level dense prediction task followed by sophisticated post-processing that is inevitable since lane markers are typically represented by a collection of line segments without thickness. In this paper, we propose a method performing direct lane marker vertex prediction in an end-to-end manner, i.e., without any post-processing step that is required in the pixel-level dense prediction task. Specifically, we translate the lane marker detection problem into a row-wise classification task, which takes advantage of the innate shape of lane markers but, surprisingly, has not been explored well. In order to compactly extract sufficient information about lane markers which spread from the left to the right in an image, we devise a novel layer, which is utilized to successively compress horizontal components so enables an end-to-end lane marker detection system where the final lane marker positions are simply obtained via argmax operations in testing time. Experimental results demonstrate the effectiveness of the proposed method, which is on par or outperforms the state-of-the-art methods on two popular lane marker detection benchmarks, i.e., TuSimple and CULane.
摘要：在自主驾驶，检测可靠，准确的车道标线的位置是一个关键而具有挑战性的任务。用于车道标记检测问题的传统方法执行的像素级别致密预测任务随后复杂的后处理是不可避免的，因为车道标记通常由线段没有厚度的集合来表示。在本文中，我们提出在端至端的方式执行直接车道标记顶点预测的方法，即，没有被在所述像素级密集预测任务所需的任何后处理步骤。具体来说，我们翻译的车道标记检测问题转化为逐行分类任务，这需要车道标记的形状先天优势，但奇怪的是，尚未探索很好。为了紧凑地提取关于车道标记，其从所述左到的图像中的右扩散足够的信息，我们设计了一种新层，其用于相继地压缩水平分量，从而使最终到终端的车道标记检测系统，其中所述最终车道标记线位置在测试时通过argmax操作简便地获得。实验结果表明，所提出的方法，这是对标准杆或优于上两种流行的车道标志检测的基准，即，TuSimple和CULane所述状态的最先进的方法的有效性。

23. Supervision and Source Domain Impact on Representation Learning: A Histopathology Case Study [PDF] 返回目录
Milad Sikaroudi, Amir Safarpoor, Benyamin Ghojogh, Sobhan Shafiei, Mark Crowley, H.R. Tizhoosh
Abstract: As many algorithms depend on a suitable representation of data, learning unique features is considered a crucial task. Although supervised techniques using deep neural networks have boosted the performance of representation learning, the need for a large set of labeled data limits the application of such methods. As an example, high-quality delineations of regions of interest in the field of pathology is a tedious and time-consuming task due to the large image dimensions. In this work, we explored the performance of a deep neural network and triplet loss in the area of representation learning. We investigated the notion of similarity and dissimilarity in pathology whole-slide images and compared different setups from unsupervised and semi-supervised to supervised learning in our experiments. Additionally, different approaches were tested, applying few-shot learning on two publicly available pathology image datasets. We achieved high accuracy and generalization when the learned representations were applied to two different pathology datasets.
摘要：由于许多算法依赖于数据的适当表示，学习独特的功能，被认为是一个重要的任务。虽然使用深层神经网络监督技术已经提高学习表现的性能，需要一个大组的标签数据限制了这种方法的应用。作为一个例子，在病理学领域中感兴趣的区域的高品质delineations是繁琐和耗时的任务由于大的图像尺寸。在这项工作中，我们探索出了深层神经网络的性能，并表示学习领域三重损失。我们研究了相似性和差异性的病理全幻灯片图像和比较不同的设置的概念从无人监督和半监督，在我们的实验监督学习。此外，不同的方法进行了测试，在两个可公开获得的病理图像数据集应用几拍的学习。我们实现了精度高，当得知交涉分别适用于两种不同的病理数据集的概括。

24. Synthetic Image Augmentation for Damage Region Segmentation using Conditional GAN with Structure Edge [PDF] 返回目录
Takato Yasuno, Michihiro Nakajima, Tomoharu Sekiguchi, Kazuhiro Noda, Kiyoshi Aoyanagi, Sakura Kato
Abstract: Recently, social infrastructure is aging, and its predictive maintenance has become important issue. To monitor the state of infrastructures, bridge inspection is performed by human eye or bay drone. For diagnosis, primary damage region are recognized for repair targets. But, the degradation at worse level has rarely occurred, and the damage regions of interest are often narrow, so their ratio per image is extremely small pixel count, as experienced 0.6 to 1.5 percent. The both scarcity and imbalance property on the damage region of interest influences limited performance to detect damage. If additional data set of damaged images can be generated, it may enable to improve accuracy in damage region segmentation algorithm. We propose a synthetic augmentation procedure to generate damaged images using the image-to-image translation mapping from the tri-categorical label that consists the both semantic label and structure edge to the real damage image. We use the Sobel gradient operator to enhance structure edge. Actually, in case of bridge inspection, we apply the RC concrete structure with the number of 208 eye-inspection photos that rebar exposure have occurred, which are prepared 840 block images with size 224 by 224. We applied popular per-pixel segmentation algorithms such as the FCN-8s, SegNet, and DeepLabv3+Xception-v2. We demonstrates that re-training a data set added with synthetic augmentation procedure make higher accuracy based on indices the mean IoU, damage region of interest IoU, precision, recall, BF score when we predict test images.
摘要：最近，社会基础设施老化，其预测性维护已经成为重要的课题。为了监控基础设施的状态，桥梁检测由人眼或海湾雄蜂进行。用于诊断，初级损伤区域被识别为修复的目标。但是，在糟糕的水平退化已经很少发生，而利益受损的区域往往是狭窄的，因此每幅图像的比例是非常小的像素数，作为经验丰富的0.6％至1.5％。两个稀缺和不平衡性上有限的性能，以检测感兴趣的损害的影响的损坏区域。如果能够产生损坏的图像的附加数据集，其可以使得能够改善在损坏区域分割算法的精确度。我们提出了一种合成的增高过程使用从由两个语义标签和结构边缘与真实损坏图像三分类标签的图像到图像转换映射以产生损坏的图像。我们使用索贝尔梯度算，以增强结构边缘。实际上，在桥梁检测的情况下，我们应用RC混凝土结构用的208眼检查的照片已经发生螺纹钢曝光，这是由224制备了大小224 840个图像的数量我们应用流行每像素分割算法，例如作为FCN-787-8，SegNet和DeepLabv3 + Xception-V2。我们给出了不同的指标是重新训练与合成增高过程中添加数据集提出更高的精度平均欠条，欠条的兴趣，准确率，召回，BF得分时，我们预测测试图像的损坏区域。

25. A Spontaneous Driver Emotion Facial Expression (DEFE) Dataset for Intelligent Vehicles [PDF] 返回目录
Wenbo Li, Yaodong Cui, Yintao Ma, Xingxin Chen, Guofa Li, Gang Guo, Dongpu Cao
Abstract: In this paper, we introduce a new dataset, the driver emotion facial expression (DEFE) dataset, for driver spontaneous emotions analysis. The dataset includes facial expression recordings from 60 participants during driving. After watching a selected video-audio clip to elicit a specific emotion, each participant completed the driving tasks in the same driving scenario and rated their emotional responses during the driving processes from the aspects of dimensional emotion and discrete emotion. We also conducted classification experiments to recognize the scales of arousal, valence, dominance, as well as the emotion category and intensity to establish baseline results for the proposed dataset. Besides, this paper compared and discussed the differences in facial expressions between driving and non-driving scenarios. The results show that there were significant differences in AUs (Action Units) presence of facial expressions between driving and non-driving scenarios, indicating that human emotional expressions in driving scenarios were different from other life scenarios. Therefore, publishing a human emotion dataset specifically for the driver is necessary for traffic safety improvement. The proposed dataset will be publicly available so that researchers worldwide can use it to develop and examine their driver emotion analysis methods. To the best of our knowledge, this is currently the only public driver facial expression dataset.
摘要：在本文中，我们引入一个新的数据集，司机情绪面部表情（防卫厅）的数据集，为司机自发的情感分析。所述数据集包括在行驶期间从60名参与者的面部表情的录音。观看选择的视频，音频剪辑引出一个特定的情感后，每个参与者完成了相同的驱动方案驱动的任务和维情感和离散情感方面的驱动过程中评价他们的情绪反应。我们还进行了分类实验认识到觉醒，价，优势，以及情感类别和强度的尺度来建立基线结果所提出的数据集。此外，本文比较和讨论在驱动和非驱动方案之间的面部表情的差异。结果表明，有在驱动和非驱动方案之间的面部表情的AU的（动作单元）存在显著差异，说明在驾驶情形中的人类情感表达来自其他生活场景不同。因此，特别是出版人的情感数据集驱动程序是必要的交通安全改善。所提出的数据集将公开发布，使世界各地的研究人员可以利用它来开发和审视自己的情感驱动的分析方法。据我们所知，这是目前唯一公开司机的面部表情数据集。

26. A model-based Gait Recognition Method based on Gait Graph Convolutional Networks and Joints Relationship Pyramid Mapping [PDF] 返回目录
Na Li, Xinbo Zhao, Chong Ma
Abstract: Gait, as a unique biometric feature that can be recognized at a distance, which can be widely applicated in public security. In this paper, we propose a novel model-based gait recognition method, JointsGait, which extracts gait information from human body joints. Early gait recognition methods are mainly based on appearance. The appearance-based features are usually extracted from human body silhouettes, which is not invariant to changes in clothing, and can be subject to drastic variations, due to camera motion or other external factors. In contrast to previous approaches, JointsGait firstly extracted spatio-temporal features using gait graph convolutional networks constructed by 18 2-D joints, which are less interfered by external factors. Then Joints Relationship Pyramid Mapping (JRPM) are proposed to map spatio-temporal gait features into a discriminative feature space with biological advantages according to physical structure and walking habit at various scales. Finally, we research a fusion loss strategy to help the joints features be insensitive to cross-view. Our method is evaluated on large datasets CASIA B. The experimental results show that JointsGait achieves the state-of-art performance, which is less affected by the view variations. Its recognition accuracy is higher than lasted model-based method PoseGait in all walking conditions, even outperforms most of state-of-art appearance-based methods, especially when there is a clothing variation.
摘要：步态，作为一个独特的生物特征，可以在一定距离被识别，可应用广泛的公众安全。在本文中，我们提出了一种新颖的基于模型的步态识别方法，JointsGait，其提取从人体关节步态信息。早期的步态识别方法主要是基于外观。外观基础的特征通常是从人体的轮廓，这是不不变的服装的变化，并且可以经受剧烈的变化，由于照相机运动或其他外部因素萃取。相比于以前的方法，首先JointsGait提取使用由18 2-d接头，其较少受外部因素的干扰构造步态图表卷积网络的时空特征。然后接头关系金字塔映射（JRPM）旨在映射时空步态设有与根据在不同尺度的物理结构和行走习惯生物优点的辨别特征空间。最后，我们研究的融合损失的策略，帮助关节功能是不敏感的交叉视角。我们的方法是对大数据集CASIA B.评价的实验结果表明，JointsGait实现状态的最先进的性能，这是较少受视图变化。其识别精度比所有行走条件持续了基于模型的方法PoseGait更高，甚至优于大多数国家的本领域外观基础的方法中，尤其是当有一个服装的变化。

27. Learn Class Hierarchy using Convolutional Neural Networks [PDF] 返回目录
Riccardo La Grassa, Ignazio Gallo, Nicola Landro
Abstract: A large amount of research on Convolutional Neural Networks has focused on flat Classification in the multi-class domain. In the real world, many problems are naturally expressed as problems of hierarchical classification, in which the classes to be predicted are organized in a hierarchy of classes. In this paper, we propose a new architecture for hierarchical classification of images, introducing a stack of deep linear layers with cross-entropy loss functions and center loss combined. The proposed architecture can extend any neural network model and simultaneously optimizes loss functions to discover local hierarchical class relationships and a loss function to discover global information from the whole class hierarchy while penalizing class hierarchy violations. We experimentally show that our hierarchical classifier presents advantages to the traditional classification approaches finding application in computer vision tasks.
摘要：卷积神经网络的大量的研究集中在多级域平分类。在现实世界中，很多问题都自然地表达为分层分类，其中要预测中的类类层次结构进行组织的问题。在本文中，我们提出了一个新的体系结构对于图像的层次分类，引入深线性层的堆叠具有交叉熵损失函数和中心损失相结合。所提出的架构可以扩展任何神经网络模型，并同时优化损失函数发现局部分层阶级关系和损失函数，而惩罚的类层次的侵犯，从整个类层次探索全球信息。我们实验表明，我们的分层分类呈现优点，传统的分类方法，发现在计算机视觉任务应用程序。

28. Decoder Modulation for Indoor Depth Completion [PDF] 返回目录
Dmitry Senushkin, Ilia Belikov, Anton Konushin
Abstract: Accurate depth map estimation is an essential step in scene spatial mapping for AR applications and 3D modeling. Current depth sensors provide time-synchronized depth and color images in real-time, but have limited range and suffer from missing and erroneous depth values on transparent or glossy surfaces. We investigate the task of depth completion that aims at improving the accuracy of depth measurements and recovering the missing depth values using additional information from corresponding color images. Surprisingly, we find that a simple baseline model based on modern encoder-decoder architecture for semantic segmentation achieves state-of-the-art accuracy on standard depth completion benchmarks. Then, we show that the accuracy can be further improved by taking into account a mask of missing depth values. The main contributions of our work are two-fold. First, we propose a modified decoder architecture, where features from raw depth and color are modulated by features from the mask via Spatially-Adaptive Denormalization (SPADE). Second, we introduce a new loss function for depth estimation based on a direct comparison of log depth prediction with ground truth values. The resulting model outperforms current state-of-the-art by a large margin on the challenging Matterport3D dataset.
摘要：精确的深度图估计是场景空间映射为AR应用和3D建模的重要步骤。当前深度传感器提供实时时间同步的深度和彩色图像，但具有有限的范围和从丢失和上透明的或有光泽的表面错误的深度值受到影响。我们深入调查完成的任务，旨在提高深度测量的精度和使用从对应于彩色图像的附加信息恢复丢失的深度值。出人意料的是，我们发现，基于现代编码器，解码器架构的语义分割一个简单的基准模型实现了对标准深度完成基准国家的最先进的精度。然后，我们表明，精度可通过考虑丢失的深度值的面罩进一步提高。我们工作的主要贡献是双重的。首先，我们提出了一种改进的解码器架构，其中，从原料的深度和颜色的特征由特征从经由空间自适应非规范化（SPADE），所述掩模调制。其次，介绍了基于日志深度预测的地面实况值的直接比较的深度估计新的损失函数。将得到的模型通过对挑战Matterport3D数据集大幅度优于当前状态的最先进的。

29. End-to-End Lip Synchronisation [PDF] 返回目录
You Jin Kim, Hee Soo Heo, Soo-Whan Chung, Bong-Jin Lee
Abstract: The goal of this work is to synchronise audio and video of a talking face using deep neural network models. Existing works have trained networks on proxy tasks such as cross-modal similarity learning, and then computed similarities between audio and video frames using a sliding window approach. While these methods demonstrate satisfactory performance, the networks are not trained directly on the task. To this end, we propose an end-to-end trained network that can directly predict the offset between an audio stream and the corresponding video stream. The similarity matrix between the two modalities is first computed from the features, then the inference of the offset can be considered to be a pattern recognition problem where the matrix is considered equivalent to an image. The feature extractor and the classifier are trained jointly. We demonstrate that the proposed approach outperforms the previous work by a large margin on LRS2 and LRS3 datasets.
摘要：这项工作的目标是利用深层神经网络模型说话脸的同步音频和视频。现有工程对代理任务的培训网络，如跨模式的相似性学习，然后计算出使用滑动窗口方法音频和视频帧之间的相似性。虽然这些方法证明令人满意的性能，网络不直接在任务训练。为此，我们提出了一个终端到终端的培训网络，可以直接预测音频流和相应的视频流之间的偏移。两个模态之间的相似性矩阵被首先从特征计算，则偏移的推断可以被认为是其中基质被认为是等同的图像的模式识别问题。特征提取和分类进行联合训练。我们证明，该方法通过对LRS2和LRS3数据集大比分胜过以前的工作。

30. DDD20 End-to-End Event Camera Driving Dataset: Fusing Frames and Events with Deep Learning for Improved Steering Prediction [PDF] 返回目录
Yuhuang Hu, Jonathan Binas, Daniel Neil, Shih-Chii Liu, Tobi Delbruck
Abstract: Neuromorphic event cameras are useful for dynamic vision problems under difficult lighting conditions. To enable studies of using event cameras in automobile driving applications, this paper reports a new end-to-end driving dataset called DDD20. The dataset was captured with a DAVIS camera that concurrently streams both dynamic vision sensor (DVS) brightness change events and active pixel sensor (APS) intensity frames. DDD20 is the longest event camera end-to-end driving dataset to date with 51h of DAVIS event+frame camera and vehicle human control data collected from 4000km of highway and urban driving under a variety of lighting conditions. Using DDD20, we report the first study of fusing brightness change events and intensity frame data using a deep learning approach to predict the instantaneous human steering wheel angle. Over all day and night conditions, the explained variance for human steering prediction from a Resnet-32 is significantly better from the fused DVS+APS frames (0.88) than using either DVS (0.67) or APS (0.77) data alone.
摘要：仿神经事件摄像机是困难的照明条件下的动态视力问题非常有用。为了使汽车驾驶应用程序中使用事件相机的研究，本文报道称为DDD20一个新的终端到终端的驾驶数据集。数据集用DAVIS照相机同时流两者动态视觉传感器（DVS）的亮度变化的事件和有源像素传感器（APS）强度帧拍摄。 DDD20是最长的事件相机端至端驱动数据集的最新在各种照明条件下，从公路和城市驾驶的4000公里收集DAVIS事件+帧相机和人类车辆控制数据的51H。使用DDD20，我们报告使用了深刻的学习方法来预测瞬时人类方向盘角度融合亮度变化的事件和强度帧数据的第一项研究。在所有日夜条件下，用于从RESNET-32人转向预测解释方差是显著从稠合DVS + APS帧（0.88）比单独使用任一DVS（0.67）或APS（0.77）数据更好。

31. Omni-supervised Facial Expression Recognition: A Simple Baseline [PDF] 返回目录
Ping Liu, Yunchao Wei, Zibo Meng, Weihong Deng, Joey Tianyi Zhou, Yi Yang
Abstract: In this paper, we target on advancing the performance in facial expression recognition (FER) by exploiting omni-supervised learning. The current state of the art FER approaches usually aim to recognize facial expressions in a controlled environment by training models with a limited number of samples. To enhance the robustness of the learned models for various scenarios, we propose to perform omni-supervised learning by exploiting the labeled samples together with a large number of unlabeled data. Particularly, we first employ MS-Celeb-1M as the facial-pool where around 5,822K unlabeled facial images are included. Then, a primitive model learned on a small number of labeled samples is adopted to select samples with high confidence from the facial-pool by conducting feature-based similarity comparison. We find the new dataset constructed in such an omni-supervised manner can significantly improve the generalization ability of the learned FER model and boost the performance consequently. However, as more training samples are used, more computation resources and training time are required, which is usually not affordable in many circumstances. To relieve the requirement of computational resources, we further adopt a dataset distillation strategy to distill the target task-related knowledge from the new mined samples and compressed them into a very small set of images. This distilled dataset is capable of boosting the performance of FER with few additional computational cost introduced. We perform extensive experiments on five popular benchmarks and a newly constructed dataset, where consistent gains can be achieved under various settings using the proposed framework. We hope this work will serve as a solid baseline and help ease future research in FER.
摘要：在本文中，我们的目标就通过利用全方位监督学习推进面部表情识别（FER）性能。本领域FER的当前状态通常办法的目标是通过用样品的数量有限的训练模型来识别在受控环境中的面部表情。为加强对各种情景学习模型的鲁棒性，我们建议进行全方位监督与大量未标记数据的利用标记的样本一起学习。特别是，我们首先利用MS-名人-1M的面部游泳池让周围5,822K未标记的面部图像都包括在内。然后，在一个小数目标记的样品学到了原始的模型采用与进行基于特征的相似性比较从面部池高置信度来选择样本。我们发现在这样的全方位监督方式构造可以显著提高学习FER模型的泛化能力，从而提升性能的新的数据集。然而，随着越来越多的训练样本的使用，更多的计算资源和培训时间是必需的，这通常不是在许多情况下，价格实惠。为了减轻计算资源的要求，我们进一步采取数据集中蒸馏战略从新开采样本提炼目标任务相关的知识，并将其压缩到一个非常小的一组图像。这蒸馏水数据集能够与引入一些额外的计算成本提高FER的性能。我们对五个款热门基准和一个新建的数据集，其中一致的收益可以在使用拟议框架各种设置实现进行了广泛的实验。我们希望这项工作将作为FER了坚实的基础，并有助于缓解未来的研究。

32. Spatio-Temporal Graph Transformer Networks for Pedestrian Trajectory Prediction [PDF] 返回目录
Cunjun Yu, Xiao Ma, Jiawei Ren, Haiyu Zhao, Shuai Yi
Abstract: Understanding crowd motion dynamics is critical to real-world applications, e.g., surveillance systems and autonomous driving. This is challenging because it requires effectively modeling the socially aware crowd spatial interaction and complex temporal dependencies. We believe attention is the most important factor for trajectory prediction. In this paper, we present STAR, a Spatio-Temporal grAph tRansformer framework, which tackles trajectory prediction by only attention mechanisms. STAR models intra-graph crowd interaction by TGConv, a novel Transformer-based graph convolution mechanism. The inter-graph temporal dependencies are modeled by separate temporal Transformers. STAR captures complex spatio-temporal interactions by interleaving between spatial and temporal Transformers. To calibrate the temporal prediction for the long-lasting effect of disappeared pedestrians, we introduce a read-writable external memory module, consistently being updated by the temporal Transformer. We show STAR outperforms the state-of-the-art models on 4 out of 5 real-world pedestrian trajectory prediction datasets, and achieves comparable performance on the rest.
摘要：了解人群运动动力学是现实世界的应用，例如，监控系统和自动驾驶的关键。这是一个挑战，因为它需要有效的建模社会意识的人群空间相互作用和复杂的时间依赖性。我们认为，关注的是为轨迹预测的最重要因素。在本文中，我们提出STAR，时空图形变压器框架，铲球轨迹预测仅关注机制。通过TGConv，一种新颖的基于变压器的图表卷积机构STAR模型-图表帧内人群相互作用。所述-图形间时间相关是由单独的时间变压器建模。 STAR的空间和时间变形金刚之间交织捕捉复杂的时空互动。来校准消失行人的长效作用的时间预测，我们引入一个可读写的外部存储器模块，始终由时间转换更新。我们展示STAR优于4，满分5现实世界的行人轨迹预测数据集的国家的最先进的车型，并实现对其余相当的性能。

33. VecQ: Minimal Loss DNN Model Compression With Vectorized Weight Quantization [PDF] 返回目录
Cheng Gong, Yao Chen, Ye Lu, Tao Li, Cong Hao, Deming Chen
Abstract: Quantization has been proven to be an effective method for reducing the computing and/or storage cost of DNNs. However, the trade-off between the quantization bitwidth and final accuracy is complex and non-convex, which makes it difficult to be optimized directly. Minimizing direct quantization loss (DQL) of the coefficient data is an effective local optimization method, but previous works often neglect the accurate control of the DQL, resulting in a higher loss of the final DNN model accuracy. In this paper, we propose a novel metric called Vector Loss. Based on this new metric, we develop a new quantization solution called VecQ, which can guarantee minimal direct quantization loss and better model accuracy. In addition, in order to speed up the proposed quantization process during model training, we accelerate the quantization process with a parameterized probability estimation method and template-based derivation calculation. We evaluate our proposed algorithm on MNIST, CIFAR, ImageNet, IMDB movie review and THUCNews text data sets with numerical DNN models. The results demonstrate that our proposed quantization solution is more accurate and effective than the state-of-the-art approaches yet with more flexible bitwidth support. Moreover, the evaluation of our quantized models on Saliency Object Detection (SOD) tasks maintains comparable feature extraction quality with up to 16$\times$ weight size reduction.
摘要：量化已被证明是用于减少DNNs的计算和/或存储成本的有效方法。然而，量化位宽和最终精度之间的折衷是复杂的和非凸的，这使得它难以被直接优化。最小化的系数数据的量化直接损失（DQL）是一种有效的局部优化的方法，但以前的工作常常忽视DQL的精确控制，从而在最终的模型DNN精度的更高的损耗。在本文中，我们提出了一个名为向量损失新颖的度量。基于这一新指标，我们开发名为VecQ一个新的量化的解决方案，从而能够保证最小的直接损失量化和更好的模型的准确性。此外，为了加速模型训练期间拟议量化过程中，我们加快了参数概率估计方法和基于模板的导数计算的量化处理。我们评估我们提出了MNIST，CIFAR，ImageNet，IMDB电影评论和THUCNews文本数据集的数值DNN模型算法。结果表明，我们提出的量化的解决方案比更准确，更有效的国家的最先进的具有更灵活的位宽支持尚未接近。此外，我们的显着性目标检测（SOD）的任务量化模型的评估保持与$权重的大小减少高达16 $ \倍可比特征提取质量。

34. Context-aware and Scale-insensitive Temporal Repetition Counting [PDF] 返回目录
Huaidong Zhang, Xuemiao Xu, Guoqiang Han, Shengfeng He
Abstract: Temporal repetition counting aims to estimate the number of cycles of a given repetitive action. Existing deep learning methods assume repetitive actions are performed in a fixed time-scale, which is invalid for the complex repetitive actions in real life. In this paper, we tailor a context-aware and scale-insensitive framework, to tackle the challenges in repetition counting caused by the unknown and diverse cycle-lengths. Our approach combines two key insights: (1) Cycle lengths from different actions are unpredictable that require large-scale searching, but, once a coarse cycle length is determined, the variety between repetitions can be overcome by regression. (2) Determining the cycle length cannot only rely on a short fragment of video but a contextual understanding. The first point is implemented by a coarse-to-fine cycle refinement method. It avoids the heavy computation of exhaustively searching all the cycle lengths in the video, and, instead, it propagates the coarse prediction for further refinement in a hierarchical manner. We secondly propose a bidirectional cycle length estimation method for a context-aware prediction. It is a regression network that takes two consecutive coarse cycles as input, and predicts the locations of the previous and next repetitive cycles. To benefit the training and evaluation of temporal repetition counting area, we construct a new and largest benchmark, which contains 526 videos with diverse repetitive actions. Extensive experiments show that the proposed network trained on a single dataset outperforms state-of-the-art methods on several benchmarks, indicating that the proposed framework is general enough to capture repetition patterns across domains.
摘要：颞重复计数目标来估计给定的重复动作的周期的数目。现有的深学习方法采用重复的动作在一个固定的时间尺度，这是现实生活中的复杂重复操作无效进行。在本文中，我们都会进行情境感知和规模不敏感的框架，以解决造成未知的和多样化的周期长度在重复计数的挑战。我们的方法结合了两种重要见解：从不同的动作（1）的周期长度是需要大规模的搜索不可预测的，但是，一旦粗周期长度被确定，重复之间的各种可通过回归来克服。（2）确定的周期长度可以不仅依赖于视频的短片段，但上下文的理解。第一点是由粗到细的循环精制方法实施。它避免了在视频穷尽搜索所有的周期长度的大量的计算，并且，相反，它用于传播以分级的方式进一步改进中，粗略预测。其次，我们提出了一个上下文感知预测的双向循环长度估计方法。这是一个回归网络，它有两个连续周期粗作为输入，并预测一个和下一个重复循环的位置。受益时间重复计数区域的培训和考核，我们构建了一个新的，规模最大的标杆，其中包含526级多元重复动作的视频。广泛的实验表明，该网络上训练状态的最先进的单一数据集性能优于上几个基准的方法，指示该提议的框架一般是足够来跨多个域捕获重复图形。

35. Feature Transformation Ensemble Model with Batch Spectral Regularization for Cross-Domain Few-Shot Classification [PDF] 返回目录
Bingyu Liu, Zhen Zhao, Zhenpeng Li, Jianan Jiang, Yuhong Guo, Haifeng Shen, Jieping Ye
Abstract: Deep learning models often require much annotated data to obtain good performance. In real-world cases, collecting and annotating data is easy in some domains while hard in others. A practical way to tackle this problem is using label-rich datasets with large amounts of labeled data to help improve prediction performance on label-poor datasets with little annotated data. Cross-domain few-shot learning (CD-FSL) is one of such transfer learning settings. In this paper, we propose a feature transformation ensemble model with batch spectral regularization and label propagation for the CD-FSL challenge. Specifically, we proposes to construct an ensemble prediction model by performing multiple diverse feature transformations after a shared feature extraction network. On each branch prediction network of the model we use a batch spectral regularization term to suppress the singular values of the feature matrix during pre-training to improve the generalization ability of the model. The proposed model can then be fine tuned in the target domain to address few-shot classification. We also further apply label propagation and data augmentation to further mitigate the shortage of labeled data in target domains. Experiments are conducted on a number of CD-FSL benchmark tasks with four target domains and the results demonstrate the superiority of our proposed method.
摘要：深学习模型往往需要大量的注释数据，以获得良好的性能。在现实世界的情况下，收集和标注的数据很容易在某些领域，而很难在其他国家。解决这个问题的一个可行的方法是利用丰富标签的数据集有大量的标签数据，以帮助改善与小注释数据标签贫乏数据集预测性能。跨域几个次学习（CD-FSL）是这样的转移学习环境之一。在本文中，我们提出用的CD-FSL挑战批次光谱正规化和标签传播功能改造整体模型。具体来说，我们提出了通过一个共享的特征提取网络之后执行多个不同的功能转换以构成集合预测模型。在模型的每个分支预测网，我们使用了一批光谱正规化期内前培训，以抑制特征矩阵的奇异值，以提高模型的泛化能力。该模型然后可以在目标域地址为数不多的镜头分类进行微调。我们还进一步适用标签传播和数据扩张中以进一步减轻标记数据的短缺目标域。实验是在一些CD-FSL基准任务有四个目标域进行，结果证明我们提出的方法的优越性。

36. Large-Scale Object Detection in the Wild from Imbalanced Multi-Labels [PDF] 返回目录
Junran Peng, Xingyuan Bu, Ming Sun, Zhaoxiang Zhang, Tieniu Tan, Junjie Yan
Abstract: Training with more data has always been the most stable and effective way of improving performance in deep learning era. As the largest object detection dataset so far, Open Images brings great opportunities and challenges for object detection in general and sophisticated scenarios. However, owing to its semi-automatic collecting and labeling pipeline to deal with the huge data scale, Open Images dataset suffers from label-related problems that objects may explicitly or implicitly have multiple labels and the label distribution is extremely imbalanced. In this work, we quantitatively analyze these label problems and provide a simple but effective solution. We design a concurrent softmax to handle the multi-label problems in object detection and propose a soft-sampling methods with hybrid training scheduler to deal with the label imbalance. Overall, our method yields a dramatic improvement of 3.34 points, leading to the best single model with 60.90 mAP on the public object detection test set of Open Images. And our ensembling result achieves 67.17 mAP, which is 4.29 points higher than the best result of Open Images public test 2018.
摘要：培养具备更多的数据一直是最稳定和改善深度学习的时代表现的有效途径。作为迄今为止国内规模最大的物体检测数据集，打开图像带来了巨大的机遇和一般的和复杂的场景中物体的检测挑战。然而，由于其半自动化采集和标签的管道来处理庞大的数据规模，打开图像从标签相关的问题数据集遭罪的对象可以明确或隐含有多个标签和标签分布极不均衡。在这项工作中，我们定量分析这些标签问题，并提供一个简单而有效的解决方案。我们设计并发SOFTMAX处理物体检测的多标签问题，并提出了软抽样方法与混合训练调度处理标签不平衡。总的来说，我们的方法产生的3.34点的显着改善，导致了最好的单模式与60.90地图上的公共目标检测测试集打开的图像。而我们ensembling结果达到67.17地图，这比打开图像公开测试2018的最好成绩4.29百分点。

37. Cross-Task Transfer for Multimodal Aerial Scene Recognition [PDF] 返回目录
Di Hu, Xuhong Li, Lichao Mou, Pu Jin, Dong Chen, Liping Jing, Xiaoxiang Zhu, Dejing Dou
Abstract: Aerial scene recognition is a fundamental task in remote sensing and has recently received increased interest. While the visual information from overhead images with powerful models and efficient algorithms yields good performance on scene recognition, additional information is always a bonus, for example, the corresponding audio information. In this paper, for improving the performance on the aerial scene recognition, we explore a novel audiovisual aerial scene recognition task using both images and sounds as input. Based on an observation that some specific sound events are more likely to be heard at a given geographic location, we propose to exploit the knowledge from the sound events to improve the performance on the aerial scene recognition. For this purpose, we have constructed a new dataset named AuDio Visual Aerial sceNe reCognition datasEt (ADVANCE). With the help of this dataset, we evaluate three proposed approaches for transferring the sound event knowledge to the aerial scene recognition task in a multimodal learning framework, and show the benefit of exploiting the audio information for the aerial scene recognition. The source code is publicly available for reproducibility purposes.this https URL.
摘要：空中场景识别是遥感的根本任务，最近已经获得越来越多的关注。虽然从功能强大的模型和高效的算法俯瞰图像的视觉信息产生的场景识别性能好，附加信息始终是一个奖金，例如，对应的音频信息。在本文中，为了提高对空中场景识别性能，我们探索利用两个图像和声音输入一个新的视听空中场景识别任务。根据观察，一些特定的声音事件更可能在给定的地理位置被听到，我们建议利用从声音事件的知识，提高对空中场景识别性能。为此，我们已经建立了一个名为视听空中场景识别的数据集（ADVANCE）新的数据集。有了这个数据集的帮助下，我们评估在一个多学习框架传输声音事件的知识，空中场景识别任务提出的三种方法，并展示了利用空中场景识别，音频信息的好处。源代码是公开的可重复性purposes.this HTTPS URL。

38. Single-sample writers -- "Document Filter" and their impacts on writer identification [PDF] 返回目录
Fabio Pinhelli, Alceu S. Britto Jr, Luiz S. Oliveira, Yandre M. G. Costa, Diego Bertolini
Abstract: The writing can be used as an important biometric modality which allows to unequivocally identify an individual. It happens because the writing of two different persons present differences that can be explored both in terms of graphometric properties or even by addressing the manuscript as a digital image, taking into account the use of image processing techniques that can properly capture different visual attributes of the image (e.g. texture). In this work, perform a detailed study in which we dissect whether or not the use of a database with only a single sample taken from some writers may skew the results obtained in the experimental protocol. In this sense, we propose here what we call "document filter". The "document filter" protocol is supposed to be used as a preprocessing technique, such a way that all the data taken from fragments of the same document must be placed either into the training or into the test set. The rationale behind it, is that the classifier must capture the features from the writer itself, and not features regarding other particularities which could affect the writing in a specific document (i.e. emotional state of the writer, pen used, paper type, and etc.). By analyzing the literature, one can find several works dealing the writer identification problem. However, the performance of the writer identification systems must be evaluated also taking into account the occurrence of writer volunteers who contributed with a single sample during the creation of the manuscript databases. To address the open issue investigated here, a comprehensive set of experiments was performed on the IAM, BFL and CVL databases. They have shown that, in the most extreme case, the recognition rate obtained using the "document filter" protocol drops from 81.80% to 50.37%.
摘要：写作可以作为一个重要的生物特征识别方式，它允许明确鉴定的个体。这是因为两个不同的人目前的分歧在于既可以在graphometric性能方面甚至通过处理稿件为数字图像进行探讨，考虑到使用的图像处理技术，可以适当地捕捉不同的视觉属性的文字图像（例如纹理）。在这项工作中，执行其中我们解剖是否与只在某些作家截取的单个样品中使用的数据库的可能会扭曲在实验方案获得的结果的详细研究。在这个意义上，我们建议在这里我们所说的“文件过滤器”。 “文档滤波器”协议应该被用作预处理技术，使得所有来自同一文档的片段所采取的数据必须是放置到训练或到测试集中。它的基本原理是，分类必须捕获从作家本身的特征，而不是功能等方面的特殊性，这可能影响一个特定的文档中的文字（即作家的情感状态，使用钢笔，纸张类型等。）。通过分析文献中，人们可以找到几部作品打交道的作家身份问题。然而，笔迹鉴别系统的性能也必须评估考虑到作家的志愿者谁与创作手稿数据库中单个样本贡献的发生。为了解决悬而未决的问题这里考察，在IAM，BFL和CVL数据库进行一套全面的实验。他们已经表明，在最极端的情况下，使用“文件滤波器”协议得到的识别率从81.80％降至50.37％。

39. T-VSE: Transformer-Based Visual Semantic Embedding [PDF] 返回目录
Muhammet Bastan, Arnau Ramisa, Mehmet Tek
Abstract: Transformer models have recently achieved impressive performance on NLP tasks, owing to new algorithms for self-supervised pre-training on very large text corpora. In contrast, recent literature suggests that simple average word models outperform more complicated language models, e.g., RNNs and Transformers, on cross-modal image/text search tasks on standard benchmarks, like MS COCO. In this paper, we show that dataset scale and training strategy are critical and demonstrate that transformer-based cross-modal embeddings outperform word average and RNN-based embeddings by a large margin, when trained on a large dataset of e-commerce product image-title pairs.
摘要：变压器模型已经最近取得的NLP任务骄人的业绩，由于自我监督前培训的新算法非常大的语料库。与此相反，最近的文献表明，简单平均的字模式优于更复杂的语言模型，例如，RNNs和变压器，对标准基准，如MS COCO跨模态图像/文本搜索任务。在本文中，我们表明，数据集的规模和培训战略是至关重要的，并证明了基于变压器的跨模式的嵌入跑赢了一大截，由于大型数据集的电子商务产品图像 - 训练有素的字时，平均和基于RNN-的嵌入标题对。

40. Detecting Forged Facial Videos using convolutional neural network [PDF] 返回目录
Neilesh Sambhu, Shaun Canavan
Abstract: In this paper, we propose to detect forged videos, of faces, in online videos. To facilitate this detection, we propose to use smaller (fewer parameters to learn) convolutional neural networks (CNN), for a data-driven approach to forged video detection. To validate our approach, we investigate the FaceForensics public dataset detailing both frame-based and video-based results. The proposed method is shown to outperform current state of the art. We also perform an ablation study, analyzing the impact of batch size, number of filters, and number of network layers on the accuracy of detecting forged videos.
摘要：在本文中，我们提出了以识别伪造的视频，面孔，在在线视频。为了便于检测，我们建议使用较小的（较少的参数学习）卷积神经网络（CNN），为数据驱动的方法来伪造的视频检测。为了验证我们的方法，我们研究了FaceForensics公共数据集的细节都基于视频的帧为基础和成果。所提出的方法被示出优于现有技术的当前状态。我们还进行消融的研究中，分析批量大小，滤波器数目，层和网络层的数目的上检测伪造的视频的精度的影响。

41. Facial Action Unit Detection using 3D Facial Landmarks [PDF] 返回目录
Saurabh Hinduja, Shaun Canavan
Abstract: In this paper, we propose to detect facial action units (AU) using 3D facial landmarks. Specifically, we train a 2D convolutional neural network (CNN) on 3D facial landmarks, tracked using a shape index-based statistical shape model, for binary and multi-class AU detection. We show that the proposed approach is able to accurately model AU occurrences, as the movement of the facial landmarks corresponds directly to the movement of the AUs. By training a CNN on 3D landmarks, we can achieve accurate AU detection on two state-of-the-art emotion datasets, namely BP4D and BP4D+. Using the proposed method, we detect multiple AUs on over 330,000 frames, reporting improved results over state-of-the-art methods.
摘要：在本文中，我们建议使用检测三维人脸标志的面部动作单元（AU）。具体而言，我们训练三维面部界标的二维卷积神经网络（CNN），使用基于形状指标统计形状模型，为二元和多类AU检测跟踪。我们表明，该方法能够精确地模拟AU出现，作为脸部显着标记对应的运动直接的AU的运动。通过对3D地标训练CNN，我们可以在国家的最技术的具有两种情感数据集，即BP4D和BP4D +实现精确AU检测。使用该方法，我们检测到33万帧多个AU，报告在国家的最先进的方法改进的结果。

42. Impact of multiple modalities on emotion recognition: investigation into 3d facial landmarks, action units, and physiological data [PDF] 返回目录
Diego Fabiano, Manikandan Jaishanker, Shaun Canavan
Abstract: To fully understand the complexities of human emotion, the integration of multiple physical features from different modalities can be advantageous. Considering this, we present an analysis of 3D facial data, action units, and physiological data as it relates to their impact on emotion recognition. We analyze each modality independently, as well as the fusion of each for recognizing human emotion. This analysis includes which features are most important for specific emotions (e.g. happy). Our analysis indicates that both 3D facial landmarks and physiological data are encouraging for expression/emotion recognition. On the other hand, while action units can positively impact emotion recognition when fused with other modalities, the results suggest it is difficult to detect emotion using them in a unimodal fashion.
摘要：为全面了解人类情感的复杂性，多重物理特性不同方式的集成可以是有利的。考虑到这一点，我们提出了三维人脸数据，行动单位和生理数据的分析，因为它涉及到他们的情感认同的影响。我们分析每一个模式独立，以及各自的识别人类情感的融合。这种分析包括哪些功能是为特定的情绪（例如快乐）最重要的。我们的分析表明，这两个3D面部界标和生理数据是令人鼓舞的用于表达/情感识别。在另一方面，虽然与其他方式融合动作单元可以产生积极影响情感识别，结果表明，它是很难在单峰方式使用它们来检测的情感。

43. Subject Identification Across Large Expression Variations Using 3D Facial Landmarks [PDF] 返回目录
Sk Rahatul Jannat, Diego Fabiano, Shaun Canavan, Tempestt Neal
Abstract: Landmark localization is an important first step towards geometric based vision research including subject identification. Considering this, we propose to use 3D facial landmarks for the task of subject identification, over a range of expressed emotion. Landmarks are detected, using a Temporal Deformable Shape Model and used to train a Support Vector Machine (SVM), Random Forest (RF), and Long Short-term Memory (LSTM) neural network for subject identification. As we are interested in subject identification with large variations in expression, we conducted experiments on 3 emotion-based databases, namely the BU-4DFE, BP4D, and BP4D+ 3D/4D face databases. We show that our proposed method outperforms current state of the art methods for subject identification on BU-4DFE and BP4D. To the best of our knowledge, this is the first work to investigate subject identification on the BP4D+, resulting in a baseline for the community.
摘要：地标定位是朝着基于几何视觉研究包括主体识别的重要的第一步。鉴于此，我们建议使用3D脸部显着标记为主体标识的任务，在一定范围内表达的情感。被检测的地标，使用时间变形形状模型和用于训练支持向量机（SVM），随机森林（RF），和长短期存储器（LSTM）为主体识别神经网络。正如我们在表达感兴趣的主体识别与大的变化，我们在3基于情绪的数据库，即BU-4DFE，BP4D和BP4D + 3D / 4D脸数据库进行的实验。我们证明了我们提出的方法优于对在BU-4DFE和BP4D主题识别技术方法的当前状态。据我们所知，这是第一个工作，调查在BP4D +主体标识，从而为社会的基线。

44. A Survey on Unknown Presentation Attack Detection for Fingerprint [PDF] 返回目录
Jag Mohan Singh, Ahmed Madhun, Guoqiang Li, Raghavendra Ramachandra
Abstract: Fingerprint recognition systems are widely deployed in various real-life applications as they have achieved high accuracy. The widely used applications include border control, automated teller machine (ATM), and attendance monitoring systems. However, these critical systems are prone to spoofing attacks (a.k.a presentation attacks (PA)). PA for fingerprint can be performed by presenting gummy fingers made from different materials such as silicone, gelatine, play-doh, ecoflex, 2D printed paper, 3D printed material, or latex. Biometrics Researchers have developed Presentation Attack Detection (PAD) methods as a countermeasure to PA. PAD is usually done by training a machine learning classifier for known attacks for a given dataset, and they achieve high accuracy in this task. However, generalizing to unknown attacks is an essential problem from applicability to real-world systems, mainly because attacks cannot be exhaustively listed in advance. In this survey paper, we present a comprehensive survey on existing PAD algorithms for fingerprint recognition systems, specifically from the standpoint of detecting unknown PAD. We categorize PAD algorithms, point out their advantages/disadvantages, and future directions for this area.
摘要：指纹识别系统被广泛部署于各种实际生活中的应用，因为他们都取得了较高的精度。广泛使用的应用包括边界控制，自动取款机（ATM），和出席监测系统。然而，这些关键系统很容易发生欺骗攻击（a.k.a演示攻击（PA））。 PA指纹可以通过呈现由不同的材料例如硅树脂，明胶，播放-DOH，ECOFLEX制成胶状的手指来执行，2D的印刷纸，3D印刷材料，或者胶乳。生物识别技术的研究人员已经开发出演示攻击检测（PAD）方法作为对策PA。 PAD通常由训练机器学习分类为给定数据集的已知攻击完成，他们达到较高的精度完成这项任务。然而，推广到未知的攻击是从应用到实际系统的重要问题，主要是因为攻击无法穷尽提前上市。在该调查中论文中，我们提出在现有PAD算法的指纹识别系统的综合性的调查，特别是从检测未知PAD的观点来看。我们PAD分类算法，指出了这方面的优势/劣势，以及未来的发展方向。

45. AC-VRNN: Attentive Conditional-VRNN for Multi-Future Trajectory Prediction [PDF] 返回目录
Alessia Bertugli, Simone Calderara, Pasquale Coscia, Lamberto Ballan, Rita Cucchiara
Abstract: Anticipating human motion in crowded scenarios is essential for developing intelligent transportation systems, social-aware robots and advanced video-surveillance applications. An important aspect of such task is represented by the inherently multi-modal nature of human paths which makes socially-acceptable multiple futures when human interactions are involved. To this end, we propose a new generative model for multi-future trajectory prediction based on Conditional Variational Recurrent Neural Networks (C-VRNNs). Conditioning relies on prior belief maps, representing most likely moving directions and forcing the model to consider the collective agents' motion. Human interactions are modeled in a structured way with a graph attention mechanism, providing an online attentive hidden state refinement of the recurrent estimation. Compared to sequence-to-sequence methods, our model operates step-by-step, generating more refined and accurate predictions. To corroborate our model, we perform extensive experiments on publicly-available datasets (ETH, UCY and Stanford Drone Dataset) and demonstrate its effectiveness compared to state-of-the-art methods.
摘要：在拥挤的场景预见人类的运动，是发展智能交通系统，社会意识的机器人和先进的视频监控应用至关重要。这样的任务的一个重要方面是由人类路径固有的多模态性质，这使得社会上可接受的多期货当人类交互涉及表示。为此，我们提出了基于条件变回归神经网络（C-VRNNs）多未来的轨迹预测新生成模型。空调依赖于之前的信念地图，代表最有可能的运动方向，并迫使该模型考虑集体代理的运动。人类相互作用模型与图表注意机制以结构化方式，提供经常性估计的在线细心的隐藏状态细化。相较于序列对序列的方法，我们的模型操作一步一步，产生更加精细和准确的预测。为了证实我们的模型，我们公开发表的数据集（ETH，UCY和斯坦福大学的无人机数据集）进行了广泛的实验，并与国家的最先进的方法证明其有效性。

46. A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer [PDF] 返回目录
Vladimir Iashin, Esa Rahtu
Abstract: Dense video captioning aims to localize and describe important events in untrimmed videos. Existing methods mainly tackle this task by exploiting only visual features, while completely neglecting the audio track. Only a few prior works have utilized both modalities, yet they show poor results or demonstrate the importance on a dataset with a specific domain. In this paper, we introduce Bi-modal Transformer which generalizes the Transformer architecture for a bi-modal input. We show the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable to input any two modalities in a sequence-to-sequence task. We show that the pre-training a bi-modal encoder along with a bi-modal decoder for captioning can be used as a feature extractor for a simple proposal generation module. The performance is demonstrated on a challenging ActivityNet Captions dataset where our model achieves outstanding performance.
摘要：密集的视频字幕的目标定位和描述修剪视频的重要事件。现有的方法主要是通过利用唯一的视觉功能，解决这一任务，而忽略了完全的音频轨道。只有少数作品之前已经使用两种模式，但他们表现出不好的结果，或证明与特定域的数据集的重要性。在本文中，我们介绍了双峰变压器推广了变压器架构双模式输入。我们表明，该模型在密集的视频字幕任务的视听方式的有效性，但该模块能够输入任何两种模式在一个序列到序列的任务。我们表明，前培训与字幕双峰解码器沿双峰编码器可以作为一个简单的方案生成模块特征提取。性能演示了一个具有挑战性的ActivityNet字幕数据集，其中我们的模型实现了卓越的性能。

47. FuCiTNet: Improving the generalization of deep learning networks by the fusion of learned class-inherent transformations [PDF] 返回目录
Manuel Rey-Area, Emilio Guirado, Siham Tabik, Javier Ruiz-Hidalgo
Abstract: It is widely known that very small datasets produce overfitting in Deep Neural Networks (DNNs), i.e., the network becomes highly biased to the data it has been trained on. This issue is often alleviated using transfer learning, regularization techniques and/or data augmentation. This work presents a new approach, independent but complementary to the previous mentioned techniques, for improving the generalization of DNNs on very small datasets in which the involved classes share many visual features. The proposed methodology, called FuCiTNet (Fusion Class inherent Transformations Network), inspired by GANs, creates as many generators as classes in the problem. Each generator, $k$, learns the transformations that bring the input image into the k-class domain. We introduce a classification loss in the generators to drive the leaning of specific k-class transformations. Our experiments demonstrate that the proposed transformations improve the generalization of the classification model in three diverse datasets.
摘要：众所周知，非常小的数据集产生深层神经网络（DNNs），即过拟合，网络变得高度偏向于已经接受了有关数据。这个问题是使用迁移学习，正则化技术和/或数据扩张往往缓解。这项工作提出了一种新的方法，独立的，但互补的前面提到的技术，以提高在非常小的数据集DNNs的概括，其中所涉及的类别有许多共同的视觉特征。所提出的方法，叫FuCiTNet（融合类固有转换网络），由甘斯的启发，创建如同许多发电机作为问题类。每个发电机，$ķ$，得知把输入图像转换成K类别域的转换。我们介绍的发电机分类损失促成特定的k类转换的斜塔。我们的实验表明，该转换提高分类模型的三个不同的数据集的泛化。

48. Graph Density-Aware Losses for Novel Compositions in Scene Graph Generation [PDF] 返回目录
Boris Knyazev, Harm de Vries, Cătălina Cangea, Graham W. Taylor, Aaron Courville, Eugene Belilovsky
Abstract: Scene graph generation (SGG) aims to predict graph-structured descriptions of input images, in the form of objects and relationships between them. This task is becoming increasingly useful for progress at the interface of vision and language. Here, it is important - yet challenging - to perform well on novel (zero-shot) or rare (few-shot) compositions of objects and relationships. In this paper, we identify two key issues that limit such generalization. Firstly, we show that the standard loss used in this task is unintentionally a function of scene graph density. This leads to the neglect of individual edges in large sparse graphs during training, even though these contain diverse few-shot examples that are important for generalization. Secondly, the frequency of relationships can create a strong bias in this task, such that a blind model predicting the most frequent relationship achieves good performance. Consequently, some state-of-the-art models exploit this bias to improve results. We show that such models can suffer the most in their ability to generalize to rare compositions, evaluating two different models on the Visual Genome dataset and its more recent, improved version, GQA. To address these issues, we introduce a density-normalized edge loss, which provides more than a two-fold improvement in certain generalization metrics. Compared to other works in this direction, our enhancements require only a few lines of code and no added computational cost. We also highlight the difficulty of accurately evaluating models using existing metrics, especially on zero/few shots, and introduce a novel weighted metric.
摘要：场景图表生成（SGG）目标来预测输入图像的图形结构的描述，在它们之间的对象和关系的形式。这个任务是成为在视觉和语言的界面上的进度越来越有用。对新的（零次）或罕见（几拍）对象和关系的组合表现良好 - 在这里，重要的是 - 而具有挑战性的。在本文中，我们确定限制这种泛化的两个关键问题。首先，我们表明，在这项任务中使用的标准损失无意中场景图密度的函数。这导致大稀疏图个别边缘，而忽视训练中，尽管这些含有不同的几个拍的例子是泛化重要。其次，关系的频率可以创建在这项任务中强烈的偏见，使得盲人模型预测的最常见的关系取得了良好的业绩。因此，国家的最先进的一些模型利用这个偏差，提高效果。我们表明，这种模式可能遭受其推广到罕见成分的能力最大，评估在视觉基因组数据集，其最近的改进版，GQA两种不同的模式。为了解决这些问题，我们引入一个密度归一化的边缘损失，这提供了比在某些概括度量的两倍的改善更大。相比于这个方向的其他作品，我们需要改进的只是几行代码，并没有增加计算成本。我们还强调利用现有的指标，特别是对零/拍了几张照片准确地评估模型的难度，并引入新的加权指标。

49. Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis [PDF] 返回目录
K R Prajwal, Rudrabha Mukhopadhyay, Vinay Namboodiri, C V Jawahar
Abstract: Humans involuntarily tend to infer parts of the conversation from lip movements when the speech is absent or corrupted by external noise. In this work, we explore the task of lip to speech synthesis, i.e., learning to generate natural speech given only the lip movements of a speaker. Acknowledging the importance of contextual and speaker-specific cues for accurate lip-reading, we take a different path from existing works. We focus on learning accurate lip sequences to speech mappings for individual speakers in unconstrained, large vocabulary settings. To this end, we collect and release a large-scale benchmark dataset, the first of its kind, specifically to train and evaluate the single-speaker lip to speech task in natural settings. We propose a novel approach with key design choices to achieve accurate, natural lip to speech synthesis in such unconstrained scenarios for the first time. Extensive evaluation using quantitative, qualitative metrics and human evaluation shows that our method is four times more intelligible than previous works in this space. Please check out our demo video for a quick overview of the paper, method, and qualitative results. this https URL
摘要：人类不由自主倾向于推断嘴唇运动的谈话部分当语音不存在或由外部噪声破坏。在这项工作中，我们将探讨唇语音合成，即任务，学习产生只给出一个扬声器的嘴唇动作自然语音。确认准确的唇读上下文和扬声器，具体线索的重要性，我们从现有的作品不同的路径。我们专注于学习准确唇序列在无约束，大量词汇设置单独的扬声器讲话映射。为此，我们收集和发布大规模的基准数据集，首开先河，专门培训和评估单扬声器唇部在自然的环境中的语音任务。我们建议与主要的设计选择一个新的方法来实现精确，自然的唇部语音合成在第一次如此不受约束的场景。广泛的评估采用定量，定性指标和人工评估表明，我们的方法是四倍，在这个空间里以前的作品更容易理解。用于造纸，方法和定性结果的简要概述请看看我们的演示视频。这个HTTPS URL

50. Hyperspectral Image Classification Based on Sparse Modeling of Spectral Blocks [PDF] 返回目录
Saeideh Ghanbari Azar, Saeed Meshgini, Tohid Yousefi Rezaii, Soosan Beheshti
Abstract: Hyperspectral images provide abundant spatial and spectral information that is very valuable for material detection in diverse areas of practical science. The high-dimensions of data lead to many processing challenges that can be addressed via existent spatial and spectral redundancies. In this paper, a sparse modeling framework is proposed for hyperspectral image classification. Spectral blocks are introduced to be used along with spatial groups to jointly exploit spectral and spatial redundancies. To reduce the computational complexity of sparse modeling, spectral blocks are used to break the high-dimensional optimization problems into small-size sub-problems that are faster to solve. Furthermore, the proposed sparse structure enables to extract the most discriminative spectral blocks and further reduce the computational burden. Experiments on three benchmark datasets, i.e., Pavia University Image and Indian Pines Image verify that the proposed method leads to a robust sparse modeling of hyperspectral images and improves the classification accuracy compared to several state-of-the-art methods. Moreover, the experiments demonstrate that the proposed method requires less processing time.
摘要：高光谱图像提供了对于在应用科学方面，不同领域的材料检测非常有价值的丰富的空间和光谱信息。数据的高维导致可通过存在的空间和光谱冗余来解决许多处理的挑战。在本文中，稀疏建模框架，提出了高光谱图像分类。被引入的频谱块以具有空间组使用沿着共同开采光谱和空间冗余。为了减少稀疏模型的计算复杂度，频谱块是用来打破的高维优化问题成更快解决小型子问题。此外，所提出的稀疏结构能够提取最辨别光谱块，并进一步降低了计算负担。对三个标准数据集，即实验，帕维亚大学的形象和印度松树图片验证，该方法导致高光谱图像的一个强大的稀疏建模，并与国家的最先进的几种方法提高了分类的准确性。此外，实验结果表明，所提出的方法需要较少的处理时间。

51. Co-occurrence Based Texture Synthesis [PDF] 返回目录
Anna Darzi, Itai Lang, Ashutosh Taklikar, Hadar Averbuch-Elor, Shai Avidan
Abstract: We model local texture patterns using the co-occurrence statistics of pixel values. We then train a generative adversarial network, conditioned on co-occurrence statistics, to synthesize new textures from the co-occurrence statistics and a random noise seed. Co-occurrences have long been used to measure similarity between textures. That is, two textures are considered similar if their corresponding co-occurrence matrices are similar. By the same token, we show that multiple textures generated from the same co-occurrence matrix are similar to each other. This gives rise to a new texture synthesis algorithm. We show that co-occurrences offer a stable, intuitive and interpretable latent representation for texture synthesis. Our technique can be used to generate a smooth texture morph between two textures, by interpolating between their corresponding co-occurrence matrices. We further show an interactive texture tool that allows a user to adjust local characteristics of the synthesized texture image using the co-occurrence values directly.
摘要：我们使用的像素值的共生统计局部纹理图案的模型。然后，我们培养出生成对抗性的网络条件上共同出现统计，合成从共生统计和随机噪声种子新的纹理。共现早已被用来衡量纹理之间的相似性。即，两个纹理被认为是相似如果其相应的共生矩阵是相似的。出于同样的原因，我们表明，从相同的共生矩阵产生的多个纹理彼此相似。这样就产生了一个新的纹理合成算法。我们表明，共同出现提供了一个稳定的，直观的，可解释的潜表示纹理合成。我们的技术可被用于产生两个纹理之间的平滑纹理变形，可以通过对应的共生矩阵之间进行内插。进一步的研究表明一个交互式纹理工具，其允许用户直接使用共现值来调整合成的纹理图像的局部特性。

52. Neural Networks for Fashion Image Classification and Visual Search [PDF] 返回目录
Fengzi Li, Shashi Kant, Shunichi Araki, Sumer Bangera, Swapna Samir Shukla
Abstract: We discuss two potentially challenging problems faced by the ecommerce industry. One relates to the problem faced by sellers while uploading pictures of products on the platform for sale and the consequent manual tagging involved. It gives rise to misclassifications leading to its absence from search results. The other problem concerns with the potential bottleneck in placing orders when a customer may not know the right keywords but has a visual impression of an image. An image based search algorithm can unleash the true potential of ecommerce by enabling customers to click a picture of an object and search for similar products without the need for typing. In this paper, we explore machine learning algorithms which can help us solve both these problems.
摘要：我们讨论所面临的电子商务行业2点可能具有挑战性的问题。其中涉及到面临的卖家在平台销售和随之而来的手动标记上的产品图片上传所涉及的问题。它产生了错误分类导致其缺席的搜索结果。另一个问题的关注与潜在的瓶颈在订货时，客户可能不知道正确的关键字，但是图像的视觉印象。一种基于图像搜索算法可以使客户点击的对象的图片和搜索类似的产品，而不需要打字发挥电子商务的真正潜力。在本文中，我们探讨了机器学习算法，它可以帮助我们解决这两个问题。

53. FA-GANs: Facial Attractiveness Enhancement with Generative Adversarial Networks on Frontal Faces [PDF] 返回目录
Jingwu He, Chuan Wang, Yang Zhang, Jie Guo, Yanwen Guo
Abstract: Facial attractiveness enhancement has been an interesting application in Computer Vision and Graphics over these years. It aims to generate a more attractive face via manipulations on image and geometry structure while preserving face identity. In this paper, we propose the first Generative Adversarial Networks (GANs) for enhancing facial attractiveness in both geometry and appearance aspects, which we call "FA-GANs". FA-GANs contain two branches and enhance facial attractiveness in two perspectives: facial geometry and facial appearance. Each branch consists of individual GANs with the appearance branch adjusting the facial image and the geometry branch adjusting the facial landmarks in appearance and geometry aspects, respectively. Unlike the traditional facial manipulations learning from paired faces, which are infeasible to collect before and after enhancement of the same individual, we achieve this by learning the features of attractiveness faces through unsupervised adversarial learning. The proposed FA-GANs are able to extract attractiveness features and impose them on the enhancement results. To better enhance faces, both the geometry and appearance networks are considered to refine the facial attractiveness by adjusting the geometry layout of faces and the appearance of faces independently. To the best of our knowledge, we are the first to enhance the facial attractiveness with GANs in both geometry and appearance aspects. The experimental results suggest that our FA-GANs can generate compelling perceptual results in both geometry structure and facial appearance and outperform current state-of-the-art methods.
摘要：面部吸引力的增强已经在计算机视觉和图形这些年来的有趣应用。它的目的是通过生成一个操作更具吸引力的脸部图像和几何结构，同时保持脸的身份。在本文中，我们提出了第一个剖成对抗性网络（甘斯）为加强双方的几何形状和外观方面，我们称之为“FA-甘斯”面部吸引力。 FA-甘斯包含两个分公司和提高两个方面相貌：面部形状和面部美观。每个分支包含与外观分支调整脸部图像和几何分支调整在外观和几何形状方面的面部界标，分别个别甘斯的。不同于传统的面部操作从配对的面孔，这是不可行之前和增强同一个人后，收集学习，我们通过无监督敌对学习学习面部吸引力的功能实现这一点。所提出的FA-甘斯能够提取吸引力的功能和强加给增强的结果。为了更好地提高面，二者的几何形状和外观网络被认为是通过调节面的几何布局和面独立地出现细化面部吸引力。据我们所知，我们是第一次来加强与甘斯在几何和外观方面的相貌。实验结果表明，我们的FA-Gans的可产生在两个几何结构和面部外观引人注目感知结果和优于国家的最先进的现有方法。

54. Three-Filters-to-Normal: An Accurate and Ultrafast Surface Normal Estimator [PDF] 返回目录
Rui Fan, Hengli Wang, Bohuan Xue, Huaiyang Huang, Yuan Wang, Ming Liu, Ioannis Pitas
Abstract: Over the past decade, significant efforts have been made to improve the trade-off between speed and accuracy of surface normal estimators (SNEs). This paper introduces an accurate and ultrafast SNE for structured range data. The proposed approach computes surface normals by simply performing three filtering operations, namely, two image gradient filters (in horizontal and vertical directions, respectively) and a mean/median filter, on an inverse depth image or a disparity image. Despite the simplicity of the method, no similar method already exists in the literature. In our experiments, we created three large-scale synthetic datasets (easy, medium and hard) using 24 3-dimensional (3D) mesh models. Each mesh model is used to generate 1800--2500 pairs of 480x640 pixel depth images and the corresponding surface normal ground truth from different views. The average angular errors with respect to the easy, medium and hard datasets are 1.6 degrees, 5.6 degrees and 15.3 degrees, respectively. Our C++ and CUDA implementations achieve a processing speed of over 260 Hz and 21 kHz, respectively. Our proposed SNE achieves a better overall performance than all other existing computer vision-based SNEs. Our datasets and source code are publicly available at: this http URL.
摘要：在过去的十年中，显著已作出努力，以提高速度和表面法线估计（SNES）的精度之间的权衡。本文介绍了结构化的范围数据的精确和超快SNE。所提出的方法计算由简单地执行三次滤波操作，即，（在分别水平和垂直方向）的两个图像的梯度过滤器和一个平均值/中值滤波器，逆深度图像或视差图像上的表面法线。尽管该方法的简单性，没有类似的方法已经存在于文献中。在我们的实验中，我们创建了使用24 3维（3D）网格模型的三个大型数据集合成（易，中，硬）。每个网格模型被用于产生1800--2500对480×640像素深度图像和来自不同视图的对应的表面法线地面实况。相对于所述简单，中等和硬数据集的平均角度误差是1.6度，5.6度和15.3度时，分别。我们的C ++和CUDA实现达到超过260赫兹和21千赫兹的处理速度，分别。我们提出的SNE实现了比所有其他现有的计算机基于视觉的SNES一个更好的整体性能。我们的数据集和源代码是公开的网址为：此HTTP URL。

55. High-dimensional Convolutional Networks for Geometric Pattern Recognition [PDF] 返回目录
Christopher Choy, Junha Lee, Rene Ranftl, Jaesik Park, Vladlen Koltun
Abstract: Many problems in science and engineering can be formulated in terms of geometric patterns in high-dimensional spaces. We present high-dimensional convolutional networks (ConvNets) for pattern recognition problems that arise in the context of geometric registration. We first study the effectiveness of convolutional networks in detecting linear subspaces in high-dimensional spaces with up to 32 dimensions: much higher dimensionality than prior applications of ConvNets. We then apply high-dimensional ConvNets to 3D registration under rigid motions and image correspondence estimation. Experiments indicate that our high-dimensional ConvNets outperform prior approaches that relied on deep networks based on global pooling operators.
摘要：在科学与工程中的许多问题都可以在高维空间几何图案的条款制定。我们目前的高维卷积网络（ConvNets），用于几何登记的上下文中出现的图案识别的问题。我们首先研究卷积网络的有效性检测在高维空间具有高达线性子空间32米的尺寸：比ConvNets的在先申请高得多的维数。然后，我们应用下刚性运动和图像对应估计高维ConvNets到3D配准。实验表明，我们的高维ConvNets跑赢大盘，基于全球运营商池深网络依赖以前的方法。

56. Train in Germany, Test in The USA: Making 3D Object Detectors Generalize [PDF] 返回目录
Yan Wang, Xiangyu Chen, Yurong You, Li Erran, Bharath Hariharan, Mark Campbell, Kilian Q. Weinberger, Wei-Lun Chao
Abstract: In the domain of autonomous driving, deep learning has substantially improved the 3D object detection accuracy for LiDAR and stereo camera data alike. While deep networks are great at generalization, they are also notorious to over-fit to all kinds of spurious artifacts, such as brightness, car sizes and models, that may appear consistently throughout the data. In fact, most datasets for autonomous driving are collected within a narrow subset of cities within one country, typically under similar weather conditions. In this paper we consider the task of adapting 3D object detectors from one dataset to another. We observe that naively, this appears to be a very challenging task, resulting in drastic drops in accuracy levels. We provide extensive experiments to investigate the true adaptation challenges and arrive at a surprising conclusion: the primary adaptation hurdle to overcome are differences in car sizes across geographic areas. A simple correction based on the average car size yields a strong correction of the adaptation gap. Our proposed method is simple and easily incorporated into most 3D object detection frameworks. It provides a first baseline for 3D object detection adaptation across countries, and gives hope that the underlying problem may be more within grasp than one may have hoped to believe. Our code is available at this https URL.
摘要：自主驾驶的域，深学习已大大改善了激光雷达和立体照相机数据一样3D对象检测精度。虽然深网络正处于推广伟大的，他们也是臭名昭著的过度适合各类假文物，如亮度，汽车尺寸和型号的，可能在整个数据持续出现。事实上，对于自主驾驶大多数数据集一国之内的城市的一个极小部分内收集，通常在类似的天气状况。在本文中，我们考虑从一个数据集改编的3D物体探测器到另一个任务。我们天真地观察到，这似乎是一个非常具有挑战性的任务，导致精度水平急剧下降。我们提供了广泛的实验研究真正适应挑战，并以惊人得出结论：主要适应克服的障碍是跨地理区域的汽车大小的差异。基于平均轿厢尺寸的简单的校正产生的适应间隙的强修正。我们提出的方法是简单和容易地结合到大多数3D物体检测框架。它提供了不同国家的3D物体检测改编的第一基线，并给出了希望，潜在的问题可能会更把握内多于一个可能有希望去相信。我们的代码可在此HTTPS URL。

57. VPR-Bench: An Open-Source Visual Place Recognition Evaluation Framework with Quantifiable Viewpoint and Appearance Change [PDF] 返回目录
Mubariz Zaffar, Shoaib Ehsan, Michael Milford, David Flynn, Klaus McDonald-Maier
Abstract: Visual Place Recognition (VPR) is the process of recognising a previously visited place using visual information, often under varying appearance conditions and viewpoint changes and with computational constraints. VPR is a critical component of many autonomous navigation systems ranging from autonomous vehicles to drones. While the concept of place recognition has been around for many years, VPR research has grown rapidly as a field over the past decade due to both improving camera hardware technologies and its suitability for application of deep learning-based techniques. With this growth however has come field fragmentation, lack of standardisation and a disconnect between current performance metrics and the actual utility of a VPR technique at application-deployment. In this paper we address these key challenges through a new comprehensive open-source evaluation framework, dubbed 'VPR-Bench'. VPR-Bench introduces two much-needed capabilities for researchers: firstly, quantification of viewpoint and illumination variation, replacing what has largely been assessed qualitatively in the past, and secondly, new metrics 'Extended precision' (EP), 'Performance-Per-Compute-Unit' (PCU) and 'Number of Prospective Place Matching Candidates' (NPPMC). These new metrics complement the limitations of traditional Precision-Recall curves, by providing measures that are more informative to the wide range of potential VPR applications. Mechanistically, we develop new unified templates that facilitate the implementation, deployment and evaluation of a wide range of VPR techniques and datasets. We incorporate the most comprehensive combination of state-of-the-art VPR techniques and datasets to date into VPR-Bench and demonstrate how it provides a rich range of previously inaccessible insights, such as the nuanced relationship between viewpoint invariance, different types of VPR techniques and datasets.
摘要：视觉地识别（VPR）是使用视觉信息，往往在变化的外形的条件和视点改变，并用计算的约束来识别先前访问过的地方的处理。 VPR是许多自主导航系统，从自主车辆无人驾驶飞机的重要组成部分。虽然地方承认的概念已经存在了很多年，VPR研究已迅速成长为在过去十年中的字段由于双方改善相机的硬件技术及其深基于学习的技术应用的适用性。随着这种增长却已经来到现场破碎，缺乏标准化和当前性能指标和技术VPR在应用程序部署的实际效用之间的脱节。在本文中，我们通过讨论一个新的全面的开源评估框架这些关键挑战，被称为“VPR-台”。 VPR-台介绍了研究人员两大部分需要的功能：首先，观点和光照变化进行量化，更换什么已经在很大程度上被定性在过去的评估，其次，新的指标“扩展精度”（EP），绩效-Per-计算单位”（PCU）和‘前瞻性地匹配考生数’（NPPMC）。这些新指标补充传统的精密召回曲线的局限，通过提供更翔实的广泛的潜在应用VPR措施。在机制上，我们开发了有利于各种各样的VPR技术和数据集的实施，部署和评估新的统一的模板。我们整合的国家的最先进的VPR技术和数据集的最全面的组合日期为VPR-台并展示它如何提供一个丰富的以前无法进入的见解，如观点不变之间的微妙的关系，不同类型的VPR的范围技术和数据集。

58. From Boundaries to Bumps: when closed (extremal) contours are critical [PDF] 返回目录
Benjamin Kunsberg, Steven W. Zucker
Abstract: Invariants underlying shape inference are elusive: a variety of shapes can give rise to the same image, and a variety of images can be rendered from the same shape. The occluding contour is a rare exception: it has both image salience, in terms of isophotes, and surface meaning, in terms of surface normal. We relax the notion of occluding contour to define closed extremal curves, a new shape invariant that exists at the topological level. They surround bumps, a common but ill-specified interior shape component, and formalize the qualitative nature of bump perception. Extremal curves are biologically computable, unify shape inferences from shading, texture, and specular materials, and predict new phenomena in bump perception.
摘要：形状推理底层是不变量难以捉摸：各种形状的可产生相同的图像，并且各种图像可以从相同的形状来呈现。闭塞轮廓是一种罕见的例外：它既有图像显着性，在照度程度方面，表面的意思，在表面法线的条款。我们放松遮挡轮廓定义封闭极值曲线，即存在于拓扑层次新的形状不变的概念。他们围绕颠簸，共同但有不适指定的内部形状组成部分，正式凹凸知觉的定性性质。极值曲线是生物学从阴影，纹理，和镜面材料可计算，统一形状的推论，并预测在凸点感知的新现象。

59. Mutual Information Maximization for Robust Plannable Representations [PDF] 返回目录
Yiming Ding, Ignasi Clavera, Pieter Abbeel
Abstract: Extending the capabilities of robotics to real-world complex, unstructured environments requires the need of developing better perception systems while maintaining low sample complexity. When dealing with high-dimensional state spaces, current methods are either model-free or model-based based on reconstruction objectives. The sample inefficiency of the former constitutes a major barrier for applying them to the real-world. The later, while they present low sample complexity, they learn latent spaces that need to reconstruct every single detail of the scene. In real environments, the task typically just represents a small fraction of the scene. Reconstruction objectives suffer in such scenarios as they capture all the unnecessary components. In this work, we present MIRO, an information theoretic representational learning algorithm for model-based reinforcement learning. We design a latent space that maximizes the mutual information with the future information while being able to capture all the information needed for planning. We show that our approach is more robust than reconstruction objectives in the presence of distractors and cluttered scenes
摘要：机器人技术扩展到现实世界的复杂化，非结构化环境的能力，需要发展更好的感知系统，同时保持低样本的复杂性的需要。当高维状态空间处理，目前的方法或者基于模型的模型的或基于重建的目标。前者的样本低效率构成了将其应用到真实世界的一个主要障碍。后来，虽然他们目前的低样品的复杂性，他们学会潜伏空间，需要重建现场的每一个细节。在现实环境中，典型的任务只是代表了现场的一小部分。重建目标，在这样的情况下吃亏，因为他们捕获所有的不必要的组件。在这项工作中，我们提出MIRO，信息理论的代表性学习算法基于模型的强化学习。我们设计最大化使用，同时能够捕捉到所有需要的信息进行规划未来的信息相互信息的潜在空间。我们表明，我们的方法是比重建目标更稳健的分心和杂乱的场面的存在

60. Analytic Signal Phase in $N-D$ by Linear Symmetry Tensor--fingerprint modeling [PDF] 返回目录
Josef Bigun, Fernando Alonso-Fernandez
Abstract: We reveal that the Analytic Signal phase, and its gradient have a hitherto unstudied discontinuity in $2-D $ and higher dimensions. The shortcoming can result in severe artifacts whereas the problem does not exist in $1-D $ signals. Direct use of Gabor phase, or its gradient, in computer vision and biometric recognition e.g., as done in influential studies \cite{fleet90,wiskott1997face}, may produce undesired results that will go unnoticed unless special images similar to ours reveal them. Instead of the Analytic Signal phase, we suggest the use of Linear Symmetry phase, relying on more than one set of Gabor filters, but with a negligible computational add-on, as a remedy. Gradient magnitudes of this phase are continuous in contrast to that of the analytic signal whereas continuity of the gradient direction of the phase is guaranteed if Linear Symmetry Tensor replaces gradient vector. The suggested phase has also a built-in automatic scale estimator, useful for robust detection of patterns by multi-scale processing. We show crucial concepts on synthesized fingerprint images, where ground truth regarding instantaneous frequency, (scale \& direction), and phase are known with favorable results. A comparison to a baseline alternative is also reported. To that end, a novel multi-scale minutia model where location, direction, and scale of minutia parameters are steerable, without the creation of uncontrollable minutia is also presented. This is a useful tool, to reduce development times of minutia detection methods with explainable behavior. A revealed consequence is that minutia directions are not determined by the linear phase alone, but also by each other and the influence must be corrected to obtain steerability and accurate ground truths. Essential conclusions are readily transferable to $N-D $, and unrelated applications, e.g. optical flow or disparity estimation in stereo.
摘要：我们揭示，分析信号的相位，其坡度在$ 2 d $和更高尺寸的以往的研究忽略不连续性。缺点可能导致严重的伪像，而问题不会在$ 1 $ d的信号存在。直接使用的Gabor期，或者它的梯度，在计算机视觉和生物识别例如，作为有影响力的研究做了\ {举fleet90，wiskott1997face}，可能会产生意想不到的结果，将被忽视，除非与我们相似的特殊图像揭示它们。取而代之的是分析信号相位的，我们建议使用线性相位对称的，依靠一组以上的Gabor滤波，但可以忽略不计的计算附加，作为一种补救措施。这个阶段的梯度量值是在对比的是，解析信号，而如果线对称张量替换梯度矢量的相位的梯度方向的连续性是有保证的连续的。所建议的相也具有内置自动规模估计器，用于通过多尺度处理鲁棒检测的图案是有用的。我们展示的合成指纹图像，关键的概念，其中地面实况关于瞬时频率，（规模\＆方向），和相位与有利的结果众所周知。还报道的基线替代比较。为此，一种新的多尺度细节模型，其中的细节参数位置，方向和比例是可转向的，而不产生不可控制的细节的还提出。这是一个有用的工具，以减少细节的检测方法开发次解释的行为。甲揭示后果是，细节方向并不由线性相位单独确定的，而且通过相互的影响必须被校正，以获得可转向和准确的基础事实。必不可少的结论是容易转移至$ N-d $，和不相关的应用，例如在立体声光流或视差估算。

61. Single-Stage Semantic Segmentation from Image Labels [PDF] 返回目录
Nikita Araslanov, Stefan Roth
Abstract: Recent years have seen a rapid growth in new approaches improving the accuracy of semantic segmentation in a weakly supervised setting, i.e. with only image-level labels available for training. However, this has come at the cost of increased model complexity and sophisticated multi-stage training procedures. This is in contrast to earlier work that used only a single stage $-$ training one segmentation network on image labels $-$ which was abandoned due to inferior segmentation accuracy. In this work, we first define three desirable properties of a weakly supervised method: local consistency, semantic fidelity, and completeness. Using these properties as guidelines, we then develop a segmentation-based network model and a self-supervised training scheme to train for semantic masks from image-level annotations in a single stage. We show that despite its simplicity, our method achieves results that are competitive with significantly more complex pipelines, substantially outperforming earlier single-stage methods.
摘要：最近几年，在新办法快速增长提高了语义分割的准确性弱监督的设置，即具有可用于训练只图像级标签。然而，这已经在增加模型的复杂性和复杂的多级培训过程的成本。这与早期的工作，只使用单级$ - $训练图像分割一个网标签$ - $这是由于废弃劣质分割精度。在这项工作中，我们首先定义弱监督方法的三个理想的性质：局部一致性，语义保真度和完整性。使用这些属性为指导，我们再开发一个基于分割的网络模型和自我监督培训计划的语义口罩从图像级别注释在单级训练。我们发现，尽管它的简单，我们的方法实现的结果与显著更复杂的管道竞争力，大幅跑赢早期单级方法。

62. Universal Adversarial Perturbations: A Survey [PDF] 返回目录
Ashutosh Chaubey, Nikhil Agrawal, Kavya Barnwal, Keerat K. Guliani, Pramod Mehta
Abstract: Over the past decade, Deep Learning has emerged as a useful and efficient tool to solve a wide variety of complex learning problems ranging from image classification to human pose estimation, which is challenging to solve using statistical machine learning algorithms. However, despite their superior performance, deep neural networks are susceptible to adversarial perturbations, which can cause the network's prediction to change without making perceptible changes to the input image, thus creating severe security issues at the time of deployment of such systems. Recent works have shown the existence of Universal Adversarial Perturbations, which, when added to any image in a dataset, misclassifies it when passed through a target model. Such perturbations are more practical to deploy since there is minimal computation done during the actual attack. Several techniques have also been proposed to defend the neural networks against these perturbations. In this paper, we attempt to provide a detailed discussion on the various data-driven and data-independent methods for generating universal perturbations, along with measures to defend against such perturbations. We also cover the applications of such universal perturbations in various deep learning tasks.
摘要：在过去的十年中，深度学习已经成为解决各种复杂的学习问题，从图像分类为人类姿态估计，这是具有挑战性的使用统计机器学习算法来解决一个有用和有效的工具。然而，尽管其优越的性能，深层神经网络很容易受到干扰对抗，这可能会导致网络的预测变化未做察觉的变化对输入图像，从而在这样的系统部署的时间来创建严重的安全问题。当通过目标模型，通过最近的作品表明通用对抗性扰动，其中，当一个数据集添加到任何图像，misclassifies它的存在。这样的扰动都是比较实用的部署，因为在实际攻击期间完成最低限度计算。一些技术也被提出来捍卫神经网络对这些扰动。在本文中，我们试图提供各种数据驱动和数据无关的产生普遍的扰动方法，采取措施一起抵御这种扰动的详细讨论。我们还包括这样的通用扰动的各种深学习任务的应用程序。

63. A Deep Learning based Wearable Healthcare IoT Device for AI-enabled Hearing Assistance Automation [PDF] 返回目录
Fraser Young, L Zhang, Richard Jiang, Han Liu, Conor Wall
Abstract: With the recent booming of artificial intelligence (AI), particularly deep learning techniques, digital healthcare is one of the prevalent areas that could gain benefits from AI-enabled functionality. This research presents a novel AI-enabled Internet of Things (IoT) device operating from the ESP-8266 platform capable of assisting those who suffer from impairment of hearing or deafness to communicate with others in conversations. In the proposed solution, a server application is created that leverages Google's online speech recognition service to convert the received conversations into texts, then deployed to a micro-display attached to the glasses to display the conversation contents to deaf people, to enable and assist conversation as normal with the general population. Furthermore, in order to raise alert of traffic or dangerous scenarios, an 'urban-emergency' classifier is developed using a deep learning model, Inception-v4, with transfer learning to detect/recognize alerting/alarming sounds, such as a horn sound or a fire alarm, with texts generated to alert the prospective user. The training of Inception-v4 was carried out on a consumer desktop PC and then implemented into the AI based IoT application. The empirical results indicate that the developed prototype system achieves an accuracy rate of 92% for sound recognition and classification with real-time performance.
摘要：随着近年来蓬勃发展的人工智能（AI），特别深的学习技术，数字医疗保健是可能获得来自启用AI-功能方面的优势流行的地区之一。本研究提出从能够帮助那些谁听到或耳聋的损害遭受与他人在谈话中交流的ESP-8266平台的物联网（IOT）器件工作的新启用AI-上网。在所提出的解决方案中，服务器应用程序创建一个利用谷歌的接收到的谈话到文本转换的在线语音识别服务，然后部署到连接到眼镜，以显示交谈内容聋哑人，微型显示器，以使和协助的对话正常与普通人群。此外，为了流量或危险的情况下，筹集警报的“城市应急”分类器使用了深刻的学习模式，启-V4，以转移学习检测/认警告/报警的声音，如喇叭声或发展火灾报警，以产生警示潜在用户文本。盗梦空间-V4的培训消费台式PC上进行，然后落实到基于人工智能物联网的应用。实证结果表明，所开发的原型系统实现了92％的声音识别和分类与实时性能的准确率。

64. Visual Relationship Detection using Scene Graphs: A Survey [PDF] 返回目录
Aniket Agarwal, Ayush Mangal, Vipul
Abstract: Understanding a scene by decoding the visual relationships depicted in an image has been a long studied problem. While the recent advances in deep learning and the usage of deep neural networks have achieved near human accuracy on many tasks, there still exists a pretty big gap between human and machine level performance when it comes to various visual relationship detection tasks. Developing on earlier tasks like object recognition, segmentation and captioning which focused on a relatively coarser image understanding, newer tasks have been introduced recently to deal with a finer level of image understanding. A Scene Graph is one such technique to better represent a scene and the various relationships present in it. With its wide number of applications in various tasks like Visual Question Answering, Semantic Image Retrieval, Image Generation, among many others, it has proved to be a useful tool for deeper and better visual relationship understanding. In this paper, we present a detailed survey on the various techniques for scene graph generation, their efficacy to represent visual relationships and how it has been used to solve various downstream tasks. We also attempt to analyze the various future directions in which the field might advance in the future. Being one of the first papers to give a detailed survey on this topic, we also hope to give a succinct introduction to scene graphs, and guide practitioners while developing approaches for their applications.
摘要：了解由图像中描绘的视觉关系解码的场景已经很长时间研究的问题。虽然近期深度学习和深层神经网络的实际使用情况的进展附近的人精度上许多任务实现的，有当它涉及到各种视觉关系的检测任务仍然存在人机级性能之间存在相当大的差距。像物体识别，分割和字幕前面的任务侧重于相对粗糙的图像理解发展，新的任务已经最近推出来处理图像理解更精细的水平。场景图是一种这样的技术，以便更好地表示的场景和各种关系存在于它。凭借其宽一些的，如Visual问答系统，语义图像检索，图像生成，以及许多其他各种任务的应用程序，它已被证明是更深，更好的视觉关系理解的有用工具。在本文中，我们提出了一个详细的调查，对场景图产生的各种技术，其疗效代表的视觉关系，以及它是如何被用来解决各种下游任务。我们也试图分析在该领域可能会在未来推动各种未来的发展方向。作为第一篇论文的人给有关这个主题的详细调查，我们也希望能够给一个简洁的介绍场景图，引导从业者同时开发接近他们的应用程序。

65. Towards in-store multi-person tracking using head detection and track heatmaps [PDF] 返回目录
Aibek Musaev, Jiangping Wang, Liang Zhu, Cheng Li, Yi Chen, Jialin Liu, Wanqi Zhang, Juan Mei, De Wang
Abstract: Computer vision algorithms are being implemented across a breadth of industries to enable technological innovations. In this paper, we study the problem of computer vision based customer tracking in retail industry. To this end, we introduce a dataset collected from a camera in an office environment where participants mimic various behaviors of customers in a supermarket. In addition, we describe an illustrative example of the use of this dataset for tracking participants based on a head tracking model in an effort to minimize errors due to occlusion. Furthermore, we propose a model for recognizing customers and staff based on their movement patterns. The model is evaluated using a real-world dataset collected in a supermarket over a 24-hour period that achieves 98\% accuracy during training and 93\% accuracy during evaluation.
摘要：计算机视觉算法被各行业广度实现为使技术创新。在本文中，我们研究了基于计算机视觉的客户在零售行业跟踪问题。为此，我们将介绍在办公环境中，参加者模仿客户的各种行为在超市从摄像头采集的数据集。此外，我们描述了使用该数据集，以努力减少由于遮挡跟踪误差基于头部跟踪模型参与者的一个说明性的例子。此外，我们提出了识别顾客和工作人员根据他们的运动模式的典范。该模型是使用在超市在24小时内该评估期间训练和93 \％的准确度期间实现98 \％的精度收集真实世界数据集进行评估。

66. Deep Lighting Environment Map Estimation from Spherical Panoramas [PDF] 返回目录
Vasileios Gkitsas, Nikolaos Zioulis, Federico Alvarez, Dimitrios Zarpalas, Petros Daras
Abstract: Estimating a scene's lighting is a very important task when compositing synthetic content within real environments, with applications in mixed reality and post-production. In this work we present a data-driven model that estimates an HDR lighting environment map from a single LDR monocular spherical panorama. In addition to being a challenging and ill-posed problem, the lighting estimation task also suffers from a lack of facile illumination ground truth data, a fact that hinders the applicability of data-driven methods. We approach this problem differently, exploiting the availability of surface geometry to employ image-based relighting as a data generator and supervision mechanism. This relies on a global Lambertian assumption that helps us overcome issues related to pre-baked lighting. We relight our training data and complement the model's supervision with a photometric loss, enabled by a differentiable image-based relighting technique. Finally, since we predict spherical spectral coefficients, we show that by imposing a distribution prior on the predicted coefficients, we can greatly boost performance. Code and models available at this https URL.
摘要：估计场景的照明是一个非常重要的任务合成真实环境中合成的内容时，与在混合现实中和后期制作应用程序。在这项工作中，我们提出的是评价从单一LDR单眼球形全景的HDR光照环境下的地图数据驱动的模型。除了是一个具有挑战性的和病态问题，照明估计任务还从缺乏容易照明地面实况数据的，遭受事实阻碍的数据驱动方法的适用性。我们不同的解决这个问题，利用表面几何形状对基于图像聘请重燃可用性数据发生器和监督机制。这依赖于一个全球性的朗伯假设，帮助我们克服有关预焙照明问题。我们重新点燃我们的训练数据，并用光度损耗，通过一个基于图像的微二次照明技术使补充模型的监督。最后，由于我们预测球形频谱系数，我们表明，强加于预测系数之前分布，可以大大提高性能。代码可以在这个HTTPS URL模式。

67. Non-Linearities Improve OrigiNet based on Active Imaging for Micro Expression Recognition [PDF] 返回目录
Monu Verma, Santosh Kumar Vipparthi, Girdhari Singh
Abstract: Micro expression recognition (MER)is a very challenging task as the expression lives very short in nature and demands feature modeling with the involvement of both spatial and temporal dynamics. Existing MER systems exploit CNN networks to spot the significant features of minor muscle movements and subtle changes. However, existing networks fail to establish a relationship between spatial features of facial appearance and temporal variations of facial dynamics. Thus, these networks were not able to effectively capture minute variations and subtle changes in expressive regions. To address these issues, we introduce an active imaging concept to segregate active changes in expressive regions of a video into a single frame while preserving facial appearance information. Moreover, we propose a shallow CNN network: hybrid local receptive field based augmented learning network (OrigiNet) that efficiently learns significant features of the micro-expressions in a video. In this paper, we propose a new refined rectified linear unit (RReLU), which overcome the problem of vanishing gradient and dying ReLU. RReLU extends the range of derivatives as compared to existing activation functions. The RReLU not only injects a nonlinearity but also captures the true edges by imposing additive and multiplicative property. Furthermore, we present an augmented feature learning block to improve the learning capabilities of the network by embedding two parallel fully connected layers. The performance of proposed OrigiNet is evaluated by conducting leave one subject out experiments on four comprehensive ME datasets. The experimental results demonstrate that OrigiNet outperformed state-of-the-art techniques with less computational complexity.
摘要：微表情识别（MER）是一个非常具有挑战性的任务，因为表达的生活是很短的性质和要求设有空间和时间动态的参与建模。现有MER系统利用CNN网络，以发现轻微的肌肉运动和微妙的变化显著特点。但是，现有的网络未能建立面部外观的空间特征和面部动力学的时间变化之间的关系。因此，这些网络无法有效地捕捉微小变化和表现力地区微妙的变化。为了解决这些问题，我们引入一个有源成像的概念在视频的表现力的区域活性的变化分离成单个帧，同时保留面部外观信息。此外，我们提出了一个浅浅的CNN网络：混合动力地方感受野基于增强学习网络（OrigiNet）能够有效地学习在视频的微表情的显著特点。在本文中，我们提出了一种新的改进整流线性单元（RReLU），其克服消失梯度和垂死RELU的问题。 RReLU相比于现有的激活函数延伸衍生物的范围内。该RReLU不仅注入非线性而且还会捕获气势加法和乘法财产的真实边缘。此外，我们提出的增强特征学习块通过嵌入两个平行的完全连接层，以提高网络的学习功能。建议OrigiNet的性能通过进行休假四个全面ME数据集一个主题的实验评估。实验结果表明，OrigiNet优于状态的最先进的技术，以较少的计算复杂性。

68. JNCD-Based Perceptual Compression of RGB 4:4:4 Image Data [PDF] 返回目录
Lee Prangnell, Victor Sanchez
Abstract: In contemporary lossy image coding applications, a desired aim is to decrease, as much as possible, bits per pixel without inducing perceptually conspicuous distortions in RGB image data. In this paper, we propose a novel color-based perceptual compression technique, named RGB-PAQ. RGB-PAQ is based on CIELAB Just Noticeable Color Difference (JNCD) and Human Visual System (HVS) spectral sensitivity. We utilize CIELAB JNCD and HVS spectral sensitivity modeling to separately adjust quantization levels at the Coding Block (CB) level. In essence, our method is designed to capitalize on the inability of the HVS to perceptually differentiate photons in very similar wavelength bands. In terms of application, the proposed technique can be used with RGB (4:4:4) image data of various bit depths and spatial resolutions including, for example, true color and deep color images in HD and Ultra HD resolutions. In the evaluations, we compare RGB-PAQ with a set of anchor methods; namely, HEVC, JPEG, JPEG 2000 and Google WebP. Compared with HEVC HM RExt, RGB-PAQ achieves up to 77.8% bits reductions. The subjective evaluations confirm that the compression artifacts induced by RGB-PAQ proved to be either imperceptible (MOS = 5) or near-imperceptible (MOS = 4) in the vast majority of cases.
摘要：在当代有损图像编码应用中，所希望的目的是减少，尽可能地，每个像素的位数而不RGB图像数据诱导感知显眼失真。在本文中，我们提出了一个新颖的基于颜色的感知压缩技术，命名为RGB-PAQ。 RGB-PAQ是基于CIELAB恰可察觉色差（JNCD）和人类视觉系统（HVS）光谱灵敏度。我们利用CIELAB JNCD和HVS光谱灵敏度建模以分别在编码块（CB）水平调整量化等级。从本质上说，我们的方法被设计成非常相似的波段上的无能HVS到感知分化光子的利用。各种位深度和空间分辨率，包括例如，本色和在HD深的彩色图像和超高清分辨率的图像数据。在应用方面，所提出的技术可以与RGB（：4 4 4）中使用。在评估中，我们比较RGB-PAQ了一套锚方法;即，HEVC，JPEG，JPEG 2000和谷歌的WebP。与HEVC HM REXT相比，RGB-PAQ实现了高达77.8％的比特减少。的主观评价证实，通过RGB-PAQ引起的压缩噪声被证明是难以察觉的任一（MOS = 5）或在绝大多数情况下近不可察觉（MOS = 4）。

69. Deep feature fusion for self-supervised monocular depth prediction [PDF] 返回目录
Vinay Kaushik, Brejesh Lall
Abstract: Recent advances in end-to-end unsupervised learning has significantly improved the performance of monocular depth prediction and alleviated the requirement of ground truth depth. Although a plethora of work has been done in enforcing various structural constraints by incorporating multiple losses utilising smoothness, left-right consistency, regularisation and matching surface normals, a few of them take into consideration multi-scale structures present in real world images. Most works utilise a VGG16 or ResNet50 model pre-trained on ImageNet weights for predicting depth. We propose a deep feature fusion method utilising features at multiple scales for learning self-supervised depth from scratch. Our fusion network selects features from both upper and lower levels at every level in the encoder network, thereby creating multiple feature pyramid sub-networks that are fed to the decoder after applying the CoordConv solution. We also propose a refinement module learning higher scale residual depth from a combination of higher level deep features and lower level residual depth using a pixel shuffling framework that super-resolves lower level residual depth. We select the KITTI dataset for evaluation and show that our proposed architecture can produce better or comparable results in depth prediction.
摘要：在终端到终端的无监督学习的最新进展已显著提高单眼深度预测的性能和缓解地面实况深度的要求。虽然工作过多已经通过将利用光滑，左，右的一致性，正规化和配合面的法线多重损失实施各种结构性制约因素做了，他们几个人考虑到多尺度结构存在于真实世界中的图像。大多数作品利用VGG16或ResNet50模型上ImageNet权预先训练的预测深度。我们建议采用多尺度从头开始学习自我监督的深层特征的深刻特征融合方法。我们的融合网络选择从两个上部和下部水平设有在编码器网络中的每个电平，从而产生被施加CoordConv溶液后馈送到解码器的多个特征金字塔子网络。我们还提出了细化模块从更高层次深的特点相结合的学习更高规模的残差深度和使用像素洗牌框架超做出决议较低水平的残留深度较低水平的残留深度。我们选择进行评估KITTI数据集，并表明我们提出的架构可以产生深度预测更好或类似的结果。

70. Attribute2Font: Creating Fonts You Want From Attributes [PDF] 返回目录
Yizhi Wang, Yue Gao, Zhouhui Lian
Abstract: Font design is now still considered as an exclusive privilege of professional designers, whose creativity is not possessed by existing software systems. Nevertheless, we also notice that most commercial font products are in fact manually designed by following specific requirements on some attributes of glyphs, such as italic, serif, cursive, width, angularity, etc. Inspired by this fact, we propose a novel model, Attribute2Font, to automatically create fonts by synthesizing visually-pleasing glyph images according to user-specified attributes and their corresponding values. To the best of our knowledge, our model is the first one in the literature which is capable of generating glyph images in new font styles, instead of retrieving existing fonts, according to given values of specified font attributes. Specifically, Attribute2Font is trained to perform font style transfer between any two fonts conditioned on their attribute values. After training, our model can generate glyph images in accordance with an arbitrary set of font attribute values. Furthermore, a novel unit named Attribute Attention Module is designed to make those generated glyph images better embody the prominent font attributes. Considering that the annotations of font attribute values are extremely expensive to obtain, a semi-supervised learning scheme is also introduced to exploit a large number of unlabeled fonts. Experimental results demonstrate that our model achieves impressive performance on many tasks, such as creating glyph images in new font styles, editing existing fonts, interpolation among different fonts, etc.
摘要：字体设计现在仍然被认为是专业设计师的独家特权，他们的创造力不被现有的软件系统所拥有。然而，我们也注意到，大部分的商业字号产品其实都是由手工继字形一些属性，如斜体，衬线，草书，宽度，倾斜度等，通过该事实的启发具体的要求设计的，我们提出了一种新的模式， Attribute2Font，自动通过根据用户指定的属性和它们的对应的值合成视觉取悦字形图像创建字体。据我们所知，我们的模型中，能够在新的字体样式生成字型图像，而不是获取现有字体的文献中第一个，根据指定的字体属性的给定值。具体而言，Attribute2Font被训练来执行任何两种字体空调上它们的属性值之间的字体样式传递。训练结束后，我们的模型可以生成字形按照字体属性值的任意一组图片。此外，一个新的单元命名属性注意模块被设计成使这些生成的字形图像更好地体现突出字体属性。考虑到字体属性值的注释是获得极其昂贵，一个半监督学习方案也被引入到利用大量未标记的字体。实验结果表明，我们的模型实现了许多任务，如创建新的字体样式字形图像，不同的字体中编辑现有的字体，插值等骄人业绩

71. COCAS: A Large-Scale Clothes Changing Person Dataset for Re-identification [PDF] 返回目录
Shijie Yu, Shihua Li, Dapeng Chen, Rui Zhao, Junjie Yan, Yu Qiao
Abstract: Recent years have witnessed great progress in person re-identification (re-id). Several academic benchmarks such as Market1501, CUHK03 and DukeMTMC play important roles to promote the re-id research. To our best knowledge, all the existing benchmarks assume the same person will have the same clothes. While in real-world scenarios, it is very often for a person to change clothes. To address the clothes changing person re-id problem, we construct a novel large-scale re-id benchmark named ClOthes ChAnging Person Set (COCAS), which provides multiple images of the same identity with different clothes. COCAS totally contains 62,382 body images from 5,266 persons. Based on COCAS, we introduce a new person re-id setting for clothes changing problem, where the query includes both a clothes template and a person image taking another clothes. Moreover, we propose a two-branch network named Biometric-Clothes Network (BC-Net) which can effectively integrate biometric and clothes feature for re-id under our setting. Experiments show that it is feasible for clothes changing re-id with clothes templates.
摘要：近年来，双方在人员重新鉴定（重新编号）很大的进步。一些学术基准，如Market1501，CUHK03和DukeMTMC发挥促进再-ID研究的重要角色。据我们所知，所有现有的基准假设相同的人都会有同样的衣服。而在现实世界的场景，这是很经常一个人去换衣服。为了解决换衣服的人重新编号的问题，我们构建了一个名为衣服新颖的大规模重新编号基准改变人所设定的（COCAS），它提供了不同的衣服一样身份的多个图像。 COCAS完全包含5266人62382个身体形象。基于COCAS，我们引入一个新的人重新编号为设置换衣服问题，其中查询包括衣服模板和个人图像拍摄另一衣服。此外，我们提出了一个名为生物特征衣服网（BC-网）两分支网络，可以有效整合生物识别和服装为特色在我们的设置重新编号。实验表明，它是换衣服再ID衣服模板是可行的。

72. Partial Domain Adaptation Using Graph Convolutional Networks [PDF] 返回目录
Seunghan Yang, Youngeun Kim, Dongki Jung, Changick Kim
Abstract: Partial domain adaptation (PDA), in which we assume the target label space is included in the source label space, is a general version of standard domain adaptation. Since the target label space is unknown, the main challenge of PDA is to reduce the learning impact of irrelevant source samples, named outliers, which do not belong to the target label space. Although existing partial domain adaptation methods effectively down-weigh outliers' importance, they do not consider data structure of each domain and do not directly align the feature distributions of the same class in the source and target domains, which may lead to misalignment of category-level distributions. To overcome these problems, we propose a graph partial domain adaptation (GPDA) network, which exploits Graph Convolutional Networks for jointly considering data structure and the feature distribution of each class. Specifically, we propose a label relational graph to align the distributions of the same category in two domains and introduce moving average centroid separation for learning networks from the label relational graph. We demonstrate that considering data structure and the distribution of each category is effective for PDA and our GPDA network achieves state-of-the-art performance on the Digit and Office-31 datasets.
摘要：部分适配域（PDA），其中我们假设目标标签空间被包括在源标签空间，是标准的域适应的一般形式。由于目标标签空间不明，PDA的主要挑战是减少无关源样品，命名为离群值，这不属于目标标签空间学习的影响。虽然现有的局部域自适应方法有效地降权衡离群重要性，也不用考虑每个域的数据结构，并且不直接对准在源和目标域相同的类，这可能导致分类 - 的未对准的特征分布水平分布。为了克服这些问题，提出了一种图形局部域适配（GPDA）网络，其利用图形卷积网络的联合考虑数据结构和每个类别的特征分布。具体地，我们提出了一种标签关系图来对齐相同类别的分布在两个结构域和介绍的移动平均的质心分离，用于从标签关系图学习网络。我们表明，考虑到数据结构和每个类别的分布是有效的PDA和我们GPDA网络实现对数字和Office-31数据集的国家的最先进的性能。

73. FuSSI-Net: Fusion of Spatio-temporal Skeletons for Intention Prediction Network [PDF] 返回目录
Francesco Piccoli, Rajarathnam Balakrishnan, Maria Jesus Perez, Moraldeepsingh Sachdeo, Carlos Nunez, Matthew Tang, Kajsa Andreasson, Kalle Bjurek, Ria Dass Raj, Ebba Davidsson, Colin Eriksson, Victor Hagman, Jonas Sjoberg, Ying Li, L. Srikar Muppirisetty, Sohini Roychowdhury
Abstract: Pedestrian intention recognition is very important to develop robust and safe autonomous driving (AD) and advanced driver assistance systems (ADAS) functionalities for urban driving. In this work, we develop an end-to-end pedestrian intention framework that performs well on day- and night- time scenarios. Our framework relies on objection detection bounding boxes combined with skeletal features of human pose. We study early, late, and combined (early and late) fusion mechanisms to exploit the skeletal features and reduce false positives as well to improve the intention prediction performance. The early fusion mechanism results in AP of 0.89 and precision/recall of 0.79/0.89 for pedestrian intention classification. Furthermore, we propose three new metrics to properly evaluate the pedestrian intention systems. Under these new evaluation metrics for the intention prediction, the proposed end-to-end network offers accurate pedestrian intention up to half a second ahead of the actual risky maneuver.
摘要：行人意图识别是非常重要的发展稳健和安全的自动驾驶（AD）和高级驾驶员辅助系统（ADAS）功能为城市驾驶。在这项工作中，我们开发了一个终端到终端的行人意向框架，执行以及在日间和晚上 - 时的场景。我们的框架依赖于目标检测包围盒与人体姿势的骨骼特征相结合。我们研究早期，晚期，并结合（早期和晚期）融合机制以利用骨骼特征和减少误报以及改善意向预测性能。早期融合机制导致的0.79 / 0.89 0.89和精度/召回的AP用于行人意图分类。此外，我们提出了三种新的指标来正确评价行人意图系统。在这种新的评价指标意向预测，建议终端到终端的网络提供准确行人意向到实际的危险动作半秒领先。

74. WW-Nets: Dual Neural Networks for Object Detection [PDF] 返回目录
Mohammad K. Ebrahimpour, J. Ben Falandays, Samuel Spevack, Ming-Hsuan Yang, David C. Noelle
Abstract: We propose a new deep convolutional neural network framework that uses object location knowledge implicit in network connection weights to guide selective attention in object detection tasks. Our approach is called What-Where Nets (WW-Nets), and it is inspired by the structure of human visual pathways. In the brain, vision incorporates two separate streams, one in the temporal lobe and the other in the parietal lobe, called the ventral stream and the dorsal stream, respectively. The ventral pathway from primary visual cortex is dominated by "what" information, while the dorsal pathway is dominated by "where" information. Inspired by this structure, we have proposed an object detection framework involving the integration of a "What Network" and a "Where Network". The aim of the What Network is to provide selective attention to the relevant parts of the input image. The Where Network uses this information to locate and classify objects of interest. In this paper, we compare this approach to state-of-the-art algorithms on the PASCAL VOC 2007 and 2012 and COCO object detection challenge datasets. Also, we compare out approach to human "ground-truth" attention. We report the results of an eye-tracking experiment on human subjects using images from PASCAL VOC 2007, and we demonstrate interesting relationships between human overt attention and information processing in our WW-Nets. Finally, we provide evidence that our proposed method performs favorably in comparison to other object detection approaches, often by a large margin. The code and the eye-tracking ground-truth dataset can be found at: this https URL.
摘要：本文提出了一种新的深卷积神经网络架构，使用对象定位在知识网络的连接权隐含引导选择性注意物体检测任务。我们的做法是叫什么，在哪里篮网（WW-篮网），它是由人类视觉通路的结构的启发。在大脑中，视力包含两个单独的流，一个在颞叶和另一个在顶叶，称为腹侧流和背流，分别。从初级视觉皮层腹途径是通过“做什么”的信息为主，而背部途径是通过“何处”的信息为主。通过这种结构的启发，我们提出了涉及“什么样的网络”和“去哪儿网”的整合目标检测框架。在什么网络的目的是提供选择性注意输入图像的相关部分。在去哪儿网使用该信息来定位和感兴趣的对象进行分类。在本文中，我们比较这种方法的国家的最先进的算法对PASCAL VOC 2007年和2012年和COCO对象检测挑战数据集。此外，我们比较了处理人“地面实况”的关注。我们报告使用图片来自PASCAL VOC 2007年人类受试者的眼球追踪实验的结果，我们证明我们的WW-篮网人公开关注和信息处理之间有趣的关系。最后，我们提供的证据表明，我们提出的方法进行顺利，相较于其他物体的检测方法，往往以大比分。此HTTPS URL：代码和眼球追踪地面实况数据集可以在这里找到。

75. Transformation Based Deep Anomaly Detection in Astronomical Images [PDF] 返回目录
Esteban Reyes, Pablo A. Estévez
Abstract: In this work, we propose several enhancements to a geometric transformation based model for anomaly detection in images (GeoTranform). The model assumes that the anomaly class is unknown and that only inlier samples are available for training. We introduce new filter based transformations useful for detecting anomalies in astronomical images, that highlight artifact properties to make them more easily distinguishable from real objects. In addition, we propose a transformation selection strategy that allows us to find indistinguishable pairs of transformations. This results in an improvement of the area under the Receiver Operating Characteristic curve (AUROC) and accuracy performance, as well as in a dimensionality reduction. The models were tested on astronomical images from the High Cadence Transient Survey (HiTS) and Zwicky Transient Facility (ZTF) datasets. The best models obtained an average AUROC of 99.20% for HiTS and 91.39% for ZTF. The improvement over the original GeoTransform algorithm and baseline methods such as One-Class Support Vector Machine, and deep learning based methods is significant both statistically and in practice.
摘要：在这项工作中，我们提出了几项改进几何变换基于模型在图像异常检测（GeoTranform）。该模型假设异常类是未知的，只有内围的样本可用于训练。我们介绍了检测天文图像异常有用的新的基于过滤器的转换，即彰显神器属性，使他们更容易区别于实物。此外，我们提出了一个转型的选择策略，使我们能够找到区分双转换。这导致在接收器工作特性曲线（AUROC）和精度性能下的面积，以及在一个降维的改进。模特们从高Cadence的瞬态调查（命中）和兹维基瞬态基金（ZTF）数据集天文图像测试。最好的车型获得了99.20％的命中和ZTF 91.39％的平均AUROC。在原来的地理坐标转换算法和基线的方法，例如一类支持向量机和深度学习为基础的方法的改进是统计上两者并在实践中显著。

76. C3VQG: Category Consistent Cyclic Visual Question Generation [PDF] 返回目录
Shagun Uppal, Anish Madan, Sarthak Bhagat, Yi Yu, Rajiv Ratn Shah
Abstract: Visual Question Generation (VQG) is the task of generating natural questions based on an image. Popular methods in the past have explored image-to-sequence architectures trained with maximum likelihood which often lead to generic questions. While generative models try to exploit more concepts in an image, they still require ground-truth questions, answers (and categories in some cases). In this paper, we try to exploit the different visual cues and concepts in an image to generate questions using a variational autoencoder without the need for ground-truth answers. In this work, we, therefore, address two shortcomings of the current VQG approaches by minimizing the level of supervision and replacing generic questions by category-relevant generations. We, therefore, eliminate the need for expensive answer annotations thus weakening the required supervision in this task and use question categories instead. Using different categories enables us to exploit different concepts as the inference requires only the image and category. We maximize the mutual information between the image, question, and question category in the latent space of our VAE. We also propose a novel \textit{category consistent cyclic loss} that motivates the model to generate consistent predictions with respect to the question category, reducing its redundancies and irregularities. Additionally, we also impose supplementary constraints on the latent space of our generative model to provide structure based on categories and enhance generalization by encapsulating decorrelated features within each dimension. Finally, we compare our qualitative as well as quantitative results to the state-of-the-art in VQG.
摘要：视觉询问生成（VQG）是生成基于图像自然问题的任务。过去常用的方法已经探索与最大似然训练的图像对序列结构往往导致通用的问题。虽然生成模型试图利用图像中更多的概念，他们仍然需要地面实况问题，答案（和类别在某些情况下）。在本文中，我们试图利用不同的视觉线索和概念的在图像中生成使用变的自动编码，无需地面实况答案的问题。在这项工作中，我们，因此，地址当前VQG的两个缺点通过最小化监管水平和类别相关的代取代通用问题的方法。因此，我们消除了昂贵的答案注释从而削弱在这个任务和使用的问题，而不是类别所要求的监管的需要。使用不同的类别，使我们能够利用不同的概念推理只需要图像和类别。我们最大限度地提高我们的VAE的潜在空间中的图像，问题，问题类别之间的互信息。我们还提出了一个新颖的\ {textit类别一致的循环损失}，激励的模型来生成相对于出题类型一致的预测，降低其冗余和违规行为。此外，我们还强加给我们的生成模型的每个维度中封装去相关功能，以提供基于类别结构，提高泛化的潜在空间补充的限制。最后，我们我们的定性和定量的比较结果对国家的最先进的VQG。

77. Disentangling in Latent Space by Harnessing a Pretrained Generator [PDF] 返回目录
Yotam Nitzan, Amit Bermano, Yangyan Li, Daniel Cohen-Or
Abstract: Learning disentangled representations of data is a fundamental problem in artificial intelligence. Specifically, disentangled latent representations allow generative models to control and compose the disentangled factors in the synthesis process. Current methods, however, require extensive supervision and training, or instead, noticeably compromise quality. In this paper, we present a method that learn show to represent data in a disentangled way, with minimal supervision, manifested solely using available pre-trained networks. Our key insight is to decouple the processes of disentanglement and synthesis, by employing a leading pre-trained unconditional image generator, such as StyleGAN. By learning to map into its latent space, we leverage both its state-of-the-art quality generative power, and its rich and expressive latent space, without the burden of training it.We demonstrate our approach on the complex and high dimensional domain of human heads. We evaluate our method qualitatively and quantitatively, and exhibit its success with de-identification operations and with temporal identity coherency in image sequences. Through this extensive experimentation, we show that our method successfully disentangles identity from other facial attributes, surpassing existing methods, even though they require more training and supervision.
摘要：学习解开数据的表示是人工智能的一个基本问题。具体而言，解缠结潜表示允许生成模型，以控制和构成解开因素在合成过程。目前的方法，但是，需要广泛的监督和培训，或相反，显着影响质量。在本文中，我们提出学习表现在解缠结的方式来表示数据，在无人监督，表现完全可以用预先训练网络的方法。我们的主要观点是解耦解开和合成的方法中，通过采用预训练的领先无条件图像生成器，例如StyleGAN。通过学习映射到它的潜在空间，我们利用它的两个国家的最先进的质量生成的功率，而且其丰富的表现力和潜在空间，没有训练的负担it.We证明上的复杂性和高维域我们的方法的人头。我们定性和定量评估我们的方法，并表现出与去标识的操作和在图像序列临时标识一致性成功。通过这种广泛的实验，我们表明，我们的方法成功地理顺了那些纷繁的身份从其他面部特征，超越现有的方法，即使他们需要更多的培训和监督。

78. Semantic Photo Manipulation with a Generative Image Prior [PDF] 返回目录
David Bau, Hendrik Strobelt, William Peebles, Jonas, Bolei Zhou, Jun-Yan Zhu, Antonio Torralba
Abstract: Despite the recent success of GANs in synthesizing images conditioned on inputs such as a user sketch, text, or semantic labels, manipulating the high-level attributes of an existing natural photograph with GANs is challenging for two reasons. First, it is hard for GANs to precisely reproduce an input image. Second, after manipulation, the newly synthesized pixels often do not fit the original image. In this paper, we address these issues by adapting the image prior learned by GANs to image statistics of an individual image. Our method can accurately reconstruct the input image and synthesize new content, consistent with the appearance of the input image. We demonstrate our interactive system on several semantic image editing tasks, including synthesizing new objects consistent with background, removing unwanted objects, and changing the appearance of an object. Quantitative and qualitative comparisons against several existing methods demonstrate the effectiveness of our method.
摘要：尽管甘斯的在合成条件上的输入图像，诸如用户草图，文本，或语义标签，操作的高级别最近成功属性与甘斯现有自然照片的原因有两个是具有挑战性的。首先，它是很难甘斯以精确地再现输入图像。二，操作后，新合成的像素往往不适合原始图像。在本文中，我们通过调整之前由甘斯学会了单独的图像的图像统计的图像解决这些问题。我们的方法可以精确地重建输入图像和合成新的内容，与输入图像的外观一致。我们证明在几个语义图像编辑任务，包括合成新对象与背景一致，消除不必要的对象，改变物体的外观我们的互动系统。对现有的几种方法，定量和定性的比较证明了我们方法的有效性。

79. Deep Snow: Synthesizing Remote Sensing Imagery with Generative Adversarial Nets [PDF] 返回目录
Christopher X. Ren, Amanda Ziemann, James Theiler, Alice M. S. Durieux
Abstract: In this work we demonstrate that generative adversarial networks (GANs) can be used to generate realistic pervasive changes in remote sensing imagery, even in an unpaired training setting. We investigate some transformation quality metrics based on deep embedding of the generated and real images which enable visualization and understanding of the training dynamics of the GAN, and may provide a useful measure in terms of quantifying how distinguishable the generated images are from real images. We also identify some artifacts introduced by the GAN in the generated images, which are likely to contribute to the differences seen between the real and generated samples in the deep embedding feature space even in cases where the real and generated samples appear perceptually similar.
摘要：在这项工作中，我们证明了生成对抗网络（甘斯）可以用来生成遥感影像逼真普遍的变化，甚至在未配对的训练环境。我们调查基础上产生的和真实图像这使可视化和GaN的培训力度的理解，并且可以定量所产生的图像是如何区分是从真实图像方面提供了有益的措施，深嵌入一些改造质量指标。我们还确定了一些文物由GAN中生成的图像出台，这有可能导致即使在现实和生成的样本感知出现类似的情况在深嵌入特征空间的真实与生成的样本之间看到的差异。

80. Deep Implicit Volume Compression [PDF] 返回目录
Danhang Tang, Saurabh Singh, Philip A. Chou, Christian Haene, Mingsong Dou, Sean Fanello, Jonathan Taylor, Philip Davidson, Onur G. Guleryuz, Yinda Zhang, Shahram Izadi, Andrea Tagliasacchi, Sofien Bouaziz, Cem Keskin
Abstract: We describe a novel approach for compressing truncated signed distance fields (TSDF) stored in 3D voxel grids, and their corresponding textures. To compress the TSDF, our method relies on a block-based neural network architecture trained end-to-end, achieving state-of-the-art rate-distortion trade-off. To prevent topological errors, we losslessly compress the signs of the TSDF, which also upper bounds the reconstruction error by the voxel size. To compress the corresponding texture, we designed a fast block-based UV parameterization, generating coherent texture maps that can be effectively compressed using existing video compression algorithms. We demonstrate the performance of our algorithms on two 4D performance capture datasets, reducing bitrate by 66% for the same distortion, or alternatively reducing the distortion by 50% for the same bitrate, compared to the state-of-the-art.
摘要：我们描述了存储在三维像素网格压缩截断签订距离字段（TSDF），及其相应的纹理的新方法。压缩TSDF，我们的方法依赖于基于块的神经网络结构训练端至端，实现状态的最先进的速率 - 失真权衡。为了避免拓扑错误，我们无损压缩TSDF，这也上由体素大小界定的重建错误的迹象。压缩对应的纹理，我们设计了一个快速的基于块的UV参数，生成能够使用现有的视频压缩算法被有效地压缩相干纹理贴图。我们证明我们的算法在两个4D性能捕获数据集的性能，降低比特率66％为相同的失真，或50％，或者减少失真对于相同的比特率，相对于状态的最先进的。

81. Learning Spatial-Spectral Prior for Super-Resolution of Hyperspectral Imagery [PDF] 返回目录
Junjun Jiang, He Sun, Xianming Liu, Jiayi Ma
Abstract: Recently, single gray/RGB image super-resolution reconstruction task has been extensively studied and made significant progress by leveraging the advanced machine learning techniques based on deep convolutional neural networks (DCNNs). However, there has been limited technical development focusing on single hyperspectral image super-resolution due to the high-dimensional and complex spectral patterns in hyperspectral image. In this paper, we make a step forward by investigating how to adapt state-of-the-art residual learning based single gray/RGB image super-resolution approaches for computationally efficient single hyperspectral image super-resolution, referred as SSPSR. Specifically, we introduce a spatial-spectral prior network (SSPN) to fully exploit the spatial information and the correlation between the spectra of the hyperspectral data. Considering that the hyperspectral training samples are scarce and the spectral dimension of hyperspectral image data is very high, it is nontrivial to train a stable and effective deep network. Therefore, a group convolution (with shared network parameters) and progressive upsampling framework is proposed. This will not only alleviate the difficulty in feature extraction due to high-dimension of the hyperspectral data, but also make the training process more stable. To exploit the spatial and spectral prior, we design a spatial-spectral block (SSB), which consists of a spatial residual module and a spectral attention residual module. Experimental results on some hyperspectral images demonstrate that the proposed SSPSR method enhances the details of the recovered high-resolution hyperspectral images, and outperforms state-of-the-arts. The source code is available at \url{this https URL
摘要：近日，单灰/ RGB影像超分辨率重建任务已被广泛研究，并通过利用先进的机器基于深卷积神经网络（DCNNs）学习技术取得显著的进展。然而，出现了有限的技术发展重点放在单一的高光谱影像超分辨率由于高光谱图像的高维和复杂的频谱图案。在本文中，我们做一个步骤前进通过调查如何适应先进国家的用于计算上高效的单一光谱图像超分辨率残差学习基于单个灰/ RGB图像超分辨率方法中，称为SSPSR。具体而言，我们引入一个空间光谱现有网络（SSPN），以充分利用空间信息和高光谱数据的频谱之间的相关性。考虑到高光谱训练样本是稀缺和高光谱图像数据的频谱维数是非常高的，它是平凡的培养稳定的和有效的深网络。因此，一组卷积（共享网络参数）和进行性上采样框架提出。这不仅会缓解特征提取的难度，由于高光谱数据的高维，但也使训练过程更稳定。能够利用空间和光谱之前，我们设计的空间频谱块（SSB），它由空间残余模块和频谱关注残余模块。一些高光谱图像实验结果表明，该方法SSPSR增强的细节回收高分辨率高光谱图像，性能胜过国家的最艺术。源代码可以在\ {URL这HTTPS URL

82. Improving Named Entity Recognition in Tor Darknet with Local Distance Neighbor Feature [PDF] 返回目录
Mhd Wesam Al-Nabki, Francisco Jañez-Martino, Roberto A. Vasco-Carofilis, Eduardo Fidalgo, Javier Velasco-Mata
Abstract: Name entity recognition in noisy user-generated texts is a difficult task usually enhanced by incorporating an external resource of information, such as gazetteers. However, gazetteers are task-specific, and they are expensive to build and maintain. This paper adopts and improves the approach of Aguilar et al. by presenting a novel feature, called Local Distance Neighbor, which substitutes gazetteers. We tested the new approach on the W-NUT-2017 dataset, obtaining state-of-the-art results for the Group, Person and Product categories of Named Entities. Next, we added 851 manually labeled samples to the W-NUT-2017 dataset to account for named entities in the Tor Darknet related to weapons and drug selling. Finally, our proposal achieved an entity and surface F1 scores of 52.96% and 50.57% on this extended dataset, demonstrating its usefulness for Law Enforcement Agencies to detect named entities in the Tor hidden services.
摘要：在嘈杂的用户生成的文本命名实体识别是一项艰巨的任务通常是通过将信息的外部资源，如地方志增强。然而，地方志是任务特定的，它们是建立和维护费用昂贵。本文采用和改进阿吉拉尔等人的做法。通过呈现新的特征，被称为现代远程邻居，它替代方志。我们测试的W-NUT-2017数据集的新方法，获得国家的最先进的结果命名实体的团体，个人和产品类别。接下来，我们增加了851个手动标记的样品到W-NUT-2017数据集，以考虑Tor的暗网命名实体相关的武器和药品销售。最后，我们的建议实现这一扩展数据集的52.96％和50.57％的实体和表面F1成绩，展示了其执法机构有效性检测Tor的隐匿服务命名实体。

83. Building BROOK: A Multi-modal and Facial Video Database for Human-Vehicle Interaction Research [PDF] 返回目录
Xiangjun Peng, Zhentao Huang, Xu Sun
Abstract: With the growing popularity of Autonomous Vehicles, more opportunities have bloomed in the context of Human-Vehicle Interactions. However, the lack of comprehensive and concrete database support for such specific use case limits relevant studies in the whole design spaces. In this paper, we present our work-in-progress BROOK, a public multi-modal database with facial video records, which could be used to characterize drivers' affective states and driving styles. We first explain how we over-engineer such database in details, and what we have gained through a ten-month study. Then we showcase a Neural Network-based predictor, leveraging BROOK, which supports multi-modal prediction (including physiological data of heart rate and skin conductance and driving status data of speed)through facial videos. Finally, we discuss related issues when building such a database and our future directions in the context of BROOK. We believe BROOK is an essential building block for future Human-Vehicle Interaction Research.
摘要：随着自主车的日益普及，更多的机会，已在人 - 车互动的情境中绽放。然而，由于缺乏对这种具体使用情况全面，具体的数据库支持，在整个设计空间限制了相关研究。在本文中，我们提出我们的工作在进展BROOK，与面部的视频记录，这可以用来表征司机的情感状态和驾驶风格的一个公共的多模式数据库。我们首先解释细节如何，我们过度工程这样的数据库，以及我们走过了十个月的研究获得的。然后，我们展示一个基于神经网络预测器，利用BROOK，其支持多模式预测（包括心脏速率和皮肤电导和驱动速度的状态数据的生理数据）通过面部的视频。最后，我们讨论建立这样一个数据库时，相关的问题和我们的未来在BROOK的背景下方向。我们相信BROOK是未来人机交互的车辆研究的重要组成部分。

84. Universalization of any adversarial attack using very few test examples [PDF] 返回目录
Sandesh Kamath, Amit Deshpande, K V Subrahmanyam
Abstract: Deep learning models are known to be vulnerable not only to input-dependent adversarial attacks but also to input-agnostic or universal adversarial attacks. Dezfooli et al. \cite{Dezfooli17,Dezfooli17anal} construct universal adversarial attack on a given model by looking at a large number of training data points and the geometry of the decision boundary near them. Subsequent work \cite{Khrulkov18} constructs universal attack by looking only at test examples and intermediate layers of the given model. In this paper, we propose a simple universalization technique to take any input-dependent adversarial attack and construct a universal attack by only looking at very few adversarial test examples. We do not require details of the given model and have negligible computational overhead for universalization. We theoretically justify our universalization technique by a spectral property common to many input-dependent adversarial perturbations, e.g., gradients, Fast Gradient Sign Method (FGSM) and DeepFool. Using matrix concentration inequalities and spectral perturbation bounds, we show that the top singular vector of input-dependent adversarial directions on a small test sample gives an effective and simple universal adversarial attack. For VGG16 and VGG19 models trained on ImageNet, our simple universalization of Gradient, FGSM, and DeepFool perturbations using a test sample of 64 images gives fooling rates comparable to state-of-the-art universal attacks \cite{Dezfooli17,Khrulkov18} for reasonable norms of perturbation.
摘要：深学习模型被称为是脆弱的，不仅输入相关的敌对攻击，而且要输入无关的或普遍的敌对攻击。 Dezfooli等。 \ {举Dezfooli17，Dezfooli17anal}通过查看大量的训练数据点附近进行决策边界的几何构造在给定的模型普遍敌对攻击。随后的工作\ {引用} Khrulkov18仅在测试实例以及给定模型的中间层寻求构建普遍的攻击。在本文中，我们提出了一个简单的普遍化技术采取任何输入相关的敌对攻击，只看着极少数敌对的测试实例构建一个通用的攻击。我们不要求给定模型的细节和对普及微不足道的计算开销。从理论上我们通过光谱属性共同普遍化技术证明许多输入依赖性对抗性扰动，例如，梯度，快速梯度注册方法（FGSM）和DeepFool。使用基质浓度不等式和光谱扰动界，我们表明，输入依赖性对抗性方向上的小的测试样品的顶部奇异向量给出了一个有效而简单的通用对抗攻击。对于VGG16和VGG19模型培训了ImageNet，使用64幅图像的测试样品我们的梯度，FGSM和DeepFool扰动的简单普及给出了合理的愚弄率堪比国家的最先进的通用攻击\举{Dezfooli17，Khrulkov18}扰动的规范。

85. Learning to Model and Calibrate Optics via a Differentiable Wave Optics Simulator [PDF] 返回目录
Josue Page, Paolo Favaro
Abstract: We present a novel learning-based method to build a differentiable computational model of a real fluorescence microscope. Our model can be used to calibrate a real optical setup directly from data samples and to engineer point spread functions by specifying the desired input-output data. This approach is poised to drastically improve the design of microscopes, because the parameters of current models of optical setups cannot be easily fit to real data. Inspired by the recent progress in deep learning, our solution is to build a differentiable wave optics simulator as a composition of trainable modules, each computing light wave-front (WF) propagation due to a specific optical element. We call our differentiable modules WaveBlocks and show reconstruction results in the case of lenses, wave propagation in air, camera sensors and diffractive elements (e.g., phase-masks).
摘要：本文提出了一种新的学习法，以建立一个真正的荧光显微镜的微计算模型。我们的模型可以通过指定所希望的输入输出数据被用来对实际光学装置直接从数据样本和工程师点扩散函数进行校准。这种方法有望显着提高显微镜的设计，因为光学设备的当前型号的参数不能轻易适合真实的数据。通过在深学习的最新进展的启发，我们的解决方案是构建可微波光学模拟器作为可训练模块，每个计算光的波前（WF）传播的组合物由于特定光学元件。我们称我们的微分模块WaveBlocks和显示重建结果中的透镜的情况下，在空气中，相机传感器和衍射元件（例如，相位掩模）波传播。

86. Bayesian convolutional neural network based MRI brain extraction on nonhuman primates [PDF] 返回目录
Gengyan Zhao, Fang Liu, Jonathan A. Oler, Mary E. Meyerand, Ned H. Kalin, Rasmus M. Birn
Abstract: Brain extraction or skull stripping of magnetic resonance images (MRI) is an essential step in neuroimaging studies, the accuracy of which can severely affect subsequent image processing procedures. Current automatic brain extraction methods demonstrate good results on human brains, but are often far from satisfactory on nonhuman primates, which are a necessary part of neuroscience research. To overcome the challenges of brain extraction in nonhuman primates, we propose a fully-automated brain extraction pipeline combining deep Bayesian convolutional neural network (CNN) and fully connected three-dimensional (3D) conditional random field (CRF). The deep Bayesian CNN, Bayesian SegNet, is used as the core segmentation engine. As a probabilistic network, it is not only able to perform accurate high-resolution pixel-wise brain segmentation, but also capable of measuring the model uncertainty by Monte Carlo sampling with dropout in the testing stage. Then, fully connected 3D CRF is used to refine the probability result from Bayesian SegNet in the whole 3D context of the brain volume. The proposed method was evaluated with a manually brain-extracted dataset comprising T1w images of 100 nonhuman primates. Our method outperforms six popular publicly available brain extraction packages and three well-established deep learning based methods with a mean Dice coefficient of 0.985 and a mean average symmetric surface distance of 0.220 mm. A better performance against all the compared methods was verified by statistical tests (all p-values<10-4, 100 two-sided, bonferroni corrected). the maximum uncertainty of model on nonhuman primate brain extraction has a mean value 0.116 across all subjects... < font>
摘要：脑提取或颅骨汽提（MRI）的磁共振图像是在神经影像学研究的一个重要步骤，其准确性会严重影响随后的图象处理程序。目前自动大脑提取方法证明对人类大脑的好成绩，但往往差强人意的非人灵长类动物，它们是神经科学研究的必要组成部分。为了克服在非人灵长类动物大脑提取的挑战，我们提出了一个完全自动化的大脑提取管道结合深贝叶斯卷积神经网络（CNN）和完全连接的三维（3D）条件随机场（CRF）。深CNN贝叶斯，贝叶斯SegNet，被用作芯分段引擎。作为概率网络，它不仅能够执行精确的高分辨率逐像素脑分割，还能够测量通过蒙特卡罗与处于测试阶段降采样模型的不确定性。然后，完全连接3D CRF用于在脑体积的整个3D上下文从SegNet贝叶斯细化概率的结果。所提出的方法是用人工脑提取的数据集包括100种非人灵长类T1W图像评价。我们的方法优于六种流行公开可用的大脑提取包和三个深行之有效的学习基础与0.985的平均骰子系数并从0.220mm毫米的平均平均对称表面距离的方法。对所有被比较的方法的一个更好的性能是由统计试验验证（所有p值<10-4，双面，邦弗朗尼校正的）。在非人灵长类动物的大脑提取模型的最大不确定性的0.116在所有100名受试者的平均值...< font>

87. Deep Convolutional Sparse Coding Networks for Image Fusion [PDF] 返回目录
Shuang Xu, Zixiang Zhao, Yicheng Wang, Chunxia Zhang, Junmin Liu, Jiangshe Zhang
Abstract: Image fusion is a significant problem in many fields including digital photography, computational imaging and remote sensing, to name but a few. Recently, deep learning has emerged as an important tool for image fusion. This paper presents three deep convolutional sparse coding (CSC) networks for three kinds of image fusion tasks (i.e., infrared and visible image fusion, multi-exposure image fusion, and multi-modal image fusion). The CSC model and the iterative shrinkage and thresholding algorithm are generalized into dictionary convolution units. As a result, all hyper-parameters are learned from data. Our extensive experiments and comprehensive comparisons reveal the superiority of the proposed networks with regard to quantitative evaluation and visual inspection.
摘要：图像融合在很多领域都是一个显著的问题，包括数码摄影，计算成像和遥感，仅举几例。近日，深学习已成为用于图像融合的重要工具。本文提出了三种深卷积稀疏编码（CSC）网络三种图像融合的任务（即，红外和可见光图像融合，多曝光图像融合，和多模态的图像融合）。该CSC模型和迭代收缩和阈值算法被概括为字典卷积单元。其结果是，所有的超参数从数据获悉。我们广泛的实验和综合比较，揭示所提出的网络优势对于定量评价和目视检查。

88. Deep Learning and Bayesian Deep Learning Based Gender Prediction in Multi-Scale Brain Functional Connectivity [PDF] 返回目录
Gengyan Zhao, Gyujoon Hwang, Cole J. Cook, Fang Liu, Mary E. Meyerand, Rasmus M. Birn
Abstract: Brain gender differences have been known for a long time and are the possible reason for many psychological, psychiatric and behavioral differences between males and females. Predicting genders from brain functional connectivity (FC) can build the relationship between brain activities and gender, and extracting important gender related FC features from the prediction model offers a way to investigate the brain gender difference. Current predictive models applied to gender prediction demonstrate good accuracies, but usually extract individual functional connections instead of connectivity patterns in the whole connectivity matrix as features. In addition, current models often omit the effect of the input brain FC scale on prediction and cannot give any model uncertainty information. Hence, in this study we propose to predict gender from multiple scales of brain FC with deep learning, which can extract full FC patterns as features. We further develop the understanding of the feature extraction mechanism in deep neural network (DNN) and propose a DNN feature ranking method to extract the highly important features based on their contributions to the prediction. Moreover, we apply Bayesian deep learning to the brain FC gender prediction, which as a probabilistic model can not only make accurate predictions but also generate model uncertainty for each prediction. Experiments were done on the high-quality Human Connectome Project S1200 release dataset comprising the resting state functional MRI data of 1003 healthy adults. First, DNN reaches 83.0%, 87.6%, 92.0%, 93.5% and 94.1% accuracies respectively with the FC input derived from 25, 50, 100, 200, 300 independent component analysis (ICA) components. DNN outperforms the conventional machine learning methods on the 25-ICA-component scale FC, but the linear machine learning method catches up as the number of ICA components increases...
摘要：脑性别差异已经知道了很长的时间，并且是男性和女性之间的许多心理，精神和行为上的差异的可能原因。从预测脑功能连接（FC）性别可以建立大脑活动和性别之间的关系，并提取相关的FC重要的性别与预测模型提供的功能的方式来研究大脑的性别差异。施加到性别预测当前的预测模型显示出良好的精度，但通常提取单个功能连接，而不是在整个连接矩阵作为特征的连接图案。此外，目前的模型往往忽略对预测输入大脑FC规模的影响，并不能给出任何模型不确定性的信息。因此，在本研究中，我们提出从大脑FC的多尺度深学习，这可以提取完整的FC模式为特征的预测性别。我们进一步发展深层神经网络（DNN）的特征提取机理的认识，并提出一个DNN功能排序的方法来提取基于其预测的捐款非常重要的功能。此外，我们运用贝叶斯深度学习到大脑FC性别预测，这是一个概率模型不仅能做出准确的预测，但也产生了对每个预测模型不确定性。实验是在高品质的人类连接组项目S1200释放数据集，包括1003名健康成人的静息状态功能磁共振数据来完成。首先，DNN达到83.0％，87.6％，92.0％，93.5％，94.1％与25，50，100，200，300种独立分量分析（ICA）成分得到的输入FC分别精度最高。 DNN优于传统的机器学习关于25-ICA组分规模FC的方法，但是线性机器学习方法的ICA部件数量增加赶上...

89. Improving Robustness using Joint Attention Network For Detecting Retinal Degeneration From Optical Coherence Tomography Images [PDF] 返回目录
Sharif Amit Kamran, Alireza Tavakkoli, Stewart Lee Zuckerbrod
Abstract: Noisy data and the similarity in the ocular appearances caused by different ophthalmic pathologies pose significant challenges for an automated expert system to accurately detect retinal diseases. In addition, the lack of knowledge transferability and the need for unreasonably large datasets limit clinical application of current machine learning systems. To increase robustness, a better understanding of how the retinal subspace deformations lead to various levels of disease severity needs to be utilized for prioritizing disease-specific model details. In this paper we propose the use of disease-specific feature representation as a novel architecture comprised of two joint networks -- one for supervised encoding of disease model and the other for producing attention maps in an unsupervised manner to retain disease specific spatial information. Our experimental results on publicly available datasets show the proposed joint-network significantly improves the accuracy and robustness of state-of-the-art retinal disease classification networks on unseen datasets.
摘要：噪声数据和所引起的不同的眼科疾病的眼部外观的相似性构成用于自动化专家系统显著挑战精确地检测视网膜疾病。此外，由于缺乏知识转让和需要不合理的大数据集的限制电流的机器学习系统的临床应用。为了提高鲁棒性，将被用于优先特定疾病模型的细节更好地了解视网膜子空间变形如何导致各种水平的疾病严重程度的需求。在本文中，我们提出使用特定疾病的特征表示的作为新型结构包括两个关节网络 - 一个用于疾病模型的监督编码和其他用于制造关注以无监督的方式保留疾病特定的空间信息映射。我们可公开获得的数据集的实验结果表明，该联合网络显著提高了看不见的数据集的准确性和国家的最先进的视网膜疾病分类网络的鲁棒性。

90. Various Total Variation for Snapshot Video Compressive Imaging [PDF] 返回目录
Xin Yuan
Abstract: Sampling high-dimensional images is challenging due to limited availability of sensors; scanning is usually necessary in these cases. To mitigate this challenge, snapshot compressive imaging (SCI) was proposed to capture the high-dimensional (usually 3D) images using a 2D sensor (detector). Via novel optical design, the {\em measurement} captured by the sensor is an encoded image of multiple frames of the 3D desired signal. Following this, reconstruction algorithms are employed to retrieve the high-dimensional data. Though various algorithms have been proposed, the total variation (TV) based method is still the most efficient one due to a good trade-off between computational time and performance. This paper aims to answer the question of which TV penalty (anisotropic TV, isotropic TV and vectorized TV) works best for video SCI reconstruction? Various TV denoising and projection algorithms are developed and tested for video SCI reconstruction on both simulation and real datasets.
摘要：采样高的二维图像是由于传感器的有限利用率挑战;扫描通常是在这些情况下必要的。为了减轻这一难题，快照压缩成像（SCI），提出了以捕获使用2D传感器（检测器）的高维（通常为3D）图像。通过新的光学设计中，{\ EM测量}由传感器捕捉的是3D的多个帧的编码的图像所需的信号。在此之后，重构算法被用来检索高维数据。虽然各种算法已经被提出，基于总变差（TV）方法仍然是最有效的一个，由于良好的权衡计算时间和性能之间。本文旨在回答其中电视罚款（各向异性电视，各向同性的电视和矢量TV）最适合视频SCI重建的问题吗？各种电视去噪和投影算法的开发和用于视频SCI重建上模拟和真实数据的测试。

91. Extreme Low-Light Imaging with Multi-granulation Cooperative Networks [PDF] 返回目录
Keqi Wang, Peng Gao, Steven Hoi, Qian Guo, Yuhua Qian
Abstract: Low-light imaging is challenging since images may appear to be dark and noised due to low signal-to-noise ratio, complex image content, and the variety in shooting scenes in extreme low-light condition. Many methods have been proposed to enhance the imaging quality under extreme low-light conditions, but it remains difficult to obtain satisfactory results, especially when they attempt to retain high dynamic range (HDR). In this paper, we propose a novel method of multi-granulation cooperative networks (MCN) with bidirectional information flow to enhance extreme low-light images, and design an illumination map estimation function (IMEF) to preserve high dynamic range (HDR). To facilitate this research, we also contribute to create a new benchmark dataset of real-world Dark High Dynamic Range (DHDR) images to evaluate the performance of high dynamic preservation in low light environment. Experimental results show that the proposed method outperforms the state-of-the-art approaches in terms of both visual effects and quantitative analysis.
摘要：低亮度成像是具有挑战性的，因为图像可能看起来是暗和降噪由于低信噪比，复杂的图像内容，并且在极端低光条件拍摄场景的品种。许多方法已被提出以提高极端低光条件下的成像质量，但特别是当他们试图保留高动态范围（HDR）仍然难以取得满意的效果，。在本文中，我们提出了多造粒协作网络（MCN）双向信息流的新方法，以提高极端低光图像，以及设计一个照明图估计函数（IMEF）保存高动态范围（HDR）。为了便于研究，我们也有助于创造真实世界的黑暗的高动态范围（DHDR）图像的新的基准数据集来评价高动态保存在弱光环境下的性能。实验结果表明，所提出的方法优于状态的最先进的方法中的两个视觉效果和定量分析方面。

92. Revisiting Agglomerative Clustering [PDF] 返回目录
Eric K. Tokuda, Cesar H. Comin, Luciano da F. Costa
Abstract: In data clustering, emphasis is often placed in finding groups of points. An equally important subject concerns the avoidance of false positives. As it could be expected, these two goals oppose one another, in the sense that emphasis on finding clusters tends to imply in higher probability of obtaining false positives. The present work addresses this problem considering some traditional agglomerative methods, namely single, average, median, complete, centroid and Ward's applied to unimodal and bimodal datasets following uniform, gaussian, exponential and power-law distributions. More importantly, we adopt a generic model of clusters involving a higher density core surrounded by a transition zone, followed by a sparser set of outliers. Combined with preliminary specification of the size of the expected clusters, this model paved the way to the implementation of an objective means for identifying the clusters from dendrograms. In addition, the adopted model also allowed the relevance of the detected clusters to be estimated in terms of the height of the subtrees corresponding to the identified clusters. More specifically, the lower this height, the more compact and relevant the clusters tend to be. Several interesting results have been obtained, including the tendency of several of the considered methods to detect two clusters in unimodal data. The single-linkage method has been found to provide the best resilience to this tendency. In addition, several methods tended to detect clusters that do not correspond directly to the cores, therefore characterized by lower relevance. The possibility of identifying the type of distribution of points from the adopted measurements was also investigated.
摘要：数据聚类，重点往往放在寻找点组。一个同样重要的主题涉及误报的回避。由于可以预期，这两个目标彼此相对，在这个意义上，注重寻找集群往往在获得误报的概率较高暗示。目前的工作解决了这个问题考虑一些传统的凝聚方法，即单一，平均值，中位数，完整，质心和沃德的应用于以下均匀，高斯，指数和幂律分布的单峰和双峰数据集。更重要的是，我们采用涉及通过一个过渡区包围的更高密度的核心，其次是稀疏离群集簇的通用模型。与预期的簇的大小的初步说明书相结合，这种模式铺平了道路的客观手段执行用于从系统树识别的集群。此外，所采用的模型也允许检测出簇的相关性在对应于所识别的集群的子树的高度方面进行估计。更具体地，降低该高度，更紧凑和相关的簇趋于。一些有趣的结果已经获得，其中包括几个的考虑方法的倾向检测单式数据两簇。已经发现单联动方式，以提供这一趋势最好的抵御能力。此外，一些方法倾向于检测簇不直接对应于核，因此其特征在于，低相关性。识别从所采用的测量点的分布的类型的可能性也进行了研究。

93. Multi-level Feature Fusion-based CNN for Local Climate Zone Classification from Sentinel-2 Images: Benchmark Results on the So2Sat LCZ42 Dataset [PDF] 返回目录
Chunping Qiu, Xiaochong Tong, Michael Schmitt, Benjamin Bechtel, Xiao Xiang Zhu
Abstract: As a unique classification scheme for urban forms and functions, the local climate zone (LCZ) system provides essential general information for any studies related to urban environments, especially on a large scale. Remote sensing data-based classification approaches are the key to large-scale mapping and monitoring of LCZs. The potential of deep learning-based approaches is not yet fully explored, even though advanced convolutional neural networks (CNNs) continue to push the frontiers for various computer vision tasks. One reason is that published studies are based on different datasets, usually at a regional scale, which makes it impossible to fairly and consistently compare the potential of different CNNs for real-world scenarios. This study is based on the big So2Sat LCZ42 benchmark dataset dedicated to LCZ classification. Using this dataset, we studied a range of CNNs of varying sizes. In addition, we proposed a CNN to classify LCZs from Sentinel-2 images, Sen2LCZ-Net. Using this base network, we propose fusing multi-level features using the extended Sen2LCZ-Net-MF. With this proposed simple network architecture and the highly competitive benchmark dataset, we obtain results that are better than those obtained by the state-of-the-art CNNs, while requiring less computation with fewer layers and parameters. Large-scale LCZ classification examples of completely unseen areas are presented, demonstrating the potential of our proposed Sen2LCZ-Net-MF as well as the So2Sat LCZ42 dataset. We also intensively investigated the influence of network depth and width and the effectiveness of the design choices made for Sen2LCZ-Net-MF. Our work will provide important baselines for future CNN-based algorithm developments for both LCZ classification and other urban land cover land use classification.
摘要：随着城市形式和功能独特的分类方案，当地的气候区（LCZ）系统提供与城市环境的任何研究必不可少的常规信息，特别是大规模的。遥感基于数据的分类方法的关键是大规模映射和监测低碳经济区的。深基于学习的方法的潜力还没有被充分挖掘，即使先进的卷积神经网络（细胞神经网络）继续推动前沿适用于各种计算机视觉任务。原因之一是，发表的研究报告是基于不同的数据集，通常是在一个区域范围内，这使得它不可能公平和一致比较现实世界的场景不同的细胞神经网络的潜力。这项研究是基于专用于LCZ分类大So2Sat LCZ42基准数据集。使用这个数据集，我们研究了一系列不同大小的细胞神经网络的。此外，我们提出了一个CNN分类低碳经济区，从哨兵2图像，Sen2LCZ-Net的。使用这个基地网络，我们建议融合多级功能使用扩展Sen2LCZ-NET-MF。与此提出的简单的网络体系结构和高度竞争的基准数据集，我们获得的结果是比由国家的最先进的细胞神经网络获得的那些更好，同时需要具有更少的层和少的参数计算。完全看不见地区大型LCZ分类例子给出，证明了我们提出的Sen2LCZ-NET-MF的潜力还有So2Sat LCZ42数据集。我们还深入研究网络的深度和宽度以及Sen2LCZ-NET-MF做出的设计选择的有效性的影响。我们的工作将基于CNN-未来发展的算法两种LCZ分类和其他城市的土地覆盖土地利用分类提供了重要的基准。

94. The Power of Triply Complementary Priors for Image Compressive Sensing [PDF] 返回目录
Zhiyuan Zha, Xin Yuan, Joey Tianyi Zhou, Jiantao Zhou, Bihan Wen, Ce Zhu
Abstract: Recent works that utilized deep models have achieved superior results in various image restoration applications. Such approach is typically supervised which requires a corpus of training images with distribution similar to the images to be recovered. On the other hand, the shallow methods which are usually unsupervised remain promising performance in many inverse problems, \eg, image compressive sensing (CS), as they can effectively leverage non-local self-similarity priors of natural images. However, most of such methods are patch-based leading to the restored images with various ringing artifacts due to naive patch aggregation. Using either approach alone usually limits performance and generalizability in image restoration tasks. In this paper, we propose a joint low-rank and deep (LRD) image model, which contains a pair of triply complementary priors, namely \textit{external} and \textit{internal}, \textit{deep} and \textit{shallow}, and \textit{local} and \textit{non-local} priors. We then propose a novel hybrid plug-and-play (H-PnP) framework based on the LRD model for image CS. To make the optimization tractable, a simple yet effective algorithm is proposed to solve the proposed H-PnP based image CS problem. Extensive experimental results demonstrate that the proposed H-PnP algorithm significantly outperforms the state-of-the-art techniques for image CS recovery such as SCSNet and WNNM.
摘要：最近的作品是利用深模型已经实现了各种图像恢复应用效果出众。这样的方法典型地监督这需要具有类似于要恢复的图像分布训练图像的语料库。在另一方面，浅方法通常是无人监管，仍是许多反问题有前途的性能，\例如，图像压缩传感（CS），因为它们可以有效利用非本地自然图像的自相似性先验。然而，大多数的这样的方法是修补基导致与各种振铃伪像的复原的图像由于天真补丁聚集。单独使用这两种方法通常限制在图像恢复任务的性能和普遍性。在本文中，我们提出了一个联合低秩和深（LRD）图像模型，它包含一对三重互补先验，即\ textit {外部}和\ textit {内部}，\ textit {深}和\ textit {的浅}和\ {textit当地}和\ {textit非本地}前科。然后，我们提出了一个新颖的混合插件和播放基于图像CS的LRD模型（H-PNP）的框架。为了使优化易于处理，一个简单而有效的算法来解决所提出的H-即插即用基于图像CS问题。广泛的实验结果表明，所提出的H-即插即用算法显著优于国家的最先进的技术用于图像恢复CS诸如SCSNet和WNNM。

95. Multi-scale Grouped Dense Network for VVC Intra Coding [PDF] 返回目录
Xin Li, Simeng Sun, Zhizheng Zhang, Zhibo Chen
Abstract: Versatile Video Coding (H.266/VVC) standard achieves better image quality when keeping the same bits than any other conventional image codec, such as BPG, JPEG, and etc. However, it is still attractive and challenging to improve the image quality with high compression ratio on the basis of traditional coding techniques. In this paper, we design the multi-scale grouped dense network (MSGDN) to further reduce the compression artifacts by combining the multi-scale and grouped dense block, which are integrated as the post-process network of VVC intra coding. Besides, to improve the subjective quality of compressed image, we also present a generative adversarial network (MSGDN-GAN) by utilizing our MSGDN as generator. Across the extensive experiments on validation set, our MSGDN trained by MSE losses yields the PSNR of 32.622 on average with teams IMC at the bit-rate of 0.15 in Lowrate track. Moreover, our MSGDN-GAN could achieve the better subjective performance.
摘要：通用视频保持相同的位时比其他任何传统的图像编解码器，如BPG，JPEG等编码（H.266 / VVC）标准获得更好的图像质量。然而，它仍然是有吸引力和挑战，以提高图像质量与传统的编码技术的基础上，高压缩比。在本文中，我们设计了多尺度分组密集网络（MSGDN）通过组合多尺度和分组致密块，其被集成为VVC的后处理网络内编码以进一步减少压缩伪像。另外，为了提高压缩图像的主观质量，我们还利用我们的MSGDN作为发电机呈现生成对抗网络（MSGDN-GAN）。在整个上验证集的大量的实验，我们训练有素的MSGDN通过MSE损失产生的32.622平均在低速率的轨道0.15码率的PSNR与团队IMC。此外，我们的MSGDN-GaN可以达到更好的主观表现。

96. Joint Progressive Knowledge Distillation and Unsupervised Domain Adaptation [PDF] 返回目录
Le Thanh Nguyen-Meidine, Eric Granger, Madhu Kiran, Jose Dolz, Louis-Antoine Blais-Morin
Abstract: Currently, the divergence in distributions of design and operational data, and large computational complexity are limiting factors in the adoption of CNNs in real-world applications. For instance, person re-identification systems typically rely on a distributed set of cameras, where each camera has different capture conditions. This can translate to a considerable shift between source (e.g. lab setting) and target (e.g. operational camera) domains. Given the cost of annotating image data captured for fine-tuning in each target domain, unsupervised domain adaptation (UDA) has become a popular approach to adapt CNNs. Moreover, state-of-the-art deep learning models that provide a high level of accuracy often rely on architectures that are too complex for real-time applications. Although several compression and UDA approaches have recently been proposed to overcome these limitations, they do not allow optimizing a CNN to simultaneously address both. In this paper, we propose an unexplored direction -- the joint optimization of CNNs to provide a compressed model that is adapted to perform well for a given target domain. In particular, the proposed approach performs unsupervised knowledge distillation (KD) from a complex teacher model to a compact student model, by leveraging both source and target data. It also improves upon existing UDA techniques by progressively teaching the student about domain-invariant features, instead of directly adapting a compact model on target domain data. Our method is compared against state-of-the-art compression and UDA techniques, using two popular classification datasets for UDA -- Office31 and ImageClef-DA. In both datasets, results indicate that our method can achieve the highest level of accuracy while requiring a comparable or lower time complexity.
摘要：目前，在设计和运营数据，和大型的计算复杂度的分布发散的限制，在实际应用中采用细胞神经网络的因素。例如，人再识别系统通常依赖于一组分布式的摄像机，其中每个相机具有不同的拍摄条件。这可以翻译成源（例如实验室环境）和靶（例如操作相机）结构域之间的相当大的偏移。鉴于注释捕获在每个目标域微调图像数据的成本，无监督域适配（UDA）已经成为适应细胞神经网络一个流行的方法。此外，国家的最先进的能够提供精度的高层次深度学习模型通常依赖于对于实时应用过于复杂的架构。尽管一些压缩和UDA方法最近被提出来克服这些限制，他们不允许优化CNN同时解决这两个。在本文中，我们提出了一个未知的方向 - 细胞神经网络的联合优化，以提供适合于给定目标域表现良好压缩模式。特别是，从一个复杂的教师模型紧凑学生模型所提出的方法进行无监督知识蒸馏（KD），通过利用源和目标数据。它还改善了通过逐步讲授域不变特征的学生，而不是直接调整目标域数据集约模型现有UDA技术。我们的方法是针对国家的最先进的压缩和UDA技术相比，使用用于UDA两个流行的分类数据集 - Office31和ImageClef-DA。在这两个数据集，结果表明，我们的方法可以实现准确的最高水平，同时需要一个相当或较低的时间复杂度。

97. A Learning-from-noise Dilated Wide Activation Network for denoising Arterial Spin Labeling (ASL) Perfusion Images [PDF] 返回目录
Danfeng Xie, Yiran Li, Hanlu Yang, Li Bai, Lei Zhang, Ze Wang
Abstract: Arterial spin labeling (ASL) perfusion MRI provides a non-invasive way to quantify cerebral blood flow (CBF) but it still suffers from a low signal-to-noise-ratio (SNR). Using deep machine learning (DL), several groups have shown encouraging denoising results. Interestingly, the improvement was obtained when the deep neural network was trained using noise-contaminated surrogate reference because of the lack of golden standard high quality ASL CBF images. More strikingly, the output of these DL ASL networks (ASLDN) showed even higher SNR than the surrogate reference. This phenomenon indicates a learning-from-noise capability of deep networks for ASL CBF image denoising, which can be further enhanced by network optimization. In this study, we proposed a new ASLDN to test whether similar or even better ASL CBF image quality can be achieved in the case of highly noisy training reference. Different experiments were performed to validate the learning-from-noise hypothesis. The results showed that the learning-from-noise strategy produced better output quality than ASLDN trained with relatively high SNR reference.
摘要：动脉自旋标记（ASL）灌注MRI提供了一种非侵入性的方式来量化脑血流量（CBF），但它仍然患有低信号对噪声比（SNR）。使用深机器学习（DL），几个小组已显示出令人鼓舞的去噪结果。有趣的是，当深神经网络使用由于缺乏金标准高质量ASL CBF图像的噪声污染的替代参考训练得到改善。更引人注意的是，这些DL ASL网络的输出（ASLDN）显示出比参考替代甚至更高的SNR。这种现象表示用于ASL CBF图像去噪深网络，其可以由网络优化可以进一步增强的学习从噪声能力。在这项研究中，我们提出了一个新ASLDN测试是否相似甚至更好的ASL CBF的图像质量可以在高噪声训练参考的情况下实现。进行不同的实验，以验证学习从噪声假说。结果表明，学习从噪声产生的战略更好的输出质量，比ASLDN具有相对较高的信噪比参考培训。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-05-19

目录

摘要