摘要

1. Neural Topological SLAM for Visual Navigation [PDF] 返回目录
Devendra Singh Chaplot, Ruslan Salakhutdinov, Abhinav Gupta, Saurabh Gupta
Abstract: This paper studies the problem of image-goal navigation which involves navigating to the location indicated by a goal image in a novel previously unseen environment. To tackle this problem, we design topological representations for space that effectively leverage semantics and afford approximate geometric reasoning. At the heart of our representations are nodes with associated semantic features, that are interconnected using coarse geometric information. We describe supervised learning-based algorithms that can build, maintain and use such representations under noisy actuation. Experimental study in visually and physically realistic simulation suggests that our method builds effective representations that capture structural regularities and efficiently solve long-horizon navigation problems. We observe a relative improvement of more than 50% over existing methods that study this task.
摘要：本文研究图像目标导航的问题，涉及到导航的位置在一个新的前所未见的环境指示由目标图像。为了解决这个问题，我们设计了空间拓扑表示，有效地利用语义，并提供近似几何推理。在我们交涉的心脏有关联语义特征，即使用粗几何信息相互连接的节点。我们描述了监督学习型算法，可以建立，维护和下嘈杂驱动使用这样的表示。在视觉上和物理上逼真的模拟实验研究表明，我们的方法建立有效的表示，用于捕获结构规律和有效地解决长期地平线导航问题。我们在现有的方法是研究这个任务观察超过50％的相对改善。

2. Automating the Surveillance of Mosquito Vectors from Trapped Specimens Using Computer Vision Techniques [PDF] 返回目录
Mona Minakshi, Pratool Bharti, Willie B. McClinton III, Jamshidbek Mirzakhalov, Ryan M. Carney, Sriram Chellappan
Abstract: Among all animals, mosquitoes are responsible for the most deaths worldwide. Interestingly, not all types of mosquitoes spread diseases, but rather, a select few alone are competent enough to do so. In the case of any disease outbreak, an important first step is surveillance of vectors (i.e., those mosquitoes capable of spreading diseases). To do this today, public health workers lay several mosquito traps in the area of interest. Hundreds of mosquitoes will get trapped. Naturally, among these hundreds, taxonomists have to identify only the vectors to gauge their density. This process today is manual, requires complex expertise/ training, and is based on visual inspection of each trapped specimen under a microscope. It is long, stressful and self-limiting. This paper presents an innovative solution to this problem. Our technique assumes the presence of an embedded camera (similar to those in smart-phones) that can take pictures of trapped mosquitoes. Our techniques proposed here will then process these images to automatically classify the genus and species type. Our CNN model based on Inception-ResNet V2 and Transfer Learning yielded an overall accuracy of 80% in classifying mosquitoes when trained on 25,867 images of 250 trapped mosquito vector specimens captured via many smart-phone cameras. In particular, the accuracy of our model in classifying Aedes aegypti and Anopheles stephensi mosquitoes (both of which are deadly vectors) is amongst the highest. We present important lessons learned and practical impact of our techniques towards the end of the paper.
摘要：在所有的动物，蚊子是负责全球最为死亡。有趣的是，并非所有类型的蚊子传播的疾病，而是独自一个少数有能力足以这样做。在任何疾病爆发的情况下，重要的第一步是矢量的监视（即，能传播疾病的那些蚊子）。要做到这一点今天，公共卫生工作者在感兴趣的领域打下几个捕蚊器。数以百计的蚊子将被困。当然，这些数百间，分类学家识别仅载体，以衡量他们的密度。这个过程今天是手动的，需要复杂的专业知识/训练，并且是基于在显微镜下的每个捕获的样品的目视检查。这是漫长而紧张和自限性。本文提出了一种创新的解决这个问题。我们的技术假设，可以采取被困蚊子的图片内嵌摄像头（类似于智能手机）的存在。我们在这里提出的技术，然后将处理这些图像自动分类属和种类型。根据成立之初，RESNET V2和迁移学习我们的CNN模型在250被困蚊子的25867个图像通过训练有素的许多智能手机的摄像头捕捉到的标本时，蚊子分类产生了80％的整体精度。特别是，我们在埃及伊蚊和按蚊分类模型的准确性须按蚊的蚊子（这两者都是致命的载体）是当中最高的。我们目前取得的重要经验和我们的技术对文章的最后实际影响。

3. AGVNet: Attention Guided Velocity Learning for 3D Human Motion Prediction [PDF] 返回目录
Xiaoli Liu, Jianqin Yin, Jun Liu
Abstract: Human motion prediction plays a vital role in human-robot interaction with various applications such as family service robot. Most of the existing works did not explicitly model velocities of skeletal motion that carried rich motion dynamics, which is critical to predict future poses. In this paper, we propose a novel feedforward network, AGVNet (Attention Guided Velocity Learning Network), to predict future poses, which explicitly models the velocities at both Encoder and Decoder. Specifically, a novel two-stream Encoder is proposed to encode the skeletal motion in both velocity space and position space. Then, a new feedforward Decoder is presented to predict future velocities instead of position poses, which enables the network to predict multiple future velocities recursively like RNN based Decoder. Finally, a novel loss, ATPL (Attention Temporal Prediction Loss), is designed to pay more attention to the early predictions, which can efficiently guide the recursive model to achieve more accurate predictions. Extensive experiments show that our method achieves state-of-the-art performance on two benchmark datasets (i.e. Human3.6M and 3DPW) for human motion prediction, which demonstrates the effectiveness of our proposed method. The code will be available if the paper is accepted.
摘要：人体运动预测扮演各种应用，如家庭服务机器人人机交互的重要作用。大多数现有的作品做到了携带丰富的运动动力学，这是预测未来的姿势关键骨骼运动没有明确模型的速度。在本文中，我们提出了一个新颖的前馈网络，AGVNet（注意引导速度学习网络）来预测未来的姿势，其中明确机型速度在编码器和解码器。具体而言，一种新颖的双流编码器，提出了以编码在两个速度空间和位置的空间骨架运动。然后，新的前馈解码器被提供给预测未来的速度，而不是位置姿势，这使得网络来预测多个未来速度递归如基于RNN解码器。最后，一个新的损失，ATPL（注意时间预测亏损），旨在更注重早期的预测，它可以有效地引导递归模式，实现更精确的预测。大量的实验表明，该方法实现了在两个基准数据集（即Human3.6M和3DPW）人运动预测，这表明我们提出的方法的有效性状态的最先进的性能。如果纸张被接受的代码将可用。

4. Learning to Simulate Dynamic Environments with GameGAN [PDF] 返回目录
Seung Wook Kim, Yuhao Zhou, Jonah Philion, Antonio Torralba, Sanja Fidler
Abstract: Simulation is a crucial component of any robotic system. In order to simulate correctly, we need to write complex rules of the environment: how dynamic agents behave, and how the actions of each of the agents affect the behavior of others. In this paper, we aim to learn a simulator by simply watching an agent interact with an environment. We focus on graphics games as a proxy of the real environment. We introduce GameGAN, a generative model that learns to visually imitate a desired game by ingesting screenplay and keyboard actions during training. Given a key pressed by the agent, GameGAN "renders" the next screen using a carefully designed generative adversarial network. Our approach offers key advantages over existing work: we design a memory module that builds an internal map of the environment, allowing for the agent to return to previously visited locations with high visual consistency. In addition, GameGAN is able to disentangle static and dynamic components within an image making the behavior of the model more interpretable, and relevant for downstream tasks that require explicit reasoning over dynamic elements. This enables many interesting applications such as swapping different components of the game to build new games that do not exist.
摘要：模拟任何机器人系统的重要组成部分。为了正确模拟，我们需要编写环境的复杂的规则：动态代理的行为，以及各代理商的行为如何影响他人的行为。在本文中，我们的目标是通过简单的观察与环境中的代理交互来学习的模拟器。我们专注于图形游戏，真实环境的代理。我们介绍GameGAN，生成模型是学会通过视觉训练期间摄入剧本和键盘操作模仿所需的游戏。鉴于由代理按下一个键，GameGAN“渲染”使用精心设计的生成敌对网络的下一个屏幕。我们的方法提供了超过现有工作的关键优势：我们设计的是构建环境的内部地图，允许代理恢复与高视觉效果的一致性以前访问过的位置的内存模块。此外，GameGAN能够解开静态和动态分量的图像，使模型更可解释的，而相关的那些需要动态元素明确的推理下游任务的行为中。这使许多有趣的应用，如交换不同的游戏组件来构建新的游戏，不存在。

5. Egocentric Human Segmentation for Mixed Reality [PDF] 返回目录
Andrija Gajic, Ester Gonzalez-Sosa, Diego Gonzalez-Morin, Marcos Escudero-Viñolo, Alvaro Villegas
Abstract: The objective of this work is to segment human body parts from egocentric video using semantic segmentation networks. Our contribution is two-fold: i) we create a semi-synthetic dataset composed of more than 15, 000 realistic images and associated pixel-wise labels of egocentric human body parts, such as arms or legs including different demographic factors; ii) building upon the ThunderNet architecture, we implement a deep learning semantic segmentation algorithm that is able to perform beyond real-time requirements (16 ms for 720 x 720 images). It is believed that this method will enhance sense of presence of Virtual Environments and will constitute a more realistic solution to the standard virtual avatars.
摘要：这项工作的目的是从以自我为中心的视频使用语义分割网络段人的身体部位。我们的贡献是双重的：ⅰ）我们创建的超过15构成的半合成数据集，000逼真的图像和自我中心的人体部位相关联的逐像素的标签，例如手臂或腿包括不同人口因素;二）立足于这个ThunderNet架构，我们实现了深刻的学习语义分割算法，能够超越实时性要求为720×720的图像来执行（16毫秒）。据认为，这种方法将提高虚拟环境的存在感，将构成一个更现实的解决方案，以标准的虚拟化身。

6. Visual Attention: Deep Rare Features [PDF] 返回目录
Mancas Matei, Kong Phutphalla, Gosselin Bernard
Abstract: Human visual system is modeled in engineering field providing feature-engineered methods which detect contrasted/surprising/unusual data into images. This data is "interesting" for humans and leads to numerous applications. Deep learning (DNNs) drastically improved the algorithms efficiency on the main benchmark datasets. However, DNN-based models are counter-intuitive: surprising or unusual data is by definition difficult to learn because of its low occurrence probability. In reality, DNNs models mainly learn top-down features such as faces, text, people, or animals which usually attract human attention, but they have low efficiency in extracting surprising or unusual data in the images. In this paper, we propose a model called DeepRare2019 (DR) which uses the power of DNNs feature extraction and the genericity of feature-engineered algorithms. DR 1) does not need any training, 2) it takes less than a second per image on CPU only and 3) our tests on three very different eye-tracking datasets show that DR is generic and is always in the top-3 models on all datasets and metrics while no other model exhibits such a regularity and genericity. DeepRare2019 code can be found at this https URL
摘要：人类视觉系统模型化在工程领域提供一种检测对比/奇/异常数据为图像特征工程方法。这个数据是对人类并导致大量应用“有趣”。深度学习（DNNs）显着改善的主要标准数据集的算法效率。然而，基于DNN的模型是反直觉：奇怪的或不寻常的数据是通过定义难学，因为它的低发生概率的。在现实中，DNNs车型主要是学习自上而下的功能，如脸部，文字的人，或动物通常吸引了人的注意力，但他们效率低在图像中提取奇怪的或不寻常的数据。在本文中，我们提出了一个名为DeepRare2019（DR）模式，它使用DNNs的动力特征提取和特征设计的算法的通用性。 DR 1）不需要任何培训，2）只需不到上CPU每个图像的第二只和3）我们在三个非常不同的眼动跟踪数据集的测试表明，DR是通用的，始终处于前三的车型上所有数据集和指标而没有其他的模型显示出这样的规则性和通用性。 DeepRare2019代码可以在此HTTPS URL中找到

7. A Joint Pixel and Feature Alignment Framework for Cross-dataset Palmprint Recognition [PDF] 返回目录
Huikai Shao, Dexing Zhong
Abstract: Deep learning-based palmprint recognition algorithms have shown great potential. Most of them are mainly focused on identifying samples from the same dataset. However, they may be not suitable for a more convenient case that the images for training and test are from different datasets, such as collected by embedded terminals and smartphones. Therefore, we propose a novel Joint Pixel and Feature Alignment (JPFA) framework for such cross-dataset palmprint recognition scenarios. Two stage-alignment is applied to obtain adaptive features in source and target datasets. 1) Deep style transfer model is adopted to convert source images into fake images to reduce the dataset gaps and perform data augmentation on pixel level. 2) A new deep domain adaptation model is proposed to extract adaptive features by aligning the dataset-specific distributions of target-source and target-fake pairs on feature level. Adequate experiments are conducted on several benchmarks including constrained and unconstrained palmprint databases. The results demonstrate that our JPFA outperforms other models to achieve the state-of-the-arts. Compared with baseline, the accuracy of cross-dataset identification is improved by up to 28.10% and the Equal Error Rate (EER) of cross-dataset verification is reduced by up to 4.69%. To make our results reproducible, the codes are publicly available at this http URL.
摘要：深基于学习的掌纹识别算法已经显示出巨大的潜力。他们中的大多数都主要集中在来自同一数据集识别样品。然而，它们可以是不适合于更方便的情况下，对于训练和测试图像是从不同的数据集，例如通过嵌入端子和智能手机收集。因此，我们提出了这样的交叉数据集掌纹识别场景的新型联合像素与特征对齐（JPFA）框架。两级对准被施加到获得在源和目标数据集自适应特性。 1）深样式传递模型被采用到转换源图像转换成图像假以减小数据集间隙和像素级执行数据扩充。 2）新的深域适应模型是提出通过对准目标源和目标假对数据集特有的分布上的功能级别来提取自适应特性。充足的实验在几个基准，包括约束和不受约束的掌纹数据库进行。结果表明，我们的JPFA优于其他车型，以实现国家的最艺术。与基线相比，跨数据集识别的准确性被至多提高到28.10％，并且跨数据集核查等错误率（EER）被至多降低至4.69％。为了使我们的结果重现性好，代码是公开的，在此http网址。

8. A Preliminary Study for Identification of Additive Manufactured Objects with Transmitted Images [PDF] 返回目录
Kenta Yamamoto, Ryota Kawamura, Kazuki Takazawa, Hiroyuki Osone, Yoichi Ochiai
Abstract: Additive manufacturing has the potential to become a standard method for manufacturing products, and product information is indispensable for the item distribution system. While most products are given barcodes to the exterior surfaces, research on embedding barcodes inside products is underway. This is because additive manufacturing makes it possible to carry out manufacturing and information adding at the same time, and embedding information inside does not impair the exterior appearance of the product. However, products that have not been embedded information can not be identified, and embedded information can not be rewritten later. In this study, we have developed a product identification system that does not require embedding barcodes inside. This system uses a transmission image of the product which contains information of each product such as different inner support structures and manufacturing errors. We have shown through experiments that if datasets of transmission images are available, objects can be identified with an accuracy of over 90%. This result suggests that our approach can be useful for identifying objects without embedded information.
摘要：添加剂制造业有可能成为制造产品的标准方法的潜力和产品信息是对商品配送系统是必不可少的。虽然大多数产品都给予条形码外表面，里面产品中嵌入条形码的研究正在进行中。这是因为添加剂制造能够进行制造和信息将在同一时间，和嵌入信息内不影响产品的外观。然而，还没有被嵌入的信息产品不能被识别，并嵌入信息后不能改写。在这项研究中，我们已经开发出一种不需要内部嵌入条形码产品标识制度。该系统使用其中包含的每个产品的信息，例如不同的内部支撑结构和制造误差的产品的透射图像。我们已经通过实验发现，如果传输图像的数据集是可用的，对象可以有超过90％的准确识别出。这一结果表明，我们的方法可以用于识别对象，而嵌入的信息是有用的。

9. Multi-Margin based Decorrelation Learning for Heterogeneous Face Recognition [PDF] 返回目录
Bing Cao, Nannan Wang, Xinbo Gao, Jie Li, Zhifeng Li
Abstract: Heterogeneous face recognition (HFR) refers to matching face images acquired from different domains with wide applications in security scenarios. This paper presents a deep neural network approach namely Multi-Margin based Decorrelation Learning (MMDL) to extract decorrelation representations in a hyperspherical space for cross-domain face images. The proposed framework can be divided into two components: heterogeneous representation network and decorrelation representation learning. First, we employ a large scale of accessible visual face images to train heterogeneous representation network. The decorrelation layer projects the output of the first component into decorrelation latent subspace and obtains decorrelation representation. In addition, we design a multi-margin loss (MML), which consists of quadruplet margin loss (QML) and heterogeneous angular margin loss (HAML), to constrain the proposed framework. Experimental results on two challenging heterogeneous face databases show that our approach achieves superior performance on both verification and recognition tasks, comparing with state-of-the-art methods.
摘要：异构人脸识别（HFR）指的是匹配来自不同的域与在安全方案的广泛应用获取的面部图像。本文件介绍了跨域脸图像的超球的空间了深刻的神经网络方法，即多保证金基于去相关学习（MMDL）提取去相关表示。所提出的架构可以分为两个部分：异构表示网络和去相关表示学习。首先，我们采用了大规模的无障碍视觉人脸图像的训练异构网络的表示。解相关层突出的第一组件的输出到去相关潜子空间并且获得去相关的表示。此外，我们设计了一个多余量损耗（MML），它由四重峰余量损失（QML）和异构角度余量损失（HAML）的，以限制所提出的框架。两个有挑战性的异质人脸数据库的实验结果表明，该方法实现两个验证和识别任务性能优越，与国家的最先进的方法相比。

10. Interlayer and Intralayer Scale Aggregation for Scale-invariant Crowd Counting [PDF] 返回目录
Mingjie Wang, Hao Cai, Jun Zhou, Minglun Gong
Abstract: Crowd counting is an important vision task, which faces challenges on continuous scale variation within a given scene and huge density shift both within and across images. These challenges are typically addressed using multi-column structures in existing methods. However, such an approach does not provide consistent improvement and transferability due to limited ability in capturing multi-scale features, sensitiveness to large density shift, and difficulty in training multi-branch models. To overcome these limitations, a Single-column Scale-invariant Network (ScSiNet) is presented in this paper, which extracts sophisticated scale-invariant features via the combination of interlayer multi-scale integration and a novel intralayer scale-invariant transformation (SiT). Furthermore, in order to enlarge the diversity of densities, a randomly integrated loss is presented for training our single-branch method. Extensive experiments on public datasets demonstrate that the proposed method consistently outperforms state-of-the-art approaches in counting accuracy and achieves remarkable transferability and scale-invariant property.
摘要：人群计数是一个重要的视觉任务，这既面临内部和整个图像的特定场景和巨大的密度内移上连续尺度变化的挑战。这些挑战是利用现有方法多列结构通常解决。然而，这样的做法并不因拍摄多尺度特征，敏感到大密度的转变，并在训练中多支款困难的能力有限，提供一致的完善和转移性。为了克服这些限制，单柱尺度不变网络（ScSiNet）呈现于本文中，其通过层间多尺度一体化的组合和新的层内尺度不变变换（SIT）提取复杂的比例不变特征。此外，为了扩大密度的多样性，随机综合损失，提出训练我们的单支方法。公共数据集的大量实验表明，该方法的性能一直优于国家的最先进的计数精度接近和达到卓越的可转移性和尺度不变特性。

11. Visual Localization Using Semantic Segmentation and Depth Prediction [PDF] 返回目录
Huanhuan Fan, Yuhao Zhou, Ang Li, Shuang Gao, Jijunnan Li, Yandong Guo
Abstract: In this paper, we propose a monocular visual localization pipeline leveraging semantic and depth cues. We apply semantic consistency evaluation to rank the image retrieval results and a practical clustering technique to reject estimation outliers. In addition, we demonstrate a substantial performance boost achieved with a combination of multiple feature extractors. Furthermore, by using depth prediction with a deep neural network, we show that a significant amount of falsely matched keypoints are identified and eliminated. The proposed pipeline outperforms most of the existing approaches at the Long-Term Visual Localization benchmark 2020.
摘要：在本文中，我们提出了一个单眼视觉定位管线利用语义和深度线索。我们采用语义一致性评价等级的图像检索结果和实际的聚类技术来拒绝估计的异常值。此外，我们展示了多特征提取的组合来实现大幅性能提升。此外，通过使用深度预测与深层神经网络，我们表明，错误匹配的关键点的显著量识别和消除。拟议的管道性能优于大多数在长期可视本地化基准2020年现有的方法。

12. Rethinking of Pedestrian Attribute Recognition: Realistic Datasets with Efficient Method [PDF] 返回目录
Jian Jia, Houjing Huang, Wenjie Yang, Xiaotang Chen, Kaiqi Huang
Abstract: Despite various methods are proposed to make progress in pedestrian attribute recognition, a crucial problem on existing datasets is often neglected, namely, a large number of identical pedestrian identities in train and test set, which is not consistent with practical application. Thus, images of the same pedestrian identity in train set and test set are extremely similar, leading to overestimated performance of state-of-the-art methods on existing datasets. To address this problem, we propose two realistic datasets PETA\textsubscript{$zs$} and RAPv2\textsubscript{$zs$} following zero-shot setting of pedestrian identities based on PETA and RAPv2 datasets. Furthermore, compared to our strong baseline method, we have observed that recent state-of-the-art methods can not make performance improvement on PETA, RAPv2, PETA\textsubscript{$zs$} and RAPv2\textsubscript{$zs$}. Thus, through solving the inherent attribute imbalance in pedestrian attribute recognition, an efficient method is proposed to further improve the performance. Experiments on existing and proposed datasets verify the superiority of our method by achieving state-of-the-art performance.
摘要：尽管提出了各种方法，使行人属性识别的进展，对现有的数据集一个关键的问题往往被忽视，即大量训练和测试集，这是不符合实际应用一致的相同行人的身份。因此，在列车集和测试集相同的行人身份图像是极其相似，导致在现有的数据集的状态的最先进的方法高估性能。为了解决这个问题，我们提出了两个真实的数据集PETA \ textsubscript {$ ZS $}和RAPv2 \ textsubscript {$ ZS $}以下基于PETA和RAPv2数据集的行人身份的零次设置。此外，相比于我们强大的基线法，我们观察到，国家的最先进的最新方法不能对PETA，RAPv2，PETA \ textsubscript {$ ZS $}和RAPv2 \ textsubscript {$ ZS $}性能改进。因此，通过求解在行人属性识别的固有属性的不平衡，一种有效的方法，提出了进一步改善性能。对现有的和拟议的数据集实验通过实现国家的最先进的性能验证了该方法的优越性。

13. Adaptive Adversarial Logits Pairing [PDF] 返回目录
Shangxi Wu, Jitao Sang, Kaiyuan Xu, Guanhua Zheng, Changsheng Xu
Abstract: Adversarial examples provide an opportunity as well as impose a challenge for understanding image classification systems. Based on the analysis of state-of-the-art defense solution Adversarial Logits Pairing (ALP), we observed in this work that: (1) The inference of adversarially robust models tends to rely on fewer high-contribution features compared with vulnerable ones. (2) The training target of ALP doesn't fit well to a noticeable part of samples, where the logits pairing loss is overemphasized and obstructs minimizing the classification loss. Motivated by these observations, we designed an Adaptive Adversarial Logits Pairing (AALP) solution by modifying the training process and training target of ALP. Specifically, AALP consists of an adaptive feature optimization module with Guided Dropout to systematically pursue few high-contribution features, and an adaptive sample weighting module by setting sample-specific training weights to balance between logits pairing loss and classification loss. The proposed AALP solution demonstrates superior defense performance on multiple datasets with extensive experiments.
摘要：对抗性的例子提供以及征收挑战理解图像分类系统的机会。根据分析国家的最先进的防御解决方案对抗性Logits配对（ALP），我们在这项工作中所观察到的：（1）adversarially稳健模式的推理往往依赖较少的高贡献与脆弱的特征进行比较，。（2）碱性磷酸酶（ALP）的训练目标不适合很好地样本，其中配对损失logits被过分强调，阻碍最小化的分类的损失显着部分。由这些观察结果的启发，我们通过修改训练过程和训练ALP的目标设计了一个自适应对抗性Logits配对（AALP）溶液。具体而言，AALP由具有引导降自适应特征优化模块的系统追求少数高贡献的功能，以及自适应样品通过样品特定训练权重设置为logits配对损失和分类损耗之间的平衡加权模块。所提出的解决方案AALP展示与广泛的实验多个数据集卓越性能的防御。

14. Recognizing Families through Images with Pretrained Encoder [PDF] 返回目录
Tuan-Duy H. Nguyen, Huu-Nghia H. Nguyen, Hieu Dao
Abstract: Kinship verification and kinship retrieval are emerging tasks in computer vision. Kinship verification aims at determining whether two facial images are from related people or not, while kinship retrieval is the task of retrieving possible related facial images to a person from a gallery of images. They introduce unique challenges because of the hidden relations and features that carry inherent characteristics between the facial images. We employ 3 methods, FaceNet, Siamese VGG-Face, and a combination of FaceNet and VGG-Face models as feature extractors, to achieve the 9th standing for kinship verification and the 5th standing for kinship retrieval in the Recognizing Family in The Wild 2020 competition. We then further experimented using StyleGAN2 as another encoder, with no improvement in the result.
摘要：亲属关系确认和亲属检索是计算机视觉新兴的任务。亲属验证的目的是确定两个面部图像是否从相关的人或没有，而亲属关系检索是从图像画廊检索可能相关的面部图像，以一个人的任务。他们介绍，因为隐藏的关系和功能进行面部图像之间固有特性的独特挑战。我们采用3种方法，FaceNet，连体VGG面，和特征提取FaceNet和VGG-脸部模型相结合，以实现亲情验证第九地位和在野外2020大赛第五站在为亲属关系检索的认识家庭。然后，我们使用StyleGAN2作为另一个编码器进一步试验，在结果没有改善。

15. Deep Convolutional Neural Network-based Bernoulli Heatmap for Head Pose Estimation [PDF] 返回目录
Zhongxu Hu, Yang Xing, Chen Lv, Peng Hang, Jie Liu
Abstract: Head pose estimation is a crucial problem for many tasks, such as driver attention, fatigue detection, and human behaviour analysis. It is well known that neural networks are better at handling classification problems than regression problems. It is an extremely nonlinear process to let the network output the angle value directly for optimization learning, and the weight constraint of the loss function will be relatively weak. This paper proposes a novel Bernoulli heatmap for head pose estimation from a single RGB image. Our method can achieve the positioning of the head area while estimating the angles of the head. The Bernoulli heatmap makes it possible to construct fully convolutional neural networks without fully connected layers and provides a new idea for the output form of head pose estimation. A deep convolutional neural network (CNN) structure with multiscale representations is adopted to maintain high-resolution information and low-resolution information in parallel. This kind of structure can maintain rich, high-resolution representations. In addition, channelwise fusion is adopted to make the fusion weights learnable instead of simple addition with equal weights. As a result, the estimation is spatially more precise and potentially more accurate. The effectiveness of the proposed method is empirically demonstrated by comparing it with other state-of-the-art methods on public datasets.
摘要：头部姿态估计是许多任务，如驾驶员注意力，疲劳检测，和人类行为分析的关键问题。这是众所周知，神经网络是在处理比回归问题分类问题的更好。这是一个极端非线性过程，让网络直接输出的角度值进行优化学习，和损失函数的重量约束将是相对较弱。本文提出了从单个RGB图像头部姿势估计的新颖伯努利热图。我们的方法而估计所述头部的角度可以实现头部区域的定位。伯努利热图使得能够构建充分卷积神经网络不完全连接层和提供用于头部姿势估计的输出形式的新想法。多尺度表示深卷积神经网络（CNN）结构被采用，以保持并联高分辨率信息和低分辨率的信息。这种结构可以保持丰富，高清晰度表示。此外，通道式融合采用，使融合的权重可以学习简单的加法与相等的权重来代替。其结果是，估计在空间上更精确的和潜在的更准确。所提出的方法的有效性是根据经验通过它与国家的最先进的其它公共数据集的方法进行比较证明。

16. Deep learning approach to describe and classify fungi microscopic images [PDF] 返回目录
Bartosz Zieliński, Agnieszka Sroka-Oleksiak, Dawid Rymarczyk, Adam Piekarczyk, Monika Brzychczy-Włoch
Abstract: Preliminary diagnosis of fungal infections can rely on microscopic examination. However, in many cases, it does not allow unambiguous identification of the species by microbiologist due to their visual similarity. Therefore, it is usually necessary to use additional biochemical tests. That involves additional costs and extends the identification process up to 10 days. Such a delay in the implementation of targeted therapy may be grave in consequence as the mortality rate for immunosuppressed patients is high. In this paper, we apply a machine learning approach based on deep neural networks and Fisher Vector (advanced bag-of-words method) to classify microscopic images of various fungi species. Our approach has the potential to make the last stage of biochemical identification redundant, shortening the identification process by 2-3 days, and reducing the cost of the diagnosis.
摘要：真菌感染的初步诊断可以依靠镜检。然而，在许多情况下，它不允许通过微生物学家物种的明确识别，由于其视觉相似。因此，通常需要使用额外的生化检查。这涉及到额外的成本和向上延伸的识别过程为10天。靶向治疗实施这种延迟可能的后果是严重的死亡率为免疫抑制患者高。在本文中，我们采用基于深层神经网络和Fisher矢量（高级袋的字法）机器学习的方法对各种菌种的显微图像进行分类。我们的办法是使生化鉴定多余的最后阶段，由2-3天缩短识别过程，并降低了诊断成本的潜力。

17. Domain Specific, Semi-Supervised Transfer Learning for Medical Imaging [PDF] 返回目录
Jitender Singh Virk, Deepti R. Bathula
Abstract: Limited availability of annotated medical imaging data poses a challenge for deep learning algorithms. Although transfer learning minimizes this hurdle in general, knowledge transfer across disparate domains is shown to be less effective. On the other hand, smaller architectures were found to be more compelling in learning better features. Consequently, we propose a lightweight architecture that uses mixed asymmetric kernels (MAKNet) to reduce the number of parameters significantly. Additionally, we train the proposed architecture using semi-supervised learning to provide pseudo-labels for a large medical dataset to assist with transfer learning. The proposed MAKNet provides better classification performance with $60 - 70\%$ less parameters than popular architectures. Experimental results also highlight the importance of domain-specific knowledge for effective transfer learning.
摘要：注释的医学成像数据的数量有限造成的深学习算法是一个挑战。虽然迁移学习最大限度地减少这一障碍一般，跨不同领域的知识转移被证明是事倍功半。在另一方面，被发现更小的结构是在学习更好的功能更加引人注目。因此，我们建议，使用混合非对称内核（MAKNet）至显著减少参数的数量的轻质结构。此外，我们训练用半监督学习，为大型医疗数据集，以协助转移学习提供伪标签所提出的架构。所提出的MAKNet提供更好的分类性能$ 60 - 70个\％$不到流行的结构参数。实验结果还突出了特定领域的知识进行有效的迁移学习的重要性。

18. High-Resolution Image Inpainting with Iterative Confidence Feedback and Guided Upsampling [PDF] 返回目录
Yu Zeng, Zhe Lin, Jimei Yang, Jianming Zhang, Eli Shechtman, Huchuan Lu
Abstract: Existing image inpainting methods often produce artifacts when dealing with large holes in real applications. To address this challenge, we propose an iterative inpainting method with a feedback mechanism. Specifically, we introduce a deep generative model which not only outputs an inpainting result but also a corresponding confidence map. Using this map as feedback, it progressively fills the hole by trusting only high-confidence pixels inside the hole at each iteration and focuses on the remaining pixels in the next iteration. As it reuses partial predictions from the previous iterations as known pixels, this process gradually improves the result. In addition, we propose a guided upsampling network to enable generation of high-resolution inpainting results. We achieve this by extending the Contextual Attention module [1] to borrow high-resolution feature patches in the input image. Furthermore, to mimic real object removal scenarios, we collect a large object mask dataset and synthesize more realistic training data that better simulates user inputs. Experiments show that our method significantly outperforms existing methods in both quantitative and qualitative evaluations. More results and Web APP are available at this https URL.
摘要：在实际应用大孔打交道时，现有的图像修复方法往往会产生伪影。为了应对这一挑战，我们提出了反馈机制的迭代方法修补。具体来说，我们引入了深刻的生成模型不仅输出修补的结果，但也有相应信心地图。使用这种地图作为反馈，它逐渐被在每次迭代内孔信任只高可信度的像素填充孔和专注于下一次迭代的其余像素。因为它重新使用先前迭代已知像素部分的预测，此过程中逐渐提高的结果。此外，我们提出了一个引导采样网络，使新一代的高分辨率图像修补效果。我们通过扩展语境注意模块[1]借用高分辨率特征贴剂在输入图像中实现这一点。此外，为了模拟真实对象删除场景中，我们收集了大量的对象屏蔽数据集和合成，更好地模拟用户输入更加逼真的训练数据。实验表明，我们的方法显著优于在定量和定性评估现有的方法。更多结果和Web APP可在此HTTPS URL。

19. Networks with pixels embedding: a method to improve noise resistance in images classification [PDF] 返回目录
Chi-Chun Zhou, Hai-Long Tu, Yi Liua, Fu-Lin Zhang
Abstract: In the task of images classification, usually, the network is sensitive to noises. For example, an image of cat with noises might be misclassified as an ostrich. Conventionally, to overcome the problem of noises, one uses the technique of data enhancement, that is, to teach the network to distinguish noises by adding more images with noises in the training dataset. In this work, we provide a noise-resistance network in images classification by introducing a technique of pixels embedding. We test the network with pixels embedding, which is abbreviated as the network with PE, on the mnist database of handwritten digits. It shows that the network with PE outperforms the conventional network on images with noises. The technique of pixels embedding can be used in many tasks of images classification to improve noise resistance.
摘要：在图像分类的任务，通常情况下，网络对噪声敏感。例如，猫的有噪声的图像可能会被误判为鸵鸟。以往，为了克服噪音的问题，一种使用了数据增强，即的技术中，教导网络通过在训练数据集的噪声添加更多的图像来区分噪声。在这项工作中，我们通过引入像素嵌入的技术提供在图像分类的噪声电阻网络。我们测试用的像素的嵌入，简称为PE与网络，手写数字的MNIST数据库在网络上。这表明，用PE网络优于上与噪声图像的常规网络。像素可以嵌入在图像的许多任务中使用的技术分类，以提高抗噪声能力。

20. Benefits of temporal information for appearance-based gaze estimation [PDF] 返回目录
Cristina Palmero, Oleg V. Komogortsev, Sachin S. Talathi
Abstract: State-of-the-art appearance-based gaze estimation methods, usually based on deep learning techniques, mainly rely on static features. However, temporal trace of eye gaze contains useful information for estimating a given gaze point. For example, approaches leveraging sequential eye gaze information when applied to remote or low-resolution image scenarios with off-the-shelf cameras are showing promising results. The magnitude of contribution from temporal gaze trace is yet unclear for higher resolution/frame rate imaging systems, in which more detailed information about an eye is captured. In this paper, we investigate whether temporal sequences of eye images, captured using a high-resolution, high-frame rate head-mounted virtual reality system, can be leveraged to enhance the accuracy of an end-to-end appearance-based deep-learning model for gaze estimation. Performance is compared against a static-only version of the model. Results demonstrate statistically-significant benefits of temporal information, particularly for the vertical component of gaze.
摘要：国家的最先进的外观基础的注视估计方法，通常是基于深刻的学习技术，主要依靠静电功能。然而，眼睛注视的时间跟踪包含用于估计给定的凝视点的有用信息。例如，当接近具有现成的，货架摄像机显示有希望的结果施加到远程或低分辨率图像场景利用顺序眼睛注视信息。从时间注视跟踪贡献的大小是用于较高分辨率/帧速率的成像系统，其中，关于眼睛更详细的信息被捕获尚不清楚。在本文中，我们调查是否眼图像的时间序列，采用一个高分辨率的拍摄，高帧速率的头戴式虚拟现实系统，可被利用来提高的准确度的端部到端外观基础深孔学习注视估计模型。性能根据模型的唯一静态版本进行比较。结果证明的时间信息统计学显著的好处，特别是对于视线的垂直分量。

21. Master-Auxiliary: an efficient aggregation strategy for video anomaly detection [PDF] 返回目录
Zhiguo Wang, Zhongliang Yang, Yujin Zhang
Abstract: The aim of surveillance video anomaly detection is to detect events that rarely or never happened in a specified scene. Different detectors can detect different anomalies. This paper proposes an efficient strategy to aggregate multiple detectors together. At first, the aggregation strategy chooses one detector as master detector, and sets the other detectors as auxiliary detectors. Then, the aggregation strategy extracts credible information from auxiliary detectors, which includes credible abnormal (Cred-a) frames and credible normal (Cred-n) frames, and counts their Cred-a and Cred-n frequencies. Finally, the aggregation strategy utilizes the Cred-a and Cred-n frequencies to calculate soft weights in a voting manner, and uses the soft weights to assist the master detector. Experiments are carried out on multiple datasets. Compared with existing aggregation strategies, the proposed strategy achieves state-of-the-art performance.
摘要：监控视频异常检测的目的是检测在指定的场景很少或从来没有发生过的事件。不同的检测器可以检测不同的异常。本文提出了一种有效的策略，以聚集体的多个检测器在一起。首先，聚合策略选择一个检测器作为检测器的主，并且将其它的检测器作为辅助检测器。然后，将聚合策略提取从辅助检测器的可靠信息，其中包括可信异常（中房-a）中的帧的和可信的正常（中房-n）的帧，并计算它们的中房-a和中房-n的频率。最后，将聚合策略利用中房-a和中房-n个频率中投票方式来计算软权重，并且使用软的权重，以协助所述主探测器。实验是在多个数据集进行。与现有的聚合策略相比，该策略实现了国家的最先进的性能。

22. Robust Object Detection under Occlusion with \\Context-Aware CompositionalNets [PDF] 返回目录
Angtian Wang, Yihong Sun, Adam Kortylewski, Alan Yuille
Abstract: Detecting partially occluded objects is a difficult task. Our experimental results show that deep learning approaches, such as Faster R-CNN, are not robust at object detection under occlusion. Compositional convolutional neural networks (CompositionalNets) have been shown to be robust at classifying occluded objects by explicitly representing the object as a composition of parts. In this work, we propose to overcome two limitations of CompositionalNets which will enable them to detect partially occluded objects: 1) CompositionalNets, as well as other DCNN architectures, do not explicitly separate the representation of the context from the object itself. Under strong object occlusion, the influence of the context is amplified which can have severe negative effects for detection at test time. In order to overcome this, we propose to segment the context during training via bounding box annotations. We then use the segmentation to learn a context-aware CompositionalNet that disentangles the representation of the context and the object. 2) We extend the part-based voting scheme in CompositionalNets to vote for the corners of the object's bounding box, which enables the model to reliably estimate bounding boxes for partially occluded objects. Our extensive experiments show that our proposed model can detect objects robustly, increasing the detection performance of strongly occluded vehicles from PASCAL3D+ and MS-COCO by 41\% and 35\% respectively in absolute performance relative to Faster R-CNN.
摘要：检测部分被遮挡的对象是一项艰巨的任务。我们的实验结果表明，深学习方法，如更快的R-CNN，不处于闭塞下物体检测的鲁棒性。组成卷积神经网络（CompositionalNets）已被证明是在通过显式地表示对象作为部件的组合物遮挡的对象分类的鲁棒性。在这项工作中，我们提出了克服CompositionalNets这将使他们能够检测部分遮挡物体的两个限制：1）CompositionalNets，以及其他DCNN架构，没有从对象本身的上下文的表示明确分开。下强烈对象遮挡，上下文的影响被放大，其可具有在测试时用于检测严重的负面影响。为了克服这个问题，我们通过边框标注训练期间提出来分割的背景下。然后，我们使用分段学习情境感知CompositionalNet是理顺了那些纷繁的背景和对象的代表性。 2）我们在CompositionalNets的基于部分的表决方案延伸到投给该对象的边界框，这使得模型来可靠地估计边界框用于部分地遮挡的物体的角上。我们广泛实验表明，我们的模型能检测物体鲁棒，在相对的绝对性能分别由41 \％和35 \％增加来自PASCAL3D +和MS-COCO强烈闭塞车辆的检测性能更快R-CNN。

23. ShapeAdv: Generating Shape-Aware Adversarial 3D Point Clouds [PDF] 返回目录
Kibok Lee, Zhuoyuan Chen, Xinchen Yan, Raquel Urtasun, Ersin Yumer
Abstract: We introduce ShapeAdv, a novel framework to study shape-aware adversarial perturbations that reflect the underlying shape variations (e.g., geometric deformations and structural differences) in the 3D point cloud space. We develop shape-aware adversarial 3D point cloud attacks by leveraging the learned latent space of a point cloud auto-encoder where the adversarial noise is applied in the latent space. Specifically, we propose three different variants including an exemplar-based one by guiding the shape deformation with auxiliary data, such that the generated point cloud resembles the shape morphing between objects in the same category. Different from prior works, the resulting adversarial 3D point clouds reflect the shape variations in the 3D point cloud space while still being close to the original one. In addition, experimental evaluations on the ModelNet40 benchmark demonstrate that our adversaries are more difficult to defend with existing point cloud defense methods and exhibit a higher attack transferability across classifiers. Our shape-aware adversarial attacks are orthogonal to existing point cloud based attacks and shed light on the vulnerability of 3D deep neural networks.
摘要：我们介绍ShapeAdv，一种新颖的框架，以反映在三维点云空间底层形状的变化（例如，几何变形和结构上的差异）的研究形状感知对抗性扰动。我们开发了通过利用在对抗噪声于潜在空间应用点云自动编码器来的教训潜在空间形状感知对抗三维点云的攻击。具体来说，我们提出了三种不同的变体，包括通过引导形状的变形与辅助数据，使得所生成的点云类似于形状在相同的类别中的对象之间的变形范例基于酮。从现有作品不同，所产生的对抗三维点云反映三维点云的空间的形状的变化，同时仍然接近原始之一。此外，在ModelNet40基准实验评估表明，我们的对手更难以与现有的点云防御的方法来捍卫并表现出跨分类更高的攻击转移性。我们的形状感知敌对攻击正交现有的基于点云的攻击和3D深层神经网络的脆弱性线索。

24. RAPiD: Rotation-Aware People Detection in Overhead Fisheye Images [PDF] 返回目录
Zhihao Duan, M. Ozan Tezcan, Hayato Nakamura, Prakash Ishwar, Janusz Konrad
Abstract: Recent methods for people detection in overhead, fisheye images either use radially-aligned bounding boxes to represent people, assuming people always appear along image radius or require significant pre-/post-processing which radically increases computational complexity. In this work, we develop an end-to-end rotation-aware people detection method, named RAPiD, that detects people using arbitrarily-oriented bounding boxes. Our fully-convolutional neural network directly regresses the angle of each bounding box using a periodic loss function, which accounts for angle periodicities. We have also created a new dataset with spatio-temporal annotations of rotated bounding boxes, for people detection as well as other vision tasks in overhead fisheye videos. We show that our simple, yet effective method outperforms state-of-the-art results on three fisheye-image datasets. Code and dataset are available at this http URL .
摘要：在开销人检测最近的方法，鱼眼图像或者使用径向对齐边框代表的人，假设人们总是沿着图像半径出现或需要显著前/后处理从根本上提高了计算的复杂性。在这项工作中，我们开发一个终端到终端的旋转感知人的检测方法，命名为快速检测人用任意为本边界框。我们完全卷积神经网络直接使用消退的周期性损失函数的每个边界框，占角度周期性的角度。我们还创建了的旋转边框时空标注新的数据集，为人们检测以及架空鱼眼视频和其他视觉任务。我们证明了我们的简单而有效的方法优于国家的先进成果三鱼眼图像数据集。代码和数据集可在此http网址。

25. Unsupervised Geometric Disentanglement for Surfaces via CFAN-VAE [PDF] 返回目录
N. Joseph Tatro, Stefan C. Schonsheck, Rongjie Lai
Abstract: For non-Euclidean data such as meshes of humans, a prominent task for generative models is geometric disentanglement; the separation of latent codes for intrinsic (i.e. identity) and extrinsic (i.e. pose) geometry. This work introduces a novel mesh feature, the conformal factor and normal feature (CFAN), for use in mesh convolutional autoencoders. We further propose CFAN-VAE, a novel architecture that disentangles identity and pose using the CFAN feature and parallel transport convolution. CFAN-VAE achieves this geometric disentanglement in an unsupervised way, as it does not require label information on the identity or pose during training. Our comprehensive experiments, including reconstruction, interpolation, generation, and canonical correlation analysis, validate the effectiveness of the unsupervised geometric disentanglement. We also successfully detect and recover geometric disentanglement in mesh convolutional autoencoders that encode xyz-coordinates directly by registering its latent space to that of CFAN-VAE.
摘要：对于非欧几里德数据，如人类的网格，用于生成模型的一个突出任务是几何的解开;的潜代码本征（即，同一性）和外源性（即，姿势）的几何形状的分离。这项工作引入了一个新的网格特征，保形因子和正常特征（CFAN），用于啮合卷积自动编码使用。我们进一步建议CFAN，VAE，一个新的架构，理顺了那些纷繁的身份和姿态使用CFAN功能和并行传输卷积。 CFAN-VAE实现在无人监督的方式这种几何的解开，因为它并不需要在身份标签信息或培训期间姿势。我们全面实验，包括重建，插值，生成，和典型相关分析，验证了无监督几何解缠结的有效性。我们还成功地检测和恢复几何解开网卷积自动编码是编码XYZ坐标直接通过注册它的潜在空间到CFAN-VAE的。

26. One-Shot Unsupervised Cross-Domain Detection [PDF] 返回目录
Antonio D'Innocente, Francesco Cappio Borlino, Silvia Bucci, Barbara Caputo, Tatiana Tommasi
Abstract: Despite impressive progress in object detection over the last years, it is still an open challenge to reliably detect objects across visual domains. Although the topic has attracted attention recently, current approaches all rely on the ability to access a sizable amount of target data for use at training time. This is a heavy assumption, as often it is not possible to anticipate the domain where a detector will be used, nor to access it in advance for data acquisition. Consider for instance the task of monitoring image feeds from social media: as every image is created and uploaded by a different user it belongs to a different target domain that is impossible to foresee during training. This paper addresses this setting, presenting an object detection algorithm able to perform unsupervised adaption across domains by using only one target sample, seen at test time. We achieve this by introducing a multi-task architecture that one-shot adapts to any incoming sample by iteratively solving a self-supervised task on it. We further enhance this auxiliary adaptation with cross-task pseudo-labeling. A thorough benchmark analysis against the most recent cross-domain detection methods and a detailed ablation study show the advantage of our method, which sets the state-of-the-art in the defined one-shot scenario.
摘要：尽管在过去几年中的物体检测令人瞩目的进展，但它仍然是一个开放的挑战整个视觉域可靠地检测对象。虽然话题已引起重视近日，电流接近全部依靠访问目标数据的数量可观的使用在训练时间的能力。这是一个沉重的假设，因为通常情况下无法预测何处检测器将被使用的结构域，也没有访问它提前进行数据采集。例如考虑从社交媒体监测图像饲料的任务：为创建的每个图像，并通过不同的用户上传它属于那是不可能的训练中预见到不同的目标域。本文针对此设置，呈现能够通过仅使用一个目标样品，在测试时可见来执行域间无监督适应的对象检测算法。我们通过引入多任务架构，一次性适应任何来料来样通过迭代求解就可以了自我监督的任务，实现这一目标。我们进一步加强与交任务伪标签这个辅助适应。针对最近跨域检测方法彻底基准分析和详细的消融研究显示我们的方法，这台先进设备，最先进的定义一次性方案的优势。

27. Revisiting Street-to-Aerial View Image Geo-localization and Orientation Estimation [PDF] 返回目录
Sijie Zhu, Taojiannan Yang, Chen Chen
Abstract: Street-to-aerial image geo-localization, which matches a query street-view image to the GPS-tagged aerial images in a reference set, has attracted increasing attention recently. In this paper, we revisit this problem and point out the ignored issue about image alignment information. We show that the performance of a simple Siamese network is highly dependent on the alignment setting and the comparison of previous works can be unfair if they have different assumptions. Instead of focusing on the feature extraction under the alignment assumption, we show that improvements in metric learning techniques significantly boost the performance regardless of the alignment. Without leveraging the alignment information, our pipeline outperforms previous works on both panorama and cropped datasets. Furthermore, we conduct visualization to help understand the learned model and the effect of alignment information using Grad-CAM. With our discovery on the approximate rotation-invariant activation maps, we propose a novel method to estimate the orientation/alignment between a pair of cross-view images with unknown alignment information. It achieves state-of-the-art results on the CVUSA dataset.
摘要：街到空中的一组参考图像的地理定位，这就给查询街景图像匹配的GPS标记的航拍图像，吸引了最近越来越多的关注。在本文中，我们重新审视这个问题，并指出有关影像对准信息被忽略的问题。我们证明了一个简单的连体网络的性能是高度依赖于对齐设置和以前的作品的比较是不公平的，如果他们有不同的假设。闷头对准假设下的特征提取，我们表明，在度量学习技术的改进显著提升性能，无论对齐的。如果不利用对准信息，我们的管道优于两个全景和裁剪的数据集以前的作品。此外，我们进行可视化，帮助理解学习模型和比对信息使用梯度-CAM的效果。与我们的近似旋转不变激活图发现，我们提出以估计一对交叉视图图像的具有未知对准信息之间的取向和/对齐的新方法。它实现了在CVUSA数据集的国家的最先进的成果。

28. Hierarchical Feature Embedding for Attribute Recognition [PDF] 返回目录
Jie Yang, Jiarou Fan, Yiru Wang, Yige Wang, Weihao Gan, Lin Liu, Wei Wu
Abstract: Attribute recognition is a crucial but challenging task due to viewpoint changes, illumination variations and appearance diversities, etc. Most of previous work only consider the attribute-level feature embedding, which might perform poorly in complicated heterogeneous conditions. To address this problem, we propose a hierarchical feature embedding (HFE) framework, which learns a fine-grained feature embedding by combining attribute and ID information. In HFE, we maintain the inter-class and intra-class feature embedding simultaneously. Not only samples with the same attribute but also samples with the same ID are gathered more closely, which could restrict the feature embedding of visually hard samples with regard to attributes and improve the robustness to variant conditions. We establish this hierarchical structure by utilizing HFE loss consisted of attribute-level and ID-level constraints. We also introduce an absolute boundary regularization and a dynamic loss weight as supplementary components to help build up the feature embedding. Experiments show that our method achieves the state-of-the-art results on two pedestrian attribute datasets and a facial attribute dataset.
摘要：属性识别是一个重要但具有挑战性的任务，由于视角变化，光照变化和外观的多样性等多数以前的工作只考虑属性水平的功能嵌入，这可能会在复杂的异构环境表现不佳。为了解决这个问题，我们提出了一个分层特征嵌入（HFE）框架，学习细粒度的功能相结合的属性和ID信息嵌入。在HFE，我们维持对跨类和类内功能同时嵌入。不仅具有相同属性的样品也具有相同ID的样品更紧密地聚集，这可能会限制在视觉上很难样品的关于属性的要素嵌入和提高稳健性变体的条件。我们利用HFE损失建立这种层次结构由属性级别和ID级约束的。我们还引入了一个绝对的界限正则和动态损失重量作为辅助成分，以帮助建立起来的功能嵌入。实验表明，我们的方法实现了两个行人属性数据集和面部属性数据集的国家的最先进的成果。

29. Invariant 3D Shape Recognition using Predictive Modular Neural Networks [PDF] 返回目录
Vasileios Petridis
Abstract: In this paper PREMONN (PREdictive MOdular Neural Networks) model/architecture is generalized to functions of two variables and to non-Euclidean spaces. It is presented in the context of 3D invariant shape recognition and texture recognition. PREMONN uses local relation, it is modular and exhibits incremental learning. The recognition process can start at any point on a shape or texture, so a reference point is not needed. Its local relation characteristic enables it to recognize shape and texture even in presence of occlusion. The analysis is mainly mathematical. However, we present some experimental results. The methods presented in this paper can be applied to many problems such as gesture recognition, action recognition, dynamic texture recognition etc.
摘要：本文PREMONN（预测模块化神经网络）的模型/架构推广到两个变量的函数和非欧氏空间。它呈现在3D不变的形状识别和纹理识别的情况下。 PREMONN使用本地的关系，它是模块化的，具有增量学习。识别处理可以在一个形状或纹理的任何一点开始，所以不需要的参考点。其本地关系特性使得它能够识别甚至在闭塞的存在形状和纹理。该分析主要是数学。但是，我们提出了一些实验结果。在本文所提出的方法可以被应用到许多问题，如手势识别，动作识别，动态纹理识别等

30. Underwater object detection using Invert Multi-Class Adaboost with deep learning [PDF] 返回目录
Long Chen, Zhihua Liu, Lei Tong, Zheheng Jiang, Shengke Wang, Junyu Dong, Huiyu Zhou
Abstract: In recent years, deep learning based methods have achieved promising performance in standard object detection. However, these methods lack sufficient capabilities to handle underwater object detection due to these challenges: (1) Objects in real applications are usually small and their images are blurry, and (2) images in the underwater datasets and real applications accompany heterogeneous noise. To address these two problems, we first propose a novel neural network architecture, namely Sample-WeIghted hyPEr Network (SWIPENet), for small object detection. SWIPENet consists of high resolution and semantic rich Hyper Feature Maps which can significantly improve small object detection accuracy. In addition, we propose a novel sample-weighted loss function which can model sample weights for SWIPENet, which uses a novel sample re-weighting algorithm, namely Invert Multi-Class Adaboost (IMA), to reduce the influence of noise on the proposed SWIPENet. Experiments on two underwater robot picking contest datasets URPC2017 and URPC2018 show that the proposed SWIPENet+IMA framework achieves better performance in detection accuracy against several state-of-the-art object detection approaches.
摘要：近年来，基于深度学习方法都取得了承诺的标准物体检测性能。然而，这些方法缺乏由于这些挑战足够的能力来处理水下物体检测：（1）在实际应用中的对象通常是小的和他们的图像是模糊的，和（2）在水下的数据集和实际应用图片陪异构噪声。为了解决这两个问题，我们首先提出了一种新的神经网络结构，即样本加权超网络（SWIPENet），用于小物件检测。 SWIPENet由高分辨率和语义丰富的Hyper特征映射可显著提高小物体的检测精度。此外，我们提出一种可以模拟为SWIPENet，它采用一种新型的样品重新加权算法，即反转多类Adaboost算法（IMA），以减少噪声对所提出的影响SWIPENet样本权重的新型采样加权损失函数。在两个实验水下机器人采摘比赛数据集URPC2017和URPC2018表明，该SWIPENet + IMA框架，实现了检测精度对若干国家的最先进的物体检测方法更好的性能。

31. Self-supervised Robust Object Detectors from Partially Labelled datasets [PDF] 返回目录
Mahdieh Abbasi, Denis Laurendeau, Christian Gagne
Abstract: In the object detection task, merging various datasets from similar contexts but with different sets of Objects of Interest (OoI) is an inexpensive way (in terms of labor cost) for crafting a large-scale dataset covering a wide range of objects. Moreover, merging datasets allows us to train one integrated object detector, instead of training several ones, which in turn resulting in the reduction of computational and time costs. However, merging the datasets from similar contexts causes samples with partial labeling as each constituent dataset is originally annotated for its own set of OoI and ignores to annotate those objects that are become interested after merging the datasets. With the goal of training \emph{one integrated robust object detector with high generalization performance}, we propose a training framework to overcome missing-label challenge of the merged datasets. More specifically, we propose a computationally efficient self-supervised framework to create on-the-fly pseudo-labels for the unlabelled positive instances in the merged dataset in order to train the object detector jointly on both ground truth and pseudo labels. We evaluate our proposed framework for training Yolo on a simulated merged dataset with missing rate $\approx\!48\%$ using VOC2012 and VOC2007. We empirically show that generalization performance of Yolo trained on both ground truth and the pseudo-labels created by our method is on average $4\%$ higher than the ones trained only with the ground truth labels of the merged dataset.
摘要：在对象检测任务，合并来自相似上下文但具有不同组的利息（OOI）的对象的各种数据集是用于制作一个大型数据集覆盖范围广的对象的廉价的方式（在劳动力成本方面）。此外，合并的数据集可以让我们训练，而不是训练中的几个的，这反过来又导致计算和时间成本的降低一体的综合性目标物检测。然而，从相似上下文合并数据集使得与局部标记样本作为每个组成数据集注释原本为它自己的一套OOI和忽略的注释被成为合并数据集后有兴趣的那些对象。随着培训\ {EMPH高泛化性能一体的综合性强大的对象检测器}的目标，我们提出了一个培训框架，以克服的合并数据集失踪标签的挑战。更具体地讲，我们提出了一个计算效率的自我监督框架，以共同培养目标物检测两个地面真理和伪标签来创建即时伪标签在合并数据集中的未标记的正面例子。我们评估我们提出的框架与漏检率$ \大约\在模拟合并数据集训练YOLO！48 \％$使用VOC2012和VOC2007。我们经验表明YOLO的泛化性能上我们的方法创建的两个地面真理和伪标签训练有素平均为$ 4 \％，比只与合并数据集的地面实况标签训练有素的那些$较高。

32. ProAlignNet : Unsupervised Learning for Progressively Aligning Noisy Contours [PDF] 返回目录
VSR Veeravasarapu, Abhishek Goel, Deepak Mittal, Maneesh Singh
Abstract: Contour shape alignment is a fundamental but challenging problem in computer vision, especially when the observations are partial, noisy, and largely misaligned. Recent ConvNet-based architectures that were proposed to align image structures tend to fail with contour representation of shapes, mostly due to the use of proximity-insensitive pixel-wise similarity measures as loss functions in their training processes. This work presents a novel ConvNet, "ProAlignNet" that accounts for large scale misalignments and complex transformations between the contour shapes. It infers the warp parameters in a multi-scale fashion with progressively increasing complex transformations over increasing scales. It learns --without supervision-- to align contours, agnostic to noise and missing parts, by training with a novel loss function which is derived an upperbound of a proximity-sensitive and local shape-dependent similarity metric that uses classical Morphological Chamfer Distance Transform. We evaluate the reliability of these proposals on a simulated MNIST noisy contours dataset via some basic sanity check experiments. Next, we demonstrate the effectiveness of the proposed models in two real-world applications of (i) aligning geo-parcel data to aerial image maps and (ii) refining coarsely annotated segmentation labels. In both applications, the proposed models consistently perform superior to state-of-the-art methods.
摘要：轮廓形状排列是计算机视觉中一个基本的，但具有挑战性的问题，尤其是当意见是局部的，嘈杂的，并且在很大程度上错位。提出要对齐图像结构，最近基于ConvNet架构往往会失败，轮廓形状的表示，主要是由于使用接近不敏感的逐像素的相似性措施，在他们的训练过程中损失的功能。这项工作提出了一种新ConvNet，“ProAlignNet”即占轮廓形状之间的大规模的错位和复杂的转换。据推断在多尺度时尚的扭曲参数超过增加的规模递增复杂的转换。它获知--without supervision--到对齐的轮廓，不可知的与被导出的接近度敏感的和局部形状依赖性相似度度量使用经典形态倒角距离变换的上界的新颖损失函数的噪音和缺少的部分，通过训练。我们评估的一个模拟MNIST嘈杂的轮廓，通过一些基本的完整性检查实验数据集这些建议的可靠性。接下来，我们验证了该模型的两个真实世界（我）对准地理宗地数据的应用，航空影像图及（ii）精炼粗注释分割标签的有效性。在这两种应用中，提出的模型表现一致优于国家的最先进的方法。

33. AnimGAN: A Spatiotemporally-Conditioned Generative Adversarial Network for Character Animation [PDF] 返回目录
Maryam Sadat Mirzaei, Kourosh Meshgi, Etienne Frigo, Toyoaki Nishida
Abstract: Producing realistic character animations is one of the essential tasks in human-AI interactions. Considered as a sequence of poses of a humanoid, the task can be considered as a sequence generation problem with spatiotemporal smoothness and realism constraints. Additionally, we wish to control the behavior of AI agents by giving them what to do and, more specifically, how to do it. We proposed a spatiotemporally-conditioned GAN that generates a sequence that is similar to a given sequence in terms of semantics and spatiotemporal dynamics. Using LSTM-based generator and graph ConvNet discriminator, this system is trained end-to-end on a large gathered dataset of gestures, expressions, and actions. Experiments showed that compared to traditional conditional GAN, our method creates plausible, realistic, and semantically relevant humanoid animation sequences that match user expectations.
摘要：制作逼真的角色动画是在人类-AI交互的基本任务之一。视为一个人形的姿态序列，任务可以被视为一个序列生成问题进行时空平滑和现实的制约。此外，我们希望给他们做什么，更具体地说，如何做到这一点，控制AI主体的行为。我们提出了一个时空空调甘生成一个序列类似于语义和时空动力学方面给定的顺序。使用基于LSTM发电机和图形ConvNet鉴别，该系统被训练结束到终端的手势，表情和动作的大聚集数据集。实验表明，相对于传统的条件GAN，我们的方法创建匹配用户的期望合理的，现实的，语义相关人形动画序列。

34. Self-Training for Domain Adaptive Scene Text Detection [PDF] 返回目录
Yudi Chen, Wei Wang, Yu Zhou, Fei Yang, Dongbao Yang, Weiping Wang
Abstract: Though deep learning based scene text detection has achieved great progress, well-trained detectors suffer from severe performance degradation for different domains. In general, a tremendous amount of data is indispensable to train the detector in the target domain. However, data collection and annotation are expensive and time-consuming. To address this problem, we propose a self-training framework to automatically mine hard examples with pseudo-labels from unannotated videos or images. To reduce the noise of hard examples, a novel text mining module is implemented based on the fusion of detection and tracking results. Then, an image-to-video generation method is designed for the tasks that videos are unavailable and only images can be used. Experimental results on standard benchmarks, including ICDAR2015, MSRA-TD500, ICDAR2017 MLT, demonstrate the effectiveness of our self-training method. The simple Mask R-CNN adapted with self-training and fine-tuned on real data can achieve comparable or even superior results with the state-of-the-art methods.
摘要：虽然基于深度学习场景文本检测取得了很大的进步，训练有素的探测器从不同的域的性能严重下降受到影响。一般地，数据的大量是不可缺少的训练检测器在目标域。然而，数据收集和注解是昂贵和费时。为了解决这个问题，我们提出了一个自我培训框架，与未加视频或图像伪标签自动矿硬的例子。为了减少硬例的噪声，一种新颖的文本挖掘模块是基于检测和跟踪结果的融合实现。然后，图像到视频产生方法被设计用于的任务即视频是不可用的，只有图像可以被使用。标准的基准测试，包括ICDAR2015，MSRA-TD500，ICDAR2017 MLT实验结果，证明了我们的自我训练方法的有效性。简单的面膜R-CNN适于与自我训练和微调真实数据可以实现与国家的最先进的方法相当或甚至更好的结果。

35. Attention-guided Context Feature Pyramid Network for Object Detection [PDF] 返回目录
Junxu Cao, Qi Chen, Jun Guo, Ruichao Shi
Abstract: For object detection, how to address the contradictory requirement between feature map resolution and receptive field on high-resolution inputs still remains an open question. In this paper, to tackle this issue, we build a novel architecture, called Attention-guided Context Feature Pyramid Network (AC-FPN), that exploits discriminative information from various large receptive fields via integrating attention-guided multi-path features. The model contains two modules. The first one is Context Extraction Module (CEM) that explores large contextual information from multiple receptive fields. As redundant contextual relations may mislead localization and recognition, we also design the second module named Attention-guided Module (AM), which can adaptively capture the salient dependencies over objects by using the attention mechanism. AM consists of two sub-modules, i.e., Context Attention Module (CxAM) and Content Attention Module (CnAM), which focus on capturing discriminative semantics and locating precise positions, respectively. Most importantly, our AC-FPN can be readily plugged into existing FPN-based models. Extensive experiments on object detection and instance segmentation show that existing models with our proposed CEM and AM significantly surpass their counterparts without them, and our model successfully obtains state-of-the-art results. We have released the source code at this https URL.
摘要：目标检测，如何解决特征图的分辨率和感受野的高分辨率输入之间的矛盾需求仍然是一个悬而未决的问题。在本文中，要解决这一问题，我们建立了一个新的架构，称为注意力引导关联的特征金字塔网（AC-FPN），它利用从通过整合注意引导下的多路径功能，各种大型的感受野判别信息。该模型包含两个模块。第一个是背景信息提取模块（CEM），其从多个接收域探索大的上下文信息。至于多余的上下文关系可能会误导定位和认可，我们还设计命名为注意力引导模块（AM）的模块，它可以通过使用注意机制自适应地捕捉对象上的显着相关性。 AM包括两个子模块，即，上下文注意模块（CxAM）和内容注意模块（CNAM），其集中于捕捉辨别语义和分别定位的精确位置。最重要的是，我们的AC-FPN可以很容易地插入到现有的基于FPN的模型。在物体检测和实例分割显示出广泛的实验，以我们现有的模型提出CEM和AM显著超越同行没有他们，我们的模型成功地获得国家的先进成果。我们已经发布了源代码，在这个HTTPS URL。

36. Delving into the Imbalance of Positive Proposals in Two-stage Object Detection [PDF] 返回目录
Zheng Ge, Zequn Jie, Xin Huang, Chengzheng Li, Osamu Yoshie
Abstract: Imbalance issue is a major yet unsolved bottleneck for the current object detection models. In this work, we observe two crucial yet never discussed imbalance issues. The first imbalance lies in the large number of low-quality RPN proposals, which makes the R-CNN module (i.e., post-classification layers) become highly biased towards the negative proposals in the early training stage. The second imbalance stems from the unbalanced ground-truth numbers across different testing images, resulting in the imbalance of the number of potentially existing positive proposals in testing phase. To tackle these two imbalance issues, we incorporates two innovations into Faster R-CNN: 1) an R-CNN Gradient Annealing (RGA) strategy to enhance the impact of positive proposals in the early training stage. 2) a set of Parallel R-CNN Modules (PRM) with different positive/negative sampling ratios during training on one same backbone. Our RGA and PRM can totally bring 2.0% improvements on AP on COCO minival. Experiments on CrowdHuman further validates the effectiveness of our innovations across various kinds of object detection tasks.
摘要：不平衡问题是当前对象的检测模型的一大尚未解决的瓶颈。在这项工作中，我们观察到两个关键还从来没有讨论过不平衡的问题。第一不平衡在于大量的低质量的RPN提案，这使得R-CNN模块（即，后分类层）成为高度偏向在早期训练阶段负极的提案。第二不平衡从跨越不同的测试图像中的不平衡地面实况数字茎，导致在测试阶段可能存在的正提案数量的不平衡。为了解决这两个不平衡的问题，我们采用了两个创新转化为更快的R-CNN：1）R-CNN梯度退火（RGA）战略，以提高积极建议在早期训练阶段的影响。 2）上的一个骨架相同训练期间一组平行R-CNN模块（PRM）具有不同的正/负的采样比。我们的RGA和PRM完全可以带来2.0％的改进在AP上COCO minival。在CrowdHuman也进一步证明我们的实验在不同类型的对象检测任务，创新的有效性。

37. Learning from Naturalistic Driving Data for Human-like Autonomous Highway Driving [PDF] 返回目录
Donghao Xu, Zhezhang Ding, Xu He, Huijing Zhao, Mathieu Moze, François Aioun, Franck Guillemard
Abstract: Driving in a human-like manner is important for an autonomous vehicle to be a smart and predictable traffic participant. To achieve this goal, parameters of the motion planning module should be carefully tuned, which needs great effort and expert knowledge. In this study, a method of learning cost parameters of a motion planner from naturalistic driving data is proposed. The learning is achieved by encouraging the selected trajectory to approximate the human driving trajectory under the same traffic situation. The employed motion planner follows a widely accepted methodology that first samples candidate trajectories in the trajectory space, then select the one with minimal cost as the planned trajectory. Moreover, in addition to traditional factors such as comfort, efficiency and safety, the cost function is proposed to incorporate incentive of behavior decision like a human driver, so that both lane change decision and motion planning are coupled into one framework. Two types of lane incentive cost -- heuristic and learning based -- are proposed and implemented. To verify the validity of the proposed method, a data set is developed by using the naturalistic trajectory data of human drivers collected on the motorways in Beijing, containing samples of lane changes to the left and right lanes, and car followings. Experiments are conducted with respect to both lane change decision and motion planning, and promising results are achieved.
摘要：在驾驶人样的方式为自主汽车是一个聪明的和可预见的交通参与者是非常重要的。为了实现这一目标，运动规划模块的参数应仔细调整，这需要极大的努力和专业知识。在这项研究中，学习从自然驱动数据中的运动计划器的费用参数的方法，提出了学习是通过鼓励选择的轨迹相同的交通状况下，以接近人类驾驶轨迹实现。所采用的运动计划遵循广泛接受的方法是，在该轨迹空间的第一样品候选轨迹，然后选择一个以最少的成本为所计划的轨迹。此外，除了传统的因素，比如舒适性，效率和安全性，成本函数拟合并行为决策的动机就像一个人的司机，这样既变道决策和运动规划被耦合到一个框架。有两种类型的车道激励的成本 - 启发式和基于学习 - 在提出和实施。为了验证该方法的有效性，数据集是通过使用收集了北京高速公路人力驱动的自然轨迹数据，包含车道变更到左，右车道，车以下的样品开发。实验是关于两个车道变更的决定，运动规划进行的，可喜的成果得以实现。

38. Fine-Grain Few-Shot Vision via Domain Knowledge as Hyperspherical Priors [PDF] 返回目录
Bijan Haney, Alexander Lavin
Abstract: Prototypical networks have been shown to perform well at few-shot learning tasks in computer vision. Yet these networks struggle when classes are very similar to each other (fine-grain classification) and currently have no way of taking into account prior knowledge (through the use of tabular data). Using a spherical latent space to encode prototypes, we can achieve few-shot fine-grain classification by maximally separating the classes while incorporating domain knowledge as informative priors. We describe how to construct a hypersphere of prototypes that embed a-priori domain information, and demonstrate the effectiveness of the approach on challenging benchmark datasets for fine-grain classification, with top results for one-shot classification and 5x speedups in training time.
摘要：网络原型已经显示在计算机视觉几拍学习任务表现良好。然而，这些网络时类非常相似，彼此（细粒度分类），目前有没有考虑到先验知识（通过使用表格数据）的方式抗争。使用球状潜在空间编码的原型，我们可以通过同时集成领域知识作为先验信息最大限度地分离的类实现几拍细粒度分类。我们描述了如何构建的，嵌入先验域信息原型的超球，并展示在富有挑战性的基准数据集进行细粒度的分类，与一次性分类顶部结果和训练时间5倍的速度提升了该方法的有效性。

39. S3VAE: Self-Supervised Sequential VAE for Representation Disentanglement and Data Generation [PDF] 返回目录
Yizhe Zhu, Martin Renqiang Min, Asim Kadav, Hans Peter Graf
Abstract: We propose a sequential variational autoencoder to learn disentangled representations of sequential data (e.g., videos and audios) under self-supervision. Specifically, we exploit the benefits of some readily accessible supervisory signals from input data itself or some off-the-shelf functional models and accordingly design auxiliary tasks for our model to utilize these signals. With the supervision of the signals, our model can easily disentangle the representation of an input sequence into static factors and dynamic factors (i.e., time-invariant and time-varying parts). Comprehensive experiments across videos and audios verify the effectiveness of our model on representation disentanglement and generation of sequential data, and demonstrate that, our model with self-supervision performs comparable to, if not better than, the fully-supervised model with ground truth labels, and outperforms state-of-the-art unsupervised models by a large margin.
摘要：本文提出了一种连续变自动编码器来学习下自检顺序数据（例如，视频和音频）的解开表示。具体而言，我们利用从输入数据本身或某些现成的现成功能模型的一些容易获得监控信号的好处，并据此设计辅助任务对我们的模型利用这些信号。随着信号的监督下，我们的模型可以很容易地解开的输入序列的表示成静态要素和动态要素（即，时不变和时变份）。整个视频和音频综合实验验证的连续数据表示的解开，并代我们的模型的有效性，并表明，我们的模型与自我监督执行相媲美，如果不是更好的，与地面实况标签完全监督模式，，优于国家的最先进的无人监管的模式以大比分。

40. Hashing-based Non-Maximum Suppression for Crowded Object Detection [PDF] 返回目录
Jianfeng Wang, Xi Yin, Lijuan Wang, Lei Zhang
Abstract: In this paper, we propose an algorithm, named hashing-based non-maximum suppression (HNMS) to efficiently suppress the non-maximum boxes for object detection. Non-maximum suppression (NMS) is an essential component to suppress the boxes at closely located locations with similar shapes. The time cost tends to be huge when the number of boxes becomes large, especially for crowded scenes. The basic idea of HNMS is to firstly map each box to a discrete code (hash cell) and then remove the boxes with lower confidences if they are in the same cell. Considering the intersection-over-union (IoU) as the metric, we propose a simple yet effective hashing algorithm, named IoUHash, which guarantees that the boxes within the same cell are close enough by a lower IoU bound. For two-stage detectors, we replace NMS in region proposal network with HNMS, and observe significant speed-up with comparable accuracy. For one-stage detectors, HNMS is used as a pre-filter to speed up the suppression with a large margin. Extensive experiments are conducted on CARPK, SKU-110K, CrowdHuman datasets to demonstrate the efficiency and effectiveness of HNMS. Code is released at \url{this https URL}.
摘要：在本文中，我们提出了一种算法，命名为基于散列的非最大抑制（HNMS）有效地抑制物体检测非最大箱。非最大抑制（NMS）是具有相似的形状位于密切位置，以抑制盒的重要组成部分。时间成本往往是巨大的，当箱数变大，特别是对拥挤的场面。 HNMS的基本思想是将每个框首先映射到一个离散的码（散列细胞），然后用较低的置信度除去盒，如果他们是在同一小区。考虑到交叉点过联盟（IOU）作为指标，我们提出了一个简单而有效的哈希算法，命名为IoUHash，这保证了同一小区内的箱子足够接近以较低的欠条约束。对于两个阶段的检测，我们替换区域提案网络HNMS NMS，并观察显著加速相媲美的准确性。对于单级检测器，HNMS被用作预过滤器，以加快具有大裕度的抑制。广泛实验在CARPK，SKU-110K，CrowdHuman数据集以证实HNMS的效率和有效性。代码为\ {URL这HTTPS URL}释放。

41. Approaching Bio Cellular Classification for Malaria Infected Cells Using Machine Learning and then Deep Learning to compare & analyze K-Nearest Neighbours and Deep CNNs [PDF] 返回目录
Rishabh Malhotra, Dhron Joshi, Ku Young Shin
Abstract: Malaria is a deadly disease which claims the lives of hundreds of thousands of people every year. Computational methods have been proven to be useful in the medical industry by providing effective means of classification of diagnostic imaging and disease identification. This paper examines different machine learning methods in the context of classifying the presence of malaria in cell images. Numerous machine learning methods can be applied to the same problem; the question of whether one machine learning method is better suited to a problem relies heavily on the problem itself and the implementation of a model. In particular, convolutional neural networks and k nearest neighbours are both analyzed and contrasted in regards to their application to classifying the presence of malaria and each models empirical performance. Here, we implement two models of classification; a convolutional neural network, and the k nearest neighbours algorithm. These two algorithms are compared based on validation accuracy. For our implementation, CNN (95%) performed 25% better than kNN (75%).
摘要：疟疾是一种致命的疾病，声称数百每年有成千上万的人的生活。计算方法已被证明通过提供诊断成像和疾病鉴别的分类的有效手段是在医疗行业是有用的。本文探讨在细胞图像疟疾的存在分类的上下文不同的机器学习方法。许多机器学习方法可被应用到相同的问题;一个机器学习方法是否更适合一个问题，问题在很大程度上依赖于问题本身和模型的实现。特别是，卷积神经网络和最近的邻居K均是分析和对比方面其应用到疟疾的存在和每个模型的实证性能进行分类。在这里，我们实现了两种分类模型;卷积神经网络，以及最近的邻居算法k个。这两种算法都是基于验证准确度进行比较。对于我们的实现，美国有线电视新闻网（95％）比k近邻（75％）进行25％更好。

42. Novel Human-Object Interaction Detection via Adversarial Domain Generalization [PDF] 返回目录
Yuhang Song, Wenbo Li, Lei Zhang, Jianwei Yang, Emre Kiciman, Hamid Palangi, Jianfeng Gao, C.-C. Jay Kuo, Pengchuan Zhang
Abstract: We study in this paper the problem of novel human-object interaction (HOI) detection, aiming at improving the generalization ability of the model to unseen scenarios. The challenge mainly stems from the large compositional space of objects and predicates, which leads to the lack of sufficient training data for all the object-predicate combinations. As a result, most existing HOI methods heavily rely on object priors and can hardly generalize to unseen combinations. To tackle this problem, we propose a unified framework of adversarial domain generalization to learn object-invariant features for predicate prediction. To measure the performance improvement, we create a new split of the HICO-DET dataset, where the HOIs in the test set are all unseen triplet categories in the training set. Our experiments show that the proposed framework significantly increases the performance by up to 50% on the new split of HICO-DET dataset and up to 125% on the UnRel dataset for auxiliary evaluation in detecting novel HOIs.
摘要：本文研究新的人，对象交互（海）检测的问题，旨在提高模型看不见场景的泛化能力。我们面临的挑战主要来自对象和谓词的成分大的空间，从而导致缺乏对所有对象的谓词组合足够的训练数据。其结果是，大多数现有HOI方法在很大程度上依赖于对象先验，难以推广到看不见的组合。为了解决这个问题，我们提出了对抗性域泛化的一个统一的框架，以学习对象不变特征谓词预测。为了测量性能的提高，我们创建了HICO-DET数据集，其中测试集的HOIS都在训练集所有看不见的三重类别的新的分裂。我们的实验表明，该框架显著增加了由最多的性能70％的HICO-DET数据集的新的分割和至多125％的UnRel数据集用于辅助评估检测新颖HOIS。

43. One of these (Few) Things is Not Like the Others [PDF] 返回目录
Nat Roth, Justin Wagle
Abstract: To perform well, most deep learning based image classification systems require large amounts of data and computing resources. These constraints make it difficult to quickly personalize to individual users or train models outside of fairly powerful machines. To deal with these problems, there has been a large body of research into teaching machines to learn to classify images based on only a handful of training examples, a field known as few-shot learning. Few-shot learning research traditionally makes the simplifying assumption that all images belong to one of a fixed number of previously seen groups. However, many image datasets, such as a camera roll on a phone, will be noisy and contain images that may not be relevant or fit into any clear group. We propose a model which can both classify new images based on a small number of examples and recognize images which do not belong to any previously seen group. We adapt previous few-shot learning work to include a simple mechanism for learning a cutoff that determines whether an image should be excluded or classified. We examine how well our method performs in a realistic setting, benchmarking the approach on a noisy and ambiguous dataset of images. We evaluate performance over a spectrum of model architectures, including setups small enough to be run on low powered devices, such as mobile phones or web browsers. We find that this task of excluding irrelevant images poses significant extra difficulty beyond that of the traditional few-shot task. We decompose the sources of this error, and suggest future improvements that might alleviate this difficulty.
摘要：要执行好，最深刻的学习基于图像分类系统需要大量的数据和计算资源。这些限制使其难以快速个性化向个人用户或火车模型之外相当强大的机器。为了解决这些问题，出现了大量的研究到教学机学习基础上屈指可数的训练实例，被称为几拍学习现场图像分类。很少拍学习研究传统使得简化假设所有图像属于固定数量的先前看到的群体之一。然而，许多图像数据集，如手机上的相机胶卷，将是嘈杂，包含可能不相关或符合任何清晰的图像组。我们提出，可以根据少数的例子中都分门别类新的图像，并承认这不属于任何先前看到的图像组的模型。我们采用前几拍的学习工作，包括学习，确定图像是否应该被排除或分类的截止一种简单的机制。我们研究在现实环境有多好我们的方法执行，基准上嘈杂和图像暧昧的数据集的方法。我们评估性能比模型架构的频谱，其中包括设置足够小，要在低功率设备，如手机或Web浏览器上运行。我们发现，排除不相关的图片的这一任务带来超越了传统的几拍任务的显著额外的困难。我们分解这个错误的根源，并提出今后的改进可能缓解这一困难。

44. Machine Vision using Diffractive Spectral Encoding [PDF] 返回目录
Jingxi Li, Deniz Mengu, Nezih T. Yardimci, Yi Luo, Xurong Li, Muhammed Veli, Yair Rivenson, Mona Jarrahi, Aydogan Ozcan
Abstract: Machine vision systems mostly rely on lens-based optical imaging architectures that relay the spatial information of objects onto high pixel-count opto-electronic sensor arrays, followed by digital processing of this information. Here, we demonstrate an optical machine vision system that uses trainable matter in the form of diffractive layers to transform and encode the spatial information of objects into the power spectrum of the diffracted light, which is used to perform optical classification of objects with a single-pixel spectroscopic detector. Using a time-domain spectroscopy setup with a plasmonic nanoantenna-based detector, we experimentally validated this framework at terahertz spectrum to optically classify the images of handwritten digits by detecting the spectral power of the diffracted light at ten distinct wavelengths, each representing one class/digit. We also report the coupling of this spectral encoding achieved through a diffractive optical network with a shallow electronic neural network, separately trained to reconstruct the images of handwritten digits based on solely the spectral information encoded in these ten distinct wavelengths within the diffracted light. These reconstructed images demonstrate task-specific image decompression and can also be cycled back as new inputs to the same diffractive network to improve its optical object classification. This unique framework merges the power of deep learning with the spatial and spectral processing capabilities of trainable matter, and can also be extended to other spectral-domain measurement systems to enable new 3D imaging and sensing modalities integrated with spectrally encoded classification tasks performed through diffractive networks.
摘要：机器视觉系统主要依赖于该中继对象的空间信息到高像素计数光电传感器阵列基于透镜的光学成像体系结构，接着该信息进行数字处理。这里，我们证明的光学机器视觉系统，在衍射层的形式使用可训练的物质转化和对象的空间信息编码成的衍射光，其被用于与单执行对象的光学上的分类的功率谱像素的光谱检测器。使用时域光谱的设置与等离子体激元基于纳米天线检测器，我们通过实验在太赫兹谱验证了该框架，以光学地由在十个不同的波长检测的衍射光的光谱功率的手写体数字的图像进行分类，每一个代表一类/数字。我们还报告通过具有浅电子神经网络的衍射光学网络，单独训练来重建的基础上单独的衍射光内的这十个不同波长编码的频谱信息手写体数字图像实现该光谱编码的耦合。这些重建图像展示任务特定图像解压缩，也可以循环回到作为新的输入到同一衍射网络，以提高其光学对象分类。这种独特的框架融合了的可训练物质的空间和光谱处理能力深学习的功率，并且还可以扩展到其他谱域测量系统，以使新的3D成像和与频谱编码分类任务集成传感方式通过衍射网络执行。

45. Image Translation by Latent Union of Subspaces for Cross-Domain Plaque Detection [PDF] 返回目录
Yingying Zhu, Daniel C. Elton, Sungwon Lee, Perry J. Pickhardt, Ronald M. Summers
Abstract: Calcified plaque in the aorta and pelvic arteries is associated with coronary artery calcification and is a strong predictor of heart attack. Current calcified plaque detection models show poor generalizability to different domains (ie. pre-contrast vs. post-contrast CT scans). Many recent works have shown how cross domain object detection can be improved using an image translation model which translates between domains using a single shared latent space. However, while current image translation models do a good job preserving global/intermediate level structures they often have trouble preserving tiny structures. In medical imaging applications, preserving small structures is important since these structures can carry information which is highly relevant for disease diagnosis. Recent works on image reconstruction show that complex real-world images are better reconstructed using a union of subspaces approach. Since small image patches are used to train the image translation model, it makes sense to enforce that each patch be represented by a linear combination of subspaces which may correspond to the different parts of the body present in that patch. Motivated by this, we propose an image translation network using a shared union of subspaces constraint and show our approach preserves subtle structures (plaques) better than the conventional method. We further applied our method to a cross domain plaque detection task and show significant improvement compared to the state-of-the art method.
摘要：在主动脉和盆腔动脉钙化斑块与冠状动脉钙化，是心脏发作的一个强有力的预测。当前钙化斑检测模型显示概差不同的域（即预对比度与对比后CT扫描）。最近的许多工作已经展示了如何跨域对象检测可使用其中使用单个共享潜在空间域之间进行转换的图象翻译模型来改进。然而，尽管当前图像翻译模型做好维护全球/中级结构两人经常闹保持微小结构。在医学成像应用中，保持小的结构，因为这些结构可以携带其是用于疾病诊断高度相关的信息是重要的。在图像重建中显示最近的作品是复杂的现实世界图像是使用工会的子空间接近更好的重建。由于小图像块被用来训练图像翻译模型，是有意义的执行使得每个补丁通过其可以对应于存在于补丁身体的不同部分的子空间的线性组合来表示。这个启发，我们提出使用子空间的共享工会的图像转换网络约束，表明我们的方法保留细微结构（斑），比传统方法更好。相较于国家的最先进的方法，我们进一步将我们的方法来跨域斑块检测任务，并显示显著的改善。

46. Gleason Grading of Histology Prostate Images through Semantic Segmentation via Residual U-Net [PDF] 返回目录
Amartya Kalapahar, Julio Silva-Rodríguez, Adrián Colomer, Fernando López-Mir, Valery Naranjo
Abstract: Worldwide, prostate cancer is one of the main cancers affecting men. The final diagnosis of prostate cancer is based on the visual detection of Gleason patterns in prostate biopsy by pathologists. Computer-aided-diagnosis systems allow to delineate and classify the cancerous patterns in the tissue via computer-vision algorithms in order to support the physicians' task. The methodological core of this work is a U-Net convolutional neural network for image segmentation modified with residual blocks able to segment cancerous tissue according to the full Gleason system. This model outperforms other well-known architectures, and reaches a pixel-level Cohen's quadratic Kappa of 0.52, at the level of previous image-level works in the literature, but providing also a detailed localisation of the patterns.
摘要：在世界范围内，前列腺癌是男性的影响主要癌症之一。前列腺癌的最终诊断是基于由病理学家在前列腺活检视觉检测的格里森图案。计算机辅助诊断系统允许划定并以支持医师的任务，通过计算机视觉算法组织癌变的模式进行分类。这项工作的方法核心是用于图像分割能够段癌组织根据满格里森系统残余块改性的U净卷积神经网络。这种模式优于其他知名的体系结构，并达到像素级科恩的0.52二次卡帕，在以前的文献影像级作品的水平，同时也提供了模式的详细定位。

47. Stable and expressive recurrent vision models [PDF] 返回目录
Drew Linsley, Alekh Karkada Ashok, Lakshmi Narasimhan Govindarajan, Rex Liu, Thomas Serre
Abstract: Primate vision depends on recurrent processing for reliable perception (Gilbert & Li, 2013). At the same time, there is a growing body of literature demonstrating that recurrent connections improve the learning efficiency and generalization of vision models on classic computer vision challenges. Why then, are current large-scale challenges dominated by feedforward networks? We posit that the effectiveness of recurrent vision models is bottlenecked by the widespread algorithm used for training them, "back-propagation through time" (BPTT), which has O(N) memory-complexity for training an N step model. Thus, recurrent vision model design is bounded by memory constraints, forcing a choice between rivaling the enormous capacity of leading feedforward models or trying to compensate for this deficit through granular and complex dynamics. Here, we develop a new learning algorithm, "contractor recurrent back-propagation" (C-RBP), which alleviates these issues by achieving constant O(1) memory-complexity with steps of recurrent processing. We demonstrate that recurrent vision models trained with C-RBP can detect long-range spatial dependencies in a synthetic contour tracing task that BPTT-trained models cannot. We further demonstrate that recurrent vision models trained with C-RBP to solve the large-scale Panoptic Segmentation MS-COCO challenge outperform the leading feedforward approach. C-RBP is a general-purpose learning algorithm for any application that can benefit from expansive recurrent dynamics. Code and data are available at this https URL.
摘要：灵长类动物的视觉取决于可靠的感觉（吉尔伯特 - 李，2013年）反复处理。与此同时，有文学的示范越来越多的是经常性的连接提高学习效率和视觉模型的泛化经典的计算机视觉挑战。那么，为什么是由前馈网络主导目前大规模的挑战？我们断定，经常视觉模型的有效性被用于训练他们的普遍算法“通过时间向后传播”（BPTT），其中有O（N）存储复杂性训练的N步模型瓶颈。因此，轮回视觉模型设计是由内存的限制约束，迫使媲美领先前馈机型的巨大容量或试图弥补这一赤字通过精细和复杂的动力学之间做出选择。在这里，我们开发了一个新的学习算法，“承包商经常反向传播”（C-RBP），它通过实现恒定的O（1）复发的处理步骤存储复杂性缓解这些问题。我们证明在合成轮廓跟踪任务，与C-RBP训练的可检测复发视觉模型长范围空间依赖关系BPTT训练模型不能。进一步研究证明，与C-RBP培训的经常性视觉模型，解决了大型全景分割MS-COCO挑战超越领先的前馈方法。 C-RBP是可以从广阔的复发动态获益的任何应用程序的通用学习算法。代码和数据都可以在此HTTPS URL。

48. Attention-based Neural Bag-of-Features Learning for Sequence Data [PDF] 返回目录
Dat Thanh Tran, Nikolaos Passalis, Anastasios Tefas, Moncef Gabbouj, Alexandros Iosifidis
Abstract: In this paper, we propose 2D-Attention (2DA), a generic attention formulation for sequence data, which acts as a complementary computation block that can detect and focus on relevant sources of information for the given learning objective. The proposed attention module is incorporated into the recently proposed Neural Bag of Feature (NBoF) model to enhance its learning capacity. Since 2DA acts as a plug-in layer, injecting it into different computation stages of the NBoF model results in different 2DA-NBoF architectures, each of which possesses a unique interpretation. We conducted extensive experiments in financial forecasting, audio analysis as well as medical diagnosis problems to benchmark the proposed formulations in comparison with existing methods, including the widely used Gated Recurrent Units. Our empirical analysis shows that the proposed attention formulations can not only improve performances of NBoF models but also make them resilient to noisy data.
摘要：在本文中，我们提出了2D-注意（2DA），通用注意制剂用于序列数据，它充当互补计算块可以检测和聚焦上的对于给定的学习目标信息相关来源。建议关注模块纳入功能（NBoF）模型的最近提出的神经袋，以提高其学习能力。由于2DA充当插件层，其注入在不同2DA-NBoF架构的NBoF模型的结果，其中的每一个具有独特的解释的不同的计算阶段。我们在比较中进行财务预测大量的实验，音频分析以及医疗诊断问题基准所提出的配方与现有的方法，包括广泛使用的门控复发单位。我们的实证分析表明，建议关注制剂不仅能提高NBoF车型的性能，但也使他们抵御噪声数据。

49. JSSR: A Joint Synthesis, Segmentation, and Registration System for 3D Multi-Modal Image Alignment of Large-scale Pathological CT Scans [PDF] 返回目录
Fengze Liu, Jingzheng Cai, Yuankai Huo, Le Lu, Adam P Harrison
Abstract: Multi-modal image registration is a challenging problem yet important clinical task in many real applications and scenarios. For medical imaging based diagnosis, deformable registration among different image modalities is often required in order to provide complementary visual information, as the first step. During the registration, the semantic information is the key to match homologous points and pixels. Nevertheless, many conventional registration methods are incapable to capture the high-level semantic anatomical dense correspondences. In this work, we propose a novel multi-task learning system, JSSR, based on an end-to-end 3D convolutional neural network that is composed of a generator, a register and a segmentor, for the tasks of synthesis, registration and segmentation, respectively. This system is optimized to satisfy the implicit constraints between different tasks unsupervisedly. It first synthesizes the source domain images into the target domain, then an intra-modal registration is applied on the synthesized images and target images. Then we can get the semantic segmentation by applying segmentors on the synthesized images and target images, which are aligned by the same deformation field generated by the registers. The supervision from another fully-annotated dataset is used to regularize the segmentors. We extensively evaluate our JSSR system on a large-scale medical image dataset containing 1,485 patient CT imaging studies of four different phases (i.e., 5,940 3D CT scans with pathological livers) on the registration, segmentation and synthesis tasks. The performance is improved after joint training on the registration and segmentation tasks by $0.9\%$ and $1.9\%$ respectively from a highly competitive and accurate baseline. The registration part also consistently outperforms the conventional state-of-the-art multi-modal registration methods.
摘要：多模态图像配准是在许多实际应用和场景一个具有挑战性的问题，但重要的临床任务。用于医学成像诊断基础，不同图像模态之间可变形配准通常需要以提供互补的可视信息，作为第一步骤。在注册时，语义信息是匹配的同源点和像素的关键。然而，许多常规的登记方法不能捕捉到高层语义解剖致密的对应关系。在这项工作中，我们提出了一种新的多任务学习系统，JSSR，基于所构成的发电，寄存器和分割器的，合成，登记和分割的任务结束到终端的3D卷积神经网络，分别。这个系统被优化以满足不同任务之间的隐含约束unsupervisedly。它首先合成源域的图像到目标域，则帧内模态配准上的合成图像和目标图像使用。然后，我们可以通过在合成图像和目标图像，这是通过由寄存器产生的相同的变形场对准施加segmentors得到语义分割。从另一个完全标注的数据集的监督是用来规则化segmentors。我们广泛地评价在含有1485个的四个不同的阶段（即，5940的3D CT与病理肝脏扫描）上登记，分割和合成任务患者CT成像研究一个大型医疗图像数据集提供了JSSR系统。性能由$ 0.9 \％$，并从一个高度竞争的和准确的基线分别为$ 1.9 \％$后的登记和分割任务的联合训练提高。登记部也始终优于传统的状态的最先进的多模态配准方法。

50. NENET: An Edge Learnable Network for Link Prediction in Scene Text [PDF] 返回目录
Mayank Kumar Singh, Sayan Banerjee, Shubhasis Chaudhuri
Abstract: Text detection in scenes based on deep neural networks have shown promising results. Instead of using word bounding box regression, recent state-of-the-art methods have started focusing on character bounding box and pixel-level prediction. This necessitates the need to link adjacent characters, which we propose in this paper using a novel Graph Neural Network (GNN) architecture that allows us to learn both node and edge features as opposed to only the node features under the typical GNN. The main advantage of using GNN for link prediction lies in its ability to connect characters which are spatially separated and have an arbitrary orientation. We show our concept on the well known SynthText dataset, achieving top results as compared to state-of-the-art methods.
摘要：基于深层神经网络场景文字检测已展现出可喜效果。相反，使用单词边界框回归的，国家的最先进的方法，最近已经开始注重个性边框和像素级别的预测。这要求以链接相邻的字符，我们使用一种新的图形神经网络（GNN）架构，使我们能够学习点和边功能，而不是仅节点下的典型特征GNN在本文中提出。在其能力使用GNN链路预测谎言连接在空间分隔的字符，并具有任意方向的主要优势。我们展示的著名SynthText数据集我们的概念，相比于国家的最先进的方法达到最高的效果。

51. The efficiency of deep learning algorithms for detecting anatomical reference points on radiological images of the head profile [PDF] 返回目录
Konstantin Dobratulin, Andrey Gaidel, Irina Aupova, Anna Ivleva, Aleksandr Kapishnikov, Pavel Zelter
Abstract: In this article we investigate the efficiency of deep learning algorithms in solving the task of detecting anatomical reference points on radiological images of the head in lateral projection using a fully convolutional neural network and a fully convolutional neural network with an extended architecture for biomedical image segmentation - U-Net. A comparison is made for the results of detection anatomical reference points for each of the selected neural network architectures and their comparison with the results obtained when orthodontists detected anatomical reference points. Based on the obtained results, it was concluded that a U-Net neural network allows performing the detection of anatomical reference points more accurately than a fully convolutional neural network. The results of the detection of anatomical reference points by the U-Net neural network are closer to the average results of the detection of reference points by a group of orthodontists.
摘要：在本文中我们调查的深度学习算法的效率在解决的使用完全卷积神经网络，并与生物医学图像的扩展的体系结构的全卷积神经网络的横向突出部的头部的放射线图像检测解剖参考点的任务分割 - 掌中。的比较用于检测解剖参考点，每个所选择的神经网络结构和它们与正牙医生时检测到的解剖参考点获得的结果相比较的结果作出。基于所获得的结果，可以得出的结论是一个U形网神经网络允许更准确地执行的解剖参考点的检测比完全卷积神经网络。由U-Net的神经网络检测的解剖参考点的结果是由一组正畸医生的接近的检测的基准点的平均结果。

52. An interpretable automated detection system for FISH-based HER2 oncogene amplification testing in histo-pathological routine images of breast and gastric cancer diagnostics [PDF] 返回目录
Sarah Schmell, Falk Zakrzewski, Walter de Back, Martin Weigert, Uwe Schmidt, Torsten Wenke, Silke Zeugner, Robert Mantey, Christian Sperling, Ingo Roeder, Pia Hoenscheid, Daniela Aust, Gustavo Baretton
Abstract: Histo-pathological diagnostics are an inherent part of the everyday work but are particularly laborious and associated with time-consuming manual analysis of image data. In order to cope with the increasing diagnostic case numbers due to the current growth and demographic change of the global population and the progress in personalized medicine, pathologists ask for assistance. Profiting from digital pathology and the use of artificial intelligence, individual solutions can be offered (e.g. detect labeled cancer tissue sections). The testing of the human epidermal growth factor receptor 2 (HER2) oncogene amplification status via fluorescence in situ hybridization (FISH) is recommended for breast and gastric cancer diagnostics and is regularly performed at clinics. Here, we develop an interpretable, deep learning (DL)-based pipeline which automates the evaluation of FISH images with respect to HER2 gene amplification testing. It mimics the pathological assessment and relies on the detection and localization of interphase nuclei based on instance segmentation networks. Furthermore, it localizes and classifies fluorescence signals within each nucleus with the help of image classification and object detection convolutional neural networks (CNNs). Finally, the pipeline classifies the whole image regarding its HER2 amplification status. The visualization of pixels on which the networks' decision occurs, complements an essential part to enable interpretability by pathologists.
摘要：HISTO病理诊断是日常工作的固有部分，但是特别费力的，并与图像数据的耗时的手动分析相关联。为了与日益增加的情况下，诊断数量，以应付由于目前的增长和全球人口的人口结构的变化和个性化医疗的进步，病理学家寻求帮助。从数字病理学和使用人工智能的获利，个体的解决方案可以提供（例如，检测标记的癌组织切片）。推荐的人表皮生长因子受体2（HER2）原癌基因的扩增状态通过荧光原位杂交（FISH）的测试乳腺癌和胃癌的诊断和在诊所定期执行。在这里，我们开发一个解释，深度学习（DL）为基础的管道，其相对于HER2基因扩增检测自动FISH图像的分析。它模仿了病理评估，并依赖于一个基于实例的分割网络间期核的检测和定位。此外，局部化和分类荧光每个细胞核内的信号与图像分类和对象检测的卷积神经网络（细胞神经网络）的帮助。最后，管道分类关于其HER2扩增情况的整体形象。像素在其上的网络的决定发生可视化，补充由病理学家以使解释性的重要部分。

53. Eye Gaze Controlled Robotic Arm for Persons with SSMI [PDF] 返回目录
Vinay Krishna Sharma, L.R.D. Murthy, KamalPreet Singh Saluja, Vimal Mollyn, Gourav Sharma, Pradipta Biswas
Abstract: Background: People with severe speech and motor impairment (SSMI) often uses a technique called eye pointing to communicate with outside world. One of their parents, caretakers or teachers hold a printed board in front of them and by analyzing their eye gaze manually, their intentions are interpreted. This technique is often error prone and time consuming and depends on a single caretaker. Objective: We aimed to automate the eye tracking process electronically by using commercially available tablet, computer or laptop and without requiring any dedicated hardware for eye gaze tracking. The eye gaze tracker is used to develop a video see through based AR (augmented reality) display that controls a robotic device with eye gaze and deployed for a fabric printing task. Methodology: We undertook a user centred design process and separately evaluated the web cam based gaze tracker and the video see through based human robot interaction involving users with SSMI. We also reported a user study on manipulating a robotic arm with webcam based eye gaze tracker. Results: Using our bespoke eye gaze controlled interface, able bodied users can select one of nine regions of screen at a median of less than 2 secs and users with SSMI can do so at a median of 4 secs. Using the eye gaze controlled human-robot AR display, users with SSMI could undertake representative pick and drop task at an average duration less than 15 secs and reach a randomly designated target within 60 secs using a COTS eye tracker and at an average time of 2 mins using the webcam based eye gaze tracker.
摘要：背景：有严重言语和运动障碍（SSMI）人们经常使用一种叫做眼睛指向与外界沟通。一个父母，看护者或教师在他们面前保持的印刷电路板，并通过分析它们的眼睛注视手动，他们的意图被解释。这种技术往往是容易出错，耗费时间，且依赖于单一的看守政府。目的：我们的目的是通过使用市售平板电脑或笔记本电脑，而不需要任何眼部专用硬件视线跟踪自动电子眼跟踪过程。眼睛注视跟踪器被用于开发视频通过基于AR见（增强现实）显示器，其控制机器人装置与眼睛注视和部署用于织物打印任务。方法：我们开展了以用户为中心的设计过程，并分别评价了网络摄像头，基于注视跟踪和视频通过涉及用户与基于SSMI人机器人的互动看。我们还报告了操纵机械臂基于摄像头眼睛注视跟踪用户研究。结果：使用我们定制的视线控制接口，身强力壮的用户可以在不到2秒和用户提供SSMI的中位数选择屏幕的九个地区的一种能够在4秒的中位数这样做。使用眼睛注视控制人机AR显示，用户SSMI可以在平均持续时间小于15秒承接代表拾取和降任务和内使用COTS眼睛跟踪器60秒并以2的平均时间达到一个随机指定的目标使用基于网络摄像头眼睛注视跟踪分钟。

54. Hyperspectral Image Classification with Attention Aided CNNs [PDF] 返回目录
Renlong Hang, Zhu Li, Qingshan Liu, Pedram Ghamisi, Shuvra S. Bhattacharyya
Abstract: Convolutional neural networks (CNNs) have been widely used for hyperspectral image classification. As a common process, small cubes are firstly cropped from the hyperspectral image and then fed into CNNs to extract spectral and spatial features. It is well known that different spectral bands and spatial positions in the cubes have different discriminative abilities. If fully explored, this prior information will help improve the learning capacity of CNNs. Along this direction, we propose an attention aided CNN model for spectral-spatial classification of hyperspectral images. Specifically, a spectral attention sub-network and a spatial attention sub-network are proposed for spectral and spatial classification, respectively. Both of them are based on the traditional CNN model, and incorporate attention modules to aid networks focus on more discriminative channels or positions. In the final classification phase, the spectral classification result and the spatial classification result are combined together via an adaptively weighted summation method. To evaluate the effectiveness of the proposed model, we conduct experiments on three standard hyperspectral datasets. The experimental results show that the proposed model can achieve superior performance compared to several state-of-the-art CNN-related models.
摘要：卷积神经网络（细胞神经网络）已经被广泛用于高光谱影像分类。作为共同的过程中，小立方体首先从高光谱图像裁剪，然后送入到细胞神经网络提取的光谱和空间特性。众所周知，不同的光谱带与空间位置在立方体有不同的辨别能力。如果充分探讨，在此之前的信息将有助于改善细胞神经网络的学习能力。沿着这个方向，我们提出了高光谱图像的谱空间分类的注意力辅助CNN模型。具体地，频谱的关注子网络和一个空间注意子网络分别提出了光谱和空间分类，。他们两个都是基于传统的CNN模型，并结合注意力模块援助网络专注于更有辨别力的渠道或立场。在最后的分类阶段，光谱分类结果和空间分类结果经由自适应加权求和方法结合在一起。为了评估该模型的有效性，我们在三个标准的高光谱数据集进行实验。实验结果表明，该模型可以比较几个国家的最先进的CNN相关模型实现卓越的性能。

55. Keypoints Localization for Joint Vertebra Detection and Fracture Severity Quantification [PDF] 返回目录
Maxim Pisov, Vladimir Kondratenko, Alexey Zakharov, Alexey Petraikin, Victor Gombolevskiy, Sergey Morozov, Mikhail Belyaev
Abstract: Vertebral body compression fractures are reliable early signs of osteoporosis. Though these fractures are visible on Computed Tomography (CT) images, they are frequently missed by radiologists in clinical settings. Prior research on automatic methods of vertebral fracture classification proves its reliable quality; however, existing methods provide hard-to-interpret outputs and sometimes fail to process cases with severe abnormalities such as highly pathological vertebrae or scoliosis. We propose a new two-step algorithm to localize the vertebral column in 3D CT images and then to simultaneously detect individual vertebrae and quantify fractures in 2D. We train neural networks for both steps using a simple 6-keypoints based annotation scheme, which corresponds precisely to current medical recommendation. Our algorithm has no exclusion criteria, processes 3D CT in 2 seconds on a single GPU, and provides an intuitive and verifiable output. The method approaches expert-level performance and demonstrates state-of-the-art results in vertebrae 3D localization (the average error is 1 mm), vertebrae 2D detection (precision is 0.99, recall is 1), and fracture identification (ROC AUC at the patient level is 0.93).
摘要：椎体压缩性骨折是骨质疏松症的可靠的早期迹象。虽然这些裂缝是在计算机断层扫描（CT）图像中可见，他们经常在临床放射科医生漏诊。上椎体骨折分类自动方法先前的研究证明了其可靠的质量;然而，现有的方法提供难以解释输出和有时不能处理的情况有严重异常，如高度病理椎骨或脊柱侧凸。我们提出了一个新的两步算法本地化的三维CT图像的脊柱，然后同时检测各个椎骨和2D骨折进行量化。我们培养使用基于简单的6个关键点标注方案，这正好对应于目前的医学建议，这两个步骤的神经网络。我们的算法没有排除标准，在单个GPU 2秒处理的3D CT，并且提供一个直观的和可核实的输出。该方法接近专家级性能，并演示状态的最先进的结果在椎骨3D定位（平均误差为1mm），椎骨的2D检测（精度为0.99，召回是1），和裂缝识别（ROC AUC在患者一级是0.93）。

56. A Bayesian-inspired, deep learning, semi-supervised domain adaptation technique for land cover mapping [PDF] 返回目录
Benjamin Lucas, Charlotte Pelletier, Daniel Schmidt, Geoffrey I. Webb, François Petitjean
Abstract: Land cover maps are a vital input variable to many types of environmental research and management. While they can be produced automatically by machine learning techniques, these techniques require substantial training data to achieve high levels of accuracy, which are not always available. One technique researchers use when labelled training data are scarce is domain adaptation (DA) -- where data from an alternate region, known as the source domain, are used to train a classifier and this model is adapted to map the study region, or target domain. The scenario we address in this paper is known as semi-supervised DA, where some labelled samples are available in the target domain. In this paper we present Sourcerer, a Bayesian-inspired, deep learning-based, semi-supervised DA technique for producing land cover maps from SITS data. The technique takes a convolutional neural network trained on a source domain and then trains further on the available target domain with a novel regularizer applied to the model weights. The regularizer adjusts the degree to which the model is modified to fit the target data, limiting the degree of change when the target data are few in number and increasing it as target data quantity increases. Our experiments on Sentinel-2 time series images compare Sourcerer with two state-of-the-art semi-supervised domain adaptation techniques and four baseline models. We show that on two different source-target domain pairings Sourcerer outperforms all other methods for any quantity of labelled target data available. In fact, the results on the more difficult target domain show that the starting accuracy of Sourcerer (when no labelled target data are available), 74.2%, is greater than the next-best state-of-the-art method trained on 20,000 labelled target instances.
摘要：土地覆盖图是一个重要的输入变量到许多类型的环境研究和管理。虽然他们可以通过机器学习技术自动生成，这些技术需要大量的训练数据，以实现高精确度水平，这并不总是可用。一种技术研究者当标记的训练数据是稀缺的使用是适配域（DA） - ，其中从备用区域，被称为源域数据被用于训练分类器，该模型适于映射研究区域或目标域。在本文的情况下，我们的地址被称为半监督DA，其中一些标记的样品是在目标域上可用。在本文中，我们提出源码人员，贝叶斯风格，深度学习为主，半监督DA技术用于生产土地覆盖图，从SITS数据。该技术需要训练有素上的源域的卷积神经网络，然后用施加到所述模型权重的新型正则列车进一步可用的目标域。的正则调整到该模型被修改成适合目标数据的程度，当目标数据数量很少限制变化的程度，并将其作为提高目标数据量的增加。我们的哨兵-2时间序列图像的实验比较源码人员有两个国家的最先进的半监督域自适应技术和四个基本模式。我们发现，在两种不同的源 - 目标域配对源码人员胜过可用标记的目标数据的任何数量的所有其他方法。事实上，就比较困难目标域的结果表明的Sourcerer（当可用无标记的目标数据），74.2％，的起始精度比上训练20000下一个最佳状态的最先进的方法标记的更大目标实例。

57. mr2NST: Multi-Resolution and Multi-Reference Neural Style Transfer for Mammography [PDF] 返回目录
Sheng Wang, Jiayu Huo, Xi Ouyang, Jifei Che, Xuhua Ren, Zhong Xue, Qian Wang, Jie-Zhi Cheng
Abstract: Computer-aided diagnosis with deep learning techniques has been shown to be helpful for the diagnosis of the mammography in many clinical studies. However, the image styles of different vendors are very distinctive, and there may exist domain gap among different vendors that could potentially compromise the universal applicability of one deep learning model. In this study, we explicitly address style variety issue with the proposed multi-resolution and multi-reference neural style transfer (mr2NST) network. The mr2NST can normalize the styles from different vendors to the same style baseline with very high resolution. We illustrate that the image quality of the transferred images is comparable to the quality of original images of the target domain (vendor) in terms of NIMA scores. Meanwhile, the mr2NST results are also shown to be helpful for the lesion detection in mammograms.
摘要：深学习技术的计算机辅助诊断已被证明是对乳房X光检查在许多临床研究有助于诊断。然而，图像风格不同的供应商都非常有特色，而且有可能存在不同供应商的可能危及一个深刻的学习模式的普遍适用性中域的差距。在这项研究中，我们明确地址款式品种问题所提出的多分辨率，多参考神经风格转移（mr2NST）网络。该mr2NST可以以非常高的分辨率正常化来自不同厂商的同一风格基线的样式。我们示出了所传输的图像的图像质量是可比到目标域（供应商）的原始图像的在NIMA得分方面的质量。同时，mr2NST结果也显示出对于在乳房X线照片的病变检测有帮助的。

58. Bayesian Conditional GAN for MRI Brain Image Synthesis [PDF] 返回目录
Gengyan Zhaoa, Mary E. Meyerand, Rasmus M. Birn
Abstract: As a powerful technique in medical imaging, image synthesis is widely used in applications such as denoising, super resolution and modality transformation etc. Recently, the revival of deep neural networks made immense progress in the field of medical imaging. Although many deep leaning based models have been proposed to improve the image synthesis accuracy, the evaluation of the model uncertainty, which is highly important for medical applications, has been a missing part. In this work, we propose to use Bayesian conditional generative adversarial network (GAN) with concrete dropout to improve image synthesis accuracy. Meanwhile, an uncertainty calibration approach is involved in the whole pipeline to make the uncertainty generated by Bayesian network interpretable. The method is validated with the T1w to T2w MR image translation with a brain tumor dataset of 102 subjects. Compared with the conventional Bayesian neural network with Monte Carlo dropout, results of the proposed method reach a significant lower RMSE with a p-value of 0.0186. Improvement of the calibration of the generated uncertainty by the uncertainty recalibration method is also illustrated.
摘要：在医学成像中的有效技术，图像合成被广泛用于应用例如去噪，超分辨率和模态变换等。最近，深神经网络的复兴在医学成像领域取得巨大进展。虽然许多深倚基于模型已被提出，以提高图像合成的精度，模型的不确定性，这对于医疗应用非常重要的评价，一直缺少的一部分。在这项工作中，我们建议使用贝叶斯条件生成对抗性混凝土辍学网络（GAN），提高图像合成的精度。同时，不确定性的校准方法参与整个管道，使通过贝叶斯网络解释所产生的不确定性。该方法进行了验证与T1W到T2加权的MR图像与翻译的102名受试者脑肿瘤数据集。与蒙特卡洛漏失常规贝叶斯神经网络相比，所提出的方法的结果达到与0.0186的p值的显著下RMSE。由不确定的重新校准方法所产生的不确定性的校准的改进也被示出。

59. An efficient iterative method for reconstructing surface from point clouds [PDF] 返回目录
Dong Wang
Abstract: Surface reconstruction from point clouds is a fundamental step in many applications in computer vision. In this paper, we develop an efficient iterative method on a variational model for the surface reconstruction from point clouds. The surface is implicitly represented by indicator functions and the energy functional is then approximated based on such representations using heat kernel convolutions. We then develop a novel iterative method to minimize the approximate energy and prove the energy decaying property during each iteration. We then use asymptotic expansion to give a connection between the proposed algorithm and active contour models. Extensive numerical experiments are performed in both 2- and 3- dimensional Euclidean spaces to show that the proposed method is simple, efficient, and accurate.
摘要：从点云表面重建是计算机视觉的许多应用的基本步骤。在本文中，我们开发上从点云的表面重建一个变模型的有效迭代方法。表面被隐式地指示功能和能量函数，然后基于使用热内核卷积这样的表示近似表示。然后，我们制定了新的迭代方法，以尽量减少近似能量和每次迭代中证明了能量衰减特性。然后，我们使用渐近展开给出了算法和活动轮廓模型之间的连接。广泛数值实验中都2-和3-维欧几里得空间进行显示，该方法是简单，有效和准确的。

60. Vision-based control of a knuckle boom crane with online cable length estimation [PDF] 返回目录
Geir Ole Tysse, Andrej Cibicik, Olav Egeland
Abstract: A vision-based controller for a knuckle boom crane is presented. The controller is used to control the motion of the crane tip and at the same time compensate for payload oscillations. The oscillations of the payload are measured with three cameras that are fixed to the crane king and are used to track two spherical markers fixed to the payload cable. Based on color and size information, each camera identifies the image points corresponding to the markers. The payload angles are then determined using linear triangulation of the image points. An extended Kalman filter is used for estimation of payload angles and angular velocity. The length of the payload cable is also estimated using a least squares technique with projection. The crane is controlled by a linear cascade controller where the inner control loop is designed to damp out the pendulum oscillation, and the crane tip is controlled by the outer loop. The control variable of the controller is the commanded crane tip acceleration, which is converted to a velocity command using a velocity loop. The performance of the control system is studied experimentally using a scaled laboratory version of a knuckle boom crane.
摘要：一折臂起重机基于视觉的控制器呈现。该控制器用于控制起重机尖端的运动，并在同一时间补偿有效载荷的振荡。有效载荷的振荡与被固定到起重机王和用于跟踪固定到有效载荷电缆两个球形标志三个照相机测量。基于颜色和大小的信息，各摄像机识别对应于标记物的图像点。然后，将有效载荷角度使用图像点的线性三角测量来确定。扩展卡尔曼滤波器被用于有效载荷角度和角速度估计。有效负载电缆的长度被使用具有投影最小平方方法还估计。起重机由线性级联控制器，其中内部控制回路设计为潮湿了摆锤振荡控制，并且起重机尖端通过外环控制。控制器的控制变量是命令的起重机顶端的加速度，这是使用一个速度环转换为速度指令。控制系统的性能进行了研究实验使用的折臂起重机的缩放实验室版本。

61. Multi-view Alignment and Generation in CCA via Consistent Latent Encoding [PDF] 返回目录
Yaxin Shi, Yuangang Pan, Donna Xu, Ivor W. Tsang
Abstract: Multi-view alignment, achieving one-to-one correspondence of multi-view inputs, is critical in many real-world multi-view applications, especially for cross-view data analysis problems. Recently, an increasing number of works study this alignment problem with Canonical Correlation Analysis (CCA). However, existing CCA models are prone to misalign the multiple views due to either the neglect of uncertainty or the inconsistent encoding of the multiple views. To tackle these two issues, this paper studies multi-view alignment from the Bayesian perspective. Delving into the impairments of inconsistent encodings, we propose to recover correspondence of the multi-view inputs by matching the marginalization of the joint distribution of multi-view random variables under different forms of factorization. To realize our design, we present Adversarial CCA (ACCA) which achieves consistent latent encodings by matching the marginalized latent encodings through the adversarial training paradigm. Our analysis based on conditional mutual information reveals that ACCA is flexible for handling implicit distributions. Extensive experiments on correlation analysis and cross-view generation under noisy input settings demonstrate the superiority of our model.
摘要：多视点对齐，实现多视点输入一对一对应，在许多真实世界多视图应用的关键，尤其是对于交视图数据分析问题。近来，越来越多的作品的研究与典型相关分析（CCA）这个对齐问题。然而，现有的CCA模型易于错位多个视图由于不确定的任一或忽视的多个视图的不一致的编码。为了解决这两个问题，本文的研究从贝叶斯角度多视点对齐。钻研不一致编码的损伤，提出了通过多视点随机变量的下不同形式的因式分解的联合分布的边缘化匹配恢复的多视点输入的对应关系。为了实现我们的设计，我们提出对抗性CCA（ACCA），它通过对抗训练模式相匹配的边缘化潜在的编码实现一致的潜编码。基于条件互信息我们的分析表明，ACCA是灵活处理隐含分布。在嘈杂的输入设置相关性分析和交叉视图生成大量的实验证明我们的模式的优越性。

62. A Lightweight CNN and Joint Shape-Joint Space (JS2) Descriptor for Radiological Osteoarthritis Detection [PDF] 返回目录
Neslihan Bayramoglu, Miika T. Nieminen, Simo Saarakkala
Abstract: Knee osteoarthritis (OA) is very common progressive and degenerative musculoskeletal disease worldwide creates a heavy burden on patients with reduced quality of life and also on society due to financial impact. Therefore, any attempt to reduce the burden of the disease could help both patients and society. In this study, we propose a fully automated novel method, based on combination of joint shape and convolutional neural network (CNN) based bone texture features, to distinguish between the knee radiographs with and without radiographic osteoarthritis. Moreover, we report the first attempt at describing the bone texture using CNN. Knee radiographs from Osteoarthritis Initiative (OAI) and Multicenter Osteoarthritis (MOST) studies were used in the experiments. Our models were trained on 8953 knee radiographs from OAI and evaluated on 3445 knee radiographs from MOST. Our results demonstrate that fusing the proposed shape and texture parameters achieves the state-of-the art performance in radiographic OA detection yielding area under the ROC curve (AUC) of 95.21%
摘要：膝关节骨性关节炎（OA）是很常见的渐进性和退行性疾病的肌肉骨骼造成全球对患者带来沉重的负担与生活质量下降，也对社会，由于财务影响。因此，任何试图以降低疾病负担可能帮助患者和社会。在这项研究中，我们提出了一个完全自动化的新颖的方法，是根据连接形状和卷积神经网络（CNN）基于骨纹理特征的组合，有和没有放射线骨关节炎膝放射照片之间进行区分。此外，我们在报告描述使用CNN骨纹理的首次尝试。骨关节炎倡议（OAI）和多中心膝关节骨性关节炎X线片（MOST）研究在实验中使用。我们的模型是从OAI培训了8953张膝盖X光片，并从大多数3445张膝盖X光片进行评估。我们的研究结果表明，融合所提出的形状和纹理的参数实现的状态的最在射线照相检测OA艺术表演产生区域的ROC曲线下的95.21％（AUC）

63. Learning Camera Miscalibration Detection [PDF] 返回目录
Andrei Cramariuc, Aleksandar Petrov, Rohit Suri, Mayank Mittal, Roland Siegwart, Cesar Cadena
Abstract: Self-diagnosis and self-repair are some of the key challenges in deploying robotic platforms for long-term real-world applications. One of the issues that can occur to a robot is miscalibration of its sensors due to aging, environmental transients, or external disturbances. Precise calibration lies at the core of a variety of applications, due to the need to accurately perceive the world. However, while a lot of work has focused on calibrating the sensors, not much has been done towards identifying when a sensor needs to be recalibrated. This paper focuses on a data-driven approach to learn the detection of miscalibration in vision sensors, specifically RGB cameras. Our contributions include a proposed miscalibration metric for RGB cameras and a novel semi-synthetic dataset generation pipeline based on this metric. Additionally, by training a deep convolutional neural network, we demonstrate the effectiveness of our pipeline to identify whether a recalibration of the camera's intrinsic parameters is required or not. The code is available at this http URL.
摘要：自我诊断和自我修复是一些在长期的现实世界的应用部署机器人平台的主要挑战。一个可能发生的机器人的一个问题是它的传感器由于老化，环境瞬变，或外部扰动的校准错误。精确校准的谎言在各种应用程序的核心，由于需要准确地感知世界。然而，虽然大量的工作都集中在校准传感器，没有多少已经对识别当传感器需要重新校准完成。本文的重点是数据驱动的方法来学习校准误差的检测视觉传感器，特别是RGB摄像头。我们的贡献包括提出的失准度量RGB摄像头，并在此基础上度量一种新型半合成数据集生成管线。此外，通过训练深卷积神经网络，我们证明了我们管道的有效性，以确定相机的内部参数的重新校准是否需要与否。该代码可在此HTTP URL。

64. PoliteCamera: Respecting Strangers' Privacy in Mobile Photographing [PDF] 返回目录
Ang Li, Wei Du, Qinghua Li
Abstract: Camera is a standard on-board sensor of modern mobile phones. It makes photo taking popular due to its convenience and high resolution. However, when users take a photo of a scenery, a building or a target person, a stranger may also be unintentionally captured in the photo. Such photos expose the location and activity of strangers, and hence may breach their privacy. In this paper, we propose a cooperative mobile photographing scheme called PoliteCamera to protect strangers' privacy. Through the cooperation between a photographer and a stranger, the stranger's face in a photo can be automatically blurred upon his request when the photo is taken. Since multiple strangers nearby the photographer might send out blurring requests but not all of them are in the photo, an adapted balanced convolutional neural network (ABCNN) is proposed to determine whether the requesting stranger is in the photo based on facial attributes. Evaluations demonstrate that the ABCNN can accurately predict facial attributes and PoliteCamera can provide accurate privacy protection for strangers.
摘要：相机是现代移动电话标准的车载传感器。它使拍摄照片的流行，由于其方便性和高分辨率。但是，当用户采取的风景，建筑物或目标人物照片，陌生人也可以无意中在照片拍摄。这样的照片曝光陌生人的位置和活动，因此可能违反他们的隐私。在本文中，我们提出了所谓的PoliteCamera为了保护陌生人的隐私合作行动拍摄方案。通过一个摄影师和一个陌生人之间的合作，陌生人的在照片的脸可以在拍摄照片时，在他的要求是自动模糊。由于附近的摄影师多陌生人可能会发出模糊请求，但不是所有的人都在照片中，一个适应平衡卷积神经网络（ABCNN）提出了确定请求陌生人是否是基于脸部特征的照片。评估表明，ABCNN能够准确预测的面部属性和PoliteCamera能为陌生人提供准确的隐私保护。

65. MVStylizer: An Efficient Edge-Assisted Video Photorealistic Style Transfer System for Mobile Phones [PDF] 返回目录
Ang Li, Chunpeng Wu, Yiran Chen, Bin Ni
Abstract: Recent research has made great progress in realizing neural style transfer of images, which denotes transforming an image to a desired style. Many users start to use their mobile phones to record their daily life, and then edit and share the captured images and videos with other users. However, directly applying existing style transfer approaches on videos, i.e., transferring the style of a video frame by frame, requires an extremely large amount of computation resources. It is still technically unaffordable to perform style transfer of videos on mobile phones. To address this challenge, we propose MVStylizer, an efficient edge-assisted photorealistic video style transfer system for mobile phones. Instead of performing stylization frame by frame, only key frames in the original video are processed by a pre-trained deep neural network (DNN) on edge servers, while the rest of stylized intermediate frames are generated by our designed optical-flow-based frame interpolation algorithm on mobile phones. A meta-smoothing module is also proposed to simultaneously upscale a stylized frame to arbitrary resolution and remove style transfer related distortions in these upscaled frames. In addition, for the sake of continuously enhancing the performance of the DNN model on the edge server, we adopt a federated learning scheme to keep retraining each DNN model on the edge server with collected data from mobile clients and syncing with a global DNN model on the cloud server. Such a scheme effectively leverages the diversity of collected data from various mobile clients and efficiently improves the system performance. Our experiments demonstrate that MVStylizer can generate stylized videos with an even better visual quality compared to the state-of-the-art method while achieving 75.5$\times$ speedup for 1920$\times$1080 videos.
摘要：最近的研究在实现图像的神经风格转移，这表示变换的图像到所需的风格有了很大的进步。许多用户开始使用手机来记录自己的生活，然后编辑和分享所拍摄的图像和视频与其他用户。然而，直接应用现有的样式转印上的视频接近，即，由帧传输的视频帧的样式，需要非常大量的计算资源。它在技术上仍负担不起执行在手机上的视频转移风格。为了应对这一挑战，我们提出MVStylizer，用于手机的有效边缘辅助写实视频的风格传输系统。代替通过帧执行程式化帧的，在原始视频中只有关键帧由边缘服务器上预训练的深层神经网络（DNN）处理，而程式化中间帧的其余部分是由我们的设计为基础的光流的帧生成在手机上插值算法。荟萃平滑模块还提出了同时高档一个程式化的帧在这些放大的帧任意的分辨率和移除式的传递相关的失真。此外，为不断提高边缘服务器上的DNN模型的性能起见，我们采用联合学习方案，以保持再培训边缘服务器上的每个DNN模型与移动客户端收集的数据，并在全球DNN模型同步云服务器。这种方案有效地利用了来自不同的移动客户端收集的数据的分集，有效地提高了系统的性能。我们的实验表明，MVStylizer可以产生一个更好的视觉质量程式化的影片相比，国家的最先进的方法，同时实现75.5 $ \ $次提速1920 $ \ $次1080个的视频。

66. Bayesian Neural Networks at Scale: A Performance Analysis and Pruning Study [PDF] 返回目录
Himanshu Sharma, Elise Jennings
Abstract: Bayesian neural Networks (BNNs) are a promising method of obtaining statistical uncertainties for neural network predictions but with a higher computational overhead which can limit their practical usage. This work explores the use of high performance computing with distributed training to address the challenges of training BNNs at scale. We present a performance and scalability comparison of training the VGG-16 and Resnet-18 models on a Cray-XC40 cluster. We demonstrate that network pruning can speed up inference without accuracy loss and provide an open source software package, {\it{BPrune}} to automate this pruning. For certain models we find that pruning up to 80\% of the network results in only a 7.0\% loss in accuracy. With the development of new hardware accelerators for Deep Learning, BNNs are of considerable interest for benchmarking performance. This analysis of training a BNN at scale outlines the limitations and benefits compared to a conventional neural network.
摘要：贝叶斯神经网络（BNNs）是获得神经网络预测，但具有更高的计算开销，这会限制他们的实际使用统计不确定性的一个很有前途的方法。这项工作探索与分布式训练，以解决大规模培训BNNs的挑战计算中使用的高性能。我们目前的训练VGG-16和RESNET-18机型一台Cray-XC40集群上的性能和可扩展性的比较。我们表明，网络修剪可以加快无精度损失的推断，并提供一个开源软件包，{\ {它BPrune}}自动执行此修剪。对于某些模型，我们发现，仅在精度7.0 \％的损失修剪到网络上产生80 \％。随着新硬件加速器深度学习的发展，BNNs是基准性能相当大的兴趣。相比于传统的神经网络的大规模训练BNN这种分析概括的局限性和好处。

67. Coronavirus: Comparing COVID-19, SARS and MERS in the eyes of AI [PDF] 返回目录
Anas Tahir, Yazan Qiblawey, Amith Khandakar, Tawsifur Rahman, Uzair Khurshid, Farayi Musharavati, Serkan Kiranyaz, Muhammad E. H. Chowdhury
Abstract: Novel Coronavirus disease (COVID-19) is an extremely contagious and quickly spreading Coronavirus disease. Severe Acute Respiratory Syndrome (SARS)-CoV, Middle East Respiratory Syndrome (MERS)-CoV outbreak in 2002 and 2011 and current COVID-19 pandemic all from the same family of Coronavirus. The fatality rate due to SARS and MERS were higher than COVID-19 however, the spread of those were limited to few countries while COVID-19 affected more than two-hundred countries of the world. In this work, authors used deep machine learning algorithms along with innovative image pre-processing techniques to distinguish COVID-19 images from SARS and MERS images. Several deep learning algorithms were trained, and tested and four outperforming algorithms were reported: SqueezeNet, ResNet18, Inceptionv3 and DenseNet201. Original, Contrast limited adaptive histogram equalized and complemented image were used individually and in concatenation as the inputs to the networks. It was observed that inceptionv3 outperforms all networks for 3-channel concatenation technique and provide an excellent sensitivity of 99.5%, 93.1% and 97% for classifying COVID-19, MERS and SARS images respectively. Investigating deep layer activation mapping of the correctly classified images and miss-classified images, it was observed that some overlapping features between COVID-19 and MERS images were identified by the deep layer network. Interestingly these features were present in MERS images and 10 out of 144 images were miss-classified as COVID while only one out of 423 COVID-19 images was miss-classified as MERS. None of the MERS images was miss-classified to SARS and only one COVID-19 image was miss-classified as SARS. Therefore, it can be summarized that SARS images are significantly different from MERS and COVID-19 in the eyes of AI while there are some overlapping feature available between MERS and COVID-19.
摘要：新型冠状病毒病（COVID-19）是一种极具传染性并迅速蔓延冠状病毒病。严重急性呼吸系统综合症（SARS）-CoV，中东呼吸综合征（MERS）-CoV于2002年，2011年和当前COVID-19大流行爆发都来自同一家庭的冠状病毒。病死率由于SARS和MERS比COVID-19却较高，那些传播仅限于少数几个国家，同时COVID-19的影响超过了世界两百多个国家。在这项工作中，作者使用深的机器学习算法，以创新的图像预处理技术一起区分SARS和MERS图像COVID-19图像。一些深学习算法进行了培训，并测试，并且可报四种跑赢算法：SqueezeNet，ResNet18，Inceptionv3和DenseNet201。原来，对比度受限自适应直方图均衡化和可补图像被单独和在级联作为输入到网络中。据观察，inceptionv3优于所有网络的3通道级联技术和用于分别COVID-19 MERS和SARS图像分类提供的99.5％，93.1％和97％的优异的灵敏度。调查正确分类的图像和未命中分类的图像的深层激活映射，可以观察到COVID-19和MERS图像之间的一些重叠的特征是由深层的网络标识。有趣的是，这些特征存在于MERS图像和10出来的144图像被误分类为COVID而只有一个的423 COVID-19的图像是误分类为MERS。在MERS图像都不是误分类为SARS且只有一个COVID-19图像误列为SARS。因此，可以总结出SARS图片均来自MERS和COVID-19显著不同在AI的眼睛，而有MERS和COVID-19之间提供了一些重叠的功能。

68. Multi-view polarimetric scattering cloud tomography and retrieval of droplet size [PDF] 返回目录
Aviad Levis, Yoav Y. Schechner, Anthony B. Davis, Jesse Loveridge
Abstract: Tomography aims to recover a three-dimensional (3D) density map of a medium or an object. In medical imaging, it is extensively used for diagnostics via X-ray computed tomography (CT). Optical diffusion tomography is an alternative to X-ray CT that uses multiply scattered light to deliver coarse density maps for soft tissues. We define and derive tomography of cloud droplet distributions via passive remote sensing. We use multi-view polarimetric images to fit a 3D polarized radiative transfer (RT) forward model. Our motivation is 3D volumetric probing of vertically-developed convectively-driven clouds that are ill-served by current methods in operational passive remote sensing. These techniques are based on strictly 1D RT modeling and applied to a single cloudy pixel, where cloud geometry is assumed to be that of a plane-parallel slab. Incident unpolarized sunlight, once scattered by cloud-droplets, changes its polarization state according to droplet size. Therefore, polarimetric measurements in the rainbow and glory angular regions can be used to infer the droplet size distribution. This work defines and derives a framework for a full 3D tomography of cloud droplets for both their mass concentration in space and their distribution across a range of sizes. This 3D retrieval of key microphysical properties is made tractable by our novel approach that involves a restructuring and differentiation of an open-source polarized 3D RT code to accommodate a special two-step optimization technique. Physically-realistic synthetic clouds are used to demonstrate the methodology with rigorous uncertainty quantification.
摘要：断层摄影术的目的是恢复一个三维（3D）密度的介质或对象的地图。在医学成像中，它被广泛经由X射线计算机断层扫描（CT）用于诊断。光扩散断层摄影术是将X射线CT的替代，使用多次散射光以提供粗密度映射的软组织。我们定义和云的派生断层经由无源遥感液滴分布。我们使用多视点偏振图像以适合3D偏振辐射传输（RT）正向模型。我们的动机是三维立体探测垂直对流发展为导向的云是由当前的方法在操作被动遥感虐待送达。这些技术是基于严格1D RT建模和应用于单一混浊像素，其中云几何被假定为是一个平面平行板的。非偏振光入射太阳光，一旦由云滴分散，根据液滴尺寸改变其偏振状态。因此，在彩虹和光彩角度区域极化测量值可以被用来推断液滴尺寸分布。此工作定义和导出用于对空间二者它们的质量浓度和它们跨越一定范围的尺寸的分布云滴的完整3D断层扫描的框架。键微物理特性的该3D检索是由我们涉及一个开源偏光3D RT代码，以适应特殊的两步优化技术的重组和分化新颖的方法制成易处理的。物理逼真的人造云，是要证明有严格的不确定性量化的方法。

69. SentPWNet: A Unified Sentence Pair Weighting Network for Task-specific Sentence Embedding [PDF] 返回目录
Li Zhang, Han Wang, Lingxiao Li
Abstract: Pair-based metric learning has been widely adopted to learn sentence embedding in many NLP tasks such as semantic text similarity due to its efficiency in computation. Most existing works employed a sequence encoder model and utilized limited sentence pairs with a pair-based loss to learn discriminating sentence representation. However, it is known that the sentence representation can be biased when the sampled sentence pairs deviate from the true distribution of all sentence pairs. In this paper, our theoretical analysis shows that existing works severely suffered from a good pair sampling and instance weighting strategy. Instead of one time pair selection and learning on equal weighted pairs, we propose a unified locality weighting and learning framework to learn task-specific sentence embedding. Our model, SentPWNet, exploits the neighboring spatial distribution of each sentence as locality weight to indicate the informative level of sentence pair. Such weight is updated along with pair-loss optimization in each round, ensuring the model keep learning the most informative sentence pairs. Extensive experiments on four public available datasets and a self-collected place search benchmark with 1.4 million places clearly demonstrate that our model consistently outperforms existing sentence embedding methods with comparable efficiency.
摘要：基于对-度量学习已经被广泛采用，以学习NLP的许多任务，比如语义文本相似度句子嵌入由于其在计算效率。大多数现有的工程中采用的序列编码器模型和利用具有一对基损耗局限于句子对学习判别句子表示。然而，已知的是，当所采样的句子对来自所有句子对真实分布偏离句子表示可以被偏置。在本文中，我们的理论分析表明，现有的作品从严重一双好取样和实例加权策略遭遇。而不是一次对选择和学习等权对，我们提出了一个统一的地区权重和学习框架，以学习任务特定的句子嵌入。我们的模型，SentPWNet，利用每个句子的周边空间分布局部性权重指示句对的信息水平。这样的重量与在每一轮对损失优化一起更新，保证了模型不断的学习最翔实的句子对。四个公共可用的数据集和140万个地方自采的地方搜索基准大量的实验清楚地表明，我们的模型的性能一直优于现有的句子嵌入相媲美效率的方法。

70. Pulmonary Nodule Malignancy Classification Using its Temporal Evolution with Two-Stream 3D Convolutional Neural Networks [PDF] 返回目录
Xavier Rafael-Palou, Anton Aubanell, Ilaria Bonavita, Mario Ceresa, Gemma Piella, Vicent Ribas, Miguel A. González Ballester
Abstract: Nodule malignancy assessment is a complex, time-consuming and error-prone task. Current clinical practice requires measuring changes in size and density of the nodule at different time-points. State of the art solutions rely on 3D convolutional neural networks built on pulmonary nodules obtained from single CT scan per patient. In this work, we propose a two-stream 3D convolutional neural network that predicts malignancy by jointly analyzing two pulmonary nodule volumes from the same patient taken at different time-points. Best results achieve 77% of F1-score in test with an increment of 9% and 12% of F1-score with respect to the same network trained with images from a single time-point.
摘要：结节恶性的评估是一个复杂，耗时且容易出错的任务。目前的临床实践需要在不同的时间点在尺寸和结节的密度测量的变化。该技术方案的国家依靠建立在每一个病人的CT扫描获得的肺结节3D卷积神经网络。在这项工作中，我们提出，通过在不同时间点拍摄的同一患者共同分析两肺结节体积预测恶性肿瘤的双流3D卷积神经网络。最好的结果实现在测试F1-得分的77％的具有9％的增量，并且相对于与来自一个单一的时间点图像训练同一网络F1-得分的12％。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-05-26

目录

摘要