摘要

1. Unsupervised 3D Learning for Shape Analysis via Multiresolution Instance Discrimination [PDF] 返回目录
Peng-Shuai Wang, Yu-Qi Yang, Qian-Fang Zou, Zhirong Wu, Yang Liu, Xin Tong
Abstract: Although unsupervised feature learning has demonstrated its advantages to reducing the workload of data labeling and network design in many fields, existing unsupervised 3D learning methods still cannot offer a generic network for various shape analysis tasks with competitive performance to supervised methods. In this paper, we propose an unsupervised method for learning a generic and efficient shape encoding network for different shape analysis tasks. The key idea of our method is to jointly encode and learn shape and point features from unlabeled 3D point clouds. For this purpose, we adapt HR-Net to octree-based convolutional neural networks for jointly encoding shape and point features with fused multiresolution subnetworks and design a simple-yet-efficient \emph{Multiresolution Instance Discrimination} (MID) loss for jointly learning the shape and point features. Our network takes a 3D point cloud as input and output both shape and point features. After training, the network is concatenated with simple task-specific back-end layers and fine-tuned for different shape analysis tasks. We evaluate the efficacy and generality of our method and validate our network and loss design with a set of shape analysis tasks, including shape classification, semantic shape segmentation, as well as shape registration tasks. With simple back-ends, our network demonstrates the best performance among all unsupervised methods and achieves competitive performance to supervised methods, especially in tasks with a small labeled dataset. For fine-grained shape segmentation, our method even surpasses existing supervised methods by a large margin.
摘要：虽然无监督功能学习已经证明自己的优势，减少数据标签和网络设计的工作量在许多领域，现有的3D无监督学习方法仍不能提供具有竞争力的性能，以监督方法的各种形状分析任务的通用网络。在本文中，我们提出了学习通用和有效的形状编码网络，不同形状的分析任务的无人监督的方法。我们的方法的核心思想是联合编码和学习形状和点无标签的三维点云功能。为此，我们适应HR-网基于八叉树卷积神经网络的联合编码的形状和点设有融合多分辨率子网并设计一个简单但高效\ EMPH {多尺度实例歧视}（MID）损失共同学习形状和点特征。我们的网络需要一个3D点云作为输入和输出两者的形状和点特征。训练结束后，网络是连接在一起简单的任务特定的后端层和微调，对不同形状的分析任务。我们评估了该方法的有效性和普遍性和验证我们的网络和损失设计了一组形状分析任务，包括形状分类，语义形状分割，以及形状注册的任务。通过简单的后端，我们的网络演示了所有非监督方法中最好的性能和竞争力实现业绩的监督方法，尤其是在一个小标记数据集的任务。细粒度形状分割，我们的方法甚至超过现有的大幅度监督方法。

2. Memory-augmented Dense Predictive Coding for Video Representation Learning [PDF] 返回目录
Tengda Han, Weidi Xie, Andrew Zisserman
Abstract: The objective of this paper is self-supervised learning from video, in particular for representations for action recognition. We make the following contributions: (i) We propose a new architecture and learning framework Memory-augmented Dense Predictive Coding (MemDPC) for the task. It is trained with a predictive attention mechanism over the set of compressed memories, such that any future states can always be constructed by a convex combination of the condense representations, allowing to make multiple hypotheses efficiently. (ii) We investigate visual-only self-supervised video representation learning from RGB frames, or from unsupervised optical flow, or both. (iii) We thoroughly evaluate the quality of learnt representation on four different downstream tasks: action recognition, video retrieval, learning with scarce annotations, and unintentional action classification. In all cases, we demonstrate state-of-the-art or comparable performance over other approaches with orders of magnitude fewer training data.
摘要：本文的目的是从视频自我监督学习，特别是用于表示动作识别。我们做出以下贡献：（一）我们提出了一个新的架构和学习框架内存扩充密集预测编码（MemDPC）的任务。它被训练与预测机制的关注超过了设定的压缩记忆，使得未来任何状态下都可以随时通过冷凝表示的凸组合来构造，从而能够有效地使多个假设。（ⅱ）我们调查视觉仅自监督视频表示学习从RGB帧，或从无监督的光流，或两者。（iii）本公司全面评估学会表示对四个不同的下游任务的质量：动作识别，视频检索，学习与稀缺的注释，并无意的动作分类。在任何情况下，我们证明了用较少数量的训练数据的命令其他途径国家的最先进的或相当的性能。

3. Improving One-stage Visual Grounding by Recursive Sub-query Construction [PDF] 返回目录
Zhengyuan Yang, Tianlang Chen, Liwei Wang, Jiebo Luo
Abstract: We improve one-stage visual grounding by addressing current limitations on grounding long and complex queries. Existing one-stage methods encode the entire language query as a single sentence embedding vector, e.g., taking the embedding from BERT or the hidden state from LSTM. This single vector representation is prone to overlooking the detailed descriptions in the query. To address this query modeling deficiency, we propose a recursive sub-query construction framework, which reasons between image and query for multiple rounds and reduces the referring ambiguity step by step. We show our new one-stage method obtains 5.0%, 4.5%, 7.5%, 12.8% absolute improvements over the state-of-the-art one-stage baseline on ReferItGame, RefCOCO, RefCOCO+, and RefCOCOg, respectively. In particular, superior performances on longer and more complex queries validates the effectiveness of our query modeling.
摘要：通过对接地长期和复杂的查询处理当前限制提高一个阶段的视觉接地。现有的一阶段方法编码整个语言查询作为单句嵌入载体，例如，以从BERT嵌入或从LSTM隐藏状态。这种单一的矢量表示容易发生俯瞰在查询中的详细描述。为了解决这个查询建模不足，我们提出了一个递归子查询构造框架，图像和查询之间以及其原因多轮减少一步的参考模棱两可的一步。我们表明我们的新的一阶段方法获得5.0％，4.5％，7.5％，比上ReferItGame，RefCOCO，RefCOCO +，和RefCOCOg，分别的状态的最先进的一阶段的基线12.8％的绝对改进。在上更长，更复杂的查询特别高超的性能证明了我们查询模型的有效性。

4. Residual Frames with Efficient Pseudo-3D CNN for Human Action Recognition [PDF] 返回目录
Jiawei Chen, Jenson Hsiao, Chiu Man Ho
Abstract: Human action recognition is regarded as a key cornerstone in domains such as surveillance or video understanding. Despite recent progress in the development of end-to-end solutions for video-based action recognition, achieving state-of-the-art performance still requires using auxiliary hand-crafted motion representations, e.g., optical flow, which are usually computationally demanding. In this work, we propose to use residual frames (i.e., differences between adjacent RGB frames) as an alternative "lightweight" motion representation, which carries salient motion information and is computationally efficient. In addition, we develop a new pseudo-3D convolution module which decouples 3D convolution into 2D and 1D convolution. The proposed module exploits residual information in the feature space to better structure motions, and is equipped with a self-attention mechanism that assists to recalibrate the appearance and motion features. Empirical results confirm the efficiency and effectiveness of residual frames as well as the proposed pseudo-3D convolution module.
摘要：人类动作识别被视为领域，如监控和视频理解的关键基石。尽管在终端到终端的解决方案的开发基于视频的动作识别的最新进展，实现国家的最先进的性能仍然需要使用辅助手工制作的运动表示，例如，光流，通常需要大量计算的。在这项工作中，我们建议使用残留帧（即，相邻的RGB帧之间的差异）来替代“轻量级”运动表示，其携带显着运动信息和计算效率高。此外，我们还开发出解耦3D卷积成2D和1D卷积的新伪3D卷积模块。所提出的模块利用的功能空间，以更好地构造的动作残留信息，并配有一个自我的关注机制，协助重新校准的外观和运动特征。经验结果确认残留帧的效率和效力，以及所提出的伪3D卷积模块。

5. Project to Adapt: Domain Adaptation for Depth Completion from Noisy and Sparse Sensor Data [PDF] 返回目录
Adrian Lopez-Rodriguez, Benjamin Busam, Krystian Mikolajczyk
Abstract: Depth completion aims to predict a dense depth map from a sparse depth input. The acquisition of dense ground truth annotations for depth completion settings can be difficult and, at the same time, a significant domain gap between real LiDAR measurements and synthetic data has prevented from successful training of models in virtual settings. We propose a domain adaptation approach for sparse-to-dense depth completion that is trained from synthetic data, without annotations in the real domain or additional sensors. Our approach simulates the real sensor noise in an RGB+LiDAR set-up, and consists of three modules: simulating the real LiDAR input in the synthetic domain via projections, filtering the real noisy LiDAR for supervision and adapting the synthetic RGB image using a CycleGAN approach. We extensively evaluate these modules against the state-of-the-art in the KITTI depth completion benchmark, showing significant improvements.
摘要：深度完成目标预测从稀疏深度输入稠密深度图。密密地真理的注解深度完成设置，此次收购是很困难的，并在同一时间，实激光雷达的测量和综合数据之间的显著领域差距已经从虚拟设置的模式成功的培训预防。我们建议对于由合成数据训练稀疏到密集的深度后，在没有真正的域或额外的传感器注释的领域适应性方法。我们的方法模拟了在RGB +激光雷达的建立真正的传感器噪声，并且由三个模块：模拟在通过突起合成结构域的实际激光雷达输入，滤波监督真正嘈杂激光雷达和适应使用CycleGAN合成的RGB图像接近。我们广泛的评估对国家的最先进的KITTI深度完成基准这些模块，表现出显著的改善。

6. From Design Draft to Real Attire: Unaligned Fashion Image Translation [PDF] 返回目录
Yu Han, Shuai Yang, Wenjing Wang, Jiaying Liu
Abstract: Fashion manipulation has attracted growing interest due to its great application value, which inspires many researches towards fashion images. However, little attention has been paid to fashion design draft. In this paper, we study a new unaligned translation problem between design drafts and real fashion items, whose main challenge lies in the huge misalignment between the two modalities. We first collect paired design drafts and real fashion item images without pixel-wise alignment. To solve the misalignment problem, our main idea is to train a sampling network to adaptively adjust the input to an intermediate state with structure alignment to the output. Moreover, built upon the sampling network, we present design draft to real fashion item translation network (D2RNet), where two separate translation streams that focus on texture and shape, respectively, are combined tactfully to get both benefits. D2RNet is able to generate realistic garments with both texture and shape consistency to their design drafts. We show that this idea can be effectively applied to the reverse translation problem and present R2DNet accordingly. Extensive experiments on unaligned fashion design translation demonstrate the superiority of our method over state-of-the-art methods. Our project website is available at: this https URL .
摘要：时尚的操作已经吸引了越来越大的兴趣，由于其巨大的应用价值，这对灵感的时尚图片许多研究。然而，很少受到人们的重视，以时尚的设计稿。在本文中，我们研究设计稿和真正的时尚单品，其主要挑战在于两种模式之间的巨大偏差之间的新的不对齐的翻译问题。我们首先收集成对的设计稿和真正的时尚物品图像，而无需逐像素排列。为了解决错位问题，我们的主要思想是培养的采样网络，以将输入自适应地调整到与结构对准到输出的中间状态。此外，建立在采样网络，我们现在设计稿到真正的时尚项翻译网（D2RNet），其中两个独立的翻译流，专注于纹理和形状，分别被组合巧妙地让双方都带来好处。 D2RNet能够产生与两个纹理和形状一致性的设计稿现实的服装。我们表明，这种想法可以有效地应用于逆转换问题，目前R2DNet相应。对未对齐的时尚设计翻译大量的实验证明我们的方法超过国家的最先进方法的优越性。我们的项目网站，请访问：此HTTPS URL。

7. RareAct: A video dataset of unusual interactions [PDF] 返回目录
Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew Zisserman
Abstract: This paper introduces a manually annotated video dataset of unusual actions, namely RareAct, including actions such as "blend phone", "cut keyboard" and "microwave shoes". RareAct aims at evaluating the zero-shot and few-shot compositionality of action recognition models for unlikely compositions of common action verbs and object nouns. It contains 122 different actions which were obtained by combining verbs and nouns rarely co-occurring together in the large-scale textual corpus from HowTo100M, but that frequently appear separately. We provide benchmarks using a state-of-the-art HowTo100M pretrained video and text model and show that zero-shot and few-shot compositionality of actions remains a challenging and unsolved task.
摘要：本文介绍了不同寻常的行动，即RareAct，其中包括诸如“混合电话”，“剪切键盘”和“微波鞋”操作的手动注释视频数据集。 RareAct旨在评估的动作识别模型零射门，很少拍组合性的共同行动动词和名词对象的可能性不大成分。它包含通过组合动词和名词很少共同出现在大型文本语料库从HowTo100M在一起而获得的，但经常单独出现122个不同的动作。我们提供使用一个国家的最先进的HowTo100M预训练的视频和文本模式基准和显示的行为零射门，很少拍组合性仍然是一个挑战和解决的任务。

8. Teacher-Student Training and Triplet Loss for Facial Expression Recognition under Occlusion [PDF] 返回目录
Mariana-Iuliana Georgescu, Radu Tudor Ionescu
Abstract: In this paper, we study the task of facial expression recognition under strong occlusion. We are particularly interested in cases where 50% of the face is occluded, e.g. when the subject wears a Virtual Reality (VR) headset. While previous studies show that pre-training convolutional neural networks (CNNs) on fully-visible (non-occluded) faces improves the accuracy, we propose to employ knowledge distillation to achieve further improvements. First of all, we employ the classic teacher-student training strategy, in which the teacher is a CNN trained on fully-visible faces and the student is a CNN trained on occluded faces. Second of all, we propose a new approach for knowledge distillation based on triplet loss. During training, the goal is to reduce the distance between an anchor embedding, produced by a student CNN that takes occluded faces as input, and a positive embedding (from the same class as the anchor), produced by a teacher CNN trained on fully-visible faces, so that it becomes smaller than the distance between the anchor and a negative embedding (from a different class than the anchor), produced by the student CNN. Third of all, we propose to combine the distilled embeddings obtained through the classic teacher-student strategy and our novel teacher-student strategy based on triplet loss into a single embedding vector. We conduct experiments on two benchmarks, FER+ and AffectNet, with two CNN architectures, VGG-f and VGG-face, showing that knowledge distillation can bring significant improvements over the state-of-the-art methods designed for occluded faces in the VR setting.
摘要：在本文中，我们研究了在强遮挡面部表情识别的任务。我们特别感兴趣在脸部50％被遮挡的情况下，例如当主体戴着虚拟现实（VR）耳机。虽然以前的研究表明，对前训练卷积神经网络（细胞神经网络）完全可视（非闭合）面临着提高计算精度，我们建议采用知识蒸馏，实现进一步的改进。首先，我们采用了经典的师生培训战略，其中老师是一个训练有素的CNN在完全可见的人脸和学生是CNN上遮挡面孔训练。其次重要的是，我们提出了基于三重损失知识蒸馏的新方法。在训练期间，该目标是减少正嵌入（来自相同类作为锚）锚嵌入之间的距离，由学生CNN说需要闭塞面作为输入产生，并且，由教师CNN上fully-训练产生可见的人脸，使得它变得比锚和负极嵌入（从不同的类比锚），由学生CNN产生之间的距离小。三全的，我们建议基于三重损失通过经典的师生策略获得的蒸馏水的嵌入和我们的新型师生战略合并成一个单一的嵌入矢量。我们在两个基准，FER +和AffectNet进行实验，有两个CNN的架构，VGG-F和VGG面，显示出知识蒸馏可以通过在VR设置，旨在为遮挡脸上的国家的最先进的方法带来显著改善。

9. Segmenting overlapped objects in images. A study to support the diagnosis of sickle cell disease [PDF] 返回目录
Miquel Miró-Nicolau, Biel Moyà-Alcover, Manuel González-Hidalgo, Antoni Jaume-i-Capó
Abstract: Overlapped objects are found on multiple kinds of images, they are a source of problem due its partial information. Multiple types of algorithm are used to address this problem from simple and naive methods to more complex ones. In this work we propose a new method for the segmentation of overlapped object. Finally we compare the results of this algorithm with the state-of-art in two experiments: one with a new dataset, developed specially for this work, and red blood smears from sickle-cell disease patients.
摘要：重叠对象上的多个类型的图像发现，他们是由于其部分信息问题的来源。多种类型的算法被用来解决从简单和幼稚的方法这个问题更复杂的。在这项工作中，我们提出了重叠的对象分割的新方法。最后，我们比较了该算法与国家的最先进的的结果，两个实验：一个新的数据集，专门为这项工作的发展，以及红血镰状细胞疾病患者污迹。

10. An Exploration of Target-Conditioned Segmentation Methods for Visual Object Trackers [PDF] 返回目录
Matteo Dunnhofer, Niki Martinel, Christian Micheloni
Abstract: Visual object tracking is the problem of predicting a target object's state in a video. Generally, bounding-boxes have been used to represent states, and a surge of effort has been spent by the community to produce efficient causal algorithms capable of locating targets with such representations. As the field is moving towards binary segmentation masks to define objects more precisely, in this paper we propose to extensively explore target-conditioned segmentation methods available in the computer vision community, in order to transform any bounding-box tracker into a segmentation tracker. Our analysis shows that such methods allow trackers to compete with recently proposed segmentation trackers, while performing quasi real-time.
摘要：视觉目标跟踪是在视频预测目标对象的状态的问题。一般来说，包围盒已经被用来代表国家，和努力的激增已经得到了社会的花生产能够与这样的表示目标定位的有效因果算法。由于现场朝二元分割掩码移动更精确地定义对象，在本文中，我们提出，广泛探讨了计算机视觉社区可用的目标空调分割方法，以便把任何边界框追踪到分割跟踪。我们的分析表明，这种方法允许跟踪与最近提出的分段跟踪竞争，而进行准实时。

11. SeCo: Exploring Sequence Supervision for Unsupervised Representation Learning [PDF] 返回目录
Ting Yao, Yiheng Zhang, Zhaofan Qiu, Yingwei Pan, Tao Mei
Abstract: A steady momentum of innovations and breakthroughs has convincingly pushed the limits of unsupervised image representation learning. Compared to static 2D images, video has one more dimension (time). The inherent supervision existing in such sequential structure offers a fertile ground for building unsupervised learning models. In this paper, we compose a trilogy of exploring the basic and generic supervision in the sequence from spatial, spatiotemporal and sequential perspectives. We materialize the supervisory signals through determining whether a pair of samples is from one frame or from one video, and whether a triplet of samples is in the correct temporal order. We uniquely regard the signals as the foundation in contrastive learning and derive a particular form named Sequence Contrastive Learning (SeCo). SeCo shows superior results under the linear protocol on action recognition (Kinetics), untrimmed activity recognition (ActivityNet) and object tracking (OTB-100). More remarkably, SeCo demonstrates considerable improvements over recent unsupervised pre-training techniques, and leads the accuracy by 2.96% and 6.47% against fully-supervised ImageNet pre-training in action recognition task on UCF101 and HMDB51, respectively.
摘要：创新和突破的稳定态势令人信服地推无监督图像表示学习的极限。相比于静态2D图像，视频多了一个维度（时间）。现有在这样的顺序结构的内在监督建设监督学习模型提供了肥沃的土壤。在本文中，我们撰写探索从空间，时空和顺序的观点序列中的基本和一般监管的三部曲。我们通过确定一对样品是否是从一帧或从一个视频兑现的监控信号，并确定是否样本的三重态是在正确的时间顺序。我们认为唯一的信号作为对比学习的基础，并获得命名的序列对比学习（SECO）一种特殊形式。 SECO节目上动作识别（动力学）的线性协议下优异的结果，未修剪的活动识别（ActivityNet）和对象跟踪（OTB-100）。更引人注目的是，山高表明在最近无监督预训练技术相当大的改善，并导致准确度由2.96％和6.47％，对全面监督ImageNet在动作识别任务分别UCF101和HMDB51，前培训。

12. End-to-end Full Projector Compensation [PDF] 返回目录
Bingyao Huang, Tao Sun, Haibin Ling
Abstract: Full projector compensation aims to modify a projector input image to compensate for both geometric and photometric disturbance of the projection surface. Traditional methods usually solve the two parts separately and may suffer from suboptimal solutions. In this paper, we propose the first end-to-end differentiable solution, named CompenNeSt++, to solve the two problems jointly. First, we propose a novel geometric correction subnet, named WarpingNet, which is designed with a cascaded coarse-to-fine structure to learn the sampling grid directly from sampling images. Second, we propose a novel photometric compensation subnet, named CompenNeSt, which is designed with a siamese architecture to capture the photometric interactions between the projection surface and the projected images, and to use such information to compensate the geometrically corrected images. By concatenating WarpingNet with CompenNeSt, CompenNeSt++ accomplishes full projector compensation and is end-to-end trainable. Third, to improve practicability, we propose a novel synthetic data-based pre-training strategy to significantly reduce the number of training images and training time. Moreover, we construct the first setup-independent full compensation benchmark to facilitate future studies. In thorough experiments, our method shows clear advantages over prior art with promising compensation quality and meanwhile being practically convenient.
摘要：全投影补偿目标修改的投影仪输入的图像，以补偿在投影表面的两个几何和光度干扰。传统方法通常单独解决两个部分，可从次优解苦。在本文中，我们提出了第一端至端微解，命名CompenNeSt ++，共同解决这两个问题。首先，我们提出了一个新颖的几何校正的子网中，命名为WarpingNet，其被设计为具有级联粗到细的结构，直接从采样图像学习采样网格。第二，我们提出了一个新颖的光度补偿子网，命名CompenNeSt，其被设计成具有连体架构来捕获投影表面和投影图像之间的光度的相互作用，并使用这样的信息以补偿几何校正的图像。通过连接WarpingNet与CompenNeSt，CompenNeSt ++实现全面的投影补偿，并最终到终端的可训练。三，提高实用性，我们提出了一种新的基于数据合成前的训练策略，显著减少训练图像和训练时间的数量。此外，我们构建了第一个设置独立的全额赔偿基准，以方便今后的研究。在彻底的实验中，我们的方法显示了有希望的补偿质量，同时是一个实用方便的现有技术明显的优势。

13. Deep Traffic Sign Detection and Recognition Without Target Domain Real Images [PDF] 返回目录
Lucas Tabelini, Rodrigo Berriel, Thiago M. Paixão, Alberto F. De Souza, Claudine Badue, Nicu Sebe, Thiago Oliveira-Santos
Abstract: Deep learning has been successfully applied to several problems related to autonomous driving, often relying on large databases of real target-domain images for proper training. The acquisition of such real-world data is not always possible in the self-driving context, and sometimes their annotation is not feasible. Moreover, in many tasks, there is an intrinsic data imbalance that most learning-based methods struggle to cope with. Particularly, traffic sign detection is a challenging problem in which these three issues are seen altogether. To address these challenges, we propose a novel database generation method that requires only (i) arbitrary natural images, i.e., requires no real image from the target-domain, and (ii) templates of the traffic signs. The method does not aim at overcoming the training with real data, but to be a compatible alternative when the real data is not available. The effortlessly generated database is shown to be effective for the training of a deep detector on traffic signs from multiple countries. On large data sets, training with a fully synthetic data set almost matches the performance of training with a real one. When compared to training with a smaller data set of real images, training with synthetic images increased the accuracy by 12.25%. The proposed method also improves the performance of the detector when target-domain data are available.
摘要：深学习已成功地应用于与自主驾驶的几个问题，往往依靠适当的培训大型数据库真正的目标域图像。这样的真实世界数据的获取并不总是能够在自驾车方面，有时候他们的注释是不可行的。此外，在许多任务，有一个内在的数据不平衡，大多数基于学习的方法疲于应付。特别是，交通标志检测是在这三个问题都完全看出一个具有挑战性的问题。为了应对这些挑战，我们建议只需要（我）是任意自然的图像，即需要从目标域没有真正的形象，及（ii）的交通标志模板的新数据库生成方法。该方法不着眼于克服与真实数据的训练，但是当真正的数据不可用是一个兼容的替代品。在轻松生成的数据库被证明是有效的深层探测器对来自多个国家的交通标志的训练。在大型数据集，训练用全合成数据集几乎一致的训练与真正的一个表现。当与训练用较小的数据集真实图像的，培养具有合成影像由12.25％提高了精度。所提出的方法还提高了检测器的可用时目标域数据的性能。

14. Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation [PDF] 返回目录
Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, Daniel Cohen-Or
Abstract: We present a generic image-to-image translation framework, Pixel2Style2Pixel (pSp). Our pSp framework is based on a novel encoder network that directly generates a series of style vectors which are fed into a pretrained StyleGAN generator, forming the extended W+ latent space. We first show that our encoder can directly embed real images into W+, with no additional optimization. We further introduce a dedicated identity loss which is shown to achieve improved performance in the reconstruction of an input image. We demonstrate pSp to be a simple architecture that, by leveraging a well-trained, fixed generator network, can be easily applied on a wide-range of image-to-image translation tasks. Solving these tasks through the style representation results in a global approach that does not rely on a local pixel-to-pixel correspondence and further supports multi-modal synthesis via the resampling of styles. Notably, we demonstrate that pSp can be trained to align a face image to a frontal pose without any labeled data, generate multi-modal results for ambiguous tasks such as conditional face generation from segmentation maps, and construct high-resolution images from corresponding low-resolution images.
摘要：我们提出了一个通用的图像 - 图像转换框架，Pixel2Style2Pixel（PSP）。我们的PSP框架基于直接产生一系列被送入预训练StyleGAN发生器风格向量，形成延长W +潜在空间的新颖编码器网络。我们首先表明，我们的编码器可以直接嵌入真实图像为W +，没有额外的优化。我们进一步介绍，其被示出为实现在输入图像的重建改进的性能的专用身份损失。我们证明PSP是一个简单的架构，通过利用一个训练有素的，固定的发电机网络，可以方便地对图像 - 图像平移任务的大范围应用。通过在不依赖于通过样式的采样当地的像素到像素的对应关系，并进一步支持多模态合成一个全球性的办法作风表示结果解决这些任务。值得注意的是，我们表明，PSP可以训练对准的面部图像，以正面姿势而没有任何标记的数据，生成用于暧昧任务多模态的结果，如条件面代从分割地图，以及构建高分辨率图像从对应低分辨率的图片。

15. Frame-To-Frame Consistent Semantic Segmentation [PDF] 返回目录
Manuel Rebol, Patrick Knöbelreiter
Abstract: In this work, we aim for temporally consistent semantic segmentation throughout frames in a video. Many semantic segmentation algorithms process images individually which leads to an inconsistent scene interpretation due to illumination changes, occlusions and other variations over time. To achieve a temporally consistent prediction, we train a convolutional neural network (CNN) which propagates features through consecutive frames in a video using a convolutional long short term memory (ConvLSTM) cell. Besides the temporal feature propagation, we penalize inconsistencies in our loss function. We show in our experiments that the performance improves when utilizing video information compared to single frame prediction. The mean intersection over union (mIoU) metric on the Cityscapes validation set increases from 45.2 % for the single frames to 57.9 % for video data after implementing the ConvLSTM to propagate features trough time on the ESPNet. Most importantly, inconsistency decreases from 4.5 % to 1.3 % which is a reduction by 71.1 %. Our results indicate that the added temporal information produces a frame-to-frame consistent and more accurate image understanding compared to single frame processing.
摘要：在这项工作中，我们的目标是时间一致的语义分割在整个视频帧。许多语义分割算法单独地处理图像，这导致不一致的场景解释由于照明改变，闭塞和随着时间的推移的其他变化。为了实现时间一致的预测，我们培养了卷积神经网络（CNN），其传播使用卷积长短期存储器（ConvLSTM）细胞通过在视频连续帧的特征。除了时间特征的传播，我们在惩罚我们的损失函数的不一致性。我们发现在我们的实验相比，单一帧预测利用视频信息时的性能提高。在联合的均值交点（米欧）度量从45.2％实施ConvLSTM传播后的都市风景验证集增加用于单个帧以用于视频数据57.9％设有对ESPNet槽时间。最重要的是，减小的不一致性从4.5％到1.3％，这是由71.1％的降低。我们的结果表明，相比于单个帧处理所添加的时间信息产生一个帧到帧一致和更精确的图像的理解。

16. Pre-training for Video Captioning Challenge 2020 Summary [PDF] 返回目录
Yingwei Pan, Jun Xu, Yehao Li, Ting Yao, Tao Mei
Abstract: The Pre-training for Video Captioning Challenge 2020 Summary: results and challenge participants' technical reports.
摘要：前的训练视频字幕挑战2020总结：成绩与挑战参与者的技术报告。

17. Detection and Localization of Robotic Tools in Robot-Assisted Surgery Videos Using Deep Neural Networks for Region Proposal and Detection [PDF] 返回目录
Duygu Sarikaya, Jason J. Corso, Khurshid A. Guru
Abstract: Video understanding of robot-assisted surgery (RAS) videos is an active research area. Modeling the gestures and skill level of surgeons presents an interesting problem. The insights drawn may be applied in effective skill acquisition, objective skill assessment, real-time feedback, and human-robot collaborative surgeries. We propose a solution to the tool detection and localization open problem in RAS video understanding, using a strictly computer vision approach and the recent advances of deep learning. We propose an architecture using multimodal convolutional neural networks for fast detection and localization of tools in RAS videos. To our knowledge, this approach will be the first to incorporate deep neural networks for tool detection and localization in RAS videos. Our architecture applies a Region Proposal Network (RPN), and a multi-modal two stream convolutional network for object detection, to jointly predict objectness and localization on a fusion of image and temporal motion cues. Our results with an Average Precision (AP) of 91% and a mean computation time of 0.1 seconds per test frame detection indicate that our study is superior to conventionally used methods for medical imaging while also emphasizing the benefits of using RPN for precision and efficiency. We also introduce a new dataset, ATLAS Dione, for RAS video understanding. Our dataset provides video data of ten surgeons from Roswell Park Cancer Institute (RPCI) (Buffalo, NY) performing six different surgical tasks on the daVinci Surgical System (dVSS R ) with annotations of robotic tools per frame.
摘要：机器人辅助外科手术（RAS）视频视频的理解是一个活跃的研究领域。医生建模呈现的姿势和技能水平一个有趣的问题。得出的见解可以有效获得技能，客观的技能评估，实时反馈，人机协同手术应用。我们提出了一个解决方案，该工具检测和RAS视频了解本地化开放的问题，采用了严格的计算机视觉的方式和近期深度学习的进步。我们使用多卷积神经网络的快速检测，并在RAS视频工具的定位提出了一个架构。据我们所知，这种做法将是第一个集成深层神经网络的工具检测和定位在RAS视频。我们的体系结构施加的区域建议网络（RPN），以及用于物体检测的多模态双流卷积网络，共同预测对象性和本地化图像和时间运动线索的融合物。我们的91％的平均精确度（AP）和每个测试帧检测0.1秒的平均计算时间结果表明，我们的研究是优于用于医学成像的常规使用的方法，同时还强调用RPN为精度和效率的益处。我们还引入了一个新的数据集，ATLAS二酮，为RAS视频了解。我们的数据提供的每帧的机器人工具进行注释的达芬奇外科手术系统（DVSS R）六种不同的手术任务从罗斯威尔公园癌症研究所（RPCI）（纽约州布法罗市）十外科医生的视频数据。

18. AUTSL: A Large Scale Multi-modal Turkish Sign Language Dataset and Baseline Methods [PDF] 返回目录
Ozge Mercanoglu Sincan, Hacer Yalim Keles
Abstract: Sign language recognition is a challenging problem where signs are identified by simultaneous local and global articulations of multiple sources, i.e. hand shape and orientation, hand movements, body posture and facial expressions. Solving this problem computationally for a large vocabulary of signs in real life settings is still a challenge, even with the state-of-the-art models. In this study, we present a new large-scale multi-modal Turkish Sign Language dataset (AUTSL) with a benchmark and provide baseline models for performance evaluations. Our dataset consists of 226 signs performed by 43 different signers and 38,336 isolated sign video samples in total. Samples contain a wide variety of backgrounds recorded in indoor and outdoor environments. Moreover, spatial positions and the postures of signers also vary in the recordings. Each sample is recorded with Microsoft Kinect v2 and contains color image (RGB), depth and skeleton data modalities. We prepared benchmark training and test sets for user independent assessments of the models. We trained several deep learning based models and provide empirical evaluations using the benchmark; we used Convolutional Neural Networks (CNNs) to extract features, unidirectional and bidirectional Long Short-Term Memory (LSTM) models to characterize temporal information. We also incorporated feature pooling modules and temporal attention to our models to improve the performances. Using the benchmark test set, we obtained 62.02% accuracy with RGB+Depth data and 47.62% accuracy with RGB only data with the CNN+FPM+BLSTM+Attention model. Our dataset will be made publicly available at this https URL.
摘要：手语识别是其中的标志是由多种来源的同时局部和全局的关节，即手的形状和方向，手的动作，身体姿势和面部表情识别的一个具有挑战性的问题。计算解决这个问题了大量的词汇在现实生活中设置的标志仍然是一个挑战，甚至与国家的最先进的车型。在这项研究中，我们提出了一个新的大规模的多模态土耳其手语集（AUTSL）与基准和绩效评估提供基准模型。我们的数据集包括由43周不同的签名者和总共38336个分离的标志的视频样本执行226升的迹象。样品中含有各种各样的记录在室内和室外环境的背景。此外，空间位置和签名者的姿势也发生变化的记录。每个样品记录有微软Kinect v2和包含彩色图像（RGB），深度和骨架数据方式。我们准备了车型的用户独立评估基准训练和测试集。我们培训了深基础的学习模式，并提供使用基准实证评价;我们采用卷积神经网络（细胞神经网络）提取特征，单向和双向长短期记忆（LSTM）模型来描述时间信息。我们也纳入统筹功能模块和时间关注我们的模型，以提高性能。使用基准测试集，我们获得了62.02％的准确率与RGB +深度数据和47.62％，精度RGB只与CNN + FPM + BLSTM +注意模型数据。我们的数据集将在此HTTPS URL公之于众。

19. Traffic Prediction Framework for OpenStreetMap using Deep Learning based Complex Event Processing and Open Traffic Cameras [PDF] 返回目录
Piyush Yadav, Dipto Sarkar, Dhaval Salwala, Edward Curry
Abstract: Displaying near-real-time traffic information is a useful feature of digital navigation maps. However, most commercial providers rely on privacy-compromising measures such as deriving location information from cellphones to estimate traffic. The lack of an open-source traffic estimation method using open data platforms is a bottleneck for building sophisticated navigation services on top of OpenStreetMap (OSM). We propose a deep learning-based Complex Event Processing (CEP) method that relies on publicly available video camera streams for traffic estimation. The proposed framework performs near-real-time object detection and objects property extraction across camera clusters in parallel to derive multiple measures related to traffic with the results visualized on OpenStreetMap. The estimation of object properties (e.g. vehicle speed, count, direction) provides multidimensional data that can be leveraged to create metrics and visualization for congestion beyond commonly used density-based measures. Our approach couples both flow and count measures during interpolation by considering each vehicle as a sample point and their speed as weight. We demonstrate multidimensional traffic metrics (e.g. flow rate, congestion estimation) over OSM by processing 22 traffic cameras from London streets. The system achieves a near-real-time performance of 1.42 seconds median latency and an average F-score of 0.80.
摘要：显示近乎实时的交通信息是数字导航地图的一个非常有用的功能。然而，大多数商业提供商依靠隐私妥协措施，如从手机获得位置信息来估计流量。由于缺乏使用开放的数据平台，一个开放源码的交通估计方法是对OpenStreetMap的（OSM）上构建复杂的导航服务的瓶颈。我们建议，依赖公开的摄像机流的流量估计的深基础的学习，复杂事件处理（CEP）方法。近实时的拟议的框架进行目标检测和对象跨相机机群属性提取并行与交通与可视化OpenStreetMap的结果导出多种措施。对象属性（例如车辆速度，数量，方向）的估计提供了可以被利用来创建度量和可视化对于超出常用的基于密度的措施拥塞多维数据。我们的做法既夫妇考虑每辆车作为样本点以及其作为重速度流计措施插补过程。我们从伦敦街头处理22个交通摄像头证明了OSM多维流量指标（例如流量，拥堵估计）。该系统实现了1.42秒的近实时性能平均时延和0.80的平均F-得分。

20. Active Object Search [PDF] 返回目录
Jie Wu, Tianshui Chen, Lishan Huang, Hefeng Wu, Guanbin Li, Ling Tian, Liang Lin
Abstract: In this work, we investigate an Active Object Search (AOS) task that is not explicitly addressed in the literature; it aims to actively performs as few action steps as possible to search and locate the target object in a 3D indoor scene. Different from classic object detection that passively receives visual information, this task encourages an intelligent agent to perform active search via reasonable action planning; thus it can better recall the target objects, especially for the challenging situations that the target is far from the agent, blocked by an obstacle and out of view. To handle this cross-modal task, we formulate a reinforcement learning framework that consists of a 3D object detector, a state controller and a cross-modal action planner to work cooperatively to find out the target object with minimal action steps. During training, we design a novel cost-sensitive active search reward that penalizes inaccurate object search and redundant action steps. To evaluate this novel task, we construct an Active Object Search (AOS) benchmark that contains 5,845 samples from 30 diverse indoor scenes. We conduct extensive qualitative and quantitative evaluations on this benchmark to demonstrate the effectiveness of the proposed approach and analyze the key factors that contribute more to address this task.
摘要：在这项工作中，我们调查活动目标搜索（AOS）任务未明确在文献中解决;它的目的是积极地进行尽可能少的行动步骤，尽可能地搜索和定位在3D室内场景的目标对象。从经典对象检测被动地接收视觉信息，则此任务鼓励一个智能代理通过合理行动计划来执行主动搜索不同;因此可以更好地回忆起目标的物体，特别是对具有挑战性的情况下，目标是远离剂，被障碍物阻挡进出的图。为了处理这种跨模式的任务，我们制定一个强化学习框架，包括一个3D对象检测器，状态控制器和跨模式的行动策划者的工作合作，以找出最小的行动步骤的目标对象。在培训过程中，我们设计出惩罚不准确的目标搜索和冗余的行动步骤，一个新的成本敏感型主动搜索奖励。为了评估这种新型的任务，我们构建了包含30个不同的室内场景5845个样本的活动对象搜索（AOS）基准。我们在这个基准进行广泛的定性和定量评估证明了该方法的有效性和分析作出更大的贡献，以解决这一任务的关键因素。

21. Explainable Face Recognition [PDF] 返回目录
Jonathan R. Williford, Brandon B. May, Jeffrey Byrne
Abstract: Explainable face recognition is the problem of explaining why a facial matcher matches faces. In this paper, we provide the first comprehensive benchmark and baseline evaluation for explainable face recognition. We define a new evaluation protocol called the ``inpainting game'', which is a curated set of 3648 triplets (probe, mate, nonmate) of 95 subjects, which differ by synthetically inpainting a chosen facial characteristic like the nose, eyebrows or mouth creating an inpainted nonmate. An explainable face matcher is tasked with generating a network attention map which best explains which regions in a probe image match with a mated image, and not with an inpainted nonmate for each triplet. This provides ground truth for quantifying what image regions contribute to face matching. Furthermore, we provide a comprehensive benchmark on this dataset comparing five state of the art methods for network attention in face recognition on three facial matchers. This benchmark includes two new algorithms for network attention called subtree EBP and Density-based Input Sampling for Explanation (DISE) which outperform the state of the art by a wide margin. Finally, we show qualitative visualization of these network attention techniques on novel images, and explore how these explainable face recognition models can improve transparency and trust for facial matchers.
摘要：解释的人脸识别是解释为什么面部匹配匹配面的问题。在本文中，我们提供了解释的人脸识别的第一个全面的基准和基线评估。我们定义称为``补绘游戏“”一个新的评价协议，它是一个策划组的95名受试者3648个三联体（探针，交配，nonmate），它由合成补绘象鼻子，眉毛或嘴的选定的面部特征不同创建一个修补的nonmate。一个解释的面匹配的任务是生成网络注意图，其最好地解释该区域中的探针图像匹配配合的图像，并且不具有用于每个三联体的修补的nonmate。这提供地面实况定量什么图像区域有助于面部匹配。此外，我们提供这方面的数据集中在三个面部的匹配比较的人脸识别技术方法对网络的关注五个州的综合性基准测试。这个基准包括用于网络注意称为子树EBP和基于密度的输入采样为解释（DISE）两种新算法，其大幅超越现有技术的状态。最后，我们展示的这些网络关注技术的新型图像质量的可视化，并探讨这些解释的面部识别模型如何改善面部的匹配的透明度和信任。

22. Color Texture Image Retrieval Based on Copula Multivariate Modeling in the Shearlet Domain [PDF] 返回目录
Sadegh Etemad, Maryam Amirmazlaghani
Abstract: In this paper, a color texture image retrieval framework is proposed based on Shearlet domain modeling using Copula multivariate model. In the proposed framework, Gaussian Copula is used to model the dependencies between different sub-bands of the Non Subsample Shearlet Transform (NSST) and non-Gaussian models are used for marginal modeling of the coefficients. Six different schemes are proposed for modeling NSST coefficients based on the four types of neighboring defined; moreover, Kullback Leibler Divergence(KLD) close form is calculated in different situations for the two Gaussian Copula and non Gaussian functions in order to investigate the similarities in the proposed retrieval framework. The Jeffery divergence (JD) criterion, which is a symmetrical version of KLD, is used for investigating similarities in the proposed framework. We have implemented our experiments on four texture image retrieval benchmark datasets, the results of which show the superiority of the proposed framework over the existing state-of-the-art methods. In addition, the retrieval time of the proposed framework is also analyzed in the two steps of feature extraction and similarity matching, which also shows that the proposed framework enjoys an appropriate retrieval time.
摘要：本文提出了一种彩色纹理图像检索框架是基于使用Copula的多变量模型剪切波领域建模建议。在拟议的框架，高斯Copula函数用于非子样本剪切波的不同子带变换（NSST）和非高斯模型之间的依赖关系被用于系数的边际模型建模。六种不同的方案提出了基于模型的四种类型的相邻定义NSST系数;此外，的Kullback Leibler散度（KLD）接近形式在用于两个高斯Copula与非高斯函数不同的情况下计算出的，以便调查在所提出的检索框架的相似性。的杰弗里发散（JD）的标准，这是KLD的对称版本，用于调查在所提出的架构的相似性。我们已经实施了四个纹理图像检索基准数据集我们的实验，其结果表明在现有的国家的最先进的方法，所提出的框架的优越性。此外，所提出的框架的检索时间也分析了在特征提取和相似性匹配，这也表明，所提出的框架享有适当检索时间的两个步骤。

23. Fusion of Deep and Non-Deep Methods for Fast Super-Resolution of Satellite Images [PDF] 返回目录
Gaurav Kumar Nayak, Saksham Jain, R Venkatesh Babu, Anirban Chakraborty
Abstract: In the emerging commercial space industry there is a drastic increase in access to low cost satellite imagery. The price for satellite images depends on the sensor quality and revisit rate. This work proposes to bridge the gap between image quality and the price by improving the image quality via super-resolution (SR). Recently, a number of deep SR techniques have been proposed to enhance satellite images. However, none of these methods utilize the region-level context information, giving equal importance to each region in the image. This, along with the fact that most state-of-the-art SR methods are complex and cumbersome deep models, the time taken to process very large satellite images can be impractically high. We, propose to handle this challenge by designing an SR framework that analyzes the regional information content on each patch of the low-resolution image and judiciously chooses to use more computationally complex deep models to super-resolve more structure-rich regions on the image, while using less resource-intensive non-deep methods on non-salient regions. Through extensive experiments on a large satellite image, we show substantial decrease in inference time while achieving similar performance to that of existing deep SR methods over several evaluation measures like PSNR, MSE and SSIM.
摘要：在新兴的商业空间行业存在获得低成本的卫星图像的急剧增加。卫星图像的价格取决于传感器质量和回访率。这项工作提出了通过提高通过超分辨率（SR）的图像质量和缩小图像质量和价格之间的差距。最近，一些深SR技术已被提出，以提高卫星图像。然而，这些方法都利用区域级别的上下文信息，给予同等的重要性，以各区域在图像中。这与一个事实，即国家的最先进最SR方法是复杂和繁琐深模型一起，所花费的时间来处理非常大的卫星图像可以是不切实际的高。我们，提出了通过设计SR框架，分析了低分辨率图像的每个补丁区域信息内容来处理这一挑战，并明智地选择更多的计算使用复杂的深模型超级决心更加丰富的结构区域的图像上，同时采用非显着的区域资源密集程度较低的非深的方法。通过大量卫星图片上广泛的实验，我们显示推理时间大幅减少，而对像PSNR，MSE和SSIM几个评估措施实现类似的性能与现时的深SR方法。

24. Adversarial Graph Representation Adaptation for Cross-Domain Facial Expression Recognition [PDF] 返回目录
Yuan Xie, Tianshui Chen, Tao Pu, Hefeng Wu, Liang Lin
Abstract: Data inconsistency and bias are inevitable among different facial expression recognition (FER) datasets due to subjective annotating process and different collecting conditions. Recent works resort to adversarial mechanisms that learn domain-invariant features to mitigate domain shift. However, most of these works focus on holistic feature adaptation, and they ignore local features that are more transferable across different datasets. Moreover, local features carry more detailed and discriminative content for expression recognition, and thus integrating local features may enable fine-grained adaptation. In this work, we propose a novel Adversarial Graph Representation Adaptation (AGRA) framework that unifies graph representation propagation with adversarial learning for cross-domain holistic-local feature co-adaptation. To achieve this, we first build a graph to correlate holistic and local regions within each domain and another graph to correlate these regions across different domains. Then, we learn the per-class statistical distribution of each domain and extract holistic-local features from the input image to initialize the corresponding graph nodes. Finally, we introduce two stacked graph convolution networks to propagate holistic-local feature within each domain to explore their interaction and across different domains for holistic-local feature co-adaptation. In this way, the AGRA framework can adaptively learn fine-grained domain-invariant features and thus facilitate cross-domain expression recognition. We conduct extensive and fair experiments on several popular benchmarks and show that the proposed AGRA framework achieves superior performance over previous state-of-the-art methods.
摘要：数据不一致性和偏差是不同的面部表情识别（FER）数据集之间不可避免由于主观标注过程和不同采集条件。近期作品诉诸于学习域不变特征，以减轻域转变对抗机制。然而，这些作品大多注重整体功能适应，他们忽略了当地的功能，可以在不同的数据集的转让。此外，局部特征携带用于表情识别更详细和辨别的内容，并由此集成局部特征可以使细粒适应。在这项工作中，我们提出了一种新颖的对抗性图表示适配（AGRA）框架，该框架相结合图与对抗性学习跨域整体-局部特征互相适应表示传播。要做到这一点，我们首先建立一个图表，每个域和另一个图形中归属关系整体和局部地区跨不同的域这些区域相关联。然后，我们了解每个域的每个类的统计分布和从输入图像中提取整体-局部特征来初始化对应的图的节点。最后，我们引入两个叠式图卷积网络到各个领域中传播的整体，局部特征，探讨整体 - 局部特征互相适应他们的互动和跨不同的域。通过这种方式，在AGRA框架可以自适应学习细粒度的领域不变特征，从而促进跨域表情识别。我们在几个流行的基准进行广泛和公正的实验表明，该AGRA框架实现了比以前的国家的最先进的方法，性能优越。

25. LSOTB-TIR:A Large-Scale High-Diversity Thermal Infrared Object Tracking Benchmark [PDF] 返回目录
Qiao Liu, Xin Li, Zhenyu He, Chenglong Li, Jun Li, Zikun Zhou, Di Yuan, Jing Li, Kai Yang, Nana Fan, Feng Zheng
Abstract: In this paper, we present a Large-Scale and high-diversity general Thermal InfraRed (TIR) Object Tracking Benchmark, called LSOTBTIR, which consists of an evaluation dataset and a training dataset with a total of 1,400 TIR sequences and more than 600K frames. We annotate the bounding box of objects in every frame of all sequences and generate over 730K bounding boxes in total. To the best of our knowledge, LSOTB-TIR is the largest and most diverse TIR object tracking benchmark to date. To evaluate a tracker on different attributes, we define 4 scenario attributes and 12 challenge attributes in the evaluation dataset. By releasing LSOTB-TIR, we encourage the community to develop deep learning based TIR trackers and evaluate them fairly and comprehensively. We evaluate and analyze more than 30 trackers on LSOTB-TIR to provide a series of baselines, and the results show that deep trackers achieve promising performance. Furthermore, we re-train several representative deep trackers on LSOTB-TIR, and their results demonstrate that the proposed training dataset significantly improves the performance of deep TIR trackers. Codes and dataset are available at this https URL.
摘要：在本文中，我们提出了一种大规模，高多样性一般热红外（TIR）目标跟踪基准，称为LSOTBTIR，其由评估数据集，并用总1400个TIR序列的训练数据集的和超过600K帧。我们注释对象的包围盒中的所有序列的每一帧，并产生超过总730K边界框。据我们所知，LSOTB-TIR是全球最大和最多样化的TIR目标跟踪基准日期。为了评估在不同属性的跟踪，我们定义了4点场景的属性，并在评估数据集12个挑战属性。通过发布LSOTB-TIR，我们鼓励社区开发基于深度学习TIR跟踪和公正和全面地评估他们。我们评估和分析LSOTB-TIR超过30个跟踪提供了一系列的基线，结果表明，深纤夫实现承诺的表现。此外，我们重新训练上LSOTB-TIR几个有代表性的深跟踪，其结果表明，该训练数据集显著提高深TIR跟踪器的性能。代码和数据集可在此HTTPS URL。

26. Deep Network Ensemble Learning applied to Image Classification using CNN Trees [PDF] 返回目录
Abdul Mueed Hafiz, Ghulam Mohiuddin Bhat
Abstract: Traditional machine learning approaches may fail to perform satisfactorily when dealing with complex data. In this context, the importance of data mining evolves w.r.t. building an efficient knowledge discovery and mining framework. Ensemble learning is aimed at integration of fusion, modeling and mining of data into a unified model. However, traditional ensemble learning methods are complex and have optimization or tuning problems. In this paper, we propose a simple, sequential, efficient, ensemble learning approach using multiple deep networks. The deep network used in the ensembles is ResNet50. The model draws inspiration from binary decision/classification trees. The proposed approach is compared against the baseline viz. the single classifier approach i.e. using a single multiclass ResNet50 on the ImageNet and Natural Images datasets. Our approach outperforms the baseline on all experiments on the ImageNet dataset. Code is available in this https URL
摘要：传统的机器学习方法可能会失败，复杂的数据处理时具有满意的性能。在这种情况下，数据挖掘演变的重要性w.r.t.建立有效的知识发现和数据挖掘框架。集成学习的目的是融合，建模和数据挖掘集成到一个统一的模式。然而，传统的集成学习方法是复杂的，有优化或调整的问题。在本文中，我们使用多深的网络提出了一个简单，连续，高效，集成学习方法。在合奏中使用的深网络ResNet50。该模型从二元决策/分类树的灵感。所提出的方法是根据基准即比较。使用关于ImageNet和自然图像数据集的单个多类ResNet50单一分类器方法即。我们的方法优于上的ImageNet数据集中的所有实验的基线。代码可以在此HTTPS URL

27. Defining Traffic States using Spatio-temporal Traffic Graphs [PDF] 返回目录
Debaditya Roy, K. Naveen Kumar, C. Krishna Mohan
Abstract: Intersections are one of the main sources of congestion and hence, it is important to understand traffic behavior at intersections. Particularly, in developing countries with high vehicle density, mixed traffic type, and lane-less driving behavior, it is difficult to distinguish between congested and normal traffic behavior. In this work, we propose a way to understand the traffic state of smaller spatial regions at intersections using traffic graphs. The way these traffic graphs evolve over time reveals different traffic states - a) a congestion is forming (clumping), the congestion is dispersing (unclumping), or c) the traffic is flowing normally (neutral). We train a spatio-temporal deep network to identify these changes. Also, we introduce a large dataset called EyeonTraffic (EoT) containing 3 hours of aerial videos collected at 3 busy intersections in Ahmedabad, India. Our experiments on the EoT dataset show that the traffic graphs can help in correctly identifying congestion-prone behavior in different spatial regions of an intersection.
摘要：交叉口是拥堵的主要来源之一，因此，了解在路口交通灯的行为是很重要的。特别是在发展高车辆密度，混合交通类型，车道少驾驶行为的国家，很难拥挤和正常流量的行为进行区分。在这项工作中，我们提出了一个方法来了解在使用流量图表路口小的空间区域的交通状况。这些流量图表随着时间的推移的方式揭示了不同的交通状态 - 一）拥塞正在形成（聚集），拥塞分散（unclumping），或c）交通流动正常（中性）。我们培养一个时空深网络来识别这些更改。此外，我们还推出含3个小时在印度艾哈迈达巴德3个繁忙的十字路口收集的空中视频称为EyeonTraffic（EOT）的大型数据集。我们对EOT数据集上的流量图可以正确识别在路口的不同空间区域拥堵多发的行为，帮助实验。

28. DSC IIT-ISM at SemEval-2020 Task 8: Bi-Fusion Techniques for Deep Meme Emotion Analysis [PDF] 返回目录
Pradyumna Gupta, Himanshu Gupta, Aman Sinha
Abstract: Memes have become an ubiquitous social media entity and the processing and analysis of suchmultimodal data is currently an active area of research. This paper presents our work on theMemotion Analysis shared task of SemEval 2020, which involves the sentiment and humoranalysis of memes. We propose a system which uses different bimodal fusion techniques toleverage the inter-modal dependency for sentiment and humor classification tasks. Out of all ourexperiments, the best system improved the baseline with macro F1 scores of 0.357 on SentimentClassification (Task A), 0.510 on Humor Classification (Task B) and 0.312 on Scales of SemanticClasses (Task C).
摘要：模因已成为suchmultimodal数据的无处不在的社交媒体实体，处理和分析是目前研究的活跃领域。本文介绍了我们在theMemotion分析工作共享SemEval 2020年，其中涉及模因情绪和humoranalysis的任务。我们建议它使用不同的双峰融合技术toleverage的情感和幽默分类任务的多式联运依赖的系统。出所有ourexperiments，最好的系统改善了与的0.357对幽默分类（任务B）和0.312 SentimentClassification（任务A），0.510宏F1得分上SemanticClasses的尺度（任务C）的基线。

29. Rethinking Image Deraining via Rain Streaks and Vapors [PDF] 返回目录
Yinglong Wang, Yibing Song, Chao Ma, Bing Zeng
Abstract: Single image deraining regards an input image as a fusion of a background image, a transmission map, rain streaks, and atmosphere light. While advanced models are proposed for image restoration (i.e., background image generation), they regard rain streaks with the same properties as background rather than transmission medium. As vapors (i.e., rain streaks accumulation or fog-like rain) are conveyed in the transmission map to model the veiling effect, the fusion of rain streaks and vapors do not naturally reflect the rain image formation. In this work, we reformulate rain streaks as transmission medium together with vapors to model rain imaging. We propose an encoder-decoder CNN named as SNet to learn the transmission map of rain streaks. As rain streaks appear with various shapes and directions, we use ShuffleNet units within SNet to capture their anisotropic representations. As vapors are brought by rain streaks, we propose a VNet containing spatial pyramid pooling (SSP) to predict the transmission map of vapors in multi-scales based on that of rain streaks. Meanwhile, we use an encoder CNN named ANet to estimate atmosphere light. The SNet, VNet, and ANet are jointly trained to predict transmission maps and atmosphere light for rain image restoration. Extensive experiments on the benchmark datasets demonstrate the effectiveness of the proposed visual model to predict rain streaks and vapors. The proposed deraining method performs favorably against state-of-the-art deraining approaches.
摘要：单图像deraining关于输入图像作为融合的背景图像的，传输地图，雨条纹，和气氛的光。而先进的模型，提出的图像复原（即，背景图像生成），他们认为雨条纹具有相同属性作为背景，而不是传输介质。作为蒸气（即，雨条纹积累或雾状雨）在传输地图被输送到遮罩效果进行建模，雨条纹和蒸气的融合不自然反映雨图像形成。在这项工作中，我们重新制定雨水条纹与蒸气模型雨影像传输介质在一起。我们提出了一个编码器，解码器CNN评为SNET学习雨条纹的传输地图。由于雨条纹出现各种形状和方向，我们使用的ShuffleNet单位内SNET捕捉他们的各向异性表示。由于蒸汽被雨条纹带来的，我们提出了含有空间金字塔池（SSP）以预测基于该雨条纹的多尺度蒸汽的传输地图一个互联星空。同时，我们使用一个名为ANET估算气氛清淡编码器CNN。的SNET，互联星空，和ANET共同训练以预测传输地图和气氛光雨水图像恢复。基准的数据集大量的实验验证了可视化模型的有效性，预测雨条纹和蒸气。有利地针对国家的最先进的deraining接近提出deraining方法进行。

30. Experimental results on palmvein-based personal recognition by multi-snapshot fusion of textural features [PDF] 返回目录
Mohanad Abukmeil, Gian Luca Marcialis
Abstract: In this paper, we investigate multiple snapshot fusion of textural features for palmvein recognition including identification and verification. Although the literature proposed several approaches for palmvein recognition, the palmvein performance is still affected by identification and verification errors. As well-known, palmveins are usually described by line-based methods which enhance the vein flow. This is claimed to be unique from person to person. However, palmvein images are also characterized by texture that can be pointed out by textural features, which relies on recent and efficient hand-crafted algorithms such as Local Binary Patterns, Local Phase Quantization, Local Tera Pattern, Local directional Pattern, and Binarized Statistical Image Features (LBP, LPQ, LTP, LDP and BSIF, respectively), among others. Finally, they can be easily managed at feature-level fusion, when more than one sample can be acquired for recognition. Therefore, multi-snapshot fusion can be adopted for exploiting these features complementarity. Our goal is to show that this is confirmed for palmvein recognition, thus allowing to achieve very high recognition rates on a well-known benchmark data set.
摘要：在本文中，我们调查的纹理特征进行识别palmvein包括识别和验证的多个快照融合。虽然文献提出了palmvein识别几种方法中，palmvein性能仍然受到识别和验证错误。正如公知的，palmveins通常通过基于行的方法，其增强静脉流程来描述。这是自称是从人到人的独特。然而，palmvein图像的特征还在于纹理可以由纹理特征指出的，这依赖于最近，高效的手工制作的算法，如局部二进制模式，本地相位量化的，本地的Tera模式，本地定向模式，和二值化的统计图像特点（LBP，LPQ，LTP，自民党和BSIF，分别），等等。最后，它们能够容易地在特征级融合管理，当一个以上的样本可以被获取为承认。因此，多快照融合可以利用这些特征的互补性可以采用。我们的目标是要表明，这是证实palmvein认同，从而使实现对一个众所周知的基准数据集非常高的识别率。

31. Generating Visually Aligned Sound from Videos [PDF] 返回目录
Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, Deng Huang, Chuang Gan
Abstract: We focus on the task of generating sound from natural videos, and the sound should be both temporally and content-wise aligned with visual signals. This task is extremely challenging because some sounds generated \emph{outside} a camera can not be inferred from video content. The model may be forced to learn an incorrect mapping between visual content and these irrelevant sounds. To address this challenge, we propose a framework named REGNET. In this framework, we first extract appearance and motion features from video frames to better distinguish the object that emits sound from complex background information. We then introduce an innovative audio forwarding regularizer that directly considers the real sound as input and outputs bottlenecked sound features. Using both visual and bottlenecked sound features for sound prediction during training provides stronger supervision for the sound prediction. The audio forwarding regularizer can control the irrelevant sound component and thus prevent the model from learning an incorrect mapping between video frames and sound emitted by the object that is out of the screen. During testing, the audio forwarding regularizer is removed to ensure that REGNET can produce purely aligned sound only from visual features. Extensive evaluations based on Amazon Mechanical Turk demonstrate that our method significantly improves both temporal and content-wise alignment. Remarkably, our generated sound can fool the human with a 68.12% success rate. Code and pre-trained models are publicly available at this https URL
摘要：我们专注于产生自然的视频声音的任务，声音应该是在时间和内容明智的视觉信号一致。这个任务是非常具有挑战性，因为生成的一些声音\ {EMPH外}相机不能从视频内容推断。该模型可能被迫学习视觉内容和这些不相干的声音之间的不正确的映射。为了应对这一挑战，我们提出了一个框架命名REGNET。在此框架下，我们首先提取物的外观和运动从视频帧特征，以便更好地区别发射从复杂的背景音信息的对象。然后，我们介绍了一种创新的音频转发正则直接认为真正的声音作为输入和输出瓶颈的声音特点。训练过程中使用视觉和声音瓶颈功能健全的预测提供了完善的预测更强的监督。音频转发正则可以控制不相关的声音分量，从而防止模型学习由在屏幕的出对象发射的视频帧和声音之间的不正确映射。在测试过程中，音频转发正则被删除，以确保REGNET只能从视觉特征产生纯粹一致的声音。基于亚马逊的Mechanical Turk广泛的评估表明，我们的方法显著改善了时间和内容的序列比对。值得注意的是，我们产生的声音可以骗过了68.12％的成功率的人。代码和预先训练模式是公开的，在此HTTPS URL

32. Tell me what this is: Few-Shot Incremental Object Learning by a Robot [PDF] 返回目录
Ali Ayub, Alan R. Wagner
Abstract: For many applications, robots will need to be incrementally trained to recognize the specific objects needed for an application. This paper presents a practical system for incrementally training a robot to recognize different object categories using only a small set of visual examples provided by a human. The paper uses a recently developed state-of-the-art method for few-shot incremental learning of objects. After learning the object classes incrementally, the robot performs a table cleaning task organizing objects into categories specified by the human. We also demonstrate the system's ability to learn arrangements of objects and predict missing or incorrectly placed objects. Experimental evaluations demonstrate that our approach achieves nearly the same performance as a system trained with all examples at one time (batch training), which constitutes a theoretical upper bound.
摘要：对于许多应用，机器人将需要逐步训练来识别应用程序所需的特定对象。本文提出了增量训练机器人仅使用一小部分的由人提供的视觉例子来识别不同的对象类别实际的系统。本文使用的对象的几拍增量学习最近发展起来的国家的最先进的方法。学习的对象类递增之后，机器人执行表清洁任务组织对象到由人的指定的类别。我们还演示了系统的学习对象的安排能力和预测缺失或不正确放置的对象。试验评估表明，我们的方法实现几乎相同的性能与在同一时间（批次培训），这构成了理论上限的所有示例训练有素的系统。

33. Partially Supervised Multi-Task Network for Single-View Dietary Assessment [PDF] 返回目录
Ya Lu, Thomai Stathopoulou, Stavroula Mougiakakou
Abstract: Food volume estimation is an essential step in the pipeline of dietary assessment and demands the precise depth estimation of the food surface and table plane. Existing methods based on computer vision require either multi-image input or additional depth maps, reducing convenience of implementation and practical significance. Despite the recent advances in unsupervised depth estimation from a single image, the achieved performance in the case of large texture-less areas needs to be improved. In this paper, we propose a network architecture that jointly performs geometric understanding (i.e., depth prediction and 3D plane estimation) and semantic prediction on a single food image, enabling a robust and accurate food volume estimation regardless of the texture characteristics of the target plane. For the training of the network, only monocular videos with semantic ground truth are required, while the depth map and 3D plane ground truth are no longer needed. Experimental results on two separate food image databases demonstrate that our method performs robustly on texture-less scenarios and is superior to unsupervised networks and structure from motion based approaches, while it achieves comparable performance to fully-supervised methods.
摘要：食品量估计是膳食评估的管道的一个重要步骤，并要求食品表面和表平面的精确的深度估计。基于计算机视觉的现有方法或者需要多图像输入或额外的深度贴图，降低实施和现实意义的便利。尽管在无监督的深度估计从单个图像的最近的进步，在大的无纹理区的需要的情况下的实现的性能还有待提高。在本文中，我们提议共同执行几何理解的网络架构（即，深度预测和3D平面估计）的单个食品图像上和语义预测，实现了鲁棒和准确的食物容积估计而不管目标平面的纹理特征。对于网络的培训，只需要语义地面实况单眼的视频，而不再需要深度图和3D平面地面实况。在两个单独的食物图像数据库实验结果表明，鲁棒地对无纹理场景我们的方法进行并优于从基于运动的方法无监督网络和结构，而它达到相当的性能完全监督方法。

34. Kinematics of motion tracking using computer vision [PDF] 返回目录
José L. Escalona
Abstract: This paper describes the kinematics of the motion tracking of a rigid body using video recording. The novelty of the paper is on the adaptation of the methods and nomenclature used in Computer Vision to those used in Multibody System Dynamics. That way, the equations presented here can be used, for example, for inverse-dynamics multibody simulations driven by the motion tracking of selected bodies. This paper also adapts the well-known Zhang calibration method to the presented nomenclature.
摘要：本文描述了使用视频记录的刚体的运动跟踪的运动学。本文的新颖之处在于对计算机视觉用于那些多体系统动力学所使用的方法和术语的调整。这样，这里介绍的公式可用于，例如，用于逆动力学通过选择机构的运动跟踪驱动多体模拟。本文还适应知名张校准方法所呈现的术语。

35. CASNet: Common Attribute Support Network for image instance and panoptic segmentation [PDF] 返回目录
Xiaolong Liu, Yuqing Hou, Anbang Yao, Yurong Chen, Keqiang Li
Abstract: Instance segmentation and panoptic segmentation is being paid more and more attention in recent years. In comparison with bounding box based object detection and semantic segmentation, instance segmentation can provide more analytical results at pixel level. Given the insight that pixels belonging to one instance have one or more common attributes of current instance, we bring up an one-stage instance segmentation network named Common Attribute Support Network (CASNet), which realizes instance segmentation by predicting and clustering common attributes. CASNet is designed in the manner of fully convolutional and can implement training and inference from end to end. And CASNet manages predicting the instance without overlaps and holes, which problem exists in most of current instance segmentation algorithms. Furthermore, it can be easily extended to panoptic segmentation through minor modifications with little computation overhead. CASNet builds a bridge between semantic and instance segmentation from finding pixel class ID to obtaining class and instance ID by operations on common attribute. Through experiment for instance and panoptic segmentation, CASNet gets mAP 32.8% and PQ 59.0% on Cityscapes validation dataset by joint training, and mAP 36.3% and PQ 66.1% by separated training mode. For panoptic segmentation, CASNet gets state-of-the-art performance on the Cityscapes validation dataset.
摘要：实例分割和全景分割是近年来正在受到越来越多的关注。与基于边界框对象检测和语义分割比较，实例分割能够提供在像素级别更多的分析结果。鉴于这样的认识属于一个实例像素具有当前实例的一个或多个公共属性，我们为大家带来了一个名为公共属性支持网络（CASNet）的一个阶段实例分割网络，通过预测和集群公共属性实现例如分割。 CASNet设计为完全卷积的方式，并可以实现端到端的培训和推理。而CASNet管理着预测无交叠和空洞的情况下，在大多数的当前实例分割算法存在问题。此外，它可以很容易地通过很少的计算开销小的修改扩展到全景分割。 CASNet构建从寻找像素等级ID通过在共同的属性的操作获得类和实例ID语义和实例分割之间的桥梁。通过实验实例和全景分割，CASNet获取地图32.8％和PQ通过联合训练风情验证数据集59.0％和地图36.3％和PQ通过分离培养模式66.1％。为全景图像分割，CASNet搭乘于都市风景验证数据集状态的最先进的性能。

36. Adding Seemingly Uninformative Labels Helps in Low Data Regimes [PDF] 返回目录
Christos Matsoukas, Albert Bou I Hernandez, Yue Liu, Karin Dembrower, Gisele Miranda, Emir Konuk, Johan Fredin Haslum, Athanasios Zouzos, Peter Lindholm, Fredrik Strand, Kevin Smith
Abstract: Evidence suggests that networks trained on large datasets generalize well not solely because of the numerous training examples, but also class diversity which encourages learning of enriched features. This raises the question of whether this remains true when data is scarce - is there an advantage to learning with additional labels in low-data regimes? In this work, we consider a task that requires difficult-to-obtain expert annotations: tumor segmentation in mammography images. We show that, in low-data settings, performance can be improved by complementing the expert annotations with seemingly uninformative labels from non-expert annotators, turning the task into a multi-class problem. We reveal that these gains increase when less expert data is available, and uncover several interesting properties through further studies. We demonstrate our findings on CSAW-S, a new dataset that we introduce here, and confirm them on two public datasets.
摘要：有证据表明，经过训练，对大数据集的网络推广也不仅仅是因为许多培训的例子，但也类的多样性，鼓励的丰富功能学习。这就提出了是否在数据匮乏这仍然是真正的问题 - 有一个优势，在低数据传输机制附加标签学习？在这项工作中，我们认为需要难以得到专家注释的任务：在乳腺放射影像肿瘤分割。我们表明，在低数据设置，性能可以通过与非专家注释看似无信息标签的补充说明专家，把任务分成多类问题得到改善。我们表明：当不那么专业的数据可用这些收益增加，并通过进一步的研究，揭开了一些有趣的特性。我们证明在CSAW-S，一个新的数据集，我们在这里介绍我们的调查结果，并确认他们两个公共数据集。

37. Real-Time Point Cloud Fusion of Multi-LiDAR Infrastructure Sensor Setups with Unknown Spatial Location and Orientation [PDF] 返回目录
Laurent Kloeker, Christian Kotulla, Lutz Eckstein
Abstract: The use of infrastructure sensor technology for traffic detection has already been proven several times. However, extrinsic sensor calibration is still a challenge for the operator. While previous approaches are unable to calibrate the sensors without the use of reference objects in the sensor field of view (FOV), we present an algorithm that is completely detached from external assistance and runs fully automatically. Our method focuses on the high-precision fusion of LiDAR point clouds and is evaluated in simulation as well as on real measurements. We set the LiDARs in a continuous pendulum motion in order to simulate real-world operation as closely as possible and to increase the demands on the algorithm. However, it does not receive any information about the initial spatial location and orientation of the LiDARs throughout the entire measurement period. Experiments in simulation as well as with real measurements have shown that our algorithm performs a continuous point cloud registration of up to four 64-layer LiDARs in real-time. The averaged resulting translational error is within a few centimeters and the averaged error in rotation is below 0.15 degrees.
摘要：为流量检测使用的基础设施传感器技术已经被证明多次。然而，外在传感器校准仍然是运营商的一个挑战。虽然先前的方法不能校准传感器而不在视场（FOV）的传感器领域中的用途的参考对象的，我们提出，它是完全从外部援助和运行完全自动地脱离的算法。我们的方法集中在激光雷达点云的高精度融合和模拟以及对实际测量评估。我们设在连续钟摆运动的激光雷达，以模拟真实世界的操作尽可能地和增加对算法的需求。然而，它不接收关于初始的空间位置和激光雷达的取向在整个测定期间的任何信息。在仿真以及与实际测量的实验已经表明，我们的算法执行多达4个64层的激光雷达实时的连续点云注册。平均产生的平移误差为几厘米之内，并在旋转的平均误差低于0.15度。

38. Dynamic and Static Context-aware LSTM for Multi-agent Motion Prediction [PDF] 返回目录
Chaofan Tao, Qinhong Jiang, Lixin Duan, Ping Luo
Abstract: Multi-agent motion prediction is challenging because it aims to foresee the future trajectories of multiple agents (\textit{e.g.} pedestrians) simultaneously in a complicated scene. Existing work addressed this challenge by either learning social spatial interactions represented by the positions of a group of pedestrians, while ignoring their temporal coherence (\textit{i.e.} dependencies between different long trajectories), or by understanding the complicated scene layout (\textit{e.g.} scene segmentation) to ensure safe navigation. However, unlike previous work that isolated the spatial interaction, temporal coherence, and scene layout, this paper designs a new mechanism, \textit{i.e.}, Dynamic and Static Context-aware Motion Predictor (DSCMP), to integrates these rich information into the long-short-term-memory (LSTM). It has three appealing benefits. (1) DSCMP models the dynamic interactions between agents by learning both their spatial positions and temporal coherence, as well as understanding the contextual scene layout.(2) Different from previous LSTM models that predict motions by propagating hidden features frame by frame, limiting the capacity to learn correlations between long trajectories, we carefully design a differentiable queue mechanism in DSCMP, which is able to explicitly memorize and learn the correlations between long trajectories. (3) DSCMP captures the context of scene by inferring latent variable, which enables multimodal predictions with meaningful semantic scene layout. Extensive experiments show that DSCMP outperforms state-of-the-art methods by large margins, such as 9.05\% and 7.62\% relative improvements on the ETH-UCY and SDD datasets respectively.
摘要：多剂运动预测是具有挑战性的，因为它的目的是预见多个代理的未来轨迹同时在一个场景复杂（\ textit行人{e.g。}）。现有的工作由一组行人的位置表现无论是学习社交空间相互作用解决了这个难题，而忽略了时间相干（\ textit不同的长轨迹之间{即}依赖），或通过了解复杂的场景布置（\ textit {如}场景分割），以确保航行安全。然而，与隔离空间相互作用，时间相干性和场景布置前期工作，本文设计了一种新的机制，\ textit {即}，动态和静态上下文感知运动预测（DSCMP），以集成了这些丰富的信息进长短期存储器（LSTM）。它有三个吸引人的好处。（1）DSCMP模型通过学习既它们的空间位置和时间相干性，以及理解上下文场景布局剂之间的动态相互作用。（2）从该通过帧传播隐藏的功能框，限制了预测运动以前LSTM模型不同学习长轨迹之间的相关性的能力，我们精心设计的DSCMP一个微队列机制，它能够明确地记忆和学习长轨迹之间的相关性。（3）DSCMP捕获场景的通过推断潜变量，这使得能够提供有意义的语义场景布置多峰预测的上下文。广泛的实验DSCMP性能优于国家的最先进的方法，通过大空间，如9.05 \％和7.62 \分别那％在ETH-UCY和SDD数据集相对改进显示。

39. DCSFN: Deep Cross-scale Fusion Network for Single Image Rain Removal [PDF] 返回目录
Cong Wang, Xiaoying Xing, Zhixun Su, Junyang Chen
Abstract: Rain removal is an important but challenging computer vision task as rain streaks can severely degrade the visibility of images that may make other visions or multimedia tasks fail to work. Previous works mainly focused on feature extraction and processing or neural network structure, while the current rain removal methods can already achieve remarkable results, training based on single network structure without considering the cross-scale relationship may cause information drop-out. In this paper, we explore the cross-scale manner between networks and inner-scale fusion operation to solve the image rain removal task. Specifically, to learn features with different scales, we propose a multi-sub-networks structure, where these sub-networks are fused via a crossscale manner by Gate Recurrent Unit to inner-learn and make full use of information at different scales in these sub-networks. Further, we design an inner-scale connection block to utilize the multi-scale information and features fusion way between different scales to improve rain representation ability and we introduce the dense block with skip connection to inner-connect these blocks. Experimental results on both synthetic and real-world datasets have demonstrated the superiority of our proposed method, which outperforms over the state-of-the-art methods. The source code will be available at this https URL.
摘要：雨去除是一个重要的挑战，但是计算机视觉的任务，因为下雨的条纹会严重降低，可能使其他的愿景或多媒体任务不能工作图像的可见性。以前的作品主要集中在特征提取和加工或神经网络结构，而目前的雨去除方法已经可以取得显着成效的基础上，单一的网络结构，训练不考虑跨尺度关系可能导致的信息辍学。在本文中，我们探索网络和内尺度融合操作之间的跨尺度的方式，解决了图像雨删除任务。具体而言，学习特征具有不同比例的，提出了一种多子网络结构，其中这些子网络经由通过门重复单元一个crossscale方式融合内学习和充分利用信息在这些子不同尺度-networks。此外，我们设计的内尺度连接块以利用多尺度信息和功能不同尺度之间的融合的方式来提高雨表现能力，我们介绍与跳过连接致密块内连接这些块。在人工和真实世界的数据集实验结果证明我们提出的方法的优越性，它比国家的最先进的方法，性能优于。源代码将可在此HTTPS URL。

40. GmFace: A Mathematical Model for Face Image Representation Using Multi-Gaussian [PDF] 返回目录
Liping Zhang, Weijun Li, Lina Yu, Xiaoli Dong, Linjun Sun, Xin Ning, Jian Xu, Hong Qin
Abstract: Establishing mathematical models is a ubiquitous and effective method to understand the objective world. Due to complex physiological structures and dynamic behaviors, mathematical representation of the human face is an especially challenging task. A mathematical model for face image representation called GmFace is proposed in the form of a multi-Gaussian function in this paper. The model utilizes the advantages of two-dimensional Gaussian function which provides a symmetric bell surface with a shape that can be controlled by parameters. The GmNet is then designed using Gaussian functions as neurons, with parameters that correspond to each of the parameters of GmFace in order to transform the problem of GmFace parameter solving into a network optimization problem of GmNet. The face modeling process can be described by the following steps: (1) GmNet initialization; (2) feeding GmNet with face image(s); (3) training GmNet until convergence; (4) drawing out the parameters of GmNet (as the same as GmFace); (5) recording the face model GmFace. Furthermore, using GmFace, several face image transformation operations can be realized mathematically through simple parameter computation.
摘要：建立数学模型是一个无处不在的，有效的方法来了解客观世界。由于复杂的生理结构和动态行为，人脸的数学表示是特别具有挑战性的任务。称为GmFace用于面部图像表示的数学模型中，本文的多高斯函数的形式提出。该模型利用的二维高斯函数，其提供了一个对称的钟面与可由参数控制的形状的优点。所述记忆网然后使用高斯函数作为神经元设计，使用的参数对应于每个GmFace的参数以变换参数GmFace解决的问题分解成记忆网的网络优化问题。脸部建模过程可以通过以下步骤来描述：（1）记忆网初始化; （2）将与记忆网面部图像（一个或多个）; （3）培养记忆网直到收敛; （4）抽出记忆网的参数（如相同GmFace）; （5）记录所述脸部模型GmFace。此外，使用GmFace，几个人脸图像变换操作可以用数学方法通过简单的参数计算实现。

41. The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020) [PDF] 返回目录
Samuel Albanie, Yang Liu, Arsha Nagrani, Antoine Miech, Ernesto Coto, Ivan Laptev, Rahul Sukthankar, Bernard Ghanem, Andrew Zisserman, Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid, Shizhe Chen, Yida Zhao, Qin Jin, Kaixu Cui, Hui Liu, Chen Wang, Yudong Jiang, Xiaoshuai Hao
Abstract: We present a new video understanding pentathlon challenge, an open competition held in conjunction with the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020. The objective of the challenge was to explore and evaluate new methods for text-to-video retrieval-the task of searching for content within a corpus of videos using natural language queries. This report summarizes the results of the first edition of the challenge together with the findings of the participants.
摘要：本文提出了一种新的视频了解五项挑战，在计算机视觉和模式识别（CVPR）2020年的目标挑战的IEEE会议同时举行公开竞争是探讨和评估文本到视频的新方法检索，该搜索的使用自然语言查询，视频语料库中的内容的任务。这份报告总结了参与者的调查结果挑战一起第一版的结果。

42. AE TextSpotter: Learning Visual and Linguistic Representation for Ambiguous Text Spotting [PDF] 返回目录
Wenhai Wang, Xuebo Liu, Xiaozhong Ji, Enze Xie, Ding Liang, ZhiBo Yang, Tong Lu, Chunhua Shen, Ping Luo
Abstract: Scene text spotting aims to detect and recognize the entire word or sentence with multiple characters in natural images. It is still challenging because ambiguity often occurs when the spacing between characters is large or the characters are evenly spread in multiple rows and columns, making many visually plausible groupings of the characters (e.g. "BERLIN" is incorrectly detected as "BERL" and "IN" in Fig. 1(c)). Unlike previous works that merely employed visual features for text detection, this work proposes a novel text spotter, named Ambiguity Eliminating Text Spotter (AE TextSpotter), which learns both visual and linguistic features to significantly reduce ambiguity in text detection. The proposed AE TextSpotter has three important benefits. 1) The linguistic representation is learned together with the visual representation in a framework. To our knowledge, it is the first time to improve text detection by using a language model. 2) A carefully designed language module is utilized to reduce the detection confidence of incorrect text lines, making them easily pruned in the detection stage. 3) Extensive experiments show that AE TextSpotter outperforms other state-of-the-art methods by a large margin. For example, we carefully select a set of extremely ambiguous samples from the IC19-ReCTS dataset, where our approach surpasses other methods by more than 4%.
摘要：场景文本斑点目标检测和识别整个单词或句子与自然图像多个字符。它仍然是具有挑战性的，因为当字符之间的间隔大，或者字符在多个行和列均匀分布，从而使字符的许多在视觉上似是而非的分组（例如，“柏林”被错误地检测为“BERL”和“IN常发生歧义“在图1的（c））。不同于仅仅使用视觉特征的文本检测以前的作品，这项工作提出了一种新的文本去污剂，命名为消除歧义文本去污剂（AE TextSpotter），其学习视觉和语言特征，以显著减少文本检测的不确定性。所提出的AE TextSpotter有三个重要的好处。 1）语言表示与在一个框架中的视觉表示一起学习。据我们所知，这是第一次使用的语言模式，以提高文本检测。 2）一个精心设计语言模块被利用来减少不正确的文本行的检测置信，使它们在检测阶段容易修剪。 3）广泛的实验表明，AE TextSpotter优于其他国家的最先进的方法，通过一个大的裕度。例如，我们慎重选择从IC19-ReCTS数据集，我们的做法超过4％，优于其他方法了一套极其暧昧的样本。

43. Deep Complementary Joint Model for Complex Scene Registration and Few-shot Segmentation on Medical Images [PDF] 返回目录
Yuting He, Tiantian Li, Guanyu Yang, Youyong Kong, Yang Chen, Huazhong Shu, Jean-Louis Coatrieux, Jean-Louis Dillenseger, Shuo Li
Abstract: Deep learning-based medical image registration and segmentation joint models utilize the complementarity (augmentation data or weakly supervised data from registration, region constraints from segmentation) to bring mutual improvement in complex scene and few-shot situation. However, further adoption of the joint models are hindered: 1) the diversity of augmentation data is reduced limiting the further enhancement of segmentation, 2) misaligned regions in weakly supervised data disturb the training process, 3) lack of label-based region constraints in few-shot situation limits the registration performance. We propose a novel Deep Complementary Joint Model (DeepRS) for complex scene registration and few-shot segmentation. We embed a perturbation factor in the registration to increase the activity of deformation thus maintaining the augmentation data diversity. We take a pixel-wise discriminator to extract alignment confidence maps which highlight aligned regions in weakly supervised data so the misaligned regions' disturbance will be suppressed via weighting. The outputs from segmentation model are utilized to implement deep-based region constraints thus relieving the label requirements and bringing fine registration. Extensive experiments on the CT dataset of MM-WHS 2017 Challenge show great advantages of our DeepRS that outperforms the existing state-of-the-art models.
摘要：深以学习为主的医学图像配准和分割合资车型使用的互补性（增强数据或登记弱监督的数据，从分割区域的限制）在复杂场景和几拍形势带来共同提高。然而，进一步的关节模型的采用是受阻：1）增强数据的多样性降低限制分割的进一步增强，2）在弱错位区域监督数据扰乱训练过程中，3）缺乏在基于标签的区域限制几拍的情况限制了注册性能。我们提出了一个新颖的深互补联合模型（DeepRS）复杂的现场报名和为数不多的镜头分割。我们在注册嵌入一个扰动因素来增加变形的活性从而保持了增强数据分集。我们以逐像素鉴别器，其突出显示弱监督数据对准区域所以未对准的区域扰动将通过加权被抑制提取物对准信心地图。从分割模型的输出被用于实现深的基于区域的约束从而减轻标签的要求和使细登记。在MM-WHS 2017年挑战赛的CT数据集大量的实验表明我们的DeepRS是优于现有的国家的最先进的车型很大的优势。

44. Anti-Bandit Neural Architecture Search for Model Defense [PDF] 返回目录
Hanlin Chen, Baochang Zhang, Song Xue, Xuan Gong, Hong Liu, Rongrong Ji, David Doermann
Abstract: Deep convolutional neural networks (DCNNs) have dominated as the best performers in machine learning, but can be challenged by adversarial attacks. In this paper, we defend against adversarial attacks using neural architecture search (NAS) which is based on a comprehensive search of denoising blocks, weight-free operations, Gabor filters and convolutions. The resulting anti-bandit NAS (ABanditNAS) incorporates a new operation evaluation measure and search process based on the lower and upper confidence bounds (LCB and UCB). Unlike the conventional bandit algorithm using UCB for evaluation only, we use UCB to abandon arms for search efficiency and LCB for a fair competition between arms. Extensive experiments demonstrate that ABanditNAS is faster than other NAS methods, while achieving an $8.73\%$ improvement over prior arts on CIFAR-10 under PGD-$7$.
摘要：深卷积神经网络（DCNNs）已经占据在机器学习表现最好的，但可以通过对抗攻击的挑战。在本文中，我们抵御利用神经结构搜索（NAS），这是基于一个全面的搜索去噪块，重自由操作，Gabor滤波器和卷积的敌对攻击。将得到的防匪NAS（ABanditNAS）采用了新的操作评价度量并且基于该下部和上部置信边界（LCB和UCB）搜索处理。不同于使用评价UCB传统的老虎机算法而已，我们用UCB放弃了搜索效率和LCB武器武器之间的公平竞争。大量的实验证明，ABanditNAS比其他NAS方法更快，同时实现在现有技术下PGD- $ 7 $ $的8.73 \％$的改善上CIFAR-10。

45. Adversarial Semantic Data Augmentation for Human Pose Estimation [PDF] 返回目录
Yanrui Bin, Xuan Cao, Xinya Chen, Yanhao Ge, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, Changxin Gao, Nong Sang
Abstract: Human pose estimation is the task of localizing body keypoints from still images. The state-of-the-art methods suffer from insufficient examples of challenging cases such as symmetric appearance, heavy occlusion and nearby person. To enlarge the amounts of challenging cases, previous methods augmented images by cropping and pasting image patches with weak semantics, which leads to unrealistic appearance and limited diversity. We instead propose Semantic Data Augmentation (SDA), a method that augments images by pasting segmented body parts with various semantic granularity. Furthermore, we propose Adversarial Semantic Data Augmentation (ASDA), which exploits a generative network to dynamiclly predict tailored pasting configuration. Given off-the-shelf pose estimation network as discriminator, the generator seeks the most confusing transformation to increase the loss of the discriminator while the discriminator takes the generated sample as input and learns from it. The whole pipeline is optimized in an adversarial manner. State-of-the-art results are achieved on challenging benchmarks.
摘要：人体姿势估计是从静止图像定位身体的关键点的任务。状态的最先进的方法，由具有挑战性的情况下不足的例子，如对称的外观，重闭塞和附近的人受到影响。为了扩大质疑案件的赔偿额，以前的方法通过剪切和粘贴像块弱语义，从而导致不切实际的外观和有限的多样性增强图像。我们提出，而不是语义数据扩增（SDA），一种方法，其增强件通过粘贴分割的身体部位与各种语义粒度的图像。此外，我们提出对抗性语义数据增强（ASDA），它利用了生成网络dynamiclly地预测量身粘贴配置。鉴于场外的货架姿态估计网络鉴别，发电机寻求最令人困惑的改造，提高鉴别的损失，同时鉴别取生成的样本作为输入，并学会从它。整个管道是在对抗性的方式进行了优化。国家的最先进的结果具有挑战性的基准来实现的。

46. The pursuit of beauty: Converting image labels to meaningful vectors [PDF] 返回目录
Savvas Karatsiolis, Andreas Kamilaris
Abstract: A challenge of the computer vision community is to understand the semantics of an image, in order to allow image reconstruction based on existing high-level features or to better analyze (semi-)labelled datasets. Towards addressing this challenge, this paper introduces a method, called Occlusion-based Latent Representations (OLR), for converting image labels to meaningful representations that capture a significant amount of data semantics. Besides being informational rich, these representations compose a disentangled low-dimensional latent space where each image label is encoded into a separate vector. We evaluate the quality of these representations in a series of experiments whose results suggest that the proposed model can capture data concepts and discover data interrelations.
摘要：计算机视觉领域的一个挑战是理解图像的语义，以允许基于现有的高级功能，或者更好地分析（半）标记的数据集图像重建。向解决这一挑战，本文介绍的方法中，称为基于闭塞潜交涉（OLR），用于转换图像标签到有意义表示，用于捕获数据语义的显著量。除了是信息丰富的，这些表示组成，其中每个图像标签被编码成一个单独的载体中的解缠结的低维潜在空间。我们评估这些表述在一系列的实验，其结果表明，该模型可以捕捉数据的概念，发现数据的相互关系的质量。

47. PIC-Net: Point Cloud and Image Collaboration Network for Large-Scale Place Recognition [PDF] 返回目录
Yuheng Lu, Fan Yang, Fangping Chen, Don Xie
Abstract: Place recognition is one of the hot research fields in automation technology and is still an open issue, Camera and Lidar are two mainstream sensors used in this task, Camera-based methods are easily affected by illumination and season changes, LIDAR cannot get the rich data as the image could , In this paper, we propose the PIC-Net (Point cloud and Image Collaboration Network), which use attention mechanism to fuse the features of image and point cloud, and mine the complementary information between the two. Furthermore, in order to improve the recognition performance at night, we transform the night image into the daytime style. Comparison results show that the collaboration of image and point cloud outperform both image-based and point cloud-based method, the attention strategy and day-night-transform could further improve the performance.
摘要：将识别是自动化技术的热点研究领域之一，仍然是一个悬而未决的问题，相机和激光雷达在这个任务中使用两种主流传感器，基于摄像机的方法很容易受到光照和季节的变化，LIDAR不能得到丰富的数据作为图像可以，在本文中，我们提出了PIC-NET（点云和图像式协作网络），其使用注意机制来融合图像和点云的特点，而我两者之间的补充信息。此外，为了提高夜间的识别性能，我们把夜间图像到白天的风格。比较结果表明，图像和点云的协作胜过都基于图像并指向基于云的方法，注意策略和昼夜变换可以进一步提高性能。

48. Self-supervised Object Tracking with Cycle-consistent Siamese Networks [PDF] 返回目录
Weihao Yuan, Michael Yu Wang, Qifeng Chen
Abstract: Self-supervised learning for visual object tracking possesses valuable advantages compared to supervised learning, such as the non-necessity of laborious human annotations and online training. In this work, we exploit an end-to-end Siamese network in a cycle-consistent self-supervised framework for object tracking. Self-supervision can be performed by taking advantage of the cycle consistency in the forward and backward tracking. To better leverage the end-to-end learning of deep networks, we propose to integrate a Siamese region proposal and mask regression network in our tracking framework so that a fast and more accurate tracker can be learned without the annotation of each frame. The experiments on the VOT dataset for visual object tracking and on the DAVIS dataset for video object segmentation propagation show that our method outperforms prior approaches on both tasks.
摘要：相比于监督学习，如费力人注释和在线培训的非必要性的视觉对象跟踪自我监督学习获得了宝贵的优势。在这项工作中，我们利用在目标跟踪周期一致的自我监督框架的端至端连体网络。自检可以通过在向前和向后跟踪拍摄周期一致性的优势来进行。为了更好地利用终端到终端的学习深刻的网络，我们建议整合连体区的建议，在我们跟踪的框架掩盖回归网络，以便快速，更准确的跟踪，而不每帧的注释来学习。在VOT数据集可视对象的跟踪和对DAVIS数据集视频对象分割的传播表明，我们的方法优于在这两个任务的现有方法的实验。

49. Deep Photo Cropper and Enhancer [PDF] 返回目录
Aaron Ott, Amir Mazaheri, Niels D. Lobo, Mubarak Shah
Abstract: This paper introduces a new type of image enhancement problem. Compared to traditional image enhancement methods, which mostly deal with pixel-wise modifications of a given photo, our proposed task is to crop an image which is embedded within a photo and enhance the quality of the cropped image. We split our proposed approach into two deep networks: deep photo cropper and deep image enhancer. In the photo cropper network, we employ a spatial transformer to extract the embedded image. In the photo enhancer, we employ super-resolution to increase the number of pixels in the embedded image and reduce the effect of stretching and distortion of pixels. We use cosine distance loss between image features and ground truth for the cropper and the mean square loss for the enhancer. Furthermore, we propose a new dataset to train and test the proposed method. Finally, we analyze the proposed method with respect to qualitative and quantitative evaluations.
摘要：本文介绍了一种新型图像增强的问题。相比传统的图像增强方法，它主要是处理与给定照片的像素方面的修改，我们提出的任务是作物被嵌入照片中的形象，提升裁剪图像的质量。我们分裂我们提出的方法分为两个深网：深相片裁剪和深图像增强。在照片剪裁网络，我们聘请了空间变压器提取嵌入图像。在照片增强，我们采用超分辨率增加嵌入图像中的像素数量，减少拉伸和像素失真的效果。我们使用的图像特征，为农作物地面真理，均方损失为增强之间的余弦距离的损失。此外，我们提出了一个新的数据集训练和测试所提出的方法。最后，我们分析所提出的方法对于定性和定量评估。

50. Learning to Purify Noisy Labels via Meta Soft Label Corrector [PDF] 返回目录
Yichen Wu, Jun Shu, Qi Xie, Qian Zhao, Deyu Meng
Abstract: Recent deep neural networks (DNNs) can easily overfit to biased training data with noisy labels. Label correction strategy is commonly used to alleviate this issue by designing a method to identity suspected noisy labels and then correct them. Current approaches to correcting corrupted labels usually need certain pre-defined label correction rules or manually preset hyper-parameters. These fixed settings make it hard to apply in practice since the accurate label correction usually related with the concrete problem, training data and the temporal information hidden in dynamic iterations of training process. To address this issue, we propose a meta-learning model which could estimate soft labels through meta-gradient descent step under the guidance of noise-free meta data. By viewing the label correction procedure as a meta-process and using a meta-learner to automatically correct labels, we could adaptively obtain rectified soft labels iteratively according to current training problems without manually preset hyper-parameters. Besides, our method is model-agnostic and we can combine it with any other existing model with ease. Comprehensive experiments substantiate the superiority of our method in both synthetic and real-world problems with noisy labels compared with current SOTA label correction strategies.
摘要：最近的深层神经网络（DNNs）很容易过拟合与嘈杂的标签偏见的训练数据。标签修正策略是常用的设计到身份可疑的嘈杂标签的方法来缓解这个问题，然后改正。纠正已损坏的标签的当前方法通常需要某些预先定义的标签的校正规则或手动预置超参数。这些固定设置，使它很难在实际中应用，因为准确的标签校正通常与具体问题相关，训练数据和隐藏在训练过程中的动态迭代的时间信息。为了解决这个问题，我们提出了元学习模型，可以通过无噪声元数据的指导下元梯度下降步估算软标签。通过查看标签校正过程作为元处理和使用的元学习者自动正确的标签，我们可以得到自适应迭代根据无需手动预置超参数当前训练问题整流软标签。此外，我们的方法是模型无关，我们可以轻松完成任何其他现有的模型结合起来。综合实验证实了该方法的优越性在合成和真实世界的问题与当前SOTA标签校正策略比较嘈杂的标签。

51. Robust Collaborative Learning of Patch-level and Image-level Annotations for Diabetic Retinopathy Grading from Fundus Image [PDF] 返回目录
Yehui Yang, Fangxin Shang, Binghong Wu, Dalu Yang, Lei Wang, Yanwu Xu, Wensheng Zhang, Tianzhu Zhang
Abstract: Currently, diabetic retinopathy (DR) grading from fundus images has attracted incremental interests in both academic and industrial communities. Most convolutional neural networks (CNNs) based algorithms treat DR grading as a classification task via image-level annotations. However, they have not fully explored the valuable information from the DR-related lesions. In this paper, we present a robust framework, which can collaboratively utilize both patch-level lesion and image-level grade annotations, for DR severity grading. By end-to-end optimizing the entire framework, the fine-grained lesion and image-level grade information can be bidirectionally exchanged to exploit more discriminative features for DR grading. Compared with the recent state-of-the-art algorithms and three over 9-years clinical experienced ophthalmologists, the proposed algorithm shows favorable performance. Testing on the datasets from totally different scenarios and distributions (such as label and camera), our algorithm is proved robust in facing image quality and distribution problems that commonly exist in real-world practice. Extensive ablation studies dissect the proposed framework and indicate the effectiveness and necessity of each motivation. The code and some valuable annotations are now publicly available.
摘要：目前，糖尿病视网膜病变（DR）的眼底图像分级已经引起了学术界和工业界的增量利益。大多数卷积神经网络（细胞神经网络）基于算法处理DR分级为通过映像级注释的分类任务。然而，他们并没有充分探讨从DR相关病变的有价值的信息。在本文中，我们提出了一个坚固的框架，其可协同利用两个补丁级别病灶和图像级的级注释，对于DR严重性分级。由端至端优化整个框架，所述细粒度病变和图像级的等级信息可被双向地交换以利用用于DR分级更有辨别特征。随着近期国家的最先进的算法和三个过9年的临床经验的眼科医生，该算法显示了良好的性能进行比较。从完全不同的情景和分布（如标签和相机）的数据集测试，我们的算法，证明面临着在现实世界的实践中普遍存在的图像质量和分配问题稳健。广泛切除研究剖析提议的框架，并指出每个激励的有效性和必要性。代码和一些有价值的注解，现在公开。

52. Semi-supervised deep learning based on label propagation in a 2D embedded space [PDF] 返回目录
Barbara Caroline Benato, Jancarlo Ferreira Gomes, Alexandru Cristian Telea, Alexandre Xavier Falcão
Abstract: While convolutional neural networks need large labeled sets for training images, expert human supervision of such datasets can be very laborious. Proposed solutions propagate labels from a small set of supervised images to a large set of unsupervised ones to obtain sufficient truly-and-artificially labeled samples to train a deep neural network model. Yet, such solutions need many supervised images for validation. We present a loop in which a deep neural network (VGG-16) is trained from a set with more correctly labeled samples along iterations, created by using t-SNE to project the features of its last max-pooling layer into a 2D embedded space in which labels are propagated using the Optimum-Path Forest semi-supervised classifier. As the labeled set improves along iterations, it improves the features of the neural network. We show that this can significantly improve classification results on test data (using only 1\% to 5\% of supervised samples) of three private challenging datasets and two public ones.
摘要：虽然卷积神经网络需要大量的标记集训练图像，这样的数据集的专家人力监督是非常费力的。提出的解决方案，从传播一小部分监督的图像标签，一大套无人监管者来获得足够的真正实现和人工标记的样品来训练深层神经网络模型。然而，这种解决方案需要进行验证许多监督的图像。我们提出，其中深神经网络（VGG-16）是由一组训练有素的沿迭代更正确地标记的样品，通过使用T-SNE投射其最后MAX-池层的特征为二维嵌入空间中创建一个循环其中标签使用最合适的路径森林半监督分类传播。作为标记组一起反复改进，提高了神经网络的功能。我们表明，这种可显著提高测试数据的三家民营挑战数据集和两个公立分类结果（只使用1 \％至5 \％监督样本）。

53. Efficient Deep Learning of Non-local Features for Hyperspectral Image Classification [PDF] 返回目录
Yu Shen, Sijie Zhu, Chen Chen, Qian Du, Liang Xiao, Jianyu Chen, Delu Pan
Abstract: Deep learning based methods, such as Convolution Neural Network (CNN), have demonstrated their efficiency in hyperspectral image (HSI) classification. These methods can automatically learn spectral-spatial discriminative features within local patches. However, for each pixel in an HSI, it is not only related to its nearby pixels but also has connections to pixels far away from itself. Therefore, to incorporate the long-range contextual information, a deep fully convolutional network (FCN) with an efficient non-local module, named ENL-FCN, is proposed for HSI classification. In the proposed framework, a deep FCN considers an entire HSI as input and extracts spectral-spatial information in a local receptive field. The efficient non-local module is embedded in the network as a learning unit to capture the long-range contextual information. Different from the traditional non-local neural networks, the long-range contextual information is extracted in a specially designed criss-cross path for computation efficiency. Furthermore, by using a recurrent operation, each pixel's response is aggregated from all pixels of HSI. The benefits of our proposed ENL-FCN are threefold: 1) the long-range contextual information is incorporated effectively, 2) the efficient module can be freely embedded in a deep neural network in a plug-and-play fashion, and 3) it has much fewer learning parameters and requires less computational resources. The experiments conducted on three popular HSI datasets demonstrate that the proposed method achieves state-of-the-art classification performance with lower computational cost in comparison with several leading deep neural networks for HSI.
摘要：深学习为基础的方法，如卷积神经网络（CNN），展示了他们在光谱图像（HSI）分级效率。这些方法可以局部小片内自动学习谱空间判别特征。然而，在恒指每一个像素，它不仅关系到其附近的像素，但也有远离本身的像素连接。因此，以纳入远程上下文信息，深充分卷积网络（FCN）具有有效的非本地模块，命名为ENL-FCN，提出了一种用于HSI分类。在所提出的架构中，深FCN认为整个HSI如在本地感受域输入和提取物的谱空间信息。高效非本地模块被嵌入网络作为学习单元中捕获长程上下文信息。从传统的非局部神经网络不同，远程上下文信息在用于计算效率特别设计的十字交叉路径萃取。此外，通过使用一个经常性的操作中，每个像素的响应被从HSI的所有像素聚合。我们提出的ENL-FCN的好处有三：1）的长范围的上下文信息被掺入有效，2）有效的模块可以自由地嵌入在一个插头和播放方式深神经网络，以及3）它具有少得多的学习参数和需要较少的计算资源。上的三个热门HSI数据集进行的实验表明，该方法实现了以较低的计算成本的国家的最先进的分类性能比较恒指多家领先的深层神经网络。

54. Integrated monitoring of ice in selected Swiss lakes. Final project report [PDF] 返回目录
Manu Tom, Melanie Suetterlin, Damien Bouffard, Mathias Rothermel, Stefan Wunderle, Emmanuel Baltsavias
Abstract: Various lake observables, including lake ice, are related to climate and climate change and provide a good opportunity for long-term monitoring. Lakes (and as part of them lake ice) is therefore considered an Essential Climate Variable (ECV) of the Global Climate Observing System (GCOS). Following the need for an integrated multi-temporal monitoring of lake ice in Switzerland, MeteoSwiss in the framework of GCOS Switzerland supported this 2-year project to explore not only the use of satellite images but also the possibilities of Webcams and in-situ measurements. The aim of this project is to monitor some target lakes and detect the extent of ice and especially the ice-on/off dates, with focus on the integration of various input data and processing methods. The target lakes are: St. Moritz, Silvaplana, Sils, Sihl, Greifen and Aegeri, whereby only the first four were mainly frozen during the observation period and thus processed. The observation period was mainly the winter 2016-17. During the project, various approaches were developed, implemented, tested and compared. Firstly, low spatial resolution (250 - 1000 m) but high temporal resolution (1 day) satellite images from the optical sensors MODIS and VIIRS were used. Secondly, and as a pilot project, the use of existing public Webcams was investigated for (a) validation of results from satellite data, and (b) independent estimation of lake ice, especially for small lakes like St. Moritz, that could not be possibly monitored in the satellite images. Thirdly, in-situ measurements were made in order to characterize the development of the temperature profiles and partly pressure before freezing and under the ice-cover until melting. This report presents the results of the project work.
摘要：各种观测湖，包括湖冰，都与气候和气候变化，并为长期监测的好机会。湖泊（以及如他们湖冰的一部分）因此被认为全球气候观测系统（GCOS）的基本气候变量（ECV）。以下为一体的综合多时相的监测湖冰在瑞士，MeteoSwiss在GCOS瑞士的框架需要支持这个为期两年的项目，探索不仅使用卫星图像，而且摄像头的可能性，并在现场测量。本项目的目的是监测一些目标湖泊和检测冰的程度，特别是冰 - 开/关日期，重点是各种输入数据和处理方法的整合。目标湖泊是：圣莫里茨，席尔瓦普拉纳，锡尔斯，SIHL，Greifen酒店和Aegeri，由此仅前四个在观察期间主要冷冻并从而进行处理。观察期，主要是冬天2016-17。项目期间，各种方法，开发，实施，测试和比较。首先，低空间分辨率（250 - 1000米），但在来自光学传感器MODIS和VIIRS高时间分辨率（1天）的卫星图像进行使用。其次，作为一个试点项目，利用现有的公共摄像头的考察从卫星数据的结果（一）验证，和（b）湖冰的独立评估，尤其是像圣莫里茨小湖泊，是不能在卫星图像可能监测。第三，在现场测量，以便在冷冻前和冰覆盖直至融化下表征的温度分布的发展，部分压力被做了。本报告介绍了项目工作的结果。

55. HyperFaceNet: A Hyperspectral Face Recognition Method Based on Deep Fusion [PDF] 返回目录
Zhicheng Cao, Xi Cen, Liaojun Pang
Abstract: Face recognition has already been well studied under the visible light and the infrared,in both intra-spectral and cross-spectral cases. However, how to fuse different light bands, i.e., hyperspectral face recognition, is still an open research problem, which has the advantages of richer information retaining and all-weather functionality over single band face recognition. Among the very few works for hyperspectral face recognition, traditional non-deep learning techniques are largely used. Thus, we in this paper bring deep learning into the topic of hyperspectral face recognition, and propose a new fusion model (termed HyperFaceNet) especially for hyperspectral faces. The proposed fusion model is characterized by residual dense learning, a feedback style encoder and a recognition-oriented loss function. During the experiments, our method is proved to be of higher recognition rates than face recognition using either visible light or the infrared. Moreover, our fusion model is shown to be superior to other general-purposed image fusion methods including state-of-the-arts, in terms of both image quality and recognition performance.
摘要：脸部识别已经可见光和红外下得到很好的研究，在这两个帧内光谱和交叉谱的情况。然而，如何融合不同的光带，即高光谱人脸识别，仍然是一个开放的研究课题，它具有更丰富的信息保留和在单波段人脸识别全天候功能的优点。在高光谱人脸识别的作品很少，传统的非深学习技术大量使用。因此，我们在本文中带来深度学习到高光谱人脸识别的话题，并提出一种新的融合模型（称为HyperFaceNet）特别适用于高光谱面孔。所提出的融合模型的特征在于残余致密学习，反馈式编码器和一个面向识别损耗函数。在实验过程中，我们的方法被证明是使用任一可见光或红外较高的识别率比面部识别。此外，我们的融合模型被示出优于其它通用旨意图像融合方法，包括状态的最艺术在图像质量和识别性能方面。

56. Tensor Low-Rank Reconstruction for Semantic Segmentation [PDF] 返回目录
Wanli Chen, Xinge Zhu, Ruoqi Sun, Junjun He, Ruiyu Li, Xiaoyong Shen, Bei Yu
Abstract: Context information plays an indispensable role in the success of semantic segmentation. Recently, non-local self-attention based methods are proved to be effective for context information collection. Since the desired context consists of spatial-wise and channel-wise attentions, 3D representation is an appropriate formulation. However, these non-local methods describe 3D context information based on a 2D similarity matrix, where space compression may lead to channel-wise attention missing. An alternative is to model the contextual information directly without compression. However, this effort confronts a fundamental difficulty, namely the high-rank property of context information. In this paper, we propose a new approach to model the 3D context representations, which not only avoids the space compression but also tackles the high-rank difficulty. Here, inspired by tensor canonical-polyadic decomposition theory (i.e, a high-rank tensor can be expressed as a combination of rank-1 tensors.), we design a low-rank-to-high-rank context reconstruction framework (i.e, RecoNet). Specifically, we first introduce the tensor generation module (TGM), which generates a number of rank-1 tensors to capture fragments of context feature. Then we use these rank-1 tensors to recover the high-rank context features through our proposed tensor reconstruction module (TRM). Extensive experiments show that our method achieves state-of-the-art on various public datasets. Additionally, our proposed method has more than 100 times less computational cost compared with conventional non-local-based methods.
摘要：上下文信息在语义分割的成功不可或缺的作用。近年来，基于非地方自治注意方法被证明是有效的环境信息收集。因为期望的上下文由空间明智和信道明智关注的，3D表示是适当的制剂。但是，这些非本地的方法描述了基于二维相似度矩阵，其中空间压缩可导致信道逐注意力缺失3D上下文信息。另一种方法是上下文信息的情况下直接压缩建模。然而，这一努力面临一个根本性的困难，即上下文信息高等级特性。在本文中，我们提出了一种新的方法来3D方面表示，这不仅避免了空间的压缩模式，但也解决高等级难度。这里，由张量规范-polyadic分解理论启发（即，高秩张量可表示为秩1张量的组合。），设计了低秩到高秩上下文重建框架（即， RecoNet）。具体地讲，我们首先介绍的张量生成模块（TGM），其生成数秩1张量的上下文特征的捕获片段。然后我们使用这些等级1张量通过我们提出的张量重构模块（TRM）恢复高等级背景特征。大量的实验表明，该方法实现国家的最先进的各种公共数据集。此外，与传统的非本地为基础的方法相比，我们提出的方法已经超过100倍的计算成本。

57. SymmetryNet: Learning to Predict Reflectional and Rotational Symmetries of 3D Shapes from Single-View RGB-D Images [PDF] 返回目录
Yifei Shi, Junwen Huang, Hongjia Zhang, Xin Xu, Szymon Rusinkiewicz, Kai Xu
Abstract: We study the problem of symmetry detection of 3D shapes from single-view RGB-D images, where severe missing data renders geometric detection approach infeasible. We propose an end-to-end deep neural network which is able to predict both reflectional and rotational symmetries of 3D objects present in the input RGB-D image. Directly training a deep model for symmetry prediction, however, can quickly run into the issue of overfitting. We adopt a multi-task learning approach. Aside from symmetry axis prediction, our network is also trained to predict symmetry correspondences. In particular, given the 3D points present in the RGB-D image, our network outputs for each 3D point its symmetric counterpart corresponding to a specific predicted symmetry. In addition, our network is able to detect for a given shape multiple symmetries of different types. We also contribute a benchmark of 3D symmetry detection based on single-view RGB-D images. Extensive evaluation on the benchmark demonstrates the strong generalization ability of our method, in terms of high accuracy of both symmetry axis prediction and counterpart estimation. In particular, our method is robust in handling unseen object instances with large variation in shape, multi-symmetry composition, as well as novel object categories.
摘要：从3D单视图RGB-d的图像，其中严重缺失的数据呈现几何检测方法不可行塑造我们研究对称性检测的问题。我们提出了一种端至端深的神经网络，其能够预测二者reflectional和旋转对称性存在于输入RGB-d的图像中的3D对象。直接培训对称预测了深刻的模型，但是，可以很快运行到过度拟合的问题。我们采用多任务的学习方式。除了对称轴的预测，我们的网络也被训练以预测对称对应关系。具体地，给出的三维点的RGB-d的图像中呈现，我们对每个3D网络输出指向其对称对方对应于特定的预测对称性。此外，我们的网络能够检测不同类型的特定形状多对称性。我们也有利于基于单视图RGB-d 3D图像对称性检测的基准。基准的广泛的评估证明了我们方法的泛化能力强，在这两个对称轴的预测和对应的估计的高精确度方面。特别地，我们的方法是在处理看不见的对象实例与形状变化较大，多对称性组合物，以及新的物体的类别健壮。

58. Mask Point R-CNN [PDF] 返回目录
Wenchao Zhang, Chong Fu, Mai Zhu
Abstract: The attributes of object contours has great significance for instance segmentation task. However, most of the current popular deep neural networks do not pay much attention to the target edge information. Inspired by the human annotation process when making instance segmentation datasets, in this paper, we propose Mask Point RCNN aiming at promoting the neural networks attention to the target edge information, which can heighten the information propagates between multiple tasks by using different attributes features. Specifically, we present an auxiliary task to Mask RCNN, including utilizing keypoint detection technology to construct the target edge contour, and enhancing the sensitivity of the network to the object edge through multi task learning and feature fusion. These improvements are easy to implement and have a small amount of additional computing overhead. By extensive evaluations on the Cityscapes dataset, the results show that our approach outperforms vanilla Mask RCNN by 5.4 on the validation subset and 5.0 on the test subset.
摘要：物体轮廓的属性有例如分割任务具有重要意义。然而，目前大多数流行的深层神经网络的不重视对目标的边缘信息。使实例分割数据集时被人注释过程的启发，在本文中，我们提出了模板点RCNN旨在促进神经网络关注目标边缘信息，可以通过使用不同的属性特征提高多任务之间的信息传播。具体地，我们提出了一个辅助任务面膜RCNN，包括利用特征点检测技术来构建目标边缘轮廓，并提高网络的对象边缘的通过多任务学习和特征融合的灵敏度。这些改进很容易实现，并有额外的计算开销量小。通过对城市景观的数据集广泛的评估，结果表明，我们的方法比香草的测试子集面膜RCNN由5.4验证子集和5.0。

59. Video Super-Resolution with Recurrent Structure-Detail Network [PDF] 返回目录
Takashi Isobe, Xu Jia, Shuhang Gu, Songjiang Li, Shengjin Wang, Qi Tian
Abstract: Most video super-resolution methods super-resolve a single reference frame with the help of neighboring frames in a temporal sliding window. They are less efficient compared to the recurrent-based methods. In this work, we propose a novel recurrent video super-resolution method which is both effective and efficient in exploiting previous frames to super-resolve the current frame. It divides the input into structure and detail components which are fed to a recurrent unit composed of several proposed two-stream structure-detail blocks. In addition, a hidden state adaptation module that allows the current frame to selectively use information from hidden state is introduced to enhance its robustness to appearance change and error accumulation. Extensive ablation study validate the effectiveness of the proposed modules. Experiments on several benchmark datasets demonstrate the superior performance of the proposed method compared to state-of-the-art methods on video super-resolution.
摘要：大多数视频超分辨率方法的超级解决同邻国的时间滑动窗框的帮助下一个参照系。他们是效率较低相比，基于复发的方法。在这项工作中，我们提出了一个新颖的复发视频超分辨率方法，它是利用在以前的帧的超级解决当前帧有效和高效。它把输入到其被馈送到数提出了两种流结构细节块组成的重复单元的结构和细节分量。此外，一个隐藏的状态自适应模块，其允许当前帧以从隐藏的状态选择性地使用信息被引入，以提高其鲁棒性的外观变化和误差累积。广泛研究消融验证了该模块的有效性。在几个基准数据集的实验证明了该方法的卓越性能相比于视频超分辨率的国家的最先进的方法。

60. Stochastic Bundle Adjustment for Efficient and Scalable 3D Reconstruction [PDF] 返回目录
Lei Zhou, Zixin Luo, Mingmin Zhen, Tianwei Shen, Shiwei Li, Zhuofei Huang, Tian Fang, Long Quan
Abstract: Current bundle adjustment solvers such as the Levenberg-Marquardt (LM) algorithm are limited by the bottleneck in solving the Reduced Camera System (RCS) whose dimension is proportional to the camera number. When the problem is scaled up, this step is neither efficient in computation nor manageable for a single compute node. In this work, we propose a stochastic bundle adjustment algorithm which seeks to decompose the RCS approximately inside the LM iterations to improve the efficiency and scalability. It first reformulates the quadratic programming problem of an LM iteration based on the clustering of the visibility graph by introducing the equality constraints across clusters. Then, we propose to relax it into a chance constrained problem and solve it through sampled convex program. The relaxation is intended to eliminate the interdependence between clusters embodied by the constraints, so that a large RCS can be decomposed into independent linear sub-problems. Numerical experiments on unordered Internet image sets and sequential SLAM image sets, as well as distributed experiments on large-scale datasets, have demonstrated the high efficiency and scalability of the proposed approach. Codes are released at this https URL.
摘要：当前束调整解算器，如列文伯格 - 马夸尔特（LM）算法由瓶颈在解决减少相机系统（RCS），其尺寸正比于相机数量的限制。当问题被放大，这一步是在计算既不有效也不管理用于单个计算节点。在这项工作中，我们建议其目的是约分解RCS的LM迭代里面，以提高效率和可扩展性随机捆绑调整算法。它首先通过重新表述跨集群引入等式约束基于所述可见度曲线图的聚类的LM迭代的二次规划问题。然后，我们建议它放松，进入一个机会约束问题，通过抽样凸程序解决。松弛旨在消除由约束体现簇之间的相互依存关系，从而使大的RCS可以分解成独立的线性子问题。在无序的互联网图像集和顺序SLAM图像集数值实验，以及对大型数据集的分布式实验，已经证明了该方法的高效率和可扩展性。代码在此HTTPS URL释放。

61. Blind Face Restoration via Deep Multi-scale Component Dictionaries [PDF] 返回目录
Xiaoming Li, Chaofeng Chen, Shangchen Zhou, Xianhui Lin, Wangmeng Zuo, Lei Zhang
Abstract: Recent reference-based face restoration methods have received considerable attention due to their great capability in recovering high-frequency details on real low-quality images. However, most of these methods require a high-quality reference image of the same identity, making them only applicable in limited scenes. To address this issue, this paper suggests a deep face dictionary network (termed as DFDNet) to guide the restoration process of degraded observations. To begin with, we use K-means to generate deep dictionaries for perceptually significant face components (\ie, left/right eyes, nose and mouth) from high-quality images. Next, with the degraded input, we match and select the most similar component features from their corresponding dictionaries and transfer the high-quality details to the input via the proposed dictionary feature transfer (DFT) block. In particular, component AdaIN is leveraged to eliminate the style diversity between the input and dictionary features (\eg, illumination), and a confidence score is proposed to adaptively fuse the dictionary feature to the input. Finally, multi-scale dictionaries are adopted in a progressive manner to enable the coarse-to-fine restoration. Experiments show that our proposed method can achieve plausible performance in both quantitative and qualitative evaluation, and more importantly, can generate realistic and promising results on real degraded images without requiring an identity-belonging reference. The source code and models are available at \url{this https URL}.
摘要：最近的参考基础的脸部修复方法已受到相当关注，因为在真实的低质量的图像恢复高频细节的能力极强。然而，这些方法大多需要相同身份的高品质的参考图像，使它们只适用于有限的场景。为了解决这个问题，本文提出了深刻的脸词典网络（称为DFDNet）指导意见退化的恢复过程。首先，我们使用K-手段来生成高品质的图像感知显著脸部构成深字典（\即，左/右眼睛，鼻子和嘴）。接着，与退化的输入，我们匹配并从它们的相应的字典选择最相似的部件特征，并通过所提出的字典特征转移（DFT）块传送高品质的信息的输入。特别地，组分AdaIN是杠杆，以消除输入和字典特征之间的风格多样性（\例如，照明），以及置信度得分，提出了自适应地保险丝的词典功能到输入端。最后，多尺度字典是采用递进的方式通过，以使粗到细的恢复。实验证明，该方法可以在定量和定性评估实现合理的性能，更重要的是，能够产生真实退化图像逼真和有希望的结果，而不需要属于身份的参考。源代码和模型可在\ {URL这HTTPS URL}。

62. Removing Backdoor-Based Watermarks in Neural Networks with Limited Data [PDF] 返回目录
Xuankai Liu, Fengting Li, Bihan Wen, Qi Li
Abstract: Deep neural networks have been widely applied and achieved great success in various fields. As training deep models usually consumes massive data and computational resources, trading the trained deep models is highly demanded and lucrative nowadays. Unfortunately, the naive trading schemes typically involves potential risks related to copyright and trustworthiness issues, e.g., a sold model can be illegally resold to others without further authorization to reap huge profits. To tackle this problem, various watermarking techniques are proposed to protect the model intellectual property, amongst which the backdoor-based watermarking is the most commonly-used one. However, the robustness of these watermarking approaches is not well evaluated under realistic settings, such as limited in-distribution data availability and agnostic of watermarking patterns. In this paper, we benchmark the robustness of watermarking, and propose a novel backdoor-based watermark removal framework using limited data, dubbed WILD. The proposed WILD removes the watermarks of deep models with only a small portion of training data, and the output model can perform the same as models trained from scratch without watermarks injected. In particular, a novel data augmentation method is utilized to mimic the behavior of watermark triggers. Combining with the distribution alignment between the normal and perturbed (e.g., occluded) data in the feature space, our approach generalizes well on all typical types of trigger contents. The experimental results demonstrate that our approach can effectively remove the watermarks without compromising the deep model performance for the original task with the limited access to training data.
摘要：深层神经网络已被广泛应用，并取得了在各个领域取得巨大成功。随着训练的深模型通常会消耗大量的数据和计算资源，交易培训的深模型非常需要的和有利可图的今天。不幸的是，天真的交易计划通常包括涉及到版权和诚信问题潜在的风险，例如，一个销售的模式可以被非法倒卖给其他人没有进一步的授权来牟取暴利。为了解决这个问题，不同的水印技术，提出了保护知识产权示范，除其基于后门水印是最常用的一个。然而，这些加水印的鲁棒性接近下不切合实际的设置，如在分布的数据可用性和不可知水印图案的限制以及评价。在本文中，我们的基准水印的鲁棒性，并提出利用有限的数据，新的基于后门水印去除框架，被称为野生。所提出的WILD移除深模型的水印与训练数据的仅一小部分，输出模型可以作为从头训练不带水印的喷射模型的执行相同。特别是，一种新的数据增强方法被用来模仿水印触发器的行为。与正常和扰动之间的分布对准结合在特征空间（例如，闭塞）的数据，我们的方法推广以及在所有典型类型的触发内容。实验结果表明，我们的方法可以有效去除水印不会影响深模型的性能与机会有限的训练数据的原始任务。

63. SeqDialN: Sequential Visual Dialog Networks in Joint Visual-Linguistic Representation Space [PDF] 返回目录
Liu Yang, Fanqi Meng, Ming-Kuang Daniel Wu, Vicent Ying, Xianchao Xu
Abstract: In this work, we formulate a visual dialog as an information flow in which each piece of information is encoded with the joint visual-linguistic representation of a single dialog round. Based on this formulation, we consider the visual dialog task as a sequence problem consisting of ordered visual-linguistic vectors. For featurization, we use a Dense Symmetric Co-Attention network as a lightweight vison-language joint representation generator to fuse multimodal features (i.e., image and text), yielding better computation and data efficiencies. For inference, we propose two Sequential Dialog Networks (SeqDialN): the first uses LSTM for information propagation (IP) and the second uses a modified Transformer for multi-step reasoning (MR). Our architecture separates the complexity of multimodal feature fusion from that of inference, which allows simpler design of the inference engine. IP based SeqDialN is our baseline with a simple 2-layer LSTM design that achieves decent performance. MR based SeqDialN, on the other hand, recurrently refines the semantic question/history representations through the self-attention stack of Transformer and produces promising results on the visual dialog task. On VisDial v1.0 test-std dataset, our best single generative SeqDialN achieves 62.54% NDCG and 48.63% MRR; our ensemble generative SeqDialN achieves 63.78% NDCG and 49.98% MRR, which set a new state-of-the-art generative visual dialog model. We fine-tune discriminative SeqDialN with dense annotations and boost the performance up to 72.41% NDCG and 55.11% MRR. In this work, we discuss the extensive experiments we have conducted to demonstrate the effectiveness of our model components. We also provide visualization for the reasoning process from the relevant conversation rounds and discuss our fine-tuning methods. Our code is available at this https URL
摘要：在这项工作中，我们制定一个视觉对话框作为其中的每条信息进行编码单个对话回合的联合视觉语言表示的信息流。在此基础上制定，我们认为视觉对话任务为由下令视觉语言向量序列问题。对于featurization，我们使用一个密对称共同关注网络作为轻量级VISON语言关节表示产生融合多峰特征（即，图像和文本），产生更好的计算和数据效率。为推论，我们提出两个连续对话框网络（SeqDialN）：第一用途LSTM用于信息传播（IP）和第二使用修改的变压器多步推理（MR）。我们的体系结构将多模态特征融合的来自推论，它允许推理引擎的简单的设计的复杂性。基于IP SeqDialN是我们用一个简单的2层LSTM设计，实现了不俗的性能基准。基于MR SeqDialN，在另一方面，反复提炼，通过变压器的自我关注栈语义问题/历史的陈述和产生视觉上的对话任务可喜的成果。在VisDial V1.0测试-STD数据集，我们最好的单生成SeqDialN达到62.54％NDCG和48.63％MRR;我们的合奏生成SeqDialN达到63.78％NDCG和49.98％MRR，即设置一个新的国家的最先进的生成可视对话模型。密密麻麻的注释，我们微调辨别SeqDialN和提升性能高达72.41％NDCG和55.11％MRR。在这项工作中，我们将讨论我们进行展示我们的模型中的功效成分的大量的实验。我们也从相关谈话轮提供的推理过程可视化，并讨论我们的微调方法。我们的代码可在此HTTPS URL

64. Point Cloud Completion by Learning Shape Priors [PDF] 返回目录
Xiaogang Wang, Marcelo H Ang Jr, Gim Hee Lee
Abstract: In view of the difficulty in reconstructing object details in point cloud completion, we propose a shape prior learning method for object completion. The shape priors include geometric information in both complete and the partial point clouds. We design a feature alignment strategy to learn the shape prior from complete points, and a coarse to fine strategy to incorporate partial prior in the fine stage. To learn the complete objects prior, we first train a point cloud auto-encoder to extract the latent embeddings from complete points. Then we learn a mapping to transfer the point features from partial points to that of the complete points by optimizing feature alignment losses. The feature alignment losses consist of a L2 distance and an adversarial loss obtained by Maximum Mean Discrepancy Generative Adversarial Network (MMD-GAN). The L2 distance optimizes the partial features towards the complete ones in the feature space, and MMD-GAN decreases the statistical distance of two point features in a Reproducing Kernel Hilbert Space. We achieve state-of-the-art performances on the point cloud completion task. Our code is available at this https URL.
摘要：鉴于在点云完成重建对象的详细信息的困难，我们提出了对象完成的形状之前的学习方法。形状先验包括两个完整的和部分的点云的几何信息。我们设计了一个功能定位策略，从完整的分之前学习形状和粗到细的策略，在精细阶段部分之前合并。要事先了解完整的对象，我们首先培养出点云自动编码器来从完整的点提取潜在的嵌入。然后我们学习的映射，通过优化功能定位的损失从偏分点的功能转移到了完整的点。特征对准损失由一个L2距离和最大平均差异剖成对抗性网络（MMD-GAN）获得的对抗性的损失。的L2距离优化部分特征向特征空间的完整的，和MMD-GAN减小的两点特征在再生核Hilbert空间统计距离。我们实现了对点云完成任务的国家的最先进的性能。我们的代码可在此HTTPS URL。

65. Looking in the Right place for Anomalies: Explainable AI through Automatic Location Learning [PDF] 返回目录
Satyananda Kashyap, Alexandros Karargyris, Joy Wu, Yaniv Gur, Arjun Sharma, Ken C. L. Wong, Mehdi Moradi, Tanveer Syeda-Mahmood
Abstract: Deep learning has now become the de facto approach to the recognition of anomalies in medical imaging. Their 'black box' way of classifying medical images into anomaly labels poses problems for their acceptance, particularly with clinicians. Current explainable AI methods offer justifications through visualizations such as heat maps but cannot guarantee that the network is focusing on the relevant image region fully containing the anomaly. In this paper, we develop an approach to explainable AI in which the anomaly is assured to be overlapping the expected location when present. This is made possible by automatically extracting location-specific labels from textual reports and learning the association of expected locations to labels using a hybrid combination of Bi-Directional Long Short-Term Memory Recurrent Neural Networks (Bi-LSTM) and DenseNet-121. Use of this expected location to bias the subsequent attention-guided inference network based on ResNet101 results in the isolation of the anomaly at the expected location when present. The method is evaluated on a large chest X-ray dataset.
摘要：现在深学习已成为事实上的方法在医学成像识别异常。医学图像分类为异常的他们的“黑匣子”的方式标注的姿势问题，为他们的认同，特别是临床医生。目前可解释的AI方法提供了可视化，通过诸如热图的理由，但不能保证网络的重点是完全包含异常相关的图像区域。在本文中，我们发展到可以解释AI的方法，其中异常被放心地重叠时预期的位置存在。这是通过从文本报告自动抽取部位特有的标签和使用双向龙短时记忆回归神经网络（双LSTM）和DenseNet-121的混合组合学习预期位置，以标签的关联成为可能。使用此预期位置的偏置基于存在时ResNet101导致的异常隔离在预期位置进行后续的注意力引导推论网络。该方法是在一个大的胸部X射线数据集进行评估。

66. Animating Through Warping: an Efficient Method for High-Quality Facial Expression Animation [PDF] 返回目录
Zili Yi, Qiang Tang, Vishnu Sanjay Ramiya Srinivasan, Zhan Xu
Abstract: Advances in deep neural networks have considerably improved the art of animating a still image without operating in 3D domain. Whereas, prior arts can only animate small images (typically no larger than 512x512) due to memory limitations, difficulty of training and lack of high-resolution (HD) training datasets, which significantly reduce their potential for applications in movie production and interactive systems. Motivated by the idea that HD images can be generated by adding high-frequency residuals to low-resolution results produced by a neural network, we propose a novel framework known as Animating Through Warping (ATW) to enable efficient animation of HD images. Specifically, the proposed framework consists of two modules, a novel two-stage neural-network generator and a novel post-processing module known as Animating Through Warping (ATW). It only requires the generator to be trained on small images and can do inference on an image of any size. During inference, an HD input image is decomposed into a low-resolution component(128x128) and its corresponding high-frequency residuals. The generator predicts the low-resolution result as well as the motion field that warps the input face to the desired status (e.g., expressions categories or action units). Finally, the ResWarp module warps the residuals based on the motion field and adding the warped residuals to generates the final HD results from the naively up-sampled low-resolution results. Experiments show the effectiveness and efficiency of our method in generating high-resolution animations. Our proposed framework successfully animates a 4K facial image, which has never been achieved by prior neural models. In addition, our method generally guarantee the temporal coherency of the generated animations. Source codes will be made publicly available.
摘要：进展深神经网络具有显着提高，而不在3D域中操作动画的静止图像的技术。然而，现有技术只能有生命的小图片（一般不大于512×512），由于内存限制，训练难度和缺乏高清晰度（HD）的训练数据集，其中显著减少其潜在的电影制作和交互系统的应用。由HD图像可以通过添加高频残差由神经网络产生的低解析结果来生成想法的启发，我们提出了称作动画效果通过规整（ATW），以使HD图像的高效动画的新颖框架。具体地，所提出的框架包括两个模块，一个新颖的两级神经网络发生器和已知为动画效果通过规整（ATW）一种新颖的后处理模块。它只需要小的图像进行训练发电机和可以在任何尺寸的图像上做推断。在推论，一个HD输入图像分解成低分辨率成分（128×128）及其相应的高频残差。发电机预测低分辨率结果以及该翘曲输入面到所需的状态（例如，表达式类别或动作单元）的运动领域。最后，ResWarp模块翘曲基于运动场的残差和添加翘曲残差产生从幼稚上采样低分辨率结果最终HD结果。实验证明我们的生成高分辨率的动画方法的有效性和效率。我们提出的框架成功动画4K的面部图像，这从来没有被之前的神经模型来实现的。此外，我们的方法一般保证生成的动画的时间一致性。源代码将被公之于众。

67. Self-supervised Visual Attribute Learning for Fashion Compatibility [PDF] 返回目录
Donghyun Kim, Kuniaki Saito, Kate Saenko, Stan Sclaroff, Bryan A Plummer
Abstract: Many self-supervised learning (SSL) methods have been successful in learning semantically meaningful visual representations by solving pretext tasks. However, state-of-the-art SSL methods focus on object recognition or detection tasks, which aim to learn object shapes, but ignore visual attributes such as color and texture via color distortion augmentation. However, learning these visual attributes could be more important than learning object shapes for other vision tasks, such as fashion compatibility. To address this deficiency, we propose Self-supervised Tasks for Outfit Compatibility (STOC) without any supervision. Specifically, STOC aims to learn colors and textures of fashion items and embed similar items nearby. STOC outperforms state-of-the-art SSL by 9.5% and a supervised Siamese Network by 3% on a fill-in-the-blank outfit completion task on our unsupervised benchmark.
摘要：许多自我监督学习（SSL）的方法已经成功地通过解决借口学习任务语义上有意义的可视化表示。然而，国家的最先进的SSL方法集中在对象识别或检测的任务，其目的是学习对象的形状，但忽略视觉属性，例如经由颜色失真的增强的颜色和质地。然而，学习这些视觉属性会比学习对象的形状为其他视觉任务，如时尚的兼容性更重要。为了解决这一缺陷，我们提出了成套装备的兼容性（STOC）自我监督的任务，而无需任何监督。具体来说，STOC旨在学习色彩和时尚单品的纹理和嵌入类似物品附近。 STOC性能优于国家的最先进的SSL 9.5％和对我们的监督的基准测试中填充式的空白装备完成任务的监督连体网3％。

68. Uncertainty-based Traffic Accident Anticipation with Spatio-Temporal Relational Learning [PDF] 返回目录
Wentao Bao, Qi Yu, Yu Kong
Abstract: Traffic accident anticipation aims to predict accidents from dashcam videos as early as possible, which is critical to safety-guaranteed self-driving systems. With cluttered traffic scenes and limited visual cues, it is of great challenge to predict how long there will be an accident from early observed frames. Most existing approaches are developed to learn features of accident-relevant agents for accident anticipation, while ignoring the features of their spatial and temporal relations. Besides, current deterministic deep neural networks could be overconfident in false predictions, leading to high risk of traffic accidents caused by self-driving systems. In this paper, we propose an uncertainty-based accident anticipation model with spatio-temporal relational learning. It sequentially predicts the probability of traffic accident occurrence with dashcam videos. Specifically, we propose to take advantage of graph convolution and recurrent networks for relational feature learning, and leverage Bayesian neural networks to address the intrinsic variability of latent relational representations. The derived uncertainty-based ranking loss is found to significantly boost model performance by improving the quality of relational features. In addition, we collect a new Car Crash Dataset (CCD) for traffic accident anticipation which contains environmental attributes and accident reasons annotations. Experimental results on both public and the newly-compiled datasets show state-of-the-art performance of our model. Our code and CCD dataset are available at this https URL.
摘要：交通事故预期目标预测从dashcam视频事故趁早，这是安全保障自驾车系统的关键。随着交通混乱场面和有限的视觉线索，这是很大的挑战来预测多久会出现从早期的观察帧事故。大多数现有方法的开发，以学习为事故预想事故相关的代理功能，而忽略了他们的时间和空间关系的特点。此外，目前的确定性深层神经网络可能是过于自信的虚假预测，从而造成自驾车系统交通事故的高风险。在本文中，我们提出了与时空关系学习基于不确定性事故预测模型。它依次预测交通事故发生的与dashcam视频的概率。具体来说，我们建议利用图形卷积和关系特征的学习经常性的网络，并利用贝叶斯神经网络来解决潜在的关系表示的内在变化。派生基于不确定性的评级损失通过提高关系质量特征发现显著提升模型的性能。此外，我们收集新的汽车碰撞数据集（CCD）为包含环境属性和事故原因注释交通事故期待。在公共和新编制的数据集实验结果表明我们的模型的国家的最先进的性能。我们的代码和数据集中CCD可在此HTTPS URL。

69. PERCH 2.0 : Fast and Accurate GPU-based Perception via Search for Object Pose Estimation [PDF] 返回目录
Aditya Agarwal, Yupeng Han, Maxim Likhachev
Abstract: Pose estimation of known objects is fundamental to tasks such as robotic grasping and manipulation. The need for reliable grasping imposes stringent accuracy requirements on pose estimation in cluttered, occluded scenes in dynamic environments. Modern methods employ large sets of training data to learn features in order to find correspondence between 3D models and observed data. However these methods require extensive annotation of ground truth poses. An alternative is to use algorithms that search for the best explanation of the observed scene in a space of possible rendered scenes. A recently developed algorithm, PERCH (PErception Via SeaRCH) does so by using depth data to converge to a globally optimum solution using a search over a specially constructed tree. While PERCH offers strong guarantees on accuracy, the current formulation suffers from low scalability owing to its high runtime. In addition, the sole reliance on depth data for pose estimation restricts the algorithm to scenes where no two objects have the same shape. In this work, we propose PERCH 2.0, a novel perception via search strategy that takes advantage of GPU acceleration and RGB data. We show that our approach can achieve a speedup of 100x over PERCH, as well as better accuracy than the state-of-the-art data-driven approaches on 6-DoF pose estimation without the need for annotating ground truth poses in the training data. Our code and video are available at this https URL.
摘要：已知对象的姿态估计是任务，如机器人抓持和操作的基础。需要可靠的抓在强加在杂乱的姿态估计严格的精度要求，闭塞的场景在动态环境。现代方法采用大集的训练数据来学习，以便找到3D模型和观测数据之间的对应关系的功能。然而，这些方法都需要地面实况姿势广泛的注解。另一种方法是使用搜索所观察的场景中的可能渲染的场景的空间最好的解释算法。最近开发的算法，鲈鱼（感知通过搜索）通过使用深度数据汇聚到使用过一个特殊构造的树搜索全局最优解这样做。而栖息在精确度提供了有力保障，由低可扩展性目前的表述遭受由于其高运行时间。此外，姿态估计深度数据的唯一依赖限制了算法的场景中没有两个对象具有相同的形状。在这项工作中，我们提出了高位信2.0，通过搜索战略的一个新的看法，即需要GPU加速以及RGB数据的优势。我们表明，我们的方法可以达到100倍以上鲈鱼加速，以及比6自由度姿态估计的国家的最先进的数据驱动的方法更好的精度，无需在训练数据需要标注地面实况姿势。我们的代码和视频可在此HTTPS URL。

70. Improving Skeleton-based Action Recognitionwith Robust Spatial and Temporal Features [PDF] 返回目录
Zeshi Yang, Kangkang Yin
Abstract: Recently skeleton-based action recognition has made signif-icant progresses in the computer vision community. Most state-of-the-art algorithms are based on Graph Convolutional Networks (GCN), andtarget at improving the network structure of the backbone GCN lay-ers. In this paper, we propose a novel mechanism to learn more robustdiscriminative features in space and time. More specifically, we add aDiscriminative Feature Learning (DFL) branch to the last layers of thenetwork to extract discriminative spatial and temporal features to helpregularize the learning. We also formally advocate the use of Direction-Invariant Features (DIF) as input to the neural networks. We show thataction recognition accuracy can be improved when these robust featuresare learned and used. We compare our results with those of ST-GCNand related methods on four datasets: NTU-RGBD60, NTU-RGBD120,SYSU 3DHOI and Skeleton-Kinetics.
摘要：近日骨架为基础的动作识别已经在计算机视觉领域取得signif-icant进展。大多数国家的最先进的算法是基于图形卷积网络（GDN），andtarget改善骨干GCN叠层ERS的网络结构。在本文中，我们提出了一种新的机制，以学习在时间和空间更加robustdiscriminative功能。更具体地讲，我们添加aDiscriminative地物学习（DFL）分支thenetwork的最后一层提取判别空间和时间功能helpregularize学习。我们也正式提倡使用方向不变特征（DIF）作为输入到神经网络。我们发现，当这些强大的featuresare学习和使用thataction识别准确度得以提高。我们比较我们的结果与那些ST-GCNand相关方法对四个数据集：NTU-RGBD60，NTU-RGBD120，中山大学3DHOI和骷髅动力学。

71. Self-supervised Learning of Point Clouds via Orientation Estimation [PDF] 返回目录
Omid Poursaeed, Tianxing Jiang, Quintessa Qiao, Nayun Xu, Vladimir G. Kim
Abstract: Point clouds provide a compact and efficient representation of 3D shapes. While deep neural networks have achieved impressive results on point cloud learning tasks, they require massive amounts of manually labeled data, which can be costly and time-consuming to collect. In this paper, we leverage 3D self-supervision for learning downstream tasks on point clouds with fewer labels. A point cloud can be rotated in infinitely many ways, which provides a rich label-free source for self-supervision. We consider the auxiliary task of predicting rotations that in turn leads to useful features for other tasks such as shape classification and 3D keypoint prediction. Using experiments on ShapeNet and ModelNet, we demonstrate that our approach outperforms the state-of-the-art. Moreover, features learned by our model are complementary to other self-supervised methods and combining them leads to further performance improvement.
摘要：点云提供的3D形状的紧凑和高效的表示。虽然深层神经网络对点云学习任务取得了不俗的成绩，他们需要手工标注的数据，这可能是昂贵和费时的收集的大量。在本文中，我们利用三维自检学习与较少的标签点云下游任务。点云可以在无限多的方面，它提供了自检丰富的无标签源转动。我们认为预测旋转的辅助任务，这反过来又导致其他任务，例如形状分类和3D关键点预测有用的功能。使用上ShapeNet和ModelNet实验，我们证明了我们的方法优于国家的最先进的。此外，通过功能我们的模型学会对其他自我监督方法互补并将它们组合导致进一步的性能提升。

72. Accurate and Efficient Intracranial Hemorrhage Detection and Subtype Classification in 3D CT Scans with Convolutional and Long Short-Term Memory Neural Networks [PDF] 返回目录
Mihail Burduja, Radu Tudor Ionescu, Nicolae Verga
Abstract: In this paper, we present our system for the RSNA Intracranial Hemorrhage Detection challenge. The proposed system is based on a lightweight deep neural network architecture composed of a convolutional neural network (CNN) that takes as input individual CT slices, and a Long Short-Term Memory (LSTM) network that takes as input feature embeddings provided by the CNN. For efficient processing, we consider various feature selection methods to produce a subset of useful CNN features for the LSTM. Furthermore, we reduce the CT slices by a factor of 2x, allowing ourselves to train the model faster. Even if our model is designed to balance speed and accuracy, we report a weighted mean log loss of 0.04989 on the final test set, which places us in the top 30 ranking (2%) from a total of 1345 participants. Although our computing infrastructure does not allow it, processing CT slices at their original scale is likely to improve performance. In order to enable others to reproduce our results, we provide our code as open source at this https URL. After the challenge, we conducted a subjective intracranial hemorrhage detection assessment by radiologists, indicating that the performance of our deep model is on par with that of doctors specialized in reading CT scans. Another contribution of our work is to integrate Grad-CAM visualizations in our system, providing useful explanations for its predictions. We therefore consider our system as a viable option when a fast diagnosis or a second opinion on intracranial hemorrhage detection are needed.
摘要：在本文中，我们提出我们的系统为RSNA颅内出血检测的挑战。所提出的系统是基于采用由CNN提供输入功能的嵌入的卷积神经网络（CNN），其作为输入个人CT切片，和一个长短期存储器（LSTM）网络组成的轻量深的神经网络结构。为了有效处理，我们考虑各种特征选择方法来生产有用的CNN功能，为LSTM一个子集。此外，我们还通过2倍的系数降低CT片，让自己训练模型更快。即使我们的模型旨在平衡速度和准确度，我们报告的0.04989最终测试集，这使我们在排名前30位总数为1345名参加排名（2％）加权平均日志损失。虽然我们的计算基础架构不允许它在其原来规模的加工CT片很可能会提高性能。为了使他人复制我们的结果，我们为这个HTTPS URL代码作为开源。挑战后，我们进行了放射科医生主观颅内出血检测评估，这表明我们的深层模型的性能看齐，专门在阅读CT扫描的医生。我们工作的另一个贡献是梯度-CAM可视化在我们的系统进行集成，为它的预测是有用的解释。因此，我们认为我们的系统作为一种可行的选择是需要快速诊断或颅内出血检测第二个意见时。

73. Eigen-CAM: Class Activation Map using Principal Components [PDF] 返回目录
Mohammed Bany Muhammad, Mohammed Yeasin
Abstract: Deep neural networks are ubiquitous due to the ease of developing models and their influence on other domains. At the heart of this progress is convolutional neural networks (CNNs) that are capable of learning representations or features given a set of data. Making sense of such complex models (i.e., millions of parameters and hundreds of layers) remains challenging for developers as well as the end-users. This is partially due to the lack of tools or interfaces capable of providing interpretability and transparency. A growing body of literature, for example, class activation map (CAM), focuses on making sense of what a model learns from the data or why it behaves poorly in a given task. This paper builds on previous ideas to cope with the increasing demand for interpretable, robust, and transparent models. Our approach provides a simpler and intuitive (or familiar) way of generating CAM. The proposed Eigen-CAM computes and visualizes the principle components of the learned features/representations from the convolutional layers. Empirical studies were performed to compare the Eigen-CAM with the state-of-the-art methods (such as Grad-CAM, Grad-CAM++, CNN-fixations) by evaluating on benchmark datasets such as weakly-supervised localization and localizing objects in the presence of adversarial noise. Eigen-CAM was found to be robust against classification errors made by fully connected layers in CNNs, does not rely on the backpropagation of gradients, class relevance score, maximum activation locations, or any other form of weighting features. In addition, it works with all CNN models without the need to modify layers or retrain models. Empirical results show up to 12% improvement over the best method among the methods compared on weakly supervised object localization.
摘要：深层神经网络是无处不在的，由于易于开发模式及其对其他领域的影响力。在这种进步的心脏是卷积神经网络（细胞神经网络）是能够学习陈述或功能给出的一组数据。进行这样的复杂模型的意义（即，数以百万计的参数和数百层）仍然是开发商以及终端用户的挑战。这部分是由于缺少的工具或能够提供解释性和透明度的接口。越来越多的文献体，例如，类激活图（CAM），专注于制作从什么数据或者为什么一个模型获悉它在一个给定的任务不良行为的意义。本文建立在以前的想法，以应付对可解释的，强大的，透明的模型的需求不断增加。我们的方法提供了生成CAM的简单和直观的（或熟悉的）的方式。所提出的本征-CAM计算和可视化从卷积层学习功能/交涉的主要部件。（++，CNN-录制品如梯度-CAM，梯度-CAM）通过在评估对基准数据集，如弱监督定位和定位的对象进行实证研究，以比较本征-CAM与国家的最先进的方法存在对抗噪声的。本征CAM被认为是抵抗由完全连接层在细胞神经网络作出分类误差稳健，不依赖于梯度的反向传播，类相关性分值，最大激活的位置，或加权特征的任何其它形式。此外，它与所有型号CNN工作，而不需要修改层或重新训练模式。经验结果显示多达在所述方法中最好的方法上弱监督对象定位相比12％的改进。

74. From Shadow Segmentation to Shadow Removal [PDF] 返回目录
Hieu Le, Dimitris Samaras
Abstract: The requirement for paired shadow and shadow-free images limits the size and diversity of shadow removal datasets and hinders the possibility of training large-scale, robust shadow removal algorithms. We propose a shadow removal method that can be trained using only shadow and non-shadow patches cropped from the shadow images themselves. Our method is trained via an adversarial framework, following a physical model of shadow formation. Our central contribution is a set of physics-based constraints that enables this adversarial training. Our method achieves competitive shadow removal results compared to state-of-the-art methods that are trained with fully paired shadow and shadow-free images. The advantages of our training regime are even more pronounced in shadow removal for videos. Our method can be fine-tuned on a testing video with only the shadow masks generated by a pre-trained shadow detector and outperforms state-of-the-art methods on this challenging test. We illustrate the advantages of our method on our proposed video shadow removal dataset.
摘要：配对阴影和无阴影的图像的要求限制了阴影去除的数据集和阻碍的培训规模大，健壮的阴影消除算法的可能性的大小和多样性。我们建议只可以使用阴影和非阴影补丁从阴影图像本身冒出来培养了阴影去除方法。我们的方法是通过对抗性框架训练，以下的阴影形成的物理模型。我们的核心贡献是一套基于物理的限制，使这个对抗性训练。相比与完全配对的阴影和无阴影的图像训练的国家的最先进的方法，我们的方法实现了有竞争力的阴影去除的效果。我们的培训制度的优势更加明显的阴影去除视频。我们的方法可以进行微调上的测试视频以仅由一个预训练的阴影检测器，优于在此挑战测试状态的最先进的方法产生的阴影掩模。我们说明我们提出的视频阴影去除数据集我们的方法的优点。

75. Distilling Visual Priors from Self-Supervised Learning [PDF] 返回目录
Bingchen Zhao, Xin Wen
Abstract: Convolutional Neural Networks (CNNs) are prone to overfit small training datasets. We present a novel two-phase pipeline that leverages self-supervised learning and knowledge distillation to improve the generalization ability of CNN models for image classification under the data-deficient setting. The first phase is to learn a teacher model which possesses rich and generalizable visual representations via self-supervised learning, and the second phase is to distill the representations into a student model in a self-distillation manner, and meanwhile fine-tune the student model for the image classification task. We also propose a novel margin loss for the self-supervised contrastive learning proxy task to better learn the representation under the data-deficient scenario. Together with other tricks, we achieve competitive performance in the VIPriors image classification challenge.
摘要：卷积神经网络（细胞神经网络）倾向于过度拟合小训练数据集。我们提出了一个新颖的两相管道，充分利用自我监督学习和知识蒸馏，以改善图像分类下的数据不足设定CNN模型的泛化能力。第一阶段是学习它通过自我监督学习拥有丰富的和可普及可视化表示老师的模式，而第二阶段是将表示提炼成自蒸馏方式学生模型，同时微调的学生模型在图像分类任务。我们还提出了一个新颖的差额损失的自我监督对比学习代理任务，以更好地了解数据缺失的情况下的表现。再加上其他的技巧，我们实现了在VIPriors图像分类的挑战有竞争力的表现。

76. Meta-DRN: Meta-Learning for 1-Shot Image Segmentation [PDF] 返回目录
Atmadeep Banerjee
Abstract: Modern deep learning models have revolutionized the field of computer vision. But, a significant drawback of most of these models is that they require a large number of labelled examples to generalize properly. Recent developments in few-shot learning aim to alleviate this requirement. In this paper, we propose a novel lightweight CNN architecture for 1-shot image segmentation. The proposed model is created by taking inspiration from well-performing architectures for semantic segmentation and adapting it to the 1-shot domain. We train our model using 4 meta-learning algorithms that have worked well for image classification and compare the results. For the chosen dataset, our proposed model has a 70% lower parameter count than the benchmark, while having better or comparable mean IoU scores using all 4 of the meta-learning algorithms.
摘要：现代深的学习模式已经彻底改变了计算机视觉领域。但是，大多数这些模型的显著缺点是它们需要大量的标识样本正确一概而论。在一些次学习的近期发展目标是缓解这一要求。在本文中，我们提出了一个新的轻量级CNN建筑1次图像分割。该模型是通过取灵感来自于业绩良好的架构进行语义分割，并适应1次域中创建。我们培养使用已对图像分类行之有效，并比较结果4元学习算法，我们的模型。对于所选择的数据集，我们提出的模型比基准70％的较低参数计数，同时具有更好或相当的平均得分欠条使用的元学习算法的所有4。

77. An Explainable Machine Learning Model for Early Detection of Parkinson's Disease using LIME on DaTscan Imagery [PDF] 返回目录
Pavan Rajkumar Magesh, Richard Delwin Myloth, Rijo Jackson Tom
Abstract: Parkinson's disease (PD) is a degenerative and progressive neurological condition. Early diagnosis can improve treatment for patients and is performed through dopaminergic imaging techniques like the SPECT DaTscan. In this study, we propose a machine learning model that accurately classifies any given DaTscan as having Parkinson's disease or not, in addition to providing a plausible reason for the prediction. This is kind of reasoning is done through the use of visual indicators generated using Local Interpretable Model-Agnostic Explainer (LIME) methods. DaTscans were drawn from the Parkinson's Progression Markers Initiative database and trained on a CNN (VGG16) using transfer learning, yielding an accuracy of 95.2%, a sensitivity of 97.5%, and a specificity of 90.9%. Keeping model interpretability of paramount importance, especially in the healthcare field, this study utilises LIME explanations to distinguish PD from non-PD, using visual superpixels on the DaTscans. It could be concluded that the proposed system, in union with its measured interpretability and accuracy may effectively aid medical workers in the early diagnosis of Parkinson's Disease.
摘要：帕金森病（PD）是一种退行性和进行性神经系统疾病。早期诊断可以提高治疗的患者，并通过多巴胺能成像技术如SPECT DaTscan进行。在这项研究中，我们提出了准确分类任何给定的DaTscan具有帕金森氏症与否，除了提供用于预测一个合理的理由机器学习模型。这是一种推理是通过使用的地方使用可解释的模型无关解释器（LIME）方法产生视觉指示器来完成。 DaTscans是从帕金森氏进展标记倡议数据库中提取和使用迁移学习在美国有线电视新闻网（VGG16）的训练，得到的95.2％的精确度，97.5％的灵敏度和90.9％的特异性。保持至关重要模型解释性，尤其是在医疗保健领域，本研究利用LIME解释从非PD区分PD，使用上DaTscans视觉超级像素。可以得出结论，所提出的系统，在联盟与其测量解释性和精确性可以有效地辅助医务工作者在帕金森病的早期诊断。

78. RGB-D Salient Object Detection: A Survey [PDF] 返回目录
Tao Zhou, Deng-Ping Fan, Ming-Ming Cheng, Jianbing Shen, Ling Shao
Abstract: Salient object detection (SOD), which simulates the human visual perception system to locate the most attractive object(s) in a scene, has been widely applied to various computer vision tasks. Now, with the advent of depth sensors, depth maps with affluent spatial information that can be beneficial in boosting the performance of SOD, can easily be captured. Although various RGB-D based SOD models with promising performance have been proposed over the past several years, an in-depth understanding of these models and challenges in this topic remains lacking. In this paper, we provide a comprehensive survey of RGB-D based SOD models from various perspectives, and review related benchmark datasets in detail. Further, considering that the light field can also provide depth maps, we review SOD models and popular benchmark datasets from this domain as well. Moreover, to investigate the SOD ability of existing models, we carry out a comprehensive evaluation, as well as attribute-based evaluation of several representative RGB-D based SOD models. Finally, we discuss several challenges and open directions of RGB-D based SOD for future research. All collected models, benchmark datasets, source code links, datasets constructed for attribute-based evaluation, and codes for evaluation will be made publicly available at this https URL
摘要：显着对象检测（SOD），模拟人类视觉感知系统定位在一个场景中最有吸引力的目标（S），已被广泛应用于各种计算机视觉任务。现在，随着深度传感器的出现，深度，可以在提高SOD的性能是有益的富裕空间信息映射，可以很容易地被捕获。虽然有希望的性能的各种RGB-d基于SOD模型已被提出，在过去的几年里，在这个话题遗骸这些模型和挑战进行了深入的了解欠缺。在本文中，我们提供了RGB-d基于SOD车型的全面调查从不同的角度，并查看详细的相关标准数据集。此外，考虑到光场也可以提供深度图，我们从这个域名审查SOD模型和流行的基准数据集，以及。此外，调查现有车型的SOD的能力，我们进行了全面的评估，以及几个代表RGB-d基于SOD模型基于属性的评价。最后，我们讨论了一些挑战，并为今后的研究基于RGB-d SOD的开放方向。所有收集的模型，基准数据集，源代码的链接，数据集用于基于属性的评价构成，用于评估代码将在这个HTTPS URL公之于众

79. Regularization by Denoising via Fixed-Point Projection (RED-PRO) [PDF] 返回目录
Regev Cohen, Michael Elad, Peyman Milanfar
Abstract: Inverse problems in image processing are typically cast as optimization tasks, consisting of data fidelity and stabilizing regularization terms. A recent regularization strategy of great interest utilizes the power of denoising engines. Two such methods are the Plug-and-Play Prior (PnP) and Regularization by Denoising (RED). While both have shown state-of-the-art results in various recovery tasks, their theoretical justification is incomplete. In this paper, we aim to enrich the understanding of RED and its connection to PnP. Towards that end, we reformulate RED as a convex optimization problem utilizing a projection (RED- PRO) onto the fixed-point set of demicontractive denoisers. We offer a simple iterative solution to this problem, and establish a novel unification of RED-PRO and PnP, while providing guarantees for their convergence to the globally optimal solution. We also present several relaxations of RED-PRO that allow for handling denoisers with limited fixed-point sets. Finally, we demonstrate RED-PRO for the tasks of image deblurring and super-resolution, showing improved results with respect to the original RED framework.
摘要：在图像处理的逆问题通常铸造成优化任务，其中包括数据的保真度和稳定正则化项。极大的兴趣，最近正则化利用降噪引擎的功率。两个这样的方法插件和播放之前（PNP）和正规化的降噪（RED）。尽管双方都表示国家的先进成果在各个恢复任务，他们的理论依据是不完整的。在本文中，我们的目标是丰富RED的理解和其即插即用的连接。为此目的，我们重新制定RED为利用投影（红 - PRO）到定点组半压缩denoisers的凸优化问题。我们提供了一个简单的迭代解决了这个问题，并建立RED-PRO和即插即用的新型统一，而对他们的融合提供担保的全局最优解。我们也存在一些RED-PRO的放宽，允许处理与有限的定点套denoisers。最后，我们证明RED-PRO的图像去模糊和超分辨率，显示出相对于红原框架的改进结果的任务。

80. Unsupervised Deep Cross-modality Spectral Hashing [PDF] 返回目录
Tuan Hoang, Thanh-Toan Do, Tam V. Nguyen, Ngai-Man Cheung
Abstract: This paper presents a novel framework, namely Deep Cross-modality Spectral Hashing (DCSH), to tackle the unsupervised learning problem of binary hash codes for efficient cross-modal retrieval. The framework is a two-step hashing approach which decouples the optimization into (1) binary optimization and (2) hashing function learning. In the first step, we propose a novel spectral embedding-based algorithm to simultaneously learn single-modality and binary cross-modality representations. While the former is capable of well preserving the local structure of each modality, the latter reveals the hidden patterns from all modalities. In the second step, to learn mapping functions from informative data inputs (images and word embeddings) to binary codes obtained from the first step, we leverage the powerful CNN for images and propose a CNN-based deep architecture to learn text modality. Quantitative evaluations on three standard benchmark datasets demonstrate that the proposed DCSH method consistently outperforms other state-of-the-art methods.
摘要：本文提出了一种新的框架，即深跨模态光谱散列（DCSH），以解决为高效的跨模态获取的二进制散列码的无监督学习问题。该框架是一个两步方法散列其中解耦优化分为（1）的二进制优化和（2）的散列函数的学习。在第一个步骤中，我们提出了一种新型光谱基于嵌入的算法来同时学习单模态和二元跨模态表示。前者能够很好保留每个模式的局部结构，后者揭示了所有形式的隐藏图案。在第二步骤中，吸取来自信息数据输入（图像和单词的嵌入），以从所述第一步骤中得到的二进制码映射函数，我们利用为图像的强大CNN，提出了一种基于CNN深架构学习文本模式。三个标准的标准数据集的定量评价表明，该方法DCSH一贯优于其他国家的最先进的方法。

81. Efficient Adversarial Attacks for Visual Object Tracking [PDF] 返回目录
Siyuan Liang, Xingxing Wei, Siyuan Yao, Xiaochun Cao
Abstract: Visual object tracking is an important task that requires the tracker to find the objects quickly and accurately. The existing state-ofthe-art object trackers, i.e., Siamese based trackers, use DNNs to attain high accuracy. However, the robustness of visual tracking models is seldom explored. In this paper, we analyze the weakness of object trackers based on the Siamese network and then extend adversarial examples to visual object tracking. We present an end-to-end network FAN (Fast Attack Network) that uses a novel drift loss combined with the embedded feature loss to attack the Siamese network based trackers. Under a single GPU, FAN is efficient in the training speed and has a strong attack performance. The FAN can generate an adversarial example at 10ms, achieve effective targeted attack (at least 40% drop rate on OTB) and untargeted attack (at least 70% drop rate on OTB).
摘要：视觉目标跟踪是需要跟踪快速，准确地找到目标的一项重要任务。现有状态-国税发艺术对象跟踪器，即，基于连体跟踪器，使用DNNs达到高的精度。然而，视觉跟踪模型的鲁棒性很少探讨。在本文中，我们分析基于连体网络上的对象跟踪器的弱点，然后延伸到视觉对象跟踪对抗性例子。我们提出了一个端至端网络FAN（快速攻击网络），其使用一种新颖的漂移损失与嵌入式功能损失攻击连体基于网络的跟踪器相结合。在单GPU，风扇在训练速度高效，具有很强的攻击性能。风扇可以产生按10ms对抗性例如，实现有效的靶向攻击（OTB上至少40％的下降率）和非目标的攻击（OTB上至少70％的下降率）。

82. HMOR: Hierarchical Multi-Person Ordinal Relations for Monocular Multi-Person 3D Pose Estimation [PDF] 返回目录
Jiefeng Li, Can Wang, Wentao Liu, Chen Qian, Cewu Lu
Abstract: Remarkable progress has been made in 3D human pose estimation from a monocular RGB camera. However, only a few studies explored 3D multi-person cases. In this paper, we attempt to address the lack of a global perspective of the top-down approaches by introducing a novel form of supervision Hierarchical Multi-person Ordinal Relations (HMOR). The HMOR encodes interaction information as the ordinal relations of depths and angles hierarchically, which captures the body-part and joint level semantic and maintains global consistency at the same time. In our approach, an integrated top-down model is designed to leverage these ordinal relations in the learning process. The integrated model estimates human bounding boxes, human depths, and root-relative 3D poses simultaneously, with a coarse-to-fine architecture to improve the accuracy of depth estimation. The proposed method significantly outperforms state-of-the-art methods on publicly available multi-person 3D pose datasets. In addition to superior performance, our method costs lower computation complexity and fewer model parameters.
摘要：显着地促进了3D人体姿势估计已经实现了从单眼相机RGB。然而，只有少数研究探讨3D多的人的情况。在本文中，我们试图解决缺乏自上而下的全球视野的引入监管层次多个人序关系（HMOR）的新形式接近。所述HMOR编码交互信息作为深度的顺序关系和角度分层，其捕获身体部分和关节水平语义并保持在同一时间全局一致性。在我们的方法，综合自上而下的模型，旨在利用在学习过程中，这些序关系。该集成模型同时估计人类包围盒，人类的深度，和根目录相对三维姿势，用粗到细的结构，以提高深度估计的准确度。该方法显著性能优于上公开提供的多人3D姿势数据集的国家的最先进的方法。除了出众的性能，我们的方法成本更低的计算复杂度和较少的模型参数。

83. PanoNet: Real-time Panoptic Segmentation through Position-Sensitive Feature Embedding [PDF] 返回目录
Xia Chen, Jianren Wang, Martial Hebert
Abstract: We propose a simple, fast, and flexible framework to generate simultaneously semantic and instance masks for panoptic segmentation. Our method, called PanoNet, incorporates a clean and natural structure design that tackles the problem purely as a segmentation task without the time-consuming detection process. We also introduce position-sensitive embedding for instance grouping by accounting for both object's appearance and its spatial location. Overall, PanoNet yields high panoptic quality results of high-resolution Cityscapes images in real-time, significantly faster than all other methods with comparable performance. Our approach well satisfies the practical speed and memory requirement for many applications like autonomous driving and augmented reality.
摘要：我们提出了一个简单，快速和灵活的框架来生成全景分割同时语义和实例口罩。我们的方法，称为PanoNet，采用了干净，自然的结构设计，铲球这个问题纯粹是不费时检测过程分段任务。我们还介绍了例如通过考虑两个物体的外观和空间位置的分组位置敏感的嵌入。总体而言，PanoNet产生实时的高分辨率图像风情的高品质的全景效果，显著比性能相当的其他方法更快。我们的方法也满足了很多应用，如自动驾驶和增强现实的实际速度和内存的要求。

84. Augmented Skeleton Based Contrastive Action Learning with Momentum LSTM for Unsupervised Action Recognition [PDF] 返回目录
Haocong Rao, Shihao Xu, Xiping Hu, Jun Cheng, Bin Hu
Abstract: Action recognition via 3D skeleton data is an emerging important topic in these years. Most existing methods either extract hand-crafted descriptors or learn action representations by supervised learning paradigms that require massive labeled data. In this paper, we for the first time propose a contrastive action learning paradigm named AS-CAL that can leverage different augmentations of unlabeled skeleton data to learn action representations in an unsupervised manner. Specifically, we first propose to contrast similarity between augmented instances (query and key) of the input skeleton sequence, which are transformed by multiple novel augmentation strategies, to learn inherent action patterns (''pattern-invariance'') of different skeleton transformations. Second, to encourage learning the pattern-invariance with more consistent action representations, we propose a momentum LSTM, which is implemented as the momentum-based moving average of LSTM based query encoder, to encode long-term action dynamics of the key sequence. Third, we introduce a queue to store the encoded keys, which allows our model to flexibly reuse proceeding keys and build a more consistent dictionary to improve contrastive learning. Last, by temporally averaging the hidden states of action learned by the query encoder, a novel representation named Contrastive Action Encoding (CAE) is proposed to represent human's action effectively. Extensive experiments show that our approach typically improves existing hand-crafted methods by 10-50% top-1 accuracy, and it can achieve comparable or even superior performance to numerous supervised learning methods.
摘要：通过三维骨骼数据动作识别是这些年来一个新兴的重要课题。大多数现有的方法或者提取物的手工制作的描述或通过需要大量的标签数据指导的学习范例学习动作表示。在本文中，我们首次提出了名为AS-CAL，可利用未标记的骨架数据的不同扩充学习动作表示在无人监督的方式进行了对比行动学习的范例。具体地讲，我们首先提出增强实例（查询和密钥）的输入骨架序列，其是由多个新颖增强策略转化的之间的对比度的相似性，以了解固有动作模式不同骨架变换（“”图案不变性“”）。二，鼓励学习模式不变性与更多的一致行动人表示，我们提出了一个气势LSTM，把作为键序列的移动平均基于LSTM查询编码器，编码到基于动量长期行动力度来实现。第三，我们引入一个队列来存储编码的钥匙，这使得我们的模型灵活重用前进键，建立一个更加一致的字典，以提高学习的对比。最后，通过时间平均由该查询编码学习的动作的隐蔽状态，（CAE），提出了一个名为对比动作编码新的代表有效地代表人类的行动。大量的实验表明，我们的做法通常是提高了10-50％，最高1精度现有的手工制作方法，它可以实现媲美甚至优于性能无数监督学习方法。

85. Contrastive Explanations in Neural Networks [PDF] 返回目录
Mohit Prabhushankar, Gukyeong Kwon, Dogancan Temel, Ghassan AlRegib
Abstract: Visual explanations are logical arguments based on visual features that justify the predictions made by neural networks. Current modes of visual explanations answer questions of the form $`Why \text{ } P?'$. These $Why$ questions operate under broad contexts thereby providing answers that are irrelevant in some cases. We propose to constrain these $Why$ questions based on some context $Q$ so that our explanations answer contrastive questions of the form $`Why \text{ } P, \text{} rather \text{ } than \text{ } Q?'$. In this paper, we formalize the structure of contrastive visual explanations for neural networks. We define contrast based on neural networks and propose a methodology to extract defined contrasts. We then use the extracted contrasts as a plug-in on top of existing $`Why \text{ } P?'$ techniques, specifically Grad-CAM. We demonstrate their value in analyzing both networks and data in applications of large-scale recognition, fine-grained recognition, subsurface seismic analysis, and image quality assessment.
摘要：视觉的解释是基于视觉特征辜负神经网络进行的预测逻辑论证。视觉解释当前模式回答的形式$`为什么\文本{} p？'$的问题。这些$ $为什么在问题情境广泛，从而提供了在某些情况下，答非所问操作。我们提出来约束这些$为什么基于某种情境$ Q $，使我们的解释回答的形式$`为什么\文本{} P，\文本{}，而\文本{}比\文字{} Q的对比问题$问题？$。在本文中，我们正式确定神经网络的对比视觉解释的结构。我们定义基于神经网络的对比，提出了一种方法来提取定义对比。然后，我们使用提取的对比作为插件在现有$`为什么\文本{} p？$技术，特别是梯度-CAM的顶部。我们证明它们在大规模识别，细粒度的认可，地下地震分析和图像质量评价的应用分析两个网络和数据值。

86. State-of-The-Art Fuzzy Active Contour Models for Image Segmentation [PDF] 返回目录
Ajoy Mondal, Kuntal Ghosh
Abstract: Image segmentation is the initial step for every image analysis task. A large variety of segmentation algorithm has been proposed in the literature during several decades with some mixed success. Among them, the fuzzy energy based active contour models get attention to the researchers during last decade which results in development of various methods. A good segmentation algorithm should perform well in a large number of images containing noise, blur, low contrast, region in-homogeneity, etc. However, the performances of the most of the existing fuzzy energy based active contour models have been evaluated typically on the limited number of images. In this article, our aim is to review the existing fuzzy active contour models from the theoretical point of view and also evaluate them experimentally on a large set of images under the various conditions. The analysis under a large variety of images provides objective insight into the strengths and weaknesses of various fuzzy active contour models. Finally, we discuss several issues and future research direction on this particular topic.
摘要：图像分割为每个图像分析任务的初始步骤。种类繁多的分割算法已经在文献中几十年了一些混杂的成功被提出。其中，模糊能量基于活动轮廓模型过去的十年中导致的各种方法在开发过程中得到重视研究人员。一个好的分割算法应当在含大量噪声，模糊，低对比度，在区域-均匀性，图像等。但是，对大多数现有模糊能量的基于活动轮廓模型的性能进行了评估通常在表现良好有限数目的图像。在这篇文章中，我们的目的是审查从理论角度对现有的模糊主动轮廓模型和实验也评估他们在各种条件下的大型图像集。在种类繁多的图像的分析提供了客观洞察优势和各种模糊主动轮廓模型的弱点。最后，我们对这个特定的主题讨论几个问题和未来的研究方向。

87. Land Cover Classification from Remote Sensing Images Based on Multi-Scale Fully Convolutional Network [PDF] 返回目录
Rui Li, Shunyi Zheng, Chenxi Duan
Abstract: In this paper, a Multi-Scale Fully Convolutional Network (MSFCN) with multi-scale convolutional kernel is proposed to exploit discriminative representations from two-dimensional (2D) satellite images.
摘要：本文提出了一种多尺度全卷积网络（MSFCN）与多尺度卷积核拟从二维（2D）的卫星图像利用判别表示。

88. TexMesh: Reconstructing Detailed Human Texture and Geometry from RGB-D Video [PDF] 返回目录
Tiancheng Zhi, Christoph Lassner, Tony Tung, Carsten Stoll, Srinivasa G. Narasimhan, Minh Vo
Abstract: We present TexMesh, a novel approach to reconstruct detailed human meshes with high-resolution full-body texture from RGB-D video. TexMesh enables high quality free-viewpoint rendering of humans. Given the RGB frames, the captured environment map, and the coarse per-frame human mesh from RGB-D tracking, our method reconstructs spatiotemporally consistent and detailed per-frame meshes along with a high-resolution albedo texture. By using the incident illumination we are able to accurately estimate local surface geometry and albedo, which allows us to further use photometric constraints to adapt a synthetically trained model to real-world sequences in a self-supervised manner for detailed surface geometry and high-resolution texture estimation. In practice, we train our models on a short example sequence for self-adaptation and the model runs at interactive framerate afterwards. We validate TexMesh on synthetic and real-world data, and show it outperforms the state of art quantitatively and qualitatively.
摘要：我们目前TexMesh，一种新的方法来重建与RGB-d视频高分辨率全机身质感详细的人力网格。 TexMesh使人类的高品质的自由视点呈现。鉴于RGB帧，所捕获的环境地图，和粗每帧人类从RGB-d跟踪目，我们的方法来重构与高分辨率反照率的纹理沿时空一致和详述每帧网格。通过使用垂直照明，我们能够准确地估计局部表面几何形状和反照率，这使我们能够进一步利用光度制约了详细的表面几何形状和高分辨率的自我监督的方式，以适应一个综合训练模型，以真实世界的序列质地估计。在实践中，我们培养对自我适应和互动帧率后的模型运行一个简单的例子序列我们的模型。我们验证的合成和真实世界的数据TexMesh，并显示它优于现有技术的数量和质量。

89. L-CNN: A Lattice cross-fusion strategy for multistream convolutional neural networks [PDF] 返回目录
Ana Paula G. S. de Almeida, Flavio de Barros Vidal
Abstract: This paper proposes a fusion strategy for multistream convolutional networks, the Lattice Cross Fusion. This approach crosses signals from convolution layers performing mathematical operation-based fusions right before pooling layers. Results on a purposely worsened CIFAR-10, a popular image classification data set, with a modified AlexNet-LCNN version show that this novel method outperforms by 46% the baseline single stream network, with faster convergence, stability, and robustness.
摘要：本文提出了多流卷积网络的融合策略，莱迪思交叉融合。这种方法从之前右池层执行数学基于操作的融合卷积层交叉信号。上的结果故意恶化CIFAR-10，一个流行的图像分类数据组，具有修饰AlexNet-LCNN版本表明，46％的基线单流网络，具有更快的收敛性，稳定性和耐用性这种新颖的方法优于。

90. Actor-Action Video Classification CSC 249/449 Spring 2020 Challenge Report [PDF] 返回目录
Jing Shi, Zhiheng Li, Haitian Zheng, Yihang Xu, Tianyou Xiao, Weitao Tan, Xiaoning Guo, Sizhe Li, Bin Yang, Zhexin Xu, Ruitao Lin, Zhongkai Shangguan, Yue Zhao, Jingwen Wang, Rohan Sharma, Surya Iyer, Ajinkya Deshmukh, Raunak Mahalik, Srishti Singh, Jayant G Rohra, Yipeng Zhang, Tongyu Yang, Xuan Wen, Ethan Fahnestock, Bryce Ikeda, Ian Lawson, Alan Finkelstein, Kehao Guo, Richard Magnotti, Andrew Sexton, Jeet Ketan Thaker, Oscar Su, Chenliang Xu
Abstract: This technical report summarizes submissions and compiles from Actor-Action video classification challenge held as a final project in CSC 249/449 Machine Vision course (Spring 2020) at University of Rochester
摘要：该技术报告总结了在罗切斯特大学举行，如CSC四百四十九分之二百四十九机器视觉课程（春季2020）项目的最后演员动作视频分类的挑战提交的文件和编译

91. White-Box Evaluation of Fingerprint Recognition Systems [PDF] 返回目录
Steven A. Grosz, Joshua J. Engelsma, Anil K. Jain
Abstract: Typical evaluations of fingerprint recognition systems consist of end-to-end black-box evaluations, which assess performance in terms of overall identification or authentication accuracy. However, these black-box tests of system performance do not reveal insights into the performance of the individual modules, including image acquisition, feature extraction, and matching. On the other hand, white-box evaluations, the topic of this paper, measure the individual performance of each constituent module in isolation. While a few studies have conducted white-box evaluations of the fingerprint reader, feature extractor, and matching components, no existing study has provided a full system, white-box analysis of the uncertainty introduced at each stage of a fingerprint recognition system. In this work, we extend previous white-box evaluations of fingerprint recognition system components and provide a unified, in-depth analysis of fingerprint recognition system performance based on the aggregated white-box evaluation results. In particular, we analyze the uncertainty introduced at each stage of the fingerprint recognition system due to adverse capture conditions (i.e., varying illumination, moisture, and pressure) at the time of acquisition. Our experiments show that a system that performs better overall, in terms of black-box recognition performance, does not necessarily perform best at each module in the fingerprint recognition system pipeline, which can only be seen with white-box analysis of each sub-module. Findings such as these enable researchers to better focus their efforts in improving fingerprint recognition systems.
摘要：指纹识别系统的典型评价由端至端暗箱评价，这在整体识别或认证精度方面评估性能的。然而，系统性能，这些黑盒测试不露分析上市公司各个模块，包括图像采集，特征提取，匹配和的表现。在另一方面，白盒的评估，本文的主题，测量隔离的各构成模块的各个性能。虽然有少数的研究已经进行了指纹识别器，特征提取，匹配和组件的白盒评价，没有现成的研究提供了在一个指纹识别系统的各个阶段引入的不确定性的一个完整的系统，白盒分析。在这项工作中，我们延伸了先前的指纹识别系统组件的白盒评估和提供指纹识别系统性能的基础上，汇总白盒评价结果统一，深入的分析。特别是，我们在指纹识别系统的每个阶段引入的不确定性分析由于不利的捕获条件（即，不同的照明，水分和压力）在采集的时间。我们的实验表明，一个系统，进行更好的整体，在暗箱识别性能方面，不一定在指纹识别系统管道的每个模块，只能每个子模块的白盒的分析可以看出效果最好。调查结果如这使研究人员能够更好地专注于提高指纹识别系统的努力。

92. Utilising Visual Attention Cues for Vehicle Detection and Tracking [PDF] 返回目录
Feiyan Hu, Venkatesh G M, Noel E. O'Connor, Alan F. Smeaton, Suzanne Little
Abstract: Advanced Driver-Assistance Systems (ADAS) have been attracting attention from many researchers. Vision-based sensors are the closest way to emulate human driver visual behavior while driving. In this paper, we explore possible ways to use visual attention (saliency) for object detection and tracking. We investigate: 1) How a visual attention map such as a \emph{subjectness} attention or saliency map and an \emph{objectness} attention map can facilitate region proposal generation in a 2-stage object detector; 2) How a visual attention map can be used for tracking multiple objects. We propose a neural network that can simultaneously detect objects as and generate objectness and subjectness maps to save computational power. We further exploit the visual attention map during tracking using a sequential Monte Carlo probability hypothesis density (PHD) filter. The experiments are conducted on KITTI and DETRAC datasets. The use of visual attention and hierarchical features has shown a considerable improvement of $\approx$8\% in object detection which effectively increased tracking performance by $\approx$4\% on KITTI dataset.
摘要：先进驾驶辅助系统（ADAS）已经从许多研究者关注。基于视觉的传感器来模拟在驾驶人的司机目视行为的最接近的方式。在本文中，我们将探讨使用物体检测和跟踪视觉注意力（显着）可能途径。我们研究：1）如何视觉注意作为\ EMPH {subjectness}注意或显着图映射这样和\ {EMPH}对象性注意力图可有利于在一个2级对象检测器区域建议生成; 2）如何视觉注意图可用于跟踪多个目标。我们建议，可同时检测对象为，并生成对象性和subjectness映射到节省计算能力神经网络。我们进一步开发利用序贯蒙特卡罗概率假设密度（PHD）滤波器跟踪在视觉注意图。实验是在KITTI和DETRAC数据集进行。使用的视觉注意力和层次特征已经显示出物体检测相当大的改进的$ \ $约8 \％，这有效地$ \ $约4 \％KITTI上增加集跟踪性能。

93. KAPLAN: A 3D Point Descriptor for Shape Completion [PDF] 返回目录
Audrey Richard, Ian Cherabier, Martin R. Oswald, Marc Pollefeys, Konrad Schindler
Abstract: We present a novel 3D shape completion method that operates directly on unstructured point clouds, thus avoiding resource-intensive data structures like voxel grids. To this end, we introduce KAPLAN, a 3D point descriptor that aggregates local shape information via a series of 2D convolutions. The key idea is to project the points in a local neighborhood onto multiple planes with different orientations. In each of those planes, point properties like normals or point-to-plane distances are aggregated into a 2D grid and abstracted into a feature representation with an efficient 2D convolutional encoder. Since all planes are encoded jointly, the resulting representation nevertheless can capture their correlations and retains knowledge about the underlying 3D shape, without expensive 3D convolutions. Experiments on public datasets show that KAPLAN achieves state-of-the-art performance for 3D shape completion.
摘要：我们提出了直接操作非结构化点云，由此避免像体素网格资源密集型的数据结构的新的3D形状完井方法。为此，我们引入卡普兰，3D点描述符通过一系列二维卷积聚集局部形状信息。关键的思想是在当地附近的点投射到具有不同取向的多个平面。在每一个这些平面的，如法线或点到平面的距离点属性被聚集成一个二维网格，抽象成一个特征表示与一个高效的2D卷积编码器。由于所有飞机进行联合编码，产生的表示仍然可以捕捉它们之间的关系，并保留有关的基本3D形状的知识，无需昂贵的3D卷积。公共数据集实验表明，KAPLAN实现国家的最先进的性能3D形状完成。

94. Deep Depth Estimation from Visual-Inertial SLAM [PDF] 返回目录
Kourosh Sartipi, Tien Do, Tong Ke, Khiem Vuong, Stergios I. Roumeliotis
Abstract: This paper addresses the problem of learning to complete a scene's depth from sparse depth points and images of indoor scenes. Specifically, we study the case in which the sparse depth is computed from a visual-inertial simultaneous localization and mapping (VI-SLAM) system. The resulting point cloud has low density, it is noisy, and has non-uniform spatial distribution, as compared to the input from active depth sensors, e.g., LiDAR or Kinect. Since the VI-SLAM produces point clouds only over textured areas, we compensate for the missing depth of the low-texture surfaces by leveraging their planar structures and their surface normals which is an important intermediate representation. The pre-trained surface normal network, however, suffers from large performance degradation when there is a significant difference in the viewing direction (especially the roll angle) of the test image as compared to the trained ones. To address this limitation, we use the available gravity estimate from the VI-SLAM to warp the input image to the orientation prevailing in the training dataset. This results in a significant performance gain for the surface normal estimate, and thus the dense depth estimates. Finally, we show that our method outperforms other state-of-the-art approaches both on training (ScanNet and NYUv2) and testing (collected with Azure Kinect) datasets.
摘要：本文论述的学习从稀疏深度点和室内场景的图像完成一个场景的深度问题。具体地，我们研究在其中稀疏深度从视觉惯性同步定位和映射（VI-SLAM）系统计算的情况。将得到的点云具有低的密度，它是嘈杂，并具有非均匀的空间分布，相比于从有源深度传感器，例如，激光雷达或Kinect的输入。由于VI-SLAM产生点云仅在有纹理的区域，我们补偿低纹理的缺失深度通过利用它们的平面结构和它们的表面法线是一种重要的中间表示面。预先训练的表面法线网络，然而，从性能明显降低患有当存在相比于训练的人在测试图像的观察方向（特别是滚动角）一个显著差异。为了解决这个限制，我们利用现有的重力估计从VI-SLAM输入图像扭曲的方向与训练数据集普遍存在。这导致显著的性能增益表面法线估计，因此，密集的深度估计。最后，我们表明，我们的方法优于其他国家的最先进的培训（ScanNet和NYUv2）和测试数据集（天青与Kinect的收集）接近两个。

95. Learning to Rank for Active Learning: A Listwise Approach [PDF] 返回目录
Minghan Li, Xialei Liu, Joost van de Weijer, Bogdan Raducanu
Abstract: Active learning emerged as an alternative to alleviate the effort to label huge amount of data for data hungry applications (such as image/video indexing and retrieval, autonomous driving, etc.). The goal of active learning is to automatically select a number of unlabeled samples for annotation (according to a budget), based on an acquisition function, which indicates how valuable a sample is for training the model. The learning loss method is a task-agnostic approach which attaches a module to learn to predict the target loss of unlabeled data, and select data with the highest loss for labeling. In this work, we follow this strategy but we define the acquisition function as a learning to rank problem and rethink the structure of the loss prediction module, using a simple but effective listwise approach. Experimental results on four datasets demonstrate that our method outperforms recent state-of-the-art active learning approaches for both image classification and regression tasks.
摘要：主动学习成为一种替代，以减轻对标签数据的数据量巨大的努力饥渴型应用（如图像/视频索引和检索，自动驾驶等）。主动学习的目标是为自动选择（根据预算）注释的数量的未标记的样品，基于获取功能，其指示样品如何有价值是用于训练模型。学习减肥方法是任务无关的方法，其附加一个模块的学习，预测未标记数据的目标损耗，并选择与标签的最高损失数据。在这项工作中，我们遵循这个策略，但我们定义采集功能作为学习等级问题，重新思考损失预测模块的结构，使用一个简单而有效的方式成列。在四个数据集实验结果表明，我们的方法优于最近的国家的最先进的主动学习两种图像分类和回归任务的方法。

96. Dynamic Object Tracking and Masking for Visual SLAM [PDF] 返回目录
Jonathan Vincent, Mathieu Labbé, Jean-Samuel Lauzon, François Grondin, Pier-Marc Comtois-Rivet, François Michaud
Abstract: In dynamic environments, performance of visual SLAM techniques can be impaired by visual features taken from moving objects. One solution is to identify those objects so that their visual features can be removed for localization and mapping. This paper presents a simple and fast pipeline that uses deep neural networks, extended Kalman filters and visual SLAM to improve both localization and mapping in dynamic environments (around 14 fps on a GTX 1080). Results on the dynamic sequences from the TUM dataset using RTAB-Map as visual SLAM suggest that the approach achieves similar localization performance compared to other state-of-the-art methods, while also providing the position of the tracked dynamic objects, a 3D map free of those dynamic objects, better loop closure detection with the whole pipeline able to run on a robot moving at moderate speed.
摘要：在动态环境中，视觉SLAM技术的性能可以通过从移动物体采取视觉特征被削弱。一种解决方案是，以确定，使得它们的视觉特征可以用于定位和映射被去除的那些对象。本文提出了一种简单而快速的管道使用深层神经网络，扩展卡尔曼滤波器和视觉SLAM以提高动态环境中的两个定位与地图（在1080 GTX约14 fps）的速度。从使用RTAB-MAP作为视觉SLAM的TUM数据集的动态序列的结果表明，相对于其他国家的最先进的方法该方法能达到类似的定位性能，同时还提供了被跟踪的动态对象的位置，3D地图释放这些动态的物体，更好环路闭合检测与整个管道能够在机器人中速移动运行。

97. Towards Leveraging End-of-Life Tools as an Asset: Value Co-Creation based on Deep Learning in the Machining Industry [PDF] 返回目录
Jannis Walk, Niklas Kühl, Jonathan Schäfer
Abstract: Sustainability is the key concept in the management of products that reached their end-of-life. We propose that end-of-life products have -- besides their value as recyclable assets -- additional value for producer and consumer. We argue this is especially true for the machining industry, where we illustrate an automatic characterization of worn cutting tools to foster value co-creation between tool manufacturer and tool user (customer) in the future. In the work at hand, we present a deep-learning-based computer vision system for the automatic classification of worn tools regarding flank wear and chipping. The resulting Matthews Correlation Coefficient of 0.878 and 0.644 confirms the feasibility of our system based on the VGG-16 network and Gradient Boosting. Based on these first results we derive a research agenda which addresses the need for a more holistic tool characterization by semantic segmentation and assesses the perceived business impact and usability by different user groups.
摘要：可持续发展是基础，达成了结束生命的产品管理的关键概念。我们建议结束生命的产品有 - 除了其可回收资产价值 - 生产者和消费者的附加价值。我们认为这是机械加工行业，在这里我们说明了磨损的刀具自动表征刀具制造商，并在未来的工具用户（客户）之间培育价值共同创造尤其如此。在手头的工作，我们提出了一个深刻的学习为基础的计算机视觉系统。针对后刀面磨损和剥落磨损的刀具自动分类。 0.878和0.644印证了我们的系统基于VGG-16网络和梯度推进的可行性产生的马修斯相关系数。基于这些初步结果，我们得出这解决了由语义分割一个更全面的工具特性的需求，并评估了不同的用户群体感知的业务影响和可用性的研究议程。

98. Improving Generative Adversarial Networks with Local Coordinate Coding [PDF] 返回目录
Jiezhang Cao, Yong Guo, Qingyao Wu, Chunhua Shen, Junzhou Huang, Mingkui Tan
Abstract: Generative adversarial networks (GANs) have shown remarkable success in generating realistic data from some predefined prior distribution (e.g., Gaussian noises). However, such prior distribution is often independent of real data and thus may lose semantic information (e.g., geometric structure or content in images) of data. In practice, the semantic information might be represented by some latent distribution learned from data. However, such latent distribution may incur difficulties in data sampling for GANs. In this paper, rather than sampling from the predefined prior distribution, we propose an LCCGAN model with local coordinate coding (LCC) to improve the performance of generating data. First, we propose an LCC sampling method in LCCGAN to sample meaningful points from the latent manifold. With the LCC sampling method, we can exploit the local information on the latent manifold and thus produce new data with promising quality. Second, we propose an improved version, namely LCCGAN++, by introducing a higher-order term in the generator approximation. This term is able to achieve better approximation and thus further improve the performance. More critically, we derive the generalization bound for both LCCGAN and LCCGAN++ and prove that a low-dimensional input is sufficient to achieve good generalization performance. Extensive experiments on four benchmark datasets demonstrate the superiority of the proposed method over existing GANs.
摘要：剖成对抗网络（甘斯）已经显示出从一些预定义的先验分布（例如，高斯噪声）产生的真实数据显着的成功。然而，这样的先验分布往往是独立的实数据，并且因此可能会丢失数据的语义信息（例如，几何结构或在图像的内容）。在实践中，语义信息可以通过从数据学到了一些潜在的分布来表示。然而，这样的潜分布可以招致数据采样为甘斯困难。在本文中，而不是从预定义的先验分布采样，我们提出了坐标编码（LCC），以改善生成的数据的性能与当地的LCCGAN模型。首先，我们提出了一种LCC从潜歧管在LCCGAN抽样的方法，样品有意义点。随着LCC抽样方法，我们可以利用的潜在歧管的局部信息，从而有希望的质量产生新的数据。第二，我们提出了一种改进版本，即LCCGAN ++，通过发电机逼近引入高阶项。该术语是能够实现更好的近似，从而进一步提高性能。更关键的是，我们得到开往既LCCGAN和LCCGAN泛化++和证明，低维输入足以达到良好的推广效果。四个基准数据集大量的实验证明在现有甘斯所提出的方法的优越性。

99. FaultFace: Deep Convolutional Generative Adversarial Network (DCGAN) based Ball-Bearing Failure Detection Method [PDF] 返回目录
Jairo Viola, YangQuan Chen, Jing Wang
Abstract: Failure detection is employed in the industry to improve system performance and reduce costs due to unexpected malfunction events. So, a good dataset of the system is desirable for designing an automated failure detection system. However, industrial process datasets are unbalanced and contain little information about failure behavior due to the uniqueness of these events and the high cost for running the system just to get information about the undesired behaviors. For this reason, performing correct training and validation of automated failure detection methods is challenging. This paper proposes a methodology called FaultFace for failure detection on Ball-Bearing joints for rotational shafts using deep learning techniques to create balanced datasets. The FaultFace methodology uses 2D representations of vibration signals denominated faceportraits obtained by time-frequency transformation techniques. From the obtained faceportraits, a Deep Convolutional Generative Adversarial Network is employed to produce new faceportraits of the nominal and failure behaviors to get a balanced dataset. A Convolutional Neural Network is trained for fault detection employing the balanced dataset. The FaultFace methodology is compared with other deep learning techniques to evaluate its performance in for fault detection with unbalanced datasets. Obtained results show that FaultFace methodology has a good performance for failure detection for unbalanced datasets.
摘要行业中采用故障检测，以提高系统性能并降低成本，因突发故障事件：抽象。所以，该系统的良好的数据集是期望用于设计自动化故障检测系统。然而，工业过程的数据集是不平衡的，包含有关故障行为的资料很少，由于这些事件的独特性和公正运行的系统，以获取有关不想要的行为的信息成本高。出于这个原因，在执行正确的训练和自动故障检测方法验证是具有挑战性的。本文提出了一种叫FaultFace关于使用深度学习技术来创建平衡数据集的旋转轴球关节轴承故障检测方法。所述FaultFace方法使用计价通过时间 - 频率变换技术获得faceportraits振动信号的2D表示。从获得的faceportraits，深卷积剖成对抗性网络被用来产生名义和失败行为的新faceportraits得到一个平衡的数据集。卷积神经网络训练采用平衡数据集的故障检测。该FaultFace方法与其他深度学习技术，故障检测与不平衡的数据集，以评估其性能相比。得到的结果表明，FaultFace方法具有故障检测不平衡数据集上佳的表现。

100. Automated Segmentation of Brain Gray Matter Nuclei on Quantitative Susceptibility Mapping Using Deep Convolutional Neural Network [PDF] 返回目录
Chao Chai, Pengchong Qiao, Bin Zhao, Huiying Wang, Guohua Liu, Hong Wu, E Mark Haacke, Wen Shen, Chen Cao, Xinchen Ye, Zhiyang Liu, Shuang Xia
Abstract: Abnormal iron accumulation in the brain subcortical nuclei has been reported to be correlated to various neurodegenerative diseases, which can be measured through the magnetic susceptibility from the quantitative susceptibility mapping (QSM). To quantitively measure the magnetic susceptibility, the nuclei should be accurately segmented, which is a tedious task for clinicians. In this paper, we proposed a double-branch residual-structured U-Net (DB-ResUNet) based on 3D convolutional neural network (CNN) to automatically segment such brain gray matter nuclei. To better tradeoff between segmentation accuracy and the memory efficiency, the proposed DB-ResUNet fed image patches with high resolution and the patches with low resolution but larger field of view into the local and global branches, respectively. Experimental results revealed that by jointly using QSM and T$_\text{1}$ weighted imaging (T$_\text{1}$WI) as inputs, the proposed method was able to achieve better segmentation accuracy over its single-branch counterpart, as well as the conventional atlas-based method and the classical 3D-UNet structure. The susceptibility values and the volumes were also measured, which indicated that the measurements from the proposed DB-ResUNet are able to present high correlation with values from the manually annotated regions of interest.
摘要：在大脑皮质下核异常铁累积已报道相关联的各种神经变性疾病，其可以通过磁化率从量化磁敏测绘（QSM）来测定。为了定量测量磁化率，细胞核应准确分割，这是临床医生一项乏味的任务。在本文中，我们提出了一种双分支的残余结构U形网（DB-ResUNet）基于自动分割这样脑灰质核三维卷积神经网络（CNN）上。到分割精度和存储器效率之间更好的折衷，所提出的具有高分辨率和低分辨率但较大的视场的补丁到本地和全局分支，分别DB-ResUNet供给图像补丁。实验结果表明，通过联合使用QSM和T $ _ \文本{1} $加权成像（T $ _ \文本{1} $ WI）作为输入，所提出的方法能够在其单分支以达到更好的分割精度对应，以及传统的基于图谱法和经典3D-UNET结构。磁化率值和体积也测量，这表明从测量提出DB-ResUNet能够与从感兴趣的手动注释区域值本高的相关性。

101. Shape Adaptor: A Learnable Resizing Module [PDF] 返回目录
Shikun Liu, Zhe Lin, Yilin Wang, Jianming Zhang, Federico Perazzi, Edward Johns
Abstract: We present a novel resizing module for neural networks: shape adaptor, a drop-in enhancement built on top of traditional resizing layers, such as pooling, bilinear sampling, and strided convolution. Whilst traditional resizing layers have fixed and deterministic reshaping factors, our module allows for a learnable reshaping factor. Our implementation enables shape adaptors to be trained end-to-end without any additional supervision, through which network architectures can be optimised for each individual task, in a fully automated way. We performed experiments across seven image classification datasets, and results show that by simply using a set of our shape adaptors instead of the original resizing layers, performance increases consistently over human-designed networks, across all datasets. Additionally, we show the effectiveness of shape adaptors on two other applications: network compression and transfer learning. The source code is available at: this http URL.
摘要：我们提出一个新的调整大小模块，用于神经网络：适配器形状，一个下拉增强建立在传统的大小调整的层，例如池，双线性采样，和跨距卷积的顶部。虽然传统的大小调整层有固定的和决定性的因素重塑，我们的模块允许可以学习整形因素。我们的实施使形状的适配器进行培训端至端无需任何额外的监管，通过网络架构可以为各个任务进行优化，以完全自动化的方式。我们进行了实验，在七个图像分类的数据集，结果表明，简单地用一组我们的形状的适配器，而不是原来的大小调整层，持续超过人类设计的网络性能提高的，在所有的数据集。此外，我们表现出形状适配器上的其他两个应用程序的有效性：网络压缩转发学习。源代码，请访问：此http网址。

102. Retinal Image Segmentation with a Structure-Texture Demixing Network [PDF] 返回目录
Shihao Zhang, Huazhu Fu, Yanwu Xu, Yanxia Liu, Mingkui Tan
Abstract: Retinal image segmentation plays an important role in automatic disease diagnosis. This task is very challenging because the complex structure and texture information are mixed in a retinal image, and distinguishing the information is difficult. Existing methods handle texture and structure jointly, which may lead biased models toward recognizing textures and thus results in inferior segmentation performance. To address it, we propose a segmentation strategy that seeks to separate structure and texture components and significantly improve the performance. To this end, we design a structure-texture demixing network (STD-Net) that can process structures and textures differently and better. Extensive experiments on two retinal image segmentation tasks (i.e., blood vessel segmentation, optic disc and cup segmentation) demonstrate the effectiveness of the proposed method.
摘要：视网膜图像分割起着自动疾病的诊断具有重要作用。这个任务非常具有挑战性，因为复杂的结构和纹理信息在视网膜图像混合，区分信息是困难的。现有的方法处理的质地和结构联合，这可导致偏置模型朝向识别纹理，从而导致低劣的分割性能。为了解决这个问题，我们建议寻求分离的结构和纹理的组件和显著提高性能的细分策略。为此，我们设计的结构纹理分层网络（STD-网），可以处理结构和纹理不同，更好。在两个视网膜图像分割任务（即，血管分割，视盘和杯分割）广泛的实验证明了该方法的有效性。

103. Adaptive Hierarchical Decomposition of Large Deep Networks [PDF] 返回目录
Sumanth Chennupati, Sai Nooka, Shagan Sah, Raymond W Ptucha
Abstract: Deep learning has recently demonstrated its ability to rival the human brain for visual object recognition. As datasets get larger, a natural question to ask is if existing deep learning architectures can be extended to handle the 50+K classes thought to be perceptible by a typical human. Most deep learning architectures concentrate on splitting diverse categories, while ignoring the similarities amongst them. This paper introduces a framework that automatically analyzes and configures a family of smaller deep networks as a replacement to a singular, larger network. Class similarities guide the creation of a family from course to fine classifiers which solve categorical problems more effectively than a single large classifier. The resulting smaller networks are highly scalable, parallel and more practical to train, and achieve higher classification accuracy. This paper also proposes a method to adaptively select the configuration of the hierarchical family of classifiers using linkage statistics from overall and sub-classification confusion matrices. Depending on the number of classes and the complexity of the problem, a deep learning model is selected and the complexity is determined. Numerous experiments on network classes, layers, and architecture configurations validate our results.
摘要：深学习最近展示了其对手的可视对象识别人类大脑的能力。随着数据集变得越来越大，一个自然的问题是，如果现有的深度学习架构可以扩展到处理50种+ K类认为一个典型的人是察觉。最深刻的学习架构专注于分裂多样的类别，而忽略了它们之间的相似性。本文介绍了一种框架，它会自动分析并配置一个家庭较小深网络的作为替换的是单数，较大的网络。类相似性引导家族的创建从当然于解决分类问题更有效地比单个大分类细分类器。将得到的较小的网络是高度可扩展的，平行的且更实用的火车，并实现更高的分类精度。本文还提出了一种方法自适应地选择使用来自整体和子分类混淆矩阵联动统计分类器的分层家族的配置。根据类的数量和问题的复杂性，深刻学习模型选择和复杂性是确定的。网络类，层和结构配置无数次的实验验证我们的结果。

104. Multi-Scale Deep Compressive Imaging [PDF] 返回目录
Thuong Nguyen Canh, Byeungwoo Jeon
Abstract: Recently, deep learning-based compressive imaging (DCI) has surpassed the conventional compressive imaging in reconstruction quality and faster running time. While multi-scale has shown superior performance over single-scale, research in DCI has been limited to single-scale sampling. Despite training with single-scale images, DCI tends to favor low-frequency components similar to the conventional multi-scale sampling, especially at low subrate. From this perspective, it would be easier for the network to learn multi-scale features with a multi-scale sampling architecture. In this work, we proposed a multi-scale deep compressive imaging (MS-DCI) framework which jointly learns to decompose, sample, and reconstruct images at multi-scale. A three-phase end-to-end training scheme was introduced with an initial and two enhance reconstruction phases to demonstrate the efficiency of multi-scale sampling and further improve the reconstruction performance. We analyzed the decomposition methods (including Pyramid, Wavelet, and Scale-space), sampling matrices, and measurements and showed the empirical benefit of MS-DCI which consistently outperforms both conventional and deep learning-based approaches.
摘要：最近，深基于学习的压缩成像（DCI）已超过在重建质量的常规压缩成像和更快的运行时间。而多尺度已经显示出比单级性能优越，在DCI研究一直局限于单尺度采样。尽管培养具有单级图像，DCI往往有利于类似于常规多尺度采样，特别是在低子速率的低频分量。从这个角度来看，它会更容易对网络学习的多尺度特征与多级采样架构。在这项工作中，我们在多尺度提出了一种多尺度深度压缩成像（MS-DCI）框架，其共同学习分解，样品，并重构图像。一种三相端至端的训练方案，用初始和两个增强重建阶段证明多尺度采样的效率，进一步提高了重建性能引入。我们分析了分解方法（包括金字塔，小波，和尺度空间），采样矩阵，和测量和显示MS-DCI的经验益处一致优于常规和深基于学习的方法。

105. IntroVAC: Introspective Variational Classifiers for Learning Interpretable Latent Subspaces [PDF] 返回目录
Marco Maggipinto, Matteo Terzi, Gian Antonio Susto
Abstract: Learning useful representations of complex data has been the subject of extensive research for many years. With the diffusion of Deep Neural Networks, Variational Autoencoders have gained lots of attention since they provide an explicit model of the data distribution based on an encoder/decoder architecture which is able to both generate images and encode them in a low-dimensional subspace. However, the latent space is not easily interpretable and the generation capabilities show some limitations since images typically look blurry and lack details. In this paper, we propose the Introspective Variational Classifier (IntroVAC), a model that learns interpretable latent subspaces by exploiting information from an additional label and provides improved image quality thanks to an adversarial training strategy.We show that IntroVAC is able to learn meaningful directions in the latent space enabling fine-grained manipulation of image attributes. We validate our approach on the CelebA dataset.
摘要：学习复杂数据的有用表示已经多年广泛研究的主题。凭借深厚的神经网络的扩散，因为它们提供了基于编码器/解码器架构中的数据分布，能都产生在低维子空间图像和编码它们的显式模型变自动编码已经获得了大量的关注。然而，潜在空间不容易解释和生成能力表现出一定的局限性，因为图像通常看起来模糊，缺乏细节。在本文中，我们提出了内省变分类（IntroVAC），一个模型，学习到从附加标签利用信息可解释的潜在子空间，并提供改进的图像质量感谢对抗性训练strategy.We表明IntroVAC能够学到有意义的方向在潜空间使图像属性的细粒操纵。我们验证我们对CelebA数据集的方式。

106. High Throughput Matrix-Matrix Multiplication between Asymmetric Bit-Width Operands [PDF] 返回目录
Dibakar Gope, Jesse Beu, Matthew Mattina
Abstract: Matrix multiplications between asymmetric bit-width operands, especially between 8- and 4-bit operands are likely to become a fundamental kernel of many important workloads including neural networks and machine learning. While existing SIMD matrix multiplication instructions for symmetric bit-width operands can support operands of mixed precision by zero- or sign-extending the narrow operand to match the size of the other operands, they cannot exploit the benefit of narrow bit-width of one of the operands. We propose a new SIMD matrix multiplication instruction that uses mixed precision on its inputs (8 and 4-bit operands) and accumulates product values into narrower 16-bit output accumulators, in turn allowing the SIMD operation at 128-bit vector width to process a greater number of data elements per instruction to improve processing throughput and memory bandwidth utilization without increasing the register read- and write-port bandwidth in CPUs. The proposed asymmetric-operand-size SIMD instruction offers 2x improvement in throughput of matrix multiplication in comparison to throughput obtained using existing symmetric-operand-size instructions while causing negligible (0.05%) overflow from 16-bit accumulators for representative machine learning workloads. The asymmetric-operand-size instruction not only can improve matrix multiplication throughput in CPUs, but also can be effective to support multiply-and-accumulate (MAC) operation between 8- and 4-bit operands in state-of-the-art DNN hardware accelerators (e.g., systolic array microarchitecture in Google TPU, etc.) and offer similar improvement in matrix multiply performance seamlessly without violating the various implementation constraints. We demonstrate how a systolic array architecture designed for symmetric-operand-size instructions could be modified to support an asymmetric-operand-sized instruction.
摘要：不对称的位宽的操作数之间的矩阵乘法，尤其是8位和4位操作数之间有可能成为包括神经网络和机器学习了许多重要的工作负载的基本内核。而对称的位宽度的操作数现有SIMD矩阵乘法指令可以由支持的混合精度操作数零或窄的操作数，以匹配其它操作数的大小符号扩展，又不能利用的一个窄比特宽度的益处操作数。我们建议，在其输入端（8和4位操作数）使用混合精度并累积产品的值到更窄的16位输出累加器一个新的SIMD矩阵乘法指令，进而使其在128位向量宽度处理的SIMD操作每一指令的数据元素的更多数量的提高处理的吞吐量和存储器带宽利用率而不增加的CPU寄存器读和写端口带宽。所提出的非对称操作数大小SIMD指令提供2倍的吞吐量矩阵乘法的改进相比于通过使用现有的对称操作数大小的指令，而从16位累加器引起可忽略的（0.05％）溢流代表性机器学习工作负载而获得。不对称-操作数大小指令不仅可以提高矩阵乘法可以通过内置于CPU，但也可以是有效的，以支持乘法和累加（MAC）运算的8位和之间的状态的最先进的DNN 4位操作数硬件加速器（例如，在谷歌TPU等脉动阵列的微架构），并提供在矩阵乘法性能类似的改进无缝而不违反各种实现的限制。我们演示了如何心脉式阵列架构设计为对称数大小的指令可以被修改，以支持非对称操作数大小的指令。

107. The Rate-Distortion-Accuracy Tradeoff: JPEG Case Study [PDF] 返回目录
Xiyang Luo, Hossein Talebi, Feng Yang, Michael Elad, Peyman Milanfar
Abstract: Handling digital images is almost always accompanied by a lossy compression in order to facilitate efficient transmission and storage. This introduces an unavoidable tension between the allocated bit-budget (rate) and the faithfulness of the resulting image to the original one (distortion). An additional complicating consideration is the effect of the compression on recognition performance by given classifiers (accuracy). This work aims to explore this rate-distortion-accuracy tradeoff. As a case study, we focus on the design of the quantization tables in the JPEG compression standard. We offer a novel optimal tuning of these tables via continuous optimization, leveraging a differential implementation of both the JPEG encoder-decoder and an entropy estimator. This enables us to offer a unified framework that considers the interplay between rate, distortion and classification accuracy. In all these fronts, we report a substantial boost in performance by a simple and easily implemented modification of these tables.
摘要：处理数字图像是为了便于有效传输和存储几乎总是伴随着有损压缩。这引入了所分配的比特预算（率）和所得到的图像的成原始（失真）信实之间的不可避免的张力。额外的复杂化的考虑是关于由下式给出的分类器（精度）的识别性能的压缩的效果。这项工作旨在探索这个率失真准确性权衡。作为一个案例研究中，我们集中在JPEG压缩标准量化表的设计。我们通过连续优化提供这些表的一种新颖的最佳调谐，利用一个差分实现JPEG编码器 - 解码器和熵估计两者。这使我们能够提供一个统一的框架，考虑率，失真和分类精度之间的相互作用。在所有这些方面，我们通过一个简单的报告性能大幅提升，轻松地实现这些表的修改。

108. Video Question Answering on Screencast Tutorials [PDF] 返回目录
Wentian Zhao, Seokhwan Kim, Ning Xu, Hailin Jin
Abstract: This paper presents a new video question answering task on screencast tutorials. We introduce a dataset including question, answer and context triples from the tutorial videos for a software. Unlike other video question answering works, all the answers in our dataset are grounded to the domain knowledge base. An one-shot recognition algorithm is designed to extract the visual cues, which helps enhance the performance of video question answering. We also propose several baseline neural network architectures based on various aspects of video contexts from the dataset. The experimental results demonstrate that our proposed models significantly improve the question answering performances by incorporating multi-modal contexts and domain knowledge.
摘要：本文介绍了关于截屏教程的新视频问答任务。我们引进包括问题，答案，从软件的视频教程背景三倍的数据集。与其他视频答疑作品，在我们的数据集中所有的答案都接地领域知识库。单触发识别算法的目的是提取的视觉线索，这有助于增强视频问答的性能。我们还提出了一种基于从数据集视频环境的各个方面的若干基线神经网络结构。实验结果表明，我们提出的模型显著改善的问题通过引入多模态环境和领域知识回答表演。

109. Towards Robust Visual Tracking for Unmanned Aerial Vehicle with Tri-Attentional Correlation Filters [PDF] 返回目录
Yujie He, Changhong Fu, Fuling Lin, Yiming Li, Peng Lu
Abstract: Object tracking has been broadly applied in unmanned aerial vehicle (UAV) tasks in recent years. However, existing algorithms still face difficulties such as partial occlusion, clutter background, and other challenging visual factors. Inspired by the cutting-edge attention mechanisms, a novel object tracking framework is proposed to leverage multi-level visual attention. Three primary attention, i.e., contextual attention, dimensional attention, and spatiotemporal attention, are integrated into the training and detection stages of correlation filter-based tracking pipeline. Therefore, the proposed tracker is equipped with robust discriminative power against challenging factors while maintaining high operational efficiency in UAV scenarios. Quantitative and qualitative experiments on two well-known benchmarks with 173 challenging UAV video sequences demonstrate the effectiveness of the proposed framework. The proposed tracking algorithm favorably outperforms 12 state-of-the-art methods, yielding 4.8% relative gain in UAVDT and 8.2% relative gain in UAV123@10fps against the baseline tracker while operating at the speed of $\sim$ 28 frames per second.
摘要：对象跟踪已在无人机（UAV）的任务，近年来广泛应用。然而，现有的算法仍然面临如部分遮挡，杂波背景，以及其他具有挑战性的视觉因素的困难。由尖端注意机制的启发，新物体跟踪框架，提出了利用多层次的视觉注意力。三个主要的关注，即，上下文注意，尺寸注意，和时空注意，被集成到基于滤波器相关跟踪管道的训练和检测阶段。因此，提出的跟踪器配备有强大的辨别力针对挑战的因素，同时保持在UAV场景高操作效率。定量和两家知名基准173具有挑战性的无人机视频序列定性实验证明了该框架的有效性。所提出的跟踪算法有利地优于12状态的最先进的方法，得到在UAVDT相对4.8％的增益和8.2％的相对增益在UAV123 @ 10fps的针对基线跟踪器同时以每秒$ \ SIM $ 28帧的速度操作。

110. Differentiable Feature Aggregation Search for Knowledge Distillation [PDF] 返回目录
Yushuo Guan, Pengyu Zhao, Bingxuan Wang, Yuanxing Zhang, Cong Yao, Kaigui Bian, Jian Tang
Abstract: Knowledge distillation has become increasingly important in model compression. It boosts the performance of a miniaturized student network with the supervision of the output distribution and feature maps from a sophisticated teacher network. Some recent works introduce multi-teacher distillation to provide more supervision to the student network. However, the effectiveness of multi-teacher distillation methods are accompanied by costly computation resources. To tackle with both the efficiency and the effectiveness of knowledge distillation, we introduce the feature aggregation to imitate the multi-teacher distillation in the single-teacher distillation framework by extracting informative supervision from multiple teacher feature maps. Specifically, we introduce DFA, a two-stage Differentiable Feature Aggregation search method that motivated by DARTS in neural architecture search, to efficiently find the aggregations. In the first stage, DFA formulates the searching problem as a bi-level optimization and leverages a novel bridge loss, which consists of a student-to-teacher path and a teacher-to-student path, to find appropriate feature aggregations. The two paths act as two players against each other, trying to optimize the unified architecture parameters to the opposite directions while guaranteeing both expressivity and learnability of the feature aggregation simultaneously. In the second stage, DFA performs knowledge distillation with the derived feature aggregation. Experimental results show that DFA outperforms existing methods on CIFAR-100 and CINIC-10 datasets under various teacher-student settings, verifying the effectiveness and robustness of the design.
摘要：知识蒸馏已成为模型压缩越来越重要。它提高小型化的学生网络从一个复杂的老师网络输出分布和特征地图的监管的性能。最近的一些作品引进多老师蒸馏提供更多的监管，学生网络。然而，多老师蒸馏方法的效果都伴随着昂贵的计算资源。要使用的效率和知识蒸馏的有效性解决，我们引入了功能聚集通过提取来自多个教师特征地图信息监管模仿在单老师蒸馏框架多老师蒸馏。具体来说，我们介绍DFA，两阶段可微特征聚合搜索方法，在神经结构的搜索动机是飞镖，高效地找到聚集。在第一阶段，DFA制定搜索问题，因为双级优化，并利用一种新型的桥梁损失，由学生与教师路径和教师与学生的路径，找到相应的功能集合。两个路径作为对抗彼此两个玩家，试图同时保证特征聚合的两个表现力和可学习到统一架构参数优化到相反的方向。在第二阶段中，DFA执行与所导出的特征聚合知识蒸馏。实验结果表明，DFA性能优于现有的在各种师生设置CIFAR-100和互联网络信息中心-10数据集的方法，验证了设计的有效性和鲁棒性。

111. Multi-level Wavelet-based Generative Adversarial Network for Perceptual Quality Enhancement of Compressed Video [PDF] 返回目录
Jianyi Wang, Xin Deng, Mai Xu, Congyong Chen, Yuhang Song
Abstract: The past few years have witnessed fast development in video quality enhancement via deep learning. Existing methods mainly focus on enhancing the objective quality of compressed video while ignoring its perceptual quality. In this paper, we focus on enhancing the perceptual quality of compressed video. Our main observation is that enhancing the perceptual quality mostly relies on recovering high-frequency sub-bands in wavelet domain. Accordingly, we propose a novel generative adversarial network (GAN) based on multi-level wavelet packet transform (WPT) to enhance the perceptual quality of compressed video, which is called multi-level wavelet-based GAN (MW-GAN). In MW-GAN, we first apply motion compensation with a pyramid architecture to obtain temporal information. Then, we propose a wavelet reconstruction network with wavelet-dense residual blocks (WDRB) to recover the high-frequency details. In addition, the adversarial loss of MW-GAN is added via WPT to further encourage high-frequency details recovery for video frames. Experimental results demonstrate the superiority of our method.
摘要：在过去的几年已经通过深度学习目睹了增强视频质量的快速发展。现有的方法主要集中在提高压缩视频的客观质量而忽视了其感知质量。在本文中，我们着眼于提高压缩视频的感知质量。我们的主要发现是，提高感知质量主要依赖于在小波域回收高频子带。因此，我们提出了基于多级小波包变换（WPT），以增强压缩视频，其被称为多电平的感知质量基于小波的-GAN（MW-GAN）一种新颖的生成对抗网络（GAN）。在MW-GAN，我们首先应用运动补偿与金字塔架构，以获得时间信息。然后，我们提出了小波重构网络小波密集的残余块（WDRB）以回收的高频细节。另外，MW-GAN的对抗损失经由WPT加入到进一步鼓励视频帧的高频细节恢复。实验结果证明了该方法的优越性。

112. Hindsight for Foresight: Unsupervised Structured Dynamics Models from Physical Interaction [PDF] 返回目录
Iman Nematollahi, Oier Mees, Lukas Hermann, Wolfram Burgard
Abstract: A key challenge for an agent learning to interact with the world is to reason about physical properties of objects and to foresee their dynamics under the effect of applied forces. In order to scale learning through interaction to many objects and scenes, robots should be able to improve their own performance from real-world experience without requiring human supervision. To this end, we propose a novel approach for modeling the dynamics of a robot's interactions directly from unlabeled 3D point clouds and images. Unlike previous approaches, our method does not require ground-truth data associations provided by a tracker or any pre-trained perception network. To learn from unlabeled real-world interaction data, we enforce consistency of estimated 3D clouds, actions and 2D images with observed ones. Our joint forward and inverse network learns to segment a scene into salient object parts and predicts their 3D motion under the effect of applied actions. Moreover, our object-centric model outputs action-conditioned 3D scene flow, object masks and 2D optical flow as emergent properties. Our extensive evaluation both in simulation and with real-world data demonstrates that our formulation leads to effective, interpretable models that can be used for visuomotor control and planning. Videos, code and dataset are available at this http URL
摘要：为代理学习与世界互动的一个关键挑战是原因有关对象的物理特性和作用力的作用下预见它们的动态。为了规模通过互动的许多物体和场景的学习，机器人应能提高来自现实世界的经验，自己的表现，而无需人的监督。为此，我们提出了直接从3D未标注点云和图像建模机器人的交互的动态的新方法。不像以前的方法，我们的方法并不需要通过跟踪器或任何预先训练感知网络提供地面实况数据的关联。要从未标记的真实世界的交互数据学习，我们强制估计的3D云，行动和观察到的那些2D图像的一致性。我们的联合正向和反向网络获知分割场景到突出物部分和预测应用的行为的作用下，他们的3D运动。此外，我们的对象为中心的模型输出动作空调3D场景流，对象掩模和2D光流为紧急性质。我们无论是在模拟和真实世界的数据广泛的评估表明，我们的配方导致有效的，可解释的模型可用于视觉运动控制和规划。视频，代码和数据集都可以在这个HTTP URL

113. Vision and Inertial Sensing Fusion for Human Action Recognition : A Review [PDF] 返回目录
Sharmin Majumder, Nasser Kehtarnavaz
Abstract: Human action recognition is used in many applications such as video surveillance, human computer interaction, assistive living, and gaming. Many papers have appeared in the literature showing that the fusion of vision and inertial sensing improves recognition accuracies compared to the situations when each sensing modality is used individually. This paper provides a survey of the papers in which vision and inertial sensing are used simultaneously within a fusion framework in order to perform human action recognition. The surveyed papers are categorized in terms of fusion approaches, features, classifiers, as well as multimodality datasets considered. Challenges as well as possible future directions are also stated for deploying the fusion of these two sensing modalities under realistic conditions.
摘要：人类动作识别在许多应用，如视频监控，人机交互，辅助生活和游戏中使用。许多论文已经出现在显示，相比情况视觉和惯性传感融合提高识别的准确度时，每个传感方式是单独使用的文献。本文提供了其中视觉和惯性传感，以便执行人类动作识别同时使用的融合框架内的文件进行了调查。被调查的论文在融合方面的分类方法，特点，分类，以及多模态数据集的考虑。面临的挑战以及未来可能的方向也指出了现实条件下部署这两种传感方式的融合。

114. Exploring Multi-Scale Feature Propagation and Communication for Image Super Resolution [PDF] 返回目录
Ruicheng Feng, Weipeng Guan, Yu qiao, Chao Dong
Abstract: Multi-scale techniques have achieved great success in a wide range of computer vision tasks. However, while this technique is incorporated in existing works, there still lacks a comprehensive investigation on variants of multi-scale convolution in image super resolution. In this work, we present a unified formulation over widely-used multi-scale structures. With this framework, we systematically explore the two factors of multi-scale convolution -- feature propagation and cross-scale communication. Based on the investigation, we propose a generic and efficient multi-scale convolution unit -- Multi-Scale cross-Scale Share-weights convolution (MS$^3$-Conv). Extensive experiments demonstrate that the proposed MS$^3$-Conv can achieve better SR performance than the standard convolution with less parameters and computational cost. Beyond quantitative analysis, we comprehensively study the visual quality, which shows that MS$^3$-Conv behave better to recover high-frequency details.
摘要：多尺度技术已经在广泛的计算机视觉任务，取得了巨大成功。然而，虽然这种技术在现有的作品纳入，但仍然缺乏对图像超分辨率多尺度卷积的变种进行全面调查。在这项工作中，我们提出了一个统一制定了广泛使用的多尺度结构。有了这个框架，我们系统地探索多尺度卷积的两个因素 - 特征传播和跨尺度的通信。根据调查，我们提出了一个通用的，高效的多尺度卷积运算部 - 多尺度跨尺度分享权重卷积（MS $ ^ 3 $ -CONV）。大量的实验表明，该MS $ ^ 3 $ -CONV可以实现比用更少的参数和计算成本的标准卷积更好的SR性能。除了定量分析，我们全面研究了视觉质量，这表明MS $ ^ 3 $ -CONV表现得更加出色，以恢复高频细节。

115. Joint Generative Learning and Super-Resolution For Real-World Camera-Screen Degradation [PDF] 返回目录
Guanghao Yin, Shouqian Sun, Chao Li, Xin Min
Abstract: In real-world single image super-resolution (SISR) task, the low-resolution image suffers more complicated degradations, not only downsampled by unknown kernels. However, existing SISR methods are generally studied with the synthetic low-resolution generation such as bicubic interpolation (BI), which greatly limits their performance. Recently, some researchers investigate real-world SISR from the perspective of the camera and smartphone. However, except the acquisition equipment, the display device also involves more complicated degradations. In this paper, we focus on the camera-screen degradation and build a real-world dataset (Cam-ScreenSR), where HR images are original ground truths from the previous DIV2K dataset and corresponding LR images are camera-captured versions of HRs displayed on the screen. We conduct extensive experiments to demonstrate that involving more real degradations is positive to improve the generalization of SISR models. Moreover, we propose a joint two-stage model. Firstly, the downsampling degradation GAN(DD-GAN) is trained to model the degradation and produces more various of LR images, which is validated to be efficient for data augmentation. Then the dual residual channel attention network (DuRCAN) learns to recover the SR image. The weighted combination of L1 loss and proposed Laplacian loss are applied to sharpen the high-frequency edges. Extensive experimental results in both typical synthetic and complicated real-world degradations validate the proposed method outperforms than existing SOTA models with less parameters, faster speed and better visual results. Moreover, in real captured photographs, our model also delivers best visual quality with sharper edge, less artifacts, especially appropriate color enhancement, which has not been accomplished by previous methods.
摘要：在真实世界单幅图像超分辨率（SISR）任务，低分辨率图像遭受更复杂的退化，不仅是未知的内核下采样。然而，现有的SISR方法通常与合成低分辨率代研究诸如双三次插值（BI），这大大限制了它们的性能。最近，一些研究人员调查从相机和智能手机的角度真实世界SISR。然而，除了采集设备，所述显示设备还涉及到更复杂的退化。在本文中，我们专注于相机屏幕的降解，并建立一个真实世界的数据集（凸轮-ScreenSR），其中HR图像是从以前的DIV2K数据集原基础事实和相应的LR图像的HR的摄像机捕捉的版本显示屏幕。我们进行了广泛的实验证明，涉及到更真实的降级是肯定的，以提高SISR车型的推广。此外，我们提出了一个联合两阶段模型。首先，下采样降解GAN（DD-GAN）被训练以降解建模和生产更多的各种LR图像，其被验证是有效的数据扩充的。然后双残留通道注意网络（Durcan为）学会恢复SR图像。 L1损失并提出拉普拉斯损耗的加权组合被施加到锐化高频边缘。在这两个典型的合成和复杂的现实世界的降级大量实验验证比现有车型SOTA所提出的方法优于用更少的参数，更快的速度和更好的视觉效果。此外，在实际捕捉到的照片，我们的模型还提供带边缘更清晰，失真较小，特别适合彩色增强，这并没有被前面的方法来实现最佳的视觉质量。

116. Diabetic Retinopathy Diagnosis based on Convolutional Neural Network [PDF] 返回目录
Mohammed hamzah abed, Lamia Abed Noor Muhammed, Sarah Hussein Toman
Abstract: Diabetic Retinopathy DR is a popular disease for many people as a result of age or the diabetic, as a result, it can cause blindness. therefore, diagnosis of this disease especially in the early time can prevent its effect for a lot of patients. To achieve this diagnosis, eye retina must be examined continuously. Therefore, computer-aided tools can be used in the field based on computer vision techniques. Different works have been performed using various machine learning techniques. Convolutional Neural Network is one of the promise methods, so it was for Diabetic Retinopathy detection in this paper. Also, the proposed work contains visual enhancement in the pre-processing phase, then the CNN model is trained to be able for recognition and classification phase, to diagnosis the healthy and unhealthy retina image. Three public dataset DiaretDB0, DiaretDB1 and DrimDB were used in practical testing. The implementation of this work based on Matlab- R2019a, deep learning toolbox and deep network designer to design the architecture of the convolutional neural network and train it. The results were evaluated to different metrics; accuracy is one of them. The best accuracy that was achieved: for DiaretDB0 is 100%, DiaretDB1 is 99.495% and DrimDB is 97.55%.
摘要：糖尿病性视网膜病变DR是很多人的年龄或糖尿病结果的流行疾病，因此，它可能会导致失明。因此，本病尤其是在早期诊断可以防止其对很多患者的影响。为了实现这一诊断，眼视网膜必须不断审查。因此，电脑辅助工具可以在基于计算机视觉技术领域中使用。不同的作品一直在使用各种机器学习技术来执行。卷积神经网络的承诺的方法之一，所以对本文糖尿病视网膜病变的检测。此外，建议的工作包含在预处理阶段增强视觉效果，那么CNN模型被训练成能够用于识别和分类相，以确诊的健康和不健康的视网膜图像。三个公共数据集DiaretDB0，DiaretDB1和DrimDB在实际的测试中使用。基于Matlab- R2019a这项工作的实施，深度学习工具箱和深厚的网络设计师来设计卷积神经网络的结构和训练它。评价结果，以不同的度量;精度是其中之一。达到了最佳精度：对于DiaretDB0是100％，DiaretDB1为99.495％和DrimDB为97.55％。

117. CorrSigNet: Learning CORRelated Prostate Cancer SIGnatures from Radiology and Pathology Images for Improved Computer Aided Diagnosis [PDF] 返回目录
Indrani Bhattacharya, Arun Seetharaman, Wei Shao, Rewa Sood, Christian A. Kunder, Richard E. Fan, Simon John Christoph Soerensen, Jeffrey B. Wang, Pejman Ghanouni, Nikola C. Teslovich, James D. Brooks, Geoffrey A. Sonn, Mirabela Rusu
Abstract: Magnetic Resonance Imaging (MRI) is widely used for screening and staging prostate cancer. However, many prostate cancers have subtle features which are not easily identifiable on MRI, resulting in missed diagnoses and alarming variability in radiologist interpretation. Machine learning models have been developed in an effort to improve cancer identification, but current models localize cancer using MRI-derived features, while failing to consider the disease pathology characteristics observed on resected tissue. In this paper, we propose CorrSigNet, an automated two-step model that localizes prostate cancer on MRI by capturing the pathology features of cancer. First, the model learns MRI signatures of cancer that are correlated with corresponding histopathology features using Common Representation Learning. Second, the model uses the learned correlated MRI features to train a Convolutional Neural Network to localize prostate cancer. The histopathology images are used only in the first step to learn the correlated features. Once learned, these correlated features can be extracted from MRI of new patients (without histopathology or surgery) to localize cancer. We trained and validated our framework on a unique dataset of 75 patients with 806 slices who underwent MRI followed by prostatectomy surgery. We tested our method on an independent test set of 20 prostatectomy patients (139 slices, 24 cancerous lesions, 1.12M pixels) and achieved a per-pixel sensitivity of 0.81, specificity of 0.71, AUC of 0.86 and a per-lesion AUC of $0.96 \pm 0.07$, outperforming the current state-of-the-art accuracy in predicting prostate cancer using MRI.
摘要：磁共振成像（MRI）是广泛用于筛选和分期的前列腺癌。然而，许多前列腺癌有细微的特征，其不容易识别的MRI，造成漏诊和放射科医师解释惊人的变化。机器学习模型已经开发，以提高癌症标识的努力，但目前的模式本地化使用MRI源性肿瘤的功能，而没有考虑对切除组织观察疾病的病理特点。在本文中，我们提议CorrSigNet，即通过捕获病理局部化前列腺癌的MRI的癌症特征的自动化的两步模式。首先，与相应的组织病理学相关的癌症模型获悉MRI签名功能使用通用的表示学习。其次，该机型采用了相关学会拥有MRI训练卷积神经网络的本地化前列腺癌。组织病理学图像被用于仅在第一步骤中学习相关特征。一旦了解到，这些相关功能可以从MRI中提取的新患者（无病理组织学或手术）本地化癌症。我们的培训和验证我们的框架对75例806片谁接受MRI检查，随后手术前列腺独特的数据集。我们测试了在一个独立的测试组的20名前列腺患者（139片，24种癌损害，1.12M像素）我们的方法和实现的0.81每像素灵敏度，0.71特异性，AUC的0.86和每病变AUC的$ 0.96 \下午0.07 $，表现优于当前状态的最先进的精度在预测使用MRI前列腺癌。

118. F*: An Interpretable Transformation of the F-measure [PDF] 返回目录
David J. Hand, Peter Christen, Nishadi Kirielle
Abstract: The F-measure is widely used to assess the performance of classification algorithms. However, some researchers find it lacking in intuitive interpretation, questioning the appropriateness of combining two aspects of performance as conceptually distinct as precision and recall, and also questioning whether the harmonic mean is the best way to combine them. To ease this concern, we describe a simple transformation of the F-measure, which we call F* (F-star), which has an immediate practical interpretation.
摘要：F-度量广泛用于评估的分类算法的性能。然而，一些研究人员发现它在直观的解释缺乏，质疑两个方面的性能相结合的概念上不同的精度和召回，并质疑调和平均值是否将它们结合起来的最佳方式是否恰当。为了缓解这一问题，我们描述了F-措施，我们称之为F *（F-星），它有一个直接的实际解释的简单的改造。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-08-04

目录

摘要