摘要

1. Cascade EF-GAN: Progressive Facial Expression Editing with Local Focuses [PDF] 返回目录
Rongliang Wu, Gongjie Zhang, Shijian Lu, Tao Chen
Abstract: Recent advances in Generative Adversarial Nets (GANs) have shown remarkable improvements for facial expression editing. However, current methods are still prone to generate artifacts and blurs around expression-intensive regions, and often introduce undesired overlapping artifacts while handling large-gap expression transformations such as transformation from furious to laughing. To address these limitations, we propose Cascade Expression Focal GAN (Cascade EF-GAN), a novel network that performs progressive facial expression editing with local expression focuses. The introduction of the local focus enables the Cascade EF-GAN to better preserve identity-related features and details around eyes, noses and mouths, which further helps reduce artifacts and blurs within the generated facial images. In addition, an innovative cascade transformation strategy is designed by dividing a large facial expression transformation into multiple small ones in cascade, which helps suppress overlapping artifacts and produce more realistic editing while dealing with large-gap expression transformations. Extensive experiments over two publicly available facial expression datasets show that our proposed Cascade EF-GAN achieves superior performance for facial expression editing.
摘要：在剖成对抗性网（甘斯）的最新进展已经示出了用于面部表情编辑显着的改进。然而，目前的方法仍然容易产生围绕表达密集区域伪影和模糊，并且常常不期望的引入伪像重叠在处理大间隙表达转化，例如转化从愤怒到大笑。为了解决这些限制，我们提出了级联式焦甘（级联EF-GAN），一种新型的网络进行渐进式的面部表情与当地表达焦点编辑。引进当地的重点使级联EF-GAN，以更好地保护眼睛周围，鼻子和嘴，这进一步有助于减少所产生的面部图像中的伪影和模糊的身份相关的功能和细节。此外，创新的级联转型的战略是通过将一个大的面部表情转变成多个小的级联，这有助于抑制重叠的文物，并产生更逼真的编辑在处理大的差距表达转换而设计的。在两个可公开获得的面部表情数据集大量的实验表明，该级联EF-GaN实现面部表情编辑优越的性能。

2. SASL: Saliency-Adaptive Sparsity Learning for Neural Network Acceleration [PDF] 返回目录
Jun Shi, Jianfeng Xu, Kazuyuki Tasaka, Zhibo Chen
Abstract: Accelerating the inference speed of CNNs is critical to their deployment in real-world applications. Among all the pruning approaches, those implementing a sparsity learning framework have shown to be effective as they learn and prune the models in an end-to-end data-driven manner. However, these works impose the same sparsity regularization on all filters indiscriminately, which can hardly result in an optimal structure-sparse network. In this paper, we propose a Saliency-Adaptive Sparsity Learning (SASL) approach for further optimization. A novel and effective estimation of each filter, i.e., saliency, is designed, which is measured from two aspects: the importance for the prediction performance and the consumed computational resources. During sparsity learning, the regularization strength is adjusted according to the saliency, so our optimized format can better preserve the prediction performance while zeroing out more computation-heavy filters. The calculation for saliency introduces minimum overhead to the training process, which means our SASL is very efficient. During the pruning phase, in order to optimize the proposed data-dependent criterion, a hard sample mining strategy is utilized, which shows higher effectiveness and efficiency. Extensive experiments demonstrate the superior performance of our method. Notably, on ILSVRC-2012 dataset, our approach can reduce 49.7% FLOPs of ResNet-50 with very negligible 0.39% top-1 and 0.05% top-5 accuracy degradation.
摘要：加速细胞神经网络的推理速度是他们在现实世界的应用部署的关键。在所有的修剪方法，这些实施稀疏学习框架已经证明，因为他们学习和在终端到终端的数据驱动的方式修剪模型是有效的。然而，这些作品征收胡乱所有滤镜，很难产生最佳的结构稀疏网络相同的稀疏正规化。在本文中，我们提出了进一步优化显着性的自适应学习稀疏性（SASL）的方法。每个滤波器的一种新的和有效的估计，即，显着性，被设计，这是从两个方面测量：用于预测的性能和所消耗的计算资源的重要性。在稀疏学习，正规化强度根据显着性，从而同时归零更多的计算重过滤器我们已优化的格式可以更好地保留预测性能调整。为显着推出最低开销的计算训练过程中，这意味着我们的SASL是非常有效的。在修剪相位，以优化所提出的相关的数据的判断标准，硬样品挖掘策略被利用，其显示出较高的效能和效率。大量的实验证明了该方法的优越性能。值得注意的是，在ILSVRC-2012数据集，我们的方法可以减少49.7％RESNET-50的触发器具有非常微不足道的0.39％，最高1和0.05％，前5精度降低。

3. Towards Photo-Realistic Virtual Try-On by Adaptively Generating$\leftrightarrow$Preserving Image Content [PDF] 返回目录
Han Yang, Ruimao Zhang, Xiaobao Guo, Wei Liu, Wangmeng Zuo, Ping Luo
Abstract: Image visual try-on aims at transferring a target clothing image onto a reference person, and has become a hot topic in recent years. Prior arts usually focus on preserving the character of a clothing image (e.g. texture, logo, embroidery) when warping it to arbitrary human pose. However, it remains a big challenge to generate photo-realistic try-on images when large occlusions and human poses are presented in the reference person. To address this issue, we propose a novel visual try-on network, namely Adaptive Content Generating and Preserving Network (ACGPN). In particular, ACGPN first predicts semantic layout of the reference image that will be changed after try-on (e.g. long sleeve shirt$\rightarrow$arm, arm$\rightarrow$jacket), and then determines whether its image content needs to be generated or preserved according to the predicted semantic layout, leading to photo-realistic try-on and rich clothing details. ACGPN generally involves three major modules. First, a semantic layout generation module utilizes semantic segmentation of the reference image to progressively predict the desired semantic layout after try-on. Second, a clothes warping module warps clothing images according to the generated semantic layout, where a second-order difference constraint is introduced to stabilize the warping process during training. Third, an inpainting module for content fusion integrates all information (e.g. reference image, semantic layout, warped clothes) to adaptively produce each semantic part of human body. In comparison to the state-of-the-art methods, ACGPN can generate photo-realistic images with much better perceptual quality and richer fine-details.
摘要：图片视觉试穿目标在目标衣服图像转印到参照人，并已成为近年来的一个热门话题。现有技术通常集中在它变形任意人体姿势时保持服装图像的字符（例如质地，图案，刺绣）。但是，它仍然产生一个很大的挑战照片般逼真试穿时大闭塞和人体姿势的参照人被呈现的影像。为了解决这个问题，我们提出了一个新颖的视觉试穿网络，即自适应内容生成和保存网络（ACGPN）。特别是，ACGPN第一预测，将后试穿（例如长袖衬衫$ \ RIGHTARROW $臂，臂$ \ RIGHTARROW $夹套），然后确定是否要生成其图像内容需要被改变的参考图像的语义布局或根据预测语义布局保存下来，从而导致照片般逼真的试穿和丰富的服装细节。 ACGPN通常包括三个主要模块。首先，语义布局生成模块利用到试穿后逐步预测所期望的语义布局的参考图像的语义分割。第二，一个衣服根据所生成的语义布局，其中一个二阶差约束被引入训练过程中稳定的翘曲过程翘曲模块经纱服装的图像。第三，对于内容融合集成了图像修复模块的所有的信息（例如参考图像，语义布局，翘曲衣服）自适应地产生人体的每个语义部分。相较于国家的最先进的方法，ACGPN可以产生更好的感知质量和更丰富精细的细节照片般逼真的图像。

4. End-to-End Learning Local Multi-view Descriptors for 3D Point Clouds [PDF] 返回目录
Lei Li, Siyu Zhu, Hongbo Fu, Ping Tan, Chiew-Lan Tai
Abstract: In this work, we propose an end-to-end framework to learn local multi-view descriptors for 3D point clouds. To adopt a similar multi-view representation, existing studies use hand-crafted viewpoints for rendering in a preprocessing stage, which is detached from the subsequent descriptor learning stage. In our framework, we integrate the multi-view rendering into neural networks by using a differentiable renderer, which allows the viewpoints to be optimizable parameters for capturing more informative local context of interest points. To obtain discriminative descriptors, we also design a soft-view pooling module to attentively fuse convolutional features across views. Extensive experiments on existing 3D registration benchmarks show that our method outperforms existing local descriptors both quantitatively and qualitatively.
摘要：在这项工作中，我们提出了一个终端到终端的框架去学习三维点云本地多视图描述。采取类似的多视图表示，现有的研究使用手工制作的观点在预处理阶段，这是从随后的描述符学习阶段分离渲染。在我们的框架中，我们通过使用微渲染器，它允许观点是优化的参数捕获的兴趣点更多信息的本地环境集成的多视图渲染成神经网络。为了获得歧视性的描述，我们还设计了软景池模块聚精会神保险丝跨越意见卷积功能。在现有的3D注册基准大量实验表明，我们的方法优于现有的局部描述符数量和质量。

5. CPS: Class-level 6D Pose and Shape Estimation From Monocular Images [PDF] 返回目录
Fabian Manhardt, Manuel Nickel, Sven Meier, Luca Minciullo, Nassir Navab
Abstract: Contemporary monocular 6D pose estimation methods can only cope with a handful of object instances. This naturally limits possible applications as, for instance, robots need to work with hundreds of different objects in a real environment. In this paper, we propose the first deep learning approach for class-wise monocular 6D pose estimation, coupled with metric shape retrieval. We propose a new loss formulation which directly optimizes over all parameters, i.e. 3D orientation, translation, scale and shape at the same time. Instead of decoupling each parameter, we transform the regressed shape, in the form of a point cloud, to 3D and directly measure its metric misalignment. We experimentally demonstrate that we can retrieve precise metric point clouds from a single image, which can also be further processed for e.g. subsequent rendering. Moreover, we show that our new 3D point cloud loss outperforms all baselines and gives overall good results despite the inherent ambiguity due to monocular data.
摘要：当代单眼6D姿态估计方法只能处理一个对象实例的屈指可数。因为，举例来说，机器人需要工作，数百个不同的对象在真实环境这自然限制了可能的应用。在本文中，我们提出了集体智慧的单眼6D姿态估计第一深的学习方式，加上公制形状检索。我们建议这对所有参数直接优化，同时新的损失制剂，即3D方向，平移，缩放和形状。代替解耦每个参数的，我们改造回归形状，在点云的形式，向3D和直接测量其度量错位。我们通过实验证明，我们可以从一个单一的图像，其也可以为例如进一步处理检索精确度量点云随后的渲染。此外，我们证明了我们新的三维点云损失优于所有的基线，并给出总体良好的结果，尽管固有的不明确，由于单眼数据。

6. Top-1 Solution of Multi-Moments in Time Challenge 2019 [PDF] 返回目录
Manyuan Zhang, Hao Shao, Guanglu Song, Yu Liu, Junjie Yan
Abstract: In this technical report, we briefly introduce the solutions of our team 'Efficient' for the Multi-Moments in Time challenge in ICCV 2019. We first conduct several experiments with popular Image-Based action recognition methods TRN, TSN, and TSM. Then a novel temporal interlacing network is proposed towards fast and accurate recognition. Besides, the SlowFast network and its variants are explored. Finally, we ensemble all the above models and achieve 67.22\% on the validation set and 60.77\% on the test set, which ranks 1st on the final leaderboard. In addition, we release a new code repository for video understanding which unifies state-of-the-art 2D and 3D methods based on PyTorch. The solution of the challenge is also included in the repository, which is available at this https URL.
摘要：在这个技术报告中，我们简单地介绍一下我们的队伍，为在2019年ICCV时间挑战多矩解决方案“有效的”，我们首先与流行的基于图像的动作识别方法TRN，TSN和TSM进行多次实验。然后，一个新的时空交错的网络，提出了对快速，准确的识别。此外，SlowFast网络及其变种进行了探讨。最后，我们合奏以上所有模型和验证数据集和60.77 \％的测试集，位居最终领先1日达到67.22 \％。此外，我们推出了面向视频了解新的代码库统一了基于PyTorch国家的最先进的2D和3D的方法。挑战的解决方案还包括在库中，这可在此HTTPS URL。

7. Cross-modal Multi-task Learning for Graphic Recognition of Caricature Face [PDF] 返回目录
Zuheng Ming, Jean-Christophe Burie, Muhammad Muzzamil Luqman
Abstract: Face recognition of realistic visual images has been well studied and made a significant progress in the recent decade. Unlike the realistic visual images, the face recognition of the caricatures is far from the performance of the visual images. This is largely due to the extreme non-rigid distortions of the caricatures introduced by exaggerating the facial features to strengthen the characters. The heterogeneous modalities of the caricatures and the visual images result the caricature-visual face recognition is a cross-modal problem. In this paper, we propose a method to conduct caricature-visual face recognition via multi-task learning. Rather than the conventional multi-task learning with fixed weights of tasks, this work proposes an approach to learn the weights of tasks according to the importance of tasks. The proposed multi-task learning with dynamic tasks weights enables to appropriately train the hard task and easy task instead of being stuck in the over-training easy task as conventional methods. The experimental results demonstrate the effectiveness of the proposed dynamic multi-task learning for cross-modal caricature-visual face recognition. The performances on the datasets CaVI and WebCaricature show the superiority over the state-of-art methods.
摘要：逼真的视觉图像的人脸识别已经得到很好的研究，并取得了近十年显著的进步。不同的是逼真的视觉图像，面部识别的漫画是远离视觉图像的表现。这在很大程度上是由于通过夸大面部特征，以加强字符引入的漫画的极端非刚性变形。的漫画和视觉图像的多相方式产生的漫画视觉面部识别是一个跨模态的问题。在本文中，我们提出了通过多任务学习进行漫画视觉面部识别的方法。而不是任务的固定权重的常规多任务学习，这项工作提出了根据任务的重要性，学习任务的权重的方法。动态任务的权重所提出的多任务学习能够适当地训练，而不是被卡在过度训练容易的事作为常规方法的硬任务，并非易事。实验结果表明，所提出的动态多任务学习的跨模态漫画视觉面部识别的有效性。对数据集CAVI和WebCaricature演出在展示国家的艺术方法的优越性。

8. SynCGAN: Using learnable class specific priors to generate synthetic data for improving classifier performance on cytological images [PDF] 返回目录
Soumyajyoti Dey, Soham Das, Swarnendu Ghosh, Shyamali Mitra, Sukanta Chakrabarty, Nibaran Das
Abstract: One of the most challenging aspects of medical image analysis is the lack of a high quantity of annotated data. This makes it difficult for deep learning algorithms to perform well due to a lack of variations in the input space. While generative adversarial networks have shown promise in the field of synthetic data generation, but without a carefully designed prior the generation procedure can not be performed well. In the proposed approach we have demonstrated the use of automatically generated segmentation masks as learnable class-specific priors to guide a conditional GAN for the generation of patho-realistic samples for cytology image. We have observed that augmentation of data using the proposed pipeline called "SynCGAN" improves the performance of state of the art classifiers such as ResNet-152, DenseNet-161, Inception-V3 significantly.
摘要：一个医学图像分析中最具挑战性的方面是缺乏注释数据的高量。这使人们难以对深度学习算法，由于缺乏在输入空间变化的良好表现。而生成对抗性网络已经显示在合成数据生成的场承诺，但没有一个精心设计的现有的生成过程不能很好地执行。在所提出的方法，我们已经证明了使用自动生成的分割掩码作为可学习类特定的先验指导有条件GAN用于细胞学图像病理逼真样品的产生。我们已经观察到，使用所提出的管道称为“SynCGAN”数据的增强改进了该技术的分类器状态的性能如RESNET-152，DenseNet-161，启-V3显著。

9. EDC3: Ensemble of Deep-Classifiers using Class-specific Copula functions to Improve Semantic Image Segmentation [PDF] 返回目录
Somenath Kuiry, Nibaran Das, Alaka Das, Mita Nasipuri
Abstract: In the literature, many fusion techniques are registered for the segmentation of images, but they primarily focus on observed output or belief score or probability score of the output classes. In the present work, we have utilized inter source statistical dependency among different classifiers for ensembling of different deep learning techniques for semantic segmentation of images. For this purpose, in the present work, a class-wise Copula-based ensembling method is newly proposed for solving the multi-class segmentation problem. Experimentally, it is observed that the performance has improved more for semantic image segmentation using the proposed class-specific Copula function than the traditionally used single Copula function for the problem. The performance is also compared with three state-of-the-art ensembling methods.
摘要：在文献中，许多融合技术被注册为图像的分割，但它们主要集中在观察到的输出或信仰得分或输出类别的概率分数。在目前的工作中，我们采用了不同的分类中除源统计上的依存针对不同深度学习技术的图像语义分割ensembling。为此，在目前的工作中，一类基于明智的Copula-ensembling方法提出了新的解决多类分割问题。实验上，观察到的性能已经使用除该问题的传统上使用单一Copula函数所提出的类特定Copula函数更多语义图像分割得到改善。的性能也与国家的最先进的3 ensembling方法相比。

10. Deformation Flow Based Two-Stream Network for Lip Reading [PDF] 返回目录
Jingyun Xiao, Shuang Yang, Yuanhang Zhang, Shiguang Shan, Xilin Chen
Abstract: Lip reading is the task of recognizing the speech content by analyzing movements in the lip region when people are speaking. Observing on the continuity in adjacent frames in the speaking process, and the consistency of the motion patterns among different speakers when they pronounce the same phoneme, we model the lip movements in the speaking process as a sequence of apparent deformations in the lip region. Specifically, we introduce a Deformation Flow Network (DFN) to learn the deformation flow between adjacent frames, which directly captures the motion information within the lip region. The learned deformation flow is then combined with the original grayscale frames with a two-stream network to perform lip reading. Different from previous two-stream networks, we make the two streams learn from each other in the learning process by introducing a bidirectional knowledge distillation loss to train the two branches jointly. Owing to the complementary cues provided by different branches, the two-stream network shows a substantial improvement over using either single branch. A thorough experimental evaluation on two large-scale lip reading benchmarks is presented with detailed analysis. The results accord with our motivation, and show that our method achieves state-of-the-art or comparable performance on these two challenging datasets.
摘要：读唇是识别由唇的区域分析动作的时候人是讲讲话内容的任务。观察对在发言过程相邻帧的连续性，和不同的扬声器之间的运动模式的一致性，当他们发音相同的音素，我们的嘴唇运动在发言过程如在唇区域明显变形的序列建模。具体而言，我们引入了形变流网络（DFN），了解相邻帧之间的变形流动，这直接捕获唇部区域中的运动的信息。所学习的变形流动，然后用具有两流网络的原始灰度帧组合以执行读唇。从前面的两个流的网络不同的是，我们做通过引入双向知识蒸馏损失，共同训练两个分支的两股气流在学习过程中互相学习。由于通过不同的分支所提供的互补的线索，这两个流的网络显示比通过任一单个分支的实质性改进。唇读的基准上两次大规模的彻底的实验评估，提出了详细的分析。结果符合我们前进的动力，并表明我们的方法实现国家的最先进的这两个数据集的挑战或相当的性能。

11. Fairness by Learning Orthogonal Disentangled Representations [PDF] 返回目录
Mhd Hasan Sarhan, Nassir Navab, Abouzar Eslami, Shadi Albarqouni
Abstract: Learning discriminative powerful representations is a crucial step for machine learning systems. Introducing invariance against arbitrary nuisance or sensitive attributes while performing well on specific tasks is an important problem in representation learning. This is mostly approached by purging the sensitive information from learned representations. In this paper, we propose a novel disentanglement approach to invariant representation problem. We disentangle the meaningful and sensitive representations by enforcing orthogonality constraints as a proxy for independence. We explicitly enforce the meaningful representation to be agnostic to sensitive information by entropy maximization. The proposed approach is evaluated on five publicly available datasets and compared with state of the art methods for learning fairness and invariance achieving the state of the art performance on three datasets and comparable performance on the rest. Further, we perform an ablative study to evaluate the effect of each component.
摘要：学习辨别强大的陈述是机器学习系统的关键一步。对任意滋扰或敏感属性介绍不变性，同时在特定任务上表现良好，是表示学习的一个重要问题。这主要是通过清除从了解到表示敏感信息接洽。在本文中，我们提出了一种新的解开的方法来不变的代表性问题。我们通过实施正交约束作为独立的代理解开有意义和敏感性代表。我们明确地执行有意义的表示是不可知的熵最大化的敏感信息。所提出的方法是在五个公开可用的数据集评估，并与学习公平和不变性实现对三个数据集的艺术表演和相当的性能在休息状态下的技术方法状态相比。此外，我们执行烧蚀研究，以评估各组分的作用。

12. Low-Rank and Total Variation Regularization and Its Application to Image Recovery [PDF] 返回目录
Pawan Goyal, Hussam Al Daas, Peter Benner
Abstract: In this paper, we study the problem of image recovery from given partial (corrupted) observations. Recovering an image using a low-rank model has been an active research area in data analysis and machine learning. But often, images are not only of low-rank but they also exhibit sparsity in a transformed space. In this work, we propose a new problem formulation in such a way that we seek to recover an image that is of low-rank and has sparsity in a transformed domain. We further discuss various non-convex non-smooth surrogates of the rank function, leading to a relaxed problem. Then, we present an efficient iterative scheme to solve the relaxed problem that essentially employs the (weighted) singular value thresholding at each iteration. Furthermore, we discuss the convergence properties of the proposed iterative method. We perform extensive experiments, showing that the proposed algorithm outperforms state-of-the-art methodologies in recovering images.
摘要：在本文中，我们研究了从给定的部分（损坏）的观测图像恢复的问题。恢复使用低等级模型的图像已经在数据分析和机器学习的活跃的研究领域。但通常，图像不仅是低等级的，但他们在转换后的空间也表现出稀疏。在这项工作中，我们提出以这样的方式，我们试图恢复它的形象是低级别的，具有在变换域稀疏新的配方问题。我们进一步讨论RANK函数的各种非凸非光滑代理人，导致松弛问题。然后，我们提出了一种高效的迭代方案，以解决松弛问题，基本上采用（加权）的奇异值的阈值在每一次迭代。此外，我们讨论拟议的迭代方法的收敛性。我们进行了广泛的实验，显示出所提出的算法优于状态的最先进的方法中回收的图像。

13. Skeleton Based Action Recognition using a Stacked Denoising Autoencoder with Constraints of Privileged Information [PDF] 返回目录
Zhize Wu, Thomas Weise, Le Zou, Fei Sun, Ming Tan
Abstract: Recently, with the availability of cost-effective depth cameras coupled with real-time skeleton estimation, the interest in skeleton-based human action recognition is renewed. Most of the existing skeletal representation approaches use either the joint location or the dynamics model. Differing from the previous studies, we propose a new method called Denoising Autoencoder with Temporal and Categorical Constraints (DAE_CTC)} to study the skeletal representation in a view of skeleton reconstruction. Based on the concept of learning under privileged information, we integrate action categories and temporal coordinates into a stacked denoising autoencoder in the training phase, to preserve category and temporal feature, while learning the hidden representation from a skeleton. Thus, we are able to improve the discriminative validity of the hidden representation. In order to mitigate the variation resulting from temporary misalignment, a new method of temporal registration, called Locally-Warped Sequence Registration (LWSR), is proposed for registering the sequences of inter- and intra-class actions. We finally represent the sequences using a Fourier Temporal Pyramid (FTP) representation and perform classification using a combination of LWSR registration, FTP representation, and a linear Support Vector Machine (SVM). The experimental results on three action data sets, namely MSR-Action3D, UTKinect-Action, and Florence3D-Action, show that our proposal performs better than many existing methods and comparably to the state of the art.
摘要：近年来，随着加上实时骨架估计成本效益的深度相机的可用性，基于骨架人类动作识别的兴趣被更新。大多数现有的骨架表示的方法使用任何联名位置或动力学模型。从以往的研究不同，我们提出了一个所谓的降噪自动编码与时空范畴约束（DAE_CTC）}来研究骨骼重建的景色骨架表示新方法。基于下特权信息学习的理念，我们整合行动的类别和时间坐标进入训练阶段堆叠去噪的自动编码，以保留类别和时间特征，同时学习从骨架的隐藏的表示。因此，我们能够提高隐蔽的图示的区别有效性。为了减轻从临时偏移而发生的变化，时间配准的一种新方法，称为局部扭曲序列注册（LWSR），提出了一种用于登记之间和内部类的动作的序列。我们终于表示使用傅立叶颞金字塔（FTP）表示的序列，并使用LWSR登记，FTP表示的组合执行分类，和线性支持向量机（SVM）。三个动作的数据集，即MSR-Action3D，UTKinect-行动，并Florence3D操作中的实验结果表明，我们的建议进行比许多现有的方法和可比现有技术的状态更好。

14. ARAE: Adversarially Robust Training of Autoencoders Improves Novelty Detection [PDF] 返回目录
Mohammadreza Salehi, Atrin Arya, Barbod Pajoum, Mohammad Otoofi, Amirreza Shaeiri, Mohammad Hossein Rohban, Hamid R. Rabiee
Abstract: Autoencoders (AE) have recently been widely employed to approach the novelty detection problem. Trained only on the normal data, the AE is expected to reconstruct the normal data effectively while fail to regenerate the anomalous data, which could be utilized for novelty detection. However, in this paper, it is demonstrated that this does not always hold. AE often generalizes so perfectly that it can also reconstruct the anomalous data well. To address this problem, we propose a novel AE that can learn more semantically meaningful features. Specifically, we exploit the fact that adversarial robustness promotes learning of meaningful features. Therefore, we force the AE to learn such features by penalizing networks with a bottleneck layer that is unstable against adversarial perturbations. We show that despite using a much simpler architecture in comparison to the prior methods, the proposed AE outperforms or is competitive to state-of-the-art on three benchmark datasets.
摘要：自动编码（AE）最近被广泛采用接近新奇检测问题。受过训练的仅在正常数据中，AE预计有效地重建正常的数据而无法再生异常数据，其可用于检测新颖性。然而，在本文中，它被证明，这并不总是持有。 AE往往概括如此完美，它也可以重建异常数据良好。为了解决这个问题，我们提出了一个新的AE，可以了解更多语义上有意义的功能。具体而言，我们利用以下事实：对抗健壮性促进学习的有意义的功能。因此，我们迫使AE通过与反对对抗扰动不稳定的瓶颈层惩罚网络学习等功能。我们发现，尽管相比使用更简单的架构与现有方法，所提出的AE性能优于或有竞争力的国家的最先进的三个基准数据集。

15. Conditional Convolutions for Instance Segmentation [PDF] 返回目录
Zhi Tian, Chunhua Shen, Hao Chen
Abstract: We propose a simple yet effective instance segmentation framework, termed CondInst (conditional convolutions for instance segmentation). Top-performing instance segmentation methods such as Mask R-CNN rely on ROI operations (typically ROIPool or ROIAlign) to obtain the final instance masks. In contrast, we propose to solve instance segmentation from a new perspective. Instead of using instance-wise ROIs as inputs to a network of fixed weights, we employ dynamic instance-aware networks, conditioned on instances. CondInst enjoys two advantages: 1) Instance segmentation is solved by a fully convolutional network, eliminating the need for ROI cropping and feature alignment. 2) Due to the much improved capacity of dynamically-generated conditional convolutions, the mask head can be very compact (e.g., 3 conv. layers, each having only 8 channels), leading to significantly faster inference. We demonstrate a simpler instance segmentation method that can achieve improved performance in both accuracy and inference speed. On the COCO dataset, we outperform a few recent methods including well-tuned Mask RCNN baselines, without longer training schedules needed. Code is available: this https URL
摘要：本文提出了一种简单而有效的情况下分割的框架，称为CondInst（例如分割条件回旋）。顶级表现的实例分割的方法，如面膜R-CNN依靠ROI操作（通常ROIPool或ROIAlign）以获得最终的实例掩模。相反，我们建议从一个新的角度解决实例分割。而不是使用实例明智的投资回报率作为输入固定权重的网络中，我们采用动态实例感知网络，空调上的实例。 CondInst享有两个优点：1）实例分段由一个完全卷积网络解决，省去了ROI裁剪和功能定位的需要。 2）由于动态生成条件卷积的显着改善的容量，掩模头可以非常紧凑（例如，3转换次数层，每个仅具有8个信道），从而导致更快的显著推理。我们展示了一个简单的实例分割方法，可以实现在精度和速度推断提高性能。在COCO的数据集，我们跑赢大盘最近的一些方法，包括精心调校面膜RCNN基准，而不需要再培训计划。代码可用：此HTTPS URL

16. Open Source Computer Vision-based Layer-wise 3D Printing Analysis [PDF] 返回目录
Aliaksei L. Petsiuk, Joshua M. Pearce
Abstract: The paper describes an open source computer vision-based hardware structure and software algorithm, which analyzes layer-wise the 3-D printing processes, tracks printing errors, and generates appropriate printer actions to improve reliability. This approach is built upon multiple-stage monocular image examination, which allows monitoring both the external shape of the printed object and internal structure of its layers. Starting with the side-view height validation, the developed program analyzes the virtual top view for outer shell contour correspondence using the multi-template matching and iterative closest point algorithms, as well as inner layer texture quality clustering the spatial-frequency filter responses with Gaussian mixture models and segmenting structural anomalies with the agglomerative hierarchical clustering algorithm. This allows evaluation of both global and local parameters of the printing modes. The experimentally-verified analysis time per layer is less than one minute, which can be considered a quasi-real-time process for large prints. The systems can work as an intelligent printing suspension tool designed to save time and material. However, the results show the algorithm provides a means to systematize in situ printing data as a first step in a fully open source failure correction algorithm for additive manufacturing.
摘要：本文描述了一种开放源码的基于计算机视觉的硬件结构和软件算法，其分析逐层3-d印刷工艺，轨道打印错误，并且产生相应的打印机的动作，提高了可靠性。这种方法是在多级单眼图像检查，这允许监测所述印刷物的两面的外部形状和层的内部结构建造。与侧视图高度验证开始，发达程序分析使用多模板匹配和迭代最近点的算法，以及内层纹理质量聚类具有高斯空间频率滤波响应用于外壳轮廓对应的虚拟顶视图混合模型和分段与凝聚层次聚类算法结构异常。这使得打印模式的全局和局部参数的评价。每一层的实验验证的分析时间少于一分钟，其可被认为是大尺寸打印准实时处理。该系统能够按照设计，以节省时间和材料的智能停止印刷工具的工作。然而，结果表明，该算法在原位的打印数据提供了一种手段，以系统化如在用于添加制造一个完全开放的源失效校正算法的第一步。

17. Towards High-Fidelity 3D Face Reconstruction from In-the-Wild Images Using Graph Convolutional Networks [PDF] 返回目录
Jiangke Lin, Yi Yuan, Tianjia Shao, Kun Zhou
Abstract: 3D Morphable Model (3DMM) based methods have achieved great success in recovering 3D face shapes from single-view images. However, the facial textures recovered by such methods lack the fidelity as exhibited in the input images. Recent work demonstrates high-quality facial texture recovering with generative networks trained from a large-scale database of high-resolution UV maps of face textures, which is hard to prepare and not publicly available. In this paper, we introduce a method to reconstruct 3D facial shapes with high-fidelity textures from single-view images in-the-wild, without the need to capture a large-scale face texture database. The main idea is to refine the initial texture generated by a 3DMM based method with facial details from the input image. To this end, we propose to use graph convolutional networks to reconstruct the detailed colors for the mesh vertices instead of reconstructing the UV map. Experiments show that our method can generate high-quality results and outperforms state-of-the-art methods in both qualitative and quantitative comparisons.
摘要：3D形变模型（3DMM）为基础的方法已经在从单一视点图像恢复的三维人脸形状取得了巨大成功。然而，面部纹理回收通过这样的方法缺乏保真度在输入图像中显示出。最近的研究表明高品质的面部纹理与高分辨率的大型数据库映射UV面对的纹理，这是很难准备并没有公开训练的生成网络恢复。在本文中，我们介绍一种方法来重建与来自内式野生单视点图像高保真纹理3D面部的形状，而无需捕获大规模面纹理数据库。其主要思想是细化通过基于3DMM方法与来自所述输入图像的面部的信息生成的初始纹理。为此，我们建议使用图形的卷积网络重建的详细颜色网格顶点而不是重构UV地图。实验表明，我们的方法可以产生在定性和定量比较高品质的效果，优于国家的最先进的方法。

18. Highly Efficient Salient Object Detection with 100K Parameters [PDF] 返回目录
Shang-Hua Gao, Yong-Qiang Tan, Ming-Ming Cheng, Chengze Lu, Yunpeng Chen, Shuicheng Yan
Abstract: Salient object detection models often demand a considerable amount of computation cost to make precise prediction for each pixel, making them hardly applicable on low-power devices. In this paper, we aim to relieve the contradiction between computation cost and model performance by improving the network efficiency to a higher degree. We propose a flexible convolutional module, namely generalized OctConv (gOctConv), to efficiently utilize both in-stage and cross-stages multi-scale features, while reducing the representation redundancy by a novel dynamic weight decay scheme. The effective dynamic weight decay scheme stably boosts the sparsity of parameters during training, supports learnable number of channels for each scale in gOctConv, allowing 80% of parameters reduce with negligible performance drop. Utilizing gOctConv, we build an extremely light-weighted model, namely CSNet, which achieves comparable performance with about 0.2% parameters (100k) of large models on popular salient object detection benchmarks.
摘要：显着对象检测模型往往需要相当数量的计算成本，使精确的预测对于每个像素，使其难以适用于低功耗设备。在本文中，我们的目标是通过提高网络效率更高程度缓解成本计算模型与性能之间的矛盾。我们提出了一种灵活的卷积模块，即广义OctConv（gOctConv），无论是在第一级和交叉阶段多尺度特征有效地利用，同时通过一种新颖的动态重量衰变方案减少了表示冗余。有效动态重量衰变方案稳定地提升的参数在训练期间的稀疏性，支持在各gOctConv级通道的可学习数，允许的参数80％具有可忽略的性能下降减少。利用gOctConv，我们建立一个非常光加权模式，即CSNET，其实现了与大型模型约0.2％参数（100K）上流行的显着对象的检测基准相当的性能。

19. Understanding Crowd Flow Movements Using Active-Langevin Model [PDF] 返回目录
Shreetam Behera, Debi Prosad Dogra, Malay Kumar Bandyopadhyay, Partha Pratim Roy
Abstract: Crowd flow describes the elementary group behavior of crowds. Understanding the dynamics behind these movements can help to identify various abnormalities in crowds. However, developing a crowd model describing these flows is a challenging task. In this paper, a physics-based model is proposed to describe the movements in dense crowds. The crowd model is based on active Langevin equation where the motion points are assumed to be similar to active colloidal particles in fluids. The model is further augmented with computer-vision techniques to segment both linear and non-linear motion flows in a dense crowd. The evaluation of the active Langevin equation-based crowd segmentation has been done on publicly available crowd videos and on our own videos. The proposed method is able to segment the flow with lesser optical flow error and better accuracy in comparison to existing state-of-the-art methods.
摘要：人群流程描述人群的基本群体行为。了解这些动作背后的动力可以帮助确定在人群中的各种异常。然而，开发描述这些流动人群模型是一个具有挑战性的任务。在本文中，基于物理学的模型，用于描述在密集的人群运动。人群模型是基于活性Langevin方程，其中假定所述运动点为类似于在流体活性胶体粒子。该模型与计算机视觉技术来段线性和非线性的运动在一个密集的人群流动进一步增加。主动型方程朗之万的人群分割性评价的可公开获得的人群视频，并在我们自己的影片已经完成。所提出的方法能够段相比于现有状态的最先进的方法具有较小的光流错误和更好的精度的流动。

20. Beyond the Camera: Neural Networks in World Coordinates [PDF] 返回目录
Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Karteek Alahari
Abstract: Eye movement and strategic placement of the visual field onto the retina, gives animals increased resolution of the scene and suppresses distracting information. This fundamental system has been missing from video understanding with deep networks, typically limited to 224 by 224 pixel content locked to the camera frame. We propose a simple idea, WorldFeatures, where each feature at every layer has a spatial transformation, and the feature map is only transformed as needed. We show that a network built with these WorldFeatures, can be used to model eye movements, such as saccades, fixation, and smooth pursuit, even in a batch setting on pre-recorded video. That is, the network can for example use all 224 by 224 pixels to look at a small detail one moment, and the whole scene the next. We show that typical building blocks, such as convolutions and pooling, can be adapted to support WorldFeatures using available tools. Experiments are presented on the Charades, Olympic Sports, and Caltech-UCSD Birds-200-2011 datasets, exploring action recognition, fine-grained recognition, and video stabilization.
摘要：眼球运动和视野的战略布局到视网膜，使动物增加了现场，并禁止显示信息分散注意力的分辨率。这一基本的系统已经从视频丢失深网络，典型地通过锁定到照相机框架224像素内容限于224理解。我们提出一个简单的想法，WorldFeatures，在此，每一层每个要素都有一个空间变换，并根据需要将功能映射只转化。我们发现，这些WorldFeatures建立了一个网络，可以用来模型眼球运动，如扫视，固定，追求平稳，甚至在预先录制的视频批量设置。也就是说，网络可以例如使用所有224 224个像素看一个小细节一个时刻，整个场景下。我们表明，典型的积木，如卷积和池，可以适应使用可用的工具支持WorldFeatures。实验都在猜字谜，奥体中心，以及加州理工学院，加州大学圣地亚哥分校的小鸟-200-2011数据集，探索动作识别，细粒度识别和视频稳定。

21. Arbitrary-Oriented Object Detection with Circular Smooth Label [PDF] 返回目录
Xue Yang, Junchi Yan
Abstract: Arbitrary-oriented object detection has recently attracted increasing attention in vision for their importance in aerial imagery, scene text, and face etc. In this paper, we show that existing regression-based rotation detectors suffer the problem of discontinuous boundaries, which is directly caused by angular periodicity or corner ordering. By a careful study, we find the root cause is that the ideal predictions are beyond the defined range. We design a new rotation detection baseline, to address the boundary problem by transforming angular prediction from a regression problem to a classification task with little accuracy loss, whereby high-precision angle classification is devised in contrast to previous works using coarse-granularity in rotation detection. We also propose a circular smooth label (CSL) technique to handle the periodicity of the angle and increase the error tolerance to adjacent angles. We further introduce four window functions in CSL and explore the effect of different window radius sizes on detection performance. Extensive experiments and visual analysis on two large-scale public datasets for aerial images i.e. DOTA, HRSC2016, as well as scene text dataset ICDAR2015 and MLT, show the effectiveness of our approach. The code will be released at this https URL.
摘要：任意的面向对象检测最近吸引了越来越多的关注视野他们在航空影像，场景文本，并在本文脸等重要，我们表明，现有的基于回归的旋转探测器遭受连续边界的问题，这是由角周期性或拐角排序直接引起的。通过仔细研究，我们找到根本原因是理想的预测都超出了规定范围。我们设计了一种新的旋转检测基准，通过从回归问题转化角度预测到精度低损失，由此高精度的角度分类相比以前的作品在旋转检测使用粗粒度设计了一种分类任务，以解决边界问题。我们还提出了一种圆形平滑标签（CSL）技术来处理角度的周期性，并增加容错相邻角。我们进一步介绍CSL 4个窗功能，探索不同的窗口大小半径对检测性能的影响。大量的实验和两个大型公共数据集用于航拍图像视觉分析即DOTA，HRSC2016，以及现场文本数据集ICDAR2015和MLT，显示我们方法的有效性。该代码将在这个HTTPS URL被释放。

22. Learning to Segment 3D Point Clouds in 2D Image Space [PDF] 返回目录
Yecheng Lyu, Xinming Huang, Ziming Zhang
Abstract: In contrast to the literature where local patterns in 3D point clouds are captured by customized convolutional operators, in this paper we study the problem of how to effectively and efficiently project such point clouds into a 2D image space so that traditional 2D convolutional neural networks (CNNs) such as U-Net can be applied for segmentation. To this end, we are motivated by graph drawing and reformulate it as an integer programming problem to learn the topology-preserving graph-to-grid mapping for each individual point cloud. To accelerate the computation in practice, we further propose a novel hierarchical approximate algorithm. With the help of the Delaunay triangulation for graph construction from point clouds and a multi-scale U-Net for segmentation, we manage to demonstrate the state-of-the-art performance on ShapeNet and PartNet, respectively, with significant improvement over the literature. Code is available at this https URL.
摘要：与其中在三维点云局部图案由定制的卷积运营商捕捉到的文献，在本文中，我们研究的是如何有效和高效等项目点云成2D图像空间，使传统的二维卷积的神经网络问题（细胞神经网络），诸如U形网可应用于分割。为此，我们通过图形绘制的动机，并重新制定它作为一个整数规划问题来学习拓扑图保存到电网映射为每个单独的点云。为了加快在实践中计算，我们进一步提出了一个新的层次近似算法。随着Delaunay三角点云图形结构和多尺度的U网进行分割的帮助下，我们管理分别展示在ShapeNet和PartNet，国家的最先进的性能，在文献显著改善。代码可在此HTTPS URL。

23. Encoder-Decoder Based Convolutional Neural Networks with Multi-Scale-Aware Modules for Crowd Counting [PDF] 返回目录
Pongpisit Thanasutives, Ken-ichi Fukui, Masayuki Numao, Boonserm Kijsirikul
Abstract: In this paper, we proposed two modified neural network architectures based on SFANet and SegNet respectively for accurate and efficient crowd counting. Inspired by SFANet, the first model is attached with two novel multi-scale-aware modules called, ASSP and CAN. This model is called M-SFANet. The encoder of M-SFANet is enhanced with ASSP containing parallel atrous convolution with different sampling rates and hence able to extract multi-scale features of the target object and incorporate larger context. To further deal with scale variation throughout an input image, we leverage contextual module called CAN which adaptively encodes the scales of the contextual information. The combination yields an effective model for counting in both dense and sparse crowd scenes. Based on SFANet decoder structure, M-SFANet decoder has dual paths, for density map generation and attention map generation. The second model is called M-SegNet. For M-SegNet, we simply change bilinear upsampling used in SFANet to max unpooling originally from SegNet and propose the faster model while providing competitive counting performance. Designed for high-speed surveillance applications, M-SegNet has no additional multi-scale-aware module in order to not increase the complexity. Both models are encoder-decoder based architectures and end-to-end trainable. We also conduct extensive experiments on four crowd counting datasets and one vehicle counting dataset to show that these modifications yield algorithms that could outperform some of state-of-the-art crowd counting methods.
摘要：在本文中，我们提出了分别基于SFANet和SegNet准确和高效的人群计数两个修改的神经网络结构。通过SFANet启发，所述第一模型上安装有称为，ASSP和CAN两种新型多尺度感知模块。这种模式被称为M-SFANet。 M-SFANet所述的编码器与包含并行atrous卷积具有不同的采样率和ASSP增强因而能够提取目标对象的多尺度特征和并入较大的范围内。为了进一步应对整个输入图像尺度变化，我们利用称为CAN上下文模块，自适应编码的上下文信息的尺度。结合产生了两个密集和稀疏的人群场面计数的有效模式。基于SFANet译码器结构中，M-SFANet解码器具有双重路径，用于密度地图生成与关注图生成。第二种模式被称为M-SegNet。对于M-SegNet，我们简单地改变双线性在SFANet升频采样来最大unpooling最初从SegNet并提出更快的模式，同时提供有竞争力的计数性能。专为高速监控应用，M-SegNet是为了不增加复杂性没有额外的多尺度感知模块。这两款机型都是基于编码器的解码器架构和终端到终端的可训练。我们还开展四个人群统计数据集和一个车辆计数数据集大量的实验表明，这些修改产生可能超越一些国家的最先进的人群计数方法算法。

24. ZSTAD: Zero-Shot Temporal Activity Detection [PDF] 返回目录
Lingling Zhang, Xiaojun Chang, Jun Liu, Minnan Luo, Sen Wang, Zongyuan Ge, Alexander Hauptmann
Abstract: An integral part of video analysis and surveillance is temporal activity detection, which means to simultaneously recognize and localize activities in long untrimmed videos. Currently, the most effective methods of temporal activity detection are based on deep learning, and they typically perform very well with large scale annotated videos for training. However, these methods are limited in real applications due to the unavailable videos about certain activity classes and the time-consuming data annotation. To solve this challenging problem, we propose a novel task setting called zero-shot temporal activity detection (ZSTAD), where activities that have never been seen in training can still be detected. We design an end-to-end deep network based on R-C3D as the architecture for this solution. The proposed network is optimized with an innovative loss function that considers the embeddings of activity labels and their super-classes while learning the common semantics of seen and unseen activities. Experiments on both the THUMOS14 and the Charades datasets show promising performance in terms of detecting unseen activities.
摘要：视频分析和监控的一个组成部分是时间活动性检测，这意味着要在长期修剪视频同时识别和定位活动。目前，时间活动性检测的最有效的方法是基于深刻的学习，他们通常与培训大型注释的视频表现非常好。然而，这些方法在实际应用中，由于对某些活动课和耗时的数据标注不可用视频的限制。为了解决这个具有挑战性的问题，提出了所谓的零出手时间活动检测（ZSTAD）一项新任务的设置，在这里仍然可以检测到从未见过的培训活动。我们设计了一个基于R-C3D的架构此解决方案的端至端深网络。所提出的网络与才会认为活动标签和他们的超类的嵌入物，同时学习看见和看不见的活动的通用语义创新的损失函数进行优化。同时在THUMOS14和字谜数据集的实验表明，在检测看不见的活动方面有前途的性能。

25. Extended Batch Normalization [PDF] 返回目录
Chunjie Luo, Jianfeng Zhan, Lei Wang, Wanling Gao
Abstract: Batch normalization (BN) has become a standard technique for training the modern deep networks. However, its effectiveness diminishes when the batch size becomes smaller, since the batch statistics estimation becomes inaccurate. That hinders batch normalization's usage for 1) training larger model which requires small batches constrained by memory consumption, 2) training on mobile or embedded devices of which the memory resource is limited. In this paper, we propose a simple but effective method, called extended batch normalization (EBN). For NCHW format feature maps, extended batch normalization computes the mean along the (N, H, W) dimensions, as the same as batch normalization, to maintain the advantage of batch normalization. To alleviate the problem caused by small batch size, extended batch normalization computes the standard deviation along the (N, C, H, W) dimensions, thus enlarges the number of samples from which the standard deviation is computed. We compare extended batch normalization with batch normalization and group normalization on the datasets of MNIST, CIFAR-10/100, STL-10, and ImageNet, respectively. The experiments show that extended batch normalization alleviates the problem of batch normalization with small batch size while achieving close performances to batch normalization with large batch size.
摘要：批标准化（BN）已经成为培养现代深网络的标准技术。然而，当批量大小变小，因为该批次的统计估计变得不准确的有效性减弱。阻碍批标准化的1使用）培养这需要通过存储器消耗约束小批量较大模型，2）上移动或嵌入式设备培训，其中的存储器资源是有限的。在本文中，我们提出了一个简单而有效的方法，称为扩展批标准化（EBN）。对于NCHW格式特征地图，扩展批标准化计算的平均沿（N，H，W）的尺寸，作为相同批次正常化，保持批标准化的优点。为了减轻因小批量大小的问题，延长了批标准化计算沿（N，C，H，W）的尺寸的标准偏差，从而放大从该标准偏差被计算的样本的数目。我们比较分别与批标准化和MNIST的数据集组归一化，CIFAR-10/100，STL-10，和ImageNet，扩展批标准化。实验结果表明，延长批标准化，同时实现接近演出批标准化大批量缓解了小批量批标准化的问题。

26. SeqXY2SeqZ: Structure Learning for 3D Shapes by Sequentially Predicting 1D Occupancy Segments From 2D Coordinates [PDF] 返回目录
Zhizhong Han, Guanhui Qiao, Yu-Shen Liu, Matthias Zwicker
Abstract: Structure learning for 3D shapes is vital for 3D computer vision. State-of-the-art methods show promising results by representing shapes using implicit functions in 3D that are learned using discriminative neural networks. However, learning implicit functions requires dense and irregular sampling in 3D space, which also makes the sampling methods affect the accuracy of shape reconstruction during test. To avoid dense and irregular sampling in 3D, we propose to represent shapes using 2D functions, where the output of the function at each 2D location is a sequence of line segments inside the shape. Our approach leverages the power of functional representations, but without the disadvantage of 3D sampling. Specifically, we use a voxel tubelization to represent a voxel grid as a set of tubes along any one of the X, Y, or Z axes. Each tube can be indexed by its 2D coordinates on the plane spanned by the other two axes. We further simplify each tube into a sequence of occupancy segments. Each occupancy segment consists of successive voxels occupied by the shape, which leads to a simple representation of its 1D start and end location. Given the 2D coordinates of the tube and a shape feature as condition, this representation enables us to learn 3D shape structures by sequentially predicting the start and end locations of each occupancy segment in the tube. We implement this approach using a Seq2Seq model with attention, called SeqXY2SeqZ, which learns the mapping from a sequence of 2D coordinates along two arbitrary axes to a sequence of 1D locations along the third axis. SeqXY2SeqZ not only benefits from the regularity of voxel grids in training and testing, but also achieves high memory efficiency. Our experiments show that SeqXY2SeqZ outperforms the state-ofthe-art methods under widely used benchmarks.
摘要：3D形状的结构学习是3D计算机视觉至关重要。国家的最先进的方法显示由代表使用隐函数在使用歧视性的神经网络学会了3D形状可喜的成果。然而，学习隐函数需要在三维空间中，这也使得采样方法影响测试期间形状重建的精度致密和不规则采样。为了避免致密和不规则采样在3D中，我们提出使用二维函数，其中在每个2D位置的函数的输出是在形状内部线段的序列来表示的形状。我们的方法利用功能性陈述的权力，但没有3D采样的缺点。具体地，我们使用一个体素tubelization来表示的体素网格，沿X，Y或Z轴中的任一项的一组管。每个管可以通过其2D坐标上通过其它两个轴构成的平面来索引。我们进一步简化每个管进入占用段的序列。每个占用段由连续的体素所占据由形状，这导致其1D的简单表示开始和结束位置。鉴于在管和形状特征作为条件的2D坐标，这表示使我们通过顺序地预测在管的每个占用段的开始和结束位置以了解3D形状结构。我们实现使用与关注的Seq2Seq模型，称为SeqXY2SeqZ，该学会从二维坐标沿着两个任意轴序列中的映射到沿第三轴线1D位置的序列这种方法。 SeqXY2SeqZ不仅来自体素网格的训练和测试的规律性的利益，也实现了高内存效率。我们的实验表明，SeqXY2SeqZ优于下广泛使用的基准国有国税发先进的方法。

27. Memory-efficient Learning for Large-scale Computational Imaging [PDF] 返回目录
Michael Kellman, Kevin Zhang, Jon Tamir, Emrah Bostan, Michael Lustig, Laura Waller
Abstract: Critical aspects of computational imaging systems, such as experimental design and image priors, can be optimized through deep networks formed by the unrolled iterations of classical model-based reconstructions (termed physics-based networks). However, for real-world large-scale inverse problems, computing gradients via backpropagation is infeasible due to memory limitations of graphics processing units. In this work, we propose a memory-efficient learning procedure that exploits the reversibility of the network's layers to enable data-driven design for large-scale computational imaging systems. We demonstrate our method on a small-scale compressed sensing example, as well as two large-scale real-world systems: multi-channel magnetic resonance imaging and super-resolution optical microscopy.
摘要：计算成像系统，如实验设计和图像先验的关键方面，可以通过经典的基于模型的重构的展开迭代形成的深网络被优化（称为基于物理网络）。但是，对于现实世界中的大型逆问题，经由反向传播计算梯度是图形处理单元的存储器局限性不可行所致。在这项工作中，我们提出了利用网络的层的可逆性启用大规模的计算成像系统的数据驱动设计的内存效率的学习过程。我们证明了我们在一个小规模的压缩传感例如两个大型的真实世界的系统方法，以及：多通道磁共振成像和超分辨率光学显微镜。

28. Frequency-Tuned Universal Adversarial Attacks [PDF] 返回目录
Yingpeng Deng, Lina J. Karam
Abstract: Researchers have shown that the predictions of a convolutional neural network (CNN) for an image set can be severely distorted by one single image-agnostic perturbation, or universal perturbation, usually with an empirically fixed threshold in the spatial domain to restrict its perceivability. However, by considering the human perception, we propose to adopt JND thresholds to guide the perceivability of universal adversarial perturbations. Based on this, we propose a frequency-tuned universal attack method to compute universal perturbations and show that our method can realize a good balance between perceivability and effectiveness in terms of fooling rate by adapting the perturbations to the local frequency content. Compared with existing universal adversarial attack techniques, our frequency-tuned attack method can achieve cutting-edge quantitative results. We demonstrate that our approach can significantly improve the performance of the baseline on both white-box and black-box attacks.
摘要：研究人员已经证明，对于一个图像集合的卷积神经网络（CNN）的预测可以由一个单一的图像无关的扰动，或通用扰动的严重失真，通常在空间域中凭经验固定阈值来限制其可感知。然而，考虑人类的感知，我们建议采用JND阈值来指导普遍对抗扰动的感知性。在此基础上，我们提出了一个频率调谐普遍的攻击方法计算通用扰动和证明我们的方法可以通过调整扰动本地频率内容实现嘴硬速率方面可感知性和有效性之间的良好平衡。与现有的普遍敌对攻击技术相比，我们的频率调谐的攻击方法可以实现前沿的定量结果。我们证明我们的方法可以显著提高两个白盒和黑盒攻击基线的性能。

29. VSGNet: Spatial Attention Network for Detecting Human Object Interactions Using Graph Convolutions [PDF] 返回目录
Oytun Ulutan, A S M Iftekhar, B.S. Manjunath
Abstract: Comprehensive visual understanding requires detection frameworks that can effectively learn and utilize object interactions while analyzing objects individually. This is the main objective in Human-Object Interaction (HOI) detection task. In particular, relative spatial reasoning and structural connections between objects are essential cues for analyzing interactions, which is addressed by the proposed Visual-Spatial-Graph Network (VSGNet) architecture. VSGNet extracts visual features from the human-object pairs, refines the features with spatial configurations of the pair, and utilizes the structural connections between the pair via graph convolutions. The performance of VSGNet is thoroughly evaluated using the Verbs in COCO (V-COCO) and HICO-DET datasets. Experimental results indicate that VSGNet outperforms state-of-the-art solutions by 8% or 4 mAP in V-COCO and 16% or 3 mAP in HICO-DET.
摘要：全面直观的了解需要能够有效地学习和使用对象的交互，同时分析对象单独检测框架。这是人类 - 对象交互（海）检测任务的主要目标。特别是，相对空间推理和对象之间的结构连接是用于分析的相互作用，这是由所提出的视觉空间-格拉夫网络（VSGNet）架构处理的基本提示。 VSGNet提取从人类对象对视觉特征，提炼与对空间配置的功能，以及经由图形卷积利用所述一对之间的结构连接。 VSGNet的性能在COCO（V-COCO）和HICO-DET数据集使用动词彻底评估。实验结果表明，国家的最先进的VSGNet性能优于溶液8％或在V-COCO 4 MAP和16％或在HICO-DET 3地图。

30. Softmax Splatting for Video Frame Interpolation [PDF] 返回目录
Simon Niklaus, Feng Liu
Abstract: Differentiable image sampling in the form of backward warping has seen broad adoption in tasks like depth estimation and optical flow prediction. In contrast, how to perform forward warping has seen less attention, partly due to additional challenges such as resolving the conflict of mapping multiple pixels to the same target location in a differentiable way. We propose softmax splatting to address this paradigm shift and show its effectiveness on the application of frame interpolation. Specifically, given two input frames, we forward-warp the frames and their feature pyramid representations based on an optical flow estimate using softmax splatting. In doing so, the softmax splatting seamlessly handles cases where multiple source pixels map to the same target location. We then use a synthesis network to predict the interpolation result from the warped representations. Our softmax splatting allows us to not only interpolate frames at an arbitrary time but also to fine tune the feature pyramid and the optical flow. We show that our synthesis approach, empowered by softmax splatting, achieves new state-of-the-art results for video frame interpolation.
摘要：在向后翘曲的形式可微图像采样已经看到广泛采用在像深度估计和光流预测的任务。相反，如何执行向前弯曲看到不太重视，部分原因是由于诸如解决在微方式多个像素映射到同一个目标位置的冲突更多的挑战。我们建议SOFTMAX泼洒解决这一模式的转变，并显示其对帧插值的应用效果。具体地，给定两个输入帧的基础上，使用SOFTMAX泼洒的光流估计我们正向翘曲帧和它们的特征金字塔表示。在这样做时，SOFTMAX泼洒无缝地处理，其中多个源像素映射到相同的目标位置的情况。然后，我们使用一个综合网络从扭曲表示预测插值结果。我们SOFTMAX泼洒，使我们不仅可以在任意时间内插帧，但也微调功能金字塔和光流。我们证明了我们的综合方法，通过SOFTMAX泼洒权力，实现国家的最先进的新成果下的视频帧插值。

31. Confidence Guided Stereo 3D Object Detection with Split Depth Estimation [PDF] 返回目录
Chengyao Li, Jason Ku, Steven L. Waslander
Abstract: Accurate and reliable 3D object detection is vital to safe autonomous driving. Despite recent developments, the performance gap between stereo-based methods and LiDAR-based methods is still considerable. Accurate depth estimation is crucial to the performance of stereo-based 3D object detection methods, particularly for those pixels associated with objects in the foreground. Moreover, stereo-based methods suffer from high variance in the depth estimation accuracy, which is often not considered in the object detection pipeline. To tackle these two issues, we propose CG-Stereo, a confidence-guided stereo 3D object detection pipeline that uses separate decoders for foreground and background pixels during depth estimation, and leverages the confidence estimation from the depth estimation network as a soft attention mechanism in the 3D object detector. Our approach outperforms all state-of-the-art stereo-based 3D detectors on the KITTI benchmark.
摘要：准确而可靠的立体物检测是安全的自动驾驶至关重要。尽管最近的事态发展，基于立体声的方法和基于激光雷达的方法之间的性能差距仍然很大。精确深度估计是基于立体三维物体检测方法的性能是至关重要的，特别是对于具有前景中的物体相关联的那些像素。此外，基于立体声的方法从在深度估计精度，其通常不能在物体检测管道被认为是高方差受损。为了解决这两个问题，我们提出了CG-立体声，置信引导立体三维物体检测管道使用深度估计中分离的前景和背景像素解码器，并利用从深度估计网络的信心，估计在软注意机制3D对象检测器。我们的方法优于在KITTI基准所有国家的最先进的基于立体3D探测器。

32. Unified Image and Video Saliency Modeling [PDF] 返回目录
Richard Droste, Jianbo Jiao, J. Alison Noble
Abstract: Visual saliency modeling for images and videos is treated as two independent tasks in recent computer vision literature. On the one hand, image saliency modeling is a well-studied problem and progress on benchmarks like \mbox{SALICON} and MIT300 is slowing. For video saliency prediction on the other hand, rapid gains have been achieved on the recent DHF1K benchmark through network architectures that are optimized for this task. Here, we take a step back and ask: Can image and video saliency modeling be approached via a unified model, with mutual benefit? We find that it is crucial to model the domain shift between image and video saliency data and between different video saliency datasets for effective joint modeling. We identify different sources of domain shift and address them through four novel domain adaptation techniques - Domain-Adaptive Priors, Domain-Adaptive Fusion, Domain-Adaptive Smoothing and Bypass-RNN - in addition to an improved formulation of learned Gaussian priors. We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train the entire network simultaneously with image and video saliency data. We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, as well as the image saliency datasets SALICON and MIT300. With one set of parameters, our method achieves state-of-the-art performance on all video saliency datasets and is on par with the state-of-the-art for image saliency prediction, despite a 5 to 20-fold reduction in model size and the fastest runtime among all competing deep models. We provide retrospective analyses and ablation studies which demonstrate the importance of the domain shift modeling. The code is available at this https URL.
摘要：对于图像和视频的视觉显着性模型被视为近年来计算机视觉文献两个独立的任务。在一方面，图像显着造型像\ {MBOX} SALICON和MIT300放缓基准的充分研究的问题和进展。关于另一方面视频显着性预测，涨势迅猛已通过网络架构近期DHF1K基准，对完成这一任务优化的实现。在这里，我们退后一步，问：能否图像和视频的显着性建模通过一个统一的模型来接洽，互利？我们发现，这是至关重要的图像和视频数据的显着性之间以及不同的视频显着的数据集进行有效的联合建模之间的域变速模型。我们确定域移和地址他们通过四种新型域自适应技术的不同来源 - 领域自适应先验，领域自适应融合，领域自适应平滑和旁路-RNN - 除了学到高斯先验的改进配方。我们这些技术整合到一个简单，重量轻编码器RNN解码器式的网络，UNISAL，以及图像和视频的显着性数据同时训练整个网络。我们评估我们在视频显着的数据集DHF1K，好莱坞-2和UCF-体育法，以及图像数据集显着性和SALICON MIT300。与一组参数，我们的方法实现了对所有视频显着性的数据集的状态的最先进的性能，是在同水准与所述状态的最先进的用于图像显着性预测，尽管在模型中的5至20倍的减少大小和所有竞争车型深间的最快运行时间。我们提供的回顾性分析和论证域转变建模的重要性切除研究。该代码可在此HTTPS URL。

33. Deep Vectorization of Technical Drawings [PDF] 返回目录
Vage Egiazarian, Oleg Voynov, Alexey Artemov, Denis Volkhonskiy, Aleksandr Safin, Maria Taktasheva, Denis Zorin, Evgeny Burnaev
Abstract: We present a new method for vectorization of technical line drawings, such as floor plans, architectural drawings, and 2D CAD images. Our method includes (1) a deep learning-based cleaning stage to eliminate the background and imperfections in the image and fill in missing parts, (2) a transformer-based network to estimate vector primitives, and (3) optimization procedure to obtain the final primitive configurations. We train the networks on synthetic data, renderings of vector line drawings, and manually vectorized scans of line drawings. Our method quantitatively and qualitatively outperforms a number of existing techniques on a collection of representative technical drawings.
摘要：本文提出了一种新的方法，技术路线图，如平面图，建筑图纸和2D CAD图像的矢量化。我们的方法包括：（1）一个深基于学习的清洁阶段，以消除在丢失的部分图像中的背景和缺陷和填充，（2）基于变压器的网络来估计基矢量，以及（3）优化过程以获得最终原始配置。我们培养上合成的数据，矢量线图的透视图，和线条图的手动扫描矢量的网络。我们的方法定量和定性优于代表性的技术图纸的集合的若干现有技术。

34. Optimal HDR and Depth from Dual Cameras [PDF] 返回目录
Pradyumna Chari, Anil Kumar Vadathya, Kaushik Mitra
Abstract: Dual camera systems have assisted in the proliferation of various applications, such as optical zoom, low-light imaging and High Dynamic Range (HDR) imaging. In this work, we explore an optimal method for capturing the scene HDR and disparity map using dual camera setups. Hasinoff et al. (2010) have developed a noise optimal framework for HDR capture from a single camera. We generalize this to the dual camera set-up for estimating both HDR and disparity map. It may seem that dual camera systems can capture HDR in a shorter time. However, disparity estimation is a necessary step, which requires overlap among the images captured by the two cameras. This may lead to an increase in the capture time. To address this conflicting requirement, we propose a novel framework to find the optimal exposure and ISO sequence by minimizing the capture time under the constraints of an upper bound on the disparity error and a lower bound on the per-exposure SNR. We show that the resulting optimization problem is non-convex in general and propose an appropriate initialization technique. To obtain the HDR and disparity map from the optimal capture sequence, we propose a pipeline which alternates between estimating the camera ICRFs and the scene disparity map. We demonstrate that our optimal capture sequence leads to better results than other possible capture sequences. Our results are also close to those obtained by capturing the full stereo stack spanning the entire dynamic range. Finally, we present for the first time a stereo HDR dataset consisting of dense ISO and exposure stack captured from a smartphone dual camera. The dataset consists of 6 scenes, with an average of 142 exposure-ISO image sequence per scene.
摘要：双相机系统中的各种应用，例如光学变焦，低光成像和高动态范围（HDR）成像的扩散已协助。在这项工作中，我们探讨了使用双摄像头设置拍摄现场HDR和视差图的最佳方法。 Hasinoff等。（2010年）已经从单一的相机开发的HDR拍摄噪声的最佳框架。我们推广这双相机组进行估计HDR和视差图。它可能看起来双摄像头的系统可以在较短的时间拍摄HDR。然而，视差估计是一个必要的步骤，其由两个照相机拍摄的图像中需要重叠。这可能会导致增加捕获时间。为了解决这个矛盾的要求，我们提出了一个新的框架，找到的上界不一致错误的约束下最小化捕获时间最佳曝光和ISO序列和下界每曝光SNR。我们发现，所产生的优化问题是非凸一般，并提出适当的初始化技术。为了获得最佳的拍摄序列中的HDR和视差图，我们提出了一个管道中估计相机ICRFs和场景视差图之间哪个交替。我们证明了我们的最佳捕获序列导致比其他可能的捕捉序列更好的结果。我们的研究结果也接近通过捕获全立体声堆栈跨越整个动态范围得到的。最后，我们提出首次立体声HDR数据集由来自智能电话双相机捕获致密ISO和曝光栈。该数据集包括6个场景，平均每142场景曝光-ISO图像序列的。

35. Truncated Inference for Latent Variable Optimization Problems: Application to Robust Estimation and Learning [PDF] 返回目录
Christopher Zach, Huu Le
Abstract: Optimization problems with an auxiliary latent variable structure in addition to the main model parameters occur frequently in computer vision and machine learning. The additional latent variables make the underlying optimization task expensive, either in terms of memory (by maintaining the latent variables), or in terms of runtime (repeated exact inference of latent variables). We aim to remove the need to maintain the latent variables and propose two formally justified methods, that dynamically adapt the required accuracy of latent variable inference. These methods have applications in large scale robust estimation and in learning energy-based models from labeled data.
摘要：除了主模型参数的辅助潜变量结构优化的问题经常发生在计算机视觉和机器学习。额外的潜在变量使底层优化任务昂贵，无论是在内存方面（通过保持了潜变量），或在运行时间方面（重复潜变量的精确推理）。我们的目标是去除需要保持潜变量，提出了两个理由正式的方法，即动态适应潜变量推断的精度要求。这些方法在大规模稳健估计在从标记数据中学习能源为基础的模型的应用程序。

36. Explaining Away Attacks Against Neural Networks [PDF] 返回目录
Sean Saito, Jin Wang
Abstract: We investigate the problem of identifying adversarial attacks on image-based neural networks. We present intriguing experimental results showing significant discrepancies between the explanations generated for the predictions of a model on clean and adversarial data. Utilizing this intuition, we propose a framework which can identify whether a given input is adversarial based on the explanations given by the model. Code for our experiments can be found here: this https URL.
摘要：我们调查识别在基于图像的神经网络对抗攻击的问题。我们提出了耐人寻味的表示模型的清洁和对抗性的数据预测产生的解释之间的差异显著的实验结果。利用这种直觉，我们提出了一个框架，它可以识别一个给定的输入是否是对抗性的基于模型给出的解释。规范我们的实验可以在这里找到：此HTTPS URL。

37. Intensity Scan Context: Coding Intensity and Geometry Relations for Loop Closure Detection [PDF] 返回目录
Han Wang, Chen Wang, Lihua Xie
Abstract: Loop closure detection is an essential and challenging problem in simultaneous localization and mapping (SLAM). It is often tackled with light detection and ranging (LiDAR) sensor due to its view-point and illumination invariant properties. Existing works on 3D loop closure detection often leverage the matching of local or global geometrical-only descriptors, but without considering the intensity reading. In this paper we explore the intensity property from LiDAR scan and show that it can be effective for place recognition. Concretely, we propose a novel global descriptor, intensity scan context (ISC), that explores both geometry and intensity characteristics. To improve the efficiency for loop closure detection, an efficient two-stage hierarchical re-identification process is proposed, including a binary-operation based fast geometric relation retrieval and an intensity structure re-identification. Thorough experiments including both local experiment and public datasets test have been conducted to evaluate the performance of the proposed method. Our method achieves higher recall rate and recall precision than existing geometric-only methods.
摘要：环路闭合检测是同时的定位和地图创建（SLAM）的必要和具有挑战性的问题。它往往与光探测和测距（LIDAR）传感器解决由于其视点和照明不变特性。在3D环形闭合检测现有的作品往往利用局部或全局唯一的几何描述符的匹配，但没有考虑到强读取。在本文中，我们探讨从激光雷达扫描的强度特性，并表明它可以有效的去处认可。具体而言，我们提出了一个新的全球性描述符，强度的扫描上下文（ISC），即探索几何和强度特性。为了改善环路闭合检测的效率，高效的两级分层再识别处理，提出了包括一个二进制运算基于快速几何关系检索和强度结构重新鉴定。彻底的实验，包括本地的实验和公共数据集测试已经进行了评估该方法的性能。我们的方法实现了比现有的几何方法只有较高的查全率和查全率精度。

38. AirSim Drone Racing Lab [PDF] 返回目录
Ratnesh Madaan, Nicholas Gyde, Sai Vemprala, Matthew Brown, Keiko Nagami, Tim Taubner, Eric Cristofalo, Davide Scaramuzza, Mac Schwager, Ashish Kapoor
Abstract: Autonomous drone racing is a challenging research problem at the intersection of computer vision, planning, state estimation, and control. We introduce AirSim Drone Racing Lab, a simulation framework for enabling fast prototyping of algorithms for autonomy and enabling machine learning research in this domain, with the goal of reducing the time, money, and risks associated with field robotics. Our framework enables generation of racing tracks in multiple photo-realistic environments, orchestration of drone races, comes with a suite of gate assets, allows for multiple sensor modalities (monocular, depth, neuromorphic events, optical flow), different camera models, and benchmarking of planning, control, computer vision, and learning-based algorithms. We used our framework to host a simulation based drone racing competition at NeurIPS 2019. The competition binaries are available at our github repository.
摘要：自主无人驾驶赛车是在计算机视觉，规划，状态估计和控制的交叉路口一个具有挑战性的研究课题。我们引进AirSim雄蜂赛车实验室，为实现快速自治的算法原型和启用机器在这一领域的学习研究，以减少时间，金钱的目标仿真框架和风险领域机器人技术相关。我们的框架能够产生在多个照片般逼真的环境中，无人驾驶飞机种族的编排赛车场，内建一系列栅极资产的，允许多个传感器模态（单目，深度，神经形态的事件，光流），不同的摄像机模型，和基准规划，控制，计算机视觉和学习为基础的算法。我们用我们的框架来承载在NeurIPS 2019年基于仿真无人机赛车比赛的竞争二进制文件可在我们的github仓库。

39. DeepURL: Deep Pose Estimation Framework for Underwater Relative Localization [PDF] 返回目录
Bharat Joshi, Md Modasshir, Travis Manderson, Hunter Damron, Marios Xanthidis, Alberto Quattrini Li, Ioannis Rekleitis, Gregory Dudek
Abstract: In this paper, we propose a real-time deep-learning approach for determining the 6D relative pose of Autonomous Underwater Vehicles (AUV) from a single image. A team of autonomous robots localizing themselves, in a communication-constrained underwater environment, is essential for many applications such as underwater exploration, mapping, multi-robot convoying, and other multi-robot tasks. Due to the profound difficulty of collecting ground truth images with accurate 6D poses underwater, this work utilizes rendered images from the Unreal Game Engine simulation for training. An image translation network is employed to bridge the gap between the rendered and the real images producing synthetic images for training. The proposed method predicts the 6D pose of an AUV from a single image as 2D image keypoints representing 8 corners of the 3D model of the AUV, and then the 6D pose in the camera coordinates is determined using RANSAC-based PnP. Experimental results in underwater environments (swimming pool and ocean) with different cameras demonstrate the robustness of the proposed technique, where the trained system decreased translation error by 75.5% and
摘要：在本文中，我们提出了一个实时深学习方式从单一图像确定自主水下航行器（AUV）的6D相对位姿。自主机器人的团队定位自己，在通信受限的水下环境，是许多应用，例如水下探测，测绘，多机器人护航，和其他多机器人的任务是必不可少的。由于与准确的6D姿态水下从虚幻游戏引擎模拟培训，这项工作利用渲染图像采集地面实况图像的深刻难度。一种图像转换网络被用于桥接提供，而该实像产生用于训练的合成影像之间的间隙。所提出的方法预测从单个图像作为代表8个角的AUV的3D模型，并在照相机坐标则6D姿势的2D图像的关键点的AUV的6D姿态使用基于RANSAC即插即用来确定的。在水下环境（游泳池和海洋）用不同的相机实验结果证明了该技术，其中受过训练系统由75.5％降低的翻译错误的稳健性和

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-03-13

目录

摘要