0%

【arxiv论文】 Computer Vision and Pattern Recognition 2020-07-15

目录

1. Multiview Detection with Feature Perspective Transformation [PDF] 摘要
2. Transposer: Universal Texture Synthesis Using Feature Maps as Transposed Convolution Filter [PDF] 摘要
3. Modeling Artistic Workflows for Image Generation and Editing [PDF] 摘要
4. Multitask Learning Strengthens Adversarial Robustness [PDF] 摘要
5. MeTRAbs: Metric-Scale Truncation-Robust Heatmaps for Absolute 3D Human Pose Estimation [PDF] 摘要
6. Collaborative Unsupervised Domain Adaptation for Medical Image Diagnosis [PDF] 摘要
7. Alpha-Net: Architecture, Models, and Applications [PDF] 摘要
8. Learning Accurate and Human-Like Driving using Semantic Maps and Attention [PDF] 摘要
9. CenterNet3D:An Anchor free Object Detector for Autonomous Driving [PDF] 摘要
10. Breaking the Curse of Space Explosion: Towards Efficient NAS with Curriculum Search [PDF] 摘要
11. Deep Heterogeneous Autoencoder for Subspace Clustering of Sequential Data [PDF] 摘要
12. Wavelet-Based Dual-Branch Network for Image Demoireing [PDF] 摘要
13. Towards Dense People Detection with Deep Learning and Depth images [PDF] 摘要
14. An Uncertainty-based Human-in-the-loop System for Industrial Tool Wear Analysis [PDF] 摘要
15. Re-ranking for Writer Identification and Writer Retrieval [PDF] 摘要
16. Pasadena: Perceptually Aware and Stealthy Adversarial Denoise Attack [PDF] 摘要
17. Unsupervised Multi-Target Domain Adaptation Through Knowledge Distillation [PDF] 摘要
18. UDBNET: Unsupervised Document Binarization Network via Adversarial Game [PDF] 摘要
19. Towards Realistic 3D Embedding via View Alignment [PDF] 摘要
20. Unsupervised Human 3D Pose Representation with Viewpoint and Pose Disentanglement [PDF] 摘要
21. RGB-D Salient Object Detection with Cross-Modality Modulation and Selection [PDF] 摘要
22. Video Object Segmentation with Episodic Graph Memory Networks [PDF] 摘要
23. Correlation filter tracking with adaptive proposal selection for accurate scale estimation [PDF] 摘要
24. Pose2RGBD. Generating Depth and RGB images from absolute positions [PDF] 摘要
25. Improving Face Recognition by Clustering Unlabeled Faces in the Wild [PDF] 摘要
26. P-KDGAN: Progressive Knowledge Distillation with GANs for One-class Novelty Detection [PDF] 摘要
27. Learning Semantics-enriched Representation via Self-discovery, Self-classification, and Self-restoration [PDF] 摘要
28. Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance [PDF] 摘要
29. REPrune: Filter Pruning via Representative Election [PDF] 摘要
30. Rethinking Image Inpainting via a Mutual Encoder-Decoder with Feature Equalizations [PDF] 摘要
31. A Graph-based Interactive Reasoning for Human-Object Interaction Detection [PDF] 摘要
32. AQD: Towards Accurate Quantized Object Detection [PDF] 摘要
33. 360$^\circ$ Depth Estimation from Multiple Fisheye Images with Origami Crown Representation of Icosahedron [PDF] 摘要
34. Joint Layout Analysis, Character Detection and Recognition for Historical Document Digitization [PDF] 摘要
35. Knowledge Distillation for Multi-task Learning [PDF] 摘要
36. JSENet: Joint Semantic Segmentation and Edge Detection Network for 3D Point Clouds [PDF] 摘要
37. Visual Tracking by TridentAlign and Context Embedding [PDF] 摘要
38. Compare and Reweight: Distinctive Image Captioning Using Similar Images Sets [PDF] 摘要
39. Alleviating Over-segmentation Errors by Detecting Action Boundaries [PDF] 摘要
40. BUNET: Blind Medical Image Segmentation Based on Secure UNET [PDF] 摘要
41. Topology-Change-Aware Volumetric Fusion for Dynamic Scene Reconstruction [PDF] 摘要
42. Socially and Contextually Aware Human Motion and Pose Forecasting [PDF] 摘要
43. Face to Purchase: Predicting Consumer Choices with Structured Facial and Behavioral Traits Embedding [PDF] 摘要
44. Top-Related Meta-Learning Method for Few-Shot Detection [PDF] 摘要
45. A Single Stream Network for Robust and Real-time RGB-D Salient Object Detection [PDF] 摘要
46. DeepMSRF: A novel Deep Multimodal Speaker Recognition framework with Feature selection [PDF] 摘要
47. TCGM: An Information-Theoretic Framework for Semi-Supervised Multi-Modality Learning [PDF] 摘要
48. Meta-rPPG: Remote Heart Rate Estimation Using a Transductive Meta-Learner [PDF] 摘要
49. Vehicle Trajectory Prediction by Transfer Learning of Semi-Supervised Models [PDF] 摘要
50. Semi-supervised Learning with a Teacher-student Network for Generalized Attribute Prediction [PDF] 摘要
51. Patch-wise Attack for Fooling Deep Neural Network [PDF] 摘要
52. Personalized Face Modeling for Improved Face Reconstruction and Motion Retargeting [PDF] 摘要
53. JNR: Joint-based Neural Rig Representation for Compact 3D Face Modeling [PDF] 摘要
54. Water level prediction from social media images with a multi-task ranking approach [PDF] 摘要
55. DETCID: Detection of Elongated Touching Cells with Inhomogeneous Illumination using a Deep Adversarial Network [PDF] 摘要
56. Embedded Encoder-Decoder in Convolutional Networks Towards Explainable AI [PDF] 摘要
57. A Bayesian Evaluation Framework for Ground Truth-Free Visual Recognition Tasks [PDF] 摘要
58. Measuring Performance of Generative Adversarial Networks on Devanagari Script [PDF] 摘要
59. Deep Image Orientation Angle Detection [PDF] 摘要
60. CPL-SLAM: Efficient and Certifiably Correct Planar Graph-Based SLAM Using the Complex Number Representation [PDF] 摘要
61. Domain Adaptation for Robust Workload Classification using fNIRS [PDF] 摘要
62. Unsupervised object-centric video generation and decomposition in 3D [PDF] 摘要
63. UnRectDepthNet: Self-Supervised Monocular Depth Estimation using a Generic Framework for Handling Common Camera Distortion Models [PDF] 摘要
64. Learning Differential Diagnosis of Skin Conditions with Co-occurrence Supervision using Graph Convolutional Networks [PDF] 摘要
65. Improving Pixel Embedding Learning through Intermediate Distance Regression Supervision for Instance Segmentation [PDF] 摘要
66. Adversarial Background-Aware Loss for Weakly-supervised Temporal Activity Localization [PDF] 摘要
67. Storing Encoded Episodes as Concepts for Continual Learning [PDF] 摘要
68. Deep Doubly Supervised Transfer Network for Diagnosis of Breast Cancer with Imbalanced Ultrasound Imaging Modalities [PDF] 摘要
69. Path Signatures on Lie Groups [PDF] 摘要
70. Dense Crowds Detection and Counting with a Lightweight Architecture [PDF] 摘要
71. A new approach to descriptors generation for image retrieval by analyzing activations of deep neural network layers [PDF] 摘要
72. AUTO3D: Novel view synthesis through unsupervisely learned variational viewpoint and global 3D representation [PDF] 摘要
73. Closed-Form Factorization of Latent Semantics in GANs [PDF] 摘要
74. Towards causal benchmarking of bias in face analysis algorithms [PDF] 摘要
75. Cross-Domain Medical Image Translation by Shared Latent Gaussian Mixture Model [PDF] 摘要
76. Conditional Image Retrieval [PDF] 摘要
77. MFRNet: A New CNN Architecture for Post-Processing and In-loop Filtering [PDF] 摘要
78. Nodule2vec: a 3D Deep Learning System for Pulmonary Nodule Retrieval Using Semantic Representation [PDF] 摘要
79. A Weakly Supervised Region-Based Active Learning Method for COVID-19 Segmentation in CT Images [PDF] 摘要
80. Automated Synthetic-to-Real Generalization [PDF] 摘要
81. Lifelong Learning using Eigentasks: Task Separation, Skill Acquisition, and Selective Transfer [PDF] 摘要
82. Our Evaluation Metric Needs an Update to Encourage Generalization [PDF] 摘要
83. From Symmetry to Geometry: Tractable Nonconvex Problems [PDF] 摘要
84. An Interpretable Baseline for Time Series Classification Without Intensive Learning [PDF] 摘要
85. Landslide Segmentation with U-Net: Evaluating Different Sampling Methods and Patch Sizes [PDF] 摘要
86. T-Basis: a Compact Representation for Neural Networks [PDF] 摘要
87. Inferring the 3D Standing Spine Posture from 2D Radiographs [PDF] 摘要
88. FocusLiteNN: High Efficiency Focus Quality Assessment for Digital Pathology [PDF] 摘要
89. Batch-level Experience Replay with Review for Continual Learning [PDF] 摘要

摘要

1. Multiview Detection with Feature Perspective Transformation [PDF] 返回目录
  Yunzhong Hou, Liang Zheng, Stephen Gould
Abstract: Incorporating multiple camera views for detection alleviates the impact of occlusions in crowded scenes. In a multiview system, we need to answer two important questions when dealing with ambiguities that arise from occlusions. First, how should we aggregate cues from the multiple views? Second, how should we aggregate unreliable 2D and 3D spatial information that has been tainted by occlusions? To address these questions, we propose a novel multiview detection system, MVDet. For multiview aggregation, existing methods combine anchor box features from the image plane, which potentially limits performance due to inaccurate anchor box shapes and sizes. In contrast, we take an anchor-free approach to aggregate multiview information by projecting feature maps onto the ground plane (bird's eye view). To resolve any remaining spatial ambiguity, we apply large kernel convolutions on the ground plane feature map and infer locations from detection peaks. Our entire model is end-to-end learnable and achieves 88.2% MODA on the standard Wildtrack dataset, outperforming the state-of-the-art by 14.1%. We also provide detailed analysis of MVDet on a newly introduced synthetic dataset, MultiviewX, which allows us to control the level of occlusion. Code and MultiviewX dataset are available at this https URL.
摘要:结合多个摄像机视图的检测缓解闭塞的拥挤场面的影响。在多视图系统中,我们需要从闭塞引起歧义时务请回答两个重要的问题。首先,我们应该如何聚合来自多个视图线索?其次,我们应该如何汇总已被污染的闭塞不可靠的二维和三维空间信息?为了解决这些问题,我们提出了一个新颖的视点检测系统,MVDet。用于多视图聚合,现有的方法从图像平面,其潜在地限制性能由于不准确的锚箱的形状和尺寸结合锚箱功能。相比之下,我们的功能地图投射到地面上(鸟瞰)采取无锚的方法来聚集多视角信息。要解决任何剩余的空间不确定性,我们将地平面功能地图,推断检测峰值位置的大内核卷积。我们的整个模型是端至端可学习并达到88.2%MODA上的标准数据集Wildtrack,14.1%表现好于国家的最先进的。我们还提供了一个新引入的合成数据集,MultiviewX,这使我们能够控制闭塞的水平MVDet的详细分析。代码和MultiviewX可在此HTTPS URL数据集。

2. Transposer: Universal Texture Synthesis Using Feature Maps as Transposed Convolution Filter [PDF] 返回目录
  Guilin Liu, Rohan Taori, Ting-Chun Wang, Zhiding Yu, Shiqiu Liu, Fitsum A. Reda, Karan Sapra, Andrew Tao, Bryan Catanzaro
Abstract: Conventional CNNs for texture synthesis consist of a sequence of (de)-convolution and up/down-sampling layers, where each layer operates locally and lacks the ability to capture the long-term structural dependency required by texture synthesis. Thus, they often simply enlarge the input texture, rather than perform reasonable synthesis. As a compromise, many recent methods sacrifice generalizability by training and testing on the same single (or fixed set of) texture image(s), resulting in huge re-training time costs for unseen images. In this work, based on the discovery that the assembling/stitching operation in traditional texture synthesis is analogous to a transposed convolution operation, we propose a novel way of using transposed convolution operation. Specifically, we directly treat the whole encoded feature map of the input texture as transposed convolution filters and the features' self-similarity map, which captures the auto-correlation information, as input to the transposed convolution. Such a design allows our framework, once trained, to be generalizable to perform synthesis of unseen textures with a single forward pass in nearly real-time. Our method achieves state-of-the-art texture synthesis quality based on various metrics. While self-similarity helps preserve the input textures' regular structural patterns, our framework can also take random noise maps for irregular input textures instead of self-similarity maps as transposed convolution inputs. It allows to get more diverse results as well as generate arbitrarily large texture outputs by directly sampling large noise maps in a single pass as well.
摘要:常规的细胞神经网络用于纹理合成由(DE)-convolution和上/下采样层,其中每个层局部地进行操作,并且缺乏捕获由纹理合成所需的长期结构依赖性的能力的序列组成。因此,它们经常简单地放大输入纹理,而不是执行合理的合成。作为妥协,最近许多方法,通过训练和测试在同一个(或一组固定的)纹理图像(S),从而为看不见的图像巨大的再培训时间成本牺牲普遍性。在这项工作的基础上,发现在传统的纹理合成装配/拼接操作类似于转置卷积运算,我们提出用换位卷积运算的新方法。具体来说,我们直接处理输入纹理的整个编码特征映射为转置卷积滤波器和特征自相似性映射,其捕获的自相关信息,作为输入提供给卷积转置。这样的设计可以让我们的框架,经过训练之后,要普及几乎实时与单个直传执行看不见纹理的合成。我们的方法实现了基于各种度量状态的最先进的纹理合成质量。虽然自相似性有助于保持输入纹理常规的结构模式,我们的框架也可以采取不规则的纹理输入随机噪声地图,而不是自相似映射为换位卷积输入。它允许获得更多的不同的结果,以及通过在单次通过直接采样大的噪声的地图以及生成任意大的纹理输出。

3. Modeling Artistic Workflows for Image Generation and Editing [PDF] 返回目录
  Hung-Yu Tseng, Matthew Fisher, Jingwan Lu, Yijun Li, Vladimir Kim, Ming-Hsuan Yang
Abstract: People often create art by following an artistic workflow involving multiple stages that inform the overall design. If an artist wishes to modify an earlier decision, significant work may be required to propagate this new decision forward to the final artwork. Motivated by the above observations, we propose a generative model that follows a given artistic workflow, enabling both multi-stage image generation as well as multi-stage image editing of an existing piece of art. Furthermore, for the editing scenario, we introduce an optimization process along with learning-based regularization to ensure the edited image produced by the model closely aligns with the originally provided image. Qualitative and quantitative results on three different artistic datasets demonstrate the effectiveness of the proposed framework on both image generation and editing tasks.
摘要:人们往往按照涉及该通知的整体设计多个阶段的艺术工作流程创造艺术。如果一个艺术家希望修改原先的决定,显著工作可能需要向前传播这一新的决定,最终的图片。通过上述观察的启发,我们提出下面给定的工作流程的艺术生成模型,使两者的多级图像生成以及本领域的现有片的多级图像编辑。此外,对于编辑的情况下,我们引入一个优化过程基于学习正则沿着以确保模型与最初提供的图像贴紧产生的编辑图像。在三个不同的艺术数据集的定性和定量结果说明两个图像生成和编辑任务所提出的框架的有效性。

4. Multitask Learning Strengthens Adversarial Robustness [PDF] 返回目录
  Chengzhi Mao, Amogh Gupta, Vikram Nitin, Baishakhi Ray, Shuran Song, Junfeng Yang, Carl Vondrick
Abstract: Although deep networks achieve strong accuracy on a range of computer vision benchmarks, they remain vulnerable to adversarial attacks, where imperceptible input perturbations fool the network. We present both theoretical and empirical analyses that connect the adversarial robustness of a model to the number of tasks that it is trained on. Experiments on two datasets show that attack difficulty increases as the number of target tasks increase. Moreover, our results suggest that when models are trained on multiple tasks at once, they become more robust to adversarial attacks on individual tasks. While adversarial defense remains an open challenge, our results suggest that deep networks are vulnerable partly because they are trained on too few tasks.
摘要:虽然深网络上的一系列计算机视觉的基准实现强劲的准确性,他们仍然容易受到攻击的对抗性,在潜移默化的输入扰动骗过网络。我们提出一个模型的鲁棒性对抗连接到任务在于它是对受训人数的理论和实证分析。对两个数据集实验表明,攻击难度增加作为目标任务数量的增加。此外,我们的研究结果表明,当模型同时在多任务的训练,他们变得更加强大的对单个任务对抗性攻击。虽然对抗性的防守仍然是一个开放的挑战,我们的研究结果表明,深网络是脆弱的部分原因是他们是在太少的任务训练。

5. MeTRAbs: Metric-Scale Truncation-Robust Heatmaps for Absolute 3D Human Pose Estimation [PDF] 返回目录
  István Sárándi, Timm Linder, Kai O. Arras, Bastian Leibe
Abstract: Heatmap representations have formed the basis of human pose estimation systems for many years, and their extension to 3D has been a fruitful line of recent research. This includes 2.5D volumetric heatmaps, whose X and Y axes correspond to image space and Z to metric depth around the subject. To obtain metric-scale predictions, 2.5D methods need a separate post-processing step to resolve scale ambiguity. Further, they cannot localize body joints outside the image boundaries, leading to incomplete estimates for truncated images. To address these limitations, we propose metric-scale truncation-robust (MeTRo) volumetric heatmaps, whose dimensions are all defined in metric 3D space, instead of being aligned with image space. This reinterpretation of heatmap dimensions allows us to directly estimate complete, metric-scale poses without test-time knowledge of distance or relying on anthropometric heuristics, such as bone lengths. To further demonstrate the utility our representation, we present a differentiable combination of our 3D metric-scale heatmaps with 2D image-space ones to estimate absolute 3D pose (our MeTRAbs architecture). We find that supervision via absolute pose loss is crucial for accurate non-root-relative localization. Using a ResNet-50 backbone without further learned layers, we obtain state-of-the-art results on Human3.6M, MPI-INF-3DHP and MuPoTS-3D. Our code will be made publicly available to facilitate further research.
摘要:热图表示已形成人体姿势估计系统的基础上多年,并且将其推广到3D是最近研究的一个卓有成效。这包括2.5D体积热图,其X和Y轴对应于图像的空间和Z到围绕对象度量深度。为了获得指标规模的预测,2.5D方法都需要一个单独的后处理步骤,以解决规模歧义。此外,他们还不能定位身体关节的图像边界之外,导致不完全估计为截断的图像。为了解决这些限制,我们提出的度量尺度截断稳健(地铁)的体积热图,其尺寸在公制3D空间中的所有定义,而不是与图像空间对齐。热图的尺寸这个重新解释使我们能够直接估计完整,度量尺度姿势不受距离的测试时间的知识或依靠人体测量启发式,如骨的长度。为了进一步证明我们的效用表示,我们呈现与2D图像空间那些我们的3D度量尺度热图的微分组合来估计绝对三维姿态(我们的MeTRAbs体系结构)。我们发现通过绝对姿势损失监督是准确的非根相对定位是至关重要的。使用未经进一步了解到层的RESNET-50主链中,我们获得Human3.6M,MPI-INF-3DHP和MuPoTS-3D状态的最先进的结果。我们的代码将被公之于众,以便进一步研究。

6. Collaborative Unsupervised Domain Adaptation for Medical Image Diagnosis [PDF] 返回目录
  Yifan Zhang, Ying Wei, Qingyao Wu, Peilin Zhao, Shuaicheng Niu, Junzhou Huang, Mingkui Tan
Abstract: Deep learning based medical image diagnosis has shown great potential in clinical medicine. However, it often suffers two major difficulties in real-world applications: 1) only limited labels are available for model training, due to expensive annotation costs over medical images; 2) labeled images may contain considerable label noise (e.g., mislabeling labels) due to diagnostic difficulties of diseases. To address these, we seek to exploit rich labeled data from relevant domains to help the learning in the target task via {Unsupervised Domain Adaptation} (UDA). Unlike most UDA methods that rely on clean labeled data or assume samples are equally transferable, we innovatively propose a Collaborative Unsupervised Domain Adaptation algorithm, which conducts transferability-aware adaptation and conquers label noise in a collaborative way. We theoretically analyze the generalization performance of the proposed method, and also empirically evaluate it on both medical and general images. Promising experimental results demonstrate the superiority and generalization of the proposed method.
摘要:基于深度学习医学图像诊断在临床用药已经显示出巨大的潜力。然而,它往往受到两个主要困难在现实世界的应用:1)只有有限的标签可用于模型训练,由于对医学图像昂贵的注释成本; 2)标记的图像可以含有相当标签噪音(例如,贴错标签标记)由于疾病的诊断困难。针对这些问题,我们试图利用从相关领域丰富的标记数据通过{无监督领域适应性}(UDA),以帮助目标任务的学习。不同于依赖于干净的标签数据或承担样品同样转让最UDA方法,我们创新性地提出了一种协作的无监督域自适应算法,其进行转让,感知适应和征服在合作的方式噪音标签。从理论上分析了该方法的泛化性能,同时还评估经验在这两个医疗和一般的图像。有希望的实验结果证明了该方法的优越性和推广。

7. Alpha-Net: Architecture, Models, and Applications [PDF] 返回目录
  Jishan Shaikh, Adya Sharma, Ankit Chouhan, Avinash Mahawar
Abstract: Deep learning network training is usually computationally expensive and intuitively complex. We present a novel network architecture for custom training and weight evaluations. We reformulate the layers as ResNet-similar blocks with certain inputs and outputs of their own, the blocks (called Alpha blocks) on their connection configuration form their own network, combined with our novel loss function and normalization function form the complete Alpha-Net architecture. We provided the empirical mathematical formulation of network loss function for more understanding of accuracy estimation and further optimizations. We implemented Alpha-Net with 4 different layer configurations to express the architecture behavior comprehensively. On a custom dataset based on ImageNet benchmark, we evaluate Alpha-Net v1, v2, v3, and v4 for image recognition to give the accuracy of 78.2%, 79.1%, 79.5%, and 78.3% respectively. The Alpha-Net v3 gives improved accuracy of approx. 3% over the last state-of-the-art network ResNet 50 on ImageNet benchmark. We also present an analysis of our dataset with 256, 512, and 1024 layers and different versions of the loss function. Input representation is also crucial for training as initial preprocessing will take only a handful of features to make training less complex than it needs to be. We also compared network behavior with different layer structures, different loss functions, and different normalization functions for better quantitative modeling of Alpha-Net.
摘要:深学习网络培训通常是计算昂贵和复杂的直觉。我们提出了一种新的网络架构,定制培训和重量评估。我们重新制定层与一定的输入和自己的输出RESNET-相似的块,这些块在其连接配置(称为阿尔法块)形成自己的网络,与我们的新颖损失函数和归一化函数形式的完整的Alpha-Net的架构相结合。我们为准确估计和进一步优化的更多的了解提供的网络损失函数经验数学公式。我们实施的Alpha-Net的4种不同的层配置,全面表达了建筑的行为。基于ImageNet基准的自定义数据集,我们评估的Alpha-Net的V1,V2,V3和V4图像识别,分别得到78.2%,79.1%,79.5%和78.3%的准确性。阿尔法网V3提供了改进的约准确性。在过去的状态的最先进的网络3%ImageNet基准RESNET 50。我们还提出我们的256,512和1024层和不同版本的损失函数的数据集的分析。输入表示也是至关重要的培训作为初始预处理将只需要的功能少数,使训练比它需要那么复杂。我们还比较了不同层的结构,不同的损失函数,和α-网更好的定量模拟不同的标准化功能的网络行为。

8. Learning Accurate and Human-Like Driving using Semantic Maps and Attention [PDF] 返回目录
  Simon Hecker, Dengxin Dai, Alexander Liniger, Luc Van Gool
Abstract: This paper investigates how end-to-end driving models can be improved to drive more accurately and human-like. To tackle the first issue we exploit semantic and visual maps from HERE Technologies and augment the existing Drive360 dataset with such. The maps are used in an attention mechanism that promotes segmentation confidence masks, thus focusing the network on semantic classes in the image that are important for the current driving situation. Human-like driving is achieved using adversarial learning, by not only minimizing the imitation loss with respect to the human driver but by further defining a discriminator, that forces the driving model to produce action sequences that are human-like. Our models are trained and evaluated on the Drive360 + HERE dataset, which features 60 hours and 3000 km of real-world driving data. Extensive experiments show that our driving models are more accurate and behave more human-like than previous methods.
摘要:本文探讨如何结束到终端的行驶模型可以提高更准确地带动和人类相似。从这里技术解决我们利用语义和视觉地图的第一个问题,用这种扩充现有Drive360数据集。地图是促进分割信心口罩的注意机制使用,因此重点在网络上是针对当前的驾驶状况的重要含义类的形象。人类般的驾乘使用对抗性学习实现,不仅对于模仿的损失最小化驾驶人员,但通过进一步定义鉴别,使得力驱动模式是类似人类的产生动作序列。我们的模型进行培训,并在Drive360 +这里的数据集,它具有60小时,实际驾驶数据3000公里评估。大量的实验表明,我们的行驶模型更准确,更表现类似人类的比以前的方法。

9. CenterNet3D:An Anchor free Object Detector for Autonomous Driving [PDF] 返回目录
  Guojun Wang, Bin Tian, Yunfeng Ai, Tong Xu, Long Chen, Dongpu Cao
Abstract: Accurate and fast 3D object detection from point clouds is a key task in autonomous driving. Existing one-stage 3D object detection methods can achieve real-time performance, however, they are dominated by anchor-based detectors which are inefficient and require additional post-processing. In this paper, we eliminate anchors and model an object as a single point the center point of its bounding box. Based on the center point, we propose an anchor-free CenterNet3D Network that performs 3D object detection without anchors. Our CenterNet3D uses keypoint estimation to find center points and directly regresses 3D bounding boxes. However, because inherent sparsity of point clouds, 3D object center points are likely to be in empty space which makes it difficult to estimate accurate boundary. To solve this issue, we propose an auxiliary corner attention module to enforce the CNN backbone to pay more attention to object boundaries which is effective to obtain more accurate bounding boxes. Besides, our CenterNet3D is Non-Maximum Suppression free which makes it more efficient and simpler. On the KITTI benchmark, our proposed CenterNet3D achieves competitive performance with other one stage anchor-based methods which show the efficacy of our proposed center point representation.
摘要:从点云准确快速的立体物检测是在自动驾驶的关键任务。现有单级3D对象的检测方法可以实现实时性能,但是,它们是由基于锚的检测器,其是低效的,并且需要额外的后处理控制。在本文中,我们消除锚和对象其边框的中心点建模为一个点。基于中心点,我们提出了一种无锚CenterNet3D网络执行立体物检测无锚。我们CenterNet3D使用关键点估计找到中心点,直接倒退3D边界框。然而,由于点云固有的稀疏性,3D对象中心点很可能是空的空间,这使得它很难估计准确的边界。为了解决这个问题,我们提出了一个辅助的角落注意模块来执行CNN骨干,更注重目标的边界,它能有效地获得更精确的边界框。此外,我们的CenterNet3D就是非最大抑制游离这使得它更高效,更简单。在KITTI基准,我们提出的CenterNet3D实现了与其他一个阶段基于锚的方法,这表明我们所提出的中心点表示功效竞争力的性能。

10. Breaking the Curse of Space Explosion: Towards Efficient NAS with Curriculum Search [PDF] 返回目录
  Yong Guo, Yaofo Chen, Yin Zheng, Peilin Zhao, Jian Chen, Junzhou Huang, Mingkui Tan
Abstract: Neural architecture search (NAS) has become an important approach to automatically find effective architectures. To cover all possible good architectures, we need to search in an extremely large search space with billions of candidate architectures. More critically, given a large search space, we may face a very challenging issue of space explosion. However, due to the limitation of computational resources, we can only sample a very small proportion of the architectures, which provides insufficient information for the training. As a result, existing methods may often produce suboptimal architectures. To alleviate this issue, we propose a curriculum search method that starts from a small search space and gradually incorporates the learned knowledge to guide the search in a large space. With the proposed search strategy, our Curriculum Neural Architecture Search (CNAS) method significantly improves the search efficiency and finds better architectures than existing NAS methods. Extensive experiments on CIFAR-10 and ImageNet demonstrate the effectiveness of the proposed method.
摘要:神经结构搜索(NAS)已经成为自动寻找有效的体系结构的重要途径。涵盖所有可能的良好架构,我们需要一个非常大的搜索空间十亿候选架构进行搜索。更关键的是,给定一个大的搜索空间,我们可能会面临空间爆炸的一个非常具有挑战性的问题。然而,由于计算资源的限制,我们只能采样架构的一个非常小的比例,这为训练不足的信息。其结果是,现有方法可能经常产生不理想的架构。为了缓解这一问题,我们提出了一个课程的搜索方法,它从一个小的搜索空间开始逐步整合所学的知识来指导一个大区域的搜索。在所提出的搜索策略,我们的课程神经结构搜索(CNAS)方法显著提高了搜索效率,并找到更好的结构比现有的NAS方法。上CIFAR-10和ImageNet了广泛的实验证明了该方法的有效性。

11. Deep Heterogeneous Autoencoder for Subspace Clustering of Sequential Data [PDF] 返回目录
  Abubakar Siddique, Reza Jalil Mozhdehi, Henry Medeiros
Abstract: We propose an unsupervised learning approach using a convolutional and fully connected autoencoder, which we call deep heterogeneous autoencoder, to learn discriminative features from segmentation masks and detection bounding boxes. To learn the mask shape information and its corresponding location in an input image, we extract coarse masks from a pretrained semantic segmentation network as well as their corresponding bounding boxes. We train the autoencoders jointly using task-dependent uncertainty weights to generate common latent features. The feature vector is then fed to the k-means clustering algorithm to separate the data points in the latent space. Finally, we incorporate additional penalties in the form of a constraints graph based on prior knowledge of the sequential data to increase clustering robustness. We evaluate the performance of our method using both synthetic and real world multi-object video datasets to demonstrate the applicability of our proposed model. Our results show that the proposed technique outperforms several state-of-the-art methods on challenging video sequences.
摘要:我们建议使用卷积和全连接的自动编码,我们称之为深异构的自动编码,要学会从分割掩码和检测边界框判别特征的无监督的学习方法。要了解在输入图像中的掩模形状信息和其相应的位置,我们提取从预训练的语义分割网络以及它们的相应的边界框的粗掩模。我们培养共同使用取决于任务的不确定性权重产生普通潜伏功能自动编码。然后,特征向量被供给到第k均值聚类算法的数据点在潜空间分离。最后,我们的约束的形式引入附加刑图表基于顺序数据的先验知识,以提高集群的鲁棒性。我们评估使用合成和真实世界的多目标视频数据集来证明我们提出的模型的适用性我们的方法的性能。我们的研究结果表明,该技术优于国家的最先进的一些具有挑战性的视频序列的方法。

12. Wavelet-Based Dual-Branch Network for Image Demoireing [PDF] 返回目录
  Lin Liu, Jianzhuang Liu, Shanxin Yuan, Gregory Slabaugh, Ales Leonardis, Wengang Zhou, Qi Tian
Abstract: When smartphone cameras are used to take photos of digital screens, usually moire patterns result, severely degrading photo quality. In this paper, we design a wavelet-based dual-branch network (WDNet) with a spatial attention mechanism for image demoireing. Existing image restoration methods working in the RGB domain have difficulty in distinguishing moire patterns from true scene texture. Unlike these methods, our network removes moire patterns in the wavelet domain to separate the frequencies of moire patterns from the image content. The network combines dense convolution modules and dilated convolution modules supporting large receptive fields. Extensive experiments demonstrate the effectiveness of our method, and we further show that WDNet generalizes to removing moire artifacts on non-screen images. Although designed for image demoireing, WDNet has been applied to two other low-levelvision tasks, outperforming state-of-the-art image deraining and derain-drop methods on the Rain100h and Raindrop800 data sets, respectively.
摘要:当智能手机的摄像头用于拍摄数字屏幕的照片,通常云纹图案导致,严重降低照片质量。在本文中,我们设计了一个基于小波变换的双分支网络(WDNet)与空间注意机制,图像demoireing。现有的RGB领域工作图像复原方法有来自真正的场景贴图区分云纹图案的难度。不像这些方法,我们的网络中移除了云纹在小波域中到波纹图案的频率从图像内容分离图案。网络联合机密卷积模块和扩张型卷积模块,支持大感受域。大量的实验证明我们的方法的有效性,以及我们进一步显示了WDNet推广到除非屏幕图像莫尔假象。虽然设计用于图像demoireing,WDNet已被应用到其他两个低levelvision任务,表现好于Rain100h和Raindrop800数据集状态的最先进的图像deraining和德兰-drop方法,分别。

13. Towards Dense People Detection with Deep Learning and Depth images [PDF] 返回目录
  David Fuentes-Jimenez, Cristina Losada-Gutierrez, David Casillas-Perez, Javier Macias-Guarasa, Roberto Martin-Lopez, Daniel Pizarro, Carlos A.Luna
Abstract: This paper proposes a DNN-based system that detects multiple people from a single depth image. Our neural network processes a depth image and outputs a likelihood map in image coordinates, where each detection corresponds to a Gaussian-shaped local distribution, centered at the person's head. The likelihood map encodes both the number of detected people and their 2D image positions, and can be used to recover the 3D position of each person using the depth image and the camera calibration parameters. Our architecture is compact, using separated convolutions to increase performance, and runs in real-time with low budget GPUs. We use simulated data for initially training the network, followed by fine tuning with a relatively small amount of real data. We show this strategy to be effective, producing networks that generalize to work with scenes different from those used during training. We thoroughly compare our method against the existing state-of-the-art, including both classical and DNN-based solutions. Our method outperforms existing methods and can accurately detect people in scenes with significant occlusions.
摘要:本文提出了检测多个人从单一的深度图像基于DNN的系统。我们的神经网络处理的深度图像,并输出图像坐标,其中每个检测对应于高斯形状的局部分布,中心在人的头部的似然图。的似然图编码两者检测的人的数目和它们的2D图像位置,并且可以被用于恢复使用深度图像和摄像机校准参数的每个人的3D位置。我们的结构紧凑,使用分离卷积来提高性能,并实时与低预算的GPU运行。我们采用模拟数据开始训练网络,然后微调与实际数据的相对较少。我们证明这种策略是有效的,生产网络,推广到与那些在训练中使用了不同的工作场景。我们彻底地比较我们对现有的国家的最先进的,包括古典和基于DNN的解决方案的方法。我们的方法优于现有的方法,可以精确地检测出人与显著遮挡的场景。

14. An Uncertainty-based Human-in-the-loop System for Industrial Tool Wear Analysis [PDF] 返回目录
  Alexander Treiss, Jannis Walk, Niklas Kühl
Abstract: Convolutional neural networks have shown to achieve superior performance on image segmentation tasks. However, convolutional neural networks, operating as black-box systems, generally do not provide a reliable measure about the confidence of their decisions. This leads to various problems in industrial settings, amongst others, inadequate levels of trust from users in the model's outputs as well as a non-compliance with current policy guidelines (e.g., EU AI Strategy). To address these issues, we use uncertainty measures based on Monte-Carlo dropout in the context of a human-in-the-loop system to increase the system's transparency and performance. In particular, we demonstrate the benefits described above on a real-world multi-class image segmentation task of wear analysis in the machining industry. Following previous work, we show that the quality of a prediction correlates with the model's uncertainty. Additionally, we demonstrate that a multiple linear regression using the model's uncertainties as independent variables significantly explains the quality of a prediction (\(R^2=0.718\)). Within the uncertainty-based human-in-the-loop system, the multiple regression aims at identifying failed predictions on an image-level. The system utilizes a human expert to label these failed predictions manually. A simulation study demonstrates that the uncertainty-based human-in-the-loop system increases performance for different levels of human involvement in comparison to a random-based human-in-the-loop system. To ensure generalizability, we show that the presented approach achieves similar results on the publicly available Cityscapes dataset.
摘要:卷积神经网络显示实现对图像的分割任务的卓越性能。然而,卷积神经网络,黑箱系统中运行,一般不提供关于其决定的信心的可靠措施。这导致了在工业环境中的各种问题,除其他外,从用户的信赖水平不足模型的输出,以及与当前的政策指导方针的非顺应性(例如,欧盟AI策略)。为了解决这些问题,我们使用基于蒙特卡罗差不确定性的措施,一个人在最回路系统的情况下,增加了系统的透明性和性能。特别是,我们证明上述在机械加工行业中的磨损分析现实世界的多级图像分割任务描述的益处。经过以往的工作中,我们证明了预测的质量与模型的不确定性有关。此外,我们证明了使用作为独立变量模型的不确定性多元线性回归显著解释了预测的质量(\(R ^ 2 = 0.718 \))。在基于不确定性,人在最回路系统,多元回归的目的是确定在像级失败的预言。该系统利用人类专家手工标注这些失败的预测。模拟研究表明,基于不确定性,人在半实物系统提高性能为不同层次的人参与相比,基于随机人在-的环系统。为了确保普遍性,我们表明,该方法实现了对可公开获得的数据集风情类似的结果。

15. Re-ranking for Writer Identification and Writer Retrieval [PDF] 返回目录
  Simon Jordan, Mathias Seuret, Pavel Král, Ladislav Lenc, Jiří Martínek, Barbara Wiermann, Tobias Schwinger, Andreas Maier, Vincent Christlein
Abstract: Automatic writer identification is a common problem in document analysis. State-of-the-art methods typically focus on the feature extraction step with traditional or deep-learning-based techniques. In retrieval problems, re-ranking is a commonly used technique to improve the results. Re-ranking refines an initial ranking result by using the knowledge contained in the ranked result, e. g., by exploiting nearest neighbor relations. To the best of our knowledge, re-ranking has not been used for writer identification/retrieval. A possible reason might be that publicly available benchmark datasets contain only few samples per writer which makes a re-ranking less promising. We show that a re-ranking step based on k-reciprocal nearest neighbor relationships is advantageous for writer identification, even if only a few samples per writer are available. We use these reciprocal relationships in two ways: encode them into new vectors, as originally proposed, or integrate them in terms of query-expansion. We show that both techniques outperform the baseline results in terms of mAP on three writer identification datasets.
摘要:自动笔迹鉴别是文档分析的通病。国家的最先进的方法通常集中在特征提取步骤与传统的或深学习为基础的技术。在检索问题,重新排序是提高结果的常用技术。重新排名提炼的初始排序结果通过使用知识包含在经排序的结果,例如克,通过利用近邻的关系。据我们所知,重新排序还没有被用于笔迹鉴别/检索。一个可能的原因可能是公开可用的基准数据集只包含每个作家这使得重新排序承诺少几样。我们表明,基于K-互惠近邻关系的重新排序步骤是作家身份有利,哪怕是只有每个作家几样。我们以两种方式使用这些相互关系:它们编码到新的载体,如最初提出的,或者他们查询扩展方面的整合。我们发现,这两种技术都跑赢基准结果在三个作家身份数据集地图方面。

16. Pasadena: Perceptually Aware and Stealthy Adversarial Denoise Attack [PDF] 返回目录
  Yupeng Cheng, Qing Guo, Felix Juefei-Xu, Xiaofei Xie, Shang-Wei Lin, Weisi Lin, Wei Feng, Yang Liu
Abstract: Image denoising techniques have been widely employed in multimedia devices as an image post-processing operation that can remove sensor noise and produce visually clean images for further AI tasks, e.g., image classification. In this paper, we investigate a new task, adversarial denoise attack, that stealthily embeds attacks inside the image denoising module. Thus it can simultaneously denoise input images while fooling the state-of-the-art deep models. We formulate this new task as a kernel prediction problem and propose the adversarial-denoising kernel prediction that can produce adversarial-noiseless kernels for effective denoising and adversarial attacking simultaneously. Furthermore, we implement an adaptive perceptual region localization to identify semantic-related vulnerability regions with which the attack can be more effective while not doing too much harm to the denoising. Thus, our proposed method is termed as Pasadena (Perceptually Aware and Stealthy Adversarial DENoise Attack). We validate our method on the NeurIPS'17 adversarial competition dataset and demonstrate that our method not only realizes denoising but has advantages of high success rate and transferability over the state-of-the-art attacks.
摘要:图像去噪技术已在多媒体设备被广泛地用作图像后处理操作,可以消除传感器噪声和产生视觉上洁净的图像用于进一步AI任务,例如,图像分类。在本文中,我们研究了一个新的任务,对抗降噪攻击,该图像中嵌入影片偷偷攻击降噪模块。因此,它可以同时降噪输入图像,同时愚弄状态的最先进的深模型。我们制定这项新任务为核心的预测问题,并提出了对抗性除噪内核预测,即能产生对抗性,无噪音内核进行有效的降噪和对抗性攻击同时进行。此外,我们实现了自适应感知区域的定位来识别与而没有做太大伤害去噪攻击可以更有效的语义相关的脆弱性地区。因此,我们提出的方法被称为帕萨迪纳(在感知Aware和隐身对抗式去噪攻击)。我们验证我们对NeurIPS'17对抗性竞争集方法,并证明我们的方法不仅实现降噪,但具有较高的成功率和转移性在国家的最先进的攻击优势。

17. Unsupervised Multi-Target Domain Adaptation Through Knowledge Distillation [PDF] 返回目录
  Le Thanh Nguyen-Meidine, Madhu Kiran, Jose Dolz, Eric Granger, Atif Bela, Louis-Antoine Blais-Morin
Abstract: Unsupervised domain adaptation (UDA) seeks to alleviate the problem of domain shift between the distribution of unlabeled data from the target domain w.r.t labeled data from source domain. While the single-target domain scenario is well studied in UDA literature, the Multi-Target Domain Adaptation (MTDA) setting remains largely unexplored despite its importance. For instance, in video surveillance, each camera can corresponds to a different viewpoint (target domain). MTDA problem can be addressed by adapting one specialized model per target domain, although this solution is too costly in many applications. It has also been addressed by blending target data for multi-domain adaptation to train a common model, yet this may lead to a reduction in performance. In this paper, we propose a new unsupervised MTDA approach to train a common CNN that can generalize across multiple target domains. Our approach the Multi-Teacher MTDA (MT-MTDA) relies on multi-teacher knowledge distillation (KD) in order to distill target domain knowledge from multiple teachers to a common student. Inspired by a common education scenario, a different target domain is assigned to each teacher model for UDA, and these teachers alternatively distill their knowledge to one common student model. The KD process is performed in a progressive manner, where the student is trained by each teacher on how to perform UDA, instead of directly learning domain adapted features. Finally, instead of directly combining the knowledge from each teacher, MT-MTDA alternates between teachers that distill knowledge in order to preserve the specificity of each target (teacher) when learning to adapt the student. MT-MTDA is compared against state-of-the-art methods on OfficeHome, Office31 and Digits-5 datasets, and empirical results show that our proposed model can provide a considerably higher level of accuracy across multiple target domains.
摘要:无监督域适配(UDA)旨在减轻未标记的数据从目标域从源域的分布w.r.t标记数据之间的域转移的问题。而单目标域场景在UDA文献很好的研究,多目标领域适应性(MTDA)设置在很大程度上仍然未知,尽管它的重要性。例如,在视频监控,每个摄像机可以对应于不同的视点(目标域)。 MTDA问题可以通过调整每个目标域中的一个专门的模型来解决,但这种解决方案在许多应用中过于昂贵。它也一直谈到了融合的目标数据进行多领域适应性训练的通用模型,但是这可能会导致性能下降。在本文中,我们提出了一种新的无监督MTDA方法培养出共同的CNN能够在多个目标域一概而论。我们的方法多教师MTDA(MT-MTDA),以便依靠多教师知识蒸馏(KD)从多个教师提制目标领域知识,以普通同学。通过公共教育方案的启发,不同的目标域被分配给UDA每个教师模型,这些教师或者提炼他们的知识一个共同的学生模型。以渐进的方式,如果学生通过关于如何执行UDA每个教师的培训进行KD过程,而不是直接学习领域适应功能。最后,而不是直接从结合每名教师的知识,教师之间的MT-MTDA交替,为了提制知识的学习,以适应学生时保留每个目标(教师)的特异性。 MT-MTDA抵靠上OfficeHome,Office31和位数-5的数据集的状态的最先进的方法相比,与实证结果表明,我们的模型可以提供跨多个目标域精确度的显着更高的水平。

18. UDBNET: Unsupervised Document Binarization Network via Adversarial Game [PDF] 返回目录
  Amandeep Kumar, Shuvozit Ghose, Pinaki Nath Chowdhury, Partha Pratim Roy, Umapada Pal
Abstract: Degraded document image binarization is one of the most challenging tasks in the domain of document image analysis. In this paper, we present a novel approach towards document image binarization by introducing three-player min-max adversarial game. We train the network in an unsupervised setup by assuming that we do not have any paired-training data. In our approach, an Adversarial Texture Augmentation Network (ATANet) first superimposes the texture of a degraded reference image over a clean image. Later, the clean image along with its generated degraded version constitute the pseudo paired-data which is used to train the Unsupervised Document Binarization Network (UDBNet). Following this approach, we have enlarged the document binarization datasets as it generates multiple images having same content feature but different textual feature. These generated noisy images are then fed into the UDBNet to get back the clean version. The joint discriminator which is the third-player of our three-player min-max adversarial game tries to couple both the ATANet and UDBNet. The three-player min-max adversarial game stops, when the distributions modelled by the ATANet and the UDBNet align to the same joint distribution over time. Thus, the joint discriminator enforces the UDBNet to perform better on real degraded image. The experimental results indicate the superior performance of the proposed model over existing state-of-the-art algorithm on widely used DIBCO datasets. The source code of the proposed system is publicly available at this https URL.
摘要:降级的文档图像二值化是在文档图像分析领域最具挑战性的任务之一。在本文中,我们提出通过引入三播放器的最小 - 最大对抗性游戏朝向原稿图像的二值化的新方法。我们假设,我们没有任何配对训练数据训练在无人监督的设置网络。在我们的方法中,对抗性纹理增强网络(ATANet)第一叠加退化参考图像的上一个干净的图像纹理。后来,随着其生成降级版本沿着清洁图像构成伪其被用来训练无监督文献二值化网络(UDBNet)配对的数据。按照这一办法,我们已经扩大了文档的二值化数据集,因为它产生具有相同内容的功能,但不同的文本特征多个图像。然后,这些生成的图像噪点送入UDBNet找回干净的版本。联合鉴别这是我们的三个球员最小 - 最大对抗性比赛的第三,播放器尝试夫妻双方ATANet和UDBNet。三球员最小 - 最大对抗性游戏停止,当分布模型由ATANet和UDBNet对齐随时间相同联合分布。因此,联合鉴别强制执行UDBNet在真正的退化图像有更好的表现。实验结果表明所提出的模型在上广泛使用的数据集DIBCO现有状态的最先进的算法的优越性能。所提出的系统的源代码是公开的,在此HTTPS URL。

19. Towards Realistic 3D Embedding via View Alignment [PDF] 返回目录
  Fangneng Zhan, Shijian Lu, Changgong Zhang, Feiying Ma, Xuansong Xie
Abstract: Recent advances in generative adversarial networks (GANs) have achieved great success in automated image composition that generates new images by embedding interested foreground objects into background images automatically. On the other hand, most existing works deal with foreground objects in two-dimensional (2D) images though foreground objects in three-dimensional (3D) models are more flexible with 360-degree view freedom. This paper presents an innovative View Alignment GAN (VA-GAN) that composes new images by embedding 3D models into 2D background images realistically and automatically. VA-GAN consists of a texture generator and a differential discriminator that are inter-connected and end-to-end trainable. The differential discriminator guides to learn geometric transformation from background images so that the composed 3D models can be aligned with the background images with realistic poses and views. The texture generator adopts a novel view encoding mechanism for generating accurate object textures for the 3D models under the estimated views. Extensive experiments over two synthesis tasks (car synthesis with KITTI and pedestrian synthesis with Cityscapes) show that VA-GAN achieves high-fidelity composition qualitatively and quantitatively as compared with state-of-the-art generation methods.
摘要:在生成对抗网络(甘斯)的最新进展已经实现,即通过自动嵌入感兴趣的前景对象合并背景图像生成新的图像的自动图像组成了巨大的成功。在另一方面,现有的大多数作品处理的前景物体在二维(2D)图像虽然前景在三维对象(3D)模型是具有360度视角的自由更加灵活。本文提出了一个创新视点对齐甘(VA-GAN),通过嵌入的3D模型引入2D背景图像逼真,并自动组成新的图像。 VA-GAN包括一个纹理发生器和微分识别器是相互连接和端 - 端可训练的。鉴别鉴别指南从背景图像学习几何变换,使得由三维模型可以与现实的姿态和观点的背景图像对齐。纹理发电机采用用于产生用于下所估计的视图的3D模型准确对象的纹理的新颖视图编码机制。在两个合成任务广泛的实验(汽车用合成KITTI和行人合成与都市风景)表明,VA-GaN作为与国家的最先进的生成方法相比,定性和定量地实现高保真组合物。

20. Unsupervised Human 3D Pose Representation with Viewpoint and Pose Disentanglement [PDF] 返回目录
  Qiang Nie, Ziwei Liu, Yunhui Liu
Abstract: Learning a good 3D human pose representation is important for human pose related tasks, e.g. human 3D pose estimation and action recognition. Within all these problems, preserving the intrinsic pose information and adapting to view variations are two critical issues. In this work, we propose a novel Siamese denoising autoencoder to learn a 3D pose representation by disentangling the pose-dependent and view-dependent feature from the human skeleton data, in a fully unsupervised manner. These two disentangled features are utilized together as the representation of the 3D pose. To consider both the kinematic and geometric dependencies, a sequential bidirectional recursive network (SeBiReNet) is further proposed to model the human skeleton data. Extensive experiments demonstrate that the learned representation 1) preserves the intrinsic information of human pose, 2) shows good transferability across datasets and tasks. Notably, our approach achieves state-of-the-art performance on two inherently different tasks: pose denoising and unsupervised action recognition. Code and models are available at: \url{this https URL}
摘要:一个良好的学习三维人体姿态的表示是人体姿势相关的任务,例如重要人体三维姿态估计和动作识别。在所有这些问题,保持了固有的姿态信息和适应查看变化是两个关键问题。在这项工作中,我们提出了一个新颖的连体降噪自动编码器由人体骨骼数据解开姿势依赖性和依赖于视图的功能,在无人监督充分地了解一个3D姿态表示。这两个解缠结的特征一起使用作为3D姿态的表示。同时考虑运动学和几何依赖关系,顺序双向递归网络(SeBiReNet)进一步提出了对人体的骨骼数据模型。大量实验证明,学习表示1)保持人体姿势的内在信息,2)显示整个数据集和任务,良好的转印。值得注意的是,我们的方法实现了两个本质上的不同任务状态的最先进的性能:姿势降噪和监督的动作识别。代码和模型,请访问:\ {URL这HTTPS URL}

21. RGB-D Salient Object Detection with Cross-Modality Modulation and Selection [PDF] 返回目录
  Chongyi Li, Runmin Cong, Yongri Piao, Qianqian Xu, Chen Change Loy
Abstract: We present an effective method to progressively integrate and refine the cross-modality complementarities for RGB-D salient object detection (SOD). The proposed network mainly solves two challenging issues: 1) how to effectively integrate the complementary information from RGB image and its corresponding depth map, and 2) how to adaptively select more saliency-related features. First, we propose a cross-modality feature modulation (cmFM) module to enhance feature representations by taking the depth features as prior, which models the complementary relations of RGB-D data. Second, we propose an adaptive feature selection (AFS) module to select saliency-related features and suppress the inferior ones. The AFS module exploits multi-modality spatial feature fusion with the self-modality and cross-modality interdependencies of channel features are considered. Third, we employ a saliency-guided position-edge attention (sg-PEA) module to encourage our network to focus more on saliency-related regions. The above modules as a whole, called cmMS block, facilitates the refinement of saliency features in a coarse-to-fine fashion. Coupled with a bottom-up inference, the refined saliency features enable accurate and edge-preserving SOD. Extensive experiments demonstrate that our network outperforms state-of-the-art saliency detectors on six popular RGB-D SOD benchmarks.
摘要:我们提出以逐渐整合和改进对RGB-d显着对象检测(SOD)的跨模态互补的有效方法。所提出的网络主要解决了两个具有挑战性的问题:1)如何有效地整合来自RGB图像的补充信息和其对应的深度图,以及2)如何自适应地选择更多的显着性相关的功能。首先,我们提出了一个跨模态特征调制(招商基金)模块通过取深度,以增强特征表示设有作为现有,其中模型RGB-d数据的互补关系。其次,我们提出了一种自适应特征选择(AFS)模块来选择显着性相关的功能和抑制下的。在AFS模块利用多模态空间特征融合的信道特征的自模态和交叉形态相互依赖被考虑。第三,我们聘请了显着引导位置边的关注(SG-PEA)模块,以鼓励我们的网络更侧重于显着相关的区域。上述模块作为一个整体,称为CMMS块,有利于显着的细化在粗到细的方式的特征。再加上自底向上的推理,精致的显着特征使得准确和边缘保持SOD。大量的实验证明,在六大流行的RGB-d SOD基准我们的网络性能优于国家的最先进的显着性检测。

22. Video Object Segmentation with Episodic Graph Memory Networks [PDF] 返回目录
  Xinkai Lu, Wenguan Wang, Martin Danelljan, Tianfei Zhou, Jianbing Shen, Luc Van Gool
Abstract: How to make a segmentation model to efficiently adapt to a specific video and to online target appearance variations are fundamentally crucial issues in the field of video object segmentation. In this work, a novel graph memory network is developed to address the novel idea of ``learning to update the segmentation model''. Specifically, we exploit an episodic memory network, organized as a fully connected graph, to store frames as nodes and capture cross-frame correlations by edges. Further, learnable controllers are embedded to ease memory reading and writing, as well as maintain a fixed memory scale. The structured, external memory design enables our model to comprehensively mine and quickly store new knowledge, even with limited visual information, and the differentiable memory controllers slowly learn an abstract method for storing useful representations in the memory and how to later use these representations for prediction, via gradient descent. In addition, the proposed graph memory network yields a neat yet principled framework, which can generalize well both one-shot and zero-shot video object segmentation tasks. Extensive experiments on four challenging benchmark datasets verify that our graph memory is able to facilitate the adaptation of the segmentation network for case-by-case video object segmentation.
摘要:如何使一个细分模型能够有效地适应特定的视频和在线目标出现变化是从根本上的视频对象分割领域的关键问题。在这项工作中,一个新的图形内存网络发达,以解决``学习来更新细分模型'的新的思路。具体来说,我们利用一个情景记忆网络,组织为一个完全连通图,来存储帧作为由边缘节点和俘获横帧相关。此外,可以学习控制器被嵌入到缓解存储器读取和写入,以及维持一个固定的存储器规模。结构化,外部存储器的设计使我们的模型全面矿山和快速存储新的知识,即使在有限的视觉信息,而微内存控制器慢慢学会抽象方法在存储器中存储用的表现,以及如何在以后使用这些陈述进行预测通过梯度下降。此外,所提出的图形存储网络产生一个整洁但原则性的框架,可以概括以及两个一杆和零拍视频对象分割的任务。四个有挑战性的基准数据集的实验结果验证我们的图形内存能够便于分割网络的适配情况逐案视频对象分割。

23. Correlation filter tracking with adaptive proposal selection for accurate scale estimation [PDF] 返回目录
  Luo Xiong, Yanjie Liang, Yan Yan, Hanzi Wang
Abstract: Recently, some correlation filter based trackers with detection proposals have achieved state-of-the-art tracking results. However, a large number of redundant proposals given by the proposal generator may degrade the performance and speed of these trackers. In this paper, we propose an adaptive proposal selection algorithm which can generate a small number of high-quality proposals to handle the problem of scale variations for visual object tracking. Specifically, we firstly utilize the color histograms in the HSV color space to represent the instances (i.e., the initial target in the first frame and the predicted target in the previous frame) and proposals. Then, an adaptive strategy based on the color similarity is formulated to select high-quality proposals. We further integrate the proposed adaptive proposal selection algorithm with coarse-to-fine deep features to validate the generalization and efficiency of the proposed tracker. Experiments on two benchmark datasets demonstrate that the proposed algorithm performs favorably against several state-of-the-art trackers.
摘要:最近,一些与检测相关的建议基于过滤纤夫已经达到国家的最先进的跟踪结果。然而,该提案发电机给出了大量的冗余方案可以降低这些跟踪器的性能和速度。在本文中,我们提出了一种自适应建议选择算法能够生成少量的高品质的建议来处理视觉对象跟踪尺度变化的问题。具体地讲,我们首先利用颜色直方图在HSV颜色空间来表示实例(即,在第一帧中的初始目标,并在先前帧中的预测的目标)和建议。然后,根据颜色相似的适应性战略被配制成选择高品质的建议。我们进一步整合粗到细深的特点所提出的自适应建议选择算法验证了该跟踪器的通用性和效率。在两个基准数据集实验表明,该有利算法执行对状态的最先进的几个跟踪器。

24. Pose2RGBD. Generating Depth and RGB images from absolute positions [PDF] 返回目录
  Mihai Cristian Pîrvu
Abstract: We propose a method at the intersection of Computer Vision and Computer Graphics fields, which automatically generates RGBD images using neural networks, based on previously seen and synchronized video, depth and pose signals. Since the models must be able to reconstruct both texture (RGB) and structure (Depth), it creates an implicit representation of the scene, as opposed to explicit ones, such as meshes or point clouds. The process can be thought of as neural rendering, where we obtain a function f : Pose -> RGBD, which we can use to navigate through the generated scene, similarly to graphics simulations. We introduce two new datasets, one based on synthetic data with full ground truth information, while the other one being recorded from a drone flight in an university campus, using only video and GPS signals. Finally, we propose a fully unsupervised method of generating datasets from videos alone, in order to train the Pose2RGBD networks. Code and datasets are available at:: this https URL.
摘要:在计算机视觉和图形学领域,其中利用神经网络自动生成RGBD图像的交集提出的方法,基于以前看过的和同步的视频,深度和姿态的信号。由于模型必须能够重建这两个纹理(RGB)和结构(深度),它创建场景的隐式表示,相对于显性的,如网格或点云。该过程可以被认为是神经渲染,在这里我们得到一个函数f:姿态 - > RGBD,我们可以通过将所生成的场景中使用来导航,类似于图形仿真。我们引入了两个新的数据集,一个基于全地面实况信息的合成数据,而另外一个是从在大学校园无人驾驶飞机的飞行记录,只用视频和GPS信号。最后,我们建议从生成单独的视频数据集,以培养Pose2RGBD网络的完全无监督的方法。代码和数据集可在此:: HTTPS URL。

25. Improving Face Recognition by Clustering Unlabeled Faces in the Wild [PDF] 返回目录
  Aruni RoyChowdhury, Xiang Yu, Kihyuk Sohn, Erik Learned-Miller, Manmohan Chandraker
Abstract: While deep face recognition has benefited significantly from large-scale labeled data, current research is focused on leveraging unlabeled data to further boost performance, reducing the cost of human annotation. Prior work has mostly been in controlled settings, where the labeled and unlabeled data sets have no overlapping identities by construction. This is not realistic in large-scale face recognition, where one must contend with such overlaps, the frequency of which increases with the volume of data. Ignoring identity overlap leads to significant labeling noise, as data from the same identity is split into multiple clusters. To address this, we propose a novel identity separation method based on extreme value theory. It is formulated as an out-of-distribution detection algorithm, and greatly reduces the problems caused by overlapping-identity label noise. Considering cluster assignments as pseudo-labels, we must also overcome the labeling noise from clustering errors. We propose a modulation of the cosine loss, where the modulation weights correspond to an estimate of clustering uncertainty. Extensive experiments on both controlled and real settings demonstrate our method's consistent improvements over supervised baselines, e.g., 11.6% improvement on IJB-A verification.
摘要:虽然深人脸识别已经从大规模的标签数据显著受益,目前的研究主要集中于未标记的数据利用,进一步提高性能,减少人为标注的成本。在此之前的工作大多是在控制的设置,这里的标记和未标记的数据集必须由建设没有重叠的身份。这不是在大规模面部识别,其中一个必须用这种重叠抗衡现实,其频率增加了与数据的音量。忽略同一性重叠导致显著标记噪声,如从相同的标识数据被划分为多个群集。为了解决这个问题,我们提出了一种基于极值理论新颖的身份分离方法。它被配制成外的分布检测算法,并且大大减少了由重叠标识标签噪声的问题。考虑到群分配为伪标签,我们也必须克服集群错误标签的噪音。我们提出的余弦损失,其中调制权重对应于聚类的不确定性的估计的调制。两个控制和实际设置大量的实验证明我们的方法的在监督基线一致的改进,例如,在IJB-A验证11.6%的改善。

26. P-KDGAN: Progressive Knowledge Distillation with GANs for One-class Novelty Detection [PDF] 返回目录
  Zhiwei Zhang, Shifeng Chen, Lei Sun
Abstract: One-class novelty detection is to identify anomalous instances that do not conform to the expected normal instances. In this paper, the Generative Adversarial Networks (GANs) based on encoder-decoder-encoder pipeline are used for detection and achieve state-of-the-art performance. However, deep neural networks are too over-parameterized to deploy on resource-limited devices. Therefore, Progressive Knowledge Distillation with GANs (PKDGAN) is proposed to learn compact and fast novelty detection networks. The P-KDGAN is a novel attempt to connect two standard GANs by the designed distillation loss for transferring knowledge from the teacher to the student. The progressive learning of knowledge distillation is a two-step approach that continuously improves the performance of the student GAN and achieves better performance than single step methods. In the first step, the student GAN learns the basic knowledge totally from the teacher via guiding of the pretrained teacher GAN with fixed weights. In the second step, joint fine-training is adopted for the knowledgeable teacher and student GANs to further improve the performance and stability. The experimental results on CIFAR-10, MNIST, and FMNIST show that our method improves the performance of the student GAN by 2.44%, 1.77%, and 1.73% when compressing the computation at ratios of 24.45:1, 311.11:1, and 700:1, respectively.
摘要:一类新颖的检测是鉴定不符合预期的正常情况下的异常情况。在本文中,基于编码器 - 解码器编码器管道剖成对抗性网络(甘斯)用于检测和实现先进的最先进的性能。然而,深层神经网络是太过分参数对资源有限的设备上部署。因此,渐进式知识精馏甘斯(PKDGAN)提出了学习紧凑和快速的新奇检测网络。在P-KDGAN是由设计的蒸馏损失由教师传授知识给学生,以连接两个标准甘斯一个新的尝试。知识蒸馏的渐进式学习是一个两步的做法,不断提高学生GAN的性能,并实现比单步方法更好的性能。在第一步中,学生通过甘与引导固定权预训练的老师甘学会从老师完全的基本知识。在第二步中,接头精细,培训采用知识渊博的教师和学生甘斯以进一步提高性能和稳定性。上CIFAR-10,MNIST,和FMNIST的实验结果表明,我们的方法由2.44%,1.77%,和1.73%压缩至少24.45比率计算时提高了学生GAN的性能:1,311.11:1,和700 :1,分别。

27. Learning Semantics-enriched Representation via Self-discovery, Self-classification, and Self-restoration [PDF] 返回目录
  Fatemeh Haghighi, Mohammad Reza Hosseinzadeh Taher, Zongwei Zhou, Michael B. Gotway, Jianming Liang
Abstract: Medical images are naturally associated with rich semantics about the human anatomy, reflected in an abundance of recurring anatomical patterns, offering unique potential to foster deep semantic representation learning and yield semantically more powerful models for different medical applications. But how exactly such strong yet free semantics embedded in medical images can be harnessed for self-supervised learning remains largely unexplored. To this end, we train deep models to learn semantically enriched visual representation by self-discovery, self-classification, and self-restoration of the anatomy underneath medical images, resulting in a semantics-enriched, general-purpose, pre-trained 3D model, named Semantic Genesis. We examine our Semantic Genesis with all the publicly-available pre-trained models, by either self-supervision or fully supervision, on the six distinct target tasks, covering both classification and segmentation in various medical modalities (i.e.,CT, MRI, and X-ray). Our extensive experiments demonstrate that Semantic Genesis significantly exceeds all of its 3D counterparts as well as the de facto ImageNet-based transfer learning in 2D. This performance is attributed to our novel self-supervised learning framework, encouraging deep models to learn compelling semantic representation from abundant anatomical patterns resulting from consistent anatomies embedded in medical images. Code and pre-trained Semantic Genesis are available at this https URL .
摘要:医学影像与自然有关的人体解剖学丰富的语义,体现在丰富的经常性解剖模式的关联,提供促进深层语义表示学习和产生语义上更强大的车型针对不同的医疗应用独特的潜力。但究竟如何嵌入在医学图像这样强大而自由的语义可以被利用进行自我监督学习在很大程度上仍然未知。为此,我们培养深模型通过自我发现,自我分类,医学图像下方的解剖结构的自我修复学语义丰富视觉表现,造成了语义富集,通用,预先训练的3D模型命名意义发生。我们审视我们的意义发生与所有的公开可用的预先训练的模型,方法是自我监督或完全监督,在六个不同的目标任务,涵盖各种医疗模式(即,CT,MRI和X分类和分割-射线)。我们大量的实验证明,意义发生显著超过基于ImageNet所有的3D同行的还有事实上的转让2D学习。这一成绩主要归功于我们的新的自我监督学习框架,鼓励深模型,从嵌入在医学图像一致的解剖导致大量的解剖模式学习引人注目的语义表达。代码和预先训练意义发生可在此HTTPS URL。

28. Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance [PDF] 返回目录
  Marvin Klingner, Jan-Aike Termöhlen, Jonas Mikolajczyk, Tim Fingscheidt
Abstract: Self-supervised monocular depth estimation presents a powerful method to obtain 3D scene information from single camera images, which is trainable on arbitrary image sequences without requiring depth labels, e.g., from a LiDAR sensor. In this work we present a new self-supervised semantically-guided depth estimation (SGDepth) method to deal with moving dynamic-class (DC) objects, such as moving cars and pedestrians, which violate the static-world assumptions typically made during training of such models. Specifically, we propose (i) mutually beneficial cross-domain training of (supervised) semantic segmentation and self-supervised depth estimation with task-specific network heads, (ii) a semantic masking scheme providing guidance to prevent moving DC objects from contaminating the photometric loss, and (iii) a detection method for frames with non-moving DC objects, from which the depth of DC objects can be learned. We demonstrate the performance of our method on several benchmarks, in particular on the Eigen split, where we exceed all baselines without test-time refinement in all measures.
摘要:自监督单眼深度估计呈现的有效方法,以获得从单个摄像机的图像,这是在任意图像序列可训练,而不需要深度标签,例如,从激光雷达传感器3D场景信息。在这项工作中,我们提出了一种新的自我监督的语义制导深度估计(SGDepth)方法来处理移动的动态级(DC)对象,比如动车和行人,违反训练期间通常由静态世界假设这样的模型。具体来说,我们建议(I)的互利跨域训练(监督)语义分割和自我监督深度估计与特定任务的网络头,(ii)提供的指导,以防止移动DC污染测光对象语义掩码方案损失,和(iii)的检测方法与非移动物体DC,从该DC对象的深度可以学习帧。我们证明我们的方法对几个基准的表现,特别是在本征分裂,我们超过了所有的基线没有一切措施测试时间细化。

29. REPrune: Filter Pruning via Representative Election [PDF] 返回目录
  Mincheol Park, Woojeong Kim, Suhyun Kim
Abstract: Even though norm-based filter pruning methods are widely accepted, it is questionable whether the "smaller-norm-less-important" criterion is optimal in determining filters to prune. Especially when we can keep only a small fraction of the original filters, it is more crucial to choose the filters that can best represent the whole filters regardless of norm values. Our novel pruning method entitled "REPrune" addresses this problem by selecting representative filters via clustering. By selecting one filter from a cluster of similar filters and avoiding selecting adjacent large filters, REPrune can achieve a better compression rate with similar accuracy. Our method also recovers the accuracy more rapidly and requires a smaller shift of filters during fine-tuning. Empirically, REPrune reduces more than 49% FLOPs, with 0.53% accuracy gain on ResNet-110 for CIFAR-10. Also, REPrune reduces more than 41.8% FLOPs with 1.67% Top-1 validation loss on ResNet-18 for ImageNet
摘要:尽管基于标准过滤器修剪方法已被广泛接受,这是值得怀疑的“小范,不太重要”的标准是否是最佳的确定过滤器进行修剪。尤其是当我们可以只保留原来的过滤器的一小部分,更重要的,选择最能代表整个过滤器,无论正常值的过滤器。我们的新型的修剪方法题为“REPrune”解决了这个问题通过经由聚类中选择代表性的过滤器。通过从类似的过滤器的一个簇中选择一个过滤器,避免选择相邻大过滤器,REPrune可实现具有相似精度更好的压缩率。我们的方法还恢复的精确度更迅速,并且需要的过滤器的微调期间更小的移位。根据经验,REPrune减少超过49级%的触发器,与RESNET-110 0.53%的精度增益CIFAR-10。此外,REPrune减少了RESNET-18用1.67%顶1验证损失超过41.8%触发器用于ImageNet

30. Rethinking Image Inpainting via a Mutual Encoder-Decoder with Feature Equalizations [PDF] 返回目录
  Hongyu Liu, Bin Jiang, Yibing Song, Wei Huang, Chao Yang
Abstract: Deep encoder-decoder based CNNs have advanced image inpainting methods for hole filling. While existing methods recover structures and textures step-by-step in the hole regions, they typically use two encoder-decoders for separate recovery. The CNN features of each encoder are learned to capture either missing structures or textures without considering them as a whole. The insufficient utilization of these encoder features limit the performance of recovering both structures and textures. In this paper, we propose a mutual encoder-decoder CNN for joint recovery of both. We use CNN features from the deep and shallow layers of the encoder to represent structures and textures of an input image, respectively. The deep layer features are sent to a structure branch and the shallow layer features are sent to a texture branch. In each branch, we fill holes in multiple scales of the CNN features. The filled CNN features from both branches are concatenated and then equalized. During feature equalization, we reweigh channel attentions first and propose a bilateral propagation activation function to enable spatial equalization. To this end, the filled CNN features of structure and texture mutually benefit each other to represent image content at all feature levels. We use the equalized feature to supplement decoder features for output image generation through skip connections. Experiments on the benchmark datasets show the proposed method is effective to recover structures and textures and performs favorably against state-of-the-art approaches.
摘要:深编码器 - 解码器基于细胞神经网络具有用于孔洞填充先进图像修复的方法。虽然现有的方法回收的结构和纹理一步一步在孔的区域,但是它们通常使用两个编码器 - 解码器对单独的恢复。每个编码器的CNN特点是学会了捕捉要么缺少结构或纹理不考虑他们作为一个整体。这些编码器功能的利用不足限制回收两种结构和纹理的性能。在本文中,我们提出了一个共同的编码器,解码器CNN两者的共同复苏。我们使用CNN从编码器的深,浅层设有分别表示输入图像的结构和纹理。深层特征被发送到一个分支结构和浅层功能被发送到纹理分支。在每一个分支,我们填写的CNN功能多尺度孔。来自两个支路填充CNN功能连接起来,然后扳平。在特征的均衡,我们首先再称量信道的关注,提出了一种双边传播的激活功能,以使空间均衡。为此,填充的CNN结构和纹理的特征共同受益彼此来表示在所有特征的水平图像内容。我们使用的均衡化功能来补充解码器通过跳过连接功能,输出图像生成。在基准数据集实验表明,该方法是有效的,以恢复结构和纹理,并且执行针对有利状态的最先进的方法。

31. A Graph-based Interactive Reasoning for Human-Object Interaction Detection [PDF] 返回目录
  Dongming Yang, Yuexian Zou
Abstract: Human-Object Interaction (HOI) detection devotes to learn how humans interact with surrounding objects via inferring triplets of < human, verb, object >. However, recent HOI detection methods mostly rely on additional annotations (e.g., human pose) and neglect powerful interactive reasoning beyond convolutions. In this paper, we present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs, in which interactive semantics implied among visual targets are efficiently exploited. The proposed model consists of a project function that maps related targets from convolution space to a graph-based semantic space, a message passing process propagating semantics among all nodes and an update function transforming the reasoned nodes back to convolution space. Furthermore, we construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet. Beyond inferring HOIs using instance features respectively, the framework dynamically parses pairwise interactive semantics among visual targets by integrating two-level in-Graphs, i.e., scene-wide and instance-wide in-Graphs. Our framework is end-to-end trainable and free from costly annotations like human pose. Extensive experiments show that our proposed framework outperforms existing HOI detection methods on both V-COCO and HICO-DET benchmarks and improves the baseline about 9.4% and 15% relatively, validating its efficacy in detecting HOIs.
摘要:以人对象交互(HOI)检测致力于通过推断<人,动词,对象>的三元组学习如何人类周围对象交互。然而,最近的HOI检测方法主要依赖于额外的注解(例如,人体姿势),而忽略强大的互动推理超越卷积。在本文中,我们提出了称作交互图形(缩写在-图形)来推断HOIS一种新颖的基于图形的交互式推理模型,其中视觉目标中暗示互动语义被有效地利用。所提出的模型由相关目标从卷积空间映射到一个基于图形的语义空间项目的功能,一个消息传递过程中传播的所有节点和更新功能改造的理由节点之间的语义回回旋空间。此外,我们建立一个新的框架中,图模型组装检测HOIS,即-GraphNet。分别超出使用实例推断HOIS特征中,所述框架动态解析通过-图表在-图形集成两电平,即,场景宽和实例范围的两两视觉目标之间的交互的语义。我们的框架是终端到终端的可训练和昂贵的注解像人体姿势自由。广泛的实验表明在两个V-COCO和HICO-DET的基准,我们提出的框架性能优于现有HOI检测方法和提高约9.4%和15%的相对基线,在检测HOIS验证其疗效。

32. AQD: Towards Accurate Quantized Object Detection [PDF] 返回目录
  Jing Liu, Bohan Zhuang, Peng Chen, Mingkui Tan, Chunhua Shen
Abstract: Network quantization aims to lower the bitwidth of weights and activations and hence reduce the model size and accelerate the inference of deep networks. Even though existing quantization methods have achieved promising performance on image classification, applying aggressively low bitwidth quantization on object detection while preserving the performance is still a challenge. In this paper, we demonstrate that the poor performance of the quantized network on object detection comes from the inaccurate batch statistics of batch normalization. To solve this, we propose an accurate quantized object detection (AQD) method. Specifically, we propose to employ multi-level batch normalization (multi-level BN) to estimate the batch statistics of each detection head separately. We further propose a learned interval quantization method to improve how the quantizer itself is configured. To evaluate the performance of the proposed methods, we apply AQD to two one-stage detectors (i.e., RetinaNet and FCOS). Experimental results on COCO show that our methods achieve near-lossless performance compared with the full-precision model by using extremely low bitwidth regimes such as 3-bit. In particular, we even outperform the full-precision counterpart by a large margin with a 4-bit detector, which is of great practical value.
摘要:网络的量化目标,以降低重量和激活的位宽,从而降低模型大小,加速深层网络的推断。尽管现有的量化方法都取得了有前途的图像分类性能,在物体检测中的应用积极低位宽量化,同时保持性能仍是一个挑战。在本文中,我们证明了在目标检测量化网络的糟糕表现来源于批标准化的不准确的批次的统计数据。为了解决这个问题,我们提出了一个准确的量化对象检测(AQD)方法。具体来说,我们建议采用多级批标准化(多级BN)分别估计各检测头的批次的统计信息。我们进一步提出了一个有学问的间隔量化的方法来提高自身量化的配置。为了评价所提出的方法的性能,我们应用AQD两个单级检测器(即,RetinaNet和FCOS)。在COCO的实验结果表明我们的方法通过使用极低的位宽制度,如3位全精度模式相比实现近无损性能。特别是,我们甚至有4位探测器,这是很大的实用价值大幅跑赢全精度对应。

33. 360$^\circ$ Depth Estimation from Multiple Fisheye Images with Origami Crown Representation of Icosahedron [PDF] 返回目录
  Ren Komatsu, Hiromitsu Fujii, Yusuke Tamura, Atsushi Yamashita, Hajime Asama
Abstract: In this study, we present a method for all-around depth estimation from multiple omnidirectional images for indoor environments. In particular, we focus on plane-sweeping stereo as the method for depth estimation from the images. We propose a new icosahedron-based representation and ConvNets for omnidirectional images, which we name "CrownConv" because the representation resembles a crown made of origami. CrownConv can be applied to both fisheye images and equirectangular images to extract features. Furthermore, we propose icosahedron-based spherical sweeping for generating the cost volume on an icosahedron from the extracted features. The cost volume is regularized using the three-dimensional CrownConv, and the final depth is obtained by depth regression from the cost volume. Our proposed method is robust to camera alignments by using the extrinsic camera parameters; therefore, it can achieve precise depth estimation even when the camera alignment differs from that in the training dataset. We evaluate the proposed model on synthetic datasets and demonstrate its effectiveness. As our proposed method is computationally efficient, the depth is estimated from four fisheye images in less than a second using a laptop with a GPU. Therefore, it is suitable for real-world robotics applications. Our source code is available at this https URL.
摘要:在这项研究中,我们提出了从室内环境的多个全方位图像全方位深度估计的方法。特别是,我们专注于平面扫立体声作为用于从图像的深度估计方法。我们提出了一种新的基于二十面体的代表性和ConvNets的全方位图像,这是我们的名字“CrownConv”,因为酷似表示折纸制作的冠冕。 CrownConv可以应用于两个鱼眼图像和等距离长方圆柱图像,以提取特征。此外,我们提出了二十面体为基础的球形扫用于产生从所提取的特征的二十面体的成本体积。成本体积使用三维CrownConv正规化,并最终深度是通过从成本体积深度回归获得。我们提出的方法是稳健的,通过使用所述非本征相机参数相机对准;因此,它可以实现精确深度估计即使相机对准不同于的是,在训练数据集。我们评估对合成数据集所提出的模型,并证明其有效性。由于我们提出的方法是计算有效的,该深度由四个鱼眼图像在小于使用笔记本电脑与GPU的第二估计。因此,它适用于现实世界的机器人应用。我们的源代码可在此HTTPS URL。

34. Joint Layout Analysis, Character Detection and Recognition for Historical Document Digitization [PDF] 返回目录
  Weihong Ma, Hesuo Zhang, Lianwen Jin, Sihang Wu, Jiapeng Wang, Yongpan Wang
Abstract: In this paper, we propose an end-to-end trainable framework for restoring historical documents content that follows the correct reading order. In this framework, two branches named character branch and layout branch are added behind the feature extraction network. The character branch localizes individual characters in a document image and recognizes them simultaneously. Then we adopt a post-processing method to group them into text lines. The layout branch based on fully convolutional network outputs a binary mask. We then use Hough transform for line detection on the binary mask and combine character results with the layout information to restore document content. These two branches can be trained in parallel and are easy to train. Furthermore, we propose a re-score mechanism to minimize recognition error. Experiment results on the extended Chinese historical document MTHv2 dataset demonstrate the effectiveness of the proposed framework.
摘要:在本文中,我们提出了恢复遵循正确的阅读顺序的历史文献内容结束到终端的可训练的框架。在此框架下,两个分支命名字符分公司和布局支路特征提取网络后面添加。字符分支在本地化文档图像单个字符,同时识别它们。然后,我们采用后处理方法将它们分组到文本行。完全基于卷积网络上的布局分支输出二进制掩码。然后,我们使用Hough变换直线检测的二进制掩码和组合字符结果与布局信息来恢复文件的内容。这两个分支可以并行进行培训,很容易训练。此外,我们提出了一个重新评分机制,以尽量减少识别错误。在扩展中国历史文献MTHv2实验结果数据集验证了该框架的有效性。

35. Knowledge Distillation for Multi-task Learning [PDF] 返回目录
  Wei-Hong Li, Hakan Bilen
Abstract: Multi-task learning (MTL) is to learn one single model that performs multiple tasks for achieving good performance on all tasks and lower cost on computation. Learning such a model requires to jointly optimize losses of a set of tasks with different difficulty levels, magnitudes, and characteristics (e.g. cross-entropy, Euclidean loss), leading to the imbalance problem in multi-task learning. To address the imbalance problem, we propose a knowledge distillation based method in this work. We first learn a task-specific model for each task. We then learn the multi-task model for minimizing task-specific loss and for producing the same feature with task-specific models. As the task-specific network encodes different features, we introduce small task-specific adaptors to project multi-task features to the task-specific features. In this way, the adaptors align the task-specific feature and the multi-task feature, which enables a balanced parameter sharing across tasks. Extensive experimental results demonstrate that our method can optimize a multi-task learning model in a more balanced way and achieve better overall performance.
摘要:多任务学习(MTL)是学习一种单一的模式执行实现上的所有任务良好的性能和计算成本较低的多重任务。学习这样的模型需要一组不同的难度级别,大小任务的联合优化损失和特性(如交叉熵,欧几里得损失),导致在多任务学习的不平衡问题。为了解决失衡问题,我们提出在这项工作中以知识蒸馏为基础的方法。我们首先了解每个任务的任务,具体型号。然后,我们学会了尽量减少任务的具体损失以及生产与任务的具体型号相同的功能,多任务模式。作为具体的任务,网络编码的不同特点,我们引进小任务专用适配器项目多任务功能的特定任务的功能。通过这种方式,适配器对准特定任务的功能和多任务功能,使整个任务的平衡参数共享。大量的实验结果表明,该方法能够以更平衡的方式优化了多任务的学习模式,并实现更好的整体性能。

36. JSENet: Joint Semantic Segmentation and Edge Detection Network for 3D Point Clouds [PDF] 返回目录
  Zeyu Hu, Mingmin Zhen, Xuyang Bai, Hongbo Fu, Chiew-lan Tai
Abstract: Semantic segmentation and semantic edge detection can be seen as two dual problems with close relationships in computer vision. Despite the fast evolution of learning-based 3D semantic segmentation methods, little attention has been drawn to the learning of 3D semantic edge detectors, even less to a joint learning method for the two tasks. In this paper, we tackle the 3D semantic edge detection task for the first time and present a new two-stream fully-convolutional network that jointly performs the two tasks. In particular, we design a joint refinement module that explicitly wires region information and edge information to improve the performances of both tasks. Further, we propose a novel loss function that encourages the network to produce semantic segmentation results with better boundaries. Extensive evaluations on S3DIS and ScanNet datasets show that our method achieves on par or better performance than the state-of-the-art methods for semantic segmentation and outperforms the baseline methods for semantic edge detection. Code release: this https URL
摘要:语义分割和语义的边缘检测可以看作是两个双问题,在计算机视觉密切的关系。尽管基于学习的三维语义分割方法快速发展,很少有人注意吸引到3D语义边缘检测的学习,更不为两个任务的联合学习方法。在本文中,我们处理首次三维语义边缘检测任务,并提出了一种新的双数据流完全卷积网络共同执行两项任务。特别是,我们设计了一个共同提炼模块,明确线区域信息和边缘信息,以提高这两个任务的性能。此外,我们建议鼓励网络产生更好的边界语义分割结果的新的损失函数。在S3DIS和ScanNet广泛的评估数据集显示,我们对超过国家的最先进的方法语义分割,优于语义边缘检测基线方法标准杆或更好的性能方法实现。代码发布:此HTTPS URL

37. Visual Tracking by TridentAlign and Context Embedding [PDF] 返回目录
  Janghoon Choi, Junseok Kwon, Kyoung Mu Lee
Abstract: Recent advances in Siamese network-based visual tracking methods have enabled high performance on numerous tracking benchmarks. However, extensive scale variations of the target object and distractor objects with similar categories have consistently posed challenges in visual tracking. To address these persisting issues, we propose novel TridentAlign and context embedding modules for Siamese network-based visual tracking methods. The TridentAlign module facilitates adaptability to extensive scale variations and large deformations of the target, where it pools the feature representation of the target object into multiple spatial dimensions to form a feature pyramid, which is then utilized in the region proposal stage. Meanwhile, context embedding module aims to discriminate the target from distractor objects by accounting for the global context information among objects. The context embedding module extracts and embeds the global context information of a given frame into a local feature representation such that the information can be utilized in the final classification stage. Experimental results obtained on multiple benchmark datasets show that the performance of the proposed tracker is comparable to that of state-of-the-art trackers, while the proposed tracker runs at real-time speed.
摘要:连体基于网络的视觉跟踪方法的最新进展已经使众多的跟踪基准的高性能。然而,目标对象的广泛尺度变化,并用相似的类别牵开对象一直提出在视觉跟踪挑战。为了解决这些问题持续存在,我们提出了连体基于网络的可视化跟踪方法新颖TridentAlign和内容嵌入模块。所述TridentAlign模块便于适应广泛尺度变化和目标的大的变形,在那里它池目标对象分成多个空间维度的特征表示,以便形成特征金字塔,然后将其在该区域提案阶段利用。同时,情境嵌入模块,旨在通过考虑对象之间的全球环境信息区分牵张对象目标。上下文嵌入模块提取并嵌入了一个给定帧的全局上下文信息到本地特征表示,使得信息可以在最终分类级被利用。上的多个基准数据集获得的实验结果表明,所提出跟踪器的性能与该状态的最先进的跟踪器,而在实时速度所提议的跟踪运行。

38. Compare and Reweight: Distinctive Image Captioning Using Similar Images Sets [PDF] 返回目录
  Jiuniu Wang, Wenjia Xu, Qingzhong Wang, Antoni B. Chan
Abstract: A wide range of image captioning models has been developed, achieving significant improvement based on popular metrics, such as BLEU, CIDEr, and SPICE. However, although the generated captions can accurately describe the image, they are generic for similar images and lack distinctiveness, i.e., cannot properly describe the uniqueness of each image. In this paper, we aim to improve the distinctiveness of image captions through training with sets of similar images. First, we propose a distinctiveness metric -- between-set CIDEr (CIDErBtw) to evaluate the distinctiveness of a caption with respect to those of similar images. Our metric shows that the human annotations of each image are not equivalent based on distinctiveness. Thus we propose several new training strategies to encourage the distinctiveness of the generated caption for each image, which are based on using CIDErBtw in a weighted loss function or as a reinforcement learning reward. Finally, extensive experiments are conducted, showing that our proposed approach significantly improves both distinctiveness (as measured by CIDErBtw and retrieval metrics) and accuracy (e.g., as measured by CIDEr) for a wide variety of image captioning baselines. These results are further confirmed through a user study.
摘要:大范围的图像字幕车型已经开发,基于流行的指标,如BLEU,苹果酒和SPICE实现显著的改善。然而,尽管所生成的字幕可以准确地描述图像,它们是通用的类似图像和缺乏显着的,即,不能正确地描述了每个图像的唯一性。在本文中,我们的目标是通过与台类似的图像的培训,以提高图像字幕的独特性。首先,我们提出了度量的独特性 - 之间设定的苹果酒(CIDErBtw)相对于那些相似的图像来评估一个标题的独特性。我们的指标显示,每幅图像的人的注释是不等价的基础上的独特性。因此,我们提出了一些新的培训战略,鼓励每个图像,这是基于加权损失函数或作为强化学习奖励使用CIDErBtw生成的字幕的独特性。最后,大量的实验进行,显示出我们的提议的方法显著改善了显着性和精确度(如通过CIDErBtw和检索的度量测量的)(例如,如由苹果酒测量)用于多种图像字幕基线。这些结果通过用户研究进一步证实。

39. Alleviating Over-segmentation Errors by Detecting Action Boundaries [PDF] 返回目录
  Yuchi Ishikawa, Seito Kasai, Yoshimitsu Aoki, Hirokatsu Kataoka
Abstract: We propose an effective framework for the temporal action segmentation task, namely an Action Segment Refinement Framework (ASRF). Our model architecture consists of a long-term feature extractor and two branches: the Action Segmentation Branch (ASB) and the Boundary Regression Branch (BRB). The long-term feature extractor provides shared features for the two branches with a wide temporal receptive field. The ASB classifies video frames with action classes, while the BRB regresses the action boundary probabilities. The action boundaries predicted by the BRB refine the output from the ASB, which results in a significant performance improvement. Our contributions are three-fold: (i) We propose a framework for temporal action segmentation, the ASRF, which divides temporal action segmentation into frame-wise action classification and action boundary regression. Our framework refines frame-level hypotheses of action classes using predicted action boundaries. (ii) We propose a loss function for smoothing the transition of action probabilities, and analyze combinations of various loss functions for temporal action segmentation. (iii) Our framework outperforms state-of-the-art methods on three challenging datasets, offering an improvement of up to 13.7% in terms of segmental edit distance and up to 16.1% in terms of segmental F1 score. Our code will be publicly available soon.
摘要:我们提出了时间分割行动任务,即行为段细化框架(ASRF)的有效框架。我们的模型架构包括一个长期的特征提取和两个分支:行动分割科(ASB)和边界回归分公司(BRB)。长期特征提取器提供共用的两个分支特征具有宽颞感受域。该ASB与分类动作类视频帧,而BRB倒退的行动边界概率。由BRB预测的动作界限缩小从ASB的输出,其结果在显著的性能改善。我们的贡献有三:(一)我们提出了时间分割的行动框架,该ASRF,划分时间的行动分割成逐帧作用分类和作用边界的回归。使用预测动作边界动作类的我们的框架提炼帧级假设。 (ⅱ)我们提出了一个损失函数用于平滑作用概率的过渡,和分析的各种损失功能的组合用于时间动作分割。 (ⅲ)我们的框架性能优于国家的最先进的方法在三个挑战数据集,在节段F1分数方面节段性编辑距离的术语和向上提供高达的改进至13.7%至16.1%。我们的代码将公开推出。

40. BUNET: Blind Medical Image Segmentation Based on Secure UNET [PDF] 返回目录
  Song Bian, Xiaowei Xu, Weiwen Jiang, Yiyu Shi, Takashi Sato
Abstract: The strict security requirements placed on medical records by various privacy regulations become major obstacles in the age of big data. To ensure efficient machine learning as a service schemes while protecting data confidentiality, in this work, we propose blind UNET (BUNET), a secure protocol that implements privacy-preserving medical image segmentation based on the UNET architecture. In BUNET, we efficiently utilize cryptographic primitives such as homomorphic encryption and garbled circuits (GC) to design a complete secure protocol for the UNET neural architecture. In addition, we perform extensive architectural search in reducing the computational bottleneck of GC-based secure activation protocols with high-dimensional input data. In the experiment, we thoroughly examine the parameter space of our protocol, and show that we can achieve up to 14x inference time reduction compared to the-state-of-the-art secure inference technique on a baseline architecture with negligible accuracy degradation.
摘要:通过各种隐私法规放在病历严格的安全要求,成为大数据时代的主要障碍。为了确保高效的机器学习作为一种服务方案,同时保护数据的机密性,在这项工作中,我们提出了盲UNET(BUNET),基于UNET结构执行隐私保护医学图像分割安全协议。在BUNET,我们有效地利用密码原语,如同态加密和乱码电路(GC)以设计为UNET神经结构的完整安全的协议。此外,我们执行减少基于GC-安全激活协议与高维输入数据的计算瓶颈广泛的建筑搜索。在实验中,我们彻底检查我们的协议的参数空间,并表明我们可以实现高达14倍推理时间相比减少上可以忽略不计的精度下降的基准架构,国家的最先进的安全推断技术。

41. Topology-Change-Aware Volumetric Fusion for Dynamic Scene Reconstruction [PDF] 返回目录
  Chao Li, Xiaohu Guo
Abstract: Topology change is a challenging problem for 4D reconstruction of dynamic scenes. In the classic volumetric fusion-based framework, a mesh is usually extracted from the TSDF volume as the canonical surface representation to help estimating deformation field. However, the surface and Embedded Deformation Graph (EDG) representations bring conflicts under topology changes since the surface mesh has fixed-connectivity but the deformation field can be discontinuous. In this paper, the classic framework is re-designed to enable 4D reconstruction of dynamic scene under topology changes, by introducing a novel structure of Non-manifold Volumetric Grid to the re-design of both TSDF and EDG, which allows connectivity updates by cell splitting and replication. Experiments show convincing reconstruction results for dynamic scenes of topology changes, as compared to the state-of-the-art methods.
摘要:拓扑变化是4D重建动态场景的一个具有挑战性的问题。在经典的体积基于融合的框架,网状物通常是从TSDF体积为规范的表面表示以帮助估计变形场萃取。然而,表面和下带来拓扑变化的冲突,因为表面网格嵌入式变形图(EDG)表示已固定连接,但变形字段可以是不连续的。在本文中,经典的框架是重新设计,以使下拓扑变化动态场景的4D重建,通过引入非歧管体积网格的一种新颖的结构,以两个TSDF和EDG的重新设计,这通过小区允许连接更新分裂和复制。实验表明令人信服重建结果的拓扑变化的动态场景,相比于国家的最先进的方法。

42. Socially and Contextually Aware Human Motion and Pose Forecasting [PDF] 返回目录
  Vida Adeli, Ehsan Adeli, Ian Reid, Juan Carlos Niebles, Hamid Rezatofighi
Abstract: Smooth and seamless robot navigation while interacting with humans depends on predicting human movements. Forecasting such human dynamics often involves modeling human trajectories (global motion) or detailed body joint movements (local motion). Prior work typically tackled local and global human movements separately. In this paper, we propose a novel framework to tackle both tasks of human motion (or trajectory) and body skeleton pose forecasting in a unified end-to-end pipeline. To deal with this real-world problem, we consider incorporating both scene and social contexts, as critical clues for this prediction task, into our proposed framework. To this end, we first couple these two tasks by i) encoding their history using a shared Gated Recurrent Unit (GRU) encoder and ii) applying a metric as loss, which measures the source of errors in each task jointly as a single distance. Then, we incorporate the scene context by encoding a spatio-temporal representation of the video data. We also include social clues by generating a joint feature representation from motion and pose of all individuals from the scene using a social pooling layer. Finally, we use a GRU based decoder to forecast both motion and skeleton pose. We demonstrate that our proposed framework achieves a superior performance compared to several baselines on two social datasets.
摘要:平滑无缝机器人导航,而与人交互依赖于预测人的运动。预测这种人类动力学常常涉及模拟人类轨迹(全局运动)或详细的身体关节的动作(局部运动)。以前的工作通常分开解决当地和全球的人类活动。在本文中,我们提出了一个新的架构,以解决人体运动(或轨迹)和车身骨架姿势预测的任务都在一个统一的端至端的管道。为了解决这一实际问题,我们考虑将两种情景和社会背景,作为该预测任务的关键线索,为我们提出的框架。为此,我们首先对夫妇通过我这两个任务)使用共享的门控重复单元(GRU)编码器编码它们的历史和ii)将指标为亏损,其共同的措施,每个任务的错误源作为单一的距离。然后,我们通过编码视频数据的时空表示结合现场环境。我们还通过包括从生成使用社会统筹层从场景中所有个人的运动,并造成关节功能表现社会线索。最后,我们使用基于GRU解码器来预测运动和骨骼的姿势。我们证明,相比于两个社会数据集数的基线我们提出的框架,实现了卓越的性能。

43. Face to Purchase: Predicting Consumer Choices with Structured Facial and Behavioral Traits Embedding [PDF] 返回目录
  Zhe Liu, Xianzhi Wang, Lina Yao, Jake An, Lei Bai, Ee-Peng Lim
Abstract: Predicting consumers' purchasing behaviors is critical for targeted advertisement and sales promotion in e-commerce. Human faces are an invaluable source of information for gaining insights into consumer personality and behavioral traits. However, consumer's faces are largely unexplored in previous research, and the existing face-related studies focus on high-level features such as personality traits while neglecting the business significance of learning from facial data. We propose to predict consumers' purchases based on their facial features and purchasing histories. We design a semi-supervised model based on a hierarchical embedding network to extract high-level features of consumers and to predict the top-$N$ purchase destinations of a consumer. Our experimental results on a real-world dataset demonstrate the positive effect of incorporating facial information in predicting consumers' purchasing behaviors.
摘要:预测消费者的购买行为是在电子商务针对性的广告和促销的关键。人脸是信息获取见解消费者的个性和行为特征的宝贵源泉。然而,消费者的脸在以前的研究在很大程度上是未开发的,和现有的脸相关的研究集中在高层次,而忽视从脸部数据中学习的重要性业务等功能的个性特征。我们提出了一种基于他们的面部特征和购买历史来预测消费者的购买。我们设计了一个基于分层嵌入网络提取高层次上的半监督模型消费者的特点,并预测消费者的顶$ N $采购目的地。我们对现实世界的实验结果表明数据集在预测消费者的购买行为结合脸部信息的积极作用。

44. Top-Related Meta-Learning Method for Few-Shot Detection [PDF] 返回目录
  Qian Li, Nan Guo, Duo Wang, Xiaochun Ye
Abstract: Many meta-learning methods which depend on a large amount of data and more parameters have been proposed for few-shot detection. They require more cost. However, because of imbalance of categories and less features, previous methods exist obvious problems, the strong bias and poor classification for few-shot detection. Therefore, for meta-learning method of few-shot detection, we propose a TCL which exploits the true-label example and the most similar semantics with the example, and a category-based grouping mechanism which groups categories by appearance and environment to enhance the semantic features between similar categories. The whole training consists of the base classes model and the fine-tuning phase. During training, the meta-features related to the category are regarded as the weights of the prediction layer of detection model, exploiting the meta-features with a shared distribution between categories within a group to improve the detection performance. According to group and category, we split category-related meta-features into different groups, so that the distribution difference between groups is obvious, and the one within each group is less. Experimental results on Pascal VOC dataset demonstrate that ours which combines TCL with category-based grouping significantly outperforms previous state-of-the-art methods for 1, 2-shot detection, and obtains detection APs of almost 30% for 3-shot detection. Especially for 1-shot detection, experiments show that our method achieves detection AP of 20% which outperforms the previous most methods obviously.
摘要:依赖于大量数据和多个参数的很多元的学习方法已经提出了一些可一次检测。他们需要更多的成本。然而,由于种类和功能较少的不平衡,以前的方法存在着明显的问题,强烈的偏见和分类为差几一次检测。因此,对于一些一次检测的元学习方法,我们提出了TCL,其利用了真正的标签案例,并与范例最相似的语义,基于类别的分组机制这组类别的市容环境提升相似的类别之间的语义特征。整个训练由基类模型和微调阶段。在训练期间,该元相关的功能被认为是检测模型的预测层的权重的类别,利用一组内的元特征与分类之间的共享分布以提高检测性能。据组和类别,我们划分类别相关的元特征分成不同的组,使得组之间的分布差异是明显的,并且每个组内的一个较小。帕斯卡VOC实验结果数据集表明,我们结合了TCL与类别的基于分组显著优于先前的状态的最先进的方法1,2-镜头检测,和几乎30%3-镜头检测取得检测的AP。特别是对于1次检测,实验表明,我们的方法实现检测这显然优于以前的大多数方法20%AP。

45. A Single Stream Network for Robust and Real-time RGB-D Salient Object Detection [PDF] 返回目录
  Xiaoqi Zhao, Lihe Zhang, Youwei Pang, Huchuan Lu, Lei Zhang
Abstract: Existing RGB-D salient object detection (SOD) approaches concentrate on the cross-modal fusion between the RGB stream and the depth stream. They do not deeply explore the effect of the depth map itself. In this work, we design a single stream network to directly use the depth map to guide early fusion and middle fusion between RGB and depth, which saves the feature encoder of the depth stream and achieves a lightweight and real-time model. We tactfully utilize depth information from two perspectives: (1) Overcoming the incompatibility problem caused by the great difference between modalities, we build a single stream encoder to achieve the early fusion, which can take full advantage of ImageNet pre-trained backbone model to extract rich and discriminative features. (2) We design a novel depth-enhanced dual attention module (DEDA) to efficiently provide the fore-/back-ground branches with the spatially filtered features, which enables the decoder to optimally perform the middle fusion. Besides, we put forward a pyramidally attended feature extraction module (PAFE) to accurately localize the objects of different scales. Extensive experiments demonstrate that the proposed model performs favorably against most state-of-the-art methods under different evaluation metrics. Furthermore, this model is 55.5\% lighter than the current lightest model and runs at a real-time speed of 32 FPS when processing a $384 \times 384$ image.
摘要:现有RGB-d显着对象检测(SOD)方法对RGB流和深度流之间的跨通道融合浓缩物。他们不深入探讨的深度效果图本身。在这项工作中,我们设计了一个单独的数据流网络直接使用深度图,以指导RGB和深度,节省了深度流的特征编码器和实现了轻量化和实时模式之间的融合早,中融合。我们巧妙地从两个角度利用深度信息:(1)克服由模式之间的巨大差异不兼容的问题,我们建立了一个单一的流编码器实现了早期的融合,这可能需要ImageNet预先训练骨干模型提取的充分利用丰富的判别特征。 (2)我们设计了一种新的深度增强双注意模块(DEDA)以有效地提供fore- /背地面分支与所述空间滤波功能,这使得能够以最佳执行中间融合解码器。此外,我们提出了一个金字塔形出席特征提取模块(PAFE)来精确定位不同尺度的对象。广泛的实验表明,该模型进行有利针对下不同的评价指标最先进的最先进的方法。此外,处理$ 384 \倍384 $图像时该模型是比在32 FPS的实时速度的当前最轻模型和运行打火机55.5 \%。

46. DeepMSRF: A novel Deep Multimodal Speaker Recognition framework with Feature selection [PDF] 返回目录
  Ehsan Asali, Farzan Shenavarmasouleh, Farid Ghareh Mohammadi, Prasanth Sengadu Suresh, Hamid R. Arabnia
Abstract: For recognizing speakers in video streams, significant research studies have been made to obtain a rich machine learning model by extracting high-level speaker's features such as facial expression, emotion, and gender. However, generating such a model is not feasible by using only single modality feature extractors that exploit either audio signals or image frames, extracted from video streams. In this paper, we address this problem from a different perspective and propose an unprecedented multimodality data fusion framework called DeepMSRF, Deep Multimodal Speaker Recognition with Feature selection. We execute DeepMSRF by feeding features of the two modalities, namely speakers' audios and face images. DeepMSRF uses a two-stream VGGNET to train on both modalities to reach a comprehensive model capable of accurately recognizing the speaker's identity. We apply DeepMSRF on a subset of VoxCeleb2 dataset with its metadata merged with VGGFace2 dataset. The goal of DeepMSRF is to identify the gender of the speaker first, and further to recognize his or her name for any given video stream. The experimental results illustrate that DeepMSRF outperforms single modality speaker recognition methods with at least 3 percent accuracy.
摘要:对于视频流识别音箱,已作出显著的调查​​研究,通过提取高级别音箱的功能,如面部表情,情感,两性,获得了丰富的机器学习模型。然而,产生这样的模型是不通过使用利用任一音频信号或图像帧,从视频流提取的只有单形态特征提取器是可行的。在本文中,我们讨论从不同的角度这个问题,并提出所谓DeepMSRF前所未有的多模态数据融合框架,特征选择深多式联运的说话人识别。我们通过喂食两种模式,即说话者的音频和人脸图像的特征执行DeepMSRF。 DeepMSRF使用双流VGGNET对两种模式训练达到能够准确识别说话人的身份的综合模型。我们应用DeepMSRF上VoxCeleb2数据集的一个子集,其元数据与VGGFace2数据集合并。 DeepMSRF的目标是首先确定说话者的性别,并进一步认识到他或她的名字对于任何给定的视频流。该实验结果表明,DeepMSRF优于具有至少3%的精确度单模态说话人识别方法。

47. TCGM: An Information-Theoretic Framework for Semi-Supervised Multi-Modality Learning [PDF] 返回目录
  Xinwei Sun, Yilun Xu, Peng Cao, Yuqing Kong, Lingjing Hu, Shanghang Zhang, Yizhou Wang
Abstract: Fusing data from multiple modalities provides more information to train machine learning systems. However, it is prohibitively expensive and time-consuming to label each modality with a large amount of data, which leads to a crucial problem of semi-supervised multi-modal learning. Existing methods suffer from either ineffective fusion across modalities or lack of theoretical guarantees under proper assumptions. In this paper, we propose a novel information-theoretic approach, namely \textbf{T}otal \textbf{C}orrelation \textbf{G}ain \textbf{M}aximization (TCGM), for semi-supervised multi-modal learning, which is endowed with promising properties: (i) it can utilize effectively the information across different modalities of unlabeled data points to facilitate training classifiers of each modality (ii) it has theoretical guarantee to identify Bayesian classifiers, i.e., the ground truth posteriors of all modalities. Specifically, by maximizing TC-induced loss (namely TC gain) over classifiers of all modalities, these classifiers can cooperatively discover the equivalent class of ground-truth classifiers; and identify the unique ones by leveraging limited percentage of labeled data. We apply our method to various tasks and achieve state-of-the-art results, including news classification, emotion recognition and disease prediction.
摘要:从多模态融合数据提供给列车的机器学习系统的更多信息。然而,这是昂贵和费时的标记每一种模式有大量的数据,从而导致半监督多模态学习的一个关键问题。现有的方法从整个模式或者无效的融合遭受或在适当的假设缺乏理论保证。在本文中,我们提出了一种新颖的信息理论的方法,即\ textbf横置otal \ textbf {C} orrelation \ textbf {G} AIN \ textbf {M} aximization(TCGM),为半监督多模态学习,其具有广阔的性质:(i)其能够有效地利用整个未标记的数据点的不同模态的信息以促进训练每种模态的分类器(ii)其具有理论保证以确定贝叶斯分类器,即,基础事实后验所有方式。具体而言,通过在所有模态的分类最大化TC引起的损耗(即TC增益),这些分类可以协作发现等价类地面实况分类器;并通过利用标记数据的比例有限标识唯一的。我们我们的方法适用于各种任务和实现国家的先进成果,包括新闻分类,情感识别和疾病预测。

48. Meta-rPPG: Remote Heart Rate Estimation Using a Transductive Meta-Learner [PDF] 返回目录
  Eugene Lee, Evan Chen, Chen-Yi Lee
Abstract: Remote heart rate estimation is the measurement of heart rate without any physical contact with the subject and is accomplished using remote photoplethysmography (rPPG) in this work. rPPG signals are usually collected using a video camera with a limitation of being sensitive to multiple contributing factors, e.g. variation in skin tone, lighting condition and facial structure. End-to-end supervised learning approach performs well when training data is abundant, covering a distribution that doesn't deviate too much from the distribution of testing data or during deployment. To cope with the unforeseeable distributional changes during deployment, we propose a transductive meta-learner that takes unlabeled samples during testing (deployment) for a self-supervised weight adjustment (also known as transductive inference), providing fast adaptation to the distributional changes. Using this approach, we achieve state-of-the-art performance on MAHNOB-HCI and UBFC-rPPG.
摘要:远程心脏速率估计是心脏率没有与主体任何物理接触测量和使用本作品的远程光电容积描记(rPPG)来完成。 rPPG信号使用视频摄像机与被多个促成因素敏感的限制通常被收集,例如在皮肤色调,照明条件和面部结构的变化。最终到终端的监督学习方法以及执行时的训练数据丰富,涵盖了分布,从测试数据或部署过程中的分布不会偏离太多。为了应对部署在不可预见的差异变迁,我们提出了一个转导元学习者是测试(部署)的自我监督的权重调整(也称为转导推理)过程中所采用的未标记样本,对分布式的变化提供快速适应。使用这种方法,我们实现了对MAHNOB-HCl和UBFC-rPPG国家的最先进的性能。

49. Vehicle Trajectory Prediction by Transfer Learning of Semi-Supervised Models [PDF] 返回目录
  Nick Lamm, Shashank Jaiprakash, Malavika Srikanth, Iddo Drori
Abstract: In this work we show that semi-supervised models for vehicle trajectory prediction significantly improve performance over supervised models on state-of-the-art real-world benchmarks. Moving from supervised to semi-supervised models allows scaling-up by using unlabeled data, increasing the number of images in pre-training from Millions to a Billion. We perform ablation studies comparing transfer learning of semi-supervised and supervised models while keeping all other factors equal. Within semi-supervised models we compare contrastive learning with teacher-student methods as well as networks predicting a small number of trajectories with networks predicting probabilities over a large trajectory set. Our results using both low-level and mid-level representations of the driving environment demonstrate the applicability of semi-supervised methods for real-world vehicle trajectory prediction.
摘要:在这项工作中,我们表明,车辆轨迹预测半监督模型显著提高国家的最先进的真实世界的基准测试在监督模式的性能。从监督到半监督模型运动允许规模的扩大使用无标签数据,增加前的训练图像的数百万数十亿。我们进行比较的半监督迁移学习切除研究和监督模式,同时保持所有其他因素相等。在半监督模型,我们比较对比学习与师生方法以及预测与网络预测在一个大的轨迹概率集合少数轨迹的网络。同时使用低级别以及行驶环境的中层表示我们的研究结果证明半监督方法真实世界的车辆轨迹预测的适用性。

50. Semi-supervised Learning with a Teacher-student Network for Generalized Attribute Prediction [PDF] 返回目录
  Minchul Shin
Abstract: This paper presents a study on semi-supervised learning to solve the visual attribute prediction problem. In many applications of vision algorithms, the precise recognition of visual attributes of objects is important but still challenging. This is because defining a class hierarchy of attributes is ambiguous, so training data inevitably suffer from class imbalance and label sparsity, leading to a lack of effective annotations. An intuitive solution is to find a method to effectively learn image representations by utilizing unlabeled images. With that in mind, we propose a multi-teacher-single-student (MTSS) approach inspired by the multi-task learning and the distillation of semi-supervised learning. Our MTSS learns task-specific domain experts called teacher networks using the label embedding technique and learns a unified model called a student network by forcing a model to mimic the distributions learned by domain experts. Our experiments demonstrate that our method not only achieves competitive performance on various benchmarks for fashion attribute prediction, but also improves robustness and cross-domain adaptability for unseen domains.
摘要:本文介绍了关于半监督学习研究,以解决视觉属性的预测问题。在视觉算法许多应用中,物体的视觉属性的精确识别是很重要的,但仍然具有挑战性。这是因为定义属性的类层次结构是模糊的,所以训练数据不可避免地类不平衡和标签稀疏性受到影响,导致缺乏有效的注释。一个直观的解决办法是找到利用未标记的图像,以有效的学习图像表示的方法。考虑到这一点,我们建议多任务学习和半监督学习的蒸馏灵感多老师单的学生(MTSS)的方法。我们MTSS获悉任务的特定领域专家呼吁使用标签嵌入技术教师网络学习并迫使一个模型来模拟领域专家学到的分布称为学生网络的统一模型。我们的实验表明,我们的方法不仅实现各种基准时尚属性的预测有竞争力的性能,同时也提高了看不见域的鲁棒性和跨域适应性。

51. Patch-wise Attack for Fooling Deep Neural Network [PDF] 返回目录
  Lianli Gao, Qilong Zhang, Jingkuan Song, Xianglong Liu, Heng Tao Shen
Abstract: By adding human-imperceptible noise to clean images, the resultant adversarial examples can fool other unknown models. Features of a pixel extracted by deep neural networks (DNNs) are influenced by its surrounding regions, and different DNNs generally focus on different discriminative regions in recognition. Motivated by this, we propose a patch-wise iterative algorithm -- a black-box attack towards mainstream normally trained and defense models, which differs from the existing attack methods manipulating pixel-wise noise. In this way, without sacrificing the performance of white-box attack, our adversarial examples can have strong transferability. Specifically, we introduce an amplification factor to the step size in each iteration, and one pixel's overall gradient overflowing the $\epsilon$-constraint is properly assigned to its surrounding regions by a project kernel. Our method can be generally integrated to any gradient-based attack methods. Compared with the current state-of-the-art attacks, we significantly improve the success rate by 9.2\% for defense models and 3.7\% for normally trained models on average. Our code is available at \url{this https URL}
摘要:通过添加人察觉不到噪音,干净的影像,得到的对抗性例子可以欺骗其他未知型号。通过深神经网络(DNNs)提取的像素的特征通过其周围区域的影响,并且不同DNNs一般集中在不同鉴别的区域中识别。这个启发,我们提出了一个补丁式的迭代算法 - 对主流正常训练和防御模型黑箱攻击,这从现有的攻击方法操纵逐个像素的噪音不同。通过这种方式,在不牺牲的白盒攻击的性能,我们的对抗例子可以有很强的可转让性。具体来说,我们引入一个放大系数在每个迭代步长和一个像素的总体梯度溢出$ \ $小量是-constraint适当一个项目内核分配给它的周边地区。我们的方法一般可以集成到任何基于梯度的攻击方法。当前国家的最先进的攻击相比,我们显著提高了9.2 \%用于国防模式和3.7 \%的平均正常训练的模型的成功率。我们的代码是可以在\ {URL这HTTPS URL}

52. Personalized Face Modeling for Improved Face Reconstruction and Motion Retargeting [PDF] 返回目录
  Bindita Chaudhuri, Noranart Vesdapunt, Linda Shapiro, Baoyuan Wang
Abstract: Traditional methods for image-based 3D face reconstruction and facial motion retargeting fit a 3D morphable model (3DMM) to the face, which has limited modeling capacity and fail to generalize well to in-the-wild data. Use of deformation transfer or multilinear tensor as a personalized 3DMM for blendshape interpolation does not address the fact that facial expressions result in different local and global skin deformations in different persons. Moreover, existing methods learn a single albedo per user which is not enough to capture the expression-specific skin reflectance variations. We propose an end-to-end framework that jointly learns a personalized face model per user and per-frame facial motion parameters from a large corpus of in-the-wild videos of user expressions. Specifically, we learn user-specific expression blendshapes and dynamic (expression-specific) albedo maps by predicting personalized corrections on top of a 3DMM prior. We introduce novel constraints to ensure that the corrected blendshapes retain their semantic meanings and the reconstructed geometry is disentangled from the albedo. Experimental results show that our personalization accurately captures fine-grained facial dynamics in a wide range of conditions and efficiently decouples the learned face model from facial motion, resulting in more accurate face reconstruction and facial motion retargeting compared to state-of-the-art methods.
摘要:基于图像的三维人脸重建和面部动作重新定位的传统方法拟合三维形变模型(3DMM)到面部,其具有有限的能力建模,并且不能很好推广到在最野生数据。变形转让或多重线性张量为个性化3DMM的形状融合插值的使用并没有解决面部表情导致不同的人不同的本地和全球的皮肤变形的事实。此外,现有的方法学习每个用户一个反照率是不够的,捕获表达特异性皮肤反射的变化。我们提出了一个终端到终端的框架,共同学习每个用户和每个帧的面部运动参数的个性化人脸模型从大量语料中最百搭的用户表达的视频。具体而言,我们通过事先预测上的3DMM的顶部的个性化校正学习用户特定表达blendshapes和动态(表达特定的)反射率的地图。我们引入新的限制,以确保修正blendshapes保留其语义和重建的几何形状从反照率解开。实验结果表明,我们的个性化准确地捕捉细粒度面部动力学在宽范围的条件下,有效地解耦面部动作所学习的面部模型,从而产生更精确的面部重建和面部动作重新定标相比状态的最先进的方法。

53. JNR: Joint-based Neural Rig Representation for Compact 3D Face Modeling [PDF] 返回目录
  Noranart Vesdapunt, Mitch Rundle, HsiangTao Wu, Baoyuan Wang
Abstract: In this paper, we introduce a novel approach to learn a 3D face model using a joint-based face rig and a neural skinning network. Thanks to the joint-based representation, our model enjoys some significant advantages over prior blendshape-based models. First, it is very compact such that we are orders of magnitude smaller while still keeping strong modeling capacity. Second, because each joint has its semantic meaning, interactive facial geometry editing is made easier and more intuitive. Third, through skinning, our model supports adding mouth interior and eyes, as well as accessories (hair, eye glasses, etc.) in a simpler, more accurate and principled way. We argue that because the human face is highly structured and topologically consistent, it does not need to be learned entirely from data. Instead we can leverage prior knowledge in the form of a human-designed 3D face rig to reduce the data dependency, and learn a compact yet strong face model from only a small dataset (less than one hundred 3D scans). To further improve the modeling capacity, we train a skinning weight generator through adversarial learning. Experiments on fitting high-quality 3D scans (both neutral and expressive), noisy depth images, and RGB images demonstrate that its modeling capacity is on-par with state-of-the-art face models, such as FLAME and Facewarehouse, even though the model is 10 to 20 times smaller. This suggests broad value in both graphics and vision applications on mobile and edge devices.
摘要:在本文中,我们介绍一种新的方法使用基于联合脸钻机和神经网络换肤学习3D人脸模型。由于基于联合表示,我们的模型享有比现有的基于形状融合的模型一些显著的优势。首先,它是非常紧凑的,例如,我们是幅度要小,同时仍保持很强的建模能力的订单。其次,因为每个关节都有其语义,互动面部几何编辑变得更容易,更直观。三,通过剥皮,我们的模型支持更简单,更精确和有原则的方式添加口内部和眼睛,以及配件(头发,眼镜等)。我们认为,由于人脸是高度结构和拓扑一致,它并不需要从数据完全学到。相反,我们可以在一个人设计的三维人脸钻机的形式,利用先前的知识,以减少数据的依赖,并从只有一个小的数据集(不到一百3D扫描)学习紧凑而强大的人脸模型。为了进一步提高造型能力,我们通过培训学习对抗剥皮重量生成。上装配高质量的3D扫描(中性和表达),噪声的深度图像,并且RGB图像的实验表明,其建模能力是标准杆与国家的最先进的面部模型,如火焰和Facewarehouse,即使该模型是10至20倍以下。这表明在移动和边缘设备的图形和视觉应用广泛的价值。

54. Water level prediction from social media images with a multi-task ranking approach [PDF] 返回目录
  P. Chaudhary, S. D'Aronco, J.P. Leitao, K. Schindler, J.D. Wegner
Abstract: Floods are among the most frequent and catastrophic natural disasters and affect millions of people worldwide. It is important to create accurate flood maps to plan (offline) and conduct (real-time) flood mitigation and flood rescue operations. Arguably, images collected from social media can provide useful information for that task, which would otherwise be unavailable. We introduce a computer vision system that estimates water depth from social media images taken during flooding events, in order to build flood maps in (near) real-time. We propose a multi-task (deep) learning approach, where a model is trained using both a regression and a pairwise ranking loss. Our approach is motivated by the observation that a main bottleneck for image-based flood level estimation is training data: it is diffcult and requires a lot of effort to annotate uncontrolled images with the correct water depth. We demonstrate how to effciently learn a predictor from a small set of annotated water levels and a larger set of weaker annotations that only indicate in which of two images the water level is higher, and are much easier to obtain. Moreover, we provide a new dataset, named DeepFlood, with 8145 annotated ground-level images, and show that the proposed multi-task approach can predict the water level from a single, crowd-sourced image with ~11 cm root mean square error.
摘要:洪水是最常见和特大自然灾害中,影响全世界数百万人。它创建准确的洪水地图计划(离线)和行为(实时)防洪和抗洪抢险行动是非常重要的。可以说,从社交媒体收集到的图像可以提供这项任务,否则不可用的有用信息。我们引入计算机视觉系统,从构建汛期洪水事件采取的,以社会化媒体的图像估计水深(近)实时映射。我们提出了一个多任务(深)学习方式,其中一种模式是同时使用回归和成对排名损耗培训。我们的做法是通过观察动机,对基于图像的洪水水位估计的主要瓶颈是训练数据:这是艰难的啮合,需要很多的努力,用正确的水深注释不受控制的图像。我们演示了如何有效牙缝学习从一小部分注释水位的预测和更大的一组较弱的注释,仅表示在两个图像的水位较高,更容易获得。此外,我们提供了一个新的数据集,名为DeepFlood,与8145注释地面拍摄的图像,并显示所提出的多任务方式可以用〜11厘米均方根误差单,人群来源的图像预测水位。

55. DETCID: Detection of Elongated Touching Cells with Inhomogeneous Illumination using a Deep Adversarial Network [PDF] 返回目录
  Ali Memariani, Ioannis A. Kakadiaris
Abstract: Clostridioides difficile infection (C. diff) is the most common cause of death due to secondary infection in hospital patients in the United States. Detection of C. diff cells in scanning electron microscopy (SEM) images is an important task to quantify the efficacy of the under-development treatments. However, detecting C. diff cells in SEM images is a challenging problem due to the presence of inhomogeneous illumination and occlusion. An Illumination normalization pre-processing step destroys the texture and adds noise to the image. Furthermore, cells are often clustered together resulting in touching cells and occlusion. In this paper, DETCID, a deep cell detection method using adversarial training, specifically robust to inhomogeneous illumination and occlusion, is proposed. An adversarial network is developed to provide region proposals and pass the proposals to a feature extraction network. Furthermore, a modified IoU metric is developed to allow the detection of touching cells in various orientations. The results indicate that DETCID outperforms the state-of-the-art in detection of touching cells in SEM images by at least 20 percent improvement of mean average precision.
摘要:Clostridioides艰难感染(C.差异)是死亡的最常见的原因是由于医院病人继发感染在美国。 C. DIFF在扫描电子显微镜(SEM)图像细胞的检测是量化的下发展治疗的功效的重要任务。但是,在SEM图像检测C. DIFF细胞是一个具有挑战性的问题,由于不均匀的照明和闭塞的存在。照明预归一化处理步骤会破坏质地和噪声向图像添加。此外,细胞通常聚集在一起,导致接触细胞和闭塞。在本文中,DETCID,使用对抗训练,特别是坚固的,以不均匀的照明和闭塞深单元的检测方法,提出了。对抗性网络发达,以提供区域的提案并通过建议一个特征提取网络。此外,经修饰的IOU度量开发以允许在不同方位接触细胞的检测。结果表明,DETCID优于在检测在SEM图像感人细胞通过的中值平均精度的提高至少20%的状态下的最先进的。

56. Embedded Encoder-Decoder in Convolutional Networks Towards Explainable AI [PDF] 返回目录
  Amirhossein Tavanaei
Abstract: Understanding intermediate layers of a deep learning model and discovering the driving features of stimuli have attracted much interest, recently. Explainable artificial intelligence (XAI) provides a new way to open an AI black box and makes a transparent and interpretable decision. This paper proposes a new explainable convolutional neural network (XCNN) which represents important and driving visual features of stimuli in an end-to-end model architecture. This network employs encoder-decoder neural networks in a CNN architecture to represent regions of interest in an image based on its category. The proposed model is trained without localization labels and generates a heat-map as part of the network architecture without extra post-processing steps. The experimental results on the CIFAR-10, Tiny ImageNet, and MNIST datasets showed the success of our algorithm (XCNN) to make CNNs explainable. Based on visual assessment, the proposed model outperforms the current algorithms in class-specific feature representation and interpretable heatmap generation while providing a simple and flexible network architecture. The initial success of this approach warrants further study to enhance weakly supervised localization and semantic segmentation in explainable frameworks.
摘要:了解了深刻的学习模式的中间层,发现刺激的驾驶特性吸引了很大的兴趣,最近。可解释的人工智能(XAI)提供了打开AI黑匣子的新途径,使透明和解释的决定。本文提出了其表示结束对终端模型架构刺激的重要和驱动视觉特征新可解释的卷积神经网络(XCNN)。这个网络采用的编码器 - 解码器神经网络在CNN架构来表示基于其类别的图像中的感兴趣区域。所提出的模型,而无需定位标签被训练并产生热图作为无需额外的后处理步骤的网络架构的一部分。在CIFAR-10,微小ImageNet和MNIST数据集上的实验结果表明我们的算法(XCNN)的成功,使细胞神经网络解释的。基于视觉评估,所提出的模型优于在类特定的特征表示和解释的热图生成的当前的算法,同时提供一个简单和灵活的网络结构。这种做法值得进一步研究的初步成功,以加强解释的框架弱监督的定位和语义分割。

57. A Bayesian Evaluation Framework for Ground Truth-Free Visual Recognition Tasks [PDF] 返回目录
  Derek S. Prijatelj, Mel McCurrie, Walter J. Scheirer
Abstract: An interesting development in automatic visual recognition has been the emergence of tasks where it is not possible to assign ground truth labels to images, yet still feasible to collect annotations that reflect human judgements about them. Such tasks include subjective visual attribute assignment and the labeling of ambiguous scenes. Machine learning-based predictors for these tasks rely on supervised training that models the behavior of the annotators, e.g., what would the average person's judgement be for an image? A key open question for this type of work, especially for applications where inconsistency with human behavior can lead to ethical lapses, is how to evaluate the uncertainty of trained predictors. Given that the real answer is unknowable, we are left with often noisy judgements from human annotators to work with. In order to account for the uncertainty that is present, we propose a relative Bayesian framework for evaluating predictors trained on such data. The framework specifies how to estimate a predictor's uncertainty due to the human labels by approximating a conditional distribution and producing a credible interval for the predictions and their measures of performance. The framework is successfully applied to four image classification tasks that use subjective human judgements: facial beauty assessment using the SCUT-FBP5500 dataset, social attribute assignment using data from this http URL, apparent age estimation using data from the ChaLearn series of challenges, and ambiguous scene labeling using the LabelMe dataset.
摘要:在自动视觉识别一个有趣的发展一直在那里是不能分配的地面实况标签的图像,但仍然可行,以反映对他们人为判断收集注释任务的出现。这些任务包括主观视觉属性分配和暧昧场景的标签。这些任务基于机器学习的预测依赖于监督的训练建模的注释,例如行为,将一般人的判断是怎样的一个形象?对于这种类型的工作,特别是对于应用中的不一致性与人的行为会导致道德沦丧的一个关键悬而未决的问题,是如何评价训练的预测的不确定性。由于真正的答案是不可知的,我们剩下的由人工注释与工作中经常吵闹的判断。为了占所存在的不确定性,我们提出了评估培训的这样的数据的预测相对贝叶斯框架。该框架指定如何通过近似条件分布和生产用于预测和他们的绩效衡量一个置信区间估计预测的不确定性,由于人的标签。该框架被成功地应用到使用人的主观判断四个图像分类任务:利用华南理工大学,FBP5500数据集面部美容评估,使用使用来自ChaLearn一系列的挑战,数据这个HTTP URL,明显的年龄估计,暧昧数据社会属性赋值使用LabelMe数据集景观标记。

58. Measuring Performance of Generative Adversarial Networks on Devanagari Script [PDF] 返回目录
  Amogh G. Warkhandkar, Baasit Sharief, Omkar B. Bhambure
Abstract: The working of neural networks following the adversarial philosophy to create a generative model is a fascinating field. Multiple papers have already explored the architectural aspect and proposed systems with potentially good results however, very few papers are available which implement it on a real-world example. Traditionally, people use the famous MNIST dataset as a Hello, World! example for implementing Generative Adversarial Networks (GAN). Instead of going the standard route of using handwritten digits, this paper uses the Devanagari script which has a more complex structure. As there is no conventional way of judging how well the generative models perform, three additional classifiers were built to judge the output of the GAN model. The following paper is an explanation of what this implementation has achieved.
摘要:神经网络下的对抗性理念,以创建一个生成模型的工作是一个令人着迷的领域。多篇论文已经探讨了建筑方面,并与潜在的良好结果,提出了系统但是,很少有文章可它实现它在真实世界的例子。传统上,人们使用著名MNIST数据集为你好,世界!例如用于实现剖成对抗性网络(GAN)。而不是去用手写体数字的标准路线,本文采用具有更复杂的结构梵文脚本。由于没有判断生成模型如何执行没有传统的方式,另外三个分类器内置判断GAN模型的输出。下面是纸本实施取得了什么样的解释。

59. Deep Image Orientation Angle Detection [PDF] 返回目录
  Subhadip Maji, Smarajit Bose
Abstract: Estimating and rectifying the orientation angle of any image is a pretty challenging task. Initial work used the hand engineering features for this purpose, where after the invention of deep learning using convolution-based neural network showed significant improvement in this problem. However, this paper shows that the combination of CNN and a custom loss function specially designed for angles lead to a state-of-the-art results. This includes the estimation of the orientation angle of any image or document at any degree (0 to 360 degree),
摘要:估计和纠正任何图像的方向角是一个非常具有挑战性的任务。最初的工作中使用的手工程特征为了这个目的,在以后使用基于卷积神经网络的深度学习的发明表现出这个问题显著的改善。然而,本文表明CNN的组合和一个自定义损失函数专门为角度设计导致国家的最先进的结果。这包括在任何程度(0到360度)的任何图像或文档的取向角的估计,

60. CPL-SLAM: Efficient and Certifiably Correct Planar Graph-Based SLAM Using the Complex Number Representation [PDF] 返回目录
  Taosha Fan, Hanlin Wang, Michael Rubenstein, Todd Murphey
Abstract: In this paper, we consider the problem of planar graph-based simultaneous localization and mapping (SLAM) that involves both poses of the autonomous agent and positions of observed landmarks. We present CPL-SLAM, an efficient and certifiably correct algorithm to solve planar graph-based SLAM using the complex number representation. We formulate and simplify planar graph-based SLAM as the maximum likelihood estimation (MLE) on the product of unit complex numbers, and relax this nonconvex quadratic complex optimization problem to convex complex semidefinite programming (SDP). Furthermore, we simplify the corresponding complex semidefinite programming to Riemannian staircase optimization (RSO) on the complex oblique manifold that can be solved with the Riemannian trust region (RTR) method. In addition, we prove that the SDP relaxation and RSO simplification are tight as long as the noise magnitude is below a certain threshold. The efficacy of this work is validated through applications of CPL-SLAM and comparisons with existing state-of-the-art methods on planar graph-based SLAM, which indicates that our proposed algorithm is capable of solving planar graph-based SLAM certifiably, and is more efficient in numerical computation and more robust to measurement noise than existing state-of-the-art methods. The C++ code for CPL-SLAM is available at this https URL.
摘要:在本文中,我们认为,涉及自主代理的两个姿势和观察到的地标位置的平面基于图形的同步定位和地图创建(SLAM)的问题。我们本CPL-SLAM,有效和正确certifiably算法使用复数表示来解决平面基于图的SLAM。我们制定并简化平面图基于SLAM作为单元复杂数的乘积的最大似然估计(MLE),并放松该非凸二次复杂的优化问题,以凸复杂半定程序设计(SDP)。此外,我们简化了对应的复半定规划到黎曼楼梯优化(RSO)上,可与黎曼信任区(RTR)的方法来解决的复杂倾斜歧管。此外,我们证明了SDP放松和简化RSO紧张,只要噪声幅度低于某一阈值。这项工作的效力是通过CPL-SLAM的应用和比较与国家的最先进的现有在基于图形平面SLAM,这表明我们提出的算法能够certifiably求解基于图形平面SLAM的方法验证,并是在数值计算更有效和更健壮的测量噪声比现有的国家的最先进的方法。在C ++的CPL-SLAM代码可在此HTTPS URL。

61. Domain Adaptation for Robust Workload Classification using fNIRS [PDF] 返回目录
  Boyang Lyu, Thao Pham, Giles Blaney, Zachary Haga, Sergio Fantini, Shuchin Aeron
Abstract: Significance: We demonstrated the potential of using domain adaptation on functional Near-Infrared Spectroscopy (fNIRS) data to detect and discriminate different levels of n-back tasks that involve working memory across different experiment sessions and subjects. Aim: To address the domain shift in fNIRS data across sessions and subjects for task label alignment, we exploited two domain adaptation approaches Gromov-Wasserstein (G-W) and Fused Gromov-Wasserstein (FG-W). Approach: We applied G-W for session-by-session alignment and FG-W for subject-by-subject alignment with Hellinger distance as underlying metric to fNIRS data acquired during different n-back task levels. We also compared with a supervised method - Convolutional Neural Network (CNN). Results: For session-by-session alignment, using G-W resulted in alignment accuracy of 70 $\pm$ 4 % (weighted mean $\pm$ standard error), whereas using CNN resulted in classification accuracy of 58 $\pm$ 5 % across five subjects. For subject-by-subject alignment, using FG-W resulted in alignment accuracy of 55 $\pm$ 3 %, whereas using CNN resulted in classification accuracy of 45 $\pm$ 1 %. Where in both cases 25 % represents chance. We also showed that removal of motion artifacts from the fNIRS data plays an important role in improving alignment performance. Conclusions: Domain adaptation is potential for session-by-session and subject-by-subject alignment using fNIRS data.
摘要:意义:我们证明了使用功能上的近红外光谱(fNIRS)的数据检测和判别不同层次,其中涉及在不同的实验课程和学科工作记忆正回任务领域适应性的潜力。目的:解决fNIRS数据域移位跨会话和主题任务标签对齐,我们利用二级域名适应办法格罗莫夫 - 瓦瑟斯坦(G-W)和融合格罗莫夫 - 沃瑟斯坦(FG-W)。的方法:我们应用G-W代表会话逐流对齐和FG-W为主体逐受试者对准海林格距离作为底层度量期间不同的n-back任务级别获取fNIRS数据。卷积神经网络(CNN) - 我们也有一个监督的方法相比。结果:对于会话逐流对准,使用GW导致70 $ \下午$ 4%的对准精度(加权平均$ \下午$标准误差),而使用CNN导致的58 $ \下午$ 5%的分类精度在五个科目。对于受试者逐主体对准,使用FG-W导致的55 $ \下午$ 3%的对准精度,而使用CNN导致的45 $ \下午$ 1%的分类精度。凡在这两种情况下25%的代表机会。我们还发现从fNIRS数据去除运动伪影对提高取向能力具有重要作用。结论:领域适应性是会话的会话,并使用fNIRS数据为准按主题排列的潜力。

62. Unsupervised object-centric video generation and decomposition in 3D [PDF] 返回目录
  Paul Henderson, Christoph H. Lampert
Abstract: A natural approach to generative modeling of videos is to represent them as a composition of moving objects. Recent works model a set of 2D sprites over a slowly-varying background, but without considering the underlying 3D scene that gives rise to them. We instead propose to model a video as the view seen while moving through a scene with multiple 3D objects and a 3D background. Our model is trained from monocular videos without any supervision, yet learns to generate coherent 3D scenes containing several moving objects. We conduct detailed experiments on two datasets, going beyond the visual complexity supported by state-of-the-art generative approaches. We evaluate our method on depth-prediction and 3D object detection---tasks which cannot be addressed by those earlier works---and show it out-performs them even on 2D instance segmentation and tracking.
摘要:自然的方法来生成视频建模是代表它们作为移动对象的组合物。最近的作品在缓慢变化的背景一套2D精灵的模型,但是没有考虑潜在的3D场景,使人们产生他们。相反,我们提出了一个视频作为同时通过现场移动与多个3D对象和3D背景看到的视图模型。我们的模式是单目视频中训练的,没有任何监督,还学会了生成包含几个移动的物体一致的3D场景。我们进行了两个数据集的详细实验,超越由国家的最先进的生成方法所支持的视觉复杂性。我们评估我们在深度预测和立体物检测---不能由那些早期的作品加以解决方法任务---并显示出来,即使进行二维实例分割和跟踪他们。

63. UnRectDepthNet: Self-Supervised Monocular Depth Estimation using a Generic Framework for Handling Common Camera Distortion Models [PDF] 返回目录
  Varun Ravi Kumar, Senthil Yogamani, Markus Bach, Christian Witt, Stefan Milz, Patrick Mader
Abstract: In classical computer vision, rectification is an integral part of multi-view depth estimation. It typically includes epipolar rectification and lens distortion correction. This process simplifies the depth estimation significantly, and thus it has been adopted in CNN approaches. However, rectification has several side effects, including a reduced field-of-view (FOV), resampling distortion, and sensitivity to calibration errors. The effects are particularly pronounced in case of significant distortion (e.g., wide-angle fisheye cameras). In this paper, we propose a generic scale-aware self-supervised pipeline for estimating depth, euclidean distance, and visual odometry from unrectified monocular videos. We demonstrate a similar level of precision on the unrectified KITTI dataset with barrel distortion comparable to the rectified KITTI dataset. The intuition being that the rectification step can be implicitly absorbed within the CNN model, which learns the distortion model without increasing complexity. Our approach does not suffer from a reduced field of view and avoids computational costs for rectification at inference time. To further illustrate the general applicability of the proposed framework, we apply it to wide-angle fisheye cameras with 190$^\circ$ horizontal field-of-view (FOV). The training framework UnRectDepthNet takes in the camera distortion model as an argument and adapts projection and unprojection functions accordingly. The proposed algorithm is evaluated further on the KITTI dataset, and we achieve state-of-the-art results that improve upon our previous work FisheyeDistanceNet. Qualitative results on a distorted test scene video sequence indicate excellent performance this https URL.
摘要:在传统的计算机视觉,整改是多视角深度估计的一个组成部分。它通常包括极整流和透镜畸变修正。这个过程显著简化了深度估计,因此它已经通过了美国有线电视新闻网的方法。但是,精馏具有若干副作用,包括一个减小的场的视场(FOV),重新采样失真和灵敏度校准误差。该效果在显著失真(例如,广角鱼眼镜头)的情况下特别显着。在本文中,我们提出了从非整流单眼视频估计深度,欧氏距离和视觉里程通用规模感知自我监督的管道。我们证明的精度与桶形失真媲美整流KITTI数据集的未校正KITTI数据集类似的水平。直觉在于整流步骤可以在CNN模型,学习不增加复杂性的失真模型内被隐式地吸收。我们的方法不从一个角度降低患场,避免了在推理时间进行整改计算成本。为了进一步说明所提出的框架的普遍适用性,我们将其应用到广角鱼眼镜头与190 $ ^ \ CIRC $水平场的视场(FOV)。训练框架UnRectDepthNet需要在相机失真模型作为参数和适应投影,并相应地逆投影功能。该算法是在KITTI数据集进一步评估,我们达到国家的先进成果在我们以前的工作FisheyeDistanceNet改善。在一个扭曲的测试场景的视频序列定性结果表明优异的性能,此HTTPS URL。

64. Learning Differential Diagnosis of Skin Conditions with Co-occurrence Supervision using Graph Convolutional Networks [PDF] 返回目录
  Junyan Wu, Hao Jiang, Xiaowei Ding, Anudeep Konda, Jin Han, Yang Zhang, Qian Li
Abstract: Skin conditions are reported the 4th leading cause of nonfatal disease burden worldwide. However, given the colossal spectrum of skin disorders defined clinically and shortage in dermatology expertise, diagnosing skin conditions in a timely and accurate manner remains a challenging task. Using computer vision technologies, a deep learning system has proven effective assisting clinicians in image diagnostics of radiology, ophthalmology and more. In this paper, we propose a deep learning system (DLS) that may predict differential diagnosis of skin conditions using clinical images. Our DLS formulates the differential diagnostics as a multi-label classification task over 80 conditions when only incomplete image labels are available. We tackle the label incompleteness problem by combining a classification network with a Graph Convolutional Network (GCN) that characterizes label co-occurrence and effectively regularizes it towards a sparse representation. Our approach is demonstrated on 136,462 clinical images and concludes that the classification accuracy greatly benefit from the Co-occurrence supervision. Our DLS achieves 93.6% top-5 accuracy on 12,378 test images and consistently outperform the baseline classification network.
摘要:皮肤状况报告非致命性疾病负担全球的第四大原因。然而,鉴于在皮肤科临床的专业知识和缺乏定义的皮肤疾病的巨大光谱,诊断皮肤状况的及时,准确地仍然是一个具有挑战性的任务。利用计算机视觉技术,深刻学习系统在放射科,眼科和更多的影像诊断证明是有效的协助临床医生。在本文中,我们提出了一个深刻的学习系统(DLS)是可以预测的临床使用图像的皮肤状况的鉴别诊断。我们DLS制定的区分诊断为多标签分类任务超过80时的条件不完全只图像标签可用。我们通过将分类网与图形卷积网络(GCN)表征标签共生解决标签不完整问题,有效规则化其向稀疏表示。我们的做法是表现出对136462张的临床图像,并得出结论认为,分类精度大大从共生的监督中受益。我们DLS达到12,378上的测试图像93.6%,排名前五的准确性和持续超越基准分类网。

65. Improving Pixel Embedding Learning through Intermediate Distance Regression Supervision for Instance Segmentation [PDF] 返回目录
  Yuli Wu, Long Chen, Dorit Merhof
Abstract: As a proposal-free approach, instance segmentation through pixel embedding learning and clustering is gaining more emphasis. Compared with bounding box refinement approaches, such as Mask R-CNN, it has potential advantages in handling complex shapes and dense objects. In this work, we propose a simple, yet highly effective, architecture for object-aware embedding learning. A distance regression module is incorporated into our architecture to generate seeds for fast clustering. At the same time, we show that the features learned by the distance regression module are able to promote the accuracy of learned object-aware embeddings significantly. By simply concatenating features of the distance regression module to the images as inputs of the embedding module, the mSBD scores on the CVPPP Leaf Segmentation Challenge can be further improved by more than 8% compared to the identical set-up without concatenation, yielding the best overall result amongst the leaderboard at CodaLab.
摘要:作为一个免费的建议的方法,例如通过分割像素嵌入学习和聚类的人越来越多的重视。与边框细化方法,比如面膜R-CNN相比,它具有在处理复杂的形状和密度物体的潜在优势。在这项工作中,我们提出了一个简单,但非常有效,架构对象知晓嵌入学习。的距离回归模块被结合到我们的体系结构来产生用于快速聚类种子。与此同时,我们表明,距离回归模块学习的特点是能够显著促进学会对象知晓的嵌入的准确性。通过简单地级联的距离回归模块的特征的图像作为嵌入模块的输入,对CVPPP叶分割挑战的MSBD分数可以进一步由8%以上相比,相同的准备工作,而不级联提高,产生最佳总体结果在CodaLab排行榜之中。

66. Adversarial Background-Aware Loss for Weakly-supervised Temporal Activity Localization [PDF] 返回目录
  Kyle Min, Jason J. Corso
Abstract: Temporally localizing activities within untrimmed videos has been extensively studied in recent years. Despite recent advances, existing methods for weakly-supervised temporal activity localization struggle to recognize when an activity is not occurring. To address this issue, we propose a novel method named A2CL-PT. Two triplets of the feature space are considered in our approach: one triplet is used to learn discriminative features for each activity class, and the other one is used to distinguish the features where no activity occurs (i.e. background features) from activity-related features for each video. To further improve the performance, we build our network using two parallel branches which operate in an adversarial way: the first branch localizes the most salient activities of a video and the second one finds other supplementary activities from non-localized parts of the video. Extensive experiments performed on THUMOS14 and ActivityNet datasets demonstrate that our proposed method is effective. Specifically, the average mAP of IoU thresholds from 0.1 to 0.9 on the THUMOS14 dataset is significantly improved from 27.9% to 30.0%.
摘要:修剪视频中的时间上本地化的活动已在近几年被广泛研究。尽管最近的进展,对现有方法弱监督的时间活动性定位斗争时的活动没有发生承认。为了解决这个问题,我们提出了一个名为A2CL-PT的新方法。特征空间的两个三胞胎在我们的做法被认为是:一个三线是用来学习判别特征为每个活动类,另一种是用来区分在没有活动发生的特征(即背景特征),从活动相关特征每个视频。为了进一步提高性能,我们建立使用这一种敌对的方式运行两个并行分支我们的网络:第一分支本地化的视频的最显着的活动,第二个从视频的非本地化的部分获得其它辅助活动。大量的实验对THUMOS14执行和ActivityNet数据集证明了我们提出的方法是有效的。具体而言,IOU的阈值从0.1至0.9的数据集THUMOS14平均MAP是从显著27.9%提高到30.0%。

67. Storing Encoded Episodes as Concepts for Continual Learning [PDF] 返回目录
  Ali Ayub, Alan R. Wagner
Abstract: The two main challenges faced by continual learning approaches are catastrophic forgetting and memory limitations on the storage of data. To cope with these challenges, we propose a novel, cognitively-inspired approach which trains autoencoders with Neural Style Transfer to encode and store images. Reconstructed images from encoded episodes are replayed when training the classifier model on a new task to avoid catastrophic forgetting. The loss function for the reconstructed images is weighted to reduce its effect during classifier training to cope with image degradation. When the system runs out of memory the encoded episodes are converted into centroids and covariance matrices, which are used to generate pseudo-images during classifier training, keeping classifier performance stable with less memory. Our approach increases classification accuracy by 13-17% over state-of-the-art methods on benchmark datasets, while requiring 78% less storage space.
摘要:面对不断学习方法的两个主要挑战是对数据的存储灾难性的遗忘和记忆的局限性。为了应对这些挑战,我们提出了一个新颖的,认知风格的做法,火车与神经式转换到编码和存储图像自动编码。从编码情节重建图像训练分类模型,当上了一个新的任务,以避免灾难性的遗忘重播。用于重建图像的损失函数被加权,以减少分类器训练期间其效果,以应付图像劣化。当系统运行的存储器出来的编码集被转换成质心和协方差矩阵,其用于分类器训练期间产生的伪图像,保持分类器性​​能稳定使用较少的内存。我们的方法由13-17%以上的基准数据集的状态的最先进的方法提高了分级精度,同时需要78%更少的存储空间。

68. Deep Doubly Supervised Transfer Network for Diagnosis of Breast Cancer with Imbalanced Ultrasound Imaging Modalities [PDF] 返回目录
  Han Xiangmin, Wang Jun, Zhou Weijun, Chang Cai, Ying Shihui, Shi Jun
Abstract: Elastography ultrasound (EUS) provides additional bio-mechanical in-formation about lesion for B-mode ultrasound (BUS) in the diagnosis of breast cancers. However, joint utilization of both BUS and EUS is not popular due to the lack of EUS devices in rural hospitals, which arouses a novel modality im-balance problem in computer-aided diagnosis (CAD) for breast cancers. Current transfer learning (TL) pay little attention to this special issue of clinical modality imbalance, that is, the source domain (EUS modality) has fewer labeled samples than those in the target domain (BUS modality). Moreover, these TL methods cannot fully use the label information to explore the intrinsic relation between two modalities and then guide the promoted knowledge transfer. To this end, we propose a novel doubly supervised TL network (DDSTN) that integrates the Learning Using Privileged Information (LUPI) paradigm and the Maximum Mean Discrepancy (MMD) criterion into a unified deep TL framework. The proposed algorithm can not only make full use of the shared labels to effectively guide knowledge transfer by LUPI paradigm, but also perform additional super-vised transfer between unpaired data. We further introduce the MMD criterion to enhance the knowledge transfer. The experimental results on the breast ultra-sound dataset indicate that the proposed DDSTN outperforms all the compared state-of-the-art algorithms for the BUS-based CAD.
摘要:弹性成像超声检查(EUS)提供了附加的生物机械在-形成大约病变用于乳腺癌的诊断B型超声(BUS)。然而,这两种总线的联合利用和EUS不缺乏在农村医院EUS设备,这引起在用于乳腺癌的计算机辅助诊断(CAD)的新型模态IM-平衡问题的流行所致。当前迁移学习(TL)不太重视临床模态不平衡的这个特殊的问题,那就是,源域(EUS模式)具有比那些在目标域(BUS方式)较少的标记的样品。此外,这些TL方法不能充分利用标签信息,探索两种模式之间的内在关系,然后引导促进知识转移。为此,我们提出了一个新颖的双重监督TL网络(DDSTN),集成了学习使用特权信息(LUPI)模式和最大平均差异(MMD)标准为统一的深TL框架。该算法不仅可以充分利用共享的标签来有效地引导由LUPI范式知识转移,也执行不配对数据之间的附加超叼着转移。我们进一步介绍MMD标准,以提高知识转移。对乳房超音数据集的实验结果表明,所提出DDSTN优于所有被比较的状态的最先进的算法基于BUS-CAD。

69. Path Signatures on Lie Groups [PDF] 返回目录
  Darrick Lee, Robert Ghrist
Abstract: Path signatures are powerful nonparametric tools for time series analysis, shown to form a universal and characteristic feature map for Euclidean valued time series data. We lift the theory of path signatures to the setting of Lie group valued time series, adapting these tools for time series with underlying geometric constraints. We prove that this generalized path signature is universal and characteristic. To demonstrate universality, we analyze the human action recognition problem in computer vision, using $SO(3)$ representations for the time series, providing comparable performance to other shallow learning approaches, while offering an easily interpretable feature set. We also provide a two-sample hypothesis test for Lie group-valued random walks to illustrate its characteristic property. Finally we provide algorithms and a Julia implementation of these methods.
摘要:路径签名是时间序列分析功能强大的非参数工具,显示出形成欧几里德值时间序列数据的通用和特征图。我们提升路径签名理论的李群值时间序列的设置,适应这些工具的时间序列与几何约束底层。我们证明了这个广义的路径签名是普遍和特点。为了证明普遍性,我们分析在计算机视觉的人体动作识别问题,使用$ SO(3)$表示的时间序列,提供相当的性能等浅的学习方法,同时提供一个容易解释的功能集。我们还提供了李群值随机游走两样本假设检验来说明它的特征性质。最后,我们提供的算法和朱莉娅实现这些方法。

70. Dense Crowds Detection and Counting with a Lightweight Architecture [PDF] 返回目录
  Javier Antonio Gonzalez-Trejo, Diego Alberto Mercado-Ravell
Abstract: In the context of crowd counting, most of the works have focused on improving the accuracy without regard to the performance leading to algorithms that are not suitable for embedded applications. In this paper, we propose a lightweight convolutional neural network architecture to perform crowd detection and counting using fewer computer resources without a significant loss on count accuracy. The architecture was trained using the Bayes loss function to further improve its accuracy and then pruned to further reduce the computational resources used. The proposed architecture was tested over the USF-QNRF achieving a competitive Mean Average Error of 154.07 and a superior Mean Square Error of 241.77 while maintaining a competitive number of parameters of 0.067 Million. The obtained results suggest that the Bayes loss can be used with other architectures to further improve them and also the last convolutional layer provides no significant information and even encourage over-fitting at training.
摘要:在人群计数的情况下,大部分作品都集中在提高精度,而不考虑导致那些不适合嵌入式应用算法的性能。在本文中,我们提出了一个轻量级的卷积神经网络结构使用较少的计算机资源,而不知道计数精度显著损失人群进行检测和计数。该架构是使用贝叶斯损失函数,以进一步提高其精度的培训,然后修剪,以进一步减少使用的计算资源。所提出的架构进行了测试在USF-QNRF实现的154.07有竞争力的中值平均误差和241.77卓越的均方误差,同时保持的0.067亿参数有竞争力的数字。得到的结果表明,贝叶斯损失可以用其他架构来进一步提高他们也是最后卷积层提供无显著信息和培训甚至鼓励过度拟合。

71. A new approach to descriptors generation for image retrieval by analyzing activations of deep neural network layers [PDF] 返回目录
  Paweł Staszewski, Maciej Jaworski, Jinde Cao, Leszek Rutkowski
Abstract: In this paper, we consider the problem of descriptors construction for the task of content-based image retrieval using deep neural networks. The idea of neural codes, based on fully connected layers activations, is extended by incorporating the information contained in convolutional layers. It is known that the total number of neurons in the convolutional part of the network is large and the majority of them have little influence on the final classification decision. Therefore, in the paper we propose a novel algorithm that allows us to extract the most significant neuron activations and utilize this information to construct effective descriptors. The descriptors consisting of values taken from both the fully connected and convolutional layers perfectly represent the whole image content. The images retrieved using these descriptors match semantically very well to the query image, and also they are similar in other secondary image characteristics, like background, textures or color distribution. These features of the proposed descriptors are verified experimentally based on the IMAGENET1M dataset using the VGG16 neural network.
摘要:在本文中,我们考虑的描述符建设的问题,使用深层神经网络的基于内容的图像检索的任务。的神经码的想法的基础上,完全连接层激活,是通过将包含在卷积层中的信息来扩展。据了解,在网络的卷积部分神经元的总数量大,并且其中大部分都对最终的分类决策的影响很小。因此,在本文中,我们提出了一种新的算法,使我们能够提取最显著神经元的激活和利用这些信息来构建有效的描述符。由来自完全连接和卷积层都采取的值的描述符完美代表整个图像内容。使用这些描述符中检索到的图像语义匹配非常好于该查询图像,并且还它们在其他次级图像特性相似,像背景,纹理或颜色分布。所提出的描述的这些功能进行了验证实验基于使用VGG16神经网络IMAGENET1M数据集。

72. AUTO3D: Novel view synthesis through unsupervisely learned variational viewpoint and global 3D representation [PDF] 返回目录
  Xiaofeng Liu, Tong Che, Yiqun Lu, Chao Yang, Site Li, Jane You
Abstract: This paper targets on learning-based novel view synthesis from a single or limited 2D images without the pose supervision. In the viewer-centered coordinates, we construct an end-to-end trainable conditional variational framework to disentangle the unsupervisely learned relative-pose/rotation and implicit global 3D representation (shape, texture and the origin of viewer-centered coordinates, etc.). The global appearance of the 3D object is given by several appearance-describing images taken from any number of viewpoints. Our spatial correlation module extracts a global 3D representation from the appearance-describing images in a permutation invariant manner. Our system can achieve implicitly 3D understanding without explicitly 3D reconstruction. With an unsupervisely learned viewer-centered relative-pose/rotation code, the decoder can hallucinate the novel view continuously by sampling the relative-pose in a prior distribution. In various applications, we demonstrate that our model can achieve comparable or even better results than pose/3D model-supervised learning-based novel view synthesis (NVS) methods with any number of input views.
摘要:对基于学习的新颖视图合成或目标从单个限于2D图像而不姿态监督。在观看者为中心的坐标,我们构建的端至端可训练条件变框架来解开unsupervisely了解到相对姿态/旋转和隐含的全球3D表示(形状,纹理和观看者为中心的坐标原点等) 。 3D对象的全局外观由从任意数量的视点拍摄几个外观描述图像给出。我们的空间相关性模块提取从一个排列不变的方式将外观描述图像的全局3D表示。我们的系统可以隐式3D理解实现无明确的三维重建。与unsupervisely学习观察者为中心相对姿态/旋转码,解码器可以通过在先验分布采样相对姿态连续产生幻觉的新颖视图。在各种应用中,我们证明了我们的模型可以实现比任何数量的输入视图的姿态/ 3D模型监督学习型新颖的视点合成(NVS)的方法相媲美,甚至更好的效果。

73. Closed-Form Factorization of Latent Semantics in GANs [PDF] 返回目录
  Yujun Shen, Bolei Zhou
Abstract: A rich set of semantic attributes has been shown to emerge in the latent space of the Generative Adversarial Networks (GANs) trained for synthesizing images. In order to identify such latent semantics for image manipulation, previous methods annotate a collection of synthesized samples and then train supervised classifiers in the latent space. However, they require a clear definition of the target attribute as well as the corresponding manual annotations, severely limiting their applications in practice. In this work, we examine the internal representation learned by GANs to reveal the underlying variation factors in an unsupervised manner. By studying the essential role of the fully-connected layer that takes the latent code into the generator of GANs, we propose a general closed-form factorization method for latent semantic discovery. The properties of the identified semantics are further analyzed both theoretically and empirically. With its fast and efficient implementation, our approach is capable of not only finding latent semantics as accurately as the state-of-the-art supervised methods, but also resulting in far more versatile semantic classes across multiple GAN models trained on a wide range of datasets.
摘要:一组丰富的语义属性已被证明在训练的合成图像的生成性对抗性网络(甘斯)的潜在空间出现。为了鉴定用于图像处理这些潜在语义,以前的方法注释合成的样品集合,然后监督训练分类器中的潜在空间。然而,他们需要目标属性的明确定义,以及相应的手动注解,严重限制了其应用在实践中。在这项工作中,我们审查甘斯学会露出下面的变化因素在无人监督的方式内部表示。通过研究,是以潜在的代码到甘斯的发电机完全连接层的重要作用,我们提出了潜在语义发现一般的封闭形式的分解方法。所识别的语义特性进一步分析的理论和经验。凭借其快速,高效的实施,我们的做法是不仅能发现潜在语义尽可能准确国家的最先进的监督方法,也造成整个训练了广泛的多甘模型更为通用的语义类别数据集。

74. Towards causal benchmarking of bias in face analysis algorithms [PDF] 返回目录
  Guha Balakrishnan, Yuanjun Xiong, Wei Xia, Pietro Perona
Abstract: Measuring algorithmic bias is crucial both to assess algorithmic fairness, and to guide the improvement of algorithms. Current methods to measure algorithmic bias in computer vision, which are based on observational datasets, are inadequate for this task because they conflate algorithmic bias with dataset bias. To address this problem we develop an experimental method for measuring algorithmic bias of face analysis algorithms, which manipulates directly the attributes of interest, e.g., gender and skin tone, in order to reveal causal links between attribute variation and performance change. Our proposed method is based on generating synthetic ``transects'' of matched sample images that are designed to differ along specific attributes while leaving other attributes constant. A crucial aspect of our approach is relying on the perception of human observers, both to guide manipulations, and to measure algorithmic bias. Besides allowing the measurement of algorithmic bias, synthetic transects have other advantages with respect to observational datasets: they sample attributes more evenly allowing for more straightforward bias analysis on minority and intersectional groups, they enable prediction of bias in new scenarios, they greatly reduce ethical and legal challenges, and they are economical and fast to obtain, helping make bias testing affordable and widely available. We validate our method by comparing it to a study that employs the traditional observational method for analyzing bias in gender classification algorithms. The two methods reach different conclusions. While the observational method reports gender and skin color biases, the experimental method reveals biases due to gender, hair length, age, and facial hair.
摘要:测量算法偏差至为关键,以评估算法的公平性,并指导算法的改进。目前的方法来衡量在计算机视觉,这是基于观察数据集的算法偏差,因为它们混为一谈与数据集偏置算法偏差不足以完成这个任务。为了解决这个问题,我们制定衡量的面上的分析算法,算法偏差的实验方法,直接操纵感兴趣的属性,例如,性别和肤色,以揭示属性变化和性能变化之间的因果关系。我们提出的方法是基于生成的合成,其被设计成沿着不同的特定属性,同时留下其他属性恒定匹配的样品的图像的``断面“”。我们的做法的一个重要方面是依靠人类观察者的看法,既要指导操作,并测量算法偏差。除了允许算法偏压的测量中,合成的断面具有相对于观测数据集的其它的优点:它们样品属性更均匀地允许对少数和交叉组更简单偏倚分析,它们使得在新场景偏压的预测,它们大大降低伦理和法律上的挑战,既经济,快速获取,帮助使偏置测试实惠和广泛使用。我们通过比较,它采用的性别分类算法分析偏传统的观察法研究验证了我们的方法。这两种方法得出不同的结论。虽然观测方法报告的性别和肤色的偏见,实验方法揭示了因性别,头发的长度,年龄和面部毛发的偏见。

75. Cross-Domain Medical Image Translation by Shared Latent Gaussian Mixture Model [PDF] 返回目录
  Yingying Zhu, Youbao Tang, Yuxing Tang, Daniel C. Elton, Sungwon Lee, Perry J. Pickhardt, Ronald M. Summers
Abstract: Current deep learning based segmentation models often generalize poorly between domains due to insufficient training data. In real-world clinical applications, cross-domain image analysis tools are in high demand since medical images from different domains are often needed to achieve a precise diagnosis. An important example in radiology is generalizing from non-contrast CT to contrast enhanced CTs. Contrast enhanced CT scans at different phases are used to enhance certain pathologies or organs. Many existing cross-domain image-to-image translation models have been shown to improve cross-domain segmentation of large organs. However, such models lack the ability to preserve fine structures during the translation process, which is significant for many clinical applications, such as segmenting small calcified plaques in the aorta and pelvic arteries. In order to preserve fine structures during medical image translation, we propose a patch-based model using shared latent variables from a Gaussian mixture model. We compare our image translation framework to several state-of-the-art methods on cross-domain image translation and show our model does a better job preserving fine structures. The superior performance of our model is verified by performing two tasks with the translated images detection and segmentation of aortic plaques and pancreas segmentation. We expect the utility of our framework will extend to other problems beyond segmentation due to the improved quality of the generated images and enhanced ability to preserve small structures.
摘要:当前深度学习基于分割模型通常域之间的差的推广,由于训练数据不足。在现实世界中的临床应用,跨域图像分析工具的要求非常高,因为从不同的域医学图像往往需要实现精确的诊断。在放射学的重要例子是从非造影CT概括对比增强的CT。在不同的阶段对比增强CT扫描来提高某些疾病或器官。许多现有的跨域图像到图像的翻译模式已被证明可以改善大器官的跨域分割。然而,这种模式缺乏在翻译过程中,这是许多临床应用,如在主动脉和盆腔动脉分割小钙化斑块显著保持精细结构的能力。为了医学图像平移过程中保持精细的结构,我们提出了基于块拼贴的模型利用高斯混合模型共享的潜在变量。我们我们的图像转化框架比较国家的最先进的几种跨域图像翻译方法,并显示我们的模型做一个更好的工作保持精细结构。我们的模型的优异性能是通过执行与主动脉斑块和胰腺分割的翻译图像检测与分割的两个任务核实。我们希望我们的框架的效用将延伸至超越分割等问题,由于生成的图像质量的提高和改进,以保护小结构的能力。

76. Conditional Image Retrieval [PDF] 返回目录
  Mark Hamilton, Stephanie Fu, William T. Freeman, Mindren Lu
Abstract: This work introduces Conditional Image Retrieval (CIR) systems: IR methods that can efficiently specialize to specific subsets of images on the fly. These systems broaden the class of queries IR systems support, and eliminate the need for expensive re-fitting to specific subsets of data. Specifically, we adapt tree-based K-Nearest Neighbor (KNN) data-structures to the conditional setting by introducing additional inverted-index data-structures. This speeds conditional queries and does not slow queries without conditioning. We present two new datasets for evaluating the performance of CIR systems and evaluate a variety of design choices. As a motivating application, we present an algorithm that can explore shared semantic content between works of art of vastly different media and cultural origin. Finally, we demonstrate that CIR data-structures can identify Generative Adversarial Network (GAN) ``blind spots'': areas where GANs fail to properly model the true data distribution.
摘要:本文介绍的工作条件图像检索(CIR)系统:IR方法,可以有效地专注在飞行中图像的特定子集。这些系统拓宽类查询检索系统支持,并消除昂贵的重新装配到特定数据子集的需要。具体而言,我们通过引入额外的倒索引数据结构适应基于树k近邻(KNN)数据结构到条件设置。这加快了条件查询和不慢查询,而无需调节。我们提出了两个新的数据集用于评估CIR系统的性能并评估各种设计选择。作为一种激励的应用,我们提出可以探索共享完全不同的媒体和文化渊源的艺术作品之间的语义内容的算法。最后,我们证明了CIR数据结构可以识别剖成对抗性网络(GAN)``盲点“”:其中甘斯无法真实数据分布正常示范区。

77. MFRNet: A New CNN Architecture for Post-Processing and In-loop Filtering [PDF] 返回目录
  Di Ma, Fan Zhang, David R. Bull
Abstract: In this paper, we propose a novel convolutional neural network (CNN) architecture, MFRNet, for post-processing (PP) and in-loop filtering (ILF) in the context of video compression. This network consists of four Multi-level Feature review Residual dense Blocks (MFRBs), which are connected using a cascading structure. Each MFRB extracts features from multiple convolutional layers using dense connections and a multi-level residual learning structure. In order to further improve information flow between these blocks, each of them also reuses high dimensional features from the previous MFRB. This network has been integrated into PP and ILF coding modules for both HEVC (HM 16.20) and VVC (VTM 7.0), and fully evaluated under the JVET Common Test Conditions using the Random Access configuration. The experimental results show significant and consistent coding gains over both anchor codecs (HEVC HM and VVC VTM) and also over other existing CNN-based PP/ILF approaches based on Bjontegaard Delta measurements using both PSNR and VMAF for quality assessment. When MFRNet is integrated into HM 16.20, gains up to 16.0% (BD-rate VMAF) are demonstrated for ILF, and up to 21.0% (BD-rate VMAF) for PP. The respective gains for VTM 7.0 are up to 5.1% for ILF and up to 7.1% for PP.
摘要:在本文中,我们提出了一种新颖的卷积神经网络(CNN)架构,MFRNet,用于后处理(PP)和在环中的视频压缩的情况下过滤(ILF)。该网络包括四个多级特征审查残余致密块(MFRBs),其使用的是级联结构连接。每个MFRB提取来自使用密的连接和一个多级残留学习结构的多个卷积的层的特征。为了进一步改善这些块之间的信息流,它们中的每也重新使用先前MFRB高维特征。该网络已经被集成到PP和ILF编码模块既HEVC(HM 16.20)和VVC(VTM 7.0),并使用随机接入配置下的JVET普通试验条件完全评估。实验结果表明,在两个锚编解码器显著和一致的编码增益(HEVC HM和VVC VTM)也比其他现有的基于CNN-PP ILF方法的基础/使用PSNR以及VMAF质量评估Bjontegaard三角测量。当MFRNet集成到HM 16.20,获得高达16.0%(BD-率VMAF)被证明为ILF,和至多21.0%为PP(BD-率VMAF)。对于VTM 7.0各增益高达5.1%为ILF和最多为PP 7.1%。

78. Nodule2vec: a 3D Deep Learning System for Pulmonary Nodule Retrieval Using Semantic Representation [PDF] 返回目录
  Ilia Kravets, Tal Heletz, Hayit Greenspan
Abstract: Content-based retrieval supports a radiologist decision making process by presenting the doctor the most similar cases from the database containing both historical diagnosis and further disease development history. We present a deep learning system that transforms a 3D image of a pulmonary nodule from a CT scan into a low-dimensional embedding vector. We demonstrate that such a vector representation preserves semantic information about the nodule and offers a viable approach for content-based image retrieval (CBIR). We discuss the theoretical limitations of the available datasets and overcome them by applying transfer learning of the state-of-the-art lung nodule detection model. We evaluate the system using the LIDC-IDRI dataset of thoracic CT scans. We devise a similarity score and show that it can be utilized to measure similarity 1) between annotations of the same nodule by different radiologists and 2) between the query nodule and the top four CBIR results. A comparison between doctors and algorithm scores suggests that the benefit provided by the system to the radiologist end-user is comparable to obtaining a second radiologist's opinion.
摘要:基于内容的检索支持放射科医生决定通过从含有历史的诊断和疾病进一步发展史上的数据库呈现医生最相似的情况下,决策过程。我们提出了一个深的学习系统,该系统转换从CT肺结节的3D图像扫描成低维嵌入矢量。我们证明了这样一个向量表示保留对结节提供了基于内容的图像检索(CBIR)的可行方法的语义信息。我们讨论的可用数据集的理论限制,并通过应用先进设备,最先进的肺结节检测模型的迁移学习克服这些困难。我们评估使用胸部CT扫描的LIDC-IDRI数据集中系统。我们设计相同的结节的注释之间由不同放射科医师查询结节和前四名CBIR结果之间的相似性得分,并显示它可以被用来测量相似性1)和2)。医生和算法评分之间的比较表明,由系统到放射科医师最终用户所提供的益处是相当获得第二放射科医师的意见。

79. A Weakly Supervised Region-Based Active Learning Method for COVID-19 Segmentation in CT Images [PDF] 返回目录
  Issam Laradji, Pau Rodriguez, Frederic Branchaud-Charron, Keegan Lensink, Parmida Atighehchian, William Parker, David Vazquez, Derek Nowrouzezahrai
Abstract: One of the key challenges in the battle against the Coronavirus (COVID-19) pandemic is to detect and quantify the severity of the disease in a timely manner. Computed tomographies (CT) of the lungs are effective for assessing the state of the infection. Unfortunately, labeling CT scans can take a lot of time and effort, with up to 150 minutes per scan. We address this challenge introducing a scalable, fast, and accurate active learning system that accelerates the labeling of CT scan images. Conventionally, active learning methods require the labelers to annotate whole images with full supervision, but that can lead to wasted efforts as many of the annotations could be redundant. Thus, our system presents the annotator with unlabeled regions that promise high information content and low annotation cost. Further, the system allows annotators to label regions using point-level supervision, which is much cheaper to acquire than per-pixel annotations. Our experiments on open-source COVID-19 datasets show that using an entropy-based method to rank unlabeled regions yields to significantly better results than random labeling of these regions. Also, we show that labeling small regions of images is more efficient than labeling whole images. Finally, we show that with only 7\% of the labeling effort required to label the whole training set gives us around 90\% of the performance obtained by training the model on the fully annotated training set. Code is available at: \url{this https URL}.
摘要:一个在对冠状病毒(COVID-19)大流行战斗的主要挑战是检测和量化疾病的严重程度,及时。肺的计算断层摄影术(CT)是有效的评估感染的状态。不幸的是,标签CT扫描会花费大量的时间和精力,与高达每扫描150分钟。我们解决这一难题推出了可扩展,快速,准确,主动的学习系统,加速CT扫描图像的标签。按照惯例,主动学习方法需要贴标注释与全程监督整个图像,但可能会导致浪费努力尽可能多的注释可能是多余的。因此,我们的系统呈现与承诺高信息含量和低注释成本未标记的区域注释。此外,该系统允许注释使用点级监督,这比每像素注解便宜得多获取标签的区域。我们对开源COVID-19数据集的实验表明,采用基于熵的方法来排名未标记的区域产量比这些区域的随机标记显著更好的结果。此外,我们表明,标签图像的小区域比标签整个图像更加有效。最后,我们表明,只有7 \%的贴标工作的必须在标签上整个训练集为我们提供了大约90 \%通过训练模式上完全注释的训练集获得的性能。代码可以在:\ {URL这HTTPS URL}。

80. Automated Synthetic-to-Real Generalization [PDF] 返回目录
  Wuyang Chen, Zhiding Yu, Zhangyang Wang, Anima Anandkumar
Abstract: Models trained on synthetic images often face degraded generalization to real data. As a convention, these models are often initialized with ImageNet pre-trained representation. Yet the role of ImageNet knowledge is seldom discussed despite common practices that leverage this knowledge to maintain the generalization ability. An example is the careful hand-tuning of early stopping and layer-wise learning rates, which is shown to improve synthetic-to-real generalization but is also laborious and heuristic. In this work, we explicitly encourage the synthetically trained model to maintain similar representations with the ImageNet pre-trained model, and propose a \textit{learning-to-optimize (L2O)} strategy to automate the selection of layer-wise learning rates. We demonstrate that the proposed framework can significantly improve the synthetic-to-real generalization performance without seeing and training on real data, while also benefiting downstream tasks such as domain adaptation. Code is available at: this https URL.
摘要:经过培训的合成图像模型往往面临降级推广到真实的数据。按照惯例,这些模型往往初始化ImageNet预先训练表示。然而的ImageNet知识的作用很少,尽管普通实践,利用这些知识来维持泛化能力的讨论。一个例子是早期停止和逐层学习速率,其显示改善的合成到实概括但也是费力的和启发式的仔细的手工调整。在这项工作中,我们明确鼓励综合训练模型,以保持与ImageNet预训练模式类似表述,并提出\ textit {学习到优化(L2O)}策略来自动逐层学习速率的选择。我们表明,该框架可以显著提高合成到真正的推广能力,没有看到和真实数据的训练,同时也受益下游任务,如域名适应。代码,请访问:此HTTPS URL。

81. Lifelong Learning using Eigentasks: Task Separation, Skill Acquisition, and Selective Transfer [PDF] 返回目录
  Aswin Raghavan, Jesse Hostetler, Indranil Sur, Abrar Rahman, Ajay Divakaran
Abstract: We introduce the eigentask framework for lifelong learning. An eigentask is a pairing of a skill that solves a set of related tasks, paired with a generative model that can sample from the skill's input space. The framework extends generative replay approaches, which have mainly been used to avoid catastrophic forgetting, to also address other lifelong learning goals such as forward knowledge transfer. We propose a wake-sleep cycle of alternating task learning and knowledge consolidation for learning in our framework, and instantiate it for lifelong supervised learning and lifelong RL. We achieve improved performance over the state-of-the-art in supervised continual learning, and show evidence of forward knowledge transfer in a lifelong RL application in the game Starcraft2.
摘要:介绍了终身学习的eigentask框架。一个eigentask是一种技能的配对,解决了一组相关的任务,用可从技能的输入空间采样生成模型配对。该框架扩展生成重播方法,其主要被用来避免灾难性的遗忘,也解决其他终身学习的目标,例如向前知识转移。我们建议交替任务学习和知识整合的唤醒睡眠周期为我们的框架中学习,并创建实例终身监督学习和终身RL。我们实现对国家的最先进的监督不断学习改进的性能,并显示在游戏星际争霸终身RL应用向前知识转移的证据。

82. Our Evaluation Metric Needs an Update to Encourage Generalization [PDF] 返回目录
  Swaroop Mishra, Anjana Arunkumar, Chris Bryan, Chitta Baral
Abstract: Models that surpass human performance on several popular benchmarks display significant degradation in performance on exposure to Out of Distribution (OOD) data. Recent research has shown that models overfit to spurious biases and `hack' datasets, in lieu of learning generalizable features like humans. In order to stop the inflation in model performance -- and thus overestimation in AI systems' capabilities -- we propose a simple and novel evaluation metric, WOOD Score, that encourages generalization during evaluation.
摘要:模型在几个流行的基准超越人的表现显示暴露于输出(OOD)的数据分布在性能显著下降。最近的研究表明,过度拟合的模型虚假的偏见和'黑客”的数据集,以代替学习人类一样普及的特点。为了停止模型性能的通货膨胀 - 人工智能系统的能力,从而高估 - 我们提出了一个简单新颖的评价指标,木材得分,该评估过程中鼓励推广。

83. From Symmetry to Geometry: Tractable Nonconvex Problems [PDF] 返回目录
  Yuqian Zhang, Qing Qu, John Wright
Abstract: As science and engineering have become increasingly data-driven, the role of optimization has expanded to touch almost every stage of the data analysis pipeline, from the signal and data acquisition to modeling and prediction. The optimization problems encountered in practice are often nonconvex. While challenges vary from problem to problem, one common source of nonconvexity is nonlinearity in the data or measurement model. Nonlinear models often exhibit symmetries, creating complicated, nonconvex objective landscapes, with multiple equivalent solutions. Nevertheless, simple methods (e.g., gradient descent) often perform surprisingly well in practice. The goal of this survey is to highlight a class of tractable nonconvex problems, which can be understood through the lens of symmetries. These problems exhibit a characteristic geometric structure: local minimizers are symmetric copies of a single ``ground truth'' solution, while other critical points occur at balanced superpositions of symmetric copies of the ground truth, and exhibit negative curvature in directions that break the symmetry. This structure enables efficient methods to obtain global minimizers. We discuss examples of this phenomenon arising from a wide range of problems in imaging, signal processing, and data analysis. We highlight the key role of symmetry in shaping the objective landscape and discuss the different roles of rotational and discrete symmetries. This area is rich with observed phenomena and open problems; we close by highlighting directions for future research.
摘要:随着科学技术已经变得越来越数据驱动,优化的作用已经扩大到触摸数据分析管道的几乎每一个阶段,从信号和数据采集建模和预测。在实践中遇到的优化问题往往非凸。而从挑战问题问题变化,非凸的一个共同的源是非线性的数据或测量模式。非线性模型常常表现出对称性,制作复杂,非凸目标的风景,与多个等效方案。然而,简单的方法(例如,梯度下降)在实践中经常执行得非常好。本次调查的目的是要突出一类非凸容易处理的问题,可以通过对称的镜头来理解。这些问题显示出的特征几何结构:本地极小是单个``地面真溶液的“”对称副本,而其它关键点在地面实况的对称副本的均衡叠加发生,并且在打破对称方向呈现负曲率。这种结构使有效的方法来获得全球极小。我们将讨论的从大范围的摄像中,信号处理和数据分析的问题而产生这种现象的例子。我们强调对称的关键作用,在塑造客观景观和讨论旋转和离散对称性的不同角色。这个地区有丰富的观察到的现象和开放性问题;我们关闭通过强调今后的研究方向。

84. An Interpretable Baseline for Time Series Classification Without Intensive Learning [PDF] 返回目录
  Robert J. Ravier, Mohammadreza Soltani, Miguel Antunes Dias Alfaiate, Denis Garagic, Vahid Tarokh
Abstract: Recent advances in time series classification have largely focused on methods that either employ deep learning or utilize other machine learning models for feature extraction. Though such methods have proven powerful, they can also require computationally expensive models that may lack interpretability of results, or may require larger datasets than are freely available. In this paper, we propose an interpretable baseline based on representing each time series as a collection of probability distributions of extracted geometric features. The features used are intuitive and require minimal parameter tuning. We perform an exhaustive evaluation of our baseline on a large number of real datasets, showing that simple classifiers trained on these features exhibit surprising performance relative to state of the art methods requiring much more computational power. In particular, we show that our methodology achieves good performance on a challenging dataset involving the classification of fishing vessels, where our methods achieve good performance relative to the state of the art despite only having access to approximately two percent of the dataset used in training and evaluating this state of the art.
摘要:在时间序列分类的最新进展,主要集中在方法,要么雇用深学习或利用其它机器学习模型进行特征提取。虽然这种方法已被证明强大,他们也可以要求计算昂贵的机型,可能没有结果的解释性,或者可能需要更大的数据集不是都是免费的。在本文中,我们提出了基于表示每个时间序列作为提取几何特征概率分布的集合,可解释的基线。使用的特点是直观的,需要最少的参数调整。我们对大量真实数据的执行我们的基础的详尽评测,显示出训练有素的这些功能的简单分类表现出相对的,需要更多的计算能力的技术方法状态令人惊讶的表现。特别是,我们表明,我们的方法实现了良好的业绩具有挑战性的数据集,涉及渔船,我们的方法实现相对于现有技术状态良好性能的分类,尽管只能够访问数据集的百分之二左右在训练中和评估现有技术的这种状态。

85. Landslide Segmentation with U-Net: Evaluating Different Sampling Methods and Patch Sizes [PDF] 返回目录
  Lucas P. Soares, Helen C. Dias, Carlos H. Grohmann
Abstract: Landslide inventory maps are crucial to validate predictive landslide models; however, since most mapping methods rely on visual interpretation or expert knowledge, detailed inventory maps are still lacking. This study used a fully convolutional deep learning model named U-net to automatically segment landslides in the city of Nova Friburgo, located in the mountainous range of Rio de Janeiro, southeastern Brazil. The objective was to evaluate the impact of patch sizes, sampling methods, and datasets on the overall accuracy of the models. The training data used the optical information from RapidEye satellite, and a digital elevation model (DEM) derived from the L-band sensor of the ALOS satellite. The data was sampled using random and regular grid methods and patched in three sizes (32x32, 64x64, and 128x128 pixels). The models were evaluated on two areas with precision, recall, f1-score, and mean intersect over union (mIoU) metrics. The results show that the models trained with 32x32 tiles tend to have higher recall values due to higher true positive rates; however, they misclassify more background areas as landslides (false positives). Models trained with 128x128 tiles usually achieve higher precision values because they make less false positive errors. In both test areas, DEM and augmentation increased the accuracy of the models. Random sampling helped in model generalization. Models trained with 128x128 random tiles from the data that used the RapidEye image, DEM information, and augmentation achieved the highest f1-score, 0.55 in test area one, and 0.58 in test area two. The results achieved in this study are comparable to other fully convolutional models found in the literature, increasing the knowledge in the area.
摘要:滑坡详细分布图是需要验证的预测模型滑坡的关键;然而,由于大多数映射方法依赖于视觉解释或专业知识,详细库存映射仍然缺乏。本研究采用命名为U形网,在城市新弗里堡,地处山区范围里约热内卢,巴西东南部的自动段山体滑坡完全卷积深刻的学习模式。其目的是评估补丁大小,取样方法和数据集对模型的整体精度的影响。训练数据用来从RapidEye卫星光学信息和数字高程模型(DEM)从所述ALOS卫星的L波段传感器导出。使用随机和规则网格方法的数据进行取样和三种尺寸(32×32,64×64,和128×128像素)进行修补。该模型与准确率,召回,F1-得分和平均相交超过联盟(米欧)指标两个方面进行评估。结果表明,与32×32瓦训练模式往往有较高的召回值是由于较高的真阳性率;然而,他们更多的错误分类背景区域滑坡(误报)。与128×128瓷砖训练的模型,因为它们使更少的假阳性错误通常会达到更高的精度值。在这两个试验区,DEM和增强提高了模型的准确性。在模型综合随机抽样的帮助。模型中测试区域中的两个训练了与从所使用的RapidEye图像,DEM信息的数据128×128随机地砖,并且增强在测试区域中的一个取得了最高F1-得分,0.55,和0.58。在这项研究中取得的成果堪比在文献中找到其他完全卷积模型,在该地区增加知识。

86. T-Basis: a Compact Representation for Neural Networks [PDF] 返回目录
  Anton Obukhov, Maxim Rakhuba, Stamatios Georgoulis, Menelaos Kanakis, Dengxin Dai, Luc Van Gool
Abstract: We introduce T-Basis, a novel concept for a compact representation of a set of tensors, each of an arbitrary shape, which is often seen in Neural Networks. Each of the tensors in the set is modeled using Tensor Rings, though the concept applies to other Tensor Networks. Owing its name to the T-shape of nodes in diagram notation of Tensor Rings, T-Basis is simply a list of equally shaped three-dimensional tensors, used to represent Tensor Ring nodes. Such representation allows us to parameterize the tensor set with a small number of parameters (coefficients of the T-Basis tensors), scaling logarithmically with each tensor's size in the set and linearly with the dimensionality of T-Basis. We evaluate the proposed approach on the task of neural network compression and demonstrate that it reaches high compression rates at acceptable performance drops. Finally, we analyze memory and operation requirements of the compressed networks and conclude that T-Basis networks are equally well suited for training and inference in resource-constrained environments and usage on the edge devices.
摘要:我们介绍T-基础,对于一组张量的紧凑表示的新颖的概念,每一个任意形状,这是经常出现在神经网络。每个组中的张量的使用张量戒指为蓝本,虽然概念也适用于其他网络的张量。由于它的名字在张量环的示意图表示法,T-基础节点的T形仅仅是相同形状的三维张量,用于表示张量环的节点的列表。这样的表示可以让我们用少量的参数(T-基础张量的系数)参数化该张量集,每个张量的集合中的大小和线性T-基础的维对数缩放。我们评估的神经网络压缩的任务所提出的方法,并证明它达到可接受的性能下降的高压缩率。最后,我们分析压缩的网络存储和操作要求,并得出结论:T-基础网络同样适合于在边缘设备在资源有限的环境中训练和推理和使用。

87. Inferring the 3D Standing Spine Posture from 2D Radiographs [PDF] 返回目录
  Amirhossein Bayat, Anjany Sekuboyina, Johannes C. Paetzold, Christian Payer, Darko Stern, Martin Urschler, Jan S. Kirschke, Bjoern H. Menze
Abstract: The treatment of degenerative spinal disorders requires an understanding of the individual spinal anatomy and curvature in 3D. An upright spinal pose (i.e. standing) under natural weight bearing is crucial for such bio-mechanical analysis. 3D volumetric imaging modalities (e.g. CT and MRI) are performed in patients lying down. On the other hand, radiographs are captured in an upright pose, but result in 2D projections. This work aims to integrate the two realms, i.e. it combines the upright spinal curvature from radiographs with the 3D vertebral shape from CT imaging for synthesizing an upright 3D model of spine, loaded naturally. Specifically, we propose a novel neural network architecture working vertebra-wise, termed \emph{TransVert}, which takes orthogonal 2D radiographs and infers the spine's 3D posture. We validate our architecture on digitally reconstructed radiographs, achieving a 3D reconstruction Dice of $95.52\%$, indicating an almost perfect 2D-to-3D domain translation. Deploying our model on clinical radiographs, we successfully synthesise full-3D, upright, patient-specific spine models for the first time.
摘要:退行性脊柱疾病的治疗需要在3D单个脊柱解剖和曲率的理解。下自然重量轴承的直立脊姿态(即站立)是用于此类生物力学分析至关重要。 3D体积成像模态(例如CT和MRI)在患者中进行躺着。在另一方面,X光片在直立的姿势拍摄,但导致的2D投影。这项工作旨在两个领域整合,即,它组合来自从CT成像与3D X射线照片椎形直立脊柱弯曲用于合成脊柱的直立的3D模型,自然加载。具体来说,我们提出了一种新的神经网络结构工作椎骨明智的,地称为\ EMPH {TransVert},这需要正交2D放射照片和推断脊柱的3D姿势。我们验证的数字重建射线照片我们的架构,实现$ 95.52 \%$的3D重建骰子,表示一个近乎完美的2D到3D的转换域。部署我们的临床X线片的模式,我们成功地合成全3D,直立,特定病人的脊柱模型首次。

88. FocusLiteNN: High Efficiency Focus Quality Assessment for Digital Pathology [PDF] 返回目录
  Zhongling Wang, Mahdi S. Hosseini, Adyn Miles, Konstantinos N. Plataniotis, Zhou Wang
Abstract: Out-of-focus microscopy lens in digital pathology is a critical bottleneck in high-throughput Whole Slide Image (WSI) scanning platforms, for which pixel-level automated Focus Quality Assessment (FQA) methods are highly desirable to help significantly accelerate the clinical workflows. Existing FQA methods include both knowledge-driven and data-driven approaches. While data-driven approaches such as Convolutional Neural Network (CNN) based methods have shown great promises, they are difficult to use in practice due to their high computational complexity and lack of transferability. Here, we propose a highly efficient CNN-based model that maintains fast computations similar to the knowledge-driven methods without excessive hardware requirements such as GPUs. We create a training dataset using FocusPath which encompasses diverse tissue slides across nine different stain colors, where the stain diversity greatly helps the model to learn diverse color spectrum and tissue structures. In our attempt to reduce the CNN complexity, we find with surprise that even trimming down the CNN to the minimal level, it still achieves a highly competitive performance. We introduce a novel comprehensive evaluation dataset, the largest of its kind, annotated and compiled from TCGA repository for model assessment and comparison, for which the proposed method exhibits superior precision-speed trade-off when compared with existing knowledge-driven and data-driven FQA approaches.
摘要:在数字病理学缺货焦显微镜透镜处于高通量全滑动图像(WSI)扫描平台的关键瓶颈,为此,像素级自动对焦质量评估(FQA)的方法是非常需要的帮助显著加速临床工作流程。现有的常见问题解答方法包括知识驱动和数据驱动的方法。虽然数据驱动的,如卷积神经网络(CNN)的方法都表现出极大的承诺接近,他们是难以在实践中使用,由于其较高的计算复杂性和缺乏可转移性的。在这里,我们建议维持类似知识驱动的方法快速计算,而不过多的硬件要求,如GPU的高效的基于CNN的模型。我们创建使用FocusPath其中包括在9点不同的色斑的颜色,那里的污渍多样性将极大地有助于模型,了解不同的色彩频谱和组织结构的不同组织切片训练数据集。在我们试图减少CNN的复杂性,我们惊奇地,即使修剪下来CNN的最低水平发现,它仍然达到一个高度竞争的表现。我们引入了一种新的综合评价数据集,其规模最大,为模型评估和比较,注释和TCGA库编制,为此,所提出的方法表现出较高的精度,速度的权衡时,与现有的相比,知识驱动和数据驱动常见问题解答方法。

89. Batch-level Experience Replay with Review for Continual Learning [PDF] 返回目录
  Zheda Mai, Hyunwoo Kim, Jihwan Jeong, Scott Sanner
Abstract: Continual learning is a branch of deep learning that seeks to strike a balance between learning stability and plasticity. The CVPR 2020 CLVision Continual Learning for Computer Vision challenge is dedicated to evaluating and advancing the current state-of-the-art continual learning methods using the CORe50 dataset with three different continual learning scenarios. This paper presents our approach, called Batch-level Experience Replay with Review, to this challenge. Our team achieved the 1'st place in all three scenarios out of 79 participated teams. The codebase of our implementation is publicly available at this https URL
摘要:持续的学习是深度学习的一个分支,力求在学习的稳定性和可塑性之间的平衡。该CVPR 2020 CL​​Vision不断地学习计算机视觉的挑战是致力于评估和使用CORe50数据集有三个不同的持续学习的方案推进当前国家的最先进的连续学习方法。本文介绍了我们的方法,称为批处理级经验重播时审查,这一挑战。我们的团队在所有三种情况所取得的第1'的地方出来的79支球队参加。我们实现的代码库是公开的,在此HTTPS URL

注:中文为机器翻译结果!封面为论文标题词云图!