0%

【arxiv论文】 Computer Vision and Pattern Recognition 2020-03-04

目录

1. BATS: Binary ArchitecTure Search [PDF] 摘要
2. Holistically-Attracted Wireframe Parsing [PDF] 摘要
3. Unsupervised Learning of Intrinsic Structural Representation Points [PDF] 摘要
4. Volumetric landmark detection with a multi-scale shift equivariant neural network [PDF] 摘要
5. Deep Multi-Modal Sets [PDF] 摘要
6. Image Matching across Wide Baselines: From Paper to Practice [PDF] 摘要
7. Distilled Hierarchical Neural Ensembles with Adaptive Inference Cost [PDF] 摘要
8. Disentangling Physical Dynamics from Unknown Factors for Unsupervised Video Prediction [PDF] 摘要
9. Implicit Functions in Feature Space for 3D Shape Reconstruction and Completion [PDF] 摘要
10. Rethinking Zero-shot Video Classification: End-to-end Training for Realistic Applications [PDF] 摘要
11. 3D dynamic hand gestures recognition using the Leap Motion sensor and convolutional neural networks [PDF] 摘要
12. UDD: An Underwater Open-sea Farm Object Detection Dataset for Underwater Robot Picking [PDF] 摘要
13. What's the relationship between CNNs and communication systems? [PDF] 摘要
14. DeepSperm: A robust and real-time bull sperm-cell detection in densely populated semen videos [PDF] 摘要
15. Fully Convolutional Networks for Automatically Generating Image Masks to Train Mask R-CNN [PDF] 摘要
16. multi-patch aggregation models for resampling detection [PDF] 摘要
17. DiPE: Deeper into Photometric Errors for Unsupervised Learning of Depth and Ego-motion from Monocular Videos [PDF] 摘要
18. Gastric histopathology image segmentation using a hierarchical conditional random field [PDF] 摘要
19. Data-Free Adversarial Perturbations for Practical Black-Box Attack [PDF] 摘要
20. Trained Model Fusion for Object Detection using Gating Network [PDF] 摘要
21. Towards Noise-resistant Object Detection with Noisy Annotations [PDF] 摘要
22. Disrupting DeepFakes: Adversarial Attacks Against Conditional Image Translation Networks and Facial Manipulation Systems [PDF] 摘要
23. Single-Shot Pose Estimation of Surgical Robot Instruments' Shafts from Monocular Endoscopic Images [PDF] 摘要
24. Point-GNN: Graph Neural Network for 3D Object Detection in a Point Cloud [PDF] 摘要
25. MVC-Net: A Convolutional Neural Network Architecture for Manifold-Valued Images With Applications [PDF] 摘要
26. MRI Super-Resolution with GAN and 3D Multi-Level DenseNet: Smaller, Faster, and Better [PDF] 摘要
27. Energy-efficient and Robust Cumulative Training with Net2Net Transformation [PDF] 摘要
28. DEEVA: A Deep Learning and IoT Based Computer Vision System to Address Safety and Security of Production Sites in Energy Industry [PDF] 摘要
29. LiDARNet: A Boundary-Aware Domain Adaptation Model for Lidar Point Cloud Semantic Segmentation [PDF] 摘要
30. Understanding Contexts Inside Robot and Human Manipulation Tasks through a Vision-Language Model and Ontology System in a Video Stream [PDF] 摘要
31. Unsupervised Domain Adaptation for Mammogram Image Classification: A Promising Tool for Model Generalization [PDF] 摘要
32. Learning from Suspected Target: Bootstrapping Performance for Breast Cancer Detection in Mammography [PDF] 摘要
33. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks [PDF] 摘要
34. Compact Surjective Encoding Autoencoder for Unsupervised Novelty Detection [PDF] 摘要
35. BUSU-Net: An Ensemble U-Net Framework for Medical Image Segmentation [PDF] 摘要
36. XGPT: Cross-modal Generative Pre-Training for Image Captioning [PDF] 摘要
37. Curriculum By Texture [PDF] 摘要
38. Shape analysis via inconsistent surface registration [PDF] 摘要
39. DDU-Nets: Distributed Dense Model for 3D MRI Brain Tumor Segmentation [PDF] 摘要
40. Visualizing intestines for diagnostic assistance of ileus based on intestinal region segmentation from 3D CT images [PDF] 摘要
41. A Deep learning Approach to Generate Contrast-Enhanced Computerised Tomography Angiography without the Use of Intravenous Contrast Agents [PDF] 摘要
42. RandomNet: Towards Fully Automatic Neural Architecture Design for Multimodal Learning [PDF] 摘要

摘要

1. BATS: Binary ArchitecTure Search [PDF] 返回目录
  Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos
Abstract: This paper proposes Binary ArchitecTure Search (BATS), a framework that drastically reduces the accuracy gap between binary neural networks and their real-valued counterparts by means of Neural Architecture Search (NAS). We show that directly applying NAS to the binary domain provides very poor results. To alleviate this, we describe, to our knowledge, for the first time, the 3 key ingredients for successfully applying NAS to the binary domain. Specifically, we (1) introduce and design a novel binary-oriented search space, (2) propose a new mechanism for controlling and stabilising the resulting searched topologies, (3) propose and validate a series of new search strategies for binary networks that lead to faster convergence and lower search times. Experimental results demonstrate the effectiveness of the proposed approach and the necessity of searching in the binary space directly. Moreover, (4) we set a new state-of-the-art for binary neural networks on CIFAR10, CIFAR100 and ImageNet datasets. Code will be made available this https URL
摘要:本文提出了二进制架构搜索(BATS),一个框架,大大减少由神经结构搜索(NAS)的手段二元神经网络和他们的真实值同行的准确性差距。我们发现,直接将NAS二进制领域提供了非常差的结果。为了缓解这种情况,我们描述,就我们所知,第一次,3个关键因素为成功运用NAS二进制领域。具体来说,我们(1)引入和设计的新颖的面向二进制搜索空间,(2)提出了一种新的机构,用于控制和稳定所得搜索拓扑,(3)提出并验证了一系列新的搜索策略二进制网络,铅以更快的收敛和较低的搜索时间。实验结果表明,该方法的有效性和二进制空间直接搜索的必要性。此外,(4),我们设置了新的国家的最先进的用于在CIFAR10,CIFAR100和ImageNet二进制数据集的神经网络。代码将提供这HTTPS URL

2. Holistically-Attracted Wireframe Parsing [PDF] 返回目录
  Nan Xue, Tianfu Wu, Song Bai, Fu-Dong Wang, Gui-Song Xia, Liangpei Zhang, Philip H.S. Torr
Abstract: This paper presents a fast and parsimonious parsing method to accurately and robustly detect a vectorized wireframe in an input image with a single forward pass. The proposed method is end-to-end trainable, consisting of three components: (i) line segment and junction proposal generation, (ii) line segment and junction matching, and (iii) line segment and junction verification. For computing line segment proposals, a novel exact dual representation is proposed which exploits a parsimonious geometric reparameterization for line segments and forms a holistic 4-dimensional attraction field map for an input image. Junctions can be treated as the "basins" in the attraction field. The proposed method is thus called Holistically-Attracted Wireframe Parser (HAWP). In experiments, the proposed method is tested on two benchmarks, the Wireframe dataset, and the YorkUrban dataset. On both benchmarks, it obtains state-of-the-art performance in terms of accuracy and efficiency. For example, on the Wireframe dataset, compared to the previous state-of-the-art method L-CNN, it improves the challenging mean structural average precision (msAP) by a large margin ($2.8\%$ absolute improvements) and achieves 29.5 FPS on single GPU ($89\%$ relative improvement). A systematic ablation study is performed to further justify the proposed method.
摘要:本文介绍了一种快速和简约解析方法准确且鲁棒地检测在输入图像中的矢量线框具有单个直传。所提出的方法是端至端可训练,由三个部分组成:(ⅰ)线段和连接点方案生成,(ⅱ)线段和结匹配,和(iii)的线段和连接验证。用于计算线段的提案,一个新颖的确切双重代表,提出了利用对线段和形成整体的4维吸引场的地图为输入图像中的简约几何重新参数化。结可以被视为在吸引场的“盆”。因而所提出的方法被称为整体上-吸引线框分析器(HAWP)。在实验中,所提出的方法是在两个基准,线框数据集,并且数据集YorkUrban测试。在这两个基准,它获得的精度和效率方面的国家的最先进的性能。例如,关于线框数据集,相比以前的状态的最先进的方法L-CNN,它大幅度改善了具有挑战性的平均结构平均精度(MSAP)($ 2.8 \%$绝对改进)和达到29.5 FPS单GPU($ 89 \%$相对改善)。系统性消融研究,以进一步证明了该方法。

3. Unsupervised Learning of Intrinsic Structural Representation Points [PDF] 返回目录
  Nenglun Chen, Lingjie Liu, Zhiming Cui, Runnan Chen, Duygu Ceylan, Changhe Tu, Wenping Wang
Abstract: Learning structures of 3D shapes is a fundamental problem in the field of computer graphics and geometry processing. We present a simple yet interpretable unsupervised method for learning a new structural representation in the form of 3D structure points. The 3D structure points pro-duced by our method encode the shape structure intrinsi-cally and exhibit semantic consistency across all the shapeinstances with similar structures. This is a challenging goal that has not fully been achieved by other methods. Specifically, our method takes a 3D point cloud as input and encodes it as a set of local features. The local features are then passed through a novel point integration module to produce a set of 3D structure points. The chamfer distance is used as reconstruction loss to ensure the structure points lie close to the input point cloud. Extensive experiments have shown that our method outperforms the state-of-the-art on the semantic shape correspondence task and achieves achieve comparable performance with state-of-the-art on the segmentation label transfer task. Moreover, the PCA based shape embedding built upon consistent structure points demonstrates good performance in preserving the shape structures. Code is available at this https URL
摘要:3D形状的学习结构是计算机图形学和几何处理领域的一个基本问题。我们提出了一个简单但在三维结构点的形式学习一种新的结构示意图解释无监督的方法。由我们的方法进行编码的3D结构点亲duced形状结构intrinsi-卡利并表现出语义一致性跨所有shapeinstances具有类似的结构。这是一个尚未完全被其他方法实现一个具有挑战性的目标。具体地,我们的方法需要一个3D点云作为输入并将其编码为一组的局部特征。局部特征,然后通过新颖点积分模块传递以产生一组的3D结构点。倒角距离被用作重建损失,以确保点位于接近输入点云的结构。大量的实验已经表明,我们的方法优于所述状态的最先进的语义形状对应的任务,并实现实现与相当的性能状态的最先进的分割标签转移的任务。此外,在一致的结构点建基于PCA形状嵌入演示了保留形状的结构性能良好。代码可在此HTTPS URL

4. Volumetric landmark detection with a multi-scale shift equivariant neural network [PDF] 返回目录
  Tianyu Ma, Ajay Gupta, Mert R. Sabuncu
Abstract: Deep neural networks yield promising results in a wide range of computer vision applications, including landmark detection. A major challenge for accurate anatomical landmark detection in volumetric images such as clinical CT scans is that large-scale data often constrain the capacity of the employed neural network architecture due to GPU memory limitations, which in turn can limit the precision of the output. We propose a multi-scale, end-to-end deep learning method that achieves fast and memory-efficient landmark detection in 3D images. Our architecture consists of blocks of shift-equivariant networks, each of which performs landmark detection at a different spatial scale. These blocks are connected from coarse to fine-scale, with differentiable resampling layers, so that all levels can be trained together. We also present a noise injection strategy that increases the robustness of the model and allows us to quantify uncertainty at test time. We evaluate our method for carotid artery bifurcations detection on 263 CT volumes and achieve a better than state-of-the-art accuracy with mean Euclidean distance error of 2.81mm.
摘要:深层神经网络的承诺收益率在广泛的计算机视觉应用,包括标志检测结果。为精确的解剖界标检测在体积图像如临床CT扫描的一个主要挑战是,大规模的数据往往限制所采用的神经网络结构的容量由于GPU存储器的限制,这反过来又可以限制输出的精度。我们提出了一个多尺度,终端到终端的深度学习方法实现了快速和内存高效的标志检测3D图像。我们的体系结构由移位等变网络的块,其中的每一个在不同的空间尺度执行标志检测的。这些块被从粗到细尺度连接,具有可微重新采样的层,从而使所有级别可以训练在一起。我们还提出一个噪声注入策略,提高了模型的鲁棒性,使我们能够量化的测试时间的不确定性。我们评估我们对263 CT卷颈总动脉分叉检测方法,实现了优于国家的最先进的精度2.81毫米的平均欧氏距离误差。

5. Deep Multi-Modal Sets [PDF] 返回目录
  Austin Reiter, Menglin Jia, Pu Yang, Ser-Nam Lim
Abstract: Many vision-related tasks benefit from reasoning over multiple modalities to leverage complementary views of data in an attempt to learn robust embedding spaces. Most deep learning-based methods rely on a late fusion technique whereby multiple feature types are encoded and concatenated and then a multi layer perceptron (MLP) combines the fused embedding to make predictions. This has several limitations, such as an unnatural enforcement that all features be present at all times as well as constraining only a constant number of occurrences of a feature modality at any given time. Furthermore, as more modalities are added, the concatenated embedding grows. To mitigate this, we propose Deep Multi-Modal Sets: a technique that represents a collection of features as an unordered set rather than one long ever-growing fixed-size vector. The set is constructed so that we have invariance both to permutations of the feature modalities as well as to the cardinality of the set. We will also show that with particular choices in our model architecture, we can yield interpretable feature performance such that during inference time we can observe which modalities are most contributing to the prediction.With this in mind, we demonstrate a scalable, multi-modal framework that reasons over different modalities to learn various types of tasks. We demonstrate new state-of-the-art performance on two multi-modal datasets (Ads-Parallelity [34] and MM-IMDb [1]).
摘要:许多视觉相关的任务受益于多模态推理到利用数据的补充意见,企图学习强大的嵌入空间。最深基于学习的方法依赖于一个较晚融合技术,由此多个特征类型进行编码和级联然后多层感知器(MLP)结合了融合嵌入进行预测。这有诸多限制,如不自然的强制执行,所有的功能出现在所有的时间以及仅约束功能形态的出现固定数量的在任何给定的时间。此外,随着越来越多的方式加入,级联嵌入增长。为了缓解这种情况,我们提出了深多模式集:代表的功能集合作为一个无序的,而不是一个长期不断增长的固定大小的矢量的技术。该组构造,使我们拥有不变性既特征模式,以及为集合的基数排列。我们也将表明,在我们的模型架构特别的选择,我们可以得到解释的功能,性能,这样在推理时间,我们可以看到它的模式是最有助于prediction.With考虑到这一点,我们展示了一个可扩展的多模态框架即在不同方式的原因,了解不同类型的任务。我们证明在两个多模态数据集新的国家的最先进的性能(ADS-并行性[34]和MM-IMDB [1])。

6. Image Matching across Wide Baselines: From Paper to Practice [PDF] 返回目录
  Yuhe Jin, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas, Pascal Fua, Kwang Moo Yi, Eduard Trulls
Abstract: We introduce a comprehensive benchmark for local features and robust estimation algorithms, focusing on the downstream task -- the accuracy of the reconstructed camera pose -- as our primary metric. Our pipeline's modular structure allows us to easily integrate, configure, and combine methods and heuristics. We demonstrate this by embedding dozens of popular algorithms and evaluating them, from seminal works to the cutting edge of machine learning research. We show that with proper settings, classical solutions may still outperform the perceived state of the art. Besides establishing the actual state of the art, the experiments conducted in this paper reveal unexpected properties of SfM pipelines that can be exploited to help improve their performance, for both algorithmic and learned methods. Data and code are online this https URL, providing an easy-to-use and flexible framework for the benchmarking of local feature and robust estimation methods, both alongside and against top-performing methods. This work provides the basis for an open challenge on wide-baseline image matching this https URL .
摘要:介绍了地方特色和强大的估计算法的综合性基准测试,侧重于下游的任务 - 重建相机姿的准确度 - 作为我们的主要指标。我们的管道的模块化结构使我们能够轻松地集成,配置和组合的方法和启发。我们通过嵌入几十种流行的算法,并进行评价,从开创性的工作机器学习研究的前沿证明这一点。我们发现,用适当的设置,经典的解决方案仍可能跑赢艺术的感知状态。除了建立艺术的实际状况,在本文中进行的实验表明,可以被利用来帮助提高其性能,对于算法和教训方法SFM管道出乎意料的特性。数据和代码是上网本HTTPS URL,为地方特色和强大的估计方法,无论是一起反对顶级表现方法的标杆一个易于使用和灵活的框架。这项工作提供了基础的公开挑战的宽基线图像匹配这个HTTPS URL。

7. Distilled Hierarchical Neural Ensembles with Adaptive Inference Cost [PDF] 返回目录
  Adria Ruiz, Jakob Verbeek
Abstract: Deep neural networks form the basis of state-of-the-art models across a variety of application domains. Moreover, networks that are able to dynamically adapt the computational cost of inference are important in scenarios where the amount of compute or input data varies over time. In this paper, we propose Hierarchical Neural Ensembles (HNE), a novel framework to embed an ensemble of multiple networks by sharing intermediate layers using a hierarchical structure. In HNE we control the inference cost by evaluating only a subset of models, which are organized in a nested manner. Our second contribution is a novel co-distillation method to boost the performance of ensemble predictions with low inference cost. This approach leverages the nested structure of our ensembles, to optimally allocate accuracy and diversity across the ensemble members. Comprehensive experiments over the CIFAR and ImageNet datasets confirm the effectiveness of HNE in building deep networks with adaptive inference cost for image classification.
摘要:深层神经网络形成国家的最先进的机型在各种应用领域的基础。此外,网络,都能够动态地适应推理的计算成本是重要的场景,其中计算或输入的数据量随时间变化。在本文中,我们提出了分层神经系综(HNE),一种新型框架,以通过共享使用分级结构的中间层中嵌入的合奏多个网络。在HNE我们通过控制仅评估模型,以嵌套的方式组织的一个子集推理成本。我们的第二个贡献是一种新型的共蒸馏方法,以提高具有低推理成本集合预报的性能。这种方法充分利用了我们歌舞团的嵌套结构,整个乐团成员优化配置,准确性和多样性。在CIFAR和ImageNet数据集综合实验证实了HNE在建设深网络与图像分类自适应推断成本有效性。

8. Disentangling Physical Dynamics from Unknown Factors for Unsupervised Video Prediction [PDF] 返回目录
  Vincent Le Guen, Nicolas Thome
Abstract: Leveraging physical knowledge described by partial differential equations (PDEs) is an appealing way to improve unsupervised video prediction methods. Since physics is too restrictive for describing the full visual content of generic videos, we introduce PhyDNet, a two-branch deep architecture, which explicitly disentangles PDE dynamics from unknown complementary information. A second contribution is to propose a new recurrent physical cell (PhyCell), inspired from data assimilation techniques, for performing PDE-constrained prediction in latent space. Extensive experiments conducted on four various datasets show the ability of PhyDNet to outperform state-of-the-art methods. Ablation studies also highlight the important gain brought out by both disentanglement and PDE-constrained prediction. Finally, we show that PhyDNet presents interesting features for dealing with missing data and long-term forecasting.
摘要:由偏微分方程(PDE的)中描述的利用物理知识是提高无监督视频预测方法的一个有吸引力的方式。由于物理的限制太大,描述的通用视频完整的视觉内容,我们引入PhyDNet,一个有两个分支深结构,其中明确理顺了那些纷繁来自未知的补充信息PDE动态。第二个贡献是提出一种新的经常物理小区(PhyCell),从数据同化技术的启发,用于在潜空间中进行PDE受限预测。在四个不同的数据集进行了广泛的实验表明PhyDNet的超越状态的最先进的方法的能力。切除研究还强调双方解开和PDE受限预测带出的重要收获。最后,我们表明,PhyDNet礼物处理丢失数据和长期预测有趣的功能。

9. Implicit Functions in Feature Space for 3D Shape Reconstruction and Completion [PDF] 返回目录
  Julian Chibane, Thiemo Alldieck, Gerard Pons-Moll
Abstract: While many works focus on 3D reconstruction from images, in this paper, we focus on 3D shape reconstruction and completion from a variety of 3D inputs, which are deficient in some respect: low and high resolution voxels, sparse and dense point clouds, complete or incomplete. Processing of such 3D inputs is an increasingly important problem as they are the output of 3D scanners, which are becoming more accessible, and are the intermediate output of 3D computer vision algorithms. Recently, learned implicit functions have shown great promise as they produce continuous reconstructions. However, we identified two limitations in reconstruction from 3D inputs: 1) details present in the input data are not retained, and 2) poor reconstruction of articulated humans. To solve this, we propose Implicit Feature Networks (IF-Nets), which deliver continuous outputs, can handle multiple topologies, and complete shapes for missing or sparse input data retaining the nice properties of recent learned implicit functions, but critically they can also retain detail when it is present in the input data, and can reconstruct articulated humans. Our work differs from prior work in two crucial aspects. First, instead of using a single vector to encode a 3D shape, we extract a learnable 3-dimensional multi-scale tensor of deep features, which is aligned with the original Euclidean space embedding the shape. Second, instead of classifying x-y-z point coordinates directly, we classify deep features extracted from the tensor at a continuous query point. We show that this forces our model to make decisions based on global and local shape structure, as opposed to point coordinates, which are arbitrary under Euclidean transformations. Experiments demonstrate that IF-Nets outperform prior work in 3D object reconstruction in ShapeNet, and obtain significantly more accurate 3D human reconstructions.
摘要:虽然许多作品侧重于从图像三维重建,在本文中,我们专注于三维形状重建和完成从多种3D输入,这是在某些方面不足:低分辨率和高分辨率体素,疏与密的点云,完全或不完全。的这样的3D输入的处理是一个日益重要的问题,因为它们是3D扫描仪,其正变得更容易获得的输出,并且是三维计算机视觉算法的中间输出。近日,得知隐函数都表现出极大的承诺,因为它们产生连续重建。然而,我们确定从3D输入端在重建两个限制:1)详述存在于输入数据不保留,和2)铰接的人类重建差。为了解决这个问题,我们提出了隐式功能网络(IF-篮网),它提供持续的输出,可以处理多种拓扑结构,以及完整的形状缺失或稀疏输入数据保留的最近了解到隐函数的良好特性,但批判他们还可以保留当它的细节存在于输入数据,并且可以重建铰接人类。我们的工作不同于以前的工作在两个关键方面。首先,代替使用单个载体到3D形状编码,我们提取的深特征,这与原来的欧氏空间中嵌入的形状对齐的可学习3维多尺度张量。二,而不是X-Y-Z点坐标直接进行分类,我们深分类功能从连续查询点的张量提取。我们表明,这种力量我们的模型基于全局和局部形状的结构,而不是点坐标,这是在欧氏变换任意作出决定。实验表明,IF-篮网优于在ShapeNet在3D对象重建之前的工作,并取得显著更精确三维人体重建。

10. Rethinking Zero-shot Video Classification: End-to-end Training for Realistic Applications [PDF] 返回目录
  Biagio Brattoli, Joe Tighe, Fedor Zhdanov, Pietro Perona, Krzysztof Chalupka
Abstract: Trained on large datasets, deep learning (DL) can accurately classify videos into hundreds of diverse classes. However, video data is expensive to annotate. Zero-shot learning (ZSL) proposes one solution to this problem. ZSL trains a model once, and generalizes to new tasks whose classes are not present in the training dataset. We propose the first end-to-end algorithm for ZSL in video classification. Our training procedure builds on insights from recent video classification literature and uses a trainable 3D CNN to learn the visual features. This is in contrast to previous video ZSL methods, which use pretrained feature extractors. We also extend the current benchmarking paradigm: Previous techniques aim to make the test task unknown at training time but fall short of this goal. We encourage domain shift across training and test data and disallow tailoring a ZSL model to a specific test dataset. We outperform the state-of-the-art by a wide margin. Our code, evaluation procedure and model weights are available at this http URL.
摘要:经过培训的大型数据集,深学习(DL)可到数百种不同类别的准确分类的视频。然而,视频数据是昂贵的注释。零次学习(ZSL)提出了一个解决这个问题。 ZSL火车模型一次,并推广到新的任务,其类是不存在的训练数据集。我们建议第一终端到终端的算法在视频分类ZSL。我们的训练过程建立在最近的视频分类文献的见解,并使用可训练的3D CNN学习视觉特征。这与之前的视频ZSL方法,它使用预训练的特征提取。我们还扩展了目前的标杆范例:以前技术的目标是使在训练时间的考验,任务不明,但达不到这个目标。我们鼓励跨训练和测试数据域移位并禁止剪裁ZSL模型到一个特定的测试数据集。我们大幅跑赢了国家的最先进的。我们的代码,评估程序和模型的权重可在此http网址。

11. 3D dynamic hand gestures recognition using the Leap Motion sensor and convolutional neural networks [PDF] 返回目录
  Katia Lupinetti, Andrea Ranieri, Franca Giannini, Marina Monti
Abstract: Defining methods for the automatic understanding of gestures is of paramount importance in many application contexts and in Virtual Reality applications for creating more natural and easy-to-use human-computer interaction methods. In this paper, we present a method for the recognition of a set of non-static gestures acquired through the Leap Motion sensor. The acquired gesture information is converted in color images, where the variation of hand joint positions during the gesture are projected on a plane and temporal information is represented with color intensity of the projected points. The classification of the gestures is performed using a deep Convolutional Neural Network (CNN). A modified version of the popular ResNet-50 architecture is adopted, obtained by removing the last fully connected layer and adding a new layer with as many neurons as the considered gesture classes. The method has been successfully applied to the existing reference dataset and preliminary tests have already been performed for the real-time recognition of dynamic gestures performed by users.
摘要:手势的自动理解定义方法是非常重要的在许多应用环境和虚拟现实应用中创造更加自然和容易使用的人机交互方式。在本文中,我们提出了识别一组通过跨越运动传感器所获取的非静态的手势的方法。所获取的姿态信息变换彩色图像,其中手关节位置的手势期间的变化被投影在一个平面上和时间信息被表示与所投影的点的颜色强度。使用深卷积神经网络(CNN)执行手势的分类。流行RESNET-50架构的修改版本被采用,通过去除最后完全连接层,并用尽可能多的神经元为考虑姿势类添加新的层来获得。该方法已成功地应用到现有的引用数据集,并初步测试已经为用户进行动态手势的实时识别进行。

12. UDD: An Underwater Open-sea Farm Object Detection Dataset for Underwater Robot Picking [PDF] 返回目录
  Zhihui Wang, Chongwei Liu, Shijie Wang, Tao Tang, Yulong Tao, Caifei Yang, Haojie Li, Xing Liu, Xin Fan
Abstract: To promote the development of underwater robot picking in sea farms, we propose an underwater open-sea farm object detection dataset called UDD. Concretely, UDD consists of 3 categories (seacucumber, seaurchin, and scallop) with 2227 images. To the best of our knowledge, it's the first dataset collected in a real open-sea farm for underwater robot picking and we also propose a novel Poisson-blending-embedded Generative Adversarial Network (Poisson GAN) to overcome the class-imbalance and massive small objects issues in UDD. By utilizing Poisson GAN to change the number, position, even size of objects in UDD, we construct a large scale augmented dataset (AUDD) containing 18K images. Besides, in order to make the detector better adapted to the underwater picking environment, a dataset (Pre-trained dataset) for pre-training containing 590K images is also proposed. Finally, we design a lightweight network (UnderwaterNet) to address the problems that detecting small objects from cloudy underwater pictures and meeting the efficiency requirements in robots. Specifically, we design a depth-wise-convolution-based Multi-scale Contextual Features Fusion (MFF) block and a Multi-scale Blursampling (MBP) module to reduce the parameters of the network to 1.3M at 48FPS, without any loss on accuracy. Extensive experiments verify the effectiveness of the proposed UnderwaterNet, Poisson GAN, UDD, AUDD, and Pre-trained datasets.
摘要:为促进海养殖场水下机器人采摘的发展,我们提出了一个水下开阔海域称为UDD农场物体检测数据集。具体而言,UDD由3类(海参,海胆,和扇贝)与2227倍的图像。据我们所知,这是一个真正的开放海农场水下机器人采摘中收集到的第一个数据集,我们还提出了一个新颖的泊松混合嵌入式剖成对抗性网络(泊松GAN)克服类失衡和大量小对象UDD的问题。通过利用泊松GAN改变的数量,位置,在UDD对象的大小均匀,我们构建含有18K图像的大规模数据集增强(AUDD)。此外,为了使探测器更好地适应水下环境采摘,数据集(预训练数据集)包含590K的图像前培训还提出。最后,我们设计了一个轻量级的网络(UnderwaterNet),以解决检测从浑浊的水下照片的小物件和满足在机器人的效率要求的问题。具体来说,我们设计了一个基于深度方向的卷积多尺度上下文特征融合(MFF)嵌段和多尺度Blursampling(MBP)模块,以减少网络到48FPS 1.3M的参数,而无需对精度的任何损失。大量的实验验证了该UnderwaterNet,泊松甘,UDD,AUDD,和预先训练数据集的有效性。

13. What's the relationship between CNNs and communication systems? [PDF] 返回目录
  Hao Ge, Xiaoguang Tu, Yanxiang Gong, Mei Xie, Zheng Ma
Abstract: The interpretability of Convolutional Neural Networks (CNNs) is an important topic in the field of computer vision. In recent years, works in this field generally adopt a mature model to reveal the internal mechanism of CNNs, helping to understand CNNs thoroughly. In this paper, we argue the working mechanism of CNNs can be revealed through a totally different interpretation, by comparing the communication systems and CNNs. This paper successfully obtained the corresponding relationship between the modules of the two, and verified the rationality of the corresponding relationship with experiments. Finally, through the analysis of some cutting-edge research on neural networks, we find the inherent relation between these two tasks can be of help in explaining these researches reasonably, as well as helping us discover the correct research direction of neural networks.
摘要:卷积神经网络(细胞神经网络)的可解释性是计算机视觉领域的一个重要课题。近年来,在这一领域的作品通常采用一个成熟的模式,揭示细胞神经网络的内部机制,有助于深入了解细胞神经网络。在本文中,我们认为细胞神经网络的工作机制可以通过一个完全不同的解释显露出来,通过比较通信系统和细胞神经网络。本文成功地获得了两个模块之间的对应关系,并验证了与实验对应关系的合理性。最后,通过神经网络的一些前沿研究的分析,我们发现这两个任务之间的内在关系可以合理解释这些研究,以及帮助我​​们发现神经网络的正确的研究方向有所帮助。

14. DeepSperm: A robust and real-time bull sperm-cell detection in densely populated semen videos [PDF] 返回目录
  Priyanto Hidayatullah, Xueting Wang, Toshihiko Yamasaki, Tati L.E.R. Mengko, Rinaldi Munir, Anggraini Barlian, Eros Sukmawati, Supraptono Supraptono
Abstract: Background and Objective: Object detection is a primary research interest in computer vision. Sperm-cell detection in a densely populated bull semen microscopic observation video presents challenges such as partial occlusion, vast number of objects in a single video frame, tiny size of the object, artifacts, low contrast, and blurry objects because of the rapid movement of the sperm cells. This study proposes an architecture, called DeepSperm, that solves the aforementioned challenges and is more accurate and faster than state-of-the-art architectures. Methods: In the proposed architecture, we use only one detection layer, which is specific for small object detection. For handling overfitting and increasing accuracy, we set a higher network resolution, use a dropout layer, and perform data augmentation on hue, saturation, and exposure. Several hyper-parameters are tuned to achieve better performance. We compare our proposed method with those of a conventional image processing-based object-detection method, you only look once (YOLOv3), and mask region-based convolutional neural network (Mask R-CNN). Results: In our experiment, we achieve 86.91 mAP on the test dataset and a processing speed of 50.3 fps. In comparison with YOLOv3, we achieve an increase of 16.66 mAP point, 3.26 x faster on testing, and 1.4 x faster on training with a small training dataset, which contains 40 video frames. The weights file size was also reduced significantly, with 16.94 x smaller than that of YOLOv3. Moreover, it requires 1.3 x less graphical processing unit (GPU) memory than YOLOv3. Conclusions: This study proposes DeepSperm, which is a simple, effective, and efficient architecture with its hyper-parameters and configuration to detect bull sperm cells robustly in real time. In our experiment, we surpass the state of the art in terms of accuracy, speed, and resource needs.
摘要:背景与目的:目标检测是计算机视觉的主要研究兴趣。在人口密集的牛精子细胞检测精液显微镜观察视频呈现为部分遮挡,在单个视频帧广大数目的对象,对象,构件,低对比度的小尺寸,并且由于快速运动的模糊对象的挑战,例如精子细胞。这项研究提出了一种架构,称为DeepSperm,是解决上述挑战,更准确,更不是国家的最先进的架构更快。方法:在所提出的架构中,我们仅使用一个检测层,这是具体的小物体的检测。处理过拟合,提高精度,提出了更高的网络的分辨率,使用差层,并在色相,饱和度和曝光进行数据扩充。几个超参数调整,以达到更好的性能。我们比较了我们提出的方法与传统的基于图像处理对象的检测方法,你只能看一次(YOLOv3),并掩盖基于区域的卷积神经网络(面膜R-CNN)。结果:在我们的实验中,我们实现了对测试数据集和50.3 fps的处理速度86.91地图。在与YOLOv3比较,我们在测试3.26 X更快达到16.66地图点的增加,和1.4×培训用小训练数据集,其中包含40个视频帧更快。权重文件的大小也显著减少,比YOLOv3小16.94 X。此外,它需要1.3×更少的图形处理单元(GPU)的内存比YOLOv3。结论:本研究提出DeepSperm,这是一种简单,有效和高效的结构与它的超参数和配置在实时检测公牛精子细胞鲁棒。在我们的实验中,我们超越在精度,速度和资源需求方面的技术状态。

15. Fully Convolutional Networks for Automatically Generating Image Masks to Train Mask R-CNN [PDF] 返回目录
  Hao Wu, Jan Paul Siebert
Abstract: This paper proposes a novel automatically generating image masks method for the state-of-the-art Mask R-CNN deep learning method. The Mask R-CNN method achieves the best results in object detection until now, however, it is very time-consuming and laborious to get the object Masks for training, the proposed method is composed by a two-stage design, to automatically generating image masks, the first stage implements a fully convolutional networks (FCN) based segmentation network, the second stage network, a Mask R-CNN based object detection network, which is trained on the object image masks from FCN output, the original input image, and additional label information. Through experimentation, our proposed method can obtain the image masks automatically to train Mask R-CNN, and it can achieve very high classification accuracy with an over 90% mean of average precision (mAP) for segmentation
摘要:本文提出了一种状态的最先进的面膜R-CNN深学习方法的新颖自动生成图像的掩模的方法。面具R-CNN方法实现了在对象检测最好的结果到现在为止,但是,这是非常费时和费力以获取训练对象面具,所提出的方法是通过两阶段的设计构成,以自动生成图像掩模,第一级器具完全卷积网络(FCN)基于分割网络,第二级网络,掩模R-CNN基于物体检测网络,这是在从FCN输出,原始输入图像的对象图像掩模的培训,和附加的标签信息。通过实验,我们提出的方法能够自动获得图像掩模以列车面膜R-CNN,并且它可以实现为分割用平均精确度(MAP)的超过90%的平均非常高的分类精度

16. multi-patch aggregation models for resampling detection [PDF] 返回目录
  Mohit Lamba, Kaushik Mitra
Abstract: Images captured nowadays are of varying dimensions with smartphones and DSLR's allowing users to choose from a list of available image resolutions. It is therefore imperative for forensic algorithms such as resampling detection to scale well for images of varying dimensions. However, in our experiments, we observed that many state-of-the-art forensic algorithms are sensitive to image size and their performance quickly degenerates when operated on images of diverse dimensions despite re-training them using multiple image sizes. To handle this issue, we propose a novel pooling strategy called ITERATIVE POOLING. This pooling strategy can dynamically adjust input tensors in a discrete without much loss of information as in ROI Max-pooling. This pooling strategy can be used with any of the existing deep models and for demonstration purposes, we show its utility on Resnet-18 for the case of resampling detection a fundamental operation for any image sought of image manipulation. Compared to existing strategies and Max-pooling it gives up to 7-8% improvement on public datasets.
摘要:今天拍摄的图像以不同的智能手机和数码单反相机的用户允许尺寸从可用图像分辨率的列表中进行选择的。因此,法医算法,如二次采样检测到规模以及对不同尺寸的图像势在必行。然而,在我们的实验中,我们观察到,尽管时再培训他们使用多种大小的图片在不同尺寸的图像操作,许多国家的最先进的法医算法对图像大小和它们的性能迅速退化敏感。为了解决这个问题,我们提出了所谓的迭代汇集了新颖的合并策略。这个池策略可以动态地调整在一个离散的输入张量没有信息多大损失如在ROI最大 - 池。这种合并策略可以与任何现有的深模型和示范的目的可以使用,我们将展示其对RESNET-18实用程序,用于二次采样检测要求图像处理的任何图像的基本操作的情况下。相较于现有的战略和Max-汇集它给公共数据集高达7-8%的提升。

17. DiPE: Deeper into Photometric Errors for Unsupervised Learning of Depth and Ego-motion from Monocular Videos [PDF] 返回目录
  Hualie Jiang, Laiyan Ding, Rui Huang
Abstract: Unsupervised learning of depth and ego-motion from unlabelled monocular videos has recently drawn attention as it has notable advantages than the supervised ones. It uses the photometric errors between the target view and the synthesized views from its adjacent source views as the loss. Although significant progress has been made, the learning still suffers from occlusion and scene dynamics. This paper shows that carefully manipulating photometric errors can tackle these difficulties better. The primary improvement is achieved by masking out the invisible or nonstationary pixels in the photometric error map using a statistical technique. With this outlier masking approach, the depth of objects that move in the opposite direction to the camera can be estimated more accurately. According to our best knowledge, such objects have not been seriously considered in the previous work, even though they pose a higher risk in applications like autonomous driving. We also propose an efficient weighted multi-scale scheme to reduce the artifacts in the predicted depth maps. Extensive experiments on the KITTI dataset show the effectiveness of the proposed approaches. The overall system achieves state-of-the-art performance on both depth and ego-motion estimation.
摘要:深度和未标记的单目视频中的自我运动的无监督学习最近引起关注,因为它比那些监管显着的优点。它使用目标视图和与其相邻源视图作为损失合成视图之间的测光误差。虽然显著已经取得了进展,学习仍然闭塞和场景的动态受到影响。本文表明精心操作的测光误差可以更好地解决这些困难。的主要改进是通过使用统计技术在测光误差地图屏蔽掉不可见或不稳定的像素来实现。与此异常值掩蔽方法中,对象的深度,在相反的方向上移动到摄像机可以更精确地估算。据我们所知,这样的对象还没有受到严重的前期工作考虑,即使它们对像自动驾驶应用的风险较高。我们还提出了一种高效的加权多尺度方案以降低预测的深度图伪影。在KITTI大量的实验数据集验证了该方法的有效性。整个系统上实现深度和自运动估计状态的最先进的性能。

18. Gastric histopathology image segmentation using a hierarchical conditional random field [PDF] 返回目录
  Changhao Sun, Chen Li, Xiaoyan Li
Abstract: In this paper, a Hierarchical Conditional Random Field (HCRF) model based Gastric Histopathology Image Segmentation (GHIS) method is proposed, which can localize abnormal (cancer) regions in gastric histopathology images obtained by optical microscope to assist histopathologists in medical work. First, to obtain pixel-level segmentation information, we retrain a Convolutional Neural Network (CNN) to build up our pixel-level potentials. Then, in order to obtain abundant spatial segmentation information in patch-level, we fine-tune another three CNNs to build up our patch-level potentials. Thirdly, based on the pixel and patch-level potentials, our HCRF model is structured. Finally, graph-based post-processing is applied to further improve our segmentation performance. In the experiment, a segmentation accuracy of 78.91% is achieved on a Hematoxylin and Eosin (H&E) stained gastric histopathological dataset with 560 images, showing the effectiveness and future potential of the proposed GHIS method.
摘要:在本文中,一个分层条件随机场(HCRF)基于模型胃组织病理学图像分割(GHIS)方法提出了一种能够在由光学显微镜获得,以协助医疗工作组织病理学家胃组织病理学图像本地化异常(癌症)的区域。首先,为了获得像素级的细分信息,我们重新训练卷积神经网络(CNN)建立我们的像素级的潜力。然后,为了获得补丁级别的丰富的空间分割的信息,我们微调另外三个细胞神经网络来建立我们的补丁级别的潜力。第三,基于像素和补丁级别的潜力,我们的HCRF模型的结构。最后,基于图的后处理应用,进一步提高我们的分割性能。在实验中,的78.91%分割精度上的苏木精和实现曙红(H&E)染色的组织病理学胃与数据集560倍的图像,示出所提出的GHIS方法的有效性和未来潜力。

19. Data-Free Adversarial Perturbations for Practical Black-Box Attack [PDF] 返回目录
  ZhaoXin Huan, Yulong Wang, Xiaolu Zhang, Lin Shang, Chilin Fu, Jun Zhou
Abstract: Neural networks are vulnerable to adversarial examples, which are malicious inputs crafted to fool pre-trained models. Adversarial examples often exhibit black-box attacking transferability, which allows that adversarial examples crafted for one model can fool another model. However, existing black-box attack methods require samples from the training data distribution to improve the transferability of adversarial examples across different models. Because of the data dependence, the fooling ability of adversarial perturbations is only applicable when training data are accessible. In this paper, we present a data-free method for crafting adversarial perturbations that can fool a target model without any knowledge about the training data distribution. In the practical setting of a black-box attack scenario where attackers do not have access to target models and training data, our method achieves high fooling rates on target models and outperforms other universal adversarial perturbation methods. Our method empirically shows that current deep learning models are still at risk even when the attackers do not have access to training data.
摘要:神经网络是容易受到对抗性的例子,这是恶意输入制作的愚弄预先训练模式。对抗性的例子常常表现出黑箱攻击转让,其允许对抗性的例子制作一个模型可以糊弄另一种模式。但是,现有的黑盒攻击方法需要从训练数据分发样本,以提高不同模型对抗例子转移性。由于数据的依赖,当训练数据是可访问的对抗扰动的能力嘴硬只适用。在本文中,我们提出了各具特色的对抗性干扰,可以骗过目标模型没有有关训练数据分布的任何知识无数据的方法。在黑盒攻击场景,攻击者没有获得目标模型和训练数据的实际环境,我们的方法实现对目标模型的高嘴硬率,优于其他通用对抗性的扰动方法。我们的方法经验表明,目前的深度学习模式仍处于即使攻击者没有获得训练数据的风险。

20. Trained Model Fusion for Object Detection using Gating Network [PDF] 返回目录
  Tetsuo Inoshita, Yuichi Nakatani, Katsuhiko Takahashi, Asuka Ishii, Gaku Nakano
Abstract: The major approaches of transfer learning in computer vision have tried to adapt the source domain to the target domain one-to-one. However, this scenario is difficult to apply to real applications such as video surveillance systems. As those systems have many cameras installed at each location regarded as source domains, it is difficult to identify the proper source domain. In this paper, we introduce a new transfer learning scenario that has various source domains and one target domain, assuming video surveillance system integration. Also, we propose a novel method for automatically producing a high accuracy model by fusing models trained at various source domains. In particular, we show how to apply a gating network to fuse source domains for object detection tasks, which is a new approach. We demonstrate the effectiveness of our method through experiments on traffic surveillance datasets.
摘要:在传输计算机视觉学习的主要途径试图源域适应目标域一个对一个。然而,这种情况下很难应用到实际应用,如视频监控系统。由于这些系统都安装在被视为源域的每个位置多台摄像机,它是难以确定适当的源域。在本文中,我们介绍了具有各种源域和一个目标域,假设视频监控系统集成新的传输的学习场景。此外,我们提出了一种用于自动产生由熔化在不同源域训练的模型高精度模型的新方法。特别是,我们将展示如何选通网络适用于保险丝源域物体检测任务,这是一种新的方法。我们通过对交通监控数据集实验证明了该方法的有效性。

21. Towards Noise-resistant Object Detection with Noisy Annotations [PDF] 返回目录
  Junnan Li, Caiming Xiong, Richard Socher, Steven Hoi
Abstract: Training deep object detectors requires significant amount of human-annotated images with accurate object labels and bounding box coordinates, which are extremely expensive to acquire. Noisy annotations are much more easily accessible, but they could be detrimental for learning. We address the challenging problem of training object detectors with noisy annotations, where the noise contains a mixture of label noise and bounding box noise. We propose a learning framework which jointly optimizes object labels, bounding box coordinates, and model parameters by performing alternating noise correction and model training. To disentangle label noise and bounding box noise, we propose a two-step noise correction method. The first step performs class-agnostic bounding box correction by minimizing classifier discrepancy and maximizing region objectness. The second step distils knowledge from dual detection heads for soft label correction and class-specific bounding box refinement. We conduct experiments on PASCAL VOC and MS-COCO dataset with both synthetic noise and machine-generated noise. Our method achieves state-of-the-art performance by effectively cleaning both label noise and bounding box noise. Code to reproduce all results will be released.
摘要:培训对象深探测器需要人力标注准确的对象标签和边界框坐标,这是为了获得极其昂贵的图像显著量。噪声注释是更方便,但他们可能是有害的学习。我们解决的培养对象探测器与噪声注释,其中噪声包含标签噪音的混合物和边界框噪声具有挑战性的问题。我们提出了一个学习框架,通过执行交替噪声校正和模型训练联合优化对象的标签,边界框坐标,和模型参数。理清标签噪音和边界框噪音,我们提出了一个两步噪声修正方法。通过最小化分类差异和最大化区域对象性的第一步进行类无关的边界框的校正。从双重检测头第二步下蒸馏知识,为软标签校正和类特定边界框细化。我们进行上PASCAL VOC和MS-COCO数据集实验与噪声产生机器包括合成噪声和。我们的方法有效地清洗两个标签噪音和噪音包围盒实现国家的最先进的性能。复制所有结果代码将被释放。

22. Disrupting DeepFakes: Adversarial Attacks Against Conditional Image Translation Networks and Facial Manipulation Systems [PDF] 返回目录
  Nataniel Ruiz, Stan Sclaroff
Abstract: Face modification systems using deep learning have become increasingly powerful and accessible. Given images of a person's face, such systems can generate new images of that same person under different expressions and poses. Some systems can also modify targeted attributes such as hair color or age. This type of manipulated images and video have been coined DeepFakes. In order to prevent a malicious user from generating modified images of a person without their consent we tackle the new problem of generating adversarial attacks against image translation systems, which disrupt the resulting output image. We call this problem disrupting deepfakes. We adapt traditional adversarial attacks to our scenario. Most image translation architectures are generative models conditioned on an attribute (e.g. put a smile on this person's face). We present class transferable adversarial attacks that generalize to different classes, which means that the attacker does not need to have knowledge about the conditioning vector. In gray-box scenarios, blurring can mount a successful defense against disruption. We present a spread-spectrum adversarial attack, which evades blurring defenses.
摘要:采用深度学习脸型修饰系统已经变得越来越强大和方便。一个人的面部图像给出,这样的系统可以产生在不同的表情和姿势是同一人的新形象。有些系统还可以通过修改目标属性,如头发的颜色或年龄。这种类型的操纵图像和视频已经创造DeepFakes。为了防止恶意用户生成一个人的修改图像未经他们同意我们处理生成对图像翻译系统,其破坏所产生的输出图像的对抗攻击的新问题。我们把这个问题破坏deepfakes。我们适应我们的场景传统对抗性攻击。大多数图像转换架构生成模型空调上的属性(如放在这个人的脸上露出了笑容)。我们认为推广到不同的类别,这意味着攻击者不需要有关于空调向量知识当前类转让对抗性攻击。在灰盒场景,图像模糊的现象发动针对中断一个成功的防守。我们提出了一个扩频敌对攻击,逃避模糊防御。

23. Single-Shot Pose Estimation of Surgical Robot Instruments' Shafts from Monocular Endoscopic Images [PDF] 返回目录
  Masakazu Yoshimura, Murilo M. Marinho, Kanako Harada, Mamoru Mitsuishi
Abstract: Surgical robots are used to perform minimally invasive surgery and alleviate much of the burden imposed on surgeons. Our group has developed a surgical robot to aid in the removal of tumors at the base of the skull via access through the nostrils. To avoid injuring the patients, a collision-avoidance algorithm that depends on having an accurate model for the poses of the instruments' shafts is used. Given that the model's parameters can change over time owing to interactions between instruments and other disturbances, the online estimation of the poses of the instrument's shaft is essential. In this work, we propose a new method to estimate the pose of the surgical instruments' shafts using a monocular endoscope. Our method is based on the use of an automatically annotated training dataset and an improved pose-estimation deep-learning architecture. In preliminary experiments, we show that our method can surpass state of the art vision-based marker-less pose estimation techniques (providing an error decrease of 55% in position estimation, 64% in pitch, and 69% in yaw) by using artificial images.
摘要:外科手术机器人被用来进行微创手术,减轻很多强加给医生的负担。我们小组已经开发了手术机器人经由通过鼻孔进入在头骨底部的切除肿瘤的帮助。为了避免伤及病人中,使用依赖于具有用于仪器的轴的姿势的精确模型的碰撞回避算法。鉴于该模型的参数可由于仪器和其他干扰之间的相互作用随时间而改变,仪器的轴的姿态的在线估计是必不可少的。在这项工作中,我们提出来估算使用单眼内窥镜手术器械轴的姿态的新方法。我们的方法是基于使用自动注释的训练数据集和改进的姿态估计深学习建筑。在初步实验中,我们表明,我们的方法可以通过使用人工超越本领域基于视觉的无标记姿势估计技术(提供的55%的位置的估计误差减小,在间距64%,和69%在偏航)的状态图片。

24. Point-GNN: Graph Neural Network for 3D Object Detection in a Point Cloud [PDF] 返回目录
  Weijing Shi, Ragunathan, Rajkumar
Abstract: In this paper, we propose a graph neural network to detect objects from a LiDAR point cloud. Towards this end, we encode the point cloud efficiently in a fixed radius near-neighbors graph. We design a graph neural network, named Point-GNN, to predict the category and shape of the object that each vertex in the graph belongs to. In Point-GNN, we propose an auto-registration mechanism to reduce translation variance, and also design a box merging and scoring operation to combine detections from multiple vertices accurately. Our experiments on the KITTI benchmark show the proposed approach achieves leading accuracy using the point cloud alone and can even surpass fusion-based algorithms. Our results demonstrate the potential of using the graph neural network as a new approach for 3D object detection. The code is available this https URL.
摘要:在本文中,我们提出了一个图形的神经网络,从激光雷达点云检测的对象。为此,我们有效编码的点云在固定半径近的邻居图。我们设计的图形神经网络,命名点GNN,预测,图中的每个顶点所属的对象的类别和形状。在点GNN,我们提出了一种自动注册机制来减少翻译偏差,并且还设计了一个盒合并和评分操作检测从多个顶点准确地组合。我们对KITTI基准实验表明,该方法实现单独使用点云领先的精度甚至会超过基于融合的算法。我们的研究结果表明使用图形神经网络作为立体物检测新方法的潜力。代码可以这样HTTPS URL。

25. MVC-Net: A Convolutional Neural Network Architecture for Manifold-Valued Images With Applications [PDF] 返回目录
  Jose J. Bouza, Chun-Hao Yang, Baba C. Vemuri
Abstract: Geometric deep learning has attracted significant attention in recent years, in part due to the availability of exotic data types for which traditional neural network architectures are not well suited. Our goal in this paper is to generalize convolutional neural networks (CNN) to the manifold-valued image case which arises commonly in medical imaging and computer vision applications. Explicitly, the input data to the network is an image where each pixel value is a sample from a Riemannian manifold. To achieve this goal, we must generalize the basic building block of traditional CNN architectures, namely, the weighted combinations operation. To this end, we develop a tangent space combination operation which is used to define a convolution operation on manifold-valued images that we call, the Manifold-Valued Convolution (MVC). We prove theoretical properties of the MVC operation, including equivariance to the action of the isometry group admitted by the manifold and characterizing when compositions of MVC layers collapse to a single layer. We present a detailed description of how to use MVC layers to build full, multi-layer neural networks that operate on manifold-valued images, which we call the MVC-net. Further, we empirically demonstrate superior performance of the MVC-nets in medical imaging and computer vision tasks.
摘要:几何深度学习吸引显著重视,近年来,部分原因是由于针对传统的神经网络结构不能很好适应异国的数据类型的可用性。我们在本文的目标是推广卷积神经网络(CNN),其在医疗成像和计算机视觉应用中通常出现的歧管值图像的情况。明确地,输入数据到所述网络是其中每个像素值是从黎曼流形的样品的图像。为了实现这一目标,我们必须推广传统CNN架构,即加权组合操作的基本构建块。为此,我们开发这是用来在歧管值图像,我们称之为定义的卷积运算的切线空间组合操作中,歧管子值卷积(MVC)。我们证明了MVC操作的理论性能,包括同变性由歧管承认等距组的作用,并且当MVC层的组合物塌陷于单层表征。我们提出了如何使用MVC层来构建在歧管值图像,我们称之为MVC网运营满,多层神经网络的详细描述。此外,我们经验证明医疗成像和计算机视觉任务的MVC-网的卓越性能。

26. MRI Super-Resolution with GAN and 3D Multi-Level DenseNet: Smaller, Faster, and Better [PDF] 返回目录
  Yuhua Chen, Anthony G. Christodoulou, Zhengwei Zhou, Feng Shi, Yibin Xie, Debiao Li
Abstract: High-resolution (HR) magnetic resonance imaging (MRI) provides detailed anatomical information that is critical for diagnosis in the clinical application. However, HR MRI typically comes at the cost of long scan time, small spatial coverage, and low signal-to-noise ratio (SNR). Recent studies showed that with a deep convolutional neural network (CNN), HR generic images could be recovered from low-resolution (LR) inputs via single image super-resolution (SISR) approaches. Additionally, previous works have shown that a deep 3D CNN can generate high-quality SR MRIs by using learned image priors. However, 3D CNN with deep structures, have a large number of parameters and are computationally expensive. In this paper, we propose a novel 3D CNN architecture, namely a multi-level densely connected super-resolution network (mDCSRN), which is light-weight, fast and accurate. We also show that with the generative adversarial network (GAN)-guided training, the mDCSRN-GAN provides appealing sharp SR images with rich texture details that are highly comparable with the referenced HR images. Our results from experiments on a large public dataset with 1,113 subjects showed that this new architecture outperformed other popular deep learning methods in recovering 4x resolution-downgraded images in both quality and speed.
摘要:高分辨率(HR)磁共振成像(MRI)提供了详细的解剖信息是用于在临床应用的诊断至关重要。然而,HR MRI通常是以牺牲长的扫描时间,小的空间覆盖范围,和低信噪比(SNR)的成本。最近的研究表明,一个深深的卷积神经网络(CNN),HR一般的图像可以通过从单幅图像超分辨率低分辨率(LR)输入回收(SISR)方法。此外,以前的作品表明了深刻的3D CNN可以利用学到先验图像生成高品质的SR核磁共振。然而,3D CNN与深层结构,有大量的参数,并计算成本。在本文中,我们提出了一种新的3D CNN架构,即多级密集连接的超分辨率网络(mDCSRN),这是重量轻,快速和准确的。我们还表明,与生殖对抗网络(GAN)引导下培养,mDCSRN-GaN提供有吸引力急剧SR图像具有丰富的纹理细节是与引用的HR图像高度相当。我们从上与1113个主题大型公共数据集的实验结果表明,这种新的架构跑赢收回在质量和速度的4倍分辨率降级图像等热门深的学习方法。

27. Energy-efficient and Robust Cumulative Training with Net2Net Transformation [PDF] 返回目录
  Aosong Feng, Priyadarshini Panda
Abstract: Deep learning has achieved state-of-the-art accuracies on several computer vision tasks. However, the computational and energy requirements associated with training such deep neural networks can be quite high. In this paper, we propose a cumulative training strategy with Net2Net transformation that achieves training computational efficiency without incurring large accuracy loss, in comparison to a model trained from scratch. We achieve this by first training a small network (with lesser parameters) on a small subset of the original dataset, and then gradually expanding the network using Net2Net transformation to train incrementally on larger subsets of the dataset. This incremental training strategy with Net2Net utilizes function-preserving transformations that transfers knowledge from each previous small network to the next larger network, thereby, reducing the overall training complexity. Our experiments demonstrate that compared with training from scratch, cumulative training yields ~2x reduction in computational complexity for training TinyImageNet using VGG19 at iso-accuracy. Besides training efficiency, a key advantage of our cumulative training strategy is that we can perform pruning during Net2Net expansion to obtain a final network with optimal configuration (~0.4x lower inference compute complexity) compared to conventional training from scratch. We also demonstrate that the final network obtained from cumulative training yields better generalization performance and noise robustness. Further, we show that mutual inference from all the networks created with cumulative Net2Net expansion enables improved adversarial input detection.
摘要:深学习取得了几个计算机视觉任务的国家的最先进的精度。然而,这样的训练深层神经网络相关的计算和能源需求可能相当高。在本文中,我们提出用Net2Net转型是实现培训的计算效率,而不会产生大的精度损失,相较于从头开始训练的模型累计培训战略。我们通过对原始数据集的一小部分第一次训练的小型网络(用较少参数)实现这一点,然后逐步扩大使用Net2Net转变到更大的数据集的子集,逐步训练网络。与Net2Net这种渐进培训战略利用功能保留转变,从以前的每一个小型网络到下一个更大的网络传输知识,从而降低了整体训练的复杂性。我们的实验表明,与从头开始培训,累计培训产量〜2倍降低计算复杂性使用VGG19在异准确性训练TinyImageNet比较。除了训练效率,我们的累计培训战略的一个关键优势在于,我们可以执行Net2Net扩张过程中修剪,以获得较从头常规训练的最佳配置(0.4倍〜较低推理的计算复杂度)的最终网络。我们还表明,从累计培训获得最终的网络得到更好的泛化性能和噪音的鲁棒性。此外,我们还表明,所有累积Net2Net扩张创造了网络相互推理能够提高对抗输入检测。

28. DEEVA: A Deep Learning and IoT Based Computer Vision System to Address Safety and Security of Production Sites in Energy Industry [PDF] 返回目录
  Nimish M. Awalgaonkar, Haining Zheng, Christopher S. Gurciullo
Abstract: When it comes to addressing the safety/security related needs at different production/construction sites, accurate detection of the presence of workers, vehicles, equipment important and formed an integral part of computer vision-based surveillance systems (CVSS). Traditional CVSS systems focus on the use of different computer vision and pattern recognition algorithms overly reliant on manual extraction of features and small datasets, limiting their usage because of low accuracy, need for expert knowledge and high computational costs. The main objective of this paper is to provide decision makers at sites with a practical yet comprehensive deep learning and IoT based solution to tackle various computer vision related problems such as scene classification, object detection in scenes, semantic segmentation, scene captioning etc. Our overarching goal is to address the central question of What is happening at this site and where is it happening in an automated fashion minimizing the need for human resources dedicated to surveillance. We developed Deep ExxonMobil Eye for Video Analysis (DEEVA) package to handle scene classification, object detection, semantic segmentation and captioning of scenes in a hierarchical approach. The results reveal that transfer learning with the RetinaNet object detector is able to detect the presence of workers, different types of vehicles/construction equipment, safety related objects at a high level of accuracy (above 90%). With the help of deep learning to automatically extract features and IoT technology to automatic capture, transfer and process vast amount of realtime images, this framework is an important step towards the development of intelligent surveillance systems aimed at addressing myriads of open ended problems in the realm of security/safety monitoring, productivity assessments and future decision making.
摘要:当谈到解决在不同的生产/建筑工地,工人,车辆,设备的重要形成基于计算机视觉监控系统(CVSS)的一个组成部分存在的精确检测安全/安全性方面的需求。传统的CVSS体系侧重于不同的使用计算机视觉和模式识别算法上的特点和小型数据集手工提取过于依赖,因为限制低精度,需要专业知识和高计算成本的使用。本文的主要目的是在与实用而全面的深度学习和基于物联网解决方案,以应对各种计算机视觉相关的问题,如场景分类,在场景中的物体检测,语义分割,现场字幕等。我们的首要站点提供决策者目标是要解决什么是在这个网站发生的主要问题,哪里是它以自动方式发生最小化专用于监控人力资源的需求。我们开发深埃克森美孚公司以眼视频分析(DEEVA)封装处理场景分类,目标检测,语义分割和分层方法场景的字幕。结果表明与RetinaNet对象检测器传递学习能够检测工人,不同类型的车辆/施工设备,安全相关的对象的存在下,在高精确度(90%以上)。随着深度学习的帮助下,自动提取特征和物联网技术的实时图像的自动采集,传输和处理大量的,这个框架是对旨在解决开放式问题无数的境界智能监控系统发展的重要一步安全性/安全监控,生产力的评估和未来的决策。

29. LiDARNet: A Boundary-Aware Domain Adaptation Model for Lidar Point Cloud Semantic Segmentation [PDF] 返回目录
  Peng Jiang, Srikanth Saripalli
Abstract: We present a boundary-aware domain adaptation model for Lidar point cloud semantic segmentation. Our model is designed to extract both the domain private features and the domain shared features using shared weight. We embedded Gated-SCNN into the shared features extractors to help it learn boundary information while learning other shared features. Besides, the CycleGAN mechanism is imposed for further adaptation. We conducted experiments on real-world datasets. The source domain data is from the Semantic KITTI dataset, and the target domain data is collected from our own platform (a warthog) in off-road as well as urban scenarios. The two datasets have differences in channel distributions, reflectivity distributions, and sensors setup. Using our approach, we are able to get a single model that can work on both domains. The model is capable of achieving the state of art performance on the source domain (Semantic KITTI dataset) and get 44.0\% mIoU on the target domain dataset.
摘要:我们提出了激光雷达点云语义分割边界感知领域适应性模型。我们的模型的目的是提取域专用功能和使用共享重量域共享特征的。我们的嵌入式门控SCNN到共享特性提取,以帮助IT学习边界信息,同时学习其他共享功能。此外,CycleGAN机制规定的进一步调整。我们进行了真实世界的数据集实验。源域数据是从语义KITTI数据集和目标域数据是从我们自己的平台(疣猪)在越野和城市环境下收集。两个数据集在信道分布,反射率分布和传感器设置差异。使用我们的方法,我们可以得到一个模型,可以在两个领域的工作。该模型能够在源域(语义KITTI数据集)实现的技术性能的状态和在目标域的数据集获得44.0 \%米欧。

30. Understanding Contexts Inside Robot and Human Manipulation Tasks through a Vision-Language Model and Ontology System in a Video Stream [PDF] 返回目录
  Chen Jiang, Masood Dehghan, Martin Jagersand
Abstract: Manipulation tasks in daily life, such as pouring water, unfold intentionally under specialized manipulation contexts. Being able to process contextual knowledge in these Activities of Daily Living (ADLs) over time can help us understand manipulation intentions, which are essential for an intelligent robot to transition smoothly between various manipulation actions. In this paper, to model the intended concepts of manipulation, we present a vision dataset under a strictly constrained knowledge domain for both robot and human manipulations, where manipulation concepts and relations are stored by an ontology system in a taxonomic manner. Furthermore, we propose a scheme to generate a combination of visual attentions and an evolving knowledge graph filled with commonsense knowledge. Our scheme works with real-world camera streams and fuses an attention-based Vision-Language model with the ontology system. The experimental results demonstrate that the proposed scheme can successfully represent the evolution of an intended object manipulation procedure for both robots and humans. The proposed scheme allows the robot to mimic human-like intentional behaviors by watching real-time videos. We aim to develop this scheme further for real-world robot intelligence in Human-Robot Interaction.
摘要:在日常生活中操作任务,如倒水,有意展现在专业操作环境。随着时间的推移能够处理语境知识在日常生活(ADL的)这些活动可以帮助我们了解操作的意图,这对各种操作行为之间的智能机器人平稳过渡至关重要。在本文中,操控的预期概念模型,我们提出以下两个机器人和人的操作,在操作概念和关系存储由本体系统中的分类方式约束严格知识领域的远景数据集。此外,我们提出了一个方案,以产生视觉关注的组合,并且充满了常识性知识的一个不断发展的知识图。我们的方案可与真实世界的摄像头流和融合与本体系统的关注,基于视觉的语言模型。实验结果表明,该方案能够成功地表示想要的对象操作过程的演化两个机器人和人类。该方案允许机器人模仿人类般通过观看实时视频故意行为。我们的目标是进一步发展该方案在人机交互真实世界的机器人智能。

31. Unsupervised Domain Adaptation for Mammogram Image Classification: A Promising Tool for Model Generalization [PDF] 返回目录
  Yu Zhang, Gongbo Liang, Nathan Jacobs, Xiaoqin Wang
Abstract: Generalization is one of the key challenges in the clinical validation and application of deep learning models to medical images. Studies have shown that such models trained on publicly available datasets often do not work well on real-world clinical data due to the differences in patient population and image device configurations. Also, manually annotating clinical images is expensive. In this work, we propose an unsupervised domain adaptation (UDA) method using Cycle-GAN to improve the generalization ability of the model without using any additional manual annotations.
摘要:泛化是在临床验证和深入学习模型应用到医学图像的关键挑战之一。有研究表明,经过训练,可公开获得的数据集,模型往往不能很好地对现实世界的临床数据,由于在患者人群和影像设备配置的不同工作。此外,手动注释临床图像是昂贵的。在这项工作中,我们提出用循环-GaN提高模型的泛化能力,而无需使用任何额外的手动注释无人监管的领域适应性(UDA)方法。

32. Learning from Suspected Target: Bootstrapping Performance for Breast Cancer Detection in Mammography [PDF] 返回目录
  Li Xiao, Cheng Zhu, Junjun Liu, Chunlong Luo, Peifang Liu, Yi Zhao
Abstract: Deep learning object detection algorithm has been widely used in medical image analysis. Currently all the object detection tasks are based on the data annotated with object classes and their bounding boxes. On the other hand, medical images such as mammography usually contain normal regions or objects that are similar to the lesion region, and may be misclassified in the testing stage if they are not taken care of. In this paper, we address such problem by introducing a novel top likelihood loss together with a new sampling procedure to select and train the suspected target regions, as well as proposing a similarity loss to further identify suspected targets from targets. Mean average precision (mAP) according to the predicted targets and specificity, sensitivity, accuracy, AUC values according to classification of patients are adopted for performance comparisons. We firstly test our proposed method on a private dense mammogram dataset. Results show that our proposed method greatly reduce the false positive rate and the specificity is increased by 0.25 on detecting mass type cancer. It is worth mention that dense breast typically has a higher risk for developing breast cancers and also are harder for cancer detection in diagnosis, and our method outperforms a reported result from performance of radiologists. Our method is also validated on the public Digital Database for Screening Mammography (DDSM) dataset, brings significant improvement on mass type cancer detection and outperforms the most state-of-the-art work.
摘要:深学习对象检测算法已广泛应用于医学图像分析。目前,所有的物体检测任务是基于与对象类及其边界框注释的数据。在另一方面,医学图像,如乳房X线照相通常含有正常区域或类似于病变区域,并且可以在测试阶段被错误分类,如果他们不照顾对象。在本文中,我们通过一起引入新颖顶部损失的可能性用新的采样过程来选择和训练可疑目标区域,以及从提出的目标的相似度损失进一步识别可疑目标解决这样的问题。根据所预测的目标和特异性,灵敏度,准确度值平均精度(MAP),根据患者分类AUC值采用对性能比较。我们首先在一个私密的乳房X光检查的数据集测试我们提出的方法。结果表明,我们提出的方法大大降低了假阳性率和特异性上检测质量类型癌症增加0.25。值得一提的是密集的乳房通常具有发展乳腺癌的风险更高,也更难在诊断癌症检测,而我们的方法优于从放射科医生的绩效报告的结果。我们的方法也验证了在公共数字数据库的乳房摄影筛检(DDSM)数据集,带来质量上的类型癌症检测显著改善,优于国家的最先进的最多的工作。

33. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks [PDF] 返回目录
  Francesco Croce, Matthias Hein
Abstract: The field of defense strategies against adversarial attacks has significantly grown over the last years, but progress is hampered as the evaluation of adversarial defenses is often insufficient and thus gives a wrong impression of robustness. Many promising defenses could be broken later on, making it difficult to identify the state-of-the-art. Frequent pitfalls in the evaluation are improper tuning of hyperparameters of the attacks, gradient obfuscation or masking. In this paper we first propose two extensions of the PGD-attack overcoming failures due to suboptimal step size and problems of the objective function. We then combine our novel attacks with two complementary existing ones to form a parameter-free, computationally affordable and user-independent ensemble of attacks to test adversarial robustness. We apply our ensemble to over 40 models from papers published at recent top machine learning and computer vision venues. In all except one of the cases we achieve lower robust test accuracy than reported in these papers, often by more than $10\%$, identifying several broken defenses.
摘要:针对敌对攻击防御战略领域已显著增长在过去几年中,但进展是阻碍作为对抗防御系统的评价往往是不够的,因此给出了稳健性的错误印象。很多有前途的防御可在以后打破,使得它难以确定国家的最先进的。在评价频繁缺陷是攻击,梯度混淆或掩盖的超参数的调整不当。在本文中,我们首先提出了PGD-攻击克服的失败是由于次优步长和目标函数的问题,两个扩展。然后,我们用两个互补现有的结合了我们新的发作形成的攻击无参数,计算负担得起的和用户无关的合奏测试对抗性的鲁棒性。我们运用我们的合奏40多个型号从近期顶部的机器学习和计算机视觉场所发表的论文。在所有除的情况下,一个我们超过$ 10 \%$实现比在这些论文中报告了较低的稳健性检验精度,往往,确定几个破防御。

34. Compact Surjective Encoding Autoencoder for Unsupervised Novelty Detection [PDF] 返回目录
  Jaewoo Park, Yoon Gyo Jung, Andrew Beng Jin Teoh
Abstract: In unsupervised novelty detection, a model is trained solely on the in-class data, and infer to single out out-class data. Autoencoder (AE) variants aim to compactly model the in-class data to reconstruct it exclusively, differentiating it from out-class by the reconstruction error. However, it remains challenging for high-dimensional image data in the fully unsupervised setting. For this, we propose Compact Surjective Encoding AE (CSE-AE). In this framework, the encoding of any input is constrained into a compact manifold by exploiting the deep neural net's ignorance of the unknown. Concurrently, the in-class data is surjectively encoded to the compact manifold via AE. The mechanism is realized by both GAN and its ensembled discriminative layers, and results to reconstruct the in-class exclusively. In inference, the reconstruction error of a query is measured using high-level semantics captured by the discriminator. Extensive experiments on image data show that the proposed model gives state-of-the-art performance.
摘要:在无监督新奇检测,模型训练只对在类数据,并推断单出掉级的数据。自动编码器(AE)的变体旨在中级数据紧凑地建模为排他性地重建它,由重建误差从掉级区分它。但是,它仍然在完全无人监管的设定具有挑战性的高维图像数据。为此,我们提出了紧凑满射编码AE(CSE-AE)。在此框架下,任何输入的编码是通过利用未知的深神经网络的无知约束成一个紧凑的歧管。同时,在该级的数据被编码surjectively经由AE紧凑歧管。该机制由GAN二者及其合奏的判别的层,和结果实现重构在级排他。在推理,查询的重建误差,使用由鉴别捕获高级别语义测量。图像数据大量实验表明,该模型给出国家的最先进的性能。

35. BUSU-Net: An Ensemble U-Net Framework for Medical Image Segmentation [PDF] 返回目录
  Wei Hao Khoong
Abstract: In recent years, convolutional neural networks (CNNs) have revolutionized medical image analysis. One of the most well-known CNN architectures in semantic segmentation is the U-net, which has achieved much success in several medical image segmentation applications. Also more recently, with the rise of autoML ad advancements in neural architecture search (NAS), methods like NAS-Unet have been proposed for NAS in medical image segmentation. In this paper, with inspiration from LadderNet, U-Net, autoML and NAS, we propose an ensemble deep neural network with an underlying U-Net framework consisting of bi-directional convolutional LSTMs and dense connections, where the first (from left) U-Net-like network is deeper than the second (from left). We show that this ensemble network outperforms recent state-of-the-art networks in several evaluation metrics, and also evaluate a lightweight version of this ensemble network, which also outperforms recent state-of-the-art networks in some evaluation metrics.
摘要:近年来,卷积神经网络(细胞神经网络)已经彻底改变了医学图像分析。一个在语义分割上最知名的CNN架构是U形网,已在一些医学图像分割的应用取得了很大的成功。而且最近,随着autoML广告进步的神经结构搜索(NAS)的兴起,像NAS-UNET方法已经提出了NAS在医学图像分割。在本文中,与来自LadderNet,U型网,autoML和NAS灵感,我们提出了一种合奏深层神经网络具有由双向卷积LSTMs和密集的连接,其中所述第一(左)U的基本U形.Net框架-net状网络是比所述第二(左起)更深。我们表明,这种集成网络中有多个评价指标优于国家的最先进的最近使用的网络,同时还评估这个乐团的网络,这也优于国家的最先进的网络最近在一些评价指标的轻量级版本。

36. XGPT: Cross-modal Generative Pre-Training for Image Captioning [PDF] 返回目录
  Qiaolin Xia, Haoyang Huang, Nan Duan, Dongdong Zhang, Lei Ji, Zhifang Sui, Edward Cui, Taroon Bharti, Ming Zhou
Abstract: While many BERT-based cross-modal pre-trained models produce excellent results on downstream understanding tasks like image-text retrieval and VQA, they cannot be applied to generation tasks directly. In this paper, we propose XGPT, a new method of Cross-modal Generative Pre-Training for Image Captioning that is designed to pre-train text-to-image caption generators through three novel generation tasks, including Image-conditioned Masked Language Modeling (IMLM), Image-conditioned Denoising Autoencoding (IDA), and Text-conditioned Image Feature Generation (TIFG). As a result, the pre-trained XGPT can be fine-tuned without any task-specific architecture modifications to create state-of-the-art models for image captioning. Experiments show that XGPT obtains new state-of-the-art results on the benchmark datasets, including COCO Captions and Flickr30k Captions. We also use XGPT to generate new image captions as data augmentation for the image retrieval task and achieve significant improvement on all recall metrics.
摘要:虽然许多基于BERT,跨模式预先训练模型产生像图像,文本检索和VQA下游理解任务优异的成绩,他们不能直接应用于生成任务。在本文中,我们提出XGPT,跨模态剖成前的训练中的图像字幕的新方法,旨在通过三种新的生成任务,包括图像的空调蒙面语言模型预列车文字到图片标题发生器( IMLM),图像空调去噪Autoencoding(IDA),和文本条件的图像特征生成(TIFG)。其结果是,预先训练XGPT可以进行微调而没有任何特定任务的架构修改以用于图像字幕创建状态的最先进的模型。实验表明,XGPT获得国家的最先进的新的基准数据集,包括COCO字幕和字幕Flickr30k结果。我们还使用XGPT产生新的图片说明为图像检索任务数据增强,实现对所有召回的指标显著改善。

37. Curriculum By Texture [PDF] 返回目录
  Samarth Sinha, Animesh Garg, Hugo Larochelle
Abstract: Convolutional Neural Networks (CNNs) have shown impressive performance in computer vision tasks such as image classification and segmentation. One factor for the success of CNNs is that they have an inductive bias that assumes a certain type of spatial structure is present in the data. Recent work by Geirhos et al. (2018) shows how learning in CNNs causes the learned CNN models to be biased towards high-frequency textural information, compared to low-frequency shape information in images. Many tasks generally requires both shape and textural information. Hence, we propose a simple curriculum based scheme which improves the ability of CNNs to be less biased towards textural information, and at the same time, being able to represent both the shape and textural information. We propose to augment the training of CNNs by controlling the amount of textural information that is available to the CNNs during the training process, by convolving the output of a CNN layer with a low-pass filter, or simply a Gaussian kernel. By reducing the standard deviation of the Gaussian kernel, we are able to gradually increase the amount of textural information available as training progresses, and hence reduce the texture bias. Such an augmented training scheme significantly improves the performance of CNNs on various image classification tasks, while adding no additional trainable parameters or auxiliary regularization objectives. We also observe significant improvements when using the trained CNNs to perform transfer learning on a different dataset, and transferring to a different task which shows how the learned CNNs using the proposed method act as better feature extractors.
摘要:卷积神经网络(细胞神经网络)显示在计算机视觉任务的骄人业绩,如图像分类和分割。对于细胞神经网络的成功的一个因素是,它们具有呈现某种类型的空间结构的存在于所述数据的感应偏压。通过Geirhos等人最近的工作。 (2018)示出了如何在细胞神经网络学习使了解到CNN模型朝高频纹理信息被偏压,而在图像的低频形状信息。许多任务通常需要形状和纹理信息。因此,我们提出一种改善细胞神经网络的朝向纹理信息被较少偏置的能力,并且在同一时间,能够表示二者的形状和纹理信息的简单课程基于方案。我们提出通过控制在训练过程中是可用的细胞神经网络的纹理信息的量,通过用低通滤波器,或简称为高斯核进行卷积CNN的层的输出,以增加细胞神经网络的训练。通过降低高斯核的标准偏差,我们能够逐步提高可作为培训进展纹理信息的数量,因此降低了质地偏差。这样的增强的训练方案显著提高细胞神经网络的各种图像分类任务的性能,同时加入没有额外的可训练的参数或辅助正规化目标。用训练细胞神经网络时,在不同的数据集进行迁移学习,并转移到一个不同的任务,显示了使用该方法的行为更好的特征提取的是如何学会细胞神经网络我们也观察到显著的改善。

38. Shape analysis via inconsistent surface registration [PDF] 返回目录
  Gary P. T. Choi, Di Qiu, Lok Ming Lui
Abstract: In this work, we develop a framework for shape analysis using inconsistent surface mapping. Traditional landmark-based geometric morphometrics methods suffer from the limited degrees of freedom, while most of the more advanced non-rigid surface mapping methods rely on a strong assumption of the global consistency of two surfaces. From a practical point of view, given two anatomical surfaces with prominent feature landmarks, it is more desirable to have a method that automatically detects the most relevant parts of the two surfaces and finds the optimal landmark-matching alignment between those parts, without assuming any global 1-1 correspondence between the two surfaces. Our method is capable of solving this problem using inconsistent surface registration based on quasi-conformal theory. It further enables us to quantify the dissimilarity of two shapes using quasi-conformal distortion and differences in mean and Gaussian curvatures, thereby providing a natural way for shape classification. Experiments on Platyrrhine molars demonstrate the effectiveness of our method and shed light on the interplay between function and shape in nature.
摘要:在这项工作中,我们开发使用不一致的表面映射形状分析的框架。传统的基于地标的几何形态测量方法从有限的自由度受到影响,而大多数的更先进的非刚性表面映射方法依赖于两个表面的全局一致性的一个很强的假设。从给定的与突出的特点的地标两个解剖表面的实际角度来看,更希望有一种方法,该方法自动检测两个表面的最相关的部分,并发现这些部分之间的最佳的地标匹配对准,而不承担任何两个表面之间的全局1-1对应。我们的方法是能够利用基于准共形理论不一致表面登记解决这一问题的。它进一步使我们使用均值和高斯曲率准共形失真和差异,从而提供形状分类以自然的方式来量化两个形状的不同之处。上Platyrrhine臼齿实验表明我们的方法的有效性和性质的功能和形状之间的相互作用线索。

39. DDU-Nets: Distributed Dense Model for 3D MRI Brain Tumor Segmentation [PDF] 返回目录
  Hanxiao Zhang, Jingxiong Li, Mali Shen, Yaqi Wang, Guang-Zhong Yang
Abstract: Segmentation of brain tumors and their subregions remains a challenging task due to their weak features and deformable shapes. In this paper, three patterns (cross-skip, skip-1 and skip-2) of distributed dense connections (DDCs) are proposed to enhance feature reuse and propagation of CNNs by constructing tunnels between key layers of the network. For better detecting and segmenting brain tumors from multi-modal 3D MR images, CNN-based models embedded with DDCs (DDU-Nets) are trained efficiently from pixel to pixel with a limited number of parameters. Postprocessing is then applied to refine the segmentation results by reducing the false-positive samples. The proposed method is evaluated on the BraTS 2019 dataset with results demonstrating the effectiveness of the DDU-Nets while requiring less computational cost.
摘要:脑肿瘤和他们分区域的分割仍然是一个具有挑战性的任务,由于其薄弱的特点和可变形的。在本文中,三种模式(横跳,则跳过-1和跳过-2)提出了由网络的关键层之间的隧道构造,以增强细胞神经网络的特征重用和分布式传播密连接(DDC的)的。对于从多模态的3D MR图像更好检测和分割脑肿瘤,CNN的嵌入用的DDC(DDU-网)从像素有效地训练像素与参数的数量有限的模式。然后后处理被施加通过减少假阳性的样品以细化分割结果。所提出的方法在评价臭小子2019数据集的结果表明所述DDU-篮网的效力,同时要求较少的计算成本。

40. Visualizing intestines for diagnostic assistance of ileus based on intestinal region segmentation from 3D CT images [PDF] 返回目录
  Hirohisa Oda, Kohei Nishio, Takayuki Kitasaka, Hizuru Amano, Aitaro Takimoto, Hiroo Uchida, Kojiro Suzuki, Hayato Itoh, Masahiro Oda, Kensaku Mori
Abstract: This paper presents a visualization method of intestine (the small and large intestines) regions and their stenosed parts caused by ileus from CT volumes. Since it is difficult for non-expert clinicians to find stenosed parts, the intestine and its stenosed parts should be visualized intuitively. Furthermore, the intestine regions of ileus cases are quite hard to be segmented. The proposed method segments intestine regions by 3D FCN (3D U-Net). Intestine regions are quite difficult to be segmented in ileus cases since the inside the intestine is filled with fluids. These fluids have similar intensities with intestinal wall on 3D CT volumes. We segment the intestine regions by using 3D U-Net trained by a weak annotation approach. Weak-annotation makes possible to train the 3D U-Net with small manually-traced label images of the intestine. This avoids us to prepare many annotation labels of the intestine that has long and winding shape. Each intestine segment is volume-rendered and colored based on the distance from its endpoint in volume rendering. Stenosed parts (disjoint points of an intestine segment) can be easily identified on such visualization. In the experiments, we showed that stenosed parts were intuitively visualized as endpoints of segmented regions, which are colored by red or blue.
摘要:本文呈现肠的可视化方法(小肠和大肠)的区域和它们的狭窄部分从CT体积引起肠梗阻。由于难以对非临床专家发现狭窄部位,小肠及其狭窄部位应直观可视化。此外,肠梗阻病例肠道地区是相当难以分割。所提出的方法的段肠通过3D FCN(3D U形净)区域。肠道地区是相当困难的情况下,肠梗阻被分割,因为肠道内充满液体。这些流体具有与三维CT体积肠壁类似的强度。我们段通过使用3D掌中肠道地区的培训由弱注解方法。弱注释使得可能与肠的小手动追踪标签图像训练3D掌中。这样就避免了我们准备已漫长而曲折的形状肠道的许多注释标签。各肠段是体积渲染并且基于来自其端点在体绘制的距离着色。狭窄部分(肠段的不相交的点)可以容易地识别这样的可视化。在实验中,我们发现,狭窄的部分被直观地显现为分割区域,其中用红色或蓝色的彩色的端点。

41. A Deep learning Approach to Generate Contrast-Enhanced Computerised Tomography Angiography without the Use of Intravenous Contrast Agents [PDF] 返回目录
  Anirudh Chandrashekar, Ashok Handa, Natesh Shivakumar, Pierfrancesco Lapolla, Vicente Grau, Regent Lee
Abstract: Contrast-enhanced computed tomography angiograms (CTAs) are widely used in cardiovascular imaging to obtain a non-invasive view of arterial structures. However, contrast agents are associated with complications at the injection site as well as renal toxicity leading to contrast-induced nephropathy (CIN) and renal failure. We hypothesised that the raw data acquired from a non-contrast CT contains sufficient information to differentiate blood and other soft tissue components. We utilised deep learning methods to define the subtleties between soft tissue components in order to simulate contrast enhanced CTAs without contrast agents. Twenty-six patients with paired non-contrast and CTA images were randomly selected from an approved clinical study. Non-contrast axial slices within the AAA from 10 patients (n = 100) were sampled for the underlying Hounsfield unit (HU) distribution at the lumen, intra-luminal thrombus and interface locations. Sampling of HUs in these regions revealed significant differences between all regions (p<0.001 for all comparisons), confirming the intrinsic differences in radiomic signatures between these regions. to generate a large training dataset, paired axial slices from set (n="13)" were augmented produce total of 23,551 2-d images. we trained cycle generative adversarial network (cyclegan) this non-contrast contrast (nc2c) transformation task. accuracy cyclegan output was assessed by comparison image. pipeline is able differentiate visually incoherent soft tissue regions ct ctas generated images bear strong resemblance ground truth. here describe novel application image processing. poised disrupt clinical pathways requiring enhanced imaging. < font>
摘要:对比增强计算机断层造影(的CTA)被广泛用于心血管成像以获得动脉结构的非侵入性的图。然而,造影剂与注射部位的并发症,以及肾毒性导致造影剂肾病(CIN)和肾功能衰竭有关。我们假设从非造影CT采集的原始数据包含足够的信息来区分血液和其它软组织的部件。我们利用深学习方法,以便限定软组织组分之间的细微之处,以模拟造影剂增强的CTA没有造影剂。 26例有配对的非对比度和CTA图像,随机从批准的临床研究选择。非造影从10名患者(n = 100)中的AAA内轴向切片取样用于在该腔,腔内血栓和接口位置底层亨氏单位(HU)的分布。 HUS在这些区域采样揭示所有区域(P <0.001所有的比较)之间显著差异,确认在这些区域之间的radiomic签名的固有差异。产生大的训练数据集,从所述训练集(n = 13)配对的轴向切片扩充以产生总的23551 2-d的图像。我们培养了2-d周期剖成对抗性网络(cyclegan)此非对比对比度(nc2c)改造任务。所述cyclegan输出的精确度是通过比较所述对比图像进行评估。这条管线能够以非造影ct图像在视觉上不连贯的软组织区域之间进行区分。从非对比度的图像生成cta所承受非常相似的地面实况。这里,我们描述剖成对抗性网络在ct图像处理的新的应用。这是准备以破坏需要造影增强的ct成像临床路径。< font>

42. RandomNet: Towards Fully Automatic Neural Architecture Design for Multimodal Learning [PDF] 返回目录
  Stefano Alletto, Shenyang Huang, Vincent Francois-Lavet, Yohei Nakata, Guillaume Rabusseau
Abstract: Almost all neural architecture search methods are evaluated in terms of performance (i.e. test accuracy) of the model structures that it finds. Should it be the only metric for a good autoML approach? To examine aspects beyond performance, we propose a set of criteria aimed at evaluating the core of autoML problem: the amount of human intervention required to deploy these methods into real world scenarios. Based on our proposed evaluation checklist, we study the effectiveness of a random search strategy for fully automated multimodal neural architecture search. Compared to traditional methods that rely on manually crafted feature extractors, our method selects each modality from a large search space with minimal human supervision. We show that our proposed random search strategy performs close to the state of the art on the AV-MNIST dataset while meeting the desirable characteristics for a fully automated design process.
摘要:几乎所有的神经结构的搜索方法的模型结构,性能方面(即测试精度)进行评估,它发现。它应该是一个很好的办法autoML的唯一指标?为了考察超越性能方面,我们提出了一系列旨在评估autoML问题的核心标准:人工干预的量所需部署这些方法融入现实世界的场景。根据我们提出的评估清单中,我们研究了完全自动化的多模态神经结构搜索随机搜索策略的有效性。相比于依靠手工制作的特征提取,我们的方法选择以最少的人力监督较大的空间搜索每一种模式的传统方法。我们证明了我们所提出的随机搜索策略进行接近的AV-MNIST数据集中的艺术,同时满足一个完全自动化的设计过程中所需特性的状态。

注:中文为机器翻译结果!