目录
8. Interpreting Medical Image Classifiers by Optimization Based Counterfactual Impact Analysis [PDF] 摘要
9. HandVoxNet: Deep Voxel-Based Network for 3D Hand Shape and Pose Estimation from a Single Depth Map [PDF] 摘要
10. DFNet: Discriminative feature extraction and integration network for salient object detection [PDF] 摘要
13. Exploring the ability of CNNs to generalise to previously unseen scales over wide scale ranges [PDF] 摘要
16. Cell Segmentation and Tracking using Distance Transform Predictions and Movement Estimation with Graph-Based Matching [PDF] 摘要
22. LiDAR-based Online 3D Video Object Detection with Graph-based Message Passing and Spatiotemporal Transformer Attention [PDF] 摘要
32. Generative PointNet: Energy-Based Learning on Unordered Point Sets for 3D Generation, Reconstruction and Classification [PDF] 摘要
33. Novel View Synthesis of Dynamic Scenes with Globally Coherent Depths from a Monocular Camera [PDF] 摘要
43. Retinopathy of Prematurity Stage Diagnosis Using Object Segmentation and Convolutional Neural Networks [PDF] 摘要
44. Crossover-Net: Leveraging the Vertical-Horizontal Crossover Relation for Robust Segmentation [PDF] 摘要
45. Characterization of Multiple 3D LiDARs for Localization and Mapping using Normal Distributions Transform [PDF] 摘要
47. Extraction and Assessment of Naturalistic Human Driving Trajectories from Infrastructure Camera and Radar Sensors [PDF] 摘要
49. Introducing Anisotropic Minkowski Functionals and Quantitative Anisotropy Measures for Local Structure Analysis in Biomedical Imaging [PDF] 摘要
50. Detection of Coronavirus (COVID-19) Associated Pneumonia based on Generative Adversarial Networks and a Fine-Tuned Deep Transfer Learning Model using Chest X-ray Dataset [PDF] 摘要
摘要
1. Near-chip Dynamic Vision Filtering for Low-Bandwidth Pedestrian Detection [PDF] 返回目录
Anthony Bisulco, Fernando Cladera Ojeda, Volkan Isler, Daniel D. Lee
Abstract: This paper presents a novel end-to-end system for pedestrian detection using Dynamic Vision Sensors (DVSs). We target applications where multiple sensors transmit data to a local processing unit, which executes a detection algorithm. Our system is composed of (i) a near-chip event filter that compresses and denoises the event stream from the DVS, and (ii) a Binary Neural Network (BNN) detection module that runs on a low-computation edge computing device (in our case a STM32F4 microcontroller). We present the system architecture and provide an end-to-end implementation for pedestrian detection in an office environment. Our implementation reduces transmission size by up to 99.6% compared to transmitting the raw event stream. The average packet size in our system is only 1397 bits, while 307.2 kb are required to send an uncompressed DVS time window. Our detector is able to perform a detection every 450 ms, with an overall testing F1 score of 83%. The low bandwidth and energy properties of our system make it ideal for IoT applications.
摘要:本文提出了一种新的端至端系统,用于使用动态视觉传感器(DVSS)行人检测。我们的目标,其中多个传感器数据发送到本地处理单元,其执行检测算法的应用程序。我们的系统是由(ⅰ)的近芯片事件过滤器,压缩的和去噪从DVS事件流,和(ii)一个二进制神经网络(BNN)检测模块,其在低计算边缘计算设备运行(在我们的情况下,STM32F4微控制器)。我们目前的系统架构,并提供在办公环境中的端至端实施的行人检测。我们的实现相比于发送所述原始事件流高达99.6%降低传输大小。在我们的系统中的数据包的平均大小只有1397位,而307.2 KB需要发送一个未压缩的DVS的时间窗口。我们的检测器能够执行的检测每450毫秒,总体测试F1得分的83%。我们的系统的低带宽和能量特性使其非常适用于物联网的应用。
Anthony Bisulco, Fernando Cladera Ojeda, Volkan Isler, Daniel D. Lee
Abstract: This paper presents a novel end-to-end system for pedestrian detection using Dynamic Vision Sensors (DVSs). We target applications where multiple sensors transmit data to a local processing unit, which executes a detection algorithm. Our system is composed of (i) a near-chip event filter that compresses and denoises the event stream from the DVS, and (ii) a Binary Neural Network (BNN) detection module that runs on a low-computation edge computing device (in our case a STM32F4 microcontroller). We present the system architecture and provide an end-to-end implementation for pedestrian detection in an office environment. Our implementation reduces transmission size by up to 99.6% compared to transmitting the raw event stream. The average packet size in our system is only 1397 bits, while 307.2 kb are required to send an uncompressed DVS time window. Our detector is able to perform a detection every 450 ms, with an overall testing F1 score of 83%. The low bandwidth and energy properties of our system make it ideal for IoT applications.
摘要:本文提出了一种新的端至端系统,用于使用动态视觉传感器(DVSS)行人检测。我们的目标,其中多个传感器数据发送到本地处理单元,其执行检测算法的应用程序。我们的系统是由(ⅰ)的近芯片事件过滤器,压缩的和去噪从DVS事件流,和(ii)一个二进制神经网络(BNN)检测模块,其在低计算边缘计算设备运行(在我们的情况下,STM32F4微控制器)。我们目前的系统架构,并提供在办公环境中的端至端实施的行人检测。我们的实现相比于发送所述原始事件流高达99.6%降低传输大小。在我们的系统中的数据包的平均大小只有1397位,而307.2 KB需要发送一个未压缩的DVS的时间窗口。我们的检测器能够执行的检测每450毫秒,总体测试F1得分的83%。我们的系统的低带宽和能量特性使其非常适用于物联网的应用。
2. S2DNet: Learning Accurate Correspondences for Sparse-to-Dense Feature Matching [PDF] 返回目录
Hugo Germain, Guillaume Bourmaud, Vincent Lepetit
Abstract: Establishing robust and accurate correspondences is a fundamental backbone to many computer vision algorithms. While recent learning-based feature matching methods have shown promising results in providing robust correspondences under challenging conditions, they are often limited in terms of precision. In this paper, we introduce S2DNet, a novel feature matching pipeline, designed and trained to efficiently establish both robust and accurate correspondences. By leveraging a sparse-to-dense matching paradigm, we cast the correspondence learning problem as a supervised classification task to learn to output highly peaked correspondence maps. We show that S2DNet achieves state-of-the-art results on the HPatches benchmark, as well as on several long-term visual localization datasets.
摘要:建立健全和准确的对应是一个基本骨干,许多计算机视觉算法。虽然已经显示出大有希望提供具有挑战性的条件下,稳健的对应结果近期以学习为主的特征匹配方法,他们往往在精度方面受到限制。在本文中,我们介绍S2DNet,新颖的特征匹配的管道,设计和培训,以有效地同时建立健全和准确对应。通过利用稀疏到密集的匹配模式,我们投的信件学习问题作为监督分类任务要学会输出高峰对应的地图。我们发现,S2DNet实现对HPatches基准国家的先进成果,以及在几个长期视觉定位数据集。
Hugo Germain, Guillaume Bourmaud, Vincent Lepetit
Abstract: Establishing robust and accurate correspondences is a fundamental backbone to many computer vision algorithms. While recent learning-based feature matching methods have shown promising results in providing robust correspondences under challenging conditions, they are often limited in terms of precision. In this paper, we introduce S2DNet, a novel feature matching pipeline, designed and trained to efficiently establish both robust and accurate correspondences. By leveraging a sparse-to-dense matching paradigm, we cast the correspondence learning problem as a supervised classification task to learn to output highly peaked correspondence maps. We show that S2DNet achieves state-of-the-art results on the HPatches benchmark, as well as on several long-term visual localization datasets.
摘要:建立健全和准确的对应是一个基本骨干,许多计算机视觉算法。虽然已经显示出大有希望提供具有挑战性的条件下,稳健的对应结果近期以学习为主的特征匹配方法,他们往往在精度方面受到限制。在本文中,我们介绍S2DNet,新颖的特征匹配的管道,设计和培训,以有效地同时建立健全和准确对应。通过利用稀疏到密集的匹配模式,我们投的信件学习问题作为监督分类任务要学会输出高峰对应的地图。我们发现,S2DNet实现对HPatches基准国家的先进成果,以及在几个长期视觉定位数据集。
3. Intrinsic Point Cloud Interpolation via Dual Latent Space Navigation [PDF] 返回目录
Marie-Julie Rakotosaona, Maks Ovsjanikov
Abstract: We present a learning-based method for interpolating and manipulating 3D shapes represented as point clouds, that is explicitly designed to preserve intrinsic shape properties. Our approach is based on constructing a dual encoding space that enables shape synthesis and, at the same time, provides links to the intrinsic shape information, which is typically not available on point cloud data. Our method works in a single pass and avoids expensive optimization, employed by existing techniques. Furthermore, the strong regularization provided by our dual latent space approach also helps to improve shape recovery in challenging settings from noisy point clouds across different datasets. Extensive experiments show that our method results in more realistic and smoother interpolations compared to baselines.
摘要:本文提出了一种基于学习的方法进行插值和操作的3D形状表示为点云,这是被设计为保持固有形状的特性。我们的方法是基于构建双重编码空间,使形状的合成和,同时,提供了指向固有形状信息,这通常是不可用的点云数据。我们的方法工作在单次,并避免昂贵的优化,通过现有的技术使用。此外,通过我们的双潜在空间的方式提供强大的正规化也有助于提高在来自全国不同的数据集嘈杂的点云挑战设置形状恢复。大量的实验表明,我们在更现实和更平滑的插值方法的结果相比基线。
Marie-Julie Rakotosaona, Maks Ovsjanikov
Abstract: We present a learning-based method for interpolating and manipulating 3D shapes represented as point clouds, that is explicitly designed to preserve intrinsic shape properties. Our approach is based on constructing a dual encoding space that enables shape synthesis and, at the same time, provides links to the intrinsic shape information, which is typically not available on point cloud data. Our method works in a single pass and avoids expensive optimization, employed by existing techniques. Furthermore, the strong regularization provided by our dual latent space approach also helps to improve shape recovery in challenging settings from noisy point clouds across different datasets. Extensive experiments show that our method results in more realistic and smoother interpolations compared to baselines.
摘要:本文提出了一种基于学习的方法进行插值和操作的3D形状表示为点云,这是被设计为保持固有形状的特性。我们的方法是基于构建双重编码空间,使形状的合成和,同时,提供了指向固有形状信息,这通常是不可用的点云数据。我们的方法工作在单次,并避免昂贵的优化,通过现有的技术使用。此外,通过我们的双潜在空间的方式提供强大的正规化也有助于提高在来自全国不同的数据集嘈杂的点云挑战设置形状恢复。大量的实验表明,我们在更现实和更平滑的插值方法的结果相比基线。
4. PointGroup: Dual-Set Point Grouping for 3D Instance Segmentation [PDF] 返回目录
Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi-Wing Fu, Jiaya Jia
Abstract: Instance segmentation is an important task for scene understanding. Compared to the fully-developed 2D, 3D instance segmentation for point clouds have much room to improve. In this paper, we present PointGroup, a new end-to-end bottom-up architecture, specifically focused on better grouping the points by exploring the void space between objects. We design a two-branch network to extract point features and predict semantic labels and offsets, for shifting each point towards its respective instance centroid. A clustering component is followed to utilize both the original and offset-shifted point coordinate sets, taking advantage of their complementary strength. Further, we formulate the ScoreNet to evaluate the candidate instances, followed by the Non-Maximum Suppression (NMS) to remove duplicates. We conduct extensive experiments on two challenging datasets, ScanNet v2 and S3DIS, on which our method achieves the highest performance, 63.6% and 64.0%, compared to 54.9% and 54.4% achieved by former best solutions in terms of mAP with IoU threshold 0.5.
摘要:实例分割在现场了解的一项重要任务。相比于完全开发的2D,3D例如分割的点云有很大的提升空间。在本文中,我们目前PointGroup,一个新的终端到终端的自下而上的体系结构,特别注重通过探索对象之间的空隙更好的分组点。我们设计了两个分支网络,以提取特征点并预测语义标签和偏移量,为每个点转移朝向其相应实例质心。甲聚类组件随后利用原始和偏移移点坐标组,取它们的互补强度的优势。此外,我们还制定ScoreNet来评估候选人的情况下,接着非最大抑制(NMS),删除重复。我们在两个具有挑战性的数据集,ScanNet v2和S3DIS上,我们的方法达到最高的性能,63.6%和64.0%,比54.9%,而前最好的解决方案,在地图方面有借条的阈值0.5达到54.4%进行了广泛的实验。
Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi-Wing Fu, Jiaya Jia
Abstract: Instance segmentation is an important task for scene understanding. Compared to the fully-developed 2D, 3D instance segmentation for point clouds have much room to improve. In this paper, we present PointGroup, a new end-to-end bottom-up architecture, specifically focused on better grouping the points by exploring the void space between objects. We design a two-branch network to extract point features and predict semantic labels and offsets, for shifting each point towards its respective instance centroid. A clustering component is followed to utilize both the original and offset-shifted point coordinate sets, taking advantage of their complementary strength. Further, we formulate the ScoreNet to evaluate the candidate instances, followed by the Non-Maximum Suppression (NMS) to remove duplicates. We conduct extensive experiments on two challenging datasets, ScanNet v2 and S3DIS, on which our method achieves the highest performance, 63.6% and 64.0%, compared to 54.9% and 54.4% achieved by former best solutions in terms of mAP with IoU threshold 0.5.
摘要:实例分割在现场了解的一项重要任务。相比于完全开发的2D,3D例如分割的点云有很大的提升空间。在本文中,我们目前PointGroup,一个新的终端到终端的自下而上的体系结构,特别注重通过探索对象之间的空隙更好的分组点。我们设计了两个分支网络,以提取特征点并预测语义标签和偏移量,为每个点转移朝向其相应实例质心。甲聚类组件随后利用原始和偏移移点坐标组,取它们的互补强度的优势。此外,我们还制定ScoreNet来评估候选人的情况下,接着非最大抑制(NMS),删除重复。我们在两个具有挑战性的数据集,ScanNet v2和S3DIS上,我们的方法达到最高的性能,63.6%和64.0%,比54.9%,而前最好的解决方案,在地图方面有借条的阈值0.5达到54.4%进行了广泛的实验。
5. Deep Learning based detection of Acute Aortic Syndrome in contrast CT images [PDF] 返回目录
Manikanta Srikar Yellapragada, Yiting Xie, Benedikt Graf, David Richmond, Arun Krishnan, Arkadiusz Sitek
Abstract: Acute aortic syndrome (AAS) is a group of life threatening conditions of the aorta. We have developed an end-to-end automatic approach to detect AAS in computed tomography (CT) images. Our approach consists of two steps. At first, we extract N cross sections along the segmented aorta centerline for each CT scan. These cross sections are stacked together to form a new volume which is then classified using two different classifiers, a 3D convolutional neural network (3D CNN) and a multiple instance learning (MIL). We trained, validated, and compared two models on 2291 contrast CT volumes. We tested on a set aside cohort of 230 normal and 50 positive CT volumes. Our models detected AAS with an Area under Receiver Operating Characteristic curve (AUC) of 0.965 and 0.985 using 3DCNN and MIL, respectively.
摘要:急性主动脉综合征(AAS)是一组的主动脉的威胁生命的情况。我们已经开发了一个终端到终端的自动的方法来检测AAS在计算机断层扫描(CT)图像。我们的方法包括两个步骤。起初,我们提取沿N个横截面分割主动脉中心线为每个CT扫描。这些横截面被堆叠在一起以形成一个新的卷,然后使用两个不同的分类器,一个三维卷积神经网络(CNN 3D)和多示例学习(MIL)分类。我们的培训,验证和两款车型在2291个对比CT体积比较。我们测试的一组预留队列230正常,50条正面CT卷。我们的模型检测AAS带的0.965和0.985分别使用3DCNN和MIL,接受者操作特性曲线(AUC)下的面积。
Manikanta Srikar Yellapragada, Yiting Xie, Benedikt Graf, David Richmond, Arun Krishnan, Arkadiusz Sitek
Abstract: Acute aortic syndrome (AAS) is a group of life threatening conditions of the aorta. We have developed an end-to-end automatic approach to detect AAS in computed tomography (CT) images. Our approach consists of two steps. At first, we extract N cross sections along the segmented aorta centerline for each CT scan. These cross sections are stacked together to form a new volume which is then classified using two different classifiers, a 3D convolutional neural network (3D CNN) and a multiple instance learning (MIL). We trained, validated, and compared two models on 2291 contrast CT volumes. We tested on a set aside cohort of 230 normal and 50 positive CT volumes. Our models detected AAS with an Area under Receiver Operating Characteristic curve (AUC) of 0.965 and 0.985 using 3DCNN and MIL, respectively.
摘要:急性主动脉综合征(AAS)是一组的主动脉的威胁生命的情况。我们已经开发了一个终端到终端的自动的方法来检测AAS在计算机断层扫描(CT)图像。我们的方法包括两个步骤。起初,我们提取沿N个横截面分割主动脉中心线为每个CT扫描。这些横截面被堆叠在一起以形成一个新的卷,然后使用两个不同的分类器,一个三维卷积神经网络(CNN 3D)和多示例学习(MIL)分类。我们的培训,验证和两款车型在2291个对比CT体积比较。我们测试的一组预留队列230正常,50条正面CT卷。我们的模型检测AAS带的0.965和0.985分别使用3DCNN和MIL,接受者操作特性曲线(AUC)下的面积。
6. Quantifying Data Augmentation for LiDAR based 3D Object Detection [PDF] 返回目录
Martin Hahner, Dengxin Dai, Alexander Liniger, Luc Van Gool
Abstract: In this work, we shed light on different data augmentation techniques commonly used in Light Detection and Ranging (LiDAR) based 3D Object Detection. We, therefore, utilize a state of the art voxel-based 3D Object Detection pipeline called PointPillars and carry out our experiments on the well established KITTI dataset. We investigate a variety of global and local augmentation techniques, where global augmentation techniques are applied to the entire point cloud of a scene and local augmentation techniques are only applied to points belonging to individual objects in the scene. Our findings show that both types of data augmentation can lead to performance increases, but it also turns out, that some augmentation techniques, such as individual object translation, for example, can be counterproductive and can hurt overall performance. We show that when we apply our findings to the data augmentation policy of PointPillars we can easily increase its performance by up to 2%. In order to provide reproducibility, our code will be publicly available at this http URL.
摘要:在这项工作中,我们阐明了不同的数据增强技术,光探测常用和测距(LIDAR)的立体物检测。因此,我们利用基于体素的艺术立体物检测管道称为PointPillars的状态,并开展井建立KITTI数据集我们的实验。我们研究了各种全球和本地增强技术,在全球增强技术应用于场景和本地增强技术的整个点云仅适用于属于场景中的各个对象点。我们的研究结果表明,这两种类型的数据隆胸会导致性能的提高,但同时也证明,一些增强技术,如单个对象的翻译,例如,可能会适得其反,并可能伤害整体性能。我们发现,当我们运用我们的调查结果PointPillars的数据扩张政策,我们可以很容易地通过了提高其性能,以2%。为了提供可重复性,我们的代码将公开这个HTTP URL。
Martin Hahner, Dengxin Dai, Alexander Liniger, Luc Van Gool
Abstract: In this work, we shed light on different data augmentation techniques commonly used in Light Detection and Ranging (LiDAR) based 3D Object Detection. We, therefore, utilize a state of the art voxel-based 3D Object Detection pipeline called PointPillars and carry out our experiments on the well established KITTI dataset. We investigate a variety of global and local augmentation techniques, where global augmentation techniques are applied to the entire point cloud of a scene and local augmentation techniques are only applied to points belonging to individual objects in the scene. Our findings show that both types of data augmentation can lead to performance increases, but it also turns out, that some augmentation techniques, such as individual object translation, for example, can be counterproductive and can hurt overall performance. We show that when we apply our findings to the data augmentation policy of PointPillars we can easily increase its performance by up to 2%. In order to provide reproducibility, our code will be publicly available at this http URL.
摘要:在这项工作中,我们阐明了不同的数据增强技术,光探测常用和测距(LIDAR)的立体物检测。因此,我们利用基于体素的艺术立体物检测管道称为PointPillars的状态,并开展井建立KITTI数据集我们的实验。我们研究了各种全球和本地增强技术,在全球增强技术应用于场景和本地增强技术的整个点云仅适用于属于场景中的各个对象点。我们的研究结果表明,这两种类型的数据隆胸会导致性能的提高,但同时也证明,一些增强技术,如单个对象的翻译,例如,可能会适得其反,并可能伤害整体性能。我们发现,当我们运用我们的调查结果PointPillars的数据扩张政策,我们可以很容易地通过了提高其性能,以2%。为了提供可重复性,我们的代码将公开这个HTTP URL。
7. Deep Learning for Image Search and Retrieval in Large Remote Sensing Archives [PDF] 返回目录
Gencer Sumbul, Jian Kang, Begüm Demir
Abstract: This chapter presents recent advances in content based image search and retrieval (CBIR) systems in remote sensing (RS) for fast and accurate information discovery from massive data archives. Initially, we analyze the limitations of the traditional CBIR systems that rely on the hand-crafted RS image descriptors applied to exhaustive search and retrieval problems. Then, we focus our attention on the advances in RS CBIR systems for which the deep learning (DL) models are at the forefront. In particular, we present the theoretical properties of the most recent DL based CBIR systems for the characterization of the complex semantic content of RS images. After discussing their strengths and limitations, we present the deep hashing based CBIR systems that have high time-efficient search capability within huge data archives. Finally, the most promising research directions in RS CBIR are discussed.
摘要:本章将介绍一些最近在基于内容的图像检索和从海量数据归档快速准确的信息发现,遥感(RS)检索(CBIR)系统的进步。首先,我们分析依赖于手工制作的RS图像描述应用于穷举搜索和检索问题,传统的CBIR系统的限制。然后,我们专注于在其深度学习(DL)模型是走在了前列RS CBIR系统的进步我们的注意。特别是,我们目前最近的DL基于CBIR系统的理论性能进行遥感图像的复杂的语义内容的表征。讨论他们的长处和局限性后,我们提出了深刻的散列具有巨大的数据归档中的高时效性的搜索功能,基于CBIR系统。最后,在RS CBIR最有前途的研究方向进行了讨论。
Gencer Sumbul, Jian Kang, Begüm Demir
Abstract: This chapter presents recent advances in content based image search and retrieval (CBIR) systems in remote sensing (RS) for fast and accurate information discovery from massive data archives. Initially, we analyze the limitations of the traditional CBIR systems that rely on the hand-crafted RS image descriptors applied to exhaustive search and retrieval problems. Then, we focus our attention on the advances in RS CBIR systems for which the deep learning (DL) models are at the forefront. In particular, we present the theoretical properties of the most recent DL based CBIR systems for the characterization of the complex semantic content of RS images. After discussing their strengths and limitations, we present the deep hashing based CBIR systems that have high time-efficient search capability within huge data archives. Finally, the most promising research directions in RS CBIR are discussed.
摘要:本章将介绍一些最近在基于内容的图像检索和从海量数据归档快速准确的信息发现,遥感(RS)检索(CBIR)系统的进步。首先,我们分析依赖于手工制作的RS图像描述应用于穷举搜索和检索问题,传统的CBIR系统的限制。然后,我们专注于在其深度学习(DL)模型是走在了前列RS CBIR系统的进步我们的注意。特别是,我们目前最近的DL基于CBIR系统的理论性能进行遥感图像的复杂的语义内容的表征。讨论他们的长处和局限性后,我们提出了深刻的散列具有巨大的数据归档中的高时效性的搜索功能,基于CBIR系统。最后,在RS CBIR最有前途的研究方向进行了讨论。
8. Interpreting Medical Image Classifiers by Optimization Based Counterfactual Impact Analysis [PDF] 返回目录
David Major, Dimitrios Lenis, Maria Wimmer, Gert Sluiter, Astrid Berg, Katja Bühler
Abstract: Clinical applicability of automated decision support systems depends on a robust, well-understood classification interpretation. Artificial neural networks while achieving class-leading scores fall short in this regard. Therefore, numerous approaches have been proposed that map a salient region of an image to a diagnostic classification. Utilizing heuristic methodology, like blurring and noise, they tend to produce diffuse, sometimes misleading results, hindering their general adoption. In this work we overcome these issues by presenting a model agnostic saliency mapping framework tailored to medical imaging. We replace heuristic techniques with a strong neighborhood conditioned inpainting approach, which avoids anatomically implausible artefacts. We formulate saliency attribution as a map-quality optimization task, enforcing constrained and focused attributions. Experiments on public mammography data show quantitatively and qualitatively more precise localization and clearer conveying results than existing state-of-the-art methods.
摘要:自动决策支持系统的临床适用性取决于一个强大的,易于理解的分类解释。同时实现了同级领先的分数人工神经网络功亏一篑在这一方面。因此,许多方法已被提出,图像的显着区域映射到诊断分类。利用启发式方法,如模糊和噪音,他们往往会产生漫反射,有时是误导性的结果,阻碍了他们的普遍采用。在这项工作中,我们将介绍针对医疗成像模型无关的显着性绘图架构克服这些问题。我们更换一个强大的邻居启发式技术条件图像修补的方法,从而避免了解剖难以置信的文物。我们制定的显着归属的地图质量的优化任务,加强约束和专注归属。公共乳腺摄影数据实验表明定量和定性更精确的定位和更清晰的输送效果比现有状态的最先进的方法。
David Major, Dimitrios Lenis, Maria Wimmer, Gert Sluiter, Astrid Berg, Katja Bühler
Abstract: Clinical applicability of automated decision support systems depends on a robust, well-understood classification interpretation. Artificial neural networks while achieving class-leading scores fall short in this regard. Therefore, numerous approaches have been proposed that map a salient region of an image to a diagnostic classification. Utilizing heuristic methodology, like blurring and noise, they tend to produce diffuse, sometimes misleading results, hindering their general adoption. In this work we overcome these issues by presenting a model agnostic saliency mapping framework tailored to medical imaging. We replace heuristic techniques with a strong neighborhood conditioned inpainting approach, which avoids anatomically implausible artefacts. We formulate saliency attribution as a map-quality optimization task, enforcing constrained and focused attributions. Experiments on public mammography data show quantitatively and qualitatively more precise localization and clearer conveying results than existing state-of-the-art methods.
摘要:自动决策支持系统的临床适用性取决于一个强大的,易于理解的分类解释。同时实现了同级领先的分数人工神经网络功亏一篑在这一方面。因此,许多方法已被提出,图像的显着区域映射到诊断分类。利用启发式方法,如模糊和噪音,他们往往会产生漫反射,有时是误导性的结果,阻碍了他们的普遍采用。在这项工作中,我们将介绍针对医疗成像模型无关的显着性绘图架构克服这些问题。我们更换一个强大的邻居启发式技术条件图像修补的方法,从而避免了解剖难以置信的文物。我们制定的显着归属的地图质量的优化任务,加强约束和专注归属。公共乳腺摄影数据实验表明定量和定性更精确的定位和更清晰的输送效果比现有状态的最先进的方法。
9. HandVoxNet: Deep Voxel-Based Network for 3D Hand Shape and Pose Estimation from a Single Depth Map [PDF] 返回目录
Jameel Malik, Ibrahim Abdelaziz, Ahmed Elhayek, Soshi Shimada, Sk Aziz Ali, Vladislav Golyanik, Christian Theobalt, Didier Stricker
Abstract: 3D hand shape and pose estimation from a single depth map is a new and challenging computer vision problem with many applications. The state-of-the-art methods directly regress 3D hand meshes from 2D depth images via 2D convolutional neural networks, which leads to artefacts in the estimations due to perspective distortions in the images. In contrast, we propose a novel architecture with 3D convolutions trained in a weakly-supervised manner. The input to our method is a 3D voxelized depth map, and we rely on two hand shape representations. The first one is the 3D voxelized grid of the shape which is accurate but does not preserve the mesh topology and the number of mesh vertices. The second representation is the 3D hand surface which is less accurate but does not suffer from the limitations of the first representation. We combine the advantages of these two representations by registering the hand surface to the voxelized hand shape. In the extensive experiments, the proposed approach improves over the state of the art by 47.8% on the SynHand5M dataset. Moreover, our augmentation policy for voxelized depth maps further enhances the accuracy of 3D hand pose estimation on real data. Our method produces visually more reasonable and realistic hand shapes on NYU and BigHand2.2M datasets compared to the existing approaches.
摘要:3D手的形状和姿态估计从单一的深度图是一个新的挑战和计算机视觉的问题,许多应用程序。通过2D卷积神经网络,状态的最先进的方法直接从2D深度图像的倒退3D手网格这导致在估计伪影归因于图像透视畸变。相比之下,我们提出了在弱监督方式训练的3D卷积一个新颖的架构。输入到我们的方法是一种3D体元化深度图,我们靠两只手形表示。第一种是3D体素化这是准确的,但不保留网状拓扑和网格顶点的数量的形状的网格。第二表示是3D手表面,其不太精确但是不从第一表示的限制的影响。我们的手面注册到素化的手形结合这两种表示法的优点。在大量的实验,该方法通过对SynHand5M数据集的47.8%提高了对现有技术的状态。此外,我们对素化的深度扩增政策映射进一步提高3D手姿态估计对实际数据的准确性。我们的方法产生上NYU和BigHand2.2M数据集视觉上更合理的,现实的手形相比于现有的方法。
Jameel Malik, Ibrahim Abdelaziz, Ahmed Elhayek, Soshi Shimada, Sk Aziz Ali, Vladislav Golyanik, Christian Theobalt, Didier Stricker
Abstract: 3D hand shape and pose estimation from a single depth map is a new and challenging computer vision problem with many applications. The state-of-the-art methods directly regress 3D hand meshes from 2D depth images via 2D convolutional neural networks, which leads to artefacts in the estimations due to perspective distortions in the images. In contrast, we propose a novel architecture with 3D convolutions trained in a weakly-supervised manner. The input to our method is a 3D voxelized depth map, and we rely on two hand shape representations. The first one is the 3D voxelized grid of the shape which is accurate but does not preserve the mesh topology and the number of mesh vertices. The second representation is the 3D hand surface which is less accurate but does not suffer from the limitations of the first representation. We combine the advantages of these two representations by registering the hand surface to the voxelized hand shape. In the extensive experiments, the proposed approach improves over the state of the art by 47.8% on the SynHand5M dataset. Moreover, our augmentation policy for voxelized depth maps further enhances the accuracy of 3D hand pose estimation on real data. Our method produces visually more reasonable and realistic hand shapes on NYU and BigHand2.2M datasets compared to the existing approaches.
摘要:3D手的形状和姿态估计从单一的深度图是一个新的挑战和计算机视觉的问题,许多应用程序。通过2D卷积神经网络,状态的最先进的方法直接从2D深度图像的倒退3D手网格这导致在估计伪影归因于图像透视畸变。相比之下,我们提出了在弱监督方式训练的3D卷积一个新颖的架构。输入到我们的方法是一种3D体元化深度图,我们靠两只手形表示。第一种是3D体素化这是准确的,但不保留网状拓扑和网格顶点的数量的形状的网格。第二表示是3D手表面,其不太精确但是不从第一表示的限制的影响。我们的手面注册到素化的手形结合这两种表示法的优点。在大量的实验,该方法通过对SynHand5M数据集的47.8%提高了对现有技术的状态。此外,我们对素化的深度扩增政策映射进一步提高3D手姿态估计对实际数据的准确性。我们的方法产生上NYU和BigHand2.2M数据集视觉上更合理的,现实的手形相比于现有的方法。
10. DFNet: Discriminative feature extraction and integration network for salient object detection [PDF] 返回目录
Mehrdad Noori, Sina Mohammadi, Sina Ghofrani Majelan, Ali Bahri, Mohammad Havaei
Abstract: Despite the powerful feature extraction capability of Convolutional Neural Networks, there are still some challenges in saliency detection. In this paper, we focus on two aspects of challenges: i) Since salient objects appear in various sizes, using single-scale convolution would not capture the right size. Moreover, using multi-scale convolutions without considering their importance may confuse the model. ii) Employing multi-level features helps the model use both local and global context. However, treating all features equally results in information redundancy. Therefore, there needs to be a mechanism to intelligently select which features in different levels are useful. To address the first challenge, we propose a Multi-scale Attention Guided Module. This module not only extracts multi-scale features effectively but also gives more attention to more discriminative feature maps corresponding to the scale of the salient object. To address the second challenge, we propose an Attention-based Multi-level Integrator Module to give the model the ability to assign different weights to multi-level feature maps. Furthermore, our Sharpening Loss function guides our network to output saliency maps with higher certainty and less blurry salient objects, and it has far better performance than the Cross-entropy loss. For the first time, we adopt four different backbones to show the generalization of our method. Experiments on five challenging datasets prove that our method achieves the state-of-the-art performance. Our approach is fast as well and can run at a real-time speed.
摘要:尽管卷积神经网络的强大的特征提取能力,仍存在显着性检测一些挑战。在本文中,我们重点关注的挑战两个方面:i)由于显着的物体出现在各种尺寸,采用单一尺度卷积不会捕捉到合适的尺寸。此外,使用多尺度卷积而不考虑其重要性可能会混淆模型。 II)采用多级功能有助于该模型利用本地和全球范围内的。然而,对待所有功能同样会导致信息冗余。因此,需要有一种机制来智能地选择它的特点是不同的水平是有用的。为了解决第一个挑战,我们提出了指导模块多尺度关注。该模块不仅提取的多尺度有效的功能,但也给出了相应的映射到显着对象的规模更注重更有辨别力的特征。为了解决第二个挑战,我们提出了一个基于注意力多级集成模块给模型不同的权重分配到多层次的特征图的能力。此外,我们的锐化功能丧失我们的网络指南,以更高的可靠性和更少的模糊的显着对象输出显着图,它具有比交叉熵损失远远更好的性能。这是第一次,我们采用四种不同的骨干网,以显示我们的方法的概括。五点有挑战性的数据集实验证明我们的方法实现国家的最先进的性能。我们的做法是快,好,可以在实时的速度运行。
Mehrdad Noori, Sina Mohammadi, Sina Ghofrani Majelan, Ali Bahri, Mohammad Havaei
Abstract: Despite the powerful feature extraction capability of Convolutional Neural Networks, there are still some challenges in saliency detection. In this paper, we focus on two aspects of challenges: i) Since salient objects appear in various sizes, using single-scale convolution would not capture the right size. Moreover, using multi-scale convolutions without considering their importance may confuse the model. ii) Employing multi-level features helps the model use both local and global context. However, treating all features equally results in information redundancy. Therefore, there needs to be a mechanism to intelligently select which features in different levels are useful. To address the first challenge, we propose a Multi-scale Attention Guided Module. This module not only extracts multi-scale features effectively but also gives more attention to more discriminative feature maps corresponding to the scale of the salient object. To address the second challenge, we propose an Attention-based Multi-level Integrator Module to give the model the ability to assign different weights to multi-level feature maps. Furthermore, our Sharpening Loss function guides our network to output saliency maps with higher certainty and less blurry salient objects, and it has far better performance than the Cross-entropy loss. For the first time, we adopt four different backbones to show the generalization of our method. Experiments on five challenging datasets prove that our method achieves the state-of-the-art performance. Our approach is fast as well and can run at a real-time speed.
摘要:尽管卷积神经网络的强大的特征提取能力,仍存在显着性检测一些挑战。在本文中,我们重点关注的挑战两个方面:i)由于显着的物体出现在各种尺寸,采用单一尺度卷积不会捕捉到合适的尺寸。此外,使用多尺度卷积而不考虑其重要性可能会混淆模型。 II)采用多级功能有助于该模型利用本地和全球范围内的。然而,对待所有功能同样会导致信息冗余。因此,需要有一种机制来智能地选择它的特点是不同的水平是有用的。为了解决第一个挑战,我们提出了指导模块多尺度关注。该模块不仅提取的多尺度有效的功能,但也给出了相应的映射到显着对象的规模更注重更有辨别力的特征。为了解决第二个挑战,我们提出了一个基于注意力多级集成模块给模型不同的权重分配到多层次的特征图的能力。此外,我们的锐化功能丧失我们的网络指南,以更高的可靠性和更少的模糊的显着对象输出显着图,它具有比交叉熵损失远远更好的性能。这是第一次,我们采用四种不同的骨干网,以显示我们的方法的概括。五点有挑战性的数据集实验证明我们的方法实现国家的最先进的性能。我们的做法是快,好,可以在实时的速度运行。
11. Sparse Concept Coded Tetrolet Transform for Unconstrained Odia Character Recognition [PDF] 返回目录
Kalyan S Dash, N B Puhan, G Panda
Abstract: Feature representation in the form of spatio-spectral decomposition is one of the robust techniques adopted in automatic handwritten character recognition systems. In this regard, we propose a new image representation approach for unconstrained handwritten alphanumeric characters using sparse concept coded Tetrolets. Tetrolets, which does not use fixed dyadic square blocks for spectral decomposition like conventional wavelets, preserve the localized variations in handwritings by adopting tetrominoes those capture the shape geometry. The sparse concept coding of low entropy Tetrolet representation is found to extract the important hidden information (concept) for superior pattern discrimination. Large scale experimentation using ten databases in six different scripts (Bangla, Devanagari, Odia, English, Arabic and Telugu) has been performed. The proposed feature representation along with standard classifiers such as random forest, support vector machine (SVM), nearest neighbor and modified quadratic discriminant function (MQDF) is found to achieve state-of-the-art recognition performance in all the databases, viz. 99.40% (MNIST); 98.72% and 93.24% (IITBBS); 99.38% and 99.22% (ISI Kolkata). The proposed OCR system is shown to perform better than other sparse based techniques such as PCA, SparsePCA and SparseLDA, as well as better than existing transforms (Wavelet, Slantlet and Stockwell).
摘要:在空间 - 频谱分解的形式特征表示的是在自动的手写字符识别系统采用的健壮的技术之一。在这方面,我们提出了使用稀疏概念编码Tetrolets自由手写字母数字字符的新的图像表示方法。 Tetrolets,其不采用tetrominoes那些捕获的形状几何使用固定二进正方形块为象常规小波频谱分解,保存在笔迹的局部变化。稀疏概念编码低熵Tetrolet表示的是发现以提取优良图案歧视重要隐藏信息(概念)。已执行使用10个数据库,在六个不同的脚本(孟加拉语,梵文,奥里亚语,英语,阿拉伯语和泰卢固语)大规模试验。与标准分类,例如随机森林,支持向量机(SVM),最近邻和改性的二次判别函数(MQDF)沿建议特征表示被发现实现中的所有数据库,即国家的最先进的识别性能。 99.40%(MNIST); 98.72%和93.24%(IITBBS); 99.38%和99.22%(ISI加尔各答)。所提出的OCR系统被示出为比其它基于稀疏的技术,如PCA,SparsePCA和SparseLDA更好执行,以及优于现有变换(小波,Slantlet和斯托克韦尔)。
Kalyan S Dash, N B Puhan, G Panda
Abstract: Feature representation in the form of spatio-spectral decomposition is one of the robust techniques adopted in automatic handwritten character recognition systems. In this regard, we propose a new image representation approach for unconstrained handwritten alphanumeric characters using sparse concept coded Tetrolets. Tetrolets, which does not use fixed dyadic square blocks for spectral decomposition like conventional wavelets, preserve the localized variations in handwritings by adopting tetrominoes those capture the shape geometry. The sparse concept coding of low entropy Tetrolet representation is found to extract the important hidden information (concept) for superior pattern discrimination. Large scale experimentation using ten databases in six different scripts (Bangla, Devanagari, Odia, English, Arabic and Telugu) has been performed. The proposed feature representation along with standard classifiers such as random forest, support vector machine (SVM), nearest neighbor and modified quadratic discriminant function (MQDF) is found to achieve state-of-the-art recognition performance in all the databases, viz. 99.40% (MNIST); 98.72% and 93.24% (IITBBS); 99.38% and 99.22% (ISI Kolkata). The proposed OCR system is shown to perform better than other sparse based techniques such as PCA, SparsePCA and SparseLDA, as well as better than existing transforms (Wavelet, Slantlet and Stockwell).
摘要:在空间 - 频谱分解的形式特征表示的是在自动的手写字符识别系统采用的健壮的技术之一。在这方面,我们提出了使用稀疏概念编码Tetrolets自由手写字母数字字符的新的图像表示方法。 Tetrolets,其不采用tetrominoes那些捕获的形状几何使用固定二进正方形块为象常规小波频谱分解,保存在笔迹的局部变化。稀疏概念编码低熵Tetrolet表示的是发现以提取优良图案歧视重要隐藏信息(概念)。已执行使用10个数据库,在六个不同的脚本(孟加拉语,梵文,奥里亚语,英语,阿拉伯语和泰卢固语)大规模试验。与标准分类,例如随机森林,支持向量机(SVM),最近邻和改性的二次判别函数(MQDF)沿建议特征表示被发现实现中的所有数据库,即国家的最先进的识别性能。 99.40%(MNIST); 98.72%和93.24%(IITBBS); 99.38%和99.22%(ISI加尔各答)。所提出的OCR系统被示出为比其它基于稀疏的技术,如PCA,SparsePCA和SparseLDA更好执行,以及优于现有变换(小波,Slantlet和斯托克韦尔)。
12. Context Prior for Scene Segmentation [PDF] 返回目录
Changqian Yu, Jingbo Wang, Changxin Gao, Gang Yu, Chunhua Shen, Nong Sang
Abstract: Recent works have widely explored the contextual dependencies to achieve more accurate segmentation results. However, most approaches rarely distinguish different types of contextual dependencies, which may pollute the scene understanding. In this work, we directly supervise the feature aggregation to distinguish the intra-class and inter-class context clearly. Specifically, we develop a Context Prior with the supervision of the Affinity Loss. Given an input image and corresponding ground truth, Affinity Loss constructs an ideal affinity map to supervise the learning of Context Prior. The learned Context Prior extracts the pixels belonging to the same category, while the reversed prior focuses on the pixels of different classes. Embedded into a conventional deep CNN, the proposed Context Prior Layer can selectively capture the intra-class and inter-class contextual dependencies, leading to robust feature representation. To validate the effectiveness, we design an effective Context Prior Network (CPNet). Extensive quantitative and qualitative evaluations demonstrate that the proposed model performs favorably against state-of-the-art semantic segmentation approaches. More specifically, our algorithm achieves 46.3% mIoU on ADE20K, 53.9% mIoU on PASCAL-Context, and 81.3% mIoU on Cityscapes. Code is available at this https URL.
摘要:最近的作品广泛地探讨了语境的依赖,实现更精确的分割结果。然而,大多数方法很少区分不同类型的上下文相关性,这可能污染现场了解。在这项工作中,我们直接监督功能聚集到内部类与类之间的上下文明确区分。具体来说,我们发展与亲和损失的监督在此之前的背景。通过在输入图像和相应的地面实况,亲和损失构建了一个理想的亲和力地图事先监督语境下的学习。博学的语境优先提取像素属于同一类别,而在不同类别的像素反转之前重点。嵌入到常规的深CNN,所提出的上下文之前层能够选择性地捕获帧内类和类间上下文相关性,从而导致鲁棒特征表示。为了验证有效性,我们设计一个有效的语境优先网络(CPNet)。广泛的定量和定性的评价证明,有利地对国家的最先进的语义分割方法所提出的模型进行。更具体地讲,我们的算法实现对ADE20K 53.9%米欧46.3%米欧,帕斯卡 - 语境,以及风情81.3%米欧。代码可在此HTTPS URL。
Changqian Yu, Jingbo Wang, Changxin Gao, Gang Yu, Chunhua Shen, Nong Sang
Abstract: Recent works have widely explored the contextual dependencies to achieve more accurate segmentation results. However, most approaches rarely distinguish different types of contextual dependencies, which may pollute the scene understanding. In this work, we directly supervise the feature aggregation to distinguish the intra-class and inter-class context clearly. Specifically, we develop a Context Prior with the supervision of the Affinity Loss. Given an input image and corresponding ground truth, Affinity Loss constructs an ideal affinity map to supervise the learning of Context Prior. The learned Context Prior extracts the pixels belonging to the same category, while the reversed prior focuses on the pixels of different classes. Embedded into a conventional deep CNN, the proposed Context Prior Layer can selectively capture the intra-class and inter-class contextual dependencies, leading to robust feature representation. To validate the effectiveness, we design an effective Context Prior Network (CPNet). Extensive quantitative and qualitative evaluations demonstrate that the proposed model performs favorably against state-of-the-art semantic segmentation approaches. More specifically, our algorithm achieves 46.3% mIoU on ADE20K, 53.9% mIoU on PASCAL-Context, and 81.3% mIoU on Cityscapes. Code is available at this https URL.
摘要:最近的作品广泛地探讨了语境的依赖,实现更精确的分割结果。然而,大多数方法很少区分不同类型的上下文相关性,这可能污染现场了解。在这项工作中,我们直接监督功能聚集到内部类与类之间的上下文明确区分。具体来说,我们发展与亲和损失的监督在此之前的背景。通过在输入图像和相应的地面实况,亲和损失构建了一个理想的亲和力地图事先监督语境下的学习。博学的语境优先提取像素属于同一类别,而在不同类别的像素反转之前重点。嵌入到常规的深CNN,所提出的上下文之前层能够选择性地捕获帧内类和类间上下文相关性,从而导致鲁棒特征表示。为了验证有效性,我们设计一个有效的语境优先网络(CPNet)。广泛的定量和定性的评价证明,有利地对国家的最先进的语义分割方法所提出的模型进行。更具体地讲,我们的算法实现对ADE20K 53.9%米欧46.3%米欧,帕斯卡 - 语境,以及风情81.3%米欧。代码可在此HTTPS URL。
13. Exploring the ability of CNNs to generalise to previously unseen scales over wide scale ranges [PDF] 返回目录
Ylva Jansson, Tony Lindeberg
Abstract: The ability to handle large scale variations is crucial for many real world visual tasks. A straightforward approach for handling scale in a deep network is to process an image at several scales simultaneously in a set of scale channels. Scale invariance can then, in principle, be achieved by using weight sharing between the scale channels together with max or average pooling over the outputs from the scale channels. The ability of such scale channel networks to generalise to scales not present in the training set over significant scale ranges has, however, not previously been explored. We, therefore, present a theoretical analysis of invariance and covariance properties of scale channel networks and perform an experimental evaluation of the ability of different types of scale channel networks to generalise to previously unseen scales. We identify limitations of previous approaches and propose a new type of foveated scale channel architecture, where the scale channels process increasingly larger parts of the image with decreasing resolution. Our proposed FovMax and FovAvg networks perform almost identically over a scale range of 8 also when training on single scale training data and give improvements in the small sample regime.
摘要:处理大规模变化的能力是许多真实世界的视觉任务至关重要。一种用于在深网络中处理规模的直接方法是处理在几个尺度的图像中的一组标度信道的同时进行。尺度不变性然后,原则上,可以通过使用共享重量秤通道之间与在从级通道的输出最大或平均池一起实现。这样规模信道网络的推广到在训练组上显著比例范围不存在秤的能力已经然而,以前没有被探讨。因此,我们提出的规模信道网络的不变性和协方差特性的理论分析和执行的不同类型的信道规模网络的推广到前所未见的秤的能力的实验评价。我们找出以前的方法的局限性,提出了一种新类型的视网膜中央凹规模通道架构,其中规模渠道处理越来越大的部分图像随分辨率。我们提出的FovMax和FovAvg网络运行几乎一样在8也是一个刻度范围时,在单一尺度的训练数据训练,并给在小样本制度的改进。
Ylva Jansson, Tony Lindeberg
Abstract: The ability to handle large scale variations is crucial for many real world visual tasks. A straightforward approach for handling scale in a deep network is to process an image at several scales simultaneously in a set of scale channels. Scale invariance can then, in principle, be achieved by using weight sharing between the scale channels together with max or average pooling over the outputs from the scale channels. The ability of such scale channel networks to generalise to scales not present in the training set over significant scale ranges has, however, not previously been explored. We, therefore, present a theoretical analysis of invariance and covariance properties of scale channel networks and perform an experimental evaluation of the ability of different types of scale channel networks to generalise to previously unseen scales. We identify limitations of previous approaches and propose a new type of foveated scale channel architecture, where the scale channels process increasingly larger parts of the image with decreasing resolution. Our proposed FovMax and FovAvg networks perform almost identically over a scale range of 8 also when training on single scale training data and give improvements in the small sample regime.
摘要:处理大规模变化的能力是许多真实世界的视觉任务至关重要。一种用于在深网络中处理规模的直接方法是处理在几个尺度的图像中的一组标度信道的同时进行。尺度不变性然后,原则上,可以通过使用共享重量秤通道之间与在从级通道的输出最大或平均池一起实现。这样规模信道网络的推广到在训练组上显著比例范围不存在秤的能力已经然而,以前没有被探讨。因此,我们提出的规模信道网络的不变性和协方差特性的理论分析和执行的不同类型的信道规模网络的推广到前所未见的秤的能力的实验评价。我们找出以前的方法的局限性,提出了一种新类型的视网膜中央凹规模通道架构,其中规模渠道处理越来越大的部分图像随分辨率。我们提出的FovMax和FovAvg网络运行几乎一样在8也是一个刻度范围时,在单一尺度的训练数据训练,并给在小样本制度的改进。
14. RANSAC-Flow: generic two-stage image alignment [PDF] 返回目录
Xi Shen, François Darmon, Alexei A. Efros, Mathieu Aubry
Abstract: This paper considers the generic problem of dense alignment between two images, whether they be two frames of a video, two widely different views of a scene, two paintings depicting similar content, etc. Whereas each such task is typically addressed with a domain-specific solution, we show that a simple unsupervised approach performs surprisingly well across a range of tasks. Our main insight is that parametric and non-parametric alignment methods have complementary strengths. We propose a two-stage process: first, a feature-based parametric coarse alignment using one or more homographies, followed by non-parametric fine pixel-wise alignment. Coarse alignment is performed using RANSAC on off-the-shelf deep features. Fine alignment is learned in an unsupervised way by a deep network which optimizes a standard structural similarity metric (SSIM) between the two images, plus cycle-consistency. Despite its simplicity, our method shows competitive results on a range of tasks and datasets, including unsupervised optical flow on KITTI, dense correspondences on Hpatches, two-view geometry estimation on YFCC100M, localization on Aachen Day-Night, and, for the first time, fine alignment of artworks on the Brughel dataset. Our code and data are available at this http URL
摘要:本文认为两幅图像之间的密集排列的一般性问题,无论是视频的两帧,两种广泛不同场景的意见,两幅画描绘了类似的内容等。而每个这样的任务通常与域解决特异性的解决方案,我们证明了一个简单的无监督的办法进行跨出奇地好一系列任务。我们的主要观点是,参数和非参数化对准方法有互补优势。我们提出了一个两阶段过程:首先,使用一个或多个单应性一基于特征的参数粗略对准,随后非参数化精细逐像素对准。粗对准被关上的,现成的深功能使用RANSAC进行。细对准以无监督的方式由这优化了两个图像之间的标准结构类似性度量(SSIM),加上周期一致性深网络获知。尽管它的简单性,我们的方法示出了一系列任务和数据集,包括KITTI无监督光流,对Hpatches密对应,上YFCC100M两视图几何形状的估计,定位上亚琛日夜的有竞争力的结果,并且,在第一次,在Brughel数据集作品的精细对准。我们的代码和数据都可以在这个HTTP URL
Xi Shen, François Darmon, Alexei A. Efros, Mathieu Aubry
Abstract: This paper considers the generic problem of dense alignment between two images, whether they be two frames of a video, two widely different views of a scene, two paintings depicting similar content, etc. Whereas each such task is typically addressed with a domain-specific solution, we show that a simple unsupervised approach performs surprisingly well across a range of tasks. Our main insight is that parametric and non-parametric alignment methods have complementary strengths. We propose a two-stage process: first, a feature-based parametric coarse alignment using one or more homographies, followed by non-parametric fine pixel-wise alignment. Coarse alignment is performed using RANSAC on off-the-shelf deep features. Fine alignment is learned in an unsupervised way by a deep network which optimizes a standard structural similarity metric (SSIM) between the two images, plus cycle-consistency. Despite its simplicity, our method shows competitive results on a range of tasks and datasets, including unsupervised optical flow on KITTI, dense correspondences on Hpatches, two-view geometry estimation on YFCC100M, localization on Aachen Day-Night, and, for the first time, fine alignment of artworks on the Brughel dataset. Our code and data are available at this http URL
摘要:本文认为两幅图像之间的密集排列的一般性问题,无论是视频的两帧,两种广泛不同场景的意见,两幅画描绘了类似的内容等。而每个这样的任务通常与域解决特异性的解决方案,我们证明了一个简单的无监督的办法进行跨出奇地好一系列任务。我们的主要观点是,参数和非参数化对准方法有互补优势。我们提出了一个两阶段过程:首先,使用一个或多个单应性一基于特征的参数粗略对准,随后非参数化精细逐像素对准。粗对准被关上的,现成的深功能使用RANSAC进行。细对准以无监督的方式由这优化了两个图像之间的标准结构类似性度量(SSIM),加上周期一致性深网络获知。尽管它的简单性,我们的方法示出了一系列任务和数据集,包括KITTI无监督光流,对Hpatches密对应,上YFCC100M两视图几何形状的估计,定位上亚琛日夜的有竞争力的结果,并且,在第一次,在Brughel数据集作品的精细对准。我们的代码和数据都可以在这个HTTP URL
15. Two-Stream AMTnet for Action Detection [PDF] 返回目录
Suman Saha, Gurkirt Singh, Fabio Cuzzolin
Abstract: In this paper, we propose Two-Stream AMTnet, which leverages recent advances in video-based action representation[1] and incremental action tube generation[2]. Majority of the present action detectors follow a frame-based representation, a late-fusion followed by an offline action tube building steps. These are sub-optimal as: frame-based features barely encode the temporal relations; late-fusion restricts the network to learn robust spatiotemporal features; and finally, an offline action tube generation is not suitable for many real-world problems such as autonomous driving, human-robot interaction to name a few. The key contributions of this work are: (1) combining AMTnet's 3D proposal architecture with an online action tube generation technique which allows the model to learn stronger temporal features needed for accurate action detection and facilitates running inference online; (2) an efficient fusion technique allowing the deep network to learn strong spatiotemporal action representations. This is achieved by augmenting the previous Action Micro-Tube (AMTnet) action detection framework in three distinct ways: by adding a parallel motion stIn this paper, we propose a new deep neural network architecture for online action detection, termed ream to the original appearance one in AMTnet; (2) in opposition to state-of-the-art action detectors which train appearance and motion streams separately, and use a test time late fusion scheme to fuse RGB and flow cues, by jointly training both streams in an end-to-end fashion and merging RGB and optical flow features at training time; (3) by introducing an online action tube generation algorithm which works at video-level, and in real-time (when exploiting only appearance features). Two-Stream AMTnet exhibits superior action detection performance over state-of-the-art approaches on the standard action detection benchmarks.
摘要:在本文中,我们提出了两种插播AMTnet,它利用在基于视频的动作表示[1]和增量动作管代[2]的最新进展。本行动探测器多数遵循基于帧的表示中,后融合,随后的离线动作管构建步骤。这些是次优为:基于帧的特征勉强编码时间关系;后融合限制了网络学习强大的时空特征;最后,离线操作管代并不适合许多现实问题,如自动驾驶,人机交互,仅举几例。这项工作的主要贡献是:(1)AMTnet的3D方案架构,具有在线动作管代技术,它允许模型学习需要准确的动作检测更强时间特征,并利于在线运行推理相结合; (2)一种有效的融合技术使深网络学习强时空动作表示。这是通过扩大以往的动作微管(AMTnet)动作检测框架,以三种不同的方式来实现:通过增加一个平行运动STIN本文中,我们提出了在线动作检测新的深层神经网络结构,称为扩到原来的样子一个在AMTnet; (2)在反对状态的最先进的行动探测器,列车的外观和运动分别流,并且使用测试时间晚融合方案保险丝RGB和流动线索,通过在端至端共同培养两个流时尚和合并RGB和光流提供在训练时间; (3)通过引入在线动作管生成算法,其在视频电平(利用仅外观特征时)的工作原理,以及在实时的。两流AMTnet呈现过度状态的最先进的标准动作检测基准接近优越动作检测性能。
Suman Saha, Gurkirt Singh, Fabio Cuzzolin
Abstract: In this paper, we propose Two-Stream AMTnet, which leverages recent advances in video-based action representation[1] and incremental action tube generation[2]. Majority of the present action detectors follow a frame-based representation, a late-fusion followed by an offline action tube building steps. These are sub-optimal as: frame-based features barely encode the temporal relations; late-fusion restricts the network to learn robust spatiotemporal features; and finally, an offline action tube generation is not suitable for many real-world problems such as autonomous driving, human-robot interaction to name a few. The key contributions of this work are: (1) combining AMTnet's 3D proposal architecture with an online action tube generation technique which allows the model to learn stronger temporal features needed for accurate action detection and facilitates running inference online; (2) an efficient fusion technique allowing the deep network to learn strong spatiotemporal action representations. This is achieved by augmenting the previous Action Micro-Tube (AMTnet) action detection framework in three distinct ways: by adding a parallel motion stIn this paper, we propose a new deep neural network architecture for online action detection, termed ream to the original appearance one in AMTnet; (2) in opposition to state-of-the-art action detectors which train appearance and motion streams separately, and use a test time late fusion scheme to fuse RGB and flow cues, by jointly training both streams in an end-to-end fashion and merging RGB and optical flow features at training time; (3) by introducing an online action tube generation algorithm which works at video-level, and in real-time (when exploiting only appearance features). Two-Stream AMTnet exhibits superior action detection performance over state-of-the-art approaches on the standard action detection benchmarks.
摘要:在本文中,我们提出了两种插播AMTnet,它利用在基于视频的动作表示[1]和增量动作管代[2]的最新进展。本行动探测器多数遵循基于帧的表示中,后融合,随后的离线动作管构建步骤。这些是次优为:基于帧的特征勉强编码时间关系;后融合限制了网络学习强大的时空特征;最后,离线操作管代并不适合许多现实问题,如自动驾驶,人机交互,仅举几例。这项工作的主要贡献是:(1)AMTnet的3D方案架构,具有在线动作管代技术,它允许模型学习需要准确的动作检测更强时间特征,并利于在线运行推理相结合; (2)一种有效的融合技术使深网络学习强时空动作表示。这是通过扩大以往的动作微管(AMTnet)动作检测框架,以三种不同的方式来实现:通过增加一个平行运动STIN本文中,我们提出了在线动作检测新的深层神经网络结构,称为扩到原来的样子一个在AMTnet; (2)在反对状态的最先进的行动探测器,列车的外观和运动分别流,并且使用测试时间晚融合方案保险丝RGB和流动线索,通过在端至端共同培养两个流时尚和合并RGB和光流提供在训练时间; (3)通过引入在线动作管生成算法,其在视频电平(利用仅外观特征时)的工作原理,以及在实时的。两流AMTnet呈现过度状态的最先进的标准动作检测基准接近优越动作检测性能。
16. Cell Segmentation and Tracking using Distance Transform Predictions and Movement Estimation with Graph-Based Matching [PDF] 返回目录
Tim Scherr, Katharina Löffler, Moritz Böhland, Ralf Mikut
Abstract: In this paper, we present the approach used for our IEEE ISBI 2020 Cell Tracking Challenge contribution (team KIT-Sch-GE). Our method consists of a segmentation and a tracking step that includes the correction of segmentation errors (tracking by detection method). For the segmentation, deep learning-based predictions of cell distance maps and novel neighbor distance maps are used as input for a watershed post-processing. Since most of the provided Cell Tracking Challenge ground truth data are 2D, a 2D convolutional neural network is trained to predict the distance maps. The tracking is based on a movement estimation in combination with a matching formulated as a maximum flow minimum cost problem.
摘要:在本文中,我们提出用我们的IEEE ISBI 2020细胞跟踪挑战的贡献(团队KIT-SCH-GE)的方法。我们的方法包括分割和包括分割错误(由检测方法进行跟踪)的校正的跟踪步骤的。对于分割,细胞距离图和新颖邻居距离图的深基于学习的预测被用作分水岭后处理输入。由于大多数提供的小区追踪挑战地面实况数据的是2D,二维卷积神经网络进行训练,以预测距离图。跟踪是基于组合的运动估计以配制成最大流量最小代价问题的匹配。
Tim Scherr, Katharina Löffler, Moritz Böhland, Ralf Mikut
Abstract: In this paper, we present the approach used for our IEEE ISBI 2020 Cell Tracking Challenge contribution (team KIT-Sch-GE). Our method consists of a segmentation and a tracking step that includes the correction of segmentation errors (tracking by detection method). For the segmentation, deep learning-based predictions of cell distance maps and novel neighbor distance maps are used as input for a watershed post-processing. Since most of the provided Cell Tracking Challenge ground truth data are 2D, a 2D convolutional neural network is trained to predict the distance maps. The tracking is based on a movement estimation in combination with a matching formulated as a maximum flow minimum cost problem.
摘要:在本文中,我们提出用我们的IEEE ISBI 2020细胞跟踪挑战的贡献(团队KIT-SCH-GE)的方法。我们的方法包括分割和包括分割错误(由检测方法进行跟踪)的校正的跟踪步骤的。对于分割,细胞距离图和新颖邻居距离图的深基于学习的预测被用作分水岭后处理输入。由于大多数提供的小区追踪挑战地面实况数据的是2D,二维卷积神经网络进行训练,以预测距离图。跟踪是基于组合的运动估计以配制成最大流量最小代价问题的匹配。
17. Gradient Centralization: A New Optimization Technique for Deep Neural Networks [PDF] 返回目录
Hongwei Yong, Jianqiang Huang, Xiansheng Hua, Lei Zhang
Abstract: Optimization techniques are of great importance to effectively and efficiently train a deep neural network (DNN). It has been shown that using the first and second order statistics (e.g., mean and variance) to perform Z-score standardization on network activations or weight vectors, such as batch normalization (BN) and weight standardization (WS), can improve the training performance. Different from these existing methods that mostly operate on activations or weights, we present a new optimization technique, namely gradient centralization (GC), which operates directly on gradients by centralizing the gradient vectors to have zero mean. GC can be viewed as a projected gradient descent method with a constrained loss function. We show that GC can regularize both the weight space and output feature space so that it can boost the generalization performance of DNNs. Moreover, GC improves the Lipschitzness of the loss function and its gradient so that the training process becomes more efficient and stable. GC is very simple to implement and can be easily embedded into existing gradient based DNN optimizers with only one line of code. It can also be directly used to fine-tune the pre-trained DNNs. Our experiments on various applications, including general image classification, fine-grained image classification, detection and segmentation, demonstrate that GC can consistently improve the performance of DNN learning. The code of GC can be found at this https URL.
摘要:优化技术是非常重要的有效和高效地训练深层神经网络(DNN)。已经显示,使用所述第一和第二顺序统计(例如,平均值和方差)来在网络激活或权重向量,如批标准化(BN)和重量标准化(WS)执行Z分数标准化,可以提高训练性能。从主要在激活或权重操作这些现有的方法不同,我们提出了一个新的最优化技术,即梯度集中(GC),它直接通过集中梯度向量进行操作上的梯度为具有零均值。 GC可以被看作是具有约束损耗功能的投影梯度下降法。我们表明,GC可以在正规化权空间和输出功能空间都使得它可以提高DNNs的泛化性能。此外,GC提高了损失函数的Lipschitzness及其梯度,使训练过程变得更加高效和稳定。 GC是非常容易实现,并且可以很容易地嵌入到现有的基于梯度的优化DNN只有一行的代码。它也可以直接用于微调预训练DNNs。我们在各种应用,包括一般的图像分类,细粒度图像分类,检测与分割实验,证明了GC能够不断改善DNN学习的性能。 GC的代码可以在此HTTPS URL中找到。
Hongwei Yong, Jianqiang Huang, Xiansheng Hua, Lei Zhang
Abstract: Optimization techniques are of great importance to effectively and efficiently train a deep neural network (DNN). It has been shown that using the first and second order statistics (e.g., mean and variance) to perform Z-score standardization on network activations or weight vectors, such as batch normalization (BN) and weight standardization (WS), can improve the training performance. Different from these existing methods that mostly operate on activations or weights, we present a new optimization technique, namely gradient centralization (GC), which operates directly on gradients by centralizing the gradient vectors to have zero mean. GC can be viewed as a projected gradient descent method with a constrained loss function. We show that GC can regularize both the weight space and output feature space so that it can boost the generalization performance of DNNs. Moreover, GC improves the Lipschitzness of the loss function and its gradient so that the training process becomes more efficient and stable. GC is very simple to implement and can be easily embedded into existing gradient based DNN optimizers with only one line of code. It can also be directly used to fine-tune the pre-trained DNNs. Our experiments on various applications, including general image classification, fine-grained image classification, detection and segmentation, demonstrate that GC can consistently improve the performance of DNN learning. The code of GC can be found at this https URL.
摘要:优化技术是非常重要的有效和高效地训练深层神经网络(DNN)。已经显示,使用所述第一和第二顺序统计(例如,平均值和方差)来在网络激活或权重向量,如批标准化(BN)和重量标准化(WS)执行Z分数标准化,可以提高训练性能。从主要在激活或权重操作这些现有的方法不同,我们提出了一个新的最优化技术,即梯度集中(GC),它直接通过集中梯度向量进行操作上的梯度为具有零均值。 GC可以被看作是具有约束损耗功能的投影梯度下降法。我们表明,GC可以在正规化权空间和输出功能空间都使得它可以提高DNNs的泛化性能。此外,GC提高了损失函数的Lipschitzness及其梯度,使训练过程变得更加高效和稳定。 GC是非常容易实现,并且可以很容易地嵌入到现有的基于梯度的优化DNN只有一行的代码。它也可以直接用于微调预训练DNNs。我们在各种应用,包括一般的图像分类,细粒度图像分类,检测与分割实验,证明了GC能够不断改善DNN学习的性能。 GC的代码可以在此HTTPS URL中找到。
18. Self-Paced Deep Regression Forests with Consideration on Underrepresented Samples [PDF] 返回目录
Lili Pan, Shijie Ai, Yazhou Ren, Zenglin Xu
Abstract: Deep discriminative models (e.g. deep regression forests, deep Gaussian process) have been extensively studied recently to solve problems such as facial age estimation and head pose estimation. Most existing methods pursue to achieve robust and unbiased solutions through either learning more discriminative features, or weighting samples. We argue what is more desirable is to gradually learn to discriminate like our human being, and hence we resort to self-paced learning (SPL). Then, a natural question arises: can self-paced regime guide deep discriminative models to obtain more robust and less unbiased solutions? To this end, this paper proposes a new deep discriminative model--self-paced deep regression forests considering sample uncertainty (SPUDRFs). It builds up a new self-paced learning paradigm: easy and underrepresented samples first. This paradigm could be extended to combine with a variety of deep discriminative models. Extensive experiments on two computer vision tasks, i.e., facial age estimation and head pose estimation, demonstrate the efficacy of SPUDRFs, where state-of-the-art performances are achieved.
摘要:深判别模型(如深回归森林,深高斯过程)已被广泛最近研究解决的问题,如人脸年龄估计和头部姿势估计。大多数现有的方法追求过任何学习更有辨别力的特征,或加权的样本,以实现强大的和公正的解决方案。我们认为更重要的是希望是慢慢学会像我们人类歧视,因此我们采取自学(SPL)。那么,一个自然的问题出现了:可以自定进度的制度引导深判别模型,以获得更强大和更小的解决方案偏见?为此,本文提出了一种新的深判别模型 - 自定进度的深林的回归考虑样品的不确定性(SPUDRFs)。它建立了一个新的自我学习模式:容易和缺额样品第一。这种模式可以扩展到与各种深判别模型的结合。两个计算机视觉任务,广泛的实验,即人脸年龄估计和头部姿态估计,证明SPUDRFs,在国家的最先进的性能得以实现的功效。
Lili Pan, Shijie Ai, Yazhou Ren, Zenglin Xu
Abstract: Deep discriminative models (e.g. deep regression forests, deep Gaussian process) have been extensively studied recently to solve problems such as facial age estimation and head pose estimation. Most existing methods pursue to achieve robust and unbiased solutions through either learning more discriminative features, or weighting samples. We argue what is more desirable is to gradually learn to discriminate like our human being, and hence we resort to self-paced learning (SPL). Then, a natural question arises: can self-paced regime guide deep discriminative models to obtain more robust and less unbiased solutions? To this end, this paper proposes a new deep discriminative model--self-paced deep regression forests considering sample uncertainty (SPUDRFs). It builds up a new self-paced learning paradigm: easy and underrepresented samples first. This paradigm could be extended to combine with a variety of deep discriminative models. Extensive experiments on two computer vision tasks, i.e., facial age estimation and head pose estimation, demonstrate the efficacy of SPUDRFs, where state-of-the-art performances are achieved.
摘要:深判别模型(如深回归森林,深高斯过程)已被广泛最近研究解决的问题,如人脸年龄估计和头部姿势估计。大多数现有的方法追求过任何学习更有辨别力的特征,或加权的样本,以实现强大的和公正的解决方案。我们认为更重要的是希望是慢慢学会像我们人类歧视,因此我们采取自学(SPL)。那么,一个自然的问题出现了:可以自定进度的制度引导深判别模型,以获得更强大和更小的解决方案偏见?为此,本文提出了一种新的深判别模型 - 自定进度的深林的回归考虑样品的不确定性(SPUDRFs)。它建立了一个新的自我学习模式:容易和缺额样品第一。这种模式可以扩展到与各种深判别模型的结合。两个计算机视觉任务,广泛的实验,即人脸年龄估计和头部姿态估计,证明SPUDRFs,在国家的最先进的性能得以实现的功效。
19. Disassembling Object Representations without Labels [PDF] 返回目录
Zunlei Feng, Xinchao Wang, Yongming He, Yike Yuan, Xin Gao, Mingli Song
Abstract: In this paper, we study a new representation-learning task, which we termed as disassembling object representations. Given an image featuring multiple objects, the goal of disassembling is to acquire a latent representation, of which each part corresponds to one category of objects. Disassembling thus finds its application in a wide domain such as image editing and few- or zero-shot learning, as it enables category-specific modularity in the learned representations. To this end, we propose an unsupervised approach to achieving disassembling, named Unsupervised Disassembling Object Representation (UDOR). UDOR follows a double auto-encoder architecture, in which a fuzzy classification and an object-removing operation are imposed. The fuzzy classification constrains each part of the latent representation to encode features of up to one object category, while the object-removing, combined with a generative adversarial network, enforces the modularity of the representations and integrity of the reconstructed image. Furthermore, we devise two metrics to respectively measure the modularity of disassembled representations and the visual integrity of reconstructed images. Experimental results demonstrate that the proposed UDOR, despited unsupervised, achieves truly encouraging results on par with those of supervised methods.
摘要:在本文中,我们研究了一种新的表示学习任务,这是我们称为拆解对象表示。给定的图像设有多个对象,拆卸的目标是获得一个潜表示,其中的每个部分对应于对象中的一个类别。拆卸因此认为其在广泛领域的应用,例如图像编辑和few-或零射门的学习,因为它能够在学习表示类别的特定模块。为此,我们提出了一种无监督的办法来实现拆卸,名为无监督拆解对象表示(UDOR)。 UDOR如下双自动编码器体系结构,其中,模糊分类和去除对象的操作中施加的。模糊分类约束到一个对象类的潜表示来编码的各部分的功能,同时该物体去除,结合一个生成对抗性网络,强制执行表示和重建图像的完整性的模块化。此外,我们设计两个指标分别测量分解表示的模块化和重建图像的视觉完整性。实验结果表明,该UDOR,despited无监督,真正做到鼓励看齐结果与受监督的方法。
Zunlei Feng, Xinchao Wang, Yongming He, Yike Yuan, Xin Gao, Mingli Song
Abstract: In this paper, we study a new representation-learning task, which we termed as disassembling object representations. Given an image featuring multiple objects, the goal of disassembling is to acquire a latent representation, of which each part corresponds to one category of objects. Disassembling thus finds its application in a wide domain such as image editing and few- or zero-shot learning, as it enables category-specific modularity in the learned representations. To this end, we propose an unsupervised approach to achieving disassembling, named Unsupervised Disassembling Object Representation (UDOR). UDOR follows a double auto-encoder architecture, in which a fuzzy classification and an object-removing operation are imposed. The fuzzy classification constrains each part of the latent representation to encode features of up to one object category, while the object-removing, combined with a generative adversarial network, enforces the modularity of the representations and integrity of the reconstructed image. Furthermore, we devise two metrics to respectively measure the modularity of disassembled representations and the visual integrity of reconstructed images. Experimental results demonstrate that the proposed UDOR, despited unsupervised, achieves truly encouraging results on par with those of supervised methods.
摘要:在本文中,我们研究了一种新的表示学习任务,这是我们称为拆解对象表示。给定的图像设有多个对象,拆卸的目标是获得一个潜表示,其中的每个部分对应于对象中的一个类别。拆卸因此认为其在广泛领域的应用,例如图像编辑和few-或零射门的学习,因为它能够在学习表示类别的特定模块。为此,我们提出了一种无监督的办法来实现拆卸,名为无监督拆解对象表示(UDOR)。 UDOR如下双自动编码器体系结构,其中,模糊分类和去除对象的操作中施加的。模糊分类约束到一个对象类的潜表示来编码的各部分的功能,同时该物体去除,结合一个生成对抗性网络,强制执行表示和重建图像的完整性的模块化。此外,我们设计两个指标分别测量分解表示的模块化和重建图像的视觉完整性。实验结果表明,该UDOR,despited无监督,真正做到鼓励看齐结果与受监督的方法。
20. Demographic Bias: A Challenge for Fingervein Recognition Systems? [PDF] 返回目录
P. Drozdowski, B. Prommegger, G. Wimmer, R. Schraml, C. Rathgeb, A. Uhl, C. Busch
Abstract: Recently, concerns regarding potential biases in the underlying algorithms of many automated systems (including biometrics) have been raised. In this context, a biased algorithm produces statistically different outcomes for different groups of individuals based on certain (often protected by anti-discrimination legislation) attributes such as sex and age. While several preliminary studies investigating this matter for facial recognition algorithms do exist, said topic has not yet been addressed for vascular biometric characteristics. Accordingly, in this paper, several popular types of recognition algorithms are benchmarked to ascertain the matter for fingervein recognition. The experimental evaluation suggests lack of bias for the tested algorithms, although future works with larger datasets are needed to validate and confirm those preliminary results.
摘要:近日,关于许多自动化系统(包括生物识别技术)的底层算法可能存在的偏差有关人士提出。在此背景下,有偏算法产生基于某些个人的不同群体(通常由反歧视立法的保护)统计结果的不同属性,如性别和年龄。尽管一些初步的研究调查这件事情的面部识别算法确实存在,说的话题还没有得到解决的血管生物特征。因此,在本文中,几种常见类型的识别算法基准确定为fingervein承认此事。实验评估表明缺乏偏见的测试算法,但未来更大的数据集的作品,需要验证和证实这些初步的结果。
P. Drozdowski, B. Prommegger, G. Wimmer, R. Schraml, C. Rathgeb, A. Uhl, C. Busch
Abstract: Recently, concerns regarding potential biases in the underlying algorithms of many automated systems (including biometrics) have been raised. In this context, a biased algorithm produces statistically different outcomes for different groups of individuals based on certain (often protected by anti-discrimination legislation) attributes such as sex and age. While several preliminary studies investigating this matter for facial recognition algorithms do exist, said topic has not yet been addressed for vascular biometric characteristics. Accordingly, in this paper, several popular types of recognition algorithms are benchmarked to ascertain the matter for fingervein recognition. The experimental evaluation suggests lack of bias for the tested algorithms, although future works with larger datasets are needed to validate and confirm those preliminary results.
摘要:近日,关于许多自动化系统(包括生物识别技术)的底层算法可能存在的偏差有关人士提出。在此背景下,有偏算法产生基于某些个人的不同群体(通常由反歧视立法的保护)统计结果的不同属性,如性别和年龄。尽管一些初步的研究调查这件事情的面部识别算法确实存在,说的话题还没有得到解决的血管生物特征。因此,在本文中,几种常见类型的识别算法基准确定为fingervein承认此事。实验评估表明缺乏偏见的测试算法,但未来更大的数据集的作品,需要验证和证实这些初步的结果。
21. TEA: Temporal Excitation and Aggregation for Action Recognition [PDF] 返回目录
Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, Limin Wang
Abstract: Temporal modeling is key for action recognition in videos. It normally considers both short-range motions and long-range aggregations. In this paper, we propose a Temporal Excitation and Aggregation (TEA) block, including a motion excitation (ME) module and a multiple temporal aggregation (MTA) module, specifically designed to capture both short- and long-range temporal evolution. In particular, for short-range motion modeling, the ME module calculates the feature-level temporal differences from spatiotemporal features. It then utilizes the differences to excite the motion-sensitive channels of the features. The long-range temporal aggregations in previous works are typically achieved by stacking a large number of local temporal convolutions. Each convolution processes a local temporal window at a time. In contrast, the MTA module proposes to deform the local convolution to a group of sub-convolutions, forming a hierarchical residual architecture. Without introducing additional parameters, the features will be processed with a series of sub-convolutions, and each frame could complete multiple temporal aggregations with neighborhoods. The final equivalent receptive field of temporal dimension is accordingly enlarged, which is capable of modeling the long-range temporal relationship over distant frames. The two components of the TEA block are complementary in temporal modeling. Finally, our approach achieves impressive results at low FLOPs on several action recognition benchmarks, such as Kinetics, Something-Something, HMDB51, and UCF101, which confirms its effectiveness and efficiency.
摘要:时空建模是在影片动作识别键。它通常认为这两个短距离运动和长程聚集。在本文中,我们提出了一个颞激发和聚合(TEA)块,包括一个运动激励(ME)模块和多个时间聚合(MTA)模块,专门用于捕获短期和长程时间演变。特别地,对于短距离运动建模,所述ME模块计算从时空特征的特征级别的时间差。然后它利用的差异来激发的特征的运动敏感通道。在以前的作品中远距离时间聚合通常由堆放着大量的本地时间回旋的实现。每一圈一次处理本地时间窗口。与此相反,MTA模块建议本地卷积变形到一组子卷积,形成了分层的残余结构。而不引入额外的参数,所述特征将与一系列子的卷积处理,并且每个帧可以与街区完成多个时间聚合。时间维度的最终等效感受域相应地扩大,其能够在建模遥远帧远程时间关系的。的TEA块的两个分量是在时间建模的互补性。最后,我们的方法实现了在低拖上几个动作识别基准,如动力学,东西出头,HMDB51和UCF101,这证实了其有效性和效率令人印象深刻的结果。
Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, Limin Wang
Abstract: Temporal modeling is key for action recognition in videos. It normally considers both short-range motions and long-range aggregations. In this paper, we propose a Temporal Excitation and Aggregation (TEA) block, including a motion excitation (ME) module and a multiple temporal aggregation (MTA) module, specifically designed to capture both short- and long-range temporal evolution. In particular, for short-range motion modeling, the ME module calculates the feature-level temporal differences from spatiotemporal features. It then utilizes the differences to excite the motion-sensitive channels of the features. The long-range temporal aggregations in previous works are typically achieved by stacking a large number of local temporal convolutions. Each convolution processes a local temporal window at a time. In contrast, the MTA module proposes to deform the local convolution to a group of sub-convolutions, forming a hierarchical residual architecture. Without introducing additional parameters, the features will be processed with a series of sub-convolutions, and each frame could complete multiple temporal aggregations with neighborhoods. The final equivalent receptive field of temporal dimension is accordingly enlarged, which is capable of modeling the long-range temporal relationship over distant frames. The two components of the TEA block are complementary in temporal modeling. Finally, our approach achieves impressive results at low FLOPs on several action recognition benchmarks, such as Kinetics, Something-Something, HMDB51, and UCF101, which confirms its effectiveness and efficiency.
摘要:时空建模是在影片动作识别键。它通常认为这两个短距离运动和长程聚集。在本文中,我们提出了一个颞激发和聚合(TEA)块,包括一个运动激励(ME)模块和多个时间聚合(MTA)模块,专门用于捕获短期和长程时间演变。特别地,对于短距离运动建模,所述ME模块计算从时空特征的特征级别的时间差。然后它利用的差异来激发的特征的运动敏感通道。在以前的作品中远距离时间聚合通常由堆放着大量的本地时间回旋的实现。每一圈一次处理本地时间窗口。与此相反,MTA模块建议本地卷积变形到一组子卷积,形成了分层的残余结构。而不引入额外的参数,所述特征将与一系列子的卷积处理,并且每个帧可以与街区完成多个时间聚合。时间维度的最终等效感受域相应地扩大,其能够在建模遥远帧远程时间关系的。的TEA块的两个分量是在时间建模的互补性。最后,我们的方法实现了在低拖上几个动作识别基准,如动力学,东西出头,HMDB51和UCF101,这证实了其有效性和效率令人印象深刻的结果。
22. LiDAR-based Online 3D Video Object Detection with Graph-based Message Passing and Spatiotemporal Transformer Attention [PDF] 返回目录
Junbo Yin, Jianbing Shen, Chenye Guan, Dingfu Zhou, Ruigang Yang
Abstract: Existing LiDAR-based 3D object detectors usually focus on the single-frame detection, while ignoring the spatiotemporal information in consecutive point cloud frames. In this paper, we propose an end-to-end online 3D video object detector that operates on point cloud sequences. The proposed model comprises a spatial feature encoding component and a spatiotemporal feature aggregation component. In the former component, a novel Pillar Message Passing Network (PMPNet) is proposed to encode each discrete point cloud frame. It adaptively collects information for a pillar node from its neighbors by iterative message passing, which effectively enlarges the receptive field of the pillar feature. In the latter component, we propose an Attentive Spatiotemporal Transformer GRU (AST-GRU) to aggregate the spatiotemporal information, which enhances the conventional ConvGRU with an attentive memory gating mechanism. AST-GRU contains a Spatial Transformer Attention (STA) module and a Temporal Transformer Attention (TTA) module, which can emphasize the foreground objects and align the dynamic objects, respectively. Experimental results demonstrate that the proposed 3D video object detector achieves state-of-the-art performance on the large-scale nuScenes benchmark.
摘要:现有的基于激光雷达-3D对象检测器通常集中在单帧检测,而忽略在连续的点云的帧的时空信息。在本文中,我们提出了一个终端到终端的在线3D上点云序列操作视频对象检测器。该模型包括空间特征编码分量和时空特征聚集组件。在前者的部件,传送网络的新的支柱消息(PMPNet)建议每个离散点云帧进行编码。它适应性收集信息从邻国通过迭代消息传递,从而有效地扩大支柱特征的感受野的支柱节点。在后者的部件,提出了一种细心时空变压器GRU(AST-GRU)聚集所述时空信息,这增强了常规ConvGRU与细心存储器选通机构。 AST-GRU包含空间变换器注意(STA)模块和一个时间转换注意(TTA)模块,该模块可以强调前景对象和分别对齐动态对象。实验结果表明,所提出的3D视频对象检测器实现对大型nuScenes基准状态的最先进的性能。
Junbo Yin, Jianbing Shen, Chenye Guan, Dingfu Zhou, Ruigang Yang
Abstract: Existing LiDAR-based 3D object detectors usually focus on the single-frame detection, while ignoring the spatiotemporal information in consecutive point cloud frames. In this paper, we propose an end-to-end online 3D video object detector that operates on point cloud sequences. The proposed model comprises a spatial feature encoding component and a spatiotemporal feature aggregation component. In the former component, a novel Pillar Message Passing Network (PMPNet) is proposed to encode each discrete point cloud frame. It adaptively collects information for a pillar node from its neighbors by iterative message passing, which effectively enlarges the receptive field of the pillar feature. In the latter component, we propose an Attentive Spatiotemporal Transformer GRU (AST-GRU) to aggregate the spatiotemporal information, which enhances the conventional ConvGRU with an attentive memory gating mechanism. AST-GRU contains a Spatial Transformer Attention (STA) module and a Temporal Transformer Attention (TTA) module, which can emphasize the foreground objects and align the dynamic objects, respectively. Experimental results demonstrate that the proposed 3D video object detector achieves state-of-the-art performance on the large-scale nuScenes benchmark.
摘要:现有的基于激光雷达-3D对象检测器通常集中在单帧检测,而忽略在连续的点云的帧的时空信息。在本文中,我们提出了一个终端到终端的在线3D上点云序列操作视频对象检测器。该模型包括空间特征编码分量和时空特征聚集组件。在前者的部件,传送网络的新的支柱消息(PMPNet)建议每个离散点云帧进行编码。它适应性收集信息从邻国通过迭代消息传递,从而有效地扩大支柱特征的感受野的支柱节点。在后者的部件,提出了一种细心时空变压器GRU(AST-GRU)聚集所述时空信息,这增强了常规ConvGRU与细心存储器选通机构。 AST-GRU包含空间变换器注意(STA)模块和一个时间转换注意(TTA)模块,该模块可以强调前景对象和分别对齐动态对象。实验结果表明,所提出的3D视频对象检测器实现对大型nuScenes基准状态的最先进的性能。
23. Effective Fusion of Deep Multitasking Representations for Robust Visual Tracking [PDF] 返回目录
Seyed Mojtaba Marvasti-Zadeh, Hossein Ghanei-Yakhdan, Shohreh Kasaei, Kamal Nasrollahi, Thomas B. Moeslund
Abstract: Visual object tracking remains an active research field in computer vision due to persisting challenges with various problem-specific factors in real-world scenes. Many existing tracking methods based on discriminative correlation filters (DCFs) employ feature extraction networks (FENs) to model the target appearance during the learning process. However, using deep feature maps extracted from FENs based on different residual neural networks (ResNets) has not previously been investigated. This paper aims to evaluate the performance of twelve state-of-the-art ResNet-based FENs in a DCF-based framework to determine the best for visual tracking purposes. First, it ranks their best feature maps and explores the generalized adoption of the best ResNet-based FEN into another DCF-based method. Then, the proposed method extracts deep semantic information from a fully convolutional FEN and fuses it with the best ResNet-based feature maps to strengthen the target representation in the learning process of continuous convolution filters. Finally, it introduces a new and efficient semantic weighting method (using semantic segmentation feature maps on each video frame) to reduce the drift problem. Extensive experimental results on the well-known OTB-2013, OTB-2015, TC-128 and VOT-2018 visual tracking datasets demonstrate that the proposed method effectively outperforms state-of-the-art methods in terms of precision and robustness of visual tracking.
摘要:视觉对象跟踪仍然是计算机视觉活跃的研究领域,由于与真实世界场景的各种问题的具体因素持续存在的挑战。许多现有的基础上辨别相关滤波器的跟踪方法(的DCF)聘请特征提取网络(沼泽),以在学习过程中目标外观模型。然而,使用深特征映射基于不同的残余神经网络(ResNets)先前没有被研究沼泽萃取。本文的目的是评估的国家的最先进的12 RESNET基础,沼泽的性能在基于贴现现金流的框架,以确定最佳的视觉跟踪的目的。首先,他们的排名最好的功能地图和探索推广采用基于RESNET最佳FEN转换成另一种基于贴现现金流法。然后,该方法提取物完全卷积FEN与RESNET为主的最大的特点保险丝它深层语义信息映射到加强的连续卷积滤波器学习过程中目标表示。最后,它介绍了一种新型高效的语义加权方法(使用每个视频帧上语义分割特征图),以减少漂移问题。在公知的OTB-2013,OTB-2015,TC-128和VOT-2018视觉跟踪数据集广泛的实验结果表明,所提出的方法有效地性能优于国家的最先进的方法中的视觉跟踪的精度和鲁棒性方面。
Seyed Mojtaba Marvasti-Zadeh, Hossein Ghanei-Yakhdan, Shohreh Kasaei, Kamal Nasrollahi, Thomas B. Moeslund
Abstract: Visual object tracking remains an active research field in computer vision due to persisting challenges with various problem-specific factors in real-world scenes. Many existing tracking methods based on discriminative correlation filters (DCFs) employ feature extraction networks (FENs) to model the target appearance during the learning process. However, using deep feature maps extracted from FENs based on different residual neural networks (ResNets) has not previously been investigated. This paper aims to evaluate the performance of twelve state-of-the-art ResNet-based FENs in a DCF-based framework to determine the best for visual tracking purposes. First, it ranks their best feature maps and explores the generalized adoption of the best ResNet-based FEN into another DCF-based method. Then, the proposed method extracts deep semantic information from a fully convolutional FEN and fuses it with the best ResNet-based feature maps to strengthen the target representation in the learning process of continuous convolution filters. Finally, it introduces a new and efficient semantic weighting method (using semantic segmentation feature maps on each video frame) to reduce the drift problem. Extensive experimental results on the well-known OTB-2013, OTB-2015, TC-128 and VOT-2018 visual tracking datasets demonstrate that the proposed method effectively outperforms state-of-the-art methods in terms of precision and robustness of visual tracking.
摘要:视觉对象跟踪仍然是计算机视觉活跃的研究领域,由于与真实世界场景的各种问题的具体因素持续存在的挑战。许多现有的基础上辨别相关滤波器的跟踪方法(的DCF)聘请特征提取网络(沼泽),以在学习过程中目标外观模型。然而,使用深特征映射基于不同的残余神经网络(ResNets)先前没有被研究沼泽萃取。本文的目的是评估的国家的最先进的12 RESNET基础,沼泽的性能在基于贴现现金流的框架,以确定最佳的视觉跟踪的目的。首先,他们的排名最好的功能地图和探索推广采用基于RESNET最佳FEN转换成另一种基于贴现现金流法。然后,该方法提取物完全卷积FEN与RESNET为主的最大的特点保险丝它深层语义信息映射到加强的连续卷积滤波器学习过程中目标表示。最后,它介绍了一种新型高效的语义加权方法(使用每个视频帧上语义分割特征图),以减少漂移问题。在公知的OTB-2013,OTB-2015,TC-128和VOT-2018视觉跟踪数据集广泛的实验结果表明,所提出的方法有效地性能优于国家的最先进的方法中的视觉跟踪的精度和鲁棒性方面。
24. Sequential Learning for Domain Generalization [PDF] 返回目录
Da Li, Yongxin Yang, Yi-Zhe Song, Timothy Hospedales
Abstract: In this paper we propose a sequential learning framework for Domain Generalization (DG), the problem of training a model that is robust to domain shift by design. Various DG approaches have been proposed with different motivating intuitions, but they typically optimize for a single step of domain generalization -- training on one set of domains and generalizing to one other. Our sequential learning is inspired by the idea lifelong learning, where accumulated experience means that learning the $n^{th}$ thing becomes easier than the $1^{st}$ thing. In DG this means encountering a sequence of domains and at each step training to maximise performance on the next domain. The performance at domain $n$ then depends on the previous $n-1$ learning problems. Thus backpropagating through the sequence means optimizing performance not just for the next domain, but all following domains. Training on all such sequences of domains provides dramatically more `practice' for a base DG learner compared to existing approaches, thus improving performance on a true testing domain. This strategy can be instantiated for different base DG algorithms, but we focus on its application to the recently proposed Meta-Learning Domain generalization (MLDG). We show that for MLDG it leads to a simple to implement and fast algorithm that provides consistent performance improvement on a variety of DG benchmarks.
摘要:在本文中,我们提出了域泛化(DG)的训练模式,具有较强的抗设计领域转变的问题顺序学习框架。各种DG方法已经提出了具有不同的激励的直觉,但它们通常用于优化域概括的一个步骤 - 训练在一组域和推广到一个其他。我们的学习顺序由理念终身学习,在积累经验的手段,学习的$ N ^ {个} $的事情变得比$ 1→{ST} $的东西更容易启发。在此DG装置遇到在每一步训练域和一个序列,以最大化对下一个域的性能。在域$ n $的性能则取决于前$ N-1 $学习问题。因此,通过序列手段优化性能不只是为下一个域backpropagating,但所有以下域。上结构域的所有这样的序列的训练提供相比现有的方法,从而提高了一个真实的测试域性能碱DG学习者显着更`实践”。这种策略可以实例化不同的基础DG的算法,但是我们注重其应用到最近提出的元学习域泛化(MLDG)。我们表明,MLDG它导致实现一个简单而快速的算法,对各种DG基准提供一致的性能改进。
Da Li, Yongxin Yang, Yi-Zhe Song, Timothy Hospedales
Abstract: In this paper we propose a sequential learning framework for Domain Generalization (DG), the problem of training a model that is robust to domain shift by design. Various DG approaches have been proposed with different motivating intuitions, but they typically optimize for a single step of domain generalization -- training on one set of domains and generalizing to one other. Our sequential learning is inspired by the idea lifelong learning, where accumulated experience means that learning the $n^{th}$ thing becomes easier than the $1^{st}$ thing. In DG this means encountering a sequence of domains and at each step training to maximise performance on the next domain. The performance at domain $n$ then depends on the previous $n-1$ learning problems. Thus backpropagating through the sequence means optimizing performance not just for the next domain, but all following domains. Training on all such sequences of domains provides dramatically more `practice' for a base DG learner compared to existing approaches, thus improving performance on a true testing domain. This strategy can be instantiated for different base DG algorithms, but we focus on its application to the recently proposed Meta-Learning Domain generalization (MLDG). We show that for MLDG it leads to a simple to implement and fast algorithm that provides consistent performance improvement on a variety of DG benchmarks.
摘要:在本文中,我们提出了域泛化(DG)的训练模式,具有较强的抗设计领域转变的问题顺序学习框架。各种DG方法已经提出了具有不同的激励的直觉,但它们通常用于优化域概括的一个步骤 - 训练在一组域和推广到一个其他。我们的学习顺序由理念终身学习,在积累经验的手段,学习的$ N ^ {个} $的事情变得比$ 1→{ST} $的东西更容易启发。在此DG装置遇到在每一步训练域和一个序列,以最大化对下一个域的性能。在域$ n $的性能则取决于前$ N-1 $学习问题。因此,通过序列手段优化性能不只是为下一个域backpropagating,但所有以下域。上结构域的所有这样的序列的训练提供相比现有的方法,从而提高了一个真实的测试域性能碱DG学习者显着更`实践”。这种策略可以实例化不同的基础DG的算法,但是我们注重其应用到最近提出的元学习域泛化(MLDG)。我们表明,MLDG它导致实现一个简单而快速的算法,对各种DG基准提供一致的性能改进。
25. FairALM: Augmented Lagrangian Method for Training Fair Models with Little Regret [PDF] 返回目录
Vishnu Suresh Lokhande, Aditya Kumar Akash, Sathya N. Ravi, Vikas Singh
Abstract: Algorithmic decision making based on computer vision and machine learning technologies continue to permeate our lives. But issues related to biases of these models and the extent to which they treat certain segments of the population unfairly, have led to concern in the general public. It is now accepted that because of biases in the datasets we present to the models, a fairness-oblivious training will lead to unfair models. An interesting topic is the study of mechanisms via which the de novo design or training of the model can be informed by fairness measures. Here, we study mechanisms that impose fairness concurrently while training the model. While existing fairness based approaches in vision have largely relied on training adversarial modules together with the primary classification/regression task, in an effort to remove the influence of the protected attribute or variable, we show how ideas based on well-known optimization concepts can provide a simpler alternative. In our proposed scheme, imposing fairness just requires specifying the protected attribute and utilizing our optimization routine. We provide a detailed technical analysis and present experiments demonstrating that various fairness measures from the literature can be reliably imposed on a number of training tasks in vision in a manner that is interpretable.
摘要:基于计算机视觉和机器学习技术算法决策不断渗透到我们的生活。但是,与这些模型和他们对待某些群体不公平程度的偏差问题,导致了广大市民的关注。人们现在认识到,由于在数据集中的偏见,我们提出的模型,一个公平,无视培训将导致不公平的机型。一个有趣的话题是,通过该模型的从头设计或训练可以通过公平的措施通知机制研究。在这里,我们研究认为并处公平而训练模型的机制。虽然在视觉现有的公平性为基础的方法在很大程度上就与主要分类/回归任务训练对抗性的模块组合在一起,以努力消除受保护的属性或变量的影响依赖,我们会根据众所周知的优化概念,想法如何能提供一个简单的选择。在我们提出的方案,实行公平只是需要指定受保护的属性,并利用我们的优化程序。我们提供详细的技术分析和目前的实验证明从文献中各种公平措施,能够可靠地对若干的,其方式是可解释的训练任务中的视力。
Vishnu Suresh Lokhande, Aditya Kumar Akash, Sathya N. Ravi, Vikas Singh
Abstract: Algorithmic decision making based on computer vision and machine learning technologies continue to permeate our lives. But issues related to biases of these models and the extent to which they treat certain segments of the population unfairly, have led to concern in the general public. It is now accepted that because of biases in the datasets we present to the models, a fairness-oblivious training will lead to unfair models. An interesting topic is the study of mechanisms via which the de novo design or training of the model can be informed by fairness measures. Here, we study mechanisms that impose fairness concurrently while training the model. While existing fairness based approaches in vision have largely relied on training adversarial modules together with the primary classification/regression task, in an effort to remove the influence of the protected attribute or variable, we show how ideas based on well-known optimization concepts can provide a simpler alternative. In our proposed scheme, imposing fairness just requires specifying the protected attribute and utilizing our optimization routine. We provide a detailed technical analysis and present experiments demonstrating that various fairness measures from the literature can be reliably imposed on a number of training tasks in vision in a manner that is interpretable.
摘要:基于计算机视觉和机器学习技术算法决策不断渗透到我们的生活。但是,与这些模型和他们对待某些群体不公平程度的偏差问题,导致了广大市民的关注。人们现在认识到,由于在数据集中的偏见,我们提出的模型,一个公平,无视培训将导致不公平的机型。一个有趣的话题是,通过该模型的从头设计或训练可以通过公平的措施通知机制研究。在这里,我们研究认为并处公平而训练模型的机制。虽然在视觉现有的公平性为基础的方法在很大程度上就与主要分类/回归任务训练对抗性的模块组合在一起,以努力消除受保护的属性或变量的影响依赖,我们会根据众所周知的优化概念,想法如何能提供一个简单的选择。在我们提出的方案,实行公平只是需要指定受保护的属性,并利用我们的优化程序。我们提供详细的技术分析和目前的实验证明从文献中各种公平措施,能够可靠地对若干的,其方式是可解释的训练任务中的视力。
26. Deep White-Balance Editing [PDF] 返回目录
Mahmoud Afifi, Michael S. Brown
Abstract: We introduce a deep learning approach to realistically edit an sRGB image's white balance. Cameras capture sensor images that are rendered by their integrated signal processor (ISP) to a standard RGB (sRGB) color space encoding. The ISP rendering begins with a white-balance procedure that is used to remove the color cast of the scene's illumination. The ISP then applies a series of nonlinear color manipulations to enhance the visual quality of the final sRGB image. Recent work by [3] showed that sRGB images that were rendered with the incorrect white balance cannot be easily corrected due to the ISP's nonlinear rendering. The work in [3] proposed a k-nearest neighbor (KNN) solution based on tens of thousands of image pairs. We propose to solve this problem with a deep neural network (DNN) architecture trained in an end-to-end manner to learn the correct white balance. Our DNN maps an input image to two additional white-balance settings corresponding to indoor and outdoor illuminations. Our solution not only is more accurate than the KNN approach in terms of correcting a wrong white-balance setting but also provides the user the freedom to edit the white balance in the sRGB image to other illumination settings.
摘要:介绍了深刻的学习方法,切实编辑的sRGB的图像的白平衡。由它们的积分信号处理器(ISP)呈现给标准RGB(sRGB)可色彩空间编码摄像机采集传感器的图像。所述ISP呈现开始于用于去除场景的照明的偏色白平衡过程。该ISP然后应用一系列非线性颜色的处理,以提高最终的sRGB图像的视觉质量。通过最近的工作[3]表明,这是呈现与不正确的白平衡的sRGB图像不能轻易因ISP的非线性渲染纠正。在[3]的工作提出了一种基于几万图像对的k最近邻(KNN)溶液。我们建议在一个终端到终端的方式来学习正确的白平衡培养出深层神经网络(DNN)架构来解决这个问题。我们的DNN映射对应于室内和室外照明的输入图像到两个额外的白平衡设定。我们的解决方案不仅比纠正一个错误的白平衡设置方面的KNN方法更精确,但也提供了用户自由编辑的sRGB图像中的其他照明设置白平衡。
Mahmoud Afifi, Michael S. Brown
Abstract: We introduce a deep learning approach to realistically edit an sRGB image's white balance. Cameras capture sensor images that are rendered by their integrated signal processor (ISP) to a standard RGB (sRGB) color space encoding. The ISP rendering begins with a white-balance procedure that is used to remove the color cast of the scene's illumination. The ISP then applies a series of nonlinear color manipulations to enhance the visual quality of the final sRGB image. Recent work by [3] showed that sRGB images that were rendered with the incorrect white balance cannot be easily corrected due to the ISP's nonlinear rendering. The work in [3] proposed a k-nearest neighbor (KNN) solution based on tens of thousands of image pairs. We propose to solve this problem with a deep neural network (DNN) architecture trained in an end-to-end manner to learn the correct white balance. Our DNN maps an input image to two additional white-balance settings corresponding to indoor and outdoor illuminations. Our solution not only is more accurate than the KNN approach in terms of correcting a wrong white-balance setting but also provides the user the freedom to edit the white balance in the sRGB image to other illumination settings.
摘要:介绍了深刻的学习方法,切实编辑的sRGB的图像的白平衡。由它们的积分信号处理器(ISP)呈现给标准RGB(sRGB)可色彩空间编码摄像机采集传感器的图像。所述ISP呈现开始于用于去除场景的照明的偏色白平衡过程。该ISP然后应用一系列非线性颜色的处理,以提高最终的sRGB图像的视觉质量。通过最近的工作[3]表明,这是呈现与不正确的白平衡的sRGB图像不能轻易因ISP的非线性渲染纠正。在[3]的工作提出了一种基于几万图像对的k最近邻(KNN)溶液。我们建议在一个终端到终端的方式来学习正确的白平衡培养出深层神经网络(DNN)架构来解决这个问题。我们的DNN映射对应于室内和室外照明的输入图像到两个额外的白平衡设定。我们的解决方案不仅比纠正一个错误的白平衡设置方面的KNN方法更精确,但也提供了用户自由编辑的sRGB图像中的其他照明设置白平衡。
27. Context-Aware Multi-Task Learning for Traffic Scene Recognition in Autonomous Vehicles [PDF] 返回目录
Younkwan Lee, Jihyo Jeon, Jongmin Yu, Moongu Jeon
Abstract: Traffic scene recognition, which requires various visual classification tasks, is a critical ingredient in autonomous vehicles. However, most existing approaches treat each relevant task independently from one another, never considering the entire system as a whole. Because of this, they are limited to utilizing a task-specific set of features for all possible tasks of inference-time, which ignores the capability to leverage common task-invariant contextual knowledge for the task at hand. To address this problem, we propose an algorithm to jointly learn the task-specific and shared representations by adopting a multi-task learning network. Specifically, we present a lower bound for the mutual information constraint between shared feature embedding and input that is considered to be able to extract common contextual information across tasks while preserving essential information of each task jointly. The learned representations capture richer contextual information without additional task-specific network. Extensive experiments on the large-scale dataset HSD demonstrate the effectiveness and superiority of our network over state-of-the-art methods.
摘要:交通场景识别,这就需要各种视觉分类任务,是自主车的关键因素。然而,大多数现有的方法单独地处理每一个相关的任务彼此,从未考虑整个系统作为一个整体。正因为如此,它们仅限于利用特定任务的功能集的推断时间,而忽略了能力,利用共同的任务不变的背景知识,为手头的任务所有可能的任务。为了解决这个问题,我们提出了一种算法,采用多任务学习网络共同学习任务的具体和共享的表示。具体来说,我们提出了一个下界共享功能嵌入和被认为是能够跨任务提取共同的背景信息,同时保留联合每项任务的基本信息输入之间的互信息约束。该学会表示拍摄而无需额外的任务专用的网络更丰富的上下文信息。在大型数据集HSD大量的实验证明我们的网络在国家的最先进方法的有效性和优越性。
Younkwan Lee, Jihyo Jeon, Jongmin Yu, Moongu Jeon
Abstract: Traffic scene recognition, which requires various visual classification tasks, is a critical ingredient in autonomous vehicles. However, most existing approaches treat each relevant task independently from one another, never considering the entire system as a whole. Because of this, they are limited to utilizing a task-specific set of features for all possible tasks of inference-time, which ignores the capability to leverage common task-invariant contextual knowledge for the task at hand. To address this problem, we propose an algorithm to jointly learn the task-specific and shared representations by adopting a multi-task learning network. Specifically, we present a lower bound for the mutual information constraint between shared feature embedding and input that is considered to be able to extract common contextual information across tasks while preserving essential information of each task jointly. The learned representations capture richer contextual information without additional task-specific network. Extensive experiments on the large-scale dataset HSD demonstrate the effectiveness and superiority of our network over state-of-the-art methods.
摘要:交通场景识别,这就需要各种视觉分类任务,是自主车的关键因素。然而,大多数现有的方法单独地处理每一个相关的任务彼此,从未考虑整个系统作为一个整体。正因为如此,它们仅限于利用特定任务的功能集的推断时间,而忽略了能力,利用共同的任务不变的背景知识,为手头的任务所有可能的任务。为了解决这个问题,我们提出了一种算法,采用多任务学习网络共同学习任务的具体和共享的表示。具体来说,我们提出了一个下界共享功能嵌入和被认为是能够跨任务提取共同的背景信息,同时保留联合每项任务的基本信息输入之间的互信息约束。该学会表示拍摄而无需额外的任务专用的网络更丰富的上下文信息。在大型数据集HSD大量的实验证明我们的网络在国家的最先进方法的有效性和优越性。
28. Learning Pose-invariant 3D Object Reconstruction from Single-view Images [PDF] 返回目录
Bo Peng, Wei Wang, Jing Dong, Tieniu Tan
Abstract: Learning to reconstruct 3D shapes using 2D images is an active research topic, with benefits of not requiring expensive 3D data. However, most work in this direction requires multi-view images for each object instance as training supervision, which oftentimes does not apply in practice. In this paper, we relax the common multi-view assumption and explore a more challenging yet more realistic setup of learning 3D shape from only single-view images. The major difficulty lies in insufficient constraints that can be provided by single view images, which leads to the problem of pose entanglement in learned shape space. As a result, reconstructed shapes vary along input pose and have poor accuracy. We address this problem by taking a novel domain adaptation perspective, and propose an effective adversarial domain confusion method to learn pose-disentangled compact shape space. Experiments on single-view reconstruction show effectiveness in solving pose entanglement, and the proposed method achieves state-of-the-art reconstruction accuracy with high efficiency.
摘要:学习重建利用二维图像是一个活跃的研究话题,不需要昂贵的3D数据的好处的3D形状。然而,在这个方向上大部分的工作需要为每个对象实例作为训练的监督,这常常不会在实际中应用多视角图像。在本文中,我们放松了常见的多视角的假设,并探索从唯一的单一视图图像学习三维形状的更具挑战但更现实的设置。主要的困难在于约束不足,可以通过单一视图图像来提供,这导致了解到形状空间姿势纠缠的问题。其结果是,重建的形状沿输入姿势各不相同,具有精度差。我们通过采取新的领域适应性的角度来处理这个问题,并提出有效的对抗域混乱的方法来学习的姿态,解开小巧的造型空间。在单视图重建显示在解决姿势缠结有效性,并且所提出的方法实现了实验以高效率状态的最先进的重建精度。
Bo Peng, Wei Wang, Jing Dong, Tieniu Tan
Abstract: Learning to reconstruct 3D shapes using 2D images is an active research topic, with benefits of not requiring expensive 3D data. However, most work in this direction requires multi-view images for each object instance as training supervision, which oftentimes does not apply in practice. In this paper, we relax the common multi-view assumption and explore a more challenging yet more realistic setup of learning 3D shape from only single-view images. The major difficulty lies in insufficient constraints that can be provided by single view images, which leads to the problem of pose entanglement in learned shape space. As a result, reconstructed shapes vary along input pose and have poor accuracy. We address this problem by taking a novel domain adaptation perspective, and propose an effective adversarial domain confusion method to learn pose-disentangled compact shape space. Experiments on single-view reconstruction show effectiveness in solving pose entanglement, and the proposed method achieves state-of-the-art reconstruction accuracy with high efficiency.
摘要:学习重建利用二维图像是一个活跃的研究话题,不需要昂贵的3D数据的好处的3D形状。然而,在这个方向上大部分的工作需要为每个对象实例作为训练的监督,这常常不会在实际中应用多视角图像。在本文中,我们放松了常见的多视角的假设,并探索从唯一的单一视图图像学习三维形状的更具挑战但更现实的设置。主要的困难在于约束不足,可以通过单一视图图像来提供,这导致了解到形状空间姿势纠缠的问题。其结果是,重建的形状沿输入姿势各不相同,具有精度差。我们通过采取新的领域适应性的角度来处理这个问题,并提出有效的对抗域混乱的方法来学习的姿态,解开小巧的造型空间。在单视图重建显示在解决姿势缠结有效性,并且所提出的方法实现了实验以高效率状态的最先进的重建精度。
29. A Fast Fully Octave Convolutional Neural Network for Document Image Segmentation [PDF] 返回目录
Ricardo Batista das Neves Junior, Luiz Felipe Verçosa, David Macêdo, Byron Leite Dantas Bezerra, Cleber Zanchettin
Abstract: The Know Your Customer (KYC) and Anti Money Laundering (AML) are worldwide practices to online customer identification based on personal identification documents, similarity and liveness checking, and proof of address. To answer the basic regulation question: are you whom you say you are? The customer needs to upload valid identification documents (ID). This task imposes some computational challenges since these documents are diverse, may present different and complex backgrounds, some occlusion, partial rotation, poor quality, or damage. Advanced text and document segmentation algorithms were used to process the ID images. In this context, we investigated a method based on U-Net to detect the document edges and text regions in ID images. Besides the promising results on image segmentation, the U-Net based approach is computationally expensive for a real application, since the image segmentation is a customer device task. We propose a model optimization based on Octave Convolutions to qualify the method to situations where storage, processing, and time resources are limited, such as in mobile and robotic applications. We conducted the evaluation experiments in two new datasets CDPhotoDataset and DTDDataset, which are composed of real ID images of Brazilian documents. Our results showed that the proposed models are efficient to document segmentation tasks and portable.
摘要:了解你的客户(KYC)和反洗钱(AML)是全球惯例网上客户识别基础上的个人身份证明文件,相似性和活动检查,及地址证明。要回答的基本问题,调节:你是你说你是谁?客户需要上传有效身份证件(ID)。这项任务强加一些计算的挑战,因为这些文件是不同的,会呈现出不同的复杂背景,有些闭塞,局部旋转,质量差或损坏。高级文本和文件分割算法被用于处理该ID的图像。在这方面,我们调查基于U形网,以检测ID的图像文件的边缘和文字区域的方法。除了对图像分割的有希望的结果,基于掌中宽带的做法是一个真正的应用计算昂贵的,因为图像分割是客户设备的任务。我们提出了一种基于八度盘旋资格的方法的情况下存储,处理和时间资源是有限的,比如在移动和机器人应用的模型优化。我们进行了两个新的数据集CDPhotoDataset和DTDDataset评价实验,这是由巴西的文件真实ID图像。我们的研究结果表明,该模型是有效的文件分割任务,便于携带。
Ricardo Batista das Neves Junior, Luiz Felipe Verçosa, David Macêdo, Byron Leite Dantas Bezerra, Cleber Zanchettin
Abstract: The Know Your Customer (KYC) and Anti Money Laundering (AML) are worldwide practices to online customer identification based on personal identification documents, similarity and liveness checking, and proof of address. To answer the basic regulation question: are you whom you say you are? The customer needs to upload valid identification documents (ID). This task imposes some computational challenges since these documents are diverse, may present different and complex backgrounds, some occlusion, partial rotation, poor quality, or damage. Advanced text and document segmentation algorithms were used to process the ID images. In this context, we investigated a method based on U-Net to detect the document edges and text regions in ID images. Besides the promising results on image segmentation, the U-Net based approach is computationally expensive for a real application, since the image segmentation is a customer device task. We propose a model optimization based on Octave Convolutions to qualify the method to situations where storage, processing, and time resources are limited, such as in mobile and robotic applications. We conducted the evaluation experiments in two new datasets CDPhotoDataset and DTDDataset, which are composed of real ID images of Brazilian documents. Our results showed that the proposed models are efficient to document segmentation tasks and portable.
摘要:了解你的客户(KYC)和反洗钱(AML)是全球惯例网上客户识别基础上的个人身份证明文件,相似性和活动检查,及地址证明。要回答的基本问题,调节:你是你说你是谁?客户需要上传有效身份证件(ID)。这项任务强加一些计算的挑战,因为这些文件是不同的,会呈现出不同的复杂背景,有些闭塞,局部旋转,质量差或损坏。高级文本和文件分割算法被用于处理该ID的图像。在这方面,我们调查基于U形网,以检测ID的图像文件的边缘和文字区域的方法。除了对图像分割的有希望的结果,基于掌中宽带的做法是一个真正的应用计算昂贵的,因为图像分割是客户设备的任务。我们提出了一种基于八度盘旋资格的方法的情况下存储,处理和时间资源是有限的,比如在移动和机器人应用的模型优化。我们进行了两个新的数据集CDPhotoDataset和DTDDataset评价实验,这是由巴西的文件真实ID图像。我们的研究结果表明,该模型是有效的文件分割任务,便于携带。
30. From Paris to Berlin: Discovering Fashion Style Influences Around the World [PDF] 返回目录
Ziad Al-Halah, Kristen Grauman
Abstract: The evolution of clothing styles and their migration across the world is intriguing, yet difficult to describe quantitatively. We propose to discover and quantify fashion influences from everyday images of people wearing clothes. We introduce an approach that detects which cities influence which other cities in terms of propagating their styles. We then leverage the discovered influence patterns to inform a forecasting model that predicts the popularity of any given style at any given city into the future. Demonstrating our idea with GeoStyle---a large-scale dataset of 7.7M images covering 44 major world cities, we present the discovered influence relationships, revealing how cities exert and receive fashion influence for an array of 50 observed visual styles. Furthermore, the proposed forecasting model achieves state-of-the-art results for a challenging style forecasting task, showing the advantage of grounding visual style evolution both spatially and temporally.
摘要:服装款式和他们在世界各地移民进化是耐人寻味的,但难以定量描述。我们建议发现并从人们穿衣服的日常图像量化方式的影响。我们介绍的方法,其检测城市中传播自己的风格方面影响其其他城市。然后,我们利用已发现的影响模式,以告知预测任何给定的风格的流行在任何给定城市的未来预测模型。证明了我们与GeoStyle --- 7.7M图像覆盖44个主要世界城市的大规模数据集的想法,我们目前所发现的影响关系,揭示城市如何发挥和接受时尚影响力的一个数组的50观察到的视觉风格。此外,所提出的预测模型实现了国家的最先进成果的挑战式的预测任务,显示在空间和时间接地视觉风格演变的优势。
Ziad Al-Halah, Kristen Grauman
Abstract: The evolution of clothing styles and their migration across the world is intriguing, yet difficult to describe quantitatively. We propose to discover and quantify fashion influences from everyday images of people wearing clothes. We introduce an approach that detects which cities influence which other cities in terms of propagating their styles. We then leverage the discovered influence patterns to inform a forecasting model that predicts the popularity of any given style at any given city into the future. Demonstrating our idea with GeoStyle---a large-scale dataset of 7.7M images covering 44 major world cities, we present the discovered influence relationships, revealing how cities exert and receive fashion influence for an array of 50 observed visual styles. Furthermore, the proposed forecasting model achieves state-of-the-art results for a challenging style forecasting task, showing the advantage of grounding visual style evolution both spatially and temporally.
摘要:服装款式和他们在世界各地移民进化是耐人寻味的,但难以定量描述。我们建议发现并从人们穿衣服的日常图像量化方式的影响。我们介绍的方法,其检测城市中传播自己的风格方面影响其其他城市。然后,我们利用已发现的影响模式,以告知预测任何给定的风格的流行在任何给定城市的未来预测模型。证明了我们与GeoStyle --- 7.7M图像覆盖44个主要世界城市的大规模数据集的想法,我们目前所发现的影响关系,揭示城市如何发挥和接受时尚影响力的一个数组的50观察到的视觉风格。此外,所提出的预测模型实现了国家的最先进成果的挑战式的预测任务,显示在空间和时间接地视觉风格演变的优势。
31. Towards Better Generalization: Joint Depth-Pose Learning without PoseNet [PDF] 返回目录
Wang Zhao, Shaohui Liu, Yezhi Shu, Yong-Jin Liu
Abstract: In this work, we tackle the essential problem of scale inconsistency for self-supervised joint depth-pose learning. Most existing methods assume that a consistent scale of depth and pose can be learned across all input samples, which makes the learning problem harder, resulting in degraded performance and limited generalization in indoor environments and long-sequence visual odometry application. To address this issue, we propose a novel system that explicitly disentangles scale from the network estimation. Instead of relying on PoseNet architecture, our method recovers relative pose by directly solving fundamental matrix from dense optical flow correspondence and makes use of a two-view triangulation module to recover an up-to-scale 3D structure. Then, we align the scale of the depth prediction with the triangulated point cloud and use the transformed depth map for depth error computation and dense reprojection check. Our whole system can be jointly trained end-to-end. Extensive experiments show that our system not only reaches state-of-the-art performance on KITTI depth and flow estimation, but also significantly improves the generalization ability of existing self-supervised depth-pose learning methods under a variety of challenging scenarios, and achieves state-of-the-art results among self-supervised learning-based methods on KITTI Odometry and NYUv2 dataset. Furthermore, we present some interesting findings on the limitation of PoseNet-based relative pose estimation methods in terms of generalization ability. Code is available at this https URL.
摘要:在这项工作中,我们处理规模不一致的自我监督联合深度姿势学习的基本问题。大多数现有的方法假定的深度和姿态的一致规模可以在所有输入样本,这使得学习困难的问题,导致室内环境和长序列视觉里程应用的性能下降和有限的推广学习。为了解决这个问题,我们提出了明确从网络估计理顺了那些纷繁规模的新颖系统。代替通过从密集光流对应,使直接求解基本矩阵依靠PoseNet架构,我们的方法中恢复相对姿势的使用双视图三角测量模块的恢复向上按比例的3D结构。然后,我们对准深度预测的规模与三角点云和使用深度误差计算和密集的重投影检查变换的深度图。我们的整个系统进行联合训练的端至端。大量的实验表明,该系统不仅达到上KITTI深度国家的最先进的性能和流量估算,也显著提高现有的自我监督的深度姿势学习在各种具有挑战性的场景的方法的推广能力,并实现国家的最先进成果的KITTI里程计和NYUv2数据集自我监督的基于学习的方法之一。此外,我们提出的基于PoseNet相对姿态估计方法的局限性一些有趣的发现在泛化能力方面。代码可在此HTTPS URL。
Wang Zhao, Shaohui Liu, Yezhi Shu, Yong-Jin Liu
Abstract: In this work, we tackle the essential problem of scale inconsistency for self-supervised joint depth-pose learning. Most existing methods assume that a consistent scale of depth and pose can be learned across all input samples, which makes the learning problem harder, resulting in degraded performance and limited generalization in indoor environments and long-sequence visual odometry application. To address this issue, we propose a novel system that explicitly disentangles scale from the network estimation. Instead of relying on PoseNet architecture, our method recovers relative pose by directly solving fundamental matrix from dense optical flow correspondence and makes use of a two-view triangulation module to recover an up-to-scale 3D structure. Then, we align the scale of the depth prediction with the triangulated point cloud and use the transformed depth map for depth error computation and dense reprojection check. Our whole system can be jointly trained end-to-end. Extensive experiments show that our system not only reaches state-of-the-art performance on KITTI depth and flow estimation, but also significantly improves the generalization ability of existing self-supervised depth-pose learning methods under a variety of challenging scenarios, and achieves state-of-the-art results among self-supervised learning-based methods on KITTI Odometry and NYUv2 dataset. Furthermore, we present some interesting findings on the limitation of PoseNet-based relative pose estimation methods in terms of generalization ability. Code is available at this https URL.
摘要:在这项工作中,我们处理规模不一致的自我监督联合深度姿势学习的基本问题。大多数现有的方法假定的深度和姿态的一致规模可以在所有输入样本,这使得学习困难的问题,导致室内环境和长序列视觉里程应用的性能下降和有限的推广学习。为了解决这个问题,我们提出了明确从网络估计理顺了那些纷繁规模的新颖系统。代替通过从密集光流对应,使直接求解基本矩阵依靠PoseNet架构,我们的方法中恢复相对姿势的使用双视图三角测量模块的恢复向上按比例的3D结构。然后,我们对准深度预测的规模与三角点云和使用深度误差计算和密集的重投影检查变换的深度图。我们的整个系统进行联合训练的端至端。大量的实验表明,该系统不仅达到上KITTI深度国家的最先进的性能和流量估算,也显著提高现有的自我监督的深度姿势学习在各种具有挑战性的场景的方法的推广能力,并实现国家的最先进成果的KITTI里程计和NYUv2数据集自我监督的基于学习的方法之一。此外,我们提出的基于PoseNet相对姿态估计方法的局限性一些有趣的发现在泛化能力方面。代码可在此HTTPS URL。
32. Generative PointNet: Energy-Based Learning on Unordered Point Sets for 3D Generation, Reconstruction and Classification [PDF] 返回目录
Jianwen Xie, Yifei Xu, Zilong Zheng, Song-Chun Zhu, Ying Nian Wu
Abstract: We propose a generative model of unordered point sets, such as point clouds, in the forms of an energy-based model, where the energy function is parameterized by an input-permutation-invariant bottom-up neural network. The energy function learns a coordinate encoding of each point and then aggregates all individual point features into energy for the whole point cloud. We show that our model can be derived from the discriminative PointNet. The model can be trained by MCMC-based maximum likelihood learning (as well as its variants), without the help of any assisting networks like those in GANs and VAEs. Unlike most point cloud generator that relys on hand-crafting distance metrics, our model does not rely on hand-crafting distance metric for point cloud generation, because it synthesizes point clouds by matching observed examples in terms of statistical property defined by the energy function. Furthermore, we can learn a short-run MCMC toward the energy-based model as a flow-like generator for point cloud reconstruction and interpretation. The learned point cloud representation can be also useful for point cloud classification. Experiments demonstrate the advantages of the proposed generative model of point clouds.
摘要:本文提出的无序点集,如点云生成模型,在基于能量的模型,其中能量函数是由输入排列不变自下而上的神经网络参数的形式。能量函数学习的坐标的各点的编码,然后聚合所有个别点特征为能量为整个点云。我们表明,我们的模型可以从区别PointNet导出。该模型可以通过基于MCMC-最大似然学习(以及它的变体)进行训练,没有任何协助网络的帮助,就像那些在甘斯和VAES。与大多数点云发生器上手工编写距离度量relys,我们的模型不依赖于手工编写距离度量点云的产生,因为它通过由能量函数定义的统计特性方面匹配观察到的例子综合点云。此外,我们可以学到短期MCMC向基于能量的模型作为流动状发生器点云重建和解释。博学的点云表示可以是对点云的分类也非常有用。实验证明点云的提议生成模型的优势。
Jianwen Xie, Yifei Xu, Zilong Zheng, Song-Chun Zhu, Ying Nian Wu
Abstract: We propose a generative model of unordered point sets, such as point clouds, in the forms of an energy-based model, where the energy function is parameterized by an input-permutation-invariant bottom-up neural network. The energy function learns a coordinate encoding of each point and then aggregates all individual point features into energy for the whole point cloud. We show that our model can be derived from the discriminative PointNet. The model can be trained by MCMC-based maximum likelihood learning (as well as its variants), without the help of any assisting networks like those in GANs and VAEs. Unlike most point cloud generator that relys on hand-crafting distance metrics, our model does not rely on hand-crafting distance metric for point cloud generation, because it synthesizes point clouds by matching observed examples in terms of statistical property defined by the energy function. Furthermore, we can learn a short-run MCMC toward the energy-based model as a flow-like generator for point cloud reconstruction and interpretation. The learned point cloud representation can be also useful for point cloud classification. Experiments demonstrate the advantages of the proposed generative model of point clouds.
摘要:本文提出的无序点集,如点云生成模型,在基于能量的模型,其中能量函数是由输入排列不变自下而上的神经网络参数的形式。能量函数学习的坐标的各点的编码,然后聚合所有个别点特征为能量为整个点云。我们表明,我们的模型可以从区别PointNet导出。该模型可以通过基于MCMC-最大似然学习(以及它的变体)进行训练,没有任何协助网络的帮助,就像那些在甘斯和VAES。与大多数点云发生器上手工编写距离度量relys,我们的模型不依赖于手工编写距离度量点云的产生,因为它通过由能量函数定义的统计特性方面匹配观察到的例子综合点云。此外,我们可以学到短期MCMC向基于能量的模型作为流动状发生器点云重建和解释。博学的点云表示可以是对点云的分类也非常有用。实验证明点云的提议生成模型的优势。
33. Novel View Synthesis of Dynamic Scenes with Globally Coherent Depths from a Monocular Camera [PDF] 返回目录
Jae Shin Yoon, Kihwan Kim, Orazio Gallo, Hyun Soo Park, Jan Kautz
Abstract: This paper presents a new method to synthesize an image from arbitrary views and times given a collection of images of a dynamic scene. A key challenge for the novel view synthesis arises from dynamic scene reconstruction where epipolar geometry does not apply to the local motion of dynamic contents. To address this challenge, we propose to combine the depth from single view (DSV) and the depth from multi-view stereo (DMV), where DSV is complete, i.e., a depth is assigned to every pixel, yet view-variant in its scale, while DMV is view-invariant yet incomplete. Our insight is that although its scale and quality are inconsistent with other views, the depth estimation from a single view can be used to reason about the globally coherent geometry of dynamic contents. We cast this problem as learning to correct the scale of DSV, and to refine each depth with locally consistent motions between views to form a coherent depth estimation. We integrate these tasks into a depth fusion network in a self-supervised fashion. Given the fused depth maps, we synthesize a photorealistic virtual view in a specific location and time with our deep blending network that completes the scene and renders the virtual view. We evaluate our method of depth estimation and view synthesis on diverse real-world dynamic scenes and show the outstanding performance over existing methods.
摘要:本文提出了一种新的方法来从给定的动态场景的图像集合任意观点和次合成图像。用于新的视图合成中的关键挑战来自动态场景重建,其中极几何不适用于动态内容的局部运动。为了应对这一挑战,我们建议从单一视图(DSV)的深度和多视点立体(DMV),深度在那里DSV完成,即,深度被分配到每一个像素,但鉴于变在结合其规模的同时,DMV是视图不变尚未完成。我们的见解是,尽管它的规模和质量是与其它意见不一致的,从单个视图深度估计可以用于推理的动态内容的全局统一的几何形状。我们投这个问题作为学习纠正DSV的规模,并完善各深度之间的意见一致的局部运动,形成一个连贯的深度估计。我们这些任务整合到一个自我监督的方式深度融合网络。由于融合的深度地图,我们综合在一个特定的地点和时间与我们的混合深厚的网络,完成现场并呈现虚拟视图中的逼真虚拟视图。我们评估我们的深度估计和视图合成的不同真实世界的动态场景的方法,并显示了现有方法的出色表现。
Jae Shin Yoon, Kihwan Kim, Orazio Gallo, Hyun Soo Park, Jan Kautz
Abstract: This paper presents a new method to synthesize an image from arbitrary views and times given a collection of images of a dynamic scene. A key challenge for the novel view synthesis arises from dynamic scene reconstruction where epipolar geometry does not apply to the local motion of dynamic contents. To address this challenge, we propose to combine the depth from single view (DSV) and the depth from multi-view stereo (DMV), where DSV is complete, i.e., a depth is assigned to every pixel, yet view-variant in its scale, while DMV is view-invariant yet incomplete. Our insight is that although its scale and quality are inconsistent with other views, the depth estimation from a single view can be used to reason about the globally coherent geometry of dynamic contents. We cast this problem as learning to correct the scale of DSV, and to refine each depth with locally consistent motions between views to form a coherent depth estimation. We integrate these tasks into a depth fusion network in a self-supervised fashion. Given the fused depth maps, we synthesize a photorealistic virtual view in a specific location and time with our deep blending network that completes the scene and renders the virtual view. We evaluate our method of depth estimation and view synthesis on diverse real-world dynamic scenes and show the outstanding performance over existing methods.
摘要:本文提出了一种新的方法来从给定的动态场景的图像集合任意观点和次合成图像。用于新的视图合成中的关键挑战来自动态场景重建,其中极几何不适用于动态内容的局部运动。为了应对这一挑战,我们建议从单一视图(DSV)的深度和多视点立体(DMV),深度在那里DSV完成,即,深度被分配到每一个像素,但鉴于变在结合其规模的同时,DMV是视图不变尚未完成。我们的见解是,尽管它的规模和质量是与其它意见不一致的,从单个视图深度估计可以用于推理的动态内容的全局统一的几何形状。我们投这个问题作为学习纠正DSV的规模,并完善各深度之间的意见一致的局部运动,形成一个连贯的深度估计。我们这些任务整合到一个自我监督的方式深度融合网络。由于融合的深度地图,我们综合在一个特定的地点和时间与我们的混合深厚的网络,完成现场并呈现虚拟视图中的逼真虚拟视图。我们评估我们的深度估计和视图合成的不同真实世界的动态场景的方法,并显示了现有方法的出色表现。
34. BosphorusSign22k Sign Language Recognition Dataset [PDF] 返回目录
Oğulcan Özdemir, Ahmet Alp Kındıroğlu, Necati Cihan Camgöz, Lale Akarun
Abstract: Sign Language Recognition is a challenging research domain. It has recently seen several advancements with the increased availability of data. In this paper, we introduce the BosphorusSign22k, a publicly available large scale sign language dataset aimed at computer vision, video recognition and deep learning research communities. The primary objective of this dataset is to serve as a new benchmark in Turkish Sign Language Recognition for its vast lexicon, the high number of repetitions by native signers, high recording quality, and the unique syntactic properties of the signs it encompasses. We also provide state-of-the-art human pose estimates to encourage other tasks such as Sign Language Production. We survey other publicly available datasets and expand on how BosphorusSign22k can contribute to future research that is being made possible through the widespread availability of similar Sign Language resources. We have conducted extensive experiments and present baseline results to underpin future research on our dataset.
摘要:手语识别是一个具有挑战性的研究领域。它最近见过几个进步与增加的数据可用性。在本文中,我们介绍了BosphorusSign22k,一个公开的大型手语数据集针对计算机视觉,视频识别和深入学习研究团体。该数据集的主要目的是作为在土耳其手语识别的新标杆为其庞大的词库,通过原生签名,高记录质量,它包括标志独特的语法属性高的重复次数。我们还提供了先进设备,最先进的人体姿势估计鼓励其他任务,如手语制作。我们考察一下其他可公开获得的数据集和扩大BosphorusSign22k如何促进被引导通过类似的手语资源的普及成为可能未来的研究。我们已经进行了广泛的实验,目前的基准结果今后对我们的数据托换的研究。
Oğulcan Özdemir, Ahmet Alp Kındıroğlu, Necati Cihan Camgöz, Lale Akarun
Abstract: Sign Language Recognition is a challenging research domain. It has recently seen several advancements with the increased availability of data. In this paper, we introduce the BosphorusSign22k, a publicly available large scale sign language dataset aimed at computer vision, video recognition and deep learning research communities. The primary objective of this dataset is to serve as a new benchmark in Turkish Sign Language Recognition for its vast lexicon, the high number of repetitions by native signers, high recording quality, and the unique syntactic properties of the signs it encompasses. We also provide state-of-the-art human pose estimates to encourage other tasks such as Sign Language Production. We survey other publicly available datasets and expand on how BosphorusSign22k can contribute to future research that is being made possible through the widespread availability of similar Sign Language resources. We have conducted extensive experiments and present baseline results to underpin future research on our dataset.
摘要:手语识别是一个具有挑战性的研究领域。它最近见过几个进步与增加的数据可用性。在本文中,我们介绍了BosphorusSign22k,一个公开的大型手语数据集针对计算机视觉,视频识别和深入学习研究团体。该数据集的主要目的是作为在土耳其手语识别的新标杆为其庞大的词库,通过原生签名,高记录质量,它包括标志独特的语法属性高的重复次数。我们还提供了先进设备,最先进的人体姿势估计鼓励其他任务,如手语制作。我们考察一下其他可公开获得的数据集和扩大BosphorusSign22k如何促进被引导通过类似的手语资源的普及成为可能未来的研究。我们已经进行了广泛的实验,目前的基准结果今后对我们的数据托换的研究。
35. Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention [PDF] 返回目录
Juan-Manuel Perez-Rua, Brais Martinez, Xiatian Zhu, Antoine Toisoul, Victor Escorcia, Tao Xiang
Abstract: Attentive video modeling is essential for action recognition in unconstrained videos due to their rich yet redundant information over space and time. However, introducing attention in a deep neural network for action recognition is challenging for two reasons. First, an effective attention module needs to learn what (objects and their local motion patterns), where (spatially), and when (temporally) to focus on. Second, a video attention module must be efficient because existing action recognition models already suffer from high computational cost. To address both challenges, a novel What-Where-When (W3) video attention module is proposed. Departing from existing alternatives, our W3 module models all three facets of video attention jointly. Crucially, it is extremely efficient by factorizing the high-dimensional video feature data into low-dimensional meaningful spaces (1D channel vector for `what' and 2D spatial tensors for `where'), followed by lightweight temporal attention reasoning. Extensive experiments show that our attention model brings significant improvements to existing action recognition models, achieving new state-of-the-art performance on a number of benchmarks.
摘要:细心的视频建模是在无约束的视频动作识别至关重要,由于空间和时间上其丰富的尚未冗余信息。然而,引入注意在动作识别深层神经网络是具有挑战性的原因有两个。首先,有效的注意力模块需要学习什么(对象及其局部运动模式),其中(空间),当(暂时)重点关注。其次,视频注意模块必须是高效的,因为现有的动作识别模型已经从计算成本高的困扰。要解决这两个难题,什么,在哪里,当(W3)视频注意模块提出了一种新颖的。从现有的替代品,我们的W3模块模型的视频关注所有三个方面共同出发。重要的是,它是由高维视频特征数据因式分解成低维空间有意义极其有效的(1D信道向量为`什么“的和2D空间张量`其中”),其次是轻质颞注意推理。大量的实验表明,我们的注意力模式带来现有的行为识别模型显著的改善,实现了多项基准测试的新的国家的最先进的性能。
Juan-Manuel Perez-Rua, Brais Martinez, Xiatian Zhu, Antoine Toisoul, Victor Escorcia, Tao Xiang
Abstract: Attentive video modeling is essential for action recognition in unconstrained videos due to their rich yet redundant information over space and time. However, introducing attention in a deep neural network for action recognition is challenging for two reasons. First, an effective attention module needs to learn what (objects and their local motion patterns), where (spatially), and when (temporally) to focus on. Second, a video attention module must be efficient because existing action recognition models already suffer from high computational cost. To address both challenges, a novel What-Where-When (W3) video attention module is proposed. Departing from existing alternatives, our W3 module models all three facets of video attention jointly. Crucially, it is extremely efficient by factorizing the high-dimensional video feature data into low-dimensional meaningful spaces (1D channel vector for `what' and 2D spatial tensors for `where'), followed by lightweight temporal attention reasoning. Extensive experiments show that our attention model brings significant improvements to existing action recognition models, achieving new state-of-the-art performance on a number of benchmarks.
摘要:细心的视频建模是在无约束的视频动作识别至关重要,由于空间和时间上其丰富的尚未冗余信息。然而,引入注意在动作识别深层神经网络是具有挑战性的原因有两个。首先,有效的注意力模块需要学习什么(对象及其局部运动模式),其中(空间),当(暂时)重点关注。其次,视频注意模块必须是高效的,因为现有的动作识别模型已经从计算成本高的困扰。要解决这两个难题,什么,在哪里,当(W3)视频注意模块提出了一种新颖的。从现有的替代品,我们的W3模块模型的视频关注所有三个方面共同出发。重要的是,它是由高维视频特征数据因式分解成低维空间有意义极其有效的(1D信道向量为`什么“的和2D空间张量`其中”),其次是轻质颞注意推理。大量的实验表明,我们的注意力模式带来现有的行为识别模型显著的改善,实现了多项基准测试的新的国家的最先进的性能。
36. Guided Variational Autoencoder for Disentanglement Learning [PDF] 返回目录
Zheng Ding, Yifan Xu, Weijian Xu, Gaurav Parmar, Yang Yang, Max Welling, Zhuowen Tu
Abstract: We propose an algorithm, guided variational autoencoder (Guided-VAE), that is able to learn a controllable generative model by performing latent representation disentanglement learning. The learning objective is achieved by providing signals to the latent encoding/embedding in VAE without changing its main backbone architecture, hence retaining the desirable properties of the VAE. We design an unsupervised strategy and a supervised strategy in Guided-VAE and observe enhanced modeling and controlling capability over the vanilla VAE. In the unsupervised strategy, we guide the VAE learning by introducing a lightweight decoder that learns latent geometric transformation and principal components; in the supervised strategy, we use an adversarial excitation and inhibition mechanism to encourage the disentanglement of the latent variables. Guided-VAE enjoys its transparency and simplicity for the general representation learning task, as well as disentanglement learning. On a number of experiments for representation learning, improved synthesis/sampling, better disentanglement for classification, and reduced classification errors in meta-learning have been observed.
摘要:本文提出的算法,引导变自动编码器(指导-VAE),即能够通过执行潜伏表示解开学习学习可控生成模型。学习目标通过向潜编码提供信号/嵌入VAE不改变其主骨架结构,因此保持VAE的所需性质来实现的。我们设计了一个无人监管的战略和指导,VAE监督的战略,并观察增强建模和控制能力,在香草VAE。在无监督策略,我们引导VAE通过引入轻质解码器获悉潜伏几何变换和主成分学习;在监管策略,我们使用了一个敌对的兴奋和抑制机制,鼓励潜在变量的解开。导VAE享有其一般表示学习任务的透明性和简单性,以及解开学习。在若干实验为表示学习,改进的合成/采样,用于分类更好的解缠结,并在元学习减少分类误差已经观察到。
Zheng Ding, Yifan Xu, Weijian Xu, Gaurav Parmar, Yang Yang, Max Welling, Zhuowen Tu
Abstract: We propose an algorithm, guided variational autoencoder (Guided-VAE), that is able to learn a controllable generative model by performing latent representation disentanglement learning. The learning objective is achieved by providing signals to the latent encoding/embedding in VAE without changing its main backbone architecture, hence retaining the desirable properties of the VAE. We design an unsupervised strategy and a supervised strategy in Guided-VAE and observe enhanced modeling and controlling capability over the vanilla VAE. In the unsupervised strategy, we guide the VAE learning by introducing a lightweight decoder that learns latent geometric transformation and principal components; in the supervised strategy, we use an adversarial excitation and inhibition mechanism to encourage the disentanglement of the latent variables. Guided-VAE enjoys its transparency and simplicity for the general representation learning task, as well as disentanglement learning. On a number of experiments for representation learning, improved synthesis/sampling, better disentanglement for classification, and reduced classification errors in meta-learning have been observed.
摘要:本文提出的算法,引导变自动编码器(指导-VAE),即能够通过执行潜伏表示解开学习学习可控生成模型。学习目标通过向潜编码提供信号/嵌入VAE不改变其主骨架结构,因此保持VAE的所需性质来实现的。我们设计了一个无人监管的战略和指导,VAE监督的战略,并观察增强建模和控制能力,在香草VAE。在无监督策略,我们引导VAE通过引入轻质解码器获悉潜伏几何变换和主成分学习;在监管策略,我们使用了一个敌对的兴奋和抑制机制,鼓励潜在变量的解开。导VAE享有其一般表示学习任务的透明性和简单性,以及解开学习。在若干实验为表示学习,改进的合成/采样,用于分类更好的解缠结,并在元学习减少分类误差已经观察到。
37. Semantic Segmentation of Underwater Imagery: Dataset and Benchmark [PDF] 返回目录
Md Jahidul Islam, Chelsey Edge, Yuyang Xiao, Peigen Luo, Muntaqim Mehtaz, Christopher Morse, Sadman Sakib Enan, Junaed Sattar
Abstract: In this paper, we present the first large-scale dataset for semantic Segmentation of Underwater IMagery (SUIM). It contains over 1500 images with pixel annotations for eight object categories: fish (vertebrates), reefs (invertebrates), aquatic plants, wrecks/ruins, human divers, robots, and sea-floor. The images are rigorously collected during oceanic explorations and human-robot collaborative experiments, and annotated by human participants. We also present a comprehensive benchmark evaluation of several state-of-the-art semantic segmentation approaches based on standard performance metrics. Additionally, we present SUIM-Net, a fully-convolutional deep residual model that balances the trade-off between performance and computational efficiency. It offers competitive performance while ensuring fast end-to-end inference, which is essential for its use in the autonomy pipeline by visually-guided underwater robots. In particular, we demonstrate its usability benefits for visual servoing, saliency prediction, and detailed scene understanding. With a variety of use cases, the proposed model and benchmark dataset open up promising opportunities for future research on underwater robot vision.
摘要:在本文中,我们介绍了第一次大规模的数据集水下图像的语义分割(SUIM)。它包含超过1500的图像与像素的注释八个对象类别:鱼(脊椎动物),珊瑚礁(无脊椎动物),水生植物,沉船/废墟,人类潜水员,机器人,海底。这些图像在海洋的探索,人类与机器人协作实验严格收集,并通过人类参与者注解。我们还提出国家的最先进的几种语义分割的一个综合指标评价方法的基础上标准性能指标。此外,我们目前SUIM型网,来平衡性能和计算效率之间的权衡一个完全卷积深残留模型。同时确保快速终端到终端的推论,这是通过视觉引导水下机器人其在自主管道使用所必需,提供有竞争力的表现。特别是,我们证明了视觉伺服,显着性预测,细致的场景理解它的可用性好处。具有多种用途的情况下,提出的模型和基准数据集开辟充满希望的机会为今后的水下机器人视觉研究。
Md Jahidul Islam, Chelsey Edge, Yuyang Xiao, Peigen Luo, Muntaqim Mehtaz, Christopher Morse, Sadman Sakib Enan, Junaed Sattar
Abstract: In this paper, we present the first large-scale dataset for semantic Segmentation of Underwater IMagery (SUIM). It contains over 1500 images with pixel annotations for eight object categories: fish (vertebrates), reefs (invertebrates), aquatic plants, wrecks/ruins, human divers, robots, and sea-floor. The images are rigorously collected during oceanic explorations and human-robot collaborative experiments, and annotated by human participants. We also present a comprehensive benchmark evaluation of several state-of-the-art semantic segmentation approaches based on standard performance metrics. Additionally, we present SUIM-Net, a fully-convolutional deep residual model that balances the trade-off between performance and computational efficiency. It offers competitive performance while ensuring fast end-to-end inference, which is essential for its use in the autonomy pipeline by visually-guided underwater robots. In particular, we demonstrate its usability benefits for visual servoing, saliency prediction, and detailed scene understanding. With a variety of use cases, the proposed model and benchmark dataset open up promising opportunities for future research on underwater robot vision.
摘要:在本文中,我们介绍了第一次大规模的数据集水下图像的语义分割(SUIM)。它包含超过1500的图像与像素的注释八个对象类别:鱼(脊椎动物),珊瑚礁(无脊椎动物),水生植物,沉船/废墟,人类潜水员,机器人,海底。这些图像在海洋的探索,人类与机器人协作实验严格收集,并通过人类参与者注解。我们还提出国家的最先进的几种语义分割的一个综合指标评价方法的基础上标准性能指标。此外,我们目前SUIM型网,来平衡性能和计算效率之间的权衡一个完全卷积深残留模型。同时确保快速终端到终端的推论,这是通过视觉引导水下机器人其在自主管道使用所必需,提供有竞争力的表现。特别是,我们证明了视觉伺服,显着性预测,细致的场景理解它的可用性好处。具有多种用途的情况下,提出的模型和基准数据集开辟充满希望的机会为今后的水下机器人视觉研究。
38. Deformation-Aware 3D Model Embedding and Retrieval [PDF] 返回目录
Mikaela Angelina Uy, Jingwei Huang, Minhyuk Sung, Tolga Birdal, Leonidas Guibas
Abstract: We introduce a new problem of $\textit{retrieving}$ 3D models that are not just similar but are deformable to a given query shape. We then present a novel deep $\textit{deformation-aware}$ embedding to solve this retrieval task. 3D model retrieval is a fundamental operation for recovering a clean and complete 3D model from a noisy and partial 3D scan. However, given a finite collection of 3D shapes, even the closest model to a query may not be a satisfactory reconstruction. This motivates us to apply 3D model deformation techniques to adapt the retrieved model so as to better fit the query. Yet, certain restrictions are enforced in most 3D deformation techniques to preserve important features of the original model that prevent a perfect fitting of the deformed model to the query. This gap between the deformed model and the query induces $\textit{asymmetric}$ relationships among the models, which cannot be dealt with typical metric learning techniques. Thus, to retrieve the best models for fitting, we propose a novel deep embedding approach that learns the asymmetric relationships by leveraging location-dependent egocentric distance fields. We also propose two strategies for training the embedding network. We demonstrate that both of these approaches outperform other baselines in both synthetic evaluations and real 3D object reconstruction.
摘要:介绍$ \ {textit检索} $ 3D模型不只是相似,但有变形的给定查询形状的新的问题。然后,我们提出一个新的深$ \ {textit变形感知} $嵌入解决这个检索任务。三维模型检索是用于回收从噪声和局部3D扫描干净和完整的3D模型的基本操作。然而,鉴于3D形状的集合有限,即使是最接近的模型查询可能不是一个令人满意的重建。这促使我们运用3D模型变形技术检索到的模型适应,以更好地适应查询。然而,某些限制被强制执行的大多数3D变形技术保护,以防止变形模型来查询的完美拟合原始模型的重要特征。变形模型和查询诱导之间的差距$ \ {textit不对称}模型,不能与典型的度量学习技术来处理之间的关系$。因此,为了获取最佳的模型拟合,我们建议学习通过利用位置相关的自我中心距离场的不对称关系的小说深深嵌入方法。我们还提出了两种培训策略嵌入网络。我们证明这两种方法的超越两个合成的评价和真实3D对象重建等基线。
Mikaela Angelina Uy, Jingwei Huang, Minhyuk Sung, Tolga Birdal, Leonidas Guibas
Abstract: We introduce a new problem of $\textit{retrieving}$ 3D models that are not just similar but are deformable to a given query shape. We then present a novel deep $\textit{deformation-aware}$ embedding to solve this retrieval task. 3D model retrieval is a fundamental operation for recovering a clean and complete 3D model from a noisy and partial 3D scan. However, given a finite collection of 3D shapes, even the closest model to a query may not be a satisfactory reconstruction. This motivates us to apply 3D model deformation techniques to adapt the retrieved model so as to better fit the query. Yet, certain restrictions are enforced in most 3D deformation techniques to preserve important features of the original model that prevent a perfect fitting of the deformed model to the query. This gap between the deformed model and the query induces $\textit{asymmetric}$ relationships among the models, which cannot be dealt with typical metric learning techniques. Thus, to retrieve the best models for fitting, we propose a novel deep embedding approach that learns the asymmetric relationships by leveraging location-dependent egocentric distance fields. We also propose two strategies for training the embedding network. We demonstrate that both of these approaches outperform other baselines in both synthetic evaluations and real 3D object reconstruction.
摘要:介绍$ \ {textit检索} $ 3D模型不只是相似,但有变形的给定查询形状的新的问题。然后,我们提出一个新的深$ \ {textit变形感知} $嵌入解决这个检索任务。三维模型检索是用于回收从噪声和局部3D扫描干净和完整的3D模型的基本操作。然而,鉴于3D形状的集合有限,即使是最接近的模型查询可能不是一个令人满意的重建。这促使我们运用3D模型变形技术检索到的模型适应,以更好地适应查询。然而,某些限制被强制执行的大多数3D变形技术保护,以防止变形模型来查询的完美拟合原始模型的重要特征。变形模型和查询诱导之间的差距$ \ {textit不对称}模型,不能与典型的度量学习技术来处理之间的关系$。因此,为了获取最佳的模型拟合,我们建议学习通过利用位置相关的自我中心距离场的不对称关系的小说深深嵌入方法。我们还提出了两种培训策略嵌入网络。我们证明这两种方法的超越两个合成的评价和真实3D对象重建等基线。
39. Temporal Accumulative Features for Sign Language Recognition [PDF] 返回目录
Ahmet Alp Kındıroğlu, Oğulcan Özdemir, Lale Akarun
Abstract: In this paper, we propose a set of features called temporal accumulative features (TAF) for representing and recognizing isolated sign language gestures. By incorporating sign language specific constructs to better represent the unique linguistic characteristic of sign language videos, we have devised an efficient and fast SLR method for recognizing isolated sign language gestures. The proposed method is an HSV based accumulative video representation where keyframes based on the linguistic movement-hold model are represented by different colors. We also incorporate hand shape information and using a small scale convolutional neural network, demonstrate that sequential modeling of accumulative features for linguistic subunits improves upon baseline classification results.
摘要:在本文中,我们提出了代表和识别孤立的手语手势的一组功能称为时间累计功能(TAF)。通过将手语的具体结构,以更好地代表的手语视频的独特的语言特点,我们设计了一套识别孤立手语手势的高效,快速的SLR方法。所提出的方法是一种基于HSV累积视频表示,其中基于所述语言运动保持模型的关键帧是通过不同的颜色来表示。我们还采用手形的信息和使用小规模的卷积神经网络,证明了语言亚基累计功能,连续建模在基线分类结果提高了。
Ahmet Alp Kındıroğlu, Oğulcan Özdemir, Lale Akarun
Abstract: In this paper, we propose a set of features called temporal accumulative features (TAF) for representing and recognizing isolated sign language gestures. By incorporating sign language specific constructs to better represent the unique linguistic characteristic of sign language videos, we have devised an efficient and fast SLR method for recognizing isolated sign language gestures. The proposed method is an HSV based accumulative video representation where keyframes based on the linguistic movement-hold model are represented by different colors. We also incorporate hand shape information and using a small scale convolutional neural network, demonstrate that sequential modeling of accumulative features for linguistic subunits improves upon baseline classification results.
摘要:在本文中,我们提出了代表和识别孤立的手语手势的一组功能称为时间累计功能(TAF)。通过将手语的具体结构,以更好地代表的手语视频的独特的语言特点,我们设计了一套识别孤立手语手势的高效,快速的SLR方法。所提出的方法是一种基于HSV累积视频表示,其中基于所述语言运动保持模型的关键帧是通过不同的颜色来表示。我们还采用手形的信息和使用小规模的卷积神经网络,证明了语言亚基累计功能,连续建模在基线分类结果提高了。
40. Deep Transfer Learning for Texture Classification in Colorectal Cancer Histology [PDF] 返回目录
Srinath Jayachandran, Ashlin Ghosh
Abstract: Microscopic examination of tissues or histopathology is one of the diagnostic procedures for detecting colorectal cancer. The pathologist involved in such an examination usually identifies tissue type based on texture analysis, especially focusing on tumour-stroma ratio. In this work, we automate the task of tissue classification within colorectal cancer histology samples using deep transfer learning. We use discriminative fine-tuning with one-cycle-policy and apply structure-preserving colour normalization to boost our results. We also provide visual explanations of the deep neural network's decision on texture classification. With achieving state-of-the-art test accuracy of 96.2% we also embark on using deployment friendly architecture called SqueezeNet for memory-limited hardware.
摘要:组织或病理组织学显微镜检查的是用于检测结肠直肠癌的诊断过程之一。参与这种检查病理学家通常标识基于纹理分析的组织类型,尤其是集中于肿瘤 - 基质的比率。在这项工作中,我们使用自动转移深层学习结直肠癌病理样本内组织分类的任务。我们使用歧视性的微调与一周期策略并应用结构保留的色彩正常化,以提高我们的结果。我们还提供了深层神经网络的纹理分类决策的可视化解释。随着实现的96.2%的国家的最先进的测试精度,我们还着手使用部署友好架构,称为SqueezeNet为内存有限的硬件。
Srinath Jayachandran, Ashlin Ghosh
Abstract: Microscopic examination of tissues or histopathology is one of the diagnostic procedures for detecting colorectal cancer. The pathologist involved in such an examination usually identifies tissue type based on texture analysis, especially focusing on tumour-stroma ratio. In this work, we automate the task of tissue classification within colorectal cancer histology samples using deep transfer learning. We use discriminative fine-tuning with one-cycle-policy and apply structure-preserving colour normalization to boost our results. We also provide visual explanations of the deep neural network's decision on texture classification. With achieving state-of-the-art test accuracy of 96.2% we also embark on using deployment friendly architecture called SqueezeNet for memory-limited hardware.
摘要:组织或病理组织学显微镜检查的是用于检测结肠直肠癌的诊断过程之一。参与这种检查病理学家通常标识基于纹理分析的组织类型,尤其是集中于肿瘤 - 基质的比率。在这项工作中,我们使用自动转移深层学习结直肠癌病理样本内组织分类的任务。我们使用歧视性的微调与一周期策略并应用结构保留的色彩正常化,以提高我们的结果。我们还提供了深层神经网络的纹理分类决策的可视化解释。随着实现的96.2%的国家的最先进的测试精度,我们还着手使用部署友好架构,称为SqueezeNet为内存有限的硬件。
41. Cell Segmentation by Combining Marker-Controlled Watershed and Deep Learning [PDF] 返回目录
Filip Lux, Petr Matula
Abstract: We propose a cell segmentation method for analyzing images of densely clustered cells. The method combines the strengths of marker-controlled watershed transformation and a convolutional neural network (CNN). We demonstrate the method universality and high performance on three Cell Tracking Challenge (CTC) datasets of clustered cells captured by different acquisition techniques. For all tested datasets, our method reached the top performance in both cell detection and segmentation. Based on a series of experiments, we observed: (1) Predicting both watershed marker function and segmentation function significantly improves the accuracy of the segmentation. (2) Both functions can be learned independently. (3) Training data augmentation by scaling and rigid geometric transformations is superior to augmentation that involves elastic transformations. Our method is simple to use, and it generalizes well for various data with state-of-the-art performance.
摘要:我们提出了分析紧簇的细胞图像的细胞分割方法。该方法结合标记控制的分水岭变换和卷积神经网络(CNN)的优势。我们证明在由不同的采集技术捕获聚集的细胞的三个细胞追踪挑战(CTC)的数据集的方法的普遍性和高的性能。对于所有的测试数据集,我们的方法达到了在细胞检测与分割的顶级性能。基于一系列的实验中,我们观察到:(1)预测二者分水岭标记功能和分割功能显著提高分割的准确度。 (2)这两个函数都可以被独立地获知。 (3)通过训练缩放和刚性几何变换数据扩张优于增强涉及弹性转换。我们的方法是简单的使用,并且它概括以及用于与国家的最先进的性能的各种数据。
Filip Lux, Petr Matula
Abstract: We propose a cell segmentation method for analyzing images of densely clustered cells. The method combines the strengths of marker-controlled watershed transformation and a convolutional neural network (CNN). We demonstrate the method universality and high performance on three Cell Tracking Challenge (CTC) datasets of clustered cells captured by different acquisition techniques. For all tested datasets, our method reached the top performance in both cell detection and segmentation. Based on a series of experiments, we observed: (1) Predicting both watershed marker function and segmentation function significantly improves the accuracy of the segmentation. (2) Both functions can be learned independently. (3) Training data augmentation by scaling and rigid geometric transformations is superior to augmentation that involves elastic transformations. Our method is simple to use, and it generalizes well for various data with state-of-the-art performance.
摘要:我们提出了分析紧簇的细胞图像的细胞分割方法。该方法结合标记控制的分水岭变换和卷积神经网络(CNN)的优势。我们证明在由不同的采集技术捕获聚集的细胞的三个细胞追踪挑战(CTC)的数据集的方法的普遍性和高的性能。对于所有的测试数据集,我们的方法达到了在细胞检测与分割的顶级性能。基于一系列的实验中,我们观察到:(1)预测二者分水岭标记功能和分割功能显著提高分割的准确度。 (2)这两个函数都可以被独立地获知。 (3)通过训练缩放和刚性几何变换数据扩张优于增强涉及弹性转换。我们的方法是简单的使用,并且它概括以及用于与国家的最先进的性能的各种数据。
42. Detection of Perineural Invasion in Prostate Needle Biopsies with Deep Neural Networks [PDF] 返回目录
Peter Ström, Kimmo Kartasalo, Pekka Ruusuvuori, Henrik Grönberg, Hemamali Samaratunga, Brett Delahunt, Toyonori Tsuzuki, Lars Egevad, Martin Eklund
Abstract: Background: The detection of perineural invasion (PNI) by carcinoma in prostate biopsies has been shown to be associated with poor prognosis. The assessment and quantification of PNI is; however, labor intensive. In the study we aimed to develop an algorithm based on deep neural networks to aid pathologists in this task. Methods: We collected, digitized and pixel-wise annotated the PNI findings in each of the approximately 80,000 biopsy cores from the 7,406 men who underwent biopsy in the prospective and diagnostic STHLM3 trial between 2012 and 2014. In total, 485 biopsy cores showed PNI. We also digitized more than 10% (n=8,318) of the PNI negative biopsy cores. Digitized biopsies from a random selection of 80% of the men were used to build deep neural networks, and the remaining 20% were used to evaluate the performance of the algorithm. Results: For the detection of PNI in prostate biopsy cores the network had an estimated area under the receiver operating characteristics curve of 0.98 (95% CI 0.97-0.99) based on 106 PNI positive cores and 1,652 PNI negative cores in the independent test set. For the pre-specified operating point this translates to sensitivity of 0.87 and specificity of 0.97. The corresponding positive and negative predictive values were 0.67 and 0.99, respectively. For localizing the regions of PNI within a slide we estimated an average intersection over union of 0.50 (CI: 0.46-0.55). Conclusion: We have developed an algorithm based on deep neural networks for detecting PNI in prostate biopsies with apparently acceptable diagnostic properties. These algorithms have the potential to aid pathologists in the day-to-day work by drastically reducing the number of biopsy cores that need to be assessed for PNI and by highlighting regions of diagnostic interest.
摘要:神经浸润(PNI)通过在前列腺活检癌的检测已被证明是与预后不良相关联。 PNI的评估和量化是;然而,劳动强度大。在研究中,我们的目的是开发基于深层神经网络算法帮助病理学家完成这项任务。方法:收集,数字化,并逐像素注释的PNI发现在每个从7406人谁在2012年和2014年之间的前瞻性和诊断STHLM3试验总体后行活检,活检485个显示核心的PNI约80000穿刺活检的。我们还数字化的PNI阴性活检核心的超过10%(N = 8318)。从的人80%的随机选择数字化活检被用来建立深层神经网络,其余20%被用来评估算法的性能。结果:对于PNI在前列腺活检的检测铁芯的网络具有基于106个PNI阳性核和在独立测试组1652个PNI阴性核0.98(95%CI 0.97-0.99)的接收器工作特性曲线下的面积估计。对于预先指定的工作点这转换成的0.87的灵敏度和0.97的特异性。相应的阳性和阴性预测值分别为0.67和0.99。用于滑动内定位PNI的区域估计我们超过0.50(CI:0.46-0.55)联合的平均交叉点。结论:我们已经开发出了基于深层神经网络在前列腺活检有明显接受诊断性检测PNI的算法。这些算法通过大幅减少穿刺活检是必要的PNI进行评估的数量和突出的诊断意义的地区,以帮助病理学家在每天的日常工作的潜力。
Peter Ström, Kimmo Kartasalo, Pekka Ruusuvuori, Henrik Grönberg, Hemamali Samaratunga, Brett Delahunt, Toyonori Tsuzuki, Lars Egevad, Martin Eklund
Abstract: Background: The detection of perineural invasion (PNI) by carcinoma in prostate biopsies has been shown to be associated with poor prognosis. The assessment and quantification of PNI is; however, labor intensive. In the study we aimed to develop an algorithm based on deep neural networks to aid pathologists in this task. Methods: We collected, digitized and pixel-wise annotated the PNI findings in each of the approximately 80,000 biopsy cores from the 7,406 men who underwent biopsy in the prospective and diagnostic STHLM3 trial between 2012 and 2014. In total, 485 biopsy cores showed PNI. We also digitized more than 10% (n=8,318) of the PNI negative biopsy cores. Digitized biopsies from a random selection of 80% of the men were used to build deep neural networks, and the remaining 20% were used to evaluate the performance of the algorithm. Results: For the detection of PNI in prostate biopsy cores the network had an estimated area under the receiver operating characteristics curve of 0.98 (95% CI 0.97-0.99) based on 106 PNI positive cores and 1,652 PNI negative cores in the independent test set. For the pre-specified operating point this translates to sensitivity of 0.87 and specificity of 0.97. The corresponding positive and negative predictive values were 0.67 and 0.99, respectively. For localizing the regions of PNI within a slide we estimated an average intersection over union of 0.50 (CI: 0.46-0.55). Conclusion: We have developed an algorithm based on deep neural networks for detecting PNI in prostate biopsies with apparently acceptable diagnostic properties. These algorithms have the potential to aid pathologists in the day-to-day work by drastically reducing the number of biopsy cores that need to be assessed for PNI and by highlighting regions of diagnostic interest.
摘要:神经浸润(PNI)通过在前列腺活检癌的检测已被证明是与预后不良相关联。 PNI的评估和量化是;然而,劳动强度大。在研究中,我们的目的是开发基于深层神经网络算法帮助病理学家完成这项任务。方法:收集,数字化,并逐像素注释的PNI发现在每个从7406人谁在2012年和2014年之间的前瞻性和诊断STHLM3试验总体后行活检,活检485个显示核心的PNI约80000穿刺活检的。我们还数字化的PNI阴性活检核心的超过10%(N = 8318)。从的人80%的随机选择数字化活检被用来建立深层神经网络,其余20%被用来评估算法的性能。结果:对于PNI在前列腺活检的检测铁芯的网络具有基于106个PNI阳性核和在独立测试组1652个PNI阴性核0.98(95%CI 0.97-0.99)的接收器工作特性曲线下的面积估计。对于预先指定的工作点这转换成的0.87的灵敏度和0.97的特异性。相应的阳性和阴性预测值分别为0.67和0.99。用于滑动内定位PNI的区域估计我们超过0.50(CI:0.46-0.55)联合的平均交叉点。结论:我们已经开发出了基于深层神经网络在前列腺活检有明显接受诊断性检测PNI的算法。这些算法通过大幅减少穿刺活检是必要的PNI进行评估的数量和突出的诊断意义的地区,以帮助病理学家在每天的日常工作的潜力。
43. Retinopathy of Prematurity Stage Diagnosis Using Object Segmentation and Convolutional Neural Networks [PDF] 返回目录
Alexander Ding, Qilei Chen, Yu Cao, Benyuan Liu
Abstract: Retinopathy of Prematurity (ROP) is an eye disorder primarily affecting premature infants with lower weights. It causes proliferation of vessels in the retina and could result in vision loss and, eventually, retinal detachment, leading to blindness. While human experts can easily identify severe stages of ROP, the diagnosis of earlier stages, which are the most relevant to determining treatment choice, are much more affected by variability in subjective interpretations of human experts. In recent years, there has been a significant effort to automate the diagnosis using deep learning. This paper builds upon the success of previous models and develops a novel architecture, which combines object segmentation and convolutional neural networks (CNN) to construct an effective classifier of ROP stages 1-3 based on neonatal retinal images. Motivated by the fact that the formation and shape of a demarcation line in the retina is the distinguishing feature between earlier ROP stages, our proposed system first trains an object segmentation model to identify the demarcation line at a pixel level and adds the resulting mask as an additional "color" channel in the original image. Then, the system trains a CNN classifier based on the processed images to leverage information from both the original image and the mask, which helps direct the model's attention to the demarcation line. In a number of careful experiments comparing its performance to previous object segmentation systems and CNN-only systems trained on our dataset, our novel architecture significantly outperforms previous systems in accuracy, demonstrating the effectiveness of our proposed pipeline.
摘要:早产儿视网膜病变(ROP)是眼睛疾病,主要影响与较低的权重的早产儿。它会导致视网膜血管增生,并可能导致视力下降,最终视网膜脱离,导致失明。虽然人类专家可以轻松识别ROP,早期阶段,这是最相关的决定治疗的首选诊断的严重阶段,是人类专家主观解释更受变性。近年来,出现了用深度学习来自动诊断显著的努力。本文建立在以前的模型的成功并开发一种新颖的体系结构,它结合了对象分割和卷积神经网络(CNN),构建ROP的一种有效的分类级1-3基于新生儿视网膜图像。的事实,在视网膜上的分界线的形成和形状是较早ROP阶段,我们所提出的系统的第一串之间的区别特征的对象分割模型来识别在像素级的划分线,并增加所产生的掩模作为激励额外的“色”通道的原始图像中然后,系统训练基础上,从原始图像和面具,这有助于引导模式的注意分界线都经过处理的图像杠杆信息CNN分类。在一些比较其性能训练有素我们以前的数据集对象分割系统和CNN-只有系统仔细的实验,我们新颖的架构显著优于准确性以前的系统,证明了我们提出的管道的有效性。
Alexander Ding, Qilei Chen, Yu Cao, Benyuan Liu
Abstract: Retinopathy of Prematurity (ROP) is an eye disorder primarily affecting premature infants with lower weights. It causes proliferation of vessels in the retina and could result in vision loss and, eventually, retinal detachment, leading to blindness. While human experts can easily identify severe stages of ROP, the diagnosis of earlier stages, which are the most relevant to determining treatment choice, are much more affected by variability in subjective interpretations of human experts. In recent years, there has been a significant effort to automate the diagnosis using deep learning. This paper builds upon the success of previous models and develops a novel architecture, which combines object segmentation and convolutional neural networks (CNN) to construct an effective classifier of ROP stages 1-3 based on neonatal retinal images. Motivated by the fact that the formation and shape of a demarcation line in the retina is the distinguishing feature between earlier ROP stages, our proposed system first trains an object segmentation model to identify the demarcation line at a pixel level and adds the resulting mask as an additional "color" channel in the original image. Then, the system trains a CNN classifier based on the processed images to leverage information from both the original image and the mask, which helps direct the model's attention to the demarcation line. In a number of careful experiments comparing its performance to previous object segmentation systems and CNN-only systems trained on our dataset, our novel architecture significantly outperforms previous systems in accuracy, demonstrating the effectiveness of our proposed pipeline.
摘要:早产儿视网膜病变(ROP)是眼睛疾病,主要影响与较低的权重的早产儿。它会导致视网膜血管增生,并可能导致视力下降,最终视网膜脱离,导致失明。虽然人类专家可以轻松识别ROP,早期阶段,这是最相关的决定治疗的首选诊断的严重阶段,是人类专家主观解释更受变性。近年来,出现了用深度学习来自动诊断显著的努力。本文建立在以前的模型的成功并开发一种新颖的体系结构,它结合了对象分割和卷积神经网络(CNN),构建ROP的一种有效的分类级1-3基于新生儿视网膜图像。的事实,在视网膜上的分界线的形成和形状是较早ROP阶段,我们所提出的系统的第一串之间的区别特征的对象分割模型来识别在像素级的划分线,并增加所产生的掩模作为激励额外的“色”通道的原始图像中然后,系统训练基础上,从原始图像和面具,这有助于引导模式的注意分界线都经过处理的图像杠杆信息CNN分类。在一些比较其性能训练有素我们以前的数据集对象分割系统和CNN-只有系统仔细的实验,我们新颖的架构显著优于准确性以前的系统,证明了我们提出的管道的有效性。
44. Crossover-Net: Leveraging the Vertical-Horizontal Crossover Relation for Robust Segmentation [PDF] 返回目录
Qian Yu, Yinghuan Shi, Yefeng Zheng, Yang Gao, Jianbing Zhu, Yakang Dai
Abstract: Robust segmentation for non-elongated tissues in medical images is hard to realize due to the large variation of the shape, size, and appearance of these tissues in different patients. In this paper, we present an end-to-end trainable deep segmentation model termed Crossover-Net for robust segmentation in medical images. Our proposed model is inspired by an insightful observation: during segmentation, the representation from the horizontal and vertical directions can provide different local appearance and orthogonality context information, which helps enhance the discrimination between different tissues by simultaneously learning from these two directions. Specifically, by converting the segmentation task to a pixel/voxel-wise prediction problem, firstly, we originally propose a cross-shaped patch, namely crossover-patch, which consists of a pair of (orthogonal and overlapped) vertical and horizontal patches, to capture the orthogonal vertical and horizontal relation. Then, we develop the Crossover-Net to learn the vertical-horizontal crossover relation captured by our crossover-patches. To achieve this goal, for learning the representation on a typical crossover-patch, we design a novel loss function to (1) impose the consistency on the overlap region of the vertical and horizontal patches and (2) preserve the diversity on their non-overlap regions. We have extensively evaluated our method on CT kidney tumor, MR cardiac, and X-ray breast mass segmentation tasks. Promising results are achieved according to our extensive evaluation and comparison with the state-of-the-art segmentation models.
摘要:用于医学图像的非伸长组织鲁棒分割是难以实现由于其形状,大小,和这些组织中不同患者出现的大的变化。在本文中,我们提出了一个端至端可训练深分割模型称为交叉-Net的用于在医用图像鲁棒分割。我们提出的模型由见地观察启发:分割期间,从水平和垂直方向上的表示可以提供不同的局部外观和正交上下文信息,这有助于通过这两个方向上同时提高学习不同组织之间的区别。具体地,由分割任务转换为像素/体素的明智预测问题,首先,我们原先提出的横形贴片,即交叉的贴剂,它由一对(正交和重叠)的垂直和水平补丁,以捕获正交的垂直和水平关系。然后,我们开发的交叉型网,了解我们的交叉,补丁捕获的纵横交叉的关系。为了实现这一目标,用于学习在典型的跨接件贴片的表示,我们设计了一种新的损失函数(1)上施加的垂直和水平的补丁的重叠区域,(2)的一致性保存在他们的非多样性重叠区域。我们已经广泛评估我们对CT肾肿瘤,心脏MR和X射线乳腺肿块分割任务的方法。可喜的成果是根据我们广泛的评估,并与国家的最先进的分割模型的比较来实现的。
Qian Yu, Yinghuan Shi, Yefeng Zheng, Yang Gao, Jianbing Zhu, Yakang Dai
Abstract: Robust segmentation for non-elongated tissues in medical images is hard to realize due to the large variation of the shape, size, and appearance of these tissues in different patients. In this paper, we present an end-to-end trainable deep segmentation model termed Crossover-Net for robust segmentation in medical images. Our proposed model is inspired by an insightful observation: during segmentation, the representation from the horizontal and vertical directions can provide different local appearance and orthogonality context information, which helps enhance the discrimination between different tissues by simultaneously learning from these two directions. Specifically, by converting the segmentation task to a pixel/voxel-wise prediction problem, firstly, we originally propose a cross-shaped patch, namely crossover-patch, which consists of a pair of (orthogonal and overlapped) vertical and horizontal patches, to capture the orthogonal vertical and horizontal relation. Then, we develop the Crossover-Net to learn the vertical-horizontal crossover relation captured by our crossover-patches. To achieve this goal, for learning the representation on a typical crossover-patch, we design a novel loss function to (1) impose the consistency on the overlap region of the vertical and horizontal patches and (2) preserve the diversity on their non-overlap regions. We have extensively evaluated our method on CT kidney tumor, MR cardiac, and X-ray breast mass segmentation tasks. Promising results are achieved according to our extensive evaluation and comparison with the state-of-the-art segmentation models.
摘要:用于医学图像的非伸长组织鲁棒分割是难以实现由于其形状,大小,和这些组织中不同患者出现的大的变化。在本文中,我们提出了一个端至端可训练深分割模型称为交叉-Net的用于在医用图像鲁棒分割。我们提出的模型由见地观察启发:分割期间,从水平和垂直方向上的表示可以提供不同的局部外观和正交上下文信息,这有助于通过这两个方向上同时提高学习不同组织之间的区别。具体地,由分割任务转换为像素/体素的明智预测问题,首先,我们原先提出的横形贴片,即交叉的贴剂,它由一对(正交和重叠)的垂直和水平补丁,以捕获正交的垂直和水平关系。然后,我们开发的交叉型网,了解我们的交叉,补丁捕获的纵横交叉的关系。为了实现这一目标,用于学习在典型的跨接件贴片的表示,我们设计了一种新的损失函数(1)上施加的垂直和水平的补丁的重叠区域,(2)的一致性保存在他们的非多样性重叠区域。我们已经广泛评估我们对CT肾肿瘤,心脏MR和X射线乳腺肿块分割任务的方法。可喜的成果是根据我们广泛的评估,并与国家的最先进的分割模型的比较来实现的。
45. Characterization of Multiple 3D LiDARs for Localization and Mapping using Normal Distributions Transform [PDF] 返回目录
Alexander Carballo, Abraham Monrroy, David Wong, Patiphon Narksri, Jacob Lambert, Yuki Kitsukawa, Eijiro Takeuchi, Shinpei Kato, Kazuya Takeda
Abstract: In this work, we present a detailed comparison of ten different 3D LiDAR sensors, covering a range of manufacturers, models, and laser configurations, for the tasks of mapping and vehicle localization, using as common reference the Normal Distributions Transform (NDT) algorithm implemented in the self-driving open source platform Autoware. LiDAR data used in this study is a subset of our LiDAR Benchmarking and Reference (LIBRE) dataset, captured independently from each sensor, from a vehicle driven on public urban roads multiple times, at different times of the day. In this study, we analyze the performance and characteristics of each LiDAR for the tasks of (1) 3D mapping including an assessment map quality based on mean map entropy, and (2) 6-DOF localization using a ground truth reference map.
摘要:在这项工作中,我们提出十个不同的三维激光雷达传感器的详细比较,覆盖范围的制造商,型号和激光构造,映射和车辆定位的任务,使用如公共参考正态分布变换(NDT)算法的自驾车开源平台Autoware实现。在本研究中使用LiDAR数据是我们的激光雷达基准和参考(LIBRE)数据集的子集,从每个传感器独立地捕获,从驱动公共城市道路多次,在一天的不同时间的车辆。在这项研究中,我们分析每个激光雷达的:(1)3D映射的任务,包括基于平均地图熵评估地图的质量,以及使用地面实况参考地图(2)6-DOF定位的性能和特性。
Alexander Carballo, Abraham Monrroy, David Wong, Patiphon Narksri, Jacob Lambert, Yuki Kitsukawa, Eijiro Takeuchi, Shinpei Kato, Kazuya Takeda
Abstract: In this work, we present a detailed comparison of ten different 3D LiDAR sensors, covering a range of manufacturers, models, and laser configurations, for the tasks of mapping and vehicle localization, using as common reference the Normal Distributions Transform (NDT) algorithm implemented in the self-driving open source platform Autoware. LiDAR data used in this study is a subset of our LiDAR Benchmarking and Reference (LIBRE) dataset, captured independently from each sensor, from a vehicle driven on public urban roads multiple times, at different times of the day. In this study, we analyze the performance and characteristics of each LiDAR for the tasks of (1) 3D mapping including an assessment map quality based on mean map entropy, and (2) 6-DOF localization using a ground truth reference map.
摘要:在这项工作中,我们提出十个不同的三维激光雷达传感器的详细比较,覆盖范围的制造商,型号和激光构造,映射和车辆定位的任务,使用如公共参考正态分布变换(NDT)算法的自驾车开源平台Autoware实现。在本研究中使用LiDAR数据是我们的激光雷达基准和参考(LIBRE)数据集的子集,从每个传感器独立地捕获,从驱动公共城市道路多次,在一天的不同时间的车辆。在这项研究中,我们分析每个激光雷达的:(1)3D映射的任务,包括基于平均地图熵评估地图的质量,以及使用地面实况参考地图(2)6-DOF定位的性能和特性。
46. STAN-CT: Standardizing CT Image using Generative Adversarial Network [PDF] 返回目录
Md Selim, Jie Zhang, Baowei Fei, Guo-Qiang Zhang, Jin Chen
Abstract: Computed tomography (CT) plays an important role in lung malignancy diagnostics and therapy assessment and facilitating precision medicine delivery. However, the use of personalized imaging protocols poses a challenge in large-scale cross-center CT image radiomic studies. We present an end-to-end solution called STAN-CT for CT image standardization and normalization, which effectively reduces discrepancies in image features caused by using different imaging protocols or using different CT scanners with the same imaging protocol. STAN-CT consists of two components: 1) a novel Generative Adversarial Networks (GAN) model that is capable of effectively learning the data distribution of a standard imaging protocol with only a few rounds of generator training, and 2) an automatic DICOM reconstruction pipeline with systematic image quality control that ensure the generation of high-quality standard DICOM images. Experimental results indicate that the training efficiency and model performance of STAN-CT have been significantly improved compared to the state-of-the-art CT image standardization and normalization algorithms.
摘要:计算机断层扫描(CT)起着肺恶性肿瘤的诊断和治疗评估,促进精密药品交付了重要的作用。然而,使用个性化的成像协议提出了大规模的跨中心CT图像radiomic研究提出了挑战。我们提出了一个终端到终端的解决方案称为STAN-CT用于CT图像的标准化和归一化,从而有效地降低在图像的差异通过使用不同成像协议或者使用具有相同的成像协议不同的CT扫描仪引起的功能。 STAN-CT由两个部分组成:1)一种新的生成性对抗性网络(GAN)模型,其能够有效地学习标准成像协议的只有少数轮发电机训练的数据分布的,和2)的自动DICOM重建管道与系统的图像质量控制,以确保高质量的标准DICOM图像的生成。实验结果表明,STAN-CT的训练效率和模型的性能已经相比国家的最先进的CT图像标准化和规范化算法被显著改善。
Md Selim, Jie Zhang, Baowei Fei, Guo-Qiang Zhang, Jin Chen
Abstract: Computed tomography (CT) plays an important role in lung malignancy diagnostics and therapy assessment and facilitating precision medicine delivery. However, the use of personalized imaging protocols poses a challenge in large-scale cross-center CT image radiomic studies. We present an end-to-end solution called STAN-CT for CT image standardization and normalization, which effectively reduces discrepancies in image features caused by using different imaging protocols or using different CT scanners with the same imaging protocol. STAN-CT consists of two components: 1) a novel Generative Adversarial Networks (GAN) model that is capable of effectively learning the data distribution of a standard imaging protocol with only a few rounds of generator training, and 2) an automatic DICOM reconstruction pipeline with systematic image quality control that ensure the generation of high-quality standard DICOM images. Experimental results indicate that the training efficiency and model performance of STAN-CT have been significantly improved compared to the state-of-the-art CT image standardization and normalization algorithms.
摘要:计算机断层扫描(CT)起着肺恶性肿瘤的诊断和治疗评估,促进精密药品交付了重要的作用。然而,使用个性化的成像协议提出了大规模的跨中心CT图像radiomic研究提出了挑战。我们提出了一个终端到终端的解决方案称为STAN-CT用于CT图像的标准化和归一化,从而有效地降低在图像的差异通过使用不同成像协议或者使用具有相同的成像协议不同的CT扫描仪引起的功能。 STAN-CT由两个部分组成:1)一种新的生成性对抗性网络(GAN)模型,其能够有效地学习标准成像协议的只有少数轮发电机训练的数据分布的,和2)的自动DICOM重建管道与系统的图像质量控制,以确保高质量的标准DICOM图像的生成。实验结果表明,STAN-CT的训练效率和模型的性能已经相比国家的最先进的CT图像标准化和规范化算法被显著改善。
47. Extraction and Assessment of Naturalistic Human Driving Trajectories from Infrastructure Camera and Radar Sensors [PDF] 返回目录
Dominik Notz, Felix Becker, Thomas Kühbeck, Daniel Watzenig
Abstract: Collecting realistic driving trajectories is crucial for training machine learning models that imitate human driving behavior. Most of today's autonomous driving datasets contain only a few trajectories per location and are recorded with test vehicles that are cautiously driven by trained drivers. In particular in interactive scenarios such as highway merges, the test driver's behavior significantly influences other vehicles. This influence prevents recording the whole traffic space of human driving behavior. In this work, we present a novel methodology to extract trajectories of traffic objects using infrastructure sensors. Infrastructure sensors allow us to record a lot of data for one location and take the test drivers out of the loop. We develop both a hardware setup consisting of a camera and a traffic surveillance radar and a trajectory extraction algorithm. Our vision pipeline accurately detects objects, fuses camera and radar detections and tracks them over time. We improve a state-of-the-art object tracker by combining the tracking in image coordinates with a Kalman filter in road coordinates. We show that our sensor fusion approach successfully combines the advantages of camera and radar detections and outperforms either single sensor. Finally, we also evaluate the accuracy of our trajectory extraction pipeline. For that, we equip our test vehicle with a differential GPS sensor and use it to collect ground truth trajectories. With this data we compute the measurement errors. While we use the mean error to de-bias the trajectories, the error standard deviation is in the magnitude of the ground truth data inaccuracy. Hence, the extracted trajectories are not only naturalistic but also highly accurate and prove the potential of using infrastructure sensors to extract real-world trajectories.
摘要:收集真实的驾驶轨迹是模仿人类的驾驶行为训练机器学习模型的关键。今天的大多数自主驾驶数据集只包含每个地点几个轨迹和记录与被谨慎地由经过培训的司机驾驶的测试车。尤其是在互动的场景如高速公路合并,测试驱动程序的行为显著影响其他车辆。这种影响,以防止录制的人驾驶行为的整个交通空间。在这项工作中,我们提出了一种新的方法来使用的基础设施传感器交通中的物体提取轨迹。基础设施传感器使我们能够记录大量数据的一个位置,并采取测试驱动程序跳出循环。我们同时开发硬件设置,包括摄像头和交通监视雷达和轨迹提取算法。我们的愿景管道准确地检测随着时间的推移对象,保险丝摄像头和雷达检测和跟踪他们。我们通过在图像坐标的跟踪与道路坐标卡尔曼滤波器结合提高一个国家的最先进的物体跟踪。我们证明了我们的传感器融合的方法成功地结合了摄像头和雷达检测的优点,优于任何单一传感器。最后,我们还评估我们的轨迹提取管道的准确度。为此,我们装备我们的测试车带有差分GPS传感器,并用它来收集地面实况轨迹。有了这些数据,我们计算的测量误差。虽然我们使用的平均误差反偏置轨迹,误差标准差是在地面实况数据不准确的大小。因此,所提取的轨迹不仅自然,而且非常准确,证明使用的基础设施传感器来提取真实世界的轨迹的潜力。
Dominik Notz, Felix Becker, Thomas Kühbeck, Daniel Watzenig
Abstract: Collecting realistic driving trajectories is crucial for training machine learning models that imitate human driving behavior. Most of today's autonomous driving datasets contain only a few trajectories per location and are recorded with test vehicles that are cautiously driven by trained drivers. In particular in interactive scenarios such as highway merges, the test driver's behavior significantly influences other vehicles. This influence prevents recording the whole traffic space of human driving behavior. In this work, we present a novel methodology to extract trajectories of traffic objects using infrastructure sensors. Infrastructure sensors allow us to record a lot of data for one location and take the test drivers out of the loop. We develop both a hardware setup consisting of a camera and a traffic surveillance radar and a trajectory extraction algorithm. Our vision pipeline accurately detects objects, fuses camera and radar detections and tracks them over time. We improve a state-of-the-art object tracker by combining the tracking in image coordinates with a Kalman filter in road coordinates. We show that our sensor fusion approach successfully combines the advantages of camera and radar detections and outperforms either single sensor. Finally, we also evaluate the accuracy of our trajectory extraction pipeline. For that, we equip our test vehicle with a differential GPS sensor and use it to collect ground truth trajectories. With this data we compute the measurement errors. While we use the mean error to de-bias the trajectories, the error standard deviation is in the magnitude of the ground truth data inaccuracy. Hence, the extracted trajectories are not only naturalistic but also highly accurate and prove the potential of using infrastructure sensors to extract real-world trajectories.
摘要:收集真实的驾驶轨迹是模仿人类的驾驶行为训练机器学习模型的关键。今天的大多数自主驾驶数据集只包含每个地点几个轨迹和记录与被谨慎地由经过培训的司机驾驶的测试车。尤其是在互动的场景如高速公路合并,测试驱动程序的行为显著影响其他车辆。这种影响,以防止录制的人驾驶行为的整个交通空间。在这项工作中,我们提出了一种新的方法来使用的基础设施传感器交通中的物体提取轨迹。基础设施传感器使我们能够记录大量数据的一个位置,并采取测试驱动程序跳出循环。我们同时开发硬件设置,包括摄像头和交通监视雷达和轨迹提取算法。我们的愿景管道准确地检测随着时间的推移对象,保险丝摄像头和雷达检测和跟踪他们。我们通过在图像坐标的跟踪与道路坐标卡尔曼滤波器结合提高一个国家的最先进的物体跟踪。我们证明了我们的传感器融合的方法成功地结合了摄像头和雷达检测的优点,优于任何单一传感器。最后,我们还评估我们的轨迹提取管道的准确度。为此,我们装备我们的测试车带有差分GPS传感器,并用它来收集地面实况轨迹。有了这些数据,我们计算的测量误差。虽然我们使用的平均误差反偏置轨迹,误差标准差是在地面实况数据不准确的大小。因此,所提取的轨迹不仅自然,而且非常准确,证明使用的基础设施传感器来提取真实世界的轨迹的潜力。
48. Quantification of Tomographic Patterns associated with COVID-19 from Chest CT [PDF] 返回目录
Shikha Chaganti, Abishek Balachandran, Guillaume Chabin, Stuart Cohen, Thomas Flohr, Bogdan Georgescu, Philippe Grenier, Sasa Grbic, Siqi Liu, François Mellot, Nicolas Murray, Savvas Nicolaou, William Parker, Thomas Re, Pina Sanelli, Alexander W. Sauter, Zhoubing Xu, Youngjin Yoo, Valentin Ziebandt, Dorin Comaniciu
Abstract: Purpose: To present a method that automatically detects and quantifies abnormal tomographic patterns commonly present in COVID-19, namely Ground Glass Opacities (GGO) and consolidations. Given that high opacity abnormalities (i.e., consolidations) were shown to correlate with severe disease, the paper introduces two combined severity measures (Percentage of Opacity, Percentage of High Opacity) and (Lung Severity Score, Lung High Opacity Score). They quantify the extent of overall COVID-19 abnormalities and the presence of high opacity abnormalities, global and lobe-wise, respectively, being computed based on 3D segmentations of lesions, lungs, and lobes. Materials and Methods: The proposed method takes as input a non-contrasted Chest CT and segments the lesions, lungs, and lobes in 3D. It outputs two combined measures of the severity of lung/lobe involvement, quantifying both the extent of COVID-19 abnormalities and presence of high opacities, based on deep learning and deep reinforcement learning. The first measure (POO, POHO) is global, while the second (LSS, LHOS) is lobe-wise. Evaluation is reported on CTs of 100 subjects (50 COVID-19 confirmed and 50 controls) from institutions from Canada, Europe and US. Ground truth is established by manual annotations of lesions, lungs, and lobes. Results: Pearson Correlation Coefficient between method prediction and ground truth is 0.97 (POO), 0.98 (POHO), 0.96 (LSS), 0.96 (LHOS). Automated processing time to compute the severity scores is 10 seconds/case vs 30 mins needed for manual annotations. Conclusion: A new method identifies regions of abnormalities seen in COVID-19 non-contrasted Chest CT and computes (POO, POHO) and (LSS, LHOS) severity scores.
摘要:目的:为了提供一个方法,该方法自动地检测并量化异常断层图案通常存在于COVID-19,即磨玻璃影(GGO)和合并。鉴于高不透明度的异常(即,合并)中示出关联与严重的疾病,介绍了2项组合的严重性措施(不透明度的百分比,不透明度高的百分比)和(肺严重程度评分,肺高遮盖力得分)。它们量化整体COVID-19的异常的程度和高不透明度的异常,存在全球和叶的角度来看,分别基于病变,肺部和裂片的3D分割被计算。材料和方法:该方法作为输入在三维非对比胸部CT和段病变,肺部和裂片。它输出肺/叶受累的严重程度两者合用的措施,量化COVID-19异常两者的程度和高混浊的存在,基于深度学习和深强化学习。第一项措施(POO,POHO)是全球性的,而第二个(LSS,LHOS)是叶,明智的。评估报告对100名受试者(50 COVID-19确认,50个控件)来自加拿大,欧洲和美国的机构的电流互感器。基础事实通过病变,肺和叶的手工注释建立。结果:方法预测和地面真值之间Pearson相关系数为0.97(POO),0.98(POHO),0.96(LSS),0.96(LHOS)。自动化处理的时间来计算严重程度评分为10秒/箱VS需要手动注释30分钟。结论:在COVID-19看到异常的新方法识别区域的非对比胸部CT和计算(POO,POHO)和(LSS,LHOS)严重程度评分。
Shikha Chaganti, Abishek Balachandran, Guillaume Chabin, Stuart Cohen, Thomas Flohr, Bogdan Georgescu, Philippe Grenier, Sasa Grbic, Siqi Liu, François Mellot, Nicolas Murray, Savvas Nicolaou, William Parker, Thomas Re, Pina Sanelli, Alexander W. Sauter, Zhoubing Xu, Youngjin Yoo, Valentin Ziebandt, Dorin Comaniciu
Abstract: Purpose: To present a method that automatically detects and quantifies abnormal tomographic patterns commonly present in COVID-19, namely Ground Glass Opacities (GGO) and consolidations. Given that high opacity abnormalities (i.e., consolidations) were shown to correlate with severe disease, the paper introduces two combined severity measures (Percentage of Opacity, Percentage of High Opacity) and (Lung Severity Score, Lung High Opacity Score). They quantify the extent of overall COVID-19 abnormalities and the presence of high opacity abnormalities, global and lobe-wise, respectively, being computed based on 3D segmentations of lesions, lungs, and lobes. Materials and Methods: The proposed method takes as input a non-contrasted Chest CT and segments the lesions, lungs, and lobes in 3D. It outputs two combined measures of the severity of lung/lobe involvement, quantifying both the extent of COVID-19 abnormalities and presence of high opacities, based on deep learning and deep reinforcement learning. The first measure (POO, POHO) is global, while the second (LSS, LHOS) is lobe-wise. Evaluation is reported on CTs of 100 subjects (50 COVID-19 confirmed and 50 controls) from institutions from Canada, Europe and US. Ground truth is established by manual annotations of lesions, lungs, and lobes. Results: Pearson Correlation Coefficient between method prediction and ground truth is 0.97 (POO), 0.98 (POHO), 0.96 (LSS), 0.96 (LHOS). Automated processing time to compute the severity scores is 10 seconds/case vs 30 mins needed for manual annotations. Conclusion: A new method identifies regions of abnormalities seen in COVID-19 non-contrasted Chest CT and computes (POO, POHO) and (LSS, LHOS) severity scores.
摘要:目的:为了提供一个方法,该方法自动地检测并量化异常断层图案通常存在于COVID-19,即磨玻璃影(GGO)和合并。鉴于高不透明度的异常(即,合并)中示出关联与严重的疾病,介绍了2项组合的严重性措施(不透明度的百分比,不透明度高的百分比)和(肺严重程度评分,肺高遮盖力得分)。它们量化整体COVID-19的异常的程度和高不透明度的异常,存在全球和叶的角度来看,分别基于病变,肺部和裂片的3D分割被计算。材料和方法:该方法作为输入在三维非对比胸部CT和段病变,肺部和裂片。它输出肺/叶受累的严重程度两者合用的措施,量化COVID-19异常两者的程度和高混浊的存在,基于深度学习和深强化学习。第一项措施(POO,POHO)是全球性的,而第二个(LSS,LHOS)是叶,明智的。评估报告对100名受试者(50 COVID-19确认,50个控件)来自加拿大,欧洲和美国的机构的电流互感器。基础事实通过病变,肺和叶的手工注释建立。结果:方法预测和地面真值之间Pearson相关系数为0.97(POO),0.98(POHO),0.96(LSS),0.96(LHOS)。自动化处理的时间来计算严重程度评分为10秒/箱VS需要手动注释30分钟。结论:在COVID-19看到异常的新方法识别区域的非对比胸部CT和计算(POO,POHO)和(LSS,LHOS)严重程度评分。
49. Introducing Anisotropic Minkowski Functionals and Quantitative Anisotropy Measures for Local Structure Analysis in Biomedical Imaging [PDF] 返回目录
Axel Wismueller, Titas De, Eva Lochmueller, Felix Eckstein, Mahesh B. Nagarajan
Abstract: The ability of Minkowski Functionals to characterize local structure in different biological tissue types has been demonstrated in a variety of medical image processing tasks. We introduce anisotropic Minkowski Functionals (AMFs) as a novel variant that captures the inherent anisotropy of the underlying gray-level structures. To quantify the anisotropy characterized by our approach, we further introduce a method to compute a quantitative measure motivated by a technique utilized in MR diffusion tensor imaging, namely fractional anisotropy. We showcase the applicability of our method in the research context of characterizing the local structure properties of trabecular bone micro-architecture in the proximal femur as visualized on multi-detector CT. To this end, AMFs were computed locally for each pixel of ROIs extracted from the head, neck and trochanter regions. Fractional anisotropy was then used to quantify the local anisotropy of the trabecular structures found in these ROIs and to compare its distribution in different anatomical regions. Our results suggest a significantly greater concentration of anisotropic trabecular structures in the head and neck regions when compared to the trochanter region (p < 10-4). We also evaluated the ability of such AMFs to predict bone strength in the femoral head of proximal femur specimens obtained from 50 donors. Our results suggest that such AMFs, when used in conjunction with multi-regression models, can outperform more conventional features such as BMD in predicting failure load. We conclude that such anisotropic Minkowski Functionals can capture valuable information regarding directional attributes of local structure, which may be useful in a wide scope of biomedical imaging applications.
摘要:度规函数的在不同的生物组织类型来表征局部结构的能力已经在各种医用图像处理任务被证明。我们引入各向异性度规函数(AMFS)作为一种新的变体,其捕获底层灰度级结构的固有各向异性。为了量化特征在于我们的方法的各向异性,我们进一步介绍计算由在MR扩散张量成像,即分数各向异性使用的技术的动机的定量测量的方法。我们展示了该方法的适用性上多排螺旋CT作为可视化股骨近端表征骨小梁的微体系结构的局部结构性质的研究背景。为此,分别AMFS从头部,颈部和转子区域中提取ROI的每个像素本地计算。然后分数各向异性被用来量化小梁结构的局部各向异性在这些发现的ROI,并比较其在不同的解剖区域的分布。相比于转子区域(p <10-4)当我们的结果表明在头部和颈部区域各向异性小梁结构的显著更大浓度。我们也评估这种amfs的预测骨强度在近端的股骨头股骨标本50个供体的能力。我们的研究结果表明,这种amfs,在具有多回归模型结合使用时,可以在预测失效负荷优于更常规的功能,如bmd。我们的结论是这种各向异性度规函数可以捕获关于局部结构的定向属性,其可以是在生物医学成像应用范围广泛有用的有价值的信息。< font> 10-4)当我们的结果表明在头部和颈部区域各向异性小梁结构的显著更大浓度。我们也评估这种amfs的预测骨强度在近端的股骨头股骨标本50个供体的能力。我们的研究结果表明,这种amfs,在具有多回归模型结合使用时,可以在预测失效负荷优于更常规的功能,如bmd。我们的结论是这种各向异性度规函数可以捕获关于局部结构的定向属性,其可以是在生物医学成像应用范围广泛有用的有价值的信息。<>
Axel Wismueller, Titas De, Eva Lochmueller, Felix Eckstein, Mahesh B. Nagarajan
Abstract: The ability of Minkowski Functionals to characterize local structure in different biological tissue types has been demonstrated in a variety of medical image processing tasks. We introduce anisotropic Minkowski Functionals (AMFs) as a novel variant that captures the inherent anisotropy of the underlying gray-level structures. To quantify the anisotropy characterized by our approach, we further introduce a method to compute a quantitative measure motivated by a technique utilized in MR diffusion tensor imaging, namely fractional anisotropy. We showcase the applicability of our method in the research context of characterizing the local structure properties of trabecular bone micro-architecture in the proximal femur as visualized on multi-detector CT. To this end, AMFs were computed locally for each pixel of ROIs extracted from the head, neck and trochanter regions. Fractional anisotropy was then used to quantify the local anisotropy of the trabecular structures found in these ROIs and to compare its distribution in different anatomical regions. Our results suggest a significantly greater concentration of anisotropic trabecular structures in the head and neck regions when compared to the trochanter region (p < 10-4). We also evaluated the ability of such AMFs to predict bone strength in the femoral head of proximal femur specimens obtained from 50 donors. Our results suggest that such AMFs, when used in conjunction with multi-regression models, can outperform more conventional features such as BMD in predicting failure load. We conclude that such anisotropic Minkowski Functionals can capture valuable information regarding directional attributes of local structure, which may be useful in a wide scope of biomedical imaging applications.
摘要:度规函数的在不同的生物组织类型来表征局部结构的能力已经在各种医用图像处理任务被证明。我们引入各向异性度规函数(AMFS)作为一种新的变体,其捕获底层灰度级结构的固有各向异性。为了量化特征在于我们的方法的各向异性,我们进一步介绍计算由在MR扩散张量成像,即分数各向异性使用的技术的动机的定量测量的方法。我们展示了该方法的适用性上多排螺旋CT作为可视化股骨近端表征骨小梁的微体系结构的局部结构性质的研究背景。为此,分别AMFS从头部,颈部和转子区域中提取ROI的每个像素本地计算。然后分数各向异性被用来量化小梁结构的局部各向异性在这些发现的ROI,并比较其在不同的解剖区域的分布。相比于转子区域(p <10-4)当我们的结果表明在头部和颈部区域各向异性小梁结构的显著更大浓度。我们也评估这种amfs的预测骨强度在近端的股骨头股骨标本50个供体的能力。我们的研究结果表明,这种amfs,在具有多回归模型结合使用时,可以在预测失效负荷优于更常规的功能,如bmd。我们的结论是这种各向异性度规函数可以捕获关于局部结构的定向属性,其可以是在生物医学成像应用范围广泛有用的有价值的信息。< font> 10-4)当我们的结果表明在头部和颈部区域各向异性小梁结构的显著更大浓度。我们也评估这种amfs的预测骨强度在近端的股骨头股骨标本50个供体的能力。我们的研究结果表明,这种amfs,在具有多回归模型结合使用时,可以在预测失效负荷优于更常规的功能,如bmd。我们的结论是这种各向异性度规函数可以捕获关于局部结构的定向属性,其可以是在生物医学成像应用范围广泛有用的有价值的信息。<>
50. Detection of Coronavirus (COVID-19) Associated Pneumonia based on Generative Adversarial Networks and a Fine-Tuned Deep Transfer Learning Model using Chest X-ray Dataset [PDF] 返回目录
Nour Eldeen M. Khalifa, Mohamed Hamed N. Taha, Aboul Ella Hassanien, Sally Elghamrawy
Abstract: The COVID-19 coronavirus is one of the devastating viruses according to the world health organization. This novel virus leads to pneumonia, which is an infection that inflames the lungs' air sacs of a human. One of the methods to detect those inflames is by using x-rays for the chest. In this paper, a pneumonia chest x-ray detection based on generative adversarial networks (GAN) with a fine-tuned deep transfer learning for a limited dataset will be presented. The use of GAN positively affects the proposed model robustness and made it immune to the overfitting problem and helps in generating more images from the dataset. The dataset used in this research consists of 5863 X-ray images with two categories: Normal and Pneumonia. This research uses only 10% of the dataset for training data and generates 90% of images using GAN to prove the efficiency of the proposed model. Through the paper, AlexNet, GoogLeNet, Squeeznet, and Resnet18 are selected as deep transfer learning models to detect the pneumonia from chest x-rays. Those models are selected based on their small number of layers on their architectures, which will reflect in reducing the complexity of the models and the consumed memory and time. Using a combination of GAN and deep transfer models proved it is efficiency according to testing accuracy measurement. The research concludes that the Resnet18 is the most appropriate deep transfer model according to testing accuracy measurement and achieved 99% with the other performance metrics such as precision, recall, and F1 score while using GAN as an image augmenter. Finally, a comparison result was carried out at the end of the research with related work which used the same dataset except that this research used only 10% of original dataset. The presented work achieved a superior result than the related work in terms of testing accuracy.
摘要:COVID-19冠状病毒是根据世界卫生组织的破坏性病毒之一。这种新型病毒导致的肺炎,这是发炎肺的人员的气囊的感染。一的,以检测这些发炎的方法是通过使用X射线的胸部。在本文中,基于生成的网络对抗性(GAN)与一个有限的数据集微调深转印学习肺炎胸部X射线检测将被呈现。使用GAN的积极影响该模型的鲁棒性,并使其免疫的过度拟合问题,并从数据集生成更多的图像帮助。在这项研究中所使用的数据集由5863 X射线图像有两类:普通和肺炎。本研究采用只有10%的数据集训练数据,并产生90%的使用GAN证明了模型的效率图像。通过纸,AlexNet,GoogLeNet,Squeeznet,和Resnet18被选择为深转印学习模型来检测从胸部X光的肺炎。这些模型都是基于自己的小号码在他们的架构层,这将降低模型的复杂性和消耗的内存和时间反映的选择。使用GAN和深转移模型的组合证明了这一点,根据测试的精度测量效率。该研究的结论是,Resnet18是最合适的深传输模式根据检测精度测量,并与其它的性能度量,如精度,召回达到99%,同时使用GAN作为图像增强因子F1得分。最后,比较结果在与相关的工作当中使用的同一数据集,除了这项研究使用的原始数据集的只有10%的研究进行到底。该论文在测试精度方面取得了卓越的成绩比相关工作。
Nour Eldeen M. Khalifa, Mohamed Hamed N. Taha, Aboul Ella Hassanien, Sally Elghamrawy
Abstract: The COVID-19 coronavirus is one of the devastating viruses according to the world health organization. This novel virus leads to pneumonia, which is an infection that inflames the lungs' air sacs of a human. One of the methods to detect those inflames is by using x-rays for the chest. In this paper, a pneumonia chest x-ray detection based on generative adversarial networks (GAN) with a fine-tuned deep transfer learning for a limited dataset will be presented. The use of GAN positively affects the proposed model robustness and made it immune to the overfitting problem and helps in generating more images from the dataset. The dataset used in this research consists of 5863 X-ray images with two categories: Normal and Pneumonia. This research uses only 10% of the dataset for training data and generates 90% of images using GAN to prove the efficiency of the proposed model. Through the paper, AlexNet, GoogLeNet, Squeeznet, and Resnet18 are selected as deep transfer learning models to detect the pneumonia from chest x-rays. Those models are selected based on their small number of layers on their architectures, which will reflect in reducing the complexity of the models and the consumed memory and time. Using a combination of GAN and deep transfer models proved it is efficiency according to testing accuracy measurement. The research concludes that the Resnet18 is the most appropriate deep transfer model according to testing accuracy measurement and achieved 99% with the other performance metrics such as precision, recall, and F1 score while using GAN as an image augmenter. Finally, a comparison result was carried out at the end of the research with related work which used the same dataset except that this research used only 10% of original dataset. The presented work achieved a superior result than the related work in terms of testing accuracy.
摘要:COVID-19冠状病毒是根据世界卫生组织的破坏性病毒之一。这种新型病毒导致的肺炎,这是发炎肺的人员的气囊的感染。一的,以检测这些发炎的方法是通过使用X射线的胸部。在本文中,基于生成的网络对抗性(GAN)与一个有限的数据集微调深转印学习肺炎胸部X射线检测将被呈现。使用GAN的积极影响该模型的鲁棒性,并使其免疫的过度拟合问题,并从数据集生成更多的图像帮助。在这项研究中所使用的数据集由5863 X射线图像有两类:普通和肺炎。本研究采用只有10%的数据集训练数据,并产生90%的使用GAN证明了模型的效率图像。通过纸,AlexNet,GoogLeNet,Squeeznet,和Resnet18被选择为深转印学习模型来检测从胸部X光的肺炎。这些模型都是基于自己的小号码在他们的架构层,这将降低模型的复杂性和消耗的内存和时间反映的选择。使用GAN和深转移模型的组合证明了这一点,根据测试的精度测量效率。该研究的结论是,Resnet18是最合适的深传输模式根据检测精度测量,并与其它的性能度量,如精度,召回达到99%,同时使用GAN作为图像增强因子F1得分。最后,比较结果在与相关的工作当中使用的同一数据集,除了这项研究使用的原始数据集的只有10%的研究进行到底。该论文在测试精度方面取得了卓越的成绩比相关工作。
注:中文为机器翻译结果!