目录
2. Physically-Constrained Transfer Learning through Shared Abundance Space for Hyperspectral Image Classification [PDF] 摘要
5. Black Re-ID: A Head-shoulder Descriptor for the Challenging Problem of Person Re-Identification [PDF] 摘要
11. MineNav: An Expandable Synthetic Dataset Based on Minecraft for Aircraft Visual Navigation [PDF] 摘要
12. SegCodeNet: Color-Coded Segmentation Masks for Activity Detection from Wearable Cameras [PDF] 摘要
13. Deep Neural Networks for automatic extraction of features in time series satellite images [PDF] 摘要
16. Graudally Applying Weakly Supervised and Active Learning for Mass Detection in Breast Ultrasound Images [PDF] 摘要
19. Virtual Adversarial Training in Feature Space to Improve Unsupervised Video Domain Adaptation [PDF] 摘要
24. FrankMocap: Fast Monocular 3D Hand and Body Motion Capture by Regression and Integration [PDF] 摘要
32. Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment Retrieval in Videos [PDF] 摘要
35. Weakly Supervised Learning with Region and Box-level Annotations for Salient Instance Segmentation [PDF] 摘要
38. DeepHandMesh: A Weakly-supervised Deep Encoder-Decoder Framework for High-fidelity Hand Mesh Modeling [PDF] 摘要
39. PC-U Net: Learning to Jointly Reconstruct and Segment the Cardiac Walls in 3D from CT Data [PDF] 摘要
47. DeepLiDARFlow: A Deep Learning Architecture For Scene Flow Estimation Using Monocular Camera and Sparse LiDAR [PDF] 摘要
49. Slide-free MUSE Microscopy to H&E Histology Modality Conversion via Unpaired Image-to-Image Translation GAN Models [PDF] 摘要
51. "Name that manufacturer". Relating image acquisition bias with task complexity when training deep learning models: experiments on head CT [PDF] 摘要
52. Correcting Data Imbalance for Semi-Supervised Covid-19 Detection Using X-ray Chest Images [PDF] 摘要
53. Unsupervised Cross-domain Image Classification by Distance Metric Guided Feature Alignment [PDF] 摘要
55. Addressing Neural Network Robustness with Mixup and Targeted Labeling Adversarial Training [PDF] 摘要
57. Spatio-temporal relationships between rainfall and convective clouds during Indian Monsoon through a discrete lens [PDF] 摘要
摘要
1. Every Pixel Matters: Center-aware Feature Alignment for Domain Adaptive Object Detector [PDF] 返回目录
Cheng-Chun Hsu, Yi-Hsuan Tsai, Yen-Yu Lin, Ming-Hsuan Yang
Abstract: A domain adaptive object detector aims to adapt itself to unseen domains that may contain variations of object appearance, viewpoints or backgrounds. Most existing methods adopt feature alignment either on the image level or instance level. However, image-level alignment on global features may tangle foreground/background pixels at the same time, while instance-level alignment using proposals may suffer from the background noise. Different from existing solutions, we propose a domain adaptation framework that accounts for each pixel via predicting pixel-wise objectness and centerness. Specifically, the proposed method carries out center-aware alignment by paying more attention to foreground pixels, hence achieving better adaptation across domains. We demonstrate our method on numerous adaptation settings with extensive experimental results and show favorable performance against existing state-of-the-art algorithms.
摘要:域自适应对象检测器的目的使自己适应于可能包含对象的外观,视点或背景的变化看不见域。大多数现有的方法采用功能定位无论是在图像的水平或实例级别。然而,全局特征图像级对准可以同时纠结前景/背景像素,而使用的建议实例级对准可以从背景噪声困扰。从现有的解决方案不同的是,我们提出了一个域的适应框架,占经预测像素方面的对象性和centerness每个像素。具体而言,所提出的方法进行,更多关注到的前景像素,因此在域达到更好的适应中心感知对齐。我们证明与大量实验结果许多适应设置我们的方法,并显示出对现有的国家的最先进的算法良好的性能。
Cheng-Chun Hsu, Yi-Hsuan Tsai, Yen-Yu Lin, Ming-Hsuan Yang
Abstract: A domain adaptive object detector aims to adapt itself to unseen domains that may contain variations of object appearance, viewpoints or backgrounds. Most existing methods adopt feature alignment either on the image level or instance level. However, image-level alignment on global features may tangle foreground/background pixels at the same time, while instance-level alignment using proposals may suffer from the background noise. Different from existing solutions, we propose a domain adaptation framework that accounts for each pixel via predicting pixel-wise objectness and centerness. Specifically, the proposed method carries out center-aware alignment by paying more attention to foreground pixels, hence achieving better adaptation across domains. We demonstrate our method on numerous adaptation settings with extensive experimental results and show favorable performance against existing state-of-the-art algorithms.
摘要:域自适应对象检测器的目的使自己适应于可能包含对象的外观,视点或背景的变化看不见域。大多数现有的方法采用功能定位无论是在图像的水平或实例级别。然而,全局特征图像级对准可以同时纠结前景/背景像素,而使用的建议实例级对准可以从背景噪声困扰。从现有的解决方案不同的是,我们提出了一个域的适应框架,占经预测像素方面的对象性和centerness每个像素。具体而言,所提出的方法进行,更多关注到的前景像素,因此在域达到更好的适应中心感知对齐。我们证明与大量实验结果许多适应设置我们的方法,并显示出对现有的国家的最先进的算法良好的性能。
2. Physically-Constrained Transfer Learning through Shared Abundance Space for Hyperspectral Image Classification [PDF] 返回目录
Ying Qu, Razieh Kaviani Baghbaderani, Wei Li, Lianru Gao, Hairong Qi
Abstract: Hyperspectral image (HSI) classification is one of the most active research topics and has achieved promising results boosted by the recent development of deep learning. However, most state-of-the-art approaches tend to perform poorly when the training and testing images are on different domains, e.g., source domain and target domain, respectively, due to the spectral variability caused by different acquisition conditions. Transfer learning-based methods address this problem by pre-training in the source domain and fine-tuning on the target domain. Nonetheless, a considerable amount of data on the target domain has to be labeled and non-negligible computational resources are required to retrain the whole network. In this paper, we propose a new transfer learning scheme to bridge the gap between the source and target domains by projecting the HSI data from the source and target domains into a shared abundance space based on their own physical characteristics. In this way, the domain discrepancy would be largely reduced such that the model trained on the source domain could be applied on the target domain without extra efforts for data labeling or network retraining. The proposed method is referred to as physically-constrained transfer learning through shared abundance space (PCTL-SAS). Extensive experimental results demonstrate the superiority of the proposed method as compared to the state-of-the-art. The success of this endeavor would largely facilitate the deployment of HSI classification for real-world sensing scenarios.
摘要:高光谱图像(HSI)的分类是最活跃的研究课题之一,并已取得看好受近期深度学习的发展推动的结果。然而,大多数国家的最先进的方法往往表现不佳时,训练和测试图像分别在不同的结构域,例如,源域和目标域,由于由不同的采集条件的光谱变化。基于转移学习方法解决了在源域和微调在目标域前培训这个问题。然而,在目标域相当量的数据具有待标记和不可忽略的计算资源需要重新训练整个网络。在本文中,我们提出了一个新的迁移学习方案,通过从源和目标域恒指数据伸入根据自身的物理特性丰富的共享空间,以弥补源和目标域之间的差距。通过这种方式,域差异将大大降低,从而接受了有关源域的模型可以在目标域,而无需进行数据标签或改装网络额外的努力来应用。所提出的方法被称为通过共享丰度空间(PCTL-SAS)物理约束的迁移学习。相比于国家的最先进的广泛的实验结果证明了所提出的方法的优越性。这种努力的成功将在很大程度上推动恒指分类的现实世界的感知场景的部署。
Ying Qu, Razieh Kaviani Baghbaderani, Wei Li, Lianru Gao, Hairong Qi
Abstract: Hyperspectral image (HSI) classification is one of the most active research topics and has achieved promising results boosted by the recent development of deep learning. However, most state-of-the-art approaches tend to perform poorly when the training and testing images are on different domains, e.g., source domain and target domain, respectively, due to the spectral variability caused by different acquisition conditions. Transfer learning-based methods address this problem by pre-training in the source domain and fine-tuning on the target domain. Nonetheless, a considerable amount of data on the target domain has to be labeled and non-negligible computational resources are required to retrain the whole network. In this paper, we propose a new transfer learning scheme to bridge the gap between the source and target domains by projecting the HSI data from the source and target domains into a shared abundance space based on their own physical characteristics. In this way, the domain discrepancy would be largely reduced such that the model trained on the source domain could be applied on the target domain without extra efforts for data labeling or network retraining. The proposed method is referred to as physically-constrained transfer learning through shared abundance space (PCTL-SAS). Extensive experimental results demonstrate the superiority of the proposed method as compared to the state-of-the-art. The success of this endeavor would largely facilitate the deployment of HSI classification for real-world sensing scenarios.
摘要:高光谱图像(HSI)的分类是最活跃的研究课题之一,并已取得看好受近期深度学习的发展推动的结果。然而,大多数国家的最先进的方法往往表现不佳时,训练和测试图像分别在不同的结构域,例如,源域和目标域,由于由不同的采集条件的光谱变化。基于转移学习方法解决了在源域和微调在目标域前培训这个问题。然而,在目标域相当量的数据具有待标记和不可忽略的计算资源需要重新训练整个网络。在本文中,我们提出了一个新的迁移学习方案,通过从源和目标域恒指数据伸入根据自身的物理特性丰富的共享空间,以弥补源和目标域之间的差距。通过这种方式,域差异将大大降低,从而接受了有关源域的模型可以在目标域,而无需进行数据标签或改装网络额外的努力来应用。所提出的方法被称为通过共享丰度空间(PCTL-SAS)物理约束的迁移学习。相比于国家的最先进的广泛的实验结果证明了所提出的方法的优越性。这种努力的成功将在很大程度上推动恒指分类的现实世界的感知场景的部署。
3. No-reference Quality Assessment with Unsupervised Domain Adaptation [PDF] 返回目录
Baoliang Chen, Haoliang Li, Hongfei Fan, Shiqi Wang
Abstract: Quality assessment driven by machine learning generally relies on the strong assumption that the training and testing data share very close scene statistics, lacking the adaptation capacity to the content in other domains. In this paper, we quest the capability of transferring the quality of natural scene to the images that are not acquired by optical cameras (e.g., screen content images, SCIs), rooted in the widely accepted view that the human visual system has adapted and evolved through the perception of natural environment. Here we develop the first unsupervised domain adaptation based no reference quality assessment method, under the reasonable assumption that there are abundant subjective ratings of natural images (NIs). In general, it is a non-trivial task to directly transfer the quality prediction model from NIs to a new type of content (i.e., SCIs) that holds dramatically different statistical characteristics. Inspired by the transferability of pair-wise relationship, the proposed quality measure operates based on the philosophy of learning to rank. To reduce the domain gap, we introduce two complementary losses which explicitly regularize the feature space of ranking in a progressive manner. For feature discrepancy minimization, the maximum mean discrepancy (MMD) is imposed on the extracted ranking features of NIs and SCIs. For feature discriminatory capability enhancement, we propose a center based loss to rectify the classifier and improve its prediction capability not only for source domain (NI) but also the target domain (SCI). Experiments show that our method can achieve higher performance on different source-target settings based on a light-weight convolution neural network. The proposed method also sheds light on learning quality assessment measures for unseen application-specific content without the cumbersome and costing subjective evaluations.
摘要:机一般学习驱动的质量评估依赖于强大的假设训练和测试数据共享非常接近现场统计,缺乏适应能力,以在其他领域的内容。在本文中,我们追求自然场景的质量转移到未由光照相机(例如,屏幕内容图像,的SCI),扎根于广泛接受的观点,即人类视觉系统已经适应和演进获取的图像的能力通过自然环境的感知。在这里,我们开发了基于无参考质量评价方法的第一监督的领域适应性,合理的假设下,有自然图像(NIS)的丰富的主观评级。通常,它是一个非平凡的任务到直接在质量预测模型从NIS转移到一个新的内容类型(即,的SCI),其保持显着不同的统计特性。通过成对关系的可转让性的启发,提出的质量指标进行操作基于学习等级的理念。为了降低间隙域,我们引入两个互补损耗,这明确地正规化以渐进的方式排序的特征空间。为特征的差异最小化,所述最大平均差异(MMD)是施加在NIS和的SCI的所提取的分级特征。对于有歧视性的能力增强,我们提出了一个基于中心损失整顿分类并提高其预测能力不仅为源域(NI),而且目标域(SCI)。实验表明,我们的方法可以实现基于一个轻量级的卷积神经网络的不同源 - 目标设置更高的性能。该方法还揭示了学习质量评估措施看不见的专用内容无需繁琐和成本主观评价光。
Baoliang Chen, Haoliang Li, Hongfei Fan, Shiqi Wang
Abstract: Quality assessment driven by machine learning generally relies on the strong assumption that the training and testing data share very close scene statistics, lacking the adaptation capacity to the content in other domains. In this paper, we quest the capability of transferring the quality of natural scene to the images that are not acquired by optical cameras (e.g., screen content images, SCIs), rooted in the widely accepted view that the human visual system has adapted and evolved through the perception of natural environment. Here we develop the first unsupervised domain adaptation based no reference quality assessment method, under the reasonable assumption that there are abundant subjective ratings of natural images (NIs). In general, it is a non-trivial task to directly transfer the quality prediction model from NIs to a new type of content (i.e., SCIs) that holds dramatically different statistical characteristics. Inspired by the transferability of pair-wise relationship, the proposed quality measure operates based on the philosophy of learning to rank. To reduce the domain gap, we introduce two complementary losses which explicitly regularize the feature space of ranking in a progressive manner. For feature discrepancy minimization, the maximum mean discrepancy (MMD) is imposed on the extracted ranking features of NIs and SCIs. For feature discriminatory capability enhancement, we propose a center based loss to rectify the classifier and improve its prediction capability not only for source domain (NI) but also the target domain (SCI). Experiments show that our method can achieve higher performance on different source-target settings based on a light-weight convolution neural network. The proposed method also sheds light on learning quality assessment measures for unseen application-specific content without the cumbersome and costing subjective evaluations.
摘要:机一般学习驱动的质量评估依赖于强大的假设训练和测试数据共享非常接近现场统计,缺乏适应能力,以在其他领域的内容。在本文中,我们追求自然场景的质量转移到未由光照相机(例如,屏幕内容图像,的SCI),扎根于广泛接受的观点,即人类视觉系统已经适应和演进获取的图像的能力通过自然环境的感知。在这里,我们开发了基于无参考质量评价方法的第一监督的领域适应性,合理的假设下,有自然图像(NIS)的丰富的主观评级。通常,它是一个非平凡的任务到直接在质量预测模型从NIS转移到一个新的内容类型(即,的SCI),其保持显着不同的统计特性。通过成对关系的可转让性的启发,提出的质量指标进行操作基于学习等级的理念。为了降低间隙域,我们引入两个互补损耗,这明确地正规化以渐进的方式排序的特征空间。为特征的差异最小化,所述最大平均差异(MMD)是施加在NIS和的SCI的所提取的分级特征。对于有歧视性的能力增强,我们提出了一个基于中心损失整顿分类并提高其预测能力不仅为源域(NI),而且目标域(SCI)。实验表明,我们的方法可以实现基于一个轻量级的卷积神经网络的不同源 - 目标设置更高的性能。该方法还揭示了学习质量评估措施看不见的专用内容无需繁琐和成本主观评价光。
4. STAR: Sparse Trained Articulated Human Body Regressor [PDF] 返回目录
Ahmed A. A. Osman, Timo Bolkart, Michael J. Black
Abstract: The SMPL body model is widely used for the estimation, synthesis, and analysis of 3D human pose and shape. While popular, we show that SMPL has several limitations and introduce STAR, which is quantitatively and qualitatively superior to SMPL. First, SMPL has a huge number of parameters resulting from its use of global blend shapes. These dense pose-corrective offsets relate every vertex on the mesh to all the joints in the kinematic tree, capturing spurious long-range correlations. To address this, we define per-joint pose correctives and learn the subset of mesh vertices that are influenced by each joint movement. This sparse formulation results in more realistic deformations and significantly reduces the number of model parameters to 20% of SMPL. When trained on the same data as SMPL, STAR generalizes better despite having many fewer parameters. Second, SMPL factors pose-dependent deformations from body shape while, in reality, people with different shapes deform differently. Consequently, we learn shape-dependent pose-corrective blend shapes that depend on both body pose and BMI. Third, we show that the shape space of SMPL is not rich enough to capture the variation in the human population. We address this by training STAR with an additional 10,000 scans of male and female subjects, and show that this results in better model generalization. STAR is compact, generalizes better to new bodies and is a drop-in replacement for SMPL. STAR is publicly available for research purposes at this http URL.
摘要:SMPL体模型被广泛用于估计,合成和三维人体姿势和形状的分析。虽然流行,我们表明,SMPL有一些局限性,并介绍STAR,这是数量和质量优于SMPL。首先,SMPL有其使用的全球混合变形而导致的参数数量巨大。这些密集的姿势矫正偏移涉及在网格上的所有关节的运动树的每个顶点,捕捉杂散长程关联。为了解决这个问题,我们定义每个关节的姿势矫正,学习由各个关节运动的影响,网格顶点的子集。在更真实的变形此稀疏制剂结果和显著减少的模型参数来SMPL的20%的数量。当相同的数据SMPL训练有素,STAR尽管有许多参数较少,更好地推广。其次,SMPL因素造成依赖变形从体形,而在现实中,人们用不同形状的变形是不同的。因此,我们学习依赖于双方的身体姿势和BMI形状依赖的姿势矫正的混合形状。第三,我们表明,SMPL的形状空间不够丰富捕捉到在人群中的变化。我们用男性和女性受试者的10,000名扫描训练STAR解决这个问题,并表明这个结果更好的模式概括。 STAR紧凑,更好地推广新机构,是一个简易替换为SMPL。 STAR是公开的用于研究目的在此http网址。
Ahmed A. A. Osman, Timo Bolkart, Michael J. Black
Abstract: The SMPL body model is widely used for the estimation, synthesis, and analysis of 3D human pose and shape. While popular, we show that SMPL has several limitations and introduce STAR, which is quantitatively and qualitatively superior to SMPL. First, SMPL has a huge number of parameters resulting from its use of global blend shapes. These dense pose-corrective offsets relate every vertex on the mesh to all the joints in the kinematic tree, capturing spurious long-range correlations. To address this, we define per-joint pose correctives and learn the subset of mesh vertices that are influenced by each joint movement. This sparse formulation results in more realistic deformations and significantly reduces the number of model parameters to 20% of SMPL. When trained on the same data as SMPL, STAR generalizes better despite having many fewer parameters. Second, SMPL factors pose-dependent deformations from body shape while, in reality, people with different shapes deform differently. Consequently, we learn shape-dependent pose-corrective blend shapes that depend on both body pose and BMI. Third, we show that the shape space of SMPL is not rich enough to capture the variation in the human population. We address this by training STAR with an additional 10,000 scans of male and female subjects, and show that this results in better model generalization. STAR is compact, generalizes better to new bodies and is a drop-in replacement for SMPL. STAR is publicly available for research purposes at this http URL.
摘要:SMPL体模型被广泛用于估计,合成和三维人体姿势和形状的分析。虽然流行,我们表明,SMPL有一些局限性,并介绍STAR,这是数量和质量优于SMPL。首先,SMPL有其使用的全球混合变形而导致的参数数量巨大。这些密集的姿势矫正偏移涉及在网格上的所有关节的运动树的每个顶点,捕捉杂散长程关联。为了解决这个问题,我们定义每个关节的姿势矫正,学习由各个关节运动的影响,网格顶点的子集。在更真实的变形此稀疏制剂结果和显著减少的模型参数来SMPL的20%的数量。当相同的数据SMPL训练有素,STAR尽管有许多参数较少,更好地推广。其次,SMPL因素造成依赖变形从体形,而在现实中,人们用不同形状的变形是不同的。因此,我们学习依赖于双方的身体姿势和BMI形状依赖的姿势矫正的混合形状。第三,我们表明,SMPL的形状空间不够丰富捕捉到在人群中的变化。我们用男性和女性受试者的10,000名扫描训练STAR解决这个问题,并表明这个结果更好的模式概括。 STAR紧凑,更好地推广新机构,是一个简易替换为SMPL。 STAR是公开的用于研究目的在此http网址。
5. Black Re-ID: A Head-shoulder Descriptor for the Challenging Problem of Person Re-Identification [PDF] 返回目录
Boqiang Xu, Lingxiao He, Xingyu Liao, Wu Liu, Zhenan Sun, Tao Mei
Abstract: Person re-identification (Re-ID) aims at retrieving an input person image from a set of images captured by multiple cameras. Although recent Re-ID methods have made great success, most of them extract features in terms of the attributes of clothing (e.g., color, texture). However, it is common for people to wear black clothes or be captured by surveillance systems in low light illumination, in which cases the attributes of the clothing are severely missing. We call this problem the Black Re-ID problem. To solve this problem, rather than relying on the clothing information, we propose to exploit head-shoulder features to assist person Re-ID. The head-shoulder adaptive attention network (HAA) is proposed to learn the head-shoulder feature and an innovative ensemble method is designed to enhance the generalization of our model. Given the input person image, the ensemble method would focus on the head-shoulder feature by assigning a larger weight if the individual insides the image is in black clothing. Due to the lack of a suitable benchmark dataset for studying the Black Re-ID problem, we also contribute the first Black-reID dataset, which contains 1274 identities in training set. Extensive evaluations on the Black-reID, Market1501 and DukeMTMC-reID datasets show that our model achieves the best result compared with the state-of-the-art Re-ID methods on both Black and conventional Re-ID problems. Furthermore, our method is also proved to be effective in dealing with person Re-ID in similar clothing. Our code and dataset are avaliable on this https URL.
摘要:人重新鉴定(再ID)旨在从一组由多个照相机拍摄的图像的检索的输入人物图像。虽然最近重新编号的方法已经取得了很大的成功,其中大部分提取的服装(如颜色,纹理)的属性方面的特点。然而,这是常见的人穿黑色衣服或通过在低光照度监控系统,其中,所述服装的情况下,属性严重缺少被捕获。我们把这个问题称为黑重新编号的问题。为了解决这个问题,而不是依赖于服装的信息,我们建议利用头肩功能,以帮助人重新编号。头肩自适应注意网络(HAA),提出了学习头肩功能和创新的集成方法的目的是提高我们模型的泛化。给定了输入人物图像,整体方法将集中于头肩特征通过分配如果个体内部的图像是在黑服装较大的权重。由于缺乏合适的基准数据集用于研究黑再ID的问题,我们也对第一黑白里德的数据集,其中包含在训练集1274名的身份。在黑色里德Market1501和DukeMTMC-Reid的数据集,广泛的评估表明,在两个黑色和传统再ID问题的国家的最先进的重新编号方法相比我们的模型达到最好的结果。此外,我们的方法也被证明是有效的在处理类似服装的人重新编号。我们的代码和数据集都在此HTTPS URL avaliable。
Boqiang Xu, Lingxiao He, Xingyu Liao, Wu Liu, Zhenan Sun, Tao Mei
Abstract: Person re-identification (Re-ID) aims at retrieving an input person image from a set of images captured by multiple cameras. Although recent Re-ID methods have made great success, most of them extract features in terms of the attributes of clothing (e.g., color, texture). However, it is common for people to wear black clothes or be captured by surveillance systems in low light illumination, in which cases the attributes of the clothing are severely missing. We call this problem the Black Re-ID problem. To solve this problem, rather than relying on the clothing information, we propose to exploit head-shoulder features to assist person Re-ID. The head-shoulder adaptive attention network (HAA) is proposed to learn the head-shoulder feature and an innovative ensemble method is designed to enhance the generalization of our model. Given the input person image, the ensemble method would focus on the head-shoulder feature by assigning a larger weight if the individual insides the image is in black clothing. Due to the lack of a suitable benchmark dataset for studying the Black Re-ID problem, we also contribute the first Black-reID dataset, which contains 1274 identities in training set. Extensive evaluations on the Black-reID, Market1501 and DukeMTMC-reID datasets show that our model achieves the best result compared with the state-of-the-art Re-ID methods on both Black and conventional Re-ID problems. Furthermore, our method is also proved to be effective in dealing with person Re-ID in similar clothing. Our code and dataset are avaliable on this https URL.
摘要:人重新鉴定(再ID)旨在从一组由多个照相机拍摄的图像的检索的输入人物图像。虽然最近重新编号的方法已经取得了很大的成功,其中大部分提取的服装(如颜色,纹理)的属性方面的特点。然而,这是常见的人穿黑色衣服或通过在低光照度监控系统,其中,所述服装的情况下,属性严重缺少被捕获。我们把这个问题称为黑重新编号的问题。为了解决这个问题,而不是依赖于服装的信息,我们建议利用头肩功能,以帮助人重新编号。头肩自适应注意网络(HAA),提出了学习头肩功能和创新的集成方法的目的是提高我们模型的泛化。给定了输入人物图像,整体方法将集中于头肩特征通过分配如果个体内部的图像是在黑服装较大的权重。由于缺乏合适的基准数据集用于研究黑再ID的问题,我们也对第一黑白里德的数据集,其中包含在训练集1274名的身份。在黑色里德Market1501和DukeMTMC-Reid的数据集,广泛的评估表明,在两个黑色和传统再ID问题的国家的最先进的重新编号方法相比我们的模型达到最好的结果。此外,我们的方法也被证明是有效的在处理类似服装的人重新编号。我们的代码和数据集都在此HTTPS URL avaliable。
6. Scene Text Detection with Selected Anchor [PDF] 返回目录
Anna Zhu, Hang Du, Shengwu Xiong
Abstract: Object proposal technique with dense anchoring scheme for scene text detection were applied frequently to achieve high recall. It results in the significant improvement in accuracy but waste of computational searching, regression and classification. In this paper, we propose an anchor selection-based region proposal network (AS-RPN) using effective selected anchors instead of dense anchors to extract text proposals. The center, scales, aspect ratios and orientations of anchors are learnable instead of fixing, which leads to high recall and greatly reduced numbers of anchors. By replacing the anchor-based RPN in Faster RCNN, the AS-RPN-based Faster RCNN can achieve comparable performance with previous state-of-the-art text detecting approaches on standard benchmarks, including COCO-Text, ICDAR2013, ICDAR2015 and MSRA-TD500 when using single-scale and single model (ResNet50) testing only.
摘要:与场景文本检测密集锚固方案对象提案技术被频繁使用,以实现较高的召回率。这导致在精度显著改善,但浪费计算搜索,回归和分类的。在本文中,我们提出了一个基于选择锚定区的提案网络(AS-RPN)采用有效的选择,而不是锚密集锚提取文本提案。中心,比例,纵横比和锚的取向是可学习的,而不是固定,这导致高的召回并大大降低了锚的号码。通过在更换基于锚的RPN更快RCNN,基于AS-RPN更快RCNN可以实现具有相当的性能以前的状态的最先进的文本检测的标准基准,包括COCO-文本,ICDAR2013,ICDAR2015和MSRA-接近在使用TD500单规模和单一模式(ResNet50)只测试。
Anna Zhu, Hang Du, Shengwu Xiong
Abstract: Object proposal technique with dense anchoring scheme for scene text detection were applied frequently to achieve high recall. It results in the significant improvement in accuracy but waste of computational searching, regression and classification. In this paper, we propose an anchor selection-based region proposal network (AS-RPN) using effective selected anchors instead of dense anchors to extract text proposals. The center, scales, aspect ratios and orientations of anchors are learnable instead of fixing, which leads to high recall and greatly reduced numbers of anchors. By replacing the anchor-based RPN in Faster RCNN, the AS-RPN-based Faster RCNN can achieve comparable performance with previous state-of-the-art text detecting approaches on standard benchmarks, including COCO-Text, ICDAR2013, ICDAR2015 and MSRA-TD500 when using single-scale and single model (ResNet50) testing only.
摘要:与场景文本检测密集锚固方案对象提案技术被频繁使用,以实现较高的召回率。这导致在精度显著改善,但浪费计算搜索,回归和分类的。在本文中,我们提出了一个基于选择锚定区的提案网络(AS-RPN)采用有效的选择,而不是锚密集锚提取文本提案。中心,比例,纵横比和锚的取向是可学习的,而不是固定,这导致高的召回并大大降低了锚的号码。通过在更换基于锚的RPN更快RCNN,基于AS-RPN更快RCNN可以实现具有相当的性能以前的状态的最先进的文本检测的标准基准,包括COCO-文本,ICDAR2013,ICDAR2015和MSRA-接近在使用TD500单规模和单一模式(ResNet50)只测试。
7. Learning Trailer Moments in Full-Length Movies [PDF] 返回目录
Lezi Wang, Dong Liu, Rohit Puri, Dimitris N. Metaxas
Abstract: A movie's key moments stand out of the screenplay to grab an audience's attention and make movie browsing efficient. But a lack of annotations makes the existing approaches not applicable to movie key moment detection. To get rid of human annotations, we leverage the officially-released trailers as the weak supervision to learn a model that can detect the key moments from full-length movies. We introduce a novel ranking network that utilizes the Co-Attention between movies and trailers as guidance to generate the training pairs, where the moments highly corrected with trailers are expected to be scored higher than the uncorrelated moments. Additionally, we propose a Contrastive Attention module to enhance the feature representations such that the comparative contrast between features of the key and non-key moments are maximized. We construct the first movie-trailer dataset, and the proposed Co-Attention assisted ranking network shows superior performance even over the supervised approach. The effectiveness of our Contrastive Attention module is also demonstrated by the performance improvement over the state-of-the-art on the public benchmarks.
摘要:一部电影的关键时刻经得起剧本出来抓住观众的注意力,使电影浏览效率。但由于缺少注解使得并不适用于电影的关键时刻检测的现行做法。为了摆脱人类的注释,我们利用了正式发布的预告片作为监管不力的学习,可以检测从整部电影关键时刻的典范。我们介绍的是利用电影和预告片之间的共同关注为指导,以生成训练对,其中有拖车高度纠正的时刻预计将得分比不相关的时刻更高的一种新型网络排名。此外,我们提出了一个对比注意模块来增强功能表示这样的键的功能和非关键时刻之间的对比反差最大化。我们构建了第一部电影,预告片集,并且所提出的共同关注辅助排序网络显示优越的性能甚至超过了监管办法。我们对比注意模块的有效性也受到了对公众的基准国家的最先进的性能改进证明。
Lezi Wang, Dong Liu, Rohit Puri, Dimitris N. Metaxas
Abstract: A movie's key moments stand out of the screenplay to grab an audience's attention and make movie browsing efficient. But a lack of annotations makes the existing approaches not applicable to movie key moment detection. To get rid of human annotations, we leverage the officially-released trailers as the weak supervision to learn a model that can detect the key moments from full-length movies. We introduce a novel ranking network that utilizes the Co-Attention between movies and trailers as guidance to generate the training pairs, where the moments highly corrected with trailers are expected to be scored higher than the uncorrelated moments. Additionally, we propose a Contrastive Attention module to enhance the feature representations such that the comparative contrast between features of the key and non-key moments are maximized. We construct the first movie-trailer dataset, and the proposed Co-Attention assisted ranking network shows superior performance even over the supervised approach. The effectiveness of our Contrastive Attention module is also demonstrated by the performance improvement over the state-of-the-art on the public benchmarks.
摘要:一部电影的关键时刻经得起剧本出来抓住观众的注意力,使电影浏览效率。但由于缺少注解使得并不适用于电影的关键时刻检测的现行做法。为了摆脱人类的注释,我们利用了正式发布的预告片作为监管不力的学习,可以检测从整部电影关键时刻的典范。我们介绍的是利用电影和预告片之间的共同关注为指导,以生成训练对,其中有拖车高度纠正的时刻预计将得分比不相关的时刻更高的一种新型网络排名。此外,我们提出了一个对比注意模块来增强功能表示这样的键的功能和非关键时刻之间的对比反差最大化。我们构建了第一部电影,预告片集,并且所提出的共同关注辅助排序网络显示优越的性能甚至超过了监管办法。我们对比注意模块的有效性也受到了对公众的基准国家的最先进的性能改进证明。
8. Human Body Model Fitting by Learned Gradient Descent [PDF] 返回目录
Jie Song, Xu Chen, Otmar Hilliges
Abstract: We propose a novel algorithm for the fitting of 3D human shape to images. Combining the accuracy and refinement capabilities of iterative gradient-based optimization techniques with the robustness of deep neural networks, we propose a gradient descent algorithm that leverages a neural network to predict the parameter update rule for each iteration. This per-parameter and state-aware update guides the optimizer towards a good solution in very few steps, converging in typically few steps. During training our approach only requires MoCap data of human poses, parametrized via SMPL. From this data the network learns a subspace of valid poses and shapes in which optimization is performed much more efficiently. The approach does not require any hard to acquire image-to-3D correspondences. At test time we only optimize the 2D joint re-projection error without the need for any further priors or regularization terms. We show empirically that this algorithm is fast (avg. 120ms convergence), robust to initialization and dataset, and achieves state-of-the-art results on public evaluation datasets including the challenging 3DPW in-the-wild benchmark (improvement over SMPLify 45%) and also approaches using image-to-3D correspondences
摘要:我们提出了三维人体形状的拟合图像的新算法。深神经网络的鲁棒性结合的迭代基于梯度的优化技术的准确性和精细化的能力,我们提出了一个梯度下降算法,充分利用神经网络来预测每次迭代参数更新规则。这每个参数和状态感知更新引向很少的步骤一个很好的解决方案优化,在通常几步趋同。在训练过程中我们的方法只需要人类的姿势动作捕捉数据,通过SMPL参数化。从这个数据在网络学习中这是更有效地执行优化有效的姿势和形状的子空间。该方法不需要任何难以获得影像到3D对应。在测试时,我们只优化2D联合重组投影错误,而不需要任何进一步的先验或正则项。我们经验表明,这种算法是快(平均120毫秒收敛),稳健初始化和数据集,以及公众评价的数据集,包括挑战3DPW在最狂野标杆(改善了SMPLify 45实现了国家的先进成果%)和也接近使用图像至3D的对应
Jie Song, Xu Chen, Otmar Hilliges
Abstract: We propose a novel algorithm for the fitting of 3D human shape to images. Combining the accuracy and refinement capabilities of iterative gradient-based optimization techniques with the robustness of deep neural networks, we propose a gradient descent algorithm that leverages a neural network to predict the parameter update rule for each iteration. This per-parameter and state-aware update guides the optimizer towards a good solution in very few steps, converging in typically few steps. During training our approach only requires MoCap data of human poses, parametrized via SMPL. From this data the network learns a subspace of valid poses and shapes in which optimization is performed much more efficiently. The approach does not require any hard to acquire image-to-3D correspondences. At test time we only optimize the 2D joint re-projection error without the need for any further priors or regularization terms. We show empirically that this algorithm is fast (avg. 120ms convergence), robust to initialization and dataset, and achieves state-of-the-art results on public evaluation datasets including the challenging 3DPW in-the-wild benchmark (improvement over SMPLify 45%) and also approaches using image-to-3D correspondences
摘要:我们提出了三维人体形状的拟合图像的新算法。深神经网络的鲁棒性结合的迭代基于梯度的优化技术的准确性和精细化的能力,我们提出了一个梯度下降算法,充分利用神经网络来预测每次迭代参数更新规则。这每个参数和状态感知更新引向很少的步骤一个很好的解决方案优化,在通常几步趋同。在训练过程中我们的方法只需要人类的姿势动作捕捉数据,通过SMPL参数化。从这个数据在网络学习中这是更有效地执行优化有效的姿势和形状的子空间。该方法不需要任何难以获得影像到3D对应。在测试时,我们只优化2D联合重组投影错误,而不需要任何进一步的先验或正则项。我们经验表明,这种算法是快(平均120毫秒收敛),稳健初始化和数据集,以及公众评价的数据集,包括挑战3DPW在最狂野标杆(改善了SMPLify 45实现了国家的先进成果%)和也接近使用图像至3D的对应
9. Cross-Domain Identification for Thermal-to-Visible Face Recognition [PDF] 返回目录
Cedric Nimpa Fondje, Shuowen Hu, Nathaniel J. Short, Benjamin S. Riggan
Abstract: Recent advances in domain adaptation, especially those applied to heterogeneous facial recognition, typically rely upon restrictive Euclidean loss functions (e.g., $L_2$ norm) which perform best when images from two different domains (e.g., visible and thermal) are co-registered and temporally synchronized. This paper proposes a novel domain adaptation framework that combines a new feature mapping sub-network with existing deep feature models, which are based on modified network architectures (e.g., VGG16 or Resnet50). This framework is optimized by introducing new cross-domain identity and domain invariance loss functions for thermal-to-visible face recognition, which alleviates the requirement for precisely co-registered and synchronized imagery. We provide extensive analysis of both features and loss functions used, and compare the proposed domain adaptation framework with state-of-the-art feature based domain adaptation models on a difficult dataset containing facial imagery collected at varying ranges, poses, and expressions. Moreover, we analyze the viability of the proposed framework for more challenging tasks, such as non-frontal thermal-to-visible face recognition.
摘要:在域适应最新进展,特别是那些适用于异构面部识别,通常依赖于限制性欧几里德损耗函数(例如,$ $ L_2范数),当从两个不同的结构域(例如,可见光和热)图像是共它们执行最佳注册和时间同步。本文提出了一种新颖的结构域自适应框架,结合了新的特征映射的子网络与现有的深特征的模型,其基于修改的网络体系结构(例如,VGG16或Resnet50)。该框架是由热 - 可见面部识别,这减轻了用于精确地共同配准和同步图像的要求引入新的跨域身份和域不变性损失函数优化。我们提供了用于这两种功能和损失函数广泛的分析,并比较与国家的最先进的基于特征的域适应模型所提出的域自适应框架上含有在不同的范围内,姿势,和表达式收集面部图像困难的数据集。此外,我们分析了所提出的框架的可行性进行更具挑战性的任务,如非正面的热 - 可见人脸识别。
Cedric Nimpa Fondje, Shuowen Hu, Nathaniel J. Short, Benjamin S. Riggan
Abstract: Recent advances in domain adaptation, especially those applied to heterogeneous facial recognition, typically rely upon restrictive Euclidean loss functions (e.g., $L_2$ norm) which perform best when images from two different domains (e.g., visible and thermal) are co-registered and temporally synchronized. This paper proposes a novel domain adaptation framework that combines a new feature mapping sub-network with existing deep feature models, which are based on modified network architectures (e.g., VGG16 or Resnet50). This framework is optimized by introducing new cross-domain identity and domain invariance loss functions for thermal-to-visible face recognition, which alleviates the requirement for precisely co-registered and synchronized imagery. We provide extensive analysis of both features and loss functions used, and compare the proposed domain adaptation framework with state-of-the-art feature based domain adaptation models on a difficult dataset containing facial imagery collected at varying ranges, poses, and expressions. Moreover, we analyze the viability of the proposed framework for more challenging tasks, such as non-frontal thermal-to-visible face recognition.
摘要:在域适应最新进展,特别是那些适用于异构面部识别,通常依赖于限制性欧几里德损耗函数(例如,$ $ L_2范数),当从两个不同的结构域(例如,可见光和热)图像是共它们执行最佳注册和时间同步。本文提出了一种新颖的结构域自适应框架,结合了新的特征映射的子网络与现有的深特征的模型,其基于修改的网络体系结构(例如,VGG16或Resnet50)。该框架是由热 - 可见面部识别,这减轻了用于精确地共同配准和同步图像的要求引入新的跨域身份和域不变性损失函数优化。我们提供了用于这两种功能和损失函数广泛的分析,并比较与国家的最先进的基于特征的域适应模型所提出的域自适应框架上含有在不同的范围内,姿势,和表达式收集面部图像困难的数据集。此外,我们分析了所提出的框架的可行性进行更具挑战性的任务,如非正面的热 - 可见人脸识别。
10. CosyPose: Consistent multi-view multi-object 6D pose estimation [PDF] 返回目录
Yann Labbé, Justin Carpentier, Mathieu Aubry, Josef Sivic
Abstract: We introduce an approach for recovering the 6D pose of multiple known objects in a scene captured by a set of input images with unknown camera viewpoints. First, we present a single-view single-object 6D pose estimation method, which we use to generate 6D object pose hypotheses. Second, we develop a robust method for matching individual 6D object pose hypotheses across different input images in order to jointly estimate camera viewpoints and 6D poses of all objects in a single consistent scene. Our approach explicitly handles object symmetries, does not require depth measurements, is robust to missing or incorrect object hypotheses, and automatically recovers the number of objects in the scene. Third, we develop a method for global scene refinement given multiple object hypotheses and their correspondences across views. This is achieved by solving an object-level bundle adjustment problem that refines the poses of cameras and objects to minimize the reprojection error in all views. We demonstrate that the proposed method, dubbed CosyPose, outperforms current state-of-the-art results for single-view and multi-view 6D object pose estimation by a large margin on two challenging benchmarks: the YCB-Video and T-LESS datasets. Code and pre-trained models are available on the project webpage this https URL.
摘要:介绍了恢复多个已知物体在与未知相机观点一组输入图像的拍摄场景中的6D姿态的方法。首先,我们提出了一种单查看单个对象6D姿态估计方法,我们用它来生成6D对象姿态的假设。其次,我们开发了以联合估计在一个单一,一致的场景摄像机观点和所有对象的6D姿态在不同的输入图像个人6D对象姿势假说相匹配的稳健的方法。我们的方法明确处理对象的对称性,不需要深度测量,具有较强的抗丢失或不正确的对象的假设,并自动恢复场景中的对象的数量。第三,我们开发了赋予多重目标的假设和跨他们的意见对应全球舞台细化的方法。这是通过求解对象级束调整问题,即摄像机和对象细化姿势到在所有视图最小化投影误差实现。我们表明,所提出的方法,被称为CosyPose,优于国家的最先进的目前的结果为单视图和多视图6D对象姿态估计大幅度两个挑战基准:在YCB-Video和T-LESS数据集。代码和预先训练模式都可以在项目网页此HTTPS URL。
Yann Labbé, Justin Carpentier, Mathieu Aubry, Josef Sivic
Abstract: We introduce an approach for recovering the 6D pose of multiple known objects in a scene captured by a set of input images with unknown camera viewpoints. First, we present a single-view single-object 6D pose estimation method, which we use to generate 6D object pose hypotheses. Second, we develop a robust method for matching individual 6D object pose hypotheses across different input images in order to jointly estimate camera viewpoints and 6D poses of all objects in a single consistent scene. Our approach explicitly handles object symmetries, does not require depth measurements, is robust to missing or incorrect object hypotheses, and automatically recovers the number of objects in the scene. Third, we develop a method for global scene refinement given multiple object hypotheses and their correspondences across views. This is achieved by solving an object-level bundle adjustment problem that refines the poses of cameras and objects to minimize the reprojection error in all views. We demonstrate that the proposed method, dubbed CosyPose, outperforms current state-of-the-art results for single-view and multi-view 6D object pose estimation by a large margin on two challenging benchmarks: the YCB-Video and T-LESS datasets. Code and pre-trained models are available on the project webpage this https URL.
摘要:介绍了恢复多个已知物体在与未知相机观点一组输入图像的拍摄场景中的6D姿态的方法。首先,我们提出了一种单查看单个对象6D姿态估计方法,我们用它来生成6D对象姿态的假设。其次,我们开发了以联合估计在一个单一,一致的场景摄像机观点和所有对象的6D姿态在不同的输入图像个人6D对象姿势假说相匹配的稳健的方法。我们的方法明确处理对象的对称性,不需要深度测量,具有较强的抗丢失或不正确的对象的假设,并自动恢复场景中的对象的数量。第三,我们开发了赋予多重目标的假设和跨他们的意见对应全球舞台细化的方法。这是通过求解对象级束调整问题,即摄像机和对象细化姿势到在所有视图最小化投影误差实现。我们表明,所提出的方法,被称为CosyPose,优于国家的最先进的目前的结果为单视图和多视图6D对象姿态估计大幅度两个挑战基准:在YCB-Video和T-LESS数据集。代码和预先训练模式都可以在项目网页此HTTPS URL。
11. MineNav: An Expandable Synthetic Dataset Based on Minecraft for Aircraft Visual Navigation [PDF] 返回目录
Dali Wang
Abstract: We propose a simply method to generate high quality synthetic dataset based on open-source game Minecraft includes rendered image, Depth map, surface normal map, and 6-dof camera trajectory. This dataset has a perfect ground-truth generated by plug-in program, and thanks for the large game's community, there is an extremely large number of 3D open-world environment, users can find suitable scenes for shooting and build data sets through it and they can also build scenes in-game. as such, We don't need to worry about manual over fitting caused by too small datasets. what's more, there is also a shader community which We can use to minimize data bias between rendered images and real-images as little as possible. Last but not least, we now provide three tools to generate the data for depth prediction ,surface normal prediction and visual odometry, user can also develop the plug-in module for other vision task like segmentation or optical flow prediction.
摘要:我们提出了一种简单的方法来产生基于开源游戏的Minecraft高品质的合成数据集包括渲染的图像,深度图,表面法线图,和图6自由度相机轨迹。该数据集有通过插件程序,并感谢大型游戏的社区产生一个完美的地面实况,还有一个非常大的数字3D开放世界环境中,用户可以通过它找到了拍摄和构建数据集合适的场景,他们也可以建立场景在游戏中。因此,我们不要过度用太小的数据集拟合进行必要担心手册。更重要的是,还有哪些我们可以使用,以尽量减少渲染图像和实时图像尽可能少地之间数据偏差的着色器社区。最后但并非最不重要的,我们现在提供三种工具来生成深度预测,表面正常的预测和视觉里程数据,用户还可以开发插件模块,其他视觉任务像分割或光流预测。
Dali Wang
Abstract: We propose a simply method to generate high quality synthetic dataset based on open-source game Minecraft includes rendered image, Depth map, surface normal map, and 6-dof camera trajectory. This dataset has a perfect ground-truth generated by plug-in program, and thanks for the large game's community, there is an extremely large number of 3D open-world environment, users can find suitable scenes for shooting and build data sets through it and they can also build scenes in-game. as such, We don't need to worry about manual over fitting caused by too small datasets. what's more, there is also a shader community which We can use to minimize data bias between rendered images and real-images as little as possible. Last but not least, we now provide three tools to generate the data for depth prediction ,surface normal prediction and visual odometry, user can also develop the plug-in module for other vision task like segmentation or optical flow prediction.
摘要:我们提出了一种简单的方法来产生基于开源游戏的Minecraft高品质的合成数据集包括渲染的图像,深度图,表面法线图,和图6自由度相机轨迹。该数据集有通过插件程序,并感谢大型游戏的社区产生一个完美的地面实况,还有一个非常大的数字3D开放世界环境中,用户可以通过它找到了拍摄和构建数据集合适的场景,他们也可以建立场景在游戏中。因此,我们不要过度用太小的数据集拟合进行必要担心手册。更重要的是,还有哪些我们可以使用,以尽量减少渲染图像和实时图像尽可能少地之间数据偏差的着色器社区。最后但并非最不重要的,我们现在提供三种工具来生成深度预测,表面正常的预测和视觉里程数据,用户还可以开发插件模块,其他视觉任务像分割或光流预测。
12. SegCodeNet: Color-Coded Segmentation Masks for Activity Detection from Wearable Cameras [PDF] 返回目录
Asif Shahriyar Sushmit, Partho Ghosh, Md.Abrar Istiak, Nayeeb Rashid, Ahsan Habib Akash, Taufiq Hasan
Abstract: Activity detection from first-person videos (FPV) captured using a wearable camera is an active research field with potential applications in many sectors, including healthcare, law enforcement, and rehabilitation. State-of-the-art methods use optical flow-based hybrid techniques that rely on features derived from the motion of objects from consecutive frames. In this work, we developed a two-stream network, the \emph{SegCodeNet}, that uses a network branch containing video-streams with color-coded semantic segmentation masks of relevant objects in addition to the original RGB video-stream. We also include a stream-wise attention gating that prioritizes between the two streams and a frame-wise attention module that prioritizes the video frames that contain relevant features. Experiments are conducted on an FPV dataset containing $18$ activity classes in office environments. In comparison to a single-stream network, the proposed two-stream method achieves an absolute improvement of $14.366\%$ and $10.324\%$ for averaged F1 score and accuracy, respectively, when average results are compared for three different frame sizes $224\times224$, $112\times112$, and $64\times64$. The proposed method provides significant performance gains for lower-resolution images with absolute improvements of $17\%$ and $26\%$ in F1 score for input dimensions of $112\times112$ and $64\times64$, respectively. The best performance is achieved for a frame size of $224\times224$ yielding an F1 score and accuracy of $90.176\%$ and $90.799\%$ which outperforms the state-of-the-art Inflated 3D ConvNet (I3D) \cite{carreira2017quo} method by an absolute margin of $4.529\%$ and $2.419\%$, respectively.
摘要:使用可穿戴式相机拍摄的第一人称视频(FPV)活动检测是在许多领域的潜在应用,包括医疗保健,执法和康复的活跃的研究领域。状态的最先进的方法使用依赖于来自连续帧的对象的运动功能得到的光流为基础的混合技术。在这项工作中,我们开发了两个流网络中,\ {EMPH} SegCodeNet,使用包含视频流在除原有的RGB视频流相关对象的颜色编码的语义分割掩盖了网络分支。我们还包括流明智重视门控的两个流,并且优先包含相关特征的视频帧的帧明智关注模块之间优先进行。实验是在包含在办公环境$ $ 18的活动课的FPV数据集进行。相比于单流网络中,提出了两种流的方法实现了$ 14.366 \%$和$ 10.324 \%为平均F1得分和准确性分别当平均结果比较了三种不同的帧尺寸$ 224 $ \绝对改进times224,$ 112 \ times112 $和$ 64 \ times64 $。所提出的方法提供了显著性能增益为较低分辨率的图像与$ $和26 \%$ $的绝对17改进\%在F1的分数的分别为$ 112 \ times112 $ $和64 \ times64 $,输入尺寸。最佳的性能是为$ 224 \ times224 $的帧大小,产生一个F1得分和90.176 $ \%$和$ 90.799 \%$这优于状态的最先进的充气3D ConvNet(I3D)的\举{carreira2017quo精度实现}通过分别4.529 $ \%$ $和2.419 \%$,绝对余量方法。
Asif Shahriyar Sushmit, Partho Ghosh, Md.Abrar Istiak, Nayeeb Rashid, Ahsan Habib Akash, Taufiq Hasan
Abstract: Activity detection from first-person videos (FPV) captured using a wearable camera is an active research field with potential applications in many sectors, including healthcare, law enforcement, and rehabilitation. State-of-the-art methods use optical flow-based hybrid techniques that rely on features derived from the motion of objects from consecutive frames. In this work, we developed a two-stream network, the \emph{SegCodeNet}, that uses a network branch containing video-streams with color-coded semantic segmentation masks of relevant objects in addition to the original RGB video-stream. We also include a stream-wise attention gating that prioritizes between the two streams and a frame-wise attention module that prioritizes the video frames that contain relevant features. Experiments are conducted on an FPV dataset containing $18$ activity classes in office environments. In comparison to a single-stream network, the proposed two-stream method achieves an absolute improvement of $14.366\%$ and $10.324\%$ for averaged F1 score and accuracy, respectively, when average results are compared for three different frame sizes $224\times224$, $112\times112$, and $64\times64$. The proposed method provides significant performance gains for lower-resolution images with absolute improvements of $17\%$ and $26\%$ in F1 score for input dimensions of $112\times112$ and $64\times64$, respectively. The best performance is achieved for a frame size of $224\times224$ yielding an F1 score and accuracy of $90.176\%$ and $90.799\%$ which outperforms the state-of-the-art Inflated 3D ConvNet (I3D) \cite{carreira2017quo} method by an absolute margin of $4.529\%$ and $2.419\%$, respectively.
摘要:使用可穿戴式相机拍摄的第一人称视频(FPV)活动检测是在许多领域的潜在应用,包括医疗保健,执法和康复的活跃的研究领域。状态的最先进的方法使用依赖于来自连续帧的对象的运动功能得到的光流为基础的混合技术。在这项工作中,我们开发了两个流网络中,\ {EMPH} SegCodeNet,使用包含视频流在除原有的RGB视频流相关对象的颜色编码的语义分割掩盖了网络分支。我们还包括流明智重视门控的两个流,并且优先包含相关特征的视频帧的帧明智关注模块之间优先进行。实验是在包含在办公环境$ $ 18的活动课的FPV数据集进行。相比于单流网络中,提出了两种流的方法实现了$ 14.366 \%$和$ 10.324 \%为平均F1得分和准确性分别当平均结果比较了三种不同的帧尺寸$ 224 $ \绝对改进times224,$ 112 \ times112 $和$ 64 \ times64 $。所提出的方法提供了显著性能增益为较低分辨率的图像与$ $和26 \%$ $的绝对17改进\%在F1的分数的分别为$ 112 \ times112 $ $和64 \ times64 $,输入尺寸。最佳的性能是为$ 224 \ times224 $的帧大小,产生一个F1得分和90.176 $ \%$和$ 90.799 \%$这优于状态的最先进的充气3D ConvNet(I3D)的\举{carreira2017quo精度实现}通过分别4.529 $ \%$ $和2.419 \%$,绝对余量方法。
13. Deep Neural Networks for automatic extraction of features in time series satellite images [PDF] 返回目录
Gael Kamdem De Teyou, Yuliya Tarabalka, Isabelle Manighetti, Rafael Almar, Sebastien Tripod
Abstract: Many earth observation programs such as Landsat, Sentinel, SPOT, and Pleiades produce huge volume of medium to high resolution multi-spectral images every day that can be organized in time series. In this work, we exploit both temporal and spatial information provided by these images to generate land cover maps. For this purpose, we combine a fully convolutional neural network with a convolutional long short-term memory. Implementation details of the proposed spatio-temporal neural network architecture are provided. Experimental results show that the temporal information provided by time series images allows increasing the accuracy of land cover classification, thus producing up-to-date maps that can help in identifying changes on earth.
摘要:许多地球观测计划,如地球资源卫星,哨兵,SPOT和昴生产,每天可以在时间序列中组织体积庞大高分辨率多光谱图像。在这项工作中,我们利用这些图像提供给生成土地覆盖图时间和空间信息。为此,我们结合了完全卷积神经网络卷积长短期记忆。提供建议的时空神经网络结构的实现细节。实验结果表明,由时间序列的图像所提供的时间信息,可以提高土地覆盖分类的准确度,因此产生了最新映射识别地球上的变化,可以帮助。
Gael Kamdem De Teyou, Yuliya Tarabalka, Isabelle Manighetti, Rafael Almar, Sebastien Tripod
Abstract: Many earth observation programs such as Landsat, Sentinel, SPOT, and Pleiades produce huge volume of medium to high resolution multi-spectral images every day that can be organized in time series. In this work, we exploit both temporal and spatial information provided by these images to generate land cover maps. For this purpose, we combine a fully convolutional neural network with a convolutional long short-term memory. Implementation details of the proposed spatio-temporal neural network architecture are provided. Experimental results show that the temporal information provided by time series images allows increasing the accuracy of land cover classification, thus producing up-to-date maps that can help in identifying changes on earth.
摘要:许多地球观测计划,如地球资源卫星,哨兵,SPOT和昴生产,每天可以在时间序列中组织体积庞大高分辨率多光谱图像。在这项工作中,我们利用这些图像提供给生成土地覆盖图时间和空间信息。为此,我们结合了完全卷积神经网络卷积长短期记忆。提供建议的时空神经网络结构的实现细节。实验结果表明,由时间序列的图像所提供的时间信息,可以提高土地覆盖分类的准确度,因此产生了最新映射识别地球上的变化,可以帮助。
14. AutoSimulate: (Quickly) Learning Synthetic Data Generation [PDF] 返回目录
Harkirat Singh Behl, Atılım Güneş Baydin, Ran Gal, Philip H.S. Torr, Vibhav Vineet
Abstract: Simulation is increasingly being used for generating large labelled datasets in many machine learning problems. Recent methods have focused on adjusting simulator parameters with the goal of maximising accuracy on a validation task, usually relying on REINFORCE-like gradient estimators. However these approaches are very expensive as they treat the entire data generation, model training, and validation pipeline as a black-box and require multiple costly objective evaluations at each iteration. We propose an efficient alternative for optimal synthetic data generation, based on a novel differentiable approximation of the objective. This allows us to optimize the simulator, which may be non-differentiable, requiring only one objective evaluation at each iteration with a little overhead. We demonstrate on a state-of-the-art photorealistic renderer that the proposed method finds the optimal data distribution faster (up to $50\times$), with significantly reduced training data generation (up to $30\times$) and better accuracy ($+8.7\%$) on real-world test datasets than previous methods.
摘要:模拟越来越多地被用于许多机器学习问题,产生了大量的数据集标记。最近的方法集中在与验证任务精度的最大化,通常依靠加固样梯度估计的目标调整模拟器的参数。然而,由于他们对待整个数据生成,模型训练和验证管道作为一个黑盒子,需要在每次迭代多昂贵的客观评价这些方法是非常昂贵的。我们提出了最佳的合成数据生成,一个有效的替代基于所述物镜的新颖的可微分近似。这使我们能够优化模拟器,其可以是不可微的,只需要一个客观的评价在与一个小的开销每次迭代。我们证明一个国家的最先进的逼真渲染,所提出的方法找到最佳的数据分发快(可达$ 50 \次$),有显著减少训练数据生成(高达$ 30 \次$)和更高的精度( $ + 8.7 \%$)现实世界的测试数据集,比以前的方法。
Harkirat Singh Behl, Atılım Güneş Baydin, Ran Gal, Philip H.S. Torr, Vibhav Vineet
Abstract: Simulation is increasingly being used for generating large labelled datasets in many machine learning problems. Recent methods have focused on adjusting simulator parameters with the goal of maximising accuracy on a validation task, usually relying on REINFORCE-like gradient estimators. However these approaches are very expensive as they treat the entire data generation, model training, and validation pipeline as a black-box and require multiple costly objective evaluations at each iteration. We propose an efficient alternative for optimal synthetic data generation, based on a novel differentiable approximation of the objective. This allows us to optimize the simulator, which may be non-differentiable, requiring only one objective evaluation at each iteration with a little overhead. We demonstrate on a state-of-the-art photorealistic renderer that the proposed method finds the optimal data distribution faster (up to $50\times$), with significantly reduced training data generation (up to $30\times$) and better accuracy ($+8.7\%$) on real-world test datasets than previous methods.
摘要:模拟越来越多地被用于许多机器学习问题,产生了大量的数据集标记。最近的方法集中在与验证任务精度的最大化,通常依靠加固样梯度估计的目标调整模拟器的参数。然而,由于他们对待整个数据生成,模型训练和验证管道作为一个黑盒子,需要在每次迭代多昂贵的客观评价这些方法是非常昂贵的。我们提出了最佳的合成数据生成,一个有效的替代基于所述物镜的新颖的可微分近似。这使我们能够优化模拟器,其可以是不可微的,只需要一个客观的评价在与一个小的开销每次迭代。我们证明一个国家的最先进的逼真渲染,所提出的方法找到最佳的数据分发快(可达$ 50 \次$),有显著减少训练数据生成(高达$ 30 \次$)和更高的精度( $ + 8.7 \%$)现实世界的测试数据集,比以前的方法。
15. Anchor-free Small-scale Multispectral Pedestrian Detection [PDF] 返回目录
Alexander Wolpert, Michael Teutsch, M. Saquib Sarfraz, Rainer Stiefelhagen
Abstract: Multispectral images consisting of aligned visual-optical (VIS) and thermal infrared (IR) image pairs are well-suited for practical applications like autonomous driving or visual surveillance. Such data can be used to increase the performance of pedestrian detection especially for weakly illuminated, small-scaled, or partially occluded instances. The current state-of-the-art is based on variants of Faster R-CNN and thus passes through two stages: a proposal generator network with handcrafted anchor boxes for object localization and a classification network for verifying the object category. In this paper we propose a method for effective and efficient multispectral fusion of the two modalities in an adapted single-stage anchor-free base architecture. We aim at learning pedestrian representations based on object center and scale rather than direct bounding box predictions. In this way, we can both simplify the network architecture and achieve higher detection performance, especially for pedestrians under occlusion or at low object resolution. In addition, we provide a study on well-suited multispectral data augmentation techniques that improve the commonly used augmentations. The results show our method's effectiveness in detecting small-scaled pedestrians. We achieve 5.68% log-average miss rate in comparison to the best current state-of-the-art of 7.49% (25% improvement) on the challenging KAIST Multispectral Pedestrian Detection Benchmark. Code: this https URL
摘要:由对齐视觉 - 光(VIS)和热红外(IR)图像对多光谱图像是非常适合于像自主驾驶或视觉监控的实际应用。这样的数据可以被用来增加行人检测特别是对于弱照射时,小规模的,或部分闭塞的实例的性能。当前状态的最先进的基于快速R-CNN的变体,因此通过两个阶段:一个提案发电机网络与手工锚盒对象定位和分类网络用于验证对象类别。在本文中,我们提出一种用于在适于单级无锚碱架构的两个模态的有效和高效的多光谱融合的方法。我们的目标是基于对象中心和规模,而不是直接边框预测学习行人表示。以这种方式,我们都可以简化网络架构和实现较高的检测性能,尤其是下闭塞行人或在低对象的分辨率。此外,我们提供非常适合多光谱数据增强技术,提高了常用扩充研究。结果表明,我们的方法在检测小规模行人有效性。我们在比较达到5.68%,对数平均命中率国家的最先进的的挑战KAIST多光谱行人检测基准7.49%(提高25%)目前最好的。代码:这HTTPS URL
Alexander Wolpert, Michael Teutsch, M. Saquib Sarfraz, Rainer Stiefelhagen
Abstract: Multispectral images consisting of aligned visual-optical (VIS) and thermal infrared (IR) image pairs are well-suited for practical applications like autonomous driving or visual surveillance. Such data can be used to increase the performance of pedestrian detection especially for weakly illuminated, small-scaled, or partially occluded instances. The current state-of-the-art is based on variants of Faster R-CNN and thus passes through two stages: a proposal generator network with handcrafted anchor boxes for object localization and a classification network for verifying the object category. In this paper we propose a method for effective and efficient multispectral fusion of the two modalities in an adapted single-stage anchor-free base architecture. We aim at learning pedestrian representations based on object center and scale rather than direct bounding box predictions. In this way, we can both simplify the network architecture and achieve higher detection performance, especially for pedestrians under occlusion or at low object resolution. In addition, we provide a study on well-suited multispectral data augmentation techniques that improve the commonly used augmentations. The results show our method's effectiveness in detecting small-scaled pedestrians. We achieve 5.68% log-average miss rate in comparison to the best current state-of-the-art of 7.49% (25% improvement) on the challenging KAIST Multispectral Pedestrian Detection Benchmark. Code: this https URL
摘要:由对齐视觉 - 光(VIS)和热红外(IR)图像对多光谱图像是非常适合于像自主驾驶或视觉监控的实际应用。这样的数据可以被用来增加行人检测特别是对于弱照射时,小规模的,或部分闭塞的实例的性能。当前状态的最先进的基于快速R-CNN的变体,因此通过两个阶段:一个提案发电机网络与手工锚盒对象定位和分类网络用于验证对象类别。在本文中,我们提出一种用于在适于单级无锚碱架构的两个模态的有效和高效的多光谱融合的方法。我们的目标是基于对象中心和规模,而不是直接边框预测学习行人表示。以这种方式,我们都可以简化网络架构和实现较高的检测性能,尤其是下闭塞行人或在低对象的分辨率。此外,我们提供非常适合多光谱数据增强技术,提高了常用扩充研究。结果表明,我们的方法在检测小规模行人有效性。我们在比较达到5.68%,对数平均命中率国家的最先进的的挑战KAIST多光谱行人检测基准7.49%(提高25%)目前最好的。代码:这HTTPS URL
16. Graudally Applying Weakly Supervised and Active Learning for Mass Detection in Breast Ultrasound Images [PDF] 返回目录
JooYeol Yun, JungWoo Oh, IlDong Yun
Abstract: We propose a method for effectively utilizing weakly annotated image data in an object detection tasks of breast ultrasound images. Given the problem setting where a small, strongly annotated dataset and a large, weakly annotated dataset with no bounding box information are available, training an object detection model becomes a non-trivial problem. We suggest a controlled weight for handling the effect of weakly annotated images in a two stage object detection model. We~also present a subsequent active learning scheme for safely assigning weakly annotated images a strong annotation using the trained model. Experimental results showed a 24\% point increase in correct localization (CorLoc) measure, which is the ratio of correctly localized and classified images, by assigning the properly controlled weight. Performing active learning after a model is trained showed an additional increase in CorLoc. We tested the proposed method on the Stanford Dog datasets to assure that it can be applied to general cases, where strong annotations are insufficient to obtain resembling results. The presented method showed that higher performance is achievable with lesser annotation effort.
摘要:我们提出了在乳腺超声图像目标检测任务的有效利用弱标注的图像数据的方法。鉴于这个问题的设置,其中一个小的,强烈的注解数据集和大,弱注释无边框信息数据集是可用的,训练物体检测模型成为一个不平凡的问题。我们建议用于处理在两个阶段目标检测模型弱注释的图像效果的控制体重。我们也〜呈现随后主动学习方案使用训练模型安全分配弱注释的图像强大的注解。实验结果表明在正确定位(CorLoc)量度的24 \%点增加,这是正确本地化和分类的图像的比例,通过分配适当地控制权重。执行主动学习的模型训练中表现出的CorLoc额外增加后。我们测试的斯坦福狗数据集所提出的方法,以确保它可以适用于一般情况下,当强大的注解是不足以获得类似的结果。所提出的方法表明,更高的性能与较低的注释的努力实现的。
JooYeol Yun, JungWoo Oh, IlDong Yun
Abstract: We propose a method for effectively utilizing weakly annotated image data in an object detection tasks of breast ultrasound images. Given the problem setting where a small, strongly annotated dataset and a large, weakly annotated dataset with no bounding box information are available, training an object detection model becomes a non-trivial problem. We suggest a controlled weight for handling the effect of weakly annotated images in a two stage object detection model. We~also present a subsequent active learning scheme for safely assigning weakly annotated images a strong annotation using the trained model. Experimental results showed a 24\% point increase in correct localization (CorLoc) measure, which is the ratio of correctly localized and classified images, by assigning the properly controlled weight. Performing active learning after a model is trained showed an additional increase in CorLoc. We tested the proposed method on the Stanford Dog datasets to assure that it can be applied to general cases, where strong annotations are insufficient to obtain resembling results. The presented method showed that higher performance is achievable with lesser annotation effort.
摘要:我们提出了在乳腺超声图像目标检测任务的有效利用弱标注的图像数据的方法。鉴于这个问题的设置,其中一个小的,强烈的注解数据集和大,弱注释无边框信息数据集是可用的,训练物体检测模型成为一个不平凡的问题。我们建议用于处理在两个阶段目标检测模型弱注释的图像效果的控制体重。我们也〜呈现随后主动学习方案使用训练模型安全分配弱注释的图像强大的注解。实验结果表明在正确定位(CorLoc)量度的24 \%点增加,这是正确本地化和分类的图像的比例,通过分配适当地控制权重。执行主动学习的模型训练中表现出的CorLoc额外增加后。我们测试的斯坦福狗数据集所提出的方法,以确保它可以适用于一般情况下,当强大的注解是不足以获得类似的结果。所提出的方法表明,更高的性能与较低的注释的努力实现的。
17. Instance-Aware Graph Convolutional Network for Multi-Label Classification [PDF] 返回目录
Yun Wang, Tong Zhang, Zhen Cui, Chunyan Xu, Jian Yang
Abstract: Graph convolutional neural network (GCN) has effectively boosted the multi-label image recognition task by introducing label dependencies based on statistical label co-occurrence of data. However, in previous methods, label correlation is computed based on statistical information of data and therefore the same for all samples, and this makes graph inference on labels insufficient to handle huge variations among numerous image instances. In this paper, we propose an instance-aware graph convolutional neural network (IA-GCN) framework for multi-label classification. As a whole, two fused branches of sub-networks are involved in the framework: a global branch modeling the whole image and a region-based branch exploring dependencies among regions of interests (ROIs). For label diffusion of instance-awareness in graph convolution, rather than using the statistical label correlation alone, an image-dependent label correlation matrix (LCM), fusing both the statistical LCM and an individual one of each image instance, is constructed for graph inference on labels to inject adaptive information of label-awareness into the learned features of the model. Specifically, the individual LCM of each image is obtained by mining the label dependencies based on the scores of labels about detected ROIs. In this process, considering the contribution differences of ROIs to multi-label classification, variational inference is introduced to learn adaptive scaling factors for those ROIs by considering their complex distribution. Finally, extensive experiments on MS-COCO and VOC datasets show that our proposed approach outperforms existing state-of-the-art methods.
摘要:图形卷积神经网络(GCN),基于数据的统计标签共生引入标签的依赖,有效地推动了多标签图像识别任务。然而,在以前的方法,标签的相关性是基于数据的统计信息,因此同样为所有样本计算,这使得在标签不足以处理大量的图像实例之间的巨大变化曲线图推断。在本文中,我们提出了多标签分类的识别实例图卷积神经网络(IA-GCN)架构。作为一个整体,子网络的两个稠分支都参与了框架:全球分支造型的整体形象和基于区域的分支探索的利益(投资回报)地区之间的依赖关系。对于在图卷积实例意识标签扩散,而不是使用统计标签相关单独的依赖于图像的标签相关矩阵(LCM),熔合的统计LCM和各图像实例中的单独一个既,被构建用于图推理在标签上注入的标签意识自适应信息到模型的学习功能。具体地,通过挖掘基于关于检测到的ROI的标签的分数标签依赖关系而获得的每个图像的各个LCM。在这个过程中,考虑到投资回报的多标签分类的贡献不同,变推理介绍了考虑其复杂的分布,了解那些投资回报率自适应缩放因素。最后,在MS-COCO和VOC数据集大量的实验表明,该方法比现有的国家的最先进的方法。
Yun Wang, Tong Zhang, Zhen Cui, Chunyan Xu, Jian Yang
Abstract: Graph convolutional neural network (GCN) has effectively boosted the multi-label image recognition task by introducing label dependencies based on statistical label co-occurrence of data. However, in previous methods, label correlation is computed based on statistical information of data and therefore the same for all samples, and this makes graph inference on labels insufficient to handle huge variations among numerous image instances. In this paper, we propose an instance-aware graph convolutional neural network (IA-GCN) framework for multi-label classification. As a whole, two fused branches of sub-networks are involved in the framework: a global branch modeling the whole image and a region-based branch exploring dependencies among regions of interests (ROIs). For label diffusion of instance-awareness in graph convolution, rather than using the statistical label correlation alone, an image-dependent label correlation matrix (LCM), fusing both the statistical LCM and an individual one of each image instance, is constructed for graph inference on labels to inject adaptive information of label-awareness into the learned features of the model. Specifically, the individual LCM of each image is obtained by mining the label dependencies based on the scores of labels about detected ROIs. In this process, considering the contribution differences of ROIs to multi-label classification, variational inference is introduced to learn adaptive scaling factors for those ROIs by considering their complex distribution. Finally, extensive experiments on MS-COCO and VOC datasets show that our proposed approach outperforms existing state-of-the-art methods.
摘要:图形卷积神经网络(GCN),基于数据的统计标签共生引入标签的依赖,有效地推动了多标签图像识别任务。然而,在以前的方法,标签的相关性是基于数据的统计信息,因此同样为所有样本计算,这使得在标签不足以处理大量的图像实例之间的巨大变化曲线图推断。在本文中,我们提出了多标签分类的识别实例图卷积神经网络(IA-GCN)架构。作为一个整体,子网络的两个稠分支都参与了框架:全球分支造型的整体形象和基于区域的分支探索的利益(投资回报)地区之间的依赖关系。对于在图卷积实例意识标签扩散,而不是使用统计标签相关单独的依赖于图像的标签相关矩阵(LCM),熔合的统计LCM和各图像实例中的单独一个既,被构建用于图推理在标签上注入的标签意识自适应信息到模型的学习功能。具体地,通过挖掘基于关于检测到的ROI的标签的分数标签依赖关系而获得的每个图像的各个LCM。在这个过程中,考虑到投资回报的多标签分类的贡献不同,变推理介绍了考虑其复杂的分布,了解那些投资回报率自适应缩放因素。最后,在MS-COCO和VOC数据集大量的实验表明,该方法比现有的国家的最先进的方法。
18. Robust RGB-based 6-DoF Pose Estimation without Real Pose Annotations [PDF] 返回目录
Zhigang Li, Yinlin Hu, Mathieu Salzmann, Xiangyang Ji
Abstract: While much progress has been made in 6-DoF object pose estimation from a single RGB image, the current leading approaches heavily rely on real-annotation data. As such, they remain sensitive to severe occlusions, because covering all possible occlusions with annotated data is intractable. In this paper, we introduce an approach to robustly and accurately estimate the 6-DoF pose in challenging conditions and without using any real pose annotations. To this end, we leverage the intuition that the poses predicted by a network from an image and from its counterpart synthetically altered to mimic occlusion should be consistent, and translate this to a self-supervised loss function. Our experiments on LINEMOD, Occluded-LINEMOD, YCB and new Randomization LINEMOD dataset evidence the robustness of our approach. We achieve state of the art performance on LINEMOD, and OccludedLINEMOD in without real-pose setting, even outperforming methods that rely on real annotations during training on Occluded-LINEMOD.
摘要:虽然极大地促进了六自由度物体姿态估计进行了从单一的RGB图像,当前领先的方法在很大程度上依赖于真实注释数据。因此,他们仍然严重阻塞敏感,因为涵盖了所有可能的闭塞与注释的数据是棘手。在本文中,我们介绍鲁棒和准确地估计在严苛的条件下和在不使用任何真正的姿势注解6自由度的姿势的方法。为此,我们利用的直觉,从图像的网络和其对应合成改变以模仿闭塞预测的姿势应该是一致的,并翻译这一个自我监督的损失函数。我们对LINEMOD,闭塞,LINEMOD,YCB和新的随机LINEMOD数据集实验证明我们的方法的稳健性。我们在没有实际姿势设置实现对LINEMOD的先进的性能,并OccludedLINEMOD,甚至跑赢上OccludedLINEMOD训练中依靠真正的注释方法。
Zhigang Li, Yinlin Hu, Mathieu Salzmann, Xiangyang Ji
Abstract: While much progress has been made in 6-DoF object pose estimation from a single RGB image, the current leading approaches heavily rely on real-annotation data. As such, they remain sensitive to severe occlusions, because covering all possible occlusions with annotated data is intractable. In this paper, we introduce an approach to robustly and accurately estimate the 6-DoF pose in challenging conditions and without using any real pose annotations. To this end, we leverage the intuition that the poses predicted by a network from an image and from its counterpart synthetically altered to mimic occlusion should be consistent, and translate this to a self-supervised loss function. Our experiments on LINEMOD, Occluded-LINEMOD, YCB and new Randomization LINEMOD dataset evidence the robustness of our approach. We achieve state of the art performance on LINEMOD, and OccludedLINEMOD in without real-pose setting, even outperforming methods that rely on real annotations during training on Occluded-LINEMOD.
摘要:虽然极大地促进了六自由度物体姿态估计进行了从单一的RGB图像,当前领先的方法在很大程度上依赖于真实注释数据。因此,他们仍然严重阻塞敏感,因为涵盖了所有可能的闭塞与注释的数据是棘手。在本文中,我们介绍鲁棒和准确地估计在严苛的条件下和在不使用任何真正的姿势注解6自由度的姿势的方法。为此,我们利用的直觉,从图像的网络和其对应合成改变以模仿闭塞预测的姿势应该是一致的,并翻译这一个自我监督的损失函数。我们对LINEMOD,闭塞,LINEMOD,YCB和新的随机LINEMOD数据集实验证明我们的方法的稳健性。我们在没有实际姿势设置实现对LINEMOD的先进的性能,并OccludedLINEMOD,甚至跑赢上OccludedLINEMOD训练中依靠真正的注释方法。
19. Virtual Adversarial Training in Feature Space to Improve Unsupervised Video Domain Adaptation [PDF] 返回目录
Artjoms Gorpincenko, Geoffrey French, Michal Mackiewicz
Abstract: Virtual Adversarial Training has recently seen a lot of success in semi-supervised learning, as well as unsupervised Domain Adaptation. However, so far it has been used on input samples in the pixel space, whereas we propose to apply it directly to feature vectors. We also discuss the unstable behaviour of entropy minimization and Decision-Boundary Iterative Refinement Training With a Teacher in Domain Adaptation, and suggest substitutes that achieve similar behaviour. By adding the aforementioned techniques to the state of the art model TA$^3$N, we either maintain competitive results or outperform prior art in multiple unsupervised video Domain Adaptation tasks
摘要:虚拟对抗性训练,最近看到了半监督学习,以及监督的领域适应性很多成功的。然而,到目前为止,它已经在像素空间使用上输入样本,而我们建议将它直接应用于特征向量。我们还讨论了最小熵与决策边界迭代精化训练随着领域适应性教师的不稳定的行为,并建议实现类似的行为的替代品。通过添加上述技术到艺术的典范TA $ ^ 3 $ N的状态,我们要么保持竞争结果或多个无监督视频领域适应性任务优于现有技术
Artjoms Gorpincenko, Geoffrey French, Michal Mackiewicz
Abstract: Virtual Adversarial Training has recently seen a lot of success in semi-supervised learning, as well as unsupervised Domain Adaptation. However, so far it has been used on input samples in the pixel space, whereas we propose to apply it directly to feature vectors. We also discuss the unstable behaviour of entropy minimization and Decision-Boundary Iterative Refinement Training With a Teacher in Domain Adaptation, and suggest substitutes that achieve similar behaviour. By adding the aforementioned techniques to the state of the art model TA$^3$N, we either maintain competitive results or outperform prior art in multiple unsupervised video Domain Adaptation tasks
摘要:虚拟对抗性训练,最近看到了半监督学习,以及监督的领域适应性很多成功的。然而,到目前为止,它已经在像素空间使用上输入样本,而我们建议将它直接应用于特征向量。我们还讨论了最小熵与决策边界迭代精化训练随着领域适应性教师的不稳定的行为,并建议实现类似的行为的替代品。通过添加上述技术到艺术的典范TA $ ^ 3 $ N的状态,我们要么保持竞争结果或多个无监督视频领域适应性任务优于现有技术
20. Query Twice: Dual Mixture Attention Meta Learning for Video Summarization [PDF] 返回目录
Junyan Wang, Yang Bai, Yang Long, Bingzhang Hu, Zhenhua Chai, Yu Guan, Xiaolin Wei
Abstract: Video summarization aims to select representative frames to retain high-level information, which is usually solved by predicting the segment-wise importance score via a softmax function. However, softmax function suffers in retaining high-rank representations for complex visual or sequential information, which is known as the Softmax Bottleneck problem. In this paper, we propose a novel framework named Dual Mixture Attention (DMASum) model with Meta Learning for video summarization that tackles the softmax bottleneck problem, where the Mixture of Attention layer (MoA) effectively increases the model capacity by employing twice self-query attention that can capture the second-order changes in addition to the initial query-key attention, and a novel Single Frame Meta Learning rule is then introduced to achieve more generalization to small datasets with limited training sources. Furthermore, the DMASum significantly exploits both visual and sequential attention that connects local key-frame and global attention in an accumulative way. We adopt the new evaluation protocol on two public datasets, SumMe, and TVSum. Both qualitative and quantitative experiments manifest significant improvements over the state-of-the-art methods.
摘要:视频摘要的目的来选择代表帧留住高层次的信息,这通常是由通过SOFTMAX功能预测逐段重要性分数解决。然而,在保留高级别交涉复杂的视觉或顺序的信息,这就是所谓的SOFTMAX瓶颈问题SOFTMAX功能受到影响。在本文中,我们提出了一个名为双混合注意(DMASum)模型元学习的视频概括了一种新的框架,铲断SOFTMAX瓶颈问题,这里注意层(MOA)的混合物通过采用两次自我查询有效地提高了模型的能力可以捕捉除了初始查询,重点关注第二次变化,和新的单帧元学习规则的关注,然后介绍了实现更多推广到小型数据集与有限的培训资源。此外,DMASum显著利用连接本地关键帧和全球关注的累计方式视觉和顺序的关注。我们采用两个公共数据集,郑树森和TVSum新的评估方案。在国家的最先进的方法,定性和定量实验清单显著的改善。
Junyan Wang, Yang Bai, Yang Long, Bingzhang Hu, Zhenhua Chai, Yu Guan, Xiaolin Wei
Abstract: Video summarization aims to select representative frames to retain high-level information, which is usually solved by predicting the segment-wise importance score via a softmax function. However, softmax function suffers in retaining high-rank representations for complex visual or sequential information, which is known as the Softmax Bottleneck problem. In this paper, we propose a novel framework named Dual Mixture Attention (DMASum) model with Meta Learning for video summarization that tackles the softmax bottleneck problem, where the Mixture of Attention layer (MoA) effectively increases the model capacity by employing twice self-query attention that can capture the second-order changes in addition to the initial query-key attention, and a novel Single Frame Meta Learning rule is then introduced to achieve more generalization to small datasets with limited training sources. Furthermore, the DMASum significantly exploits both visual and sequential attention that connects local key-frame and global attention in an accumulative way. We adopt the new evaluation protocol on two public datasets, SumMe, and TVSum. Both qualitative and quantitative experiments manifest significant improvements over the state-of-the-art methods.
摘要:视频摘要的目的来选择代表帧留住高层次的信息,这通常是由通过SOFTMAX功能预测逐段重要性分数解决。然而,在保留高级别交涉复杂的视觉或顺序的信息,这就是所谓的SOFTMAX瓶颈问题SOFTMAX功能受到影响。在本文中,我们提出了一个名为双混合注意(DMASum)模型元学习的视频概括了一种新的框架,铲断SOFTMAX瓶颈问题,这里注意层(MOA)的混合物通过采用两次自我查询有效地提高了模型的能力可以捕捉除了初始查询,重点关注第二次变化,和新的单帧元学习规则的关注,然后介绍了实现更多推广到小型数据集与有限的培训资源。此外,DMASum显著利用连接本地关键帧和全球关注的累计方式视觉和顺序的关注。我们采用两个公共数据集,郑树森和TVSum新的评估方案。在国家的最先进的方法,定性和定量实验清单显著的改善。
21. Deep Volumetric Ambient Occlusion [PDF] 返回目录
Dominik Engel, Timo Ropinski
Abstract: We present a novel deep learning based technique for volumetric ambient occlusion in the context of direct volume rendering. Our proposed Deep Volumetric Ambient Occlusion (DVAO) approach can predict per-voxel ambient occlusion in volumetric data sets, while considering global information provided through the transfer function. The proposed neural network only needs to be executed upon change of this global information, and thus supports real-time volume interaction. Accordingly, we demonstrate DVAOs ability to predict volumetric ambient occlusion, such that it can be applied interactively within direct volume rendering. To achieve the best possible results, we propose and analyze a variety of transfer function representations and injection strategies for deep neural networks. Based on the obtained results we also give recommendations applicable in similar volume learning scenarios. Lastly, we show that DVAO generalizes to a variety of modalities, despite being trained on computed tomography data only.
摘要:本文提出了一种新的深度学习为基础的技术在直接体绘制的情况下体积环境闭塞。我们提出的深容积环境光遮蔽(DVAO)方法能够预测每个体素的环境闭塞体数据集,同时考虑通过传递函数提供全局信息。拟议的神经网络只需要在这个全球信息变更被执行,从而支持实时音量互动。因此,我们证明DVAOs预测体积环境遮挡,使得它可以通过交互直接体绘制内应用的能力。为了达到最好的结果,我们提出并分析了各种传递函数的表示和注射策略深层神经网络。基于得到的结果,我们也给适用于类似容积学习情景的建议。最后,我们表明,DVAO推广到各种形式的,尽管只有计算机断层扫描数据的培训。
Dominik Engel, Timo Ropinski
Abstract: We present a novel deep learning based technique for volumetric ambient occlusion in the context of direct volume rendering. Our proposed Deep Volumetric Ambient Occlusion (DVAO) approach can predict per-voxel ambient occlusion in volumetric data sets, while considering global information provided through the transfer function. The proposed neural network only needs to be executed upon change of this global information, and thus supports real-time volume interaction. Accordingly, we demonstrate DVAOs ability to predict volumetric ambient occlusion, such that it can be applied interactively within direct volume rendering. To achieve the best possible results, we propose and analyze a variety of transfer function representations and injection strategies for deep neural networks. Based on the obtained results we also give recommendations applicable in similar volume learning scenarios. Lastly, we show that DVAO generalizes to a variety of modalities, despite being trained on computed tomography data only.
摘要:本文提出了一种新的深度学习为基础的技术在直接体绘制的情况下体积环境闭塞。我们提出的深容积环境光遮蔽(DVAO)方法能够预测每个体素的环境闭塞体数据集,同时考虑通过传递函数提供全局信息。拟议的神经网络只需要在这个全球信息变更被执行,从而支持实时音量互动。因此,我们证明DVAOs预测体积环境遮挡,使得它可以通过交互直接体绘制内应用的能力。为了达到最好的结果,我们提出并分析了各种传递函数的表示和注射策略深层神经网络。基于得到的结果,我们也给适用于类似容积学习情景的建议。最后,我们表明,DVAO推广到各种形式的,尽管只有计算机断层扫描数据的培训。
22. Towards Class-incremental Object Detection with Nearest Mean of Exemplars [PDF] 返回目录
Sheng Ren, Yan He, Neal N. Xiong, Kehua Guo
Abstract: Object detection has been widely used in the field of Internet, and deep learning plays a very important role in object detection. However, the existing object detection methods need to be trained in the static setting, which requires obtaining all the data at one time, and it does not support training in the way of class-incremental. In this paper, an object detection framework named class-incremental object detection (CIOD) is proposed. CIOD divides object detection into two stages. Firstly, the traditional OpenCV cascade classifier is improved in the object candidate box generation stage to meet the needs of class increment. Secondly, we use the concept of prototype vector on the basis of deep learning to train a classifier based on class-incremental to identify the generated object candidate box, so as to extract the real object box. A large number of experiments on CIOD have been carried out to verify that CIOD can detect the object in the way of class-incremental and can control the training time and memory capacity.
摘要:目的检测已广泛应用于互联网领域,而深学习起着物体检测非常重要的作用。然而,现有的物体检测方法需要在静态设置,这就需要在同一时间获得的所有数据进行训练,并没有确实类增量的方式支持培训。在本文中,对象检测框架命名的类增量对象检测(CIOD)提出。 CIOD划分物体检测为两个阶段。首先,传统的OpenCV的级联分类是对象候选框生成阶段改进,以满足类增量的需求。其次,我们采用原型向量的概念深度学习的基础上,基于类的增量来识别生成的对象候选框训练分类,以提取真正的对象框。上CIOD甲大量的实验已经进行了验证CIOD可检测类增量的方式的目的和可控制的训练时间和存储器容量。
Sheng Ren, Yan He, Neal N. Xiong, Kehua Guo
Abstract: Object detection has been widely used in the field of Internet, and deep learning plays a very important role in object detection. However, the existing object detection methods need to be trained in the static setting, which requires obtaining all the data at one time, and it does not support training in the way of class-incremental. In this paper, an object detection framework named class-incremental object detection (CIOD) is proposed. CIOD divides object detection into two stages. Firstly, the traditional OpenCV cascade classifier is improved in the object candidate box generation stage to meet the needs of class increment. Secondly, we use the concept of prototype vector on the basis of deep learning to train a classifier based on class-incremental to identify the generated object candidate box, so as to extract the real object box. A large number of experiments on CIOD have been carried out to verify that CIOD can detect the object in the way of class-incremental and can control the training time and memory capacity.
摘要:目的检测已广泛应用于互联网领域,而深学习起着物体检测非常重要的作用。然而,现有的物体检测方法需要在静态设置,这就需要在同一时间获得的所有数据进行训练,并没有确实类增量的方式支持培训。在本文中,对象检测框架命名的类增量对象检测(CIOD)提出。 CIOD划分物体检测为两个阶段。首先,传统的OpenCV的级联分类是对象候选框生成阶段改进,以满足类增量的需求。其次,我们采用原型向量的概念深度学习的基础上,基于类的增量来识别生成的对象候选框训练分类,以提取真正的对象框。上CIOD甲大量的实验已经进行了验证CIOD可检测类增量的方式的目的和可控制的训练时间和存储器容量。
23. CFAD: Coarse-to-Fine Action Detector for Spatiotemporal Action Localization [PDF] 返回目录
Yuxi Li, Weiyao Lin, John See, Ning Xu, Shugong Xu, Ke Yan, Cong Yang
Abstract: Most current pipelines for spatio-temporal action localization connect frame-wise or clip-wise detection results to generate action proposals, where only local information is exploited and the efficiency is hindered by dense per-frame localization. In this paper, we propose Coarse-to-Fine Action Detector (CFAD),an original end-to-end trainable framework for efficient spatio-temporal action localization. The CFAD introduces a new paradigm that first estimates coarse spatio-temporal action tubes from video streams, and then refines the tubes' location based on key timestamps. This concept is implemented by two key components, the Coarse and Refine Modules in our framework. The parameterized modeling of long temporal information in the Coarse Module helps obtain accurate initial tube estimation, while the Refine Module selectively adjusts the tube location under the guidance of key timestamps. Against other methods, theproposed CFAD achieves competitive results on action detection benchmarks of UCF101-24, UCFSports and JHMDB-21 with inference speed that is 3.3x faster than the nearest competitors.
摘要:时空动作本地化连接逐帧或夹明智检测结果大多数当前的管道以产生行动的建议,其中仅本地信息被利用和效率由致密每帧定位阻碍。在本文中,我们提出了由粗到细动作检测(CFAD),高效的时空行动本地化的原最终到终端的可训练的框架。在CFAD基于关键时间戳引入了一个新的范例,首先估计粗时空动作管从视频流,然后提炼管位置。这一概念是由两个关键部件,粗,细化模块在我们的框架中实现。的长的时间信息在粗模块参数化建模有助于获得精确的初始管估计,而精炼模块选择性地键时间戳的指导下调整管的位置。对其他方法,theproposed CFAD实现了与推理速度是3.3倍,比最接近的竞争对手快上UCF101-24,UCFSports和JHMDB-21的动作检测基准竞争的结果。
Yuxi Li, Weiyao Lin, John See, Ning Xu, Shugong Xu, Ke Yan, Cong Yang
Abstract: Most current pipelines for spatio-temporal action localization connect frame-wise or clip-wise detection results to generate action proposals, where only local information is exploited and the efficiency is hindered by dense per-frame localization. In this paper, we propose Coarse-to-Fine Action Detector (CFAD),an original end-to-end trainable framework for efficient spatio-temporal action localization. The CFAD introduces a new paradigm that first estimates coarse spatio-temporal action tubes from video streams, and then refines the tubes' location based on key timestamps. This concept is implemented by two key components, the Coarse and Refine Modules in our framework. The parameterized modeling of long temporal information in the Coarse Module helps obtain accurate initial tube estimation, while the Refine Module selectively adjusts the tube location under the guidance of key timestamps. Against other methods, theproposed CFAD achieves competitive results on action detection benchmarks of UCF101-24, UCFSports and JHMDB-21 with inference speed that is 3.3x faster than the nearest competitors.
摘要:时空动作本地化连接逐帧或夹明智检测结果大多数当前的管道以产生行动的建议,其中仅本地信息被利用和效率由致密每帧定位阻碍。在本文中,我们提出了由粗到细动作检测(CFAD),高效的时空行动本地化的原最终到终端的可训练的框架。在CFAD基于关键时间戳引入了一个新的范例,首先估计粗时空动作管从视频流,然后提炼管位置。这一概念是由两个关键部件,粗,细化模块在我们的框架中实现。的长的时间信息在粗模块参数化建模有助于获得精确的初始管估计,而精炼模块选择性地键时间戳的指导下调整管的位置。对其他方法,theproposed CFAD实现了与推理速度是3.3倍,比最接近的竞争对手快上UCF101-24,UCFSports和JHMDB-21的动作检测基准竞争的结果。
24. FrankMocap: Fast Monocular 3D Hand and Body Motion Capture by Regression and Integration [PDF] 返回目录
Yu Rong, Takaaki Shiratori, Hanbyul Joo
Abstract: Although the essential nuance of human motion is often conveyed as a combination of body movements and hand gestures, the existing monocular motion capture approaches mostly focus on either body motion capture only ignoring hand parts or hand motion capture only without considering body motion. In this paper, we present FrankMocap, a motion capture system that can estimate both 3D hand and body motion from in-the-wild monocular inputs with faster speed (9.5 fps) and better accuracy than previous work. Our method works in near real-time (9.5 fps) and produces 3D body and hand motion capture outputs as a unified parametric model structure. Our method aims to capture 3D body and hand motion simultaneously from challenging in-the-wild monocular videos. To construct FrankMocap, we build the state-of-the-art monocular 3D "hand" motion capture method by taking the hand part of the whole body parametric model (SMPL-X). Our 3D hand motion capture output can be efficiently integrated to monocular body motion capture output, producing whole body motion results in a unified parrametric model structure. We demonstrate the state-of-the-art performance of our hand motion capture system in public benchmarks, and show the high quality of our whole body motion capture result in various challenging real-world scenes, including a live demo scenario.
摘要:虽然人的运动的基本细微差别往往输送作为身体运动和手势的组合,现有单眼运动捕捉办法主要集中在任一身体运动捕获仅忽略手部件或手运动捕捉仅在不考虑身体运动。在本文中,我们目前FrankMocap,动作捕捉系统,可以在最狂野单眼输入,更快的速度(9.5 fps)的速度,比以前的工作更高的精度同时估计3D手和身体运动。我们的在近实时(9.5 FPS)和方法作品产生三维的身体和手部运动捕获的输出作为一个统一的参数模型结构。我们的方法旨在捕捉3D身体和手部动作在最百搭的单目视频挑战同时进行。为了构建FrankMocap,我们通过取整个身体参数模型(SMPL-X)的手部建立状态的最先进的单眼3D“手”的动作捕捉方法。我们的3D手运动捕捉输出可以有效地集成,以单眼身体运动捕获输出,产生在一个统一的parrametric模型结构的整个身体运动的结果。我们证明在公共基准我们的手部动作捕捉系统的国家的最先进的性能,并展示我们的整个身体动作捕捉效果的高品质在各种具有挑战性的真实世界的场景,包括现场演示场景。
Yu Rong, Takaaki Shiratori, Hanbyul Joo
Abstract: Although the essential nuance of human motion is often conveyed as a combination of body movements and hand gestures, the existing monocular motion capture approaches mostly focus on either body motion capture only ignoring hand parts or hand motion capture only without considering body motion. In this paper, we present FrankMocap, a motion capture system that can estimate both 3D hand and body motion from in-the-wild monocular inputs with faster speed (9.5 fps) and better accuracy than previous work. Our method works in near real-time (9.5 fps) and produces 3D body and hand motion capture outputs as a unified parametric model structure. Our method aims to capture 3D body and hand motion simultaneously from challenging in-the-wild monocular videos. To construct FrankMocap, we build the state-of-the-art monocular 3D "hand" motion capture method by taking the hand part of the whole body parametric model (SMPL-X). Our 3D hand motion capture output can be efficiently integrated to monocular body motion capture output, producing whole body motion results in a unified parrametric model structure. We demonstrate the state-of-the-art performance of our hand motion capture system in public benchmarks, and show the high quality of our whole body motion capture result in various challenging real-world scenes, including a live demo scenario.
摘要:虽然人的运动的基本细微差别往往输送作为身体运动和手势的组合,现有单眼运动捕捉办法主要集中在任一身体运动捕获仅忽略手部件或手运动捕捉仅在不考虑身体运动。在本文中,我们目前FrankMocap,动作捕捉系统,可以在最狂野单眼输入,更快的速度(9.5 fps)的速度,比以前的工作更高的精度同时估计3D手和身体运动。我们的在近实时(9.5 FPS)和方法作品产生三维的身体和手部运动捕获的输出作为一个统一的参数模型结构。我们的方法旨在捕捉3D身体和手部动作在最百搭的单目视频挑战同时进行。为了构建FrankMocap,我们通过取整个身体参数模型(SMPL-X)的手部建立状态的最先进的单眼3D“手”的动作捕捉方法。我们的3D手运动捕捉输出可以有效地集成,以单眼身体运动捕获输出,产生在一个统一的parrametric模型结构的整个身体运动的结果。我们证明在公共基准我们的手部动作捕捉系统的国家的最先进的性能,并展示我们的整个身体动作捕捉效果的高品质在各种具有挑战性的真实世界的场景,包括现场演示场景。
25. Towards Lightweight Lane Detection by Optimizing Spatial Embedding [PDF] 返回目录
Seokwoo Jung, Sungha Choi, Mohammad Azam Khan, Jaegul Choo
Abstract: A number of lane detection methods depend on a proposal-free instance segmentation because of its adaptability to flexible object shape, occlusion, and real-time application. This paper addresses the problem that pixel embedding in proposal-free instance segmentation based lane detection is difficult to optimize. A translation invariance of convolution, which is one of the supposed strengths, causes challenges in optimizing pixel embedding. In this work, we propose a lane detection method based on proposal-free instance segmentation, directly optimizing spatial embedding of pixels using image coordinate. Our proposed method allows the post-processing step for center localization and optimizes clustering in an end-to-end manner. The proposed method enables real-time lane detection through the simplicity of post-processing and the adoption of a lightweight backbone. Our proposed method demonstrates competitive performance on public lane detection datasets.
摘要:车道检测方法的数量依赖于免费的建议,例如分割因其适应性灵活物体的形状,闭塞,和实时应用。本文将探讨的问题,像素免费的建议,例如基于分割车道检测嵌入难以优化。卷积的平移不变性,也就是所谓的优势之一,导致在优化像素嵌入的挑战。在这项工作中,我们提出了一种基于免费的建议,例如分割车道检测方法,直接使用优化图像坐标像素的空间嵌入。我们提出的方法允许中心定位的后处理步骤,并在端至端的方式优化了聚类。该方法能够通过后期处理的简单性和通过一个轻量级的骨干实时车道检测。我们提出的方法证明对公众车道检测数据集竞争性优势。
Seokwoo Jung, Sungha Choi, Mohammad Azam Khan, Jaegul Choo
Abstract: A number of lane detection methods depend on a proposal-free instance segmentation because of its adaptability to flexible object shape, occlusion, and real-time application. This paper addresses the problem that pixel embedding in proposal-free instance segmentation based lane detection is difficult to optimize. A translation invariance of convolution, which is one of the supposed strengths, causes challenges in optimizing pixel embedding. In this work, we propose a lane detection method based on proposal-free instance segmentation, directly optimizing spatial embedding of pixels using image coordinate. Our proposed method allows the post-processing step for center localization and optimizes clustering in an end-to-end manner. The proposed method enables real-time lane detection through the simplicity of post-processing and the adoption of a lightweight backbone. Our proposed method demonstrates competitive performance on public lane detection datasets.
摘要:车道检测方法的数量依赖于免费的建议,例如分割因其适应性灵活物体的形状,闭塞,和实时应用。本文将探讨的问题,像素免费的建议,例如基于分割车道检测嵌入难以优化。卷积的平移不变性,也就是所谓的优势之一,导致在优化像素嵌入的挑战。在这项工作中,我们提出了一种基于免费的建议,例如分割车道检测方法,直接使用优化图像坐标像素的空间嵌入。我们提出的方法允许中心定位的后处理步骤,并在端至端的方式优化了聚类。该方法能够通过后期处理的简单性和通过一个轻量级的骨干实时车道检测。我们提出的方法证明对公众车道检测数据集竞争性优势。
26. Deep Relighting Networks for Image Light Source Manipulation [PDF] 返回目录
Li-Wen Wang, Wan-Chi Siu, Zhi-Song Liu, Chu-Tak Li, Daniel P.K. Lun
Abstract: Manipulating the light source of given images is an interesting task and useful in various applications, including photography and cinematography. Existing methods usually require additional information like the geometric structure of the scene, which may not be available for most images. In this paper, we formulate the single image relighting task and propose a novel Deep Relighting Network (DRN) with three parts: 1) scene reconversion, which aims to reveal the primary scene structure through a deep auto-encoder network, 2) shadow prior estimation, to predict light effect from the new light direction through adversarial learning, and 3) re-renderer, to combine the primary structure with the reconstructed shadow view to form the required estimation under the target light source. Experimental results show that the proposed method outperforms other possible methods, both qualitatively and quantitatively. Specifically, the proposed DRN has achieved the best PSNR in the "AIM2020 - Any to one relighting challenge" of the 2020 ECCV conference.
摘要:操作给出图像的光源是一个有趣的任务,并在各种应用,包括摄影和摄影非常有用。现有方法通常需要像场景的几何结构,这可能不适用于大多数图像的附加信息。在本文中,我们制定单个图像重新点灯任务,并用三个部分提出了一个新颖的深重新点灯网络(DRN):1)场景再变换,其目的是通过一个深的自动编码器网络以显示主场景结构,2)之前阴影估计,通过对抗学习预测来自新光方向的光的效果,以及3)重新渲染器,到主结构与所述重构的阴影视图相结合,形成在目标光源下所需要的估计。实验结果表明,所提出的方法优于其它可能的方法,定性和定量。具体而言,建议DRN取得最好的PSNR的 - 2020年ECCV会议的“AIM2020任何一个重新点灯的挑战”。
Li-Wen Wang, Wan-Chi Siu, Zhi-Song Liu, Chu-Tak Li, Daniel P.K. Lun
Abstract: Manipulating the light source of given images is an interesting task and useful in various applications, including photography and cinematography. Existing methods usually require additional information like the geometric structure of the scene, which may not be available for most images. In this paper, we formulate the single image relighting task and propose a novel Deep Relighting Network (DRN) with three parts: 1) scene reconversion, which aims to reveal the primary scene structure through a deep auto-encoder network, 2) shadow prior estimation, to predict light effect from the new light direction through adversarial learning, and 3) re-renderer, to combine the primary structure with the reconstructed shadow view to form the required estimation under the target light source. Experimental results show that the proposed method outperforms other possible methods, both qualitatively and quantitatively. Specifically, the proposed DRN has achieved the best PSNR in the "AIM2020 - Any to one relighting challenge" of the 2020 ECCV conference.
摘要:操作给出图像的光源是一个有趣的任务,并在各种应用,包括摄影和摄影非常有用。现有方法通常需要像场景的几何结构,这可能不适用于大多数图像的附加信息。在本文中,我们制定单个图像重新点灯任务,并用三个部分提出了一个新颖的深重新点灯网络(DRN):1)场景再变换,其目的是通过一个深的自动编码器网络以显示主场景结构,2)之前阴影估计,通过对抗学习预测来自新光方向的光的效果,以及3)重新渲染器,到主结构与所述重构的阴影视图相结合,形成在目标光源下所需要的估计。实验结果表明,所提出的方法优于其它可能的方法,定性和定量。具体而言,建议DRN取得最好的PSNR的 - 2020年ECCV会议的“AIM2020任何一个重新点灯的挑战”。
27. TNT: Target-driveN Trajectory Prediction [PDF] 返回目录
Hang Zhao, Jiyang Gao, Tian Lan, Chen Sun, Benjamin Sapp, Balakrishnan Varadarajan, Yue Shen, Yi Shen, Yuning Chai, Cordelia Schmid, Congcong Li, Dragomir Anguelov
Abstract: Predicting the future behavior of moving agents is essential for real world applications. It is challenging as the intent of the agent and the corresponding behavior is unknown and intrinsically multimodal. Our key insight is that for prediction within a moderate time horizon, the future modes can be effectively captured by a set of target states. This leads to our target-driven trajectory prediction (TNT) framework. TNT has three stages which are trained end-to-end. It first predicts an agent's potential target states $T$ steps into the future, by encoding its interactions with the environment and the other agents. TNT then generates trajectory state sequences conditioned on targets. A final stage estimates trajectory likelihoods and a final compact set of trajectory predictions is selected. This is in contrast to previous work which models agent intents as latent variables, and relies on test-time sampling to generate diverse trajectories. We benchmark TNT on trajectory prediction of vehicles and pedestrians, where we outperform state-of-the-art on Argoverse Forecasting, INTERACTION, Stanford Drone and an in-house Pedestrian-at-Intersection dataset.
摘要:预测移动代理的未来行为是对现实世界的应用是必不可少的。它具有挑战性,因为代理的意图和相应的行为是未知的,内在的多式联运。我们的主要观点是,对于一个中等时间范围内预测,未来的模式可以有效地通过一组目标状态的捕捉。这导致我们的目标驱动轨迹预测(TNT)的框架。 TNT已经被培养终端到终端的三个阶段。它首先预测代理人的潜在目标状态$ T $步入未来,与环境和其他代理编码它的相互作用。 TNT然后生成关于空调的目标的轨迹的状态序列。最后一个阶段,估计轨迹可能性和最终的紧集轨迹预测的选择。这与以前的工作该款车型剂意图作为潜在变量,并依靠测试时间取样来产生不同的轨迹。在车辆和行人,在那里我们超越国家的最先进的Argoverse预测,互动,斯坦福大学无人机的轨迹预测和内部行人在路口处的数据集,我们的基准TNT。
Hang Zhao, Jiyang Gao, Tian Lan, Chen Sun, Benjamin Sapp, Balakrishnan Varadarajan, Yue Shen, Yi Shen, Yuning Chai, Cordelia Schmid, Congcong Li, Dragomir Anguelov
Abstract: Predicting the future behavior of moving agents is essential for real world applications. It is challenging as the intent of the agent and the corresponding behavior is unknown and intrinsically multimodal. Our key insight is that for prediction within a moderate time horizon, the future modes can be effectively captured by a set of target states. This leads to our target-driven trajectory prediction (TNT) framework. TNT has three stages which are trained end-to-end. It first predicts an agent's potential target states $T$ steps into the future, by encoding its interactions with the environment and the other agents. TNT then generates trajectory state sequences conditioned on targets. A final stage estimates trajectory likelihoods and a final compact set of trajectory predictions is selected. This is in contrast to previous work which models agent intents as latent variables, and relies on test-time sampling to generate diverse trajectories. We benchmark TNT on trajectory prediction of vehicles and pedestrians, where we outperform state-of-the-art on Argoverse Forecasting, INTERACTION, Stanford Drone and an in-house Pedestrian-at-Intersection dataset.
摘要:预测移动代理的未来行为是对现实世界的应用是必不可少的。它具有挑战性,因为代理的意图和相应的行为是未知的,内在的多式联运。我们的主要观点是,对于一个中等时间范围内预测,未来的模式可以有效地通过一组目标状态的捕捉。这导致我们的目标驱动轨迹预测(TNT)的框架。 TNT已经被培养终端到终端的三个阶段。它首先预测代理人的潜在目标状态$ T $步入未来,与环境和其他代理编码它的相互作用。 TNT然后生成关于空调的目标的轨迹的状态序列。最后一个阶段,估计轨迹可能性和最终的紧集轨迹预测的选择。这与以前的工作该款车型剂意图作为潜在变量,并依靠测试时间取样来产生不同的轨迹。在车辆和行人,在那里我们超越国家的最先进的Argoverse预测,互动,斯坦福大学无人机的轨迹预测和内部行人在路口处的数据集,我们的基准TNT。
28. Attribute Prototype Network for Zero-Shot Learning [PDF] 返回目录
Wenjia Xu, Yongqin Xian, Jiuniu Wang, Bernt Schiele, Zeynep Akata
Abstract: From the beginning of zero-shot learning research, visual attributes have been shown to play an important role. In order to better transfer attribute-based knowledge from known to unknown classes, we argue that an image representation with integrated attribute localization ability would be beneficial for zero-shot learning. To this end, we propose a novel zero-shot representation learning framework that jointly learns discriminative global and local features using only class-level attributes. While a visual-semantic embedding layer learns global features, local features are learned through an attribute prototype network that simultaneously regresses and decorrelates attributes from intermediate features. We show that our locality augmented image representations achieve a new state-of-the-art on three zero-shot learning benchmarks. As an additional benefit, our model points to the visual evidence of the attributes in an image, e.g. for the CUB dataset, confirming the improved attribute localization ability of our image representation. The code will be publicaly available at this https URL.
摘要:从零射门学习研究的开始,可视属性已被证明发挥了重要作用。为了从已知到未知的类更好的传输基于属性的认识,我们认为,随着整合的属性定位能力的图像表示将是零射门的学习有益。为此,我们提出了一个新颖的零次代表学习框架,只使用类级别属性共同学习辨别全局和局部特征。虽然可视语义埋入层获悉全局特征,局部特征是通过一个属性原型网络同时退化和去相关从中间特征的属性获知。我们证明了我们的地区增强图像表示实现三个零射门学习基准的新的国家的最先进的。作为一个额外的好处,我们的模型指向属性的视觉证据的图像,例如对于数据集CUB,证实了我们图像表示的改进的属性的定位能力。该代码将publicaly可在此HTTPS URL。
Wenjia Xu, Yongqin Xian, Jiuniu Wang, Bernt Schiele, Zeynep Akata
Abstract: From the beginning of zero-shot learning research, visual attributes have been shown to play an important role. In order to better transfer attribute-based knowledge from known to unknown classes, we argue that an image representation with integrated attribute localization ability would be beneficial for zero-shot learning. To this end, we propose a novel zero-shot representation learning framework that jointly learns discriminative global and local features using only class-level attributes. While a visual-semantic embedding layer learns global features, local features are learned through an attribute prototype network that simultaneously regresses and decorrelates attributes from intermediate features. We show that our locality augmented image representations achieve a new state-of-the-art on three zero-shot learning benchmarks. As an additional benefit, our model points to the visual evidence of the attributes in an image, e.g. for the CUB dataset, confirming the improved attribute localization ability of our image representation. The code will be publicaly available at this https URL.
摘要:从零射门学习研究的开始,可视属性已被证明发挥了重要作用。为了从已知到未知的类更好的传输基于属性的认识,我们认为,随着整合的属性定位能力的图像表示将是零射门的学习有益。为此,我们提出了一个新颖的零次代表学习框架,只使用类级别属性共同学习辨别全局和局部特征。虽然可视语义埋入层获悉全局特征,局部特征是通过一个属性原型网络同时退化和去相关从中间特征的属性获知。我们证明了我们的地区增强图像表示实现三个零射门学习基准的新的国家的最先进的。作为一个额外的好处,我们的模型指向属性的视觉证据的图像,例如对于数据集CUB,证实了我们图像表示的改进的属性的定位能力。该代码将publicaly可在此HTTPS URL。
29. Channel-wise Hessian Aware trace-Weighted Quantization of Neural Networks [PDF] 返回目录
Xu Qian, Victor Li, Crews Darren
Abstract: Second-order information has proven to be very effective in determining the redundancy of neural network weights and activations. Recent paper proposes to use Hessian traces of weights and activations for mixed-precision quantization and achieves state-of-the-art results. However, prior works only focus on selecting bits for each layer while the redundancy of different channels within a layer also differ a lot. This is mainly because the complexity of determining bits for each channel is too high for original methods. Here, we introduce Channel-wise Hessian Aware trace-Weighted Quantization (CW-HAWQ). CW-HAWQ uses Hessian trace to determine the relative sensitivity order of different channels of activations and weights. What's more, CW-HAWQ proposes to use deep Reinforcement learning (DRL) Deep Deterministic Policy Gradient (DDPG)-based agent to find the optimal ratios of different quantization bits and assign bits to channels according to the Hessian trace order. The number of states in CW-HAWQ is much smaller compared with traditional AutoML based mix-precision methods since we only need to search ratios for the quantization bits. Compare CW-HAWQ with state-of-the-art shows that we can achieve better results for multiple networks.
摘要:二阶信息已被证明是在确定神经网络的权重和激活的冗余是非常有效的。最近的一篇论文提出了使用权重和激活用于混合精度量化的Hessian矩阵迹线和实现先进的最先进的结果。然而,现有作品只着眼于选择用于每个层的比特,而在层内的不同信道的冗余度也差别很大。这主要是因为在确定每个信道位的复杂性过高,原始的方法。在这里,我们介绍道明智黑森州感知跟踪加权量化(CW-HAWQ)。 CW-HAWQ使用的Hessian跟踪以确定激活和权重的不同信道的相对灵敏度顺序。更重要的是,CW-HAWQ建议使用深层强化学习(DRL)深确定性政策梯度(DDPG)为基础的代理人根据黑森州跟踪为了找到不同的量化位和分配位的最优比率通道。与传统的基于AutoML混合精度方法,因为我们只需要搜索比对量化位数相比,CW-HAWQ状态的数目要小得多。比较CW-HAWQ与国家的最先进的,表明我们可以实现多个网络更好的结果。
Xu Qian, Victor Li, Crews Darren
Abstract: Second-order information has proven to be very effective in determining the redundancy of neural network weights and activations. Recent paper proposes to use Hessian traces of weights and activations for mixed-precision quantization and achieves state-of-the-art results. However, prior works only focus on selecting bits for each layer while the redundancy of different channels within a layer also differ a lot. This is mainly because the complexity of determining bits for each channel is too high for original methods. Here, we introduce Channel-wise Hessian Aware trace-Weighted Quantization (CW-HAWQ). CW-HAWQ uses Hessian trace to determine the relative sensitivity order of different channels of activations and weights. What's more, CW-HAWQ proposes to use deep Reinforcement learning (DRL) Deep Deterministic Policy Gradient (DDPG)-based agent to find the optimal ratios of different quantization bits and assign bits to channels according to the Hessian trace order. The number of states in CW-HAWQ is much smaller compared with traditional AutoML based mix-precision methods since we only need to search ratios for the quantization bits. Compare CW-HAWQ with state-of-the-art shows that we can achieve better results for multiple networks.
摘要:二阶信息已被证明是在确定神经网络的权重和激活的冗余是非常有效的。最近的一篇论文提出了使用权重和激活用于混合精度量化的Hessian矩阵迹线和实现先进的最先进的结果。然而,现有作品只着眼于选择用于每个层的比特,而在层内的不同信道的冗余度也差别很大。这主要是因为在确定每个信道位的复杂性过高,原始的方法。在这里,我们介绍道明智黑森州感知跟踪加权量化(CW-HAWQ)。 CW-HAWQ使用的Hessian跟踪以确定激活和权重的不同信道的相对灵敏度顺序。更重要的是,CW-HAWQ建议使用深层强化学习(DRL)深确定性政策梯度(DDPG)为基础的代理人根据黑森州跟踪为了找到不同的量化位和分配位的最优比率通道。与传统的基于AutoML混合精度方法,因为我们只需要搜索比对量化位数相比,CW-HAWQ状态的数目要小得多。比较CW-HAWQ与国家的最先进的,表明我们可以实现多个网络更好的结果。
30. CCA: Exploring the Possibility of Contextual Camouflage Attack on Object Detection [PDF] 返回目录
Shengnan Hu, Yang Zhang, Sumit Laha, Ankit Sharma, Hassan Foroosh
Abstract: Deep neural network based object detection hasbecome the cornerstone of many real-world applications. Alongwith this success comes concerns about its vulnerability tomalicious attacks. To gain more insight into this issue, we proposea contextual camouflage attack (CCA for short) algorithm to in-fluence the performance of object detectors. In this paper, we usean evolutionary search strategy and adversarial machine learningin interactions with a photo-realistic simulated environment tofind camouflage patterns that are effective over a huge varietyof object locations, camera poses, and lighting conditions. Theproposed camouflages are validated effective to most of the state-of-the-art object detectors.
摘要:深基于神经网络的目标检测hasbecome的许多实际应用的基石。这非常久远的成功来自有关其漏洞攻击tomalicious关注。为了更深入地了解这个问题,我们proposea上下文伪装攻击(CCA的简称)算法,在能量密度对象检测器的性能。在本文中,我们usean进化搜索策略,并用照片般逼真的模拟环境中找到相当的迷彩图案,在一个巨大的varietyof物的位置,摄影机姿态有效对抗机learningin互动,和照明条件。 Theproposed伪装被验证有效的大多数国家的最先进的对象检测器。
Shengnan Hu, Yang Zhang, Sumit Laha, Ankit Sharma, Hassan Foroosh
Abstract: Deep neural network based object detection hasbecome the cornerstone of many real-world applications. Alongwith this success comes concerns about its vulnerability tomalicious attacks. To gain more insight into this issue, we proposea contextual camouflage attack (CCA for short) algorithm to in-fluence the performance of object detectors. In this paper, we usean evolutionary search strategy and adversarial machine learningin interactions with a photo-realistic simulated environment tofind camouflage patterns that are effective over a huge varietyof object locations, camera poses, and lighting conditions. Theproposed camouflages are validated effective to most of the state-of-the-art object detectors.
摘要:深基于神经网络的目标检测hasbecome的许多实际应用的基石。这非常久远的成功来自有关其漏洞攻击tomalicious关注。为了更深入地了解这个问题,我们proposea上下文伪装攻击(CCA的简称)算法,在能量密度对象检测器的性能。在本文中,我们usean进化搜索策略,并用照片般逼真的模拟环境中找到相当的迷彩图案,在一个巨大的varietyof物的位置,摄影机姿态有效对抗机learningin互动,和照明条件。 Theproposed伪装被验证有效的大多数国家的最先进的对象检测器。
31. Learning Connectivity of Neural Networks from a Topological Perspective [PDF] 返回目录
Kun Yuan, Quanquan Li, Jing Shao, Junjie Yan
Abstract: Seeking effective neural networks is a critical and practical field in deep learning. Besides designing the depth, type of convolution, normalization, and nonlinearities, the topological connectivity of neural networks is also important. Previous principles of rule-based modular design simplify the difficulty of building an effective architecture, but constrain the possible topologies in limited spaces. In this paper, we attempt to optimize the connectivity in neural networks. We propose a topological perspective to represent a network into a complete graph for analysis, where nodes carry out aggregation and transformation of features, and edges determine the flow of information. By assigning learnable parameters to the edges which reflect the magnitude of connections, the learning process can be performed in a differentiable manner. We further attach auxiliary sparsity constraint to the distribution of connectedness, which promotes the learned topology focus on critical connections. This learning process is compatible with existing networks and owns adaptability to larger search spaces and different tasks. Quantitative results of experiments reflect the learned connectivity is superior to traditional rule-based ones, such as random, residual, and complete. In addition, it obtains significant improvements in image classification and object detection without introducing excessive computation burden.
摘要:寻找有效的神经网络在深学习的重要和实用的领域。除了设计深度,类型卷积,标准化和非线性的,神经网络的拓扑连接性也很重要。基于规则的模块化设计的原则,以前简化建立有效的体系结构的难度,但在有限的空间约束可能的拓扑结构。在本文中,我们尝试在神经网络优化的连通性。我们提出了一个拓扑的角度来表示网络到用于分析,其中节点进行聚合和特征变换一个完全图,和边缘确定的信息流。通过分配可学习参数,以反映连接的大小的边缘上,该学习过程可以以微分的方式进行。我们进一步重视辅助稀疏约束连通的分配,从而促进学习拓扑专注于关键连接。这个学习的过程与现有网络兼容,并且拥有适应更大的搜索空间和不同的任务。实验定量结果反映了解到的连接优于传统的以规则为基础的,如随机的,剩余的,完整的。此外,它在获得图像分类和物体检测显著改进而不引入过多的计算量。
Kun Yuan, Quanquan Li, Jing Shao, Junjie Yan
Abstract: Seeking effective neural networks is a critical and practical field in deep learning. Besides designing the depth, type of convolution, normalization, and nonlinearities, the topological connectivity of neural networks is also important. Previous principles of rule-based modular design simplify the difficulty of building an effective architecture, but constrain the possible topologies in limited spaces. In this paper, we attempt to optimize the connectivity in neural networks. We propose a topological perspective to represent a network into a complete graph for analysis, where nodes carry out aggregation and transformation of features, and edges determine the flow of information. By assigning learnable parameters to the edges which reflect the magnitude of connections, the learning process can be performed in a differentiable manner. We further attach auxiliary sparsity constraint to the distribution of connectedness, which promotes the learned topology focus on critical connections. This learning process is compatible with existing networks and owns adaptability to larger search spaces and different tasks. Quantitative results of experiments reflect the learned connectivity is superior to traditional rule-based ones, such as random, residual, and complete. In addition, it obtains significant improvements in image classification and object detection without introducing excessive computation burden.
摘要:寻找有效的神经网络在深学习的重要和实用的领域。除了设计深度,类型卷积,标准化和非线性的,神经网络的拓扑连接性也很重要。基于规则的模块化设计的原则,以前简化建立有效的体系结构的难度,但在有限的空间约束可能的拓扑结构。在本文中,我们尝试在神经网络优化的连通性。我们提出了一个拓扑的角度来表示网络到用于分析,其中节点进行聚合和特征变换一个完全图,和边缘确定的信息流。通过分配可学习参数,以反映连接的大小的边缘上,该学习过程可以以微分的方式进行。我们进一步重视辅助稀疏约束连通的分配,从而促进学习拓扑专注于关键连接。这个学习的过程与现有网络兼容,并且拥有适应更大的搜索空间和不同的任务。实验定量结果反映了解到的连接优于传统的以规则为基础的,如随机的,剩余的,完整的。此外,它在获得图像分类和物体检测显著改进而不引入过多的计算量。
32. Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment Retrieval in Videos [PDF] 返回目录
Zhu Zhang, Zhijie Lin, Zhou Zhao, Jieming Zhu, Xiuqiang He
Abstract: Video moment retrieval aims to localize the target moment in an video according to the given sentence. The weak-supervised setting only provides the video-level sentence annotations during training. Most existing weak-supervised methods apply a MIL-based framework to develop inter-sample confrontment, but ignore the intra-sample confrontment between moments with semantically similar contents. Thus, these methods fail to distinguish the target moment from plausible negative moments. In this paper, we propose a novel Regularized Two-Branch Proposal Network to simultaneously consider the inter-sample and intra-sample confrontments. Concretely, we first devise a language-aware filter to generate an enhanced video stream and a suppressed video stream. We then design the sharable two-branch proposal module to generate positive proposals from the enhanced stream and plausible negative proposals from the suppressed one for sufficient confrontment. Further, we apply the proposal regularization to stabilize the training process and improve model performance. The extensive experiments show the effectiveness of our method. Our code is released at here.
摘要:视频检索瞬间旨在根据给定的句子本地化目标时刻的视频。弱监督的设置只能在训练期间提供的视频级句话的注解。大多数现有的弱监督方法应用基于MIL-框架来开发相互对抗的样品,但忽略了与语义相似的内容时刻之间的内部对抗的样本。因此,这些方法不能区分合理的负弯矩目标的时刻。在本文中,我们提出了一个新的规则二分公司的议案网同时考虑跨样本内和样本confrontments。具体地,我们首先设计一种语言识别滤波器以生成增强的视频流和抑制的视频流。然后,我们设计了可共享的两个分支的建议模块生成从抑制一个对抗的足够增强流和合理的负建议积极的建议。此外,我们应用的建议正以稳定的训练过程,提高模型的性能。广泛的实验证明了该方法的有效性。我们的代码是在这里释放。
Zhu Zhang, Zhijie Lin, Zhou Zhao, Jieming Zhu, Xiuqiang He
Abstract: Video moment retrieval aims to localize the target moment in an video according to the given sentence. The weak-supervised setting only provides the video-level sentence annotations during training. Most existing weak-supervised methods apply a MIL-based framework to develop inter-sample confrontment, but ignore the intra-sample confrontment between moments with semantically similar contents. Thus, these methods fail to distinguish the target moment from plausible negative moments. In this paper, we propose a novel Regularized Two-Branch Proposal Network to simultaneously consider the inter-sample and intra-sample confrontments. Concretely, we first devise a language-aware filter to generate an enhanced video stream and a suppressed video stream. We then design the sharable two-branch proposal module to generate positive proposals from the enhanced stream and plausible negative proposals from the suppressed one for sufficient confrontment. Further, we apply the proposal regularization to stabilize the training process and improve model performance. The extensive experiments show the effectiveness of our method. Our code is released at here.
摘要:视频检索瞬间旨在根据给定的句子本地化目标时刻的视频。弱监督的设置只能在训练期间提供的视频级句话的注解。大多数现有的弱监督方法应用基于MIL-框架来开发相互对抗的样品,但忽略了与语义相似的内容时刻之间的内部对抗的样本。因此,这些方法不能区分合理的负弯矩目标的时刻。在本文中,我们提出了一个新的规则二分公司的议案网同时考虑跨样本内和样本confrontments。具体地,我们首先设计一种语言识别滤波器以生成增强的视频流和抑制的视频流。然后,我们设计了可共享的两个分支的建议模块生成从抑制一个对抗的足够增强流和合理的负建议积极的建议。此外,我们应用的建议正以稳定的训练过程,提高模型的性能。广泛的实验证明了该方法的有效性。我们的代码是在这里释放。
33. A Color Elastica Model for Vector-Valued Image Regularization [PDF] 返回目录
Hao Liu, Xue-Cheng Tai, Ron Kimmel, Roland Glowinski
Abstract: Models related to the Euler's Elastica energy have proven to be very useful for many applications, including image processing and high energy physics. Extending the Elastica models to color images and multi-channel data is challenging, as numerical solvers for these geometric models are difficult to find. In the past, the Polyakov action from high energy physics has been successfully applied for color image processing. Like the single channel Euler's elastica model and the total variation (TV) models, measures that require high order derivatives could help when considering image formation models that minimize elastic properties, in one way or another. Here, we introduce an addition to the Polyakov action for color images that minimizes the color manifold curvature, that is computed by applying of the Laplace-Beltrami operator to the color image channels. When applied to gray scale images, while selecting appropriate scaling between space and color, the proposed model reduces to minimizing the Euler's Elastica operating on the image level sets. Finding a minimizer for the proposed nonlinear geometric model is the challenge we address in this paper. Specifically, we present an operator-splitting method to minimize the proposed functional. The nonlinearity is decoupled by introducing three vector-valued and matrix-valued variables. The problem is then converted into solving for the steady state of an associated initial-value problem. The initial-value problem is time-split into three fractional steps, such that each sub-problem has a closed form solution, or can be solved by fast algorithms. The efficiency, and robustness of the proposed method are demonstrated by systematic numerical experiments.
摘要:与欧拉的弹性曲线能源车型已经被证明是对于许多应用,包括图像处理和高能量物理是非常有用的。延伸的弹性曲线模型的彩色图像和多通道数据是具有挑战性的,因为这些几何模型的数值解算器都很难找到。在过去,从高能物理的波利亚科夫的行动已成功应用于彩色图像处理。像单通道欧拉弹性曲线模型和总变差(TV)模式,需要高阶导数措施可以考虑成像模型,最小化弹性性能,以这种或那种方式时提供帮助。在这里,我们引入一个除了波利亚科夫动作的彩色图像的颜色歧管的曲率,即通过应用拉普拉斯Beltrami算的彩色图像计算通道最小化。当施加到灰度级图像,在选择的空间和颜色之间适当的缩放,所提出的模型简化为在图像水平集最小化欧拉弹性曲线操作。找到适合提出的非线性几何模型中的最小化是本文的挑战,我们的地址。具体而言,提出了一种操作员分裂方法以最小化所提出的功能。非线性是通过引入3矢量值和矩阵值的变量解耦。然后将问题转化为求解相关联的初始值问题的稳定状态。初始值问题是时间分割成三个分步,以使得每个子问题具有闭合形式解,或者可以通过快速算法来解决。所提出的方法的效率,和鲁棒性通过系统的数值实验证实。
Hao Liu, Xue-Cheng Tai, Ron Kimmel, Roland Glowinski
Abstract: Models related to the Euler's Elastica energy have proven to be very useful for many applications, including image processing and high energy physics. Extending the Elastica models to color images and multi-channel data is challenging, as numerical solvers for these geometric models are difficult to find. In the past, the Polyakov action from high energy physics has been successfully applied for color image processing. Like the single channel Euler's elastica model and the total variation (TV) models, measures that require high order derivatives could help when considering image formation models that minimize elastic properties, in one way or another. Here, we introduce an addition to the Polyakov action for color images that minimizes the color manifold curvature, that is computed by applying of the Laplace-Beltrami operator to the color image channels. When applied to gray scale images, while selecting appropriate scaling between space and color, the proposed model reduces to minimizing the Euler's Elastica operating on the image level sets. Finding a minimizer for the proposed nonlinear geometric model is the challenge we address in this paper. Specifically, we present an operator-splitting method to minimize the proposed functional. The nonlinearity is decoupled by introducing three vector-valued and matrix-valued variables. The problem is then converted into solving for the steady state of an associated initial-value problem. The initial-value problem is time-split into three fractional steps, such that each sub-problem has a closed form solution, or can be solved by fast algorithms. The efficiency, and robustness of the proposed method are demonstrated by systematic numerical experiments.
摘要:与欧拉的弹性曲线能源车型已经被证明是对于许多应用,包括图像处理和高能量物理是非常有用的。延伸的弹性曲线模型的彩色图像和多通道数据是具有挑战性的,因为这些几何模型的数值解算器都很难找到。在过去,从高能物理的波利亚科夫的行动已成功应用于彩色图像处理。像单通道欧拉弹性曲线模型和总变差(TV)模式,需要高阶导数措施可以考虑成像模型,最小化弹性性能,以这种或那种方式时提供帮助。在这里,我们引入一个除了波利亚科夫动作的彩色图像的颜色歧管的曲率,即通过应用拉普拉斯Beltrami算的彩色图像计算通道最小化。当施加到灰度级图像,在选择的空间和颜色之间适当的缩放,所提出的模型简化为在图像水平集最小化欧拉弹性曲线操作。找到适合提出的非线性几何模型中的最小化是本文的挑战,我们的地址。具体而言,提出了一种操作员分裂方法以最小化所提出的功能。非线性是通过引入3矢量值和矩阵值的变量解耦。然后将问题转化为求解相关联的初始值问题的稳定状态。初始值问题是时间分割成三个分步,以使得每个子问题具有闭合形式解,或者可以通过快速算法来解决。所提出的方法的效率,和鲁棒性通过系统的数值实验证实。
34. Face Anti-Spoofing Via Disentangled Representation Learning [PDF] 返回目录
Ke-Yue Zhang, Taiping Yao, Jian Zhang, Ying Tai, Shouhong Ding, Jilin Li, Feiyue Huang, Haichuan Song, Lizhuang Ma
Abstract: Face anti-spoofing is crucial to security of face recognition systems. Previous approaches focus on developing discriminative models based on the features extracted from images, which may be still entangled between spoof patterns and real persons. In this paper, motivated by the disentangled representation learning, we propose a novel perspective of face anti-spoofing that disentangles the liveness features and content features from images, and the liveness features is further used for classification. We also put forward a Convolutional Neural Network (CNN) architecture with the process of disentanglement and combination of low-level and high-level supervision to improve the generalization capabilities. We evaluate our method on public benchmark datasets and extensive experimental results demonstrate the effectiveness of our method against the state-of-the-art competitors. Finally, we further visualize some results to help understand the effect and advantage of disentanglement.
摘要:面对反欺骗是人脸识别系统的安全性是至关重要的。以前的方法侧重于开发基于从图像中提取的特征,这可能还在纠结恶搞图案和真实的人之间的判别模型。在本文中,由解缠结的表示学习动机,我们提出的面防伪从图像即理顺了那些纷繁活跃度的功能和内容的特征的新颖的角度来看,和活跃度设有被进一步用于分类。我们也提出了卷积神经网络(CNN)架构的解开,低层次,高层次的监管相结合的过程中,提高泛化能力。我们评估我们的公共标准数据集和大量的实验结果证明方法对我们国家的最先进的竞争者方法的有效性。最后,我们进一步显现了一定的效果,以帮助了解解开的作用和优势。
Ke-Yue Zhang, Taiping Yao, Jian Zhang, Ying Tai, Shouhong Ding, Jilin Li, Feiyue Huang, Haichuan Song, Lizhuang Ma
Abstract: Face anti-spoofing is crucial to security of face recognition systems. Previous approaches focus on developing discriminative models based on the features extracted from images, which may be still entangled between spoof patterns and real persons. In this paper, motivated by the disentangled representation learning, we propose a novel perspective of face anti-spoofing that disentangles the liveness features and content features from images, and the liveness features is further used for classification. We also put forward a Convolutional Neural Network (CNN) architecture with the process of disentanglement and combination of low-level and high-level supervision to improve the generalization capabilities. We evaluate our method on public benchmark datasets and extensive experimental results demonstrate the effectiveness of our method against the state-of-the-art competitors. Finally, we further visualize some results to help understand the effect and advantage of disentanglement.
摘要:面对反欺骗是人脸识别系统的安全性是至关重要的。以前的方法侧重于开发基于从图像中提取的特征,这可能还在纠结恶搞图案和真实的人之间的判别模型。在本文中,由解缠结的表示学习动机,我们提出的面防伪从图像即理顺了那些纷繁活跃度的功能和内容的特征的新颖的角度来看,和活跃度设有被进一步用于分类。我们也提出了卷积神经网络(CNN)架构的解开,低层次,高层次的监管相结合的过程中,提高泛化能力。我们评估我们的公共标准数据集和大量的实验结果证明方法对我们国家的最先进的竞争者方法的有效性。最后,我们进一步显现了一定的效果,以帮助了解解开的作用和优势。
35. Weakly Supervised Learning with Region and Box-level Annotations for Salient Instance Segmentation [PDF] 返回目录
Jialun Pei, He Tang, Chuanbo Chen
Abstract: Salient instance segmentation is a new challenging task that received widespread attention in saliency detection area. Due to the limited scale of the existing dataset and the high mask annotations cost, it is difficult to train a salient instance neural network completely. In this paper, we appeal to train a salient instance segmentation framework by a weakly supervised source without resorting to laborious labeling. We present a cyclic global context salient instance segmentation network (CGCNet), which is supervised by the combination of the binary salient regions and bounding boxes from the existing saliency detection datasets. For a precise pixel-level location, a global feature refining layer is introduced that dilates the context features of each salient instance to the global context in the image. Meanwhile, a labeling updating scheme is embedded in the proposed framework to online update the weak annotations for next iteration. Experiment results demonstrate that the proposed end-to-end network trained by weakly supervised annotations can be competitive to the existing fully supervised salient instance segmentation methods. Without bells and whistles, our proposed method achieves a mask AP of 57.13%, which outperforms the best fully supervised methods and establishes new states of the art for weakly supervised salient instance segmentation.
摘要:凸实例分割是获得了广泛的关注显着性检测领域一个新的具有挑战性的任务。由于现有的数据集和高面具注释成本的规模有限,很难完全培养一个突出的实例神经网络。在本文中,我们呼吁通过训练弱监督源的突出实例分割的框架,而不诉诸费力的标签。我们提出了一个环状全局上下文凸实例分割网络(CGCNet),其通过从现有的显着性检测数据集二进制显着区域和边界框的组合监督。对于精确的像素级的位置,一个全局特征精炼层被引入该瞳孔会放大每一凸实例的上下文中提供的图像中的全局上下文。同时,一个标签更新方案嵌入在拟议的框架网上更新为下一次迭代弱注解。实验结果表明,该终端到终端的网络由弱监督注解训练有素的是可以竞争的现有充分的监督突出的情况下分割方法。没有花俏,我们提出的方法实现了57.13%,这胜过最好的充分监督方法,并建立了弱监督突出的情况下分割的新的艺术状态的面具AP。
Jialun Pei, He Tang, Chuanbo Chen
Abstract: Salient instance segmentation is a new challenging task that received widespread attention in saliency detection area. Due to the limited scale of the existing dataset and the high mask annotations cost, it is difficult to train a salient instance neural network completely. In this paper, we appeal to train a salient instance segmentation framework by a weakly supervised source without resorting to laborious labeling. We present a cyclic global context salient instance segmentation network (CGCNet), which is supervised by the combination of the binary salient regions and bounding boxes from the existing saliency detection datasets. For a precise pixel-level location, a global feature refining layer is introduced that dilates the context features of each salient instance to the global context in the image. Meanwhile, a labeling updating scheme is embedded in the proposed framework to online update the weak annotations for next iteration. Experiment results demonstrate that the proposed end-to-end network trained by weakly supervised annotations can be competitive to the existing fully supervised salient instance segmentation methods. Without bells and whistles, our proposed method achieves a mask AP of 57.13%, which outperforms the best fully supervised methods and establishes new states of the art for weakly supervised salient instance segmentation.
摘要:凸实例分割是获得了广泛的关注显着性检测领域一个新的具有挑战性的任务。由于现有的数据集和高面具注释成本的规模有限,很难完全培养一个突出的实例神经网络。在本文中,我们呼吁通过训练弱监督源的突出实例分割的框架,而不诉诸费力的标签。我们提出了一个环状全局上下文凸实例分割网络(CGCNet),其通过从现有的显着性检测数据集二进制显着区域和边界框的组合监督。对于精确的像素级的位置,一个全局特征精炼层被引入该瞳孔会放大每一凸实例的上下文中提供的图像中的全局上下文。同时,一个标签更新方案嵌入在拟议的框架网上更新为下一次迭代弱注解。实验结果表明,该终端到终端的网络由弱监督注解训练有素的是可以竞争的现有充分的监督突出的情况下分割方法。没有花俏,我们提出的方法实现了57.13%,这胜过最好的充分监督方法,并建立了弱监督突出的情况下分割的新的艺术状态的面具AP。
36. Open Source Iris Recognition Hardware and Software with Presentation Attack Detection [PDF] 返回目录
Zhaoyuan Fang, Adam Czajka
Abstract: This paper proposes the first known to us open source hardware and software iris recognition system with presentation attack detection (PAD), which can be easily assembled for about 75 USD using Raspberry Pi board and a few peripherals. The primary goal of this work is to offer a low-cost baseline for spoof-resistant iris recognition, which may (a) stimulate research in iris PAD and allow for easy prototyping of secure iris recognition systems, (b) offer a low-cost secure iris recognition alternative to more sophisticated systems, and (c) serve as an educational platform. We propose a lightweight image complexity-guided convolutional network for fast and accurate iris segmentation, domain-specific human-inspired Binarized Statistical Image Features (BSIF) to build an iris template, and to combine 2D (iris texture) and 3D (photometric stereo-based) features for PAD. The proposed iris recognition runs in about 3.2 seconds and the proposed PAD runs in about 4.5 seconds on Raspberry Pi 3B+. The hardware specifications and all source codes of the entire pipeline are made available along with this paper.
摘要:本文提出了第一个我们已知的开源硬件和软件与演示攻击检测(PAD),可以很容易地组装用树莓派板几外设约75美元,软件虹膜识别系统。这项工作的主要目标是提供抗欺骗虹膜识别低成本基线,这可能(一)虹膜PAD鼓励研究,并允许易于成型安全的虹膜识别系统,(B)提供了一个低成本安全虹膜识别替代更为复杂的系统,以及(c)作为一个教育平台。我们提出了一个轻量级的图像复杂引导卷积网络进行快速,准确的虹膜分割,特定领域的人启发二值化的统计图像特征(BSIF)来构建虹膜模板,以2D(虹膜纹理)和3D(立体 - 光度相结合基于)功能的PAD。在约3.2秒内提出虹膜识别运行,并在约4.5秒内提出PAD运行在树莓裨3B +。硬件规格和整个管道的所有的源代码与本文一起提供。
Zhaoyuan Fang, Adam Czajka
Abstract: This paper proposes the first known to us open source hardware and software iris recognition system with presentation attack detection (PAD), which can be easily assembled for about 75 USD using Raspberry Pi board and a few peripherals. The primary goal of this work is to offer a low-cost baseline for spoof-resistant iris recognition, which may (a) stimulate research in iris PAD and allow for easy prototyping of secure iris recognition systems, (b) offer a low-cost secure iris recognition alternative to more sophisticated systems, and (c) serve as an educational platform. We propose a lightweight image complexity-guided convolutional network for fast and accurate iris segmentation, domain-specific human-inspired Binarized Statistical Image Features (BSIF) to build an iris template, and to combine 2D (iris texture) and 3D (photometric stereo-based) features for PAD. The proposed iris recognition runs in about 3.2 seconds and the proposed PAD runs in about 4.5 seconds on Raspberry Pi 3B+. The hardware specifications and all source codes of the entire pipeline are made available along with this paper.
摘要:本文提出了第一个我们已知的开源硬件和软件与演示攻击检测(PAD),可以很容易地组装用树莓派板几外设约75美元,软件虹膜识别系统。这项工作的主要目标是提供抗欺骗虹膜识别低成本基线,这可能(一)虹膜PAD鼓励研究,并允许易于成型安全的虹膜识别系统,(B)提供了一个低成本安全虹膜识别替代更为复杂的系统,以及(c)作为一个教育平台。我们提出了一个轻量级的图像复杂引导卷积网络进行快速,准确的虹膜分割,特定领域的人启发二值化的统计图像特征(BSIF)来构建虹膜模板,以2D(虹膜纹理)和3D(立体 - 光度相结合基于)功能的PAD。在约3.2秒内提出虹膜识别运行,并在约4.5秒内提出PAD运行在树莓裨3B +。硬件规格和整个管道的所有的源代码与本文一起提供。
37. Stereo Plane SLAM Based on Intersecting Lines [PDF] 返回目录
Xiaoyu Zhang, Wei Wang, Xianyu Qi, Ziwei Liao
Abstract: Plane feature is a kind of stable landmark to reduce drift error in SLAM system. It is easy and fast to extract planes from dense point cloud, which is commonly acquired from RGB-D camera or lidar. But for stereo camera, it is hard to compute dense point cloud accurately and efficiently. In this paper, we propose a novel method to compute plane parameters from intersecting lines extracted for stereo image. The plane features commonly exist on the surface of man-made objects and structure, which have regular shape and straight edge lines. In 3D space, two intersecting lines can determine such a plane. Thus we extract line segments from both stereo left and right image. By stereo matching, we compute the endpoints and line directions in 3D space, and then the planes can be computed. Adding such computed plane features in stereo SLAM system reduces the drift error and refines the performance. We test our proposed system on public datasets and demonstrate its robust and accurate estimation results, compared with state-of-the-art SLAM systems.
摘要:平面功能是一种稳定的具有里程碑意义的降低SLAM系统漂移误差。这是很容易和快速地从密点云,其通常从RGB-d相机或激光雷达获取的提取物平面。但对于立体相机,这是很难准确,高效地计算密集的点云。在本文中,我们提出了从交叉为立体图像提取的行的新的方法来计算平面参数。的平面特征的人造物体和结构,其具有规则的形状和直的边缘线的表面上通常存在。在3D空间中,两个交叉的线可确定这样的平面。因此,我们提取两个立体声线段左,右图像。通过立体匹配,我们计算在三维空间中的端点和线方向,然后飞机就可以计算。在立体声系统SLAM添加这样计算的平面特性降低的漂移误差和细化的性能。我们测试我们建议的制度对公共数据集,并展示其强大和精确的估计结果,与国家的最先进的SLAM系统相比。
Xiaoyu Zhang, Wei Wang, Xianyu Qi, Ziwei Liao
Abstract: Plane feature is a kind of stable landmark to reduce drift error in SLAM system. It is easy and fast to extract planes from dense point cloud, which is commonly acquired from RGB-D camera or lidar. But for stereo camera, it is hard to compute dense point cloud accurately and efficiently. In this paper, we propose a novel method to compute plane parameters from intersecting lines extracted for stereo image. The plane features commonly exist on the surface of man-made objects and structure, which have regular shape and straight edge lines. In 3D space, two intersecting lines can determine such a plane. Thus we extract line segments from both stereo left and right image. By stereo matching, we compute the endpoints and line directions in 3D space, and then the planes can be computed. Adding such computed plane features in stereo SLAM system reduces the drift error and refines the performance. We test our proposed system on public datasets and demonstrate its robust and accurate estimation results, compared with state-of-the-art SLAM systems.
摘要:平面功能是一种稳定的具有里程碑意义的降低SLAM系统漂移误差。这是很容易和快速地从密点云,其通常从RGB-d相机或激光雷达获取的提取物平面。但对于立体相机,这是很难准确,高效地计算密集的点云。在本文中,我们提出了从交叉为立体图像提取的行的新的方法来计算平面参数。的平面特征的人造物体和结构,其具有规则的形状和直的边缘线的表面上通常存在。在3D空间中,两个交叉的线可确定这样的平面。因此,我们提取两个立体声线段左,右图像。通过立体匹配,我们计算在三维空间中的端点和线方向,然后飞机就可以计算。在立体声系统SLAM添加这样计算的平面特性降低的漂移误差和细化的性能。我们测试我们建议的制度对公共数据集,并展示其强大和精确的估计结果,与国家的最先进的SLAM系统相比。
38. DeepHandMesh: A Weakly-supervised Deep Encoder-Decoder Framework for High-fidelity Hand Mesh Modeling [PDF] 返回目录
Gyeongsik Moon, Takaaki Shiratori, Kyoung Mu Lee
Abstract: Human hands play a central role in interacting with other people and objects. For realistic replication of such hand motions, high-fidelity hand meshes have to be reconstructed. In this study, we firstly propose DeepHandMesh, a weakly-supervised deep encoder-decoder framework for high-fidelity hand mesh modeling. We design our system to be trained in an end-to-end and weakly-supervised manner; therefore, it does not require groundtruth meshes. Instead, it relies on weaker supervisions such as 3D joint coordinates and multi-view depth maps, which are easier to get than groundtruth meshes and do not dependent on the mesh topology. Although the proposed DeepHandMesh is trained in a weakly-supervised way, it provides significantly more realistic hand mesh than previous fully-supervised hand models. Our newly introduced penetration avoidance loss further improves results by replicating physical interaction between hand parts. Finally, we demonstrate that our system can also be applied successfully to the 3D hand mesh estimation from general images. Our hand model, dataset, and codes are publicly available at this https URL.
摘要:人的手发挥与他人和对象交互中心作用。对于这样的手部动作的逼真复制,高保真手网眼必须被重建。在这项研究中,我们首先提出DeepHandMesh,弱监督高保真手网建模深编码器,解码器的框架。我们设计的系统的端至端和弱监督的方式进行培训;因此,它不需要真实状况的网格。相反,它依赖于较弱的监督,如3D关节坐标和多视图深度地图,这是更容易获得比真实状况网格和不依赖于网状拓扑。虽然提出DeepHandMesh在弱监督方式训练的,它提供了显著更真实的手目比之前的完全监督的手模型。我们新推出的渗透避免损失复制手零件之间物理相互作用进一步提高的结果。最后,我们证明了我们的系统也可以被成功地从一般的图像应用到3D手目估计。我们的手模型,数据集,并代码是公开的,在此HTTPS URL。
Gyeongsik Moon, Takaaki Shiratori, Kyoung Mu Lee
Abstract: Human hands play a central role in interacting with other people and objects. For realistic replication of such hand motions, high-fidelity hand meshes have to be reconstructed. In this study, we firstly propose DeepHandMesh, a weakly-supervised deep encoder-decoder framework for high-fidelity hand mesh modeling. We design our system to be trained in an end-to-end and weakly-supervised manner; therefore, it does not require groundtruth meshes. Instead, it relies on weaker supervisions such as 3D joint coordinates and multi-view depth maps, which are easier to get than groundtruth meshes and do not dependent on the mesh topology. Although the proposed DeepHandMesh is trained in a weakly-supervised way, it provides significantly more realistic hand mesh than previous fully-supervised hand models. Our newly introduced penetration avoidance loss further improves results by replicating physical interaction between hand parts. Finally, we demonstrate that our system can also be applied successfully to the 3D hand mesh estimation from general images. Our hand model, dataset, and codes are publicly available at this https URL.
摘要:人的手发挥与他人和对象交互中心作用。对于这样的手部动作的逼真复制,高保真手网眼必须被重建。在这项研究中,我们首先提出DeepHandMesh,弱监督高保真手网建模深编码器,解码器的框架。我们设计的系统的端至端和弱监督的方式进行培训;因此,它不需要真实状况的网格。相反,它依赖于较弱的监督,如3D关节坐标和多视图深度地图,这是更容易获得比真实状况网格和不依赖于网状拓扑。虽然提出DeepHandMesh在弱监督方式训练的,它提供了显著更真实的手目比之前的完全监督的手模型。我们新推出的渗透避免损失复制手零件之间物理相互作用进一步提高的结果。最后,我们证明了我们的系统也可以被成功地从一般的图像应用到3D手目估计。我们的手模型,数据集,并代码是公开的,在此HTTPS URL。
39. PC-U Net: Learning to Jointly Reconstruct and Segment the Cardiac Walls in 3D from CT Data [PDF] 返回目录
Meng Ye, Qiaoying Huang, Dong Yang, Pengxiang Wu, Jingru Yi, Leon Axel, Dimitris Metaxas
Abstract: The 3D volumetric shape of the heart's left ventricle (LV) myocardium (MYO) wall provides important information for diagnosis of cardiac disease and invasive procedure navigation. Many cardiac image segmentation methods have relied on detection of region-of-interest as a pre-requisite for shape segmentation and modeling. With segmentation results, a 3D surface mesh and a corresponding point cloud of the segmented cardiac volume can be reconstructed for further analyses. Although state-of-the-art methods (e.g., U-Net) have achieved decent performance on cardiac image segmentation in terms of accuracy, these segmentation results can still suffer from imaging artifacts and noise, which will lead to inaccurate shape modeling results. In this paper, we propose a PC-U net that jointly reconstructs the point cloud of the LV MYO wall directly from volumes of 2D CT slices and generates its segmentation masks from the predicted 3D point cloud. Extensive experimental results show that by incorporating a shape prior from the point cloud, the segmentation masks are more accurate than the state-of-the-art U-Net results in terms of Dice's coefficient and Hausdorff distance.The proposed joint learning framework of our PC-U net is beneficial for automatic cardiac image analysis tasks because it can obtain simultaneously the 3D shape and segmentation of the LV MYO walls.
摘要:3D心脏的左心室的体积形状(LV)心肌(MYO)壁提供了一种用于心脏疾病和侵入性手术导航的诊断的重要信息。许多心脏的图像分割方法依赖于检测区域的感兴趣作为先决条件的形状的分割和建模。与分割结果,一个3D表面网格和所分割的心脏体积的对应点云可以重建用于进一步分析。虽然状态的最先进的方法(例如,U-净)已经实现在精度方面心脏图像分割体面性能,这些分割结果仍然可以从成像伪影和噪声,这将导致不准确的形状建模的结果受到影响。在本文中,我们提出了一个PC-U网,共同直接从二维CT切片的体积重构LV MYO壁的点群,并从预测的3D点云生成其分割掩码。大量的实验结果表明,通过将点云之前的形状,分割面具是不是在骰子的系数和豪斯多夫distance.The方面的国家的最先进的掌中结果更加准确提出共同学习框架,我们的PC-U净为自动心脏图像分析任务有益的,因为它可以同时得到LV MYO壁的三维形状和分割。
Meng Ye, Qiaoying Huang, Dong Yang, Pengxiang Wu, Jingru Yi, Leon Axel, Dimitris Metaxas
Abstract: The 3D volumetric shape of the heart's left ventricle (LV) myocardium (MYO) wall provides important information for diagnosis of cardiac disease and invasive procedure navigation. Many cardiac image segmentation methods have relied on detection of region-of-interest as a pre-requisite for shape segmentation and modeling. With segmentation results, a 3D surface mesh and a corresponding point cloud of the segmented cardiac volume can be reconstructed for further analyses. Although state-of-the-art methods (e.g., U-Net) have achieved decent performance on cardiac image segmentation in terms of accuracy, these segmentation results can still suffer from imaging artifacts and noise, which will lead to inaccurate shape modeling results. In this paper, we propose a PC-U net that jointly reconstructs the point cloud of the LV MYO wall directly from volumes of 2D CT slices and generates its segmentation masks from the predicted 3D point cloud. Extensive experimental results show that by incorporating a shape prior from the point cloud, the segmentation masks are more accurate than the state-of-the-art U-Net results in terms of Dice's coefficient and Hausdorff distance.The proposed joint learning framework of our PC-U net is beneficial for automatic cardiac image analysis tasks because it can obtain simultaneously the 3D shape and segmentation of the LV MYO walls.
摘要:3D心脏的左心室的体积形状(LV)心肌(MYO)壁提供了一种用于心脏疾病和侵入性手术导航的诊断的重要信息。许多心脏的图像分割方法依赖于检测区域的感兴趣作为先决条件的形状的分割和建模。与分割结果,一个3D表面网格和所分割的心脏体积的对应点云可以重建用于进一步分析。虽然状态的最先进的方法(例如,U-净)已经实现在精度方面心脏图像分割体面性能,这些分割结果仍然可以从成像伪影和噪声,这将导致不准确的形状建模的结果受到影响。在本文中,我们提出了一个PC-U网,共同直接从二维CT切片的体积重构LV MYO壁的点群,并从预测的3D点云生成其分割掩码。大量的实验结果表明,通过将点云之前的形状,分割面具是不是在骰子的系数和豪斯多夫distance.The方面的国家的最先进的掌中结果更加准确提出共同学习框架,我们的PC-U净为自动心脏图像分析任务有益的,因为它可以同时得到LV MYO壁的三维形状和分割。
40. Learning Tuple Compatibility for Conditional OutfitRecommendation [PDF] 返回目录
Xuewen Yang, Dongliang Xie, Xin Wang, Jiangbo Yuan, Wanying Ding, Pengyun Yan
Abstract: Outfit recommendation requires the answers of some challenging outfit compatibility questions such as 'Which pair of boots and school bag go well with my jeans and sweater?'. It is more complicated than conventional similarity search, and needs to consider not only visual aesthetics but also the intrinsic fine-grained and multi-category nature of fashion items. Some existing approaches solve the problem through sequential models or learning pair-wise distances between items. However, most of them only consider coarse category information in defining fashion compatibility while neglecting the fine-grained category information often desired in practical applications. To better define the fashion compatibility and more flexibly meet different needs, we propose a novel problem of learning compatibility among multiple tuples (each consisting of an item and category pair), and recommending fashion items following the category choices from customers. Our contributions include: 1) Designing a Mixed Category Attention Net (MCAN) which integrates both fine-grained and coarse category information into recommendation and learns the compatibility among fashion tuples. MCAN can explicitly and effectively generate diverse and controllable recommendations based on need. 2) Contributing a new dataset IQON, which follows eastern culture and can be used to test the generalization of recommendation systems. Our extensive experiments on a reference dataset Polyvore and our dataset IQON demonstrate that our method significantly outperforms state-of-the-art recommendation methods.
摘要:军团建议要求的一些具有挑战性的装备的兼容性问题,如答案“这双靴子和书包的我的牛仔裤和毛衣相配吗?”。它比传统的相似性搜索更加复杂,需要考虑的不仅仅是视觉美感,而且时尚单品的内在细粒度和多类别的性质。一些现有的方法,通过连续的模型或学习项目之间的成对距离解决问题。然而,大多数人只考虑在定义时尚的兼容性,而忽略通常需要在实际应用中细粒度的分类信息粗分类信息。为了更好地定义了时尚的兼容性和更灵活地满足不同的需求,我们提出了学习的兼容性多个元组中(各由一个项目和类别对),并推荐以下客户类别选择的时尚单品的新问题。我们的贡献包括:1)设计一个混合类别注意网(MCAN),这两个细颗粒和粗分类信息整合到推荐和学习方式的元组之间的兼容性。 MCAN可以明确地和有效地基于不同的需求和可控的建议。 2)促进新的数据集IQON,它遵循东方文化,并且可以被用来测试推荐系统的概括。我们对基准数据集合Polyvore和我们的数据IQON大量的实验证明我们的方法显著优于国家的最先进的推荐方法。
Xuewen Yang, Dongliang Xie, Xin Wang, Jiangbo Yuan, Wanying Ding, Pengyun Yan
Abstract: Outfit recommendation requires the answers of some challenging outfit compatibility questions such as 'Which pair of boots and school bag go well with my jeans and sweater?'. It is more complicated than conventional similarity search, and needs to consider not only visual aesthetics but also the intrinsic fine-grained and multi-category nature of fashion items. Some existing approaches solve the problem through sequential models or learning pair-wise distances between items. However, most of them only consider coarse category information in defining fashion compatibility while neglecting the fine-grained category information often desired in practical applications. To better define the fashion compatibility and more flexibly meet different needs, we propose a novel problem of learning compatibility among multiple tuples (each consisting of an item and category pair), and recommending fashion items following the category choices from customers. Our contributions include: 1) Designing a Mixed Category Attention Net (MCAN) which integrates both fine-grained and coarse category information into recommendation and learns the compatibility among fashion tuples. MCAN can explicitly and effectively generate diverse and controllable recommendations based on need. 2) Contributing a new dataset IQON, which follows eastern culture and can be used to test the generalization of recommendation systems. Our extensive experiments on a reference dataset Polyvore and our dataset IQON demonstrate that our method significantly outperforms state-of-the-art recommendation methods.
摘要:军团建议要求的一些具有挑战性的装备的兼容性问题,如答案“这双靴子和书包的我的牛仔裤和毛衣相配吗?”。它比传统的相似性搜索更加复杂,需要考虑的不仅仅是视觉美感,而且时尚单品的内在细粒度和多类别的性质。一些现有的方法,通过连续的模型或学习项目之间的成对距离解决问题。然而,大多数人只考虑在定义时尚的兼容性,而忽略通常需要在实际应用中细粒度的分类信息粗分类信息。为了更好地定义了时尚的兼容性和更灵活地满足不同的需求,我们提出了学习的兼容性多个元组中(各由一个项目和类别对),并推荐以下客户类别选择的时尚单品的新问题。我们的贡献包括:1)设计一个混合类别注意网(MCAN),这两个细颗粒和粗分类信息整合到推荐和学习方式的元组之间的兼容性。 MCAN可以明确地和有效地基于不同的需求和可控的建议。 2)促进新的数据集IQON,它遵循东方文化,并且可以被用来测试推荐系统的概括。我们对基准数据集合Polyvore和我们的数据IQON大量的实验证明我们的方法显著优于国家的最先进的推荐方法。
41. Discovering Multi-Hardware Mobile Models via Architecture Search [PDF] 返回目录
Grace Chu, Okan Arikan, Gabriel Bender, Weijun Wang, Achille Brighton, Pieter-Jan Kindermans, Hanxiao Liu, Berkin Akin, Suyog Gupta, Andrew Howard
Abstract: Developing efficient models for mobile phones or other on-device deployments has been a popular topic in both industry and academia. In such scenarios, it is often convenient to deploy the same model on a diverse set of hardware devices owned by different end users to minimize the costs of development, deployment and maintenance. Despite the importance, designing a single neural network that can perform well on multiple devices is difficult as each device has its own specialty and restrictions: A model optimized for one device may not perform well on another. While most existing work proposes different models optimized for each single hardware, this paper is the first which explores the problem of finding a single model that performs well on multiple hardware. Specifically, we leverage architecture search to help us find the best model, where given a set of diverse hardware to optimize for, we first introduce a multi-hardware search space that is compatible with all examined hardware. Then, to measure the performance of a neural network over multiple hardware, we propose metrics that can characterize the overall latency performance in an average case and worst case scenario. With the multi-hardware search space and new metrics applied to Pixel4 CPU, GPU, DSP and EdgeTPU, we found models that perform on par or better than state-of-the-art (SOTA) models on each of our target accelerators and generalize well on many un-targeted hardware. Comparing with single-hardware searches, multi-hardware search gives a better trade-off between computation cost and model performance.
摘要:开发用于移动电话或其他设备上部署高效的车型已经在工业界和学术界的热门话题。在这种情况下,它往往是方便对不同组由不同的终端用户拥有的,以尽量减少开发,部署和维护成本的硬件设备上部署相同的模型。尽管重要性,设计可在多台设备以及执行单个神经网络是困难的,因为每个设备都有自己的特长和限制:一台设备优化的模型可能无法在另一种表现良好。虽然大多数现有的工作提出了不同的模型对每个单独的硬件进行了优化,本文是探讨寻找一个单一模式的问题:第一,以及在多种硬件执行。具体而言,我们利用架构的搜索,以帮助我们找到最好的模式,即给定一组不同的硬件进行优化的,我们首先介绍了多硬件搜索空间与所有检查硬件兼容。然后,测量在多个硬件神经网络的性能,我们建议可以在平均情况和最坏的情况下表征整体延迟性能指标。随着多硬件搜索空间和新的指标应用于Pixel4 CPU,GPU,DSP和EdgeTPU,我们发现,我们的每一个目标加速器和推广了持平或好于国家的最先进的(SOTA)模型进行模型还有许多未指定的硬件。与单一的硬件的搜索相比,多硬件的搜索提供了一个更好的权衡计算成本和模型性能之间。
Grace Chu, Okan Arikan, Gabriel Bender, Weijun Wang, Achille Brighton, Pieter-Jan Kindermans, Hanxiao Liu, Berkin Akin, Suyog Gupta, Andrew Howard
Abstract: Developing efficient models for mobile phones or other on-device deployments has been a popular topic in both industry and academia. In such scenarios, it is often convenient to deploy the same model on a diverse set of hardware devices owned by different end users to minimize the costs of development, deployment and maintenance. Despite the importance, designing a single neural network that can perform well on multiple devices is difficult as each device has its own specialty and restrictions: A model optimized for one device may not perform well on another. While most existing work proposes different models optimized for each single hardware, this paper is the first which explores the problem of finding a single model that performs well on multiple hardware. Specifically, we leverage architecture search to help us find the best model, where given a set of diverse hardware to optimize for, we first introduce a multi-hardware search space that is compatible with all examined hardware. Then, to measure the performance of a neural network over multiple hardware, we propose metrics that can characterize the overall latency performance in an average case and worst case scenario. With the multi-hardware search space and new metrics applied to Pixel4 CPU, GPU, DSP and EdgeTPU, we found models that perform on par or better than state-of-the-art (SOTA) models on each of our target accelerators and generalize well on many un-targeted hardware. Comparing with single-hardware searches, multi-hardware search gives a better trade-off between computation cost and model performance.
摘要:开发用于移动电话或其他设备上部署高效的车型已经在工业界和学术界的热门话题。在这种情况下,它往往是方便对不同组由不同的终端用户拥有的,以尽量减少开发,部署和维护成本的硬件设备上部署相同的模型。尽管重要性,设计可在多台设备以及执行单个神经网络是困难的,因为每个设备都有自己的特长和限制:一台设备优化的模型可能无法在另一种表现良好。虽然大多数现有的工作提出了不同的模型对每个单独的硬件进行了优化,本文是探讨寻找一个单一模式的问题:第一,以及在多种硬件执行。具体而言,我们利用架构的搜索,以帮助我们找到最好的模式,即给定一组不同的硬件进行优化的,我们首先介绍了多硬件搜索空间与所有检查硬件兼容。然后,测量在多个硬件神经网络的性能,我们建议可以在平均情况和最坏的情况下表征整体延迟性能指标。随着多硬件搜索空间和新的指标应用于Pixel4 CPU,GPU,DSP和EdgeTPU,我们发现,我们的每一个目标加速器和推广了持平或好于国家的最先进的(SOTA)模型进行模型还有许多未指定的硬件。与单一的硬件的搜索相比,多硬件的搜索提供了一个更好的权衡计算成本和模型性能之间。
42. Uncertainty-aware Self-supervised 3D Data Association [PDF] 返回目录
Jianren Wang, Siddharth Ancha, Yi-Ting Chen, David Held
Abstract: 3D object trackers usually require training on large amounts of annotated data that is expensive and time-consuming to collect. Instead, we propose leveraging vast unlabeled datasets by self-supervised metric learning of 3D object trackers, with a focus on data association. Large scale annotations for unlabeled data are cheaply obtained by automatic object detection and association across frames. We show how these self-supervised annotations can be used in a principled manner to learn point-cloud embeddings that are effective for 3D tracking. We estimate and incorporate uncertainty in self-supervised tracking to learn more robust embeddings, without needing any labeled data. We design embeddings to differentiate objects across frames, and learn them using uncertainty-aware self-supervised training. Finally, we demonstrate their ability to perform accurate data association across frames, towards effective and accurate 3D tracking. Project videos and code are at this https URL.
摘要:3D物体跟踪器通常需要对大量的注释数据的训练是昂贵和费时的收集。相反,我们建议三维物体跟踪的自我监督度量学习利用庞大的数据集未标记,重点放在数据关联。对于未标记的数据大规模注释廉价地通过自动物体检测和关联在帧之间获得的。我们将展示如何这些自我监督的注释可以有原则的方式来学习点云的嵌入是有效的3D跟踪。我们估计在自我监督跟踪一体化的不确定性,以了解更多的嵌入强劲,无需任何标签的数据。我们设计的嵌入到跨框架分化对象,并使用不确定性感知自我监督的培训学习他们。最后,我们展示他们在整个帧进行准确的数据关联,实现有效和精确的3D跟踪能力。项目视频和代码都在此HTTPS URL。
Jianren Wang, Siddharth Ancha, Yi-Ting Chen, David Held
Abstract: 3D object trackers usually require training on large amounts of annotated data that is expensive and time-consuming to collect. Instead, we propose leveraging vast unlabeled datasets by self-supervised metric learning of 3D object trackers, with a focus on data association. Large scale annotations for unlabeled data are cheaply obtained by automatic object detection and association across frames. We show how these self-supervised annotations can be used in a principled manner to learn point-cloud embeddings that are effective for 3D tracking. We estimate and incorporate uncertainty in self-supervised tracking to learn more robust embeddings, without needing any labeled data. We design embeddings to differentiate objects across frames, and learn them using uncertainty-aware self-supervised training. Finally, we demonstrate their ability to perform accurate data association across frames, towards effective and accurate 3D tracking. Project videos and code are at this https URL.
摘要:3D物体跟踪器通常需要对大量的注释数据的训练是昂贵和费时的收集。相反,我们建议三维物体跟踪的自我监督度量学习利用庞大的数据集未标记,重点放在数据关联。对于未标记的数据大规模注释廉价地通过自动物体检测和关联在帧之间获得的。我们将展示如何这些自我监督的注释可以有原则的方式来学习点云的嵌入是有效的3D跟踪。我们估计在自我监督跟踪一体化的不确定性,以了解更多的嵌入强劲,无需任何标签的数据。我们设计的嵌入到跨框架分化对象,并使用不确定性感知自我监督的培训学习他们。最后,我们展示他们在整个帧进行准确的数据关联,实现有效和精确的3D跟踪能力。项目视频和代码都在此HTTPS URL。
43. Learning to Generate Diverse Dance Motions with Transformer [PDF] 返回目录
Jiaman Li, Yihang Yin, Hang Chu, Yi Zhou, Tingwu Wang, Sanja Fidler, Hao Li
Abstract: With the ongoing pandemic, virtual concerts and live events using digitized performances of musicians are getting traction on massive multiplayer online worlds. However, well choreographed dance movements are extremely complex to animate and would involve an expensive and tedious production process. In addition to the use of complex motion capture systems, it typically requires a collaborative effort between animators, dancers, and choreographers. We introduce a complete system for dance motion synthesis, which can generate complex and highly diverse dance sequences given an input music sequence. As motion capture data is limited for the range of dance motions and styles, we introduce a massive dance motion data set that is created from YouTube videos. We also present a novel two-stream motion transformer generative model, which can generate motion sequences with high flexibility. We also introduce new evaluation metrics for the quality of synthesized dance motions, and demonstrate that our system can outperform state-of-the-art methods. Our system provides high-quality animations suitable for large crowds for virtual concerts and can also be used as reference for professional animation pipelines. Most importantly, we show that vast online videos can be effective in training dance motion models.
摘要:随着持续的流感大流行,使用音乐家的演出数字化虚拟音乐会和现场活动也越来越上的大型多人在线的世界的牵引。然而,以及编排的舞蹈动作是极其复杂的动画,这将涉及昂贵和繁琐的生产过程。除了使用复杂的动作捕捉系统,它通常需要动画师,舞者,编舞和之间的合作努力。我们介绍舞蹈运动合成一个完整的系统,其可以给定的输入序列的音乐生成复杂和高度多样化的舞蹈序列。由于动作捕捉数据是有限的舞蹈动作和风格的范围内,我们介绍的影片从YouTube创建了一个庞大的舞蹈运动数据集。我们还提出了一种新型的双流运动变压器生成模型,其可以产生具有高灵活性的运动序列。我们还引入了新的评价指标合成舞蹈动作的质量,并证明我们的系统可以超越国家的最先进的方法。我们的系统提供适合于大量人群虚拟演唱高质量的动画,并且还可以用作用于专业动画管道参考。更重要的是,我们表明,庞大的在线视频可以有效地训练舞蹈运动模型。
Jiaman Li, Yihang Yin, Hang Chu, Yi Zhou, Tingwu Wang, Sanja Fidler, Hao Li
Abstract: With the ongoing pandemic, virtual concerts and live events using digitized performances of musicians are getting traction on massive multiplayer online worlds. However, well choreographed dance movements are extremely complex to animate and would involve an expensive and tedious production process. In addition to the use of complex motion capture systems, it typically requires a collaborative effort between animators, dancers, and choreographers. We introduce a complete system for dance motion synthesis, which can generate complex and highly diverse dance sequences given an input music sequence. As motion capture data is limited for the range of dance motions and styles, we introduce a massive dance motion data set that is created from YouTube videos. We also present a novel two-stream motion transformer generative model, which can generate motion sequences with high flexibility. We also introduce new evaluation metrics for the quality of synthesized dance motions, and demonstrate that our system can outperform state-of-the-art methods. Our system provides high-quality animations suitable for large crowds for virtual concerts and can also be used as reference for professional animation pipelines. Most importantly, we show that vast online videos can be effective in training dance motion models.
摘要:随着持续的流感大流行,使用音乐家的演出数字化虚拟音乐会和现场活动也越来越上的大型多人在线的世界的牵引。然而,以及编排的舞蹈动作是极其复杂的动画,这将涉及昂贵和繁琐的生产过程。除了使用复杂的动作捕捉系统,它通常需要动画师,舞者,编舞和之间的合作努力。我们介绍舞蹈运动合成一个完整的系统,其可以给定的输入序列的音乐生成复杂和高度多样化的舞蹈序列。由于动作捕捉数据是有限的舞蹈动作和风格的范围内,我们介绍的影片从YouTube创建了一个庞大的舞蹈运动数据集。我们还提出了一种新型的双流运动变压器生成模型,其可以产生具有高灵活性的运动序列。我们还引入了新的评价指标合成舞蹈动作的质量,并证明我们的系统可以超越国家的最先进的方法。我们的系统提供适合于大量人群虚拟演唱高质量的动画,并且还可以用作用于专业动画管道参考。更重要的是,我们表明,庞大的在线视频可以有效地训练舞蹈运动模型。
44. Robust Handwriting Recognition with Limited and Noisy Data [PDF] 返回目录
Hai Pham, Amrith Setlur, Saket Dingliwal, Tzu-Hsiang Lin, Barnabas Poczos, Kang Huang, Zhuo Li, Jae Lim, Collin McCormack, Tam Vu
Abstract: Despite the advent of deep learning in computer vision, the general handwriting recognition problem is far from solved. Most existing approaches focus on handwriting datasets that have clearly written text and carefully segmented labels. In this paper, we instead focus on learning handwritten characters from maintenance logs, a constrained setting where data is very limited and noisy. We break the problem into two consecutive stages of word segmentation and word recognition respectively and utilize data augmentation techniques to train both stages. Extensive comparisons with popular baselines for scene-text detection and word recognition show that our system achieves a lower error rate and is more suited to handle noisy and difficult documents
摘要:尽管深学习计算机视觉领域中出现,一般手写识别问题还远远没有解决。大多数现有方法的重点已经明确的书面文本,仔细分段标签笔迹数据集。在本文中,我们转而注重从维护日志,受约束的环境,让数据非常有限,嘈杂的学习手写字符。我们分别把问题分解成分词和词识别的两个连续的阶段,并利用数据增强技术来训练两个阶段。与流行的基线为场景文本检测和单词的识别结果表明,我们的系统实现了更低的误码率和更适合处理噪声而难以广泛的文件比较
Hai Pham, Amrith Setlur, Saket Dingliwal, Tzu-Hsiang Lin, Barnabas Poczos, Kang Huang, Zhuo Li, Jae Lim, Collin McCormack, Tam Vu
Abstract: Despite the advent of deep learning in computer vision, the general handwriting recognition problem is far from solved. Most existing approaches focus on handwriting datasets that have clearly written text and carefully segmented labels. In this paper, we instead focus on learning handwritten characters from maintenance logs, a constrained setting where data is very limited and noisy. We break the problem into two consecutive stages of word segmentation and word recognition respectively and utilize data augmentation techniques to train both stages. Extensive comparisons with popular baselines for scene-text detection and word recognition show that our system achieves a lower error rate and is more suited to handle noisy and difficult documents
摘要:尽管深学习计算机视觉领域中出现,一般手写识别问题还远远没有解决。大多数现有方法的重点已经明确的书面文本,仔细分段标签笔迹数据集。在本文中,我们转而注重从维护日志,受约束的环境,让数据非常有限,嘈杂的学习手写字符。我们分别把问题分解成分词和词识别的两个连续的阶段,并利用数据增强技术来训练两个阶段。与流行的基线为场景文本检测和单词的识别结果表明,我们的系统实现了更低的误码率和更适合处理噪声而难以广泛的文件比较
45. Category Level Object Pose Estimation via Neural Analysis-by-Synthesis [PDF] 返回目录
Xu Chen, Zijian Dong, Jie Song, Andreas Geiger, Otmar Hilliges
Abstract: Many object pose estimation algorithms rely on the analysis-by-synthesis framework which requires explicit representations of individual object instances. In this paper we combine a gradient-based fitting procedure with a parametric neural image synthesis module that is capable of implicitly representing the appearance, shape and pose of entire object categories, thus rendering the need for explicit CAD models per object instance unnecessary. The image synthesis network is designed to efficiently span the pose configuration space so that model capacity can be used to capture the shape and local appearance (i.e., texture) variations jointly. At inference time the synthesized images are compared to the target via an appearance based loss and the error signal is backpropagated through the network to the input parameters. Keeping the network parameters fixed, this allows for iterative optimization of the object pose, shape and appearance in a joint manner and we experimentally show that the method can recover orientation of objects with high accuracy from 2D images alone. When provided with depth measurements, to overcome scale ambiguities, the method can accurately recover the full 6DOF pose successfully.
摘要:许多物体姿态估计算法依赖于分析合成 - 框架,它需要单独的对象实例的明确表示。在本文中,我们结合了基于梯度的拟合程序与参数的神经图像合成模块,其能够隐式地表示整个对象类别的外观,形状和姿势,因此使每不必要对象实例明确的CAD模型的需要。图像合成网被设计成有效地跨越姿势配置空间中,使得模型容量可被用于捕获的形状和外观的本地(即,纹理)的变化共同。在推理时的合成图像经由外观基于损失相比,目标和所述误差信号通过网络传送到输入参数backpropagated。保持网络参数固定,这允许以联合方式的对象的姿势,形状和外观的迭代优化,我们通过实验表明,该方法可以恢复对象的取向与从单独的2D图像精度高。当将设置有深度测量,克服规模歧义,该方法可以精确地成功地收回全部6DOF姿态。
Xu Chen, Zijian Dong, Jie Song, Andreas Geiger, Otmar Hilliges
Abstract: Many object pose estimation algorithms rely on the analysis-by-synthesis framework which requires explicit representations of individual object instances. In this paper we combine a gradient-based fitting procedure with a parametric neural image synthesis module that is capable of implicitly representing the appearance, shape and pose of entire object categories, thus rendering the need for explicit CAD models per object instance unnecessary. The image synthesis network is designed to efficiently span the pose configuration space so that model capacity can be used to capture the shape and local appearance (i.e., texture) variations jointly. At inference time the synthesized images are compared to the target via an appearance based loss and the error signal is backpropagated through the network to the input parameters. Keeping the network parameters fixed, this allows for iterative optimization of the object pose, shape and appearance in a joint manner and we experimentally show that the method can recover orientation of objects with high accuracy from 2D images alone. When provided with depth measurements, to overcome scale ambiguities, the method can accurately recover the full 6DOF pose successfully.
摘要:许多物体姿态估计算法依赖于分析合成 - 框架,它需要单独的对象实例的明确表示。在本文中,我们结合了基于梯度的拟合程序与参数的神经图像合成模块,其能够隐式地表示整个对象类别的外观,形状和姿势,因此使每不必要对象实例明确的CAD模型的需要。图像合成网被设计成有效地跨越姿势配置空间中,使得模型容量可被用于捕获的形状和外观的本地(即,纹理)的变化共同。在推理时的合成图像经由外观基于损失相比,目标和所述误差信号通过网络传送到输入参数backpropagated。保持网络参数固定,这允许以联合方式的对象的姿势,形状和外观的迭代优化,我们通过实验表明,该方法可以恢复对象的取向与从单独的2D图像精度高。当将设置有深度测量,克服规模歧义,该方法可以精确地成功地收回全部6DOF姿态。
46. How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language [PDF] 返回目录
Amanda Duarte, Shruti Palaskar, Deepti Ghadiyaram, Kenneth DeHaan, Florian Metze, Jordi Torres, Xavier Giro-i-Nieto
Abstract: Sign Language is the primary means of communication for the majority of the Deaf community. One of the factors that has hindered the progress in the areas of automatic sign language recognition, generation, and translation is the absence of large annotated datasets, especially continuous sign language datasets, i.e. datasets that are annotated and segmented at the sentence or utterance level. Towards this end, in this work we introduce How2Sign, a work-in-progress dataset collection. How2Sign consists of a parallel corpus of 80 hours of sign language videos (collected with multi-view RGB and depth sensor data) with corresponding speech transcriptions and gloss annotations. In addition, a three-hour subset was further recorded in a geodesic dome setup using hundreds of cameras and sensors, which enables detailed 3D reconstruction and pose estimation and paves the way for vision systems to understand the 3D geometry of sign language.
摘要:手语是为广大聋人的通信的主要手段。一个已妨碍在自动手语识别,生成和翻译的区域中的进展的因素是不存在的大型注释的数据集,特别是连续的手语的数据集,即,其被注释和在句子或发声水平分段的数据集。为此,在这项工作中,我们介绍How2Sign,一个工作正在进行数据集中采集。 How2Sign由80个小时的手语的视频与相应的语音转录和光泽注释的平行语料库(与多视图RGB和深度的传感器数据收集)的。此外,三个小时的子集还记录使用数百台摄像机和传感器的网格球顶设置,这使得能够详细的三维重建和姿态估计和铺平了视觉系统了解手语的三维几何的方式。
Amanda Duarte, Shruti Palaskar, Deepti Ghadiyaram, Kenneth DeHaan, Florian Metze, Jordi Torres, Xavier Giro-i-Nieto
Abstract: Sign Language is the primary means of communication for the majority of the Deaf community. One of the factors that has hindered the progress in the areas of automatic sign language recognition, generation, and translation is the absence of large annotated datasets, especially continuous sign language datasets, i.e. datasets that are annotated and segmented at the sentence or utterance level. Towards this end, in this work we introduce How2Sign, a work-in-progress dataset collection. How2Sign consists of a parallel corpus of 80 hours of sign language videos (collected with multi-view RGB and depth sensor data) with corresponding speech transcriptions and gloss annotations. In addition, a three-hour subset was further recorded in a geodesic dome setup using hundreds of cameras and sensors, which enables detailed 3D reconstruction and pose estimation and paves the way for vision systems to understand the 3D geometry of sign language.
摘要:手语是为广大聋人的通信的主要手段。一个已妨碍在自动手语识别,生成和翻译的区域中的进展的因素是不存在的大型注释的数据集,特别是连续的手语的数据集,即,其被注释和在句子或发声水平分段的数据集。为此,在这项工作中,我们介绍How2Sign,一个工作正在进行数据集中采集。 How2Sign由80个小时的手语的视频与相应的语音转录和光泽注释的平行语料库(与多视图RGB和深度的传感器数据收集)的。此外,三个小时的子集还记录使用数百台摄像机和传感器的网格球顶设置,这使得能够详细的三维重建和姿态估计和铺平了视觉系统了解手语的三维几何的方式。
47. DeepLiDARFlow: A Deep Learning Architecture For Scene Flow Estimation Using Monocular Camera and Sparse LiDAR [PDF] 返回目录
Rishav, Ramy Battrawy, René Schuster, Oliver Wasenmüller, Didier Stricker
Abstract: Scene flow is the dense 3D reconstruction of motion and geometry of a scene. Most state-of-the-art methods use a pair of stereo images as input for full scene reconstruction. These methods depend a lot on the quality of the RGB images and perform poorly in regions with reflective objects, shadows, ill-conditioned light environment and so on. LiDAR measurements are much less sensitive to the aforementioned conditions but LiDAR features are in general unsuitable for matching tasks due to their sparse nature. Hence, using both LiDAR and RGB can potentially overcome the individual disadvantages of each sensor by mutual improvement and yield robust features which can improve the matching process. In this paper, we present DeepLiDARFlow, a novel deep learning architecture which fuses high level RGB and LiDAR features at multiple scales in a monocular setup to predict dense scene flow. Its performance is much better in the critical regions where image-only and LiDAR-only methods are inaccurate. We verify our DeepLiDARFlow using the established data sets KITTI and FlyingThings3D and we show strong robustness compared to several state-of-the-art methods which used other input modalities. The code of our paper is available at this https URL.
摘要:场景流的场景的运动和几何形状的致密的三维重建。大多数国家的最先进的方法使用一对立体图像的作为全场景重建的输入。这些方法对RGB图像的质量取决于很多,在区域与反射物体,阴影,病态光环境等表现不佳。激光雷达的测量是敏感的上述条件要少得多,但激光雷达的特点是一般不适用于匹配任务,由于他们的稀少性质。因此,使用两个激光雷达和RGB可以潜在地克服由相互改进每个传感器的单独的缺点,并产生强大的功能,可以提高匹配处理。在本文中,我们提出DeepLiDARFlow,一种新型的深学习结构,融合高电平RGB和激光雷达在单眼设置多尺度特征以预测场景密集流。其表现是在只有图象和激光雷达只方法是不准确的关键区域要好得多。我们利用所建立的数据集KITTI和FlyingThings3D验证我们的DeepLiDARFlow,我们表现出较强的鲁棒性相比,其使用其他输入方式的几个国家的最先进的方法。我们的论文的代码可在此HTTPS URL。
Rishav, Ramy Battrawy, René Schuster, Oliver Wasenmüller, Didier Stricker
Abstract: Scene flow is the dense 3D reconstruction of motion and geometry of a scene. Most state-of-the-art methods use a pair of stereo images as input for full scene reconstruction. These methods depend a lot on the quality of the RGB images and perform poorly in regions with reflective objects, shadows, ill-conditioned light environment and so on. LiDAR measurements are much less sensitive to the aforementioned conditions but LiDAR features are in general unsuitable for matching tasks due to their sparse nature. Hence, using both LiDAR and RGB can potentially overcome the individual disadvantages of each sensor by mutual improvement and yield robust features which can improve the matching process. In this paper, we present DeepLiDARFlow, a novel deep learning architecture which fuses high level RGB and LiDAR features at multiple scales in a monocular setup to predict dense scene flow. Its performance is much better in the critical regions where image-only and LiDAR-only methods are inaccurate. We verify our DeepLiDARFlow using the established data sets KITTI and FlyingThings3D and we show strong robustness compared to several state-of-the-art methods which used other input modalities. The code of our paper is available at this https URL.
摘要:场景流的场景的运动和几何形状的致密的三维重建。大多数国家的最先进的方法使用一对立体图像的作为全场景重建的输入。这些方法对RGB图像的质量取决于很多,在区域与反射物体,阴影,病态光环境等表现不佳。激光雷达的测量是敏感的上述条件要少得多,但激光雷达的特点是一般不适用于匹配任务,由于他们的稀少性质。因此,使用两个激光雷达和RGB可以潜在地克服由相互改进每个传感器的单独的缺点,并产生强大的功能,可以提高匹配处理。在本文中,我们提出DeepLiDARFlow,一种新型的深学习结构,融合高电平RGB和激光雷达在单眼设置多尺度特征以预测场景密集流。其表现是在只有图象和激光雷达只方法是不准确的关键区域要好得多。我们利用所建立的数据集KITTI和FlyingThings3D验证我们的DeepLiDARFlow,我们表现出较强的鲁棒性相比,其使用其他输入方式的几个国家的最先进的方法。我们的论文的代码可在此HTTPS URL。
48. TIDE: A General Toolbox for Identifying Object Detection Errors [PDF] 返回目录
Daniel Bolya, Sean Foley, James Hays, Judy Hoffman
Abstract: We introduce TIDE, a framework and associated toolbox for analyzing the sources of error in object detection and instance segmentation algorithms. Importantly, our framework is applicable across datasets and can be applied directly to output prediction files without required knowledge of the underlying prediction system. Thus, our framework can be used as a drop-in replacement for the standard mAP computation while providing a comprehensive analysis of each model's strengths and weaknesses. We segment errors into six types and, crucially, are the first to introduce a technique for measuring the contribution of each error in a way that isolates its effect on overall performance. We show that such a representation is critical for drawing accurate, comprehensive conclusions through in-depth analysis across 4 datasets and 7 recognition models. Available at this https URL
摘要:我们介绍TIDE,用于分析物检测和实例分割算法误差的来源的框架和相关联的工具箱。重要的是,我们的框架是适用于整个数据集,并可以直接应用到输出的预测文件,而底层的预测系统所需的知识。因此,我们的架构能够同时提供各型号的长处和短处的综合分析,可以作为一个下拉更换为标准图计算。我们段错误分为六个类型,至关重要,是率先推出的技术,用于测量每个错误的在隔离其对整体性能影响的方式作出贡献。我们发现,这样的表示是通过深入分析整个数据集4和7个识别模型绘制准确,全面的结论是至关重要的。可在这HTTPS URL
Daniel Bolya, Sean Foley, James Hays, Judy Hoffman
Abstract: We introduce TIDE, a framework and associated toolbox for analyzing the sources of error in object detection and instance segmentation algorithms. Importantly, our framework is applicable across datasets and can be applied directly to output prediction files without required knowledge of the underlying prediction system. Thus, our framework can be used as a drop-in replacement for the standard mAP computation while providing a comprehensive analysis of each model's strengths and weaknesses. We segment errors into six types and, crucially, are the first to introduce a technique for measuring the contribution of each error in a way that isolates its effect on overall performance. We show that such a representation is critical for drawing accurate, comprehensive conclusions through in-depth analysis across 4 datasets and 7 recognition models. Available at this https URL
摘要:我们介绍TIDE,用于分析物检测和实例分割算法误差的来源的框架和相关联的工具箱。重要的是,我们的框架是适用于整个数据集,并可以直接应用到输出的预测文件,而底层的预测系统所需的知识。因此,我们的架构能够同时提供各型号的长处和短处的综合分析,可以作为一个下拉更换为标准图计算。我们段错误分为六个类型,至关重要,是率先推出的技术,用于测量每个错误的在隔离其对整体性能影响的方式作出贡献。我们发现,这样的表示是通过深入分析整个数据集4和7个识别模型绘制准确,全面的结论是至关重要的。可在这HTTPS URL
49. Slide-free MUSE Microscopy to H&E Histology Modality Conversion via Unpaired Image-to-Image Translation GAN Models [PDF] 返回目录
Tanishq Abraham, Andrew Shaw, Daniel O'Connor, Austin Todd, Richard Levenson
Abstract: MUSE is a novel slide-free imaging technique for histological examination of tissues that can serve as an alternative to traditional histology. In order to bridge the gap between MUSE and traditional histology, we aim to convert MUSE images to resemble authentic hematoxylin- and eosin-stained (H&E) images. We evaluated four models: a non-machine-learning-based color-mapping unmixing-based tool, CycleGAN, DualGAN, and GANILLA. CycleGAN and GANILLA provided visually compelling results that appropriately transferred H&E style and preserved MUSE content. Based on training an automated critic on real and generated H&E images, we determined that CycleGAN demonstrated the best performance. We have also found that MUSE color inversion may be a necessary step for accurate modality conversion to H&E. We believe that our MUSE-to-H&E model can help improve adoption of novel slide-free methods by bridging a perceptual gap between MUSE imaging and traditional histology.
摘要:MUSE是用于组织,可以作为替代传统组织学的组织学检查的新的自由滑动成像技术。为了缩小MUSE和传统组织学之间的间隙中,我们的目标是MUSE转换图像,以类似于真实苏木精和曙红染色(H&E)图像。我们评估了四种模式:基于反混合非基于机器学习的色彩映射工具,CycleGAN,DualGAN和GANILLA。 CycleGAN和GANILLA提供视觉吸引力的结果适当地传递H&E风格和保存MUSE内容。基于对现实和产生的H&E影像训练的自动化评论家,我们确定CycleGAN表现出最佳性能。我们还发现,MUSE颜色反转可能是准确的形态转换到H&E的必要步骤。我们相信,我们的MUSE到H&E模型可以帮助弥合MUSE成像与传统组织学之间的差距感性提高采用新颖的自由滑动的方法。
Tanishq Abraham, Andrew Shaw, Daniel O'Connor, Austin Todd, Richard Levenson
Abstract: MUSE is a novel slide-free imaging technique for histological examination of tissues that can serve as an alternative to traditional histology. In order to bridge the gap between MUSE and traditional histology, we aim to convert MUSE images to resemble authentic hematoxylin- and eosin-stained (H&E) images. We evaluated four models: a non-machine-learning-based color-mapping unmixing-based tool, CycleGAN, DualGAN, and GANILLA. CycleGAN and GANILLA provided visually compelling results that appropriately transferred H&E style and preserved MUSE content. Based on training an automated critic on real and generated H&E images, we determined that CycleGAN demonstrated the best performance. We have also found that MUSE color inversion may be a necessary step for accurate modality conversion to H&E. We believe that our MUSE-to-H&E model can help improve adoption of novel slide-free methods by bridging a perceptual gap between MUSE imaging and traditional histology.
摘要:MUSE是用于组织,可以作为替代传统组织学的组织学检查的新的自由滑动成像技术。为了缩小MUSE和传统组织学之间的间隙中,我们的目标是MUSE转换图像,以类似于真实苏木精和曙红染色(H&E)图像。我们评估了四种模式:基于反混合非基于机器学习的色彩映射工具,CycleGAN,DualGAN和GANILLA。 CycleGAN和GANILLA提供视觉吸引力的结果适当地传递H&E风格和保存MUSE内容。基于对现实和产生的H&E影像训练的自动化评论家,我们确定CycleGAN表现出最佳性能。我们还发现,MUSE颜色反转可能是准确的形态转换到H&E的必要步骤。我们相信,我们的MUSE到H&E模型可以帮助弥合MUSE成像与传统组织学之间的差距感性提高采用新颖的自由滑动的方法。
50. Blur-Attention: A boosting mechanism for non-uniform blurred image restoration [PDF] 返回目录
Xiaoguang Li, Feifan Yang, Kin Man Lam, Li Zhuo, Jiafeng Li
Abstract: Dynamic scene deblurring is a challenging problem in computer vision. It is difficult to accurately estimate the spatially varying blur kernel by traditional methods. Data-driven-based methods usually employ kernel-free end-to-end mapping schemes, which are apt to overlook the kernel estimation. To address this issue, we propose a blur-attention module to dynamically capture the spatially varying features of non-uniform blurred images. The module consists of a DenseBlock unit and a spatial attention unit with multi-pooling feature fusion, which can effectively extract complex spatially varying blur features. We design a multi-level residual connection structure to connect multiple blur-attention modules to form a blur-attention network. By introducing the blur-attention network into a conditional generation adversarial framework, we propose an end-to-end blind motion deblurring method, namely Blur-Attention-GAN (BAG), for a single image. Our method can adaptively select the weights of the extracted features according to the spatially varying blur features, and dynamically restore the images. Experimental results show that the deblurring capability of our method achieved outstanding objective performance in terms of PSNR, SSIM, and subjective visual quality. Furthermore, by visualizing the features extracted by the blur-attention module, comprehensive discussions are provided on its effectiveness.
摘要:动态场景去模糊是计算机视觉中一个具有挑战性的问题。这是很难用传统的方法来准确地估计空间变化模糊内核。数据驱动为基础的方法通常采用自由内核端至端映射方案,其易于忽视核估计。为了解决这个问题,我们提出了一个模糊的关注模块动态捕捉不均匀模糊图像的空间变化特征。该模块由一个DenseBlock单元和空间注意单元具有多池功能融合,这可有效地提取复杂空间上变化的模糊特性。我们设计一个多层次的残余连接结构连接多个模糊,注意力模块,形成一个模糊的关注网络。通过引入模糊注意网络到一个有条件地生成对抗性框架内,我们提出了一种端至端盲运动去模糊的方法,即模糊-注意力GAN(BAG),为单个图像。我们的方法可以自适应地选择的,根据空间变化的模糊特征提取的特征的权重,并动态地恢复图像。实验结果表明,该方法的去模糊能力,PSNR,SSIM和主观视觉质量方面取得了突出的客观表现。此外,通过可视化由模糊注意模块提取的特征,提供了在其有效性全面的讨论。
Xiaoguang Li, Feifan Yang, Kin Man Lam, Li Zhuo, Jiafeng Li
Abstract: Dynamic scene deblurring is a challenging problem in computer vision. It is difficult to accurately estimate the spatially varying blur kernel by traditional methods. Data-driven-based methods usually employ kernel-free end-to-end mapping schemes, which are apt to overlook the kernel estimation. To address this issue, we propose a blur-attention module to dynamically capture the spatially varying features of non-uniform blurred images. The module consists of a DenseBlock unit and a spatial attention unit with multi-pooling feature fusion, which can effectively extract complex spatially varying blur features. We design a multi-level residual connection structure to connect multiple blur-attention modules to form a blur-attention network. By introducing the blur-attention network into a conditional generation adversarial framework, we propose an end-to-end blind motion deblurring method, namely Blur-Attention-GAN (BAG), for a single image. Our method can adaptively select the weights of the extracted features according to the spatially varying blur features, and dynamically restore the images. Experimental results show that the deblurring capability of our method achieved outstanding objective performance in terms of PSNR, SSIM, and subjective visual quality. Furthermore, by visualizing the features extracted by the blur-attention module, comprehensive discussions are provided on its effectiveness.
摘要:动态场景去模糊是计算机视觉中一个具有挑战性的问题。这是很难用传统的方法来准确地估计空间变化模糊内核。数据驱动为基础的方法通常采用自由内核端至端映射方案,其易于忽视核估计。为了解决这个问题,我们提出了一个模糊的关注模块动态捕捉不均匀模糊图像的空间变化特征。该模块由一个DenseBlock单元和空间注意单元具有多池功能融合,这可有效地提取复杂空间上变化的模糊特性。我们设计一个多层次的残余连接结构连接多个模糊,注意力模块,形成一个模糊的关注网络。通过引入模糊注意网络到一个有条件地生成对抗性框架内,我们提出了一种端至端盲运动去模糊的方法,即模糊-注意力GAN(BAG),为单个图像。我们的方法可以自适应地选择的,根据空间变化的模糊特征提取的特征的权重,并动态地恢复图像。实验结果表明,该方法的去模糊能力,PSNR,SSIM和主观视觉质量方面取得了突出的客观表现。此外,通过可视化由模糊注意模块提取的特征,提供了在其有效性全面的讨论。
51. "Name that manufacturer". Relating image acquisition bias with task complexity when training deep learning models: experiments on head CT [PDF] 返回目录
Giorgio Pietro Biondetti, Romane Gauriau, Christopher P. Bridge, Charles Lu, Katherine P. Andriole
Abstract: As interest in applying machine learning techniques for medical images continues to grow at a rapid pace, models are starting to be developed and deployed for clinical applications. In the clinical AI model development lifecycle (described by Lu et al. [1]), a crucial phase for machine learning scientists and clinicians is the proper design and collection of the data cohort. The ability to recognize various forms of biases and distribution shifts in the dataset is critical at this step. While it remains difficult to account for all potential sources of bias, techniques can be developed to identify specific types of bias in order to mitigate their impact. In this work we analyze how the distribution of scanner manufacturers in a dataset can contribute to the overall bias of deep learning models. We evaluate convolutional neural networks (CNN) for both classification and segmentation tasks, specifically two state-of-the-art models: ResNet [2] for classification and U-Net [3] for segmentation. We demonstrate that CNNs can learn to distinguish the imaging scanner manufacturer and that this bias can substantially impact model performance for both classification and segmentation tasks. By creating an original synthesis dataset of brain data mimicking the presence of more or less subtle lesions we also show that this bias is related to the difficulty of the task. Recognition of such bias is critical to develop robust, generalizable models that will be crucial for clinical applications in real-world data distributions.
摘要:在应用机器学习技术用于医学图像的兴趣继续以迅猛的速度增长,模型开始被开发和部署用于临床应用。在临床AI模型开发生命周期(通过路等人描述[1]),用于机器学习科学家和临床医生的关键阶段是数据队列的适当的设计和收集。认识到各种形式的数据集中的偏见和分布的变化的能力是在此步骤的关键。虽然它仍然很难占到偏见的所有潜在根源,技术可开发识别,以减轻其影响特定类型的偏见。在这项工作中,我们分析的扫描仪制造商在数据集中的分布如何能够促进深度学习模型的整体偏向。我们评估分类和分割任务卷积神经网络(CNN),特别是2状态的最先进的模型:RESNET [2]用于分类和U-Net的[3]的分割。我们证明细胞神经网络能学会分辨成像扫描仪制造商,并且这种偏差可以大大的分类和分割任务影响模型的性能。通过创建脑模拟数据或多或少细微病变的存在的原始数据集合成我们还表明,这种偏差是关系到任务的难度。这种偏见的认识是发展强大的,一般化的模型,这将是对现实世界的数据分布临床应用的关键的关键。
Giorgio Pietro Biondetti, Romane Gauriau, Christopher P. Bridge, Charles Lu, Katherine P. Andriole
Abstract: As interest in applying machine learning techniques for medical images continues to grow at a rapid pace, models are starting to be developed and deployed for clinical applications. In the clinical AI model development lifecycle (described by Lu et al. [1]), a crucial phase for machine learning scientists and clinicians is the proper design and collection of the data cohort. The ability to recognize various forms of biases and distribution shifts in the dataset is critical at this step. While it remains difficult to account for all potential sources of bias, techniques can be developed to identify specific types of bias in order to mitigate their impact. In this work we analyze how the distribution of scanner manufacturers in a dataset can contribute to the overall bias of deep learning models. We evaluate convolutional neural networks (CNN) for both classification and segmentation tasks, specifically two state-of-the-art models: ResNet [2] for classification and U-Net [3] for segmentation. We demonstrate that CNNs can learn to distinguish the imaging scanner manufacturer and that this bias can substantially impact model performance for both classification and segmentation tasks. By creating an original synthesis dataset of brain data mimicking the presence of more or less subtle lesions we also show that this bias is related to the difficulty of the task. Recognition of such bias is critical to develop robust, generalizable models that will be crucial for clinical applications in real-world data distributions.
摘要:在应用机器学习技术用于医学图像的兴趣继续以迅猛的速度增长,模型开始被开发和部署用于临床应用。在临床AI模型开发生命周期(通过路等人描述[1]),用于机器学习科学家和临床医生的关键阶段是数据队列的适当的设计和收集。认识到各种形式的数据集中的偏见和分布的变化的能力是在此步骤的关键。虽然它仍然很难占到偏见的所有潜在根源,技术可开发识别,以减轻其影响特定类型的偏见。在这项工作中,我们分析的扫描仪制造商在数据集中的分布如何能够促进深度学习模型的整体偏向。我们评估分类和分割任务卷积神经网络(CNN),特别是2状态的最先进的模型:RESNET [2]用于分类和U-Net的[3]的分割。我们证明细胞神经网络能学会分辨成像扫描仪制造商,并且这种偏差可以大大的分类和分割任务影响模型的性能。通过创建脑模拟数据或多或少细微病变的存在的原始数据集合成我们还表明,这种偏差是关系到任务的难度。这种偏见的认识是发展强大的,一般化的模型,这将是对现实世界的数据分布临床应用的关键的关键。
52. Correcting Data Imbalance for Semi-Supervised Covid-19 Detection Using X-ray Chest Images [PDF] 返回目录
Saul Calderon-Ramirez, Shengxiang-Yang, Armaghan Moemeni, David Elizondo, Simon Colreavy-Donnelly, Luis Fernando Chavarria-Estrada, Miguel A. Molina-Cabello
Abstract: The Corona Virus (COVID-19) is an internationalpandemic that has quickly propagated throughout the world. The application of deep learning for image classification of chest X-ray images of Covid-19 patients, could become a novel pre-diagnostic detection methodology. However, deep learning architectures require large labelled datasets. This is often a limitation when the subject of research is relatively new as in the case of the virus outbreak, where dealing with small labelled datasets is a challenge. Moreover, in the context of a new highly infectious disease, the datasets are also highly imbalanced,with few observations from positive cases of the new disease. In this work we evaluate the performance of the semi-supervised deep learning architecture known as MixMatch using a very limited number of labelled observations and highly imbalanced labelled dataset. We propose a simple approach for correcting data imbalance, re-weight each observationin the loss function, giving a higher weight to the observationscorresponding to the under-represented class. For unlabelled observations, we propose the usage of the pseudo and augmentedlabels calculated by MixMatch to choose the appropriate weight. The MixMatch method combined with the proposed pseudo-label based balance correction improved classification accuracy by up to 10%, with respect to the non balanced MixMatch algorithm, with statistical significance. We tested our proposed approach with several available datasets using 10, 15 and 20 labelledobservations. Additionally, a new dataset is included among thetested datasets, composed of chest X-ray images of Costa Rican adult patients
摘要:冠状病毒(COVID-19)是一种internationalpandemic已经迅速在世界各地传播。深学习的用于Covid-19的患者胸部X射线图像的图像分类中的应用,可能会成为一个新的预诊断性检测的方法。然而,深度学习架构需要大量标记数据集。这往往是一个限制时的研究对象是在病毒爆发,在处理小数据集标记是一个挑战的情况下,相对较新。此外,在一个新的高度传染性疾病的情况下,数据集也非常不平衡,与新的疾病的阳性病例几点意见。在这项工作中,我们评估了半监督深度学习架构使用标记观测值的数量非常有限称为MixMatch的性能和高度不平衡的标记数据集。我们提出了修正数据的不平衡,再重的每个observationin损失函数,给予更高的权重为observationscorresponding到代表性不足的类的简单方法。对于未标记的观察,我们建议MixMatch计算选择适当的重量伪和augmentedlabels的使用。该方法MixMatch与由最高达10%,相对于所述非平衡MixMatch算法,具有统计学意义所提出的伪标签基于平衡修正提高分级精度相结合。我们测试了我们提出的方法,使用10,15和20 labelledobservations几个可用的数据集。此外,新的数据集包括thetested数据集之间,哥斯达黎加成年患者的胸部X射线图像组成
Saul Calderon-Ramirez, Shengxiang-Yang, Armaghan Moemeni, David Elizondo, Simon Colreavy-Donnelly, Luis Fernando Chavarria-Estrada, Miguel A. Molina-Cabello
Abstract: The Corona Virus (COVID-19) is an internationalpandemic that has quickly propagated throughout the world. The application of deep learning for image classification of chest X-ray images of Covid-19 patients, could become a novel pre-diagnostic detection methodology. However, deep learning architectures require large labelled datasets. This is often a limitation when the subject of research is relatively new as in the case of the virus outbreak, where dealing with small labelled datasets is a challenge. Moreover, in the context of a new highly infectious disease, the datasets are also highly imbalanced,with few observations from positive cases of the new disease. In this work we evaluate the performance of the semi-supervised deep learning architecture known as MixMatch using a very limited number of labelled observations and highly imbalanced labelled dataset. We propose a simple approach for correcting data imbalance, re-weight each observationin the loss function, giving a higher weight to the observationscorresponding to the under-represented class. For unlabelled observations, we propose the usage of the pseudo and augmentedlabels calculated by MixMatch to choose the appropriate weight. The MixMatch method combined with the proposed pseudo-label based balance correction improved classification accuracy by up to 10%, with respect to the non balanced MixMatch algorithm, with statistical significance. We tested our proposed approach with several available datasets using 10, 15 and 20 labelledobservations. Additionally, a new dataset is included among thetested datasets, composed of chest X-ray images of Costa Rican adult patients
摘要:冠状病毒(COVID-19)是一种internationalpandemic已经迅速在世界各地传播。深学习的用于Covid-19的患者胸部X射线图像的图像分类中的应用,可能会成为一个新的预诊断性检测的方法。然而,深度学习架构需要大量标记数据集。这往往是一个限制时的研究对象是在病毒爆发,在处理小数据集标记是一个挑战的情况下,相对较新。此外,在一个新的高度传染性疾病的情况下,数据集也非常不平衡,与新的疾病的阳性病例几点意见。在这项工作中,我们评估了半监督深度学习架构使用标记观测值的数量非常有限称为MixMatch的性能和高度不平衡的标记数据集。我们提出了修正数据的不平衡,再重的每个observationin损失函数,给予更高的权重为observationscorresponding到代表性不足的类的简单方法。对于未标记的观察,我们建议MixMatch计算选择适当的重量伪和augmentedlabels的使用。该方法MixMatch与由最高达10%,相对于所述非平衡MixMatch算法,具有统计学意义所提出的伪标签基于平衡修正提高分级精度相结合。我们测试了我们提出的方法,使用10,15和20 labelledobservations几个可用的数据集。此外,新的数据集包括thetested数据集之间,哥斯达黎加成年患者的胸部X射线图像组成
53. Unsupervised Cross-domain Image Classification by Distance Metric Guided Feature Alignment [PDF] 返回目录
Qingjie Meng, Daniel Rueckert, Bernhard Kainz
Abstract: Learning deep neural networks that are generalizable across different domains remains a challenge due to the problem of domain shift. Unsupervised domain adaptation is a promising avenue which transfers knowledge from a source domain to a target domain without using any labels in the target domain. Contemporary techniques focus on extracting domain-invariant features using domain adversarial training. However, these techniques neglect to learn discriminative class boundaries in the latent representation space on a target domain and yield limited adaptation performance. To address this problem, we propose distance metric guided feature alignment (MetFA) to extract discriminative as well as domain-invariant features on both source and target domains. The proposed MetFA method explicitly and directly learns the latent representation without using domain adversarial training. Our model integrates class distribution alignment to transfer semantic knowledge from a source domain to a target domain. We evaluate the proposed method on fetal ultrasound datasets for cross-device image classification. Experimental results demonstrate that the proposed method outperforms the state-of-the-art and enables model generalization.
摘要:学习深层神经网络,可以在不同的领域普及仍然是一个挑战,因为域转移的问题。无监督域的适应是一个有希望的途径,其传送从源域到目标域的知识,而不在目标域中使用任何标签。当代技术专注于提取使用域对抗性训练域不变特征。然而,这些技术忽略了学习辨别阶级界限在目标域潜表示空间和产量有限的适应性能。为了解决这个问题,我们提出了距离度量引导特征对准(MetFA)以及对源和目标域中的域不变特征提取判别。所提出的方法MetFA明确和直接学习,而无需使用域对抗性训练潜表示。我们的模型集成类分布排列,以语义知识从源域转移到目标域。我们评估对胎儿超声数据集进行跨设备图像分类所提出的方法。实验结果表明,所提出的方法优于状态的最先进和使模型概括。
Qingjie Meng, Daniel Rueckert, Bernhard Kainz
Abstract: Learning deep neural networks that are generalizable across different domains remains a challenge due to the problem of domain shift. Unsupervised domain adaptation is a promising avenue which transfers knowledge from a source domain to a target domain without using any labels in the target domain. Contemporary techniques focus on extracting domain-invariant features using domain adversarial training. However, these techniques neglect to learn discriminative class boundaries in the latent representation space on a target domain and yield limited adaptation performance. To address this problem, we propose distance metric guided feature alignment (MetFA) to extract discriminative as well as domain-invariant features on both source and target domains. The proposed MetFA method explicitly and directly learns the latent representation without using domain adversarial training. Our model integrates class distribution alignment to transfer semantic knowledge from a source domain to a target domain. We evaluate the proposed method on fetal ultrasound datasets for cross-device image classification. Experimental results demonstrate that the proposed method outperforms the state-of-the-art and enables model generalization.
摘要:学习深层神经网络,可以在不同的领域普及仍然是一个挑战,因为域转移的问题。无监督域的适应是一个有希望的途径,其传送从源域到目标域的知识,而不在目标域中使用任何标签。当代技术专注于提取使用域对抗性训练域不变特征。然而,这些技术忽略了学习辨别阶级界限在目标域潜表示空间和产量有限的适应性能。为了解决这个问题,我们提出了距离度量引导特征对准(MetFA)以及对源和目标域中的域不变特征提取判别。所提出的方法MetFA明确和直接学习,而无需使用域对抗性训练潜表示。我们的模型集成类分布排列,以语义知识从源域转移到目标域。我们评估对胎儿超声数据集进行跨设备图像分类所提出的方法。实验结果表明,所提出的方法优于状态的最先进和使模型概括。
54. Improving Blind Spot Denoising for Microscopy [PDF] 返回目录
Anna S. Goncharova, Alf Honigmann, Florian Jug, Alexander Krull
Abstract: Many microscopy applications are limited by the total amount of usable light and are consequently challenged by the resulting levels of noise in the acquired images. This problem is often addressed via (supervised) deep learning based denoising. Recently, by making assumptions about the noise statistics, self-supervised methods have emerged. Such methods are trained directly on the images that are to be denoised and do not require additional paired training data. While achieving remarkable results, self-supervised methods can produce high-frequency artifacts and achieve inferior results compared to supervised approaches. Here we present a novel way to improve the quality of self-supervised denoising. Considering that light microscopy images are usually diffraction-limited, we propose to include this knowledge in the denoising process. We assume the clean image to be the result of a convolution with a point spread function (PSF) and explicitly include this operation at the end of our neural network. As a consequence, we are able to eliminate high-frequency artifacts and achieve self-supervised results that are very close to the ones achieved with traditional supervised methods.
摘要:许多显微镜应用是由可用光的总量的限制,通过噪声的在所获取的图像所得到的电平因此被挑战。这个问题通常是通过(监督)深度学习的消噪处理。近日,由作出有关噪声的统计假设,自我监督的方法已经出现。这样的方法是直接将被去噪,不需要额外的配对训练数据的图像训练。虽然取得了明显成效,自我监督的方法相比可监督的方法产生高频文物和实现较差的结果。在这里,我们提出提高自我监督去噪质量的新方法。考虑到光显微镜图像通常是衍射限制,我们建议包括在降噪处理这方面的知识。我们假设干净的形象是一个点扩散函数(PSF)卷积的结果,并明确包括在我们的神经网络结束此操作。因此,我们能够消除高频文物和实现自我监督的结果,是非常接近与传统的监督方式实现的。
Anna S. Goncharova, Alf Honigmann, Florian Jug, Alexander Krull
Abstract: Many microscopy applications are limited by the total amount of usable light and are consequently challenged by the resulting levels of noise in the acquired images. This problem is often addressed via (supervised) deep learning based denoising. Recently, by making assumptions about the noise statistics, self-supervised methods have emerged. Such methods are trained directly on the images that are to be denoised and do not require additional paired training data. While achieving remarkable results, self-supervised methods can produce high-frequency artifacts and achieve inferior results compared to supervised approaches. Here we present a novel way to improve the quality of self-supervised denoising. Considering that light microscopy images are usually diffraction-limited, we propose to include this knowledge in the denoising process. We assume the clean image to be the result of a convolution with a point spread function (PSF) and explicitly include this operation at the end of our neural network. As a consequence, we are able to eliminate high-frequency artifacts and achieve self-supervised results that are very close to the ones achieved with traditional supervised methods.
摘要:许多显微镜应用是由可用光的总量的限制,通过噪声的在所获取的图像所得到的电平因此被挑战。这个问题通常是通过(监督)深度学习的消噪处理。近日,由作出有关噪声的统计假设,自我监督的方法已经出现。这样的方法是直接将被去噪,不需要额外的配对训练数据的图像训练。虽然取得了明显成效,自我监督的方法相比可监督的方法产生高频文物和实现较差的结果。在这里,我们提出提高自我监督去噪质量的新方法。考虑到光显微镜图像通常是衍射限制,我们建议包括在降噪处理这方面的知识。我们假设干净的形象是一个点扩散函数(PSF)卷积的结果,并明确包括在我们的神经网络结束此操作。因此,我们能够消除高频文物和实现自我监督的结果,是非常接近与传统的监督方式实现的。
55. Addressing Neural Network Robustness with Mixup and Targeted Labeling Adversarial Training [PDF] 返回目录
Alfred Laugros, Alice Caplier, Matthieu Ospici
Abstract: Despite their performance, Artificial Neural Networks are not reliable enough for most of industrial applications. They are sensitive to noises, rotations, blurs and adversarial examples. There is a need to build defenses that protect against a wide range of perturbations, covering the most traditional common corruptions and adversarial examples. We propose a new data augmentation strategy called M-TLAT and designed to address robustness in a broad sense. Our approach combines the Mixup augmentation and a new adversarial training algorithm called Targeted Labeling Adversarial Training (TLAT). The idea of TLAT is to interpolate the target labels of adversarial examples with the ground-truth labels. We show that M-TLAT can increase the robustness of image classifiers towards nineteen common corruptions and five adversarial attacks, without reducing the accuracy on clean samples.
摘要:尽管他们的表现,人工神经网络并不适用于大多数工业应用的足够可靠的。他们是噪音,旋转,模糊和对抗性的例子敏感。有必要建立防御保护对多种扰动的,覆盖了最常用的传统腐败和对抗的例子。我们提出了所谓的M-TLAT和旨在解决健壮性在广义上一个新的数据扩张战略。我们的方法结合了增强的mixup并呼吁有针对性的标签对抗性训练(TLAT)一种新的对抗训练算法。 TLAT的想法是插值与地面实况标签对抗的例子目标的标签。我们证明了M-TLAT可以提高图像分类的稳健性对19个共同的腐败和五个敌对攻击,而不会降低清洁样本的准确性。
Alfred Laugros, Alice Caplier, Matthieu Ospici
Abstract: Despite their performance, Artificial Neural Networks are not reliable enough for most of industrial applications. They are sensitive to noises, rotations, blurs and adversarial examples. There is a need to build defenses that protect against a wide range of perturbations, covering the most traditional common corruptions and adversarial examples. We propose a new data augmentation strategy called M-TLAT and designed to address robustness in a broad sense. Our approach combines the Mixup augmentation and a new adversarial training algorithm called Targeted Labeling Adversarial Training (TLAT). The idea of TLAT is to interpolate the target labels of adversarial examples with the ground-truth labels. We show that M-TLAT can increase the robustness of image classifiers towards nineteen common corruptions and five adversarial attacks, without reducing the accuracy on clean samples.
摘要:尽管他们的表现,人工神经网络并不适用于大多数工业应用的足够可靠的。他们是噪音,旋转,模糊和对抗性的例子敏感。有必要建立防御保护对多种扰动的,覆盖了最常用的传统腐败和对抗的例子。我们提出了所谓的M-TLAT和旨在解决健壮性在广义上一个新的数据扩张战略。我们的方法结合了增强的mixup并呼吁有针对性的标签对抗性训练(TLAT)一种新的对抗训练算法。 TLAT的想法是插值与地面实况标签对抗的例子目标的标签。我们证明了M-TLAT可以提高图像分类的稳健性对19个共同的腐败和五个敌对攻击,而不会降低清洁样本的准确性。
56. DONet: Dual Objective Networks for Skin Lesion Segmentation [PDF] 返回目录
Yaxiong Wang, Yunchao Wei, Xueming Qian, Li Zhu, Yi Yang
Abstract: Skin lesion segmentation is a crucial step in the computer-aided diagnosis of dermoscopic images. In the last few years, deep learning based semantic segmentation methods have significantly advanced the skin lesion segmentation results. However, the current performance is still unsatisfactory due to some challenging factors such as large variety of lesion scale and ambiguous difference between lesion region and background. In this paper, we propose a simple yet effective framework, named Dual Objective Networks (DONet), to improve the skin lesion segmentation. Our DONet adopts two symmetric decoders to produce different predictions for approaching different objectives. Concretely, the two objectives are actually defined by different loss functions. In this way, the two decoders are encouraged to produce differentiated probability maps to match different optimization targets, resulting in complementary predictions accordingly. The complementary information learned by these two objectives are further aggregated together to make the final prediction, by which the uncertainty existing in segmentation maps can be significantly alleviated. Besides, to address the challenge of large variety of lesion scales and shapes in dermoscopic images, we additionally propose a recurrent context encoding module (RCEM) to model the complex correlation among skin lesions, where the features with different scale contexts are efficiently integrated to form a more robust representation. Extensive experiments on two popular benchmarks well demonstrate the effectiveness of the proposed DONet. In particular, our DONet achieves 0.881 and 0.931 dice score on ISIC 2018 and $\text{PH}^2$, respectively. Code will be made public available.
摘要:皮肤病变分割是皮肤镜图像的计算机辅助诊断的关键步骤。在过去的几年中,深学习基于语义分割方法有显著先进的皮肤病灶分割结果。然而,目前的性能仍不能令人满意,由于一些挑战因素,如大量的各种病变的规模和病变区域和背景之间的不明确的差异。在本文中,我们提出了一个简单而有效的框架,名为双重目标网络(DONet中),以改善皮肤肿瘤分割。我们DONet中采用两个对称的解码器来产生接近不同的目标不同的预测。具体地,两个目标实际上是由不同的损失函数定义的。通过这种方式,鼓励两个解码器来产生分化概率图来匹配不同的优化目标,造成补充相应的预测。由这两个目标学到的补充信息被进一步聚集在一起,使最终的预测,通过该存在于分割地图的不确定性可以显著缓解。此外,为了解决大量的各种病变尺度和在皮肤镜图像的形状的挑战,我们还提出了一种复发性上下文编码模块(RCEM)皮损之间的复数相关,其中具有不同比例的上下文特征被有效地集成到表单模型一个更强大的表现。两个流行的基准,大量的实验也验证了DONet中的有效性。特别是,我们DONet中达到0.881和0.931分别骰子得分ISIC 2018和$ \ {文字PH} ^ 2 $。代码将之公布于众。
Yaxiong Wang, Yunchao Wei, Xueming Qian, Li Zhu, Yi Yang
Abstract: Skin lesion segmentation is a crucial step in the computer-aided diagnosis of dermoscopic images. In the last few years, deep learning based semantic segmentation methods have significantly advanced the skin lesion segmentation results. However, the current performance is still unsatisfactory due to some challenging factors such as large variety of lesion scale and ambiguous difference between lesion region and background. In this paper, we propose a simple yet effective framework, named Dual Objective Networks (DONet), to improve the skin lesion segmentation. Our DONet adopts two symmetric decoders to produce different predictions for approaching different objectives. Concretely, the two objectives are actually defined by different loss functions. In this way, the two decoders are encouraged to produce differentiated probability maps to match different optimization targets, resulting in complementary predictions accordingly. The complementary information learned by these two objectives are further aggregated together to make the final prediction, by which the uncertainty existing in segmentation maps can be significantly alleviated. Besides, to address the challenge of large variety of lesion scales and shapes in dermoscopic images, we additionally propose a recurrent context encoding module (RCEM) to model the complex correlation among skin lesions, where the features with different scale contexts are efficiently integrated to form a more robust representation. Extensive experiments on two popular benchmarks well demonstrate the effectiveness of the proposed DONet. In particular, our DONet achieves 0.881 and 0.931 dice score on ISIC 2018 and $\text{PH}^2$, respectively. Code will be made public available.
摘要:皮肤病变分割是皮肤镜图像的计算机辅助诊断的关键步骤。在过去的几年中,深学习基于语义分割方法有显著先进的皮肤病灶分割结果。然而,目前的性能仍不能令人满意,由于一些挑战因素,如大量的各种病变的规模和病变区域和背景之间的不明确的差异。在本文中,我们提出了一个简单而有效的框架,名为双重目标网络(DONet中),以改善皮肤肿瘤分割。我们DONet中采用两个对称的解码器来产生接近不同的目标不同的预测。具体地,两个目标实际上是由不同的损失函数定义的。通过这种方式,鼓励两个解码器来产生分化概率图来匹配不同的优化目标,造成补充相应的预测。由这两个目标学到的补充信息被进一步聚集在一起,使最终的预测,通过该存在于分割地图的不确定性可以显著缓解。此外,为了解决大量的各种病变尺度和在皮肤镜图像的形状的挑战,我们还提出了一种复发性上下文编码模块(RCEM)皮损之间的复数相关,其中具有不同比例的上下文特征被有效地集成到表单模型一个更强大的表现。两个流行的基准,大量的实验也验证了DONet中的有效性。特别是,我们DONet中达到0.881和0.931分别骰子得分ISIC 2018和$ \ {文字PH} ^ 2 $。代码将之公布于众。
57. Spatio-temporal relationships between rainfall and convective clouds during Indian Monsoon through a discrete lens [PDF] 返回目录
Arjun Sharma, Adway Mitra, Vishal Vasan, Rama Govindarajan
Abstract: The Indian monsoon, a multi-variable process causing heavy rains during June-September every year, is very heterogeneous in space and time. We study the relationship between rainfall and Outgoing Longwave Radiation (OLR, convective cloud cover) for monsoon between 2004-2010. To identify, classify and visualize spatial patterns of rainfall and OLR we use a discrete and spatio-temporally coherent representation of the data, created using a statistical model based on Markov Random Field. Our approach clusters the days with similar spatial distributions of rainfall and OLR into a small number of spatial patterns. We find that eight daily spatial patterns each in rainfall and OLR, and seven joint patterns of rainfall and OLR, describe over 90\% of all days. Through these patterns, we find that OLR generally has a strong negative correlation with precipitation, but with significant spatial variations. In particular, peninsular India (except west coast) is under significant convective cloud cover over a majority of days but remains rainless. We also find that much of the monsoon rainfall co-occurs with low OLR, but some amount of rainfall in Eastern and North-western India in June occurs on OLR days, presumably from shallow clouds. To study day-to-day variations of both quantities, we identify spatial patterns in the temporal gradients computed from the observations. We find that changes in convective cloud activity across India most commonly occur due to the establishment of a north-south OLR gradient which persists for 1-2 days and shifts the convective cloud cover from light to deep or vice versa. Such changes are also accompanied by changes in the spatial distribution of precipitation. The present work thus provides a highly reduced description of the complex spatial patterns and their day-to-day variations, and could form a useful tool for future simplified descriptions of this process.
摘要:印度季风,每年六月至九月期间造成暴雨多变量的过程,是在空间和时间非常不均匀。我们研究降雨和射出长波辐射(OLR,对流云层覆盖)为2004 - 2010年间季风之间的关系。识别,分类和可视化降雨和OLR的空间格局,我们采用了分离和数据的空间 - 时间一致表示,使用基于MRF的统计模型创建的。我们的做法集群,降水和OLR的类似空间分布的天进少量空间格局。我们发现,8个日报每一个空间格局降雨和OLR和降水和OLR七个图形,描述了90 \%的所有天。通过这些方式,我们发现,OLR一般有降水较强的负相关,但与显著的空间变化。尤其是,印度半岛(除西海岸)正在显著对流云层覆盖在大部分的日子,但仍然少雨。我们还发现,大部分的雨季降雨量低OLR共同发生,但好发于六月降雨量在东部和印度西北部的一些量上的OLR天,大概从浅层云。为了研究这两个量的一天到一天的变化,我们确定从观测计算的时间梯度分布格局。我们发现在印度最常见的发生对流云团活动的改变,由于建立了南北OLR梯度的这1-2天,轮班对流云层仍然存在,从浅到深,反之亦然。这样的变化也伴随着降水的空间分布的变化。本工作从而提供了复杂的空间图案和它们的天到一天的变化的高度还原的描述,并且可以形成用于这个过程的未来简化描述的有用工具。
Arjun Sharma, Adway Mitra, Vishal Vasan, Rama Govindarajan
Abstract: The Indian monsoon, a multi-variable process causing heavy rains during June-September every year, is very heterogeneous in space and time. We study the relationship between rainfall and Outgoing Longwave Radiation (OLR, convective cloud cover) for monsoon between 2004-2010. To identify, classify and visualize spatial patterns of rainfall and OLR we use a discrete and spatio-temporally coherent representation of the data, created using a statistical model based on Markov Random Field. Our approach clusters the days with similar spatial distributions of rainfall and OLR into a small number of spatial patterns. We find that eight daily spatial patterns each in rainfall and OLR, and seven joint patterns of rainfall and OLR, describe over 90\% of all days. Through these patterns, we find that OLR generally has a strong negative correlation with precipitation, but with significant spatial variations. In particular, peninsular India (except west coast) is under significant convective cloud cover over a majority of days but remains rainless. We also find that much of the monsoon rainfall co-occurs with low OLR, but some amount of rainfall in Eastern and North-western India in June occurs on OLR days, presumably from shallow clouds. To study day-to-day variations of both quantities, we identify spatial patterns in the temporal gradients computed from the observations. We find that changes in convective cloud activity across India most commonly occur due to the establishment of a north-south OLR gradient which persists for 1-2 days and shifts the convective cloud cover from light to deep or vice versa. Such changes are also accompanied by changes in the spatial distribution of precipitation. The present work thus provides a highly reduced description of the complex spatial patterns and their day-to-day variations, and could form a useful tool for future simplified descriptions of this process.
摘要:印度季风,每年六月至九月期间造成暴雨多变量的过程,是在空间和时间非常不均匀。我们研究降雨和射出长波辐射(OLR,对流云层覆盖)为2004 - 2010年间季风之间的关系。识别,分类和可视化降雨和OLR的空间格局,我们采用了分离和数据的空间 - 时间一致表示,使用基于MRF的统计模型创建的。我们的做法集群,降水和OLR的类似空间分布的天进少量空间格局。我们发现,8个日报每一个空间格局降雨和OLR和降水和OLR七个图形,描述了90 \%的所有天。通过这些方式,我们发现,OLR一般有降水较强的负相关,但与显著的空间变化。尤其是,印度半岛(除西海岸)正在显著对流云层覆盖在大部分的日子,但仍然少雨。我们还发现,大部分的雨季降雨量低OLR共同发生,但好发于六月降雨量在东部和印度西北部的一些量上的OLR天,大概从浅层云。为了研究这两个量的一天到一天的变化,我们确定从观测计算的时间梯度分布格局。我们发现在印度最常见的发生对流云团活动的改变,由于建立了南北OLR梯度的这1-2天,轮班对流云层仍然存在,从浅到深,反之亦然。这样的变化也伴随着降水的空间分布的变化。本工作从而提供了复杂的空间图案和它们的天到一天的变化的高度还原的描述,并且可以形成用于这个过程的未来简化描述的有用工具。
58. Enhanced MRI Reconstruction Network using Neural Architecture Search [PDF] 返回目录
Qiaoying Huang, Dong Yang, Yikun Xian, Pengxiang Wu, Jingru Yi, Hui Qu, Dimitris Metaxas
Abstract: The accurate reconstruction of under-sampled magnetic resonance imaging (MRI) data using modern deep learning technology, requires significant effort to design the necessary complex neural network architectures. The cascaded network architecture for MRI reconstruction has been widely used, while it suffers from the "vanishing gradient" problem when the network becomes deep. In addition, homogeneous architecture degrades the representation capacity of the network. In this work, we present an enhanced MRI reconstruction network using a residual in residual basic block. For each cell in the basic block, we use the differentiable neural architecture search (NAS) technique to automatically choose the optimal operation among eight variants of the dense block. This new heterogeneous network is evaluated on two publicly available datasets and outperforms all current state-of-the-art methods, which demonstrates the effectiveness of our proposed method.
摘要:利用现代深学习技术欠采样磁共振成像(MRI)数据的准确的重建,需要设计必要的复杂的神经网络结构显著的努力。用于MRI重建级联网络体系结构已被广泛使用,虽然它从“消失梯度”问题的困扰当网络变深。此外,均匀的结构降低了网络的表示能力。在这项工作中,我们提出在残余基本块使用残余的增强MRI重建网络。对于在基本块中的每个细胞,我们使用微神经结构搜索(NAS)技术,自动选择中的致密块状八个变种的最佳操作。这种新的异构网络上的两个可公开获得的数据集进行评估,优于目前所有的国家的最先进的方法,这表明我们提出的方法的有效性。
Qiaoying Huang, Dong Yang, Yikun Xian, Pengxiang Wu, Jingru Yi, Hui Qu, Dimitris Metaxas
Abstract: The accurate reconstruction of under-sampled magnetic resonance imaging (MRI) data using modern deep learning technology, requires significant effort to design the necessary complex neural network architectures. The cascaded network architecture for MRI reconstruction has been widely used, while it suffers from the "vanishing gradient" problem when the network becomes deep. In addition, homogeneous architecture degrades the representation capacity of the network. In this work, we present an enhanced MRI reconstruction network using a residual in residual basic block. For each cell in the basic block, we use the differentiable neural architecture search (NAS) technique to automatically choose the optimal operation among eight variants of the dense block. This new heterogeneous network is evaluated on two publicly available datasets and outperforms all current state-of-the-art methods, which demonstrates the effectiveness of our proposed method.
摘要:利用现代深学习技术欠采样磁共振成像(MRI)数据的准确的重建,需要设计必要的复杂的神经网络结构显著的努力。用于MRI重建级联网络体系结构已被广泛使用,虽然它从“消失梯度”问题的困扰当网络变深。此外,均匀的结构降低了网络的表示能力。在这项工作中,我们提出在残余基本块使用残余的增强MRI重建网络。对于在基本块中的每个细胞,我们使用微神经结构搜索(NAS)技术,自动选择中的致密块状八个变种的最佳操作。这种新的异构网络上的两个可公开获得的数据集进行评估,优于目前所有的国家的最先进的方法,这表明我们提出的方法的有效性。
59. LIRA: Lifelong Image Restoration from Unknown Blended Distortions [PDF] 返回目录
Jianzhao Liu, Jianxin Lin, Xin Li, Wei Zhou, Sen Liu, Zhibo Chen
Abstract: Most existing image restoration networks are designed in a disposable way and catastrophically forget previously learned distortions when trained on a new distortion removal task. To alleviate this problem, we raise the novel lifelong image restoration problem for blended distortions. We first design a base fork-join model in which multiple pre-trained expert models specializing in individual distortion removal task work cooperatively and adaptively to handle blended distortions. When the input is degraded by a new distortion, inspired by adult neurogenesis in human memory system, we develop a neural growing strategy where the previously trained model can incorporate a new expert branch and continually accumulate new knowledge without interfering with learned knowledge. Experimental results show that the proposed approach can not only achieve state-of-the-art performance on blended distortions removal tasks in both PSNR/SSIM metrics, but also maintain old expertise while learning new restoration tasks.
摘要:大多数现有的图像恢复网络的设计在一次性的方式和新的失真去除任务训练的时候灾难性忘记以前所学的扭曲。为了缓解这一问题,我们提出了混合扭曲小说终身图像复原问题。我们首先设计一个基础的fork-join模型,其中多个预先训练专家型号专业个人畸变去除任务协同工作,并自适应处理混合扭曲。当输入是通过一个新的扭曲,在人类的记忆系统成年神经发生降解的启发,我们开发了一个神经培养策略在以前训练的模型可以将一个新的专家分支,不断积累新知识,而不与所学的知识干扰。实验结果表明,该方法不仅可以实现对两个PSNR / SSIM指标混合扭曲删除任务的国家的最先进的性能,而且还保持老的专业知识,同时学习新的修复任务。
Jianzhao Liu, Jianxin Lin, Xin Li, Wei Zhou, Sen Liu, Zhibo Chen
Abstract: Most existing image restoration networks are designed in a disposable way and catastrophically forget previously learned distortions when trained on a new distortion removal task. To alleviate this problem, we raise the novel lifelong image restoration problem for blended distortions. We first design a base fork-join model in which multiple pre-trained expert models specializing in individual distortion removal task work cooperatively and adaptively to handle blended distortions. When the input is degraded by a new distortion, inspired by adult neurogenesis in human memory system, we develop a neural growing strategy where the previously trained model can incorporate a new expert branch and continually accumulate new knowledge without interfering with learned knowledge. Experimental results show that the proposed approach can not only achieve state-of-the-art performance on blended distortions removal tasks in both PSNR/SSIM metrics, but also maintain old expertise while learning new restoration tasks.
摘要:大多数现有的图像恢复网络的设计在一次性的方式和新的失真去除任务训练的时候灾难性忘记以前所学的扭曲。为了缓解这一问题,我们提出了混合扭曲小说终身图像复原问题。我们首先设计一个基础的fork-join模型,其中多个预先训练专家型号专业个人畸变去除任务协同工作,并自适应处理混合扭曲。当输入是通过一个新的扭曲,在人类的记忆系统成年神经发生降解的启发,我们开发了一个神经培养策略在以前训练的模型可以将一个新的专家分支,不断积累新知识,而不与所学的知识干扰。实验结果表明,该方法不仅可以实现对两个PSNR / SSIM指标混合扭曲删除任务的国家的最先进的性能,而且还保持老的专业知识,同时学习新的修复任务。
60. Prevalence of Neural Collapse during the terminal phase of deep learning training [PDF] 返回目录
Vardan Papyan, X.Y. Han, David L. Donoho
Abstract: Modern practice for training classification deepnets involves a Terminal Phase of Training (TPT), which begins at the epoch where training error first vanishes; During TPT, the training error stays effectively zero while training loss is pushed towards zero. Direct measurements of TPT, for three prototypical deepnet architectures and across seven canonical classification datasets, expose a pervasive inductive bias we call Neural Collapse, involving four deeply interconnected phenomena: (NC1) Cross-example within-class variability of last-layer training activations collapses to zero, as the individual activations themselves collapse to their class-means; (NC2) The class-means collapse to the vertices of a Simplex Equiangular Tight Frame (ETF); (NC3) Up to rescaling, the last-layer classifiers collapse to the class-means, or in other words to the Simplex ETF, i.e. to a self-dual configuration; (NC4) For a given activation, the classifier's decision collapses to simply choosing whichever class has the closest train class-mean, i.e. the Nearest Class Center (NCC) decision rule. The symmetric and very simple geometry induced by the TPT confers important benefits, including better generalization performance, better robustness, and better interpretability.
摘要:培训分类deepnets现代做法涉及培训的末段(TPT),其起始于训练误差首先消失的时代; TPT期间,训练误差住宿有效为零,而培训的损失向零推。 TPT的直接测量,三个原型DEEPNET架构和在七个规范的分类数据集,露出一个普遍的归纳偏置,我们称之为神经崩溃,涉及四个紧密相连的现象:(NC1)跨例如类内变化最后一层训练激活崩塌到零,因为各个激活自己崩溃到他们的阶级手段; (NC2)类均值崩溃到单面等角紧框架(ETF)的顶点; (NC3)最多重新缩放,最后层分类塌陷到类装置,或者换言之,以单纯形ETF,即一种自双重配置; (NC4)对于一个给定的激活,分类决定崩溃简单地选择任何一个类有最近的火车站类均值,即最近的类中心(NCC)决策规则。对称和非常简单的几何形状利用TPT赋予重要的好处,包括更好的泛化性能,更好的鲁棒性和更好的可解释性诱导。
Vardan Papyan, X.Y. Han, David L. Donoho
Abstract: Modern practice for training classification deepnets involves a Terminal Phase of Training (TPT), which begins at the epoch where training error first vanishes; During TPT, the training error stays effectively zero while training loss is pushed towards zero. Direct measurements of TPT, for three prototypical deepnet architectures and across seven canonical classification datasets, expose a pervasive inductive bias we call Neural Collapse, involving four deeply interconnected phenomena: (NC1) Cross-example within-class variability of last-layer training activations collapses to zero, as the individual activations themselves collapse to their class-means; (NC2) The class-means collapse to the vertices of a Simplex Equiangular Tight Frame (ETF); (NC3) Up to rescaling, the last-layer classifiers collapse to the class-means, or in other words to the Simplex ETF, i.e. to a self-dual configuration; (NC4) For a given activation, the classifier's decision collapses to simply choosing whichever class has the closest train class-mean, i.e. the Nearest Class Center (NCC) decision rule. The symmetric and very simple geometry induced by the TPT confers important benefits, including better generalization performance, better robustness, and better interpretability.
摘要:培训分类deepnets现代做法涉及培训的末段(TPT),其起始于训练误差首先消失的时代; TPT期间,训练误差住宿有效为零,而培训的损失向零推。 TPT的直接测量,三个原型DEEPNET架构和在七个规范的分类数据集,露出一个普遍的归纳偏置,我们称之为神经崩溃,涉及四个紧密相连的现象:(NC1)跨例如类内变化最后一层训练激活崩塌到零,因为各个激活自己崩溃到他们的阶级手段; (NC2)类均值崩溃到单面等角紧框架(ETF)的顶点; (NC3)最多重新缩放,最后层分类塌陷到类装置,或者换言之,以单纯形ETF,即一种自双重配置; (NC4)对于一个给定的激活,分类决定崩溃简单地选择任何一个类有最近的火车站类均值,即最近的类中心(NCC)决策规则。对称和非常简单的几何形状利用TPT赋予重要的好处,包括更好的泛化性能,更好的鲁棒性和更好的可解释性诱导。
61. Accelerated Zeroth-Order Momentum Methods from Mini to Minimax Optimization [PDF] 返回目录
Feihu Huang, Shangqian Gao, Jian Pei, Heng Huang
Abstract: In the paper, we propose a new accelerated zeroth-order momentum (Acc-ZOM) method to solve the non-convex stochastic mini-optimization problems. We prove that the Acc-ZOM method achieves a lower query complexity of $O(d^{3/4}\epsilon^{-3})$ for finding an $\epsilon$-stationary point, which improves the best known result by a factor of $O(d^{1/4})$ where $d$ denotes the parameter dimension. The Acc-ZOM does not require any batches compared to the large batches required in the existing zeroth-order stochastic algorithms. Further, we extend the Acc-ZOM method to solve the non-convex stochastic minimax-optimization problems and propose an accelerated zeroth-order momentum descent ascent (Acc-ZOMDA) method. We prove that the Acc-ZOMDA method reaches the best know query complexity of $\tilde{O}(\kappa_y^3(d_1+d_2)^{3/2}\epsilon^{-3})$ for finding an $\epsilon$-stationary point, where $d_1$ and $d_2$ denote dimensions of the mini and max optimization parameters respectively and $\kappa_y$ is condition number. In particular, our theoretical result does not rely on large batches required in the existing methods. Moreover, we propose a momentum-based accelerated framework for the minimax-optimization problems. At the same time, we present an accelerated momentum descent ascent (Acc-MDA) method for solving the white-box minimax problems, and prove that it achieves the best known gradient complexity of $\tilde{O}(\kappa_y^3\epsilon^{-3})$ without large batches. Extensive experimental results on the black-box adversarial attack to deep neural networks (DNNs) and poisoning attack demonstrate the efficiency of our algorithms.
摘要:在本文中,我们提出了一个新的加速零阶势头(ACC-ZOM)的方法来解决非凸随机迷你优化问题。我们证明了在ACC-ZOM方法达到$ O(d ^ {3/4} \小量^ { - 3})的较低查询的复杂$寻找一个$ \ $小量点-stationary,从而改善目前已知最好的结果通过$的O的一个因素(d ^ {1/4})$其中$ d $表示参数尺寸。在ACC-ZOM不需要任何批次相比,在现有的零阶随机算法所需要的大批量。此外,我们扩展了累积 - ZOM方法来解决非凸随机极大极小优化问题,并提出一种加速零阶动量下降上升(ACC-ZOMDA)方法。我们证明了累积 - ZOMDA方法达到$ \ {波浪号Ø}的最清楚的查询的复杂性(\ kappa_y ^ 3(D_1 + D_2)^ {3/2} \小量^ { - 3})$寻找一个$ \小量$ -stationary点,其中$ $ D_1和$ $ D_2表示迷你的尺寸和最大优化参数分别和$ \ kappa_y $是条件数。特别是,我们的理论结果不依赖于现有方法需要大量的批次。此外,我们提出了极大极小优化问题,基于势头加速的框架。与此同时,我们提出了一个加速的势头下降上升(ACC-MDA)方法解决白盒极小问题,并证明它达到$ \波浪号{Ø}(\ kappa_y ^ 3最著名的梯度复杂\小量^ { - 3})$没有大的批次。在暗箱敌对攻击深层神经网络(DNNs)和中毒攻击大量的实验结果表明,我们的算法效率。
Feihu Huang, Shangqian Gao, Jian Pei, Heng Huang
Abstract: In the paper, we propose a new accelerated zeroth-order momentum (Acc-ZOM) method to solve the non-convex stochastic mini-optimization problems. We prove that the Acc-ZOM method achieves a lower query complexity of $O(d^{3/4}\epsilon^{-3})$ for finding an $\epsilon$-stationary point, which improves the best known result by a factor of $O(d^{1/4})$ where $d$ denotes the parameter dimension. The Acc-ZOM does not require any batches compared to the large batches required in the existing zeroth-order stochastic algorithms. Further, we extend the Acc-ZOM method to solve the non-convex stochastic minimax-optimization problems and propose an accelerated zeroth-order momentum descent ascent (Acc-ZOMDA) method. We prove that the Acc-ZOMDA method reaches the best know query complexity of $\tilde{O}(\kappa_y^3(d_1+d_2)^{3/2}\epsilon^{-3})$ for finding an $\epsilon$-stationary point, where $d_1$ and $d_2$ denote dimensions of the mini and max optimization parameters respectively and $\kappa_y$ is condition number. In particular, our theoretical result does not rely on large batches required in the existing methods. Moreover, we propose a momentum-based accelerated framework for the minimax-optimization problems. At the same time, we present an accelerated momentum descent ascent (Acc-MDA) method for solving the white-box minimax problems, and prove that it achieves the best known gradient complexity of $\tilde{O}(\kappa_y^3\epsilon^{-3})$ without large batches. Extensive experimental results on the black-box adversarial attack to deep neural networks (DNNs) and poisoning attack demonstrate the efficiency of our algorithms.
摘要:在本文中,我们提出了一个新的加速零阶势头(ACC-ZOM)的方法来解决非凸随机迷你优化问题。我们证明了在ACC-ZOM方法达到$ O(d ^ {3/4} \小量^ { - 3})的较低查询的复杂$寻找一个$ \ $小量点-stationary,从而改善目前已知最好的结果通过$的O的一个因素(d ^ {1/4})$其中$ d $表示参数尺寸。在ACC-ZOM不需要任何批次相比,在现有的零阶随机算法所需要的大批量。此外,我们扩展了累积 - ZOM方法来解决非凸随机极大极小优化问题,并提出一种加速零阶动量下降上升(ACC-ZOMDA)方法。我们证明了累积 - ZOMDA方法达到$ \ {波浪号Ø}的最清楚的查询的复杂性(\ kappa_y ^ 3(D_1 + D_2)^ {3/2} \小量^ { - 3})$寻找一个$ \小量$ -stationary点,其中$ $ D_1和$ $ D_2表示迷你的尺寸和最大优化参数分别和$ \ kappa_y $是条件数。特别是,我们的理论结果不依赖于现有方法需要大量的批次。此外,我们提出了极大极小优化问题,基于势头加速的框架。与此同时,我们提出了一个加速的势头下降上升(ACC-MDA)方法解决白盒极小问题,并证明它达到$ \波浪号{Ø}(\ kappa_y ^ 3最著名的梯度复杂\小量^ { - 3})$没有大的批次。在暗箱敌对攻击深层神经网络(DNNs)和中毒攻击大量的实验结果表明,我们的算法效率。
注:中文为机器翻译结果!封面为论文标题词云图!