目录
5. Beyond Disentangled Representations: An Attentive Angular Distillation Approach to Large-scale Lightweight Age-Invariant Face Recognition [PDF] 摘要
21. ContourNet: Taking a Further Step toward Accurate Arbitrary-shaped Scene Text Detection [PDF] 摘要
35. Socioeconomic correlations of urban patterns inferred from aerial images: interpreting activation maps of Convolutional Neural Networks [PDF] 摘要
37. Latent regularization for feature selection using kernel methods in tumor classification [PDF] 摘要
摘要
1. Learning to Explore using Active Neural SLAM [PDF] 返回目录
Devendra Singh Chaplot, Dhiraj Gandhi, Saurabh Gupta, Abhinav Gupta, Ruslan Salakhutdinov
Abstract: This work presents a modular and hierarchical approach to learn policies for exploring 3D environments, called `Active Neural SLAM'. Our approach leverages the strengths of both classical and learning-based methods, by using analytical path planners with learned SLAM module, and global and local policies. The use of learning provides flexibility with respect to input modalities (in the SLAM module), leverages structural regularities of the world (in global policies), and provides robustness to errors in state estimation (in local policies). Such use of learning within each module retains its benefits, while at the same time, hierarchical decomposition and modular training allow us to sidestep the high sample complexities associated with training end-to-end policies. Our experiments in visually and physically realistic simulated 3D environments demonstrate the effectiveness of our approach over past learning and geometry-based approaches. The proposed model can also be easily transferred to the PointGoal task and was the winning entry of the CVPR 2019 Habitat PointGoal Navigation Challenge.
摘要:这项工作提出了模块化和层次化的方法来学习政策,探索3D环境,被称为'主动神经SLAM”。我们的方法利用的经典和学习为基础的方法的优势,通过与教训SLAM模块,以及全球和地方政策的分析路径规划师。利用学习提供了灵活性相对于输入方式(SLAM的模块中),充分利用世界的结构规律(在全球政策),并提供了稳健性的状态估计错误(本地策略)。每个模块内的学习的这种使用保留它的好处,而在同一时间,层次分解和模块化培训使我们能够回避与训练结束到终端的政策相关联的高采样复杂性。我们在视觉上和物理模拟逼真的3D环境实验表明,我们对过去的学习和基于几何的方式方法的有效性。该模型也可以很容易地转移到PointGoal任务是CVPR 2019人居PointGoal航海挑战赛的获奖作品。
Devendra Singh Chaplot, Dhiraj Gandhi, Saurabh Gupta, Abhinav Gupta, Ruslan Salakhutdinov
Abstract: This work presents a modular and hierarchical approach to learn policies for exploring 3D environments, called `Active Neural SLAM'. Our approach leverages the strengths of both classical and learning-based methods, by using analytical path planners with learned SLAM module, and global and local policies. The use of learning provides flexibility with respect to input modalities (in the SLAM module), leverages structural regularities of the world (in global policies), and provides robustness to errors in state estimation (in local policies). Such use of learning within each module retains its benefits, while at the same time, hierarchical decomposition and modular training allow us to sidestep the high sample complexities associated with training end-to-end policies. Our experiments in visually and physically realistic simulated 3D environments demonstrate the effectiveness of our approach over past learning and geometry-based approaches. The proposed model can also be easily transferred to the PointGoal task and was the winning entry of the CVPR 2019 Habitat PointGoal Navigation Challenge.
摘要:这项工作提出了模块化和层次化的方法来学习政策,探索3D环境,被称为'主动神经SLAM”。我们的方法利用的经典和学习为基础的方法的优势,通过与教训SLAM模块,以及全球和地方政策的分析路径规划师。利用学习提供了灵活性相对于输入方式(SLAM的模块中),充分利用世界的结构规律(在全球政策),并提供了稳健性的状态估计错误(本地策略)。每个模块内的学习的这种使用保留它的好处,而在同一时间,层次分解和模块化培训使我们能够回避与训练结束到终端的政策相关联的高采样复杂性。我们在视觉上和物理模拟逼真的3D环境实验表明,我们对过去的学习和基于几何的方式方法的有效性。该模型也可以很容易地转移到PointGoal任务是CVPR 2019人居PointGoal航海挑战赛的获奖作品。
2. Model Uncertainty Quantification for Reliable Deep Vision Structural Health Monitoring [PDF] 返回目录
Seyed Omid Sajedi, Xiao Liang
Abstract: Computer vision leveraging deep learning has achieved significant success in the last decade. Despite the promising performance of the existing deep models in the recent literature, the extent of models' reliability remains unknown. Structural health monitoring (SHM) is a crucial task for the safety and sustainability of structures, and thus prediction mistakes can have fatal outcomes. This paper proposes Bayesian inference for deep vision SHM models where uncertainty can be quantified using the Monte Carlo dropout sampling. Three independent case studies for cracks, local damage identification, and bridge component detection are investigated using Bayesian inference. Aside from better prediction results, mean class softmax variance and entropy, the two uncertainty metrics, are shown to have good correlations with misclassifications. While the uncertainty metrics can be used to trigger human intervention and potentially improve prediction results, interpretation of uncertainty masks can be challenging. Therefore, surrogate models are introduced to take the uncertainty as input such that the performance can be further boosted. The proposed methodology in this paper can be applied to future deep vision SHM frameworks to incorporate model uncertainty in the inspection processes.
摘要:计算机视觉利用深度学习取得了在过去十年显著成功。尽管在最近的文献中已有的深车型的承诺表现,模型的可靠性程度仍然不明。结构健康监测(SHM)是结构的安全性和可持续性的关键任务,因此在预测错误可能有致命的后果。本文提出了一种深刻的视觉SHM模型,其中的不确定性可以用蒙特卡罗降采样量化贝叶斯推理。有裂纹,局部损伤识别,和桥连部件检测三个独立的个案研究是使用贝叶斯推断的影响。除了更好的预测结果,平均班SOFTMAX方差和熵,这两个不确定性的度量,都显示出有错误分类良好的相关性。虽然不确定度量可以被用来触发人类干预和潜在地改善预测结果,不确定的掩模的解释可以是具有挑战性的。因此,替代模型被引入到采取该性能可以进一步提高的不确定性作为输入这样。本文所提出的方法可以应用于未来的深视力SHM框架纳入模型不确定性在检验过程。
Seyed Omid Sajedi, Xiao Liang
Abstract: Computer vision leveraging deep learning has achieved significant success in the last decade. Despite the promising performance of the existing deep models in the recent literature, the extent of models' reliability remains unknown. Structural health monitoring (SHM) is a crucial task for the safety and sustainability of structures, and thus prediction mistakes can have fatal outcomes. This paper proposes Bayesian inference for deep vision SHM models where uncertainty can be quantified using the Monte Carlo dropout sampling. Three independent case studies for cracks, local damage identification, and bridge component detection are investigated using Bayesian inference. Aside from better prediction results, mean class softmax variance and entropy, the two uncertainty metrics, are shown to have good correlations with misclassifications. While the uncertainty metrics can be used to trigger human intervention and potentially improve prediction results, interpretation of uncertainty masks can be challenging. Therefore, surrogate models are introduced to take the uncertainty as input such that the performance can be further boosted. The proposed methodology in this paper can be applied to future deep vision SHM frameworks to incorporate model uncertainty in the inspection processes.
摘要:计算机视觉利用深度学习取得了在过去十年显著成功。尽管在最近的文献中已有的深车型的承诺表现,模型的可靠性程度仍然不明。结构健康监测(SHM)是结构的安全性和可持续性的关键任务,因此在预测错误可能有致命的后果。本文提出了一种深刻的视觉SHM模型,其中的不确定性可以用蒙特卡罗降采样量化贝叶斯推理。有裂纹,局部损伤识别,和桥连部件检测三个独立的个案研究是使用贝叶斯推断的影响。除了更好的预测结果,平均班SOFTMAX方差和熵,这两个不确定性的度量,都显示出有错误分类良好的相关性。虽然不确定度量可以被用来触发人类干预和潜在地改善预测结果,不确定的掩模的解释可以是具有挑战性的。因此,替代模型被引入到采取该性能可以进一步提高的不确定性作为输入这样。本文所提出的方法可以应用于未来的深视力SHM框架纳入模型不确定性在检验过程。
3. Deep transfer learning for improving single-EEG arousal detection [PDF] 返回目录
Alexander Neergaard Olesen, Poul Jennum, Emmanuel Mignot, Helge Bjarup Dissing Sorensen
Abstract: Datasets in sleep science present challenges for machine learning algorithms due to differences in recording setups across clinics. We investigate two deep transfer learning strategies for overcoming the channel mismatch problem for cases where two datasets do not contain exactly the same setup leading to degraded performance in single-EEG models. Specifically, we train a baseline model on multivariate polysomnography data and subsequently replace the first two layers to prepare the architecture for single-channel electroencephalography data. Using a fine-tuning strategy, our model yields similar performance to the baseline model (F1=0.682 and F1=0.694, respectively), and was significantly better than a comparable single-channel model. Our results are promising for researchers working with small databases who wish to use deep learning models pre-trained on larger databases.
摘要:数据集在机器学习算法睡眠科学提出了挑战,由于横跨诊所记录设置的差异。我们考察两个深迁移学习策略,克服了通道失配问题,对于其中的两个数据集不包含完全相同的设置导致单EEG车型的性能下降的情况。具体而言,我们培养多元多导睡眠图的数据的基线模型,并随后替换前两层以制备用于单通道脑电图学数据的体系结构。使用微调策略,我们的模型产生类似的性能(分别为F1 = 0.682和F1 = 0.694,)的基线模型,并且是显著优于可比较的单信道模型。我们的研究结果是有希望与谁希望使用深度学习模型对大型数据库预先训练的小型数据库工作的研究人员。
Alexander Neergaard Olesen, Poul Jennum, Emmanuel Mignot, Helge Bjarup Dissing Sorensen
Abstract: Datasets in sleep science present challenges for machine learning algorithms due to differences in recording setups across clinics. We investigate two deep transfer learning strategies for overcoming the channel mismatch problem for cases where two datasets do not contain exactly the same setup leading to degraded performance in single-EEG models. Specifically, we train a baseline model on multivariate polysomnography data and subsequently replace the first two layers to prepare the architecture for single-channel electroencephalography data. Using a fine-tuning strategy, our model yields similar performance to the baseline model (F1=0.682 and F1=0.694, respectively), and was significantly better than a comparable single-channel model. Our results are promising for researchers working with small databases who wish to use deep learning models pre-trained on larger databases.
摘要:数据集在机器学习算法睡眠科学提出了挑战,由于横跨诊所记录设置的差异。我们考察两个深迁移学习策略,克服了通道失配问题,对于其中的两个数据集不包含完全相同的设置导致单EEG车型的性能下降的情况。具体而言,我们培养多元多导睡眠图的数据的基线模型,并随后替换前两层以制备用于单通道脑电图学数据的体系结构。使用微调策略,我们的模型产生类似的性能(分别为F1 = 0.682和F1 = 0.694,)的基线模型,并且是显著优于可比较的单信道模型。我们的研究结果是有希望与谁希望使用深度学习模型对大型数据库预先训练的小型数据库工作的研究人员。
4. MA 3 : Model Agnostic Adversarial Augmentation for Few Shot learning [PDF] 返回目录
Rohit Jena, Shirsendu Sukanta Halder, Katia Sycara
Abstract: Despite the recent developments in vision-related problems using deep neural networks, there still remains a wide scope in the improvement of generalizing these models to unseen examples. In this paper, we explore the domain of few-shot learning with a novel augmentation technique. In contrast to other generative augmentation techniques, where the distribution over input images are learnt, we propose to learn the probability distribution over the image transformation parameters which are easier and quicker to learn. Our technique is fully differentiable which enables its extension to versatile data-sets and base models. We evaluate our proposed method on multiple base-networks and 2 data-sets to establish the robustness and efficiency of this method. We obtain an improvement of nearly 4% by adding our augmentation module without making any change in network architectures. We also make the code readily available for usage by the community.
摘要:尽管在视觉相关的问题使用深层神经网络最近的事态发展,仍然在推广这些模型看不见的例子的改善范围广。在本文中,我们将探讨一些次学习的一个新的增强技术领域。相对于其他生成增强技术,其中分布在输入图像的教训,我们建议学习在图像变换参数,是更容易和更快学习的概率分布。我们的技术是完全微分这使得其扩展到通用的数据集和基本模型。我们评估我们提出了多个基础网络和2数据集建立了该方法的稳健性和效率的方法。我们可以通过将我们的放大模块无需在网络架构中的任何改变获得的近4%的改善。我们也使代码容易可供社区使用。
Rohit Jena, Shirsendu Sukanta Halder, Katia Sycara
Abstract: Despite the recent developments in vision-related problems using deep neural networks, there still remains a wide scope in the improvement of generalizing these models to unseen examples. In this paper, we explore the domain of few-shot learning with a novel augmentation technique. In contrast to other generative augmentation techniques, where the distribution over input images are learnt, we propose to learn the probability distribution over the image transformation parameters which are easier and quicker to learn. Our technique is fully differentiable which enables its extension to versatile data-sets and base models. We evaluate our proposed method on multiple base-networks and 2 data-sets to establish the robustness and efficiency of this method. We obtain an improvement of nearly 4% by adding our augmentation module without making any change in network architectures. We also make the code readily available for usage by the community.
摘要:尽管在视觉相关的问题使用深层神经网络最近的事态发展,仍然在推广这些模型看不见的例子的改善范围广。在本文中,我们将探讨一些次学习的一个新的增强技术领域。相对于其他生成增强技术,其中分布在输入图像的教训,我们建议学习在图像变换参数,是更容易和更快学习的概率分布。我们的技术是完全微分这使得其扩展到通用的数据集和基本模型。我们评估我们提出了多个基础网络和2数据集建立了该方法的稳健性和效率的方法。我们可以通过将我们的放大模块无需在网络架构中的任何改变获得的近4%的改善。我们也使代码容易可供社区使用。
5. Beyond Disentangled Representations: An Attentive Angular Distillation Approach to Large-scale Lightweight Age-Invariant Face Recognition [PDF] 返回目录
Thanh-Dat Truong, Chi Nhan Duong, Kha Gia Quach, Dung Nguyen, Ngan Le, Khoa Luu, Tien D. Bui
Abstract: Disentangled representations have been commonly adopted to Age-invariant Face Recognition (AiFR) tasks. However, these methods have reached some limitations with (1) the requirement of large-scale face recognition (FR) training data with age labels, which is limited in practice; (2) heavy deep network architecture for high performance; and (3) their evaluations are usually taken place on age-related face databases while neglecting the standard large-scale FR databases to guarantee its robustness. This work presents a novel Attentive Angular Distillation (AAD) approach to Large-scale Lightweight AiFR that overcomes these limitations. Given two high-performance heavy networks as teachers with different specialized knowledge, AAD introduces a learning paradigm to efficiently distill the age-invariant attentive and angular knowledge from those teachers to a lightweight student network making it more powerful with higher FR accuracy and robust against age factor. Consequently, AAD approach is able to take the advantages of both FR datasets with and without age labels to train an AiFR model. Far apart from prior distillation methods mainly focusing on accuracy and compression ratios in closed-set problems, our AAD aims to solve the open-set problem, i.e. large-scale face recognition. Evaluations on LFW, IJB-B and IJB-C Janus, AgeDB and MegaFace-FGNet with one million distractors have demonstrated the efficiency of the proposed approach. This work also presents a new longitudinal face aging (LogiFace) database for further studies in age-related facial problems in future.
摘要:解开表示已普遍采用年龄不变人脸识别(AIFR)任务。然而,这些方法已经与一些限制(1)大型面部识别(FR)的训练随着年龄的标签,这在实际上限制了数据的要求; (2)重深网络架构的高性能; (3)他们的评价通常采取年龄相关的人脸数据库的地方而忽视了标准的大型数据库,FR,以保证其坚固性。这项工作提出了一种新颖的细心角蒸馏(AAD)的方法来大规模轻型AIFR克服这些局限性。鉴于两个高性能重型网络作为教师不同的专业知识,AAD引入了一个学习的范例,以有效地凝结成一个轻量级的学生网络时代不变的细心和角度的知识从这些教师使其具有更高的FR精度更强大,对年龄稳健因子。因此,AAD的方法是能够采取两FR数据集的优势,有和没有年龄的标签来训练AIFR模型。从现有蒸馏方法主要侧重于精确度和压缩比在闭集的问题相距甚远,我们的AAD旨在解决开集问题,即大规模人脸识别。在LFW,IJB-B和IJB-C剑锋,AgeDB和菲斯 - FGNet评价一百万分心已经证明了该方法的效率。这项工作也提出了新的纵向表面老化,在未来与年龄相关的面部问题的进一步研究(LogiFace)数据库。
Thanh-Dat Truong, Chi Nhan Duong, Kha Gia Quach, Dung Nguyen, Ngan Le, Khoa Luu, Tien D. Bui
Abstract: Disentangled representations have been commonly adopted to Age-invariant Face Recognition (AiFR) tasks. However, these methods have reached some limitations with (1) the requirement of large-scale face recognition (FR) training data with age labels, which is limited in practice; (2) heavy deep network architecture for high performance; and (3) their evaluations are usually taken place on age-related face databases while neglecting the standard large-scale FR databases to guarantee its robustness. This work presents a novel Attentive Angular Distillation (AAD) approach to Large-scale Lightweight AiFR that overcomes these limitations. Given two high-performance heavy networks as teachers with different specialized knowledge, AAD introduces a learning paradigm to efficiently distill the age-invariant attentive and angular knowledge from those teachers to a lightweight student network making it more powerful with higher FR accuracy and robust against age factor. Consequently, AAD approach is able to take the advantages of both FR datasets with and without age labels to train an AiFR model. Far apart from prior distillation methods mainly focusing on accuracy and compression ratios in closed-set problems, our AAD aims to solve the open-set problem, i.e. large-scale face recognition. Evaluations on LFW, IJB-B and IJB-C Janus, AgeDB and MegaFace-FGNet with one million distractors have demonstrated the efficiency of the proposed approach. This work also presents a new longitudinal face aging (LogiFace) database for further studies in age-related facial problems in future.
摘要:解开表示已普遍采用年龄不变人脸识别(AIFR)任务。然而,这些方法已经与一些限制(1)大型面部识别(FR)的训练随着年龄的标签,这在实际上限制了数据的要求; (2)重深网络架构的高性能; (3)他们的评价通常采取年龄相关的人脸数据库的地方而忽视了标准的大型数据库,FR,以保证其坚固性。这项工作提出了一种新颖的细心角蒸馏(AAD)的方法来大规模轻型AIFR克服这些局限性。鉴于两个高性能重型网络作为教师不同的专业知识,AAD引入了一个学习的范例,以有效地凝结成一个轻量级的学生网络时代不变的细心和角度的知识从这些教师使其具有更高的FR精度更强大,对年龄稳健因子。因此,AAD的方法是能够采取两FR数据集的优势,有和没有年龄的标签来训练AIFR模型。从现有蒸馏方法主要侧重于精确度和压缩比在闭集的问题相距甚远,我们的AAD旨在解决开集问题,即大规模人脸识别。在LFW,IJB-B和IJB-C剑锋,AgeDB和菲斯 - FGNet评价一百万分心已经证明了该方法的效率。这项工作也提出了新的纵向表面老化,在未来与年龄相关的面部问题的进一步研究(LogiFace)数据库。
6. ASL Recognition with Metric-Learning based Lightweight Network [PDF] 返回目录
Evgeny Izutov
Abstract: In the past decades the set of human tasks that are solved by machines was extended dramatically. From simple image classification problems researchers now move towards solving more sophisticated and vital problems, like, autonomous driving and language translation. The case of language translation includes a challenging area of sign language translation that incorporates both image and language processing. We make a step in that direction by proposing a lightweight network for ASL gesture recognition with a performance sufficient for practical applications. The proposed solution demonstrates impressive robustness on MS-ASL dataset and in live mode for continuous sign gesture recognition scenario. Additionally, we describe how to combine action recognition model training with metric-learning to train the network on the database of limited size. The training code is available as part of Intel OpenVINO Training Extensions.
摘要:在过去的十年中由机器解决了一系列的人工任务急剧扩大。从简单的图像分类问题,研究人员现在对解决更加复杂和重要的问题,如,自主驾驶和语言翻译移动。语言翻译的情况下,包括既包含图像和语言处理手语翻译的一个具有挑战性的区域。让我们通过提出ASL手势识别一个轻量级的网络有足够的实际应用性能在方向迈出的一步。提出的解决方案演示了在MS-ASL的数据集,并在连续的标志手势识别场景实时模式令人印象深刻的鲁棒性。此外,我们将介绍如何使用公制学习行为识别模型训练相结合,培养网络规模有限的数据库上。该训练码可作为英特尔OpenVINO培训扩展的一部分。
Evgeny Izutov
Abstract: In the past decades the set of human tasks that are solved by machines was extended dramatically. From simple image classification problems researchers now move towards solving more sophisticated and vital problems, like, autonomous driving and language translation. The case of language translation includes a challenging area of sign language translation that incorporates both image and language processing. We make a step in that direction by proposing a lightweight network for ASL gesture recognition with a performance sufficient for practical applications. The proposed solution demonstrates impressive robustness on MS-ASL dataset and in live mode for continuous sign gesture recognition scenario. Additionally, we describe how to combine action recognition model training with metric-learning to train the network on the database of limited size. The training code is available as part of Intel OpenVINO Training Extensions.
摘要:在过去的十年中由机器解决了一系列的人工任务急剧扩大。从简单的图像分类问题,研究人员现在对解决更加复杂和重要的问题,如,自主驾驶和语言翻译移动。语言翻译的情况下,包括既包含图像和语言处理手语翻译的一个具有挑战性的区域。让我们通过提出ASL手势识别一个轻量级的网络有足够的实际应用性能在方向迈出的一步。提出的解决方案演示了在MS-ASL的数据集,并在连续的标志手势识别场景实时模式令人印象深刻的鲁棒性。此外,我们将介绍如何使用公制学习行为识别模型训练相结合,培养网络规模有限的数据库上。该训练码可作为英特尔OpenVINO培训扩展的一部分。
7. Hyperspectral Image Clustering with Spatially-Regularized Ultrametrics [PDF] 返回目录
Shukun Zhang, James M. Murphy
Abstract: We propose a method for the unsupervised clustering of hyperspectral images based on spatially regularized spectral clustering with ultrametric path distances. The proposed method efficiently combines data density and geometry to distinguish between material classes in the data, without the need for training labels. The proposed method is efficient, with quasilinear scaling in the number of data points, and enjoys robust theoretical performance guarantees. Extensive experiments on synthetic and real HSI data demonstrate its strong performance compared to benchmark and state-of-the-art methods. In particular, the proposed method achieves not only excellent labeling accuracy, but also efficiently estimates the number of clusters.
摘要:提出了一种基于与超度量路径距离空间转正谱聚类的高光谱图像的无监督聚类的方法。所提出的方法有效地结合了数据密度和几何数据材料类之间进行区分,而无需培训标签。所提出的方法是有效的,在数据点的数量拟线性缩放,并享有强大的理论性能保证。相比基准和国家的最先进的方法,在模拟和真实HSI数据大量的实验证明其强大的性能。特别地,所提出的方法不仅实现优异的贴标精度,而且还有效地估计簇的数目。
Shukun Zhang, James M. Murphy
Abstract: We propose a method for the unsupervised clustering of hyperspectral images based on spatially regularized spectral clustering with ultrametric path distances. The proposed method efficiently combines data density and geometry to distinguish between material classes in the data, without the need for training labels. The proposed method is efficient, with quasilinear scaling in the number of data points, and enjoys robust theoretical performance guarantees. Extensive experiments on synthetic and real HSI data demonstrate its strong performance compared to benchmark and state-of-the-art methods. In particular, the proposed method achieves not only excellent labeling accuracy, but also efficiently estimates the number of clusters.
摘要:提出了一种基于与超度量路径距离空间转正谱聚类的高光谱图像的无监督聚类的方法。所提出的方法有效地结合了数据密度和几何数据材料类之间进行区分,而无需培训标签。所提出的方法是有效的,在数据点的数量拟线性缩放,并享有强大的理论性能保证。相比基准和国家的最先进的方法,在模拟和真实HSI数据大量的实验证明其强大的性能。特别地,所提出的方法不仅实现优异的贴标精度,而且还有效地估计簇的数目。
8. Parsing-based View-aware Embedding Network for Vehicle Re-Identification [PDF] 返回目录
Dechao Meng, Liang Li, Xuejing Liu, Yadong Li, Shijie Yang, Zhengjun Zha, Xingyu Gao, Shuhui Wang, Qingming Huang
Abstract: Vehicle Re-Identification is to find images of the same vehicle from various views in the cross-camera scenario. The main challenges of this task are the large intra-instance distance caused by different views and the subtle inter-instance discrepancy caused by similar vehicles. In this paper, we propose a parsing-based view-aware embedding network (PVEN) to achieve the view-aware feature alignment and enhancement for vehicle ReID. First, we introduce a parsing network to parse a vehicle into four different views, and then align the features by mask average pooling. Such alignment provides a fine-grained representation of the vehicle. Second, in order to enhance the view-aware features, we design a common-visible attention to focus on the common visible views, which not only shortens the distance among intra-instances, but also enlarges the discrepancy of inter-instances. The PVEN helps capture the stable discriminative information of vehicle under different views. The experiments conducted on three datasets show that our model outperforms state-of-the-art methods by a large margin.
摘要:车辆重新鉴定是从跨相机的场景不同的视图发现同一车辆的图像。这项任务的主要挑战是造成不同的观点和造成类似车辆的微妙实例间差异大实例内部距离。在本文中,我们提出了一个基于分析视图感知嵌入网络(PVEN),以实现车辆里德认为感知功能定位和增强。首先,我们介绍一个解析网络的车辆解析成四种不同的视图,然后通过面罩平均池对齐功能。这样的对准提供了车辆的细粒度表示。其次,为了增强视图感知的功能,我们在设计上常见的可视意见,这不仅缩短内部实例之间的距离,一个共同的可见关注的焦点,同时也扩大了跨实例的差异。该PVEN有助于捕捉车辆根据不同的看法稳定的判别信息。三个数据集进行的实验表明,我们的模型大幅度优于国家的最先进的方法。
Dechao Meng, Liang Li, Xuejing Liu, Yadong Li, Shijie Yang, Zhengjun Zha, Xingyu Gao, Shuhui Wang, Qingming Huang
Abstract: Vehicle Re-Identification is to find images of the same vehicle from various views in the cross-camera scenario. The main challenges of this task are the large intra-instance distance caused by different views and the subtle inter-instance discrepancy caused by similar vehicles. In this paper, we propose a parsing-based view-aware embedding network (PVEN) to achieve the view-aware feature alignment and enhancement for vehicle ReID. First, we introduce a parsing network to parse a vehicle into four different views, and then align the features by mask average pooling. Such alignment provides a fine-grained representation of the vehicle. Second, in order to enhance the view-aware features, we design a common-visible attention to focus on the common visible views, which not only shortens the distance among intra-instances, but also enlarges the discrepancy of inter-instances. The PVEN helps capture the stable discriminative information of vehicle under different views. The experiments conducted on three datasets show that our model outperforms state-of-the-art methods by a large margin.
摘要:车辆重新鉴定是从跨相机的场景不同的视图发现同一车辆的图像。这项任务的主要挑战是造成不同的观点和造成类似车辆的微妙实例间差异大实例内部距离。在本文中,我们提出了一个基于分析视图感知嵌入网络(PVEN),以实现车辆里德认为感知功能定位和增强。首先,我们介绍一个解析网络的车辆解析成四种不同的视图,然后通过面罩平均池对齐功能。这样的对准提供了车辆的细粒度表示。其次,为了增强视图感知的功能,我们在设计上常见的可视意见,这不仅缩短内部实例之间的距离,一个共同的可见关注的焦点,同时也扩大了跨实例的差异。该PVEN有助于捕捉车辆根据不同的看法稳定的判别信息。三个数据集进行的实验表明,我们的模型大幅度优于国家的最先进的方法。
9. ModuleNet: Knowledge-inherited Neural Architecture Search [PDF] 返回目录
Yaran Chen, Ruiyuan Gao, Fenggang Liu, Dongbin Zhao
Abstract: Although Neural Architecture Search (NAS) can bring improvement to deep models, they always neglect precious knowledge of existing models. The computation and time costing property in NAS also means that we should not start from scratch to search, but make every attempt to reuse the existing knowledge. In this paper, we discuss what kind of knowledge in a model can and should be used for new architecture design. Then, we propose a new NAS algorithm, namely ModuleNet, which can fully inherit knowledge from existing convolutional neural networks. To make full use of existing models, we decompose existing models into different \textit{module}s which also keep their weights, consisting of a knowledge base. Then we sample and search for new architecture according to the knowledge base. Unlike previous search algorithms, and benefiting from inherited knowledge, our method is able to directly search for architectures in the macro space by NSGA-II algorithm without tuning parameters in these \textit{module}s. Experiments show that our strategy can efficiently evaluate the performance of new architecture even without tuning weights in convolutional layers. With the help of knowledge we inherited, our search results can always achieve better performance on various datasets (CIFAR10, CIFAR100) over original architectures.
摘要:虽然神经结构搜索(NAS)可以带来改善深车型,现有车型的他们总是忽略了宝贵的知识。在NAS的计算和时间成本属性还意味着我们不应该从头开始搜索,但千方百计地重用现有的知识。在本文中,我们将讨论什么样的模型知识,可以而且应该被用于新的架构设计。然后,我们提出了一个新的NAS算法,即ModuleNet,可从现有的卷积神经网络完全传承知识。要充分利用现有的模型,我们分解现有车型到不同的\ {textit模块} S的还留着它们的权重,组成一个知识库。然后我们品尝,并根据知识库搜索新的架构。不同于以往的搜索算法,并从继承知识中受益,我们的方法是能够直接搜索架构宏空间通过NSGA-II算法,而不在这些\ textit {}模块š调整参数。实验表明,我们的策略可以有效即使没有卷积层调整权重评估新架构的性能。随着我们继承知识的帮助下,我们的搜索结果总是可以达到超过原来的架构不同的数据集(CIFAR10,CIFAR100)更好的性能。
Yaran Chen, Ruiyuan Gao, Fenggang Liu, Dongbin Zhao
Abstract: Although Neural Architecture Search (NAS) can bring improvement to deep models, they always neglect precious knowledge of existing models. The computation and time costing property in NAS also means that we should not start from scratch to search, but make every attempt to reuse the existing knowledge. In this paper, we discuss what kind of knowledge in a model can and should be used for new architecture design. Then, we propose a new NAS algorithm, namely ModuleNet, which can fully inherit knowledge from existing convolutional neural networks. To make full use of existing models, we decompose existing models into different \textit{module}s which also keep their weights, consisting of a knowledge base. Then we sample and search for new architecture according to the knowledge base. Unlike previous search algorithms, and benefiting from inherited knowledge, our method is able to directly search for architectures in the macro space by NSGA-II algorithm without tuning parameters in these \textit{module}s. Experiments show that our strategy can efficiently evaluate the performance of new architecture even without tuning weights in convolutional layers. With the help of knowledge we inherited, our search results can always achieve better performance on various datasets (CIFAR10, CIFAR100) over original architectures.
摘要:虽然神经结构搜索(NAS)可以带来改善深车型,现有车型的他们总是忽略了宝贵的知识。在NAS的计算和时间成本属性还意味着我们不应该从头开始搜索,但千方百计地重用现有的知识。在本文中,我们将讨论什么样的模型知识,可以而且应该被用于新的架构设计。然后,我们提出了一个新的NAS算法,即ModuleNet,可从现有的卷积神经网络完全传承知识。要充分利用现有的模型,我们分解现有车型到不同的\ {textit模块} S的还留着它们的权重,组成一个知识库。然后我们品尝,并根据知识库搜索新的架构。不同于以往的搜索算法,并从继承知识中受益,我们的方法是能够直接搜索架构宏空间通过NSGA-II算法,而不在这些\ textit {}模块š调整参数。实验表明,我们的策略可以有效即使没有卷积层调整权重评估新架构的性能。随着我们继承知识的帮助下,我们的搜索结果总是可以达到超过原来的架构不同的数据集(CIFAR10,CIFAR100)更好的性能。
10. Robust Line Segments Matching via Graph Convolution Networks [PDF] 返回目录
QuanMeng Ma, Guang Jiang, DianZhi Lai
Abstract: Line matching plays an essential role in structure from motion (SFM) and simultaneous localization and mapping (SLAM), especially in low-textured and repetitive scenes. In this paper, we present a new method of using a graph convolution network to match line segments in a pair of images, and we design a graph-based strategy of matching line segments with relaxing to an optimal transport problem. In contrast to hand-crafted line matching algorithms, our approach learns local line segment descriptor and the matching simultaneously through end-to-end training. The results show our method outperforms the state-of-the-art techniques, and especially, the recall is improved from 45.28% to 70.47% under a similar presicion. The code of our work is available at this https URL GraphLineMatching
摘要:行匹配播放从运动(SFM)和同时定位和地图创建(SLAM)在结构中起重要作用,特别是在低纹理化的和重复的场景。在本文中,我们提出使用图形的卷积网络中的一对图像的匹配线段的新方法,和我们设计线段与放松到最佳运输问题匹配基于图的策略。相比于手工制作的线匹配算法,我们的方法通过终端到终端的培训学习同时进行局部线段描述符和匹配。结果显示我们的方法优于状态的最先进的技术,特别是,召回从45.28%下一个类似presicion提高到70.47%。我们工作的代码可在此HTTPS URL GraphLineMatching
QuanMeng Ma, Guang Jiang, DianZhi Lai
Abstract: Line matching plays an essential role in structure from motion (SFM) and simultaneous localization and mapping (SLAM), especially in low-textured and repetitive scenes. In this paper, we present a new method of using a graph convolution network to match line segments in a pair of images, and we design a graph-based strategy of matching line segments with relaxing to an optimal transport problem. In contrast to hand-crafted line matching algorithms, our approach learns local line segment descriptor and the matching simultaneously through end-to-end training. The results show our method outperforms the state-of-the-art techniques, and especially, the recall is improved from 45.28% to 70.47% under a similar presicion. The code of our work is available at this https URL GraphLineMatching
摘要:行匹配播放从运动(SFM)和同时定位和地图创建(SLAM)在结构中起重要作用,特别是在低纹理化的和重复的场景。在本文中,我们提出使用图形的卷积网络中的一对图像的匹配线段的新方法,和我们设计线段与放松到最佳运输问题匹配基于图的策略。相比于手工制作的线匹配算法,我们的方法通过终端到终端的培训学习同时进行局部线段描述符和匹配。结果显示我们的方法优于状态的最先进的技术,特别是,召回从45.28%下一个类似presicion提高到70.47%。我们工作的代码可在此HTTPS URL GraphLineMatching
11. Improved Residual Networks for Image and Video Recognition [PDF] 返回目录
Ionut Cosmin Duta, Li Liu, Fan Zhu, Ling Shao
Abstract: Residual networks (ResNets) represent a powerful type of convolutional neural network (CNN) architecture, widely adopted and used in various tasks. In this work we propose an improved version of ResNets. Our proposed improvements address all three main components of a ResNet: the flow of information through the network layers, the residual building block, and the projection shortcut. We are able to show consistent improvements in accuracy and learning convergence over the baseline. For instance, on ImageNet dataset, using the ResNet with 50 layers, for top-1 accuracy we can report a 1.19% improvement over the baseline in one setting and around 2% boost in another. Importantly, these improvements are obtained without increasing the model complexity. Our proposed approach allows us to train extremely deep networks, while the baseline shows severe optimization issues. We report results on three tasks over six datasets: image classification (ImageNet, CIFAR-10 and CIFAR-100), object detection (COCO) and video action recognition (Kinetics-400 and Something-Something-v2). In the deep learning era, we establish a new milestone for the depth of a CNN. We successfully train a 404-layer deep CNN on the ImageNet dataset and a 3002-layer network on CIFAR-10 and CIFAR-100, while the baseline is not able to converge at such extreme depths. Code is available at: this https URL
摘要:残留网络(ResNets)代表一个功能强大的类型卷积神经网络(CNN)的体系结构,广泛采用并在各种任务中使用。在这项工作中,我们提出ResNets的改进版本。我们提出的改进解决RESNET的所有三个主要组件:的信息通过网络层的流动,残余积木,和投影的快捷方式。我们能够显示准确度持续改善,并通过基线学习收敛。例如,在ImageNet数据集,使用RESNET 50层,顶部1的精度,我们可以在一个环境报告将比基线1.19%的改善,而在另一个2%左右提升。重要的是,在不增加模型的复杂性,得到这些改进。我们建议的方法使我们能够训练极其深刻的网络,同时基线显示严重的优化问题。我们报告超过六分集的三个任务的结果:图像分类(ImageNet,CIFAR-10和CIFAR-100),目标检测(COCO)和视频行为识别(动力学-400和某事某事-V2)。在深学习的时代,我们建立了CNN的深度一个新的里程碑。我们成功地培养了404层深CNN上ImageNet数据集和CIFAR-10和CIFAR-100一3002层网络,而基线不能够会聚在这种极端的深度。代码,请访问:此HTTPS URL
Ionut Cosmin Duta, Li Liu, Fan Zhu, Ling Shao
Abstract: Residual networks (ResNets) represent a powerful type of convolutional neural network (CNN) architecture, widely adopted and used in various tasks. In this work we propose an improved version of ResNets. Our proposed improvements address all three main components of a ResNet: the flow of information through the network layers, the residual building block, and the projection shortcut. We are able to show consistent improvements in accuracy and learning convergence over the baseline. For instance, on ImageNet dataset, using the ResNet with 50 layers, for top-1 accuracy we can report a 1.19% improvement over the baseline in one setting and around 2% boost in another. Importantly, these improvements are obtained without increasing the model complexity. Our proposed approach allows us to train extremely deep networks, while the baseline shows severe optimization issues. We report results on three tasks over six datasets: image classification (ImageNet, CIFAR-10 and CIFAR-100), object detection (COCO) and video action recognition (Kinetics-400 and Something-Something-v2). In the deep learning era, we establish a new milestone for the depth of a CNN. We successfully train a 404-layer deep CNN on the ImageNet dataset and a 3002-layer network on CIFAR-10 and CIFAR-100, while the baseline is not able to converge at such extreme depths. Code is available at: this https URL
摘要:残留网络(ResNets)代表一个功能强大的类型卷积神经网络(CNN)的体系结构,广泛采用并在各种任务中使用。在这项工作中,我们提出ResNets的改进版本。我们提出的改进解决RESNET的所有三个主要组件:的信息通过网络层的流动,残余积木,和投影的快捷方式。我们能够显示准确度持续改善,并通过基线学习收敛。例如,在ImageNet数据集,使用RESNET 50层,顶部1的精度,我们可以在一个环境报告将比基线1.19%的改善,而在另一个2%左右提升。重要的是,在不增加模型的复杂性,得到这些改进。我们建议的方法使我们能够训练极其深刻的网络,同时基线显示严重的优化问题。我们报告超过六分集的三个任务的结果:图像分类(ImageNet,CIFAR-10和CIFAR-100),目标检测(COCO)和视频行为识别(动力学-400和某事某事-V2)。在深学习的时代,我们建立了CNN的深度一个新的里程碑。我们成功地培养了404层深CNN上ImageNet数据集和CIFAR-10和CIFAR-100一3002层网络,而基线不能够会聚在这种极端的深度。代码,请访问:此HTTPS URL
12. Spatiotemporal Fusion in 3D CNNs: A Probabilistic View [PDF] 返回目录
Yizhou Zhou, Xiaoyan Sun, Chong Luo, Zheng-Jun Zha, Wenjun Zeng
Abstract: Despite the success in still image recognition, deep neural networks for spatiotemporal signal tasks (such as human action recognition in videos) still suffers from low efficacy and inefficiency over the past years. Recently, human experts have put more efforts into analyzing the importance of different components in 3D convolutional neural networks (3D CNNs) to design more powerful spatiotemporal learning backbones. Among many others, spatiotemporal fusion is one of the essentials. It controls how spatial and temporal signals are extracted at each layer during inference. Previous attempts usually start by ad-hoc designs that empirically combine certain convolutions and then draw conclusions based on the performance obtained by training the corresponding networks. These methods only support network-level analysis on limited number of fusion strategies. In this paper, we propose to convert the spatiotemporal fusion strategies into a probability space, which allows us to perform network-level evaluations of various fusion strategies without having to train them separately. Besides, we can also obtain fine-grained numerical information such as layer-level preference on spatiotemporal fusion within the probability space. Our approach greatly boosts the efficiency of analyzing spatiotemporal fusion. Based on the probability space, we further generate new fusion strategies which achieve the state-of-the-art performance on four well-known action recognition datasets.
摘要:尽管在静态图像识别的成功,为时空信号任务的深层神经网络(如视频人类动作识别)仍低效力和效率在过去几年中受到影响。最近,人类专家已经把更多的努力进入分析在3D卷积神经网络(3D细胞神经网络)不同成分的重要性,设计出更强大的时空学习骨干。其中许多人一样,时空融合的必需品之一。它控制空间和时间信号如何在推理过程中每一层萃取。以前尝试通常由特设的设计,结合经验一定回旋,然后根据通过训练相应的网络获得的性能得出结论开始。这些方法只支持的融合策略数量有限的网络级分析。在本文中,我们提出了时空融合策略转换成概率空间,这使我们可以无需单独训练他们执行的各种融合策略的网络级评估。此外,我们也能获得细粒度数值信息,例如所述概率空间内的时空融合层级别的偏好。我们的做法极大地提高分析时空融合的效率。基于概率空间,我们进一步产生新的融合策略,实现在四大知名动作识别数据集的国家的最先进的性能。
Yizhou Zhou, Xiaoyan Sun, Chong Luo, Zheng-Jun Zha, Wenjun Zeng
Abstract: Despite the success in still image recognition, deep neural networks for spatiotemporal signal tasks (such as human action recognition in videos) still suffers from low efficacy and inefficiency over the past years. Recently, human experts have put more efforts into analyzing the importance of different components in 3D convolutional neural networks (3D CNNs) to design more powerful spatiotemporal learning backbones. Among many others, spatiotemporal fusion is one of the essentials. It controls how spatial and temporal signals are extracted at each layer during inference. Previous attempts usually start by ad-hoc designs that empirically combine certain convolutions and then draw conclusions based on the performance obtained by training the corresponding networks. These methods only support network-level analysis on limited number of fusion strategies. In this paper, we propose to convert the spatiotemporal fusion strategies into a probability space, which allows us to perform network-level evaluations of various fusion strategies without having to train them separately. Besides, we can also obtain fine-grained numerical information such as layer-level preference on spatiotemporal fusion within the probability space. Our approach greatly boosts the efficiency of analyzing spatiotemporal fusion. Based on the probability space, we further generate new fusion strategies which achieve the state-of-the-art performance on four well-known action recognition datasets.
摘要:尽管在静态图像识别的成功,为时空信号任务的深层神经网络(如视频人类动作识别)仍低效力和效率在过去几年中受到影响。最近,人类专家已经把更多的努力进入分析在3D卷积神经网络(3D细胞神经网络)不同成分的重要性,设计出更强大的时空学习骨干。其中许多人一样,时空融合的必需品之一。它控制空间和时间信号如何在推理过程中每一层萃取。以前尝试通常由特设的设计,结合经验一定回旋,然后根据通过训练相应的网络获得的性能得出结论开始。这些方法只支持的融合策略数量有限的网络级分析。在本文中,我们提出了时空融合策略转换成概率空间,这使我们可以无需单独训练他们执行的各种融合策略的网络级评估。此外,我们也能获得细粒度数值信息,例如所述概率空间内的时空融合层级别的偏好。我们的做法极大地提高分析时空融合的效率。基于概率空间,我们进一步产生新的融合策略,实现在四大知名动作识别数据集的国家的最先进的性能。
13. Co-Saliency Spatio-Temporal Interaction Network for Person Re-Identification in Videos [PDF] 返回目录
Jiawei Liu, Xierong Zhu, Zheng-Jun Zha, Na Jiang
Abstract: Person re-identification aims at identifying a certain pedestrian across non-overlapping camera networks. Video-based re-identification approaches have gained significant attention recently, expanding image-based approaches by learning features from multiple frames. In this work, we propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos. It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions, towards learning discriminative pedestrian representation. Specifically, multiple co-saliency learning modules within CSTNet are designed to utilize the correlated information across video frames to extract the salient features from the task-relevant regions and suppress background interference. Moreover, multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation, to enhance feature representation. Extensive experiments on two benchmarks have demonstrated the effectiveness of the proposed method.
摘要:人重新鉴定的目的是跨非重叠摄像机网络识别一定行人。基于视频的重新鉴定方法最近获得显著的关注,通过学习从多帧功能扩展基于图像的方法。在这项工作中,我们提出了个人重新鉴定视频中一个新的共同的显着度时空互动网(CSTNET)。它捕获的视频帧之间的共同的显着前景区域并探索从这样的区域的空间 - 时间长程上下文相互依赖,向学习判别行人表示。具体而言,内CSTNET多个共显着的学习模块被设计成利用跨越视频帧的相关信息中提取的显着从所述任务相关的区域和抑制背景干扰的特征。此外,内CSTNET多个spatialtemporal交互模块被提出,其中利用这样的特征和spatialtemporal信息相关的空间和时间长程上下文相互依赖性,以增强特征表示。两个基准测试大量的实验证明了该方法的有效性。
Jiawei Liu, Xierong Zhu, Zheng-Jun Zha, Na Jiang
Abstract: Person re-identification aims at identifying a certain pedestrian across non-overlapping camera networks. Video-based re-identification approaches have gained significant attention recently, expanding image-based approaches by learning features from multiple frames. In this work, we propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos. It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions, towards learning discriminative pedestrian representation. Specifically, multiple co-saliency learning modules within CSTNet are designed to utilize the correlated information across video frames to extract the salient features from the task-relevant regions and suppress background interference. Moreover, multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation, to enhance feature representation. Extensive experiments on two benchmarks have demonstrated the effectiveness of the proposed method.
摘要:人重新鉴定的目的是跨非重叠摄像机网络识别一定行人。基于视频的重新鉴定方法最近获得显著的关注,通过学习从多帧功能扩展基于图像的方法。在这项工作中,我们提出了个人重新鉴定视频中一个新的共同的显着度时空互动网(CSTNET)。它捕获的视频帧之间的共同的显着前景区域并探索从这样的区域的空间 - 时间长程上下文相互依赖,向学习判别行人表示。具体而言,内CSTNET多个共显着的学习模块被设计成利用跨越视频帧的相关信息中提取的显着从所述任务相关的区域和抑制背景干扰的特征。此外,内CSTNET多个spatialtemporal交互模块被提出,其中利用这样的特征和spatialtemporal信息相关的空间和时间长程上下文相互依赖性,以增强特征表示。两个基准测试大量的实验证明了该方法的有效性。
14. SESAME: Semantic Editing of Scenes by Adding, Manipulating or Erasing Objects [PDF] 返回目录
Evangelos Ntavelis, Andrés Romero, Iason Kastanis, Luc Van Gool, Radu Timofte
Abstract: Recent advances in image generation gave rise to powerful tools for semantic image editing. However, existing approaches can either operate on a single image or require an abundance of additional information. They are not capable of handling the complete set of editing operations, that is addition, manipulation or removal of semantic concepts. To address these limitations, we propose SESAME, a novel generator-discriminator pair for Semantic Editing of Scenes by Adding, Manipulating or Erasing objects. In our setup, the user provides the semantic labels of the areas to be edited and the generator synthesizes the corresponding pixels. In contrast to previous methods that employ a discriminator that trivially concatenates semantics and image as an input, the SESAME discriminator is composed of two input streams that independently process the image and its semantics, using the latter to manipulate the results of the former. We evaluate our model on a diverse set of datasets and report state-of-the-art performance on two tasks: (a) image manipulation and (b) image generation conditioned on semantic labels.
摘要:在图像生成的最新进展引起了强大的工具,语义图像编辑。但是,现有的方法可以在单一图像上或者操作或需要的附加信息的丰度。他们不能够处理的一整套编辑操作的,那就是另外,操纵或删除语义概念。为了解决这些限制,我们提出了芝麻,一种新型的发电机鉴别为对场景的语义编辑加入,操纵或删除对象。在我们的设置,用户提供的领域的语义标签进行编辑和生成合成相应的像素。与此相反使用鉴别该平凡串接语义和图像作为输入以前的方法中,SESAME鉴别器是由独立地处理图像和它的语义,使用后者来操纵前者的结果的两个输入流。我们评估我们对两个任务一组不同的数据集和国家的最先进的报告性能的模型:(1)图像处理和(b)图像生成条件语义标签。
Evangelos Ntavelis, Andrés Romero, Iason Kastanis, Luc Van Gool, Radu Timofte
Abstract: Recent advances in image generation gave rise to powerful tools for semantic image editing. However, existing approaches can either operate on a single image or require an abundance of additional information. They are not capable of handling the complete set of editing operations, that is addition, manipulation or removal of semantic concepts. To address these limitations, we propose SESAME, a novel generator-discriminator pair for Semantic Editing of Scenes by Adding, Manipulating or Erasing objects. In our setup, the user provides the semantic labels of the areas to be edited and the generator synthesizes the corresponding pixels. In contrast to previous methods that employ a discriminator that trivially concatenates semantics and image as an input, the SESAME discriminator is composed of two input streams that independently process the image and its semantics, using the latter to manipulate the results of the former. We evaluate our model on a diverse set of datasets and report state-of-the-art performance on two tasks: (a) image manipulation and (b) image generation conditioned on semantic labels.
摘要:在图像生成的最新进展引起了强大的工具,语义图像编辑。但是,现有的方法可以在单一图像上或者操作或需要的附加信息的丰度。他们不能够处理的一整套编辑操作的,那就是另外,操纵或删除语义概念。为了解决这些限制,我们提出了芝麻,一种新型的发电机鉴别为对场景的语义编辑加入,操纵或删除对象。在我们的设置,用户提供的领域的语义标签进行编辑和生成合成相应的像素。与此相反使用鉴别该平凡串接语义和图像作为输入以前的方法中,SESAME鉴别器是由独立地处理图像和它的语义,使用后者来操纵前者的结果的两个输入流。我们评估我们对两个任务一组不同的数据集和国家的最先进的报告性能的模型:(1)图像处理和(b)图像生成条件语义标签。
15. Would Mega-scale Datasets Further Enhance Spatiotemporal 3D CNNs? [PDF] 返回目录
Hirokatsu Kataoka, Tenga Wakamiya, Kensho Hara, Yutaka Satoh
Abstract: How can we collect and use a video dataset to further improve spatiotemporal 3D Convolutional Neural Networks (3D CNNs)? In order to positively answer this open question in video recognition, we have conducted an exploration study using a couple of large-scale video datasets and 3D CNNs. In the early era of deep neural networks, 2D CNNs have been better than 3D CNNs in the context of video recognition. Recent studies revealed that 3D CNNs can outperform 2D CNNs trained on a large-scale video dataset. However, we heavily rely on architecture exploration instead of dataset consideration. Therefore, in the present paper, we conduct exploration study in order to improve spatiotemporal 3D CNNs as follows: (i) Recently proposed large-scale video datasets help improve spatiotemporal 3D CNNs in terms of video classification accuracy. We reveal that a carefully annotated dataset (e.g., Kinetics-700) effectively pre-trains a video representation for a video classification task. (ii) We confirm the relationships between #category/#instance and video classification accuracy. The results show that #category should initially be fixed, and then #instance is increased on a video dataset in case of dataset construction. (iii) In order to practically extend a video dataset, we simply concatenate publicly available datasets, such as Kinetics-700 and Moments in Time (MiT) datasets. Compared with Kinetics-700 pre-training, we further enhance spatiotemporal 3D CNNs with the merged dataset, e.g., +0.9, +3.4, and +1.1 on UCF-101, HMDB-51, and ActivityNet datasets, respectively, in terms of fine-tuning. (iv) In terms of recognition architecture, the Kinetics-700 and merged dataset pre-trained models increase the recognition performance to 200 layers with the Residual Network (ResNet), while the Kinetics-400 pre-trained model cannot successfully optimize the 200-layer architecture.
摘要:我们如何收集和使用的视频数据集,以进一步提高时空3D卷积神经网络(3D细胞神经网络)?为了积极回答视频识别这个开放性的问题,我们已经进行了利用几个大型视频数据集和3D细胞神经网络的探索研究。在深层神经网络时代初期,二维细胞神经网络已经在视频识别的情况下已经比3D细胞神经网络更好。最近的研究表明,三维细胞神经网络可以超越2D细胞神经网络训练了大型视频数据集。但是,我们在很大程度上依赖于建筑的探索,而不是数据集的考虑。因此,在本文中,我们以提高时空3D细胞神经网络如下进行勘探研究:(一)最近提出的大规模视频数据集帮助改善时空三维细胞神经网络在视频分类精度方面。我们揭示了一个精心标注的数据集(例如,动力学-700)有效预列车的视频分类任务视频表示。 (二)确认#类/#实例和视频分类精度之间的关系。结果表明,#category最初应固定,然后#instance在数据集中建设的情况下对视频数据集中增加。 (三)为了切实扩展视频数据集,我们只是在连接可公开获得的数据集,如动力学-700和每个时刻(MIT)的数据集。与动力学-700前的训练相比,我们进一步加强时空3D细胞神经网络与合并后的数据集,如+0.9,+3.4和+1.1在UCF-101,HMDB-51,和ActivityNet数据集,分别在精细方面-tuning。 (ⅳ)在识别架构而言,动力学-700和合并数据集预先训练模型提高识别性能到200层与剩余网络(RESNET),而动力学-400预训练的模型不能成功地优化200-层结构。
Hirokatsu Kataoka, Tenga Wakamiya, Kensho Hara, Yutaka Satoh
Abstract: How can we collect and use a video dataset to further improve spatiotemporal 3D Convolutional Neural Networks (3D CNNs)? In order to positively answer this open question in video recognition, we have conducted an exploration study using a couple of large-scale video datasets and 3D CNNs. In the early era of deep neural networks, 2D CNNs have been better than 3D CNNs in the context of video recognition. Recent studies revealed that 3D CNNs can outperform 2D CNNs trained on a large-scale video dataset. However, we heavily rely on architecture exploration instead of dataset consideration. Therefore, in the present paper, we conduct exploration study in order to improve spatiotemporal 3D CNNs as follows: (i) Recently proposed large-scale video datasets help improve spatiotemporal 3D CNNs in terms of video classification accuracy. We reveal that a carefully annotated dataset (e.g., Kinetics-700) effectively pre-trains a video representation for a video classification task. (ii) We confirm the relationships between #category/#instance and video classification accuracy. The results show that #category should initially be fixed, and then #instance is increased on a video dataset in case of dataset construction. (iii) In order to practically extend a video dataset, we simply concatenate publicly available datasets, such as Kinetics-700 and Moments in Time (MiT) datasets. Compared with Kinetics-700 pre-training, we further enhance spatiotemporal 3D CNNs with the merged dataset, e.g., +0.9, +3.4, and +1.1 on UCF-101, HMDB-51, and ActivityNet datasets, respectively, in terms of fine-tuning. (iv) In terms of recognition architecture, the Kinetics-700 and merged dataset pre-trained models increase the recognition performance to 200 layers with the Residual Network (ResNet), while the Kinetics-400 pre-trained model cannot successfully optimize the 200-layer architecture.
摘要:我们如何收集和使用的视频数据集,以进一步提高时空3D卷积神经网络(3D细胞神经网络)?为了积极回答视频识别这个开放性的问题,我们已经进行了利用几个大型视频数据集和3D细胞神经网络的探索研究。在深层神经网络时代初期,二维细胞神经网络已经在视频识别的情况下已经比3D细胞神经网络更好。最近的研究表明,三维细胞神经网络可以超越2D细胞神经网络训练了大型视频数据集。但是,我们在很大程度上依赖于建筑的探索,而不是数据集的考虑。因此,在本文中,我们以提高时空3D细胞神经网络如下进行勘探研究:(一)最近提出的大规模视频数据集帮助改善时空三维细胞神经网络在视频分类精度方面。我们揭示了一个精心标注的数据集(例如,动力学-700)有效预列车的视频分类任务视频表示。 (二)确认#类/#实例和视频分类精度之间的关系。结果表明,#category最初应固定,然后#instance在数据集中建设的情况下对视频数据集中增加。 (三)为了切实扩展视频数据集,我们只是在连接可公开获得的数据集,如动力学-700和每个时刻(MIT)的数据集。与动力学-700前的训练相比,我们进一步加强时空3D细胞神经网络与合并后的数据集,如+0.9,+3.4和+1.1在UCF-101,HMDB-51,和ActivityNet数据集,分别在精细方面-tuning。 (ⅳ)在识别架构而言,动力学-700和合并数据集预先训练模型提高识别性能到200层与剩余网络(RESNET),而动力学-400预训练的模型不能成功地优化200-层结构。
16. Rephrasing visual questions by specifying the entropy of the answer distribution [PDF] 返回目录
Kento Terao, Toru Tamaki, Bisser Raytchev, Kazufumi Kaneda, Shun'ichi Satoh
Abstract: Visual question answering (VQA) is a task of answering a visual question that is a pair of question and image. Some visual questions are ambiguous and some are clear, and it may be appropriate to change the ambiguity of questions from situation to situation. However, this issue has not been addressed by any prior work. We propose a novel task, rephrasing the questions by controlling the ambiguity of the questions. The ambiguity of a visual question is defined by the use of the entropy of the answer distribution predicted by a VQA model. The proposed model rephrases a source question given with an image so that the rephrased question has the ambiguity (or entropy) specified by users. We propose two learning strategies to train the proposed model with the VQA v2 dataset, which has no ambiguity information. We demonstrate the advantage of our approach that can control the ambiguity of the rephrased questions, and an interesting observation that it is harder to increase than to reduce ambiguity.
摘要:视觉问答(VQA)是回答一个视觉问题,这是一个对问题和形象的任务。一些视觉上的问题不明确,有些是明显的,它可能是适当的问题模糊的情状改变局面。然而,这个问题还没有得到任何工作之前解决。我们提出了一个新的任务,通过控制的问题含糊改写的问题。视觉问题的模糊性是通过使用由VQA模型预测的答案分布的熵的定义。所提出的模型rephrases与图像给定源的问题,使得改述问题有用户指定的模糊性(或熵)。我们提出了两种学习策略训练提出的模型与VQA v2的数据集,它没有歧义的信息。我们证明我们的方法的优点是可以控制的改写问题的不确定性,以及一个有趣的现象,这是难以增加,而不是减少歧义。
Kento Terao, Toru Tamaki, Bisser Raytchev, Kazufumi Kaneda, Shun'ichi Satoh
Abstract: Visual question answering (VQA) is a task of answering a visual question that is a pair of question and image. Some visual questions are ambiguous and some are clear, and it may be appropriate to change the ambiguity of questions from situation to situation. However, this issue has not been addressed by any prior work. We propose a novel task, rephrasing the questions by controlling the ambiguity of the questions. The ambiguity of a visual question is defined by the use of the entropy of the answer distribution predicted by a VQA model. The proposed model rephrases a source question given with an image so that the rephrased question has the ambiguity (or entropy) specified by users. We propose two learning strategies to train the proposed model with the VQA v2 dataset, which has no ambiguity information. We demonstrate the advantage of our approach that can control the ambiguity of the rephrased questions, and an interesting observation that it is harder to increase than to reduce ambiguity.
摘要:视觉问答(VQA)是回答一个视觉问题,这是一个对问题和形象的任务。一些视觉上的问题不明确,有些是明显的,它可能是适当的问题模糊的情状改变局面。然而,这个问题还没有得到任何工作之前解决。我们提出了一个新的任务,通过控制的问题含糊改写的问题。视觉问题的模糊性是通过使用由VQA模型预测的答案分布的熵的定义。所提出的模型rephrases与图像给定源的问题,使得改述问题有用户指定的模糊性(或熵)。我们提出了两种学习策略训练提出的模型与VQA v2的数据集,它没有歧义的信息。我们证明我们的方法的优点是可以控制的改写问题的不确定性,以及一个有趣的现象,这是难以增加,而不是减少歧义。
17. 3D IoU-Net: IoU Guided 3D Object Detector for Point Clouds [PDF] 返回目录
Jiale Li, Shujie Luo, Ziqi Zhu, Hang Dai, Andrey S. Krylov, Yong Ding, Ling Shao
Abstract: Most existing point cloud based 3D object detectors focus on the tasks of classification and box regression. However, another bottleneck in this area is achieving an accurate detection confidence for the Non-Maximum Suppression (NMS) post-processing. In this paper, we add a 3D IoU prediction branch to the regular classification and regression branches. The predicted IoU is used as the detection confidence for NMS. In order to obtain a more accurate IoU prediction, we propose a 3D IoU-Net with IoU sensitive feature learning and an IoU alignment operation. To obtain a perspective-invariant prediction head, we propose an Attentive Corner Aggregation (ACA) module by aggregating a local point cloud feature from each perspective of eight corners and adaptively weighting the contribution of each perspective with different attentions. We propose a Corner Geometry Encoding (CGE) module for geometry information embedding. To the best of our knowledge, this is the first time geometric embedding information has been introduced in proposal feature learning. These two feature parts are then adaptively fused by a multi-layer perceptron (MLP) network as our IoU sensitive feature. The IoU alignment operation is introduced to resolve the mismatching between the bounding box regression head and IoU prediction, thereby further enhancing the accuracy of IoU prediction. The experimental results on the KITTI car detection benchmark show that 3D IoU-Net with IoU perception achieves state-of-the-art performance.
摘要:基于现有的大多数点云3D物体探测器注重分类箱回归的任务。然而,在这方面的另一个瓶颈是实现对非最大抑制(NMS)后处理的精确检测的信心。在本文中,我们在3D欠条预测分支添加到常规分类和回归分支机构。预测IOU被用作用于NMS检测信心。为了获得更准确的预测欠条,我们提出了一个3D的欠条,网带欠条敏感特性的学习和一张欠条对齐操作。为了获得立体不变预测头,我们通过从八个角各透视聚集的局部点云特征和自适应加权不同关注各立体的贡献提出一种细心的角聚合(ACA)模块。我们提出了几何信息嵌入一个角几何形状编码(CGE)模块。据我们所知,这是第一次几何嵌入信息已提议功能的学习被引入。这两个特征的部件然后通过自适应的多层感知器(MLP)网络作为我们IOU敏感特征稠合。所述IOU对齐操作被引入,以解决边界框回归头和IOU预测之间的失配,从而进一步增强IOU预测的准确性。在KITTI汽车检测指标显示,3D借条,与净欠条感知实现国家的先进性能的试验结果。
Jiale Li, Shujie Luo, Ziqi Zhu, Hang Dai, Andrey S. Krylov, Yong Ding, Ling Shao
Abstract: Most existing point cloud based 3D object detectors focus on the tasks of classification and box regression. However, another bottleneck in this area is achieving an accurate detection confidence for the Non-Maximum Suppression (NMS) post-processing. In this paper, we add a 3D IoU prediction branch to the regular classification and regression branches. The predicted IoU is used as the detection confidence for NMS. In order to obtain a more accurate IoU prediction, we propose a 3D IoU-Net with IoU sensitive feature learning and an IoU alignment operation. To obtain a perspective-invariant prediction head, we propose an Attentive Corner Aggregation (ACA) module by aggregating a local point cloud feature from each perspective of eight corners and adaptively weighting the contribution of each perspective with different attentions. We propose a Corner Geometry Encoding (CGE) module for geometry information embedding. To the best of our knowledge, this is the first time geometric embedding information has been introduced in proposal feature learning. These two feature parts are then adaptively fused by a multi-layer perceptron (MLP) network as our IoU sensitive feature. The IoU alignment operation is introduced to resolve the mismatching between the bounding box regression head and IoU prediction, thereby further enhancing the accuracy of IoU prediction. The experimental results on the KITTI car detection benchmark show that 3D IoU-Net with IoU perception achieves state-of-the-art performance.
摘要:基于现有的大多数点云3D物体探测器注重分类箱回归的任务。然而,在这方面的另一个瓶颈是实现对非最大抑制(NMS)后处理的精确检测的信心。在本文中,我们在3D欠条预测分支添加到常规分类和回归分支机构。预测IOU被用作用于NMS检测信心。为了获得更准确的预测欠条,我们提出了一个3D的欠条,网带欠条敏感特性的学习和一张欠条对齐操作。为了获得立体不变预测头,我们通过从八个角各透视聚集的局部点云特征和自适应加权不同关注各立体的贡献提出一种细心的角聚合(ACA)模块。我们提出了几何信息嵌入一个角几何形状编码(CGE)模块。据我们所知,这是第一次几何嵌入信息已提议功能的学习被引入。这两个特征的部件然后通过自适应的多层感知器(MLP)网络作为我们IOU敏感特征稠合。所述IOU对齐操作被引入,以解决边界框回归头和IOU预测之间的失配,从而进一步增强IOU预测的准确性。在KITTI汽车检测指标显示,3D借条,与净欠条感知实现国家的先进性能的试验结果。
18. Boosting Semantic Human Matting with Coarse Annotations [PDF] 返回目录
Jinlin Liu, Yuan Yao, Wendi Hou, Miaomiao Cui, Xuansong Xie, Changshui Zhang, Xian-sheng Hua
Abstract: Semantic human matting aims to estimate the per-pixel opacity of the foreground human regions. It is quite challenging and usually requires user interactive trimaps and plenty of high quality annotated data. Annotating such kind of data is labor intensive and requires great skills beyond normal users, especially considering the very detailed hair part of humans. In contrast, coarse annotated human dataset is much easier to acquire and collect from the public dataset. In this paper, we propose to use coarse annotated data coupled with fine annotated data to boost end-to-end semantic human matting without trimaps as extra input. Specifically, we train a mask prediction network to estimate the coarse semantic mask using the hybrid data, and then propose a quality unification network to unify the quality of the previous coarse mask outputs. A matting refinement network takes in the unified mask and the input image to predict the final alpha matte. The collected coarse annotated dataset enriches our dataset significantly, allows generating high quality alpha matte for real images. Experimental results show that the proposed method performs comparably against state-of-the-art methods. Moreover, the proposed method can be used for refining coarse annotated public dataset, as well as semantic segmentation methods, which reduces the cost of annotating high quality human data to a great extent.
摘要:语义人席的目标来估算每个像素的不透明度前景人区域的。这是非常具有挑战性的,通常需要用户互动trimaps和大量高品质的注解数据。注释这种类型的数据是劳动密集型的,并且需要超过正常用户带来了极大的技巧,特别是考虑到人的非常详细的头发的一部分。相比之下,粗注释的人类数据集是从公共数据集更容易获取和收集。在本文中,我们建议使用加上精致注释数据粗注释的数据,以提高终端到终端的语义人类消光没有trimaps作为额外的输入。具体来说,我们培养了掩模预测网络来估计使用混合数据粗语义掩模,然后提出了一种统一的质量网络以统一以前粗掩码输出的质量。消光细化网络取入统一掩模和所述输入图像来预测最终α无光泽。将收集的粗注释的数据集显著丰富了我们的数据集,允许产生用于实际图像高质量阿尔法遮片。实验结果表明,对相当的国家的最先进的方法,该方法执行。此外,可用于注释的数据集公众,以及语义分割方法精炼粗糙的,这降低了高品质的人类数据注释很大程度上的成本所提出的方法。
Jinlin Liu, Yuan Yao, Wendi Hou, Miaomiao Cui, Xuansong Xie, Changshui Zhang, Xian-sheng Hua
Abstract: Semantic human matting aims to estimate the per-pixel opacity of the foreground human regions. It is quite challenging and usually requires user interactive trimaps and plenty of high quality annotated data. Annotating such kind of data is labor intensive and requires great skills beyond normal users, especially considering the very detailed hair part of humans. In contrast, coarse annotated human dataset is much easier to acquire and collect from the public dataset. In this paper, we propose to use coarse annotated data coupled with fine annotated data to boost end-to-end semantic human matting without trimaps as extra input. Specifically, we train a mask prediction network to estimate the coarse semantic mask using the hybrid data, and then propose a quality unification network to unify the quality of the previous coarse mask outputs. A matting refinement network takes in the unified mask and the input image to predict the final alpha matte. The collected coarse annotated dataset enriches our dataset significantly, allows generating high quality alpha matte for real images. Experimental results show that the proposed method performs comparably against state-of-the-art methods. Moreover, the proposed method can be used for refining coarse annotated public dataset, as well as semantic segmentation methods, which reduces the cost of annotating high quality human data to a great extent.
摘要:语义人席的目标来估算每个像素的不透明度前景人区域的。这是非常具有挑战性的,通常需要用户互动trimaps和大量高品质的注解数据。注释这种类型的数据是劳动密集型的,并且需要超过正常用户带来了极大的技巧,特别是考虑到人的非常详细的头发的一部分。相比之下,粗注释的人类数据集是从公共数据集更容易获取和收集。在本文中,我们建议使用加上精致注释数据粗注释的数据,以提高终端到终端的语义人类消光没有trimaps作为额外的输入。具体来说,我们培养了掩模预测网络来估计使用混合数据粗语义掩模,然后提出了一种统一的质量网络以统一以前粗掩码输出的质量。消光细化网络取入统一掩模和所述输入图像来预测最终α无光泽。将收集的粗注释的数据集显著丰富了我们的数据集,允许产生用于实际图像高质量阿尔法遮片。实验结果表明,对相当的国家的最先进的方法,该方法执行。此外,可用于注释的数据集公众,以及语义分割方法精炼粗糙的,这降低了高品质的人类数据注释很大程度上的成本所提出的方法。
19. Learning to Visually Navigate in Photorealistic Environments Without any Supervision [PDF] 返回目录
Lina Mezghani, Sainbayar Sukhbaatar, Arthur Szlam, Armand Joulin, Piotr Bojanowski
Abstract: Learning to navigate in a realistic setting where an agent must rely solely on visual inputs is a challenging task, in part because the lack of position information makes it difficult to provide supervision during training. In this paper, we introduce a novel approach for learning to navigate from image inputs without external supervision or reward. Our approach consists of three stages: learning a good representation of first-person views, then learning to explore using memory, and finally learning to navigate by setting its own goals. The model is trained with intrinsic rewards only so that it can be applied to any environment with image observations. We show the benefits of our approach by training an agent to navigate challenging photo-realistic environments from the Gibson dataset with RGB inputs only.
摘要:学习在真实的场景中进行导航,其中的代理必须在视觉输入,仅仅依靠是一项艰巨的任务,这部分是因为缺少位置信息,因此很难在训练时提供监督。在本文中,我们介绍学习从图像输入导航无需外部监督或奖励的新方法。我们的方法包括三个阶段:学习第一人称意见提供了良好表现,那么学习使用内存的探索,终于学会通过设定自己的目标进行导航。该模型进行训练与内在报酬只有这样它可以应用于与图像的观察任何环境。我们通过培训的代理来导航从吉布森数据集只用RGB输入具有挑战性的照片般逼真的环境中展示我们方法的好处。
Lina Mezghani, Sainbayar Sukhbaatar, Arthur Szlam, Armand Joulin, Piotr Bojanowski
Abstract: Learning to navigate in a realistic setting where an agent must rely solely on visual inputs is a challenging task, in part because the lack of position information makes it difficult to provide supervision during training. In this paper, we introduce a novel approach for learning to navigate from image inputs without external supervision or reward. Our approach consists of three stages: learning a good representation of first-person views, then learning to explore using memory, and finally learning to navigate by setting its own goals. The model is trained with intrinsic rewards only so that it can be applied to any environment with image observations. We show the benefits of our approach by training an agent to navigate challenging photo-realistic environments from the Gibson dataset with RGB inputs only.
摘要:学习在真实的场景中进行导航,其中的代理必须在视觉输入,仅仅依靠是一项艰巨的任务,这部分是因为缺少位置信息,因此很难在训练时提供监督。在本文中,我们介绍学习从图像输入导航无需外部监督或奖励的新方法。我们的方法包括三个阶段:学习第一人称意见提供了良好表现,那么学习使用内存的探索,终于学会通过设定自己的目标进行导航。该模型进行训练与内在报酬只有这样它可以应用于与图像的观察任何环境。我们通过培训的代理来导航从吉布森数据集只用RGB输入具有挑战性的照片般逼真的环境中展示我们方法的好处。
20. State-Relabeling Adversarial Active Learning [PDF] 返回目录
Beichen Zhang, Liang Li, Shijie Yang, Shuhui Wang, Zheng-Jun Zha, Qingming Huang
Abstract: Active learning is to design label-efficient algorithms by sampling the most representative samples to be labeled by an oracle. In this paper, we propose a state relabeling adversarial active learning model (SRAAL), that leverages both the annotation and the labeled/unlabeled state information for deriving the most informative unlabeled samples. The SRAAL consists of a representation generator and a state discriminator. The generator uses the complementary annotation information with traditional reconstruction information to generate the unified representation of samples, which embeds the semantic into the whole data representation. Then, we design an online uncertainty indicator in the discriminator, which endues unlabeled samples with different importance. As a result, we can select the most informative samples based on the discriminator's predicted state. We also design an algorithm to initialize the labeled pool, which makes subsequent sampling more efficient. The experiments conducted on various datasets show that our model outperforms the previous state-of-art active learning methods and our initially sampling algorithm achieves better performance.
摘要:主动学习是通过采样最具代表性的样本由甲骨文来标记设计标签,高效的算法。在本文中,我们提出了一个状态重新标记对抗性主动学习模型(SRAAL),即同时利用了注解和推导最翔实的未标记样本标记/未标记的状态信息。所述SRAAL由表示生成器和一个状态鉴别器。该生成器使用与传统重建信息互补注释信息,以产生样本的统一表示,其中嵌入所述语义到整个数据表示。然后,我们设计的鉴别,这endues未标记样本不同重要性的在线不确定性指标。因此,我们可以选择基于鉴别的预测状态的最翔实的样本。我们还设计一个算法初始化标记库,这使得随后的采样更加高效。在不同的数据集进行的实验表明,我们的模型优于以前的状态的最先进的主动学习的方法和我们的初步抽样算法达到更好的性能。
Beichen Zhang, Liang Li, Shijie Yang, Shuhui Wang, Zheng-Jun Zha, Qingming Huang
Abstract: Active learning is to design label-efficient algorithms by sampling the most representative samples to be labeled by an oracle. In this paper, we propose a state relabeling adversarial active learning model (SRAAL), that leverages both the annotation and the labeled/unlabeled state information for deriving the most informative unlabeled samples. The SRAAL consists of a representation generator and a state discriminator. The generator uses the complementary annotation information with traditional reconstruction information to generate the unified representation of samples, which embeds the semantic into the whole data representation. Then, we design an online uncertainty indicator in the discriminator, which endues unlabeled samples with different importance. As a result, we can select the most informative samples based on the discriminator's predicted state. We also design an algorithm to initialize the labeled pool, which makes subsequent sampling more efficient. The experiments conducted on various datasets show that our model outperforms the previous state-of-art active learning methods and our initially sampling algorithm achieves better performance.
摘要:主动学习是通过采样最具代表性的样本由甲骨文来标记设计标签,高效的算法。在本文中,我们提出了一个状态重新标记对抗性主动学习模型(SRAAL),即同时利用了注解和推导最翔实的未标记样本标记/未标记的状态信息。所述SRAAL由表示生成器和一个状态鉴别器。该生成器使用与传统重建信息互补注释信息,以产生样本的统一表示,其中嵌入所述语义到整个数据表示。然后,我们设计的鉴别,这endues未标记样本不同重要性的在线不确定性指标。因此,我们可以选择基于鉴别的预测状态的最翔实的样本。我们还设计一个算法初始化标记库,这使得随后的采样更加高效。在不同的数据集进行的实验表明,我们的模型优于以前的状态的最先进的主动学习的方法和我们的初步抽样算法达到更好的性能。
21. ContourNet: Taking a Further Step toward Accurate Arbitrary-shaped Scene Text Detection [PDF] 返回目录
Yuxin Wang, Hongtao Xie, Zhengjun Zha, Mengting Xing, Zilong Fu, Yongdong Zhang
Abstract: Scene text detection has witnessed rapid development in recent years. However, there still exists two main challenges: 1) many methods suffer from false positives in their text representations; 2) the large scale variance of scene texts makes it hard for network to learn samples. In this paper, we propose the ContourNet, which effectively handles these two problems taking a further step toward accurate arbitrary-shaped text detection. At first, a scale-insensitive Adaptive Region Proposal Network (Adaptive-RPN) is proposed to generate text proposals by only focusing on the Intersection over Union (IoU) values between predicted and ground-truth bounding boxes. Then a novel Local Orthogonal Texture-aware Module (LOTM) models the local texture information of proposal features in two orthogonal directions and represents text region with a set of contour points. Considering that the strong unidirectional or weakly orthogonal activation is usually caused by the monotonous texture characteristic of false-positive patterns (e.g. streaks.), our method effectively suppresses these false positives by only outputting predictions with high response value in both orthogonal directions. This gives more accurate description of text regions. Extensive experiments on three challenging datasets (Total-Text, CTW1500 and ICDAR2015) verify that our method achieves the state-of-the-art performance. Code is available at this https URL.
摘要:场景文本检测得到了迅速的发展在最近几年。但是,仍然存在着两个主要挑战:1)很多方法在他们的文字表述误报吃亏; 2)现场文本的大规模变化使得它很难对网络学习的样本。在本文中,我们提出了ContourNet,从而有效地处理这两个问题迈出准确的任意形状的文本检测的又一步骤。起初,一个规模不敏感的自适应区域网络的提案(自适应-RPN)提出了通过只专注于交集超过联盟(IOU)值来预测和地面实况边界框之间生成文本的建议。然后一个新的本地正交纹理感知模块(LOTM)模型中的局部纹理提案的信息在两个正交的方向提供,并且代表文本区域与一组轮廓点。考虑到强的单向或弱正交活化通常是由假阳性图案单调纹理特征引起的(例如条纹),我们的方法有效地仅与在两个正交方向上的高的响应值输出预测抑制这些假阳性。这使文本区域的更准确的描述。三个有挑战性的数据集(共文本,CTW1500和ICDAR2015)的实验结果验证我们的方法实现国家的最先进的性能。代码可在此HTTPS URL。
Yuxin Wang, Hongtao Xie, Zhengjun Zha, Mengting Xing, Zilong Fu, Yongdong Zhang
Abstract: Scene text detection has witnessed rapid development in recent years. However, there still exists two main challenges: 1) many methods suffer from false positives in their text representations; 2) the large scale variance of scene texts makes it hard for network to learn samples. In this paper, we propose the ContourNet, which effectively handles these two problems taking a further step toward accurate arbitrary-shaped text detection. At first, a scale-insensitive Adaptive Region Proposal Network (Adaptive-RPN) is proposed to generate text proposals by only focusing on the Intersection over Union (IoU) values between predicted and ground-truth bounding boxes. Then a novel Local Orthogonal Texture-aware Module (LOTM) models the local texture information of proposal features in two orthogonal directions and represents text region with a set of contour points. Considering that the strong unidirectional or weakly orthogonal activation is usually caused by the monotonous texture characteristic of false-positive patterns (e.g. streaks.), our method effectively suppresses these false positives by only outputting predictions with high response value in both orthogonal directions. This gives more accurate description of text regions. Extensive experiments on three challenging datasets (Total-Text, CTW1500 and ICDAR2015) verify that our method achieves the state-of-the-art performance. Code is available at this https URL.
摘要:场景文本检测得到了迅速的发展在最近几年。但是,仍然存在着两个主要挑战:1)很多方法在他们的文字表述误报吃亏; 2)现场文本的大规模变化使得它很难对网络学习的样本。在本文中,我们提出了ContourNet,从而有效地处理这两个问题迈出准确的任意形状的文本检测的又一步骤。起初,一个规模不敏感的自适应区域网络的提案(自适应-RPN)提出了通过只专注于交集超过联盟(IOU)值来预测和地面实况边界框之间生成文本的建议。然后一个新的本地正交纹理感知模块(LOTM)模型中的局部纹理提案的信息在两个正交的方向提供,并且代表文本区域与一组轮廓点。考虑到强的单向或弱正交活化通常是由假阳性图案单调纹理特征引起的(例如条纹),我们的方法有效地仅与在两个正交方向上的高的响应值输出预测抑制这些假阳性。这使文本区域的更准确的描述。三个有挑战性的数据集(共文本,CTW1500和ICDAR2015)的实验结果验证我们的方法实现国家的最先进的性能。代码可在此HTTPS URL。
22. Real-world Person Re-Identification via Degradation Invariance Learning [PDF] 返回目录
Yukun Huang, Zheng-Jun Zha, Xueyang Fu, Richang Hong, Liang Li
Abstract: Person re-identification (Re-ID) in real-world scenarios usually suffers from various degradation factors, e.g., low-resolution, weak illumination, blurring and adverse weather. On the one hand, these degradations lead to severe discriminative information loss, which significantly obstructs identity representation learning; on the other hand, the feature mismatch problem caused by low-level visual variations greatly reduces retrieval performance. An intuitive solution to this problem is to utilize low-level image restoration methods to improve the image quality. However, existing restoration methods cannot directly serve to real-world Re-ID due to various limitations, e.g., the requirements of reference samples, domain gap between synthesis and reality, and incompatibility between low-level and high-level methods. In this paper, to solve the above problem, we propose a degradation invariance learning framework for real-world person Re-ID. By introducing a self-supervised disentangled representation learning strategy, our method is able to simultaneously extract identity-related robust features and remove real-world degradations without extra supervision. We use low-resolution images as the main demonstration, and experiments show that our approach is able to achieve state-of-the-art performance on several Re-ID benchmarks. In addition, our framework can be easily extended to other real-world degradation factors, such as weak illumination, with only a few modifications.
摘要:在真实世界的场景人重新鉴定(再ID),通常由各种因素降解,例如,低分辨率,弱光照,模糊和恶劣天气受到影响。在一方面,这些退化导致严重的判别信息的损失,这显著阻碍身份表示学习;在另一方面,引起低级别视觉变化的特征不匹配的问题大大减少了检索的性能。直观的解决这个问题是利用低级别的图像恢复方法,提高了图像质量。但是,现有的修复方法不能直接用于现实世界的重新ID由于各种限制,例如,参考样本的要求,低级别和高级别方法之间的合成和现实,和不相容之间域间隙。在本文中,为解决上述问题,我们提出了一种退化不变性学习真实世界的人重新编号框架。通过引入自我监督解开表示学习策略,我们的方法是能够同时提取与身份相关的强大功能,并删除真实世界的降级而无需额外的监管。我们使用低分辨率图像为主要示范和实验表明,我们的做法是能够实现在几重ID基准的国家的最先进的性能。此外,我们的分析框架可以很容易地扩展到其他真实世界的降解因素,如弱光,只有少数修改。
Yukun Huang, Zheng-Jun Zha, Xueyang Fu, Richang Hong, Liang Li
Abstract: Person re-identification (Re-ID) in real-world scenarios usually suffers from various degradation factors, e.g., low-resolution, weak illumination, blurring and adverse weather. On the one hand, these degradations lead to severe discriminative information loss, which significantly obstructs identity representation learning; on the other hand, the feature mismatch problem caused by low-level visual variations greatly reduces retrieval performance. An intuitive solution to this problem is to utilize low-level image restoration methods to improve the image quality. However, existing restoration methods cannot directly serve to real-world Re-ID due to various limitations, e.g., the requirements of reference samples, domain gap between synthesis and reality, and incompatibility between low-level and high-level methods. In this paper, to solve the above problem, we propose a degradation invariance learning framework for real-world person Re-ID. By introducing a self-supervised disentangled representation learning strategy, our method is able to simultaneously extract identity-related robust features and remove real-world degradations without extra supervision. We use low-resolution images as the main demonstration, and experiments show that our approach is able to achieve state-of-the-art performance on several Re-ID benchmarks. In addition, our framework can be easily extended to other real-world degradation factors, such as weak illumination, with only a few modifications.
摘要:在真实世界的场景人重新鉴定(再ID),通常由各种因素降解,例如,低分辨率,弱光照,模糊和恶劣天气受到影响。在一方面,这些退化导致严重的判别信息的损失,这显著阻碍身份表示学习;在另一方面,引起低级别视觉变化的特征不匹配的问题大大减少了检索的性能。直观的解决这个问题是利用低级别的图像恢复方法,提高了图像质量。但是,现有的修复方法不能直接用于现实世界的重新ID由于各种限制,例如,参考样本的要求,低级别和高级别方法之间的合成和现实,和不相容之间域间隙。在本文中,为解决上述问题,我们提出了一种退化不变性学习真实世界的人重新编号框架。通过引入自我监督解开表示学习策略,我们的方法是能够同时提取与身份相关的强大功能,并删除真实世界的降级而无需额外的监管。我们使用低分辨率图像为主要示范和实验表明,我们的做法是能够实现在几重ID基准的国家的最先进的性能。此外,我们的分析框架可以很容易地扩展到其他真实世界的降解因素,如弱光,只有少数修改。
23. Phase Consistent Ecological Domain Adaptation [PDF] 返回目录
Yanchao Yang, Dong Lao, Ganesh Sundaramoorthi, Stefano Soatto
Abstract: We introduce two criteria to regularize the optimization involved in learning a classifier in a domain where no annotated data are available, leveraging annotated data in a different domain, a problem known as unsupervised domain adaptation. We focus on the task of semantic segmentation, where annotated synthetic data are aplenty, but annotating real data is laborious. The first criterion, inspired by visual psychophysics, is that the map between the two image domains be phase-preserving. This restricts the set of possible learned maps, while enabling enough flexibility to transfer semantic information. The second criterion aims to leverage ecological statistics, or regularities in the scene which are manifest in any image of it, regardless of the characteristics of the illuminant or the imaging sensor. It is implemented using a deep neural network that scores the likelihood of each possible segmentation given a single un-annotated image. Incorporating these two priors in a standard domain adaptation framework improves performance across the board in the most common unsupervised domain adaptation benchmarks for semantic segmentation.
摘要:介绍两个标准来规范参与学习在没有注释的数据是可用的域分类,利用不同域中注释数据的优化,作为一种非监督领域适应性问题。我们专注于语义分割,凡注明合成数据是层层叠叠的任务,但标注真实的数据是费力。第一标准,通过视觉心理物理学的启发,是这两个图像区域之间的映射是相保留。这限制了一系列可能了解到的地图,同时使足够的灵活性来传递语义信息。第二个标准的目标是利用生态统计,或在它们中的任何图像清单中的场景的规律性,而不管光源或成像传感器的特性。它使用的是深的神经网络来实现该分数给予单一的未标注的图像的每个可能的分割的可能性。在标准领域适应性框架结合这两个先验改善中最常见的非监督领域适应性基准语义分割表现全面。
Yanchao Yang, Dong Lao, Ganesh Sundaramoorthi, Stefano Soatto
Abstract: We introduce two criteria to regularize the optimization involved in learning a classifier in a domain where no annotated data are available, leveraging annotated data in a different domain, a problem known as unsupervised domain adaptation. We focus on the task of semantic segmentation, where annotated synthetic data are aplenty, but annotating real data is laborious. The first criterion, inspired by visual psychophysics, is that the map between the two image domains be phase-preserving. This restricts the set of possible learned maps, while enabling enough flexibility to transfer semantic information. The second criterion aims to leverage ecological statistics, or regularities in the scene which are manifest in any image of it, regardless of the characteristics of the illuminant or the imaging sensor. It is implemented using a deep neural network that scores the likelihood of each possible segmentation given a single un-annotated image. Incorporating these two priors in a standard domain adaptation framework improves performance across the board in the most common unsupervised domain adaptation benchmarks for semantic segmentation.
摘要:介绍两个标准来规范参与学习在没有注释的数据是可用的域分类,利用不同域中注释数据的优化,作为一种非监督领域适应性问题。我们专注于语义分割,凡注明合成数据是层层叠叠的任务,但标注真实的数据是费力。第一标准,通过视觉心理物理学的启发,是这两个图像区域之间的映射是相保留。这限制了一系列可能了解到的地图,同时使足够的灵活性来传递语义信息。第二个标准的目标是利用生态统计,或在它们中的任何图像清单中的场景的规律性,而不管光源或成像传感器的特性。它使用的是深的神经网络来实现该分数给予单一的未标注的图像的每个可能的分割的可能性。在标准领域适应性框架结合这两个先验改善中最常见的非监督领域适应性基准语义分割表现全面。
24. Deep Residual Correction Network for Partial Domain Adaptation [PDF] 返回目录
Shuang Li, Chi Harold Liu, Qiuxia Lin, Qi Wen, Limin Su, Gao Huang, Zhengming Ding
Abstract: Deep domain adaptation methods have achieved appealing performance by learning transferable representations from a well-labeled source domain to a different but related unlabeled target domain. Most existing works assume source and target data share the identical label space, which is often difficult to be satisfied in many real-world applications. With the emergence of big data, there is a more practical scenario called partial domain adaptation, where we are always accessible to a more large-scale source domain while working on a relative small-scale target domain. In this case, the conventional domain adaptation assumption should be relaxed, and the target label space tends to be a subset of the source label space. Intuitively, reinforcing the positive effects of the most relevant source subclasses and reducing the negative impacts of irrelevant source subclasses are of vital importance to address partial domain adaptation challenge. This paper proposes an efficiently-implemented Deep Residual Correction Network (DRCN) by plugging one residual block into the source network along with the task-specific feature layer, which effectively enhances the adaptation from source to target and explicitly weakens the influence from the irrelevant source classes. Specifically, the plugged residual block, which consists of several fully-connected layers, could deepen basic network and boost its feature representation capability correspondingly. Moreover, we design a weighted class-wise domain alignment loss to couple two domains by matching the feature distributions of shared classes between source and target. Comprehensive experiments on partial, traditional and fine-grained cross-domain visual recognition demonstrate that DRCN is superior to the competitive deep domain adaptation approaches.
摘要:深域自适应方法已经实现由一个良好的标记源域学习转让交涉到不同但相关的未标记的目标域吸引人的性能。大多数现有的工作假设源和目标数据共享相同的标签空间,这往往是难以在很多实际应用中得到满足。随着大数据的出现,有一个叫做局部领域适应性比较实用的情况下,我们总是以一个更大规模的源域访问在相对小规模的目标域工作时。在这种情况下,传统的适配域假设应放宽,并且目标标签空间往往是源标签空间的子集。直观地看,加强最相关的源子类的积极影响,并减少无关源子类的负面影响,解决一部分的区域适应挑战至关重要。本文通过插入一个残余块与任务特定功能层,从而有效地增强了适应从源到目标和明确地沿着源网络提出了一种有效地实现的深残余校正网络(DRCN)削弱从不相干源的影响类。具体而言,堵塞的残余块,它由几个全连接层,可以加深基本网络并相应地增加其特征表示能力。此外,我们通过源和目标之间共享类的特征分布匹配设计的加权类明智域对准损失将两个结构域。偏,传统和细粒度跨域视觉识别综合实验证明,DRCN优于竞争力深厚的适应方法。
Shuang Li, Chi Harold Liu, Qiuxia Lin, Qi Wen, Limin Su, Gao Huang, Zhengming Ding
Abstract: Deep domain adaptation methods have achieved appealing performance by learning transferable representations from a well-labeled source domain to a different but related unlabeled target domain. Most existing works assume source and target data share the identical label space, which is often difficult to be satisfied in many real-world applications. With the emergence of big data, there is a more practical scenario called partial domain adaptation, where we are always accessible to a more large-scale source domain while working on a relative small-scale target domain. In this case, the conventional domain adaptation assumption should be relaxed, and the target label space tends to be a subset of the source label space. Intuitively, reinforcing the positive effects of the most relevant source subclasses and reducing the negative impacts of irrelevant source subclasses are of vital importance to address partial domain adaptation challenge. This paper proposes an efficiently-implemented Deep Residual Correction Network (DRCN) by plugging one residual block into the source network along with the task-specific feature layer, which effectively enhances the adaptation from source to target and explicitly weakens the influence from the irrelevant source classes. Specifically, the plugged residual block, which consists of several fully-connected layers, could deepen basic network and boost its feature representation capability correspondingly. Moreover, we design a weighted class-wise domain alignment loss to couple two domains by matching the feature distributions of shared classes between source and target. Comprehensive experiments on partial, traditional and fine-grained cross-domain visual recognition demonstrate that DRCN is superior to the competitive deep domain adaptation approaches.
摘要:深域自适应方法已经实现由一个良好的标记源域学习转让交涉到不同但相关的未标记的目标域吸引人的性能。大多数现有的工作假设源和目标数据共享相同的标签空间,这往往是难以在很多实际应用中得到满足。随着大数据的出现,有一个叫做局部领域适应性比较实用的情况下,我们总是以一个更大规模的源域访问在相对小规模的目标域工作时。在这种情况下,传统的适配域假设应放宽,并且目标标签空间往往是源标签空间的子集。直观地看,加强最相关的源子类的积极影响,并减少无关源子类的负面影响,解决一部分的区域适应挑战至关重要。本文通过插入一个残余块与任务特定功能层,从而有效地增强了适应从源到目标和明确地沿着源网络提出了一种有效地实现的深残余校正网络(DRCN)削弱从不相干源的影响类。具体而言,堵塞的残余块,它由几个全连接层,可以加深基本网络并相应地增加其特征表示能力。此外,我们通过源和目标之间共享类的特征分布匹配设计的加权类明智域对准损失将两个结构域。偏,传统和细粒度跨域视觉识别综合实验证明,DRCN优于竞争力深厚的适应方法。
25. Person Re-Identification via Active Hard Sample Mining [PDF] 返回目录
Xin Xu, Lei Liu, Weifeng Liu, Meng Wang, Ruimin Hu
Abstract: Annotating a large-scale image dataset is very tedious, yet necessary for training person re-identification models. To alleviate such a problem, we present an active hard sample mining framework via training an effective re-ID model with the least labeling efforts. Considering that hard samples can provide informative patterns, we first formulate an uncertainty estimation to actively select hard samples to iteratively train a re-ID model from scratch. Then, intra-diversity estimation is designed to reduce the redundant hard samples by maximizing their diversity. Moreover, we propose a computer-assisted identity recommendation module embedded in the active hard sample mining framework to help human annotators to rapidly and accurately label the selected samples. Extensive experiments were carried out to demonstrate the effectiveness of our method on several public datasets. Experimental results indicate that our method can reduce 57%, 63%, and 49% annotation efforts on the Market1501, MSMT17, and CUHK03, respectively, while maximizing the performance of the re-ID model.
摘要:注解大规模图像数据集是非常繁琐的,但训练的人重新识别模型必要的。为了减轻这样的问题,我们提出了通过与所述至少标记努力培养的有效重新ID模型活性硬样品挖掘框架。考虑到硬样品可以提供翔实的模式,我们首先制定的不确定性估计到主动选择硬样品从头反复训练再ID模型。然后,内部多样性估计旨在通过最大限度地发挥他们的多样性,以减少冗余硬盘样品。此外,我们建议嵌入活性硬样品采框架,帮助人工注释计算机辅助身份推荐模块能够快速,准确地标记选择的样品。大量的实验进行了演示的几个公开的数据集我们的方法的有效性。实验结果表明,我们的方法可以减少57%,63%,和49个Market1501,MSMT17和CUHK03,分别%注释工作,同时最大限度地再ID模型的性能。
Xin Xu, Lei Liu, Weifeng Liu, Meng Wang, Ruimin Hu
Abstract: Annotating a large-scale image dataset is very tedious, yet necessary for training person re-identification models. To alleviate such a problem, we present an active hard sample mining framework via training an effective re-ID model with the least labeling efforts. Considering that hard samples can provide informative patterns, we first formulate an uncertainty estimation to actively select hard samples to iteratively train a re-ID model from scratch. Then, intra-diversity estimation is designed to reduce the redundant hard samples by maximizing their diversity. Moreover, we propose a computer-assisted identity recommendation module embedded in the active hard sample mining framework to help human annotators to rapidly and accurately label the selected samples. Extensive experiments were carried out to demonstrate the effectiveness of our method on several public datasets. Experimental results indicate that our method can reduce 57%, 63%, and 49% annotation efforts on the Market1501, MSMT17, and CUHK03, respectively, while maximizing the performance of the re-ID model.
摘要:注解大规模图像数据集是非常繁琐的,但训练的人重新识别模型必要的。为了减轻这样的问题,我们提出了通过与所述至少标记努力培养的有效重新ID模型活性硬样品挖掘框架。考虑到硬样品可以提供翔实的模式,我们首先制定的不确定性估计到主动选择硬样品从头反复训练再ID模型。然后,内部多样性估计旨在通过最大限度地发挥他们的多样性,以减少冗余硬盘样品。此外,我们建议嵌入活性硬样品采框架,帮助人工注释计算机辅助身份推荐模块能够快速,准确地标记选择的样品。大量的实验进行了演示的几个公开的数据集我们的方法的有效性。实验结果表明,我们的方法可以减少57%,63%,和49个Market1501,MSMT17和CUHK03,分别%注释工作,同时最大限度地再ID模型的性能。
26. Analyze and Development System with Multiple Biometric Identification [PDF] 返回目录
Sher Dadakhanov
Abstract: Cause of a rapid increase in technological development, increasing identity theft, consumer fraud, the threat to personal data is also increasing every day. Methods developed earlier to ensure personal the information from the thefts was not effective and safe. Biometrics were introduced when it was needed technology for more efficient security of personal information. Old-fashioned traditional approaches like Personal identification number( PIN), passwords, keys, login ID can be forgotten, stolen or lost. In biometric authentication system, user may not remember any passwords or carry any keys. As people they recognize each other by the physical appearance and behavioral characteristics that biometric systems use physical characteristics, such as fingerprints, facial recognition, voice recognition, in order to distinguish between the actual user and scammer. In order to increase safety in 2005, biometric identification methods were developed government and business sectors, but today it has reached almost all private sectors as Banking, Finance, home security and protection, healthcare, business security and security etc. Since biometric samples and templates of a biometric system having one biometric character to detect and the user can be replaced and duplicated, the new idea of merging multiple biometric identification technologies has so-called multimodal biometric recognition systems have been introduced that use two or more biometric data characteristics of the individual that can be identified as a real user or not.
摘要:在科技发展的快速增长的原因,增加了身份盗窃,欺骗消费者,个人数据的威胁,每天也越来越多。开发较早,以确保个人从盗窃信息的方法是无效的和安全的。它需要技术的个人信息更有效的安全性时,生物识别技术进行了介绍。像个人识别号码(PIN),密码,钥匙老式的传统方法,登录ID可以被遗忘,被盗或丢失。在生物认证系统中,用户可能不记得任何密码或携带任何按键。随着人们相互确认对方的物理外观和生物识别系统使用的物理特征,如指纹,面部识别,语音识别行为特征,为了实际的用户和骗子之间的区别。为了增加在2005年安全,生物特征识别方法的发展政府和企业部门,但今天它已经达到了几乎所有的私营部门银行,金融,家庭安全和保障,医疗保健,业务安全和保安等。由于生物特征样本和模板具有一个生物特征字符来检测,并且用户可以更换和复制生物测定系统的,合并多个生物特征识别技术的新想法所谓多峰生物统计识别系统已经被引入该个体的使用两种或多种生物特征数据特征可以认定为一个真正的用户与否。
Sher Dadakhanov
Abstract: Cause of a rapid increase in technological development, increasing identity theft, consumer fraud, the threat to personal data is also increasing every day. Methods developed earlier to ensure personal the information from the thefts was not effective and safe. Biometrics were introduced when it was needed technology for more efficient security of personal information. Old-fashioned traditional approaches like Personal identification number( PIN), passwords, keys, login ID can be forgotten, stolen or lost. In biometric authentication system, user may not remember any passwords or carry any keys. As people they recognize each other by the physical appearance and behavioral characteristics that biometric systems use physical characteristics, such as fingerprints, facial recognition, voice recognition, in order to distinguish between the actual user and scammer. In order to increase safety in 2005, biometric identification methods were developed government and business sectors, but today it has reached almost all private sectors as Banking, Finance, home security and protection, healthcare, business security and security etc. Since biometric samples and templates of a biometric system having one biometric character to detect and the user can be replaced and duplicated, the new idea of merging multiple biometric identification technologies has so-called multimodal biometric recognition systems have been introduced that use two or more biometric data characteristics of the individual that can be identified as a real user or not.
摘要:在科技发展的快速增长的原因,增加了身份盗窃,欺骗消费者,个人数据的威胁,每天也越来越多。开发较早,以确保个人从盗窃信息的方法是无效的和安全的。它需要技术的个人信息更有效的安全性时,生物识别技术进行了介绍。像个人识别号码(PIN),密码,钥匙老式的传统方法,登录ID可以被遗忘,被盗或丢失。在生物认证系统中,用户可能不记得任何密码或携带任何按键。随着人们相互确认对方的物理外观和生物识别系统使用的物理特征,如指纹,面部识别,语音识别行为特征,为了实际的用户和骗子之间的区别。为了增加在2005年安全,生物特征识别方法的发展政府和企业部门,但今天它已经达到了几乎所有的私营部门银行,金融,家庭安全和保障,医疗保健,业务安全和保安等。由于生物特征样本和模板具有一个生物特征字符来检测,并且用户可以更换和复制生物测定系统的,合并多个生物特征识别技术的新想法所谓多峰生物统计识别系统已经被引入该个体的使用两种或多种生物特征数据特征可以认定为一个真正的用户与否。
27. Spatial Priming for Detecting Human-Object Interactions [PDF] 返回目录
Ankan Bansal, Sai Saketh Rambhatla, Abhinav Shrivastava, Rama Chellappa
Abstract: The relative spatial layout of a human and an object is an important cue for determining how they interact. However, until now, spatial layout has been used just as side-information for detecting human-object interactions (HOIs). In this paper, we present a method for exploiting this spatial layout information for detecting HOIs in images. The proposed method consists of a layout module which primes a visual module to predict the type of interaction between a human and an object. The visual and layout modules share information through lateral connections at several stages. The model uses predictions from the layout module as a prior to the visual module and the prediction from the visual module is given as the final output. It also incorporates semantic information about the object using word2vec vectors. The proposed model reaches an mAP of 24.79% for HICO-Det dataset which is about 2.8% absolute points higher than the current state-of-the-art.
摘要:人类和对象的相对空间布局是用于确定它们如何相互作用的重要线索。然而,直到现在,空间布局已使用只是作为边信息用于检测人类对象的交互(HOIS)。在本文中,我们提出了利用用于检测图像中的HOIS此空间布局信息的方法。所提出的方法包括引发(prime)的可视化模块,以预测人类和对象之间的相互作用的类型的布局模块。视觉和布局模块在几个阶段共享通过横向连接的信息。从布局模块作为之前的视觉模块,并从视觉模块的预测模式的预测的用途是作为最终输出。它还包含关于使用word2vec矢量对象的语义信息。所提出的模型达到24.79%为HICO-DET数据集的图,其为约2.8%,比当前状态的最先进的绝对百分点。
Ankan Bansal, Sai Saketh Rambhatla, Abhinav Shrivastava, Rama Chellappa
Abstract: The relative spatial layout of a human and an object is an important cue for determining how they interact. However, until now, spatial layout has been used just as side-information for detecting human-object interactions (HOIs). In this paper, we present a method for exploiting this spatial layout information for detecting HOIs in images. The proposed method consists of a layout module which primes a visual module to predict the type of interaction between a human and an object. The visual and layout modules share information through lateral connections at several stages. The model uses predictions from the layout module as a prior to the visual module and the prediction from the visual module is given as the final output. It also incorporates semantic information about the object using word2vec vectors. The proposed model reaches an mAP of 24.79% for HICO-Det dataset which is about 2.8% absolute points higher than the current state-of-the-art.
摘要:人类和对象的相对空间布局是用于确定它们如何相互作用的重要线索。然而,直到现在,空间布局已使用只是作为边信息用于检测人类对象的交互(HOIS)。在本文中,我们提出了利用用于检测图像中的HOIS此空间布局信息的方法。所提出的方法包括引发(prime)的可视化模块,以预测人类和对象之间的相互作用的类型的布局模块。视觉和布局模块在几个阶段共享通过横向连接的信息。从布局模块作为之前的视觉模块,并从视觉模块的预测模式的预测的用途是作为最终输出。它还包含关于使用word2vec矢量对象的语义信息。所提出的模型达到24.79%为HICO-DET数据集的图,其为约2.8%,比当前状态的最先进的绝对百分点。
28. Analysis on DeepLabV3+ Performance for Automatic Steel Defects Detection [PDF] 返回目录
Zheng Nie, Jiachen Xu, Shengchang Zhang
Abstract: Our works experimented DeepLabV3+ with different backbones on a large volume of steel images aiming to automatically detect different types of steel defects. Our methods applied random weighted augmentation to balance different defects types in the training set. And then applied DeeplabV3+ model three different backbones, ResNet, DenseNet and EfficientNet, on segmenting defection regions on the steel images. Based on experiments, we found that applying ResNet101 or EfficientNet as backbones could reach the best IoU scores on the test set, which is around 0.57, comparing with 0.325 for using DenseNet. Also, DeepLabV3+ model with ResNet101 as backbone has the fewest training time.
摘要:我们的工作尝试DeepLabV3 +用大量钢铁图像旨在自动检测不同类型的钢材缺陷对不同的骨干。我们的方法应用随机加权增强训练集,以平衡不同的缺陷类型。然后施加DeeplabV3 +模型三种不同的主链,RESNET,DenseNet和EfficientNet,在对钢的图像分割缺陷区域。在实验基础上,我们发现,使用ResNet101或EfficientNet为骨干网可以在测试集,这大约是0.57,与0.325相比使用DenseNet达到最佳的欠条分数。此外,DeepLabV3 +模型ResNet101为骨干具有最少的培训时间。
Zheng Nie, Jiachen Xu, Shengchang Zhang
Abstract: Our works experimented DeepLabV3+ with different backbones on a large volume of steel images aiming to automatically detect different types of steel defects. Our methods applied random weighted augmentation to balance different defects types in the training set. And then applied DeeplabV3+ model three different backbones, ResNet, DenseNet and EfficientNet, on segmenting defection regions on the steel images. Based on experiments, we found that applying ResNet101 or EfficientNet as backbones could reach the best IoU scores on the test set, which is around 0.57, comparing with 0.325 for using DenseNet. Also, DeepLabV3+ model with ResNet101 as backbone has the fewest training time.
摘要:我们的工作尝试DeepLabV3 +用大量钢铁图像旨在自动检测不同类型的钢材缺陷对不同的骨干。我们的方法应用随机加权增强训练集,以平衡不同的缺陷类型。然后施加DeeplabV3 +模型三种不同的主链,RESNET,DenseNet和EfficientNet,在对钢的图像分割缺陷区域。在实验基础上,我们发现,使用ResNet101或EfficientNet为骨干网可以在测试集,这大约是0.57,与0.325相比使用DenseNet达到最佳的欠条分数。此外,DeepLabV3 +模型ResNet101为骨干具有最少的培训时间。
29. 6D Camera Relocalization in Ambiguous Scenes via Continuous Multimodal Inference [PDF] 返回目录
Mai Bui, Tolga Birdal, Haowen Deng, Shadi Albarqouni, Leonidas Guibas, Slobodan Ilic, Nassir Navab
Abstract: We present a multimodal camera relocalization framework that captures ambiguities and uncertainties with continuous mixture models defined on the manifold of camera poses. In highly ambiguous environments, which can easily arise due to symmetries and repetitive structures in the scene, computing one plausible solution (what most state-of-the-art methods currently regress) may not be sufficient. Instead we predict multiple camera pose hypotheses as well as the respective uncertainty for each prediction. Towards this aim, we use Bingham distributions, to model the orientation of the camera pose, and a multivariate Gaussian to model the position, with an end-to-end deep neural network. By incorporating a Winner-Takes-All training scheme, we finally obtain a mixture model that is well suited for explaining ambiguities in the scene, yet does not suffer from mode collapse, a common problem with mixture density networks. We introduce a new dataset specifically designed to foster camera localization research in ambiguous environments and exhaustively evaluate our method on synthetic as well as real data on both ambiguous scenes and on non-ambiguous benchmark datasets. We plan to release our code and dataset under $\href{this https URL}{this http URL}$.
摘要:我们提出了捕获多义性和不确定性与连续混合模型上照相机姿态的歧管限定的多峰相机重新定位的框架。在高度模棱两可的环境中,它可以很容易出现由于在场景对称性与重复结构中,计算一个可能的解决办法(大多数国家的最先进的方法目前回归)可能不足以。相反,我们预测多个摄像机姿态假设以及为每个预测相应的不确定性。为了实现这一目标,我们使用宾汉姆分布到相机姿态的方向模型和多元高斯的位置模式,用一个终端到终端的深层神经网络。通过将一个赢家通吃的培训计划,我们终于获得非常适合在现场解释含糊不清的混合模式,但不会崩溃模式与混合密度网络中常见的问题的困扰。我们推出专为培育相机国产化攻关暧昧环境中一个新的数据集,并详尽地评估我们的合成方法,以及双方暧昧的场面和无歧义的基准数据集的真实数据。我们计划发布在$ \ {HREF这HTTPS URL} {这个HTTP URL} $我们的代码和数据集。
Mai Bui, Tolga Birdal, Haowen Deng, Shadi Albarqouni, Leonidas Guibas, Slobodan Ilic, Nassir Navab
Abstract: We present a multimodal camera relocalization framework that captures ambiguities and uncertainties with continuous mixture models defined on the manifold of camera poses. In highly ambiguous environments, which can easily arise due to symmetries and repetitive structures in the scene, computing one plausible solution (what most state-of-the-art methods currently regress) may not be sufficient. Instead we predict multiple camera pose hypotheses as well as the respective uncertainty for each prediction. Towards this aim, we use Bingham distributions, to model the orientation of the camera pose, and a multivariate Gaussian to model the position, with an end-to-end deep neural network. By incorporating a Winner-Takes-All training scheme, we finally obtain a mixture model that is well suited for explaining ambiguities in the scene, yet does not suffer from mode collapse, a common problem with mixture density networks. We introduce a new dataset specifically designed to foster camera localization research in ambiguous environments and exhaustively evaluate our method on synthetic as well as real data on both ambiguous scenes and on non-ambiguous benchmark datasets. We plan to release our code and dataset under $\href{this https URL}{this http URL}$.
摘要:我们提出了捕获多义性和不确定性与连续混合模型上照相机姿态的歧管限定的多峰相机重新定位的框架。在高度模棱两可的环境中,它可以很容易出现由于在场景对称性与重复结构中,计算一个可能的解决办法(大多数国家的最先进的方法目前回归)可能不足以。相反,我们预测多个摄像机姿态假设以及为每个预测相应的不确定性。为了实现这一目标,我们使用宾汉姆分布到相机姿态的方向模型和多元高斯的位置模式,用一个终端到终端的深层神经网络。通过将一个赢家通吃的培训计划,我们终于获得非常适合在现场解释含糊不清的混合模式,但不会崩溃模式与混合密度网络中常见的问题的困扰。我们推出专为培育相机国产化攻关暧昧环境中一个新的数据集,并详尽地评估我们的合成方法,以及双方暧昧的场面和无歧义的基准数据集的真实数据。我们计划发布在$ \ {HREF这HTTPS URL} {这个HTTP URL} $我们的代码和数据集。
30. D-SRGAN: DEM Super-Resolution with Generative Adversarial Network [PDF] 返回目录
Bekir Z Demiray, Muhammed Sit, Ibrahim Demir
Abstract: LIDAR (light detection and ranging) is an optical remote-sensing technique that measures the distance between sensor and object, and the reflected energy from the object. Over the years, LIDAR data has been used as the primary source of Digital Elevation Models (DEMs). DEMs have been used in a variety of applications like road extraction, hydrological modeling, flood mapping, and surface analysis. A number of studies in flooding suggest the usage of high-resolution DEMs as inputs in the applications improve the overall reliability and accuracy. Despite the importance of high-resolution DEM, many areas in the United States and the world do not have access to high-resolution DEM due to technological limitations or the cost of the data collection. With recent development in Graphical Processing Units (GPU) and novel algorithms, deep learning techniques have become attractive to researchers for their performance in learning features from high-resolution datasets. Numerous new methods have been proposed such as Generative Adversarial Networks (GANs) to create intelligent models that correct and augment large-scale datasets. In this paper, a GAN based model is developed and evaluated, inspired by single image super-resolution methods, to increase the spatial resolution of a given DEM dataset up to 4 times without additional information related to data.
摘要:LIDAR(光探测和测距)是光学遥感技术,其测量传感器和对象,并从对象反射的能量之间的距离。多年来,LIDAR数据已被用作数字高程模型(DEM)的主要来源。数字高程模型已经在各种像道路提取,水文模型,洪水测绘和表面分析应用中使用。一些在洪水研究表明高分辨率数字高程模型的使用作为应用程序的输入提高了整机的可靠性和准确性。尽管高分辨率DEM的重要性,在美国及世界许多地区没有因技术限制或数据采集的成本获得高分辨率的DEM。随着图形处理单元(GPU)和新的算法,新的发展,深度学习技术已经为他们的学习,从高分辨率的数据集功能,性能成为研究人员的吸引力。许多新方法被提出,如剖成对抗性网络(甘斯)创建智能模型正确,扩充大型数据集。在本文中,所述的GaN系模型开发和评估,通过单图像超分辨率方法的启发,以增加给定数据集DEM的空间分辨率高达4倍,而不涉及数据的附加信息。
Bekir Z Demiray, Muhammed Sit, Ibrahim Demir
Abstract: LIDAR (light detection and ranging) is an optical remote-sensing technique that measures the distance between sensor and object, and the reflected energy from the object. Over the years, LIDAR data has been used as the primary source of Digital Elevation Models (DEMs). DEMs have been used in a variety of applications like road extraction, hydrological modeling, flood mapping, and surface analysis. A number of studies in flooding suggest the usage of high-resolution DEMs as inputs in the applications improve the overall reliability and accuracy. Despite the importance of high-resolution DEM, many areas in the United States and the world do not have access to high-resolution DEM due to technological limitations or the cost of the data collection. With recent development in Graphical Processing Units (GPU) and novel algorithms, deep learning techniques have become attractive to researchers for their performance in learning features from high-resolution datasets. Numerous new methods have been proposed such as Generative Adversarial Networks (GANs) to create intelligent models that correct and augment large-scale datasets. In this paper, a GAN based model is developed and evaluated, inspired by single image super-resolution methods, to increase the spatial resolution of a given DEM dataset up to 4 times without additional information related to data.
摘要:LIDAR(光探测和测距)是光学遥感技术,其测量传感器和对象,并从对象反射的能量之间的距离。多年来,LIDAR数据已被用作数字高程模型(DEM)的主要来源。数字高程模型已经在各种像道路提取,水文模型,洪水测绘和表面分析应用中使用。一些在洪水研究表明高分辨率数字高程模型的使用作为应用程序的输入提高了整机的可靠性和准确性。尽管高分辨率DEM的重要性,在美国及世界许多地区没有因技术限制或数据采集的成本获得高分辨率的DEM。随着图形处理单元(GPU)和新的算法,新的发展,深度学习技术已经为他们的学习,从高分辨率的数据集功能,性能成为研究人员的吸引力。许多新方法被提出,如剖成对抗性网络(甘斯)创建智能模型正确,扩充大型数据集。在本文中,所述的GaN系模型开发和评估,通过单图像超分辨率方法的启发,以增加给定数据集DEM的空间分辨率高达4倍,而不涉及数据的附加信息。
31. Early Disease Diagnosis for Rice Crop [PDF] 返回目录
M. Hammad Masood, Habiba Saim, Murtaza Taj, Mian M. Awais
Abstract: Many existing techniques provide automatic estimation of crop damage due to various diseases. However, early detection can prevent or reduce the extend of damage itself. The limited performance of existing techniques in early detection is lack of localized information. We instead propose a dataset with annotations for each diseased segment in each image. Unlike existing approaches, instead of classifying images into either healthy or diseased, we propose to provide localized classification for each segment of an images. Our method is based on Mask RCNN and provides location as well as extend of infected regions on the plant. Thus the extend of damage on the crop can be estimated. Our method has obtained overall 87.6% accuracy on the proposed dataset as compared to 58.4% obtained without incorporating localized information.
摘要:许多现有技术由于各种疾病提供了损害作物的自动估计。然而,早期检测,可以防止或减少损伤延伸本身。现有技术在早期发现的性能有限,缺乏本地化的信息。相反,我们建议用注释每个图像中的每个病变段的数据集。不同于现有的方法,而不是图像分类为两种健康或患病,我们提出了一种用于图像的每个部分提供本地化的分类。我们的方法是基于面膜RCNN并提供以及受感染地区的工厂扩展位置。因此,在作物的损害的延伸可以被估计。我们的方法获得的数据集中提出的总体87.6%的准确率比58.4%,而不包含本地化信息获得。
M. Hammad Masood, Habiba Saim, Murtaza Taj, Mian M. Awais
Abstract: Many existing techniques provide automatic estimation of crop damage due to various diseases. However, early detection can prevent or reduce the extend of damage itself. The limited performance of existing techniques in early detection is lack of localized information. We instead propose a dataset with annotations for each diseased segment in each image. Unlike existing approaches, instead of classifying images into either healthy or diseased, we propose to provide localized classification for each segment of an images. Our method is based on Mask RCNN and provides location as well as extend of infected regions on the plant. Thus the extend of damage on the crop can be estimated. Our method has obtained overall 87.6% accuracy on the proposed dataset as compared to 58.4% obtained without incorporating localized information.
摘要:许多现有技术由于各种疾病提供了损害作物的自动估计。然而,早期检测,可以防止或减少损伤延伸本身。现有技术在早期发现的性能有限,缺乏本地化的信息。相反,我们建议用注释每个图像中的每个病变段的数据集。不同于现有的方法,而不是图像分类为两种健康或患病,我们提出了一种用于图像的每个部分提供本地化的分类。我们的方法是基于面膜RCNN并提供以及受感染地区的工厂扩展位置。因此,在作物的损害的延伸可以被估计。我们的方法获得的数据集中提出的总体87.6%的准确率比58.4%,而不包含本地化信息获得。
32. CNN Encoder to Reduce the Dimensionality of Data Image for Motion Planning [PDF] 返回目录
Janderson Ferreira, Agostinho A. F. Júnior, Yves M. Galvão, Bruno J. T. Fernandes, Pablo Barros
Abstract: Many real-world applications need path planning algorithms to solve tasks in different areas, such as social applications, autonomous cars, and tracking activities. And most importantly motion planning. Although the use of path planning is sufficient in most motion planning scenarios, they represent potential bottlenecks in large environments with dynamic changes. To tackle this problem, the number of possible routes could be reduced to make it easier for path planning algorithms to find the shortest path with less efforts. An traditional algorithm for path planning is the A*, it uses an heuristic to work faster than other solutions. In this work, we propose a CNN encoder capable of eliminating useless routes for motion planning problems, then we combine the proposed neural network output with A*. To measure the efficiency of our solution, we propose a database with different scenarios of motion planning problems. The evaluated metric is the number of the iterations to find the shortest path. The A* was compared with the CNN Encoder (proposal) with A*. In all evaluated scenarios, our solution reduced the number of iterations by more than 60\%.
摘要:许多现实世界的应用程序需要路径规划算法来解决不同领域,如社交应用,自主车和跟踪活动的任务。而最重要的运动规划。虽然使用路径规划的是在大多数运动规划方案充分,他们代表在大环境与动态变化的潜在瓶颈。为了解决这个问题,可能路由的数量可以减少,使其更容易路径规划算法来找到较少努力的最短路径。一个传统算法的路径规划是A *,它使用一个启发式的工作比其他解决方案快。在这项工作中,我们提出了CNN编码器能够消除运动规划问题的无用路由,那么我们有一个结合所提出的神经网络输出*。为了衡量我们的解决方案的效率,我们提出用的运动规划问题,不同的场景的数据库。该评估的度量值是迭代次数找到最短路径。在A *是与CNN编码器(建议)与一比对*。在所有评估的情况下,我们的解决方案超过60 \%降低迭代次数。
Janderson Ferreira, Agostinho A. F. Júnior, Yves M. Galvão, Bruno J. T. Fernandes, Pablo Barros
Abstract: Many real-world applications need path planning algorithms to solve tasks in different areas, such as social applications, autonomous cars, and tracking activities. And most importantly motion planning. Although the use of path planning is sufficient in most motion planning scenarios, they represent potential bottlenecks in large environments with dynamic changes. To tackle this problem, the number of possible routes could be reduced to make it easier for path planning algorithms to find the shortest path with less efforts. An traditional algorithm for path planning is the A*, it uses an heuristic to work faster than other solutions. In this work, we propose a CNN encoder capable of eliminating useless routes for motion planning problems, then we combine the proposed neural network output with A*. To measure the efficiency of our solution, we propose a database with different scenarios of motion planning problems. The evaluated metric is the number of the iterations to find the shortest path. The A* was compared with the CNN Encoder (proposal) with A*. In all evaluated scenarios, our solution reduced the number of iterations by more than 60\%.
摘要:许多现实世界的应用程序需要路径规划算法来解决不同领域,如社交应用,自主车和跟踪活动的任务。而最重要的运动规划。虽然使用路径规划的是在大多数运动规划方案充分,他们代表在大环境与动态变化的潜在瓶颈。为了解决这个问题,可能路由的数量可以减少,使其更容易路径规划算法来找到较少努力的最短路径。一个传统算法的路径规划是A *,它使用一个启发式的工作比其他解决方案快。在这项工作中,我们提出了CNN编码器能够消除运动规划问题的无用路由,那么我们有一个结合所提出的神经网络输出*。为了衡量我们的解决方案的效率,我们提出用的运动规划问题,不同的场景的数据库。该评估的度量值是迭代次数找到最短路径。在A *是与CNN编码器(建议)与一比对*。在所有评估的情况下,我们的解决方案超过60 \%降低迭代次数。
33. Weakly supervised multiple instance learning histopathological tumor segmentation [PDF] 返回目录
Marvin Lerousseau, Maria Vakalopoulou, Marion Classe, Julien Adam, Enzo Battistella, Alexandre Carré, Théo Estienne, Théophraste Henry, Eric Deutsch, Nikos Paragios
Abstract: Histopathological image segmentation is a challenging and important topic in medical imaging with tremendous potential impact in clinical practice. State of the art methods relying on hand-crafted annotations that reduce the scope of the solutions since digital histology suffers from standardization and samples differ significantly between cancer phenotypes. To this end, in this paper, we propose a weakly supervised framework relying on weak standard clinical practice annotations, available in most medical centers. In particular, we exploit a multiple instance learning scheme providing a label for each instance, establishing a detailed segmentation of whole slide images. The potential of the framework is assessed with multi-centric data experiments using The Cancer Genome Atlas repository and the publicly available PatchCamelyon dataset. Promising results when compared with experts' annotations demonstrate the potentials of our approach.
摘要:组织病理学图像分割与临床实践的巨大潜在影响医疗成像一个充满挑战和重要课题。的技术方法依靠手工制作的注释的是降低的解决方案的范围,因为从标准化数字患有组织学和样品癌症表型之间显著不同状态。为此,在本文中,我们提出了一个弱监管框架,依靠微弱的标准临床实践注解,在大多数医疗中心提供。特别是,我们利用一个多示例学习方案提供了一个标签,每个实例,建立整个幻灯片图像的详细分割。该框架的潜力进行评估使用的癌症基因组图谱库,并公开发布的数据集PatchCamelyon多中心的数据实验。相比在与专家的注解证明了该方法的潜力有希望的结果。
Marvin Lerousseau, Maria Vakalopoulou, Marion Classe, Julien Adam, Enzo Battistella, Alexandre Carré, Théo Estienne, Théophraste Henry, Eric Deutsch, Nikos Paragios
Abstract: Histopathological image segmentation is a challenging and important topic in medical imaging with tremendous potential impact in clinical practice. State of the art methods relying on hand-crafted annotations that reduce the scope of the solutions since digital histology suffers from standardization and samples differ significantly between cancer phenotypes. To this end, in this paper, we propose a weakly supervised framework relying on weak standard clinical practice annotations, available in most medical centers. In particular, we exploit a multiple instance learning scheme providing a label for each instance, establishing a detailed segmentation of whole slide images. The potential of the framework is assessed with multi-centric data experiments using The Cancer Genome Atlas repository and the publicly available PatchCamelyon dataset. Promising results when compared with experts' annotations demonstrate the potentials of our approach.
摘要:组织病理学图像分割与临床实践的巨大潜在影响医疗成像一个充满挑战和重要课题。的技术方法依靠手工制作的注释的是降低的解决方案的范围,因为从标准化数字患有组织学和样品癌症表型之间显著不同状态。为此,在本文中,我们提出了一个弱监管框架,依靠微弱的标准临床实践注解,在大多数医疗中心提供。特别是,我们利用一个多示例学习方案提供了一个标签,每个实例,建立整个幻灯片图像的详细分割。该框架的潜力进行评估使用的癌症基因组图谱库,并公开发布的数据集PatchCamelyon多中心的数据实验。相比在与专家的注解证明了该方法的潜力有希望的结果。
34. Multimodal Categorization of Crisis Events in Social Media [PDF] 返回目录
Mahdi Abavisani, Liwei Wu, Shengli Hu, Joel Tetreault, Alejandro Jaimes
Abstract: Recent developments in image classification and natural language processing, coupled with the rapid growth in social media usage, have enabled fundamental advances in detecting breaking events around the world in real-time. Emergency response is one such area that stands to gain from these advances. By processing billions of texts and images a minute, events can be automatically detected to enable emergency response workers to better assess rapidly evolving situations and deploy resources accordingly. To date, most event detection techniques in this area have focused on image-only or text-only approaches, limiting detection performance and impacting the quality of information delivered to crisis response teams. In this paper, we present a new multimodal fusion method that leverages both images and texts as input. In particular, we introduce a cross-attention module that can filter uninformative and misleading components from weak modalities on a sample by sample basis. In addition, we employ a multimodal graph-based approach to stochastically transition between embeddings of different multimodal pairs during training to better regularize the learning process as well as dealing with limited training data by constructing new matched pairs from different samples. We show that our method outperforms the unimodal approaches and strong multimodal baselines by a large margin on three crisis-related tasks.
摘要:在图像分类和自然语言处理的最新发展,再加上社交媒体使用的快速增长,在检测的实时世界各地的突发事件已经启用了根本性的进步。应急反应是这样一个领域是代表从这些进步中获益。通过处理数十亿文本和图像一分钟,事件可自动检测,以使应急人员,以更好地评估相应快速发展的情况和调配资源。迄今为止,在这方面最事件检测技术都集中在只有图象或文字仅接近,限制了检测性能和冲击的传递到危机应对小组信息的质量。在本文中,我们提出一个利用两个图像和文本作为输入的新的多峰融合方法。特别是,我们引进一个跨注意力模块,可以对样本的基础样品过滤器从弱模式无信息和误导成分。此外,我们采用随机到不同的过渡多式联运对嵌入物之间的多式联运基于图形的方法在训练中更好的正规化学习的过程,以及从不同样品构建新的配对有限的训练数据处理。我们证明了我们的方法优于三个危机相关的任务大幅度的单峰的方法和强大的多基线。
Mahdi Abavisani, Liwei Wu, Shengli Hu, Joel Tetreault, Alejandro Jaimes
Abstract: Recent developments in image classification and natural language processing, coupled with the rapid growth in social media usage, have enabled fundamental advances in detecting breaking events around the world in real-time. Emergency response is one such area that stands to gain from these advances. By processing billions of texts and images a minute, events can be automatically detected to enable emergency response workers to better assess rapidly evolving situations and deploy resources accordingly. To date, most event detection techniques in this area have focused on image-only or text-only approaches, limiting detection performance and impacting the quality of information delivered to crisis response teams. In this paper, we present a new multimodal fusion method that leverages both images and texts as input. In particular, we introduce a cross-attention module that can filter uninformative and misleading components from weak modalities on a sample by sample basis. In addition, we employ a multimodal graph-based approach to stochastically transition between embeddings of different multimodal pairs during training to better regularize the learning process as well as dealing with limited training data by constructing new matched pairs from different samples. We show that our method outperforms the unimodal approaches and strong multimodal baselines by a large margin on three crisis-related tasks.
摘要:在图像分类和自然语言处理的最新发展,再加上社交媒体使用的快速增长,在检测的实时世界各地的突发事件已经启用了根本性的进步。应急反应是这样一个领域是代表从这些进步中获益。通过处理数十亿文本和图像一分钟,事件可自动检测,以使应急人员,以更好地评估相应快速发展的情况和调配资源。迄今为止,在这方面最事件检测技术都集中在只有图象或文字仅接近,限制了检测性能和冲击的传递到危机应对小组信息的质量。在本文中,我们提出一个利用两个图像和文本作为输入的新的多峰融合方法。特别是,我们引进一个跨注意力模块,可以对样本的基础样品过滤器从弱模式无信息和误导成分。此外,我们采用随机到不同的过渡多式联运对嵌入物之间的多式联运基于图形的方法在训练中更好的正规化学习的过程,以及从不同样品构建新的配对有限的训练数据处理。我们证明了我们的方法优于三个危机相关的任务大幅度的单峰的方法和强大的多基线。
35. Socioeconomic correlations of urban patterns inferred from aerial images: interpreting activation maps of Convolutional Neural Networks [PDF] 返回目录
Jacob Levy Abitbol, Márton Karsai
Abstract: Urbanisation is a great challenge for modern societies, promising better access to economic opportunities while widening socioeconomic inequalities. Accurately tracking how this process unfolds has been challenging for traditional data collection methods, while remote sensing information offers an alternative to gather a more complete view on these societal changes. By feeding a neural network with satellite images one may recover the socioeconomic information associated to that area, however these models lack to explain how visual features contained in a sample, trigger a given prediction. Here we close this gap by predicting socioeconomic status across France from aerial images and interpreting class activation mappings in terms of urban topology. We show that the model disregards the spatial correlations existing between urban class and socioeconomic status to derive its predictions. These results pave the way to build interpretable models, which may help to better track and understand urbanisation and its consequences.
摘要:城市化是现代社会一个巨大的挑战,有前途的经济机会更好的访问,同时不断扩大的社会经济不平等。准确跟踪如何进程的展开已具有挑战性的传统的数据采集方法,而遥感信息提供了一种替代收集这些社会变化的更完整视图。通过饲喂与卫星图像神经网络之一可以恢复相关联的那个区域的社会经济信息,但是这些模型缺乏来解释如何样品中所含的视觉特征,触发一个给定的预测。在这里,我们通过预测从航拍图像法国各地社会经济地位和在城市布局方面解释类激活映射弥补这一差距。我们表明,该模型忽视了城市阶级和社会经济地位之间存在推导其预测的空间相关性。这些结果铺平道路,建立可解释的模型,这可能有助于更好地跟踪和了解城市化进程及其后果的方式。
Jacob Levy Abitbol, Márton Karsai
Abstract: Urbanisation is a great challenge for modern societies, promising better access to economic opportunities while widening socioeconomic inequalities. Accurately tracking how this process unfolds has been challenging for traditional data collection methods, while remote sensing information offers an alternative to gather a more complete view on these societal changes. By feeding a neural network with satellite images one may recover the socioeconomic information associated to that area, however these models lack to explain how visual features contained in a sample, trigger a given prediction. Here we close this gap by predicting socioeconomic status across France from aerial images and interpreting class activation mappings in terms of urban topology. We show that the model disregards the spatial correlations existing between urban class and socioeconomic status to derive its predictions. These results pave the way to build interpretable models, which may help to better track and understand urbanisation and its consequences.
摘要:城市化是现代社会一个巨大的挑战,有前途的经济机会更好的访问,同时不断扩大的社会经济不平等。准确跟踪如何进程的展开已具有挑战性的传统的数据采集方法,而遥感信息提供了一种替代收集这些社会变化的更完整视图。通过饲喂与卫星图像神经网络之一可以恢复相关联的那个区域的社会经济信息,但是这些模型缺乏来解释如何样品中所含的视觉特征,触发一个给定的预测。在这里,我们通过预测从航拍图像法国各地社会经济地位和在城市布局方面解释类激活映射弥补这一差距。我们表明,该模型忽视了城市阶级和社会经济地位之间存在推导其预测的空间相关性。这些结果铺平道路,建立可解释的模型,这可能有助于更好地跟踪和了解城市化进程及其后果的方式。
36. MRQy: An Open-Source Tool for Quality Control of MR Imaging Data [PDF] 返回目录
Amir Reza Sadri, Andrew Janowczyk, Ren Zhou, Ruchika Verma, Jacob Antunes, Anant Madabhushi, Pallavi Tiwari, Satish E. Viswanath
Abstract: Even as public data repositories such as The Cancer Imaging Archive (TCIA) have enabled development of new radiomics and machine learning schemes, a key concern remains the generalizability of these methods to unseen datasets. For MRI datasets, model performance could be impacted by (a) site- or scanner-specific variations in image resolution, field-of-view, or image contrast, or (b) presence of imaging artifacts such as noise, motion, inhomogeneity, ringing, or aliasing; which can adversely affect relative image quality between data cohorts. This indicates a need for a quantitative tool to quickly determine relative differences in MRI volumes both within and between large data cohorts. We present MRQy, a new open-source quality control tool to (a) interrogate MRI cohorts for site- or equipment-based differences, and (b) quantify the impact of MRI artifacts on relative image quality; to help determine how to correct for these variations prior to model development. MRQy extracts a series of quality measures (e.g. noise ratios, variation metrics, entropy and energy criteria) and MR image metadata (e.g. voxel resolution, image dimensions) for subsequent interrogation via a specialized HTML5 based front-end designed for real-time filtering and trend visualization. MRQy is designed to be a standalone, unsupervised tool that can be efficiently run on a standard desktop computer. It has been made freely accessible at this http URL for wider community use and feedback. MRQy was used to evaluate (a) n=133 brain MRIs from TCIA (7 sites), and (b) n=104 rectal MRIs (3 local sites). MRQy measures revealed significant site-specific variations in both cohorts, indicating potential batch effects. Marked differences in specific MRQy measures were also able to identify MRI datasets that needed to be corrected for common MR imaging artifacts.
摘要:尽管公共数据仓库等的癌症成像存档(TCIA)中已启用的新radiomics和机器学习方案的发展,一个关键问题仍然是这些方法看不见的数据集的普遍性。用于MRI数据集,模型性能可以通过以下方式影响(A)位点或在图像分辨率的扫描仪特有的变型中,字段的视图,或图像对比度,或成像伪影,例如噪声,运动不均匀性的(b)中的存在,振铃,或混叠;可数据群组之间的相对图像质量产生不利影响。这表明需要一个定量的工具,以迅速确定内和大数据同伙之间的MRI体积相对差异。我们本MRQy,一个新的开源质量控制工具向(a)中诘MRI群组为定点或基于设备的差异,和(b)量化在相对图像质量MRI伪影的影响;为了帮助确定如何纠正之前,模型开发这些变化。 MRQy经由专门HTML5基于前端设计用于实时滤波,并提取了一系列的质量度量(例如信噪比,变异度量,熵和能量条件)和MR图像的元数据(例如体素分辨率,图像尺寸)用于随后的询问趋势可视化。 MRQy被设计成一个独立的,不受监督的工具,可以在标准的台式计算机上高效运行。它已经取得了这个HTTP URL更广泛的社区使用和反馈可以自由进出。 MRQy用于评价(a)中,N = 133个从TCIA脑核磁共振(7位),以及(b)N = 104个直肠核磁共振(3个本地网站)。 MRQy措施表明显著特定地点的变化在两个同伙,这表明潜在的批次效应。在特定MRQy措施显着的差异也能够识别所需要的公共MR成像伪像将被校正MRI数据集。
Amir Reza Sadri, Andrew Janowczyk, Ren Zhou, Ruchika Verma, Jacob Antunes, Anant Madabhushi, Pallavi Tiwari, Satish E. Viswanath
Abstract: Even as public data repositories such as The Cancer Imaging Archive (TCIA) have enabled development of new radiomics and machine learning schemes, a key concern remains the generalizability of these methods to unseen datasets. For MRI datasets, model performance could be impacted by (a) site- or scanner-specific variations in image resolution, field-of-view, or image contrast, or (b) presence of imaging artifacts such as noise, motion, inhomogeneity, ringing, or aliasing; which can adversely affect relative image quality between data cohorts. This indicates a need for a quantitative tool to quickly determine relative differences in MRI volumes both within and between large data cohorts. We present MRQy, a new open-source quality control tool to (a) interrogate MRI cohorts for site- or equipment-based differences, and (b) quantify the impact of MRI artifacts on relative image quality; to help determine how to correct for these variations prior to model development. MRQy extracts a series of quality measures (e.g. noise ratios, variation metrics, entropy and energy criteria) and MR image metadata (e.g. voxel resolution, image dimensions) for subsequent interrogation via a specialized HTML5 based front-end designed for real-time filtering and trend visualization. MRQy is designed to be a standalone, unsupervised tool that can be efficiently run on a standard desktop computer. It has been made freely accessible at this http URL for wider community use and feedback. MRQy was used to evaluate (a) n=133 brain MRIs from TCIA (7 sites), and (b) n=104 rectal MRIs (3 local sites). MRQy measures revealed significant site-specific variations in both cohorts, indicating potential batch effects. Marked differences in specific MRQy measures were also able to identify MRI datasets that needed to be corrected for common MR imaging artifacts.
摘要:尽管公共数据仓库等的癌症成像存档(TCIA)中已启用的新radiomics和机器学习方案的发展,一个关键问题仍然是这些方法看不见的数据集的普遍性。用于MRI数据集,模型性能可以通过以下方式影响(A)位点或在图像分辨率的扫描仪特有的变型中,字段的视图,或图像对比度,或成像伪影,例如噪声,运动不均匀性的(b)中的存在,振铃,或混叠;可数据群组之间的相对图像质量产生不利影响。这表明需要一个定量的工具,以迅速确定内和大数据同伙之间的MRI体积相对差异。我们本MRQy,一个新的开源质量控制工具向(a)中诘MRI群组为定点或基于设备的差异,和(b)量化在相对图像质量MRI伪影的影响;为了帮助确定如何纠正之前,模型开发这些变化。 MRQy经由专门HTML5基于前端设计用于实时滤波,并提取了一系列的质量度量(例如信噪比,变异度量,熵和能量条件)和MR图像的元数据(例如体素分辨率,图像尺寸)用于随后的询问趋势可视化。 MRQy被设计成一个独立的,不受监督的工具,可以在标准的台式计算机上高效运行。它已经取得了这个HTTP URL更广泛的社区使用和反馈可以自由进出。 MRQy用于评价(a)中,N = 133个从TCIA脑核磁共振(7位),以及(b)N = 104个直肠核磁共振(3个本地网站)。 MRQy措施表明显著特定地点的变化在两个同伙,这表明潜在的批次效应。在特定MRQy措施显着的差异也能够识别所需要的公共MR成像伪像将被校正MRI数据集。
37. Latent regularization for feature selection using kernel methods in tumor classification [PDF] 返回目录
Martin Palazzo, Patricio Yankilevich, Pierre Beauseroy
Abstract: The transcriptomics of cancer tumors are characterized with tens of thousands of gene expression features. Patient prognosis or tumor stage can be assessed by machine learning techniques like supervised classification tasks given a gene expression profile. Feature selection is a useful approach to select the key genes which helps to classify tumors. In this work we propose a feature selection method based on Multiple Kernel Learning that results in a reduced subset of genes and a custom kernel that improves the classification performance when used in support vector classification. During the feature selection process this method performs a novel latent regularisation by relaxing the supervised target problem by introducing unsupervised structure obtained from the latent space learned by a non linear dimensionality reduction model. An improvement of the generalization capacity is obtained and assessed by the tumor classification performance on new unseen test samples when the classifier is trained with the features selected by the proposed method in comparison with other supervised feature selection approaches.
摘要:癌症肿瘤的转录的特征在于具有数以万计的基因表达特征。患者的预后或肿瘤阶段可通过机器学习技术等给出基因表达谱监督分类任务进行评估。特征选择是选择的关键基因,这有助于分类肿瘤的有用方法。在这项工作中,我们提出了一种基于多核学习特征选择方法的结果在基因的缩减子集和支持向量分类使用时提高了分类性能定制的内核。在特征选择过程中该方法通过引入从由非线性维数降低模型学的潜在空间获得的无监督结构放宽监督目标问题进行的新型潜正规化。获得泛化能力的改善和由上,当分类器与由所提出的方法相比,与其他监督特征选择接近所选择的特征训练的新的看不见的测试样品的肿瘤分类性能评估。
Martin Palazzo, Patricio Yankilevich, Pierre Beauseroy
Abstract: The transcriptomics of cancer tumors are characterized with tens of thousands of gene expression features. Patient prognosis or tumor stage can be assessed by machine learning techniques like supervised classification tasks given a gene expression profile. Feature selection is a useful approach to select the key genes which helps to classify tumors. In this work we propose a feature selection method based on Multiple Kernel Learning that results in a reduced subset of genes and a custom kernel that improves the classification performance when used in support vector classification. During the feature selection process this method performs a novel latent regularisation by relaxing the supervised target problem by introducing unsupervised structure obtained from the latent space learned by a non linear dimensionality reduction model. An improvement of the generalization capacity is obtained and assessed by the tumor classification performance on new unseen test samples when the classifier is trained with the features selected by the proposed method in comparison with other supervised feature selection approaches.
摘要:癌症肿瘤的转录的特征在于具有数以万计的基因表达特征。患者的预后或肿瘤阶段可通过机器学习技术等给出基因表达谱监督分类任务进行评估。特征选择是选择的关键基因,这有助于分类肿瘤的有用方法。在这项工作中,我们提出了一种基于多核学习特征选择方法的结果在基因的缩减子集和支持向量分类使用时提高了分类性能定制的内核。在特征选择过程中该方法通过引入从由非线性维数降低模型学的潜在空间获得的无监督结构放宽监督目标问题进行的新型潜正规化。获得泛化能力的改善和由上,当分类器与由所提出的方法相比,与其他监督特征选择接近所选择的特征训练的新的看不见的测试样品的肿瘤分类性能评估。
38. Exemplar VAEs for Exemplar based Generation and Data Augmentation [PDF] 返回目录
Sajad Norouzi, David J. Fleet, Mohammad Norouzi
Abstract: This paper presents a framework for exemplar based generative modeling, featuring Exemplar VAEs. To generate a sample from the Exemplar VAE, one first draws a random exemplar from a training dataset, and then stochastically transforms that exemplar into a latent code, which is then used to generate a new observation. We show that the Exemplar VAE can be interpreted as a VAE with a mixture of Gaussians prior in the latent space, with Gaussian means defined by the latent encoding of the exemplars. To enable optimization and avoid overfitting, Exemplar VAE's parameters are learned using leave-one-out and exemplar subsampling, where, for the generation of each data point, we build a prior based on a random subset of the remaining data points. To accelerate learning, which requires finding the exemplars that exert the greatest influence on the generation of each data point, we use approximate nearest neighbor search in the latent space, yielding a lower bound on the log marginal likelihood. Experiments demonstrate the effectiveness of Exemplar VAEs in density estimation, representation learning, and generative data augmentation for supervised learning.
摘要:本文提出了基于典范生成模型的框架,特色的Exemplar VAES。为了生成从所述的Exemplar VAE的样品,一个第一绘制一个随机范例从训练数据集,然后随机变换该示例性成潜代码,然后将其用于产生一个新的观察。我们表明,VAE的Exemplar可以在潜在空间之前解释为与VAE高斯的混合物,与由范例的潜编码定义高斯装置。为了能够优化并避免过度拟合,VAE的Exemplar的参数使用留一法和典范二次抽样,其中,每个数据点的产生,我们构建基于剩余的数据点的随机子集事先得知。为了加快学习,这需要发现会对每个数据点的产生影响最大的典范,我们使用近似最近邻搜索的潜在空间,从而产生一个下界日志边际可能性。实验证明的Exemplar VAES在密度估计,表示学习,以及用于监督学习数据生成增强的有效性。
Sajad Norouzi, David J. Fleet, Mohammad Norouzi
Abstract: This paper presents a framework for exemplar based generative modeling, featuring Exemplar VAEs. To generate a sample from the Exemplar VAE, one first draws a random exemplar from a training dataset, and then stochastically transforms that exemplar into a latent code, which is then used to generate a new observation. We show that the Exemplar VAE can be interpreted as a VAE with a mixture of Gaussians prior in the latent space, with Gaussian means defined by the latent encoding of the exemplars. To enable optimization and avoid overfitting, Exemplar VAE's parameters are learned using leave-one-out and exemplar subsampling, where, for the generation of each data point, we build a prior based on a random subset of the remaining data points. To accelerate learning, which requires finding the exemplars that exert the greatest influence on the generation of each data point, we use approximate nearest neighbor search in the latent space, yielding a lower bound on the log marginal likelihood. Experiments demonstrate the effectiveness of Exemplar VAEs in density estimation, representation learning, and generative data augmentation for supervised learning.
摘要:本文提出了基于典范生成模型的框架,特色的Exemplar VAES。为了生成从所述的Exemplar VAE的样品,一个第一绘制一个随机范例从训练数据集,然后随机变换该示例性成潜代码,然后将其用于产生一个新的观察。我们表明,VAE的Exemplar可以在潜在空间之前解释为与VAE高斯的混合物,与由范例的潜编码定义高斯装置。为了能够优化并避免过度拟合,VAE的Exemplar的参数使用留一法和典范二次抽样,其中,每个数据点的产生,我们构建基于剩余的数据点的随机子集事先得知。为了加快学习,这需要发现会对每个数据点的产生影响最大的典范,我们使用近似最近邻搜索的潜在空间,从而产生一个下界日志边际可能性。实验证明的Exemplar VAES在密度估计,表示学习,以及用于监督学习数据生成增强的有效性。
39. Capsules for Biomedical Image Segmentation [PDF] 返回目录
Rodney LaLonde, Ziyue Xu, Sanjay Jain, Ulas Bagci
Abstract: Our work expands the use of capsule networks to the task of object segmentation for the first time in the literature. This is made possible via the introduction of locally-constrained routing and transformation matrix sharing, which reduces the parameter/memory burden and allows for the segmentation of objects at large resolutions. To compensate for the loss of global information in constraining the routing, we propose the concept of "deconvolutional" capsules to create a deep encoder-decoder style network, called SegCaps. We extend the masked reconstruction regularization to the task of segmentation and perform thorough ablation experiments on each component of our method. The proposed convolutional-deconvolutional capsule network, SegCaps, shows state-of-the-art results while using a fraction of the parameters of popular segmentation networks. To validate our proposed method, we perform the largest-scale study in pathological lung segmentation in the literature, where we conduct experiments across five extremely challenging datasets, containing both clinical and pre-clinical subjects, and nearly 2000 computed-tomography scans. Our newly developed segmentation platform outperforms other methods across all datasets while utilizing 95% fewer parameters than the popular U-Net for biomedical image segmentation. We also provide proof-of-concept results on thin, tree-like structures in retinal imagery as well as demonstrate capsules' handling of rotations/reflections on natural images.
摘要:我们的工作扩大了使用的胶囊网络到对象分割的任务,在文献中第一次。这是通过引入局部受限路由和转换矩阵共享,从而降低了参数/存储器负担,并允许对象中的大的分辨率的分割成为可能。要在约束路由弥补全球信息的损失,我们提出的“解卷积”胶囊打造深编码器,解码器式的网络,称为SegCaps概念。我们延长掩盖重建正规化分割的任务,我们的方法的各个组件进行彻底消融实验。所提出的卷积解卷积胶囊网络,SegCaps,示出了同时使用流行分割网络的参数的一小部分的状态的最先进的结果。为了验证我们提出的方法,我们在文献中,我们在五个极具挑战性的数据集进行实验进行病理肺分割的最大规模的研究,含临床和临床前科目,近2000名计算机断层造影扫描。我们新开发的分段平台优于所有数据集等方法,同时利用较流行的U型网络参数减少95%的生物医学图像分割。我们还提供了薄,证明了概念的结果树一样的视网膜图像结构,以及展示胶囊的处理上自然图像旋转/反射。
Rodney LaLonde, Ziyue Xu, Sanjay Jain, Ulas Bagci
Abstract: Our work expands the use of capsule networks to the task of object segmentation for the first time in the literature. This is made possible via the introduction of locally-constrained routing and transformation matrix sharing, which reduces the parameter/memory burden and allows for the segmentation of objects at large resolutions. To compensate for the loss of global information in constraining the routing, we propose the concept of "deconvolutional" capsules to create a deep encoder-decoder style network, called SegCaps. We extend the masked reconstruction regularization to the task of segmentation and perform thorough ablation experiments on each component of our method. The proposed convolutional-deconvolutional capsule network, SegCaps, shows state-of-the-art results while using a fraction of the parameters of popular segmentation networks. To validate our proposed method, we perform the largest-scale study in pathological lung segmentation in the literature, where we conduct experiments across five extremely challenging datasets, containing both clinical and pre-clinical subjects, and nearly 2000 computed-tomography scans. Our newly developed segmentation platform outperforms other methods across all datasets while utilizing 95% fewer parameters than the popular U-Net for biomedical image segmentation. We also provide proof-of-concept results on thin, tree-like structures in retinal imagery as well as demonstrate capsules' handling of rotations/reflections on natural images.
摘要:我们的工作扩大了使用的胶囊网络到对象分割的任务,在文献中第一次。这是通过引入局部受限路由和转换矩阵共享,从而降低了参数/存储器负担,并允许对象中的大的分辨率的分割成为可能。要在约束路由弥补全球信息的损失,我们提出的“解卷积”胶囊打造深编码器,解码器式的网络,称为SegCaps概念。我们延长掩盖重建正规化分割的任务,我们的方法的各个组件进行彻底消融实验。所提出的卷积解卷积胶囊网络,SegCaps,示出了同时使用流行分割网络的参数的一小部分的状态的最先进的结果。为了验证我们提出的方法,我们在文献中,我们在五个极具挑战性的数据集进行实验进行病理肺分割的最大规模的研究,含临床和临床前科目,近2000名计算机断层造影扫描。我们新开发的分段平台优于所有数据集等方法,同时利用较流行的U型网络参数减少95%的生物医学图像分割。我们还提供了薄,证明了概念的结果树一样的视网膜图像结构,以及展示胶囊的处理上自然图像旋转/反射。
注:中文为机器翻译结果!