目录
5. Semi-Supervised Semantic Segmentation in Earth Observation: The MiniFrance Suite, Dataset Analysis and Multi-task Network Study [PDF] 摘要
6. Interpretation of Swedish Sign Language using Convolutional Neural Networks and Transfer Learning [PDF] 摘要
13. Unsupervised Learning of Depth and Ego-Motion from Cylindrical Panoramic Video with Applications for Virtual Reality [PDF] 摘要
14. Integrating Coarse Granularity Part-level Features with Supervised Global-level Features for Person Re-identification [PDF] 摘要
18. THIN: THrowable Information Networks and Application for Facial Expression Recognition In The Wild [PDF] 摘要
23. FOSS: Multi-Person Age Estimation with Focusing on Objects and Still Seeing Surroundings [PDF] 摘要
25. Unsupervised Video Anomaly Detection via Flow-based Generative Modeling on Appearance and Motion Latent Features [PDF] 摘要
28. Unsupervised Self-training Algorithm Based on Deep Learning for Optical Aerial Images Change Detection [PDF] 摘要
39. A Patch-based Image Denoising Method Using Eigenvectors of the Geodesics' Gramian Matrix [PDF] 摘要
40. LiteDepthwiseNet: An Extreme Lightweight Network for Hyperspectral Image Classification [PDF] 摘要
41. Linking average- and worst-case perturbation robustness via class selectivity and dimensionality [PDF] 摘要
42. Demonstration of a Cloud-based Software Framework for Video Analytics Application using Low-Cost IoT Devices [PDF] 摘要
44. Encoder-decoder semantic segmentation models for electroluminescence images of thin-film photovoltaic modules [PDF] 摘要
45. Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs [PDF] 摘要
46. RetiNerveNet: Using Recursive Deep Learning to Estimate Pointwise 24-2 Visual Field Data based on Retinal Structure [PDF] 摘要
摘要
1. LTN: Long-Term Network for Long-Term Motion Prediction [PDF] 返回目录
YingQiao Wang
Abstract: Making accurate motion prediction of surrounding agents such as pedestrians and vehicles is a critical task when robots are trying to perform autonomous navigation tasks. Recent research on multi-modal trajectory prediction, including regression and classification approaches, perform very well at short-term prediction. However, when it comes to long-term prediction, most Long Short-Term Memory (LSTM) based models tend to diverge far away from the ground truth. Therefore, in this work, we present a two-stage framework for long-term trajectory prediction, which is named as Long-Term Network (LTN). Our Long-Term Network integrates both the regression and classification approaches. We first generate a set of proposed trajectories with our proposed distribution using a Conditional Variational Autoencoder (CVAE), and then classify them with binary labels, and output the trajectories with the highest score. We demonstrate our Long-Term Network's performance with experiments on two real-world pedestrian datasets: ETH/UCY, Stanford Drone Dataset (SDD), and one challenging real-world driving forecasting dataset: nuScenes. The results show that our method outperforms multiple state-of-the-art approaches in long-term trajectory prediction in terms of accuracy.
摘要:周围剂,如行人和车辆进行准确的运动预测是,当机器人试图进行自主导航任务的一项重要任务。在多模态轨迹预测,包括回归和分类方法最近的研究,在短期预测表现非常好。然而,当涉及到长期预测,基于最长的短时记忆(LSTM)模型倾向于远离地面实况分歧。因此,在这项工作中,我们提出了一个长期的轨迹预测,它被命名为长期网络(LTN)两级架构。我们长期的网络集成了回归和分类方法。我们首先生成一组与我们的使用条件变自动编码器(CVAE)提出分配建议的轨迹,然后用二进制标签,并输出得分最高的轨迹进行归类。我们证明了我们长期的网络与两个真实世界的数据集的行人实验性能:ETH / UCY,斯坦福雄蜂数据集(SDD),和一个具有挑战性的实际驾驶的预测数据集:nuScenes。结果表明,我们的方法优于长期轨迹预测方法在精度方面最先进的国家的数倍。
YingQiao Wang
Abstract: Making accurate motion prediction of surrounding agents such as pedestrians and vehicles is a critical task when robots are trying to perform autonomous navigation tasks. Recent research on multi-modal trajectory prediction, including regression and classification approaches, perform very well at short-term prediction. However, when it comes to long-term prediction, most Long Short-Term Memory (LSTM) based models tend to diverge far away from the ground truth. Therefore, in this work, we present a two-stage framework for long-term trajectory prediction, which is named as Long-Term Network (LTN). Our Long-Term Network integrates both the regression and classification approaches. We first generate a set of proposed trajectories with our proposed distribution using a Conditional Variational Autoencoder (CVAE), and then classify them with binary labels, and output the trajectories with the highest score. We demonstrate our Long-Term Network's performance with experiments on two real-world pedestrian datasets: ETH/UCY, Stanford Drone Dataset (SDD), and one challenging real-world driving forecasting dataset: nuScenes. The results show that our method outperforms multiple state-of-the-art approaches in long-term trajectory prediction in terms of accuracy.
摘要:周围剂,如行人和车辆进行准确的运动预测是,当机器人试图进行自主导航任务的一项重要任务。在多模态轨迹预测,包括回归和分类方法最近的研究,在短期预测表现非常好。然而,当涉及到长期预测,基于最长的短时记忆(LSTM)模型倾向于远离地面实况分歧。因此,在这项工作中,我们提出了一个长期的轨迹预测,它被命名为长期网络(LTN)两级架构。我们长期的网络集成了回归和分类方法。我们首先生成一组与我们的使用条件变自动编码器(CVAE)提出分配建议的轨迹,然后用二进制标签,并输出得分最高的轨迹进行归类。我们证明了我们长期的网络与两个真实世界的数据集的行人实验性能:ETH / UCY,斯坦福雄蜂数据集(SDD),和一个具有挑战性的实际驾驶的预测数据集:nuScenes。结果表明,我们的方法优于长期轨迹预测方法在精度方面最先进的国家的数倍。
2. Auto Seg-Loss: Searching Metric Surrogates for Semantic Segmentation [PDF] 返回目录
Hao Li, Chenxin Tao, Xizhou Zhu, Xiaogang Wang, Gao Huang, Jifeng Dai
Abstract: We propose a general framework for searching surrogate losses for mainstream semantic segmentation metrics. This is in contrast to existing loss functions manually designed for individual metrics. The searched surrogate losses can generalize well to other datasets and networks. Extensive experiments on PASCAL VOC and Cityscapes demonstrate the effectiveness of our approach. Code shall be released.
摘要:我们建议用于搜索主流语义分割指标替代损失的总体框架。这与手动设计用于单个指标现有损失功能。搜索到代孕的损失可以概括到其他数据集和网络。在PASCAL VOC和风情广泛的实验,证明了该方法的有效性。代码将被释放。
Hao Li, Chenxin Tao, Xizhou Zhu, Xiaogang Wang, Gao Huang, Jifeng Dai
Abstract: We propose a general framework for searching surrogate losses for mainstream semantic segmentation metrics. This is in contrast to existing loss functions manually designed for individual metrics. The searched surrogate losses can generalize well to other datasets and networks. Extensive experiments on PASCAL VOC and Cityscapes demonstrate the effectiveness of our approach. Code shall be released.
摘要:我们建议用于搜索主流语义分割指标替代损失的总体框架。这与手动设计用于单个指标现有损失功能。搜索到代孕的损失可以概括到其他数据集和网络。在PASCAL VOC和风情广泛的实验,证明了该方法的有效性。代码将被释放。
3. An Empirical Analysis of Visual Features for Multiple Object Tracking in Urban Scenes [PDF] 返回目录
Mehdi Miah, Justine Pepin, Nicolas Saunier, Guillaume-Alexandre Bilodeau
Abstract: This paper addresses the problem of selecting appearance features for multiple object tracking (MOT) in urban scenes. Over the years, a large number of features has been used for MOT. However, it is not clear whether some of them are better than others. Commonly used features are color histograms, histograms of oriented gradients, deep features from convolutional neural networks and re-identification (ReID) features. In this study, we assess how good these features are at discriminating objects enclosed by a bounding box in urban scene tracking scenarios. Several affinity measures, namely the $\mathrm{L}_1$, $\mathrm{L}_2$ and the Bhattacharyya distances, Rank-1 counts and the cosine similarity, are also assessed for their impact on the discriminative power of the features. Results on several datasets show that features from ReID networks are the best for discriminating instances from one another regardless of the quality of the detector. If a ReID model is not available, color histograms may be selected if the detector has a good recall and there are few occlusions; otherwise, deep features are more robust to detectors with lower recall. The project page is this http URL.
摘要:本文地址在城市场景多目标跟踪(MOT)选择出现的问题的功能。多年来,已用于MOT了大量的功能。但是,目前尚不清楚他们中的一些是否比别人更好。常用的功能是颜色直方图,梯度方向,从卷积神经网络和重新鉴定(雷德)设有深特征的直方图。在这项研究中,我们评估这些功能有多好,在判别在城市场景跟踪场景边框包围的对象。一些亲和力措施,即$ \ mathrm {L} _1 $,$ \ mathrm {L} _2 $和巴氏距离,秩1计数和余弦相似性,也被评估他们在这些特征的辨别力的影响。几个数据集的结果表明,里德的网络功能是最好的区分彼此实例不管检测的质量。如果里德模式不可用,如果探测器具有很好的回忆和很少有遮挡颜色直方图可以选择;否则,深特征更健壮以较低的召回探测器。该项目页是HTTP URL。
Mehdi Miah, Justine Pepin, Nicolas Saunier, Guillaume-Alexandre Bilodeau
Abstract: This paper addresses the problem of selecting appearance features for multiple object tracking (MOT) in urban scenes. Over the years, a large number of features has been used for MOT. However, it is not clear whether some of them are better than others. Commonly used features are color histograms, histograms of oriented gradients, deep features from convolutional neural networks and re-identification (ReID) features. In this study, we assess how good these features are at discriminating objects enclosed by a bounding box in urban scene tracking scenarios. Several affinity measures, namely the $\mathrm{L}_1$, $\mathrm{L}_2$ and the Bhattacharyya distances, Rank-1 counts and the cosine similarity, are also assessed for their impact on the discriminative power of the features. Results on several datasets show that features from ReID networks are the best for discriminating instances from one another regardless of the quality of the detector. If a ReID model is not available, color histograms may be selected if the detector has a good recall and there are few occlusions; otherwise, deep features are more robust to detectors with lower recall. The project page is this http URL.
摘要:本文地址在城市场景多目标跟踪(MOT)选择出现的问题的功能。多年来,已用于MOT了大量的功能。但是,目前尚不清楚他们中的一些是否比别人更好。常用的功能是颜色直方图,梯度方向,从卷积神经网络和重新鉴定(雷德)设有深特征的直方图。在这项研究中,我们评估这些功能有多好,在判别在城市场景跟踪场景边框包围的对象。一些亲和力措施,即$ \ mathrm {L} _1 $,$ \ mathrm {L} _2 $和巴氏距离,秩1计数和余弦相似性,也被评估他们在这些特征的辨别力的影响。几个数据集的结果表明,里德的网络功能是最好的区分彼此实例不管检测的质量。如果里德模式不可用,如果探测器具有很好的回忆和很少有遮挡颜色直方图可以选择;否则,深特征更健壮以较低的召回探测器。该项目页是HTTP URL。
4. A Hamiltonian Monte Carlo Method for Probabilistic Adversarial Attack and Learning [PDF] 返回目录
Hongjun Wang, Guanbin Li, Xiaobai Liu, Liang Lin
Abstract: Although deep convolutional neural networks (CNNs) have demonstrated remarkable performance on multiple computer vision tasks, researches on adversarial learning have shown that deep models are vulnerable to adversarial examples, which are crafted by adding visually imperceptible perturbations to the input images. Most of the existing adversarial attack methods only create a single adversarial example for the input, which just gives a glimpse of the underlying data manifold of adversarial examples. An attractive solution is to explore the solution space of the adversarial examples and generate a diverse bunch of them, which could potentially improve the robustness of real-world systems and help prevent severe security threats and vulnerabilities. In this paper, we present an effective method, called Hamiltonian Monte Carlo with Accumulated Momentum (HMCAM), aiming to generate a sequence of adversarial examples. To improve the efficiency of HMC, we propose a new regime to automatically control the length of trajectories, which allows the algorithm to move with adaptive step sizes along the search direction at different positions. Moreover, we revisit the reason for high computational cost of adversarial training under the view of MCMC and design a new generative method called Contrastive Adversarial Training (CAT), which approaches equilibrium distribution of adversarial examples with only few iterations by building from small modifications of the standard Contrastive Divergence (CD) and achieve a trade-off between efficiency and accuracy. Both quantitative and qualitative analysis on several natural image datasets and practical systems have confirmed the superiority of the proposed algorithm.
摘要:虽然深卷积神经网络(细胞神经网络)已经证明了在多个计算机视觉任务的表现可圈可点,在对抗性学习研究表明,深模型很容易受到对抗性的例子,它是通过添加视觉不易察觉的扰动对输入图像制作。大部分现有对抗攻击方法仅产生用于输入,这只是给出的对抗性例子底层数据歧管的一瞥单个对抗例子。一个有吸引力的解决方案是探索的对抗例子解空间,并生成一个多样化的一群人,这有可能提高真实世界的系统和帮助的坚固性防止严重的安全威胁和漏洞。在本文中,我们提出了一种有效的方法,称为哈密顿蒙特卡洛与累计动量(HMCAM),旨在产生的对抗式的例子的序列。为了提高HMC的效率,我们提出了一个新的机制来自动地控制轨迹的长度,这允许算法沿着在不同的位置,所述搜索方向自适应步长移动。此外,我们重温了对抗性训练的计算成本高的原因MCMC视野下和设计要求对比对抗性训练(CAT)的新生成方法,它通过小的修改建设办法的,只有几次迭代对抗性的例子均衡分布标准对比发散(CD),实现一个权衡效率和准确性之间。几个自然的图像数据集和实际系统的定量和定性分析,证实了该算法的优越性。
Hongjun Wang, Guanbin Li, Xiaobai Liu, Liang Lin
Abstract: Although deep convolutional neural networks (CNNs) have demonstrated remarkable performance on multiple computer vision tasks, researches on adversarial learning have shown that deep models are vulnerable to adversarial examples, which are crafted by adding visually imperceptible perturbations to the input images. Most of the existing adversarial attack methods only create a single adversarial example for the input, which just gives a glimpse of the underlying data manifold of adversarial examples. An attractive solution is to explore the solution space of the adversarial examples and generate a diverse bunch of them, which could potentially improve the robustness of real-world systems and help prevent severe security threats and vulnerabilities. In this paper, we present an effective method, called Hamiltonian Monte Carlo with Accumulated Momentum (HMCAM), aiming to generate a sequence of adversarial examples. To improve the efficiency of HMC, we propose a new regime to automatically control the length of trajectories, which allows the algorithm to move with adaptive step sizes along the search direction at different positions. Moreover, we revisit the reason for high computational cost of adversarial training under the view of MCMC and design a new generative method called Contrastive Adversarial Training (CAT), which approaches equilibrium distribution of adversarial examples with only few iterations by building from small modifications of the standard Contrastive Divergence (CD) and achieve a trade-off between efficiency and accuracy. Both quantitative and qualitative analysis on several natural image datasets and practical systems have confirmed the superiority of the proposed algorithm.
摘要:虽然深卷积神经网络(细胞神经网络)已经证明了在多个计算机视觉任务的表现可圈可点,在对抗性学习研究表明,深模型很容易受到对抗性的例子,它是通过添加视觉不易察觉的扰动对输入图像制作。大部分现有对抗攻击方法仅产生用于输入,这只是给出的对抗性例子底层数据歧管的一瞥单个对抗例子。一个有吸引力的解决方案是探索的对抗例子解空间,并生成一个多样化的一群人,这有可能提高真实世界的系统和帮助的坚固性防止严重的安全威胁和漏洞。在本文中,我们提出了一种有效的方法,称为哈密顿蒙特卡洛与累计动量(HMCAM),旨在产生的对抗式的例子的序列。为了提高HMC的效率,我们提出了一个新的机制来自动地控制轨迹的长度,这允许算法沿着在不同的位置,所述搜索方向自适应步长移动。此外,我们重温了对抗性训练的计算成本高的原因MCMC视野下和设计要求对比对抗性训练(CAT)的新生成方法,它通过小的修改建设办法的,只有几次迭代对抗性的例子均衡分布标准对比发散(CD),实现一个权衡效率和准确性之间。几个自然的图像数据集和实际系统的定量和定性分析,证实了该算法的优越性。
5. Semi-Supervised Semantic Segmentation in Earth Observation: The MiniFrance Suite, Dataset Analysis and Multi-task Network Study [PDF] 返回目录
Javiera Castillo-Navarro, Bertrand Le Saux, Alexandre Boulch, Nicolas Audebert, Sébastien Lefèvre
Abstract: The development of semi-supervised learning techniques is essential to enhance the generalization capacities of machine learning algorithms. Indeed, raw image data are abundant while labels are scarce, therefore it is crucial to leverage unlabeled inputs to build better models. The availability of large databases have been key for the development of learning algorithms with high level performance. Despite the major role of machine learning in Earth Observation to derive products such as land cover maps, datasets in the field are still limited, either because of modest surface coverage, lack of variety of scenes or restricted classes to identify. We introduce a novel large-scale dataset for semi-supervised semantic segmentation in Earth Observation, the MiniFrance suite. MiniFrance has several unprecedented properties: it is large-scale, containing over 2000 very high resolution aerial images, accounting for more than 200 billions samples (pixels); it is varied, covering 16 conurbations in France, with various climates, different landscapes, and urban as well as countryside scenes; and it is challenging, considering land use classes with high-level semantics. Nevertheless, the most distinctive quality of MiniFrance is being the only dataset in the field especially designed for semi-supervised learning: it contains labeled and unlabeled images in its training partition, which reproduces a life-like scenario. Along with this dataset, we present tools for data representativeness analysis in terms of appearance similarity and a thorough study of MiniFrance data, demonstrating that it is suitable for learning and generalizes well in a semi-supervised setting. Finally, we present semi-supervised deep architectures based on multi-task learning and the first experiments on MiniFrance.
摘要:半监督学习技术的发展是至关重要的,以提高机器学习算法的泛化能力。事实上,原始图像数据是丰富的,而标签是稀缺的,因此它是利用未标记的投入至关重要建立更好的模型。大型数据库的可用性一直是学习与高水平的计算性能,发展的关键。尽管机器学习在地球观测,以获取产品,如土地覆盖图的重要作用,在该领域的数据集仍然是有限的,要么是因为适度的表面覆盖的,缺乏各种场景或限制类来鉴定的。我们介绍了在地球观测,该MiniFrance套件半监督语义分割一个新的大型数据集。 MiniFrance有几个前所未有属性:它是大型,含有超过2000非常高的分辨率航空图像,占比超过200个亿取样(象素);它是变化的,在法国覆盖16个大都市,各种气候条件下,不同的风景,和城市以及乡村场景;它是具有挑战性的,考虑到与高层次语义土地利用类。尽管如此,MiniFrance的最有特色的质量是在特别设计的半监督学习领域唯一的数据集:它包含在培训分区,再现栩栩如生的场景标记和未标记的图像。随着此数据集,我们在外观相似性和MiniFrance数据的深入研究方面数据分析的代表性工具存在,证明它是适用于半监督学习环境,概括好。最后,基于多任务学习和MiniFrance第一实验中,我们目前的半监督深层结构。
Javiera Castillo-Navarro, Bertrand Le Saux, Alexandre Boulch, Nicolas Audebert, Sébastien Lefèvre
Abstract: The development of semi-supervised learning techniques is essential to enhance the generalization capacities of machine learning algorithms. Indeed, raw image data are abundant while labels are scarce, therefore it is crucial to leverage unlabeled inputs to build better models. The availability of large databases have been key for the development of learning algorithms with high level performance. Despite the major role of machine learning in Earth Observation to derive products such as land cover maps, datasets in the field are still limited, either because of modest surface coverage, lack of variety of scenes or restricted classes to identify. We introduce a novel large-scale dataset for semi-supervised semantic segmentation in Earth Observation, the MiniFrance suite. MiniFrance has several unprecedented properties: it is large-scale, containing over 2000 very high resolution aerial images, accounting for more than 200 billions samples (pixels); it is varied, covering 16 conurbations in France, with various climates, different landscapes, and urban as well as countryside scenes; and it is challenging, considering land use classes with high-level semantics. Nevertheless, the most distinctive quality of MiniFrance is being the only dataset in the field especially designed for semi-supervised learning: it contains labeled and unlabeled images in its training partition, which reproduces a life-like scenario. Along with this dataset, we present tools for data representativeness analysis in terms of appearance similarity and a thorough study of MiniFrance data, demonstrating that it is suitable for learning and generalizes well in a semi-supervised setting. Finally, we present semi-supervised deep architectures based on multi-task learning and the first experiments on MiniFrance.
摘要:半监督学习技术的发展是至关重要的,以提高机器学习算法的泛化能力。事实上,原始图像数据是丰富的,而标签是稀缺的,因此它是利用未标记的投入至关重要建立更好的模型。大型数据库的可用性一直是学习与高水平的计算性能,发展的关键。尽管机器学习在地球观测,以获取产品,如土地覆盖图的重要作用,在该领域的数据集仍然是有限的,要么是因为适度的表面覆盖的,缺乏各种场景或限制类来鉴定的。我们介绍了在地球观测,该MiniFrance套件半监督语义分割一个新的大型数据集。 MiniFrance有几个前所未有属性:它是大型,含有超过2000非常高的分辨率航空图像,占比超过200个亿取样(象素);它是变化的,在法国覆盖16个大都市,各种气候条件下,不同的风景,和城市以及乡村场景;它是具有挑战性的,考虑到与高层次语义土地利用类。尽管如此,MiniFrance的最有特色的质量是在特别设计的半监督学习领域唯一的数据集:它包含在培训分区,再现栩栩如生的场景标记和未标记的图像。随着此数据集,我们在外观相似性和MiniFrance数据的深入研究方面数据分析的代表性工具存在,证明它是适用于半监督学习环境,概括好。最后,基于多任务学习和MiniFrance第一实验中,我们目前的半监督深层结构。
6. Interpretation of Swedish Sign Language using Convolutional Neural Networks and Transfer Learning [PDF] 返回目录
Gustaf Halvardsson, Johanna Peterson, César Soto-Valero, Benoit Baudry
Abstract: The automatic interpretation of sign languages is a challenging task, as it requires the usage of high-level vision and high-level motion processing systems for providing accurate image perception. In this paper, we use Convolutional Neural Networks (CNNs) and transfer learning in order to make computers able to interpret signs of the Swedish Sign Language (SSL) hand alphabet. Our model consist of the implementation of a pre-trained InceptionV3 network, and the usage of the mini-batch gradient descent optimization algorithm. We rely on transfer learning during the pre-training of the model and its data. The final accuracy of the model, based on 8 study subjects and 9,400 images, is 85%. Our results indicate that the usage of CNNs is a promising approach to interpret sign languages, and transfer learning can be used to achieve high testing accuracy despite using a small training dataset. Furthermore, we describe the implementation details of our model to interpret signs as a user-friendly web application.
摘要:手语的自动解释是一项艰巨的任务,因为它需要高层次的视觉和高级运动处理系统,为提供准确的形象感知的使用。在本文中,我们使用卷积神经网络(细胞神经网络)和转让的学习,以便让计算机能够解释瑞典手语(SSL)手字母的迹象。我们的模型包括一个预先训练InceptionV3网络的实施,以及小批量梯度下降优化算法的使用情况。我们的模型和数据的预训练期间依靠转让学习。该模型的最终精度,根据8个研究对象和9400倍的图像,是85%。我们的研究结果表明,细胞神经网络的使用是解释用手语一个有前途的方法,和迁移学习可以用于尽管使用一个小的训练数据集来实现,测试精度高。此外,我们描述了我们模型的实施细则,以解释迹象,一个用户友好的Web应用程序。
Gustaf Halvardsson, Johanna Peterson, César Soto-Valero, Benoit Baudry
Abstract: The automatic interpretation of sign languages is a challenging task, as it requires the usage of high-level vision and high-level motion processing systems for providing accurate image perception. In this paper, we use Convolutional Neural Networks (CNNs) and transfer learning in order to make computers able to interpret signs of the Swedish Sign Language (SSL) hand alphabet. Our model consist of the implementation of a pre-trained InceptionV3 network, and the usage of the mini-batch gradient descent optimization algorithm. We rely on transfer learning during the pre-training of the model and its data. The final accuracy of the model, based on 8 study subjects and 9,400 images, is 85%. Our results indicate that the usage of CNNs is a promising approach to interpret sign languages, and transfer learning can be used to achieve high testing accuracy despite using a small training dataset. Furthermore, we describe the implementation details of our model to interpret signs as a user-friendly web application.
摘要:手语的自动解释是一项艰巨的任务,因为它需要高层次的视觉和高级运动处理系统,为提供准确的形象感知的使用。在本文中,我们使用卷积神经网络(细胞神经网络)和转让的学习,以便让计算机能够解释瑞典手语(SSL)手字母的迹象。我们的模型包括一个预先训练InceptionV3网络的实施,以及小批量梯度下降优化算法的使用情况。我们的模型和数据的预训练期间依靠转让学习。该模型的最终精度,根据8个研究对象和9400倍的图像,是85%。我们的研究结果表明,细胞神经网络的使用是解释用手语一个有前途的方法,和迁移学习可以用于尽管使用一个小的训练数据集来实现,测试精度高。此外,我们描述了我们模型的实施细则,以解释迹象,一个用户友好的Web应用程序。
7. Boosting Image-based Mutual Gaze Detection using Pseudo 3D Gaze [PDF] 返回目录
Bardia Doosti, Ching-Hui Chen, Raviteja Vemulapalli, Xuhui Jia, Yukun Zhu, Bradley Green
Abstract: Mutual gaze detection, i.e., predicting whether or not two people are looking at each other, plays an important role in understanding human interactions. In this work, we focus on the task of image-based mutual gaze detection, and propose a simple and effective approach to boost the performance by using an auxiliary 3D gaze estimation task during training. We achieve the performance boost without additional labeling cost by training the 3D gaze estimation branch using pseudo 3D gaze labels deduced from mutual gaze labels. By sharing the head image encoder between the 3D gaze estimation and the mutual gaze detection branches, we achieve better head features than learned by training the mutual gaze detection branch alone. Experimental results on three image datasets show that the proposed approach improves the detection performance significantly without additional annotations. This work also introduces a new image dataset that consists of 33.1K pairs of humans annotated with mutual gaze labels in 29.2K images.
摘要:相互凝视检测,即预测两人是否被看着对方,起着理解人类交流的重要作用。在这项工作中,我们专注于基于图像的相互凝视检测的任务,并提出了一个简单而有效的方法,通过使用辅助的3D提升性能的训练过程中凝视估计任务。我们通过培训的3D实现,无需额外贴标成本性能提升使用伪3D凝视从的相互凝视标签推断标签凝视估计分支。通过共享3D的头部图像编码器凝视估计和相互凝视检测分支,我们取得更好的头部特征比单独训练的相互凝视检测分局了解到。在三个图像数据集实验结果表明,该方法显著提高了检测的性能,而无需额外的注解。这项工作还引入了新的图像数据集包括33.1K对在29.2K图像的相互凝视标签标注的人类的。
Bardia Doosti, Ching-Hui Chen, Raviteja Vemulapalli, Xuhui Jia, Yukun Zhu, Bradley Green
Abstract: Mutual gaze detection, i.e., predicting whether or not two people are looking at each other, plays an important role in understanding human interactions. In this work, we focus on the task of image-based mutual gaze detection, and propose a simple and effective approach to boost the performance by using an auxiliary 3D gaze estimation task during training. We achieve the performance boost without additional labeling cost by training the 3D gaze estimation branch using pseudo 3D gaze labels deduced from mutual gaze labels. By sharing the head image encoder between the 3D gaze estimation and the mutual gaze detection branches, we achieve better head features than learned by training the mutual gaze detection branch alone. Experimental results on three image datasets show that the proposed approach improves the detection performance significantly without additional annotations. This work also introduces a new image dataset that consists of 33.1K pairs of humans annotated with mutual gaze labels in 29.2K images.
摘要:相互凝视检测,即预测两人是否被看着对方,起着理解人类交流的重要作用。在这项工作中,我们专注于基于图像的相互凝视检测的任务,并提出了一个简单而有效的方法,通过使用辅助的3D提升性能的训练过程中凝视估计任务。我们通过培训的3D实现,无需额外贴标成本性能提升使用伪3D凝视从的相互凝视标签推断标签凝视估计分支。通过共享3D的头部图像编码器凝视估计和相互凝视检测分支,我们取得更好的头部特征比单独训练的相互凝视检测分局了解到。在三个图像数据集实验结果表明,该方法显著提高了检测的性能,而无需额外的注解。这项工作还引入了新的图像数据集包括33.1K对在29.2K图像的相互凝视标签标注的人类的。
8. Does Data Augmentation Benefit from Split BatchNorms [PDF] 返回目录
Amil Merchant, Barret Zoph, Ekin Dogus Cubuk
Abstract: Data augmentation has emerged as a powerful technique for improving the performance of deep neural networks and led to state-of-the-art results in computer vision. However, state-of-the-art data augmentation strongly distorts training images, leading to a disparity between examples seen during training and inference. In this work, we explore a recently proposed training paradigm in order to correct for this disparity: using an auxiliary BatchNorm for the potentially out-of-distribution, strongly augmented images. Our experiments then focus on how to define the BatchNorm parameters that are used at evaluation. To eliminate the train-test disparity, we experiment with using the batch statistics defined by clean training images only, yet surprisingly find that this does not yield improvements in model performance. Instead, we investigate using BatchNorm parameters defined by weak augmentations and find that this method significantly improves the performance of common image classification benchmarks such as CIFAR-10, CIFAR-100, and ImageNet. We then explore a fundamental trade-off between accuracy and robustness coming from using different BatchNorm parameters, providing greater insight into the benefits of data augmentation on model performance.
摘要:数据隆胸已经成为了提高深层神经网络的性能的强大技术,并导致国家的最先进成果在计算机视觉。然而,国家的最先进的数据扩张强烈扭曲的训练图像,导致训练和推理过程中看到的例子之间的差异。在这项工作中,我们将探讨以最近提出的培训模式,以纠正这种差异:使用辅助BatchNorm的潜在外的分布,强烈增强图像。我们的实验则集中在如何定义在评估中使用的BatchNorm参数。为了消除列车试验的差距,我们尝试使用由干净的训练图像定义批次只统计,但令人惊讶的发现,这不会产生模型的性能改进。取而代之的是,我们研究利用弱扩充定义BatchNorm参数,发现这种方法显著提高普通图像分类基准,如CIFAR-10,CIFAR-100,和ImageNet的性能。然后,我们探索准确性和鲁棒性使用不同的参数BatchNorm未来之间有一个基本的权衡,提供更深入地了解数据增强对模型性能的好处。
Amil Merchant, Barret Zoph, Ekin Dogus Cubuk
Abstract: Data augmentation has emerged as a powerful technique for improving the performance of deep neural networks and led to state-of-the-art results in computer vision. However, state-of-the-art data augmentation strongly distorts training images, leading to a disparity between examples seen during training and inference. In this work, we explore a recently proposed training paradigm in order to correct for this disparity: using an auxiliary BatchNorm for the potentially out-of-distribution, strongly augmented images. Our experiments then focus on how to define the BatchNorm parameters that are used at evaluation. To eliminate the train-test disparity, we experiment with using the batch statistics defined by clean training images only, yet surprisingly find that this does not yield improvements in model performance. Instead, we investigate using BatchNorm parameters defined by weak augmentations and find that this method significantly improves the performance of common image classification benchmarks such as CIFAR-10, CIFAR-100, and ImageNet. We then explore a fundamental trade-off between accuracy and robustness coming from using different BatchNorm parameters, providing greater insight into the benefits of data augmentation on model performance.
摘要:数据隆胸已经成为了提高深层神经网络的性能的强大技术,并导致国家的最先进成果在计算机视觉。然而,国家的最先进的数据扩张强烈扭曲的训练图像,导致训练和推理过程中看到的例子之间的差异。在这项工作中,我们将探讨以最近提出的培训模式,以纠正这种差异:使用辅助BatchNorm的潜在外的分布,强烈增强图像。我们的实验则集中在如何定义在评估中使用的BatchNorm参数。为了消除列车试验的差距,我们尝试使用由干净的训练图像定义批次只统计,但令人惊讶的发现,这不会产生模型的性能改进。取而代之的是,我们研究利用弱扩充定义BatchNorm参数,发现这种方法显著提高普通图像分类基准,如CIFAR-10,CIFAR-100,和ImageNet的性能。然后,我们探索准确性和鲁棒性使用不同的参数BatchNorm未来之间有一个基本的权衡,提供更深入地了解数据增强对模型性能的好处。
9. CIMON: Towards High-quality Hash Codes [PDF] 返回目录
Xiao Luo, Daqing Wu, Zeyu Ma, Chong Chen, Huasong Zhong, Minghua Deng, Jianqiang Huang, Xian-sheng Hua
Abstract: Recently, hashing is widely-used in approximate nearest neighbor search for its storage and computational efficiency. Due to the lack of labeled data in practice, many studies focus on unsupervised hashing. Most of the unsupervised hashing methods learn to map images into semantic similarity-preserving hash codes by constructing local semantic similarity structure from the pre-trained model as guiding information, i.e., treating each point pair similar if their distance is small in feature space. However, due to the inefficient representation ability of the pre-trained model, many false positives and negatives in local semantic similarity will be introduced and lead to error propagation during hash code learning. Moreover, most of hashing methods ignore the basic characteristics of hash codes such as collisions, which will cause instability of hash codes to disturbance. In this paper, we propose a new method named Comprehensive sImilarity Mining and cOnsistency learNing (CIMON). First, we use global constraint learning and similarity statistical distribution to obtain reliable and smooth guidance. Second, image augmentation and consistency learning will be introduced to explore both semantic and contrastive consistency to derive robust hash codes with fewer collisions. Extensive experiments on several benchmark datasets show that the proposed method consistently outperforms a wide range of state-of-the-art methods in both retrieval performance and robustness.
摘要:近日,哈希在近似最近邻搜索它的存储和计算效率广泛使用。由于缺乏实际的标签数据,很多研究集中在无人监督的散列。大部分的无监督散列方法学通过从预训练的模型作为引导信息,即,处理类似的每个点对,如果他们的距离在特征空间小构建本地语义相似度结构映射图像转换成语义相似度保留的散列码。然而,由于预先训练模式的低效表现能力,许多误报和局部语义相似底片将被引入和散列码的学习过程中导致错误传播。此外,大多数的散列方法忽略的散列码的基本特性,例如碰撞,这将导致的散列码的不稳定性对干扰。在本文中,我们提出了一个名为综合相似性挖掘和一致性学习(CIMON)新方法。首先,我们使用全局约束,学习和相似的统计分布,以获得可靠和平滑的引导。第二,图像增强和一致性学习将被引入到探索语义和对比一致性与较少的冲突派生稳健散列码。在几个基准数据集大量实验表明,该方法的性能一直优于大范围的国家的最先进的方法,在这两种检索性能和健壮性。
Xiao Luo, Daqing Wu, Zeyu Ma, Chong Chen, Huasong Zhong, Minghua Deng, Jianqiang Huang, Xian-sheng Hua
Abstract: Recently, hashing is widely-used in approximate nearest neighbor search for its storage and computational efficiency. Due to the lack of labeled data in practice, many studies focus on unsupervised hashing. Most of the unsupervised hashing methods learn to map images into semantic similarity-preserving hash codes by constructing local semantic similarity structure from the pre-trained model as guiding information, i.e., treating each point pair similar if their distance is small in feature space. However, due to the inefficient representation ability of the pre-trained model, many false positives and negatives in local semantic similarity will be introduced and lead to error propagation during hash code learning. Moreover, most of hashing methods ignore the basic characteristics of hash codes such as collisions, which will cause instability of hash codes to disturbance. In this paper, we propose a new method named Comprehensive sImilarity Mining and cOnsistency learNing (CIMON). First, we use global constraint learning and similarity statistical distribution to obtain reliable and smooth guidance. Second, image augmentation and consistency learning will be introduced to explore both semantic and contrastive consistency to derive robust hash codes with fewer collisions. Extensive experiments on several benchmark datasets show that the proposed method consistently outperforms a wide range of state-of-the-art methods in both retrieval performance and robustness.
摘要:近日,哈希在近似最近邻搜索它的存储和计算效率广泛使用。由于缺乏实际的标签数据,很多研究集中在无人监督的散列。大部分的无监督散列方法学通过从预训练的模型作为引导信息,即,处理类似的每个点对,如果他们的距离在特征空间小构建本地语义相似度结构映射图像转换成语义相似度保留的散列码。然而,由于预先训练模式的低效表现能力,许多误报和局部语义相似底片将被引入和散列码的学习过程中导致错误传播。此外,大多数的散列方法忽略的散列码的基本特性,例如碰撞,这将导致的散列码的不稳定性对干扰。在本文中,我们提出了一个名为综合相似性挖掘和一致性学习(CIMON)新方法。首先,我们使用全局约束,学习和相似的统计分布,以获得可靠和平滑的引导。第二,图像增强和一致性学习将被引入到探索语义和对比一致性与较少的冲突派生稳健散列码。在几个基准数据集大量实验表明,该方法的性能一直优于大范围的国家的最先进的方法,在这两种检索性能和健壮性。
10. Generalizing Universal Adversarial Attacks Beyond Additive Perturbations [PDF] 返回目录
Yanghao Zhang, Wenjie Ruan, Fu Wang, Xiaowei Huang
Abstract: The previous study has shown that universal adversarial attacks can fool deep neural networks over a large set of input images with a single human-invisible perturbation. However, current methods for universal adversarial attacks are based on additive perturbation, which cause misclassification when the perturbation is directly added to the input images. In this paper, for the first time, we show that a universal adversarial attack can also be achieved via non-additive perturbation (e.g., spatial transformation). More importantly, to unify both additive and non-additive perturbations, we propose a novel unified yet flexible framework for universal adversarial attacks, called GUAP, which is able to initiate attacks by additive perturbation, non-additive perturbation, or the combination of both. Extensive experiments are conducted on CIFAR-10 and ImageNet datasets with six deep neural network models including GoogleLeNet, VGG16/19, ResNet101/152, and DenseNet121. The empirical experiments demonstrate that GUAP can obtain up to 90.9% and 99.24% successful attack rates on CIFAR-10 and ImageNet datasets, leading to over 15% and 19% improvements respectively than current state-of-the-art universal adversarial attacks. The code for reproducing the experiments in this paper is available at this https URL.
摘要:以前的研究已经表明,普遍对抗攻击可以在一个大组输入图像与一个人看不见的扰动傻瓜深层神经网络。然而,对于通用对抗攻击当前的方法是基于添加剂的扰动,这导致错误分类时扰动被直接添加到输入图像。在本文中,在第一次,我们表明,一个普遍的对抗性攻击,也可通过非相加扰动(例如,空间变换)来实现。更重要的是,统一加性和非加性扰动,我们提出了统一的普遍敌对攻击而灵活的框架一本小说,叫GUAP,这是能够启动由加扰动,非加性扰动,或两者的组合攻击。广泛的实验用6个深神经网络模型包括GoogleLeNet,VGG16 / 19,ResNet101 / 152,和DenseNet121上CIFAR-10和ImageNet数据集进行的。实证实验表明,GUAP可以获得高达90.9%和99.24%,成功的攻击速度,在CIFAR-10和ImageNet数据集,分别导致超过15%和19%的改善比当前国家的最先进的普遍敌对攻击。在本文再现实验中的代码可在此HTTPS URL。
Yanghao Zhang, Wenjie Ruan, Fu Wang, Xiaowei Huang
Abstract: The previous study has shown that universal adversarial attacks can fool deep neural networks over a large set of input images with a single human-invisible perturbation. However, current methods for universal adversarial attacks are based on additive perturbation, which cause misclassification when the perturbation is directly added to the input images. In this paper, for the first time, we show that a universal adversarial attack can also be achieved via non-additive perturbation (e.g., spatial transformation). More importantly, to unify both additive and non-additive perturbations, we propose a novel unified yet flexible framework for universal adversarial attacks, called GUAP, which is able to initiate attacks by additive perturbation, non-additive perturbation, or the combination of both. Extensive experiments are conducted on CIFAR-10 and ImageNet datasets with six deep neural network models including GoogleLeNet, VGG16/19, ResNet101/152, and DenseNet121. The empirical experiments demonstrate that GUAP can obtain up to 90.9% and 99.24% successful attack rates on CIFAR-10 and ImageNet datasets, leading to over 15% and 19% improvements respectively than current state-of-the-art universal adversarial attacks. The code for reproducing the experiments in this paper is available at this https URL.
摘要:以前的研究已经表明,普遍对抗攻击可以在一个大组输入图像与一个人看不见的扰动傻瓜深层神经网络。然而,对于通用对抗攻击当前的方法是基于添加剂的扰动,这导致错误分类时扰动被直接添加到输入图像。在本文中,在第一次,我们表明,一个普遍的对抗性攻击,也可通过非相加扰动(例如,空间变换)来实现。更重要的是,统一加性和非加性扰动,我们提出了统一的普遍敌对攻击而灵活的框架一本小说,叫GUAP,这是能够启动由加扰动,非加性扰动,或两者的组合攻击。广泛的实验用6个深神经网络模型包括GoogleLeNet,VGG16 / 19,ResNet101 / 152,和DenseNet121上CIFAR-10和ImageNet数据集进行的。实证实验表明,GUAP可以获得高达90.9%和99.24%,成功的攻击速度,在CIFAR-10和ImageNet数据集,分别导致超过15%和19%的改善比当前国家的最先进的普遍敌对攻击。在本文再现实验中的代码可在此HTTPS URL。
11. Improved Multi-Source Domain Adaptation by Preservation of Factors [PDF] 返回目录
Sebastian Schrom, Stephan Hasler, Jürgen Adamy
Abstract: Domain Adaptation (DA) is a highly relevant research topic when it comes to image classification with deep neural networks. Combining multiple source domains in a sophisticated way to optimize a classification model can improve the generalization to a target domain. Here, the difference in data distributions of source and target image datasets plays a major role. In this paper, we describe based on a theory of visual factors how real-world scenes appear in images in general and how recent DA datasets are composed of such. We show that different domains can be described by a set of so called domain factors, whose values are consistent within a domain, but can change across domains. Many DA approaches try to remove all domain factors from the feature representation to be domain invariant. In this paper we show that this can lead to negative transfer since task-informative factors can get lost as well. To address this, we propose Factor-Preserving DA (FP-DA), a method to train a deep adversarial unsupervised DA model, which is able to preserve specific task relevant factors in a multi-domain scenario. We demonstrate on CORe50, a dataset with many domains, how such factors can be identified by standard one-to-one transfer experiments between single domains combined with PCA. By applying FP-DA, we show that the highest average and minimum performance can be achieved.
摘要:领域适应性(DA)是一个非常重要的研究课题,当谈到与深层神经网络的图像分类。在一个复杂的方式组合多个源区以优化分类模型可以改善泛化到目标域。在这里,在源和目标图像数据集的数据分布的差异起着重要的作用。在本文中,我们描述了一种基于视觉的因素真实世界的场景是如何出现在一般的和最近如何DA数据集图像是由这样的理论。我们发现,不同的域可以由一组所谓的域的因素,它的值是一个域内一致的描述,但可以跨域改变。许多DA方法尝试从特征表示取消所有域名的因素是域不变。在本文中,我们表明,这可能会导致负迁移,因为任务的信息因素可能会丢失为好。为了解决这个问题,我们提出因子保留DA(FP-DA),培养了深刻的对抗监督的DA模型,它是能够保持在多域方案具体任务相关因素的方法。我们证明上CORe50,有许多领域,如何这些因素可以通过单一域与PCA组合之间的标准一到一个转移实验鉴定的数据集。通过应用FP-DA,我们表明,平均最高和最低性能可以达到。
Sebastian Schrom, Stephan Hasler, Jürgen Adamy
Abstract: Domain Adaptation (DA) is a highly relevant research topic when it comes to image classification with deep neural networks. Combining multiple source domains in a sophisticated way to optimize a classification model can improve the generalization to a target domain. Here, the difference in data distributions of source and target image datasets plays a major role. In this paper, we describe based on a theory of visual factors how real-world scenes appear in images in general and how recent DA datasets are composed of such. We show that different domains can be described by a set of so called domain factors, whose values are consistent within a domain, but can change across domains. Many DA approaches try to remove all domain factors from the feature representation to be domain invariant. In this paper we show that this can lead to negative transfer since task-informative factors can get lost as well. To address this, we propose Factor-Preserving DA (FP-DA), a method to train a deep adversarial unsupervised DA model, which is able to preserve specific task relevant factors in a multi-domain scenario. We demonstrate on CORe50, a dataset with many domains, how such factors can be identified by standard one-to-one transfer experiments between single domains combined with PCA. By applying FP-DA, we show that the highest average and minimum performance can be achieved.
摘要:领域适应性(DA)是一个非常重要的研究课题,当谈到与深层神经网络的图像分类。在一个复杂的方式组合多个源区以优化分类模型可以改善泛化到目标域。在这里,在源和目标图像数据集的数据分布的差异起着重要的作用。在本文中,我们描述了一种基于视觉的因素真实世界的场景是如何出现在一般的和最近如何DA数据集图像是由这样的理论。我们发现,不同的域可以由一组所谓的域的因素,它的值是一个域内一致的描述,但可以跨域改变。许多DA方法尝试从特征表示取消所有域名的因素是域不变。在本文中,我们表明,这可能会导致负迁移,因为任务的信息因素可能会丢失为好。为了解决这个问题,我们提出因子保留DA(FP-DA),培养了深刻的对抗监督的DA模型,它是能够保持在多域方案具体任务相关因素的方法。我们证明上CORe50,有许多领域,如何这些因素可以通过单一域与PCA组合之间的标准一到一个转移实验鉴定的数据集。通过应用FP-DA,我们表明,平均最高和最低性能可以达到。
12. Self-training for Few-shot Transfer Across Extreme Task Differences [PDF] 返回目录
Cheng Perng Phoo, Bharath Hariharan
Abstract: All few-shot learning techniques must be pre-trained on a large, labeled "base dataset". In problem domains where such large labeled datasets are not available for pre-training (e.g., X-ray images), one must resort to pre-training in a different "source" problem domain (e.g., ImageNet), which can be very different from the desired target task. Traditional few-shot and transfer learning techniques fail in the presence of such extreme differences between the source and target tasks. In this paper, we present a simple and effective solution to tackle this extreme domain gap: self-training a source domain representation on unlabeled data from the target domain. We show that this improves one-shot performance on the target domain by 2.9 points on average on a challenging benchmark with multiple domains.
摘要:所有的几炮在大学习技术必须预先训练,标有“基础数据集”。在问题域其中这样的大标记的数据集不提供预训练(例如,X射线图像),必须求助于预训练在一个不同的“源”的问题域(例如,ImageNet),其可以是非常不同的从期望的目标任务。传统的几拍和转移学习技术无法在源和目标任务之间的这种极端差异的存在。在本文中,我们提出了一个简单而有效的解决方案来解决这个极端域间隙:自我培养上从目标域的未标记数据的源域表示。我们表明,这种提高一次性表现在目标域平均提高2.9分与多个域的富有挑战性的基准。
Cheng Perng Phoo, Bharath Hariharan
Abstract: All few-shot learning techniques must be pre-trained on a large, labeled "base dataset". In problem domains where such large labeled datasets are not available for pre-training (e.g., X-ray images), one must resort to pre-training in a different "source" problem domain (e.g., ImageNet), which can be very different from the desired target task. Traditional few-shot and transfer learning techniques fail in the presence of such extreme differences between the source and target tasks. In this paper, we present a simple and effective solution to tackle this extreme domain gap: self-training a source domain representation on unlabeled data from the target domain. We show that this improves one-shot performance on the target domain by 2.9 points on average on a challenging benchmark with multiple domains.
摘要:所有的几炮在大学习技术必须预先训练,标有“基础数据集”。在问题域其中这样的大标记的数据集不提供预训练(例如,X射线图像),必须求助于预训练在一个不同的“源”的问题域(例如,ImageNet),其可以是非常不同的从期望的目标任务。传统的几拍和转移学习技术无法在源和目标任务之间的这种极端差异的存在。在本文中,我们提出了一个简单而有效的解决方案来解决这个极端域间隙:自我培养上从目标域的未标记数据的源域表示。我们表明,这种提高一次性表现在目标域平均提高2.9分与多个域的富有挑战性的基准。
13. Unsupervised Learning of Depth and Ego-Motion from Cylindrical Panoramic Video with Applications for Virtual Reality [PDF] 返回目录
Alisha Sharma, Ryan Nett, Jonathan Ventura
Abstract: We introduce a convolutional neural network model for unsupervised learning of depth and ego-motion from cylindrical panoramic video. Panoramic depth estimation is an important technology for applications such as virtual reality, 3D modeling, and autonomous robotic navigation. In contrast to previous approaches for applying convolutional neural networks to panoramic imagery, we use the cylindrical panoramic projection which allows for the use of the traditional CNN layers such as convolutional filters and max pooling without modification. Our evaluation of synthetic and real data shows that unsupervised learning of depth and ego-motion on cylindrical panoramic images can produce high-quality depth maps and that an increased field-of-view improves ego-motion estimation accuracy. We create two new datasets to evaluate our approach: a synthetic dataset created using the CARLA simulator, and Headcam, a novel dataset of panoramic video collected from a helmet-mounted camera while biking in an urban setting. We also apply our network to the problem of converting monocular panoramas to stereo panoramas.
摘要:介绍了深度的无监督学习和柱面全景视频自我运动卷积神经网络模型。全景深度估计的应用,如虚拟现实,三维建模,并自主机器人导航的重要技术。与此相反,以用于施加卷积神经网络到的全景图像以前的方法中,我们使用圆柱全景投影,其允许采用传统的CNN层的诸如卷积滤波器和不加修改最大池。我们的合成的和真实数据显示的评价,关于圆柱全景图像的深度和自运动的无监督学习可以生产出高质量的深度图和该增加的字段的视图改善自运动估计精度。我们创建了两个新的数据集来评估我们的方法:使用CARLA模拟器创建合成数据集,Headcam,从收集的全景视频的新的数据集头盔摄像头,而在城市环境中骑行。我们也适用我们的网络转换单眼全景立体声全景的问题。
Alisha Sharma, Ryan Nett, Jonathan Ventura
Abstract: We introduce a convolutional neural network model for unsupervised learning of depth and ego-motion from cylindrical panoramic video. Panoramic depth estimation is an important technology for applications such as virtual reality, 3D modeling, and autonomous robotic navigation. In contrast to previous approaches for applying convolutional neural networks to panoramic imagery, we use the cylindrical panoramic projection which allows for the use of the traditional CNN layers such as convolutional filters and max pooling without modification. Our evaluation of synthetic and real data shows that unsupervised learning of depth and ego-motion on cylindrical panoramic images can produce high-quality depth maps and that an increased field-of-view improves ego-motion estimation accuracy. We create two new datasets to evaluate our approach: a synthetic dataset created using the CARLA simulator, and Headcam, a novel dataset of panoramic video collected from a helmet-mounted camera while biking in an urban setting. We also apply our network to the problem of converting monocular panoramas to stereo panoramas.
摘要:介绍了深度的无监督学习和柱面全景视频自我运动卷积神经网络模型。全景深度估计的应用,如虚拟现实,三维建模,并自主机器人导航的重要技术。与此相反,以用于施加卷积神经网络到的全景图像以前的方法中,我们使用圆柱全景投影,其允许采用传统的CNN层的诸如卷积滤波器和不加修改最大池。我们的合成的和真实数据显示的评价,关于圆柱全景图像的深度和自运动的无监督学习可以生产出高质量的深度图和该增加的字段的视图改善自运动估计精度。我们创建了两个新的数据集来评估我们的方法:使用CARLA模拟器创建合成数据集,Headcam,从收集的全景视频的新的数据集头盔摄像头,而在城市环境中骑行。我们也适用我们的网络转换单眼全景立体声全景的问题。
14. Integrating Coarse Granularity Part-level Features with Supervised Global-level Features for Person Re-identification [PDF] 返回目录
Xiaofei Mao, Jiahao Cao, Dongfang Li, Xia Jia, Qingfang Zheng
Abstract: Holistic person re-identification (Re-ID) and partial person re-identification have achieved great progress respectively in recent years. However, scenarios in reality often include both holistic and partial pedestrian images, which makes single holistic or partial person Re-ID hard to work. In this paper, we propose a robust coarse granularity part-level person Re-ID network (CGPN), which not only extracts robust regional level body features, but also integrates supervised global features for both holistic and partial person images. CGPN gains two-fold benefit toward higher accuracy for person Re-ID. On one hand, CGPN learns to extract effective body part features for both holistic and partial person images. On the other hand, compared with extracting global features directly by backbone network, CGPN learns to extract more accurate global features with a supervision strategy. The single model trained on three Re-ID datasets including Market-1501, DukeMTMC-reID and CUHK03 achieves state-of-the-art performances and outperforms any existing approaches. Especially on CUHK03, which is the most challenging dataset for person Re-ID, in single query mode, we obtain a top result of Rank-1/mAP=87.1\%/83.6\% with this method without re-ranking, outperforming the current best method by +7.0\%/+6.7\%.
摘要:整体人重新鉴定(重新-ID)和部分人重新鉴定,近年来分别取得了很大的进展。然而,在现实情况下通常包括整体和部分行人的图像,这使得单一的整体或部分的人重新编号辛苦工作。在本文中,我们提出了一个强大的粗粒度部件级的人重新编号网络(CGPN),它不仅提取强劲区域层面的身体特征,而且还集成了监督整体都和部分人的图像全局特征。 CGPN获得双重的好处对朝人重新编号更高的精度。一方面,CGPN学习以提取有效身体部分特征为整体和局部人的图像。在另一方面,通过骨干网中直接提取全局特征相比,CGPN学习与监管策略来提取更准确的全局特征。训练有素的三个重ID的数据集,包括市场-1501,DukeMTMC-Reid和CUHK03的单一模式实现了国家的最先进的性能和优于任何现有的方法。特别是在CUHK03,这对于人再ID中最具挑战性的数据集,在单个查询模式下,我们得到秩-1 / MAP = 87.1 \%/ 83.6 \%的顶结果与该方法,无需重新排序,表现优于通过7.0 \%/ + 6.7 \%当前最佳方法。
Xiaofei Mao, Jiahao Cao, Dongfang Li, Xia Jia, Qingfang Zheng
Abstract: Holistic person re-identification (Re-ID) and partial person re-identification have achieved great progress respectively in recent years. However, scenarios in reality often include both holistic and partial pedestrian images, which makes single holistic or partial person Re-ID hard to work. In this paper, we propose a robust coarse granularity part-level person Re-ID network (CGPN), which not only extracts robust regional level body features, but also integrates supervised global features for both holistic and partial person images. CGPN gains two-fold benefit toward higher accuracy for person Re-ID. On one hand, CGPN learns to extract effective body part features for both holistic and partial person images. On the other hand, compared with extracting global features directly by backbone network, CGPN learns to extract more accurate global features with a supervision strategy. The single model trained on three Re-ID datasets including Market-1501, DukeMTMC-reID and CUHK03 achieves state-of-the-art performances and outperforms any existing approaches. Especially on CUHK03, which is the most challenging dataset for person Re-ID, in single query mode, we obtain a top result of Rank-1/mAP=87.1\%/83.6\% with this method without re-ranking, outperforming the current best method by +7.0\%/+6.7\%.
摘要:整体人重新鉴定(重新-ID)和部分人重新鉴定,近年来分别取得了很大的进展。然而,在现实情况下通常包括整体和部分行人的图像,这使得单一的整体或部分的人重新编号辛苦工作。在本文中,我们提出了一个强大的粗粒度部件级的人重新编号网络(CGPN),它不仅提取强劲区域层面的身体特征,而且还集成了监督整体都和部分人的图像全局特征。 CGPN获得双重的好处对朝人重新编号更高的精度。一方面,CGPN学习以提取有效身体部分特征为整体和局部人的图像。在另一方面,通过骨干网中直接提取全局特征相比,CGPN学习与监管策略来提取更准确的全局特征。训练有素的三个重ID的数据集,包括市场-1501,DukeMTMC-Reid和CUHK03的单一模式实现了国家的最先进的性能和优于任何现有的方法。特别是在CUHK03,这对于人再ID中最具挑战性的数据集,在单个查询模式下,我们得到秩-1 / MAP = 87.1 \%/ 83.6 \%的顶结果与该方法,无需重新排序,表现优于通过7.0 \%/ + 6.7 \%当前最佳方法。
15. A Deep Drift-Diffusion Model for Image Aesthetic Score Distribution Prediction [PDF] 返回目录
Xin Jin, Xiqiao Li, Heng Huang, Xiaodong Li, Xinghui Zhou
Abstract: The task of aesthetic quality assessment is complicated due to its subjectivity. In recent years, the target representation of image aesthetic quality has changed from a one-dimensional binary classification label or numerical score to a multi-dimensional score distribution. According to current methods, the ground truth score distributions are straightforwardly regressed. However, the subjectivity of aesthetics is not taken into account, that is to say, the psychological processes of human beings are not taken into consideration, which limits the performance of the task. In this paper, we propose a Deep Drift-Diffusion (DDD) model inspired by psychologists to predict aesthetic score distribution from images. The DDD model can describe the psychological process of aesthetic perception instead of traditional modeling of the results of assessment. We use deep convolution neural networks to regress the parameters of the drift-diffusion model. The experimental results in large scale aesthetic image datasets reveal that our novel DDD model is simple but efficient, which outperforms the state-of-the-art methods in aesthetic score distribution prediction. Besides, different psychological processes can also be predicted by our model.
摘要:审美素质评价的任务是复杂的,由于其主观性。近年来,图像的美学质量的目标表示已经从一个一维二元分类标签或数字评分改变为多维得分分布。根据目前的方法,地面实况得分分布直截了当地倒退。然而,审美的主观性是没有考虑到,也就是说,人类的心理过程没有考虑到,这限制了任务的执行。在本文中,我们提出由心理学家启发预测从图像审美分数分布的深漂移扩散(DDD)模型。该DDD模型可以描述审美感知,而不是评价的结果,传统的建模的心理过程。我们使用深卷积神经网络退步漂移扩散模型的参数。在大规模审美图像数据组的实验结果表明,我们的新颖DDD模型是简单的,但有效的,这优于在审美得分分布预测的状态的最先进的方法。此外,不同的心理过程也可以通过我们的模型预测。
Xin Jin, Xiqiao Li, Heng Huang, Xiaodong Li, Xinghui Zhou
Abstract: The task of aesthetic quality assessment is complicated due to its subjectivity. In recent years, the target representation of image aesthetic quality has changed from a one-dimensional binary classification label or numerical score to a multi-dimensional score distribution. According to current methods, the ground truth score distributions are straightforwardly regressed. However, the subjectivity of aesthetics is not taken into account, that is to say, the psychological processes of human beings are not taken into consideration, which limits the performance of the task. In this paper, we propose a Deep Drift-Diffusion (DDD) model inspired by psychologists to predict aesthetic score distribution from images. The DDD model can describe the psychological process of aesthetic perception instead of traditional modeling of the results of assessment. We use deep convolution neural networks to regress the parameters of the drift-diffusion model. The experimental results in large scale aesthetic image datasets reveal that our novel DDD model is simple but efficient, which outperforms the state-of-the-art methods in aesthetic score distribution prediction. Besides, different psychological processes can also be predicted by our model.
摘要:审美素质评价的任务是复杂的,由于其主观性。近年来,图像的美学质量的目标表示已经从一个一维二元分类标签或数字评分改变为多维得分分布。根据目前的方法,地面实况得分分布直截了当地倒退。然而,审美的主观性是没有考虑到,也就是说,人类的心理过程没有考虑到,这限制了任务的执行。在本文中,我们提出由心理学家启发预测从图像审美分数分布的深漂移扩散(DDD)模型。该DDD模型可以描述审美感知,而不是评价的结果,传统的建模的心理过程。我们使用深卷积神经网络退步漂移扩散模型的参数。在大规模审美图像数据组的实验结果表明,我们的新颖DDD模型是简单的,但有效的,这优于在审美得分分布预测的状态的最先进的方法。此外,不同的心理过程也可以通过我们的模型预测。
16. Empty Cities: a Dynamic-Object-Invariant Space for Visual SLAM [PDF] 返回目录
Berta Bescos, Cesar Cadena, Jose Neira
Abstract: In this paper we present a data-driven approach to obtain the static image of a scene, eliminating dynamic objects that might have been present at the time of traversing the scene with a camera. The general objective is to improve vision-based localization and mapping tasks in dynamic environments, where the presence (or absence) of different dynamic objects in different moments makes these tasks less robust. We introduce an end-to-end deep learning framework to turn images of an urban environment that include dynamic content, such as vehicles or pedestrians, into realistic static frames suitable for localization and mapping. This objective faces two main challenges: detecting the dynamic objects, and inpainting the static occluded back-ground. The first challenge is addressed by the use of a convolutional network that learns a multi-class semantic segmentation of the image. The second challenge is approached with a generative adversarial model that, taking as input the original dynamic image and the computed dynamic/static binary mask, is capable of generating the final static image. This framework makes use of two new losses, one based on image steganalysis techniques, useful to improve the inpainting quality, and another one based on ORB features, designed to enhance feature matching between real and hallucinated image regions. To validate our approach, we perform an extensive evaluation on different tasks that are affected by dynamic entities, i.e., visual odometry, place recognition and multi-view stereo, with the hallucinated images. Code has been made available on this https URL.
摘要:本文提出了一个数据驱动的方法来获得一个场景的静态图像,消除了可能已经存在于用相机穿越场景的实时动态对象。总目标是提高在动态环境中,其中在不同的时刻不同的动态物体的存在(或不存在),使这些任务不太可靠的基于视觉的定位和映射的任务。我们引入端至端深学习框架转动的城市环境,其包括动态内容,如车辆或行人,成适合于定位和映射现实静态帧的图像。这个目标面上的两个主要挑战:检测所述动态对象,和补绘静态闭塞回地面。第一个挑战是通过使用该学习图像的多类语义分割卷积网络的解决。第二个挑战是接近与生成对抗性模型,作为输入的原始动态图像和所述计算出的动态/静态二进制掩码,能够生成最终静态图象。这个框架利用了两个新的损失,一个基于图像隐写技术,提高了图像修补质量是有用的,基于ORB另外一个特点,旨在增强现实和幻觉的图像区域之间的特征匹配。为了验证我们的方法,我们进行对由动态实体,即视觉里程,地方认同和多视点立体,与幻觉图像的影响不同的任务进行广泛的评估。代码已经被公布在该HTTPS URL。
Berta Bescos, Cesar Cadena, Jose Neira
Abstract: In this paper we present a data-driven approach to obtain the static image of a scene, eliminating dynamic objects that might have been present at the time of traversing the scene with a camera. The general objective is to improve vision-based localization and mapping tasks in dynamic environments, where the presence (or absence) of different dynamic objects in different moments makes these tasks less robust. We introduce an end-to-end deep learning framework to turn images of an urban environment that include dynamic content, such as vehicles or pedestrians, into realistic static frames suitable for localization and mapping. This objective faces two main challenges: detecting the dynamic objects, and inpainting the static occluded back-ground. The first challenge is addressed by the use of a convolutional network that learns a multi-class semantic segmentation of the image. The second challenge is approached with a generative adversarial model that, taking as input the original dynamic image and the computed dynamic/static binary mask, is capable of generating the final static image. This framework makes use of two new losses, one based on image steganalysis techniques, useful to improve the inpainting quality, and another one based on ORB features, designed to enhance feature matching between real and hallucinated image regions. To validate our approach, we perform an extensive evaluation on different tasks that are affected by dynamic entities, i.e., visual odometry, place recognition and multi-view stereo, with the hallucinated images. Code has been made available on this https URL.
摘要:本文提出了一个数据驱动的方法来获得一个场景的静态图像,消除了可能已经存在于用相机穿越场景的实时动态对象。总目标是提高在动态环境中,其中在不同的时刻不同的动态物体的存在(或不存在),使这些任务不太可靠的基于视觉的定位和映射的任务。我们引入端至端深学习框架转动的城市环境,其包括动态内容,如车辆或行人,成适合于定位和映射现实静态帧的图像。这个目标面上的两个主要挑战:检测所述动态对象,和补绘静态闭塞回地面。第一个挑战是通过使用该学习图像的多类语义分割卷积网络的解决。第二个挑战是接近与生成对抗性模型,作为输入的原始动态图像和所述计算出的动态/静态二进制掩码,能够生成最终静态图象。这个框架利用了两个新的损失,一个基于图像隐写技术,提高了图像修补质量是有用的,基于ORB另外一个特点,旨在增强现实和幻觉的图像区域之间的特征匹配。为了验证我们的方法,我们进行对由动态实体,即视觉里程,地方认同和多视点立体,与幻觉图像的影响不同的任务进行广泛的评估。代码已经被公布在该HTTPS URL。
17. HS-ResNet: Hierarchical-Split Block on Convolutional Neural Network [PDF] 返回目录
Pengcheng Yuan, Shufei Lin, Cheng Cui, Yuning Du, Ruoyu Guo, Dongliang He, Errui Ding, Shumin Han
Abstract: This paper addresses representational block named Hierarchical-Split Block, which can be taken as a plug-and-play block to upgrade existing convolutional neural networks, improves model performance significantly in a network. Hierarchical-Split Block contains many hierarchical split and concatenate connections within one single residual block. We find multi-scale features is of great importance for numerous vision tasks. Moreover, Hierarchical-Split block is very flexible and efficient, which provides a large space of potential network architectures for different applications. In this work, we present a common backbone based on Hierarchical-Split block for tasks: image classification, object detection, instance segmentation and semantic image segmentation/parsing. Our approach shows significant improvements over all these core tasks in comparison with the baseline. As shown in Figure1, for image classification, our 50-layers network(HS-ResNet50) achieves 81.28% top-1 accuracy with competitive latency on ImageNet-1k dataset. It also outperforms most state-of-the-art models. The source code and models will be available on: this https URL
摘要:名为阶层分离块本文地址代表性块,它可以作为一个插件和播放模块升级现有的卷积神经网络,显著提高了网络模型的性能。分层分割块中包含许多层次分裂和一个单一的残余块内串连连接。我们发现多尺度特征是众多视觉任务具有重要意义。此外,分层分割块是非常灵活和有效的,它提供了潜在的网络体系结构对于不同的应用的一个大的空间。在这项工作中,我们提出了基于分层分割块任务在一个共同的骨干:图像分类,目标检测,例如分割和语义的图像分割/分析。我们的做法显示了与基线相比,所有这些核心任务显著的改善。正如在如图一所示,图像分类,我们的50层网络(HS-ResNet50)实现上ImageNet-1K数据集81.28%顶1的精度具有竞争力的等待时间。它也优于大多数国家的最先进的车型。源代码和模型将可在:此HTTPS URL
Pengcheng Yuan, Shufei Lin, Cheng Cui, Yuning Du, Ruoyu Guo, Dongliang He, Errui Ding, Shumin Han
Abstract: This paper addresses representational block named Hierarchical-Split Block, which can be taken as a plug-and-play block to upgrade existing convolutional neural networks, improves model performance significantly in a network. Hierarchical-Split Block contains many hierarchical split and concatenate connections within one single residual block. We find multi-scale features is of great importance for numerous vision tasks. Moreover, Hierarchical-Split block is very flexible and efficient, which provides a large space of potential network architectures for different applications. In this work, we present a common backbone based on Hierarchical-Split block for tasks: image classification, object detection, instance segmentation and semantic image segmentation/parsing. Our approach shows significant improvements over all these core tasks in comparison with the baseline. As shown in Figure1, for image classification, our 50-layers network(HS-ResNet50) achieves 81.28% top-1 accuracy with competitive latency on ImageNet-1k dataset. It also outperforms most state-of-the-art models. The source code and models will be available on: this https URL
摘要:名为阶层分离块本文地址代表性块,它可以作为一个插件和播放模块升级现有的卷积神经网络,显著提高了网络模型的性能。分层分割块中包含许多层次分裂和一个单一的残余块内串连连接。我们发现多尺度特征是众多视觉任务具有重要意义。此外,分层分割块是非常灵活和有效的,它提供了潜在的网络体系结构对于不同的应用的一个大的空间。在这项工作中,我们提出了基于分层分割块任务在一个共同的骨干:图像分类,目标检测,例如分割和语义的图像分割/分析。我们的做法显示了与基线相比,所有这些核心任务显著的改善。正如在如图一所示,图像分类,我们的50层网络(HS-ResNet50)实现上ImageNet-1K数据集81.28%顶1的精度具有竞争力的等待时间。它也优于大多数国家的最先进的车型。源代码和模型将可在:此HTTPS URL
18. THIN: THrowable Information Networks and Application for Facial Expression Recognition In The Wild [PDF] 返回目录
Estephe Arnaud, Arnaud Dapogny, Kevin Bailly
Abstract: For a number of tasks solved using deep learning techniques, an exogenous variable can be identified such that (a) it heavily influences the appearance of the different classes, and (b) an ideal classifier should be invariant to this variable. An example of such exogenous variable is identity if facial expression recognition (FER) is considered. In this paper, we propose a dual exogenous/endogenous representation. The former captures the exogenous variable whereas the second one models the task at hand (e.g. facial expression). We design a prediction layer that uses a deep ensemble conditioned by the exogenous representation. It employs a differential tree gate that learns an adaptive weak predictor weighting, therefore modeling a partition of the exogenous representation space, upon which the weak predictors specialize. This layer explicitly models the dependency between the exogenous variable and the predicted task (a). We also propose an exogenous dispelling loss to remove the exogenous information from the endogenous representation, enforcing (b). Thus, the exogenous information is used two times in a throwable fashion, first as a conditioning variable for the target task, and second to create invariance within the endogenous representation. We call this method THIN, standing for THrowable Information Networks. We experimentally validate THIN in several contexts where an exogenous information can be identified, such as digit recognition under large rotations and shape recognition at multiple scales. We also apply it to FER with identity as the exogenous variable. In particular, we demonstrate that THIN significantly outperforms state-of-the-art approaches on several challenging datasets.
摘要:对于使用深学习技术解决了多项任务,一个外生变量可以被识别使得(a)它在很大程度上影响了不同类别的外观,和(b)一个理想的分类应该是不变的这个变量。这种外源变量的一个例子是,如果身份面部表情识别(FER)被认为。在本文中,我们提出了一个双外生/内生表示。前者捕获外生变量而第二个模型手头的任务(例如,面部表情)。我们设计了使用深合奏由外源表达调节的预测层。它采用差分树门,其学习的自适应预测弱的加权,因此造型外源表示空间,在其上弱预测专门的分区。该层明确地模型外生变量和预测任务(A)之间的依赖关系。我们还提出了一种外源性祛除损失从内生表示除去外生信息,执行(B)。因此,外源信息用于两次以抛出方式,首先,作为目标任务调理变量,和第二内源表示内创建不变性。我们称这种方法瘦,站立抛出信息网络。我们通过实验验证THIN于若干情况,其中外源信息可以被识别,如在大型旋转数字识别和形状识别多尺度。我们也把它应用到FER与身份作为外生变量。特别是,我们证明了THIN显著优于国家的最先进的方法对一些具有挑战性的数据集。
Estephe Arnaud, Arnaud Dapogny, Kevin Bailly
Abstract: For a number of tasks solved using deep learning techniques, an exogenous variable can be identified such that (a) it heavily influences the appearance of the different classes, and (b) an ideal classifier should be invariant to this variable. An example of such exogenous variable is identity if facial expression recognition (FER) is considered. In this paper, we propose a dual exogenous/endogenous representation. The former captures the exogenous variable whereas the second one models the task at hand (e.g. facial expression). We design a prediction layer that uses a deep ensemble conditioned by the exogenous representation. It employs a differential tree gate that learns an adaptive weak predictor weighting, therefore modeling a partition of the exogenous representation space, upon which the weak predictors specialize. This layer explicitly models the dependency between the exogenous variable and the predicted task (a). We also propose an exogenous dispelling loss to remove the exogenous information from the endogenous representation, enforcing (b). Thus, the exogenous information is used two times in a throwable fashion, first as a conditioning variable for the target task, and second to create invariance within the endogenous representation. We call this method THIN, standing for THrowable Information Networks. We experimentally validate THIN in several contexts where an exogenous information can be identified, such as digit recognition under large rotations and shape recognition at multiple scales. We also apply it to FER with identity as the exogenous variable. In particular, we demonstrate that THIN significantly outperforms state-of-the-art approaches on several challenging datasets.
摘要:对于使用深学习技术解决了多项任务,一个外生变量可以被识别使得(a)它在很大程度上影响了不同类别的外观,和(b)一个理想的分类应该是不变的这个变量。这种外源变量的一个例子是,如果身份面部表情识别(FER)被认为。在本文中,我们提出了一个双外生/内生表示。前者捕获外生变量而第二个模型手头的任务(例如,面部表情)。我们设计了使用深合奏由外源表达调节的预测层。它采用差分树门,其学习的自适应预测弱的加权,因此造型外源表示空间,在其上弱预测专门的分区。该层明确地模型外生变量和预测任务(A)之间的依赖关系。我们还提出了一种外源性祛除损失从内生表示除去外生信息,执行(B)。因此,外源信息用于两次以抛出方式,首先,作为目标任务调理变量,和第二内源表示内创建不变性。我们称这种方法瘦,站立抛出信息网络。我们通过实验验证THIN于若干情况,其中外源信息可以被识别,如在大型旋转数字识别和形状识别多尺度。我们也把它应用到FER与身份作为外生变量。特别是,我们证明了THIN显著优于国家的最先进的方法对一些具有挑战性的数据集。
19. Unsupervised Constrative Person Re-identification [PDF] 返回目录
Bo Pang, Deming Zhai, Junjun Jiang, Xianming Liu
Abstract: Person re-identification (ReID) aims at searching the same identity person among images captured by various cameras. Unsupervised person ReID attracts a lot of attention recently, due to it works without intensive manual annotation and thus shows great potential of adapting to new conditions. Representation learning plays a critical role in unsupervised person ReID. In this work, we propose a novel selective contrastive learning framework for unsupervised feature learning. Specifically, different from traditional contrastive learning strategies, we propose to use multiple positives and adaptively sampled negatives for defining the contrastive loss, enabling to learn a feature embedding model with stronger identity discriminative representation. Moreover, we propose to jointly leverage global and local features to construct three dynamic dictionaries, among which the global and local memory banks are used for pairwise similarity computation and the mixture memory bank are used for contrastive loss definition. Experimental results demonstrate the superiority of our method in unsupervised person ReID compared with the state-of-the-arts.
摘要:人重新鉴定(里德)的目的是寻找各种照相机拍摄的图像中的相同身份的人。无监督人里德吸引了大量的关注最近,因为它的工作原理没有密集的手工标注,从而显示了适应新环境的巨大潜力。代表学习起着监督的人里德关键作用。在这项工作中,我们提出了无监督功能学习新的选择性对比学习框架。具体而言,从传统的对比学习策略不同,我们建议使用多个阳性和自适应采样底片定义的对比损失,使学习更强的身份歧视表示功能嵌入模型。此外,我们建议共同利用全局和局部特征,构建三个动态词典,其中包括全球和本地存储库用于成对相似度计算和混合存储体用于对比损失的定义。实验结果表明,我们在无人监督的人方法的优越性与国家的最艺术的里德相比。
Bo Pang, Deming Zhai, Junjun Jiang, Xianming Liu
Abstract: Person re-identification (ReID) aims at searching the same identity person among images captured by various cameras. Unsupervised person ReID attracts a lot of attention recently, due to it works without intensive manual annotation and thus shows great potential of adapting to new conditions. Representation learning plays a critical role in unsupervised person ReID. In this work, we propose a novel selective contrastive learning framework for unsupervised feature learning. Specifically, different from traditional contrastive learning strategies, we propose to use multiple positives and adaptively sampled negatives for defining the contrastive loss, enabling to learn a feature embedding model with stronger identity discriminative representation. Moreover, we propose to jointly leverage global and local features to construct three dynamic dictionaries, among which the global and local memory banks are used for pairwise similarity computation and the mixture memory bank are used for contrastive loss definition. Experimental results demonstrate the superiority of our method in unsupervised person ReID compared with the state-of-the-arts.
摘要:人重新鉴定(里德)的目的是寻找各种照相机拍摄的图像中的相同身份的人。无监督人里德吸引了大量的关注最近,因为它的工作原理没有密集的手工标注,从而显示了适应新环境的巨大潜力。代表学习起着监督的人里德关键作用。在这项工作中,我们提出了无监督功能学习新的选择性对比学习框架。具体而言,从传统的对比学习策略不同,我们建议使用多个阳性和自适应采样底片定义的对比损失,使学习更强的身份歧视表示功能嵌入模型。此外,我们建议共同利用全局和局部特征,构建三个动态词典,其中包括全球和本地存储库用于成对相似度计算和混合存储体用于对比损失的定义。实验结果表明,我们在无人监督的人方法的优越性与国家的最艺术的里德相比。
20. Object Tracking Using Spatio-Temporal Future Prediction [PDF] 返回目录
Yuan Liu, Ruoteng Li, Robby T. Tan, Yu Cheng, Xiubao Sui
Abstract: Occlusion is a long-standing problem that causes many modern tracking methods to be erroneous. In this paper, we address the occlusion problem by exploiting the current and future possible locations of the target object from its past trajectory. To achieve this, we introduce a learning-based tracking method that takes into account background motion modeling and trajectory prediction. Our trajectory prediction module predicts the target object's locations in the current and future frames based on the object's past trajectory. Since, in the input video, the target object's trajectory is not only affected by the object motion but also the camera motion, our background motion module estimates the camera motion. So that the object's trajectory can be made independent from it. To dynamically switch between the appearance-based tracker and the trajectory prediction, we employ a network that can assess how good a tracking prediction is, and we use the assessment scores to choose between the appearance-based tracker's prediction and the trajectory-based prediction. Comprehensive evaluations show that the proposed method sets a new state-of-the-art performance on commonly used tracking benchmarks.
摘要:闭塞是一个长期存在的问题,导致许多现代的跟踪方法是错误的。在本文中,我们应对其过去的轨迹利用目标对象的当前和未来可能的位置遮挡问题。为了实现这一目标,我们介绍,考虑到背景运动建模和轨迹预测基于学习的跟踪方法。我们的轨迹预测模块预测基于对象的过去轨迹的当前和未来的框架目标对象的位置。因为,在输入视频,目标对象的轨迹不仅受物体运动,而且摄像头的运动,我们的背景运动模块估计的摄像机运动。因此,该对象的轨迹可以由独立于它。为了外观基础的跟踪和轨迹预测之间进行动态切换,我们采用的是可以评估的跟踪预测有多好,是一个网络,而我们使用的评估分数外观基础的跟踪预测和基于轨迹的预测之间进行选择。综合评价表明,该方法设置有关常用的跟踪基准的新的国家的最先进的性能。
Yuan Liu, Ruoteng Li, Robby T. Tan, Yu Cheng, Xiubao Sui
Abstract: Occlusion is a long-standing problem that causes many modern tracking methods to be erroneous. In this paper, we address the occlusion problem by exploiting the current and future possible locations of the target object from its past trajectory. To achieve this, we introduce a learning-based tracking method that takes into account background motion modeling and trajectory prediction. Our trajectory prediction module predicts the target object's locations in the current and future frames based on the object's past trajectory. Since, in the input video, the target object's trajectory is not only affected by the object motion but also the camera motion, our background motion module estimates the camera motion. So that the object's trajectory can be made independent from it. To dynamically switch between the appearance-based tracker and the trajectory prediction, we employ a network that can assess how good a tracking prediction is, and we use the assessment scores to choose between the appearance-based tracker's prediction and the trajectory-based prediction. Comprehensive evaluations show that the proposed method sets a new state-of-the-art performance on commonly used tracking benchmarks.
摘要:闭塞是一个长期存在的问题,导致许多现代的跟踪方法是错误的。在本文中,我们应对其过去的轨迹利用目标对象的当前和未来可能的位置遮挡问题。为了实现这一目标,我们介绍,考虑到背景运动建模和轨迹预测基于学习的跟踪方法。我们的轨迹预测模块预测基于对象的过去轨迹的当前和未来的框架目标对象的位置。因为,在输入视频,目标对象的轨迹不仅受物体运动,而且摄像头的运动,我们的背景运动模块估计的摄像机运动。因此,该对象的轨迹可以由独立于它。为了外观基础的跟踪和轨迹预测之间进行动态切换,我们采用的是可以评估的跟踪预测有多好,是一个网络,而我们使用的评估分数外观基础的跟踪预测和基于轨迹的预测之间进行选择。综合评价表明,该方法设置有关常用的跟踪基准的新的国家的最先进的性能。
21. Interactive Latent Interpolation on MNIST Dataset [PDF] 返回目录
Mazeyar Moeini Feizabadi, Ali Mohammed Shujjat, Sarah Shahid, Zainab Hasnain
Abstract: This paper will discuss the potential of dimensionality reduction with a web-based use of GANs. Throughout a variety of experiments, we show synthesizing visually-appealing samples, interpolating meaningfully between samples, and performing linear arithmetic with latent vectors. GANs have proved to be a remarkable technique to produce computer-generated images, very similar to an original image. This is primarily useful when coupled with dimensionality reduction as an effective application of our algorithm. We proposed a new architecture for GANs, which ended up not working for mathematical reasons later explained. We then proposed a new web-based GAN that still takes advantage of dimensionality reduction to speed generation in the browser to .2 milliseconds. Lastly, we made a modern UI with linear interpolation to present the work. With the speedy generation, we can generate so fast that we can create an animation type effect that we have never seen before that works on both web and mobile.
摘要:本文将讨论降维的潜力与基于Web的应用甘斯的。在整个各种实验中,我们显示合成视觉吸引力的样品,样品之间的内插有意义,并与潜在向量进行线性运算。甘斯已被证明是一个显着的技术,以产生计算机生成的图像,非常类似于原始图像。当与降维作为我们的算法的一个有效应用连接这主要是有用的。我们提出了甘斯的新架构,它最终没有工作了数学的原因后面解释。然后,我们提出了一个新的基于Web-GaN仍然利用降维的速度生成在浏览器到0.2毫秒。最后,我们做了一个现代的UI与线性插值到目前的工作。随着快速的一代,我们能产生如此之快,我们可以创建一个动画类型的效果,我们从来都网络和移动上的作品有过的。
Mazeyar Moeini Feizabadi, Ali Mohammed Shujjat, Sarah Shahid, Zainab Hasnain
Abstract: This paper will discuss the potential of dimensionality reduction with a web-based use of GANs. Throughout a variety of experiments, we show synthesizing visually-appealing samples, interpolating meaningfully between samples, and performing linear arithmetic with latent vectors. GANs have proved to be a remarkable technique to produce computer-generated images, very similar to an original image. This is primarily useful when coupled with dimensionality reduction as an effective application of our algorithm. We proposed a new architecture for GANs, which ended up not working for mathematical reasons later explained. We then proposed a new web-based GAN that still takes advantage of dimensionality reduction to speed generation in the browser to .2 milliseconds. Lastly, we made a modern UI with linear interpolation to present the work. With the speedy generation, we can generate so fast that we can create an animation type effect that we have never seen before that works on both web and mobile.
摘要:本文将讨论降维的潜力与基于Web的应用甘斯的。在整个各种实验中,我们显示合成视觉吸引力的样品,样品之间的内插有意义,并与潜在向量进行线性运算。甘斯已被证明是一个显着的技术,以产生计算机生成的图像,非常类似于原始图像。当与降维作为我们的算法的一个有效应用连接这主要是有用的。我们提出了甘斯的新架构,它最终没有工作了数学的原因后面解释。然后,我们提出了一个新的基于Web-GaN仍然利用降维的速度生成在浏览器到0.2毫秒。最后,我们做了一个现代的UI与线性插值到目前的工作。随着快速的一代,我们能产生如此之快,我们可以创建一个动画类型的效果,我们从来都网络和移动上的作品有过的。
22. MOTChallenge: A Benchmark for Single-camera Multiple Target Tracking [PDF] 返回目录
Patrick Dendorfer, Aljoša Ošep, Anton Milan, Konrad Schindler, Daniel Cremers, Ian Reid, Stefan Roth, Laura Leal-Taixé
Abstract: Standardized benchmarks have been crucial in pushing the performance of computer vision algorithms, especially since the advent of deep learning. Although leaderboards should not be over-claimed, they often provide the most objective measure of performance and are therefore important guides for research. We present MOTChallenge, a benchmark for single-camera Multiple Object Tracking (MOT) launched in late 2014, to collect existing and new data, and create a framework for the standardized evaluation of multiple object tracking methods. The benchmark is focused on multiple people tracking, since pedestrians are by far the most studied object in the tracking community, with applications ranging from robot navigation to self-driving cars. This paper collects the first three releases of the benchmark: (i) MOT15, along with numerous state-of-the-art results that were submitted in the last years, (ii) MOT16, which contains new challenging videos, and (iii) MOT17, that extends MOT16 sequences with more precise labels and evaluates tracking performance on three different object detectors. The second and third release not only offers a significant increase in the number of labeled boxes but also provide labels for multiple object classes beside pedestrians, as well as the level of visibility for every single object of interest. We finally provide a categorization of state-of-the-art trackers and a broad error analysis. This will help newcomers understand the related work and research trends in the MOT community, and hopefully shred some light into potential future research directions.
摘要:从标准化基准一直在推的计算机视觉算法的性能是至关重要的,特别是深度学习的来临。虽然排行榜不宜过分声称,他们通常会提供的性能的最客观的衡量标准,因此研究的重要指南。我们目前MOTChallenge,对于单电相机的多目标跟踪(MOT)的基准在2014年年底推出,收集现有的和新的数据,并创建多个对象跟踪方法的标准化评估框架。基准的重点是多人跟踪,因为行人是目前在社区跟踪研究最多的对象,其应用范围从机器人导航自动驾驶汽车。本文收集了基准的前三个版本:(一)MOT15,在过去的几年中提交的众多国家的最先进的结果一起,(II)MOT16,其中包含新的具有挑战性的视频,以及(iii) MOT17,延伸具有更精确的标签且评估跟踪在三个不同的对象检测器的性能MOT16序列。第二和第三个版本不仅提供在标箱数量显著增加,但也提供了旁边的行人多对象类的标签,以及能见度感兴趣每一个对象的水平。最后,我们提供国家的最先进的跟踪和广阔的误差分析的分类。这将有助于新人理解的MOT社区的相关工作和研究的发展趋势,并希望切丝一些轻到未来潜在的研究方向。
Patrick Dendorfer, Aljoša Ošep, Anton Milan, Konrad Schindler, Daniel Cremers, Ian Reid, Stefan Roth, Laura Leal-Taixé
Abstract: Standardized benchmarks have been crucial in pushing the performance of computer vision algorithms, especially since the advent of deep learning. Although leaderboards should not be over-claimed, they often provide the most objective measure of performance and are therefore important guides for research. We present MOTChallenge, a benchmark for single-camera Multiple Object Tracking (MOT) launched in late 2014, to collect existing and new data, and create a framework for the standardized evaluation of multiple object tracking methods. The benchmark is focused on multiple people tracking, since pedestrians are by far the most studied object in the tracking community, with applications ranging from robot navigation to self-driving cars. This paper collects the first three releases of the benchmark: (i) MOT15, along with numerous state-of-the-art results that were submitted in the last years, (ii) MOT16, which contains new challenging videos, and (iii) MOT17, that extends MOT16 sequences with more precise labels and evaluates tracking performance on three different object detectors. The second and third release not only offers a significant increase in the number of labeled boxes but also provide labels for multiple object classes beside pedestrians, as well as the level of visibility for every single object of interest. We finally provide a categorization of state-of-the-art trackers and a broad error analysis. This will help newcomers understand the related work and research trends in the MOT community, and hopefully shred some light into potential future research directions.
摘要:从标准化基准一直在推的计算机视觉算法的性能是至关重要的,特别是深度学习的来临。虽然排行榜不宜过分声称,他们通常会提供的性能的最客观的衡量标准,因此研究的重要指南。我们目前MOTChallenge,对于单电相机的多目标跟踪(MOT)的基准在2014年年底推出,收集现有的和新的数据,并创建多个对象跟踪方法的标准化评估框架。基准的重点是多人跟踪,因为行人是目前在社区跟踪研究最多的对象,其应用范围从机器人导航自动驾驶汽车。本文收集了基准的前三个版本:(一)MOT15,在过去的几年中提交的众多国家的最先进的结果一起,(II)MOT16,其中包含新的具有挑战性的视频,以及(iii) MOT17,延伸具有更精确的标签且评估跟踪在三个不同的对象检测器的性能MOT16序列。第二和第三个版本不仅提供在标箱数量显著增加,但也提供了旁边的行人多对象类的标签,以及能见度感兴趣每一个对象的水平。最后,我们提供国家的最先进的跟踪和广阔的误差分析的分类。这将有助于新人理解的MOT社区的相关工作和研究的发展趋势,并希望切丝一些轻到未来潜在的研究方向。
23. FOSS: Multi-Person Age Estimation with Focusing on Objects and Still Seeing Surroundings [PDF] 返回目录
Masakazu Yoshimura, Satoshi Ogata
Abstract: Age estimation from images can be used in many practical scenes. Most of the previous works targeted on the estimation from images in which only one face exists. Also, most of the open datasets for age estimation contain images like that. However, in some situations, age estimation in the wild and for multi-person is needed. Usually, such situations were solved by two separate models; one is a face detector model which crops facial regions and the other is an age estimation model which estimates from cropped images. In this work, we propose a method that can detect and estimate the age of multi-person with a single model which estimates age with focusing on faces and still seeing surroundings. Also, we propose a training method which enables the model to estimate multi-person well despite trained with images in which only one face is photographed. In the experiments, we evaluated our proposed method compared with the traditional approach using two separate models. As the result, the accuracy could be enhanced with our proposed method. We also adapted our proposed model to commonly used single person photographed age estimation datasets and it is proved that our method is also effective to those images and outperforms the state of the art accuracy.
摘要:从图像中年龄估计可以在许多实际场景中使用。大部分以前的工作从只有一个面存在图像针对性的估计。此外,大多数的开放数据集的年龄估计都含有这样的图像。然而,在某些情况下,需要在野外和多人年龄估计。通常情况下,这种情况是由两个独立的模型解决;一个是面部检测器模型,其作物面部区域,而另一个是一个年龄估计模型,其从裁切影像估计。在这项工作中,我们提出,可以检测并估计多的人的年龄与估计年龄集中在脸上,仍然看到周围的单一模式的方法。此外,我们提出这使得模型估计多的人很好,尽管与只有一个面拍摄的图像训练的训练方法。在实验中,我们使用两个单独的模型与传统方法相比,评估了我们提出的方法。结果,准确度可以与我们推荐的方法来增强。我们还调整了我们提出的模型,以常用的单人拍摄的年龄估计数据集,并证明了我们的方法也是有效利用这些图像,优于现有技术精度的状态。
Masakazu Yoshimura, Satoshi Ogata
Abstract: Age estimation from images can be used in many practical scenes. Most of the previous works targeted on the estimation from images in which only one face exists. Also, most of the open datasets for age estimation contain images like that. However, in some situations, age estimation in the wild and for multi-person is needed. Usually, such situations were solved by two separate models; one is a face detector model which crops facial regions and the other is an age estimation model which estimates from cropped images. In this work, we propose a method that can detect and estimate the age of multi-person with a single model which estimates age with focusing on faces and still seeing surroundings. Also, we propose a training method which enables the model to estimate multi-person well despite trained with images in which only one face is photographed. In the experiments, we evaluated our proposed method compared with the traditional approach using two separate models. As the result, the accuracy could be enhanced with our proposed method. We also adapted our proposed model to commonly used single person photographed age estimation datasets and it is proved that our method is also effective to those images and outperforms the state of the art accuracy.
摘要:从图像中年龄估计可以在许多实际场景中使用。大部分以前的工作从只有一个面存在图像针对性的估计。此外,大多数的开放数据集的年龄估计都含有这样的图像。然而,在某些情况下,需要在野外和多人年龄估计。通常情况下,这种情况是由两个独立的模型解决;一个是面部检测器模型,其作物面部区域,而另一个是一个年龄估计模型,其从裁切影像估计。在这项工作中,我们提出,可以检测并估计多的人的年龄与估计年龄集中在脸上,仍然看到周围的单一模式的方法。此外,我们提出这使得模型估计多的人很好,尽管与只有一个面拍摄的图像训练的训练方法。在实验中,我们使用两个单独的模型与传统方法相比,评估了我们提出的方法。结果,准确度可以与我们推荐的方法来增强。我们还调整了我们提出的模型,以常用的单人拍摄的年龄估计数据集,并证明了我们的方法也是有效利用这些图像,优于现有技术精度的状态。
24. Self-Supervised Domain Adaptation with Consistency Training [PDF] 返回目录
L. Xiao, J. Xu, D. Zhao, Z. Wang, L. Wang, Y. Nie, B. Dai
Abstract: We consider the problem of unsupervised domain adaptation for image classification. To learn target-domain-aware features from the unlabeled data, we create a self-supervised pretext task by augmenting the unlabeled data with a certain type of transformation (specifically, image rotation) and ask the learner to predict the properties of the transformation. However, the obtained feature representation may contain a large amount of irrelevant information with respect to the main task. To provide further guidance, we force the feature representation of the augmented data to be consistent with that of the original data. Intuitively, the consistency introduces additional constraints to representation learning, therefore, the learned representation is more likely to focus on the right information about the main task. Our experimental results validate the proposed method and demonstrate state-of-the-art performance on classical domain adaptation benchmarks. Code is available at this https URL.
摘要:我们认为监督的领域适应性的图像分类的问题。要了解从标签数据目标域感知功能,我们创建与某种类型的变换(具体而言,图像旋转)的扩充未标记的数据自我监督的借口任务,并要求学习者预测转换的特性。然而,所获得的特征表示可以含有相对于所述主任务的大量不相关的信息。为了提供进一步的指导,我们强制增强数据的特征表示要与原来的数据是一致的。直观地说,一致性引入了代表学习额外的约束,因此,学习表现更可能把重点放在有关的主要任务将正确的信息。我们的实验结果验证了该方法,并证明经典域自适应基准状态的最先进的性能。代码可在此HTTPS URL。
L. Xiao, J. Xu, D. Zhao, Z. Wang, L. Wang, Y. Nie, B. Dai
Abstract: We consider the problem of unsupervised domain adaptation for image classification. To learn target-domain-aware features from the unlabeled data, we create a self-supervised pretext task by augmenting the unlabeled data with a certain type of transformation (specifically, image rotation) and ask the learner to predict the properties of the transformation. However, the obtained feature representation may contain a large amount of irrelevant information with respect to the main task. To provide further guidance, we force the feature representation of the augmented data to be consistent with that of the original data. Intuitively, the consistency introduces additional constraints to representation learning, therefore, the learned representation is more likely to focus on the right information about the main task. Our experimental results validate the proposed method and demonstrate state-of-the-art performance on classical domain adaptation benchmarks. Code is available at this https URL.
摘要:我们认为监督的领域适应性的图像分类的问题。要了解从标签数据目标域感知功能,我们创建与某种类型的变换(具体而言,图像旋转)的扩充未标记的数据自我监督的借口任务,并要求学习者预测转换的特性。然而,所获得的特征表示可以含有相对于所述主任务的大量不相关的信息。为了提供进一步的指导,我们强制增强数据的特征表示要与原来的数据是一致的。直观地说,一致性引入了代表学习额外的约束,因此,学习表现更可能把重点放在有关的主要任务将正确的信息。我们的实验结果验证了该方法,并证明经典域自适应基准状态的最先进的性能。代码可在此HTTPS URL。
25. Unsupervised Video Anomaly Detection via Flow-based Generative Modeling on Appearance and Motion Latent Features [PDF] 返回目录
MyeongAh Cho, Taeoh Kim, Sangyoun Lee
Abstract: Surveillance video anomaly detection searches for anomalous events such as crimes or accidents among normal scenes. Since anomalous events occur rarely, there is a class imbalance problem between normal and abnormal data and it is impossible to collect all potential anomalous events, which makes the task challenging. Therefore, performing anomaly detection requires learning the patterns of normal scenes to detect unseen and undefined anomalies. Since abnormal scenes are distinguished from normal scenes by appearance or motion, lots of previous approaches have used an explicit pre-trained model such as optical flow for motion information, which makes the network complex and dependent on the pre-training. We propose an implicit two-path AutoEncoder (ITAE) that exploits the structure of a SlowFast network and focuses on spatial and temporal information through appearance (slow) and motion (fast) encoders, respectively. The two encoders and a single decoder learn normal appearance and behavior by reconstructing normal videos of the training set. Furthermore, with features from the two encoders, we suggest density estimation through flow-based generative models to learn the tractable likelihoods of appearance and motion features. Finally, we show the effectiveness of appearance and motion encoders and their distribution modeling through experiments in three benchmarks which result outperforms the state-of-the-art methods.
摘要:异常事件,如普通场景中的犯罪或事故监控视频异常检测的搜索。由于异常事件很少发生,有正常和异常数据之间的不平衡类问题,这是不可能的收集所有可能的异常事件,这使得挑战的任务。因此,在进行异常检测需要学习正常场景的模式来检测看不见和未定义异常。由于异常场景从通过外观或运动正常场景区分开来,很多以前的方法都使用了一个明确的预训练的模型,例如用于运动信息,这使得网络复杂和依赖于训练前的光流。我们提出了一个隐式的两路自动编码器(ITAE)分别利用了SlowFast网络的结构,并通过外观着眼于空间和时间信息(慢)和运动(快速)编码器,。两个编码器和一个解码器学会重建训练集的正常视频的正常外观和行为。此外,从两个编码器功能,我们建议通过基于流的生成模型密度估计来学习的外观和运动特征可触觉可能性。最后,我们展示的外观和运动编码器的有效性,并在三个基准从而导致性能优于国家的最先进的方法,通过实验的分布模型。
MyeongAh Cho, Taeoh Kim, Sangyoun Lee
Abstract: Surveillance video anomaly detection searches for anomalous events such as crimes or accidents among normal scenes. Since anomalous events occur rarely, there is a class imbalance problem between normal and abnormal data and it is impossible to collect all potential anomalous events, which makes the task challenging. Therefore, performing anomaly detection requires learning the patterns of normal scenes to detect unseen and undefined anomalies. Since abnormal scenes are distinguished from normal scenes by appearance or motion, lots of previous approaches have used an explicit pre-trained model such as optical flow for motion information, which makes the network complex and dependent on the pre-training. We propose an implicit two-path AutoEncoder (ITAE) that exploits the structure of a SlowFast network and focuses on spatial and temporal information through appearance (slow) and motion (fast) encoders, respectively. The two encoders and a single decoder learn normal appearance and behavior by reconstructing normal videos of the training set. Furthermore, with features from the two encoders, we suggest density estimation through flow-based generative models to learn the tractable likelihoods of appearance and motion features. Finally, we show the effectiveness of appearance and motion encoders and their distribution modeling through experiments in three benchmarks which result outperforms the state-of-the-art methods.
摘要:异常事件,如普通场景中的犯罪或事故监控视频异常检测的搜索。由于异常事件很少发生,有正常和异常数据之间的不平衡类问题,这是不可能的收集所有可能的异常事件,这使得挑战的任务。因此,在进行异常检测需要学习正常场景的模式来检测看不见和未定义异常。由于异常场景从通过外观或运动正常场景区分开来,很多以前的方法都使用了一个明确的预训练的模型,例如用于运动信息,这使得网络复杂和依赖于训练前的光流。我们提出了一个隐式的两路自动编码器(ITAE)分别利用了SlowFast网络的结构,并通过外观着眼于空间和时间信息(慢)和运动(快速)编码器,。两个编码器和一个解码器学会重建训练集的正常视频的正常外观和行为。此外,从两个编码器功能,我们建议通过基于流的生成模型密度估计来学习的外观和运动特征可触觉可能性。最后,我们展示的外观和运动编码器的有效性,并在三个基准从而导致性能优于国家的最先进的方法,通过实验的分布模型。
26. A Human Eye-based Text Color Scheme Generation Method for Image Synthesis [PDF] 返回目录
Shao Wei Wang, Guan Jie Huang, Xiang Yu Luo
Abstract: Synthetic data used for scene text detection and recognition tasks have proven effective. However, there are still two problems: First, the color schemes used for text coloring in the existing methods are relatively fixed color key-value pairs learned from real datasets. The dirty data in real datasets may cause the problem that the colors of text and background are too similar to be distinguished from each other. Second, the generated texts are uniformly limited to the same depth of a picture, while there are special cases in the real world that text may appear across depths. To address these problems, in this paper we design a novel method to generate color schemes, which are consistent with the characteristics of human eyes to observe things. The advantages of our method are as follows: (1) overcomes the color confusion problem between text and background caused by dirty data; (2) the texts generated are allowed to appear in most locations of any image, even across depths; (3) avoids analyzing the depth of background, such that the performance of our method exceeds the state-of-the-art methods; (4) the speed of generating images is fast, nearly one picture generated per three milliseconds. The effectiveness of our method is verified on several public datasets.
摘要:用于现场文字的检测和识别任务综合数据已证明是有效的。但是,仍存在两个问题:从真实数据获悉首先,在现有的方法用于文本颜色的配色方案是相对固定的颜色键值对。在真实数据集的脏数据可能会导致文本和背景的颜色太相似,彼此区分的问题。其次,产生文本均匀限于图片相同的深度,同时也有特殊情况在现实世界中的文本可以跨越深处出现。为了解决这些问题,在本文中,我们设计了一种新的方法来产生的配色方案,这是符合人眼观察事物的特点是一致的。是我们的方法的优点如下:(1)克服了文本和由脏数据背景之间的颜色混淆问题; (2)产生的被允许出现在任何图像的大多数位置,甚至跨越深处文本; (3)避免了分析背景的深度,使得我们的方法的性能超过国家的最先进的方法; (4)生成的图像的速度快,每三毫秒产生几乎一个画面。我们的方法的有效性在几个公开的数据集进行验证。
Shao Wei Wang, Guan Jie Huang, Xiang Yu Luo
Abstract: Synthetic data used for scene text detection and recognition tasks have proven effective. However, there are still two problems: First, the color schemes used for text coloring in the existing methods are relatively fixed color key-value pairs learned from real datasets. The dirty data in real datasets may cause the problem that the colors of text and background are too similar to be distinguished from each other. Second, the generated texts are uniformly limited to the same depth of a picture, while there are special cases in the real world that text may appear across depths. To address these problems, in this paper we design a novel method to generate color schemes, which are consistent with the characteristics of human eyes to observe things. The advantages of our method are as follows: (1) overcomes the color confusion problem between text and background caused by dirty data; (2) the texts generated are allowed to appear in most locations of any image, even across depths; (3) avoids analyzing the depth of background, such that the performance of our method exceeds the state-of-the-art methods; (4) the speed of generating images is fast, nearly one picture generated per three milliseconds. The effectiveness of our method is verified on several public datasets.
摘要:用于现场文字的检测和识别任务综合数据已证明是有效的。但是,仍存在两个问题:从真实数据获悉首先,在现有的方法用于文本颜色的配色方案是相对固定的颜色键值对。在真实数据集的脏数据可能会导致文本和背景的颜色太相似,彼此区分的问题。其次,产生文本均匀限于图片相同的深度,同时也有特殊情况在现实世界中的文本可以跨越深处出现。为了解决这些问题,在本文中,我们设计了一种新的方法来产生的配色方案,这是符合人眼观察事物的特点是一致的。是我们的方法的优点如下:(1)克服了文本和由脏数据背景之间的颜色混淆问题; (2)产生的被允许出现在任何图像的大多数位置,甚至跨越深处文本; (3)避免了分析背景的深度,使得我们的方法的性能超过国家的最先进的方法; (4)生成的图像的速度快,每三毫秒产生几乎一个画面。我们的方法的有效性在几个公开的数据集进行验证。
27. NeRF++: Analyzing and Improving Neural Radiance Fields [PDF] 返回目录
Kai Zhang, Gernot Riegler, Noah Snavely, Vladlen Koltun
Abstract: Neural Radiance Fields (NeRF) achieve impressive view synthesis results for a variety of capture settings, including 360 capture of bounded scenes and forward-facing capture of bounded and unbounded scenes. NeRF fits multi-layer perceptrons (MLPs) representing view-invariant opacity and view-dependent color volumes to a set of training images, and samples novel views based on volume rendering techniques. In this technical report, we first remark on radiance fields and their potential ambiguities, namely the shape-radiance ambiguity, and analyze NeRF's success in avoiding such ambiguities. Second, we address a parametrization issue involved in applying NeRF to 360 captures of objects within large-scale, unbounded 3D scenes. Our method improves view synthesis fidelity in this challenging scenario. Code is available at this https URL.
摘要:神经光辉字段(NERF)实现为各种捕获设置令人印象深刻的视图合成的结果,包括360捕获有界的场景和界和无界的场景的面向前的捕获的。 NERF适合多层感知器(的MLP)表示视图不变不透明度和视点相关色卷到一组训练图像,和样品基于体绘制技术新颖的观点。在这个技术报告中,我们的第一句话就辐射领域及其潜在的歧义,即形状辐射歧义,并分析NERF在避免这种含糊不清的成功。其次,我们要解决参与应用NERF到大型内360个捕获对象的,无限的3D场景一个参数化的问题。我们的方法可以提高视图合成在这个充满挑战的场景保真度。代码可在此HTTPS URL。
Kai Zhang, Gernot Riegler, Noah Snavely, Vladlen Koltun
Abstract: Neural Radiance Fields (NeRF) achieve impressive view synthesis results for a variety of capture settings, including 360 capture of bounded scenes and forward-facing capture of bounded and unbounded scenes. NeRF fits multi-layer perceptrons (MLPs) representing view-invariant opacity and view-dependent color volumes to a set of training images, and samples novel views based on volume rendering techniques. In this technical report, we first remark on radiance fields and their potential ambiguities, namely the shape-radiance ambiguity, and analyze NeRF's success in avoiding such ambiguities. Second, we address a parametrization issue involved in applying NeRF to 360 captures of objects within large-scale, unbounded 3D scenes. Our method improves view synthesis fidelity in this challenging scenario. Code is available at this https URL.
摘要:神经光辉字段(NERF)实现为各种捕获设置令人印象深刻的视图合成的结果,包括360捕获有界的场景和界和无界的场景的面向前的捕获的。 NERF适合多层感知器(的MLP)表示视图不变不透明度和视点相关色卷到一组训练图像,和样品基于体绘制技术新颖的观点。在这个技术报告中,我们的第一句话就辐射领域及其潜在的歧义,即形状辐射歧义,并分析NERF在避免这种含糊不清的成功。其次,我们要解决参与应用NERF到大型内360个捕获对象的,无限的3D场景一个参数化的问题。我们的方法可以提高视图合成在这个充满挑战的场景保真度。代码可在此HTTPS URL。
28. Unsupervised Self-training Algorithm Based on Deep Learning for Optical Aerial Images Change Detection [PDF] 返回目录
Yuan Zhou, Xiangrui Li
Abstract: Optical aerial images change detection is an important task in earth observation and has been extensively investigated in the past few decades. Generally, the supervised change detection methods with superior performance require a large amount of labeled training data which is obtained by manual annotation with high cost. In this paper, we present a novel unsupervised self-training algorithm (USTA) for optical aerial images change detection. The traditional method such as change vector analysis is used to generate the pseudo labels. We use these pseudo labels to train a well designed convolutional neural network. The network is used as a teacher to classify the original multitemporal images to generate another set of pseudo labels. Then two set of pseudo labels are used to jointly train a student network with the same structure as the teacher. The final change detection result can be obtained by the trained student network. Besides, we design an image filter to control the usage of change information in the pseudo labels in the training process of the network. The whole process of the algorithm is an unsupervised process without manually marked labels. Experimental results on the real datasets demonstrate competitive performance of our proposed method.
摘要:光学航空影像变化检测是在地球观测领域的一项重要任务,并已在过去的几十年中得到了广泛的研究。通常,具有优异的性能被监控的变化检测方法需要其通过手动标注与高成本获得大量标记的训练数据。在本文中,我们提出了光学空间图像变化检测的新的无监督自训练算法(USTA)。传统的方法,如改变矢量分析是用于产生伪标签。我们使用这些伪标签来训练一个精心设计的卷积神经网络。网络作为一个老师原来多时图像进行分类,生成另一组伪标签。然后两组伪标签用于联合培养学生的网络结构相同的老师。最终的变化检测结果可以通过训练好的学生网络获得。此外,我们设计了一个图像过滤器来控制的变化信息的使用在网络的训练过程中的伪标签。该算法的整个过程,而无需手动标记标签无监督过程。在真实数据集实验结果表明,我们提出的方法的竞争性优势。
Yuan Zhou, Xiangrui Li
Abstract: Optical aerial images change detection is an important task in earth observation and has been extensively investigated in the past few decades. Generally, the supervised change detection methods with superior performance require a large amount of labeled training data which is obtained by manual annotation with high cost. In this paper, we present a novel unsupervised self-training algorithm (USTA) for optical aerial images change detection. The traditional method such as change vector analysis is used to generate the pseudo labels. We use these pseudo labels to train a well designed convolutional neural network. The network is used as a teacher to classify the original multitemporal images to generate another set of pseudo labels. Then two set of pseudo labels are used to jointly train a student network with the same structure as the teacher. The final change detection result can be obtained by the trained student network. Besides, we design an image filter to control the usage of change information in the pseudo labels in the training process of the network. The whole process of the algorithm is an unsupervised process without manually marked labels. Experimental results on the real datasets demonstrate competitive performance of our proposed method.
摘要:光学航空影像变化检测是在地球观测领域的一项重要任务,并已在过去的几十年中得到了广泛的研究。通常,具有优异的性能被监控的变化检测方法需要其通过手动标注与高成本获得大量标记的训练数据。在本文中,我们提出了光学空间图像变化检测的新的无监督自训练算法(USTA)。传统的方法,如改变矢量分析是用于产生伪标签。我们使用这些伪标签来训练一个精心设计的卷积神经网络。网络作为一个老师原来多时图像进行分类,生成另一组伪标签。然后两组伪标签用于联合培养学生的网络结构相同的老师。最终的变化检测结果可以通过训练好的学生网络获得。此外,我们设计了一个图像过滤器来控制的变化信息的使用在网络的训练过程中的伪标签。该算法的整个过程,而无需手动标记标签无监督过程。在真实数据集实验结果表明,我们提出的方法的竞争性优势。
29. Deep Learning Models for Predicting Wildfires from Historical Remote-Sensing Data [PDF] 返回目录
Fantine Huot, R. Lily Hu, Matthias Ihme, Qing Wang, John Burge, Tianjian Lu, Jason Hickey, Yi-Fan Chen, John Anderson
Abstract: Identifying regions that have high likelihood for wildfires is a key component of land and forestry management and disaster preparedness. We create a data set by aggregating nearly a decade of remote-sensing data and historical fire records to predict wildfires. This prediction problem is framed as three machine learning tasks. Results are compared and analyzed for four different deep learning models to estimate wildfire likelihood. The results demonstrate that deep learning models can successfully identify areas of high fire likelihood using aggregated data about vegetation, weather, and topography with an AUC of 83%.
摘要:具有针对野火高的可能性识别区是土地和森林管理和备灾的关键组成部分。我们通过汇总遥感数据和历史记录火了近十年的预测野火创建一个数据集。这个预测问题是诬陷,三个机器学习任务。结果进行了比较和分析了四种不同的深度学习模型来估计野火的可能性。结果表明,深度学习模型可以成功识别使用约植被,天气和地形83%的AUC汇总数据火灾高发可能性的区域。
Fantine Huot, R. Lily Hu, Matthias Ihme, Qing Wang, John Burge, Tianjian Lu, Jason Hickey, Yi-Fan Chen, John Anderson
Abstract: Identifying regions that have high likelihood for wildfires is a key component of land and forestry management and disaster preparedness. We create a data set by aggregating nearly a decade of remote-sensing data and historical fire records to predict wildfires. This prediction problem is framed as three machine learning tasks. Results are compared and analyzed for four different deep learning models to estimate wildfire likelihood. The results demonstrate that deep learning models can successfully identify areas of high fire likelihood using aggregated data about vegetation, weather, and topography with an AUC of 83%.
摘要:具有针对野火高的可能性识别区是土地和森林管理和备灾的关键组成部分。我们通过汇总遥感数据和历史记录火了近十年的预测野火创建一个数据集。这个预测问题是诬陷,三个机器学习任务。结果进行了比较和分析了四种不同的深度学习模型来估计野火的可能性。结果表明,深度学习模型可以成功识别使用约植被,天气和地形83%的AUC汇总数据火灾高发可能性的区域。
30. AI-based BMI Inference from Facial Images: An Application to Weight Monitoring [PDF] 返回目录
Hera Siddiqui, Ajita Rattani, Dakshina Ranjan Kisku, Tanner Dean
Abstract: Self-diagnostic image-based methods for healthy weight monitoring is gaining increased interest following the alarming trend of obesity. Only a handful of academic studies exist that investigate AI-based methods for Body Mass Index (BMI) inference from facial images as a solution to healthy weight monitoring and management. To promote further research and development in this area, we evaluate and compare the performance of five different deep-learning based Convolutional Neural Network (CNN) architectures i.e., VGG19, ResNet50, DenseNet, MobileNet, and lightCNN for BMI inference from facial images. Experimental results on the three publicly available BMI annotated facial image datasets assembled from social media, namely, VisualBMI, VIP-Attributes, and Bollywood datasets, suggest the efficacy of the deep learning methods in BMI inference from face images with minimum Mean Absolute Error (MAE) of $1.04$ obtained using ResNet50.
摘要:健康体重监测自我诊断基于图像的方法是获得以下肥胖的令人震惊的趋势越来越感兴趣。只有学术研究的少数存在该调查从面部图像的身体质量指数(BMI)推理作为解决健康的体重监控和管理基于人工智能的方法。为了促进进一步的研究和开发在这方面,我们评估和比较五种不同的深学习基于卷积神经网络(CNN)的架构,即VGG19,ResNet50,DenseNet,MobileNet的性能,并为lightCNN从BMI面部图像推断。这三个公开可用的BMI实验结果注释来自社交媒体,即VisualBMI,VIP-属性,和宝莱坞的数据集组装的面部图像数据集,建议在从人脸图像与最低平均绝对误差BMI推断深学习方法的疗效(MAE )的$ 1.04 $使用ResNet50获得。
Hera Siddiqui, Ajita Rattani, Dakshina Ranjan Kisku, Tanner Dean
Abstract: Self-diagnostic image-based methods for healthy weight monitoring is gaining increased interest following the alarming trend of obesity. Only a handful of academic studies exist that investigate AI-based methods for Body Mass Index (BMI) inference from facial images as a solution to healthy weight monitoring and management. To promote further research and development in this area, we evaluate and compare the performance of five different deep-learning based Convolutional Neural Network (CNN) architectures i.e., VGG19, ResNet50, DenseNet, MobileNet, and lightCNN for BMI inference from facial images. Experimental results on the three publicly available BMI annotated facial image datasets assembled from social media, namely, VisualBMI, VIP-Attributes, and Bollywood datasets, suggest the efficacy of the deep learning methods in BMI inference from face images with minimum Mean Absolute Error (MAE) of $1.04$ obtained using ResNet50.
摘要:健康体重监测自我诊断基于图像的方法是获得以下肥胖的令人震惊的趋势越来越感兴趣。只有学术研究的少数存在该调查从面部图像的身体质量指数(BMI)推理作为解决健康的体重监控和管理基于人工智能的方法。为了促进进一步的研究和开发在这方面,我们评估和比较五种不同的深学习基于卷积神经网络(CNN)的架构,即VGG19,ResNet50,DenseNet,MobileNet的性能,并为lightCNN从BMI面部图像推断。这三个公开可用的BMI实验结果注释来自社交媒体,即VisualBMI,VIP-属性,和宝莱坞的数据集组装的面部图像数据集,建议在从人脸图像与最低平均绝对误差BMI推断深学习方法的疗效(MAE )的$ 1.04 $使用ResNet50获得。
31. Auto-calibration Method Using Stop Signs for Urban Autonomous Driving Applications [PDF] 返回目录
Yunhai Han, Yuhan Liu, David Paz, Henrik Christensen
Abstract: For use of cameras on an intelligent vehicle, driving over a major bump could challenge the calibration. It is then of interest to do dynamic calibration. What structures can be used for calibration? How about using traffic signs that you recognize? In this paper an approach is presented for dynamic camera calibration based on recognition of stop signs. The detection is performed based on convolutional neural networks (CNNs). A recognized sign is modeled as a polygon and matched to a model. Parameters are tracked over time. Experimental results show clear convergence and improved performance for the calibration.
摘要:对于智能车辆使用的摄像头,驶过主要隆起可能挑战校准。它是那么的利益做动态校准。可用于校准什么结构?如何使用交通标志,你认识?在本文的方法提出了一种基于识别的停车标志的动态摄像机标定。基于卷积神经网络(细胞神经网络)进行检测。一个公认的符号建模为一个多边形和匹配的模式。参数跟踪随着时间的推移。实验结果表明,明显的趋同和校准提高性能。
Yunhai Han, Yuhan Liu, David Paz, Henrik Christensen
Abstract: For use of cameras on an intelligent vehicle, driving over a major bump could challenge the calibration. It is then of interest to do dynamic calibration. What structures can be used for calibration? How about using traffic signs that you recognize? In this paper an approach is presented for dynamic camera calibration based on recognition of stop signs. The detection is performed based on convolutional neural networks (CNNs). A recognized sign is modeled as a polygon and matched to a model. Parameters are tracked over time. Experimental results show clear convergence and improved performance for the calibration.
摘要:对于智能车辆使用的摄像头,驶过主要隆起可能挑战校准。它是那么的利益做动态校准。可用于校准什么结构?如何使用交通标志,你认识?在本文的方法提出了一种基于识别的停车标志的动态摄像机标定。基于卷积神经网络(细胞神经网络)进行检测。一个公认的符号建模为一个多边形和匹配的模式。参数跟踪随着时间的推移。实验结果表明,明显的趋同和校准提高性能。
32. Skeleton-bridged Point Completion: From Global Inference to Local Adjustment [PDF] 返回目录
Yinyu Nie, Yiqun Lin, Xiaoguang Han, Shihui Guo, Jian Chang, Shuguang Cui, Jian Jun Zhang
Abstract: Point completion refers to complete the missing geometries of objects from partial point clouds. Existing works usually estimate the missing shape by decoding a latent feature encoded from the input points. However, real-world objects are usually with diverse topologies and surface details, which a latent feature may fail to represent to recover a clean and complete surface. To this end, we propose a skeleton-bridged point completion network (SK-PCN) for shape completion. Given a partial scan, our method first predicts its 3D skeleton to obtain the global structure, and completes the surface by learning displacements from skeletal points. We decouple the shape completion into structure estimation and surface reconstruction, which eases the learning difficulty and benefits our method to obtain on-surface details. Besides, considering the missing features during encoding input points, SK-PCN adopts a local adjustment strategy that merges the input point cloud to our predictions for surface refinement. Comparing with previous methods, our skeleton-bridged manner better supports point normal estimation to obtain the full surface mesh beyond point clouds. The qualitative and quantitative experiments on both point cloud and mesh completion show that our approach outperforms the existing methods on various object categories.
摘要:点完成是指从局部点云完成目标的缺失几何形状。现有作品通常是通过解码从输入点编码的潜特征估计丢失的形状。然而,现实世界的对象通常是与不同的拓扑结构和表面细节,其潜在功能可能不能代表来恢复清洁,完整的表面。为此,我们提出了形状完成骨架桥联点完成网络(SK-PCN)。鉴于部分扫描,我们的方法首先预测其3D骨架,以获得全局结构,并通过学习从骨架点的位移完成的表面。我们去耦形状完成情况纳入结构估计和表面重建,从而简化了学习难度,有利于我们的方法,来获取表面的细节。此外,编码输入点时考虑到缺少的功能,SK-PCN采用的是融合了输入的点云来我表面细化预测的局部调整策略。与以往的方法相比,我们的骨架桥联的方式更好的支持点对点正常估算,以获得完整的网面点之外云。在两个点云的定性和定量实验和网状完成表明我们的方法优于各种对象类别的现有方法。
Yinyu Nie, Yiqun Lin, Xiaoguang Han, Shihui Guo, Jian Chang, Shuguang Cui, Jian Jun Zhang
Abstract: Point completion refers to complete the missing geometries of objects from partial point clouds. Existing works usually estimate the missing shape by decoding a latent feature encoded from the input points. However, real-world objects are usually with diverse topologies and surface details, which a latent feature may fail to represent to recover a clean and complete surface. To this end, we propose a skeleton-bridged point completion network (SK-PCN) for shape completion. Given a partial scan, our method first predicts its 3D skeleton to obtain the global structure, and completes the surface by learning displacements from skeletal points. We decouple the shape completion into structure estimation and surface reconstruction, which eases the learning difficulty and benefits our method to obtain on-surface details. Besides, considering the missing features during encoding input points, SK-PCN adopts a local adjustment strategy that merges the input point cloud to our predictions for surface refinement. Comparing with previous methods, our skeleton-bridged manner better supports point normal estimation to obtain the full surface mesh beyond point clouds. The qualitative and quantitative experiments on both point cloud and mesh completion show that our approach outperforms the existing methods on various object categories.
摘要:点完成是指从局部点云完成目标的缺失几何形状。现有作品通常是通过解码从输入点编码的潜特征估计丢失的形状。然而,现实世界的对象通常是与不同的拓扑结构和表面细节,其潜在功能可能不能代表来恢复清洁,完整的表面。为此,我们提出了形状完成骨架桥联点完成网络(SK-PCN)。鉴于部分扫描,我们的方法首先预测其3D骨架,以获得全局结构,并通过学习从骨架点的位移完成的表面。我们去耦形状完成情况纳入结构估计和表面重建,从而简化了学习难度,有利于我们的方法,来获取表面的细节。此外,编码输入点时考虑到缺少的功能,SK-PCN采用的是融合了输入的点云来我表面细化预测的局部调整策略。与以往的方法相比,我们的骨架桥联的方式更好的支持点对点正常估算,以获得完整的网面点之外云。在两个点云的定性和定量实验和网状完成表明我们的方法优于各种对象类别的现有方法。
33. Pose Refinement Graph Convolutional Network for Skeleton-based Action Recognition [PDF] 返回目录
Shijie Li, Jinhui Yi, Yazan Abu Farha, Juergen Gall
Abstract: With the advances in capturing 2D or 3D skeleton data, skeleton-based action recognition has received an increasing interest over the last years. As skeleton data is commonly represented by graphs, graph convolutional networks have been proposed for this task. While current graph convolutional networks accurately recognize actions, they are too expensive for robotics applications where limited computational resources are available. In this paper, we therefore propose a highly efficient graph convolutional network that addresses the limitations of previous works. This is achieved by a parallel structure that gradually fuses motion and spatial information and by reducing the temporal resolution as early as possible. Furthermore, we explicitly address the issue that human poses can contain errors. To this end, the network first refines the poses before they are further processed to recognize the action. We therefore call the network Pose Refinement Graph Convolutional Network. Compared to other graph convolutional networks, our network requires 86\%-93\% less parameters and reduces the floating point operations by 89%-96% while achieving a comparable accuracy. It therefore provides a much better trade-off between accuracy, memory footprint and processing time, which makes it suitable for robotics applications.
摘要:随着拍摄2D或3D骨架数据的进步,基于骨架动作识别已收到超过过去几年里越来越多的关注。作为骨架数据由图表代表普遍,图形卷积网络已经提出了这个任务。虽然目前的图形卷积网络精确地识别行动,他们是在那里有限的计算资源可用机器人应用过于昂贵。在本文中,我们因此提出了一个高效的图形卷积网络以前的工作地址的限制。这是通过并行结构逐渐融合运动和空间信息,并通过降低时间分辨率尽早实现。此外,我们明确提出,人的姿势可能包含错误的问题。为此,网络第一细化他们之前的姿势被进一步处理以识别动作。因此,我们呼吁网络姿态优化图形卷积网络。相比其他图形卷积网络,我们的网络需要86 \% - 93 \%参数少了89%,降低了浮点运算-96%,同时实现了可比的精度。因此,它提供了一个更好的折衷精度,内存占用和处理时间之间,这使得它适合于机器人应用。
Shijie Li, Jinhui Yi, Yazan Abu Farha, Juergen Gall
Abstract: With the advances in capturing 2D or 3D skeleton data, skeleton-based action recognition has received an increasing interest over the last years. As skeleton data is commonly represented by graphs, graph convolutional networks have been proposed for this task. While current graph convolutional networks accurately recognize actions, they are too expensive for robotics applications where limited computational resources are available. In this paper, we therefore propose a highly efficient graph convolutional network that addresses the limitations of previous works. This is achieved by a parallel structure that gradually fuses motion and spatial information and by reducing the temporal resolution as early as possible. Furthermore, we explicitly address the issue that human poses can contain errors. To this end, the network first refines the poses before they are further processed to recognize the action. We therefore call the network Pose Refinement Graph Convolutional Network. Compared to other graph convolutional networks, our network requires 86\%-93\% less parameters and reduces the floating point operations by 89%-96% while achieving a comparable accuracy. It therefore provides a much better trade-off between accuracy, memory footprint and processing time, which makes it suitable for robotics applications.
摘要:随着拍摄2D或3D骨架数据的进步,基于骨架动作识别已收到超过过去几年里越来越多的关注。作为骨架数据由图表代表普遍,图形卷积网络已经提出了这个任务。虽然目前的图形卷积网络精确地识别行动,他们是在那里有限的计算资源可用机器人应用过于昂贵。在本文中,我们因此提出了一个高效的图形卷积网络以前的工作地址的限制。这是通过并行结构逐渐融合运动和空间信息,并通过降低时间分辨率尽早实现。此外,我们明确提出,人的姿势可能包含错误的问题。为此,网络第一细化他们之前的姿势被进一步处理以识别动作。因此,我们呼吁网络姿态优化图形卷积网络。相比其他图形卷积网络,我们的网络需要86 \% - 93 \%参数少了89%,降低了浮点运算-96%,同时实现了可比的精度。因此,它提供了一个更好的折衷精度,内存占用和处理时间之间,这使得它适合于机器人应用。
34. Photovoltaic module segmentation and thermal analysis tool from thermal images [PDF] 返回目录
L. E. Montañez, L. M. Valentín-Coronado, D. Moctezuma, G. Flores
Abstract: The growing interest in the use of clean energy has led to the construction of increasingly large photovoltaic systems. Consequently, monitoring the proper functioning of these systems has become a highly relevant this http URL this paper, automatic detection, and analysis of photovoltaic modules are proposed. To perform the analysis, a module identification step, based on a digital image processing algorithm, is first carried out. This algorithm consists of image enhancement (contrast enhancement, noise reduction, etc.), followed by segmentation of the photovoltaic module. Subsequently, a statistical analysis based on the temperature values of the segmented module is performed.Besides, a graphical user interface has been designed as a potential tool that provides relevant information of the photovoltaic modules.
摘要:在使用清洁能源日益增长的兴趣导致了越来越大的光伏系统的建设。因此,监测这些系统的正常运作已成为此http URL本文,自动检测,以及光伏模块的分析提出了高度相关的。为了进行分析,模块识别步骤,基于数字图像处理算法中,首先进行的。这个算法由图像增强(对比度增强,降噪,等),其次是光伏模块的分割。随后,基于所分割的模块的温度值的统计分析是performed.Besides,图形用户界面被设计为提供光伏模块的相关信息的潜在工具。
L. E. Montañez, L. M. Valentín-Coronado, D. Moctezuma, G. Flores
Abstract: The growing interest in the use of clean energy has led to the construction of increasingly large photovoltaic systems. Consequently, monitoring the proper functioning of these systems has become a highly relevant this http URL this paper, automatic detection, and analysis of photovoltaic modules are proposed. To perform the analysis, a module identification step, based on a digital image processing algorithm, is first carried out. This algorithm consists of image enhancement (contrast enhancement, noise reduction, etc.), followed by segmentation of the photovoltaic module. Subsequently, a statistical analysis based on the temperature values of the segmented module is performed.Besides, a graphical user interface has been designed as a potential tool that provides relevant information of the photovoltaic modules.
摘要:在使用清洁能源日益增长的兴趣导致了越来越大的光伏系统的建设。因此,监测这些系统的正常运作已成为此http URL本文,自动检测,以及光伏模块的分析提出了高度相关的。为了进行分析,模块识别步骤,基于数字图像处理算法中,首先进行的。这个算法由图像增强(对比度增强,降噪,等),其次是光伏模块的分割。随后,基于所分割的模块的温度值的统计分析是performed.Besides,图形用户界面被设计为提供光伏模块的相关信息的潜在工具。
35. Do End-to-end Stereo Algorithms Under-utilize Information? [PDF] 返回目录
Changjiang Cai, Philippos Mordohai
Abstract: Deep networks for stereo matching typically leverage 2D or 3D convolutional encoder-decoder architectures to aggregate cost and regularize the cost volume for accurate disparity estimation. Due to content-insensitive convolutions and down-sampling and up-sampling operations, these cost aggregation mechanisms do not take full advantage of the information available in the images. Disparity maps suffer from over-smoothing near occlusion boundaries, and erroneous predictions in thin structures. In this paper, we show how deep adaptive filtering and differentiable semi-global aggregation can be integrated in existing 2D and 3D convolutional networks for end-to-end stereo matching, leading to improved accuracy. The improvements are due to utilizing RGB information from the images as a signal to dynamically guide the matching process, in addition to being the signal we attempt to match across the images. We show extensive experimental results on the KITTI 2015 and Virtual KITTI 2 datasets comparing four stereo networks (DispNetC, GCNet, PSMNet and GANet) after integrating four adaptive filters (segmentation-aware bilateral filtering, dynamic filtering networks, pixel adaptive convolution and semi-global aggregation) into their architectures. Our code is available at this https URL.
摘要:深网络的立体匹配通常杠杆2D或3D卷积编码器 - 解码器架构来累计成本和规范进行精确的视差估计的成本体积。由于内容敏感卷积和下采样和上采样操作,这些成本的聚集机制并没有充分利用图像中的可用信息。视差图从过平滑靠近闭塞的边界,和在薄结构错误预测受损。在本文中,我们显示了如何深自适应滤波和区分的半全局聚合可以集成在现有的2D和3D卷积网络的端至端的立体匹配,从而导致改善的准确度。的改进是由于利用来自图像的RGB信息作为信号来动态地引导的匹配处理,除了作为我们试图跨过图像匹配信号。我们展示的KITTI 2015年和虚拟KITTI 2个数据集整合4个自适应滤波器(分割感知双边过滤,动态过滤网络,像素自适应卷积和半全局后比较四个立体声网络(DispNetC,GCNet,PSMNet和GANet)广泛的实验结果聚集)到他们的架构。我们的代码可在此HTTPS URL。
Changjiang Cai, Philippos Mordohai
Abstract: Deep networks for stereo matching typically leverage 2D or 3D convolutional encoder-decoder architectures to aggregate cost and regularize the cost volume for accurate disparity estimation. Due to content-insensitive convolutions and down-sampling and up-sampling operations, these cost aggregation mechanisms do not take full advantage of the information available in the images. Disparity maps suffer from over-smoothing near occlusion boundaries, and erroneous predictions in thin structures. In this paper, we show how deep adaptive filtering and differentiable semi-global aggregation can be integrated in existing 2D and 3D convolutional networks for end-to-end stereo matching, leading to improved accuracy. The improvements are due to utilizing RGB information from the images as a signal to dynamically guide the matching process, in addition to being the signal we attempt to match across the images. We show extensive experimental results on the KITTI 2015 and Virtual KITTI 2 datasets comparing four stereo networks (DispNetC, GCNet, PSMNet and GANet) after integrating four adaptive filters (segmentation-aware bilateral filtering, dynamic filtering networks, pixel adaptive convolution and semi-global aggregation) into their architectures. Our code is available at this https URL.
摘要:深网络的立体匹配通常杠杆2D或3D卷积编码器 - 解码器架构来累计成本和规范进行精确的视差估计的成本体积。由于内容敏感卷积和下采样和上采样操作,这些成本的聚集机制并没有充分利用图像中的可用信息。视差图从过平滑靠近闭塞的边界,和在薄结构错误预测受损。在本文中,我们显示了如何深自适应滤波和区分的半全局聚合可以集成在现有的2D和3D卷积网络的端至端的立体匹配,从而导致改善的准确度。的改进是由于利用来自图像的RGB信息作为信号来动态地引导的匹配处理,除了作为我们试图跨过图像匹配信号。我们展示的KITTI 2015年和虚拟KITTI 2个数据集整合4个自适应滤波器(分割感知双边过滤,动态过滤网络,像素自适应卷积和半全局后比较四个立体声网络(DispNetC,GCNet,PSMNet和GANet)广泛的实验结果聚集)到他们的架构。我们的代码可在此HTTPS URL。
36. Matching-space Stereo Networks for Cross-domain Generalization [PDF] 返回目录
Changjiang Cai, Matteo Poggi, Stefano Mattoccia, Philippos Mordohai
Abstract: End-to-end deep networks represent the state of the art for stereo matching. While excelling on images framing environments similar to the training set, major drops in accuracy occur in unseen domains (e.g., when moving from synthetic to real scenes). In this paper we introduce a novel family of architectures, namely Matching-Space Networks (MS-Nets), with improved generalization properties. By replacing learning-based feature extraction from image RGB values with matching functions and confidence measures from conventional wisdom, we move the learning process from the color space to the Matching Space, avoiding over-specialization to domain specific features. Extensive experimental results on four real datasets highlight that our proposal leads to superior generalization to unseen environments over conventional deep architectures, keeping accuracy on the source domain almost unaltered. Our code is available at this https URL.
摘要:端至端深网络代表本领域中用于立体匹配的状态。而在取景类似于训练集环境中的图像优良,在准确度主要滴发生在看不见的结构域(例如,从合成的移动到真实的场景时)。在本文中,我们介绍体系结构的新家族,即匹配-空间网络(MS-网),具有改进的概括特性。通过与传统智慧匹配函数和信心的措施,从图像RGB值代替基于学习的特征提取,我们把学习的过程,从色彩空间的匹配空间,避免过度专业化领域的具体特征。在四个真实数据集大量的实验研究结果强调了传统的深架构,我们的建议导致上级推广到看不见的环境下,保持准确的源域几乎不变。我们的代码可在此HTTPS URL。
Changjiang Cai, Matteo Poggi, Stefano Mattoccia, Philippos Mordohai
Abstract: End-to-end deep networks represent the state of the art for stereo matching. While excelling on images framing environments similar to the training set, major drops in accuracy occur in unseen domains (e.g., when moving from synthetic to real scenes). In this paper we introduce a novel family of architectures, namely Matching-Space Networks (MS-Nets), with improved generalization properties. By replacing learning-based feature extraction from image RGB values with matching functions and confidence measures from conventional wisdom, we move the learning process from the color space to the Matching Space, avoiding over-specialization to domain specific features. Extensive experimental results on four real datasets highlight that our proposal leads to superior generalization to unseen environments over conventional deep architectures, keeping accuracy on the source domain almost unaltered. Our code is available at this https URL.
摘要:端至端深网络代表本领域中用于立体匹配的状态。而在取景类似于训练集环境中的图像优良,在准确度主要滴发生在看不见的结构域(例如,从合成的移动到真实的场景时)。在本文中,我们介绍体系结构的新家族,即匹配-空间网络(MS-网),具有改进的概括特性。通过与传统智慧匹配函数和信心的措施,从图像RGB值代替基于学习的特征提取,我们把学习的过程,从色彩空间的匹配空间,避免过度专业化领域的具体特征。在四个真实数据集大量的实验研究结果强调了传统的深架构,我们的建议导致上级推广到看不见的环境下,保持准确的源域几乎不变。我们的代码可在此HTTPS URL。
37. Representation Learning via Invariant Causal Mechanisms [PDF] 返回目录
Jovana Mitrovic, Brian McWilliams, Jacob Walker, Lars Buesing, Charles Blundell
Abstract: Self-supervised learning has emerged as a strategy to reduce the reliance on costly supervised signal by pretraining representations only using unlabeled data. These methods combine heuristic proxy classification tasks with data augmentations and have achieved significant success, but our theoretical understanding of this success remains limited. In this paper we analyze self-supervised representation learning using a causal framework. We show how data augmentations can be more effectively utilized through explicit invariance constraints on the proxy classifiers employed during pretraining. Based on this, we propose a novel self-supervised objective, Representation Learning via Invariant Causal Mechanisms (ReLIC), that enforces invariant prediction of proxy targets across augmentations through an invariance regularizer which yields improved generalization guarantees. Further, using causality we generalize contrastive learning, a particular kind of self-supervised method, and provide an alternative theoretical explanation for the success of these methods. Empirically, ReLIC significantly outperforms competing methods in terms of robustness and out-of-distribution generalization on ImageNet, while also significantly outperforming these methods on Atari achieving above human-level performance on $51$ out of $57$ games.
摘要:自监督学习已经成为只使用无标签数据训练前表示,以减少对昂贵的监控信号的依赖的战略。这些方法结合启发式代理分类任务与数据扩充并取得了显著的成功,但我们的这种成功的理论认识仍然有限。在本文中,我们分析的自我监督表示使用因果框架学习。我们将展示如何扩充的数据可通过在训练前期间采用的代理分类明确不变的约束来更有效地利用。在此基础上,我们提出了一个新的自我监督的目标,表示通过学习不变因果机制(遗迹),即通过调整装置的不变性这将产生改进的概括保证强制执行的跨扩充代理的目标不变的预测。此外,使用因果关系推广了对比学习,特定的一种自我监督的方法,并提供了这些方法的成功的替代理论解释。根据经验,遗物显著优于在ImageNet鲁棒性的概括和乱分配方面的竞争方法,同时还显著优于在Atari这些方法在$ $ 51 $的$ 57场比赛开出达到上述人类水平的性能。
Jovana Mitrovic, Brian McWilliams, Jacob Walker, Lars Buesing, Charles Blundell
Abstract: Self-supervised learning has emerged as a strategy to reduce the reliance on costly supervised signal by pretraining representations only using unlabeled data. These methods combine heuristic proxy classification tasks with data augmentations and have achieved significant success, but our theoretical understanding of this success remains limited. In this paper we analyze self-supervised representation learning using a causal framework. We show how data augmentations can be more effectively utilized through explicit invariance constraints on the proxy classifiers employed during pretraining. Based on this, we propose a novel self-supervised objective, Representation Learning via Invariant Causal Mechanisms (ReLIC), that enforces invariant prediction of proxy targets across augmentations through an invariance regularizer which yields improved generalization guarantees. Further, using causality we generalize contrastive learning, a particular kind of self-supervised method, and provide an alternative theoretical explanation for the success of these methods. Empirically, ReLIC significantly outperforms competing methods in terms of robustness and out-of-distribution generalization on ImageNet, while also significantly outperforming these methods on Atari achieving above human-level performance on $51$ out of $57$ games.
摘要:自监督学习已经成为只使用无标签数据训练前表示,以减少对昂贵的监控信号的依赖的战略。这些方法结合启发式代理分类任务与数据扩充并取得了显著的成功,但我们的这种成功的理论认识仍然有限。在本文中,我们分析的自我监督表示使用因果框架学习。我们将展示如何扩充的数据可通过在训练前期间采用的代理分类明确不变的约束来更有效地利用。在此基础上,我们提出了一个新的自我监督的目标,表示通过学习不变因果机制(遗迹),即通过调整装置的不变性这将产生改进的概括保证强制执行的跨扩充代理的目标不变的预测。此外,使用因果关系推广了对比学习,特定的一种自我监督的方法,并提供了这些方法的成功的替代理论解释。根据经验,遗物显著优于在ImageNet鲁棒性的概括和乱分配方面的竞争方法,同时还显著优于在Atari这些方法在$ $ 51 $的$ 57场比赛开出达到上述人类水平的性能。
38. DynaSLAM II: Tightly-Coupled Multi-Object Tracking and SLAM [PDF] 返回目录
Berta Bescos, Carlos Campos, Juan D. Tardós, José Neira
Abstract: The assumption of scene rigidity is common in visual SLAM algorithms. However, it limits their applicability in populated real-world environments. Furthermore, most scenarios including autonomous driving, multi-robot collaboration and augmented/virtual reality, require explicit motion information of the surroundings to help with decision making and scene understanding. We present in this paper DynaSLAM II, a visual SLAM system for stereo and RGB-D configurations that tightly integrates the multi-object tracking capability. DynaSLAM II makes use of instance semantic segmentation and of ORB features to track dynamic objects. The structure of the static scene and of the dynamic objects is optimized jointly with the trajectories of both the camera and the moving agents within a novel bundle adjustment proposal. The 3D bounding boxes of the objects are also estimated and loosely optimized within a fixed temporal window. We demonstrate that tracking dynamic objects does not only provide rich clues for scene understanding but is also beneficial for camera tracking. The project code will be released upon acceptance.
摘要:现场刚性的假设是全球视觉SLAM算法常见。然而,这限制了它们在人口现实环境的适用性。此外,大多数情况下,包括自动驾驶,多机器人协作和增强/虚拟现实,需要周围的与决策和场景的理解帮助明确的运动信息。我们提出在本文DynaSLAM II,用于立体声和RGB-d配置的视觉SLAM系统紧密集成的多对象跟踪能力。 DynaSLAM II利用例如语义分割和ORB的功能,以跟踪动态对象。静态场景的和动态对象的结构既与所述摄像机的新颖束调整建议内的轨迹和所述移动剂联合优化。的三维边界的对象的盒还估计和固定时间窗口内松散地优化。我们证明了跟踪动态对象不仅提供现场了解丰富的线索,而且对相机跟踪有益。该项目的代码将验收后公布。
Berta Bescos, Carlos Campos, Juan D. Tardós, José Neira
Abstract: The assumption of scene rigidity is common in visual SLAM algorithms. However, it limits their applicability in populated real-world environments. Furthermore, most scenarios including autonomous driving, multi-robot collaboration and augmented/virtual reality, require explicit motion information of the surroundings to help with decision making and scene understanding. We present in this paper DynaSLAM II, a visual SLAM system for stereo and RGB-D configurations that tightly integrates the multi-object tracking capability. DynaSLAM II makes use of instance semantic segmentation and of ORB features to track dynamic objects. The structure of the static scene and of the dynamic objects is optimized jointly with the trajectories of both the camera and the moving agents within a novel bundle adjustment proposal. The 3D bounding boxes of the objects are also estimated and loosely optimized within a fixed temporal window. We demonstrate that tracking dynamic objects does not only provide rich clues for scene understanding but is also beneficial for camera tracking. The project code will be released upon acceptance.
摘要:现场刚性的假设是全球视觉SLAM算法常见。然而,这限制了它们在人口现实环境的适用性。此外,大多数情况下,包括自动驾驶,多机器人协作和增强/虚拟现实,需要周围的与决策和场景的理解帮助明确的运动信息。我们提出在本文DynaSLAM II,用于立体声和RGB-d配置的视觉SLAM系统紧密集成的多对象跟踪能力。 DynaSLAM II利用例如语义分割和ORB的功能,以跟踪动态对象。静态场景的和动态对象的结构既与所述摄像机的新颖束调整建议内的轨迹和所述移动剂联合优化。的三维边界的对象的盒还估计和固定时间窗口内松散地优化。我们证明了跟踪动态对象不仅提供现场了解丰富的线索,而且对相机跟踪有益。该项目的代码将验收后公布。
39. A Patch-based Image Denoising Method Using Eigenvectors of the Geodesics' Gramian Matrix [PDF] 返回目录
Kelum Gajamannage, Randy Paffenroth, Anura P. Jayasumana
Abstract: With the sophisticated modern technology in the camera industry, the demand for accurate and visually pleasing images is increasing. However, the quality of images captured by cameras are inevitably degraded by noise. Thus, some processing on images is required to filter out the noise without losing vital image features such as edges, corners, etc. Even though the current literature offers a variety of denoising methods, fidelity and efficiency of their denoising are sometimes uncertain. Thus, here we propose a novel and computationally efficient image denoising method that is capable of producing an accurate output. This method inputs patches partitioned from the image rather than pixels that are well known for preserving image smoothness. Then, it performs denoising on the manifold underlying the patch-space rather than that in the image domain to better preserve the features across the whole image. We validate the performance of this method against benchmark image processing methods.
摘要:随着先进的现代技术,在相机行业,准确和赏心悦目的图像的需求正在增加。然而,图像的由照相机捕获的质量由噪声不可避免地劣化。因此,需要在图像上的一些处理,以滤除噪声,而不会丢失重要的图像特征,如边缘,角落等即使目前的文献提供了多种方法去噪,保真度和它们的去噪效率有时不确定。因此,这里我们提出了能够产生精确的输出的一种新颖的和计算上有效的图像去噪方法。该方法的输入补丁从图像划分而不是公用于保存图像平滑度已知像素。然后,它执行去噪歧管上的图像域补丁空间底层而不是横跨整个图像更好地维护功能。我们验证对基准图像处理方法,该方法的性能。
Kelum Gajamannage, Randy Paffenroth, Anura P. Jayasumana
Abstract: With the sophisticated modern technology in the camera industry, the demand for accurate and visually pleasing images is increasing. However, the quality of images captured by cameras are inevitably degraded by noise. Thus, some processing on images is required to filter out the noise without losing vital image features such as edges, corners, etc. Even though the current literature offers a variety of denoising methods, fidelity and efficiency of their denoising are sometimes uncertain. Thus, here we propose a novel and computationally efficient image denoising method that is capable of producing an accurate output. This method inputs patches partitioned from the image rather than pixels that are well known for preserving image smoothness. Then, it performs denoising on the manifold underlying the patch-space rather than that in the image domain to better preserve the features across the whole image. We validate the performance of this method against benchmark image processing methods.
摘要:随着先进的现代技术,在相机行业,准确和赏心悦目的图像的需求正在增加。然而,图像的由照相机捕获的质量由噪声不可避免地劣化。因此,需要在图像上的一些处理,以滤除噪声,而不会丢失重要的图像特征,如边缘,角落等即使目前的文献提供了多种方法去噪,保真度和它们的去噪效率有时不确定。因此,这里我们提出了能够产生精确的输出的一种新颖的和计算上有效的图像去噪方法。该方法的输入补丁从图像划分而不是公用于保存图像平滑度已知像素。然后,它执行去噪歧管上的图像域补丁空间底层而不是横跨整个图像更好地维护功能。我们验证对基准图像处理方法,该方法的性能。
40. LiteDepthwiseNet: An Extreme Lightweight Network for Hyperspectral Image Classification [PDF] 返回目录
Benlei Cui, XueMei Dong, Qiaoqiao Zhan, Jiangtao Peng, Weiwei Sun
Abstract: Deep learning methods have shown considerable potential for hyperspectral image (HSI) classification, which can achieve high accuracy compared with traditional methods. However, they often need a large number of training samples and have a lot of parameters and high computational overhead. To solve these problems, this paper proposes a new network architecture, LiteDepthwiseNet, for HSI classification. Based on 3D depthwise convolution, LiteDepthwiseNet can decompose standard convolution into depthwise convolution and pointwise convolution, which can achieve high classification performance with minimal parameters. Moreover, we remove the ReLU layer and Batch Normalization layer in the original 3D depthwise convolution, which significantly improves the overfitting phenomenon of the model on small sized datasets. In addition, focal loss is used as the loss function to improve the model's attention on difficult samples and unbalanced data, and its training performance is significantly better than that of cross-entropy loss or balanced cross-entropy loss. Experiment results on three benchmark hyperspectral datasets show that LiteDepthwiseNet achieves state-of-the-art performance with a very small number of parameters and low computational cost.
摘要:深学习方法已经显示了高光谱图像(HSI)的分类,与传统方法相比,可以达到较高的精度相当大的潜力。然而,他们往往需要大量的训练样本中,有很多的参数和高计算开销。为了解决这些问题,本文提出了一种新的网络架构,LiteDepthwiseNet,恒指分类。基于三维深度上卷积,LiteDepthwiseNet可以分解标准卷积成深度方向的卷积和逐点卷积,从而可以实现以最小的参数较高的分类性能。此外,我们在移除原始3D深度方向卷积,这显著提高了小尺寸的数据集的模型的过拟合现象RELU层和批标准化层。此外,焦点损失作为损失函数,以提高模型的重视,就很难样品和不平衡的数据,其训练表现是显著优于交叉熵损失或平衡交叉熵的损失。对三个标准高光谱数据集实验结果表明,LiteDepthwiseNet实现国家的最先进的性能与极少数的参数和较低的计算成本。
Benlei Cui, XueMei Dong, Qiaoqiao Zhan, Jiangtao Peng, Weiwei Sun
Abstract: Deep learning methods have shown considerable potential for hyperspectral image (HSI) classification, which can achieve high accuracy compared with traditional methods. However, they often need a large number of training samples and have a lot of parameters and high computational overhead. To solve these problems, this paper proposes a new network architecture, LiteDepthwiseNet, for HSI classification. Based on 3D depthwise convolution, LiteDepthwiseNet can decompose standard convolution into depthwise convolution and pointwise convolution, which can achieve high classification performance with minimal parameters. Moreover, we remove the ReLU layer and Batch Normalization layer in the original 3D depthwise convolution, which significantly improves the overfitting phenomenon of the model on small sized datasets. In addition, focal loss is used as the loss function to improve the model's attention on difficult samples and unbalanced data, and its training performance is significantly better than that of cross-entropy loss or balanced cross-entropy loss. Experiment results on three benchmark hyperspectral datasets show that LiteDepthwiseNet achieves state-of-the-art performance with a very small number of parameters and low computational cost.
摘要:深学习方法已经显示了高光谱图像(HSI)的分类,与传统方法相比,可以达到较高的精度相当大的潜力。然而,他们往往需要大量的训练样本中,有很多的参数和高计算开销。为了解决这些问题,本文提出了一种新的网络架构,LiteDepthwiseNet,恒指分类。基于三维深度上卷积,LiteDepthwiseNet可以分解标准卷积成深度方向的卷积和逐点卷积,从而可以实现以最小的参数较高的分类性能。此外,我们在移除原始3D深度方向卷积,这显著提高了小尺寸的数据集的模型的过拟合现象RELU层和批标准化层。此外,焦点损失作为损失函数,以提高模型的重视,就很难样品和不平衡的数据,其训练表现是显著优于交叉熵损失或平衡交叉熵的损失。对三个标准高光谱数据集实验结果表明,LiteDepthwiseNet实现国家的最先进的性能与极少数的参数和较低的计算成本。
41. Linking average- and worst-case perturbation robustness via class selectivity and dimensionality [PDF] 返回目录
Matthew L. Leavitt, Ari Morcos
Abstract: Representational sparsity is known to affect robustness to input perturbations in deep neural networks (DNNs), but less is known about how the semantic content of representations affects robustness. Class selectivity-the variability of a unit's responses across data classes or dimensions-is one way of quantifying the sparsity of semantic representations. Given recent evidence that class selectivity may not be necessary for, and can even impair generalization, we investigated whether it also confers robustness (or vulnerability) to perturbations of input data. We found that class selectivity leads to increased vulnerability to average-case (naturalistic) perturbations in ResNet18 and ResNet20, as measured using Tiny ImageNetC and CIFAR10C, respectively. Networks regularized to have lower levels of class selectivity are more robust to average-case perturbations, while networks with higher class selectivity are more vulnerable. In contrast, we found that class selectivity increases robustness to worst-case (i.e. white box adversarial) perturbations, suggesting that while decreasing class selectivity is helpful for average-case robustness, it is harmful for worst-case robustness. To explain this difference, we studied the dimensionality of the networks' representations: we found that the dimensionality of early-layer representations is inversely proportional to a network's class selectivity, and that adversarial samples cause a larger increase in early-layer dimensionality than corrupted samples. We also found that the input-unit gradient was more variable across samples and units in high-selectivity networks compared to low-selectivity networks. These results lead to the conclusion that units participate more consistently in low-selectivity regimes compared to high-selectivity regimes, effectively creating a larger attack surface and hence vulnerability to worst-case perturbations.
摘要:具稀疏是众所周知的影响鲁棒性的深层神经网络(DNNs)输入扰动,但很少有人知道有关陈述的语义内容是如何影响的鲁棒性。类选择性-的跨数据类或一个单元的响应变化的尺寸,是量化语义表示的稀疏性的一种方式。鉴于最近的证据表明,类选择性可能没有必要的,甚至可以IMPAIR泛化,我们调查是否还赋予鲁棒性(或漏洞)来输入数据的扰动。我们发现,类选择性导致增加的脆弱性在ResNet18和ResNet20平均情况(自然)扰动,使用微小ImageNetC和CIFAR10C分别作为测定。转正有选择性类下级网络更加坚固平均情况的扰动,而具有较高的选择性类网络更容易。与此相反,我们发现,类选择性增加的鲁棒性的最坏情况(即白色盒对抗)的扰动,这表明,同时减少类的选择性为平均情况的鲁棒性有帮助的,这是有害的最坏情况下的鲁棒性。为了解释这种差异,我们研究了网络表示的维度:我们发现,早层表示的维度是成反比网络的类选择性,而且对抗性样品引起早期层维度比变质取样增幅较大。我们还发现,相比于低选择性网络输入单元梯度为跨样品和单位多个可变高选择性的网络。这些结果导致这样的结论:单位低选择性的制度相比,高选择性的参与制度更加稳定,有效地创建一个更大的攻击面,因此易受最坏情况下的扰动。
Matthew L. Leavitt, Ari Morcos
Abstract: Representational sparsity is known to affect robustness to input perturbations in deep neural networks (DNNs), but less is known about how the semantic content of representations affects robustness. Class selectivity-the variability of a unit's responses across data classes or dimensions-is one way of quantifying the sparsity of semantic representations. Given recent evidence that class selectivity may not be necessary for, and can even impair generalization, we investigated whether it also confers robustness (or vulnerability) to perturbations of input data. We found that class selectivity leads to increased vulnerability to average-case (naturalistic) perturbations in ResNet18 and ResNet20, as measured using Tiny ImageNetC and CIFAR10C, respectively. Networks regularized to have lower levels of class selectivity are more robust to average-case perturbations, while networks with higher class selectivity are more vulnerable. In contrast, we found that class selectivity increases robustness to worst-case (i.e. white box adversarial) perturbations, suggesting that while decreasing class selectivity is helpful for average-case robustness, it is harmful for worst-case robustness. To explain this difference, we studied the dimensionality of the networks' representations: we found that the dimensionality of early-layer representations is inversely proportional to a network's class selectivity, and that adversarial samples cause a larger increase in early-layer dimensionality than corrupted samples. We also found that the input-unit gradient was more variable across samples and units in high-selectivity networks compared to low-selectivity networks. These results lead to the conclusion that units participate more consistently in low-selectivity regimes compared to high-selectivity regimes, effectively creating a larger attack surface and hence vulnerability to worst-case perturbations.
摘要:具稀疏是众所周知的影响鲁棒性的深层神经网络(DNNs)输入扰动,但很少有人知道有关陈述的语义内容是如何影响的鲁棒性。类选择性-的跨数据类或一个单元的响应变化的尺寸,是量化语义表示的稀疏性的一种方式。鉴于最近的证据表明,类选择性可能没有必要的,甚至可以IMPAIR泛化,我们调查是否还赋予鲁棒性(或漏洞)来输入数据的扰动。我们发现,类选择性导致增加的脆弱性在ResNet18和ResNet20平均情况(自然)扰动,使用微小ImageNetC和CIFAR10C分别作为测定。转正有选择性类下级网络更加坚固平均情况的扰动,而具有较高的选择性类网络更容易。与此相反,我们发现,类选择性增加的鲁棒性的最坏情况(即白色盒对抗)的扰动,这表明,同时减少类的选择性为平均情况的鲁棒性有帮助的,这是有害的最坏情况下的鲁棒性。为了解释这种差异,我们研究了网络表示的维度:我们发现,早层表示的维度是成反比网络的类选择性,而且对抗性样品引起早期层维度比变质取样增幅较大。我们还发现,相比于低选择性网络输入单元梯度为跨样品和单位多个可变高选择性的网络。这些结果导致这样的结论:单位低选择性的制度相比,高选择性的参与制度更加稳定,有效地创建一个更大的攻击面,因此易受最坏情况下的扰动。
42. Demonstration of a Cloud-based Software Framework for Video Analytics Application using Low-Cost IoT Devices [PDF] 返回目录
Bhavin Joshi, Tapan Pathak, Vatsal Patel, Sarth Kanani, Pankesh Patel, Muhammad Intizar Ali, John Breslin
Abstract: The design of products and services such as a Smart doorbell, demonstrating video analytics software/algorithm functionality, is expected to address a new kind of requirements such as designing a scalable solution while considering the trade-off between cost and accuracy; a flexible architecture to deploy new AI-based models or update existing models, as user requirements evolve; as well as seamlessly integrating different kinds of user interfaces and devices. To address these challenges, we propose a smart doorbell that orchestrates video analytics across Edge and Cloud resources. The proposal uses AWS as a base platform for implementation and leverages Commercially Available Off-The-Shelf(COTS) affordable devices such as Raspberry Pi in the form of an Edge device.
摘要:产品和服务,如智能设计门铃,展示视频分析软件/算法的功能,有望解决一种新的要求,如在考虑成本和精度之间的权衡设计一个可扩展解决方案;一个灵活的架构来部署新的基于人工智能的模型或更新现有的模型,用户需求的发展;以及无缝地集成不同类型的用户接口和设备。为了应对这些挑战,我们提出了一个智能门铃跨Edge和云资源编排视频分析。该提议的用途AWS作为基础平台的实施和在边缘装置的形式利用市售现成的现货(COTS)实惠设备,如树莓裨。
Bhavin Joshi, Tapan Pathak, Vatsal Patel, Sarth Kanani, Pankesh Patel, Muhammad Intizar Ali, John Breslin
Abstract: The design of products and services such as a Smart doorbell, demonstrating video analytics software/algorithm functionality, is expected to address a new kind of requirements such as designing a scalable solution while considering the trade-off between cost and accuracy; a flexible architecture to deploy new AI-based models or update existing models, as user requirements evolve; as well as seamlessly integrating different kinds of user interfaces and devices. To address these challenges, we propose a smart doorbell that orchestrates video analytics across Edge and Cloud resources. The proposal uses AWS as a base platform for implementation and leverages Commercially Available Off-The-Shelf(COTS) affordable devices such as Raspberry Pi in the form of an Edge device.
摘要:产品和服务,如智能设计门铃,展示视频分析软件/算法的功能,有望解决一种新的要求,如在考虑成本和精度之间的权衡设计一个可扩展解决方案;一个灵活的架构来部署新的基于人工智能的模型或更新现有的模型,用户需求的发展;以及无缝地集成不同类型的用户接口和设备。为了应对这些挑战,我们提出了一个智能门铃跨Edge和云资源编排视频分析。该提议的用途AWS作为基础平台的实施和在边缘装置的形式利用市售现成的现货(COTS)实惠设备,如树莓裨。
43. Respecting Domain Relations: Hypothesis Invariance for Domain Generalization [PDF] 返回目录
Ziqi Wang, Marco Loog, Jan van Gemert
Abstract: In domain generalization, multiple labeled non-independent and non-identically distributed source domains are available during training while neither the data nor the labels of target domains are. Currently, learning so-called domain invariant representations (DIRs) is the prevalent approach to domain generalization. In this work, we define DIRs employed by existing works in probabilistic terms and show that by learning DIRs, overly strict requirements are imposed concerning the invariance. Particularly, DIRs aim to perfectly align representations of different domains, i.e. their input distributions. This is, however, not necessary for good generalization to a target domain and may even dispose of valuable classification information. We propose to learn so-called hypothesis invariant representations (HIRs), which relax the invariance assumptions by merely aligning posteriors, instead of aligning representations. We report experimental results on public domain generalization datasets to show that learning HIRs is more effective than learning DIRs. In fact, our approach can even compete with approaches using prior knowledge about domains.
摘要:域概括,多个标记的非独立的和非恒等分布源域训练期间可用,而既不是数据也不目标域的标签。目前,学习所谓的域恒定表征(的DIR)是普遍的做法域概括。在这项工作中,我们定义现有的作品在概率方面采用的DIR并表明通过学习的DIR,过于严格的要求有关不变性强加的。特别是,DIR的目的是不同的域的完全一致的表示,即,它们的输入分布。这一点,但是,没有必要为良好的推广到目标域,甚至可以处分的有价值的分类信息。我们建议学习所谓的假设恒定表征(HIRS),其仅仅通过对齐后验,而不是对准表示放松不变性假设。我们报告公共领域泛化数据集的实验结果表明,学习HIRS比学习的DIR更有效。事实上,我们的方法甚至可以用先验知识有关域的方法竞争。
Ziqi Wang, Marco Loog, Jan van Gemert
Abstract: In domain generalization, multiple labeled non-independent and non-identically distributed source domains are available during training while neither the data nor the labels of target domains are. Currently, learning so-called domain invariant representations (DIRs) is the prevalent approach to domain generalization. In this work, we define DIRs employed by existing works in probabilistic terms and show that by learning DIRs, overly strict requirements are imposed concerning the invariance. Particularly, DIRs aim to perfectly align representations of different domains, i.e. their input distributions. This is, however, not necessary for good generalization to a target domain and may even dispose of valuable classification information. We propose to learn so-called hypothesis invariant representations (HIRs), which relax the invariance assumptions by merely aligning posteriors, instead of aligning representations. We report experimental results on public domain generalization datasets to show that learning HIRs is more effective than learning DIRs. In fact, our approach can even compete with approaches using prior knowledge about domains.
摘要:域概括,多个标记的非独立的和非恒等分布源域训练期间可用,而既不是数据也不目标域的标签。目前,学习所谓的域恒定表征(的DIR)是普遍的做法域概括。在这项工作中,我们定义现有的作品在概率方面采用的DIR并表明通过学习的DIR,过于严格的要求有关不变性强加的。特别是,DIR的目的是不同的域的完全一致的表示,即,它们的输入分布。这一点,但是,没有必要为良好的推广到目标域,甚至可以处分的有价值的分类信息。我们建议学习所谓的假设恒定表征(HIRS),其仅仅通过对齐后验,而不是对准表示放松不变性假设。我们报告公共领域泛化数据集的实验结果表明,学习HIRS比学习的DIR更有效。事实上,我们的方法甚至可以用先验知识有关域的方法竞争。
44. Encoder-decoder semantic segmentation models for electroluminescence images of thin-film photovoltaic modules [PDF] 返回目录
Evgenii Sovetkin, Elbert Jan Achterberg, Thomas Weber, Bart E. Pieters
Abstract: We consider a series of image segmentation methods based on the deep neural networks in order to perform semantic segmentation of electroluminescence (EL) images of thin-film modules. We utilize the encoder-decoder deep neural network architecture. The framework is general such that it can easily be extended to other types of images (e.g. thermography) or solar cell technologies (e.g. crystalline silicon modules). The networks are trained and tested on a sample of images from a database with 6000 EL images of Copper Indium Gallium Diselenide (CIGS) thin film modules. We selected two types of features to extract, shunts and so called "droplets". The latter feature is often observed in the set of images. Several models are tested using various combinations of encoder-decoder layers, and a procedure is proposed to select the best model. We show exemplary results with the best selected model. Furthermore, we applied the best model to the full set of 6000 images and demonstrate that the automated segmentation of EL images can reveal many subtle features which cannot be inferred from studying a small sample of images. We believe these features can contribute to process optimization and quality control.
摘要:为了进行电致发光的语义分割(EL)的薄膜模块图像考虑一系列的基于深神经网络的图像分割方法。我们利用编码器,解码器深层神经网络结构。该框架是一般,使得其可以很容易地扩展到其它类型的图像(例如,热成像)或太阳能电池技术(例如晶体硅模块)。这些网络被训练并从数据库中与铜铟镓硒(CIGS)薄膜组件6000幅EL图像的图像的样品进行测试。我们选择两种类型的特征提取,分流器和所谓的“水滴”。后者功能经常在该组图像的观察。几种模式使用编码器 - 译码器层的不同组合进行试验,提出了一个过程来选择最佳的模式。我们展示的最佳选择模式的示范性结果。此外,我们应用的最佳模式,以全套6000张的图像并证明EL图像的自动分割可以揭示不能从学习图像的小样本推断出许多细微特征。我们相信,这些功能有助于工艺优化和质量控制。
Evgenii Sovetkin, Elbert Jan Achterberg, Thomas Weber, Bart E. Pieters
Abstract: We consider a series of image segmentation methods based on the deep neural networks in order to perform semantic segmentation of electroluminescence (EL) images of thin-film modules. We utilize the encoder-decoder deep neural network architecture. The framework is general such that it can easily be extended to other types of images (e.g. thermography) or solar cell technologies (e.g. crystalline silicon modules). The networks are trained and tested on a sample of images from a database with 6000 EL images of Copper Indium Gallium Diselenide (CIGS) thin film modules. We selected two types of features to extract, shunts and so called "droplets". The latter feature is often observed in the set of images. Several models are tested using various combinations of encoder-decoder layers, and a procedure is proposed to select the best model. We show exemplary results with the best selected model. Furthermore, we applied the best model to the full set of 6000 images and demonstrate that the automated segmentation of EL images can reveal many subtle features which cannot be inferred from studying a small sample of images. We believe these features can contribute to process optimization and quality control.
摘要:为了进行电致发光的语义分割(EL)的薄膜模块图像考虑一系列的基于深神经网络的图像分割方法。我们利用编码器,解码器深层神经网络结构。该框架是一般,使得其可以很容易地扩展到其它类型的图像(例如,热成像)或太阳能电池技术(例如晶体硅模块)。这些网络被训练并从数据库中与铜铟镓硒(CIGS)薄膜组件6000幅EL图像的图像的样品进行测试。我们选择两种类型的特征提取,分流器和所谓的“水滴”。后者功能经常在该组图像的观察。几种模式使用编码器 - 译码器层的不同组合进行试验,提出了一个过程来选择最佳的模式。我们展示的最佳选择模式的示范性结果。此外,我们应用的最佳模式,以全套6000张的图像并证明EL图像的自动分割可以揭示不能从学习图像的小样本推断出许多细微特征。我们相信,这些功能有助于工艺优化和质量控制。
45. Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs [PDF] 返回目录
Ana Marasović, Chandra Bhagavatula, Jae Sung Park, Ronan Le Bras, Noah A. Smith, Yejin Choi
Abstract: Natural language rationales could provide intuitive, higher-level explanations that are easily understandable by humans, complementing the more broadly studied lower-level explanations based on gradients or attention weights. We present the first study focused on generating natural language rationales across several complex visual reasoning tasks: visual commonsense reasoning, visual-textual entailment, and visual question answering. The key challenge of accurate rationalization is comprehensive image understanding at all levels: not just their explicit content at the pixel level, but their contextual contents at the semantic and pragmatic levels. We present Rationale^VT Transformer, an integrated model that learns to generate free-text rationales by combining pretrained language models with object recognition, grounded visual semantic frames, and visual commonsense graphs. Our experiments show that the base pretrained language model benefits from visual adaptation and that free-text rationalization is a promising research direction to complement model interpretability for complex visual-textual reasoning tasks.
摘要:自然语言基本原理可以提供直观的,更高层次的解释是被人容易理解,补充基于梯度或关注权重更广泛地研究了低级别的解释。我们提出的第一个研究的重点在几个复杂的视觉推理任务生成自然语言的基本原理:视觉常识推理,视觉文字蕴涵和视觉答疑。准确合理化的主要挑战是在各级综合图像理解:不仅是他们在像素水平上的色情内容,但它们在语义和语用层面的上下文内容。我们现在基本原理^ VT变压器,集成模型,学会通过预训练的语言模型与目标识别,视觉接地语义框架和视觉常识图表结合生成自由文本理由。我们的实验表明,从视觉适应和自由文本合理化基本预训练的语言模型的好处是一个有前途的研究方向,以补充模型解释性复杂的视觉文本推理任务。
Ana Marasović, Chandra Bhagavatula, Jae Sung Park, Ronan Le Bras, Noah A. Smith, Yejin Choi
Abstract: Natural language rationales could provide intuitive, higher-level explanations that are easily understandable by humans, complementing the more broadly studied lower-level explanations based on gradients or attention weights. We present the first study focused on generating natural language rationales across several complex visual reasoning tasks: visual commonsense reasoning, visual-textual entailment, and visual question answering. The key challenge of accurate rationalization is comprehensive image understanding at all levels: not just their explicit content at the pixel level, but their contextual contents at the semantic and pragmatic levels. We present Rationale^VT Transformer, an integrated model that learns to generate free-text rationales by combining pretrained language models with object recognition, grounded visual semantic frames, and visual commonsense graphs. Our experiments show that the base pretrained language model benefits from visual adaptation and that free-text rationalization is a promising research direction to complement model interpretability for complex visual-textual reasoning tasks.
摘要:自然语言基本原理可以提供直观的,更高层次的解释是被人容易理解,补充基于梯度或关注权重更广泛地研究了低级别的解释。我们提出的第一个研究的重点在几个复杂的视觉推理任务生成自然语言的基本原理:视觉常识推理,视觉文字蕴涵和视觉答疑。准确合理化的主要挑战是在各级综合图像理解:不仅是他们在像素水平上的色情内容,但它们在语义和语用层面的上下文内容。我们现在基本原理^ VT变压器,集成模型,学会通过预训练的语言模型与目标识别,视觉接地语义框架和视觉常识图表结合生成自由文本理由。我们的实验表明,从视觉适应和自由文本合理化基本预训练的语言模型的好处是一个有前途的研究方向,以补充模型解释性复杂的视觉文本推理任务。
46. RetiNerveNet: Using Recursive Deep Learning to Estimate Pointwise 24-2 Visual Field Data based on Retinal Structure [PDF] 返回目录
Shounak Datta, Eduardo B. Mariottoni, David Dov, Alessandro A. Jammal, Lawrence Carin, Felipe A. Medeiros
Abstract: Glaucoma is the leading cause of irreversible blindness in the world, affecting over 70 million people. The cumbersome Standard Automated Perimetry (SAP) test is most frequently used to detect visual loss due to glaucoma. Due to the SAP test's innate difficulty and its high test-retest variability, we propose the RetiNerveNet, a deep convolutional recursive neural network for obtaining estimates of the SAP visual field. RetiNerveNet uses information from the more objective Spectral-Domain Optical Coherence Tomography (SDOCT). RetiNerveNet attempts to trace-back the arcuate convergence of the retinal nerve fibers, starting from the Retinal Nerve Fiber Layer (RNFL) thickness around the optic disc, to estimate individual age-corrected 24-2 SAP values. Recursive passes through the proposed network sequentially yield estimates of the visual locations progressively farther from the optic disc. The proposed network is able to obtain more accurate estimates of the individual visual field values, compared to a number of baselines, implying its utility as a proxy for SAP. We further augment RetiNerveNet to additionally predict the SAP Mean Deviation values and also create an ensemble of RetiNerveNets that further improves the performance, by increasingly weighting-up underrepresented parts of the training data.
摘要:青光眼是全球不可逆转的失明的首要原因,影响超过70万人。繁琐的自动视野(SAP)测试最常用于检测视力丧失由于青光眼。由于SAP测试的先天困难和高重测信度的变化,我们提出RetiNerveNet,获得了SAP视野的估计了深刻的卷积递归神经网络。 RetiNerveNet从更客观的谱域光学相干断层扫描(SDOCT)使用的信息。 RetiNerveNet尝试回溯视网膜神经纤维的弧形收敛,从视网膜神经纤维层开始围绕视神经盘(RNFL)厚度,为了估计个体年龄校正24-2 SAP值。递归穿过所提出的网络顺序地从视盘产生视觉位置的估计值逐渐更远。所提出的网络能够获得个人视野值的更准确的估计,相较于一些基线的,这意味着它的实用性作为SAP的代理。我们进一步扩充RetiNerveNet额外预测SAP平均数偏差值,也可以创建RetiNerveNets的集合进一步提高性能,通过日益加权了训练数据的代表性不足部分。
Shounak Datta, Eduardo B. Mariottoni, David Dov, Alessandro A. Jammal, Lawrence Carin, Felipe A. Medeiros
Abstract: Glaucoma is the leading cause of irreversible blindness in the world, affecting over 70 million people. The cumbersome Standard Automated Perimetry (SAP) test is most frequently used to detect visual loss due to glaucoma. Due to the SAP test's innate difficulty and its high test-retest variability, we propose the RetiNerveNet, a deep convolutional recursive neural network for obtaining estimates of the SAP visual field. RetiNerveNet uses information from the more objective Spectral-Domain Optical Coherence Tomography (SDOCT). RetiNerveNet attempts to trace-back the arcuate convergence of the retinal nerve fibers, starting from the Retinal Nerve Fiber Layer (RNFL) thickness around the optic disc, to estimate individual age-corrected 24-2 SAP values. Recursive passes through the proposed network sequentially yield estimates of the visual locations progressively farther from the optic disc. The proposed network is able to obtain more accurate estimates of the individual visual field values, compared to a number of baselines, implying its utility as a proxy for SAP. We further augment RetiNerveNet to additionally predict the SAP Mean Deviation values and also create an ensemble of RetiNerveNets that further improves the performance, by increasingly weighting-up underrepresented parts of the training data.
摘要:青光眼是全球不可逆转的失明的首要原因,影响超过70万人。繁琐的自动视野(SAP)测试最常用于检测视力丧失由于青光眼。由于SAP测试的先天困难和高重测信度的变化,我们提出RetiNerveNet,获得了SAP视野的估计了深刻的卷积递归神经网络。 RetiNerveNet从更客观的谱域光学相干断层扫描(SDOCT)使用的信息。 RetiNerveNet尝试回溯视网膜神经纤维的弧形收敛,从视网膜神经纤维层开始围绕视神经盘(RNFL)厚度,为了估计个体年龄校正24-2 SAP值。递归穿过所提出的网络顺序地从视盘产生视觉位置的估计值逐渐更远。所提出的网络能够获得个人视野值的更准确的估计,相较于一些基线的,这意味着它的实用性作为SAP的代理。我们进一步扩充RetiNerveNet额外预测SAP平均数偏差值,也可以创建RetiNerveNets的集合进一步提高性能,通过日益加权了训练数据的代表性不足部分。
47. CS2-Net: Deep Learning Segmentation of Curvilinear Structures in Medical Imaging [PDF] 返回目录
Lei Mou, Yitian Zhao, Huazhu Fu, Yonghuai Liu, Jun Cheng, Yalin Zheng, Pan Su, Jianlong Yang, Li Chen, Alejandro F Frang, Masahiro Akiba, Jiang Liu
Abstract: Automated detection of curvilinear structures, e.g., blood vessels or nerve fibres, from medical and biomedical images is a crucial early step in automatic image interpretation associated to the management of many diseases. Precise measurement of the morphological changes of these curvilinear organ structures informs clinicians for understanding the mechanism, diagnosis, and treatment of e.g. cardiovascular, kidney, eye, lung, and neurological conditions. In this work, we propose a generic and unified convolution neural network for the segmentation of curvilinear structures and illustrate in several 2D/3D medical imaging modalities. We introduce a new curvilinear structure segmentation network (CS2-Net), which includes a self-attention mechanism in the encoder and decoder to learn rich hierarchical representations of curvilinear structures. Two types of attention modules - spatial attention and channel attention - are utilized to enhance the inter-class discrimination and intra-class responsiveness, to further integrate local features with their global dependencies and normalization, adaptively. Furthermore, to facilitate the segmentation of curvilinear structures in medical images, we employ a 1x3 and a 3x1 convolutional kernel to capture boundary features. ...
摘要:曲线形的结构,例如,血管或神经纤维,从医学和生物医学图像的自动检测是在相关联的许多疾病的管理自动图像判读一个关键的早期步骤。这些曲线器官结构INFORMS临床医生对于理解的机理,诊断形态变化精确的测量,以及治疗例如心脑血管,肾,眼,肺,和神经疾病。在这项工作中,我们提出了曲线结构的分割通用和统一的卷积神经网络,并说明了几个2D / 3D医学成像方式。我们推出了新的曲线结构分割网络(CS2-网),其中包括编码器和解码器的学习曲线结构的丰富层次表示自注意机制。两种类型的注意力模块 - 空间的关注和重视渠道 - 被用来增强,级间的歧视和类内响应,与全球的依赖和规范化,以进一步整合地方特色,适应。此外,为促进曲线结构的医学图像分割,我们采用了1×3和3×1卷积内核捕获边界特征。 ...
Lei Mou, Yitian Zhao, Huazhu Fu, Yonghuai Liu, Jun Cheng, Yalin Zheng, Pan Su, Jianlong Yang, Li Chen, Alejandro F Frang, Masahiro Akiba, Jiang Liu
Abstract: Automated detection of curvilinear structures, e.g., blood vessels or nerve fibres, from medical and biomedical images is a crucial early step in automatic image interpretation associated to the management of many diseases. Precise measurement of the morphological changes of these curvilinear organ structures informs clinicians for understanding the mechanism, diagnosis, and treatment of e.g. cardiovascular, kidney, eye, lung, and neurological conditions. In this work, we propose a generic and unified convolution neural network for the segmentation of curvilinear structures and illustrate in several 2D/3D medical imaging modalities. We introduce a new curvilinear structure segmentation network (CS2-Net), which includes a self-attention mechanism in the encoder and decoder to learn rich hierarchical representations of curvilinear structures. Two types of attention modules - spatial attention and channel attention - are utilized to enhance the inter-class discrimination and intra-class responsiveness, to further integrate local features with their global dependencies and normalization, adaptively. Furthermore, to facilitate the segmentation of curvilinear structures in medical images, we employ a 1x3 and a 3x1 convolutional kernel to capture boundary features. ...
摘要:曲线形的结构,例如,血管或神经纤维,从医学和生物医学图像的自动检测是在相关联的许多疾病的管理自动图像判读一个关键的早期步骤。这些曲线器官结构INFORMS临床医生对于理解的机理,诊断形态变化精确的测量,以及治疗例如心脑血管,肾,眼,肺,和神经疾病。在这项工作中,我们提出了曲线结构的分割通用和统一的卷积神经网络,并说明了几个2D / 3D医学成像方式。我们推出了新的曲线结构分割网络(CS2-网),其中包括编码器和解码器的学习曲线结构的丰富层次表示自注意机制。两种类型的注意力模块 - 空间的关注和重视渠道 - 被用来增强,级间的歧视和类内响应,与全球的依赖和规范化,以进一步整合地方特色,适应。此外,为促进曲线结构的医学图像分割,我们采用了1×3和3×1卷积内核捕获边界特征。 ...
48. Spherical Knowledge Distillation [PDF] 返回目录
Jia Guo, Minghao Chen, Yao Hu, Chen Zhu, Xiaofei He, Deng Cai
Abstract: Knowledge distillation aims at obtaining a small but effective deep model by transferring knowledge from a much larger one. The previous approaches try to reach this goal by simply "logit-supervised" information transferring between the teacher and student, which somehow can be subsequently decomposed as the transfer of normalized logits and $l^2$ norm. We argue that the norm of logits is actually interference, which damages the efficiency in the transfer process. To address this problem, we propose Spherical Knowledge Distillation (SKD). Specifically, we project the teacher and the student's logits into a unit sphere, and then we can efficiently perform knowledge distillation on the sphere. We verify our argument via theoretical analysis and ablation study. Extensive experiments have demonstrated the superiority and scalability of our method over the SOTAs.
摘要:知识蒸馏的目的是通过从更大的一个传递知识获得一个小而有效的深层模型。在以前的方法试图通过简单的“分对数监督”信息的教师和学生,这在某种程度上可以随后分解为标准化logits的转移和$ L ^ 2 $模之间传输实现这一目标。我们认为,logits的规范实际上是干扰,从而损害在转移过程中的效率。为了解决这个问题,我们提出了球形知识蒸馏(SKD)。具体来说,我们预计在教师和学生的logits到单位球,然后我们就可以有效地对球进行知识升华。我们通过验证理论分析和研究消融我们的观点。大量的实验已经证明我们在SOTAs方法的优越性和可扩展性。
Jia Guo, Minghao Chen, Yao Hu, Chen Zhu, Xiaofei He, Deng Cai
Abstract: Knowledge distillation aims at obtaining a small but effective deep model by transferring knowledge from a much larger one. The previous approaches try to reach this goal by simply "logit-supervised" information transferring between the teacher and student, which somehow can be subsequently decomposed as the transfer of normalized logits and $l^2$ norm. We argue that the norm of logits is actually interference, which damages the efficiency in the transfer process. To address this problem, we propose Spherical Knowledge Distillation (SKD). Specifically, we project the teacher and the student's logits into a unit sphere, and then we can efficiently perform knowledge distillation on the sphere. We verify our argument via theoretical analysis and ablation study. Extensive experiments have demonstrated the superiority and scalability of our method over the SOTAs.
摘要:知识蒸馏的目的是通过从更大的一个传递知识获得一个小而有效的深层模型。在以前的方法试图通过简单的“分对数监督”信息的教师和学生,这在某种程度上可以随后分解为标准化logits的转移和$ L ^ 2 $模之间传输实现这一目标。我们认为,logits的规范实际上是干扰,从而损害在转移过程中的效率。为了解决这个问题,我们提出了球形知识蒸馏(SKD)。具体来说,我们预计在教师和学生的logits到单位球,然后我们就可以有效地对球进行知识升华。我们通过验证理论分析和研究消融我们的观点。大量的实验已经证明我们在SOTAs方法的优越性和可扩展性。
49. AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients [PDF] 返回目录
Juntang Zhuang, Tommy Tang, Sekhar Tatikonda, Nicha Dvornek, Yifan Ding, Xenophon Papademetris, James S. Duncan
Abstract: Most popular optimizers for deep learning can be broadly categorized as adaptive methods (e.g. Adam) and accelerated schemes (e.g. stochastic gradient descent (SGD) with momentum). For many models such as convolutional neural networks (CNNs), adaptive methods typically converge faster but generalize worse compared to SGD; for complex settings such as generative adversarial networks (GANs), adaptive methods are typically the default because of their stability.We propose AdaBelief to simultaneously achieve three goals: fast convergence as in adaptive methods, good generalization as in SGD, and training stability. The intuition for AdaBelief is to adapt the stepsize according to the "belief" in the current gradient direction. Viewing the exponential moving average (EMA) of the noisy gradient as the prediction of the gradient at the next time step, if the observed gradient greatly deviates from the prediction, we distrust the current observation and take a small step; if the observed gradient is close to the prediction, we trust it and take a large step. We validate AdaBelief in extensive experiments, showing that it outperforms other methods with fast convergence and high accuracy on image classification and language modeling. Specifically, on ImageNet, AdaBelief achieves comparable accuracy to SGD. Furthermore, in the training of a GAN on Cifar10, AdaBelief demonstrates high stability and improves the quality of generated samples compared to a well-tuned Adam optimizer. Code is available at this https URL
摘要:深学习最流行的优化可广义地分类为自适应方法(例如亚当)和(与动量例如随机梯度下降(SGD))加速方案。对于许多模型,如卷积神经网络(细胞神经网络),自适应方法通常收敛速度快,但概括更糟相比SGD;对于复杂的设置,如生成对抗性网络(甘斯),自适应方法通常是因为他们stability.We的默认建议AdaBelief同时实现三个目标:快速收敛的自适应方法,良好的泛化如新加坡元,并培训稳定性。对于AdaBelief直觉是根据在当前梯度方向上的“信念”适应步长。查看指数嘈杂梯度作为在下一时间步骤中的梯度的预测的移动平均值(EMA),如果所观察到的梯度大大从预测偏离时,我们不信任当前的观察和取小步骤;如果所观察到的梯度接近预测,我们相信它,并采取了一大步。我们验证AdaBelief在大量的实验,这表明它优于其他方法具有快速收敛和图像分类和语言模型精度高。具体而言,上ImageNet,AdaBelief达到可比的精度SGD。此外,在上一个Cifar10 GAN的训练,AdaBelief表明高稳定性和提高相比良好的调谐亚当优化器生成的样本的质量。代码可在此HTTPS URL
Juntang Zhuang, Tommy Tang, Sekhar Tatikonda, Nicha Dvornek, Yifan Ding, Xenophon Papademetris, James S. Duncan
Abstract: Most popular optimizers for deep learning can be broadly categorized as adaptive methods (e.g. Adam) and accelerated schemes (e.g. stochastic gradient descent (SGD) with momentum). For many models such as convolutional neural networks (CNNs), adaptive methods typically converge faster but generalize worse compared to SGD; for complex settings such as generative adversarial networks (GANs), adaptive methods are typically the default because of their stability.We propose AdaBelief to simultaneously achieve three goals: fast convergence as in adaptive methods, good generalization as in SGD, and training stability. The intuition for AdaBelief is to adapt the stepsize according to the "belief" in the current gradient direction. Viewing the exponential moving average (EMA) of the noisy gradient as the prediction of the gradient at the next time step, if the observed gradient greatly deviates from the prediction, we distrust the current observation and take a small step; if the observed gradient is close to the prediction, we trust it and take a large step. We validate AdaBelief in extensive experiments, showing that it outperforms other methods with fast convergence and high accuracy on image classification and language modeling. Specifically, on ImageNet, AdaBelief achieves comparable accuracy to SGD. Furthermore, in the training of a GAN on Cifar10, AdaBelief demonstrates high stability and improves the quality of generated samples compared to a well-tuned Adam optimizer. Code is available at this https URL
摘要:深学习最流行的优化可广义地分类为自适应方法(例如亚当)和(与动量例如随机梯度下降(SGD))加速方案。对于许多模型,如卷积神经网络(细胞神经网络),自适应方法通常收敛速度快,但概括更糟相比SGD;对于复杂的设置,如生成对抗性网络(甘斯),自适应方法通常是因为他们stability.We的默认建议AdaBelief同时实现三个目标:快速收敛的自适应方法,良好的泛化如新加坡元,并培训稳定性。对于AdaBelief直觉是根据在当前梯度方向上的“信念”适应步长。查看指数嘈杂梯度作为在下一时间步骤中的梯度的预测的移动平均值(EMA),如果所观察到的梯度大大从预测偏离时,我们不信任当前的观察和取小步骤;如果所观察到的梯度接近预测,我们相信它,并采取了一大步。我们验证AdaBelief在大量的实验,这表明它优于其他方法具有快速收敛和图像分类和语言模型精度高。具体而言,上ImageNet,AdaBelief达到可比的精度SGD。此外,在上一个Cifar10 GAN的训练,AdaBelief表明高稳定性和提高相比良好的调谐亚当优化器生成的样本的质量。代码可在此HTTPS URL
50. Viewmaker Networks: Learning Views for Unsupervised Representation Learning [PDF] 返回目录
Alex Tamkin, Mike Wu, Noah Goodman
Abstract: Many recent methods for unsupervised representation learning involve training models to be invariant to different "views," or transformed versions of an input. However, designing these views requires considerable human expertise and experimentation, hindering widespread adoption of unsupervised representation learning methods across domains and modalities. To address this, we propose viewmaker networks: generative models that learn to produce input-dependent views for contrastive learning. We train this network jointly with an encoder network to produce adversarial $\ell_p$ perturbations for an input, which yields challenging yet useful views without extensive human tuning. Our learned views, when applied to CIFAR-10, enable comparable transfer accuracy to the the well-studied augmentations used for the SimCLR model. Our views significantly outperforming baseline augmentations in speech (+9% absolute) and wearable sensor (+17% absolute) domains. We also show how viewmaker views can be combined with handcrafted views to improve robustness to common image corruptions. Our method demonstrates that learned views are a promising way to reduce the amount of expertise and effort needed for unsupervised learning, potentially extending its benefits to a much wider set of domains.
摘要:无监督表示学习最近的许多方法涉及的培训模式是不变的,以不同的“意见”,或者输入的转换版本。然而,设计这些观点需要大量人力的专业知识和实验,阻碍广泛采用的跨域和模式无监督代表的学习方法。为了解决这个问题,我们提出viewmaker网络:生成模型学习产生了对比学习用输入相关的视图。我们与编码器网络联合训练这个网络产生对抗性$ \ $ ell_p为扰动的输入,这将产生不广泛的人调整挑战又是有用的意见。我们了解到的意见,当应用于CIFAR-10,使可比转移精度用于SimCLR模型充分研究的扩充。我们的观点显著优于语音基线增扩(+ 9%绝对值)和可穿戴式传感器(+ 17%的绝对)结构域。我们还展示了如何viewmaker的观点可以用手工的意见相结合,以提高坚固性,以常见的图像损坏。我们的方法证明,学会的观点是减少专业知识和所需要的无监督学习努力的量,有可能延长其好处更广泛一套域的有希望的途径。
Alex Tamkin, Mike Wu, Noah Goodman
Abstract: Many recent methods for unsupervised representation learning involve training models to be invariant to different "views," or transformed versions of an input. However, designing these views requires considerable human expertise and experimentation, hindering widespread adoption of unsupervised representation learning methods across domains and modalities. To address this, we propose viewmaker networks: generative models that learn to produce input-dependent views for contrastive learning. We train this network jointly with an encoder network to produce adversarial $\ell_p$ perturbations for an input, which yields challenging yet useful views without extensive human tuning. Our learned views, when applied to CIFAR-10, enable comparable transfer accuracy to the the well-studied augmentations used for the SimCLR model. Our views significantly outperforming baseline augmentations in speech (+9% absolute) and wearable sensor (+17% absolute) domains. We also show how viewmaker views can be combined with handcrafted views to improve robustness to common image corruptions. Our method demonstrates that learned views are a promising way to reduce the amount of expertise and effort needed for unsupervised learning, potentially extending its benefits to a much wider set of domains.
摘要:无监督表示学习最近的许多方法涉及的培训模式是不变的,以不同的“意见”,或者输入的转换版本。然而,设计这些观点需要大量人力的专业知识和实验,阻碍广泛采用的跨域和模式无监督代表的学习方法。为了解决这个问题,我们提出viewmaker网络:生成模型学习产生了对比学习用输入相关的视图。我们与编码器网络联合训练这个网络产生对抗性$ \ $ ell_p为扰动的输入,这将产生不广泛的人调整挑战又是有用的意见。我们了解到的意见,当应用于CIFAR-10,使可比转移精度用于SimCLR模型充分研究的扩充。我们的观点显著优于语音基线增扩(+ 9%绝对值)和可穿戴式传感器(+ 17%的绝对)结构域。我们还展示了如何viewmaker的观点可以用手工的意见相结合,以提高坚固性,以常见的图像损坏。我们的方法证明,学会的观点是减少专业知识和所需要的无监督学习努力的量,有可能延长其好处更广泛一套域的有希望的途径。
51. Harnessing Uncertainty in Domain Adaptation for MRI Prostate Lesion Segmentation [PDF] 返回目录
Eleni Chiou, Francesco Giganti, Shonit Punwani, Iasonas Kokkinos, Eleftheria Panagiotaki
Abstract: The need for training data can impede the adoption of novel imaging modalities for learning-based medical image analysis. Domain adaptation methods partially mitigate this problem by translating training data from a related source domain to a novel target domain, but typically assume that a one-to-one translation is possible. Our work addresses the challenge of adapting to a more informative target domain where multiple target samples can emerge from a single source sample. In particular we consider translating from mp-MRI to VERDICT, a richer MRI modality involving an optimized acquisition protocol for cancer characterization. We explicitly account for the inherent uncertainty of this mapping and exploit it to generate multiple outputs conditioned on a single input. Our results show that this allows us to extract systematically better image representations for the target domain, when used in tandem with both simple, CycleGAN-based baselines, as well as more powerful approaches that integrate discriminative segmentation losses and/or residual adapters. When compared to its deterministic counterparts, our approach yields substantial improvements across a broad range of dataset sizes, increasingly strong baselines, and evaluation measures.
摘要:需要训练数据会阻碍学习基础医学图像分析,采用新颖的成像方式。域适配方法由来自相关源域到一个新的目标域转换训练数据部分缓解这个问题,但通常假设一个对一翻译是可能的。我们的工作解决了适应更翔实的目标域的多个目标样本可以从单一来源样品出现的挑战。特别是,我们考虑从翻译MP-MRI来VERDICT,更丰富的MRI模态涉及用于癌症表征的优化获取协议。我们明确地解释这种映射的固有的不确定性,并利用它来产生多路输出调节在一个单一的输入。我们的研究结果显示,这样我们就可以提取目标域,与两个简单的,基于CycleGAN-基线串联使用时,系统更好的形象表述,以及集成了歧视性的分割损失和/或残留的适配器,更强大的方法。当相比,其确定性的同行,我们的做法得到在广泛的数据集规模的显着改善,日益强大的基线和评估措施。
Eleni Chiou, Francesco Giganti, Shonit Punwani, Iasonas Kokkinos, Eleftheria Panagiotaki
Abstract: The need for training data can impede the adoption of novel imaging modalities for learning-based medical image analysis. Domain adaptation methods partially mitigate this problem by translating training data from a related source domain to a novel target domain, but typically assume that a one-to-one translation is possible. Our work addresses the challenge of adapting to a more informative target domain where multiple target samples can emerge from a single source sample. In particular we consider translating from mp-MRI to VERDICT, a richer MRI modality involving an optimized acquisition protocol for cancer characterization. We explicitly account for the inherent uncertainty of this mapping and exploit it to generate multiple outputs conditioned on a single input. Our results show that this allows us to extract systematically better image representations for the target domain, when used in tandem with both simple, CycleGAN-based baselines, as well as more powerful approaches that integrate discriminative segmentation losses and/or residual adapters. When compared to its deterministic counterparts, our approach yields substantial improvements across a broad range of dataset sizes, increasingly strong baselines, and evaluation measures.
摘要:需要训练数据会阻碍学习基础医学图像分析,采用新颖的成像方式。域适配方法由来自相关源域到一个新的目标域转换训练数据部分缓解这个问题,但通常假设一个对一翻译是可能的。我们的工作解决了适应更翔实的目标域的多个目标样本可以从单一来源样品出现的挑战。特别是,我们考虑从翻译MP-MRI来VERDICT,更丰富的MRI模态涉及用于癌症表征的优化获取协议。我们明确地解释这种映射的固有的不确定性,并利用它来产生多路输出调节在一个单一的输入。我们的研究结果显示,这样我们就可以提取目标域,与两个简单的,基于CycleGAN-基线串联使用时,系统更好的形象表述,以及集成了歧视性的分割损失和/或残留的适配器,更强大的方法。当相比,其确定性的同行,我们的做法得到在广泛的数据集规模的显着改善,日益强大的基线和评估措施。
52. Towards Accurate Quantization and Pruning via Data-free Knowledge Transfer [PDF] 返回目录
Chen Zhu, Zheng Xu, Ali Shafahi, Manli Shu, Amin Ghiasi, Tom Goldstein
Abstract: When large scale training data is available, one can obtain compact and accurate networks to be deployed in resource-constrained environments effectively through quantization and pruning. However, training data are often protected due to privacy concerns and it is challenging to obtain compact networks without data. We study data-free quantization and pruning by transferring knowledge from trained large networks to compact networks. Auxiliary generators are simultaneously and adversarially trained with the targeted compact networks to generate synthetic inputs that maximize the discrepancy between the given large network and its quantized or pruned version. We show theoretically that the alternating optimization for the underlying minimax problem converges under mild conditions for pruning and quantization. Our data-free compact networks achieve competitive accuracy to networks trained and fine-tuned with training data. Our quantized and pruned networks achieve good performance while being more compact and lightweight. Further, we demonstrate that the compact structure and corresponding initialization from the Lottery Ticket Hypothesis can also help in data-free training.
摘要:当大规模的培训数据是可用的,可以得到紧凑和准确的网络,有效地通过量化和修剪被部署在资源受限的环境。然而,训练数据往往由于保护隐私问题,它是具有挑战性的获得紧凑的网络没有数据。我们通过培训的大型网络知识转移到紧凑的网络学习免费的数据量化和修剪。辅助发电机同时且adversarially与目标紧凑网络训练以生成合成投入最大化给定的大网络及其量化或修剪版本之间的差异。我们从理论上证明,对于在温和条件下潜在的极小问题的收敛的修剪和量化交替优化。我们免费的数据小型网络实现竞争准确性训练和训练数据微调网络。我们的量化,并修剪网络实现良好性能的同时,更加小巧轻便。此外,我们表明,紧凑的结构和相应的初始化从彩票假说还可以在无数据帮助训练。
Chen Zhu, Zheng Xu, Ali Shafahi, Manli Shu, Amin Ghiasi, Tom Goldstein
Abstract: When large scale training data is available, one can obtain compact and accurate networks to be deployed in resource-constrained environments effectively through quantization and pruning. However, training data are often protected due to privacy concerns and it is challenging to obtain compact networks without data. We study data-free quantization and pruning by transferring knowledge from trained large networks to compact networks. Auxiliary generators are simultaneously and adversarially trained with the targeted compact networks to generate synthetic inputs that maximize the discrepancy between the given large network and its quantized or pruned version. We show theoretically that the alternating optimization for the underlying minimax problem converges under mild conditions for pruning and quantization. Our data-free compact networks achieve competitive accuracy to networks trained and fine-tuned with training data. Our quantized and pruned networks achieve good performance while being more compact and lightweight. Further, we demonstrate that the compact structure and corresponding initialization from the Lottery Ticket Hypothesis can also help in data-free training.
摘要:当大规模的培训数据是可用的,可以得到紧凑和准确的网络,有效地通过量化和修剪被部署在资源受限的环境。然而,训练数据往往由于保护隐私问题,它是具有挑战性的获得紧凑的网络没有数据。我们通过培训的大型网络知识转移到紧凑的网络学习免费的数据量化和修剪。辅助发电机同时且adversarially与目标紧凑网络训练以生成合成投入最大化给定的大网络及其量化或修剪版本之间的差异。我们从理论上证明,对于在温和条件下潜在的极小问题的收敛的修剪和量化交替优化。我们免费的数据小型网络实现竞争准确性训练和训练数据微调网络。我们的量化,并修剪网络实现良好性能的同时,更加小巧轻便。此外,我们表明,紧凑的结构和相应的初始化从彩票假说还可以在无数据帮助训练。
53. XPDNet for MRI Reconstruction: an Application to the fastMRI 2020 Brain Challenge [PDF] 返回目录
Zaccharie Ramzi, Philippe Ciuciu, Jean-Luc Starck
Abstract: We present a modular cross-domain neural network the XPDNet and its application to the MRI reconstruction task. This approach consists in unrolling the PDHG algorithm as well as learning the acceleration scheme between steps. We also adopt state-of-the-art techniques specific to Deep Learning for MRI reconstruction. At the time of writing, this approach is the best performer in PSNR on the fastMRI leaderboards for both knee and brain at acceleration factor 4.
摘要:本文提出了一种模块化的跨域神经网络XPDNet并将其应用到了MRI重建任务。这种方法在于展开的PDHG算法以及学习步骤之间的加速方案。我们还采用了先进的最先进的技术,具体到深学习的MRI重建。在写这篇文章的时候,这种方法在PSNR表现最好的fastMRI排行榜两个膝盖和大脑的加速因子4。
Zaccharie Ramzi, Philippe Ciuciu, Jean-Luc Starck
Abstract: We present a modular cross-domain neural network the XPDNet and its application to the MRI reconstruction task. This approach consists in unrolling the PDHG algorithm as well as learning the acceleration scheme between steps. We also adopt state-of-the-art techniques specific to Deep Learning for MRI reconstruction. At the time of writing, this approach is the best performer in PSNR on the fastMRI leaderboards for both knee and brain at acceleration factor 4.
摘要:本文提出了一种模块化的跨域神经网络XPDNet并将其应用到了MRI重建任务。这种方法在于展开的PDHG算法以及学习步骤之间的加速方案。我们还采用了先进的最先进的技术,具体到深学习的MRI重建。在写这篇文章的时候,这种方法在PSNR表现最好的fastMRI排行榜两个膝盖和大脑的加速因子4。
注:中文为机器翻译结果!封面为论文标题词云图!