目录
3. Multiple-Vehicle Tracking in the Highway Using Appearance Model and Visual Object Tracking [PDF] 摘要
4. Pitfalls of the Gram Loss for Neural Texture Synthesis in Light of Deep Feature Histograms [PDF] 摘要
13. Unmasking the Inductive Biases of Unsupervised Object Representations for Video Sequences [PDF] 摘要
15. Background Modeling via Uncertainty Estimation for Weakly-supervised Action Localization [PDF] 摘要
20. Does Unsupervised Architecture Representation Learning Help Neural Architecture Search? [PDF] 摘要
33. Residual Force Control for Agile Human Behavior Imitation and Extended Motion Synthesis [PDF] 摘要
38. Move-to-Data: A new Continual Learning approach with Deep CNNs, Application for image-class recognition [PDF] 摘要
40. Early Detection of Retinopathy of Prematurity (ROP) in Retinal Fundus Images Via Convolutional Neural Networks [PDF] 摘要
43. Online Sequential Extreme Learning Machines: Features Combined From Hundreds of Midlayers [PDF] 摘要
44. Reintroducing Straight-Through Estimators as Principled Methods for Stochastic Binary Networks [PDF] 摘要
45. Combining the band-limited parameterization and Semi-Lagrangian Runge--Kutta integration for efficient PDE-constrained LDDMM [PDF] 摘要
46. Automated Identification of Thoracic Pathology from Chest Radiographs with Enhanced Training Pipeline [PDF] 摘要
49. Data Driven Prediction Architecture for Autonomous Driving and its Application on Apollo Platform [PDF] 摘要
摘要
1. Robust Baggage Detection and Classification Based on Local Tri-directional Pattern [PDF] 返回目录
Shahbano, Muhammad Abdullah, Kashif Inayat
Abstract: In recent decades, the automatic video surveillance system has gained significant importance in computer vision community. The crucial objective of surveillance is monitoring and security in public places. In the traditional Local Binary Pattern, the feature description is somehow inaccurate, and the feature size is large enough. Therefore, to overcome these shortcomings, our research proposed a detection algorithm for a human with or without carrying baggage. The Local tri-directional pattern descriptor is exhibited to extract features of different human body parts including head, trunk, and limbs. Then with the help of support vector machine, extracted features are trained and evaluated. Experimental results on INRIA and MSMT17 V1 datasets show that LtriDP outperforms several state-of-the-art feature descriptors and validate its effectiveness.
摘要:近几十年来,自动视频监控系统已经获得了计算机视觉领域显著重要性。监测的关键目的是监测和公共场所的安全。在传统的局部二元模式,该功能的描述是不准确的莫名其妙,和特征是尺寸足够大。因此,为了克服这些缺点,我们的研究提出了一个人的检测算法有或没有携带行李。本地三向图案描述符被显示出来提取不同人体部位,包括头,躯干,四肢的功能。然后用支持向量机的帮助下,提取的特征进行培训和评估。上INRIA和MSMT17 V1数据集实验结果表明,LtriDP优于国家的最先进的几个特征描述符并验证其有效性。
Shahbano, Muhammad Abdullah, Kashif Inayat
Abstract: In recent decades, the automatic video surveillance system has gained significant importance in computer vision community. The crucial objective of surveillance is monitoring and security in public places. In the traditional Local Binary Pattern, the feature description is somehow inaccurate, and the feature size is large enough. Therefore, to overcome these shortcomings, our research proposed a detection algorithm for a human with or without carrying baggage. The Local tri-directional pattern descriptor is exhibited to extract features of different human body parts including head, trunk, and limbs. Then with the help of support vector machine, extracted features are trained and evaluated. Experimental results on INRIA and MSMT17 V1 datasets show that LtriDP outperforms several state-of-the-art feature descriptors and validate its effectiveness.
摘要:近几十年来,自动视频监控系统已经获得了计算机视觉领域显著重要性。监测的关键目的是监测和公共场所的安全。在传统的局部二元模式,该功能的描述是不准确的莫名其妙,和特征是尺寸足够大。因此,为了克服这些缺点,我们的研究提出了一个人的检测算法有或没有携带行李。本地三向图案描述符被显示出来提取不同人体部位,包括头,躯干,四肢的功能。然后用支持向量机的帮助下,提取的特征进行培训和评估。上INRIA和MSMT17 V1数据集实验结果表明,LtriDP优于国家的最先进的几个特征描述符并验证其有效性。
2. GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking with Multi-Feature Learning [PDF] 返回目录
Xinshuo Weng, Yongxin Wang, Yunze Man, Kris Kitani
Abstract: 3D Multi-object tracking (MOT) is crucial to autonomous systems. Recent work uses a standard tracking-by-detection pipeline, where feature extraction is first performed independently for each object in order to compute an affinity matrix. Then the affinity matrix is passed to the Hungarian algorithm for data association. A key process of this standard pipeline is to learn discriminative features for different objects in order to reduce confusion during data association. In this work, we propose two techniques to improve the discriminative feature learning for MOT: (1) instead of obtaining features for each object independently, we propose a novel feature interaction mechanism by introducing the Graph Neural Network. As a result, the feature of one object is informed of the features of other objects so that the object feature can lean towards the object with similar feature (i.e., object probably with a same ID) and deviate from objects with dissimilar features (i.e., object probably with different IDs), leading to a more discriminative feature for each object; (2) instead of obtaining the feature from either 2D or 3D space in prior work, we propose a novel joint feature extractor to learn appearance and motion features from 2D and 3D space simultaneously. As features from different modalities often have complementary information, the joint feature can be more discriminate than feature from each individual modality. To ensure that the joint feature extractor does not heavily rely on one modality, we also propose an ensemble training paradigm. Through extensive evaluation, our proposed method achieves state-of-the-art performance on KITTI and nuScenes 3D MOT benchmarks. Our code will be made available at this https URL
摘要:3D多目标追踪(MOT)是自治系统是至关重要的。最近的工作使用一个标准的追踪通过检测管道,其中特征提取首先,以计算亲和基质独立地执行对于每个对象。然后亲和基质被传递给匈牙利算法用于数据关联。该标准流水线的一个关键过程是学习,以减少数据关联时发生混淆不同对象的判别特征。在这项工作中,我们提出了两种技术来提高用于MOT的辨别特征学习的:不是获得用于每个对象独立地特征(1),我们通过引入图形神经网络提出了一种新颖的特征交互的机制。其结果是,一个对象的特征被通知的其他对象的特征,使得所述对象特征可朝向具有相似特征的物体倾斜(即,可能有相同的ID的对象),并偏离与不同特征的物体(即,可能具有不同ID的)对象,导致对每个对象一个更有辨别力特征; (2),而不是从现有工作二维或三维空间获得特征,提出了一种新颖的关节特征提取器来同时学习从2D外观和运动的特征和3D空间。由于来自不同模态特征往往具有互补的信息,所述联合特征可以比从每个单独的模态功能的详细判别。为了确保联合特征提取不严重依赖于一种模式,我们也提出了一个整体的训练模式。通过广泛的评估,我们提出的方法实现对KITTI和nuScenes 3D MOT基准的国家的最先进的性能。我们的代码将在此HTTPS URL来
Xinshuo Weng, Yongxin Wang, Yunze Man, Kris Kitani
Abstract: 3D Multi-object tracking (MOT) is crucial to autonomous systems. Recent work uses a standard tracking-by-detection pipeline, where feature extraction is first performed independently for each object in order to compute an affinity matrix. Then the affinity matrix is passed to the Hungarian algorithm for data association. A key process of this standard pipeline is to learn discriminative features for different objects in order to reduce confusion during data association. In this work, we propose two techniques to improve the discriminative feature learning for MOT: (1) instead of obtaining features for each object independently, we propose a novel feature interaction mechanism by introducing the Graph Neural Network. As a result, the feature of one object is informed of the features of other objects so that the object feature can lean towards the object with similar feature (i.e., object probably with a same ID) and deviate from objects with dissimilar features (i.e., object probably with different IDs), leading to a more discriminative feature for each object; (2) instead of obtaining the feature from either 2D or 3D space in prior work, we propose a novel joint feature extractor to learn appearance and motion features from 2D and 3D space simultaneously. As features from different modalities often have complementary information, the joint feature can be more discriminate than feature from each individual modality. To ensure that the joint feature extractor does not heavily rely on one modality, we also propose an ensemble training paradigm. Through extensive evaluation, our proposed method achieves state-of-the-art performance on KITTI and nuScenes 3D MOT benchmarks. Our code will be made available at this https URL
摘要:3D多目标追踪(MOT)是自治系统是至关重要的。最近的工作使用一个标准的追踪通过检测管道,其中特征提取首先,以计算亲和基质独立地执行对于每个对象。然后亲和基质被传递给匈牙利算法用于数据关联。该标准流水线的一个关键过程是学习,以减少数据关联时发生混淆不同对象的判别特征。在这项工作中,我们提出了两种技术来提高用于MOT的辨别特征学习的:不是获得用于每个对象独立地特征(1),我们通过引入图形神经网络提出了一种新颖的特征交互的机制。其结果是,一个对象的特征被通知的其他对象的特征,使得所述对象特征可朝向具有相似特征的物体倾斜(即,可能有相同的ID的对象),并偏离与不同特征的物体(即,可能具有不同ID的)对象,导致对每个对象一个更有辨别力特征; (2),而不是从现有工作二维或三维空间获得特征,提出了一种新颖的关节特征提取器来同时学习从2D外观和运动的特征和3D空间。由于来自不同模态特征往往具有互补的信息,所述联合特征可以比从每个单独的模态功能的详细判别。为了确保联合特征提取不严重依赖于一种模式,我们也提出了一个整体的训练模式。通过广泛的评估,我们提出的方法实现对KITTI和nuScenes 3D MOT基准的国家的最先进的性能。我们的代码将在此HTTPS URL来
3. Multiple-Vehicle Tracking in the Highway Using Appearance Model and Visual Object Tracking [PDF] 返回目录
Fateme Bafghi, Bijan Shoushtarian
Abstract: In recent decades, due to the groundbreaking improvements in machine vision, many daily tasks are performed by computers. One of these tasks is multiple-vehicle tracking, which is widely used in different areas such as video surveillance and traffic monitoring. This paper focuses on introducing an efficient novel approach with acceptable accuracy. This is achieved through an efficient appearance and motion model based on the features extracted from each object. For this purpose, two different approaches have been used to extract features, i.e. features extracted from a deep neural network, and traditional features. Then the results from these two approaches are compared with state-of-the-art trackers. The results are obtained by executing the methods on the UA-DETRACK benchmark. The first method led to 58.9% accuracy while the second method caused up to 15.9%. The proposed methods can still be improved by extracting more distinguishable features.
摘要:近几十年来,由于机器视觉领域的突破性改进,许多日常任务由计算机执行。这些任务之一是多车辆跟踪,因而被广泛应用于不同的领域,如视频监控和交通监控。本文着重介绍以可接受的精度的有效新型方法。这是通过一种有效的外观和动作模型基于从每一个对象中提取的特征来实现。为此目的,两种不同的方法已被用于提取特征,从深层神经网络中提取即特征和传统的特征。然后从这些两种方法的结果与状态的最先进的跟踪器相比较。结果,通过执行在UA-脱轨基准的方法获得的。第一种方法导致了58.9%的准确度,而第二种方法导致高达15.9%。所提出的方法仍然可以通过提取更容易分辨功能得到改善。
Fateme Bafghi, Bijan Shoushtarian
Abstract: In recent decades, due to the groundbreaking improvements in machine vision, many daily tasks are performed by computers. One of these tasks is multiple-vehicle tracking, which is widely used in different areas such as video surveillance and traffic monitoring. This paper focuses on introducing an efficient novel approach with acceptable accuracy. This is achieved through an efficient appearance and motion model based on the features extracted from each object. For this purpose, two different approaches have been used to extract features, i.e. features extracted from a deep neural network, and traditional features. Then the results from these two approaches are compared with state-of-the-art trackers. The results are obtained by executing the methods on the UA-DETRACK benchmark. The first method led to 58.9% accuracy while the second method caused up to 15.9%. The proposed methods can still be improved by extracting more distinguishable features.
摘要:近几十年来,由于机器视觉领域的突破性改进,许多日常任务由计算机执行。这些任务之一是多车辆跟踪,因而被广泛应用于不同的领域,如视频监控和交通监控。本文着重介绍以可接受的精度的有效新型方法。这是通过一种有效的外观和动作模型基于从每一个对象中提取的特征来实现。为此目的,两种不同的方法已被用于提取特征,从深层神经网络中提取即特征和传统的特征。然后从这些两种方法的结果与状态的最先进的跟踪器相比较。结果,通过执行在UA-脱轨基准的方法获得的。第一种方法导致了58.9%的准确度,而第二种方法导致高达15.9%。所提出的方法仍然可以通过提取更容易分辨功能得到改善。
4. Pitfalls of the Gram Loss for Neural Texture Synthesis in Light of Deep Feature Histograms [PDF] 返回目录
Eric Heitz, Kenneth Vanhoey, Thomas Chambon, Laurent Belcour
Abstract: Neural texture synthesis and style transfer are both powered by the Gram matrix as a means to measure deep feature statistics. Despite its ubiquity, this second-order feature descriptor has several shortcomings resulting in visual artifacts, ill-defined interpolation, or inability to capture spatial constraints. Many previous works acknowledge these shortcomings but do not really explain why they occur. Fixing them is thus usually approached by adding new losses, which require parameter tuning and make the problem even more ill-defined, or architecturing complex and/or adversarial networks. In this paper, we propose a comprehensive study of these problems in the light of the multi-dimensional histograms of deep features. With the insights gained from our analysis, we show how to compute a well-defined and efficient textural loss based on histogram transformations. Our textural loss outperforms the Gram matrix in terms of quality, robustness, spatial control, and interpolation. It does not require additional learning or parameter tuning, and can be implemented in a few lines of code.
摘要:神经纹理合成和样式转印都搭载了革兰氏矩阵作为一种手段来测量深特征的统计信息。尽管它的普及,该第二阶特征描述符具有导致视觉假象几个缺点,边界不清插值,或无法捕捉空间限制。许多以前的作品中承认了这些缺点,但真的不解释为什么会发生。固定它们因此通常通过添加新的损失,这需要参数调谐和使问题更加不明确的,或architecturing复杂和/或对抗网络接近。在本文中,我们提出在深特征的多维直方图的光这些问题进行全面研究。从我们的分析中获得的洞察力,我们将展示如何计算基于直方图变换的明确和有效的纹理损失。我们的纹理损失优于在质量,耐用性,空间控制和插值方面的格拉姆矩阵。它不需要额外的学习或参数调整,并且可以在几行代码来实现。
Eric Heitz, Kenneth Vanhoey, Thomas Chambon, Laurent Belcour
Abstract: Neural texture synthesis and style transfer are both powered by the Gram matrix as a means to measure deep feature statistics. Despite its ubiquity, this second-order feature descriptor has several shortcomings resulting in visual artifacts, ill-defined interpolation, or inability to capture spatial constraints. Many previous works acknowledge these shortcomings but do not really explain why they occur. Fixing them is thus usually approached by adding new losses, which require parameter tuning and make the problem even more ill-defined, or architecturing complex and/or adversarial networks. In this paper, we propose a comprehensive study of these problems in the light of the multi-dimensional histograms of deep features. With the insights gained from our analysis, we show how to compute a well-defined and efficient textural loss based on histogram transformations. Our textural loss outperforms the Gram matrix in terms of quality, robustness, spatial control, and interpolation. It does not require additional learning or parameter tuning, and can be implemented in a few lines of code.
摘要:神经纹理合成和样式转印都搭载了革兰氏矩阵作为一种手段来测量深特征的统计信息。尽管它的普及,该第二阶特征描述符具有导致视觉假象几个缺点,边界不清插值,或无法捕捉空间限制。许多以前的作品中承认了这些缺点,但真的不解释为什么会发生。固定它们因此通常通过添加新的损失,这需要参数调谐和使问题更加不明确的,或architecturing复杂和/或对抗网络接近。在本文中,我们提出在深特征的多维直方图的光这些问题进行全面研究。从我们的分析中获得的洞察力,我们将展示如何计算基于直方图变换的明确和有效的纹理损失。我们的纹理损失优于在质量,耐用性,空间控制和插值方面的格拉姆矩阵。它不需要额外的学习或参数调整,并且可以在几行代码来实现。
5. Local-Area-Learning Network: Meaningful Local Areas for Efficient Point Cloud Analysis [PDF] 返回目录
Qendrim Bytyqi, Nicola Wolpert, Elmar Schömer
Abstract: Research in point cloud analysis with deep neural networks has made rapid progress in recent years. The pioneering work PointNet offered a direct analysis of point clouds. However, due to its architecture PointNet is not able to capture local structures. To overcome this drawback, the same authors have developed PointNet++ by applying PointNet to local areas. The local areas are defined by center points and their neighbors. In PointNet++ and its further developments the center points are determined with a Farthest Point Sampling (FPS) algorithm. This has the disadvantage that the center points in general do not have meaningful local areas. In this paper, we introduce the neural Local-Area-Learning Network (LocAL-Net) which places emphasis on the selection and characterization of the local areas. Our approach learns critical points that we use as center points. In order to strengthen the recognition of local structures, the points are given additional metric properties depending on the local areas. Finally, we derive and combine two global feature vectors, one from the whole point cloud and one from all local areas. Experiments on the datasets ModelNet10/40 and ShapeNet show that LocAL-Net is competitive for part segmentation. For classification LocAL-Net outperforms the state-of-the-arts.
摘要:研究点与深层神经网络云分析,近年来取得了长足的发展。开创性工作PointNet提供点云的直接分析。然而,由于其架构PointNet不能捕捉到局部结构。为了克服这个缺点,同一作者开发PointNet ++应用PointNet局部地区。局部区域由中心点和他们的邻居定义。在PointNet ++和其进一步的发展的中心点用最远点取样(FPS)算法来确定。其缺点是,在一般工作的核心点没有意义的局部地区。在本文中,我们介绍了神经局部区域学习网络(本地网),它非常注重局部区域的选择和表征。我们的方法学习,我们作为中心点使用临界点。为了加强地方机构的认可,点视局部地区给予额外的度量属性。最后,我们得出并结合两个全局特征向量,从一个整点云,一个来自当地所有领域。在数据集ModelNet10 / 40和ShapeNet实验表明,局部网络是部分细分竞争力。对于分类局部网络优于国家的最艺术。
Qendrim Bytyqi, Nicola Wolpert, Elmar Schömer
Abstract: Research in point cloud analysis with deep neural networks has made rapid progress in recent years. The pioneering work PointNet offered a direct analysis of point clouds. However, due to its architecture PointNet is not able to capture local structures. To overcome this drawback, the same authors have developed PointNet++ by applying PointNet to local areas. The local areas are defined by center points and their neighbors. In PointNet++ and its further developments the center points are determined with a Farthest Point Sampling (FPS) algorithm. This has the disadvantage that the center points in general do not have meaningful local areas. In this paper, we introduce the neural Local-Area-Learning Network (LocAL-Net) which places emphasis on the selection and characterization of the local areas. Our approach learns critical points that we use as center points. In order to strengthen the recognition of local structures, the points are given additional metric properties depending on the local areas. Finally, we derive and combine two global feature vectors, one from the whole point cloud and one from all local areas. Experiments on the datasets ModelNet10/40 and ShapeNet show that LocAL-Net is competitive for part segmentation. For classification LocAL-Net outperforms the state-of-the-arts.
摘要:研究点与深层神经网络云分析,近年来取得了长足的发展。开创性工作PointNet提供点云的直接分析。然而,由于其架构PointNet不能捕捉到局部结构。为了克服这个缺点,同一作者开发PointNet ++应用PointNet局部地区。局部区域由中心点和他们的邻居定义。在PointNet ++和其进一步的发展的中心点用最远点取样(FPS)算法来确定。其缺点是,在一般工作的核心点没有意义的局部地区。在本文中,我们介绍了神经局部区域学习网络(本地网),它非常注重局部区域的选择和表征。我们的方法学习,我们作为中心点使用临界点。为了加强地方机构的认可,点视局部地区给予额外的度量属性。最后,我们得出并结合两个全局特征向量,从一个整点云,一个来自当地所有领域。在数据集ModelNet10 / 40和ShapeNet实验表明,局部网络是部分细分竞争力。对于分类局部网络优于国家的最艺术。
6. Branch-Cooperative OSNet for Person Re-Identification [PDF] 返回目录
Lei Zhang, Xiaofu Wu, Suofei Zhang, Zirui Yin
Abstract: Multi-branch is extensively studied for learning rich feature representation for person re-identification (Re-ID). In this paper, we propose a branch-cooperative architecture over OSNet, termed BC-OSNet, for person Re-ID. By stacking four cooperative branches, namely, a global branch, a local branch, a relational branch and a contrastive branch, we obtain powerful feature representation for person Re-ID. Extensive experiments show that the proposed BC-OSNet achieves state-of-art performance on the three popular datasets, including Market-1501, DukeMTMC-reID and CUHK03. In particular, it achieves mAP of 84.0% and rank-1 accuracy of 87.1% on the CUHK03_labeled.
摘要:多分支广泛的研究学习的人重新鉴定(重新编号)丰富的功能表示。在本文中,我们提出了一个OSNet分支的合作架构,称为BC-OSNet,对人重新编号。通过堆垛四层合作分部,即,一个全球性的分公司,当地分公司,关系分支和对比分支,我们得到的人重新编号强大的功能表现。大量的实验表明,该BC-OSNet达到的三个流行的数据集,包括市场-1501,DukeMTMC-Reid和CUHK03国家的艺术表演。特别是,它实现了84.0%的MAP和87.1%对CUHK03_labeled秩1的精度。
Lei Zhang, Xiaofu Wu, Suofei Zhang, Zirui Yin
Abstract: Multi-branch is extensively studied for learning rich feature representation for person re-identification (Re-ID). In this paper, we propose a branch-cooperative architecture over OSNet, termed BC-OSNet, for person Re-ID. By stacking four cooperative branches, namely, a global branch, a local branch, a relational branch and a contrastive branch, we obtain powerful feature representation for person Re-ID. Extensive experiments show that the proposed BC-OSNet achieves state-of-art performance on the three popular datasets, including Market-1501, DukeMTMC-reID and CUHK03. In particular, it achieves mAP of 84.0% and rank-1 accuracy of 87.1% on the CUHK03_labeled.
摘要:多分支广泛的研究学习的人重新鉴定(重新编号)丰富的功能表示。在本文中,我们提出了一个OSNet分支的合作架构,称为BC-OSNet,对人重新编号。通过堆垛四层合作分部,即,一个全球性的分公司,当地分公司,关系分支和对比分支,我们得到的人重新编号强大的功能表现。大量的实验表明,该BC-OSNet达到的三个流行的数据集,包括市场-1501,DukeMTMC-Reid和CUHK03国家的艺术表演。特别是,它实现了84.0%的MAP和87.1%对CUHK03_labeled秩1的精度。
7. Video Understanding as Machine Translation [PDF] 返回目录
Bruno Korbar, Fabio Petroni, Rohit Girdhar, Lorenzo Torresani
Abstract: With the advent of large-scale multimodal video datasets, especially sequences with audio or transcribed speech, there has been a growing interest in self-supervised learning of video representations. Most prior work formulates the objective as a contrastive metric learning problem between the modalities. To enable effective learning, however, these strategies require a careful selection of positive and negative samples often combined with hand-designed curriculum policies. In this work we remove the need for negative sampling by taking a generative modeling approach that poses the objective as a translation problem between modalities. Such a formulation allows us to tackle a wide variety of downstream video understanding tasks by means of a single unified framework, without the need for large batches of negative samples common in contrastive metric learning. We experiment with the large-scale HowTo100M dataset for training, and report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT), and text-based clip retrieval (YouCook2 and MSR-VTT).
摘要:随着大型多视频数据集,尤其是音频或转录语音序列的出现,出现了视频表示的自我监督学习的兴趣与日俱增。大多数现有的工作制定目标的方式之间的对比度量学习问题。为了能够有效地学习,但是,这些战略需要经常与手设计的课程政策结合阳性和阴性样品的精心挑选。在这项工作中,我们采取的是带来了客观的方式之间的转换问题生成的建模方法去除负采样的需要。这类制剂可以让我们通过一个单一的统一框架的方式,以解决各种各样的下游视频了解任务,而不需要在对比度量学习大批量阴性样品常见的。我们与大型HowTo100M数据集实验培训和报告的性能提升在国家的最先进的几个下游任务,包括视频分类(EPIC-厨房),问题解答(TVQA),字幕(TVC,YouCook2,和MSR-VTT),和基于文本的片段检索(YouCook2和MSR-VTT)。
Bruno Korbar, Fabio Petroni, Rohit Girdhar, Lorenzo Torresani
Abstract: With the advent of large-scale multimodal video datasets, especially sequences with audio or transcribed speech, there has been a growing interest in self-supervised learning of video representations. Most prior work formulates the objective as a contrastive metric learning problem between the modalities. To enable effective learning, however, these strategies require a careful selection of positive and negative samples often combined with hand-designed curriculum policies. In this work we remove the need for negative sampling by taking a generative modeling approach that poses the objective as a translation problem between modalities. Such a formulation allows us to tackle a wide variety of downstream video understanding tasks by means of a single unified framework, without the need for large batches of negative samples common in contrastive metric learning. We experiment with the large-scale HowTo100M dataset for training, and report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT), and text-based clip retrieval (YouCook2 and MSR-VTT).
摘要:随着大型多视频数据集,尤其是音频或转录语音序列的出现,出现了视频表示的自我监督学习的兴趣与日俱增。大多数现有的工作制定目标的方式之间的对比度量学习问题。为了能够有效地学习,但是,这些战略需要经常与手设计的课程政策结合阳性和阴性样品的精心挑选。在这项工作中,我们采取的是带来了客观的方式之间的转换问题生成的建模方法去除负采样的需要。这类制剂可以让我们通过一个单一的统一框架的方式,以解决各种各样的下游视频了解任务,而不需要在对比度量学习大批量阴性样品常见的。我们与大型HowTo100M数据集实验培训和报告的性能提升在国家的最先进的几个下游任务,包括视频分类(EPIC-厨房),问题解答(TVQA),字幕(TVC,YouCook2,和MSR-VTT),和基于文本的片段检索(YouCook2和MSR-VTT)。
8. ESAD: Endoscopic Surgeon Action Detection Dataset [PDF] 返回目录
Vivek Singh Bawa, Gurkirt Singh, Francis KapingA, InnaSkarga-Bandurova, Alice Leporini, Carmela Landolfo, Armando Stabile, Francesco Setti, Riccardo Muradore, Elettra Oleari, Fabio Cuzzolin
Abstract: In this work, we take aim towards increasing the effectiveness of surgical assistant robots. We intended to make assistant robots safer by making them aware about the actions of surgeon, so it can take appropriate assisting actions. In other words, we aim to solve the problem of surgeon action detection in endoscopic videos. To this, we introduce a challenging dataset for surgeon action detection in real-world endoscopic videos. Action classes are picked based on the feedback of surgeons and annotated by medical professional. Given a video frame, we draw bounding box around surgical tool which is performing action and label it with action label. Finally, we presenta frame-level action detection baseline model based on recent advances in ob-ject detection. Results on our new dataset show that our presented dataset provides enough interesting challenges for future method and it can serveas strong benchmark corresponding research in surgeon action detection in endoscopic videos.
摘要:在这项工作中,我们盯准努力增加手术助理机器人的有效性。我们打算让助理机器人更加安全,让他们意识到对医生的行为,所以它可以采取适当的行动协助。换句话说,我们的目标是解决医生动作检测的问题,在内窥镜的视频。对此,我们介绍了在现实世界内镜外科医生的视频动作检测一个具有挑战性的数据集。 Action类是基于医生的反馈和医疗专业人士注解采摘。给定一个视频帧,我们绘制边界周围的外科手术工具正在执行的行动,并用实际行动标签贴上标签框。最后,我们presenta帧级动作检测基准模型基础上OB-JECT检测的最新进展。我们的新的数据集的结果显示,我们提出的数据集提供了未来的方法足够有趣的挑战,它可以对应于内镜外科医生的视频动作检测研究serveas强烈的标杆。
Vivek Singh Bawa, Gurkirt Singh, Francis KapingA, InnaSkarga-Bandurova, Alice Leporini, Carmela Landolfo, Armando Stabile, Francesco Setti, Riccardo Muradore, Elettra Oleari, Fabio Cuzzolin
Abstract: In this work, we take aim towards increasing the effectiveness of surgical assistant robots. We intended to make assistant robots safer by making them aware about the actions of surgeon, so it can take appropriate assisting actions. In other words, we aim to solve the problem of surgeon action detection in endoscopic videos. To this, we introduce a challenging dataset for surgeon action detection in real-world endoscopic videos. Action classes are picked based on the feedback of surgeons and annotated by medical professional. Given a video frame, we draw bounding box around surgical tool which is performing action and label it with action label. Finally, we presenta frame-level action detection baseline model based on recent advances in ob-ject detection. Results on our new dataset show that our presented dataset provides enough interesting challenges for future method and it can serveas strong benchmark corresponding research in surgeon action detection in endoscopic videos.
摘要:在这项工作中,我们盯准努力增加手术助理机器人的有效性。我们打算让助理机器人更加安全,让他们意识到对医生的行为,所以它可以采取适当的行动协助。换句话说,我们的目标是解决医生动作检测的问题,在内窥镜的视频。对此,我们介绍了在现实世界内镜外科医生的视频动作检测一个具有挑战性的数据集。 Action类是基于医生的反馈和医疗专业人士注解采摘。给定一个视频帧,我们绘制边界周围的外科手术工具正在执行的行动,并用实际行动标签贴上标签框。最后,我们presenta帧级动作检测基准模型基础上OB-JECT检测的最新进展。我们的新的数据集的结果显示,我们提出的数据集提供了未来的方法足够有趣的挑战,它可以对应于内镜外科医生的视频动作检测研究serveas强烈的标杆。
9. Are we done with ImageNet? [PDF] 返回目录
Lucas Beyer, Olivier J. Hénaff, Alexander Kolesnikov, Xiaohua Zhai, Aäron van den Oord
Abstract: Yes, and no. We ask whether recent progress on the ImageNet classification benchmark continues to represent meaningful generalization, or whether the community has started to overfit to the idiosyncrasies of its labeling procedure. We therefore develop a significantly more robust procedure for collecting human annotations of the ImageNet validation set. Using these new labels, we reassess the accuracy of recently proposed ImageNet classifiers, and find their gains to be substantially smaller than those reported on the original labels. Furthermore, we find the original ImageNet labels to no longer be the best predictors of this independently-collected set, indicating that their usefulness in evaluating vision models may be nearing an end. Nevertheless, we find our annotation procedure to have largely remedied the errors in the original labels, reinforcing ImageNet as a powerful benchmark for future research in visual recognition.
摘要:是的,没有。我们要求近期对ImageNet分类基准进度是否继续代表意义的概括,还是对社会是否已经开始过度拟合其标记过程的特质。因此,我们制定收集ImageNet验证组的人注释的显著更强大的过程。利用这些新的标签,我们重新评估最近提出ImageNet分类的准确度,并且发现他们的收益会比那些报道了原标签小得多。此外,我们发现,原来ImageNet标签不再是这家独立收集组的最佳预测,这表明他们在评价视觉模型有效性可能已经接近尾声。然而,我们发现我们的批注过程,在很大程度上弥补了错误,在原来的标签,加强ImageNet作为一个强大的标杆在视觉识别未来的研究。
Lucas Beyer, Olivier J. Hénaff, Alexander Kolesnikov, Xiaohua Zhai, Aäron van den Oord
Abstract: Yes, and no. We ask whether recent progress on the ImageNet classification benchmark continues to represent meaningful generalization, or whether the community has started to overfit to the idiosyncrasies of its labeling procedure. We therefore develop a significantly more robust procedure for collecting human annotations of the ImageNet validation set. Using these new labels, we reassess the accuracy of recently proposed ImageNet classifiers, and find their gains to be substantially smaller than those reported on the original labels. Furthermore, we find the original ImageNet labels to no longer be the best predictors of this independently-collected set, indicating that their usefulness in evaluating vision models may be nearing an end. Nevertheless, we find our annotation procedure to have largely remedied the errors in the original labels, reinforcing ImageNet as a powerful benchmark for future research in visual recognition.
摘要:是的,没有。我们要求近期对ImageNet分类基准进度是否继续代表意义的概括,还是对社会是否已经开始过度拟合其标记过程的特质。因此,我们制定收集ImageNet验证组的人注释的显著更强大的过程。利用这些新的标签,我们重新评估最近提出ImageNet分类的准确度,并且发现他们的收益会比那些报道了原标签小得多。此外,我们发现,原来ImageNet标签不再是这家独立收集组的最佳预测,这表明他们在评价视觉模型有效性可能已经接近尾声。然而,我们发现我们的批注过程,在很大程度上弥补了错误,在原来的标签,加强ImageNet作为一个强大的标杆在视觉识别未来的研究。
10. Attribute analysis with synthetic dataset for person re-identification [PDF] 返回目录
Suncheng Xiang, Yuzhuo Fu, Guanjie You, Ting Liu
Abstract: Person re-identification (re-ID) plays an important role in applications such as public security and video surveillance. Recently, learning from synthetic data, which benefits from the popularity of synthetic data engine, have achieved remarkable performance. However, existing synthetic datasets are in small size and lack of diversity, which hinders the development of person re-ID in real-world scenarios. To address this problem, firstly, we develop a large-scale synthetic data engine, the salient characteristic of this engine is controllable. Based on it, we build a large-scale synthetic dataset, which are diversified and customized from different attributes, such as illumination and viewpoint. Secondly, we quantitatively analyze the influence of dataset attributes on re-ID system. To our best knowledge, this is the first attempt to explicitly dissect person re-ID from the aspect of attribute on synthetic dataset. Comprehensive experiments help us have a deeper understanding of the fundamental problems in person re-ID. Our research also provides useful insights for dataset building and future practical usage.
摘要:人重新鉴定(重新-ID),起到应用的重要作用,如公安和视频监控。近日,由合成数据,这受益于合成数据引擎的普及,取得了显着的性能学习。但是,现有的合成数据集是体积小,缺乏多样性,这阻碍了人再ID在真实世界场景的发展。为了解决这个问题,首先,我们开发了一个大型的综合数据发动机,这款发动机的显着的特点是可控的。在此基础上,我们构建了大规模合成的数据集,这些多样化并从不同的属性,诸如照明和视点定制。其次,我们定量分析重新-ID系统的数据集属性的影响。据我们所知,这是第一次尝试从属性对合成数据集方面明确解剖的人重新编号。综合实验有助于我们拥有的人重新编号的根本问题有更深的了解。我们的研究还提供了数据集建设和未来实际应用有益的见解。
Suncheng Xiang, Yuzhuo Fu, Guanjie You, Ting Liu
Abstract: Person re-identification (re-ID) plays an important role in applications such as public security and video surveillance. Recently, learning from synthetic data, which benefits from the popularity of synthetic data engine, have achieved remarkable performance. However, existing synthetic datasets are in small size and lack of diversity, which hinders the development of person re-ID in real-world scenarios. To address this problem, firstly, we develop a large-scale synthetic data engine, the salient characteristic of this engine is controllable. Based on it, we build a large-scale synthetic dataset, which are diversified and customized from different attributes, such as illumination and viewpoint. Secondly, we quantitatively analyze the influence of dataset attributes on re-ID system. To our best knowledge, this is the first attempt to explicitly dissect person re-ID from the aspect of attribute on synthetic dataset. Comprehensive experiments help us have a deeper understanding of the fundamental problems in person re-ID. Our research also provides useful insights for dataset building and future practical usage.
摘要:人重新鉴定(重新-ID),起到应用的重要作用,如公安和视频监控。近日,由合成数据,这受益于合成数据引擎的普及,取得了显着的性能学习。但是,现有的合成数据集是体积小,缺乏多样性,这阻碍了人再ID在真实世界场景的发展。为了解决这个问题,首先,我们开发了一个大型的综合数据发动机,这款发动机的显着的特点是可控的。在此基础上,我们构建了大规模合成的数据集,这些多样化并从不同的属性,诸如照明和视点定制。其次,我们定量分析重新-ID系统的数据集属性的影响。据我们所知,这是第一次尝试从属性对合成数据集方面明确解剖的人重新编号。综合实验有助于我们拥有的人重新编号的根本问题有更深的了解。我们的研究还提供了数据集建设和未来实际应用有益的见解。
11. Knowledge Distillation Meets Self-Supervision [PDF] 返回目录
Guodong Xu, Ziwei Liu, Xiaoxiao Li, Chen Change Loy
Abstract: Knowledge distillation, which involves extracting the "dark knowledge" from a teacher network to guide the learning of a student network, has emerged as an important technique for model compression and transfer learning. Unlike previous works that exploit architecture-specific cues such as activation and attention for distillation, here we wish to explore a more general and model-agnostic approach for extracting "richer dark knowledge" from the pre-trained teacher model. We show that the seemingly different self-supervision task can serve as a simple yet powerful solution. For example, when performing contrastive learning between transformed entities, the noisy predictions of the teacher network reflect its intrinsic composition of semantic and pose information. By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student. In this paper, we discuss practical ways to exploit those noisy self-supervision signals with selective transfer for distillation. We further show that self-supervision signals improve conventional distillation with substantial gains under few-shot and noisy-label scenarios. Given the richer knowledge mined from self-supervision, our knowledge distillation approach achieves state-of-the-art performance on standard benchmarks, i.e., CIFAR100 and ImageNet, under both similar-architecture and cross-architecture settings. The advantage is even more pronounced under the cross-architecture setting, where our method outperforms the state of the art CRD by an average of 2.3% in accuracy rate on CIFAR100 across six different teacher-student pairs.
摘要:知识蒸馏,其中包括提取老师网“黑暗知识”,引导学生网络学习,已经成为了模型压缩转发学习的重要技术。不同于利用特定架构的线索,如激活和注意力蒸馏以前的作品,在这里我们希望探索一个更普遍的和模型无关的方法从预先训练的老师模型提取“暗丰富的知识”。我们发现,看似不同的自检任务可以作为一个简单而强大的解决方案。例如,转化实体之间进行对比学习的时候,老师网络嘈杂的预测反映的语义和姿态的信息其内在成分。通过利用这些自检信号作为辅助任务之间的相似性,可以有效地从教师传给隐藏信息给学生。在本文中,我们将讨论实际的方式来利用与蒸馏选择性转移那些嘈杂的自检信号。进一步的研究表明自检信号改善下几拍和嘈杂的标签的情况大有斩获常规蒸馏。鉴于从自我监督开采的丰富知识,我们的知识蒸馏方法实现了对标准基准,即CIFAR100和ImageNet国家的最先进的性能,同时根据类似的架构和跨架构的设置。其优点是更加的跨架构的设置,我们的方法以平均2.3%的准确率上CIFAR100在六个不同的师生对优于艺术CRD的状态下显着。
Guodong Xu, Ziwei Liu, Xiaoxiao Li, Chen Change Loy
Abstract: Knowledge distillation, which involves extracting the "dark knowledge" from a teacher network to guide the learning of a student network, has emerged as an important technique for model compression and transfer learning. Unlike previous works that exploit architecture-specific cues such as activation and attention for distillation, here we wish to explore a more general and model-agnostic approach for extracting "richer dark knowledge" from the pre-trained teacher model. We show that the seemingly different self-supervision task can serve as a simple yet powerful solution. For example, when performing contrastive learning between transformed entities, the noisy predictions of the teacher network reflect its intrinsic composition of semantic and pose information. By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student. In this paper, we discuss practical ways to exploit those noisy self-supervision signals with selective transfer for distillation. We further show that self-supervision signals improve conventional distillation with substantial gains under few-shot and noisy-label scenarios. Given the richer knowledge mined from self-supervision, our knowledge distillation approach achieves state-of-the-art performance on standard benchmarks, i.e., CIFAR100 and ImageNet, under both similar-architecture and cross-architecture settings. The advantage is even more pronounced under the cross-architecture setting, where our method outperforms the state of the art CRD by an average of 2.3% in accuracy rate on CIFAR100 across six different teacher-student pairs.
摘要:知识蒸馏,其中包括提取老师网“黑暗知识”,引导学生网络学习,已经成为了模型压缩转发学习的重要技术。不同于利用特定架构的线索,如激活和注意力蒸馏以前的作品,在这里我们希望探索一个更普遍的和模型无关的方法从预先训练的老师模型提取“暗丰富的知识”。我们发现,看似不同的自检任务可以作为一个简单而强大的解决方案。例如,转化实体之间进行对比学习的时候,老师网络嘈杂的预测反映的语义和姿态的信息其内在成分。通过利用这些自检信号作为辅助任务之间的相似性,可以有效地从教师传给隐藏信息给学生。在本文中,我们将讨论实际的方式来利用与蒸馏选择性转移那些嘈杂的自检信号。进一步的研究表明自检信号改善下几拍和嘈杂的标签的情况大有斩获常规蒸馏。鉴于从自我监督开采的丰富知识,我们的知识蒸馏方法实现了对标准基准,即CIFAR100和ImageNet国家的最先进的性能,同时根据类似的架构和跨架构的设置。其优点是更加的跨架构的设置,我们的方法以平均2.3%的准确率上CIFAR100在六个不同的师生对优于艺术CRD的状态下显着。
12. A Face Preprocessing Approach for Improved DeepFake Detection [PDF] 返回目录
Polychronis Charitidis, Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Kompatsiaris
Abstract: Recent advancements in content generation technologies (also widely known as DeepFakes) along with the online proliferation of manipulated media and disinformation campaigns render the detection of such manipulations a task of increasing importance. There are numerous works related to DeepFake detection but there is little focus on the impact of dataset preprocessing on the detection accuracy of the models. In this paper, we focus on this aspect of the DeepFake detection task and propose a preprocessing step to improve the quality of training datasets for the problem. We also examine its effects on the DeepFake detection performance. Experimental results demonstrate that the proposed preprocessing approach leads to measurable improvements in the performance of detection models.
摘要:在内容生成技术与操作的媒体和虚假宣传的网上扩散沿的最新进展(也被广泛称为DeepFakes)使该检测此类操纵的越来越重要的任务。有相关DeepFake检测的许多工作,但很少有专注于数据集预处理对模型的检测精度的影响。在本文中,我们专注于DeepFake检测任务的这一方面,并提出预处理步骤,以提高训练数据集的该问题的质量。我们还检查其对DeepFake检测性能的影响。实验结果表明,该检测模型的性能预处理方法导致可衡量的改进。
Polychronis Charitidis, Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Kompatsiaris
Abstract: Recent advancements in content generation technologies (also widely known as DeepFakes) along with the online proliferation of manipulated media and disinformation campaigns render the detection of such manipulations a task of increasing importance. There are numerous works related to DeepFake detection but there is little focus on the impact of dataset preprocessing on the detection accuracy of the models. In this paper, we focus on this aspect of the DeepFake detection task and propose a preprocessing step to improve the quality of training datasets for the problem. We also examine its effects on the DeepFake detection performance. Experimental results demonstrate that the proposed preprocessing approach leads to measurable improvements in the performance of detection models.
摘要:在内容生成技术与操作的媒体和虚假宣传的网上扩散沿的最新进展(也被广泛称为DeepFakes)使该检测此类操纵的越来越重要的任务。有相关DeepFake检测的许多工作,但很少有专注于数据集预处理对模型的检测精度的影响。在本文中,我们专注于DeepFake检测任务的这一方面,并提出预处理步骤,以提高训练数据集的该问题的质量。我们还检查其对DeepFake检测性能的影响。实验结果表明,该检测模型的性能预处理方法导致可衡量的改进。
13. Unmasking the Inductive Biases of Unsupervised Object Representations for Video Sequences [PDF] 返回目录
Marissa A. Weis, Kashyap Chitta, Yash Sharma, Wieland Brendel, Matthias Bethge, Andreas Geiger, Alexander S. Ecker
Abstract: Perceiving the world in terms of objects is a crucial prerequisite for reasoning and scene understanding. Recently, several methods have been proposed for unsupervised learning of object-centric representations. However, since these models have been evaluated with respect to different downstream tasks, it remains unclear how they compare in terms of basic perceptual abilities such as detection, figure-ground segmentation and tracking of individual objects. In this paper, we argue that the established evaluation protocol of multi-object tracking tests precisely these perceptual qualities and we propose a new benchmark dataset based on procedurally generated video sequences. Using this benchmark, we compare the perceptual abilities of three state-of-the-art unsupervised object-centric learning approaches. Towards this goal, we propose a video-extension of MONet, a seminal object-centric model for static scenes, and compare it to two recent video models: OP3, which exploits clustering via spatial mixture models, and TBA, which uses an explicit factorization via spatial transformers. Our results indicate that architectures which employ unconstrained latent representations based on per-object variational autoencoders and full-image object masks are able to learn more powerful representations in terms of object detection, segmentation and tracking than the explicitly parameterized spatial transformer based architecture. We also observe that none of the methods are able to gracefully handle the most challenging tracking scenarios, suggesting that our synthetic video benchmark may provide fruitful guidance towards learning more robust object-centric video representations.
摘要:从对象的角度感知世界是推理和场景理解一个重要前提。最近,一些方法已经被提出了以对象为中心交涉无监督学习。然而,由于这些模型已经针对不同的下游任务评估,这在他们的基本感知能力,例如检测,图地面分割方面如何比较和各个对象的跟踪仍不清楚。在本文中,我们认为,多目标跟踪测试正是这些感知质量的建立的评估协议和提出了一种基于程序产生的视频序列的新的基准数据集。使用这个基准,我们比较三种状态的最先进的无人监督的对象为中心的学习方法的感知能力。为了实现这一目标,我们提出了一个视频扩展莫奈,静态场景开创性的对象为中心的模式,并把它比作最近的两个视频模式:OP3,它利用通过空间的混合模型集群和TBA,它使用一个明确的分解通过空间变压器。我们的研究结果表明,它采用基于每个对象变自动编码和全图像对象口罩不受约束的潜在陈述架构能够在物体检测,分割和较明确的参数空间变压器基础架构跟踪方面的学习更强大的表示。我们也观察到,没有一种方法能够妥善地处理最具挑战性的跟踪的情况,这表明我们的合成视频基准可以向学习更强大的对象为中心的视频表示提供富有成效的指导。
Marissa A. Weis, Kashyap Chitta, Yash Sharma, Wieland Brendel, Matthias Bethge, Andreas Geiger, Alexander S. Ecker
Abstract: Perceiving the world in terms of objects is a crucial prerequisite for reasoning and scene understanding. Recently, several methods have been proposed for unsupervised learning of object-centric representations. However, since these models have been evaluated with respect to different downstream tasks, it remains unclear how they compare in terms of basic perceptual abilities such as detection, figure-ground segmentation and tracking of individual objects. In this paper, we argue that the established evaluation protocol of multi-object tracking tests precisely these perceptual qualities and we propose a new benchmark dataset based on procedurally generated video sequences. Using this benchmark, we compare the perceptual abilities of three state-of-the-art unsupervised object-centric learning approaches. Towards this goal, we propose a video-extension of MONet, a seminal object-centric model for static scenes, and compare it to two recent video models: OP3, which exploits clustering via spatial mixture models, and TBA, which uses an explicit factorization via spatial transformers. Our results indicate that architectures which employ unconstrained latent representations based on per-object variational autoencoders and full-image object masks are able to learn more powerful representations in terms of object detection, segmentation and tracking than the explicitly parameterized spatial transformer based architecture. We also observe that none of the methods are able to gracefully handle the most challenging tracking scenarios, suggesting that our synthetic video benchmark may provide fruitful guidance towards learning more robust object-centric video representations.
摘要:从对象的角度感知世界是推理和场景理解一个重要前提。最近,一些方法已经被提出了以对象为中心交涉无监督学习。然而,由于这些模型已经针对不同的下游任务评估,这在他们的基本感知能力,例如检测,图地面分割方面如何比较和各个对象的跟踪仍不清楚。在本文中,我们认为,多目标跟踪测试正是这些感知质量的建立的评估协议和提出了一种基于程序产生的视频序列的新的基准数据集。使用这个基准,我们比较三种状态的最先进的无人监督的对象为中心的学习方法的感知能力。为了实现这一目标,我们提出了一个视频扩展莫奈,静态场景开创性的对象为中心的模式,并把它比作最近的两个视频模式:OP3,它利用通过空间的混合模型集群和TBA,它使用一个明确的分解通过空间变压器。我们的研究结果表明,它采用基于每个对象变自动编码和全图像对象口罩不受约束的潜在陈述架构能够在物体检测,分割和较明确的参数空间变压器基础架构跟踪方面的学习更强大的表示。我们也观察到,没有一种方法能够妥善地处理最具挑战性的跟踪的情况,这表明我们的合成视频基准可以向学习更强大的对象为中心的视频表示提供富有成效的指导。
14. Rethinking Sampling in 3D Point Cloud Generative Adversarial Networks [PDF] 返回目录
He Wang, Zetian Jiang, Li Yi, Kaichun Mo, Hao Su, Leonidas J. Guibas
Abstract: In this paper, we examine the long-neglected yet important effects of point sampling patterns in point cloud GANs. Through extensive experiments, we show that sampling-insensitive discriminators (e.g.PointNet-Max) produce shape point clouds with point clustering artifacts while sampling-oversensitive discriminators (e.g.PointNet++, DGCNN) fail to guide valid shape generation. We propose the concept of sampling spectrum to depict the different sampling sensitivities of discriminators. We further study how different evaluation metrics weigh the sampling pattern against the geometry and propose several perceptual metrics forming a sampling spectrum of metrics. Guided by the proposed sampling spectrum, we discover a middle-point sampling-aware baseline discriminator, PointNet-Mix, which improves all existing point cloud generators by a large margin on sampling-related metrics. We point out that, though recent research has been focused on the generator design, the main bottleneck of point cloud GAN actually lies in the discriminator design. Our work provides both suggestions and tools for building future discriminators. We will release the code to facilitate future research.
摘要:在本文中,我们分析了点云甘斯的点采样模式长期被忽视而重要的作用。通过广泛的实验,我们表明,采样不敏感的鉴别器(e.g.PointNet-最大值)产生形状点云与点聚类伪影而采样多心鉴别器(e.g.PointNet ++,DGCNN)失败,引导有效形状生成。我们建议采样频谱描绘鉴别器的不同采样灵敏度的概念。我们进一步研究不同的评价指标,如何权衡几何采样模式,并提出一些感性指标形成指标的采样范围。所提出的采样频谱的指导下,我们发现了一个中间点取样感知基线鉴别,PointNet混合,从而大幅度提高了采样相关指标全部现有的点云发电机。我们指出,尽管最近的研究已经集中在发电机组设计,点云甘的主要瓶颈实际上就在于鉴别设计。我们的工作提供了两项建议和工具,用于建设未来鉴别。我们将发布的代码,以方便今后的研究。
He Wang, Zetian Jiang, Li Yi, Kaichun Mo, Hao Su, Leonidas J. Guibas
Abstract: In this paper, we examine the long-neglected yet important effects of point sampling patterns in point cloud GANs. Through extensive experiments, we show that sampling-insensitive discriminators (e.g.PointNet-Max) produce shape point clouds with point clustering artifacts while sampling-oversensitive discriminators (e.g.PointNet++, DGCNN) fail to guide valid shape generation. We propose the concept of sampling spectrum to depict the different sampling sensitivities of discriminators. We further study how different evaluation metrics weigh the sampling pattern against the geometry and propose several perceptual metrics forming a sampling spectrum of metrics. Guided by the proposed sampling spectrum, we discover a middle-point sampling-aware baseline discriminator, PointNet-Mix, which improves all existing point cloud generators by a large margin on sampling-related metrics. We point out that, though recent research has been focused on the generator design, the main bottleneck of point cloud GAN actually lies in the discriminator design. Our work provides both suggestions and tools for building future discriminators. We will release the code to facilitate future research.
摘要:在本文中,我们分析了点云甘斯的点采样模式长期被忽视而重要的作用。通过广泛的实验,我们表明,采样不敏感的鉴别器(e.g.PointNet-最大值)产生形状点云与点聚类伪影而采样多心鉴别器(e.g.PointNet ++,DGCNN)失败,引导有效形状生成。我们建议采样频谱描绘鉴别器的不同采样灵敏度的概念。我们进一步研究不同的评价指标,如何权衡几何采样模式,并提出一些感性指标形成指标的采样范围。所提出的采样频谱的指导下,我们发现了一个中间点取样感知基线鉴别,PointNet混合,从而大幅度提高了采样相关指标全部现有的点云发电机。我们指出,尽管最近的研究已经集中在发电机组设计,点云甘的主要瓶颈实际上就在于鉴别设计。我们的工作提供了两项建议和工具,用于建设未来鉴别。我们将发布的代码,以方便今后的研究。
15. Background Modeling via Uncertainty Estimation for Weakly-supervised Action Localization [PDF] 返回目录
Pilhyeon Lee, Jinglu Wang, Yan Lu, Hyeran Byun
Abstract: Weakly-supervised temporal action localization aims to detect intervals of action instances with only video-level action labels for training. A crucial challenge is to separate frames of action classes from remaining, denoted as background frames (i.e., frames not belonging to any action class). Previous methods attempt background modeling by either synthesizing pseudo background videos with static frames or introducing an auxiliary class for background. However, they overlook an essential fact that background frames could be dynamic and inconsistent. Accordingly, we cast the problem of identifying background frames as out-of-distribution detection and isolate it from conventional action classification. Beyond our base action localization network, we propose a module to estimate the probability of being background (i.e., uncertainty [20]), which allows us to learn uncertainty given only video-level labels via multiple instance learning. A background entropy loss is further designed to reject background frames by forcing them to have uniform probability distribution for action classes. Extensive experiments verify the effectiveness of our background modeling and show that our method significantly outperforms state-of-the-art methods on the standard benchmarks - THUMOS'14 and ActivityNet (1.2 and 1.3). Our code and the trained model are available at this https URL.
摘要:弱监督时间行动本地化的目标检测行为实例的间隔只有视频级动作标签进行培训。一个关键的挑战是从剩余的操作类的分开的帧,表示为背景的帧(即,帧不属于任何动作类)。以前的方法通过与静态画面或引入用于背景的辅助类或者合成伪背景视频尝试背景建模。然而,他们忽略了一个基本事实,即背景框可以是动态的和不一致的。因此,我们投识别背景帧作为外的分布检测的问题,并从常规动作分类隔离。除了我们的基本动作的本地化网络,我们提出了一个模块,估计是背景(即不确定性[20]),这使我们能够了解的不确定性给只能通过多示例学习视频级标签的概率。甲背景熵损失被进一步设计为通过迫使它们具有用于操作类均匀概率分布来拒绝背景帧。广泛的实验验证我们的背景建模的有效性,并表明,我们的方法显著性能优于对标准基准状态的最先进的方法 - THUMOS'14和ActivityNet(1.2和1.3)。我们的代码和训练的模型可在此HTTPS URL。
Pilhyeon Lee, Jinglu Wang, Yan Lu, Hyeran Byun
Abstract: Weakly-supervised temporal action localization aims to detect intervals of action instances with only video-level action labels for training. A crucial challenge is to separate frames of action classes from remaining, denoted as background frames (i.e., frames not belonging to any action class). Previous methods attempt background modeling by either synthesizing pseudo background videos with static frames or introducing an auxiliary class for background. However, they overlook an essential fact that background frames could be dynamic and inconsistent. Accordingly, we cast the problem of identifying background frames as out-of-distribution detection and isolate it from conventional action classification. Beyond our base action localization network, we propose a module to estimate the probability of being background (i.e., uncertainty [20]), which allows us to learn uncertainty given only video-level labels via multiple instance learning. A background entropy loss is further designed to reject background frames by forcing them to have uniform probability distribution for action classes. Extensive experiments verify the effectiveness of our background modeling and show that our method significantly outperforms state-of-the-art methods on the standard benchmarks - THUMOS'14 and ActivityNet (1.2 and 1.3). Our code and the trained model are available at this https URL.
摘要:弱监督时间行动本地化的目标检测行为实例的间隔只有视频级动作标签进行培训。一个关键的挑战是从剩余的操作类的分开的帧,表示为背景的帧(即,帧不属于任何动作类)。以前的方法通过与静态画面或引入用于背景的辅助类或者合成伪背景视频尝试背景建模。然而,他们忽略了一个基本事实,即背景框可以是动态的和不一致的。因此,我们投识别背景帧作为外的分布检测的问题,并从常规动作分类隔离。除了我们的基本动作的本地化网络,我们提出了一个模块,估计是背景(即不确定性[20]),这使我们能够了解的不确定性给只能通过多示例学习视频级标签的概率。甲背景熵损失被进一步设计为通过迫使它们具有用于操作类均匀概率分布来拒绝背景帧。广泛的实验验证我们的背景建模的有效性,并表明,我们的方法显著性能优于对标准基准状态的最先进的方法 - THUMOS'14和ActivityNet(1.2和1.3)。我们的代码和训练的模型可在此HTTPS URL。
16. Quantum Robust Fitting [PDF] 返回目录
Tat-Jun Chin, David Suter, James Quach, Shin Fang Chng
Abstract: Many computer vision applications need to recover structure from imperfect measurements of the real world. The task is often solved by robustly fitting a geometric model onto noisy and outlier-contaminated data. However, recent theoretical analyses indicate that many commonly used formulations of robust fitting in computer vision are not amenable to tractable solution and approximation. In this paper, we explore the usage of quantum computers for robust fitting. To do so, we examine and establish the practical usefulness of a robust fitting formulation inspired by Fourier analysis of Boolean functions. We then investigate a quantum algorithm to solve the formulation and analyse the computational speed-up possible over the classical algorithm. Our work thus proposes one of the first quantum treatments of robust fitting for computer vision.
摘要:许多计算机视觉应用需要从现实世界的不完美的测量恢复结构。该任务通常由稳健拟合的几何模型到嘈杂和离群污染数据解决。然而,最近的理论分析表明,在计算机视觉强大的配件很多,常用的配方是不服从听话的解决方案和逼近。在本文中,我们将探讨量子计算机的鲁棒配件的使用。要做到这一点,我们研究建立由布尔函数傅立叶分析激发了强大的配件配方的实用性。然后,我们调查一个量子算法解决制定和分析的计算加速可能比经典算法。因此,我们的工作提出了计算机视觉强大的配件的第一量子治疗方法之一。
Tat-Jun Chin, David Suter, James Quach, Shin Fang Chng
Abstract: Many computer vision applications need to recover structure from imperfect measurements of the real world. The task is often solved by robustly fitting a geometric model onto noisy and outlier-contaminated data. However, recent theoretical analyses indicate that many commonly used formulations of robust fitting in computer vision are not amenable to tractable solution and approximation. In this paper, we explore the usage of quantum computers for robust fitting. To do so, we examine and establish the practical usefulness of a robust fitting formulation inspired by Fourier analysis of Boolean functions. We then investigate a quantum algorithm to solve the formulation and analyse the computational speed-up possible over the classical algorithm. Our work thus proposes one of the first quantum treatments of robust fitting for computer vision.
摘要:许多计算机视觉应用需要从现实世界的不完美的测量恢复结构。该任务通常由稳健拟合的几何模型到嘈杂和离群污染数据解决。然而,最近的理论分析表明,在计算机视觉强大的配件很多,常用的配方是不服从听话的解决方案和逼近。在本文中,我们将探讨量子计算机的鲁棒配件的使用。要做到这一点,我们研究建立由布尔函数傅立叶分析激发了强大的配件配方的实用性。然后,我们调查一个量子算法解决制定和分析的计算加速可能比经典算法。因此,我们的工作提出了计算机视觉强大的配件的第一量子治疗方法之一。
17. Towards Robust Pattern Recognition: A Review [PDF] 返回目录
Xu-Yao Zhang, Cheng-Lin Liu, Ching Y. Suen
Abstract: The accuracies for many pattern recognition tasks have increased rapidly year by year, achieving or even outperforming human performance. From the perspective of accuracy, pattern recognition seems to be a nearly-solved problem. However, once launched in real applications, the high-accuracy pattern recognition systems may become unstable and unreliable, due to the lack of robustness in open and changing environments. In this paper, we present a comprehensive review of research towards robust pattern recognition from the perspective of breaking three basic and implicit assumptions: closed-world assumption, independent and identically distributed assumption, and clean and big data assumption, which form the foundation of most pattern recognition models. Actually, our brain is robust at learning concepts continually and incrementally, in complex, open and changing environments, with different contexts, modalities and tasks, by showing only a few examples, under weak or noisy supervision. These are the major differences between human intelligence and machine intelligence, which are closely related to the above three assumptions. After witnessing the significant progress in accuracy improvement nowadays, this review paper will enable us to analyze the shortcomings and limitations of current methods and identify future research directions for robust pattern recognition.
摘要:许多模式识别任务的准确度得到了迅速逐年增加,达到甚至超越人类的表现。从精度的角度来看,模式识别似乎是一个几乎解决的问题。然而,一旦在实际应用而推出的,高精度的模式识别系统可能变得不稳定和不可靠的,由于开放和不断变化的环境缺乏稳健性。在本文中,我们提出的对从打破三个基本和隐含假设的角度强大的模式识别研究进行全面审查:封闭世界假定,独立同分布的假设,清洁和大数据的假设,它们组成了最基础模式识别模型。事实上,我们的大脑在学习的概念不断,并逐步在复杂的,开放的,不断变化的环境,不同的情境,方式和任务,通过只显示了几个例子,薄弱或嘈杂的监督下强劲。这些都是人类智慧和机器智能之间的主要区别,这是密切相关的上述三个假设。如今目睹精度提高显著的进展后,这综述文章将使我们能够分析的缺点和现有方法的局限性,并确定稳健的模式识别未来的研究方向。
Xu-Yao Zhang, Cheng-Lin Liu, Ching Y. Suen
Abstract: The accuracies for many pattern recognition tasks have increased rapidly year by year, achieving or even outperforming human performance. From the perspective of accuracy, pattern recognition seems to be a nearly-solved problem. However, once launched in real applications, the high-accuracy pattern recognition systems may become unstable and unreliable, due to the lack of robustness in open and changing environments. In this paper, we present a comprehensive review of research towards robust pattern recognition from the perspective of breaking three basic and implicit assumptions: closed-world assumption, independent and identically distributed assumption, and clean and big data assumption, which form the foundation of most pattern recognition models. Actually, our brain is robust at learning concepts continually and incrementally, in complex, open and changing environments, with different contexts, modalities and tasks, by showing only a few examples, under weak or noisy supervision. These are the major differences between human intelligence and machine intelligence, which are closely related to the above three assumptions. After witnessing the significant progress in accuracy improvement nowadays, this review paper will enable us to analyze the shortcomings and limitations of current methods and identify future research directions for robust pattern recognition.
摘要:许多模式识别任务的准确度得到了迅速逐年增加,达到甚至超越人类的表现。从精度的角度来看,模式识别似乎是一个几乎解决的问题。然而,一旦在实际应用而推出的,高精度的模式识别系统可能变得不稳定和不可靠的,由于开放和不断变化的环境缺乏稳健性。在本文中,我们提出的对从打破三个基本和隐含假设的角度强大的模式识别研究进行全面审查:封闭世界假定,独立同分布的假设,清洁和大数据的假设,它们组成了最基础模式识别模型。事实上,我们的大脑在学习的概念不断,并逐步在复杂的,开放的,不断变化的环境,不同的情境,方式和任务,通过只显示了几个例子,薄弱或嘈杂的监督下强劲。这些都是人类智慧和机器智能之间的主要区别,这是密切相关的上述三个假设。如今目睹精度提高显著的进展后,这综述文章将使我们能够分析的缺点和现有方法的局限性,并确定稳健的模式识别未来的研究方向。
18. Multi Layer Neural Networks as Replacement for Pooling Operations [PDF] 返回目录
Wolfgang Fuhl, Enkelejda Kasneci
Abstract: Pooling operations are a layer found in almost every modern neural network, which can be calculated at low cost and serves as a linear or nonlinear transfer function for data reduction. Many modern approaches have already dealt with replacing the common maximum value selection and mean value operations by others or even to provide a function that includes different functions which can be selected through changing parameters. Additional neural networks are used to estimate the parameters of these pooling functions. Therefore, these pooling layers need many additional parameters and increase the complexity of the whole model. In this work, we show that already one perceptron can be used very effectively as a pooling operation without increasing the complexity of the model. This kind of pooling allows to integrate multi-layer neural networks directly into a model as a pooling operation by restructuring the data and thus learning complex pooling operations. We compare our approach to tensor convolution with strides as a pooling operation and show that our approach is effective and reduces complexity. The restructuring of the data in combination with multiple perceptrons allows also to use our approach for upscaling, which is used for transposed convolutions in semantic segmentation.
摘要:池操作的层几乎所有现代的神经网络,它能够以低成本进行计算,并作为一个线性或非线性传递函数数据减少找到。许多现代方法已经处理过被别人替代了一般的最大值选择和平均值操作,甚至以提供包括可以通过改变参数来选择不同功能的函数。额外的神经网络被用于估计这些池功能的参数。因此,这些汇聚层需要许多额外的参数,提高了整个模型的复杂性。在这项工作中,我们发现已经有一个感知,可以非常有效地作为集中操作,而不增加模型的复杂性使用。这种汇集的允许多层神经网络直接集成到一个模型,通过重组数据,并且因此学习复杂的池操作的池操作。我们我们的方法比较张卷积进步作为池操作和显示,我们的方法是有效的,并降低复杂性。在具有多个感知组合数据的调整也可以使用我们的倍增,这是用于语义分割换位卷积方法。
Wolfgang Fuhl, Enkelejda Kasneci
Abstract: Pooling operations are a layer found in almost every modern neural network, which can be calculated at low cost and serves as a linear or nonlinear transfer function for data reduction. Many modern approaches have already dealt with replacing the common maximum value selection and mean value operations by others or even to provide a function that includes different functions which can be selected through changing parameters. Additional neural networks are used to estimate the parameters of these pooling functions. Therefore, these pooling layers need many additional parameters and increase the complexity of the whole model. In this work, we show that already one perceptron can be used very effectively as a pooling operation without increasing the complexity of the model. This kind of pooling allows to integrate multi-layer neural networks directly into a model as a pooling operation by restructuring the data and thus learning complex pooling operations. We compare our approach to tensor convolution with strides as a pooling operation and show that our approach is effective and reduces complexity. The restructuring of the data in combination with multiple perceptrons allows also to use our approach for upscaling, which is used for transposed convolutions in semantic segmentation.
摘要:池操作的层几乎所有现代的神经网络,它能够以低成本进行计算,并作为一个线性或非线性传递函数数据减少找到。许多现代方法已经处理过被别人替代了一般的最大值选择和平均值操作,甚至以提供包括可以通过改变参数来选择不同功能的函数。额外的神经网络被用于估计这些池功能的参数。因此,这些汇聚层需要许多额外的参数,提高了整个模型的复杂性。在这项工作中,我们发现已经有一个感知,可以非常有效地作为集中操作,而不增加模型的复杂性使用。这种汇集的允许多层神经网络直接集成到一个模型,通过重组数据,并且因此学习复杂的池操作的池操作。我们我们的方法比较张卷积进步作为池操作和显示,我们的方法是有效的,并降低复杂性。在具有多个感知组合数据的调整也可以使用我们的倍增,这是用于语义分割换位卷积方法。
19. The eyes know it: FakeET -- An Eye-tracking Database to Understand Deepfake Perception [PDF] 返回目录
Parul Gupta, Komal Chugh, Abhinav Dhall, Ramanathan Subramanian
Abstract: We present \textbf{FakeET}-- an eye-tracking database to understand human visual perception of \emph{deepfake} videos. Given that the principal purpose of deepfakes is to deceive human observers, FakeET is designed to understand and evaluate the ease with which viewers can detect synthetic video artifacts. FakeET contains viewing patterns compiled from 40 users via the \emph{Tobii} desktop eye-tracker for 811 videos from the \textit{Google Deepfake} dataset, with a minimum of two viewings per video. Additionally, EEG responses acquired via the \emph{Emotiv} sensor are also available. The compiled data confirms (a) distinct eye movement characteristics for \emph{real} vs \emph{fake} videos; (b) utility of the eye-track saliency maps for spatial forgery localization and detection, and (c) Error Related Negativity (ERN) triggers in the EEG responses, and the ability of the \emph{raw} EEG signal to distinguish between \emph{real} and \emph{fake} videos.
摘要:我们目前\ textbf {} FakeET - 眼球追踪数据库可以理解的\ {EMPH} deepfake影片人的视觉感受。鉴于deepfakes的主要目的是欺骗人的观察员,FakeET的目的是了解和评估与观众能够检测合成视频文物的难易程度。 FakeET包含观看通过\ {EMPH Tobii的}桌面眼动仪从\ {textit谷歌Deepfake}数据集811部影片来自40个用户编辑模式,以最小的每两个视频的查看。此外,通过\ EMPH {Emotiv公司}获取的传感器也可提供EEG响应。编译后的数据确认(a)用于\ {EMPH真实} VS \ EMPH {假}视频鲜明眼球运动特征;眼轨迹的显着性的(b)中实用程序映射用于空间伪造定位和检测,并在EEG响应(c)出错相关负(ERN)触发器,和\ EMPH {原始} EEG信号之间进行区分的能力\ EMPH {}真正和\ {EMPH假冒}视频。
Parul Gupta, Komal Chugh, Abhinav Dhall, Ramanathan Subramanian
Abstract: We present \textbf{FakeET}-- an eye-tracking database to understand human visual perception of \emph{deepfake} videos. Given that the principal purpose of deepfakes is to deceive human observers, FakeET is designed to understand and evaluate the ease with which viewers can detect synthetic video artifacts. FakeET contains viewing patterns compiled from 40 users via the \emph{Tobii} desktop eye-tracker for 811 videos from the \textit{Google Deepfake} dataset, with a minimum of two viewings per video. Additionally, EEG responses acquired via the \emph{Emotiv} sensor are also available. The compiled data confirms (a) distinct eye movement characteristics for \emph{real} vs \emph{fake} videos; (b) utility of the eye-track saliency maps for spatial forgery localization and detection, and (c) Error Related Negativity (ERN) triggers in the EEG responses, and the ability of the \emph{raw} EEG signal to distinguish between \emph{real} and \emph{fake} videos.
摘要:我们目前\ textbf {} FakeET - 眼球追踪数据库可以理解的\ {EMPH} deepfake影片人的视觉感受。鉴于deepfakes的主要目的是欺骗人的观察员,FakeET的目的是了解和评估与观众能够检测合成视频文物的难易程度。 FakeET包含观看通过\ {EMPH Tobii的}桌面眼动仪从\ {textit谷歌Deepfake}数据集811部影片来自40个用户编辑模式,以最小的每两个视频的查看。此外,通过\ EMPH {Emotiv公司}获取的传感器也可提供EEG响应。编译后的数据确认(a)用于\ {EMPH真实} VS \ EMPH {假}视频鲜明眼球运动特征;眼轨迹的显着性的(b)中实用程序映射用于空间伪造定位和检测,并在EEG响应(c)出错相关负(ERN)触发器,和\ EMPH {原始} EEG信号之间进行区分的能力\ EMPH {}真正和\ {EMPH假冒}视频。
20. Does Unsupervised Architecture Representation Learning Help Neural Architecture Search? [PDF] 返回目录
Shen Yan, Yu Zheng, Wei Ao, Xiao Zeng, Mi Zhang
Abstract: Existing Neural Architecture Search (NAS) methods either encode neural architectures using discrete encodings that do not scale well, or adopt supervised learning-based methods to jointly learn architecture representations and optimize architecture search on such representations which incurs search bias. Despite the widespread use, architecture representations learned in NAS are still poorly understood. We observe that the structural properties of neural architectures are hard to preserve in the latent space if architecture representation learning and search are coupled, resulting in less effective search performance. In this work, we find empirically that pre-training architecture representations using only neural architectures without their accuracies as labels considerably improve the downstream architecture search efficiency. To explain these observations, we visualize how unsupervised architecture representation learning better encourages neural architectures with similar connections and operators to cluster together. This helps to map neural architectures with similar performance to the same regions in the latent space and makes the transition of architectures in the latent space relatively smooth, which considerably benefits diverse downstream search strategies.
摘要:使用不很好地扩展,或采用基于监督学习的方法,共同学习上这样表示这将产生搜索偏置结构表征和优化结构搜索离散编码现有的神经结构搜索(NAS)方法或者编码的神经结构。尽管广泛使用,NAS架构了解到表示仍然知之甚少。我们观察到神经结构的结构特性是很难于潜在空间来保存,如果建筑代表学习和搜索耦合,从而减少有效的搜索性能。在这项工作中,我们发现那个经验前培训架构表示只使用神经结构没有他们的精度为标签大大提高下游架构搜索效率。为了解释这些现象,我们可视化架构是如何监督的学习表现更好的鼓励类似的连接和运营商的神经结构聚集在一起。这有助于在潜在空间与性能相似的神经结构映射到同一地区,使得潜在空间架构的过渡比较顺利,这大大有利于多样化的下游检索策略。
Shen Yan, Yu Zheng, Wei Ao, Xiao Zeng, Mi Zhang
Abstract: Existing Neural Architecture Search (NAS) methods either encode neural architectures using discrete encodings that do not scale well, or adopt supervised learning-based methods to jointly learn architecture representations and optimize architecture search on such representations which incurs search bias. Despite the widespread use, architecture representations learned in NAS are still poorly understood. We observe that the structural properties of neural architectures are hard to preserve in the latent space if architecture representation learning and search are coupled, resulting in less effective search performance. In this work, we find empirically that pre-training architecture representations using only neural architectures without their accuracies as labels considerably improve the downstream architecture search efficiency. To explain these observations, we visualize how unsupervised architecture representation learning better encourages neural architectures with similar connections and operators to cluster together. This helps to map neural architectures with similar performance to the same regions in the latent space and makes the transition of architectures in the latent space relatively smooth, which considerably benefits diverse downstream search strategies.
摘要:使用不很好地扩展,或采用基于监督学习的方法,共同学习上这样表示这将产生搜索偏置结构表征和优化结构搜索离散编码现有的神经结构搜索(NAS)方法或者编码的神经结构。尽管广泛使用,NAS架构了解到表示仍然知之甚少。我们观察到神经结构的结构特性是很难于潜在空间来保存,如果建筑代表学习和搜索耦合,从而减少有效的搜索性能。在这项工作中,我们发现那个经验前培训架构表示只使用神经结构没有他们的精度为标签大大提高下游架构搜索效率。为了解释这些现象,我们可视化架构是如何监督的学习表现更好的鼓励类似的连接和运营商的神经结构聚集在一起。这有助于在潜在空间与性能相似的神经结构映射到同一地区,使得潜在空间架构的过渡比较顺利,这大大有利于多样化的下游检索策略。
21. Iterate & Cluster: Iterative Semi-Supervised Action Recognition [PDF] 返回目录
Jingyuan Li, Eli Shlizerman
Abstract: We propose a novel system for active semi-supervised feature-based action recognition. Given time sequences of features tracked during movements our system clusters the sequences into actions. Our system is based on encoder-decoder unsupervised methods shown to perform clustering by self-organization of their latent representation through the auto-regression task. These methods were tested on human action recognition benchmarks and outperformed non-feature based unsupervised methods and achieved comparable accuracy to skeleton-based supervised methods. However, such methods rely on K-Nearest Neighbours (KNN) associating sequences to actions, and general features with no annotated data would correspond to approximate clusters which could be further enhanced. Our system proposes an iterative semi-supervised method to address this challenge and to actively learn the association of clusters and actions. The method utilizes latent space embedding and clustering of the unsupervised encoder-decoder to guide the selection of sequences to be annotated in each iteration. Each iteration, the selection aims to enhance action recognition accuracy while choosing a small number of sequences for annotation. We test the approach on human skeleton-based action recognition benchmarks assuming that only annotations chosen by our method are available and on mouse movements videos recorded in lab experiments. We show that our system can boost recognition performance with only a small percentage of annotations. The system can be used as an interactive annotation tool to guide labeling efforts for 'in the wild' videos of various objects and actions to reach robust recognition.
摘要:我们提出了一个新颖的系统主动半监督基于特征的动作识别。功能给定的时间序列运动我们的系统集群序列化为行动期间跟踪。我们的系统是基于显示执行通过自回归任务通过其潜伏表示自组织聚类编码器,解码器无监督的方法。这些方法对人体动作识别基准测试,并跑赢基于非特征的无监督方法,并取得相当的精度基于骨架监督方法。然而,这样的方法依赖于k-最近邻(KNN)序列动作相关联,以及一般特征与没有注释的数据将对应于近似的簇,其可被进一步增强。我们的系统提出了一种迭代的半监督的方法来应对这一挑战,并积极学习集群和行为之间的关联。该方法利用潜在空间嵌入和非监督编码器 - 解码器的聚类来指导序列的选择在每个迭代要注释。每次迭代中,选择的目的是提高动作识别准确度而选择少量用于注释序列。我们测试的基于人类骨架动作识别的基准假设,只有我们的方法选择注释可用,记录在实验室实验中鼠标移动视频的方法。我们表明,我们的系统可以提高识别性能,只有注释的一小部分。该系统可以作为一个互动的注释工具,引导用于标记的努力“野生”的各种物体和动作的视频,达到稳健的认可。
Jingyuan Li, Eli Shlizerman
Abstract: We propose a novel system for active semi-supervised feature-based action recognition. Given time sequences of features tracked during movements our system clusters the sequences into actions. Our system is based on encoder-decoder unsupervised methods shown to perform clustering by self-organization of their latent representation through the auto-regression task. These methods were tested on human action recognition benchmarks and outperformed non-feature based unsupervised methods and achieved comparable accuracy to skeleton-based supervised methods. However, such methods rely on K-Nearest Neighbours (KNN) associating sequences to actions, and general features with no annotated data would correspond to approximate clusters which could be further enhanced. Our system proposes an iterative semi-supervised method to address this challenge and to actively learn the association of clusters and actions. The method utilizes latent space embedding and clustering of the unsupervised encoder-decoder to guide the selection of sequences to be annotated in each iteration. Each iteration, the selection aims to enhance action recognition accuracy while choosing a small number of sequences for annotation. We test the approach on human skeleton-based action recognition benchmarks assuming that only annotations chosen by our method are available and on mouse movements videos recorded in lab experiments. We show that our system can boost recognition performance with only a small percentage of annotations. The system can be used as an interactive annotation tool to guide labeling efforts for 'in the wild' videos of various objects and actions to reach robust recognition.
摘要:我们提出了一个新颖的系统主动半监督基于特征的动作识别。功能给定的时间序列运动我们的系统集群序列化为行动期间跟踪。我们的系统是基于显示执行通过自回归任务通过其潜伏表示自组织聚类编码器,解码器无监督的方法。这些方法对人体动作识别基准测试,并跑赢基于非特征的无监督方法,并取得相当的精度基于骨架监督方法。然而,这样的方法依赖于k-最近邻(KNN)序列动作相关联,以及一般特征与没有注释的数据将对应于近似的簇,其可被进一步增强。我们的系统提出了一种迭代的半监督的方法来应对这一挑战,并积极学习集群和行为之间的关联。该方法利用潜在空间嵌入和非监督编码器 - 解码器的聚类来指导序列的选择在每个迭代要注释。每次迭代中,选择的目的是提高动作识别准确度而选择少量用于注释序列。我们测试的基于人类骨架动作识别的基准假设,只有我们的方法选择注释可用,记录在实验室实验中鼠标移动视频的方法。我们表明,我们的系统可以提高识别性能,只有注释的一小部分。该系统可以作为一个互动的注释工具,引导用于标记的努力“野生”的各种物体和动作的视频,达到稳健的认可。
22. SemifreddoNets: Partially Frozen Neural Networks for Efficient Computer Vision Systems [PDF] 返回目录
Leo F Isikdogan, Bhavin V Nayak, Chyuan-Tyng Wu, Joao Peralta Moreira, Sushma Rao, Gilad Michael
Abstract: We propose a system comprised of fixed-topology neural networks having partially frozen weights, named SemifreddoNets. SemifreddoNets work as fully-pipelined hardware blocks that are optimized to have an efficient hardware implementation. Those blocks freeze a certain portion of the parameters at every layer and replace the corresponding multipliers with fixed scalers. Fixing the weights reduces the silicon area, logic delay, and memory requirements, leading to significant savings in cost and power consumption. Unlike traditional layer-wise freezing approaches, SemifreddoNets make a profitable trade between the cost and flexibility by having some of the weights configurable at different scales and levels of abstraction in the model. Although fixing the topology and some of the weights somewhat limits the flexibility, we argue that the efficiency benefits of this strategy outweigh the advantages of a fully configurable model for many use cases. Furthermore, our system uses repeatable blocks, therefore it has the flexibility to adjust model complexity without requiring any hardware change. The hardware implementation of SemifreddoNets provides up to an order of magnitude reduction in silicon area and power consumption as compared to their equivalent implementation on a general-purpose accelerator.
摘要:我们提出了包括具有部分冻结权,命名SemifreddoNets固定拓扑结构的神经网络系统。 SemifreddoNets工作作为优化有一个高效的硬件实现完全流水线硬件模块。这些块在每一层冻结参数的某一部分,并与固定缩放器替换相应的乘数。固定的权重减小硅面积,逻辑延迟,和内存需求,导致成本和功耗显著节省。不同于传统的逐层冷冻的方法,通过SemifreddoNets有一些不同尺度和模型抽象层次配置的权重进行成本和灵活性之间的盈利交易。虽然固定拓扑和部分权重有所限制的灵活性,我们认为这一战略的效率收益大于一个完全可配置模式的优势为众多使用情况。此外,我们的系统采用可重复的块,所以它具有灵活性,而无需任何硬件变化来调整模型的复杂性。相比,它们对通用加速器等效实施SemifreddoNets的硬件实现可提供高达在硅面积和功耗减少幅度的顺序。
Leo F Isikdogan, Bhavin V Nayak, Chyuan-Tyng Wu, Joao Peralta Moreira, Sushma Rao, Gilad Michael
Abstract: We propose a system comprised of fixed-topology neural networks having partially frozen weights, named SemifreddoNets. SemifreddoNets work as fully-pipelined hardware blocks that are optimized to have an efficient hardware implementation. Those blocks freeze a certain portion of the parameters at every layer and replace the corresponding multipliers with fixed scalers. Fixing the weights reduces the silicon area, logic delay, and memory requirements, leading to significant savings in cost and power consumption. Unlike traditional layer-wise freezing approaches, SemifreddoNets make a profitable trade between the cost and flexibility by having some of the weights configurable at different scales and levels of abstraction in the model. Although fixing the topology and some of the weights somewhat limits the flexibility, we argue that the efficiency benefits of this strategy outweigh the advantages of a fully configurable model for many use cases. Furthermore, our system uses repeatable blocks, therefore it has the flexibility to adjust model complexity without requiring any hardware change. The hardware implementation of SemifreddoNets provides up to an order of magnitude reduction in silicon area and power consumption as compared to their equivalent implementation on a general-purpose accelerator.
摘要:我们提出了包括具有部分冻结权,命名SemifreddoNets固定拓扑结构的神经网络系统。 SemifreddoNets工作作为优化有一个高效的硬件实现完全流水线硬件模块。这些块在每一层冻结参数的某一部分,并与固定缩放器替换相应的乘数。固定的权重减小硅面积,逻辑延迟,和内存需求,导致成本和功耗显著节省。不同于传统的逐层冷冻的方法,通过SemifreddoNets有一些不同尺度和模型抽象层次配置的权重进行成本和灵活性之间的盈利交易。虽然固定拓扑和部分权重有所限制的灵活性,我们认为这一战略的效率收益大于一个完全可配置模式的优势为众多使用情况。此外,我们的系统采用可重复的块,所以它具有灵活性,而无需任何硬件变化来调整模型的复杂性。相比,它们对通用加速器等效实施SemifreddoNets的硬件实现可提供高达在硅面积和功耗减少幅度的顺序。
23. Rethinking Pre-training and Self-training [PDF] 返回目录
Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui, Hanxiao Liu, Ekin D. Cubuk, Quoc V. Le
Abstract: Pre-training is a dominant paradigm in computer vision. For example, supervised ImageNet pre-training is commonly used to initialize the backbones of object detection and segmentation models. He et al., however, show a surprising result that ImageNet pre-training has limited impact on COCO object detection. Here we investigate self-training as another method to utilize additional data on the same setup and contrast it against ImageNet pre-training. Our study reveals the generality and flexibility of self-training with three additional insights: 1) stronger data augmentation and more labeled data further diminish the value of pre-training, 2) unlike pre-training, self-training is always helpful when using stronger data augmentation, in both low-data and high-data regimes, and 3) in the case that pre-training is helpful, self-training improves upon pre-training. For example, on the COCO object detection dataset, pre-training benefits when we use one fifth of the labeled data, and hurts accuracy when we use all labeled data. Self-training, on the other hand, shows positive improvements from +1.3 to +3.4AP across all dataset sizes. In other words, self-training works well exactly on the same setup that pre-training does not work (using ImageNet to help COCO). On the PASCAL segmentation dataset, which is a much smaller dataset than COCO, though pre-training does help significantly, self-training improves upon the pre-trained model. On COCO object detection, we achieve 54.3AP, an improvement of +1.5AP over the strongest SpineNet model. On PASCAL segmentation, we achieve 90.5 mIOU, an improvement of +1.5% mIOU over the previous state-of-the-art result by DeepLabv3+.
摘要:前培训是计算机视觉的主导范式。例如,监督ImageNet预训练通常用于初始化对象检测和分割模型的骨架。他等人,却表现出了令人惊奇的结果ImageNet前培训对COCO物体探测有限的影响。这里,我们调查的自我培训,另一种方法利用在相同的设置更多的数据和对比其对ImageNet前培训。我们的研究揭示了通用性和自我培训的灵活性,另外三个观点:1)更强的数据增长以及更多的标记数据进一步削弱前的培训,2)不像前培训的价值,用更强当自我训练总是有益的数据扩张,在该前培训是有帮助的情况下,两个低数据和高数据制度,和3),自我训练改善了预训练。例如,在COCO对象检测数据集,前培训的好处,当我们用五分之一的标签数据,而当我们使用所有的标签数据的准确性伤害。自我培训,另一方面,在所有数据集的大小显示了积极的改进,从+1.3至+ 3.4AP。换句话说,自我训练效果很好正好在相同的设置是前培训不工作(使用ImageNet以帮助COCO)。帕斯卡分割数据集,这比COCO更小的数据集,虽然前培训确实帮助显著,自我训练改善了预先训练的模式。在COCO目标检测,我们实现了54.3AP,+ 1.5AP过的最强SpineNet模型的改进。上PASCAL分割,就可以实现90.5米欧,+ 1.5%的米欧比前状态的最先进的结果由DeepLabv3 +的改进。
Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui, Hanxiao Liu, Ekin D. Cubuk, Quoc V. Le
Abstract: Pre-training is a dominant paradigm in computer vision. For example, supervised ImageNet pre-training is commonly used to initialize the backbones of object detection and segmentation models. He et al., however, show a surprising result that ImageNet pre-training has limited impact on COCO object detection. Here we investigate self-training as another method to utilize additional data on the same setup and contrast it against ImageNet pre-training. Our study reveals the generality and flexibility of self-training with three additional insights: 1) stronger data augmentation and more labeled data further diminish the value of pre-training, 2) unlike pre-training, self-training is always helpful when using stronger data augmentation, in both low-data and high-data regimes, and 3) in the case that pre-training is helpful, self-training improves upon pre-training. For example, on the COCO object detection dataset, pre-training benefits when we use one fifth of the labeled data, and hurts accuracy when we use all labeled data. Self-training, on the other hand, shows positive improvements from +1.3 to +3.4AP across all dataset sizes. In other words, self-training works well exactly on the same setup that pre-training does not work (using ImageNet to help COCO). On the PASCAL segmentation dataset, which is a much smaller dataset than COCO, though pre-training does help significantly, self-training improves upon the pre-trained model. On COCO object detection, we achieve 54.3AP, an improvement of +1.5AP over the strongest SpineNet model. On PASCAL segmentation, we achieve 90.5 mIOU, an improvement of +1.5% mIOU over the previous state-of-the-art result by DeepLabv3+.
摘要:前培训是计算机视觉的主导范式。例如,监督ImageNet预训练通常用于初始化对象检测和分割模型的骨架。他等人,却表现出了令人惊奇的结果ImageNet前培训对COCO物体探测有限的影响。这里,我们调查的自我培训,另一种方法利用在相同的设置更多的数据和对比其对ImageNet前培训。我们的研究揭示了通用性和自我培训的灵活性,另外三个观点:1)更强的数据增长以及更多的标记数据进一步削弱前的培训,2)不像前培训的价值,用更强当自我训练总是有益的数据扩张,在该前培训是有帮助的情况下,两个低数据和高数据制度,和3),自我训练改善了预训练。例如,在COCO对象检测数据集,前培训的好处,当我们用五分之一的标签数据,而当我们使用所有的标签数据的准确性伤害。自我培训,另一方面,在所有数据集的大小显示了积极的改进,从+1.3至+ 3.4AP。换句话说,自我训练效果很好正好在相同的设置是前培训不工作(使用ImageNet以帮助COCO)。帕斯卡分割数据集,这比COCO更小的数据集,虽然前培训确实帮助显著,自我训练改善了预先训练的模式。在COCO目标检测,我们实现了54.3AP,+ 1.5AP过的最强SpineNet模型的改进。上PASCAL分割,就可以实现90.5米欧,+ 1.5%的米欧比前状态的最先进的结果由DeepLabv3 +的改进。
24. Feudal Steering: Hierarchical Learning for Steering Angle Prediction [PDF] 返回目录
Faith Johnson, Kristin Dana
Abstract: We consider the challenge of automated steering angle prediction for self driving cars using egocentric road images. In this work, we explore the use of feudal networks, used in hierarchical reinforcement learning (HRL), to devise a vehicle agent to predict steering angles from first person, dash-cam images of the Udacity driving dataset. Our method, Feudal Steering, is inspired by recent work in HRL consisting of a manager network and a worker network that operate on different temporal scales and have different goals. The manager works at a temporal scale that is relatively coarse compared to the worker and has a higher level, task-oriented goal space. Using feudal learning to divide the task into manager and worker sub-networks provides more accurate and robust prediction. Temporal abstraction in driving allows more complex primitives than the steering angle at a single time instance. Composite actions comprise a subroutine or skill that can be re-used throughout the driving sequence. The associated subroutine id is the manager network's goal, so that the manager seeks to succeed at the high level task (e.g. a sharp right turn, a slight right turn, moving straight in traffic, or moving straight unencumbered by traffic). The steering angle at a particular time instance is the worker network output which is regulated by the manager's high level task. We demonstrate state-of-the art steering angle prediction results on the Udacity dataset.
摘要:我们认为使用自我中心的路面影像的自驾驶汽车自动转向角预测的挑战。在这项工作中,我们探索使用封建网络,分层强化学习(HRL)的使用,制定车辆剂从第一人称,Udacity驾驶数据集的破折号凸轮图像预测转向角。我们的方法,封建转向,由HRL最近的工作由管理网络和工作人员的网络,在不同的时间尺度上操作,并有不同的目标的鼓舞。经理工作在时间尺度是比较粗的相比,工人,具有较高的水平,面向任务的目标空间。利用封建学习任务划分为经理和工人子网络提供更精确和稳健的预测。在驱动时间抽象允许比在单一时间实例的转向角更复杂的原语。复合动作包括在整个驱动序列,可重新使用的子例程或技巧。相关的子程序id是管理网络的目标,从而使管理者寻求在高级别任务的成功(例如急剧右转,有轻微的右转弯,在交通直移动或移动按流量直接支配)。在特定时间实例的转向角是由经理人的高层次的任务规定的工作人员网络输出。我们证明在Udacity数据集的国家的最先进的转向角度的预测结果。
Faith Johnson, Kristin Dana
Abstract: We consider the challenge of automated steering angle prediction for self driving cars using egocentric road images. In this work, we explore the use of feudal networks, used in hierarchical reinforcement learning (HRL), to devise a vehicle agent to predict steering angles from first person, dash-cam images of the Udacity driving dataset. Our method, Feudal Steering, is inspired by recent work in HRL consisting of a manager network and a worker network that operate on different temporal scales and have different goals. The manager works at a temporal scale that is relatively coarse compared to the worker and has a higher level, task-oriented goal space. Using feudal learning to divide the task into manager and worker sub-networks provides more accurate and robust prediction. Temporal abstraction in driving allows more complex primitives than the steering angle at a single time instance. Composite actions comprise a subroutine or skill that can be re-used throughout the driving sequence. The associated subroutine id is the manager network's goal, so that the manager seeks to succeed at the high level task (e.g. a sharp right turn, a slight right turn, moving straight in traffic, or moving straight unencumbered by traffic). The steering angle at a particular time instance is the worker network output which is regulated by the manager's high level task. We demonstrate state-of-the art steering angle prediction results on the Udacity dataset.
摘要:我们认为使用自我中心的路面影像的自驾驶汽车自动转向角预测的挑战。在这项工作中,我们探索使用封建网络,分层强化学习(HRL)的使用,制定车辆剂从第一人称,Udacity驾驶数据集的破折号凸轮图像预测转向角。我们的方法,封建转向,由HRL最近的工作由管理网络和工作人员的网络,在不同的时间尺度上操作,并有不同的目标的鼓舞。经理工作在时间尺度是比较粗的相比,工人,具有较高的水平,面向任务的目标空间。利用封建学习任务划分为经理和工人子网络提供更精确和稳健的预测。在驱动时间抽象允许比在单一时间实例的转向角更复杂的原语。复合动作包括在整个驱动序列,可重新使用的子例程或技巧。相关的子程序id是管理网络的目标,从而使管理者寻求在高级别任务的成功(例如急剧右转,有轻微的右转弯,在交通直移动或移动按流量直接支配)。在特定时间实例的转向角是由经理人的高层次的任务规定的工作人员网络输出。我们证明在Udacity数据集的国家的最先进的转向角度的预测结果。
25. SegNBDT: Visual Decision Rules for Segmentation [PDF] 返回目录
Alvin Wan, Daniel Ho, Younjin Song, Henk Tillman, Sarah Adel Bargal, Joseph E. Gonzalez
Abstract: The black-box nature of neural networks limits model decision interpretability, in particular for high-dimensional inputs in computer vision and for dense pixel prediction tasks like segmentation. To address this, prior work combines neural networks with decision trees. However, such models (1) perform poorly when compared to state-of-the-art segmentation models or (2) fail to produce decision rules with spatially-grounded semantic meaning. In this work, we build a hybrid neural-network and decision-tree model for segmentation that (1) attains neural network segmentation accuracy and (2) provides semi-automatically constructed visual decision rules such as "Is there a window?". We obtain semantic visual meaning by extending saliency methods to segmentation and attain accuracy by leveraging insights from neural-backed decision trees, a deep learning analog of decision trees for image classification. Our model SegNBDT attains accuracy within ~2-4% of the state-of-the-art HRNetV2 segmentation model while also retaining explainability; we achieve state-of-the-art performance for explainable models on three benchmark datasets -- Pascal-Context (49.12%), Cityscapes (79.01%), and Look Into Person (51.64%). Furthermore, user studies suggest visual decision rules are more interpretable, particularly for incorrect predictions. Code and pretrained models can be found at this https URL.
摘要:神经网络的黑箱性质限制模型决定解释性,特别是用于在计算机视觉高维输入和像分割密像素预测的任务。为了解决这个问题,以前的工作结合决策树的神经网络。然而,相对于国家的最先进的分割模型或(2)不具有空间接地语义产生决策规则时这样的模型(1)表现不佳。在这项工作中,我们建立了分割的混合神经网络和决策树模型(1)达到神经网络的分割精度和(2)提供半自动构建可视化决策规则,如“有一个窗口?”。我们通过扩展显着的方法来分割获得语义视觉意义,从而通过利用从神经背决策树,决策树的图像分类的深度学习模拟见解达到的精度。国家的最先进的HRNetV2分割模型的〜2-4%的范围内我们的模型SegNBDT精度无所获同时也保留explainability;我们实现了国家的最先进的性能可以解释的模型对三个标准数据集 - 帕斯卡 - 语境(49.12%),风情(79.01%),并考虑人(51.64%)。此外,用户研究表明视觉决策规则更解释,特别是对不正确的预测。代码和预训练模型可以在此HTTPS URL中找到。
Alvin Wan, Daniel Ho, Younjin Song, Henk Tillman, Sarah Adel Bargal, Joseph E. Gonzalez
Abstract: The black-box nature of neural networks limits model decision interpretability, in particular for high-dimensional inputs in computer vision and for dense pixel prediction tasks like segmentation. To address this, prior work combines neural networks with decision trees. However, such models (1) perform poorly when compared to state-of-the-art segmentation models or (2) fail to produce decision rules with spatially-grounded semantic meaning. In this work, we build a hybrid neural-network and decision-tree model for segmentation that (1) attains neural network segmentation accuracy and (2) provides semi-automatically constructed visual decision rules such as "Is there a window?". We obtain semantic visual meaning by extending saliency methods to segmentation and attain accuracy by leveraging insights from neural-backed decision trees, a deep learning analog of decision trees for image classification. Our model SegNBDT attains accuracy within ~2-4% of the state-of-the-art HRNetV2 segmentation model while also retaining explainability; we achieve state-of-the-art performance for explainable models on three benchmark datasets -- Pascal-Context (49.12%), Cityscapes (79.01%), and Look Into Person (51.64%). Furthermore, user studies suggest visual decision rules are more interpretable, particularly for incorrect predictions. Code and pretrained models can be found at this https URL.
摘要:神经网络的黑箱性质限制模型决定解释性,特别是用于在计算机视觉高维输入和像分割密像素预测的任务。为了解决这个问题,以前的工作结合决策树的神经网络。然而,相对于国家的最先进的分割模型或(2)不具有空间接地语义产生决策规则时这样的模型(1)表现不佳。在这项工作中,我们建立了分割的混合神经网络和决策树模型(1)达到神经网络的分割精度和(2)提供半自动构建可视化决策规则,如“有一个窗口?”。我们通过扩展显着的方法来分割获得语义视觉意义,从而通过利用从神经背决策树,决策树的图像分类的深度学习模拟见解达到的精度。国家的最先进的HRNetV2分割模型的〜2-4%的范围内我们的模型SegNBDT精度无所获同时也保留explainability;我们实现了国家的最先进的性能可以解释的模型对三个标准数据集 - 帕斯卡 - 语境(49.12%),风情(79.01%),并考虑人(51.64%)。此外,用户研究表明视觉决策规则更解释,特别是对不正确的预测。代码和预训练模型可以在此HTTPS URL中找到。
26. On Improving the Generalization of Face Recognition in the Presence of Occlusions [PDF] 返回目录
Xiang Xu, Nikolaos Sarafianos, Ioannis A. Kakadiaris
Abstract: In this paper, we address a key limitation of existing 2D face recognition methods: robustness to occlusions. To accomplish this task, we systematically analyzed the impact of facial attributes on the performance of a state-of-the-art face recognition method and through extensive experimentation, quantitatively analyzed the performance degradation under different types of occlusion. Our proposed Occlusion-aware face REcOgnition (OREO) approach learned discriminative facial templates despite the presence of such occlusions. First, an attention mechanism was proposed that extracted local identity-related region. The local features were then aggregated with the global representations to form a single template. Second, a simple, yet effective, training strategy was introduced to balance the non-occluded and occluded facial images. Extensive experiments demonstrated that OREO improved the generalization ability of face recognition under occlusions by (10.17%) in a single-image-based setting and outperformed the baseline by approximately (2%) in terms of rank-1 accuracy in an image-set-based scenario.
摘要:在本文中,我们解决现有的二维人脸识别方法的关键限制:鲁棒性闭塞。为了完成这个任务,我们系统地分析面部属性的影响的状态下的最先进的脸部识别方法的性能,并通过大量的实验,定量分析下不同类型的遮挡的性能劣化。我们提出的尽管有这样闭塞的存在闭塞感知脸部识别(奥利奥)的方式学会了辨别人脸模板。首先,注意机制提议,提取本地身份相关的区域。局部特征,然后用全球交涉聚集,形成一个模板。其次,一个简单而有效的培训战略被引入到平衡非封闭和堵塞面部图像。大量的实验证明,奥利奥改善下闭塞在基于单图像设置面部识别的泛化能力由(10.17%)并且在秩1的精度方面具有在大约(2%)的表现优于基线图像-SET-基于场景。
Xiang Xu, Nikolaos Sarafianos, Ioannis A. Kakadiaris
Abstract: In this paper, we address a key limitation of existing 2D face recognition methods: robustness to occlusions. To accomplish this task, we systematically analyzed the impact of facial attributes on the performance of a state-of-the-art face recognition method and through extensive experimentation, quantitatively analyzed the performance degradation under different types of occlusion. Our proposed Occlusion-aware face REcOgnition (OREO) approach learned discriminative facial templates despite the presence of such occlusions. First, an attention mechanism was proposed that extracted local identity-related region. The local features were then aggregated with the global representations to form a single template. Second, a simple, yet effective, training strategy was introduced to balance the non-occluded and occluded facial images. Extensive experiments demonstrated that OREO improved the generalization ability of face recognition under occlusions by (10.17%) in a single-image-based setting and outperformed the baseline by approximately (2%) in terms of rank-1 accuracy in an image-set-based scenario.
摘要:在本文中,我们解决现有的二维人脸识别方法的关键限制:鲁棒性闭塞。为了完成这个任务,我们系统地分析面部属性的影响的状态下的最先进的脸部识别方法的性能,并通过大量的实验,定量分析下不同类型的遮挡的性能劣化。我们提出的尽管有这样闭塞的存在闭塞感知脸部识别(奥利奥)的方式学会了辨别人脸模板。首先,注意机制提议,提取本地身份相关的区域。局部特征,然后用全球交涉聚集,形成一个模板。其次,一个简单而有效的培训战略被引入到平衡非封闭和堵塞面部图像。大量的实验证明,奥利奥改善下闭塞在基于单图像设置面部识别的泛化能力由(10.17%)并且在秩1的精度方面具有在大约(2%)的表现优于基线图像-SET-基于场景。
27. On Improving Temporal Consistency for Online Face Liveness Detection [PDF] 返回目录
Xiang Xu, Yuanjun Xiong, Wei Xia
Abstract: In this paper, we focus on improving the online face liveness detection system to enhance the security of the downstream face recognition system. Most of the existing frame-based methods are suffering from the prediction inconsistency across time. To address the issue, a simple yet effective solution based on temporal consistency is proposed. Specifically, in the training stage, to integrate the temporal consistency constraint, a temporal self-supervision loss and a class consistency loss are proposed in addition to the softmax cross-entropy loss. In the deployment stage, a training-free non-parametric uncertainty estimation module is developed to smooth the predictions adaptively. Beyond the common evaluation approach, a video segment-based evaluation is proposed to accommodate more practical scenarios. Extensive experiments demonstrated that our solution is more robust against several presentation attacks in various scenarios, and significantly outperformed the state-of-the-art on multiple public datasets by at least 40% in terms of ACER. Besides, with much less computational complexity (33% fewer FLOPs), it provides great potential for low-latency online applications.
摘要:在本文中,我们着眼于提高网上面活跃度检测体系,提升下游人脸识别系统的安全性。大多数现有的基于帧的方法是从预测不一致跨越时间的痛苦。为了解决这个问题,一个简单的基于时间一致性又有效的解决方案建议。具体而言,在训练阶段,整合时间一致性约束,时间自检损失和类一致性损失除了SOFTMAX交叉熵损失提出了建议。在部署阶段,免费培训,非参数不确定性估计模块开发自适应平滑预测。除了常用的评估方法,基于分段视频的评价,提出以容纳更多的实际场景。大量的实验证明,我们的解决方案是针对在各种情况下的几个演示攻击更加健壮,至少40%的ACER方面显著跑赢多个公共数据集的国家的最先进的。此外,用少得多的计算复杂性(减少33%FLOPS),它提供了低延迟的在线应用的巨大潜力。
Xiang Xu, Yuanjun Xiong, Wei Xia
Abstract: In this paper, we focus on improving the online face liveness detection system to enhance the security of the downstream face recognition system. Most of the existing frame-based methods are suffering from the prediction inconsistency across time. To address the issue, a simple yet effective solution based on temporal consistency is proposed. Specifically, in the training stage, to integrate the temporal consistency constraint, a temporal self-supervision loss and a class consistency loss are proposed in addition to the softmax cross-entropy loss. In the deployment stage, a training-free non-parametric uncertainty estimation module is developed to smooth the predictions adaptively. Beyond the common evaluation approach, a video segment-based evaluation is proposed to accommodate more practical scenarios. Extensive experiments demonstrated that our solution is more robust against several presentation attacks in various scenarios, and significantly outperformed the state-of-the-art on multiple public datasets by at least 40% in terms of ACER. Besides, with much less computational complexity (33% fewer FLOPs), it provides great potential for low-latency online applications.
摘要:在本文中,我们着眼于提高网上面活跃度检测体系,提升下游人脸识别系统的安全性。大多数现有的基于帧的方法是从预测不一致跨越时间的痛苦。为了解决这个问题,一个简单的基于时间一致性又有效的解决方案建议。具体而言,在训练阶段,整合时间一致性约束,时间自检损失和类一致性损失除了SOFTMAX交叉熵损失提出了建议。在部署阶段,免费培训,非参数不确定性估计模块开发自适应平滑预测。除了常用的评估方法,基于分段视频的评价,提出以容纳更多的实际场景。大量的实验证明,我们的解决方案是针对在各种情况下的几个演示攻击更加健壮,至少40%的ACER方面显著跑赢多个公共数据集的国家的最先进的。此外,用少得多的计算复杂性(减少33%FLOPS),它提供了低延迟的在线应用的巨大潜力。
28. PRGFlow: Benchmarking SWAP-Aware Unified Deep Visual Inertial Odometry [PDF] 返回目录
Nitin J. Sanket, Chahat Deep Singh, Cornelia Fermüller, Yiannis Aloimonos
Abstract: Odometry on aerial robots has to be of low latency and high robustness whilst also respecting the Size, Weight, Area and Power (SWAP) constraints as demanded by the size of the robot. A combination of visual sensors coupled with Inertial Measurement Units (IMUs) has proven to be the best combination to obtain robust and low latency odometry on resource-constrained aerial robots. Recently, deep learning approaches for Visual Inertial fusion have gained momentum due to their high accuracy and robustness. However, the remarkable advantages of these techniques are their inherent scalability (adaptation to different sized aerial robots) and unification (same method works on different sized aerial robots) by utilizing compression methods and hardware acceleration, which have been lacking from previous approaches. To this end, we present a deep learning approach for visual translation estimation and loosely fuse it with an Inertial sensor for full 6DoF odometry estimation. We also present a detailed benchmark comparing different architectures, loss functions and compression methods to enable scalability. We evaluate our network on the MSCOCO dataset and evaluate the VI fusion on multiple real-flight trajectories.
摘要:测程在空中机器人必须是低延迟和高耐用性的同时,也尊重尺寸,重量,面积和要求的机器人的大小功率(SWAP)的约束。加上惯性测量单元(IMU)的视觉传感器的组合已被证明是获得关于资源受限空中机器人健壮和低延迟里程计的最佳组合。近日,深度学习的视觉惯性聚变方法已经获得了动力,由于其高精确度和耐用性。然而,这些技术的显着优点是利用压缩的方法和硬件加速,一直缺乏从以前的办法,他们的固有的可伸缩性(适应不同大小的空中机器人)和统一(在不同大小的空中机器人同样的方法作品)。为此,我们提出了一个视觉翻译估计了深刻的学习方法和松散的导火索将其与惯性传感器用于全6自由度的测距估计。我们还提出详细的基准比较不同的架构,损失函数和压缩方法,使可扩展性。我们评估我们在MSCOCO数据集网络和评估多个真实的飞行轨迹的VI融合。
Nitin J. Sanket, Chahat Deep Singh, Cornelia Fermüller, Yiannis Aloimonos
Abstract: Odometry on aerial robots has to be of low latency and high robustness whilst also respecting the Size, Weight, Area and Power (SWAP) constraints as demanded by the size of the robot. A combination of visual sensors coupled with Inertial Measurement Units (IMUs) has proven to be the best combination to obtain robust and low latency odometry on resource-constrained aerial robots. Recently, deep learning approaches for Visual Inertial fusion have gained momentum due to their high accuracy and robustness. However, the remarkable advantages of these techniques are their inherent scalability (adaptation to different sized aerial robots) and unification (same method works on different sized aerial robots) by utilizing compression methods and hardware acceleration, which have been lacking from previous approaches. To this end, we present a deep learning approach for visual translation estimation and loosely fuse it with an Inertial sensor for full 6DoF odometry estimation. We also present a detailed benchmark comparing different architectures, loss functions and compression methods to enable scalability. We evaluate our network on the MSCOCO dataset and evaluate the VI fusion on multiple real-flight trajectories.
摘要:测程在空中机器人必须是低延迟和高耐用性的同时,也尊重尺寸,重量,面积和要求的机器人的大小功率(SWAP)的约束。加上惯性测量单元(IMU)的视觉传感器的组合已被证明是获得关于资源受限空中机器人健壮和低延迟里程计的最佳组合。近日,深度学习的视觉惯性聚变方法已经获得了动力,由于其高精确度和耐用性。然而,这些技术的显着优点是利用压缩的方法和硬件加速,一直缺乏从以前的办法,他们的固有的可伸缩性(适应不同大小的空中机器人)和统一(在不同大小的空中机器人同样的方法作品)。为此,我们提出了一个视觉翻译估计了深刻的学习方法和松散的导火索将其与惯性传感器用于全6自由度的测距估计。我们还提出详细的基准比较不同的架构,损失函数和压缩方法,使可扩展性。我们评估我们在MSCOCO数据集网络和评估多个真实的飞行轨迹的VI融合。
29. An Unsupervised Information-Theoretic Perceptual Quality Metric [PDF] 返回目录
Sangnie Bhardwaj, Ian Fischer, Johannes Ballé, Troy Chinen
Abstract: Tractable models of human perception have proved to be challenging to build. Hand-designed models such as MS-SSIM remain popular predictors of human image quality judgements due to their simplicity and speed. Recent modern deep learning approaches can perform better, but they rely on supervised data which can be costly to gather: large sets of class labels such as ImageNet, image quality ratings, or both. We combine recent advances in information-theoretic objective functions with a computational architecture informed by the physiology of the human visual system and unsupervised training on pairs of video frames, yielding our Perceptual Information Metric (PIM). We show that PIM is competitive with supervised metrics on the recent and challenging BAPPS image quality assessment dataset. We also perform qualitative experiments using the ImageNet-C dataset, and establish that our approach is robust with respect to architectural details.
摘要:人类感知的易处理的模型已被证明是具有挑战性的构建。手设计的模型,如MS-SSIM仍然是人类的图像质量的判断的流行预测由于其简单性和速度。最近现代深的学习方法可以有更好的表现,但他们依靠监督的数据,可能是昂贵的收集:大型成套类的标签,如ImageNet,图像质量评级,或两者兼而有之。我们结合信息论的目标函数的最新进展与计算架构人类视觉系统和无人监督训练的生理上对视频帧的通知,产生我们的知觉信息公制(PIM)。我们证明了PIM是在最近和挑战BAPPS图像质量评价数据集监管指标的竞争力。我们还进行使用ImageNet-C数据集的定性实验,并建立我们的做法是相对于建筑细节强劲。
Sangnie Bhardwaj, Ian Fischer, Johannes Ballé, Troy Chinen
Abstract: Tractable models of human perception have proved to be challenging to build. Hand-designed models such as MS-SSIM remain popular predictors of human image quality judgements due to their simplicity and speed. Recent modern deep learning approaches can perform better, but they rely on supervised data which can be costly to gather: large sets of class labels such as ImageNet, image quality ratings, or both. We combine recent advances in information-theoretic objective functions with a computational architecture informed by the physiology of the human visual system and unsupervised training on pairs of video frames, yielding our Perceptual Information Metric (PIM). We show that PIM is competitive with supervised metrics on the recent and challenging BAPPS image quality assessment dataset. We also perform qualitative experiments using the ImageNet-C dataset, and establish that our approach is robust with respect to architectural details.
摘要:人类感知的易处理的模型已被证明是具有挑战性的构建。手设计的模型,如MS-SSIM仍然是人类的图像质量的判断的流行预测由于其简单性和速度。最近现代深的学习方法可以有更好的表现,但他们依靠监督的数据,可能是昂贵的收集:大型成套类的标签,如ImageNet,图像质量评级,或两者兼而有之。我们结合信息论的目标函数的最新进展与计算架构人类视觉系统和无人监督训练的生理上对视频帧的通知,产生我们的知觉信息公制(PIM)。我们证明了PIM是在最近和挑战BAPPS图像质量评价数据集监管指标的竞争力。我们还进行使用ImageNet-C数据集的定性实验,并建立我们的做法是相对于建筑细节强劲。
30. Deep Convolutional Likelihood Particle Filter for Visual Tracking [PDF] 返回目录
Reza Jalil Mozhdehi, Henry Medeiros
Abstract: We propose a novel particle filter for convolutional-correlation visual trackers. Our method uses correlation response maps to estimate likelihood distributions and employs these likelihoods as proposal densities to sample particles. Likelihood distributions are more reliable than proposal densities based on target transition distributions because correlation response maps provide additional information regarding the target's location. Additionally, our particle filter searches for multiple modes in the likelihood distribution, which improves performance in target occlusion scenarios while decreasing computational costs by more efficiently sampling particles. In other challenging scenarios such as those involving motion blur, where only one mode is present but a larger search area may be necessary, our particle filter allows for the variance of the likelihood distribution to increase. We tested our algorithm on the Visual Tracker Benchmark v1.1 (OTB100) and our experimental results demonstrate that our framework outperforms state-of-the-art methods.
摘要:我们提出了卷积相关视觉跟踪一个新的微粒过滤器。我们的方法的用途相关响应映射到估计似然性分布,并采用这些似然性作为提案密度下样品的粒子。似然分布比基于目标的过渡分布,因为相关响应地图提供有关目标的位置信息,建议密度更可靠。此外,我们的颗粒过滤器搜索在似然分布,这在目标阻塞情形提高性能的同时,通过更有效地采样颗粒降低计算成本的多个模式。在其它具有挑战性的场景,如那些涉及运动模糊,其中仅一个模式是存在的,但较大的搜索区域可以是必要的,我们的颗粒过滤器允许对似然性分布,以增加的方差。我们测试了我们的算法对视觉跟踪基准V1.1(OTB100)和我们的实验结果表明,我们的框架性能优于国家的最先进的方法。
Reza Jalil Mozhdehi, Henry Medeiros
Abstract: We propose a novel particle filter for convolutional-correlation visual trackers. Our method uses correlation response maps to estimate likelihood distributions and employs these likelihoods as proposal densities to sample particles. Likelihood distributions are more reliable than proposal densities based on target transition distributions because correlation response maps provide additional information regarding the target's location. Additionally, our particle filter searches for multiple modes in the likelihood distribution, which improves performance in target occlusion scenarios while decreasing computational costs by more efficiently sampling particles. In other challenging scenarios such as those involving motion blur, where only one mode is present but a larger search area may be necessary, our particle filter allows for the variance of the likelihood distribution to increase. We tested our algorithm on the Visual Tracker Benchmark v1.1 (OTB100) and our experimental results demonstrate that our framework outperforms state-of-the-art methods.
摘要:我们提出了卷积相关视觉跟踪一个新的微粒过滤器。我们的方法的用途相关响应映射到估计似然性分布,并采用这些似然性作为提案密度下样品的粒子。似然分布比基于目标的过渡分布,因为相关响应地图提供有关目标的位置信息,建议密度更可靠。此外,我们的颗粒过滤器搜索在似然分布,这在目标阻塞情形提高性能的同时,通过更有效地采样颗粒降低计算成本的多个模式。在其它具有挑战性的场景,如那些涉及运动模糊,其中仅一个模式是存在的,但较大的搜索区域可以是必要的,我们的颗粒过滤器允许对似然性分布,以增加的方差。我们测试了我们的算法对视觉跟踪基准V1.1(OTB100)和我们的实验结果表明,我们的框架性能优于国家的最先进的方法。
31. Gaze estimation problem tackled through synthetic images [PDF] 返回目录
Gonzalo Garde, Andoni Larumbe-Bergera, Benoît Bossavit, Rafael Cabeza, Sonia Porta, Arantxa Villanueva
Abstract: In this paper, we evaluate a synthetic framework to be used in the field of gaze estimation employing deep learning techniques. The lack of sufficient annotated data could be overcome by the utilization of a synthetic evaluation framework as far as it resembles the behavior of a real scenario. In this work, we use U2Eyes synthetic environment employing I2Head datataset as real benchmark for comparison based on alternative training and testing strategies. The results obtained show comparable average behavior between both frameworks although significantly more robust and stable performance is retrieved by the synthetic images. Additionally, the potential of synthetically pretrained models in order to be applied in user's specific calibration strategies is shown with outstanding performances.
摘要:在本文中,我们评估的目光估计采用深学习技术的领域中使用的合成框架。由于缺乏足够的注解数据进行可以通过一个综合评估框架的使用,只要它类似于一个真实的情景中的行为加以克服。在这项工作中,我们使用U2Eyes综合环境使用I2Head datataset作为真正的基准进行比较以替代方案的培训和测试策略。获得的结果表明这两个框架之间是相当的平均行为虽然显著更坚固和稳定的性能由合成影像检索。此外,在用户的具体标定策略被应用显示以优异的表演综合预训练的模型,以便潜力。
Gonzalo Garde, Andoni Larumbe-Bergera, Benoît Bossavit, Rafael Cabeza, Sonia Porta, Arantxa Villanueva
Abstract: In this paper, we evaluate a synthetic framework to be used in the field of gaze estimation employing deep learning techniques. The lack of sufficient annotated data could be overcome by the utilization of a synthetic evaluation framework as far as it resembles the behavior of a real scenario. In this work, we use U2Eyes synthetic environment employing I2Head datataset as real benchmark for comparison based on alternative training and testing strategies. The results obtained show comparable average behavior between both frameworks although significantly more robust and stable performance is retrieved by the synthetic images. Additionally, the potential of synthetically pretrained models in order to be applied in user's specific calibration strategies is shown with outstanding performances.
摘要:在本文中,我们评估的目光估计采用深学习技术的领域中使用的合成框架。由于缺乏足够的注解数据进行可以通过一个综合评估框架的使用,只要它类似于一个真实的情景中的行为加以克服。在这项工作中,我们使用U2Eyes综合环境使用I2Head datataset作为真正的基准进行比较以替代方案的培训和测试策略。获得的结果表明这两个框架之间是相当的平均行为虽然显著更坚固和稳定的性能由合成影像检索。此外,在用户的具体标定策略被应用显示以优异的表演综合预训练的模型,以便潜力。
32. Training Generative Adversarial Networks with Limited Data [PDF] 返回目录
Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, Timo Aila
Abstract: Training generative adversarial networks (GAN) using too little data typically leads to discriminator overfitting, causing training to diverge. We propose an adaptive discriminator augmentation mechanism that significantly stabilizes training in limited data regimes. The approach does not require changes to loss functions or network architectures, and is applicable both when training from scratch and when fine-tuning an existing GAN on another dataset. We demonstrate, on several datasets, that good results are now possible using only a few thousand training images, often matching StyleGAN2 results with an order of magnitude fewer images. We expect this to open up new application domains for GANs. We also find that the widely used CIFAR-10 is, in fact, a limited data benchmark, and improve the record FID from 5.59 to 2.67.
摘要:使用数据太少通常会导致过度拟合鉴培训生成对抗网络(GAN),导致训练发散。我们建议显著稳定在有限的数据政权训练自适应鉴别增加机构。该方法不需要改变损失的功能或网络架构,适用时都从头开始,并在训练时微调另一个数据集的现有GAN。我们证明,在几个数据集,良好的结果只用了几千训练图像,经常匹配StyleGAN2结果与幅度较少的图像的顺序,现在是可能的。我们预计这将开辟新甘斯的应用领域。我们还发现,目前广泛使用的CIFAR-10,事实上,有限的数据基准,并提高记录FID从5.59到2.67。
Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, Timo Aila
Abstract: Training generative adversarial networks (GAN) using too little data typically leads to discriminator overfitting, causing training to diverge. We propose an adaptive discriminator augmentation mechanism that significantly stabilizes training in limited data regimes. The approach does not require changes to loss functions or network architectures, and is applicable both when training from scratch and when fine-tuning an existing GAN on another dataset. We demonstrate, on several datasets, that good results are now possible using only a few thousand training images, often matching StyleGAN2 results with an order of magnitude fewer images. We expect this to open up new application domains for GANs. We also find that the widely used CIFAR-10 is, in fact, a limited data benchmark, and improve the record FID from 5.59 to 2.67.
摘要:使用数据太少通常会导致过度拟合鉴培训生成对抗网络(GAN),导致训练发散。我们建议显著稳定在有限的数据政权训练自适应鉴别增加机构。该方法不需要改变损失的功能或网络架构,适用时都从头开始,并在训练时微调另一个数据集的现有GAN。我们证明,在几个数据集,良好的结果只用了几千训练图像,经常匹配StyleGAN2结果与幅度较少的图像的顺序,现在是可能的。我们预计这将开辟新甘斯的应用领域。我们还发现,目前广泛使用的CIFAR-10,事实上,有限的数据基准,并提高记录FID从5.59到2.67。
33. Residual Force Control for Agile Human Behavior Imitation and Extended Motion Synthesis [PDF] 返回目录
Ye Yuan, Kris Kitani
Abstract: Reinforcement learning has shown great promise for synthesizing realistic human behaviors by learning humanoid control policies from motion capture data. However, it is still very challenging to reproduce sophisticated human skills like ballet dance, or to stably imitate long-term human behaviors with complex transitions. The main difficulty lies in the dynamics mismatch between the humanoid model and real humans. That is, motions of real humans may not be physically possible for the humanoid model. To overcome the dynamics mismatch, we propose a novel approach, residual force control (RFC), that augments a humanoid control policy by adding external residual forces into the action space. During training, the RFC-based policy learns to apply residual forces to the humanoid to compensate for the dynamics mismatch and better imitate the reference motion. Experiments on a wide range of dynamic motions demonstrate that our approach outperforms state-of-the-art methods in terms of convergence speed and the quality of learned motions. For the first time, we show a physics-based virtual character performing highly agile ballet dance moves such as pirouette, arabesque and jeté. Furthermore, we propose a dual-policy control framework, where a kinematic policy and an RFC-based policy work in tandem to synthesize multi-modal infinite-horizon human motions without any task guidance or user input. Our approach is the first humanoid control method that successfully learns from a large-scale human motion dataset (Human3.6M) and generates diverse long-term motions.
摘要:强化学习已经显示出通过学习从动作捕捉数据人形调控政策合成逼真的人类行为大有希望。然而,它仍然是非常具有挑战性的重现复杂的人类技能,如芭蕾舞,或稳定地模仿复杂的过渡长期的人类行为。主要的困难在于动态人形模型和实际人类之间的不匹配。也就是说,真正的人的动作可能不会对人形模型在物理上可能的。为了克服动态不匹配,我们提出了一种新的方法,残余力控制(RFC),即增强部通过外接残余部队到动作空间人形控制策略。在培训过程中,基于RFC-政策学会运用残余势力的人形,以补偿不匹配,更好地模仿参照运动的动态。在广泛的动态运动的实验表明,我们的方法比国家的最先进的方法,收敛速度,得知运动的质量方面。这是第一次,我们展示了一个基于物理的虚拟人物进行高度灵活的芭蕾舞步,如原地旋转,阿拉伯式花纹和jeté。此外,我们提出了一个双策略控制架构,其中在一前一后的运动策略和基于RFC-政策工作,以合成多模态无限视距人的运动,没有任何任务的指导或用户输入。我们的方法是在第一人形控制方法,从大规模人体运动数据集成功地学习(Human3.6M),并产生不同的长期运动。
Ye Yuan, Kris Kitani
Abstract: Reinforcement learning has shown great promise for synthesizing realistic human behaviors by learning humanoid control policies from motion capture data. However, it is still very challenging to reproduce sophisticated human skills like ballet dance, or to stably imitate long-term human behaviors with complex transitions. The main difficulty lies in the dynamics mismatch between the humanoid model and real humans. That is, motions of real humans may not be physically possible for the humanoid model. To overcome the dynamics mismatch, we propose a novel approach, residual force control (RFC), that augments a humanoid control policy by adding external residual forces into the action space. During training, the RFC-based policy learns to apply residual forces to the humanoid to compensate for the dynamics mismatch and better imitate the reference motion. Experiments on a wide range of dynamic motions demonstrate that our approach outperforms state-of-the-art methods in terms of convergence speed and the quality of learned motions. For the first time, we show a physics-based virtual character performing highly agile ballet dance moves such as pirouette, arabesque and jeté. Furthermore, we propose a dual-policy control framework, where a kinematic policy and an RFC-based policy work in tandem to synthesize multi-modal infinite-horizon human motions without any task guidance or user input. Our approach is the first humanoid control method that successfully learns from a large-scale human motion dataset (Human3.6M) and generates diverse long-term motions.
摘要:强化学习已经显示出通过学习从动作捕捉数据人形调控政策合成逼真的人类行为大有希望。然而,它仍然是非常具有挑战性的重现复杂的人类技能,如芭蕾舞,或稳定地模仿复杂的过渡长期的人类行为。主要的困难在于动态人形模型和实际人类之间的不匹配。也就是说,真正的人的动作可能不会对人形模型在物理上可能的。为了克服动态不匹配,我们提出了一种新的方法,残余力控制(RFC),即增强部通过外接残余部队到动作空间人形控制策略。在培训过程中,基于RFC-政策学会运用残余势力的人形,以补偿不匹配,更好地模仿参照运动的动态。在广泛的动态运动的实验表明,我们的方法比国家的最先进的方法,收敛速度,得知运动的质量方面。这是第一次,我们展示了一个基于物理的虚拟人物进行高度灵活的芭蕾舞步,如原地旋转,阿拉伯式花纹和jeté。此外,我们提出了一个双策略控制架构,其中在一前一后的运动策略和基于RFC-政策工作,以合成多模态无限视距人的运动,没有任何任务的指导或用户输入。我们的方法是在第一人形控制方法,从大规模人体运动数据集成功地学习(Human3.6M),并产生不同的长期运动。
34. CPR: Classifier-Projection Regularization for Continual Learning [PDF] 返回目录
Sungmin Cha, Hsiang Hsu, Flavio P. Calmon, Taesup Moon
Abstract: We propose a general, yet simple patch that can be applied to existing regularization-based continual learning methods called classifier-projection regularization (CPR). Inspired by both recent results on neural networks with wide local minima and information theory, CPR adds an additional regularization term that maximizes the entropy of a classifier's output probability. We demonstrate that this additional term can be interpreted as a projection of the conditional probability given by a classifier's output to the uniform distribution. By applying the Pythagorean theorem for KL divergence, we then prove that this projection may (in theory) improve the performance of continual learning methods. In our extensive experimental results, we apply CPR to several state-of-the-art regularization-based continual learning methods and benchmark performance on popular image recognition datasets. Our results demonstrate that CPR indeed promotes a wide local minima and significantly improves both accuracy and plasticity while simultaneously mitigating the catastrophic forgetting of baseline continual learning methods.
摘要:我们认为可以适用于所谓的分类投影正规化(CPR)现有的基于正规化,持续学习方法的普遍的,但简单的补丁。最近在神经网络两个结果具有广泛的局部极小和信息理论的启发,CPR补充说,最大化分类的输出概率的熵额外的调整项。我们表明,该附加项可以理解为通过一个分级的输出提供给均匀分布的条件概率的预测。通过运用勾股定理的KL发散,我们则证明这一预测可能(理论上)提高不断学习方法的性能。在我们广泛的实验结果,我们采用CPR几个国家的最先进的基于正规化,持续的学习方法和流行的图像识别数据集的基准性能。我们的研究结果表明,CPR确实促进了广泛的局部极小显著提高了准确度和可塑性,同时减轻基线不断学习方法灾难性遗忘。
Sungmin Cha, Hsiang Hsu, Flavio P. Calmon, Taesup Moon
Abstract: We propose a general, yet simple patch that can be applied to existing regularization-based continual learning methods called classifier-projection regularization (CPR). Inspired by both recent results on neural networks with wide local minima and information theory, CPR adds an additional regularization term that maximizes the entropy of a classifier's output probability. We demonstrate that this additional term can be interpreted as a projection of the conditional probability given by a classifier's output to the uniform distribution. By applying the Pythagorean theorem for KL divergence, we then prove that this projection may (in theory) improve the performance of continual learning methods. In our extensive experimental results, we apply CPR to several state-of-the-art regularization-based continual learning methods and benchmark performance on popular image recognition datasets. Our results demonstrate that CPR indeed promotes a wide local minima and significantly improves both accuracy and plasticity while simultaneously mitigating the catastrophic forgetting of baseline continual learning methods.
摘要:我们认为可以适用于所谓的分类投影正规化(CPR)现有的基于正规化,持续学习方法的普遍的,但简单的补丁。最近在神经网络两个结果具有广泛的局部极小和信息理论的启发,CPR补充说,最大化分类的输出概率的熵额外的调整项。我们表明,该附加项可以理解为通过一个分级的输出提供给均匀分布的条件概率的预测。通过运用勾股定理的KL发散,我们则证明这一预测可能(理论上)提高不断学习方法的性能。在我们广泛的实验结果,我们采用CPR几个国家的最先进的基于正规化,持续的学习方法和流行的图像识别数据集的基准性能。我们的研究结果表明,CPR确实促进了广泛的局部极小显著提高了准确度和可塑性,同时减轻基线不断学习方法灾难性遗忘。
35. FedGAN: Federated Generative AdversarialNetworks for Distributed Data [PDF] 返回目录
Mohammad Rasouli, Tao Sun, Ram Rajagopal
Abstract: We propose Federated Generative Adversarial Network (FedGAN) for training a GAN across distributed sources of non-independent-and-identically-distributed data sources subject to communication and privacy constraints. Our algorithm uses local generators and discriminators which are periodically synced via an intermediary that averages and broadcasts the generator and discriminator parameters. We theoretically prove the convergence of FedGAN with both equal and two time-scale updates of generator and discriminator, under standard assumptions, using stochastic approximations and communication efficient stochastic gradient descents. We experiment FedGAN on toy examples (2D system, mixed Gaussian, and Swiss role), image datasets (MNIST, CIFAR-10, and CelebA), and time series datasets (household electricity consumption and electric vehicle charging sessions). We show FedGAN converges and has similar performance to general distributed GAN, while reduces communication complexity. We also show its robustness to reduced communications.
摘要:本文提出联合剖成对抗性网络(FedGAN)跨非独立,和相同的分布式数据源受通信和隐私约束的分布式资源培训的GaN。我们的算法使用本地发电机和鉴别它们通过中介机构均线和广播发电机和鉴别参数定期同步。我们从理论上证明FedGAN的收敛与都等于和发电机和鉴别的两个时间尺度的更新,在标准的假设,利用随机近似和通信效率的随机梯度下坡。我们对玩具的例子(二维系统,混合高斯,和瑞士的角色),图像数据集(MNIST,CIFAR-10和CelebA),以及时间序列数据集(家庭用电量和电动汽车充电期间)实验FedGAN。我们展示FedGAN收敛,并具有类似的性能一般的分布式甘,同时大大减少了通信复杂性。我们还表明它的坚固性降低通信。
Mohammad Rasouli, Tao Sun, Ram Rajagopal
Abstract: We propose Federated Generative Adversarial Network (FedGAN) for training a GAN across distributed sources of non-independent-and-identically-distributed data sources subject to communication and privacy constraints. Our algorithm uses local generators and discriminators which are periodically synced via an intermediary that averages and broadcasts the generator and discriminator parameters. We theoretically prove the convergence of FedGAN with both equal and two time-scale updates of generator and discriminator, under standard assumptions, using stochastic approximations and communication efficient stochastic gradient descents. We experiment FedGAN on toy examples (2D system, mixed Gaussian, and Swiss role), image datasets (MNIST, CIFAR-10, and CelebA), and time series datasets (household electricity consumption and electric vehicle charging sessions). We show FedGAN converges and has similar performance to general distributed GAN, while reduces communication complexity. We also show its robustness to reduced communications.
摘要:本文提出联合剖成对抗性网络(FedGAN)跨非独立,和相同的分布式数据源受通信和隐私约束的分布式资源培训的GaN。我们的算法使用本地发电机和鉴别它们通过中介机构均线和广播发电机和鉴别参数定期同步。我们从理论上证明FedGAN的收敛与都等于和发电机和鉴别的两个时间尺度的更新,在标准的假设,利用随机近似和通信效率的随机梯度下坡。我们对玩具的例子(二维系统,混合高斯,和瑞士的角色),图像数据集(MNIST,CIFAR-10和CelebA),以及时间序列数据集(家庭用电量和电动汽车充电期间)实验FedGAN。我们展示FedGAN收敛,并具有类似的性能一般的分布式甘,同时大大减少了通信复杂性。我们还表明它的坚固性降低通信。
36. Sparse and Continuous Attention Mechanisms [PDF] 返回目录
André F. T. Martins, Marcos Treviso, António Farinhas, Vlad Niculae, Mário A. T. Figueiredo, Pedro M. Q. Aguiar
Abstract: Exponential families are widely used in machine learning; they include many distributions in continuous and discrete domains (e.g., Gaussian, Dirichlet, Poisson, and categorical distributions via the softmax transformation). Distributions in each of these families have fixed support. In contrast, for finite domains, there has been recent work on sparse alternatives to softmax (e.g. sparsemax and alpha-entmax), which have varying support, being able to assign zero probability to irrelevant categories. This paper expands that work in two directions: first, we extend alpha-entmax to continuous domains, revealing a link with Tsallis statistics and deformed exponential families. Second, we introduce continuous-domain attention mechanisms, deriving efficient gradient backpropagation algorithms for alpha in {1,2}. Experiments on attention-based text classification, machine translation, and visual question answering illustrate the use of continuous attention in 1D and 2D, showing that it allows attending to time intervals and compact regions.
摘要:指数的家庭被广泛应用于机器学习;它们包括在连续和离散域(例如,高斯,狄利克雷,泊松,并且经由变换SOFTMAX分类分布)许多分布。在这些家庭分布有固定的支持。相反,对于有限域,最近有稀疏替代SOFTMAX(例如sparsemax和α-entmax),其具有变化的支持,能够零概率分配给不相关的类别的工作。本文扩大了工作在两个方向:第一,我们的α-entmax扩展到连续域,揭示了与Tsallis统计和变形指数族的链接。第二,我们引入连续域注意机制,在{1,2}导出高效梯度反向传播算法用于α-。在关注基于文本分类,机器翻译,和视觉的问答实验说明使用的一维和二维持续关注,表明它允许参加到时间间隔和紧凑的区域。
André F. T. Martins, Marcos Treviso, António Farinhas, Vlad Niculae, Mário A. T. Figueiredo, Pedro M. Q. Aguiar
Abstract: Exponential families are widely used in machine learning; they include many distributions in continuous and discrete domains (e.g., Gaussian, Dirichlet, Poisson, and categorical distributions via the softmax transformation). Distributions in each of these families have fixed support. In contrast, for finite domains, there has been recent work on sparse alternatives to softmax (e.g. sparsemax and alpha-entmax), which have varying support, being able to assign zero probability to irrelevant categories. This paper expands that work in two directions: first, we extend alpha-entmax to continuous domains, revealing a link with Tsallis statistics and deformed exponential families. Second, we introduce continuous-domain attention mechanisms, deriving efficient gradient backpropagation algorithms for alpha in {1,2}. Experiments on attention-based text classification, machine translation, and visual question answering illustrate the use of continuous attention in 1D and 2D, showing that it allows attending to time intervals and compact regions.
摘要:指数的家庭被广泛应用于机器学习;它们包括在连续和离散域(例如,高斯,狄利克雷,泊松,并且经由变换SOFTMAX分类分布)许多分布。在这些家庭分布有固定的支持。相反,对于有限域,最近有稀疏替代SOFTMAX(例如sparsemax和α-entmax),其具有变化的支持,能够零概率分配给不相关的类别的工作。本文扩大了工作在两个方向:第一,我们的α-entmax扩展到连续域,揭示了与Tsallis统计和变形指数族的链接。第二,我们引入连续域注意机制,在{1,2}导出高效梯度反向传播算法用于α-。在关注基于文本分类,机器翻译,和视觉的问答实验说明使用的一维和二维持续关注,表明它允许参加到时间间隔和紧凑的区域。
37. HMIC: Hierarchical Medical Image Classification, A Deep Learning Approach [PDF] 返回目录
Kamran Kowsari, Rasoul Sali, Lubaina Ehsan, William Adorno, Asad Ali, Sean Moore, Beatrice Amadi, Paul Kelly, Sana Syed, Donald Brown
Abstract: Image classification is central to the big data revolution in medicine. Improved information processing methods for diagnosis and classification of digital medical images have shown to be successful via deep learning approaches. As this field is explored, there are limitations to the performance of traditional supervised classifiers. This paper outlines an approach that is different from the current medical image classification tasks that view the issue as multi-class classification. We performed a hierarchical classification using our Hierarchical Medical Image classification (HMIC) approach. HMIC uses stacks of deep learning models to give particular comprehension at each level of the clinical picture hierarchy. For testing our performance, we use biopsy of the small bowel images that contain three categories in the parent level (Celiac Disease, Environmental Enteropathy, and histologically normal controls). For the child level, Celiac Disease Severity is classified into 4 classes (I, IIIa, IIIb, and IIIC).
摘要:图像分类是中央医药大数据革命。用于诊断和数字医学图像的分类改进的信息处理方法已经证明是成功的通过深学习方法。由于该字段探讨,也有对传统的监督分类器的性能限制。本文概述了一种方法,是从该视图的问题,因为多类分类目前的医学图像分类的任务不同。我们使用我们的分层医学图像分类(HMIC)方法的层次分类。深学习模型HMIC用途堆在临床表现每个层次给予特别的理解。为了测试我们的性能,我们采用活检包含在母公司层面三类(腹腔疾病,环境肠病,组织学正常对照)小肠的图像。对于孩子的水平,腹腔疾病严重程度分为4类(I,IIIA,IIIB和IIIC)。
Kamran Kowsari, Rasoul Sali, Lubaina Ehsan, William Adorno, Asad Ali, Sean Moore, Beatrice Amadi, Paul Kelly, Sana Syed, Donald Brown
Abstract: Image classification is central to the big data revolution in medicine. Improved information processing methods for diagnosis and classification of digital medical images have shown to be successful via deep learning approaches. As this field is explored, there are limitations to the performance of traditional supervised classifiers. This paper outlines an approach that is different from the current medical image classification tasks that view the issue as multi-class classification. We performed a hierarchical classification using our Hierarchical Medical Image classification (HMIC) approach. HMIC uses stacks of deep learning models to give particular comprehension at each level of the clinical picture hierarchy. For testing our performance, we use biopsy of the small bowel images that contain three categories in the parent level (Celiac Disease, Environmental Enteropathy, and histologically normal controls). For the child level, Celiac Disease Severity is classified into 4 classes (I, IIIa, IIIb, and IIIC).
摘要:图像分类是中央医药大数据革命。用于诊断和数字医学图像的分类改进的信息处理方法已经证明是成功的通过深学习方法。由于该字段探讨,也有对传统的监督分类器的性能限制。本文概述了一种方法,是从该视图的问题,因为多类分类目前的医学图像分类的任务不同。我们使用我们的分层医学图像分类(HMIC)方法的层次分类。深学习模型HMIC用途堆在临床表现每个层次给予特别的理解。为了测试我们的性能,我们采用活检包含在母公司层面三类(腹腔疾病,环境肠病,组织学正常对照)小肠的图像。对于孩子的水平,腹腔疾病严重程度分为4类(I,IIIA,IIIB和IIIC)。
38. Move-to-Data: A new Continual Learning approach with Deep CNNs, Application for image-class recognition [PDF] 返回目录
Miltiadis Poursanidis, Jenny Benois-Pineau, Akka Zemmari, Boris Mansenca, Aymar de Rugy
Abstract: In many real-life tasks of application of supervised learning approaches, all the training data are not available at the same time. The examples are lifelong image classification or recognition of environmental objects during interaction of instrumented persons with their environment, enrichment of an online-database with more images. It is necessary to pre-train the model at a "training recording phase" and then adjust it to the new coming data. This is the task of incremental/continual learning approaches. Amongst different problems to be solved by these approaches such as introduction of new categories in the model, refining existing categories to sub-categories and extending trained classifiers over them, ... we focus on the problem of adjusting pre-trained model with new additional training data for existing categories. We propose a fast continual learning layer at the end of the neuronal network. Obtained results are illustrated on the opensource CIFAR benchmark dataset. The proposed scheme yields similar performances as retraining but with drastically lower computational cost.
摘要:在监督学习的应用程序的许多现实生活任务的临近,所有的训练数据不可在同一时间。这些例子都与他们的环境,一个在线数据库与多个图像的富集仪表的人的交互过程中终身图像分类或识别环境对象。有必要在“培训记录相”预训练模型,然后将其调整到新进来的数据。这是增量/不断学习方法的任务。当中由这些来解决不同的问题的方法,如引入新的类别模型,改进现有类别的子类别和超过他们延长训练的分类,...我们着力调整预先训练模式与新的额外的问题训练数据为现有类别。我们在神经元网络的最后提出了一个快速持续学习层。得到的结果是在开源CIFAR基准数据集所示。该方案得到类似的表演,再培训,但被大幅度地降低计算成本。
Miltiadis Poursanidis, Jenny Benois-Pineau, Akka Zemmari, Boris Mansenca, Aymar de Rugy
Abstract: In many real-life tasks of application of supervised learning approaches, all the training data are not available at the same time. The examples are lifelong image classification or recognition of environmental objects during interaction of instrumented persons with their environment, enrichment of an online-database with more images. It is necessary to pre-train the model at a "training recording phase" and then adjust it to the new coming data. This is the task of incremental/continual learning approaches. Amongst different problems to be solved by these approaches such as introduction of new categories in the model, refining existing categories to sub-categories and extending trained classifiers over them, ... we focus on the problem of adjusting pre-trained model with new additional training data for existing categories. We propose a fast continual learning layer at the end of the neuronal network. Obtained results are illustrated on the opensource CIFAR benchmark dataset. The proposed scheme yields similar performances as retraining but with drastically lower computational cost.
摘要:在监督学习的应用程序的许多现实生活任务的临近,所有的训练数据不可在同一时间。这些例子都与他们的环境,一个在线数据库与多个图像的富集仪表的人的交互过程中终身图像分类或识别环境对象。有必要在“培训记录相”预训练模型,然后将其调整到新进来的数据。这是增量/不断学习方法的任务。当中由这些来解决不同的问题的方法,如引入新的类别模型,改进现有类别的子类别和超过他们延长训练的分类,...我们着力调整预先训练模式与新的额外的问题训练数据为现有类别。我们在神经元网络的最后提出了一个快速持续学习层。得到的结果是在开源CIFAR基准数据集所示。该方案得到类似的表演,再培训,但被大幅度地降低计算成本。
39. Non-Negative Bregman Divergence Minimization for Deep Direct Density Ratio Estimation [PDF] 返回目录
Masahiro Kato, Takeshi Teshima
Abstract: The estimation of the ratio of two probability densities has garnered attention as the density ratio is useful in various machine learning tasks, such as anomaly detection and domain adaptation. To estimate the density ratio, methods collectively known as direct density ratio estimation (DRE) have been explored. These methods are based on the minimization of the Bregman (BR) divergence between a density ratio model and the true density ratio. However, existing direct DRE suffers from serious overfitting when using flexible models such as neural networks. In this paper, we introduce a non-negative correction for empirical risk using only the prior knowledge of the upper bound of the density ratio. This correction makes a DRE method more robust against overfitting and enables the use of flexible models. In the theoretical analysis, we discuss the consistency of the empirical risk. In our experiments, the proposed estimators show favorable performance in inlier-based outlier detection and covariate shift adaptation.
摘要:两个概率密度的比率的估计已经获得关注作为密度比是在各种机器学习任务,诸如异常检测和域的适应是有用的。为了估计密度比,方法统称为直接密度比估计(DRE)已探索。这些方法是基于一种密度比模型和真密度比之间的布莱格曼(BR)的发散的最小化。然而,现有的直接DRE严重的过度拟合采用灵活的模型,如神经网络时受到影响。在本文中,我们仅使用的上限的密度比的先验知识引入一个非负校正经验风险。这种修正使得DRE方法对过度拟合更健壮,并允许使用的灵活模式。在理论分析中,我们讨论了经验风险的一致性。在我们的实验中,所提出的估计显示,内围基于异常检测和协转变适应良好的性能。
Masahiro Kato, Takeshi Teshima
Abstract: The estimation of the ratio of two probability densities has garnered attention as the density ratio is useful in various machine learning tasks, such as anomaly detection and domain adaptation. To estimate the density ratio, methods collectively known as direct density ratio estimation (DRE) have been explored. These methods are based on the minimization of the Bregman (BR) divergence between a density ratio model and the true density ratio. However, existing direct DRE suffers from serious overfitting when using flexible models such as neural networks. In this paper, we introduce a non-negative correction for empirical risk using only the prior knowledge of the upper bound of the density ratio. This correction makes a DRE method more robust against overfitting and enables the use of flexible models. In the theoretical analysis, we discuss the consistency of the empirical risk. In our experiments, the proposed estimators show favorable performance in inlier-based outlier detection and covariate shift adaptation.
摘要:两个概率密度的比率的估计已经获得关注作为密度比是在各种机器学习任务,诸如异常检测和域的适应是有用的。为了估计密度比,方法统称为直接密度比估计(DRE)已探索。这些方法是基于一种密度比模型和真密度比之间的布莱格曼(BR)的发散的最小化。然而,现有的直接DRE严重的过度拟合采用灵活的模型,如神经网络时受到影响。在本文中,我们仅使用的上限的密度比的先验知识引入一个非负校正经验风险。这种修正使得DRE方法对过度拟合更健壮,并允许使用的灵活模式。在理论分析中,我们讨论了经验风险的一致性。在我们的实验中,所提出的估计显示,内围基于异常检测和协转变适应良好的性能。
40. Early Detection of Retinopathy of Prematurity (ROP) in Retinal Fundus Images Via Convolutional Neural Networks [PDF] 返回目录
Xin Guo, Yusuke Kikuchi, Guan Wang, Jinglin Yi, Qiong Zou, Rui Zhou
Abstract: Retinopathy of prematurity (ROP) is an abnormal blood vessel development in the retina of a prematurely-born infant or an infant with low birth weight. ROP is one of the leading causes for infant blindness globally. Early detection of ROP is critical to slow down and avert the progression to vision impairment caused by ROP. Yet there is limited awareness of ROP even among medical professionals. Consequently, dataset for ROP is limited if ever available, and is in general extremely imbalanced in terms of the ratio between negative images and positive ones. In this study, we formulate the problem of detecting ROP in retinal fundus images in an optimization framework, and apply state-of-art convolutional neural network techniques to solve this problem. Experimental results based on our models achieve 100 percent sensitivity, 96 percent specificity, 98 percent accuracy, and 96 percent precision. In addition, our study shows that as the network gets deeper, more significant features can be extracted for better understanding of ROP.
摘要:早产儿视网膜病变(ROP)是在过早出生的婴儿的视网膜或低出生体重婴儿的畸形血管发育。 ROP是为全球婴幼儿失明的主要因素之一。 ROP的早期检测是减慢和避免进展引起的ROP视力损伤是至关重要的。然而,有即使在医疗专业人员ROP的认识有限。因此,数据集被ROP如果曾经可用的限制,并且一般在负图像和正面的之间的比率方面非常不平衡。在这项研究中,我们制定的在优化框架视网膜眼底图像检测ROP问题,并应用先进的技术卷积神经网络技术来解决这个问题。根据我们的模型实验结果达到100%的灵敏度,96%的特异性,98%的准确性,并且96%的精度。此外,我们的研究表明,随着网络变得更深,更显著特征可以被提取为更好地理解ROP的。
Xin Guo, Yusuke Kikuchi, Guan Wang, Jinglin Yi, Qiong Zou, Rui Zhou
Abstract: Retinopathy of prematurity (ROP) is an abnormal blood vessel development in the retina of a prematurely-born infant or an infant with low birth weight. ROP is one of the leading causes for infant blindness globally. Early detection of ROP is critical to slow down and avert the progression to vision impairment caused by ROP. Yet there is limited awareness of ROP even among medical professionals. Consequently, dataset for ROP is limited if ever available, and is in general extremely imbalanced in terms of the ratio between negative images and positive ones. In this study, we formulate the problem of detecting ROP in retinal fundus images in an optimization framework, and apply state-of-art convolutional neural network techniques to solve this problem. Experimental results based on our models achieve 100 percent sensitivity, 96 percent specificity, 98 percent accuracy, and 96 percent precision. In addition, our study shows that as the network gets deeper, more significant features can be extracted for better understanding of ROP.
摘要:早产儿视网膜病变(ROP)是在过早出生的婴儿的视网膜或低出生体重婴儿的畸形血管发育。 ROP是为全球婴幼儿失明的主要因素之一。 ROP的早期检测是减慢和避免进展引起的ROP视力损伤是至关重要的。然而,有即使在医疗专业人员ROP的认识有限。因此,数据集被ROP如果曾经可用的限制,并且一般在负图像和正面的之间的比率方面非常不平衡。在这项研究中,我们制定的在优化框架视网膜眼底图像检测ROP问题,并应用先进的技术卷积神经网络技术来解决这个问题。根据我们的模型实验结果达到100%的灵敏度,96%的特异性,98%的准确性,并且96%的精度。此外,我们的研究表明,随着网络变得更深,更显著特征可以被提取为更好地理解ROP的。
41. LSSL: Longitudinal Self-Supervised Learning [PDF] 返回目录
Qingyu Zhao, Zixuan Liu, Ehsan Adeli, Kilian M. Pohl
Abstract: Longitudinal neuroimaging or biomedical studies often acquire multiple observations from each individual over time, which entails repeated measures with highly interdependent variables. In this paper, we discuss the implication of repeated measures design on unsupervised learning by showing its tight conceptual connection to self-supervised learning and factor disentanglement. Leveraging the ability for `self-comparison' through repeated measures, we explicitly separate the definition of the factor space and the representation space enabling an exact disentanglement of time-related factors from the representations of the images. By formulating deterministic multivariate mapping functions between the two spaces, our model, named Longitudinal Self-Supervised Learning (LSSL), uses a standard autoencoding structure with a cosine loss to estimate the direction linked to the disentangled factor. We apply LSSL to two longitudinal neuroimaging studies to show its unique advantage in extracting the `brain-age' information from the data and in revealing informative characteristics associated with neurodegenerative and neuropsychological disorders. For a downstream task of supervised diagnosis classification, the representations learned by LSSL permit faster convergence and higher (or similar) prediction accuracy compared to several other representation learning techniques.
摘要:纵向神经影像学或生物医学研究经常获得从每个个体随着时间的推移,这需要反复进行高度相互依赖的变量的措施多次观测。在本文中,我们通过展示自我监督学习和要素的解开其紧密连接的概念讨论的无监督学习的重复测量设计的寓意。利用通过重复测量`自己与自己比较”的能力,我们明确地分开因素空间的定义和表示空间实现的时间相关因素的精确解缠从图像的表示。通过制定两个空间之间的确定性多元映射函数,我们的模型,命名为纵向自我监督学习(LSSL),采用标准autoencoding结构余弦损失估计挂解缠结的因素的方向。我们应用LSSL两个纵向神经影像学研究表明在提取数据的'大脑年龄”的信息,并揭示神经退行性和神经性疾病相关信息的特性及其独特的优势。对于监督分类诊断的下游任务,通过LSSL了解到所述表示允许更快的收敛,并与其它几种表示学习技术更高(或类似)的预测精度。
Qingyu Zhao, Zixuan Liu, Ehsan Adeli, Kilian M. Pohl
Abstract: Longitudinal neuroimaging or biomedical studies often acquire multiple observations from each individual over time, which entails repeated measures with highly interdependent variables. In this paper, we discuss the implication of repeated measures design on unsupervised learning by showing its tight conceptual connection to self-supervised learning and factor disentanglement. Leveraging the ability for `self-comparison' through repeated measures, we explicitly separate the definition of the factor space and the representation space enabling an exact disentanglement of time-related factors from the representations of the images. By formulating deterministic multivariate mapping functions between the two spaces, our model, named Longitudinal Self-Supervised Learning (LSSL), uses a standard autoencoding structure with a cosine loss to estimate the direction linked to the disentangled factor. We apply LSSL to two longitudinal neuroimaging studies to show its unique advantage in extracting the `brain-age' information from the data and in revealing informative characteristics associated with neurodegenerative and neuropsychological disorders. For a downstream task of supervised diagnosis classification, the representations learned by LSSL permit faster convergence and higher (or similar) prediction accuracy compared to several other representation learning techniques.
摘要:纵向神经影像学或生物医学研究经常获得从每个个体随着时间的推移,这需要反复进行高度相互依赖的变量的措施多次观测。在本文中,我们通过展示自我监督学习和要素的解开其紧密连接的概念讨论的无监督学习的重复测量设计的寓意。利用通过重复测量`自己与自己比较”的能力,我们明确地分开因素空间的定义和表示空间实现的时间相关因素的精确解缠从图像的表示。通过制定两个空间之间的确定性多元映射函数,我们的模型,命名为纵向自我监督学习(LSSL),采用标准autoencoding结构余弦损失估计挂解缠结的因素的方向。我们应用LSSL两个纵向神经影像学研究表明在提取数据的'大脑年龄”的信息,并揭示神经退行性和神经性疾病相关信息的特性及其独特的优势。对于监督分类诊断的下游任务,通过LSSL了解到所述表示允许更快的收敛,并与其它几种表示学习技术更高(或类似)的预测精度。
42. Potential Field Guided Actor-Critic Reinforcement Learning [PDF] 返回目录
Weiya Ren
Abstract: In this paper, we consider the problem of actor-critic reinforcement learning. Firstly, we extend the actor-critic architecture to actor-critic-N architecture by introducing more critics beyond rewards. Secondly, we combine the reward-based critic with a potential-field-based critic to formulate the proposed potential field guided actor-critic reinforcement learning approach (actor-critic-2). This can be seen as a combination of the model-based gradients and the model-free gradients in policy improvement. State with large potential field often contains a strong prior information, such as pointing to the target at a long distance or avoiding collision by the side of an obstacle. In this situation, we should trust potential-field-based critic more as policy evaluation to accelerate policy improvement, where action policy tends to be guided. For example, in practical application, learning to avoid obstacles should be guided rather than learned by trial and error. State with small potential filed is often lack of information, for example, at the local minimum point or around the moving target. At this time, we should trust reward-based critic as policy evaluation more to evaluate the long-term return. In this case, action policy tends to explore. In addition, potential field evaluation can be combined with planning to estimate a better state value function. In this way, reward design can focus more on the final stage of reward, rather than reward shaping or phased reward. Furthermore, potential field evaluation can make up for the lack of communication in multi-agent cooperation problem, i.e., multi-agent each has a reward-based critic and a relative unified potential-field-based critic with prior information. Thirdly, simplified experiments on predator-prey game demonstrate the effectiveness of the proposed approach.
摘要:在本文中,我们考虑演员,评论家强化学习的问题。首先,我们通过引入更多的批评超越奖励延长演员评论家架构演员评论家-N结构。其次,我们结合基于奖励评论家基于势场评论家制定建议的势场指导演员,评论家强化学习方法(演员评论家-2)。这可以看作是基于模型的梯度和政策改善无模型梯度的组合。与大的电势场状态通常含有强烈的现有信息,例如指向目标长距离或障碍物的侧面避免冲突。在这种情况下,我们应该信任为基础的势场评论家更多的政策评估,加快完善政策,在操作策略倾向于引导。例如,在实际应用中,学习避开障碍物要引导,而不是通过试错的经验教训。小电位状态提交往往缺乏信息,例如,在当地的最低点或周围移动的目标。在这个时候,我们应该相信基于奖励批评家政策评估更多的评估长期回报。在这种情况下,操作策略趋于探索。此外,潜在的现场评估可与计划估计更好的状态值的功能相结合。通过这种方式,奖励设计可以更专注于奖励的最后阶段,而不是奖励成型或分阶段奖励。此外,势场评价可以弥补在多剂合作问题缺乏沟通,即,多剂各自具有基于奖励评论家和与现有信息的相对统一的基于电位场评论家。第三,关于捕食游戏简化实验证明了该方法的有效性。
Weiya Ren
Abstract: In this paper, we consider the problem of actor-critic reinforcement learning. Firstly, we extend the actor-critic architecture to actor-critic-N architecture by introducing more critics beyond rewards. Secondly, we combine the reward-based critic with a potential-field-based critic to formulate the proposed potential field guided actor-critic reinforcement learning approach (actor-critic-2). This can be seen as a combination of the model-based gradients and the model-free gradients in policy improvement. State with large potential field often contains a strong prior information, such as pointing to the target at a long distance or avoiding collision by the side of an obstacle. In this situation, we should trust potential-field-based critic more as policy evaluation to accelerate policy improvement, where action policy tends to be guided. For example, in practical application, learning to avoid obstacles should be guided rather than learned by trial and error. State with small potential filed is often lack of information, for example, at the local minimum point or around the moving target. At this time, we should trust reward-based critic as policy evaluation more to evaluate the long-term return. In this case, action policy tends to explore. In addition, potential field evaluation can be combined with planning to estimate a better state value function. In this way, reward design can focus more on the final stage of reward, rather than reward shaping or phased reward. Furthermore, potential field evaluation can make up for the lack of communication in multi-agent cooperation problem, i.e., multi-agent each has a reward-based critic and a relative unified potential-field-based critic with prior information. Thirdly, simplified experiments on predator-prey game demonstrate the effectiveness of the proposed approach.
摘要:在本文中,我们考虑演员,评论家强化学习的问题。首先,我们通过引入更多的批评超越奖励延长演员评论家架构演员评论家-N结构。其次,我们结合基于奖励评论家基于势场评论家制定建议的势场指导演员,评论家强化学习方法(演员评论家-2)。这可以看作是基于模型的梯度和政策改善无模型梯度的组合。与大的电势场状态通常含有强烈的现有信息,例如指向目标长距离或障碍物的侧面避免冲突。在这种情况下,我们应该信任为基础的势场评论家更多的政策评估,加快完善政策,在操作策略倾向于引导。例如,在实际应用中,学习避开障碍物要引导,而不是通过试错的经验教训。小电位状态提交往往缺乏信息,例如,在当地的最低点或周围移动的目标。在这个时候,我们应该相信基于奖励批评家政策评估更多的评估长期回报。在这种情况下,操作策略趋于探索。此外,潜在的现场评估可与计划估计更好的状态值的功能相结合。通过这种方式,奖励设计可以更专注于奖励的最后阶段,而不是奖励成型或分阶段奖励。此外,势场评价可以弥补在多剂合作问题缺乏沟通,即,多剂各自具有基于奖励评论家和与现有信息的相对统一的基于电位场评论家。第三,关于捕食游戏简化实验证明了该方法的有效性。
43. Online Sequential Extreme Learning Machines: Features Combined From Hundreds of Midlayers [PDF] 返回目录
Chandra Swarathesh Addanki
Abstract: In this paper, we develop an algorithm called hierarchal online sequential learning algorithm (H-OS-ELM) for single feed feedforward network with features combined from hundreds of midlayers, the algorithm can learn chunk by chunk with fixed or varying block size, we believe that the diverse selectivity of neurons in top layers which consists of encoded distributed information produced by the other neurons offers better computational advantage over inference accuracy. Thus this paper proposes a Hierarchical model framework combined with Online-Sequential learning algorithm, Firstly the model consists of subspace feature extractor which consists of subnetwork neuron, using the sub-features which is result of the feature extractor in first layer of the hierarchy we get rid of irrelevant factors which are of no use for the learning and iterate this process so that to recast the the subfeatures into the hierarchical model to be processed into more acceptable cognition. Secondly by using OS-Elm we are using non-iterative style for learning we are implementing a network which is wider and shallow which plays a important role in generalizing the overall performance which in turn boosts up the learning speed
摘要:在本文中,我们开发了单馈前馈网络称为层次在线连续学习算法(H-OS-ELM)的算法与功能从几百midlayers的结合,该算法可以通过块与固定或不同的块大小学会块,我们认为,其中包括由其他神经元提供了超过推断的准确性更好的计算优势产生的已编码分布式信息顶层神经元的多样选择性。因此,本文提出了一种分层的模型框架,在线连续学习算法相结合,首先该模型由其中包括子网神经元的子空间特征提取的,使用子功能这是我们得到的层次结构的第一层的特征提取的结果的其是没有用用于学习和迭代这个过程,使得对重铸的子功能到分层模型的无关因素摆脱被加工成多种可接受的认知。其次,通过使用OS-榆树,我们正在使用非迭代式的学习,我们正在实施一个网络,它是宽而浅,其在推广的整体性能起到了重要的作用,这反过来又提升了学习速度
Chandra Swarathesh Addanki
Abstract: In this paper, we develop an algorithm called hierarchal online sequential learning algorithm (H-OS-ELM) for single feed feedforward network with features combined from hundreds of midlayers, the algorithm can learn chunk by chunk with fixed or varying block size, we believe that the diverse selectivity of neurons in top layers which consists of encoded distributed information produced by the other neurons offers better computational advantage over inference accuracy. Thus this paper proposes a Hierarchical model framework combined with Online-Sequential learning algorithm, Firstly the model consists of subspace feature extractor which consists of subnetwork neuron, using the sub-features which is result of the feature extractor in first layer of the hierarchy we get rid of irrelevant factors which are of no use for the learning and iterate this process so that to recast the the subfeatures into the hierarchical model to be processed into more acceptable cognition. Secondly by using OS-Elm we are using non-iterative style for learning we are implementing a network which is wider and shallow which plays a important role in generalizing the overall performance which in turn boosts up the learning speed
摘要:在本文中,我们开发了单馈前馈网络称为层次在线连续学习算法(H-OS-ELM)的算法与功能从几百midlayers的结合,该算法可以通过块与固定或不同的块大小学会块,我们认为,其中包括由其他神经元提供了超过推断的准确性更好的计算优势产生的已编码分布式信息顶层神经元的多样选择性。因此,本文提出了一种分层的模型框架,在线连续学习算法相结合,首先该模型由其中包括子网神经元的子空间特征提取的,使用子功能这是我们得到的层次结构的第一层的特征提取的结果的其是没有用用于学习和迭代这个过程,使得对重铸的子功能到分层模型的无关因素摆脱被加工成多种可接受的认知。其次,通过使用OS-榆树,我们正在使用非迭代式的学习,我们正在实施一个网络,它是宽而浅,其在推广的整体性能起到了重要的作用,这反过来又提升了学习速度
44. Reintroducing Straight-Through Estimators as Principled Methods for Stochastic Binary Networks [PDF] 返回目录
Viktor Yanush, Alexander Shekhovtsov, Dmitry Molchanov, Dmitry Vetrov
Abstract: Training neural networks with binary weights and activations is a challenging problem due to the lack of gradients and difficulty of optimization over discrete weights. Many successful experimental results have been recently achieved using the empirical straight-through estimation approach. This approach has generated a variety of ad-hoc rules for propagating gradients through non-differentiable activations and updating discrete weights. We put such methods on a solid basis by obtaining them as viable approximations in the stochastic binary network (SBN) model with Bernoulli weights. In this model gradients are well-defined and the weight probabilities can be optimized by continuous techniques. By choosing the activation noises in SBN appropriately and choosing mirror descent (MD) for optimization, we obtain methods that closely resemble several existing straight-through variants, but unlike them, all work reliably and produce equally good results. We further show that variational inference for Bayesian learning of Binary weights can be implemented using MD updates with the same simplicity.
摘要:二进制权重和激活训练神经网络是一个具有挑战性的问题,由于缺乏梯度和超过离散权重优化的难度。许多成功的实验结果使用经验直通估计方法最近被实现。这种方法已经产生了多种自组织规则通过不可微分激活传播梯度和离散的更新权重。我们把这种方法了坚实的基础上,通过获取他们作为伯努利权重随机二进制网络(SBN)模型可行的近似值。在这个模型中梯度是明确定义和重量概率可以通过连续工艺进行优化。通过SBN适当地选择激活噪音和优化选择镜下降(MD),我们得到的可靠接近于现有的几种直通变种,但与他们不同,所有的工作,并产生同样良好的效果的方法。进一步的研究表明二进制加权的贝叶斯学习变推理可以用相同的简单MD更新来实现。
Viktor Yanush, Alexander Shekhovtsov, Dmitry Molchanov, Dmitry Vetrov
Abstract: Training neural networks with binary weights and activations is a challenging problem due to the lack of gradients and difficulty of optimization over discrete weights. Many successful experimental results have been recently achieved using the empirical straight-through estimation approach. This approach has generated a variety of ad-hoc rules for propagating gradients through non-differentiable activations and updating discrete weights. We put such methods on a solid basis by obtaining them as viable approximations in the stochastic binary network (SBN) model with Bernoulli weights. In this model gradients are well-defined and the weight probabilities can be optimized by continuous techniques. By choosing the activation noises in SBN appropriately and choosing mirror descent (MD) for optimization, we obtain methods that closely resemble several existing straight-through variants, but unlike them, all work reliably and produce equally good results. We further show that variational inference for Bayesian learning of Binary weights can be implemented using MD updates with the same simplicity.
摘要:二进制权重和激活训练神经网络是一个具有挑战性的问题,由于缺乏梯度和超过离散权重优化的难度。许多成功的实验结果使用经验直通估计方法最近被实现。这种方法已经产生了多种自组织规则通过不可微分激活传播梯度和离散的更新权重。我们把这种方法了坚实的基础上,通过获取他们作为伯努利权重随机二进制网络(SBN)模型可行的近似值。在这个模型中梯度是明确定义和重量概率可以通过连续工艺进行优化。通过SBN适当地选择激活噪音和优化选择镜下降(MD),我们得到的可靠接近于现有的几种直通变种,但与他们不同,所有的工作,并产生同样良好的效果的方法。进一步的研究表明二进制加权的贝叶斯学习变推理可以用相同的简单MD更新来实现。
45. Combining the band-limited parameterization and Semi-Lagrangian Runge--Kutta integration for efficient PDE-constrained LDDMM [PDF] 返回目录
Monica Hernandez
Abstract: The family of PDE-constrained LDDMM methods is emerging as a particularly interesting approach for physically meaningful diffeomorphic transformations. The original combination of Gauss--Newton--Krylov optimization and Runge--Kutta integration, shows excellent numerical accuracy and fast convergence rate. However, its most significant limitation is the huge computational complexity, hindering its extensive use in Computational Anatomy applied studies. This limitation has been treated independently by the problem formulation in the space of band-limited vector fields and Semi-Lagrangian integration. The purpose of this work is to combine both in three variants of band-limited PDE-constrained LDDMM for further increasing their computational efficiency. The accuracy of the resulting methods is evaluated extensively. For all the variants, the proposed combined approach shows a significant increment of the computational efficiency. In addition, the variant based on the deformation state equation is positioned consistently as the best performing method across all the evaluation frameworks in terms of accuracy and efficiency.
摘要:PDE受限LDDMM方法的家庭正在成为物理意义的微分同胚变换一个特别有趣的方法。牛顿 - - 高斯的原始组合克雷洛夫优化和龙格 - 库塔积分,表现出优异的数值精度和快速的收敛速度。然而,其最显著的限制是巨大的计算复杂性,阻碍了其在计算剖析广泛使用的应用研究。这一限制已经被问题制剂在带限矢量场和半拉格朗日一体化的空间独立地处理。这项工作的目的是既为进一步提高他们的计算效率,带宽受限的PDE约束LDDMM的三种变体结合。所得的方法的精度广泛评价。对于所有的变种,建议组合方法显示的计算效率显著增加。此外,基于该变形状态方程的变体一致地定位为在精度和效率方面在所有评价框架表现最好的方法。
Monica Hernandez
Abstract: The family of PDE-constrained LDDMM methods is emerging as a particularly interesting approach for physically meaningful diffeomorphic transformations. The original combination of Gauss--Newton--Krylov optimization and Runge--Kutta integration, shows excellent numerical accuracy and fast convergence rate. However, its most significant limitation is the huge computational complexity, hindering its extensive use in Computational Anatomy applied studies. This limitation has been treated independently by the problem formulation in the space of band-limited vector fields and Semi-Lagrangian integration. The purpose of this work is to combine both in three variants of band-limited PDE-constrained LDDMM for further increasing their computational efficiency. The accuracy of the resulting methods is evaluated extensively. For all the variants, the proposed combined approach shows a significant increment of the computational efficiency. In addition, the variant based on the deformation state equation is positioned consistently as the best performing method across all the evaluation frameworks in terms of accuracy and efficiency.
摘要:PDE受限LDDMM方法的家庭正在成为物理意义的微分同胚变换一个特别有趣的方法。牛顿 - - 高斯的原始组合克雷洛夫优化和龙格 - 库塔积分,表现出优异的数值精度和快速的收敛速度。然而,其最显著的限制是巨大的计算复杂性,阻碍了其在计算剖析广泛使用的应用研究。这一限制已经被问题制剂在带限矢量场和半拉格朗日一体化的空间独立地处理。这项工作的目的是既为进一步提高他们的计算效率,带宽受限的PDE约束LDDMM的三种变体结合。所得的方法的精度广泛评价。对于所有的变种,建议组合方法显示的计算效率显著增加。此外,基于该变形状态方程的变体一致地定位为在精度和效率方面在所有评价框架表现最好的方法。
46. Automated Identification of Thoracic Pathology from Chest Radiographs with Enhanced Training Pipeline [PDF] 返回目录
Adora M. DSouza, Anas Z. Abidin, Axel Wismüller
Abstract: Chest x-rays are the most common radiology studies for diagnosing lung and heart disease. Hence, a system for automated pre-reporting of pathologic findings on chest x-rays would greatly enhance radiologists' productivity. To this end, we investigate a deep-learning framework with novel training schemes for classification of different thoracic pathology labels from chest x-rays. We use the currently largest publicly available annotated dataset ChestX-ray14 of 112,120 chest radiographs of 30,805 patients. Each image was annotated with either a 'NoFinding' class, or one or more of 14 thoracic pathology labels. Subjects can have multiple pathologies, resulting in a multi-class, multi-label problem. We encoded labels as binary vectors using k-hot encoding. We study the ResNet34 architecture, pre-trained on ImageNet, where two key modifications were incorporated into the training framework: (1) Stochastic gradient descent with momentum and with restarts using cosine annealing, (2) Variable image sizes for fine-tuning to prevent overfitting. Additionally, we use a heuristic algorithm to select a good learning rate. Learning with restarts was used to avoid local minima. Area Under receiver operating characteristics Curve (AUC) was used to quantitatively evaluate diagnostic quality. Our results are comparable to, or outperform the best results of current state-of-the-art methods with AUCs as follows: Atelectasis:0.81, Cardiomegaly:0.91, Consolidation:0.81, Edema:0.92, Effusion:0.89, Emphysema: 0.92, Fibrosis:0.81, Hernia:0.84, Infiltration:0.73, Mass:0.85, Nodule:0.76, Pleural Thickening:0.81, Pneumonia:0.77, Pneumothorax:0.89 and NoFinding:0.79. Our results suggest that, in addition to using sophisticated network architectures, a good learning rate, scheduler and a robust optimizer can boost performance.
摘要:胸部X光检查是诊断肺癌和心脏疾病中最常见的放射学研究。因此,对于上胸部x射线病理结果的自动预报告的系统将大大提高放射科医师的生产率。为此,我们调查与从胸部X光胸部的不同标签的病理分类的新的培训计划深学习框架。我们使用目前最大的公开可用的注解数据集的30805例112120个胸片ChestX-ray14。每个图像用无论是“NoFinding”类,或一种或多种的14个胸椎病理学标签注释。主题可以有多个病变,导致在多级,多标签问题。我们经编码标签如使用K-热编码二元载体。我们研究了ResNet34架构,上ImageNet,其中两个重要的修改被纳入训练框架预先训练:(1)随机梯度下降动量和使用余弦退火重新启动时,(2),用于微调可变的图像尺寸,以防止过度拟合。此外,我们使用启发式算法来选择一个良好的学习速度。与重新学习使用,以避免局部极小。区域下的受试者工作特征曲线(AUC)用于定量评价诊断质量。我们的结果是相当或优于的状态的最先进的现有方法用的AUC最好的结果如下:肺不张:0.81,心肥大:0.91,合并:0.81,水肿:0.92,积液:0.89,肺气肿:0.92,纤维化:0.81,疝:0.84,渗透:0.73,质谱:0.85,结节:0.76,胸腔增稠:0.81,肺炎:0.77,气胸:0.89和NoFinding:0.79。我们的研究结果表明,除了使用先进的网络架构,良好的学习速度,调度和强大的优化可以提高性能。
Adora M. DSouza, Anas Z. Abidin, Axel Wismüller
Abstract: Chest x-rays are the most common radiology studies for diagnosing lung and heart disease. Hence, a system for automated pre-reporting of pathologic findings on chest x-rays would greatly enhance radiologists' productivity. To this end, we investigate a deep-learning framework with novel training schemes for classification of different thoracic pathology labels from chest x-rays. We use the currently largest publicly available annotated dataset ChestX-ray14 of 112,120 chest radiographs of 30,805 patients. Each image was annotated with either a 'NoFinding' class, or one or more of 14 thoracic pathology labels. Subjects can have multiple pathologies, resulting in a multi-class, multi-label problem. We encoded labels as binary vectors using k-hot encoding. We study the ResNet34 architecture, pre-trained on ImageNet, where two key modifications were incorporated into the training framework: (1) Stochastic gradient descent with momentum and with restarts using cosine annealing, (2) Variable image sizes for fine-tuning to prevent overfitting. Additionally, we use a heuristic algorithm to select a good learning rate. Learning with restarts was used to avoid local minima. Area Under receiver operating characteristics Curve (AUC) was used to quantitatively evaluate diagnostic quality. Our results are comparable to, or outperform the best results of current state-of-the-art methods with AUCs as follows: Atelectasis:0.81, Cardiomegaly:0.91, Consolidation:0.81, Edema:0.92, Effusion:0.89, Emphysema: 0.92, Fibrosis:0.81, Hernia:0.84, Infiltration:0.73, Mass:0.85, Nodule:0.76, Pleural Thickening:0.81, Pneumonia:0.77, Pneumothorax:0.89 and NoFinding:0.79. Our results suggest that, in addition to using sophisticated network architectures, a good learning rate, scheduler and a robust optimizer can boost performance.
摘要:胸部X光检查是诊断肺癌和心脏疾病中最常见的放射学研究。因此,对于上胸部x射线病理结果的自动预报告的系统将大大提高放射科医师的生产率。为此,我们调查与从胸部X光胸部的不同标签的病理分类的新的培训计划深学习框架。我们使用目前最大的公开可用的注解数据集的30805例112120个胸片ChestX-ray14。每个图像用无论是“NoFinding”类,或一种或多种的14个胸椎病理学标签注释。主题可以有多个病变,导致在多级,多标签问题。我们经编码标签如使用K-热编码二元载体。我们研究了ResNet34架构,上ImageNet,其中两个重要的修改被纳入训练框架预先训练:(1)随机梯度下降动量和使用余弦退火重新启动时,(2),用于微调可变的图像尺寸,以防止过度拟合。此外,我们使用启发式算法来选择一个良好的学习速度。与重新学习使用,以避免局部极小。区域下的受试者工作特征曲线(AUC)用于定量评价诊断质量。我们的结果是相当或优于的状态的最先进的现有方法用的AUC最好的结果如下:肺不张:0.81,心肥大:0.91,合并:0.81,水肿:0.92,积液:0.89,肺气肿:0.92,纤维化:0.81,疝:0.84,渗透:0.73,质谱:0.85,结节:0.76,胸腔增稠:0.81,肺炎:0.77,气胸:0.89和NoFinding:0.79。我们的研究结果表明,除了使用先进的网络架构,良好的学习速度,调度和强大的优化可以提高性能。
47. Multigrid-in-Channels Architectures for Wide Convolutional Neural Networks [PDF] 返回目录
Jonathan Ephrath, Lars Ruthotto, Eran Treister
Abstract: We present a multigrid approach that combats the quadratic growth of the number of parameters with respect to the number of channels in standard convolutional neural networks (CNNs). It has been shown that there is a redundancy in standard CNNs, as networks with much sparser convolution operators can yield similar performance to full networks. The sparsity patterns that lead to such behavior, however, are typically random, hampering hardware efficiency. In this work, we present a multigrid-in-channels approach for building CNN architectures that achieves full coupling of the channels, and whose number of parameters is linearly proportional to the width of the network. To this end, we replace each convolution layer in a generic CNN with a multilevel layer consisting of structured (i.e., grouped) convolutions. Our examples from supervised image classification show that applying this strategy to residual networks and MobileNetV2 considerably reduces the number of parameters without negatively affecting accuracy. Therefore, we can widen networks without dramatically increasing the number of parameters or operations.
摘要:我们提出了对抗的参数的数目的平方增长相对于在标准的卷积神经网络(细胞神经网络)信道的数量的多网格方法。它已被证明是有标准的细胞神经网络的冗余,与多稀疏卷积运营商网络中可以产生类似的性能全面的网络。稀疏图案然而,导致这样的行为,通常是随机的,阻碍了硬件效率。在这项工作中,我们提出了一种多重网格式通道用于构建CNN架构,实现了信道的全部耦合,和参数数为线性地正比于网络的宽度接近。为此,我们用由结构化的(即,分组)卷积的多级层代替了一般CNN每个卷积层。我们从监督图像分类实例表明,应用此策略,剩余的网络和MobileNetV2大大降低了参数的数量不准确造成负面影响。因此,我们可以扩大网络,但不显着增加的参数或操作数。
Jonathan Ephrath, Lars Ruthotto, Eran Treister
Abstract: We present a multigrid approach that combats the quadratic growth of the number of parameters with respect to the number of channels in standard convolutional neural networks (CNNs). It has been shown that there is a redundancy in standard CNNs, as networks with much sparser convolution operators can yield similar performance to full networks. The sparsity patterns that lead to such behavior, however, are typically random, hampering hardware efficiency. In this work, we present a multigrid-in-channels approach for building CNN architectures that achieves full coupling of the channels, and whose number of parameters is linearly proportional to the width of the network. To this end, we replace each convolution layer in a generic CNN with a multilevel layer consisting of structured (i.e., grouped) convolutions. Our examples from supervised image classification show that applying this strategy to residual networks and MobileNetV2 considerably reduces the number of parameters without negatively affecting accuracy. Therefore, we can widen networks without dramatically increasing the number of parameters or operations.
摘要:我们提出了对抗的参数的数目的平方增长相对于在标准的卷积神经网络(细胞神经网络)信道的数量的多网格方法。它已被证明是有标准的细胞神经网络的冗余,与多稀疏卷积运营商网络中可以产生类似的性能全面的网络。稀疏图案然而,导致这样的行为,通常是随机的,阻碍了硬件效率。在这项工作中,我们提出了一种多重网格式通道用于构建CNN架构,实现了信道的全部耦合,和参数数为线性地正比于网络的宽度接近。为此,我们用由结构化的(即,分组)卷积的多级层代替了一般CNN每个卷积层。我们从监督图像分类实例表明,应用此策略,剩余的网络和MobileNetV2大大降低了参数的数量不准确造成负面影响。因此,我们可以扩大网络,但不显着增加的参数或操作数。
48. One Ring to Rule Them All: Certifiably Robust Geometric Perception with Outliers [PDF] 返回目录
Heng Yang, Luca Carlone
Abstract: We propose a general and practical framework to design certifiable algorithms for robust geometric perception in the presence of a large amount of outliers. We investigate the use of a truncated least squares (TLS) cost function, which is known to be robust to outliers, but leads to hard, nonconvex, and nonsmooth optimization problems. Our first contribution is to show that -for a broad class of geometric perception problems- TLS estimation can be reformulated as an optimization over the ring of polynomials and Lasserre's hierarchy of convex moment relaxations is empirically tight at the minimum relaxation order (i.e., certifiably obtains the global minimum of the nonconvex TLS problem). Our second contribution is to exploit the structural sparsity of the objective and constraint polynomials and leverage basis reduction to significantly reduce the size of the semidefinite program (SDP) resulting from the moment relaxation, without compromising its tightness. Our third contribution is to develop scalable dual optimality certifiers from the lens of sums-of-squares (SOS) relaxation, that can compute the suboptimality gap and possibly certify global optimality of any candidate solution (e.g., returned by fast heuristics such as RANSAC or graduated non-convexity). Our dual certifiers leverage Douglas-Rachford Splitting to solve a convex feasibility SDP. Numerical experiments across different perception problems, including high-integrity satellite pose estimation, demonstrate the tightness of our relaxations, the correctness of the certification, and the scalability of the proposed dual certifiers to large problems, beyond the reach of current SDP solvers.
摘要:本文提出了一种普遍和实用的框架设计认证的算法大量异常值的存在强大的几何感知。我们研究使用截断最小二乘法(TLS)成本函数,这是众所周知的是稳健的异常值,但会造成硬,非凸,非光滑优化问题。我们的第一个贡献是证明 - 对于大类几何感知problems- TLS估计可以改写为在多项式环和凸瞬间松弛拉萨尔的层次结构的优化是凭经验紧在最小松弛顺序(即certifiably取得非凸TLS问题的全球最低)。我们的第二个贡献是利用目标和约束多项式和杠杆的基础减少的结构稀疏,以显著减少半定程序(SDP)的规模从目前的放松产生的,不影响其密封性。我们的第三个贡献是从资金-平方的(SOS)放松的镜头,可以计算出任何候选解决方案的次优差距和可能的Certify全局最优(开发可扩展的双重最优认证机构例如,通过快速启发式诸如RANSAC或退回毕业于非凸)。我们的双认证机构杠杆道格拉斯 - 分裂的DR解决凸可行性SDP。在不同的感知问题,包括高集成度的卫星姿态估计数值实验,证明我们的松弛的松紧度,认证的正确性,并提出了双重验证者大问题的可扩展性,超越现有的SDP解决者的范围。
Heng Yang, Luca Carlone
Abstract: We propose a general and practical framework to design certifiable algorithms for robust geometric perception in the presence of a large amount of outliers. We investigate the use of a truncated least squares (TLS) cost function, which is known to be robust to outliers, but leads to hard, nonconvex, and nonsmooth optimization problems. Our first contribution is to show that -for a broad class of geometric perception problems- TLS estimation can be reformulated as an optimization over the ring of polynomials and Lasserre's hierarchy of convex moment relaxations is empirically tight at the minimum relaxation order (i.e., certifiably obtains the global minimum of the nonconvex TLS problem). Our second contribution is to exploit the structural sparsity of the objective and constraint polynomials and leverage basis reduction to significantly reduce the size of the semidefinite program (SDP) resulting from the moment relaxation, without compromising its tightness. Our third contribution is to develop scalable dual optimality certifiers from the lens of sums-of-squares (SOS) relaxation, that can compute the suboptimality gap and possibly certify global optimality of any candidate solution (e.g., returned by fast heuristics such as RANSAC or graduated non-convexity). Our dual certifiers leverage Douglas-Rachford Splitting to solve a convex feasibility SDP. Numerical experiments across different perception problems, including high-integrity satellite pose estimation, demonstrate the tightness of our relaxations, the correctness of the certification, and the scalability of the proposed dual certifiers to large problems, beyond the reach of current SDP solvers.
摘要:本文提出了一种普遍和实用的框架设计认证的算法大量异常值的存在强大的几何感知。我们研究使用截断最小二乘法(TLS)成本函数,这是众所周知的是稳健的异常值,但会造成硬,非凸,非光滑优化问题。我们的第一个贡献是证明 - 对于大类几何感知problems- TLS估计可以改写为在多项式环和凸瞬间松弛拉萨尔的层次结构的优化是凭经验紧在最小松弛顺序(即certifiably取得非凸TLS问题的全球最低)。我们的第二个贡献是利用目标和约束多项式和杠杆的基础减少的结构稀疏,以显著减少半定程序(SDP)的规模从目前的放松产生的,不影响其密封性。我们的第三个贡献是从资金-平方的(SOS)放松的镜头,可以计算出任何候选解决方案的次优差距和可能的Certify全局最优(开发可扩展的双重最优认证机构例如,通过快速启发式诸如RANSAC或退回毕业于非凸)。我们的双认证机构杠杆道格拉斯 - 分裂的DR解决凸可行性SDP。在不同的感知问题,包括高集成度的卫星姿态估计数值实验,证明我们的松弛的松紧度,认证的正确性,并提出了双重验证者大问题的可扩展性,超越现有的SDP解决者的范围。
49. Data Driven Prediction Architecture for Autonomous Driving and its Application on Apollo Platform [PDF] 返回目录
Kecheng Xu, Xiangquan Xiao, Jinghao Miao, Qi Luo
Abstract: Autonomous Driving vehicles (ADV) are on road with large scales. For safe and efficient operations, ADVs must be able to predict the future states and iterative with road entities in complex, real-world driving scenarios. How to migrate a well-trained prediction model from one geo-fenced area to another is essential in scaling the ADV operation and is difficult most of the time since the terrains, traffic rules, entities distributions, driving/walking patterns would be largely different in different geo-fenced operation areas. In this paper, we introduce a highly automated learning-based prediction model pipeline, which has been deployed on Baidu Apollo self-driving platform, to support different prediction learning sub-modules' data annotation, feature extraction, model training/tuning and deployment. This pipeline is completely automatic without any human intervention and shows an up to 400\% efficiency increase in parameter tuning, when deployed at scale in different scenarios across nations.
摘要:自动驾驶汽车(ADV)是与大尺度的道路。为了安全和高效的运营,ADVS必须能够预测未来状态和反复,在复杂的,实际驾驶情况下的道路实体。如何从一个地理围栏区域迁移训练有素的预测模型,另一个是在缩放ADV操作至关重要,并且很难大部分的自地形,交通规则,实体分布时驾驶/步行模式将在很大程度上的不同之处不同地理围栏作业区。在本文中,我们介绍了一个高度自动化的基于学习的预测模型管道,已部署在百度阿波罗自驾车平台,以支持不同的预测学习子模块的数据标注,特征提取,模型训练/调整和部署。这条管道是没有任何人为干预,并显示了高达400 \%的效率增加了参数调整,在不同的场景在国家间大规模部署时,完全自动的。
Kecheng Xu, Xiangquan Xiao, Jinghao Miao, Qi Luo
Abstract: Autonomous Driving vehicles (ADV) are on road with large scales. For safe and efficient operations, ADVs must be able to predict the future states and iterative with road entities in complex, real-world driving scenarios. How to migrate a well-trained prediction model from one geo-fenced area to another is essential in scaling the ADV operation and is difficult most of the time since the terrains, traffic rules, entities distributions, driving/walking patterns would be largely different in different geo-fenced operation areas. In this paper, we introduce a highly automated learning-based prediction model pipeline, which has been deployed on Baidu Apollo self-driving platform, to support different prediction learning sub-modules' data annotation, feature extraction, model training/tuning and deployment. This pipeline is completely automatic without any human intervention and shows an up to 400\% efficiency increase in parameter tuning, when deployed at scale in different scenarios across nations.
摘要:自动驾驶汽车(ADV)是与大尺度的道路。为了安全和高效的运营,ADVS必须能够预测未来状态和反复,在复杂的,实际驾驶情况下的道路实体。如何从一个地理围栏区域迁移训练有素的预测模型,另一个是在缩放ADV操作至关重要,并且很难大部分的自地形,交通规则,实体分布时驾驶/步行模式将在很大程度上的不同之处不同地理围栏作业区。在本文中,我们介绍了一个高度自动化的基于学习的预测模型管道,已部署在百度阿波罗自驾车平台,以支持不同的预测学习子模块的数据标注,特征提取,模型训练/调整和部署。这条管道是没有任何人为干预,并显示了高达400 \%的效率增加了参数调整,在不同的场景在国家间大规模部署时,完全自动的。
注:中文为机器翻译结果!