目录
2. Transformation Consistency Regularization- A Semi-Supervised Paradigm for Image-to-Image Translation [PDF] 摘要
4. VidCEP: Complex Event Processing Framework to Detect Spatiotemporal Patterns in Video Streams [PDF] 摘要
5. Data-Efficient Deep Learning Method for Image Classification Using Data Augmentation, Focal Cosine Loss, and Ensemble [PDF] 摘要
7. Vision-Based Fall Event Detection in Complex Background Using Attention Guided Bi-directional LSTM [PDF] 摘要
13. Lunar Terrain Relative Navigation Using a Convolutional Neural Network for Visual Crater Detection [PDF] 摘要
14. P$^{2}$Net: Patch-match and Plane-regularization for Unsupervised Indoor Depth Estimation [PDF] 摘要
18. Learning Multiplicative Interactions with Bayesian Neural Networks for Visual-Inertial Odometry [PDF] 摘要
34. Reorganizing local image features with chaotic maps: an application to texture recognition [PDF] 摘要
38. COCO-FUNIT: Few-Shot Unsupervised Image Translation with a Content Conditioned Style Encoder [PDF] 摘要
39. Comparing to Learn: Surpassing ImageNet Pretraining on Radiographs By Comparing Image Representations [PDF] 摘要
41. Automatic extraction of road intersection points from USGS historical map series using deep convolutional neural networks [PDF] 摘要
53. Monocular Retinal Depth Estimation and Joint Optic Disc and Cup Segmentation using Adversarial Networks [PDF] 摘要
摘要
1. AdaptiveReID: Adaptive L2 Regularization in Person Re-Identification [PDF] 返回目录
Xingyang Ni, Liang Fang, Heikki Huttunen
Abstract: We introduce an adaptive L2 regularization mechanism termed AdaptiveReID, in the setting of person re-identification. In the literature, it is common practice to utilize hand-picked regularization factors which remain constant throughout the training procedure. Unlike existing approaches, the regularization factors in our proposed method are updated adaptively through backpropagation. This is achieved by incorporating trainable scalar variables as the regularization factors, which are further fed into a scaled hard sigmoid function. Extensive experiments on the Market-1501, DukeMTMC-reID and MSMT17 datasets validate the effectiveness of our framework. Most notably, we obtain state-of-the-art performance on MSMT17, which is the largest dataset for person re-identification. Source code will be published at this https URL.
摘要:介绍了自适应L2正规化机制称为AdaptiveReID中的人重新鉴定的设置。在文献中,通常的做法是将利用哪一整个训练过程保持恒定手工采摘正规化因子。不同于现有的方法,在我们提出的方法正规化因素通过反向传播自适应更新。这是通过将可训练标量变量作为正规化因子,其被进一步馈送到缩放的硬S形函数来实现的。在市场-1501广泛的实验,DukeMTMC-Reid和MSMT17数据集验证了我们框架的有效性。最值得注意的是,我们获得MSMT17,这是人的重新鉴定的最大数据集的国家的最先进的性能。源代码将在这个HTTPS URL公布。
Xingyang Ni, Liang Fang, Heikki Huttunen
Abstract: We introduce an adaptive L2 regularization mechanism termed AdaptiveReID, in the setting of person re-identification. In the literature, it is common practice to utilize hand-picked regularization factors which remain constant throughout the training procedure. Unlike existing approaches, the regularization factors in our proposed method are updated adaptively through backpropagation. This is achieved by incorporating trainable scalar variables as the regularization factors, which are further fed into a scaled hard sigmoid function. Extensive experiments on the Market-1501, DukeMTMC-reID and MSMT17 datasets validate the effectiveness of our framework. Most notably, we obtain state-of-the-art performance on MSMT17, which is the largest dataset for person re-identification. Source code will be published at this https URL.
摘要:介绍了自适应L2正规化机制称为AdaptiveReID中的人重新鉴定的设置。在文献中,通常的做法是将利用哪一整个训练过程保持恒定手工采摘正规化因子。不同于现有的方法,在我们提出的方法正规化因素通过反向传播自适应更新。这是通过将可训练标量变量作为正规化因子,其被进一步馈送到缩放的硬S形函数来实现的。在市场-1501广泛的实验,DukeMTMC-Reid和MSMT17数据集验证了我们框架的有效性。最值得注意的是,我们获得MSMT17,这是人的重新鉴定的最大数据集的国家的最先进的性能。源代码将在这个HTTPS URL公布。
2. Transformation Consistency Regularization- A Semi-Supervised Paradigm for Image-to-Image Translation [PDF] 返回目录
Aamir Mustafa, Rafal K. Mantiuk
Abstract: Scarcity of labeled data has motivated the development of semi-supervised learning methods, which learn from large portions of unlabeled data alongside a few labeled samples. Consistency Regularization between model's predictions under different input perturbations, particularly has shown to provide state-of-the art results in a semi-supervised framework. However, most of these method have been limited to classification and segmentation applications. We propose Transformation Consistency Regularization, which delves into a more challenging setting of image-to-image translation, which remains unexplored by semi-supervised algorithms. The method introduces a diverse set of geometric transformations and enforces the model's predictions for unlabeled data to be invariant to those transformations. We evaluate the efficacy of our algorithm on three different applications: image colorization, denoising and super-resolution. Our method is significantly data efficient, requiring only around 10 - 20% of labeled samples to achieve similar image reconstructions to its fully-supervised counterpart. Furthermore, we show the effectiveness of our method in video processing applications, where knowledge from a few frames can be leveraged to enhance the quality of the rest of the movie.
摘要:标数据的匮乏促使的半监督学习方法,它从旁边的几个标记的样品未标记的数据的大部分学的发展。下不同的输入扰动模型的预测之间的一致性的正则化,特别是已显示出提供在半监督框架国家的本领域的结果。然而,这些方法都被限制在分类和细分应用。我们提出转型一致性规范化,其深入研究了图像 - 图像平移,这仍然是由半监督算法未开发的更具挑战性的环境。该方法引入一组不同的几何变换的,并执行模型的未标记数据的预测是不变的那些变换。我们评估我们的算法对三种不同的应用功效:图像彩色化,去噪和超分辨率。我们的方法是显著数据有效,只需要大约10 - 20%的标记的样品来实现类似的图像重建到其完全监督对应物。此外,我们展示我们在视频处理应用的方法,其中从几帧的知识可以被利用来增强影片的其余部分的质量的有效性。
Aamir Mustafa, Rafal K. Mantiuk
Abstract: Scarcity of labeled data has motivated the development of semi-supervised learning methods, which learn from large portions of unlabeled data alongside a few labeled samples. Consistency Regularization between model's predictions under different input perturbations, particularly has shown to provide state-of-the art results in a semi-supervised framework. However, most of these method have been limited to classification and segmentation applications. We propose Transformation Consistency Regularization, which delves into a more challenging setting of image-to-image translation, which remains unexplored by semi-supervised algorithms. The method introduces a diverse set of geometric transformations and enforces the model's predictions for unlabeled data to be invariant to those transformations. We evaluate the efficacy of our algorithm on three different applications: image colorization, denoising and super-resolution. Our method is significantly data efficient, requiring only around 10 - 20% of labeled samples to achieve similar image reconstructions to its fully-supervised counterpart. Furthermore, we show the effectiveness of our method in video processing applications, where knowledge from a few frames can be leveraged to enhance the quality of the rest of the movie.
摘要:标数据的匮乏促使的半监督学习方法,它从旁边的几个标记的样品未标记的数据的大部分学的发展。下不同的输入扰动模型的预测之间的一致性的正则化,特别是已显示出提供在半监督框架国家的本领域的结果。然而,这些方法都被限制在分类和细分应用。我们提出转型一致性规范化,其深入研究了图像 - 图像平移,这仍然是由半监督算法未开发的更具挑战性的环境。该方法引入一组不同的几何变换的,并执行模型的未标记数据的预测是不变的那些变换。我们评估我们的算法对三种不同的应用功效:图像彩色化,去噪和超分辨率。我们的方法是显著数据有效,只需要大约10 - 20%的标记的样品来实现类似的图像重建到其完全监督对应物。此外,我们展示我们在视频处理应用的方法,其中从几帧的知识可以被利用来增强影片的其余部分的质量的有效性。
3. Few-shot Scene-adaptive Anomaly Detection [PDF] 返回目录
Yiwei Lu, Frank Yu, Mahesh Kumar Krishna Reddy, Yang Wang
Abstract: We address the problem of anomaly detection in videos. The goal is to identify unusual behaviours automatically by learning exclusively from normal videos. Most existing approaches are usually data-hungry and have limited generalization abilities. They usually need to be trained on a large number of videos from a target scene to achieve good results in that scene. In this paper, we propose a novel few-shot scene-adaptive anomaly detection problem to address the limitations of previous approaches. Our goal is to learn to detect anomalies in a previously unseen scene with only a few frames. A reliable solution for this new problem will have huge potential in real-world applications since it is expensive to collect a massive amount of data for each target scene. We propose a meta-learning based approach for solving this new problem; extensive experimental results demonstrate the effectiveness of our proposed method.
摘要:我们在视频处理异常检测的问题。我们的目标是通过正常的视频专门学习自动识别异常行为。大多数现有的方法通常是大量数据的和有限的概括能力。他们通常需要对大量的来自目标场景的视频进行培训,以实现的那一幕了良好的效果。在本文中,我们提出了一个新颖的为数不多的拍摄场景自适应异常检测的问题,解决以前的方法的局限性。我们的目标是要学会发现异常情况在以前看不到的景象,只有几帧。这个新问题的可靠的解决方案将在实际应用中的巨大潜力,因为它是昂贵的数据收集了大量的每个目标场景。我们提出了解决这一问题的新的荟萃学习基础的方法;大量的实验结果表明,我们提出的方法的有效性。
Yiwei Lu, Frank Yu, Mahesh Kumar Krishna Reddy, Yang Wang
Abstract: We address the problem of anomaly detection in videos. The goal is to identify unusual behaviours automatically by learning exclusively from normal videos. Most existing approaches are usually data-hungry and have limited generalization abilities. They usually need to be trained on a large number of videos from a target scene to achieve good results in that scene. In this paper, we propose a novel few-shot scene-adaptive anomaly detection problem to address the limitations of previous approaches. Our goal is to learn to detect anomalies in a previously unseen scene with only a few frames. A reliable solution for this new problem will have huge potential in real-world applications since it is expensive to collect a massive amount of data for each target scene. We propose a meta-learning based approach for solving this new problem; extensive experimental results demonstrate the effectiveness of our proposed method.
摘要:我们在视频处理异常检测的问题。我们的目标是通过正常的视频专门学习自动识别异常行为。大多数现有的方法通常是大量数据的和有限的概括能力。他们通常需要对大量的来自目标场景的视频进行培训,以实现的那一幕了良好的效果。在本文中,我们提出了一个新颖的为数不多的拍摄场景自适应异常检测的问题,解决以前的方法的局限性。我们的目标是要学会发现异常情况在以前看不到的景象,只有几帧。这个新问题的可靠的解决方案将在实际应用中的巨大潜力,因为它是昂贵的数据收集了大量的每个目标场景。我们提出了解决这一问题的新的荟萃学习基础的方法;大量的实验结果表明,我们提出的方法的有效性。
4. VidCEP: Complex Event Processing Framework to Detect Spatiotemporal Patterns in Video Streams [PDF] 返回目录
Piyush Yadav, Edward Curry
Abstract: Video data is highly expressive and has traditionally been very difficult for a machine to interpret. Querying event patterns from video streams is challenging due to its unstructured representation. Middleware systems such as Complex Event Processing (CEP) mine patterns from data streams and send notifications to users in a timely fashion. Current CEP systems have inherent limitations to query video streams due to their unstructured data model and lack of expressive query language. In this work, we focus on a CEP framework where users can define high-level expressive queries over videos to detect a range of spatiotemporal event patterns. In this context, we propose: i) VidCEP, an in-memory, on the fly, near real-time complex event matching framework for video streams. The system uses a graph-based event representation for video streams which enables the detection of high-level semantic concepts from video using cascades of Deep Neural Network models, ii) a Video Event Query language (VEQL) to express high-level user queries for video streams in CEP, iii) a complex event matcher to detect spatiotemporal video event patterns by matching expressive user queries over video data. The proposed approach detects spatiotemporal video event patterns with an F-score ranging from 0.66 to 0.89. VidCEP maintains near real-time performance with an average throughput of 70 frames per second for 5 parallel videos with sub-second matching latency.
摘要:视频数据是极富表现力和历来是非常困难的一台机器来解释。从视频流查询事件模式由于其非结构化表示质疑。中间件系统,如数据流和发送通知的复杂事件处理(CEP)矿模式,以用户及时。当前的CEP系统由于其非结构化数据模型具有固有限制查询视频流,缺乏表现力的查询语言。在这项工作中,我们专注于CEP框架,用户可以定义高层次的表现力上查询到的视频检测范围时空事件模式。在此背景下,我们建议:1)VidCEP,在内存中,在飞行中,接近实时的复杂事件匹配框架的视频流。该系统使用的视频流的基于图形的事件表示这使得高级语义概念检测从使用Deep神经网络模型的级联视频,ⅱ)一个视频事件查询语言(VEQL)来表示高级别用户查询对在CEP视频流,ⅲ)一个复杂事件匹配器通过在视频数据匹配的表达用户查询检测时空视频的事件模式。所提出的方法检测时空视频事件模式与F得分范围从0.66至0.89。 VidCEP保持与平均吞吐量每秒70帧的具有子第二匹配延迟5个并行视频近实时性能。
Piyush Yadav, Edward Curry
Abstract: Video data is highly expressive and has traditionally been very difficult for a machine to interpret. Querying event patterns from video streams is challenging due to its unstructured representation. Middleware systems such as Complex Event Processing (CEP) mine patterns from data streams and send notifications to users in a timely fashion. Current CEP systems have inherent limitations to query video streams due to their unstructured data model and lack of expressive query language. In this work, we focus on a CEP framework where users can define high-level expressive queries over videos to detect a range of spatiotemporal event patterns. In this context, we propose: i) VidCEP, an in-memory, on the fly, near real-time complex event matching framework for video streams. The system uses a graph-based event representation for video streams which enables the detection of high-level semantic concepts from video using cascades of Deep Neural Network models, ii) a Video Event Query language (VEQL) to express high-level user queries for video streams in CEP, iii) a complex event matcher to detect spatiotemporal video event patterns by matching expressive user queries over video data. The proposed approach detects spatiotemporal video event patterns with an F-score ranging from 0.66 to 0.89. VidCEP maintains near real-time performance with an average throughput of 70 frames per second for 5 parallel videos with sub-second matching latency.
摘要:视频数据是极富表现力和历来是非常困难的一台机器来解释。从视频流查询事件模式由于其非结构化表示质疑。中间件系统,如数据流和发送通知的复杂事件处理(CEP)矿模式,以用户及时。当前的CEP系统由于其非结构化数据模型具有固有限制查询视频流,缺乏表现力的查询语言。在这项工作中,我们专注于CEP框架,用户可以定义高层次的表现力上查询到的视频检测范围时空事件模式。在此背景下,我们建议:1)VidCEP,在内存中,在飞行中,接近实时的复杂事件匹配框架的视频流。该系统使用的视频流的基于图形的事件表示这使得高级语义概念检测从使用Deep神经网络模型的级联视频,ⅱ)一个视频事件查询语言(VEQL)来表示高级别用户查询对在CEP视频流,ⅲ)一个复杂事件匹配器通过在视频数据匹配的表达用户查询检测时空视频的事件模式。所提出的方法检测时空视频事件模式与F得分范围从0.66至0.89。 VidCEP保持与平均吞吐量每秒70帧的具有子第二匹配延迟5个并行视频近实时性能。
5. Data-Efficient Deep Learning Method for Image Classification Using Data Augmentation, Focal Cosine Loss, and Ensemble [PDF] 返回目录
Byeongjo Kim, Chanran Kim, Jaehoon Lee, Jein Song, Gyoungsoo Park
Abstract: In general, sufficient data is essential for the better performance and generalization of deep-learning models. However, lots of limitations(cost, resources, etc.) of data collection leads to lack of enough data in most of the areas. In addition, various domains of each data sources and licenses also lead to difficulties in collection of sufficient data. This situation makes us hard to utilize not only the pre-trained model, but also the external knowledge. Therefore, it is important to leverage small dataset effectively for achieving the better performance. We applied some techniques in three aspects: data, loss function, and prediction to enable training from scratch with less data. With these methods, we obtain high accuracy by leveraging ImageNet data which consist of only 50 images per class. Furthermore, our model is ranked 4th in Visual Inductive Printers for Data-Effective Computer Vision Challenge.
摘要:在一般情况下,有足够的数据进行深学习模型的更好的性能和推广是必不可少的。然而,大量的数据收集线索的限制(成本,资源等)在大多数地区缺乏足够的数据。此外,每个数据源和许可证各个领域也导致了足够的数据收集困难。这种情况使得我们很难利用,不仅预先训练的模式,也是外部知识。因此,有效地实现了更好的性能,利用小数据集是很重要的。我们采用一些技术在三个方面:数据,损失函数,并预测使从头训练数据较少。利用这些方法,我们通过利用其由每类仅有50的图像的数据ImageNet获得高的精度。此外,我们的模型是排名第四的视觉感应式打印机的数据有效的计算机视觉挑战。
Byeongjo Kim, Chanran Kim, Jaehoon Lee, Jein Song, Gyoungsoo Park
Abstract: In general, sufficient data is essential for the better performance and generalization of deep-learning models. However, lots of limitations(cost, resources, etc.) of data collection leads to lack of enough data in most of the areas. In addition, various domains of each data sources and licenses also lead to difficulties in collection of sufficient data. This situation makes us hard to utilize not only the pre-trained model, but also the external knowledge. Therefore, it is important to leverage small dataset effectively for achieving the better performance. We applied some techniques in three aspects: data, loss function, and prediction to enable training from scratch with less data. With these methods, we obtain high accuracy by leveraging ImageNet data which consist of only 50 images per class. Furthermore, our model is ranked 4th in Visual Inductive Printers for Data-Effective Computer Vision Challenge.
摘要:在一般情况下,有足够的数据进行深学习模型的更好的性能和推广是必不可少的。然而,大量的数据收集线索的限制(成本,资源等)在大多数地区缺乏足够的数据。此外,每个数据源和许可证各个领域也导致了足够的数据收集困难。这种情况使得我们很难利用,不仅预先训练的模式,也是外部知识。因此,有效地实现了更好的性能,利用小数据集是很重要的。我们采用一些技术在三个方面:数据,损失函数,并预测使从头训练数据较少。利用这些方法,我们通过利用其由每类仅有50的图像的数据ImageNet获得高的精度。此外,我们的模型是排名第四的视觉感应式打印机的数据有效的计算机视觉挑战。
6. CANet: Context Aware Network for 3D Brain Tumor Segmentation [PDF] 返回目录
Zhihua Liu, Lei Tong, Long Chen, Feixiang Zhou, Zheheng Jiang, Qianni Zhang, Yinhai Wang, Caifeng Shan, Ling Li, Huiyu Zhou
Abstract: Automated segmentation of brain tumors in 3D magnetic resonance imaging plays an active role in tumor diagnosis, progression monitoring and surgery planning. Based on convolutional neural networks, especially fully convolutional networks, previous studies have shown some promising technologies for brain tumor segmentation. However, these approaches lack suitable strategies to incorporate contextual information to deal with local ambiguities, leading to unsatisfactory segmentation outcomes in challenging circumstances. In this work, we propose a novel Context-Aware Network (CANet) with a Hybrid Context Aware Feature Extractor (HCA-FE) and a Context Guided Attentive Conditional Random Field (CG-ACRF) for feature fusion. HCA-FE captures high dimensional and discriminative features with the contexts from both the convolutional space and feature interaction graphs. We adopt the powerful inference ability of probabilistic graphical models to learn hidden feature maps, and then use CG-ACRF to fuse the features of different contexts. We evaluate our proposed method on publicly accessible brain tumor segmentation datasets BRATS2017 and BRATS2018 against several state-of-the-art approaches using different segmentation metrics. The experimental results show that the proposed algorithm has better or competitive performance, compared to the standard approaches.
摘要:在3D磁共振成像脑肿瘤的自动分割在肿瘤的诊断,进展监测和手术计划中发挥积极作用。基于卷积神经网络,特别是充分卷积网络,以前的研究已经显示了脑肿瘤分割一些有前途的技术。然而,这些方法缺乏合适的策略,以纳入上下文信息处理本地含糊不清,导致不能令人满意的分割结果在困难的情况下。在这项工作中,我们提出了一个新的环境感知网络(CANET)用混合环境感知特征提取(HCA-FE)和特征融合的指导下细心条件随机场(CG-ACRF)的上下文。 HCA-FE捕获与来自卷积空间和功能的交互图形上下文高维和判别特征。我们采用学习藏特征映射的概率图模型的强大能力,推理,然后用CG-ACRF融合不同背景的功能。我们评估对几个国家的最先进的我们提出的对公众开放的脑肿瘤分割数据集BRATS2017和BRATS2018方法用不同的分割度量的方法。实验结果表明,该算法具有更好的或竞争性的性能,相对于标准的做法。
Zhihua Liu, Lei Tong, Long Chen, Feixiang Zhou, Zheheng Jiang, Qianni Zhang, Yinhai Wang, Caifeng Shan, Ling Li, Huiyu Zhou
Abstract: Automated segmentation of brain tumors in 3D magnetic resonance imaging plays an active role in tumor diagnosis, progression monitoring and surgery planning. Based on convolutional neural networks, especially fully convolutional networks, previous studies have shown some promising technologies for brain tumor segmentation. However, these approaches lack suitable strategies to incorporate contextual information to deal with local ambiguities, leading to unsatisfactory segmentation outcomes in challenging circumstances. In this work, we propose a novel Context-Aware Network (CANet) with a Hybrid Context Aware Feature Extractor (HCA-FE) and a Context Guided Attentive Conditional Random Field (CG-ACRF) for feature fusion. HCA-FE captures high dimensional and discriminative features with the contexts from both the convolutional space and feature interaction graphs. We adopt the powerful inference ability of probabilistic graphical models to learn hidden feature maps, and then use CG-ACRF to fuse the features of different contexts. We evaluate our proposed method on publicly accessible brain tumor segmentation datasets BRATS2017 and BRATS2018 against several state-of-the-art approaches using different segmentation metrics. The experimental results show that the proposed algorithm has better or competitive performance, compared to the standard approaches.
摘要:在3D磁共振成像脑肿瘤的自动分割在肿瘤的诊断,进展监测和手术计划中发挥积极作用。基于卷积神经网络,特别是充分卷积网络,以前的研究已经显示了脑肿瘤分割一些有前途的技术。然而,这些方法缺乏合适的策略,以纳入上下文信息处理本地含糊不清,导致不能令人满意的分割结果在困难的情况下。在这项工作中,我们提出了一个新的环境感知网络(CANET)用混合环境感知特征提取(HCA-FE)和特征融合的指导下细心条件随机场(CG-ACRF)的上下文。 HCA-FE捕获与来自卷积空间和功能的交互图形上下文高维和判别特征。我们采用学习藏特征映射的概率图模型的强大能力,推理,然后用CG-ACRF融合不同背景的功能。我们评估对几个国家的最先进的我们提出的对公众开放的脑肿瘤分割数据集BRATS2017和BRATS2018方法用不同的分割度量的方法。实验结果表明,该算法具有更好的或竞争性的性能,相对于标准的做法。
7. Vision-Based Fall Event Detection in Complex Background Using Attention Guided Bi-directional LSTM [PDF] 返回目录
Yong Chen, Lu Wang, Jiajia Hu, Mingbin Ye
Abstract: Fall event detection, as one of the greatest risks to the elderly, has been a hot research issue in the solitary scene in recent years. Nevertheless, there are few researches on the fall event detection in complex background. Different from most conventional background subtraction methods which depend on background modeling, Mask R-CNN method based on deep learning technique can clearly extract the moving object in noise background. We further propose an attention guided Bi-directional LSTM model for the final fall event detection. To demonstrate the efficiency, the proposed method is verified in the public dataset and self-build dataset. Evaluation of the algorithm performances in comparison with other state-of-the-art methods indicates that the proposed design is accurate and robust, which means it is suitable for the task of fall event detection in complex situation.
摘要:秋季事件检测,为给老人面临的最大风险之一,一直是近几年在孤现场研究的热点问题。不过,也有在复杂背景秋季事件检测研究较少。从依赖于背景建模最常规背景减法方法的不同,基于深学习技术可以清楚地提取噪声背景移动物体面膜R-CNN方法。我们进一步提出指导了最终坠落事件检测双向LSTM模型的关注。为了演示的效率,所提出的方法是在公共数据集和自建数据集进行验证。的与其它国家的最先进的方法相比该算法的性能评价表明,所提出的设计的准确性和鲁棒性,这意味着它是适合于在复杂的情况下的坠落事件检测的任务。
Yong Chen, Lu Wang, Jiajia Hu, Mingbin Ye
Abstract: Fall event detection, as one of the greatest risks to the elderly, has been a hot research issue in the solitary scene in recent years. Nevertheless, there are few researches on the fall event detection in complex background. Different from most conventional background subtraction methods which depend on background modeling, Mask R-CNN method based on deep learning technique can clearly extract the moving object in noise background. We further propose an attention guided Bi-directional LSTM model for the final fall event detection. To demonstrate the efficiency, the proposed method is verified in the public dataset and self-build dataset. Evaluation of the algorithm performances in comparison with other state-of-the-art methods indicates that the proposed design is accurate and robust, which means it is suitable for the task of fall event detection in complex situation.
摘要:秋季事件检测,为给老人面临的最大风险之一,一直是近几年在孤现场研究的热点问题。不过,也有在复杂背景秋季事件检测研究较少。从依赖于背景建模最常规背景减法方法的不同,基于深学习技术可以清楚地提取噪声背景移动物体面膜R-CNN方法。我们进一步提出指导了最终坠落事件检测双向LSTM模型的关注。为了演示的效率,所提出的方法是在公共数据集和自建数据集进行验证。的与其它国家的最先进的方法相比该算法的性能评价表明,所提出的设计的准确性和鲁棒性,这意味着它是适合于在复杂的情况下的坠落事件检测的任务。
8. Self-Supervised Representation Learning for Detection of ACL Tear Injury in Knee MRI [PDF] 返回目录
Siladittya Manna, Saumik Bhattacharya, Umapada Pal
Abstract: The success and efficiency of Deep Learning based models for computer vision applications require large scale human annotated data which are often expensive to generate. Self-supervised learning, a subset of unsupervised learning, handles this problem by learning meaningful features from unlabeled image or video data. In this paper, we propose a self-supervised learning approach to learn transferable features from MRI clips by enforcing the model to learn anatomical features. The pretext task models are designed to predict the correct ordering of the jumbled image patches that the MRI frames are divided into. To the best of our knowledge, none of the supervised learning models performing injury classification task from MRI frames, provide any explanations for the decisions made by the models, making our work the first of its kind on MRI data. Experiments on the pretext task show that this proposed approach enables the model to learn spatial context invariant features which helps in reliable and explainable performance in downstream tasks like classification of ACL tear injury from knee MRI. The efficiency of the novel Convolutional Neural Network proposed in this paper is reflected in the experimental results obtained in the downstream task.
摘要:计算机视觉应用深度学习的成功和效率,基于模型需要大规模的人类注释的数据,往往产生昂贵。自我监督学习,无监督学习的一个子集,由未标记的图像或视频数据有意义的学习功能,处理这个问题。在本文中,我们提出了一个自我监督的学习方法,通过执行模型来学习解剖学特征来学习从MRI剪辑转移功能。借口任务模型旨在预测错杂图像补丁的MRI帧分成的正确排序。据我们所知,没有任何的监督学习模式从MRI帧进行损伤分类任务,提供了由模型作出的决定任何解释,使我们的工作之先河的MRI数据。在借口任务的实验表明这种建议的方法使模型学习空间背景不变特征,这有助于在像膝关节MRI ACL撕伤分类下游任务可靠和解释的性能。卷积神经网络在本文提出的新颖的效率反映在下游任务所获得的实验结果。
Siladittya Manna, Saumik Bhattacharya, Umapada Pal
Abstract: The success and efficiency of Deep Learning based models for computer vision applications require large scale human annotated data which are often expensive to generate. Self-supervised learning, a subset of unsupervised learning, handles this problem by learning meaningful features from unlabeled image or video data. In this paper, we propose a self-supervised learning approach to learn transferable features from MRI clips by enforcing the model to learn anatomical features. The pretext task models are designed to predict the correct ordering of the jumbled image patches that the MRI frames are divided into. To the best of our knowledge, none of the supervised learning models performing injury classification task from MRI frames, provide any explanations for the decisions made by the models, making our work the first of its kind on MRI data. Experiments on the pretext task show that this proposed approach enables the model to learn spatial context invariant features which helps in reliable and explainable performance in downstream tasks like classification of ACL tear injury from knee MRI. The efficiency of the novel Convolutional Neural Network proposed in this paper is reflected in the experimental results obtained in the downstream task.
摘要:计算机视觉应用深度学习的成功和效率,基于模型需要大规模的人类注释的数据,往往产生昂贵。自我监督学习,无监督学习的一个子集,由未标记的图像或视频数据有意义的学习功能,处理这个问题。在本文中,我们提出了一个自我监督的学习方法,通过执行模型来学习解剖学特征来学习从MRI剪辑转移功能。借口任务模型旨在预测错杂图像补丁的MRI帧分成的正确排序。据我们所知,没有任何的监督学习模式从MRI帧进行损伤分类任务,提供了由模型作出的决定任何解释,使我们的工作之先河的MRI数据。在借口任务的实验表明这种建议的方法使模型学习空间背景不变特征,这有助于在像膝关节MRI ACL撕伤分类下游任务可靠和解释的性能。卷积神经网络在本文提出的新颖的效率反映在下游任务所获得的实验结果。
9. Enhancing Generalized Zero-Shot Learning via Adversarial Visual-Semantic Interaction [PDF] 返回目录
Shivam Chandhok, Vineeth N Balasubramanian
Abstract: The performance of generative zero-shot methods mainly depends on the quality of generated features and how well the model facilitates knowledge transfer between visual and semantic domains. The quality of generated features is a direct consequence of the ability of the model to capture the several modes of the underlying data distribution. To address these issues, we propose a new two-level joint maximization idea to augment the generative network with an inference network during training which helps our model capture the several modes of the data and generate features that better represent the underlying data distribution. This provides strong cross-modal interaction for effective transfer of knowledge between visual and semantic domains. Furthermore, existing methods train the zero-shot classifier either on generate synthetic image features or latent embeddings produced by leveraging representation learning. In this work, we unify these paradigms into a single model which in addition to synthesizing image features, also utilizes the representation learning capabilities of the inference network to provide discriminative features for the final zero-shot recognition task. We evaluate our approach on four benchmark datasets i.e. CUB, FLO, AWA1 and AWA2 against several state-of-the-art methods, and show its performance. We also perform ablation studies to analyze and understand our method more carefully for the Generalized Zero-shot Learning task.
摘要:生成零射门方法的性能主要取决于所产生的质量特征,以及如何很好的模式有利于视觉和语义域之间的知识转移。的生成特征的质量的模型来捕捉底层数据分布的几种模式的能力的直接结果。为了解决这些问题,我们提出了一个新的两级联合最大化的想法,以增加训练期间与推理网络生成的网络,这有助于我们的模型捕获数据的几种模式,并生成功能,更好地代表底层的数据分布。这提供了视觉和语义结构域之间的知识的有效转移强跨通道的相互作用。此外,现有的方法或者训练零次分类器上生成合成图像的特征或通过利用表示学习产生潜的嵌入。在这项工作中,我们结合这些范式到一个单一的模型,除了合成图像的特点,还采用了推理网络的代表学习的能力,为最终的零射门识别任务提供判别特征。我们评估的四个标准数据集即CUB,FLO,AWA1和AWA2办法对国家的最先进的几种方法,并展示其性能。我们还进行消融研究,分析和更仔细广义零射门的学习任务,了解我们的方法。
Shivam Chandhok, Vineeth N Balasubramanian
Abstract: The performance of generative zero-shot methods mainly depends on the quality of generated features and how well the model facilitates knowledge transfer between visual and semantic domains. The quality of generated features is a direct consequence of the ability of the model to capture the several modes of the underlying data distribution. To address these issues, we propose a new two-level joint maximization idea to augment the generative network with an inference network during training which helps our model capture the several modes of the data and generate features that better represent the underlying data distribution. This provides strong cross-modal interaction for effective transfer of knowledge between visual and semantic domains. Furthermore, existing methods train the zero-shot classifier either on generate synthetic image features or latent embeddings produced by leveraging representation learning. In this work, we unify these paradigms into a single model which in addition to synthesizing image features, also utilizes the representation learning capabilities of the inference network to provide discriminative features for the final zero-shot recognition task. We evaluate our approach on four benchmark datasets i.e. CUB, FLO, AWA1 and AWA2 against several state-of-the-art methods, and show its performance. We also perform ablation studies to analyze and understand our method more carefully for the Generalized Zero-shot Learning task.
摘要:生成零射门方法的性能主要取决于所产生的质量特征,以及如何很好的模式有利于视觉和语义域之间的知识转移。的生成特征的质量的模型来捕捉底层数据分布的几种模式的能力的直接结果。为了解决这些问题,我们提出了一个新的两级联合最大化的想法,以增加训练期间与推理网络生成的网络,这有助于我们的模型捕获数据的几种模式,并生成功能,更好地代表底层的数据分布。这提供了视觉和语义结构域之间的知识的有效转移强跨通道的相互作用。此外,现有的方法或者训练零次分类器上生成合成图像的特征或通过利用表示学习产生潜的嵌入。在这项工作中,我们结合这些范式到一个单一的模型,除了合成图像的特点,还采用了推理网络的代表学习的能力,为最终的零射门识别任务提供判别特征。我们评估的四个标准数据集即CUB,FLO,AWA1和AWA2办法对国家的最先进的几种方法,并展示其性能。我们还进行消融研究,分析和更仔细广义零射门的学习任务,了解我们的方法。
10. Finding Non-Uniform Quantization Schemes usingMulti-Task Gaussian Processes [PDF] 返回目录
Marcelo Gennari do Nascimento, Theo W. Costain, Victor Adrian Prisacariu
Abstract: We propose a novel method for neural network quantization that casts the neural architecture search problem as one of hyperparameter search to find non-uniform bit distributions throughout the layers of a CNN. We perform the search assuming a Multi-Task Gaussian Processes prior, which splits the problem to multiple tasks, each corresponding to different number of training epochs, and explore the space by sampling those configurations that yield maximum information. We then show that with significantly lower precision in the last layers we achieve a minimal loss of accuracy with appreciable memory savings. We test our findings on the CIFAR10 and ImageNet datasets using the VGG, ResNet and GoogLeNet architectures.
摘要:我们提出了神经网络的量化是注塑神经结构的搜索问题,因为超参数搜索的一个以整个CNN的层发现不均匀比特分布的新方法。我们执行搜索假设多任务高斯先验的过程,其将问题向多个任务,每个对应于不同数目的训练历元的,并通过采样那些配置的产量最大信息探索的空间。然后,我们表明,在最后一层显著精度较低,我们达到的精度与可观的内存节约损失降到最低。我们测试我们的CIFAR10结果和ImageNet使用VGG,RESNET和GoogLeNet架构的数据集。
Marcelo Gennari do Nascimento, Theo W. Costain, Victor Adrian Prisacariu
Abstract: We propose a novel method for neural network quantization that casts the neural architecture search problem as one of hyperparameter search to find non-uniform bit distributions throughout the layers of a CNN. We perform the search assuming a Multi-Task Gaussian Processes prior, which splits the problem to multiple tasks, each corresponding to different number of training epochs, and explore the space by sampling those configurations that yield maximum information. We then show that with significantly lower precision in the last layers we achieve a minimal loss of accuracy with appreciable memory savings. We test our findings on the CIFAR10 and ImageNet datasets using the VGG, ResNet and GoogLeNet architectures.
摘要:我们提出了神经网络的量化是注塑神经结构的搜索问题,因为超参数搜索的一个以整个CNN的层发现不均匀比特分布的新方法。我们执行搜索假设多任务高斯先验的过程,其将问题向多个任务,每个对应于不同数目的训练历元的,并通过采样那些配置的产量最大信息探索的空间。然后,我们表明,在最后一层显著精度较低,我们达到的精度与可观的内存节约损失降到最低。我们测试我们的CIFAR10结果和ImageNet使用VGG,RESNET和GoogLeNet架构的数据集。
11. Attention as Activation [PDF] 返回目录
Yimian Dai, Stefan Oehmcke, Yiquan Wu, Kobus Barnard
Abstract: Activation functions and attention mechanisms are typically treated as having different purposes and have evolved differently. However, both concepts can be formulated as a non-linear gating function. Inspired by their similarity, we propose a novel type of activation units called attentional activation~(ATAC) units as a unification of activation functions and attention mechanisms. In particular, we propose a local channel attention module for the simultaneous non-linear activation and element-wise feature refinement, which locally aggregates point-wise cross-channel feature contexts. By replacing the well-known rectified linear units by such ATAC units in convolutional networks, we can construct fully attentional networks that perform significantly better with a modest number of additional parameters. We conducted detailed ablation studies on the ATAC units using several host networks with varying network depths to empirically verify the effectiveness and efficiency of the units. Furthermore, we compared the performance of the ATAC units against existing activation functions as well as other attention mechanisms on the CIFAR-10, CIFAR-100, and ImageNet datasets. Our experimental results show that networks constructed with the proposed ATAC units generally yield performance gains over their competitors given a comparable number of parameters.
摘要:激活功能和注意力的机制通常被视为具有不同的目的,已经演变不同。但是,这两个概念可以被配制为非线性门控函数。通过他们的相似性的启发,我们提出了一种新型的所谓注意力激活〜(ATAC)单位的激活功能和注意机制的统一激活单位。特别是,我们提出用于同时非线性激活和逐元素特征的改进中,局部地聚集逐点交叉信道特征的上下文的本地信道注意模块。通过在卷积网络由这种ATAC单元替换公知的整流线性单位,我们可以构建具有适中数量的附加参数显著更好地发挥充分注意网络。我们使用几个主机网络与不同的网络深处实证检验单位的效益和效率的ATAC单位进行了详细的切除研究。此外,我们比较了现有的激活功能ATAC单元的性能,以及其他关注机制上CIFAR-10,CIFAR-100和ImageNet数据集。我们的实验结果表明,所提出的ATAC单元构成的网络通常会产生性能提高了他们的竞争对手给出的参数相当的数量。
Yimian Dai, Stefan Oehmcke, Yiquan Wu, Kobus Barnard
Abstract: Activation functions and attention mechanisms are typically treated as having different purposes and have evolved differently. However, both concepts can be formulated as a non-linear gating function. Inspired by their similarity, we propose a novel type of activation units called attentional activation~(ATAC) units as a unification of activation functions and attention mechanisms. In particular, we propose a local channel attention module for the simultaneous non-linear activation and element-wise feature refinement, which locally aggregates point-wise cross-channel feature contexts. By replacing the well-known rectified linear units by such ATAC units in convolutional networks, we can construct fully attentional networks that perform significantly better with a modest number of additional parameters. We conducted detailed ablation studies on the ATAC units using several host networks with varying network depths to empirically verify the effectiveness and efficiency of the units. Furthermore, we compared the performance of the ATAC units against existing activation functions as well as other attention mechanisms on the CIFAR-10, CIFAR-100, and ImageNet datasets. Our experimental results show that networks constructed with the proposed ATAC units generally yield performance gains over their competitors given a comparable number of parameters.
摘要:激活功能和注意力的机制通常被视为具有不同的目的,已经演变不同。但是,这两个概念可以被配制为非线性门控函数。通过他们的相似性的启发,我们提出了一种新型的所谓注意力激活〜(ATAC)单位的激活功能和注意机制的统一激活单位。特别是,我们提出用于同时非线性激活和逐元素特征的改进中,局部地聚集逐点交叉信道特征的上下文的本地信道注意模块。通过在卷积网络由这种ATAC单元替换公知的整流线性单位,我们可以构建具有适中数量的附加参数显著更好地发挥充分注意网络。我们使用几个主机网络与不同的网络深处实证检验单位的效益和效率的ATAC单位进行了详细的切除研究。此外,我们比较了现有的激活功能ATAC单元的性能,以及其他关注机制上CIFAR-10,CIFAR-100和ImageNet数据集。我们的实验结果表明,所提出的ATAC单元构成的网络通常会产生性能提高了他们的竞争对手给出的参数相当的数量。
12. PVSNet: Pixelwise Visibility-Aware Multi-View Stereo Network [PDF] 返回目录
Qingshan Xu, Wenbing Tao
Abstract: Recently, learning-based multi-view stereo methods have achieved promising results. However, they all overlook the visibility difference among different views, which leads to an indiscriminate multi-view similarity definition and greatly limits their performance on datasets with strong viewpoint variations. In this paper, a Pixelwise Visibility-aware multi-view Stereo Network (PVSNet) is proposed for robust dense 3D reconstruction. We present a pixelwise visibility network to learn the visibility information for different neighboring images before computing the multi-view similarity, and then construct an adaptive weighted cost volume with the visibility information. Moreover, we present an anti-noise training strategy that introduces disturbing views during model training to make the pixelwise visibility network more distinguishable to unrelated views, which is different with the existing learning methods that only use two best neighboring views for training. To the best of our knowledge, PVSNet is the first deep learning framework that is able to capture the visibility information of different neighboring views. In this way, our method can be generalized well to different types of datasets, especially the ETH3D high-res benchmark with strong viewpoint variations. Extensive experiments show that PVSNet achieves the state-of-the-art performance on different datasets.
摘要:近日,学习基础的多视点立体方法都取得了可喜的成果。然而,他们都忽略了不同意见之间的差异的知名度,从而导致不加区分的多视图相似的定义,大大限制了它们的数据集上的表现具有很强的观点变化。在本文中,按像素可见性感知多视点立体网络(PVSNet)提出了一种鲁棒致密三维重建。我们提出了一个基于像素的可见性网络学习用于不同相邻图像的可见度的信息计算所述多视图相似之前,然后构造一个自适应加权成本体积与所述可见性信息。此外,我们提出了一个抗噪培训策略,引入了模型训练期间令人不安的观点,使基于像素的可视性网络更容易分辨无关的观点,这是与现有的学习方法,仅使用两个最好的邻观点培训不同。据我们所知,PVSNet是第一个深度学习的框架,能够捕捉到的不同的相邻视图的可见性信息。这样一来,我们的方法可以很好地推广到不同类型的数据集,特别是ETH3D高分辨率基准具有很强的观点变化。大量的实验表明,PVSNet实现对不同的数据集的国家的最先进的性能。
Qingshan Xu, Wenbing Tao
Abstract: Recently, learning-based multi-view stereo methods have achieved promising results. However, they all overlook the visibility difference among different views, which leads to an indiscriminate multi-view similarity definition and greatly limits their performance on datasets with strong viewpoint variations. In this paper, a Pixelwise Visibility-aware multi-view Stereo Network (PVSNet) is proposed for robust dense 3D reconstruction. We present a pixelwise visibility network to learn the visibility information for different neighboring images before computing the multi-view similarity, and then construct an adaptive weighted cost volume with the visibility information. Moreover, we present an anti-noise training strategy that introduces disturbing views during model training to make the pixelwise visibility network more distinguishable to unrelated views, which is different with the existing learning methods that only use two best neighboring views for training. To the best of our knowledge, PVSNet is the first deep learning framework that is able to capture the visibility information of different neighboring views. In this way, our method can be generalized well to different types of datasets, especially the ETH3D high-res benchmark with strong viewpoint variations. Extensive experiments show that PVSNet achieves the state-of-the-art performance on different datasets.
摘要:近日,学习基础的多视点立体方法都取得了可喜的成果。然而,他们都忽略了不同意见之间的差异的知名度,从而导致不加区分的多视图相似的定义,大大限制了它们的数据集上的表现具有很强的观点变化。在本文中,按像素可见性感知多视点立体网络(PVSNet)提出了一种鲁棒致密三维重建。我们提出了一个基于像素的可见性网络学习用于不同相邻图像的可见度的信息计算所述多视图相似之前,然后构造一个自适应加权成本体积与所述可见性信息。此外,我们提出了一个抗噪培训策略,引入了模型训练期间令人不安的观点,使基于像素的可视性网络更容易分辨无关的观点,这是与现有的学习方法,仅使用两个最好的邻观点培训不同。据我们所知,PVSNet是第一个深度学习的框架,能够捕捉到的不同的相邻视图的可见性信息。这样一来,我们的方法可以很好地推广到不同类型的数据集,特别是ETH3D高分辨率基准具有很强的观点变化。大量的实验表明,PVSNet实现对不同的数据集的国家的最先进的性能。
13. Lunar Terrain Relative Navigation Using a Convolutional Neural Network for Visual Crater Detection [PDF] 返回目录
Lena M. Downes, Ted J. Steiner, Jonathan P. How
Abstract: Terrain relative navigation can improve the precision of a spacecraft's position estimate by detecting global features that act as supplementary measurements to correct for drift in the inertial navigation system. This paper presents a system that uses a convolutional neural network (CNN) and image processing methods to track the location of a simulated spacecraft with an extended Kalman filter (EKF). The CNN, called LunaNet, visually detects craters in the simulated camera frame and those detections are matched to known lunar craters in the region of the current estimated spacecraft position. These matched craters are treated as features that are tracked using the EKF. LunaNet enables more reliable position tracking over a simulated trajectory due to its greater robustness to changes in image brightness and more repeatable crater detections from frame to frame throughout a trajectory. LunaNet combined with an EKF produces a decrease of 60% in the average final position estimation error and a decrease of 25% in average final velocity estimation error compared to an EKF using an image processing-based crater detection method when tested on trajectories using images of standard brightness.
摘要:地形相对导航可以通过检测作为补充测量,以正确的在惯性导航系统漂移全局特征提高飞船的位置估计的精度。本文提出了使用卷积神经网络(CNN)的系统和图像处理方法,以跟踪与扩展卡尔曼滤波器(EKF)仿真的航天器的位置。的CNN,称为LunaNet,在视觉上检测到的火山口在模拟相机框架和这些检测在当前估计空间飞行器位置的区域被匹配到已知的月球坑穴。这些陨石坑匹配被视为正在使用EKF跟踪功能。 LunaNet使得能够在一个模拟轨迹更可靠的位置跟踪,由于其更大的鲁棒性在图像的亮度和从帧到帧整个轨迹的变化更加可重复的火山口检测。 LunaNet与EKF组合时使用的图像的轨迹测试产生的60%的平均最终位置估计误差的降低和使用基于处理图像弹坑检测方法在平均最终速度估计误差的25%的降低相比,EKF标准亮度。
Lena M. Downes, Ted J. Steiner, Jonathan P. How
Abstract: Terrain relative navigation can improve the precision of a spacecraft's position estimate by detecting global features that act as supplementary measurements to correct for drift in the inertial navigation system. This paper presents a system that uses a convolutional neural network (CNN) and image processing methods to track the location of a simulated spacecraft with an extended Kalman filter (EKF). The CNN, called LunaNet, visually detects craters in the simulated camera frame and those detections are matched to known lunar craters in the region of the current estimated spacecraft position. These matched craters are treated as features that are tracked using the EKF. LunaNet enables more reliable position tracking over a simulated trajectory due to its greater robustness to changes in image brightness and more repeatable crater detections from frame to frame throughout a trajectory. LunaNet combined with an EKF produces a decrease of 60% in the average final position estimation error and a decrease of 25% in average final velocity estimation error compared to an EKF using an image processing-based crater detection method when tested on trajectories using images of standard brightness.
摘要:地形相对导航可以通过检测作为补充测量,以正确的在惯性导航系统漂移全局特征提高飞船的位置估计的精度。本文提出了使用卷积神经网络(CNN)的系统和图像处理方法,以跟踪与扩展卡尔曼滤波器(EKF)仿真的航天器的位置。的CNN,称为LunaNet,在视觉上检测到的火山口在模拟相机框架和这些检测在当前估计空间飞行器位置的区域被匹配到已知的月球坑穴。这些陨石坑匹配被视为正在使用EKF跟踪功能。 LunaNet使得能够在一个模拟轨迹更可靠的位置跟踪,由于其更大的鲁棒性在图像的亮度和从帧到帧整个轨迹的变化更加可重复的火山口检测。 LunaNet与EKF组合时使用的图像的轨迹测试产生的60%的平均最终位置估计误差的降低和使用基于处理图像弹坑检测方法在平均最终速度估计误差的25%的降低相比,EKF标准亮度。
14. P$^{2}$Net: Patch-match and Plane-regularization for Unsupervised Indoor Depth Estimation [PDF] 返回目录
Zehao Yu, Lei Jin, Shenghua Gao
Abstract: This paper tackles the unsupervised depth estimation task in indoor environments. The task is extremely challenging because of the vast areas of non-texture regions in these scenes. These areas could overwhelm the optimization process in the commonly used unsupervised depth estimation framework proposed for outdoor environments. However, even when those regions are masked out, the performance is still unsatisfactory. In this paper, we argue that the poor performance suffers from the non-discriminative point-based matching. To this end, we propose P$^2$Net. We first extract points with large local gradients and adopt patches centered at each point as its representation. Multiview consistency loss is then defined over patches. This operation significantly improves the robustness of the network training. Furthermore, because those textureless regions in indoor scenes (e.g., wall, floor, roof, \etc) usually correspond to planar regions, we propose to leverage superpixels as a plane prior. We enforce the predicted depth to be well fitted by a plane within each superpixel. Extensive experiments on NYUv2 and ScanNet show that our P$^2$Net outperforms existing approaches by a large margin. Code is available at \url{this https URL}.
摘要:本文铲球在室内环境中无人监管的深度估计的任务。因为在这些场景中的非纹理区域的广大地区的任务极具挑战性。这些区域可压倒提出了户外环境中常用的无监督的深度估计框架的优化过程。然而,即使这些区域被屏蔽掉,表现仍不尽人意。在本文中,我们认为,从非歧视的基础点匹配的糟糕表现受到影响。为此,我们提出了P $ ^ 2 $网。大局部梯度我们首先提取点和采取的补丁集中在每一个点作为其代表性。多视点一致性损耗然后用补丁所定义。此操作显著提高了网络训练的鲁棒性。此外,由于在室内场景(例如,墙壁,地板,屋顶,\等)的那些无纹理的区域通常对应于平面区域,我们建议杠杆超像素作为现有的平面。我们加强预测深度由每个超级像素内的平面很好地拟合。在NYUv2大量的实验和ScanNet表明,我们的P $ ^ 2个$网络性能优于现有的以大比分接近。代码可以在\ {URL这HTTPS URL}。
Zehao Yu, Lei Jin, Shenghua Gao
Abstract: This paper tackles the unsupervised depth estimation task in indoor environments. The task is extremely challenging because of the vast areas of non-texture regions in these scenes. These areas could overwhelm the optimization process in the commonly used unsupervised depth estimation framework proposed for outdoor environments. However, even when those regions are masked out, the performance is still unsatisfactory. In this paper, we argue that the poor performance suffers from the non-discriminative point-based matching. To this end, we propose P$^2$Net. We first extract points with large local gradients and adopt patches centered at each point as its representation. Multiview consistency loss is then defined over patches. This operation significantly improves the robustness of the network training. Furthermore, because those textureless regions in indoor scenes (e.g., wall, floor, roof, \etc) usually correspond to planar regions, we propose to leverage superpixels as a plane prior. We enforce the predicted depth to be well fitted by a plane within each superpixel. Extensive experiments on NYUv2 and ScanNet show that our P$^2$Net outperforms existing approaches by a large margin. Code is available at \url{this https URL}.
摘要:本文铲球在室内环境中无人监管的深度估计的任务。因为在这些场景中的非纹理区域的广大地区的任务极具挑战性。这些区域可压倒提出了户外环境中常用的无监督的深度估计框架的优化过程。然而,即使这些区域被屏蔽掉,表现仍不尽人意。在本文中,我们认为,从非歧视的基础点匹配的糟糕表现受到影响。为此,我们提出了P $ ^ 2 $网。大局部梯度我们首先提取点和采取的补丁集中在每一个点作为其代表性。多视点一致性损耗然后用补丁所定义。此操作显著提高了网络训练的鲁棒性。此外,由于在室内场景(例如,墙壁,地板,屋顶,\等)的那些无纹理的区域通常对应于平面区域,我们建议杠杆超像素作为现有的平面。我们加强预测深度由每个超级像素内的平面很好地拟合。在NYUv2大量的实验和ScanNet表明,我们的P $ ^ 2个$网络性能优于现有的以大比分接近。代码可以在\ {URL这HTTPS URL}。
15. Proof of Concept: Automatic Type Recognition [PDF] 返回目录
Vincent Christlein, Nikolaus Weichselbaumer, Saskia Limbach, Mathias Seuret
Abstract: The type used to print an early modern book can give scholars valuable information about the time and place of its production as well as the printer responsible. Currently type recognition is done manually using the shapes of `M' or `Qu' and the size of a type to look it up in a large reference work. This is reliable, but slow and requires specialized skills. We investigate the performance of type classification and type retrieval using a newly created dataset consisting of easy and difficult types used in early printed books. For type classification, we rely on a deep Convolutional Neural Network (CNN) originally used for font-group classification while we use a common writer identification method for the retrieval case. We show that in both scenarios, easy types can be classified/retrieved with a high accuracy while difficult cases are indeed difficult.
摘要:用于打印的近代早期的书可以给学者们关于其生产的时间和地点以及负责打印机有价值的信息类型。目前,类型识别正在使用手动`M“或'屈”的形状和类型的大小,看它在一个大的参考所做的工作。这是可靠的,但速度缓慢且需要专门的技能。我们调查类型分类的性能和使用新创建的数据集,即在早期印刷书籍中使用简单和困难类型的输入检索。用于类型分类,我们依靠一个深卷积神经网络(CNN)最初用于字体组的分类,同时我们使用了常用作家识别方法对于检索壳体上。我们发现,在这两种情况下,容易类型可以划分/高精度检索而困难的情况是确实很难。
Vincent Christlein, Nikolaus Weichselbaumer, Saskia Limbach, Mathias Seuret
Abstract: The type used to print an early modern book can give scholars valuable information about the time and place of its production as well as the printer responsible. Currently type recognition is done manually using the shapes of `M' or `Qu' and the size of a type to look it up in a large reference work. This is reliable, but slow and requires specialized skills. We investigate the performance of type classification and type retrieval using a newly created dataset consisting of easy and difficult types used in early printed books. For type classification, we rely on a deep Convolutional Neural Network (CNN) originally used for font-group classification while we use a common writer identification method for the retrieval case. We show that in both scenarios, easy types can be classified/retrieved with a high accuracy while difficult cases are indeed difficult.
摘要:用于打印的近代早期的书可以给学者们关于其生产的时间和地点以及负责打印机有价值的信息类型。目前,类型识别正在使用手动`M“或'屈”的形状和类型的大小,看它在一个大的参考所做的工作。这是可靠的,但速度缓慢且需要专门的技能。我们调查类型分类的性能和使用新创建的数据集,即在早期印刷书籍中使用简单和困难类型的输入检索。用于类型分类,我们依靠一个深卷积神经网络(CNN)最初用于字体组的分类,同时我们使用了常用作家识别方法对于检索壳体上。我们发现,在这两种情况下,容易类型可以划分/高精度检索而困难的情况是确实很难。
16. End-to-end training of a two-stage neural network for defect detection [PDF] 返回目录
Jakob Božič, Domen Tabernik, Danijel Skočaj
Abstract: Segmentation-based, two-stage neural network has shown excellent results in the surface defect detection, enabling the network to learn from a relatively small number of samples. In this work, we introduce end-to-end training of the two-stage network together with several extensions to the training process, which reduce the amount of training time and improve the results on the surface defect detection tasks. To enable end-to-end training we carefully balance the contributions of both the segmentation and the classification loss throughout the learning. We adjust the gradient flow from the classification into the segmentation network in order to prevent the unstable features from corrupting the learning. As an additional extension to the learning, we propose frequency-of-use sampling scheme of negative samples to address the issue of over- and under-sampling of images during the training, while we employ the distance transform algorithm on the region-based segmentation masks as weights for positive pixels, giving greater importance to areas with higher probability of presence of defect without requiring a detailed annotation. We demonstrate the performance of the end-to-end training scheme and the proposed extensions on three defect detection datasets - DAGM, KolektorSDD and Severstal Steel defect dataset - where we show state-of-the-art results. On the DAGM and the KolektorSDD we demonstrate 100\% detection rate, therefore completely solving the datasets. Additional ablation study performed on all three datasets quantitatively demonstrates the contribution to the overall result improvements for each of the proposed extensions.
摘要:基于分割,两级神经网络已经显示在表面缺陷检测优异的结果,使网络从相对小数量的样本的学习。在这项工作中,我们介绍几个扩展训练过程中,其减少训练时间量,提高在表面缺陷检测任务的结果的端至端训练二级网络一起的。为了使终端到终端的培训我们仔细平衡分割和整个学习分类损失两者的贡献。我们调整以防止破坏学习的不稳定特征从分类梯度流入分割网络。作为一个额外的扩展,学习,我们提出了频率的使用采样阴性样品的方案来解决问题过度和欠采样在培训期间的图像,而我们采用距离变换算法基于区域分割掩模作为权为正的像素,赋予而不需要详细注释与缺陷的存在的概率较高的区域更大的重要性。我们证明年底到终端的培训方案的性能和三个缺陷检测数据集所提出的扩展 - DAGM,KolektorSDD和谢韦尔钢铁缺陷数据集 - 在这里我们展示国家的最先进的成果。在DAGM和KolektorSDD我们证明100 \%的检出率,因此完全解决了数据集。在所有这三个数据集进行了另外的消融研究表明定量的总体结果改进为每个提议的扩展的贡献。
Jakob Božič, Domen Tabernik, Danijel Skočaj
Abstract: Segmentation-based, two-stage neural network has shown excellent results in the surface defect detection, enabling the network to learn from a relatively small number of samples. In this work, we introduce end-to-end training of the two-stage network together with several extensions to the training process, which reduce the amount of training time and improve the results on the surface defect detection tasks. To enable end-to-end training we carefully balance the contributions of both the segmentation and the classification loss throughout the learning. We adjust the gradient flow from the classification into the segmentation network in order to prevent the unstable features from corrupting the learning. As an additional extension to the learning, we propose frequency-of-use sampling scheme of negative samples to address the issue of over- and under-sampling of images during the training, while we employ the distance transform algorithm on the region-based segmentation masks as weights for positive pixels, giving greater importance to areas with higher probability of presence of defect without requiring a detailed annotation. We demonstrate the performance of the end-to-end training scheme and the proposed extensions on three defect detection datasets - DAGM, KolektorSDD and Severstal Steel defect dataset - where we show state-of-the-art results. On the DAGM and the KolektorSDD we demonstrate 100\% detection rate, therefore completely solving the datasets. Additional ablation study performed on all three datasets quantitatively demonstrates the contribution to the overall result improvements for each of the proposed extensions.
摘要:基于分割,两级神经网络已经显示在表面缺陷检测优异的结果,使网络从相对小数量的样本的学习。在这项工作中,我们介绍几个扩展训练过程中,其减少训练时间量,提高在表面缺陷检测任务的结果的端至端训练二级网络一起的。为了使终端到终端的培训我们仔细平衡分割和整个学习分类损失两者的贡献。我们调整以防止破坏学习的不稳定特征从分类梯度流入分割网络。作为一个额外的扩展,学习,我们提出了频率的使用采样阴性样品的方案来解决问题过度和欠采样在培训期间的图像,而我们采用距离变换算法基于区域分割掩模作为权为正的像素,赋予而不需要详细注释与缺陷的存在的概率较高的区域更大的重要性。我们证明年底到终端的培训方案的性能和三个缺陷检测数据集所提出的扩展 - DAGM,KolektorSDD和谢韦尔钢铁缺陷数据集 - 在这里我们展示国家的最先进的成果。在DAGM和KolektorSDD我们证明100 \%的检出率,因此完全解决了数据集。在所有这三个数据集进行了另外的消融研究表明定量的总体结果改进为每个提议的扩展的贡献。
17. Learning to Learn with Variational Information Bottleneck for Domain Generalization [PDF] 返回目录
Yingjun Du, Jun Xu, Huan Xiong, Qiang Qiu, Xiantong Zhen, Cees G. M. Snoek, Ling Shao
Abstract: Domain generalization models learn to generalize to previously unseen domains, but suffer from prediction uncertainty and domain shift. In this paper, we address both problems. We introduce a probabilistic meta-learning model for domain generalization, in which classifier parameters shared across domains are modeled as distributions. This enables better handling of prediction uncertainty on unseen domains. To deal with domain shift, we learn domain-invariant representations by the proposed principle of meta variational information bottleneck, we call MetaVIB. MetaVIB is derived from novel variational bounds of mutual information, by leveraging the meta-learning setting of domain generalization. Through episodic training, MetaVIB learns to gradually narrow domain gaps to establish domain-invariant representations, while simultaneously maximizing prediction accuracy. We conduct experiments on three benchmarks for cross-domain visual recognition. Comprehensive ablation studies validate the benefits of MetaVIB for domain generalization. The comparison results demonstrate our method outperforms previous approaches consistently.
摘要:域泛化模型学习推广到以前看不到的领域,但预测的不确定性和域漂移的影响。在本文中,我们要解决这两个问题。我们介绍域概括概率元学习模型,其中在域之间共享分类器参数被建模为分布。这可以实现更好的操控性上看不见域预测的不确定性。为了应对域变化,我们学会元变信息瓶颈的建议原则域不变的声明,我们称之为MetaVIB。 MetaVIB从互信息的新变界衍生,通过利用领域归纳的元学习环境。通过情景训练,MetaVIB学会逐步缩小差距域建立域的恒定表征,同时最大限度地提高预测精度。我们三个基准跨域视觉识别进行实验。综合消融研究证实了MetaVIB的域泛化的好处。比较结果表明,我们的方法始终优于以前的方法。
Yingjun Du, Jun Xu, Huan Xiong, Qiang Qiu, Xiantong Zhen, Cees G. M. Snoek, Ling Shao
Abstract: Domain generalization models learn to generalize to previously unseen domains, but suffer from prediction uncertainty and domain shift. In this paper, we address both problems. We introduce a probabilistic meta-learning model for domain generalization, in which classifier parameters shared across domains are modeled as distributions. This enables better handling of prediction uncertainty on unseen domains. To deal with domain shift, we learn domain-invariant representations by the proposed principle of meta variational information bottleneck, we call MetaVIB. MetaVIB is derived from novel variational bounds of mutual information, by leveraging the meta-learning setting of domain generalization. Through episodic training, MetaVIB learns to gradually narrow domain gaps to establish domain-invariant representations, while simultaneously maximizing prediction accuracy. We conduct experiments on three benchmarks for cross-domain visual recognition. Comprehensive ablation studies validate the benefits of MetaVIB for domain generalization. The comparison results demonstrate our method outperforms previous approaches consistently.
摘要:域泛化模型学习推广到以前看不到的领域,但预测的不确定性和域漂移的影响。在本文中,我们要解决这两个问题。我们介绍域概括概率元学习模型,其中在域之间共享分类器参数被建模为分布。这可以实现更好的操控性上看不见域预测的不确定性。为了应对域变化,我们学会元变信息瓶颈的建议原则域不变的声明,我们称之为MetaVIB。 MetaVIB从互信息的新变界衍生,通过利用领域归纳的元学习环境。通过情景训练,MetaVIB学会逐步缩小差距域建立域的恒定表征,同时最大限度地提高预测精度。我们三个基准跨域视觉识别进行实验。综合消融研究证实了MetaVIB的域泛化的好处。比较结果表明,我们的方法始终优于以前的方法。
18. Learning Multiplicative Interactions with Bayesian Neural Networks for Visual-Inertial Odometry [PDF] 返回目录
Kashmira Shinde, Jongseok Lee, Matthias Humt, Aydin Sezgin, Rudolph Triebel
Abstract: This paper presents an end-to-end multi-modal learning approach for monocular Visual-Inertial Odometry (VIO), which is specifically designed to exploit sensor complementarity in the light of sensor degradation scenarios. The proposed network makes use of a multi-head self-attention mechanism that learns multiplicative interactions between multiple streams of information. Another design feature of our approach is the incorporation of the model uncertainty using scalable Laplace Approximation. We evaluate the performance of the proposed approach by comparing it against the end-to-end state-of-the-art methods on the KITTI dataset and show that it achieves superior performance. Importantly, our work thereby provides an empirical evidence that learning multiplicative interactions can result in a powerful inductive bias for increased robustness to sensor failures.
摘要:本文提出了单眼视觉惯性测程(VIO),其被专门设计以利用的传感器退化场景的光传感器的互补的端至端的多模态的学习方法。拟议的网络使用的是学习的多种信息流之间的交互乘多头的自我关注机制。我们的方法的另一个设计特点是采用可扩展的拉普拉斯逼近模型不确定性的结合。我们通过比较其对在KITTI数据集终端到终端的国家的最先进的方法评估了该方法的性能,并表明它实现了卓越的性能。重要的是,我们的工作从而提供了经验证据表明,学习乘法的相互作用可导致增加的鲁棒性传感器故障强大的归纳偏置。
Kashmira Shinde, Jongseok Lee, Matthias Humt, Aydin Sezgin, Rudolph Triebel
Abstract: This paper presents an end-to-end multi-modal learning approach for monocular Visual-Inertial Odometry (VIO), which is specifically designed to exploit sensor complementarity in the light of sensor degradation scenarios. The proposed network makes use of a multi-head self-attention mechanism that learns multiplicative interactions between multiple streams of information. Another design feature of our approach is the incorporation of the model uncertainty using scalable Laplace Approximation. We evaluate the performance of the proposed approach by comparing it against the end-to-end state-of-the-art methods on the KITTI dataset and show that it achieves superior performance. Importantly, our work thereby provides an empirical evidence that learning multiplicative interactions can result in a powerful inductive bias for increased robustness to sensor failures.
摘要:本文提出了单眼视觉惯性测程(VIO),其被专门设计以利用的传感器退化场景的光传感器的互补的端至端的多模态的学习方法。拟议的网络使用的是学习的多种信息流之间的交互乘多头的自我关注机制。我们的方法的另一个设计特点是采用可扩展的拉普拉斯逼近模型不确定性的结合。我们通过比较其对在KITTI数据集终端到终端的国家的最先进的方法评估了该方法的性能,并表明它实现了卓越的性能。重要的是,我们的工作从而提供了经验证据表明,学习乘法的相互作用可导致增加的鲁棒性传感器故障强大的归纳偏置。
19. Visualizing Transfer Learning [PDF] 返回目录
Róbert Szabó, Dániel Katona, Márton Csillag, Adrián Csiszárik, Dániel Varga
Abstract: We provide visualizations of individual neurons of a deep image recognition network during the temporal process of transfer learning. These visualizations qualitatively demonstrate various novel properties of the transfer learning process regarding the speed and characteristics of adaptation, neuron reuse, spatial scale of the represented image features, and behavior of transfer learning to small data. We publish the large-scale dataset that we have created for the purposes of this analysis.
摘要:我们在迁移学习的时间进程提供了深刻的图像识别网络的单个神经元的可视化。这些可视化定性证明关于速度和适应的特性,神经元复用,的表示的图像的特征空间尺度,以及小的数据传输的学习行为的传递学习过程的各种新颖特性。我们发布大型数据集,我们已经为这个分析的目的而创建。
Róbert Szabó, Dániel Katona, Márton Csillag, Adrián Csiszárik, Dániel Varga
Abstract: We provide visualizations of individual neurons of a deep image recognition network during the temporal process of transfer learning. These visualizations qualitatively demonstrate various novel properties of the transfer learning process regarding the speed and characteristics of adaptation, neuron reuse, spatial scale of the represented image features, and behavior of transfer learning to small data. We publish the large-scale dataset that we have created for the purposes of this analysis.
摘要:我们在迁移学习的时间进程提供了深刻的图像识别网络的单个神经元的可视化。这些可视化定性证明关于速度和适应的特性,神经元复用,的表示的图像的特征空间尺度,以及小的数据传输的学习行为的传递学习过程的各种新颖特性。我们发布大型数据集,我们已经为这个分析的目的而创建。
20. Fast and Robust Iterative Closet Point [PDF] 返回目录
Juyong Zhang, Yuxin Yao, Bailin Deng
Abstract: The Iterative Closest Point (ICP) algorithm and its variants are a fundamental technique for rigid registration between two point sets, with wide applications in different areas from robotics to 3D reconstruction. The main drawbacks for ICP are its slow convergence as well as its sensitivity to outliers, missing data, and partial overlaps. Recent work such as Sparse ICP achieves robustness via sparsity optimization at the cost of computational speed. In this paper, we propose a new method for robust registration with fast convergence. First, we show that the classical point-to-point ICP can be treated as a majorization-minimization (MM) algorithm, and propose an Anderson acceleration approach to improve its convergence. In addition, we introduce a robust error metric based on the Welsch's function, which is minimized efficiently using the MM algorithm with Anderson acceleration. On challenging datasets with noises and partial overlaps, we achieve similar or better accuracy than Sparse ICP while being at least an order of magnitude faster. Finally, we extend the robust formulation to point-to-plane ICP, and solve the resulting problem using a similar Anderson-accelerated MM strategy. Our robust ICP methods improve the registration accuracy on benchmark datasets while being competitive in computational time.
摘要:迭代最近点(ICP)算法及其变体是两个点集之间的刚性配准的基本技术,从机器人到3D重建在不同的领域具有广泛的应用。为ICP的主要缺点是它的收敛速度慢,以及其对异常值,丢失的数据,和部分重叠的灵敏度。最近的工作,如稀疏ICP的计算速度为代价实现了通过优化的稀疏性稳健性。在本文中,我们提出了具有快速收敛稳健注册的新方法。首先,我们证明了经典的点至点ICP可作为优化最小化(MM)算法进行处理,并提出了安德森的加速方法,以提高其收敛。此外,我们还推出基于这个韦尔施的功能,有效地最小化使用MM算法安德森加速稳健误差度量。在挑战与噪声和部分重叠的数据集,我们实现类似或更好的精度比稀疏ICP至少一个数量级的速度更快,同时。最后,我们扩展了强大的配方,以点到面ICP,并使用类似的安德森加速MM策略解决由此带来的问题。我们强大的ICP方法提高基准数据集的配准精度,同时在计算时间的竞争。
Juyong Zhang, Yuxin Yao, Bailin Deng
Abstract: The Iterative Closest Point (ICP) algorithm and its variants are a fundamental technique for rigid registration between two point sets, with wide applications in different areas from robotics to 3D reconstruction. The main drawbacks for ICP are its slow convergence as well as its sensitivity to outliers, missing data, and partial overlaps. Recent work such as Sparse ICP achieves robustness via sparsity optimization at the cost of computational speed. In this paper, we propose a new method for robust registration with fast convergence. First, we show that the classical point-to-point ICP can be treated as a majorization-minimization (MM) algorithm, and propose an Anderson acceleration approach to improve its convergence. In addition, we introduce a robust error metric based on the Welsch's function, which is minimized efficiently using the MM algorithm with Anderson acceleration. On challenging datasets with noises and partial overlaps, we achieve similar or better accuracy than Sparse ICP while being at least an order of magnitude faster. Finally, we extend the robust formulation to point-to-plane ICP, and solve the resulting problem using a similar Anderson-accelerated MM strategy. Our robust ICP methods improve the registration accuracy on benchmark datasets while being competitive in computational time.
摘要:迭代最近点(ICP)算法及其变体是两个点集之间的刚性配准的基本技术,从机器人到3D重建在不同的领域具有广泛的应用。为ICP的主要缺点是它的收敛速度慢,以及其对异常值,丢失的数据,和部分重叠的灵敏度。最近的工作,如稀疏ICP的计算速度为代价实现了通过优化的稀疏性稳健性。在本文中,我们提出了具有快速收敛稳健注册的新方法。首先,我们证明了经典的点至点ICP可作为优化最小化(MM)算法进行处理,并提出了安德森的加速方法,以提高其收敛。此外,我们还推出基于这个韦尔施的功能,有效地最小化使用MM算法安德森加速稳健误差度量。在挑战与噪声和部分重叠的数据集,我们实现类似或更好的精度比稀疏ICP至少一个数量级的速度更快,同时。最后,我们扩展了强大的配方,以点到面ICP,并使用类似的安德森加速MM策略解决由此带来的问题。我们强大的ICP方法提高基准数据集的配准精度,同时在计算时间的竞争。
21. Temporal Distinct Representation Learning for Action Recognition [PDF] 返回目录
Junwu Weng, Donghao Luo, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, Xudong Jiang, Junsong Yuan
Abstract: Motivated by the previous success of Two-Dimensional Convolutional Neural Network (2D CNN) on image recognition, researchers endeavor to leverage it to characterize videos. However, one limitation of applying 2D CNN to analyze videos is that different frames of a video share the same 2D CNN kernels, which may result in repeated and redundant information utilization, especially in the spatial semantics extraction process, hence neglecting the critical variations among frames. In this paper, we attempt to tackle this issue through two ways. 1) Design a sequential channel filtering mechanism, i.e., Progressive Enhancement Module (PEM), to excite the discriminative channels of features from different frames step by step, and thus avoid repeated information extraction. 2) Create a Temporal Diversity Loss (TD Loss) to force the kernels to concentrate on and capture the variations among frames rather than the image regions with similar appearance. Our method is evaluated on benchmark temporal reasoning datasets Something-Something V1 and V2, and it achieves visible improvements over the best competitor by 2.4% and 1.3%, respectively. Besides, performance improvements over the 2D-CNN-based state-of-the-arts on the large-scale dataset Kinetics are also witnessed.
摘要:通过二维卷积神经网络(CNN 2D)的图像识别以前成功的启发,研究人员设法利用它来表征视频。然而,将2D CNN来分析视频中的一个限制是,视频共享相同的2D CNN内核,这可能导致在重复和冗余信息的利用,尤其是在空间语义提取处理,因此忽略帧之间的关键改变的不同的帧。在本文中,我们试图通过两种方式来解决这个问题。 1)设计一个顺序信道滤波机构,即,逐行增强模块(PEM),以激发来自不同帧步步特征的辨别信道,从而避免重复的信息提取。 2)创建时间分集损耗(TD损失)来强制内核专心并捕捉帧而不是具有类似外观的图像区域之间的变化。我们的方法是在基准时间推理数据集的东西,东西V1和V2进行评估,并通过分别为2.4%和1.3%,达到了最佳的竞争者明显改善。此外,性能改进了对大规模数据集的动力学基于2D-CNN-国家的的艺术也见证。
Junwu Weng, Donghao Luo, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, Xudong Jiang, Junsong Yuan
Abstract: Motivated by the previous success of Two-Dimensional Convolutional Neural Network (2D CNN) on image recognition, researchers endeavor to leverage it to characterize videos. However, one limitation of applying 2D CNN to analyze videos is that different frames of a video share the same 2D CNN kernels, which may result in repeated and redundant information utilization, especially in the spatial semantics extraction process, hence neglecting the critical variations among frames. In this paper, we attempt to tackle this issue through two ways. 1) Design a sequential channel filtering mechanism, i.e., Progressive Enhancement Module (PEM), to excite the discriminative channels of features from different frames step by step, and thus avoid repeated information extraction. 2) Create a Temporal Diversity Loss (TD Loss) to force the kernels to concentrate on and capture the variations among frames rather than the image regions with similar appearance. Our method is evaluated on benchmark temporal reasoning datasets Something-Something V1 and V2, and it achieves visible improvements over the best competitor by 2.4% and 1.3%, respectively. Besides, performance improvements over the 2D-CNN-based state-of-the-arts on the large-scale dataset Kinetics are also witnessed.
摘要:通过二维卷积神经网络(CNN 2D)的图像识别以前成功的启发,研究人员设法利用它来表征视频。然而,将2D CNN来分析视频中的一个限制是,视频共享相同的2D CNN内核,这可能导致在重复和冗余信息的利用,尤其是在空间语义提取处理,因此忽略帧之间的关键改变的不同的帧。在本文中,我们试图通过两种方式来解决这个问题。 1)设计一个顺序信道滤波机构,即,逐行增强模块(PEM),以激发来自不同帧步步特征的辨别信道,从而避免重复的信息提取。 2)创建时间分集损耗(TD损失)来强制内核专心并捕捉帧而不是具有类似外观的图像区域之间的变化。我们的方法是在基准时间推理数据集的东西,东西V1和V2进行评估,并通过分别为2.4%和1.3%,达到了最佳的竞争者明显改善。此外,性能改进了对大规模数据集的动力学基于2D-CNN-国家的的艺术也见证。
22. Augmented Bi-path Network for Few-shot Learning [PDF] 返回目录
Baoming Yan, Chen Zhou, Bo Zhao, Kan Guo, Jiang Yang, Xiaobo Li, Ming Zhang, Yizhou Wang
Abstract: Few-shot Learning (FSL) which aims to learn from few labeled training data is becoming a popular research topic, due to the expensive labeling cost in many real-world applications. One kind of successful FSL method learns to compare the testing (query) image and training (support) image by simply concatenating the features of two images and feeding it into the neural network. However, with few labeled data in each class, the neural network has difficulty in learning or comparing the local features of two images. Such simple image-level comparison may cause serious mis-classification. To solve this problem, we propose Augmented Bi-path Network (ABNet) for learning to compare both global and local features on multi-scales. Specifically, the salient patches are extracted and embedded as the local features for every image. Then, the model learns to augment the features for better robustness. Finally, the model learns to compare global and local features separately, i.e., in two paths, before merging the similarities. Extensive experiments show that the proposed ABNet outperforms the state-of-the-art methods. Both quantitative and visual ablation studies are provided to verify that the proposed modules lead to more precise comparison results.
摘要:很少次学习(FSL)的目的是从几个标记的训练数据了解哪些正在成为一个热门的研究课题,因为在许多实际应用中的昂贵的标签成本。一种成功的方法FSL获悉的,只需连接两个图像的特征,并将其送入神经网络进行测试(查询)图像和培训(支持)图像进行比较。然而,在每个班级几个标记数据,神经网络具有在学习或比较两个图像的局部特征的难度。这种简单的图像层次比较可能会导致严重的错误分类。为了解决这个问题,我们提出了增强双通道网络(ABNet)学习比较多尺度全局和局部特征。具体而言,凸补丁被提取并且嵌入作为局部特征为每一个图像。随后,模型学会增强功能更好的稳健性。最后,模型学会合并相似之处之前单独比较全局和局部特征,即在两条路径。大量的实验表明,该ABNet优于国家的最先进的方法。提供定量和视觉切除研究,以验证所提出的模块导致更精确的比较结果。
Baoming Yan, Chen Zhou, Bo Zhao, Kan Guo, Jiang Yang, Xiaobo Li, Ming Zhang, Yizhou Wang
Abstract: Few-shot Learning (FSL) which aims to learn from few labeled training data is becoming a popular research topic, due to the expensive labeling cost in many real-world applications. One kind of successful FSL method learns to compare the testing (query) image and training (support) image by simply concatenating the features of two images and feeding it into the neural network. However, with few labeled data in each class, the neural network has difficulty in learning or comparing the local features of two images. Such simple image-level comparison may cause serious mis-classification. To solve this problem, we propose Augmented Bi-path Network (ABNet) for learning to compare both global and local features on multi-scales. Specifically, the salient patches are extracted and embedded as the local features for every image. Then, the model learns to augment the features for better robustness. Finally, the model learns to compare global and local features separately, i.e., in two paths, before merging the similarities. Extensive experiments show that the proposed ABNet outperforms the state-of-the-art methods. Both quantitative and visual ablation studies are provided to verify that the proposed modules lead to more precise comparison results.
摘要:很少次学习(FSL)的目的是从几个标记的训练数据了解哪些正在成为一个热门的研究课题,因为在许多实际应用中的昂贵的标签成本。一种成功的方法FSL获悉的,只需连接两个图像的特征,并将其送入神经网络进行测试(查询)图像和培训(支持)图像进行比较。然而,在每个班级几个标记数据,神经网络具有在学习或比较两个图像的局部特征的难度。这种简单的图像层次比较可能会导致严重的错误分类。为了解决这个问题,我们提出了增强双通道网络(ABNet)学习比较多尺度全局和局部特征。具体而言,凸补丁被提取并且嵌入作为局部特征为每一个图像。随后,模型学会增强功能更好的稳健性。最后,模型学会合并相似之处之前单独比较全局和局部特征,即在两条路径。大量的实验表明,该ABNet优于国家的最先进的方法。提供定量和视觉切除研究,以验证所提出的模块导致更精确的比较结果。
23. CycAs: Self-supervised Cycle Association for Learning Re-identifiable Descriptions [PDF] 返回目录
Zhongdao Wang, Jingwei Zhang, Liang Zheng, Yixuan Liu, Yifan Sun, Yali Li, Shengjin Wang
Abstract: This paper proposes a self-supervised learning method for the person re-identification (re-ID) problem, where existing unsupervised methods usually rely on pseudo labels, such as those from video tracklets or clustering. A potential drawback of using pseudo labels is that errors may accumulate and it is challenging to estimate the number of pseudo IDs. We introduce a different unsupervised method that allows us to learn pedestrian embeddings from raw videos, without resorting to pseudo labels. The goal is to construct a self-supervised pretext task that matches the person re-ID objective. Inspired by the \emph{data association} concept in multi-object tracking, we propose the \textbf{Cyc}le \textbf{As}sociation (\textbf{CycAs}) task: after performing data association between a pair of video frames forward and then backward, a pedestrian instance is supposed to be associated to itself. To fulfill this goal, the model must learn a meaningful representation that can well describe correspondences between instances in frame pairs. We adapt the discrete association process to a differentiable form, such that end-to-end training becomes feasible. Experiments are conducted in two aspects: We first compare our method with existing unsupervised re-ID methods on seven benchmarks and demonstrate CycAs' superiority. Then, to further validate the practical value of CycAs in real-world applications, we perform training on self-collected videos and report promising performance on standard test sets.
摘要:本文提出了一种人重新鉴定(重新-ID)的问题,在现有的无监督方法通常依赖于伪标签,如从视频tracklets或群集自我监督学习方法。使用假标签的一个潜在缺点是错误可能积累,它是具有挑战性的估计伪ID的数量。我们引入不同的无监督方法,使我们能够了解从原材料的影片行人的嵌入,而不诉诸伪标签。我们的目标是构建人重新-ID匹配目标自我监督的借口任务。通过在多目标跟踪的\ EMPH {数据关联}概念的启发,我们提出了\ textbf {的Cyc}勒\ textbf {正如}社会交往(\ textbf {苏铁})任务:一对视频帧之间执行数据关联之后向前然后向后一行人实例应该被关联到自身。为了实现这一目标,模型必须学会一个有意义的表示,可以很好地描述在帧对实例之间的对应关系。我们离散关联过程适应微的形式,使得端至端训练变得可行。实验是在两个方面进行:首先,我们与现有的无监督重新编号方法比较七个基准我们的方法,并证明苏铁的优越性。然后,以进一步验证苏铁在实际应用中的实用价值,我们进行自我收集的视频和报告承诺在标准测试集表演训练。
Zhongdao Wang, Jingwei Zhang, Liang Zheng, Yixuan Liu, Yifan Sun, Yali Li, Shengjin Wang
Abstract: This paper proposes a self-supervised learning method for the person re-identification (re-ID) problem, where existing unsupervised methods usually rely on pseudo labels, such as those from video tracklets or clustering. A potential drawback of using pseudo labels is that errors may accumulate and it is challenging to estimate the number of pseudo IDs. We introduce a different unsupervised method that allows us to learn pedestrian embeddings from raw videos, without resorting to pseudo labels. The goal is to construct a self-supervised pretext task that matches the person re-ID objective. Inspired by the \emph{data association} concept in multi-object tracking, we propose the \textbf{Cyc}le \textbf{As}sociation (\textbf{CycAs}) task: after performing data association between a pair of video frames forward and then backward, a pedestrian instance is supposed to be associated to itself. To fulfill this goal, the model must learn a meaningful representation that can well describe correspondences between instances in frame pairs. We adapt the discrete association process to a differentiable form, such that end-to-end training becomes feasible. Experiments are conducted in two aspects: We first compare our method with existing unsupervised re-ID methods on seven benchmarks and demonstrate CycAs' superiority. Then, to further validate the practical value of CycAs in real-world applications, we perform training on self-collected videos and report promising performance on standard test sets.
摘要:本文提出了一种人重新鉴定(重新-ID)的问题,在现有的无监督方法通常依赖于伪标签,如从视频tracklets或群集自我监督学习方法。使用假标签的一个潜在缺点是错误可能积累,它是具有挑战性的估计伪ID的数量。我们引入不同的无监督方法,使我们能够了解从原材料的影片行人的嵌入,而不诉诸伪标签。我们的目标是构建人重新-ID匹配目标自我监督的借口任务。通过在多目标跟踪的\ EMPH {数据关联}概念的启发,我们提出了\ textbf {的Cyc}勒\ textbf {正如}社会交往(\ textbf {苏铁})任务:一对视频帧之间执行数据关联之后向前然后向后一行人实例应该被关联到自身。为了实现这一目标,模型必须学会一个有意义的表示,可以很好地描述在帧对实例之间的对应关系。我们离散关联过程适应微的形式,使得端至端训练变得可行。实验是在两个方面进行:首先,我们与现有的无监督重新编号方法比较七个基准我们的方法,并证明苏铁的优越性。然后,以进一步验证苏铁在实际应用中的实用价值,我们进行自我收集的视频和报告承诺在标准测试集表演训练。
24. P2D: a self-supervised method for depth estimation from polarimetry [PDF] 返回目录
Marc Blanchon, Désiré Sidibé, Olivier Morel, Ralph Seulin, Daniel Braun, Fabrice Meriaudeau
Abstract: Monocular depth estimation is a recurring subject in the field of computer vision. Its ability to describe scenes via a depth map while reducing the constraints related to the formulation of perspective geometry tends to favor its use. However, despite the constant improvement of algorithms, most methods exploit only colorimetric information. Consequently, robustness to events to which the modality is not sensitive to, like specularity or transparency, is neglected. In response to this phenomenon, we propose using polarimetry as an input for a self-supervised monodepth network. Therefore, we propose exploiting polarization cues to encourage accurate reconstruction of scenes. Furthermore, we include a term of polarimetric regularization to state-of-the-art method to take specific advantage of the data. Our method is evaluated both qualitatively and quantitatively demonstrating that the contribution of this new information as well as an enhanced loss function improves depth estimation results, especially for specular areas.
摘要:单眼深度估计是计算机视觉领域的一个反复出现的主题。它通过深度图来描述场景,同时减少与角度的几何形状的制定约束能力往往有利于它的使用。然而,尽管算法的不断完善,大多数方法利用仅比色信息。因此,稳健性,以该方式是不敏感的,像镜面或透明的事件,被忽略。针对这一现象,我们建议采用旋光作为一个自我监督monodepth网络的输入。因此,我们提出利用极化线索,鼓励场景的准确重建。此外,我们包括极化正规化状态的最先进的方法的术语取数据的特定优势。我们的方法是定性和定量地评价两个证明的这个新的信息以及增强的损失函数的贡献提高深度估计结果,特别是对镜面反射区域。
Marc Blanchon, Désiré Sidibé, Olivier Morel, Ralph Seulin, Daniel Braun, Fabrice Meriaudeau
Abstract: Monocular depth estimation is a recurring subject in the field of computer vision. Its ability to describe scenes via a depth map while reducing the constraints related to the formulation of perspective geometry tends to favor its use. However, despite the constant improvement of algorithms, most methods exploit only colorimetric information. Consequently, robustness to events to which the modality is not sensitive to, like specularity or transparency, is neglected. In response to this phenomenon, we propose using polarimetry as an input for a self-supervised monodepth network. Therefore, we propose exploiting polarization cues to encourage accurate reconstruction of scenes. Furthermore, we include a term of polarimetric regularization to state-of-the-art method to take specific advantage of the data. Our method is evaluated both qualitatively and quantitatively demonstrating that the contribution of this new information as well as an enhanced loss function improves depth estimation results, especially for specular areas.
摘要:单眼深度估计是计算机视觉领域的一个反复出现的主题。它通过深度图来描述场景,同时减少与角度的几何形状的制定约束能力往往有利于它的使用。然而,尽管算法的不断完善,大多数方法利用仅比色信息。因此,稳健性,以该方式是不敏感的,像镜面或透明的事件,被忽略。针对这一现象,我们建议采用旋光作为一个自我监督monodepth网络的输入。因此,我们提出利用极化线索,鼓励场景的准确重建。此外,我们包括极化正规化状态的最先进的方法的术语取数据的特定优势。我们的方法是定性和定量地评价两个证明的这个新的信息以及增强的损失函数的贡献提高深度估计结果,特别是对镜面反射区域。
25. Learning Part Boundaries from 3D Point Clouds [PDF] 返回目录
Marios Loizou, Melinos Averkiou, Evangelos Kalogerakis
Abstract: We present a method that detects boundaries of parts in 3D shapes represented as point clouds. Our method is based on a graph convolutional network architecture that outputs a probability for a point to lie in an area that separates two or more parts in a 3D shape. Our boundary detector is quite generic: it can be trained to localize boundaries of semantic parts or geometric primitives commonly used in 3D modeling. Our experiments demonstrate that our method can extract more accurate boundaries that are closer to ground-truth ones compared to alternatives. We also demonstrate an application of our network to fine-grained semantic shape segmentation, where we also show improvements in terms of part labeling performance.
摘要:我们提出,其检测三维零件的边界的形状表示为点云的方法。我们的方法是基于在一个区域输出用于一个点谎言的概率的曲线图卷积网络架构,在3D形状分隔两个或多个部分。我们的边界探测器是非常通用的:它可以训练本地化3D建模常用的语法部分或几何图元的边界。我们的实验表明,我们的方法可以提取更精确的界限更接近地面实况的人相比,替代品。我们也证明了我们的网络,以细粒度语义形分割,在那里我们还显示在部分标签性能方面的改进的应用程序。
Marios Loizou, Melinos Averkiou, Evangelos Kalogerakis
Abstract: We present a method that detects boundaries of parts in 3D shapes represented as point clouds. Our method is based on a graph convolutional network architecture that outputs a probability for a point to lie in an area that separates two or more parts in a 3D shape. Our boundary detector is quite generic: it can be trained to localize boundaries of semantic parts or geometric primitives commonly used in 3D modeling. Our experiments demonstrate that our method can extract more accurate boundaries that are closer to ground-truth ones compared to alternatives. We also demonstrate an application of our network to fine-grained semantic shape segmentation, where we also show improvements in terms of part labeling performance.
摘要:我们提出,其检测三维零件的边界的形状表示为点云的方法。我们的方法是基于在一个区域输出用于一个点谎言的概率的曲线图卷积网络架构,在3D形状分隔两个或多个部分。我们的边界探测器是非常通用的:它可以训练本地化3D建模常用的语法部分或几何图元的边界。我们的实验表明,我们的方法可以提取更精确的界限更接近地面实况的人相比,替代品。我们也证明了我们的网络,以细粒度语义形分割,在那里我们还显示在部分标签性能方面的改进的应用程序。
26. Evaluation of Neural Network Classification Systems on Document Stream [PDF] 返回目录
Joris Voerman, Aurelie Joseph, Mickael Coustaty, Vincent Poulain d Andecy, Jean-Marc Ogier
Abstract: One major drawback of state of the art Neural Networks (NN)-based approaches for document classification purposes is the large number of training samples required to obtain an efficient classification. The minimum required number is around one thousand annotated documents for each class. In many cases it is very difficult, if not impossible, to gather this number of samples in real industrial processes. In this paper, we analyse the efficiency of NN-based document classification systems in a sub-optimal training case, based on the situation of a company document stream. We evaluated three different approaches, one based on image content and two on textual content. The evaluation was divided into four parts: a reference case, to assess the performance of the system in the lab; two cases that each simulate a specific difficulty linked to document stream processing; and a realistic case that combined all of these difficulties. The realistic case highlighted the fact that there is a significant drop in the efficiency of NN-Based document classification systems. Although they remain efficient for well represented classes (with an over-fitting of the system for those classes), it is impossible for them to handle appropriately less well represented classes. NN-Based document classification systems need to be adapted to resolve these two problems before they can be considered for use in a company document stream.
摘要:本领域神经网络(NN)基的方法用于文档分类目的的状态的一个主要缺点是大量训练样本的需要获得一个有效的分类。最低要求数量为约一千每一类注释的文档。在许多情况下,它是非常困难的,如果不是不可能的,聚集在实际工业生产过程这个数目的样本。在本文中,我们分析了一个次优的训练情况下基于神经网络的文档分类系统的效率的基础上,公司文档流的情况。我们评估了三种不同的方法,基于文本内容的图像内容和两个之一。评价是分为四个部分:参考的情况下,评估系统在实验室的性能;两种情况的每个模拟链接到文件流处理一个特定的困难;而结合所有这些困难的现实情况。现实的情况下,突出了事实,那就是在基于神经网络的文档分类系统的效率显著下降。尽管它们保持效率为很好的代表类(具有过拟合该系统的这些类的),这是不可能的它们来处理适当地少很好的代表类。 NN-基于文档分类系统需要进行调整,以解决这两个问题,才可以考虑在公司文件流的使用。
Joris Voerman, Aurelie Joseph, Mickael Coustaty, Vincent Poulain d Andecy, Jean-Marc Ogier
Abstract: One major drawback of state of the art Neural Networks (NN)-based approaches for document classification purposes is the large number of training samples required to obtain an efficient classification. The minimum required number is around one thousand annotated documents for each class. In many cases it is very difficult, if not impossible, to gather this number of samples in real industrial processes. In this paper, we analyse the efficiency of NN-based document classification systems in a sub-optimal training case, based on the situation of a company document stream. We evaluated three different approaches, one based on image content and two on textual content. The evaluation was divided into four parts: a reference case, to assess the performance of the system in the lab; two cases that each simulate a specific difficulty linked to document stream processing; and a realistic case that combined all of these difficulties. The realistic case highlighted the fact that there is a significant drop in the efficiency of NN-Based document classification systems. Although they remain efficient for well represented classes (with an over-fitting of the system for those classes), it is impossible for them to handle appropriately less well represented classes. NN-Based document classification systems need to be adapted to resolve these two problems before they can be considered for use in a company document stream.
摘要:本领域神经网络(NN)基的方法用于文档分类目的的状态的一个主要缺点是大量训练样本的需要获得一个有效的分类。最低要求数量为约一千每一类注释的文档。在许多情况下,它是非常困难的,如果不是不可能的,聚集在实际工业生产过程这个数目的样本。在本文中,我们分析了一个次优的训练情况下基于神经网络的文档分类系统的效率的基础上,公司文档流的情况。我们评估了三种不同的方法,基于文本内容的图像内容和两个之一。评价是分为四个部分:参考的情况下,评估系统在实验室的性能;两种情况的每个模拟链接到文件流处理一个特定的困难;而结合所有这些困难的现实情况。现实的情况下,突出了事实,那就是在基于神经网络的文档分类系统的效率显著下降。尽管它们保持效率为很好的代表类(具有过拟合该系统的这些类的),这是不可能的它们来处理适当地少很好的代表类。 NN-基于文档分类系统需要进行调整,以解决这两个问题,才可以考虑在公司文件流的使用。
27. RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition [PDF] 返回目录
Xiaoyu Yue, Zhanghui Kuang, Chenhao Lin, Hongbin Sun, Wayne Zhang
Abstract: The attention-based encoder-decoder framework has recently achieved impressive results for scene text recognition, and many variants have emerged with improvements in recognition quality. However, it performs poorly on contextless texts (e.g., random character sequences) which is unacceptable in most of real application scenarios. In this paper, we first deeply investigate the decoding process of the decoder. We empirically find that a representative character-level sequence decoder utilizes not only context information but also positional information. The existing approaches heavily relying on contextual information causes the problem of attention drift. To suppress the side-effect of the attention drift, we propose one novel position enhancement branch, and dynamically fuse its outputs with those of the decoder attention module for scene text recognition. Specifically, it contains a position aware module to make the encoder output feature vectors encoding their own spatial positions, and an attention module to estimate glimpses using the positional clue (i.e., the current decoding time step) only. The dynamic fusion is conducted for more robust feature via an element-wise gate mechanism. Theoretically, our proposed method, dubbed RobustScanner, decodes individual characters with dynamic ratio between context and positional clues, and utilizes more positional ones when the decoding sequences with scarce context, and thus is robust and practical. Empirically, it has achieved new state-of-the-art results on popular regular and irregular text recognition benchmarks while without much performance drop on contextless benchmarks, validating its robustness in both context and contextless application scenarios.
摘要:基于注意机制的编码解码器架构,最近取得了不俗的成绩为现场文字识别,和许多变种已经出现在识别质量的提高。然而,无环境上的文本(例如,随机字符序列),这是不能接受的在大多数的实际应用场景表现不佳。在本文中,我们首先深挖解码器的解码过程。我们经验发现,一个代表字符级序列解码器,不仅利用上下文信息,而且位置信息。现有的方法严重依赖于上下文信息引起重视漂移的问题。为了抑制注意力漂移的副作用,我们提出了一个新的位置增强分支,动态融合的输出与场景文本识别解码器注意模块。具体而言,它包含一个位置感知模块,以使编码他们自己的空间位置的编码器输出的特征向量,并且仅使用位置线索(即,当前的解码时间步长)的注意模块来估计影子。动态融合经由逐元素门机构,用于更鲁棒的特征进行。从理论上讲,我们提出的方法,被称为RobustScanner,解码具有上下文和位置线索之间的动态比单个字符,并利用更位置那些当与稀缺上下文解码的序列,并且因此是鲁棒的和实用的。根据经验,它已经取得了国家的最先进的新流行定期和不定期的文字识别基准测试的结果,而没有对无环境基准多少性能下降,验证其在这两个方面无环境和应用场景的稳健性。
Xiaoyu Yue, Zhanghui Kuang, Chenhao Lin, Hongbin Sun, Wayne Zhang
Abstract: The attention-based encoder-decoder framework has recently achieved impressive results for scene text recognition, and many variants have emerged with improvements in recognition quality. However, it performs poorly on contextless texts (e.g., random character sequences) which is unacceptable in most of real application scenarios. In this paper, we first deeply investigate the decoding process of the decoder. We empirically find that a representative character-level sequence decoder utilizes not only context information but also positional information. The existing approaches heavily relying on contextual information causes the problem of attention drift. To suppress the side-effect of the attention drift, we propose one novel position enhancement branch, and dynamically fuse its outputs with those of the decoder attention module for scene text recognition. Specifically, it contains a position aware module to make the encoder output feature vectors encoding their own spatial positions, and an attention module to estimate glimpses using the positional clue (i.e., the current decoding time step) only. The dynamic fusion is conducted for more robust feature via an element-wise gate mechanism. Theoretically, our proposed method, dubbed RobustScanner, decodes individual characters with dynamic ratio between context and positional clues, and utilizes more positional ones when the decoding sequences with scarce context, and thus is robust and practical. Empirically, it has achieved new state-of-the-art results on popular regular and irregular text recognition benchmarks while without much performance drop on contextless benchmarks, validating its robustness in both context and contextless application scenarios.
摘要:基于注意机制的编码解码器架构,最近取得了不俗的成绩为现场文字识别,和许多变种已经出现在识别质量的提高。然而,无环境上的文本(例如,随机字符序列),这是不能接受的在大多数的实际应用场景表现不佳。在本文中,我们首先深挖解码器的解码过程。我们经验发现,一个代表字符级序列解码器,不仅利用上下文信息,而且位置信息。现有的方法严重依赖于上下文信息引起重视漂移的问题。为了抑制注意力漂移的副作用,我们提出了一个新的位置增强分支,动态融合的输出与场景文本识别解码器注意模块。具体而言,它包含一个位置感知模块,以使编码他们自己的空间位置的编码器输出的特征向量,并且仅使用位置线索(即,当前的解码时间步长)的注意模块来估计影子。动态融合经由逐元素门机构,用于更鲁棒的特征进行。从理论上讲,我们提出的方法,被称为RobustScanner,解码具有上下文和位置线索之间的动态比单个字符,并利用更位置那些当与稀缺上下文解码的序列,并且因此是鲁棒的和实用的。根据经验,它已经取得了国家的最先进的新流行定期和不定期的文字识别基准测试的结果,而没有对无环境基准多少性能下降,验证其在这两个方面无环境和应用场景的稳健性。
28. Learning to Parse Wireframes in Images of Man-Made Environments [PDF] 返回目录
Kun Huang, Yifan Wang, Zihan Zhou, Tianjiao Ding, Shenghua Gao, Yi Ma
Abstract: In this paper, we propose a learning-based approach to the task of automatically extracting a "wireframe" representation for images of cluttered man-made environments. The wireframe (see Fig. 1) contains all salient straight lines and their junctions of the scene that encode efficiently and accurately large-scale geometry and object shapes. To this end, we have built a very large new dataset of over 5,000 images with wireframes thoroughly labelled by humans. We have proposed two convolutional neural networks that are suitable for extracting junctions and lines with large spatial support, respectively. The networks trained on our dataset have achieved significantly better performance than state-of-the-art methods for junction detection and line segment detection, respectively. We have conducted extensive experiments to evaluate quantitatively and qualitatively the wireframes obtained by our method, and have convincingly shown that effectively and efficiently parsing wireframes for images of man-made environments is a feasible goal within reach. Such wireframes could benefit many important visual tasks such as feature correspondence, 3D reconstruction, vision-based mapping, localization, and navigation. The data and source code are available at this https URL.
摘要:在本文中,我们提出了一个基于学习的方法来自动提取的混乱人工环境的影像的“线框”表示的任务。线框(见图1)包含所有凸直线和其场景的结其编码有效而准确地大型几何形状和物体的形状。为此,我们已经建立了由人类彻底标记线框5000图像的一个非常大的新的数据集。我们已经提出了两种卷积神经网络,适用于分别提取路口和线中的大空间支持。培训了我们的数据网络已实现超过国家的最先进的方法分别结检测和线段检测,显著更好的性能。我们已经进行了广泛的实验,以评估定量和定性通过我们的方法得到的线框图,并已令人信服地表明,有效地解析为线框人造环境的图像是可以实现的一个可行的目标。这样的线框可以受益很多重要的视觉任务,例如特征对应,三维重建,基于视觉的映射,定位和导航。数据和源代码可在此HTTPS URL。
Kun Huang, Yifan Wang, Zihan Zhou, Tianjiao Ding, Shenghua Gao, Yi Ma
Abstract: In this paper, we propose a learning-based approach to the task of automatically extracting a "wireframe" representation for images of cluttered man-made environments. The wireframe (see Fig. 1) contains all salient straight lines and their junctions of the scene that encode efficiently and accurately large-scale geometry and object shapes. To this end, we have built a very large new dataset of over 5,000 images with wireframes thoroughly labelled by humans. We have proposed two convolutional neural networks that are suitable for extracting junctions and lines with large spatial support, respectively. The networks trained on our dataset have achieved significantly better performance than state-of-the-art methods for junction detection and line segment detection, respectively. We have conducted extensive experiments to evaluate quantitatively and qualitatively the wireframes obtained by our method, and have convincingly shown that effectively and efficiently parsing wireframes for images of man-made environments is a feasible goal within reach. Such wireframes could benefit many important visual tasks such as feature correspondence, 3D reconstruction, vision-based mapping, localization, and navigation. The data and source code are available at this https URL.
摘要:在本文中,我们提出了一个基于学习的方法来自动提取的混乱人工环境的影像的“线框”表示的任务。线框(见图1)包含所有凸直线和其场景的结其编码有效而准确地大型几何形状和物体的形状。为此,我们已经建立了由人类彻底标记线框5000图像的一个非常大的新的数据集。我们已经提出了两种卷积神经网络,适用于分别提取路口和线中的大空间支持。培训了我们的数据网络已实现超过国家的最先进的方法分别结检测和线段检测,显著更好的性能。我们已经进行了广泛的实验,以评估定量和定性通过我们的方法得到的线框图,并已令人信服地表明,有效地解析为线框人造环境的图像是可以实现的一个可行的目标。这样的线框可以受益很多重要的视觉任务,例如特征对应,三维重建,基于视觉的映射,定位和导航。数据和源代码可在此HTTPS URL。
29. Learning with Privileged Information for Efficient Image Super-Resolution [PDF] 返回目录
Wonkyung Lee, Junghyup Lee, Dohyung Kim, Bumsub Ham
Abstract: Convolutional neural networks (CNNs) have allowed remarkable advances in single image super-resolution (SISR) over the last decade. Most SR methods based on CNNs have focused on achieving performance gains in terms of quality metrics, such as PSNR and SSIM, over classical approaches. They typically require a large amount of memory and computational units. FSRCNN, consisting of few numbers of convolutional layers, has shown promising results, while using an extremely small number of network parameters. We introduce in this paper a novel distillation framework, consisting of teacher and student networks, that allows to boost the performance of FSRCNN drastically. To this end, we propose to use ground-truth high-resolution (HR) images as privileged information. The encoder in the teacher learns the degradation process, subsampling of HR images, using an imitation loss. The student and the decoder in the teacher, having the same network architecture as FSRCNN, try to reconstruct HR images. Intermediate features in the decoder, affordable for the student to learn, are transferred to the student through feature distillation. Experimental results on standard benchmarks demonstrate the effectiveness and the generalization ability of our framework, which significantly boosts the performance of FSRCNN as well as other SR methods. Our code and model are available online: this https URL.
摘要:卷积神经网络(细胞神经网络)已经允许在单个图像超分辨率(SISR)在过去十年中显着的进展。基于细胞神经网络的大多数SR方法集中在质量指标,如PSNR和SSIM,相对于传统方法方面实现性能提升。他们通常需要大量的内存和计算单元。 FSRCNN,由卷积层数少的,已经显示出有希望的结果,在使用一个非常小的数的网络参数。我们在本文中介绍了一种新的蒸馏框架,由老师和学生的网络,它允许提高FSRCNN的性能大大的。为此,我们建议使用地面实况高分辨率(HR)图像作为特权信息。在老师的编码器学习降解过程,二次抽样HR图像,使用仿制损失。学生和老师的解码器,具有相同的网络架构FSRCNN,试图重建HR图像。在解码器中间的功能,实惠为学生学习,通过功能蒸馏传递给学生。在标准的基准测试实验结果表明,有效性和我们的框架,泛化能力,其显著提升FSRCNN的性能以及其他SR方法。我们的代码和型号都可以在网上:此HTTPS URL。
Wonkyung Lee, Junghyup Lee, Dohyung Kim, Bumsub Ham
Abstract: Convolutional neural networks (CNNs) have allowed remarkable advances in single image super-resolution (SISR) over the last decade. Most SR methods based on CNNs have focused on achieving performance gains in terms of quality metrics, such as PSNR and SSIM, over classical approaches. They typically require a large amount of memory and computational units. FSRCNN, consisting of few numbers of convolutional layers, has shown promising results, while using an extremely small number of network parameters. We introduce in this paper a novel distillation framework, consisting of teacher and student networks, that allows to boost the performance of FSRCNN drastically. To this end, we propose to use ground-truth high-resolution (HR) images as privileged information. The encoder in the teacher learns the degradation process, subsampling of HR images, using an imitation loss. The student and the decoder in the teacher, having the same network architecture as FSRCNN, try to reconstruct HR images. Intermediate features in the decoder, affordable for the student to learn, are transferred to the student through feature distillation. Experimental results on standard benchmarks demonstrate the effectiveness and the generalization ability of our framework, which significantly boosts the performance of FSRCNN as well as other SR methods. Our code and model are available online: this https URL.
摘要:卷积神经网络(细胞神经网络)已经允许在单个图像超分辨率(SISR)在过去十年中显着的进展。基于细胞神经网络的大多数SR方法集中在质量指标,如PSNR和SSIM,相对于传统方法方面实现性能提升。他们通常需要大量的内存和计算单元。 FSRCNN,由卷积层数少的,已经显示出有希望的结果,在使用一个非常小的数的网络参数。我们在本文中介绍了一种新的蒸馏框架,由老师和学生的网络,它允许提高FSRCNN的性能大大的。为此,我们建议使用地面实况高分辨率(HR)图像作为特权信息。在老师的编码器学习降解过程,二次抽样HR图像,使用仿制损失。学生和老师的解码器,具有相同的网络架构FSRCNN,试图重建HR图像。在解码器中间的功能,实惠为学生学习,通过功能蒸馏传递给学生。在标准的基准测试实验结果表明,有效性和我们的框架,泛化能力,其显著提升FSRCNN的性能以及其他SR方法。我们的代码和型号都可以在网上:此HTTPS URL。
30. Learning Visual Context by Comparison [PDF] 返回目录
Minchul Kim, Jongchan Park, Seil Na, Chang Min Park, Donggeun Yoo
Abstract: Finding diseases from an X-ray image is an important yet highly challenging task. Current methods for solving this task exploit various characteristics of the chest X-ray image, but one of the most important characteristics is still missing: the necessity of comparison between related regions in an image. In this paper, we present Attend-and-Compare Module (ACM) for capturing the difference between an object of interest and its corresponding context. We show that explicit difference modeling can be very helpful in tasks that require direct comparison between locations from afar. This module can be plugged into existing deep learning models. For evaluation, we apply our module to three chest X-ray recognition tasks and COCO object detection & segmentation tasks and observe consistent improvements across tasks. The code is available at this https URL.
摘要:从X射线图像发现疾病是一个重要但非常具有挑战性的任务。为解决这个任务目前的方法利用了胸部X射线图像的各种特性,但其中最重要的特性之一,至今下落不明:图像中的相关区域之间进行比较的必要性。在本文中,我们本手捧和比较模块(ACM),用于捕获感兴趣的对象和其相应的上下文之间的差。我们证明了明确的差异建模可以在需要从远处位置之间的直接比较的任务非常有帮助。该模块可以插入现有的深度学习模式。对于评价,我们应用我们的模块三个胸片识别任务和COCO目标检测与分割任务,并观察整个任务持续改善。该代码可在此HTTPS URL。
Minchul Kim, Jongchan Park, Seil Na, Chang Min Park, Donggeun Yoo
Abstract: Finding diseases from an X-ray image is an important yet highly challenging task. Current methods for solving this task exploit various characteristics of the chest X-ray image, but one of the most important characteristics is still missing: the necessity of comparison between related regions in an image. In this paper, we present Attend-and-Compare Module (ACM) for capturing the difference between an object of interest and its corresponding context. We show that explicit difference modeling can be very helpful in tasks that require direct comparison between locations from afar. This module can be plugged into existing deep learning models. For evaluation, we apply our module to three chest X-ray recognition tasks and COCO object detection & segmentation tasks and observe consistent improvements across tasks. The code is available at this https URL.
摘要:从X射线图像发现疾病是一个重要但非常具有挑战性的任务。为解决这个任务目前的方法利用了胸部X射线图像的各种特性,但其中最重要的特性之一,至今下落不明:图像中的相关区域之间进行比较的必要性。在本文中,我们本手捧和比较模块(ACM),用于捕获感兴趣的对象和其相应的上下文之间的差。我们证明了明确的差异建模可以在需要从远处位置之间的直接比较的任务非常有帮助。该模块可以插入现有的深度学习模式。对于评价,我们应用我们的模块三个胸片识别任务和COCO目标检测与分割任务,并观察整个任务持续改善。该代码可在此HTTPS URL。
31. Decoding CNN based Object Classifier Using Visualization [PDF] 返回目录
Abhishek Mukhopadhyay, Imon Mukherjee, Pradipta Biswas
Abstract: This paper investigates how working of Convolutional Neural Network (CNN) can be explained through visualization in the context of machine perception of autonomous vehicles. We visualize what type of features are extracted in different convolution layers of CNN that helps to understand how CNN gradually increases spatial information in every layer. Thus, it concentrates on region of interests in every transformation. Visualizing heat map of activation helps us to understand how CNN classifies and localizes different objects in image. This study also helps us to reason behind low accuracy of a model helps to increase trust on object detection module.
摘要:本文研究了卷积神经网络(CNN)的工作如何能够通过可视化的自主车的机器感知的角度进行说明。我们想象的是什么类型的功能在CNN的不同卷积层,有助于了解CNN逐渐增加,在每一层的空间信息被提取。因此,它集中在每一个转型的利益区域。可视化激活的热图有助于我们理解如何CNN分类和图像本地化不同的对象。这项研究还有助于我们背后的原因的模型精度低有助于提高物体检测模块的信任。
Abhishek Mukhopadhyay, Imon Mukherjee, Pradipta Biswas
Abstract: This paper investigates how working of Convolutional Neural Network (CNN) can be explained through visualization in the context of machine perception of autonomous vehicles. We visualize what type of features are extracted in different convolution layers of CNN that helps to understand how CNN gradually increases spatial information in every layer. Thus, it concentrates on region of interests in every transformation. Visualizing heat map of activation helps us to understand how CNN classifies and localizes different objects in image. This study also helps us to reason behind low accuracy of a model helps to increase trust on object detection module.
摘要:本文研究了卷积神经网络(CNN)的工作如何能够通过可视化的自主车的机器感知的角度进行说明。我们想象的是什么类型的功能在CNN的不同卷积层,有助于了解CNN逐渐增加,在每一层的空间信息被提取。因此,它集中在每一个转型的利益区域。可视化激活的热图有助于我们理解如何CNN分类和图像本地化不同的对象。这项研究还有助于我们背后的原因的模型精度低有助于提高物体检测模块的信任。
32. Explaining Deep Neural Networks using Unsupervised Clustering [PDF] 返回目录
Sercan O. Arik, Yu-han Liu
Abstract: We propose a novel method to explain trained deep neural networks (DNNs), by distilling them into surrogate models using unsupervised clustering. Our method can be applied flexibly to any subset of layers of a DNN architecture and can incorporate low-level and high-level information. On image datasets given pre-trained DNNs, we demonstrate the strength of our method in finding similar training samples, and shedding light on the concepts the DNNs base their decisions on. Via user studies, we show that our model can improve the user trust in model's prediction.
摘要:本文提出了一种新的方法来解释训练的深层神经网络(DNNs),通过使用无监督聚类蒸馏他们入替代模型。我们的方法可以灵活地应用于DNN架构的层的任何子集和可以结合低级别和高级别信息。在给定的预先训练DNNs图像数据集,我们证明了我们在概念找到类似的训练样本,脱落光方法的强度DNNs基础上作出决定。通过用户研究,我们发现我们的模型可以提高模型预测用户的信赖。
Sercan O. Arik, Yu-han Liu
Abstract: We propose a novel method to explain trained deep neural networks (DNNs), by distilling them into surrogate models using unsupervised clustering. Our method can be applied flexibly to any subset of layers of a DNN architecture and can incorporate low-level and high-level information. On image datasets given pre-trained DNNs, we demonstrate the strength of our method in finding similar training samples, and shedding light on the concepts the DNNs base their decisions on. Via user studies, we show that our model can improve the user trust in model's prediction.
摘要:本文提出了一种新的方法来解释训练的深层神经网络(DNNs),通过使用无监督聚类蒸馏他们入替代模型。我们的方法可以灵活地应用于DNN架构的层的任何子集和可以结合低级别和高级别信息。在给定的预先训练DNNs图像数据集,我们证明了我们在概念找到类似的训练样本,脱落光方法的强度DNNs基础上作出决定。通过用户研究,我们发现我们的模型可以提高模型预测用户的信赖。
33. A cellular automata approach to local patterns for texture recognition [PDF] 返回目录
Joao Florindo, Konradin Metze
Abstract: Texture recognition is one of the most important tasks in computer vision and, despite the recent success of learning-based approaches, there is still need for model-based solutions. This is especially the case when the amount of data available for training is not sufficiently large, a common situation in several applied areas, or when computational resources are limited. In this context, here we propose a method for texture descriptors that combines the representation power of complex objects by cellular automata with the known effectiveness of local descriptors in texture analysis. The method formulates a new transition function for the automaton inspired on local binary descriptors. It counterbalances the new state of each cell with the previous state, in this way introducing an idea of "controlled deterministic chaos". The descriptors are obtained from the distribution of cell states. The proposed descriptors are applied to the classification of texture images both on benchmark data sets and a real-world problem, i.e., that of identifying plant species based on the texture of their leaf surfaces. Our proposal outperforms other classical and state-of-the-art approaches, especially in the real-world problem, thus revealing its potential to be applied in numerous practical tasks involving texture recognition at some stage.
摘要:纹理识别是计算机视觉领域中最重要的任务之一,尽管最近的基于学习的方法成功,但仍然需要基于模型的解决方案。这是特别的情况时,数据可用于训练量不够大,在多个应用领域的一个常见的情况,或者在计算资源是有限的。在此背景下,我们在这里提出了纹理描述,通过元胞自动机与局部描述符的纹理分析已知的有效性结合复杂对象的表现力的方法。该方法制定的启发本地二进制描述符的自动机新的过渡功能。它抵消与先前状态的每个细胞的新状态,以这种方式引入“控制确定性混沌”的想法。的描述符被从细胞状态的分布获得的。所提出的描述符应用到纹理图像两者的上基准数据集和一个实际问题,即分类,即识别基于它们的叶表面的织构的植物物种。我们的建议优于其他经典方法,特别是在现实世界中的问题,从而揭示其潜在的在某个阶段涉及纹理识别许多实际任务应用最先进的国家的和。
Joao Florindo, Konradin Metze
Abstract: Texture recognition is one of the most important tasks in computer vision and, despite the recent success of learning-based approaches, there is still need for model-based solutions. This is especially the case when the amount of data available for training is not sufficiently large, a common situation in several applied areas, or when computational resources are limited. In this context, here we propose a method for texture descriptors that combines the representation power of complex objects by cellular automata with the known effectiveness of local descriptors in texture analysis. The method formulates a new transition function for the automaton inspired on local binary descriptors. It counterbalances the new state of each cell with the previous state, in this way introducing an idea of "controlled deterministic chaos". The descriptors are obtained from the distribution of cell states. The proposed descriptors are applied to the classification of texture images both on benchmark data sets and a real-world problem, i.e., that of identifying plant species based on the texture of their leaf surfaces. Our proposal outperforms other classical and state-of-the-art approaches, especially in the real-world problem, thus revealing its potential to be applied in numerous practical tasks involving texture recognition at some stage.
摘要:纹理识别是计算机视觉领域中最重要的任务之一,尽管最近的基于学习的方法成功,但仍然需要基于模型的解决方案。这是特别的情况时,数据可用于训练量不够大,在多个应用领域的一个常见的情况,或者在计算资源是有限的。在此背景下,我们在这里提出了纹理描述,通过元胞自动机与局部描述符的纹理分析已知的有效性结合复杂对象的表现力的方法。该方法制定的启发本地二进制描述符的自动机新的过渡功能。它抵消与先前状态的每个细胞的新状态,以这种方式引入“控制确定性混沌”的想法。的描述符被从细胞状态的分布获得的。所提出的描述符应用到纹理图像两者的上基准数据集和一个实际问题,即分类,即识别基于它们的叶表面的织构的植物物种。我们的建议优于其他经典方法,特别是在现实世界中的问题,从而揭示其潜在的在某个阶段涉及纹理识别许多实际任务应用最先进的国家的和。
34. Reorganizing local image features with chaotic maps: an application to texture recognition [PDF] 返回目录
Joao Florindo
Abstract: Despite the recent success of convolutional neural networks in texture recognition, model-based descriptors are still competitive, especially when we do not have access to large amounts of annotated data for training and the interpretation of the model is an important issue. Among the model-based approaches, fractal geometry has been one of the most popular, especially in biological applications. Nevertheless, fractals are part of a much broader family of models, which are the non-linear operators, studied in chaos theory. In this context, we propose here a chaos-based local descriptor for texture recognition. More specifically, we map the image into the three-dimensional Euclidean space, iterate a chaotic map over this three-dimensional structure and convert it back to the original image. From such chaos-transformed image at each iteration we collect local descriptors (here we use local binary patters) and those descriptors compose the feature representation of the texture. The performance of our method was verified on the classification of benchmark databases and in the identification of Brazilian plant species based on the texture of the leaf surface. The achieved results confirmed our expectation of a competitive performance, even when compared with some learning-based modern approaches in the literature.
摘要:尽管卷积神经网络的纹理识别的最近的成功,基于模型的描述仍然有竞争力,特别是当我们没有获得大量的培训注释数据和模型的解释是一个重要的问题。在基于模型的方法,分形几何一直是最流行的一种,特别是在生物应用。然而,分形是一个更广泛的家庭的车型,这是非线性的运营商,在混沌理论研究的一部分。在此背景下,我们在这里提出了纹理识别基于混沌局部描述符。更具体地说,我们将图像映射到三维欧几里德空间,遍历这个三维结构的混沌映射,并将其转换回原始图像。从在每个迭代这种混乱局面转化的图像我们收集局部描述符(在这里我们使用二进制本地patters)以及那些描述符组成的纹理特征表示。我们的方法的性能进行了验证的基准数据库的分类和基于叶面上的纹理巴西植物物种的鉴定。取得的成果证实了我们的竞争力表现的期望,即使在一些文献以学习为基础的现代方法相比较。
Joao Florindo
Abstract: Despite the recent success of convolutional neural networks in texture recognition, model-based descriptors are still competitive, especially when we do not have access to large amounts of annotated data for training and the interpretation of the model is an important issue. Among the model-based approaches, fractal geometry has been one of the most popular, especially in biological applications. Nevertheless, fractals are part of a much broader family of models, which are the non-linear operators, studied in chaos theory. In this context, we propose here a chaos-based local descriptor for texture recognition. More specifically, we map the image into the three-dimensional Euclidean space, iterate a chaotic map over this three-dimensional structure and convert it back to the original image. From such chaos-transformed image at each iteration we collect local descriptors (here we use local binary patters) and those descriptors compose the feature representation of the texture. The performance of our method was verified on the classification of benchmark databases and in the identification of Brazilian plant species based on the texture of the leaf surface. The achieved results confirmed our expectation of a competitive performance, even when compared with some learning-based modern approaches in the literature.
摘要:尽管卷积神经网络的纹理识别的最近的成功,基于模型的描述仍然有竞争力,特别是当我们没有获得大量的培训注释数据和模型的解释是一个重要的问题。在基于模型的方法,分形几何一直是最流行的一种,特别是在生物应用。然而,分形是一个更广泛的家庭的车型,这是非线性的运营商,在混沌理论研究的一部分。在此背景下,我们在这里提出了纹理识别基于混沌局部描述符。更具体地说,我们将图像映射到三维欧几里德空间,遍历这个三维结构的混沌映射,并将其转换回原始图像。从在每个迭代这种混乱局面转化的图像我们收集局部描述符(在这里我们使用二进制本地patters)以及那些描述符组成的纹理特征表示。我们的方法的性能进行了验证的基准数据库的分类和基于叶面上的纹理巴西植物物种的鉴定。取得的成果证实了我们的竞争力表现的期望,即使在一些文献以学习为基础的现代方法相比较。
35. Graph-Based Social Relation Reasoning [PDF] 返回目录
Wanhua Li, Yueqi Duan, Jiwen Lu, Jianjiang Feng, Jie Zhou
Abstract: Human beings are fundamentally sociable -- that we generally organize our social lives in terms of relations with other people. Understanding social relations from an image has great potential for intelligent systems such as social chatbots and personal assistants. In this paper, we propose a simpler, faster, and more accurate method named graph relational reasoning network (GR2N) for social relation recognition. Different from existing methods which process all social relations on an image independently, our method considers the paradigm of jointly inferring the relations by constructing a social relation graph. Furthermore, the proposed GR2N constructs several virtual relation graphs to explicitly grasp the strong logical constraints among different types of social relations. Experimental results illustrate that our method generates a reasonable and consistent social relation graph and improves the performance in both accuracy and efficiency.
摘要:人类是从根本上有点大男子主义 - 即我们一般安排我们的社会生活中与其他人的关系方面。从图像中了解社会关系具有智能系统,如社交聊天机器人和个人助理的巨大潜力。在本文中,我们提出了一个更简单,更快速,更准确的方法命名为graph关系推理的社会关系网络的认可(GR2N)。从独立处理图像上的所有社会关系的现有方法不同的是,我们的方法考虑联合推断通过构建社会关系图的关系模式。此外,拟议GR2N构建多个虚拟关系的图表,以明确掌握不同类型的社会关系中强有力的逻辑约束。实验结果表明,我们的方法生成一个合理和一致的社会关系图,并提高了精度和效率的性能。
Wanhua Li, Yueqi Duan, Jiwen Lu, Jianjiang Feng, Jie Zhou
Abstract: Human beings are fundamentally sociable -- that we generally organize our social lives in terms of relations with other people. Understanding social relations from an image has great potential for intelligent systems such as social chatbots and personal assistants. In this paper, we propose a simpler, faster, and more accurate method named graph relational reasoning network (GR2N) for social relation recognition. Different from existing methods which process all social relations on an image independently, our method considers the paradigm of jointly inferring the relations by constructing a social relation graph. Furthermore, the proposed GR2N constructs several virtual relation graphs to explicitly grasp the strong logical constraints among different types of social relations. Experimental results illustrate that our method generates a reasonable and consistent social relation graph and improves the performance in both accuracy and efficiency.
摘要:人类是从根本上有点大男子主义 - 即我们一般安排我们的社会生活中与其他人的关系方面。从图像中了解社会关系具有智能系统,如社交聊天机器人和个人助理的巨大潜力。在本文中,我们提出了一个更简单,更快速,更准确的方法命名为graph关系推理的社会关系网络的认可(GR2N)。从独立处理图像上的所有社会关系的现有方法不同的是,我们的方法考虑联合推断通过构建社会关系图的关系模式。此外,拟议GR2N构建多个虚拟关系的图表,以明确掌握不同类型的社会关系中强有力的逻辑约束。实验结果表明,我们的方法生成一个合理和一致的社会关系图,并提高了精度和效率的性能。
36. RGB-IR Cross-modality Person ReID based on Teacher-Student GAN Model [PDF] 返回目录
Ziyue Zhang, Shuai Jiang, Congzhentao Huang, Yang Li, Richard Yi Da Xu
Abstract: RGB-Infrared (RGB-IR) person re-identification (ReID) is a technology where the system can automatically identify the same person appearing at different parts of a video when light is unavailable. The critical challenge of this task is the cross-modality gap of features under different modalities. To solve this challenge, we proposed a Teacher-Student GAN model (TS-GAN) to adopt different domains and guide the ReID backbone to learn better ReID information. (1) In order to get corresponding RGB-IR image pairs, the RGB-IR Generative Adversarial Network (GAN) was used to generate IR images. (2) To kick-start the training of identities, a ReID Teacher module was trained under IR modality person images, which is then used to guide its Student counterpart in training. (3) Likewise, to better adapt different domain features and enhance model ReID performance, three Teacher-Student loss functions were used. Unlike other GAN based models, the proposed model only needs the backbone module at the test stage, making it more efficient and resource-saving. To showcase our model's capability, we did extensive experiments on the newly-released SYSU-MM01 RGB-IR Re-ID benchmark and achieved superior performance to the state-of-the-art with 49.8% Rank-1 and 47.4% mAP.
摘要:RGB红外(RGB-IR)人重新鉴定(雷德)是一种技术,其中当光不可用的系统可以自动识别出现在视频的不同部分是同一个人。这个任务的关键挑战是功能在不同方式的跨模态的差距。为了解决这个难题,我们提出了师生甘模型(TS-GAN)采用不同的域和引导骨干里德学得更好瑞德信息。 (1)为了得到相应的RGB-IR图像对,该RGB-IR剖成对抗性网络(GAN)被用来产生红外图像。 (2)为启动身份,雷德教师模块是根据IR形态的人物图像的训练,然后将其用于指导其学生对口培训的培训。 (3)同样地,以更好地适应不同领域的功能和增强模式里德性能,使用了三个师生损失函数。不像其他的GaN基模型,该模型只需要骨干模块在测试阶段,使之更加高效,资源节约型。为了展示我们的模型的能力,我们做了新发布的中山大学-MM01 RGB-IR再ID基准广泛的实验,取得了卓越的性能为国家的最先进的以49.8%排名-1和47.4%映像。
Ziyue Zhang, Shuai Jiang, Congzhentao Huang, Yang Li, Richard Yi Da Xu
Abstract: RGB-Infrared (RGB-IR) person re-identification (ReID) is a technology where the system can automatically identify the same person appearing at different parts of a video when light is unavailable. The critical challenge of this task is the cross-modality gap of features under different modalities. To solve this challenge, we proposed a Teacher-Student GAN model (TS-GAN) to adopt different domains and guide the ReID backbone to learn better ReID information. (1) In order to get corresponding RGB-IR image pairs, the RGB-IR Generative Adversarial Network (GAN) was used to generate IR images. (2) To kick-start the training of identities, a ReID Teacher module was trained under IR modality person images, which is then used to guide its Student counterpart in training. (3) Likewise, to better adapt different domain features and enhance model ReID performance, three Teacher-Student loss functions were used. Unlike other GAN based models, the proposed model only needs the backbone module at the test stage, making it more efficient and resource-saving. To showcase our model's capability, we did extensive experiments on the newly-released SYSU-MM01 RGB-IR Re-ID benchmark and achieved superior performance to the state-of-the-art with 49.8% Rank-1 and 47.4% mAP.
摘要:RGB红外(RGB-IR)人重新鉴定(雷德)是一种技术,其中当光不可用的系统可以自动识别出现在视频的不同部分是同一个人。这个任务的关键挑战是功能在不同方式的跨模态的差距。为了解决这个难题,我们提出了师生甘模型(TS-GAN)采用不同的域和引导骨干里德学得更好瑞德信息。 (1)为了得到相应的RGB-IR图像对,该RGB-IR剖成对抗性网络(GAN)被用来产生红外图像。 (2)为启动身份,雷德教师模块是根据IR形态的人物图像的训练,然后将其用于指导其学生对口培训的培训。 (3)同样地,以更好地适应不同领域的功能和增强模式里德性能,使用了三个师生损失函数。不像其他的GaN基模型,该模型只需要骨干模块在测试阶段,使之更加高效,资源节约型。为了展示我们的模型的能力,我们做了新发布的中山大学-MM01 RGB-IR再ID基准广泛的实验,取得了卓越的性能为国家的最先进的以49.8%排名-1和47.4%映像。
37. ContourRend: A Segmentation Method for Improving Contours by Rendering [PDF] 返回目录
Junwen Chen, Yi Lu, Yaran Chen, Dongbin Zhao, Zhonghua Pang
Abstract: A good object segmentation should contain clear contours and complete regions. However, mask-based segmentation can not handle contour features well on a coarse prediction grid, thus causing problems of blurry edges. While contour-based segmentation provides contours directly, but misses contours' details. In order to obtain fine contours, we propose a segmentation method named ContourRend which adopts a contour renderer to refine segmentation contours. And we implement our method on a segmentation model based on graph convolutional network (GCN). For the single object segmentation task on cityscapes dataset, the GCN-based segmentation con-tour is used to generate a contour of a single object, then our contour renderer focuses on the pixels around the contour and predicts the category at high resolution. By rendering the contour result, our method reaches 72.41% mean intersection over union (IoU) and surpasses baseline Polygon-GCN by 1.22%.
摘要:一个好的目标分割应该包含清晰的轮廓和完整的地区。然而,基于掩模的分割不能处理轮廓功能以及上的粗略预测网格,从而导致模糊的边缘的问题。虽然基于轮廓的分割提供轮廓直接,但是偏出轮廓的详细信息。为了获得精细的轮廓,我们提出了一个名为ContourRend分割方法,它采用了轮廓渲染细化分割轮廓。我们实现了基于图形的卷积网络(GCN)上分割模型我们的方法。有关数据集城市景观单一对象分割任务,基于GCN分割的轮廓被用来生成一个单一对象的轮廓,那么,我们的轮廓呈现集中在轮廓周围的像素,并预测在高分辨率的类别。通过渲染轮廓因此,我们的方法达到72.41%,平均在路口联盟(IOU)并超过基线多边形GCN 1.22%。
Junwen Chen, Yi Lu, Yaran Chen, Dongbin Zhao, Zhonghua Pang
Abstract: A good object segmentation should contain clear contours and complete regions. However, mask-based segmentation can not handle contour features well on a coarse prediction grid, thus causing problems of blurry edges. While contour-based segmentation provides contours directly, but misses contours' details. In order to obtain fine contours, we propose a segmentation method named ContourRend which adopts a contour renderer to refine segmentation contours. And we implement our method on a segmentation model based on graph convolutional network (GCN). For the single object segmentation task on cityscapes dataset, the GCN-based segmentation con-tour is used to generate a contour of a single object, then our contour renderer focuses on the pixels around the contour and predicts the category at high resolution. By rendering the contour result, our method reaches 72.41% mean intersection over union (IoU) and surpasses baseline Polygon-GCN by 1.22%.
摘要:一个好的目标分割应该包含清晰的轮廓和完整的地区。然而,基于掩模的分割不能处理轮廓功能以及上的粗略预测网格,从而导致模糊的边缘的问题。虽然基于轮廓的分割提供轮廓直接,但是偏出轮廓的详细信息。为了获得精细的轮廓,我们提出了一个名为ContourRend分割方法,它采用了轮廓渲染细化分割轮廓。我们实现了基于图形的卷积网络(GCN)上分割模型我们的方法。有关数据集城市景观单一对象分割任务,基于GCN分割的轮廓被用来生成一个单一对象的轮廓,那么,我们的轮廓呈现集中在轮廓周围的像素,并预测在高分辨率的类别。通过渲染轮廓因此,我们的方法达到72.41%,平均在路口联盟(IOU)并超过基线多边形GCN 1.22%。
38. COCO-FUNIT: Few-Shot Unsupervised Image Translation with a Content Conditioned Style Encoder [PDF] 返回目录
Kuniaki Saito, Kate Saenko, Ming-Yu Liu
Abstract: Unsupervised image-to-image translation intends to learn a mapping of an image in a given domain to an analogous image in a different domain, without explicit supervision of the mapping. Few-shot unsupervised image-to-image translation further attempts to generalize the model to an unseen domain by leveraging example images of the unseen domain provided at inference time. While remarkably successful, existing few-shot image-to-image translation models find it difficult to preserve the structure of the input image while emulating the appearance of the unseen domain, which we refer to as the \textit{content loss} problem. This is particularly severe when the poses of the objects in the input and example images are very different. To address the issue, we propose a new few-shot image translation model, \cocofunit, which computes the style embedding of the example images conditioned on the input image and a new module called the constant style bias. Through extensive experimental validations with comparison to the state-of-the-art, our model shows effectiveness in addressing the content loss problem. Code and pretrained models are available at this https URL .
摘要:无监督图像 - 图像平移打算在给定域中学习的图像的映射类似的图像在不同的领域,没有映射的明确的监督。几拍无监督图像到图像平移进一步尝试通过利用在推理时提供的看不见的结构域的示例图像概括模型以一个看不见的结构域。虽然非常成功,现有的几个镜头图像到图像的翻译模式很难保持输入图像的结构,而模仿看不见域的外观,我们称其为\ {textit内容损失}问题。这是特别严重时在输入和实例图像中的对象的姿势有很大的不同。为了解决这个问题,我们提出了一个新的几个镜头图像翻译模型,\ cocofunit,其计算式的嵌入空调输入图像和一个叫做常数风格偏向新模块上的示例图像。通过大量的实验验证与比较,在解决内容丢失问题的国家的最先进的,我们的模型显示效果。代码和预训练模式可在此HTTPS URL。
Kuniaki Saito, Kate Saenko, Ming-Yu Liu
Abstract: Unsupervised image-to-image translation intends to learn a mapping of an image in a given domain to an analogous image in a different domain, without explicit supervision of the mapping. Few-shot unsupervised image-to-image translation further attempts to generalize the model to an unseen domain by leveraging example images of the unseen domain provided at inference time. While remarkably successful, existing few-shot image-to-image translation models find it difficult to preserve the structure of the input image while emulating the appearance of the unseen domain, which we refer to as the \textit{content loss} problem. This is particularly severe when the poses of the objects in the input and example images are very different. To address the issue, we propose a new few-shot image translation model, \cocofunit, which computes the style embedding of the example images conditioned on the input image and a new module called the constant style bias. Through extensive experimental validations with comparison to the state-of-the-art, our model shows effectiveness in addressing the content loss problem. Code and pretrained models are available at this https URL .
摘要:无监督图像 - 图像平移打算在给定域中学习的图像的映射类似的图像在不同的领域,没有映射的明确的监督。几拍无监督图像到图像平移进一步尝试通过利用在推理时提供的看不见的结构域的示例图像概括模型以一个看不见的结构域。虽然非常成功,现有的几个镜头图像到图像的翻译模式很难保持输入图像的结构,而模仿看不见域的外观,我们称其为\ {textit内容损失}问题。这是特别严重时在输入和实例图像中的对象的姿势有很大的不同。为了解决这个问题,我们提出了一个新的几个镜头图像翻译模型,\ cocofunit,其计算式的嵌入空调输入图像和一个叫做常数风格偏向新模块上的示例图像。通过大量的实验验证与比较,在解决内容丢失问题的国家的最先进的,我们的模型显示效果。代码和预训练模式可在此HTTPS URL。
39. Comparing to Learn: Surpassing ImageNet Pretraining on Radiographs By Comparing Image Representations [PDF] 返回目录
Hong-Yu Zhou, Shuang Yu, Cheng Bian, Yifan Hu, Kai Ma, Yefeng Zheng
Abstract: In deep learning era, pretrained models play an important role in medical image analysis, in which ImageNet pretraining has been widely adopted as the best way. However, it is undeniable that there exists an obvious domain gap between natural images and medical images. To bridge this gap, we propose a new pretraining method which learns from 700k radiographs given no manual annotations. We call our method as Comparing to Learn (C2L) because it learns robust features by comparing different image representations. To verify the effectiveness of C2L, we conduct comprehensive ablation studies and evaluate it on different tasks and datasets. The experimental results on radiographs show that C2L can outperform ImageNet pretraining and previous state-of-the-art approaches significantly. Code and models are available.
摘要:在深度学习的时代,预训练的模型在医学图像分析中的重要作用,其中ImageNet训练前已经是最好的方式被广泛采用。但是,不可否认的是,存在着自然图像和医学图像之间存在明显的间隙域。为了弥补这种差距,我们建议从没有给出手动注释700K X光片学会了一门新的训练前的方法。我们把我们作为比较学习(C2L),因为它通过比较不同的图像表示学习强大的功能的方法。为了验证C2L的有效性,我们进行全面的消融研究和评估它在不同的任务和数据集。 X线片上的实验结果表明,C2L可以超越ImageNet训练前和国家的最先进的前面显著方法。代码和型号可供选择。
Hong-Yu Zhou, Shuang Yu, Cheng Bian, Yifan Hu, Kai Ma, Yefeng Zheng
Abstract: In deep learning era, pretrained models play an important role in medical image analysis, in which ImageNet pretraining has been widely adopted as the best way. However, it is undeniable that there exists an obvious domain gap between natural images and medical images. To bridge this gap, we propose a new pretraining method which learns from 700k radiographs given no manual annotations. We call our method as Comparing to Learn (C2L) because it learns robust features by comparing different image representations. To verify the effectiveness of C2L, we conduct comprehensive ablation studies and evaluate it on different tasks and datasets. The experimental results on radiographs show that C2L can outperform ImageNet pretraining and previous state-of-the-art approaches significantly. Code and models are available.
摘要:在深度学习的时代,预训练的模型在医学图像分析中的重要作用,其中ImageNet训练前已经是最好的方式被广泛采用。但是,不可否认的是,存在着自然图像和医学图像之间存在明显的间隙域。为了弥补这种差距,我们建议从没有给出手动注释700K X光片学会了一门新的训练前的方法。我们把我们作为比较学习(C2L),因为它通过比较不同的图像表示学习强大的功能的方法。为了验证C2L的有效性,我们进行全面的消融研究和评估它在不同的任务和数据集。 X线片上的实验结果表明,C2L可以超越ImageNet训练前和国家的最先进的前面显著方法。代码和型号可供选择。
40. Automatic Image Labelling at Pixel Level [PDF] 返回目录
Xiang Zhang, Wei Zhang, Jinye Peng, Janping Fan
Abstract: The performance of deep networks for semantic image segmentation largely depends on the availability of large-scale training images which are labelled at the pixel level. Typically, such pixel-level image labellings are obtained manually by a labour-intensive process. To alleviate the burden of manual image labelling, we propose an interesting learning approach to generate pixel-level image labellings automatically. A Guided Filter Network (GFN) is first developed to learn the segmentation knowledge from a source domain, and such GFN then transfers such segmentation knowledge to generate coarse object masks in the target domain. Such coarse object masks are treated as pseudo labels and they are further integrated to optimize/refine the GFN iteratively in the target domain. Our experiments on six image sets have demonstrated that our proposed approach can generate fine-grained object masks (i.e., pixel-level object labellings), whose quality is very comparable to the manually-labelled ones. Our proposed approach can also achieve better performance on semantic image segmentation than most existing weakly-supervised approaches.
摘要:语义图像分割深层网络的性能在很大程度上取决于它们在像素级标记的大型训练图像的可用性。通常,这样的像素级图像labellings由劳动密集的过程手动获得。为了减轻人工的图像标记的负担,我们提出了一个有趣的学习方法来自动生成像素级图像labellings。被引导滤波网络(GFN)首先开发学习从源域的分割的知识,然后将这样的GFN这样分割知识转移,以产生在目标域中粗对象掩码。这种粗对象掩模被当作伪标签和它们进一步集成,以优化/迭代地细化GFN在目标域。我们在六个图像集实验已经证明,我们提出的方法能够产生细粒度对象口罩(即像素级对象labellings),其质量是手动标记那些非常具有可比性。我们建议的方法也可以实现语义图像分割比大多数现有的弱监督的方法更好的性能。
Xiang Zhang, Wei Zhang, Jinye Peng, Janping Fan
Abstract: The performance of deep networks for semantic image segmentation largely depends on the availability of large-scale training images which are labelled at the pixel level. Typically, such pixel-level image labellings are obtained manually by a labour-intensive process. To alleviate the burden of manual image labelling, we propose an interesting learning approach to generate pixel-level image labellings automatically. A Guided Filter Network (GFN) is first developed to learn the segmentation knowledge from a source domain, and such GFN then transfers such segmentation knowledge to generate coarse object masks in the target domain. Such coarse object masks are treated as pseudo labels and they are further integrated to optimize/refine the GFN iteratively in the target domain. Our experiments on six image sets have demonstrated that our proposed approach can generate fine-grained object masks (i.e., pixel-level object labellings), whose quality is very comparable to the manually-labelled ones. Our proposed approach can also achieve better performance on semantic image segmentation than most existing weakly-supervised approaches.
摘要:语义图像分割深层网络的性能在很大程度上取决于它们在像素级标记的大型训练图像的可用性。通常,这样的像素级图像labellings由劳动密集的过程手动获得。为了减轻人工的图像标记的负担,我们提出了一个有趣的学习方法来自动生成像素级图像labellings。被引导滤波网络(GFN)首先开发学习从源域的分割的知识,然后将这样的GFN这样分割知识转移,以产生在目标域中粗对象掩码。这种粗对象掩模被当作伪标签和它们进一步集成,以优化/迭代地细化GFN在目标域。我们在六个图像集实验已经证明,我们提出的方法能够产生细粒度对象口罩(即像素级对象labellings),其质量是手动标记那些非常具有可比性。我们建议的方法也可以实现语义图像分割比大多数现有的弱监督的方法更好的性能。
41. Automatic extraction of road intersection points from USGS historical map series using deep convolutional neural networks [PDF] 返回目录
Mahmoud Saeedimoghaddam, T. F. Stepinski
Abstract: Road intersections data have been used across different geospatial applications and analysis. The road network datasets dating from pre-GIS years are only available in the form of historical printed maps. Before they can be analyzed by a GIS software, they need to be scanned and transformed into the usable vector-based format. Due to the great bulk of scanned historical maps, automated methods of transforming them into digital datasets need to be employed. Frequently, this process is based on computer vision algorithms. However, low conversion accuracy for low quality and visually complex maps and setting optimal parameters are the two challenges of using those algorithms. In this paper, we employed the standard paradigm of using deep convolutional neural network for object detection task named region-based CNN for automatically identifying road intersections in scanned historical USGS maps of several U.S. cities. We have found that the algorithm showed higher conversion accuracy for the double line cartographic representations of the road maps than the single line ones. Also, compared to the majority of traditional computer vision algorithms RCNN provides more accurate extraction. Finally, the results show that the amount of errors in the detection outputs is sensitive to complexity and blurriness of the maps as well as the number of distinct RGB combinations within them.
摘要:道路交叉口数据已经在不同的地理空间应用程序和分析使用。道路网络数据集从GIS前几年约会只能在历史印制地图的形式提供。之前,他们可以通过GIS软件进行分析,需要对它们进行扫描并转化为可用的基于矢量的格式。由于大批量扫描的历史地图,需要采用它们转化成数字数据集的自动方法。通常情况下,这个过程是基于计算机视觉算法。然而,对于低质量和在视觉上复杂的地图和设置最佳参数低转换精度是使用这些算法的两个挑战。在本文中,我们采用了使用深卷积神经网络命名为基于区域的CNN用于自动识别扫描的历史USGS道路交叉口物体检测任务的标准范式几个美国城市的地图。我们已经发现,该算法具有较高的转换精度的路线图比单行者的双线制图表达。此外,相比于传统的大多数计算机视觉算法RCNN提供更准确的提取。最后,结果显示,在检测输出的误差的量,以复杂性和模糊性敏感的映射以及在其中不同的RGB组合的数目。
Mahmoud Saeedimoghaddam, T. F. Stepinski
Abstract: Road intersections data have been used across different geospatial applications and analysis. The road network datasets dating from pre-GIS years are only available in the form of historical printed maps. Before they can be analyzed by a GIS software, they need to be scanned and transformed into the usable vector-based format. Due to the great bulk of scanned historical maps, automated methods of transforming them into digital datasets need to be employed. Frequently, this process is based on computer vision algorithms. However, low conversion accuracy for low quality and visually complex maps and setting optimal parameters are the two challenges of using those algorithms. In this paper, we employed the standard paradigm of using deep convolutional neural network for object detection task named region-based CNN for automatically identifying road intersections in scanned historical USGS maps of several U.S. cities. We have found that the algorithm showed higher conversion accuracy for the double line cartographic representations of the road maps than the single line ones. Also, compared to the majority of traditional computer vision algorithms RCNN provides more accurate extraction. Finally, the results show that the amount of errors in the detection outputs is sensitive to complexity and blurriness of the maps as well as the number of distinct RGB combinations within them.
摘要:道路交叉口数据已经在不同的地理空间应用程序和分析使用。道路网络数据集从GIS前几年约会只能在历史印制地图的形式提供。之前,他们可以通过GIS软件进行分析,需要对它们进行扫描并转化为可用的基于矢量的格式。由于大批量扫描的历史地图,需要采用它们转化成数字数据集的自动方法。通常情况下,这个过程是基于计算机视觉算法。然而,对于低质量和在视觉上复杂的地图和设置最佳参数低转换精度是使用这些算法的两个挑战。在本文中,我们采用了使用深卷积神经网络命名为基于区域的CNN用于自动识别扫描的历史USGS道路交叉口物体检测任务的标准范式几个美国城市的地图。我们已经发现,该算法具有较高的转换精度的路线图比单行者的双线制图表达。此外,相比于传统的大多数计算机视觉算法RCNN提供更准确的提取。最后,结果显示,在检测输出的误差的量,以复杂性和模糊性敏感的映射以及在其中不同的RGB组合的数目。
42. Real-Time Drone Detection and Tracking With Visible, Thermal and Acoustic Sensors [PDF] 返回目录
Fredrik Svanstrom, Cristofer Englund, Fernando Alonso-Fernandez
Abstract: This paper explores the process of designing an automatic multi-sensor drone detection system. Besides the common video and audio sensors, the system also includes a thermal infrared camera, which is shown to be a feasible solution to the drone detection task. Even with slightly lower resolution, the performance is just as good as a camera in visible range. The detector performance as a function of the sensor-to-target distance is also investigated. In addition, using sensor fusion, the system is made more robust than the individual sensors, helping to reduce false detections. To counteract the lack of public datasets, a novel video dataset containing 650 annotated infrared and visible videos of drones, birds, airplanes and helicopters is also presented. The database is complemented with an audio dataset of the classes drones, helicopters and background noise.
摘要:本文探讨设计一个自动多传感器雄蜂检测系统的过程。除了常见的视频和音频传感器,所述系统还包括热红外摄像机,其被证明是一种可行的解决方案,以无人驾驶飞机检测任务。即使有略低的分辨率,性能一样好,在可见光范围内的摄像机。该检测器的性能,所述传感器 - 目标距离的函数进行了研究。此外,使用传感器融合,该系统由比单独的传感器更鲁棒,从而减少错误检测。为了抵消由于缺乏公共数据集,包含的无人驾驶飞机,飞鸟,飞机和直升机650个注释红外和可见光视频新颖的视频数据集还介绍。数据库补充有类无人驾驶飞机,直升机和背景噪声的音频数据集。
Fredrik Svanstrom, Cristofer Englund, Fernando Alonso-Fernandez
Abstract: This paper explores the process of designing an automatic multi-sensor drone detection system. Besides the common video and audio sensors, the system also includes a thermal infrared camera, which is shown to be a feasible solution to the drone detection task. Even with slightly lower resolution, the performance is just as good as a camera in visible range. The detector performance as a function of the sensor-to-target distance is also investigated. In addition, using sensor fusion, the system is made more robust than the individual sensors, helping to reduce false detections. To counteract the lack of public datasets, a novel video dataset containing 650 annotated infrared and visible videos of drones, birds, airplanes and helicopters is also presented. The database is complemented with an audio dataset of the classes drones, helicopters and background noise.
摘要:本文探讨设计一个自动多传感器雄蜂检测系统的过程。除了常见的视频和音频传感器,所述系统还包括热红外摄像机,其被证明是一种可行的解决方案,以无人驾驶飞机检测任务。即使有略低的分辨率,性能一样好,在可见光范围内的摄像机。该检测器的性能,所述传感器 - 目标距离的函数进行了研究。此外,使用传感器融合,该系统由比单独的传感器更鲁棒,从而减少错误检测。为了抵消由于缺乏公共数据集,包含的无人驾驶飞机,飞鸟,飞机和直升机650个注释红外和可见光视频新颖的视频数据集还介绍。数据库补充有类无人驾驶飞机,直升机和背景噪声的音频数据集。
43. Tackling the Problem of Limited Data and Annotations in Semantic Segmentation [PDF] 返回目录
Ahmadreza Jeddi
Abstract: In this work, the case of semantic segmentation on a small image dataset (simulated by 1000 randomly selected images from PASCAL VOC 2012), where only weak supervision signals (scribbles from user interaction) are available is studied. Especially, to tackle the problem of limited data annotations in image segmentation, transferring different pre-trained models and CRF based methods are applied to enhance the segmentation performance. To this end, RotNet, DeeperCluster, and Semi&Weakly Supervised Learning (SWSL) pre-trained models are transferred and finetuned in a DeepLab-v2 baseline, and dense CRF is applied both as a post-processing and loss regularization technique. The results of my study show that, on this small dataset, using a pre-trained ResNet50 SWSL model gives results that are 7.4% better than applying an ImageNet pre-trained model; moreover, for the case of training on the full PASCAL VOC 2012 training data, this pre-training approach increases the mIoU results by almost 4%. On the other hand, dense CRF is shown to be very effective as well, enhancing the results both as a loss regularization technique in weakly supervised training and as a post-processing tool.
摘要:在这项工作中,语义分割的一个小的图像数据集,其中只有微弱的监督信号(来自用户的交互涂鸦)是可用的(从PASCAL VOC 2012 1000个随机选择的图像模拟的)的情况下进行了研究。特别是,以解决在图像分割有限的数据注解的问题,不同的传输预先训练模型和基于CRF方法应用于提高分割性能。为此,RotNet,DeeperCluster,和半&弱监督学习(SWSL)预训练的模型被转移,并在DeepLab-V2基线微调,和致密的CRF施加既作为后处理和损耗正则化技术。我的研究结果表明,在这个小数据集,使用预训练ResNet50 SWSL模型给出的结果是7.4%,比上施加预先训练ImageNet模型更好的结果;此外,对全PASCAL VOC 2012训练数据训练的情况下,这个前的训练方法,通过近4%提高米欧结果。在另一方面,致密CRF被示出为是非常有效的,以及,提高了结果既作为在弱监督训练损失正则化技术和作为后处理工具。
Ahmadreza Jeddi
Abstract: In this work, the case of semantic segmentation on a small image dataset (simulated by 1000 randomly selected images from PASCAL VOC 2012), where only weak supervision signals (scribbles from user interaction) are available is studied. Especially, to tackle the problem of limited data annotations in image segmentation, transferring different pre-trained models and CRF based methods are applied to enhance the segmentation performance. To this end, RotNet, DeeperCluster, and Semi&Weakly Supervised Learning (SWSL) pre-trained models are transferred and finetuned in a DeepLab-v2 baseline, and dense CRF is applied both as a post-processing and loss regularization technique. The results of my study show that, on this small dataset, using a pre-trained ResNet50 SWSL model gives results that are 7.4% better than applying an ImageNet pre-trained model; moreover, for the case of training on the full PASCAL VOC 2012 training data, this pre-training approach increases the mIoU results by almost 4%. On the other hand, dense CRF is shown to be very effective as well, enhancing the results both as a loss regularization technique in weakly supervised training and as a post-processing tool.
摘要:在这项工作中,语义分割的一个小的图像数据集,其中只有微弱的监督信号(来自用户的交互涂鸦)是可用的(从PASCAL VOC 2012 1000个随机选择的图像模拟的)的情况下进行了研究。特别是,以解决在图像分割有限的数据注解的问题,不同的传输预先训练模型和基于CRF方法应用于提高分割性能。为此,RotNet,DeeperCluster,和半&弱监督学习(SWSL)预训练的模型被转移,并在DeepLab-V2基线微调,和致密的CRF施加既作为后处理和损耗正则化技术。我的研究结果表明,在这个小数据集,使用预训练ResNet50 SWSL模型给出的结果是7.4%,比上施加预先训练ImageNet模型更好的结果;此外,对全PASCAL VOC 2012训练数据训练的情况下,这个前的训练方法,通过近4%提高米欧结果。在另一方面,致密CRF被示出为是非常有效的,以及,提高了结果既作为在弱监督训练损失正则化技术和作为后处理工具。
44. TinyVIRAT: Low-resolution Video Action Recognition [PDF] 返回目录
Ugur Demir, Yogesh S Rawat, Mubarak Shah
Abstract: The existing research in action recognition is mostly focused on high-quality videos where the action is distinctly visible. In real-world surveillance environments, the actions in videos are captured at a wide range of resolutions. Most activities occur at a distance with a small resolution and recognizing such activities is a challenging problem. In this work, we focus on recognizing tiny actions in videos. We introduce a benchmark dataset, TinyVIRAT, which contains natural low-resolution activities. The actions in TinyVIRAT videos have multiple labels and they are extracted from surveillance videos which makes them realistic and more challenging. We propose a novel method for recognizing tiny actions in videos which utilizes a progressive generative approach to improve the quality of low-resolution actions. The proposed method also consists of a weakly trained attention mechanism which helps in focusing on the activity regions in the video. We perform extensive experiments to benchmark the proposed TinyVIRAT dataset and observe that the proposed method significantly improves the action recognition performance over baselines. We also evaluate the proposed approach on synthetically resized action recognition datasets and achieve state-of-the-art results when compared with existing methods. The dataset and code is publicly available at this https URL.
摘要:在动作识别现有的研究多集中在高品质的视频中的动作是明显可见的。在现实世界中的监控环境,在影片的动作是在多种分辨率的拍摄。大多数活动发生在一个小分辨率的距离,并承认这样的活动是一个具有挑战性的问题。在这项工作中,我们专注于视频识别微小的动作。我们引入一个基准数据集,TinyVIRAT,其中含有天然低分辨率活动。在TinyVIRAT影片的行动有多个标签,他们从监控视频,这使得他们的现实和更具挑战性提取。我们建议用于识别影片,它利用一个渐进的生成方法,以提高低分辨率行动的质量微小的行动的新方法。该方法还包括弱训练有素的注意机制,这有助于专注于视频活动区。我们进行了广泛的实验,以基准所提出的TinyVIRAT数据集,并观察该方法显著提高了基线的动作识别性能。我们也评估对综合调整动作识别数据集所提出的方法,当与现有的方法相比,实现国家的最先进的成果。该数据集和代码是公开的,在此HTTPS URL。
Ugur Demir, Yogesh S Rawat, Mubarak Shah
Abstract: The existing research in action recognition is mostly focused on high-quality videos where the action is distinctly visible. In real-world surveillance environments, the actions in videos are captured at a wide range of resolutions. Most activities occur at a distance with a small resolution and recognizing such activities is a challenging problem. In this work, we focus on recognizing tiny actions in videos. We introduce a benchmark dataset, TinyVIRAT, which contains natural low-resolution activities. The actions in TinyVIRAT videos have multiple labels and they are extracted from surveillance videos which makes them realistic and more challenging. We propose a novel method for recognizing tiny actions in videos which utilizes a progressive generative approach to improve the quality of low-resolution actions. The proposed method also consists of a weakly trained attention mechanism which helps in focusing on the activity regions in the video. We perform extensive experiments to benchmark the proposed TinyVIRAT dataset and observe that the proposed method significantly improves the action recognition performance over baselines. We also evaluate the proposed approach on synthetically resized action recognition datasets and achieve state-of-the-art results when compared with existing methods. The dataset and code is publicly available at this https URL.
摘要:在动作识别现有的研究多集中在高品质的视频中的动作是明显可见的。在现实世界中的监控环境,在影片的动作是在多种分辨率的拍摄。大多数活动发生在一个小分辨率的距离,并承认这样的活动是一个具有挑战性的问题。在这项工作中,我们专注于视频识别微小的动作。我们引入一个基准数据集,TinyVIRAT,其中含有天然低分辨率活动。在TinyVIRAT影片的行动有多个标签,他们从监控视频,这使得他们的现实和更具挑战性提取。我们建议用于识别影片,它利用一个渐进的生成方法,以提高低分辨率行动的质量微小的行动的新方法。该方法还包括弱训练有素的注意机制,这有助于专注于视频活动区。我们进行了广泛的实验,以基准所提出的TinyVIRAT数据集,并观察该方法显著提高了基线的动作识别性能。我们也评估对综合调整动作识别数据集所提出的方法,当与现有的方法相比,实现国家的最先进的成果。该数据集和代码是公开的,在此HTTPS URL。
45. A Generalization of Otsu's Method and Minimum Error Thresholding [PDF] 返回目录
Jonathan T. Barron
Abstract: We present Generalized Histogram Thresholding (GHT), a simple, fast, and effective technique for histogram-based image thresholding. GHT works by performing approximate maximum a posteriori estimation of a mixture of Gaussians with appropriate priors. We demonstrate that GHT subsumes three classic thresholding techniques as special cases: Otsu's method, Minimum Error Thresholding (MET), and weighted percentile thresholding. GHT thereby enables the continuous interpolation between those three algorithms, which allows thresholding accuracy to be improved significantly. GHT also provides a clarifying interpretation of the common practice of coarsening a histogram's bin width during thresholding. We show that GHT outperforms or matches the performance of all algorithms on a recent challenge for handwritten document image binarization (including deep neural networks trained to produce per-pixel binarizations), and can be implemented in a dozen lines of code or as a trivial modification to Otsu's method or MET.
摘要:我们提出了广义直方图阈值(GHT),一个简单,快速,有效的技术为基于直方图的图像阈值。 GHT作品通过执行最大近似高斯具有适当先验的混合物的后验估计。我们表明,GHT涵括三大经典阈值技术作为特殊情况:大津算法,最小误差阈值(MET)和加权百分比阈值。 GHT从而使这三个算法,其允许阈值处理精度被改善显著之间的连续内插。 GHT还提供了阈值时粗化直方图的bin宽度的普遍做法的澄清解释。我们表明,GHT性能优于或所有算法对手写文档图像二值化(包括培训,制作每个像素binarizations深层神经网络)最近挑战的性能相匹配,并能在十几行代码或作为一个简单的修改来实施对大津算法或MET。
Jonathan T. Barron
Abstract: We present Generalized Histogram Thresholding (GHT), a simple, fast, and effective technique for histogram-based image thresholding. GHT works by performing approximate maximum a posteriori estimation of a mixture of Gaussians with appropriate priors. We demonstrate that GHT subsumes three classic thresholding techniques as special cases: Otsu's method, Minimum Error Thresholding (MET), and weighted percentile thresholding. GHT thereby enables the continuous interpolation between those three algorithms, which allows thresholding accuracy to be improved significantly. GHT also provides a clarifying interpretation of the common practice of coarsening a histogram's bin width during thresholding. We show that GHT outperforms or matches the performance of all algorithms on a recent challenge for handwritten document image binarization (including deep neural networks trained to produce per-pixel binarizations), and can be implemented in a dozen lines of code or as a trivial modification to Otsu's method or MET.
摘要:我们提出了广义直方图阈值(GHT),一个简单,快速,有效的技术为基于直方图的图像阈值。 GHT作品通过执行最大近似高斯具有适当先验的混合物的后验估计。我们表明,GHT涵括三大经典阈值技术作为特殊情况:大津算法,最小误差阈值(MET)和加权百分比阈值。 GHT从而使这三个算法,其允许阈值处理精度被改善显著之间的连续内插。 GHT还提供了阈值时粗化直方图的bin宽度的普遍做法的澄清解释。我们表明,GHT性能优于或所有算法对手写文档图像二值化(包括培训,制作每个像素binarizations深层神经网络)最近挑战的性能相匹配,并能在十几行代码或作为一个简单的修改来实施对大津算法或MET。
46. COBE: Contextualized Object Embeddings from Narrated Instructional Video [PDF] 返回目录
Gedas Bertasius, Lorenzo Torresani
Abstract: Many objects in the real world undergo dramatic variations in visual appearance. For example, a tomato may be red or green, sliced or chopped, fresh or fried, liquid or solid. Training a single detector to accurately recognize tomatoes in all these different states is challenging. On the other hand, contextual cues (e.g., the presence of a knife, a cutting board, a strainer or a pan) are often strongly indicative of how the object appears in the scene. Recognizing such contextual cues is useful not only to improve the accuracy of object detection or to determine the state of the object, but also to understand its functional properties and to infer ongoing or upcoming human-object interactions. A fully-supervised approach to recognizing object states and their contexts in the real-world is unfortunately marred by the long-tailed, open-ended distribution of the data, which would effectively require massive amounts of annotations to capture the appearance of objects in all their different forms. Instead of relying on manually-labeled data for this task, we propose a new framework for learning Contextualized OBject Embeddings (COBE) from automatically-transcribed narrations of instructional videos. We leverage the semantic and compositional structure of language by training a visual detector to predict a contextualized word embedding of the object and its associated narration. This enables the learning of an object representation where concepts relate according to a semantic language metric. Our experiments show that our detector learns to predict a rich variety of contextual object information, and that it is highly effective in the settings of few-shot and zero-shot learning.
摘要:在现实世界中的许多对象进行视觉外观戏剧性的变化。例如,番茄可以是红色或绿色,切片或切碎,新鲜或油炸,液体或固体。培训一个探测器在所有这些不同的状态准确识别番茄是具有挑战性的。在另一方面,上下文线索(例如,刀,切割板,一滤网或一盘的存在)往往强烈指示的对象如何出现在场景中。认识到这样的上下文线索是有用的,不仅改善对象检测的精度,或确定所述对象的状态,而且还了解其功能特性和推断正在进行的或即将到来的人类对象的相互作用。要认识到现实世界对象的状态及其上下文的全面监督的做法是不幸的是由数据,这将有效地需要标注的大量的长尾,开放式分布毁损捕捉对象的所有外观他们的不同形式。而不是依靠手动标记的数据,这个任务,我们提出了一个新的框架,从教学视频自动转录叙事情境式学习对象曲面嵌入(COBE)。我们通过训练视觉检测来预测情境字嵌入对象及其关联叙事充分利用语言的语义和组成结构。这使得一个对象表示,其中概念根据语义语言度量涉及的学习。我们的实验表明,我们的探测器学会预测丰富多样的上下文对象信息,而且它是为数不多的镜头和零射门学习的设置非常有效。
Gedas Bertasius, Lorenzo Torresani
Abstract: Many objects in the real world undergo dramatic variations in visual appearance. For example, a tomato may be red or green, sliced or chopped, fresh or fried, liquid or solid. Training a single detector to accurately recognize tomatoes in all these different states is challenging. On the other hand, contextual cues (e.g., the presence of a knife, a cutting board, a strainer or a pan) are often strongly indicative of how the object appears in the scene. Recognizing such contextual cues is useful not only to improve the accuracy of object detection or to determine the state of the object, but also to understand its functional properties and to infer ongoing or upcoming human-object interactions. A fully-supervised approach to recognizing object states and their contexts in the real-world is unfortunately marred by the long-tailed, open-ended distribution of the data, which would effectively require massive amounts of annotations to capture the appearance of objects in all their different forms. Instead of relying on manually-labeled data for this task, we propose a new framework for learning Contextualized OBject Embeddings (COBE) from automatically-transcribed narrations of instructional videos. We leverage the semantic and compositional structure of language by training a visual detector to predict a contextualized word embedding of the object and its associated narration. This enables the learning of an object representation where concepts relate according to a semantic language metric. Our experiments show that our detector learns to predict a rich variety of contextual object information, and that it is highly effective in the settings of few-shot and zero-shot learning.
摘要:在现实世界中的许多对象进行视觉外观戏剧性的变化。例如,番茄可以是红色或绿色,切片或切碎,新鲜或油炸,液体或固体。培训一个探测器在所有这些不同的状态准确识别番茄是具有挑战性的。在另一方面,上下文线索(例如,刀,切割板,一滤网或一盘的存在)往往强烈指示的对象如何出现在场景中。认识到这样的上下文线索是有用的,不仅改善对象检测的精度,或确定所述对象的状态,而且还了解其功能特性和推断正在进行的或即将到来的人类对象的相互作用。要认识到现实世界对象的状态及其上下文的全面监督的做法是不幸的是由数据,这将有效地需要标注的大量的长尾,开放式分布毁损捕捉对象的所有外观他们的不同形式。而不是依靠手动标记的数据,这个任务,我们提出了一个新的框架,从教学视频自动转录叙事情境式学习对象曲面嵌入(COBE)。我们通过训练视觉检测来预测情境字嵌入对象及其关联叙事充分利用语言的语义和组成结构。这使得一个对象表示,其中概念根据语义语言度量涉及的学习。我们的实验表明,我们的探测器学会预测丰富多样的上下文对象信息,而且它是为数不多的镜头和零射门学习的设置非常有效。
47. Privacy Preserving Text Recognition with Gradient-Boosting for Federated Learning [PDF] 返回目录
Hanchi Ren, Jingjing Deng, Xianghua Xie
Abstract: Typical machine learning approaches require centralized data for model training, which may not be possible where restrictions on data sharing are in place due to, for instance, privacy protection. The recently proposed Federated Learning (FL) frame-work allows learning a shared model collaboratively without data being centralized or data sharing among data owners. However, we show in this paper that the generalization ability of the joint model is poor on Non-Independent and Non-Identically Dis-tributed (Non-IID) data, particularly when the Federated Averaging (FedAvg) strategy is used in this collaborative learning framework thanks to the weight divergence phenomenon. We propose a novel boosting algorithm for FL to address this generalisation issue, as well as achieving much faster convergence in gradient based optimization. We demonstrate our Federated Boosting (FedBoost) method on privacy-preserved text recognition, which shows significant improvements in both performance and efficiency. The text images are based on publicly available datasets for fair comparison and we intend to make our implementation public to ensure reproducibility.
摘要:典型机器学习方法需要集中的模型训练,在数据共享的限制已经到位,由于,例如,隐私保护这可能是不可能的数据。最近提出的联合学习(FL)帧作品允许协作学习共享模型,而不会被数据集中或数据所有者之间的数据共享。但是,我们将展示在本文中,联合模型的泛化能力是在非独立和非独立同派息,在分布式(非IID)数据不佳,尤其当联邦平均(FedAvg)战略在这个协作学习使用框架得益于重量发散现象。我们提出了一个新的提升算法FL来解决这个问题的概括,并实现更快的基于梯度的最优化收敛。我们证明了我们的联合增压(FedBoost)隐私保存完好的文字识别方法,在性能和效率,这显示了显著的改善。文字图像是基于公平的比较可公开获得的数据集,我们打算使我们实现公开,以确保重复性。
Hanchi Ren, Jingjing Deng, Xianghua Xie
Abstract: Typical machine learning approaches require centralized data for model training, which may not be possible where restrictions on data sharing are in place due to, for instance, privacy protection. The recently proposed Federated Learning (FL) frame-work allows learning a shared model collaboratively without data being centralized or data sharing among data owners. However, we show in this paper that the generalization ability of the joint model is poor on Non-Independent and Non-Identically Dis-tributed (Non-IID) data, particularly when the Federated Averaging (FedAvg) strategy is used in this collaborative learning framework thanks to the weight divergence phenomenon. We propose a novel boosting algorithm for FL to address this generalisation issue, as well as achieving much faster convergence in gradient based optimization. We demonstrate our Federated Boosting (FedBoost) method on privacy-preserved text recognition, which shows significant improvements in both performance and efficiency. The text images are based on publicly available datasets for fair comparison and we intend to make our implementation public to ensure reproducibility.
摘要:典型机器学习方法需要集中的模型训练,在数据共享的限制已经到位,由于,例如,隐私保护这可能是不可能的数据。最近提出的联合学习(FL)帧作品允许协作学习共享模型,而不会被数据集中或数据所有者之间的数据共享。但是,我们将展示在本文中,联合模型的泛化能力是在非独立和非独立同派息,在分布式(非IID)数据不佳,尤其当联邦平均(FedAvg)战略在这个协作学习使用框架得益于重量发散现象。我们提出了一个新的提升算法FL来解决这个问题的概括,并实现更快的基于梯度的最优化收敛。我们证明了我们的联合增压(FedBoost)隐私保存完好的文字识别方法,在性能和效率,这显示了显著的改善。文字图像是基于公平的比较可公开获得的数据集,我们打算使我们实现公开,以确保重复性。
48. Explore and Explain: Self-supervised Navigation and Recounting [PDF] 返回目录
Roberto Bigazzi, Federico Landi, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara
Abstract: Embodied AI has been recently gaining attention as it aims to foster the development of autonomous and intelligent agents. In this paper, we devise a novel embodied setting in which an agent needs to explore a previously unknown environment while recounting what it sees during the path. In this context, the agent needs to navigate the environment driven by an exploration goal, select proper moments for description, and output natural language descriptions of relevant objects and scenes. Our model integrates a novel self-supervised exploration module with penalty, and a fully-attentive captioning model for explanation. Also, we investigate different policies for selecting proper moments for explanation, driven by information coming from both the environment and the navigation. Experiments are conducted on photorealistic environments from the Matterport3D dataset and investigate the navigation and explanation capabilities of the agent as well as the role of their interactions.
摘要:体现AI最近已经受到关注,因为它的目的是促进自主和智能代理的发展。在本文中,我们设计了一种新的设定体现在其代理的需求,同时讲述它的路径中看到探索一种前所未知的环境。在这种情况下,代理需要通过浏览勘探目标驱动的环境,选择适当说明的时刻,以及相关物品和场景产出自然语言描述。我们的模型集成了处罚新颖的自我监督的探索模块,并为解释完全细心字幕模型。此外,我们调查了选择正确解释的时刻,从环境和导航双方未来信息驱动不同的政策。实验从Matterport3D数据集上逼真的环境中进行,调查剂的导航和解释能力,以及它们之间的相互作用的作用。
Roberto Bigazzi, Federico Landi, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, Rita Cucchiara
Abstract: Embodied AI has been recently gaining attention as it aims to foster the development of autonomous and intelligent agents. In this paper, we devise a novel embodied setting in which an agent needs to explore a previously unknown environment while recounting what it sees during the path. In this context, the agent needs to navigate the environment driven by an exploration goal, select proper moments for description, and output natural language descriptions of relevant objects and scenes. Our model integrates a novel self-supervised exploration module with penalty, and a fully-attentive captioning model for explanation. Also, we investigate different policies for selecting proper moments for explanation, driven by information coming from both the environment and the navigation. Experiments are conducted on photorealistic environments from the Matterport3D dataset and investigate the navigation and explanation capabilities of the agent as well as the role of their interactions.
摘要:体现AI最近已经受到关注,因为它的目的是促进自主和智能代理的发展。在本文中,我们设计了一种新的设定体现在其代理的需求,同时讲述它的路径中看到探索一种前所未知的环境。在这种情况下,代理需要通过浏览勘探目标驱动的环境,选择适当说明的时刻,以及相关物品和场景产出自然语言描述。我们的模型集成了处罚新颖的自我监督的探索模块,并为解释完全细心字幕模型。此外,我们调查了选择正确解释的时刻,从环境和导航双方未来信息驱动不同的政策。实验从Matterport3D数据集上逼真的环境中进行,调查剂的导航和解释能力,以及它们之间的相互作用的作用。
49. Non-greedy Gradient-based Hyperparameter Optimization Over Long Horizons [PDF] 返回目录
Paul Micaelli, Amos Storkey
Abstract: Gradient-based hyperparameter optimization is an attractive way to perform meta-learning across a distribution of tasks, or improve the performance of an optimizer on a single task. However, this approach has been unpopular for tasks requiring long horizons (many gradient steps), due to memory scaling and gradient degradation issues. A common workaround is to learn hyperparameters online or split the horizon into smaller chunks. However, this introduces greediness which comes with a large performance drop, since the best local hyperparameters can make for poor global solutions. In this work, we enable non-greediness over long horizons with a two-fold solution. First, we share hyperparameters that are contiguous in time, and show that this drastically mitigates gradient degradation issues. Then, we derive a forward-mode differentiation algorithm for the popular momentum-based SGD optimizer, which allows for a memory cost that is constant with horizon size. When put together, these solutions allow us to learn hyperparameters without any prior knowledge. Compared to the baseline of hand-tuned off-the-shelf hyperparameters, our method compares favorably on simple datasets like SVHN. On CIFAR-10 we match the baseline performance, and demonstrate for the first time that learning rate, momentum and weight decay schedules can be learned with gradients on a dataset of this size. Code is available at this https URL
摘要:基于梯度的超参数优化是在单一任务跨越任务的分布进行元学习,或改进的优化性能的一个有吸引力的方式。然而,这种方法已经吃不开,因为需要长时间的视野(很多梯度步骤),由于内存扩展和梯度退化问题的任务。一个常见的解决方法是在线学习超参数或地平线分裂成更小的块。然而,这引起贪婪,它有一个大的性能下降,因为当地最好的超参数可以为贫困的全球解决方案。在这项工作中,我们使非贪婪在长期的视野有两方面的解决方案。首先,我们分享的在时间上连续的超参数,并显示这大大缓解了梯度退化问题。然后,我们推导出流行的基于动量SGD优化,这使得因为这是与地平线大小不变的存储成本正向模式分化算法。当放在一起,这些解决方案使我们能够学习的超参数没有任何先验知识。相比于手动调整过的,现成的超参数的基线,我们的方法相比是有利的简单的数据集等SVHN。在CIFAR-10,我们匹配基准性能,并在第一时间,学习速度,动量和质量衰减时间表可以用梯度在这种规模的数据集学习证明。代码可在此HTTPS URL
Paul Micaelli, Amos Storkey
Abstract: Gradient-based hyperparameter optimization is an attractive way to perform meta-learning across a distribution of tasks, or improve the performance of an optimizer on a single task. However, this approach has been unpopular for tasks requiring long horizons (many gradient steps), due to memory scaling and gradient degradation issues. A common workaround is to learn hyperparameters online or split the horizon into smaller chunks. However, this introduces greediness which comes with a large performance drop, since the best local hyperparameters can make for poor global solutions. In this work, we enable non-greediness over long horizons with a two-fold solution. First, we share hyperparameters that are contiguous in time, and show that this drastically mitigates gradient degradation issues. Then, we derive a forward-mode differentiation algorithm for the popular momentum-based SGD optimizer, which allows for a memory cost that is constant with horizon size. When put together, these solutions allow us to learn hyperparameters without any prior knowledge. Compared to the baseline of hand-tuned off-the-shelf hyperparameters, our method compares favorably on simple datasets like SVHN. On CIFAR-10 we match the baseline performance, and demonstrate for the first time that learning rate, momentum and weight decay schedules can be learned with gradients on a dataset of this size. Code is available at this https URL
摘要:基于梯度的超参数优化是在单一任务跨越任务的分布进行元学习,或改进的优化性能的一个有吸引力的方式。然而,这种方法已经吃不开,因为需要长时间的视野(很多梯度步骤),由于内存扩展和梯度退化问题的任务。一个常见的解决方法是在线学习超参数或地平线分裂成更小的块。然而,这引起贪婪,它有一个大的性能下降,因为当地最好的超参数可以为贫困的全球解决方案。在这项工作中,我们使非贪婪在长期的视野有两方面的解决方案。首先,我们分享的在时间上连续的超参数,并显示这大大缓解了梯度退化问题。然后,我们推导出流行的基于动量SGD优化,这使得因为这是与地平线大小不变的存储成本正向模式分化算法。当放在一起,这些解决方案使我们能够学习的超参数没有任何先验知识。相比于手动调整过的,现成的超参数的基线,我们的方法相比是有利的简单的数据集等SVHN。在CIFAR-10,我们匹配基准性能,并在第一时间,学习速度,动量和质量衰减时间表可以用梯度在这种规模的数据集学习证明。代码可在此HTTPS URL
50. Focus-and-Expand: Training Guidance Through Gradual Manipulation of Input Features [PDF] 返回目录
Moab Arar, Noa Fish, Dani Daniel, Evgeny Tenetov, Ariel Shamir, Amit Bermano
Abstract: We present a simple and intuitive Focus-and-eXpand (\fax) method to guide the training process of a neural network towards a specific solution. Optimizing a neural network is a highly non-convex problem. Typically, the space of solutions is large, with numerous possible local minima, where reaching a specific minimum depends on many factors. In many cases, however, a solution which considers specific aspects, or features, of the input is desired. For example, in the presence of bias, a solution that disregards the biased feature is a more robust and accurate one. Drawing inspiration from Parameter Continuation methods, we propose steering the training process to consider specific features in the input more than others, through gradual shifts in the input domain. \fax extracts a subset of features from each input data-point, and exposes the learner to these features first, Focusing the solution on them. Then, by using a blending/mixing parameter $\alpha$ it gradually eXpands the learning process to include all features of the input. This process encourages the consideration of the desired features more than others. Though not restricted to this field, we quantitatively evaluate the effectiveness of our approach on various Computer Vision tasks, and achieve state-of-the-art bias removal, improvements to an established augmentation method, and two examples of improvements to image classification tasks. Through these few examples we demonstrate the impact this approach potentially carries for a wide variety of problems, which stand to gain from understanding the solution landscape.
摘要:我们提出了一个简单而直观的焦点和扩张(\传真)方法来指导神经网络的训练过程中对具体的解决方案。优化神经网络是一种高度非凸问题。通常情况下,解决方案的空间大,有许多可能的局部极小,在达到特定的最小取决于许多因素。在很多情况下,然而,输入的它参考具体的方面,或设有一个溶液,是期望的。例如,在偏置的情况下,无视该偏置特征是更鲁棒和准确的一种的溶液。从参数续方法中汲取灵感,我们建议转向训练过程中要考虑输入特定的功能比别人多过输入域逐渐转变。 \传真提取的特征的子集从每个输入数据点,并暴露学习者这些第一特征,聚焦在其上的解决方案。然后,通过使用共混/混合参数$ \阿尔法$它逐渐扩大学习过程中包括输入的所有特征。这个过程鼓励考虑所需的功能比别人多。虽然不限于这个领域,我们定量地评估我们在各种计算机视觉任务的方法的有效性,实现国家的最先进的偏差消除,改进的成立增强方法,以及改进图像分类任务的两个例子。通过这些例子,我们证明这种方法可能携带各种各样的问题,其中进账来自对一个解决方案架构的影响。
Moab Arar, Noa Fish, Dani Daniel, Evgeny Tenetov, Ariel Shamir, Amit Bermano
Abstract: We present a simple and intuitive Focus-and-eXpand (\fax) method to guide the training process of a neural network towards a specific solution. Optimizing a neural network is a highly non-convex problem. Typically, the space of solutions is large, with numerous possible local minima, where reaching a specific minimum depends on many factors. In many cases, however, a solution which considers specific aspects, or features, of the input is desired. For example, in the presence of bias, a solution that disregards the biased feature is a more robust and accurate one. Drawing inspiration from Parameter Continuation methods, we propose steering the training process to consider specific features in the input more than others, through gradual shifts in the input domain. \fax extracts a subset of features from each input data-point, and exposes the learner to these features first, Focusing the solution on them. Then, by using a blending/mixing parameter $\alpha$ it gradually eXpands the learning process to include all features of the input. This process encourages the consideration of the desired features more than others. Though not restricted to this field, we quantitatively evaluate the effectiveness of our approach on various Computer Vision tasks, and achieve state-of-the-art bias removal, improvements to an established augmentation method, and two examples of improvements to image classification tasks. Through these few examples we demonstrate the impact this approach potentially carries for a wide variety of problems, which stand to gain from understanding the solution landscape.
摘要:我们提出了一个简单而直观的焦点和扩张(\传真)方法来指导神经网络的训练过程中对具体的解决方案。优化神经网络是一种高度非凸问题。通常情况下,解决方案的空间大,有许多可能的局部极小,在达到特定的最小取决于许多因素。在很多情况下,然而,输入的它参考具体的方面,或设有一个溶液,是期望的。例如,在偏置的情况下,无视该偏置特征是更鲁棒和准确的一种的溶液。从参数续方法中汲取灵感,我们建议转向训练过程中要考虑输入特定的功能比别人多过输入域逐渐转变。 \传真提取的特征的子集从每个输入数据点,并暴露学习者这些第一特征,聚焦在其上的解决方案。然后,通过使用共混/混合参数$ \阿尔法$它逐渐扩大学习过程中包括输入的所有特征。这个过程鼓励考虑所需的功能比别人多。虽然不限于这个领域,我们定量地评估我们在各种计算机视觉任务的方法的有效性,实现国家的最先进的偏差消除,改进的成立增强方法,以及改进图像分类任务的两个例子。通过这些例子,我们证明这种方法可能携带各种各样的问题,其中进账来自对一个解决方案架构的影响。
51. Relative Pose Estimation of Calibrated Cameras with Known $\mathrm{SE}(3)$ Invariants [PDF] 返回目录
Bo Li, Evgeniy Martyushev, Gim Hee Lee
Abstract: The $\mathrm{SE}(3)$ invariants of a pose include its rotation angle and screw translation. In this paper, we present a complete comprehensive study of the relative pose estimation problem for a calibrated camera constrained by known $\mathrm{SE}(3)$ invariant, which involves 5 minimal problems in total. These problems reduces the minimal number of point pairs for relative pose estimation and improves the estimation efficiency and robustness. The $\mathrm{SE}(3)$ invariant constraints can come from extra sensor measurements or motion assumption. Different from conventional relative pose estimation with extra constraints, no extrinsic calibration is required to transform the constraints to the camera frame. This advantage comes from the invariance of $\mathrm{SE}(3)$ invariants cross different coordinate systems on a rigid body and makes the solvers more convenient and flexible in practical applications. Besides proposing the concept of relative pose estimation constrained by $\mathrm{SE}(3)$ invariants, we present a comprehensive study of existing polynomial formulations for relative pose estimation and discover their relationship. Different formulations are carefully chosen for each proposed problems to achieve best efficiency. Experiments on synthetic and real data shows performance improvement compared to conventional relative pose estimation methods.
摘要:$ \ mathrm {SE}(3)的姿态的$不变量包括其旋转角度和螺杆平移。在本文中,我们目前的相对位姿估计问题的完整全面的研究已知$ \ mathrm {} SE(3)$不变的,它涉及到共5个最小的问题的制约校准相机。这些问题降低了对相对于姿态估计点对的最小数量,并提高了效率估计和鲁棒性。的$ \ mathrm {SE} ...(3)$不变约束可以来自额外的传感器测量值或运动的假设。从与额外约束现有的相对姿势估计不同,没有外部校准需要将约束转换到摄像机帧。这一优点来源于$ \ mathrm {SE}(3)$不变量交叉上的刚体不同的坐标系的不变性和使得求解器更方便,更灵活,在实际应用。除了提出相对位姿估计由$ \ mathrm约束的概念{} SE(3)$不变,我们目前现有的多项式配方相对姿态估计的综合研究,发现它们之间的关系。不同的配方都经过精心挑选每一所提出的问题,以达到最佳的效率。关于合成的和真实数据显示性能的改善实验相比于传统的相对姿态估计方法。
Bo Li, Evgeniy Martyushev, Gim Hee Lee
Abstract: The $\mathrm{SE}(3)$ invariants of a pose include its rotation angle and screw translation. In this paper, we present a complete comprehensive study of the relative pose estimation problem for a calibrated camera constrained by known $\mathrm{SE}(3)$ invariant, which involves 5 minimal problems in total. These problems reduces the minimal number of point pairs for relative pose estimation and improves the estimation efficiency and robustness. The $\mathrm{SE}(3)$ invariant constraints can come from extra sensor measurements or motion assumption. Different from conventional relative pose estimation with extra constraints, no extrinsic calibration is required to transform the constraints to the camera frame. This advantage comes from the invariance of $\mathrm{SE}(3)$ invariants cross different coordinate systems on a rigid body and makes the solvers more convenient and flexible in practical applications. Besides proposing the concept of relative pose estimation constrained by $\mathrm{SE}(3)$ invariants, we present a comprehensive study of existing polynomial formulations for relative pose estimation and discover their relationship. Different formulations are carefully chosen for each proposed problems to achieve best efficiency. Experiments on synthetic and real data shows performance improvement compared to conventional relative pose estimation methods.
摘要:$ \ mathrm {SE}(3)的姿态的$不变量包括其旋转角度和螺杆平移。在本文中,我们目前的相对位姿估计问题的完整全面的研究已知$ \ mathrm {} SE(3)$不变的,它涉及到共5个最小的问题的制约校准相机。这些问题降低了对相对于姿态估计点对的最小数量,并提高了效率估计和鲁棒性。的$ \ mathrm {SE} ...(3)$不变约束可以来自额外的传感器测量值或运动的假设。从与额外约束现有的相对姿势估计不同,没有外部校准需要将约束转换到摄像机帧。这一优点来源于$ \ mathrm {SE}(3)$不变量交叉上的刚体不同的坐标系的不变性和使得求解器更方便,更灵活,在实际应用。除了提出相对位姿估计由$ \ mathrm约束的概念{} SE(3)$不变,我们目前现有的多项式配方相对姿态估计的综合研究,发现它们之间的关系。不同的配方都经过精心挑选每一所提出的问题,以达到最佳的效率。关于合成的和真实数据显示性能的改善实验相比于传统的相对姿态估计方法。
52. SpaceNet: Make Free Space For Continual Learning [PDF] 返回目录
Ghada Sokar, Decebal Constantin Mocanu, Mykola Pechenizkiy
Abstract: The continual learning (CL) paradigm aims to enable neural networks to learn tasks continually in a sequential fashion. The fundamental challenge in this learning paradigm is catastrophic forgetting previously learned tasks when the model is optimized for a new task, especially when their data is not accessible. Current architectural-based methods aim at alleviating the catastrophic forgetting problem but at the expense of expanding the capacity of the model. Regularization-based methods maintain a fixed model capacity; however, previous studies showed the huge performance degradation of these methods when the task identity is not available during inference (e.g. class incremental learning scenario). In this work, we propose a novel architectural-based method referred as SpaceNet for class incremental learning scenario where we utilize the available fixed capacity of the model intelligently. SpaceNet trains sparse deep neural networks from scratch in an adaptive way that compresses the sparse connections of each task in a compact number of neurons. The adaptive training of the sparse connections results in sparse representations that reduce the interference between the tasks. Experimental results show the robustness of our proposed method against catastrophic forgetting old tasks and the efficiency of SpaceNet in utilizing the available capacity of the model, leaving space for more tasks to be learned. In particular, when SpaceNet is tested on the well-known benchmarks for CL: split MNIST, split Fashion-MNIST, and CIFAR-10/100, it outperforms regularization-based methods by a big performance gap. Moreover, it achieves better performance than architectural-based methods without model expansion and achieved comparable results with rehearsal-based methods, while offering a huge memory reduction.
摘要:不断学习(CL)模式目的在于使神经网络不断学习任务,以连续的方式。在这种学习模式最根本的挑战是以前学过的任务时的模型是一个新的任务进行了优化,特别是当他们的数据是无法访问的灾难性遗忘。目前,基于架构的方法旨在减轻灾难性遗忘的问题,但在扩展模式的能力为代价的。基于正则化的方法保持固定的模型容量;然而,以往的研究表明,这些方法的巨大性能下降时,任务标识不是推理(例如类增量学习的情况)时使用。在这项工作中,我们提出了一种新的基于建筑法称为SpaceNet为我们利用该模型的可用固定容量智能类增量学习场景。 SpaceNet列车疏从头深层神经网络以压缩每个任务的神经元的数量紧凑稀疏连接的自适应方法。在稀疏表示减少任务之间的干扰稀疏连接结果的适应性训练。实验结果表明,我们提出的针对灾难性遗忘旧任务方法的稳健性和SpaceNet在利用模型的可用容量,留下更多的任务需要学习的空间效率。特别是,当SpaceNet正在对针对CL众所周知的基准测试:分流MNIST,分裂时装-MNIST,和CIFAR-10/100,它优于通过一个大的性能差距基于正则化的方法。此外,它实现比没有模式扩展和基于排练的方法取得了类似的结果基础架构的方法更好的性能,同时提供了巨大的存储量减少。
Ghada Sokar, Decebal Constantin Mocanu, Mykola Pechenizkiy
Abstract: The continual learning (CL) paradigm aims to enable neural networks to learn tasks continually in a sequential fashion. The fundamental challenge in this learning paradigm is catastrophic forgetting previously learned tasks when the model is optimized for a new task, especially when their data is not accessible. Current architectural-based methods aim at alleviating the catastrophic forgetting problem but at the expense of expanding the capacity of the model. Regularization-based methods maintain a fixed model capacity; however, previous studies showed the huge performance degradation of these methods when the task identity is not available during inference (e.g. class incremental learning scenario). In this work, we propose a novel architectural-based method referred as SpaceNet for class incremental learning scenario where we utilize the available fixed capacity of the model intelligently. SpaceNet trains sparse deep neural networks from scratch in an adaptive way that compresses the sparse connections of each task in a compact number of neurons. The adaptive training of the sparse connections results in sparse representations that reduce the interference between the tasks. Experimental results show the robustness of our proposed method against catastrophic forgetting old tasks and the efficiency of SpaceNet in utilizing the available capacity of the model, leaving space for more tasks to be learned. In particular, when SpaceNet is tested on the well-known benchmarks for CL: split MNIST, split Fashion-MNIST, and CIFAR-10/100, it outperforms regularization-based methods by a big performance gap. Moreover, it achieves better performance than architectural-based methods without model expansion and achieved comparable results with rehearsal-based methods, while offering a huge memory reduction.
摘要:不断学习(CL)模式目的在于使神经网络不断学习任务,以连续的方式。在这种学习模式最根本的挑战是以前学过的任务时的模型是一个新的任务进行了优化,特别是当他们的数据是无法访问的灾难性遗忘。目前,基于架构的方法旨在减轻灾难性遗忘的问题,但在扩展模式的能力为代价的。基于正则化的方法保持固定的模型容量;然而,以往的研究表明,这些方法的巨大性能下降时,任务标识不是推理(例如类增量学习的情况)时使用。在这项工作中,我们提出了一种新的基于建筑法称为SpaceNet为我们利用该模型的可用固定容量智能类增量学习场景。 SpaceNet列车疏从头深层神经网络以压缩每个任务的神经元的数量紧凑稀疏连接的自适应方法。在稀疏表示减少任务之间的干扰稀疏连接结果的适应性训练。实验结果表明,我们提出的针对灾难性遗忘旧任务方法的稳健性和SpaceNet在利用模型的可用容量,留下更多的任务需要学习的空间效率。特别是,当SpaceNet正在对针对CL众所周知的基准测试:分流MNIST,分裂时装-MNIST,和CIFAR-10/100,它优于通过一个大的性能差距基于正则化的方法。此外,它实现比没有模式扩展和基于排练的方法取得了类似的结果基础架构的方法更好的性能,同时提供了巨大的存储量减少。
53. Monocular Retinal Depth Estimation and Joint Optic Disc and Cup Segmentation using Adversarial Networks [PDF] 返回目录
Sharath M Shankaranarayana, Keerthi Ram, Kaushik Mitra, Mohanasankar Sivaprakasam
Abstract: One of the important parameters for the assessment of glaucoma is optic nerve head (ONH) evaluation, which usually involves depth estimation and subsequent optic disc and cup boundary extraction. Depth is usually obtained explicitly from imaging modalities like optical coherence tomography (OCT) and is very challenging to estimate depth from a single RGB image. To this end, we propose a novel method using adversarial network to predict depth map from a single image. The proposed depth estimation technique is trained and evaluated using individual retinal images from INSPIRE-stereo dataset. We obtain a very high average correlation coefficient of 0.92 upon five fold cross validation outperforming the state of the art. We then use the depth estimation process as a proxy task for joint optic disc and cup segmentation.
摘要:一种用于治疗青光眼的评估的重要参数是视神经乳头(ONH)的评价,这通常涉及深度估计和随后的视盘和视杯边界提取。通常从像光学相干断层扫描(OCT)成像模态获得的深度明确和非常具有挑战性从单个RGB图像估计深度。为此,我们提出使用对抗网络来预测从单个图像的深度图的新方法。所提出的深度估计技术训练和使用单独的视网膜图像从INSPIRE立体数据集进行评估。我们获得0.92在5倍交叉验证的表现优于现有技术的一个非常高的平均相关系数。然后,我们使用深度估计过程联合视盘和杯分割的代理任务。
Sharath M Shankaranarayana, Keerthi Ram, Kaushik Mitra, Mohanasankar Sivaprakasam
Abstract: One of the important parameters for the assessment of glaucoma is optic nerve head (ONH) evaluation, which usually involves depth estimation and subsequent optic disc and cup boundary extraction. Depth is usually obtained explicitly from imaging modalities like optical coherence tomography (OCT) and is very challenging to estimate depth from a single RGB image. To this end, we propose a novel method using adversarial network to predict depth map from a single image. The proposed depth estimation technique is trained and evaluated using individual retinal images from INSPIRE-stereo dataset. We obtain a very high average correlation coefficient of 0.92 upon five fold cross validation outperforming the state of the art. We then use the depth estimation process as a proxy task for joint optic disc and cup segmentation.
摘要:一种用于治疗青光眼的评估的重要参数是视神经乳头(ONH)的评价,这通常涉及深度估计和随后的视盘和视杯边界提取。通常从像光学相干断层扫描(OCT)成像模态获得的深度明确和非常具有挑战性从单个RGB图像估计深度。为此,我们提出使用对抗网络来预测从单个图像的深度图的新方法。所提出的深度估计技术训练和使用单独的视网膜图像从INSPIRE立体数据集进行评估。我们获得0.92在5倍交叉验证的表现优于现有技术的一个非常高的平均相关系数。然后,我们使用深度估计过程联合视盘和杯分割的代理任务。
54. AdvFlow: Inconspicuous Black-box Adversarial Attacks using Normalizing Flows [PDF] 返回目录
Hadi M. Dolatabadi, Sarah Erfani, Christopher Leckie
Abstract: Deep learning classifiers are susceptible to well-crafted, imperceptible variations of their inputs, known as adversarial attacks. In this regard, the study of powerful attack models sheds light on the sources of vulnerability in these classifiers, hopefully leading to more robust ones. In this paper, we introduce AdvFlow: a novel black-box adversarial attack method on image classifiers that exploits the power of normalizing flows to model the density of adversarial examples around a given target image. We see that the proposed method generates adversaries that closely follow the clean data distribution, a property which makes their detection less likely. Also, our experimental results show competitive performance of the proposed approach with some of the existing attack methods on defended classifiers, outperforming them in both the number of queries and attack success rate. The code is available at this https URL.
摘要:深学习分类很容易受到精心设计的,其输入,被称为对抗攻击难以察觉的变化。在这方面,强大的攻击模式的研究揭示在这些分类漏洞的光源发出的光,希望导致更稳健的。在本文中,我们介绍AdvFlow:一种新颖的黑盒对抗攻击使图像分类器,它利用正火的功率的方法流向的周围给定的目标图像对抗性例的密度建模。我们看到,该方法生成紧跟干净的数据分布,这使得他们的检测不太可能的属性的敌人。此外,我们的实验结果表明,有一些对辩护分类现有的攻击方法所提出的方法的有竞争力的表现,查询和进攻成功率的两个数量超越他们。该代码可在此HTTPS URL。
Hadi M. Dolatabadi, Sarah Erfani, Christopher Leckie
Abstract: Deep learning classifiers are susceptible to well-crafted, imperceptible variations of their inputs, known as adversarial attacks. In this regard, the study of powerful attack models sheds light on the sources of vulnerability in these classifiers, hopefully leading to more robust ones. In this paper, we introduce AdvFlow: a novel black-box adversarial attack method on image classifiers that exploits the power of normalizing flows to model the density of adversarial examples around a given target image. We see that the proposed method generates adversaries that closely follow the clean data distribution, a property which makes their detection less likely. Also, our experimental results show competitive performance of the proposed approach with some of the existing attack methods on defended classifiers, outperforming them in both the number of queries and attack success rate. The code is available at this https URL.
摘要:深学习分类很容易受到精心设计的,其输入,被称为对抗攻击难以察觉的变化。在这方面,强大的攻击模式的研究揭示在这些分类漏洞的光源发出的光,希望导致更稳健的。在本文中,我们介绍AdvFlow:一种新颖的黑盒对抗攻击使图像分类器,它利用正火的功率的方法流向的周围给定的目标图像对抗性例的密度建模。我们看到,该方法生成紧跟干净的数据分布,这使得他们的检测不太可能的属性的敌人。此外,我们的实验结果表明,有一些对辩护分类现有的攻击方法所提出的方法的有竞争力的表现,查询和进攻成功率的两个数量超越他们。该代码可在此HTTPS URL。
55. Anatomy of Catastrophic Forgetting: Hidden Representations and Task Semantics [PDF] 返回目录
Vinay V. Ramasesh, Ethan Dyer, Maithra Raghu
Abstract: A central challenge in developing versatile machine learning systems is catastrophic forgetting: a model trained on tasks in sequence will suffer significant performance drops on earlier tasks. Despite the ubiquity of catastrophic forgetting, there is limited understanding of the underlying process and its causes. In this paper, we address this important knowledge gap, investigating how forgetting affects representations in neural network models. Through representational analysis techniques, we find that deeper layers are disproportionately the source of forgetting. Supporting this, a study of methods to mitigate forgetting illustrates that they act to stabilize deeper layers. These insights enable the development of an analytic argument and empirical picture relating the degree of forgetting to representational similarity between tasks. Consistent with this picture, we observe maximal forgetting occurs for task sequences with intermediate similarity. We perform empirical studies on the standard split CIFAR-10 setup and also introduce a novel CIFAR-100 based task approximating realistic input distribution shift.
摘要:在通用的机器学习系统开发一个主要挑战是灾难性的遗忘:对训练的顺序任务的模型将遭受早先任务显著性能下降。尽管灾难性遗忘的普及,有基本过程及其原因的理解有限。在本文中,我们要解决这个重要的知识差距,进行调查遗忘是如何影响神经网络模型表示。通过代表性的分析技术,我们发现,更深层次是不成比例的遗忘之源。支持这一,方法来减轻遗忘的一项研究显示,他们起到稳定更深的层次。这些见解使关于忘了任务之间的代表性相似程度的分析论证和实证图片的发展。与此图片一致,我们观察到发生用于与中间相似的任务序列的最大遗忘。我们执行在标准分裂CIFAR-10安装实证研究,并且还引入新的CIFAR-100基于任务近似现实输入分布偏移。
Vinay V. Ramasesh, Ethan Dyer, Maithra Raghu
Abstract: A central challenge in developing versatile machine learning systems is catastrophic forgetting: a model trained on tasks in sequence will suffer significant performance drops on earlier tasks. Despite the ubiquity of catastrophic forgetting, there is limited understanding of the underlying process and its causes. In this paper, we address this important knowledge gap, investigating how forgetting affects representations in neural network models. Through representational analysis techniques, we find that deeper layers are disproportionately the source of forgetting. Supporting this, a study of methods to mitigate forgetting illustrates that they act to stabilize deeper layers. These insights enable the development of an analytic argument and empirical picture relating the degree of forgetting to representational similarity between tasks. Consistent with this picture, we observe maximal forgetting occurs for task sequences with intermediate similarity. We perform empirical studies on the standard split CIFAR-10 setup and also introduce a novel CIFAR-100 based task approximating realistic input distribution shift.
摘要:在通用的机器学习系统开发一个主要挑战是灾难性的遗忘:对训练的顺序任务的模型将遭受早先任务显著性能下降。尽管灾难性遗忘的普及,有基本过程及其原因的理解有限。在本文中,我们要解决这个重要的知识差距,进行调查遗忘是如何影响神经网络模型表示。通过代表性的分析技术,我们发现,更深层次是不成比例的遗忘之源。支持这一,方法来减轻遗忘的一项研究显示,他们起到稳定更深的层次。这些见解使关于忘了任务之间的代表性相似程度的分析论证和实证图片的发展。与此图片一致,我们观察到发生用于与中间相似的任务序列的最大遗忘。我们执行在标准分裂CIFAR-10安装实证研究,并且还引入新的CIFAR-100基于任务近似现实输入分布偏移。
56. Concept Learners for Generalizable Few-Shot Learning [PDF] 返回目录
Kaidi Cao, Maria Brbic, Jure Leskovec
Abstract: Developing algorithms that are able to generalize to a novel task given only a few labeled examples represents a fundamental challenge in closing the gap between machine- and human-level performance. The core of human cognition lies in the structured, reusable concepts that help us to rapidly adapt to new tasks and provide reasoning behind our decisions. However, existing meta-learning methods learn complex representations across prior labeled tasks without imposing any structure on the learned representations. Here we propose COMET, a meta-learning method that improves generalization ability by learning to learn along human-interpretable concept dimensions. Instead of learning a joint unstructured metric space, COMET learns mappings of high-level concepts into semi-structured metric spaces, and effectively combines the outputs of independent concept learners. We evaluate our model on few-shot tasks from diverse domains, including a benchmark image classification dataset and a novel single-cell dataset from a biological domain developed in our work. COMET significantly outperforms strong meta-learning baselines, achieving $9$-$12\%$ average improvement on the most challenging $1$-shot learning tasks, while unlike existing methods also providing interpretations behind the model's predictions.
摘要:开发的算法,能够推广到只给出几个例子标记一个新的任务表示关闭机器和人类级性能之间的差距的根本挑战。人类认知在于结构化,可重复使用的概念,帮助我们迅速适应新的任务,并为我们的决策背后的推理的核心。但是,现有的元学习方法学习跨前标记的任务复杂的表示,而没有上了解到表示强加任何结构。在这里我们建议COMET,通过学习沿着人类可解释的概念维学习提高推广能力一元的学习方法。代替学习关节非结构化度量空间的,COMET学习的高级概念映射到半结构化度量空间,并有效地结合的独立概念学习者输出。我们评估我们从不同的领域,包括基准图像分类数据集,并在我们的工作制定了生物领域一种新型的单细胞集几炮的任务模式。 COMET显著优于强元学习基线,达到$ 9 $ - $ 12 \%$最具挑战性的$ 1 $ -shot学习任务平均改善,而不像现有的方法还提供了模型的预测后面的解释。
Kaidi Cao, Maria Brbic, Jure Leskovec
Abstract: Developing algorithms that are able to generalize to a novel task given only a few labeled examples represents a fundamental challenge in closing the gap between machine- and human-level performance. The core of human cognition lies in the structured, reusable concepts that help us to rapidly adapt to new tasks and provide reasoning behind our decisions. However, existing meta-learning methods learn complex representations across prior labeled tasks without imposing any structure on the learned representations. Here we propose COMET, a meta-learning method that improves generalization ability by learning to learn along human-interpretable concept dimensions. Instead of learning a joint unstructured metric space, COMET learns mappings of high-level concepts into semi-structured metric spaces, and effectively combines the outputs of independent concept learners. We evaluate our model on few-shot tasks from diverse domains, including a benchmark image classification dataset and a novel single-cell dataset from a biological domain developed in our work. COMET significantly outperforms strong meta-learning baselines, achieving $9$-$12\%$ average improvement on the most challenging $1$-shot learning tasks, while unlike existing methods also providing interpretations behind the model's predictions.
摘要:开发的算法,能够推广到只给出几个例子标记一个新的任务表示关闭机器和人类级性能之间的差距的根本挑战。人类认知在于结构化,可重复使用的概念,帮助我们迅速适应新的任务,并为我们的决策背后的推理的核心。但是,现有的元学习方法学习跨前标记的任务复杂的表示,而没有上了解到表示强加任何结构。在这里我们建议COMET,通过学习沿着人类可解释的概念维学习提高推广能力一元的学习方法。代替学习关节非结构化度量空间的,COMET学习的高级概念映射到半结构化度量空间,并有效地结合的独立概念学习者输出。我们评估我们从不同的领域,包括基准图像分类数据集,并在我们的工作制定了生物领域一种新型的单细胞集几炮的任务模式。 COMET显著优于强元学习基线,达到$ 9 $ - $ 12 \%$最具挑战性的$ 1 $ -shot学习任务平均改善,而不像现有的方法还提供了模型的预测后面的解释。
57. Relaxed-Responsibility Hierarchical Discrete VAEs [PDF] 返回目录
Matthew Willetts, Xenia Miscouridou, Stephen Roberts, Chris Holmes
Abstract: Successfully training Variational Autoencoders (VAEs) with a hierarchy of discrete latent variables remains an area of active research. Leveraging insights from classical methods of inference we introduce $\textit{Relaxed-Responsibility Vector-Quantisation}$, a novel way to parameterise discrete latent variables, a refinement of relaxed Vector-Quantisation. This enables a novel approach to hierarchical discrete variational autoencoder with numerous layers of latent variables that we train end-to-end. Unlike discrete VAEs with a single layer of latent variables, we can produce realistic-looking samples by ancestral sampling: it is not essential to train a second generative model over the learnt latent representations to then sample from and then decode. Further, we observe different layers of our model become associated with different aspects of the data.
摘要:成功地与离散隐变量的层次训练变自动编码(VAES)保持活跃的研究领域。从推理的经典方法利用的见解,我们引进$ \ {textit宽松责任矢量量化} $,进行参数离散潜在变数,轻松矢量量化的改进的新方法。这使得一个新的方法来分级离散变自动编码器与潜在变量的许多层,我们培养端至端。不同于与潜在变量的单层离散VAES,我们可以通过祖先的抽样产生逼真的样品:它不是必需的在学潜表示第二生成模型火车来然后从样品,然后解码。此外,我们观察到我们的模型成为与数据的不同方面相关联的不同层。
Matthew Willetts, Xenia Miscouridou, Stephen Roberts, Chris Holmes
Abstract: Successfully training Variational Autoencoders (VAEs) with a hierarchy of discrete latent variables remains an area of active research. Leveraging insights from classical methods of inference we introduce $\textit{Relaxed-Responsibility Vector-Quantisation}$, a novel way to parameterise discrete latent variables, a refinement of relaxed Vector-Quantisation. This enables a novel approach to hierarchical discrete variational autoencoder with numerous layers of latent variables that we train end-to-end. Unlike discrete VAEs with a single layer of latent variables, we can produce realistic-looking samples by ancestral sampling: it is not essential to train a second generative model over the learnt latent representations to then sample from and then decode. Further, we observe different layers of our model become associated with different aspects of the data.
摘要:成功地与离散隐变量的层次训练变自动编码(VAES)保持活跃的研究领域。从推理的经典方法利用的见解,我们引进$ \ {textit宽松责任矢量量化} $,进行参数离散潜在变数,轻松矢量量化的改进的新方法。这使得一个新的方法来分级离散变自动编码器与潜在变量的许多层,我们培养端至端。不同于与潜在变量的单层离散VAES,我们可以通过祖先的抽样产生逼真的样品:它不是必需的在学潜表示第二生成模型火车来然后从样品,然后解码。此外,我们观察到我们的模型成为与数据的不同方面相关联的不同层。
注:中文为机器翻译结果!封面为论文标题词云图!