目录
11. Enhanced Self-Perception in Mixed Reality: Egocentric Arm Segmentation and Database with Automatic Labelling [PDF] 摘要
13. Tackling Two Challenges of 6D Object Pose Estimation: Lack of Real Annotated RGB Images and Scalability to Number of Objects [PDF] 摘要
17. Generalizable Semantic Segmentation via Model-agnostic Learning and Target-specific Normalization [PDF] 摘要
23. Towards Discriminability and Diversity: Batch Nuclear-norm Maximization under Label Insufficient Situations [PDF] 摘要
25. Multi-Granularity Reference-Aided Attentive Feature Aggregation for Video-based Person Re-identification [PDF] 摘要
29. An improved 3D region detection network: automated detection of the 12th thoracic vertebra in image guided radiation therapy [PDF] 摘要
34. Augmenting Colonoscopy using Extended and Directional CycleGAN for Lossy Image Translation [PDF] 摘要
35. A Computer-Aided Diagnosis System Using Artificial Intelligence for Proximal Femoral Fractures Enables Residents to Achieve a Diagnostic Rate Equivalent to Orthopedic Surgeons -- multi-institutional joint development research [PDF] 摘要
38. A Comprehensive Review for Breast Histopathology Image Analysis Using Classical and Deep Neural Networks [PDF] 摘要
摘要
1. Probabilistic Regression for Visual Tracking [PDF] 返回目录
Martin Danelljan, Luc Van Gool, Radu Timofte
Abstract: Visual tracking is fundamentally the problem of regressing the state of the target in each video frame. While significant progress has been achieved, trackers are still prone to failures and inaccuracies. It is therefore crucial to represent the uncertainty in the target estimation. Although current prominent paradigms rely on estimating a state-dependent confidence score, this value lacks a clear probabilistic interpretation, complicating its use. In this work, we therefore propose a probabilistic regression formulation and apply it to tracking. Our network predicts the conditional probability density of the target state given an input image. Crucially, our formulation is capable of modeling label noise stemming from inaccurate annotations and ambiguities in the task. The regression network is trained by minimizing the Kullback-Leibler divergence. When applied for tracking, our formulation not only allows a probabilistic representation of the output, but also substantially improves the performance. Our tracker sets a new state-of-the-art on six datasets, achieving 59.8% AUC on LaSOT and 75.8% Success on TrackingNet. The code and models are available at this https URL.
摘要:视觉跟踪是从根本上在每个视频帧中的回归目标的状态的问题。虽然显著已经取得进展,跟踪器仍然容易出现故障和不准确性。因此,关键是要代表预测对象的不确定性。尽管目前的突出范例依靠估计状态相关的置信度评分,这个数值缺乏明确概率解释,它的使用变得复杂。在这项工作中,因此,我们建议一个概率回归方程,并应用到追踪。我们的网络预测给定的输入图像中的目标状态的条件概率密度。重要的是,我们的制剂能够模拟标签的噪音从任务不准确的注释和含糊不清而产生的。回归网络是通过最小化相对熵培训。当应用于用于跟踪,我们的制剂不仅允许的输出的概率表示,而且还显着地改善性能。我们的跟踪器设置六个数据集的一个新的国家的最先进的,在TrackingNet实现对LaSOT 59.8%AUC和75.8%的成功。代码和模型可在此HTTPS URL。
Martin Danelljan, Luc Van Gool, Radu Timofte
Abstract: Visual tracking is fundamentally the problem of regressing the state of the target in each video frame. While significant progress has been achieved, trackers are still prone to failures and inaccuracies. It is therefore crucial to represent the uncertainty in the target estimation. Although current prominent paradigms rely on estimating a state-dependent confidence score, this value lacks a clear probabilistic interpretation, complicating its use. In this work, we therefore propose a probabilistic regression formulation and apply it to tracking. Our network predicts the conditional probability density of the target state given an input image. Crucially, our formulation is capable of modeling label noise stemming from inaccurate annotations and ambiguities in the task. The regression network is trained by minimizing the Kullback-Leibler divergence. When applied for tracking, our formulation not only allows a probabilistic representation of the output, but also substantially improves the performance. Our tracker sets a new state-of-the-art on six datasets, achieving 59.8% AUC on LaSOT and 75.8% Success on TrackingNet. The code and models are available at this https URL.
摘要:视觉跟踪是从根本上在每个视频帧中的回归目标的状态的问题。虽然显著已经取得进展,跟踪器仍然容易出现故障和不准确性。因此,关键是要代表预测对象的不确定性。尽管目前的突出范例依靠估计状态相关的置信度评分,这个数值缺乏明确概率解释,它的使用变得复杂。在这项工作中,因此,我们建议一个概率回归方程,并应用到追踪。我们的网络预测给定的输入图像中的目标状态的条件概率密度。重要的是,我们的制剂能够模拟标签的噪音从任务不准确的注释和含糊不清而产生的。回归网络是通过最小化相对熵培训。当应用于用于跟踪,我们的制剂不仅允许的输出的概率表示,而且还显着地改善性能。我们的跟踪器设置六个数据集的一个新的国家的最先进的,在TrackingNet实现对LaSOT 59.8%AUC和75.8%的成功。代码和模型可在此HTTPS URL。
2. DA-NAS: Data Adapted Pruning for Efficient Neural Architecture Search [PDF] 返回目录
Xiyang Dai, Dongdong Chen, Mengchen Liu, Yinpeng Chen, Lu Yuan
Abstract: Efficient search is a core issue in Neural Architecture Search (NAS). It is difficult for conventional NAS algorithms to directly search the architectures on large-scale tasks like ImageNet. In general, the cost of GPU hours for NAS grows with regard to training dataset size and candidate set size. One common way is searching on a smaller proxy dataset (e.g., CIFAR-10) and then transferring to the target task (e.g., ImageNet). These architectures optimized on proxy data are not guaranteed to be optimal on the target task. Another common way is learning with a smaller candidate set, which may require expert knowledge and indeed betrays the essence of NAS. In this paper, we present DA-NAS that can directly search the architecture for large-scale target tasks while allowing a large candidate set in a more efficient manner. Our method is based on an interesting observation that the learning speed for blocks in deep neural networks is related to the difficulty of recognizing distinct categories. We carefully design a progressive data adapted pruning strategy for efficient architecture search. It will quickly trim low performed blocks on a subset of target dataset (e.g., easy classes), and then gradually find the best blocks on the whole target dataset. At this time, the original candidate set becomes as compact as possible, providing a faster search in the target task. Experiments on ImageNet verify the effectiveness of our approach. It is 2x faster than previous methods while the accuracy is currently state-of-the-art, at 76.2% under small FLOPs constraint. It supports an argument search space (i.e., more candidate blocks) to efficiently search the best-performing architecture.
摘要:高效的搜索是在神经结构搜索(NAS)的一个核心问题。这是困难的传统NAS算法直接搜索像ImageNet大型任务的架构。一般来说,GPU小时NAS成本增长方面的训练数据集的大小和候选集的大小。一种常见的方法是在一个较小的数据集代理搜索(例如,CIFAR-10),然后转移到所述目标任务(例如,ImageNet)。这些代理上的数据进行了优化架构不保证最佳的目标任务。另一种常见的方式是用较小的候选集,这可能需要专业知识和确实背叛了NAS的精髓学习。在本文中,我们提出了DA-NAS是可以直接搜索大型目标任务的架构,同时允许以更有效的方式大量候选集。我们的方法是基于一个有趣的发现是深部的神经网络模块学习速度与识别不同类别的难度。我们精心设计的渐进数据适用于高效的架构搜索剪枝策略。它能够快速微调低执行对目标数据集(例如,易班)的一个子集块,然后逐渐找到整个目标数据集的最佳块。这时,原来的候选集变得尽可能紧凑,提供更快的目标任务进行搜索。在ImageNet实验验证我们方法的有效性。这是2倍比以前的方法快而准确是目前国家的最先进的,在在小拖约束76.2%。它支持参数的搜索空间(即,多个候选块)能够有效地搜索表现最佳的体系结构。
Xiyang Dai, Dongdong Chen, Mengchen Liu, Yinpeng Chen, Lu Yuan
Abstract: Efficient search is a core issue in Neural Architecture Search (NAS). It is difficult for conventional NAS algorithms to directly search the architectures on large-scale tasks like ImageNet. In general, the cost of GPU hours for NAS grows with regard to training dataset size and candidate set size. One common way is searching on a smaller proxy dataset (e.g., CIFAR-10) and then transferring to the target task (e.g., ImageNet). These architectures optimized on proxy data are not guaranteed to be optimal on the target task. Another common way is learning with a smaller candidate set, which may require expert knowledge and indeed betrays the essence of NAS. In this paper, we present DA-NAS that can directly search the architecture for large-scale target tasks while allowing a large candidate set in a more efficient manner. Our method is based on an interesting observation that the learning speed for blocks in deep neural networks is related to the difficulty of recognizing distinct categories. We carefully design a progressive data adapted pruning strategy for efficient architecture search. It will quickly trim low performed blocks on a subset of target dataset (e.g., easy classes), and then gradually find the best blocks on the whole target dataset. At this time, the original candidate set becomes as compact as possible, providing a faster search in the target task. Experiments on ImageNet verify the effectiveness of our approach. It is 2x faster than previous methods while the accuracy is currently state-of-the-art, at 76.2% under small FLOPs constraint. It supports an argument search space (i.e., more candidate blocks) to efficiently search the best-performing architecture.
摘要:高效的搜索是在神经结构搜索(NAS)的一个核心问题。这是困难的传统NAS算法直接搜索像ImageNet大型任务的架构。一般来说,GPU小时NAS成本增长方面的训练数据集的大小和候选集的大小。一种常见的方法是在一个较小的数据集代理搜索(例如,CIFAR-10),然后转移到所述目标任务(例如,ImageNet)。这些代理上的数据进行了优化架构不保证最佳的目标任务。另一种常见的方式是用较小的候选集,这可能需要专业知识和确实背叛了NAS的精髓学习。在本文中,我们提出了DA-NAS是可以直接搜索大型目标任务的架构,同时允许以更有效的方式大量候选集。我们的方法是基于一个有趣的发现是深部的神经网络模块学习速度与识别不同类别的难度。我们精心设计的渐进数据适用于高效的架构搜索剪枝策略。它能够快速微调低执行对目标数据集(例如,易班)的一个子集块,然后逐渐找到整个目标数据集的最佳块。这时,原来的候选集变得尽可能紧凑,提供更快的目标任务进行搜索。在ImageNet实验验证我们方法的有效性。这是2倍比以前的方法快而准确是目前国家的最先进的,在在小拖约束76.2%。它支持参数的搜索空间(即,多个候选块)能够有效地搜索表现最佳的体系结构。
3. Assessing Image Quality Issues for Real-World Problem [PDF] 返回目录
Tai-Yin Chiu, Yinan Zhao, Danna Gurari
Abstract: We introduce a new large-scale dataset that links the assessment of image quality issues to two practical vision tasks: image captioning and visual question answering. First, we identify for 39,181 images taken by people who are blind whether each is sufficient quality to recognize the content as well as what quality flaws are observed from six options. These labels serve as a critical foundation for us to make the following contributions: (1) a new problem and algorithms for deciding whether an image is insufficient quality to recognize the content and so not captionable, (2) a new problem and algorithms for deciding which of six quality flaws an image contains, (3) a new problem and algorithms for deciding whether a visual question is unanswerable due to unrecognizable content versus the content of interest being missing from the field of view, and (4) a novel application of more efficiently creating a large-scale image captioning dataset by automatically deciding whether an image is insufficient quality and so should not be captioned. We publicly-share our datasets and code to facilitate future extensions of this work: this https URL.
摘要:我们推出了新的大规模数据集的图像质量问题的评估链接到两个实际视觉任务:图像字幕和视觉答疑。首先,我们确定了由民,用之于民39181个图像谁是盲目的每个是否是足够的质量识别内容以及什么是从六个选项观察到的质量缺陷。这些标签作为一个重要的基础,为我们做出以下贡献:(1)一个新的问题和算法来决定图像是否是质量不足以识别内容,因此不captionable,(2)为确定一个新的问题和算法其中六个质量缺陷图像包含,(3)的新问题和算法用于判定视觉问题是否是无法回答由于不能识别的内容与所关注从视场被缺失的内容,以及(4)的一个新的应用更有效地产生由自动决定的图像是否是质量不够,所以不应该被字幕大规模图像字幕数据集。我们公开,分享我们的数据集和代码来促进这项工作的未来扩展:该HTTPS URL。
Tai-Yin Chiu, Yinan Zhao, Danna Gurari
Abstract: We introduce a new large-scale dataset that links the assessment of image quality issues to two practical vision tasks: image captioning and visual question answering. First, we identify for 39,181 images taken by people who are blind whether each is sufficient quality to recognize the content as well as what quality flaws are observed from six options. These labels serve as a critical foundation for us to make the following contributions: (1) a new problem and algorithms for deciding whether an image is insufficient quality to recognize the content and so not captionable, (2) a new problem and algorithms for deciding which of six quality flaws an image contains, (3) a new problem and algorithms for deciding whether a visual question is unanswerable due to unrecognizable content versus the content of interest being missing from the field of view, and (4) a novel application of more efficiently creating a large-scale image captioning dataset by automatically deciding whether an image is insufficient quality and so should not be captioned. We publicly-share our datasets and code to facilitate future extensions of this work: this https URL.
摘要:我们推出了新的大规模数据集的图像质量问题的评估链接到两个实际视觉任务:图像字幕和视觉答疑。首先,我们确定了由民,用之于民39181个图像谁是盲目的每个是否是足够的质量识别内容以及什么是从六个选项观察到的质量缺陷。这些标签作为一个重要的基础,为我们做出以下贡献:(1)一个新的问题和算法来决定图像是否是质量不足以识别内容,因此不captionable,(2)为确定一个新的问题和算法其中六个质量缺陷图像包含,(3)的新问题和算法用于判定视觉问题是否是无法回答由于不能识别的内容与所关注从视场被缺失的内容,以及(4)的一个新的应用更有效地产生由自动决定的图像是否是质量不够,所以不应该被字幕大规模图像字幕数据集。我们公开,分享我们的数据集和代码来促进这项工作的未来扩展:该HTTPS URL。
4. Hybrid Models for Open Set Recognition [PDF] 返回目录
Hongjie Zhang, Ang Li, Jie Guo, Yanwen Guo
Abstract: Open set recognition requires a classifier to detect samples not belonging to any of the classes in its training set. Existing methods fit a probability distribution to the training samples on their embedding space and detect outliers according to this distribution. The embedding space is often obtained from a discriminative classifier. However, such discriminative representation focuses only on known classes, which may not be critical for distinguishing the unknown classes. We argue that the representation space should be jointly learned from the inlier classifier and the density estimator (served as an outlier detector). We propose the OpenHybrid framework, which is composed of an encoder to encode the input data into a joint embedding space, a classifier to classify samples to inlier classes, and a flow-based density estimator to detect whether a sample belongs to the unknown category. A typical problem of existing flow-based models is that they may assign a higher likelihood to outliers. However, we empirically observe that such an issue does not occur in our experiments when learning a joint representation for discriminative and generative components. Experiments on standard open set benchmarks also reveal that an end-to-end trained OpenHybrid model significantly outperforms state-of-the-art methods and flow-based baselines.
摘要:打开设置的识别需要分类器检测不属于任何在其训练集类的样本。现有的方法拟合的概率分布的训练样本对他们的嵌入空间,并根据该分布来检测异常。嵌入空间通常从辨别分类获得。然而,这样的判别表示仅着眼于已知的类,这可能不是用于区分未知类的关键。我们认为,空间表现应该从内围分类和密度估计(担任离群值检测器)共同教训。我们提出OpenHybrid框架,它是由一个编码器的输入数据编码成一个关节嵌入空间,分类器进行分类样本内点类,以及基于流的密度估计以检测样品是否属于未知类别。现有的基于流的模型的一个典型问题是,它们可以分配更高的可能性的异常值。然而,我们凭经验观察学习辨别和生成组件的联合代表当这样的问题不会出现在我们的实验。在标准的开放式的基准测试实验还表明,一个终端到终端的培训OpenHybrid模型显著性能优于国家的最先进的方法和流程为基础的基准。
Hongjie Zhang, Ang Li, Jie Guo, Yanwen Guo
Abstract: Open set recognition requires a classifier to detect samples not belonging to any of the classes in its training set. Existing methods fit a probability distribution to the training samples on their embedding space and detect outliers according to this distribution. The embedding space is often obtained from a discriminative classifier. However, such discriminative representation focuses only on known classes, which may not be critical for distinguishing the unknown classes. We argue that the representation space should be jointly learned from the inlier classifier and the density estimator (served as an outlier detector). We propose the OpenHybrid framework, which is composed of an encoder to encode the input data into a joint embedding space, a classifier to classify samples to inlier classes, and a flow-based density estimator to detect whether a sample belongs to the unknown category. A typical problem of existing flow-based models is that they may assign a higher likelihood to outliers. However, we empirically observe that such an issue does not occur in our experiments when learning a joint representation for discriminative and generative components. Experiments on standard open set benchmarks also reveal that an end-to-end trained OpenHybrid model significantly outperforms state-of-the-art methods and flow-based baselines.
摘要:打开设置的识别需要分类器检测不属于任何在其训练集类的样本。现有的方法拟合的概率分布的训练样本对他们的嵌入空间,并根据该分布来检测异常。嵌入空间通常从辨别分类获得。然而,这样的判别表示仅着眼于已知的类,这可能不是用于区分未知类的关键。我们认为,空间表现应该从内围分类和密度估计(担任离群值检测器)共同教训。我们提出OpenHybrid框架,它是由一个编码器的输入数据编码成一个关节嵌入空间,分类器进行分类样本内点类,以及基于流的密度估计以检测样品是否属于未知类别。现有的基于流的模型的一个典型问题是,它们可以分配更高的可能性的异常值。然而,我们凭经验观察学习辨别和生成组件的联合代表当这样的问题不会出现在我们的实验。在标准的开放式的基准测试实验还表明,一个终端到终端的培训OpenHybrid模型显著性能优于国家的最先进的方法和流程为基础的基准。
5. End-to-end Autonomous Driving Perception with Sequential Latent Representation Learning [PDF] 返回目录
Jianyu Chen, Zhuo Xu, Masayoshi Tomizuka
Abstract: Current autonomous driving systems are composed of a perception system and a decision system. Both of them are divided into multiple subsystems built up with lots of human heuristics. An end-to-end approach might clean up the system and avoid huge efforts of human engineering, as well as obtain better performance with increasing data and computation resources. Compared to the decision system, the perception system is more suitable to be designed in an end-to-end framework, since it does not require online driving exploration. In this paper, we propose a novel end-to-end approach for autonomous driving perception. A latent space is introduced to capture all relevant features useful for perception, which is learned through sequential latent representation learning. The learned end-to-end perception model is able to solve the detection, tracking, localization and mapping problems altogether with only minimum human engineering efforts and without storing any maps online. The proposed method is evaluated in a realistic urban driving simulator, with both camera image and lidar point cloud as sensor inputs. The codes and videos of this work are available at our github repo and project website.
摘要:当前自主驾驶系统是由感知系统和决策系统。他们两人被分为建立,有很多人启发的多个子系统。终端到终端的方法可能清理系统,避免人体工程学的巨大努力,以及获得与不断增加的数据和计算资源更好的性能。相比于决策系统,感知系统更适合在终端到终端的框架来设计,因为它不要求在线驾驶的探索。在本文中,我们提出了自主驾驶感受一种新的终端到终端的方法。潜在空间被引入到捕获的看法,它是通过连续的潜表示学习学到有用的所有相关功能。该学会的端至端感知模型能够只用最小的人体工程学的努力,并且不保存任何地图在线完全解决了探测,跟踪,定位和映射问题。所提出的方法在现实的城市驾驶模拟器进行评价时,与两个照相机图像和激光雷达点云作为传感器输入。该代码和这项工作的视频都可以在我们的GitHub库和项目网站。
Jianyu Chen, Zhuo Xu, Masayoshi Tomizuka
Abstract: Current autonomous driving systems are composed of a perception system and a decision system. Both of them are divided into multiple subsystems built up with lots of human heuristics. An end-to-end approach might clean up the system and avoid huge efforts of human engineering, as well as obtain better performance with increasing data and computation resources. Compared to the decision system, the perception system is more suitable to be designed in an end-to-end framework, since it does not require online driving exploration. In this paper, we propose a novel end-to-end approach for autonomous driving perception. A latent space is introduced to capture all relevant features useful for perception, which is learned through sequential latent representation learning. The learned end-to-end perception model is able to solve the detection, tracking, localization and mapping problems altogether with only minimum human engineering efforts and without storing any maps online. The proposed method is evaluated in a realistic urban driving simulator, with both camera image and lidar point cloud as sensor inputs. The codes and videos of this work are available at our github repo and project website.
摘要:当前自主驾驶系统是由感知系统和决策系统。他们两人被分为建立,有很多人启发的多个子系统。终端到终端的方法可能清理系统,避免人体工程学的巨大努力,以及获得与不断增加的数据和计算资源更好的性能。相比于决策系统,感知系统更适合在终端到终端的框架来设计,因为它不要求在线驾驶的探索。在本文中,我们提出了自主驾驶感受一种新的终端到终端的方法。潜在空间被引入到捕获的看法,它是通过连续的潜表示学习学到有用的所有相关功能。该学会的端至端感知模型能够只用最小的人体工程学的努力,并且不保存任何地图在线完全解决了探测,跟踪,定位和映射问题。所提出的方法在现实的城市驾驶模拟器进行评价时,与两个照相机图像和激光雷达点云作为传感器输入。该代码和这项工作的视频都可以在我们的GitHub库和项目网站。
6. TextCaps: a Dataset for Image Captioning with Reading Comprehension [PDF] 返回目录
Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, Amanpreet Singh
Abstract: Image descriptions can help visually impaired people to quickly understand the image content. While we made significant progress in automatically describing images and optical character recognition, current approaches are unable to include written text in their descriptions, although text is omnipresent in human environments and frequently critical to understand our surroundings. To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images. Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects. We study baselines and adapt existing approaches to this new task, which we refer to as image captioning with reading comprehension. Our analysis with automatic and human studies shows that our new TextCaps dataset provides many new technical challenges over previous datasets.
摘要:图片说明可以帮助有视觉障碍的人迅速了解图像内容。虽然我们在自动描述的图像和光学字符识别取得显著的进展,目前的做法是不能包括在他们的描述书面文字,但文字是人类的环境中无所不在,经常关键是要理解我们的环境。要研究如何在图像的背景下领悟文本,我们收集了新的数据集,TextCaps,以145K为字幕图像28K。我们的数据挑战的模型进行文字识别,它与它的视觉环境,并决定哪些部分文字复制或意译,需要多个文本标记和可视化实体,如物体之间的空间,语义和视觉推理。我们研究的基线和适应现有的方法这一新的任务,我们称之为图像字幕与阅读理解。我们有自动和人体研究显示分析,我们的数据集提供了比以前的数据集,许多新的技术挑战新TextCaps。
Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, Amanpreet Singh
Abstract: Image descriptions can help visually impaired people to quickly understand the image content. While we made significant progress in automatically describing images and optical character recognition, current approaches are unable to include written text in their descriptions, although text is omnipresent in human environments and frequently critical to understand our surroundings. To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images. Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects. We study baselines and adapt existing approaches to this new task, which we refer to as image captioning with reading comprehension. Our analysis with automatic and human studies shows that our new TextCaps dataset provides many new technical challenges over previous datasets.
摘要:图片说明可以帮助有视觉障碍的人迅速了解图像内容。虽然我们在自动描述的图像和光学字符识别取得显著的进展,目前的做法是不能包括在他们的描述书面文字,但文字是人类的环境中无所不在,经常关键是要理解我们的环境。要研究如何在图像的背景下领悟文本,我们收集了新的数据集,TextCaps,以145K为字幕图像28K。我们的数据挑战的模型进行文字识别,它与它的视觉环境,并决定哪些部分文字复制或意译,需要多个文本标记和可视化实体,如物体之间的空间,语义和视觉推理。我们研究的基线和适应现有的方法这一新的任务,我们称之为图像字幕与阅读理解。我们有自动和人体研究显示分析,我们的数据集提供了比以前的数据集,许多新的技术挑战新TextCaps。
7. Weakly-Supervised Action Localization by Generative Attention Modeling [PDF] 返回目录
Baifeng Shi, Qi Dai, Yadong Mu, Jingdong Wang
Abstract: Weakly-supervised temporal action localization is a problem of learning an action localization model with only video-level action labeling available. The general framework largely relies on the classification activation, which employs an attention model to identify the action-related frames and then categorizes them into different classes. Such method results in the action-context confusion issue: context frames near action clips tend to be recognized as action frames themselves, since they are closely related to the specific classes. To solve the problem, in this paper we propose to model the class-agnostic frame-wise probability conditioned on the frame attention using conditional Variational Auto-Encoder (VAE). With the observation that the context exhibits notable difference from the action at representation level, a probabilistic model, i.e., conditional VAE, is learned to model the likelihood of each frame given the attention. By maximizing the conditional probability with respect to the attention, the action and non-action frames are well separated. Experiments on THUMOS14 and ActivityNet1.2 demonstrate advantage of our method and effectiveness in handling action-context confusion problem. Code is now available on GitHub.
摘要:弱监督时间行动本地化是学习与只有视频一级的行动标签的行动定位模型的问题。一般框架在很大程度上依赖于分类激活,它采用注意模型来识别动作相关的帧,然后将它们归类到不同的类别。这种方法导致动作上下文困惑的问题:语境框架附近的动作的剪辑,往往被视为行动框架本身,因为它们是密切相关的特定的类。为了解决这个问题,在本文中,我们提出到类无关的逐帧概率空调使用条件变自动编码器(VAE)框架注意模型。与上下文呈现从在表示电平,一个概率模型,即,条件VAE的动作显着差异的观察,被学习到给予的注意各帧的似然性进行建模。通过最大化相对于关注的条件概率,动作和非动作帧被很好地分离。在THUMOS14和ActivityNet1.2实验证明我们在处理动作方面的混乱问题,方法和效果的优势。代码现在在GitHub上可用。
Baifeng Shi, Qi Dai, Yadong Mu, Jingdong Wang
Abstract: Weakly-supervised temporal action localization is a problem of learning an action localization model with only video-level action labeling available. The general framework largely relies on the classification activation, which employs an attention model to identify the action-related frames and then categorizes them into different classes. Such method results in the action-context confusion issue: context frames near action clips tend to be recognized as action frames themselves, since they are closely related to the specific classes. To solve the problem, in this paper we propose to model the class-agnostic frame-wise probability conditioned on the frame attention using conditional Variational Auto-Encoder (VAE). With the observation that the context exhibits notable difference from the action at representation level, a probabilistic model, i.e., conditional VAE, is learned to model the likelihood of each frame given the attention. By maximizing the conditional probability with respect to the attention, the action and non-action frames are well separated. Experiments on THUMOS14 and ActivityNet1.2 demonstrate advantage of our method and effectiveness in handling action-context confusion problem. Code is now available on GitHub.
摘要:弱监督时间行动本地化是学习与只有视频一级的行动标签的行动定位模型的问题。一般框架在很大程度上依赖于分类激活,它采用注意模型来识别动作相关的帧,然后将它们归类到不同的类别。这种方法导致动作上下文困惑的问题:语境框架附近的动作的剪辑,往往被视为行动框架本身,因为它们是密切相关的特定的类。为了解决这个问题,在本文中,我们提出到类无关的逐帧概率空调使用条件变自动编码器(VAE)框架注意模型。与上下文呈现从在表示电平,一个概率模型,即,条件VAE的动作显着差异的观察,被学习到给予的注意各帧的似然性进行建模。通过最大化相对于关注的条件概率,动作和非动作帧被很好地分离。在THUMOS14和ActivityNet1.2实验证明我们在处理动作方面的混乱问题,方法和效果的优势。代码现在在GitHub上可用。
8. Learning Implicit Surface Light Fields [PDF] 返回目录
Michael Oechsle, Michael Niemeyer, Lars Mescheder, Thilo Strauss, Andreas Geiger
Abstract: Implicit representations of 3D objects have recently achieved impressive results on learning-based 3D reconstruction tasks. While existing works use simple texture models to represent object appearance, photo-realistic image synthesis requires reasoning about the complex interplay of light, geometry and surface properties. In this work, we propose a novel implicit representation for capturing the visual appearance of an object in terms of its surface light field. In contrast to existing representations, our implicit model represents surface light fields in a continuous fashion and independent of the geometry. Moreover, we condition the surface light field with respect to the location and color of a small light source. Compared to traditional surface light field models, this allows us to manipulate the light source and relight the object using environment maps. We further demonstrate the capabilities of our model to predict the visual appearance of an unseen object from a single real RGB image and corresponding 3D shape information. As evidenced by our experiments, our model is able to infer rich visual appearance including shadows and specular reflections. Finally, we show that the proposed representation can be embedded into a variational auto-encoder for generating novel appearances that conform to the specified illumination conditions.
摘要:3D对象的隐表示基于学习型三维重建任务,最近取得了不俗的成绩。尽管现有的作品使用简单纹理模型来表示对象的外观,照片般逼真的图像合成需要大约的光,几何形状和表面性质的复杂的相互作用的推理。在这项工作中,我们提出了用于捕获物体的视觉外观在其表面上的光场方面的新颖的隐式表示。相较于现有的交涉,我们的隐式模型代表以连续的方式表面光场和独立的几何体。此外,我们相对于一个小的光源的位置和颜色调节该表面的光场。相比于传统的表面光场模型,这使我们能够操纵光源和使用环境贴图重新点燃的对象。我们进一步证明我们的模型的能力来预测一个看不见的对象从单一的实际RGB图像的视觉外观和相应的3D形状的信息。就证明我们的实验中,我们的模型能够推断出丰富的视觉效果,包括阴影和镜面反射。最后,我们表明,该表示可以被嵌入到一个变自动编码器,用于产生符合指定的照明条件的新的外观。
Michael Oechsle, Michael Niemeyer, Lars Mescheder, Thilo Strauss, Andreas Geiger
Abstract: Implicit representations of 3D objects have recently achieved impressive results on learning-based 3D reconstruction tasks. While existing works use simple texture models to represent object appearance, photo-realistic image synthesis requires reasoning about the complex interplay of light, geometry and surface properties. In this work, we propose a novel implicit representation for capturing the visual appearance of an object in terms of its surface light field. In contrast to existing representations, our implicit model represents surface light fields in a continuous fashion and independent of the geometry. Moreover, we condition the surface light field with respect to the location and color of a small light source. Compared to traditional surface light field models, this allows us to manipulate the light source and relight the object using environment maps. We further demonstrate the capabilities of our model to predict the visual appearance of an unseen object from a single real RGB image and corresponding 3D shape information. As evidenced by our experiments, our model is able to infer rich visual appearance including shadows and specular reflections. Finally, we show that the proposed representation can be embedded into a variational auto-encoder for generating novel appearances that conform to the specified illumination conditions.
摘要:3D对象的隐表示基于学习型三维重建任务,最近取得了不俗的成绩。尽管现有的作品使用简单纹理模型来表示对象的外观,照片般逼真的图像合成需要大约的光,几何形状和表面性质的复杂的相互作用的推理。在这项工作中,我们提出了用于捕获物体的视觉外观在其表面上的光场方面的新颖的隐式表示。相较于现有的交涉,我们的隐式模型代表以连续的方式表面光场和独立的几何体。此外,我们相对于一个小的光源的位置和颜色调节该表面的光场。相比于传统的表面光场模型,这使我们能够操纵光源和使用环境贴图重新点燃的对象。我们进一步证明我们的模型的能力来预测一个看不见的对象从单一的实际RGB图像的视觉外观和相应的3D形状的信息。就证明我们的实验中,我们的模型能够推断出丰富的视觉效果,包括阴影和镜面反射。最后,我们表明,该表示可以被嵌入到一个变自动编码器,用于产生符合指定的照明条件的新的外观。
9. Modeling 3D Shapes by Reinforcement Learning [PDF] 返回目录
Cheng Lin, Tingxiang Fan, Wenping Wang, Matthias Nießner
Abstract: We explore how to enable machines to model 3D shapes like human modelers using reinforcement learning (RL). In 3D modeling software like Maya, a modeler usually creates a mesh model in two steps: (1) approximating the shape using a set of primitives; (2) editing the meshes of the primitives to create detailed geometry. Inspired by such artist-based modeling, we propose a two-step neural framework based on RL to learn 3D modeling policies. By taking actions and collecting rewards in an interactive environment, the agents first learn to parse a target shape into primitives and then to edit the geometry. To effectively train the modeling agents, we introduce a novel training algorithm that combines heuristic policy, imitation learning and reinforcement learning. Our experiments show that the agents can learn good policies to produce regular and structure-aware mesh models, which demonstrates the feasibility and effectiveness of the proposed RL framework.
摘要:本文探讨如何使机器模型的3D形状像使用强化学习(RL)人建模。在3D建模软件如Maya,建模者通常会产生在两个步骤中的网格模型:(1)近似采用一组原语的形状; (2)编辑的原语的网眼创建详细的几何形状。通过这种基于艺术家的造型的启发,提出了一种基于RL学习3D建模策略两步神经框架。通过采取行动和收集奖励在一个互动的环境,代理商先学会解析目标形状成原语,然后编辑几何。为了有效地训练建模代理,我们引入新的训练算法,结合启发式政策,模仿学习和强化学习。我们的实验表明,该药物可以学习好的政策产生规则和结构感知的网格模型,这表明了该RL框架的可行性和有效性。
Cheng Lin, Tingxiang Fan, Wenping Wang, Matthias Nießner
Abstract: We explore how to enable machines to model 3D shapes like human modelers using reinforcement learning (RL). In 3D modeling software like Maya, a modeler usually creates a mesh model in two steps: (1) approximating the shape using a set of primitives; (2) editing the meshes of the primitives to create detailed geometry. Inspired by such artist-based modeling, we propose a two-step neural framework based on RL to learn 3D modeling policies. By taking actions and collecting rewards in an interactive environment, the agents first learn to parse a target shape into primitives and then to edit the geometry. To effectively train the modeling agents, we introduce a novel training algorithm that combines heuristic policy, imitation learning and reinforcement learning. Our experiments show that the agents can learn good policies to produce regular and structure-aware mesh models, which demonstrates the feasibility and effectiveness of the proposed RL framework.
摘要:本文探讨如何使机器模型的3D形状像使用强化学习(RL)人建模。在3D建模软件如Maya,建模者通常会产生在两个步骤中的网格模型:(1)近似采用一组原语的形状; (2)编辑的原语的网眼创建详细的几何形状。通过这种基于艺术家的造型的启发,提出了一种基于RL学习3D建模策略两步神经框架。通过采取行动和收集奖励在一个互动的环境,代理商先学会解析目标形状成原语,然后编辑几何。为了有效地训练建模代理,我们引入新的训练算法,结合启发式政策,模仿学习和强化学习。我们的实验表明,该药物可以学习好的政策产生规则和结构感知的网格模型,这表明了该RL框架的可行性和有效性。
10. PyMatting: A Python Library for Alpha Matting [PDF] 返回目录
Thomas Germer, Tobias Uelwer, Stefan Conrad, Stefan Harmeling
Abstract: An important step of many image editing tasks is to extract specific objects from an image in order to place them in a scene of a movie or compose them onto another background. Alpha matting describes the problem of separating the objects in the foreground from the background of an image given only a rough sketch. We introduce the PyMatting package for Python which implements various approaches to solve the alpha matting problem. Our toolbox is also able to extract the foreground of an image given the alpha matte. The implementation aims to be computationally efficient and easy to use. The source code of PyMatting is available under an open-source license at this https URL.
摘要:许多图像编辑任务的一个重要步骤是为了将它们放置在电影的场景或撰写他们到另一个背景,从图片中提取特定对象。阿尔法抠图描述了从给定的仅一个草图的图像的背景分离在前景中的对象的问题。我们引入PyMatting包Python的各种方法来解决它实现了阿尔法抠图问题。我们的工具箱也能够提取给出的阿尔法无光泽的图像的前景。实现的目标是计算效率高,使用方便。 PyMatting的源代码都可以在这个HTTPS URL的开源许可证。
Thomas Germer, Tobias Uelwer, Stefan Conrad, Stefan Harmeling
Abstract: An important step of many image editing tasks is to extract specific objects from an image in order to place them in a scene of a movie or compose them onto another background. Alpha matting describes the problem of separating the objects in the foreground from the background of an image given only a rough sketch. We introduce the PyMatting package for Python which implements various approaches to solve the alpha matting problem. Our toolbox is also able to extract the foreground of an image given the alpha matte. The implementation aims to be computationally efficient and easy to use. The source code of PyMatting is available under an open-source license at this https URL.
摘要:许多图像编辑任务的一个重要步骤是为了将它们放置在电影的场景或撰写他们到另一个背景,从图片中提取特定对象。阿尔法抠图描述了从给定的仅一个草图的图像的背景分离在前景中的对象的问题。我们引入PyMatting包Python的各种方法来解决它实现了阿尔法抠图问题。我们的工具箱也能够提取给出的阿尔法无光泽的图像的前景。实现的目标是计算效率高,使用方便。 PyMatting的源代码都可以在这个HTTPS URL的开源许可证。
11. Enhanced Self-Perception in Mixed Reality: Egocentric Arm Segmentation and Database with Automatic Labelling [PDF] 返回目录
Ester Gonzalez-Sosa, Pablo Perez, Ruben Tolosana, Redouane Kachach, Alvaro Villegas
Abstract: In this study, we focus on the egocentric segmentation of arms to improve self-perception in Augmented Virtuality (AV). The main contributions of this work are: i) a comprehensive survey of segmentation algorithms for AV; ii) an Egocentric Arm Segmentation Dataset, composed of more than 10, 000 images, comprising variations of skin color, and gender, among others. We provide all details required for the automated generation of groundtruth and semi-synthetic images; iii) the use of deep learning for the first time for segmenting arms in AV; iv) to showcase the usefulness of this database, we report results on different real egocentric hand datasets, including GTEA Gaze+, EDSH, EgoHands, Ego Youtube Hands, THU-Read, TEgO, FPAB, and Ego Gesture, which allow for direct comparisons with existing approaches utilizing color or depth. Results confirm the suitability of the EgoArm dataset for this task, achieving improvement up to 40% with respect to the original network, depending on the particular dataset. Results also suggest that, while approaches based on color or depth can work in controlled conditions (lack of occlusion, uniform lighting, only objects of interest in the near range, controlled background, etc.), egocentric segmentation based on deep learning is more robust in real AV applications.
摘要:在这项研究中,我们重点关注的武器,以提高在增强虚拟性(AV)自我认知自我中心的分割。这项工作的主要贡献是:1)的分割算法AV全面调查; ⅱ)自我中心臂分割数据集,由多于10,000的图像,其包括肤色,和性别,以及其它的变化。我们提供用于自动生成真实状况和半合成图像所需的所有细节; iii)使用深度学习的首次用于分割在AV臂;四)向人们展示这个数据库的实用性,我们报告不同的真实自我中心手的数据集,包括GTEA凝视+,EDSH,EgoHands,自我的Youtube手,THU-读,迪高,FPAB,和自我的姿态,这允许直接比较的结果利用颜色或深度的现有方法。结果证实了EgoArm数据集是否适合这个任务,实现提高达40%,相对于原来的网络,这取决于特定的数据集。结果还表明,虽然办法基于颜色或深度可以在受控条件下工作的基于深度学习(缺乏闭塞的,均匀的照明,对象只在近范围内的兴趣,控制的背景等),分割自我中心是更健壮的在真正的视听应用。
Ester Gonzalez-Sosa, Pablo Perez, Ruben Tolosana, Redouane Kachach, Alvaro Villegas
Abstract: In this study, we focus on the egocentric segmentation of arms to improve self-perception in Augmented Virtuality (AV). The main contributions of this work are: i) a comprehensive survey of segmentation algorithms for AV; ii) an Egocentric Arm Segmentation Dataset, composed of more than 10, 000 images, comprising variations of skin color, and gender, among others. We provide all details required for the automated generation of groundtruth and semi-synthetic images; iii) the use of deep learning for the first time for segmenting arms in AV; iv) to showcase the usefulness of this database, we report results on different real egocentric hand datasets, including GTEA Gaze+, EDSH, EgoHands, Ego Youtube Hands, THU-Read, TEgO, FPAB, and Ego Gesture, which allow for direct comparisons with existing approaches utilizing color or depth. Results confirm the suitability of the EgoArm dataset for this task, achieving improvement up to 40% with respect to the original network, depending on the particular dataset. Results also suggest that, while approaches based on color or depth can work in controlled conditions (lack of occlusion, uniform lighting, only objects of interest in the near range, controlled background, etc.), egocentric segmentation based on deep learning is more robust in real AV applications.
摘要:在这项研究中,我们重点关注的武器,以提高在增强虚拟性(AV)自我认知自我中心的分割。这项工作的主要贡献是:1)的分割算法AV全面调查; ⅱ)自我中心臂分割数据集,由多于10,000的图像,其包括肤色,和性别,以及其它的变化。我们提供用于自动生成真实状况和半合成图像所需的所有细节; iii)使用深度学习的首次用于分割在AV臂;四)向人们展示这个数据库的实用性,我们报告不同的真实自我中心手的数据集,包括GTEA凝视+,EDSH,EgoHands,自我的Youtube手,THU-读,迪高,FPAB,和自我的姿态,这允许直接比较的结果利用颜色或深度的现有方法。结果证实了EgoArm数据集是否适合这个任务,实现提高达40%,相对于原来的网络,这取决于特定的数据集。结果还表明,虽然办法基于颜色或深度可以在受控条件下工作的基于深度学习(缺乏闭塞的,均匀的照明,对象只在近范围内的兴趣,控制的背景等),分割自我中心是更健壮的在真正的视听应用。
12. Convolutional Spiking Neural Networks for Spatio-Temporal Feature Extraction [PDF] 返回目录
Ali Samadzadeh, Fatemeh Sadat Tabatabaei Far, Ali Javadi, Ahmad Nickabadi, Morteza Haghir Chehreghani
Abstract: Spiking neural networks (SNNs) can be used in low-power and embedded systems (such as emerging neuromorphic chips) due to their event-based nature. Also, they have the advantage of low computation cost in contrast to conventional artificial neural networks (ANNs), while preserving ANN's properties. However, temporal coding in layers of convolutional spiking neural networks and other types of SNNs has yet to be studied. In this paper, we provide insight into spatio-temporal feature extraction of convolutional SNNs in experiments designed to exploit this property. Our proposed shallow convolutional SNN outperforms state-of-the-art spatio-temporal feature extractor methods such as C3D, ConvLstm, and similar networks. Furthermore, we present a new deep spiking architecture to tackle real-world problems (in particular classification tasks), and the model achieved superior performance compared to other SNN methods on CIFAR10-DVS. It is also worth noting that the training process is implemented based on spatio-temporal backpropagation, and ANN to SNN conversion methods will serve no use.
摘要:尖峰神经网络(SNNS)可以在低功耗和嵌入式系统(诸如新兴神经形态芯片)中使用,因为它们基于事件的性质。此外,他们在与传统的人工神经网络(人工神经网络)计算量小的优点,同时保留人工神经网络的性能。然而,时间在卷积扣球神经网络和其他类型的SNNS的层编码还有待研究。在本文中,我们提供旨在利用此特性实验洞察卷积SNNS的时空特征提取。我们提出的浅卷积SNN性能优于国家的最先进的时空特征提取方法,如C3D,ConvLstm,和类似的网络。此外,我们提出了一个新的深扣球架构来解决现实世界的问题(特别是分类任务),并且相比于CIFAR10-DVS等SNN方法模型实现卓越的性能。还值得一提的是,训练过程是基于时空反向传播来实现,和人工神经网络SNN转换方法将成为没有用。
Ali Samadzadeh, Fatemeh Sadat Tabatabaei Far, Ali Javadi, Ahmad Nickabadi, Morteza Haghir Chehreghani
Abstract: Spiking neural networks (SNNs) can be used in low-power and embedded systems (such as emerging neuromorphic chips) due to their event-based nature. Also, they have the advantage of low computation cost in contrast to conventional artificial neural networks (ANNs), while preserving ANN's properties. However, temporal coding in layers of convolutional spiking neural networks and other types of SNNs has yet to be studied. In this paper, we provide insight into spatio-temporal feature extraction of convolutional SNNs in experiments designed to exploit this property. Our proposed shallow convolutional SNN outperforms state-of-the-art spatio-temporal feature extractor methods such as C3D, ConvLstm, and similar networks. Furthermore, we present a new deep spiking architecture to tackle real-world problems (in particular classification tasks), and the model achieved superior performance compared to other SNN methods on CIFAR10-DVS. It is also worth noting that the training process is implemented based on spatio-temporal backpropagation, and ANN to SNN conversion methods will serve no use.
摘要:尖峰神经网络(SNNS)可以在低功耗和嵌入式系统(诸如新兴神经形态芯片)中使用,因为它们基于事件的性质。此外,他们在与传统的人工神经网络(人工神经网络)计算量小的优点,同时保留人工神经网络的性能。然而,时间在卷积扣球神经网络和其他类型的SNNS的层编码还有待研究。在本文中,我们提供旨在利用此特性实验洞察卷积SNNS的时空特征提取。我们提出的浅卷积SNN性能优于国家的最先进的时空特征提取方法,如C3D,ConvLstm,和类似的网络。此外,我们提出了一个新的深扣球架构来解决现实世界的问题(特别是分类任务),并且相比于CIFAR10-DVS等SNN方法模型实现卓越的性能。还值得一提的是,训练过程是基于时空反向传播来实现,和人工神经网络SNN转换方法将成为没有用。
13. Tackling Two Challenges of 6D Object Pose Estimation: Lack of Real Annotated RGB Images and Scalability to Number of Objects [PDF] 返回目录
Juil Sock, Pedro Castro, Anil Armagan, Guillermo Garcia-Hernando, Tae-Kyun Kim
Abstract: State-of-the-art methods for 6D object pose estimation typically train a Deep Neural Network per object, and its training data first comes from a 3D object mesh. Models trained with synthetic data alone do not generalise well, and training a model for multiple objects sharply drops its accuracy. In this work, we address these two main challenges for 6D object pose estimation and investigate viable methods in experiments. For lack of real RGB data with pose annotations, we propose a novel self-supervision method via pose consistency. For scalability to multiple objects, we apply additional parameterisation to a backbone network and distill knowledge from teachers to a student network for model compression. We further evaluate the combination of the two methods for settings where we are given only synthetic data and a single network for multiple objects. In experiments using LINEMOD, LINEMOD OCCLUSION and T-LESS datasets, the methods significantly boost baseline accuracies and are comparable with the upper bounds, i.e., object specific networks trained on real data with pose labels.
摘要:国家的最先进的方法6D对象姿态估计一般训练深层神经网络的每个对象,它的训练数据首先来自3D对象网格。单独合成数据训练的模型不能推广好,培训多对象的模型急剧下降其准确性。在这项工作中,我们要解决的6D对象姿态估计这两个主要挑战,并探讨可行的实验方法。对于缺乏与姿态标注真实的RGB数据,我们建议通过姿态一致性新颖的自检方法。在可扩展性,以多个对象,我们收取额外的参数化的骨干网络,并从教师到学生网络模型压缩提制知识。我们进一步评估两种方法的设置,我们只给出综合的数据和多个对象的网络的组合。在使用LINEMOD,LINEMOD闭塞和T-LESS实验数据集,所述方法显著提高基线的精度,并且与上界相媲美,即训练真实的数据与姿态标签对象特定网络。
Juil Sock, Pedro Castro, Anil Armagan, Guillermo Garcia-Hernando, Tae-Kyun Kim
Abstract: State-of-the-art methods for 6D object pose estimation typically train a Deep Neural Network per object, and its training data first comes from a 3D object mesh. Models trained with synthetic data alone do not generalise well, and training a model for multiple objects sharply drops its accuracy. In this work, we address these two main challenges for 6D object pose estimation and investigate viable methods in experiments. For lack of real RGB data with pose annotations, we propose a novel self-supervision method via pose consistency. For scalability to multiple objects, we apply additional parameterisation to a backbone network and distill knowledge from teachers to a student network for model compression. We further evaluate the combination of the two methods for settings where we are given only synthetic data and a single network for multiple objects. In experiments using LINEMOD, LINEMOD OCCLUSION and T-LESS datasets, the methods significantly boost baseline accuracies and are comparable with the upper bounds, i.e., object specific networks trained on real data with pose labels.
摘要:国家的最先进的方法6D对象姿态估计一般训练深层神经网络的每个对象,它的训练数据首先来自3D对象网格。单独合成数据训练的模型不能推广好,培训多对象的模型急剧下降其准确性。在这项工作中,我们要解决的6D对象姿态估计这两个主要挑战,并探讨可行的实验方法。对于缺乏与姿态标注真实的RGB数据,我们建议通过姿态一致性新颖的自检方法。在可扩展性,以多个对象,我们收取额外的参数化的骨干网络,并从教师到学生网络模型压缩提制知识。我们进一步评估两种方法的设置,我们只给出综合的数据和多个对象的网络的组合。在使用LINEMOD,LINEMOD闭塞和T-LESS实验数据集,所述方法显著提高基线的精度,并且与上界相媲美,即训练真实的数据与姿态标签对象特定网络。
14. An Investigation into the Stochasticity of Batch Whitening [PDF] 返回目录
Lei Huang, Lei Zhao, Yi Zhou, Fan Zhu, Li Liu, Ling Shao
Abstract: Batch Normalization (BN) is extensively employed in various network architectures by performing standardization within mini-batches. A full understanding of the process has been a central target in the deep learning communities. Unlike existing works, which usually only analyze the standardization operation, this paper investigates the more general Batch Whitening (BW). Our work originates from the observation that while various whitening transformations equivalently improve the conditioning, they show significantly different behaviors in discriminative scenarios and training Generative Adversarial Networks (GANs). We attribute this phenomenon to the stochasticity that BW introduces. We quantitatively investigate the stochasticity of different whitening transformations and show that it correlates well with the optimization behaviors during training. We also investigate how stochasticity relates to the estimation of population statistics during inference. Based on our analysis, we provide a framework for designing and comparing BW algorithms in different scenarios. Our proposed BW algorithm improves the residual networks by a significant margin on ImageNet classification. Besides, we show that the stochasticity of BW can improve the GAN's performance with, however, the sacrifice of the training stability.
摘要:批标准化(BN)是在各种网络架构由微型批次内执行标准化广泛地采用。过程一,充分认识一直处于深度学习社区的中心目标。与现有的作品,通常只分析了标准化操作,本文研究了更一般的批次美白(BW)。我们从观察工作起源,虽然各种美白变换等效提高了空调,他们表现出歧视性的方案和培训剖成对抗式网络(甘斯)显著不同的行为。我们认为这种现象的随机性是BW介绍。我们定量研究了不同的美白变换的随机性,并表明它与训练过程中优化的行为很好的相关性。我们还调查推理过程中的随机性如何与人口统计的估计。根据我们的分析,我们提供的设计和在不同场景比较BW算法的框架。我们提出的BW算法通过对ImageNet分类的显著利润率提高了剩余的网络。此外,我们表明,BW的随机性可以提高GAN的性能,然而,训练稳定的牺牲。
Lei Huang, Lei Zhao, Yi Zhou, Fan Zhu, Li Liu, Ling Shao
Abstract: Batch Normalization (BN) is extensively employed in various network architectures by performing standardization within mini-batches. A full understanding of the process has been a central target in the deep learning communities. Unlike existing works, which usually only analyze the standardization operation, this paper investigates the more general Batch Whitening (BW). Our work originates from the observation that while various whitening transformations equivalently improve the conditioning, they show significantly different behaviors in discriminative scenarios and training Generative Adversarial Networks (GANs). We attribute this phenomenon to the stochasticity that BW introduces. We quantitatively investigate the stochasticity of different whitening transformations and show that it correlates well with the optimization behaviors during training. We also investigate how stochasticity relates to the estimation of population statistics during inference. Based on our analysis, we provide a framework for designing and comparing BW algorithms in different scenarios. Our proposed BW algorithm improves the residual networks by a significant margin on ImageNet classification. Besides, we show that the stochasticity of BW can improve the GAN's performance with, however, the sacrifice of the training stability.
摘要:批标准化(BN)是在各种网络架构由微型批次内执行标准化广泛地采用。过程一,充分认识一直处于深度学习社区的中心目标。与现有的作品,通常只分析了标准化操作,本文研究了更一般的批次美白(BW)。我们从观察工作起源,虽然各种美白变换等效提高了空调,他们表现出歧视性的方案和培训剖成对抗式网络(甘斯)显著不同的行为。我们认为这种现象的随机性是BW介绍。我们定量研究了不同的美白变换的随机性,并表明它与训练过程中优化的行为很好的相关性。我们还调查推理过程中的随机性如何与人口统计的估计。根据我们的分析,我们提供的设计和在不同场景比较BW算法的框架。我们提出的BW算法通过对ImageNet分类的显著利润率提高了剩余的网络。此外,我们表明,BW的随机性可以提高GAN的性能,然而,训练稳定的牺牲。
15. Lightweight Photometric Stereo for Facial Details Recovery [PDF] 返回目录
Xueying Wang, Yudong Guo, Bailin Deng, Juyong Zhang
Abstract: Recently, 3D face reconstruction from a single image has achieved great success with the help of deep learning and shape prior knowledge, but they often fail to produce accurate geometry details. On the other hand, photometric stereo methods can recover reliable geometry details, but require dense inputs and need to solve a complex optimization problem. In this paper, we present a lightweight strategy that only requires sparse inputs or even a single image to recover high-fidelity face shapes with images captured under near-field lights. To this end, we construct a dataset containing 84 different subjects with 29 expressions under 3 different lights. Data augmentation is applied to enrich the data in terms of diversity in identity, lighting, expression, etc. With this constructed dataset, we propose a novel neural network specially designed for photometric stereo based 3D face reconstruction. Extensive experiments and comparisons demonstrate that our method can generate high-quality reconstruction results with one to three facial images captured under near-field lights. Our full framework is available at this https URL.
摘要:近日,从单一的图像三维人脸重建已经与深度学习的帮助下取得了巨大的成功和形状的先验知识,但他们往往不能产生精确的几何细节。在另一方面,光度立体方法可以恢复可靠的几何细节,但需要密集的投入和需要解决一个复杂的优化问题。在本文中,我们提出了一种轻量的策略,只需要稀疏输入或甚至单个图像以恢复高保真面形状,其具有下近场灯拍摄的图像。为此,我们构建含有与下3个不同的灯29个84名表达不同受试者的数据集。数据增强施加到丰富在身份,照明,表达等利用这种构造数据集分集的数据,我们提出专为光度立体基于三维脸重建设计一种新的神经网络。大量的实验和比较表明,我们的方法可以产生在近场灯捕获一到三个面部图像高品质的重建结果。我们的全面框架可在此HTTPS URL。
Xueying Wang, Yudong Guo, Bailin Deng, Juyong Zhang
Abstract: Recently, 3D face reconstruction from a single image has achieved great success with the help of deep learning and shape prior knowledge, but they often fail to produce accurate geometry details. On the other hand, photometric stereo methods can recover reliable geometry details, but require dense inputs and need to solve a complex optimization problem. In this paper, we present a lightweight strategy that only requires sparse inputs or even a single image to recover high-fidelity face shapes with images captured under near-field lights. To this end, we construct a dataset containing 84 different subjects with 29 expressions under 3 different lights. Data augmentation is applied to enrich the data in terms of diversity in identity, lighting, expression, etc. With this constructed dataset, we propose a novel neural network specially designed for photometric stereo based 3D face reconstruction. Extensive experiments and comparisons demonstrate that our method can generate high-quality reconstruction results with one to three facial images captured under near-field lights. Our full framework is available at this https URL.
摘要:近日,从单一的图像三维人脸重建已经与深度学习的帮助下取得了巨大的成功和形状的先验知识,但他们往往不能产生精确的几何细节。在另一方面,光度立体方法可以恢复可靠的几何细节,但需要密集的投入和需要解决一个复杂的优化问题。在本文中,我们提出了一种轻量的策略,只需要稀疏输入或甚至单个图像以恢复高保真面形状,其具有下近场灯拍摄的图像。为此,我们构建含有与下3个不同的灯29个84名表达不同受试者的数据集。数据增强施加到丰富在身份,照明,表达等利用这种构造数据集分集的数据,我们提出专为光度立体基于三维脸重建设计一种新的神经网络。大量的实验和比较表明,我们的方法可以产生在近场灯捕获一到三个面部图像高品质的重建结果。我们的全面框架可在此HTTPS URL。
16. CurlingNet: Compositional Learning between Images and Text for Fashion IQ Data [PDF] 返回目录
Youngjae Yu, Seunghwan Lee, Yuncheol Choi, Yuncheol Choi, Gunhee Kim
Abstract: We present an approach named CurlingNet that can measure the semantic distance of composition of image-text embedding. In order to learn an effective image-text composition for the data in the fashion domain, our model proposes two key components as follows. First, the Delivery makes the transition of a source image in an embedding space. Second, the Sweeping emphasizes query-related components of fashion images in the embedding space. We utilize a channel-wise gating mechanism to make it possible. Our single model outperforms previous state-of-the-art image-text composition models including TIRG and FiLM. We participate in the first fashion-IQ challenge in ICCV 2019, for which ensemble of our model achieves one of the best performances.
摘要:我们提出一个名为CurlingNet了一种方法,可以测量的图像,文本嵌入的组成语义距离。为了了解在时尚领域的数据的有效图像文字组成,我们的模型提出如下两个关键部分。首先,交付使源图像的在嵌入空间的转换。其次,扫强调嵌入空间时尚图像查询相关组件。我们利用通道方向的门控机制,使之成为可能。我们的单模型优于以前的状态的最先进的图像文本组成的模型,包括TIRG和电影。我们参加ICCV 2019年第一时尚智商的挑战,为此,我们集成模型实现的最好的演出之一。
Youngjae Yu, Seunghwan Lee, Yuncheol Choi, Yuncheol Choi, Gunhee Kim
Abstract: We present an approach named CurlingNet that can measure the semantic distance of composition of image-text embedding. In order to learn an effective image-text composition for the data in the fashion domain, our model proposes two key components as follows. First, the Delivery makes the transition of a source image in an embedding space. Second, the Sweeping emphasizes query-related components of fashion images in the embedding space. We utilize a channel-wise gating mechanism to make it possible. Our single model outperforms previous state-of-the-art image-text composition models including TIRG and FiLM. We participate in the first fashion-IQ challenge in ICCV 2019, for which ensemble of our model achieves one of the best performances.
摘要:我们提出一个名为CurlingNet了一种方法,可以测量的图像,文本嵌入的组成语义距离。为了了解在时尚领域的数据的有效图像文字组成,我们的模型提出如下两个关键部分。首先,交付使源图像的在嵌入空间的转换。其次,扫强调嵌入空间时尚图像查询相关组件。我们利用通道方向的门控机制,使之成为可能。我们的单模型优于以前的状态的最先进的图像文本组成的模型,包括TIRG和电影。我们参加ICCV 2019年第一时尚智商的挑战,为此,我们集成模型实现的最好的演出之一。
17. Generalizable Semantic Segmentation via Model-agnostic Learning and Target-specific Normalization [PDF] 返回目录
Jian Zhang, Lei Qi, Yinghuan Shi, Yang Gao
Abstract: Semantic segmentation methods in the supervised scenario have achieved significant improvement in recent years. However, when directly deploying the trained model to segment the images of unseen (or new coming) domains, its performance usually drops dramatically due to the data-distribution discrepancy between seen and unseen domains. To overcome this limitation, we propose a novel domain generalization framework for the generalizable semantic segmentation task, which enhances the generalization ability of the model from two different views, including the training paradigm and the data-distribution discrepancy. Concretely, we exploit the model-agnostic learning method to simulate the domain shift problem, which deals with the domain generalization from the training scheme perspective. Besides, considering the data-distribution discrepancy between source domains and unseen target domains, we develop the target-specific normalization scheme to further boost the generalization ability in unseen target domains. Extensive experiments highlight that the proposed method produces state-of-the-art performance for the domain generalization of semantic segmentation on multiple benchmark segmentation datasets (i.e., Cityscapes, Mapillary). Furthermore, we gain an interesting observation that the target-specific normalization can benefit from the model-agnostic learning scheme.
摘要:在监督场景语义分割方法在近几年取得了显著的改善。但是,当训练模型直接部署到段看不见(或新的未来)结构域的图像时,其性能显着地通常由于可见和不可见的结构域之间的数据分布差异下降。为了克服这种局限性,我们提出了普及语义分割的任务,这增强了从两个不同的意见,包括训练模式和数据分布差异模型的泛化能力的新的领域泛化框架。具体而言,我们利用模型无关的学习方法来模拟域转移问题,从培训计划的角度域泛化交易。此外,考虑源域和看不见的目标域之间的数据分布差异,我们开发的目标,特定标准化方案,以进一步提升在看不见的目标域泛化能力。广泛的实验强调,所提出的方法产生的国家的最先进的性能语义分割的多个分割基准数据集的域普遍化(即,风情,Mapillary)。此外,我们获得了一个有趣的现象是,目标特定标准化可以从模型无关的学习计划中受益。
Jian Zhang, Lei Qi, Yinghuan Shi, Yang Gao
Abstract: Semantic segmentation methods in the supervised scenario have achieved significant improvement in recent years. However, when directly deploying the trained model to segment the images of unseen (or new coming) domains, its performance usually drops dramatically due to the data-distribution discrepancy between seen and unseen domains. To overcome this limitation, we propose a novel domain generalization framework for the generalizable semantic segmentation task, which enhances the generalization ability of the model from two different views, including the training paradigm and the data-distribution discrepancy. Concretely, we exploit the model-agnostic learning method to simulate the domain shift problem, which deals with the domain generalization from the training scheme perspective. Besides, considering the data-distribution discrepancy between source domains and unseen target domains, we develop the target-specific normalization scheme to further boost the generalization ability in unseen target domains. Extensive experiments highlight that the proposed method produces state-of-the-art performance for the domain generalization of semantic segmentation on multiple benchmark segmentation datasets (i.e., Cityscapes, Mapillary). Furthermore, we gain an interesting observation that the target-specific normalization can benefit from the model-agnostic learning scheme.
摘要:在监督场景语义分割方法在近几年取得了显著的改善。但是,当训练模型直接部署到段看不见(或新的未来)结构域的图像时,其性能显着地通常由于可见和不可见的结构域之间的数据分布差异下降。为了克服这种局限性,我们提出了普及语义分割的任务,这增强了从两个不同的意见,包括训练模式和数据分布差异模型的泛化能力的新的领域泛化框架。具体而言,我们利用模型无关的学习方法来模拟域转移问题,从培训计划的角度域泛化交易。此外,考虑源域和看不见的目标域之间的数据分布差异,我们开发的目标,特定标准化方案,以进一步提升在看不见的目标域泛化能力。广泛的实验强调,所提出的方法产生的国家的最先进的性能语义分割的多个分割基准数据集的域普遍化(即,风情,Mapillary)。此外,我们获得了一个有趣的现象是,目标特定标准化可以从模型无关的学习计划中受益。
18. Towards Accurate Scene Text Recognition with Semantic Reasoning Networks [PDF] 返回目录
Deli Yu, Xuan Li, Chengquan Zhang, Junyu Han, Jingtuo Liu, Errui Ding
Abstract: Scene text image contains two levels of contents: visual texture and semantic information. Although the previous scene text recognition methods have made great progress over the past few years, the research on mining semantic information to assist text recognition attracts less attention, only RNN-like structures are explored to implicitly model semantic information. However, we observe that RNN based methods have some obvious shortcomings, such as time-dependent decoding manner and one-way serial transmission of semantic context, which greatly limit the help of semantic information and the computation efficiency. To mitigate these limitations, we propose a novel end-to-end trainable framework named semantic reasoning network (SRN) for accurate scene text recognition, where a global semantic reasoning module (GSRM) is introduced to capture global semantic context through multi-way parallel transmission. The state-of-the-art results on 7 public benchmarks, including regular text, irregular text and non-Latin long text, verify the effectiveness and robustness of the proposed method. In addition, the speed of SRN has significant advantages over the RNN based methods, demonstrating its value in practical use.
摘要:场景文本图像包含的内容两个层面:视觉质感和语义信息。虽然前一个场景文本识别方法在过去几年取得了很大的进步,在挖掘语义信息,以协助文字识别研究吸引关注较少,只有RNN状结构的探索隐含模型语义信息。然而,我们看到,基于RNN方法有一些明显的不足,如与时间相关的解码方式和语义上下文的单向串行传输,极大地限制了语义信息和计算效率的帮助。为了弥补这部分不足,我们提出了准确的现场文字识别,其中一个全球性的语义推理模块(GSRM)引入通过多路并行捕捉全球语义语境一种新型的终端到终端的可训练的框架命名为语义推理网络(SRN)传输。 7个公共基准,包括普通文字,不规则文本和非拉丁长文本的国家的最先进的结果,验证了该方法的有效性和鲁棒性。此外,SRN的速度已经超过了基于RNN方法显著的优势,展示了其在实际应用价值。
Deli Yu, Xuan Li, Chengquan Zhang, Junyu Han, Jingtuo Liu, Errui Ding
Abstract: Scene text image contains two levels of contents: visual texture and semantic information. Although the previous scene text recognition methods have made great progress over the past few years, the research on mining semantic information to assist text recognition attracts less attention, only RNN-like structures are explored to implicitly model semantic information. However, we observe that RNN based methods have some obvious shortcomings, such as time-dependent decoding manner and one-way serial transmission of semantic context, which greatly limit the help of semantic information and the computation efficiency. To mitigate these limitations, we propose a novel end-to-end trainable framework named semantic reasoning network (SRN) for accurate scene text recognition, where a global semantic reasoning module (GSRM) is introduced to capture global semantic context through multi-way parallel transmission. The state-of-the-art results on 7 public benchmarks, including regular text, irregular text and non-Latin long text, verify the effectiveness and robustness of the proposed method. In addition, the speed of SRN has significant advantages over the RNN based methods, demonstrating its value in practical use.
摘要:场景文本图像包含的内容两个层面:视觉质感和语义信息。虽然前一个场景文本识别方法在过去几年取得了很大的进步,在挖掘语义信息,以协助文字识别研究吸引关注较少,只有RNN状结构的探索隐含模型语义信息。然而,我们看到,基于RNN方法有一些明显的不足,如与时间相关的解码方式和语义上下文的单向串行传输,极大地限制了语义信息和计算效率的帮助。为了弥补这部分不足,我们提出了准确的现场文字识别,其中一个全球性的语义推理模块(GSRM)引入通过多路并行捕捉全球语义语境一种新型的终端到终端的可训练的框架命名为语义推理网络(SRN)传输。 7个公共基准,包括普通文字,不规则文本和非拉丁长文本的国家的最先进的结果,验证了该方法的有效性和鲁棒性。此外,SRN的速度已经超过了基于RNN方法显著的优势,展示了其在实际应用价值。
19. Controllable Person Image Synthesis with Attribute-Decomposed GAN [PDF] 返回目录
Yifang Men, Yiming Mao, Yuning Jiang, Wei-Ying Ma, Zhouhui Lian
Abstract: This paper introduces the Attribute-Decomposed GAN, a novel generative model for controllable person image synthesis, which can produce realistic person images with desired human attributes (e.g., pose, head, upper clothes and pants) provided in various source inputs. The core idea of the proposed model is to embed human attributes into the latent space as independent codes and thus achieve flexible and continuous control of attributes via mixing and interpolation operations in explicit style representations. Specifically, a new architecture consisting of two encoding pathways with style block connections is proposed to decompose the original hard mapping into multiple more accessible subtasks. In source pathway, we further extract component layouts with an off-the-shelf human parser and feed them into a shared global texture encoder for decomposed latent codes. This strategy allows for the synthesis of more realistic output images and automatic separation of un-annotated attributes. Experimental results demonstrate the proposed method's superiority over the state of the art in pose transfer and its effectiveness in the brand-new task of component attribute transfer.
摘要:本文介绍了属性分解GAN,用于可控人物图像的合成,其可以产生具有以各种源输入提供所需的人的属性(例如,姿势,头部,上部衣服和裤子)现实的人的图像的新的生成模型。该模型的核心思想是人的嵌入到属性作为独立的代码潜在空间,从而通过显式的风格表示混合和内插操作实现属性的灵活和持续的控制。具体而言,由具有式块连接两种编码途径的新的体系结构提出了分解原始硬映射成多个更易于访问的子任务。在源途径中,与关断的,现成的人类解析器我们进一步提取物成分的布局和它们馈送到一个共享全局纹理编码器,用于分解潜代码。这种策略允许更逼真的输出图像和未注释的属性的自动分离的合成。实验结果表明,在姿态调过来的技术状态所提出的方法的优势及其组件的属性转移的全新任务有效性。
Yifang Men, Yiming Mao, Yuning Jiang, Wei-Ying Ma, Zhouhui Lian
Abstract: This paper introduces the Attribute-Decomposed GAN, a novel generative model for controllable person image synthesis, which can produce realistic person images with desired human attributes (e.g., pose, head, upper clothes and pants) provided in various source inputs. The core idea of the proposed model is to embed human attributes into the latent space as independent codes and thus achieve flexible and continuous control of attributes via mixing and interpolation operations in explicit style representations. Specifically, a new architecture consisting of two encoding pathways with style block connections is proposed to decompose the original hard mapping into multiple more accessible subtasks. In source pathway, we further extract component layouts with an off-the-shelf human parser and feed them into a shared global texture encoder for decomposed latent codes. This strategy allows for the synthesis of more realistic output images and automatic separation of un-annotated attributes. Experimental results demonstrate the proposed method's superiority over the state of the art in pose transfer and its effectiveness in the brand-new task of component attribute transfer.
摘要:本文介绍了属性分解GAN,用于可控人物图像的合成,其可以产生具有以各种源输入提供所需的人的属性(例如,姿势,头部,上部衣服和裤子)现实的人的图像的新的生成模型。该模型的核心思想是人的嵌入到属性作为独立的代码潜在空间,从而通过显式的风格表示混合和内插操作实现属性的灵活和持续的控制。具体而言,由具有式块连接两种编码途径的新的体系结构提出了分解原始硬映射成多个更易于访问的子任务。在源途径中,与关断的,现成的人类解析器我们进一步提取物成分的布局和它们馈送到一个共享全局纹理编码器,用于分解潜代码。这种策略允许更逼真的输出图像和未注释的属性的自动分离的合成。实验结果表明,在姿态调过来的技术状态所提出的方法的优势及其组件的属性转移的全新任务有效性。
20. Weakly Supervised Dataset Collection for Robust Person Detection [PDF] 返回目录
Munetaka Minoguchi, Ken Okayama, Yutaka Satoh, Hirokatsu Kataoka
Abstract: To construct an algorithm that can provide robust person detection, we present a dataset with over 8 million images that was produced in a weakly supervised manner. Through labor-intensive human annotation, the person detection research community has produced relatively small datasets containing on the order of 100,000 images, such as the EuroCity Persons dataset, which includes 240,000 bounding boxes. Therefore, we have collected 8.7 million images of persons based on a two-step collection process, namely person detection with an existing detector and data refinement for false positive suppression. According to the experimental results, the Weakly Supervised Person Dataset (WSPD) is simple yet effective for person detection pre-training. In the context of pre-trained person detection algorithms, our WSPD pre-trained model has 13.38 and 6.38% better accuracy than the same model trained on the fully supervised ImageNet and EuroCity Persons datasets, respectively, when verified with the Caltech Pedestrian.
摘要:构建算法,可以提供鲁棒人检测,我们提出与已在弱监督的方式产生超过800万的图像数据集。通过劳动密集型人注解,人检测研究界产生了含有10万倍的图像,如欧城列车人数据集,其中包括24万个边界框的顺序相对较小的数据集。因此,我们已经收集了基于两步采集过程,即人检测与现有的检测和数据细化为假阳性抑制人8.7万张图片。根据实验结果,弱监管人数据集(WSPD)是人检测前的训练简单而有效的。在预先训练的人检测算法的背景下,我们预先训练WSPD模型比上训练完全同一型号分别为13.38更好和6.38%的精度监督ImageNet和欧城列车人数据集,当与行人加州理工学院验证。
Munetaka Minoguchi, Ken Okayama, Yutaka Satoh, Hirokatsu Kataoka
Abstract: To construct an algorithm that can provide robust person detection, we present a dataset with over 8 million images that was produced in a weakly supervised manner. Through labor-intensive human annotation, the person detection research community has produced relatively small datasets containing on the order of 100,000 images, such as the EuroCity Persons dataset, which includes 240,000 bounding boxes. Therefore, we have collected 8.7 million images of persons based on a two-step collection process, namely person detection with an existing detector and data refinement for false positive suppression. According to the experimental results, the Weakly Supervised Person Dataset (WSPD) is simple yet effective for person detection pre-training. In the context of pre-trained person detection algorithms, our WSPD pre-trained model has 13.38 and 6.38% better accuracy than the same model trained on the fully supervised ImageNet and EuroCity Persons datasets, respectively, when verified with the Caltech Pedestrian.
摘要:构建算法,可以提供鲁棒人检测,我们提出与已在弱监督的方式产生超过800万的图像数据集。通过劳动密集型人注解,人检测研究界产生了含有10万倍的图像,如欧城列车人数据集,其中包括24万个边界框的顺序相对较小的数据集。因此,我们已经收集了基于两步采集过程,即人检测与现有的检测和数据细化为假阳性抑制人8.7万张图片。根据实验结果,弱监管人数据集(WSPD)是人检测前的训练简单而有效的。在预先训练的人检测算法的背景下,我们预先训练WSPD模型比上训练完全同一型号分别为13.38更好和6.38%的精度监督ImageNet和欧城列车人数据集,当与行人加州理工学院验证。
21. One-Shot GAN Generated Fake Face Detection [PDF] 返回目录
Hadi Mansourifar, Weidong Shi
Abstract: Fake face detection is a significant challenge for intelligent systems as generative models become more powerful every single day. As the quality of fake faces increases, the trained models become more and more inefficient to detect the novel fake faces, since the corresponding training data is considered outdated. In this case, robust One-Shot learning methods is more compatible with the requirements of changeable training data. In this paper, we propose a universal One-Shot GAN generated fake face detection method which can be used in significantly different areas of anomaly detection. The proposed method is based on extracting out-of-context objects from faces via scene understanding models. To do so, we use state of the art scene understanding and object detection methods as a pre-processing tool to detect the weird objects in the face. Second, we create a bag of words given all the detected out-of-context objects per all training data. This way, we transform each image into a sparse vector where each feature represents the confidence score related to each detected object in the image. Our experiments show that, we can discriminate fake faces from real ones in terms of out-of-context features. It means that, different sets of objects are detected in fake faces comparing to real ones when we analyze them with scene understanding and object detection models. We prove that, the proposed method can outperform previous methods based on our experiments on Style-GAN generated fake faces.
摘要:假脸检测是智能系统一个显著的挑战,因为生成模型变得更加强大的每一天。由于假货的质量面临的增加,训练的模型变得越来越低效检测新颖的假面孔,因为相应的训练数据被认为已过时。在这种情况下,强大的单次的学习方法是多变的训练数据的要求,更兼容。在本文中,我们提出了一个通用的单次甘产生的假脸检测可以在异常检测的显著不同的区域使用方法。该方法是基于通过场景理解模型从面孔提取出的上下文对象。要做到这一点,我们使用的艺术场景的理解和物体检测方法状态作为前处理工具来检测在脸上神秘的东西。其次,我们创建给出的单词一袋所有检测出的上下文占所有训练数据对象。通过这种方式,我们把每一个图像到稀疏向量,其中每个要素表示涉及到图像中的每个检测到的对象的信心评分。我们的实验表明,我们可以区分以假乱真的假面孔的外的背景特征方面。这意味着,在假脸比较以假乱真检测组不同的对象,当我们分析他们与现场的理解和物体检测模型。我们证明,该方法能超越基于我们对风格-GaN生成的假面孔实验以前的方法。
Hadi Mansourifar, Weidong Shi
Abstract: Fake face detection is a significant challenge for intelligent systems as generative models become more powerful every single day. As the quality of fake faces increases, the trained models become more and more inefficient to detect the novel fake faces, since the corresponding training data is considered outdated. In this case, robust One-Shot learning methods is more compatible with the requirements of changeable training data. In this paper, we propose a universal One-Shot GAN generated fake face detection method which can be used in significantly different areas of anomaly detection. The proposed method is based on extracting out-of-context objects from faces via scene understanding models. To do so, we use state of the art scene understanding and object detection methods as a pre-processing tool to detect the weird objects in the face. Second, we create a bag of words given all the detected out-of-context objects per all training data. This way, we transform each image into a sparse vector where each feature represents the confidence score related to each detected object in the image. Our experiments show that, we can discriminate fake faces from real ones in terms of out-of-context features. It means that, different sets of objects are detected in fake faces comparing to real ones when we analyze them with scene understanding and object detection models. We prove that, the proposed method can outperform previous methods based on our experiments on Style-GAN generated fake faces.
摘要:假脸检测是智能系统一个显著的挑战,因为生成模型变得更加强大的每一天。由于假货的质量面临的增加,训练的模型变得越来越低效检测新颖的假面孔,因为相应的训练数据被认为已过时。在这种情况下,强大的单次的学习方法是多变的训练数据的要求,更兼容。在本文中,我们提出了一个通用的单次甘产生的假脸检测可以在异常检测的显著不同的区域使用方法。该方法是基于通过场景理解模型从面孔提取出的上下文对象。要做到这一点,我们使用的艺术场景的理解和物体检测方法状态作为前处理工具来检测在脸上神秘的东西。其次,我们创建给出的单词一袋所有检测出的上下文占所有训练数据对象。通过这种方式,我们把每一个图像到稀疏向量,其中每个要素表示涉及到图像中的每个检测到的对象的信心评分。我们的实验表明,我们可以区分以假乱真的假面孔的外的背景特征方面。这意味着,在假脸比较以假乱真检测组不同的对象,当我们分析他们与现场的理解和物体检测模型。我们证明,该方法能超越基于我们对风格-GaN生成的假面孔实验以前的方法。
22. Dynamic Region-Aware Convolution [PDF] 返回目录
Jin Chen, Xijun Wang, Zichao Guo, Xiangyu Zhang, Jian Sun
Abstract: We propose a new convolution called Dynamic Region-Aware Convolution (DRConv), which can automatically assign multiple filters to corresponding spatial regions where features have similar representation. In this way, DRConv outperforms standard convolution in modeling semantic variations. Standard convolution can increase the number of channels to extract more visual elements but results in high computational cost. More gracefully, our DRConv transfers the increasing channel-wise filters to spatial dimension with learnable instructor, which significantly improves representation ability of convolution and maintains translation-invariance like standard convolution. DRConv is an effective and elegant method for handling complex and variable spatial information distribution. It can substitute standard convolution in any existing networks for its plug-and-play property. We evaluate DRConv on a wide range of models (MobileNet series, ShuffleNetV2, etc.) and tasks (Classification, Face Recognition, Detection and Segmentation.). On ImageNet classification, DRConv-based ShuffleNetV2-0.5x achieves state-of-the-art performance of 67.1% at 46M multiply-adds level with 6.3% relative improvement.
摘要:本文提出了一种被称为动态区域感知卷积(DRConv)新的卷积,其可以自动分配多个过滤器以相应的空间区域,其中的功能有类似的表示。通过这种方式,DRConv优于建模语义变化的标准卷积。标准卷积可以增加信道的数量来提取多个可视元素,但导致高的计算成本。更优雅,我们DRConv增加信道滤波器明智转移到与可学习指导员,这显著提高卷积的表现能力和保持平移不变性像标准回旋空间维度。 DRConv是用于处理复杂的和可变的空间分布信息的有效和优雅的方法。它可以用在它的插件和播放财产的任何现有网络的替代标准卷积。我们评估DRConv了广泛的模型(MobileNet系列,ShuffleNetV2等)和任务(分类,人脸识别,检测与分割)。上ImageNet分类,基于DRConv-ShuffleNetV2-0.5x实现在46M的性能67.1%最先进的国家的具有6.3%的相对改善乘增加水平。
Jin Chen, Xijun Wang, Zichao Guo, Xiangyu Zhang, Jian Sun
Abstract: We propose a new convolution called Dynamic Region-Aware Convolution (DRConv), which can automatically assign multiple filters to corresponding spatial regions where features have similar representation. In this way, DRConv outperforms standard convolution in modeling semantic variations. Standard convolution can increase the number of channels to extract more visual elements but results in high computational cost. More gracefully, our DRConv transfers the increasing channel-wise filters to spatial dimension with learnable instructor, which significantly improves representation ability of convolution and maintains translation-invariance like standard convolution. DRConv is an effective and elegant method for handling complex and variable spatial information distribution. It can substitute standard convolution in any existing networks for its plug-and-play property. We evaluate DRConv on a wide range of models (MobileNet series, ShuffleNetV2, etc.) and tasks (Classification, Face Recognition, Detection and Segmentation.). On ImageNet classification, DRConv-based ShuffleNetV2-0.5x achieves state-of-the-art performance of 67.1% at 46M multiply-adds level with 6.3% relative improvement.
摘要:本文提出了一种被称为动态区域感知卷积(DRConv)新的卷积,其可以自动分配多个过滤器以相应的空间区域,其中的功能有类似的表示。通过这种方式,DRConv优于建模语义变化的标准卷积。标准卷积可以增加信道的数量来提取多个可视元素,但导致高的计算成本。更优雅,我们DRConv增加信道滤波器明智转移到与可学习指导员,这显著提高卷积的表现能力和保持平移不变性像标准回旋空间维度。 DRConv是用于处理复杂的和可变的空间分布信息的有效和优雅的方法。它可以用在它的插件和播放财产的任何现有网络的替代标准卷积。我们评估DRConv了广泛的模型(MobileNet系列,ShuffleNetV2等)和任务(分类,人脸识别,检测与分割)。上ImageNet分类,基于DRConv-ShuffleNetV2-0.5x实现在46M的性能67.1%最先进的国家的具有6.3%的相对改善乘增加水平。
23. Towards Discriminability and Diversity: Batch Nuclear-norm Maximization under Label Insufficient Situations [PDF] 返回目录
Shuhao Cui, Shuhui Wang, Junbao Zhuo, Liang Li, Qingming Huang, Qi Tian
Abstract: The learning of the deep networks largely relies on the data with human-annotated labels. In some label insufficient situations, the performance degrades on the decision boundary with high data density. A common solution is to directly minimize the Shannon Entropy, but the side effect caused by entropy minimization, i.e., reduction of the prediction diversity, is mostly ignored. To address this issue, we reinvestigate the structure of classification output matrix of a randomly selected data batch. We find by theoretical analysis that the prediction discriminability and diversity could be separately measured by the Frobenius-norm and rank of the batch output matrix. Besides, the nuclear-norm is an upperbound of the Frobenius-norm, and a convex approximation of the matrix rank. Accordingly, to improve both discriminability and diversity, we propose Batch Nuclear-norm Maximization (BNM) on the output matrix. BNM could boost the learning under typical label insufficient learning scenarios, such as semi-supervised learning, domain adaptation and open domain recognition. On these tasks, extensive experimental results show that BNM outperforms competitors and works well with existing well-known methods. The code is available at this https URL.
摘要:深网络的学习很大程度上依赖于与人类标注标签的数据。在某些情况下,标签不充分,具有高数据密度的决策边界上的性能下降。一个常见的解决方案是直接最小化信息熵,反而造成熵最小化,即,减少的预测分集的副作用,主要是忽略。为了解决这个问题,我们重新调查随机选择的数据批次的分类输出矩阵的结构。我们通过理论分析发现,预测可辨性和多样性可以通过批量输出矩阵的弗罗比尼斯范范数和排名来分别测量。此外,核标准是弗罗比尼斯范范数的上界,矩阵秩的凸逼近。因此,为了提高鉴别力都和多样性,我们建议在输出矩阵批次核范数最大化(BNM)。国行可能推动下,典型的标签不足的学习方案,如半监督学习,领域适应性和开放的领域认可的学习。在这些任务,大量的实验结果表明,BNM优于竞争对手,并与现有的众所周知的方法效果很好。该代码可在此HTTPS URL。
Shuhao Cui, Shuhui Wang, Junbao Zhuo, Liang Li, Qingming Huang, Qi Tian
Abstract: The learning of the deep networks largely relies on the data with human-annotated labels. In some label insufficient situations, the performance degrades on the decision boundary with high data density. A common solution is to directly minimize the Shannon Entropy, but the side effect caused by entropy minimization, i.e., reduction of the prediction diversity, is mostly ignored. To address this issue, we reinvestigate the structure of classification output matrix of a randomly selected data batch. We find by theoretical analysis that the prediction discriminability and diversity could be separately measured by the Frobenius-norm and rank of the batch output matrix. Besides, the nuclear-norm is an upperbound of the Frobenius-norm, and a convex approximation of the matrix rank. Accordingly, to improve both discriminability and diversity, we propose Batch Nuclear-norm Maximization (BNM) on the output matrix. BNM could boost the learning under typical label insufficient learning scenarios, such as semi-supervised learning, domain adaptation and open domain recognition. On these tasks, extensive experimental results show that BNM outperforms competitors and works well with existing well-known methods. The code is available at this https URL.
摘要:深网络的学习很大程度上依赖于与人类标注标签的数据。在某些情况下,标签不充分,具有高数据密度的决策边界上的性能下降。一个常见的解决方案是直接最小化信息熵,反而造成熵最小化,即,减少的预测分集的副作用,主要是忽略。为了解决这个问题,我们重新调查随机选择的数据批次的分类输出矩阵的结构。我们通过理论分析发现,预测可辨性和多样性可以通过批量输出矩阵的弗罗比尼斯范范数和排名来分别测量。此外,核标准是弗罗比尼斯范范数的上界,矩阵秩的凸逼近。因此,为了提高鉴别力都和多样性,我们建议在输出矩阵批次核范数最大化(BNM)。国行可能推动下,典型的标签不足的学习方案,如半监督学习,领域适应性和开放的领域认可的学习。在这些任务,大量的实验结果表明,BNM优于竞争对手,并与现有的众所周知的方法效果很好。该代码可在此HTTPS URL。
24. Learning to Optimize Non-Rigid Tracking [PDF] 返回目录
Yang Li, Aljaž Božič, Tianwei Zhang, Yanli Ji, Tatsuya Harada, Matthias Nießner
Abstract: One of the widespread solutions for non-rigid tracking has a nested-loop structure: with Gauss-Newton to minimize a tracking objective in the outer loop, and Preconditioned Conjugate Gradient (PCG) to solve a sparse linear system in the inner loop. In this paper, we employ learnable optimizations to improve tracking robustness and speed up solver convergence. First, we upgrade the tracking objective by integrating an alignment data term on deep features which are learned end-to-end through CNN. The new tracking objective can capture the global deformation which helps Gauss-Newton to jump over local minimum, leading to robust tracking on large non-rigid motions. Second, we bridge the gap between the preconditioning technique and learning method by introducing a ConditionNet which is trained to generate a preconditioner such that PCG can converge within a small number of steps. Experimental results indicate that the proposed learning method converges faster than the original PCG by a large margin.
摘要:一个用于非刚性跟踪普遍的解决方案的具有嵌套循环结构:与高斯 - 牛顿,以尽量减少在外部环路中的跟踪目标,预条件共轭梯度(PCG),以在内部循环解决的稀疏线性系统。在本文中,我们采用可学习的优化,以提高跟踪的鲁棒性,加快求解收敛。首先,我们通过整合其上通过CNN学会年底到终端的深特征的比对数据项升级的跟踪目标。新的跟踪目标可以捕捉整体变形,这有助于高斯 - 牛顿跳过局部极小,导致鲁棒跟踪大非刚性运动。其次,我们通过引入被训练以生成预处理器,使得可以PCG一小数量的步骤内收敛一个ConditionNet桥接预处理技术和学习方法之间的差距。实验结果表明,以大比分所提出的学习方法收敛速度比原来快PCG。
Yang Li, Aljaž Božič, Tianwei Zhang, Yanli Ji, Tatsuya Harada, Matthias Nießner
Abstract: One of the widespread solutions for non-rigid tracking has a nested-loop structure: with Gauss-Newton to minimize a tracking objective in the outer loop, and Preconditioned Conjugate Gradient (PCG) to solve a sparse linear system in the inner loop. In this paper, we employ learnable optimizations to improve tracking robustness and speed up solver convergence. First, we upgrade the tracking objective by integrating an alignment data term on deep features which are learned end-to-end through CNN. The new tracking objective can capture the global deformation which helps Gauss-Newton to jump over local minimum, leading to robust tracking on large non-rigid motions. Second, we bridge the gap between the preconditioning technique and learning method by introducing a ConditionNet which is trained to generate a preconditioner such that PCG can converge within a small number of steps. Experimental results indicate that the proposed learning method converges faster than the original PCG by a large margin.
摘要:一个用于非刚性跟踪普遍的解决方案的具有嵌套循环结构:与高斯 - 牛顿,以尽量减少在外部环路中的跟踪目标,预条件共轭梯度(PCG),以在内部循环解决的稀疏线性系统。在本文中,我们采用可学习的优化,以提高跟踪的鲁棒性,加快求解收敛。首先,我们通过整合其上通过CNN学会年底到终端的深特征的比对数据项升级的跟踪目标。新的跟踪目标可以捕捉整体变形,这有助于高斯 - 牛顿跳过局部极小,导致鲁棒跟踪大非刚性运动。其次,我们通过引入被训练以生成预处理器,使得可以PCG一小数量的步骤内收敛一个ConditionNet桥接预处理技术和学习方法之间的差距。实验结果表明,以大比分所提出的学习方法收敛速度比原来快PCG。
25. Multi-Granularity Reference-Aided Attentive Feature Aggregation for Video-based Person Re-identification [PDF] 返回目录
Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Zhibo Chen
Abstract: Video-based person re-identification (reID) aims at matching the same person across video clips. It is a challenging task due to the existence of redundancy among frames, newly revealed appearance, occlusion, and motion blurs. In this paper, we propose an attentive feature aggregation module, namely Multi-Granularity Reference-aided Attentive Feature Aggregation (MG-RAFA), to delicately aggregate spatio-temporal features into a discriminative video-level feature representation. In order to determine the contribution/importance of a spatial-temporal feature node, we propose to learn the attention from a global view with convolutional operations. Specifically, we stack its relations, i.e., pairwise correlations with respect to a representative set of reference feature nodes (S-RFNs) that represents global video information, together with the feature itself to infer the attention. Moreover, to exploit the semantics of different levels, we propose to learn multi-granularity attentions based on the relations captured at different granularities. Extensive ablation studies demonstrate the effectiveness of our attentive feature aggregation module MG-RAFA. Our framework achieves the state-of-the-art performance on three benchmark datasets.
摘要:基于视频的人重新鉴定(里德)旨在跨越视频剪辑同一个人相匹配。这是一个艰巨的任务,由于冗余的帧,新发现的外观,闭塞,和运动模糊之间存在。在本文中,我们提出了一种细心特征聚集模块,即多粒度参考辅助细心特征聚合(MG-RAFA),精细地聚集时空特征成判别视频级特征表示。为了确定一个时空特征节点的贡献/重要性,我们建议从卷积运算的全局视图学习的重视。具体来说,我们与特征本身来推断注意堆栈其关系,即,成对的相关性对于一组代表性的表示全球视频信息的参考特征的节点(S-RFNs),在一起。此外,利用不同层次的语义,我们建议学习基于不同粒度捕捉的关系多粒度的关注。广泛切除研究证明了我们周到的功能集合模块MG-RAFA的有效性。我们的框架,实现了对三个标准数据集的国家的最先进的性能。
Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Zhibo Chen
Abstract: Video-based person re-identification (reID) aims at matching the same person across video clips. It is a challenging task due to the existence of redundancy among frames, newly revealed appearance, occlusion, and motion blurs. In this paper, we propose an attentive feature aggregation module, namely Multi-Granularity Reference-aided Attentive Feature Aggregation (MG-RAFA), to delicately aggregate spatio-temporal features into a discriminative video-level feature representation. In order to determine the contribution/importance of a spatial-temporal feature node, we propose to learn the attention from a global view with convolutional operations. Specifically, we stack its relations, i.e., pairwise correlations with respect to a representative set of reference feature nodes (S-RFNs) that represents global video information, together with the feature itself to infer the attention. Moreover, to exploit the semantics of different levels, we propose to learn multi-granularity attentions based on the relations captured at different granularities. Extensive ablation studies demonstrate the effectiveness of our attentive feature aggregation module MG-RAFA. Our framework achieves the state-of-the-art performance on three benchmark datasets.
摘要:基于视频的人重新鉴定(里德)旨在跨越视频剪辑同一个人相匹配。这是一个艰巨的任务,由于冗余的帧,新发现的外观,闭塞,和运动模糊之间存在。在本文中,我们提出了一种细心特征聚集模块,即多粒度参考辅助细心特征聚合(MG-RAFA),精细地聚集时空特征成判别视频级特征表示。为了确定一个时空特征节点的贡献/重要性,我们建议从卷积运算的全局视图学习的重视。具体来说,我们与特征本身来推断注意堆栈其关系,即,成对的相关性对于一组代表性的表示全球视频信息的参考特征的节点(S-RFNs),在一起。此外,利用不同层次的语义,我们建议学习基于不同粒度捕捉的关系多粒度的关注。广泛切除研究证明了我们周到的功能集合模块MG-RAFA的有效性。我们的框架,实现了对三个标准数据集的国家的最先进的性能。
26. HERS: Homomorphically Encrypted Representation Search [PDF] 返回目录
Joshua J. Engelsma, Anil K. Jain, Vishnu Naresh Boddeti
Abstract: We present a method to search for a probe (or query) image representation against a large gallery in the encrypted domain. We require that the probe and gallery images be represented in terms of a fixed-length representation, which is typical for representations obtained from learned networks. Our encryption scheme is agnostic to how the fixed-length representation is obtained and can, therefore, be applied to any fixed-length representation in any application domain. Our method, dubbed HERS (Homomorphically Encrypted Representation Search), operates by (i) compressing the representation towards its estimated intrinsic dimensionality, (ii) encrypting the compressed representation using the proposed fully homomorphic encryption scheme, and (iii) searching against a gallery of encrypted representations directly in the encrypted domain, without decrypting them, and with minimal loss of accuracy. Numerical results on large galleries of face, fingerprint, and object datasets such as ImageNet show that, for the first time, accurate and fast image search within the encrypted domain is feasible at scale (296 seconds; 46x speedup over state-of-the-art for face search against a background of 1 million).
摘要:我们提出一个方法来搜索针对在加密的域大的库的探针(或查询)的图像表示。我们需要的是,探针和画廊图像在一个固定长度表示,这是典型的由学习获得的网络表示中的术语来表示。我们的加密方案是不可知的是如何获得的固定长度表示,并且可以,因此,能够适用于任何应用领域任何固定长度表示。我们的方法,名为HERS(同态加密表示搜索),操作通过(i)压缩朝其估计的内在维度的表现,(ii)使用所提出的全同态加密方案加密的压缩表示,及(iii)对库中进行搜索直接在加密的域加密陈述,而不解密它们,并能准确地损失最小。上大画廊面部,指纹,和对象的数据集的诸如ImageNet数值计算结果表明,在第一次,加密域内准确,快速的图像搜索是在尺度296秒可行(; 46X加速超过国家的the-艺术针对1亿美元)的背景面部搜索。
Joshua J. Engelsma, Anil K. Jain, Vishnu Naresh Boddeti
Abstract: We present a method to search for a probe (or query) image representation against a large gallery in the encrypted domain. We require that the probe and gallery images be represented in terms of a fixed-length representation, which is typical for representations obtained from learned networks. Our encryption scheme is agnostic to how the fixed-length representation is obtained and can, therefore, be applied to any fixed-length representation in any application domain. Our method, dubbed HERS (Homomorphically Encrypted Representation Search), operates by (i) compressing the representation towards its estimated intrinsic dimensionality, (ii) encrypting the compressed representation using the proposed fully homomorphic encryption scheme, and (iii) searching against a gallery of encrypted representations directly in the encrypted domain, without decrypting them, and with minimal loss of accuracy. Numerical results on large galleries of face, fingerprint, and object datasets such as ImageNet show that, for the first time, accurate and fast image search within the encrypted domain is feasible at scale (296 seconds; 46x speedup over state-of-the-art for face search against a background of 1 million).
摘要:我们提出一个方法来搜索针对在加密的域大的库的探针(或查询)的图像表示。我们需要的是,探针和画廊图像在一个固定长度表示,这是典型的由学习获得的网络表示中的术语来表示。我们的加密方案是不可知的是如何获得的固定长度表示,并且可以,因此,能够适用于任何应用领域任何固定长度表示。我们的方法,名为HERS(同态加密表示搜索),操作通过(i)压缩朝其估计的内在维度的表现,(ii)使用所提出的全同态加密方案加密的压缩表示,及(iii)对库中进行搜索直接在加密的域加密陈述,而不解密它们,并能准确地损失最小。上大画廊面部,指纹,和对象的数据集的诸如ImageNet数值计算结果表明,在第一次,加密域内准确,快速的图像搜索是在尺度296秒可行(; 46X加速超过国家的the-艺术针对1亿美元)的背景面部搜索。
27. Action Localization through Continual Predictive Learning [PDF] 返回目录
Sathyanarayanan N. Aakur, Sudeep Sarkar
Abstract: The problem of action recognition involves locating the action in the video, both over time and spatially in the image. The dominant current approaches use supervised learning to solve this problem, and require large amounts of annotated training data, in the form of frame-level bounding box annotations around the region of interest. In this paper, we present a new approach based on continual learning that uses feature-level predictions for self-supervision. It does not require any training annotations in terms of frame-level bounding boxes. The approach is inspired by cognitive models of visual event perception that propose a prediction-based approach to event understanding. We use a stack of LSTMs coupled with CNN encoder, along with novel attention mechanisms, to model the events in the video and use this model to predict high-level features for the future frames. The prediction errors are used to continuously learn the parameters of the models. This self-supervised framework is not complicated as other approaches but is very effective in learning robust visual representations for both labeling and localization. It should be noted that the approach outputs in a streaming fashion, requiring only a single pass through the video, making it amenable for real-time processing. We demonstrate this on three datasets - UCF Sports, JHMDB, and THUMOS'13 and show that the proposed approach outperforms weakly-supervised and unsupervised baselines and obtains competitive performance compared to fully supervised baselines. Finally, we show that the proposed framework can generalize to egocentric videos and obtain state-of-the-art results in unsupervised gaze prediction.
摘要:行为识别的问题涉及定位视频的动作,都随着时间的推移和空间的图像。占主导地位的电流接近使用监督学习来解决这个问题,需要大量的注释的训练数据,在感兴趣的区域周围的帧级边框注释的形式。在本文中,我们提出了一种基于不断学习使用功能级自我监督预测的新方法。它不需要在帧级边框方面的任何培训的注解。这种方法是通过视觉感知事件的认知模型,提出了基于预测的方法来理解事件的启发。我们使用加上CNN编码器LSTMs的堆栈,以新颖的注意力机制一起,在视频中的事件模型,并用这个模型来预测高层次功能,为未来帧。预测误差来不断学习模型的参数。这种自我监管框架并不复杂,因为其他的方法,但在学习两种标记和本地化强大的可视化表示非常有效。应当指出的是,该方法的输出以流方式,只需要通过视频单传,使得它易于进行实时处理。我们证明这三个数据集 - UCF体育,JHMDB和THUMOS'13并表明,该方法比弱监督和无监督的基准和有竞争力的性能相比充分监督基线获得。最后,我们表明,该框架可以推广到以自我为中心的影片,并获得国家的先进成果在无人监督的目光预测。
Sathyanarayanan N. Aakur, Sudeep Sarkar
Abstract: The problem of action recognition involves locating the action in the video, both over time and spatially in the image. The dominant current approaches use supervised learning to solve this problem, and require large amounts of annotated training data, in the form of frame-level bounding box annotations around the region of interest. In this paper, we present a new approach based on continual learning that uses feature-level predictions for self-supervision. It does not require any training annotations in terms of frame-level bounding boxes. The approach is inspired by cognitive models of visual event perception that propose a prediction-based approach to event understanding. We use a stack of LSTMs coupled with CNN encoder, along with novel attention mechanisms, to model the events in the video and use this model to predict high-level features for the future frames. The prediction errors are used to continuously learn the parameters of the models. This self-supervised framework is not complicated as other approaches but is very effective in learning robust visual representations for both labeling and localization. It should be noted that the approach outputs in a streaming fashion, requiring only a single pass through the video, making it amenable for real-time processing. We demonstrate this on three datasets - UCF Sports, JHMDB, and THUMOS'13 and show that the proposed approach outperforms weakly-supervised and unsupervised baselines and obtains competitive performance compared to fully supervised baselines. Finally, we show that the proposed framework can generalize to egocentric videos and obtain state-of-the-art results in unsupervised gaze prediction.
摘要:行为识别的问题涉及定位视频的动作,都随着时间的推移和空间的图像。占主导地位的电流接近使用监督学习来解决这个问题,需要大量的注释的训练数据,在感兴趣的区域周围的帧级边框注释的形式。在本文中,我们提出了一种基于不断学习使用功能级自我监督预测的新方法。它不需要在帧级边框方面的任何培训的注解。这种方法是通过视觉感知事件的认知模型,提出了基于预测的方法来理解事件的启发。我们使用加上CNN编码器LSTMs的堆栈,以新颖的注意力机制一起,在视频中的事件模型,并用这个模型来预测高层次功能,为未来帧。预测误差来不断学习模型的参数。这种自我监管框架并不复杂,因为其他的方法,但在学习两种标记和本地化强大的可视化表示非常有效。应当指出的是,该方法的输出以流方式,只需要通过视频单传,使得它易于进行实时处理。我们证明这三个数据集 - UCF体育,JHMDB和THUMOS'13并表明,该方法比弱监督和无监督的基准和有竞争力的性能相比充分监督基线获得。最后,我们表明,该框架可以推广到以自我为中心的影片,并获得国家的先进成果在无人监督的目光预测。
28. ParSeNet: A Parametric Surface Fitting Network for 3D Point Clouds [PDF] 返回目录
Gopal Sharma, Difan Liu, Evangelos Kalogerakis, Subhransu Maji, Siddhartha Chaudhuri, Radomir Měch
Abstract: We propose a novel, end-to-end trainable, deep network called ParSeNet that decomposes a 3D point cloud into parametric surface patches, including B-spline patches as well as basic geometric primitives. ParSeNet is trained on a large-scale dataset of man-made 3D shapes and captures high-level semantic priors for shape decomposition. It handles a much richer class of primitives than prior work, and allows us to represent surfaces with higher fidelity. It also produces repeatable and robust parametrizations of a surface compared to purely geometric approaches. We present extensive experiments to validate our approach against analytical and learning-based alternatives.
摘要:提出一种新的,端 - 端可训练,深网络称为ParSeNet该分解的3D点云为参数曲面片,包括B样条曲面片以及基本几何图元。 ParSeNet是在人造的3D形状并获取高层语义先验的形状分解的大规模数据集训练。它可以处理更丰富的类图元比之前的工作,使我们能够以更高的保真度代表表面。它还相比纯几何方法产生的表面的可重复的和强大的参数化。我们目前大量的实验来验证我们对分析和学习为基础的替代方法。
Gopal Sharma, Difan Liu, Evangelos Kalogerakis, Subhransu Maji, Siddhartha Chaudhuri, Radomir Měch
Abstract: We propose a novel, end-to-end trainable, deep network called ParSeNet that decomposes a 3D point cloud into parametric surface patches, including B-spline patches as well as basic geometric primitives. ParSeNet is trained on a large-scale dataset of man-made 3D shapes and captures high-level semantic priors for shape decomposition. It handles a much richer class of primitives than prior work, and allows us to represent surfaces with higher fidelity. It also produces repeatable and robust parametrizations of a surface compared to purely geometric approaches. We present extensive experiments to validate our approach against analytical and learning-based alternatives.
摘要:提出一种新的,端 - 端可训练,深网络称为ParSeNet该分解的3D点云为参数曲面片,包括B样条曲面片以及基本几何图元。 ParSeNet是在人造的3D形状并获取高层语义先验的形状分解的大规模数据集训练。它可以处理更丰富的类图元比之前的工作,使我们能够以更高的保真度代表表面。它还相比纯几何方法产生的表面的可重复的和强大的参数化。我们目前大量的实验来验证我们对分析和学习为基础的替代方法。
29. An improved 3D region detection network: automated detection of the 12th thoracic vertebra in image guided radiation therapy [PDF] 返回目录
Yunhe Xie, Gregory Sharp, David P. Gierga, Theodore S. Hong, Thomas Bortfeld, Kongbin Kang
Abstract: Abstract. Image guidance has been widely used in radiation therapy. Correctly identifying anatomical landmarks, like the 12th thoracic vertebra (T12), is the key to success. Until recently, the detection of those landmarks still requires tedious manual inspections and annotations; and superior-inferior misalignment to the wrong vertebral body is still relatively common in image guided radiation therapy. It is necessary to develop an automated approach to detect those landmarks from images. There are three major challenges to identify T12 vertebra automatically: 1) subtle difference in the structures with high similarity, 2) limited annotated training data, and 3) high memory usage of 3D networks. Abstract. In this study, we propose a novel 3D full convolutional network (FCN) that is trained to detect anatomical structures from 3D volumetric data, requiring only a small amount of training data. Comparing with existing approaches, the network architecture, target generation and loss functions were significantly improved to address the challenges specific to medical images. In our experiments, the proposed network, which was trained from a small amount of annotated images, demonstrated the capability of accurately detecting structures with high similarity. Furthermore, the trained network showed the capability of cross-modality learning. This is meaningful in the situation where image annotations in one modality are easier to obtain than others. The cross-modality learning ability also indicated that the learned features were robust to noise in different image modalities. In summary, our approach has a great potential to be integrated into the clinical workflow to improve the safety of image guided radiation therapy.
摘要:摘要。形象指导已被广泛应用于放射治疗。正确识别解剖标志,如第12胸椎(T12),是成功的关键。直到最近,这些标志的检测还需要繁琐的手工检查和注释;和上 - 下偏移到错误的椎体仍然在图像引导的辐射治疗比较常见的。有必要开发一种自动化的方法来从图像中检测这些标志性建筑。有三个主要的挑战椎骨自动识别T12:在结构中具有高相似1)细微的差别,2)限定注释的训练数据,和3)三维网络的高内存使用情况。抽象。在这项研究中,我们建议被训练来检测解剖结构从3D体数据的新型3D全卷积网络(FCN),只需要训练数据量小。与现有的方法相比,网络架构,目标生成和损失功能进行了改善显著应对挑战的具体医学图像。在我们的实验中,所提出的网络,将其从注释的图像的少量的训练,表明精确地检测具有高相似结构的能力。此外,训练有素的网络表现出跨模态学习的能力。其中一个态下的图像标注更容易获得比别人这是有意义的局面。在跨模态学习能力也表示,得知特点是稳健的不同的图像模式噪声。总之,我们的方法有被整合到临床工作流程,以提高图像引导放射治疗的安全性有很大的潜力。
Yunhe Xie, Gregory Sharp, David P. Gierga, Theodore S. Hong, Thomas Bortfeld, Kongbin Kang
Abstract: Abstract. Image guidance has been widely used in radiation therapy. Correctly identifying anatomical landmarks, like the 12th thoracic vertebra (T12), is the key to success. Until recently, the detection of those landmarks still requires tedious manual inspections and annotations; and superior-inferior misalignment to the wrong vertebral body is still relatively common in image guided radiation therapy. It is necessary to develop an automated approach to detect those landmarks from images. There are three major challenges to identify T12 vertebra automatically: 1) subtle difference in the structures with high similarity, 2) limited annotated training data, and 3) high memory usage of 3D networks. Abstract. In this study, we propose a novel 3D full convolutional network (FCN) that is trained to detect anatomical structures from 3D volumetric data, requiring only a small amount of training data. Comparing with existing approaches, the network architecture, target generation and loss functions were significantly improved to address the challenges specific to medical images. In our experiments, the proposed network, which was trained from a small amount of annotated images, demonstrated the capability of accurately detecting structures with high similarity. Furthermore, the trained network showed the capability of cross-modality learning. This is meaningful in the situation where image annotations in one modality are easier to obtain than others. The cross-modality learning ability also indicated that the learned features were robust to noise in different image modalities. In summary, our approach has a great potential to be integrated into the clinical workflow to improve the safety of image guided radiation therapy.
摘要:摘要。形象指导已被广泛应用于放射治疗。正确识别解剖标志,如第12胸椎(T12),是成功的关键。直到最近,这些标志的检测还需要繁琐的手工检查和注释;和上 - 下偏移到错误的椎体仍然在图像引导的辐射治疗比较常见的。有必要开发一种自动化的方法来从图像中检测这些标志性建筑。有三个主要的挑战椎骨自动识别T12:在结构中具有高相似1)细微的差别,2)限定注释的训练数据,和3)三维网络的高内存使用情况。抽象。在这项研究中,我们建议被训练来检测解剖结构从3D体数据的新型3D全卷积网络(FCN),只需要训练数据量小。与现有的方法相比,网络架构,目标生成和损失功能进行了改善显著应对挑战的具体医学图像。在我们的实验中,所提出的网络,将其从注释的图像的少量的训练,表明精确地检测具有高相似结构的能力。此外,训练有素的网络表现出跨模态学习的能力。其中一个态下的图像标注更容易获得比别人这是有意义的局面。在跨模态学习能力也表示,得知特点是稳健的不同的图像模式噪声。总之,我们的方法有被整合到临床工作流程,以提高图像引导放射治疗的安全性有很大的潜力。
30. Cycle Text-To-Image GAN with BERT [PDF] 返回目录
Trevor Tsue, Samir Sen, Jason Li
Abstract: We explore novel approaches to the task of image generation from their respective captions, building on state-of-the-art GAN architectures. Particularly, we baseline our models with the Attention-based GANs that learn attention mappings from words to image features. To better capture the features of the descriptions, we then built a novel cyclic design that learns an inverse function to maps the image back to original caption. Additionally, we incorporated recently developed BERT pretrained word embeddings as our initial text featurizer and observe a noticeable improvement in qualitative and quantitative performance compared to the Attention GAN baseline.
摘要:我们从各自的字幕探索新的方法,以图像生成的任务,建立在国家的最先进的GAN架构。特别是,我们基线我们的模型与借鉴的话图像特征映射关注基于注意力甘斯。为了更好地捕捉描述的特征,那么,我们建立了学习的反函数映射图像回到原来的标题一个新的环状设计。此外,我们整合了最近开发的BERT预训练字的嵌入作为我们的初始文本featurizer和观察的定性和定量的性能明显改善相比注意GAN基线。
Trevor Tsue, Samir Sen, Jason Li
Abstract: We explore novel approaches to the task of image generation from their respective captions, building on state-of-the-art GAN architectures. Particularly, we baseline our models with the Attention-based GANs that learn attention mappings from words to image features. To better capture the features of the descriptions, we then built a novel cyclic design that learns an inverse function to maps the image back to original caption. Additionally, we incorporated recently developed BERT pretrained word embeddings as our initial text featurizer and observe a noticeable improvement in qualitative and quantitative performance compared to the Attention GAN baseline.
摘要:我们从各自的字幕探索新的方法,以图像生成的任务,建立在国家的最先进的GAN架构。特别是,我们基线我们的模型与借鉴的话图像特征映射关注基于注意力甘斯。为了更好地捕捉描述的特征,那么,我们建立了学习的反函数映射图像回到原来的标题一个新的环状设计。此外,我们整合了最近开发的BERT预训练字的嵌入作为我们的初始文本featurizer和观察的定性和定量的性能明显改善相比注意GAN基线。
31. SaccadeNet: A Fast and Accurate Object Detector [PDF] 返回目录
Shiyi Lan, Zhou Ren, Yi Wu, Larry S. Davis, Gang Hua
Abstract: Object detection is an essential step towards holistic scene understanding. Most existing object detection algorithms attend to certain object areas once and then predict the object locations. However, neuroscientists have revealed that humans do not look at the scene in fixed steadiness. Instead, human eyes move around, locating informative parts to understand the object location. This active perceiving movement process is called \textit{saccade}. %In this paper, Inspired by such mechanism, we propose a fast and accurate object detector called \textit{SaccadeNet}. It contains four main modules, the \cenam, the \coram, the \atm, and the \aggatt, which allows it to attend to different informative object keypoints, and predict object locations from coarse to fine. The \coram~is used only during training to extract more informative corner features which brings free-lunch performance boost. On the MS COCO dataset, we achieve the performance of 40.4\% mAP at 28 FPS and 30.5\% mAP at 118 FPS. Among all the real-time object detectors, %that can run faster than 25 FPS, our SaccadeNet achieves the best detection performance, which demonstrates the effectiveness of the proposed detection mechanism.
摘要:目标检测是实现全面现场了解的重要一步。大多数现有的物体检测算法出席一次特定对象区域,然后预测对象的位置。然而,神经科学家发现,人类不看在固定稳定性现场。相反,人眼左右移动,定位信息部分,了解对象的位置。这种主动感知运动过程被称为\ textit {}扫视。 %在本文中,通过这样的机制的启发,我们提出了被称为快速和准确的对象检测器\ textit {SaccadeNet}。它包含四个主要模块,该\ cenam,该\科勒姆,该\大气压,\ aggatt,这使得它能够参加到不同信息的对象的关键点,和预测对象的位置从粗到细。该\科勒姆〜只在训练中提取更多的信息角特征带来免费午餐的性能提升。在MS COCO数据集,我们实现在28 FPS 40.4 \%MAP,并在118 FPS 30.5 \%地图的性能。在所有的实时对象检测器,%,可以超过25 FPS跑得更快,我们SaccadeNet达到最佳的检测性能,这表明了该检测机制的有效性。
Shiyi Lan, Zhou Ren, Yi Wu, Larry S. Davis, Gang Hua
Abstract: Object detection is an essential step towards holistic scene understanding. Most existing object detection algorithms attend to certain object areas once and then predict the object locations. However, neuroscientists have revealed that humans do not look at the scene in fixed steadiness. Instead, human eyes move around, locating informative parts to understand the object location. This active perceiving movement process is called \textit{saccade}. %In this paper, Inspired by such mechanism, we propose a fast and accurate object detector called \textit{SaccadeNet}. It contains four main modules, the \cenam, the \coram, the \atm, and the \aggatt, which allows it to attend to different informative object keypoints, and predict object locations from coarse to fine. The \coram~is used only during training to extract more informative corner features which brings free-lunch performance boost. On the MS COCO dataset, we achieve the performance of 40.4\% mAP at 28 FPS and 30.5\% mAP at 118 FPS. Among all the real-time object detectors, %that can run faster than 25 FPS, our SaccadeNet achieves the best detection performance, which demonstrates the effectiveness of the proposed detection mechanism.
摘要:目标检测是实现全面现场了解的重要一步。大多数现有的物体检测算法出席一次特定对象区域,然后预测对象的位置。然而,神经科学家发现,人类不看在固定稳定性现场。相反,人眼左右移动,定位信息部分,了解对象的位置。这种主动感知运动过程被称为\ textit {}扫视。 %在本文中,通过这样的机制的启发,我们提出了被称为快速和准确的对象检测器\ textit {SaccadeNet}。它包含四个主要模块,该\ cenam,该\科勒姆,该\大气压,\ aggatt,这使得它能够参加到不同信息的对象的关键点,和预测对象的位置从粗到细。该\科勒姆〜只在训练中提取更多的信息角特征带来免费午餐的性能提升。在MS COCO数据集,我们实现在28 FPS 40.4 \%MAP,并在118 FPS 30.5 \%地图的性能。在所有的实时对象检测器,%,可以超过25 FPS跑得更快,我们SaccadeNet达到最佳的检测性能,这表明了该检测机制的有效性。
32. Real-time information retrieval from Identity cards [PDF] 返回目录
Niloofar Tavakolian, Azadeh Nazemi, Donal Fitzpatrick
Abstract: Information is frequently retrieved from valid personal ID cards by the authorised organisation to address different purposes. The successful information retrieval (IR) depends on the accuracy and timing process. A process which necessitates a long time to respond is frustrating for both sides in the exchange of data. This paper aims to propose a series of state-of-the-art methods for the journey of an Identification card (ID) from the scanning or capture phase to the point before Optical character recognition (OCR). The key factors for this proposal are the accuracy and speed of the process during the journey. The experimental results of this research prove that utilising the methods based on deep learning, such as Efficient and Accurate Scene Text (EAST) detector and Deep Neural Network (DNN) for face detection, instead of traditional methods increase the efficiency considerably.
摘要:信息频繁地从个人有效身份证被授权的组织以满足不同的用途检索。成功的信息检索(IR)取决于精度和定时过程。这需要很长的时间来回应法用于数据交换的双方沮丧。本文旨在提出一系列状态的最先进的方法,用于从扫描或捕获阶段到光学字符识别(OCR)之前的点的识别卡(ID)的旅程。提出这项建议的关键因素是在旅途中的准确性和过程的速度。这项研究的实验结果表明,利用基于深度学习,比如高效,准确的场景文本(EAST)探测器和深层神经网络(DNN)的人脸检测方法,而不是传统的方法,大大提高了效率。
Niloofar Tavakolian, Azadeh Nazemi, Donal Fitzpatrick
Abstract: Information is frequently retrieved from valid personal ID cards by the authorised organisation to address different purposes. The successful information retrieval (IR) depends on the accuracy and timing process. A process which necessitates a long time to respond is frustrating for both sides in the exchange of data. This paper aims to propose a series of state-of-the-art methods for the journey of an Identification card (ID) from the scanning or capture phase to the point before Optical character recognition (OCR). The key factors for this proposal are the accuracy and speed of the process during the journey. The experimental results of this research prove that utilising the methods based on deep learning, such as Efficient and Accurate Scene Text (EAST) detector and Deep Neural Network (DNN) for face detection, instead of traditional methods increase the efficiency considerably.
摘要:信息频繁地从个人有效身份证被授权的组织以满足不同的用途检索。成功的信息检索(IR)取决于精度和定时过程。这需要很长的时间来回应法用于数据交换的双方沮丧。本文旨在提出一系列状态的最先进的方法,用于从扫描或捕获阶段到光学字符识别(OCR)之前的点的识别卡(ID)的旅程。提出这项建议的关键因素是在旅途中的准确性和过程的速度。这项研究的实验结果表明,利用基于深度学习,比如高效,准确的场景文本(EAST)探测器和深层神经网络(DNN)的人脸检测方法,而不是传统的方法,大大提高了效率。
33. Local Facial Makeup Transfer via Disentangled Representation [PDF] 返回目录
Zhaoyang Sun, Wenxuan Liu, Feng Liu, Ryan Wen Liu, Shengwu Xiong
Abstract: Facial makeup transfer aims to render a non-makeup face image in an arbitrary given makeup one while preserving face identity. The most advanced method separates makeup style information from face images to realize makeup transfer. However, makeup style includes several semantic clear local styles which are still entangled together. In this paper, we propose a novel unified adversarial disentangling network to further decompose face images into four independent components, i.e., personal identity, lips makeup style, eyes makeup style and face makeup style. Owing to the further disentangling of makeup style, our method can not only control the degree of global makeup style, but also flexibly regulate the degree of local makeup styles which any other approaches can't do. For makeup removal, different from other methods which regard makeup removal as the reverse process of makeup, we integrate the makeup transfer with the makeup removal into one uniform framework and obtain multiple makeup removal results. Extensive experiments have demonstrated that our approach can produce more realistic and accurate makeup transfer results compared to the state-of-the-art methods.
摘要:面部化妆转移目标在任意给定的妆容呈现非妆面图像,同时保持脸的身份。从人脸图像的最先进的方法中隔离化妆造型信息,以实现化妆转移。然而,化妆造型包括几个语义明确的地方风格,目前仍在纠缠在了一起。在本文中,我们提出了一个新的统一的对抗解开网络进一步分解人脸图像分成四个独立的组件,即个人身份,嘴唇的化妆风格,眼妆风格和妆面风格。由于化妆风格进一步解开,我们的方法不仅可以控制的全球彩妆风格的程度,而且还灵活地调节局部化妆风格,任何其他方法无法做到的程度。用于卸妆,从其它方法方面卸妆作为化妆的逆过程不同,我们整合与卸妆上妆转移到一个统一的框架,将获得的多个卸妆效果。大量的实验已经证明,我们的方法可以产生更加逼真和精确的化妆转移的结果相比,国家的最先进的方法。
Zhaoyang Sun, Wenxuan Liu, Feng Liu, Ryan Wen Liu, Shengwu Xiong
Abstract: Facial makeup transfer aims to render a non-makeup face image in an arbitrary given makeup one while preserving face identity. The most advanced method separates makeup style information from face images to realize makeup transfer. However, makeup style includes several semantic clear local styles which are still entangled together. In this paper, we propose a novel unified adversarial disentangling network to further decompose face images into four independent components, i.e., personal identity, lips makeup style, eyes makeup style and face makeup style. Owing to the further disentangling of makeup style, our method can not only control the degree of global makeup style, but also flexibly regulate the degree of local makeup styles which any other approaches can't do. For makeup removal, different from other methods which regard makeup removal as the reverse process of makeup, we integrate the makeup transfer with the makeup removal into one uniform framework and obtain multiple makeup removal results. Extensive experiments have demonstrated that our approach can produce more realistic and accurate makeup transfer results compared to the state-of-the-art methods.
摘要:面部化妆转移目标在任意给定的妆容呈现非妆面图像,同时保持脸的身份。从人脸图像的最先进的方法中隔离化妆造型信息,以实现化妆转移。然而,化妆造型包括几个语义明确的地方风格,目前仍在纠缠在了一起。在本文中,我们提出了一个新的统一的对抗解开网络进一步分解人脸图像分成四个独立的组件,即个人身份,嘴唇的化妆风格,眼妆风格和妆面风格。由于化妆风格进一步解开,我们的方法不仅可以控制的全球彩妆风格的程度,而且还灵活地调节局部化妆风格,任何其他方法无法做到的程度。用于卸妆,从其它方法方面卸妆作为化妆的逆过程不同,我们整合与卸妆上妆转移到一个统一的框架,将获得的多个卸妆效果。大量的实验已经证明,我们的方法可以产生更加逼真和精确的化妆转移的结果相比,国家的最先进的方法。
34. Augmenting Colonoscopy using Extended and Directional CycleGAN for Lossy Image Translation [PDF] 返回目录
Shawn Mathew, Saad Nadeem, Sruti Kumari, Arie Kaufman
Abstract: Colorectal cancer screening modalities, such as optical colonoscopy (OC) and virtual colonoscopy (VC), are critical for diagnosing and ultimately removing polyps (precursors of colon cancer). The non-invasive VC is normally used to inspect a 3D reconstructed colon (from CT scans) for polyps and if found, the OC procedure is performed to physically traverse the colon via endoscope and remove these polyps. In this paper, we present a deep learning framework, Extended and Directional CycleGAN, for lossy unpaired image-to-image translation between OC and VC to augment OC video sequences with scale-consistent depth information from VC, and augment VC with patient-specific textures, color and specular highlights from OC (e.g, for realistic polyp synthesis). Both OC and VC contain structural information, but it is obscured in OC by additional patient-specific texture and specular highlights, hence making the translation from OC to VC lossy. The existing CycleGAN approaches do not handle lossy transformations. To address this shortcoming, we introduce an extended cycle consistency loss, which compares the geometric structures from OC in the VC domain. This loss removes the need for the CycleGAN to embed OC information in the VC domain. To handle a stronger removal of the textures and lighting, a Directional Discriminator is introduced to differentiate the direction of translation (by creating paired information for the discriminator), as opposed to the standard CycleGAN which is direction-agnostic. Combining the extended cycle consistency loss and the Directional Discriminator, we show state-of-the-art results on scale-consistent depth inference for phantom, textured VC and for real polyp and normal colon video sequences. We also present results for realistic pendunculated and flat polyp synthesis from bumps introduced in 3D VC models.
摘要:大肠癌筛选方式,诸如光学结肠镜(OC)和虚拟结肠镜检查(VC),可用于诊断和最终去除息肉(结肠癌的前体)是至关重要的。非侵入式VC通常用于检查三维重建结肠(从CT扫描),用于息肉,如果找到,在物理上遍历经由内窥镜结肠和除去这些息肉进行OC过程。在本文中,我们提出了一个深刻的学习框架,扩展和定向CycleGAN,有损不成对图像 - 图像OC和VC之间的转换,以增加OC视频序列,从VC的规模一致的深度信息,并增强VC与患者的特异纹理,颜色和从OC镜面高光(例如,用于逼真息肉合成)。既OC和VC包含结构信息,但它是在由OC附加患者特异性的质地和镜面高光遮挡,因此使得从OC到VC有损翻译。现有CycleGAN方法不处理有损转换。为了解决这个缺点,我们引入一个扩展周期一致性的损失,其几何结构从OC在VC域进行比较。这种损耗消除了对CycleGAN在VC域嵌入OC信息的需要。为了处理更强的去除纹理和照明的,一个方向鉴别器被引入(通过创建鉴别器配对信息)来区分平移的方向,相对于标准CycleGAN其方向无关。组合所述延长周期一致性的损失和定向鉴别,我们显示在刻度一致的深度推断为幻象,网纹VC和真正的息肉和正常结肠视频序列状态的最先进的结果。我们还提出了从3D VC机型介绍颠簸现实pendunculated和扁平息肉综合结果。
Shawn Mathew, Saad Nadeem, Sruti Kumari, Arie Kaufman
Abstract: Colorectal cancer screening modalities, such as optical colonoscopy (OC) and virtual colonoscopy (VC), are critical for diagnosing and ultimately removing polyps (precursors of colon cancer). The non-invasive VC is normally used to inspect a 3D reconstructed colon (from CT scans) for polyps and if found, the OC procedure is performed to physically traverse the colon via endoscope and remove these polyps. In this paper, we present a deep learning framework, Extended and Directional CycleGAN, for lossy unpaired image-to-image translation between OC and VC to augment OC video sequences with scale-consistent depth information from VC, and augment VC with patient-specific textures, color and specular highlights from OC (e.g, for realistic polyp synthesis). Both OC and VC contain structural information, but it is obscured in OC by additional patient-specific texture and specular highlights, hence making the translation from OC to VC lossy. The existing CycleGAN approaches do not handle lossy transformations. To address this shortcoming, we introduce an extended cycle consistency loss, which compares the geometric structures from OC in the VC domain. This loss removes the need for the CycleGAN to embed OC information in the VC domain. To handle a stronger removal of the textures and lighting, a Directional Discriminator is introduced to differentiate the direction of translation (by creating paired information for the discriminator), as opposed to the standard CycleGAN which is direction-agnostic. Combining the extended cycle consistency loss and the Directional Discriminator, we show state-of-the-art results on scale-consistent depth inference for phantom, textured VC and for real polyp and normal colon video sequences. We also present results for realistic pendunculated and flat polyp synthesis from bumps introduced in 3D VC models.
摘要:大肠癌筛选方式,诸如光学结肠镜(OC)和虚拟结肠镜检查(VC),可用于诊断和最终去除息肉(结肠癌的前体)是至关重要的。非侵入式VC通常用于检查三维重建结肠(从CT扫描),用于息肉,如果找到,在物理上遍历经由内窥镜结肠和除去这些息肉进行OC过程。在本文中,我们提出了一个深刻的学习框架,扩展和定向CycleGAN,有损不成对图像 - 图像OC和VC之间的转换,以增加OC视频序列,从VC的规模一致的深度信息,并增强VC与患者的特异纹理,颜色和从OC镜面高光(例如,用于逼真息肉合成)。既OC和VC包含结构信息,但它是在由OC附加患者特异性的质地和镜面高光遮挡,因此使得从OC到VC有损翻译。现有CycleGAN方法不处理有损转换。为了解决这个缺点,我们引入一个扩展周期一致性的损失,其几何结构从OC在VC域进行比较。这种损耗消除了对CycleGAN在VC域嵌入OC信息的需要。为了处理更强的去除纹理和照明的,一个方向鉴别器被引入(通过创建鉴别器配对信息)来区分平移的方向,相对于标准CycleGAN其方向无关。组合所述延长周期一致性的损失和定向鉴别,我们显示在刻度一致的深度推断为幻象,网纹VC和真正的息肉和正常结肠视频序列状态的最先进的结果。我们还提出了从3D VC机型介绍颠簸现实pendunculated和扁平息肉综合结果。
35. A Computer-Aided Diagnosis System Using Artificial Intelligence for Proximal Femoral Fractures Enables Residents to Achieve a Diagnostic Rate Equivalent to Orthopedic Surgeons -- multi-institutional joint development research [PDF] 返回目录
Yoichi Sato, Takamune Asamoto, Yutaro Ono, Ryosuke Goto, Asahi Kitamura, Seiwa Honda
Abstract: [Objective] To develop a CAD system for proximal femoral fracture for plain frontal hip radiographs by CNN trained on a large dataset collected at multiple institutions. And, the possibility of the diagnosis rate improvement of the proximal femoral fracture by the resident using this CAD system as an aid of the diagnosis. [Materials and methods] In total, 4851 cases of proximal femoral fracture patients who visited each institution between 2009 and 2019 were included. 5242 plain pelvic radiographs were extracted from a DICOM server, and a total of 10484 images(5242 with fracture and 5242 without fracture) were used for machine learning. A CNN approach was used. We used the EffectiventNet-B4 framework with Pytorch 1.3 and this http URL 1.0. In the final evaluation, accuracy, sensitivity, specificity, F-value, and AUC were evaluated. Grad-CAM was used to conceptualize the basis of the diagnosis by the CAD system. For 31 residents and 4 orthopedic surgeons, the image diagnosis test was carried out for 600 photographs of proximal femoral fracture randomly extracted from test image data set. And, diagnosis rate in the situation with/without the diagnosis support by the CAD system were evaluated respectively. [Results] The diagnostic accuracy of the learning model was 96.1%, sensitivity 95.2%, specificity 96.9%, F value 0.961, and AUC 0.99. Grad-CAM was used to show the most accurate diagnosis. In the image diagnosis test, the resident acquired the diagnostic ability equivalent to that of the orthopedic surgeon by using the diagnostic aid of the CAD system. [Conclusions] The CAD system using AI for the proximal femoral fracture which we developed could offer the diagnosis reason, and it became an image diagnosis tool with the high diagnosis accuracy. And, the possibility of contributing to the diagnosis rate improvement was considered in the field of actual clinical environment such as emergency room.
摘要:[目的]为了开发近端股骨骨折由CNN平原正面臀部射线照片CAD系统上训练在多个机构收集了大量的数据集。而且,通过使用该CAD系统作为诊断的辅助驻地诊断率改善近端股骨骨折的可能性。总的来说,4851案件谁2009和2019之间的访问的每个机构股骨近端骨折患者被纳入[材料和方法]。 5242张平原骨盆X光片从DICOM服务器萃取,共10484幅图像(5242裂缝和5242而不断裂)的被用于机器学习。被使用的CNN的做法。我们使用Pytorch 1.3 EffectiventNet-B4框架,这个HTTP URL 1.0。在最后的评价中,精确度,灵敏度,特异性,F值和AUC进行了评价。梯度-CAM来概念化诊断由CAD系统的基础。 31名居民和4名骨科医生,图像诊断试验进行的近端股骨骨折随机地从测试图像数据集600张中提取的照片来进行。并且,在有/无由CAD系统中的诊断支持的情况的诊断率分别评价。 [结果]学习模型的诊断准确率为96.1%,灵敏度为95.2%,特异性96.9%,F值0.961,和AUC 0.99。毕业生-CAM来展现最准确的诊断。在图象诊断测试,驻地通过使用CAD系统的诊断支援获取的能力等效诊断于整形外科医生的。结论使用AI用于CAD系统的近端股骨骨折我们开发了可以提供诊断原因,并成为与高准确度诊断的图像诊断工具。而且,有助于诊断率提高的可能性,在实际临床环境领域被认为是如急诊室。
Yoichi Sato, Takamune Asamoto, Yutaro Ono, Ryosuke Goto, Asahi Kitamura, Seiwa Honda
Abstract: [Objective] To develop a CAD system for proximal femoral fracture for plain frontal hip radiographs by CNN trained on a large dataset collected at multiple institutions. And, the possibility of the diagnosis rate improvement of the proximal femoral fracture by the resident using this CAD system as an aid of the diagnosis. [Materials and methods] In total, 4851 cases of proximal femoral fracture patients who visited each institution between 2009 and 2019 were included. 5242 plain pelvic radiographs were extracted from a DICOM server, and a total of 10484 images(5242 with fracture and 5242 without fracture) were used for machine learning. A CNN approach was used. We used the EffectiventNet-B4 framework with Pytorch 1.3 and this http URL 1.0. In the final evaluation, accuracy, sensitivity, specificity, F-value, and AUC were evaluated. Grad-CAM was used to conceptualize the basis of the diagnosis by the CAD system. For 31 residents and 4 orthopedic surgeons, the image diagnosis test was carried out for 600 photographs of proximal femoral fracture randomly extracted from test image data set. And, diagnosis rate in the situation with/without the diagnosis support by the CAD system were evaluated respectively. [Results] The diagnostic accuracy of the learning model was 96.1%, sensitivity 95.2%, specificity 96.9%, F value 0.961, and AUC 0.99. Grad-CAM was used to show the most accurate diagnosis. In the image diagnosis test, the resident acquired the diagnostic ability equivalent to that of the orthopedic surgeon by using the diagnostic aid of the CAD system. [Conclusions] The CAD system using AI for the proximal femoral fracture which we developed could offer the diagnosis reason, and it became an image diagnosis tool with the high diagnosis accuracy. And, the possibility of contributing to the diagnosis rate improvement was considered in the field of actual clinical environment such as emergency room.
摘要:[目的]为了开发近端股骨骨折由CNN平原正面臀部射线照片CAD系统上训练在多个机构收集了大量的数据集。而且,通过使用该CAD系统作为诊断的辅助驻地诊断率改善近端股骨骨折的可能性。总的来说,4851案件谁2009和2019之间的访问的每个机构股骨近端骨折患者被纳入[材料和方法]。 5242张平原骨盆X光片从DICOM服务器萃取,共10484幅图像(5242裂缝和5242而不断裂)的被用于机器学习。被使用的CNN的做法。我们使用Pytorch 1.3 EffectiventNet-B4框架,这个HTTP URL 1.0。在最后的评价中,精确度,灵敏度,特异性,F值和AUC进行了评价。梯度-CAM来概念化诊断由CAD系统的基础。 31名居民和4名骨科医生,图像诊断试验进行的近端股骨骨折随机地从测试图像数据集600张中提取的照片来进行。并且,在有/无由CAD系统中的诊断支持的情况的诊断率分别评价。 [结果]学习模型的诊断准确率为96.1%,灵敏度为95.2%,特异性96.9%,F值0.961,和AUC 0.99。毕业生-CAM来展现最准确的诊断。在图象诊断测试,驻地通过使用CAD系统的诊断支援获取的能力等效诊断于整形外科医生的。结论使用AI用于CAD系统的近端股骨骨折我们开发了可以提供诊断原因,并成为与高准确度诊断的图像诊断工具。而且,有助于诊断率提高的可能性,在实际临床环境领域被认为是如急诊室。
36. End-to-End Entity Classification on Multimodal Knowledge Graphs [PDF] 返回目录
W.X. Wilcke, P. Bloem, V. de Boer, R.H. van t Veer, F.A.H. van Harmelen
Abstract: End-to-end multimodal learning on knowledge graphs has been left largely unaddressed. Instead, most end-to-end models such as message passing networks learn solely from the relational information encoded in graphs' structure: raw values, or literals, are either omitted completely or are stripped from their values and treated as regular nodes. In either case we lose potentially relevant information which could have otherwise been exploited by our learning methods. To avoid this, we must treat literals and non-literals as separate cases. We must also address each modality separately and accordingly: numbers, texts, images, geometries, et cetera. We propose a multimodal message passing network which not only learns end-to-end from the structure of graphs, but also from their possibly divers set of multimodal node features. Our model uses dedicated (neural) encoders to naturally learn embeddings for node features belonging to five different types of modalities, including images and geometries, which are projected into a joint representation space together with their relational information. We demonstrate our model on a node classification task, and evaluate the effect that each modality has on the overall performance. Our result supports our hypothesis that including information from multiple modalities can help our models obtain a better overall performance.
摘要:结束到终端的知识图多学习一直留在很大程度上得到解决。相反,大多数端至端型号如消息传递网络从在图表结构编码的关系信息仅在学习:原始值,或文本,要么完全省去或从它们的值被剥离并作为常规节点处理。在这两种情况下,我们失去了它本来否则被我们的学习方法,开发潜在的相关信息。为了避免这种情况,我们必须把文字和非文字作为单独的个案。我们还必须分开,并相应地解决每一个模式:数字,文字,图像,几何形状,等等。我们提出了一个多模式的消息传递网络,该网络不仅学习终端到终端,从图的结构,但也从设置的多节点功能的可能潜水员。我们致力于(神经)编码器模型采用自然学习的嵌入节点拥有属于五种不同类型的模式,包括图像和几何形状,这与他们的关系信息投射到一个关节表示空间在一起。我们证明了我们的一个结点的分类任务模型,并评估每个模式对整体性能的影响。我们的结果支持了我们的假设,包括多模态信息可以帮助我们的模型获得更好的整体性能。
W.X. Wilcke, P. Bloem, V. de Boer, R.H. van t Veer, F.A.H. van Harmelen
Abstract: End-to-end multimodal learning on knowledge graphs has been left largely unaddressed. Instead, most end-to-end models such as message passing networks learn solely from the relational information encoded in graphs' structure: raw values, or literals, are either omitted completely or are stripped from their values and treated as regular nodes. In either case we lose potentially relevant information which could have otherwise been exploited by our learning methods. To avoid this, we must treat literals and non-literals as separate cases. We must also address each modality separately and accordingly: numbers, texts, images, geometries, et cetera. We propose a multimodal message passing network which not only learns end-to-end from the structure of graphs, but also from their possibly divers set of multimodal node features. Our model uses dedicated (neural) encoders to naturally learn embeddings for node features belonging to five different types of modalities, including images and geometries, which are projected into a joint representation space together with their relational information. We demonstrate our model on a node classification task, and evaluate the effect that each modality has on the overall performance. Our result supports our hypothesis that including information from multiple modalities can help our models obtain a better overall performance.
摘要:结束到终端的知识图多学习一直留在很大程度上得到解决。相反,大多数端至端型号如消息传递网络从在图表结构编码的关系信息仅在学习:原始值,或文本,要么完全省去或从它们的值被剥离并作为常规节点处理。在这两种情况下,我们失去了它本来否则被我们的学习方法,开发潜在的相关信息。为了避免这种情况,我们必须把文字和非文字作为单独的个案。我们还必须分开,并相应地解决每一个模式:数字,文字,图像,几何形状,等等。我们提出了一个多模式的消息传递网络,该网络不仅学习终端到终端,从图的结构,但也从设置的多节点功能的可能潜水员。我们致力于(神经)编码器模型采用自然学习的嵌入节点拥有属于五种不同类型的模式,包括图像和几何形状,这与他们的关系信息投射到一个关节表示空间在一起。我们证明了我们的一个结点的分类任务模型,并评估每个模式对整体性能的影响。我们的结果支持了我们的假设,包括多模态信息可以帮助我们的模型获得更好的整体性能。
37. COVID-19 Screening on Chest X-ray Images Using Deep Learning based Anomaly Detection [PDF] 返回目录
Jianpeng Zhang, Yutong Xie, Yi Li, Chunhua Shen, Yong Xia
Abstract: Coronaviruses are important human and animal pathogens. To date the novel COVID-19 coronavirus is rapidly spreading worldwide and subsequently threatening health of billions of humans. Clinical studies have shown that most COVID-19 patients suffer from the lung infection. Although chest CT has been shown to be an effective imaging technique for lung-related disease diagnosis, chest Xray is more widely available due to its faster imaging time and considerably lower cost than CT. Deep learning, one of the most successful AI techniques, is an effective means to assist radiologists to analyze the vast amount of chest X-ray images, which can be critical for efficient and reliable COVID-19 screening. In this work, we aim to develop a new deep anomaly detection model for fast, reliable screening. To evaluate the model performance, we have collected 100 chest X-ray images of 70 patients confirmed with COVID-19 from the Github repository. To facilitate deep learning, more data are needed. Thus, we have also collected 1431 additional chest X-ray images confirmed as other pneumonia of 1008 patients from the public ChestX-ray14 dataset. Our initial experimental results show that the model developed here can reliably detect 96.00% COVID-19 cases (sensitivity being 96.00%) and 70.65% non-COVID-19 cases (specificity being 70.65%) when evaluated on 1531 Xray images with two splits of the dataset.
摘要:冠状病毒是重要的人类和动物病原体。迄今为止小说COVID-19冠状病毒正在迅速蔓延全球,随后威胁的数十亿人的健康。临床研究表明,大多数COVID-19从患者的肺部感染受苦。虽然胸部CT已被证明是用于肺相关疾病诊断的有效成像技术,胸片是更广泛地提供,由于其较快的成像时间和比CT成本显着降低。深度学习,最成功的AI技术之一,是帮助放射科医生分析大量胸部X射线图像,这对于高效,可靠COVID-19的筛选关键的有效手段。在这项工作中,我们的目标是开发一个新的深异常检测模型进行快速,可靠的检查。为了评估模型的性能,我们已经收集了70例患者从GitHub的仓库与COVID-19证实100胸部X射线图像。为了促进深度学习,需要更多的数据。因此,我们还收集了确认的公众ChestX-ray14数据集1008名其他肺炎1431额外的胸部X射线图像。我们最初的实验结果表明,这里建立的模型能够可靠地检测96.00%COVID-19例(灵敏度为96.00%)和70.65%的非COVID-19例(特异性为70.65%),用两个裂口上1531对的X射线图像进行评价时该数据集。
Jianpeng Zhang, Yutong Xie, Yi Li, Chunhua Shen, Yong Xia
Abstract: Coronaviruses are important human and animal pathogens. To date the novel COVID-19 coronavirus is rapidly spreading worldwide and subsequently threatening health of billions of humans. Clinical studies have shown that most COVID-19 patients suffer from the lung infection. Although chest CT has been shown to be an effective imaging technique for lung-related disease diagnosis, chest Xray is more widely available due to its faster imaging time and considerably lower cost than CT. Deep learning, one of the most successful AI techniques, is an effective means to assist radiologists to analyze the vast amount of chest X-ray images, which can be critical for efficient and reliable COVID-19 screening. In this work, we aim to develop a new deep anomaly detection model for fast, reliable screening. To evaluate the model performance, we have collected 100 chest X-ray images of 70 patients confirmed with COVID-19 from the Github repository. To facilitate deep learning, more data are needed. Thus, we have also collected 1431 additional chest X-ray images confirmed as other pneumonia of 1008 patients from the public ChestX-ray14 dataset. Our initial experimental results show that the model developed here can reliably detect 96.00% COVID-19 cases (sensitivity being 96.00%) and 70.65% non-COVID-19 cases (specificity being 70.65%) when evaluated on 1531 Xray images with two splits of the dataset.
摘要:冠状病毒是重要的人类和动物病原体。迄今为止小说COVID-19冠状病毒正在迅速蔓延全球,随后威胁的数十亿人的健康。临床研究表明,大多数COVID-19从患者的肺部感染受苦。虽然胸部CT已被证明是用于肺相关疾病诊断的有效成像技术,胸片是更广泛地提供,由于其较快的成像时间和比CT成本显着降低。深度学习,最成功的AI技术之一,是帮助放射科医生分析大量胸部X射线图像,这对于高效,可靠COVID-19的筛选关键的有效手段。在这项工作中,我们的目标是开发一个新的深异常检测模型进行快速,可靠的检查。为了评估模型的性能,我们已经收集了70例患者从GitHub的仓库与COVID-19证实100胸部X射线图像。为了促进深度学习,需要更多的数据。因此,我们还收集了确认的公众ChestX-ray14数据集1008名其他肺炎1431额外的胸部X射线图像。我们最初的实验结果表明,这里建立的模型能够可靠地检测96.00%COVID-19例(灵敏度为96.00%)和70.65%的非COVID-19例(特异性为70.65%),用两个裂口上1531对的X射线图像进行评价时该数据集。
38. A Comprehensive Review for Breast Histopathology Image Analysis Using Classical and Deep Neural Networks [PDF] 返回目录
Xiaomin Zhou, Chen Li, Md Mamunur Rahaman, Yudong Yao, Shiliang Ai, Changhao Sun, Xiaoyan Li, Qian Wang, Tao Jiang
Abstract: Breast cancer is one of the most common and deadliest cancers among women. Since histopathological images contain sufficient phenotypic information, they play an indispensable role in the diagnosis and treatment of breast cancers. To improve the accuracy and objectivity of Breast Histopathological Image Analysis (BHIA), Artificial Neural Network (ANN) approaches are widely used in the segmentation and classification tasks of breast histopathological images. In this review, we present a comprehensive overview of the BHIA techniques based on ANNs. First of all, we categorize the BHIA systems into classical and deep neural networks for in-depth investigation. Then, the relevant studies based on BHIA systems are presented. After that, we analyze the existing models to discover the most suitable algorithms. Finally, publicly accessible datasets, along with their download links, are provided for the convenience of future researchers.
摘要:乳腺癌是女性最常见和最致命的癌症之一。由于病理图像包含足够的表型信息,它们在乳腺癌的诊断和治疗提供了不可或缺的作用。为了提高精度和乳腺组织图象分析(脑心浸液)的客观性,人工神经网络(ANN)方法被广泛应用于乳腺病理图像的分割和分类的任务。在这次审查中,我们提出了基于人工神经网络的脑心浸液技术的全面概述。首先,我们分类的BHIA系统到古典和深层神经网络进行深入调查。然后,根据BHIA系统的相关研究介绍。在那之后,我们分析了现有的模式,找出最合适的算法。最后,可公开访问的数据集,与他们的下载链接一起,为今后的研究提供方便。
Xiaomin Zhou, Chen Li, Md Mamunur Rahaman, Yudong Yao, Shiliang Ai, Changhao Sun, Xiaoyan Li, Qian Wang, Tao Jiang
Abstract: Breast cancer is one of the most common and deadliest cancers among women. Since histopathological images contain sufficient phenotypic information, they play an indispensable role in the diagnosis and treatment of breast cancers. To improve the accuracy and objectivity of Breast Histopathological Image Analysis (BHIA), Artificial Neural Network (ANN) approaches are widely used in the segmentation and classification tasks of breast histopathological images. In this review, we present a comprehensive overview of the BHIA techniques based on ANNs. First of all, we categorize the BHIA systems into classical and deep neural networks for in-depth investigation. Then, the relevant studies based on BHIA systems are presented. After that, we analyze the existing models to discover the most suitable algorithms. Finally, publicly accessible datasets, along with their download links, are provided for the convenience of future researchers.
摘要:乳腺癌是女性最常见和最致命的癌症之一。由于病理图像包含足够的表型信息,它们在乳腺癌的诊断和治疗提供了不可或缺的作用。为了提高精度和乳腺组织图象分析(脑心浸液)的客观性,人工神经网络(ANN)方法被广泛应用于乳腺病理图像的分割和分类的任务。在这次审查中,我们提出了基于人工神经网络的脑心浸液技术的全面概述。首先,我们分类的BHIA系统到古典和深层神经网络进行深入调查。然后,根据BHIA系统的相关研究介绍。在那之后,我们分析了现有的模式,找出最合适的算法。最后,可公开访问的数据集,与他们的下载链接一起,为今后的研究提供方便。
39. MiLeNAS: Efficient Neural Architecture Search via Mixed-Level Reformulation [PDF] 返回目录
Chaoyang He, Haishan Ye, Li Shen, Tong Zhang
Abstract: Many recently proposed methods for Neural Architecture Search (NAS) can be formulated as bilevel optimization. For efficient implementation, its solution requires approximations of second-order methods. In this paper, we demonstrate that gradient errors caused by such approximations lead to suboptimality, in the sense that the optimization procedure fails to converge to a (locally) optimal solution. To remedy this, this paper proposes \mldas, a mixed-level reformulation for NAS that can be optimized efficiently and reliably. It is shown that even when using a simple first-order method on the mixed-level formulation, \mldas\ can achieve a lower validation error for NAS problems. Consequently, architectures obtained by our method achieve consistently higher accuracies than those obtained from bilevel optimization. Moreover, \mldas\ proposes a framework beyond DARTS. It is upgraded via model size-based search and early stopping strategies to complete the search process in around 5 hours. Extensive experiments within the convolutional architecture search space validate the effectiveness of our approach.
摘要:对于许多神经结构搜索(NAS)最近提出的方法可以配制成双层优化。为了有效的实现方式中,它的解决方案要求的二阶方法近似值。在本文中,我们证明了由这种近似斜度误差导致次优,在优化过程无法收敛到(当地)最优解的意义。为了解决这个问题,提出了\ mldas,用于NAS混合级再形成,可以高效且可靠地进行优化。结果表明,使用对混合级制剂简单的一阶方法即使当\ mldas \可以实现为NAS问题较低验证错误。因此,通过我们的方法得到的架构实现了比从双层优化得到的一致精度更高。此外,\ mldas \提出超出飞镖的框架。它通过基于模型的大小,搜索和提前终止的策略来完成搜索过程在5小时左右升级。卷积架构搜索空间内大量的实验验证了我们方法的有效性。
Chaoyang He, Haishan Ye, Li Shen, Tong Zhang
Abstract: Many recently proposed methods for Neural Architecture Search (NAS) can be formulated as bilevel optimization. For efficient implementation, its solution requires approximations of second-order methods. In this paper, we demonstrate that gradient errors caused by such approximations lead to suboptimality, in the sense that the optimization procedure fails to converge to a (locally) optimal solution. To remedy this, this paper proposes \mldas, a mixed-level reformulation for NAS that can be optimized efficiently and reliably. It is shown that even when using a simple first-order method on the mixed-level formulation, \mldas\ can achieve a lower validation error for NAS problems. Consequently, architectures obtained by our method achieve consistently higher accuracies than those obtained from bilevel optimization. Moreover, \mldas\ proposes a framework beyond DARTS. It is upgraded via model size-based search and early stopping strategies to complete the search process in around 5 hours. Extensive experiments within the convolutional architecture search space validate the effectiveness of our approach.
摘要:对于许多神经结构搜索(NAS)最近提出的方法可以配制成双层优化。为了有效的实现方式中,它的解决方案要求的二阶方法近似值。在本文中,我们证明了由这种近似斜度误差导致次优,在优化过程无法收敛到(当地)最优解的意义。为了解决这个问题,提出了\ mldas,用于NAS混合级再形成,可以高效且可靠地进行优化。结果表明,使用对混合级制剂简单的一阶方法即使当\ mldas \可以实现为NAS问题较低验证错误。因此,通过我们的方法得到的架构实现了比从双层优化得到的一致精度更高。此外,\ mldas \提出超出飞镖的框架。它通过基于模型的大小,搜索和提前终止的策略来完成搜索过程在5小时左右升级。卷积架构搜索空间内大量的实验验证了我们方法的有效性。
40. Going in circles is the way forward: the role of recurrence in visual inference [PDF] 返回目录
Ruben S. van Bergen, Nikolaus Kriegeskorte
Abstract: Biological visual systems exhibit abundant recurrent connectivity. State-of-the-art neural network models for visual recognition, by contrast, rely heavily or exclusively on feedforward computation. Any finite-time recurrent neural network (RNN) can be unrolled along time to yield an equivalent feedforward neural network (FNN). This important insight suggests that computational neuroscientists may not need to engage recurrent computation, and that computer-vision engineers may be limiting themselves to a special case of FNN if they build recurrent models. Here we argue, to the contrary, that FNNs are a special case of RNNs and that computational neuroscientists and engineers should engage recurrence to understand how brains and machines can (1) achieve greater and more flexible computational depth, (2) compress complex computations into limited hardware, (3) integrate priors and priorities into visual inference through expectation and attention, (4) exploit sequential dependencies in their data for better inference and prediction, and (5) leverage the power of iterative computation.
摘要:生物视觉系统表现出丰富的反复连接。国家的最先进的神经网络模型的视觉识别,相比之下,很大程度上或完全依赖于前馈计算。任何有限时间回归神经网络(RNN)可以沿着时间被展开以产生等效的前馈神经网络(FNN)。这个重要的启示表明,计算神经科学家可能不需要搞反复计算,如果他们建立经常性的模型,计算机视觉工程师可能会限制自己FNN的特殊情况。在这里,我们认为,与此相反,即FNNS是RNNs的特殊情况和计算神经科学家和工程师应该参与复发了解大脑和机器可以(1)实现更大,更灵活的计算深度,(2)压缩复杂的计算成有限的硬件,(3)通过的期望和注意力整合先验和优先成视觉推理,(4)利用更好的推理和预测,以及(5)利用迭代计算的功率在他们的数据顺序依赖性。
Ruben S. van Bergen, Nikolaus Kriegeskorte
Abstract: Biological visual systems exhibit abundant recurrent connectivity. State-of-the-art neural network models for visual recognition, by contrast, rely heavily or exclusively on feedforward computation. Any finite-time recurrent neural network (RNN) can be unrolled along time to yield an equivalent feedforward neural network (FNN). This important insight suggests that computational neuroscientists may not need to engage recurrent computation, and that computer-vision engineers may be limiting themselves to a special case of FNN if they build recurrent models. Here we argue, to the contrary, that FNNs are a special case of RNNs and that computational neuroscientists and engineers should engage recurrence to understand how brains and machines can (1) achieve greater and more flexible computational depth, (2) compress complex computations into limited hardware, (3) integrate priors and priorities into visual inference through expectation and attention, (4) exploit sequential dependencies in their data for better inference and prediction, and (5) leverage the power of iterative computation.
摘要:生物视觉系统表现出丰富的反复连接。国家的最先进的神经网络模型的视觉识别,相比之下,很大程度上或完全依赖于前馈计算。任何有限时间回归神经网络(RNN)可以沿着时间被展开以产生等效的前馈神经网络(FNN)。这个重要的启示表明,计算神经科学家可能不需要搞反复计算,如果他们建立经常性的模型,计算机视觉工程师可能会限制自己FNN的特殊情况。在这里,我们认为,与此相反,即FNNS是RNNs的特殊情况和计算神经科学家和工程师应该参与复发了解大脑和机器可以(1)实现更大,更灵活的计算深度,(2)压缩复杂的计算成有限的硬件,(3)通过的期望和注意力整合先验和优先成视觉推理,(4)利用更好的推理和预测,以及(5)利用迭代计算的功率在他们的数据顺序依赖性。
41. Pedestrian Detection with Wearable Cameras for the Blind: A Two-way Perspective [PDF] 返回目录
Kyungjun Lee, Daisuke Sato, Saki Asakawa, Hernisa Kacorri, Chieko Asakawa
Abstract: Blind people have limited access to information about their surroundings, which is important for ensuring one's safety, managing social interactions, and identifying approaching pedestrians. With advances in computer vision, wearable cameras can provide equitable access to such information. However, the always-on nature of these assistive technologies poses privacy concerns for parties that may get recorded. We explore this tension from both perspectives, those of sighted passersby and blind users, taking into account camera visibility, in-person versus remote experience, and extracted visual information. We conduct two studies: an online survey with MTurkers (N=206) and an in-person experience study between pairs of blind (N=10) and sighted (N=40) participants, where blind participants wear a working prototype for pedestrian detection and pass by sighted participants. Our results suggest that both of the perspectives of users and bystanders and the several factors mentioned above need to be carefully considered to mitigate potential social tensions.
摘要:盲人不得不对他们的环境,这有利于保证自己的安全,管理社会的相互作用,并确定的行人重要的信息有限。随着计算机视觉的进步,可穿戴式摄像机可以提供此类信息的平等机会。然而,始终对这些辅助技术的本质造成的,可能得到记录当事人的隐私问题。我们探索从两个角度来看,那些有远见的路人和盲人用户的这种紧张关系,同时考虑到相机的知名度,在人与远程的经验,并提取视觉信息。我们进行了两项研究:与MTurkers(N = 206)的在线调查和对盲人之间(N = 10)的亲身经验,研究和远见(N = 40)的参与者,其中盲人学员佩戴的工作原型行人检测和有远见的参与者通过。我们的研究结果表明,这两个用户和旁观者及以上需要提到的几个因素的观点要慎重考虑,以减轻潜在的社会矛盾。
Kyungjun Lee, Daisuke Sato, Saki Asakawa, Hernisa Kacorri, Chieko Asakawa
Abstract: Blind people have limited access to information about their surroundings, which is important for ensuring one's safety, managing social interactions, and identifying approaching pedestrians. With advances in computer vision, wearable cameras can provide equitable access to such information. However, the always-on nature of these assistive technologies poses privacy concerns for parties that may get recorded. We explore this tension from both perspectives, those of sighted passersby and blind users, taking into account camera visibility, in-person versus remote experience, and extracted visual information. We conduct two studies: an online survey with MTurkers (N=206) and an in-person experience study between pairs of blind (N=10) and sighted (N=40) participants, where blind participants wear a working prototype for pedestrian detection and pass by sighted participants. Our results suggest that both of the perspectives of users and bystanders and the several factors mentioned above need to be carefully considered to mitigate potential social tensions.
摘要:盲人不得不对他们的环境,这有利于保证自己的安全,管理社会的相互作用,并确定的行人重要的信息有限。随着计算机视觉的进步,可穿戴式摄像机可以提供此类信息的平等机会。然而,始终对这些辅助技术的本质造成的,可能得到记录当事人的隐私问题。我们探索从两个角度来看,那些有远见的路人和盲人用户的这种紧张关系,同时考虑到相机的知名度,在人与远程的经验,并提取视觉信息。我们进行了两项研究:与MTurkers(N = 206)的在线调查和对盲人之间(N = 10)的亲身经验,研究和远见(N = 40)的参与者,其中盲人学员佩戴的工作原型行人检测和有远见的参与者通过。我们的研究结果表明,这两个用户和旁观者及以上需要提到的几个因素的观点要慎重考虑,以减轻潜在的社会矛盾。
注:中文为机器翻译结果!