目录
2. MRZ code extraction from visa and passport documents using convolutional neural networks [PDF] 摘要
5. Semi-Supervised Active Learning for COVID-19 Lung Ultrasound Multi-symptom Classification [PDF] 摘要
10. Critical analysis on the reproducibility of visual quality assessment using deep features [PDF] 摘要
15. Enabling Image Recognition on Constrained Devices Using Neural Network Pruning and a CycleGAN [PDF] 摘要
16. Unsupervised Partial Point Set Registration via Joint Shape Completion and Registration [PDF] 摘要
18. Fairness Matters -- A Data-Driven Framework Towards Fair and High Performing Facial Recognition Systems [PDF] 摘要
21. PiaNet: A pyramid input augmented convolutional neural network for GGO detection in 3D lung CT scans [PDF] 摘要
22. Novel and Effective CNN-Based Binarization for Historically Degraded As-built Drawing Maps [PDF] 摘要
26. An unsupervised deep learning framework via integrated optimization of representation learning and GMM-based modeling [PDF] 摘要
30. Variance Loss: A Confidence-Based Reweighting Strategy for Coarse Semantic Segmentation [PDF] 摘要
35. SWP-Leaf NET: a novel multistage approach for plant leaf identification based on deep learning [PDF] 摘要
42. Data-Level Recombination and Lightweight Fusion Scheme for RGB-D Salient Object Detection [PDF] 摘要
48. COVIDNet-CT: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases from Chest CT Images [PDF] 摘要
49. Disentangling Neural Architectures and Weights: A Case Study in Supervised Classification [PDF] 摘要
53. COVID CT-Net: Predicting Covid-19 From Chest CT Images Using Attentional Convolutional Network [PDF] 摘要
摘要
1. TP-LSD: Tri-Points Based Line Segment Detector [PDF] 返回目录
Siyu Huang, Fangbo Qin, Pengfei Xiong, Ning Ding, Yijia He, Xiao Liu
Abstract: This paper proposes a novel deep convolutional model, Tri-Points Based Line Segment Detector (TP-LSD), to detect line segments in an image at real-time speed. The previous related methods typically use the two-step strategy, relying on either heuristic post-process or extra classifier. To realize one-step detection with a faster and more compact model, we introduce the tri-points representation, converting the line segment detection to the end-to-end prediction of a root-point and two endpoints for each line segment. TP-LSD has two branches: tri-points extraction branch and line segmentation branch. The former predicts the heat map of root-points and the two displacement maps of endpoints. The latter segments the pixels on straight lines out from background. Moreover, the line segmentation map is reused in the first branch as structural prior. We propose an additional novel evaluation metric and evaluate our method on Wireframe and YorkUrban datasets, demonstrating not only the competitive accuracy compared to the most recent methods, but also the real-time run speed up to 78 FPS with the $320\times 320$ input.
摘要:本文提出了一种新颖深卷积模型,基于三点线段检测器(TP-LSD),以检测在实时速度的图像中的线段。以前的相关方法通常使用的两步走战略,依托无论是启发式的后处理或额外的分类。为了实现一个步骤的检测具有更快和更紧凑的模型,我们引入三点表示中,线段检测转换为根点的端部到端预测和两个端点为每个线段。 TP-LSD有两个分支:三 - 点提取分支和线分割分支。前者预测的根本点热图和端点的两个排量的地图。后者段上的直线从出背景的像素。此外,线分割映射在第一分支作为结构之前再利用。我们提出了一个额外的新评价指标和评价我们的线框和YorkUrban数据集方法,相对于最近的方法不仅展现竞争力的准确性,而且实时运行速度可达78 FPS与$ 320 \次320 $输入。
Siyu Huang, Fangbo Qin, Pengfei Xiong, Ning Ding, Yijia He, Xiao Liu
Abstract: This paper proposes a novel deep convolutional model, Tri-Points Based Line Segment Detector (TP-LSD), to detect line segments in an image at real-time speed. The previous related methods typically use the two-step strategy, relying on either heuristic post-process or extra classifier. To realize one-step detection with a faster and more compact model, we introduce the tri-points representation, converting the line segment detection to the end-to-end prediction of a root-point and two endpoints for each line segment. TP-LSD has two branches: tri-points extraction branch and line segmentation branch. The former predicts the heat map of root-points and the two displacement maps of endpoints. The latter segments the pixels on straight lines out from background. Moreover, the line segmentation map is reused in the first branch as structural prior. We propose an additional novel evaluation metric and evaluate our method on Wireframe and YorkUrban datasets, demonstrating not only the competitive accuracy compared to the most recent methods, but also the real-time run speed up to 78 FPS with the $320\times 320$ input.
摘要:本文提出了一种新颖深卷积模型,基于三点线段检测器(TP-LSD),以检测在实时速度的图像中的线段。以前的相关方法通常使用的两步走战略,依托无论是启发式的后处理或额外的分类。为了实现一个步骤的检测具有更快和更紧凑的模型,我们引入三点表示中,线段检测转换为根点的端部到端预测和两个端点为每个线段。 TP-LSD有两个分支:三 - 点提取分支和线分割分支。前者预测的根本点热图和端点的两个排量的地图。后者段上的直线从出背景的像素。此外,线分割映射在第一分支作为结构之前再利用。我们提出了一个额外的新评价指标和评价我们的线框和YorkUrban数据集方法,相对于最近的方法不仅展现竞争力的准确性,而且实时运行速度可达78 FPS与$ 320 \次320 $输入。
2. MRZ code extraction from visa and passport documents using convolutional neural networks [PDF] 返回目录
Yichuan Liu, Hailey James, Otkrist Gupta, Dan Raviv
Abstract: Detecting and extracting information from Machine-Readable Zone (MRZ) on passports and visas is becoming increasingly important for verifying document authenticity. However, computer vision methods for performing similar tasks, such as optical character recognition (OCR), fail to extract the MRZ given digital images of passports with reasonable accuracy. We present a specially designed model based on convolutional neural networks that is able to successfully extract MRZ information from digital images of passports of arbitrary orientation and size. Our model achieved 100% MRZ detection rate and 98.36% character recognition macro-f1 score on a passport and visa dataset.
摘要:检测并从护照和签证机读区(MRZ)中提取信息正在成为验证真伪的文档越来越重要。然而,计算机视觉方法来执行类似的任务,诸如光学字符识别(OCR),失败以合理的精度提取护照的MRZ给定的数字图像。我们提出了一种基于卷积神经网络专门设计的模型,该模型能够成功地提取任意方向和大小的护照的数字图像MRZ信息。我们的模型实现了100%MRZ检测率和98.36%的字符识别宏观F1值在护照和签证的数据集。
Yichuan Liu, Hailey James, Otkrist Gupta, Dan Raviv
Abstract: Detecting and extracting information from Machine-Readable Zone (MRZ) on passports and visas is becoming increasingly important for verifying document authenticity. However, computer vision methods for performing similar tasks, such as optical character recognition (OCR), fail to extract the MRZ given digital images of passports with reasonable accuracy. We present a specially designed model based on convolutional neural networks that is able to successfully extract MRZ information from digital images of passports of arbitrary orientation and size. Our model achieved 100% MRZ detection rate and 98.36% character recognition macro-f1 score on a passport and visa dataset.
摘要:检测并从护照和签证机读区(MRZ)中提取信息正在成为验证真伪的文档越来越重要。然而,计算机视觉方法来执行类似的任务,诸如光学字符识别(OCR),失败以合理的精度提取护照的MRZ给定的数字图像。我们提出了一种基于卷积神经网络专门设计的模型,该模型能够成功地提取任意方向和大小的护照的数字图像MRZ信息。我们的模型实现了100%MRZ检测率和98.36%的字符识别宏观F1值在护照和签证的数据集。
3. Heterogeneous Domain Generalization via Domain Mixup [PDF] 返回目录
Yufei Wang, Haoliang Li, Alex C. Kot
Abstract: One of the main drawbacks of deep Convolutional Neural Networks (DCNN) is that they lack generalization capability. In this work, we focus on the problem of heterogeneous domain generalization which aims to improve the generalization capability across different tasks, which is, how to learn a DCNN model with multiple domain data such that the trained feature extractor can be generalized to supporting recognition of novel categories in a novel target domain. To solve this problem, we propose a novel heterogeneous domain generalization method by mixing up samples across multiple source domains with two different sampling strategies. Our experimental results based on the Visual Decathlon benchmark demonstrates the effectiveness of our proposed method. The code is released in \url{this https URL}
摘要:一个深卷积神经网络(DCNN)的主要缺点是,他们缺乏泛化能力。在这项工作中,我们专注于异构域泛化的问题,以提高在不同任务的泛化能力,这就是如何学会与多个域的数据DCNN模型旨在使得训练的特征提取可以推广到支持识别新颖类别的新颖目标域。为了解决这个问题,我们提出了通过在多个源域样本与两个不同的采样策略混合了新颖异质域一般化方法。我们基于Visual迪卡侬基准实验结果证明了我们提出的方法的有效性。该代码被释放\ {URL这HTTPS URL}
Yufei Wang, Haoliang Li, Alex C. Kot
Abstract: One of the main drawbacks of deep Convolutional Neural Networks (DCNN) is that they lack generalization capability. In this work, we focus on the problem of heterogeneous domain generalization which aims to improve the generalization capability across different tasks, which is, how to learn a DCNN model with multiple domain data such that the trained feature extractor can be generalized to supporting recognition of novel categories in a novel target domain. To solve this problem, we propose a novel heterogeneous domain generalization method by mixing up samples across multiple source domains with two different sampling strategies. Our experimental results based on the Visual Decathlon benchmark demonstrates the effectiveness of our proposed method. The code is released in \url{this https URL}
摘要:一个深卷积神经网络(DCNN)的主要缺点是,他们缺乏泛化能力。在这项工作中,我们专注于异构域泛化的问题,以提高在不同任务的泛化能力,这就是如何学会与多个域的数据DCNN模型旨在使得训练的特征提取可以推广到支持识别新颖类别的新颖目标域。为了解决这个问题,我们提出了通过在多个源域样本与两个不同的采样策略混合了新颖异质域一般化方法。我们基于Visual迪卡侬基准实验结果证明了我们提出的方法的有效性。该代码被释放\ {URL这HTTPS URL}
4. ODIN: Automated Drift Detection and Recovery in Video Analytics [PDF] 返回目录
Abhijit Suprem, Joy Arulraj, Calton Pu, Joao Ferreira
Abstract: Recent advances in computer vision have led to a resurgence of interest in visual data analytics. Researchers are developing systems for effectively and efficiently analyzing visual data at scale. A significant challenge that these systems encounter lies in the drift in real-world visual data. For instance, a model for self-driving vehicles that is not trained on images containing snow does not work well when it encounters them in practice. This drift phenomenon limits the accuracy of models employed for visual data analytics. In this paper, we present a visual data analytics system, called ODIN, that automatically detects and recovers from drift. ODIN uses adversarial autoencoders to learn the distribution of high-dimensional images. We present an unsupervised algorithm for detecting drift by comparing the distributions of the given data against that of previously seen data. When ODIN detects drift, it invokes a drift recovery algorithm to deploy specialized models tailored towards the novel data points. These specialized models outperform their non-specialized counterpart on accuracy, performance, and memory footprint. Lastly, we present a model selection algorithm for picking an ensemble of best-fit specialized models to process a given input. We evaluate the efficacy and efficiency of ODIN on high-resolution dashboard camera videos captured under diverse environments from the Berkeley DeepDrive dataset. We demonstrate that ODIN's models deliver 6x higher throughput, 2x higher accuracy, and 6x smaller memory footprint compared to a baseline system without automated drift detection and recovery.
摘要:在计算机视觉的最新进展已导致视觉数据分析的兴趣死灰复燃。研究人员正在开发系统,用于大规模有效和高效地分析视频数据。一个显著的挑战,这些系统遇到的谎言在真实世界的视觉数据漂移。举例来说,当它遇到他们在实践中不含有雪图像训练了自驾车车辆模型不能很好地工作。这种漂移现象限制了用于视觉数据分析使用模型的准确性。在本文中,我们提出了一个可视数据分析系统,称为ODIN,即自动检测和从漂移复苏。 ODIN采用对抗性的自动编码学习高维图像的分布。我们提出了一种无监督算法用于通过比较所述给定数据的分布针对先前看到数据的检测漂移。当ODIN检测漂移,它会调用漂移恢复算法部署对新数据点量身定制的专业机型。这些专业模型跑赢上的精度,性能和内存占用的非专业对口。最后,我们提出了一个模型选择算法选择最适合的合奏专业模型来处理一个给定的输入。我们评估ODIN对下从伯克利DeepDrive数据集不同环境拍摄的高清晰度仪表盘摄影机视频的有效性和效率。我们证明ODIN的机型相比没有自动漂移检测和恢复的基准系统提供6倍的更高的吞吐量,2倍更高的精确度,和6X较小的内存占用。
Abhijit Suprem, Joy Arulraj, Calton Pu, Joao Ferreira
Abstract: Recent advances in computer vision have led to a resurgence of interest in visual data analytics. Researchers are developing systems for effectively and efficiently analyzing visual data at scale. A significant challenge that these systems encounter lies in the drift in real-world visual data. For instance, a model for self-driving vehicles that is not trained on images containing snow does not work well when it encounters them in practice. This drift phenomenon limits the accuracy of models employed for visual data analytics. In this paper, we present a visual data analytics system, called ODIN, that automatically detects and recovers from drift. ODIN uses adversarial autoencoders to learn the distribution of high-dimensional images. We present an unsupervised algorithm for detecting drift by comparing the distributions of the given data against that of previously seen data. When ODIN detects drift, it invokes a drift recovery algorithm to deploy specialized models tailored towards the novel data points. These specialized models outperform their non-specialized counterpart on accuracy, performance, and memory footprint. Lastly, we present a model selection algorithm for picking an ensemble of best-fit specialized models to process a given input. We evaluate the efficacy and efficiency of ODIN on high-resolution dashboard camera videos captured under diverse environments from the Berkeley DeepDrive dataset. We demonstrate that ODIN's models deliver 6x higher throughput, 2x higher accuracy, and 6x smaller memory footprint compared to a baseline system without automated drift detection and recovery.
摘要:在计算机视觉的最新进展已导致视觉数据分析的兴趣死灰复燃。研究人员正在开发系统,用于大规模有效和高效地分析视频数据。一个显著的挑战,这些系统遇到的谎言在真实世界的视觉数据漂移。举例来说,当它遇到他们在实践中不含有雪图像训练了自驾车车辆模型不能很好地工作。这种漂移现象限制了用于视觉数据分析使用模型的准确性。在本文中,我们提出了一个可视数据分析系统,称为ODIN,即自动检测和从漂移复苏。 ODIN采用对抗性的自动编码学习高维图像的分布。我们提出了一种无监督算法用于通过比较所述给定数据的分布针对先前看到数据的检测漂移。当ODIN检测漂移,它会调用漂移恢复算法部署对新数据点量身定制的专业机型。这些专业模型跑赢上的精度,性能和内存占用的非专业对口。最后,我们提出了一个模型选择算法选择最适合的合奏专业模型来处理一个给定的输入。我们评估ODIN对下从伯克利DeepDrive数据集不同环境拍摄的高清晰度仪表盘摄影机视频的有效性和效率。我们证明ODIN的机型相比没有自动漂移检测和恢复的基准系统提供6倍的更高的吞吐量,2倍更高的精确度,和6X较小的内存占用。
5. Semi-Supervised Active Learning for COVID-19 Lung Ultrasound Multi-symptom Classification [PDF] 返回目录
Lei Liu, Wentao Lei, Yongfang Luo, Cheng Feng, Xiang Wan, Li Liu
Abstract: Ultrasound (US) is a non-invasive yet effective medical diagnostic imaging technique for the COVID-19 global pandemic. However, due to complex feature behaviors and expensive annotations of US images, it is difficult to apply Artificial Intelligence (AI) assisting approaches for lung's multi-symptom (multi-label) classification. To overcome these difficulties, we propose a novel semi-supervised Two-Stream Active Learning (TSAL) method to model complicated features and reduce labeling costs in an iterative procedure. The core component of TSAL is the multi-label learning mechanism, in which label correlations information is used to design multi-label margin (MLM) strategy and confidence validation for automatically selecting informative samples and confident labels. On this basis, a multi-symptom multi-label (MSML) classification network is proposed to learn discriminative features of lung symptoms, and a human-machine interaction is exploited to confirm the final annotations that are used to fine-tune MSML with progressively labeled data. Moreover, a novel lung US dataset named COVID19-LUSMS is built, currently containing 71 clinical patients with 6,836 images sampled from 678 videos. Experimental evaluations show that TSAL using only 20% data can achieve superior performance to the baseline and the state-of-the-art. Qualitatively, visualization of both attention map and sample distribution confirms the good consistency with the clinic knowledge.
摘要:超声(US)是COVID-19全球性流行病的非侵入性但有效的医疗诊断成像技术。然而,由于复杂的功能,行为和美国的图像昂贵的注解,它是难以适用人工智能(AI)协助接近肺的多症状(多标签)的分类。为了克服这些困难,我们提出了一个新颖的半监督双流主动学习(TSAL)方法来模拟复杂的功能和降低成本的标签在迭代过程。 TSAL的核心部件是多标记学习机制,在这种标签相关信息被用于设计多标记余量(MLM)策略和信心验证自动选择提供信息的样品和自信标签。在此基础上,多症状多标签(MSML)分类网提出了学习的肺部症状的判别特征,以及人机交互利用,以确认被用来微调MSML与渐进标记的最终注解数据。此外,一个名为COVID19-LUSMS一种新型肺部美国数据集建立,目前包含71例患者的临床与来自678级的视频采样6836倍的图像。实验评估示出了使用仅20%的数据可以达到优异的性能基线和国家的最先进的是TSAL。定性,既重视地图和样本分布的可视化确认与临床知识一致性好。
Lei Liu, Wentao Lei, Yongfang Luo, Cheng Feng, Xiang Wan, Li Liu
Abstract: Ultrasound (US) is a non-invasive yet effective medical diagnostic imaging technique for the COVID-19 global pandemic. However, due to complex feature behaviors and expensive annotations of US images, it is difficult to apply Artificial Intelligence (AI) assisting approaches for lung's multi-symptom (multi-label) classification. To overcome these difficulties, we propose a novel semi-supervised Two-Stream Active Learning (TSAL) method to model complicated features and reduce labeling costs in an iterative procedure. The core component of TSAL is the multi-label learning mechanism, in which label correlations information is used to design multi-label margin (MLM) strategy and confidence validation for automatically selecting informative samples and confident labels. On this basis, a multi-symptom multi-label (MSML) classification network is proposed to learn discriminative features of lung symptoms, and a human-machine interaction is exploited to confirm the final annotations that are used to fine-tune MSML with progressively labeled data. Moreover, a novel lung US dataset named COVID19-LUSMS is built, currently containing 71 clinical patients with 6,836 images sampled from 678 videos. Experimental evaluations show that TSAL using only 20% data can achieve superior performance to the baseline and the state-of-the-art. Qualitatively, visualization of both attention map and sample distribution confirms the good consistency with the clinic knowledge.
摘要:超声(US)是COVID-19全球性流行病的非侵入性但有效的医疗诊断成像技术。然而,由于复杂的功能,行为和美国的图像昂贵的注解,它是难以适用人工智能(AI)协助接近肺的多症状(多标签)的分类。为了克服这些困难,我们提出了一个新颖的半监督双流主动学习(TSAL)方法来模拟复杂的功能和降低成本的标签在迭代过程。 TSAL的核心部件是多标记学习机制,在这种标签相关信息被用于设计多标记余量(MLM)策略和信心验证自动选择提供信息的样品和自信标签。在此基础上,多症状多标签(MSML)分类网提出了学习的肺部症状的判别特征,以及人机交互利用,以确认被用来微调MSML与渐进标记的最终注解数据。此外,一个名为COVID19-LUSMS一种新型肺部美国数据集建立,目前包含71例患者的临床与来自678级的视频采样6836倍的图像。实验评估示出了使用仅20%的数据可以达到优异的性能基线和国家的最先进的是TSAL。定性,既重视地图和样本分布的可视化确认与临床知识一致性好。
6. Evaluation of the Robustness of Visual SLAM Methods in Different Environments [PDF] 返回目录
Joonas Lomps, Artjom Lind, Amnir Hadachi
Abstract: Determining the position and orientation of a sensor vis-a-vis its surrounding, while simultaneously mapping the environment around that sensor or simultaneous localization and mapping is quickly becoming an important advancement in embedded vision with a large number of different possible applications. This paper presents a comprehensive comparison of the latest open-source SLAM algorithms with the main focus being their performance in different environmental surroundings. The chosen algorithms are evaluated on common publicly available datasets and the results reasoned with respect to the datasets' environment. This is the first stage of our main target of testing the methods in off-road scenarios.
摘要:确定的传感器面对面的人及其周边,同时映射周围的传感器或同时定位的环境和映射正迅速成为嵌入式视觉的一个重要进步与大量不同的可能应用的位置和方向。本文介绍了最新的开源SLAM算法与主要焦点是他们在不同环境的环境性能进行综合比较。所选择的算法,就共同公开获得的数据集评价结果相对于所述数据集的环境的理由。这是我们测试在越野场景的方法主要目标的第一阶段。
Joonas Lomps, Artjom Lind, Amnir Hadachi
Abstract: Determining the position and orientation of a sensor vis-a-vis its surrounding, while simultaneously mapping the environment around that sensor or simultaneous localization and mapping is quickly becoming an important advancement in embedded vision with a large number of different possible applications. This paper presents a comprehensive comparison of the latest open-source SLAM algorithms with the main focus being their performance in different environmental surroundings. The chosen algorithms are evaluated on common publicly available datasets and the results reasoned with respect to the datasets' environment. This is the first stage of our main target of testing the methods in off-road scenarios.
摘要:确定的传感器面对面的人及其周边,同时映射周围的传感器或同时定位的环境和映射正迅速成为嵌入式视觉的一个重要进步与大量不同的可能应用的位置和方向。本文介绍了最新的开源SLAM算法与主要焦点是他们在不同环境的环境性能进行综合比较。所选择的算法,就共同公开获得的数据集评价结果相对于所述数据集的环境的理由。这是我们测试在越野场景的方法主要目标的第一阶段。
7. ZooBuilder: 2D and 3D Pose Estimation for Quadrupeds Using Synthetic Data [PDF] 返回目录
Abassin Sourou Fangbemi, Yi Fei Lu, Mao Yuan Xu, Xiao Wu Luo, Alexis Rolland, Chedy Raissi
Abstract: This work introduces a novel strategy for generating synthetic training data for 2D and 3D pose estimation of animals using keyframe animations. With the objective to automate the process of creating animations for wildlife, we train several 2D and 3D pose estimation models with synthetic data, and put in place an end-to-end pipeline called ZooBuilder. The pipeline takes as input a video of an animal in the wild, and generates the corresponding 2D and 3D coordinates for each joint of the animal's skeleton. With this approach, we produce motion capture data that can be used to create animations for wildlife.
摘要:该作品介绍用于生成用于使用关键帧动画动物的二维和三维姿态估计合成训练数据的新策略。随着客观上自动创建野生动物动画的过程中,我们培养几个二维和三维姿态估计模型与合成数据,并制定一个终端到终端的管道叫ZooBuilder。该管道作为输入在野生动物的视频,并且生成相应的二维和三维坐标为动物的骨骼的各关节。通过这种方法,我们生产出可用于为野生动物创造动画运动捕捉数据。
Abassin Sourou Fangbemi, Yi Fei Lu, Mao Yuan Xu, Xiao Wu Luo, Alexis Rolland, Chedy Raissi
Abstract: This work introduces a novel strategy for generating synthetic training data for 2D and 3D pose estimation of animals using keyframe animations. With the objective to automate the process of creating animations for wildlife, we train several 2D and 3D pose estimation models with synthetic data, and put in place an end-to-end pipeline called ZooBuilder. The pipeline takes as input a video of an animal in the wild, and generates the corresponding 2D and 3D coordinates for each joint of the animal's skeleton. With this approach, we produce motion capture data that can be used to create animations for wildlife.
摘要:该作品介绍用于生成用于使用关键帧动画动物的二维和三维姿态估计合成训练数据的新策略。随着客观上自动创建野生动物动画的过程中,我们培养几个二维和三维姿态估计模型与合成数据,并制定一个终端到终端的管道叫ZooBuilder。该管道作为输入在野生动物的视频,并且生成相应的二维和三维坐标为动物的骨骼的各关节。通过这种方法,我们生产出可用于为野生动物创造动画运动捕捉数据。
8. Automatic cinematography for 360 video [PDF] 返回目录
Hannes Fassold
Abstract: We describe our method for automatic generation of a visually interesting camera path (automatic cinematography)from a 360 video. Based on the information from the scene objects, multiple shot hypotheses for different shot types are constructed and the best one is rendered.
摘要:我们从360的视频描述了我们自动生成一个视觉上的趣味相机路径(自动摄影)的方法。基于从场景中的对象的信息,不同的镜头拍摄的多个假设构建和最好的呈现。
Hannes Fassold
Abstract: We describe our method for automatic generation of a visually interesting camera path (automatic cinematography)from a 360 video. Based on the information from the scene objects, multiple shot hypotheses for different shot types are constructed and the best one is rendered.
摘要:我们从360的视频描述了我们自动生成一个视觉上的趣味相机路径(自动摄影)的方法。基于从场景中的对象的信息,不同的镜头拍摄的多个假设构建和最好的呈现。
9. Hybrid Space Learning for Language-based Video Retrieval [PDF] 返回目录
Jianfeng Dong, Xirong Li, Chaoxi Xu, Gang Yang, Xun Wang
Abstract: This paper attacks the challenging problem of video retrieval by text. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described exclusively in the form of a natural-language sentence, with no visual example provided. Given videos as sequences of frames and queries as sequences of words, an effective sequence-to-sequence cross-modal matching is crucial. To that end, the two modalities need to be first encoded into real-valued vectors and then projected into a common space. In this paper we achieve this by proposing a dual deep encoding network that encodes videos and queries into powerful dense representations of their own. Our novelty is two-fold. First, different from prior art that resorts to a specific single-level encoder, the proposed network performs multi-level encoding that represents the rich content of both modalities in a coarse-to-fine fashion. Second, different from a conventional common space learning algorithm which is either concept based or latent space based, we introduce hybrid space learning which combines the high performance of the latent space and the good interpretability of the concept space. Dual encoding is conceptually simple, practically effective and end-to-end trained with hybrid space learning. Extensive experiments on four challenging video datasets show the viability of the new method.
摘要:视频检索的通过短信本文攻击的具有挑战性的问题。在这样的检索范例,通过即席查询为未标记的视频的最终用户的搜索在一自然语言句子的形式专门描述的,与设置没有视觉例子。给定为帧序列和查询作为单词序列的视频,一个有效的序列到序列跨模式匹配是至关重要的。为此目的,两个模态需要被第一编码成实值向量,然后投射到一个共同的空间。在本文中,我们通过提出一个双深编码网络实现这个编码的视频和查询到自己的强大密集的表示。我们的新颖之处在于两方面。首先,从现有技术的不同之处在于采取某种特定的单电平编码器,它表示两个模态的在粗到细的方式丰富的内容所提出的网络执行多电平编码。其次,从现有的一般的空间学习算法要么是基于概念的或潜在的空间基于不同,我们引入混合空间学习它结合了潜在空间和概念空间的良好解释性的高性能。双编码的概念很简单,切实有效和终端到终端与混合空间学习培训。四个有挑战性的视频数据集中大量的实验表明,新方法的可行性。
Jianfeng Dong, Xirong Li, Chaoxi Xu, Gang Yang, Xun Wang
Abstract: This paper attacks the challenging problem of video retrieval by text. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described exclusively in the form of a natural-language sentence, with no visual example provided. Given videos as sequences of frames and queries as sequences of words, an effective sequence-to-sequence cross-modal matching is crucial. To that end, the two modalities need to be first encoded into real-valued vectors and then projected into a common space. In this paper we achieve this by proposing a dual deep encoding network that encodes videos and queries into powerful dense representations of their own. Our novelty is two-fold. First, different from prior art that resorts to a specific single-level encoder, the proposed network performs multi-level encoding that represents the rich content of both modalities in a coarse-to-fine fashion. Second, different from a conventional common space learning algorithm which is either concept based or latent space based, we introduce hybrid space learning which combines the high performance of the latent space and the good interpretability of the concept space. Dual encoding is conceptually simple, practically effective and end-to-end trained with hybrid space learning. Extensive experiments on four challenging video datasets show the viability of the new method.
摘要:视频检索的通过短信本文攻击的具有挑战性的问题。在这样的检索范例,通过即席查询为未标记的视频的最终用户的搜索在一自然语言句子的形式专门描述的,与设置没有视觉例子。给定为帧序列和查询作为单词序列的视频,一个有效的序列到序列跨模式匹配是至关重要的。为此目的,两个模态需要被第一编码成实值向量,然后投射到一个共同的空间。在本文中,我们通过提出一个双深编码网络实现这个编码的视频和查询到自己的强大密集的表示。我们的新颖之处在于两方面。首先,从现有技术的不同之处在于采取某种特定的单电平编码器,它表示两个模态的在粗到细的方式丰富的内容所提出的网络执行多电平编码。其次,从现有的一般的空间学习算法要么是基于概念的或潜在的空间基于不同,我们引入混合空间学习它结合了潜在空间和概念空间的良好解释性的高性能。双编码的概念很简单,切实有效和终端到终端与混合空间学习培训。四个有挑战性的视频数据集中大量的实验表明,新方法的可行性。
10. Critical analysis on the reproducibility of visual quality assessment using deep features [PDF] 返回目录
Franz Götz-Hahn, Vlad Hosu, Dietmar Saupe
Abstract: Data used to train supervised machine learning models are commonly split into independent training, validation, and test sets. In this paper we illustrate that intricate cases of data leakage have occurred in the no-reference video and image quality assessment literature. We show that the performance results of several recently published journal papers that are well above the best performances in related works, cannot be reached. Our analysis shows that information from the test set was inappropriately used in the training process in different ways. When correcting for the data leakage, the performances of the approaches drop below the state-of-the-art by a large margin. Additionally, we investigate end-to-end variations to the discussed approaches, which do not improve upon the original.
摘要:数据用来训练监督机器学习模型通常被划分为独立的培训,验证和测试集。在本文中,我们说明数据泄露的错综复杂的案件都发生在无参考的视频和图像质量评价文献。我们发现,几个最近发表的期刊论文是远远高于在相关工作中表现最好的表现效果,无法达到。我们的分析显示,从测试集信息以不同的方式训练过程中使用不当。当校正数据泄漏,该方法的性能大幅度下降到低于所述状态的最先进的。此外,我们调查的端至端变化对被讨论的方法,不仍按原提高。
Franz Götz-Hahn, Vlad Hosu, Dietmar Saupe
Abstract: Data used to train supervised machine learning models are commonly split into independent training, validation, and test sets. In this paper we illustrate that intricate cases of data leakage have occurred in the no-reference video and image quality assessment literature. We show that the performance results of several recently published journal papers that are well above the best performances in related works, cannot be reached. Our analysis shows that information from the test set was inappropriately used in the training process in different ways. When correcting for the data leakage, the performances of the approaches drop below the state-of-the-art by a large margin. Additionally, we investigate end-to-end variations to the discussed approaches, which do not improve upon the original.
摘要:数据用来训练监督机器学习模型通常被划分为独立的培训,验证和测试集。在本文中,我们说明数据泄露的错综复杂的案件都发生在无参考的视频和图像质量评价文献。我们发现,几个最近发表的期刊论文是远远高于在相关工作中表现最好的表现效果,无法达到。我们的分析显示,从测试集信息以不同的方式训练过程中使用不当。当校正数据泄漏,该方法的性能大幅度下降到低于所述状态的最先进的。此外,我们调查的端至端变化对被讨论的方法,不仍按原提高。
11. Meta Learning for Few-Shot One-class Classification [PDF] 返回目录
Gabriel Dahia, Maurício Pamplona Segundo
Abstract: We propose a method that can perform one-class classification given only a small number of examples from the target class and none from the others. We formulate the learning of meaningful features for one-class classification as a meta-learning problem in which the meta-training stage repeatedly simulates one-class classification, using the classification loss of the chosen algorithm to learn a feature representation. To learn these representations, we require only multiclass data from similar tasks. We show how the Support Vector Data Description method can be used with our method, and also propose a simpler variant based on Prototypical Networks that obtains comparable performance, indicating that learning feature representations directly from data may be more important than which one-class algorithm we choose. We validate our approach by adapting few-shot classification datasets to the few-shot one-class classification scenario, obtaining similar results to the state-of-the-art of traditional one-class classification, and that improves upon that of one-class classification baselines employed in the few-shot setting.
摘要:我们认为可以给予执行只有少数来自目标类的实例,并没有与其他一类分类的方法。我们制定的有意义的特征为一类分类学习作为其中元训练阶段反复模拟一类分类,使用所选算法的分类丧失学习特征表示一元的学习问题。要了解这些表象,我们需要从类似的任务只有多类数据。我们呈现怎样的支持向量数据描述方法可以用我们的方法来使用,并且还提出了一种基于原型网络是获得相当的性能,这表明直接从数据中学习特征表示可能是更重要的一个简单的变型比一类算法我们选择。我们验证通过调整为数不多的镜头分类数据集到几拍一类分类的情况下,获得类似的结果,以传统的一类分类的国家的最先进的我们的方法,这样可以提高在一个类的分类基准在几合一设定使用。
Gabriel Dahia, Maurício Pamplona Segundo
Abstract: We propose a method that can perform one-class classification given only a small number of examples from the target class and none from the others. We formulate the learning of meaningful features for one-class classification as a meta-learning problem in which the meta-training stage repeatedly simulates one-class classification, using the classification loss of the chosen algorithm to learn a feature representation. To learn these representations, we require only multiclass data from similar tasks. We show how the Support Vector Data Description method can be used with our method, and also propose a simpler variant based on Prototypical Networks that obtains comparable performance, indicating that learning feature representations directly from data may be more important than which one-class algorithm we choose. We validate our approach by adapting few-shot classification datasets to the few-shot one-class classification scenario, obtaining similar results to the state-of-the-art of traditional one-class classification, and that improves upon that of one-class classification baselines employed in the few-shot setting.
摘要:我们认为可以给予执行只有少数来自目标类的实例,并没有与其他一类分类的方法。我们制定的有意义的特征为一类分类学习作为其中元训练阶段反复模拟一类分类,使用所选算法的分类丧失学习特征表示一元的学习问题。要了解这些表象,我们需要从类似的任务只有多类数据。我们呈现怎样的支持向量数据描述方法可以用我们的方法来使用,并且还提出了一种基于原型网络是获得相当的性能,这表明直接从数据中学习特征表示可能是更重要的一个简单的变型比一类算法我们选择。我们验证通过调整为数不多的镜头分类数据集到几拍一类分类的情况下,获得类似的结果,以传统的一类分类的国家的最先进的我们的方法,这样可以提高在一个类的分类基准在几合一设定使用。
12. The PREVENTION Challenge: How Good Are Humans Predicting Lane Changes? [PDF] 返回目录
A. Quintanar, R. Izquierdo, I. Parra, D. Fernández-Llorca, M. A. Sotelo
Abstract: While driving on highways, every driver tries to be aware of the behavior of surrounding vehicles, including possible emergency braking, evasive maneuvers trying to avoid obstacles, unexpected lane changes, or other emergencies that could lead to an accident. In this paper, human's ability to predict lane changes in highway scenarios is analyzed through the use of video sequences extracted from the PREVENTION dataset, a database focused on the development of research on vehicle intention and trajectory prediction. Thus, users had to indicate the moment at which they considered that a lane change maneuver was taking place in a target vehicle, subsequently indicating its direction: left or right. The results retrieved have been carefully analyzed and compared to ground truth labels, evaluating statistical models to understand whether humans can actually predict. The study has revealed that most participants are unable to anticipate lane-change maneuvers, detecting them after they have started. These results might serve as a baseline for AI's prediction ability evaluation, grading if those systems can outperform human skills by analyzing hidden cues that seem unnoticed, improving the detection time, and even anticipating maneuvers in some cases.
摘要:当驾驶在高速公路上,每个驾驶员试图要注意周围的车辆,包括可能的紧急制动行为,躲避动作试图避开障碍物,意想不到的车道变更,或可能导致事故等突发事件。在本文中,人的预测高速公路车道的场景变化的能力是通过使用从防止数据集中提取的视频序列的分析,数据库侧重于对车辆的意图和轨迹预测研究的发展。因此,用户必须指定在哪,他们认为,车道变更操作是发生在目标车辆,随后表明其方向的时刻:向左或向右。检索结果都经过认真分析,并与地面实况标签,评估统计模型来理解人类是否能真正预测。该研究显示,大多数与会者都无法预料的车道变更操纵,检测他们,他们已经开始了。这些结果可能作为AI的预测能力评估的基准,如果分级这些系统可以通过分析隐藏的线索,似乎被忽视,完善的检测时间,甚至在某些情况下,预计演习超越人的技能。
A. Quintanar, R. Izquierdo, I. Parra, D. Fernández-Llorca, M. A. Sotelo
Abstract: While driving on highways, every driver tries to be aware of the behavior of surrounding vehicles, including possible emergency braking, evasive maneuvers trying to avoid obstacles, unexpected lane changes, or other emergencies that could lead to an accident. In this paper, human's ability to predict lane changes in highway scenarios is analyzed through the use of video sequences extracted from the PREVENTION dataset, a database focused on the development of research on vehicle intention and trajectory prediction. Thus, users had to indicate the moment at which they considered that a lane change maneuver was taking place in a target vehicle, subsequently indicating its direction: left or right. The results retrieved have been carefully analyzed and compared to ground truth labels, evaluating statistical models to understand whether humans can actually predict. The study has revealed that most participants are unable to anticipate lane-change maneuvers, detecting them after they have started. These results might serve as a baseline for AI's prediction ability evaluation, grading if those systems can outperform human skills by analyzing hidden cues that seem unnoticed, improving the detection time, and even anticipating maneuvers in some cases.
摘要:当驾驶在高速公路上,每个驾驶员试图要注意周围的车辆,包括可能的紧急制动行为,躲避动作试图避开障碍物,意想不到的车道变更,或可能导致事故等突发事件。在本文中,人的预测高速公路车道的场景变化的能力是通过使用从防止数据集中提取的视频序列的分析,数据库侧重于对车辆的意图和轨迹预测研究的发展。因此,用户必须指定在哪,他们认为,车道变更操作是发生在目标车辆,随后表明其方向的时刻:向左或向右。检索结果都经过认真分析,并与地面实况标签,评估统计模型来理解人类是否能真正预测。该研究显示,大多数与会者都无法预料的车道变更操纵,检测他们,他们已经开始了。这些结果可能作为AI的预测能力评估的基准,如果分级这些系统可以通过分析隐藏的线索,似乎被忽视,完善的检测时间,甚至在某些情况下,预计演习超越人的技能。
13. SoFAr: Shortcut-based Fractal Architectures for Binary Convolutional Neural Networks [PDF] 返回目录
Zhu Baozhou, Peter Hofstee, Jinho Lee, Zaid Al-Ars
Abstract: Binary Convolutional Neural Networks (BCNNs) can significantly improve the efficiency of Deep Convolutional Neural Networks (DCNNs) for their deployment on resource-constrained platforms, such as mobile and embedded systems. However, the accuracy degradation of BCNNs is still considerable compared with their full precision counterpart, impeding their practical deployment. Because of the inevitable binarization error in the forward propagation and gradient mismatch problem in the backward propagation, it is nontrivial to train BCNNs to achieve satisfactory accuracy. To ease the difficulty of training, the shortcut-based BCNNs, such as residual connection-based Bi-real ResNet and dense connection-based BinaryDenseNet, introduce additional shortcuts in addition to the shortcuts already present in their full precision counterparts. Furthermore, fractal architectures have been also been used to improve the training process of full-precision DCNNs since the fractal structure triggers effects akin to deep supervision and lateral student-teacher information flow. Inspired by the shortcuts and fractal architectures, we propose two Shortcut-based Fractal Architectures (SoFAr) specifically designed for BCNNs: 1. residual connection-based fractal architectures for binary ResNet, and 2. dense connection-based fractal architectures for binary DenseNet. Our proposed SoFAr combines the adoption of shortcuts and the fractal architectures in one unified model, which is helpful in the training of BCNNs. Results show that our proposed SoFAr achieves better accuracy compared with shortcut-based BCNNs. Specifically, the Top-1 accuracy of our proposed RF-c4d8 ResNet37(41) and DRF-c2d2 DenseNet51(53) on ImageNet outperforms Bi-real ResNet18(64) and BinaryDenseNet51(32) by 3.29% and 1.41%, respectively, with the same computational complexity overhead.
摘要:二进制卷积神经网络(BCNNs)可显著改善后卷积神经网络(DCNNs)为他们在资源受限的平台,如移动和嵌入式系统部署效率。然而,BCNNs的精度下降仍有相当大的与他们的全精度对口,阻碍其实际部署进行比较。因为在反向传播的正向传播和梯度不匹配问题的不可避免的二值化误差,它是平凡到列车BCNNs达到满意的精度。为了缓解训练难度,基于快捷,BCNNs,如基于连接的残留双实RESNET和密集的基于连接的BinaryDenseNet,除了已经存在于他们的全精度同行的快捷方式引入额外的快捷键。此外,分形结构也已被用于改善全精度DCNNs的训练过程,因为分形结构触发器作用类似于深监督和横向师生的信息流。由快捷方式和分形结构的启发,提出了两种基于快捷-分形架构(SOFAR)专门为BCNNs设计:二进制RESNET 1.残余基于连接的分形结构,并且对于二进制DenseNet 2.致密基于连接的分形结构。我们提出的SOFAR结合通过快捷方式,在一个统一的模型,这是在BCNNs的训练有帮助的分形结构的。结果表明,与基于快捷,BCNNs相比,我们提出的SOFAR达到更好的准确性。上ImageNet性能优于具体地,我们的顶1精度提出RF-c4d8 ResNet37(41)和DRF-C2D2 DenseNet51(53)双实ResNet18(64)和BinaryDenseNet51(32)由分别3.29%和1.41%,以同样的计算复杂度开销。
Zhu Baozhou, Peter Hofstee, Jinho Lee, Zaid Al-Ars
Abstract: Binary Convolutional Neural Networks (BCNNs) can significantly improve the efficiency of Deep Convolutional Neural Networks (DCNNs) for their deployment on resource-constrained platforms, such as mobile and embedded systems. However, the accuracy degradation of BCNNs is still considerable compared with their full precision counterpart, impeding their practical deployment. Because of the inevitable binarization error in the forward propagation and gradient mismatch problem in the backward propagation, it is nontrivial to train BCNNs to achieve satisfactory accuracy. To ease the difficulty of training, the shortcut-based BCNNs, such as residual connection-based Bi-real ResNet and dense connection-based BinaryDenseNet, introduce additional shortcuts in addition to the shortcuts already present in their full precision counterparts. Furthermore, fractal architectures have been also been used to improve the training process of full-precision DCNNs since the fractal structure triggers effects akin to deep supervision and lateral student-teacher information flow. Inspired by the shortcuts and fractal architectures, we propose two Shortcut-based Fractal Architectures (SoFAr) specifically designed for BCNNs: 1. residual connection-based fractal architectures for binary ResNet, and 2. dense connection-based fractal architectures for binary DenseNet. Our proposed SoFAr combines the adoption of shortcuts and the fractal architectures in one unified model, which is helpful in the training of BCNNs. Results show that our proposed SoFAr achieves better accuracy compared with shortcut-based BCNNs. Specifically, the Top-1 accuracy of our proposed RF-c4d8 ResNet37(41) and DRF-c2d2 DenseNet51(53) on ImageNet outperforms Bi-real ResNet18(64) and BinaryDenseNet51(32) by 3.29% and 1.41%, respectively, with the same computational complexity overhead.
摘要:二进制卷积神经网络(BCNNs)可显著改善后卷积神经网络(DCNNs)为他们在资源受限的平台,如移动和嵌入式系统部署效率。然而,BCNNs的精度下降仍有相当大的与他们的全精度对口,阻碍其实际部署进行比较。因为在反向传播的正向传播和梯度不匹配问题的不可避免的二值化误差,它是平凡到列车BCNNs达到满意的精度。为了缓解训练难度,基于快捷,BCNNs,如基于连接的残留双实RESNET和密集的基于连接的BinaryDenseNet,除了已经存在于他们的全精度同行的快捷方式引入额外的快捷键。此外,分形结构也已被用于改善全精度DCNNs的训练过程,因为分形结构触发器作用类似于深监督和横向师生的信息流。由快捷方式和分形结构的启发,提出了两种基于快捷-分形架构(SOFAR)专门为BCNNs设计:二进制RESNET 1.残余基于连接的分形结构,并且对于二进制DenseNet 2.致密基于连接的分形结构。我们提出的SOFAR结合通过快捷方式,在一个统一的模型,这是在BCNNs的训练有帮助的分形结构的。结果表明,与基于快捷,BCNNs相比,我们提出的SOFAR达到更好的准确性。上ImageNet性能优于具体地,我们的顶1精度提出RF-c4d8 ResNet37(41)和DRF-C2D2 DenseNet51(53)双实ResNet18(64)和BinaryDenseNet51(32)由分别3.29%和1.41%,以同样的计算复杂度开销。
14. A Density-Aware PointRCNN for 3D Objection Detection in Point Clouds [PDF] 返回目录
Jie Li, Yu Hu
Abstract: We present an improved version of PointRCNN for 3D object detection, in which a multi-branch backbone network is adopted to handle the non-uniform density of point clouds. An uncertainty-based sampling policy is proposed to deal with the distribution differences of different point clouds. The new model can achieve about 0.8 AP higher performance than the baseline PointRCNN on KITTI val set. In addition, a simplified model using a single scale grouping for each set-abstraction layer can achieve competitive performance with less computational cost.
摘要:我们提出PointRCNN的改进版本对立体物检测,其中的多分支骨干网采用以处理点云的非均匀的密度。基于不确定性采样策略,提出应对不同的点云的分布差异。新模型可以达到约比KITTI VAL设置基线PointRCNN 0.8 AP更高的性能。此外,使用单一的规模为每组抽象层分组的简化模型可达到以较少的计算成本竞争力的性能。
Jie Li, Yu Hu
Abstract: We present an improved version of PointRCNN for 3D object detection, in which a multi-branch backbone network is adopted to handle the non-uniform density of point clouds. An uncertainty-based sampling policy is proposed to deal with the distribution differences of different point clouds. The new model can achieve about 0.8 AP higher performance than the baseline PointRCNN on KITTI val set. In addition, a simplified model using a single scale grouping for each set-abstraction layer can achieve competitive performance with less computational cost.
摘要:我们提出PointRCNN的改进版本对立体物检测,其中的多分支骨干网采用以处理点云的非均匀的密度。基于不确定性采样策略,提出应对不同的点云的分布差异。新模型可以达到约比KITTI VAL设置基线PointRCNN 0.8 AP更高的性能。此外,使用单一的规模为每组抽象层分组的简化模型可达到以较少的计算成本竞争力的性能。
15. Enabling Image Recognition on Constrained Devices Using Neural Network Pruning and a CycleGAN [PDF] 返回目录
August Lidfelt, Daniel Isaksson, Ludwig Hedlund, Simon Åberg, Markus Borg, Erik Larsson
Abstract: Smart cameras are increasingly used in surveillance solutions in public spaces. Contemporary computer vision applications can be used to recognize events that require intervention by emergency services. Smart cameras can be mounted in locations where citizens feel particularly unsafe, e.g., pathways and underpasses with a history of incidents. One promising approach for smart cameras is edge AI, i.e., deploying AI technology on IoT devices. However, implementing resource-demanding technology such as image recognition using deep neural networks (DNN) on constrained devices is a substantial challenge. In this paper, we explore two approaches to reduce the need for compute in contemporary image recognition in an underpass. First, we showcase successful neural network pruning, i.e., we retain comparable classification accuracy with only 1.1\% of the neurons remaining from the state-of-the-art DNN architecture. Second, we demonstrate how a CycleGAN can be used to transform out-of-distribution images to the operational design domain. We posit that both pruning and CycleGANs are promising enablers for efficient edge AI in smart cameras.
摘要:智能摄像机在监控解决方案越来越多地使用在公共场所。当代计算机视觉应用程序可以用来识别那些需要紧急服务干预的事件。智能摄像头可以安装在居民觉得特别不安全,例如,途径和地下通道与事件的历史位置。智能相机一个可行的方法是边缘AI,即在物联网设备部署人工智能技术。然而,受限设备使用深层神经网络(DNN)实施需要资源的技术,如图像识别是一个重大的挑战。在本文中,我们探讨了两种方法来降低地下道需要在当代图像识别计算。首先,我们展示成功神经网络修剪,即,我们保留相当的分类精度只有1.1 \%从国家的最先进的DNN架构残留在神经元。第二,我们展示了一个CycleGAN如何可以用于转化外的分布图像的操作设计域。我们断定这两个修剪和CycleGANs是有希望的高效的边缘AI促成在智能相机。
August Lidfelt, Daniel Isaksson, Ludwig Hedlund, Simon Åberg, Markus Borg, Erik Larsson
Abstract: Smart cameras are increasingly used in surveillance solutions in public spaces. Contemporary computer vision applications can be used to recognize events that require intervention by emergency services. Smart cameras can be mounted in locations where citizens feel particularly unsafe, e.g., pathways and underpasses with a history of incidents. One promising approach for smart cameras is edge AI, i.e., deploying AI technology on IoT devices. However, implementing resource-demanding technology such as image recognition using deep neural networks (DNN) on constrained devices is a substantial challenge. In this paper, we explore two approaches to reduce the need for compute in contemporary image recognition in an underpass. First, we showcase successful neural network pruning, i.e., we retain comparable classification accuracy with only 1.1\% of the neurons remaining from the state-of-the-art DNN architecture. Second, we demonstrate how a CycleGAN can be used to transform out-of-distribution images to the operational design domain. We posit that both pruning and CycleGANs are promising enablers for efficient edge AI in smart cameras.
摘要:智能摄像机在监控解决方案越来越多地使用在公共场所。当代计算机视觉应用程序可以用来识别那些需要紧急服务干预的事件。智能摄像头可以安装在居民觉得特别不安全,例如,途径和地下通道与事件的历史位置。智能相机一个可行的方法是边缘AI,即在物联网设备部署人工智能技术。然而,受限设备使用深层神经网络(DNN)实施需要资源的技术,如图像识别是一个重大的挑战。在本文中,我们探讨了两种方法来降低地下道需要在当代图像识别计算。首先,我们展示成功神经网络修剪,即,我们保留相当的分类精度只有1.1 \%从国家的最先进的DNN架构残留在神经元。第二,我们展示了一个CycleGAN如何可以用于转化外的分布图像的操作设计域。我们断定这两个修剪和CycleGANs是有希望的高效的边缘AI促成在智能相机。
16. Unsupervised Partial Point Set Registration via Joint Shape Completion and Registration [PDF] 返回目录
Xiang Li, Lingjing Wang, Yi Fang
Abstract: We propose a self-supervised method for partial point set registration. While recent proposed learning-based methods have achieved impressive registration performance on the full shape observations, these methods mostly suffer from performance degradation when dealing with partial shapes. To bridge the performance gaps between partial point set registration with full point set registration, we proposed to incorporate a shape completion network to benefit the registration process. To achieve this, we design a latent code for each pair of shapes, which can be regarded as a geometric encoding of the target shape. By doing so, our model does need an explicit feature embedding network to learn the feature encodings. More importantly, both our shape completion network and the point set registration network take the shared latent codes as input, which are optimized along with the parameters of two decoder networks in the training process. Therefore, the point set registration process can thus benefit from the joint optimization process of latent codes, which are enforced to represent the information of full shape instead of partial ones. In the inference stage, we fix the network parameter and optimize the latent codes to get the optimal shape completion and registration results. Our proposed method is pure unsupervised and does not need any ground truth supervision. Experiments on the ModelNet40 dataset demonstrate the effectiveness of our model for partial point set registration.
摘要:我们提出了部分点集注册一个自我监督的方法。虽然最近提出了一种基于学习的方法对全形状观测取得了骄人的业绩登记,这些方法大多与部分形状打交道时性能下降受到影响。为了弥补部分点集登记之间的绩效差距满点集登记,我们建议结合的形状完成网络受益的注册过程。为了实现这一点,我们设计了一个潜码对于每对形状,其可被视为目标形状的几何编码。通过这样做,我们的模型也需要一个明确的功能嵌入网络的学习功能编码。更重要的是,无论我们的形状完成网络和所述点集登记网络采取共享潜码作为输入,其具有两个译码器网络的在训练过程中的参数一起进行优化。因此,点集配准过程可因此受益于潜代码,其被执行以表示充分的形状,而不是局部的人的信息的联合优化过程。在推论阶段,我们固定网络参数和优化潜码,以获得最佳的形状完成和配准结果。我们提出的方法是纯粹的无监督,不需要任何地面实况监督。在ModelNet40实验数据集展示我们的部分点集注册模型的有效性。
Xiang Li, Lingjing Wang, Yi Fang
Abstract: We propose a self-supervised method for partial point set registration. While recent proposed learning-based methods have achieved impressive registration performance on the full shape observations, these methods mostly suffer from performance degradation when dealing with partial shapes. To bridge the performance gaps between partial point set registration with full point set registration, we proposed to incorporate a shape completion network to benefit the registration process. To achieve this, we design a latent code for each pair of shapes, which can be regarded as a geometric encoding of the target shape. By doing so, our model does need an explicit feature embedding network to learn the feature encodings. More importantly, both our shape completion network and the point set registration network take the shared latent codes as input, which are optimized along with the parameters of two decoder networks in the training process. Therefore, the point set registration process can thus benefit from the joint optimization process of latent codes, which are enforced to represent the information of full shape instead of partial ones. In the inference stage, we fix the network parameter and optimize the latent codes to get the optimal shape completion and registration results. Our proposed method is pure unsupervised and does not need any ground truth supervision. Experiments on the ModelNet40 dataset demonstrate the effectiveness of our model for partial point set registration.
摘要:我们提出了部分点集注册一个自我监督的方法。虽然最近提出了一种基于学习的方法对全形状观测取得了骄人的业绩登记,这些方法大多与部分形状打交道时性能下降受到影响。为了弥补部分点集登记之间的绩效差距满点集登记,我们建议结合的形状完成网络受益的注册过程。为了实现这一点,我们设计了一个潜码对于每对形状,其可被视为目标形状的几何编码。通过这样做,我们的模型也需要一个明确的功能嵌入网络的学习功能编码。更重要的是,无论我们的形状完成网络和所述点集登记网络采取共享潜码作为输入,其具有两个译码器网络的在训练过程中的参数一起进行优化。因此,点集配准过程可因此受益于潜代码,其被执行以表示充分的形状,而不是局部的人的信息的联合优化过程。在推论阶段,我们固定网络参数和优化潜码,以获得最佳的形状完成和配准结果。我们提出的方法是纯粹的无监督,不需要任何地面实况监督。在ModelNet40实验数据集展示我们的部分点集注册模型的有效性。
17. Attribute-conditioned Layout GAN for Automatic Graphic Design [PDF] 返回目录
Jianan Li, Jimei Yang, Jianming Zhang, Chang Liu, Christina Wang, Tingfa Xu
Abstract: Modeling layout is an important first step for graphic design. Recently, methods for generating graphic layouts have progressed, particularly with Generative Adversarial Networks (GANs). However, the problem of specifying the locations and sizes of design elements usually involves constraints with respect to element attributes, such as area, aspect ratio and reading-order. Automating attribute conditional graphic layouts remains a complex and unsolved problem. In this paper, we introduce Attribute-conditioned Layout GAN to incorporate the attributes of design elements for graphic layout generation by forcing both the generator and the discriminator to meet attribute conditions. Due to the complexity of graphic designs, we further propose an element dropout method to make the discriminator look at partial lists of elements and learn their local patterns. In addition, we introduce various loss designs following different design principles for layout optimization. We demonstrate that the proposed method can synthesize graphic layouts conditioned on different element attributes. It can also adjust well-designed layouts to new sizes while retaining elements' original reading-orders. The effectiveness of our method is validated through a user study.
摘要:造型布局是平面设计的重要的第一步。近来,用于生成图形布局的方法取得了进展,特别是与剖成对抗性网络(甘斯)。但是,指定的位置和设计元素的尺寸的问题通常涉及相对于元素的属性,如面积,纵横比和读取顺序的约束。自动化属性的有条件的图形布局仍然是一个复杂而没有解决的问题。在本文中,我们引入属性空调布局GAN通过迫使两个发生器和鉴别器,以满足的属性条件掺入设计元素为图形布局生成的属性。由于图形设计的复杂性,我们进一步提出了一个元素辍学方法进行鉴别看看元素的部分名单,并了解他们的本地模式。此外,我们还推出以下不同的设计原则,布局优化各种损失的设计。我们证明,该方法能合成图形布局条件不同的元素属性。它还可以调节精心设计布局,以新的大小,同时保留元素的原稿读取订单。我们的方法的有效性通过用户研究验证。
Jianan Li, Jimei Yang, Jianming Zhang, Chang Liu, Christina Wang, Tingfa Xu
Abstract: Modeling layout is an important first step for graphic design. Recently, methods for generating graphic layouts have progressed, particularly with Generative Adversarial Networks (GANs). However, the problem of specifying the locations and sizes of design elements usually involves constraints with respect to element attributes, such as area, aspect ratio and reading-order. Automating attribute conditional graphic layouts remains a complex and unsolved problem. In this paper, we introduce Attribute-conditioned Layout GAN to incorporate the attributes of design elements for graphic layout generation by forcing both the generator and the discriminator to meet attribute conditions. Due to the complexity of graphic designs, we further propose an element dropout method to make the discriminator look at partial lists of elements and learn their local patterns. In addition, we introduce various loss designs following different design principles for layout optimization. We demonstrate that the proposed method can synthesize graphic layouts conditioned on different element attributes. It can also adjust well-designed layouts to new sizes while retaining elements' original reading-orders. The effectiveness of our method is validated through a user study.
摘要:造型布局是平面设计的重要的第一步。近来,用于生成图形布局的方法取得了进展,特别是与剖成对抗性网络(甘斯)。但是,指定的位置和设计元素的尺寸的问题通常涉及相对于元素的属性,如面积,纵横比和读取顺序的约束。自动化属性的有条件的图形布局仍然是一个复杂而没有解决的问题。在本文中,我们引入属性空调布局GAN通过迫使两个发生器和鉴别器,以满足的属性条件掺入设计元素为图形布局生成的属性。由于图形设计的复杂性,我们进一步提出了一个元素辍学方法进行鉴别看看元素的部分名单,并了解他们的本地模式。此外,我们还推出以下不同的设计原则,布局优化各种损失的设计。我们证明,该方法能合成图形布局条件不同的元素属性。它还可以调节精心设计布局,以新的大小,同时保留元素的原稿读取订单。我们的方法的有效性通过用户研究验证。
18. Fairness Matters -- A Data-Driven Framework Towards Fair and High Performing Facial Recognition Systems [PDF] 返回目录
Yushi Cao, David Berend, Palina Tolmach, Moshe Levy, Guy Amit, Asaf Shabtai, Yuval Elovici, Yang Liu
Abstract: Facial recognition technologies are widely used in governmental and industrial applications. Together with the advancements in deep learning (DL), human-centric tasks such as accurate age prediction based on face images become feasible. However, the issue of fairness when predicting the age for different ethnicity and gender remains an open problem. Policing systems use age to estimate the likelihood of someone to commit a crime, where younger suspects tend to be more likely involved. Unfair age prediction may lead to unfair treatment of humans not only in crime prevention but also in marketing, identity acquisition and authentication. Therefore, this work follows two parts. First, an empirical study is conducted evaluating performance and fairness of state-of-the-art systems for age prediction including baseline and most recent works of academia and the main industrial service providers (Amazon AWS and Microsoft Azure). Building on the findings we present a novel approach to mitigate unfairness and enhance performance, using distribution-aware dataset curation and augmentation. Distribution-awareness is based on out-of-distribution detection which is utilized to validate equal and diverse DL system behavior towards e.g. ethnicity and gender. In total we train 24 DNN models and utilize one million data points to assess performance and fairness of the state-of-the-art for face recognition algorithms. We demonstrate an improvement in mean absolute age prediction error from 7.70 to 3.39 years and a 4-fold increase in fairness towards ethnicity when compared to related work. Utilizing the presented methodology we are able to outperform leading industry players such as Amazon AWS or Microsoft Azure in both fairness and age prediction accuracy and provide the necessary guidelines to assess quality and enhance face recognition systems based on DL techniques.
摘要:面部识别技术被广泛应用于政府和工业应用。加上深学习(DL)的进步,人类为中心的任务,例如精确年龄预测基于人脸图像变得可行。然而,公平的预测年龄不同种族和性别时,这个问题仍然是一个悬而未决的问题。警务系统使用年龄来估算人的可能性犯罪,在犯罪嫌疑人年龄往往更容易参与。不正当年龄预测可能导致不仅在预防犯罪,而且在市场营销,标识取得和认证人的不公平的待遇。因此,这项工作如下两个部分。首先,实证研究进行了年龄预测包括基线和最近学术界的作品和主要工业服务提供商(亚马逊AWS和微软Azure)国家的最先进的系统的评估性能和公平性。建立在调查结果,我们提出了一种新的方法来减少不公平和提高性能,使用分配感知数据集的策展和增强。分布意识是基于其中使用,以验证对例如等于多样DL系统行为外的分布检测种族和性别。我们总共培养24个DNN模型和利用一个万个数据点来评估国家的最先进的人脸识别算法的性能和公平性。相比于相关工作时,我们证明从7.70 3.39岁平均年龄绝对预测误差的改善和对种族公平性增加4倍。利用所提出的方法,我们能够跑赢行业领先企业如亚马逊AWS或Microsoft Azure在兼顾公平和年龄的预测精度,并提供必要的指导方针,以评估质量,增强基于DL技术的人脸识别系统。
Yushi Cao, David Berend, Palina Tolmach, Moshe Levy, Guy Amit, Asaf Shabtai, Yuval Elovici, Yang Liu
Abstract: Facial recognition technologies are widely used in governmental and industrial applications. Together with the advancements in deep learning (DL), human-centric tasks such as accurate age prediction based on face images become feasible. However, the issue of fairness when predicting the age for different ethnicity and gender remains an open problem. Policing systems use age to estimate the likelihood of someone to commit a crime, where younger suspects tend to be more likely involved. Unfair age prediction may lead to unfair treatment of humans not only in crime prevention but also in marketing, identity acquisition and authentication. Therefore, this work follows two parts. First, an empirical study is conducted evaluating performance and fairness of state-of-the-art systems for age prediction including baseline and most recent works of academia and the main industrial service providers (Amazon AWS and Microsoft Azure). Building on the findings we present a novel approach to mitigate unfairness and enhance performance, using distribution-aware dataset curation and augmentation. Distribution-awareness is based on out-of-distribution detection which is utilized to validate equal and diverse DL system behavior towards e.g. ethnicity and gender. In total we train 24 DNN models and utilize one million data points to assess performance and fairness of the state-of-the-art for face recognition algorithms. We demonstrate an improvement in mean absolute age prediction error from 7.70 to 3.39 years and a 4-fold increase in fairness towards ethnicity when compared to related work. Utilizing the presented methodology we are able to outperform leading industry players such as Amazon AWS or Microsoft Azure in both fairness and age prediction accuracy and provide the necessary guidelines to assess quality and enhance face recognition systems based on DL techniques.
摘要:面部识别技术被广泛应用于政府和工业应用。加上深学习(DL)的进步,人类为中心的任务,例如精确年龄预测基于人脸图像变得可行。然而,公平的预测年龄不同种族和性别时,这个问题仍然是一个悬而未决的问题。警务系统使用年龄来估算人的可能性犯罪,在犯罪嫌疑人年龄往往更容易参与。不正当年龄预测可能导致不仅在预防犯罪,而且在市场营销,标识取得和认证人的不公平的待遇。因此,这项工作如下两个部分。首先,实证研究进行了年龄预测包括基线和最近学术界的作品和主要工业服务提供商(亚马逊AWS和微软Azure)国家的最先进的系统的评估性能和公平性。建立在调查结果,我们提出了一种新的方法来减少不公平和提高性能,使用分配感知数据集的策展和增强。分布意识是基于其中使用,以验证对例如等于多样DL系统行为外的分布检测种族和性别。我们总共培养24个DNN模型和利用一个万个数据点来评估国家的最先进的人脸识别算法的性能和公平性。相比于相关工作时,我们证明从7.70 3.39岁平均年龄绝对预测误差的改善和对种族公平性增加4倍。利用所提出的方法,我们能够跑赢行业领先企业如亚马逊AWS或Microsoft Azure在兼顾公平和年龄的预测精度,并提供必要的指导方针,以评估质量,增强基于DL技术的人脸识别系统。
19. AFP-SRC: Identification of Antifreeze Proteins Using Sparse Representation Classifier [PDF] 返回目录
Shujaat Khan, Muhammad Usman
Abstract: Species living in the extreme cold environment fight against the harsh conditions by virtue of antifreeze proteins (AFPs), that manipulates the freezing mechanism of water in more than one way. This amazing nature of AFP turns out to be extremely useful in a number of industrial and medical applications. The lack of similarity in their structure and sequence makes their prediction an arduous task and identifying them experimentally in the wet-lab is time consuming and expensive. In this research, we propose a computational framework for the prediction of AFPs which is essentially based on a sample-specific classification method using the sparse reconstruction. A linear model and an over-complete dictionary matrix of known AFPs is used to predict sparse class-label vector which provides sample-association score. Delta-rule is applied for the reconstruction of two pseudo-samples using lower and upper parts of sample-association vector and based on the minimum recovery score, class labels are assigned. We compare our approach with contemporary methods on a standard dataset and the proposed method is found to outperform in terms of Matthews correlation coefficient and Youden's index. The MATLAB implementation of proposed method is available at author's github page this https URL.
摘要:物种生活在对抗恶劣条件下的极端寒冷的环境中战斗凭借抗冻蛋白(AFP的)的,其操纵的水不止一种方式的冻结机制。法新社的这个惊人的自然结果是在许多工业和医疗应用是非常有用的。在它们的结构缺乏相似性和序列使得它们的预测的艰巨任务和实验鉴定它们在湿实验室是耗时且昂贵的。在这项研究中,我们提出了哪些主要基于使用稀疏重建样本特异性分类方法抗冻的预测计算框架。线性模型和已知的AFP的过完备字典矩阵用于预测稀疏类标签矢量提供样本关联分数。 Δ-规则被应用于的使用较低和样本关联矢量的上部以及基于所述最小恢复得分两个伪样品的重建,类别标签被分配。我们比较在一个标准的数据集我们与现代方法的方法和所提出的方法被发现跑赢马修斯相关系数约登指数的条件。提出的方法的MATLAB实现可在笔者的github上页面此HTTPS URL。
Shujaat Khan, Muhammad Usman
Abstract: Species living in the extreme cold environment fight against the harsh conditions by virtue of antifreeze proteins (AFPs), that manipulates the freezing mechanism of water in more than one way. This amazing nature of AFP turns out to be extremely useful in a number of industrial and medical applications. The lack of similarity in their structure and sequence makes their prediction an arduous task and identifying them experimentally in the wet-lab is time consuming and expensive. In this research, we propose a computational framework for the prediction of AFPs which is essentially based on a sample-specific classification method using the sparse reconstruction. A linear model and an over-complete dictionary matrix of known AFPs is used to predict sparse class-label vector which provides sample-association score. Delta-rule is applied for the reconstruction of two pseudo-samples using lower and upper parts of sample-association vector and based on the minimum recovery score, class labels are assigned. We compare our approach with contemporary methods on a standard dataset and the proposed method is found to outperform in terms of Matthews correlation coefficient and Youden's index. The MATLAB implementation of proposed method is available at author's github page this https URL.
摘要:物种生活在对抗恶劣条件下的极端寒冷的环境中战斗凭借抗冻蛋白(AFP的)的,其操纵的水不止一种方式的冻结机制。法新社的这个惊人的自然结果是在许多工业和医疗应用是非常有用的。在它们的结构缺乏相似性和序列使得它们的预测的艰巨任务和实验鉴定它们在湿实验室是耗时且昂贵的。在这项研究中,我们提出了哪些主要基于使用稀疏重建样本特异性分类方法抗冻的预测计算框架。线性模型和已知的AFP的过完备字典矩阵用于预测稀疏类标签矢量提供样本关联分数。 Δ-规则被应用于的使用较低和样本关联矢量的上部以及基于所述最小恢复得分两个伪样品的重建,类别标签被分配。我们比较在一个标准的数据集我们与现代方法的方法和所提出的方法被发现跑赢马修斯相关系数约登指数的条件。提出的方法的MATLAB实现可在笔者的github上页面此HTTPS URL。
20. Image Conditioned Keyframe-Based Video Summarization Using Object Detection [PDF] 返回目录
Neeraj Baghel, Suresh C. Raikwar, Charul Bhatnagar
Abstract: Video summarization plays an important role in selecting keyframe for understanding a video. Traditionally, it aims to find the most representative and diverse contents (or frames) in a video for short summaries. Recently, query-conditioned video summarization has been introduced, which considers user queries to learn more user-oriented summaries and its preference. However, there are obstacles in text queries for user subjectivity and finding similarity between the user query and input frames. In this work, (i) Image is introduced as a query for user preference (ii) a mathematical model is proposed to minimize redundancy based on the loss function & summary variance and (iii) the similarity score between the query image and input video to obtain the summarized video. Furthermore, the Object-based Query Image (OQI) dataset has been introduced, which contains the query images. The proposed method has been validated using UT Egocentric (UTE) dataset. The proposed model successfully resolved the issues of (i) user preference, (ii) recognize important frames and selecting that keyframe in daily life videos, with different illumination conditions. The proposed method achieved 57.06% average F1-Score for UTE dataset and outperforms the existing state-of-theart by 11.01%. The process time is 7.81 times faster than actual time of video Experiments on a recently proposed UTE dataset show the efficiency of the proposed method
摘要:视频摘要起着了解影片选择的关键帧中起重要作用。传统上,它的目的是找到在影片中最有代表性的和多样化的内容(或帧)的简短总结。近日,查询空调,视频摘要已经出台,这充分考虑用户的查询,以了解更多面向用户的摘要和其偏好。不过,也有在文本查询障碍供用户查询和输入帧之间的用户的主观性和结论相似。在这项工作中,(i)图像被引入作为用于用户偏好的数学模型提出了一种基于损耗函数&摘要方差和最小化冗余的查询(ⅱ)(ⅲ)该查询图像和输入视频之间的相似性得分获取摘要视频。此外,基于对象的查询图像(OQI)的数据集已经被引入,其中包含该查询图像。所提出的方法已经使用UT自我中心(UTE)数据集验证。该模型成功地解决了(我)的用户偏好的问题,(二)认识的重要画面,并选择在日常生活中的影片,关键帧,具有不同的照明条件。所提出的方法实现了57.06%的平均F1-分数为UTE数据集和优于现有的国家的theart由11.01%。处理时间比在最近提出的UTE视频实验的实际时间7.81倍的速度数据集中表明了该方法的效率
Neeraj Baghel, Suresh C. Raikwar, Charul Bhatnagar
Abstract: Video summarization plays an important role in selecting keyframe for understanding a video. Traditionally, it aims to find the most representative and diverse contents (or frames) in a video for short summaries. Recently, query-conditioned video summarization has been introduced, which considers user queries to learn more user-oriented summaries and its preference. However, there are obstacles in text queries for user subjectivity and finding similarity between the user query and input frames. In this work, (i) Image is introduced as a query for user preference (ii) a mathematical model is proposed to minimize redundancy based on the loss function & summary variance and (iii) the similarity score between the query image and input video to obtain the summarized video. Furthermore, the Object-based Query Image (OQI) dataset has been introduced, which contains the query images. The proposed method has been validated using UT Egocentric (UTE) dataset. The proposed model successfully resolved the issues of (i) user preference, (ii) recognize important frames and selecting that keyframe in daily life videos, with different illumination conditions. The proposed method achieved 57.06% average F1-Score for UTE dataset and outperforms the existing state-of-theart by 11.01%. The process time is 7.81 times faster than actual time of video Experiments on a recently proposed UTE dataset show the efficiency of the proposed method
摘要:视频摘要起着了解影片选择的关键帧中起重要作用。传统上,它的目的是找到在影片中最有代表性的和多样化的内容(或帧)的简短总结。近日,查询空调,视频摘要已经出台,这充分考虑用户的查询,以了解更多面向用户的摘要和其偏好。不过,也有在文本查询障碍供用户查询和输入帧之间的用户的主观性和结论相似。在这项工作中,(i)图像被引入作为用于用户偏好的数学模型提出了一种基于损耗函数&摘要方差和最小化冗余的查询(ⅱ)(ⅲ)该查询图像和输入视频之间的相似性得分获取摘要视频。此外,基于对象的查询图像(OQI)的数据集已经被引入,其中包含该查询图像。所提出的方法已经使用UT自我中心(UTE)数据集验证。该模型成功地解决了(我)的用户偏好的问题,(二)认识的重要画面,并选择在日常生活中的影片,关键帧,具有不同的照明条件。所提出的方法实现了57.06%的平均F1-分数为UTE数据集和优于现有的国家的theart由11.01%。处理时间比在最近提出的UTE视频实验的实际时间7.81倍的速度数据集中表明了该方法的效率
21. PiaNet: A pyramid input augmented convolutional neural network for GGO detection in 3D lung CT scans [PDF] 返回目录
Weihua Liu, Xiabi Liua, Xiongbiao Luo, Murong Wang, Guangyuan Zheng, Guanghui Han
Abstract: This paper proposes a new convolutional neural network with multiscale processing for detecting ground-glass opacity (GGO) nodules in 3D computed tomography (CT) images, which is referred to as PiaNet for short. PiaNet consists of a feature-extraction module and a prediction module. The former module is constructed by introducing pyramid multiscale source connections into a contracting-expanding structure. The latter module includes a bounding-box regressor and a classifier that are employed to simultaneously recognize GGO nodules and estimate bounding boxes at multiple scales. To train the proposed PiaNet, a two-stage transfer learning strategy is developed. In the first stage, the feature-extraction module is embedded into a classifier network that is trained on a large data set of GGO and non-GGO patches, which are generated by performing data augmentation from a small number of annotated CT scans. In the second stage, the pretrained feature-extraction module is loaded into PiaNet, and then PiaNet is fine-tuned using the annotated CT scans. We evaluate the proposed PiaNet on the LIDC-IDRI data set. The experimental results demonstrate that our method outperforms state-of-the-art counterparts, including the Subsolid CAD and Aidence systems and S4ND and GA-SSD methods. PiaNet achieves a sensitivity of 91.75% with only one false positive per scan
摘要:本文提出了一种具有多尺度处理一个新的卷积神经网络,用于检测毛玻璃不透明度(GGO)在3D结节计算机断层扫描(CT)图像,这被称为PiaNet的简称。 PiaNet由特征提取模块和预测模块的。前者模块通过引入金字塔多尺度源连接到一个收缩扩张结构构成。后者模块包括边界框回归和所采用同时识别GGO结节和多尺度估计边界框的分类器。训练提出PiaNet,两个阶段的转移学习策略开发。在第一阶段中,所述特征提取模块被嵌入到被上GGO和非GGO贴剂,其通过从少数注释的CT扫描执行数据扩张产生的大数据集训练的分类网络。在第二阶段中,预训练的特征提取模块加载到PiaNet,然后PiaNet是微调使用注释的CT扫描。我们评估对LIDC-IDRI数据集所提出的PiaNet。实验结果表明,我们的方法优于国家的最先进的同行,包括Subsolid CAD和Aidence系统和S4ND和GA-SSD方法。 PiaNet实现了的91.75%,每扫描只有一个假阳性灵敏度
Weihua Liu, Xiabi Liua, Xiongbiao Luo, Murong Wang, Guangyuan Zheng, Guanghui Han
Abstract: This paper proposes a new convolutional neural network with multiscale processing for detecting ground-glass opacity (GGO) nodules in 3D computed tomography (CT) images, which is referred to as PiaNet for short. PiaNet consists of a feature-extraction module and a prediction module. The former module is constructed by introducing pyramid multiscale source connections into a contracting-expanding structure. The latter module includes a bounding-box regressor and a classifier that are employed to simultaneously recognize GGO nodules and estimate bounding boxes at multiple scales. To train the proposed PiaNet, a two-stage transfer learning strategy is developed. In the first stage, the feature-extraction module is embedded into a classifier network that is trained on a large data set of GGO and non-GGO patches, which are generated by performing data augmentation from a small number of annotated CT scans. In the second stage, the pretrained feature-extraction module is loaded into PiaNet, and then PiaNet is fine-tuned using the annotated CT scans. We evaluate the proposed PiaNet on the LIDC-IDRI data set. The experimental results demonstrate that our method outperforms state-of-the-art counterparts, including the Subsolid CAD and Aidence systems and S4ND and GA-SSD methods. PiaNet achieves a sensitivity of 91.75% with only one false positive per scan
摘要:本文提出了一种具有多尺度处理一个新的卷积神经网络,用于检测毛玻璃不透明度(GGO)在3D结节计算机断层扫描(CT)图像,这被称为PiaNet的简称。 PiaNet由特征提取模块和预测模块的。前者模块通过引入金字塔多尺度源连接到一个收缩扩张结构构成。后者模块包括边界框回归和所采用同时识别GGO结节和多尺度估计边界框的分类器。训练提出PiaNet,两个阶段的转移学习策略开发。在第一阶段中,所述特征提取模块被嵌入到被上GGO和非GGO贴剂,其通过从少数注释的CT扫描执行数据扩张产生的大数据集训练的分类网络。在第二阶段中,预训练的特征提取模块加载到PiaNet,然后PiaNet是微调使用注释的CT扫描。我们评估对LIDC-IDRI数据集所提出的PiaNet。实验结果表明,我们的方法优于国家的最先进的同行,包括Subsolid CAD和Aidence系统和S4ND和GA-SSD方法。 PiaNet实现了的91.75%,每扫描只有一个假阳性灵敏度
22. Novel and Effective CNN-Based Binarization for Historically Degraded As-built Drawing Maps [PDF] 返回目录
Kuo-Liang Chung, De-Wei Hsieh
Abstract: Binarizing historically degraded as-built drawing (HDAD) maps is a new challenging job, especially in terms of removing the three artifacts, namely noise, the yellowing areas, and the folded lines, while preserving the foreground components well. In this paper, we first propose a semi-automatic labeling method to create the HDAD-pair dataset of which each HDAD-pair consists of one HDAD map and its binarized HDAD map. Based on the created training HDAD-pair dataset, we propose a convolutional neural network-based (CNN-based) binarization method to produce high-quality binarized HDAD maps. Based on the testing HDAD maps, the thorough experimental data demonstrated that in terms of the accuracy, PSNR (peak-signal-to-noise-ratio), and the perceptual effect of the binarized HDAD maps, our method substantially outperforms the nine existing binarization methods. In addition, with similar accuracy, the experimental results demonstrated the significant execution-time reduction merit of our method relative to the retrained version of the state-of-the-art CNN-based binarization methods.
摘要:二值化历史降解竣工图(经HDAd)映射为一个新的具有挑战性的工作,尤其是在去除三个神器,即噪声,泛黄地区看,与折叠线,同时保持前景组分很好。在本文中,我们首先提出一种半自动的标记方法来创建经HDAd-对数据集,其中每个经HDAd-对由一个经HDAd地图及其二值化经HDAd地图的。基于所创建训练经HDAd-对数据集,我们提出了一种卷积基于神经网络的(基于CNN-)的二值化方法来生产高品质的二值化经HDAd映射。基于所述测试经HDAd映射,展示了彻底的实验数据,在准确度,PSNR(峰值信号对噪声比),以及二值化经HDAd的感知效果方面映射,我们的方法基本上优于9个现有二值化方法。此外,与类似的准确度,实验结果表明相对于所述基于CNN-状态的最先进的二值化方法的重新训练版本我们的方法的显著执行时间减少的优点。
Kuo-Liang Chung, De-Wei Hsieh
Abstract: Binarizing historically degraded as-built drawing (HDAD) maps is a new challenging job, especially in terms of removing the three artifacts, namely noise, the yellowing areas, and the folded lines, while preserving the foreground components well. In this paper, we first propose a semi-automatic labeling method to create the HDAD-pair dataset of which each HDAD-pair consists of one HDAD map and its binarized HDAD map. Based on the created training HDAD-pair dataset, we propose a convolutional neural network-based (CNN-based) binarization method to produce high-quality binarized HDAD maps. Based on the testing HDAD maps, the thorough experimental data demonstrated that in terms of the accuracy, PSNR (peak-signal-to-noise-ratio), and the perceptual effect of the binarized HDAD maps, our method substantially outperforms the nine existing binarization methods. In addition, with similar accuracy, the experimental results demonstrated the significant execution-time reduction merit of our method relative to the retrained version of the state-of-the-art CNN-based binarization methods.
摘要:二值化历史降解竣工图(经HDAd)映射为一个新的具有挑战性的工作,尤其是在去除三个神器,即噪声,泛黄地区看,与折叠线,同时保持前景组分很好。在本文中,我们首先提出一种半自动的标记方法来创建经HDAd-对数据集,其中每个经HDAd-对由一个经HDAd地图及其二值化经HDAd地图的。基于所创建训练经HDAd-对数据集,我们提出了一种卷积基于神经网络的(基于CNN-)的二值化方法来生产高品质的二值化经HDAd映射。基于所述测试经HDAd映射,展示了彻底的实验数据,在准确度,PSNR(峰值信号对噪声比),以及二值化经HDAd的感知效果方面映射,我们的方法基本上优于9个现有二值化方法。此外,与类似的准确度,实验结果表明相对于所述基于CNN-状态的最先进的二值化方法的重新训练版本我们的方法的显著执行时间减少的优点。
23. Devil's in the Detail: Graph-based Key-point Alignment and Embedding for Person Re-ID [PDF] 返回目录
Xinyang Jiang, Fufu Yu, Yifei Gong, Shizhen Zhao, Xiaowei Guo, Feiyue Huang, Wei-Shi Zheng, Xing Sun
Abstract: Although Person Re-Identification has made impressive progress, difficult cases like occlusion, change of view-point and similar clothing still bring great challenges. Besides overall visual features, matching and comparing detailed local information is also essential for tackling these challenges. This paper proposes two key recognition patterns to better utilize the local information of pedestrian images. From the spatial perspective, the model should be able to select and align key-points from the image pairs for comparison (i.e. key-points alignment). From the perspective of feature channels, the feature of a query image should be dynamically adjusted based on the gallery image it needs to match (i.e. conditional feature embedding). Most of the existing methods are unable to satisfy both key-point alignment and conditional feature embedding. By introducing novel techniques including correspondence attention module and discrepancy-based GCN, we propose an end-to-end ReID method that integrates both patterns into a unified framework, called Siamese-GCN. The experiments show that Siamese-GCN achieves state-of-the-art performance on three public datasets.
摘要:虽然人重新鉴定取得了令人瞩目的进展,困难情况下,像闭塞,视点和类似服装仍然会带来巨大挑战的变化。除了整体视觉特征,匹配和比较详细的本地信息,也为应对这些挑战至关重要。本文提出了两个关键的识别模式,以更好地利用行人图像的本地信息。从空间的角度看,模型应该能够选择与从图像对对准键点进行比较(即关键点对准)。从功能信道的角度来看,一个查询图像的特征应该是动态的基于它需要匹配画廊图像(即有条件的特征嵌入)上调整。大多数现有的方法都无法满足这两个关键点对准和有条件的功能嵌入。通过引入新颖的技术,包括对应注意模块和基于差异GCN,我们提出的端至端雷德法,这两个图案集成到一个统一的框架,称为连体GCN。实验结果表明,连体GCN实现了三个公共数据集的国家的最先进的性能。
Xinyang Jiang, Fufu Yu, Yifei Gong, Shizhen Zhao, Xiaowei Guo, Feiyue Huang, Wei-Shi Zheng, Xing Sun
Abstract: Although Person Re-Identification has made impressive progress, difficult cases like occlusion, change of view-point and similar clothing still bring great challenges. Besides overall visual features, matching and comparing detailed local information is also essential for tackling these challenges. This paper proposes two key recognition patterns to better utilize the local information of pedestrian images. From the spatial perspective, the model should be able to select and align key-points from the image pairs for comparison (i.e. key-points alignment). From the perspective of feature channels, the feature of a query image should be dynamically adjusted based on the gallery image it needs to match (i.e. conditional feature embedding). Most of the existing methods are unable to satisfy both key-point alignment and conditional feature embedding. By introducing novel techniques including correspondence attention module and discrepancy-based GCN, we propose an end-to-end ReID method that integrates both patterns into a unified framework, called Siamese-GCN. The experiments show that Siamese-GCN achieves state-of-the-art performance on three public datasets.
摘要:虽然人重新鉴定取得了令人瞩目的进展,困难情况下,像闭塞,视点和类似服装仍然会带来巨大挑战的变化。除了整体视觉特征,匹配和比较详细的本地信息,也为应对这些挑战至关重要。本文提出了两个关键的识别模式,以更好地利用行人图像的本地信息。从空间的角度看,模型应该能够选择与从图像对对准键点进行比较(即关键点对准)。从功能信道的角度来看,一个查询图像的特征应该是动态的基于它需要匹配画廊图像(即有条件的特征嵌入)上调整。大多数现有的方法都无法满足这两个关键点对准和有条件的功能嵌入。通过引入新颖的技术,包括对应注意模块和基于差异GCN,我们提出的端至端雷德法,这两个图案集成到一个统一的框架,称为连体GCN。实验结果表明,连体GCN实现了三个公共数据集的国家的最先进的性能。
24. Optimizing Convolutional Neural Network Architecture via Information Field [PDF] 返回目录
Yuke Wang, Boyuan Feng, Xueqiao Peng, Yufei Ding
Abstract: CNN architecture design has attracted tremendous attention of improving model accuracy or reducing model complexity. However, existing works either introduce repeated training overhead in the search process or lack an interpretable metric to guide the design. To clear the hurdles, we propose Information Field (IF), an explainable and easy-to-compute metric, to estimate the quality of a CNN architecture and guide the search process of designs. To validate the effectiveness of IF, we build a static optimizer to improve the CNN architectures at both the stage level and the kernel level. Our optimizer not only provides a clear and reproducible procedure but also mitigates unnecessary training efforts in the architecture search process. Experiments show that the models generated by our optimizer can achieve up to 5.47% accuracy improvement and up to 65.38% parameters deduction, compared with state-of-the-art CNN structures like MobileNet and ResNet.
摘要:美国有线电视新闻网的架构设计已经吸引了提高模型的准确性或降低模型的复杂性的极大关注。但是,现有的作品无论是引进在搜索过程中反复训练开销或缺乏可解释的指标来指导设计。要清除障碍,我们提出了信息字段(IF),一个可以解释和易于计算的度量,估计CNN的建筑质量,引导设计的搜索过程。为了验证IF的有效性,我们建立了一个静态优化,以提高在舞台级和内核级两个CNN架构。我们的优化,不仅提供了一个明确的和可重复的过程,而且减轻了架构搜索过程中不必要的培训力度。实验表明,我们的优化器生成的模型可以实现高达5.47%的精度提高和高达65.38%扣除的参数,与国家的最先进的CNN结构如MobileNet和RESNET比较。
Yuke Wang, Boyuan Feng, Xueqiao Peng, Yufei Ding
Abstract: CNN architecture design has attracted tremendous attention of improving model accuracy or reducing model complexity. However, existing works either introduce repeated training overhead in the search process or lack an interpretable metric to guide the design. To clear the hurdles, we propose Information Field (IF), an explainable and easy-to-compute metric, to estimate the quality of a CNN architecture and guide the search process of designs. To validate the effectiveness of IF, we build a static optimizer to improve the CNN architectures at both the stage level and the kernel level. Our optimizer not only provides a clear and reproducible procedure but also mitigates unnecessary training efforts in the architecture search process. Experiments show that the models generated by our optimizer can achieve up to 5.47% accuracy improvement and up to 65.38% parameters deduction, compared with state-of-the-art CNN structures like MobileNet and ResNet.
摘要:美国有线电视新闻网的架构设计已经吸引了提高模型的准确性或降低模型的复杂性的极大关注。但是,现有的作品无论是引进在搜索过程中反复训练开销或缺乏可解释的指标来指导设计。要清除障碍,我们提出了信息字段(IF),一个可以解释和易于计算的度量,估计CNN的建筑质量,引导设计的搜索过程。为了验证IF的有效性,我们建立了一个静态优化,以提高在舞台级和内核级两个CNN架构。我们的优化,不仅提供了一个明确的和可重复的过程,而且减轻了架构搜索过程中不必要的培训力度。实验表明,我们的优化器生成的模型可以实现高达5.47%的精度提高和高达65.38%扣除的参数,与国家的最先进的CNN结构如MobileNet和RESNET比较。
25. Spectral Analysis Network for Deep Representation Learning and Image Clustering [PDF] 返回目录
Jinghua Wang, Adrian Hilton, Jianmin Jiang
Abstract: Deep representation learning is a crucial procedure in multimedia analysis and attracts increasing attention. Most of the popular techniques rely on convolutional neural network and require a large amount of labeled data in the training procedure. However, it is time consuming or even impossible to obtain the label information in some tasks due to cost limitation. Thus, it is necessary to develop unsupervised deep representation learning techniques. This paper proposes a new network structure for unsupervised deep representation learning based on spectral analysis, which is a popular technique with solid theory foundations. Compared with the existing spectral analysis methods, the proposed network structure has at least three advantages. Firstly, it can identify the local similarities among images in patch level and thus more robust against occlusion. Secondly, through multiple consecutive spectral analysis procedures, the proposed network can learn more clustering-friendly representations and is capable to reveal the deep correlations among data samples. Thirdly, it can elegantly integrate different spectral analysis procedures, so that each spectral analysis procedure can have their individual strengths in dealing with different data sample distributions. Extensive experimental results show the effectiveness of the proposed methods on various image clustering tasks.
摘要:深表示学习是在多媒体分析的关键步骤,并吸引了越来越多的关注。最流行的技术依赖卷积神经网络,需要在训练过程中的大量标签数据。然而,这是费时的,甚至是不可能获得一些任务的标签信息,由于成本的限制。因此,有必要制定监督的深表示学习技术。本文提出了一种基于频谱分析,这与理论扎实基础的流行技术的无监督深表示学习新的网络结构。与现有的频谱分析方法相比,所提出的网络结构具有至少三个优点。首先,它可以识别补丁级别图像之间的相似性本地,从而防止阻塞更鲁棒。其次,通过多个连续的频谱分析程序,所提出的网络可以了解更多的聚类友好的表示,并能揭示数据样本中的具有很深的关联。第三,它可以整合优雅不同的频谱分析程序,让每个频谱分析程序可以有自己的个人优势在处理不同的数据样本分布。大量的实验结果表明,在各种图像的聚类任务所提出的方法的有效性。
Jinghua Wang, Adrian Hilton, Jianmin Jiang
Abstract: Deep representation learning is a crucial procedure in multimedia analysis and attracts increasing attention. Most of the popular techniques rely on convolutional neural network and require a large amount of labeled data in the training procedure. However, it is time consuming or even impossible to obtain the label information in some tasks due to cost limitation. Thus, it is necessary to develop unsupervised deep representation learning techniques. This paper proposes a new network structure for unsupervised deep representation learning based on spectral analysis, which is a popular technique with solid theory foundations. Compared with the existing spectral analysis methods, the proposed network structure has at least three advantages. Firstly, it can identify the local similarities among images in patch level and thus more robust against occlusion. Secondly, through multiple consecutive spectral analysis procedures, the proposed network can learn more clustering-friendly representations and is capable to reveal the deep correlations among data samples. Thirdly, it can elegantly integrate different spectral analysis procedures, so that each spectral analysis procedure can have their individual strengths in dealing with different data sample distributions. Extensive experimental results show the effectiveness of the proposed methods on various image clustering tasks.
摘要:深表示学习是在多媒体分析的关键步骤,并吸引了越来越多的关注。最流行的技术依赖卷积神经网络,需要在训练过程中的大量标签数据。然而,这是费时的,甚至是不可能获得一些任务的标签信息,由于成本的限制。因此,有必要制定监督的深表示学习技术。本文提出了一种基于频谱分析,这与理论扎实基础的流行技术的无监督深表示学习新的网络结构。与现有的频谱分析方法相比,所提出的网络结构具有至少三个优点。首先,它可以识别补丁级别图像之间的相似性本地,从而防止阻塞更鲁棒。其次,通过多个连续的频谱分析程序,所提出的网络可以了解更多的聚类友好的表示,并能揭示数据样本中的具有很深的关联。第三,它可以整合优雅不同的频谱分析程序,让每个频谱分析程序可以有自己的个人优势在处理不同的数据样本分布。大量的实验结果表明,在各种图像的聚类任务所提出的方法的有效性。
26. An unsupervised deep learning framework via integrated optimization of representation learning and GMM-based modeling [PDF] 返回目录
Jinghua Wang, Jianmin Jiang
Abstract: While supervised deep learning has achieved great success in a range of applications, relatively little work has studied the discovery of knowledge from unlabeled data. In this paper, we propose an unsupervised deep learning framework to provide a potential solution for the problem that existing deep learning techniques require large labeled data sets for completing the training process. Our proposed introduces a new principle of joint learning on both deep representations and GMM (Gaussian Mixture Model)-based deep modeling, and thus an integrated objective function is proposed to facilitate the principle. In comparison with the existing work in similar areas, our objective function has two learning targets, which are created to be jointly optimized to achieve the best possible unsupervised learning and knowledge discovery from unlabeled data sets. While maximizing the first target enables the GMM to achieve the best possible modeling of the data representations and each Gaussian component corresponds to a compact cluster, maximizing the second term will enhance the separability of the Gaussian components and hence the inter-cluster distances. As a result, the compactness of clusters is significantly enhanced by reducing the intra-cluster distances, and the separability is improved by increasing the inter-cluster distances. Extensive experimental results show that the propose method can improve the clustering performance compared with benchmark methods.
摘要:虽然监督深度学习已经在一系列的应用中取得了巨大的成功,相对较少的工作已经研究了知识的无标签数据的发现。在本文中,我们提出了一种无监督的深度学习框架,以提供该问题的潜在解决方案,现有的深度学习技术需要大量的标签数据集完成培训过程。我们提出的介绍了双方深厚的陈述和GMM(高斯混合模型)共同学习的新原则为基础的深建模,从而集成目标函数,提出了促进的原则。与在类似领域现有的工作相比,我们的目标函数有两个学习目标,这是创建联合优化,实现无标签的数据集最佳的无监督的学习和知识发现。同时最大化所述第一目标使得GMM来实现数据表示和每个高斯分量对应的最佳可能的建模到一个紧凑的簇,最大化的第二项将增强高斯分量,并因此群集间距离的可分离性。其结果是,簇的紧凑性通过降低集群内的距离是显著增强,并且可分离通过增加簇间距离的改善。大量的实验结果表明,该提议与基准方法相比,方法可以提高聚类性能。
Jinghua Wang, Jianmin Jiang
Abstract: While supervised deep learning has achieved great success in a range of applications, relatively little work has studied the discovery of knowledge from unlabeled data. In this paper, we propose an unsupervised deep learning framework to provide a potential solution for the problem that existing deep learning techniques require large labeled data sets for completing the training process. Our proposed introduces a new principle of joint learning on both deep representations and GMM (Gaussian Mixture Model)-based deep modeling, and thus an integrated objective function is proposed to facilitate the principle. In comparison with the existing work in similar areas, our objective function has two learning targets, which are created to be jointly optimized to achieve the best possible unsupervised learning and knowledge discovery from unlabeled data sets. While maximizing the first target enables the GMM to achieve the best possible modeling of the data representations and each Gaussian component corresponds to a compact cluster, maximizing the second term will enhance the separability of the Gaussian components and hence the inter-cluster distances. As a result, the compactness of clusters is significantly enhanced by reducing the intra-cluster distances, and the separability is improved by increasing the inter-cluster distances. Extensive experimental results show that the propose method can improve the clustering performance compared with benchmark methods.
摘要:虽然监督深度学习已经在一系列的应用中取得了巨大的成功,相对较少的工作已经研究了知识的无标签数据的发现。在本文中,我们提出了一种无监督的深度学习框架,以提供该问题的潜在解决方案,现有的深度学习技术需要大量的标签数据集完成培训过程。我们提出的介绍了双方深厚的陈述和GMM(高斯混合模型)共同学习的新原则为基础的深建模,从而集成目标函数,提出了促进的原则。与在类似领域现有的工作相比,我们的目标函数有两个学习目标,这是创建联合优化,实现无标签的数据集最佳的无监督的学习和知识发现。同时最大化所述第一目标使得GMM来实现数据表示和每个高斯分量对应的最佳可能的建模到一个紧凑的簇,最大化的第二项将增强高斯分量,并因此群集间距离的可分离性。其结果是,簇的紧凑性通过降低集群内的距离是显著增强,并且可分离通过增加簇间距离的改善。大量的实验结果表明,该提议与基准方法相比,方法可以提高聚类性能。
27. Conditional Coupled Generative Adversarial Networks for Zero-Shot Domain Adaptation [PDF] 返回目录
Jinghua Wang, Jianmin Jiang
Abstract: Machine learning models trained in one domain perform poorly in the other domains due to the existence of domain shift. Domain adaptation techniques solve this problem by training transferable models from the label-rich source domain to the label-scarce target domain. Unfortunately, a majority of the existing domain adaptation techniques rely on the availability of target-domain data, and thus limit their applications to a small community across few computer vision problems. In this paper, we tackle the challenging zero-shot domain adaptation (ZSDA) problem, where target-domain data is non-available in the training stage. For this purpose, we propose conditional coupled generative adversarial networks (CoCoGAN) by extending the coupled generative adversarial networks (CoGAN) into a conditioning model. Compared with the existing state of the arts, our proposed CoCoGAN is able to capture the joint distribution of dual-domain samples in two different tasks, i.e. the relevant task (RT) and an irrelevant task (IRT). We train CoCoGAN with both source-domain samples in RT and dual-domain samples in IRT to complete the domain adaptation. While the former provide high-level concepts of the non-available target-domain data, the latter carry the sharing correlation between the two domains in RT and IRT. To train CoCoGAN in the absence of target-domain data for RT, we propose a new supervisory signal, i.e. the alignment between representations across tasks. Extensive experiments carried out demonstrate that our proposed CoCoGAN outperforms existing state of the arts in image classifications.
摘要:在一个领域训练的机器学习模型不佳在其他领域,由于域转移的存在进行。域自适应技术通过丰富的标签源域训练转移模型标签稀缺的目标域解决这个问题。不幸的是,大多数的现有域自适应技术依赖于目标域数据的可用性,从而限制了其应用程序能够在几个计算机视觉问题的一个小社区。在本文中,我们应对挑战零炮域自适应(ZSDA)的问题,在目标域数据是不可用的在训练阶段。为了这个目的,我们提出了有条件偶合生成对抗网络(CoCoGAN)由耦合生成对抗网络(Cogan的)延伸到调理模式。用艺术的存在状态相比,我们提出的CoCoGAN能够捕捉到双域样本的联合分布在两个不同的任务,即相关的任务(RT)和不相关的任务(IRT)。我们培养CoCoGAN与RT两个源域采样和双域样本在IRT完成领域适应性。虽然前者提供非可用目标域数据,后者携带在RT和IRT两个域之间共享相关的高层次的概念。要培养CoCoGAN在没有对RT目标域数据,我们提出了一个新的监管信号,即整个任务表述的对齐。大量的实验进行论证,我们提出的CoCoGAN性能优于现有的艺术状态的图像分类。
Jinghua Wang, Jianmin Jiang
Abstract: Machine learning models trained in one domain perform poorly in the other domains due to the existence of domain shift. Domain adaptation techniques solve this problem by training transferable models from the label-rich source domain to the label-scarce target domain. Unfortunately, a majority of the existing domain adaptation techniques rely on the availability of target-domain data, and thus limit their applications to a small community across few computer vision problems. In this paper, we tackle the challenging zero-shot domain adaptation (ZSDA) problem, where target-domain data is non-available in the training stage. For this purpose, we propose conditional coupled generative adversarial networks (CoCoGAN) by extending the coupled generative adversarial networks (CoGAN) into a conditioning model. Compared with the existing state of the arts, our proposed CoCoGAN is able to capture the joint distribution of dual-domain samples in two different tasks, i.e. the relevant task (RT) and an irrelevant task (IRT). We train CoCoGAN with both source-domain samples in RT and dual-domain samples in IRT to complete the domain adaptation. While the former provide high-level concepts of the non-available target-domain data, the latter carry the sharing correlation between the two domains in RT and IRT. To train CoCoGAN in the absence of target-domain data for RT, we propose a new supervisory signal, i.e. the alignment between representations across tasks. Extensive experiments carried out demonstrate that our proposed CoCoGAN outperforms existing state of the arts in image classifications.
摘要:在一个领域训练的机器学习模型不佳在其他领域,由于域转移的存在进行。域自适应技术通过丰富的标签源域训练转移模型标签稀缺的目标域解决这个问题。不幸的是,大多数的现有域自适应技术依赖于目标域数据的可用性,从而限制了其应用程序能够在几个计算机视觉问题的一个小社区。在本文中,我们应对挑战零炮域自适应(ZSDA)的问题,在目标域数据是不可用的在训练阶段。为了这个目的,我们提出了有条件偶合生成对抗网络(CoCoGAN)由耦合生成对抗网络(Cogan的)延伸到调理模式。用艺术的存在状态相比,我们提出的CoCoGAN能够捕捉到双域样本的联合分布在两个不同的任务,即相关的任务(RT)和不相关的任务(IRT)。我们培养CoCoGAN与RT两个源域采样和双域样本在IRT完成领域适应性。虽然前者提供非可用目标域数据,后者携带在RT和IRT两个域之间共享相关的高层次的概念。要培养CoCoGAN在没有对RT目标域数据,我们提出了一个新的监管信号,即整个任务表述的对齐。大量的实验进行论证,我们提出的CoCoGAN性能优于现有的艺术状态的图像分类。
28. HAA500: Human-Centric Atomic Action Dataset with Curated Videos [PDF] 返回目录
Jihoon Chung, Cheng-hsin Wuu, Hsuan-ru Yang, Yu-Wing Tai, Chi-Keung Tang
Abstract: We contribute HAA500, a manually annotated human-centric atomic action dataset for action recognition on 500 classes with over 591k labeled frames. Unlike existing atomic action datasets, where coarse-grained atomic actions were labeled with action-verbs, e.g., "Throw", HAA500 contains fine-grained atomic actions where only consistent actions fall under the same label, e.g., "Baseball Pitching" vs "Free Throw in Basketball", to minimize ambiguities in action classification. HAA500 has been carefully curated to capture the movement of human figures with less spatio-temporal label noises to greatly enhance the training of deep neural networks. The advantages of HAA500 include: 1) human-centric actions with a high average of 69.7% detectable joints for the relevant human poses; 2) each video captures the essential elements of an atomic action without irrelevant frames; 3) fine-grained atomic action classes. Our extensive experiments validate the benefits of human-centric and atomic characteristics of HAA, which enables the trained model to improve prediction by attending to atomic human poses. We detail the HAA500 dataset statistics and collection methodology, and compare quantitatively with existing action recognition datasets.
摘要:我们贡献HAA500,手动注释的人类为中心的原子操作数据集上的500类动作识别超过591k标记帧。与现有的原子操作的数据集,其中粗粒度原子操作都标有动作动词,如“扔”,HAA500包含其中只有一致行动相同的标签,例如“棒球投手”下回落VS细粒度原子操作“罚球篮球”,以尽量减少在行动分类歧义。 HAA500经过精心策划,以捕捉人物动作以较少的时空标签的噪音大大提升深层神经网络的训练。 HAA500的优点包括:1)以人为中心的具有高平均的69.7%检测接头的相关的人的姿势的动作; 2)每一个视频捕获而不不相干帧的原子作用的基本要素; 3)细粒原子动作类。我们大量的实验验证的HAA,使训练的模型通过参加到原子人体姿势,以提高预测的以人为中心和原子特性的好处。我们详细HAA500数据集统计数据和采集方法,定量与现有的动作识别的数据集进行比较。
Jihoon Chung, Cheng-hsin Wuu, Hsuan-ru Yang, Yu-Wing Tai, Chi-Keung Tang
Abstract: We contribute HAA500, a manually annotated human-centric atomic action dataset for action recognition on 500 classes with over 591k labeled frames. Unlike existing atomic action datasets, where coarse-grained atomic actions were labeled with action-verbs, e.g., "Throw", HAA500 contains fine-grained atomic actions where only consistent actions fall under the same label, e.g., "Baseball Pitching" vs "Free Throw in Basketball", to minimize ambiguities in action classification. HAA500 has been carefully curated to capture the movement of human figures with less spatio-temporal label noises to greatly enhance the training of deep neural networks. The advantages of HAA500 include: 1) human-centric actions with a high average of 69.7% detectable joints for the relevant human poses; 2) each video captures the essential elements of an atomic action without irrelevant frames; 3) fine-grained atomic action classes. Our extensive experiments validate the benefits of human-centric and atomic characteristics of HAA, which enables the trained model to improve prediction by attending to atomic human poses. We detail the HAA500 dataset statistics and collection methodology, and compare quantitatively with existing action recognition datasets.
摘要:我们贡献HAA500,手动注释的人类为中心的原子操作数据集上的500类动作识别超过591k标记帧。与现有的原子操作的数据集,其中粗粒度原子操作都标有动作动词,如“扔”,HAA500包含其中只有一致行动相同的标签,例如“棒球投手”下回落VS细粒度原子操作“罚球篮球”,以尽量减少在行动分类歧义。 HAA500经过精心策划,以捕捉人物动作以较少的时空标签的噪音大大提升深层神经网络的训练。 HAA500的优点包括:1)以人为中心的具有高平均的69.7%检测接头的相关的人的姿势的动作; 2)每一个视频捕获而不不相干帧的原子作用的基本要素; 3)细粒原子动作类。我们大量的实验验证的HAA,使训练的模型通过参加到原子人体姿势,以提高预测的以人为中心和原子特性的好处。我们详细HAA500数据集统计数据和采集方法,定量与现有的动作识别的数据集进行比较。
29. Adversarial Learning for Zero-shot Domain Adaptation [PDF] 返回目录
Jinghua Wang, Jianmin Jiang
Abstract: Zero-shot domain adaptation (ZSDA) is a category of domain adaptation problems where neither data sample nor label is available for parameter learning in the target domain. With the hypothesis that the shift between a given pair of domains is shared across tasks, we propose a new method for ZSDA by transferring domain shift from an irrelevant task (IrT) to the task of interest (ToI). Specifically, we first identify an IrT, where dual-domain samples are available, and capture the domain shift with a coupled generative adversarial networks (CoGAN) in this task. Then, we train a CoGAN for the ToI and restrict it to carry the same domain shift as the CoGAN for IrT does. In addition, we introduce a pair of co-training classifiers to regularize the training procedure of CoGAN in the ToI. The proposed method not only derives machine learning models for the non-available target-domain data, but also synthesizes the data themselves. We evaluate the proposed method on benchmark datasets and achieve the state-of-the-art performances.
摘要:零炮域自适应(ZSDA)是的领域适应性问题一类既没有数据样本,也没有标签,可用于在目标域参数学习。与一对给定域之间的移跨任务共享的假设,我们从一个不相关的任务(IRT)转让域名转向的利息(TOI)任务提出ZSDA的新方法。具体地讲,我们首先识别一个IRT,其中双域样本是可用的,并捕获域移与在这项任务中耦合生成对抗网络(科根)。然后,我们培养了一个布袋柯根并限制它携带相同的域转移的科根为IRT一样。此外,我们还引进了一对共同训练分类器来规范在TOI科根的训练过程。该方法不仅导出机器学习模型对非可用的目标域数据,而且还综合了数据本身。我们评估的基准数据集所提出的方法和实现国家的最先进的性能。
Jinghua Wang, Jianmin Jiang
Abstract: Zero-shot domain adaptation (ZSDA) is a category of domain adaptation problems where neither data sample nor label is available for parameter learning in the target domain. With the hypothesis that the shift between a given pair of domains is shared across tasks, we propose a new method for ZSDA by transferring domain shift from an irrelevant task (IrT) to the task of interest (ToI). Specifically, we first identify an IrT, where dual-domain samples are available, and capture the domain shift with a coupled generative adversarial networks (CoGAN) in this task. Then, we train a CoGAN for the ToI and restrict it to carry the same domain shift as the CoGAN for IrT does. In addition, we introduce a pair of co-training classifiers to regularize the training procedure of CoGAN in the ToI. The proposed method not only derives machine learning models for the non-available target-domain data, but also synthesizes the data themselves. We evaluate the proposed method on benchmark datasets and achieve the state-of-the-art performances.
摘要:零炮域自适应(ZSDA)是的领域适应性问题一类既没有数据样本,也没有标签,可用于在目标域参数学习。与一对给定域之间的移跨任务共享的假设,我们从一个不相关的任务(IRT)转让域名转向的利息(TOI)任务提出ZSDA的新方法。具体地讲,我们首先识别一个IRT,其中双域样本是可用的,并捕获域移与在这项任务中耦合生成对抗网络(科根)。然后,我们培养了一个布袋柯根并限制它携带相同的域转移的科根为IRT一样。此外,我们还引进了一对共同训练分类器来规范在TOI科根的训练过程。该方法不仅导出机器学习模型对非可用的目标域数据,而且还综合了数据本身。我们评估的基准数据集所提出的方法和实现国家的最先进的性能。
30. Variance Loss: A Confidence-Based Reweighting Strategy for Coarse Semantic Segmentation [PDF] 返回目录
Jingchao Liu, Ye Du, Qingjie Liu, Yunhong Wang
Abstract: Coarsely-labeled semantic segmentation annotations are easy to obtain, but therefore bear the risk of losing edge details and introducing background noise. Though they are usually used as a supplement to the finely-labeled ones, in this paper, we attempt to train a model only using these coarse annotations, and improve the model performance with a noise-robust reweighting strategy. Specifically, the proposed confidence indicator makes it possible to design a reweighting strategy that simultaneously mines hard samples and alleviates noisy labels for the coarse annotation. Besides, the optimal reweighting strategy can be automatically derived by our Adversarial Weight Assigning Module (AWAM) with only 53 learnable parameters. Moreover, a rigorous proof of the convergence of AWAM is given. Experiments on standard datasets show that our proposed reweighting strategy can bring consistent performance improvements for both coarse annotations and fine annotations. In particular, built on top of DeeplabV3+, we improve the mIoU on Cityscapes Coarse dataset (coarsely-labeled) and ADE20K (finely-labeled) by 2.21 and 0.91, respectively.
摘要:粗磨标记的语义分割注解是很容易获得,但因此忍受失去边缘细节,引入背景噪声的风险。虽然他们通常被用来作为补充精细标记的,在本文中,我们试图训练模型只使用这些粗糙的注释,并提高具有噪声稳健的权重调整策略模型的性能。具体而言,建议信心指数能够设计出权重调整策略,同时矿山硬盘样品,减轻嘈杂的标签粗注释。此外,最佳的重新加权策略可以自动由我们的对抗性重量分配模块(AWAM)仅具有53可以学习参数导出。此外,AWAM收敛的严格证明给出。在标准数据集实验表明,我们提出的权重调整策略,可以带来两个粗注释和批注优良稳定的性能改进。具体地,建立在DeeplabV3 +的顶部,我们提高米欧上都市风景粗的数据集(粗标记的)和ADE20K(精细地标记的)通过分别2.21和0.91,。
Jingchao Liu, Ye Du, Qingjie Liu, Yunhong Wang
Abstract: Coarsely-labeled semantic segmentation annotations are easy to obtain, but therefore bear the risk of losing edge details and introducing background noise. Though they are usually used as a supplement to the finely-labeled ones, in this paper, we attempt to train a model only using these coarse annotations, and improve the model performance with a noise-robust reweighting strategy. Specifically, the proposed confidence indicator makes it possible to design a reweighting strategy that simultaneously mines hard samples and alleviates noisy labels for the coarse annotation. Besides, the optimal reweighting strategy can be automatically derived by our Adversarial Weight Assigning Module (AWAM) with only 53 learnable parameters. Moreover, a rigorous proof of the convergence of AWAM is given. Experiments on standard datasets show that our proposed reweighting strategy can bring consistent performance improvements for both coarse annotations and fine annotations. In particular, built on top of DeeplabV3+, we improve the mIoU on Cityscapes Coarse dataset (coarsely-labeled) and ADE20K (finely-labeled) by 2.21 and 0.91, respectively.
摘要:粗磨标记的语义分割注解是很容易获得,但因此忍受失去边缘细节,引入背景噪声的风险。虽然他们通常被用来作为补充精细标记的,在本文中,我们试图训练模型只使用这些粗糙的注释,并提高具有噪声稳健的权重调整策略模型的性能。具体而言,建议信心指数能够设计出权重调整策略,同时矿山硬盘样品,减轻嘈杂的标签粗注释。此外,最佳的重新加权策略可以自动由我们的对抗性重量分配模块(AWAM)仅具有53可以学习参数导出。此外,AWAM收敛的严格证明给出。在标准数据集实验表明,我们提出的权重调整策略,可以带来两个粗注释和批注优良稳定的性能改进。具体地,建立在DeeplabV3 +的顶部,我们提高米欧上都市风景粗的数据集(粗标记的)和ADE20K(精细地标记的)通过分别2.21和0.91,。
31. OCR Graph Features for Manipulation Detection in Documents [PDF] 返回目录
Hailey James, Otkrist Gupta, Dan Raviv
Abstract: Detecting manipulations in digital documents is becoming increasingly important for information verification purposes. Due to the proliferation of image editing software, altering key information in documents has become widely accessible. Nearly all approaches in this domain rely on a procedural approach, using carefully generated features and a hand-tuned scoring system, rather than a data-driven and generalizable approach. We frame this issue as a graph comparison problem using the character bounding boxes, and propose a model that leverages graph features using OCR (Optical Character Recognition). Our model relies on a data-driven approach to detect alterations by training a random forest classifier on the graph-based OCR features. We evaluate our algorithm's forgery detection performance on dataset constructed from real business documents with slight forgery imperfections. Our proposed model dramatically outperforms the most closely-related document manipulation detection model on this task.
摘要:在数字文档检测操作正在成为信息核查目的越来越重要。由于图像编辑软件的泛滥,在文件改变的关键信息已被广泛使用。几乎所有在这一领域的方法依赖于一个程序的方法,采用精心生成功能和手动调整评分系统,而不是一个数据驱动和普及的方法。我们这个框架问题,因为使用字符边框的图形比较的问题,并提出一个模型,它利用图形使用OCR功能(光学字符识别)。我们的模型依赖于一个数据驱动的方法,通过对基于图形的OCR功能训练随机森林分类以检测更改。我们评估的数据集从轻微伪造瑕疵真正的商业文档构建我们的算法的伪造检测性能。我们提出的模型大大优于这项任务最密切相关的文档操作检测模型。
Hailey James, Otkrist Gupta, Dan Raviv
Abstract: Detecting manipulations in digital documents is becoming increasingly important for information verification purposes. Due to the proliferation of image editing software, altering key information in documents has become widely accessible. Nearly all approaches in this domain rely on a procedural approach, using carefully generated features and a hand-tuned scoring system, rather than a data-driven and generalizable approach. We frame this issue as a graph comparison problem using the character bounding boxes, and propose a model that leverages graph features using OCR (Optical Character Recognition). Our model relies on a data-driven approach to detect alterations by training a random forest classifier on the graph-based OCR features. We evaluate our algorithm's forgery detection performance on dataset constructed from real business documents with slight forgery imperfections. Our proposed model dramatically outperforms the most closely-related document manipulation detection model on this task.
摘要:在数字文档检测操作正在成为信息核查目的越来越重要。由于图像编辑软件的泛滥,在文件改变的关键信息已被广泛使用。几乎所有在这一领域的方法依赖于一个程序的方法,采用精心生成功能和手动调整评分系统,而不是一个数据驱动和普及的方法。我们这个框架问题,因为使用字符边框的图形比较的问题,并提出一个模型,它利用图形使用OCR功能(光学字符识别)。我们的模型依赖于一个数据驱动的方法,通过对基于图形的OCR功能训练随机森林分类以检测更改。我们评估的数据集从轻微伪造瑕疵真正的商业文档构建我们的算法的伪造检测性能。我们提出的模型大大优于这项任务最密切相关的文档操作检测模型。
32. A new heuristic algorithm for fast k-segmentation [PDF] 返回目录
Sabarish Vadarevu, Vijay Karamcheti
Abstract: The $k$-segmentation of a video stream is used to partition it into $k$ piecewise-linear segments, so that each linear segment has a meaningful interpretation. Such segmentation may be used to summarize large videos using a small set of images, to identify anomalies within segments and change points between segments, and to select critical subsets for training machine learning models. Exact and approximate segmentation methods for $k$-segmentation exist in the literature. Each of these algorithms occupies a different spot in the trade-off between computational complexity and accuracy. A novel heuristic algorithm is proposed in this paper to improve upon existing methods. It is empirically found to provide accuracies competitive with exact methods at a fraction of the computational expense. The new algorithm is inspired by Lloyd's algorithm for K-Means and Lloyd-Max algorithm for scalar quantization, and is called the LM algorithm for convenience. It works by iteratively minimizing a cost function from any given initialisation; the commonly used $L_2$ cost is chosen in this paper. While the greedy minimization makes the algorithm sensitive to initialisation, the ability to converge from any initial guess to a local optimum allows the algorithm to be integrated into other existing algorithms. Three variants of the algorithm are tested over a large number of synthetic datasets, one being a standalone LM implementation, and two others that combine with existing algorithms. One of the latter two -- LM-enhanced-Bottom-Up segmentation -- is found to have the best accuracy and the lowest computational complexity among all algorithms. This variant of LM can provide $k$-segmentations over data sets with up to a million image frames within several seconds.
摘要:$ k中的视频流的-segmentation $用于它划分为$ $ķ分段线性段,使得每个直线段具有有意义的解释。这样的分割可以用来概括使用一小部分的图像,以确定段之间段和变化点内的异常情况,并选择对机器学习模型的关键子集大型视频。精确和$ķ$ -segmentation近似的分割方法存在于文献中。每一种算法占有计算复杂度和精度之间的权衡不同点。一种新颖的启发式算法本文提出在现有的方法来提高。实验中发现,以提供精确度与计算成本的一小部分精确的方法有竞争力的。新算法由劳氏算法K-均值和Lloyd-Max的算法标量量化的启发,被称为为方便LM算法。它的工作原理,通过反复减少从任何给定的初始化成本函数;常用的$ L_2 $成本本文选择。而最小化贪婪使得以初始化该算法敏感,从任何初始猜测到一个局部最优的能力会聚允许算法被整合到现有的其他算法。该算法的三个变种都在大量合成数据集,一个是独立的LM实现,另外两个是与现有的算法相结合的测试。其中后两者 - LM-增强,自下而上分段 - 被发现具有最佳的精度和所有的算法中最低的计算复杂度。 LM的这种变体可以通过数据集提供$ķ$ -segmentations高达一百万的图像帧在几秒钟内。
Sabarish Vadarevu, Vijay Karamcheti
Abstract: The $k$-segmentation of a video stream is used to partition it into $k$ piecewise-linear segments, so that each linear segment has a meaningful interpretation. Such segmentation may be used to summarize large videos using a small set of images, to identify anomalies within segments and change points between segments, and to select critical subsets for training machine learning models. Exact and approximate segmentation methods for $k$-segmentation exist in the literature. Each of these algorithms occupies a different spot in the trade-off between computational complexity and accuracy. A novel heuristic algorithm is proposed in this paper to improve upon existing methods. It is empirically found to provide accuracies competitive with exact methods at a fraction of the computational expense. The new algorithm is inspired by Lloyd's algorithm for K-Means and Lloyd-Max algorithm for scalar quantization, and is called the LM algorithm for convenience. It works by iteratively minimizing a cost function from any given initialisation; the commonly used $L_2$ cost is chosen in this paper. While the greedy minimization makes the algorithm sensitive to initialisation, the ability to converge from any initial guess to a local optimum allows the algorithm to be integrated into other existing algorithms. Three variants of the algorithm are tested over a large number of synthetic datasets, one being a standalone LM implementation, and two others that combine with existing algorithms. One of the latter two -- LM-enhanced-Bottom-Up segmentation -- is found to have the best accuracy and the lowest computational complexity among all algorithms. This variant of LM can provide $k$-segmentations over data sets with up to a million image frames within several seconds.
摘要:$ k中的视频流的-segmentation $用于它划分为$ $ķ分段线性段,使得每个直线段具有有意义的解释。这样的分割可以用来概括使用一小部分的图像,以确定段之间段和变化点内的异常情况,并选择对机器学习模型的关键子集大型视频。精确和$ķ$ -segmentation近似的分割方法存在于文献中。每一种算法占有计算复杂度和精度之间的权衡不同点。一种新颖的启发式算法本文提出在现有的方法来提高。实验中发现,以提供精确度与计算成本的一小部分精确的方法有竞争力的。新算法由劳氏算法K-均值和Lloyd-Max的算法标量量化的启发,被称为为方便LM算法。它的工作原理,通过反复减少从任何给定的初始化成本函数;常用的$ L_2 $成本本文选择。而最小化贪婪使得以初始化该算法敏感,从任何初始猜测到一个局部最优的能力会聚允许算法被整合到现有的其他算法。该算法的三个变种都在大量合成数据集,一个是独立的LM实现,另外两个是与现有的算法相结合的测试。其中后两者 - LM-增强,自下而上分段 - 被发现具有最佳的精度和所有的算法中最低的计算复杂度。 LM的这种变体可以通过数据集提供$ķ$ -segmentations高达一百万的图像帧在几秒钟内。
33. Practical Cross-modal Manifold Alignment for Grounded Language [PDF] 返回目录
Andre T. Nguyen, Luke E. Richards, Gaoussou Youssouf Kebe, Edward Raff, Kasra Darvish, Frank Ferraro, Cynthia Matuszek
Abstract: We propose a cross-modality manifold alignment procedure that leverages triplet loss to jointly learn consistent, multi-modal embeddings of language-based concepts of real-world items. Our approach learns these embeddings by sampling triples of anchor, positive, and negative data points from RGB-depth images and their natural language descriptions. We show that our approach can benefit from, but does not require, post-processing steps such as Procrustes analysis, in contrast to some of our baselines which require it for reasonable performance. We demonstrate the effectiveness of our approach on two datasets commonly used to develop robotic-based grounded language learning systems, where our approach outperforms four baselines, including a state-of-the-art approach, across five evaluation metrics.
摘要:我们提出了一个跨模态流形调整过程,它利用三重损失,共同学习的真实世界项目的基于语言的概念相一致,多模式的嵌入。我们的方法从RGB-深度图像和自然语言描述采样锚,正,负的数据点的三元组学习这些的嵌入。我们表明,我们的方法可以从中受益,但不要求,后处理步骤,如普鲁克分析,而相比之下,我们的一些基线需要它的合理表现。我们证明上常用来开发机器人为基础的接地语言学习系统的两个数据集,我们的方法的效果,其中我们的方法比4条基线,包括国家的最先进的方法,在五个评价指标。
Andre T. Nguyen, Luke E. Richards, Gaoussou Youssouf Kebe, Edward Raff, Kasra Darvish, Frank Ferraro, Cynthia Matuszek
Abstract: We propose a cross-modality manifold alignment procedure that leverages triplet loss to jointly learn consistent, multi-modal embeddings of language-based concepts of real-world items. Our approach learns these embeddings by sampling triples of anchor, positive, and negative data points from RGB-depth images and their natural language descriptions. We show that our approach can benefit from, but does not require, post-processing steps such as Procrustes analysis, in contrast to some of our baselines which require it for reasonable performance. We demonstrate the effectiveness of our approach on two datasets commonly used to develop robotic-based grounded language learning systems, where our approach outperforms four baselines, including a state-of-the-art approach, across five evaluation metrics.
摘要:我们提出了一个跨模态流形调整过程,它利用三重损失,共同学习的真实世界项目的基于语言的概念相一致,多模式的嵌入。我们的方法从RGB-深度图像和自然语言描述采样锚,正,负的数据点的三元组学习这些的嵌入。我们表明,我们的方法可以从中受益,但不要求,后处理步骤,如普鲁克分析,而相比之下,我们的一些基线需要它的合理表现。我们证明上常用来开发机器人为基础的接地语言学习系统的两个数据集,我们的方法的效果,其中我们的方法比4条基线,包括国家的最先进的方法,在五个评价指标。
34. Auto-encoders for Track Reconstruction in Drift Chambers for CLAS12 [PDF] 返回目录
Gagik Gavalian
Abstract: In this article we describe the development of machine learning models to assist the CLAS12 tracking algorithm by identifying tracks through inferring missing segments in the drift chambers. Auto encoders are used to reconstruct missing segments from track trajectory. Implemented neural network was able to reliably reconstruct missing segment positions with accuracy of $\approx 0.35$ wires, and lead to recovery of missing tracks with accuracy of $>99.8\%$.
摘要:在这篇文章中,我们介绍的机器学习模型的开发通过在漂移室推断中缺失的段确定的轨道,以协助CLAS12跟踪算法。自动编码器被用于重建从轨道轨迹中缺失的段。实现神经网络能够可靠地重建丢失与$ \约0.35 $线的精度,和引线段位置失踪的$精度> 99.8 \%$轨道的恢复。
Gagik Gavalian
Abstract: In this article we describe the development of machine learning models to assist the CLAS12 tracking algorithm by identifying tracks through inferring missing segments in the drift chambers. Auto encoders are used to reconstruct missing segments from track trajectory. Implemented neural network was able to reliably reconstruct missing segment positions with accuracy of $\approx 0.35$ wires, and lead to recovery of missing tracks with accuracy of $>99.8\%$.
摘要:在这篇文章中,我们介绍的机器学习模型的开发通过在漂移室推断中缺失的段确定的轨道,以协助CLAS12跟踪算法。自动编码器被用于重建从轨道轨迹中缺失的段。实现神经网络能够可靠地重建丢失与$ \约0.35 $线的精度,和引线段位置失踪的$精度> 99.8 \%$轨道的恢复。
35. SWP-Leaf NET: a novel multistage approach for plant leaf identification based on deep learning [PDF] 返回目录
Ali Beikmohammadi, Karim Faez, Ali Motallebi
Abstract: Modern scientific and technological advances are allowing botanists to use computer vision-based approaches for plant identification tasks. These approaches have their own challenges. Leaf classification is a computer-vision task performed for the automated identification of plant species, a serious challenge due to variations in leaf morphology, including its size, texture, shape, and venation. Researchers have recently become more inclined toward deep learning-based methods rather than conventional feature-based methods due to the popularity and successful implementation of deep learning methods in image analysis, object recognition, and speech recognition. In this paper, a botanist's behavior was modeled in leaf identification by proposing a highly-efficient method of maximum behavioral resemblance developed through three deep learning-based models. Different layers of the three models were visualized to ensure that the botanist's behavior was modeled accurately. The first and second models were designed from scratch.Regarding the third model, the pre-trained architecture MobileNetV2 was employed along with the transfer-learning technique. The proposed method was evaluated on two well-known datasets: Flavia and MalayaKew. According to a comparative analysis, the suggested approach was more accurate than hand-crafted feature extraction methods and other deep learning techniques in terms of 99.67% and 99.81% accuracy. Unlike conventional techniques that have their own specific complexities and depend on datasets, the proposed method required no hand-crafted feature extraction, and also increased accuracy and distributability as compared with other deep learning techniques. It was further considerably faster than other methods because it used shallower networks with fewer parameters and did not use all three models recurrently.
摘要:现代科学技术的进步正在使植物学家用于植物鉴定任务计算机基于视觉的方法。这些方法都有自己的挑战。叶分类为植物物种,一个严重的挑战的自动识别所执行的计算机视觉任务,因为在叶形态的变化,包括其大小,纹理,形状,和脉络。研究人员最近随着向深基于学习的方法,而不是传统的基于特征的方法,由于知名度和成功实施的图像分析,目标识别和语音识别深层的学习方法更倾斜。在本文中,一个植物学家的行为在叶识别通过提出通过三个深以学习为主的开发模式最大的行为相似的高效方法建模。这三种模式的不同层可视化,以确保植物学家的行为进行了精确建模。在第一和第二模型从scratch.Regarding第三个模型中,预先训练的体系结构与转印学习技术采用沿MobileNetV2设计。弗拉维亚和MalayaKew:该方法是在两个著名的数据集进行评估。根据比较分析,建议的方法是比在99.67%和99.81%的准确度方面的手工制作的特征提取方法等深学习技术更准确。不像有自己特定的复杂性和依赖于数据集的常规技术,该方法无需手工制作的特征提取,同时也增加了精度和可分布与其他深度学习技术相比。这是进一步大大高于其他方法,因为它使用较浅的网络用较少的参数,并没有反复使用所有三种型号。
Ali Beikmohammadi, Karim Faez, Ali Motallebi
Abstract: Modern scientific and technological advances are allowing botanists to use computer vision-based approaches for plant identification tasks. These approaches have their own challenges. Leaf classification is a computer-vision task performed for the automated identification of plant species, a serious challenge due to variations in leaf morphology, including its size, texture, shape, and venation. Researchers have recently become more inclined toward deep learning-based methods rather than conventional feature-based methods due to the popularity and successful implementation of deep learning methods in image analysis, object recognition, and speech recognition. In this paper, a botanist's behavior was modeled in leaf identification by proposing a highly-efficient method of maximum behavioral resemblance developed through three deep learning-based models. Different layers of the three models were visualized to ensure that the botanist's behavior was modeled accurately. The first and second models were designed from scratch.Regarding the third model, the pre-trained architecture MobileNetV2 was employed along with the transfer-learning technique. The proposed method was evaluated on two well-known datasets: Flavia and MalayaKew. According to a comparative analysis, the suggested approach was more accurate than hand-crafted feature extraction methods and other deep learning techniques in terms of 99.67% and 99.81% accuracy. Unlike conventional techniques that have their own specific complexities and depend on datasets, the proposed method required no hand-crafted feature extraction, and also increased accuracy and distributability as compared with other deep learning techniques. It was further considerably faster than other methods because it used shallower networks with fewer parameters and did not use all three models recurrently.
摘要:现代科学技术的进步正在使植物学家用于植物鉴定任务计算机基于视觉的方法。这些方法都有自己的挑战。叶分类为植物物种,一个严重的挑战的自动识别所执行的计算机视觉任务,因为在叶形态的变化,包括其大小,纹理,形状,和脉络。研究人员最近随着向深基于学习的方法,而不是传统的基于特征的方法,由于知名度和成功实施的图像分析,目标识别和语音识别深层的学习方法更倾斜。在本文中,一个植物学家的行为在叶识别通过提出通过三个深以学习为主的开发模式最大的行为相似的高效方法建模。这三种模式的不同层可视化,以确保植物学家的行为进行了精确建模。在第一和第二模型从scratch.Regarding第三个模型中,预先训练的体系结构与转印学习技术采用沿MobileNetV2设计。弗拉维亚和MalayaKew:该方法是在两个著名的数据集进行评估。根据比较分析,建议的方法是比在99.67%和99.81%的准确度方面的手工制作的特征提取方法等深学习技术更准确。不像有自己特定的复杂性和依赖于数据集的常规技术,该方法无需手工制作的特征提取,同时也增加了精度和可分布与其他深度学习技术相比。这是进一步大大高于其他方法,因为它使用较浅的网络用较少的参数,并没有反复使用所有三种型号。
36. 1st Place Solution to Google Landmark Retrieval 2020 [PDF] 返回目录
SeungKee Jeon
Abstract: This paper presents the 1st place solution to the Google Landmark Retrieval 2020 Competition on Kaggle. The solution is based on metric learning to classify numerous landmark classes, and uses transfer learning with two train datasets, fine-tuning on bigger images, adjusting loss weight for cleaner samples, and esemble to enhance the model's performance further. Finally, it scored 0.38677 mAP@100 on the private leaderboard.
摘要:本文介绍了第一名的解决方案上Kaggle的谷歌地标检索2020大赛。该解决方案是基于度量学习到了许多具有里程碑意义的类进行分类,并采用转移有两个火车的数据集,在更大的图像微调学习,调整减肥更清洁的样品,并esemble,进一步提高模型的性能。最后,拿下0.38677地图@ 100私营排行榜。
SeungKee Jeon
Abstract: This paper presents the 1st place solution to the Google Landmark Retrieval 2020 Competition on Kaggle. The solution is based on metric learning to classify numerous landmark classes, and uses transfer learning with two train datasets, fine-tuning on bigger images, adjusting loss weight for cleaner samples, and esemble to enhance the model's performance further. Finally, it scored 0.38677 mAP@100 on the private leaderboard.
摘要:本文介绍了第一名的解决方案上Kaggle的谷歌地标检索2020大赛。该解决方案是基于度量学习到了许多具有里程碑意义的类进行分类,并采用转移有两个火车的数据集,在更大的图像微调学习,调整减肥更清洁的样品,并esemble,进一步提高模型的性能。最后,拿下0.38677地图@ 100私营排行榜。
37. Dynamic Future Net: Diversified Human Motion Generation [PDF] 返回目录
Wenheng Chen, He Wang, Yi Yuan, Tianjia Shao, Kun Zhou
Abstract: Human motion modelling is crucial in many areas such as computer graphics, vision and virtual reality. Acquiring high-quality skeletal motions is difficult due to the need for specialized equipment and laborious manual post-posting, which necessitates maximizing the use of existing data to synthesize new data. However, it is a challenge due to the intrinsic motion stochasticity of human motion dynamics, manifested in the short and long terms. In the short term, there is strong randomness within a couple frames, e.g. one frame followed by multiple possible frames leading to different motion styles; while in the long term, there are non-deterministic action transitions. In this paper, we present Dynamic Future Net, a new deep learning model where we explicitly focuses on the aforementioned motion stochasticity by constructing a generative model with non-trivial modelling capacity in temporal stochasticity. Given limited amounts of data, our model can generate a large number of high-quality motions with arbitrary duration, and visually-convincing variations in both space and time. We evaluate our model on a wide range of motions and compare it with the state-of-the-art methods. Both qualitative and quantitative results show the superiority of our method, for its robustness, versatility and high-quality.
摘要:人体运动造型在很多领域,如计算机图形,视觉和虚拟现实的关键。收购优质骨骼运动是需要专门的设备和费力的人工后发布,这需要最大限度地利用现有数据,合成新的数据难以所致。然而,这是一个挑战,因为人体运动动力学的内在运动的随机性,在短期和长期表现。在短期内,有一对夫妇的帧,例如内随机性强一帧,随后导致不同的运动样式的多个可能的帧;而从长远来看,也有非确定性的动作变化。在本文中,我们提出了动态未来净,一个新的深度学习模型,我们明确地通过构建与时间的随机性不平凡的造型能力生成模型侧重于前述运动随机性。鉴于有限的数据量,我们的模型可以产生大量的高品质运动任意时间,并在时间和空间在视觉上令人信服的变化。我们评估我们在广泛的运动模型,并与国家的最先进的方法进行了比较。定性和定量的结果表明,我们的方法的优越性,它的坚固性,多功能性和高品质。
Wenheng Chen, He Wang, Yi Yuan, Tianjia Shao, Kun Zhou
Abstract: Human motion modelling is crucial in many areas such as computer graphics, vision and virtual reality. Acquiring high-quality skeletal motions is difficult due to the need for specialized equipment and laborious manual post-posting, which necessitates maximizing the use of existing data to synthesize new data. However, it is a challenge due to the intrinsic motion stochasticity of human motion dynamics, manifested in the short and long terms. In the short term, there is strong randomness within a couple frames, e.g. one frame followed by multiple possible frames leading to different motion styles; while in the long term, there are non-deterministic action transitions. In this paper, we present Dynamic Future Net, a new deep learning model where we explicitly focuses on the aforementioned motion stochasticity by constructing a generative model with non-trivial modelling capacity in temporal stochasticity. Given limited amounts of data, our model can generate a large number of high-quality motions with arbitrary duration, and visually-convincing variations in both space and time. We evaluate our model on a wide range of motions and compare it with the state-of-the-art methods. Both qualitative and quantitative results show the superiority of our method, for its robustness, versatility and high-quality.
摘要:人体运动造型在很多领域,如计算机图形,视觉和虚拟现实的关键。收购优质骨骼运动是需要专门的设备和费力的人工后发布,这需要最大限度地利用现有数据,合成新的数据难以所致。然而,这是一个挑战,因为人体运动动力学的内在运动的随机性,在短期和长期表现。在短期内,有一对夫妇的帧,例如内随机性强一帧,随后导致不同的运动样式的多个可能的帧;而从长远来看,也有非确定性的动作变化。在本文中,我们提出了动态未来净,一个新的深度学习模型,我们明确地通过构建与时间的随机性不平凡的造型能力生成模型侧重于前述运动随机性。鉴于有限的数据量,我们的模型可以产生大量的高品质运动任意时间,并在时间和空间在视觉上令人信服的变化。我们评估我们在广泛的运动模型,并与国家的最先进的方法进行了比较。定性和定量的结果表明,我们的方法的优越性,它的坚固性,多功能性和高品质。
38. Bayesian Geodesic Regression onRiemannian Manifolds [PDF] 返回目录
Youshan Zhang
Abstract: Geodesic regression has been proposed for fitting the geodesic curve. However, it cannot automatically choose the dimensionality of data. In this paper, we develop a Bayesian geodesic regression model on Riemannian manifolds (BGRM) model. To avoid the overfitting problem, we add a regularization term to control the effectiveness of the model. To automatically select the dimensionality, we develop a prior for the geodesic regression model, which can automatically select the number of relevant dimensions by driving unnecessary tangent vectors to zero. To show the validation of our model, we first apply it in the 3D synthetic sphere and 2D pentagon data. We then demonstrate the effectiveness of our model in reducing the dimensionality and analyzing shape variations of human corpus callosum and mandible data.
摘要:短程回归已经提出了拟合测地曲线。但是,它不能自动选择数据的维度。在本文中,我们制定黎曼流形(BGRM)模型贝叶斯测地回归模型。为了避免过度拟合的问题,我们增加了一个正则项来控制模型的有效性。自动选择维数,我们开发了一个现有的测地回归模型,其可通过驱动不必要切向量为零自动选择相关的维数。为了表明我们的模型的验证,我们首先在3D合成领域和2D五角大楼数据应用它。然后,我们证明减少人类胼胝体和下颌骨数据的维度和分析形状的变化我们的模型的有效性。
Youshan Zhang
Abstract: Geodesic regression has been proposed for fitting the geodesic curve. However, it cannot automatically choose the dimensionality of data. In this paper, we develop a Bayesian geodesic regression model on Riemannian manifolds (BGRM) model. To avoid the overfitting problem, we add a regularization term to control the effectiveness of the model. To automatically select the dimensionality, we develop a prior for the geodesic regression model, which can automatically select the number of relevant dimensions by driving unnecessary tangent vectors to zero. To show the validation of our model, we first apply it in the 3D synthetic sphere and 2D pentagon data. We then demonstrate the effectiveness of our model in reducing the dimensionality and analyzing shape variations of human corpus callosum and mandible data.
摘要:短程回归已经提出了拟合测地曲线。但是,它不能自动选择数据的维度。在本文中,我们制定黎曼流形(BGRM)模型贝叶斯测地回归模型。为了避免过度拟合的问题,我们增加了一个正则项来控制模型的有效性。自动选择维数,我们开发了一个现有的测地回归模型,其可通过驱动不必要切向量为零自动选择相关的维数。为了表明我们的模型的验证,我们首先在3D合成领域和2D五角大楼数据应用它。然后,我们证明减少人类胼胝体和下颌骨数据的维度和分析形状的变化我们的模型的有效性。
39. Efficiently Constructing Adversarial Examples by Feature Watermarking [PDF] 返回目录
Yuexin Xiang, Wei Ren, Tiantian Li, Xianghan Zheng, Tianqing Zhu, Kim-Kwang Raymond Choo
Abstract: With the increasing attentions of deep learning models, attacks are also upcoming for such models. For example, an attacker may carefully construct images in specific ways (also referred to as adversarial examples) aiming to mislead the deep learning models to output incorrect classification results. Similarly, many efforts are proposed to detect and mitigate adversarial examples, usually for certain dedicated attacks. In this paper, we propose a novel digital watermark based method to generate adversarial examples for deep learning models. Specifically, partial main features of the watermark image are embedded into the host image invisibly, aiming to tamper and damage the recognition capabilities of the deep learning models. We devise an efficient mechanism to select host images and watermark images, and utilize the improved discrete wavelet transform (DWT) based Patchwork watermarking algorithm and the modified discrete cosine transform (DCT) based Patchwork watermarking algorithm. The experimental results showed that our scheme is able to generate a large number of adversarial examples efficiently. In addition, we find that using the extracted features of the image as the watermark images, can increase the success rate of an attack under certain conditions with minimal changes to the host image. To ensure repeatability, reproducibility, and code sharing, the source code is available on GitHub
摘要:随着深学习模型的越来越多的关注,攻击也是即将到来的这种模式。例如,攻击者可能仔细地构造以特定的方式(也称为对抗示例)的图像对准于深学习模型误导到输出不正确的分类结果。同样,许多工作都提出了检测和缓解对抗的例子,通常是某些专用的攻击。在本文中,我们提出了一个新颖的数字水印为基础的方法来产生深刻的学习模型对抗性的例子。具体地讲,水印图像的部分主要特征被嵌入到不可见的宿主图像,目的是篡改并损坏深度学习模型的识别能力。我们设计一个有效的机制来选择主机图像和水印的图像,并利用改进的离散小波变换(DWT)基于碎料水印算法和改进离散余弦变换(DCT)基于碎料水印算法。实验结果表明,我们的方案能够有效地产生了大量的对抗性例子。此外,我们发现,使用图像的提取的特征作为水印图像,可以增加下以最小的改动主机图像一定条件下攻击的成功率。为了确保重复性,再现性和代码共享,源代码可以在GitHub
Yuexin Xiang, Wei Ren, Tiantian Li, Xianghan Zheng, Tianqing Zhu, Kim-Kwang Raymond Choo
Abstract: With the increasing attentions of deep learning models, attacks are also upcoming for such models. For example, an attacker may carefully construct images in specific ways (also referred to as adversarial examples) aiming to mislead the deep learning models to output incorrect classification results. Similarly, many efforts are proposed to detect and mitigate adversarial examples, usually for certain dedicated attacks. In this paper, we propose a novel digital watermark based method to generate adversarial examples for deep learning models. Specifically, partial main features of the watermark image are embedded into the host image invisibly, aiming to tamper and damage the recognition capabilities of the deep learning models. We devise an efficient mechanism to select host images and watermark images, and utilize the improved discrete wavelet transform (DWT) based Patchwork watermarking algorithm and the modified discrete cosine transform (DCT) based Patchwork watermarking algorithm. The experimental results showed that our scheme is able to generate a large number of adversarial examples efficiently. In addition, we find that using the extracted features of the image as the watermark images, can increase the success rate of an attack under certain conditions with minimal changes to the host image. To ensure repeatability, reproducibility, and code sharing, the source code is available on GitHub
摘要:随着深学习模型的越来越多的关注,攻击也是即将到来的这种模式。例如,攻击者可能仔细地构造以特定的方式(也称为对抗示例)的图像对准于深学习模型误导到输出不正确的分类结果。同样,许多工作都提出了检测和缓解对抗的例子,通常是某些专用的攻击。在本文中,我们提出了一个新颖的数字水印为基础的方法来产生深刻的学习模型对抗性的例子。具体地讲,水印图像的部分主要特征被嵌入到不可见的宿主图像,目的是篡改并损坏深度学习模型的识别能力。我们设计一个有效的机制来选择主机图像和水印的图像,并利用改进的离散小波变换(DWT)基于碎料水印算法和改进离散余弦变换(DCT)基于碎料水印算法。实验结果表明,我们的方案能够有效地产生了大量的对抗性例子。此外,我们发现,使用图像的提取的特征作为水印图像,可以增加下以最小的改动主机图像一定条件下攻击的成功率。为了确保重复性,再现性和代码共享,源代码可以在GitHub
40. What am I allowed to do here?: Online Learning of Context-Specific Norms by Pepper [PDF] 返回目录
Ali Ayub, Alan R. Wagner
Abstract: Social norms support coordination and cooperation in society. With social robots becoming increasingly involved in our society, they also need to follow the social norms of the society. This paper presents a computational framework for learning contexts and the social norms present in a context in an online manner on a robot. The paper utilizes a recent state-of-the-art approach for incremental learning and adapts it for online learning of scenes (contexts). The paper further utilizes Dempster-Schafer theory to model context-specific norms. After learning the scenes (contexts), we use active learning to learn related norms. We test our approach on the Pepper robot by taking it through different scene locations. Our results show that Pepper can learn different scenes and related norms simply by communicating with a human partner in an online manner.
摘要:社会规范在社会支持的协调与合作。随着社会的机器人越来越多地参与我们的社会中,他们也需要遵循社会的社会规范。本文介绍了在机器人的学习环境和社会规范存在的情况下在网上的方式计算框架。本文利用增量学习最近的国家的最先进的方法,并适应它的场景(上下文)的在线学习。文章进一步利用登普斯特 - 谢弗理论模型上下文的具体规范。学习场景(上下文)后,我们使用主动学习,了解相关规范。我们通过采取它通过不同的场景地点考验我们的辣椒机器人的方法。我们的研究结果表明,辣椒可以简单地通过与在线的方式人类的伙伴交流学习不同的场景和相关规范。
Ali Ayub, Alan R. Wagner
Abstract: Social norms support coordination and cooperation in society. With social robots becoming increasingly involved in our society, they also need to follow the social norms of the society. This paper presents a computational framework for learning contexts and the social norms present in a context in an online manner on a robot. The paper utilizes a recent state-of-the-art approach for incremental learning and adapts it for online learning of scenes (contexts). The paper further utilizes Dempster-Schafer theory to model context-specific norms. After learning the scenes (contexts), we use active learning to learn related norms. We test our approach on the Pepper robot by taking it through different scene locations. Our results show that Pepper can learn different scenes and related norms simply by communicating with a human partner in an online manner.
摘要:社会规范在社会支持的协调与合作。随着社会的机器人越来越多地参与我们的社会中,他们也需要遵循社会的社会规范。本文介绍了在机器人的学习环境和社会规范存在的情况下在网上的方式计算框架。本文利用增量学习最近的国家的最先进的方法,并适应它的场景(上下文)的在线学习。文章进一步利用登普斯特 - 谢弗理论模型上下文的具体规范。学习场景(上下文)后,我们使用主动学习,了解相关规范。我们通过采取它通过不同的场景地点考验我们的辣椒机器人的方法。我们的研究结果表明,辣椒可以简单地通过与在线的方式人类的伙伴交流学习不同的场景和相关规范。
41. Emotion-Based End-to-End Matching Between Image and Music in Valence-Arousal Space [PDF] 返回目录
Sicheng Zhao, Yaxian Li, Xingxu Yao, Weizhi Nie, Pengfei Xu, Jufeng Yang, Kurt Keutzer
Abstract: Both images and music can convey rich semantics and are widely used to induce specific emotions. Matching images and music with similar emotions might help to make emotion perceptions more vivid and stronger. Existing emotion-based image and music matching methods either employ limited categorical emotion states which cannot well reflect the complexity and subtlety of emotions, or train the matching model using an impractical multi-stage pipeline. In this paper, we study end-to-end matching between image and music based on emotions in the continuous valence-arousal (VA) space. First, we construct a large-scale dataset, termed Image-Music-Emotion-Matching-Net (IMEMNet), with over 140K image-music pairs. Second, we propose cross-modal deep continuous metric learning (CDCML) to learn a shared latent embedding space which preserves the cross-modal similarity relationship in the continuous matching space. Finally, we refine the embedding space by further preserving the single-modal emotion relationship in the VA spaces of both images and music. The metric learning in the embedding space and task regression in the label space are jointly optimized for both cross-modal matching and single-modal VA prediction. The extensive experiments conducted on IMEMNet demonstrate the superiority of CDCML for emotion-based image and music matching as compared to the state-of-the-art approaches.
摘要:图像和音乐都可以传达丰富的语义,并广泛用于诱导特定的情绪。匹配的图像和音乐有类似的情绪可能有助于使情感的看法更加生动和更强。现有的基于情绪的图像和音乐匹配方法或者雇用限于绝对情感状态,不能很好地反映了情感的复杂和微妙,或使用一种不切实际的多级流水线训练匹配模型。在本文中,我们基于连续价觉醒(VA)空间的情感学习图像和音乐之间的端至端匹配。首先,我们构造了一个大型数据集,被称为图片,音乐,情感匹配-NET(IMEMNet),有超过140K的图像,音乐对。第二,我们提出跨通道深连续度量学习(CDCML),了解其保留在连续匹配空间中的跨通道相似性关系的共享潜嵌入空间。最后,我们还保留在图像和音乐的VA空间的单模态的情感关系,细化嵌入空间。在标签空间嵌入空间和任务回归的度量学习共同两个跨模态匹配和单模态VA预测进行了优化。相较于国家的最先进的方法对IMEMNet进行了广泛的实验表明CDCML为基于情绪的图像和音乐匹配的优越性。
Sicheng Zhao, Yaxian Li, Xingxu Yao, Weizhi Nie, Pengfei Xu, Jufeng Yang, Kurt Keutzer
Abstract: Both images and music can convey rich semantics and are widely used to induce specific emotions. Matching images and music with similar emotions might help to make emotion perceptions more vivid and stronger. Existing emotion-based image and music matching methods either employ limited categorical emotion states which cannot well reflect the complexity and subtlety of emotions, or train the matching model using an impractical multi-stage pipeline. In this paper, we study end-to-end matching between image and music based on emotions in the continuous valence-arousal (VA) space. First, we construct a large-scale dataset, termed Image-Music-Emotion-Matching-Net (IMEMNet), with over 140K image-music pairs. Second, we propose cross-modal deep continuous metric learning (CDCML) to learn a shared latent embedding space which preserves the cross-modal similarity relationship in the continuous matching space. Finally, we refine the embedding space by further preserving the single-modal emotion relationship in the VA spaces of both images and music. The metric learning in the embedding space and task regression in the label space are jointly optimized for both cross-modal matching and single-modal VA prediction. The extensive experiments conducted on IMEMNet demonstrate the superiority of CDCML for emotion-based image and music matching as compared to the state-of-the-art approaches.
摘要:图像和音乐都可以传达丰富的语义,并广泛用于诱导特定的情绪。匹配的图像和音乐有类似的情绪可能有助于使情感的看法更加生动和更强。现有的基于情绪的图像和音乐匹配方法或者雇用限于绝对情感状态,不能很好地反映了情感的复杂和微妙,或使用一种不切实际的多级流水线训练匹配模型。在本文中,我们基于连续价觉醒(VA)空间的情感学习图像和音乐之间的端至端匹配。首先,我们构造了一个大型数据集,被称为图片,音乐,情感匹配-NET(IMEMNet),有超过140K的图像,音乐对。第二,我们提出跨通道深连续度量学习(CDCML),了解其保留在连续匹配空间中的跨通道相似性关系的共享潜嵌入空间。最后,我们还保留在图像和音乐的VA空间的单模态的情感关系,细化嵌入空间。在标签空间嵌入空间和任务回归的度量学习共同两个跨模态匹配和单模态VA预测进行了优化。相较于国家的最先进的方法对IMEMNet进行了广泛的实验表明CDCML为基于情绪的图像和音乐匹配的优越性。
42. Data-Level Recombination and Lightweight Fusion Scheme for RGB-D Salient Object Detection [PDF] 返回目录
Xuehao Wang, Shuai Li, Chenglizhao Chen, Yuming Fang, Aimin Hao, Hong Qin
Abstract: Existing RGB-D salient object detection methods treat depth information as an independent component to complement its RGB part, and widely follow the bi-stream parallel network architecture. To selectively fuse the CNNs features extracted from both RGB and depth as a final result, the state-of-the-art (SOTA) bi-stream networks usually consist of two independent subbranches; i.e., one subbranch is used for RGB saliency and the other aims for depth saliency. However, its depth saliency is persistently inferior to the RGB saliency because the RGB component is intrinsically more informative than the depth component. The bi-stream architecture easily biases its subsequent fusion procedure to the RGB subbranch, leading to a performance bottleneck. In this paper, we propose a novel data-level recombination strategy to fuse RGB with D (depth) before deep feature extraction, where we cyclically convert the original 4-dimensional RGB-D into \textbf{D}GB, R\textbf{D}B and RG\textbf{D}. Then, a newly lightweight designed triple-stream network is applied over these novel formulated data to achieve an optimal channel-wise complementary fusion status between the RGB and D, achieving a new SOTA performance.
摘要:现有RGB-d显着对象的检测方法治疗的深度信息作为独立的部件,以补充其RGB一部分,并且广泛地遵循双流并行网络架构。为了选择性熔丝细胞神经网络的特征从RGB和深度提取的作为最终结果,状态的最先进的(SOTA)双向流网络通常包括两个独立的子分支的;即,一个支行用于RGB显着和其它目的对于深度的显着性。然而,它的深度显着性是持续不及RGB显着,因为RGB分量本质上比深度分量更多的信息。双流架构容易及其随后的融合过程偏压到RGB支行,导致性能瓶颈。在本文中,我们提出了一种新颖的数据级重组策略到保险丝RGB深特征提取之前d(深度),在那里我们循环原来的4维RGB-d转换成\ textbf {d} GB,R \ textbf { d} B和RG \ textbf {d}。然后,新设计的轻质三重流网络被施加在这些新配制的数据,以实现RGB和d之间的最佳信道逐互补融合状态,实现了新的性能SOTA。
Xuehao Wang, Shuai Li, Chenglizhao Chen, Yuming Fang, Aimin Hao, Hong Qin
Abstract: Existing RGB-D salient object detection methods treat depth information as an independent component to complement its RGB part, and widely follow the bi-stream parallel network architecture. To selectively fuse the CNNs features extracted from both RGB and depth as a final result, the state-of-the-art (SOTA) bi-stream networks usually consist of two independent subbranches; i.e., one subbranch is used for RGB saliency and the other aims for depth saliency. However, its depth saliency is persistently inferior to the RGB saliency because the RGB component is intrinsically more informative than the depth component. The bi-stream architecture easily biases its subsequent fusion procedure to the RGB subbranch, leading to a performance bottleneck. In this paper, we propose a novel data-level recombination strategy to fuse RGB with D (depth) before deep feature extraction, where we cyclically convert the original 4-dimensional RGB-D into \textbf{D}GB, R\textbf{D}B and RG\textbf{D}. Then, a newly lightweight designed triple-stream network is applied over these novel formulated data to achieve an optimal channel-wise complementary fusion status between the RGB and D, achieving a new SOTA performance.
摘要:现有RGB-d显着对象的检测方法治疗的深度信息作为独立的部件,以补充其RGB一部分,并且广泛地遵循双流并行网络架构。为了选择性熔丝细胞神经网络的特征从RGB和深度提取的作为最终结果,状态的最先进的(SOTA)双向流网络通常包括两个独立的子分支的;即,一个支行用于RGB显着和其它目的对于深度的显着性。然而,它的深度显着性是持续不及RGB显着,因为RGB分量本质上比深度分量更多的信息。双流架构容易及其随后的融合过程偏压到RGB支行,导致性能瓶颈。在本文中,我们提出了一种新颖的数据级重组策略到保险丝RGB深特征提取之前d(深度),在那里我们循环原来的4维RGB-d转换成\ textbf {d} GB,R \ textbf { d} B和RG \ textbf {d}。然后,新设计的轻质三重流网络被施加在这些新配制的数据,以实现RGB和d之间的最佳信道逐互补融合状态,实现了新的性能SOTA。
43. Vision at A Glance: Interplay between Fine and Coarse Information Processing Pathways [PDF] 返回目录
Zilong Ji, Xiaolong Zou, Tiejun Huang, Si Wu
Abstract: Object recognition is often viewed as a feedforward, bottom-up process in machine learning, but in real neural systems, object recognition is a complicated process which involves the interplay between two signal pathways. One is the parvocellular pathway (P-pathway), which is slow and extracts fine features of objects; the other is the magnocellular pathway (M-pathway), which is fast and extracts coarse features of objects. It has been suggested that the interplay between the two pathways endows the neural system with the capacity of processing visual information rapidly, adaptively, and robustly. However, the underlying computational mechanisms remain largely unknown. In this study, we build a computational model to elucidate the computational advantages associated with the interactions between two pathways. Our model consists of two convolution neural networks: one mimics the P-pathway, referred to as FineNet, which is deep, has small-size kernels, and receives detailed visual inputs; the other mimics the M-pathway, referred to as CoarseNet, which is shallow, has large-size kernels, and receives low-pass filtered or binarized visual inputs. The two pathways interact with each other via a Restricted Boltzmann Machine. We find that: 1) FineNet can teach CoarseNet through imitation and improve its performance considerably; 2) CoarseNet can improve the noise robustness of FineNet through association; 3) the output of CoarseNet can serve as a cognitive bias to improve the performance of FineNet. We hope that this study will provide insight into understanding visual information processing and inspire the development of new object recognition architectures.
摘要:对象识别通常被视为在机器学习的前馈,自下而上的过程,但在现实的神经系统中,对象识别是一个复杂的过程,其涉及两个信号通路之间的相互作用。一个是小细胞亚核途径(P-通路),其是物体的缓慢和提取物细微特征;另一种是大细胞途径(M-途径),这是快速和提取物粗对象的特征。已经建议,这两个途径之间的相互作用赋予神经系统迅速,自适应,且鲁棒处理视觉信息的容量。但是,基本的计算机制仍然知之甚少。在这项研究中,我们建立了一个计算模型,以阐明两个途径之间的相互作用有关的计算优势。我们的模型包括两个卷积神经网络:一个模仿P-通路,简称FineNet,这是深的,具有小尺寸的内核,并接收详细的视觉输入;其他模仿M-通路,简称CoarseNet,其是浅,具有大尺寸的内核,并且接收低通滤波或二值化的视觉输入。这两种途径通过受限玻尔兹曼机彼此交互。我们发现:1)FineNet可以教CoarseNet通过模仿并大大提高其性能; 2)CoarseNet可以提高通过关联FineNet的噪声鲁棒; 3)CoarseNet的输出可以用作认知偏差以改善FineNet的性能。我们希望,这项研究将提供洞察理解视觉信息处理和激励的新对象识别架构的发展。
Zilong Ji, Xiaolong Zou, Tiejun Huang, Si Wu
Abstract: Object recognition is often viewed as a feedforward, bottom-up process in machine learning, but in real neural systems, object recognition is a complicated process which involves the interplay between two signal pathways. One is the parvocellular pathway (P-pathway), which is slow and extracts fine features of objects; the other is the magnocellular pathway (M-pathway), which is fast and extracts coarse features of objects. It has been suggested that the interplay between the two pathways endows the neural system with the capacity of processing visual information rapidly, adaptively, and robustly. However, the underlying computational mechanisms remain largely unknown. In this study, we build a computational model to elucidate the computational advantages associated with the interactions between two pathways. Our model consists of two convolution neural networks: one mimics the P-pathway, referred to as FineNet, which is deep, has small-size kernels, and receives detailed visual inputs; the other mimics the M-pathway, referred to as CoarseNet, which is shallow, has large-size kernels, and receives low-pass filtered or binarized visual inputs. The two pathways interact with each other via a Restricted Boltzmann Machine. We find that: 1) FineNet can teach CoarseNet through imitation and improve its performance considerably; 2) CoarseNet can improve the noise robustness of FineNet through association; 3) the output of CoarseNet can serve as a cognitive bias to improve the performance of FineNet. We hope that this study will provide insight into understanding visual information processing and inspire the development of new object recognition architectures.
摘要:对象识别通常被视为在机器学习的前馈,自下而上的过程,但在现实的神经系统中,对象识别是一个复杂的过程,其涉及两个信号通路之间的相互作用。一个是小细胞亚核途径(P-通路),其是物体的缓慢和提取物细微特征;另一种是大细胞途径(M-途径),这是快速和提取物粗对象的特征。已经建议,这两个途径之间的相互作用赋予神经系统迅速,自适应,且鲁棒处理视觉信息的容量。但是,基本的计算机制仍然知之甚少。在这项研究中,我们建立了一个计算模型,以阐明两个途径之间的相互作用有关的计算优势。我们的模型包括两个卷积神经网络:一个模仿P-通路,简称FineNet,这是深的,具有小尺寸的内核,并接收详细的视觉输入;其他模仿M-通路,简称CoarseNet,其是浅,具有大尺寸的内核,并且接收低通滤波或二值化的视觉输入。这两种途径通过受限玻尔兹曼机彼此交互。我们发现:1)FineNet可以教CoarseNet通过模仿并大大提高其性能; 2)CoarseNet可以提高通过关联FineNet的噪声鲁棒; 3)CoarseNet的输出可以用作认知偏差以改善FineNet的性能。我们希望,这项研究将提供洞察理解视觉信息处理和激励的新对象识别架构的发展。
44. Adversarial score matching and improved sampling for image generation [PDF] 返回目录
Alexia Jolicoeur-Martineau, Rémi Piché-Taillefer, Rémi Tachet des Combes, Ioannis Mitliagkas
Abstract: Denoising score matching with Annealed Langevin Sampling (DSM-ALS) is a recent approach to generative modeling. Despite the convincing visual quality of samples, this method appears to perform worse than Generative Adversarial Networks (GANs) under the Fréchet Inception Distance, a popular metric for generative models. We show that this apparent gap vanishes when denoising the final Langevin samples using the score network. In addition, we propose two improvements to DSM-ALS: 1) Consistent Annealed Sampling as a more stable alternative to Annealed Langevin Sampling, and 2) a hybrid training formulation,composed of both denoising score matching and adversarial objectives. By combining both of these techniques and exploring different network architectures, we elevate score matching methods and obtain results competitive with state-of-the-art image generation on CIFAR-10.
摘要:退火朗之万采样(DSM-ALS)去噪分数匹配的是一款最新的方法来生成模型。尽管样本的说服力视觉质量,这种方法似乎比创成对抗性网络(甘斯)下的Fréchet可盗梦空间距离,一个流行的度量生成模型来表现较差。我们表明,采用分数网络去噪最终朗之万样本时,这种明显的差距消失。此外,我们提出了两种改进DSM-ALS:1)一致的退火采样作为一个更稳定的替代退火朗之万取样,和2)的混合制剂训练,去噪得分匹配和对抗性目标两者的组成。通过组合这两种技术,探索不同的网络结构,我们ELEVATE得分匹配方法和CIFAR-10获得与国家的最先进的图像生成的结果的竞争力。
Alexia Jolicoeur-Martineau, Rémi Piché-Taillefer, Rémi Tachet des Combes, Ioannis Mitliagkas
Abstract: Denoising score matching with Annealed Langevin Sampling (DSM-ALS) is a recent approach to generative modeling. Despite the convincing visual quality of samples, this method appears to perform worse than Generative Adversarial Networks (GANs) under the Fréchet Inception Distance, a popular metric for generative models. We show that this apparent gap vanishes when denoising the final Langevin samples using the score network. In addition, we propose two improvements to DSM-ALS: 1) Consistent Annealed Sampling as a more stable alternative to Annealed Langevin Sampling, and 2) a hybrid training formulation,composed of both denoising score matching and adversarial objectives. By combining both of these techniques and exploring different network architectures, we elevate score matching methods and obtain results competitive with state-of-the-art image generation on CIFAR-10.
摘要:退火朗之万采样(DSM-ALS)去噪分数匹配的是一款最新的方法来生成模型。尽管样本的说服力视觉质量,这种方法似乎比创成对抗性网络(甘斯)下的Fréchet可盗梦空间距离,一个流行的度量生成模型来表现较差。我们表明,采用分数网络去噪最终朗之万样本时,这种明显的差距消失。此外,我们提出了两种改进DSM-ALS:1)一致的退火采样作为一个更稳定的替代退火朗之万取样,和2)的混合制剂训练,去噪得分匹配和对抗性目标两者的组成。通过组合这两种技术,探索不同的网络结构,我们ELEVATE得分匹配方法和CIFAR-10获得与国家的最先进的图像生成的结果的竞争力。
45. Object Recognition for Economic Development from Daytime Satellite Imagery [PDF] 返回目录
Klaus Ackermann, Alexey Chernikov, Nandini Anantharama, Miethy Zaman, Paul A Raschky
Abstract: Reliable data about the stock of physical capital and infrastructure in developing countries is typically very scarce. This is particular a problem for data at the subnational level where existing data is often outdated, not consistently measured or coverage is incomplete. Traditional data collection methods are time and labor-intensive costly, which often prohibits developing countries from collecting this type of data. This paper proposes a novel method to extract infrastructure features from high-resolution satellite images. We collected high-resolution satellite images for 5 million 1km $\times$ 1km grid cells covering 21 African countries. We contribute to the growing body of literature in this area by training our machine learning algorithm on ground-truth data. We show that our approach strongly improves the predictive accuracy. Our methodology can build the foundation to then predict subnational indicators of economic development for areas where this data is either missing or unreliable.
摘要:对物质资本和发展中国家的基础设施存量可靠的数据通常是非常稀缺的。这是在国家以下水平,现有的数据往往是过时的,不能始终如一地测量或覆盖是不完全的数据特定的问题。传统的数据收集方法是时间和劳动密集型的成本,这往往禁止发展中国家收集这类数据。本文提出了一种新颖的方法,以从高分辨率卫星图像提取基础设施的特征。我们收集了高分辨率卫星图像为500万1公里$ \ $次网格1公里覆盖细胞21个非洲国家。我们通过对地面实测数据训练我们的机器学习算法有助于文学的越来越多的在这个区域。我们表明,我们的做法强烈提高了预测的准确性。我们的方法可以建立的基础,进而预测经济发展的地方政府指标,其中该数据被丢失或不可靠的地方。
Klaus Ackermann, Alexey Chernikov, Nandini Anantharama, Miethy Zaman, Paul A Raschky
Abstract: Reliable data about the stock of physical capital and infrastructure in developing countries is typically very scarce. This is particular a problem for data at the subnational level where existing data is often outdated, not consistently measured or coverage is incomplete. Traditional data collection methods are time and labor-intensive costly, which often prohibits developing countries from collecting this type of data. This paper proposes a novel method to extract infrastructure features from high-resolution satellite images. We collected high-resolution satellite images for 5 million 1km $\times$ 1km grid cells covering 21 African countries. We contribute to the growing body of literature in this area by training our machine learning algorithm on ground-truth data. We show that our approach strongly improves the predictive accuracy. Our methodology can build the foundation to then predict subnational indicators of economic development for areas where this data is either missing or unreliable.
摘要:对物质资本和发展中国家的基础设施存量可靠的数据通常是非常稀缺的。这是在国家以下水平,现有的数据往往是过时的,不能始终如一地测量或覆盖是不完全的数据特定的问题。传统的数据收集方法是时间和劳动密集型的成本,这往往禁止发展中国家收集这类数据。本文提出了一种新颖的方法,以从高分辨率卫星图像提取基础设施的特征。我们收集了高分辨率卫星图像为500万1公里$ \ $次网格1公里覆盖细胞21个非洲国家。我们通过对地面实测数据训练我们的机器学习算法有助于文学的越来越多的在这个区域。我们表明,我们的做法强烈提高了预测的准确性。我们的方法可以建立的基础,进而预测经济发展的地方政府指标,其中该数据被丢失或不可靠的地方。
46. Embodied Visual Navigation with Automatic Curriculum Learning in Real Environments [PDF] 返回目录
Steven D. Morad, Roberto Mecca, Rudra P.K. Poudel, Stephan Liwicki, Roberto Cipolla
Abstract: We present NavACL, a method of automatic curriculum learning tailored to the navigation task. NavACL is simple to train and efficiently selects relevant tasks using geometric features. In our experiments, deep reinforcement learning agents trained using NavACL in collision-free environments significantly outperform state-of-the-art agents trained with uniform sampling -- the current standard. Furthermore, our agents are able to navigate through unknown cluttered indoor environments to semantically-specified targets using only RGB images. Collision avoidance policies and frozen feature networks support transfer to unseen real-world environments, without any modification or retraining requirements. We evaluate our policies in simulation, and in the real world on a ground robot and a quadrotor drone. Videos of real-world results are available in the supplementary material
摘要:我们提出NavACL,学习量身定做的导航任务自动课程的方法。 NavACL是简单的培训和有效利用几何特征选择相关的任务。目前的标准 - 在我们的实验中,深强化学习代理商使用NavACL在无冲突的环境与均匀采样训练有素显著强于大盘国家的最先进的代理商培训。此外,我们的代理商能够导航仅使用RGB图像通过未知杂乱的室内环境,以语义指定的目标。防撞政策和冷冻功能,网络支持转移到看不见的真实世界的环境中,无需任何修改或再培训的要求。我们在仿真评估我们的政策,并在地面机器人和无人驾驶旋翼飞行器的真实世界。真实世界的结果影片都在补充材料可用
Steven D. Morad, Roberto Mecca, Rudra P.K. Poudel, Stephan Liwicki, Roberto Cipolla
Abstract: We present NavACL, a method of automatic curriculum learning tailored to the navigation task. NavACL is simple to train and efficiently selects relevant tasks using geometric features. In our experiments, deep reinforcement learning agents trained using NavACL in collision-free environments significantly outperform state-of-the-art agents trained with uniform sampling -- the current standard. Furthermore, our agents are able to navigate through unknown cluttered indoor environments to semantically-specified targets using only RGB images. Collision avoidance policies and frozen feature networks support transfer to unseen real-world environments, without any modification or retraining requirements. We evaluate our policies in simulation, and in the real world on a ground robot and a quadrotor drone. Videos of real-world results are available in the supplementary material
摘要:我们提出NavACL,学习量身定做的导航任务自动课程的方法。 NavACL是简单的培训和有效利用几何特征选择相关的任务。目前的标准 - 在我们的实验中,深强化学习代理商使用NavACL在无冲突的环境与均匀采样训练有素显著强于大盘国家的最先进的代理商培训。此外,我们的代理商能够导航仅使用RGB图像通过未知杂乱的室内环境,以语义指定的目标。防撞政策和冷冻功能,网络支持转移到看不见的真实世界的环境中,无需任何修改或再培训的要求。我们在仿真评估我们的政策,并在地面机器人和无人驾驶旋翼飞行器的真实世界。真实世界的结果影片都在补充材料可用
47. Phase Sampling Profilometry [PDF] 返回目录
Zhenzhou Wang
Abstract: Structured light 3D surface imaging is a school of techniques in which structured light patterns are used for measuring the depth map of the object. Among all the designed structured light patterns, phase pattern has become most popular because of its high resolution and high accuracy. Accordingly, phase measuring profolimetry (PMP) has become the mainstream of structured light technology. In this letter, we introduce the concept of phase sampling profilometry (PSP) that calculates the phase unambiguously in the spatial-frequency domain with only one pattern image. Therefore, PSP is capable of measuring the 3D shapes of the moving objects robustly with single-shot.
摘要:结构化的光3D表面成像,其中结构化光图案用于测量所述对象的深度图的技术的学校。在所有的设计结构化光图案,相位模式已成为最流行的,因为它的高分辨率和高精确度。因此,相位测量profolimetry(PMP)已经成为结构光技术的主流。在这种信,我们引入与仅一个图案图像明确地计算在空间频率域中的相位相采样轮廓(PSP)的概念。因此,PSP能够测量三维形状的移动与单次鲁棒对象。
Zhenzhou Wang
Abstract: Structured light 3D surface imaging is a school of techniques in which structured light patterns are used for measuring the depth map of the object. Among all the designed structured light patterns, phase pattern has become most popular because of its high resolution and high accuracy. Accordingly, phase measuring profolimetry (PMP) has become the mainstream of structured light technology. In this letter, we introduce the concept of phase sampling profilometry (PSP) that calculates the phase unambiguously in the spatial-frequency domain with only one pattern image. Therefore, PSP is capable of measuring the 3D shapes of the moving objects robustly with single-shot.
摘要:结构化的光3D表面成像,其中结构化光图案用于测量所述对象的深度图的技术的学校。在所有的设计结构化光图案,相位模式已成为最流行的,因为它的高分辨率和高精确度。因此,相位测量profolimetry(PMP)已经成为结构光技术的主流。在这种信,我们引入与仅一个图案图像明确地计算在空间频率域中的相位相采样轮廓(PSP)的概念。因此,PSP能够测量三维形状的移动与单次鲁棒对象。
48. COVIDNet-CT: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases from Chest CT Images [PDF] 返回目录
Hayden Gunraj, Linda Wang, Alexander Wong
Abstract: The coronavirus disease 2019 (COVID-19) pandemic continues to have a tremendous impact on patients and healthcare systems around the world. In the fight against this novel disease, there is a pressing need for rapid and effective screening tools to identify patients infected with COVID-19, and to this end CT imaging has been proposed as one of the key screening methods which may be used as a complement to RT-PCR testing, particularly in situations where patients undergo routine CT scans for non-COVID-19 related reasons, patients with worsening respiratory status or developing complications that require expedited care, and patients suspected to be COVID-19-positive but have negative RT-PCR test results. Motivated by this, in this study we introduce COVIDNet-CT, a deep convolutional neural network architecture that is tailored for detection of COVID-19 cases from chest CT images via a machine-driven design exploration approach. Additionally, we introduce COVIDx-CT, a benchmark CT image dataset derived from CT imaging data collected by the China National Center for Bioinformation comprising 104,009 images across 1,489 patient cases. Furthermore, in the interest of reliability and transparency, we leverage an explainability-driven performance validation strategy to investigate the decision-making behaviour of COVIDNet-CT, and in doing so ensure that COVIDNet-CT makes predictions based on relevant indicators in CT images. Both COVIDNet-CT and the COVIDx-CT dataset are available to the general public in an open-source and open access manner as part of the COVID-Net initiative. While COVIDNet-CT is not yet a production-ready screening solution, we hope that releasing the model and dataset will encourage researchers, clinicians, and citizen data scientists alike to leverage and build upon them.
摘要:冠状病2019(COVID-19)的流行继续对世界各地的患者和医疗保健系统产生巨大的影响。在对这种新的疾病的斗争中,迫切需要迅速和有效的筛选工具来识别感染COVID-19例,并为此CT成像已被提议作为可以用来作为重点筛查方法之一补充RT-PCR检测,特别是在患者接受了常规CT扫描的情况下非COVID-19相关的原因,患者的恶化呼吸状态或开发需要加快护理并发症,患者怀疑是COVID-19阳性,但有负RT-PCR测试结果。这个启发,在这个研究中,我们介绍COVIDNet-CT,即通过一台机器驱动的设计勘探方法进行检测,从胸部CT图像COVID-19案件量身定做了深刻的卷积神经网络架构。此外,我们引入COVIDx-CT,从由中国国家中心生物信息包括横跨1489患者病例104009个图像采集的CT成像数据得到的基准CT图像数据集。此外,在可靠性和透明度,我们利用一个explainability驱动性能验证战略研究COVIDNet-CT的决策行为,并在这样做确保COVIDNet-CT使得基于CT图像中相关指标的预测。无论COVIDNet-CT和COVIDx-CT数据集可向一般公众在开放源码和开放接入方式的COVIDNet计划的一部分。虽然COVIDNet-CT还不是一个生产就绪解决方案的筛选,我们希望释放模型和数据集将鼓励研究人员,临床医生和市民数据的科学家都以利用并增强在他们身上。
Hayden Gunraj, Linda Wang, Alexander Wong
Abstract: The coronavirus disease 2019 (COVID-19) pandemic continues to have a tremendous impact on patients and healthcare systems around the world. In the fight against this novel disease, there is a pressing need for rapid and effective screening tools to identify patients infected with COVID-19, and to this end CT imaging has been proposed as one of the key screening methods which may be used as a complement to RT-PCR testing, particularly in situations where patients undergo routine CT scans for non-COVID-19 related reasons, patients with worsening respiratory status or developing complications that require expedited care, and patients suspected to be COVID-19-positive but have negative RT-PCR test results. Motivated by this, in this study we introduce COVIDNet-CT, a deep convolutional neural network architecture that is tailored for detection of COVID-19 cases from chest CT images via a machine-driven design exploration approach. Additionally, we introduce COVIDx-CT, a benchmark CT image dataset derived from CT imaging data collected by the China National Center for Bioinformation comprising 104,009 images across 1,489 patient cases. Furthermore, in the interest of reliability and transparency, we leverage an explainability-driven performance validation strategy to investigate the decision-making behaviour of COVIDNet-CT, and in doing so ensure that COVIDNet-CT makes predictions based on relevant indicators in CT images. Both COVIDNet-CT and the COVIDx-CT dataset are available to the general public in an open-source and open access manner as part of the COVID-Net initiative. While COVIDNet-CT is not yet a production-ready screening solution, we hope that releasing the model and dataset will encourage researchers, clinicians, and citizen data scientists alike to leverage and build upon them.
摘要:冠状病2019(COVID-19)的流行继续对世界各地的患者和医疗保健系统产生巨大的影响。在对这种新的疾病的斗争中,迫切需要迅速和有效的筛选工具来识别感染COVID-19例,并为此CT成像已被提议作为可以用来作为重点筛查方法之一补充RT-PCR检测,特别是在患者接受了常规CT扫描的情况下非COVID-19相关的原因,患者的恶化呼吸状态或开发需要加快护理并发症,患者怀疑是COVID-19阳性,但有负RT-PCR测试结果。这个启发,在这个研究中,我们介绍COVIDNet-CT,即通过一台机器驱动的设计勘探方法进行检测,从胸部CT图像COVID-19案件量身定做了深刻的卷积神经网络架构。此外,我们引入COVIDx-CT,从由中国国家中心生物信息包括横跨1489患者病例104009个图像采集的CT成像数据得到的基准CT图像数据集。此外,在可靠性和透明度,我们利用一个explainability驱动性能验证战略研究COVIDNet-CT的决策行为,并在这样做确保COVIDNet-CT使得基于CT图像中相关指标的预测。无论COVIDNet-CT和COVIDx-CT数据集可向一般公众在开放源码和开放接入方式的COVIDNet计划的一部分。虽然COVIDNet-CT还不是一个生产就绪解决方案的筛选,我们希望释放模型和数据集将鼓励研究人员,临床医生和市民数据的科学家都以利用并增强在他们身上。
49. Disentangling Neural Architectures and Weights: A Case Study in Supervised Classification [PDF] 返回目录
Nicolo Colombo, Yang Gao
Abstract: The history of deep learning has shown that human-designed problem-specific networks can greatly improve the classification performance of general neural models. In most practical cases, however, choosing the optimal architecture for a given task remains a challenging problem. Recent architecture-search methods are able to automatically build neural models with strong performance but fail to fully appreciate the interaction between neural architecture and weights. This work investigates the problem of disentangling the role of the neural structure and its edge weights, by showing that well-trained architectures may not need any link-specific fine-tuning of the weights. We compare the performance of such weight-free networks (in our case these are binary networks with {0, 1}-valued weights) with random, weight-agnostic, pruned and standard fully connected networks. To find the optimal weight-agnostic network, we use a novel and computationally efficient method that translates the hard architecture-search problem into a feasible optimization problem.More specifically, we look at the optimal task-specific architectures as the optimal configuration of binary networks with {0, 1}-valued weights, which can be found through an approximate gradient descent strategy. Theoretical convergence guarantees of the proposed algorithm are obtained by bounding the error in the gradient approximation and its practical performance is evaluated on two real-world data sets. For measuring the structural similarities between different architectures, we use a novel spectral approach that allows us to underline the intrinsic differences between real-valued networks and weight-free architectures.
摘要:深学习的历史表明,人类设计的问题,具体的网络可以大大提高一般的神经模型的分类性能。在大多数实际情况,但是,选择最佳的架构对于一个给定的任务仍然是一个具有挑战性的问题。最近的架构搜索方法能够自动建立与强大的性能车型的神经,但未能完全理解的神经结构和权重之间的相互作用。这项工作调查解开神经结构和它的边缘权重的作用,通过显示出训练有素的架构可能不需要权重的任何特定链路微调的问题。我们比较重等无网络的性能(在我们的情况这些都是{0,1} -valued权重二进制网络)与随机,体重无关,修剪和标准全连接网络。为了找到最佳的重量无关的网络,我们使用专门翻译硬架构搜索问题转化为可行的优化problem.More一种新型的计算和有效的方法,我们来看看最佳特定任务的架构为二进制网络的优化配置含{0,1} -valued权重,这可以通过一个近似梯度下降策略被发现。该算法的理论收敛担保由梯度近似边界错误和它的实际性能是在两个现实世界的数据集评估获得的。用于测量不同的体系结构之间的结构相似性,我们使用了一种新的光谱的方法,使我们能够强调实值网络和重量 - 自由体系结构之间的固有差异。
Nicolo Colombo, Yang Gao
Abstract: The history of deep learning has shown that human-designed problem-specific networks can greatly improve the classification performance of general neural models. In most practical cases, however, choosing the optimal architecture for a given task remains a challenging problem. Recent architecture-search methods are able to automatically build neural models with strong performance but fail to fully appreciate the interaction between neural architecture and weights. This work investigates the problem of disentangling the role of the neural structure and its edge weights, by showing that well-trained architectures may not need any link-specific fine-tuning of the weights. We compare the performance of such weight-free networks (in our case these are binary networks with {0, 1}-valued weights) with random, weight-agnostic, pruned and standard fully connected networks. To find the optimal weight-agnostic network, we use a novel and computationally efficient method that translates the hard architecture-search problem into a feasible optimization problem.More specifically, we look at the optimal task-specific architectures as the optimal configuration of binary networks with {0, 1}-valued weights, which can be found through an approximate gradient descent strategy. Theoretical convergence guarantees of the proposed algorithm are obtained by bounding the error in the gradient approximation and its practical performance is evaluated on two real-world data sets. For measuring the structural similarities between different architectures, we use a novel spectral approach that allows us to underline the intrinsic differences between real-valued networks and weight-free architectures.
摘要:深学习的历史表明,人类设计的问题,具体的网络可以大大提高一般的神经模型的分类性能。在大多数实际情况,但是,选择最佳的架构对于一个给定的任务仍然是一个具有挑战性的问题。最近的架构搜索方法能够自动建立与强大的性能车型的神经,但未能完全理解的神经结构和权重之间的相互作用。这项工作调查解开神经结构和它的边缘权重的作用,通过显示出训练有素的架构可能不需要权重的任何特定链路微调的问题。我们比较重等无网络的性能(在我们的情况这些都是{0,1} -valued权重二进制网络)与随机,体重无关,修剪和标准全连接网络。为了找到最佳的重量无关的网络,我们使用专门翻译硬架构搜索问题转化为可行的优化problem.More一种新型的计算和有效的方法,我们来看看最佳特定任务的架构为二进制网络的优化配置含{0,1} -valued权重,这可以通过一个近似梯度下降策略被发现。该算法的理论收敛担保由梯度近似边界错误和它的实际性能是在两个现实世界的数据集评估获得的。用于测量不同的体系结构之间的结构相似性,我们使用了一种新的光谱的方法,使我们能够强调实值网络和重量 - 自由体系结构之间的固有差异。
50. Visually Analyzing and Steering Zero Shot Learning [PDF] 返回目录
Saroj Sahoo, Matthew Berger
Abstract: We propose a visual analytics system to help a user analyze and steer zero-shot learning models. Zero-shot learning has emerged as a viable scenario for categorizing data that consists of no labeled examples, and thus a promising approach to minimize data annotation from humans. However, it is challenging to understand where zero-shot learning fails, the cause of such failures, and how a user can modify the model to prevent such failures. Our visualization system is designed to help users diagnose and understand mispredictions in such models, so that they may gain insight on the behavior of a model when applied to data associated with categories not seen during training. Through usage scenarios, we highlight how our system can help a user improve performance in zero-shot learning.
摘要:我们提出了一个可视化分析系统,帮助用户分析和引导零射门的学习模式。零射门的学习已经成为一种可行的方案进行分类是由无标识样本数据,从而有前途的方法,以尽量减少人的数据注解。但是,它是具有挑战性的明白的地方零射门学习失败了,这种失败的原因,以及用户如何修改模型,以防止此类故障。我们的可视化系统,旨在帮助用户诊断,了解在这样的模型预测失误,让他们可以在模型的行为,当应用到训练中没有看到相关的类别数据获得洞察力。通过使用场景中,我们强调我们的系统如何帮助用户提高零射门的学习表现。
Saroj Sahoo, Matthew Berger
Abstract: We propose a visual analytics system to help a user analyze and steer zero-shot learning models. Zero-shot learning has emerged as a viable scenario for categorizing data that consists of no labeled examples, and thus a promising approach to minimize data annotation from humans. However, it is challenging to understand where zero-shot learning fails, the cause of such failures, and how a user can modify the model to prevent such failures. Our visualization system is designed to help users diagnose and understand mispredictions in such models, so that they may gain insight on the behavior of a model when applied to data associated with categories not seen during training. Through usage scenarios, we highlight how our system can help a user improve performance in zero-shot learning.
摘要:我们提出了一个可视化分析系统,帮助用户分析和引导零射门的学习模式。零射门的学习已经成为一种可行的方案进行分类是由无标识样本数据,从而有前途的方法,以尽量减少人的数据注解。但是,它是具有挑战性的明白的地方零射门学习失败了,这种失败的原因,以及用户如何修改模型,以防止此类故障。我们的可视化系统,旨在帮助用户诊断,了解在这样的模型预测失误,让他们可以在模型的行为,当应用到训练中没有看到相关的类别数据获得洞察力。通过使用场景中,我们强调我们的系统如何帮助用户提高零射门的学习表现。
51. Defending Against Multiple and Unforeseen Adversarial Videos [PDF] 返回目录
Shao-Yuan Lo, Vishal M. Patel
Abstract: Adversarial examples of deep neural networks have been actively investigated on image-based classification, segmentation and detection tasks. However, adversarial robustness of video models still lacks exploration. While several studies have proposed how to generate adversarial videos, only a handful of approaches pertaining to the defense strategies have been published in the literature. Furthermore, these defense methods are limited to a single perturbation type and often fail to provide robustness to Lp-bounded attacks and physically realizable attacks simultaneously. In this paper, we propose one of the first defense solutions against multiple adversarial video types for video classification. The proposed approach performs adversarial training with multiple types of video adversaries using independent batch normalizations (BNs), and recognizes different adversaries by an adversarial video detector. During inference, a switch module sends an input to a proper batch normalization branch according to the detected attack type. Compared to conventional adversarial training, our method exhibits stronger robustness to multiple and even unforeseen adversarial videos and provides higher classification accuracy.
摘要:深层神经网络的对抗性例子已经积极地研究了基于图像的分类,分割和检测任务。然而,视频机型的对抗性稳健性仍然缺乏探索。虽然一些研究提出如何生成对抗性的视频,只有属于防御战略的方针少数已发表文献。此外,这些防御方法局限于单一扰动类型,往往不能同时向LP-界攻击和物理攻击的变现提供鲁棒性。在本文中,我们提出了对多种对抗性的视频类型的视频分类的第一道防线的解决方案之一。所提出的方法进行对抗训练与多种类型使用独立批次的归一化(BNS)视频的对手,并且通过对抗视频检测器识别的不同的敌人。在推论,一个开关模块根据所述检测到的攻击类型发送一个输入到一个适当的批标准化分支。相比传统的对抗性训练,我们的方法具有较强的鲁棒性,以多种甚至不可预见的对抗性视频和提供更高的分类精度。
Shao-Yuan Lo, Vishal M. Patel
Abstract: Adversarial examples of deep neural networks have been actively investigated on image-based classification, segmentation and detection tasks. However, adversarial robustness of video models still lacks exploration. While several studies have proposed how to generate adversarial videos, only a handful of approaches pertaining to the defense strategies have been published in the literature. Furthermore, these defense methods are limited to a single perturbation type and often fail to provide robustness to Lp-bounded attacks and physically realizable attacks simultaneously. In this paper, we propose one of the first defense solutions against multiple adversarial video types for video classification. The proposed approach performs adversarial training with multiple types of video adversaries using independent batch normalizations (BNs), and recognizes different adversaries by an adversarial video detector. During inference, a switch module sends an input to a proper batch normalization branch according to the detected attack type. Compared to conventional adversarial training, our method exhibits stronger robustness to multiple and even unforeseen adversarial videos and provides higher classification accuracy.
摘要:深层神经网络的对抗性例子已经积极地研究了基于图像的分类,分割和检测任务。然而,视频机型的对抗性稳健性仍然缺乏探索。虽然一些研究提出如何生成对抗性的视频,只有属于防御战略的方针少数已发表文献。此外,这些防御方法局限于单一扰动类型,往往不能同时向LP-界攻击和物理攻击的变现提供鲁棒性。在本文中,我们提出了对多种对抗性的视频类型的视频分类的第一道防线的解决方案之一。所提出的方法进行对抗训练与多种类型使用独立批次的归一化(BNS)视频的对手,并且通过对抗视频检测器识别的不同的敌人。在推论,一个开关模块根据所述检测到的攻击类型发送一个输入到一个适当的批标准化分支。相比传统的对抗性训练,我们的方法具有较强的鲁棒性,以多种甚至不可预见的对抗性视频和提供更高的分类精度。
52. Weakly Supervised Content Selection for Improved Image Captioning [PDF] 返回目录
Khyathi Raghavi Chandu, Piyush Sharma, Soravit Changpinyo, Ashish Thapliyal, Radu Soricut
Abstract: Image captioning involves identifying semantic concepts in the scene and describing them in fluent natural language. Recent approaches do not explicitly model the semantic concepts and train the model only for the end goal of caption generation. Such models lack interpretability and controllability, primarily due to sub-optimal content selection. We address this problem by breaking down the captioning task into two simpler, manageable and more controllable tasks -- skeleton prediction and skeleton-based caption generation. We approach the former as a weakly supervised task, using a simple off-the-shelf language syntax parser and avoiding the need for additional human annotations; the latter uses a supervised-learning approach. We investigate three methods of conditioning the caption on skeleton in the encoder, decoder and both. Our compositional model generates significantly better quality captions on out of domain test images, as judged by human annotators. Additionally, we demonstrate the cross-language effectiveness of the English skeleton to other languages including French, Italian, German, Spanish and Hindi. This compositional nature of captioning exhibits the potential of unpaired image captioning, thereby reducing the dependence on expensive image-caption pairs. Furthermore, we investigate the use of skeletons as a knob to control certain properties of the generated image caption, such as length, content, and gender expression.
摘要:图片字幕包括识别语义概念的场景和描述他们用流利的自然语言。近来的方案并没有明确的模型语义概念和训练模型只对字幕生成的最终目标。这种模式缺乏解释性和可控性,这主要是由于次优的内容选择。我们应对打破字幕任务分为两个简单,便于管理,更可控的任务这个问题 - 骨骼预测和基于骨架字幕生成。我们接近前者为弱监督的任务,使用简单现成的,货架语言的语法分析器和避免额外人力注释的需要;后者采用的是监督学习的方法。我们调查的编码器,解码器和两个空调的三种方法的标题上骨架。我们的成分模型产生出上域测试图像显著更优质的字幕,通过人工注释的判断。此外,我们展示了英语骨架,以其他语言,包括法语,意大利语,德语,西班牙语和印地文的跨语言的有效性。字幕的此组成性质表现出不成对图像字幕的电位,从而降低昂贵的图像标题对的依赖性。此外,我们研究了使用骨架作为旋钮控制所生成的图像标题的某些性质,诸如长度,内容和性别表达。
Khyathi Raghavi Chandu, Piyush Sharma, Soravit Changpinyo, Ashish Thapliyal, Radu Soricut
Abstract: Image captioning involves identifying semantic concepts in the scene and describing them in fluent natural language. Recent approaches do not explicitly model the semantic concepts and train the model only for the end goal of caption generation. Such models lack interpretability and controllability, primarily due to sub-optimal content selection. We address this problem by breaking down the captioning task into two simpler, manageable and more controllable tasks -- skeleton prediction and skeleton-based caption generation. We approach the former as a weakly supervised task, using a simple off-the-shelf language syntax parser and avoiding the need for additional human annotations; the latter uses a supervised-learning approach. We investigate three methods of conditioning the caption on skeleton in the encoder, decoder and both. Our compositional model generates significantly better quality captions on out of domain test images, as judged by human annotators. Additionally, we demonstrate the cross-language effectiveness of the English skeleton to other languages including French, Italian, German, Spanish and Hindi. This compositional nature of captioning exhibits the potential of unpaired image captioning, thereby reducing the dependence on expensive image-caption pairs. Furthermore, we investigate the use of skeletons as a knob to control certain properties of the generated image caption, such as length, content, and gender expression.
摘要:图片字幕包括识别语义概念的场景和描述他们用流利的自然语言。近来的方案并没有明确的模型语义概念和训练模型只对字幕生成的最终目标。这种模式缺乏解释性和可控性,这主要是由于次优的内容选择。我们应对打破字幕任务分为两个简单,便于管理,更可控的任务这个问题 - 骨骼预测和基于骨架字幕生成。我们接近前者为弱监督的任务,使用简单现成的,货架语言的语法分析器和避免额外人力注释的需要;后者采用的是监督学习的方法。我们调查的编码器,解码器和两个空调的三种方法的标题上骨架。我们的成分模型产生出上域测试图像显著更优质的字幕,通过人工注释的判断。此外,我们展示了英语骨架,以其他语言,包括法语,意大利语,德语,西班牙语和印地文的跨语言的有效性。字幕的此组成性质表现出不成对图像字幕的电位,从而降低昂贵的图像标题对的依赖性。此外,我们研究了使用骨架作为旋钮控制所生成的图像标题的某些性质,诸如长度,内容和性别表达。
53. COVID CT-Net: Predicting Covid-19 From Chest CT Images Using Attentional Convolutional Network [PDF] 返回目录
Shakib Yazdani, Shervin Minaee, Rahele Kafieh, Narges Saeedizadeh, Milan Sonka
Abstract: The novel corona-virus disease (COVID-19) pandemic has caused a major outbreak in more than 200 countries around the world, leading to a severe impact on the health and life of many people globally. As of Aug 25th of 2020, more than 20 million people are infected, and more than 800,000 death are reported. Computed Tomography (CT) images can be used as a as an alternative to the time-consuming "reverse transcription polymerase chain reaction (RT-PCR)" test, to detect COVID-19. In this work we developed a deep learning framework to predict COVID-19 from CT images. We propose to use an attentional convolution network, which can focus on the infected areas of chest, enabling it to perform a more accurate prediction. We trained our model on a dataset of more than 2000 CT images, and report its performance in terms of various popular metrics, such as sensitivity, specificity, area under the curve, and also precision-recall curve, and achieve very promising results. We also provide a visualization of the attention maps of the model for several test images, and show that our model is attending to the infected regions as intended. In addition to developing a machine learning modeling framework, we also provide the manual annotation of the potentionally infected regions of chest, with the help of a board-certified radiologist, and make that publicly available for other researchers.
摘要:新型电晕病毒病(COVID-19)的盛行已在全球超过200个国家造成了重大的爆发,导致全球上很多人的健康和生活造成了严重影响。随着2020年8月25日,超过2000万人被感染,并且超过80万死亡报告。计算机断层扫描(CT)图像可以用作作为替代耗时的“反转录聚合酶链反应(RT-PCR)”测试,以检测COVID-19。在这项工作中,我们建立了深厚的学习框架,从CT图像预测COVID-19。我们建议使用卷积注意力网络,可以专注于胸的疫区,使其能够进行更准确的预测。我们培训了超过2000个CT图像的数据集我们的模型,并报告其在各种流行的指标,如敏感性,特异性,曲线下面积,同时也精确召回曲线方面的性能,并取得非常可喜的成果。我们还提供了模型的关注可视化映射了几个测试图像,并表明我们的模型参加到受感染区域如预期。除了开发机器学习模型框架,我们还提供胸部potentionally感染地区的人工注释,与委员会认证的放射科医生的帮助,并作出公开可用于其他研究人员。
Shakib Yazdani, Shervin Minaee, Rahele Kafieh, Narges Saeedizadeh, Milan Sonka
Abstract: The novel corona-virus disease (COVID-19) pandemic has caused a major outbreak in more than 200 countries around the world, leading to a severe impact on the health and life of many people globally. As of Aug 25th of 2020, more than 20 million people are infected, and more than 800,000 death are reported. Computed Tomography (CT) images can be used as a as an alternative to the time-consuming "reverse transcription polymerase chain reaction (RT-PCR)" test, to detect COVID-19. In this work we developed a deep learning framework to predict COVID-19 from CT images. We propose to use an attentional convolution network, which can focus on the infected areas of chest, enabling it to perform a more accurate prediction. We trained our model on a dataset of more than 2000 CT images, and report its performance in terms of various popular metrics, such as sensitivity, specificity, area under the curve, and also precision-recall curve, and achieve very promising results. We also provide a visualization of the attention maps of the model for several test images, and show that our model is attending to the infected regions as intended. In addition to developing a machine learning modeling framework, we also provide the manual annotation of the potentionally infected regions of chest, with the help of a board-certified radiologist, and make that publicly available for other researchers.
摘要:新型电晕病毒病(COVID-19)的盛行已在全球超过200个国家造成了重大的爆发,导致全球上很多人的健康和生活造成了严重影响。随着2020年8月25日,超过2000万人被感染,并且超过80万死亡报告。计算机断层扫描(CT)图像可以用作作为替代耗时的“反转录聚合酶链反应(RT-PCR)”测试,以检测COVID-19。在这项工作中,我们建立了深厚的学习框架,从CT图像预测COVID-19。我们建议使用卷积注意力网络,可以专注于胸的疫区,使其能够进行更准确的预测。我们培训了超过2000个CT图像的数据集我们的模型,并报告其在各种流行的指标,如敏感性,特异性,曲线下面积,同时也精确召回曲线方面的性能,并取得非常可喜的成果。我们还提供了模型的关注可视化映射了几个测试图像,并表明我们的模型参加到受感染区域如预期。除了开发机器学习模型框架,我们还提供胸部potentionally感染地区的人工注释,与委员会认证的放射科医生的帮助,并作出公开可用于其他研究人员。
注:中文为机器翻译结果!封面为论文标题词云图!