目录
1. Simultaneously Learning Corrections and Error Models for Geometry-based Visual Odometry Methods [PDF] 摘要
6. Spatio-temporal Consistency to Detect Potential Aedes aegypti Breeding Grounds in Aerial Video Sequences [PDF] 摘要
8. What My Motion tells me about Your Pose: Self-Supervised Fine-Tuning of Observed Vehicle Orientation Angle [PDF] 摘要
10. SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation [PDF] 摘要
15. Solving the Blind Perspective-n-Point Problem End-To-End With Robust Differentiable Geometric Optimization [PDF] 摘要
16. Deep Multi-Scale Resemblance Network for the Sub-class Differentiation of Adrenal Masses on Computed Tomography Images [PDF] 摘要
20. Composer Style Classification of Piano Sheet Music Images Using Language Model Pretraining [PDF] 摘要
22. Realistic Video Summarization through VISIOCITY: A New Benchmark and Evaluation Framework [PDF] 摘要
23. BiTraP: Bi-directional Pedestrian Trajectory Prediction with Multi-modal Goal Estimation [PDF] 摘要
24. Chained-Tracker: Chaining Paired Attentive Regression Results for End-to-End Joint Multiple-Object Detection and Tracking [PDF] 摘要
28. $S^3$Net: Semantic-Aware Self-supervised Depth Estimation with Monocular Videos and Synthetic Data [PDF] 摘要
31. Color-complexity enabled exhaustive color-dots identification and spatial patterns testing in images [PDF] 摘要
32. Automated Intracranial Artery Labeling using a Graph Neural Network and Hierarchical Refinement [PDF] 摘要
36. A Convolutional Neural Network for gaze preference detection: A potential tool for diagnostics of autism spectrum disorder in children [PDF] 摘要
38. Neural Network-based Reconstruction in Compressed Sensing MRI Without Fully-sampled Training Data [PDF] 摘要
41. Reliable Tuberculosis Detection using Chest X-ray with Deep Learning, Segmentation and Visualization [PDF] 摘要
42. Between Subjectivity and Imposition: Power Dynamics in Data Annotation for Computer Vision [PDF] 摘要
46. PDCOVIDNet: A Parallel-Dilated Convolutional Neural Network Architecture for Detecting COVID-19 from Chest X-Ray Images [PDF] 摘要
49. Multimodal Spatial Attention Module for Targeting Multimodal PET-CT Lung Tumor Segmentation [PDF] 摘要
51. Sample Efficient Interactive End-to-End Deep Learning for Self-Driving Cars with Selective Multi-Class Safe Dataset Aggregation [PDF] 摘要
56. 3D Fusion of Infrared Images with Dense RGB Reconstruction from Multiple Views -- with Application to Fire-fighting Robots [PDF] 摘要
摘要
1. Simultaneously Learning Corrections and Error Models for Geometry-based Visual Odometry Methods [PDF] 返回目录
Andrea De Maio, Simon Lacroix
Abstract: This paper fosters the idea that deep learning methods can be used to complement classical visual odometry pipelines to improve their accuracy and to associate uncertainty models to their estimations. We show that the biases inherent to the visual odometry process can be faithfully learned and compensated for, and that a learning architecture associated with a probabilistic loss function can jointly estimate a full covariance matrix of the residual errors, defining an error model capturing the heteroscedasticity of the process. Experiments on autonomous driving image sequences assess the possibility to concurrently improve visual odometry and estimate an error associated with its outputs.
摘要:本文促进深的学习方法,可以用来补充传统的视觉里程管道,以提高其精度和准不确定性模型来估计自己的想法。我们表明,固有的视觉测程工艺的偏差,可以忠实地教训和补偿,并与概率损失函数相关的学习结构可以联合估计残差的完整协方差矩阵,定义误差模型捕捉的异方差流程。上自主驾驶的图像序列的实验评估,以同时改善视觉里程计和估计与它的输出端相关联的错误的可能性。
Andrea De Maio, Simon Lacroix
Abstract: This paper fosters the idea that deep learning methods can be used to complement classical visual odometry pipelines to improve their accuracy and to associate uncertainty models to their estimations. We show that the biases inherent to the visual odometry process can be faithfully learned and compensated for, and that a learning architecture associated with a probabilistic loss function can jointly estimate a full covariance matrix of the residual errors, defining an error model capturing the heteroscedasticity of the process. Experiments on autonomous driving image sequences assess the possibility to concurrently improve visual odometry and estimate an error associated with its outputs.
摘要:本文促进深的学习方法,可以用来补充传统的视觉里程管道,以提高其精度和准不确定性模型来估计自己的想法。我们表明,固有的视觉测程工艺的偏差,可以忠实地教训和补偿,并与概率损失函数相关的学习结构可以联合估计残差的完整协方差矩阵,定义误差模型捕捉的异方差流程。上自主驾驶的图像序列的实验评估,以同时改善视觉里程计和估计与它的输出端相关联的错误的可能性。
2. Learning Video Representations from Textual Web Supervision [PDF] 返回目录
Jonathan C. Stroud, David A. Ross, Chen Sun, Jia Deng, Rahul Sukthankar, Cordelia Schmid
Abstract: Videos found on the Internet are paired with pieces of text, such as titles and descriptions. This text typically describes the most important content in the video, such as the objects in the scene and the actions being performed. Based on this observation, we propose to use such text as a method for learning video representations. To accomplish this, we propose a data collection process and use it to collect 70M video clips shared publicly on the Internet, and we then train a model to pair each video with its associated text. We fine-tune the model on several down-stream action recognition tasks, including Kinetics, HMDB-51, and UCF-101. We find that this approach is an effective method of pretraining video representations. Specifically, it leads to improvements over from-scratch training on all benchmarks, outperforms many methods for self-supervised and webly-supervised video representation learning, and achieves an improvement of 2.2% accuracy on HMDB-51.
摘要:在互联网上找到的影片是搭配的文字作品,如标题和描述。本文通常描述视频中的最重要的内容,执行诸如在场景中的对象和动作。基于这一观察,我们建议使用这样的文本作为学习化视频表示的方法。要做到这一点,我们提出了一个数据收集过程,并用它来收集70M视频短片公开分享在互联网上,然后我们培养了一个模型来配对每个视频及其相关的文字。我们微调几个下游动作识别任务,包括动力学,HMDB-51,和UCF-101模型。我们发现,这种做法是训练前视频表示的有效方法。具体来说,它导致了所有的基准从划痕训练改进,优于自我监督和webly监督视频表示学习方法很多,达到2.2%的精确度对HMDB-51的改进。
Jonathan C. Stroud, David A. Ross, Chen Sun, Jia Deng, Rahul Sukthankar, Cordelia Schmid
Abstract: Videos found on the Internet are paired with pieces of text, such as titles and descriptions. This text typically describes the most important content in the video, such as the objects in the scene and the actions being performed. Based on this observation, we propose to use such text as a method for learning video representations. To accomplish this, we propose a data collection process and use it to collect 70M video clips shared publicly on the Internet, and we then train a model to pair each video with its associated text. We fine-tune the model on several down-stream action recognition tasks, including Kinetics, HMDB-51, and UCF-101. We find that this approach is an effective method of pretraining video representations. Specifically, it leads to improvements over from-scratch training on all benchmarks, outperforms many methods for self-supervised and webly-supervised video representation learning, and achieves an improvement of 2.2% accuracy on HMDB-51.
摘要:在互联网上找到的影片是搭配的文字作品,如标题和描述。本文通常描述视频中的最重要的内容,执行诸如在场景中的对象和动作。基于这一观察,我们建议使用这样的文本作为学习化视频表示的方法。要做到这一点,我们提出了一个数据收集过程,并用它来收集70M视频短片公开分享在互联网上,然后我们培养了一个模型来配对每个视频及其相关的文字。我们微调几个下游动作识别任务,包括动力学,HMDB-51,和UCF-101模型。我们发现,这种做法是训练前视频表示的有效方法。具体来说,它导致了所有的基准从划痕训练改进,优于自我监督和webly监督视频表示学习方法很多,达到2.2%的精确度对HMDB-51的改进。
3. Dynamic Character Graph via Online Face Clustering for Movie Analysis [PDF] 返回目录
Prakhar Kulshreshtha, Tanaya Guha
Abstract: An effective approach to automated movie content analysis involves building a network (graph) of its characters. Existing work usually builds a static character graph to summarize the content using metadata, scripts or manual annotations. We propose an unsupervised approach to building a dynamic character graph that captures the temporal evolution of character interaction. We refer to this as the character interaction graph(CIG). Our approach has two components:(i) an online face clustering algorithm that discovers the characters in the video stream as they appear, and (ii) simultaneous creation of a CIG using the temporal dynamics of the resulting clusters. We demonstrate the usefulness of the CIG for two movie analysis tasks: narrative structure (acts) segmentation, and major character retrieval. Our evaluation on full-length movies containing more than 5000 face tracks shows that the proposed approach achieves superior performance for both the tasks.
摘要:一种有效的方法来自动电影内容分析涉及建立其字符的网络(图形)。现有的工作通常构建静态字符图形总结使用元数据,脚本或手动注释的内容。我们提出了一种无监督的方法来构建一个动态特性曲线图,捕捉人物交流的时间演变。我们称此为角色关系图(CIG)。我们的方法有两个组成部分:(ⅰ)一个在线面部聚类算法发现该视频流中的字符,因为它们出现,和(ii)同时创建使用所得到的簇的时间动态一个CIG的。我们证明了CIG的两个影片分析任务的用处:叙事结构(行为)的分割,以及主要角色检索。我们对包含超过5000脸部轨迹表明,该方法实现了两个任务性能优越的完整长度的电影的评价。
Prakhar Kulshreshtha, Tanaya Guha
Abstract: An effective approach to automated movie content analysis involves building a network (graph) of its characters. Existing work usually builds a static character graph to summarize the content using metadata, scripts or manual annotations. We propose an unsupervised approach to building a dynamic character graph that captures the temporal evolution of character interaction. We refer to this as the character interaction graph(CIG). Our approach has two components:(i) an online face clustering algorithm that discovers the characters in the video stream as they appear, and (ii) simultaneous creation of a CIG using the temporal dynamics of the resulting clusters. We demonstrate the usefulness of the CIG for two movie analysis tasks: narrative structure (acts) segmentation, and major character retrieval. Our evaluation on full-length movies containing more than 5000 face tracks shows that the proposed approach achieves superior performance for both the tasks.
摘要:一种有效的方法来自动电影内容分析涉及建立其字符的网络(图形)。现有的工作通常构建静态字符图形总结使用元数据,脚本或手动注释的内容。我们提出了一种无监督的方法来构建一个动态特性曲线图,捕捉人物交流的时间演变。我们称此为角色关系图(CIG)。我们的方法有两个组成部分:(ⅰ)一个在线面部聚类算法发现该视频流中的字符,因为它们出现,和(ii)同时创建使用所得到的簇的时间动态一个CIG的。我们证明了CIG的两个影片分析任务的用处:叙事结构(行为)的分割,以及主要角色检索。我们对包含超过5000脸部轨迹表明,该方法实现了两个任务性能优越的完整长度的电影的评价。
4. Linear Attention Mechanism: An Efficient Attention for Semantic Segmentation [PDF] 返回目录
Rui Li, Chenxi Duan, Shunyi Zheng
Abstract: In this paper, to remedy this deficiency, we propose a Linear Attention Mechanism which is approximate to dot-product attention with much less memory and computational costs. The efficient design makes the incorporation between attention mechanisms and neural networks more flexible and versatile. Experiments conducted on semantic segmentation demonstrated the effectiveness of linear attention mechanism. Code is available at this https URL.
摘要:本文以弥补这方面的不足,我们提出了一个线性注意机制是近似点产品的关注要少得多的内存和计算成本。高效的设计使注意力机制和神经网络更灵活和通用之间的结合。上语义分割进行的实验证实的线性注意机制的有效性。代码可在此HTTPS URL。
Rui Li, Chenxi Duan, Shunyi Zheng
Abstract: In this paper, to remedy this deficiency, we propose a Linear Attention Mechanism which is approximate to dot-product attention with much less memory and computational costs. The efficient design makes the incorporation between attention mechanisms and neural networks more flexible and versatile. Experiments conducted on semantic segmentation demonstrated the effectiveness of linear attention mechanism. Code is available at this https URL.
摘要:本文以弥补这方面的不足,我们提出了一个线性注意机制是近似点产品的关注要少得多的内存和计算成本。高效的设计使注意力机制和神经网络更灵活和通用之间的结合。上语义分割进行的实验证实的线性注意机制的有效性。代码可在此HTTPS URL。
5. MessyTable: Instance Association in Multiple Camera Views [PDF] 返回目录
Zhongang Cai, Junzhe Zhang, Daxuan Ren, Cunjun Yu, Haiyu Zhao, Shuai Yi, Chai Kiat Yeo, Chen Change Loy
Abstract: We present an interesting and challenging dataset that features a large number of scenes with messy tables captured from multiple camera views. Each scene in this dataset is highly complex, containing multiple object instances that could be identical, stacked and occluded by other instances. The key challenge is to associate all instances given the RGB image of all views. The seemingly simple task surprisingly fails many popular methods or heuristics that we assume good performance in object association. The dataset challenges existing methods in mining subtle appearance differences, reasoning based on contexts, and fusing appearance with geometric cues for establishing an association. We report interesting findings with some popular baselines, and discuss how this dataset could help inspire new problems and catalyse more robust formulations to tackle real-world instance association problems. Project page: $\href{this https URL}{\text{MessyTable}}$
摘要:我们提出了一个有趣的,挑战的数据集,具有大量来自多个摄像机视角捕捉凌乱的桌子场景。在此数据集的每个场景是高度复杂的,包含多个对象实例,可以是相同的,堆叠并通过其他实例闭塞。关键的挑战是给所有视图的RGB图像的所有实例关联。看似简单的任务,令人惊讶的失败很多流行的方法或启发式,我们假设在对象关联不错的表现。该数据集的挑战现有矿山细微差别的外观,方法关系的推理基于上下文,并融合外观与几何线索建立关联。我们报告有趣的发现与一些流行的基线,并讨论该数据集可以帮助激发怎样的新问题,并促进更强劲的配方来解决真实世界的实例关联的问题。项目页面:$ \ {HREF这HTTPS URL} {\ {文本} MessyTable} $
Zhongang Cai, Junzhe Zhang, Daxuan Ren, Cunjun Yu, Haiyu Zhao, Shuai Yi, Chai Kiat Yeo, Chen Change Loy
Abstract: We present an interesting and challenging dataset that features a large number of scenes with messy tables captured from multiple camera views. Each scene in this dataset is highly complex, containing multiple object instances that could be identical, stacked and occluded by other instances. The key challenge is to associate all instances given the RGB image of all views. The seemingly simple task surprisingly fails many popular methods or heuristics that we assume good performance in object association. The dataset challenges existing methods in mining subtle appearance differences, reasoning based on contexts, and fusing appearance with geometric cues for establishing an association. We report interesting findings with some popular baselines, and discuss how this dataset could help inspire new problems and catalyse more robust formulations to tackle real-world instance association problems. Project page: $\href{this https URL}{\text{MessyTable}}$
摘要:我们提出了一个有趣的,挑战的数据集,具有大量来自多个摄像机视角捕捉凌乱的桌子场景。在此数据集的每个场景是高度复杂的,包含多个对象实例,可以是相同的,堆叠并通过其他实例闭塞。关键的挑战是给所有视图的RGB图像的所有实例关联。看似简单的任务,令人惊讶的失败很多流行的方法或启发式,我们假设在对象关联不错的表现。该数据集的挑战现有矿山细微差别的外观,方法关系的推理基于上下文,并融合外观与几何线索建立关联。我们报告有趣的发现与一些流行的基线,并讨论该数据集可以帮助激发怎样的新问题,并促进更强劲的配方来解决真实世界的实例关联的问题。项目页面:$ \ {HREF这HTTPS URL} {\ {文本} MessyTable} $
6. Spatio-temporal Consistency to Detect Potential Aedes aegypti Breeding Grounds in Aerial Video Sequences [PDF] 返回目录
Wesley L. Passos, Eduardo A. B. da Silva, Sergio L. Netto, Gabriel M. Araujo, Amaro A. de Lima
Abstract: Every year, the \textit{Aedes aegypti} mosquito infects thousands of people with diseases such as dengue, zika, chikungunya, and urban yellow fever. The main form to combat these diseases is to avoid the transmitter reproduction by searching and eliminating the potential mosquito breeding grounds. In this work, we introduce a comprehensive database of aerial videos recorded with a drone, where all objects of interest are identified by their respective bounding boxes, and describe an object detection system based on deep neural networks. We track the objects by employing phase correlation to obtain the spatial alignment between them along the video frames. By doing so, we are capable of registering the detected objects, minimizing false positives and correcting most false negatives. Using the ResNet-101-FPN as a backbone, it is possible to obtain 0.78 in terms of \textit{F1-score} on the proposed dataset.
摘要:每年,\ {textit埃及伊蚊的蚊}感染成千上万的人的疾病,如登革热,兹卡,基孔肯雅和城市黄热病。防治这些疾病的主要形式是通过搜索和消除潜在的蚊虫滋生地,以避免发射器再现。在这项工作中,我们将介绍的记录无人驾驶飞机,其中所关注的所有对象都通过各自的边界框标识的空中视频的综合数据库,并描述了基于深层神经网络的目标检测系统。我们跟踪通过采用相位相关沿视频帧获得它们之间的空间对准的对象。通过这样做,我们有能力在注册检测对象,最大限度地降低误报和纠正大多数假阴性。使用RESNET-101-FPN作为主干,有可能在\ textit {F1-得分}上提出的数据集方面取得0.78。
Wesley L. Passos, Eduardo A. B. da Silva, Sergio L. Netto, Gabriel M. Araujo, Amaro A. de Lima
Abstract: Every year, the \textit{Aedes aegypti} mosquito infects thousands of people with diseases such as dengue, zika, chikungunya, and urban yellow fever. The main form to combat these diseases is to avoid the transmitter reproduction by searching and eliminating the potential mosquito breeding grounds. In this work, we introduce a comprehensive database of aerial videos recorded with a drone, where all objects of interest are identified by their respective bounding boxes, and describe an object detection system based on deep neural networks. We track the objects by employing phase correlation to obtain the spatial alignment between them along the video frames. By doing so, we are capable of registering the detected objects, minimizing false positives and correcting most false negatives. Using the ResNet-101-FPN as a backbone, it is possible to obtain 0.78 in terms of \textit{F1-score} on the proposed dataset.
摘要:每年,\ {textit埃及伊蚊的蚊}感染成千上万的人的疾病,如登革热,兹卡,基孔肯雅和城市黄热病。防治这些疾病的主要形式是通过搜索和消除潜在的蚊虫滋生地,以避免发射器再现。在这项工作中,我们将介绍的记录无人驾驶飞机,其中所关注的所有对象都通过各自的边界框标识的空中视频的综合数据库,并描述了基于深层神经网络的目标检测系统。我们跟踪通过采用相位相关沿视频帧获得它们之间的空间对准的对象。通过这样做,我们有能力在注册检测对象,最大限度地降低误报和纠正大多数假阴性。使用RESNET-101-FPN作为主干,有可能在\ textit {F1-得分}上提出的数据集方面取得0.78。
7. Difficulty-aware Glaucoma Classification with Multi-Rater Consensus Modeling [PDF] 返回目录
Shuang Yu, Hong-Yu Zhou, Kai Ma, Cheng Bian, Chunyan Chu, Hanruo Liu, Yefeng Zheng
Abstract: Medical images are generally labeled by multiple experts before the final ground-truth labels are determined. Consensus or disagreement among experts regarding individual images reflects the gradeability and difficulty levels of the image. However, when being used for model training, only the final ground-truth label is utilized, while the critical information contained in the raw multi-rater gradings regarding the image being an easy/hard case is discarded. In this paper, we aim to take advantage of the raw multi-rater gradings to improve the deep learning model performance for the glaucoma classification task. Specifically, a multi-branch model structure is proposed to predict the most sensitive, most specifical and a balanced fused result for the input images. In order to encourage the sensitivity branch and specificity branch to generate consistent results for consensus labels and opposite results for disagreement labels, a consensus loss is proposed to constrain the output of the two branches. Meanwhile, the consistency/inconsistency between the prediction results of the two branches implies the image being an easy/hard case, which is further utilized to encourage the balanced fusion branch to concentrate more on the hard cases. Compared with models trained only with the final ground-truth labels, the proposed method using multi-rater consensus information has achieved superior performance, and it is also able to estimate the difficulty levels of individual input images when making the prediction.
摘要:医学图像通常是由多个专家的标记来确定最终的地面实况标签之前。关于个人图片专家达成共识或分歧反映了图像的爬坡能力和难度级别。然而,当被用于训练模型,只有最终地面实况标签被利用,而关键信息包含在关于所述图像是一个简单的/硬的情况下被丢弃的原始多方位分级。在本文中,我们的目标是把原料多方位分级的优势,提高对青光眼分类任务的深度学习模型的性能。具体而言,多分支模型结构,提出预测最敏感,最特异性目的和用于输入图像的平衡稠合的结果。为了鼓励灵敏度分支和特异性分支产生一致的标签和用于分歧标签相反的结果一致的结果,共有损耗,提出了约束的两个分支的输出。同时,两个分支的预测结果之间的一致性/不一致性意味着图像是一个简单的/硬的情况下,其被进一步利用,以鼓励均衡融合分支更集中在硬的情况。只有与最终地面实况标签训练的车型相比,采用多方位的共识信息所提出的方法取得了卓越的性能,而且还能够使预测来估算各输入图像的难度级别。
Shuang Yu, Hong-Yu Zhou, Kai Ma, Cheng Bian, Chunyan Chu, Hanruo Liu, Yefeng Zheng
Abstract: Medical images are generally labeled by multiple experts before the final ground-truth labels are determined. Consensus or disagreement among experts regarding individual images reflects the gradeability and difficulty levels of the image. However, when being used for model training, only the final ground-truth label is utilized, while the critical information contained in the raw multi-rater gradings regarding the image being an easy/hard case is discarded. In this paper, we aim to take advantage of the raw multi-rater gradings to improve the deep learning model performance for the glaucoma classification task. Specifically, a multi-branch model structure is proposed to predict the most sensitive, most specifical and a balanced fused result for the input images. In order to encourage the sensitivity branch and specificity branch to generate consistent results for consensus labels and opposite results for disagreement labels, a consensus loss is proposed to constrain the output of the two branches. Meanwhile, the consistency/inconsistency between the prediction results of the two branches implies the image being an easy/hard case, which is further utilized to encourage the balanced fusion branch to concentrate more on the hard cases. Compared with models trained only with the final ground-truth labels, the proposed method using multi-rater consensus information has achieved superior performance, and it is also able to estimate the difficulty levels of individual input images when making the prediction.
摘要:医学图像通常是由多个专家的标记来确定最终的地面实况标签之前。关于个人图片专家达成共识或分歧反映了图像的爬坡能力和难度级别。然而,当被用于训练模型,只有最终地面实况标签被利用,而关键信息包含在关于所述图像是一个简单的/硬的情况下被丢弃的原始多方位分级。在本文中,我们的目标是把原料多方位分级的优势,提高对青光眼分类任务的深度学习模型的性能。具体而言,多分支模型结构,提出预测最敏感,最特异性目的和用于输入图像的平衡稠合的结果。为了鼓励灵敏度分支和特异性分支产生一致的标签和用于分歧标签相反的结果一致的结果,共有损耗,提出了约束的两个分支的输出。同时,两个分支的预测结果之间的一致性/不一致性意味着图像是一个简单的/硬的情况下,其被进一步利用,以鼓励均衡融合分支更集中在硬的情况。只有与最终地面实况标签训练的车型相比,采用多方位的共识信息所提出的方法取得了卓越的性能,而且还能够使预测来估算各输入图像的难度级别。
8. What My Motion tells me about Your Pose: Self-Supervised Fine-Tuning of Observed Vehicle Orientation Angle [PDF] 返回目录
Cédric Picron, Punarjay Chakravarty, Tom Roussel, Tinne Tuytelaars
Abstract: The determination of the relative 6 Degree of Freedom (DoF) pose of vehicles around the ego-vehicle from monocular cameras is an important aspect of the perception problem for Autonomous Vehicles (AVs) and Driver Assist Technology (DAT). Current deep learning techniques used for tackling this problem are data hungry, driving the need for unsupervised or self-supervised methods. In this paper, we consider the domain adaptation task of fine-tuning a vehicle orientation estimator on a new domain without labels. By leveraging the ego-motion consistencies obtained from a monocular SLAM method, we show that our self-supervised fine-tuning scheme consistently improves the accuracy of the resulting network. More specifically, when transitioning from Virtual Kitti to nuScenes, up to 70% of the performance is recovered compared to the 100% of a supervised method. Our self-supervised method hence allows us to safely transfer vehicle orientation estimators to new domains without requiring expensive new labels.
摘要:相对6自由度(DOF)的确定提出从左右单眼相机的自车辆的车辆是自主车(AVS)和驱动程序的感知问题的一个重要方面辅助技术(DAT)。用于解决这一问题目前深度学习技术是饿了的数据,推动了无人监管或自我监督方法的需要。在本文中,我们考虑微调的领域适应性任务上没有标签的新域名车辆定向估计。通过利用从单目SLAM方法获得的自我运动的一致性,我们证明了我们的自我监督微调方案一致改善所得网络的精度。更具体地,从虚拟吉滴过渡到nuScenes时,性能的高达70%相比有监督方法的100%的回收。我们的自我监督方法,从而使我们能够安全地传输车辆方向估计到新的领域,而不需要昂贵的新标签。
Cédric Picron, Punarjay Chakravarty, Tom Roussel, Tinne Tuytelaars
Abstract: The determination of the relative 6 Degree of Freedom (DoF) pose of vehicles around the ego-vehicle from monocular cameras is an important aspect of the perception problem for Autonomous Vehicles (AVs) and Driver Assist Technology (DAT). Current deep learning techniques used for tackling this problem are data hungry, driving the need for unsupervised or self-supervised methods. In this paper, we consider the domain adaptation task of fine-tuning a vehicle orientation estimator on a new domain without labels. By leveraging the ego-motion consistencies obtained from a monocular SLAM method, we show that our self-supervised fine-tuning scheme consistently improves the accuracy of the resulting network. More specifically, when transitioning from Virtual Kitti to nuScenes, up to 70% of the performance is recovered compared to the 100% of a supervised method. Our self-supervised method hence allows us to safely transfer vehicle orientation estimators to new domains without requiring expensive new labels.
摘要:相对6自由度(DOF)的确定提出从左右单眼相机的自车辆的车辆是自主车(AVS)和驱动程序的感知问题的一个重要方面辅助技术(DAT)。用于解决这一问题目前深度学习技术是饿了的数据,推动了无人监管或自我监督方法的需要。在本文中,我们考虑微调的领域适应性任务上没有标签的新域名车辆定向估计。通过利用从单目SLAM方法获得的自我运动的一致性,我们证明了我们的自我监督微调方案一致改善所得网络的精度。更具体地,从虚拟吉滴过渡到nuScenes时,性能的高达70%相比有监督方法的100%的回收。我们的自我监督方法,从而使我们能够安全地传输车辆方向估计到新的领域,而不需要昂贵的新标签。
9. Face2Face: Real-time Face Capture and Reenactment of RGB Videos [PDF] 返回目录
Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, Matthias Nießner
Abstract: We present Face2Face, a novel approach for real-time facial reenactment of a monocular target video sequence (e.g., Youtube video). The source sequence is also a monocular video stream, captured live with a commodity webcam. Our goal is to animate the facial expressions of the target video by a source actor and re-render the manipulated output video in a photo-realistic fashion. To this end, we first address the under-constrained problem of facial identity recovery from monocular video by non-rigid model-based bundling. At run time, we track facial expressions of both source and target video using a dense photometric consistency measure. Reenactment is then achieved by fast and efficient deformation transfer between source and target. The mouth interior that best matches the re-targeted expression is retrieved from the target sequence and warped to produce an accurate fit. Finally, we convincingly re-render the synthesized target face on top of the corresponding video stream such that it seamlessly blends with the real-world illumination. We demonstrate our method in a live setup, where Youtube videos are reenacted in real time.
摘要:本面对面商务,对于单眼目标视频序列(例如,YouTube的视频)的实时面部重演一种新的方法。源序列也是单筒视频流,具有商品的网络摄像头实时捕获。我们的目标是通过源演员动画目标视频的面部表情和重新渲染照片般逼真的方式操作输出视频。为此,我们首先处理非刚性基于模型的捆绑从单筒视频面部识别复苏受到约束的问题。在运行时,我们跟踪使用密集的光度测量的一致性源和目标视频的面部表情。重演然后通过源和目标之间快速和有效的变形传递来实现。嘴内部最匹配重新靶向表达从靶序列中检索和翘曲产生精确的配合。最后,我们令人信服地重新呈现在相应的视频流的顶部上的合成的靶面,使得它与真实世界的照明无缝共混物。我们证明了我们在现场安装,在YouTube视频实时重演方法。
Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, Matthias Nießner
Abstract: We present Face2Face, a novel approach for real-time facial reenactment of a monocular target video sequence (e.g., Youtube video). The source sequence is also a monocular video stream, captured live with a commodity webcam. Our goal is to animate the facial expressions of the target video by a source actor and re-render the manipulated output video in a photo-realistic fashion. To this end, we first address the under-constrained problem of facial identity recovery from monocular video by non-rigid model-based bundling. At run time, we track facial expressions of both source and target video using a dense photometric consistency measure. Reenactment is then achieved by fast and efficient deformation transfer between source and target. The mouth interior that best matches the re-targeted expression is retrieved from the target sequence and warped to produce an accurate fit. Finally, we convincingly re-render the synthesized target face on top of the corresponding video stream such that it seamlessly blends with the real-world illumination. We demonstrate our method in a live setup, where Youtube videos are reenacted in real time.
摘要:本面对面商务,对于单眼目标视频序列(例如,YouTube的视频)的实时面部重演一种新的方法。源序列也是单筒视频流,具有商品的网络摄像头实时捕获。我们的目标是通过源演员动画目标视频的面部表情和重新渲染照片般逼真的方式操作输出视频。为此,我们首先处理非刚性基于模型的捆绑从单筒视频面部识别复苏受到约束的问题。在运行时,我们跟踪使用密集的光度测量的一致性源和目标视频的面部表情。重演然后通过源和目标之间快速和有效的变形传递来实现。嘴内部最匹配重新靶向表达从靶序列中检索和翘曲产生精确的配合。最后,我们令人信服地重新呈现在相应的视频流的顶部上的合成的靶面,使得它与真实世界的照明无缝共混物。我们证明了我们在现场安装,在YouTube视频实时重演方法。
10. SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation [PDF] 返回目录
Jiale Cao, Rao Muhammad Anwer, Hisham Cholakkal, Fahad Shahbaz Khan, Yanwei Pang, Ling Shao
Abstract: Single-stage instance segmentation approaches have recently gained popularity due to their speed and simplicity, but are still lagging behind in accuracy, compared to two-stage methods. We propose a fast single-stage instance segmentation method, called SipMask, that preserves instance-specific spatial information by separating mask prediction of an instance to different sub-regions of a detected bounding-box. Our main contribution is a novel light-weight spatial preservation (SP) module that generates a separate set of spatial coefficients for each sub-region within a bounding-box, leading to improved mask predictions. It also enables accurate delineation of spatially adjacent instances. Further, we introduce a mask alignment weighting loss and a feature alignment scheme to better correlate mask prediction with object detection. On COCO test-dev, our SipMask outperforms the existing single-stage methods. Compared to the state-of-the-art single-stage TensorMask, SipMask obtains an absolute gain of 1.0% (mask AP), while providing a four-fold speedup. In terms of real-time capabilities, SipMask outperforms YOLACT with an absolute gain of 3.0% (mask AP) under similar settings, while operating at comparable speed on a Titan Xp. We also evaluate our SipMask for real-time video instance segmentation, achieving promising results on YouTube-VIS dataset. The source code is available at this https URL.
摘要:单级实例分割方法最近获得了普及,由于其速度和简单性,但仍落后于落后的精度相比,两个阶段的方法。我们提出了一个快速单级实例分割的方法,称为SipMask,通过一个实例的掩模预测分离到的不同子区蜜饯特定实例空间信息检测边界框。我们的主要贡献是产生一组单独用于包围盒中的每个子区域的空间系数,从而导致改善掩模预测的新型轻量空间保存(SP)模块。这也使空间上相邻的实例的精确轮廓。此外,我们引入一个掩模对准加权损耗和特征对准方案,以更好地进行关联掩模预测与对象检测。在COCO测试开发,我们SipMask优于现有的单级方法。相比国家的最先进的单级TensorMask,SipMask获得的1.0%(掩模AP)的绝对增益,同时提供了四倍加速。在实时能力方面,SipMask优于YOLACT与3.0%(掩模AP)下的类似设置的绝对增益,而在上一个泰坦XP的可比的速度操作。我们也评估我们SipMask实时视频实例分割,实现了在YouTube上-VIS数据集可喜的成果。源代码可在此HTTPS URL。
Jiale Cao, Rao Muhammad Anwer, Hisham Cholakkal, Fahad Shahbaz Khan, Yanwei Pang, Ling Shao
Abstract: Single-stage instance segmentation approaches have recently gained popularity due to their speed and simplicity, but are still lagging behind in accuracy, compared to two-stage methods. We propose a fast single-stage instance segmentation method, called SipMask, that preserves instance-specific spatial information by separating mask prediction of an instance to different sub-regions of a detected bounding-box. Our main contribution is a novel light-weight spatial preservation (SP) module that generates a separate set of spatial coefficients for each sub-region within a bounding-box, leading to improved mask predictions. It also enables accurate delineation of spatially adjacent instances. Further, we introduce a mask alignment weighting loss and a feature alignment scheme to better correlate mask prediction with object detection. On COCO test-dev, our SipMask outperforms the existing single-stage methods. Compared to the state-of-the-art single-stage TensorMask, SipMask obtains an absolute gain of 1.0% (mask AP), while providing a four-fold speedup. In terms of real-time capabilities, SipMask outperforms YOLACT with an absolute gain of 3.0% (mask AP) under similar settings, while operating at comparable speed on a Titan Xp. We also evaluate our SipMask for real-time video instance segmentation, achieving promising results on YouTube-VIS dataset. The source code is available at this https URL.
摘要:单级实例分割方法最近获得了普及,由于其速度和简单性,但仍落后于落后的精度相比,两个阶段的方法。我们提出了一个快速单级实例分割的方法,称为SipMask,通过一个实例的掩模预测分离到的不同子区蜜饯特定实例空间信息检测边界框。我们的主要贡献是产生一组单独用于包围盒中的每个子区域的空间系数,从而导致改善掩模预测的新型轻量空间保存(SP)模块。这也使空间上相邻的实例的精确轮廓。此外,我们引入一个掩模对准加权损耗和特征对准方案,以更好地进行关联掩模预测与对象检测。在COCO测试开发,我们SipMask优于现有的单级方法。相比国家的最先进的单级TensorMask,SipMask获得的1.0%(掩模AP)的绝对增益,同时提供了四倍加速。在实时能力方面,SipMask优于YOLACT与3.0%(掩模AP)下的类似设置的绝对增益,而在上一个泰坦XP的可比的速度操作。我们也评估我们SipMask实时视频实例分割,实现了在YouTube上-VIS数据集可喜的成果。源代码可在此HTTPS URL。
11. Dynamic GCN: Context-enriched Topology Learning for Skeleton-based Action Recognition [PDF] 返回目录
Fanfan Ye, Shiliang Pu, Qiaoyong Zhong, Chao Li, Di Xie, Huiming Tang
Abstract: Graph Convolutional Networks (GCNs) have attracted increasing interests for the task of skeleton-based action recognition. The key lies in the design of the graph structure, which encodes skeleton topology information. In this paper, we propose Dynamic GCN, in which a novel convolutional neural network named Contextencoding Network (CeN) is introduced to learn skeleton topology automatically. In particular, when learning the dependency between two joints, contextual features from the rest joints are incorporated in a global manner. CeN is extremely lightweight yet effective, and can be embedded into a graph convolutional layer. By stacking multiple CeN-enabled graph convolutional layers, we build Dynamic GCN. Notably, as a merit of CeN, dynamic graph topologies are constructed for different input samples as well as graph convolutional layers of various depths. Besides, three alternative context modeling architectures are well explored, which may serve as a guideline for future research on graph topology learning. CeN brings only ~7% extra FLOPs for the baseline model, and Dynamic GCN achieves better performance with $2\times$~$4\times$ fewer FLOPs than existing methods. By further combining static physical body connections and motion modalities, we achieve state-of-the-art performance on three large-scale benchmarks, namely NTU-RGB+D, NTU-RGB+D 120 and Skeleton-Kinetics.
摘要:图形卷积网络(GCNs)已经吸引了基于骨架动作识别的任务越来越浓厚的兴趣。关键在于该图形结构,其编码骨架拓扑信息的设计。在本文中,我们提出了动态GCN,其中一种新型的卷积神经网络命名Contextencoding网络(CEN)被引入到自动学习骨干拓扑。具体地,学习两个接头之间的依赖关系的情况下,从静止关节上下文特征在全局方式并入本文。 CEN是极其轻便而有效的,并且可嵌入到一个图形卷积层。通过堆叠多个支持CEN-图卷积层,我们建立动态GCN。值得注意的是,如CEN的优点,动态图拓扑被构造为不同的输入采样以及各种深度的曲线图卷积层。此外,三种可供选择的上下文建模架构也探讨了,这可以作为今后图形型拓扑学习研究的指导方针。岑只带来约7%的额外触发器的基准模型和动态GCN获得更好的性能,$ 2 \倍〜$ 4 \ $倍比现有方法FLOPS少。通过进一步组合静态身体的连接和运动方式,我们在三个大型基准,即NTU-RGB + d,NTU-RGB + d 120和骨架-动力学实现状态的最先进的性能。
Fanfan Ye, Shiliang Pu, Qiaoyong Zhong, Chao Li, Di Xie, Huiming Tang
Abstract: Graph Convolutional Networks (GCNs) have attracted increasing interests for the task of skeleton-based action recognition. The key lies in the design of the graph structure, which encodes skeleton topology information. In this paper, we propose Dynamic GCN, in which a novel convolutional neural network named Contextencoding Network (CeN) is introduced to learn skeleton topology automatically. In particular, when learning the dependency between two joints, contextual features from the rest joints are incorporated in a global manner. CeN is extremely lightweight yet effective, and can be embedded into a graph convolutional layer. By stacking multiple CeN-enabled graph convolutional layers, we build Dynamic GCN. Notably, as a merit of CeN, dynamic graph topologies are constructed for different input samples as well as graph convolutional layers of various depths. Besides, three alternative context modeling architectures are well explored, which may serve as a guideline for future research on graph topology learning. CeN brings only ~7% extra FLOPs for the baseline model, and Dynamic GCN achieves better performance with $2\times$~$4\times$ fewer FLOPs than existing methods. By further combining static physical body connections and motion modalities, we achieve state-of-the-art performance on three large-scale benchmarks, namely NTU-RGB+D, NTU-RGB+D 120 and Skeleton-Kinetics.
摘要:图形卷积网络(GCNs)已经吸引了基于骨架动作识别的任务越来越浓厚的兴趣。关键在于该图形结构,其编码骨架拓扑信息的设计。在本文中,我们提出了动态GCN,其中一种新型的卷积神经网络命名Contextencoding网络(CEN)被引入到自动学习骨干拓扑。具体地,学习两个接头之间的依赖关系的情况下,从静止关节上下文特征在全局方式并入本文。 CEN是极其轻便而有效的,并且可嵌入到一个图形卷积层。通过堆叠多个支持CEN-图卷积层,我们建立动态GCN。值得注意的是,如CEN的优点,动态图拓扑被构造为不同的输入采样以及各种深度的曲线图卷积层。此外,三种可供选择的上下文建模架构也探讨了,这可以作为今后图形型拓扑学习研究的指导方针。岑只带来约7%的额外触发器的基准模型和动态GCN获得更好的性能,$ 2 \倍〜$ 4 \ $倍比现有方法FLOPS少。通过进一步组合静态身体的连接和运动方式,我们在三个大型基准,即NTU-RGB + d,NTU-RGB + d 120和骨架-动力学实现状态的最先进的性能。
12. Enriching Video Captions With Contextual Text [PDF] 返回目录
Philipp Rimle, Pelin Dogan, Markus Gross
Abstract: Understanding video content and generating caption with context is an important and challenging task. Unlike prior methods that typically attempt to generate generic video captions without context, our architecture contextualizes captioning by infusing extracted information from relevant text data. We propose an end-to-end sequence-to-sequence model which generates video captions based on visual input, and mines relevant knowledge such as names and locations from contextual text. In contrast to previous approaches, we do not preprocess the text further, and let the model directly learn to attend over it. Guided by the visual input, the model is able to copy words from the contextual text via a pointer-generator network, allowing to produce more specific video captions. We show competitive performance on the News Video Dataset and, through ablation studies, validate the efficacy of contextual video captioning as well as individual design choices in our model architecture.
摘要:了解视频内容,并生成字幕与环境是一项重要而艰巨的任务。与现有方法不同,通常试图生成通用视频字幕没有上下文,我们的架构contextualizes通过注入从相关文本数据中提取信息的字幕。我们提出了一个终端到终端的序列对序列模型,基于视觉输入的视频标题和矿山相关知识,如从上下文文本的名称和位置。相较于以前的方法,我们不进一步预处理文本,并让模型直接了解参加过它。由视觉输入的指导下,该模型能够从通过指针发电机网络上下文复制文本的话,可以让生产更具体的视频字幕。我们展示的新闻视频数据集和有竞争力的性能,通过切除研究,验证背景视频的字幕在我们的模型架构的有效性,以及个性化的设计选择。
Philipp Rimle, Pelin Dogan, Markus Gross
Abstract: Understanding video content and generating caption with context is an important and challenging task. Unlike prior methods that typically attempt to generate generic video captions without context, our architecture contextualizes captioning by infusing extracted information from relevant text data. We propose an end-to-end sequence-to-sequence model which generates video captions based on visual input, and mines relevant knowledge such as names and locations from contextual text. In contrast to previous approaches, we do not preprocess the text further, and let the model directly learn to attend over it. Guided by the visual input, the model is able to copy words from the contextual text via a pointer-generator network, allowing to produce more specific video captions. We show competitive performance on the News Video Dataset and, through ablation studies, validate the efficacy of contextual video captioning as well as individual design choices in our model architecture.
摘要:了解视频内容,并生成字幕与环境是一项重要而艰巨的任务。与现有方法不同,通常试图生成通用视频字幕没有上下文,我们的架构contextualizes通过注入从相关文本数据中提取信息的字幕。我们提出了一个终端到终端的序列对序列模型,基于视觉输入的视频标题和矿山相关知识,如从上下文文本的名称和位置。相较于以前的方法,我们不进一步预处理文本,并让模型直接了解参加过它。由视觉输入的指导下,该模型能够从通过指针发电机网络上下文复制文本的话,可以让生产更具体的视频字幕。我们展示的新闻视频数据集和有竞争力的性能,通过切除研究,验证背景视频的字幕在我们的模型架构的有效性,以及个性化的设计选择。
13. Stylized Adversarial Defense [PDF] 返回目录
Muzammal Naseer, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Fatih Porikli
Abstract: Deep Convolution Neural Networks (CNNs) can easily be fooled by subtle, imperceptible changes to the input images. To address this vulnerability, adversarial training creates perturbation patterns and includes them in the training set to robustify the model. In contrast to existing adversarial training methods that only use class-boundary information (e.g., using a cross entropy loss), we propose to exploit additional information from the feature space to craft stronger adversaries that are in turn used to learn a robust model. Specifically, we use the style and content information of the target sample from another class, alongside its class boundary information to create adversarial perturbations. We apply our proposed multi-task objective in a deeply supervised manner, extracting multi-scale feature knowledge to create maximally separating adversaries. Subsequently, we propose a max-margin adversarial training approach that minimizes the distance between source image and its adversary and maximizes the distance between the adversary and the target image. Our adversarial training approach demonstrates strong robustness compared to state of the art defenses, generalizes well to naturally occurring corruptions and data distributional shifts, and retains the model accuracy on clean examples.
摘要:深卷积神经网络(细胞神经网络)可以很容易地通过微妙的,不易察觉的变化对输入图像上当。为了解决这个漏洞,对抗性训练产生微扰样式,包括他们在训练集中到robustify模型。相较于现有的对抗性训练方法,只有使用类的边界信息(例如,使用交叉熵损失),我们建议利用从特征空间的其他信息手艺更强的对手是反过来用来学习稳健的模型。具体而言,我们使用的目标样本的风格和内容的信息,从另一个类,沿着它的类边界信息来创建对抗性的扰动。我们适用于一个深深监督的方式我们提出的多任务目标,提取的多尺度特征知识创造最大分离对手。随后,我们建议,最大限度地减少源图像和对手之间的距离,最大化的对手和目标图像之间的距离最大利润的对抗训练方法。与现有技术相比防御状态我们的对抗培训方法表明强的鲁棒性,以及概括于天然存在的损坏和数据分布的变化,并保持在干净的实施例中的模型的准确性。
Muzammal Naseer, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Fatih Porikli
Abstract: Deep Convolution Neural Networks (CNNs) can easily be fooled by subtle, imperceptible changes to the input images. To address this vulnerability, adversarial training creates perturbation patterns and includes them in the training set to robustify the model. In contrast to existing adversarial training methods that only use class-boundary information (e.g., using a cross entropy loss), we propose to exploit additional information from the feature space to craft stronger adversaries that are in turn used to learn a robust model. Specifically, we use the style and content information of the target sample from another class, alongside its class boundary information to create adversarial perturbations. We apply our proposed multi-task objective in a deeply supervised manner, extracting multi-scale feature knowledge to create maximally separating adversaries. Subsequently, we propose a max-margin adversarial training approach that minimizes the distance between source image and its adversary and maximizes the distance between the adversary and the target image. Our adversarial training approach demonstrates strong robustness compared to state of the art defenses, generalizes well to naturally occurring corruptions and data distributional shifts, and retains the model accuracy on clean examples.
摘要:深卷积神经网络(细胞神经网络)可以很容易地通过微妙的,不易察觉的变化对输入图像上当。为了解决这个漏洞,对抗性训练产生微扰样式,包括他们在训练集中到robustify模型。相较于现有的对抗性训练方法,只有使用类的边界信息(例如,使用交叉熵损失),我们建议利用从特征空间的其他信息手艺更强的对手是反过来用来学习稳健的模型。具体而言,我们使用的目标样本的风格和内容的信息,从另一个类,沿着它的类边界信息来创建对抗性的扰动。我们适用于一个深深监督的方式我们提出的多任务目标,提取的多尺度特征知识创造最大分离对手。随后,我们建议,最大限度地减少源图像和对手之间的距离,最大化的对手和目标图像之间的距离最大利润的对抗训练方法。与现有技术相比防御状态我们的对抗培训方法表明强的鲁棒性,以及概括于天然存在的损坏和数据分布的变化,并保持在干净的实施例中的模型的准确性。
14. Meta-Learning with Context-Agnostic Initialisations [PDF] 返回目录
Toby Perrett, Alessandro Masullo, Tilo Burghardt, Majid Mirmehdi, Dima Damen
Abstract: Meta-learning approaches have addressed few-shot problems by finding initialisations suited for fine-tuning to target tasks. Often there are additional properties within training data (which we refer to as context), not relevant to the target task, which act as a distractor to meta-learning, particularly when the target task contains examples from a novel context not seen during training. We address this oversight by incorporating a context-adversarial component into the meta-learning process. This produces an initialisation for fine-tuning to target which is both context-agnostic and task-generalised. We evaluate our approach on three commonly used meta-learning algorithms and two problems. We demonstrate our context-agnostic meta-learning improves results in each case. First, we report on Omniglot few-shot character classification, using alphabets as context. An average improvement of 4.3% is observed across methods and tasks when classifying characters from an unseen alphabet. Second, we evaluate on a dataset for personalised energy expenditure predictions from video, using participant knowledge as context. We demonstrate that context-agnostic meta-learning decreases the average mean square error by 30%.
摘要:元学习方法已经通过找到适合微调,目标任务initialisations解决几拍的问题。经常有训练数据中的附加属性(我们称之为上下文),以目标任务,充当牵到元学习,尤其是当目标任务包含从训练中没有看到一个新的上下文的例子并不相关。我们通过结合上下文对抗组件到元学习过程中解决这个问题的监督。这产生用于精细调谐到目标这既是上下文无关和任务的广义的初始化。我们评估我们对三种常用元学习算法和两个问题的办法。我们证明了我们的上下文无关元学习提高了在各种情况下的结果。首先,我们对Omniglot几拍字符分类报告,使用字母作为背景。的4.3%的平均改善当来自一个看不见的字母进行分类字符跨越方法和任务观察。其次,我们评估对从视频个性化的能量消耗预测的数据集,利用参与者的知识为背景。我们证明上下文无关元学习降低了30%的平均均方误差。
Toby Perrett, Alessandro Masullo, Tilo Burghardt, Majid Mirmehdi, Dima Damen
Abstract: Meta-learning approaches have addressed few-shot problems by finding initialisations suited for fine-tuning to target tasks. Often there are additional properties within training data (which we refer to as context), not relevant to the target task, which act as a distractor to meta-learning, particularly when the target task contains examples from a novel context not seen during training. We address this oversight by incorporating a context-adversarial component into the meta-learning process. This produces an initialisation for fine-tuning to target which is both context-agnostic and task-generalised. We evaluate our approach on three commonly used meta-learning algorithms and two problems. We demonstrate our context-agnostic meta-learning improves results in each case. First, we report on Omniglot few-shot character classification, using alphabets as context. An average improvement of 4.3% is observed across methods and tasks when classifying characters from an unseen alphabet. Second, we evaluate on a dataset for personalised energy expenditure predictions from video, using participant knowledge as context. We demonstrate that context-agnostic meta-learning decreases the average mean square error by 30%.
摘要:元学习方法已经通过找到适合微调,目标任务initialisations解决几拍的问题。经常有训练数据中的附加属性(我们称之为上下文),以目标任务,充当牵到元学习,尤其是当目标任务包含从训练中没有看到一个新的上下文的例子并不相关。我们通过结合上下文对抗组件到元学习过程中解决这个问题的监督。这产生用于精细调谐到目标这既是上下文无关和任务的广义的初始化。我们评估我们对三种常用元学习算法和两个问题的办法。我们证明了我们的上下文无关元学习提高了在各种情况下的结果。首先,我们对Omniglot几拍字符分类报告,使用字母作为背景。的4.3%的平均改善当来自一个看不见的字母进行分类字符跨越方法和任务观察。其次,我们评估对从视频个性化的能量消耗预测的数据集,利用参与者的知识为背景。我们证明上下文无关元学习降低了30%的平均均方误差。
15. Solving the Blind Perspective-n-Point Problem End-To-End With Robust Differentiable Geometric Optimization [PDF] 返回目录
Dylan Campbell, Liu Liu, Stephen Gould
Abstract: Blind Perspective-n-Point (PnP) is the problem of estimating the position and orientation of a camera relative to a scene, given 2D image points and 3D scene points, without prior knowledge of the 2D-3D correspondences. Solving for pose and correspondences simultaneously is extremely challenging since the search space is very large. Fortunately it is a coupled problem: the pose can be found easily given the correspondences and vice versa. Existing approaches assume that noisy correspondences are provided, that a good pose prior is available, or that the problem size is small. We instead propose the first fully end-to-end trainable network for solving the blind PnP problem efficiently and globally, that is, without the need for pose priors. We make use of recent results in differentiating optimization problems to incorporate geometric model fitting into an end-to-end learning framework, including Sinkhorn, RANSAC and PnP algorithms. Our proposed approach significantly outperforms other methods on synthetic and real data.
摘要:盲视角-N点(PNP)是估计给定的2D图像点和3D场景点的相对于场景的照相机的位置和取向,而不2D-3D对应的先验知识的问题。同时求解姿态和对应极具挑战性,因为搜索空间非常大。幸运的是一个耦合问题:姿势可以发现容易给出的对应,并且反之亦然。现有的方法假设提供了嘈杂的对应,是一个很好的姿态之前是可用的,或者说,问题大小小。相反,我们首先提出了完全结束到终端的可训练的网络效率和全球解决盲目即插即用的问题,那就是,无需姿态先验。我们在做差异化的优化问题纳入几何模型拟合到一个终端到终端的学习框架,包括Sinkhorn,RANSAC和PNP算法使用的最新成果。我们提出的方法显著优于对合成和真实数据的其他方法。
Dylan Campbell, Liu Liu, Stephen Gould
Abstract: Blind Perspective-n-Point (PnP) is the problem of estimating the position and orientation of a camera relative to a scene, given 2D image points and 3D scene points, without prior knowledge of the 2D-3D correspondences. Solving for pose and correspondences simultaneously is extremely challenging since the search space is very large. Fortunately it is a coupled problem: the pose can be found easily given the correspondences and vice versa. Existing approaches assume that noisy correspondences are provided, that a good pose prior is available, or that the problem size is small. We instead propose the first fully end-to-end trainable network for solving the blind PnP problem efficiently and globally, that is, without the need for pose priors. We make use of recent results in differentiating optimization problems to incorporate geometric model fitting into an end-to-end learning framework, including Sinkhorn, RANSAC and PnP algorithms. Our proposed approach significantly outperforms other methods on synthetic and real data.
摘要:盲视角-N点(PNP)是估计给定的2D图像点和3D场景点的相对于场景的照相机的位置和取向,而不2D-3D对应的先验知识的问题。同时求解姿态和对应极具挑战性,因为搜索空间非常大。幸运的是一个耦合问题:姿势可以发现容易给出的对应,并且反之亦然。现有的方法假设提供了嘈杂的对应,是一个很好的姿态之前是可用的,或者说,问题大小小。相反,我们首先提出了完全结束到终端的可训练的网络效率和全球解决盲目即插即用的问题,那就是,无需姿态先验。我们在做差异化的优化问题纳入几何模型拟合到一个终端到终端的学习框架,包括Sinkhorn,RANSAC和PNP算法使用的最新成果。我们提出的方法显著优于对合成和真实数据的其他方法。
16. Deep Multi-Scale Resemblance Network for the Sub-class Differentiation of Adrenal Masses on Computed Tomography Images [PDF] 返回目录
Lei Bi, Jinman Kim, Tingwei Su, Michael Fulham, David Dagan Feng, Guang Ning
Abstract: Objective: The accurate classification of mass lesions in the adrenal glands ('adrenal masses'), detected with computed tomography (CT), is important for diagnosis and patient management. Adrenal masses can be benign or malignant and the benign masses have varying prevalence. Classification methods based on convolutional neural networks (CNN) are the state-of-the-art in maximizing inter-class differences in large medical imaging training datasets. The application of CNNs, to adrenal masses is challenging due to large intra-class variations, large inter-class similarities and imbalanced training data due to the size of masses. Methods: We developed a deep multi-scale resemblance network (DMRN) to overcome these limitations and leveraged paired CNNs to evaluate the intra-class similarities. We used multi-scale feature embedding to improve the inter-class separability by iteratively combining complementary information produced at different scales of the input to create structured feature descriptors. We also augmented the training data with randomly sampled paired adrenal masses to reduce the influence of imbalanced training data. We used 229 CT scans of patients with adrenal masses. Results: Our method had the best results compared to state-of-the-art methods. Conclusion: Our DMRN sub-classified adrenal masses on CT and was superior to state-of-the-art approaches.
摘要:质量病变在肾上腺的准确的分类(“肾上腺肿瘤”),与计算机断层摄影(CT)检测出,为诊断和病人管理的重要。肾上腺肿瘤可以是良性或恶性和良性肿块有不同的流行。基于卷积神经网络(CNN)的分类方法是所述状态的最先进的最大化在大型医疗成像训练数据组间的差异。细胞神经网络的应用,肾上腺肿瘤是具有挑战性由于大类内变化,大类间的相似性和不平衡的训练数据由于块的大小。方法:我们建立了深厚的多尺度相似网络(DMRN),以克服这些限制和杠杆配对细胞神经网络来评价类内的相似性。我们使用多尺度特征的嵌入通过迭代地组合在输入以创建结构化特征描述符的不同尺度产生补充信息以提高类间可分离性。我们还扩充了随机抽样配对肾上腺肿瘤的训练数据,以减少不平衡训练数据的影响。我们使用的患者肾上腺肿瘤229次CT扫描。结果:我们的方法有最好的结果相比,国家的最先进的方法。结论:我们的DMRN子分类的CT肾上腺肿瘤和优于国家的最先进的方法。
Lei Bi, Jinman Kim, Tingwei Su, Michael Fulham, David Dagan Feng, Guang Ning
Abstract: Objective: The accurate classification of mass lesions in the adrenal glands ('adrenal masses'), detected with computed tomography (CT), is important for diagnosis and patient management. Adrenal masses can be benign or malignant and the benign masses have varying prevalence. Classification methods based on convolutional neural networks (CNN) are the state-of-the-art in maximizing inter-class differences in large medical imaging training datasets. The application of CNNs, to adrenal masses is challenging due to large intra-class variations, large inter-class similarities and imbalanced training data due to the size of masses. Methods: We developed a deep multi-scale resemblance network (DMRN) to overcome these limitations and leveraged paired CNNs to evaluate the intra-class similarities. We used multi-scale feature embedding to improve the inter-class separability by iteratively combining complementary information produced at different scales of the input to create structured feature descriptors. We also augmented the training data with randomly sampled paired adrenal masses to reduce the influence of imbalanced training data. We used 229 CT scans of patients with adrenal masses. Results: Our method had the best results compared to state-of-the-art methods. Conclusion: Our DMRN sub-classified adrenal masses on CT and was superior to state-of-the-art approaches.
摘要:质量病变在肾上腺的准确的分类(“肾上腺肿瘤”),与计算机断层摄影(CT)检测出,为诊断和病人管理的重要。肾上腺肿瘤可以是良性或恶性和良性肿块有不同的流行。基于卷积神经网络(CNN)的分类方法是所述状态的最先进的最大化在大型医疗成像训练数据组间的差异。细胞神经网络的应用,肾上腺肿瘤是具有挑战性由于大类内变化,大类间的相似性和不平衡的训练数据由于块的大小。方法:我们建立了深厚的多尺度相似网络(DMRN),以克服这些限制和杠杆配对细胞神经网络来评价类内的相似性。我们使用多尺度特征的嵌入通过迭代地组合在输入以创建结构化特征描述符的不同尺度产生补充信息以提高类间可分离性。我们还扩充了随机抽样配对肾上腺肿瘤的训练数据,以减少不平衡训练数据的影响。我们使用的患者肾上腺肿瘤229次CT扫描。结果:我们的方法有最好的结果相比,国家的最先进的方法。结论:我们的DMRN子分类的CT肾上腺肿瘤和优于国家的最先进的方法。
17. Translate the Facial Regions You Like Using Region-Wise Normalization [PDF] 返回目录
Wenshuang Liu, Wenting Chen, Linlin Shen
Abstract: Though GAN (Generative Adversarial Networks) based technique has greatly advanced the performance of image synthesis and face translation, only few works available in literature provide region based style encoding and translation. We propose in this paper a region-wise normalization framework, for region level face translation. While per-region style is encoded using available approach, we build a so called RIN (region-wise normalization) block to individually inject the styles into per-region feature maps and then fuse them for following convolution and upsampling. Both shape and texture of different regions can thus be translated to various target styles. A region matching loss has also been proposed to significantly reduce the inference between regions during the translation process. Extensive experiments on three publicly available datasets, i.e. Morph, RaFD and CelebAMask-HQ, suggest that our approach demonstrate a large improvement over state-of-the-art methods like StarGAN, SEAN and FUNIT. Our approach has further advantages in precise control of the regions to be translated. As a result, region level expression changes and step by step make up can be achieved. The video demo is available at this https URL.
摘要:虽然甘(剖成对抗性网络)为基础的技术,极大地采用先进的图像合成和面部翻译的性能,只有在文献中找到一些作品提供基于区域格式的编码和翻译。我们在本文中提出了一个区域法则标准化框架,为区域级的脸翻译。虽然每个区域的风格是利用现有的编码方法,我们建立了一个所谓的RIN(区域方式标准化)块单独注入样式到每个区域的特征图,然后保险丝他们为以下卷积和采样。不同区域的形状和质地因此可以转化为各种目标样式。一个地区匹配的损失也已经提出了显著减少在翻译过程中区域间的推断。三个公开可用的数据集,即变身,RaFD和CelebAMask-HQ广泛的实验,表明我们的方法证明了国家的最先进的方法,如StarGAN,肖恩和フニツ一个大的改善。我们的办法是在各区域的精确控制进一步的优点进行翻译。其结果是,区域水平表达的变化和步步弥补可以实现的。视频演示可在此HTTPS URL。
Wenshuang Liu, Wenting Chen, Linlin Shen
Abstract: Though GAN (Generative Adversarial Networks) based technique has greatly advanced the performance of image synthesis and face translation, only few works available in literature provide region based style encoding and translation. We propose in this paper a region-wise normalization framework, for region level face translation. While per-region style is encoded using available approach, we build a so called RIN (region-wise normalization) block to individually inject the styles into per-region feature maps and then fuse them for following convolution and upsampling. Both shape and texture of different regions can thus be translated to various target styles. A region matching loss has also been proposed to significantly reduce the inference between regions during the translation process. Extensive experiments on three publicly available datasets, i.e. Morph, RaFD and CelebAMask-HQ, suggest that our approach demonstrate a large improvement over state-of-the-art methods like StarGAN, SEAN and FUNIT. Our approach has further advantages in precise control of the regions to be translated. As a result, region level expression changes and step by step make up can be achieved. The video demo is available at this https URL.
摘要:虽然甘(剖成对抗性网络)为基础的技术,极大地采用先进的图像合成和面部翻译的性能,只有在文献中找到一些作品提供基于区域格式的编码和翻译。我们在本文中提出了一个区域法则标准化框架,为区域级的脸翻译。虽然每个区域的风格是利用现有的编码方法,我们建立了一个所谓的RIN(区域方式标准化)块单独注入样式到每个区域的特征图,然后保险丝他们为以下卷积和采样。不同区域的形状和质地因此可以转化为各种目标样式。一个地区匹配的损失也已经提出了显著减少在翻译过程中区域间的推断。三个公开可用的数据集,即变身,RaFD和CelebAMask-HQ广泛的实验,表明我们的方法证明了国家的最先进的方法,如StarGAN,肖恩和フニツ一个大的改善。我们的办法是在各区域的精确控制进一步的优点进行翻译。其结果是,区域水平表达的变化和步步弥补可以实现的。视频演示可在此HTTPS URL。
18. A SLAM Map Restoration Algorithm Based on Submaps and an Undirected Connected Graph [PDF] 返回目录
Zongqian Zhan, Wenjie Jian, Yihui Li, Xin Wang, Yang Yue
Abstract: Many visual simultaneous localization and mapping (SLAM) systems have been shown to be accurate and robust, and have real-time performance capabilities on both indoor and ground datasets. However, these methods can be problematic when dealing with aerial frames captured by a camera mounted on an unmanned aerial vehicle (UAV) because the flight height of the UAV can be difficult to control and is easily affected by the this http URL cope with the case of lost tracking, many visual SLAM systems employ a relocalization strategy. This involves the tracking thread continuing the online working by inspecting the connections between the subsequent new frames and the generated map before the tracking was lost. To solve the missing map problem, which is an issue in many applications , after the tracking is lost, based on monocular visual SLAM, we present a method of reconstructing a complete global map of UAV datasets by sequentially merging the submaps via the corresponding undirected connected graph. Specifically, submaps are repeatedly generated, from the initialization process to the place where the tracking is lost, and a corresponding undirected connected graph is built by considering these submaps as nodes and the common map points within two submaps as edges. The common map points are then determined by the bag-of-words (BoW) method, and the submaps are merged if they are found to be connected with the online map in the undirect connected graph. To demonstrate the performance of the proposed method, we first investigated the performance on a UAV dataset, and the experimental results showed that, in the case of several tracking failures, the integrity of the mapping was significantly better than that of the current mainstream SLAM method.
摘要:许多视觉同步定位和映射(SLAM)系统已被证明是准确和健壮,有室内和地面数据集的实时性能的能力。然而,捕获由安装在无人驾驶飞行器(UAV)的相机的空中帧处理时,因为UAV的飞行高度可以是难以控制的,并且容易由该HTTP URL影响应付的情况下这些方法可能是有问题失去跟踪的,许多视觉SLAM系统采用了重新定位策略。这涉及到跟踪线程通过检查以后的新的帧和生成的地图之间的连接丢失的跟踪之前继续在线工作。为了解决丢失地图的问题,这是在许多应用中的一个问题,跟踪丢失后,基于单目视觉SLAM中,我们提出通过经由无向连接对应依次合并所述子地图重建UAV数据集的一个完整的全球地图的方法图形。具体而言,子图反复产生,从所述初始化过程,其中跟踪丢失的地方,和一个相应的无向连通图是通过考虑这些子图的节点和两个子地图作为边缘内的公共地图点建。公共地图点然后通过袋的字(BOW)方法确定,并且如果它们被发现与在undirect连通图的在线地图连接子地图被合并。为了证明该方法的性能,我们首先研究了无人机数据集的性能,实验结果表明,在几个跟踪失败的情况下,映射的完整性是显著优于当前主流的SLAM方法。
Zongqian Zhan, Wenjie Jian, Yihui Li, Xin Wang, Yang Yue
Abstract: Many visual simultaneous localization and mapping (SLAM) systems have been shown to be accurate and robust, and have real-time performance capabilities on both indoor and ground datasets. However, these methods can be problematic when dealing with aerial frames captured by a camera mounted on an unmanned aerial vehicle (UAV) because the flight height of the UAV can be difficult to control and is easily affected by the this http URL cope with the case of lost tracking, many visual SLAM systems employ a relocalization strategy. This involves the tracking thread continuing the online working by inspecting the connections between the subsequent new frames and the generated map before the tracking was lost. To solve the missing map problem, which is an issue in many applications , after the tracking is lost, based on monocular visual SLAM, we present a method of reconstructing a complete global map of UAV datasets by sequentially merging the submaps via the corresponding undirected connected graph. Specifically, submaps are repeatedly generated, from the initialization process to the place where the tracking is lost, and a corresponding undirected connected graph is built by considering these submaps as nodes and the common map points within two submaps as edges. The common map points are then determined by the bag-of-words (BoW) method, and the submaps are merged if they are found to be connected with the online map in the undirect connected graph. To demonstrate the performance of the proposed method, we first investigated the performance on a UAV dataset, and the experimental results showed that, in the case of several tracking failures, the integrity of the mapping was significantly better than that of the current mainstream SLAM method.
摘要:许多视觉同步定位和映射(SLAM)系统已被证明是准确和健壮,有室内和地面数据集的实时性能的能力。然而,捕获由安装在无人驾驶飞行器(UAV)的相机的空中帧处理时,因为UAV的飞行高度可以是难以控制的,并且容易由该HTTP URL影响应付的情况下这些方法可能是有问题失去跟踪的,许多视觉SLAM系统采用了重新定位策略。这涉及到跟踪线程通过检查以后的新的帧和生成的地图之间的连接丢失的跟踪之前继续在线工作。为了解决丢失地图的问题,这是在许多应用中的一个问题,跟踪丢失后,基于单目视觉SLAM中,我们提出通过经由无向连接对应依次合并所述子地图重建UAV数据集的一个完整的全球地图的方法图形。具体而言,子图反复产生,从所述初始化过程,其中跟踪丢失的地方,和一个相应的无向连通图是通过考虑这些子图的节点和两个子地图作为边缘内的公共地图点建。公共地图点然后通过袋的字(BOW)方法确定,并且如果它们被发现与在undirect连通图的在线地图连接子地图被合并。为了证明该方法的性能,我们首先研究了无人机数据集的性能,实验结果表明,在几个跟踪失败的情况下,映射的完整性是显著优于当前主流的SLAM方法。
19. Pooling Regularized Graph Neural Network for fMRI Biomarker Analysis [PDF] 返回目录
Xiaoxiao Li, Yuan Zhou, Nicha C. Dvornek, Muhan Zhang, Juntang Zhuang, Pamela Ventola, James S Duncan
Abstract: Understanding how certain brain regions relate to a specific neurological disorder has been an important area of neuroimaging research. A promising approach to identify the salient regions is using Graph Neural Networks (GNNs), which can be used to analyze graph structured data, e.g. brain networks constructed by functional magnetic resonance imaging (fMRI). We propose an interpretable GNN framework with a novel salient region selection mechanism to determine neurological brain biomarkers associated with disorders. Specifically, we design novel regularized pooling layers that highlight salient regions of interests (ROIs) so that we can infer which ROIs are important to identify a certain disease based on the node pooling scores calculated by the pooling layers. Our proposed framework, Pooling Regularized-GNN (PR-GNN), encourages reasonable ROI-selection and provides flexibility to preserve either individual- or group-level patterns. We apply the PR-GNN framework on a Biopoint Autism Spectral Disorder (ASD) fMRI dataset. We investigate different choices of the hyperparameters and show that PR-GNN outperforms baseline methods in terms of classification accuracy. The salient ROI detection results show high correspondence with the previous neuroimaging-derived biomarkers for ASD.
摘要:了解某些大脑区域是如何与一个特定的神经系统疾病一直是神经影像学研究的一个重要领域。有希望的方法来识别显着的区域被使用Graph神经网络(GNNS),其可以被用于分析图结构化数据,例如大脑网络构成由功能性磁共振成像(fMRI)。我们提出了一个新的显着区域选择机制可解释GNN框架,以确定与疾病相关的脑神经生物标志物。具体而言,我们设计新颖正则池层即利益亮点突出区域(ROI),以便我们可以推断这是ROI的重要基于汇集由池层计算的得分的节点上,以确定一个特定的疾病。我们提出的框架,池正则-GNN(PR-GNN),鼓励合理的ROI选择,并提供灵活性,以保护任何个人 - 或群组模式。我们应用在Biopoint孤独谱系障碍(ASD)功能磁共振成像数据集的PR-GNN框架。我们调查的超参数的不同选择,并显示在分类准确性方面是PR-GNN性能优于基准方法。凸ROI检测结果表明与ASD先前神经影像学衍生的生物标记物的高对应性。
Xiaoxiao Li, Yuan Zhou, Nicha C. Dvornek, Muhan Zhang, Juntang Zhuang, Pamela Ventola, James S Duncan
Abstract: Understanding how certain brain regions relate to a specific neurological disorder has been an important area of neuroimaging research. A promising approach to identify the salient regions is using Graph Neural Networks (GNNs), which can be used to analyze graph structured data, e.g. brain networks constructed by functional magnetic resonance imaging (fMRI). We propose an interpretable GNN framework with a novel salient region selection mechanism to determine neurological brain biomarkers associated with disorders. Specifically, we design novel regularized pooling layers that highlight salient regions of interests (ROIs) so that we can infer which ROIs are important to identify a certain disease based on the node pooling scores calculated by the pooling layers. Our proposed framework, Pooling Regularized-GNN (PR-GNN), encourages reasonable ROI-selection and provides flexibility to preserve either individual- or group-level patterns. We apply the PR-GNN framework on a Biopoint Autism Spectral Disorder (ASD) fMRI dataset. We investigate different choices of the hyperparameters and show that PR-GNN outperforms baseline methods in terms of classification accuracy. The salient ROI detection results show high correspondence with the previous neuroimaging-derived biomarkers for ASD.
摘要:了解某些大脑区域是如何与一个特定的神经系统疾病一直是神经影像学研究的一个重要领域。有希望的方法来识别显着的区域被使用Graph神经网络(GNNS),其可以被用于分析图结构化数据,例如大脑网络构成由功能性磁共振成像(fMRI)。我们提出了一个新的显着区域选择机制可解释GNN框架,以确定与疾病相关的脑神经生物标志物。具体而言,我们设计新颖正则池层即利益亮点突出区域(ROI),以便我们可以推断这是ROI的重要基于汇集由池层计算的得分的节点上,以确定一个特定的疾病。我们提出的框架,池正则-GNN(PR-GNN),鼓励合理的ROI选择,并提供灵活性,以保护任何个人 - 或群组模式。我们应用在Biopoint孤独谱系障碍(ASD)功能磁共振成像数据集的PR-GNN框架。我们调查的超参数的不同选择,并显示在分类准确性方面是PR-GNN性能优于基准方法。凸ROI检测结果表明与ASD先前神经影像学衍生的生物标记物的高对应性。
20. Composer Style Classification of Piano Sheet Music Images Using Language Model Pretraining [PDF] 返回目录
TJ Tsai, Kevin Ji
Abstract: This paper studies composer style classification of piano sheet music images. Previous approaches to the composer classification task have been limited by a scarcity of data. We address this issue in two ways: (1) we recast the problem to be based on raw sheet music images rather than a symbolic music format, and (2) we propose an approach that can be trained on unlabeled data. Our approach first converts the sheet music image into a sequence of musical "words" based on the bootleg feature representation, and then feeds the sequence into a text classifier. We show that it is possible to significantly improve classifier performance by first training a language model on a set of unlabeled data, initializing the classifier with the pretrained language model weights, and then finetuning the classifier on a small amount of labeled data. We train AWD-LSTM, GPT-2, and RoBERTa language models on all piano sheet music images in IMSLP. We find that transformer-based architectures outperform CNN and LSTM models, and pretraining boosts classification accuracy for the GPT-2 model from 46\% to 70\% on a 9-way classification task. The trained model can also be used as a feature extractor that projects piano sheet music into a feature space that characterizes compositional style.
摘要:钢琴乐谱图像本文研究的作曲家风格分类。上一页作曲家分类任务的方法已经由数据的缺乏限制。我们用两种方式解决这个问题:(1)我们重铸的问题是基于原始乐谱的图像,而不是一个象征性的音乐格式,和(2)我们建议,可以在标签数据进行训练的方法。我们的方法首先将乐谱图像到音乐剧“字”的顺序将基于该盗版特征表示,随后把序列转换成文本分类。我们发现,可以通过先培训,显著提高分类性能上的一组未标记数据的语言模型,初始化分类与预训练的语言模型的权重,然后微调分类上标有一个小的数据量。我们培养AWD-LSTM,GPT-2,并在IMSLP所有的钢琴乐谱图像罗伯塔语言模型。我们发现,基于变压器的架构胜过CNN和LSTM模式,训练前提升分类准确率从46 \%的GPT-2型70 \%,在9路分类任务。训练有素的模型也可以用作特征提取该项目的钢琴乐谱为表征的创作风格特征空间。
TJ Tsai, Kevin Ji
Abstract: This paper studies composer style classification of piano sheet music images. Previous approaches to the composer classification task have been limited by a scarcity of data. We address this issue in two ways: (1) we recast the problem to be based on raw sheet music images rather than a symbolic music format, and (2) we propose an approach that can be trained on unlabeled data. Our approach first converts the sheet music image into a sequence of musical "words" based on the bootleg feature representation, and then feeds the sequence into a text classifier. We show that it is possible to significantly improve classifier performance by first training a language model on a set of unlabeled data, initializing the classifier with the pretrained language model weights, and then finetuning the classifier on a small amount of labeled data. We train AWD-LSTM, GPT-2, and RoBERTa language models on all piano sheet music images in IMSLP. We find that transformer-based architectures outperform CNN and LSTM models, and pretraining boosts classification accuracy for the GPT-2 model from 46\% to 70\% on a 9-way classification task. The trained model can also be used as a feature extractor that projects piano sheet music into a feature space that characterizes compositional style.
摘要:钢琴乐谱图像本文研究的作曲家风格分类。上一页作曲家分类任务的方法已经由数据的缺乏限制。我们用两种方式解决这个问题:(1)我们重铸的问题是基于原始乐谱的图像,而不是一个象征性的音乐格式,和(2)我们建议,可以在标签数据进行训练的方法。我们的方法首先将乐谱图像到音乐剧“字”的顺序将基于该盗版特征表示,随后把序列转换成文本分类。我们发现,可以通过先培训,显著提高分类性能上的一组未标记数据的语言模型,初始化分类与预训练的语言模型的权重,然后微调分类上标有一个小的数据量。我们培养AWD-LSTM,GPT-2,并在IMSLP所有的钢琴乐谱图像罗伯塔语言模型。我们发现,基于变压器的架构胜过CNN和LSTM模式,训练前提升分类准确率从46 \%的GPT-2型70 \%,在9路分类任务。训练有素的模型也可以用作特征提取该项目的钢琴乐谱为表征的创作风格特征空间。
21. Camera-Based Piano Sheet Music Identification [PDF] 返回目录
Daniel Yang, TJ Tsai
Abstract: This paper presents a method for large-scale retrieval of piano sheet music images. Our work differs from previous studies on sheet music retrieval in two ways. First, we investigate the problem at a much larger scale than previous studies, using all solo piano sheet music images in the entire IMSLP dataset as a searchable database. Second, we use cell phone images of sheet music as our input queries, which lends itself to a practical, user-facing application. We show that a previously proposed fingerprinting method for sheet music retrieval is far too slow for a real-time application, and we diagnose its shortcomings. We propose a novel hashing scheme called dynamic n-gram fingerprinting that significantly reduces runtime while simultaneously boosting retrieval accuracy. In experiments on IMSLP data, our proposed method achieves a mean reciprocal rank of 0.85 and an average runtime of 0.98 seconds per query.
摘要:本文介绍了钢琴乐谱图像的大型检索的方法。我们的工作不同于在两个方面乐谱检索以前的研究。首先,我们在一个比以往的研究更大规模的调查问题,全部采用钢琴独奏乐谱图像在整个IMSLP数据集作为一个可搜索的数据库。其次,我们采用活页乐谱的手机图像作为我们的输入查询,这适合于实际,面向用户的应用程序。我们表明,乐谱检索先前提出的指纹识别方法实在太慢了实时应用,我们诊断其不足之处。我们提出所谓动态n克指纹一种新颖的哈希方案是显著减少运行时间,同时提高检索的准确度。在IMSLP数据的实验,我们提出的方法实现了0.85的平均倒数排名和每次查询0.98秒的平均运行时间。
Daniel Yang, TJ Tsai
Abstract: This paper presents a method for large-scale retrieval of piano sheet music images. Our work differs from previous studies on sheet music retrieval in two ways. First, we investigate the problem at a much larger scale than previous studies, using all solo piano sheet music images in the entire IMSLP dataset as a searchable database. Second, we use cell phone images of sheet music as our input queries, which lends itself to a practical, user-facing application. We show that a previously proposed fingerprinting method for sheet music retrieval is far too slow for a real-time application, and we diagnose its shortcomings. We propose a novel hashing scheme called dynamic n-gram fingerprinting that significantly reduces runtime while simultaneously boosting retrieval accuracy. In experiments on IMSLP data, our proposed method achieves a mean reciprocal rank of 0.85 and an average runtime of 0.98 seconds per query.
摘要:本文介绍了钢琴乐谱图像的大型检索的方法。我们的工作不同于在两个方面乐谱检索以前的研究。首先,我们在一个比以往的研究更大规模的调查问题,全部采用钢琴独奏乐谱图像在整个IMSLP数据集作为一个可搜索的数据库。其次,我们采用活页乐谱的手机图像作为我们的输入查询,这适合于实际,面向用户的应用程序。我们表明,乐谱检索先前提出的指纹识别方法实在太慢了实时应用,我们诊断其不足之处。我们提出所谓动态n克指纹一种新颖的哈希方案是显著减少运行时间,同时提高检索的准确度。在IMSLP数据的实验,我们提出的方法实现了0.85的平均倒数排名和每次查询0.98秒的平均运行时间。
22. Realistic Video Summarization through VISIOCITY: A New Benchmark and Evaluation Framework [PDF] 返回目录
Vishal Kaushal, Suraj Kothawade, Rishabh Iyer, Ganesh Ramakrishnan
Abstract: Automatic video summarization is still an unsolved problem due to several challenges. We take steps towards making automatic video summarization more realistic by addressing them. Firstly, the currently available datasets either have very short videos or have few long videos of only a particular type. We introduce a new benchmarking dataset VISIOCITY which comprises of longer videos across six different categories with dense concept annotations capable of supporting different flavors of video summarization and can be used for other vision problems. Secondly, for long videos, human reference summaries are difficult to obtain. We present a novel recipe based on pareto optimality to automatically generate multiple reference summaries from indirect ground truth present in VISIOCITY. We show that these summaries are at par with human summaries. Thirdly, we demonstrate that in the presence of multiple ground truth summaries (due to the highly subjective nature of the task), learning from a single combined ground truth summary using a single loss function is not a good idea. We propose a simple recipe VISIOCITY-SUM to enhance an existing model using a combination of losses and demonstrate that it beats the current state of the art techniques when tested on VISIOCITY. We also show that a single measure to evaluate a summary, as is the current typical practice, falls short. We propose a framework for better quantitative assessment of summary quality which is closer to human judgment than a single measure, say F1. We report the performance of a few representative techniques of video summarization on VISIOCITY assessed using various measures and bring out the limitation of the techniques and/or the assessment mechanism in modeling human judgment and demonstrate the effectiveness of our evaluation framework in doing so.
摘要:自动视频概括仍是一个未解决的问题,是由于一些挑战。我们迈出使得通过解决他们自动视频概括更为现实的步骤。首先,当前可用的数据集,要么有很短的视频或只有特定类型的几个长视频。我们引入一个新的标杆数据集VISIOCITY其中包括较长的视频在六个不同的类别能够支持视频摘要的不同口味,可以用于其他视力问题密集的概念注释。其次,对于长视频,人类参考摘要是很难获得的。提出了一种基于帕累托最优自动生成从存在于VISIOCITY间接地面实况多个参考摘要的新颖配方。我们发现,这些摘要是在与人类相提并论的摘要。第三,我们证明,在多个地面实况摘要(由于任务的高度主观性),使用单遗失功能的单一组合地面实况总结学习的存在是不是一个好主意。我们提出了一个简单的食谱VISIOCITY-SUM使用的损失相结合,以增强现有的模型,并表明,当上VISIOCITY测试它打败的有技术的当前状态。我们还表明,单一的措施来评估总结,是当前典型的做法是,功亏一篑。我们提出了总结质量更接近人的判断不是单一措施的更好的定量评估的框架,说F1。我们报告使用各种措施和带出的技术和/或模拟人类的判断的考核评价机制的局限对VISIOCITY视频概括的几个有代表性的技术性能进行评估,并证明这样做对我们的评价框架的有效性。
Vishal Kaushal, Suraj Kothawade, Rishabh Iyer, Ganesh Ramakrishnan
Abstract: Automatic video summarization is still an unsolved problem due to several challenges. We take steps towards making automatic video summarization more realistic by addressing them. Firstly, the currently available datasets either have very short videos or have few long videos of only a particular type. We introduce a new benchmarking dataset VISIOCITY which comprises of longer videos across six different categories with dense concept annotations capable of supporting different flavors of video summarization and can be used for other vision problems. Secondly, for long videos, human reference summaries are difficult to obtain. We present a novel recipe based on pareto optimality to automatically generate multiple reference summaries from indirect ground truth present in VISIOCITY. We show that these summaries are at par with human summaries. Thirdly, we demonstrate that in the presence of multiple ground truth summaries (due to the highly subjective nature of the task), learning from a single combined ground truth summary using a single loss function is not a good idea. We propose a simple recipe VISIOCITY-SUM to enhance an existing model using a combination of losses and demonstrate that it beats the current state of the art techniques when tested on VISIOCITY. We also show that a single measure to evaluate a summary, as is the current typical practice, falls short. We propose a framework for better quantitative assessment of summary quality which is closer to human judgment than a single measure, say F1. We report the performance of a few representative techniques of video summarization on VISIOCITY assessed using various measures and bring out the limitation of the techniques and/or the assessment mechanism in modeling human judgment and demonstrate the effectiveness of our evaluation framework in doing so.
摘要:自动视频概括仍是一个未解决的问题,是由于一些挑战。我们迈出使得通过解决他们自动视频概括更为现实的步骤。首先,当前可用的数据集,要么有很短的视频或只有特定类型的几个长视频。我们引入一个新的标杆数据集VISIOCITY其中包括较长的视频在六个不同的类别能够支持视频摘要的不同口味,可以用于其他视力问题密集的概念注释。其次,对于长视频,人类参考摘要是很难获得的。提出了一种基于帕累托最优自动生成从存在于VISIOCITY间接地面实况多个参考摘要的新颖配方。我们发现,这些摘要是在与人类相提并论的摘要。第三,我们证明,在多个地面实况摘要(由于任务的高度主观性),使用单遗失功能的单一组合地面实况总结学习的存在是不是一个好主意。我们提出了一个简单的食谱VISIOCITY-SUM使用的损失相结合,以增强现有的模型,并表明,当上VISIOCITY测试它打败的有技术的当前状态。我们还表明,单一的措施来评估总结,是当前典型的做法是,功亏一篑。我们提出了总结质量更接近人的判断不是单一措施的更好的定量评估的框架,说F1。我们报告使用各种措施和带出的技术和/或模拟人类的判断的考核评价机制的局限对VISIOCITY视频概括的几个有代表性的技术性能进行评估,并证明这样做对我们的评价框架的有效性。
23. BiTraP: Bi-directional Pedestrian Trajectory Prediction with Multi-modal Goal Estimation [PDF] 返回目录
Yu Yao, Ella Atkins, Matthew Johnson-Roberson, Ram Vasudevan, Xiaoxiao Du
Abstract: Pedestrian trajectory prediction is an essential task in robotic applications such as autonomous driving and robot navigation. State-of-the-art trajectory predictors use a conditional variational autoencoder (CVAE) with recurrent neural networks (RNNs) to encode observed trajectories and decode multi-modal future trajectories. This process can suffer from accumulated errors over long prediction horizons (>=2 seconds). This paper presents BiTraP, a goal-conditioned bi-directional multi-modal trajectory prediction method based on the CVAE. BiTraP estimates the goal (end-point) of trajectories and introduces a novel bi-directional decoder to improve longer-term trajectory prediction accuracy. Extensive experiments show that BiTraP generalizes to both first-person view (FPV) and bird's-eye view (BEV) scenarios and outperforms state-of-the-art results by ~10-50%. We also show that different choices of non-parametric versus parametric target models in the CVAE directly influence the predicted multi-modal trajectory distributions. These results provide guidance on trajectory predictor design for robotic applications such as collision avoidance and navigation systems.
摘要:行人运动轨迹的预测是在机器人的应用,如自动驾驶和机器人导航的一项重要任务。状态的最先进的轨迹预测使用条件变自动编码器(CVAE)复发性神经网络(RNNs)来编码观测的轨迹和解码多模态的将来轨迹。这个过程可以从在长预测视野(> = 2秒)累积误差困扰。本文礼物BiTraP的基础上,CVAE一个目标空调,双向多模态轨迹预测方法。 BiTraP估计轨迹的目标(终点),并介绍了一种新的双向解码器,以提高长期轨迹预测精度。大量的实验表明,国家的最先进的BiTraP推广到两个第一人称视图(FPV)和鸟瞰图(BEV)场景,优于结果包括〜10-50%。我们还表明非参数的不同选择与在CVAE参数目标模型直接影响预测的多模态轨迹分布。这些结果提供了轨迹预测设计机器人应用,如防撞和导航系统的指导。
Yu Yao, Ella Atkins, Matthew Johnson-Roberson, Ram Vasudevan, Xiaoxiao Du
Abstract: Pedestrian trajectory prediction is an essential task in robotic applications such as autonomous driving and robot navigation. State-of-the-art trajectory predictors use a conditional variational autoencoder (CVAE) with recurrent neural networks (RNNs) to encode observed trajectories and decode multi-modal future trajectories. This process can suffer from accumulated errors over long prediction horizons (>=2 seconds). This paper presents BiTraP, a goal-conditioned bi-directional multi-modal trajectory prediction method based on the CVAE. BiTraP estimates the goal (end-point) of trajectories and introduces a novel bi-directional decoder to improve longer-term trajectory prediction accuracy. Extensive experiments show that BiTraP generalizes to both first-person view (FPV) and bird's-eye view (BEV) scenarios and outperforms state-of-the-art results by ~10-50%. We also show that different choices of non-parametric versus parametric target models in the CVAE directly influence the predicted multi-modal trajectory distributions. These results provide guidance on trajectory predictor design for robotic applications such as collision avoidance and navigation systems.
摘要:行人运动轨迹的预测是在机器人的应用,如自动驾驶和机器人导航的一项重要任务。状态的最先进的轨迹预测使用条件变自动编码器(CVAE)复发性神经网络(RNNs)来编码观测的轨迹和解码多模态的将来轨迹。这个过程可以从在长预测视野(> = 2秒)累积误差困扰。本文礼物BiTraP的基础上,CVAE一个目标空调,双向多模态轨迹预测方法。 BiTraP估计轨迹的目标(终点),并介绍了一种新的双向解码器,以提高长期轨迹预测精度。大量的实验表明,国家的最先进的BiTraP推广到两个第一人称视图(FPV)和鸟瞰图(BEV)场景,优于结果包括〜10-50%。我们还表明非参数的不同选择与在CVAE参数目标模型直接影响预测的多模态轨迹分布。这些结果提供了轨迹预测设计机器人应用,如防撞和导航系统的指导。
24. Chained-Tracker: Chaining Paired Attentive Regression Results for End-to-End Joint Multiple-Object Detection and Tracking [PDF] 返回目录
Jinlong Peng, Changan Wang, Fangbin Wan, Yang Wu, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, Yanwei Fu
Abstract: Existing Multiple-Object Tracking (MOT) methods either follow the tracking-by-detection paradigm to conduct object detection, feature extraction and data association separately, or have two of the three subtasks integrated to form a partially end-to-end solution. Going beyond these sub-optimal frameworks, we propose a simple online model named Chained-Tracker (CTracker), which naturally integrates all the three subtasks into an end-to-end solution (the first as far as we know). It chains paired bounding boxes regression results estimated from overlapping nodes, of which each node covers two adjacent frames. The paired regression is made attentive by object-attention (brought by a detection module) and identity-attention (ensured by an ID verification module). The two major novelties: chained structure and paired attentive regression, make CTracker simple, fast and effective, setting new MOTA records on MOT16 and MOT17 challenge datasets (67.6 and 66.6, respectively), without relying on any extra training data. The source code of CTracker can be found at: this http URL.
摘要:现有多目标跟踪(MOT)方法或者按照跟踪逐检测范例来进行对象检测,特征提取和数据关联分开,或具有两个三子任务的集成以形成部分最终的端到端解决方案。超越这些次优的框架,我们提出了一个名为链式 - 追踪系统(CTracker)一个简单的在线模式,这自然集成了所有三个子任务到最终的端到端解决方案(第一,据我们所知)。它链配对包围盒回归结果从重叠节点估计,其中每个节点包括两个相邻帧。成对的回归由通过物体的关注(由检测模块带来)和与身份关注(由ID验证模块确保)细心。两大新奇:链结构和配对周到的回归,让CTracker简单,快速和有效的,设置在MOT16和MOT17挑战数据集(67.6和66.6,分别)新MOTA记录,而不依赖于任何额外的训练数据。这个HTTP网址:CTracker的源代码可以在这里找到。
Jinlong Peng, Changan Wang, Fangbin Wan, Yang Wu, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, Yanwei Fu
Abstract: Existing Multiple-Object Tracking (MOT) methods either follow the tracking-by-detection paradigm to conduct object detection, feature extraction and data association separately, or have two of the three subtasks integrated to form a partially end-to-end solution. Going beyond these sub-optimal frameworks, we propose a simple online model named Chained-Tracker (CTracker), which naturally integrates all the three subtasks into an end-to-end solution (the first as far as we know). It chains paired bounding boxes regression results estimated from overlapping nodes, of which each node covers two adjacent frames. The paired regression is made attentive by object-attention (brought by a detection module) and identity-attention (ensured by an ID verification module). The two major novelties: chained structure and paired attentive regression, make CTracker simple, fast and effective, setting new MOTA records on MOT16 and MOT17 challenge datasets (67.6 and 66.6, respectively), without relying on any extra training data. The source code of CTracker can be found at: this http URL.
摘要:现有多目标跟踪(MOT)方法或者按照跟踪逐检测范例来进行对象检测,特征提取和数据关联分开,或具有两个三子任务的集成以形成部分最终的端到端解决方案。超越这些次优的框架,我们提出了一个名为链式 - 追踪系统(CTracker)一个简单的在线模式,这自然集成了所有三个子任务到最终的端到端解决方案(第一,据我们所知)。它链配对包围盒回归结果从重叠节点估计,其中每个节点包括两个相邻帧。成对的回归由通过物体的关注(由检测模块带来)和与身份关注(由ID验证模块确保)细心。两大新奇:链结构和配对周到的回归,让CTracker简单,快速和有效的,设置在MOT16和MOT17挑战数据集(67.6和66.6,分别)新MOTA记录,而不依赖于任何额外的训练数据。这个HTTP网址:CTracker的源代码可以在这里找到。
25. Compare and Select: Video Summarization with Multi-Agent Reinforcement Learning [PDF] 返回目录
Tianyu Liu
Abstract: Video summarization aims at generating concise video summaries from the lengthy videos, to achieve better user watching experience. Due to the subjectivity, purely supervised methods for video summarization may bring the inherent errors from the annotations. To solve the subjectivity problem, we study the general user summarization process. General users usually watch the whole video, compare interesting clips and select some clips to form a final summary. Inspired by the general user behaviours, we formulate the summarization process as multiple sequential decision-making processes, and propose Comparison-Selection Network (CoSNet) based on multi-agent reinforcement learning. Each agent focuses on a video clip and constantly changes its focus during the iterations, and the final focus clips of all agents form the summary. The comparison network provides the agent with the visual feature from clips and the chronological feature from the past round, while the selection network of the agent makes decisions on the change of its focus clip. The specially designed unsupervised reward and supervised reward together contribute to the policy advancement, each containing local and global parts. Extensive experiments on two benchmark datasets show that CoSNet outperforms state-of-the-art unsupervised methods with the unsupervised reward and surpasses most supervised methods with the complete reward.
摘要:视频摘要的目的是从冗长的视频生成简洁的视频摘要,以达到更好的用户观赏体验。由于主观性,对于视频摘要纯粹的监管方法可能会带来从注释中固有的错误。为了解决这个问题的主观性,我们研究了广大用户总结的过程。一般用户平时观看整个视频,比较有趣的短片,并选择一些剪辑,形成最后的总结。一般用户行为的启发,我们制定了精简处理中的多个连续的决策过程,并提出了基于多代理强化学习比较,选择网络(CoSNet)。每个代理都集中在一个视频剪辑和迭代过程中不断改变其重点,所有代理的最终焦点剪辑形成总结。比较网络提供从剪辑的视觉特征,并从过去的一轮时间功能的代理,而代理的选择网络使得其焦点剪辑的变化决定。特别设计的无监督的奖励和监督奖励在一起有助于政策的进步,每片含本地和全球的部件。两个基准数据集大量实验表明,CoSNet性能优于国家的最先进的无人监督的方法与监督的奖励和超越与完善奖励最有监督的方法。
Tianyu Liu
Abstract: Video summarization aims at generating concise video summaries from the lengthy videos, to achieve better user watching experience. Due to the subjectivity, purely supervised methods for video summarization may bring the inherent errors from the annotations. To solve the subjectivity problem, we study the general user summarization process. General users usually watch the whole video, compare interesting clips and select some clips to form a final summary. Inspired by the general user behaviours, we formulate the summarization process as multiple sequential decision-making processes, and propose Comparison-Selection Network (CoSNet) based on multi-agent reinforcement learning. Each agent focuses on a video clip and constantly changes its focus during the iterations, and the final focus clips of all agents form the summary. The comparison network provides the agent with the visual feature from clips and the chronological feature from the past round, while the selection network of the agent makes decisions on the change of its focus clip. The specially designed unsupervised reward and supervised reward together contribute to the policy advancement, each containing local and global parts. Extensive experiments on two benchmark datasets show that CoSNet outperforms state-of-the-art unsupervised methods with the unsupervised reward and surpasses most supervised methods with the complete reward.
摘要:视频摘要的目的是从冗长的视频生成简洁的视频摘要,以达到更好的用户观赏体验。由于主观性,对于视频摘要纯粹的监管方法可能会带来从注释中固有的错误。为了解决这个问题的主观性,我们研究了广大用户总结的过程。一般用户平时观看整个视频,比较有趣的短片,并选择一些剪辑,形成最后的总结。一般用户行为的启发,我们制定了精简处理中的多个连续的决策过程,并提出了基于多代理强化学习比较,选择网络(CoSNet)。每个代理都集中在一个视频剪辑和迭代过程中不断改变其重点,所有代理的最终焦点剪辑形成总结。比较网络提供从剪辑的视觉特征,并从过去的一轮时间功能的代理,而代理的选择网络使得其焦点剪辑的变化决定。特别设计的无监督的奖励和监督奖励在一起有助于政策的进步,每片含本地和全球的部件。两个基准数据集大量实验表明,CoSNet性能优于国家的最先进的无人监督的方法与监督的奖励和超越与完善奖励最有监督的方法。
26. Online Visual Place Recognition via Saliency Re-identification [PDF] 返回目录
Han Wang, Chen Wang, Lihua Xie
Abstract: As an essential component of visual simultaneous localization and mapping (SLAM), place recognition is crucial for robot navigation and autonomous driving. Existing methods often formulate visual place recognition as feature matching, which is computationally expensive for many robotic applications with limited computing power, e.g., autonomous driving and cleaning robot. Inspired by the fact that human beings always recognize a place by remembering salient regions or landmarks that are more attractive or interesting than others, we formulate visual place recognition as saliency re-identification. In the meanwhile, we propose to perform both saliency detection and re-identification in frequency domain, in which all operations become element-wise. The experiments show that our proposed method achieves competitive accuracy and much higher speed than the state-of-the-art feature-based methods. The proposed method is open-sourced and available at this https URL.
摘要:作为视觉同时定位和地图创建(SLAM)的重要组成部分,地点识别为机器人导航和自主驾驶的关键。现有方法通常制定视觉识别位置作为特征匹配,这对于具有有限计算能力,例如,自主驾驶和清洁机器人许多机器人应用计算上昂贵的。的事实,人类总是通过记住显着的区域,或者更吸引人的还是比其他有趣的地标辨认的地方的启发,我们制定的视觉地方承认显着重新鉴定。同时,我们建议双方进行显着性检测和重新鉴定的频域,其中所有的操作变得逐元素。实验结果表明,我们所提出的方法实现了比国家的最先进的基于特征的方法有竞争力的准确度和更高的速度。该方法是开源的,可在此HTTPS URL。
Han Wang, Chen Wang, Lihua Xie
Abstract: As an essential component of visual simultaneous localization and mapping (SLAM), place recognition is crucial for robot navigation and autonomous driving. Existing methods often formulate visual place recognition as feature matching, which is computationally expensive for many robotic applications with limited computing power, e.g., autonomous driving and cleaning robot. Inspired by the fact that human beings always recognize a place by remembering salient regions or landmarks that are more attractive or interesting than others, we formulate visual place recognition as saliency re-identification. In the meanwhile, we propose to perform both saliency detection and re-identification in frequency domain, in which all operations become element-wise. The experiments show that our proposed method achieves competitive accuracy and much higher speed than the state-of-the-art feature-based methods. The proposed method is open-sourced and available at this https URL.
摘要:作为视觉同时定位和地图创建(SLAM)的重要组成部分,地点识别为机器人导航和自主驾驶的关键。现有方法通常制定视觉识别位置作为特征匹配,这对于具有有限计算能力,例如,自主驾驶和清洁机器人许多机器人应用计算上昂贵的。的事实,人类总是通过记住显着的区域,或者更吸引人的还是比其他有趣的地标辨认的地方的启发,我们制定的视觉地方承认显着重新鉴定。同时,我们建议双方进行显着性检测和重新鉴定的频域,其中所有的操作变得逐元素。实验结果表明,我们所提出的方法实现了比国家的最先进的基于特征的方法有竞争力的准确度和更高的速度。该方法是开源的,可在此HTTPS URL。
27. A Deep Learning Framework for Generation and Analysis of Driving Scenario Trajectories [PDF] 返回目录
Andreas Demetriou, Henrik Alfsvåg, Sadegh Rahrovani, Morteza Haghir Chehreghani
Abstract: We propose a unified deep learning framework for generation and analysis of driving scenario trajectories, and validate its effectiveness in a principled way. In order to model and generate scenarios of trajectories with different length, we develop two approaches. First, we adapt the Recurrent Conditional Generative Adversarial Networks (RC-GAN) by conditioning on the length of the trajectories. This provides us flexibility to generate variable-length driving trajectories, a desirable feature for scenario test case generation in the verification of self-driving cars. Second, we develop an architecture based on Recurrent Autoencoder with GANs in order to obviate the variable length issue, wherein we train a GAN to learn/generate the latent representations of original trajectories. In this approach, we train an integrated feed-forward neural network to estimate the length of the trajectories to be able to bring them back from the latent space representation. In addition to trajectory generation, we employ the trained autoencoder as a feature extractor, for the purpose of clustering and anomaly detection, in order to obtain further insights on the collected scenario dataset. We experimentally investigate the performance of the proposed framework on real-world scenario trajectories obtained from in-field data collection.
摘要:我们提出了生成和驾驶场景轨迹的分析,统一的深度学习的框架,并在原则的方式验证其有效性。为了模型,并生成具有不同长度的轨迹的情况下,我们开发了两种方法。首先,我们通过调整移动轨迹的长度适应经常性条件剖成对抗性网络(RC-GAN)。这为我们提供灵活性,以产生可变长度驱动轨迹,在无人驾驶汽车的验证方案测试用例生成所期望的特征。其次,我们开发了基于自动编码器经常与甘斯的架构,以避免可变长度的问题,其中我们培养的GaN学习/生成原始轨迹的潜在表示。在这种方法中,我们培养一个集成的前馈神经网络来估计轨迹的长度,能够使他们从潜在空间表示回去。除了轨迹生成,我们聘请训练有素的自动编码为特征提取,聚类和异常检测的目的,以获得对所收集的数据集的情况进一步的见解。我们实验研究从在现场的数据采集获得真实世界的场景轨迹所提出的框架的性能。
Andreas Demetriou, Henrik Alfsvåg, Sadegh Rahrovani, Morteza Haghir Chehreghani
Abstract: We propose a unified deep learning framework for generation and analysis of driving scenario trajectories, and validate its effectiveness in a principled way. In order to model and generate scenarios of trajectories with different length, we develop two approaches. First, we adapt the Recurrent Conditional Generative Adversarial Networks (RC-GAN) by conditioning on the length of the trajectories. This provides us flexibility to generate variable-length driving trajectories, a desirable feature for scenario test case generation in the verification of self-driving cars. Second, we develop an architecture based on Recurrent Autoencoder with GANs in order to obviate the variable length issue, wherein we train a GAN to learn/generate the latent representations of original trajectories. In this approach, we train an integrated feed-forward neural network to estimate the length of the trajectories to be able to bring them back from the latent space representation. In addition to trajectory generation, we employ the trained autoencoder as a feature extractor, for the purpose of clustering and anomaly detection, in order to obtain further insights on the collected scenario dataset. We experimentally investigate the performance of the proposed framework on real-world scenario trajectories obtained from in-field data collection.
摘要:我们提出了生成和驾驶场景轨迹的分析,统一的深度学习的框架,并在原则的方式验证其有效性。为了模型,并生成具有不同长度的轨迹的情况下,我们开发了两种方法。首先,我们通过调整移动轨迹的长度适应经常性条件剖成对抗性网络(RC-GAN)。这为我们提供灵活性,以产生可变长度驱动轨迹,在无人驾驶汽车的验证方案测试用例生成所期望的特征。其次,我们开发了基于自动编码器经常与甘斯的架构,以避免可变长度的问题,其中我们培养的GaN学习/生成原始轨迹的潜在表示。在这种方法中,我们培养一个集成的前馈神经网络来估计轨迹的长度,能够使他们从潜在空间表示回去。除了轨迹生成,我们聘请训练有素的自动编码为特征提取,聚类和异常检测的目的,以获得对所收集的数据集的情况进一步的见解。我们实验研究从在现场的数据采集获得真实世界的场景轨迹所提出的框架的性能。
28. $S^3$Net: Semantic-Aware Self-supervised Depth Estimation with Monocular Videos and Synthetic Data [PDF] 返回目录
Bin Cheng, Inderjot Singh Saggu, Raunak Shah, Gaurav Bansal, Dinesh Bharadia
Abstract: Solving depth estimation with monocular cameras enables the possibility of widespread use of cameras as low-cost depth estimation sensors in applications such as autonomous driving and robotics. However, learning such a scalable depth estimation model would require a lot of labeled data which is expensive to collect. There are two popular existing approaches which do not require annotated depth maps: (i) using labeled synthetic and unlabeled real data in an adversarial framework to predict more accurate depth, and (ii) unsupervised models which exploit geometric structure across space and time in monocular video frames. Ideally, we would like to leverage features provided by both approaches as they complement each other; however, existing methods do not adequately exploit these additive benefits. We present $S^3$Net, a self-supervised framework which combines these complementary features: we use synthetic and real-world images for training while exploiting geometric, temporal, as well as semantic constraints. Our novel consolidated architecture provides a new state-of-the-art in self-supervised depth estimation using monocular videos. We present a unique way to train this self-supervised framework, and achieve (i) more than $15\%$ improvement over previous synthetic supervised approaches that use domain adaptation and (ii) more than $10\%$ improvement over previous self-supervised approaches which exploit geometric constraints from the real data.
摘要:单眼相机解决深度估计能够广泛使用的相机为低成本深度估计传感器的应用,如自动驾驶和机器人技术的可能性。然而,学习这样一个可扩展的深度估计模型将需要大量的标签数据是收集昂贵。有不需要注解深度图两种流行的现有方法:(i)使用标记的合成的和未标记的真实数据在敌对框架来预测更精确的深度,和(ii)无监督模式,其利用在单眼在空间和时间的几何结构视频帧。理想情况下,我们想通过他们相互补充这两种方法提供杠杆功能;然而,现有的方法没有充分利用这些添加剂的好处。我们现在$ S ^ 3 $网,自我监督的框架,将这些互补的特征:我们使用合成和真实世界的图像进行培训,同时利用几何,时间,以及语义约束。我们的新颖的综合架构使用单眼影片自监督深度估计提供了一个新的国家的最先进的。我们提出了一个独特的方式来培养这种自我监督的框架,实现(我)比以前的合成超过$ 15美元\%$提高监督的方法是使用领域适应性及(ii)更低于$ 10 \%$的改善比以前的自我监督其利用从真实数据几何约束的方法。
Bin Cheng, Inderjot Singh Saggu, Raunak Shah, Gaurav Bansal, Dinesh Bharadia
Abstract: Solving depth estimation with monocular cameras enables the possibility of widespread use of cameras as low-cost depth estimation sensors in applications such as autonomous driving and robotics. However, learning such a scalable depth estimation model would require a lot of labeled data which is expensive to collect. There are two popular existing approaches which do not require annotated depth maps: (i) using labeled synthetic and unlabeled real data in an adversarial framework to predict more accurate depth, and (ii) unsupervised models which exploit geometric structure across space and time in monocular video frames. Ideally, we would like to leverage features provided by both approaches as they complement each other; however, existing methods do not adequately exploit these additive benefits. We present $S^3$Net, a self-supervised framework which combines these complementary features: we use synthetic and real-world images for training while exploiting geometric, temporal, as well as semantic constraints. Our novel consolidated architecture provides a new state-of-the-art in self-supervised depth estimation using monocular videos. We present a unique way to train this self-supervised framework, and achieve (i) more than $15\%$ improvement over previous synthetic supervised approaches that use domain adaptation and (ii) more than $10\%$ improvement over previous self-supervised approaches which exploit geometric constraints from the real data.
摘要:单眼相机解决深度估计能够广泛使用的相机为低成本深度估计传感器的应用,如自动驾驶和机器人技术的可能性。然而,学习这样一个可扩展的深度估计模型将需要大量的标签数据是收集昂贵。有不需要注解深度图两种流行的现有方法:(i)使用标记的合成的和未标记的真实数据在敌对框架来预测更精确的深度,和(ii)无监督模式,其利用在单眼在空间和时间的几何结构视频帧。理想情况下,我们想通过他们相互补充这两种方法提供杠杆功能;然而,现有的方法没有充分利用这些添加剂的好处。我们现在$ S ^ 3 $网,自我监督的框架,将这些互补的特征:我们使用合成和真实世界的图像进行培训,同时利用几何,时间,以及语义约束。我们的新颖的综合架构使用单眼影片自监督深度估计提供了一个新的国家的最先进的。我们提出了一个独特的方式来培养这种自我监督的框架,实现(我)比以前的合成超过$ 15美元\%$提高监督的方法是使用领域适应性及(ii)更低于$ 10 \%$的改善比以前的自我监督其利用从真实数据几何约束的方法。
29. Families In Wild Multimedia (FIW-MM): A Multi-Modal Database for Recognizing Kinship [PDF] 返回目录
Joseph P. Robinson, Zaid Khan, Yu Yin, Ming Shao, Yun Fu
Abstract: Recognizing kinship - a soft biometric with vast applications - in photos has piqued the interest of many machine vision researchers. The large-scale Families In the Wild (FIW) database promoted the problem by supporting annual kinship-based vision challenges that saw consistent performance improvements. We have now begun to approach performance levels for image-based systems acceptable for practical use - something unforeseeable a decade ago. However, biometric systems can benefit from multi-modal perspectives, as information contained in multimedia can add to and complement that of still images. Thus, we aim to narrow the gap from research-to-reality by extending FIW with multimedia data (i.e., video, audio, and contextual transcripts). Specifically, we introduce the first large-scale dataset for recognizing kinship in multimedia, the FIW in Multimedia (FIW-MM) database. We utilize automated machinery to collect, annotate, and prepare the data with minimal human input and no financial cost. This large-scale, multimedia corpus allows problem formulations to follow more realistic template-based protocols. We show significant improvements in benchmarks for multiple kin-based tasks when additional media-types are added. Experiments provide insights by highlighting edge cases to inspire future research and areas of improvement. Emphasis is put on short and long-term research directions, with the overarching intent to increase the potential of systems built to automatically detect kinship in multimedia. Furthermore, we expect a broader range of researchers with recognition tasks, generative modeling, speech understanding, and nature-based narratives.
摘要:认识到亲情 - 有广阔的应用软的生物特征 - 在照片激起了很多机器视觉研究人员的兴趣。大型家庭在野外(FIW)数据库,通过支持,看到一致的性能改进年度基于血缘关系的视觉挑战,促进了问题。现在我们已经开始接近的性能水平为实用上基于图像的系统 - 这是不可预见的十年前。然而,生物识别系统可以从多模式的观点中受益,如多媒体包含可以添加和补充的是静止图像的信息。因此,我们的目标是通过与多媒体数据(即,视频,音频,和上下文转录本)延伸FIW缩小从研究到现实的差距。具体而言,我们引入了第一的大规模数据集在多媒体识别亲属中,FIW在多媒体(FIW-MM)数据库。我们利用自动化机械收集,注释,并以最少的人力投入,也没有财务费用准备数据。这种大规模,多媒体语料库允许问题的配方遵循更现实的基于模板的协议。我们显示基准多个基于亲属任务显著改善时添加额外的媒体类型。实验通过突出边缘的情况下,激励未来的研究和改进的地方提供见解。重点放在短期和长期的研究方向,与总体意图增加内置在多媒体自动检测系统的亲属关系的潜力。此外,我们预计研究人员识别任务,生成模型,言语理解,以及基于自然的叙述更广泛的范围内。
Joseph P. Robinson, Zaid Khan, Yu Yin, Ming Shao, Yun Fu
Abstract: Recognizing kinship - a soft biometric with vast applications - in photos has piqued the interest of many machine vision researchers. The large-scale Families In the Wild (FIW) database promoted the problem by supporting annual kinship-based vision challenges that saw consistent performance improvements. We have now begun to approach performance levels for image-based systems acceptable for practical use - something unforeseeable a decade ago. However, biometric systems can benefit from multi-modal perspectives, as information contained in multimedia can add to and complement that of still images. Thus, we aim to narrow the gap from research-to-reality by extending FIW with multimedia data (i.e., video, audio, and contextual transcripts). Specifically, we introduce the first large-scale dataset for recognizing kinship in multimedia, the FIW in Multimedia (FIW-MM) database. We utilize automated machinery to collect, annotate, and prepare the data with minimal human input and no financial cost. This large-scale, multimedia corpus allows problem formulations to follow more realistic template-based protocols. We show significant improvements in benchmarks for multiple kin-based tasks when additional media-types are added. Experiments provide insights by highlighting edge cases to inspire future research and areas of improvement. Emphasis is put on short and long-term research directions, with the overarching intent to increase the potential of systems built to automatically detect kinship in multimedia. Furthermore, we expect a broader range of researchers with recognition tasks, generative modeling, speech understanding, and nature-based narratives.
摘要:认识到亲情 - 有广阔的应用软的生物特征 - 在照片激起了很多机器视觉研究人员的兴趣。大型家庭在野外(FIW)数据库,通过支持,看到一致的性能改进年度基于血缘关系的视觉挑战,促进了问题。现在我们已经开始接近的性能水平为实用上基于图像的系统 - 这是不可预见的十年前。然而,生物识别系统可以从多模式的观点中受益,如多媒体包含可以添加和补充的是静止图像的信息。因此,我们的目标是通过与多媒体数据(即,视频,音频,和上下文转录本)延伸FIW缩小从研究到现实的差距。具体而言,我们引入了第一的大规模数据集在多媒体识别亲属中,FIW在多媒体(FIW-MM)数据库。我们利用自动化机械收集,注释,并以最少的人力投入,也没有财务费用准备数据。这种大规模,多媒体语料库允许问题的配方遵循更现实的基于模板的协议。我们显示基准多个基于亲属任务显著改善时添加额外的媒体类型。实验通过突出边缘的情况下,激励未来的研究和改进的地方提供见解。重点放在短期和长期的研究方向,与总体意图增加内置在多媒体自动检测系统的亲属关系的潜力。此外,我们预计研究人员识别任务,生成模型,言语理解,以及基于自然的叙述更广泛的范围内。
30. Unsupervised Learning of Particle Image Velocimetry [PDF] 返回目录
Mingrui Zhang, Matthew D. Piggott
Abstract: Particle Image Velocimetry (PIV) is a classical flow estimation problem which is widely considered and utilised, especially as a diagnostic tool in experimental fluid dynamics and the remote sensing of environmental flows. Recently, the development of deep learning based methods has inspired new approaches to tackle the PIV problem. These supervised learning based methods are driven by large volumes of data with ground truth training information. However, it is difficult to collect reliable ground truth data in large-scale, real-world scenarios. Although synthetic datasets can be used as alternatives, the gap between the training set-ups and real-world scenarios limits applicability. We present here what we believe to be the first work which takes an unsupervised learning based approach to tackle PIV problems. The proposed approach is inspired by classic optical flow methods. Instead of using ground truth data, we make use of photometric loss between two consecutive image frames, consistency loss in bidirectional flow estimates and spatial smoothness loss to construct the total unsupervised loss function. The approach shows significant potential and advantages for fluid flow estimation. Results presented here demonstrate that our method outputs competitive results compared with classical PIV methods as well as supervised learning based methods for a broad PIV dataset, and even outperforms these existing approaches in some difficult flow cases. Codes and trained models are available at this https URL.
摘要:粒子图像测速仪(PIV)是被广泛认为利用,尤其是如在实验流体动力学的诊断工具和环境流的遥感经典流量估计问题。近日,深基础的学习方法的发展激发了新的方法来解决这个问题PIV。这些监督学习为基础的方法是通过大量的地面实况训练信息数据的驱动。但是,很难收集大规模,真实世界的场景可靠的地面实况数据。虽然合成的数据集可以被用作替代方案中,训练集制成品及现实世界的场景之间的间隙限制了适用性。我们在这里提出了我们认为是第一个工作,这需要一个无监督学习为基础的方法来解决问题,PIV。所提出的方法是通过经典的光流法的启发。除了使用地面实况数据,我们利用两个连续的图像帧之间的一致性损失双向流量估计和空间平滑损失光度损失,构建总监督的损失函数。该方法给出了流体流量估计显著潜力和优势。结果呈现在这里展示我们的方法产出的竞争结果与传统PIV方法相比,以及监督学习了广泛的PIV数据集为基础的方法,甚至优于一些困难的流动情况下,这些现有的方法。规范和训练的模型可在此HTTPS URL。
Mingrui Zhang, Matthew D. Piggott
Abstract: Particle Image Velocimetry (PIV) is a classical flow estimation problem which is widely considered and utilised, especially as a diagnostic tool in experimental fluid dynamics and the remote sensing of environmental flows. Recently, the development of deep learning based methods has inspired new approaches to tackle the PIV problem. These supervised learning based methods are driven by large volumes of data with ground truth training information. However, it is difficult to collect reliable ground truth data in large-scale, real-world scenarios. Although synthetic datasets can be used as alternatives, the gap between the training set-ups and real-world scenarios limits applicability. We present here what we believe to be the first work which takes an unsupervised learning based approach to tackle PIV problems. The proposed approach is inspired by classic optical flow methods. Instead of using ground truth data, we make use of photometric loss between two consecutive image frames, consistency loss in bidirectional flow estimates and spatial smoothness loss to construct the total unsupervised loss function. The approach shows significant potential and advantages for fluid flow estimation. Results presented here demonstrate that our method outputs competitive results compared with classical PIV methods as well as supervised learning based methods for a broad PIV dataset, and even outperforms these existing approaches in some difficult flow cases. Codes and trained models are available at this https URL.
摘要:粒子图像测速仪(PIV)是被广泛认为利用,尤其是如在实验流体动力学的诊断工具和环境流的遥感经典流量估计问题。近日,深基础的学习方法的发展激发了新的方法来解决这个问题PIV。这些监督学习为基础的方法是通过大量的地面实况训练信息数据的驱动。但是,很难收集大规模,真实世界的场景可靠的地面实况数据。虽然合成的数据集可以被用作替代方案中,训练集制成品及现实世界的场景之间的间隙限制了适用性。我们在这里提出了我们认为是第一个工作,这需要一个无监督学习为基础的方法来解决问题,PIV。所提出的方法是通过经典的光流法的启发。除了使用地面实况数据,我们利用两个连续的图像帧之间的一致性损失双向流量估计和空间平滑损失光度损失,构建总监督的损失函数。该方法给出了流体流量估计显著潜力和优势。结果呈现在这里展示我们的方法产出的竞争结果与传统PIV方法相比,以及监督学习了广泛的PIV数据集为基础的方法,甚至优于一些困难的流动情况下,这些现有的方法。规范和训练的模型可在此HTTPS URL。
31. Color-complexity enabled exhaustive color-dots identification and spatial patterns testing in images [PDF] 返回目录
Shuting Liao, Li-Yu Liu, Ting-An Chen, Kuang-Yu Chen, Fushing Hsieh
Abstract: Targeted color-dots with varying shapes and sizes in images are first exhaustively identified, and then their multiscale 2D geometric patterns are extracted for testing spatial uniformness in a progressive fashion. Based on color theory in physics, we develop a new color-identification algorithm relying on highly associative relations among the three color-coordinates: RGB or HSV. Such high associations critically imply low color-complexity of a color image, and renders potentials of exhaustive identification of targeted color-dots of all shapes and sizes. Via heterogeneous shaded regions and lighting conditions, our algorithm is shown being robust, practical and efficient comparing with the popular Contour and OpenCV approaches. Upon all identified color-pixels, we form color-dots as individually connected networks with shapes and sizes. We construct minimum spanning trees (MST) as spatial geometries of dot-collectives of various size-scales. Given a size-scale, the distribution of distances between immediate neighbors in the observed MST is extracted, so do many simulated MSTs under the spatial uniformness assumption. We devise a new algorithm for testing 2D spatial uniformness based on a Hierarchical clustering tree upon all involving MSTs. Our developments are illustrated on images obtained by mimicking chemical spraying via drone in Precision Agriculture.
摘要:具有变化的形状和尺寸在图像首先详尽识别,然后他们的多尺度2D几何图案定向色点被提取的以逐步的方式测试的空间均匀性。基于物理学色彩理论,我们开发了一个新的色彩识别算法依托三色坐标之间的高度关联关系:RGB或HSV。这种高关联危重暗示的彩色图像的颜色低复杂度,并呈现的详尽识别各种形状和大小的目标色点的电位。通过异质阴影区域和照明条件下,我们的算法被示为健壮,实用,高效与流行的轮廓和OpenCV比较接近。在所有识别出的颜色像素,我们形成彩色点作为与形状和尺寸分别连接网络。我们构造最小生成树(MST)为各种大小规模的点集体的空间几何。在给定大小规模,近邻之间的距离在观察MST提取的分布,所以做空间均匀性假设下,许多模拟MSTS。我们设计用于测试基于要求所有涉及MSTS一个层次聚类树二维空间均匀性的一种新算法。我们的发展是由精准农业模仿通过无人驾驶飞机喷药获得的图像所示。
Shuting Liao, Li-Yu Liu, Ting-An Chen, Kuang-Yu Chen, Fushing Hsieh
Abstract: Targeted color-dots with varying shapes and sizes in images are first exhaustively identified, and then their multiscale 2D geometric patterns are extracted for testing spatial uniformness in a progressive fashion. Based on color theory in physics, we develop a new color-identification algorithm relying on highly associative relations among the three color-coordinates: RGB or HSV. Such high associations critically imply low color-complexity of a color image, and renders potentials of exhaustive identification of targeted color-dots of all shapes and sizes. Via heterogeneous shaded regions and lighting conditions, our algorithm is shown being robust, practical and efficient comparing with the popular Contour and OpenCV approaches. Upon all identified color-pixels, we form color-dots as individually connected networks with shapes and sizes. We construct minimum spanning trees (MST) as spatial geometries of dot-collectives of various size-scales. Given a size-scale, the distribution of distances between immediate neighbors in the observed MST is extracted, so do many simulated MSTs under the spatial uniformness assumption. We devise a new algorithm for testing 2D spatial uniformness based on a Hierarchical clustering tree upon all involving MSTs. Our developments are illustrated on images obtained by mimicking chemical spraying via drone in Precision Agriculture.
摘要:具有变化的形状和尺寸在图像首先详尽识别,然后他们的多尺度2D几何图案定向色点被提取的以逐步的方式测试的空间均匀性。基于物理学色彩理论,我们开发了一个新的色彩识别算法依托三色坐标之间的高度关联关系:RGB或HSV。这种高关联危重暗示的彩色图像的颜色低复杂度,并呈现的详尽识别各种形状和大小的目标色点的电位。通过异质阴影区域和照明条件下,我们的算法被示为健壮,实用,高效与流行的轮廓和OpenCV比较接近。在所有识别出的颜色像素,我们形成彩色点作为与形状和尺寸分别连接网络。我们构造最小生成树(MST)为各种大小规模的点集体的空间几何。在给定大小规模,近邻之间的距离在观察MST提取的分布,所以做空间均匀性假设下,许多模拟MSTS。我们设计用于测试基于要求所有涉及MSTS一个层次聚类树二维空间均匀性的一种新算法。我们的发展是由精准农业模仿通过无人驾驶飞机喷药获得的图像所示。
32. Automated Intracranial Artery Labeling using a Graph Neural Network and Hierarchical Refinement [PDF] 返回目录
Li Chen, Thomas Hatsukami, Jenq-Neng Hwang, Chun Yuan
Abstract: Automatically labeling intracranial arteries (ICA) with their anatomical names is beneficial for feature extraction and detailed analysis of intracranial vascular structures. There are significant variations in the ICA due to natural and pathological causes, making it challenging for automated labeling. However, the existing public dataset for evaluation of anatomical labeling is limited. We construct a comprehensive dataset with 729 Magnetic Resonance Angiography scans and propose a Graph Neural Network (GNN) method to label arteries by classifying types of nodes and edges in an attributed relational graph. In addition, a hierarchical refinement framework is developed for further improving the GNN outputs to incorporate structural and relational knowledge about the ICA. Our method achieved a node labeling accuracy of 97.5%, and 63.8% of scans were correctly labeled for all Circle of Willis nodes, on a testing set of 105 scans with both healthy and diseased subjects. This is a significant improvement over available state-of-the-art methods. Automatic artery labeling is promising to minimize manual effort in characterizing the complicated ICA networks and provides valuable information for the identification of geometric risk factors of vascular disease. Our code and dataset are available at this https URL.
摘要:与他们解剖名称自动贴标颅内动脉(ICA)是特征提取和颅内血管结构的详细分析是有益的。有在ICA由于自然和病理原因显著变化,使其成为自动标示具有挑战性。然而,对于解剖标志的评估现有公共数据集是有限的。我们建设有729次磁共振血管造影扫描的综合数据集,并通过在归属关系图进行分类类型的节点和边缘的提出图表神经网络(GNN)方法来标记动脉。此外,分层细化框架,为了进一步提高GNN输出纳入有关ICA结构和关系的知识开发。我们的方法来实现一个节点贴标精度为97.5%,和扫描的63.8%被正确标记为Willis环状的所有节点,在一个测试组的105次扫描健康和患病受试者。这是对现有的国家的最先进的方法显著的改善。自动贴标动脉承诺将在表征复杂ICA网络尽量减少人工工作量并提供对血管疾病的几何危险因素识别有价值的信息。我们的代码和数据集可在此HTTPS URL。
Li Chen, Thomas Hatsukami, Jenq-Neng Hwang, Chun Yuan
Abstract: Automatically labeling intracranial arteries (ICA) with their anatomical names is beneficial for feature extraction and detailed analysis of intracranial vascular structures. There are significant variations in the ICA due to natural and pathological causes, making it challenging for automated labeling. However, the existing public dataset for evaluation of anatomical labeling is limited. We construct a comprehensive dataset with 729 Magnetic Resonance Angiography scans and propose a Graph Neural Network (GNN) method to label arteries by classifying types of nodes and edges in an attributed relational graph. In addition, a hierarchical refinement framework is developed for further improving the GNN outputs to incorporate structural and relational knowledge about the ICA. Our method achieved a node labeling accuracy of 97.5%, and 63.8% of scans were correctly labeled for all Circle of Willis nodes, on a testing set of 105 scans with both healthy and diseased subjects. This is a significant improvement over available state-of-the-art methods. Automatic artery labeling is promising to minimize manual effort in characterizing the complicated ICA networks and provides valuable information for the identification of geometric risk factors of vascular disease. Our code and dataset are available at this https URL.
摘要:与他们解剖名称自动贴标颅内动脉(ICA)是特征提取和颅内血管结构的详细分析是有益的。有在ICA由于自然和病理原因显著变化,使其成为自动标示具有挑战性。然而,对于解剖标志的评估现有公共数据集是有限的。我们建设有729次磁共振血管造影扫描的综合数据集,并通过在归属关系图进行分类类型的节点和边缘的提出图表神经网络(GNN)方法来标记动脉。此外,分层细化框架,为了进一步提高GNN输出纳入有关ICA结构和关系的知识开发。我们的方法来实现一个节点贴标精度为97.5%,和扫描的63.8%被正确标记为Willis环状的所有节点,在一个测试组的105次扫描健康和患病受试者。这是对现有的国家的最先进的方法显著的改善。自动贴标动脉承诺将在表征复杂ICA网络尽量减少人工工作量并提供对血管疾病的几何危险因素识别有价值的信息。我们的代码和数据集可在此HTTPS URL。
33. Towards 3D Visualization of Video from Frames [PDF] 返回目录
Slimane Larabi
Abstract: We explain theoretically how to reconstruct the 3D scene from successive frames in order to see the video in 3D. To do this, features, associated to moving rigid objects in 3D, are extracted in frames and matched. The vanishing point computed in frame corresponding to the direction of moving object is used for 3D positioning of the 3D structure of the moving object. First experiments are conducted and the obtained results are shown and publicly available. They demonstrate the feasibility of our method. We conclude this paper by future works in order to improve this method tacking into account non-rigid objects and the case of moving camera.
摘要:从理论上解释了如何重建从连续帧的3D场景以3D形式查看视频。要做到这一点,特征,相关联以在3D运动刚性物体,在帧被提取和匹配。在对应于移动对象的方向帧计算的消失点被用于运动物体的三维结构的3D定位。第一实验进行,并且获得的结果被示出并公开。他们证明了我们方法的可行性。我们得出结论:本文通过今后的工作,以提高这种方法黏合到非刚性物体移动和相机的情况下。
Slimane Larabi
Abstract: We explain theoretically how to reconstruct the 3D scene from successive frames in order to see the video in 3D. To do this, features, associated to moving rigid objects in 3D, are extracted in frames and matched. The vanishing point computed in frame corresponding to the direction of moving object is used for 3D positioning of the 3D structure of the moving object. First experiments are conducted and the obtained results are shown and publicly available. They demonstrate the feasibility of our method. We conclude this paper by future works in order to improve this method tacking into account non-rigid objects and the case of moving camera.
摘要:从理论上解释了如何重建从连续帧的3D场景以3D形式查看视频。要做到这一点,特征,相关联以在3D运动刚性物体,在帧被提取和匹配。在对应于移动对象的方向帧计算的消失点被用于运动物体的三维结构的3D定位。第一实验进行,并且获得的结果被示出并公开。他们证明了我们方法的可行性。我们得出结论:本文通过今后的工作,以提高这种方法黏合到非刚性物体移动和相机的情况下。
34. Learning from Scale-Invariant Examples for Domain Adaptation in Semantic Segmentation [PDF] 返回目录
M.Naseer Subhani, Mohsen Ali
Abstract: Self-supervised learning approaches for unsupervised domain adaptation (UDA) of semantic segmentation models suffer from challenges of predicting and selecting reasonable good quality pseudo labels. In this paper, we propose a novel approach of exploiting scale-invariance property of the semantic segmentation model for self-supervised domain adaptation. Our algorithm is based on a reasonable assumption that, in general, regardless of the size of the object and stuff (given context) the semantic labeling should be unchanged. We show that this constraint is violated over the images of the target domain, and hence could be used to transfer labels in-between differently scaled patches. Specifically, we show that semantic segmentation model produces output with high entropy when presented with scaled-up patches of target domain, in comparison to when presented original size images. These scale-invariant examples are extracted from the most confident images of the target domain. Dynamic class specific entropy thresholding mechanism is presented to filter out unreliable pseudo-labels. Furthermore, we also incorporate the focal loss to tackle the problem of class imbalance in self-supervised learning. Extensive experiments have been performed, and results indicate that exploiting the scale-invariant labeling, we outperform existing self-supervised based state-of-the-art domain adaptation methods. Specifically, we achieve 1.3% and 3.8% of lead for GTA5 to Cityscapes and SYNTHIA to Cityscapes with VGG16-FCN8 baseline network.
摘要:语义分割模型的无监督领域适应性(UDA)自监督学习方法的预测和合理选择质量好的假标签的挑战受到影响。在本文中,我们提出利用语义细分模型的尺度不变性,自我监督的领域适应性的一种新方法。我们的算法是基于合理的假设,在一般情况下,无论对象和东西(给定上下文)的大小的语义标注应该是不变的。我们表明,这种违反约束在目标域的图像,因此可用于在两者之间不同比例的补丁转移标签。具体而言,我们表明,当与目标域的按比例放大的补丁提交,在比较时呈现的原始大小的图像以语义分割模型产生输出具有高熵。这些尺度不变实例从目标域的最有信心图像中提取。动态类特定熵阈值机制提供给过滤掉不可靠的伪标签。此外,我们还采用了焦距损失,以解决自监督学习类不平衡的问题。大量的实验已经进行,结果表明,利用规模不变的标签,我们超过已有国家的最先进的自我监督基于域自适应方法。具体来说,我们达到1.3%,铅为GTA5到城市景观和SYNTHIA到城市景观与VGG16-FCN8基线网络3.8%。
M.Naseer Subhani, Mohsen Ali
Abstract: Self-supervised learning approaches for unsupervised domain adaptation (UDA) of semantic segmentation models suffer from challenges of predicting and selecting reasonable good quality pseudo labels. In this paper, we propose a novel approach of exploiting scale-invariance property of the semantic segmentation model for self-supervised domain adaptation. Our algorithm is based on a reasonable assumption that, in general, regardless of the size of the object and stuff (given context) the semantic labeling should be unchanged. We show that this constraint is violated over the images of the target domain, and hence could be used to transfer labels in-between differently scaled patches. Specifically, we show that semantic segmentation model produces output with high entropy when presented with scaled-up patches of target domain, in comparison to when presented original size images. These scale-invariant examples are extracted from the most confident images of the target domain. Dynamic class specific entropy thresholding mechanism is presented to filter out unreliable pseudo-labels. Furthermore, we also incorporate the focal loss to tackle the problem of class imbalance in self-supervised learning. Extensive experiments have been performed, and results indicate that exploiting the scale-invariant labeling, we outperform existing self-supervised based state-of-the-art domain adaptation methods. Specifically, we achieve 1.3% and 3.8% of lead for GTA5 to Cityscapes and SYNTHIA to Cityscapes with VGG16-FCN8 baseline network.
摘要:语义分割模型的无监督领域适应性(UDA)自监督学习方法的预测和合理选择质量好的假标签的挑战受到影响。在本文中,我们提出利用语义细分模型的尺度不变性,自我监督的领域适应性的一种新方法。我们的算法是基于合理的假设,在一般情况下,无论对象和东西(给定上下文)的大小的语义标注应该是不变的。我们表明,这种违反约束在目标域的图像,因此可用于在两者之间不同比例的补丁转移标签。具体而言,我们表明,当与目标域的按比例放大的补丁提交,在比较时呈现的原始大小的图像以语义分割模型产生输出具有高熵。这些尺度不变实例从目标域的最有信心图像中提取。动态类特定熵阈值机制提供给过滤掉不可靠的伪标签。此外,我们还采用了焦距损失,以解决自监督学习类不平衡的问题。大量的实验已经进行,结果表明,利用规模不变的标签,我们超过已有国家的最先进的自我监督基于域自适应方法。具体来说,我们达到1.3%,铅为GTA5到城市景观和SYNTHIA到城市景观与VGG16-FCN8基线网络3.8%。
35. Cassandra: Detecting Trojaned Networks from Adversarial Perturbations [PDF] 返回目录
Xiaoyu Zhang, Ajmal Mian, Rohit Gupta, Nazanin Rahnavard, Mubarak Shah
Abstract: Deep neural networks are being widely deployed for many critical tasks due to their high classification accuracy. In many cases, pre-trained models are sourced from vendors who may have disrupted the training pipeline to insert Trojan behaviors into the models. These malicious behaviors can be triggered at the adversary's will and hence, cause a serious threat to the widespread deployment of deep models. We propose a method to verify if a pre-trained model is Trojaned or benign. Our method captures fingerprints of neural networks in the form of adversarial perturbations learned from the network gradients. Inserting backdoors into a network alters its decision boundaries which are effectively encoded in their adversarial perturbations. We train a two stream network for Trojan detection from its global ($L_\infty$ and $L_2$ bounded) perturbations and the localized region of high energy within each perturbation. The former encodes decision boundaries of the network and latter encodes the unknown trigger shape. We also propose an anomaly detection method to identify the target class in a Trojaned network. Our methods are invariant to the trigger type, trigger size, training data and network architecture. We evaluate our methods on MNIST, NIST-Round0 and NIST-Round1 datasets, with up to 1,000 pre-trained models making this the largest study to date on Trojaned network detection, and achieve over 92\% detection accuracy to set the new state-of-the-art.
摘要:深层神经网络被广泛部署于许多关键任务,由于其较高的分类精度。在许多情况下,预先训练的模型从谁可能打乱了训练管道插入木马的行为到模型的供应商采购。这些恶意行为可以在对手的将被触发,因此,导致深模型的广泛部署的严重威胁。我们建议以验证是否预先训练模式是木马或良性的方法。我们的方法抓住了对抗扰动的形式神经网络的指纹来自网络的梯度教训。插入后门进入网络改变其决策的界线,这些都在他们的对抗扰动有效编码。我们培养两个流网络木马检测其全球($ L_ \ infty $ $和$ L_2界)扰动和高能量的每个扰动都内的局部区域。网络与后者编码的编码前的决策边界未知触发形状。我们还提出了一种异常检测方法,以确定在网络木马目标类。我们的方法是不变的触发类型,触发尺寸,训练数据和网络架构。我们评估我们的MNIST,NIST-Round0和NIST-ROUND1数据集的方法,具有高达1000预先训练模式使这个最大的研究确定,木马网络检测,并实现超过92 \%的检测精度来设置新的国有的最先进的。
Xiaoyu Zhang, Ajmal Mian, Rohit Gupta, Nazanin Rahnavard, Mubarak Shah
Abstract: Deep neural networks are being widely deployed for many critical tasks due to their high classification accuracy. In many cases, pre-trained models are sourced from vendors who may have disrupted the training pipeline to insert Trojan behaviors into the models. These malicious behaviors can be triggered at the adversary's will and hence, cause a serious threat to the widespread deployment of deep models. We propose a method to verify if a pre-trained model is Trojaned or benign. Our method captures fingerprints of neural networks in the form of adversarial perturbations learned from the network gradients. Inserting backdoors into a network alters its decision boundaries which are effectively encoded in their adversarial perturbations. We train a two stream network for Trojan detection from its global ($L_\infty$ and $L_2$ bounded) perturbations and the localized region of high energy within each perturbation. The former encodes decision boundaries of the network and latter encodes the unknown trigger shape. We also propose an anomaly detection method to identify the target class in a Trojaned network. Our methods are invariant to the trigger type, trigger size, training data and network architecture. We evaluate our methods on MNIST, NIST-Round0 and NIST-Round1 datasets, with up to 1,000 pre-trained models making this the largest study to date on Trojaned network detection, and achieve over 92\% detection accuracy to set the new state-of-the-art.
摘要:深层神经网络被广泛部署于许多关键任务,由于其较高的分类精度。在许多情况下,预先训练的模型从谁可能打乱了训练管道插入木马的行为到模型的供应商采购。这些恶意行为可以在对手的将被触发,因此,导致深模型的广泛部署的严重威胁。我们建议以验证是否预先训练模式是木马或良性的方法。我们的方法抓住了对抗扰动的形式神经网络的指纹来自网络的梯度教训。插入后门进入网络改变其决策的界线,这些都在他们的对抗扰动有效编码。我们培养两个流网络木马检测其全球($ L_ \ infty $ $和$ L_2界)扰动和高能量的每个扰动都内的局部区域。网络与后者编码的编码前的决策边界未知触发形状。我们还提出了一种异常检测方法,以确定在网络木马目标类。我们的方法是不变的触发类型,触发尺寸,训练数据和网络架构。我们评估我们的MNIST,NIST-Round0和NIST-ROUND1数据集的方法,具有高达1000预先训练模式使这个最大的研究确定,木马网络检测,并实现超过92 \%的检测精度来设置新的国有的最先进的。
36. A Convolutional Neural Network for gaze preference detection: A potential tool for diagnostics of autism spectrum disorder in children [PDF] 返回目录
Dennis Núñez Fernández, Franklin Barrientos Porras, Robert H. Gilman, Macarena Vittet Mondonedo, Patricia Sheen, Mirko Zimic
Abstract: Early diagnosis of autism spectrum disorder (ASD) is known to improve the quality of life of affected individuals. However, diagnosis is often delayed even in wealthier countries including the US, largely due to the fact that gold standard diagnostic tools such as the Autism Diagnostic Observation Schedule (ADOS) and the Autism Diagnostic Interview-Revised (ADI-R) are time consuming and require expertise to administer. This trend is even more pronounced lower resources settings due to a lack of trained experts. As a result, alternative, less technical methods that leverage the unique ways in which children with ASD react to visual stimulation in a controlled environment have been developed to help facilitate early diagnosis. Previous studies have shown that, when exposed to a video that presents both social and abstract scenes side by side, a child with ASD will focus their attention towards the abstract images on the screen to a greater extent than a child without ASD. Such differential responses make it possible to implement an algorithm for the rapid diagnosis of ASD based on eye tracking against different visual stimuli. Here we propose a convolutional neural network (CNN) algorithm for gaze prediction using images extracted from a one-minute stimulus video. Our model achieved a high accuracy rate and robustness for prediction of gaze direction with independent persons and employing a different camera than the one used during testing. In addition to this, the proposed algorithm achieves a fast response time, providing a near real-time evaluation of ASD. Thereby, by applying the proposed method, we could significantly reduce the diagnosis time and facilitate the diagnosis of ASD in low resource regions.
摘要:自闭症谱系障碍(ASD)的早期诊断是众所周知的提高影响了患者的生活质量。然而,诊断往往即使在富裕的国家,包括美国推迟,主要是由于这样的事实:金标准诊断工具,如自闭症诊断观察量表(ADOS)和孤独症诊断访谈量表 - 修订版(ADI-R)的耗时和需要专业知识来管理。这种趋势更是由于缺乏训练有素的专家明显较低的资源设置。作为结果,利用独特的方式与ASD的儿童对视觉刺激在受控环境已经发展到帮助选择,更少的技术方法有利于早期诊断。以往的研究表明,当暴露在视频呈现社会和抽象场景并排,自闭症儿童将他们的注意力集中对抽象的图像在屏幕上比没有自闭症孩子更大的程度。这些不同的反应有可能实现的基础上针对不同的视觉刺激眼睛跟踪ASD的快速诊断的算法。在这里我们建议利用从一分钟的刺激视频提取的图像的注视预测卷积神经网络(CNN)的算法。我们的模型取得了较高的准确率和鲁棒性与独立人士注视方向的预测和采用不同的相机比测试过程中使用的一个。除此之外,该算法实现了快速响应时间,提供ASD的近实时评估。因此,通过应用所提出的方法,我们可以显著缩短诊断时间,并促进ASD在低资源地区的诊断。
Dennis Núñez Fernández, Franklin Barrientos Porras, Robert H. Gilman, Macarena Vittet Mondonedo, Patricia Sheen, Mirko Zimic
Abstract: Early diagnosis of autism spectrum disorder (ASD) is known to improve the quality of life of affected individuals. However, diagnosis is often delayed even in wealthier countries including the US, largely due to the fact that gold standard diagnostic tools such as the Autism Diagnostic Observation Schedule (ADOS) and the Autism Diagnostic Interview-Revised (ADI-R) are time consuming and require expertise to administer. This trend is even more pronounced lower resources settings due to a lack of trained experts. As a result, alternative, less technical methods that leverage the unique ways in which children with ASD react to visual stimulation in a controlled environment have been developed to help facilitate early diagnosis. Previous studies have shown that, when exposed to a video that presents both social and abstract scenes side by side, a child with ASD will focus their attention towards the abstract images on the screen to a greater extent than a child without ASD. Such differential responses make it possible to implement an algorithm for the rapid diagnosis of ASD based on eye tracking against different visual stimuli. Here we propose a convolutional neural network (CNN) algorithm for gaze prediction using images extracted from a one-minute stimulus video. Our model achieved a high accuracy rate and robustness for prediction of gaze direction with independent persons and employing a different camera than the one used during testing. In addition to this, the proposed algorithm achieves a fast response time, providing a near real-time evaluation of ASD. Thereby, by applying the proposed method, we could significantly reduce the diagnosis time and facilitate the diagnosis of ASD in low resource regions.
摘要:自闭症谱系障碍(ASD)的早期诊断是众所周知的提高影响了患者的生活质量。然而,诊断往往即使在富裕的国家,包括美国推迟,主要是由于这样的事实:金标准诊断工具,如自闭症诊断观察量表(ADOS)和孤独症诊断访谈量表 - 修订版(ADI-R)的耗时和需要专业知识来管理。这种趋势更是由于缺乏训练有素的专家明显较低的资源设置。作为结果,利用独特的方式与ASD的儿童对视觉刺激在受控环境已经发展到帮助选择,更少的技术方法有利于早期诊断。以往的研究表明,当暴露在视频呈现社会和抽象场景并排,自闭症儿童将他们的注意力集中对抽象的图像在屏幕上比没有自闭症孩子更大的程度。这些不同的反应有可能实现的基础上针对不同的视觉刺激眼睛跟踪ASD的快速诊断的算法。在这里我们建议利用从一分钟的刺激视频提取的图像的注视预测卷积神经网络(CNN)的算法。我们的模型取得了较高的准确率和鲁棒性与独立人士注视方向的预测和采用不同的相机比测试过程中使用的一个。除此之外,该算法实现了快速响应时间,提供ASD的近实时评估。因此,通过应用所提出的方法,我们可以显著缩短诊断时间,并促进ASD在低资源地区的诊断。
37. AiR: Attention with Reasoning Capability [PDF] 返回目录
Shi Chen, Ming Jiang, Jinhui Yang, Qi Zhao
Abstract: While attention has been an increasingly popular component in deep neural networks to both interpret and boost performance of models, little work has examined how attention progresses to accomplish a task and whether it is reasonable. In this work, we propose an Attention with Reasoning capability (AiR) framework that uses attention to understand and improve the process leading to task outcomes. We first define an evaluation metric based on a sequence of atomic reasoning operations, enabling quantitative measurement of attention that considers the reasoning process. We then collect human eye-tracking and answer correctness data, and analyze various machine and human attentions on their reasoning capability and how they impact task performance. Furthermore, we propose a supervision method to jointly and progressively optimize attention, reasoning, and task performance so that models learn to look at regions of interests by following a reasoning process. We demonstrate the effectiveness of the proposed framework in analyzing and modeling attention with better reasoning capability and task performance. The code and data are available at this https URL
摘要:虽然关注度一直在深层神经网络模型既解释和提升性能的日益流行的成分,一些工作已经研究了如何关注进展到完成任务,以及它是否是合理的。在这项工作中,我们提出了一个警告与推理能力(空气)的框架,使用注意了解和提高导致任务结果的过程。我们首先定义了基于原子推理的一系列操作的评价指标,使的注意,认为推理过程定量测量。我们然后收集人眼跟踪和答案的正确性数据,并分析各种机器和他们的推理能力的人的关注,以及它们如何影响工作表现。此外,我们提出了一个监督方法,共同和逐步优化注意力,推理和任务性能让模特学会通过以下推理过程来看待利益的地区。我们证明在分析和更好的推理能力和工作绩效造型注意所提出的框架的有效性。代码和数据都可以在此HTTPS URL
Shi Chen, Ming Jiang, Jinhui Yang, Qi Zhao
Abstract: While attention has been an increasingly popular component in deep neural networks to both interpret and boost performance of models, little work has examined how attention progresses to accomplish a task and whether it is reasonable. In this work, we propose an Attention with Reasoning capability (AiR) framework that uses attention to understand and improve the process leading to task outcomes. We first define an evaluation metric based on a sequence of atomic reasoning operations, enabling quantitative measurement of attention that considers the reasoning process. We then collect human eye-tracking and answer correctness data, and analyze various machine and human attentions on their reasoning capability and how they impact task performance. Furthermore, we propose a supervision method to jointly and progressively optimize attention, reasoning, and task performance so that models learn to look at regions of interests by following a reasoning process. We demonstrate the effectiveness of the proposed framework in analyzing and modeling attention with better reasoning capability and task performance. The code and data are available at this https URL
摘要:虽然关注度一直在深层神经网络模型既解释和提升性能的日益流行的成分,一些工作已经研究了如何关注进展到完成任务,以及它是否是合理的。在这项工作中,我们提出了一个警告与推理能力(空气)的框架,使用注意了解和提高导致任务结果的过程。我们首先定义了基于原子推理的一系列操作的评价指标,使的注意,认为推理过程定量测量。我们然后收集人眼跟踪和答案的正确性数据,并分析各种机器和他们的推理能力的人的关注,以及它们如何影响工作表现。此外,我们提出了一个监督方法,共同和逐步优化注意力,推理和任务性能让模特学会通过以下推理过程来看待利益的地区。我们证明在分析和更好的推理能力和工作绩效造型注意所提出的框架的有效性。代码和数据都可以在此HTTPS URL
38. Neural Network-based Reconstruction in Compressed Sensing MRI Without Fully-sampled Training Data [PDF] 返回目录
Alan Q. Wang, Adrian V. Dalca, Mert R. Sabuncu
Abstract: Compressed Sensing MRI (CS-MRI) has shown promise in reconstructing under-sampled MR images, offering the potential to reduce scan times. Classical techniques minimize a regularized least-squares cost function using an expensive iterative optimization procedure. Recently, deep learning models have been developed that model the iterative nature of classical techniques by unrolling iterations in a neural network. While exhibiting superior performance, these methods require large quantities of ground-truth images and have shown to be non-robust to unseen data. In this paper, we explore a novel strategy to train an unrolled reconstruction network in an unsupervised fashion by adopting a loss function widely-used in classical optimization schemes. We demonstrate that this strategy achieves lower loss and is computationally cheap compared to classical optimization solvers while also exhibiting superior robustness compared to supervised models. Code is available at this https URL.
摘要:压缩传感MRI(CS-MRI)在重建欠采样的MR图像,提供以减少扫描时间的潜力已经显示出希望。经典技术使用昂贵的迭代优化过程最小化正则最小二乘成本函数。近日,深度学习模型已经开发了模型的经典技术的迭代特性通过在神经网络展开迭代。同时表现出优异的性能,这些方法需要大量的地面实况图像,并显示出为非坚固以看不见的数据。在本文中,我们探索一种新的战略通过采用损耗函数在经典的优化方案被广泛用于训练的展开重建网络以无监督方式。我们证明,这一策略实现了更低的损耗,并计算比较便宜的经典优化求解,同时也比监管模式表现出优异的稳定性。代码可在此HTTPS URL。
Alan Q. Wang, Adrian V. Dalca, Mert R. Sabuncu
Abstract: Compressed Sensing MRI (CS-MRI) has shown promise in reconstructing under-sampled MR images, offering the potential to reduce scan times. Classical techniques minimize a regularized least-squares cost function using an expensive iterative optimization procedure. Recently, deep learning models have been developed that model the iterative nature of classical techniques by unrolling iterations in a neural network. While exhibiting superior performance, these methods require large quantities of ground-truth images and have shown to be non-robust to unseen data. In this paper, we explore a novel strategy to train an unrolled reconstruction network in an unsupervised fashion by adopting a loss function widely-used in classical optimization schemes. We demonstrate that this strategy achieves lower loss and is computationally cheap compared to classical optimization solvers while also exhibiting superior robustness compared to supervised models. Code is available at this https URL.
摘要:压缩传感MRI(CS-MRI)在重建欠采样的MR图像,提供以减少扫描时间的潜力已经显示出希望。经典技术使用昂贵的迭代优化过程最小化正则最小二乘成本函数。近日,深度学习模型已经开发了模型的经典技术的迭代特性通过在神经网络展开迭代。同时表现出优异的性能,这些方法需要大量的地面实况图像,并显示出为非坚固以看不见的数据。在本文中,我们探索一种新的战略通过采用损耗函数在经典的优化方案被广泛用于训练的展开重建网络以无监督方式。我们证明,这一策略实现了更低的损耗,并计算比较便宜的经典优化求解,同时也比监管模式表现出优异的稳定性。代码可在此HTTPS URL。
39. Advancing Visual Specification of Code Requirements for Graphs [PDF] 返回目录
Dewi Yokelson
Abstract: Researchers in the humanities are among the many who are now exploring the world of big data. They have begun to use programming languages like Python or R and their corresponding libraries to manipulate large data sets and discover brand new insights. One of the major hurdles that still exists is incorporating visualizations of this data into their projects. Visualization libraries can be difficult to learn how to use, even for those with formal training. Yet these visualizations are crucial for recognizing themes and communicating results to not only other researchers, but also the general public. This paper focuses on producing meaningful visualizations of data using machine learning. We allow the user to visually specify their code requirements in order to lower the barrier for humanities researchers to learn how to program visualizations. We use a hybrid model, combining a neural network and optical character recognition to generate the code to create the visualization.
摘要:研究人员在人文学科的许多谁正在探索大数据的世界之中。他们已经开始使用像Python或R及其相应的库编程语言来处理大型数据集,并探索全新的见解。一个仍然存在的主要障碍是将这一数据的可视化到他们的项目。可视化库可能很难学会如何使用,即使是对那些正规培训。然而,这些可视化的识别主题,不仅沟通结果其他研究人员的关键,也是广大市民。本文侧重于生产使用机器学习数据的可视化意义。我们允许用户以较低的屏障人文研究者如何学习程序的可视化直观地指定他们的规范要求。我们使用混合模式,结合神经网络和光学字符识别,生成的代码来创建可视化。
Dewi Yokelson
Abstract: Researchers in the humanities are among the many who are now exploring the world of big data. They have begun to use programming languages like Python or R and their corresponding libraries to manipulate large data sets and discover brand new insights. One of the major hurdles that still exists is incorporating visualizations of this data into their projects. Visualization libraries can be difficult to learn how to use, even for those with formal training. Yet these visualizations are crucial for recognizing themes and communicating results to not only other researchers, but also the general public. This paper focuses on producing meaningful visualizations of data using machine learning. We allow the user to visually specify their code requirements in order to lower the barrier for humanities researchers to learn how to program visualizations. We use a hybrid model, combining a neural network and optical character recognition to generate the code to create the visualization.
摘要:研究人员在人文学科的许多谁正在探索大数据的世界之中。他们已经开始使用像Python或R及其相应的库编程语言来处理大型数据集,并探索全新的见解。一个仍然存在的主要障碍是将这一数据的可视化到他们的项目。可视化库可能很难学会如何使用,即使是对那些正规培训。然而,这些可视化的识别主题,不仅沟通结果其他研究人员的关键,也是广大市民。本文侧重于生产使用机器学习数据的可视化意义。我们允许用户以较低的屏障人文研究者如何学习程序的可视化直观地指定他们的规范要求。我们使用混合模式,结合神经网络和光学字符识别,生成的代码来创建可视化。
40. Force myography benchmark data for hand gesture recognition and transfer learning [PDF] 返回目录
Thomas Buhl Andersen, Rógvi Eliasen, Mikkel Jarlund, Bin Yang
Abstract: Force myography has recently gained increasing attention for hand gesture recognition tasks. However, there is a lack of publicly available benchmark data, with most existing studies collecting their own data often with custom hardware and for varying sets of gestures. This limits the ability to compare various algorithms, as well as the possibility for research to be done without first needing to collect data oneself. We contribute to the advancement of this field by making accessible a benchmark dataset collected using a commercially available sensor setup from 20 persons covering 18 unique gestures, in the hope of allowing further comparison of results as well as easier entry into this field of research. We illustrate one use-case for such data, showing how we can improve gesture recognition accuracy by utilising transfer learning to incorporate data from multiple other persons. This also illustrates that the dataset can serve as a benchmark dataset to facilitate research on transfer learning algorithms.
摘要:强制肌动描记最近获得越来越多的关注于手势识别任务。然而,缺乏可公开获得的基准数据,与定制的硬件往往收集自己的数据大多数现有的研究和不同组的手势。而无需首先收集数据自己做这限制比较各种算法的能力,以及对研究的可能性。我们通过制作使用市售的传感器设置从20人涵盖了18个独特的手势收集访问的基准数据集,在允许的结果进一步比较,以及更容易进入这一研究领域的希望推动这一领域的发展。我们举例说明一个用例这样的数据,展示我们如何能够利用迁移学习纳入从多个其他人的数据提高手势识别的准确性。这也说明了该数据集可以作为基准数据集以方便转移学习算法研究。
Thomas Buhl Andersen, Rógvi Eliasen, Mikkel Jarlund, Bin Yang
Abstract: Force myography has recently gained increasing attention for hand gesture recognition tasks. However, there is a lack of publicly available benchmark data, with most existing studies collecting their own data often with custom hardware and for varying sets of gestures. This limits the ability to compare various algorithms, as well as the possibility for research to be done without first needing to collect data oneself. We contribute to the advancement of this field by making accessible a benchmark dataset collected using a commercially available sensor setup from 20 persons covering 18 unique gestures, in the hope of allowing further comparison of results as well as easier entry into this field of research. We illustrate one use-case for such data, showing how we can improve gesture recognition accuracy by utilising transfer learning to incorporate data from multiple other persons. This also illustrates that the dataset can serve as a benchmark dataset to facilitate research on transfer learning algorithms.
摘要:强制肌动描记最近获得越来越多的关注于手势识别任务。然而,缺乏可公开获得的基准数据,与定制的硬件往往收集自己的数据大多数现有的研究和不同组的手势。而无需首先收集数据自己做这限制比较各种算法的能力,以及对研究的可能性。我们通过制作使用市售的传感器设置从20人涵盖了18个独特的手势收集访问的基准数据集,在允许的结果进一步比较,以及更容易进入这一研究领域的希望推动这一领域的发展。我们举例说明一个用例这样的数据,展示我们如何能够利用迁移学习纳入从多个其他人的数据提高手势识别的准确性。这也说明了该数据集可以作为基准数据集以方便转移学习算法研究。
41. Reliable Tuberculosis Detection using Chest X-ray with Deep Learning, Segmentation and Visualization [PDF] 返回目录
Tawsifur Rahman, Amith Khandakar, Muhammad Abdul Kadir, Khandaker R. Islam, Khandaker F. Islam, Rashid Mazhar, Tahir Hamid, Mohammad T. Islam, Zaid B. Mahbub, Mohamed Arselene Ayari, Muhammad E. H. Chowdhury
Abstract: Tuberculosis (TB) is a chronic lung disease that occurs due to bacterial infection and is one of the top 10 leading causes of death. Accurate and early detection of TB is very important, otherwise, it could be life-threatening. In this work, we have detected TB reliably from the chest X-ray images using image pre-processing, data augmentation, image segmentation, and deep-learning classification techniques. Several public databases were used to create a database of 700 TB infected and 3500 normal chest X-ray images for this study. Nine different deep CNNs (ResNet18, ResNet50, ResNet101, ChexNet, InceptionV3, Vgg19, DenseNet201, SqueezeNet, and MobileNet), which were used for transfer learning from their pre-trained initial weights and trained, validated and tested for classifying TB and non-TB normal cases. Three different experiments were carried out in this work: segmentation of X-ray images using two different U-net models, classification using X-ray images, and segmented lung images. The accuracy, precision, sensitivity, F1-score, specificity in the detection of tuberculosis using X-ray images were 97.07 %, 97.34 %, 97.07 %, 97.14 % and 97.36 % respectively. However, segmented lungs for the classification outperformed than whole X-ray image-based classification and accuracy, precision, sensitivity, F1-score, specificity were 99.9 %, 99.91 %, 99.9 %, 99.9 %, and 99.52 % respectively. The paper also used a visualization technique to confirm that CNN learns dominantly from the segmented lung regions results in higher detection accuracy. The proposed method with state-of-the-art performance can be useful in the computer-aided faster diagnosis of tuberculosis.
摘要:结核病(TB)是因细菌感染发生,并且是死亡的顶部10主要原因之一的慢性肺病。 TB的准确和早期检测是非常重要的,否则,就可能危及生命。在这项工作中,我们已经从使用图像预处理,数据扩张,图像分割,并且深学习分类技术的胸部X射线图像中检测TB可靠。一些公共数据库被用来创建700 TB感染和3500本研究正常的胸部X射线图像的数据库。它被用于传送学习从其预先训练初始权重与训练,验证和测试TB和非分级九个不同深细胞神经网络(ResNet18,ResNet50,ResNet101,ChexNet,InceptionV3,Vgg19,DenseNet201,SqueezeNet,和MobileNet),结核病例正常。使用两种不同的U形网模型的X射线图像的分割,使用X射线图像,并且分割的肺部图像分类:三种不同的实验,在这项工作中进行。的准确度,精度,灵敏度,F1-得分,在结核病的使用X射线图像中检测特异性分别为97.07%,97.34%,97.07%,97.14%和97.36%。然而,分段的肺部表现优于比整体透视基于图像的分类和准确度,精度,灵敏度,F1-得分分类,特异性分别为99.9%,99.91%,99.9%,99.9%,和99.52%。本文还使用的可视化技术,以确认从CNN在更高的检测精度所分割的肺区域的结果显性获悉。所提出的方法与国家的最先进的性能可以是肺结核的计算机辅助诊断更快有用。
Tawsifur Rahman, Amith Khandakar, Muhammad Abdul Kadir, Khandaker R. Islam, Khandaker F. Islam, Rashid Mazhar, Tahir Hamid, Mohammad T. Islam, Zaid B. Mahbub, Mohamed Arselene Ayari, Muhammad E. H. Chowdhury
Abstract: Tuberculosis (TB) is a chronic lung disease that occurs due to bacterial infection and is one of the top 10 leading causes of death. Accurate and early detection of TB is very important, otherwise, it could be life-threatening. In this work, we have detected TB reliably from the chest X-ray images using image pre-processing, data augmentation, image segmentation, and deep-learning classification techniques. Several public databases were used to create a database of 700 TB infected and 3500 normal chest X-ray images for this study. Nine different deep CNNs (ResNet18, ResNet50, ResNet101, ChexNet, InceptionV3, Vgg19, DenseNet201, SqueezeNet, and MobileNet), which were used for transfer learning from their pre-trained initial weights and trained, validated and tested for classifying TB and non-TB normal cases. Three different experiments were carried out in this work: segmentation of X-ray images using two different U-net models, classification using X-ray images, and segmented lung images. The accuracy, precision, sensitivity, F1-score, specificity in the detection of tuberculosis using X-ray images were 97.07 %, 97.34 %, 97.07 %, 97.14 % and 97.36 % respectively. However, segmented lungs for the classification outperformed than whole X-ray image-based classification and accuracy, precision, sensitivity, F1-score, specificity were 99.9 %, 99.91 %, 99.9 %, 99.9 %, and 99.52 % respectively. The paper also used a visualization technique to confirm that CNN learns dominantly from the segmented lung regions results in higher detection accuracy. The proposed method with state-of-the-art performance can be useful in the computer-aided faster diagnosis of tuberculosis.
摘要:结核病(TB)是因细菌感染发生,并且是死亡的顶部10主要原因之一的慢性肺病。 TB的准确和早期检测是非常重要的,否则,就可能危及生命。在这项工作中,我们已经从使用图像预处理,数据扩张,图像分割,并且深学习分类技术的胸部X射线图像中检测TB可靠。一些公共数据库被用来创建700 TB感染和3500本研究正常的胸部X射线图像的数据库。它被用于传送学习从其预先训练初始权重与训练,验证和测试TB和非分级九个不同深细胞神经网络(ResNet18,ResNet50,ResNet101,ChexNet,InceptionV3,Vgg19,DenseNet201,SqueezeNet,和MobileNet),结核病例正常。使用两种不同的U形网模型的X射线图像的分割,使用X射线图像,并且分割的肺部图像分类:三种不同的实验,在这项工作中进行。的准确度,精度,灵敏度,F1-得分,在结核病的使用X射线图像中检测特异性分别为97.07%,97.34%,97.07%,97.14%和97.36%。然而,分段的肺部表现优于比整体透视基于图像的分类和准确度,精度,灵敏度,F1-得分分类,特异性分别为99.9%,99.91%,99.9%,99.9%,和99.52%。本文还使用的可视化技术,以确认从CNN在更高的检测精度所分割的肺区域的结果显性获悉。所提出的方法与国家的最先进的性能可以是肺结核的计算机辅助诊断更快有用。
42. Between Subjectivity and Imposition: Power Dynamics in Data Annotation for Computer Vision [PDF] 返回目录
Milagros Miceli, Martin Schuessler, Tianling Yang
Abstract: The interpretation of data is fundamental to machine learning. This paper investigates practices of image data annotation as performed in industrial contexts. We define data annotation as a sense-making practice, where annotators assign meaning to data through the use of labels. Previous human-centered investigations have largely focused on annotators subjectivity as a major cause for biased labels. We propose a wider view on this issue: guided by constructivist grounded theory, we conducted several weeks of fieldwork at two annotation companies. We analyzed which structures, power relations, and naturalized impositions shape the interpretation of data. Our results show that the work of annotators is profoundly informed by the interests, values, and priorities of other actors above their station. Arbitrary classifications are vertically imposed on annotators, and through them, on data. This imposition is largely naturalized. Assigning meaning to data is often presented as a technical matter. This paper shows it is, in fact, an exercise of power with multiple implications for individuals and society.
摘要:数据的解释是机器学习的基础。在工业环境中进行本文研究的图像数据注解的做法。我们定义数据注解的意义建构实践中,如果注释分配通过使用标签的意思的数据。上一页人为中心的研究已主要集中在注释主观性的主要原因偏置标签。我们建议在这个问题上更宽的视野:通过建构扎根理论的指导下,我们在两个注释公司进行了几个星期的实地考察。我们分析其结构,权力关系和归拼版形状数据的解释。我们的研究结果显示,注释的工作是深受利益,价值观和他们站的正上方其他演员的优先通知。任意分类垂直强加于注释,并通过它们的数据。这种强加在很大程度上归。分配含义数据通常被看作是一个技术问题。本文表明它是,事实上,权力的行使对个人和社会多重涵义。
Milagros Miceli, Martin Schuessler, Tianling Yang
Abstract: The interpretation of data is fundamental to machine learning. This paper investigates practices of image data annotation as performed in industrial contexts. We define data annotation as a sense-making practice, where annotators assign meaning to data through the use of labels. Previous human-centered investigations have largely focused on annotators subjectivity as a major cause for biased labels. We propose a wider view on this issue: guided by constructivist grounded theory, we conducted several weeks of fieldwork at two annotation companies. We analyzed which structures, power relations, and naturalized impositions shape the interpretation of data. Our results show that the work of annotators is profoundly informed by the interests, values, and priorities of other actors above their station. Arbitrary classifications are vertically imposed on annotators, and through them, on data. This imposition is largely naturalized. Assigning meaning to data is often presented as a technical matter. This paper shows it is, in fact, an exercise of power with multiple implications for individuals and society.
摘要:数据的解释是机器学习的基础。在工业环境中进行本文研究的图像数据注解的做法。我们定义数据注解的意义建构实践中,如果注释分配通过使用标签的意思的数据。上一页人为中心的研究已主要集中在注释主观性的主要原因偏置标签。我们建议在这个问题上更宽的视野:通过建构扎根理论的指导下,我们在两个注释公司进行了几个星期的实地考察。我们分析其结构,权力关系和归拼版形状数据的解释。我们的研究结果显示,注释的工作是深受利益,价值观和他们站的正上方其他演员的优先通知。任意分类垂直强加于注释,并通过它们的数据。这种强加在很大程度上归。分配含义数据通常被看作是一个技术问题。本文表明它是,事实上,权力的行使对个人和社会多重涵义。
43. TR-GAN: Topology Ranking GAN with Triplet Loss for Retinal Artery/Vein Classification [PDF] 返回目录
Wenting Chen, Shuang Yu, Junde Wu, Kai Ma, Cheng Bian, Chunyan Chu, Linlin Shen, Yefeng Zheng
Abstract: Retinal artery/vein (A/V) classification lays the foundation for the quantitative analysis of retinal vessels, which is associated with potential risks of various cardiovascular and cerebral diseases. The topological connection relationship, which has been proved effective in improving the A/V classification performance for the conventional graph based method, has not been exploited by the deep learning based method. In this paper, we propose a Topology Ranking Generative Adversarial Network (TR-GAN) to improve the topology connectivity of the segmented arteries and veins, and further to boost the A/V classification performance. A topology ranking discriminator based on ordinal regression is proposed to rank the topological connectivity level of the ground-truth, the generated A/V mask and the intentionally shuffled mask. The ranking loss is further back-propagated to the generator to generate better connected A/V masks. In addition, a topology preserving module with triplet loss is also proposed to extract the high-level topological features and further to narrow the feature distance between the predicted A/V mask and the ground-truth. The proposed framework effectively increases the topological connectivity of the predicted A/V masks and achieves state-of-the-art A/V classification performance on the publicly available AV-DRIVE dataset.
摘要:视网膜动脉/静脉(A / V)分类奠定了视网膜血管的定量分析,其与各种心血管和脑疾病的潜在风险相关联的基础。拓扑连接关系,这已被证明有效改善对于传统的基于图形的方法的A / V分类性能,一直没有通过深学习基础的方法利用。在本文中,我们提出了一种拓扑排序生成性对抗性网络(TR-GAN),以改善分段动脉和静脉的拓扑连接,并进一步提振A / V分类性能。基于有序回归拓扑排名鉴别提出排名地面实况,所生成的A / V掩模和故意洗牌掩模的拓扑连接性级别。排名损失是进一步向后传播到所述发电机以产生更好的连接的A / V掩模。此外,使用三线态损耗拓扑保存模块还提出,以提取高级拓扑特征,并进一步缩小预测的A / V掩模和地面实况之间的特征的距离。所提出的架构有效地增加了预测的A / V掩模的拓扑连接并实现对可公开获得的AV-DRIVE数据集状态的最先进的A / V分类性能。
Wenting Chen, Shuang Yu, Junde Wu, Kai Ma, Cheng Bian, Chunyan Chu, Linlin Shen, Yefeng Zheng
Abstract: Retinal artery/vein (A/V) classification lays the foundation for the quantitative analysis of retinal vessels, which is associated with potential risks of various cardiovascular and cerebral diseases. The topological connection relationship, which has been proved effective in improving the A/V classification performance for the conventional graph based method, has not been exploited by the deep learning based method. In this paper, we propose a Topology Ranking Generative Adversarial Network (TR-GAN) to improve the topology connectivity of the segmented arteries and veins, and further to boost the A/V classification performance. A topology ranking discriminator based on ordinal regression is proposed to rank the topological connectivity level of the ground-truth, the generated A/V mask and the intentionally shuffled mask. The ranking loss is further back-propagated to the generator to generate better connected A/V masks. In addition, a topology preserving module with triplet loss is also proposed to extract the high-level topological features and further to narrow the feature distance between the predicted A/V mask and the ground-truth. The proposed framework effectively increases the topological connectivity of the predicted A/V masks and achieves state-of-the-art A/V classification performance on the publicly available AV-DRIVE dataset.
摘要:视网膜动脉/静脉(A / V)分类奠定了视网膜血管的定量分析,其与各种心血管和脑疾病的潜在风险相关联的基础。拓扑连接关系,这已被证明有效改善对于传统的基于图形的方法的A / V分类性能,一直没有通过深学习基础的方法利用。在本文中,我们提出了一种拓扑排序生成性对抗性网络(TR-GAN),以改善分段动脉和静脉的拓扑连接,并进一步提振A / V分类性能。基于有序回归拓扑排名鉴别提出排名地面实况,所生成的A / V掩模和故意洗牌掩模的拓扑连接性级别。排名损失是进一步向后传播到所述发电机以产生更好的连接的A / V掩模。此外,使用三线态损耗拓扑保存模块还提出,以提取高级拓扑特征,并进一步缩小预测的A / V掩模和地面实况之间的特征的距离。所提出的架构有效地增加了预测的A / V掩模的拓扑连接并实现对可公开获得的AV-DRIVE数据集状态的最先进的A / V分类性能。
44. An Uncertainty-aware Transfer Learning-based Framework for Covid-19 Diagnosis [PDF] 返回目录
Afshar Shamsi Jokandan, Hamzeh Asgharnezhad, Shirin Shamsi Jokandan, Abbas Khosravi, Parham M.Kebria, Darius Nahavandi, Saeid Nahavandi, Dipti Srinivasan
Abstract: The early and reliable detection of COVID-19 infected patients is essential to prevent and limit its outbreak. The PCR tests for COVID-19 detection are not available in many countries and also there are genuine concerns about their reliability and performance. Motivated by these shortcomings, this paper proposes a deep uncertainty-aware transfer learning framework for COVID-19 detection using medical images. Four popular convolutional neural networks (CNNs) including VGG16, ResNet50, DenseNet121, and InceptionResNetV2 are first applied to extract deep features from chest X-ray and computed tomography (CT) images. Extracted features are then processed by different machine learning and statistical modelling techniques to identify COVID-19 cases. We also calculate and report the epistemic uncertainty of classification results to identify regions where the trained models are not confident about their decisions (out of distribution problem). Comprehensive simulation results for X-ray and CT image datasets indicate that linear support vector machine and neural network models achieve the best results as measured by accuracy, sensitivity, specificity, and AUC. Also it is found that predictive uncertainty estimates are much higher for CT images compared to X-ray images.
摘要:COVID-19感染患者的早期和可靠的检测是必不可少的,以防止和限制其爆发。对于COVID-19检测中的PCR测试不是在许多国家提供,也有关于他们的可靠性和性能的真正关注。通过这些缺陷的启发,本文提出了一种利用医学图像COVID-19检测了深刻的不确定性感知转移学习框架。四种流行的卷积神经网络(细胞神经网络)包括VGG16,ResNet50,DenseNet121,和InceptionResNetV2首先施加到提取胸部透视和计算机断层扫描(CT)图像深特征。提取的特征,然后由不同的机器学习和统计建模技术处理来识别COVID-19的情况。我们还计算和报告分类结果的主观因素,以确定其中训练的模型不相信他们的决定(满分分配问题的)区域。全面的仿真结果的X射线和CT图像数据组表明,线性支持向量机和神经网络模型达到最佳的效果,通过精确度,灵敏度,特异性和AUC测定。此外,它被发现,预测的不确定性估计要高得多的比X射线图像CT图像。
Afshar Shamsi Jokandan, Hamzeh Asgharnezhad, Shirin Shamsi Jokandan, Abbas Khosravi, Parham M.Kebria, Darius Nahavandi, Saeid Nahavandi, Dipti Srinivasan
Abstract: The early and reliable detection of COVID-19 infected patients is essential to prevent and limit its outbreak. The PCR tests for COVID-19 detection are not available in many countries and also there are genuine concerns about their reliability and performance. Motivated by these shortcomings, this paper proposes a deep uncertainty-aware transfer learning framework for COVID-19 detection using medical images. Four popular convolutional neural networks (CNNs) including VGG16, ResNet50, DenseNet121, and InceptionResNetV2 are first applied to extract deep features from chest X-ray and computed tomography (CT) images. Extracted features are then processed by different machine learning and statistical modelling techniques to identify COVID-19 cases. We also calculate and report the epistemic uncertainty of classification results to identify regions where the trained models are not confident about their decisions (out of distribution problem). Comprehensive simulation results for X-ray and CT image datasets indicate that linear support vector machine and neural network models achieve the best results as measured by accuracy, sensitivity, specificity, and AUC. Also it is found that predictive uncertainty estimates are much higher for CT images compared to X-ray images.
摘要:COVID-19感染患者的早期和可靠的检测是必不可少的,以防止和限制其爆发。对于COVID-19检测中的PCR测试不是在许多国家提供,也有关于他们的可靠性和性能的真正关注。通过这些缺陷的启发,本文提出了一种利用医学图像COVID-19检测了深刻的不确定性感知转移学习框架。四种流行的卷积神经网络(细胞神经网络)包括VGG16,ResNet50,DenseNet121,和InceptionResNetV2首先施加到提取胸部透视和计算机断层扫描(CT)图像深特征。提取的特征,然后由不同的机器学习和统计建模技术处理来识别COVID-19的情况。我们还计算和报告分类结果的主观因素,以确定其中训练的模型不相信他们的决定(满分分配问题的)区域。全面的仿真结果的X射线和CT图像数据组表明,线性支持向量机和神经网络模型达到最佳的效果,通过精确度,灵敏度,特异性和AUC测定。此外,它被发现,预测的不确定性估计要高得多的比X射线图像CT图像。
45. Efficient OCT Image Segmentation Using Neural Architecture Search [PDF] 返回目录
Saba Heidari Gheshlaghi, Omid Dehzangi, Ali Dabouei, Annahita Amireskandari, Ali Rezai, Nasser M Nasrabadi
Abstract: In this work, we propose a Neural Architecture Search (NAS) for retinal layer segmentation in Optical Coherence Tomography (OCT) scans. We incorporate the Unet architecture in the NAS framework as its backbone for the segmentation of the retinal layers in our collected and pre-processed OCT image dataset. At the pre-processing stage, we conduct super resolution and image processing techniques on the raw OCT scans to improve the quality of the raw images. For our search strategy, different primitive operations are suggested to find the down- & up-sampling cell blocks, and the binary gate method is applied to make the search strategy practical for the task in hand. We empirically evaluated our method on our in-house OCT dataset. The experimental results demonstrate that the self-adapting NAS-Unet architecture substantially outperformed the competitive human-designed architecture by achieving 95.4% in mean Intersection over Union metric and 78.7% in Dice similarity coefficient.
摘要:在这项工作中,我们提出了一个神经结构搜索(NAS),用于光学相干断层扫描(OCT)扫描视网膜层分割。我们在NAS架构为骨干在我们收集和预处理OCT图像数据集的视网膜层的分割纳入UNET架构。在预处理阶段,我们进行超分辨率和图像处理技术对原始OCT扫描,以提高原始图像的质量。对于我们的搜索策略,不同的原始操作建议寻找向下和向上采样单元块,并应用二进制逻辑门的方法来使搜索策略实际在手头的任务。我们凭经验评价了公司内部的OCT数据集我们的方法。实验结果表明,该自适应NAS-UNET架构通过在联盟度量和在骰子相似系数78.7%平均交叉口实现95.4%基本上优于竞争力的人力设计的体系结构。
Saba Heidari Gheshlaghi, Omid Dehzangi, Ali Dabouei, Annahita Amireskandari, Ali Rezai, Nasser M Nasrabadi
Abstract: In this work, we propose a Neural Architecture Search (NAS) for retinal layer segmentation in Optical Coherence Tomography (OCT) scans. We incorporate the Unet architecture in the NAS framework as its backbone for the segmentation of the retinal layers in our collected and pre-processed OCT image dataset. At the pre-processing stage, we conduct super resolution and image processing techniques on the raw OCT scans to improve the quality of the raw images. For our search strategy, different primitive operations are suggested to find the down- & up-sampling cell blocks, and the binary gate method is applied to make the search strategy practical for the task in hand. We empirically evaluated our method on our in-house OCT dataset. The experimental results demonstrate that the self-adapting NAS-Unet architecture substantially outperformed the competitive human-designed architecture by achieving 95.4% in mean Intersection over Union metric and 78.7% in Dice similarity coefficient.
摘要:在这项工作中,我们提出了一个神经结构搜索(NAS),用于光学相干断层扫描(OCT)扫描视网膜层分割。我们在NAS架构为骨干在我们收集和预处理OCT图像数据集的视网膜层的分割纳入UNET架构。在预处理阶段,我们进行超分辨率和图像处理技术对原始OCT扫描,以提高原始图像的质量。对于我们的搜索策略,不同的原始操作建议寻找向下和向上采样单元块,并应用二进制逻辑门的方法来使搜索策略实际在手头的任务。我们凭经验评价了公司内部的OCT数据集我们的方法。实验结果表明,该自适应NAS-UNET架构通过在联盟度量和在骰子相似系数78.7%平均交叉口实现95.4%基本上优于竞争力的人力设计的体系结构。
46. PDCOVIDNet: A Parallel-Dilated Convolutional Neural Network Architecture for Detecting COVID-19 from Chest X-Ray Images [PDF] 返回目录
Nihad Karim Chowdhury, Md. Muhtadir Rahman, Muhammad Ashad Kabir
Abstract: The COVID-19 pandemic continues to severely undermine the prosperity of the global health system. To combat this pandemic, effective screening techniques for infected patients are indispensable. There is no doubt that the use of chest X-ray images for radiological assessment is one of the essential screening techniques. Some of the early studies revealed that the patient's chest X-ray images showed abnormalities, which is natural for patients infected with COVID-19. In this paper, we proposed a parallel-dilated convolutional neural network (CNN) based COVID-19 detection system from chest x-ray images, named as Parallel-Dilated COVIDNet (PDCOVIDNet). First, the publicly available chest X-ray collection fully preloaded and enhanced, and then classified by the proposed method. Differing convolution dilation rate in a parallel form demonstrates the proof-of-principle for using PDCOVIDNet to extract radiological features for COVID-19 detection. Accordingly, we have assisted our method with two visualization methods, which are specifically designed to increase understanding of the key components associated with COVID-19 infection. Both visualization methods compute gradients for a given image category related to feature maps of the last convolutional layer to create a class-discriminative region. In our experiment, we used a total of 2,905 chest X-ray images, comprising three cases (such as COVID-19, normal, and viral pneumonia), and empirical evaluations revealed that the proposed method extracted more significant features expeditiously related to the suspected disease. The experimental results demonstrate that our proposed method significantly improves performance metrics: accuracy, precision, recall, and F1 scores reach 96.58%, 96.58%, 96.59%, and 96.58%, respectively, which is comparable or enhanced compared with the state-of-the-art methods.
摘要:COVID-19流行病继续严重破坏全球卫生体系的繁荣。为了打击对感染患者这一流行病,有效的筛选技术是必不可少的。毫无疑问,对于放射性评估使用胸部X射线图像是人体必需的筛选技术之一。一些早期的研究显示,病人的胸部X光图像显示异常,这是很自然的感染COVID-19例。在本文中,我们提出了一种平行扩张卷积神经从胸部x射线图像,命名为并行扩张型COVIDNet(PDCOVIDNet)基于COVID-19检测系统网络(CNN)。首先,公众可获得的胸部X线收集完全预装和提高,然后利用该方法进行分类。以并行形式不同的卷积扩张速率演示了验证的原理用于使用PDCOVIDNet提取放射学特征为COVID-19检测。因此,我们已协助我们的方法有两个可视化的方法,这是专门为增进理解与COVID-19感染有关的关键部件。两个可视化方法计算用于相关功能的给定图像类别梯度映射最后卷积层来创建类辨别区域。在我们的实验中,我们总共2905胸部X射线图像所使用的,包括三个情况下(如COVID-19,正常和病毒性肺炎),和经验评估表明,所提出的方法提取迅速与疑似更显著特征疾病。实验结果表明,我们提出的方法显著提高性能指标:准确度,精密度,调用和F1分数达到96.58%,96.58%,96.59%,96.58和%,分别与反映最新相比是相当或增强的该技术的方法。
Nihad Karim Chowdhury, Md. Muhtadir Rahman, Muhammad Ashad Kabir
Abstract: The COVID-19 pandemic continues to severely undermine the prosperity of the global health system. To combat this pandemic, effective screening techniques for infected patients are indispensable. There is no doubt that the use of chest X-ray images for radiological assessment is one of the essential screening techniques. Some of the early studies revealed that the patient's chest X-ray images showed abnormalities, which is natural for patients infected with COVID-19. In this paper, we proposed a parallel-dilated convolutional neural network (CNN) based COVID-19 detection system from chest x-ray images, named as Parallel-Dilated COVIDNet (PDCOVIDNet). First, the publicly available chest X-ray collection fully preloaded and enhanced, and then classified by the proposed method. Differing convolution dilation rate in a parallel form demonstrates the proof-of-principle for using PDCOVIDNet to extract radiological features for COVID-19 detection. Accordingly, we have assisted our method with two visualization methods, which are specifically designed to increase understanding of the key components associated with COVID-19 infection. Both visualization methods compute gradients for a given image category related to feature maps of the last convolutional layer to create a class-discriminative region. In our experiment, we used a total of 2,905 chest X-ray images, comprising three cases (such as COVID-19, normal, and viral pneumonia), and empirical evaluations revealed that the proposed method extracted more significant features expeditiously related to the suspected disease. The experimental results demonstrate that our proposed method significantly improves performance metrics: accuracy, precision, recall, and F1 scores reach 96.58%, 96.58%, 96.59%, and 96.58%, respectively, which is comparable or enhanced compared with the state-of-the-art methods.
摘要:COVID-19流行病继续严重破坏全球卫生体系的繁荣。为了打击对感染患者这一流行病,有效的筛选技术是必不可少的。毫无疑问,对于放射性评估使用胸部X射线图像是人体必需的筛选技术之一。一些早期的研究显示,病人的胸部X光图像显示异常,这是很自然的感染COVID-19例。在本文中,我们提出了一种平行扩张卷积神经从胸部x射线图像,命名为并行扩张型COVIDNet(PDCOVIDNet)基于COVID-19检测系统网络(CNN)。首先,公众可获得的胸部X线收集完全预装和提高,然后利用该方法进行分类。以并行形式不同的卷积扩张速率演示了验证的原理用于使用PDCOVIDNet提取放射学特征为COVID-19检测。因此,我们已协助我们的方法有两个可视化的方法,这是专门为增进理解与COVID-19感染有关的关键部件。两个可视化方法计算用于相关功能的给定图像类别梯度映射最后卷积层来创建类辨别区域。在我们的实验中,我们总共2905胸部X射线图像所使用的,包括三个情况下(如COVID-19,正常和病毒性肺炎),和经验评估表明,所提出的方法提取迅速与疑似更显著特征疾病。实验结果表明,我们提出的方法显著提高性能指标:准确度,精密度,调用和F1分数达到96.58%,96.58%,96.59%,96.58和%,分别与反映最新相比是相当或增强的该技术的方法。
47. On the unreasonable effectiveness of CNNs [PDF] 返回目录
Andreas Hauptmann, Jonas Adler
Abstract: Deep learning methods using convolutional neural networks (CNN) have been successfully applied to virtually all imaging problems, and particularly in image reconstruction tasks with ill-posed and complicated imaging models. In an attempt to put upper bounds on the capability of baseline CNNs for solving image-to-image problems we applied a widely used standard off-the-shelf network architecture (U-Net) to the "inverse problem" of XOR decryption from noisy data and show acceptable results.
摘要:采用卷积神经网络(CNN)深的学习方法已成功地应用于几乎所有的成像问题,特别是在图像重建任务,病态的和复杂的成像模型。在试图把上界上基线细胞神经网络的能力用于解决图像到图像的问题,我们应用了广泛使用的标准现成的,货架网络架构(U-净)从嘈杂XOR解密的“逆问题”数据和显示可接受的结果。
Andreas Hauptmann, Jonas Adler
Abstract: Deep learning methods using convolutional neural networks (CNN) have been successfully applied to virtually all imaging problems, and particularly in image reconstruction tasks with ill-posed and complicated imaging models. In an attempt to put upper bounds on the capability of baseline CNNs for solving image-to-image problems we applied a widely used standard off-the-shelf network architecture (U-Net) to the "inverse problem" of XOR decryption from noisy data and show acceptable results.
摘要:采用卷积神经网络(CNN)深的学习方法已成功地应用于几乎所有的成像问题,特别是在图像重建任务,病态的和复杂的成像模型。在试图把上界上基线细胞神经网络的能力用于解决图像到图像的问题,我们应用了广泛使用的标准现成的,货架网络架构(U-净)从嘈杂XOR解密的“逆问题”数据和显示可接受的结果。
48. CommuNety: A Deep Learning System for the Prediction of Cohesive Social Communities [PDF] 返回目录
Syed Afaq Ali Shah, Weifeng Deng, Jianxin Li, Muhammad Aamir Cheema, Abdul Bais
Abstract: Effective mining of social media, which consists of a large number of users is a challenging task. Traditional approaches rely on the analysis of text data related to users to accomplish this task. However, text data lacks significant information about the social users and their associated groups. In this paper, we propose CommuNety, a deep learning system for the prediction of cohesive social networks using images. The proposed deep learning model consists of hierarchical CNN architecture to learn descriptive features related to each cohesive network. The paper also proposes a novel Face Co-occurrence Frequency algorithm to quantify existence of people in images, and a novel photo ranking method to analyze the strength of relationship between different individuals in a predicted social network. We extensively evaluate the proposed technique on PIPA dataset and compare with state-of-the-art methods. Our experimental results demonstrate the superior performance of the proposed technique for the prediction of relationship between different individuals and the cohesiveness of communities.
摘要:社交媒体的有效挖掘,它由大量用户的是一个具有挑战性的任务。传统的方法依赖于相关的用户来完成这项任务的文本数据的分析。然而,文本数据缺乏对社会用户及其关联的群组显著的信息。在本文中,我们提出CommuNety,使用图像凝聚力的社交网络预测的深度学习系统。建议的深度学习模型包括分层CNN架构的学习与每个凝聚力的网络描述的特征。该文件还提出了一种新的表面合作出现频率算法来量化的人存在的图像,和新的照片排序方法分析预测社交网络不同个体之间关系的强度。我们广泛的评估对PIPA数据集所提出的技术,并与国家的最先进的方法进行了比较。我们的实验结果表明,对于不同的个人和社区的凝聚力关系的预测所提出的技术的卓越性能。
Syed Afaq Ali Shah, Weifeng Deng, Jianxin Li, Muhammad Aamir Cheema, Abdul Bais
Abstract: Effective mining of social media, which consists of a large number of users is a challenging task. Traditional approaches rely on the analysis of text data related to users to accomplish this task. However, text data lacks significant information about the social users and their associated groups. In this paper, we propose CommuNety, a deep learning system for the prediction of cohesive social networks using images. The proposed deep learning model consists of hierarchical CNN architecture to learn descriptive features related to each cohesive network. The paper also proposes a novel Face Co-occurrence Frequency algorithm to quantify existence of people in images, and a novel photo ranking method to analyze the strength of relationship between different individuals in a predicted social network. We extensively evaluate the proposed technique on PIPA dataset and compare with state-of-the-art methods. Our experimental results demonstrate the superior performance of the proposed technique for the prediction of relationship between different individuals and the cohesiveness of communities.
摘要:社交媒体的有效挖掘,它由大量用户的是一个具有挑战性的任务。传统的方法依赖于相关的用户来完成这项任务的文本数据的分析。然而,文本数据缺乏对社会用户及其关联的群组显著的信息。在本文中,我们提出CommuNety,使用图像凝聚力的社交网络预测的深度学习系统。建议的深度学习模型包括分层CNN架构的学习与每个凝聚力的网络描述的特征。该文件还提出了一种新的表面合作出现频率算法来量化的人存在的图像,和新的照片排序方法分析预测社交网络不同个体之间关系的强度。我们广泛的评估对PIPA数据集所提出的技术,并与国家的最先进的方法进行了比较。我们的实验结果表明,对于不同的个人和社区的凝聚力关系的预测所提出的技术的卓越性能。
49. Multimodal Spatial Attention Module for Targeting Multimodal PET-CT Lung Tumor Segmentation [PDF] 返回目录
Xiaohang Fu, Lei Bi, Ashnil Kumar, Michael Fulham, Jinman Kim
Abstract: Multimodal positron emission tomography-computed tomography (PET-CT) is used routinely in the assessment of cancer. PET-CT combines the high sensitivity for tumor detection with PET and anatomical information from CT. Tumor segmentation is a critical element of PET-CT but at present there is not an accurate automated segmentation method. Segmentation tends to be done manually by different imaging experts and it is labor-intensive and prone to errors and inconsistency. Previous automated segmentation methods largely focused on fusing information that is extracted separately from the PET and CT modalities, with the underlying assumption that each modality contains complementary information. However, these methods do not fully exploit the high PET tumor sensitivity that can guide segmentation. In this study, we introduce a multimodal spatial attention module (MSAM) that automatically learns to emphasize regions (spatial areas) related to tumors and suppress normal regions with physiologic high-uptake. The spatial attention maps are subsequently employed to target a convolutional neural network (CNN) for segmentation of areas with higher tumor likelihood. Our MSAM can be applied to common backbone architectures and trained end-to-end. Our experimental results on two clinical PET-CT datasets of non-small cell lung cancer (NSCLC) and soft tissue sarcoma (STS) validate the effectiveness of the MSAM in these different cancer types. We show that our MSAM, with a conventional U-Net backbone, surpasses the state-of-the-art lung tumor segmentation approach by a margin of 7.6% Dice similarity coefficient (DSC).
摘要:多模式正电子发射断层摄影术,计算机断层扫描(PET-CT)是在癌症的评估中常规使用。 PET-CT结合的肿瘤检测用PET和CT从解剖信息的高灵敏度。肿瘤分割是PET-CT的关键要素,但是目前还没有一个准确的自动分割的方法。分割往往是由不同的影像专家手工完成,这是劳动密集,容易出现错误和不一致。以前自动分割的方法主要侧重于融合对从PET和CT模态分别提取的信息,与每个模态包含补充信息的基本假设。然而,这些方法没有充分利用高PET肿瘤灵敏度,可以引导分割。在这项研究中,我们引入了多空间注意模块(MSAM),可自动学习,强调有关的肿瘤,抑制正常区域与生理的高摄取的区域(空间区域)。空间注意映射随后被用来靶向与更高的肿瘤的可能性的区域分割的卷积神经网络(CNN)。我们MSAM可应用于普通骨干架构和训练有素的端至端。我们对非小细胞肺癌(NSCLC)和软组织肉瘤(STS)的两项临床PET-CT数据集实验验证在这些不同癌症类型的MSAM的有效性。我们表明,我们的MSAM,与常规的U型网骨干,超过7.6%骰子相似系数(DSC)的余量的状态的最先进的肺肿瘤的分割方法。
Xiaohang Fu, Lei Bi, Ashnil Kumar, Michael Fulham, Jinman Kim
Abstract: Multimodal positron emission tomography-computed tomography (PET-CT) is used routinely in the assessment of cancer. PET-CT combines the high sensitivity for tumor detection with PET and anatomical information from CT. Tumor segmentation is a critical element of PET-CT but at present there is not an accurate automated segmentation method. Segmentation tends to be done manually by different imaging experts and it is labor-intensive and prone to errors and inconsistency. Previous automated segmentation methods largely focused on fusing information that is extracted separately from the PET and CT modalities, with the underlying assumption that each modality contains complementary information. However, these methods do not fully exploit the high PET tumor sensitivity that can guide segmentation. In this study, we introduce a multimodal spatial attention module (MSAM) that automatically learns to emphasize regions (spatial areas) related to tumors and suppress normal regions with physiologic high-uptake. The spatial attention maps are subsequently employed to target a convolutional neural network (CNN) for segmentation of areas with higher tumor likelihood. Our MSAM can be applied to common backbone architectures and trained end-to-end. Our experimental results on two clinical PET-CT datasets of non-small cell lung cancer (NSCLC) and soft tissue sarcoma (STS) validate the effectiveness of the MSAM in these different cancer types. We show that our MSAM, with a conventional U-Net backbone, surpasses the state-of-the-art lung tumor segmentation approach by a margin of 7.6% Dice similarity coefficient (DSC).
摘要:多模式正电子发射断层摄影术,计算机断层扫描(PET-CT)是在癌症的评估中常规使用。 PET-CT结合的肿瘤检测用PET和CT从解剖信息的高灵敏度。肿瘤分割是PET-CT的关键要素,但是目前还没有一个准确的自动分割的方法。分割往往是由不同的影像专家手工完成,这是劳动密集,容易出现错误和不一致。以前自动分割的方法主要侧重于融合对从PET和CT模态分别提取的信息,与每个模态包含补充信息的基本假设。然而,这些方法没有充分利用高PET肿瘤灵敏度,可以引导分割。在这项研究中,我们引入了多空间注意模块(MSAM),可自动学习,强调有关的肿瘤,抑制正常区域与生理的高摄取的区域(空间区域)。空间注意映射随后被用来靶向与更高的肿瘤的可能性的区域分割的卷积神经网络(CNN)。我们MSAM可应用于普通骨干架构和训练有素的端至端。我们对非小细胞肺癌(NSCLC)和软组织肉瘤(STS)的两项临床PET-CT数据集实验验证在这些不同癌症类型的MSAM的有效性。我们表明,我们的MSAM,与常规的U型网骨干,超过7.6%骰子相似系数(DSC)的余量的状态的最先进的肺肿瘤的分割方法。
50. Video compression with low complexity CNN-based spatial resolution adaptation [PDF] 返回目录
Di Ma, Fan Zhang, David R. Bull
Abstract: It has recently been demonstrated that spatial resolution adaptation can be integrated within video compression to improve overall coding performance by spatially down-sampling before encoding and super-resolving at the decoder. Significant improvements have been reported when convolutional neural networks (CNNs) were used to perform the resolution up-sampling. However, this approach suffers from high complexity at the decoder due to the employment of CNN-based super-resolution. In this paper, a novel framework is proposed which supports the flexible allocation of complexity between the encoder and decoder. This approach employs a CNN model for video down-sampling at the encoder and uses a Lanczos3 filter to reconstruct full resolution at the decoder. The proposed method was integrated into the HEVC HM 16.20 software and evaluated on JVET UHD test sequences using the All Intra configuration. The experimental results demonstrate the potential of the proposed approach, with significant bitrate savings (more than 10%) over the original HEVC HM, coupled with reduced computational complexity at both encoder (29%) and decoder (10%).
摘要:最近已经证明,空间分辨率适应可在视频压缩集成,以提高通过编码和解码器的超分辨空间之前下降采样整体编码性能。当卷积神经网络(细胞神经网络)来执行分辨率上采样显著改进的报道。然而,从高点的解码器复杂度,由于就业的这种做法患有CNN为基础的超分辨率。在本文中,一种新颖的框架提出了一种支持在编码器和解码器之间的复杂的灵活的分配。这种方法采用用于视频下采样在编码器一CNN模型并且使用Lanczos3滤波器在解码器处重构全分辨率。所提出的方法被集成到HEVC HM 16.20软件和使用所述全帧内配置JVET UHD测试序列进行评价。实验结果表明,所提出的方法的潜力,以显著比特率节约(10%以上)在原始HEVC HM,加上在两个编码器(29%)和解码器(10%)降低计算复杂度。
Di Ma, Fan Zhang, David R. Bull
Abstract: It has recently been demonstrated that spatial resolution adaptation can be integrated within video compression to improve overall coding performance by spatially down-sampling before encoding and super-resolving at the decoder. Significant improvements have been reported when convolutional neural networks (CNNs) were used to perform the resolution up-sampling. However, this approach suffers from high complexity at the decoder due to the employment of CNN-based super-resolution. In this paper, a novel framework is proposed which supports the flexible allocation of complexity between the encoder and decoder. This approach employs a CNN model for video down-sampling at the encoder and uses a Lanczos3 filter to reconstruct full resolution at the decoder. The proposed method was integrated into the HEVC HM 16.20 software and evaluated on JVET UHD test sequences using the All Intra configuration. The experimental results demonstrate the potential of the proposed approach, with significant bitrate savings (more than 10%) over the original HEVC HM, coupled with reduced computational complexity at both encoder (29%) and decoder (10%).
摘要:最近已经证明,空间分辨率适应可在视频压缩集成,以提高通过编码和解码器的超分辨空间之前下降采样整体编码性能。当卷积神经网络(细胞神经网络)来执行分辨率上采样显著改进的报道。然而,从高点的解码器复杂度,由于就业的这种做法患有CNN为基础的超分辨率。在本文中,一种新颖的框架提出了一种支持在编码器和解码器之间的复杂的灵活的分配。这种方法采用用于视频下采样在编码器一CNN模型并且使用Lanczos3滤波器在解码器处重构全分辨率。所提出的方法被集成到HEVC HM 16.20软件和使用所述全帧内配置JVET UHD测试序列进行评价。实验结果表明,所提出的方法的潜力,以显著比特率节约(10%以上)在原始HEVC HM,加上在两个编码器(29%)和解码器(10%)降低计算复杂度。
51. Sample Efficient Interactive End-to-End Deep Learning for Self-Driving Cars with Selective Multi-Class Safe Dataset Aggregation [PDF] 返回目录
Yunus Bicer, Ali Alizadeh, Nazim Kemal Ure, Ahmetcan Erdogan, Orkun Kizilirmak
Abstract: The objective of this paper is to develop a sample efficient end-to-end deep learning method for self-driving cars, where we attempt to increase the value of the information extracted from samples, through careful analysis obtained from each call to expert driverś policy. End-to-end imitation learning is a popular method for computing self-driving car policies. The standard approach relies on collecting pairs of inputs (camera images) and outputs (steering angle, etc.) from an expert policy and fitting a deep neural network to this data to learn the driving policy. Although this approach had some successful demonstrations in the past, learning a good policy might require a lot of samples from the expert driver, which might be resource-consuming. In this work, we develop a novel framework based on the Safe Dateset Aggregation (safe DAgger) approach, where the current learned policy is automatically segmented into different trajectory classes, and the algorithm identifies trajectory segments or classes with the weak performance at each step. Once the trajectory segments with weak performance identified, the sampling algorithm focuses on calling the expert policy only on these segments, which improves the convergence rate. The presented simulation results show that the proposed approach can yield significantly better performance compared to the standard Safe DAgger algorithm while using the same amount of samples from the expert.
摘要:本文的目的是开发一个示例高效终端到终端的深度学习的自动驾驶汽车,在这里我们试图增加从样本中提取信息的价值的方法,通过每一通电话专家获得仔细分析司机政策。最终到终端的模仿学习是计算自动驾驶汽车政策的常用方法。的标准方法依赖于收集对输入(摄像机图像),并输出从专家策略(转向角等)和嵌合深神经网络对这些数据来学习驾驶策略。虽然这种方法在过去一些成功的示范,学习的好政策,可能需要从专家的驱动程序,这可能是资源消耗了大量的样品。在这项工作中,我们开发了基于安全Dateset聚合(安全匕首)的方式,在当前知悉政策被自动分割成不同的轨迹类一个新的框架,算法识别轨迹段或教学班,在每一步表现疲弱。一旦与弱的表现轨迹段标识,采样算法的重点仅调用这些区段的专家政策,提高了收敛速度。所提出的仿真结果表明,同时使用来自专家样本相同数量的该方法能够比标准的安全算法匕首产生显著更好的性能。
Yunus Bicer, Ali Alizadeh, Nazim Kemal Ure, Ahmetcan Erdogan, Orkun Kizilirmak
Abstract: The objective of this paper is to develop a sample efficient end-to-end deep learning method for self-driving cars, where we attempt to increase the value of the information extracted from samples, through careful analysis obtained from each call to expert driverś policy. End-to-end imitation learning is a popular method for computing self-driving car policies. The standard approach relies on collecting pairs of inputs (camera images) and outputs (steering angle, etc.) from an expert policy and fitting a deep neural network to this data to learn the driving policy. Although this approach had some successful demonstrations in the past, learning a good policy might require a lot of samples from the expert driver, which might be resource-consuming. In this work, we develop a novel framework based on the Safe Dateset Aggregation (safe DAgger) approach, where the current learned policy is automatically segmented into different trajectory classes, and the algorithm identifies trajectory segments or classes with the weak performance at each step. Once the trajectory segments with weak performance identified, the sampling algorithm focuses on calling the expert policy only on these segments, which improves the convergence rate. The presented simulation results show that the proposed approach can yield significantly better performance compared to the standard Safe DAgger algorithm while using the same amount of samples from the expert.
摘要:本文的目的是开发一个示例高效终端到终端的深度学习的自动驾驶汽车,在这里我们试图增加从样本中提取信息的价值的方法,通过每一通电话专家获得仔细分析司机政策。最终到终端的模仿学习是计算自动驾驶汽车政策的常用方法。的标准方法依赖于收集对输入(摄像机图像),并输出从专家策略(转向角等)和嵌合深神经网络对这些数据来学习驾驶策略。虽然这种方法在过去一些成功的示范,学习的好政策,可能需要从专家的驱动程序,这可能是资源消耗了大量的样品。在这项工作中,我们开发了基于安全Dateset聚合(安全匕首)的方式,在当前知悉政策被自动分割成不同的轨迹类一个新的框架,算法识别轨迹段或教学班,在每一步表现疲弱。一旦与弱的表现轨迹段标识,采样算法的重点仅调用这些区段的专家政策,提高了收敛速度。所提出的仿真结果表明,同时使用来自专家样本相同数量的该方法能够比标准的安全算法匕首产生显著更好的性能。
52. COVID-19 CT Image Synthesis with a Conditional Generative Adversarial Network [PDF] 返回目录
Yifan Jiang, Han Chen, Murray Loew, Hanseok Ko
Abstract: Coronavirus disease 2019 (COVID-19) is an ongoing global pandemic that has spread rapidly since December 2019. Real-time reverse transcription polymerase chain reaction (rRT-PCR) and chest computed tomography (CT) imaging both play an important role in COVID-19 diagnosis. Chest CT imaging offers the benefits of quick reporting, a low cost, and high sensitivity for the detection of pulmonary infection. Recently, deep-learning-based computer vision methods have demonstrated great promise for use in medical imaging applications, including X-rays, magnetic resonance imaging, and CT imaging. However, training a deep-learning model requires large volumes of data, and medical staff faces a high risk when collecting COVID-19 CT data due to the high infectivity of the disease. Another issue is the lack of experts available for data labeling. In order to meet the data requirements for COVID-19 CT imaging, we propose a CT image synthesis approach based on a conditional generative adversarial network that can effectively generate high-quality and realistic COVID-19 CT images for use in deep-learning-based medical imaging tasks. Experimental results show that the proposed method outperforms other state-of-the-art image synthesis methods with the generated COVID-19 CT images and indicates promising for various machine learning applications including semantic segmentation and classification.
摘要:冠状病毒病2019(COVID-19)是一个持续的全球性流行病有迅速蔓延,因为2019年十二月实时逆转录聚合酶链反应(RRT-PCR)和胸部计算机断层扫描(CT)成像都发挥了重要的作用COVID-19的诊断。胸部CT成像提供了快速报告,成本低,以及用于检测肺部感染的高灵敏度的好处。近日,深学习的计算机视觉方法已经证明了在医疗成像应用,包括X射线,核磁共振成像和CT成像的巨大潜力。然而,培养了深厚学习模型需要大量的数据,以及医务人员面临着收集COVID-19 CT数据时的高风险,由于疾病的高传染性。另一个问题是缺乏可用的数据标注的专家。为了满足COVID-19 CT成像的数据要求,提出了一种基于条件生成对抗性的网络,可以有效地生成高品质的和现实的COVID-19的CT图像在深学习型上使用的CT图像合成方法医疗成像任务。实验结果表明,所提出的方法优于与生成的COVID-19 CT图像的其它国家的最先进的图像合成方法和表示希望用于各种机器学习应用,包括语义分割和分类。
Yifan Jiang, Han Chen, Murray Loew, Hanseok Ko
Abstract: Coronavirus disease 2019 (COVID-19) is an ongoing global pandemic that has spread rapidly since December 2019. Real-time reverse transcription polymerase chain reaction (rRT-PCR) and chest computed tomography (CT) imaging both play an important role in COVID-19 diagnosis. Chest CT imaging offers the benefits of quick reporting, a low cost, and high sensitivity for the detection of pulmonary infection. Recently, deep-learning-based computer vision methods have demonstrated great promise for use in medical imaging applications, including X-rays, magnetic resonance imaging, and CT imaging. However, training a deep-learning model requires large volumes of data, and medical staff faces a high risk when collecting COVID-19 CT data due to the high infectivity of the disease. Another issue is the lack of experts available for data labeling. In order to meet the data requirements for COVID-19 CT imaging, we propose a CT image synthesis approach based on a conditional generative adversarial network that can effectively generate high-quality and realistic COVID-19 CT images for use in deep-learning-based medical imaging tasks. Experimental results show that the proposed method outperforms other state-of-the-art image synthesis methods with the generated COVID-19 CT images and indicates promising for various machine learning applications including semantic segmentation and classification.
摘要:冠状病毒病2019(COVID-19)是一个持续的全球性流行病有迅速蔓延,因为2019年十二月实时逆转录聚合酶链反应(RRT-PCR)和胸部计算机断层扫描(CT)成像都发挥了重要的作用COVID-19的诊断。胸部CT成像提供了快速报告,成本低,以及用于检测肺部感染的高灵敏度的好处。近日,深学习的计算机视觉方法已经证明了在医疗成像应用,包括X射线,核磁共振成像和CT成像的巨大潜力。然而,培养了深厚学习模型需要大量的数据,以及医务人员面临着收集COVID-19 CT数据时的高风险,由于疾病的高传染性。另一个问题是缺乏可用的数据标注的专家。为了满足COVID-19 CT成像的数据要求,提出了一种基于条件生成对抗性的网络,可以有效地生成高品质的和现实的COVID-19的CT图像在深学习型上使用的CT图像合成方法医疗成像任务。实验结果表明,所提出的方法优于与生成的COVID-19 CT图像的其它国家的最先进的图像合成方法和表示希望用于各种机器学习应用,包括语义分割和分类。
53. Object-and-Action Aware Model for Visual Language Navigation [PDF] 返回目录
Yuankai Qi, Zizheng Pan, Shengping Zhang, Anton van den Hengel, Qi Wu
Abstract: Vision-and-Language Navigation (VLN) is unique in that it requires turning relatively general natural-language instructions into robot agent actions, on the basis of the visible environment. This requires to extract value from two very different types of natural-language information. The first is object description (e.g., 'table', 'door'), each presenting as a tip for the agent to determine the next action by finding the item visible in the environment, and the second is action specification (e.g., 'go straight', 'turn left') which allows the robot to directly predict the next movements without relying on visual perceptions. However, most existing methods pay few attention to distinguish these information from each other during instruction encoding and mix together the matching between textual object/action encoding and visual perception/orientation features of candidate viewpoints. In this paper, we propose an Object-and-Action Aware Model (OAAM) that processes these two different forms of natural language based instruction separately. This enables each process to match object-centered/action-centered instruction to their own counterpart visual perception/action orientation flexibly. However, one side-issue caused by above solution is that an object mentioned in instructions may be observed in the direction of two or more candidate viewpoints, thus the OAAM may not predict the viewpoint on the shortest path as the next action. To handle this problem, we design a simple but effective path loss to penalize trajectories deviating from the ground truth path. Experimental results demonstrate the effectiveness of the proposed model and path loss, and the superiority of their combination with a 50% SPL score on the R2R dataset and a 40% CLS score on the R4R dataset in unseen environments, outperforming the previous state-of-the-art.
摘要:视觉和语言导航(VLN)是独一无二的,它需要转向相对一般的自然语言指令转换为机器人代理操作,可见环境的基础上。这需要从两个完全不同类型的自然语言信息中提取价值。第一个是对象描述(例如,“表”,“门”),每个呈现一个尖端为代理通过找到在环境中可见的项目,以确定下一个动作,并且第二个是动作规范(例如,“去直”,‘向左转’),其允许机器人直接预测下一个运动,而不依赖于视觉感知。然而,大多数现有的方法很少注意指令在编码过程中相互区分这些信息混合在一起的文本对象/动作编码和视觉感知/候选人价值观取向特征之间的匹配。在本文中,我们提出了一个对象和动作感知模型(OAAM),其分别处理这两种不同形式的基于自然语言的教学。这使得每个工艺以对象为中心的/动作为中心的指令灵活匹配自身对应的视觉感知/动作方向。然而,由上述溶液一侧-问题是,在说明书中提到的对象可以在两个或更多个候选视点的方向观察到的,因此可以OAAM不能预测为下一个动作的最短路径上的观点来看。为了解决这个问题,我们设计了一个简单而有效的路径损耗惩罚轨迹从地面实况路径偏离。实验结果表明,所提出的模型和路径损耗的效果,以及它们的组合的具有50%的SPL得分上R2R数据集和40%CLS得分中看不见环境R4R数据集中的优势,表现优于以前了最先进艺术。
Yuankai Qi, Zizheng Pan, Shengping Zhang, Anton van den Hengel, Qi Wu
Abstract: Vision-and-Language Navigation (VLN) is unique in that it requires turning relatively general natural-language instructions into robot agent actions, on the basis of the visible environment. This requires to extract value from two very different types of natural-language information. The first is object description (e.g., 'table', 'door'), each presenting as a tip for the agent to determine the next action by finding the item visible in the environment, and the second is action specification (e.g., 'go straight', 'turn left') which allows the robot to directly predict the next movements without relying on visual perceptions. However, most existing methods pay few attention to distinguish these information from each other during instruction encoding and mix together the matching between textual object/action encoding and visual perception/orientation features of candidate viewpoints. In this paper, we propose an Object-and-Action Aware Model (OAAM) that processes these two different forms of natural language based instruction separately. This enables each process to match object-centered/action-centered instruction to their own counterpart visual perception/action orientation flexibly. However, one side-issue caused by above solution is that an object mentioned in instructions may be observed in the direction of two or more candidate viewpoints, thus the OAAM may not predict the viewpoint on the shortest path as the next action. To handle this problem, we design a simple but effective path loss to penalize trajectories deviating from the ground truth path. Experimental results demonstrate the effectiveness of the proposed model and path loss, and the superiority of their combination with a 50% SPL score on the R2R dataset and a 40% CLS score on the R4R dataset in unseen environments, outperforming the previous state-of-the-art.
摘要:视觉和语言导航(VLN)是独一无二的,它需要转向相对一般的自然语言指令转换为机器人代理操作,可见环境的基础上。这需要从两个完全不同类型的自然语言信息中提取价值。第一个是对象描述(例如,“表”,“门”),每个呈现一个尖端为代理通过找到在环境中可见的项目,以确定下一个动作,并且第二个是动作规范(例如,“去直”,‘向左转’),其允许机器人直接预测下一个运动,而不依赖于视觉感知。然而,大多数现有的方法很少注意指令在编码过程中相互区分这些信息混合在一起的文本对象/动作编码和视觉感知/候选人价值观取向特征之间的匹配。在本文中,我们提出了一个对象和动作感知模型(OAAM),其分别处理这两种不同形式的基于自然语言的教学。这使得每个工艺以对象为中心的/动作为中心的指令灵活匹配自身对应的视觉感知/动作方向。然而,由上述溶液一侧-问题是,在说明书中提到的对象可以在两个或更多个候选视点的方向观察到的,因此可以OAAM不能预测为下一个动作的最短路径上的观点来看。为了解决这个问题,我们设计了一个简单而有效的路径损耗惩罚轨迹从地面实况路径偏离。实验结果表明,所提出的模型和路径损耗的效果,以及它们的组合的具有50%的SPL得分上R2R数据集和40%CLS得分中看不见环境R4R数据集中的优势,表现优于以前了最先进艺术。
54. Solving Phase Retrieval with a Learned Reference [PDF] 返回目录
Rakib Hyder, Zikui Cai, M. Salman Asif
Abstract: Fourier phase retrieval is a classical problem that deals with the recovery of an image from the amplitude measurements of its Fourier coefficients. Conventional methods solve this problem via iterative (alternating) minimization by leveraging some prior knowledge about the structure of the unknown image. The inherent ambiguities about shift and flip in the Fourier measurements make this problem especially difficult; and most of the existing methods use several random restarts with different permutations. In this paper, we assume that a known (learned) reference is added to the signal before capturing the Fourier amplitude measurements. Our method is inspired by the principle of adding a reference signal in holography. To recover the signal, we implement an iterative phase retrieval method as an unrolled network. Then we use back propagation to learn the reference that provides us the best reconstruction for a fixed number of phase retrieval iterations. We performed a number of simulations on a variety of datasets under different conditions and found that our proposed method for phase retrieval via unrolled network and learned reference provides near-perfect recovery at fixed (small) computational cost. We compared our method with standard Fourier phase retrieval methods and observed significant performance enhancement using the learned reference.
摘要:傅立叶相位恢复是一个经典的问题,即与图像的从它的傅立叶系数的幅度测量的恢复的交易。常规的方法通过利用关于未知图像的结构中的一些先验知识解决通过迭代(交替)最小化该问题。有关傅立叶测量移动和翻转固有的模糊性使这个问题尤其困难;大多数现有的方法使用几个随机重启与不同的排列。在本文中,我们假设一个已知的(学习)参考捕获傅立叶振幅测量之前加入到该信号。我们的方法是通过在全息加入参考信号的原理的启发。为了恢复信号,我们实现的迭代阶段的检索方法作为展开网络。然后我们使用反向传播学会为我们提供固定数量的相位恢复迭代重建最好的参考。我们在各种不同条件下的数据集进行一些模拟,发现我们提出了通过网络展开相位恢复,得知参考方法提供了在固定的(小)计算成本近乎完美的恢复。我们比较我们与标准傅立叶阶段的检索方式方法和使用学习参考观察显著的性能增强。
Rakib Hyder, Zikui Cai, M. Salman Asif
Abstract: Fourier phase retrieval is a classical problem that deals with the recovery of an image from the amplitude measurements of its Fourier coefficients. Conventional methods solve this problem via iterative (alternating) minimization by leveraging some prior knowledge about the structure of the unknown image. The inherent ambiguities about shift and flip in the Fourier measurements make this problem especially difficult; and most of the existing methods use several random restarts with different permutations. In this paper, we assume that a known (learned) reference is added to the signal before capturing the Fourier amplitude measurements. Our method is inspired by the principle of adding a reference signal in holography. To recover the signal, we implement an iterative phase retrieval method as an unrolled network. Then we use back propagation to learn the reference that provides us the best reconstruction for a fixed number of phase retrieval iterations. We performed a number of simulations on a variety of datasets under different conditions and found that our proposed method for phase retrieval via unrolled network and learned reference provides near-perfect recovery at fixed (small) computational cost. We compared our method with standard Fourier phase retrieval methods and observed significant performance enhancement using the learned reference.
摘要:傅立叶相位恢复是一个经典的问题,即与图像的从它的傅立叶系数的幅度测量的恢复的交易。常规的方法通过利用关于未知图像的结构中的一些先验知识解决通过迭代(交替)最小化该问题。有关傅立叶测量移动和翻转固有的模糊性使这个问题尤其困难;大多数现有的方法使用几个随机重启与不同的排列。在本文中,我们假设一个已知的(学习)参考捕获傅立叶振幅测量之前加入到该信号。我们的方法是通过在全息加入参考信号的原理的启发。为了恢复信号,我们实现的迭代阶段的检索方法作为展开网络。然后我们使用反向传播学会为我们提供固定数量的相位恢复迭代重建最好的参考。我们在各种不同条件下的数据集进行一些模拟,发现我们提出了通过网络展开相位恢复,得知参考方法提供了在固定的(小)计算成本近乎完美的恢复。我们比较我们与标准傅立叶阶段的检索方式方法和使用学习参考观察显著的性能增强。
55. Clarinet: A One-step Approach Towards Budget-friendly Unsupervised Domain Adaptation [PDF] 返回目录
Yiyang Zhang, Feng Liu, Zhen Fang, Bo Yuan, Guangquan Zhang, Jie Lu
Abstract: In unsupervised domain adaptation (UDA), classifiers for the target domain are trained with massive true-label data from the source domain and unlabeled data from the target domain. However, it may be difficult to collect fully-true-label data in a source domain given a limited budget. To mitigate this problem, we consider a novel problem setting where the classifier for the target domain has to be trained with complementary-label data from the source domain and unlabeled data from the target domain named budget-friendly UDA (BFUDA). The key benefit is that it is much less costly to collect complementary-label source data (required by BFUDA) than collecting the true-label source data (required by ordinary UDA). To this end, the complementary label adversarial network (CLARINET) is proposed to solve the BFUDA problem. CLARINET maintains two deep networks simultaneously, where one focuses on classifying complementary-label source data and the other takes care of the source-to-target distributional adaptation. Experiments show that CLARINET significantly outperforms a series of competent baselines.
摘要:在无人监管的领域适应性(UDA),目标域分类进行培训,与来自目标域源域和未标记数据海量真标签数据。但是,它可能很难给出一个有限的预算来源域,收集全真标签数据。为了缓解这个问题,我们考虑一个新问题环境,让目标域分类必须与来自名为budget友好UDA(BFUDA)目标域源域和未标记数据的补充标签数据进行训练。的主要优点是,它是昂贵得多小于收集真标签源数据(由普通UDA必需的),以收集互补标签源数据(由BFUDA需要)。为此,互补的标签对抗网络(黑管),提出了解决该问题BFUDA。 CLARINET同时维持两个深的网络,其中,一个专注于互补标签源的数据进行分类,另取源到目标分配适应的护理。实验表明,单簧管显著优于一系列主管基线。
Yiyang Zhang, Feng Liu, Zhen Fang, Bo Yuan, Guangquan Zhang, Jie Lu
Abstract: In unsupervised domain adaptation (UDA), classifiers for the target domain are trained with massive true-label data from the source domain and unlabeled data from the target domain. However, it may be difficult to collect fully-true-label data in a source domain given a limited budget. To mitigate this problem, we consider a novel problem setting where the classifier for the target domain has to be trained with complementary-label data from the source domain and unlabeled data from the target domain named budget-friendly UDA (BFUDA). The key benefit is that it is much less costly to collect complementary-label source data (required by BFUDA) than collecting the true-label source data (required by ordinary UDA). To this end, the complementary label adversarial network (CLARINET) is proposed to solve the BFUDA problem. CLARINET maintains two deep networks simultaneously, where one focuses on classifying complementary-label source data and the other takes care of the source-to-target distributional adaptation. Experiments show that CLARINET significantly outperforms a series of competent baselines.
摘要:在无人监管的领域适应性(UDA),目标域分类进行培训,与来自目标域源域和未标记数据海量真标签数据。但是,它可能很难给出一个有限的预算来源域,收集全真标签数据。为了缓解这个问题,我们考虑一个新问题环境,让目标域分类必须与来自名为budget友好UDA(BFUDA)目标域源域和未标记数据的补充标签数据进行训练。的主要优点是,它是昂贵得多小于收集真标签源数据(由普通UDA必需的),以收集互补标签源数据(由BFUDA需要)。为此,互补的标签对抗网络(黑管),提出了解决该问题BFUDA。 CLARINET同时维持两个深的网络,其中,一个专注于互补标签源的数据进行分类,另取源到目标分配适应的护理。实验表明,单簧管显著优于一系列主管基线。
56. 3D Fusion of Infrared Images with Dense RGB Reconstruction from Multiple Views -- with Application to Fire-fighting Robots [PDF] 返回目录
Yuncong Chen, Will Warren
Abstract: This project integrates infrared and RGB imagery to produce dense 3D environment models reconstructed from multiple views. The resulting 3D map contains both thermal and RGB information which can be used in robotic fire-fighting applications to identify victims and active fire areas.
摘要:该项目集成红外和RGB图像来产生浓厚的3D环境模型从多个视图重建。由此产生的三维地图包含可在机器人灭火应用中被用来识别受害者和积极的火区热和RGB信息。
Yuncong Chen, Will Warren
Abstract: This project integrates infrared and RGB imagery to produce dense 3D environment models reconstructed from multiple views. The resulting 3D map contains both thermal and RGB information which can be used in robotic fire-fighting applications to identify victims and active fire areas.
摘要:该项目集成红外和RGB图像来产生浓厚的3D环境模型从多个视图重建。由此产生的三维地图包含可在机器人灭火应用中被用来识别受害者和积极的火区热和RGB信息。
57. A regularized deep matrix factorized model of matrix completion for image restoration [PDF] 返回目录
Zhemin Li, Zhi-Qin John Xu, Tao Luo, Hongxia Wang
Abstract: It has been an important approach of using matrix completion to perform image restoration. Most previous works on matrix completion focus on the low-rank property by imposing explicit constraints on the recovered matrix, such as the constraint of the nuclear norm or limiting the dimension of the matrix factorization component. Recently, theoretical works suggest that deep linear neural network has an implicit bias towards low rank on matrix completion. However, low rank is not adequate to reflect the intrinsic characteristics of a natural image. Thus, algorithms with only the constraint of low rank are insufficient to perform image restoration well. In this work, we propose a Regularized Deep Matrix Factorized (RDMF) model for image restoration, which utilizes the implicit bias of the low rank of deep neural networks and the explicit bias of total variation. We demonstrate the effectiveness of our RDMF model with extensive experiments, in which our method surpasses the state of art models in common examples, especially for the restoration from very few observations. Our work sheds light on a more general framework for solving other inverse problems by combining the implicit bias of deep learning with explicit regularization.
摘要:一直使用矩阵完成执行图像恢复的一个重要途径。矩阵完成对焦大多数以前的作品在低级别财产上的恢复矩阵实行明确的限制,比如核规范的约束或限制矩阵分解组件的尺寸。近日,理论著作表明,深线性神经网络具有对矩阵完成低级别的隐性偏见。然而,低等级不足以反映自然图像的固有特性。因此,具有低等级的只约束算法不足以执行图像恢复良好。在这项工作中,我们提出了图像复原,它利用深层神经网络的低等级和总的变化的明确偏见的隐性偏见正则深矩阵分比(RDMF)模型。我们证明我们的RDMF模型广泛的实验,其中我们的方法超越了常见的例子艺术模型的状态,特别是从极少数的观察恢复的有效性。我们的工作对深学习的隐性偏见有明确正规化结合解决其他反问题更广泛的框架,揭示光。
Zhemin Li, Zhi-Qin John Xu, Tao Luo, Hongxia Wang
Abstract: It has been an important approach of using matrix completion to perform image restoration. Most previous works on matrix completion focus on the low-rank property by imposing explicit constraints on the recovered matrix, such as the constraint of the nuclear norm or limiting the dimension of the matrix factorization component. Recently, theoretical works suggest that deep linear neural network has an implicit bias towards low rank on matrix completion. However, low rank is not adequate to reflect the intrinsic characteristics of a natural image. Thus, algorithms with only the constraint of low rank are insufficient to perform image restoration well. In this work, we propose a Regularized Deep Matrix Factorized (RDMF) model for image restoration, which utilizes the implicit bias of the low rank of deep neural networks and the explicit bias of total variation. We demonstrate the effectiveness of our RDMF model with extensive experiments, in which our method surpasses the state of art models in common examples, especially for the restoration from very few observations. Our work sheds light on a more general framework for solving other inverse problems by combining the implicit bias of deep learning with explicit regularization.
摘要:一直使用矩阵完成执行图像恢复的一个重要途径。矩阵完成对焦大多数以前的作品在低级别财产上的恢复矩阵实行明确的限制,比如核规范的约束或限制矩阵分解组件的尺寸。近日,理论著作表明,深线性神经网络具有对矩阵完成低级别的隐性偏见。然而,低等级不足以反映自然图像的固有特性。因此,具有低等级的只约束算法不足以执行图像恢复良好。在这项工作中,我们提出了图像复原,它利用深层神经网络的低等级和总的变化的明确偏见的隐性偏见正则深矩阵分比(RDMF)模型。我们证明我们的RDMF模型广泛的实验,其中我们的方法超越了常见的例子艺术模型的状态,特别是从极少数的观察恢复的有效性。我们的工作对深学习的隐性偏见有明确正规化结合解决其他反问题更广泛的框架,揭示光。
58. Accurate 2D soft segmentation of medical image via SoftGAN network [PDF] 返回目录
Changwei Wang, Rongtao Xu, Shibiao Xu, Weiliang Meng, Jun Xiao, Qimin Peng, Xiaopeng Zhang
Abstract: Accurate 2D lung nodules segmentation from medical Computed Tomography (CT) images is crucial in medical applications. Most current approaches cannot achieve precise segmentation results that preserving both rich edge details description and smooth transition representations between image regions due to the tininess, complexities, and irregularities of lung nodule shapes. To address this issue, we propose a novel Cascaded Generative Adversarial Network (CasGAN) to cope with CT images super-resolution and segmentation tasks, in which the semantic soft segmentation form on precise lesion representation is introduced for the first time according to our knowledge, and lesion edges can be retained accurately after our segmentation that can promote rapid acquisition of high-quality large-scale annotation data based on RECIST weak supervision information. Extensive experiments validate that our CasGAN outperforms the state-of-the-art methods greatly in segmentation quality, which is also robust on the application of medical images beyond lung nodules. Besides, we provide a challenging lung nodules soft segmentation dataset of medical CT images for further studies.
摘要:从医疗计算机断层扫描(CT)图像精确的二维肺结节分割是在医学应用中是至关重要的。大多数当前的方法不能达到精确的分割结果保存两个富边缘细节描述和平滑由于极小值,复杂性肺结节的形状的凹凸的图像区域之间的过渡的表示,和。为了解决这个问题,我们提出了一个新颖的级联创成对抗性网络(CasGAN),以应付CT影像超分辨率和分割任务,其中在精确的病变表示语义软分割的形式是根据我们的知识介绍,第一次,和病灶边缘可我们的分割,可以促进快速获取基于RECIST监管不力信息高品质的大型注释数据的准确后保留。大量的实验验证我们的CasGAN大大优于在分割质量,这也是对医用图像的超出肺结节应用健壮状态的最先进的方法。此外,我们提供一个具有挑战性的肺内结节的医学CT图像进行进一步研究的软分割数据集。
Changwei Wang, Rongtao Xu, Shibiao Xu, Weiliang Meng, Jun Xiao, Qimin Peng, Xiaopeng Zhang
Abstract: Accurate 2D lung nodules segmentation from medical Computed Tomography (CT) images is crucial in medical applications. Most current approaches cannot achieve precise segmentation results that preserving both rich edge details description and smooth transition representations between image regions due to the tininess, complexities, and irregularities of lung nodule shapes. To address this issue, we propose a novel Cascaded Generative Adversarial Network (CasGAN) to cope with CT images super-resolution and segmentation tasks, in which the semantic soft segmentation form on precise lesion representation is introduced for the first time according to our knowledge, and lesion edges can be retained accurately after our segmentation that can promote rapid acquisition of high-quality large-scale annotation data based on RECIST weak supervision information. Extensive experiments validate that our CasGAN outperforms the state-of-the-art methods greatly in segmentation quality, which is also robust on the application of medical images beyond lung nodules. Besides, we provide a challenging lung nodules soft segmentation dataset of medical CT images for further studies.
摘要:从医疗计算机断层扫描(CT)图像精确的二维肺结节分割是在医学应用中是至关重要的。大多数当前的方法不能达到精确的分割结果保存两个富边缘细节描述和平滑由于极小值,复杂性肺结节的形状的凹凸的图像区域之间的过渡的表示,和。为了解决这个问题,我们提出了一个新颖的级联创成对抗性网络(CasGAN),以应付CT影像超分辨率和分割任务,其中在精确的病变表示语义软分割的形式是根据我们的知识介绍,第一次,和病灶边缘可我们的分割,可以促进快速获取基于RECIST监管不力信息高品质的大型注释数据的准确后保留。大量的实验验证我们的CasGAN大大优于在分割质量,这也是对医用图像的超出肺结节应用健壮状态的最先进的方法。此外,我们提供一个具有挑战性的肺内结节的医学CT图像进行进一步研究的软分割数据集。
59. Group Knowledge Transfer: Collaborative Training of Large CNNs on the Edge [PDF] 返回目录
Chaoyang He, Salman Avestimehr, Murali Annavaram
Abstract: Scaling up the convolutional neural network (CNN) size (e.g., width, depth, etc.) is known to effectively improve model accuracy. However, the large model size impedes training on resource-constrained edge devices. For instance, federated learning (FL) on edge devices cannot tackle large CNN training demands, even though there is a strong practical need for FL due to its privacy and confidentiality properties. To address the resource-constrained reality, we reformulate FL as a group knowledge transfer (GKT) training algorithm. GKT designs a variant of the alternating minimization approach to train small CNNs on edge nodes and periodically transfer their knowledge by knowledge distillation to a large server-side CNN. GKT consolidates several advantages in a single framework: reduced demand for edge computation, lower communication cost for large CNNs, and asynchronous training, all while maintaining model accuracy comparable to FL. To simplify the edge training, we also develop a distributed training system based on our GKT. We train CNNs designed based on ResNet-56 and ResNet-110 using three distinct datasets (CIFAR-10, CIFAR-100, and CINIC-10) and their non-IID variants. Our results show that GKT can obtain comparable or even slightly higher accuracy. More importantly, GKT makes edge training affordable. Compared to the edge training using FedAvg, GKT demands 9 to 17 times less computational power (FLOPs) on edge devices and requires 54 to 105 times fewer parameters in the edge CNN.
摘要:按比例放大卷积神经网络(CNN)尺寸(例如,宽度,深度等)是已知的,有效地提高模型的准确性。然而,大尺寸模型训练妨碍在资源受限的边缘设备。例如,在边缘设备联合学习(FL)不能应对大CNN培训需求,即使是FL很强的现实需要,因为它的隐私和机密性。为了解决资源约束的现实,我们重新制定FL作为一个群体的知识转移(GKT)训练算法。 GKT设计交替最小化的方法的一种变体,培养上边缘节点小细胞神经网络,并定期通过知识转移蒸馏他们的知识大型服务器侧CNN。 GKT整合在一个单一的框架几个优点:对边缘计算需求减少,对大细胞神经网络降低通信成本和异步培训,同时保持模型的精度媲美FL。为了简化边训练,我们还开发了基于我们GKT分布式训练系统。我们培养细胞神经网络设计了基于RESNET-56和RESNET-110使用三个不同的数据集(CIFAR-10,CIFAR-100,和CINIC-10)和它们的非IID变体。我们的研究结果表明,GKT能获得媲美甚至略有更高的精度。更重要的是,GKT使得边缘训练实惠。相比于使用FedAvg边缘训练,GKT的边缘设备要求9至17倍以下的计算能力(FLOPS),并要求在边缘CNN 54至105倍更少的参数。
Chaoyang He, Salman Avestimehr, Murali Annavaram
Abstract: Scaling up the convolutional neural network (CNN) size (e.g., width, depth, etc.) is known to effectively improve model accuracy. However, the large model size impedes training on resource-constrained edge devices. For instance, federated learning (FL) on edge devices cannot tackle large CNN training demands, even though there is a strong practical need for FL due to its privacy and confidentiality properties. To address the resource-constrained reality, we reformulate FL as a group knowledge transfer (GKT) training algorithm. GKT designs a variant of the alternating minimization approach to train small CNNs on edge nodes and periodically transfer their knowledge by knowledge distillation to a large server-side CNN. GKT consolidates several advantages in a single framework: reduced demand for edge computation, lower communication cost for large CNNs, and asynchronous training, all while maintaining model accuracy comparable to FL. To simplify the edge training, we also develop a distributed training system based on our GKT. We train CNNs designed based on ResNet-56 and ResNet-110 using three distinct datasets (CIFAR-10, CIFAR-100, and CINIC-10) and their non-IID variants. Our results show that GKT can obtain comparable or even slightly higher accuracy. More importantly, GKT makes edge training affordable. Compared to the edge training using FedAvg, GKT demands 9 to 17 times less computational power (FLOPs) on edge devices and requires 54 to 105 times fewer parameters in the edge CNN.
摘要:按比例放大卷积神经网络(CNN)尺寸(例如,宽度,深度等)是已知的,有效地提高模型的准确性。然而,大尺寸模型训练妨碍在资源受限的边缘设备。例如,在边缘设备联合学习(FL)不能应对大CNN培训需求,即使是FL很强的现实需要,因为它的隐私和机密性。为了解决资源约束的现实,我们重新制定FL作为一个群体的知识转移(GKT)训练算法。 GKT设计交替最小化的方法的一种变体,培养上边缘节点小细胞神经网络,并定期通过知识转移蒸馏他们的知识大型服务器侧CNN。 GKT整合在一个单一的框架几个优点:对边缘计算需求减少,对大细胞神经网络降低通信成本和异步培训,同时保持模型的精度媲美FL。为了简化边训练,我们还开发了基于我们GKT分布式训练系统。我们培养细胞神经网络设计了基于RESNET-56和RESNET-110使用三个不同的数据集(CIFAR-10,CIFAR-100,和CINIC-10)和它们的非IID变体。我们的研究结果表明,GKT能获得媲美甚至略有更高的精度。更重要的是,GKT使得边缘训练实惠。相比于使用FedAvg边缘训练,GKT的边缘设备要求9至17倍以下的计算能力(FLOPS),并要求在边缘CNN 54至105倍更少的参数。
60. Decompose X-ray Images for Bone and Soft Tissue [PDF] 返回目录
Yuanhao Gong
Abstract: Bones are always wrapped by soft tissues. As a result, bones in their X-ray images are obscured and become unclear. In this paper, we tackle this problem and propose a novel task to virtually decompose the soft tissue and bone by image processing algorithms. This task is fundamentally different from segmentation because the decomposed images share the same imaging domain. Our decomposition task is also fundamentally different from the conventional image enhancement. We propose a new mathematical model for such decomposition. Our model is ill-posed and thus it requires some priors. With proper assumptions, our model can be solved by solving a standard Laplace equation. The resulting bone image is theoretically guaranteed to have better contrast than the original input image. Therefore, the details of bones get enhanced and become clearer. Several numerical experiments confirm the effective and efficiency of our method. Our approach is important for clinical diagnosis, surgery planning, recognition, deep learning, etc.
摘要:骨骼总是被软组织包裹着。其结果是,在他们的X射线图像的骨头被遮掩而变得不清晰。在本文中,我们解决这个问题,提出了一种新的任务几乎分解通过图像处理算法的软组织和骨骼。这个任务是从分割根本不同的,因为分解图象共享相同的成像域。我们分解的任务也由传统的图像增强根本的不同。我们提出了这样的分解一个新的数学模型。我们的模型是病态,因此它需要一些先验。通过适当的假设,我们的模型可以通过求解一个标准的拉普拉斯方程来解决。将得到的骨图像理论上保证具有比原始输入图像更好的对比度。因此,骨骼的细节得到增强,变得更加清晰。几个数值实验证实了该方法的有效和效率。我们的做法是,为临床诊断,手术规划,识别,深度学习等重要
Yuanhao Gong
Abstract: Bones are always wrapped by soft tissues. As a result, bones in their X-ray images are obscured and become unclear. In this paper, we tackle this problem and propose a novel task to virtually decompose the soft tissue and bone by image processing algorithms. This task is fundamentally different from segmentation because the decomposed images share the same imaging domain. Our decomposition task is also fundamentally different from the conventional image enhancement. We propose a new mathematical model for such decomposition. Our model is ill-posed and thus it requires some priors. With proper assumptions, our model can be solved by solving a standard Laplace equation. The resulting bone image is theoretically guaranteed to have better contrast than the original input image. Therefore, the details of bones get enhanced and become clearer. Several numerical experiments confirm the effective and efficiency of our method. Our approach is important for clinical diagnosis, surgery planning, recognition, deep learning, etc.
摘要:骨骼总是被软组织包裹着。其结果是,在他们的X射线图像的骨头被遮掩而变得不清晰。在本文中,我们解决这个问题,提出了一种新的任务几乎分解通过图像处理算法的软组织和骨骼。这个任务是从分割根本不同的,因为分解图象共享相同的成像域。我们分解的任务也由传统的图像增强根本的不同。我们提出了这样的分解一个新的数学模型。我们的模型是病态,因此它需要一些先验。通过适当的假设,我们的模型可以通过求解一个标准的拉普拉斯方程来解决。将得到的骨图像理论上保证具有比原始输入图像更好的对比度。因此,骨骼的细节得到增强,变得更加清晰。几个数值实验证实了该方法的有效和效率。我们的做法是,为临床诊断,手术规划,识别,深度学习等重要
61. Learning to predict metal deformations in hot-rolling processes [PDF] 返回目录
R. Omar Chavez-Garcia, Emian Furger, Samuele Kronauer, Christian Brianza, Marco Scarfò, Luca Diviani, Alessandro Giusti
Abstract: Hot-rolling is a metal forming process that produces a workpiece with a desired target cross-section from an input workpiece through a sequence of plastic deformations; each deformation is generated by a stand composed of opposing rolls with a specific geometry. In current practice, the rolling sequence (i.e., the sequence of stands and the geometry of their rolls) needed to achieve a given final cross-section is designed by experts based on previous experience, and iteratively refined in a costly trial-and-error process. Finite Element Method simulations are increasingly adopted to make this process more efficient and to test potential rolling sequences, achieving good accuracy at the cost of long simulation times, limiting the practical use of the approach. We propose a supervised learning approach to predict the deformation of a given workpiece by a set of rolls with a given geometry; the model is trained on a large dataset of procedurally-generated FEM simulations, which we publish as supplementary material. The resulting predictor is four orders of magnitude faster than simulations, and yields an average Jaccard Similarity Index of 0.972 (against ground truth from simulations) and 0.925 (against real-world measured deformations); we additionally report preliminary results on using the predictor for automatic planning of rolling sequences.
摘要:热轧是金属成形过程,其产生与从通过塑性变形的序列的输入工件所需的目标横截面的工件;每个变形可以通过与特定的几何形状相对辊组成的支架产生。在目前的实践中,滚动序列(即,看台上的序列及其辊的几何形状),以获得给定的最终横截面需要由专家根据以往的经验而设计的,反复细化在昂贵的试错处理。有限元法仿真越来越多地采用,使这个过程更加高效和测试潜在的滚动序列,在很长的仿真时间成本达到良好的精度,这限制了实际使用的办法。我们提出了一种监督学习方法来预测由一组与给定的几何形状的辊的给定工件的变形;该模型是大型数据集在程序上产生的FEM仿真的,这是我们发布为补充材料的培训。将所得的预测器是四个数量级比模拟速度更快,并且产生的0.972的平均Jaccard相似指数(对来自模拟地面实况)和0.925(针对真实世界的测量的变形);我们还利用预测滚动序列自动规划报告的初步结果。
R. Omar Chavez-Garcia, Emian Furger, Samuele Kronauer, Christian Brianza, Marco Scarfò, Luca Diviani, Alessandro Giusti
Abstract: Hot-rolling is a metal forming process that produces a workpiece with a desired target cross-section from an input workpiece through a sequence of plastic deformations; each deformation is generated by a stand composed of opposing rolls with a specific geometry. In current practice, the rolling sequence (i.e., the sequence of stands and the geometry of their rolls) needed to achieve a given final cross-section is designed by experts based on previous experience, and iteratively refined in a costly trial-and-error process. Finite Element Method simulations are increasingly adopted to make this process more efficient and to test potential rolling sequences, achieving good accuracy at the cost of long simulation times, limiting the practical use of the approach. We propose a supervised learning approach to predict the deformation of a given workpiece by a set of rolls with a given geometry; the model is trained on a large dataset of procedurally-generated FEM simulations, which we publish as supplementary material. The resulting predictor is four orders of magnitude faster than simulations, and yields an average Jaccard Similarity Index of 0.972 (against ground truth from simulations) and 0.925 (against real-world measured deformations); we additionally report preliminary results on using the predictor for automatic planning of rolling sequences.
摘要:热轧是金属成形过程,其产生与从通过塑性变形的序列的输入工件所需的目标横截面的工件;每个变形可以通过与特定的几何形状相对辊组成的支架产生。在目前的实践中,滚动序列(即,看台上的序列及其辊的几何形状),以获得给定的最终横截面需要由专家根据以往的经验而设计的,反复细化在昂贵的试错处理。有限元法仿真越来越多地采用,使这个过程更加高效和测试潜在的滚动序列,在很长的仿真时间成本达到良好的精度,这限制了实际使用的办法。我们提出了一种监督学习方法来预测由一组与给定的几何形状的辊的给定工件的变形;该模型是大型数据集在程序上产生的FEM仿真的,这是我们发布为补充材料的培训。将所得的预测器是四个数量级比模拟速度更快,并且产生的0.972的平均Jaccard相似指数(对来自模拟地面实况)和0.925(针对真实世界的测量的变形);我们还利用预测滚动序列自动规划报告的初步结果。
62. Enhancement of Retinal Fundus Images via Pixel Color Amplification [PDF] 返回目录
Alex Gaudio, Asim Smailagic, Aurélio Campilho
Abstract: We propose a pixel color amplification theory and family of enhancement methods to facilitate segmentation tasks on retinal images. Our novel re-interpretation of the image distortion model underlying dehazing theory shows how three existing priors commonly used by the dehazing community and a novel fourth prior are related. We utilize the theory to develop a family of enhancement methods for retinal images, including novel methods for whole image brightening and darkening. We show a novel derivation of the Unsharp Masking algorithm. We evaluate the enhancement methods as a pre-processing step to a challenging multi-task segmentation problem and show large increases in performance on all tasks, with Dice score increases over a no-enhancement baseline by as much as 0.491. We provide evidence that our enhancement preprocessing is useful for unbalanced and difficult data. We show that the enhancements can perform class balancing by composing them together.
摘要:我们提出了一个像素的颜色放大理论和家人的增强方法,以促进视网膜图像分割任务。我们的新型图像失真模型基本理论除雾显示了如何三个现有的先验通常使用的除雾社区和新型第四现有相关的重新演绎。我们利用理论开发一系列的增强方法对视网膜图像,包括对整个图像变亮变暗的新方法。我们展示模糊掩蔽算法的一种新的推导。我们评估的增强方法作为预处理步骤,以一个具有挑战性的多任务分割问题,并显示在性能上的所有任务在没有增强基线多达0.491大幅增加,随着骰子的分数增加。我们提供的证据表明,我们的强化预处理是不平衡的,困难的数据非常有用。我们表明,增强可通过组合在一起执行类平衡。
Alex Gaudio, Asim Smailagic, Aurélio Campilho
Abstract: We propose a pixel color amplification theory and family of enhancement methods to facilitate segmentation tasks on retinal images. Our novel re-interpretation of the image distortion model underlying dehazing theory shows how three existing priors commonly used by the dehazing community and a novel fourth prior are related. We utilize the theory to develop a family of enhancement methods for retinal images, including novel methods for whole image brightening and darkening. We show a novel derivation of the Unsharp Masking algorithm. We evaluate the enhancement methods as a pre-processing step to a challenging multi-task segmentation problem and show large increases in performance on all tasks, with Dice score increases over a no-enhancement baseline by as much as 0.491. We provide evidence that our enhancement preprocessing is useful for unbalanced and difficult data. We show that the enhancements can perform class balancing by composing them together.
摘要:我们提出了一个像素的颜色放大理论和家人的增强方法,以促进视网膜图像分割任务。我们的新型图像失真模型基本理论除雾显示了如何三个现有的先验通常使用的除雾社区和新型第四现有相关的重新演绎。我们利用理论开发一系列的增强方法对视网膜图像,包括对整个图像变亮变暗的新方法。我们展示模糊掩蔽算法的一种新的推导。我们评估的增强方法作为预处理步骤,以一个具有挑战性的多任务分割问题,并显示在性能上的所有任务在没有增强基线多达0.491大幅增加,随着骰子的分数增加。我们提供的证据表明,我们的强化预处理是不平衡的,困难的数据非常有用。我们表明,增强可通过组合在一起执行类平衡。
63. Extending LOUPE for K-space Under-sampling Pattern Optimization in Multi-coil MRI [PDF] 返回目录
Jinwei Zhang, Hang Zhang, Alan Wang, Qihao Zhang, Mert Sabuncu, Pascal Spincemaille, Thanh D. Nguyen, Yi Wang
Abstract: The previously established LOUPE (Learning-based Optimization of the Under-sampling Pattern) framework for optimizing the k-space sampling pattern in MRI was extended in three folds: firstly, fully sampled multi-coil k-space data from the scanner, rather than simulated k-space data from magnitude MR images in LOUPE, was retrospectively under-sampled to optimize the under-sampling pattern of in-vivo k-space data; secondly, binary stochastic k-space sampling, rather than approximate stochastic k-space sampling of LOUPE during training, was applied together with a straight-through (ST) estimator to estimate the gradient of the threshold operation in a neural network; thirdly, modified unrolled optimization network, rather than modified U-Net in LOUPE, was used as the reconstruction network in order to reconstruct multi-coil data properly and reduce the dependency on training data. Experimental results show that when dealing with the in-vivo k-space data, unrolled optimization network with binary under-sampling block and ST estimator had better reconstruction performance compared to the ones with either U-Net reconstruction network or approximate sampling pattern optimization network, and once trained, the learned optimal sampling pattern worked better than the hand-crafted variable density sampling pattern when deployed with other conventional reconstruction methods.
摘要:先前建立LOUPE用于优化在MRI的k-空间采样图案框架(副采样模式的基于学习的优化)在三个折叠延长:首先,完全采样多线圈的k-空间数据从扫描仪,而不是从LOUPE大小MR图像模拟的k空间数据,是回顾性欠采样以优化的体内k空间数据的欠采样图案;其次,二进制随机k空间采样,而不是LOUPE的近似随机k空间采样在训练期间,被加上一个直通(ST)估计器来估计在神经网络中的阈值的操作的梯度施加;第三,改性展开优化网络,而不是在LOUPE变形的U型网,是为了正确地重建多线圈数据,并减少对训练数据的依赖性用作重建网络。实验结果表明,与处理时的体内的k空间数据,展开的优化网络具有二进制欠采样块和ST估计与用任一U形网重建网络或近似采样图案优化网络的那些具有更好的重建性能,一旦训练,以了解最佳采样图案比手工制作的可变密度的采样模式工作时更好与其它常规的重建方法展开。
Jinwei Zhang, Hang Zhang, Alan Wang, Qihao Zhang, Mert Sabuncu, Pascal Spincemaille, Thanh D. Nguyen, Yi Wang
Abstract: The previously established LOUPE (Learning-based Optimization of the Under-sampling Pattern) framework for optimizing the k-space sampling pattern in MRI was extended in three folds: firstly, fully sampled multi-coil k-space data from the scanner, rather than simulated k-space data from magnitude MR images in LOUPE, was retrospectively under-sampled to optimize the under-sampling pattern of in-vivo k-space data; secondly, binary stochastic k-space sampling, rather than approximate stochastic k-space sampling of LOUPE during training, was applied together with a straight-through (ST) estimator to estimate the gradient of the threshold operation in a neural network; thirdly, modified unrolled optimization network, rather than modified U-Net in LOUPE, was used as the reconstruction network in order to reconstruct multi-coil data properly and reduce the dependency on training data. Experimental results show that when dealing with the in-vivo k-space data, unrolled optimization network with binary under-sampling block and ST estimator had better reconstruction performance compared to the ones with either U-Net reconstruction network or approximate sampling pattern optimization network, and once trained, the learned optimal sampling pattern worked better than the hand-crafted variable density sampling pattern when deployed with other conventional reconstruction methods.
摘要:先前建立LOUPE用于优化在MRI的k-空间采样图案框架(副采样模式的基于学习的优化)在三个折叠延长:首先,完全采样多线圈的k-空间数据从扫描仪,而不是从LOUPE大小MR图像模拟的k空间数据,是回顾性欠采样以优化的体内k空间数据的欠采样图案;其次,二进制随机k空间采样,而不是LOUPE的近似随机k空间采样在训练期间,被加上一个直通(ST)估计器来估计在神经网络中的阈值的操作的梯度施加;第三,改性展开优化网络,而不是在LOUPE变形的U型网,是为了正确地重建多线圈数据,并减少对训练数据的依赖性用作重建网络。实验结果表明,与处理时的体内的k空间数据,展开的优化网络具有二进制欠采样块和ST估计与用任一U形网重建网络或近似采样图案优化网络的那些具有更好的重建性能,一旦训练,以了解最佳采样图案比手工制作的可变密度的采样模式工作时更好与其它常规的重建方法展开。
注:中文为机器翻译结果!封面为论文标题词云图!