目录
7. Improving Road Signs Detection performance by Combining the Features of Hough Transform and Texture [PDF] 摘要
9. RMDL: Recalibrated multi-instance deep learning for whole slide gastric image classification [PDF] 摘要
11. Electroencephalography signal processing based on textural features for monitoring the driver's state by a Brain-Computer Interface [PDF] 摘要
16. Collaborative Video Object Segmentation by Multi-Scale Foreground-Background Integration [PDF] 摘要
24. DORi: Discovering Object Relationship for Moment Localization of a Natural-Language Query in Video [PDF] 摘要
29. Few-shot Action Recognition with Implicit Temporal Alignment and Pair Similarity Optimization [PDF] 摘要
30. DoFE: Domain-oriented Feature Embedding for Generalizable Fundus Image Segmentation on Unseen Datasets [PDF] 摘要
33. Exploring Efficient Volumetric Medical Image Segmentation Using 2.5D Method: An Empirical Study [PDF] 摘要
46. Does my multimodal model learn cross-modal interactions? It's harder to tell than you might think! [PDF] 摘要
48. RANDGAN: Randomized Generative Adversarial Network for Detection of COVID-19 in Chest X-ray [PDF] 摘要
50. A Possible Method of Carbon Deposit Mapping on Plasma Facing Components Using Infrared Thermography [PDF] 摘要
摘要
1. Kartta Labs: Collaborative Time Travel [PDF] 返回目录
Sasan Tavakkol, Feng Han, Brandon Mayer, Mark Phillips, Cyrus Shahabi, Yao-Yi Chiang, Raimondas Kiveris
Abstract: We introduce the modular and scalable design of Kartta Labs, an open source, open data, and scalable system for virtually reconstructing cities from historical maps and photos. Kartta Labs relies on crowdsourcing and artificial intelligence consisting of two major modules: Maps and 3D models. Each module, in turn, consists of sub-modules that enable the system to reconstruct a city from historical maps and photos. The result is a spatiotemporal reference that can be used to integrate various collected data (curated, sensed, or crowdsourced) for research, education, and entertainment purposes. The system empowers the users to experience collaborative time travel such that they work together to reconstruct the past and experience it on an open source and open data platform.
摘要:介绍地图实验室,一个开源的,开放的数据,以及可扩展的系统的模块化和可扩展的设计几乎重建从历史地图和照片城市。地图实验室依靠众包和由两个主要模块人工智能:地图和三维模型。每个模块,反过来,由子模块,使系统从历史地图和照片重建的城市。其结果是,可用于整合各种收集到的数据(策划,感应,或众包)为研究,教育和娱乐目的的时空参考。该系统使得用户体验到协同时间旅行,使得它们共同重建过去体验一下一个开放源码和开放的数据平台。
Sasan Tavakkol, Feng Han, Brandon Mayer, Mark Phillips, Cyrus Shahabi, Yao-Yi Chiang, Raimondas Kiveris
Abstract: We introduce the modular and scalable design of Kartta Labs, an open source, open data, and scalable system for virtually reconstructing cities from historical maps and photos. Kartta Labs relies on crowdsourcing and artificial intelligence consisting of two major modules: Maps and 3D models. Each module, in turn, consists of sub-modules that enable the system to reconstruct a city from historical maps and photos. The result is a spatiotemporal reference that can be used to integrate various collected data (curated, sensed, or crowdsourced) for research, education, and entertainment purposes. The system empowers the users to experience collaborative time travel such that they work together to reconstruct the past and experience it on an open source and open data platform.
摘要:介绍地图实验室,一个开源的,开放的数据,以及可扩展的系统的模块化和可扩展的设计几乎重建从历史地图和照片城市。地图实验室依靠众包和由两个主要模块人工智能:地图和三维模型。每个模块,反过来,由子模块,使系统从历史地图和照片重建的城市。其结果是,可用于整合各种收集到的数据(策划,感应,或众包)为研究,教育和娱乐目的的时空参考。该系统使得用户体验到协同时间旅行,使得它们共同重建过去体验一下一个开放源码和开放的数据平台。
2. Deep Learning for Recognizing Mobile Targets in Satellite Imagery [PDF] 返回目录
Mark Pritt
Abstract: There is an increasing demand for software that automatically detects and classifies mobile targets such as airplanes, cars, and ships in satellite imagery. Applications of such automated target recognition (ATR) software include economic forecasting, traffic planning, maritime law enforcement, and disaster response. This paper describes the extension of a convolutional neural network (CNN) for classification to a sliding window algorithm for detection. It is evaluated on mobile targets of the xView dataset, on which it achieves detection and classification accuracies higher than 95%.
摘要:对于软件的需求日益增加,可以自动检测,如飞机,汽车,以及在卫星图像进行分类船舶机动目标。这样的自动目标识别(ATR)软件的应用包括经济预测,交通规划,海事执法和救灾工作。本文描述了一种用于分类的卷积神经网络(CNN)的延伸到用于检测的滑动窗口算法。据上的xView数据集,在其上实现了检测和分类精确度高于95%的移动目标进行评价。
Mark Pritt
Abstract: There is an increasing demand for software that automatically detects and classifies mobile targets such as airplanes, cars, and ships in satellite imagery. Applications of such automated target recognition (ATR) software include economic forecasting, traffic planning, maritime law enforcement, and disaster response. This paper describes the extension of a convolutional neural network (CNN) for classification to a sliding window algorithm for detection. It is evaluated on mobile targets of the xView dataset, on which it achieves detection and classification accuracies higher than 95%.
摘要:对于软件的需求日益增加,可以自动检测,如飞机,汽车,以及在卫星图像进行分类船舶机动目标。这样的自动目标识别(ATR)软件的应用包括经济预测,交通规划,海事执法和救灾工作。本文描述了一种用于分类的卷积神经网络(CNN)的延伸到用于检测的滑动窗口算法。据上的xView数据集,在其上实现了检测和分类精确度高于95%的移动目标进行评价。
3. LASSR: Effective Super-Resolution Method for Plant Disease Diagnosis [PDF] 返回目录
Quan Huu Cap, Hiroki Tani, Hiroyuki Uga, Satoshi Kagiwada, Hitoshi Iyatomi
Abstract: The collection of high-resolution training data is crucial in building robust plant disease diagnosis systems, since such data have a significant impact on diagnostic performance. However, they are very difficult to obtain and are not always available in practice. Deep learning-based techniques, and particularly generative adversarial networks (GANs), can be applied to generate high-quality super-resolution images, but these methods often produce unexpected artifacts that can lower the diagnostic performance. In this paper, we propose a novel artifact-suppression super-resolution method that is specifically designed for diagnosing leaf disease, called Leaf Artifact-Suppression Super Resolution (LASSR). Thanks to its own artifact removal module that detects and suppresses artifacts to a considerable extent, LASSR can generate much more pleasing, high-quality images compared to the state-of-the-art ESRGAN model. Experiments based on a five-class cucumber disease (including healthy) discrimination model show that training with data generated by LASSR significantly boosts the performance on an unseen test dataset by nearly 22% compared with the baseline, and that our approach is more than 2% better than a model trained with images generated by ESRGAN.
摘要:高分辨率的训练数据的采集是建立植株健壮疾病诊断系统是至关重要的,因为这些数据对诊断性能的显著影响。然而,他们很难获得,并不总是在实践中可用。深基于学习的技术,特别是生成对抗网络(甘斯),可以应用到生成高品质的超分辨率图像,但这些方法往往产生意想不到的伪像可以降低诊断性能。在本文中,我们提出了专门用于诊断疾病叶设计新颖的神器抑制超分辨率的方法,称为叶神器抑制超解像(LASSR)。由于其自身的伪像移除模块,其检测并阻挡工件在相当程度上,LASSR可以生成更美观,高品质的图像相比,国家的最先进的ESRGAN模型。基于五类黄瓜病害实验(包括健康)判别模型表明,训练与LASSR产生显著提升与基线相比上一个看不见的测试数据集的性能了近22%,而我们的做法是超过2%的数据比由ESRGAN生成的图像训练的模型更好。
Quan Huu Cap, Hiroki Tani, Hiroyuki Uga, Satoshi Kagiwada, Hitoshi Iyatomi
Abstract: The collection of high-resolution training data is crucial in building robust plant disease diagnosis systems, since such data have a significant impact on diagnostic performance. However, they are very difficult to obtain and are not always available in practice. Deep learning-based techniques, and particularly generative adversarial networks (GANs), can be applied to generate high-quality super-resolution images, but these methods often produce unexpected artifacts that can lower the diagnostic performance. In this paper, we propose a novel artifact-suppression super-resolution method that is specifically designed for diagnosing leaf disease, called Leaf Artifact-Suppression Super Resolution (LASSR). Thanks to its own artifact removal module that detects and suppresses artifacts to a considerable extent, LASSR can generate much more pleasing, high-quality images compared to the state-of-the-art ESRGAN model. Experiments based on a five-class cucumber disease (including healthy) discrimination model show that training with data generated by LASSR significantly boosts the performance on an unseen test dataset by nearly 22% compared with the baseline, and that our approach is more than 2% better than a model trained with images generated by ESRGAN.
摘要:高分辨率的训练数据的采集是建立植株健壮疾病诊断系统是至关重要的,因为这些数据对诊断性能的显著影响。然而,他们很难获得,并不总是在实践中可用。深基于学习的技术,特别是生成对抗网络(甘斯),可以应用到生成高品质的超分辨率图像,但这些方法往往产生意想不到的伪像可以降低诊断性能。在本文中,我们提出了专门用于诊断疾病叶设计新颖的神器抑制超分辨率的方法,称为叶神器抑制超解像(LASSR)。由于其自身的伪像移除模块,其检测并阻挡工件在相当程度上,LASSR可以生成更美观,高品质的图像相比,国家的最先进的ESRGAN模型。基于五类黄瓜病害实验(包括健康)判别模型表明,训练与LASSR产生显著提升与基线相比上一个看不见的测试数据集的性能了近22%,而我们的做法是超过2%的数据比由ESRGAN生成的图像训练的模型更好。
4. Satellite Image Classification with Deep Learning [PDF] 返回目录
Mark Pritt, Gary Chern
Abstract: Satellite imagery is important for many applications including disaster response, law enforcement, and environmental monitoring. These applications require the manual identification of objects and facilities in the imagery. Because the geographic expanses to be covered are great and the analysts available to conduct the searches are few, automation is required. Yet traditional object detection and classification algorithms are too inaccurate and unreliable to solve the problem. Deep learning is a family of machine learning algorithms that have shown promise for the automation of such tasks. It has achieved success in image understanding by means of convolutional neural networks. In this paper we apply them to the problem of object and facility recognition in high-resolution, multi-spectral satellite imagery. We describe a deep learning system for classifying objects and facilities from the IARPA Functional Map of the World (fMoW) dataset into 63 different classes. The system consists of an ensemble of convolutional neural networks and additional neural networks that integrate satellite metadata with image features. It is implemented in Python using the Keras and TensorFlow deep learning libraries and runs on a Linux server with an NVIDIA Titan X graphics card. At the time of writing the system is in 2nd place in the fMoW TopCoder competition. Its total accuracy is 83%, the F1 score is 0.797, and it classifies 15 of the classes with accuracies of 95% or better.
摘要:卫星图像是许多应用,包括救灾,执法和环境监测非常重要。这些应用需要在图像对象和设施的人工识别。由于地域辽阔被覆盖是巨大的,可用于进行搜索的分析师很少,需要具有自动化。然而,传统的目标检测和分类算法太不准确和不可靠的解决问题。深度学习是已经显示了这样的任务自动化诺机器学习算法,一个家庭。它由卷积神经网络来实现图像理解成功。在本文中,我们将它们应用到物体和设备识别的高分辨率,多光谱卫星图像的问题。我们描述了从世界(fMoW)数据集的IARPA功能图,公用设施分成63个不同类别的深刻学习系统。该系统由与图像特征整合卫星元数据卷积神经网络的和额外的神经网络的集合的。它使用的是Linux服务器与NVIDIA泰坦X显卡的Keras和TensorFlow深度学习库和运行用Python实现。在编写了系统的时间是在fMoW的TopCoder竞赛第二名。它的总精度是83%时,F1分数是0.797,并将其与95%或更好的精度进行分类的类15。
Mark Pritt, Gary Chern
Abstract: Satellite imagery is important for many applications including disaster response, law enforcement, and environmental monitoring. These applications require the manual identification of objects and facilities in the imagery. Because the geographic expanses to be covered are great and the analysts available to conduct the searches are few, automation is required. Yet traditional object detection and classification algorithms are too inaccurate and unreliable to solve the problem. Deep learning is a family of machine learning algorithms that have shown promise for the automation of such tasks. It has achieved success in image understanding by means of convolutional neural networks. In this paper we apply them to the problem of object and facility recognition in high-resolution, multi-spectral satellite imagery. We describe a deep learning system for classifying objects and facilities from the IARPA Functional Map of the World (fMoW) dataset into 63 different classes. The system consists of an ensemble of convolutional neural networks and additional neural networks that integrate satellite metadata with image features. It is implemented in Python using the Keras and TensorFlow deep learning libraries and runs on a Linux server with an NVIDIA Titan X graphics card. At the time of writing the system is in 2nd place in the fMoW TopCoder competition. Its total accuracy is 83%, the F1 score is 0.797, and it classifies 15 of the classes with accuracies of 95% or better.
摘要:卫星图像是许多应用,包括救灾,执法和环境监测非常重要。这些应用需要在图像对象和设施的人工识别。由于地域辽阔被覆盖是巨大的,可用于进行搜索的分析师很少,需要具有自动化。然而,传统的目标检测和分类算法太不准确和不可靠的解决问题。深度学习是已经显示了这样的任务自动化诺机器学习算法,一个家庭。它由卷积神经网络来实现图像理解成功。在本文中,我们将它们应用到物体和设备识别的高分辨率,多光谱卫星图像的问题。我们描述了从世界(fMoW)数据集的IARPA功能图,公用设施分成63个不同类别的深刻学习系统。该系统由与图像特征整合卫星元数据卷积神经网络的和额外的神经网络的集合的。它使用的是Linux服务器与NVIDIA泰坦X显卡的Keras和TensorFlow深度学习库和运行用Python实现。在编写了系统的时间是在fMoW的TopCoder竞赛第二名。它的总精度是83%时,F1分数是0.797,并将其与95%或更好的精度进行分类的类15。
5. Making Every Label Count: Handling Semantic Imprecision by Integrating Domain Knowledge [PDF] 返回目录
Clemens-Alexander Brust, Björn Barz, Joachim Denzler
Abstract: Noisy data, crawled from the web or supplied by volunteers such as Mechanical Turkers or citizen scientists, is considered an alternative to professionally labeled data. There has been research focused on mitigating the effects of label noise. It is typically modeled as inaccuracy, where the correct label is replaced by an incorrect label from the same set. We consider an additional dimension of label noise: imprecision. For example, a non-breeding snow bunting is labeled as a bird. This label is correct, but not as precise as the task requires. Standard softmax classifiers cannot learn from such a weak label because they consider all classes mutually exclusive, which non-breeding snow bunting and bird are not. We propose CHILLAX (Class Hierarchies for Imprecise Label Learning and Annotation eXtrapolation), a method based on hierarchical classification, to fully utilize labels of any precision. Experiments on noisy variants of NABirds and ILSVRC2012 show that our method outperforms strong baselines by as much as 16.4 percentage points, and the current state of the art by up to 3.9 percentage points.
摘要:噪声数据,从网络检索或由志愿者如机械零工或公民科学家供给,被认为是专业标记数据的替代方法。已经有研究侧重于减轻标签的噪声影响。它通常建模为不准确,在正确的标签是由同一组不正确的标签取代。我们认为标签噪音的额外维度:不精确。例如,非繁殖雪鹀被标记为一只鸟。该标签是正确的,但不那么精确的任务要求。因为他们认为相互排斥的所有类,其中非繁殖雪鹀和鸟类都没有标准SOFTMAX分类不能从这样的弱标签学习。我们建议CHILLAX(类层次结构不精确标记学习和注释外推),基于分层分类的方法,充分利用的任意精度的标签。在NABirds的喧闹变种和ILSVRC2012的实验表明我们的方法优于强基线幅度高达16.4个百分点,达3.9个百分点领域的当前状态。
Clemens-Alexander Brust, Björn Barz, Joachim Denzler
Abstract: Noisy data, crawled from the web or supplied by volunteers such as Mechanical Turkers or citizen scientists, is considered an alternative to professionally labeled data. There has been research focused on mitigating the effects of label noise. It is typically modeled as inaccuracy, where the correct label is replaced by an incorrect label from the same set. We consider an additional dimension of label noise: imprecision. For example, a non-breeding snow bunting is labeled as a bird. This label is correct, but not as precise as the task requires. Standard softmax classifiers cannot learn from such a weak label because they consider all classes mutually exclusive, which non-breeding snow bunting and bird are not. We propose CHILLAX (Class Hierarchies for Imprecise Label Learning and Annotation eXtrapolation), a method based on hierarchical classification, to fully utilize labels of any precision. Experiments on noisy variants of NABirds and ILSVRC2012 show that our method outperforms strong baselines by as much as 16.4 percentage points, and the current state of the art by up to 3.9 percentage points.
摘要:噪声数据,从网络检索或由志愿者如机械零工或公民科学家供给,被认为是专业标记数据的替代方法。已经有研究侧重于减轻标签的噪声影响。它通常建模为不准确,在正确的标签是由同一组不正确的标签取代。我们认为标签噪音的额外维度:不精确。例如,非繁殖雪鹀被标记为一只鸟。该标签是正确的,但不那么精确的任务要求。因为他们认为相互排斥的所有类,其中非繁殖雪鹀和鸟类都没有标准SOFTMAX分类不能从这样的弱标签学习。我们建议CHILLAX(类层次结构不精确标记学习和注释外推),基于分层分类的方法,充分利用的任意精度的标签。在NABirds的喧闹变种和ILSVRC2012的实验表明我们的方法优于强基线幅度高达16.4个百分点,达3.9个百分点领域的当前状态。
6. The DongNiao International Birds 10000 Dataset [PDF] 返回目录
Jian Mei, Hao Dong
Abstract: DongNiao International Birds 10000 (DIB-10K) is a challenging image dataset which has more than 10 thousand different types of birds. It was created to enable the study of machine learning and also ornithology research. DIB-10K does not own the copyright of these images. It only provides thumbnails of images, in a way similar to ImageNet.
摘要:DongNiao国际鸟10000(DIB-10K)是一个具有挑战性的图像数据集,其拥有超过10000种不同的鸟类。它的建立,使机器学习的研究,也研究鸟类。 DIB-10K并不拥有这些图片的版权。它不仅提供图像的缩略图,并以类似于ImageNet的方式。
Jian Mei, Hao Dong
Abstract: DongNiao International Birds 10000 (DIB-10K) is a challenging image dataset which has more than 10 thousand different types of birds. It was created to enable the study of machine learning and also ornithology research. DIB-10K does not own the copyright of these images. It only provides thumbnails of images, in a way similar to ImageNet.
摘要:DongNiao国际鸟10000(DIB-10K)是一个具有挑战性的图像数据集,其拥有超过10000种不同的鸟类。它的建立,使机器学习的研究,也研究鸟类。 DIB-10K并不拥有这些图片的版权。它不仅提供图像的缩略图,并以类似于ImageNet的方式。
7. Improving Road Signs Detection performance by Combining the Features of Hough Transform and Texture [PDF] 返回目录
Tarik Ayaou, Mourad Boussaid, Karim Afdel, Abdellah Amghar
Abstract: With the large uses of the intelligent systems in different domains, and in order to increase the drivers and pedestrians safety, the road and traffic sign recognition system has been a challenging issue and an important task for many years. But studies, done in this field of detection and recognition of traffic signs in an image, which are interested in the Arab context, are still insufficient. Detection of the road signs present in the scene is the one of the main stages of the traffic sign detection and recognition. In this paper, an efficient solution to enhance road signs detection, including Arabic context, performance based on color segmentation, Randomized Hough Transform and the combination of Zernike moments and Haralick features has been made. Segmentation stage is useful to determine the Region of Interest (ROI) in the image. The Randomized Hough Transform (RHT) is used to detect the circular and octagonal shapes. This stage is improved by the extraction of the Haralick features and Zernike moments. Furthermore, we use it as input of a classifier based on SVM. Experimental results show that the proposed approach allows us to perform the measurements precision.
摘要:在不同领域的智能系统的大用途,并且为了提高司机和行人的安全,道路和交通标志识别系统一直是一个具有挑战性的问题,多年的一项重要任务。但是,研究,检测与识别的图像,其中感兴趣的阿拉伯方面在交通标志的这个方面做的仍然不够。出现在现场的路标的检测是交通标志检测与识别的主要阶段之一。在本文中,有效的解决方案,以提高道路标志的检测,包括阿拉伯语背景下,基于颜色的分割性能,随机Hough变换和泽尼克矩的组合和Haralick特征已经取得进展。分割阶段是有用的,以确定图像中的关注区域(ROI)的区域。的随机Hough变换(RHT)被用于检测的圆形和八边形的形状。此阶段通过的Haralick特征和Zernike矩提取改善。此外,我们使用它作为基于SVM分类器的输入。实验结果表明,该方法使我们能够进行测量精度。
Tarik Ayaou, Mourad Boussaid, Karim Afdel, Abdellah Amghar
Abstract: With the large uses of the intelligent systems in different domains, and in order to increase the drivers and pedestrians safety, the road and traffic sign recognition system has been a challenging issue and an important task for many years. But studies, done in this field of detection and recognition of traffic signs in an image, which are interested in the Arab context, are still insufficient. Detection of the road signs present in the scene is the one of the main stages of the traffic sign detection and recognition. In this paper, an efficient solution to enhance road signs detection, including Arabic context, performance based on color segmentation, Randomized Hough Transform and the combination of Zernike moments and Haralick features has been made. Segmentation stage is useful to determine the Region of Interest (ROI) in the image. The Randomized Hough Transform (RHT) is used to detect the circular and octagonal shapes. This stage is improved by the extraction of the Haralick features and Zernike moments. Furthermore, we use it as input of a classifier based on SVM. Experimental results show that the proposed approach allows us to perform the measurements precision.
摘要:在不同领域的智能系统的大用途,并且为了提高司机和行人的安全,道路和交通标志识别系统一直是一个具有挑战性的问题,多年的一项重要任务。但是,研究,检测与识别的图像,其中感兴趣的阿拉伯方面在交通标志的这个方面做的仍然不够。出现在现场的路标的检测是交通标志检测与识别的主要阶段之一。在本文中,有效的解决方案,以提高道路标志的检测,包括阿拉伯语背景下,基于颜色的分割性能,随机Hough变换和泽尼克矩的组合和Haralick特征已经取得进展。分割阶段是有用的,以确定图像中的关注区域(ROI)的区域。的随机Hough变换(RHT)被用于检测的圆形和八边形的形状。此阶段通过的Haralick特征和Zernike矩提取改善。此外,我们使用它作为基于SVM分类器的输入。实验结果表明,该方法使我们能够进行测量精度。
8. A review of 3D human pose estimation algorithms for markerless motion capture [PDF] 返回目录
Yann Desmarais, Denis Mottet, Pierre Slangen, Philippe Montesinos
Abstract: Human pose estimation (HPE) in 3D is an active research field that have many applications in entertainment, health and sport science, robotics. In the last five years markerless motion captures techniques have seen their average error decrease from more than 10cm to less than 2cm today. This evolution is mainly driven by the improvements in 2D pose estimation task that benefited from the use of convolutional networks. However with the multiplication of different approaches it can be difficult to identify what is more adapted to the specifics of any applications. We suggest to classify existing methods with a taxonomy based on the performance criteria of accuracy, speed and robustness. We review more than twenty methods from the last three years. Additionally we analyze the metrics, benchmarks and structure of the different pose estimation systems and propose several direction for future research. We hope to offer a good introduction to 3D markerless pose estimation as well as discussing the leading contemporary algorithms.
摘要:在3D人体姿势估计(HPE)是在娱乐,健康和运动科学,机器人许多应用中活跃的研究领域。在过去的五年中无标记的动作捕捉技术已经看到从10cm以上的平均误差减少到今天小于2cm。这种发展主要是通过从使用卷积网络中受益2D姿势估计任务的改进驱动。然而,随着不同方法的乘法可能很难确定哪些是更适应于任何应用程序的细节。我们建议基于精度,速度和稳定性的性能标准的分类法现有方法分类。我们回顾从过去三年超过20种方法。此外,我们分析的指标,基准和不同的姿态估计系统的结构,并提出几个方向为今后的研究。我们希望能够提供一个很好的介绍3D无标记的姿势估计值,以及讨论当代杰出的算法。
Yann Desmarais, Denis Mottet, Pierre Slangen, Philippe Montesinos
Abstract: Human pose estimation (HPE) in 3D is an active research field that have many applications in entertainment, health and sport science, robotics. In the last five years markerless motion captures techniques have seen their average error decrease from more than 10cm to less than 2cm today. This evolution is mainly driven by the improvements in 2D pose estimation task that benefited from the use of convolutional networks. However with the multiplication of different approaches it can be difficult to identify what is more adapted to the specifics of any applications. We suggest to classify existing methods with a taxonomy based on the performance criteria of accuracy, speed and robustness. We review more than twenty methods from the last three years. Additionally we analyze the metrics, benchmarks and structure of the different pose estimation systems and propose several direction for future research. We hope to offer a good introduction to 3D markerless pose estimation as well as discussing the leading contemporary algorithms.
摘要:在3D人体姿势估计(HPE)是在娱乐,健康和运动科学,机器人许多应用中活跃的研究领域。在过去的五年中无标记的动作捕捉技术已经看到从10cm以上的平均误差减少到今天小于2cm。这种发展主要是通过从使用卷积网络中受益2D姿势估计任务的改进驱动。然而,随着不同方法的乘法可能很难确定哪些是更适应于任何应用程序的细节。我们建议基于精度,速度和稳定性的性能标准的分类法现有方法分类。我们回顾从过去三年超过20种方法。此外,我们分析的指标,基准和不同的姿态估计系统的结构,并提出几个方向为今后的研究。我们希望能够提供一个很好的介绍3D无标记的姿势估计值,以及讨论当代杰出的算法。
9. RMDL: Recalibrated multi-instance deep learning for whole slide gastric image classification [PDF] 返回目录
Shujun Wang, Yaxi Zhu, Lequan Yu, Hao Chen, Huangjing Lin, Xiangbo Wan, Xinjuan Fan, Pheng-Ann Hen
Abstract: The whole slide histopathology images (WSIs) play a critical role in gastric cancer diagnosis. However, due to the large scale of WSIs and various sizes of the abnormal area, how to select informative regions and analyze them are quite challenging during the automatic diagnosis process. The multi-instance learning based on the most discriminative instances can be of great benefit for whole slide gastric image diagnosis. In this paper, we design a recalibrated multi-instance deep learning method (RMDL) to address this challenging problem. We first select the discriminative instances, and then utilize these instances to diagnose diseases based on the proposed RMDL approach. The designed RMDL network is capable of capturing instance-wise dependencies and recalibrating instance features according to the importance coefficient learned from the fused features. Furthermore, we build a large whole-slide gastric histopathology image dataset with detailed pixel-level annotations. Experimental results on the constructed gastric dataset demonstrate the significant improvement on the accuracy of our proposed framework compared with other state-of-the-art multi-instance learning methods. Moreover, our method is general and can be extended to other diagnosis tasks of different cancer types based on WSIs.
摘要:整个幻灯片图像的组织病理学(WSIS)发挥胃癌的诊断至关重要的作用。然而,由于大型峰会和异常区域的各种尺寸的,如何选择信息的区域并对其进行分析期间自动诊断过程相当具有挑战性。基于最歧视性的情况下,多实例学习可以对整个幻灯片图像胃癌诊断大有裨益。在本文中,我们设计了一个重新调校的多实例深度学习方法(RMDL)来解决这个具有挑战性的问题。我们首先选择了歧视性实例,然后利用这些实例来诊断基础上,提出RMDL方法的疾病。所设计的RMDL网络能够捕获实例明智依赖性和重新校准实例根据从融合特征学的重要性系数的特征。此外,我们还建立了详细的像素级注释的大型全胃幻灯片病理图像数据集。在构建胃癌数据集的实验结果表明,与国家的最先进的其他多实例的学习方法相比,我们提出的框架的准确性显著改善。此外,我们的方法是通用的,并可以扩展到基于峰会不同癌症类型的其它诊断任务。
Shujun Wang, Yaxi Zhu, Lequan Yu, Hao Chen, Huangjing Lin, Xiangbo Wan, Xinjuan Fan, Pheng-Ann Hen
Abstract: The whole slide histopathology images (WSIs) play a critical role in gastric cancer diagnosis. However, due to the large scale of WSIs and various sizes of the abnormal area, how to select informative regions and analyze them are quite challenging during the automatic diagnosis process. The multi-instance learning based on the most discriminative instances can be of great benefit for whole slide gastric image diagnosis. In this paper, we design a recalibrated multi-instance deep learning method (RMDL) to address this challenging problem. We first select the discriminative instances, and then utilize these instances to diagnose diseases based on the proposed RMDL approach. The designed RMDL network is capable of capturing instance-wise dependencies and recalibrating instance features according to the importance coefficient learned from the fused features. Furthermore, we build a large whole-slide gastric histopathology image dataset with detailed pixel-level annotations. Experimental results on the constructed gastric dataset demonstrate the significant improvement on the accuracy of our proposed framework compared with other state-of-the-art multi-instance learning methods. Moreover, our method is general and can be extended to other diagnosis tasks of different cancer types based on WSIs.
摘要:整个幻灯片图像的组织病理学(WSIS)发挥胃癌的诊断至关重要的作用。然而,由于大型峰会和异常区域的各种尺寸的,如何选择信息的区域并对其进行分析期间自动诊断过程相当具有挑战性。基于最歧视性的情况下,多实例学习可以对整个幻灯片图像胃癌诊断大有裨益。在本文中,我们设计了一个重新调校的多实例深度学习方法(RMDL)来解决这个具有挑战性的问题。我们首先选择了歧视性实例,然后利用这些实例来诊断基础上,提出RMDL方法的疾病。所设计的RMDL网络能够捕获实例明智依赖性和重新校准实例根据从融合特征学的重要性系数的特征。此外,我们还建立了详细的像素级注释的大型全胃幻灯片病理图像数据集。在构建胃癌数据集的实验结果表明,与国家的最先进的其他多实例的学习方法相比,我们提出的框架的准确性显著改善。此外,我们的方法是通用的,并可以扩展到基于峰会不同癌症类型的其它诊断任务。
10. Face Mask Assistant: Detection of Face Mask Service Stage Based on Mobile Phone [PDF] 返回目录
Yuzhen Chen, Menghan Hu, Chunjun Hua, Guangtao Zhai, Jian Zhang, Qingli Li, Simon X. Yang
Abstract: Coronavirus Disease 2019 (COVID-19) has spread all over the world since it broke out massively in December 2019, which has caused a large loss to the whole world. Both the confirmed cases and death cases have reached a relatively frightening number. Syndrome coronaviruses 2 (SARS-CoV-2), the cause of COVID-19, can be transmitted by small respiratory droplets. To curb its spread at the source, wearing masks is a convenient and effective measure. In most cases, people use face masks in a high-frequent but short-time way. Aimed at solving the problem that we don't know which service stage of the mask belongs to, we propose a detection system based on the mobile phone. We first extract four features from the GLCMs of the face mask's micro-photos. Next, a three-result detection system is accomplished by using KNN algorithm. The results of validation experiments show that our system can reach a precision of 82.87% (standard deviation=8.5%) on the testing dataset. In future work, we plan to expand the detection objects to more mask types. This work demonstrates that the proposed mobile microscope system can be used as an assistant for face mask being used, which may play a positive role in fighting against COVID-19.
摘要:冠状病毒病2019(COVID-19)已经遍布全世界,因为它在十二月2019年,它已经造成了大量的损失向全世界大规模爆发。无论是确诊病例和死亡病例已经达到了一个比较可怕的数字。综合征冠状病毒2(SARS-CoV的-2),COVID-19的原因,可以通过小的飞沫被发送。为了遏制其在源传播,戴口罩是一种方便有效的措施。在大多数情况下,人们使用口罩在高频繁,但短时的方式。在解决这个问题,我们不知道这面具的服务阶段,属于为目标,提出了一种基于手机上的检测系统。我们首先从面罩的微照片GLCMs提取四个特征。接着,三结果检测系统是通过使用KNN算法来完成。的验证实验的结果表明,我们的系统可以达到82.87%的测试数据集精度(标准偏差= 8.5%)。在今后的工作中,我们计划扩大检测对象更掩模类型。这项工作表明,所提出的移动显微镜系统可以作为被用于面罩的助理,这可能对COVID-19战斗中发挥积极的作用。
Yuzhen Chen, Menghan Hu, Chunjun Hua, Guangtao Zhai, Jian Zhang, Qingli Li, Simon X. Yang
Abstract: Coronavirus Disease 2019 (COVID-19) has spread all over the world since it broke out massively in December 2019, which has caused a large loss to the whole world. Both the confirmed cases and death cases have reached a relatively frightening number. Syndrome coronaviruses 2 (SARS-CoV-2), the cause of COVID-19, can be transmitted by small respiratory droplets. To curb its spread at the source, wearing masks is a convenient and effective measure. In most cases, people use face masks in a high-frequent but short-time way. Aimed at solving the problem that we don't know which service stage of the mask belongs to, we propose a detection system based on the mobile phone. We first extract four features from the GLCMs of the face mask's micro-photos. Next, a three-result detection system is accomplished by using KNN algorithm. The results of validation experiments show that our system can reach a precision of 82.87% (standard deviation=8.5%) on the testing dataset. In future work, we plan to expand the detection objects to more mask types. This work demonstrates that the proposed mobile microscope system can be used as an assistant for face mask being used, which may play a positive role in fighting against COVID-19.
摘要:冠状病毒病2019(COVID-19)已经遍布全世界,因为它在十二月2019年,它已经造成了大量的损失向全世界大规模爆发。无论是确诊病例和死亡病例已经达到了一个比较可怕的数字。综合征冠状病毒2(SARS-CoV的-2),COVID-19的原因,可以通过小的飞沫被发送。为了遏制其在源传播,戴口罩是一种方便有效的措施。在大多数情况下,人们使用口罩在高频繁,但短时的方式。在解决这个问题,我们不知道这面具的服务阶段,属于为目标,提出了一种基于手机上的检测系统。我们首先从面罩的微照片GLCMs提取四个特征。接着,三结果检测系统是通过使用KNN算法来完成。的验证实验的结果表明,我们的系统可以达到82.87%的测试数据集精度(标准偏差= 8.5%)。在今后的工作中,我们计划扩大检测对象更掩模类型。这项工作表明,所提出的移动显微镜系统可以作为被用于面罩的助理,这可能对COVID-19战斗中发挥积极的作用。
11. Electroencephalography signal processing based on textural features for monitoring the driver's state by a Brain-Computer Interface [PDF] 返回目录
Giulia Orrù, Marco Micheletto, Fabio Terranova, Gian Luca Marcialis
Abstract: In this study we investigate a textural processing method of electroencephalography (EEG) signal as an indicator to estimate the driver's vigilance in a hypothetical Brain-Computer Interface (BCI) system. The novelty of the solution proposed relies on employing the one-dimensional Local Binary Pattern (1D-LBP) algorithm for feature extraction from pre-processed EEG data. From the resulting feature vector, the classification is done according to three vigilance classes: awake, tired and drowsy. The claim is that the class transitions can be detected by describing the variations of the micro-patterns' occurrences along the EEG signal. The 1D-LBP is able to describe them by detecting mutual variations of the signal temporarily "close" as a short bit-code. Our analysis allows to conclude that the 1D-LBP adoption has led to significant performance improvement. Moreover, capturing the class transitions from the EEG signal is effective, although the overall performance is not yet good enough to develop a BCI for assessing the driver's vigilance in real environments.
摘要:在这项研究中,我们探讨脑电图(EEG)信号的纹理处理方法作为指标来估计假设的脑 - 机接口(BCI)系统驾驶员的警惕性。提出的解决方案的新颖性依赖于采用用于从预处理的EEG数据特征提取一维局部二元模式(1D-LBP)算法。醒了,累了,昏昏欲睡:从得到的特征向量,分类是按照三个警惕类完成。的权利要求是类过渡可以通过描述微图案沿EEG信号发生的变化来检测。在1D-LBP能够通过检测该信号暂时“接近”作为短位代码的相互变化,以描述它们。我们的分析可以得出这样的结论1D-LBP收养导致显著的性能提升。此外,从EEG信号捕获类的转换是有效的,虽然整体表现不够理想建立一个BCI在实际环境评估驾驶员的警觉。
Giulia Orrù, Marco Micheletto, Fabio Terranova, Gian Luca Marcialis
Abstract: In this study we investigate a textural processing method of electroencephalography (EEG) signal as an indicator to estimate the driver's vigilance in a hypothetical Brain-Computer Interface (BCI) system. The novelty of the solution proposed relies on employing the one-dimensional Local Binary Pattern (1D-LBP) algorithm for feature extraction from pre-processed EEG data. From the resulting feature vector, the classification is done according to three vigilance classes: awake, tired and drowsy. The claim is that the class transitions can be detected by describing the variations of the micro-patterns' occurrences along the EEG signal. The 1D-LBP is able to describe them by detecting mutual variations of the signal temporarily "close" as a short bit-code. Our analysis allows to conclude that the 1D-LBP adoption has led to significant performance improvement. Moreover, capturing the class transitions from the EEG signal is effective, although the overall performance is not yet good enough to develop a BCI for assessing the driver's vigilance in real environments.
摘要:在这项研究中,我们探讨脑电图(EEG)信号的纹理处理方法作为指标来估计假设的脑 - 机接口(BCI)系统驾驶员的警惕性。提出的解决方案的新颖性依赖于采用用于从预处理的EEG数据特征提取一维局部二元模式(1D-LBP)算法。醒了,累了,昏昏欲睡:从得到的特征向量,分类是按照三个警惕类完成。的权利要求是类过渡可以通过描述微图案沿EEG信号发生的变化来检测。在1D-LBP能够通过检测该信号暂时“接近”作为短位代码的相互变化,以描述它们。我们的分析可以得出这样的结论1D-LBP收养导致显著的性能提升。此外,从EEG信号捕获类的转换是有效的,虽然整体表现不够理想建立一个BCI在实际环境评估驾驶员的警觉。
12. Detecting Anomalies from Video-Sequences: a Novel Descriptor [PDF] 返回目录
Giulia Orrù, Davide Ghiani, Maura Pintor, Gian Luca Marcialis, Fabio Roli
Abstract: We present a novel descriptor for crowd behavior analysis and anomaly detection. The goal is to measure by appropriate patterns the speed of formation and disintegration of groups in the crowd. This descriptor is inspired by the concept of one-dimensional local binary patterns: in our case, such patterns depend on the number of group observed in a time window. An appropriate measurement unit, named "trit" (trinary digit), represents three possible dynamic states of groups on a certain frame. Our hypothesis is that abrupt variations of the groups' number may be due to an anomalous event that can be accordingly detected, by translating these variations on temporal trit-based sequence of strings which are significantly different from the one describing the "no-anomaly" one. Due to the peculiarity of the rationale behind this work, relying on the number of groups, three different methods of people group's extraction are compared. Experiments are carried out on the Motion-Emotion benchmark data set. Reported results point out in which cases the trit-based measurement of group dynamics allows us to detect the anomaly. Besides the promising performance of our approach, we show how it is correlated with the anomaly typology and the camera's perspective to the crowd's flow (frontal, lateral).
摘要:我们提出了人群行为分析和异常检测一个新的描述符。我们的目标是通过适当的方式来测量在人群中形成和基团的崩解速度。该描述符是由一维局部二元模式的概念的启发:在我们的情况下,这样的图案依赖于在时间窗口观察组的数量。适当的测量单元,命名为“三进制数位”(三进制位),代表某一帧上基团的三个可能的动态状态。我们的假设是,基团数目的急剧变化可能是由于其可以相应地检测到,通过平移对字符串的基于时间的三叔序列这些变化,这是从所述一个描述‘无异常’显著不同的异常事件一。由于这背后的工作原理,依靠群体的数量的特殊性,三个不同的人群体的提取方法进行了比较。实验是在运动情感基准数据集进行。报告的结果指出在哪些情况下群体动力学的基础三叔测量允许我们来检测异常。除了我们的做法的承诺表现,我们将展示它是如何与异常类型和相机的视角人群的流动(正面,侧面)相关。
Giulia Orrù, Davide Ghiani, Maura Pintor, Gian Luca Marcialis, Fabio Roli
Abstract: We present a novel descriptor for crowd behavior analysis and anomaly detection. The goal is to measure by appropriate patterns the speed of formation and disintegration of groups in the crowd. This descriptor is inspired by the concept of one-dimensional local binary patterns: in our case, such patterns depend on the number of group observed in a time window. An appropriate measurement unit, named "trit" (trinary digit), represents three possible dynamic states of groups on a certain frame. Our hypothesis is that abrupt variations of the groups' number may be due to an anomalous event that can be accordingly detected, by translating these variations on temporal trit-based sequence of strings which are significantly different from the one describing the "no-anomaly" one. Due to the peculiarity of the rationale behind this work, relying on the number of groups, three different methods of people group's extraction are compared. Experiments are carried out on the Motion-Emotion benchmark data set. Reported results point out in which cases the trit-based measurement of group dynamics allows us to detect the anomaly. Besides the promising performance of our approach, we show how it is correlated with the anomaly typology and the camera's perspective to the crowd's flow (frontal, lateral).
摘要:我们提出了人群行为分析和异常检测一个新的描述符。我们的目标是通过适当的方式来测量在人群中形成和基团的崩解速度。该描述符是由一维局部二元模式的概念的启发:在我们的情况下,这样的图案依赖于在时间窗口观察组的数量。适当的测量单元,命名为“三进制数位”(三进制位),代表某一帧上基团的三个可能的动态状态。我们的假设是,基团数目的急剧变化可能是由于其可以相应地检测到,通过平移对字符串的基于时间的三叔序列这些变化,这是从所述一个描述‘无异常’显著不同的异常事件一。由于这背后的工作原理,依靠群体的数量的特殊性,三个不同的人群体的提取方法进行了比较。实验是在运动情感基准数据集进行。报告的结果指出在哪些情况下群体动力学的基础三叔测量允许我们来检测异常。除了我们的做法的承诺表现,我们将展示它是如何与异常类型和相机的视角人群的流动(正面,侧面)相关。
13. Coarse and fine-grained automatic cropping deep convolutional neural network [PDF] 返回目录
Jingfei Chang
Abstract: The existing convolutional neural network pruning algorithms can be divided into two categories: coarse-grained clipping and fine-grained clipping. This paper proposes a coarse and fine-grained automatic pruning algorithm, which can achieve more efficient and accurate compression acceleration for convolutional neural networks. First, cluster the intermediate feature maps of the convolutional neural network to obtain the network structure after coarse-grained clipping, and then use the particle swarm optimization algorithm to iteratively search and optimize the structure. Finally, the optimal network tailoring substructure is obtained.
摘要:现有的卷积神经网络修剪算法可分为两类:粗粒剪裁和细粒度削波。本文提出了一种粗和细粒度自动修剪算法,从而可以实现对卷积神经网络更有效和准确的压缩加速。首先,群集卷积神经网络的中间特征映射到获得粗粒度限幅后的网络结构,然后使用粒子群优化算法以迭代地搜索和优化结构。最后,获得最佳的网络剪裁子结构。
Jingfei Chang
Abstract: The existing convolutional neural network pruning algorithms can be divided into two categories: coarse-grained clipping and fine-grained clipping. This paper proposes a coarse and fine-grained automatic pruning algorithm, which can achieve more efficient and accurate compression acceleration for convolutional neural networks. First, cluster the intermediate feature maps of the convolutional neural network to obtain the network structure after coarse-grained clipping, and then use the particle swarm optimization algorithm to iteratively search and optimize the structure. Finally, the optimal network tailoring substructure is obtained.
摘要:现有的卷积神经网络修剪算法可分为两类:粗粒剪裁和细粒度削波。本文提出了一种粗和细粒度自动修剪算法,从而可以实现对卷积神经网络更有效和准确的压缩加速。首先,群集卷积神经网络的中间特征映射到获得粗粒度限幅后的网络结构,然后使用粒子群优化算法以迭代地搜索和优化结构。最后,获得最佳的网络剪裁子结构。
14. Three-Dimensional Lip Motion Network for Text-Independent Speaker Recognition [PDF] 返回目录
Jianrong Wang, Tong Wu, Shanyu Wang, Mei Yu, Qiang Fang, Ju Zhang, Li Liu
Abstract: Lip motion reflects behavior characteristics of speakers, and thus can be used as a new kind of biometrics in speaker recognition. In the literature, lots of works used two-dimensional (2D) lip images to recognize speaker in a textdependent context. However, 2D lip easily suffers from various face orientations. To this end, in this work, we present a novel end-to-end 3D lip motion Network (3LMNet) by utilizing the sentence-level 3D lip motion (S3DLM) to recognize speakers in both the text-independent and text-dependent contexts. A new regional feedback module (RFM) is proposed to obtain attentions in different lip regions. Besides, prior knowledge of lip motion is investigated to complement RFM, where landmark-level and frame-level features are merged to form a better feature representation. Moreover, we present two methods, i.e., coordinate transformation and face posture correction to pre-process the LSD-AV dataset, which contains 68 speakers and 146 sentences per speaker. The evaluation results on this dataset demonstrate that our proposed 3LMNet is superior to the baseline models, i.e., LSTM, VGG-16 and ResNet-34, and outperforms the state-of-the-art using 2D lip image as well as the 3D face. The code of this work is released at this https URL Motion-Network-for-Text-Independent-Speaker-Recognition.
摘要:唇部运动反映了音箱的行为特征,因而可以作为一种新的说话人识别生物识别。在文献中,大量的工程所使用的二维(2D)嘴唇图像以识别一个textdependent上下文扬声器。然而,2D嘴唇很容易从各种脸部的朝向受到影响。为此,在本工作中,我们通过利用句子级3D唇运动(S3DLM)以识别在文本无关和文本相关的上下文两个扬声器提出了一种新的端至端3D唇运动网络(3LMNet) 。一个新的区域反馈模块(RFM)的建议,以获得不同的唇区域的关注。此外,唇运动的先验知识进行了研究,以补充RFM,其中界标级和帧级特征被合并以形成更好的特征表示。此外,我们提出了两种方法,即,坐标变换和脸姿势修正到预先处理LSD-AV数据集,其中包含68个扬声器和每扬声器146分的句子。有关此数据集的评价结果表明,我们提出的3LMNet优于基线模型,即LSTM,VGG-16和RESNET-34,和优于状态的最先进的使用2D嘴唇图像以及所述三维人脸。这项工作的代码在此HTTPS URL的Motion-网络换文本无关的喇叭识别被释放。
Jianrong Wang, Tong Wu, Shanyu Wang, Mei Yu, Qiang Fang, Ju Zhang, Li Liu
Abstract: Lip motion reflects behavior characteristics of speakers, and thus can be used as a new kind of biometrics in speaker recognition. In the literature, lots of works used two-dimensional (2D) lip images to recognize speaker in a textdependent context. However, 2D lip easily suffers from various face orientations. To this end, in this work, we present a novel end-to-end 3D lip motion Network (3LMNet) by utilizing the sentence-level 3D lip motion (S3DLM) to recognize speakers in both the text-independent and text-dependent contexts. A new regional feedback module (RFM) is proposed to obtain attentions in different lip regions. Besides, prior knowledge of lip motion is investigated to complement RFM, where landmark-level and frame-level features are merged to form a better feature representation. Moreover, we present two methods, i.e., coordinate transformation and face posture correction to pre-process the LSD-AV dataset, which contains 68 speakers and 146 sentences per speaker. The evaluation results on this dataset demonstrate that our proposed 3LMNet is superior to the baseline models, i.e., LSTM, VGG-16 and ResNet-34, and outperforms the state-of-the-art using 2D lip image as well as the 3D face. The code of this work is released at this https URL Motion-Network-for-Text-Independent-Speaker-Recognition.
摘要:唇部运动反映了音箱的行为特征,因而可以作为一种新的说话人识别生物识别。在文献中,大量的工程所使用的二维(2D)嘴唇图像以识别一个textdependent上下文扬声器。然而,2D嘴唇很容易从各种脸部的朝向受到影响。为此,在本工作中,我们通过利用句子级3D唇运动(S3DLM)以识别在文本无关和文本相关的上下文两个扬声器提出了一种新的端至端3D唇运动网络(3LMNet) 。一个新的区域反馈模块(RFM)的建议,以获得不同的唇区域的关注。此外,唇运动的先验知识进行了研究,以补充RFM,其中界标级和帧级特征被合并以形成更好的特征表示。此外,我们提出了两种方法,即,坐标变换和脸姿势修正到预先处理LSD-AV数据集,其中包含68个扬声器和每扬声器146分的句子。有关此数据集的评价结果表明,我们提出的3LMNet优于基线模型,即LSTM,VGG-16和RESNET-34,和优于状态的最先进的使用2D嘴唇图像以及所述三维人脸。这项工作的代码在此HTTPS URL的Motion-网络换文本无关的喇叭识别被释放。
15. A Generalized Zero-Shot Framework for Emotion Recognition from Body Gestures [PDF] 返回目录
Jinting Wu, Yujia Zhang, Xiaoguang Zhao
Abstract: Although automatic emotion recognition from facial expressions and speech has made remarkable progress, emotion recognition from body gestures has not been thoroughly explored. People often use a variety of body language to express emotions, and it is difficult to enumerate all emotional body gestures and collect enough samples for each category. Therefore, recognizing new emotional body gestures is critical for better understanding human emotions. However, the existing methods fail to accurately determine which emotional state a new body gesture belongs to. In order to solve this problem, we introduce a Generalized Zero-Shot Learning (GZSL) framework, which consists of three branches to infer the emotional state of the new body gestures with only their semantic descriptions. The first branch is a Prototype-Based Detector (PBD) which is used to determine whether an sample belongs to a seen body gesture category and obtain the prediction results of the samples from the seen categories. The second branch is a Stacked AutoEncoder (StAE) with manifold regularization, which utilizes semantic representations to predict samples from unseen categories. Note that both of the above branches are for body gesture recognition. We further add an emotion classifier with a softmax layer as the third branch in order to better learn the feature representations for this emotion classification task. The input features for these three branches are learned by a shared feature extraction network, i.e., a Bidirectional Long Short-Term Memory Networks (BLSTM) with a self-attention module. We treat these three branches as subtasks and use multi-task learning strategies for joint training. The performance of our framework on an emotion recognition dataset is significantly superior to the traditional method of emotion classification and state-of-the-art zero-shot learning methods.
摘要:尽管从面部表情自动情感识别和语音取得了令人瞩目的进步,从肢体语言的情感识别尚未彻底研究。人们常常用各种肢体语言来表达情感,这是难以一一列举所有的情感身体姿势和足够的采集样品为每个类别。因此,认识新的情感身体姿势是为了更好地理解人类情感的关键。然而,现有的方法不能准确地确定其情绪状态一个新的身体姿势属于。为了解决这个问题,我们引入一个广义零射门学习(GZSL)框架,它由三个分支,只有他们的语义描述推断出新的身体姿势的情绪状态。所述第一分支是一个基于原型检测器(PBD),其被用于确定样品是否属于看出身体姿势类别和获得来自看出类别样本的预测结果。第二分支是堆叠自动编码器(StAE)与歧管的正则化,其利用语义表示从看不见的类别预测样本。请注意,上述两个分支是人体手势识别。我们还与SOFTMAX层作为第三分公司,以便更好地了解这种情感分类任务的特征表示添加的情感分类。这三个分支中的输入功能通过共享的特征提取网络,了解到即双向长短期记忆网络(BLSTM)具有自注意模块。我们对待这三个分支为子任务,并使用多任务学习策略进行合练。我们的情绪识别数据集架构的性能显著优于情感类别和国家的最先进的零次学习方法的传统方法。
Jinting Wu, Yujia Zhang, Xiaoguang Zhao
Abstract: Although automatic emotion recognition from facial expressions and speech has made remarkable progress, emotion recognition from body gestures has not been thoroughly explored. People often use a variety of body language to express emotions, and it is difficult to enumerate all emotional body gestures and collect enough samples for each category. Therefore, recognizing new emotional body gestures is critical for better understanding human emotions. However, the existing methods fail to accurately determine which emotional state a new body gesture belongs to. In order to solve this problem, we introduce a Generalized Zero-Shot Learning (GZSL) framework, which consists of three branches to infer the emotional state of the new body gestures with only their semantic descriptions. The first branch is a Prototype-Based Detector (PBD) which is used to determine whether an sample belongs to a seen body gesture category and obtain the prediction results of the samples from the seen categories. The second branch is a Stacked AutoEncoder (StAE) with manifold regularization, which utilizes semantic representations to predict samples from unseen categories. Note that both of the above branches are for body gesture recognition. We further add an emotion classifier with a softmax layer as the third branch in order to better learn the feature representations for this emotion classification task. The input features for these three branches are learned by a shared feature extraction network, i.e., a Bidirectional Long Short-Term Memory Networks (BLSTM) with a self-attention module. We treat these three branches as subtasks and use multi-task learning strategies for joint training. The performance of our framework on an emotion recognition dataset is significantly superior to the traditional method of emotion classification and state-of-the-art zero-shot learning methods.
摘要:尽管从面部表情自动情感识别和语音取得了令人瞩目的进步,从肢体语言的情感识别尚未彻底研究。人们常常用各种肢体语言来表达情感,这是难以一一列举所有的情感身体姿势和足够的采集样品为每个类别。因此,认识新的情感身体姿势是为了更好地理解人类情感的关键。然而,现有的方法不能准确地确定其情绪状态一个新的身体姿势属于。为了解决这个问题,我们引入一个广义零射门学习(GZSL)框架,它由三个分支,只有他们的语义描述推断出新的身体姿势的情绪状态。所述第一分支是一个基于原型检测器(PBD),其被用于确定样品是否属于看出身体姿势类别和获得来自看出类别样本的预测结果。第二分支是堆叠自动编码器(StAE)与歧管的正则化,其利用语义表示从看不见的类别预测样本。请注意,上述两个分支是人体手势识别。我们还与SOFTMAX层作为第三分公司,以便更好地了解这种情感分类任务的特征表示添加的情感分类。这三个分支中的输入功能通过共享的特征提取网络,了解到即双向长短期记忆网络(BLSTM)具有自注意模块。我们对待这三个分支为子任务,并使用多任务学习策略进行合练。我们的情绪识别数据集架构的性能显著优于情感类别和国家的最先进的零次学习方法的传统方法。
16. Collaborative Video Object Segmentation by Multi-Scale Foreground-Background Integration [PDF] 返回目录
Zongxin Yang, Yunchao Wei, Yi Yang
Abstract: This paper investigates the principles of embedding learning to tackle the challenging semi-supervised video object segmentation. Unlike previous practices that focus on exploring the embedding learning of foreground object (s), we consider background should be equally treated. Thus, we propose a Collaborative video object segmentation by Foreground-Background Integration (CFBI) approach. CFBI separates the feature embedding into the foreground object region and its corresponding background region, implicitly promoting them to be more contrastive and improving the segmentation results accordingly. Moreover, CFBI performs both pixel-level matching processes and instance-level attention mechanisms between the reference and the predicted sequence, making CFBI robust to various object scales. Based on CFBI, we introduce a multi-scale matching structure and propose an Atrous Matching strategy, resulting in a more robust and efficient framework, CFBI+. We conduct extensive experiments on two popular benchmarks, i.e., DAVIS and YouTube-VOS. Without applying any simulated data for pre-training, our CFBI+ achieves the performance (J&F) of 82.9% and 82.8%, outperforming all the other state-of-the-art methods. Code: this https URL.
摘要:本文探讨嵌入学习应对挑战的半监督视频对象分割的原则。不同于以往的实践,注重探索前景物体(S)的嵌入学习,我们考虑的背景应该一视同仁。因此,我们提出了通过前景背景集成(CFBI)的方式协作型视频对象分割。 CFBI分离特征嵌入到所述前景对象区域和与其对应的背景区域,隐式地促进他们更加对比,并相应地提高了分割的结果。此外,CFBI执行参考和预测的序列之间的两个像素级匹配过程和实例级注意机制,使得CFBI健壮各种对象鳞。基于CFBI,我们引入了多尺度匹配结构,并提出了一个Atrous匹配策略,导致了更强大和高效的框架,CFBI +。我们进行了两个流行的基准,即DAVIS和YouTube,VOS广泛的实验。不施加任何模拟数据用于预培养,我们CFBI +实现了82.9%和82.8%的性能(J&F),超越国家的最先进的所有其他方法。代码:该HTTPS URL。
Zongxin Yang, Yunchao Wei, Yi Yang
Abstract: This paper investigates the principles of embedding learning to tackle the challenging semi-supervised video object segmentation. Unlike previous practices that focus on exploring the embedding learning of foreground object (s), we consider background should be equally treated. Thus, we propose a Collaborative video object segmentation by Foreground-Background Integration (CFBI) approach. CFBI separates the feature embedding into the foreground object region and its corresponding background region, implicitly promoting them to be more contrastive and improving the segmentation results accordingly. Moreover, CFBI performs both pixel-level matching processes and instance-level attention mechanisms between the reference and the predicted sequence, making CFBI robust to various object scales. Based on CFBI, we introduce a multi-scale matching structure and propose an Atrous Matching strategy, resulting in a more robust and efficient framework, CFBI+. We conduct extensive experiments on two popular benchmarks, i.e., DAVIS and YouTube-VOS. Without applying any simulated data for pre-training, our CFBI+ achieves the performance (J&F) of 82.9% and 82.8%, outperforming all the other state-of-the-art methods. Code: this https URL.
摘要:本文探讨嵌入学习应对挑战的半监督视频对象分割的原则。不同于以往的实践,注重探索前景物体(S)的嵌入学习,我们考虑的背景应该一视同仁。因此,我们提出了通过前景背景集成(CFBI)的方式协作型视频对象分割。 CFBI分离特征嵌入到所述前景对象区域和与其对应的背景区域,隐式地促进他们更加对比,并相应地提高了分割的结果。此外,CFBI执行参考和预测的序列之间的两个像素级匹配过程和实例级注意机制,使得CFBI健壮各种对象鳞。基于CFBI,我们引入了多尺度匹配结构,并提出了一个Atrous匹配策略,导致了更强大和高效的框架,CFBI +。我们进行了两个流行的基准,即DAVIS和YouTube,VOS广泛的实验。不施加任何模拟数据用于预培养,我们CFBI +实现了82.9%和82.8%的性能(J&F),超越国家的最先进的所有其他方法。代码:该HTTPS URL。
17. LM-Reloc: Levenberg-Marquardt Based Direct Visual Relocalization [PDF] 返回目录
Lukas von Stumberg, Patrick Wenzel, Nan Yang, Daniel Cremers
Abstract: We present LM-Reloc -- a novel approach for visual relocalization based on direct image alignment. In contrast to prior works that tackle the problem with a feature-based formulation, the proposed method does not rely on feature matching and RANSAC. Hence, the method can utilize not only corners but any region of the image with gradients. In particular, we propose a loss formulation inspired by the classical Levenberg-Marquardt algorithm to train LM-Net. The learned features significantly improve the robustness of direct image alignment, especially for relocalization across different conditions. To further improve the robustness of LM-Net against large image baselines, we propose a pose estimation network, CorrPoseNet, which regresses the relative pose to bootstrap the direct image alignment. Evaluations on the CARLA and Oxford RobotCar relocalization tracking benchmark show that our approach delivers more accurate results than previous state-of-the-art methods while being comparable in terms of robustness.
摘要:我们目前LM-重定位 - 基于直接图像对齐视觉重新定位的新方法。在对比的是解决了基于特征的配方问题之前作品中,所提出的方法不依赖于特征匹配和RANSAC。因此,该方法可以利用不仅角但具有梯度的图像的任何区域。特别是,我们提出了由传统的Levenberg-Marquardt算法来训练LM-Net的启发亏损配方。博学的功能显著提高直接图像对准的鲁棒性,特别适用于在不同条件下重新定位。为了进一步提高LM-网对大的图像基准的稳健性,我们提出了一个姿势估计网络,CorrPoseNet,这倒退相对姿态来引导直接图像对齐。在Carla和牛津RobotCar重新定位跟踪基准表明,我们的方法提供了比以前的国家的最先进的方法,更准确的结果,而在稳健性方面是类似的评价。
Lukas von Stumberg, Patrick Wenzel, Nan Yang, Daniel Cremers
Abstract: We present LM-Reloc -- a novel approach for visual relocalization based on direct image alignment. In contrast to prior works that tackle the problem with a feature-based formulation, the proposed method does not rely on feature matching and RANSAC. Hence, the method can utilize not only corners but any region of the image with gradients. In particular, we propose a loss formulation inspired by the classical Levenberg-Marquardt algorithm to train LM-Net. The learned features significantly improve the robustness of direct image alignment, especially for relocalization across different conditions. To further improve the robustness of LM-Net against large image baselines, we propose a pose estimation network, CorrPoseNet, which regresses the relative pose to bootstrap the direct image alignment. Evaluations on the CARLA and Oxford RobotCar relocalization tracking benchmark show that our approach delivers more accurate results than previous state-of-the-art methods while being comparable in terms of robustness.
摘要:我们目前LM-重定位 - 基于直接图像对齐视觉重新定位的新方法。在对比的是解决了基于特征的配方问题之前作品中,所提出的方法不依赖于特征匹配和RANSAC。因此,该方法可以利用不仅角但具有梯度的图像的任何区域。特别是,我们提出了由传统的Levenberg-Marquardt算法来训练LM-Net的启发亏损配方。博学的功能显著提高直接图像对准的鲁棒性,特别适用于在不同条件下重新定位。为了进一步提高LM-网对大的图像基准的稳健性,我们提出了一个姿势估计网络,CorrPoseNet,这倒退相对姿态来引导直接图像对齐。在Carla和牛津RobotCar重新定位跟踪基准表明,我们的方法提供了比以前的国家的最先进的方法,更准确的结果,而在稳健性方面是类似的评价。
18. Audio-Visual Self-Supervised Terrain Type Discovery for Mobile Platforms [PDF] 返回目录
Akiyoshi Kurobe, Yoshikatsu Nakajima, Hideo Saito, Kris Kitani
Abstract: The ability to both recognize and discover terrain characteristics is an important function required for many autonomous ground robots such as social robots, assistive robots, autonomous vehicles, and ground exploration robots. Recognizing and discovering terrain characteristics is challenging because similar terrains may have very different appearances (e.g., carpet comes in many colors), while terrains with very similar appearance may have very different physical properties (e.g. mulch versus dirt). In order to address the inherent ambiguity in vision-based terrain recognition and discovery, we propose a multi-modal self-supervised learning technique that switches between audio features extracted from a mic attached to the underside of a mobile platform and image features extracted by a camera on the platform to cluster terrain types. The terrain cluster labels are then used to train an image-based convolutional neural network to predict changes in terrain types. Through experiments, we demonstrate that the proposed self-supervised terrain type discovery method achieves over 80% accuracy, which greatly outperforms several baselines and suggests strong potential for assistive applications.
摘要:既承认和发现地形特征的能力是许多自主地面机器人,如社交机器人,辅助性机器人,自主车和地面探测机器人所需要的重要功能。识别和发现的地形特征是具有挑战性的,因为相似的地形可能具有非常不同的外观(例如,地毯有多种颜色),而具有非常相似的外观的地形可以具有非常不同的物理性质(例如覆盖与污垢)。为了解决固有的歧义基于视觉的地形识别和发现,我们提出了一个多模式的自我监督学习技术,从连接到移动平台的下侧与图像的麦克风提取的音频特征之间的切换功能被提取相机在平台上聚集地形类型。然后地形簇标签被用来训练基于图像的卷积神经网络来预测在地形类型的变化。通过实验,我们证明了所提出的自我监督地形类型的发现方法实现超过80%的准确率,这大大优于几个基线,并提出了辅助应用的巨大潜力。
Akiyoshi Kurobe, Yoshikatsu Nakajima, Hideo Saito, Kris Kitani
Abstract: The ability to both recognize and discover terrain characteristics is an important function required for many autonomous ground robots such as social robots, assistive robots, autonomous vehicles, and ground exploration robots. Recognizing and discovering terrain characteristics is challenging because similar terrains may have very different appearances (e.g., carpet comes in many colors), while terrains with very similar appearance may have very different physical properties (e.g. mulch versus dirt). In order to address the inherent ambiguity in vision-based terrain recognition and discovery, we propose a multi-modal self-supervised learning technique that switches between audio features extracted from a mic attached to the underside of a mobile platform and image features extracted by a camera on the platform to cluster terrain types. The terrain cluster labels are then used to train an image-based convolutional neural network to predict changes in terrain types. Through experiments, we demonstrate that the proposed self-supervised terrain type discovery method achieves over 80% accuracy, which greatly outperforms several baselines and suggests strong potential for assistive applications.
摘要:既承认和发现地形特征的能力是许多自主地面机器人,如社交机器人,辅助性机器人,自主车和地面探测机器人所需要的重要功能。识别和发现的地形特征是具有挑战性的,因为相似的地形可能具有非常不同的外观(例如,地毯有多种颜色),而具有非常相似的外观的地形可以具有非常不同的物理性质(例如覆盖与污垢)。为了解决固有的歧义基于视觉的地形识别和发现,我们提出了一个多模式的自我监督学习技术,从连接到移动平台的下侧与图像的麦克风提取的音频特征之间的切换功能被提取相机在平台上聚集地形类型。然后地形簇标签被用来训练基于图像的卷积神经网络来预测在地形类型的变化。通过实验,我们证明了所提出的自我监督地形类型的发现方法实现超过80%的准确率,这大大优于几个基线,并提出了辅助应用的巨大潜力。
19. How important are faces for person re-identification? [PDF] 返回目录
Julia Dietlmeier, Joseph Antony, Kevin McGuinness, Noel E. O'Connor
Abstract: This paper investigates the dependence of existing state-of-the-art person re-identification models on the presence and visibility of human faces. We apply a face detection and blurring algorithm to create anonymized versions of several popular person re-identification datasets including Market1501, DukeMTMC-reID, CUHK03, Viper, and Airport. Using a cross-section of existing state-of-the-art models that range in accuracy and computational efficiency, we evaluate the effect of this anonymization on re-identification performance using standard metrics. Perhaps surprisingly, the effect on mAP is very small, and accuracy is recovered by simply training on the anonymized versions of the data rather than the original data. These findings are consistent across multiple models and datasets. These results indicate that datasets can be safely anonymized by blurring faces without significantly impacting the performance of person reidentification systems, and may allow for the release of new richer re-identification datasets where previously there were privacy or data protection concerns.
摘要:本文研究了对人脸的存在和知名度现有的国家的最先进的人重新鉴定模式的依赖。我们应用人脸检测和模糊算法创建的几个流行的人重新鉴定的数据集,包括Market1501,DukeMTMC - 里德CUHK03,毒蛇和机场匿名版本。使用现有的国家的最先进的模型,其范围在精度和计算效率的横截面,我们评估这个匿名使用标准度量重新鉴定性能的效果。也许令人惊讶,在地图上的影响非常小,而且精度是通过简单的数据,而不是原始数据的匿名版本的训练恢复。这些发现是跨多个模型和数据集相一致。这些结果表明,数据集可以通过模糊的面孔没有显著影响的人重新识别系统的性能,安全地匿名,并且可以允许新的更丰富的数据集重新识别而之前有隐私或数据保护的担忧释放。
Julia Dietlmeier, Joseph Antony, Kevin McGuinness, Noel E. O'Connor
Abstract: This paper investigates the dependence of existing state-of-the-art person re-identification models on the presence and visibility of human faces. We apply a face detection and blurring algorithm to create anonymized versions of several popular person re-identification datasets including Market1501, DukeMTMC-reID, CUHK03, Viper, and Airport. Using a cross-section of existing state-of-the-art models that range in accuracy and computational efficiency, we evaluate the effect of this anonymization on re-identification performance using standard metrics. Perhaps surprisingly, the effect on mAP is very small, and accuracy is recovered by simply training on the anonymized versions of the data rather than the original data. These findings are consistent across multiple models and datasets. These results indicate that datasets can be safely anonymized by blurring faces without significantly impacting the performance of person reidentification systems, and may allow for the release of new richer re-identification datasets where previously there were privacy or data protection concerns.
摘要:本文研究了对人脸的存在和知名度现有的国家的最先进的人重新鉴定模式的依赖。我们应用人脸检测和模糊算法创建的几个流行的人重新鉴定的数据集,包括Market1501,DukeMTMC - 里德CUHK03,毒蛇和机场匿名版本。使用现有的国家的最先进的模型,其范围在精度和计算效率的横截面,我们评估这个匿名使用标准度量重新鉴定性能的效果。也许令人惊讶,在地图上的影响非常小,而且精度是通过简单的数据,而不是原始数据的匿名版本的训练恢复。这些发现是跨多个模型和数据集相一致。这些结果表明,数据集可以通过模糊的面孔没有显著影响的人重新识别系统的性能,安全地匿名,并且可以允许新的更丰富的数据集重新识别而之前有隐私或数据保护的担忧释放。
20. MixCo: Mix-up Contrastive Learning for Visual Representation [PDF] 返回目录
Sungnyun Kim, Gihun Lee, Sangmin Bae, Se-Young Yun
Abstract: Contrastive learning has shown remarkable results in recent self-supervised approaches for visual representation. By learning to contrast positive pairs' representation from the corresponding negatives pairs, one can train good visual representations without human annotations. This paper proposes Mix-up Contrast (MixCo), which extends the contrastive learning concept to semi-positives encoded from the mix-up of positive and negative images. MixCo aims to learn the relative similarity of representations, reflecting how much the mixed images have the original positives. We validate the efficacy of MixCo when applied to the recent self-supervised learning algorithms under the standard linear evaluation protocol on TinyImageNet, CIFAR10, and CIFAR100. In the experiments, MixCo consistently improves test accuracy. Remarkably, the improvement is more significant when the learning capacity (e.g., model size) is limited, suggesting that MixCo might be more useful in real-world scenarios.
摘要:对比学习中的可视化表示最近自我监督的方法已经显示出了明显的成效。通过学习从相应的阴性对反差正对表示,可以培养良好的视觉表示,而没有人的注释。本文提出了混合式对比度(米斯科),它扩展了对比学习概念从混淆的正和负的图像编码半阳性。米斯科旨在学习表示的相对相似性,反映了混合图像有多少原阳性。当上TinyImageNet,CIFAR10和CIFAR100标准线性评价协议下施加到最近自监督学习算法,我们验证米斯科的功效。在实验中,米斯科持续提高测试精度。值得注意的是,当学习能力(例如,模型的大小)是有限的,这表明米斯科可能是现实世界的场景更有效的改善更为显著。
Sungnyun Kim, Gihun Lee, Sangmin Bae, Se-Young Yun
Abstract: Contrastive learning has shown remarkable results in recent self-supervised approaches for visual representation. By learning to contrast positive pairs' representation from the corresponding negatives pairs, one can train good visual representations without human annotations. This paper proposes Mix-up Contrast (MixCo), which extends the contrastive learning concept to semi-positives encoded from the mix-up of positive and negative images. MixCo aims to learn the relative similarity of representations, reflecting how much the mixed images have the original positives. We validate the efficacy of MixCo when applied to the recent self-supervised learning algorithms under the standard linear evaluation protocol on TinyImageNet, CIFAR10, and CIFAR100. In the experiments, MixCo consistently improves test accuracy. Remarkably, the improvement is more significant when the learning capacity (e.g., model size) is limited, suggesting that MixCo might be more useful in real-world scenarios.
摘要:对比学习中的可视化表示最近自我监督的方法已经显示出了明显的成效。通过学习从相应的阴性对反差正对表示,可以培养良好的视觉表示,而没有人的注释。本文提出了混合式对比度(米斯科),它扩展了对比学习概念从混淆的正和负的图像编码半阳性。米斯科旨在学习表示的相对相似性,反映了混合图像有多少原阳性。当上TinyImageNet,CIFAR10和CIFAR100标准线性评价协议下施加到最近自监督学习算法,我们验证米斯科的功效。在实验中,米斯科持续提高测试精度。值得注意的是,当学习能力(例如,模型的大小)是有限的,这表明米斯科可能是现实世界的场景更有效的改善更为显著。
21. Impact of Thermal Throttling on Long-Term Visual Inference in a CPU-based Edge Device [PDF] 返回目录
Théo Benoit-Cattin, Delia Velasco-Montero, Jorge Fernández-Berni
Abstract: Many application scenarios of edge visual inference, e.g., robotics or environmental monitoring, eventually require long periods of continuous operation. In such periods, the processor temperature plays a critical role to keep a prescribed frame rate. Particularly, the heavy computational load of convolutional neural networks (CNNs) may lead to thermal throttling and hence performance degradation in few seconds. In this paper, we report and analyze the long-term performance of 80 different cases resulting from running 5 CNN models on 4 software frameworks and 2 operating systems without and with active cooling. This comprehensive study was conducted on a low-cost edge platform, namely Raspberry Pi 4B (RPi4B), under stable indoor conditions. The results show that hysteresis-based active cooling prevented thermal throttling in all cases, thereby improving the throughput up to approximately 90% versus no cooling. Interestingly, the range of fan usage during active cooling varied from 33% to 65%. Given the impact of the fan on the power consumption of the system as a whole, these results stress the importance of a suitable selection of CNN model and software components. To assess the performance in outdoor applications, we integrated an external temperature sensor with the RPi4B and conducted a set of experiments with no active cooling in a wide interval of ambient temperature, ranging from 22 °C to 36 °C. Variations up to 27.7% were measured with respect to the maximum throughput achieved in that interval. This demonstrates that ambient temperature is a critical parameter in case active cooling cannot be applied.
摘要:边缘视觉推断,例如,机器人或环境监测的许多应用场景,最终需要连续运行很长时间。在这样的时段期间,处理器温度起着保持规定的帧速率的关键作用。具体地,卷积神经网络的重计算负荷(细胞神经网络)可以导致在几秒钟热节流,因此性能下降。在本文中,我们报告和分析运行4个软件框架和2个操作系统5个CNN模型,而不与主动散热产生80种不同的情况下,长期表现。这一综合性研究是一个低成本的优势平台,即树莓派4B(RPi4B),稳定的室内条件下,上进行。结果表明在所有情况下,基于滞后活性冷却防止热节流,从而提高吞吐量高达约90%,而没有冷却。有趣的是,风扇的使用过程中有活性的范围内冷却从33%变化到65%。鉴于该系统作为一个整体的功耗风扇的影响,这些结果强调CNN模型和软件组件的适当选择的重要性。为了评估在户外应用的性能,我们集成了RPi4B外部温度传感器和在环境温度下的宽间隔进行一组与无活性的冷却实验,从22℃至36℃。变化高达27.7%,相对于通过在该间隔达到的最大测定。这表明环境温度的情况下,主动冷却无法应用的关键参数。
Théo Benoit-Cattin, Delia Velasco-Montero, Jorge Fernández-Berni
Abstract: Many application scenarios of edge visual inference, e.g., robotics or environmental monitoring, eventually require long periods of continuous operation. In such periods, the processor temperature plays a critical role to keep a prescribed frame rate. Particularly, the heavy computational load of convolutional neural networks (CNNs) may lead to thermal throttling and hence performance degradation in few seconds. In this paper, we report and analyze the long-term performance of 80 different cases resulting from running 5 CNN models on 4 software frameworks and 2 operating systems without and with active cooling. This comprehensive study was conducted on a low-cost edge platform, namely Raspberry Pi 4B (RPi4B), under stable indoor conditions. The results show that hysteresis-based active cooling prevented thermal throttling in all cases, thereby improving the throughput up to approximately 90% versus no cooling. Interestingly, the range of fan usage during active cooling varied from 33% to 65%. Given the impact of the fan on the power consumption of the system as a whole, these results stress the importance of a suitable selection of CNN model and software components. To assess the performance in outdoor applications, we integrated an external temperature sensor with the RPi4B and conducted a set of experiments with no active cooling in a wide interval of ambient temperature, ranging from 22 °C to 36 °C. Variations up to 27.7% were measured with respect to the maximum throughput achieved in that interval. This demonstrates that ambient temperature is a critical parameter in case active cooling cannot be applied.
摘要:边缘视觉推断,例如,机器人或环境监测的许多应用场景,最终需要连续运行很长时间。在这样的时段期间,处理器温度起着保持规定的帧速率的关键作用。具体地,卷积神经网络的重计算负荷(细胞神经网络)可以导致在几秒钟热节流,因此性能下降。在本文中,我们报告和分析运行4个软件框架和2个操作系统5个CNN模型,而不与主动散热产生80种不同的情况下,长期表现。这一综合性研究是一个低成本的优势平台,即树莓派4B(RPi4B),稳定的室内条件下,上进行。结果表明在所有情况下,基于滞后活性冷却防止热节流,从而提高吞吐量高达约90%,而没有冷却。有趣的是,风扇的使用过程中有活性的范围内冷却从33%变化到65%。鉴于该系统作为一个整体的功耗风扇的影响,这些结果强调CNN模型和软件组件的适当选择的重要性。为了评估在户外应用的性能,我们集成了RPi4B外部温度传感器和在环境温度下的宽间隔进行一组与无活性的冷却实验,从22℃至36℃。变化高达27.7%,相对于通过在该间隔达到的最大测定。这表明环境温度的情况下,主动冷却无法应用的关键参数。
22. Land Cover Semantic Segmentation Using ResUNet [PDF] 返回目录
Vasilis Pollatos, Loukas Kouvaras, Eleni Charou
Abstract: In this paper we present our work on developing an automated system for land cover classification. This system takes a multiband satellite image of an area as input and outputs the land cover map of the area at the same resolution as the input. For this purpose convolutional machine learning models were trained in the task of predicting the land cover semantic segmentation of satellite images. This is a case of supervised learning. The land cover label data were taken from the CORINE Land Cover inventory and the satellite images were taken from the Copernicus hub. As for the model, U-Net architecture variations were applied. Our area of interest are the Ionian islands (Greece). We created a dataset from scratch covering this particular area. In addition, transfer learning from the BigEarthNet dataset [1] was performed. In [1] simple classification of satellite images into the classes of CLC is performed but not segmentation as we do. However, their models have been trained into a dataset much bigger than ours, so we applied transfer learning using their pretrained models as the first part of out network, utilizing the ability these networks have developed to extract useful features from the satellite images (we transferred a pretrained ResNet50 into a U-Res-Net). Apart from transfer learning other techniques were applied in order to overcome the limitations set by the small size of our area of interest. We used data augmentation (cutting images into overlapping patches, applying random transformations such as rotations and flips) and cross validation. The results are tested on the 3 CLC class hierarchy levels and a comparative study is made on the results of different approaches.
摘要:在本文中,我们提出了制定关于土地覆盖分类的自动化系统的工作。该系统需要一个区域作为输入的多频带卫星图像,并在相同的分辨率输入输出的区域的土地覆盖图。为此卷积机器学习模型预测卫星图像的土地覆盖语义分割的任务训练。这是监督学习的情况。土地覆盖的标签数据是从CORINE土地覆盖物库存截断并从哥白尼中心拍摄的卫星图像。作为模型,进行了应用U-Net的体系结构的变化。我们感兴趣的领域是爱奥尼亚群岛(希腊)。我们创造了从无到有覆盖该特定区域的数据集。此外,传递从BigEarthNet数据集学习[1]进行。在因为我们做执行[1]卫星图像到CLC的类的简单分类而不是分割。然而,他们的模型已被训练成一个数据集比我们大得多,所以我们采用传输使用他们的预训练的模型作为从网络的第一部分的学习,利用这些网络已经发展到提取从卫星图像有用的功能的能力(我们转移一个预训练的ResNet50成U-RES-净)。除了转移学习其他技术,以克服我们感兴趣的领域的小尺寸设定的限制被使用。我们使用数据扩张(切割图像成重叠补丁,施加随机变换,如旋转和翻转)和交叉验证。结果在3级CLC等级层次水平测试和比较研究,对不同方法的结果做出。
Vasilis Pollatos, Loukas Kouvaras, Eleni Charou
Abstract: In this paper we present our work on developing an automated system for land cover classification. This system takes a multiband satellite image of an area as input and outputs the land cover map of the area at the same resolution as the input. For this purpose convolutional machine learning models were trained in the task of predicting the land cover semantic segmentation of satellite images. This is a case of supervised learning. The land cover label data were taken from the CORINE Land Cover inventory and the satellite images were taken from the Copernicus hub. As for the model, U-Net architecture variations were applied. Our area of interest are the Ionian islands (Greece). We created a dataset from scratch covering this particular area. In addition, transfer learning from the BigEarthNet dataset [1] was performed. In [1] simple classification of satellite images into the classes of CLC is performed but not segmentation as we do. However, their models have been trained into a dataset much bigger than ours, so we applied transfer learning using their pretrained models as the first part of out network, utilizing the ability these networks have developed to extract useful features from the satellite images (we transferred a pretrained ResNet50 into a U-Res-Net). Apart from transfer learning other techniques were applied in order to overcome the limitations set by the small size of our area of interest. We used data augmentation (cutting images into overlapping patches, applying random transformations such as rotations and flips) and cross validation. The results are tested on the 3 CLC class hierarchy levels and a comparative study is made on the results of different approaches.
摘要:在本文中,我们提出了制定关于土地覆盖分类的自动化系统的工作。该系统需要一个区域作为输入的多频带卫星图像,并在相同的分辨率输入输出的区域的土地覆盖图。为此卷积机器学习模型预测卫星图像的土地覆盖语义分割的任务训练。这是监督学习的情况。土地覆盖的标签数据是从CORINE土地覆盖物库存截断并从哥白尼中心拍摄的卫星图像。作为模型,进行了应用U-Net的体系结构的变化。我们感兴趣的领域是爱奥尼亚群岛(希腊)。我们创造了从无到有覆盖该特定区域的数据集。此外,传递从BigEarthNet数据集学习[1]进行。在因为我们做执行[1]卫星图像到CLC的类的简单分类而不是分割。然而,他们的模型已被训练成一个数据集比我们大得多,所以我们采用传输使用他们的预训练的模型作为从网络的第一部分的学习,利用这些网络已经发展到提取从卫星图像有用的功能的能力(我们转移一个预训练的ResNet50成U-RES-净)。除了转移学习其他技术,以克服我们感兴趣的领域的小尺寸设定的限制被使用。我们使用数据扩张(切割图像成重叠补丁,施加随机变换,如旋转和翻转)和交叉验证。结果在3级CLC等级层次水平测试和比较研究,对不同方法的结果做出。
23. A Scale and Rotational Invariant Key-point Detector based on Sparse Coding [PDF] 返回目录
Thanh Hong-Phuoc, Ling Guan
Abstract: Most popular hand-crafted key-point detectors such as Harris corner, SIFT, SURF aim to detect corners, blobs, junctions or other human defined structures in images. Though being robust with some geometric transformations, unintended scenarios or non-uniform lighting variations could significantly degrade their performance. Hence, a new detector that is flexible with context change and simultaneously robust with both geometric and non-uniform illumination variations is very desirable. In this paper, we propose a solution to this challenging problem by incorporating Scale and Rotation Invariant design (named SRI-SCK) into a recently developed Sparse Coding based Key-point detector (SCK). The SCK detector is flexible in different scenarios and fully invariant to affine intensity change, yet it is not designed to handle images with drastic scale and rotation changes. In SRI-SCK, the scale invariance is implemented with an image pyramid technique while the rotation invariance is realized by combining multiple rotated versions of the dictionary used in the sparse coding step of SCK. Techniques for calculation of key-points' characteristic scales and their sub-pixel accuracy positions are also proposed. Experimental results on three public datasets demonstrate that significantly high repeatability and matching score are achieved.
摘要:最受欢迎的手工制作的关键点检测器,例如Harris角点,SIFT,SURF旨在检测图像中的角落,斑点,结或其他人的定义的结构。虽然是与一些几何变换,无意的情景或不均匀的照明的变化鲁棒可以显著降低它们的性能。因此,一个新的检测器,其是柔性的与上下文改变和与两个几何和非均匀照明的变化同时健壮是非常可取的。在本文中,我们通过将尺度和旋转不变的设计(名为SRI-SCK)到最近开发的稀疏编码基于关键点检测(SCK)提出了一个解决这个具有挑战性的问题。该SCK探测器是在不同的场景灵活,完全不变的仿射强度变化,但它不是设计与激烈的缩放和旋转的变化处理图像。在SRI-SCK,而旋转不变性是通过组合在SCK的稀疏编码步骤中使用的字典的多个旋转版本实现尺度不变性与图像金字塔技术实现。关键点的特征尺度及其子像素精度位置的计算技术也提出了。三个公共数据集的实验结果表明,显著高重复性和匹配分数得以实现。
Thanh Hong-Phuoc, Ling Guan
Abstract: Most popular hand-crafted key-point detectors such as Harris corner, SIFT, SURF aim to detect corners, blobs, junctions or other human defined structures in images. Though being robust with some geometric transformations, unintended scenarios or non-uniform lighting variations could significantly degrade their performance. Hence, a new detector that is flexible with context change and simultaneously robust with both geometric and non-uniform illumination variations is very desirable. In this paper, we propose a solution to this challenging problem by incorporating Scale and Rotation Invariant design (named SRI-SCK) into a recently developed Sparse Coding based Key-point detector (SCK). The SCK detector is flexible in different scenarios and fully invariant to affine intensity change, yet it is not designed to handle images with drastic scale and rotation changes. In SRI-SCK, the scale invariance is implemented with an image pyramid technique while the rotation invariance is realized by combining multiple rotated versions of the dictionary used in the sparse coding step of SCK. Techniques for calculation of key-points' characteristic scales and their sub-pixel accuracy positions are also proposed. Experimental results on three public datasets demonstrate that significantly high repeatability and matching score are achieved.
摘要:最受欢迎的手工制作的关键点检测器,例如Harris角点,SIFT,SURF旨在检测图像中的角落,斑点,结或其他人的定义的结构。虽然是与一些几何变换,无意的情景或不均匀的照明的变化鲁棒可以显著降低它们的性能。因此,一个新的检测器,其是柔性的与上下文改变和与两个几何和非均匀照明的变化同时健壮是非常可取的。在本文中,我们通过将尺度和旋转不变的设计(名为SRI-SCK)到最近开发的稀疏编码基于关键点检测(SCK)提出了一个解决这个具有挑战性的问题。该SCK探测器是在不同的场景灵活,完全不变的仿射强度变化,但它不是设计与激烈的缩放和旋转的变化处理图像。在SRI-SCK,而旋转不变性是通过组合在SCK的稀疏编码步骤中使用的字典的多个旋转版本实现尺度不变性与图像金字塔技术实现。关键点的特征尺度及其子像素精度位置的计算技术也提出了。三个公共数据集的实验结果表明,显著高重复性和匹配分数得以实现。
24. DORi: Discovering Object Relationship for Moment Localization of a Natural-Language Query in Video [PDF] 返回目录
Cristian Rodriguez-Opazo, Edison Marrese-Taylor, Basura Fernando, Hongdong Li, Stephen Gould
Abstract: This paper studies the task of temporal moment localization in a long untrimmed video using natural language query. Given a query sentence, the goal is to determine the start and end of the relevant segment within the video. Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm suitable for temporal moment localization which captures the relationships between humans, objects and activities in the video. These relationships are obtained by a spatial sub-graph that contextualizes the scene representation using detected objects and human features conditioned in the language query. Moreover, a temporal sub-graph captures the activities within the video through time. Our method is evaluated on three standard benchmark datasets, and we also introduce YouCookII as a new benchmark for this task. Experiments show our method outperforms state-of-the-art methods on these datasets, confirming the effectiveness of our approach.
摘要:本文研究的时间一瞬间本地化的任务,使用自然语言查询长修剪视频。给定一个查询语句,我们的目标是确定的开始和视频中的相关部分的结束。我们的主要创新是学习的视频功能,通过嵌入语言空调的消息传递算法适用于时间时刻的定位捕捉视频中的人,物和活动之间的关系。这些关系是由一个空间子图,其使用contextualizes检测到的物体和人的特征在语言查询条件的场景表达获得。此外,时间子图表通过时捕捉视频内的活动。我们的方法是在三个标准的标准数据集进行评估,我们也引进YouCookII作为这项任务的新标杆。实验表明国家的最先进的我们的方法优于对这些数据集的方法,证实了我们方法的有效性。
Cristian Rodriguez-Opazo, Edison Marrese-Taylor, Basura Fernando, Hongdong Li, Stephen Gould
Abstract: This paper studies the task of temporal moment localization in a long untrimmed video using natural language query. Given a query sentence, the goal is to determine the start and end of the relevant segment within the video. Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm suitable for temporal moment localization which captures the relationships between humans, objects and activities in the video. These relationships are obtained by a spatial sub-graph that contextualizes the scene representation using detected objects and human features conditioned in the language query. Moreover, a temporal sub-graph captures the activities within the video through time. Our method is evaluated on three standard benchmark datasets, and we also introduce YouCookII as a new benchmark for this task. Experiments show our method outperforms state-of-the-art methods on these datasets, confirming the effectiveness of our approach.
摘要:本文研究的时间一瞬间本地化的任务,使用自然语言查询长修剪视频。给定一个查询语句,我们的目标是确定的开始和视频中的相关部分的结束。我们的主要创新是学习的视频功能,通过嵌入语言空调的消息传递算法适用于时间时刻的定位捕捉视频中的人,物和活动之间的关系。这些关系是由一个空间子图,其使用contextualizes检测到的物体和人的特征在语言查询条件的场景表达获得。此外,时间子图表通过时捕捉视频内的活动。我们的方法是在三个标准的标准数据集进行评估,我们也引进YouCookII作为这项任务的新标杆。实验表明国家的最先进的我们的方法优于对这些数据集的方法,证实了我们方法的有效性。
25. Correlation Filter for UAV-Based Aerial Tracking: A Review and Experimental Evaluation [PDF] 返回目录
Changhong Fu, Bowen Li, Fangqiang Ding, Fuling Lin, Geng Lu
Abstract: Aerial tracking, which has exhibited its omnipresent dedication and splendid performance, is one of the most active applications in the remote sensing field. Especially, unmanned aerial vehicle (UAV)-based remote sensing system, equipped with a visual tracking approach, has been widely used in aviation, navigation, agriculture, transportation, and public security, etc. As is mentioned above, the UAV-based aerial tracking platform has been gradually developed from research to practical application stage, reaching one of the main aerial remote sensing technologies in the future. However, due to real-world challenging situations, the vibration of the UAV's mechanical structure (especially under strong wind conditions), and limited computation resources, accuracy, robustness, and high efficiency are all crucial for the onboard tracking methods. Recently, the discriminative correlation filter (DCF)-based trackers have stood out for their high computational efficiency and appealing robustness on a single CPU, and have flourished in the UAV visual tracking community. In this work, the basic framework of the DCF-based trackers is firstly generalized, based on which, 20 state-of-the-art DCF-based trackers are orderly summarized according to their innovations for soloving various issues. Besides, exhaustive and quantitative experiments have been extended on various prevailing UAV tracking benchmarks, i.e., UAV123, UAV123_10fps, UAV20L, UAVDT, DTB70, and VisDrone2019-SOT, which contain 371,625 frames in total. The experiments show the performance, verify the feasibility, and demonstrate the current challenges of DCF-based trackers onboard UAV tracking. Finally, comprehensive conclusions on open challenges and directions for future research is presented.
摘要:空中跟踪,这表现出了其无所不在奉献和灿烂的性能,是在遥感领域中最活跃的应用之一。特别是,无人飞行载具(UAV)为基础的遥感系统,配备有一个视觉跟踪方法,已经被广泛地应用于航空,航海,农业,交通,公安,等。如上面提到的,基于UAV-天线跟踪平台已经从研究到实际应用阶段逐步展开,达到未来的主要航空遥感技术之一。然而,由于现实世界的挑战的情况下,无人机的机械结构的振动(特别是在强风条件下),以及有限的计算资源,精度,可靠性和高效率都是板载的跟踪方法是至关重要的。近日,辨别相关滤波器(DCF)为基础的追踪器已经站出来为他们的高计算效率和稳健性的吸引力在单个CPU上,并在无人机视觉追踪社区蓬勃发展。在这项工作中,基于贴现现金流跟踪器的基本框架首先概括,在此基础上,20个州的最先进的基于贴现现金流跟踪器是有序的,根据他们的创新为soloving各种问题总结。此外,详尽和定量实验已扩展的各种通行UAV跟踪基准,即,UAV123,UAV123_10fps,UAV20L,UAVDT,DTB70,和VisDrone2019-SOT,其含有总共371625帧。实验表明性能,验证的可行性,并展示基于贴现现金流的跟踪车载无人机跟踪当前的挑战。最后,开放的挑战和未来的研究方向全面的结论提出。
Changhong Fu, Bowen Li, Fangqiang Ding, Fuling Lin, Geng Lu
Abstract: Aerial tracking, which has exhibited its omnipresent dedication and splendid performance, is one of the most active applications in the remote sensing field. Especially, unmanned aerial vehicle (UAV)-based remote sensing system, equipped with a visual tracking approach, has been widely used in aviation, navigation, agriculture, transportation, and public security, etc. As is mentioned above, the UAV-based aerial tracking platform has been gradually developed from research to practical application stage, reaching one of the main aerial remote sensing technologies in the future. However, due to real-world challenging situations, the vibration of the UAV's mechanical structure (especially under strong wind conditions), and limited computation resources, accuracy, robustness, and high efficiency are all crucial for the onboard tracking methods. Recently, the discriminative correlation filter (DCF)-based trackers have stood out for their high computational efficiency and appealing robustness on a single CPU, and have flourished in the UAV visual tracking community. In this work, the basic framework of the DCF-based trackers is firstly generalized, based on which, 20 state-of-the-art DCF-based trackers are orderly summarized according to their innovations for soloving various issues. Besides, exhaustive and quantitative experiments have been extended on various prevailing UAV tracking benchmarks, i.e., UAV123, UAV123_10fps, UAV20L, UAVDT, DTB70, and VisDrone2019-SOT, which contain 371,625 frames in total. The experiments show the performance, verify the feasibility, and demonstrate the current challenges of DCF-based trackers onboard UAV tracking. Finally, comprehensive conclusions on open challenges and directions for future research is presented.
摘要:空中跟踪,这表现出了其无所不在奉献和灿烂的性能,是在遥感领域中最活跃的应用之一。特别是,无人飞行载具(UAV)为基础的遥感系统,配备有一个视觉跟踪方法,已经被广泛地应用于航空,航海,农业,交通,公安,等。如上面提到的,基于UAV-天线跟踪平台已经从研究到实际应用阶段逐步展开,达到未来的主要航空遥感技术之一。然而,由于现实世界的挑战的情况下,无人机的机械结构的振动(特别是在强风条件下),以及有限的计算资源,精度,可靠性和高效率都是板载的跟踪方法是至关重要的。近日,辨别相关滤波器(DCF)为基础的追踪器已经站出来为他们的高计算效率和稳健性的吸引力在单个CPU上,并在无人机视觉追踪社区蓬勃发展。在这项工作中,基于贴现现金流跟踪器的基本框架首先概括,在此基础上,20个州的最先进的基于贴现现金流跟踪器是有序的,根据他们的创新为soloving各种问题总结。此外,详尽和定量实验已扩展的各种通行UAV跟踪基准,即,UAV123,UAV123_10fps,UAV20L,UAVDT,DTB70,和VisDrone2019-SOT,其含有总共371625帧。实验表明性能,验证的可行性,并展示基于贴现现金流的跟踪车载无人机跟踪当前的挑战。最后,开放的挑战和未来的研究方向全面的结论提出。
26. Robust Two-Stream Multi-Feature Network for Driver Drowsiness Detection [PDF] 返回目录
Qi Shen, Shengjie Zhao, Rongqing Zhang, Bin Zhang
Abstract: Drowsiness driving is a major cause of traffic accidents and thus numerous previous researches have focused on driver drowsiness detection. Many drive relevant factors have been taken into consideration for fatigue detection and can lead to high precision, but there are still several serious constraints, such as most existing models are environmentally susceptible. In this paper, fatigue detection is considered as temporal action detection problem instead of image classification. The proposed detection system can be divided into four parts: (1) Localize the key patches of the detected driver picture which are critical for fatigue detection and calculate the corresponding optical flow. (2) Contrast Limited Adaptive Histogram Equalization (CLAHE) is used in our system to reduce the impact of different light conditions. (3) Three individual two-stream networks combined with attention mechanism are designed for each feature to extract temporal information. (4) The outputs of the three sub-networks will be concatenated and sent to the fully-connected network, which judges the status of the driver. The drowsiness detection system is trained and evaluated on the famous Nation Tsing Hua University Driver Drowsiness Detection (NTHU-DDD) dataset and we obtain an accuracy of 94.46%, which outperforms most existing fatigue detection models.
摘要:困倦驾驶是交通事故的主要原因,因此先前的许多研究都集中在司机睡意检测。许多驱动器相关的因素已经考虑到了疲劳检测,并可能导致精度高,但仍存在一些严重的制约因素,如大多数现有的模式是环境敏感。在本文中,疲劳检测被认为是瞬时动作检测问题,而不是图像分类。所提出的检测系统可以分为四个部分:(1)本地化检测到的驾驶员的图片,其是用于疲劳检测关键的,并且计算出相应的光流的关键补丁。 (2)对比度受限自适应直方图均衡(CLAHE)在我们的系统中使用,以减少不同的光条件的影响。与注意机制合并(3)三个独立的两流网络被设计为每个特征提取时间信息。 (4)三个子网络的输出将被连接起来并发送至全连接的网络,其判断驾驶员的状态。睡意检测系统培训,并在著名的民族清华大学驾驶员睡意检测(清大-DDD)的数据集进行评估,我们获得94.46%的准确度,其性能优于大多数现有的疲劳检测模型。
Qi Shen, Shengjie Zhao, Rongqing Zhang, Bin Zhang
Abstract: Drowsiness driving is a major cause of traffic accidents and thus numerous previous researches have focused on driver drowsiness detection. Many drive relevant factors have been taken into consideration for fatigue detection and can lead to high precision, but there are still several serious constraints, such as most existing models are environmentally susceptible. In this paper, fatigue detection is considered as temporal action detection problem instead of image classification. The proposed detection system can be divided into four parts: (1) Localize the key patches of the detected driver picture which are critical for fatigue detection and calculate the corresponding optical flow. (2) Contrast Limited Adaptive Histogram Equalization (CLAHE) is used in our system to reduce the impact of different light conditions. (3) Three individual two-stream networks combined with attention mechanism are designed for each feature to extract temporal information. (4) The outputs of the three sub-networks will be concatenated and sent to the fully-connected network, which judges the status of the driver. The drowsiness detection system is trained and evaluated on the famous Nation Tsing Hua University Driver Drowsiness Detection (NTHU-DDD) dataset and we obtain an accuracy of 94.46%, which outperforms most existing fatigue detection models.
摘要:困倦驾驶是交通事故的主要原因,因此先前的许多研究都集中在司机睡意检测。许多驱动器相关的因素已经考虑到了疲劳检测,并可能导致精度高,但仍存在一些严重的制约因素,如大多数现有的模式是环境敏感。在本文中,疲劳检测被认为是瞬时动作检测问题,而不是图像分类。所提出的检测系统可以分为四个部分:(1)本地化检测到的驾驶员的图片,其是用于疲劳检测关键的,并且计算出相应的光流的关键补丁。 (2)对比度受限自适应直方图均衡(CLAHE)在我们的系统中使用,以减少不同的光条件的影响。与注意机制合并(3)三个独立的两流网络被设计为每个特征提取时间信息。 (4)三个子网络的输出将被连接起来并发送至全连接的网络,其判断驾驶员的状态。睡意检测系统培训,并在著名的民族清华大学驾驶员睡意检测(清大-DDD)的数据集进行评估,我们获得94.46%的准确度,其性能优于大多数现有的疲劳检测模型。
27. Two-Stream Compare and Contrast Network for Vertebral Compression Fracture Diagnosis [PDF] 返回目录
Shixiang Feng, Beibei Liu, Ya Zhang, Xiaoyun Zhang, Yuehua Li
Abstract: Differentiating Vertebral Compression Fractures (VCFs) associated with trauma and osteoporosis (benign VCFs) or those caused by metastatic cancer (malignant VCFs) are critically important for treatment decisions. So far, automatic VCFs diagnosis is solved in a two-step manner, i.e. first identify VCFs and then classify it into benign or malignant. In this paper, we explore to model VCFs diagnosis as a three-class classification problem, i.e. normal vertebrae, benign VCFs, and malignant VCFs. However, VCFs recognition and classification require very different features, and both tasks are characterized by high intra-class variation and high inter-class similarity. Moreover, the dataset is extremely class-imbalanced. To address the above challenges, we propose a novel Two-Stream Compare and Contrast Network (TSCCN) for VCFs diagnosis. This network consists of two streams, a recognition stream which learns to identify VCFs through comparing and contrasting between adjacent vertebra, and a classification stream which compares and contrasts between intra-class and inter-class to learn features for fine-grained classification. The two streams are integrated via a learnable weight control module which adaptively sets their contribution. The TSCCN is evaluated on a dataset consisting of 239 VCFs patients and achieves the average sensitivity and specificity of 92.56\% and 96.29\%, respectively.
摘要:辨椎体压缩性骨折与外伤和骨质疏松症(良性椎体压缩性骨折)相关联(椎体压缩性骨折)或那些由转移性癌症(恶性椎体压缩性骨折)是用于治疗决策至关重要。到目前为止,自动诊断椎体压缩性骨折解决在两步方式,即首先确定椎体压缩性骨折,然后将其分类为良性或恶性的。在本文中,我们探索到椎体压缩性骨折诊断模型为三类别分类问题,即正常椎骨,良性椎体压缩性骨折,和恶性椎体压缩性骨折。然而,椎体压缩性骨折识别和分类需要非常不同的特征,并且两个任务由高的类内变化和高级间相似度表征。此外,该数据集是极其类的不平衡。为了应对上述挑战,我们提出了一个新颖的双码流比较和对比网络(TSCCN)为椎体压缩性骨折的诊断。该网络包括两个流,其通过学习比较和相邻的椎骨之间的对比,以确定椎体压缩性骨折的识别流,并且其比较和帧内类和类间之间的对比来学习用于细粒度分类特征的分类流的。两个流通过该自适应地设置他们的贡献可学习权重控制模块集成在一起。所述TSCCN是在由239周椎体压缩性骨折的患者的数据集进行评估和实现了分别92.56 \%和96.29 \%,平均灵敏度和特异性。
Shixiang Feng, Beibei Liu, Ya Zhang, Xiaoyun Zhang, Yuehua Li
Abstract: Differentiating Vertebral Compression Fractures (VCFs) associated with trauma and osteoporosis (benign VCFs) or those caused by metastatic cancer (malignant VCFs) are critically important for treatment decisions. So far, automatic VCFs diagnosis is solved in a two-step manner, i.e. first identify VCFs and then classify it into benign or malignant. In this paper, we explore to model VCFs diagnosis as a three-class classification problem, i.e. normal vertebrae, benign VCFs, and malignant VCFs. However, VCFs recognition and classification require very different features, and both tasks are characterized by high intra-class variation and high inter-class similarity. Moreover, the dataset is extremely class-imbalanced. To address the above challenges, we propose a novel Two-Stream Compare and Contrast Network (TSCCN) for VCFs diagnosis. This network consists of two streams, a recognition stream which learns to identify VCFs through comparing and contrasting between adjacent vertebra, and a classification stream which compares and contrasts between intra-class and inter-class to learn features for fine-grained classification. The two streams are integrated via a learnable weight control module which adaptively sets their contribution. The TSCCN is evaluated on a dataset consisting of 239 VCFs patients and achieves the average sensitivity and specificity of 92.56\% and 96.29\%, respectively.
摘要:辨椎体压缩性骨折与外伤和骨质疏松症(良性椎体压缩性骨折)相关联(椎体压缩性骨折)或那些由转移性癌症(恶性椎体压缩性骨折)是用于治疗决策至关重要。到目前为止,自动诊断椎体压缩性骨折解决在两步方式,即首先确定椎体压缩性骨折,然后将其分类为良性或恶性的。在本文中,我们探索到椎体压缩性骨折诊断模型为三类别分类问题,即正常椎骨,良性椎体压缩性骨折,和恶性椎体压缩性骨折。然而,椎体压缩性骨折识别和分类需要非常不同的特征,并且两个任务由高的类内变化和高级间相似度表征。此外,该数据集是极其类的不平衡。为了应对上述挑战,我们提出了一个新颖的双码流比较和对比网络(TSCCN)为椎体压缩性骨折的诊断。该网络包括两个流,其通过学习比较和相邻的椎骨之间的对比,以确定椎体压缩性骨折的识别流,并且其比较和帧内类和类间之间的对比来学习用于细粒度分类特征的分类流的。两个流通过该自适应地设置他们的贡献可学习权重控制模块集成在一起。所述TSCCN是在由239周椎体压缩性骨折的患者的数据集进行评估和实现了分别92.56 \%和96.29 \%,平均灵敏度和特异性。
28. Self-Supervised Multi-View Synchronization Learning for 3D Pose Estimation [PDF] 返回目录
Simon Jenni, Paolo Favaro
Abstract: Current state-of-the-art methods cast monocular 3D human pose estimation as a learning problem by training neural networks on large data sets of images and corresponding skeleton poses. In contrast, we propose an approach that can exploit small annotated data sets by fine-tuning networks pre-trained via self-supervised learning on (large) unlabeled data sets. To drive such networks towards supporting 3D pose estimation during the pre-training step, we introduce a novel self-supervised feature learning task designed to focus on the 3D structure in an image. We exploit images extracted from videos captured with a multi-view camera system. The task is to classify whether two images depict two views of the same scene up to a rigid transformation. In a multi-view data set, where objects deform in a non-rigid manner, a rigid transformation occurs only between two views taken at the exact same time, i.e., when they are synchronized. We demonstrate the effectiveness of the synchronization task on the Human3.6M data set and achieve state-of-the-art results in 3D human pose estimation.
摘要:当前国家的最先进的方法,通过对大数据集的图像训练神经网络和相应的骨架姿势投单眼的3D人体姿态估计作为学习的问题。相反,我们建议可以通过微调网络利用小注释数据集通过自我监督学习预先训练上(大)未标记的数据集的方法。朝在预训练步骤支持3D姿态估计开车这样的网络,我们将介绍在设计上注重3D结构的图像以一种新颖的自我监督功能学习任务。我们利用从多视角摄像系统拍摄的视频提取的图像。任务是分类两幅图像是否描绘同一场景的两个视图高达刚性变换。在多视图数据集,其中在对象的非刚性方式变形,刚性变换发生在确切同一时间拍摄的两个仅之间的观点,即,当它们是同步的。我们证明在Human3.6M数据集同步任务的有效性和实现国家的最先进成果的三维人体姿态估计。
Simon Jenni, Paolo Favaro
Abstract: Current state-of-the-art methods cast monocular 3D human pose estimation as a learning problem by training neural networks on large data sets of images and corresponding skeleton poses. In contrast, we propose an approach that can exploit small annotated data sets by fine-tuning networks pre-trained via self-supervised learning on (large) unlabeled data sets. To drive such networks towards supporting 3D pose estimation during the pre-training step, we introduce a novel self-supervised feature learning task designed to focus on the 3D structure in an image. We exploit images extracted from videos captured with a multi-view camera system. The task is to classify whether two images depict two views of the same scene up to a rigid transformation. In a multi-view data set, where objects deform in a non-rigid manner, a rigid transformation occurs only between two views taken at the exact same time, i.e., when they are synchronized. We demonstrate the effectiveness of the synchronization task on the Human3.6M data set and achieve state-of-the-art results in 3D human pose estimation.
摘要:当前国家的最先进的方法,通过对大数据集的图像训练神经网络和相应的骨架姿势投单眼的3D人体姿态估计作为学习的问题。相反,我们建议可以通过微调网络利用小注释数据集通过自我监督学习预先训练上(大)未标记的数据集的方法。朝在预训练步骤支持3D姿态估计开车这样的网络,我们将介绍在设计上注重3D结构的图像以一种新颖的自我监督功能学习任务。我们利用从多视角摄像系统拍摄的视频提取的图像。任务是分类两幅图像是否描绘同一场景的两个视图高达刚性变换。在多视图数据集,其中在对象的非刚性方式变形,刚性变换发生在确切同一时间拍摄的两个仅之间的观点,即,当它们是同步的。我们证明在Human3.6M数据集同步任务的有效性和实现国家的最先进成果的三维人体姿态估计。
29. Few-shot Action Recognition with Implicit Temporal Alignment and Pair Similarity Optimization [PDF] 返回目录
Congqi Cao, Yajuan Li, Qinyi Lv, Peng Wang, Yanning Zhang
Abstract: Few-shot learning aims to recognize instances from novel classes with few labeled samples, which has great value in research and application. Although there has been a lot of work in this area recently, most of the existing work is based on image classification tasks. Video-based few-shot action recognition has not been explored well and remains challenging: 1) the differences of implementation details among different papers make a fair comparison difficult; 2) the wide variations and misalignment of temporal sequences make the video-level similarity comparison difficult; 3) the scarcity of labeled data makes the optimization difficult. To solve these problems, this paper presents 1) a specific setting to evaluate the performance of few-shot action recognition algorithms; 2) an implicit sequence-alignment algorithm for better video-level similarity comparison; 3) an advanced loss for few-shot learning to optimize pair similarity with limited data. Specifically, we propose a novel few-shot action recognition framework that uses long short-term memory following 3D convolutional layers for sequence modeling and alignment. Circle loss is introduced to maximize the within-class similarity and minimize the between-class similarity flexibly towards a more definite convergence target. Instead of using random or ambiguous experimental settings, we set a concrete criterion analogous to the standard image-based few-shot learning setting for few-shot action recognition evaluation. Extensive experiments on two datasets demonstrate the effectiveness of our proposed method.
摘要:很少拍学习旨在表彰从小说类的实例很少标记的样品,其在研究和应用价值。虽然在这方面已经做了大量的工作最近,大多数现有的工作是基于图像分类任务。基于视频的几拍动作识别尚未探索好,依然充满挑战:1)的实施细节不同的文件之间的差异做出公平的比较困难的; 2)宽的变化和时间序列的未对准使视频电平相似性比较困难; 3)标记的数据的缺乏使得优化困难。为了解决这些问题,本文提出1)特定的设置,以评估的为数不多的射门动作识别算法的性能; 2)更好的视频电平相似性比较的隐式序列比对算法; 3)一种先进的损失少次学习,以优化一对相似性的限制的数据。具体而言,我们建议使用长短期记忆以下的序列建模和比对3D卷积层的新型几拍动作识别框架。圈损失引入最大化类内相似性和灵活朝着更明确的目标收敛最小化类之间的相似性。而不是使用随机的或模棱两可的实验设置,我们制定了一个具体的标准类似于标准的基于图像的几拍的学习环境为几拍动作识别评价。对两个数据集大量实验证明我们提出的方法的有效性。
Congqi Cao, Yajuan Li, Qinyi Lv, Peng Wang, Yanning Zhang
Abstract: Few-shot learning aims to recognize instances from novel classes with few labeled samples, which has great value in research and application. Although there has been a lot of work in this area recently, most of the existing work is based on image classification tasks. Video-based few-shot action recognition has not been explored well and remains challenging: 1) the differences of implementation details among different papers make a fair comparison difficult; 2) the wide variations and misalignment of temporal sequences make the video-level similarity comparison difficult; 3) the scarcity of labeled data makes the optimization difficult. To solve these problems, this paper presents 1) a specific setting to evaluate the performance of few-shot action recognition algorithms; 2) an implicit sequence-alignment algorithm for better video-level similarity comparison; 3) an advanced loss for few-shot learning to optimize pair similarity with limited data. Specifically, we propose a novel few-shot action recognition framework that uses long short-term memory following 3D convolutional layers for sequence modeling and alignment. Circle loss is introduced to maximize the within-class similarity and minimize the between-class similarity flexibly towards a more definite convergence target. Instead of using random or ambiguous experimental settings, we set a concrete criterion analogous to the standard image-based few-shot learning setting for few-shot action recognition evaluation. Extensive experiments on two datasets demonstrate the effectiveness of our proposed method.
摘要:很少拍学习旨在表彰从小说类的实例很少标记的样品,其在研究和应用价值。虽然在这方面已经做了大量的工作最近,大多数现有的工作是基于图像分类任务。基于视频的几拍动作识别尚未探索好,依然充满挑战:1)的实施细节不同的文件之间的差异做出公平的比较困难的; 2)宽的变化和时间序列的未对准使视频电平相似性比较困难; 3)标记的数据的缺乏使得优化困难。为了解决这些问题,本文提出1)特定的设置,以评估的为数不多的射门动作识别算法的性能; 2)更好的视频电平相似性比较的隐式序列比对算法; 3)一种先进的损失少次学习,以优化一对相似性的限制的数据。具体而言,我们建议使用长短期记忆以下的序列建模和比对3D卷积层的新型几拍动作识别框架。圈损失引入最大化类内相似性和灵活朝着更明确的目标收敛最小化类之间的相似性。而不是使用随机的或模棱两可的实验设置,我们制定了一个具体的标准类似于标准的基于图像的几拍的学习环境为几拍动作识别评价。对两个数据集大量实验证明我们提出的方法的有效性。
30. DoFE: Domain-oriented Feature Embedding for Generalizable Fundus Image Segmentation on Unseen Datasets [PDF] 返回目录
Shujun Wang, Lequan Yu, Kang Li, Xin Yang, Chi-Wing Fu, Pheng-Ann Heng
Abstract: Deep convolutional neural networks have significantly boosted the performance of fundus image segmentation when test datasets have the same distribution as the training datasets. However, in clinical practice, medical images often exhibit variations in appearance for various reasons, e.g., different scanner vendors and image quality. These distribution discrepancies could lead the deep networks to over-fit on the training datasets and lack generalization ability on the unseen test datasets. To alleviate this issue, we present a novel Domain-oriented Feature Embedding (DoFE) framework to improve the generalization ability of CNNs on unseen target domains by exploring the knowledge from multiple source domains. Our DoFE framework dynamically enriches the image features with additional domain prior knowledge learned from multi-source domains to make the semantic features more discriminative. Specifically, we introduce a Domain Knowledge Pool to learn and memorize the prior information extracted from multi-source domains. Then the original image features are augmented with domain-oriented aggregated features, which are induced from the knowledge pool based on the similarity between the input image and multi-source domain images. We further design a novel domain code prediction branch to infer this similarity and employ an attention-guided mechanism to dynamically combine the aggregated features with the semantic features. We comprehensively evaluate our DoFE framework on two fundus image segmentation tasks, including the optic cup and disc segmentation and vessel segmentation. Our DoFE framework generates satisfying segmentation results on unseen datasets and surpasses other domain generalization and network regularization methods.
摘要:深卷积神经网络已显著提高眼底图像分割的性能测试时,数据集具有相同的分布作为训练数据集。然而,在临床实践中,医用图像常常表现出在外观上因各种原因,例如,不同的供应商的扫描器和图像质量的变化。这些分布的差异可能导致的深层网络对训练数据过拟合和缺乏对看不见的测试数据集泛化能力。为了缓解这个问题,我们提出了一个新颖的域的面向特征嵌入(DoFE)框架通过探索从多个源域的知识来改进细胞神经网络的上看不见目标域泛化能力。我们DoFE架构动态地丰富与多源域学到更多的领域先验知识的图像功能,使语义特征更有辨别力。具体而言,我们引入一个领域知识游泳池的学习和记忆从多源域提取的先验信息。然后将原始图像特征被增强与面向领域的聚集的功能,这是从知识池基于所述输入图像和多源域图像之间的相似性引起的。我们进一步设计了新的域代码分支预测来推断这种相似性和使用的注意力引导机制与语义特征动态相结合的聚合功能。我们全面评估我们的两个眼底图像分割的任务,包括视杯和盘分割和血管分割DoFE框架。我们DoFE框架产生于看不见的数据集满足分割结果和优于其他领域的推广和网络正则化方法。
Shujun Wang, Lequan Yu, Kang Li, Xin Yang, Chi-Wing Fu, Pheng-Ann Heng
Abstract: Deep convolutional neural networks have significantly boosted the performance of fundus image segmentation when test datasets have the same distribution as the training datasets. However, in clinical practice, medical images often exhibit variations in appearance for various reasons, e.g., different scanner vendors and image quality. These distribution discrepancies could lead the deep networks to over-fit on the training datasets and lack generalization ability on the unseen test datasets. To alleviate this issue, we present a novel Domain-oriented Feature Embedding (DoFE) framework to improve the generalization ability of CNNs on unseen target domains by exploring the knowledge from multiple source domains. Our DoFE framework dynamically enriches the image features with additional domain prior knowledge learned from multi-source domains to make the semantic features more discriminative. Specifically, we introduce a Domain Knowledge Pool to learn and memorize the prior information extracted from multi-source domains. Then the original image features are augmented with domain-oriented aggregated features, which are induced from the knowledge pool based on the similarity between the input image and multi-source domain images. We further design a novel domain code prediction branch to infer this similarity and employ an attention-guided mechanism to dynamically combine the aggregated features with the semantic features. We comprehensively evaluate our DoFE framework on two fundus image segmentation tasks, including the optic cup and disc segmentation and vessel segmentation. Our DoFE framework generates satisfying segmentation results on unseen datasets and surpasses other domain generalization and network regularization methods.
摘要:深卷积神经网络已显著提高眼底图像分割的性能测试时,数据集具有相同的分布作为训练数据集。然而,在临床实践中,医用图像常常表现出在外观上因各种原因,例如,不同的供应商的扫描器和图像质量的变化。这些分布的差异可能导致的深层网络对训练数据过拟合和缺乏对看不见的测试数据集泛化能力。为了缓解这个问题,我们提出了一个新颖的域的面向特征嵌入(DoFE)框架通过探索从多个源域的知识来改进细胞神经网络的上看不见目标域泛化能力。我们DoFE架构动态地丰富与多源域学到更多的领域先验知识的图像功能,使语义特征更有辨别力。具体而言,我们引入一个领域知识游泳池的学习和记忆从多源域提取的先验信息。然后将原始图像特征被增强与面向领域的聚集的功能,这是从知识池基于所述输入图像和多源域图像之间的相似性引起的。我们进一步设计了新的域代码分支预测来推断这种相似性和使用的注意力引导机制与语义特征动态相结合的聚合功能。我们全面评估我们的两个眼底图像分割的任务,包括视杯和盘分割和血管分割DoFE框架。我们DoFE框架产生于看不见的数据集满足分割结果和优于其他领域的推广和网络正则化方法。
31. When Wireless Communications Meet Computer Vision in Beyond 5G [PDF] 返回目录
Takayuki Nishio, Yusuke Koda, Jihong Park, Mehdi Bennis, Klaus Doppler
Abstract: This article articulates the emerging paradigm, sitting at the confluence of computer vision and wireless communication, to enable beyond-5G/6G mission-critical applications (autonomous/remote-controlled vehicles, visuo-haptic VR, and other cyber-physical applications). First, drawing on recent advances in machine learning and the availability of non-RF data, vision-aided wireless networks are shown to significantly enhance the reliability of wireless communication without sacrificing spectral efficiency. In particular, we demonstrate how computer vision enables {look-ahead} prediction in a millimeter-wave channel blockage scenario, before the blockage actually happens. From a computer vision perspective, we highlight how radio frequency (RF) based sensing and imaging are instrumental in robustifying computer vision applications against occlusion and failure. This is corroborated via an RF-based image reconstruction use case, showcasing a receiver-side image failure correction resulting in reduced retransmission and latency. Taken together, this article sheds light on the much-needed convergence of RF and non-RF modalities to enable ultra-reliable communication and truly intelligent 6G networks.
摘要:本文阐明了新兴的范式,坐在计算机视觉和无线通信的融合,使超越-5G / 6G的关键任务应用(自主/遥控车,视觉一触觉VR,以及其他网络物理应用)。首先,在机器学习和非RF数据的可用性的最新进展拉丝,视觉辅助的无线网络被示为显著增强无线通信的可靠性而不牺牲频谱效率。特别是,我们能证明计算机视觉是如何使{前瞻}预测毫米波通道堵塞的情况下,阻塞实际发生之前。从计算机视觉的角度来看,我们强调的射频(RF)基于感测和成像如何在robustifying对闭塞和故障计算机视觉应用工具。这是通过基于RF的图像重建用例证实,展示导致降低重传延迟和接收机侧图像失效校正。总之,这篇文章揭示了RF的急需的融合和非RF方式,让超可靠的通信和真正的智能网络6G光。
Takayuki Nishio, Yusuke Koda, Jihong Park, Mehdi Bennis, Klaus Doppler
Abstract: This article articulates the emerging paradigm, sitting at the confluence of computer vision and wireless communication, to enable beyond-5G/6G mission-critical applications (autonomous/remote-controlled vehicles, visuo-haptic VR, and other cyber-physical applications). First, drawing on recent advances in machine learning and the availability of non-RF data, vision-aided wireless networks are shown to significantly enhance the reliability of wireless communication without sacrificing spectral efficiency. In particular, we demonstrate how computer vision enables {look-ahead} prediction in a millimeter-wave channel blockage scenario, before the blockage actually happens. From a computer vision perspective, we highlight how radio frequency (RF) based sensing and imaging are instrumental in robustifying computer vision applications against occlusion and failure. This is corroborated via an RF-based image reconstruction use case, showcasing a receiver-side image failure correction resulting in reduced retransmission and latency. Taken together, this article sheds light on the much-needed convergence of RF and non-RF modalities to enable ultra-reliable communication and truly intelligent 6G networks.
摘要:本文阐明了新兴的范式,坐在计算机视觉和无线通信的融合,使超越-5G / 6G的关键任务应用(自主/遥控车,视觉一触觉VR,以及其他网络物理应用)。首先,在机器学习和非RF数据的可用性的最新进展拉丝,视觉辅助的无线网络被示为显著增强无线通信的可靠性而不牺牲频谱效率。特别是,我们能证明计算机视觉是如何使{前瞻}预测毫米波通道堵塞的情况下,阻塞实际发生之前。从计算机视觉的角度来看,我们强调的射频(RF)基于感测和成像如何在robustifying对闭塞和故障计算机视觉应用工具。这是通过基于RF的图像重建用例证实,展示导致降低重传延迟和接收机侧图像失效校正。总之,这篇文章揭示了RF的急需的融合和非RF方式,让超可靠的通信和真正的智能网络6G光。
32. ISTA-NAS: Efficient and Consistent Neural Architecture Search by Sparse Coding [PDF] 返回目录
Yibo Yang, Hongyang Li, Shan You, Fei Wang, Chen Qian, Zhouchen Lin
Abstract: Neural architecture search (NAS) aims to produce the optimal sparse solution from a high-dimensional space spanned by all candidate connections. Current gradient-based NAS methods commonly ignore the constraint of sparsity in the search phase, but project the optimized solution onto a sparse one by post-processing. As a result, the dense super-net for search is inefficient to train and has a gap with the projected architecture for evaluation. In this paper, we formulate neural architecture search as a sparse coding problem. We perform the differentiable search on a compressed lower-dimensional space that has the same validation loss as the original sparse solution space, and recover an architecture by solving the sparse coding problem. The differentiable search and architecture recovery are optimized in an alternate manner. By doing so, our network for search at each update satisfies the sparsity constraint and is efficient to train. In order to also eliminate the depth and width gap between the network in search and the target-net in evaluation, we further propose a method to search and evaluate in one stage under the target-net settings. When training finishes, architecture variables are absorbed into network weights. Thus we get the searched architecture and optimized parameters in a single run. In experiments, our two-stage method on CIFAR-10 requires only 0.05 GPU-day for search. Our one-stage method produces state-of-the-art performances on both CIFAR-10 and ImageNet at the cost of only evaluation time.
摘要:神经架构搜索(NAS)的目标,以产生从高维空间中的最佳稀疏溶液通过跨越所有候选连接。当前基于梯度的NAS方法通常忽略搜索阶段稀疏的约束,但突出的优化方案到稀疏一个通过后处理。其结果是,致密超净对于搜索是低效的训练,并与用于评价的投影架构的间隙。在本文中,我们制定的神经结构搜索作为稀疏编码问题。我们在具有相同的验证损失的原始稀疏解空间压缩的低维空间进行区分的搜索,并通过求解稀疏编码问题中恢复的架构。可分化搜索和架构恢复被以交替的方式进行了优化。通过在每次更新满足这样做,我们的搜索网络稀疏性约束,是有效的训练。为了也消除的深度和在搜索的网络和目标网在评价之间间隙宽度,我们进一步提出来搜索的方法和下目标网设置在一个阶段评价。当训练结束,结构变量被吸收到网络中的权重。因此,我们就得到了一个单次运行搜索的体系结构及优化的参数。在实验中,我们对CIFAR-10的两级方法只需要0.05 GPU天的搜索。我们的单阶段方法在只评价时间成本上产生两个CIFAR-10和ImageNet状态的最艺术表演。
Yibo Yang, Hongyang Li, Shan You, Fei Wang, Chen Qian, Zhouchen Lin
Abstract: Neural architecture search (NAS) aims to produce the optimal sparse solution from a high-dimensional space spanned by all candidate connections. Current gradient-based NAS methods commonly ignore the constraint of sparsity in the search phase, but project the optimized solution onto a sparse one by post-processing. As a result, the dense super-net for search is inefficient to train and has a gap with the projected architecture for evaluation. In this paper, we formulate neural architecture search as a sparse coding problem. We perform the differentiable search on a compressed lower-dimensional space that has the same validation loss as the original sparse solution space, and recover an architecture by solving the sparse coding problem. The differentiable search and architecture recovery are optimized in an alternate manner. By doing so, our network for search at each update satisfies the sparsity constraint and is efficient to train. In order to also eliminate the depth and width gap between the network in search and the target-net in evaluation, we further propose a method to search and evaluate in one stage under the target-net settings. When training finishes, architecture variables are absorbed into network weights. Thus we get the searched architecture and optimized parameters in a single run. In experiments, our two-stage method on CIFAR-10 requires only 0.05 GPU-day for search. Our one-stage method produces state-of-the-art performances on both CIFAR-10 and ImageNet at the cost of only evaluation time.
摘要:神经架构搜索(NAS)的目标,以产生从高维空间中的最佳稀疏溶液通过跨越所有候选连接。当前基于梯度的NAS方法通常忽略搜索阶段稀疏的约束,但突出的优化方案到稀疏一个通过后处理。其结果是,致密超净对于搜索是低效的训练,并与用于评价的投影架构的间隙。在本文中,我们制定的神经结构搜索作为稀疏编码问题。我们在具有相同的验证损失的原始稀疏解空间压缩的低维空间进行区分的搜索,并通过求解稀疏编码问题中恢复的架构。可分化搜索和架构恢复被以交替的方式进行了优化。通过在每次更新满足这样做,我们的搜索网络稀疏性约束,是有效的训练。为了也消除的深度和在搜索的网络和目标网在评价之间间隙宽度,我们进一步提出来搜索的方法和下目标网设置在一个阶段评价。当训练结束,结构变量被吸收到网络中的权重。因此,我们就得到了一个单次运行搜索的体系结构及优化的参数。在实验中,我们对CIFAR-10的两级方法只需要0.05 GPU天的搜索。我们的单阶段方法在只评价时间成本上产生两个CIFAR-10和ImageNet状态的最艺术表演。
33. Exploring Efficient Volumetric Medical Image Segmentation Using 2.5D Method: An Empirical Study [PDF] 返回目录
Yichi Zhang, Qingcheng Liao, Jicong Zhang
Abstract: With the unprecedented developments in deep learning, many methods are proposed and have achieved great success for medical image segmentation. However, unlike segmentation of natural images, most medical images such as MRI and CT are volumetric data. In order to make full use of volumetric information, 3D CNNs are widely used. However, 3D CNNs suffer from higher inference time and computation cost, which hinders their further clinical applications. Additionally, with the increased number of parameters, the risk of overfitting is higher, especially for medical images where data and annotations are expensive to acquire. To issue this problem, many 2.5D segmentation methods have been proposed to make use of volumetric spatial information with less computation cost. Despite these works lead to improvements on a variety of segmentation tasks, to the best of our knowledge, there has not previously been a large-scale empirical comparison of these methods. In this paper, we aim to present a review of the latest developments of 2.5D methods for volumetric medical image segmentation. Additionally, to compare the performance and effectiveness of these methods, we provide an empirical study of these methods on three representative segmentation tasks involving different modalities and targets. Our experimental results highlight that 3D CNNs may not always be the best choice. Besides, although all these 2.5D methods can bring performance gains to 2D baseline, not all the methods hold the benefits on different datasets. We hope the results and conclusions of our study will prove useful for the community on exploring and developing efficient volumetric medical image segmentation methods.
摘要:随着深学习了前所未有的发展,提出了许多方法,并取得了医学图像分割了巨大的成功。然而,不同于自然图像的分割,大多数医学图像,如MRI和CT是体积数据。为了充分利用容积信息,3D细胞神经网络被广泛使用。然而,3D细胞神经网络从更高的推理时间和计算成本,这阻碍了它们的进一步临床应用受到影响。另外,与参数的数量增加,过拟合的风险更高,尤其是对于其中数据和注解是昂贵获取医学图像。要发出这个问题,已经提出了许多2.5D分割方法,利用容积空间信息与较少的计算成本。尽管这些作品导致对各种分割任务的改进,据我们所知,目前还没有以前是这些方法大规模实证比较。在本文中,我们的目标是呈现的2.5D方法体积医学图像分割的最新进展进行审查。此外,比较性能以及这些方法的效果,我们提供了对涉及不同形式和目标三个有代表性的细分任务,这些方法的实证研究。我们的实验结果突出显示3D细胞神经网络可能并不总是最好的选择。此外,虽然所有这些方法2.5D能带来的性能提升到2D基线,并不是所有的方法持有不同的数据集的好处。我们希望我们的研究结果和结论将证明是有用的,为社会上的探索和开发高效的体积医学图像分割方法。
Yichi Zhang, Qingcheng Liao, Jicong Zhang
Abstract: With the unprecedented developments in deep learning, many methods are proposed and have achieved great success for medical image segmentation. However, unlike segmentation of natural images, most medical images such as MRI and CT are volumetric data. In order to make full use of volumetric information, 3D CNNs are widely used. However, 3D CNNs suffer from higher inference time and computation cost, which hinders their further clinical applications. Additionally, with the increased number of parameters, the risk of overfitting is higher, especially for medical images where data and annotations are expensive to acquire. To issue this problem, many 2.5D segmentation methods have been proposed to make use of volumetric spatial information with less computation cost. Despite these works lead to improvements on a variety of segmentation tasks, to the best of our knowledge, there has not previously been a large-scale empirical comparison of these methods. In this paper, we aim to present a review of the latest developments of 2.5D methods for volumetric medical image segmentation. Additionally, to compare the performance and effectiveness of these methods, we provide an empirical study of these methods on three representative segmentation tasks involving different modalities and targets. Our experimental results highlight that 3D CNNs may not always be the best choice. Besides, although all these 2.5D methods can bring performance gains to 2D baseline, not all the methods hold the benefits on different datasets. We hope the results and conclusions of our study will prove useful for the community on exploring and developing efficient volumetric medical image segmentation methods.
摘要:随着深学习了前所未有的发展,提出了许多方法,并取得了医学图像分割了巨大的成功。然而,不同于自然图像的分割,大多数医学图像,如MRI和CT是体积数据。为了充分利用容积信息,3D细胞神经网络被广泛使用。然而,3D细胞神经网络从更高的推理时间和计算成本,这阻碍了它们的进一步临床应用受到影响。另外,与参数的数量增加,过拟合的风险更高,尤其是对于其中数据和注解是昂贵获取医学图像。要发出这个问题,已经提出了许多2.5D分割方法,利用容积空间信息与较少的计算成本。尽管这些作品导致对各种分割任务的改进,据我们所知,目前还没有以前是这些方法大规模实证比较。在本文中,我们的目标是呈现的2.5D方法体积医学图像分割的最新进展进行审查。此外,比较性能以及这些方法的效果,我们提供了对涉及不同形式和目标三个有代表性的细分任务,这些方法的实证研究。我们的实验结果突出显示3D细胞神经网络可能并不总是最好的选择。此外,虽然所有这些方法2.5D能带来的性能提升到2D基线,并不是所有的方法持有不同的数据集的好处。我们希望我们的研究结果和结论将证明是有用的,为社会上的探索和开发高效的体积医学图像分割方法。
34. Towards Understanding Pixel Vulnerability under Adversarial Attacks for Images [PDF] 返回目录
He Zhao, Trung Le, Paul Montague, Olivier De Vel, Tamas Abraham, Dinh Phung
Abstract: Deep neural network image classifiers are reported to be susceptible to adversarial evasion attacks, which use carefully crafted images created to mislead a classifier. Recently, various kinds of adversarial attack methods have been proposed, most of which focus on adding small perturbations to all of the pixels of a real image. We find that a considerable amount of the perturbations on an image generated by some widely-used attacks may contribute little in attacking a classifier. However, they usually result in a more easily detectable adversarial image by both humans and adversarial attack detection algorithms. Therefore, it is important to impose the perturbations on the most vulnerable pixels of an image that can change the predictions of classifiers more readily. With the pixel vulnerability, given an existing attack, we can make its adversarial images more realistic and less detectable with fewer perturbations but keep its attack performance the same. Moreover, the discovered vulnerability assists to get a better understanding of the weakness of deep classifiers. Derived from the information-theoretic perspective, we propose a probabilistic approach for automatically finding the pixel vulnerability of an image, which is compatible with and improves over many existing adversarial attacks.
摘要:深层神经网络图像分类报告易受对抗逃避攻击,它使用产生误导分类精心设计的图像。近来,各种敌对攻击方法已经被提出,其中大部分集中在将小扰动对所有国家的真实图像的像素。我们发现,一些广泛使用的攻击而产生的图像上有相当数量的扰动可能会在进攻中分类小贡献。然而,他们通常会导致由人类和对抗性攻击检测算法更容易检测到的敌对形象。因此,强加在可更容易地改变分类器的预测的图像的最脆弱的像素的扰动是很重要的。随着像素的漏洞,给现有的攻击,我们可以使其对抗性的图像更加逼真,用更少的干扰更不易察觉,但保持其攻击性能相同。另外,发现的漏洞助攻更好地了解深分类的弱点。从信息理论的角度推导,我们提出了自动查找的形象,这是兼容并且改进了很多现有的对抗攻击的漏洞像素概率方法。
He Zhao, Trung Le, Paul Montague, Olivier De Vel, Tamas Abraham, Dinh Phung
Abstract: Deep neural network image classifiers are reported to be susceptible to adversarial evasion attacks, which use carefully crafted images created to mislead a classifier. Recently, various kinds of adversarial attack methods have been proposed, most of which focus on adding small perturbations to all of the pixels of a real image. We find that a considerable amount of the perturbations on an image generated by some widely-used attacks may contribute little in attacking a classifier. However, they usually result in a more easily detectable adversarial image by both humans and adversarial attack detection algorithms. Therefore, it is important to impose the perturbations on the most vulnerable pixels of an image that can change the predictions of classifiers more readily. With the pixel vulnerability, given an existing attack, we can make its adversarial images more realistic and less detectable with fewer perturbations but keep its attack performance the same. Moreover, the discovered vulnerability assists to get a better understanding of the weakness of deep classifiers. Derived from the information-theoretic perspective, we propose a probabilistic approach for automatically finding the pixel vulnerability of an image, which is compatible with and improves over many existing adversarial attacks.
摘要:深层神经网络图像分类报告易受对抗逃避攻击,它使用产生误导分类精心设计的图像。近来,各种敌对攻击方法已经被提出,其中大部分集中在将小扰动对所有国家的真实图像的像素。我们发现,一些广泛使用的攻击而产生的图像上有相当数量的扰动可能会在进攻中分类小贡献。然而,他们通常会导致由人类和对抗性攻击检测算法更容易检测到的敌对形象。因此,强加在可更容易地改变分类器的预测的图像的最脆弱的像素的扰动是很重要的。随着像素的漏洞,给现有的攻击,我们可以使其对抗性的图像更加逼真,用更少的干扰更不易察觉,但保持其攻击性能相同。另外,发现的漏洞助攻更好地了解深分类的弱点。从信息理论的角度推导,我们提出了自动查找的形象,这是兼容并且改进了很多现有的对抗攻击的漏洞像素概率方法。
35. Map-Based Temporally Consistent Geolocalization through Learning Motion Trajectories [PDF] 返回目录
Bing Zha, Alper Yilmaz
Abstract: In this paper, we propose a novel trajectory learning method that exploits motion trajectories on topological map using recurrent neural network for temporally consistent geolocalization of object. Inspired by human's ability to both be aware of distance and direction of self-motion in navigation, our trajectory learning method learns a pattern representation of trajectories encoded as a sequence of distances and turning angles to assist self-localization. We pose the learning process as a conditional sequence prediction problem in which each output locates the object on a traversable path in a map. Considering the prediction sequence ought to be topologically connected in the graph-structured map, we adopt two different hypotheses generation and elimination strategies to eliminate disconnected sequence prediction. We demonstrate our approach on the KITTI stereo visual odometry dataset which is a city-scale environment and can generate trajectory with metric information. The key benefits of our approach to geolocalization are that 1) we take advantage of powerful sequence modeling ability of recurrent neural network and its robustness to noisy input, 2) only require a map in the form of a graph and simply use an affordable sensor that generates motion trajectory and 3) do not need initial position. The experiments show that the motion trajectories can be learned by training an recurrent neural network, and temporally consistent geolocation can be predicted with both of the proposed strategies.
摘要:在本文中,我们提出了利用使用递归神经网络为对象的时间一致geolocalization拓扑图上的运动轨迹新颖的轨迹学习方法。由人的能力的启发,既要知道在导航距离和自运动的方向,我们的轨迹学习方法学编码为距离的序列轨迹和转轮角,以协助自定位的图案表示。我们提出的学习方法,例如其中每一个输出定位在地图中的通过路径上的对象中的条件序列预测问题。考虑到预测序列应该是连接拓扑图中的结构图,我们采用两种不同的假设产生和消除战略,以消除断开序列预测。我们证明了我们对KITTI立体视觉里程计数据集是一个城市规模的环境,可以与度量信息生成轨迹的方法。我们的做法geolocalization的主要优点是:1)我们利用回归神经网络的强大序列建模能力和稳健性噪声输入,2)只需要一个地图以图形的形式,简单地用一个负担得起的传感器产生运动轨迹和3)不需要初始位置。实验结果表明,运动轨迹可以通过训练的回归神经网络学来的,时间一致的地理位置可以与广泛的战略进行预测。
Bing Zha, Alper Yilmaz
Abstract: In this paper, we propose a novel trajectory learning method that exploits motion trajectories on topological map using recurrent neural network for temporally consistent geolocalization of object. Inspired by human's ability to both be aware of distance and direction of self-motion in navigation, our trajectory learning method learns a pattern representation of trajectories encoded as a sequence of distances and turning angles to assist self-localization. We pose the learning process as a conditional sequence prediction problem in which each output locates the object on a traversable path in a map. Considering the prediction sequence ought to be topologically connected in the graph-structured map, we adopt two different hypotheses generation and elimination strategies to eliminate disconnected sequence prediction. We demonstrate our approach on the KITTI stereo visual odometry dataset which is a city-scale environment and can generate trajectory with metric information. The key benefits of our approach to geolocalization are that 1) we take advantage of powerful sequence modeling ability of recurrent neural network and its robustness to noisy input, 2) only require a map in the form of a graph and simply use an affordable sensor that generates motion trajectory and 3) do not need initial position. The experiments show that the motion trajectories can be learned by training an recurrent neural network, and temporally consistent geolocation can be predicted with both of the proposed strategies.
摘要:在本文中,我们提出了利用使用递归神经网络为对象的时间一致geolocalization拓扑图上的运动轨迹新颖的轨迹学习方法。由人的能力的启发,既要知道在导航距离和自运动的方向,我们的轨迹学习方法学编码为距离的序列轨迹和转轮角,以协助自定位的图案表示。我们提出的学习方法,例如其中每一个输出定位在地图中的通过路径上的对象中的条件序列预测问题。考虑到预测序列应该是连接拓扑图中的结构图,我们采用两种不同的假设产生和消除战略,以消除断开序列预测。我们证明了我们对KITTI立体视觉里程计数据集是一个城市规模的环境,可以与度量信息生成轨迹的方法。我们的做法geolocalization的主要优点是:1)我们利用回归神经网络的强大序列建模能力和稳健性噪声输入,2)只需要一个地图以图形的形式,简单地用一个负担得起的传感器产生运动轨迹和3)不需要初始位置。实验结果表明,运动轨迹可以通过训练的回归神经网络学来的,时间一致的地理位置可以与广泛的战略进行预测。
36. Universal Model for 3D Medical Image Analysis [PDF] 返回目录
Xiaoman Zhang, Ya Zhang, Xiaoyun Zhang, Yanfeng Wang
Abstract: Deep Learning-based methods recently have achieved remarkable progress in medical image analysis, but heavily rely on massive amounts of labeled training data. Transfer learning from pre-trained models has been proposed as a standard pipeline on medical image analysis to address this bottleneck. Despite their success, the existing pre-trained models are mostly not tuned for multi-modal multi-task generalization in medical domains. Specifically, their training data are either from non-medical domain or in single modality, failing to attend to the problem of performance degradation with cross-modal transfer. Furthermore, there is no effort to explicitly extract multi-level features required by a variety of downstream tasks. To overcome these limitations, we propose Universal Model, a transferable and generalizable pre-trained model for 3D medical image analysis. A unified self-supervised learning scheme is leveraged to learn representations from multiple unlabeled source datasets with different modalities and distinctive scan regions. A modality invariant adversarial learning module is further introduced to improve the cross-modal generalization. To fit a wide range of tasks, a simple yet effective scale classifier is incorporated to capture multi-level visual representations. To validate the effectiveness of the Universal Model, we perform extensive experimental analysis on five target tasks, covering multiple imaging modalities, distinctive scan regions, and different analysis tasks. Compared with both public 3D pre-trained models and newly investigated 3D self-supervised learning methods, Universal Model demonstrates superior generalizability, manifested by its higher performance, stronger robustness and faster convergence. The pre-trained Universal Model is available at: \href{this https URL}{this https URL}.
摘要:学习型深方法最近获得的医学图像分析显着进展,但主要依赖于大量的标记的训练数据。从前期训练的模型迁移学习已被提议作为医学图像分析的标准管道来解决这一瓶颈。尽管他们的成功,现有的预训练的车型大多没有调整为在医学领域的多模态多任务概括。具体地,他们的训练数据是无论是从非医疗域或单形态,不能参加到性能下降的,具有跨通道转移的问题。此外,也没有力气明确地提取多层次需要通过各种下游任务功能。为了克服这些限制,我们提出的通用模型,三维医学图像分析预训练的转让和普及机型。一个统一的自我监督学习方案是杠杆学会从不同的方式和独特的扫描区域的多个未标记的源数据集表示。模态不变对抗性学习模块被进一步引入,以提高跨模式的概括。为了适应多种任务的,一个简单而有效的标分类结合捕捉到多层次的视觉表现。为了验证通用模型的有效性,我们在五个目标任务执行广泛的实验分析,覆盖多个成像模态,独特的扫描区域,以及不同的分析任务。与公共3D预先训练机型相比并研究新的3D自我监督学习方法,普遍的模式显示了卓越的普遍性,其更高的性能表现,更强的鲁棒性和更快的收敛。预先训练的通用型号,请访问:\ {HREF这HTTPS URL} {这个HTTPS URL}。
Xiaoman Zhang, Ya Zhang, Xiaoyun Zhang, Yanfeng Wang
Abstract: Deep Learning-based methods recently have achieved remarkable progress in medical image analysis, but heavily rely on massive amounts of labeled training data. Transfer learning from pre-trained models has been proposed as a standard pipeline on medical image analysis to address this bottleneck. Despite their success, the existing pre-trained models are mostly not tuned for multi-modal multi-task generalization in medical domains. Specifically, their training data are either from non-medical domain or in single modality, failing to attend to the problem of performance degradation with cross-modal transfer. Furthermore, there is no effort to explicitly extract multi-level features required by a variety of downstream tasks. To overcome these limitations, we propose Universal Model, a transferable and generalizable pre-trained model for 3D medical image analysis. A unified self-supervised learning scheme is leveraged to learn representations from multiple unlabeled source datasets with different modalities and distinctive scan regions. A modality invariant adversarial learning module is further introduced to improve the cross-modal generalization. To fit a wide range of tasks, a simple yet effective scale classifier is incorporated to capture multi-level visual representations. To validate the effectiveness of the Universal Model, we perform extensive experimental analysis on five target tasks, covering multiple imaging modalities, distinctive scan regions, and different analysis tasks. Compared with both public 3D pre-trained models and newly investigated 3D self-supervised learning methods, Universal Model demonstrates superior generalizability, manifested by its higher performance, stronger robustness and faster convergence. The pre-trained Universal Model is available at: \href{this https URL}{this https URL}.
摘要:学习型深方法最近获得的医学图像分析显着进展,但主要依赖于大量的标记的训练数据。从前期训练的模型迁移学习已被提议作为医学图像分析的标准管道来解决这一瓶颈。尽管他们的成功,现有的预训练的车型大多没有调整为在医学领域的多模态多任务概括。具体地,他们的训练数据是无论是从非医疗域或单形态,不能参加到性能下降的,具有跨通道转移的问题。此外,也没有力气明确地提取多层次需要通过各种下游任务功能。为了克服这些限制,我们提出的通用模型,三维医学图像分析预训练的转让和普及机型。一个统一的自我监督学习方案是杠杆学会从不同的方式和独特的扫描区域的多个未标记的源数据集表示。模态不变对抗性学习模块被进一步引入,以提高跨模式的概括。为了适应多种任务的,一个简单而有效的标分类结合捕捉到多层次的视觉表现。为了验证通用模型的有效性,我们在五个目标任务执行广泛的实验分析,覆盖多个成像模态,独特的扫描区域,以及不同的分析任务。与公共3D预先训练机型相比并研究新的3D自我监督学习方法,普遍的模式显示了卓越的普遍性,其更高的性能表现,更强的鲁棒性和更快的收敛。预先训练的通用型号,请访问:\ {HREF这HTTPS URL} {这个HTTPS URL}。
37. Infant Pose Learning with Small Data [PDF] 返回目录
Xiaofei Huang, Nihang Fu, Sarah Ostadabbas
Abstract: With the increasing maturity of the human pose estimation domain, its applications have become more and more broaden. Yet, the state-of-the-art pose estimation models performance degrades significantly in the applications that include novel subjects or poses, such as infants with their unique movements. Infant motion analysis is a topic with critical importance in child health and developmental studies. However, models trained on large-scale adult pose datasets are barely successful in estimating infant poses due to significant differences in their body ratio and the versatility of poses they can take compared to adults. Moreover, the privacy and security considerations hinder the availability of enough infant images required for training a robust pose estimation model from scratch. Here, we propose a fine-tuned domain-adapted infant pose (FiDIP) estimation model, that transfers the knowledge of adult poses into estimating infant pose with the supervision of a domain adaptation technique on a mixed real and synthetic infant pose dataset. In developing FiDIP, we also built a synthetic and real infant pose (SyRIP) dataset with diverse and fully-annotated real infant images and generated synthetic infant images. We demonstrated that our FiDIP model outperforms other state-of-the-art human pose estimation model for the infant pose estimation, with the mean average precision (AP) as high as 92.2.
摘要:随着人体姿势估计域名的日益成熟,其应用已变得越来越开阔。然而,国家的最先进的姿势估计模型性能下降显著在包括新的主题或姿势,如用其独特的运动婴儿的应用程序。婴儿运动分析与儿童健康和发育研究至关重要的话题。然而,培训了大型成人的姿态数据集模型估计婴儿的姿势勉强成功归因于自己的身体比显著差异,他们可以采取与成年人相比姿势的多功能性。此外,隐私和安全方面的考虑阻碍从头开始培养了强大的姿态估计模型所需的足够的婴儿图片的可用性。在这里,我们提出了一个微调域适应婴儿的姿势(FiDIP)估计模型,转移了成人姿势的知识与域自适应技术在混合真实和合成的婴儿姿势数据集的监督估计婴儿的姿势。在开发FiDIP,我们还建成了合成和真正的婴儿姿势(SyRIP)数据集多样化,充分注解真正的婴儿图像和生成的合成图像的婴儿。我们证明了我们的FiDIP模型优于对婴儿姿势估计其他国家的最先进的人体姿势估计模型,与平均平均精度(AP)高达92.2。
Xiaofei Huang, Nihang Fu, Sarah Ostadabbas
Abstract: With the increasing maturity of the human pose estimation domain, its applications have become more and more broaden. Yet, the state-of-the-art pose estimation models performance degrades significantly in the applications that include novel subjects or poses, such as infants with their unique movements. Infant motion analysis is a topic with critical importance in child health and developmental studies. However, models trained on large-scale adult pose datasets are barely successful in estimating infant poses due to significant differences in their body ratio and the versatility of poses they can take compared to adults. Moreover, the privacy and security considerations hinder the availability of enough infant images required for training a robust pose estimation model from scratch. Here, we propose a fine-tuned domain-adapted infant pose (FiDIP) estimation model, that transfers the knowledge of adult poses into estimating infant pose with the supervision of a domain adaptation technique on a mixed real and synthetic infant pose dataset. In developing FiDIP, we also built a synthetic and real infant pose (SyRIP) dataset with diverse and fully-annotated real infant images and generated synthetic infant images. We demonstrated that our FiDIP model outperforms other state-of-the-art human pose estimation model for the infant pose estimation, with the mean average precision (AP) as high as 92.2.
摘要:随着人体姿势估计域名的日益成熟,其应用已变得越来越开阔。然而,国家的最先进的姿势估计模型性能下降显著在包括新的主题或姿势,如用其独特的运动婴儿的应用程序。婴儿运动分析与儿童健康和发育研究至关重要的话题。然而,培训了大型成人的姿态数据集模型估计婴儿的姿势勉强成功归因于自己的身体比显著差异,他们可以采取与成年人相比姿势的多功能性。此外,隐私和安全方面的考虑阻碍从头开始培养了强大的姿态估计模型所需的足够的婴儿图片的可用性。在这里,我们提出了一个微调域适应婴儿的姿势(FiDIP)估计模型,转移了成人姿势的知识与域自适应技术在混合真实和合成的婴儿姿势数据集的监督估计婴儿的姿势。在开发FiDIP,我们还建成了合成和真正的婴儿姿势(SyRIP)数据集多样化,充分注解真正的婴儿图像和生成的合成图像的婴儿。我们证明了我们的FiDIP模型优于对婴儿姿势估计其他国家的最先进的人体姿势估计模型,与平均平均精度(AP)高达92.2。
38. Attn-HybridNet: Improving Discriminability of Hybrid Features with Attention Fusion [PDF] 返回目录
Sunny Verma, Chen Wang, Liming Zhu, Wei Liu
Abstract: The principal component analysis network (PCANet) is an unsupervised parsimonious deep network, utilizing principal components as filters in its convolution layers. Albeit powerful, the PCANet consists of basic operations such as principal components and spatial pooling, which suffers from two fundamental problems. First, the principal components obtain information by transforming it to column vectors (which we call the amalgamated view), which incurs the loss of the spatial information in the data. Second, the generalized spatial pooling utilized in the PCANet induces feature redundancy and also fails to accommodate spatial statistics of natural images. In this research, we first propose a tensor-factorization based deep network called the Tensor Factorization Network (TFNet). The TFNet extracts features from the spatial structure of the data (which we call the minutiae view). We then show that the information obtained by the PCANet and the TFNet are distinctive and non-trivial but individually insufficient. This phenomenon necessitates the development of proposed HybridNet, which integrates the information discovery with the two views of the data. To enhance the discriminability of hybrid features, we propose Attn-HybridNet, which alleviates the feature redundancy by performing attention-based feature fusion. The significance of our proposed Attn-HybridNet is demonstrated on multiple real-world datasets where the features obtained with Attn-HybridNet achieves better classification performance over other popular baseline methods, demonstrating the effectiveness of the proposed technique.
摘要:主成分分析网络(PCANet)是一种无监督简约深网络中,利用主成分作为其卷积层的过滤器。尽管功能强大,PCANet由诸如主成分和空间汇集,它从两个基本问题遭受的基本操作。首先,主成分由将其转化为列向量(我们称之为汞齐化视图),这将产生在数据空间信息的损失获取信息。其次,在PCANet诱导特征的冗余使用,并且也广义空间汇集未能容纳自然图像的空间统计。在这项研究中,我们首先提出了被称为张量因式分解基于深网张量分解网(热带水果网络)。从数据的空间结构(我们称之为细节视图)中的热带水果网络中提取特征。然后,我们表明,该PCANet和热带水果网络获得的信息有特色,不平凡的,但个别不足。这种现象就必须提出HybridNet,它集成了两种观点的数据信息发现的发展。为了增强的混合特征量的辨别,我们提出了经办人,HybridNet,这减轻了通过执行关注基于特征融合的功能冗余。我们提出的经办人,HybridNet的意义在于证明在多个真实世界的数据集超过其他流行的基准方法,经办人,HybridNet达到更好的分类性能的特征点在哪里,证明所提出的技术的有效性。
Sunny Verma, Chen Wang, Liming Zhu, Wei Liu
Abstract: The principal component analysis network (PCANet) is an unsupervised parsimonious deep network, utilizing principal components as filters in its convolution layers. Albeit powerful, the PCANet consists of basic operations such as principal components and spatial pooling, which suffers from two fundamental problems. First, the principal components obtain information by transforming it to column vectors (which we call the amalgamated view), which incurs the loss of the spatial information in the data. Second, the generalized spatial pooling utilized in the PCANet induces feature redundancy and also fails to accommodate spatial statistics of natural images. In this research, we first propose a tensor-factorization based deep network called the Tensor Factorization Network (TFNet). The TFNet extracts features from the spatial structure of the data (which we call the minutiae view). We then show that the information obtained by the PCANet and the TFNet are distinctive and non-trivial but individually insufficient. This phenomenon necessitates the development of proposed HybridNet, which integrates the information discovery with the two views of the data. To enhance the discriminability of hybrid features, we propose Attn-HybridNet, which alleviates the feature redundancy by performing attention-based feature fusion. The significance of our proposed Attn-HybridNet is demonstrated on multiple real-world datasets where the features obtained with Attn-HybridNet achieves better classification performance over other popular baseline methods, demonstrating the effectiveness of the proposed technique.
摘要:主成分分析网络(PCANet)是一种无监督简约深网络中,利用主成分作为其卷积层的过滤器。尽管功能强大,PCANet由诸如主成分和空间汇集,它从两个基本问题遭受的基本操作。首先,主成分由将其转化为列向量(我们称之为汞齐化视图),这将产生在数据空间信息的损失获取信息。其次,在PCANet诱导特征的冗余使用,并且也广义空间汇集未能容纳自然图像的空间统计。在这项研究中,我们首先提出了被称为张量因式分解基于深网张量分解网(热带水果网络)。从数据的空间结构(我们称之为细节视图)中的热带水果网络中提取特征。然后,我们表明,该PCANet和热带水果网络获得的信息有特色,不平凡的,但个别不足。这种现象就必须提出HybridNet,它集成了两种观点的数据信息发现的发展。为了增强的混合特征量的辨别,我们提出了经办人,HybridNet,这减轻了通过执行关注基于特征融合的功能冗余。我们提出的经办人,HybridNet的意义在于证明在多个真实世界的数据集超过其他流行的基准方法,经办人,HybridNet达到更好的分类性能的特征点在哪里,证明所提出的技术的有效性。
39. Contrast and Classify: Alternate Training for Robust VQA [PDF] 返回目录
Yash Kant, Abhinav Moudgil, Dhruv Batra, Devi Parikh, Harsh Agrawal
Abstract: Recent Visual Question Answering (VQA) models have shown impressive performance on the VQA benchmark but remain sensitive to small linguistic variations in input questions. Existing approaches address this by augmenting the dataset with question paraphrases from visual question generation models or adversarial perturbations. These approaches use the combined data to learn an answer classifier by minimizing the standard cross-entropy loss. To more effectively leverage the augmented data, we build on the recent success in contrastive learning. We propose a novel training paradigm (ConCAT) that alternately optimizes cross-entropy and contrastive losses. The contrastive loss encourages representations to be robust to linguistic variations in questions while the cross-entropy loss preserves the discriminative power of the representations for answer classification. We find that alternately optimizing both losses is key to effective training. VQA models trained with ConCAT achieve higher consensus scores on the VQA-Rephrasings dataset as well as higher VQA accuracy on the VQA 2.0 dataset compared to existing approaches across a variety of data augmentation strategies.
摘要:最近的视觉答疑(VQA)模型显示性能不俗的VQA基准,但仍然在输入问题,小的语言变化敏感。现有的方法通过增加从视觉问题一代车型或对抗性的扰动问题转述数据集中解决这个问题。这些方法使用组合数据减少了标准交叉熵损失来学习一个答案分类。为了更有效地利用增强的数据,我们建立在对比学习最近的成功。我们提出了一个新颖的培训模式(CONCAT)交替优化交叉熵和对比损失。在对比损失鼓励交涉,而交叉熵损失保留的表示,答案分类的辨别力是稳健的提问语言的变化。我们发现,交替优化两种损耗的关键是有效的培训。相比于在各种数据增强现有的策略与方法训练有素的concat VQA模型实现对VQA 2.0数据集上的VQA-Rephrasings数据集更高的分数共识以及更高的精度VQA。
Yash Kant, Abhinav Moudgil, Dhruv Batra, Devi Parikh, Harsh Agrawal
Abstract: Recent Visual Question Answering (VQA) models have shown impressive performance on the VQA benchmark but remain sensitive to small linguistic variations in input questions. Existing approaches address this by augmenting the dataset with question paraphrases from visual question generation models or adversarial perturbations. These approaches use the combined data to learn an answer classifier by minimizing the standard cross-entropy loss. To more effectively leverage the augmented data, we build on the recent success in contrastive learning. We propose a novel training paradigm (ConCAT) that alternately optimizes cross-entropy and contrastive losses. The contrastive loss encourages representations to be robust to linguistic variations in questions while the cross-entropy loss preserves the discriminative power of the representations for answer classification. We find that alternately optimizing both losses is key to effective training. VQA models trained with ConCAT achieve higher consensus scores on the VQA-Rephrasings dataset as well as higher VQA accuracy on the VQA 2.0 dataset compared to existing approaches across a variety of data augmentation strategies.
摘要:最近的视觉答疑(VQA)模型显示性能不俗的VQA基准,但仍然在输入问题,小的语言变化敏感。现有的方法通过增加从视觉问题一代车型或对抗性的扰动问题转述数据集中解决这个问题。这些方法使用组合数据减少了标准交叉熵损失来学习一个答案分类。为了更有效地利用增强的数据,我们建立在对比学习最近的成功。我们提出了一个新颖的培训模式(CONCAT)交替优化交叉熵和对比损失。在对比损失鼓励交涉,而交叉熵损失保留的表示,答案分类的辨别力是稳健的提问语言的变化。我们发现,交替优化两种损耗的关键是有效的培训。相比于在各种数据增强现有的策略与方法训练有素的concat VQA模型实现对VQA 2.0数据集上的VQA-Rephrasings数据集更高的分数共识以及更高的精度VQA。
40. Spectral Synthesis for Satellite-to-Satellite Translation [PDF] 返回目录
Thomas Vandal, Daniel McDuff, Weile Wang, Andrew Michaelis, Ramakrishna Nemani
Abstract: Earth observing satellites carrying multi-spectral sensors are widely used to monitor the physical and biological states of the atmosphere, land, and oceans. These satellites have different vantage points above the earth and different spectral imaging bands resulting in inconsistent imagery from one to another. This presents challenges in building downstream applications. What if we could generate synthetic bands for existing satellites from the union of all domains? We tackle the problem of generating synthetic spectral imagery for multispectral sensors as an unsupervised image-to-image translation problem with partial labels and introduce a novel shared spectral reconstruction loss. Simulated experiments performed by dropping one or more spectral bands show that cross-domain reconstruction outperforms measurements obtained from a second vantage point. On a downstream cloud detection task, we show that generating synthetic bands with our model improves segmentation performance beyond our baseline. Our proposed approach enables synchronization of multispectral data and provides a basis for more homogeneous remote sensing datasets.
摘要:携带多光谱传感器地球观测卫星被广泛用于监测大气,陆地和海洋的物理和生物状态。这些卫星具有地球并导致不一致的图像从一个到另一个不同的光谱成像以上频带不同的有利位置。这礼物构建下游应用的挑战。如果我们可以生成所有域的联合现有卫星合成乐队?我们处理生成用于多光谱传感器的合成光谱成像为具有局部标签的无监督图像到图像转换问题的问题,并且引入新的共享频谱重建的损失。模拟实验通过丢弃一个或多个光谱带示出了从第二有利视点获得的跨域重建性能优于测量来执行。在下游的云检测任务,我们表明,生成合成带我们的模型提高了分割性能超出了我们的底线。我们提出的方法使多光谱数据的同步,并提供了更均匀的遥感数据集的基础。
Thomas Vandal, Daniel McDuff, Weile Wang, Andrew Michaelis, Ramakrishna Nemani
Abstract: Earth observing satellites carrying multi-spectral sensors are widely used to monitor the physical and biological states of the atmosphere, land, and oceans. These satellites have different vantage points above the earth and different spectral imaging bands resulting in inconsistent imagery from one to another. This presents challenges in building downstream applications. What if we could generate synthetic bands for existing satellites from the union of all domains? We tackle the problem of generating synthetic spectral imagery for multispectral sensors as an unsupervised image-to-image translation problem with partial labels and introduce a novel shared spectral reconstruction loss. Simulated experiments performed by dropping one or more spectral bands show that cross-domain reconstruction outperforms measurements obtained from a second vantage point. On a downstream cloud detection task, we show that generating synthetic bands with our model improves segmentation performance beyond our baseline. Our proposed approach enables synchronization of multispectral data and provides a basis for more homogeneous remote sensing datasets.
摘要:携带多光谱传感器地球观测卫星被广泛用于监测大气,陆地和海洋的物理和生物状态。这些卫星具有地球并导致不一致的图像从一个到另一个不同的光谱成像以上频带不同的有利位置。这礼物构建下游应用的挑战。如果我们可以生成所有域的联合现有卫星合成乐队?我们处理生成用于多光谱传感器的合成光谱成像为具有局部标签的无监督图像到图像转换问题的问题,并且引入新的共享频谱重建的损失。模拟实验通过丢弃一个或多个光谱带示出了从第二有利视点获得的跨域重建性能优于测量来执行。在下游的云检测任务,我们表明,生成合成带我们的模型提高了分割性能超出了我们的底线。我们提出的方法使多光谱数据的同步,并提供了更均匀的遥感数据集的基础。
41. A translational pathway of deep learning methods in GastroIntestinal Endoscopy [PDF] 返回目录
Sharib Ali, Mariia Dmitrieva, Noha Ghatwary, Sophia Bano, Gorkem Polat, Alptekin Temizel, Adrian Krenzer, Amar Hekalo, Yun Bo Guo, Bogdan Matuszewski, Mourad Gridach, Irina Voiculescu, Vishnusai Yoganand, Arnav Chavan, Aryan Raj, Nhan T. Nguyen, Dat Q. Tran, Le Duy Huynh, Nicolas Boutry, Shahadate Rezvy, Haijian Chen, Yoon Ho Choi, Anand Subramanian, Velmurugan Balasubramanian, Xiaohong W. Gao, Hongyu Hu, Yusheng Liao, Danail Stoyanov, Christian Daul, Stefano Realdon, Renato Cannizzaro, Dominique Lamarque, Terry Tran-Nguyen, Adam Bailey, Barbara Braden, James East, Jens Rittscher
Abstract: The Endoscopy Computer Vision Challenge (EndoCV) is a crowd-sourcing initiative to address eminent problems in developing reliable computer aided detection and diagnosis endoscopy systems and suggest a pathway for clinical translation of technologies. Whilst endoscopy is a widely used diagnostic and treatment tool for hollow-organs, there are several core challenges often faced by endoscopists, mainly: 1) presence of multi-class artefacts that hinder their visual interpretation, and 2) difficulty in identifying subtle precancerous precursors and cancer abnormalities. Artefacts often affect the robustness of deep learning methods applied to the gastrointestinal tract organs as they can be confused with tissue of interest. EndoCV2020 challenges are designed to address research questions in these remits. In this paper, we present a summary of methods developed by the top 17 teams and provide an objective comparison of state-of-the-art methods and methods designed by the participants for two sub-challenges: i) artefact detection and segmentation (EAD2020), and ii) disease detection and segmentation (EDD2020). Multi-center, multi-organ, multi-class, and multi-modal clinical endoscopy datasets were compiled for both EAD2020 and EDD2020 sub-challenges. An out-of-sample generalisation ability of detection algorithms was also evaluated. Whilst most teams focused on accuracy improvements, only a few methods hold credibility for clinical usability. The best performing teams provided solutions to tackle class imbalance, and variabilities in size, origin, modality and occurrences by exploring data augmentation, data fusion, and optimal class thresholding techniques.
摘要:内窥镜计算机视觉挑战赛(EndoCV)是一个众包倡议,在开发可靠的计算机辅助检测诊断内窥镜系统地址杰出的问题,并提出技术临床转化的路径。虽然内窥镜为中空器官一种广泛使用的诊断和治疗工具,有经常面临通过内镜几个核心挑战,主要是:1)的多级伪影的是阻碍他们的视觉解释的存在,并确定微妙癌前前体2)的难度和癌症异常。文物往往会影响应用到胃肠道器官深学习方法的稳健性,因为他们可以与感兴趣组织相混淆。 EndoCV2020挑战的设计使其在这些职权范围的地址研究的问题。在本文中,我们提出的由顶部17团队开发方法的摘要,并提供由与会者两个子挑战设计的状态的最先进的方法和方法的一个客观的比较:1)伪影检测和分割(EAD2020 ),以及ii)疾病的检测和分割(EDD2020)。多中心,多器官,多类别,多模态临床内镜的数据集被编为两个EAD2020和EDD2020子的挑战。的检测算法的外的样品泛化能力也进行了评价。虽然大多数球队专注于提高准确度,只有几个方法保持信誉的临床实用性。表现最好的球队提供的解决方案来解决类不平衡,并通过探索数据增强,数据融合,和最优级阈值技术在尺寸,产地,形态和发生变率。
Sharib Ali, Mariia Dmitrieva, Noha Ghatwary, Sophia Bano, Gorkem Polat, Alptekin Temizel, Adrian Krenzer, Amar Hekalo, Yun Bo Guo, Bogdan Matuszewski, Mourad Gridach, Irina Voiculescu, Vishnusai Yoganand, Arnav Chavan, Aryan Raj, Nhan T. Nguyen, Dat Q. Tran, Le Duy Huynh, Nicolas Boutry, Shahadate Rezvy, Haijian Chen, Yoon Ho Choi, Anand Subramanian, Velmurugan Balasubramanian, Xiaohong W. Gao, Hongyu Hu, Yusheng Liao, Danail Stoyanov, Christian Daul, Stefano Realdon, Renato Cannizzaro, Dominique Lamarque, Terry Tran-Nguyen, Adam Bailey, Barbara Braden, James East, Jens Rittscher
Abstract: The Endoscopy Computer Vision Challenge (EndoCV) is a crowd-sourcing initiative to address eminent problems in developing reliable computer aided detection and diagnosis endoscopy systems and suggest a pathway for clinical translation of technologies. Whilst endoscopy is a widely used diagnostic and treatment tool for hollow-organs, there are several core challenges often faced by endoscopists, mainly: 1) presence of multi-class artefacts that hinder their visual interpretation, and 2) difficulty in identifying subtle precancerous precursors and cancer abnormalities. Artefacts often affect the robustness of deep learning methods applied to the gastrointestinal tract organs as they can be confused with tissue of interest. EndoCV2020 challenges are designed to address research questions in these remits. In this paper, we present a summary of methods developed by the top 17 teams and provide an objective comparison of state-of-the-art methods and methods designed by the participants for two sub-challenges: i) artefact detection and segmentation (EAD2020), and ii) disease detection and segmentation (EDD2020). Multi-center, multi-organ, multi-class, and multi-modal clinical endoscopy datasets were compiled for both EAD2020 and EDD2020 sub-challenges. An out-of-sample generalisation ability of detection algorithms was also evaluated. Whilst most teams focused on accuracy improvements, only a few methods hold credibility for clinical usability. The best performing teams provided solutions to tackle class imbalance, and variabilities in size, origin, modality and occurrences by exploring data augmentation, data fusion, and optimal class thresholding techniques.
摘要:内窥镜计算机视觉挑战赛(EndoCV)是一个众包倡议,在开发可靠的计算机辅助检测诊断内窥镜系统地址杰出的问题,并提出技术临床转化的路径。虽然内窥镜为中空器官一种广泛使用的诊断和治疗工具,有经常面临通过内镜几个核心挑战,主要是:1)的多级伪影的是阻碍他们的视觉解释的存在,并确定微妙癌前前体2)的难度和癌症异常。文物往往会影响应用到胃肠道器官深学习方法的稳健性,因为他们可以与感兴趣组织相混淆。 EndoCV2020挑战的设计使其在这些职权范围的地址研究的问题。在本文中,我们提出的由顶部17团队开发方法的摘要,并提供由与会者两个子挑战设计的状态的最先进的方法和方法的一个客观的比较:1)伪影检测和分割(EAD2020 ),以及ii)疾病的检测和分割(EDD2020)。多中心,多器官,多类别,多模态临床内镜的数据集被编为两个EAD2020和EDD2020子的挑战。的检测算法的外的样品泛化能力也进行了评价。虽然大多数球队专注于提高准确度,只有几个方法保持信誉的临床实用性。表现最好的球队提供的解决方案来解决类不平衡,并通过探索数据增强,数据融合,和最优级阈值技术在尺寸,产地,形态和发生变率。
42. MedICaT: A Dataset of Medical Images, Captions, and Textual References [PDF] 返回目录
Sanjay Subramanian, Lucy Lu Wang, Sachin Mehta, Ben Bogin, Madeleine van Zuylen, Sravanthi Parasa, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi
Abstract: Understanding the relationship between figures and text is key to scientific document understanding. Medical figures in particular are quite complex, often consisting of several subfigures (75% of figures in our dataset), with detailed text describing their content. Previous work studying figures in scientific papers focused on classifying figure content rather than understanding how images relate to the text. To address challenges in figure retrieval and figure-to-text alignment, we introduce MedICaT, a dataset of medical images in context. MedICaT consists of 217K images from 131K open access biomedical papers, and includes captions, inline references for 74% of figures, and manually annotated subfigures and subcaptions for a subset of figures. Using MedICaT, we introduce the task of subfigure to subcaption alignment in compound figures and demonstrate the utility of inline references in image-text matching. Our data and code can be accessed at this https URL.
摘要:了解数字和文字之间的关系,关键是科学文献的理解。特别是医疗数字是相当复杂的,往往由几个子图(在我们的数据集的数字75%),以描述其内容的详细文本。在科学论文前期工作研究数据集中在分类数字内容,而不是理解图像如何与文本。在图检索和人物对文字的对齐方式应对挑战,我们介绍MedICaT,在上下文中医学图像的数据集。 MedICaT由从131K开放接入生物医学论文217K的图像,并且包括字幕,用于数字74%内联的引用,和手动注释子图和subcaptions为数字的子集。使用MedICaT,我们引入子图在复合数字subcaption对准的任务,并展示图像文本匹配内嵌引用的效用。我们的数据和代码可以在此HTTPS URL访问。
Sanjay Subramanian, Lucy Lu Wang, Sachin Mehta, Ben Bogin, Madeleine van Zuylen, Sravanthi Parasa, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi
Abstract: Understanding the relationship between figures and text is key to scientific document understanding. Medical figures in particular are quite complex, often consisting of several subfigures (75% of figures in our dataset), with detailed text describing their content. Previous work studying figures in scientific papers focused on classifying figure content rather than understanding how images relate to the text. To address challenges in figure retrieval and figure-to-text alignment, we introduce MedICaT, a dataset of medical images in context. MedICaT consists of 217K images from 131K open access biomedical papers, and includes captions, inline references for 74% of figures, and manually annotated subfigures and subcaptions for a subset of figures. Using MedICaT, we introduce the task of subfigure to subcaption alignment in compound figures and demonstrate the utility of inline references in image-text matching. Our data and code can be accessed at this https URL.
摘要:了解数字和文字之间的关系,关键是科学文献的理解。特别是医疗数字是相当复杂的,往往由几个子图(在我们的数据集的数字75%),以描述其内容的详细文本。在科学论文前期工作研究数据集中在分类数字内容,而不是理解图像如何与文本。在图检索和人物对文字的对齐方式应对挑战,我们介绍MedICaT,在上下文中医学图像的数据集。 MedICaT由从131K开放接入生物医学论文217K的图像,并且包括字幕,用于数字74%内联的引用,和手动注释子图和subcaptions为数字的子集。使用MedICaT,我们引入子图在复合数字subcaption对准的任务,并展示图像文本匹配内嵌引用的效用。我们的数据和代码可以在此HTTPS URL访问。
43. Shape-Texture Debiased Neural Network Training [PDF] 返回目录
Yingwei Li, Qihang Yu, Mingxing Tan, Jieru Mei, Peng Tang, Wei Shen, Alan Yuille, Cihang Xie
Abstract: Shape and texture are two prominent and complementary cues for recognizing objects. Nonetheless, Convolutional Neural Networks are often biased towards either texture or shape, depending on the training dataset. Our ablation shows that such bias degenerates model performance. Motivated by this observation, we develop a simple algorithm for shape-texture debiased learning. To prevent models from exclusively attending on a single cue in representation learning, we augment training data with images with conflicting shape and texture information (e.g., an image of chimpanzee shape but with lemon texture) and, most importantly, provide the corresponding supervisions from shape and texture simultaneously. Experiments show that our method successfully improves model performance on several image recognition benchmarks and adversarial robustness. For example, by training on ImageNet, it helps ResNet-152 achieve substantial improvements on ImageNet (+1.2%), ImageNet-A (+5.2%), ImageNet-C (+8.3%) and Stylized-ImageNet (+11.1%), and on defending against FGSM adversarial attacker on ImageNet (+14.4%). Our method also claims to be compatible to other advanced data augmentation strategies, e.g., Mixup and CutMix. The code is available here: this https URL.
摘要:形状和质感都识别对象两个突出的和互补的线索。尽管如此,卷积神经网络通常偏向任一的纹理或形状,这取决于训练数据集。我们的消融表明,这种偏见退化模型的性能。通过这一观察的推动下,我们开发了形状,质地debiased学习一个简单的算法。为了防止车型从上表示学习一个线索专门出席,我们扩充训练数据与图像与冲突的形状和纹理信息(例如,黑猩猩形状柠檬质感,但与图像),而且最重要的是,提供了从形状相应的监督和纹理同时进行。实验表明,我们的方法成功地提高了几个图像识别基准和对抗性的鲁棒性模型的性能。例如,通过在ImageNet训练,它有助于RESNET-152上ImageNet(+ 1.2%),ImageNet-A(+ 5.2%),ImageNet-C(+ 8.3%)和程式化-ImageNet(+ 11.1%)实现实质性的改进,并在防御FGSM对抗性攻击者上ImageNet(+ 14.4%)。我们的方法还声称是对其他先进的数据扩充的策略,例如,查询股价和CutMix兼容。代码可以在这里:这HTTPS URL。
Yingwei Li, Qihang Yu, Mingxing Tan, Jieru Mei, Peng Tang, Wei Shen, Alan Yuille, Cihang Xie
Abstract: Shape and texture are two prominent and complementary cues for recognizing objects. Nonetheless, Convolutional Neural Networks are often biased towards either texture or shape, depending on the training dataset. Our ablation shows that such bias degenerates model performance. Motivated by this observation, we develop a simple algorithm for shape-texture debiased learning. To prevent models from exclusively attending on a single cue in representation learning, we augment training data with images with conflicting shape and texture information (e.g., an image of chimpanzee shape but with lemon texture) and, most importantly, provide the corresponding supervisions from shape and texture simultaneously. Experiments show that our method successfully improves model performance on several image recognition benchmarks and adversarial robustness. For example, by training on ImageNet, it helps ResNet-152 achieve substantial improvements on ImageNet (+1.2%), ImageNet-A (+5.2%), ImageNet-C (+8.3%) and Stylized-ImageNet (+11.1%), and on defending against FGSM adversarial attacker on ImageNet (+14.4%). Our method also claims to be compatible to other advanced data augmentation strategies, e.g., Mixup and CutMix. The code is available here: this https URL.
摘要:形状和质感都识别对象两个突出的和互补的线索。尽管如此,卷积神经网络通常偏向任一的纹理或形状,这取决于训练数据集。我们的消融表明,这种偏见退化模型的性能。通过这一观察的推动下,我们开发了形状,质地debiased学习一个简单的算法。为了防止车型从上表示学习一个线索专门出席,我们扩充训练数据与图像与冲突的形状和纹理信息(例如,黑猩猩形状柠檬质感,但与图像),而且最重要的是,提供了从形状相应的监督和纹理同时进行。实验表明,我们的方法成功地提高了几个图像识别基准和对抗性的鲁棒性模型的性能。例如,通过在ImageNet训练,它有助于RESNET-152上ImageNet(+ 1.2%),ImageNet-A(+ 5.2%),ImageNet-C(+ 8.3%)和程式化-ImageNet(+ 11.1%)实现实质性的改进,并在防御FGSM对抗性攻击者上ImageNet(+ 14.4%)。我们的方法还声称是对其他先进的数据扩充的策略,例如,查询股价和CutMix兼容。代码可以在这里:这HTTPS URL。
44. Towards human performance on automatic motion tracking of infant spontaneous movements [PDF] 返回目录
Daniel Groos, Lars Adde, Ragnhild Støen, Heri Ramampiaro, Espen A. F. Ihlen
Abstract: Assessment of spontaneous movements can predict the long-term developmental outcomes in high-risk infants. In order to develop algorithms for automated prediction of later function based on early motor repertoire, high-precision tracking of segments and joints are required. Four types of convolutional neural networks were investigated on a novel infant pose dataset, covering the large variation in 1 424 videos from a clinical international community. The precision level of the networks was evaluated as the deviation between the estimated keypoint positions and human expert annotations. The computational efficiency was also assessed to determine the feasibility of the neural networks in clinical practice. The study shows that the precision of the best performing infant motion tracker is similar to the inter-rater error of human experts, while still operating efficiently. In conclusion, the proposed tracking of infant movements can pave the way for early detection of motor disorders in children with perinatal brain injuries by quantifying infant movements from video recordings with human precision.
摘要:自发运动的评估可以预测高风险婴儿的长期发展成果。为了开发用于基于早期的马达剧目后功能的自动预测算法中,需要的段和关节的高精度跟踪。四种类型的卷积神经网络考察了一个新的婴儿姿势数据集,涵盖从临床国际社会在1个424视频的大的变化。该网络的精度级别被评价为所估计的关键点位置以及人类专家注释之间的偏差。计算效率也评估,以确定在临床实践中的神经网络的可行性。研究表明,表现最好的婴儿运动跟踪器的精确度类似于人类专家的评估者间的错误,同时还有效地运行。总之,婴幼儿运动的建议跟踪可以通过录像与人类精确量化婴儿运动铺平了围产期脑损伤儿早期发现运动障碍的方式。
Daniel Groos, Lars Adde, Ragnhild Støen, Heri Ramampiaro, Espen A. F. Ihlen
Abstract: Assessment of spontaneous movements can predict the long-term developmental outcomes in high-risk infants. In order to develop algorithms for automated prediction of later function based on early motor repertoire, high-precision tracking of segments and joints are required. Four types of convolutional neural networks were investigated on a novel infant pose dataset, covering the large variation in 1 424 videos from a clinical international community. The precision level of the networks was evaluated as the deviation between the estimated keypoint positions and human expert annotations. The computational efficiency was also assessed to determine the feasibility of the neural networks in clinical practice. The study shows that the precision of the best performing infant motion tracker is similar to the inter-rater error of human experts, while still operating efficiently. In conclusion, the proposed tracking of infant movements can pave the way for early detection of motor disorders in children with perinatal brain injuries by quantifying infant movements from video recordings with human precision.
摘要:自发运动的评估可以预测高风险婴儿的长期发展成果。为了开发用于基于早期的马达剧目后功能的自动预测算法中,需要的段和关节的高精度跟踪。四种类型的卷积神经网络考察了一个新的婴儿姿势数据集,涵盖从临床国际社会在1个424视频的大的变化。该网络的精度级别被评价为所估计的关键点位置以及人类专家注释之间的偏差。计算效率也评估,以确定在临床实践中的神经网络的可行性。研究表明,表现最好的婴儿运动跟踪器的精确度类似于人类专家的评估者间的错误,同时还有效地运行。总之,婴幼儿运动的建议跟踪可以通过录像与人类精确量化婴儿运动铺平了围产期脑损伤儿早期发现运动障碍的方式。
45. Scenic: A Language for Scenario Specification and Data Generation [PDF] 返回目录
Daniel J. Fremont, Edward Kim, Tommaso Dreossi, Shromona Ghosh, Xiangyu Yue, Alberto L. Sangiovanni-Vincentelli, Sanjit A. Seshia
Abstract: We propose a new probabilistic programming language for the design and analysis of cyber-physical systems, especially those based on machine learning. Specifically, we consider the problems of training a system to be robust to rare events, testing its performance under different conditions, and debugging failures. We show how a probabilistic programming language can help address these problems by specifying distributions encoding interesting types of inputs, then sampling these to generate specialized training and test data. More generally, such languages can be used to write environment models, an essential prerequisite to any formal analysis. In this paper, we focus on systems like autonomous cars and robots, whose environment at any point in time is a 'scene', a configuration of physical objects and agents. We design a domain-specific language, Scenic, for describing scenarios that are distributions over scenes and the behaviors of their agents over time. As a probabilistic programming language, Scenic allows assigning distributions to features of the scene, as well as declaratively imposing hard and soft constraints over the scene. We develop specialized techniques for sampling from the resulting distribution, taking advantage of the structure provided by Scenic's domain-specific syntax. Finally, we apply Scenic in a case study on a convolutional neural network designed to detect cars in road images, improving its performance beyond that achieved by state-of-the-art synthetic data generation methods.
摘要:本文提出了一种新的概率性编程语言的网络物理系统,尤其是那些基于机器学习的设计和分析。具体来说,我们考虑的训练体系是稳健的,以罕见的事件,在不同条件下测试其性能,和调试失败的问题。我们将展示如何概率编程语言,可以通过指定的分布有趣的编码类型的输入,然后这些采样生成专门的训练和测试数据有助于解决这些问题。更一般地,这样的语言,可以用来编写环境模型,以任何形式分析的基本前提。在本文中,我们专注于像自主汽车和机器人,它的环境在任何时间点是“场景”,物理对象和代理的配置系统。我们设计了一个领域特定语言,景区,用于描述那些在场景的分布及其代理人随时间的行为的情况。作为一种概率性编程语言,风景区允许指定的分布在现场的特点,以及在声明现场实行硬约束和软约束。我们开发专门的技术从产生的分布采样,以通过景区的域特定的语法提供的结构优势。最后,我们在设计用于检测汽车在道路图像卷积神经网络的案例研究应用风景区,改善其性能超越,通过国家的最先进的合成数据生成方法来实现。
Daniel J. Fremont, Edward Kim, Tommaso Dreossi, Shromona Ghosh, Xiangyu Yue, Alberto L. Sangiovanni-Vincentelli, Sanjit A. Seshia
Abstract: We propose a new probabilistic programming language for the design and analysis of cyber-physical systems, especially those based on machine learning. Specifically, we consider the problems of training a system to be robust to rare events, testing its performance under different conditions, and debugging failures. We show how a probabilistic programming language can help address these problems by specifying distributions encoding interesting types of inputs, then sampling these to generate specialized training and test data. More generally, such languages can be used to write environment models, an essential prerequisite to any formal analysis. In this paper, we focus on systems like autonomous cars and robots, whose environment at any point in time is a 'scene', a configuration of physical objects and agents. We design a domain-specific language, Scenic, for describing scenarios that are distributions over scenes and the behaviors of their agents over time. As a probabilistic programming language, Scenic allows assigning distributions to features of the scene, as well as declaratively imposing hard and soft constraints over the scene. We develop specialized techniques for sampling from the resulting distribution, taking advantage of the structure provided by Scenic's domain-specific syntax. Finally, we apply Scenic in a case study on a convolutional neural network designed to detect cars in road images, improving its performance beyond that achieved by state-of-the-art synthetic data generation methods.
摘要:本文提出了一种新的概率性编程语言的网络物理系统,尤其是那些基于机器学习的设计和分析。具体来说,我们考虑的训练体系是稳健的,以罕见的事件,在不同条件下测试其性能,和调试失败的问题。我们将展示如何概率编程语言,可以通过指定的分布有趣的编码类型的输入,然后这些采样生成专门的训练和测试数据有助于解决这些问题。更一般地,这样的语言,可以用来编写环境模型,以任何形式分析的基本前提。在本文中,我们专注于像自主汽车和机器人,它的环境在任何时间点是“场景”,物理对象和代理的配置系统。我们设计了一个领域特定语言,景区,用于描述那些在场景的分布及其代理人随时间的行为的情况。作为一种概率性编程语言,风景区允许指定的分布在现场的特点,以及在声明现场实行硬约束和软约束。我们开发专门的技术从产生的分布采样,以通过景区的域特定的语法提供的结构优势。最后,我们在设计用于检测汽车在道路图像卷积神经网络的案例研究应用风景区,改善其性能超越,通过国家的最先进的合成数据生成方法来实现。
46. Does my multimodal model learn cross-modal interactions? It's harder to tell than you might think! [PDF] 返回目录
Jack Hessel, Lillian Lee
Abstract: Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However, sometimes high-performing black-box algorithms turn out to be mostly exploiting unimodal signals in the data. We propose a new diagnostic tool, empirical multimodally-additive function projection (EMAP), for isolating whether or not cross-modal interactions improve performance for a given model on a given task. This function projection modifies model predictions so that cross-modal interactions are eliminated, isolating the additive, unimodal structure. For seven image+text classification tasks (on each of which we set new state-of-the-art benchmarks), we find that, in many cases, removing cross-modal interactions results in little to no performance degradation. Surprisingly, this holds even when expressive models, with capacity to consider interactions, otherwise outperform less expressive models; thus, performance improvements, even when present, often cannot be attributed to consideration of cross-modal feature interactions. We hence recommend that researchers in multimodal machine learning report the performance not only of unimodal baselines, but also the EMAP of their best-performing model.
摘要:造型表现力的跨模态的相互作用似乎在多任务,如视觉问答至关重要。然而,有时高性能的黑箱算法变成在数据被大多利用单峰信号。我们提出了一个新的诊断工具,多模态实验,附加功能的投影(EMAP),用于隔离是否不跨模态相互作用提高对给定的任务给定模型的性能。此函数投影修改模型,使跨通道相互作用的预测被消除,分离该添加剂,单峰结构。七年图像+文本分类任务(每个上面我们设置国家的最先进的新的基准),我们发现,在许多情况下,去除小跨模式的互动结果,以不影响性能。出人意料的是,这也成立时表现的机型,与考虑的互动能力,否则跑赢少表现模型;因此,性能的改进,甚至当存在时,通常不能归因于考虑跨模式的特征交互。我们因此建议,研究人员在多机器学习报告的表现不仅单峰基线,也是他们最好的表现模型的EMAP。
Jack Hessel, Lillian Lee
Abstract: Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However, sometimes high-performing black-box algorithms turn out to be mostly exploiting unimodal signals in the data. We propose a new diagnostic tool, empirical multimodally-additive function projection (EMAP), for isolating whether or not cross-modal interactions improve performance for a given model on a given task. This function projection modifies model predictions so that cross-modal interactions are eliminated, isolating the additive, unimodal structure. For seven image+text classification tasks (on each of which we set new state-of-the-art benchmarks), we find that, in many cases, removing cross-modal interactions results in little to no performance degradation. Surprisingly, this holds even when expressive models, with capacity to consider interactions, otherwise outperform less expressive models; thus, performance improvements, even when present, often cannot be attributed to consideration of cross-modal feature interactions. We hence recommend that researchers in multimodal machine learning report the performance not only of unimodal baselines, but also the EMAP of their best-performing model.
摘要:造型表现力的跨模态的相互作用似乎在多任务,如视觉问答至关重要。然而,有时高性能的黑箱算法变成在数据被大多利用单峰信号。我们提出了一个新的诊断工具,多模态实验,附加功能的投影(EMAP),用于隔离是否不跨模态相互作用提高对给定的任务给定模型的性能。此函数投影修改模型,使跨通道相互作用的预测被消除,分离该添加剂,单峰结构。七年图像+文本分类任务(每个上面我们设置国家的最先进的新的基准),我们发现,在许多情况下,去除小跨模式的互动结果,以不影响性能。出人意料的是,这也成立时表现的机型,与考虑的互动能力,否则跑赢少表现模型;因此,性能的改进,甚至当存在时,通常不能归因于考虑跨模式的特征交互。我们因此建议,研究人员在多机器学习报告的表现不仅单峰基线,也是他们最好的表现模型的EMAP。
47. Piece-wise Matching Layer in Representation Learning for ECG Classification [PDF] 返回目录
Behzad Ghazanfari, Fatemeh Afghah, Sixian Zhang
Abstract: This paper proposes piece-wise matching layer as a novel layer in representation learning methods for electrocardiogram (ECG) classification. Despite the remarkable performance of representation learning methods in the analysis of time series, there are still several challenges associated with these methods ranging from the complex structures of methods, the lack of generality of solutions, the need for expert knowledge, and large-scale training datasets. We introduce the piece-wise matching layer that works based on two levels to address some of the aforementioned challenges. At the first level, a set of morphological, statistical, and frequency features and comparative forms of them are computed based on each periodic part and its neighbors. At the second level, these features are modified by predefined transformation functions based on a receptive field scenario. Several scenarios of offline processing, incremental processing, fixed sliding receptive field, and event-based triggering receptive field can be implemented based on the choice of length and mechanism of indicating the receptive field. We propose dynamic time wrapping as a mechanism that indicates a receptive field based on event triggering tactics. To evaluate the performance of this method in time series analysis, we applied the proposed layer in two publicly available datasets of PhysioNet competitions in 2015 and 2017 where the input data is ECG signal. We compared the performance of our method against a variety of known tuned methods from expert knowledge, machine learning, deep learning methods, and the combination of them. The proposed approach improves the state of the art in two known completions 2015 and 2017 around 4% and 7% correspondingly while it does not rely on in advance knowledge of the classes or the possible places of arrhythmia.
摘要:提出逐段匹配层作为用于心电图(ECG)的分类中的表示学习方法的新颖层。尽管在时间序列分析表示学习方法的表现可圈可点,还有这些方法,从方法的复杂结构相关的一些挑战,缺乏解决方案的通用性,需要专业知识,和大规模培训数据集。我们引入逐段匹配层,基于两个层次的作品,以解决上述的一些挑战。在第一级,一组形态,统计和频率的特点和他们的比较形式的计算基于每个周期部分和它的邻国。在第二级,这些特征通过基于感受野的场景预定的变换函数修改。离线处理,增量处理的几个方案,固定滑动感受域,和基于事件的触发感受域可基于长度和指示感受域的机构的选择来实现。我们提出的动态时间规整为表示基于事件触发战术感受野的机制。为了评估在时间序列分析这种方法的性能,我们在PhysioNet比赛的两个公开可用的数据集的应用层提出在2015年至2017年,其中输入数据是ECG信号。我们比较我们针对各种从专业知识,学习机,深学习方法,以及它们的组合称为调整方法的方法的性能。所提出的方法改进了该技术的状态,两个已知的完成2015年和2017年的4%左右,相应7%,而它并没有在类或心律失常可能的地方预先了解依靠。
Behzad Ghazanfari, Fatemeh Afghah, Sixian Zhang
Abstract: This paper proposes piece-wise matching layer as a novel layer in representation learning methods for electrocardiogram (ECG) classification. Despite the remarkable performance of representation learning methods in the analysis of time series, there are still several challenges associated with these methods ranging from the complex structures of methods, the lack of generality of solutions, the need for expert knowledge, and large-scale training datasets. We introduce the piece-wise matching layer that works based on two levels to address some of the aforementioned challenges. At the first level, a set of morphological, statistical, and frequency features and comparative forms of them are computed based on each periodic part and its neighbors. At the second level, these features are modified by predefined transformation functions based on a receptive field scenario. Several scenarios of offline processing, incremental processing, fixed sliding receptive field, and event-based triggering receptive field can be implemented based on the choice of length and mechanism of indicating the receptive field. We propose dynamic time wrapping as a mechanism that indicates a receptive field based on event triggering tactics. To evaluate the performance of this method in time series analysis, we applied the proposed layer in two publicly available datasets of PhysioNet competitions in 2015 and 2017 where the input data is ECG signal. We compared the performance of our method against a variety of known tuned methods from expert knowledge, machine learning, deep learning methods, and the combination of them. The proposed approach improves the state of the art in two known completions 2015 and 2017 around 4% and 7% correspondingly while it does not rely on in advance knowledge of the classes or the possible places of arrhythmia.
摘要:提出逐段匹配层作为用于心电图(ECG)的分类中的表示学习方法的新颖层。尽管在时间序列分析表示学习方法的表现可圈可点,还有这些方法,从方法的复杂结构相关的一些挑战,缺乏解决方案的通用性,需要专业知识,和大规模培训数据集。我们引入逐段匹配层,基于两个层次的作品,以解决上述的一些挑战。在第一级,一组形态,统计和频率的特点和他们的比较形式的计算基于每个周期部分和它的邻国。在第二级,这些特征通过基于感受野的场景预定的变换函数修改。离线处理,增量处理的几个方案,固定滑动感受域,和基于事件的触发感受域可基于长度和指示感受域的机构的选择来实现。我们提出的动态时间规整为表示基于事件触发战术感受野的机制。为了评估在时间序列分析这种方法的性能,我们在PhysioNet比赛的两个公开可用的数据集的应用层提出在2015年至2017年,其中输入数据是ECG信号。我们比较我们针对各种从专业知识,学习机,深学习方法,以及它们的组合称为调整方法的方法的性能。所提出的方法改进了该技术的状态,两个已知的完成2015年和2017年的4%左右,相应7%,而它并没有在类或心律失常可能的地方预先了解依靠。
48. RANDGAN: Randomized Generative Adversarial Network for Detection of COVID-19 in Chest X-ray [PDF] 返回目录
Saman Motamed, Patrik Rogalla, Farzad Khalvati
Abstract: COVID-19 spread across the globe at an immense rate has left healthcare systems incapacitated to diagnose and test patients at the needed rate. Studies have shown promising results for detection of COVID-19 from viral bacterial pneumonia in chest X-rays. Automation of COVID-19 testing using medical images can speed up the testing process of patients where health care systems lack sufficient numbers of the reverse-transcription polymerase chain reaction (RT-PCR) tests. Supervised deep learning models such as convolutional neural networks (CNN) need enough labeled data for all classes to correctly learn the task of detection. Gathering labeled data is a cumbersome task and requires time and resources which could further strain health care systems and radiologists at the early stages of a pandemic such as COVID-19. In this study, we propose a randomized generative adversarial network (RANDGAN) that detects images of an unknown class (COVID-19) from known and labelled classes (Normal and Viral Pneumonia) without the need for labels and training data from the unknown class of images (COVID-19). We used the largest publicly available COVID-19 chest X-ray dataset, COVIDx, which is comprised of Normal, Pneumonia, and COVID-19 images from multiple public databases. In this work, we use transfer learning to segment the lungs in the COVIDx dataset. Next, we show why segmentation of the region of interest (lungs) is vital to correctly learn the task of classification, specifically in datasets that contain images from different resources as it is the case for the COVIDx dataset. Finally, we show improved results in detection of COVID-19 cases using our generative model (RANDGAN) compared to conventional generative adversarial networks (GANs) for anomaly detection in medical images, improving the area under the ROC curve from 0.71 to 0.77.
摘要:COVID-19散布在全球各地以巨大的速度已经离开医疗系统无行为能力的诊断和检测患者所需的速度。有研究显示有希望用于从胸部X光病毒细菌性肺炎检测COVID-19的结果。使用医学图像可以加快其中保健系统缺乏逆转录聚合酶链反应(RT-PCR)测试的足够数量的患者测试过程COVID-19测试自动化。监督深度学习模式,如卷积神经网络(CNN)需要足够的标签数据的所有类正确学习检测任务。收集标签数据是一项繁重的任务,需要时间和资源可能在大流行这样的早期阶段为COVID-19进一步损害卫生保健系统和放射科医师。在这项研究中,我们提出了一个随机生成对抗网络(RANDGAN),从已知的和标记类(正常和病毒性肺炎),而不需要从未知类的标签和训练数据的未知类的检测图像(COVID-19)图像(COVID-19)。我们用最大的公开可用的COVID,19胸部X射线数据集,COVIDx,其中包括正常,肺炎,并从多个公共数据库COVID-19图像。在这项工作中,我们使用转移学习段在COVIDx数据集中的肺。接下来,我们说明,为什么利息(肺)区域的分割是正确学习分类的工作是至关重要的,特别是在包含来自不同资源的图像,因为它是为COVIDx数据集的情况下的数据集。最后,我们表现出的检测用我们相对于传统的生成对抗网络(甘斯)在医学图像异常检测生成模型(RANDGAN)COVID,好转19例的结果,ROC曲线下改善面积从0.71到0.77。
Saman Motamed, Patrik Rogalla, Farzad Khalvati
Abstract: COVID-19 spread across the globe at an immense rate has left healthcare systems incapacitated to diagnose and test patients at the needed rate. Studies have shown promising results for detection of COVID-19 from viral bacterial pneumonia in chest X-rays. Automation of COVID-19 testing using medical images can speed up the testing process of patients where health care systems lack sufficient numbers of the reverse-transcription polymerase chain reaction (RT-PCR) tests. Supervised deep learning models such as convolutional neural networks (CNN) need enough labeled data for all classes to correctly learn the task of detection. Gathering labeled data is a cumbersome task and requires time and resources which could further strain health care systems and radiologists at the early stages of a pandemic such as COVID-19. In this study, we propose a randomized generative adversarial network (RANDGAN) that detects images of an unknown class (COVID-19) from known and labelled classes (Normal and Viral Pneumonia) without the need for labels and training data from the unknown class of images (COVID-19). We used the largest publicly available COVID-19 chest X-ray dataset, COVIDx, which is comprised of Normal, Pneumonia, and COVID-19 images from multiple public databases. In this work, we use transfer learning to segment the lungs in the COVIDx dataset. Next, we show why segmentation of the region of interest (lungs) is vital to correctly learn the task of classification, specifically in datasets that contain images from different resources as it is the case for the COVIDx dataset. Finally, we show improved results in detection of COVID-19 cases using our generative model (RANDGAN) compared to conventional generative adversarial networks (GANs) for anomaly detection in medical images, improving the area under the ROC curve from 0.71 to 0.77.
摘要:COVID-19散布在全球各地以巨大的速度已经离开医疗系统无行为能力的诊断和检测患者所需的速度。有研究显示有希望用于从胸部X光病毒细菌性肺炎检测COVID-19的结果。使用医学图像可以加快其中保健系统缺乏逆转录聚合酶链反应(RT-PCR)测试的足够数量的患者测试过程COVID-19测试自动化。监督深度学习模式,如卷积神经网络(CNN)需要足够的标签数据的所有类正确学习检测任务。收集标签数据是一项繁重的任务,需要时间和资源可能在大流行这样的早期阶段为COVID-19进一步损害卫生保健系统和放射科医师。在这项研究中,我们提出了一个随机生成对抗网络(RANDGAN),从已知的和标记类(正常和病毒性肺炎),而不需要从未知类的标签和训练数据的未知类的检测图像(COVID-19)图像(COVID-19)。我们用最大的公开可用的COVID,19胸部X射线数据集,COVIDx,其中包括正常,肺炎,并从多个公共数据库COVID-19图像。在这项工作中,我们使用转移学习段在COVIDx数据集中的肺。接下来,我们说明,为什么利息(肺)区域的分割是正确学习分类的工作是至关重要的,特别是在包含来自不同资源的图像,因为它是为COVIDx数据集的情况下的数据集。最后,我们表现出的检测用我们相对于传统的生成对抗网络(甘斯)在医学图像异常检测生成模型(RANDGAN)COVID,好转19例的结果,ROC曲线下改善面积从0.71到0.77。
49. Which Model to Transfer? Finding the Needle in the Growing Haystack [PDF] 返回目录
Cedric Renggli, André Susano Pinto, Luka Rimanic, Joan Puigcerver, Carlos Riquelme, Ce Zhang, Mario Lucic
Abstract: Transfer learning has been recently popularized as a data-efficient alternative to training models from scratch, in particular in vision and NLP where it provides a remarkably solid baseline. The emergence of rich model repositories, such as TensorFlow Hub, enables the practitioners and researchers to unleash the potential of these models across a wide range of downstream tasks. As these repositories keep growing exponentially, efficiently selecting a good model for the task at hand becomes paramount. We provide a formalization of this problem through a familiar notion of regret and introduce the predominant strategies, namely task-agnostic (e.g. picking the highest scoring ImageNet model) and task-aware search strategies (such as linear or kNN evaluation). We conduct a large-scale empirical study and show that both task-agnostic and task-aware methods can yield high regret. We then propose a simple and computationally efficient hybrid search strategy which outperforms the existing approaches. We highlight the practical benefits of the proposed solution on a set of 19 diverse vision tasks.
摘要:转让学习最近已推广作为数据有效的替代在视觉和NLP它提供了一个非常坚实的基础从头开始培训模式,尤其如此。丰富的模型库,如TensorFlow中心的出现,使从业人员和研究人员在大范围的下游任务释放这些模型的潜力。由于这些存储库保持成倍增长,在手有效地选择一个好的模型的任务变得极为重要。我们提供通过遗憾熟悉的概念,这个问题的形式化和引进的主要策略,即任务无关的(例如:挑选得分最高的ImageNet模型)和任务感知搜索策略(如线性或kNN的评价)。我们进行了大规模的实证研究表明,无论是任务无关和任务感知的方法可以产生高的遗憾。然后,我们提出了一个简单而高效计算混合搜索策略,优于现有的方法。我们强调的一组19个不同的视觉任务提出的解决方案的实际利益。
Cedric Renggli, André Susano Pinto, Luka Rimanic, Joan Puigcerver, Carlos Riquelme, Ce Zhang, Mario Lucic
Abstract: Transfer learning has been recently popularized as a data-efficient alternative to training models from scratch, in particular in vision and NLP where it provides a remarkably solid baseline. The emergence of rich model repositories, such as TensorFlow Hub, enables the practitioners and researchers to unleash the potential of these models across a wide range of downstream tasks. As these repositories keep growing exponentially, efficiently selecting a good model for the task at hand becomes paramount. We provide a formalization of this problem through a familiar notion of regret and introduce the predominant strategies, namely task-agnostic (e.g. picking the highest scoring ImageNet model) and task-aware search strategies (such as linear or kNN evaluation). We conduct a large-scale empirical study and show that both task-agnostic and task-aware methods can yield high regret. We then propose a simple and computationally efficient hybrid search strategy which outperforms the existing approaches. We highlight the practical benefits of the proposed solution on a set of 19 diverse vision tasks.
摘要:转让学习最近已推广作为数据有效的替代在视觉和NLP它提供了一个非常坚实的基础从头开始培训模式,尤其如此。丰富的模型库,如TensorFlow中心的出现,使从业人员和研究人员在大范围的下游任务释放这些模型的潜力。由于这些存储库保持成倍增长,在手有效地选择一个好的模型的任务变得极为重要。我们提供通过遗憾熟悉的概念,这个问题的形式化和引进的主要策略,即任务无关的(例如:挑选得分最高的ImageNet模型)和任务感知搜索策略(如线性或kNN的评价)。我们进行了大规模的实证研究表明,无论是任务无关和任务感知的方法可以产生高的遗憾。然后,我们提出了一个简单而高效计算混合搜索策略,优于现有的方法。我们强调的一组19个不同的视觉任务提出的解决方案的实际利益。
50. A Possible Method of Carbon Deposit Mapping on Plasma Facing Components Using Infrared Thermography [PDF] 返回目录
R Mitteau, J Spruytte, S Vallet, J Travère, D Guilhem, C Brosset
Abstract: The material eroded from the surface of plasma facing components is redeposited partly close to high heat flux areas. At these locations, the deposit is heated by the plasma and the deposition pattern evolves depending on the operation parameters. The mapping of the deposit is still a matter of intense scientific activity, especially during the course of experimental campaigns. A method based on the comparison of surface temperature maps, obtained in situ by infrared cameras and by theoretical modelling is proposed. The difference between the two is attributed to the thermal resistance added by deposited material, and expressed as a deposit thickness. The method benefits of elaborated imaging techniques such as possibility theory and fuzzy logics. The results are consistent with deposit maps obtained by visual inspection during shutdowns.
摘要:从面对等离子体部件的表面侵蚀的材料再沉积部分地靠近高热通量的区域。在这些位置,所述沉积物是通过在等离子体和根据操作参数沉积图案演变加热。存款的映射仍然是激烈的科学活动的问题,尤其是实验活动的过程中。提出了一种基于表面温度图,由红外线摄像头原位得到的比较,并通过理论模型的方法。两者之间的差异归因于由沉积的材料添加的热阻,并表示为一个沉积厚度。的精细的成像技术如可能性理论和模糊逻辑的方法的好处。结果与停机期间通过目测获得的存款映射一致。
R Mitteau, J Spruytte, S Vallet, J Travère, D Guilhem, C Brosset
Abstract: The material eroded from the surface of plasma facing components is redeposited partly close to high heat flux areas. At these locations, the deposit is heated by the plasma and the deposition pattern evolves depending on the operation parameters. The mapping of the deposit is still a matter of intense scientific activity, especially during the course of experimental campaigns. A method based on the comparison of surface temperature maps, obtained in situ by infrared cameras and by theoretical modelling is proposed. The difference between the two is attributed to the thermal resistance added by deposited material, and expressed as a deposit thickness. The method benefits of elaborated imaging techniques such as possibility theory and fuzzy logics. The results are consistent with deposit maps obtained by visual inspection during shutdowns.
摘要:从面对等离子体部件的表面侵蚀的材料再沉积部分地靠近高热通量的区域。在这些位置,所述沉积物是通过在等离子体和根据操作参数沉积图案演变加热。存款的映射仍然是激烈的科学活动的问题,尤其是实验活动的过程中。提出了一种基于表面温度图,由红外线摄像头原位得到的比较,并通过理论模型的方法。两者之间的差异归因于由沉积的材料添加的热阻,并表示为一个沉积厚度。的精细的成像技术如可能性理论和模糊逻辑的方法的好处。结果与停机期间通过目测获得的存款映射一致。
51. Automation of Hemocompatibility Analysis Using Image Segmentation and a Random Forest [PDF] 返回目录
Johanna C. Clauser, Judith Maas, Jutta Arens, Thomas Schmitz-Rode, Ulrich Steinseifer, Benjamin Berkels
Abstract: The hemocompatibility of blood-contacting medical devices remains one of the major challenges in biomedical engineering and makes research in the field of new and improved materials inevitable. However, current in-vitro test and analysis methods are still lacking standardization and comparability, which impedes advances in material design. For example, the optical platelet analysis of material in-vitro hemocompatibility tests is carried out manually or semi-manually by each research group individually. As a step towards standardization, this paper proposes an automation approach for the optical platelet count and analysis. To this end, fluorescence images are segmented using Zach's convexification of the multiphase-phase piecewise constant Mumford--Shah model. The resulting connected components of the non-background segments then need to be classified as platelet or no platelet. Therefore, a supervised random forest is applied to feature vectors derived from the components using features like area, perimeter and circularity. With an overall high accuracy and low error rates, the random forest achieves reliable results. This is supported by high areas under the receiver-operator and the prediction-recall curve, respectively. We developed a new method for a fast, user-independent and reproducible analysis of material hemocompatibility tests, which is therefore a unique and powerful tool for advances in biomaterial research.
摘要:血液接触的医疗器械的血液相容性仍然是在生物医学工程的重大挑战之一,使研究的不可避免的新的和改进材料领域。然而,目前的体外试验和分析方法仍然缺乏标准化和可比性,在材料设计这阻碍进展。例如,材料的体外血液相容性试验的光学血小板分析被手动或半手动由每个研究小组独立地进行。作为实现标准化的步骤,提出一种在光学血小板计数和分析的自动化方法。 Shah模型 - 为此,荧光图像是利用多相相位分段恒定芒福德的扎克的凸化分段。然后需要非背景区段的所得连接的部件被分类为血小板或血小板没有。因此,监督随机森林被施加到从使用像面积,周长和圆形度特征的组分衍生的特征向量。随着整体精度高和低误码率,随机森林实现可靠的结果。这是通过接收器的操作员和预测召回曲线,分别在高区域支撑。我们开发了一种新的方法,快速,用户独立的材料血液相容性测试重复性分析,因此这对于生物材料的研究进展独特而强大的工具。
Johanna C. Clauser, Judith Maas, Jutta Arens, Thomas Schmitz-Rode, Ulrich Steinseifer, Benjamin Berkels
Abstract: The hemocompatibility of blood-contacting medical devices remains one of the major challenges in biomedical engineering and makes research in the field of new and improved materials inevitable. However, current in-vitro test and analysis methods are still lacking standardization and comparability, which impedes advances in material design. For example, the optical platelet analysis of material in-vitro hemocompatibility tests is carried out manually or semi-manually by each research group individually. As a step towards standardization, this paper proposes an automation approach for the optical platelet count and analysis. To this end, fluorescence images are segmented using Zach's convexification of the multiphase-phase piecewise constant Mumford--Shah model. The resulting connected components of the non-background segments then need to be classified as platelet or no platelet. Therefore, a supervised random forest is applied to feature vectors derived from the components using features like area, perimeter and circularity. With an overall high accuracy and low error rates, the random forest achieves reliable results. This is supported by high areas under the receiver-operator and the prediction-recall curve, respectively. We developed a new method for a fast, user-independent and reproducible analysis of material hemocompatibility tests, which is therefore a unique and powerful tool for advances in biomaterial research.
摘要:血液接触的医疗器械的血液相容性仍然是在生物医学工程的重大挑战之一,使研究的不可避免的新的和改进材料领域。然而,目前的体外试验和分析方法仍然缺乏标准化和可比性,在材料设计这阻碍进展。例如,材料的体外血液相容性试验的光学血小板分析被手动或半手动由每个研究小组独立地进行。作为实现标准化的步骤,提出一种在光学血小板计数和分析的自动化方法。 Shah模型 - 为此,荧光图像是利用多相相位分段恒定芒福德的扎克的凸化分段。然后需要非背景区段的所得连接的部件被分类为血小板或血小板没有。因此,监督随机森林被施加到从使用像面积,周长和圆形度特征的组分衍生的特征向量。随着整体精度高和低误码率,随机森林实现可靠的结果。这是通过接收器的操作员和预测召回曲线,分别在高区域支撑。我们开发了一种新的方法,快速,用户独立的材料血液相容性测试重复性分析,因此这对于生物材料的研究进展独特而强大的工具。
52. Experimental Quantum Generative Adversarial Networks for Image Generation [PDF] 返回目录
He-Liang Huang, Yuxuan Du, Ming Gong, Youwei Zhao, Yulin Wu, Chaoyue Wang, Shaowei Li, Futian Liang, Jin Lin, Yu Xu, Rui Yang, Tongliang Liu, Min-Hsiu Hsieh, Hui Deng, Hao Rong, Cheng-Zhi Peng, Chao-Yang Lu, Yu-Ao Chen, Dacheng Tao, Xiaobo Zhu, Jian-Wei Pan
Abstract: Quantum machine learning is expected to be one of the first practical applications of near-term quantum devices. Pioneer theoretical works suggest that quantum generative adversarial networks (GANs) may exhibit a potential exponential advantage over classical GANs, thus attracting widespread attention. However, it remains elusive whether quantum GANs implemented on near-term quantum devices can actually solve real-world learning tasks. Here, we devise a flexible quantum GAN scheme to narrow this knowledge gap, which could accomplish image generation with arbitrarily high-dimensional features, and could also take advantage of quantum superposition to train multiple examples in parallel. For the first time, we experimentally achieve the learning and generation of real-world hand-written digit images on a superconducting quantum processor. Moreover, we utilize a gray-scale bar dataset to exhibit the competitive performance between quantum GANs and the classical GANs based on multilayer perceptron and convolutional neural network architectures, respectively, benchmarked by the Fréchet Distance score. Our work provides guidance for developing advanced quantum generative models on near-term quantum devices and opens up an avenue for exploring quantum advantages in various GAN-related learning tasks.
摘要:量子学习机预计将在近期量子器件的首批实践应用之一。先锋理论著作表明,量子生成对抗网络(甘斯)可能会出现潜在的指数优于传统的甘斯,因而引起广泛关注。然而,在短期内量子器件实现量子甘斯是否能真正解决现实世界的学习任务仍然遥遥无期。在这里,我们设计一种灵活的量子GAN方案来缩小这一知识差距,这可以完成图像生成具有任意高维特征,并且还可以采取量子叠加的优点来训练多个示例并联。这是第一次,我们通过实验实现对超导量子处理器的学习和生成真实世界的手写数字图像。此外,我们利用的灰度标尺条数据集显示出基于量子甘斯和经典甘斯之间的竞争性能多层感知和卷积神经网络体系结构,分别由Fréchet可距离得分基准。我们的工作在短期内量子器件开发先进的量子生成模型提供了指导,并在各种GAN-相关的学习任务探索量子优势,开辟了一条大道。
He-Liang Huang, Yuxuan Du, Ming Gong, Youwei Zhao, Yulin Wu, Chaoyue Wang, Shaowei Li, Futian Liang, Jin Lin, Yu Xu, Rui Yang, Tongliang Liu, Min-Hsiu Hsieh, Hui Deng, Hao Rong, Cheng-Zhi Peng, Chao-Yang Lu, Yu-Ao Chen, Dacheng Tao, Xiaobo Zhu, Jian-Wei Pan
Abstract: Quantum machine learning is expected to be one of the first practical applications of near-term quantum devices. Pioneer theoretical works suggest that quantum generative adversarial networks (GANs) may exhibit a potential exponential advantage over classical GANs, thus attracting widespread attention. However, it remains elusive whether quantum GANs implemented on near-term quantum devices can actually solve real-world learning tasks. Here, we devise a flexible quantum GAN scheme to narrow this knowledge gap, which could accomplish image generation with arbitrarily high-dimensional features, and could also take advantage of quantum superposition to train multiple examples in parallel. For the first time, we experimentally achieve the learning and generation of real-world hand-written digit images on a superconducting quantum processor. Moreover, we utilize a gray-scale bar dataset to exhibit the competitive performance between quantum GANs and the classical GANs based on multilayer perceptron and convolutional neural network architectures, respectively, benchmarked by the Fréchet Distance score. Our work provides guidance for developing advanced quantum generative models on near-term quantum devices and opens up an avenue for exploring quantum advantages in various GAN-related learning tasks.
摘要:量子学习机预计将在近期量子器件的首批实践应用之一。先锋理论著作表明,量子生成对抗网络(甘斯)可能会出现潜在的指数优于传统的甘斯,因而引起广泛关注。然而,在短期内量子器件实现量子甘斯是否能真正解决现实世界的学习任务仍然遥遥无期。在这里,我们设计一种灵活的量子GAN方案来缩小这一知识差距,这可以完成图像生成具有任意高维特征,并且还可以采取量子叠加的优点来训练多个示例并联。这是第一次,我们通过实验实现对超导量子处理器的学习和生成真实世界的手写数字图像。此外,我们利用的灰度标尺条数据集显示出基于量子甘斯和经典甘斯之间的竞争性能多层感知和卷积神经网络体系结构,分别由Fréchet可距离得分基准。我们的工作在短期内量子器件开发先进的量子生成模型提供了指导,并在各种GAN-相关的学习任务探索量子优势,开辟了一条大道。
53. COVID-19 Imaging Data Privacy by Federated Learning Design: A Theoretical Framework [PDF] 返回目录
Anwaar Ulhaq, Oliver Burmeister
Abstract: To address COVID-19 healthcare challenges, we need frequent sharing of health data, knowledge and resources at a global scale. However, in this digital age, data privacy is a big concern that requires the secure embedding of privacy assurance into the design of all technological solutions that use health data. In this paper, we introduce differential privacy by design (dPbD) framework and discuss its embedding into the federated machine learning system. To limit the scope of our paper, we focus on the problem scenario of COVID-19 imaging data privacy for disease diagnosis by computer vision and deep learning approaches. We discuss the evaluation of the proposed design of federated machine learning systems and discuss how differential privacy by design (dPbD) framework can enhance data privacy in federated learning systems with scalability and robustness. We argue that scalable differentially private federated learning design is a promising solution for building a secure, private and collaborative machine learning model such as required to combat COVID19 challenge.
摘要:为地址COVID-19的医疗挑战,我们需要健康数据,知识和资源的共享频繁在全球范围内。然而,在这个数字时代,数据隐私是需要保密的保证安全嵌入到所有的技术解决方案,利用健康数据设计一个大问题。在本文中,我们将介绍通过设计(dPbD)框架差的隐私,并讨论其嵌入到联合机器学习系统。为了限制我们本文讨论的范围,我们重点关注的COVID-19成像数据隐私问题方案用于疾病诊断通过计算机视觉和深刻的学习方法。我们讨论联合机器学习系统的建议设计的评估和讨论设计(dPbD)框架如何鉴别隐私可以增强可扩展性和稳健性的联合学习系统的数据保密性。我们认为,可扩展的差异私人联合学习设计是建设一个安全,私密和协作机器学习模型,如需要战斗COVID19挑战一个可行的解决方案。
Anwaar Ulhaq, Oliver Burmeister
Abstract: To address COVID-19 healthcare challenges, we need frequent sharing of health data, knowledge and resources at a global scale. However, in this digital age, data privacy is a big concern that requires the secure embedding of privacy assurance into the design of all technological solutions that use health data. In this paper, we introduce differential privacy by design (dPbD) framework and discuss its embedding into the federated machine learning system. To limit the scope of our paper, we focus on the problem scenario of COVID-19 imaging data privacy for disease diagnosis by computer vision and deep learning approaches. We discuss the evaluation of the proposed design of federated machine learning systems and discuss how differential privacy by design (dPbD) framework can enhance data privacy in federated learning systems with scalability and robustness. We argue that scalable differentially private federated learning design is a promising solution for building a secure, private and collaborative machine learning model such as required to combat COVID19 challenge.
摘要:为地址COVID-19的医疗挑战,我们需要健康数据,知识和资源的共享频繁在全球范围内。然而,在这个数字时代,数据隐私是需要保密的保证安全嵌入到所有的技术解决方案,利用健康数据设计一个大问题。在本文中,我们将介绍通过设计(dPbD)框架差的隐私,并讨论其嵌入到联合机器学习系统。为了限制我们本文讨论的范围,我们重点关注的COVID-19成像数据隐私问题方案用于疾病诊断通过计算机视觉和深刻的学习方法。我们讨论联合机器学习系统的建议设计的评估和讨论设计(dPbD)框架如何鉴别隐私可以增强可扩展性和稳健性的联合学习系统的数据保密性。我们认为,可扩展的差异私人联合学习设计是建设一个安全,私密和协作机器学习模型,如需要战斗COVID19挑战一个可行的解决方案。
54. Similarity Based Stratified Splitting: an approach to train better classifiers [PDF] 返回目录
Felipe Farias, Teresa Ludermir, Carmelo Bastos-Filho
Abstract: We propose a Similarity-Based Stratified Splitting (SBSS) technique, which uses both the output and input space information to split the data. The splits are generated using similarity functions among samples to place similar samples in different splits. This approach allows for a better representation of the data in the training phase. This strategy leads to a more realistic performance estimation when used in real-world applications. We evaluate our proposal in twenty-two benchmark datasets with classifiers such as Multi-Layer Perceptron, Support Vector Machine, Random Forest and K-Nearest Neighbors, and five similarity functions Cityblock, Chebyshev, Cosine, Correlation, and Euclidean. According to the Wilcoxon Sign-Rank test, our approach consistently outperformed ordinary stratified 10-fold cross-validation in 75\% of the assessed scenarios.
摘要:我们提出了一个相似性为基础的分层分裂(SBSS)技术,它使用两个输出和输入空间的信息来分割数据。该分割是使用样品之间的相似性的功能放置的类似样品中不同分裂生成。这种方法允许在训练阶段中的数据得到更好的代表。在实际应用中使用时,这种策略导致一个更真实的性能估计。我们评估我们在22基准数据集与分类,如多层感知,支持向量机,随机森林和K近邻,和五个相似的功能Cityblock,切比雪夫,余弦,相关性,和欧几里得建议。按照魏氏符号秩检验,我们的做法表现一直超过普通分层交叉验证的评估方案中的75 \%,10倍。
Felipe Farias, Teresa Ludermir, Carmelo Bastos-Filho
Abstract: We propose a Similarity-Based Stratified Splitting (SBSS) technique, which uses both the output and input space information to split the data. The splits are generated using similarity functions among samples to place similar samples in different splits. This approach allows for a better representation of the data in the training phase. This strategy leads to a more realistic performance estimation when used in real-world applications. We evaluate our proposal in twenty-two benchmark datasets with classifiers such as Multi-Layer Perceptron, Support Vector Machine, Random Forest and K-Nearest Neighbors, and five similarity functions Cityblock, Chebyshev, Cosine, Correlation, and Euclidean. According to the Wilcoxon Sign-Rank test, our approach consistently outperformed ordinary stratified 10-fold cross-validation in 75\% of the assessed scenarios.
摘要:我们提出了一个相似性为基础的分层分裂(SBSS)技术,它使用两个输出和输入空间的信息来分割数据。该分割是使用样品之间的相似性的功能放置的类似样品中不同分裂生成。这种方法允许在训练阶段中的数据得到更好的代表。在实际应用中使用时,这种策略导致一个更真实的性能估计。我们评估我们在22基准数据集与分类,如多层感知,支持向量机,随机森林和K近邻,和五个相似的功能Cityblock,切比雪夫,余弦,相关性,和欧几里得建议。按照魏氏符号秩检验,我们的做法表现一直超过普通分层交叉验证的评估方案中的75 \%,10倍。
55. Gradient Descent Ascent for Min-Max Problems on Riemannian Manifold [PDF] 返回目录
Feihu Huang, Shangqian Gao, Heng Huang
Abstract: In the paper, we study a class of useful non-convex minimax optimization problems on the Riemanian manifold and propose a class of Riemanian gradient descent ascent algorithms to solve these minimax problems. Specifically, we propose a new Riemannian gradient descent ascent (RGDA) algorithm for the deterministic minimax optimization. Moreover, we prove that the RGDA has a sample complexity of $O(\kappa^2\epsilon^{-2})$ for finding an $\epsilon$-stationary point of the nonconvex strongly-concave minimax problems, where $\kappa$ denotes the condition number. At the same time, we introduce a Riemannian stochastic gradient descent ascent (RSGDA) algorithm for the stochastic minimax optimization. In the theoretical analysis, we prove that the RSGDA can achieve a sample complexity of $O(\kappa^4\epsilon^{-4})$. To further reduce the sample complexity, we propose a novel momentum variance-reduced Riemannian stochastic gradient descent ascent (MVR-RSGDA) algorithm based on a new momentum variance-reduced technique of STORM. We prove that the MVR-RSGDA algorithm achieves a lower sample complexity of $\tilde{O}(\kappa^{4}\epsilon^{-3})$ without large batches, which reaches near the best known sample complexity for its Euclidean counterparts. This is the first study of the minimax optimization over the Riemannian manifold. Extensive experimental results on the robust deep neural networks training over Stiefel manifold demonstrate the efficiency of our proposed algorithms.
摘要:在本文中,我们研究了一类有用的非凸极小极大优化的黎曼流形问题,并提出一类黎曼梯度下降上升算法来解决这些极小问题。具体来说,我们建议的确定性极小极大优化了新的黎曼梯度下降上升(RGDA)算法。此外,我们证明了RGDA有$ O(\卡帕^ 2 \小量^ { - 2})的样本复杂$寻找一个$ \ $小量的-stationary非凸强凹极小问题,其中$点\卡帕$表示条件数。与此同时,我们引入了随机优化极大极小黎曼随机梯度下降上升(RSGDA)算法。在理论分析中,我们证明了RSGDA可以达到$ O(\卡帕^ 4 \小量^ { - 4})的样本复杂$。为了进一步减少样品的复杂性,我们提出了一种新颖的动量方差减小黎曼随机梯度下降上升(MVR-RSGDA)的基础上STORM的一个新的动量方差减小技术算法。我们证明了MVR-RSGDA算法达到$ \波浪号{Ø}(\卡帕^ {4} \小量^ { - 3})的较低的采样复杂$没有大批量生产,这附近最好已知样品的复杂性达到欧几里德同行。这是在黎曼流形的极小极大优化的第一项研究。在训练过施蒂费尔强劲深层神经网络广泛的实验结果表明歧管我们提出的算法的效率。
Feihu Huang, Shangqian Gao, Heng Huang
Abstract: In the paper, we study a class of useful non-convex minimax optimization problems on the Riemanian manifold and propose a class of Riemanian gradient descent ascent algorithms to solve these minimax problems. Specifically, we propose a new Riemannian gradient descent ascent (RGDA) algorithm for the deterministic minimax optimization. Moreover, we prove that the RGDA has a sample complexity of $O(\kappa^2\epsilon^{-2})$ for finding an $\epsilon$-stationary point of the nonconvex strongly-concave minimax problems, where $\kappa$ denotes the condition number. At the same time, we introduce a Riemannian stochastic gradient descent ascent (RSGDA) algorithm for the stochastic minimax optimization. In the theoretical analysis, we prove that the RSGDA can achieve a sample complexity of $O(\kappa^4\epsilon^{-4})$. To further reduce the sample complexity, we propose a novel momentum variance-reduced Riemannian stochastic gradient descent ascent (MVR-RSGDA) algorithm based on a new momentum variance-reduced technique of STORM. We prove that the MVR-RSGDA algorithm achieves a lower sample complexity of $\tilde{O}(\kappa^{4}\epsilon^{-3})$ without large batches, which reaches near the best known sample complexity for its Euclidean counterparts. This is the first study of the minimax optimization over the Riemannian manifold. Extensive experimental results on the robust deep neural networks training over Stiefel manifold demonstrate the efficiency of our proposed algorithms.
摘要:在本文中,我们研究了一类有用的非凸极小极大优化的黎曼流形问题,并提出一类黎曼梯度下降上升算法来解决这些极小问题。具体来说,我们建议的确定性极小极大优化了新的黎曼梯度下降上升(RGDA)算法。此外,我们证明了RGDA有$ O(\卡帕^ 2 \小量^ { - 2})的样本复杂$寻找一个$ \ $小量的-stationary非凸强凹极小问题,其中$点\卡帕$表示条件数。与此同时,我们引入了随机优化极大极小黎曼随机梯度下降上升(RSGDA)算法。在理论分析中,我们证明了RSGDA可以达到$ O(\卡帕^ 4 \小量^ { - 4})的样本复杂$。为了进一步减少样品的复杂性,我们提出了一种新颖的动量方差减小黎曼随机梯度下降上升(MVR-RSGDA)的基础上STORM的一个新的动量方差减小技术算法。我们证明了MVR-RSGDA算法达到$ \波浪号{Ø}(\卡帕^ {4} \小量^ { - 3})的较低的采样复杂$没有大批量生产,这附近最好已知样品的复杂性达到欧几里德同行。这是在黎曼流形的极小极大优化的第一项研究。在训练过施蒂费尔强劲深层神经网络广泛的实验结果表明歧管我们提出的算法的效率。
56. A catalog of broad morphology of Pan-STARRS galaxies based on deep learning [PDF] 返回目录
Hunter Goddard, Lior Shamir
Abstract: Autonomous digital sky surveys such as Pan-STARRS have the ability to image a very large number of galactic and extra-galactic objects, and the large and complex nature of the image data reinforces the use of automation. Here we describe the design and implementation of a data analysis process for automatic broad morphology annotation of galaxies, and applied it to the data of Pan-STARRS DR1. The process is based on filters followed by a two-step convolutional neural network (CNN) classification. Training samples are generated by using an augmented and balanced set of manually classified galaxies. Results are evaluated for accuracy by comparison to the annotation of Pan-STARRS included in a previous broad morphology catalog of SDSS galaxies. Our analysis shows that a CNN combined with several filters is an effective approach for annotating the galaxies and removing unclean images. The catalog contains morphology labels for 1,662,190 galaxies with ~95% accuracy. The accuracy can be further improved by selecting labels above certain confidence thresholds. The catalog is publicly available.
摘要:自主数字巡天如泛-STARRS有能力像一个非常大的数字银河系和银河系外的物体,图像数据的大量和复杂性增强了使用的自动化。在这里,我们描述了星系的自动广阔形态注释的数据分析过程的设计和执行,并将其应用到泛-STARRS DR1的数据。该过程是基于过滤器,接着两步卷积神经网络(CNN)的分类。训练样本被使用增强和均衡设定手动分类星系生成。结果通过比较包括在SDSS星系的先前宽形态目录泛星计画的注释的准确性进行评价。我们的分析表明,CNN与几个组合滤波器是注释星系和消除图像不清楚的有效途径。该目录包含了1662190个星系〜95%的准确性形态的标签。精度可通过选择上述一定的置信阈值标签被进一步改善。该目录是公开的。
Hunter Goddard, Lior Shamir
Abstract: Autonomous digital sky surveys such as Pan-STARRS have the ability to image a very large number of galactic and extra-galactic objects, and the large and complex nature of the image data reinforces the use of automation. Here we describe the design and implementation of a data analysis process for automatic broad morphology annotation of galaxies, and applied it to the data of Pan-STARRS DR1. The process is based on filters followed by a two-step convolutional neural network (CNN) classification. Training samples are generated by using an augmented and balanced set of manually classified galaxies. Results are evaluated for accuracy by comparison to the annotation of Pan-STARRS included in a previous broad morphology catalog of SDSS galaxies. Our analysis shows that a CNN combined with several filters is an effective approach for annotating the galaxies and removing unclean images. The catalog contains morphology labels for 1,662,190 galaxies with ~95% accuracy. The accuracy can be further improved by selecting labels above certain confidence thresholds. The catalog is publicly available.
摘要:自主数字巡天如泛-STARRS有能力像一个非常大的数字银河系和银河系外的物体,图像数据的大量和复杂性增强了使用的自动化。在这里,我们描述了星系的自动广阔形态注释的数据分析过程的设计和执行,并将其应用到泛-STARRS DR1的数据。该过程是基于过滤器,接着两步卷积神经网络(CNN)的分类。训练样本被使用增强和均衡设定手动分类星系生成。结果通过比较包括在SDSS星系的先前宽形态目录泛星计画的注释的准确性进行评价。我们的分析表明,CNN与几个组合滤波器是注释星系和消除图像不清楚的有效途径。该目录包含了1662190个星系〜95%的准确性形态的标签。精度可通过选择上述一定的置信阈值标签被进一步改善。该目录是公开的。
57. Assessing Lesion Segmentation Bias of Neural Networks on Motion Corrupted Brain MRI [PDF] 返回目录
Tejas Sudharshan Mathai, Yi Wang, Nathan Cross
Abstract: Patient motion during the magnetic resonance imaging (MRI) acquisition process results in motion artifacts, which limits the ability of radiologists to provide a quantitative assessment of a condition visualized. Often times, radiologists either "see through" the artifacts with reduced diagnostic confidence, or the MR scans are rejected and patients are asked to be recalled and re-scanned. Presently, there are many published approaches that focus on MRI artifact detection and correction. However, the key question of the bias exhibited by these algorithms on motion corrupted MRI images is still unanswered. In this paper, we seek to quantify the bias in terms of the impact that different levels of motion artifacts have on the performance of neural networks engaged in a lesion segmentation task. Additionally, we explore the effect of a different learning strategy, curriculum learning, on the segmentation performance. Our results suggest that a network trained using curriculum learning is effective at compensating for different levels of motion artifacts, and improved the segmentation performance by ~9%-15% (p < 0.05) when compared against a conventional shuffled learning strategy on the same motion data. Within each motion category, it either improved or maintained the dice score. To the best of our knowledge, we are the first to quantitatively assess the segmentation bias on various levels of motion artifacts present in a brain MRI image.
摘要:过程中的运动伪影,这限制了放射科医生的,以提供可视化的条件的定量评估的能力的磁共振成像(MRI)获得的处理结果的患者运动。很多时候,无论是放射科医生“看穿”与减少的诊断信心的文物,或MR扫描被拒绝,患者被要求召回,并重新扫描。目前,有许多已发表的方法侧重于MRI伪影检测和纠正。然而,偏压的关键问题表现出通过这些算法的运动破坏MRI图像仍然是悬而未决。在本文中,我们力求量化偏见的影响,不同层次的运动伪影对从事肿瘤分割任务神经网络的性能方面。此外,我们探索不同的学习策略,学习课程的作用,对分割性能。我们的研究结果表明,一个网络使用课程学习是在补偿不同程度的运动伪影的有效训练,并且通过改进〜分割性能9%-15%(P <0.05),当针对在相同的运动的常规混洗学习策略相比数据。内每个运动类别,它或者改善或保持了骰子得分。据我们所知,我们是第一个定量评估各级存在于脑部mri图像运动伪影的分割偏差。< font> 0.05),当针对在相同的运动的常规混洗学习策略相比数据。内每个运动类别,它或者改善或保持了骰子得分。据我们所知,我们是第一个定量评估各级存在于脑部mri图像运动伪影的分割偏差。<>
Tejas Sudharshan Mathai, Yi Wang, Nathan Cross
Abstract: Patient motion during the magnetic resonance imaging (MRI) acquisition process results in motion artifacts, which limits the ability of radiologists to provide a quantitative assessment of a condition visualized. Often times, radiologists either "see through" the artifacts with reduced diagnostic confidence, or the MR scans are rejected and patients are asked to be recalled and re-scanned. Presently, there are many published approaches that focus on MRI artifact detection and correction. However, the key question of the bias exhibited by these algorithms on motion corrupted MRI images is still unanswered. In this paper, we seek to quantify the bias in terms of the impact that different levels of motion artifacts have on the performance of neural networks engaged in a lesion segmentation task. Additionally, we explore the effect of a different learning strategy, curriculum learning, on the segmentation performance. Our results suggest that a network trained using curriculum learning is effective at compensating for different levels of motion artifacts, and improved the segmentation performance by ~9%-15% (p < 0.05) when compared against a conventional shuffled learning strategy on the same motion data. Within each motion category, it either improved or maintained the dice score. To the best of our knowledge, we are the first to quantitatively assess the segmentation bias on various levels of motion artifacts present in a brain MRI image.
摘要:过程中的运动伪影,这限制了放射科医生的,以提供可视化的条件的定量评估的能力的磁共振成像(MRI)获得的处理结果的患者运动。很多时候,无论是放射科医生“看穿”与减少的诊断信心的文物,或MR扫描被拒绝,患者被要求召回,并重新扫描。目前,有许多已发表的方法侧重于MRI伪影检测和纠正。然而,偏压的关键问题表现出通过这些算法的运动破坏MRI图像仍然是悬而未决。在本文中,我们力求量化偏见的影响,不同层次的运动伪影对从事肿瘤分割任务神经网络的性能方面。此外,我们探索不同的学习策略,学习课程的作用,对分割性能。我们的研究结果表明,一个网络使用课程学习是在补偿不同程度的运动伪影的有效训练,并且通过改进〜分割性能9%-15%(P <0.05),当针对在相同的运动的常规混洗学习策略相比数据。内每个运动类别,它或者改善或保持了骰子得分。据我们所知,我们是第一个定量评估各级存在于脑部mri图像运动伪影的分割偏差。< font> 0.05),当针对在相同的运动的常规混洗学习策略相比数据。内每个运动类别,它或者改善或保持了骰子得分。据我们所知,我们是第一个定量评估各级存在于脑部mri图像运动伪影的分割偏差。<>
58. Monitoring War Destruction from Space: A Machine Learning Approach [PDF] 返回目录
Hannes Mueller, Andre Groger, Jonathan Hersh, Andrea Matranga, Joan Serrat
Abstract: Existing data on building destruction in conflict zones rely on eyewitness reports or manual detection, which makes it generally scarce, incomplete and potentially biased. This lack of reliable data imposes severe limitations for media reporting, humanitarian relief efforts, human rights monitoring, reconstruction initiatives, and academic studies of violent conflict. This article introduces an automated method of measuring destruction in high-resolution satellite images using deep learning techniques combined with data augmentation to expand training samples. We apply this method to the Syrian civil war and reconstruct the evolution of damage in major cities across the country. The approach allows generating destruction data with unprecedented scope, resolution, and frequency - only limited by the available satellite imagery - which can alleviate data limitations decisively.
摘要:在冲突地区建立销毁现有的数据依靠目击者报告或手动检测,这使得它一般稀少,不完整的和潜在的偏见。由于缺乏可靠的数据,增添了对媒体的监测报告严重的局限性,人道主义救济工作,人权,重建的举措,和暴力冲突的学术研究。本文介绍了使用深学习技术与数据增强组合以扩大训练样本测量高分辨率卫星图像破坏的自动化方法。我们这个方法适用于叙利亚内战和重建,在全国各大城市损伤的演变。该方法允许产生具有前所未有的范围,分辨率和频率销毁数据 - 仅由现有的卫星图像限制 - 其可以减轻数据的限制决定性。
Hannes Mueller, Andre Groger, Jonathan Hersh, Andrea Matranga, Joan Serrat
Abstract: Existing data on building destruction in conflict zones rely on eyewitness reports or manual detection, which makes it generally scarce, incomplete and potentially biased. This lack of reliable data imposes severe limitations for media reporting, humanitarian relief efforts, human rights monitoring, reconstruction initiatives, and academic studies of violent conflict. This article introduces an automated method of measuring destruction in high-resolution satellite images using deep learning techniques combined with data augmentation to expand training samples. We apply this method to the Syrian civil war and reconstruct the evolution of damage in major cities across the country. The approach allows generating destruction data with unprecedented scope, resolution, and frequency - only limited by the available satellite imagery - which can alleviate data limitations decisively.
摘要:在冲突地区建立销毁现有的数据依靠目击者报告或手动检测,这使得它一般稀少,不完整的和潜在的偏见。由于缺乏可靠的数据,增添了对媒体的监测报告严重的局限性,人道主义救济工作,人权,重建的举措,和暴力冲突的学术研究。本文介绍了使用深学习技术与数据增强组合以扩大训练样本测量高分辨率卫星图像破坏的自动化方法。我们这个方法适用于叙利亚内战和重建,在全国各大城市损伤的演变。该方法允许产生具有前所未有的范围,分辨率和频率销毁数据 - 仅由现有的卫星图像限制 - 其可以减轻数据的限制决定性。
59. Robots State Estimation and Observability Analysis Based on Statistical Motion Models [PDF] 返回目录
Wei Xu, Dongjiao He, Yixi Cai, Fu Zhang
Abstract: This paper presents a generic motion model to capture mobile robots' dynamic behaviors (translation and rotation). The model is based on statistical models driven by white random processes and is formulated into a full state estimation algorithm based on the error-state extended Kalman filtering framework (ESEKF). Major benefits of this method are its versatility, being applicable to different robotic systems without accurately modeling the robots' specific dynamics, and ability to estimate the robot's (angular) acceleration, jerk, or higher-order dynamic states with low delay. Mathematical analysis with numerical simulations are presented to show the properties of the statistical model-based estimation framework and to reveal its connection to existing low-pass filters. Furthermore, a new paradigm is developed for robots observability analysis by developing Lie derivatives and associated partial differentiation directly on manifolds. It is shown that this new paradigm is much simpler and more natural than existing methods based on quaternion parameterizations. It is also scalable to high dimensional systems. A novel \textbf{\textit{thin}} set concept is introduced to characterize the unobservable subset of the system states, providing the theoretical foundation to observability analysis of robotic systems operating on manifolds and in high dimension. Finally, extensive experiments including full state estimation and extrinsic calibration (both POS-IMU and IMU-IMU) on a quadrotor UAV, a handheld platform and a ground vehicle are conducted. Comparisons with existing methods show that the proposed method can effectively estimate all extrinsic parameters, the robot's translation/angular acceleration and other state variables (e.g., position, velocity, attitude) of high accuracy and low delay.
摘要:本文提出了一种通用的运动模型来捕捉移动机器人的动态行为(平移和旋转)。该模型是基于由白色随机过程驱动的统计模型和配制成基于所述误差状态的满状态估计算法扩展卡尔曼滤波框架(ESEKF)。这种方法的主要好处是它的通用性,而无需精确地建模机器人的具体动态可应用于不同的机器人系统,并且以低延迟来估计机器人的(角)加速度,加加速度,或更高阶的动态状态的能力。与数值模拟数学分析都显示了基于统计模型的估计架构的性能,并揭示其现有的低通滤波器连接。此外,一个新的范例为机器人可观测性分析通过开发烈衍生物开发并直接在歧管相关联的偏微分。结果表明,这种新的模式要简单得多,比基于四元参数化现有的方法更自然。它也扩展到高维系统。一种新颖的\ textbf {\ textit {薄}}集的概念被引入到表征系统状态的不可观测的子集,提供了理论基础上歧管和在高维操作的机器人系统的观察性分析。最后,大量的实验包括全状态估计和外部校准(二者POS-IMU和IMU-IMU)上的四旋翼UAV,手持式平台和地面车辆中进行。与现有方法的比较表明,该方法可以有效地估计所有的外部参数,机器人的平移/角加速度和高准确度和低延时其他状态变量(例如,位置,速度,姿势)。
Wei Xu, Dongjiao He, Yixi Cai, Fu Zhang
Abstract: This paper presents a generic motion model to capture mobile robots' dynamic behaviors (translation and rotation). The model is based on statistical models driven by white random processes and is formulated into a full state estimation algorithm based on the error-state extended Kalman filtering framework (ESEKF). Major benefits of this method are its versatility, being applicable to different robotic systems without accurately modeling the robots' specific dynamics, and ability to estimate the robot's (angular) acceleration, jerk, or higher-order dynamic states with low delay. Mathematical analysis with numerical simulations are presented to show the properties of the statistical model-based estimation framework and to reveal its connection to existing low-pass filters. Furthermore, a new paradigm is developed for robots observability analysis by developing Lie derivatives and associated partial differentiation directly on manifolds. It is shown that this new paradigm is much simpler and more natural than existing methods based on quaternion parameterizations. It is also scalable to high dimensional systems. A novel \textbf{\textit{thin}} set concept is introduced to characterize the unobservable subset of the system states, providing the theoretical foundation to observability analysis of robotic systems operating on manifolds and in high dimension. Finally, extensive experiments including full state estimation and extrinsic calibration (both POS-IMU and IMU-IMU) on a quadrotor UAV, a handheld platform and a ground vehicle are conducted. Comparisons with existing methods show that the proposed method can effectively estimate all extrinsic parameters, the robot's translation/angular acceleration and other state variables (e.g., position, velocity, attitude) of high accuracy and low delay.
摘要:本文提出了一种通用的运动模型来捕捉移动机器人的动态行为(平移和旋转)。该模型是基于由白色随机过程驱动的统计模型和配制成基于所述误差状态的满状态估计算法扩展卡尔曼滤波框架(ESEKF)。这种方法的主要好处是它的通用性,而无需精确地建模机器人的具体动态可应用于不同的机器人系统,并且以低延迟来估计机器人的(角)加速度,加加速度,或更高阶的动态状态的能力。与数值模拟数学分析都显示了基于统计模型的估计架构的性能,并揭示其现有的低通滤波器连接。此外,一个新的范例为机器人可观测性分析通过开发烈衍生物开发并直接在歧管相关联的偏微分。结果表明,这种新的模式要简单得多,比基于四元参数化现有的方法更自然。它也扩展到高维系统。一种新颖的\ textbf {\ textit {薄}}集的概念被引入到表征系统状态的不可观测的子集,提供了理论基础上歧管和在高维操作的机器人系统的观察性分析。最后,大量的实验包括全状态估计和外部校准(二者POS-IMU和IMU-IMU)上的四旋翼UAV,手持式平台和地面车辆中进行。与现有方法的比较表明,该方法可以有效地估计所有的外部参数,机器人的平移/角加速度和高准确度和低延时其他状态变量(例如,位置,速度,姿势)。
注:中文为机器翻译结果!封面为论文标题词云图!