目录
4. Segmentation of Surgical Instruments for Minimally-Invasive Robot-Assisted Procedures Using Generative Deep Neural Networks [PDF] 摘要
11. Real-time Human Activity Recognition Using Conditionally Parametrized Convolutions on Mobile and Wearable Devices [PDF] 摘要
24. Artificial Intelligence-based Clinical Decision Support for COVID-19 -- Where Art Thou? [PDF] 摘要
26. Detection of prostate cancer in whole-slide images through end-to-end training with image-level labels [PDF] 摘要
29. Convolutional Neural Networks for Global Human Settlements Mapping from Sentinel-2 Satellite Imagery [PDF] 摘要
摘要
1. Novel Object Viewpoint Estimation through Reconstruction Alignment [PDF] 返回目录
Mohamed El Banani, Jason J. Corso, David F. Fouhey
Abstract: The goal of this paper is to estimate the viewpoint for a novel object. Standard viewpoint estimation approaches generally fail on this task due to their reliance on a 3D model for alignment or large amounts of class-specific training data and their corresponding canonical pose. We overcome those limitations by learning a reconstruct and align approach. Our key insight is that although we do not have an explicit 3D model or a predefined canonical pose, we can still learn to estimate the object's shape in the viewer's frame and then use an image to provide our reference model or canonical pose. In particular, we propose learning two networks: the first maps images to a 3D geometry-aware feature bottleneck and is trained via an image-to-image translation loss; the second learns whether two instances of features are aligned. At test time, our model finds the relative transformation that best aligns the bottleneck features of our test image to a reference image. We evaluate our method on novel object viewpoint estimation by generalizing across different datasets, analyzing the impact of our different modules, and providing a qualitative analysis of the learned features to identify what representations are being learnt for alignment.
摘要:本文的目标是估计的观点考虑为一个新物体。标准的观点估计方法在这项任务通常会失败,因为它们在3D模型对齐或大量类特定的训练数据的依赖,其相应的规范姿势。我们通过学习重构和对齐方式克服这些局限性。我们的主要观点是,虽然我们没有明确的3D模型或预定规范的姿势,我们仍然可以学会估计在观众的框架对象的形状,然后用图像来提供我们的参考模型或规范的姿势。特别是,我们建议学习两个网络:第一映射图像的三维几何感知功能瓶颈,通过图像 - 图像转换损耗培训;第二个学习的特点两个实例是否对齐。在测试时,我们的模型发现,最好对准我们的测试图像的瓶颈特征的参照图像的相对转变。我们通过在不同的数据集归纳,分析了我国不同模块的影响,并提供学习特点进行定性分析来识别正在学会对准什么表示评估我们对新物体观点估计方法。
Mohamed El Banani, Jason J. Corso, David F. Fouhey
Abstract: The goal of this paper is to estimate the viewpoint for a novel object. Standard viewpoint estimation approaches generally fail on this task due to their reliance on a 3D model for alignment or large amounts of class-specific training data and their corresponding canonical pose. We overcome those limitations by learning a reconstruct and align approach. Our key insight is that although we do not have an explicit 3D model or a predefined canonical pose, we can still learn to estimate the object's shape in the viewer's frame and then use an image to provide our reference model or canonical pose. In particular, we propose learning two networks: the first maps images to a 3D geometry-aware feature bottleneck and is trained via an image-to-image translation loss; the second learns whether two instances of features are aligned. At test time, our model finds the relative transformation that best aligns the bottleneck features of our test image to a reference image. We evaluate our method on novel object viewpoint estimation by generalizing across different datasets, analyzing the impact of our different modules, and providing a qualitative analysis of the learned features to identify what representations are being learnt for alignment.
摘要:本文的目标是估计的观点考虑为一个新物体。标准的观点估计方法在这项任务通常会失败,因为它们在3D模型对齐或大量类特定的训练数据的依赖,其相应的规范姿势。我们通过学习重构和对齐方式克服这些局限性。我们的主要观点是,虽然我们没有明确的3D模型或预定规范的姿势,我们仍然可以学会估计在观众的框架对象的形状,然后用图像来提供我们的参考模型或规范的姿势。特别是,我们建议学习两个网络:第一映射图像的三维几何感知功能瓶颈,通过图像 - 图像转换损耗培训;第二个学习的特点两个实例是否对齐。在测试时,我们的模型发现,最好对准我们的测试图像的瓶颈特征的参照图像的相对转变。我们通过在不同的数据集归纳,分析了我国不同模块的影响,并提供学习特点进行定性分析来识别正在学会对准什么表示评估我们对新物体观点估计方法。
2. A Meta-Bayesian Model of Intentional Visual Search [PDF] 返回目录
Maell Cullen, Jonathan Monney, M. Berk Mirza, Rosalyn Moran
Abstract: We propose a computational model of visual search that incorporates Bayesian interpretations of the neural mechanisms that underlie categorical perception and saccade planning. To enable meaningful comparisons between simulated and human behaviours, we employ a gaze-contingent paradigm that required participants to classify occluded MNIST digits through a window that followed their gaze. The conditional independencies imposed by a separation of time scales in this task are embodied by constraints on the hierarchical structure of our model; planning and decision making are cast as a partially observable Markov Decision Process while proprioceptive and exteroceptive signals are integrated by a dynamic model that facilitates approximate inference on visual information and its latent causes. Our model is able to recapitulate human behavioural metrics such as classification accuracy while retaining a high degree of interpretability, which we demonstrate by recovering subject-specific parameters from observed human behaviour.
摘要:我们认为整合背后范畴知觉和扫视规划的神经机制的贝叶斯解释视觉搜索的计算模型。为了使模拟和人类行为进行有意义的比较,我们采用的是需要参与者通过跟随他们的目光一个窗口遮挡MNIST数字分类凝视依存模式。通过时间尺度的这个任务通过我们的模型的层次结构的约束体现分离规定的条件independencies;规划和决策被铸造为部分可观察马尔可夫决策过程而本体和外感受信号由便于在视觉信息和它的潜原因近似推理的动态模型集成。我们的模型能够概括人类的行为指标,如分类准确度,同时保持高度的可解释性的,这是我们通过回收从观察到的人类行为受特定的参数证明。
Maell Cullen, Jonathan Monney, M. Berk Mirza, Rosalyn Moran
Abstract: We propose a computational model of visual search that incorporates Bayesian interpretations of the neural mechanisms that underlie categorical perception and saccade planning. To enable meaningful comparisons between simulated and human behaviours, we employ a gaze-contingent paradigm that required participants to classify occluded MNIST digits through a window that followed their gaze. The conditional independencies imposed by a separation of time scales in this task are embodied by constraints on the hierarchical structure of our model; planning and decision making are cast as a partially observable Markov Decision Process while proprioceptive and exteroceptive signals are integrated by a dynamic model that facilitates approximate inference on visual information and its latent causes. Our model is able to recapitulate human behavioural metrics such as classification accuracy while retaining a high degree of interpretability, which we demonstrate by recovering subject-specific parameters from observed human behaviour.
摘要:我们认为整合背后范畴知觉和扫视规划的神经机制的贝叶斯解释视觉搜索的计算模型。为了使模拟和人类行为进行有意义的比较,我们采用的是需要参与者通过跟随他们的目光一个窗口遮挡MNIST数字分类凝视依存模式。通过时间尺度的这个任务通过我们的模型的层次结构的约束体现分离规定的条件independencies;规划和决策被铸造为部分可观察马尔可夫决策过程而本体和外感受信号由便于在视觉信息和它的潜原因近似推理的动态模型集成。我们的模型能够概括人类的行为指标,如分类准确度,同时保持高度的可解释性的,这是我们通过回收从观察到的人类行为受特定的参数证明。
3. VALUE: Large Scale Voting-based Automatic Labelling for Urban Environments [PDF] 返回目录
Giacomo Dabisias, Emanuele Ruffaldi, Hugo Grimmett, Peter Ondruska
Abstract: This paper presents a simple and robust method for the automatic localisation of static 3D objects in large-scale urban environments. By exploiting the potential to merge a large volume of noisy but accurately localised 2D image data, we achieve superior performance in terms of both robustness and accuracy of the recovered 3D information. The method is based on a simple distributed voting schema which can be fully distributed and parallelised to scale to large-scale scenarios. To evaluate the method we collected city-scale data sets from New York City and San Francisco consisting of almost 400k images spanning the area of 40 km$^2$ and used it to accurately recover the 3D positions of traffic lights. We demonstrate a robust performance and also show that the solution improves in quality over time as the amount of data increases.
摘要:本文介绍了静态3D的自动定位的简单和稳健的方法在大规模城市环境的对象。通过利用合并大量嘈杂,但准确定位2D图像数据的潜力,我们实现了两个鲁棒性和精度方面的卓越性能恢复三维信息。该方法基于一个简单的分布式架构的投票可以完全分布式的并行化和规模到大规模的场景。为了评估方法我们收集了从纽约市和旧金山包括几乎400K图像跨越40公里$ ^ $ 2区的城市规模的数据集,并用它来准确地恢复红绿灯的3D位置。我们展示了强劲的性能,也表明该解决方案在质量提高一段时间内随着数据量的增加。
Giacomo Dabisias, Emanuele Ruffaldi, Hugo Grimmett, Peter Ondruska
Abstract: This paper presents a simple and robust method for the automatic localisation of static 3D objects in large-scale urban environments. By exploiting the potential to merge a large volume of noisy but accurately localised 2D image data, we achieve superior performance in terms of both robustness and accuracy of the recovered 3D information. The method is based on a simple distributed voting schema which can be fully distributed and parallelised to scale to large-scale scenarios. To evaluate the method we collected city-scale data sets from New York City and San Francisco consisting of almost 400k images spanning the area of 40 km$^2$ and used it to accurately recover the 3D positions of traffic lights. We demonstrate a robust performance and also show that the solution improves in quality over time as the amount of data increases.
摘要:本文介绍了静态3D的自动定位的简单和稳健的方法在大规模城市环境的对象。通过利用合并大量嘈杂,但准确定位2D图像数据的潜力,我们实现了两个鲁棒性和精度方面的卓越性能恢复三维信息。该方法基于一个简单的分布式架构的投票可以完全分布式的并行化和规模到大规模的场景。为了评估方法我们收集了从纽约市和旧金山包括几乎400K图像跨越40公里$ ^ $ 2区的城市规模的数据集,并用它来准确地恢复红绿灯的3D位置。我们展示了强劲的性能,也表明该解决方案在质量提高一段时间内随着数据量的增加。
4. Segmentation of Surgical Instruments for Minimally-Invasive Robot-Assisted Procedures Using Generative Deep Neural Networks [PDF] 返回目录
Iñigo Azqueta-Gavaldon, Florian Fröhlich, Klaus Strobl, Rudolph Triebel
Abstract: This work proves that semantic segmentation on minimally invasive surgical instruments can be improved by using training data that has been augmented through domain adaptation. The benefit of this method is twofold. Firstly, it suppresses the need of manually labeling thousands of images by transforming synthetic data into realistic-looking data. To achieve this, a CycleGAN model is used, which transforms a source dataset to approximate the domain distribution of a target dataset. Secondly, this newly generated data with perfect labels is utilized to train a semantic segmentation neural network, U-Net. This method shows generalization capabilities on data with variability regarding its rotation- position- and lighting conditions. Nevertheless, one of the caveats of this approach is that the model is unable to generalize well to other surgical instruments with a different shape from the one used for training. This is driven by the lack of a high variance in the geometric distribution of the training data. Future work will focus on making the model more scale-invariant and able to adapt to other types of surgical instruments previously unseen by the training.
摘要:这项工作证明,在微创外科手术器械语义分割可以通过使用已经通过领域适应性增强训练的数据得到改善。这种方法的好处是双重的。首先,它抑制了由合成的数据转化成逼真的数据手动地标记成千上万的图像的需要。为了实现这一点,一个CycleGAN模型被使用,其将一个源数据集以接近目标数据集的域分布。其次,具有完美的标签新生成的数据被利用来训练语义分割神经网络,U-Net的。该方法示出了一般化的有关其rotation-位置 - 和照明条件与变异性数据的能力。然而,这种方法的注意事项之一是,该模型是无法与用于训练的一个不同的形状以及推广到其他的手术器械。这是通过在缺乏训练数据的几何分布高方差的驱动。今后的工作将重点放在使模型更加规模不变,并能适应其他类型的手术器械之前由培训看不见的。
Iñigo Azqueta-Gavaldon, Florian Fröhlich, Klaus Strobl, Rudolph Triebel
Abstract: This work proves that semantic segmentation on minimally invasive surgical instruments can be improved by using training data that has been augmented through domain adaptation. The benefit of this method is twofold. Firstly, it suppresses the need of manually labeling thousands of images by transforming synthetic data into realistic-looking data. To achieve this, a CycleGAN model is used, which transforms a source dataset to approximate the domain distribution of a target dataset. Secondly, this newly generated data with perfect labels is utilized to train a semantic segmentation neural network, U-Net. This method shows generalization capabilities on data with variability regarding its rotation- position- and lighting conditions. Nevertheless, one of the caveats of this approach is that the model is unable to generalize well to other surgical instruments with a different shape from the one used for training. This is driven by the lack of a high variance in the geometric distribution of the training data. Future work will focus on making the model more scale-invariant and able to adapt to other types of surgical instruments previously unseen by the training.
摘要:这项工作证明,在微创外科手术器械语义分割可以通过使用已经通过领域适应性增强训练的数据得到改善。这种方法的好处是双重的。首先,它抑制了由合成的数据转化成逼真的数据手动地标记成千上万的图像的需要。为了实现这一点,一个CycleGAN模型被使用,其将一个源数据集以接近目标数据集的域分布。其次,具有完美的标签新生成的数据被利用来训练语义分割神经网络,U-Net的。该方法示出了一般化的有关其rotation-位置 - 和照明条件与变异性数据的能力。然而,这种方法的注意事项之一是,该模型是无法与用于训练的一个不同的形状以及推广到其他的手术器械。这是通过在缺乏训练数据的几何分布高方差的驱动。今后的工作将重点放在使模型更加规模不变,并能适应其他类型的手术器械之前由培训看不见的。
5. Learning Neural Light Transport [PDF] 返回目录
Paul Sanzenbacher, Lars Mescheder, Andreas Geiger
Abstract: In recent years, deep generative models have gained significance due to their ability to synthesize natural-looking images with applications ranging from virtual reality to data augmentation for training computer vision models. While existing models are able to faithfully learn the image distribution of the training set, they often lack controllability as they operate in 2D pixel space and do not model the physical image formation process. In this work, we investigate the importance of 3D reasoning for photorealistic rendering. We present an approach for learning light transport in static and dynamic 3D scenes using a neural network with the goal of predicting photorealistic images. In contrast to existing approaches that operate in the 2D image domain, our approach reasons in both 3D and 2D space, thus enabling global illumination effects and manipulation of 3D scene geometry. Experimentally, we find that our model is able to produce photorealistic renderings of static and dynamic scenes. Moreover, it compares favorably to baselines which combine path tracing and image denoising at the same computational budget.
摘要:近年来,深生成模型具有重大意义,由于其合成与应用范围从虚拟现实培训计算机视觉模型的数据增强自然的图像的能力。虽然现有的模型能够忠实地学习训练集的图像传送,他们往往缺乏可控性,因为他们在2D像素空间操作和不实际成像过程模型。在这项工作中,我们探讨3D推理的逼真渲染的重要性。我们提出了利用神经网络预测逼真的图像的目标学习静态和动态的3D场景的光传输的方法。与此相反的是,在2D图像域运行现有的方法,我们的做法原因,3D和2D的空间,从而使全局照明效果和3D场景的几何形状的操纵。在实验中,我们发现,我们的模型能够产生静态和动态场景的照片级渲染效果。此外,它有利地比较这在相同的计算预算结合路径追踪和图像去噪基线。
Paul Sanzenbacher, Lars Mescheder, Andreas Geiger
Abstract: In recent years, deep generative models have gained significance due to their ability to synthesize natural-looking images with applications ranging from virtual reality to data augmentation for training computer vision models. While existing models are able to faithfully learn the image distribution of the training set, they often lack controllability as they operate in 2D pixel space and do not model the physical image formation process. In this work, we investigate the importance of 3D reasoning for photorealistic rendering. We present an approach for learning light transport in static and dynamic 3D scenes using a neural network with the goal of predicting photorealistic images. In contrast to existing approaches that operate in the 2D image domain, our approach reasons in both 3D and 2D space, thus enabling global illumination effects and manipulation of 3D scene geometry. Experimentally, we find that our model is able to produce photorealistic renderings of static and dynamic scenes. Moreover, it compares favorably to baselines which combine path tracing and image denoising at the same computational budget.
摘要:近年来,深生成模型具有重大意义,由于其合成与应用范围从虚拟现实培训计算机视觉模型的数据增强自然的图像的能力。虽然现有的模型能够忠实地学习训练集的图像传送,他们往往缺乏可控性,因为他们在2D像素空间操作和不实际成像过程模型。在这项工作中,我们探讨3D推理的逼真渲染的重要性。我们提出了利用神经网络预测逼真的图像的目标学习静态和动态的3D场景的光传输的方法。与此相反的是,在2D图像域运行现有的方法,我们的做法原因,3D和2D的空间,从而使全局照明效果和3D场景的几何形状的操纵。在实验中,我们发现,我们的模型能够产生静态和动态场景的照片级渲染效果。此外,它有利地比较这在相同的计算预算结合路径追踪和图像去噪基线。
6. Uncertainty-Aware CNNs for Depth Completion: Uncertainty from Beginning to End [PDF] 返回目录
Abdelrahman Eldesokey, Michael Felsberg, Karl Holmquist, Mikael Persson
Abstract: The focus in deep learning research has been mostly to push the limits of prediction accuracy. However, this was often achieved at the cost of increased complexity, raising concerns about the interpretability and the reliability of deep networks. Recently, an increasing attention has been given to untangling the complexity of deep networks and quantifying their uncertainty for different computer vision tasks. Differently, the task of depth completion has not received enough attention despite the inherent noisy nature of depth sensors. In this work, we thus focus on modeling the uncertainty of depth data in depth completion starting from the sparse noisy input all the way to the final prediction. We propose a novel approach to identify disturbed measurements in the input by learning an input confidence estimator in a self-supervised manner based on the normalized convolutional neural networks (NCNNs). Further, we propose a probabilistic version of NCNNs that produces a statistically meaningful uncertainty measure for the final prediction. When we evaluate our approach on the KITTI dataset for depth completion, we outperform all the existing Bayesian Deep Learning approaches in terms of prediction accuracy, quality of the uncertainty measure, and the computational efficiency. Moreover, our small network with 670k parameters performs on-par with conventional approaches with millions of parameters. These results give strong evidence that separating the network into parallel uncertainty and prediction streams leads to state-of-the-art performance with accurate uncertainty estimates.
摘要:在深度学习研究的重点是主要推预测精度的限制。然而,这往往增加了复杂性为代价,提高对解释性和深层网络的可靠性问题。近来,越来越多的关注已经给解开深层网络的复杂性和量化他们对不同的计算机视觉任务的不确定性。不同的是,深度完成任务尚未尽管深度传感器的固有性质嘈杂收到足够的重视。在这项工作中,我们因而注重深度数据的深度从完成输入,一路到最后的预测稀疏嘈杂开始的不确定性建模。我们提出了一个新的方法通过学习基于标准化的卷积神经网络(NCNNs)在自我监督的方式输入置信估算量来识别输入干扰测量。此外,我们提出NCNNs的概率版本产生了最终的预测具有统计意义的不确定性的措施。当我们评估我们对KITTI数据集深度完成的方法,我们超越所有现有的贝叶斯深度学习的预测精度,其不确定性的措施的质量和计算效率方面接近。此外,我们的小型网络与670K的参数执行上面值与数以百万计的参数的传统方法。这些结果给出有力的证据证明网络分隔成平行的不确定性,预测流导致国家的最先进的性能,准确的不确定性估算。
Abdelrahman Eldesokey, Michael Felsberg, Karl Holmquist, Mikael Persson
Abstract: The focus in deep learning research has been mostly to push the limits of prediction accuracy. However, this was often achieved at the cost of increased complexity, raising concerns about the interpretability and the reliability of deep networks. Recently, an increasing attention has been given to untangling the complexity of deep networks and quantifying their uncertainty for different computer vision tasks. Differently, the task of depth completion has not received enough attention despite the inherent noisy nature of depth sensors. In this work, we thus focus on modeling the uncertainty of depth data in depth completion starting from the sparse noisy input all the way to the final prediction. We propose a novel approach to identify disturbed measurements in the input by learning an input confidence estimator in a self-supervised manner based on the normalized convolutional neural networks (NCNNs). Further, we propose a probabilistic version of NCNNs that produces a statistically meaningful uncertainty measure for the final prediction. When we evaluate our approach on the KITTI dataset for depth completion, we outperform all the existing Bayesian Deep Learning approaches in terms of prediction accuracy, quality of the uncertainty measure, and the computational efficiency. Moreover, our small network with 670k parameters performs on-par with conventional approaches with millions of parameters. These results give strong evidence that separating the network into parallel uncertainty and prediction streams leads to state-of-the-art performance with accurate uncertainty estimates.
摘要:在深度学习研究的重点是主要推预测精度的限制。然而,这往往增加了复杂性为代价,提高对解释性和深层网络的可靠性问题。近来,越来越多的关注已经给解开深层网络的复杂性和量化他们对不同的计算机视觉任务的不确定性。不同的是,深度完成任务尚未尽管深度传感器的固有性质嘈杂收到足够的重视。在这项工作中,我们因而注重深度数据的深度从完成输入,一路到最后的预测稀疏嘈杂开始的不确定性建模。我们提出了一个新的方法通过学习基于标准化的卷积神经网络(NCNNs)在自我监督的方式输入置信估算量来识别输入干扰测量。此外,我们提出NCNNs的概率版本产生了最终的预测具有统计意义的不确定性的措施。当我们评估我们对KITTI数据集深度完成的方法,我们超越所有现有的贝叶斯深度学习的预测精度,其不确定性的措施的质量和计算效率方面接近。此外,我们的小型网络与670K的参数执行上面值与数以百万计的参数的传统方法。这些结果给出有力的证据证明网络分隔成平行的不确定性,预测流导致国家的最先进的性能,准确的不确定性估算。
7. Explaining Autonomous Driving by Learning End-to-End Visual Attention [PDF] 返回目录
Luca Cultrera, Lorenzo Seidenari, Federico Becattini, Pietro Pala, Alberto Del Bimbo
Abstract: Current deep learning based autonomous driving approaches yield impressive results also leading to in-production deployment in certain controlled scenarios. One of the most popular and fascinating approaches relies on learning vehicle controls directly from data perceived by sensors. This end-to-end learning paradigm can be applied both in classical supervised settings and using reinforcement learning. Nonetheless the main drawback of this approach as also in other learning problems is the lack of explainability. Indeed, a deep network will act as a black-box outputting predictions depending on previously seen driving patterns without giving any feedback on why such decisions were taken. While to obtain optimal performance it is not critical to obtain explainable outputs from a learned agent, especially in such a safety critical field, it is of paramount importance to understand how the network behaves. This is particularly relevant to interpret failures of such systems. In this work we propose to train an imitation learning based agent equipped with an attention model. The attention model allows us to understand what part of the image has been deemed most important. Interestingly, the use of attention also leads to superior performance in a standard benchmark using the CARLA driving simulator.
摘要:当前深度学习基于自主驾驶的方法产生令人印象深刻的结果也导致了生产部署在某些情况下控制。其中最流行的和令人着迷的方法依赖于直接从传感器感知的数据中学习车辆的控制。为此对终端的学习模式可以在古典监督的设置和使用强化学习应用两者。但是这种方法也如其他学习问题的主要缺点是缺乏explainability的。的确,作为一个黑箱根据以前看过的驾驶模式而不对为何采取这样的决定的任何反馈输出预测了深刻的网络将采取行动。虽然,以获得最佳的性能不是很重要,从一个有学问的代理处获得解释的输出,特别是在这样一个安全的关键领域,它是非常重要的,了解如何在网络的行为。这是试图解释这种系统的故障特别相关。在这项工作中,我们提出培养装有注意模型的模仿学习基于代理。注意模型让我们明白什么形象的一部分,已被认为是最重要的。有趣的是,使用时注意也导致了卓越的性能在使用CARLA驾驶模拟器标准的基准。
Luca Cultrera, Lorenzo Seidenari, Federico Becattini, Pietro Pala, Alberto Del Bimbo
Abstract: Current deep learning based autonomous driving approaches yield impressive results also leading to in-production deployment in certain controlled scenarios. One of the most popular and fascinating approaches relies on learning vehicle controls directly from data perceived by sensors. This end-to-end learning paradigm can be applied both in classical supervised settings and using reinforcement learning. Nonetheless the main drawback of this approach as also in other learning problems is the lack of explainability. Indeed, a deep network will act as a black-box outputting predictions depending on previously seen driving patterns without giving any feedback on why such decisions were taken. While to obtain optimal performance it is not critical to obtain explainable outputs from a learned agent, especially in such a safety critical field, it is of paramount importance to understand how the network behaves. This is particularly relevant to interpret failures of such systems. In this work we propose to train an imitation learning based agent equipped with an attention model. The attention model allows us to understand what part of the image has been deemed most important. Interestingly, the use of attention also leads to superior performance in a standard benchmark using the CARLA driving simulator.
摘要:当前深度学习基于自主驾驶的方法产生令人印象深刻的结果也导致了生产部署在某些情况下控制。其中最流行的和令人着迷的方法依赖于直接从传感器感知的数据中学习车辆的控制。为此对终端的学习模式可以在古典监督的设置和使用强化学习应用两者。但是这种方法也如其他学习问题的主要缺点是缺乏explainability的。的确,作为一个黑箱根据以前看过的驾驶模式而不对为何采取这样的决定的任何反馈输出预测了深刻的网络将采取行动。虽然,以获得最佳的性能不是很重要,从一个有学问的代理处获得解释的输出,特别是在这样一个安全的关键领域,它是非常重要的,了解如何在网络的行为。这是试图解释这种系统的故障特别相关。在这项工作中,我们提出培养装有注意模型的模仿学习基于代理。注意模型让我们明白什么形象的一部分,已被认为是最重要的。有趣的是,使用时注意也导致了卓越的性能在使用CARLA驾驶模拟器标准的基准。
8. MANTRA: Memory Augmented Networks for Multiple Trajectory Prediction [PDF] 返回目录
Francesco Marchetti, Federico Becattini, Lorenzo Seidenari, Alberto Del Bimbo
Abstract: Autonomous vehicles are expected to drive in complex scenarios with several independent non cooperating agents. Path planning for safely navigating in such environments can not just rely on perceiving present location and motion of other agents. It requires instead to predict such variables in a far enough future. In this paper we address the problem of multimodal trajectory prediction exploiting a Memory Augmented Neural Network. Our method learns past and future trajectory embeddings using recurrent neural networks and exploits an associative external memory to store and retrieve such embeddings. Trajectory prediction is then performed by decoding in-memory future encodings conditioned with the observed past. We incorporate scene knowledge in the decoding state by learning a CNN on top of semantic scene maps. Memory growth is limited by learning a writing controller based on the predictive capability of existing embeddings. We show that our method is able to natively perform multi-modal trajectory prediction obtaining state-of-the art results on three datasets. Moreover, thanks to the non-parametric nature of the memory module, we show how once trained our system can continuously improve by ingesting novel patterns.
摘要:自主汽车有望在复杂场景驱动与几个独立非合作机构。路径规划在这样的环境下安全航行不能仅仅依靠感知现在的位置和其他代理的运动。它要求,而不是在很远的将来足以预测这样的变量。在本文中,我们解决多式联运轨迹预测利用记忆增强神经网络的问题。我们的方法学习使用递归神经网络的过去和未来轨迹的嵌入和利用关联的外部存储器来存储和检索等的嵌入。轨迹预测然后通过解码在存储器未来编码与观察到过去空调进行。我们通过对语义的场景地图的顶部学习CNN纳入解码状态场景知识。内存增长是通过学习基于现有的嵌入的预测能力写入控制限制。我们表明,我们的方法是能够对三个数据集本身进行多模态轨迹预测得到国家的艺术效果。此外,由于内存模块的非参数性质,我们展示如何一次训练的我们的系统可以通过摄取款式新颖不断提高。
Francesco Marchetti, Federico Becattini, Lorenzo Seidenari, Alberto Del Bimbo
Abstract: Autonomous vehicles are expected to drive in complex scenarios with several independent non cooperating agents. Path planning for safely navigating in such environments can not just rely on perceiving present location and motion of other agents. It requires instead to predict such variables in a far enough future. In this paper we address the problem of multimodal trajectory prediction exploiting a Memory Augmented Neural Network. Our method learns past and future trajectory embeddings using recurrent neural networks and exploits an associative external memory to store and retrieve such embeddings. Trajectory prediction is then performed by decoding in-memory future encodings conditioned with the observed past. We incorporate scene knowledge in the decoding state by learning a CNN on top of semantic scene maps. Memory growth is limited by learning a writing controller based on the predictive capability of existing embeddings. We show that our method is able to natively perform multi-modal trajectory prediction obtaining state-of-the art results on three datasets. Moreover, thanks to the non-parametric nature of the memory module, we show how once trained our system can continuously improve by ingesting novel patterns.
摘要:自主汽车有望在复杂场景驱动与几个独立非合作机构。路径规划在这样的环境下安全航行不能仅仅依靠感知现在的位置和其他代理的运动。它要求,而不是在很远的将来足以预测这样的变量。在本文中,我们解决多式联运轨迹预测利用记忆增强神经网络的问题。我们的方法学习使用递归神经网络的过去和未来轨迹的嵌入和利用关联的外部存储器来存储和检索等的嵌入。轨迹预测然后通过解码在存储器未来编码与观察到过去空调进行。我们通过对语义的场景地图的顶部学习CNN纳入解码状态场景知识。内存增长是通过学习基于现有的嵌入的预测能力写入控制限制。我们表明,我们的方法是能够对三个数据集本身进行多模态轨迹预测得到国家的艺术效果。此外,由于内存模块的非参数性质,我们展示如何一次训练的我们的系统可以通过摄取款式新颖不断提高。
9. Multi-modal Feature Fusion with Feature Attention for VATEX Captioning Challenge 2020 [PDF] 返回目录
Ke Lin, Zhuoxin Gan, Liwei Wang
Abstract: This report describes our model for VATEX Captioning Challenge 2020. First, to gather information from multiple domains, we extract motion, appearance, semantic and audio features. Then we design a feature attention module to attend on different feature when decoding. We apply two types of decoders, top-down and X-LAN and ensemble these models to get the final result. The proposed method outperforms official baseline with a significant gap. We achieve 76.0 CIDEr and 50.0 CIDEr on English and Chinese private test set. We rank 2nd on both English and Chinese private test leaderboard.
摘要:这份报告描述我们对VATEX字幕挑战2020第一种模式,收集来自多个域的信息,我们提取的运动,外观,语义和音频功能。然后,我们设计了一个功能模块,注意解码时出席不同的特征。我们采用两种类型的解码器,自上而下和X-LAN和合奏这些模型来得到最终结果。该方法优于用显著差距正式基线。我们实现76.0苹果酒和50.0苹果酒在英国和中国的私人测试集。我们对排名英语和中国民营测试排行榜第2位。
Ke Lin, Zhuoxin Gan, Liwei Wang
Abstract: This report describes our model for VATEX Captioning Challenge 2020. First, to gather information from multiple domains, we extract motion, appearance, semantic and audio features. Then we design a feature attention module to attend on different feature when decoding. We apply two types of decoders, top-down and X-LAN and ensemble these models to get the final result. The proposed method outperforms official baseline with a significant gap. We achieve 76.0 CIDEr and 50.0 CIDEr on English and Chinese private test set. We rank 2nd on both English and Chinese private test leaderboard.
摘要:这份报告描述我们对VATEX字幕挑战2020第一种模式,收集来自多个域的信息,我们提取的运动,外观,语义和音频功能。然后,我们设计了一个功能模块,注意解码时出席不同的特征。我们采用两种类型的解码器,自上而下和X-LAN和合奏这些模型来得到最终结果。该方法优于用显著差距正式基线。我们实现76.0苹果酒和50.0苹果酒在英国和中国的私人测试集。我们对排名英语和中国民营测试排行榜第2位。
10. Biometric Quality: Review and Application to Face Recognition with FaceQnet [PDF] 返回目录
Javier Hernandez-Ortega, Javier Galbally, Julian Fierrez, Laurent Beslay
Abstract: "The output of a computerised system can only be as accurate as the information entered into it." This rather trivial statement is the basis behind one of the driving concepts in biometric recognition: biometric quality. Quality is nowadays widely regarded as the number one factor responsible for the good or bad performance of automated biometric systems. It refers to the ability of a biometric sample to be used for recognition purposes and produce consistent, accurate, and reliable results. Such a subjective term is objectively estimated by the so-called biometric quality metrics. These algorithms play nowadays a pivotal role in the correct functioning of systems, providing feedback to the users and working as invaluable audit tools. In spite of their unanimously accepted relevance, some of the most used and deployed biometric characteristics are lacking behind in the development of these methods. This is the case of face recognition. After a gentle introduction to the general topic of biometric quality and a review of past efforts in face quality metrics, in the present work, we address the need for better face quality metrics by developing FaceQnet. FaceQnet is a novel opensource face quality assessment tool, inspired and powered by deep learning technology, which assigns a scalar quality measure to facial images, as prediction of their recognition accuracy. Two versions of FaceQnet have been thoroughly evaluated both in this work and also independently by NIST, showing the soundness of the approach and its competitiveness with respect to current state-of-the-art metrics. Even though our work is presented here particularly in the framework of face biometrics, the proposed methodology for building a fully automated quality metric can be very useful and easily adapted to other artificial intelligence tasks.
摘要:“电脑系统的输出只能作为准确的进入到它的信息。”这个相当琐碎的语句后面的生物特征识别的驱动概念的依据之一:生物识别质量。质量是目前被广泛认为是负责的好或自动化生物特征识别系统的性能不良的头号因素。它是指生物样品用于识别目的并产生一致,准确,可靠的结果的能力。这种主观项客观上所谓的生物识别质量度量估计。这些算法时下发挥系统的正确运行了举足轻重的作用,提供反馈给用户和宝贵的审计工具工作。尽管他们一致接受相关的,一些最常用和部署生物特征缺乏后面的这些方法的发展。这是人脸识别的情况。一个温柔的介绍生物识别质量的普遍话题,在过去面对质量标准努力的审查后,在目前的工作中,我们通过开发FaceQnet为更好地面对质量度量的需要。 FaceQnet是一种新型的开源面质量评估工具,激励和由深度学习技术,它分配一个标量质量测量到的面部图像,作为其识别精度的预测供电。 FaceQnet的两个版本已经在此工作由NIST彻底的评估都还独立,显示出该方法的稳健性和它相对于国家的最先进的电流指标的竞争力。尽管我们的工作这里提出特别是在面对生物的框架,所提出的方法建立一个完全自动化的质量度量是非常有用的,并很容易地适应其他人工智能任务。
Javier Hernandez-Ortega, Javier Galbally, Julian Fierrez, Laurent Beslay
Abstract: "The output of a computerised system can only be as accurate as the information entered into it." This rather trivial statement is the basis behind one of the driving concepts in biometric recognition: biometric quality. Quality is nowadays widely regarded as the number one factor responsible for the good or bad performance of automated biometric systems. It refers to the ability of a biometric sample to be used for recognition purposes and produce consistent, accurate, and reliable results. Such a subjective term is objectively estimated by the so-called biometric quality metrics. These algorithms play nowadays a pivotal role in the correct functioning of systems, providing feedback to the users and working as invaluable audit tools. In spite of their unanimously accepted relevance, some of the most used and deployed biometric characteristics are lacking behind in the development of these methods. This is the case of face recognition. After a gentle introduction to the general topic of biometric quality and a review of past efforts in face quality metrics, in the present work, we address the need for better face quality metrics by developing FaceQnet. FaceQnet is a novel opensource face quality assessment tool, inspired and powered by deep learning technology, which assigns a scalar quality measure to facial images, as prediction of their recognition accuracy. Two versions of FaceQnet have been thoroughly evaluated both in this work and also independently by NIST, showing the soundness of the approach and its competitiveness with respect to current state-of-the-art metrics. Even though our work is presented here particularly in the framework of face biometrics, the proposed methodology for building a fully automated quality metric can be very useful and easily adapted to other artificial intelligence tasks.
摘要:“电脑系统的输出只能作为准确的进入到它的信息。”这个相当琐碎的语句后面的生物特征识别的驱动概念的依据之一:生物识别质量。质量是目前被广泛认为是负责的好或自动化生物特征识别系统的性能不良的头号因素。它是指生物样品用于识别目的并产生一致,准确,可靠的结果的能力。这种主观项客观上所谓的生物识别质量度量估计。这些算法时下发挥系统的正确运行了举足轻重的作用,提供反馈给用户和宝贵的审计工具工作。尽管他们一致接受相关的,一些最常用和部署生物特征缺乏后面的这些方法的发展。这是人脸识别的情况。一个温柔的介绍生物识别质量的普遍话题,在过去面对质量标准努力的审查后,在目前的工作中,我们通过开发FaceQnet为更好地面对质量度量的需要。 FaceQnet是一种新型的开源面质量评估工具,激励和由深度学习技术,它分配一个标量质量测量到的面部图像,作为其识别精度的预测供电。 FaceQnet的两个版本已经在此工作由NIST彻底的评估都还独立,显示出该方法的稳健性和它相对于国家的最先进的电流指标的竞争力。尽管我们的工作这里提出特别是在面对生物的框架,所提出的方法建立一个完全自动化的质量度量是非常有用的,并很容易地适应其他人工智能任务。
11. Real-time Human Activity Recognition Using Conditionally Parametrized Convolutions on Mobile and Wearable Devices [PDF] 返回目录
Xin Cheng, Lei Zhang, Yin Tang, Yue Liu, Hao Wu, Jun He
Abstract: Recently, deep learning has represented an important research trend in human activity recognition (HAR). In particular, deep convolutional neural networks (CNNs) have achieved state-of-the-art performance on various HAR datasets. For deep learning, improvements in performance have to heavily rely on increasing model size or capacity to scale to larger and larger datasets, which inevitably leads to the increase of operations. A high number of operations in deep leaning increases computational cost and is not suitable for real-time HAR using mobile and wearable sensors. Though shallow learning techniques often are lightweight, they could not achieve good performance. Therefore, deep learning methods that can balance the trade-off between accuracy and computation cost is highly needed, which to our knowledge has seldom been researched. In this paper, we for the first time propose a computation efficient CNN using conditionally parametrized convolution for real-time HAR on mobile and wearable devices. We evaluate the proposed method on four public benchmark HAR datasets consisting of WISDM dataset, PAMAP2 dataset, UNIMIB-SHAR dataset, and OPPORTUNITY dataset, achieving state-of-the-art accuracy without compromising computation cost. Various ablation experiments are performed to show how such a network with large capacity is clearly preferable to baseline while requiring a similar amount of operations. The method can be used as a drop-in replacement for the existing deep HAR architectures and easily deployed onto mobile and wearable devices for real-time HAR applications.
摘要:近日,深度学习代表了人类行为识别(HAR)的一个重要的研究方向。特别地,深卷积神经网络(细胞神经网络)已经实现各种的数据集HAR国家的最先进的性能。对于深度学习,在性能方面的改进都在很大程度上依赖于增加模型的大小或容量,扩展到更大和更大的数据集,这不可避免地导致业务的增加。高数量的在深倚操作增加了计算成本和使用移动和可穿戴的传感器不适合于实时的HAR。虽然浅学习技术往往是轻量级的,他们不能取得良好的业绩。因此,非常需要能够平衡精度和计算成本之间的权衡深学习方法,这对我们的知识还很少研究。在本文中,我们首次在移动和可佩戴设备上使用参数化有条件卷积实时HAR提出了一个计算效率CNN。我们评估的四个组成WISDM数据集,PAMAP2数据集,UNIMIB-SHAR数据集和机遇数据集的公共基准HAR数据集所提出的方法,实现了国家的最先进的精度不影响计算成本。各种消融实验,以示出这样的具有大容量的网络如何显然是优选基线,同时需要一个类似的操作量。该方法可被用作简易替代为现有深HAR结构和容易地部署到用于实时应用HAR移动和可佩戴装置。
Xin Cheng, Lei Zhang, Yin Tang, Yue Liu, Hao Wu, Jun He
Abstract: Recently, deep learning has represented an important research trend in human activity recognition (HAR). In particular, deep convolutional neural networks (CNNs) have achieved state-of-the-art performance on various HAR datasets. For deep learning, improvements in performance have to heavily rely on increasing model size or capacity to scale to larger and larger datasets, which inevitably leads to the increase of operations. A high number of operations in deep leaning increases computational cost and is not suitable for real-time HAR using mobile and wearable sensors. Though shallow learning techniques often are lightweight, they could not achieve good performance. Therefore, deep learning methods that can balance the trade-off between accuracy and computation cost is highly needed, which to our knowledge has seldom been researched. In this paper, we for the first time propose a computation efficient CNN using conditionally parametrized convolution for real-time HAR on mobile and wearable devices. We evaluate the proposed method on four public benchmark HAR datasets consisting of WISDM dataset, PAMAP2 dataset, UNIMIB-SHAR dataset, and OPPORTUNITY dataset, achieving state-of-the-art accuracy without compromising computation cost. Various ablation experiments are performed to show how such a network with large capacity is clearly preferable to baseline while requiring a similar amount of operations. The method can be used as a drop-in replacement for the existing deep HAR architectures and easily deployed onto mobile and wearable devices for real-time HAR applications.
摘要:近日,深度学习代表了人类行为识别(HAR)的一个重要的研究方向。特别地,深卷积神经网络(细胞神经网络)已经实现各种的数据集HAR国家的最先进的性能。对于深度学习,在性能方面的改进都在很大程度上依赖于增加模型的大小或容量,扩展到更大和更大的数据集,这不可避免地导致业务的增加。高数量的在深倚操作增加了计算成本和使用移动和可穿戴的传感器不适合于实时的HAR。虽然浅学习技术往往是轻量级的,他们不能取得良好的业绩。因此,非常需要能够平衡精度和计算成本之间的权衡深学习方法,这对我们的知识还很少研究。在本文中,我们首次在移动和可佩戴设备上使用参数化有条件卷积实时HAR提出了一个计算效率CNN。我们评估的四个组成WISDM数据集,PAMAP2数据集,UNIMIB-SHAR数据集和机遇数据集的公共基准HAR数据集所提出的方法,实现了国家的最先进的精度不影响计算成本。各种消融实验,以示出这样的具有大容量的网络如何显然是优选基线,同时需要一个类似的操作量。该方法可被用作简易替代为现有深HAR结构和容易地部署到用于实时应用HAR移动和可佩戴装置。
12. TCDesc: Learning Topology Consistent Descriptors [PDF] 返回目录
Honghu Pan, Fanyang Meng, Zhenyu He, Yongsheng Liang, Wei Liu
Abstract: Triplet loss is widely used for learning local descriptors from image patch. However, triplet loss only minimizes the Euclidean distance between matching descriptors and maximizes that between the non-matching descriptors, which neglects the topology similarity between two descriptor sets. In this paper, we propose topology measure besides Euclidean distance to learn topology consistent descriptors by considering kNN descriptors of positive sample. First we establish a novel topology vector for each descriptor followed by Locally Linear Embedding (LLE) to indicate the topological relation among the descriptor and its kNN descriptors. Then we define topology distance between descriptors as the difference of their topology vectors. Last we employ the dynamic weighting strategy to fuse Euclidean distance and topology distance of matching descriptors and take the fusion result as the positive sample distance in the triplet loss. Experimental results on several benchmarks show that our method performs better than state-of-the-arts results and effectively improves the performance of triplet loss.
摘要:三重损失被广泛用于从图像块学习的局部描述符。然而,三重峰损失只有最小化非匹配描述符,而忽略两个描述集之间的相似性拓扑之间的匹配的描述符并最大化之间的欧几里得距离。在本文中,我们建议的拓扑措施除了欧氏距离,考虑阳性样品的k近邻描述学习拓扑一致的描述。首先,我们建立了一种新的拓扑向量每个描述符接着局部线性嵌入(LLE),以指示所述描述符和它的kNN描述符中的拓扑关系。然后我们定义描述符之间的拓扑距离,其拓扑向量的差异。最后我们采用了动态加权策略融合匹配描述符的欧氏距离和拓扑距离,并采取融合的结果在三重损失的阳性样品的距离。在几个基准实验结果表明,该方法比国家的的艺术效果更好,有效地提高了三重损失的性能。
Honghu Pan, Fanyang Meng, Zhenyu He, Yongsheng Liang, Wei Liu
Abstract: Triplet loss is widely used for learning local descriptors from image patch. However, triplet loss only minimizes the Euclidean distance between matching descriptors and maximizes that between the non-matching descriptors, which neglects the topology similarity between two descriptor sets. In this paper, we propose topology measure besides Euclidean distance to learn topology consistent descriptors by considering kNN descriptors of positive sample. First we establish a novel topology vector for each descriptor followed by Locally Linear Embedding (LLE) to indicate the topological relation among the descriptor and its kNN descriptors. Then we define topology distance between descriptors as the difference of their topology vectors. Last we employ the dynamic weighting strategy to fuse Euclidean distance and topology distance of matching descriptors and take the fusion result as the positive sample distance in the triplet loss. Experimental results on several benchmarks show that our method performs better than state-of-the-arts results and effectively improves the performance of triplet loss.
摘要:三重损失被广泛用于从图像块学习的局部描述符。然而,三重峰损失只有最小化非匹配描述符,而忽略两个描述集之间的相似性拓扑之间的匹配的描述符并最大化之间的欧几里得距离。在本文中,我们建议的拓扑措施除了欧氏距离,考虑阳性样品的k近邻描述学习拓扑一致的描述。首先,我们建立了一种新的拓扑向量每个描述符接着局部线性嵌入(LLE),以指示所述描述符和它的kNN描述符中的拓扑关系。然后我们定义描述符之间的拓扑距离,其拓扑向量的差异。最后我们采用了动态加权策略融合匹配描述符的欧氏距离和拓扑距离,并采取融合的结果在三重损失的阳性样品的距离。在几个基准实验结果表明,该方法比国家的的艺术效果更好,有效地提高了三重损失的性能。
13. FP-Stereo: Hardware-Efficient Stereo Vision for Embedded Applications [PDF] 返回目录
Jieru Zhao, Tingyuan Liang, Liang Feng, Wenchao Ding, Sharad Sinha, Wei Zhang, Shaojie Shen
Abstract: Fast and accurate depth estimation, or stereo matching, is essential in embedded stereo vision systems, requiring substantial design effort to achieve an appropriate balance among accuracy, speed and hardware cost. To reduce the design effort and achieve the right balance, we propose FP-Stereo for building high-performance stereo matching pipelines on FPGAs automatically. FP-Stereo consists of an open-source hardware-efficient library, allowing designers to obtain the desired implementation instantly. Diverse methods are supported in our library for each stage of the stereo matching pipeline and a series of techniques are developed to exploit the parallelism and reduce the resource overhead. To improve the usability, FP-Stereo can generate synthesizable C code of the FPGA accelerator with our optimized HLS templates automatically. To guide users for the right design choice meeting specific application requirements, detailed comparisons are performed on various configurations of our library to investigate the accuracy/speed/cost trade-off. Experimental results also show that FP-Stereo outperforms the state-of-the-art FPGA design from all aspects, including 6.08% lower error, 2x faster speed, 30% less resource usage and 40% less energy consumption. Compared to GPU designs, FP-Stereo achieves the same accuracy at a competitive speed while consuming much less energy.
摘要:快速准确的深度估计,或立体匹配,在嵌入式立体视觉系统,需要大量的设计工作,以实现中精度,速度和硬件成本的适当平衡至关重要。为了减少设计工作量,达到适当的平衡,我们提出了FP-立体声自动在FPGA上构建高性能立体匹配管道。 FP-立体声包括一个开源硬件高效库,使设计人员能够立即获得所需的实现。不同的方法在我们对立体匹配管道等一系列技术的每个阶段库支持的开发利用并行和减少开销的资源。为了提高可用性,FP-立体声可以自动与我们的优化HLS模板生成FPGA加速器合成的C代码。为了引导用户进行正确的设计选择,满足特殊应用需求,详细的比较是在我们图书馆的各种配置进行调查的准确度/速度/成本的权衡。实验结果也表明,FP-立体声性能优于从各个方面的状态的最先进的FPGA设计,包括低级6.08%的误差,2×更快的速度,30%更少的资源使用情况和少40%的能量消耗。相比于GPU的设计,FP-立体声以有竞争力的速度达到相同的精度,同时消耗更少的能量。
Jieru Zhao, Tingyuan Liang, Liang Feng, Wenchao Ding, Sharad Sinha, Wei Zhang, Shaojie Shen
Abstract: Fast and accurate depth estimation, or stereo matching, is essential in embedded stereo vision systems, requiring substantial design effort to achieve an appropriate balance among accuracy, speed and hardware cost. To reduce the design effort and achieve the right balance, we propose FP-Stereo for building high-performance stereo matching pipelines on FPGAs automatically. FP-Stereo consists of an open-source hardware-efficient library, allowing designers to obtain the desired implementation instantly. Diverse methods are supported in our library for each stage of the stereo matching pipeline and a series of techniques are developed to exploit the parallelism and reduce the resource overhead. To improve the usability, FP-Stereo can generate synthesizable C code of the FPGA accelerator with our optimized HLS templates automatically. To guide users for the right design choice meeting specific application requirements, detailed comparisons are performed on various configurations of our library to investigate the accuracy/speed/cost trade-off. Experimental results also show that FP-Stereo outperforms the state-of-the-art FPGA design from all aspects, including 6.08% lower error, 2x faster speed, 30% less resource usage and 40% less energy consumption. Compared to GPU designs, FP-Stereo achieves the same accuracy at a competitive speed while consuming much less energy.
摘要:快速准确的深度估计,或立体匹配,在嵌入式立体视觉系统,需要大量的设计工作,以实现中精度,速度和硬件成本的适当平衡至关重要。为了减少设计工作量,达到适当的平衡,我们提出了FP-立体声自动在FPGA上构建高性能立体匹配管道。 FP-立体声包括一个开源硬件高效库,使设计人员能够立即获得所需的实现。不同的方法在我们对立体匹配管道等一系列技术的每个阶段库支持的开发利用并行和减少开销的资源。为了提高可用性,FP-立体声可以自动与我们的优化HLS模板生成FPGA加速器合成的C代码。为了引导用户进行正确的设计选择,满足特殊应用需求,详细的比较是在我们图书馆的各种配置进行调查的准确度/速度/成本的权衡。实验结果也表明,FP-立体声性能优于从各个方面的状态的最先进的FPGA设计,包括低级6.08%的误差,2×更快的速度,30%更少的资源使用情况和少40%的能量消耗。相比于GPU的设计,FP-立体声以有竞争力的速度达到相同的精度,同时消耗更少的能量。
14. Content and Context Features for Scene Image Representation [PDF] 返回目录
Chiranjibi Sitaula, Sunil Aryal, Yong Xiang, Anish Basnet, Xuequan Lu
Abstract: Existing research in scene image classification has focused on either content features (e.g., visual information) or context features (e.g., annotations). As they capture different information about images which can be complementary and useful to discriminate images of different classes, we suppose the fusion of them will improve classification results. In this paper, we propose new techniques to compute content features and context features, and then fuse them together. For content features, we design multi-scale deep features based on background and foreground information in images. For context features, we use annotations of similar images available in the web to design a filter words (codebook). Our experiments in three widely used benchmark scene datasets using support vector machine classifier reveal that our proposed context and content features produce better results than existing context and content features, respectively. The fusion of the proposed two types of features significantly outperform numerous state-of-the-art features.
摘要:在场景图像分类现有研究集中在任一内容特征(例如,视觉信息)或上下文特征(例如,注释)。当他们捕捉关于这可以互补和不同类别的判别的图像有用的图像不同的信息,我们假设它们的融合将提高分类结果。在本文中,我们提出了新的技术来计算内容特征和背景特征,再熔在一起。对于内容的特点,设计了一种基于图像中的背景和前景信息的多尺度深的特点。对于背景特征,我们使用可获得的类似图片的注释在web设计滤波器字(码本)。我们在使用支持向量机分类三个广泛使用的基准情景的数据集实验表明,我们提出的背景和内容特征分别产生比现有的框架和内容的功能更好的结果。所提出的两种类型的特征的融合显著优于国家的最先进的许多特征。
Chiranjibi Sitaula, Sunil Aryal, Yong Xiang, Anish Basnet, Xuequan Lu
Abstract: Existing research in scene image classification has focused on either content features (e.g., visual information) or context features (e.g., annotations). As they capture different information about images which can be complementary and useful to discriminate images of different classes, we suppose the fusion of them will improve classification results. In this paper, we propose new techniques to compute content features and context features, and then fuse them together. For content features, we design multi-scale deep features based on background and foreground information in images. For context features, we use annotations of similar images available in the web to design a filter words (codebook). Our experiments in three widely used benchmark scene datasets using support vector machine classifier reveal that our proposed context and content features produce better results than existing context and content features, respectively. The fusion of the proposed two types of features significantly outperform numerous state-of-the-art features.
摘要:在场景图像分类现有研究集中在任一内容特征(例如,视觉信息)或上下文特征(例如,注释)。当他们捕捉关于这可以互补和不同类别的判别的图像有用的图像不同的信息,我们假设它们的融合将提高分类结果。在本文中,我们提出了新的技术来计算内容特征和背景特征,再熔在一起。对于内容的特点,设计了一种基于图像中的背景和前景信息的多尺度深的特点。对于背景特征,我们使用可获得的类似图片的注释在web设计滤波器字(码本)。我们在使用支持向量机分类三个广泛使用的基准情景的数据集实验表明,我们提出的背景和内容特征分别产生比现有的框架和内容的功能更好的结果。所提出的两种类型的特征的融合显著优于国家的最先进的许多特征。
15. Content-Aware Inter-Scale Cost Aggregation for Stereo Matching [PDF] 返回目录
Chengtang Yao, Yunde Jia, Huijun Di, Yuwei Wu, Lidong Yu
Abstract: Cost aggregation is a key component of stereo matching for high-quality depth estimation. Most methods use multi-scale processing to downsample cost volume for proper context information, but will cause loss of details when upsampling. In this paper, we present a content-aware inter-scale cost aggregation method that adaptively aggregates and upsamples the cost volume from coarse-scale to fine-scale by learning dynamic filter weights according to the content of the left and right views on the two scales. Our method achieves reliable detail recovery when upsampling through the aggregation of information across different scales. Furthermore, a novel decomposition strategy is proposed to efficiently construct the 3D filter weights and aggregate the 3D cost volume, which greatly reduces the computation cost. We first learn the 2D similarities via the feature maps on the two scales, and then build the 3D filter weights based on the 2D similarities from the left and right views. After that, we split the aggregation in a full 3D spatial-disparity space into the aggregation in 1D disparity space and 2D spatial space. Experiment results on Scene Flow dataset, KITTI2015 and Middlebury demonstrate the effectiveness of our method.
摘要:成本聚合为高品质的深度估计立体匹配的一个关键组成部分。大多数方法采用多尺度处理下采样成本音量适当的上下文信息,但上采样时会造成细节的丢失。在本文中,我们提出了一个内容识别刻度间成本聚合方法,该方法自适应地聚集体和上采样从粗尺度到精细尺度通过学习根据的左视图和右视图上的两个内容动态滤波器的权重的成本体积秤。通过在不同尺度的信息聚合上采样时,我们的方法实现可靠的详细的恢复。此外,一个新的分解策略,提出有效地构造3D滤光器的重量和聚合所述3D体积成本,极大地减少了计算成本。我们先学习2D相似之处通过对两个尺度特征地图,然后生成基于来自左,右视图的2D相似的3D滤波器权。在那之后,我们分头在一个完整的三维空间视差空间聚集成一维视差空间和2D立体空间中聚集。在现场流量数据集实验结果,KITTI2015和明德证明了我们方法的有效性。
Chengtang Yao, Yunde Jia, Huijun Di, Yuwei Wu, Lidong Yu
Abstract: Cost aggregation is a key component of stereo matching for high-quality depth estimation. Most methods use multi-scale processing to downsample cost volume for proper context information, but will cause loss of details when upsampling. In this paper, we present a content-aware inter-scale cost aggregation method that adaptively aggregates and upsamples the cost volume from coarse-scale to fine-scale by learning dynamic filter weights according to the content of the left and right views on the two scales. Our method achieves reliable detail recovery when upsampling through the aggregation of information across different scales. Furthermore, a novel decomposition strategy is proposed to efficiently construct the 3D filter weights and aggregate the 3D cost volume, which greatly reduces the computation cost. We first learn the 2D similarities via the feature maps on the two scales, and then build the 3D filter weights based on the 2D similarities from the left and right views. After that, we split the aggregation in a full 3D spatial-disparity space into the aggregation in 1D disparity space and 2D spatial space. Experiment results on Scene Flow dataset, KITTI2015 and Middlebury demonstrate the effectiveness of our method.
摘要:成本聚合为高品质的深度估计立体匹配的一个关键组成部分。大多数方法采用多尺度处理下采样成本音量适当的上下文信息,但上采样时会造成细节的丢失。在本文中,我们提出了一个内容识别刻度间成本聚合方法,该方法自适应地聚集体和上采样从粗尺度到精细尺度通过学习根据的左视图和右视图上的两个内容动态滤波器的权重的成本体积秤。通过在不同尺度的信息聚合上采样时,我们的方法实现可靠的详细的恢复。此外,一个新的分解策略,提出有效地构造3D滤光器的重量和聚合所述3D体积成本,极大地减少了计算成本。我们先学习2D相似之处通过对两个尺度特征地图,然后生成基于来自左,右视图的2D相似的3D滤波器权。在那之后,我们分头在一个完整的三维空间视差空间聚集成一维视差空间和2D立体空间中聚集。在现场流量数据集实验结果,KITTI2015和明德证明了我们方法的有效性。
16. Black-box Explanation of Object Detectors via Saliency Maps [PDF] 返回目录
Vitali Petsiuk, Rajiv Jain, Varun Manjunatha, Vlad I. Morariu, Ashutosh Mehra, Vicente Ordonez, Kate Saenko
Abstract: We propose D-RISE, a method for generating visual explanations for the predictions of object detectors. D-RISE can be considered "black-box" in the software testing sense, it only needs access to the inputs and outputs of an object detector. Compared to gradient-based methods, D-RISE is more general and agnostic to the particular type of object detector being tested as it does not need to know about the inner workings of the model. We show that D-RISE can be easily applied to different object detectors including one-stage detectors such as YOLOv3 and two-stage detectors such as Faster-RCNN. We present a detailed analysis of the generated visual explanations to highlight the utilization of context and the possible biases learned by object detectors.
摘要:我们提出d-RISE,用于产生用于对象检测器的预测视觉解释的方法。 d-RISE可以被认为是在软件测试意义上的“黑盒子”,只需要访问对象检测器的输入和输出。相比于基于梯度的方法中,d-RISE是更一般的和不可知的特定类型的对象检测器被测试,因为它不需要知道关于模型的内部工作。我们表明,d-RISE可以容易地适用于不同的目的,包括检测器单级检测器,例如YOLOv3和两阶段检测器,例如更快的-RCNN。我们目前所产生的可视解释的详细分析来突出上下文的利用率,由对象检测器学习的可能的偏差。
Vitali Petsiuk, Rajiv Jain, Varun Manjunatha, Vlad I. Morariu, Ashutosh Mehra, Vicente Ordonez, Kate Saenko
Abstract: We propose D-RISE, a method for generating visual explanations for the predictions of object detectors. D-RISE can be considered "black-box" in the software testing sense, it only needs access to the inputs and outputs of an object detector. Compared to gradient-based methods, D-RISE is more general and agnostic to the particular type of object detector being tested as it does not need to know about the inner workings of the model. We show that D-RISE can be easily applied to different object detectors including one-stage detectors such as YOLOv3 and two-stage detectors such as Faster-RCNN. We present a detailed analysis of the generated visual explanations to highlight the utilization of context and the possible biases learned by object detectors.
摘要:我们提出d-RISE,用于产生用于对象检测器的预测视觉解释的方法。 d-RISE可以被认为是在软件测试意义上的“黑盒子”,只需要访问对象检测器的输入和输出。相比于基于梯度的方法中,d-RISE是更一般的和不可知的特定类型的对象检测器被测试,因为它不需要知道关于模型的内部工作。我们表明,d-RISE可以容易地适用于不同的目的,包括检测器单级检测器,例如YOLOv3和两阶段检测器,例如更快的-RCNN。我们目前所产生的可视解释的详细分析来突出上下文的利用率,由对象检测器学习的可能的偏差。
17. Egocentric Object Manipulation Graphs [PDF] 返回目录
Eadom Dessalene, Michael Maynord, Chinmaya Devaraj, Cornelia Fermuller, Yiannis Aloimonos
Abstract: We introduce Egocentric Object Manipulation Graphs (Ego-OMG) - a novel representation for activity modeling and anticipation of near future actions integrating three components: 1) semantic temporal structure of activities, 2) short-term dynamics, and 3) representations for appearance. Semantic temporal structure is modeled through a graph, embedded through a Graph Convolutional Network, whose states model characteristics of and relations between hands and objects. These state representations derive from all three levels of abstraction, and span segments delimited by the making and breaking of hand-object contact. Short-term dynamics are modeled in two ways: A) through 3D convolutions, and B) through anticipating the spatiotemporal end points of hand trajectories, where hands come into contact with objects. Appearance is modeled through deep spatiotemporal features produced through existing methods. We note that in Ego-OMG it is simple to swap these appearance features, and thus Ego-OMG is complementary to most existing action anticipation methods. We evaluate Ego-OMG on the EPIC Kitchens Action Anticipation Challenge. The consistency of the egocentric perspective of EPIC Kitchens allows for the utilization of the hand-centric cues upon which Ego-OMG relies. We demonstrate state-of-the-art performance, outranking all other previous published methods by large margins and ranking first on the unseen test set and second on the seen test set of the EPIC Kitchens Action Anticipation Challenge. We attribute the success of Ego-OMG to the modeling of semantic structure captured over long timespans. We evaluate the design choices made through several ablation studies. Code will be released upon acceptance
摘要:本文介绍了自我中心的对象操纵图形(EGO-OMG) - 对活动建模和近期行动整合三个组成部分的预期新颖的表现:1)活动的语义时间结构,2)短期的动态,以及3)表示出现。语义时间结构是通过图形建模,通过图表卷积网络,其状态模型的特点以及手部和对象之间的关系嵌入。这些国家表示所有三个层次的抽象,和段范围由制定界定和手工物体接触的突破获得。通过预测手的轨迹,在手接触到物体接触的时空终点A)通过3D卷积和B):短期动态建模两种方式。外观是通过通过现有方法制造的深时空特征建模。我们注意到,在自我OMG它是简单的交换,这些外观的特点,因而自我OMG是大部分现有行动的预期方法的补充。我们在EPIC厨房行动黄奇帆挑战自我评估,OMG。 EPIC厨房的自我中心的透视图的一致性允许手为中心的线索在其EGO-OMG依赖的利用率。我们证明国家的最先进的性能,通过大量的利润排名超越了以往所有其他公布的方法和先上看不见的测试组和第二的排名上看到测试集EPIC厨房行动黄奇帆挑战。我们认为自我OMG的语义结构的拍摄在很长的时间跨度建模成功。我们评估通过几个切除研究做出的设计选择。代码将在验收发布
Eadom Dessalene, Michael Maynord, Chinmaya Devaraj, Cornelia Fermuller, Yiannis Aloimonos
Abstract: We introduce Egocentric Object Manipulation Graphs (Ego-OMG) - a novel representation for activity modeling and anticipation of near future actions integrating three components: 1) semantic temporal structure of activities, 2) short-term dynamics, and 3) representations for appearance. Semantic temporal structure is modeled through a graph, embedded through a Graph Convolutional Network, whose states model characteristics of and relations between hands and objects. These state representations derive from all three levels of abstraction, and span segments delimited by the making and breaking of hand-object contact. Short-term dynamics are modeled in two ways: A) through 3D convolutions, and B) through anticipating the spatiotemporal end points of hand trajectories, where hands come into contact with objects. Appearance is modeled through deep spatiotemporal features produced through existing methods. We note that in Ego-OMG it is simple to swap these appearance features, and thus Ego-OMG is complementary to most existing action anticipation methods. We evaluate Ego-OMG on the EPIC Kitchens Action Anticipation Challenge. The consistency of the egocentric perspective of EPIC Kitchens allows for the utilization of the hand-centric cues upon which Ego-OMG relies. We demonstrate state-of-the-art performance, outranking all other previous published methods by large margins and ranking first on the unseen test set and second on the seen test set of the EPIC Kitchens Action Anticipation Challenge. We attribute the success of Ego-OMG to the modeling of semantic structure captured over long timespans. We evaluate the design choices made through several ablation studies. Code will be released upon acceptance
摘要:本文介绍了自我中心的对象操纵图形(EGO-OMG) - 对活动建模和近期行动整合三个组成部分的预期新颖的表现:1)活动的语义时间结构,2)短期的动态,以及3)表示出现。语义时间结构是通过图形建模,通过图表卷积网络,其状态模型的特点以及手部和对象之间的关系嵌入。这些国家表示所有三个层次的抽象,和段范围由制定界定和手工物体接触的突破获得。通过预测手的轨迹,在手接触到物体接触的时空终点A)通过3D卷积和B):短期动态建模两种方式。外观是通过通过现有方法制造的深时空特征建模。我们注意到,在自我OMG它是简单的交换,这些外观的特点,因而自我OMG是大部分现有行动的预期方法的补充。我们在EPIC厨房行动黄奇帆挑战自我评估,OMG。 EPIC厨房的自我中心的透视图的一致性允许手为中心的线索在其EGO-OMG依赖的利用率。我们证明国家的最先进的性能,通过大量的利润排名超越了以往所有其他公布的方法和先上看不见的测试组和第二的排名上看到测试集EPIC厨房行动黄奇帆挑战。我们认为自我OMG的语义结构的拍摄在很长的时间跨度建模成功。我们评估通过几个切除研究做出的设计选择。代码将在验收发布
18. Scene Image Representation by Foreground, Background and Hybrid Features [PDF] 返回目录
Chiranjibi Sitaula, Yong Xiang, Sunil Aryal, Xuequan Lu
Abstract: Previous methods for representing scene images based on deep learning primarily consider either the foreground or background information as the discriminating clues for the classification task. However, scene images also require additional information (hybrid) to cope with the inter-class similarity and intra-class variation problems. In this paper, we propose to use hybrid features in addition to foreground and background features to represent scene images. We suppose that these three types of information could jointly help to represent scene image more accurately. To this end, we adopt three VGG-16 architectures pre-trained on ImageNet, Places, and Hybrid (both ImageNet and Places) datasets for the corresponding extraction of foreground, background and hybrid information. All these three types of deep features are further aggregated to achieve our final features for the representation of scene images. Extensive experiments on two large benchmark scene datasets (MIT-67 and SUN-397) show that our method produces the state-of-the-art classification performance.
摘要:为表示基于深度学习场景图像上的方法主要是考虑在前景或背景资料,作为分类的任务区分线索。然而,场景图像还需要额外的信息(混合)以应付类间相似性和类内变化的问题。在本文中,我们建议使用混合功能,除了前台和后台功能,代表场景图像。我们假设这三种类型的信息可以联合有助于更准确地代表场景图像。为此,我们采用三VGG-16体系结构上ImageNet,邻居,和混合预训练(二者ImageNet和地点)的数据集为前景,背景和混合信息的对应的抽取。所有这三种类型的深特征进一步聚集,以实现我们的最终功能的场景图像的代表性。在两个大的基准情景的数据集(MIT-67和SUN-397)显示,我们的方法产生国家的最先进的分类性能广泛的实验。
Chiranjibi Sitaula, Yong Xiang, Sunil Aryal, Xuequan Lu
Abstract: Previous methods for representing scene images based on deep learning primarily consider either the foreground or background information as the discriminating clues for the classification task. However, scene images also require additional information (hybrid) to cope with the inter-class similarity and intra-class variation problems. In this paper, we propose to use hybrid features in addition to foreground and background features to represent scene images. We suppose that these three types of information could jointly help to represent scene image more accurately. To this end, we adopt three VGG-16 architectures pre-trained on ImageNet, Places, and Hybrid (both ImageNet and Places) datasets for the corresponding extraction of foreground, background and hybrid information. All these three types of deep features are further aggregated to achieve our final features for the representation of scene images. Extensive experiments on two large benchmark scene datasets (MIT-67 and SUN-397) show that our method produces the state-of-the-art classification performance.
摘要:为表示基于深度学习场景图像上的方法主要是考虑在前景或背景资料,作为分类的任务区分线索。然而,场景图像还需要额外的信息(混合)以应付类间相似性和类内变化的问题。在本文中,我们建议使用混合功能,除了前台和后台功能,代表场景图像。我们假设这三种类型的信息可以联合有助于更准确地代表场景图像。为此,我们采用三VGG-16体系结构上ImageNet,邻居,和混合预训练(二者ImageNet和地点)的数据集为前景,背景和混合信息的对应的抽取。所有这三种类型的深特征进一步聚集,以实现我们的最终功能的场景图像的代表性。在两个大的基准情景的数据集(MIT-67和SUN-397)显示,我们的方法产生国家的最先进的分类性能广泛的实验。
19. Pick-Object-Attack: Type-Specific Adversarial Attack for Object Detection [PDF] 返回目录
Omid Mohamad Nezami, Akshay Chaturvedi, Mark Dras, Utpal Garain
Abstract: Many recent studies have shown that deep neural models are vulnerable to adversarial samples: images with imperceptible perturbations, for example, can fool image classifiers. In this paper, we generate adversarial examples for object detection, which entails detecting bounding boxes around multiple objects present in the image and classifying them at the same time, making it a harder task than against image classification. We specifically aim to attack the widely used Faster R-CNN by changing the predicted label for a particular object in an image: where prior work has targeted one specific object (a stop sign), we generalise to arbitrary objects, with the key challenge being the need to change the labels of all bounding boxes for all instances of that object type. To do so, we propose a novel method, named Pick-Object-Attack. Pick-Object-Attack successfully adds perturbations only to bounding boxes for the targeted object, preserving the labels of other detected objects in the image. In terms of perceptibility, the perturbations induced by the method are very small. Furthermore, for the first time, we examine the effect of adversarial attacks on object detection in terms of a downstream task, image captioning; we show that where a method that can modify all object types leads to very obvious changes in captions, the changes from our constrained attack are much less apparent.
摘要:最近许多研究表明,深层神经模型很容易受到敌对样品:用潜移默化的扰动图像,例如,可以骗过图像分类。在本文中,我们产生用于对象检测,这需要检测的边界框周围的多个对象图像中存在的并且将它们在同一时间进行分类,使得它比对图像分类的较硬任务对抗性例子。我们明确目标是通过改变已预测标签的图像中的特定对象攻击广泛使用更快的R-CNN:其中以前的工作有针对性地一个特定对象(站牌),我们推广到任意对象,关键的挑战之中需要改变的所有边界框的标签,该对象类型的所有实例。要做到这一点,我们提出了一种新的方法,称为吸合对象的攻击。接送对象的攻击成功地增加了扰动只包围盒为目标对象,图像中保留其他被测物体的标签。在可感知的方面,通过该方法所引起的扰动很小。此外,对于第一次,我们考察对物体检测对抗攻击在下游任务,图像字幕方面的效果;我们表明,在那里,可以修改所有对象类型,因而在标题非常明显的变化的方法,从我们的约束所引起的变化是明显的要少得多。
Omid Mohamad Nezami, Akshay Chaturvedi, Mark Dras, Utpal Garain
Abstract: Many recent studies have shown that deep neural models are vulnerable to adversarial samples: images with imperceptible perturbations, for example, can fool image classifiers. In this paper, we generate adversarial examples for object detection, which entails detecting bounding boxes around multiple objects present in the image and classifying them at the same time, making it a harder task than against image classification. We specifically aim to attack the widely used Faster R-CNN by changing the predicted label for a particular object in an image: where prior work has targeted one specific object (a stop sign), we generalise to arbitrary objects, with the key challenge being the need to change the labels of all bounding boxes for all instances of that object type. To do so, we propose a novel method, named Pick-Object-Attack. Pick-Object-Attack successfully adds perturbations only to bounding boxes for the targeted object, preserving the labels of other detected objects in the image. In terms of perceptibility, the perturbations induced by the method are very small. Furthermore, for the first time, we examine the effect of adversarial attacks on object detection in terms of a downstream task, image captioning; we show that where a method that can modify all object types leads to very obvious changes in captions, the changes from our constrained attack are much less apparent.
摘要:最近许多研究表明,深层神经模型很容易受到敌对样品:用潜移默化的扰动图像,例如,可以骗过图像分类。在本文中,我们产生用于对象检测,这需要检测的边界框周围的多个对象图像中存在的并且将它们在同一时间进行分类,使得它比对图像分类的较硬任务对抗性例子。我们明确目标是通过改变已预测标签的图像中的特定对象攻击广泛使用更快的R-CNN:其中以前的工作有针对性地一个特定对象(站牌),我们推广到任意对象,关键的挑战之中需要改变的所有边界框的标签,该对象类型的所有实例。要做到这一点,我们提出了一种新的方法,称为吸合对象的攻击。接送对象的攻击成功地增加了扰动只包围盒为目标对象,图像中保留其他被测物体的标签。在可感知的方面,通过该方法所引起的扰动很小。此外,对于第一次,我们考察对物体检测对抗攻击在下游任务,图像字幕方面的效果;我们表明,在那里,可以修改所有对象类型,因而在标题非常明显的变化的方法,从我们的约束所引起的变化是明显的要少得多。
20. MSDU-net: A Multi-Scale Dilated U-net for Blur Detection [PDF] 返回目录
Fan Yang, Xiao Xiao
Abstract: Blur detection is the separation of blurred and clear regions of an image, which is an important and challenging task in computer vision. In this work, we regard blur detection as an image segmentation problem. Inspired by the success of the U-net architecture for image segmentation, we design a Multi-Scale Dilated convolutional neural network based on U-net, which we call MSDU-net. The MSDU-net uses a group of multi-scale feature extractors with dilated convolutions to extract texture information at different scales. The U-shape architecture of the MSDU-net fuses the different-scale texture features and generates a semantic feature which allows us to achieve better results on the blur detection task. We show that using the MSDU-net we are able to outperform other state of the art blur detection methods on two publicly available benchmarks.
摘要:模糊检测的图像,这是计算机视觉的一个重要和艰巨的任务的模糊和清晰地区分开。在这项工作中,我们把模糊检测为图像分割问题。通过对图像分割U形网架构的成功的启发,设计了一种基于U型网,我们称之为MSDU净多比例扩张卷积神经网络。的MSDU网使用是按扩张卷积一组多尺度特征提取器,以提取在不同尺度的纹理信息。该MSDU网保险丝的U形结构不同规模的结构特征,并生成一个语义特征,使我们能够实现对模糊检测任务更好的结果。我们发现,使用MSDU网,我们能够跑赢上的两个公开提供基准测试的艺术模糊检测方法的其他状态。
Fan Yang, Xiao Xiao
Abstract: Blur detection is the separation of blurred and clear regions of an image, which is an important and challenging task in computer vision. In this work, we regard blur detection as an image segmentation problem. Inspired by the success of the U-net architecture for image segmentation, we design a Multi-Scale Dilated convolutional neural network based on U-net, which we call MSDU-net. The MSDU-net uses a group of multi-scale feature extractors with dilated convolutions to extract texture information at different scales. The U-shape architecture of the MSDU-net fuses the different-scale texture features and generates a semantic feature which allows us to achieve better results on the blur detection task. We show that using the MSDU-net we are able to outperform other state of the art blur detection methods on two publicly available benchmarks.
摘要:模糊检测的图像,这是计算机视觉的一个重要和艰巨的任务的模糊和清晰地区分开。在这项工作中,我们把模糊检测为图像分割问题。通过对图像分割U形网架构的成功的启发,设计了一种基于U型网,我们称之为MSDU净多比例扩张卷积神经网络。的MSDU网使用是按扩张卷积一组多尺度特征提取器,以提取在不同尺度的纹理信息。该MSDU网保险丝的U形结构不同规模的结构特征,并生成一个语义特征,使我们能够实现对模糊检测任务更好的结果。我们发现,使用MSDU网,我们能够跑赢上的两个公开提供基准测试的艺术模糊检测方法的其他状态。
21. Unsupervised clustering of Roman pottery profiles from their SSAE representation [PDF] 返回目录
Simone Parisotto, Alessandro Launaro, Ninetta Leone, Carola-Bibiane Schönlieb
Abstract: In this paper we introduce the ROman COmmonware POTtery (ROCOPOT) database, which comprises of more than 2000 black and white imaging profiles of pottery shapes extracted from 11 Roman catalogues and related to different excavation sites. The partiality and the handcrafted variance of the shape fragments within this new database make their unsupervised clustering a very challenging problem: profile similarities are thus explored via the hierarchical clustering of non-linear features learned in the latent representation space of a stacked sparse autoencoder (SSAE) network, unveiling new profile matches. Results are commented both from a mathematical and archaeological perspective so as to unlock new research directions in the respective communities.
摘要:在本文中,我们介绍了罗马COmmonware陶器(ROCOPOT)数据库,其包括陶器的超过2000黑白成像轮廓形状从11个罗马目录提取并与不同的挖掘现场。的偏爱和形状片段的这个新的数据库内的手工方差使他们的无监督聚类一个非常具有挑战性的问题:轮廓相似性通过以堆叠稀疏自动编码器的潜表示空间了解到非线性特征的分级聚类从而探索(SSAE )网络,揭幕新的配置文件匹配。结果从数学和考古的角度,从而在各自的社区解锁新的研究方向都发表了评论。
Simone Parisotto, Alessandro Launaro, Ninetta Leone, Carola-Bibiane Schönlieb
Abstract: In this paper we introduce the ROman COmmonware POTtery (ROCOPOT) database, which comprises of more than 2000 black and white imaging profiles of pottery shapes extracted from 11 Roman catalogues and related to different excavation sites. The partiality and the handcrafted variance of the shape fragments within this new database make their unsupervised clustering a very challenging problem: profile similarities are thus explored via the hierarchical clustering of non-linear features learned in the latent representation space of a stacked sparse autoencoder (SSAE) network, unveiling new profile matches. Results are commented both from a mathematical and archaeological perspective so as to unlock new research directions in the respective communities.
摘要:在本文中,我们介绍了罗马COmmonware陶器(ROCOPOT)数据库,其包括陶器的超过2000黑白成像轮廓形状从11个罗马目录提取并与不同的挖掘现场。的偏爱和形状片段的这个新的数据库内的手工方差使他们的无监督聚类一个非常具有挑战性的问题:轮廓相似性通过以堆叠稀疏自动编码器的潜表示空间了解到非线性特征的分级聚类从而探索(SSAE )网络,揭幕新的配置文件匹配。结果从数学和考古的角度,从而在各自的社区解锁新的研究方向都发表了评论。
22. SIDU: Similarity Difference and Uniqueness Method for Explainable AI [PDF] 返回目录
Satya M. Muddamsetty, Mohammad N. S. Jahromi, Thomas B. Moeslund
Abstract: A new brand of technical artificial intelligence ( Explainable AI ) research has focused on trying to open up the 'black box' and provide some explainability. This paper presents a novel visual explanation method for deep learning networks in the form of a saliency map that can effectively localize entire object regions. In contrast to the current state-of-the art methods, the proposed method shows quite promising visual explanations that can gain greater trust of human expert. Both quantitative and qualitative evaluations are carried out on both general and clinical data sets to confirm the effectiveness of the proposed method.
摘要:技术人工智能的一个新的品牌(可解释AI)的研究集中在试图开辟的“黑盒子”,并提供一些explainability。本文提出了在显着性图,可以有效地定位整个对象区域的形式深学习网络的新的视觉解释方法。相较于当前国家的最先进的方法,该方法显示很有前途的视觉解释,即能获得人类专家的更大的信任。定性和定量评价是开展了一般和临床数据集,以确认该方法的有效性。
Satya M. Muddamsetty, Mohammad N. S. Jahromi, Thomas B. Moeslund
Abstract: A new brand of technical artificial intelligence ( Explainable AI ) research has focused on trying to open up the 'black box' and provide some explainability. This paper presents a novel visual explanation method for deep learning networks in the form of a saliency map that can effectively localize entire object regions. In contrast to the current state-of-the art methods, the proposed method shows quite promising visual explanations that can gain greater trust of human expert. Both quantitative and qualitative evaluations are carried out on both general and clinical data sets to confirm the effectiveness of the proposed method.
摘要:技术人工智能的一个新的品牌(可解释AI)的研究集中在试图开辟的“黑盒子”,并提供一些explainability。本文提出了在显着性图,可以有效地定位整个对象区域的形式深学习网络的新的视觉解释方法。相较于当前国家的最先进的方法,该方法显示很有前途的视觉解释,即能获得人类专家的更大的信任。定性和定量评价是开展了一般和临床数据集,以确认该方法的有效性。
23. Sponge Examples: Energy-Latency Attacks on Neural Networks [PDF] 返回目录
Ilia Shumailov, Yiren Zhao, Daniel Bates, Nicolas Papernot, Robert Mullins, Ross Anderson
Abstract: The high energy costs of neural network training and inference led to the use of acceleration hardware such as GPUs and TPUs. While this enabled us to train large-scale neural networks in datacenters and deploy them on edge devices, the focus so far is on average-case performance. In this work, we introduce a novel threat vector against neural networks whose energy consumption or decision latency are critical. We show how adversaries can exploit carefully crafted $\boldsymbol{sponge}~\boldsymbol{examples}$, which are inputs designed to maximise energy consumption and latency. We mount two variants of this attack on established vision and language models, increasing energy consumption by a factor of 10 to 200. Our attacks can also be used to delay decisions where a network has critical real-time performance, such as in perception for autonomous vehicles. We demonstrate the portability of our malicious inputs across CPUs and a variety of hardware accelerator chips including GPUs, and an ASIC simulator. We conclude by proposing a defense strategy which mitigates our attack by shifting the analysis of energy consumption in hardware from an average-case to a worst-case perspective.
摘要:神经网络训练和推理导致使用加速硬件,如GPU和TPU的高能源成本。虽然这使得我们在数据中心,培养大型神经网络的边缘设备部署它们,重点至今平均情况下的性能。在这项工作中,我们将介绍对神经网络的能源消费或决策延迟是至关重要的一个新的威胁载体。我们展示的敌人如何利用精心打造的$ \ {boldsymbol海绵}〜\ {boldsymbol例子} $,其目的是最大限度地提高能源消耗和延迟输入。我们安装在既定的眼光和语言模型这种攻击的两个变种,由10至200我们的攻击因子增加能耗,也可以用来延迟决策,其中一个网络具有重要的实时性能,如感知为自主汽车。我们证明了我们在CPU恶意输入的便携性和各种硬件加速器芯片,包括图形处理器,以及ASIC模拟器。最后,我们建议其通过从平均情况转变能源消费的分析在硬件最坏情况的角度缓解我们的攻击防御策略。
Ilia Shumailov, Yiren Zhao, Daniel Bates, Nicolas Papernot, Robert Mullins, Ross Anderson
Abstract: The high energy costs of neural network training and inference led to the use of acceleration hardware such as GPUs and TPUs. While this enabled us to train large-scale neural networks in datacenters and deploy them on edge devices, the focus so far is on average-case performance. In this work, we introduce a novel threat vector against neural networks whose energy consumption or decision latency are critical. We show how adversaries can exploit carefully crafted $\boldsymbol{sponge}~\boldsymbol{examples}$, which are inputs designed to maximise energy consumption and latency. We mount two variants of this attack on established vision and language models, increasing energy consumption by a factor of 10 to 200. Our attacks can also be used to delay decisions where a network has critical real-time performance, such as in perception for autonomous vehicles. We demonstrate the portability of our malicious inputs across CPUs and a variety of hardware accelerator chips including GPUs, and an ASIC simulator. We conclude by proposing a defense strategy which mitigates our attack by shifting the analysis of energy consumption in hardware from an average-case to a worst-case perspective.
摘要:神经网络训练和推理导致使用加速硬件,如GPU和TPU的高能源成本。虽然这使得我们在数据中心,培养大型神经网络的边缘设备部署它们,重点至今平均情况下的性能。在这项工作中,我们将介绍对神经网络的能源消费或决策延迟是至关重要的一个新的威胁载体。我们展示的敌人如何利用精心打造的$ \ {boldsymbol海绵}〜\ {boldsymbol例子} $,其目的是最大限度地提高能源消耗和延迟输入。我们安装在既定的眼光和语言模型这种攻击的两个变种,由10至200我们的攻击因子增加能耗,也可以用来延迟决策,其中一个网络具有重要的实时性能,如感知为自主汽车。我们证明了我们在CPU恶意输入的便携性和各种硬件加速器芯片,包括图形处理器,以及ASIC模拟器。最后,我们建议其通过从平均情况转变能源消费的分析在硬件最坏情况的角度缓解我们的攻击防御策略。
24. Artificial Intelligence-based Clinical Decision Support for COVID-19 -- Where Art Thou? [PDF] 返回目录
Mathias Unberath, Kimia Ghobadi, Scott Levin, Jeremiah Hinson, Gregory D Hager
Abstract: The COVID-19 crisis has brought about new clinical questions, new workflows, and accelerated distributed healthcare needs. While artificial intelligence (AI)-based clinical decision support seemed to have matured, the application of AI-based tools for COVID-19 has been limited to date. In this perspective piece, we identify opportunities and requirements for AI-based clinical decision support systems and highlight challenges that impact "AI readiness" for rapidly emergent healthcare challenges.
摘要:COVID-19危机带来了新的临床问题,新的工作流程,加快分布式的医疗需求。虽然人工智能(AI)为基础的临床决策支持似乎已经成熟,基于AI-工具COVID-19应用已不限于日期。从这个角度看片,我们发现机会,并要求以人工智能为基础的临床决策支持系统和亮点挑战影响“AI准备”为快速应急的医疗保健挑战。
Mathias Unberath, Kimia Ghobadi, Scott Levin, Jeremiah Hinson, Gregory D Hager
Abstract: The COVID-19 crisis has brought about new clinical questions, new workflows, and accelerated distributed healthcare needs. While artificial intelligence (AI)-based clinical decision support seemed to have matured, the application of AI-based tools for COVID-19 has been limited to date. In this perspective piece, we identify opportunities and requirements for AI-based clinical decision support systems and highlight challenges that impact "AI readiness" for rapidly emergent healthcare challenges.
摘要:COVID-19危机带来了新的临床问题,新的工作流程,加快分布式的医疗需求。虽然人工智能(AI)为基础的临床决策支持似乎已经成熟,基于AI-工具COVID-19应用已不限于日期。从这个角度看片,我们发现机会,并要求以人工智能为基础的临床决策支持系统和亮点挑战影响“AI准备”为快速应急的医疗保健挑战。
25. Acoustic Anomaly Detection for Machine Sounds based on Image Transfer Learning [PDF] 返回目录
Robert Müller, Fabian Ritz, Steffen Illium, Claudia Linnhoff-Popien
Abstract: In industrial applications, the early detection of malfunctioning factory machinery is crucial. In this paper, we consider acoustic malfunction detection via transfer learning. Contrary to the majority of current approaches which are based on deep autoencoders, we propose to extract features using neural networks that were pretrained on the task of image classification. We then use these features to train a variety of anomaly detection models and show that this improves results compared to convolutional autoencoders in recordings of four different factory machines in noisy environments. Moreover, we find that features extracted from ResNet based networks yield better results than those from AlexNet and Squeezenet. In our setting, Gaussian Mixture Models and One-Class Support Vector Machines achieve the best anomaly detection performance.
摘要:在工业应用中,错误机械厂的早期检测是至关重要的。在本文中,我们考虑通过转让学习声学故障检测。相反,大部分是基于深自动编码目前的做法,我们建议提取功能使用了预训练的图像分类任务的神经网络。然后,我们使用这些功能来训练各种异常检测模型,并表明,这种改进的结果相比,在嘈杂环境中四个不同的工厂的机器录音卷积自动编码。此外,我们发现从RESNET提取的特征基于网络产生比那些来自AlexNet和Squeezenet更好的结果。在我们的设置,高斯混合模型和一类支持向量机达到最佳的异常检测性能。
Robert Müller, Fabian Ritz, Steffen Illium, Claudia Linnhoff-Popien
Abstract: In industrial applications, the early detection of malfunctioning factory machinery is crucial. In this paper, we consider acoustic malfunction detection via transfer learning. Contrary to the majority of current approaches which are based on deep autoencoders, we propose to extract features using neural networks that were pretrained on the task of image classification. We then use these features to train a variety of anomaly detection models and show that this improves results compared to convolutional autoencoders in recordings of four different factory machines in noisy environments. Moreover, we find that features extracted from ResNet based networks yield better results than those from AlexNet and Squeezenet. In our setting, Gaussian Mixture Models and One-Class Support Vector Machines achieve the best anomaly detection performance.
摘要:在工业应用中,错误机械厂的早期检测是至关重要的。在本文中,我们考虑通过转让学习声学故障检测。相反,大部分是基于深自动编码目前的做法,我们建议提取功能使用了预训练的图像分类任务的神经网络。然后,我们使用这些功能来训练各种异常检测模型,并表明,这种改进的结果相比,在嘈杂环境中四个不同的工厂的机器录音卷积自动编码。此外,我们发现从RESNET提取的特征基于网络产生比那些来自AlexNet和Squeezenet更好的结果。在我们的设置,高斯混合模型和一类支持向量机达到最佳的异常检测性能。
26. Detection of prostate cancer in whole-slide images through end-to-end training with image-level labels [PDF] 返回目录
Hans Pinckaers, Wouter Bulten, Jeroen van der Laak, Geert Litjens
Abstract: Prostate cancer is the most prevalent cancer among men in Western countries, with 1.1 million new diagnoses every year. The gold standard for the diagnosis of prostate cancer is a pathologists' evaluation of prostate tissue. To potentially assist pathologists deep-learning-based cancer detection systems have been developed. Many of the state-of-the-art models are patch-based convolutional neural networks, as the use of entire scanned slides is hampered by memory limitations on accelerator cards. Patch-based systems typically require detailed, pixel-level annotations for effective training. However, such annotations are seldom readily available, in contrast to the clinical reports of pathologists, which contain slide-level labels. As such, developing algorithms which do not require manual pixel-wise annotations, but can learn using only the clinical report would be a significant advancement for the field. In this paper, we propose to use a streaming implementation of convolutional layers, to train a modern CNN (ResNet-34) with 21 million parameters end-to-end on 4712 prostate biopsies. The method enables the use of entire biopsy images at high-resolution directly by reducing the GPU memory requirements by 2.4 TB. We show that modern CNNs, trained using our streaming approach, can extract meaningful features from high-resolution images without additional heuristics, reaching similar performance as state-of-the-art patch-based and multiple-instance learning methods. By circumventing the need for manual annotations, this approach can function as a blueprint for other tasks in histopathological diagnosis. The source code to reproduce the streaming models is available at this https URL .
摘要:前列腺癌是男性在西方国家中最常见的癌症,每年110万名新诊断病例。的金标准前列腺癌的诊断是一个病理学家前列腺组织的评估。潜在协助病理学家深学习为基础的癌症检测系统已被开发。许多国家的最先进的机型都是基于补丁卷积神经网络,使用整个扫描的幻灯片是由内存的限制,阻碍了对加速卡。基于补丁的系统通常需要详细的,像素级别的注解进行有效的培训。然而,这样的注释是很少容易获得的,而相比之下,病理学家的临床报告,其中含有滑级标签。因此,开发算法不需要人工逐像素的注释,但可以学习只用临床报告将是该领域的显著进步。在本文中,我们建议使用卷积层的流媒体实现,具有2100万点的参数来训练现代CNN(RESNET-34)结束到终端上4712个前列腺活检。该方法使得能够使用在高清晰度整个活检图像直接通过2.4 TB减少GPU存储器要求。我们发现,现代细胞神经网络,用我们的方法流训练有素的,可以提取高分辨率图像有意义的功能,无需额外的启发,达到了性能先进设备,最先进的贴片型和多实例学习方法类似。绕开手动注释的需要,这种方法可以用于在组织病理学诊断的其他任务的蓝图作用。源代码重现流模型可在此HTTPS URL。
Hans Pinckaers, Wouter Bulten, Jeroen van der Laak, Geert Litjens
Abstract: Prostate cancer is the most prevalent cancer among men in Western countries, with 1.1 million new diagnoses every year. The gold standard for the diagnosis of prostate cancer is a pathologists' evaluation of prostate tissue. To potentially assist pathologists deep-learning-based cancer detection systems have been developed. Many of the state-of-the-art models are patch-based convolutional neural networks, as the use of entire scanned slides is hampered by memory limitations on accelerator cards. Patch-based systems typically require detailed, pixel-level annotations for effective training. However, such annotations are seldom readily available, in contrast to the clinical reports of pathologists, which contain slide-level labels. As such, developing algorithms which do not require manual pixel-wise annotations, but can learn using only the clinical report would be a significant advancement for the field. In this paper, we propose to use a streaming implementation of convolutional layers, to train a modern CNN (ResNet-34) with 21 million parameters end-to-end on 4712 prostate biopsies. The method enables the use of entire biopsy images at high-resolution directly by reducing the GPU memory requirements by 2.4 TB. We show that modern CNNs, trained using our streaming approach, can extract meaningful features from high-resolution images without additional heuristics, reaching similar performance as state-of-the-art patch-based and multiple-instance learning methods. By circumventing the need for manual annotations, this approach can function as a blueprint for other tasks in histopathological diagnosis. The source code to reproduce the streaming models is available at this https URL .
摘要:前列腺癌是男性在西方国家中最常见的癌症,每年110万名新诊断病例。的金标准前列腺癌的诊断是一个病理学家前列腺组织的评估。潜在协助病理学家深学习为基础的癌症检测系统已被开发。许多国家的最先进的机型都是基于补丁卷积神经网络,使用整个扫描的幻灯片是由内存的限制,阻碍了对加速卡。基于补丁的系统通常需要详细的,像素级别的注解进行有效的培训。然而,这样的注释是很少容易获得的,而相比之下,病理学家的临床报告,其中含有滑级标签。因此,开发算法不需要人工逐像素的注释,但可以学习只用临床报告将是该领域的显著进步。在本文中,我们建议使用卷积层的流媒体实现,具有2100万点的参数来训练现代CNN(RESNET-34)结束到终端上4712个前列腺活检。该方法使得能够使用在高清晰度整个活检图像直接通过2.4 TB减少GPU存储器要求。我们发现,现代细胞神经网络,用我们的方法流训练有素的,可以提取高分辨率图像有意义的功能,无需额外的启发,达到了性能先进设备,最先进的贴片型和多实例学习方法类似。绕开手动注释的需要,这种方法可以用于在组织病理学诊断的其他任务的蓝图作用。源代码重现流模型可在此HTTPS URL。
27. Structurally aware bidirectional unpaired image to image translation between CT and MR [PDF] 返回目录
Vismay Agrawal, Avinash Kori, Vikas Kumar Anand, Ganapathy Krishnamurthi
Abstract: Magnetic Resonance (MR) Imaging and Computed Tomography (CT) are the primary diagnostic imaging modalities quite frequently used for surgical planning and analysis. A general problem with medical imaging is that the acquisition process is quite expensive and time-consuming. Deep learning techniques like generative adversarial networks (GANs) can help us to leverage the possibility of an image to image translation between multiple imaging modalities, which in turn helps in saving time and cost. These techniques will help to conduct surgical planning under CT with the feedback of MRI information. While previous studies have shown paired and unpaired image synthesis from MR to CT, image synthesis from CT to MR still remains a challenge, since it involves the addition of extra tissue information. In this manuscript, we have implemented two different variations of Generative Adversarial Networks exploiting the cycling consistency and structural similarity between both CT and MR image modalities on a pelvis dataset, thus facilitating a bidirectional exchange of content and style between these image modalities. The proposed GANs translate the input medical images by different mechanisms, and hence generated images not only appears realistic but also performs well across various comparison metrics, and these images have also been cross verified with a radiologist. The radiologist verification has shown that slight variations in generated MR and CT images may not be exactly the same as their true counterpart but it can be used for medical purposes.
摘要:磁共振(MR)成像和计算机断层摄影(CT)是主要诊断成像模态相当频繁地用于外科手术计划和分析。与医学成像的一个普遍问题是,获取处理是相当昂贵和耗时的。深度学习技术,如生成对抗网络(甘斯)可以帮助我们充分利用图像,以多种成像模式之间的图像转换,从而有助于节省时间和成本的可能性。这些技术将帮助CT下进行手术计划与MRI信息的反馈。虽然以前的研究表明配对和不成图像合成从MR到CT,根据CT图像合成到MR仍然是一个挑战,因为它涉及到添加额外的组织的信息。在该手稿,我们已经实现剖成对抗性网络的利用两个CT和MR图像模态之间循环的一致性和结构相似性上的骨盆数据集,从而有利于这些图像模态之间的内容和风格的双向交换两个不同的变体。所提出的甘斯通过不同的机制,并不仅出现现实而且执行跨以及各种比较度量,和这些图像也被交叉验证与放射科医生因此所产生的图像转换的输入医用图像。放射科医师确认已经表明,在产生的MR和CT图像的轻微变化可能并不完全一样他们的真实配对但它可以被用于医疗目的。
Vismay Agrawal, Avinash Kori, Vikas Kumar Anand, Ganapathy Krishnamurthi
Abstract: Magnetic Resonance (MR) Imaging and Computed Tomography (CT) are the primary diagnostic imaging modalities quite frequently used for surgical planning and analysis. A general problem with medical imaging is that the acquisition process is quite expensive and time-consuming. Deep learning techniques like generative adversarial networks (GANs) can help us to leverage the possibility of an image to image translation between multiple imaging modalities, which in turn helps in saving time and cost. These techniques will help to conduct surgical planning under CT with the feedback of MRI information. While previous studies have shown paired and unpaired image synthesis from MR to CT, image synthesis from CT to MR still remains a challenge, since it involves the addition of extra tissue information. In this manuscript, we have implemented two different variations of Generative Adversarial Networks exploiting the cycling consistency and structural similarity between both CT and MR image modalities on a pelvis dataset, thus facilitating a bidirectional exchange of content and style between these image modalities. The proposed GANs translate the input medical images by different mechanisms, and hence generated images not only appears realistic but also performs well across various comparison metrics, and these images have also been cross verified with a radiologist. The radiologist verification has shown that slight variations in generated MR and CT images may not be exactly the same as their true counterpart but it can be used for medical purposes.
摘要:磁共振(MR)成像和计算机断层摄影(CT)是主要诊断成像模态相当频繁地用于外科手术计划和分析。与医学成像的一个普遍问题是,获取处理是相当昂贵和耗时的。深度学习技术,如生成对抗网络(甘斯)可以帮助我们充分利用图像,以多种成像模式之间的图像转换,从而有助于节省时间和成本的可能性。这些技术将帮助CT下进行手术计划与MRI信息的反馈。虽然以前的研究表明配对和不成图像合成从MR到CT,根据CT图像合成到MR仍然是一个挑战,因为它涉及到添加额外的组织的信息。在该手稿,我们已经实现剖成对抗性网络的利用两个CT和MR图像模态之间循环的一致性和结构相似性上的骨盆数据集,从而有利于这些图像模态之间的内容和风格的双向交换两个不同的变体。所提出的甘斯通过不同的机制,并不仅出现现实而且执行跨以及各种比较度量,和这些图像也被交叉验证与放射科医生因此所产生的图像转换的输入医用图像。放射科医师确认已经表明,在产生的MR和CT图像的轻微变化可能并不完全一样他们的真实配对但它可以被用于医疗目的。
28. Learning to Rank Learning Curves [PDF] 返回目录
Martin Wistuba, Tejaswini Pedapati
Abstract: Many automated machine learning methods, such as those for hyperparameter and neural architecture optimization, are computationally expensive because they involve training many different model configurations. In this work, we present a new method that saves computational budget by terminating poor configurations early on in the training. In contrast to existing methods, we consider this task as a ranking and transfer learning problem. We qualitatively show that by optimizing a pairwise ranking loss and leveraging learning curves from other datasets, our model is able to effectively rank learning curves without having to observe many or very long learning curves. We further demonstrate that our method can be used to accelerate a neural architecture search by a factor of up to 100 without a significant performance degradation of the discovered architecture. In further experiments we analyze the quality of ranking, the influence of different model components as well as the predictive behavior of the model.
摘要:许多自动化的机器学习方法,比如那些超参数和神经结构的优化,在计算上昂贵的,因为它们涉及训练许多不同的模型配置。在这项工作中,我们提出终止早在训练很差配置节省了计算预算的新方法。相较于现有的方法,我们认为这项工作作为排名和转移学习问题。我们定性地表明,通过优化成对排名损失和其他数据集利用学习曲线,我们的模型能够有效秩的学习曲线,而不必遵守许多或很长的学习曲线。我们进一步证明我们的方法可以用来加速高达100倍神经结构的搜索没有发现架构的显著性能下降。在进一步的实验中,我们分析排名的质量,不同的模型组件的影响,以及对模型的预测行为。
Martin Wistuba, Tejaswini Pedapati
Abstract: Many automated machine learning methods, such as those for hyperparameter and neural architecture optimization, are computationally expensive because they involve training many different model configurations. In this work, we present a new method that saves computational budget by terminating poor configurations early on in the training. In contrast to existing methods, we consider this task as a ranking and transfer learning problem. We qualitatively show that by optimizing a pairwise ranking loss and leveraging learning curves from other datasets, our model is able to effectively rank learning curves without having to observe many or very long learning curves. We further demonstrate that our method can be used to accelerate a neural architecture search by a factor of up to 100 without a significant performance degradation of the discovered architecture. In further experiments we analyze the quality of ranking, the influence of different model components as well as the predictive behavior of the model.
摘要:许多自动化的机器学习方法,比如那些超参数和神经结构的优化,在计算上昂贵的,因为它们涉及训练许多不同的模型配置。在这项工作中,我们提出终止早在训练很差配置节省了计算预算的新方法。相较于现有的方法,我们认为这项工作作为排名和转移学习问题。我们定性地表明,通过优化成对排名损失和其他数据集利用学习曲线,我们的模型能够有效秩的学习曲线,而不必遵守许多或很长的学习曲线。我们进一步证明我们的方法可以用来加速高达100倍神经结构的搜索没有发现架构的显著性能下降。在进一步的实验中,我们分析排名的质量,不同的模型组件的影响,以及对模型的预测行为。
29. Convolutional Neural Networks for Global Human Settlements Mapping from Sentinel-2 Satellite Imagery [PDF] 返回目录
Christina Corbane, Vasileios Syrris, Filip Sabo, Panagiotis Politis, Michele Melchiorri, Martino Pesaresi, Pierre Soille, Thomas Kemper
Abstract: Spatially consistent and up-to-date maps of human settlements are crucial for addressing policies related to urbanization and sustainability especially in the era of an increasingly urbanized world. The availability of open and free Sentinel-2 data of the Copernicus Earth Observation programme offers a new opportunity for wall-to-wall mapping of human settlements at a global scale. This paper presents a deep-learning-based framework for a fully automated extraction of built-up areas at a spatial resolution of 10 meters from a global composite of Sentinel-2 imagery. A multi-neuro modelling methodology, building on a simple Convolution Neural Networks architecture for pixel-wise image classification of built-up areas is developed. The deployment of the model on the global Sentinel-2 image composite provides the most detailed and complete map reporting about built-up areas for reference year 2018. The validation of the results with an independent reference dataset of building footprints covering 277 sites across the world, establishes the reliability of the built-up layer produced by the proposed framework and the model robustness. The results of this study contribute to cutting-edge research in the field of automated built-up areas mapping from remote sensing data and establish a new reference layer for the analysis of the spatial distribution of human settlements across the rural-urban continuum
摘要:人类住区的空间一致的和最新的最新的地图是解决与城市化和可持续发展特别是在日益城市化的世界的时代政策的关键。哥白尼地球观测计划提供的开放和自由哨兵-2数据的可用性在全球范围内对人类住区的墙到墙映射了新的契机。本文提出了建成区在10米的空间分辨率从哨兵-2影像的全球复合材料的全自动抽取基于深学习框架。多神经建模方法,建立在一个简单的卷积神经网络架构,建成区的逐像素图像分类开发。全球哨兵-2图像合成在模型的部署提供了最详细,完整的地图报告大约建成区参考2018年的结果与建设足迹覆盖世界各地的277位的独立基准数据集合验证,建立由所提出的框架和模型的鲁棒性产生的堆焊层的可靠性。这项研究结果有助于前沿研究的自动化建成区遥感数据映射领域,为人类住区的跨城乡的连续空间分布的分析,建立一个新的参考层
Christina Corbane, Vasileios Syrris, Filip Sabo, Panagiotis Politis, Michele Melchiorri, Martino Pesaresi, Pierre Soille, Thomas Kemper
Abstract: Spatially consistent and up-to-date maps of human settlements are crucial for addressing policies related to urbanization and sustainability especially in the era of an increasingly urbanized world. The availability of open and free Sentinel-2 data of the Copernicus Earth Observation programme offers a new opportunity for wall-to-wall mapping of human settlements at a global scale. This paper presents a deep-learning-based framework for a fully automated extraction of built-up areas at a spatial resolution of 10 meters from a global composite of Sentinel-2 imagery. A multi-neuro modelling methodology, building on a simple Convolution Neural Networks architecture for pixel-wise image classification of built-up areas is developed. The deployment of the model on the global Sentinel-2 image composite provides the most detailed and complete map reporting about built-up areas for reference year 2018. The validation of the results with an independent reference dataset of building footprints covering 277 sites across the world, establishes the reliability of the built-up layer produced by the proposed framework and the model robustness. The results of this study contribute to cutting-edge research in the field of automated built-up areas mapping from remote sensing data and establish a new reference layer for the analysis of the spatial distribution of human settlements across the rural-urban continuum
摘要:人类住区的空间一致的和最新的最新的地图是解决与城市化和可持续发展特别是在日益城市化的世界的时代政策的关键。哥白尼地球观测计划提供的开放和自由哨兵-2数据的可用性在全球范围内对人类住区的墙到墙映射了新的契机。本文提出了建成区在10米的空间分辨率从哨兵-2影像的全球复合材料的全自动抽取基于深学习框架。多神经建模方法,建立在一个简单的卷积神经网络架构,建成区的逐像素图像分类开发。全球哨兵-2图像合成在模型的部署提供了最详细,完整的地图报告大约建成区参考2018年的结果与建设足迹覆盖世界各地的277位的独立基准数据集合验证,建立由所提出的框架和模型的鲁棒性产生的堆焊层的可靠性。这项研究结果有助于前沿研究的自动化建成区遥感数据映射领域,为人类住区的跨城乡的连续空间分布的分析,建立一个新的参考层
30. Discovering Parametric Activation Functions [PDF] 返回目录
Garrett Bingham, Risto Miikkulainen
Abstract: Recent studies have shown that the choice of activation function can significantly affect the performance of deep learning networks. However, the benefits of novel activation functions have been inconsistent and task-dependent, and therefore the rectified linear unit (ReLU) is still the most commonly used. This paper proposes a technique for customizing activation functions automatically, resulting in reliable improvements in performance. Evolutionary search is used to discover the general form of the function, and gradient descent to optimize its parameters for different parts of the network and over the learning process. Experiments with three different neural network architectures on the CIFAR-100 image classification dataset show that this approach is effective. It discovers different activation functions for different architectures, and consistently improves accuracy over ReLU and other recently proposed activation functions by significant margins. The approach can therefore be used as an automated optimization step in applying deep learning to new tasks.
摘要:最近的研究表明,激活功能的选择可以显著影响深度学习网络的性能。然而,新的激活函数的好处是不一致的和任务依赖性,因此,整流线性单元(RELU)仍然是最常用的。本文提出了一种自动定制激活的功能,从而导致性能的改善可靠的技术。进化搜索用来发现函数的一般形式,并且梯度下降优化其参数为网络的不同部分和在学习的过程。实验用的CIFAR-100图像分类数据集上,这种做法是有效的三种不同的神经网络结构。它发现不同的激活功能不同的架构,并始终如一地显著的利润率提高了精度RELU和其他最近提出的激活功能。因此,该方法可用于在应用深度学习到新的任务自动化的优化步骤。
Garrett Bingham, Risto Miikkulainen
Abstract: Recent studies have shown that the choice of activation function can significantly affect the performance of deep learning networks. However, the benefits of novel activation functions have been inconsistent and task-dependent, and therefore the rectified linear unit (ReLU) is still the most commonly used. This paper proposes a technique for customizing activation functions automatically, resulting in reliable improvements in performance. Evolutionary search is used to discover the general form of the function, and gradient descent to optimize its parameters for different parts of the network and over the learning process. Experiments with three different neural network architectures on the CIFAR-100 image classification dataset show that this approach is effective. It discovers different activation functions for different architectures, and consistently improves accuracy over ReLU and other recently proposed activation functions by significant margins. The approach can therefore be used as an automated optimization step in applying deep learning to new tasks.
摘要:最近的研究表明,激活功能的选择可以显著影响深度学习网络的性能。然而,新的激活函数的好处是不一致的和任务依赖性,因此,整流线性单元(RELU)仍然是最常用的。本文提出了一种自动定制激活的功能,从而导致性能的改善可靠的技术。进化搜索用来发现函数的一般形式,并且梯度下降优化其参数为网络的不同部分和在学习的过程。实验用的CIFAR-100图像分类数据集上,这种做法是有效的三种不同的神经网络结构。它发现不同的激活功能不同的架构,并始终如一地显著的利润率提高了精度RELU和其他最近提出的激活功能。因此,该方法可用于在应用深度学习到新的任务自动化的优化步骤。
注:中文为机器翻译结果!