目录
1. A Committee of Convolutional Neural Networks for Image Classication in the Concurrent Presence of Feature and Label Noise [PDF] 摘要
11. Understanding Integrated Gradients with SmoothTaylor for Deep Neural Network Attribution [PDF] 摘要
12. Graph Convolutional Subspace Clustering: A Robust Subspace Clustering Framework for Hyperspectral Image [PDF] 摘要
18. Automatic exposure selection and fusion for high-dynamic-range photography via smartphones [PDF] 摘要
22. How to track your dragon: A Multi-Attentional Framework for real-time RGB-D 6-DOF Object Pose Tracking [PDF] 摘要
23. Multi-view Self-Constructing Graph Convolutional Networks with Adaptive Class Weighting Loss for Semantic Segmentation [PDF] 摘要
30. Red-GAN: Attacking class imbalance via conditioned generation. Yet another medical imaging perspective [PDF] 摘要
33. Automatic Detection of Coronavirus Disease (COVID-19) in X-ray and CT Images: A Machine Learning-Based Approach [PDF] 摘要
37. Stabilizing Training of Generative Adversarial Nets via Langevin Stein Variational Gradient Descent [PDF] 摘要
摘要
1. A Committee of Convolutional Neural Networks for Image Classication in the Concurrent Presence of Feature and Label Noise [PDF] 返回目录
Stanisław Kaźmierczak, Jacek Mańdziuk
Abstract: Image classification has become a ubiquitous task. Models trained on good quality data achieve accuracy which in some application domains is already above human-level performance. Unfortunately, real-world data are quite often degenerated by the noise existing in features and/or labels. There are quite many papers that handle the problem of either feature or label noise, separately. However, to the best of our knowledge, this piece of research is the first attempt to address the problem of concurrent occurrence of both types of noise. Basing on the MNIST, CIFAR-10 and CIFAR-100 datasets, we experimentally proved that the difference by which committees beat single models increases along with noise level, no matter it is an attribute or label disruption. Thus, it makes ensembles legitimate to be applied to noisy images with noisy labels. The aforementioned committees' advantage over single models is positively correlated with dataset difficulty level as well. We propose three committee selection algorithms that outperform a strong baseline algorithm which relies on an ensemble of individual (nonassociated) best models.
摘要:图像分类已成为一种无处不在的任务。经过培训的高质量的数据模型,精确度在某些应用领域已经高于人类水平的性能。不幸的是,现实世界中的数据往往由现有的功能和/或标签噪音退化。有迹象表明,无论是处理功能或标签噪音的问题,分别相当多篇论文。然而,据我们所知,这一块的研究是要解决这两种类型的噪声的同时出现的问题的第一次尝试。立足于MNIST,CIFAR-10和CIFAR-100的数据集,我们通过实验证明,通过该委员会与噪声电平一起击败单一车型的差增大,无论是属性或标签混乱。因此,它使合法歌舞团要应用到图像噪点有嘈杂的标签。上述委员会优于单一模型是积极与数据集的难易程度相关外。我们建议,超越它依赖于个人(非伴生)最佳模型的集合强大的基线算法三位委员会选择算法。
Stanisław Kaźmierczak, Jacek Mańdziuk
Abstract: Image classification has become a ubiquitous task. Models trained on good quality data achieve accuracy which in some application domains is already above human-level performance. Unfortunately, real-world data are quite often degenerated by the noise existing in features and/or labels. There are quite many papers that handle the problem of either feature or label noise, separately. However, to the best of our knowledge, this piece of research is the first attempt to address the problem of concurrent occurrence of both types of noise. Basing on the MNIST, CIFAR-10 and CIFAR-100 datasets, we experimentally proved that the difference by which committees beat single models increases along with noise level, no matter it is an attribute or label disruption. Thus, it makes ensembles legitimate to be applied to noisy images with noisy labels. The aforementioned committees' advantage over single models is positively correlated with dataset difficulty level as well. We propose three committee selection algorithms that outperform a strong baseline algorithm which relies on an ensemble of individual (nonassociated) best models.
摘要:图像分类已成为一种无处不在的任务。经过培训的高质量的数据模型,精确度在某些应用领域已经高于人类水平的性能。不幸的是,现实世界中的数据往往由现有的功能和/或标签噪音退化。有迹象表明,无论是处理功能或标签噪音的问题,分别相当多篇论文。然而,据我们所知,这一块的研究是要解决这两种类型的噪声的同时出现的问题的第一次尝试。立足于MNIST,CIFAR-10和CIFAR-100的数据集,我们通过实验证明,通过该委员会与噪声电平一起击败单一车型的差增大,无论是属性或标签混乱。因此,它使合法歌舞团要应用到图像噪点有嘈杂的标签。上述委员会优于单一模型是积极与数据集的难易程度相关外。我们建议,超越它依赖于个人(非伴生)最佳模型的集合强大的基线算法三位委员会选择算法。
2. DyNet: Dynamic Convolution for Accelerating Convolutional Neural Networks [PDF] 返回目录
Yikang Zhang, Jian Zhang, Qiang Wang, Zhao Zhong
Abstract: Convolution operator is the core of convolutional neural networks (CNNs) and occupies the most computation cost. To make CNNs more efficient, many methods have been proposed to either design lightweight networks or compress models. Although some efficient network structures have been proposed, such as MobileNet or ShuffleNet, we find that there still exists redundant information between convolution kernels. To address this issue, we propose a novel dynamic convolution method to adaptively generate convolution kernels based on image contents. To demonstrate the effectiveness, we apply dynamic convolution on multiple state-of-the-art CNNs. On one hand, we can reduce the computation cost remarkably while maintaining the performance. For ShuffleNetV2/MobileNetV2/ResNet18/ResNet50, DyNet can reduce 37.0/54.7/67.2/71.3% FLOPs without loss of accuracy. On the other hand, the performance can be largely boosted if the computation cost is maintained. Based on the architecture MobileNetV3-Small/Large, DyNet achieves 70.3/77.1% Top-1 accuracy on ImageNet with an improvement of 2.9/1.9%. To verify the scalability, we also apply DyNet on segmentation task, the results show that DyNet can reduce 69.3% FLOPs while maintaining Mean IoU on segmentation task.
摘要:卷积运算符是卷积神经网络(细胞神经网络)的核心,占据了大部分的计算成本。为了使细胞神经网络更有效,很多方法被提出要么设计轻巧网络或压缩模式。尽管一些有效的网络结构已经被提出,如MobileNet或的ShuffleNet,我们发现仍然存在卷积核之间的冗余信息。为了解决这个问题,我们提出了一种新的动态卷积方法自适应产生卷积核根据图像内容。为了演示效果,我们应用在国家的最先进的多细胞神经网络的动态卷积。一方面,我们可以显着降低计算成本,同时保持性能。对于ShuffleNetV2 / MobileNetV2 / ResNet18 / ResNet50,DyNet可以减少37.0 / 54.7 / 67.2 / 71.3%,没有精度损失FLOPS。在另一方面,其性能可以在很大程度上如果计算成本保持提振。基于该架构MobileNetV3-小/大,DyNet达到上ImageNet 70.3 / 77.1%顶1精度2.9 / 1.9%的改进。要验证的可扩展性,我们也在申请DyNet上分割任务,结果表明DyNet可减少69.3%,而触发器上分割任务保持平均欠条。
Yikang Zhang, Jian Zhang, Qiang Wang, Zhao Zhong
Abstract: Convolution operator is the core of convolutional neural networks (CNNs) and occupies the most computation cost. To make CNNs more efficient, many methods have been proposed to either design lightweight networks or compress models. Although some efficient network structures have been proposed, such as MobileNet or ShuffleNet, we find that there still exists redundant information between convolution kernels. To address this issue, we propose a novel dynamic convolution method to adaptively generate convolution kernels based on image contents. To demonstrate the effectiveness, we apply dynamic convolution on multiple state-of-the-art CNNs. On one hand, we can reduce the computation cost remarkably while maintaining the performance. For ShuffleNetV2/MobileNetV2/ResNet18/ResNet50, DyNet can reduce 37.0/54.7/67.2/71.3% FLOPs without loss of accuracy. On the other hand, the performance can be largely boosted if the computation cost is maintained. Based on the architecture MobileNetV3-Small/Large, DyNet achieves 70.3/77.1% Top-1 accuracy on ImageNet with an improvement of 2.9/1.9%. To verify the scalability, we also apply DyNet on segmentation task, the results show that DyNet can reduce 69.3% FLOPs while maintaining Mean IoU on segmentation task.
摘要:卷积运算符是卷积神经网络(细胞神经网络)的核心,占据了大部分的计算成本。为了使细胞神经网络更有效,很多方法被提出要么设计轻巧网络或压缩模式。尽管一些有效的网络结构已经被提出,如MobileNet或的ShuffleNet,我们发现仍然存在卷积核之间的冗余信息。为了解决这个问题,我们提出了一种新的动态卷积方法自适应产生卷积核根据图像内容。为了演示效果,我们应用在国家的最先进的多细胞神经网络的动态卷积。一方面,我们可以显着降低计算成本,同时保持性能。对于ShuffleNetV2 / MobileNetV2 / ResNet18 / ResNet50,DyNet可以减少37.0 / 54.7 / 67.2 / 71.3%,没有精度损失FLOPS。在另一方面,其性能可以在很大程度上如果计算成本保持提振。基于该架构MobileNetV3-小/大,DyNet达到上ImageNet 70.3 / 77.1%顶1精度2.9 / 1.9%的改进。要验证的可扩展性,我们也在申请DyNet上分割任务,结果表明DyNet可减少69.3%,而触发器上分割任务保持平均欠条。
3. Pseudo RGB-D for Self-Improving Monocular SLAM and Depth Prediction [PDF] 返回目录
Lokender Tiwari, Pan Ji, Quoc-Huy Tran, Bingbing Zhuang, Saket Anand, Manmohan Chandraker
Abstract: Classical monocular Simultaneous Localization And Mapping (SLAM) and the recently emerging convolutional neural networks (CNNs) for monocular depth prediction represent two largely disjoint approaches towards building a 3D map of the surrounding environment. In this paper, we demonstrate that the coupling of these two by leveraging the strengths of each mitigates the other's shortcomings. Specifically, we propose a joint narrow and wide baseline based self-improving framework, where on the one hand the CNN-predicted depth is leveraged to perform pseudo RGB-D feature-based SLAM, leading to better accuracy and robustness than the monocular RGB SLAM baseline. On the other hand, the bundle-adjusted 3D scene structures and camera poses from the more principled geometric SLAM are injected back into the depth network through novel wide baseline losses proposed for improving the depth prediction network, which then continues to contribute towards better pose and 3D structure estimation in the next iteration. We emphasize that our framework only requires unlabeled monocular videos in both training and inference stages, and yet is able to outperform state-of-the-art self-supervised monocular and stereo depth prediction networks (e.g, Monodepth2) and feature-based monocular SLAM system (i.e, ORB-SLAM). Extensive experiments on KITTI and TUM RGB-D datasets verify the superiority of our self-improving geometry-CNN framework.
摘要:古典单眼同时定位和映射(SLAM)和最近出现的卷积神经网络(细胞神经网络)为单眼深度预测是朝着建设对周围环境的3D映射表主要是不相交的方法。在本文中,我们证明了通过利用每个减轻对方的缺点的优势这两个耦合。具体而言,提出了一种联合窄的和宽基线基于自我改进的框架,其中,一方面在CNN-预测深度是杠杆来执行基于特征伪RGB-d SLAM,从而获得更好的准确度和鲁棒性比单眼RGB SLAM基线。在另一方面,从更原则性几何SLAM束调整3D场景的结构和照相机姿态是通过提出用于改善深度预测网络,然后继续向更好的姿势,并有助于新颖宽基线损失回注到深度网络在下一迭代3D结构估计。我们强调,我们的框架只需要在训练和推理阶段未标记的单目视频中,仍然能够跑赢大市的国家的最先进的自我监督单眼立体深度预测网络(例如,Monodepth2)和基于特征的单目SLAM系统(即,ORB-SLAM)。在KITTI和TUM RGB-d数据集的实验结果验证我们的自我改进的几何CNN框架的优越性。
Lokender Tiwari, Pan Ji, Quoc-Huy Tran, Bingbing Zhuang, Saket Anand, Manmohan Chandraker
Abstract: Classical monocular Simultaneous Localization And Mapping (SLAM) and the recently emerging convolutional neural networks (CNNs) for monocular depth prediction represent two largely disjoint approaches towards building a 3D map of the surrounding environment. In this paper, we demonstrate that the coupling of these two by leveraging the strengths of each mitigates the other's shortcomings. Specifically, we propose a joint narrow and wide baseline based self-improving framework, where on the one hand the CNN-predicted depth is leveraged to perform pseudo RGB-D feature-based SLAM, leading to better accuracy and robustness than the monocular RGB SLAM baseline. On the other hand, the bundle-adjusted 3D scene structures and camera poses from the more principled geometric SLAM are injected back into the depth network through novel wide baseline losses proposed for improving the depth prediction network, which then continues to contribute towards better pose and 3D structure estimation in the next iteration. We emphasize that our framework only requires unlabeled monocular videos in both training and inference stages, and yet is able to outperform state-of-the-art self-supervised monocular and stereo depth prediction networks (e.g, Monodepth2) and feature-based monocular SLAM system (i.e, ORB-SLAM). Extensive experiments on KITTI and TUM RGB-D datasets verify the superiority of our self-improving geometry-CNN framework.
摘要:古典单眼同时定位和映射(SLAM)和最近出现的卷积神经网络(细胞神经网络)为单眼深度预测是朝着建设对周围环境的3D映射表主要是不相交的方法。在本文中,我们证明了通过利用每个减轻对方的缺点的优势这两个耦合。具体而言,提出了一种联合窄的和宽基线基于自我改进的框架,其中,一方面在CNN-预测深度是杠杆来执行基于特征伪RGB-d SLAM,从而获得更好的准确度和鲁棒性比单眼RGB SLAM基线。在另一方面,从更原则性几何SLAM束调整3D场景的结构和照相机姿态是通过提出用于改善深度预测网络,然后继续向更好的姿势,并有助于新颖宽基线损失回注到深度网络在下一迭代3D结构估计。我们强调,我们的框架只需要在训练和推理阶段未标记的单目视频中,仍然能够跑赢大市的国家的最先进的自我监督单眼立体深度预测网络(例如,Monodepth2)和基于特征的单目SLAM系统(即,ORB-SLAM)。在KITTI和TUM RGB-d数据集的实验结果验证我们的自我改进的几何CNN框架的优越性。
4. Unpaired Photo-to-manga Translation Based on The Methodology of Manga Drawing [PDF] 返回目录
Hao Su, Jianwei Niu, Xuefeng Liu, Qingfeng Li, Jiahe Cui, Ji Wan
Abstract: Manga is a world popular comic form originated in Japan, which typically employs black-and-white stroke lines and geometric exaggeration to describe humans' appearances, poses, and actions. In this paper, we propose MangaGAN, the first method based on Generative Adversarial Network (GAN) for unpaired photo-to-manga translation. Inspired by how experienced manga artists draw manga, MangaGAN generates the geometric features of manga face by a designed GAN model and delicately translates each facial region into the manga domain by a tailored multi-GANs architecture. For training MangaGAN, we construct a new dataset collected from a popular manga work, containing manga facial features, landmarks, bodies, and so on. Moreover, to produce high-quality manga faces, we further propose a structural smoothing loss to smooth stroke-lines and avoid noisy pixels, and a similarity preserving module to improve the similarity between domains of photo and manga. Extensive experiments show that MangaGAN can produce high-quality manga faces which preserve both the facial similarity and a popular manga style, and outperforms other related state-of-the-art methods.
摘要:漫画是起源于日本,它通常采用黑色和白色中风的线条和几何夸张来形容人类的外表,姿势和动作一个世界流行的漫画形式。在本文中,我们提出MangaGAN,基于剖成对抗性网络(GAN)未配对光 - 漫画翻译的第一个方法。由经验丰富的漫画艺术家如何绘制漫画的启发,MangaGAN由设计GAN模型生成漫画面的几何特征和微妙的每个面部区域转换成由一个定制的多甘斯架构漫画域。培训MangaGAN,我们构建了从流行的漫画作品收集新的数据集,包含漫画的面部特征,标志性建筑,机构,等等。此外,为生产高品质的漫画面,我们进一步提出的结构平滑损失平滑冲程线和避免噪声像素;以及相似性保持模块,以改善照片和漫画的结构域之间的相似性。大量的实验表明,MangaGAN可以生产出高品质的漫画面孔其保持面部相似和流行的漫画风格都和优于其他相关国家的最先进的方法。
Hao Su, Jianwei Niu, Xuefeng Liu, Qingfeng Li, Jiahe Cui, Ji Wan
Abstract: Manga is a world popular comic form originated in Japan, which typically employs black-and-white stroke lines and geometric exaggeration to describe humans' appearances, poses, and actions. In this paper, we propose MangaGAN, the first method based on Generative Adversarial Network (GAN) for unpaired photo-to-manga translation. Inspired by how experienced manga artists draw manga, MangaGAN generates the geometric features of manga face by a designed GAN model and delicately translates each facial region into the manga domain by a tailored multi-GANs architecture. For training MangaGAN, we construct a new dataset collected from a popular manga work, containing manga facial features, landmarks, bodies, and so on. Moreover, to produce high-quality manga faces, we further propose a structural smoothing loss to smooth stroke-lines and avoid noisy pixels, and a similarity preserving module to improve the similarity between domains of photo and manga. Extensive experiments show that MangaGAN can produce high-quality manga faces which preserve both the facial similarity and a popular manga style, and outperforms other related state-of-the-art methods.
摘要:漫画是起源于日本,它通常采用黑色和白色中风的线条和几何夸张来形容人类的外表,姿势和动作一个世界流行的漫画形式。在本文中,我们提出MangaGAN,基于剖成对抗性网络(GAN)未配对光 - 漫画翻译的第一个方法。由经验丰富的漫画艺术家如何绘制漫画的启发,MangaGAN由设计GAN模型生成漫画面的几何特征和微妙的每个面部区域转换成由一个定制的多甘斯架构漫画域。培训MangaGAN,我们构建了从流行的漫画作品收集新的数据集,包含漫画的面部特征,标志性建筑,机构,等等。此外,为生产高品质的漫画面,我们进一步提出的结构平滑损失平滑冲程线和避免噪声像素;以及相似性保持模块,以改善照片和漫画的结构域之间的相似性。大量的实验表明,MangaGAN可以生产出高品质的漫画面孔其保持面部相似和流行的漫画风格都和优于其他相关国家的最先进的方法。
5. Self-Supervised Representation Learning on Document Images [PDF] 返回目录
Adrian Cosma, Mihai Ghidoveanu, Michael Panaitescu-Liess, Marius Popescu
Abstract: This work analyses the impact of self-supervised pre-training on document images. While previous approaches explore the effect of self-supervision on natural images, we show that patch-based pre-training performs poorly on text document images because of their different structural properties and poor intra-sample semantic information. We propose two context-aware alternatives to improve performance. We also propose a novel method for self-supervision, which makes use of the inherent multi-modality of documents (image and text), which performs better than other popular self-supervised methods, including supervised ImageNet pre-training.
摘要:该作品分析的自我监督的预培训文档图像的影响。虽然以前的方法探索自然的图像自检的效果,我们显示,因为他们的不同的结构特性和差的样品内的语义信息的文本文件的图像,基于补丁前的训练表现不佳。我们提出了两种情境感知的替代品,以提高性能。我们还提出了自我监督的新方法,它利用的文件(图片和文字)固有的多模态,比其他流行的自我监督方法,包括监督ImageNet前培训,其进行更好。
Adrian Cosma, Mihai Ghidoveanu, Michael Panaitescu-Liess, Marius Popescu
Abstract: This work analyses the impact of self-supervised pre-training on document images. While previous approaches explore the effect of self-supervision on natural images, we show that patch-based pre-training performs poorly on text document images because of their different structural properties and poor intra-sample semantic information. We propose two context-aware alternatives to improve performance. We also propose a novel method for self-supervision, which makes use of the inherent multi-modality of documents (image and text), which performs better than other popular self-supervised methods, including supervised ImageNet pre-training.
摘要:该作品分析的自我监督的预培训文档图像的影响。虽然以前的方法探索自然的图像自检的效果,我们显示,因为他们的不同的结构特性和差的样品内的语义信息的文本文件的图像,基于补丁前的训练表现不佳。我们提出了两种情境感知的替代品,以提高性能。我们还提出了自我监督的新方法,它利用的文件(图片和文字)固有的多模态,比其他流行的自我监督方法,包括监督ImageNet前培训,其进行更好。
6. Efficient Neighbourhood Consensus Networks via Submanifold Sparse Convolutions [PDF] 返回目录
Ignacio Rocco, Relja Arandjelović, Josef Sivic
Abstract: In this work we target the problem of estimating accurately localised correspondences between a pair of images. We adopt the recent Neighbourhood Consensus Networks that have demonstrated promising performance for difficult correspondence problems and propose modifications to overcome their main limitations: large memory consumption, large inference time and poorly localised correspondences. Our proposed modifications can reduce the memory footprint and execution time more than $10\times$, with equivalent results. This is achieved by sparsifying the correlation tensor containing tentative matches, and its subsequent processing with a 4D CNN using submanifold sparse convolutions. Localisation accuracy is significantly improved by processing the input images in higher resolution, which is possible due to the reduced memory footprint, and by a novel two-stage correspondence relocalisation module. The proposed Sparse-NCNet method obtains state-of-the-art results on the HPatches Sequences and InLoc visual localisation benchmarks, and competitive results in the Aachen Day-Night benchmark.
摘要:在这项工作中,我们的目标的一对图像之间的估计准确定位对应的问题。我们采用已证明有前途的性能难以对应的问题,并提出修改,克服其主要局限在最近邻居共识网:内存消耗大户,大型推理时间和局部不良对应。我们提出的修饰可以减少内存占用和执行时间更低于$ 10 \ $倍,具有相同的结果。这是通过含有稀疏基底暂定比赛相关张量来实现,其用4D CNN后续处理使用子流形稀疏卷积。定位精度是通过在较高的分辨率,这是可能的,因为减小的内存占用量,并通过一种新颖的两阶段对应relocalisation模块处理所述输入图像显著改善。所提出的稀疏NCNet方法获取状态的最先进的结果在HPatches序列和InLoc视觉定位的基准,并且在亚琛日夜基准竞争的结果。
Ignacio Rocco, Relja Arandjelović, Josef Sivic
Abstract: In this work we target the problem of estimating accurately localised correspondences between a pair of images. We adopt the recent Neighbourhood Consensus Networks that have demonstrated promising performance for difficult correspondence problems and propose modifications to overcome their main limitations: large memory consumption, large inference time and poorly localised correspondences. Our proposed modifications can reduce the memory footprint and execution time more than $10\times$, with equivalent results. This is achieved by sparsifying the correlation tensor containing tentative matches, and its subsequent processing with a 4D CNN using submanifold sparse convolutions. Localisation accuracy is significantly improved by processing the input images in higher resolution, which is possible due to the reduced memory footprint, and by a novel two-stage correspondence relocalisation module. The proposed Sparse-NCNet method obtains state-of-the-art results on the HPatches Sequences and InLoc visual localisation benchmarks, and competitive results in the Aachen Day-Night benchmark.
摘要:在这项工作中,我们的目标的一对图像之间的估计准确定位对应的问题。我们采用已证明有前途的性能难以对应的问题,并提出修改,克服其主要局限在最近邻居共识网:内存消耗大户,大型推理时间和局部不良对应。我们提出的修饰可以减少内存占用和执行时间更低于$ 10 \ $倍,具有相同的结果。这是通过含有稀疏基底暂定比赛相关张量来实现,其用4D CNN后续处理使用子流形稀疏卷积。定位精度是通过在较高的分辨率,这是可能的,因为减小的内存占用量,并通过一种新颖的两阶段对应relocalisation模块处理所述输入图像显著改善。所提出的稀疏NCNet方法获取状态的最先进的结果在HPatches序列和InLoc视觉定位的基准,并且在亚琛日夜基准竞争的结果。
7. Real-time Simultaneous 3D Head Modeling and Facial Motion Capture with an RGB-D camera [PDF] 返回目录
Diego Thomas
Abstract: We propose a method to build in real-time animated 3D head models using a consumer-grade RGB-D camera. Our proposed method is the first one to provide simultaneously comprehensive facial motion tracking and a detailed 3D model of the user's head. Anyone's head can be instantly reconstructed and his facial motion captured without requiring any training or pre-scanning. The user starts facing the camera with a neutral expression in the first frame, but is free to move, talk and change his face expression as he wills otherwise. The facial motion is captured using a blendshape animation model while geometric details are captured using a Deviation image mapped over the template mesh. We contribute with an efficient algorithm to grow and refine the deforming 3D model of the head on-the-fly and in real-time. We demonstrate robust and high-fidelity simultaneous facial motion capture and 3D head modeling results on a wide range of subjects with various head poses and facial expressions.
摘要:我们建议在使用消费级RGB-d相机三维动画头型号实时生成的方法。我们提出的方法是第一个同时提供全面的面部动作跟踪和用户头部的3D模型,任何人的头部可以立即重建,而无需任何培训或预扫描他的面部动作捕捉。用户开始面临着在第一帧中性表情的摄像头,但可以自由移动,谈话,改变了他脸上的表情,因为他另有遗嘱。使用形状融合动画模型而几何细节使用映射在该模板上啮合的偏差图像捕获的面部动作被捕获。我们有一个高效的算法,成长和完善的即时和实时的头部变形的3D模型做出贡献。我们证明了广泛的各种头部姿势和面部表情科目强大的高逼真度的同时面部动作捕捉和3D建模头效果。
Diego Thomas
Abstract: We propose a method to build in real-time animated 3D head models using a consumer-grade RGB-D camera. Our proposed method is the first one to provide simultaneously comprehensive facial motion tracking and a detailed 3D model of the user's head. Anyone's head can be instantly reconstructed and his facial motion captured without requiring any training or pre-scanning. The user starts facing the camera with a neutral expression in the first frame, but is free to move, talk and change his face expression as he wills otherwise. The facial motion is captured using a blendshape animation model while geometric details are captured using a Deviation image mapped over the template mesh. We contribute with an efficient algorithm to grow and refine the deforming 3D model of the head on-the-fly and in real-time. We demonstrate robust and high-fidelity simultaneous facial motion capture and 3D head modeling results on a wide range of subjects with various head poses and facial expressions.
摘要:我们建议在使用消费级RGB-d相机三维动画头型号实时生成的方法。我们提出的方法是第一个同时提供全面的面部动作跟踪和用户头部的3D模型,任何人的头部可以立即重建,而无需任何培训或预扫描他的面部动作捕捉。用户开始面临着在第一帧中性表情的摄像头,但可以自由移动,谈话,改变了他脸上的表情,因为他另有遗嘱。使用形状融合动画模型而几何细节使用映射在该模板上啮合的偏差图像捕获的面部动作被捕获。我们有一个高效的算法,成长和完善的即时和实时的头部变形的3D模型做出贡献。我们证明了广泛的各种头部姿势和面部表情科目强大的高逼真度的同时面部动作捕捉和3D建模头效果。
8. Multi-Domain Learning and Identity Mining for Vehicle Re-Identification [PDF] 返回目录
Shuting He, Hao Luo, Weihua Chen, Miao Zhang, Yuqi Zhang, Fan Wang, Hao Li, Wei Jiang
Abstract: This paper introduces our solution for the Track2 in AI City Challenge 2020 (AICITY20). The Track2 is a vehicle re-identification (ReID) task with both the real-world data and synthetic data. Our solution is based on a strong baseline with bag of tricks (BoT-BS) proposed in person ReID. At first, we propose a multi-domain learning method to joint the real-world and synthetic data to train the model. Then, we propose the Identity Mining method to automatically generate pseudo labels for a part of the testing data, which is better than the k-means clustering. The tracklet-level re-ranking strategy with weighted features is also used to post-process the results. Finally, with multiple-model ensemble, our method achieves 0.7322 in the mAP score which yields third place in the competition. The codes are available at this https URL.
摘要:本文介绍了我们对磁轨2在AI城市挑战2020(AICITY20)解决方案。该磁轨2与现实世界的数据和合成数据既车辆重新鉴定(里德)的任务。我们的解决方案是基于与人里德提出的技巧(BOT-BS)的袋子强大的基线。首先,我们提出了一个多域学习方法联合现实世界和合成数据训练模型。然后,我们提出了身份采矿方法自动生成伪标签的测试数据的一部分,这是比K-均值聚类更好。加权功能tracklet级重排序策略,也可用于后处理的结果。最后,多模式集合,我们的方法实现了在地图分数,这样得到第三名的比赛0.7322。该代码可在此HTTPS URL。
Shuting He, Hao Luo, Weihua Chen, Miao Zhang, Yuqi Zhang, Fan Wang, Hao Li, Wei Jiang
Abstract: This paper introduces our solution for the Track2 in AI City Challenge 2020 (AICITY20). The Track2 is a vehicle re-identification (ReID) task with both the real-world data and synthetic data. Our solution is based on a strong baseline with bag of tricks (BoT-BS) proposed in person ReID. At first, we propose a multi-domain learning method to joint the real-world and synthetic data to train the model. Then, we propose the Identity Mining method to automatically generate pseudo labels for a part of the testing data, which is better than the k-means clustering. The tracklet-level re-ranking strategy with weighted features is also used to post-process the results. Finally, with multiple-model ensemble, our method achieves 0.7322 in the mAP score which yields third place in the competition. The codes are available at this https URL.
摘要:本文介绍了我们对磁轨2在AI城市挑战2020(AICITY20)解决方案。该磁轨2与现实世界的数据和合成数据既车辆重新鉴定(里德)的任务。我们的解决方案是基于与人里德提出的技巧(BOT-BS)的袋子强大的基线。首先,我们提出了一个多域学习方法联合现实世界和合成数据训练模型。然后,我们提出了身份采矿方法自动生成伪标签的测试数据的一部分,这是比K-均值聚类更好。加权功能tracklet级重排序策略,也可用于后处理的结果。最后,多模式集合,我们的方法实现了在地图分数,这样得到第三名的比赛0.7322。该代码可在此HTTPS URL。
9. TetraTSDF: 3D human reconstruction from a single image with a tetrahedral outer shell [PDF] 返回目录
Hayato Onizuka, Zehra Hayirci, Diego Thomas, Akihiro Sugimoto, Hideaki Uchiyama, Rin-ichiro Taniguchi
Abstract: Recovering the 3D shape of a person from its 2D appearance is ill-posed due to ambiguities. Nevertheless, with the help of convolutional neural networks (CNN) and prior knowledge on the 3D human body, it is possible to overcome such ambiguities to recover detailed 3D shapes of human bodies from single images. Current solutions, however, fail to reconstruct all the details of a person wearing loose clothes. This is because of either (a) huge memory requirement that cannot be maintained even on modern GPUs or (b) the compact 3D representation that cannot encode all the details. In this paper, we propose the tetrahedral outer shell volumetric truncated signed distance function (TetraTSDF) model for the human body, and its corresponding part connection network (PCN) for 3D human body shape regression. Our proposed model is compact, dense, accurate, and yet well suited for CNN-based regression task. Our proposed PCN allows us to learn the distribution of the TSDF in the tetrahedral volume from a single image in an end-to-end manner. Results show that our proposed method allows to reconstruct detailed shapes of humans wearing loose clothes from single RGB images.
摘要:恢复从2D外观的人的3D形状是病态由于歧义。然而,随着卷积神经网络的帮助下(CNN)和先验知识对3D人体,有可能克服这种模糊性来恢复单个图像人体详细的三维形状。目前的解决方案,但是,无法重建穿着宽松的衣服的人的所有细节。这是因为无论是(一)巨大的内存不能维持甚至现代GPU或(b)的紧凑型3D表示不能编码所有的细节要求。在本文中,我们提出了四面体外壳体积截短符号距离函数(TetraTSDF)模型对于人体,以及用于三维人体形状回归其对应部分的连接网络(PCN)。我们提出的模型结构紧凑,密集,准确,但非常适合基于CNN回归任务。我们提出的PCN让我们学习TSDF的四面体体积从一个单一的图像分布在终端到终端的方式。结果表明,该方法可以重建人体穿着单一RGB图像宽松的衣服的详细形状。
Hayato Onizuka, Zehra Hayirci, Diego Thomas, Akihiro Sugimoto, Hideaki Uchiyama, Rin-ichiro Taniguchi
Abstract: Recovering the 3D shape of a person from its 2D appearance is ill-posed due to ambiguities. Nevertheless, with the help of convolutional neural networks (CNN) and prior knowledge on the 3D human body, it is possible to overcome such ambiguities to recover detailed 3D shapes of human bodies from single images. Current solutions, however, fail to reconstruct all the details of a person wearing loose clothes. This is because of either (a) huge memory requirement that cannot be maintained even on modern GPUs or (b) the compact 3D representation that cannot encode all the details. In this paper, we propose the tetrahedral outer shell volumetric truncated signed distance function (TetraTSDF) model for the human body, and its corresponding part connection network (PCN) for 3D human body shape regression. Our proposed model is compact, dense, accurate, and yet well suited for CNN-based regression task. Our proposed PCN allows us to learn the distribution of the TSDF in the tetrahedral volume from a single image in an end-to-end manner. Results show that our proposed method allows to reconstruct detailed shapes of humans wearing loose clothes from single RGB images.
摘要:恢复从2D外观的人的3D形状是病态由于歧义。然而,随着卷积神经网络的帮助下(CNN)和先验知识对3D人体,有可能克服这种模糊性来恢复单个图像人体详细的三维形状。目前的解决方案,但是,无法重建穿着宽松的衣服的人的所有细节。这是因为无论是(一)巨大的内存不能维持甚至现代GPU或(b)的紧凑型3D表示不能编码所有的细节要求。在本文中,我们提出了四面体外壳体积截短符号距离函数(TetraTSDF)模型对于人体,以及用于三维人体形状回归其对应部分的连接网络(PCN)。我们提出的模型结构紧凑,密集,准确,但非常适合基于CNN回归任务。我们提出的PCN让我们学习TSDF的四面体体积从一个单一的图像分布在终端到终端的方式。结果表明,该方法可以重建人体穿着单一RGB图像宽松的衣服的详细形状。
10. Distributed Learning and Inference with Compressed Images [PDF] 返回目录
Sudeep Katakol, Basem Elbarashy, Luis Herranz, Joost van de Weijer, Antonio M. Lopez
Abstract: Modern computer vision requires processing large amounts of data, both while training the model and/or during inference, once the model is deployed. Scenarios where images are captured and processed in physically separated locations are increasingly common (e.g. autonomous vehicles, cloud computing). In addition, many devices suffer from limited resources to store or transmit data (e.g. storage space, channel capacity). In these scenarios, lossy image compression plays a crucial role to effectively increase the number of images collected under such constraints. However, lossy compression entails some undesired degradation of the data that may harm the performance of the downstream analysis task at hand, since important semantic information may be lost in the process. Moreover, we may only have compressed images at training time but are able to use original images at inference time, or vice versa, and in such a case, the downstream model suffers from covariate shift. In this paper, we analyze this phenomenon, with a special focus on vision-based perception for autonomous driving as a paradigmatic scenario. We see that loss of semantic information and covariate shift do indeed exist, resulting in a drop in performance that depends on the compression rate. In order to address the problem, we propose dataset restoration, based on image restoration with generative adversarial networks (GANs). Our method is agnostic to both the particular image compression method and the downstream task; and has the advantage of not adding additional cost to the deployed models, which is particularly important in resource-limited devices. The presented experiments focus on semantic segmentation as a challenging use case, cover a broad range of compression rates and diverse datasets, and show how our method is able to significantly alleviate the negative effects of compression on the downstream visual task.
摘要:现代计算机视觉需要处理大量的数据,而两者训练模型和/或推理过程中,一旦模型部署。其中图像被捕获并在物理上分开的位置上处理的场景也越来越普遍(例如自主车辆,云计算)。此外,许多设备从资源存储或传送数据的有限(例如存储空间,信道容量)受到影响。在这些情况下,有损图像压缩起着至关重要的作用,有效地增加这样的约束下收集图像的数量。然而,有损压缩需要可能损害手头的下游分析任务的性能,因为重要的语义信息,可以在此过程中丢失的数据的一些不希望的降解。此外,我们只可能在压缩训练时间的图像,但能够使用反之亦然在推理时的原始图像,反之,在这种情况下,从协转变下游模型受到影响。在本文中,我们分析这种现象,特别侧重于基于视觉的感知用于自主驾驶的典范场景。我们看到的语义信息和协转变,损失确实存在,导致性能下降依赖于压缩率。为了解决这个问题,我们提出了恢复数据集的基础上,以生成对抗网络(甘斯)图像恢复。我们的方法是不可知的两个特定的图像压缩方法和下游任务;并且具有不增加额外成本的部署模型,这一点尤其重要在资源有限的设备上的优势。所提出的实验集中在语义分割为一个具有挑战性的使用案例,涵盖了广泛的压缩率和多样化的数据集,并展示我们的方法是如何能够显著减轻下游的视觉任务压缩的负面影响。
Sudeep Katakol, Basem Elbarashy, Luis Herranz, Joost van de Weijer, Antonio M. Lopez
Abstract: Modern computer vision requires processing large amounts of data, both while training the model and/or during inference, once the model is deployed. Scenarios where images are captured and processed in physically separated locations are increasingly common (e.g. autonomous vehicles, cloud computing). In addition, many devices suffer from limited resources to store or transmit data (e.g. storage space, channel capacity). In these scenarios, lossy image compression plays a crucial role to effectively increase the number of images collected under such constraints. However, lossy compression entails some undesired degradation of the data that may harm the performance of the downstream analysis task at hand, since important semantic information may be lost in the process. Moreover, we may only have compressed images at training time but are able to use original images at inference time, or vice versa, and in such a case, the downstream model suffers from covariate shift. In this paper, we analyze this phenomenon, with a special focus on vision-based perception for autonomous driving as a paradigmatic scenario. We see that loss of semantic information and covariate shift do indeed exist, resulting in a drop in performance that depends on the compression rate. In order to address the problem, we propose dataset restoration, based on image restoration with generative adversarial networks (GANs). Our method is agnostic to both the particular image compression method and the downstream task; and has the advantage of not adding additional cost to the deployed models, which is particularly important in resource-limited devices. The presented experiments focus on semantic segmentation as a challenging use case, cover a broad range of compression rates and diverse datasets, and show how our method is able to significantly alleviate the negative effects of compression on the downstream visual task.
摘要:现代计算机视觉需要处理大量的数据,而两者训练模型和/或推理过程中,一旦模型部署。其中图像被捕获并在物理上分开的位置上处理的场景也越来越普遍(例如自主车辆,云计算)。此外,许多设备从资源存储或传送数据的有限(例如存储空间,信道容量)受到影响。在这些情况下,有损图像压缩起着至关重要的作用,有效地增加这样的约束下收集图像的数量。然而,有损压缩需要可能损害手头的下游分析任务的性能,因为重要的语义信息,可以在此过程中丢失的数据的一些不希望的降解。此外,我们只可能在压缩训练时间的图像,但能够使用反之亦然在推理时的原始图像,反之,在这种情况下,从协转变下游模型受到影响。在本文中,我们分析这种现象,特别侧重于基于视觉的感知用于自主驾驶的典范场景。我们看到的语义信息和协转变,损失确实存在,导致性能下降依赖于压缩率。为了解决这个问题,我们提出了恢复数据集的基础上,以生成对抗网络(甘斯)图像恢复。我们的方法是不可知的两个特定的图像压缩方法和下游任务;并且具有不增加额外成本的部署模型,这一点尤其重要在资源有限的设备上的优势。所提出的实验集中在语义分割为一个具有挑战性的使用案例,涵盖了广泛的压缩率和多样化的数据集,并展示我们的方法是如何能够显著减轻下游的视觉任务压缩的负面影响。
11. Understanding Integrated Gradients with SmoothTaylor for Deep Neural Network Attribution [PDF] 返回目录
Gary S. W. Goh, Sebastian Lapuschkin, Leander Weber, Wojciech Samek, Alexander Binder
Abstract: Integrated gradients as an attribution method for deep neural network models offers simple implementability. However, it also suffers from noisiness of explanations, which affects the ease of interpretability. In this paper, we present Smooth Integrated Gradients as a statistically improved attribution method inspired by Taylor's theorem, which does not require a fixed baseline to be chosen. We apply both methods to the image classification problem, using the ILSVRC2012 ImageNet object recognition dataset, and a couple of pretrained image models to generate attribution maps of their predictions. These attribution maps are visualized by saliency maps which can be evaluated qualitatively. We also empirically evaluate them using quantitative metrics such as perturbations-based score drops and multi-scaled total variance. We further propose adaptive noising to optimize for the noise scale hyperparameter value in our proposed method. From our experiments, we find that the Smooth Integrated Gradients approach together with adaptive noising is able to generate better quality saliency maps with lesser noise and higher sensitivity to the relevant points in the input space.
摘要:综合梯度为深层神经网络模型提供简单可实施的归属方法。然而,它也解释吵闹,影响了易于解释性的困扰。在本文中,我们提出了平滑集成梯度通过泰勒定理启发统计学改进归因方法,其不需要固定的基线被选择。我们采用两种方法对图像分类问题,使用ILSVRC2012 ImageNet物体识别的数据集,和一对夫妇预训练的图像模型来生成自己预测的归属地图。这些属性的地图是可以定性评价显着地图可视化。我们还凭经验评估它们使用定量指标,例如基于扰动分数滴和多尺度总方差。我们进一步提出了自适应降噪,以优化我们提出的方法噪声规模超参数值。从我们的实验中,我们发现,平滑集成梯度接近连同自适应降噪能够产生更好的质量显着;较小的噪声和更高的灵敏度,以在输入空间相关穴位映射。
Gary S. W. Goh, Sebastian Lapuschkin, Leander Weber, Wojciech Samek, Alexander Binder
Abstract: Integrated gradients as an attribution method for deep neural network models offers simple implementability. However, it also suffers from noisiness of explanations, which affects the ease of interpretability. In this paper, we present Smooth Integrated Gradients as a statistically improved attribution method inspired by Taylor's theorem, which does not require a fixed baseline to be chosen. We apply both methods to the image classification problem, using the ILSVRC2012 ImageNet object recognition dataset, and a couple of pretrained image models to generate attribution maps of their predictions. These attribution maps are visualized by saliency maps which can be evaluated qualitatively. We also empirically evaluate them using quantitative metrics such as perturbations-based score drops and multi-scaled total variance. We further propose adaptive noising to optimize for the noise scale hyperparameter value in our proposed method. From our experiments, we find that the Smooth Integrated Gradients approach together with adaptive noising is able to generate better quality saliency maps with lesser noise and higher sensitivity to the relevant points in the input space.
摘要:综合梯度为深层神经网络模型提供简单可实施的归属方法。然而,它也解释吵闹,影响了易于解释性的困扰。在本文中,我们提出了平滑集成梯度通过泰勒定理启发统计学改进归因方法,其不需要固定的基线被选择。我们采用两种方法对图像分类问题,使用ILSVRC2012 ImageNet物体识别的数据集,和一对夫妇预训练的图像模型来生成自己预测的归属地图。这些属性的地图是可以定性评价显着地图可视化。我们还凭经验评估它们使用定量指标,例如基于扰动分数滴和多尺度总方差。我们进一步提出了自适应降噪,以优化我们提出的方法噪声规模超参数值。从我们的实验中,我们发现,平滑集成梯度接近连同自适应降噪能够产生更好的质量显着;较小的噪声和更高的灵敏度,以在输入空间相关穴位映射。
12. Graph Convolutional Subspace Clustering: A Robust Subspace Clustering Framework for Hyperspectral Image [PDF] 返回目录
Yaoming Cai, Zijia Zhang, Zhihua Cai, Xiaobo Liu, Xinwei Jiang, Qin Yan
Abstract: Hyperspectral image (HSI) clustering is a challenging task due to the high complexity of HSI data. Subspace clustering has been proven to be powerful for exploiting the intrinsic relationship between data points. Despite the impressive performance in the HSI clustering, traditional subspace clustering methods often ignore the inherent structural information among data. In this paper, we revisit the subspace clustering with graph convolution and present a novel subspace clustering framework called Graph Convolutional Subspace Clustering (GCSC) for robust HSI clustering. Specifically, the framework recasts the self-expressiveness property of the data into the non-Euclidean domain, which results in a more robust graph embedding dictionary. We show that traditional subspace clustering models are the special forms of our framework with the Euclidean data. Basing on the framework, we further propose two novel subspace clustering models by using the Frobenius norm, namely Efficient GCSC (EGCSC) and Efficient Kernel GCSC (EKGCSC). Both models have a globally optimal closed-form solution, which makes them easier to implement, train, and apply in practice. Extensive experiments on three popular HSI datasets demonstrate that EGCSC and EKGCSC can achieve state-of-the-art clustering performance and dramatically outperforms many existing methods with significant margins.
摘要:高光谱图像(HSI)集群是一项具有挑战性的任务,由于HSI数据的高复杂性。子空间聚类已被证明是强大的开发数据点之间的内在关系。尽管恒指集群的骄人业绩,传统的子空间聚类方法往往忽略数据之间的内在结构信息。在本文中,我们用重温图表卷积的子空间聚类并呈现为健壮聚类HSI称为格拉夫卷积子空间聚类(GCSC)一种新颖的子空间聚类框架。具体地,框架重铸数据到非欧几里得结构域,其导致更健壮的图表嵌入字典的自表现特性。我们发现,传统的子空间聚类模型是我们与欧几里德数据框架的特殊形式。基础上的框架,我们进一步利用Frobenius范数,即高效GCSC(EGCSC)和高效的内核GCSC(EKGCSC)提出了两个新的子空间聚类分析模型。两种型号都有一个全局最优的封闭形式的解决方案,这使得它们更容易实现,火车,并在实践中应用。上的三个热门HSI数据集大量的实验证明,EGCSC和EKGCSC可以实现国家的最先进的集群性能,并显着优于同显著利润率许多现有的方法。
Yaoming Cai, Zijia Zhang, Zhihua Cai, Xiaobo Liu, Xinwei Jiang, Qin Yan
Abstract: Hyperspectral image (HSI) clustering is a challenging task due to the high complexity of HSI data. Subspace clustering has been proven to be powerful for exploiting the intrinsic relationship between data points. Despite the impressive performance in the HSI clustering, traditional subspace clustering methods often ignore the inherent structural information among data. In this paper, we revisit the subspace clustering with graph convolution and present a novel subspace clustering framework called Graph Convolutional Subspace Clustering (GCSC) for robust HSI clustering. Specifically, the framework recasts the self-expressiveness property of the data into the non-Euclidean domain, which results in a more robust graph embedding dictionary. We show that traditional subspace clustering models are the special forms of our framework with the Euclidean data. Basing on the framework, we further propose two novel subspace clustering models by using the Frobenius norm, namely Efficient GCSC (EGCSC) and Efficient Kernel GCSC (EKGCSC). Both models have a globally optimal closed-form solution, which makes them easier to implement, train, and apply in practice. Extensive experiments on three popular HSI datasets demonstrate that EGCSC and EKGCSC can achieve state-of-the-art clustering performance and dramatically outperforms many existing methods with significant margins.
摘要:高光谱图像(HSI)集群是一项具有挑战性的任务,由于HSI数据的高复杂性。子空间聚类已被证明是强大的开发数据点之间的内在关系。尽管恒指集群的骄人业绩,传统的子空间聚类方法往往忽略数据之间的内在结构信息。在本文中,我们用重温图表卷积的子空间聚类并呈现为健壮聚类HSI称为格拉夫卷积子空间聚类(GCSC)一种新颖的子空间聚类框架。具体地,框架重铸数据到非欧几里得结构域,其导致更健壮的图表嵌入字典的自表现特性。我们发现,传统的子空间聚类模型是我们与欧几里德数据框架的特殊形式。基础上的框架,我们进一步利用Frobenius范数,即高效GCSC(EGCSC)和高效的内核GCSC(EKGCSC)提出了两个新的子空间聚类分析模型。两种型号都有一个全局最优的封闭形式的解决方案,这使得它们更容易实现,火车,并在实践中应用。上的三个热门HSI数据集大量的实验证明,EGCSC和EKGCSC可以实现国家的最先进的集群性能,并显着优于同显著利润率许多现有的方法。
13. Warwick Image Forensics Dataset for Device Fingerprinting In Multimedia Forensics [PDF] 返回目录
Yijun Quan, Chang-Tsun Li, Yujue Zhou, Li Li
Abstract: Device fingerprints like sensor pattern noise (SPN) are widely used for provenance analysis and image authentication. Over the past few years, the rapid advancement in digital photography has greatly reshaped the pipeline of image capturing process on consumer-level mobile devices. The flexibility of camera parameter settings and the emergence of multi-frame photography algorithms, especially high dynamic range (HDR) imaging, bring new challenges to device fingerprinting. The subsequent study on these topics requires a new purposefully built image dataset. In this paper, we present the Warwick Image Forensics Dataset, an image dataset of more than 58,600 images captured using 14 digital cameras with various exposure settings. Special attention to the exposure settings allows the images to be adopted by different multi-frame computational photography algorithms and for subsequent device fingerprinting. The dataset is released as an open-source, free for use for the digital forensic community.
摘要:设备的指纹像传感器模式噪声(SPN)被广泛地用于源分析和图像验证。在过去的几年,在数码摄影的迅速发展,极大地重塑了图像拍摄过程对消费级移动设备的管道。的照相机参数设置的灵活性和多帧摄影算法的出现,特别是高动态范围(HDR)成像,带来装置指纹新的挑战。关于这些主题的后续研究需要一个新的目的地内置图像数据集。在本文中,我们提出的沃里克图像数据集取证,超过58600个的图像的使用14个数码相机与各种曝光设置捕获的图像数据集。特别注意的曝光设置允许通过不同的多帧计算摄影算法,并为随后的器件指纹可以采用图像。该数据集被释放作为开源,免费使用的数字法医界。
Yijun Quan, Chang-Tsun Li, Yujue Zhou, Li Li
Abstract: Device fingerprints like sensor pattern noise (SPN) are widely used for provenance analysis and image authentication. Over the past few years, the rapid advancement in digital photography has greatly reshaped the pipeline of image capturing process on consumer-level mobile devices. The flexibility of camera parameter settings and the emergence of multi-frame photography algorithms, especially high dynamic range (HDR) imaging, bring new challenges to device fingerprinting. The subsequent study on these topics requires a new purposefully built image dataset. In this paper, we present the Warwick Image Forensics Dataset, an image dataset of more than 58,600 images captured using 14 digital cameras with various exposure settings. Special attention to the exposure settings allows the images to be adopted by different multi-frame computational photography algorithms and for subsequent device fingerprinting. The dataset is released as an open-source, free for use for the digital forensic community.
摘要:设备的指纹像传感器模式噪声(SPN)被广泛地用于源分析和图像验证。在过去的几年,在数码摄影的迅速发展,极大地重塑了图像拍摄过程对消费级移动设备的管道。的照相机参数设置的灵活性和多帧摄影算法的出现,特别是高动态范围(HDR)成像,带来装置指纹新的挑战。关于这些主题的后续研究需要一个新的目的地内置图像数据集。在本文中,我们提出的沃里克图像数据集取证,超过58600个的图像的使用14个数码相机与各种曝光设置捕获的图像数据集。特别注意的曝光设置允许通过不同的多帧计算摄影算法,并为随后的器件指纹可以采用图像。该数据集被释放作为开源,免费使用的数字法医界。
14. DeepFake Detection by Analyzing Convolutional Traces [PDF] 返回目录
Luca Guarnera, Oliver Giudice, Sebastiano Battiato
Abstract: The Deepfake phenomenon has become very popular nowadays thanks to the possibility to create incredibly realistic images using deep learning tools, based mainly on ad-hoc Generative Adversarial Networks (GAN). In this work we focus on the analysis of Deepfakes of human faces with the objective of creating a new detection method able to detect a forensics trace hidden in images: a sort of fingerprint left in the image generation process. The proposed technique, by means of an Expectation Maximization (EM) algorithm, extracts a set of local features specifically addressed to model the underlying convolutional generative process. Ad-hoc validation has been employed through experimental tests with naive classifiers on five different architectures (GDWCT, STARGAN, ATTGAN, STYLEGAN, STYLEGAN2) against the CELEBA dataset as ground-truth for non-fakes. Results demonstrated the effectiveness of the technique in distinguishing the different architectures and the corresponding generation process.
摘要:Deepfake现象已经成为时下非常流行的感谢创建使用深层学习工具极其逼真的图像的可能性,主要基于点对点剖成对抗性网络(GAN)。在这项工作中,我们侧重于人脸的Deepfakes与目标创造能够检测到微量取证隐藏在图像的新检测方法的分析:一种指纹留在图像生成过程。所提出的技术,通过一个期望最大化(EM)算法的手段,抽取一组局部特征具体寻址到底层卷积生成过程建模。的ad-hoc验证已通过与在五个不同的体系结构(GDWCT,STARGAN,ATTGAN,STYLEGAN,STYLEGAN2)压靠在CELEBA数据集作为地面实况非假货幼稚分类实验测试中使用。结果表明在区分不同的体系结构和对应的生成过程的技术的有效性。
Luca Guarnera, Oliver Giudice, Sebastiano Battiato
Abstract: The Deepfake phenomenon has become very popular nowadays thanks to the possibility to create incredibly realistic images using deep learning tools, based mainly on ad-hoc Generative Adversarial Networks (GAN). In this work we focus on the analysis of Deepfakes of human faces with the objective of creating a new detection method able to detect a forensics trace hidden in images: a sort of fingerprint left in the image generation process. The proposed technique, by means of an Expectation Maximization (EM) algorithm, extracts a set of local features specifically addressed to model the underlying convolutional generative process. Ad-hoc validation has been employed through experimental tests with naive classifiers on five different architectures (GDWCT, STARGAN, ATTGAN, STYLEGAN, STYLEGAN2) against the CELEBA dataset as ground-truth for non-fakes. Results demonstrated the effectiveness of the technique in distinguishing the different architectures and the corresponding generation process.
摘要:Deepfake现象已经成为时下非常流行的感谢创建使用深层学习工具极其逼真的图像的可能性,主要基于点对点剖成对抗性网络(GAN)。在这项工作中,我们侧重于人脸的Deepfakes与目标创造能够检测到微量取证隐藏在图像的新检测方法的分析:一种指纹留在图像生成过程。所提出的技术,通过一个期望最大化(EM)算法的手段,抽取一组局部特征具体寻址到底层卷积生成过程建模。的ad-hoc验证已通过与在五个不同的体系结构(GDWCT,STARGAN,ATTGAN,STYLEGAN,STYLEGAN2)压靠在CELEBA数据集作为地面实况非假货幼稚分类实验测试中使用。结果表明在区分不同的体系结构和对应的生成过程的技术的有效性。
15. Recursive Social Behavior Graph for Trajectory Prediction [PDF] 返回目录
Jianhua Sun, Qinhong Jiang, Cewu Lu
Abstract: Social interaction is an important topic in human trajectory prediction to generate plausible paths. In this paper, we present a novel insight of group-based social interaction model to explore relationships among pedestrians. We recursively extract social representations supervised by group-based annotations and formulate them into a social behavior graph, called Recursive Social Behavior Graph. Our recursive mechanism explores the representation power largely. Graph Convolutional Neural Network then is used to propagate social interaction information in such a graph. With the guidance of Recursive Social Behavior Graph, we surpass state-of-the-art method on ETH and UCY dataset for 11.1% in ADE and 10.8% in FDE in average, and successfully predict complex social behaviors.
摘要:社会交往是人类轨迹预测的一个重要课题,产生合理的路径。在本文中,我们提出了基于组的社会互动模式的新见解,探讨行人之间的关系。我们递归地提取社会监督交涉通过基于组的注释和它们配制成社会行为图,称为递归社会行为图。我们的递归机制主要是探索表现力。图表卷积神经网络随后被用于传播以这样一种图形社交互动的信息。递归社会行为图的指导下,我们超越上ETH和UCY数据集的国家的最先进的方法为ADE 11.1%,而平均FDE 10.8%,并成功地预测复杂的社会行为。
Jianhua Sun, Qinhong Jiang, Cewu Lu
Abstract: Social interaction is an important topic in human trajectory prediction to generate plausible paths. In this paper, we present a novel insight of group-based social interaction model to explore relationships among pedestrians. We recursively extract social representations supervised by group-based annotations and formulate them into a social behavior graph, called Recursive Social Behavior Graph. Our recursive mechanism explores the representation power largely. Graph Convolutional Neural Network then is used to propagate social interaction information in such a graph. With the guidance of Recursive Social Behavior Graph, we surpass state-of-the-art method on ETH and UCY dataset for 11.1% in ADE and 10.8% in FDE in average, and successfully predict complex social behaviors.
摘要:社会交往是人类轨迹预测的一个重要课题,产生合理的路径。在本文中,我们提出了基于组的社会互动模式的新见解,探讨行人之间的关系。我们递归地提取社会监督交涉通过基于组的注释和它们配制成社会行为图,称为递归社会行为图。我们的递归机制主要是探索表现力。图表卷积神经网络随后被用于传播以这样一种图形社交互动的信息。递归社会行为图的指导下,我们超越上ETH和UCY数据集的国家的最先进的方法为ADE 11.1%,而平均FDE 10.8%,并成功地预测复杂的社会行为。
16. Image Processing Failure and Deep Learning Success in Lawn Measurement [PDF] 返回目录
J. Wilkins, M. V. Nguyen, B. Rahmani
Abstract: Lawn area measurement is an application of image processing and deep learning. Researchers have been used hierarchical networks, segmented images and many other methods to measure lawn area. Methods effectiveness and accuracy varies. In this project Image processing and deep learning methods has been compared to find the best way to measure the lawn area. Three Image processing methods using OpenCV has been compared to Convolutional Neural network which is one of the most famous and effective deep learning methods. We used Keras and TensorFlow to estimate the lawn area. Convolutional Neural Network or shortly CNN shows very high accuracy (94-97%) for this purpose. In image processing methods, Thresholding with 80-87% accuracy and Edge detection are effective methods to measure the lawn area but Contouring with 26-31% accuracy does not calculate the lawn area successfully. We may conclude that deep learning methods especially CNN could be the best detective method comparing to image processing and other deep learning techniques.
摘要:草坪面积测量是图像处理和深度学习的应用。研究人员一直在使用分层网络,分割的图像和许多其他的方法来衡量草坪区。方法有效性和准确性变化。在这个项目中的图像处理与深度学习方法相比已经找到测量草坪区的最佳方法。使用OpenCV的三个图像处理方法相比已经卷积神经网络是最有名的和有效的深度学习的方法之一。我们使用Keras和TensorFlow估算草坪面积。卷积神经网络或不久CNN给出了此目的非常高的精度(94-97%)。在图像处理方法中,阈值与80-87%的准确度和边缘检测是用于测量草坪区,但轮廓与26-31%的精度不会成功地计算草坪区的有效方法。我们可以得出结论,深刻的学习方法,特别是CNN可能是比较图像处理等深学习技术最好的检测方法。
J. Wilkins, M. V. Nguyen, B. Rahmani
Abstract: Lawn area measurement is an application of image processing and deep learning. Researchers have been used hierarchical networks, segmented images and many other methods to measure lawn area. Methods effectiveness and accuracy varies. In this project Image processing and deep learning methods has been compared to find the best way to measure the lawn area. Three Image processing methods using OpenCV has been compared to Convolutional Neural network which is one of the most famous and effective deep learning methods. We used Keras and TensorFlow to estimate the lawn area. Convolutional Neural Network or shortly CNN shows very high accuracy (94-97%) for this purpose. In image processing methods, Thresholding with 80-87% accuracy and Edge detection are effective methods to measure the lawn area but Contouring with 26-31% accuracy does not calculate the lawn area successfully. We may conclude that deep learning methods especially CNN could be the best detective method comparing to image processing and other deep learning techniques.
摘要:草坪面积测量是图像处理和深度学习的应用。研究人员一直在使用分层网络,分割的图像和许多其他的方法来衡量草坪区。方法有效性和准确性变化。在这个项目中的图像处理与深度学习方法相比已经找到测量草坪区的最佳方法。使用OpenCV的三个图像处理方法相比已经卷积神经网络是最有名的和有效的深度学习的方法之一。我们使用Keras和TensorFlow估算草坪面积。卷积神经网络或不久CNN给出了此目的非常高的精度(94-97%)。在图像处理方法中,阈值与80-87%的准确度和边缘检测是用于测量草坪区,但轮廓与26-31%的精度不会成功地计算草坪区的有效方法。我们可以得出结论,深刻的学习方法,特别是CNN可能是比较图像处理等深学习技术最好的检测方法。
17. Graph-based Kinship Reasoning Network [PDF] 返回目录
Wanhua Li, Yingqiang Zhang, Kangchen Lv, Jiwen Lu, Jianjiang Feng, Jie Zhou
Abstract: In this paper, we propose a graph-based kinship reasoning (GKR) network for kinship verification, which aims to effectively perform relational reasoning on the extracted features of an image pair. Unlike most existing methods which mainly focus on how to learn discriminative features, our method considers how to compare and fuse the extracted feature pair to reason about the kin relations. The proposed GKR constructs a star graph called kinship relational graph where each peripheral node represents the information comparison in one feature dimension and the central node is used as a bridge for information communication among peripheral nodes. Then the GKR performs relational reasoning on this graph with recursive message passing. Extensive experimental results on the KinFaceW-I and KinFaceW-II datasets show that the proposed GKR outperforms the state-of-the-art methods.
摘要:在本文中,我们提出了亲缘关系验证,其目的是有效地在图像对所提取的特征进行关联推理基于图的血缘关系的推理(GKR)网络。不同于主要集中在如何学习判别特征大多数现有的方法,我们的方法考虑如何比较和保险丝提取的特征对理性有关亲属关系。所提出的GKR构建称为血缘关系图星形曲线图,其中每个节点外围代表一个特征尺寸与中央节点的信息的比较被用作外围节点之间的信息通信的桥梁。然后,GKR执行该曲线与递归消息传递关系的推理。在KinFaceW-I和KinFaceW-II的数据集大量的实验结果表明,该GKR优于国家的最先进的方法。
Wanhua Li, Yingqiang Zhang, Kangchen Lv, Jiwen Lu, Jianjiang Feng, Jie Zhou
Abstract: In this paper, we propose a graph-based kinship reasoning (GKR) network for kinship verification, which aims to effectively perform relational reasoning on the extracted features of an image pair. Unlike most existing methods which mainly focus on how to learn discriminative features, our method considers how to compare and fuse the extracted feature pair to reason about the kin relations. The proposed GKR constructs a star graph called kinship relational graph where each peripheral node represents the information comparison in one feature dimension and the central node is used as a bridge for information communication among peripheral nodes. Then the GKR performs relational reasoning on this graph with recursive message passing. Extensive experimental results on the KinFaceW-I and KinFaceW-II datasets show that the proposed GKR outperforms the state-of-the-art methods.
摘要:在本文中,我们提出了亲缘关系验证,其目的是有效地在图像对所提取的特征进行关联推理基于图的血缘关系的推理(GKR)网络。不同于主要集中在如何学习判别特征大多数现有的方法,我们的方法考虑如何比较和保险丝提取的特征对理性有关亲属关系。所提出的GKR构建称为血缘关系图星形曲线图,其中每个节点外围代表一个特征尺寸与中央节点的信息的比较被用作外围节点之间的信息通信的桥梁。然后,GKR执行该曲线与递归消息传递关系的推理。在KinFaceW-I和KinFaceW-II的数据集大量的实验结果表明,该GKR优于国家的最先进的方法。
18. Automatic exposure selection and fusion for high-dynamic-range photography via smartphones [PDF] 返回目录
Reza Pourreza, Nasser Kehtarnavaz
Abstract: High-dynamic-range (HDR) photography involves fusing a bracket of images taken at different exposure settings in order to compensate for the low dynamic range of digital cameras such as the ones used in smartphones. In this paper, a method for automatically selecting the exposure settings of such images is introduced based on the camera characteristic function. In addition, a new fusion method is introduced based on an optimization formulation and weighted averaging. Both of these methods are implemented on a smartphone platform as an HDR app to demonstrate the practicality of the introduced methods. Comparison results with several existing methods are presented indicating the effectiveness as well as the computational efficiency of the introduced solution.
摘要:高动态范围(HDR)摄影涉及在为了补偿的数字照相机,例如智能手机所使用的那些的低动态范围熔合在不同的曝光设置拍摄的图像的一个支架上。在本文中,用于自动地选择这样的图像的曝光设置的方法被引入基于照相机特性函数。此外,一个新的融合方法是基于优化配方和加权平均引入。这两种方法都在智能手机平台实现为HDR应用证明的介绍方法的实用性。比较结果与几个现有的方法都指示的有效性以及所引入的溶液的计算效率。
Reza Pourreza, Nasser Kehtarnavaz
Abstract: High-dynamic-range (HDR) photography involves fusing a bracket of images taken at different exposure settings in order to compensate for the low dynamic range of digital cameras such as the ones used in smartphones. In this paper, a method for automatically selecting the exposure settings of such images is introduced based on the camera characteristic function. In addition, a new fusion method is introduced based on an optimization formulation and weighted averaging. Both of these methods are implemented on a smartphone platform as an HDR app to demonstrate the practicality of the introduced methods. Comparison results with several existing methods are presented indicating the effectiveness as well as the computational efficiency of the introduced solution.
摘要:高动态范围(HDR)摄影涉及在为了补偿的数字照相机,例如智能手机所使用的那些的低动态范围熔合在不同的曝光设置拍摄的图像的一个支架上。在本文中,用于自动地选择这样的图像的曝光设置的方法被引入基于照相机特性函数。此外,一个新的融合方法是基于优化配方和加权平均引入。这两种方法都在智能手机平台实现为HDR应用证明的介绍方法的实用性。比较结果与几个现有的方法都指示的有效性以及所引入的溶液的计算效率。
19. Yoga-82: A New Dataset for Fine-grained Classification of Human Poses [PDF] 返回目录
Manisha Verma, Sudhakar Kumawat, Yuta Nakashima, Shanmuganathan Raman
Abstract: Human pose estimation is a well-known problem in computer vision to locate joint positions. Existing datasets for the learning of poses are observed to be not challenging enough in terms of pose diversity, object occlusion, and viewpoints. This makes the pose annotation process relatively simple and restricts the application of the models that have been trained on them. To handle more variety in human poses, we propose the concept of fine-grained hierarchical pose classification, in which we formulate the pose estimation as a classification task, and propose a dataset, Yoga-82, for large-scale yoga pose recognition with 82 classes. Yoga-82 consists of complex poses where fine annotations may not be possible. To resolve this, we provide hierarchical labels for yoga poses based on the body configuration of the pose. The dataset contains a three-level hierarchy including body positions, variations in body positions, and the actual pose names. We present the classification accuracy of the state-of-the-art convolutional neural network architectures on Yoga-82. We also present several hierarchical variants of DenseNet in order to utilize the hierarchical labels.
摘要:人体姿势估计是计算机视觉中一个众所周知的问题找到共同立场。对于姿势的学习现有的数据集观察到没有姿势的多样性,目标遮挡,和观点方面足够的挑战性。这使得姿势注释过程相对简单,限制了已培训对他们的模型的应用。为了处理人的姿势多品种,我们提出了细粒度的层次姿势分类的概念,这是我们制定姿态估计作为分类的任务,并提出用82的数据集,瑜伽-82,进行大规模的瑜伽姿势识别类。瑜伽-82是由复杂的姿势,上乘的注释可能不能的。为了解决这个问题,我们提供基于姿势的主体构造瑜伽姿势等级标签。该数据集包含三个层次结构,包括身体姿势,身体位置的变化,与实际姿势名称。我们目前的瑜伽-82的状态的最先进的卷积神经网络体系结构的分类精度。我们也存在几个层次,以利用分级标签DenseNet的变体。
Manisha Verma, Sudhakar Kumawat, Yuta Nakashima, Shanmuganathan Raman
Abstract: Human pose estimation is a well-known problem in computer vision to locate joint positions. Existing datasets for the learning of poses are observed to be not challenging enough in terms of pose diversity, object occlusion, and viewpoints. This makes the pose annotation process relatively simple and restricts the application of the models that have been trained on them. To handle more variety in human poses, we propose the concept of fine-grained hierarchical pose classification, in which we formulate the pose estimation as a classification task, and propose a dataset, Yoga-82, for large-scale yoga pose recognition with 82 classes. Yoga-82 consists of complex poses where fine annotations may not be possible. To resolve this, we provide hierarchical labels for yoga poses based on the body configuration of the pose. The dataset contains a three-level hierarchy including body positions, variations in body positions, and the actual pose names. We present the classification accuracy of the state-of-the-art convolutional neural network architectures on Yoga-82. We also present several hierarchical variants of DenseNet in order to utilize the hierarchical labels.
摘要:人体姿势估计是计算机视觉中一个众所周知的问题找到共同立场。对于姿势的学习现有的数据集观察到没有姿势的多样性,目标遮挡,和观点方面足够的挑战性。这使得姿势注释过程相对简单,限制了已培训对他们的模型的应用。为了处理人的姿势多品种,我们提出了细粒度的层次姿势分类的概念,这是我们制定姿态估计作为分类的任务,并提出用82的数据集,瑜伽-82,进行大规模的瑜伽姿势识别类。瑜伽-82是由复杂的姿势,上乘的注释可能不能的。为了解决这个问题,我们提供基于姿势的主体构造瑜伽姿势等级标签。该数据集包含三个层次结构,包括身体姿势,身体位置的变化,与实际姿势名称。我们目前的瑜伽-82的状态的最先进的卷积神经网络体系结构的分类精度。我们也存在几个层次,以利用分级标签DenseNet的变体。
20. Textual Visual Semantic Dataset for Text Spotting [PDF] 返回目录
Ahmed Sabir, Francesc Moreno-Noguer, Lluís Padró
Abstract: Text Spotting in the wild consists of detecting and recognizing text appearing in images (e.g. signboards, traffic signals or brands in clothing or objects). This is a challenging problem due to the complexity of the context where texts appear (uneven backgrounds, shading, occlusions, perspective distortions, etc.). Only a few approaches try to exploit the relation between text and its surrounding environment to better recognize text in the scene. In this paper, we propose a visual context dataset for Text Spotting in the wild, where the publicly available dataset COCO-text [Veit et al. 2016] has been extended with information about the scene (such as objects and places appearing in the image) to enable researchers to include semantic relations between texts and scene in their Text Spotting systems, and to offer a common framework for such approaches. For each text in an image, we extract three kinds of context information: objects in the scene, image location label and a textual image description (caption). We use state-of-the-art out-of-the-box available tools to extract this additional information. Since this information has textual form, it can be used to leverage text similarity or semantic relation methods into Text Spotting systems, either as a post-processing or in an end-to-end training strategy. Our data is publicly available at this https URL.
摘要:在野外文本合模包括检测和识别出现在图片中的文字(例如招牌,交通信号或在服装或对象的品牌)。这是一个具有挑战性的问题,由于其中出现文本(不均匀的背景,遮光,遮挡,透视畸变等)的上下文的复杂性。只有少数的方法尝试利用文字和其周围环境更好地识别场景中的文本之间的关系。在本文中,我们提出在野外,为text识别视觉范围内的数据集,其中可公开获得的数据集COCO-文本[维特等人。 2016年]一直延伸约场景(如图像中出现,对象和地点),使研究人员在他们的text识别系统的文本和场景之间的语义关系,并提供这些方法的共同框架的信息。对于图像中的每个文字,我们提取3种上下文信息:场景中的对象,图像位置标签和文本图像描述(标题)。我们采用先进设备,最先进外的现成可用的工具来提取这些额外信息。由于这个信息有文本形式,它可以被用来利用文本相似性或语义关系的方法为text识别系统,无论是作为一个后处理或结束对终端的培训战略。在此HTTPS URL我们的数据是公开的。
Ahmed Sabir, Francesc Moreno-Noguer, Lluís Padró
Abstract: Text Spotting in the wild consists of detecting and recognizing text appearing in images (e.g. signboards, traffic signals or brands in clothing or objects). This is a challenging problem due to the complexity of the context where texts appear (uneven backgrounds, shading, occlusions, perspective distortions, etc.). Only a few approaches try to exploit the relation between text and its surrounding environment to better recognize text in the scene. In this paper, we propose a visual context dataset for Text Spotting in the wild, where the publicly available dataset COCO-text [Veit et al. 2016] has been extended with information about the scene (such as objects and places appearing in the image) to enable researchers to include semantic relations between texts and scene in their Text Spotting systems, and to offer a common framework for such approaches. For each text in an image, we extract three kinds of context information: objects in the scene, image location label and a textual image description (caption). We use state-of-the-art out-of-the-box available tools to extract this additional information. Since this information has textual form, it can be used to leverage text similarity or semantic relation methods into Text Spotting systems, either as a post-processing or in an end-to-end training strategy. Our data is publicly available at this https URL.
摘要:在野外文本合模包括检测和识别出现在图片中的文字(例如招牌,交通信号或在服装或对象的品牌)。这是一个具有挑战性的问题,由于其中出现文本(不均匀的背景,遮光,遮挡,透视畸变等)的上下文的复杂性。只有少数的方法尝试利用文字和其周围环境更好地识别场景中的文本之间的关系。在本文中,我们提出在野外,为text识别视觉范围内的数据集,其中可公开获得的数据集COCO-文本[维特等人。 2016年]一直延伸约场景(如图像中出现,对象和地点),使研究人员在他们的text识别系统的文本和场景之间的语义关系,并提供这些方法的共同框架的信息。对于图像中的每个文字,我们提取3种上下文信息:场景中的对象,图像位置标签和文本图像描述(标题)。我们采用先进设备,最先进外的现成可用的工具来提取这些额外信息。由于这个信息有文本形式,它可以被用来利用文本相似性或语义关系的方法为text识别系统,无论是作为一个后处理或结束对终端的培训战略。在此HTTPS URL我们的数据是公开的。
21. The iWildCam 2020 Competition Dataset [PDF] 返回目录
Sara Beery, Elijah Cole, Arvi Gjoka
Abstract: Camera traps enable the automatic collection of large quantities of image data. Biologists all over the world use camera traps to monitor animal populations. We have recently been making strides towards automatic species classification in camera trap images. However, as we try to expand the geographic scope of these models we are faced with an interesting question: how do we train models that perform well on new (unseen during training) camera trap locations? Can we leverage data from other modalities, such as citizen science data and remote sensing data? In order to tackle this problem, we have prepared a challenge where the training data and test data are from different cameras spread across the globe. For each camera, we provide a series of remote sensing imagery that is tied to the location of the camera. We also provide citizen science imagery from the set of species seen in our data. The challenge is to correctly classify species in the test camera traps.
摘要:相机陷阱使大量的图像数据的自动采集。生物学家在世界各地使用相机陷阱来监测动物种群。我们最近正在朝着摄像头陷阱的图像自动品种分类的进展。然而,当我们试图扩大我们正面临着一个有趣的问题,这些车型的地理范围:我们怎么列车(在训练看不见的)相机陷阱位置上的新表现良好的模型?可以从其他模式,我们利用数据,如公民科学数据和遥感数据?为了解决这个问题,我们已经准备在训练数据和测试数据是来自世界各地不同的相机传播的一个挑战。对于每个摄像机,我们提供了一系列的是联系在一起的摄像机的位置遥感图像。我们还提供了一组在我们的数据中看到的物种公民科学图像。面临的挑战是在测试相机陷阱正确分类的物种。
Sara Beery, Elijah Cole, Arvi Gjoka
Abstract: Camera traps enable the automatic collection of large quantities of image data. Biologists all over the world use camera traps to monitor animal populations. We have recently been making strides towards automatic species classification in camera trap images. However, as we try to expand the geographic scope of these models we are faced with an interesting question: how do we train models that perform well on new (unseen during training) camera trap locations? Can we leverage data from other modalities, such as citizen science data and remote sensing data? In order to tackle this problem, we have prepared a challenge where the training data and test data are from different cameras spread across the globe. For each camera, we provide a series of remote sensing imagery that is tied to the location of the camera. We also provide citizen science imagery from the set of species seen in our data. The challenge is to correctly classify species in the test camera traps.
摘要:相机陷阱使大量的图像数据的自动采集。生物学家在世界各地使用相机陷阱来监测动物种群。我们最近正在朝着摄像头陷阱的图像自动品种分类的进展。然而,当我们试图扩大我们正面临着一个有趣的问题,这些车型的地理范围:我们怎么列车(在训练看不见的)相机陷阱位置上的新表现良好的模型?可以从其他模式,我们利用数据,如公民科学数据和遥感数据?为了解决这个问题,我们已经准备在训练数据和测试数据是来自世界各地不同的相机传播的一个挑战。对于每个摄像机,我们提供了一系列的是联系在一起的摄像机的位置遥感图像。我们还提供了一组在我们的数据中看到的物种公民科学图像。面临的挑战是在测试相机陷阱正确分类的物种。
22. How to track your dragon: A Multi-Attentional Framework for real-time RGB-D 6-DOF Object Pose Tracking [PDF] 返回目录
Isidoros Marougkas, Petros Koutras, Nikos Kardaris, Georgios Retsinas, Georgia Chalvatzaki, Petros Maragos
Abstract: We present a novel multi-attentional convolutional architecture to tackle the problem of real-time RGB-D 6D object pose tracking of single, known objects. Such a problem poses multiple challenges originating both from the objects' nature and their interaction with their environment, which previous approaches have failed to fully address. The proposed framework encapsulates methods for background clutter and occlusion handling by integrating multiple parallel soft spatial attention modules into a multitask Convolutional Neural Network (CNN) architecture. Moreover, we consider the special geometrical properties of both the object's 3D model and the pose space, and we use a more sophisticated approach for data augmentation for training. The provided experimental results confirm the effectiveness of the proposed multi-attentional architecture, as it improves the State-of-the-Art (SoA) tracking performance by an average score of 40.5% for translation and 57.5% for rotation, when testing on the dataset presented in [1], the most complete dataset designed, up to date, for the problem of RGB-D object tracking.
摘要:本文提出了一种新颖的多注意力卷积架构解决单,已知对象的实时RGB-d 6D对象姿态跟踪问题。这样的问题带来了多重挑战从对象的性质和他们与环境,以前的方法都未能完全解决的互动源于两者。所提出的框架由多个并行的软空间注意模块集成到一个多任务卷积神经网络(CNN)架构封装了背景杂波和遮挡处理方法。此外,我们认为物体的3D模型和姿势都空间的特殊几何特性,我们使用的训练数据增强了更复杂的方法。所提供的实验结果验证了该多注意力架构的有效性,因为它提高了40.5%,翻译和旋转57.5%的平均得分的国家的最先进(SOA)的跟踪性能,在测试时数据集在[1]提出,设计的最完整的数据集,是最新的,对于RGB-d对象跟踪的问题。
Isidoros Marougkas, Petros Koutras, Nikos Kardaris, Georgios Retsinas, Georgia Chalvatzaki, Petros Maragos
Abstract: We present a novel multi-attentional convolutional architecture to tackle the problem of real-time RGB-D 6D object pose tracking of single, known objects. Such a problem poses multiple challenges originating both from the objects' nature and their interaction with their environment, which previous approaches have failed to fully address. The proposed framework encapsulates methods for background clutter and occlusion handling by integrating multiple parallel soft spatial attention modules into a multitask Convolutional Neural Network (CNN) architecture. Moreover, we consider the special geometrical properties of both the object's 3D model and the pose space, and we use a more sophisticated approach for data augmentation for training. The provided experimental results confirm the effectiveness of the proposed multi-attentional architecture, as it improves the State-of-the-Art (SoA) tracking performance by an average score of 40.5% for translation and 57.5% for rotation, when testing on the dataset presented in [1], the most complete dataset designed, up to date, for the problem of RGB-D object tracking.
摘要:本文提出了一种新颖的多注意力卷积架构解决单,已知对象的实时RGB-d 6D对象姿态跟踪问题。这样的问题带来了多重挑战从对象的性质和他们与环境,以前的方法都未能完全解决的互动源于两者。所提出的框架由多个并行的软空间注意模块集成到一个多任务卷积神经网络(CNN)架构封装了背景杂波和遮挡处理方法。此外,我们认为物体的3D模型和姿势都空间的特殊几何特性,我们使用的训练数据增强了更复杂的方法。所提供的实验结果验证了该多注意力架构的有效性,因为它提高了40.5%,翻译和旋转57.5%的平均得分的国家的最先进(SOA)的跟踪性能,在测试时数据集在[1]提出,设计的最完整的数据集,是最新的,对于RGB-d对象跟踪的问题。
23. Multi-view Self-Constructing Graph Convolutional Networks with Adaptive Class Weighting Loss for Semantic Segmentation [PDF] 返回目录
Qinghui Liu, Michael Kampffmeyer, Robert Jenssen, Arnt-Børre Salberg
Abstract: We propose a novel architecture called the Multi-view Self-Constructing Graph Convolutional Networks (MSCG-Net) for semantic segmentation. Building on the recently proposed Self-Constructing Graph (SCG) module, which makes use of learnable latent variables to self-construct the underlying graphs directly from the input features without relying on manually built prior knowledge graphs, we leverage multiple views in order to explicitly exploit the rotational invariance in airborne images. We further develop an adaptive class weighting loss to address the class imbalance. We demonstrate the effectiveness and flexibility of the proposed method on the Agriculture-Vision challenge dataset and our model achieves very competitive results (0.547 mIoU) with much fewer parameters and at a lower computational cost compared to related pure-CNN based work. Code will be available at: this http URL
摘要:我们提出了一个新颖的架构称为多视图自建语义分段图形卷积网络(MSCG-网)。最近提出的自建图(SCG)模块,利用可学潜变量来构建直接从输入自身建设的基本图形,而不依赖于手工建立的先验知识的图形功能,我们为了利用多个视图明确利用在空气中的图像中的旋转不变性。我们进一步开发了自适应加权类损失,解决类不平衡。我们证明了在农业,视觉挑战数据集中提出方法的有效性和灵活性,我们的模型实现了非常有竞争力的结果(0.547米欧)用少得多的参数,并在比较相关的纯CNN基础的工作更低的计算成本。代码将可在:这个HTTP URL
Qinghui Liu, Michael Kampffmeyer, Robert Jenssen, Arnt-Børre Salberg
Abstract: We propose a novel architecture called the Multi-view Self-Constructing Graph Convolutional Networks (MSCG-Net) for semantic segmentation. Building on the recently proposed Self-Constructing Graph (SCG) module, which makes use of learnable latent variables to self-construct the underlying graphs directly from the input features without relying on manually built prior knowledge graphs, we leverage multiple views in order to explicitly exploit the rotational invariance in airborne images. We further develop an adaptive class weighting loss to address the class imbalance. We demonstrate the effectiveness and flexibility of the proposed method on the Agriculture-Vision challenge dataset and our model achieves very competitive results (0.547 mIoU) with much fewer parameters and at a lower computational cost compared to related pure-CNN based work. Code will be available at: this http URL
摘要:我们提出了一个新颖的架构称为多视图自建语义分段图形卷积网络(MSCG-网)。最近提出的自建图(SCG)模块,利用可学潜变量来构建直接从输入自身建设的基本图形,而不依赖于手工建立的先验知识的图形功能,我们为了利用多个视图明确利用在空气中的图像中的旋转不变性。我们进一步开发了自适应加权类损失,解决类不平衡。我们证明了在农业,视觉挑战数据集中提出方法的有效性和灵活性,我们的模型实现了非常有竞争力的结果(0.547米欧)用少得多的参数,并在比较相关的纯CNN基础的工作更低的计算成本。代码将可在:这个HTTP URL
24. Combining Deep Learning Classifiers for 3D Action Recognition [PDF] 返回目录
Jan Sedmidubsky, Pavel Zezula
Abstract: The popular task of 3D human action recognition is almost exclusively solved by training deep-learning classifiers. To achieve a high recognition accuracy, the input 3D actions are often pre-processed by various normalization or augmentation techniques. However, it is not computationally feasible to train a classifier for each possible variant of training data in order to select the best-performing subset of pre-processing techniques for a given dataset. In this paper, we propose to train an independent classifier for each available pre-processing technique and fuse the classification results based on a strict majority vote rule. Together with a proposed evaluation procedure, we can very efficiently determine the best combination of normalization and augmentation techniques for a specific dataset. For the best-performing combination, we can retrospectively apply the normalized/augmented variants of input data to train only a single classifier. This also allows us to decide whether it is better to train a single model, or rather a set of independent classifiers.
摘要:3D人体动作识别的流行任务几乎完全由深训练学习分类解决。为了实现高的识别精度,输入的3D行动常常因各种归一化或增量技术预处理。然而,这是不计算上是可行训练用于训练数据的每个可能的变型的分类,以便选择对于给定的数据集的预处理技术表现最佳的子集。在本文中,我们提出培养为每个可用的预处理技术,独立的分类和融合基于严格的多数表决规则的分类结果。与提议的评估程序一起,我们可以非常有效地确定正常化和增强技术,针对特定数据集的最佳组合。对于表现最好的组合,我们可以追溯适用的标准化/输入数据的增强变体训练只有一个分类。这也让我们决定是否是更好地训练一个模型,或者更确切地说,一组独立的分类器。
Jan Sedmidubsky, Pavel Zezula
Abstract: The popular task of 3D human action recognition is almost exclusively solved by training deep-learning classifiers. To achieve a high recognition accuracy, the input 3D actions are often pre-processed by various normalization or augmentation techniques. However, it is not computationally feasible to train a classifier for each possible variant of training data in order to select the best-performing subset of pre-processing techniques for a given dataset. In this paper, we propose to train an independent classifier for each available pre-processing technique and fuse the classification results based on a strict majority vote rule. Together with a proposed evaluation procedure, we can very efficiently determine the best combination of normalization and augmentation techniques for a specific dataset. For the best-performing combination, we can retrospectively apply the normalized/augmented variants of input data to train only a single classifier. This also allows us to decide whether it is better to train a single model, or rather a set of independent classifiers.
摘要:3D人体动作识别的流行任务几乎完全由深训练学习分类解决。为了实现高的识别精度,输入的3D行动常常因各种归一化或增量技术预处理。然而,这是不计算上是可行训练用于训练数据的每个可能的变型的分类,以便选择对于给定的数据集的预处理技术表现最佳的子集。在本文中,我们提出培养为每个可用的预处理技术,独立的分类和融合基于严格的多数表决规则的分类结果。与提议的评估程序一起,我们可以非常有效地确定正常化和增强技术,针对特定数据集的最佳组合。对于表现最好的组合,我们可以追溯适用的标准化/输入数据的增强变体训练只有一个分类。这也让我们决定是否是更好地训练一个模型,或者更确切地说,一组独立的分类器。
25. Group Activity Detection from Trajectory and Video Data in Soccer [PDF] 返回目录
Ryan Sanford, Siavash Gorji, Luiz G. Hafemann, Bahareh Pourbabaee, Mehrsan Javan
Abstract: Group activity detection in soccer can be done by using either video data or player and ball trajectory data. In current soccer activity datasets, activities are labelled as atomic events without a duration. Given that the state-of-the-art activity detection methods are not well-defined for atomic actions, these methods cannot be used. In this work, we evaluated the effectiveness of activity recognition models for detecting such events, by using an intuitive non-maximum suppression process and evaluation metrics. We also considered the problem of explicitly modeling interactions between players and ball. For this, we propose self-attention models to learn and extract relevant information from a group of soccer players for activity detection from both trajectory and video data. We conducted an extensive study on the use of visual features and trajectory data for group activity detection in sports using a large scale soccer dataset provided by Sportlogiq. Our results show that most events can be detected using either vision or trajectory-based approaches with a temporal resolution of less than 0.5 seconds, and that each approach has unique challenges.
摘要:足球小组活动检测可以通过使用视频数据或播放机和球的轨迹数据来完成。在目前的足球活动数据集,活动被标记为没有时间原子事件。鉴于国家的最先进的活动检测方法并不为原子动作良好定义的,不能使用这些方法。在这项工作中,我们评估了用于检测这些事件,通过使用直观的非最大抑制处理和评价标准活动识别模型的有效性。我们还考虑的明确造型球员和球之间的相互作用问题。为此,我们提出了自我注意模型的学习和从一组足球运动员提取相关信息,来自轨道和视频数据的活动检测。我们在使用的视觉特征,并利用Sportlogiq提供大规模数据集的足球运动中的集体活动检测轨迹数据进行了深入研究。我们的研究结果表明,多数事件可以用任何视觉或基于轨迹的方法小于0.5秒的时间分辨率被检测到,而且每一种方法都有其独特的挑战。
Ryan Sanford, Siavash Gorji, Luiz G. Hafemann, Bahareh Pourbabaee, Mehrsan Javan
Abstract: Group activity detection in soccer can be done by using either video data or player and ball trajectory data. In current soccer activity datasets, activities are labelled as atomic events without a duration. Given that the state-of-the-art activity detection methods are not well-defined for atomic actions, these methods cannot be used. In this work, we evaluated the effectiveness of activity recognition models for detecting such events, by using an intuitive non-maximum suppression process and evaluation metrics. We also considered the problem of explicitly modeling interactions between players and ball. For this, we propose self-attention models to learn and extract relevant information from a group of soccer players for activity detection from both trajectory and video data. We conducted an extensive study on the use of visual features and trajectory data for group activity detection in sports using a large scale soccer dataset provided by Sportlogiq. Our results show that most events can be detected using either vision or trajectory-based approaches with a temporal resolution of less than 0.5 seconds, and that each approach has unique challenges.
摘要:足球小组活动检测可以通过使用视频数据或播放机和球的轨迹数据来完成。在目前的足球活动数据集,活动被标记为没有时间原子事件。鉴于国家的最先进的活动检测方法并不为原子动作良好定义的,不能使用这些方法。在这项工作中,我们评估了用于检测这些事件,通过使用直观的非最大抑制处理和评价标准活动识别模型的有效性。我们还考虑的明确造型球员和球之间的相互作用问题。为此,我们提出了自我注意模型的学习和从一组足球运动员提取相关信息,来自轨道和视频数据的活动检测。我们在使用的视觉特征,并利用Sportlogiq提供大规模数据集的足球运动中的集体活动检测轨迹数据进行了深入研究。我们的研究结果表明,多数事件可以用任何视觉或基于轨迹的方法小于0.5秒的时间分辨率被检测到,而且每一种方法都有其独特的挑战。
26. Panoptic-based Image Synthesis [PDF] 返回目录
Aysegul Dundar, Karan Sapra, Guilin Liu, Andrew Tao, Bryan Catanzaro
Abstract: Conditional image synthesis for generating photorealistic images serves various applications for content editing to content generation. Previous conditional image synthesis algorithms mostly rely on semantic maps, and often fail in complex environments where multiple instances occlude each other. We propose a panoptic aware image synthesis network to generate high fidelity and photorealistic images conditioned on panoptic maps which unify semantic and instance information. To achieve this, we efficiently use panoptic maps in convolution and upsampling layers. We show that with the proposed changes to the generator, we can improve on the previous state-of-the-art methods by generating images in complex instance interaction environments in higher fidelity and tiny objects in more details. Furthermore, our proposed method also outperforms the previous state-of-the-art methods in metrics of mean IoU (Intersection over Union), and detAP (Detection Average Precision).
摘要:条件图像合成以产生逼真的图像供应给内容编辑内容产生各种应用。之前的条件图像合成算法主要依赖于语义图,和在复杂环境中往往不能当多个实例闭塞彼此。我们提出了一个全景感知图像合成网络产生高的保真度和逼真的图像条件上统一语义和实例信息全景地图。要做到这一点,我们可以有效地利用卷积和采样层全景地图。我们表明,所提出的修改发电机,我们可以在前面的国家的最先进的方法,在更高的保真度和更详细微小物体产生复杂的情况下交互的环境图像改善。此外,我们提出的方法也优于平均IOU(路口过联盟),并detAP(检测平均准确率)的指标,以前的国家的最先进的方法。
Aysegul Dundar, Karan Sapra, Guilin Liu, Andrew Tao, Bryan Catanzaro
Abstract: Conditional image synthesis for generating photorealistic images serves various applications for content editing to content generation. Previous conditional image synthesis algorithms mostly rely on semantic maps, and often fail in complex environments where multiple instances occlude each other. We propose a panoptic aware image synthesis network to generate high fidelity and photorealistic images conditioned on panoptic maps which unify semantic and instance information. To achieve this, we efficiently use panoptic maps in convolution and upsampling layers. We show that with the proposed changes to the generator, we can improve on the previous state-of-the-art methods by generating images in complex instance interaction environments in higher fidelity and tiny objects in more details. Furthermore, our proposed method also outperforms the previous state-of-the-art methods in metrics of mean IoU (Intersection over Union), and detAP (Detection Average Precision).
摘要:条件图像合成以产生逼真的图像供应给内容编辑内容产生各种应用。之前的条件图像合成算法主要依赖于语义图,和在复杂环境中往往不能当多个实例闭塞彼此。我们提出了一个全景感知图像合成网络产生高的保真度和逼真的图像条件上统一语义和实例信息全景地图。要做到这一点,我们可以有效地利用卷积和采样层全景地图。我们表明,所提出的修改发电机,我们可以在前面的国家的最先进的方法,在更高的保真度和更详细微小物体产生复杂的情况下交互的环境图像改善。此外,我们提出的方法也优于平均IOU(路口过联盟),并detAP(检测平均准确率)的指标,以前的国家的最先进的方法。
27. Learning Multi-Modal Image Registration without Real Data [PDF] 返回目录
Malte Hoffmann, Benjamin Billot, Juan Eugenio Iglesias, Bruce Fischl, Adrian V. Dalca
Abstract: We introduce a learning-based strategy for multi-modal registration of images acquired with any modality, without requiring real data during training. While classical registration methods can accurately align multi-modal image pairs, they solve a costly optimization problem for every new pair of images. Learning-based techniques are fast at test time, but can only register images of the specific anatomy and modalities they were trained on. In contrast, our approach leverages a generative model to synthesize label maps and gray-scale images that expose a network to a wide range of anatomy and contrast during training. We demonstrate that this strategy enables robust registration of arbitrary modalities, without the need to retrain for a new modality. Critically, we show that input labels need not be of actual anatomy: training on randomly synthesized shapes, or supervoxels, results in competitive registration performance and makes the network agnostic to anatomy and contrast, all while eradicating the need for real data. We present extensive experiments demonstrating that this strategy enables registration of modalities not seen during training and surpasses the state of art in cross-contrast registration. Our code is integrated with the VoxelMorph library at: this http URL
摘要:介绍了用任何方式获取的图像的多模态注册以学习为主的策略,无需培训期间需要真实的数据。虽然经典的配准方法可以准确地对准多模态图像对,他们解决昂贵的最优化问题的每一个新的一对图像。基于学习的技术是在快速测试时间,但只能注册的特定解剖结构和方式,他们接受了有关的图像。相比之下,我们的方法利用了生成模型合成标记图和灰度图像培训期间暴露的网络范围广泛的解剖和对比。我们表明,这种策略使任意形式的鲁棒配准,而不需要再培训的新模式。重要的是,我们证明了输入标号不一定是实际解剖体:对随机合成的形状,或supervoxels,结果在竞争激烈的登记表演训练,使网络不可知的解剖和对比,同时还能消除真正的数据的需要。我们目前大量的实验证明,这种策略使训练中没有看到模式的登记和超越了交叉对比注册领域的状态。我们的代码集成了VoxelMorph库在:这个HTTP URL
Malte Hoffmann, Benjamin Billot, Juan Eugenio Iglesias, Bruce Fischl, Adrian V. Dalca
Abstract: We introduce a learning-based strategy for multi-modal registration of images acquired with any modality, without requiring real data during training. While classical registration methods can accurately align multi-modal image pairs, they solve a costly optimization problem for every new pair of images. Learning-based techniques are fast at test time, but can only register images of the specific anatomy and modalities they were trained on. In contrast, our approach leverages a generative model to synthesize label maps and gray-scale images that expose a network to a wide range of anatomy and contrast during training. We demonstrate that this strategy enables robust registration of arbitrary modalities, without the need to retrain for a new modality. Critically, we show that input labels need not be of actual anatomy: training on randomly synthesized shapes, or supervoxels, results in competitive registration performance and makes the network agnostic to anatomy and contrast, all while eradicating the need for real data. We present extensive experiments demonstrating that this strategy enables registration of modalities not seen during training and surpasses the state of art in cross-contrast registration. Our code is integrated with the VoxelMorph library at: this http URL
摘要:介绍了用任何方式获取的图像的多模态注册以学习为主的策略,无需培训期间需要真实的数据。虽然经典的配准方法可以准确地对准多模态图像对,他们解决昂贵的最优化问题的每一个新的一对图像。基于学习的技术是在快速测试时间,但只能注册的特定解剖结构和方式,他们接受了有关的图像。相比之下,我们的方法利用了生成模型合成标记图和灰度图像培训期间暴露的网络范围广泛的解剖和对比。我们表明,这种策略使任意形式的鲁棒配准,而不需要再培训的新模式。重要的是,我们证明了输入标号不一定是实际解剖体:对随机合成的形状,或supervoxels,结果在竞争激烈的登记表演训练,使网络不可知的解剖和对比,同时还能消除真正的数据的需要。我们目前大量的实验证明,这种策略使训练中没有看到模式的登记和超越了交叉对比注册领域的状态。我们的代码集成了VoxelMorph库在:这个HTTP URL
28. ParaCNN: Visual Paragraph Generation via Adversarial Twin Contextual CNNs [PDF] 返回目录
Shiyang Yan, Yang Hua, Neil Robertson
Abstract: Image description generation plays an important role in many real-world applications, such as image retrieval, automatic navigation, and disabled people support. A well-developed task of image description generation is image captioning, which usually generates a short captioning sentence and thus neglects many of fine-grained properties, e.g., the information of subtle objects and their relationships. In this paper, we study the visual paragraph generation, which can describe the image with a long paragraph containing rich details. Previous research often generates the paragraph via a hierarchical Recurrent Neural Network (RNN)-like model, which has complex memorising, forgetting and coupling mechanism. Instead, we propose a novel pure CNN model, ParaCNN, to generate visual paragraph using hierarchical CNN architecture with contextual information between sentences within one paragraph. The ParaCNN can generate an arbitrary length of a paragraph, which is more applicable in many real-world applications. Furthermore, to enable the ParaCNN to model paragraph comprehensively, we also propose an adversarial twin net training scheme. During training, we force the forwarding network's hidden features to be close to that of the backwards network by using adversarial training. During testing, we only use the forwarding network, which already includes the knowledge of the backwards network, to generate a paragraph. We conduct extensive experiments on the Stanford Visual Paragraph dataset and achieve state-of-the-art performance.
摘要:图片说明一代起着许多现实世界的应用,如图像检索,自动导航,和残疾人支持的重要作用。图像描述生成一个发达的任务是图像字幕,这通常产生一个短字幕句子并因此忽略许多细粒的特性,例如,细微对象及其关系的信息。在本文中,我们研究了视觉款的一代,可以用含有丰富的细节很长的一段描述图像。以前的研究通常产生通过分层回归神经网络(RNN)样模型,其具有复杂的记忆,遗忘和联接机构的段落。相反,我们提出了一种纯粹的CNN模型,ParaCNN,采用分层CNN架构一个段落中的句子之间的上下文信息来产生视觉段落。该ParaCNN可以生成一个段落的任意长度,这是更适用于许多现实世界的应用程序。此外,为了使ParaCNN到模型段全面,我们也提出了一个敌对的双网的培训方案。在培训过程中,我们强制转发网络的隐藏功能,通过使用对抗训练要接近向后网络。在测试过程中,我们只用了转发网络,已经包括向后网络的知识,以产生一个段落。我们在斯坦福视觉段落数据集进行了广泛的实验,实现国家的最先进的性能。
Shiyang Yan, Yang Hua, Neil Robertson
Abstract: Image description generation plays an important role in many real-world applications, such as image retrieval, automatic navigation, and disabled people support. A well-developed task of image description generation is image captioning, which usually generates a short captioning sentence and thus neglects many of fine-grained properties, e.g., the information of subtle objects and their relationships. In this paper, we study the visual paragraph generation, which can describe the image with a long paragraph containing rich details. Previous research often generates the paragraph via a hierarchical Recurrent Neural Network (RNN)-like model, which has complex memorising, forgetting and coupling mechanism. Instead, we propose a novel pure CNN model, ParaCNN, to generate visual paragraph using hierarchical CNN architecture with contextual information between sentences within one paragraph. The ParaCNN can generate an arbitrary length of a paragraph, which is more applicable in many real-world applications. Furthermore, to enable the ParaCNN to model paragraph comprehensively, we also propose an adversarial twin net training scheme. During training, we force the forwarding network's hidden features to be close to that of the backwards network by using adversarial training. During testing, we only use the forwarding network, which already includes the knowledge of the backwards network, to generate a paragraph. We conduct extensive experiments on the Stanford Visual Paragraph dataset and achieve state-of-the-art performance.
摘要:图片说明一代起着许多现实世界的应用,如图像检索,自动导航,和残疾人支持的重要作用。图像描述生成一个发达的任务是图像字幕,这通常产生一个短字幕句子并因此忽略许多细粒的特性,例如,细微对象及其关系的信息。在本文中,我们研究了视觉款的一代,可以用含有丰富的细节很长的一段描述图像。以前的研究通常产生通过分层回归神经网络(RNN)样模型,其具有复杂的记忆,遗忘和联接机构的段落。相反,我们提出了一种纯粹的CNN模型,ParaCNN,采用分层CNN架构一个段落中的句子之间的上下文信息来产生视觉段落。该ParaCNN可以生成一个段落的任意长度,这是更适用于许多现实世界的应用程序。此外,为了使ParaCNN到模型段全面,我们也提出了一个敌对的双网的培训方案。在培训过程中,我们强制转发网络的隐藏功能,通过使用对抗训练要接近向后网络。在测试过程中,我们只用了转发网络,已经包括向后网络的知识,以产生一个段落。我们在斯坦福视觉段落数据集进行了广泛的实验,实现国家的最先进的性能。
29. Partial Volume Segmentation of Brain MRI Scans of any Resolution and Contrast [PDF] 返回目录
Benjamin Billot, Eleanor D. Robinson, Adrian V. Dalca, Juan Eugenio Iglesias
Abstract: Partial voluming (PV) is arguably the last crucial unsolved problem in Bayesian segmentation of brain MRI with probabilistic atlases. PV occurs when voxels contain multiple tissue classes, giving rise to image intensities that may not be representative of any one of the underlying classes. PV is particularly problematic for segmentation when there is a large resolution gap between the atlas and the test scan, e.g., when segmenting clinical scans with thick slices, or when using a high-resolution atlas. In this work, we present PV-SynthSeg, a convolutional neural network (CNN) that tackles this problem by directly learning a mapping between (possibly multi-modal) low resolution (LR) scans and underlying high resolution (HR) segmentations. PV-SynthSeg simulates LR images from HR label maps with a generative model of PV, and can be trained to segment scans of any desired target contrast and resolution, even for previously unseen modalities where neither images nor segmentations are available at training. PV-SynthSeg does not require any preprocessing, and runs in seconds. We demonstrate the accuracy and flexibility of the method with extensive experiments on three datasets and 2,680 scans. The code is available at this https URL.
摘要:部分容积(PV)可以说是在脑MRI的贝叶斯分割与概率地图集最后关键的未解决的问题。 PV当体素包含多个组织类,从而产生图像强度可能不具有代表性的底层类中的任何一个而发生。 PV为分割特别成问题时,有图谱和测试扫描,例如之间的大间隙的分辨率,与厚的片分割临床扫描时,或使用高分辨率图谱时。在这项工作中,我们本PV-SynthSeg,卷积神经网络(CNN),该滑车这个问题通过直接学习扫描之间的映射(可能是多模态)低分辨率(LR)和高分辨率(HR)分割底层。从HR标签PV-SynthSeg会模拟LR图像映射与PV的生成模型,并可以训练的任何期望的目标对比度和分辨率段扫描,甚至前所未见的方式既没有图像也没有分割可在训练。 PV-SynthSeg不需要任何预处理,并运行在几秒钟。我们证明有三个数据集和2680个扫描了广泛的实验方法的准确性和灵活性。该代码可在此HTTPS URL。
Benjamin Billot, Eleanor D. Robinson, Adrian V. Dalca, Juan Eugenio Iglesias
Abstract: Partial voluming (PV) is arguably the last crucial unsolved problem in Bayesian segmentation of brain MRI with probabilistic atlases. PV occurs when voxels contain multiple tissue classes, giving rise to image intensities that may not be representative of any one of the underlying classes. PV is particularly problematic for segmentation when there is a large resolution gap between the atlas and the test scan, e.g., when segmenting clinical scans with thick slices, or when using a high-resolution atlas. In this work, we present PV-SynthSeg, a convolutional neural network (CNN) that tackles this problem by directly learning a mapping between (possibly multi-modal) low resolution (LR) scans and underlying high resolution (HR) segmentations. PV-SynthSeg simulates LR images from HR label maps with a generative model of PV, and can be trained to segment scans of any desired target contrast and resolution, even for previously unseen modalities where neither images nor segmentations are available at training. PV-SynthSeg does not require any preprocessing, and runs in seconds. We demonstrate the accuracy and flexibility of the method with extensive experiments on three datasets and 2,680 scans. The code is available at this https URL.
摘要:部分容积(PV)可以说是在脑MRI的贝叶斯分割与概率地图集最后关键的未解决的问题。 PV当体素包含多个组织类,从而产生图像强度可能不具有代表性的底层类中的任何一个而发生。 PV为分割特别成问题时,有图谱和测试扫描,例如之间的大间隙的分辨率,与厚的片分割临床扫描时,或使用高分辨率图谱时。在这项工作中,我们本PV-SynthSeg,卷积神经网络(CNN),该滑车这个问题通过直接学习扫描之间的映射(可能是多模态)低分辨率(LR)和高分辨率(HR)分割底层。从HR标签PV-SynthSeg会模拟LR图像映射与PV的生成模型,并可以训练的任何期望的目标对比度和分辨率段扫描,甚至前所未见的方式既没有图像也没有分割可在训练。 PV-SynthSeg不需要任何预处理,并运行在几秒钟。我们证明有三个数据集和2680个扫描了广泛的实验方法的准确性和灵活性。该代码可在此HTTPS URL。
30. Red-GAN: Attacking class imbalance via conditioned generation. Yet another medical imaging perspective [PDF] 返回目录
Ahmad B Qasim, Ivan Ezhov, Suprosanna Shit, Oliver Schoppe, Johannes C Paetzold, Anjany Sekuboyina, Florian Kofler, Jana Lipkova, Hongwei Li, Bjoern Menze
Abstract: Exploiting learning algorithms under scarce data regimes is a limitation and a reality of the medical imaging field. In an attempt to mitigate the problem, we propose a data augmentation protocol based on generative adversarial networks. We condition the networks at a pixel-level (segmentation mask) and at a global-level information (acquisition environment or lesion type). Such conditioning provides immediate access to the image-label pairs while controlling global class specific appearance of the synthesized images. To stimulate synthesis of the features relevant for the segmentation task, an additional passive player in a form of segmentor is introduced into the the adversarial game. We validate the approach on two medical datasets: BraTS, ISIC. By controlling the class distribution through injection of synthetic images into the training set we achieve control over the accuracy levels of the datasets' classes.
摘要:下稀缺数据制度开拓学习算法是一个限制和医学成像领域的一个现实。在试图缓解这个问题,我们提出了一种基于生成对抗网络数据增强协议。我们条件的像素级(分割掩码)网络,并在全球一级的信息(采集环境或病变类型)。这种调节提供至图像标签对即时访问,同时控制的合成图像的全局类特定的外观。为了刺激相关的细分任务功能合成,分割器形式的附加被动球员被引入到对抗比赛。我们确认在两个医疗数据集的方法:臭小子,ISIC。通过控制通过合成影像的注射类分配到训练集,我们实现对数据集类的精度等级控制。
Ahmad B Qasim, Ivan Ezhov, Suprosanna Shit, Oliver Schoppe, Johannes C Paetzold, Anjany Sekuboyina, Florian Kofler, Jana Lipkova, Hongwei Li, Bjoern Menze
Abstract: Exploiting learning algorithms under scarce data regimes is a limitation and a reality of the medical imaging field. In an attempt to mitigate the problem, we propose a data augmentation protocol based on generative adversarial networks. We condition the networks at a pixel-level (segmentation mask) and at a global-level information (acquisition environment or lesion type). Such conditioning provides immediate access to the image-label pairs while controlling global class specific appearance of the synthesized images. To stimulate synthesis of the features relevant for the segmentation task, an additional passive player in a form of segmentor is introduced into the the adversarial game. We validate the approach on two medical datasets: BraTS, ISIC. By controlling the class distribution through injection of synthetic images into the training set we achieve control over the accuracy levels of the datasets' classes.
摘要:下稀缺数据制度开拓学习算法是一个限制和医学成像领域的一个现实。在试图缓解这个问题,我们提出了一种基于生成对抗网络数据增强协议。我们条件的像素级(分割掩码)网络,并在全球一级的信息(采集环境或病变类型)。这种调节提供至图像标签对即时访问,同时控制的合成图像的全局类特定的外观。为了刺激相关的细分任务功能合成,分割器形式的附加被动球员被引入到对抗比赛。我们确认在两个医疗数据集的方法:臭小子,ISIC。通过控制通过合成影像的注射类分配到训练集,我们实现对数据集类的精度等级控制。
31. ktrain: A Low-Code Library for Augmented Machine Learning [PDF] 返回目录
Arun S. Maiya
Abstract: We present ktrain, a low-code Python library that makes machine learning more accessible and easier to apply. As a wrapper to TensorFlow and many other libraries (e.g., transformers, scikit-learn, stellargraph), it is designed to make sophisticated, state-of-the-art machine learning models simple to build, train, inspect, and deploy by both beginners and experienced practitioners. Featuring modules that support text data (e.g., text classification, sequence-tagging, open-domain question-answering), vision data (e.g., image classification), and graph data (e.g., node classification, link prediction), ktrain presents a simple unified interface enabling one to quickly solve a wide range of tasks in as little as three or four "commands" or lines of code.
摘要:我们目前ktrain,低码Python库,使得机器学习更方便,更容易申请。作为一个包装TensorFlow和许多其他库(如变压器,scikit学习,stellargraph),它的目的是使双方复杂的,国家的最先进的机器学习模型简单的构建,培训,检查和部署初学者和有经验的从业者。设有模块支持文本数据(例如,文本分类,序列标记,开放域问题回答),视觉数据(例如,图像分类),以及图形数据(例如,节点分类,链接预测),ktrain给出了一个简单统一接口,使得一个快速解决在短范围广泛的任务的三个或四个“命令”或行代码。
Arun S. Maiya
Abstract: We present ktrain, a low-code Python library that makes machine learning more accessible and easier to apply. As a wrapper to TensorFlow and many other libraries (e.g., transformers, scikit-learn, stellargraph), it is designed to make sophisticated, state-of-the-art machine learning models simple to build, train, inspect, and deploy by both beginners and experienced practitioners. Featuring modules that support text data (e.g., text classification, sequence-tagging, open-domain question-answering), vision data (e.g., image classification), and graph data (e.g., node classification, link prediction), ktrain presents a simple unified interface enabling one to quickly solve a wide range of tasks in as little as three or four "commands" or lines of code.
摘要:我们目前ktrain,低码Python库,使得机器学习更方便,更容易申请。作为一个包装TensorFlow和许多其他库(如变压器,scikit学习,stellargraph),它的目的是使双方复杂的,国家的最先进的机器学习模型简单的构建,培训,检查和部署初学者和有经验的从业者。设有模块支持文本数据(例如,文本分类,序列标记,开放域问题回答),视觉数据(例如,图像分类),以及图形数据(例如,节点分类,链接预测),ktrain给出了一个简单统一接口,使得一个快速解决在短范围广泛的任务的三个或四个“命令”或行代码。
32. A review: Deep learning for medical image segmentation using multi-modality fusion [PDF] 返回目录
Tongxue Zhou, Su Ruan, Stéphane Canu
Abstract: Multi-modality is widely used in medical imaging, because it can provide multiinformation about a target (tumor, organ or tissue). Segmentation using multimodality consists of fusing multi-information to improve the segmentation. Recently, deep learning-based approaches have presented the state-of-the-art performance in image classification, segmentation, object detection and tracking tasks. Due to their self-learning and generalization ability over large amounts of data, deep learning recently has also gained great interest in multi-modal medical image segmentation. In this paper, we give an overview of deep learning-based approaches for multi-modal medical image segmentation task. Firstly, we introduce the general principle of deep learning and multi-modal medical image segmentation. Secondly, we present different deep learning network architectures, then analyze their fusion strategies and compare their results. The earlier fusion is commonly used, since it's simple and it focuses on the subsequent segmentation network architecture. However, the later fusion gives more attention on fusion strategy to learn the complex relationship between different modalities. In general, compared to the earlier fusion, the later fusion can give more accurate result if the fusion method is effective enough. We also discuss some common problems in medical image segmentation. Finally, we summarize and provide some perspectives on the future research.
摘要:多模态被广泛应用于医疗成像,因为它可以提供关于目标(肿瘤,器官或组织)multiinformation。使用多模态分割包括融合多种信息,以提高分割的。近日,深学习型的方法已经提出在图像分类,分割,目标检测和跟踪任务的国家的最先进的性能。由于他们的自我学习和泛化能力对大量数据的,深度学习最近也获得了在多模态医学图像分割的极大兴趣。在本文中,我们给出了多模态医学图像分割任务深学习型方法的概述。首先,我们介绍了深度学习和多模态医学图像分割的一般原则。其次,我们提出了不同的深度学习的网络架构,然后分析他们的融合策略,并比较其结果。早期的融合是常用的,因为它很简单,它着重于随后分割网络架构。然而,后来的融合给出了融合战略更注重学习不同模式之间的复杂关系。一般来说,相对于早期的融合,后期聚变能给出更精确的结果,如果融合方法是十分有效。我们还讨论了在医学图像分割的一些常见问题。最后,我们总结并提供关于今后的研究方向的观点。
Tongxue Zhou, Su Ruan, Stéphane Canu
Abstract: Multi-modality is widely used in medical imaging, because it can provide multiinformation about a target (tumor, organ or tissue). Segmentation using multimodality consists of fusing multi-information to improve the segmentation. Recently, deep learning-based approaches have presented the state-of-the-art performance in image classification, segmentation, object detection and tracking tasks. Due to their self-learning and generalization ability over large amounts of data, deep learning recently has also gained great interest in multi-modal medical image segmentation. In this paper, we give an overview of deep learning-based approaches for multi-modal medical image segmentation task. Firstly, we introduce the general principle of deep learning and multi-modal medical image segmentation. Secondly, we present different deep learning network architectures, then analyze their fusion strategies and compare their results. The earlier fusion is commonly used, since it's simple and it focuses on the subsequent segmentation network architecture. However, the later fusion gives more attention on fusion strategy to learn the complex relationship between different modalities. In general, compared to the earlier fusion, the later fusion can give more accurate result if the fusion method is effective enough. We also discuss some common problems in medical image segmentation. Finally, we summarize and provide some perspectives on the future research.
摘要:多模态被广泛应用于医疗成像,因为它可以提供关于目标(肿瘤,器官或组织)multiinformation。使用多模态分割包括融合多种信息,以提高分割的。近日,深学习型的方法已经提出在图像分类,分割,目标检测和跟踪任务的国家的最先进的性能。由于他们的自我学习和泛化能力对大量数据的,深度学习最近也获得了在多模态医学图像分割的极大兴趣。在本文中,我们给出了多模态医学图像分割任务深学习型方法的概述。首先,我们介绍了深度学习和多模态医学图像分割的一般原则。其次,我们提出了不同的深度学习的网络架构,然后分析他们的融合策略,并比较其结果。早期的融合是常用的,因为它很简单,它着重于随后分割网络架构。然而,后来的融合给出了融合战略更注重学习不同模式之间的复杂关系。一般来说,相对于早期的融合,后期聚变能给出更精确的结果,如果融合方法是十分有效。我们还讨论了在医学图像分割的一些常见问题。最后,我们总结并提供关于今后的研究方向的观点。
33. Automatic Detection of Coronavirus Disease (COVID-19) in X-ray and CT Images: A Machine Learning-Based Approach [PDF] 返回目录
Sara Hosseinzadeh Kassani, Peyman Hosseinzadeh Kassasni, Michal J. Wesolowski, Kevin A. Schneider, Ralph Deters
Abstract: The newly identified Coronavirus pneumonia, subsequently termed COVID-19, is highly transmittable and pathogenic with no clinically approved antiviral drug or vaccine available for treatment. The most common symptoms of COVID-19 are dry cough, sore throat, and fever. Symptoms can progress to a severe form of pneumonia with critical complications, including septic shock, pulmonary edema, acute respiratory distress syndrome and multi-organ failure. While medical imaging is not currently recommended in Canada for primary diagnosis of COVID-19, computer-aided diagnosis systems could assist in the early detection of COVID-19 abnormalities and help to monitor the progression of the disease, potentially reduce mortality rates. In this study, we compare popular deep learning-based feature extraction frameworks for automatic COVID-19 classification. To obtain the most accurate feature, which is an essential component of learning, MobileNet, DenseNet, Xception, ResNet, InceptionV3, InceptionResNetV2, VGGNet, NASNet were chosen amongst a pool of deep convolutional neural networks. The extracted features were then fed into several machine learning classifiers to classify subjects as either a case of COVID-19 or a control. This approach avoided task-specific data pre-processing methods to support a better generalization ability for unseen data. The performance of the proposed method was validated on a publicly available COVID-19 dataset of chest X-ray and CT images. The DenseNet121 feature extractor with Bagging tree classifier achieved the best performance with 99% classification accuracy. The second-best learner was a hybrid of the a ResNet50 feature extractor trained by LightGBM with an accuracy of 98%.
摘要:新发现的冠状病毒性肺炎,后来被称为COVID-19,具有很强的传递和致病性没有任何临床批准的抗病毒药物或疫苗可用于治疗。 COVID-19最常见的症状是干咳,咽痛和发热。症状可发展为危重并发症,包括感染性休克,肺水肿,急性呼吸窘迫综合征,多器官功能衰竭肺炎是一种严重的。虽然目前尚未在加拿大推荐COVID-19的初步诊断医疗成像,计算机辅助诊断系统,可以帮助早期发现COVID-19的异常和帮助,以监测疾病的进展,还可能降低死亡率。在这项研究中,我们比较了自动COVID-19分级流行深以学习为主的特征提取框架。为了获得最准确的功能,这是被选择之间深卷积神经网络的一池学习,MobileNet,DenseNet,Xception,RESNET,InceptionV3,InceptionResNetV2,VGGNet,NASNet的重要组成部分。然后将提取的特征被馈送到几个机器学习分类器进行分类受试者如任一COVID-19或对照的情况。这种方法针对特定任务的避免数据预处理方法,以支持看不见数据更好的推广能力。所提出的方法的性能进行了验证在胸部X射线和CT图像的可公开获得COVID-19数据集。套袋树分类的DenseNet121特征提取实现了与99%的分类准确度的最佳性能。第二好的学习者是通过LightGBM为98%的准确度训练一个ResNet50特征提取器的混合。
Sara Hosseinzadeh Kassani, Peyman Hosseinzadeh Kassasni, Michal J. Wesolowski, Kevin A. Schneider, Ralph Deters
Abstract: The newly identified Coronavirus pneumonia, subsequently termed COVID-19, is highly transmittable and pathogenic with no clinically approved antiviral drug or vaccine available for treatment. The most common symptoms of COVID-19 are dry cough, sore throat, and fever. Symptoms can progress to a severe form of pneumonia with critical complications, including septic shock, pulmonary edema, acute respiratory distress syndrome and multi-organ failure. While medical imaging is not currently recommended in Canada for primary diagnosis of COVID-19, computer-aided diagnosis systems could assist in the early detection of COVID-19 abnormalities and help to monitor the progression of the disease, potentially reduce mortality rates. In this study, we compare popular deep learning-based feature extraction frameworks for automatic COVID-19 classification. To obtain the most accurate feature, which is an essential component of learning, MobileNet, DenseNet, Xception, ResNet, InceptionV3, InceptionResNetV2, VGGNet, NASNet were chosen amongst a pool of deep convolutional neural networks. The extracted features were then fed into several machine learning classifiers to classify subjects as either a case of COVID-19 or a control. This approach avoided task-specific data pre-processing methods to support a better generalization ability for unseen data. The performance of the proposed method was validated on a publicly available COVID-19 dataset of chest X-ray and CT images. The DenseNet121 feature extractor with Bagging tree classifier achieved the best performance with 99% classification accuracy. The second-best learner was a hybrid of the a ResNet50 feature extractor trained by LightGBM with an accuracy of 98%.
摘要:新发现的冠状病毒性肺炎,后来被称为COVID-19,具有很强的传递和致病性没有任何临床批准的抗病毒药物或疫苗可用于治疗。 COVID-19最常见的症状是干咳,咽痛和发热。症状可发展为危重并发症,包括感染性休克,肺水肿,急性呼吸窘迫综合征,多器官功能衰竭肺炎是一种严重的。虽然目前尚未在加拿大推荐COVID-19的初步诊断医疗成像,计算机辅助诊断系统,可以帮助早期发现COVID-19的异常和帮助,以监测疾病的进展,还可能降低死亡率。在这项研究中,我们比较了自动COVID-19分级流行深以学习为主的特征提取框架。为了获得最准确的功能,这是被选择之间深卷积神经网络的一池学习,MobileNet,DenseNet,Xception,RESNET,InceptionV3,InceptionResNetV2,VGGNet,NASNet的重要组成部分。然后将提取的特征被馈送到几个机器学习分类器进行分类受试者如任一COVID-19或对照的情况。这种方法针对特定任务的避免数据预处理方法,以支持看不见数据更好的推广能力。所提出的方法的性能进行了验证在胸部X射线和CT图像的可公开获得COVID-19数据集。套袋树分类的DenseNet121特征提取实现了与99%的分类准确度的最佳性能。第二好的学习者是通过LightGBM为98%的准确度训练一个ResNet50特征提取器的混合。
34. Up or Down? Adaptive Rounding for Post-Training Quantization [PDF] 返回目录
Markus Nagel, Rana Ali Amjad, Mart van Baalen, Christos Louizos, Tijmen Blankevoort
Abstract: When quantizing neural networks, assigning each floating-point weight to its nearest fixed-point value is the predominant approach. We find that, perhaps surprisingly, this is not the best we can do. In this paper, we propose AdaRound, a better weight-rounding mechanism for post-training quantization that adapts to the data and the task loss. AdaRound is fast, does not require fine-tuning of the network, and only uses a small amount of unlabelled data. We start by theoretically analyzing the rounding problem for a pre-trained neural network. By approximating the task loss with a Taylor series expansion, the rounding task is posed as a quadratic unconstrained binary optimization problem. We simplify this to a layer-wise local loss and propose to optimize this loss with a soft relaxation. AdaRound not only outperforms rounding-to-nearest by a significant margin but also establishes a new state-of-the-art for post-training quantization on several networks and tasks. Without fine-tuning, we can quantize the weights of Resnet18 and Resnet50 to 4 bits while staying within an accuracy loss of 1%.
摘要:当量化神经网络,每个浮点权重分配至其最近的定点值是主要的方法。我们发现,也许令人惊讶,这不是我们所能做的最好。在本文中,我们提出AdaRound,更好的重舍入机制,培训后的量化,能够适应数据和工作损失。 AdaRound是快速的,不要求网络的微调,并且只使用未标记的数据的量小。我们通过理论分析的四舍五入问题的预训练神经网络的开始。通过约计与泰勒级数展开任务的损失,舍入任务,提出方式是二次无约束二进制优化问题。我们简化了这逐层局部损失,并提出优化这方面的损失用柔软的放松。 AdaRound不仅优于四舍五入到最接近由显著利润率也建立了一个新的国家的最先进的培训后量化几个网络和任务。没有微调,我们可以量化Resnet18和Resnet50的权重为4位,而1%的精度损失内停留。
Markus Nagel, Rana Ali Amjad, Mart van Baalen, Christos Louizos, Tijmen Blankevoort
Abstract: When quantizing neural networks, assigning each floating-point weight to its nearest fixed-point value is the predominant approach. We find that, perhaps surprisingly, this is not the best we can do. In this paper, we propose AdaRound, a better weight-rounding mechanism for post-training quantization that adapts to the data and the task loss. AdaRound is fast, does not require fine-tuning of the network, and only uses a small amount of unlabelled data. We start by theoretically analyzing the rounding problem for a pre-trained neural network. By approximating the task loss with a Taylor series expansion, the rounding task is posed as a quadratic unconstrained binary optimization problem. We simplify this to a layer-wise local loss and propose to optimize this loss with a soft relaxation. AdaRound not only outperforms rounding-to-nearest by a significant margin but also establishes a new state-of-the-art for post-training quantization on several networks and tasks. Without fine-tuning, we can quantize the weights of Resnet18 and Resnet50 to 4 bits while staying within an accuracy loss of 1%.
摘要:当量化神经网络,每个浮点权重分配至其最近的定点值是主要的方法。我们发现,也许令人惊讶,这不是我们所能做的最好。在本文中,我们提出AdaRound,更好的重舍入机制,培训后的量化,能够适应数据和工作损失。 AdaRound是快速的,不要求网络的微调,并且只使用未标记的数据的量小。我们通过理论分析的四舍五入问题的预训练神经网络的开始。通过约计与泰勒级数展开任务的损失,舍入任务,提出方式是二次无约束二进制优化问题。我们简化了这逐层局部损失,并提出优化这方面的损失用柔软的放松。 AdaRound不仅优于四舍五入到最接近由显著利润率也建立了一个新的国家的最先进的培训后量化几个网络和任务。没有微调,我们可以量化Resnet18和Resnet50的权重为4位,而1%的精度损失内停留。
35. Human and Machine Action Prediction Independent of Object Information [PDF] 返回目录
Fatemeh Ziaeetabar, Jennifer Pomp, Stefan Pfeiffer, Nadiya El-Sourani, Ricarda I. Schubotz, Minija Tamosiunaite, Florentin Wörgötter
Abstract: Predicting other people's action is key to successful social interactions, enabling us to adjust our own behavior to the consequence of the others' future actions. Studies on action recognition have focused on the importance of individual visual features of objects involved in an action and its context. Humans, however, recognize actions on unknown objects or even when objects are imagined (pantomime). Other cues must thus compensate the lack of recognizable visual object features. Here, we focus on the role of inter-object relations that change during an action. We designed a virtual reality setup and tested recognition speed for 10 different manipulation actions on 50 subjects. All objects were abstracted by emulated cubes so the actions could not be inferred using object information. Instead, subjects had to rely only on the information that comes from the changes in the spatial relations that occur between those cubes. In spite of these constraints, our results show the subjects were able to predict actions in, on average, less than 64% of the action's duration. We employed a computational model -an enriched Semantic Event Chain (eSEC) incorporating the information of spatial relations, specifically (a) objects' touching/untouching, (b) static spatial relations between objects and (c) dynamic spatial relations between objects. Trained on the same actions as those observed by subjects, the model successfully predicted actions even better than humans. Information theoretical analysis shows that eSECs optimally use individual cues, whereas humans presumably mostly rely on a mixed-cue strategy, which takes longer until recognition. Providing a better cognitive basis of action recognition may, on one hand improve our understanding of related human pathologies and, on the other hand, also help to build robots for conflict-free human-robot cooperation. Our results open new avenues here.
摘要:预测其他人的行动是成功的关键社会交往,使我们能够调整我们自己的行为对他人的未来行动的后果。有研究上的动作识别重点参与的行动及其上下文对象的个人视觉特征的重要性。然而人类认识未知的对象,甚至当对象被想象(手势)的行动。因此,其他的线索必须补偿缺乏可识别的视觉对象的特征。在这里,我们重点关注对象间关系的一个行动期间变化中的作用。我们设计了一个虚拟现实的设置和测试,识别速度快了关于50个科目10种不同的操作行为。所有对象被模仿立方体抽象这样的操作可能会不使用对象的信息进行推断。取而代之的是,受试者必须仅仅依靠的是来自于这些多维数据集之间发生的空间关系的变化的信息。尽管存在这些限制,我们的结果表明,受试者能够预测行动,平均而言,行动的持续时间小于64%。我们所采用的计算模型-an富集语义事件链(ESEC)将空间关系的对象和(c)之间的对象的动态空间关系之间的静空间关系的信息,具体地(a)中的对象接触/ untouching,(b)中。培训了相同的操作那些受试者中观察到,该模型成功地甚至比人类的预测动作。信息理论分析表明,eSECs优化使用单独的线索,而人类大概主要依靠混合线索策略,这需要较长的时间,直到识别。提供了动作识别的一个更好的认知基础可能,一方面提高我们相关的人类疾病的理解和,在另一方面,也有助于建立机器人无冲突的人类与机器人合作。我们的研究结果在这里开辟新的途径。
Fatemeh Ziaeetabar, Jennifer Pomp, Stefan Pfeiffer, Nadiya El-Sourani, Ricarda I. Schubotz, Minija Tamosiunaite, Florentin Wörgötter
Abstract: Predicting other people's action is key to successful social interactions, enabling us to adjust our own behavior to the consequence of the others' future actions. Studies on action recognition have focused on the importance of individual visual features of objects involved in an action and its context. Humans, however, recognize actions on unknown objects or even when objects are imagined (pantomime). Other cues must thus compensate the lack of recognizable visual object features. Here, we focus on the role of inter-object relations that change during an action. We designed a virtual reality setup and tested recognition speed for 10 different manipulation actions on 50 subjects. All objects were abstracted by emulated cubes so the actions could not be inferred using object information. Instead, subjects had to rely only on the information that comes from the changes in the spatial relations that occur between those cubes. In spite of these constraints, our results show the subjects were able to predict actions in, on average, less than 64% of the action's duration. We employed a computational model -an enriched Semantic Event Chain (eSEC) incorporating the information of spatial relations, specifically (a) objects' touching/untouching, (b) static spatial relations between objects and (c) dynamic spatial relations between objects. Trained on the same actions as those observed by subjects, the model successfully predicted actions even better than humans. Information theoretical analysis shows that eSECs optimally use individual cues, whereas humans presumably mostly rely on a mixed-cue strategy, which takes longer until recognition. Providing a better cognitive basis of action recognition may, on one hand improve our understanding of related human pathologies and, on the other hand, also help to build robots for conflict-free human-robot cooperation. Our results open new avenues here.
摘要:预测其他人的行动是成功的关键社会交往,使我们能够调整我们自己的行为对他人的未来行动的后果。有研究上的动作识别重点参与的行动及其上下文对象的个人视觉特征的重要性。然而人类认识未知的对象,甚至当对象被想象(手势)的行动。因此,其他的线索必须补偿缺乏可识别的视觉对象的特征。在这里,我们重点关注对象间关系的一个行动期间变化中的作用。我们设计了一个虚拟现实的设置和测试,识别速度快了关于50个科目10种不同的操作行为。所有对象被模仿立方体抽象这样的操作可能会不使用对象的信息进行推断。取而代之的是,受试者必须仅仅依靠的是来自于这些多维数据集之间发生的空间关系的变化的信息。尽管存在这些限制,我们的结果表明,受试者能够预测行动,平均而言,行动的持续时间小于64%。我们所采用的计算模型-an富集语义事件链(ESEC)将空间关系的对象和(c)之间的对象的动态空间关系之间的静空间关系的信息,具体地(a)中的对象接触/ untouching,(b)中。培训了相同的操作那些受试者中观察到,该模型成功地甚至比人类的预测动作。信息理论分析表明,eSECs优化使用单独的线索,而人类大概主要依靠混合线索策略,这需要较长的时间,直到识别。提供了动作识别的一个更好的认知基础可能,一方面提高我们相关的人类疾病的理解和,在另一方面,也有助于建立机器人无冲突的人类与机器人合作。我们的研究结果在这里开辟新的途径。
36. Deep Learning for Screening COVID-19 using Chest X-Ray Images [PDF] 返回目录
Sanhita Basu, Sushmita Mitra
Abstract: With the ever increasing demand for screening millions of prospective "novel coronavirus" or COVID-19 cases, and due to the emergence of high false negatives in the commonly used PCR tests, the necessity for probing an alternative simple screening mechanism of COVID-19 using radiological images (like chest X-Rays) assumes importance. In this scenario, machine learning (ML) and deep learning (DL) offer fast, automated, effective strategies to detect abnormalities and extract key features of the altered lung parenchyma, which may be related to specific signatures of the COVID-19 virus. However, the available COVID-19 datasets are inadequate to train deep neural networks. Therefore, we propose a new concept called domain extension transfer learning (DETL). We employ DETL, with pre-trained deep convolutional neural network, on a related large chest X-Ray dataset that is tuned for classifying between four classes \textit{viz.} $normal$, $other\_disease$, $pneumonia$ and $Covid-19$. A 5-fold cross validation is performed to estimate the feasibility of using chest X-Rays to diagnose COVID-19. The initial results show promise, with the possibility of replication on bigger and more diverse data sets. The overall accuracy was measured as $95.3\% \pm 0.02$. In order to get an idea about the COVID-19 detection transparency, we employed the concept of Gradient Class Activation Map (Grad-CAM) for detecting the regions where the model paid more attention during the classification. This was found to strongly correlate with clinical findings, as validated by experts.
摘要:随着筛选数百万潜在的“新型冠状病毒”或COVID-19的情况下,并且由于对高假阴性的常用PCR试验的出现,有必要对探测替代简单的筛选COVID-机制的需求不断增加19幅使用放射线图像(如胸部X射线)假设重要性。在这种情况下,机器学习(ML)和深度学习(DL)提供快速,自动,有效的策略来检测异常和改变的肺实质,这可能与该COVID-19病毒的具体特征的提取物的主要特点。然而,现有的COVID-19数据集是不够的训练深层神经网络。因此,我们提出所谓的域扩展迁移学习(DETL)的新概念。我们采用DETL,与预训练深层卷积神经网络,对被调整为四类\ textit之间进行分类的相关大胸部X射线数据集{即} $正常,$其他\ _disease $,$肺炎$和$ Covid-19 $。将5倍交叉验证进行估计使用胸部的可行性X射线诊断COVID-19。初步结果显示诺言,用复制的做大更多样化的数据集的可能性。总精度测量为$为95.3 \%\下午0.02 $。为了获得有关COVID-19检测透明度的想法,我们采用渐变激活级地图(梯度-CAM)的概念,用于检测模型分类过程中更加重视的区域。这被发现与临床研究结果强烈相关,被专家誉为验证。
Sanhita Basu, Sushmita Mitra
Abstract: With the ever increasing demand for screening millions of prospective "novel coronavirus" or COVID-19 cases, and due to the emergence of high false negatives in the commonly used PCR tests, the necessity for probing an alternative simple screening mechanism of COVID-19 using radiological images (like chest X-Rays) assumes importance. In this scenario, machine learning (ML) and deep learning (DL) offer fast, automated, effective strategies to detect abnormalities and extract key features of the altered lung parenchyma, which may be related to specific signatures of the COVID-19 virus. However, the available COVID-19 datasets are inadequate to train deep neural networks. Therefore, we propose a new concept called domain extension transfer learning (DETL). We employ DETL, with pre-trained deep convolutional neural network, on a related large chest X-Ray dataset that is tuned for classifying between four classes \textit{viz.} $normal$, $other\_disease$, $pneumonia$ and $Covid-19$. A 5-fold cross validation is performed to estimate the feasibility of using chest X-Rays to diagnose COVID-19. The initial results show promise, with the possibility of replication on bigger and more diverse data sets. The overall accuracy was measured as $95.3\% \pm 0.02$. In order to get an idea about the COVID-19 detection transparency, we employed the concept of Gradient Class Activation Map (Grad-CAM) for detecting the regions where the model paid more attention during the classification. This was found to strongly correlate with clinical findings, as validated by experts.
摘要:随着筛选数百万潜在的“新型冠状病毒”或COVID-19的情况下,并且由于对高假阴性的常用PCR试验的出现,有必要对探测替代简单的筛选COVID-机制的需求不断增加19幅使用放射线图像(如胸部X射线)假设重要性。在这种情况下,机器学习(ML)和深度学习(DL)提供快速,自动,有效的策略来检测异常和改变的肺实质,这可能与该COVID-19病毒的具体特征的提取物的主要特点。然而,现有的COVID-19数据集是不够的训练深层神经网络。因此,我们提出所谓的域扩展迁移学习(DETL)的新概念。我们采用DETL,与预训练深层卷积神经网络,对被调整为四类\ textit之间进行分类的相关大胸部X射线数据集{即} $正常,$其他\ _disease $,$肺炎$和$ Covid-19 $。将5倍交叉验证进行估计使用胸部的可行性X射线诊断COVID-19。初步结果显示诺言,用复制的做大更多样化的数据集的可能性。总精度测量为$为95.3 \%\下午0.02 $。为了获得有关COVID-19检测透明度的想法,我们采用渐变激活级地图(梯度-CAM)的概念,用于检测模型分类过程中更加重视的区域。这被发现与临床研究结果强烈相关,被专家誉为验证。
37. Stabilizing Training of Generative Adversarial Nets via Langevin Stein Variational Gradient Descent [PDF] 返回目录
Dong Wang, Xiaoqian Qin, Fengyi Song, Li Cheng
Abstract: Generative adversarial networks (GANs), famous for the capability of learning complex underlying data distribution, are however known to be tricky in the training process, which would probably result in mode collapse or performance deterioration. Current approaches of dealing with GANs' issues almost utilize some practical training techniques for the purpose of regularization, which on the other hand undermines the convergence and theoretical soundness of GAN. In this paper, we propose to stabilize GAN training via a novel particle-based variational inference -- Langevin Stein variational gradient descent (LSVGD), which not only inherits the flexibility and efficiency of original SVGD but aims to address its instability issues by incorporating an extra disturbance into the update dynamics. We further demonstrate that by properly adjusting the noise variance, LSVGD simulates a Langevin process whose stationary distribution is exactly the target distribution. We also show that LSVGD dynamics has an implicit regularization which is able to enhance particles' spread-out and diversity. At last we present an efficient way of applying particle-based variational inference on a general GAN training procedure no matter what loss function is adopted. Experimental results on one synthetic dataset and three popular benchmark datasets -- Cifar-10, Tiny-ImageNet and CelebA validate that LSVGD can remarkably improve the performance and stability of various GAN models.
摘要:创成对抗网络(甘斯),著名的学习复杂的底层数据分发的能力,但是被称为是在训练过程中,这可能会导致崩溃模式或性能恶化棘手。对付甘斯的问题的目前的做法几乎利用了正规化的目的,这在另一方面破坏了收敛和GaN的理论合理性的一些实际的训练方法。在本文中,我们提出通过新颖的基于粒子的变推论稳定GAN培训 - 朗之万斯坦变梯度下降(LSVGD),它不仅继承了灵活性和原SVGD但目标的效率,通过将一个以解决其不稳定问题额外的干扰引入更新动态。我们进一步表明,通过适当调整噪声方差,LSVGD模拟的Langevin过程,其平稳分布是完全分发目标。我们还表明,LSVGD的力度有一个隐含的正规化,它能够提高颗粒展开的和多样性。最后,我们提出一个一般GAN训练过程采用基于颗粒的变分推理无论采用什么样的损失函数的有效方式。上一个数据组的合成和三个流行标准数据集的实验结果 - CIFAR-10,微型-ImageNet和CelebA验证LSVGD可以显着提高各种GAN模型的性能和稳定性。
Dong Wang, Xiaoqian Qin, Fengyi Song, Li Cheng
Abstract: Generative adversarial networks (GANs), famous for the capability of learning complex underlying data distribution, are however known to be tricky in the training process, which would probably result in mode collapse or performance deterioration. Current approaches of dealing with GANs' issues almost utilize some practical training techniques for the purpose of regularization, which on the other hand undermines the convergence and theoretical soundness of GAN. In this paper, we propose to stabilize GAN training via a novel particle-based variational inference -- Langevin Stein variational gradient descent (LSVGD), which not only inherits the flexibility and efficiency of original SVGD but aims to address its instability issues by incorporating an extra disturbance into the update dynamics. We further demonstrate that by properly adjusting the noise variance, LSVGD simulates a Langevin process whose stationary distribution is exactly the target distribution. We also show that LSVGD dynamics has an implicit regularization which is able to enhance particles' spread-out and diversity. At last we present an efficient way of applying particle-based variational inference on a general GAN training procedure no matter what loss function is adopted. Experimental results on one synthetic dataset and three popular benchmark datasets -- Cifar-10, Tiny-ImageNet and CelebA validate that LSVGD can remarkably improve the performance and stability of various GAN models.
摘要:创成对抗网络(甘斯),著名的学习复杂的底层数据分发的能力,但是被称为是在训练过程中,这可能会导致崩溃模式或性能恶化棘手。对付甘斯的问题的目前的做法几乎利用了正规化的目的,这在另一方面破坏了收敛和GaN的理论合理性的一些实际的训练方法。在本文中,我们提出通过新颖的基于粒子的变推论稳定GAN培训 - 朗之万斯坦变梯度下降(LSVGD),它不仅继承了灵活性和原SVGD但目标的效率,通过将一个以解决其不稳定问题额外的干扰引入更新动态。我们进一步表明,通过适当调整噪声方差,LSVGD模拟的Langevin过程,其平稳分布是完全分发目标。我们还表明,LSVGD的力度有一个隐含的正规化,它能够提高颗粒展开的和多样性。最后,我们提出一个一般GAN训练过程采用基于颗粒的变分推理无论采用什么样的损失函数的有效方式。上一个数据组的合成和三个流行标准数据集的实验结果 - CIFAR-10,微型-ImageNet和CelebA验证LSVGD可以显着提高各种GAN模型的性能和稳定性。
38. Learning an Adaptive Model for Extreme Low-light Raw Image Processing [PDF] 返回目录
Qingxu Fu, Xiaoguang Di, Yu Zhang
Abstract: Low-light images suffer from severe noise and low illumination. Current deep learning models that are trained with real-world images have excellent noise reduction, but a ratio parameter must be chosen manually to complete the enhancement pipeline. In this work, we propose an adaptive low-light raw image enhancement network to avoid parameter-handcrafting and to improve image quality. The proposed method can be divided into two sub-models: Brightness Prediction (BP) and Exposure Shifting (ES). The former is designed to control the brightness of the resulting image by estimating a guideline exposure time $t_1$. The latter learns to approximate an exposure-shifting operator $ES$, converting a low-light image with real exposure time $t_0$ to a noise-free image with guideline exposure time $t_1$. Additionally, structural similarity (SSIM) loss and Image Enhancement Vector (IEV) are introduced to promote image quality, and a new Campus Image Dataset (CID) is proposed to overcome the limitations of the existing datasets and to supervise the training of the proposed model. Using the proposed model, we can achieve high-quality low-light image enhancement from a single raw image. In quantitative tests, it is shown that the proposed method has the lowest Noise Level Estimation (NLE) score compared with the state-of-the-art low-light algorithms, suggesting a superior denoising performance. Furthermore, those tests illustrate that the proposed method is able to adaptively control the global image brightness according to the content of the image scene. Lastly, the potential application in video processing is briefly discussed.
摘要:低光图像从严重的噪音和低亮度受损。这与现实世界图像训练的当前深度学习模式有优良的噪音降低,但比参数必须选择手动来完成增强管道。在这项工作中,我们提出了一种自适应低光原料图像增强网络,避免参数手工,以提高图像质量。所提出的方法可以被划分为两个子模型:亮度预测(BP)和曝光移位(ES)。前者被设计通过估计准则曝光时间$ $ T_1来控制所产生的图像的亮度。后者学会近似的曝光移操作者$ ES,$ $ T_0转换低光图像与实际的曝光时间,以无噪声的图像与指导曝光时间$ $ T_1。此外,结构相似性(SSIM)丢失和图像增强矢量(IEV)都出台了促进的图像质量,以及新校区的图像数据集(CID),提出了克服现有数据集的限制和监督该模型的训练。利用该模型,我们可以从一个单一的原始图像实现高品质低光图像增强。在定量试验中,示出了该方法具有最低噪声水平估计(NLE)分数与国家的最先进的低光算法相比,这表明优异的去噪性能。此外,这些测试说明,该方法能够根据图像场景的内容自适应地控制全局图像的亮度。最后,在视频处理潜在的应用简要讨论。
Qingxu Fu, Xiaoguang Di, Yu Zhang
Abstract: Low-light images suffer from severe noise and low illumination. Current deep learning models that are trained with real-world images have excellent noise reduction, but a ratio parameter must be chosen manually to complete the enhancement pipeline. In this work, we propose an adaptive low-light raw image enhancement network to avoid parameter-handcrafting and to improve image quality. The proposed method can be divided into two sub-models: Brightness Prediction (BP) and Exposure Shifting (ES). The former is designed to control the brightness of the resulting image by estimating a guideline exposure time $t_1$. The latter learns to approximate an exposure-shifting operator $ES$, converting a low-light image with real exposure time $t_0$ to a noise-free image with guideline exposure time $t_1$. Additionally, structural similarity (SSIM) loss and Image Enhancement Vector (IEV) are introduced to promote image quality, and a new Campus Image Dataset (CID) is proposed to overcome the limitations of the existing datasets and to supervise the training of the proposed model. Using the proposed model, we can achieve high-quality low-light image enhancement from a single raw image. In quantitative tests, it is shown that the proposed method has the lowest Noise Level Estimation (NLE) score compared with the state-of-the-art low-light algorithms, suggesting a superior denoising performance. Furthermore, those tests illustrate that the proposed method is able to adaptively control the global image brightness according to the content of the image scene. Lastly, the potential application in video processing is briefly discussed.
摘要:低光图像从严重的噪音和低亮度受损。这与现实世界图像训练的当前深度学习模式有优良的噪音降低,但比参数必须选择手动来完成增强管道。在这项工作中,我们提出了一种自适应低光原料图像增强网络,避免参数手工,以提高图像质量。所提出的方法可以被划分为两个子模型:亮度预测(BP)和曝光移位(ES)。前者被设计通过估计准则曝光时间$ $ T_1来控制所产生的图像的亮度。后者学会近似的曝光移操作者$ ES,$ $ T_0转换低光图像与实际的曝光时间,以无噪声的图像与指导曝光时间$ $ T_1。此外,结构相似性(SSIM)丢失和图像增强矢量(IEV)都出台了促进的图像质量,以及新校区的图像数据集(CID),提出了克服现有数据集的限制和监督该模型的训练。利用该模型,我们可以从一个单一的原始图像实现高品质低光图像增强。在定量试验中,示出了该方法具有最低噪声水平估计(NLE)分数与国家的最先进的低光算法相比,这表明优异的去噪性能。此外,这些测试说明,该方法能够根据图像场景的内容自适应地控制全局图像的亮度。最后,在视频处理潜在的应用简要讨论。
39. Keep It Real: a Window to Real Reality in Virtual Reality [PDF] 返回目录
Baihan Lin
Abstract: This paper proposed a new interaction paradigm in the virtual reality (VR) environments, which consists of a virtual mirror or window projected onto a virtual surface, representing the correct perspective geometry of a mirror or window reflecting the real world. This technique can be applied to various videos, live streaming apps, augmented and virtual reality settings to provide an interactive and immersive user experience. To support such a perspective-accurate representation, we implemented computer vision algorithms for feature detection and correspondence matching. To constrain the solutions, we incorporated an automatically tuning scaling factor upon the homography transform matrix such that each image frame follows a smooth transition with the user in sight. The system is a real-time rendering framework where users can engage their real-life presence with the virtual space.
摘要:本文提出在虚拟现实(VR)环境的一个新的交互模式,其由虚拟镜面或窗的投影到一个虚拟的表面,表示一反射镜或反射窗口现实世界的正确角度的几何形状。这种技术可以应用于各种视频,流媒体直播应用,增强和虚拟现实的设置,以提供一个互动和身临其境的用户体验。为了支持这样的立体-精确的表示,我们实现了计算机视觉算法为特征检测和对应的匹配。以限制的解决方案,我们在并入的单应的自动调谐缩放因子变换矩阵,使得每个图像帧如下在视线用户的平滑过渡。该系统是一个实时渲染架构,用户可以从事他们的现实生活存在与虚拟空间。
Baihan Lin
Abstract: This paper proposed a new interaction paradigm in the virtual reality (VR) environments, which consists of a virtual mirror or window projected onto a virtual surface, representing the correct perspective geometry of a mirror or window reflecting the real world. This technique can be applied to various videos, live streaming apps, augmented and virtual reality settings to provide an interactive and immersive user experience. To support such a perspective-accurate representation, we implemented computer vision algorithms for feature detection and correspondence matching. To constrain the solutions, we incorporated an automatically tuning scaling factor upon the homography transform matrix such that each image frame follows a smooth transition with the user in sight. The system is a real-time rendering framework where users can engage their real-life presence with the virtual space.
摘要:本文提出在虚拟现实(VR)环境的一个新的交互模式,其由虚拟镜面或窗的投影到一个虚拟的表面,表示一反射镜或反射窗口现实世界的正确角度的几何形状。这种技术可以应用于各种视频,流媒体直播应用,增强和虚拟现实的设置,以提供一个互动和身临其境的用户体验。为了支持这样的立体-精确的表示,我们实现了计算机视觉算法为特征检测和对应的匹配。以限制的解决方案,我们在并入的单应的自动调谐缩放因子变换矩阵,使得每个图像帧如下在视线用户的平滑过渡。该系统是一个实时渲染架构,用户可以从事他们的现实生活存在与虚拟空间。
40. M-LVC: Multiple Frames Prediction for Learned Video Compression [PDF] 返回目录
Jianping Lin, Dong Liu, Houqiang Li, Feng Wu
Abstract: We propose an end-to-end learned video compression scheme for low-latency scenarios. Previous methods are limited in using the previous one frame as reference. Our method introduces the usage of the previous multiple frames as references. In our scheme, the motion vector (MV) field is calculated between the current frame and the previous one. With multiple reference frames and associated multiple MV fields, our designed network can generate more accurate prediction of the current frame, yielding less residual. Multiple reference frames also help generate MV prediction, which reduces the coding cost of MV field. We use two deep auto-encoders to compress the residual and the MV, respectively. To compensate for the compression error of the auto-encoders, we further design a MV refinement network and a residual refinement network, taking use of the multiple reference frames as well. All the modules in our scheme are jointly optimized through a single rate-distortion loss function. We use a step-by-step training strategy to optimize the entire scheme. Experimental results show that the proposed method outperforms the existing learned video compression methods for low-latency mode. Our method also performs better than H.265 in both PSNR and MS-SSIM. Our code and models are publicly available.
摘要:我们提出一个终端到终端了解到低延迟场景的视频压缩方案。以前的方法在使用前一个帧作为参考的限制。我们的方法引入了以前的多帧作为参考的使用。在我们的方案中,运动矢量(MV)场在当前帧和前一一个之间计算。与多个参考帧和相关联的多个字段MV,我们设计的网络可以生成当前帧的更精确的预测,产生较少的残余。多个参考帧也有助于生成MV预测,这减少了MV场的编码成本。我们用两个深自动编码器压缩残余和MV,分别。为了补偿自动编码器的压缩误差,我们进一步设计一个MV细化网络和残留细化网络,以使用所述多个基准帧的为好。在我们的方案中的所有模块都通过一个单一的率失真损失函数联合优化。我们使用了一步一步的培训战略,以优化整个方案。实验结果表明,该方法优于低延迟模式的现有了解到视频压缩方法。我们的方法也进行比H.265两种PSNR和MS-SSIM更好。我们的代码和模式是公开的。
Jianping Lin, Dong Liu, Houqiang Li, Feng Wu
Abstract: We propose an end-to-end learned video compression scheme for low-latency scenarios. Previous methods are limited in using the previous one frame as reference. Our method introduces the usage of the previous multiple frames as references. In our scheme, the motion vector (MV) field is calculated between the current frame and the previous one. With multiple reference frames and associated multiple MV fields, our designed network can generate more accurate prediction of the current frame, yielding less residual. Multiple reference frames also help generate MV prediction, which reduces the coding cost of MV field. We use two deep auto-encoders to compress the residual and the MV, respectively. To compensate for the compression error of the auto-encoders, we further design a MV refinement network and a residual refinement network, taking use of the multiple reference frames as well. All the modules in our scheme are jointly optimized through a single rate-distortion loss function. We use a step-by-step training strategy to optimize the entire scheme. Experimental results show that the proposed method outperforms the existing learned video compression methods for low-latency mode. Our method also performs better than H.265 in both PSNR and MS-SSIM. Our code and models are publicly available.
摘要:我们提出一个终端到终端了解到低延迟场景的视频压缩方案。以前的方法在使用前一个帧作为参考的限制。我们的方法引入了以前的多帧作为参考的使用。在我们的方案中,运动矢量(MV)场在当前帧和前一一个之间计算。与多个参考帧和相关联的多个字段MV,我们设计的网络可以生成当前帧的更精确的预测,产生较少的残余。多个参考帧也有助于生成MV预测,这减少了MV场的编码成本。我们用两个深自动编码器压缩残余和MV,分别。为了补偿自动编码器的压缩误差,我们进一步设计一个MV细化网络和残留细化网络,以使用所述多个基准帧的为好。在我们的方案中的所有模块都通过一个单一的率失真损失函数联合优化。我们使用了一步一步的培训战略,以优化整个方案。实验结果表明,该方法优于低延迟模式的现有了解到视频压缩方法。我们的方法也进行比H.265两种PSNR和MS-SSIM更好。我们的代码和模式是公开的。
注:中文为机器翻译结果!