摘要

1. Real-time Webcam Heart-Rate and Variability Estimation with Clean Ground Truth for Evaluation [PDF] 返回目录
Amogh Gudi, Marian Bittner, Jan van Gemert
Abstract: Remote photo-plethysmography (rPPG) uses a camera to estimate a person's heart rate (HR). Similar to how heart rate can provide useful information about a person's vital signs, insights about the underlying physio/psychological conditions can be obtained from heart rate variability (HRV). HRV is a measure of the fine fluctuations in the intervals between heart beats. However, this measure requires temporally locating heart beats with a high degree of precision. We introduce a refined and efficient real-time rPPG pipeline with novel filtering and motion suppression that not only estimates heart rates, but also extracts the pulse waveform to time heart beats and measure heart rate variability. This unsupervised method requires no rPPG specific training and is able to operate in real-time. We also introduce a new multi-modal video dataset, VicarPPG 2, specifically designed to evaluate rPPG algorithms on HR and HRV estimation. We validate and study our method under various conditions on a comprehensive range of public and self-recorded datasets, showing state-of-the-art results and providing useful insights into some unique aspects. Lastly, we make available CleanerPPG, a collection of human-verified ground truth peak/heart-beat annotations for existing rPPG datasets. These verified annotations should make future evaluations and benchmarking of rPPG algorithms more accurate, standardized and fair.
摘要：远程光电容积描记法（rPPG）使用摄像头估算一个人的心率（HR）。类似于心率如何提供有关人的生命体征的有用信息，可以从心率变异性（HRV）中获得有关潜在生理/心理状况的见解。 HRV是心跳间隔中细微波动的量度。但是，该措施需要高精度地临时定位心跳。我们引入了一种经过改进的高效实时rPPG管道，该管道具有新颖的滤波和运动抑制功能，不仅可以估算心率，还可以提取脉搏波形以计时心跳并测量心率变异性。这种不受监督的方法不需要rPPG专门培训，并且能够实时运行。我们还将介绍一个新的多模式视频数据集VicarPPG 2，该数据集专门用于评估HR和HRV估计上的rPPG算法。我们在各种条件下，在广泛的公共和自记录数据集上验证和研究了我们的方法，显示了最新结果并提供了对某些独特方面的有用见解。最后，我们提供CleanerPPG，它是现有的rPPG数据集经过人工验证的地面真实峰值/心跳注释的集合。这些经过验证的注释应使rPPG算法的未来评估和基准测试更加准确，标准化和公平。

2. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [PDF] 返回目录
Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr, Li Zhang
Abstract: Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer (ie, without convolution and resolution reduction) to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR). Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, we achieve the first (44.42% mIoU) position in the highly competitive ADE20K test server leaderboard.
摘要：最新的语义分割方法采用具有编码器-解码器体系结构的全卷积网络（FCN）。编码器逐渐降低空间分辨率，并通过更大的接收场学习更多的抽象/语义视觉概念。由于上下文建模对于分割至关重要，因此最新的工作集中在通过扩大/无用卷积或插入注意模块来增加接受域。但是，基于编码器-解码器的FCN体系结构保持不变。在本文中，我们旨在通过将语义分割视为序列到序列的预测任务来提供替代观点。具体来说，我们部署一个纯变换器（即，不进行卷积和分辨率降低）将图像编码为一系列补丁。通过在变压器的每一层中建模全局上下文，此编码器可以与简单的解码器组合以提供功能强大的分段模型，称为分段分段变压器（SETR）。大量实验表明，SETR在ADE20K（50.28％mIoU），Pascal Context（55.83％mIoU）上取得了新的技术水平，在Cityscapes上取得了竞争性结果。特别是，我们在竞争激烈的ADE20K测试服务器排行榜中排名第一（mIoU为44.42％）。

3. Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans [PDF] 返回目录
Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, Qing Shuai, Hujun Bao, Xiaowei Zhou
Abstract: This paper addresses the challenge of novel view synthesis for a human performer from a very sparse set of camera views. Some recent works have shown that learning implicit neural representations of 3D scenes achieves remarkable view synthesis quality given dense input views. However, the representation learning will be ill-posed if the views are highly sparse. To solve this ill-posed problem, our key idea is to integrate observations over video frames. To this end, we propose Neural Body, a new human body representation which assumes that the learned neural representations at different frames share the same set of latent codes anchored to a deformable mesh, so that the observations across frames can be naturally integrated. The deformable mesh also provides geometric guidance for the network to learn 3D representations more efficiently. Experiments on a newly collected multi-view dataset show that our approach outperforms prior works by a large margin in terms of the view synthesis quality. We also demonstrate the capability of our approach to reconstruct a moving person from a monocular video on the People-Snapshot dataset. The code and dataset will be available at this https URL.
摘要：本文从稀疏的摄影机视角集中解决了人类表演者新颖视角合成的挑战。最近的一些工作表明，在密集的输入视图下，学习3D场景的隐式神经表示可实现非凡的视图合成质量。但是，如果视图非常稀疏，表示学习就会不适当地进行。为了解决这个不适的问题，我们的关键思想是整合对视频帧的观察。为此，我们提出了一种新的人体表示形式“神经体”，它假设在不同框架上学习到的神经表示共享锚定在可变形网格上的同一组潜在代码，从而可以自然地整合跨框架的观察结果。可变形网格还为网络提供了几何指导，以更有效地学习3D表示。在新收集的多视图数据集上进行的实验表明，在视图合成质量方面，我们的方法比以前的工作要好得多。我们还展示了我们的方法从People-Snapshot数据集上的单目视频中重建移动的人的能力。代码和数据集将在此https URL上可用。

4. A CNN Approach to Simultaneously Count Plants and Detect Plantation-Rows from UAV Imagery [PDF] 返回目录
Lucas Prado Osco, Mauro dos Santos de Arruda, Diogo Nunes Gonçalves, Alexandre Dias, Juliana Batistoti, Mauricio de Souza, Felipe David Georges Gomes, Ana Paula Marques Ramos, Lúcio André de Castro Jorge, Veraldo Liesenberg, Jonathan Li, Lingfei Ma, José Marcato Junior, Wesley Nunes Gonçalves
Abstract: In this paper, we propose a novel deep learning method based on a Convolutional Neural Network (CNN) that simultaneously detects and geolocates plantation-rows while counting its plants considering highly-dense plantation configurations. The experimental setup was evaluated in a cornfield with different growth stages and in a Citrus orchard. Both datasets characterize different plant density scenarios, locations, types of crops, sensors, and dates. A two-branch architecture was implemented in our CNN method, where the information obtained within the plantation-row is updated into the plant detection branch and retro-feed to the row branch; which are then refined by a Multi-Stage Refinement method. In the corn plantation datasets (with both growth phases, young and mature), our approach returned a mean absolute error (MAE) of 6.224 plants per image patch, a mean relative error (MRE) of 0.1038, precision and recall values of 0.856, and 0.905, respectively, and an F-measure equal to 0.876. These results were superior to the results from other deep networks (HRNet, Faster R-CNN, and RetinaNet) evaluated with the same task and dataset. For the plantation-row detection, our approach returned precision, recall, and F-measure scores of 0.913, 0.941, and 0.925, respectively. To test the robustness of our model with a different type of agriculture, we performed the same task in the citrus orchard dataset. It returned an MAE equal to 1.409 citrus-trees per patch, MRE of 0.0615, precision of 0.922, recall of 0.911, and F-measure of 0.965. For citrus plantation-row detection, our approach resulted in precision, recall, and F-measure scores equal to 0.965, 0.970, and 0.964, respectively. The proposed method achieved state-of-the-art performance for counting and geolocating plants and plant-rows in UAV images from different types of crops.
摘要：在本文中，我们提出了一种基于卷积神经网络（CNN）的新颖的深度学习方法，该方法可同时检测并定位人工林行，同时考虑到高密度人工林配置对其植物进行计数。在具有不同生长阶段的玉米田和柑橘园中评估了实验设置。这两个数据集都表征了不同的植物密度情景，位置，农作物类型，传感器和日期。在我们的CNN方法中实现了两分支架构，其中在种植行中获取的信息被更新到植物检测分支中，并逆向馈送到行分支。然后通过多阶段优化方法进行优化。在玉米种植数据集（包括生长和成熟阶段）中，我们的方法返回每个图像补丁平均有6.224株植物的平均绝对误差（MAE），平均相对误差（MRE）为0.1038，精度和召回值为0.856，和0.905，F值等于0.876。这些结果优于使用相同任务和数据集评估的其他深度网络（HRNet，Faster R-CNN和RetinaNet）的结果。对于人工林行检测，我们的方法分别返回了0.913、0.941和0.925的精度，召回率和F-measure分数。为了测试模型在不同农业类型下的稳健性，我们在柑桔园数据集中执行了相同的任务。它返回的MAE等于每个补丁1.409棵柑橘树，MRE为0.0615，精度为0.922，召回率为0.911，F度量为0.965。对于柑橘种植行的检测，我们的方法得出的精确度，召回率和F-measure得分分别等于0.965、0.970和0.964。所提出的方法实现了对来自不同类型农作物的无人机图像中的植物和植物行进行计数和地理定位的最新性能。

5. iGOS++: Integrated Gradient Optimized Saliency by Bilateral Perturbations [PDF] 返回目录
Saeed Khorram, Tyler Lawson, Fuxin Li
Abstract: The black-box nature of the deep networks makes the explanation for "why" they make certain predictions extremely challenging. Saliency maps are one of the most widely-used local explanation tools to alleviate this problem. One of the primary approaches for generating saliency maps is by optimizing a mask over the input dimensions so that the output of the network is influenced the most by the masking. However, prior work only studies such influence by removing evidence from the input. In this paper, we present iGOS++, a framework to generate saliency maps that are optimized for altering the output of the black-box system by either removing or preserving only a small fraction of the input. Additionally, we propose to add a bilateral total variation term to the optimization that improves the continuity of the saliency map especially under high resolution and with thin object parts. The evaluation results from comparing iGOS++ against state-of-the-art saliency map methods show significant improvement in locating salient regions that are directly interpretable by humans. We utilized iGOS++ in the task of classifying COVID-19 cases from x-ray images and discovered that sometimes the CNN network is overfitted to the characters printed on the x-ray images when performing classification. Fixing this issue by data cleansing significantly improved the precision and recall of the classifier.
摘要：深度网络的黑匣子性质使得“为什么”解释某些预测非常具有挑战性。显着性地图是缓解此问题的最广泛使用的本地解释工具之一。生成显着性图的主要方法之一是通过在输入维度上优化掩码，以使网络的输出受到掩码的影响最大。但是，先前的工作仅通过从输入中删除证据来研究这种影响。在本文中，我们介绍了iGOS ++，这是一个生成显着图的框架，该显着图经过优化，可以通过删除或保留一小部分输入来更改黑盒系统的输出。此外，我们建议在优化中添加一个双边总变化项，以提高显着性图的连续性，尤其是在高分辨率和薄物体部分的情况下。通过将iGOS ++与最新的显着性图方法进行比较得到的评估结果显示，在定位人类可以直接解释的显着区域方面，有了显着的改进。我们将iGOS ++用于从X射线图像分类COVID-19病例的任务中，发现在进行分类时，有时CNN网络过度适合打印在X射线图像上的字符。通过数据清理解决此问题，大大提高了分类器的准确性和召回率。

6. Illumination Estimation Challenge: experience of past two years [PDF] 返回目录
Egor Ershov, Alex Savchik, Ilya Semenkov, Nikola Banić, Karlo Koscević, Marko Subašić, Alexander Belokopytov, Zhihao Li, Arseniy Terekhin, Daria Senshina, Artem Nikonorov, Yanlin Qian, Marco Buzzelli, Riccardo Riva, Simone Bianco, Raimondo Schettini, Sven Lončarić, Dmitry Nikolaev
Abstract: Illumination estimation is the essential step of computational color constancy, one of the core parts of various image processing pipelines of modern digital cameras. Having an accurate and reliable illumination estimation is important for reducing the illumination influence on the image colors. To motivate the generation of new ideas and the development of new algorithms in this field, the 2nd Illumination estimation challenge~(IEC\#2) was conducted. The main advantage of testing a method on a challenge over testing in on some of the known datasets is the fact that the ground-truth illuminations for the challenge test images are unknown up until the results have been submitted, which prevents any potential hyperparameter tuning that may be biased. The challenge had several tracks: general, indoor, and two-illuminant with each of them focusing on different parameters of the scenes. Other main features of it are a new large dataset of images (about 5000) taken with the same camera sensor model, a manual markup accompanying each image, diverse content with scenes taken in numerous countries under a huge variety of illuminations extracted by using the SpyderCube calibration object, and a contest-like markup for the images from the Cube+ dataset that was used in IEC\#1. This paper focuses on the description of the past two challenges, algorithms which won in each track, and the conclusions that were drawn based on the results obtained during the 1st and 2nd challenge that can be useful for similar future developments.
摘要：照明估计是计算色彩一致性的必要步骤，是现代数码相机各种图像处理管道的核心部分之一。具有准确而可靠的照度估算对于减少照度对图像颜色的影响很重要。为了激励这一领域的新思想的产生和新算法的发展，进行了第二次照明估计挑战（IEC \＃2）。与在某些已知数据集上进行测试相比，对挑战进行测试的主要优势在于，直到提交结果之前，挑战测试图像的地面照明都是未知的，这可以防止任何可能的超参数调整可能有偏见。挑战有多个轨迹：普通，室内和两个光源，它们分别关注场景的不同参数。它的其他主要功能是使用相同的相机传感器模型拍摄的新的大型图像数据集（约5000张），每幅图像附带的手动标记，在许多国家/地区使用SpyderCube提取的各种照明下拍摄的场景的多样化内容标定对象，以及IEC \＃1中使用的来自Cube +数据集的图像的类似竞赛的标记。本文重点介绍了过去的两个挑战，在每个赛道上获胜的算法以及根据在第一和第二次挑战中获得的结果得出的结论，这些结论可能对类似的未来发展很有用。

7. SelectScale: Mining More Patterns from Images via Selective and Soft Dropout [PDF] 返回目录
Zhengsu Chen, Jianwei Niu, Xuefeng Liu, Shaojie Tang
Abstract: Convolutional neural networks (CNNs) have achieved remarkable success in image recognition. Although the internal patterns of the input images are effectively learned by the CNNs, these patterns only constitute a small proportion of useful patterns contained in the input images. This can be attributed to the fact that the CNNs will stop learning if the learned patterns are enough to make a correct classification. Network regularization methods like dropout and SpatialDropout can ease this problem. During training, they randomly drop the features. These dropout methods, in essence, change the patterns learned by the networks, and in turn, forces the networks to learn other patterns to make the correct classification. However, the above methods have an important drawback. Randomly dropping features is generally inefficient and can introduce unnecessary noise. To tackle this problem, we propose SelectScale. Instead of randomly dropping units, SelectScale selects the important features in networks and adjusts them during training. Using SelectScale, we improve the performance of CNNs on CIFAR and ImageNet.
摘要：卷积神经网络（CNN）在图像识别方面取得了显著成功。尽管CNN有效地了解了输入图像的内部模式，但是这些模式仅构成输入图像中包含的有用模式的一小部分。这可以归因于这样一个事实，即如果所学习的模式足以进行正确的分类，则CNN会停止学习。诸如dropout和SpatialDropout之类的网络正则化方法可以缓解此问题。在训练期间，他们会随机删除这些功能。从本质上讲，这些辍学方法会改变网络学习到的模式，进而迫使网络学习其他模式以进行正确的分类。然而，上述方法具有重要的缺点。随机丢弃特征通常效率低下，并且会引入不必要的噪声。为了解决这个问题，我们建议使用SelectScale。 SelectScale可以随机选择网络中的重要功能，并在训练过程中对其进行调整，而不是随机丢弃它们。使用SelectScale，我们可以提高CIFAR和ImageNet上CNN的性能。

8. Voxel R-CNN: Towards High Performance Voxel-based 3D Object Detection [PDF] 返回目录
Jiajun Deng, Shaoshuai Shi, Peiwei Li, Wengang Zhou, Yanyong Zhang, Houqiang Li
Abstract: Recent advances on 3D object detection heavily rely on how the 3D data are represented, \emph{i.e.}, voxel-based or point-based representation. Many existing high performance 3D detectors are point-based because this structure can better retain precise point positions. Nevertheless, point-level features lead to high computation overheads due to unordered storage. In contrast, the voxel-based structure is better suited for feature extraction but often yields lower accuracy because the input data are divided into grids. In this paper, we take a slightly different viewpoint -- we find that precise positioning of raw points is not essential for high performance 3D object detection and that the coarse voxel granularity can also offer sufficient detection accuracy. Bearing this view in mind, we devise a simple but effective voxel-based framework, named Voxel R-CNN. By taking full advantage of voxel features in a two stage approach, our method achieves comparable detection accuracy with state-of-the-art point-based models, but at a fraction of the computation cost. Voxel R-CNN consists of a 3D backbone network, a 2D bird-eye-view (BEV) Region Proposal Network and a detect head. A voxel RoI pooling is devised to extract RoI features directly from voxel features for further refinement. Extensive experiments are conducted on the widely used KITTI Dataset and the more recent Waymo Open Dataset. Our results show that compared to existing voxel-based methods, Voxel R-CNN delivers a higher detection accuracy while maintaining a real-time frame processing rate, \emph{i.e}., at a speed of 25 FPS on an NVIDIA RTX 2080 Ti GPU. The code will be make available soon.
摘要：3D对象检测的最新进展在很大程度上取决于如何表示3D数据，即\ emph {i.e。}，基于体素或基于点的表示。许多现有的高性能3D检测器都是基于点的，因为这种结构可以更好地保留精确的点位置。但是，由于无序存储，点级功能会导致较高的计算开销。相反，基于体素的结构更适合特征提取，但由于输入数据被划分为网格，因此通常产生较低的精度。在本文中，我们采取了略有不同的观点-我们发现原始点的精确定位对于高性能3D对象检测不是必不可少的，并且粗体素粒度还可以提供足够的检测精度。牢记这一观点，我们设计了一个简单但有效的基于体素的框架，名为Voxel R-CNN。通过在两阶段方法中充分利用体素功能，我们的方法可与基于点的最新模型实现可比的检测精度，但计算成本却很小。 Voxel R-CNN由3D骨干网络，2D鸟瞰（BEV）区域提议网络和检测头组成。设计了体素RoI池以直接从体素特征中提取RoI特征，以进行进一步优化。在广泛使用的KITTI数据集和最新的Waymo Open数据集上进行了广泛的实验。我们的结果表明，与现有的基于体素的方法相比，Voxel R-CNN在NVIDIA RTX 2080 Ti上以25 FPS的速度保持较高的检测精度，同时保持实时帧处理速率\ emph {ie}.。 GPU。该代码将很快发布。

9. NeuralMagicEye: Learning to See and Understand the Scene Behind an Autostereogram [PDF] 返回目录
Zhengxia Zou, Tianyang Shi, Yi Yuan, Zhenwei Shi
Abstract: An autostereogram, a.k.a. magic eye image, is a single-image stereogram that can create visual illusions of 3D scenes from 2D textures. This paper studies an interesting question that whether a deep CNN can be trained to recover the depth behind an autostereogram and understand its content. The key to the autostereogram magic lies in the stereopsis - to solve such a problem, a model has to learn to discover and estimate disparity from the quasi-periodic textures. We show that deep CNNs embedded with disparity convolution, a novel convolutional layer proposed in this paper that simulates stereopsis and encodes disparity, can nicely solve such a problem after being sufficiently trained on a large 3D object dataset in a self-supervised fashion. We refer to our method as ``NeuralMagicEye''. Experiments show that our method can accurately recover the depth behind autostereograms with rich details and gradient smoothness. Experiments also show the completely different working mechanisms for autostereogram perception between neural networks and human eyes. We hope this research can help people with visual impairments and those who have trouble viewing autostereograms. Our code is available at \url{this https URL}.
摘要：自动立体图，又称魔术眼图像，是一种单图像立体图，可以从2D纹理创建3D场景的视觉幻觉。本文研究了一个有趣的问题，即能否训练深层的CNN来恢复自动立体图背后的深度并理解其内容。自动立体图魔术的关键在于立体视觉-要解决此类问题，模型必须学会发现并估计准周期纹理中的差异。我们证明，嵌入了视差卷积的深层CNN是本文提出的一种新颖的卷积层，可以模拟立体视并编码视差，并且可以在大型3D对象数据集上以自监督的方式进行充分训练后，可以很好地解决此类问题。我们将我们的方法称为``NeuralMagicEye''。实验表明，我们的方法可以准确地恢复自动立体图背后的深度，并具有丰富的细节和渐变平滑度。实验还显示了神经网络和人眼之间自动立体图感知的完全不同的工作机制。我们希望这项研究可以帮助有视觉障碍的人和查看自动立体图有困难的人。我们的代码位于\ url {this https URL}。

10. CNN-based Single Image Crowd Counting: Network Design, Loss Function and Supervisory Signal [PDF] 返回目录
Haoyue Bai, S.-H. Gary Chan
Abstract: Single image crowd counting is a challenging computer vision problem with wide applications in public safety, city planning, traffic management, etc. This survey is to provide a comprehensive summary of recent advanced crowd counting techniques based on Convolutional Neural Network (CNN) via density map estimation. Our goals are to provide an up-to-date review of recent approaches, and educate new researchers in this field the design principles and trade-offs. After presenting publicly available datasets and evaluation metrics, we review the recent advances with detailed comparisons on three major design modules for crowd counting: deep neural network designs, loss functions, and supervisory signals. We conclude the survey with some future directions.
摘要：单图像人群计数是一个具有挑战性的计算机视觉问题，在公共安全，城市规划，交通管理等方面具有广泛的应用。本次调查旨在通过卷积神经网络（CNN）全面总结最新的先进人群计数技术。密度图估计。我们的目标是提供有关最新方法的最新评论，并教育该领域的新研究人员设计原理和取舍方法。在介绍了公开可用的数据集和评估指标之后，我们将对最近的进展进行回顾，并对三个主要的人群统计设计模块进行详细比较：深度神经网络设计，损失函数和监控信号。我们以一些未来的方向来结束调查。

11. Unsupervised Monocular Depth Reconstruction of Non-Rigid Scenes [PDF] 返回目录
Ayça Takmaz, Danda Pani Paudel, Thomas Probst, Ajad Chhatkuli, Martin R. Oswald, Luc Van Gool
Abstract: Monocular depth reconstruction of complex and dynamic scenes is a highly challenging problem. While for rigid scenes learning-based methods have been offering promising results even in unsupervised cases, there exists little to no literature addressing the same for dynamic and deformable scenes. In this work, we present an unsupervised monocular framework for dense depth estimation of dynamic scenes, which jointly reconstructs rigid and non-rigid parts without explicitly modelling the camera motion. Using dense correspondences, we derive a training objective that aims to opportunistically preserve pairwise distances between reconstructed 3D points. In this process, the dense depth map is learned implicitly using the as-rigid-as-possible hypothesis. Our method provides promising results, demonstrating its capability of reconstructing 3D from challenging videos of non-rigid scenes. Furthermore, the proposed method also provides unsupervised motion segmentation results as an auxiliary output.
摘要：复杂而动态场景的单眼深度重建是一个极具挑战性的问题。尽管对于刚性场景，即使在无人监督的情况下，基于学习的方法也已提供了令人鼓舞的结果，但几乎没有文献针对动态和可变形场景解决同样的问题。在这项工作中，我们提出了一种用于动态场景密集深度估计的无监督单目框架，该框架可以联合重建刚性和非刚性零件，而无需显式建模摄像机运动。使用密集的对应关系，我们得出了一个训练目标，旨在机会性地保留重建的3D点之间的成对距离。在此过程中，使用尽可能严格的假设隐式地学习密集深度图。我们的方法提供了令人鼓舞的结果，证明了其从非刚性场景的具有挑战性的视频中重建3D的能力。此外，所提出的方法还提供了无监督的运动分割结果作为辅助输出。

12. CorrNet3D: Unsupervised End-to-end Learning of Dense Correspondence for 3D Point Clouds [PDF] 返回目录
Yiming Zeng, Yue Qian, Zhiyu Zhu, Junhui Hou, Hui Yuan, Ying He
Abstract: This paper addresses the problem of computing dense correspondence between 3D shapes in the form of point clouds, which is a challenging and fundamental problem in computer vision and digital geometry processing. Conventional approaches often solve the problem in a supervised manner, requiring massive annotated data, which is difficult and/or expensive to obtain. Motivated by the intuition that one can transform two aligned point clouds to each other more easily and meaningfully than a misaligned pair, we propose CorrNet3D -- the first unsupervised and end-to-end deep learning-based framework -- to drive the learning of dense correspondence by means of deformation-like reconstruction to overcome the need for annotated data. Specifically, CorrNet3D consists of a deep feature embedding module and two novel modules called correspondence indicator and symmetric deformation. Feeding a pair of raw point clouds, our model first learns the pointwise features and passes them into the indicator to generate a learnable correspondence matrix used to permute the input pair. The symmetric deformer, with an additional regularized loss, transforms the two permuted point clouds to each other to drive the unsupervised learning of the correspondence. The extensive experiments on both synthetic and real-world datasets of rigid and non-rigid 3D shapes show our CorrNet3D outperforms state-of-the-art methods to a large extent, including those taking meshes as input. CorrNet3D is a flexible framework in that it can be easily adapted to supervised learning if annotated data are available.
摘要：本文解决了以点云的形式计算3D形状之间的密集对应关系的问题，这是计算机视觉和数字几何处理中具有挑战性和根本性的问题。常规方法通常以监督的方式解决该问题，需要大量带注释的数据，这难以获得和/或昂贵。出于一种直觉，即与未对齐的一对云相比，它们可以更轻松，更有意义地将两个相互对齐的点云相互转换，我们提出了CorrNet3D-第一个基于无监督且端到端的深度学习框架，以推动对通过类似变形的重构来实现密集的对应关系，从而克服了对带注释数据的需求。具体来说，CorrNet3D包含一个深层特征嵌入模块和两个名为对应指示器和对称变形的新颖模块。馈入一对原始点云，我们的模型首先学习逐点特征，并将其传递到指标中，以生成用于对输入对进行置换的可学习的对应矩阵。具有附加正则损失的对称变形器将两个排列的点云彼此转换，以驱动对对应关系的无监督学习。在刚性和非刚性3D形状的合成和真实数据集上进行的大量实验表明，我们的CorrNet3D在很大程度上优于最新方法，包括那些以网格为输入的方法。 CorrNet3D是一个灵活的框架，如果有注释的数据可用，它可以轻松地适应监督学习。

13. A Deep Retinal Image Quality Assessment Network with Salient Structure Priors [PDF] 返回目录
Ziwen Xu, beiji Zou, Qing Liu
Abstract: Retinal image quality assessment is an essential prerequisite for diagnosis of retinal diseases. Its goal is to identify retinal images in which anatomic structures and lesions attracting ophthalmologists' attention most are exhibited clearly and definitely while reject poor quality fundus images. Motivated by this, we mimic the way that ophthalmologists assess the quality of retinal images and propose a method termed SalStructuIQA. First, two salient structures for automated retinal quality assessment. One is the large-size salient structures including optic disc region and exudates in large-size. The other is the tiny-size salient structures which mainly include vessels. Then we incorporate the proposed two salient structure priors with deep convolutional neural network (CNN) to shift the focus of CNN to salient structures. Accordingly, we develop two CNN architectures: Dual-branch SalStructIQA and Single-branch SalStructIQA. Dual-branch SalStructIQA contains two CNN branches and one is guided by large-size salient structures while the other is guided by tiny-size salient structures. Single-branch SalStructIQA contains one CNN branch, which is guided by the concatenation of salient structures in both large-size and tiny-size. Experimental results on Eye-Quality dataset show that our proposed Dual-branch SalStructIQA outperforms the state-of-the-art methods for retinal image quality assessment and Single-branch SalStructIQA is much light-weight comparing with state-of-the-art deep retinal image quality assessment methods and still achieves competitive performances.
摘要：视网膜图像质量评估是诊断视网膜疾病的必要前提。它的目标是识别清晰清晰地显示出最吸引眼科医生注意的解剖结构和病变的视网膜图像，同时剔除劣质眼底图像。因此，我们模仿了眼科医生评估视网膜图像质量的方法，并提出了一种称为SalStructuIQA的方法。首先，两个重要的结构用于自动视网膜质量评估。一种是大型显着结构，包括大面积的视盘区域和渗出液。另一个是微小的突出结构，主要包括血管。然后，我们将拟议的两个显着结构先验与深度卷积神经网络（CNN）合并，以将CNN的重点转移到显着结构上。因此，我们开发了两种CNN架构：双分支SalStructIQA和单分支SalStructIQA。双分支SalStructIQA包含两个CNN分支，一个分支由大尺寸的突出结构引导，而另一个分支由小尺寸的突出结构引导。单分支SalStructIQA包含一个CNN分支，该分支由大尺寸和小尺寸的显着结构的串联引导。 Eye-Quality数据集上的实验结果表明，我们提出的双分支SalStructIQA优于最新的视网膜图像质量评估方法，而单分支SalStructIQA与最先进的深层相比轻巧得多视网膜图像质量评估方法，仍取得竞争性成绩。

14. Patch-wise++ Perturbation for Adversarial Targeted Attacks [PDF] 返回目录
Lianli Gao, Qilong Zhang, Jingkuan Song, Heng Tao Shen
Abstract: Although great progress has been made on adversarial attacks for deep neural networks (DNNs), their transferability is still unsatisfactory, especially for targeted attacks. There are two problems behind that have been long overlooked: 1) the conventional setting of $T$ iterations with the step size of $\epsilon/T$ to comply with the $\epsilon$-constraint. In this case, most of the pixels are allowed to add very small noise, much less than $\epsilon$; and 2) usually manipulating pixel-wise noise. However, features of a pixel extracted by DNNs are influenced by its surrounding regions, and different DNNs generally focus on different discriminative regions in recognition. To tackle these issues, we propose a patch-wise iterative method (PIM) aimed at crafting adversarial examples with high transferability. Specifically, we introduce an amplification factor to the step size in each iteration, and one pixel's overall gradient overflowing the $\epsilon$-constraint is properly assigned to its surrounding regions by a project kernel. But targeted attacks aim to push the adversarial examples into the territory of a specific class, and the amplification factor may lead to underfitting. Thus, we introduce the temperature and propose a patch-wise++ iterative method (PIM++) to further improve transferability without significantly sacrificing the performance of the white-box attack. Our method can be generally integrated to any gradient-based attack method. Compared with the current state-of-the-art attack methods, we significantly improve the success rate by 35.9\% for defense models and 32.7\% for normally trained models on average.
摘要：尽管在深度神经网络（DNN）的对抗攻击方面已取得了很大进展，但其可传递性仍然不尽人意，特别是对于有针对性的攻击而言。有两个长期以来被忽视的问题：1）传统的$ T $迭代设置，其步长为$ \ epsilon / T $，以符合$ \ epsilon $约束。在这种情况下，允许大多数像素添加非常小的噪声，远小于$ \ epsilon $； 2）通常操纵像素级噪声。然而，由DNN提取的像素的特征受其周围区域的影响，并且不同的DNN通常在识别中着重于不同的区分区域。为了解决这些问题，我们提出了一种补丁式迭代方法（PIM），旨在制作具有高可移植性的对抗示例。具体来说，我们在每次迭代中为步长引入一个放大因子，并且通过项目内核将一个像素的整体梯度溢出$ \ epsilon $约束适当地分配给其周围区域。但是有针对性的攻击旨在将对抗性示例推入特定类别的领域，并且放大系数可能会导致拟合不足。因此，我们介绍了温度，并提出了逐块++迭代方法（PIM ++），以进一步提高可传递性，而不会显着牺牲白盒攻击的性能。我们的方法通常可以集成到任何基于梯度的攻击方法中。与当前最先进的攻击方法相比，我们将防御模型的成功率平均提高了35.9％，对于正常训练的模型则平均提高了32.7％。

15. Incremental Embedding Learning via Zero-Shot Translation [PDF] 返回目录
Kun Wei, Cheng Deng, Xu Yang, Maosen Li
Abstract: Modern deep learning methods have achieved great success in machine learning and computer vision fields by learning a set of pre-defined datasets. Howerver, these methods perform unsatisfactorily when applied into real-world situations. The reason of this phenomenon is that learning new tasks leads the trained model quickly forget the knowledge of old tasks, which is referred to as catastrophic forgetting. Current state-of-the-art incremental learning methods tackle catastrophic forgetting problem in traditional classification networks and ignore the problem existing in embedding networks, which are the basic networks for image retrieval, face recognition, zero-shot learning, etc. Different from traditional incremental classification networks, the semantic gap between the embedding spaces of two adjacent tasks is the main challenge for embedding networks under incremental learning setting. Thus, we propose a novel class-incremental method for embedding network, named as zero-shot translation class-incremental method (ZSTCI), which leverages zero-shot translation to estimate and compensate the semantic gap without any exemplars. Then, we try to learn a unified representation for two adjacent tasks in sequential learning process, which captures the relationships of previous classes and current classes precisely. In addition, ZSTCI can easily be combined with existing regularization-based incremental learning methods to further improve performance of embedding networks. We conduct extensive experiments on CUB-200-2011 and CIFAR100, and the experiment results prove the effectiveness of our method. The code of our method has been released.
摘要：通过学习一组预定义的数据集，现代深度学习方法在机器学习和计算机视觉领域取得了巨大的成功。但是，将这些方法应用于实际情况时，效果会不理想。这种现象的原因是，学习新任务导致训练有素的模型迅速忘记了旧任务的知识，这称为灾难性遗忘。当前最新的增量学习方法可解决传统分类网络中的灾难性遗忘问题，而忽略了嵌入网络中存在的问题，嵌入网络是图像检索，人脸识别，零镜头学习等的基本网络。与传统网络不同在增量分类网络中，两个相邻任务的嵌入空间之间的语义鸿沟是增量学习环境下嵌入网络的主要挑战。因此，我们提出了一种新的用于嵌入网络的类增量式方法，称为零镜头翻译类增量方法（ZSTCI），该方法利用零镜头翻译来估计和补偿语义间隙而没有任何示例。然后，我们尝试在顺序学习过程中为两个相邻任务学习统一的表示形式，以准确地捕获先前课程和当前课程的关系。此外，ZSTCI可以轻松地与现有的基于正则化的增量学习方法结合使用，以进一步提高嵌入网络的性能。我们在CUB-200-2011和CIFAR100上进行了广泛的实验，实验结果证明了该方法的有效性。我们的方法代码已发布。

16. Audio-Visual Floorplan Reconstruction [PDF] 返回目录
Senthil Purushwalkam, Sebastian Vicenc Amengual Gari, Vamsi Krishna Ithapu, Carl Schissler, Philip Robinson, Abhinav Gupta, Kristen Grauman
Abstract: Given only a few glimpses of an environment, how much can we infer about its entire floorplan? Existing methods can map only what is visible or immediately apparent from context, and thus require substantial movements through a space to fully map it. We explore how both audio and visual sensing together can provide rapid floorplan reconstruction from limited viewpoints. Audio not only helps sense geometry outside the camera's field of view, but it also reveals the existence of distant freespace (e.g., a dog barking in another room) and suggests the presence of rooms not visible to the camera (e.g., a dishwasher humming in what must be the kitchen to the left). We introduce AV-Map, a novel multi-modal encoder-decoder framework that reasons jointly about audio and vision to reconstruct a floorplan from a short input video sequence. We train our model to predict both the interior structure of the environment and the associated rooms' semantic labels. Our results on 85 large real-world environments show the impact: with just a few glimpses spanning 26% of an area, we can estimate the whole area with 66% accuracy -- substantially better than the state of the art approach for extrapolating visual maps.
摘要：仅对环境一瞥，我们可以推断出整个平面图多少？现有方法只能映射从上下文可见或立即可见的内容，因此需要在空间中进行大量移动才能完全映射它。我们探讨了音频和视觉传感如何共同从有限的角度提供快速的平面图重建。音频不仅有助于感测相机视野之外的几何形状，而且还揭示了遥远自由空间的存在（例如，狗在另一个房间里吠叫），并暗示了相机看不见的房间的存在（例如，洗碗机嗡嗡作响）左边的厨房一定是）。我们介绍AV-Map，这是一种新颖的多模式编码器-解码器框架，该框架共同考虑音频和视觉，以从短输入视频序列中重建平面图。我们训练模型来预测环境的内部结构和相关房间的语义标签。我们在85个大型现实环境中获得的结果表明了这种影响：只需瞥见26％的区域，我们就可以以66％的精度估算整个区域-大大优于现有的可视化地图推断方法。

17. Learned Multi-Resolution Variable-Rate Image Compression with Octave-based Residual Blocks [PDF] 返回目录
Mohammad Akbari, Jie Liang, Jingning Han, Chengjie Tu
Abstract: Recently deep learning-based image compression has shown the potential to outperform traditional codecs. However, most existing methods train multiple networks for multiple bit rates, which increase the implementation complexity. In this paper, we propose a new variable-rate image compression framework, which employs generalized octave convolutions (GoConv) and generalized octave transposed-convolutions (GoTConv) with built-in generalized divisive normalization (GDN) and inverse GDN (IGDN) layers. Novel GoConv- and GoTConv-based residual blocks are also developed in the encoder and decoder networks. Our scheme also uses a stochastic rounding-based scalar quantization. To further improve the performance, we encode the residual between the input and the reconstructed image from the decoder network as an enhancement layer. To enable a single model to operate with different bit rates and to learn multi-rate image features, a new objective function is introduced. Experimental results show that the proposed framework trained with variable-rate objective function outperforms the standard codecs such as H.265/HEVC-based BPG and state-of-the-art learning-based variable-rate methods.
摘要：最近基于深度学习的图像压缩显示出优于传统编解码器的潜力。然而，大多数现有方法针对多个比特率训练多个网络，这增加了实现复杂性。在本文中，我们提出了一种新的可变速率图像压缩框架，该框架使用具有内置的广义除法归一化（GDN）和逆GDN（IGDN）层的广义八度音阶卷积（GoConv）和广义八度音阶转置卷积（GoTConv）。在编码器和解码器网络中也开发了基于GoConv和GoTConv的新型残差块。我们的方案还使用基于随机舍入的标量量化。为了进一步提高性能，我们将输入和来自解码器网络的重建图像之间的残差编码为增强层。为了使单个模型能够以不同的比特率运行并学习多速率图像特征，引入了新的目标函数。实验结果表明，提出的采用可变速率目标函数训练的框架优于标准编解码器，例如基于H.265 / HEVC的BPG和基于最新学习的可变速率方法。

18. TransTrack: Multiple-Object Tracking with Transformer [PDF] 返回目录
Peize Sun, Yi Jiang, Rufeng Zhang, Enze Xie, Jinkun Cao, Xinting Hu, Tao Kong, Zehuan Yuan, Changhu Wang, Ping Luo
Abstract: Multiple-object tracking(MOT) is mostly dominated by complex and multi-step tracking-by-detection algorithm, which performs object detection, feature extraction and temporal association, separately. Query-key mechanism in single-object tracking(SOT), which tracks the object of the current frame by object feature of the previous frame, has great potential to set up a simple joint-detection-and-tracking MOT paradigm. Nonetheless, the query-key method is seldom studied due to its inability to detect new-coming objects. In this work, we propose TransTrack, a baseline for MOT with Transformer. It takes advantage of query-key mechanism and introduces a set of learned object queries into the pipeline to enable detecting new-coming objects. TransTrack has three main advantages: (1) It is an online joint-detection-and-tracking pipeline based on query-key mechanism. Complex and multi-step components in the previous methods are simplified. (2) It is a brand new architecture based on Transformer. The learned object query detects objects in the current frame. The object feature query from the previous frame associates those current objects with the previous ones. (3) For the first time, we demonstrate a much simple and effective method based on query-key mechanism and Transformer architecture could achieve competitive 65.8\% MOTA on the MOT17 challenge dataset. We hope TransTrack can provide a new perspective for multiple-object tracking. The code is available at: \url{this https URL}.
摘要：多目标跟踪（MOT）主要由复杂的多步检测跟踪算法控制，该算法分别执行对象检测，特征提取和时间关联。单对象跟踪（SOT）中的查询键机制通过前一帧的对象特征跟踪当前帧的对象，具有建立简单的联合检测和跟踪MOT范例的巨大潜力。但是，由于查询键方法无法检测到新出现的对象，因此很少进行研究。在这项工作中，我们提出了TransTrack，这是使用Transformer进行MOT的基准。它利用查询键机制，并将一组学习到的对象查询引入到管道中，以检测新出现的对象。 TransTrack具有三个主要优点：（1）它是基于查询键机制的在线联合检测跟踪管道。简化了先前方法中的复杂步骤和多步骤组件。（2）它是基于Transformer的全新体系结构。学习的对象查询将检测当前帧中的对象。来自上一帧的对象特征查询将那些当前对象与先前的对象相关联。（3）首次展示了一种基于查询键机制的简单有效的方法，而Transformer架构可以在MOT17挑战数据集上实现具有竞争力的65.8％的MOTA。我们希望TransTrack可以为多对象跟踪提供新的视角。该代码位于：\ url {此https URL}。

19. SID: Incremental Learning for Anchor-Free Object Detection via Selective and Inter-Related Distillation [PDF] 返回目录
Can Peng, Kun Zhao, Sam Maksoud, Meng Li, Brian C. Lovell
Abstract: Incremental learning requires a model to continually learn new tasks from streaming data. However, traditional fine-tuning of a well-trained deep neural network on a new task will dramatically degrade performance on the old task - a problem known as catastrophic forgetting. In this paper, we address this issue in the context of anchor-free object detection, which is a new trend in computer vision as it is simple, fast, and flexible. Simply adapting current incremental learning strategies fails on these anchor-free detectors due to lack of consideration of their specific model structures. To deal with the challenges of incremental learning on anchor-free object detectors, we propose a novel incremental learning paradigm called Selective and Inter-related Distillation (SID). In addition, a novel evaluation metric is proposed to better assess the performance of detectors under incremental learning conditions. By selective distilling at the proper locations and further transferring additional instance relation knowledge, our method demonstrates significant advantages on the benchmark datasets PASCAL VOC and COCO.
摘要：增量学习需要一个模型来从流数据中不断学习新任务。但是，在新任务上对训练有素的深度神经网络进行传统的微调会极大地降低旧任务的性能-这一问题被称为灾难性遗忘。在本文中，我们在无锚对象检测的背景下解决了这个问题，这是计算机视觉的一种新趋势，因为它简单，快速且灵活。由于缺乏对特定模型结构的考虑，在这些无锚检测器上仅简单地采用当前的增量学习策略就失败了。为了应对无锚对象检测器上增量学习的挑战，我们提出了一种新型的增量学习范例，称为选择性和相关蒸馏（SID）。此外，提出了一种新颖的评估指标，以更好地评估增量学习条件下检测器的性能。通过在适当的位置进行选择性蒸馏并进一步转移其他实例关系知识，我们的方法在基准数据集PASCAL VOC和COCO上显示出显着的优势。

20. SharpGAN: Receptive Field Block Net for Dynamic Scene Deblurring [PDF] 返回目录
Hui Feng, Jundong Guo, Sam Shuzhi Ge
Abstract: When sailing at sea, the smart ship will inevitably produce swaying motion due to the action of wind, wave and current, which makes the image collected by the visual sensor appear motion blur. This will have an adverse effect on the object detection algorithm based on the vision sensor, thereby affect the navigation safety of the smart ship. In order to remove the motion blur in the images during the navigation of the smart ship, we propose SharpGAN, a new image deblurring method based on the generative adversarial network. First of all, the Receptive Field Block Net (RFBNet) is introduced to the deblurring network to strengthen the network's ability to extract the features of blurred image. Secondly, we propose a feature loss that combines different levels of image features to guide the network to perform higher-quality deblurring and improve the feature similarity between the restored images and the sharp image. Finally, we propose to use the lightweight RFB-s module to improve the real-time performance of deblurring network. Compared with the existing deblurring methods on large-scale real sea image datasets and large-scale deblurring datasets, the proposed method not only has better deblurring performance in visual perception and quantitative criteria, but also has higher deblurring efficiency.
摘要：在海上航行时，智能船不可避免地会由于风，浪和电流的作用而产生摇摆运动，从而使视觉传感器收集到的图像显得运动模糊。这将对基于视觉传感器的物体检测算法产生不利影响，从而影响智能船的航行安全。为了消除智能船航行过程中图像的运动模糊，我们提出了一种基于生成对抗网络的新型图像去模糊方法SharpGAN。首先，将接收场块网（RFBNet）引入去模糊网络，以增强网络提取模糊图像特征的能力。其次，我们提出了一种特征损失，该特征损失结合了不同级别的图像特征，以引导网络执行更高质量的去模糊，并提高恢复图像和清晰图像之间的特征相似度。最后，我们建议使用轻量级的RFB-s模块来提高去模糊网络的实时性能。与大规模真实海图数据集和大规模去模糊数据集的现有去模糊方法相比，该方法不仅在视觉感知和定量标准上具有更好的去模糊性能，而且具有更高的去模糊效率。

21. Beating Attackers At Their Own Games: Adversarial Example Detection Using Adversarial Gradient Directions [PDF] 返回目录
Yuhang Wu, Sunpreet S. Arora, Yanhong Wu, Hao Yang
Abstract: Adversarial examples are input examples that are specifically crafted to deceive machine learning classifiers. State-of-the-art adversarial example detection methods characterize an input example as adversarial either by quantifying the magnitude of feature variations under multiple perturbations or by measuring its distance from estimated benign example distribution. Instead of using such metrics, the proposed method is based on the observation that the directions of adversarial gradients when crafting (new) adversarial examples play a key role in characterizing the adversarial space. Compared to detection methods that use multiple perturbations, the proposed method is efficient as it only applies a single random perturbation on the input example. Experiments conducted on two different databases, CIFAR-10 and ImageNet, show that the proposed detection method achieves, respectively, 97.9% and 98.6% AUC-ROC (on average) on five different adversarial attacks, and outperforms multiple state-of-the-art detection methods. Results demonstrate the effectiveness of using adversarial gradient directions for adversarial example detection.
摘要：对抗性示例是专门设计用来欺骗机器学习分类器的输入示例。最新的对抗示例检测方法通过量化多重扰动下特征变化的幅度或通过测量其与估计的良性示例分布的距离来将输入示例表征为对抗。代替使用这种度量，所提出的方法是基于以下观察：在制作（新）对抗示例时，对抗梯度的方向在表征对抗空间中起关键作用。与使用多个扰动的检测方法相比，该方法是有效的，因为它仅对输入示例应用单个随机扰动。在两个不同的数据库CIFAR-10和ImageNet上进行的实验表明，所提出的检测方法在五种不同的对抗性攻击中分别达到了97.9％和98.6％的AUC-ROC（平均），并且优于多种状态艺术检测方法。结果证明了使用对抗梯度方向进行对抗示例检测的有效性。

22. 3D Human motion anticipation and classification [PDF] 返回目录
Emad Barsoum, John Kender, Zicheng Liu
Abstract: Human motion prediction and understanding is a challenging problem. Due to the complex dynamic of human motion and the non-deterministic aspect of future prediction. We propose a novel sequence-to-sequence model for human motion prediction and feature learning, trained with a modified version of generative adversarial network, with a custom loss function that takes inspiration from human motion animation and can control the variation between multiple predicted motion from the same input poses. Our model learns to predict multiple future sequences of human poses from the same input sequence. We show that the discriminator learns general presentation of human motion by using the learned feature in action recognition task. Furthermore, to quantify the quality of the non-deterministic predictions, we simultaneously train a motion-quality-assessment network that learns the probability that a given sequence of poses is a real human motion or not. We test our model on two of the largest human pose datasets: NTURGB-D and Human3.6M. We train on both single and multiple action types. Its predictive power for motion estimation is demonstrated by generating multiple plausible futures from the same input and show the effect of each of the loss functions. Furthermore, we show that it takes less than half the number of epochs to train an activity recognition network by using the feature learned from the discriminator.
摘要：人体运动的预测和理解是一个具有挑战性的问题。由于人类运动的复杂动态和未来预测的不确定性。我们提出了一种用于人类运动预测和特征学习的新颖序列到序列模型，该模型使用生成的对抗网络的修改版进行了训练，具有自定义损失函数，该函数可以从人类运动动画中汲取灵感，并可以控制来自相同的输入姿势。我们的模型学习从相同的输入序列预测人类姿势的多个未来序列。我们表明，鉴别器通过在动作识别任务中使用学习到的功能来学习人类动作的一般表示。此外，为了量化非确定性预测的质量，我们同时训练了运动质量评估网络，该网络可以学习给定姿势序列是否为真实人类运动的概率。我们在两个最大的人体姿势数据集上测试模型：NTURGB-D和Human3.6M。我们针对单一和多种动作类型进行训练。通过从同一输入生成多个可能的期货来证明其对运动估计的预测能力，并显示每个损失函数的效果。此外，我们表明，通过使用从鉴别器中学到的特征，训练活动识别网络所需的时间少于一半。

23. Provident Vehicle Detection at Night: The PVDN Dataset [PDF] 返回目录
Lars Ohnemus, Lukas Ewecker, Ebubekir Asan, Stefan Roos, Simon Isele, Jakob Ketterer, Leopold Müller, Sascha Saralajew
Abstract: For advanced driver assistance systems, it is crucial to have information about oncoming vehicles as early as possible. At night, this task is especially difficult due to poor lighting conditions. For that, during nighttime, every vehicle uses headlamps to improve sight and therefore ensure safe driving. As humans, we intuitively assume oncoming vehicles before the vehicles are actually physically visible by detecting light reflections caused by their headlamps. In this paper, we present a novel dataset containing 54659 annotated grayscale images out of 349 different scenes in a rural environment at night. In these images, all oncoming vehicles, their corresponding light objects (e.g., headlamps), and their respective light reflections (e.g., light reflections on guardrails) are labeled. This is accompanied by an in-depth analysis of the dataset characteristics. With that, we are providing the first open-source dataset with comprehensive ground truth data to enable research into new methods of detecting oncoming vehicles based on the light reflections they cause, long before they are directly visible. We consider this as an essential step to further close the performance gap between current advanced driver assistance systems and human behavior.
摘要：对于高级驾驶员辅助系统，至关重要的是尽早获得有关迎面驶来的车辆的信息。在夜间，由于光线不足，此任务特别困难。为此，在夜间，每辆车都使用大灯来改善视线，从而确保安全行驶。作为人类，我们通过检测前照灯引起的光反射，直观地假设即将迎面驶来的车辆在实际物理上可见之前。在本文中，我们提出了一个新颖的数据集，其中包含夜晚农村环境中349个不同场景中的54659个带注释的灰度图像。在这些图像中，标记了所有迎面驶来的车辆，其相应的照明对象（例如前大灯）及其各自的光反射（例如护栏上的光反射）。这伴随着对数据集特征的深入分析。有了这些，我们将提供第一个具有全面地面真实数据的开源数据集，从而使人们能够研究基于即将产生的光反射来检测即将到来的车辆的新方法，而这要早于它们直接可见。我们认为这是进一步缩小当前先进的驾驶员辅助系统与人类行为之间的性能差距的重要步骤。

24. OSTeC: One-Shot Texture Completion [PDF] 返回目录
Baris Gecer, Jiankang Deng, Stefanos Zafeiriou
Abstract: The last few years have witnessed the great success of non-linear generative models in synthesizing high-quality photorealistic face images. Many recent 3D facial texture reconstruction and pose manipulation from a single image approaches still rely on large and clean face datasets to train image-to-image Generative Adversarial Networks (GANs). Yet the collection of such a large scale high-resolution 3D texture dataset is still very costly and difficult to maintain age/ethnicity balance. Moreover, regression-based approaches suffer from generalization to the in-the-wild conditions and are unable to fine-tune to a target-image. In this work, we propose an unsupervised approach for one-shot 3D facial texture completion that does not require large-scale texture datasets, but rather harnesses the knowledge stored in 2D face generators. The proposed approach rotates an input image in 3D and fill-in the unseen regions by reconstructing the rotated image in a 2D face generator, based on the visible parts. Finally, we stitch the most visible textures at different angles in the UV image-plane. Further, we frontalize the target image by projecting the completed texture into the generator. The qualitative and quantitative experiments demonstrate that the completed UV textures and frontalized images are of high quality, resembles the original identity, can be used to train a texture GAN model for 3DMM fitting and improve pose-invariant face recognition.
摘要：最近几年见证了非线性生成模型在合成高质量逼真的面部图像方面的巨大成功。许多最近的3D面部纹理重构和来自单个图像方法的姿势操纵仍然依赖大型且干净的面部数据集来训练图像到图像的生成对抗网络（GAN）。然而，如此大规模的高分辨率3D纹理数据集的收集仍然非常昂贵并且难以维持年龄/种族平衡。此外，基于回归的方法会遭受野外条件的笼统影响，无法微调至目标图像。在这项工作中，我们为单次3D面部纹理完成提出了一种无监督的方法，该方法不需要大规模的纹理数据集，而是可以利用存储在2D面部生成器中的知识。所提出的方法基于可见部分，通过在2D人脸生成器中重建旋转后的图像，以3D旋转输入图像并填充看不见的区域。最后，我们在UV图像平面中的不同角度缝合最可见的纹理。此外，我们通过将完成的纹理投影到生成器中来使目标图像正面化。定性和定量实验表明，完整的UV纹理和正面图像质量很高，类似于原始标识，可用于训练用于3DMM拟合的纹理GAN模型并改善姿势不变的面部识别。

25. Knowledge Distillation with Adaptive Asymmetric Label Sharpening for Semi-supervised Fracture Detection in Chest X-rays [PDF] 返回目录
Yirui Wang, Kang Zheng, Chi-Tung Chang, Xiao-Yun Zhou, Zhilin Zheng, Lingyun Huang, Jing Xiao, Le Lu, Chien-Hung Liao, Shun Miao
Abstract: Exploiting available medical records to train high performance computer-aided diagnosis (CAD) models via the semi-supervised learning (SSL) setting is emerging to tackle the prohibitively high labor costs involved in large-scale medical image annotations. Despite the extensive attentions received on SSL, previous methods failed to 1) account for the low disease prevalence in medical records and 2) utilize the image-level diagnosis indicated from the medical records. Both issues are unique to SSL for CAD models. In this work, we propose a new knowledge distillation method that effectively exploits large-scale image-level labels extracted from the medical records, augmented with limited expert annotated region-level labels, to train a rib and clavicle fracture CAD model for chest X-ray (CXR). Our method leverages the teacher-student model paradigm and features a novel adaptive asymmetric label sharpening (AALS) algorithm to address the label imbalance problem that specially exists in medical domain. Our approach is extensively evaluated on all CXR (N = 65,845) from the trauma registry of anonymous hospital over a period of 9 years (2008-2016), on the most common rib and clavicle fractures. The experiment results demonstrate that our method achieves the state-of-the-art fracture detection performance, i.e., an area under receiver operating characteristic curve (AUROC) of 0.9318 and a free-response receiver operating characteristic (FROC) score of 0.8914 on the rib fractures, significantly outperforming previous approaches by an AUROC gap of 1.63% and an FROC improvement by 3.74%. Consistent performance gains are also observed for clavicle fracture detection.
摘要：通过半监督学习（SSL）设置来利用可用的病历来训练高性能计算机辅助诊断（CAD）模型正在解决，以解决大规模医学图像注释中涉及的过高的人工成本。尽管SSL受到了广泛关注，但以前的方法未能做到：1）解释病历中的疾病患病率低，以及2）利用病历中指示的图像级诊断。这两个问题都是SSL对于CAD模型所特有的。在这项工作中，我们提出了一种新的知识蒸馏方法，该方法可有效利用从病历中提取的大规模图像级标签，并添加有限的专家注释区域级标签，以训练胸X线的肋骨和锁骨骨折CAD模型射线（CXR）。我们的方法利用了师生模型范式，并采用了一种新颖的自适应不对称标签锐化（AALS）算法来解决医疗领域中特别存在的标签不平衡问题。在9年的时间里（2008-2016年），我们对匿名医院创伤登记处的所有CXR（N = 65,845）进行了广泛评估，评估了最常见的肋骨和锁骨骨折。实验结果表明，我们的方法实现了最新的裂缝检测性能，即在接收器工作特性曲线下的面积（AUROC）为0.9318，在接收器上的自由响应接收器工作特性（FROC）得分为0.8914。肋骨骨折，其AUROC间隙为1.63％，FROC改善为3.74％，明显优于以前的方法。锁骨骨折检测也观察到一致的性能提升。

26. Active Annotation of Informative Overlapping Frames in Video Mosaicking Applications [PDF] 返回目录
Loic Peter, Marcel Tella-Amo, Dzhoshkun Ismail Shakir, Jan Deprest, Sebastien Ourselin, Juan Eugenio Iglesias, Tom Vercauteren
Abstract: Video mosaicking requires the registration of overlapping frames located at distant timepoints in the sequence to ensure global consistency of the reconstructed scene. However, fully automated registration of such long-range pairs is (i) challenging when the registration of images itself is difficult; and (ii) computationally expensive for long sequences due to the large number of candidate pairs for registration. In this paper, we introduce an efficient framework for the active annotation of long-range pairwise correspondences in a sequence. Our framework suggests pairs of images that are sought to be informative to an oracle agent (e.g., a human user, or a reliable matching algorithm) who provides visual correspondences on each suggested pair. Informative pairs are retrieved according to an iterative strategy based on a principled annotation reward coupled with two complementary and online adaptable models of frame overlap. In addition to the efficient construction of a mosaic, our framework provides, as a by-product, ground truth landmark correspondences which can be used for evaluation or learning purposes. We evaluate our approach in both automated and interactive scenarios via experiments on synthetic sequences, on a publicly available dataset for aerial imaging and on a clinical dataset for placenta mosaicking during fetal surgery.
摘要：视频拼接要求对序列中位于远处时间点的重叠帧进行配准，以确保重建场景的全局一致性。但是，这样的远程对的全自动配准是（i）当图像本身的配准困难时，具有挑战性；（ii）由于要注册的候选对数量众多，因此长序列在计算上昂贵。在本文中，我们为序列中的远程成对对应关系的有效注释引入了一种有效的框架。我们的框架提出了一些图像对，这些图像对Oracle代理（例如人类用户或可靠的匹配算法）提供了信息，这些代理在每个建议对上提供了视觉对应。信息对是根据迭代策略检索的，该策略基于有原则的注释奖励以及两个互补且在线自适应的帧重叠模型，基于迭代注释。除了有效地构建镶嵌图之外，我们的框架还提供了可用于评估或学习目的的地面真实地标对应物作为副产品。我们通过在合成序列上进行的实验，在航空影像学上可公开获得的数据集以及在胎儿手术期间用于胎盘镶嵌的临床数据集上的实验来评估我们在自动化和互动场景中的方法。

27. Temporally-Transferable Perturbations: Efficient, One-Shot Adversarial Attacks for Online Visual Object Trackers [PDF] 返回目录
Krishna Kanth Nakka, Mathieu Salzmann
Abstract: In recent years, the trackers based on Siamese networks have emerged as highly effective and efficient for visual object tracking (VOT). While these methods were shown to be vulnerable to adversarial attacks, as most deep networks for visual recognition tasks, the existing attacks for VOT trackers all require perturbing the search region of every input frame to be effective, which comes at a non-negligible cost, considering that VOT is a real-time task. In this paper, we propose a framework to generate a single temporally transferable adversarial perturbation from the object template image only. This perturbation can then be added to every search image, which comes at virtually no cost, and still, successfully fool the tracker. Our experiments evidence that our approach outperforms the state-of-the-art attacks on the standard VOT benchmarks in the untargeted scenario. Furthermore, we show that our formalism naturally extends to targeted attacks that force the tracker to follow any given trajectory by precomputing diverse directional perturbations.
摘要：近年来，基于暹罗网络的跟踪器已经成为视觉对象跟踪（VOT）的高效工具。尽管这些方法容易受到对抗攻击，但作为大多数用于视觉识别任务的深层网络，现有的针对VOT跟踪器的攻击都需要扰动每个输入帧的搜索区域才能有效，而这是不可忽略的代价，考虑到VOT是实时任务。在本文中，我们提出了一个仅从对象模板图像生成单个可在时间上转移的对抗性扰动的框架。然后可以将此干扰添加到每个搜索图像中，这几乎是免费的，并且仍然成功地使跟踪器蒙骗。我们的实验证明，在无目标的情况下，我们的方法优于对标准VOT基准的最新攻击。此外，我们表明，形式主义自然可以扩展到针对性攻击，这些攻击通过预先计算各种方向性扰动来迫使跟踪器遵循任何给定的轨迹。

28. Rethinking the Heatmap Regression for Bottom-up Human Pose Estimation [PDF] 返回目录
Zhengxiong Luo, Zhicheng Wang, Yan Huang, Tieniu Tan, Erjin Zhou
Abstract: Heatmap regression has become the most prevalent choice for nowadays human pose estimation methods. The ground-truth heatmaps are usually constructed via covering all skeletal keypoints by 2D gaussian kernels. The standard deviations of these kernels are fixed. However, for bottom-up methods, which need to handle a large variance of human scales and labeling ambiguities, the current practice seems unreasonable. To better cope with these problems, we propose the scale-adaptive heatmap regression (SAHR) method, which can adaptively adjust the standard deviation for each keypoint. In this way, SAHR is more tolerant of various human scales and labeling ambiguities. However, SAHR may aggravate the imbalance between fore-background samples, which potentially hurts the improvement of SAHR. Thus, we further introduce the weight-adaptive heatmap regression (WAHR) to help balance the fore-background samples. Extensive experiments show that SAHR together with WAHR largely improves the accuracy of bottom-up human pose estimation. As a result, we finally outperform the state-of-the-art model by $+1.5AP$ and achieve $72.0 AP$ on COCO test-dev2017, which is comparable with the performances of most top-down methods.
摘要：热图回归已成为当今人体姿态估计方法中最普遍的选择。地面真相热图通常是通过2D高斯核覆盖所有骨骼关键点来构造的。这些内核的标准偏差是固定的。但是，对于自下而上的方法，需要处理很大范围的人类比例和标签模糊性，当前的做法似乎是不合理的。为了更好地解决这些问题，我们提出了比例尺自适应热图回归（SAHR）方法，该方法可以自适应地调整每个关键点的标准差。这样，SAHR更能容忍各种人类规模和标签的歧义。但是，SAHR可能会加剧前景色样本之间的不平衡，这有可能损害SAHR的提高。因此，我们进一步介绍了权重自适应热图回归（WAHR），以帮助平衡前台背景样本。大量实验表明，SAHR和WAHR一起可以大大提高自下而上的人体姿势估计的准确性。结果，我们最终比最新模型高出+ 1.5AP $，在COCO test-dev2017上达到72.0 AP $，与大多数自上而下的方法的性能相当。

29. MM-FSOD: Meta and metric integrated few-shot object detection [PDF] 返回目录
Yuewen Li, Wenquan Feng, Shuchang Lyu, Qi Zhao, Xuliang Li
Abstract: In the object detection task, CNN (Convolutional neural networks) models always need a large amount of annotated examples in the training process. To reduce the dependency of expensive annotations, few-shot object detection has become an increasing research focus. In this paper, we present an effective object detection framework (MM-FSOD) that integrates metric learning and meta-learning to tackle the few-shot object detection task. Our model is a class-agnostic detection model that can accurately recognize new categories, which are not appearing in training samples. Specifically, to fast learn the features of new categories without a fine-tuning process, we propose a meta-representation module (MR module) to learn intra-class mean prototypes. MR module is trained with a meta-learning method to obtain the ability to reconstruct high-level features. To further conduct similarity of features between support prototype with query RoIs features, we propose a Pearson metric module (PR module) which serves as a classifier. Compared to the previous commonly used metric method, cosine distance metric. PR module enables the model to align features into discriminative embedding space. We conduct extensive experiments on benchmark datasets FSOD, MS COCO, and PASCAL VOC to demonstrate the feasibility and efficiency of our model. Comparing with the previous method, MM-FSOD achieves state-of-the-art (SOTA) results.
摘要：在目标检测任务中，卷积神经网络（CNN）模型在训练过程中始终需要大量带注释的示例。为了减少昂贵的注释的依赖性，少拍物体检测已成为越来越多的研究重点。在本文中，我们提出了一个有效的目标检测框架（MM-FSOD），该框架将度量学习和元学习相集成，以解决少发的目标检测任务。我们的模型是与类无关的检测模型，可以准确地识别训练样本中未出现的新类别。具体来说，为了快速学习新类别的特征而无需微调过程，我们提出了一种元表示模块（MR模块）来学习类内均值原型。通过元学习方法对MR模块进行培训，以获得重建高级特征的能力。为了进一步实现支持原型与查询RoIs功能之间的功能相似性，我们提出了用作分类器的Pearson度量模块（PR模块）。与以前常用的度量方法相比，余弦距离度量。 PR模块使模型能够将特征对齐到判别性嵌入空间中。我们对基准数据集FSOD，MS COCO和PASCAL VOC进行了广泛的实验，以证明该模型的可行性和效率。与以前的方法相比，MM-FSOD获得了最新技术（SOTA）结果。

30. DUT-LFSaliency: Versatile Dataset and Light Field-to-RGB Saliency Detection [PDF] 返回目录
Yongri Piao, Zhengkun Rong, Shuang Xu, Miao Zhang, Huchuan Lu
Abstract: Light field data exhibit favorable characteristics conducive to saliency detection. The success of learning-based light field saliency detection is heavily dependent on how a comprehensive dataset can be constructed for higher generalizability of models, how high dimensional light field data can be effectively exploited, and how a flexible model can be designed to achieve versatility for desktop computers and mobile devices. To answer these questions, first we introduce a large-scale dataset to enable versatile applications for RGB, RGB-D and light field saliency detection, containing 102 classes and 4204 samples. Second, we present an asymmetrical two-stream model consisting of the Focal stream and RGB stream. The Focal stream is designed to achieve higher performance on desktop computers and transfer focusness knowledge to the RGB stream, relying on two tailor-made modules. The RGB stream guarantees the flexibility and memory/computation efficiency on mobile devices through three distillation schemes. Experiments demonstrate that our Focal stream achieves state-of-the-arts performance. The RGB stream achieves Top-2 F-measure on DUTLF-V2, which tremendously minimizes the model size by 83% and boosts FPS by 5 times, compared with the best performing method. Furthermore, our proposed distillation schemes are applicable to RGB saliency models, achieving impressive performance gains while ensuring flexibility.
摘要：光场数据表现出有利于显着性检测的特征。基于学习的光场显着性检测的成功很大程度上取决于如何构建全面的数据集以提高模型的通用性，如何有效利用高维光场数据以及如何设计灵活的模型来实现通用性台式计算机和移动设备。为了回答这些问题，我们首先引入一个大规模数据集，以支持RGB，RGB-D和光场显着性检测的通用应用程序，其中包含102个类别和4204个样本。其次，我们提出了由Focal流和RGB流组成的不对称两流模型。 Focal流被设计为在台式计算机上实现更高的性能，并依靠两个定制模块将焦点知识转移到RGB流。 RGB流通过三种蒸馏方案保证了移动设备的灵活性和存储/计算效率。实验表明，我们的Focal流可实现最先进的性能。 RGB流在DUTLF-V2上达到了Top-2 F度量，与性能最佳的方法相比，该模型极大地减少了83％的模型尺寸，并使FPS提升了5倍。此外，我们提出的蒸馏方案适用于RGB显着性模型，在确保灵活性的同时获得了令人印象深刻的性能提升。

31. RTS3D: Real-time Stereo 3D Detection from 4D Feature-Consistency Embedding Space for Autonomous Driving [PDF] 返回目录
Peixuan Li, Shun Su, Huaici Zhao
Abstract: Although the recent image-based 3D object detection methods using Pseudo-LiDAR representation have shown great capabilities, a notable gap in efficiency and accuracy still exist compared with LiDAR-based methods. Besides, over-reliance on the stand-alone depth estimator, requiring a large number of pixel-wise annotations in the training stage and more computation in the inferencing stage, limits the scaling application in the real world. In this paper, we propose an efficient and accurate 3D object detection method from stereo images, named RTS3D. Different from the 3D occupancy space in the Pseudo-LiDAR similar methods, we design a novel 4D feature-consistent embedding (FCE) space as the intermediate representation of the 3D scene without depth supervision. The FCE space encodes the object's structural and semantic information by exploring the multi-scale feature consistency warped from stereo pair. Furthermore, a semantic-guided RBF (Radial Basis Function) and a structure-aware attention module are devised to reduce the influence of FCE space noise without instance mask supervision. Experiments on the KITTI benchmark show that RTS3D is the first true real-time system (FPS$>$24) for stereo image 3D detection meanwhile achieves $10\%$ improvement in average precision comparing with the previous state-of-the-art method. The code will be available at this https URL
摘要：尽管最近使用伪LiDAR表示的基于图像的3D对象检测方法已显示出强大的功能，但与基于LiDAR的方法相比，在效率和准确性上仍存在明显差距。此外，过度依赖独立的深度估计器，在训练阶段需要大量的像素级注释，并且在推理阶段需要更多的计算，这限制了实际应用中的缩放比例。在本文中，我们提出了一种有效且准确的立体图像3D目标检测方法，称为RTS3D。与伪LiDAR类似方法中的3D占用空间不同，我们设计了一种新颖的4D特征一致嵌入（FCE）空间作为3D场景的中间表示，而无需深度监视。 FCE空间通过探索立体对扭曲的多尺度特征一致性来编码对象的结构和语义信息。此外，设计了语义引导的RBF（径向基函数）和结构感知注意模块，以减少FCE空间噪声的影响，而无需实例掩码监督。在KITTI基准上进行的实验表明，RTS3D是用于立体图像3D检测的第一个真正的实时系统（FPS $> $ 24），与以前的最新方法相比，其平均精度提高了$ 10％。该代码将在此https URL上可用

32. Bidirectional Mapping Coupled GAN for Generalized Zero-Shot Learning [PDF] 返回目录
Tasfia Shermin, Shyh Wei Teng, Ferdous Sohel, Manzur Murshed, Guojun Lu
Abstract: Bidirectional mapping-based generative models have achieved remarkable performance for the generalized zero-shot learning (GZSL) recognition by learning to construct visual features from class semantics and reconstruct class semantics back from generated visual features. The performance of these models relies on the quality of synthesized features. This depends on the ability of the model to capture the underlying seen data distribution by relating semantic-visual spaces, learning discriminative information, and re-purposing the learned distribution to recognize unseen data. This means learning the seen-unseen domains joint distribution is crucial for GZSL tasks. However, existing models only learn the underlying distribution of the seen domain as unseen data is inaccessible. In this work, we propose to utilize the available unseen class semantics along with seen class semantics and learn dual-domain joint distribution through a strong visual-semantic coupling. Therefore, we propose a bidirectional mapping coupled generative adversarial network (BMCoGAN) by extending the coupled generative adversarial network (CoGAN) into a dual-domain learning bidirectional mapping model. We further integrate a Wasserstein generative adversarial optimization to supervise the joint distribution learning. For retaining distinctive information in the synthesized visual space and reducing bias towards seen classes, we design an optimization, which pushes synthesized seen features towards real seen features and pulls synthesized unseen features away from real seen features. We evaluate BMCoGAN on several benchmark datasets against contemporary methods and show its superior performance. Also, we present ablative analysis to demonstrate the importance of different components in BMCoGAN.
摘要：基于双向映射的生成模型通过学习从类语义构造视觉特征并从生成的视觉特征重新构造类语义，从而在广义零镜头学习（GZSL）识别方面取得了卓越的性能。这些模型的性能取决于综合特征的质量。这取决于模型通过关联语义视觉空间，学习判别信息以及重新利用学习的分布以识别看不见的数据来捕获潜在的可见数据分布的能力。这意味着学习看不见的域联合分布对于GZSL任务至关重要。但是，由于无法访问看不见的数据，因此现有模型仅学习可见域的基础分布。在这项工作中，我们建议利用可用的看不见的类语义以及所见的类语义，并通过强大的视觉语义耦合来学习双域联合分布。因此，通过将耦合生成对抗网络（CoGAN）扩展到双域学习双向映射模型，我们提出了双向映射耦合生成对抗网络（BMCoGAN）。我们进一步集成了Wasserstein生成对抗性优化，以监督联合分布学习。为了在合成的视觉空间中保留独特的信息并减少对可见类的偏见，我们设计了一种优化方法，将合成的可见特征推向真实的可见特征，并将合成的看不见的特征从真实可见的特征中拉出。我们根据当代方法在几个基准数据集上评估了BMCoGAN，并显示了其卓越的性能。此外，我们提供了烧蚀分析，以证明BMCoGAN中不同组件的重要性。

33. SkiNet: A Deep Learning Solution for Skin Lesion Diagnosis with Uncertainty Estimation and Explainability [PDF] 返回目录
Rajeev Kumar Singh, Rohan Gorantla, Sai Giridhar Allada, Narra Pratap
Abstract: Skin cancer is considered to be the most common human malignancy. Around 5 million new cases of skin cancer are recorded in the United States annually. Early identification and evaluation of skin lesions is of great clinical significance, but the disproportionate dermatologist-patient ratio poses significant problem in most developing nations. Therefore a deep learning based architecture, known as SkiNet, is proposed with an objective to provide faster screening solution and assistance to newly trained physicians in the clinical diagnosis process. The main motive behind Skinet's design and development is to provide a white box solution, addressing a critical problem of trust and interpretability which is crucial for the wider adoption of Computer-aided diagnosis systems by the medical practitioners. SkiNet is a two-stage pipeline wherein the lesion segmentation is followed by the lesion classification. In our SkiNet methodology, Monte Carlo dropout and test time augmentation techniques have been employed to estimate epistemic and aleatoric uncertainty, while saliency-based methods are explored to provide post-hoc explanations of the deep learning models. The publicly available dataset, ISIC-2018, is used to perform experimentation and ablation studies. The results establish the robustness of the model on the traditional benchmarks while addressing the black-box nature of such models to alleviate the skepticism of medical practitioners by incorporating transparency and confidence to the model's prediction.
摘要：皮肤癌被认为是人类最常见的恶性肿瘤。每年在美国记录约500万新的皮肤癌病例。皮肤病灶的早期识别和评估具有重要的临床意义，但是在大多数发展中国家，皮肤科医生与病人的比例不成比例构成了重大问题。因此，提出了一种基于深度学习的架构，称为SkiNet，其目的是在临床诊断过程中为新近训练的医师提供更快的筛选解决方案并提供帮助。 Skinet设计和开发背后的主要动机是提供一个白盒解决方案，解决信任和可解释性的关键问题，这对于医疗从业人员广泛采用计算机辅助诊断系统至关重要。 SkiNet是一个两阶段的管道，其中病灶分割之后是病灶分类。在我们的SkiNet方法中，蒙特卡洛辍学和测试时间增加技术已被用于估计认知和无意识不确定性，同时探索了基于显着性的方法来提供深度学习模型的事后解释。公开可用的数据集ISIC-2018用于执行实验和消融研究。结果建立了模型在传统基准上的稳健性，同时解决了此类模型的黑匣子性质，通过将透明性和置信度纳入模型的预测中来减轻医疗从业者的怀疑。

34. Damaged Fingerprint Recognition by Convolutional Long Short-Term Memory Networks for Forensic Purposes [PDF] 返回目录
Jaouhar Fattahi, Mohamed Mejri
Abstract: Fingerprint recognition is often a game-changing step in establishing evidence against criminals. However, we are increasingly finding that criminals deliberately alter their fingerprints in a variety of ways to make it difficult for technicians and automatic sensors to recognize their fingerprints, making it tedious for investigators to establish strong evidence against them in a forensic procedure. In this sense, deep learning comes out as a prime candidate to assist in the recognition of damaged fingerprints. In particular, convolution algorithms. In this paper, we focus on the recognition of damaged fingerprints by Convolutional Long Short-Term Memory networks. We present the architecture of our model and demonstrate its performance which exceeds 95% accuracy, 99% precision, and approaches 95% recall and 99% AUC.
摘要：指纹识别通常是改变犯罪分子证据的游戏规则。但是，我们越来越多地发现，犯罪分子故意以各种方式更改其指纹，从而使技术人员和自动传感器难以识别其指纹，这使调查人员在法医程序中针对他们建立强有力的证据变得乏味。从这个意义上讲，深度学习是帮助识别受损指纹的主要候选方法。特别是卷积算法。在本文中，我们重点研究卷积长短期记忆网络对受损指纹的识别。我们介绍了模型的架构，并演示了其性能超过95％的准确性，99％的准确性，并达到95％的召回率和99％的AUC。

35. NBNet: Noise Basis Learning for Image Denoising with Subspace Projection [PDF] 返回目录
Shen Cheng, Yuzhi Wang, Haibin Huang, Donghao Liu, Haoqiang Fan, Shuaicheng Liu
Abstract: In this paper, we introduce NBNet, a novel framework for image denoising. Unlike previous works, we propose to tackle this challenging problem from a new perspective: noise reduction by image-adaptive projection. Specifically, we propose to train a network that can separate signal and noise by learning a set of reconstruction basis in the feature space. Subsequently, image denosing can be achieved by selecting corresponding basis of the signal subspace and projecting the input into such space. Our key insight is that projection can naturally maintain the local structure of input signal, especially for areas with low light or weak textures. Towards this end, we propose SSA, a non-local subspace attention module designed explicitly to learn the basis generation as well as the subspace projection. We further incorporate SSA with NBNet, a UNet structured network designed for end-to-end image denosing. We conduct evaluations on benchmarks, including SIDD and DND, and NBNet achieves state-of-the-art performance on PSNR and SSIM with significantly less computational cost.
摘要：在本文中，我们介绍了NBNet，这是一种用于图像去噪的新颖框架。与以前的作品不同，我们建议从新的角度解决这一难题：通过图像自适应投影来降低噪声。具体来说，我们建议通过学习特征空间中的一组重建基础来训练可以分离信号和噪声的网络。随后，可以通过选择信号子空间的相应基础并将输入投影到这种空间中来实现图像去噪。我们的主要见识在于，投影可以自然地保持输入信号的局部结构，尤其是在光线较弱或纹理较弱的区域。为此，我们提出了SSA，这是一个非本地子空间注意模块，专门设计用于学习基础生成以及子空间投影。我们进一步将SSA与NBNet结合在一起，NBNet是为端到端图像去噪设计的UNet结构化网络。我们对包括SIDD和DND在内的基准进行评估，并且NBNet在PSNR和SSIM上实现了最先进的性能，而计算成本却大大降低了。

36. Towards Unsupervised Deep Image Enhancement with Generative Adversarial Network [PDF] 返回目录
Zhangkai Ni, Wenhan Yang, Shiqi Wang, Lin Ma, Sam Kwong
Abstract: Improving the aesthetic quality of images is challenging and eager for the public. To address this problem, most existing algorithms are based on supervised learning methods to learn an automatic photo enhancer for paired data, which consists of low-quality photos and corresponding expert-retouched versions. However, the style and characteristics of photos retouched by experts may not meet the needs or preferences of general users. In this paper, we present an unsupervised image enhancement generative adversarial network (UEGAN), which learns the corresponding image-to-image mapping from a set of images with desired characteristics in an unsupervised manner, rather than learning on a large number of paired images. The proposed model is based on single deep GAN which embeds the modulation and attention mechanisms to capture richer global and local features. Based on the proposed model, we introduce two losses to deal with the unsupervised image enhancement: (1) fidelity loss, which is defined as a L2 regularization in the feature domain of a pre-trained VGG network to ensure the content between the enhanced image and the input image is the same, and (2) quality loss that is formulated as a relativistic hinge adversarial loss to endow the input image the desired characteristics. Both quantitative and qualitative results show that the proposed model effectively improves the aesthetic quality of images. Our code is available at: this https URL.
摘要：提高图像的美学质量对公众来说是充满挑战和渴望的。为了解决这个问题，大多数现有算法都是基于监督学习方法来学习用于配对数据的自动照片增强器，该照片增强器由低质量的照片和相应的专家修饰版本组成。但是，专家修饰的照片的样式和特征可能无法满足一般用户的需求或偏好。在本文中，我们提出了一种无监督的图像增强生成对抗网络（UEGAN），该网络以无监督的方式从一组具有所需特征的图像中学习相应的图像到图像的映射，而不是学习大量的成对图像。所提出的模型基于单个深度GAN，它嵌入了调制和注意机制以捕获更丰富的全局和局部特征。基于提出的模型，我们引入了两种损失来处理无监督图像增强：（1）保真度损失，其定义为预训练VGG网络的特征域中的L2正则化，以确保增强图像之间的内容（2）质量损失，它被定义为相对论铰链对抗损失，以赋予输入图像所需的特性。定量和定性结果均表明，该模型有效地提高了图像的美学质量。我们的代码可在以下网址获得：https URL。

37. 2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Recognition [PDF] 返回目录
Hengduo Li, Zuxuan Wu, Abhinav Shrivastava, Larry S. Davis
Abstract: 3D convolutional networks are prevalent for video recognition. While achieving excellent recognition performance on standard benchmarks, they operate on a sequence of frames with 3D convolutions and thus are computationally demanding. Exploiting large variations among different videos, we introduce Ada3D, a conditional computation framework that learns instance-specific 3D usage policies to determine frames and convolution layers to be used in a 3D network. These policies are derived with a two-head lightweight selection network conditioned on each input video clip. Then, only frames and convolutions that are selected by the selection network are used in the 3D model to generate predictions. The selection network is optimized with policy gradient methods to maximize a reward that encourages making correct predictions with limited computation. We conduct experiments on three video recognition benchmarks and demonstrate that our method achieves similar accuracies to state-of-the-art 3D models while requiring 20%-50% less computation across different datasets. We also show that learned policies are transferable and Ada3D is compatible to different backbones and modern clip selection approaches. Our qualitative analysis indicates that our method allocates fewer 3D convolutions and frames for "static" inputs, yet uses more for motion-intensive clips.
摘要：3D卷积网络普遍用于视频识别。在标准基准上获得出色的识别性能的同时，它们在具有3D卷积的帧序列上运行，因此对计算的要求很高。利用不同视频之间的巨大差异，我们引入了Ada3D，这是一个条件计算框架，可学习特定于实例的3D使用策略，以确定要在3D网络中使用的帧和卷积层。这些策略是通过以每个输入视频剪辑为条件的两头轻量选择网络得出的。然后，在3D模型中仅使用由选择网络选择的帧和卷积来生成预测。选择网络使用策略梯度方法进行了优化，以最大化奖励，从而鼓励以有限的计算来做出正确的预测。我们在三个视频识别基准上进行了实验，并证明了我们的方法具有与最新3D模型相似的准确性，同时在不同数据集上所需的计算量减少了20％-50％。我们还显示，学习到的策略是可转让的，并且Ada3D与不同的主干和现代剪辑选择方法兼容。我们的定性分析表明，我们的方法为“静态”输入分配了较少的3D卷积和帧，但为运动密集型剪辑分配了更多的3D卷积和帧。

38. SALA: Soft Assignment Local Aggregation for 3D Semantic Segmentation [PDF] 返回目录
Hani Itani, Silvio Giancola, Ali Thabet, Bernard Ghanem
Abstract: We introduce the idea of using learnable neighbor-to-grid soft assignment in grid-based aggregation functions for the task of 3D semantic segmentation. Previous methods in literature operate on a predefined geometric grid such as local volume partitions or irregular kernel points. These methods use geometric functions to assign local neighbors to their corresponding grid. Such geometric heuristics are potentially sub-optimal for the end task of semantic segmentation. Furthermore, they are applied uniformly throughout the depth of the network. A more general alternative would allow the network to learn its own neighbor-to-grid assignment function that best suits the end task. Since it is learnable, this mapping has the flexibility to be different per layer. This paper leverages learned neighbor-to-grid soft assignment to define an aggregation function that balances efficiency and performance. We demonstrate the efficacy of our method by reaching state-of-the-art (SOTA) performance on S3DIS with almost 10$\times$ less parameters than the current reigning method. We also demonstrate competitive performance on ScanNet and PartNet as compared with much larger SOTA models.
摘要：我们介绍了在基于网格的聚合函数中使用可学习的邻居到网格软分配的思想，以实现3D语义分割的任务。文献中的先前方法在预定的几何网格上运行，例如局部体积分区或不规则的核点。这些方法使用几何函数将局部邻居分配给其相应的网格。对于语义分割的最终任务，这种几何启发式方法可能不是最佳的。此外，它们在网络的整个深度上被均匀地应用。更为通用的替代方法是允许网络学习最适合最终任务的自己的邻居到网格分配功能。由于它是可学习的，因此该映射具有灵活性，每层可以不同。本文利用学习到的从邻居到网格的软分配来定义一个聚合功能，该功能可以平衡效率和性能。通过在S3DIS上达到最先进的（SOTA）性能，其参数比当前的控制方法少近10倍，我们证明了我们方法的有效性。与更大的SOTA模型相比，我们还展示了ScanNet和PartNet上的竞争性能。

39. Detecting Hate Speech in Multi-modal Memes [PDF] 返回目录
Abhishek Das, Japsimar Singh Wahi, Siyao Li
Abstract: In the past few years, there has been a surge of interest in multi-modal problems, from image captioning to visual question answering and beyond. In this paper, we focus on hate speech detection in multi-modal memes wherein memes pose an interesting multi-modal fusion problem. We aim to solve the Facebook Meme Challenge \cite{kiela2020hateful} which aims to solve a binary classification problem of predicting whether a meme is hateful or not. A crucial characteristic of the challenge is that it includes "benign confounders" to counter the possibility of models exploiting unimodal priors. The challenge states that the state-of-the-art models perform poorly compared to humans. During the analysis of the dataset, we realized that majority of the data points which are originally hateful are turned into benign just be describing the image of the meme. Also, majority of the multi-modal baselines give more preference to the hate speech (language modality). To tackle these problems, we explore the visual modality using object detection and image captioning models to fetch the "actual caption" and then combine it with the multi-modal representation to perform binary classification. This approach tackles the benign text confounders present in the dataset to improve the performance. Another approach we experiment with is to improve the prediction with sentiment analysis. Instead of only using multi-modal representations obtained from pre-trained neural networks, we also include the unimodal sentiment to enrich the features. We perform a detailed analysis of the above two approaches, providing compelling reasons in favor of the methodologies used.
摘要：在过去的几年中，对多模式问题的兴趣激增，从图像字幕到视觉问题解答等等。在本文中，我们专注于多模态模因中的仇恨语音检测，其中，模因构成了一个有趣的多模态融合问题。我们旨在解决Facebook模因挑战\ cite {kiela2020hateful}，该挑战旨在解决预测模因是否令人讨厌的二元分类问题。挑战的一个关键特征是它包括“良性混杂因素”，以应对利用单峰先验模型的可能性。挑战指出，与人类相比，最新模型的性能较差。在数据集的分析过程中，我们意识到，原来是可恨的大多数数据点只是在描述模因的图像时才变成良性的。而且，大多数多模式基线都更偏爱仇恨言论（语言模式）。为了解决这些问题，我们使用对象检测和图像字幕模型来探索视觉模态以获取“实际字幕”，然后将其与多模态表示相结合以执行二进制分类。这种方法解决了数据集中存在的良性文本混杂因素，以提高性能。我们尝试的另一种方法是通过情感分析来改善预测。不仅仅使用从预先训练的神经网络获得的多峰表示，我们还包括单峰情感以丰富特征。我们对以上两种方法进行了详细的分析，提供了令人信服的理由来支持所使用的方法。

40. Learning a Dynamic Map of Visual Appearance [PDF] 返回目录
Tawfiq Salem, Scott Workman, Nathan Jacobs
Abstract: The appearance of the world varies dramatically not only from place to place but also from hour to hour and month to month. Every day billions of images capture this complex relationship, many of which are associated with precise time and location metadata. We propose to use these images to construct a global-scale, dynamic map of visual appearance attributes. Such a map enables fine-grained understanding of the expected appearance at any geographic location and time. Our approach integrates dense overhead imagery with location and time metadata into a general framework capable of mapping a wide variety of visual attributes. A key feature of our approach is that it requires no manual data annotation. We demonstrate how this approach can support various applications, including image-driven mapping, image geolocalization, and metadata verification.
摘要：世界的外观不仅在不同的地方有很大的不同，而且在每小时的时间和每月的月份中也有很大的不同。每天都有数十亿张图像捕获这种复杂的关系，其中许多与精确的时间和位置元数据相关联。我们建议使用这些图像来构建视觉外观属性的全球范围的动态地图。这样的地图可以对任何地理位置和时间的预期外观进行细粒度的了解。我们的方法将密集的开销图像与位置和时间元数据集成到一个能够映射各种视觉属性的通用框架中。我们方法的主要特点是不需要手动数据注释。我们演示了这种方法如何支持各种应用程序，包括图像驱动的映射，图像地理定位和元数据验证。

41. Object sorting using faster R-CNN [PDF] 返回目录
Pengchang Chen, Vinayak Elangovan
Abstract: In a factory production line, different industry parts need to be quickly differentiated and sorted for further process. Parts can be of different colors and shapes. It is tedious for humans to differentiate and sort these objects in appropriate categories. Automating this process would save more time and cost. In the automation process, choosing an appropriate model to detect and classify different objects based on specific features is more challenging. In this paper, three different neural network models are compared to the object sorting system. They are namely CNN, Fast R-CNN, and Faster R-CNN. These models are tested, and their performance is analyzed. Moreover, for the object sorting system, an Arduino-controlled 5 DoF (degree of freedom) robot arm is programmed to grab and drop symmetrical objects to the targeted zone. Objects are categorized into classes based on color, defective and non-defective objects.
摘要：在工厂生产线中，需要对不同的行业零件进行快速区分和分类以进行进一步处理。零件可以具有不同的颜色和形状。对于人类来说，将这些物体区分并分类为适当的类别是一件很繁琐的事情。使该过程自动化将节省更多的时间和成本。在自动化过程中，根据特定功能选择合适的模型来检测和分类不同的对象更具挑战性。在本文中，将三种不同的神经网络模型与对象分类系统进行了比较。它们分别是CNN，Fast R-CNN和Faster R-CNN。测试了这些模型，并分析了它们的性能。此外，对于对象分类系统，对Arduino控制的5 DoF（自由度）机器人手臂进行了编程，可以将对称对象抓取并放到目标区域。根据颜色，有缺陷和无缺陷的对象将对象分为几类。

42. Visual-Thermal Camera Dataset Release and Multi-Modal Alignment without Calibration Information [PDF] 返回目录
Frank Mascarich, Kostas Alexis
Abstract: This report accompanies a dataset release on visual and thermal camera data and details a procedure followed to align such multi-modal camera frames in order to provide pixel-level correspondence between the two without using intrinsic or extrinsic calibration information. To achieve this goal we benefit from progress in the domain of multi-modal image alignment and specifically employ the Mattes Mutual Information Metric to guide the registration process. In the released dataset we release both the raw visual and thermal camera data, as well as the aligned frames, alongside calibration parameters with the goal to better facilitate the investigation on common local/global features across such multi-modal image streams.
摘要：该报告随附有关视觉和热像仪数据的数据集发布，并详细介绍了对齐此类多模式照像机框架以提供两者之间的像素级对应而无需使用内部或外部校准信息的过程。为了实现这一目标，我们受益于多模式图像对齐领域的进步，并特别采用了Mattes Mutual Information Metric来指导注册过程。在发布的数据集中，我们发布了原始的视觉和热像仪数据，以及对齐的帧以及校准参数，目的是更好地促进跨此类多模态图像流研究共同的局部/全局特征。

43. Graph-based non-linear least squares optimization for visual place recognition in changing environments [PDF] 返回目录
Stefan Schubert, Peer Neubert, Peter Protzel
Abstract: Visual place recognition is an important subproblem of mobile robot localization. Since it is a special case of image retrieval, the basic source of information is the pairwise similarity of image descriptors. However, the embedding of the image retrieval problem in this robotic task provides additional structure that can be exploited, e.g. spatio-temporal consistency. Several algorithms exist to exploit this structure, e.g., sequence processing approaches or descriptor standardization approaches for changing environments. In this paper, we propose a graph-based framework to systematically exploit different types of additional structure and information. The graphical model is used to formulate a non-linear least squares problem that can be optimized with standard tools. Beyond sequences and standardization, we propose the usage of intra-set similarities within the database and/or the query image set as additional source of information. If available, our approach also allows to seamlessly integrate additional knowledge about poses of database images. We evaluate the system on a variety of standard place recognition datasets and demonstrate performance improvements for a large number of different configurations including different sources of information, different types of constraints, and online or offline place recognition setups.
摘要：视觉位置识别是移动机器人定位的重要子问题。由于这是图像检索的特例，因此基本信息源是图像描述符的成对相似性。然而，将图像检索问题嵌入该机器人任务中提供了可被利用的附加结构，例如。时空一致性。存在几种利用该结构的算法，例如用于变化环境的序列处理方法或描述符标准化方法。在本文中，我们提出了一个基于图的框架来系统地利用不同类型的附加结构和信息。图形模型用于制定可以用标准工具优化的非线性最小二乘问题。除了序列和标准化以外，我们建议使用数据库和/或查询图像集内的内部相似性作为附加信息源。如果可以的话，我们的方法还可以无缝集成有关数据库映像姿势的其他知识。我们在各种标准的位置识别数据集上评估该系统，并展示了针对大量不同配置（包括不同的信息源，不同类型的约束以及在线或离线位置识别设置）的性能改进。

44. Deep Hashing for Secure Multimodal Biometrics [PDF] 返回目录
Veeru Talreja, Matthew Valenti, Nasser Nasrabadi
Abstract: When compared to unimodal systems, multimodal biometric systems have several advantages, including lower error rate, higher accuracy, and larger population coverage. However, multimodal systems have an increased demand for integrity and privacy because they must store multiple biometric traits associated with each user. In this paper, we present a deep learning framework for feature-level fusion that generates a secure multimodal template from each user's face and iris biometrics. We integrate a deep hashing (binarization) technique into the fusion architecture to generate a robust binary multimodal shared latent representation. Further, we employ a hybrid secure architecture by combining cancelable biometrics with secure sketch techniques and integrate it with a deep hashing framework, which makes it computationally prohibitive to forge a combination of multiple biometrics that pass the authentication. The efficacy of the proposed approach is shown using a multimodal database of face and iris and it is observed that the matching performance is improved due to the fusion of multiple biometrics. Furthermore, the proposed approach also provides cancelability and unlinkability of the templates along with improved privacy of the biometric data. Additionally, we also test the proposed hashing function for an image retrieval application using a benchmark dataset. The main goal of this paper is to develop a method for integrating multimodal fusion, deep hashing, and biometric security, with an emphasis on structural data from modalities like face and iris. The proposed approach is in no way a general biometric security framework that can be applied to all biometric modalities, as further research is needed to extend the proposed framework to other unconstrained biometric modalities.
摘要：与单峰系统相比，多峰生物识别系统具有多个优势，包括更低的错误率，更高的准确性和更大的人口覆盖范围。但是，由于多模式系统必须存储与每个用户相关的多个生物特征，因此对完整性和隐私性的需求增加。在本文中，我们提出了一种用于特征级融合的深度学习框架，该框架从每个用户的面部和虹膜生物特征生成安全的多峰模板。我们将深度哈希（二值化）技术集成到融合体系结构中，以生成健壮的二进制多峰共享潜在表示。此外，我们通过将可取消的生物特征与安全草图技术相结合并采用深度哈希框架来采用混合安全体系结构，这在计算上难以伪造通过身份验证的多个生物特征的组合。使用面部和虹膜的多峰数据库显示了该方法的有效性，并观察到由于多种生物识别技术的融合，匹配性能得到了改善。此外，所提出的方法还提供了模板的可取消性和不可链接性以及生物统计数据的改进的隐私性。此外，我们还使用基准数据集测试了针对图像检索应用程序提出的哈希函数。本文的主要目的是开发一种集成多模式融合，深度哈希和生物识别安全性的方法，重点是来自面部和虹膜等模式的结构数据。所提出的方法绝不是可应用于所有生物特征形式的通用生物特征安全框架，因为需要进一步的研究以将所提出的框架扩展到其他不受约束的生物特征形式。

45. Chasing the Tail in Monocular 3D Human Reconstruction with Prototype Memory [PDF] 返回目录
Yu Rong, Ziwei Liu, Chen Change Loy
Abstract: Deep neural networks have achieved great progress in single-image 3D human reconstruction. However, existing methods still fall short in predicting rare poses. The reason is that most of the current models perform regression based on a single human prototype, which is similar to common poses while far from the rare poses. In this work, we 1) identify and analyze this learning obstacle and 2) propose a prototype memory-augmented network, PM-Net, that effectively improves performances of predicting rare poses. The core of our framework is a memory module that learns and stores a set of 3D human prototypes capturing local distributions for either common poses or rare poses. With this formulation, the regression starts from a better initialization, which is relatively easier to converge. Extensive experiments on several widely employed datasets demonstrate the proposed framework's effectiveness compared to other state-of-the-art methods. Notably, our approach significantly improves the models' performances on rare poses while generating comparable results on other samples.
摘要：深度神经网络在单图像3D人体重建中取得了长足的进步。但是，现有方法仍不足以预测稀有姿势。原因是当前大多数模型都基于单个人体原型执行回归，这与常见姿势相似，而与稀有姿势相差甚远。在这项工作中，我们1）识别并分析该学习障碍，2）提出一个原型记忆增强网络PM-Net，该网络可以有效地提高预测稀有姿势的性能。我们框架的核心是一个内存模块，用于学习和存储一组3D人体原型，以捕获常见姿势或稀有姿势的局部分布。使用这种公式，回归从更好的初始化开始，这相对容易收敛。与其他最先进的方法相比，在几个广泛使用的数据集上的大量实验证明了所提出框架的有效性。值得注意的是，我们的方法大大改善了模型在罕见姿势下的性能，同时在其他样本上产生了可比的结果。

46. Advances in deep learning methods for pavement surface crack detection and identification with visible light visual images [PDF] 返回目录
Kailiang Lu
Abstract: Compared to NDT and health monitoring method for cracks in engineering structures, surface crack detection or identification based on visible light images is non-contact, with the advantages of fast speed, low cost and high precision. Firstly, typical pavement (concrete also) crack public data sets were collected, and the characteristics of sample images as well as the random variable factors, including environmental, noise and interference etc., were summarized. Subsequently, the advantages and disadvantages of three main crack identification methods (i.e., hand-crafted feature engineering, machine learning, deep learning) were compared. Finally, from the aspects of model architecture, testing performance and predicting effectiveness, the development and progress of typical deep learning models, including self-built CNN, transfer learning(TL) and encoder-decoder(ED), which can be easily deployed on embedded platform, were reviewed. The benchmark test shows that: 1) It has been able to realize real-time pixel-level crack identification on embedded platform: the entire crack detection average time cost of an image sample is less than 100ms, either using the ED method (i.e., FPCNet) or the TL method based on InceptionV3. It can be reduced to less than 10ms with TL method based on MobileNet (a lightweight backbone base network). 2) In terms of accuracy, it can reach over 99.8% on CCIC which is easily identified by human eyes. On SDNET2018, some samples of which are difficult to be identified, FPCNet can reach 97.5%, while TL method is close to 96.1%. To the best of our knowledge, this paper for the first time comprehensively summarizes the pavement crack public data sets, and the performance and effectiveness of surface crack detection and identification deep learning methods for embedded platform, are reviewed and evaluated.
摘要：与无损检测和工程结构裂缝健康监测方法相比，基于可见光图像的表面裂缝检测或识别是非接触式的，具有速度快，成本低，精度高的优点。首先，收集了典型的路面（也是混凝土）裂缝公共数据集，并总结了样本图像的特征以及环境，噪声和干扰等随机变量因素。随后，比较了三种主要的裂缝识别方法（即手工特征工程，机器学习，深度学习）的优缺点。最后，从模型架构，测试性能和预测有效性等方面来看，典型的深度学习模型（包括自建的CNN，传递学习（TL）和编码器-解码器（ED））的发展和进步可以轻松部署在嵌入式平台，进行了审查。基准测试表明：1）能够在嵌入式平台上实现实时的像素级裂缝识别：使用ED方法（即， FPCNet）或基于InceptionV3的TL方法。使用基于MobileNet（轻量级骨干基础网络）的TL方法，可以将其缩短到10ms以内。 2）就准确度而言，CCIC可以达到99.8％以上，人眼可以轻松识别。在SDNET2018上，其中一些样本难以识别，FPCNet可以达到97.5％，而TL方法接近96.1％。据我们所知，本文首次全面总结了路面裂缝公共数据集，并对嵌入式平台的表面裂缝检测与识别深度学习方法的性能和有效性进行了回顾和评估。

47. Image-to-Image Retrieval by Learning Similarity between Scene Graphs [PDF] 返回目录
Sangwoong Yoon, Woo Young Kang, Sungwook Jeon, SeongEun Lee, Changjin Han, Jonghun Park, Eun-Sol Kim
Abstract: As a scene graph compactly summarizes the high-level content of an image in a structured and symbolic manner, the similarity between scene graphs of two images reflects the relevance of their contents. Based on this idea, we propose a novel approach for image-to-image retrieval using scene graph similarity measured by graph neural networks. In our approach, graph neural networks are trained to predict the proxy image relevance measure, computed from human-annotated captions using a pre-trained sentence similarity model. We collect and publish the dataset for image relevance measured by human annotators to evaluate retrieval algorithms. The collected dataset shows that our method agrees well with the human perception of image similarity than other competitive baselines.
摘要：由于场景图以结构化和象征性的方式紧凑地概括了图像的高级内容，因此两个图像的场景图之间的相似性反映了它们内容的相关性。基于此思想，我们提出了一种使用图神经网络测量的场景图相似度进行图像到图像检索的新方法。在我们的方法中，训练图神经网络来预测代理图像相关性度量，该度量是使用预先训练的句子相似性模型从人工注释字幕计算得出的。我们收集并发布人类注释者测量的图像相关性数据集，以评估检索算法。收集的数据集表明，与其他竞争基准相比，我们的方法与人类对图像相似性的感知非常吻合。

48. COIN: Contrastive Identifier Network for Breast Mass Diagnosis in Mammography [PDF] 返回目录
Heyi Li, Dongdong Chen, William H. Nailon, Mike E. Davies, David Laurenson
Abstract: Computer-aided breast cancer diagnosis in mammography is a challenging problem, stemming from mammographical data scarcity and data entanglement. In particular, data scarcity is attributed to the privacy and expensive annotation. And data entanglement is due to the high similarity between benign and malignant masses, of which manifolds reside in lower dimensional space with very small margin. To address these two challenges, we propose a deep learning framework, named Contrastive Identifier Network (\textsc{COIN}), which integrates adversarial augmentation and manifold-based contrastive learning. Firstly, we employ adversarial learning to create both on- and off-distribution mass contained ROIs. After that, we propose a novel contrastive loss with a built Signed graph. Finally, the neural network is optimized in a contrastive learning manner, with the purpose of improving the deep model's discriminativity on the extended dataset. In particular, by employing COIN, data samples from the same category are pulled close whereas those with different labels are pushed further in the deep latent space. Moreover, COIN outperforms the state-of-the-art related algorithms for solving breast cancer diagnosis problem by a considerable margin, achieving 93.4\% accuracy and 95.0\% AUC score. The code will release on ***.
摘要：乳腺X线摄影中的计算机辅助乳腺癌诊断是一个具有挑战性的问题，这源于乳腺X线摄影中的数据稀缺和数据纠缠。特别地，数据稀缺性归因于隐私和昂贵的注释。数据纠缠是由于良性和恶性肿块之间的高度相似性所致，其中歧管位于较低维空间中且边缘很小。为了解决这两个挑战，我们提出了一个名为Contrastive Identifier Network（\ textsc {COIN}）的深度学习框架，该框架集成了对抗增强和基于流形的对比学习。首先，我们采用对抗学习来创建分布内和分布外的大量ROI。在那之后，我们提出了一个带有建立的Signed图的新颖的对比损失。最后，以对比学习的方式对神经网络进行了优化，目的是提高深度模型在扩展数据集上的判别力。特别是，通过使用COIN，可以将来自同一类别的数据样本拉近，而将具有不同标签的数据样本推入更深的潜在空间。此外，COIN在解决乳腺癌诊断问题上的性能远胜过与之相关的最新算法，其准确度达到93.4 \％，AUC得分达到95.0 \％。该代码将在***上发布。

49. Towards Reducing Severe Defocus Spread Effects for Multi-Focus Image Fusion via an Optimization Based Strategy [PDF] 返回目录
Shuang Xu, Lizhen Ji, Zhe Wang, Pengfei Li, Kai Sun, Chunxia Zhang, Jiangshe Zhang
Abstract: Multi-focus image fusion (MFF) is a popular technique to generate an all-in-focus image, where all objects in the scene are sharp. However, existing methods pay little attention to defocus spread effects of the real-world multi-focus images. Consequently, most of the methods perform badly in the areas near focus map boundaries. According to the idea that each local region in the fused image should be similar to the sharpest one among source images, this paper presents an optimization-based approach to reduce defocus spread effects. Firstly, a new MFF assessmentmetric is presented by combining the principle of structure similarity and detected focus maps. Then, MFF problem is cast into maximizing this metric. The optimization is solved by gradient ascent. Experiments conducted on the real-world dataset verify superiority of the proposed model. The codes are available at this https URL.
摘要：多焦点图像融合（MFF）是一种流行的技术，用于生成场景中所有对象都清晰的全焦点图像。但是，现有方法很少关注现实世界中多焦点图像的散焦散布效果。因此，大多数方法在聚焦图边界附近的区域中效果不佳。根据融合图像中每个局部区域应与源图像中最清晰的区域相似的想法，本文提出了一种基于优化的方法来减少散焦散布效果。首先，结合结构相似性原理和检测到的焦点图，提出了一种新的MFF评估指标。然后，将MFF问题转化为最大化该指标。优化是通过梯度上升来解决的。在真实数据集上进行的实验证明了该模型的优越性。可以在此https URL上找到代码。

50. Tips and Tricks for Webly-Supervised Fine-Grained Recognition: Learning from the WebFG 2020 Challenge [PDF] 返回目录
Xiu-Shen Wei, Yu-Yan Xu, Yazhou Yao, Jia Wei, Si Xi, Wenyuan Xu, Weidong Zhang, Xiaoxin Lv, Dengpan Fu, Qing Li, Baoying Chen, Haojie Guo, Taolue Xue, Haipeng Jing, Zhiheng Wang, Tianming Zhang, Mingwen Zhang
Abstract: WebFG 2020 is an international challenge hosted by Nanjing University of Science and Technology, University of Edinburgh, Nanjing University, The University of Adelaide, Waseda University, etc. This challenge mainly pays attention to the webly-supervised fine-grained recognition problem. In the literature, existing deep learning methods highly rely on large-scale and high-quality labeled training data, which poses a limitation to their practicability and scalability in real world applications. In particular, for fine-grained recognition, a visual task that requires professional knowledge for labeling, the cost of acquiring labeled training data is quite high. It causes extreme difficulties to obtain a large amount of high-quality training data. Therefore, utilizing free web data to train fine-grained recognition models has attracted increasing attentions from researchers in the fine-grained community. This challenge expects participants to develop webly-supervised fine-grained recognition methods, which leverages web images in training fine-grained recognition models to ease the extreme dependence of deep learning methods on large-scale manually labeled datasets and to enhance their practicability and scalability. In this technical report, we have pulled together the top WebFG 2020 solutions of total 54 competing teams, and discuss what methods worked best across the set of winning teams, and what surprisingly did not help.
摘要：WebFG 2020是由南京科技大学，爱丁堡大学，南京大学，阿德莱德大学，早稻田大学等主办的一项国际挑战。这一挑战主要关注网络监督的细粒度识别问题。在文献中，现有的深度学习方法高度依赖于大规模和高质量的带标签训练数据，这对其在实际应用中的实用性和可扩展性构成了限制。特别是，对于细粒度的识别（一种视觉任务，需要专业知识进行标记），获取标记训练数据的成本非常高。要获得大量高质量的培训数据，将带来极大的困难。因此，利用免费的网络数据来训练细粒度的识别模型已经引起了细粒度社区研究人员的越来越多的关注。这项挑战预计参与者将开发网络监督的细粒度识别方法，该方法利用网络图像训练细粒度识别模型，以减轻深度学习方法对大规模手动标记数据集的极端依赖，并增强其实用性和可扩展性。在此技术报告中，我们汇总了总共54个参赛团队中的顶级WebFG 2020解决方案，并讨论了哪些方法在整个获胜团队中效果最好，而哪些方法却无济于事。

51. TrustMAE: A Noise-Resilient Defect Classification Framework using Memory-Augmented Auto-Encoders with Trust Regions [PDF] 返回目录
Daniel Stanley Tan, Yi-Chun Chen, Trista Pei-Chun Chen, Wei-Chao Chen
Abstract: In this paper, we propose a framework called TrustMAE to address the problem of product defect classification. Instead of relying on defective images that are difficult to collect and laborious to label, our framework can accept datasets with unlabeled images. Moreover, unlike most anomaly detection methods, our approach is robust against noises, or defective images, in the training dataset. Our framework uses a memory-augmented auto-encoder with a sparse memory addressing scheme to avoid over-generalizing the auto-encoder, and a novel trust-region memory updating scheme to keep the noises away from the memory slots. The result is a framework that can reconstruct defect-free images and identify the defective regions using a perceptual distance network. When compared against various state-of-the-art baselines, our approach performs competitively under noise-free MVTec datasets. More importantly, it remains effective at a noise level up to 40% while significantly outperforming other baselines.
摘要：在本文中，我们提出了一个名为TrustMAE的框架来解决产品缺陷分类的问题。我们的框架可以接受带有未标记图像的数据集，而不必依赖难以收集且难以标记的缺陷图像。而且，与大多数异常检测方法不同，我们的方法对于训练数据集中的噪声或缺陷图像具有鲁棒性。我们的框架使用带有稀疏存储器寻址方案的内存增强型自动编码器，以避免过度概括自动编码器，并使用新颖的信任区域内存更新方案以使噪声远离内存插槽。结果是可以使用感知距离网络重建无缺陷图像并识别缺陷区域的框架。与各种最新基准进行比较时，我们的方法在无噪声的MVTec数据集下具有竞争优势。更重要的是，它在高达40％的噪声水平下仍然有效，同时明显优于其他基准。

52. The VIP Gallery for Video Processing Education [PDF] 返回目录
Todd Goodall, Alan C. Bovik
Abstract: Digital video pervades daily life. Mobile video, digital TV, and digital cinema are now ubiquitous, and as such, the field of Digital Video Processing (DVP) has experienced tremendous growth. Digital video systems also permeate scientific and engineering disciplines including but not limited to astronomy, communications, surveillance, entertainment, video coding, computer vision, and vision research. As a consequence, educational tools for DVP must cater to a large and diverse base of students. Towards enhancing DVP education we have created a carefully constructed gallery of educational tools that is designed to complement a comprehensive corpus of online lectures by providing examples of DVP on real-world content, along with a user-friendly interface that organizes numerous key DVP topics ranging from analog video, to human visual processing, to modern video codecs, etc. This demonstration gallery is currently being used effectively in the graduate class ``Digital Video'' at the University of Texas at Austin. Students receive enhanced access to concepts through both learning theory from highly visual lectures and watching concrete examples from the gallery, which captures the beauty of the underlying principles of modern video processing. To better understand the educational value of these tools, we conducted a pair of questionaire-based surveys to assess student background, expectations, and outcomes. The survey results support the teaching efficacy of this new didactic video toolset.
摘要：数字视频遍布日常生活。移动视频，数字电视和数字电影现在无处不在，因此，数字视频处理（DVP）领域经历了巨大的增长。数字视频系统也渗透到科学和工程学科中，包括但不限于天文学，通信，监视，娱乐，视频编码，计算机视觉和视觉研究。结果，DVP的教育工具必须迎合大量不同的学生群体。为了加强DVP教育，我们创建了精心构建的教育工具库，旨在通过提供有关真实世界内容的DVP示例以及可组织众多关键DVP主题的用户友好界面来补充全面的在线讲座集。从模拟视频到人的视觉处理，再到现代视频编解码器等。目前，该演示库已在德克萨斯大学奥斯汀分校的“数字视频”研究生课程中得到有效使用。通过在视觉效果极强的讲座中学习理论，以及在画廊中观看具体实例，学生可以更深入地了解概念，从而捕捉到现代视频处理的基本原理之美。为了更好地了解这些工具的教育价值，我们进行了两次基于问卷的调查，以评估学生的背景，期望和结果。调查结果支持了这种新的教学视频工具集的教学效果。

53. MS-GWNN:multi-scale graph wavelet neural network for breast cancer diagnosis [PDF] 返回目录
Mo Zhang, Quanzheng Li
Abstract: Breast cancer is one of the most common cancers in women worldwide, and early detection can significantly reduce the mortality rate of breast cancer. It is crucial to take multi-scale information of tissue structure into account in the detection of breast cancer. And thus, it is the key to design an accurate computer-aided detection (CAD) system to capture multi-scale contextual features in a cancerous tissue. In this work, we present a novel graph convolutional neural network for histopathological image classification of breast cancer. The new method, named multi-scale graph wavelet neural network (MS-GWNN), leverages the localization property of spectral graph wavelet to perform multi-scale analysis. By aggregating features at different scales, MS-GWNN can encode the multi-scale contextual interactions in the whole pathological slide. Experimental results on two public datasets demonstrate the superiority of the proposed method. Moreover, through ablation studies, we find that multi-scale analysis has a significant impact on the accuracy of cancer diagnosis.
摘要：乳腺癌是全世界女性中最常见的癌症之一，早期发现可以显着降低乳腺癌的死亡率。在检测乳腺癌时，必须考虑组织结构的多尺度信息。因此，设计精确的计算机辅助检测（CAD）系统以捕获癌组织中的多尺度上下文特征是关键。在这项工作中，我们提出了一种用于乳腺癌的组织病理学图像分类的新型图卷积神经网络。该新方法名为多尺度图小波神经网络（MS-GWNN），它利用频谱图小波的定位特性来进行多尺度分析。通过聚合不同尺度的特征，MS-GWNN可以对整个病理幻灯片中的多尺度上下文交互进行编码。在两个公共数据集上的实验结果证明了该方法的优越性。此外，通过消融研究，我们发现多尺度分析对癌症诊断的准确性有重大影响。

54. FPCC-Net: Fast Point Cloud Clustering for Instance Segmentation [PDF] 返回目录
Yajun Xu, Shogo Arai, Diyi Liu, Fangzhou Lin, Kazuhiro Kosuge
Abstract: Instance segmentation is an important pre-processing task in numerous real-world applications, such as robotics, autonomous vehicles, and human-computer interaction. However, there has been little research on 3D point cloud instance segmentation of bin-picking scenes in which multiple objects of the same class are stacked together. Compared with the rapid development of deep learning for two-dimensional (2D) image tasks, deep learning-based 3D point cloud segmentation still has a lot of room for development. In such a situation, distinguishing a large number of occluded objects of the same class is a highly challenging problem. In a usual bin-picking scene, an object model is known and the number of object type is one. Thus, the semantic information can be ignored; instead, the focus is put on the segmentation of instances. Based on this task requirement, we propose a network (FPCC-Net) that infers feature centers of each instance and then clusters the remaining points to the closest feature center in feature embedding space. FPCC-Net includes two subnets, one for inferring the feature centers for clustering and the other for describing features of each point. The proposed method is compared with existing 3D point clouds and 2D segmentation methods in some bin-picking scenes. It is shown that FPCC-Net outperforms SGPN by about 40\% average precision (AP) and can process about 60,000 points in about 0.8[s]
摘要：实例分割是许多现实应用中的重要预处理任务，例如机器人技术，自动驾驶汽车和人机交互。但是，对于将相同类别的多个对象堆叠在一起的bin-picking场景的3D点云实例分割的研究很少。与针对二维（2D）图像任务的深度学习的快速发展相比，基于深度学习的3D点云分割仍具有很大的发展空间。在这种情况下，区分大量相同类别的被遮挡物体是一个极富挑战性的问题。在通常的垃圾收集场景中，对象模型是已知的，并且对象类型的数量是一个。因此，语义信息可以忽略。相反，重点放在实例的细分上。基于此任务要求，我们提出了一个网络（FPCC-Net），该网络可以推断每个实例的特征中心，然后将其余点聚集到特征嵌入空间中最接近的特征中心。 FPCC-Net包含两个子网，一个子网用于推断要素中心以进行聚类，另一个子网用于描述每个点的要素。将该方法与现有的3D点云和2D分割方法在某些分类场景中进行了比较。结果表明，FPCC-Net的平均精度（AP）优于SGPN，并且可以在大约0.8 [s]的时间内处理大约60,000点。

55. Hierarchical Representation via Message Propagation for Robust Model Fitting [PDF] 返回目录
Shuyuan Lin, Xing Wang, Guobao Xiao, Yan Yan, Hanzi Wang
Abstract: In this paper, we propose a novel hierarchical representation via message propagation (HRMP) method for robust model fitting, which simultaneously takes advantages of both the consensus analysis and the preference analysis to estimate the parameters of multiple model instances from data corrupted by outliers, for robust model fitting. Instead of analyzing the information of each data point or each model hypothesis independently, we formulate the consensus information and the preference information as a hierarchical representation to alleviate the sensitivity to gross outliers. Specifically, we firstly construct a hierarchical representation, which consists of a model hypothesis layer and a data point layer. The model hypothesis layer is used to remove insignificant model hypotheses and the data point layer is used to remove gross outliers. Then, based on the hierarchical representation, we propose an effective hierarchical message propagation (HMP) algorithm and an improved affinity propagation (IAP) algorithm to prune insignificant vertices and cluster the remaining data points, respectively. The proposed HRMP can not only accurately estimate the number and parameters of multiple model instances, but also handle multi-structural data contaminated with a large number of outliers. Experimental results on both synthetic data and real images show that the proposed HRMP significantly outperforms several state-of-the-art model fitting methods in terms of fitting accuracy and speed.
摘要：在本文中，我们提出了一种通过消息传播（HRMP）进行分层表示的新方法，用于鲁棒模型拟合，该方法同时利用共识分析和偏好分析的优势，从异常值破坏的数据中估计多个模型实例的参数，用于强大的模型拟合。代替独立分析每个数据点或每个模型假设的信息，我们将共识信息和偏好信息表述为分层表示形式，以减轻对总体异常值的敏感性。具体来说，我们首先构造一个层次表示，它由模型假设层和数据点层组成。模型假设层用于删除无关紧要的模型假设，数据点层用于删除总体异常值。然后，基于分层表示，我们提出了一种有效的分层消息传播（HMP）算法和一种改进的亲和力传播（IAP）算法，以分别修剪无关紧要的顶点并聚类其余数据点。提出的HRMP不仅可以准确地估计多个模型实例的数量和参数，而且还可以处理受大量异常值污染的多结构数据。在合成数据和真实图像上的实验结果表明，所提出的HRMP在拟合精度和速度方面明显优于几种最新的模型拟合方法。

56. AU-Expression Knowledge Constrained Representation Learning for Facial Expression Recognition [PDF] 返回目录
Tao Pu, Tianshui Chen, Yuan Xie, Hefeng Wu, Liang Lin
Abstract: Recognizing human emotion/expressions automatically is quite an expected ability for intelligent robotics, as it can promote better communication and cooperation with humans. Current deep-learning-based algorithms may achieve impressive performance in some lab-controlled environments, but they always fail to recognize the expressions accurately for the uncontrolled in-the-wild situation. Fortunately, facial action units (AU) describe subtle facial behaviors, and they can help distinguish uncertain and ambiguous expressions. In this work, we explore the correlations among the action units and facial expressions, and devise an AU-Expression Knowledge Constrained Representation Learning (AUE-CRL) framework to learn the AU representations without AU annotations and adaptively use representations to facilitate facial expression recognition. Specifically, it leverages AU-expression correlations to guide the learning of the AU classifiers, and thus it can obtain AU representations without incurring any AU annotations. Then, it introduces a knowledge-guided attention mechanism that mines useful AU representations under the constraint of AU-expression correlations. In this way, the framework can capture local discriminative and complementary features to enhance facial representation for facial expression recognition. We conduct experiments on the challenging uncontrolled datasets to demonstrate the superiority of the proposed framework over current state-of-the-art methods.
摘要：自动识别人的情绪/表情是智能机器人技术的一项预期功能，因为它可以促进与人之间更好的沟通和合作。当前的基于深度学习的算法在某些实验室控制的环境中可能会实现令人印象深刻的性能，但是对于不受控制的野外情况，它们始终无法准确识别这些表达式。幸运的是，面部动作单元（AU）描述了微妙的面部表情，可以帮助区分不确定和模棱两可的表情。在这项工作中，我们探索了动作单元与面部表情之间的相关性，并设计了一个AU表达知识约束表示学习（AUE-CRL）框架来学习没有AU注释的AU表示，并自适应地使用表示来促进面部表情识别。具体而言，它利用AU表达式相关性来指导AU分类器的学习，因此它可以在不招致任何AU注释的情况下获得AU表示。然后，它介绍了一种知识指导的注意力机制，该机制在AU表达相关性的约束下挖掘有用的AU表示。以这种方式，框架可以捕获局部判别和互补特征，以增强用于面部表情识别的面部表示。我们对具有挑战性的不受控制的数据集进行了实验，以证明所提出的框架优于当前的最新方法。

57. MGML: Multi-Granularity Multi-Level Feature Ensemble Network for Remote Sensing Scene Classification [PDF] 返回目录
Qi Zhao, Shuchang Lyu, Yuewen Li, Yujing Ma, Lijiang Chen
Abstract: Remote sensing (RS) scene classification is a challenging task to predict scene categories of RS images. RS images have two main characters: large intra-class variance caused by large resolution variance and confusing information from large geographic covering area. To ease the negative influence from the above two characters. We propose a Multi-granularity Multi-Level Feature Ensemble Network (MGML-FENet) to efficiently tackle RS scene classification task in this paper. Specifically, we propose Multi-granularity Multi-Level Feature Fusion Branch (MGML-FFB) to extract multi-granularity features in different levels of network by channel-separate feature generator (CS-FG). To avoid the interference from confusing information, we propose Multi-granularity Multi-Level Feature Ensemble Module (MGML-FEM) which can provide diverse predictions by full-channel feature generator (FC-FG). Compared to previous methods, our proposed networks have ability to use structure information and abundant fine-grained features. Furthermore, through ensemble learning method, our proposed MGML-FENets can obtain more convincing final predictions. Extensive classification experiments on multiple RS datasets (AID, NWPU-RESISC45, UC-Merced and VGoogle) demonstrate that our proposed networks achieve better performance than previous state-of-the-art (SOTA) networks. The visualization analysis also shows the good interpretability of MGML-FENet.
摘要：遥感（RS）场景分类是预测RS图像场景类别的一项艰巨任务。 RS图像具有两个主要特征：由较大的分辨率差异引起的较大的类内差异以及来自较大地理覆盖区域的混乱信息。为了减轻上述两个字符的负面影响。本文提出了一种多粒度多层次特征集成网络（MGML-FENet），以有效地解决RS场景分类任务。具体而言，我们提出了多粒度多级特征融合分支（MGML-FFB），以通过通道分离特征生成器（CS-FG）提取不同级别网络中的多粒度特征。为了避免混淆信息造成的干扰，我们提出了多粒度多层次特征集合模块（MGML-FEM），该模块可以通过全通道特征发生器（FC-FG）提供各种预测。与以前的方法相比，我们提出的网络具有使用结构信息的能力和丰富的细粒度特征。此外，通过集成学习方法，我们提出的MGML-FENets可以获得更令人信服的最终预测。在多个RS数据集（AID，NWPU-RESISC45，UC-Merced和VGoogle）上进行的广泛分类实验表明，我们提出的网络比以前的最新SOTA网络具有更好的性能。可视化分析还显示了MGML-FENet的良好解释性。

58. Visual Probing and Correction of Object Recognition Models with Interactive user feedback [PDF] 返回目录
Viny Saajan Victor, Pramod Vadiraja, Jan-Tobias Sohns, Heike Leitte
Abstract: With the advent of state-of-the-art machine learning and deep learning technologies, several industries are moving towards the field. Applications of such technologies are highly diverse ranging from natural language processing to computer vision. Object recognition is one such area in the computer vision domain. Although proven to perform with high accuracy, there are still areas where such models can be improved. This is in-fact highly important in real-world use cases like autonomous driving or cancer detection, that are highly sensitive and expect such technologies to have almost no uncertainties. In this paper, we attempt to visualise the uncertainties in object recognition models and propose a correction process via user feedback. We further demonstrate our approach on the data provided by the VAST 2020 Mini-Challenge 2.
摘要：随着最先进的机器学习和深度学习技术的出现，一些行业正在向该领域发展。从自然语言处理到计算机视觉，此类技术的应用范围非常广泛。对象识别是计算机视觉领域中的此类领域之一。尽管已被证明具有很高的精度，但仍存在可以改进此类模型的领域。实际上，这在高度敏感的现实世界用例（例如自动驾驶或癌症检测）中非常重要，并期望此类技术几乎没有不确定性。在本文中，我们尝试可视化对象识别模型中的不确定性，并通过用户反馈提出校正过程。我们将根据VAST 2020小型挑战赛2提供的数据进一步展示我们的方法。

59. Enhancing Handwritten Text Recognition with N-gram sequence decomposition and Multitask Learning [PDF] 返回目录
Vasiliki Tassopoulou, George Retsinas, Petros Maragos
Abstract: Current state-of-the-art approaches in the field of Handwritten Text Recognition are predominately single task with unigram, character level target units. In our work, we utilize a Multi-task Learning scheme, training the model to perform decompositions of the target sequence with target units of different granularity, from fine to coarse. We consider this method as a way to utilize n-gram information, implicitly, in the training process, while the final recognition is performed using only the unigram output. % in order to highlight the difference of the internal Unigram decoding of such a multi-task approach highlights the capability of the learned internal representations, imposed by the different n-grams at the training step. We select n-grams as our target units and we experiment from unigrams to fourgrams, namely subword level granularities. These multiple decompositions are learned from the network with task-specific CTC losses. Concerning network architectures, we propose two alternatives, namely the Hierarchical and the Block Multi-task. Overall, our proposed model, even though evaluated only on the unigram task, outperforms its counterpart single-task by absolute 2.52\% WER and 1.02\% CER, in the greedy decoding, without any computational overhead during inference, hinting towards successfully imposing an implicit language model.
摘要：手写文本识别领域中的当前最先进方法主要是具有unigram，字符级目标单元的单个任务。在我们的工作中，我们使用多任务学习方案，训练模型以使用从细到粗的不同粒度的目标单位执行目标序列的分解。我们认为此方法是在训练过程中隐式利用n-gram信息的一种方法，而最终识别仅使用unigram输出来执行。为了强调这种多任务方法的内部Unigram解码的差异，％强调了由不同n-gram在训练步骤中施加的学习内部表示的能力。我们选择n-gram作为目标单位，并尝试从unigram到Fourgram，即子词级粒度。这些多重分解是从具有特定任务CTC损失的网络中获悉的。关于网络体系结构，我们提出了两种选择，即“分层”和“块多任务”。总体而言，我们提出的模型尽管仅在unigram任务上进行了评估，但在贪婪解码中，其绝对单项任务的绝对WER和CER为1.02 \％CER优于其相对应的单任务，而推理过程中没有任何计算开销，暗示了成功实施隐式语言模型。

60. Color Channel Perturbation Attacks for Fooling Convolutional Neural Networks and A Defense Against Such Attacks [PDF] 返回目录
Jayendra Kantipudi, Shiv Ram Dubey, Soumendu Chakraborty
Abstract: The Convolutional Neural Networks (CNNs) have emerged as a very powerful data dependent hierarchical feature extraction method. It is widely used in several computer vision problems. The CNNs learn the important visual features from training samples automatically. It is observed that the network overfits the training samples very easily. Several regularization methods have been proposed to avoid the overfitting. In spite of this, the network is sensitive to the color distribution within the images which is ignored by the existing approaches. In this paper, we discover the color robustness problem of CNN by proposing a Color Channel Perturbation (CCP) attack to fool the CNNs. In CCP attack new images are generated with new channels created by combining the original channels with the stochastic weights. Experiments were carried out over widely used CIFAR10, Caltech256 and TinyImageNet datasets in the image classification framework. The VGG, ResNet and DenseNet models are used to test the impact of the proposed attack. It is observed that the performance of the CNNs degrades drastically under the proposed CCP attack. Result show the effect of the proposed simple CCP attack over the robustness of the CNN trained model. The results are also compared with existing CNN fooling approaches to evaluate the accuracy drop. We also propose a primary defense mechanism to this problem by augmenting the training dataset with the proposed CCP attack. The state-of-the-art performance using the proposed solution in terms of the CNN robustness under CCP attack is observed in the experiments. The code is made publicly available at \url{this https URL}.
摘要：卷积神经网络（CNN）已经成为一种非常强大的依赖数据的分层特征提取方法。它被广泛用于一些计算机视觉问题。 CNN会从训练样本中自动学习重要的视觉功能。可以看出，网络非常容易拟合训练样本。已经提出了几种正则化方法来避免过度拟合。尽管如此，网络对图像内的颜色分布很敏感，而现有方法忽略了它。在本文中，我们通过提出一种颜色通道扰动（CCP）攻击来欺骗CNN来发现CNN的颜色健壮性问题。在CCP攻击中，将通过将原始通道与随机权重结合在一起创建的新通道生成新图像。在图像分类框架中，对广泛使用的CIFAR10，Caltech256和TinyImageNet数据集进行了实验。 VGG，ResNet和DenseNet模型用于测试提议的攻击的影响。可以看出，在提出的CCP攻击下，CNN的性能急剧下降。结果显示了所提出的简单CCP攻击对CNN训练模型的鲁棒性的影响。还将结果与现有的CNN欺骗方法进行比较，以评估准确性下降。我们还通过使用建议的CCP攻击扩充训练数据集，提出了针对此问题的主要防御机制。在实验中观察到了使用提出的解决方案在CCP攻击下CNN鲁棒性方面的最新性能。该代码可通过\ url {此https URL}公开获得。

61. Deep Learning Towards Edge Computing: Neural Networks Straight from Compressed Data [PDF] 返回目录
Samuel Felipe dos Santos, Jurandy Almeida
Abstract: Due to the popularization and grow in computational power of mobile phones, as well as advances in artificial intelligence, many intelligent applications have been developed, meaningfully enriching people's life. For this reason, there is a growing interest in the area of edge intelligence, that aims to push the computation of data to the edges of the network, in order to make those applications more efficient and secure. Many intelligent applications rely on deep learning models, like convolutional neural networks (CNNs). Over the past decade, they have achieved state-of-the-art performance in many computer vision tasks. To increase the performance of these methods, the trend has been to use increasingly deeper architectures and with more parameters, leading to a high computational cost. Indeed, this is one of the main problems faced by deep architectures, limiting their applicability in domains with limited computational resources, like edge devices. To alleviate the computational complexity, we propose a deep neural network capable of learning straight from the relevant information pertaining to visual content readily available in the compressed representation used for image and video storage and transmission. The novelty of our approach is that it was designed to operate directly on frequency domain data, learning with DCT coefficients rather than RGB pixels. This enables to save high computational load in full decoding the data stream and therefore greatly speed up the processing time, which has become a big bottleneck of deep learning. We evaluated our network on two challenging tasks: (1) image classification on the ImageNet dataset and (2) video classification on the UCF-101 and HMDB-51 datasets. Our results demonstrate comparable effectiveness to the state-of-the-art methods in terms of accuracy, with the advantage of being more computationally efficient.
摘要：随着手机的计算能力的普及和增长，以及人工智能的发展，已经开发了许多智能应用程序，有意义地丰富了人们的生活。因此，边缘智能领域的兴趣日益浓厚，其目的是将数据计算推向网络的边缘，以使这些应用程序更高效，更安全。许多智能应用程序都依赖于深度学习模型，例如卷积神经网络（CNN）。在过去的十年中，他们在许多计算机视觉任务中都取得了最先进的性能。为了提高这些方法的性能，趋势是使用越来越深的架构和更多的参数，从而导致较高的计算成本。确实，这是深度架构所面临的主要问题之一，限制了它们在边缘设备等计算资源有限的领域中的适用性。为了减轻计算的复杂性，我们提出了一种深度神经网络，该网络能够直接从与视觉内容有关的相关信息中学习，这些信息可在用于图像和视频存储和传输的压缩表示中轻松获得。我们的方法的新颖性在于它被设计为直接在频域数据上运行，使用DCT系数而不是RGB像素进行学习。这样可以在对数据流进行完全解码时节省高计算量，因此可以大大加快处理时间，这已成为深度学习的一大瓶颈。我们在两个具有挑战性的任务上评估了我们的网络：（1）在ImageNet数据集上进行图像分类，以及（2）在UCF-101和HMDB-51数据集上进行视频分类。我们的结果在准确性方面证明了与最新方法可比的有效性，并且具有更高的计算效率。

62. EC-GAN: Low-Sample Classification using Semi-Supervised Algorithms and GANs [PDF] 返回目录
Ayaan Haque
Abstract: Semi-supervised learning has been gaining attention as it allows for performing image analysis tasks such as classification with limited labeled data. Some popular algorithms using Generative Adversarial Networks (GANs) for semi-supervised classification share a single architecture for classification and discrimination. However, this may require a model to converge to a separate data distribution for each task, which may reduce overall performance. While progress in semi-supervised learning has been made, less addressed are small-scale, fully-supervised tasks where even unlabeled data is unavailable and unattainable. We therefore, propose a novel GAN model namely External Classifier GAN (EC-GAN), that utilizes GANs and semi-supervised algorithms to improve classification in fully-supervised regimes. Our method leverages a GAN to generate artificial data used to supplement supervised classification. More specifically, we attach an external classifier, hence the name EC-GAN, to the GAN's generator, as opposed to sharing an architecture with the discriminator. Our experiments demonstrate that EC-GAN's performance is comparable to the shared architecture method, far superior to the standard data augmentation and regularization-based approach, and effective on a small, realistic dataset.
摘要：半监督学习已获得关注，因为它可以执行图像分析任务，例如使用有限的标记数据进行分类。一些使用生成对抗网络（GAN）进行半监督分类的流行算法共享一种用于分类和区分的架构。但是，这可能需要模型收敛到每个任务的单独数据分布，这可能会降低总体性能。尽管半监督学习取得了进展，但很少进行全面监督的小规模任务，其中甚至没有标签的数据也无法获得。因此，我们提出了一种新颖的GAN模型，即外部分类器GAN（EC-GAN），该模型利用GAN和半监督算法来改善全监督体制下的分类。我们的方法利用GAN生成用于补充监督分类的人工数据。更具体地说，我们将外部分类器（即名称EC-GAN）附加到GAN的生成器上，而不是与鉴别器共享体系结构。我们的实验表明，EC-GAN的性能可与共享体系结构方法相媲美，远胜于基于标准数据增强和基于正则化的方法，并且对小型，现实的数据集有效。

63. Language-Mediated, Object-Centric Representation Learning [PDF] 返回目录
Ruocheng Wang, Jiayuan Mao, Samuel J. Gershman, Jiajun Wu
Abstract: We present Language-mediated, Object-centric Representation Learning (LORL), a paradigm for learning disentangled, object-centric scene representations from vision and language. LORL builds upon recent advances in unsupervised object segmentation, notably MONet and Slot Attention. While these algorithms learn an object-centric representation just by reconstructing the input image, LORL enables them to further learn to associate the learned representations to concepts, i.e., words for object categories, properties, and spatial relationships, from language input. These object-centric concepts derived from language facilitate the learning of object-centric representations. LORL can be integrated with various unsupervised segmentation algorithms that are language-agnostic. Experiments show that the integration of LORL consistently improves the object segmentation performance of MONet and Slot Attention on two datasets via the help of language. We also show that concepts learned by LORL, in conjunction with segmentation algorithms such as MONet, aid downstream tasks such as referring expression comprehension.
摘要：我们提出了语言中介的，以对象为中心的表示学习（LORL），这是一种从视觉和语言中学习纠缠的，以对象为中心的场景表示的范例。 LORL基于无监督对象分割的最新进展，尤其是MONet和Slot Attention。虽然这些算法仅通过重构输入图像来学习以对象为中心的表示，但LORL使他们能够进一步学习将学习到的表示与来自语言输入的概念（即用于对象类别，属性和空间关系的单词）相关联。这些源自语言的以对象为中心的概念有助于学习以对象为中心的表示形式。 LORL可以与各种与语言无关的无监督分割算法集成在一起。实验表明，通过语言的帮助，LORL的集成不断提高了两个数据集上的MONet和Slot Attention对象分割性能。我们还表明，由LORL学习的概念与诸如MONet之类的分段算法相结合，有助于下游任务，例如引用表达式理解。

64. Estimating Uncertainty in Neural Networks for Cardiac MRI Segmentation: A Benchmark Study [PDF] 返回目录
Matthew Ng, Fumin Guo, Labonny Biswas, Steffen E. Petersen, Stefan K. Piechnik, Stefan Neubauer, Graham Wright
Abstract: Convolutional neural networks (CNNs) have demonstrated promise in automated cardiac magnetic resonance imaging segmentation. However, when using CNNs in a large real world dataset, it is important to quantify segmentation uncertainty in order to know which segmentations could be problematic. In this work, we performed a systematic study of Bayesian and non-Bayesian methods for estimating uncertainty in segmentation neural networks. We evaluated Bayes by Backprop (BBB), Monte Carlo (MC) Dropout, and Deep Ensembles in terms of segmentation accuracy, probability calibration, uncertainty on out-of-distribution images, and segmentation quality control. We tested these algorithms on datasets with various distortions and observed that Deep Ensembles outperformed the other methods except for images with heavy noise distortions. For segmentation quality control, we showed that segmentation uncertainty is correlated with segmentation accuracy. With the incorporation of uncertainty estimates, we were able to reduce the percentage of poor segmentation to 5% by flagging 31% to 48% of the most uncertain images for manual review, substantially lower than random review of the results without using neural network uncertainty.
摘要：卷积神经网络（CNN）在自动心脏磁共振成像分割中已显示出希望。但是，在大型现实数据集中使用CNN时，量化分段不确定性很重要，这样才能知道哪些分段可能会出现问题。在这项工作中，我们进行了贝叶斯和非贝叶斯方法的系统研究，以估计分段神经网络中的不确定性。我们在分割准确度，概率校准，分布外图像的不确定性和分割质量控制方面评估了Backprop（BBB），Monte Carlo（MC）Dropout和Deep Ensembles的Bayes。我们在具有各种失真的数据集上测试了这些算法，并观察到Deep Ensembles的效果优于其他方法（除了具有严重噪点失真的图像之外）。对于分割质量控制，我们表明分割不确定性与分割精度相关。通过合并不确定性估计，我们可以通过标记31％至48％的最不确定图像进行手动检查，从而将不良分割的百分比降低到5％，大大低于不使用神经网络不确定性的结果随机检查。

65. Overview of MediaEval 2020 Predicting Media Memorability Task: What Makes a Video Memorable? [PDF] 返回目录
Alba García Seco De Herrera, Rukiye Savran Kiziltepe, Jon Chamberlain, Mihai Gabriel Constantin, Claire-Hélène Demarty, Faiyaz Doctor, Bogdan Ionescu, Alan F. Smeaton
Abstract: This paper describes the MediaEval 2020 \textit{Predicting Media Memorability} task. After first being proposed at MediaEval 2018, the Predicting Media Memorability task is in its 3rd edition this year, as the prediction of short-term and long-term video memorability (VM) remains a challenging task. In 2020, the format remained the same as in previous editions. This year the videos are a subset of the TRECVid 2019 Video-to-Text dataset, containing more action rich video content as compared with the 2019 task. In this paper a description of some aspects of this task is provided, including its main characteristics, a description of the collection, the ground truth dataset, evaluation metrics and the requirements for participants' run submissions.
摘要：本文介绍了MediaEval 2020 \ textit {Predicting Media Memorability}任务。在MediaEval 2018上首次提出之后，预测媒体可存储性任务将在今年的第三版中发布，因为预测短期和长期视频可存储性（VM）仍然是一项艰巨的任务。 2020年，格式与以前的版本相同。今年的视频是TRECVid 2019视频到文本数据集的子集，与2019任务相比，包含更多的动作丰富视频内容。在本文中，提供了该任务某些方面的描述，包括其主要特征，集合的描述，地面真相数据集，评估指标以及参与者跑步提交的要求。

66. Investigating Memorability of Dynamic Media [PDF] 返回目录
Phuc H. Le-Khac, Ayush K. Rai, Graham Healy, Alan F. Smeaton, Noel E. O'Connor
Abstract: The Predicting Media Memorability task in MediaEval'20 has some challenging aspects compared to previous years. In this paper we identify the high-dynamic content in videos and dataset of limited size as the core challenges for the task, we propose directions to overcome some of these challenges and we present our initial result in these directions.
摘要：与往年相比，MediaEval'20中的预测媒体可记忆性任务具有一些挑战性的方面。在本文中，我们将视频和有限大小的数据集中的高动态内容确定为任务的核心挑战，我们提出了克服这些挑战的方向，并在这些方向上给出了初步的结果。

67. Leveraging Audio Gestalt to Predict Media Memorability [PDF] 返回目录
Lorin Sweeney, Graham Healy, Alan F. Smeaton
Abstract: Memorability determines what evanesces into emptiness, and what worms its way into the deepest furrows of our minds. It is the key to curating more meaningful media content as we wade through daily digital torrents. The Predicting Media Memorability task in MediaEval 2020 aims to address the question of media memorability by setting the task of automatically predicting video memorability. Our approach is a multimodal deep learning-based late fusion that combines visual, semantic, and auditory features. We used audio gestalt to estimate the influence of the audio modality on overall video memorability, and accordingly inform which combination of features would best predict a given video's memorability scores.
摘要：记忆力决定了什么是对空虚的逃避，以及什么蠕虫进入了我们最深的沟壑。当我们每天都在浏览数字洪流时，这是策划更有意义的媒体内容的关键。 MediaEval 2020中的预测媒体可存储性任务旨在通过设置自动预测视频可存储性的任务来解决媒体可存储性的问题。我们的方法是基于多模式深度学习的后期融合，结合了视觉，语义和听觉功能。我们使用了音频格式塔来估计音频模态对整体视频记忆力的影响，并据此告知哪些功能组合最能预测给定视频的记忆力得分。

68. Searching a Raw Video Database using Natural Language Queries [PDF] 返回目录
Sriram Krishna, Siddarth Vinay, Srinivas K S
Abstract: The number of videos being produced and consequently stored in databases for video streaming platforms has been increasing exponentially over time. This vast database should be easily index-able to find the requisite clip or video to match the given search specification, preferably in the form of a textual query. This work aims to provide an end-to-end pipeline to search a video database with a voice query from the end user. The pipeline makes use of Recurrent Neural Networks in combination with Convolutional Neural Networks to generate captions of the video clips present in the database.
摘要：随着时间的流逝，正在生成并因此存储在视频流平台数据库中的视频数量呈指数级增长。这个庞大的数据库应该易于索引，以找到必要的剪辑或视频以匹配给定的搜索规范，最好以文本查询的形式。这项工作旨在提供端到端的管道，以使用来自最终用户的语音查询来搜索视频数据库。该管道利用循环神经网络与卷积神经网络相结合来生成数据库中存在的视频剪辑的字幕。

69. Exploiting Shared Knowledge from Non-COVID Lesions for Annotation-Efficient COVID-19 CT Lung Infection Segmentation [PDF] 返回目录
Yichi Zhang, Qingcheng Liao, Lin Yuan, He Zhu, Jiezhen Xing, Jicong Zhang
Abstract: The novel Coronavirus disease (COVID-19) is a highly contagious virus and has spread all over the world, posing an extremely serious threat to all countries. Automatic lung infection segmentation from computed tomography (CT) plays an important role in the quantitative analysis of COVID-19. However, the major challenge lies in the inadequacy of annotated COVID-19 datasets. Currently, there are several public non-COVID lung lesion segmentation datasets, providing the potential for generalizing useful information to the related COVID-19 segmentation task. In this paper, we propose a novel relation-driven collaborative learning model for annotation-efficient COVID-19 CT lung infection segmentation. The network consists of encoders with the same architecture and a shared decoder. The general encoder is adopted to capture general lung lesion features based on multiple non-COVID lesions, while the target encoder is adopted to focus on task-specific features of COVID-19 infections. Features extracted from the two parallel encoders are concatenated for the subsequent decoder part. To thoroughly exploit shared knowledge between COVID and non-COVID lesions, we develop a collaborative learning scheme to regularize the relation consistency between extracted features of given input. Other than existing consistency-based methods that simply enforce the consistency of individual predictions, our method enforces the consistency of feature relation among samples, encouraging the model to explore semantic information from both COVID-19 and non-COVID cases. Extensive experiments on one public COVID-19 dataset and two public non-COVID datasets show that our method achieves superior segmentation performance compared with existing methods in the absence of sufficient high-quality COVID-19 annotations.
摘要：新型冠状病毒病（COVID-19）是一种具有高度传染性的病毒，已遍布世界各地，对所有国家构成了极为严重的威胁。通过计算机断层扫描（CT）自动进行肺部感染分割在COVID-19的定量分析中起着重要作用。但是，最大的挑战在于带注释的COVID-19数据集的不足。当前，有几个公共的非COVID肺病变分割数据集，为将有用信息概括到相关的COVID-19分割任务提供了潜力。在本文中，我们提出了一种新的关系驱动的协作学习模型，用于注释有效的COVID-19 CT肺部感染分割。该网络由具有相同架构的编码器和共享的解码器组成。采用通用编码器来捕获基于多个非COVID病变的一般肺部病变特征，而采用目标编码器来关注COVID-19感染的任务特定特征。从两个并行编码器中提取的特征被串联起来用于后续的解码器部分。为了彻底利用COVID和非COVID病变之间的共享知识，我们开发了一种协作学习方案来规范化给定输入的提取特征之间的关系一致性。除了现有的基于一致性的方法可以简单地增强单个预测的一致性外，我们的方法还可以增强样本之间特征关系的一致性，从而鼓励模型探索来自COVID-19和非COVID案例的语义信息。在一个公共COVID-19数据集和两个公共非COVID数据集上进行的大量实验表明，与现有方法相比，在没有足够高质量COVID-19注释的情况下，我们的方法具有更高的分割效果。

70. Colonoscopy Polyp Detection: Domain Adaptation From Medical Report Images to Real-time Videos [PDF] 返回目录
Zhi-Qin Zhan, Huazhu Fu, Yan-Yao Yang, Jingjing Chen, Jie Liu, Yu-Gang Jiang
Abstract: Automatic colorectal polyp detection in colonoscopy video is a fundamental task, which has received a lot of attention. Manually annotating polyp region in a large scale video dataset is time-consuming and expensive, which limits the development of deep learning techniques. A compromise is to train the target model by using labeled images and infer on colonoscopy videos. However, there are several issues between the image-based training and video-based inference, including domain differences, lack of positive samples, and temporal smoothness. To address these issues, we propose an Image-video-joint polyp detection network (Ivy-Net) to address the domain gap between colonoscopy images from historical medical reports and real-time videos. In our Ivy-Net, a modified mixup is utilized to generate training data by combining the positive images and negative video frames at the pixel level, which could learn the domain adaptive representations and augment the positive samples. Simultaneously, a temporal coherence regularization (TCR) is proposed to introduce the smooth constraint on feature-level in adjacent frames and improve polyp detection by unlabeled colonoscopy videos. For evaluation, a new large colonoscopy polyp dataset is collected, which contains 3056 images from historical medical reports of 889 positive patients and 7.5-hour videos of 69 patients (28 positive). The experiments on the collected dataset demonstrate that our Ivy-Net achieves the state-of-the-art result on colonoscopy video.
摘要：结肠镜检查视频中结肠直肠息肉的自动检测是一项基本任务，受到了广泛的关注。在大型视频数据集中手动注释息肉区域既费时又昂贵，这限制了深度学习技术的发展。一种折衷方案是通过使用标记的图像来训练目标模型，并在结肠镜检查视频中进行推断。但是，在基于图像的训练和基于视频的推理之间存在几个问题，包括域差异，缺少正样本和时间平滑。为了解决这些问题，我们提出了一个图像视频关节息肉检测网络（Ivy-Net），以解决历史医学报告和实时视频中结肠镜检查图像之间的领域差距。在我们的Ivy-Net中，通过在像素级别组合正图像和负视频帧，利用改进的混合生成训练数据，从而可以学习域自适应表示并增加正样本。同时，提出了时间相干正则化（TCR），以在相邻帧中引入对特征级别的平滑约束，并通过未标记的结肠镜检查视频改善息肉检测。为了进行评估，收集了一个新的大型结肠镜检查息肉数据集，其中包含889位阳性患者的历史医学报告中的3056张图像以及69位患者（28位阳性）的7.5小时视频。在收集的数据集上进行的实验表明，我们的Ivy-Net在结肠镜检查视频上达到了最新的结果。

71. Text-Free Image-to-Speech Synthesis Using Learned Segmental Units [PDF] 返回目录
Wei-Ning Hsu, David Harwath, Christopher Song, James Glass
Abstract: In this paper we present the first model for directly synthesizing fluent, natural-sounding spoken audio captions for images that does not require natural language text as an intermediate representation or source of supervision. Instead, we connect the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units that are discovered with a self-supervised visual grounding task. We conduct experiments on the Flickr8k spoken caption dataset in addition to a novel corpus of spoken audio captions collected for the popular MSCOCO dataset, demonstrating that our generated captions also capture diverse visual semantics of the images they describe. We investigate several different intermediate speech representations, and empirically find that the representation must satisfy several important properties to serve as drop-in replacements for text.
摘要：在本文中，我们提出了第一个模型，该模型可以直接合成不需要自然语言文本作为中间表示或监督来源的图像的流利，自然听起来的语音字幕。取而代之的是，我们将图像字幕模块和语音合成模块与一组离散的，子单词的语音单元相连接，这些单元是通过自我监督的视觉基础任务发现的。除了针对流行的MSCOCO数据集收集的新颖的语音字幕集之外，我们还对Flickr8k语音字幕集进行了实验，这表明我们生成的字幕还捕获了所描述图像的多种视觉语义。我们研究了几种不同的中间语音表示形式，并根据经验发现，该表示形式必须满足几个重要的属性才能用作文本的直接替换。

72. Survey of the Detection and Classification of Pulmonary Lesions via CT and X-Ray [PDF] 返回目录
Yixuan Sun, Chengyao Li, Qian Zhang, Aimin Zhou, Guixu Zhang
Abstract: In recent years, the prevalence of several pulmonary diseases, especially the coronavirus disease 2019 (COVID-19) pandemic, has attracted worldwide attention. These diseases can be effectively diagnosed and treated with the help of lung imaging. With the development of deep learning technology and the emergence of many public medical image datasets, the diagnosis of lung diseases via medical imaging has been further improved. This article reviews pulmonary CT and X-ray image detection and classification in the last decade. It also provides an overview of the detection of lung nodules, pneumonia, and other common lung lesions based on the imaging characteristics of various lesions. Furthermore, this review introduces 26 commonly used public medical image datasets, summarizes the latest technology, and discusses current challenges and future research directions.
摘要：近年来，几种肺部疾病的流行，特别是2019年冠状病毒病（COVID-19）大流行，引起了全世界的关注。这些疾病可以通过肺部成像得到有效诊断和治疗。随着深度学习技术的发展以及许多公共医学图像数据集的出现，通过医学成像对肺部疾病的诊断得到了进一步的改善。本文回顾了最近十年的肺部CT和X射线图像检测和分类。它还根据各种病变的影像学特征概述了肺结节，肺炎和其他常见肺病变的检测。此外，本综述介绍了26个常用的公共医学图像数据集，总结了最新技术，并讨论了当前的挑战和未来的研究方向。

73. New Bag of Deep Visual Words based features to classify chest x-ray images for COVID-19 diagnosis [PDF] 返回目录
Chiranjibi Sitaula, Sunil Aryal
Abstract: Because the infection by Severe Acute Respiratory Syndrome Coronavirus 2 (COVID-19) causes the pneumonia-like effect in the lungs, the examination of chest x-rays can help to diagnose the diseases. For automatic analysis of images, they are represented in machines by a set of semantic features. Deep Learning (DL) models are widely used to extract features from images. General deep features may not be appropriate to represent chest x-rays as they have a few semantic regions. Though the Bag of Visual Words (BoVW) based features are shown to be more appropriate for x-ray type of images, existing BoVW features may not capture enough information to differentiate COVID-19 infection from other pneumonia-related infections. In this paper, we propose a new BoVW method over deep features, called Bag of Deep Visual Words (BoDVW), by removing the feature map normalization step and adding deep features normalization step on the raw feature maps. This helps to preserve the semantics of each feature map that may have important clues to differentiate COVID-19 from pneumonia. We evaluate the effectiveness of our proposed BoDVW features in chest x-rays classification using Support Vector Machine (SVM) to diagnose COVID-19. Our results on a publicly available COVID-19 x-ray dataset reveal that our features produce stable and prominent classification accuracy, particularly differentiating COVID-19 infection from other pneumonia, in shorter computation time compared to the state-of-the-art methods. Thus, our method could be a very useful tool for quick diagnosis of COVID-19 patients on a large scale.
摘要：由于严重急性呼吸系统综合症冠状病毒2（COVID-19）的感染引起了肺部的肺炎样效应，因此胸部X光检查可以帮助诊断疾病。对于图像的自动分析，它们在机器中由一组语义特征表示。深度学习（DL）模型被广泛用于从图像中提取特征。一般的深部特征可能不适合表示胸部X射线，因为它们具有一些语义区域。尽管显示的基于单词袋（BoVW）的功能更适合X射线类型的图像，但是现有的BoVW功能可能无法捕获足够的信息来区分COVID-19感染和其他与肺炎相关的感染。在本文中，我们通过删除特征图归一化步骤并在原始特征图上添加了深度特征归一化步骤，提出了一种针对深度特征的新BoVW方法，称为深度视觉词袋（BoDVW）。这有助于保留每个特征图的语义，这些语义图可能具有区分COVID-19和肺炎的重要线索。我们使用支持向量机（SVM）评估建议的BoDVW功能在胸部X射线分类中的有效性，以诊断COVID-19。我们在可公开获得的COVID-19 X射线数据集上的结果表明，与最新方法相比，我们的功能可产生稳定且突出的分类准确性，尤其是在更短的计算时间内将COVID-19感染与其他肺炎区分开来。因此，我们的方法对于大规模诊断COVID-19患者可能是非常有用的工具。

74. FREA-Unet: Frequency-aware U-net for Modality Transfer [PDF] 返回目录
Hajar Emami, Qiong Liu, Ming Dong
Abstract: While Positron emission tomography (PET) imaging has been widely used in diagnosis of number of diseases, it has costly acquisition process which involves radiation exposure to patients. However, magnetic resonance imaging (MRI) is a safer imaging modality that does not involve patient's exposure to radiation. Therefore, a need exists for an efficient and automated PET image generation from MRI data. In this paper, we propose a new frequency-aware attention U-net for generating synthetic PET images. Specifically, we incorporate attention mechanism into different U-net layers responsible for estimating low/high frequency scales of the image. Our frequency-aware attention Unet computes the attention scores for feature maps in low/high frequency layers and use it to help the model focus more on the most important regions, leading to more realistic output images. Experimental results on 30 subjects from Alzheimers Disease Neuroimaging Initiative (ADNI) dataset demonstrate good performance of the proposed model in PET image synthesis that achieved superior performance, both qualitative and quantitative, over current state-of-the-arts.
摘要：虽然正电子发射断层扫描（PET）成像已被广泛用于多种疾病的诊断，但它的获取过程成本很高，涉及对患者的辐射暴露。但是，磁共振成像（MRI）是一种更安全的成像方式，它不涉及患者的放射线暴露。因此，需要从MRI数据高效且自动化的PET图像生成。在本文中，我们提出了一种用于生成合成PET图像的新型频率感知注意力网络。具体来说，我们将注意力机制整合到不同的U-net层中，这些层负责估计图像的低频/高频比例。我们的频率感知注意力Unet在低频/高频层中计算特征图的注意力得分，并使用它来帮助模型将注意力更多地集中在最重要的区域上，从而获得更逼真的输出图像。来自阿尔茨海默氏病神经影像学倡议（ADNI）数据集的30位受试者的实验结果表明，所提出的模型在PET图像合成中表现良好，在定性和定量方面均优于当前最新技术。

75. Model-Based Visual Planning with Self-Supervised Functional Distances [PDF] 返回目录
Stephen Tian, Suraj Nair, Frederik Ebert, Sudeep Dasari, Benjamin Eysenbach, Chelsea Finn, Sergey Levine
Abstract: A generalist robot must be able to complete a variety of tasks in its environment. One appealing way to specify each task is in terms of a goal observation. However, learning goal-reaching policies with reinforcement learning remains a challenging problem, particularly when hand-engineered reward functions are not available. Learned dynamics models are a promising approach for learning about the environment without rewards or task-directed data, but planning to reach goals with such a model requires a notion of functional similarity between observations and goal states. We present a self-supervised method for model-based visual goal reaching, which uses both a visual dynamics model as well as a dynamical distance function learned using model-free reinforcement learning. Our approach learns entirely using offline, unlabeled data, making it practical to scale to large and diverse datasets. In our experiments, we find that our method can successfully learn models that perform a variety of tasks at test-time, moving objects amid distractors with a simulated robotic arm and even learning to open and close a drawer using a real-world robot. In comparisons, we find that this approach substantially outperforms both model-free and model-based prior methods. Videos and visualizations are available here: this http URL.
摘要：通才机器人必须能够在其环境中完成各种任务。指定每个任务的一种有吸引力的方法是观察目标。但是，通过强化学习来学习达到目标的策略仍然是一个具有挑战性的问题，尤其是在没有手动设计的奖励功能时。学习的动力学模型是一种无偿学习环境的有前途的方法，但是没有奖励或任务导向的数据，但是要计划用这种模型实现目标，则需要在观察值和目标状态之间保持功能相似的概念。我们提出了一种基于模型的视觉目标达成的自我监督方法，该方法同时使用了视觉动力学模型以及使用无模型强化学习而获得的动力学距离函数。我们的方法完全使用离线的，未标记的数据来学习，因此可以扩展到大型且多样化的数据集。在我们的实验中，我们发现我们的方法可以成功地学习在测试时执行各种任务的模型，使用模拟的机械臂在牵引器中移动对象，甚至使用真实世界的机器人学习如何打开和关闭抽屉。相比之下，我们发现这种方法大大优于无模型和基于模型的现有方法。此处提供视频和可视化效果：此http URL。

76. H2NF-Net for Brain Tumor Segmentation using Multimodal MR Imaging: 2nd Place Solution to BraTS Challenge 2020 Segmentation Task [PDF] 返回目录
Haozhe Jia, Weidong Cai, Heng Huang, Yong Xia
Abstract: In this paper, we propose a Hybrid High-resolution and Non-local Feature Network (H2NF-Net) to segment brain tumor in multimodal MR images. Our H2NF-Net uses the single and cascaded HNF-Nets to segment different brain tumor sub-regions and combines the predictions together as the final segmentation. We trained and evaluated our model on the Multimodal Brain Tumor Segmentation Challenge (BraTS) 2020 dataset. The results on the test set show that the combination of the single and cascaded models achieved average Dice scores of 0.78751, 0.91290, and 0.85461, as well as Hausdorff distances ($95\%$) of 26.57525, 4.18426, and 4.97162 for the enhancing tumor, whole tumor, and tumor core, respectively. Our method won the second place in the BraTS 2020 challenge segmentation task out of nearly 80 participants.
摘要：在本文中，我们提出了一种混合高分辨率和非局部特征网络（H2NF-Net），用于在多模式MR图像中分割脑肿瘤。我们的H2NF-Net使用单个和级联的HNF-Net来分割不同的脑肿瘤子区域，并将预测结果组合在一起作为最终分割。我们在2020多模式脑肿瘤分割挑战（BraTS）数据集中训练和评估了我们的模型。测试集上的结果表明，单个模型和级联模型的组合在增强肿瘤方面的平均Dice得分为0.78751、0.91290和0.85461，以及Hausdorff距离（$ 95 \％$）为26.57525、4.18426和4.97162。，整个肿瘤和肿瘤核心。我们的方法在BraTS 2020挑战细分任务中赢得了近80名参与者的第二名。

77. MRI brain tumor segmentation and uncertainty estimation using 3D-UNet architectures [PDF] 返回目录
Laura Mora Ballestar, Veronica Vilaplana
Abstract: Automation of brain tumor segmentation in 3D magnetic resonance images (MRIs) is key to assess the diagnostic and treatment of the disease. In recent years, convolutional neural networks (CNNs) have shown improved results in the task. However, high memory consumption is still a problem in 3D-CNNs. Moreover, most methods do not include uncertainty information, which is especially critical in medical diagnosis. This work studies 3D encoder-decoder architectures trained with patch-based techniques to reduce memory consumption and decrease the effect of unbalanced data. The different trained models are then used to create an ensemble that leverages the properties of each model, thus increasing the performance. We also introduce voxel-wise uncertainty information, both epistemic and aleatoric using test-time dropout (TTD) and data-augmentation (TTA) respectively. In addition, a hybrid approach is proposed that helps increase the accuracy of the segmentation. The model and uncertainty estimation measurements proposed in this work have been used in the BraTS'20 Challenge for task 1 and 3 regarding tumor segmentation and uncertainty estimation.
摘要：3D磁共振图像（MRI）中脑肿瘤分割的自动化是评估疾病诊断和治疗的关键。近年来，卷积神经网络（CNN）在这项任务中显示了改进的结果。但是，高内存消耗仍然是3D-CNN中的问题。此外，大多数方法不包括不确定性信息，这在医学诊断中尤其重要。这项工作研究了使用基于补丁的技术进行训练的3D编码器-解码器体系结构，以减少内存消耗并减少不平衡数据的影响。然后使用不同的训练模型来创建一个集合，该集合利用每个模型的属性，从而提高性能。我们还分别使用测试时间辍学（TTD）和数据增强（TTA）引入了认知和无意识体素方式的不确定性信息。此外，提出了一种混合方法，可帮助提高分割的准确性。在这项工作中提出的模型和不确定性估计测量已在BraTS'20挑战赛中用于任务1和3，涉及肿瘤分割和不确定性估计。

78. Some Algorithms on Exact, Approximate and Error-Tolerant Graph Matching [PDF] 返回目录
Shri Prakash Dwivedi
Abstract: The graph is one of the most widely used mathematical structures in engineering and science because of its representational power and inherent ability to demonstrate the relationship between objects. The objective of this work is to introduce the novel graph matching techniques using the representational power of the graph and apply it to structural pattern recognition applications. We present an extensive survey of various exact and inexact graph matching techniques. Graph matching using the concept of homeomorphism is presented. A category of graph matching algorithms is presented, which reduces the graph size by removing the less important nodes using some measure of relevance. We present an approach to error-tolerant graph matching using node contraction where the given graph is transformed into another graph by contracting smaller degree nodes. We use this scheme to extend the notion of graph edit distance, which can be used as a trade-off between execution time and accuracy. We describe an approach to graph matching by utilizing the various node centrality information, which reduces the graph size by removing a fraction of nodes from both graphs based on a given centrality measure. The graph matching problem is inherently linked to the geometry and topology of graphs. We introduce a novel approach to measure graph similarity using geometric graphs. We define the vertex distance between two geometric graphs using the position of their vertices and show it to be a metric over the set of all graphs with vertices only. We define edge distance between two graphs based on the angular orientation, length and position of the edges. Then we combine the notion of vertex distance and edge distance to define the graph distance between two geometric graphs and show it to be a metric. Finally, we use the proposed graph similarity framework to perform exact and error-tolerant graph matching.
摘要：由于它的表示能力和证明对象之间关系的固有能力，因此它是工程和科学中使用最广泛的数学结构之一。这项工作的目的是使用图的表示能力介绍新颖的图匹配技术，并将其应用于结构模式识别应用程序。我们对各种精确和不精确的图形匹配技术进行了广泛的调查。提出了使用同胚概念的图匹配。提出了一种图匹配算法，该算法通过使用某种相关性度量删除不太重要的节点来减小图的大小。我们提出一种使用节点收缩的容错图匹配方法，其中通过收缩较小度的节点将给定图转换为另一个图。我们使用这种方案来扩展图形编辑距离的概念，可以在执行时间和准确性之间进行权衡。我们描述了一种利用各种节点中心性信息进行图形匹配的方法，该方法通过基于给定的中心性度量从两个图形中删除一部分节点来减小图形大小。图匹配问题固有地与图的几何形状和拓扑有关。我们介绍了一种使用几何图来测量图相似度的新颖方法。我们使用两个几何图的顶点位置定义它们之间的顶点距离，并将其表示为仅包含顶点的所有图集的度量。我们基于边缘的角度方向，长度和位置定义两个图形之间的边缘距离。然后，我们结合顶点距离和边缘距离的概念来定义两个几何图形之间的图形距离，并将其表示为度量。最后，我们使用提出的图相似性框架执行精确和容错的图匹配。

79. Automatic Polyp Segmentation using U-Net-ResNet50 [PDF] 返回目录
Saruar Alam, Nikhil Kumar Tomar, Aarati Thakur, Debesh Jha, Ashish Rauniyar
Abstract: Polyps are the predecessors to colorectal cancer which is considered as one of the leading causes of cancer-related deaths worldwide. Colonoscopy is the standard procedure for the identification, localization, and removal of colorectal polyps. Due to variability in shape, size, and surrounding tissue similarity, colorectal polyps are often missed by the clinicians during colonoscopy. With the use of an automatic, accurate, and fast polyp segmentation method during the colonoscopy, many colorectal polyps can be easily detected and removed. The ``Medico automatic polyp segmentation challenge'' provides an opportunity to study polyp segmentation and build an efficient and accurate segmentation algorithm. We use the U-Net with pre-trained ResNet50 as the encoder for the polyp segmentation. The model is trained on Kvasir-SEG dataset provided for the challenge and tested on the organizer's dataset and achieves a dice coefficient of 0.8154, Jaccard of 0.7396, recall of 0.8533, precision of 0.8532, accuracy of 0.9506, and F2 score of 0.8272, demonstrating the generalization ability of our model.
摘要：息肉是结直肠癌的前身，它被认为是全球范围内与癌症相关的死亡的主要原因之一。结肠镜检查是识别，定位和去除结肠直肠息肉的标准程序。由于形状，大小和周围组织相似性的差异，临床医生在结肠镜检查时常常会错过大肠息肉。通过在结肠镜检查过程中使用自动，准确和快速的息肉分割方法，可以轻松地检测和去除许多结直肠息肉。 ``Medico自动息肉分割挑战''提供了研究息肉分割并建立有效且准确的分割算法的机会。我们使用带有预训练ResNet50的U-Net作为息肉分割的编码器。该模型在针对挑战而提供的Kvasir-SEG数据集上进行了训练，并在组织者的数据集上进行了测试，其骰子系数为0.8154，Jaccard为0.7396，召回率为0.8533，精度为0.8532，精度为0.9506，F2得分为0.8272，表明我们模型的泛化能力。

80. DDANet: Dual Decoder Attention Network for Automatic Polyp Segmentation [PDF] 返回目录
Nikhil Kumar Tomar, Debesh Jha, Sharib Ali, Håvard D. Johansen, Dag Johansen, Michael A. Riegler, Pål Halvorsen
Abstract: Colonoscopy is the gold standard for examination and detection of colorectal polyps. Localization and delineation of polyps can play a vital role in treatment (e.g., surgical planning) and prognostic decision making. Polyp segmentation can provide detailed boundary information for clinical analysis. Convolutional neural networks have improved the performance in colonoscopy. However, polyps usually possess various challenges, such as intra-and inter-class variation and noise. While manual labeling for polyp assessment requires time from experts and is prone to human error (e.g., missed lesions), an automated, accurate, and fast segmentation can improve the quality of delineated lesion boundaries and reduce missed rate. The Endotect challenge provides an opportunity to benchmark computer vision methods by training on the publicly available Hyperkvasir and testing on a separate unseen dataset. In this paper, we propose a novel architecture called ``DDANet'' based on a dual decoder attention network. Our experiments demonstrate that the model trained on the Kvasir-SEG dataset and tested on an unseen dataset achieves a dice coefficient of 0.7874, mIoU of 0.7010, recall of 0.7987, and a precision of 0.8577, demonstrating the generalization ability of our model.
摘要：结肠镜检查是检查和检测大肠息肉的金标准。息肉的定位和轮廓在治疗（例如手术计划）和预后决策中起着至关重要的作用。息肉分割可为临床分析提供详细的边界信息。卷积神经网络提高了结肠镜检查的性能。但是，息肉通常面临各种挑战，例如类内和类间变异以及噪声。人工标记息肉评估需要专家花费时间，并且容易发生人为错误（例如，遗漏的病灶），而自动，准确和快速的分割可以改善划定的病灶边界的质量并降低遗漏率。 Endotect挑战通过在公开可用的Hyperkvasir上进行培训并在一个单独的看不见的数据集上进行测试，提供了基准计算机视觉方法的机会。在本文中，我们提出了一种基于双解码器关注网络的新颖架构``DDANet''。我们的实验表明，在Kvasir-SEG数据集上训练并在看不见的数据集上进行测试的模型实现了0.7874的骰子系数，0.710的mIoU，0.7987的查全率和0.8577的精度，证明了我们模型的泛化能力。

81. Medico Multimedia Task at MediaEval 2020: Automatic Polyp Segmentation [PDF] 返回目录
Debesh Jha, Steven A. Hicks, Krister Emanuelsen, Håvard Johansen, Dag Johansen, Thomas de Lange, Michael A. Riegler, Pål Halvorsen
Abstract: Colorectal cancer is the third most common cause of cancer worldwide. According to Global cancer statistics 2018, the incidence of colorectal cancer is increasing in both developing and developed countries. Early detection of colon anomalies such as polyps is important for cancer prevention, and automatic polyp segmentation can play a crucial role for this. Regardless of the recent advancement in early detection and treatment options, the estimated polyp miss rate is still around 20\%. Support via an automated computer-aided diagnosis system could be one of the potential solutions for the overlooked polyps. Such detection systems can help low-cost design solutions and save doctors time, which they could for example use to perform more patient examinations. In this paper, we introduce the 2020 Medico challenge, provide some information on related work and the dataset, describe the task and evaluation metrics, and discuss the necessity of organizing the Medico challenge.
摘要：大肠癌是全球第三大常见癌症。根据《 2018年全球癌症统计》，无论是发展中国家还是发达国家，大肠癌的发病率都在增加。早期发现结肠息肉（例如息肉）对于预防癌症很重要，而自动息肉分割对此起关键作用。不管早期发现和治疗选择的最新进展如何，估计的息肉漏诊率仍约为20％。通过自动计算机辅助诊断系统的支持可能是被忽视的息肉的潜在解决方案之一。这种检测系统可以帮助提供低成本的设计解决方案，并节省医生的时间，例如，他们可以将其用于执行更多的患者检查。在本文中，我们介绍了2020 Medico挑战赛，提供了一些有关工作和数据集的信息，描述了任务和评估指标，并讨论了组织Medico挑战赛的必要性。

82. Exploring Large Context for Cerebral Aneurysm Segmentation [PDF] 返回目录
Jun Ma, Ziwei Nie
Abstract: Automated segmentation of aneurysms from 3D CT is important for the diagnosis, monitoring, and treatment planning of the cerebral aneurysm disease. This short paper briefly presents the main technique details of the aneurysm segmentation method in the MICCAI 2020 CADA challenge. The main contribution is that we configure the 3D U-Net with a large patch size, which can obtain the large context. Our method ranked second on the MICCAI 2020 CADA testing dataset with an average Jaccard of 0.7593. Our code and trained models are publicly available at \url{this https URL}.
摘要：从3D CT自动分割动脉瘤对于脑动脉瘤疾病的诊断，监测和治疗计划很重要。这篇简短的文章简要介绍了MICCAI 2020 CADA挑战中动脉瘤分割方法的主要技术细节。主要贡献在于，我们将3D U-Net配置为具有较大的补丁程序大小，从而可以获得较大的上下文。我们的方法在MICCAI 2020 CADA测试数据集上排名第二，平均Jaccard为0.7593。我们的代码和训练有素的模型可以在\ url {this https URL}上公开获得。

83. Fast Hyperspectral Image Recovery via Non-iterative Fusion of Dual-Camera Compressive Hyperspectral Imaging [PDF] 返回目录
Wei He, Naoto Yokoya, Xin Yuan
Abstract: Coded aperture snapshot spectral imaging (CASSI) is a promising technique to capture the three-dimensional hyperspectral image (HSI) using a single coded two-dimensional (2D) measurement, in which algorithms are used to perform the inverse problem. Due to the ill-posed nature, various regularizers have been exploited to reconstruct the 3D data from the 2D measurement. Unfortunately, the accuracy and computational complexity are unsatisfied. One feasible solution is to utilize additional information such as the RGB measurement in CASSI. Considering the combined CASSI and RGB measurement, in this paper, we propose a new fusion model for the HSI reconstruction. We investigate the spectral low-rank property of HSI composed of a spectral basis and spatial coefficients. Specifically, the RGB measurement is utilized to estimate the coefficients, meanwhile the CASSI measurement is adopted to provide the orthogonal spectral basis. We further propose a patch processing strategy to enhance the spectral low-rank property of HSI. The proposed model neither requires non-local processing or iteration, nor the spectral sensing matrix of the RGB detector. Extensive experiments on both simulated and real HSI dataset demonstrate that our proposed method outperforms previous state-of-the-art not only in quality but also speeds up the reconstruction more than 5000 times.
摘要：编码孔径快照光谱成像（CASSI）是一种使用单编码二维（2D）测量来捕获三维高光谱图像（HSI）的有前途的技术，其中使用算法来执行反问题。由于不适定的性质，已利用各种调节器从2D测量中重建3D数据。不幸的是，准确性和计算复杂性不令人满意。一种可行的解决方案是利用附加信息，例如CASSI中的RGB测量。考虑到CASSI和RGB的组合测量，本文提出了一种用于HSI重建的新融合模型。我们研究了由频谱基础和空间系数组成的HSI的频谱低秩特性。具体地，利用RGB测量来估计系数，同时采用CASSI测量来提供正交光谱基础。我们进一步提出了一种补丁处理策略来增强HSI的频谱低秩特性。提出的模型既不需要非本地处理或迭代，也不需要RGB检测器的光谱感应矩阵。在模拟和真实HSI数据集上进行的大量实验表明，我们提出的方法不仅在质量上优于以前的最新技术，而且可将重建速度提高5000倍以上。

84. Accurate Word Representations with Universal Visual Guidance [PDF] 返回目录
Zhuosheng Zhang, Haojie Yu, Hai Zhao, Rui Wang, Masao Utiyama
Abstract: Word representation is a fundamental component in neural language understanding models. Recently, pre-trained language models (PrLMs) offer a new performant method of contextualized word representations by leveraging the sequence-level context for modeling. Although the PrLMs generally give more accurate contextualized word representations than non-contextualized models do, they are still subject to a sequence of text contexts without diverse hints for word representation from multimodality. This paper thus proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance. In detail, we build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images. The texts and paired images are encoded in parallel, followed by an attention layer to integrate the multimodal representations. We show that the method substantially improves the accuracy of disambiguation. Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
摘要：单词表示是神经语言理解模型的基本组成部分。最近，预训练语言模型（PrLM）通过利用序列级上下文进行建模，提供了一种新型的上下文化单词表示的高效方法。尽管PrLM通常会提供比非上下文化模型更准确的上下文化单词表示形式，但是它们仍然受到一系列文本上下文的约束，而没有来自多模态的单词表示形式的各种提示。因此，本文提出了一种视觉表示方法，以通过视觉指导显着增强具有多方面意义的常规单词嵌入。详细地说，我们从多峰种子数据集中构建了一个小规模的单词图像词典，其中每个单词对应于各种相关图像。文本和成对的图像被并行编码，随后是注意层以集成多模式表示。我们表明该方法大大提高了消除歧义的准确性。对12种自然语言理解和机器翻译任务的实验进一步验证了该方法的有效性和泛化能力。

85. Unpaired Image Enhancement with Quality-Attention Generative Adversarial Network [PDF] 返回目录
Zhangkai Ni, Wenhan Yang, Shiqi Wang, Lin Ma, Sam Kwong
Abstract: In this work, we aim to learn an unpaired image enhancement model, which can enrich low-quality images with the characteristics of high-quality images provided by users. We propose a quality attention generative adversarial network (QAGAN) trained on unpaired data based on the bidirectional Generative Adversarial Network (GAN) embedded with a quality attention module (QAM). The key novelty of the proposed QAGAN lies in the injected QAM for the generator such that it learns domain-relevant quality attention directly from the two domains. More specifically, the proposed QAM allows the generator to effectively select semantic-related characteristics from the spatial-wise and adaptively incorporate style-related attributes from the channel-wise, respectively. Therefore, in our proposed QAGAN, not only discriminators but also the generator can directly access both domains which significantly facilitates the generator to learn the mapping function. Extensive experimental results show that, compared with the state-of-the-art methods based on unpaired learning, our proposed method achieves better performance in both objective and subjective evaluations.
摘要：在这项工作中，我们旨在学习一个不成对的图像增强模型，该模型可以丰富用户提供的高质量图像的特性，从而丰富低质量图像。我们提出了一个基于未成对数据的质量关注生成对抗网络（QAGAN），该网络基于嵌入了质量关注模块（QAM）的双向生成对抗网络（GAN）。提出的QAGAN的关键新颖之处在于为生成器注入的QAM，这样它就可以直接从两个域中学习与域相关的质量注意事项。更具体地，所提出的QAM允许生成器从空间方向上有效地选择与语义相关的特征，并分别从信道方向上自适应地并入样式相关的属性。因此，在我们提出的QAGAN中，不仅鉴别器而且生成器也可以直接访问两个域，这极大地促进了生成器学习映射功能。大量的实验结果表明，与基于不成对学习的最新方法相比，我们提出的方法在客观和主观评估方面均具有更好的性能。

86. A Review of Machine Learning Techniques for Applied Eye Fundus and Tongue Digital Image Processing with Diabetes Management System [PDF] 返回目录
Wei Xiang Lim, Zhiyuan Chen, Amr Ahmed, Tissa Chandesa, Iman Liao
Abstract: Diabetes is a global epidemic and it is increasing at an alarming rate. The International Diabetes Federation (IDF) projected that the total number of people with diabetes globally may increase by 48%, from 425 million (year 2017) to 629 million (year 2045). Moreover, diabetes had caused millions of deaths and the number is increasing drastically. Therefore, this paper addresses the background of diabetes and its complications. In addition, this paper investigates innovative applications and past researches in the areas of diabetes management system with applied eye fundus and tongue digital images. Different types of existing applied eye fundus and tongue digital image processing with diabetes management systems in the market and state-of-the-art machine learning techniques from previous literature have been reviewed. The implication of this paper is to have an overview in diabetic research and what new machine learning techniques can be proposed in solving this global epidemic.
摘要：糖尿病是一种全球流行病，并且正在以惊人的速度增长。国际糖尿病联合会（IDF）预测，全球糖尿病患者总数可能会增加48％，从4.25亿（2017年）增至6.29亿（2045年）。此外，糖尿病已经导致数百万的死亡，并且这个数字还在急剧增加。因此，本文探讨了糖尿病及其并发症的背景。此外，本文通过应用眼底和舌头数字图像，研究了糖尿病管理系统领域的创新应用和过去的研究。市场上已有各种不同类型的现有应用的眼底和舌头数字图像处理与市场上的糖尿病管理系统以及来自先前文献的最新机器学习技术进行了综述。本文的含义是对糖尿病研究进行概述，以及可以提出什么新的机器学习技术来解决这一全球性流行病。

87. DeepSphere: a graph-based spherical CNN [PDF] 返回目录
Michaël Defferrard, Martino Milani, Frédérick Gusset, Nathanaël Perraudin
Abstract: Designing a convolution for a spherical neural network requires a delicate tradeoff between efficiency and rotation equivariance. DeepSphere, a method based on a graph representation of the sampled sphere, strikes a controllable balance between these two desiderata. This contribution is twofold. First, we study both theoretically and empirically how equivariance is affected by the underlying graph with respect to the number of vertices and neighbors. Second, we evaluate DeepSphere on relevant problems. Experiments show state-of-the-art performance and demonstrates the efficiency and flexibility of this formulation. Perhaps surprisingly, comparison with previous work suggests that anisotropic filters might be an unnecessary price to pay. Our code is available at this https URL
摘要：为球形神经网络设计卷积需要在效率和旋转等方差之间进行精细的权衡。 DeepSphere是一种基于采样球体图形表示的方法，它在这两个目标之间取得了可控的平衡。这一贡献是双重的。首先，我们在理论和经验上都研究了基础图相对于顶点和邻居数如何影响等方差。其次，我们在相关问题上评估DeepSphere。实验显示了最先进的性能，并证明了这种配方的效率和灵活性。也许令人惊讶的是，与以前的工作进行比较表明，各向异性滤波器可能是不必要的代价。我们的代码可从以下https URL获得

88. Semi-supervised Cardiac Image Segmentation via Label Propagation and Style Transfer [PDF] 返回目录
Yao Zhang, Jiawei Yang, Feng Hou, Yang Liu, Yixin Wang, Jiang Tian, Cheng Zhong, Yang Zhang, Zhiqiang He
Abstract: Accurate segmentation of cardiac structures can assist doctors to diagnose diseases, and to improve treatment planning, which is highly demanded in the clinical practice. However, the shortage of annotation and the variance of the data among different vendors and medical centers restrict the performance of advanced deep learning methods. In this work, we present a fully automatic method to segment cardiac structures including the left (LV) and right ventricle (RV) blood pools, as well as for the left ventricular myocardium (MYO) in MRI volumes. Specifically, we design a semi-supervised learning method to leverage unlabelled MRI sequence timeframes by label propagation. Then we exploit style transfer to reduce the variance among different centers and vendors for more robust cardiac image segmentation. We evaluate our method in the M&Ms challenge 7 , ranking 2nd place among 14 competitive teams.
摘要：心脏结构的精确分割可以帮助医生诊断疾病，并改善治疗计划，这在临床实践中是非常需要的。但是，不同供应商和医疗中心之间注释的不足以及数据的差异限制了高级深度学习方法的性能。在这项工作中，我们提出了一种自动方法来对心脏结构进行分段，包括MRI卷中的左（LV）和右心室（RV）血池以及左心室心肌（MYO）。具体来说，我们设计了一种半监督学习方法，以通过标签传播来利用未标记的MRI序列时间范围。然后，我们利用样式转换来减少不同中心和供应商之间的差异，以实现更可靠的心脏图像分割。我们在M＆M挑战7中评估了我们的方法，在14支竞争团队中排名第二。

89. Parzen Window Approximation on Riemannian Manifold [PDF] 返回目录
Abhishek, Shekhar Verma
Abstract: In graph motivated learning, label propagation largely depends on data affinity represented as edges between connected data points. The affinity assignment implicitly assumes even distribution of data on the manifold. This assumption may not hold and may lead to inaccurate metric assignment due to drift towards high-density regions. The drift affected heat kernel based affinity with a globally fixed Parzen window either discards genuine neighbors or forces distant data points to become a member of the neighborhood. This yields a biased affinity matrix. In this paper, the bias due to uneven data sampling on the Riemannian manifold is catered to by a variable Parzen window determined as a function of neighborhood size, ambient dimension, flatness range, etc. Additionally, affinity adjustment is used which offsets the effect of uneven sampling responsible for the bias. An affinity metric which takes into consideration the irregular sampling effect to yield accurate label propagation is proposed. Extensive experiments on synthetic and real-world data sets confirm that the proposed method increases the classification accuracy significantly and outperforms existing Parzen window estimators in graph Laplacian manifold regularization methods.
摘要：在图驱动学习中，标签传播很大程度上取决于表示为连接数据点之间的边缘的数据亲和力。相似性分配隐式假定数据在流形上均匀分布。由于向高密度区域的漂移，此假设可能不成立，并可能导致度量分配不准确。漂移影响的基于热核的亲和力与全局固定的Parzen窗口一起丢弃真正的邻居，或迫使遥远的数据点成为邻居的成员。这产生了偏倚的亲和度矩阵。在本文中，由于可变的Parzen窗口可解决因在黎曼流形上数据采样不均匀而引起的偏差，该变量被确定为邻域大小，环境尺寸，平坦度范围等的函数。此外，使用亲和力调整来抵消抽样不均是造成偏差的原因。提出了一种亲和力度量，该度量考虑了不规则采样效应以产生准确的标签传播。在合成和真实数据集上进行的大量实验证实，该方法显着提高了分类精度，并且优于图拉普拉斯流形正则化方法中现有的Parzen窗口估计量。

90. AILearn: An Adaptive Incremental Learning Model for Spoof Fingerprint Detection [PDF] 返回目录
Shivang Agarwal, Ajita Rattani, C. Ravindranath Chowdary
Abstract: Incremental learning enables the learner to accommodate new knowledge without retraining the existing model. It is a challenging task which requires learning from new data as well as preserving the knowledge extracted from the previously accessed data. This challenge is known as the stability-plasticity dilemma. We propose AILearn, a generic model for incremental learning which overcomes the stability-plasticity dilemma by carefully integrating the ensemble of base classifiers trained on new data with the current ensemble without retraining the model from scratch using entire data. We demonstrate the efficacy of the proposed AILearn model on spoof fingerprint detection application. One of the significant challenges associated with spoof fingerprint detection is the performance drop on spoofs generated using new fabrication materials. AILearn is an adaptive incremental learning model which adapts to the features of the ``live'' and ``spoof'' fingerprint images and efficiently recognizes the new spoof fingerprints as well as the known spoof fingerprints when the new data is available. To the best of our knowledge, AILearn is the first attempt in incremental learning algorithms that adapts to the properties of data for generating a diverse ensemble of base classifiers. From the experiments conducted on standard high-dimensional datasets LivDet 2011, LivDet 2013 and LivDet 2015, we show that the performance gain on new fake materials is significantly high. On an average, we achieve $49.57\%$ improvement in accuracy between the consecutive learning phases.
摘要：增量学习使学习者能够适应新知识，而无需重新训练现有模型。这是一项艰巨的任务，需要从新数据中学习以及保留从先前访问的数据中提取的知识。这一挑战被称为稳定性-可塑性难题。我们提出AILearn，这是一种用于增量学习的通用模型，它通过将在新数据上训练的基本分类器的集合与当前集合进行仔细集成，从而克服了稳定性-可塑性难题，而无需使用整个数据从头开始重新训练模型。我们演示了拟议的AILearn模型在欺骗指纹检测应用程序中的功效。与欺骗指纹检测相关的重大挑战之一是使用新的制造材料生成的欺骗的性能下降。 AILearn是一种自适应增量学习模型，可适应``实时''和``欺骗''指纹图像的特征，并在新数据可用时有效识别新的欺骗指纹和已知的欺骗指纹。据我们所知，AILearn是增量学习算法中的首次尝试，该算法适应于数据的属性以生成各种基础分类器。通过对标准高维数据集LivDet 2011，LivDet 2013和LivDet 2015进行的实验，我们表明，在新的假冒材料上的性能提升显着。平均而言，我们在连续学习阶段之间的准确性提高了49.57％。

91. Annotation-Efficient Learning for Medical Image Segmentation based on Noisy Pseudo Labels and Adversarial Learning [PDF] 返回目录
Lu Wang, Dong Guo, Guotai Wang, Shaoting Zhang
Abstract: Despite that deep learning has achieved state-of-the-art performance for medical image segmentation, its success relies on a large set of manually annotated images for training that are expensive to acquire. In this paper, we propose an annotation-efficient learning framework for segmentation tasks that avoids annotations of training images, where we use an improved Cycle-Consistent Generative Adversarial Network (GAN) to learn from a set of unpaired medical images and auxiliary masks obtained either from a shape model or public datasets. We first use the GAN to generate pseudo labels for our training images under the implicit high-level shape constraint represented by a Variational Auto-encoder (VAE)-based discriminator with the help of the auxiliary masks, and build a Discriminator-guided Generator Channel Calibration (DGCC) module which employs our discriminator's feedback to calibrate the generator for better pseudo labels. To learn from the pseudo labels that are noisy, we further introduce a noise-robust iterative learning method using noise-weighted Dice loss. We validated our framework with two situations: objects with a simple shape model like optic disc in fundus images and fetal head in ultrasound images, and complex structures like lung in X-Ray images and liver in CT images. Experimental results demonstrated that 1) Our VAE-based discriminator and DGCC module help to obtain high-quality pseudo labels. 2) Our proposed noise-robust learning method can effectively overcome the effect of noisy pseudo labels. 3) The segmentation performance of our method without using annotations of training images is close or even comparable to that of learning from human annotations.
摘要：尽管深度学习在医学图像分割方面取得了最先进的性能，但其成功仍依赖于大量的手动注释图像进行训练，这些图像的获取成本很高。在本文中，我们为分割任务提出了一种注释有效的学习框架，该框架避免了训练图像的注释，在此我们使用改进的周期一致生成对抗网络（GAN）从一组不成对的医学图像和辅助蒙版中学习来自形状模型或公共数据集。我们首先使用GAN在辅助掩模的帮助下，在基于变分自动编码器（VAE）的鉴别器表示的隐含高级形状约束下，为训练图像生成伪标签，并建立鉴别器引导的生成器通道校准（DGCC）模块，利用我们的鉴别器反馈来校准发生器，以获得更好的伪标签。为了从嘈杂的伪标签中学习，我们进一步介绍了一种使用噪声加权Dice损失的鲁棒迭代学习方法。我们用两种情况验证了我们的框架：具有简单形状模型的对象，例如眼底图像中的视盘和超声图像中的胎儿头，以及具有复杂结构的对象，例如X射线图像中的肺和CT图像中的肝脏。实验结果表明：1）我们基于VAE的鉴别器和DGCC模块有助于获得高质量的伪标签。 2）我们提出的鲁棒学习方法可以有效地克服噪声伪标签的影响。 3）我们的方法在不使用训练图像注释的情况下的分割性能与从人类注释中学习的方法相近甚至可比。

92. Ensembled ResUnet for Anatomical Brain Barriers Segmentation [PDF] 返回目录
Munan Ning, Cheng Bian, Chenglang Yuan, Yefeng Zheng
Abstract: Accuracy segmentation of brain structures could be helpful for glioma and radiotherapy planning. However, due to the visual and anatomical differences between different modalities, the accurate segmentation of brain structures becomes challenging. To address this problem, we first construct a residual block based U-shape network with a deep encoder and shallow decoder, which can trade off the framework performance and efficiency. Then, we introduce the Tversky loss to address the issue of the class imbalance between different foreground and the background classes. Finally, a model ensemble strategy is utilized to remove outliers and further boost performance.
摘要：脑结构的准确分割可能有助于神经胶质瘤和放射治疗的计划。然而，由于不同模态之间在视觉和解剖上的差异，大脑结构的准确分割变得具有挑战性。为了解决这个问题，我们首先构造一个具有深度编码器和浅层解码器的基于残差块的U型网络，这可以权衡框架的性能和效率。然后，我们引入Tversky损失，以解决不同前景类和背景类之间的类不平衡问题。最后，模型集成策略用于消除异常值并进一步提高性能。

93. Myocardial Segmentation of Cardiac MRI Sequences with Temporal Consistency for Coronary Artery Disease Diagnosis [PDF] 返回目录
Yutian Chen, Xiaowei Xu, Dewen Zeng, Yiyu Shi, Haiyun Yuan, Jian Zhuang, Yuhao Dong, Qianjun Jia, Meiping Huang
Abstract: Coronary artery disease (CAD) is the most common cause of death globally, and its diagnosis is usually based on manual myocardial segmentation of Magnetic Resonance Imaging (MRI) sequences. As the manual segmentation is tedious, time-consuming and with low applicability, automatic myocardial segmentation using machine learning techniques has been widely explored recently. However, almost all the existing methods treat the input MRI sequences independently, which fails to capture the temporal information between sequences, e.g., the shape and location information of the myocardium in sequences along time. In this paper, we propose a myocardial segmentation framework for sequence of cardiac MRI (CMR) scanning images of left ventricular cavity, right ventricular cavity, and myocardium. Specifically, we propose to combine conventional networks and recurrent networks to incorporate temporal information between sequences to ensure temporal consistent. We evaluated our framework on the Automated Cardiac Diagnosis Challenge (ACDC) dataset. Experiment results demonstrate that our framework can improve the segmentation accuracy by up to 2% in Dice coefficient.
摘要：冠状动脉疾病（CAD）是全球最常见的死亡原因，其诊断通常基于磁共振成像（MRI）序列的手动心肌分割。由于手动分割繁琐，耗时且适用性低，因此最近广泛研究了使用机器学习技术进行的自动心肌分割。然而，几乎所有现有方法都独立地处理输入的MRI序列，这不能捕获序列之间的时间信息，例如，沿时间序列的心肌的形状和位置信息。在本文中，我们为左心室腔，右心室腔和心肌的心脏MRI（CMR）扫描图像序列提出了一种心肌分割框架。具体来说，我们建议将常规网络和递归网络相结合，以在序列之间合并时间信息，以确保时间上的一致性。我们在自动心脏诊断挑战（ACDC）数据集上评估了我们的框架。实验结果表明，我们的框架可将Dice系数的分割精度提高2％。

94. Cascaded Framework for Automatic Evaluation of Myocardial Infarction from Delayed-Enhancement Cardiac MRI [PDF] 返回目录
Jun Ma
Abstract: Automatic evaluation of myocardium and pathology plays an important role in the quantitative analysis of patients suffering from myocardial infarction. In this paper, we present a cascaded convolutional neural network framework for myocardial infarction segmentation and classification in delayed-enhancement cardiac MRI. Specifically, we first use a 2D U-Net to segment the whole heart, including the left ventricle and the myocardium. Then, we crop the whole heart as a region of interest (ROI). Finally, a new 2D U-Net is used to segment the infraction and no-reflow areas in the whole heart ROI. The segmentation method can be applied to the classification task where the segmentation results with the infraction or no-reflow areas are classified as pathological cases. Our method took second place in the MICCAI 2020 EMIDEC segmentation task with Dice scores of 86.28%, 62.24%, and 77.76% for myocardium, infraction, and no-reflow areas, respectively, and first place in the classification task with an accuracy of 92%.
摘要：心肌和病理的自动评估在定量分析心肌梗死患者中起着重要的作用。在本文中，我们提出了一种级联的卷积神经网络框架，用于延迟增强型心脏MRI中的心肌梗塞分割和分类。具体来说，我们首先使用2D U-Net分割整个心脏，包括左心室和心肌。然后，我们将整个心脏裁剪为感兴趣的区域（ROI）。最后，使用新的2D U-Net来分割整个心脏ROI中的梗阻区域和无回流区域。可以将分割方法应用于分类任务，在该分类任务中，将具有违规或无回流区域的分割结果分类为病理病例。我们的方法在MICCAI 2020 EMIDEC细分任务中排名第二，心肌，梗死和无回流区域的Dice得分分别为86.28％，62.24％和77.76％，在分类任务中排名第一，准确度为92 ％。

95. Comparison of different CNNs for breast tumor classification from ultrasound images [PDF] 返回目录
Jorge F. Lazo, Sara Moccia, Emanuele Frontoni, Elena De Momi
Abstract: Breast cancer is one of the deadliest cancer worldwide. Timely detection could reduce mortality rates. In the clinical routine, classifying benign and malignant tumors from ultrasound (US) imaging is a crucial but challenging task. An automated method, which can deal with the variability of data is therefore needed. In this paper, we compared different Convolutional Neural Networks (CNNs) and transfer learning methods for the task of automated breast tumor classification. The architectures investigated in this study were VGG-16 and Inception V3. Two different training strategies were investigated: the first one was using pretrained models as feature extractors and the second one was to fine-tune the pre-trained models. A total of 947 images were used, 587 corresponded to US images of benign tumors and 360 with malignant tumors. 678 images were used for the training and validation process, while 269 images were used for testing the models. Accuracy and Area Under the receiver operating characteristic Curve (AUC) were used as performance metrics. The best performance was obtained by fine tuning VGG-16, with an accuracy of 0.919 and an AUC of 0.934. The obtained results open the opportunity to further investigation with a view of improving cancer detection.
摘要：乳腺癌是世界上最致命的癌症之一。及时发现可以降低死亡率。在临床常规中，根据超声（US）影像对良性和恶性肿瘤进行分类是一项至关重要但具有挑战性的任务。因此，需要一种能够处理数据可变性的自动化方法。在本文中，我们比较了不同的卷积神经网络（CNN）和转移学习方法来完成自动乳腺肿瘤分类的任务。在这项研究中研究的体系结构是VGG-16和Inception V3。研究了两种不同的训练策略：第一种是使用预训练的模型作为特征提取器，第二种是对预训练的模型进行微调。总共使用了947张图像，其中587张对应于美国的良性肿瘤图像，以及360张恶性肿瘤。 678张图像用于训练和验证过程，而269张图像用于测试模型。接收器的工作特性曲线（AUC）的精度和面积用作性能指标。通过微调VGG-16可获得最佳性能，精度为0.919，AUC为0.934。所获得的结果为进一步改善癌症检测提供了进一步研究的机会。

96. SASSI -- Super-Pixelated Adaptive Spatio-Spectral Imaging [PDF] 返回目录
Vishwanath Saragadam, Michael DeZeeuw, Richard Baraniuk, Ashok Veeraraghavan, Aswin Sankaranarayanan
Abstract: We introduce a novel video-rate hyperspectral imager with high spatial, and temporal resolutions. Our key hypothesis is that spectral profiles of pixels in a super-pixel of an oversegmented image tend to be very similar. Hence, a scene-adaptive spatial sampling of an hyperspectral scene, guided by its super-pixel segmented image, is capable of obtaining high-quality reconstructions. To achieve this, we acquire an RGB image of the scene, compute its super-pixels, from which we generate a spatial mask of locations where we measure high-resolution spectrum. The hyperspectral image is subsequently estimated by fusing the RGB image and the spectral measurements using a learnable guided filtering approach. Due to low computational complexity of the superpixel estimation step, our setup can capture hyperspectral images of the scenes with little overhead over traditional snapshot hyperspectral cameras, but with significantly higher spatial and spectral resolutions. We validate the proposed technique with extensive simulations as well as a lab prototype that measures hyperspectral video at a spatial resolution of $600 \times 900$ pixels, at a spectral resolution of 10 nm over visible wavebands, and achieving a frame rate at $18$fps.
摘要：我们介绍了一种新颖的具有高空间和时间分辨率的视频速率高光谱成像仪。我们的主要假设是，过度分割图像的超像素中像素的光谱轮廓往往非常相似。因此，在其超像素分割图像的引导下，高光谱场景的场景自适应空间采样能够获得高质量的重建。为此，我们获取场景的RGB图像，计算其超像素，然后从中生成测量高分辨率光谱的位置的空间蒙版。随后通过使用可学习的引导滤波方法将RGB图像和光谱测量值融合来估计高光谱图像。由于超像素估计步骤的计算复杂度较低，因此我们的设置可以捕获场景的高光谱图像，而传统快照高光谱相机的开销很小，但空间和光谱分辨率却更高。我们通过广泛的仿真以及实验室原型验证了提出的技术，该实验室原型以空间分辨率为600美元乘以900美元像素，在可见波段上的光谱分辨率为10 nm的情况下测量高光谱视频，并以18美元/秒的帧速率实现了帧速率。

97. Classification of Pathological and Normal Gait: A Survey [PDF] 返回目录
Ryan C. Saxe, Samantha Kappagoda, David K.A. Mordecai
Abstract: Gait recognition is a term commonly referred to as an identification problem within the Computer Science field. There are a variety of methods and models capable of identifying an individual based on their pattern of ambulatory locomotion. By surveying the current literature on gait recognition, this paper seeks to identify appropriate metrics, devices, and algorithms for collecting and analyzing data regarding patterns and modes of ambulatory movement across individuals. Furthermore, this survey seeks to motivate interest in a broader scope of longitudinal analysis regarding the perturbations in gait across states (i.e. physiological, emotive, and/or cognitive states). More broadly, inferences to normal versus pathological gait patterns can be attributed, based on both longitudinal and non-longitudinal forms of classification. This may indicate promising research directions and experimental designs, such as creating algorithmic metrics for the quantification of fatigue, or models for forecasting episodic disorders. Furthermore, in conjunction with other measurements of physiological and environmental conditions, pathological gait classification might be applicable to inference for syndromic surveillance of infectious disease states or cognitive impairment.
摘要：步态识别是计算机科学领域中通常被称为识别问题的术语。有多种方法和模型能够根据他们的步行运动模式来识别他们。通过调查有关步态识别的最新文献，本文力图确定适当的指标，设备和算法，以收集和分析有关跨人群走动模式和方式的数据。此外，该调查试图激发关于跨状态（即生理状态，情绪状态和/或认知状态）的步态扰动的更广泛的纵向分析的兴趣。更广泛地说，可以基于纵向和非纵向形式的分类，得出对正常步态与病理步态模式的推断。这可能表明有希望的研究方向和实验设计，例如创建用于量化疲劳的算法指标或用于预测突发性疾病的模型。此外，结合生理和环境条件的其他测量结果，病理步态分类可能适用于对传染病状态或认知障碍进行综合监测的推断。

98. Evaluation and Comparison of Edge-Preserving Filters [PDF] 返回目录
Sarah Gingichashvili, Dani Lischinski
Abstract: Edge-preserving filters play an essential role in some of the most basic tasks of computational photography, such as abstraction, tonemapping, detail enhancement and texture removal, to name a few. The abundance and diversity of smoothing operators, accompanied by a lack of methodology to evaluate output quality and/or perform an unbiased comparison between them, could lead to misunderstanding and potential misuse of such methods. This paper introduces a systematic methodology for evaluating and comparing such operators and demonstrates it on a diverse set of published edge-preserving filters. Additionally, we present a common baseline along which a comparison of different operators can be achieved and use it to determine equivalent parameter mappings between methods. Finally, we suggest some guidelines for objective comparison and evaluation of edge-preserving filters.
摘要：边缘保留滤镜在计算摄影的一些最基本任务中起着至关重要的作用，例如抽象，色调映射，细节增强和纹理去除等。平滑运算符的丰富性和多样性，以及缺乏评估输出质量和/或在它们之间进行无偏比较的方法，可能会导致对此类方法的误解和潜在滥用。本文介绍了一种评估和比较此类算子的系统方法，并在一系列已发布的边缘保留滤波器上进行了演示。此外，我们提出了一个共同的基线，通过该基线可以实现对不同运算符的比较，并使用它来确定方法之间的等效参数映射。最后，我们建议一些准则，用于客观比较和评估保留边缘的滤波器。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2021-01-01

目录

摘要