摘要

1. Real-Time High-Resolution Background Matting [PDF] 返回目录
Shanchuan Lin, Andrey Ryabtsev, Soumyadip Sengupta, Brian Curless, Steve Seitz, Ira Kemelmacher-Shlizerman
Abstract: We introduce a real-time, high-resolution background replacement technique which operates at 30fps in 4K resolution, and 60fps for HD on a modern GPU. Our technique is based on background matting, where an additional frame of the background is captured and used in recovering the alpha matte and the foreground layer. The main challenge is to compute a high-quality alpha matte, preserving strand-level hair details, while processing high-resolution images in real-time. To achieve this goal, we employ two neural networks; a base network computes a low-resolution result which is refined by a second network operating at high-resolution on selective patches. We introduce two largescale video and image matting datasets: VideoMatte240K and PhotoMatte13K/85. Our approach yields higher quality results compared to the previous state-of-the-art in background matting, while simultaneously yielding a dramatic boost in both speed and resolution.
摘要：我们介绍了一种实时，高分辨率的背景替换技术，该技术在现代GPU上以30Kfps的速度在4K分辨率下运行，在高清时以60fps的速度运行。我们的技术基于背景遮罩，其中捕获了背景的附加帧，并用于恢复alpha遮罩和前景层。主要挑战是要计算出高质量的Alpha遮罩，并保留股级头发细节，同时实时处理高分辨率图像。为了实现这一目标，我们采用了两个神经网络。基本网络计算出低分辨率结果，该结果由第二个网络以高分辨率在选择性补丁上进行细化。我们介绍了两个大型视频和图像抠像数据集：VideoMatte240K和PhotoMatte13K / 85。与以前的背景抠像技术相比，我们的方法产生了更高的质量结果，同时在速度和分辨率上都取得了巨大的进步。

2. img2pose: Face Alignment and Detection via 6DoF, Face Pose Estimation [PDF] 返回目录
Vitor Albiero, Xingyu Chen, Xi Yin, Guan Pang, Tal Hassner
Abstract: We propose real-time, six degrees of freedom (6DoF), 3D face pose estimation without face detection or landmark localization. We observe that estimating the 6DoF rigid transformation of a face is a simpler problem than facial landmark detection, often used for 3D face alignment. In addition, 6DoF offers more information than face bounding box labels. We leverage these observations to make multiple contributions: (a) We describe an easily trained, efficient, Faster R-CNN--based model which regresses 6DoF pose for all faces in the photo, without preliminary face detection. (b) We explain how pose is converted and kept consistent between the input photo and arbitrary crops created while training and evaluating our model. (c) Finally, we show how face poses can replace detection bounding box training labels. Tests on AFLW2000-3D and BIWI show that our method runs at real-time and outperforms state of the art (SotA) face pose estimators. Remarkably, our method also surpasses SotA models of comparable complexity on the WIDER FACE detection benchmark, despite not been optimized on bounding box labels.
摘要：我们提出了实时的六自由度（6DoF），3D人脸姿势估计，而无需人脸检测或地标定位。我们观察到，与通常用于3D人脸对齐的人脸界标检测相比，估计人脸的6DoF刚性变换是一个更简单的问题。此外，6DoF提供的信息比面边界框标签还多。我们利用这些观察结果做出多种贡献：（a）我们描述了一种易于训练，高效，基于R-CNN的快速模型，该模型可以对照片中所有面部的6DoF姿势进行回归，而无需进行初步的面部检测。（b）我们说明在训练和评估模型时，如何在输入的照片和所创建的任意作物之间转换姿势并使姿势保持一致。（c）最后，我们展示了脸部姿势如何替代检测边界框训练标签。在AFLW2000-3D和BIWI上进行的测试表明，我们的方法可以实时运行，并且性能优于现有的（SotA）人脸姿势估计器。值得注意的是，尽管没有在边界框标签上进行优化，但我们的方法在WIDER FACE检测基准上也超过了具有类似复杂性的SotA模型。

3. PePScenes: A Novel Dataset and Baseline for Pedestrian Action Prediction in 3D [PDF] 返回目录
Amir Rasouli, Tiffany Yau, Peter Lakner, Saber Malekmohammadi, Mohsen Rohani, Jun Luo
Abstract: Predicting the behavior of road users, particularly pedestrians, is vital for safe motion planning in the context of autonomous driving systems. Traditionally, pedestrian behavior prediction has been realized in terms of forecasting future trajectories. However, recent evidence suggests that predicting higher-level actions, such as crossing the road, can help improve trajectory forecasting and planning tasks accordingly. There are a number of existing datasets that cater to the development of pedestrian action prediction algorithms, however, they lack certain characteristics, such as bird's eye view semantic map information, 3D locations of objects in the scene, etc., which are crucial in the autonomous driving context. To this end, we propose a new pedestrian action prediction dataset created by adding per-frame 2D/3D bounding box and behavioral annotations to the popular autonomous driving dataset, nuScenes. In addition, we propose a hybrid neural network architecture that incorporates various data modalities for predicting pedestrian crossing action. By evaluating our model on the newly proposed dataset, the contribution of different data modalities to the prediction task is revealed. The dataset is available at this https URL.
摘要：在自动驾驶系统的背景下，预测道路使用者（特别是行人）的行为对于安全运动规划至关重要。传统上，行人行为预测是根据预测未来轨迹来实现的。但是，最近的证据表明，预测更高级别的行动（如过马路）可以相应地改善轨迹预测和计划任务。现有许多数据集可以满足行人动作预测算法的发展，但是它们缺乏某些特征，例如鸟瞰语义图信息，场景中对象的3D位置等，这些在数据挖掘中至关重要。自动驾驶环境。为此，我们提出了一个新的行人动作预测数据集，该数据集是通过将每帧2D / 3D边界框和行为注释添加到流行的自动驾驶数据集nuScenes中而创建的。此外，我们提出了一种混合神经网络架构，该架构结合了各种数据模式以预测行人过路行为。通过在新提出的数据集上评估我们的模型，揭示了不同数据模式对预测任务的贡献。数据集可从此https URL获得。

4. Digital rock reconstruction with user-defined properties using conditional generative adversarial networks [PDF] 返回目录
Qiang Zheng, Dongxiao Zhang
Abstract: Uncertainty is ubiquitous with flow in subsurface rocks because of their inherent heterogeneity and lack of in-situ measurements. To complete uncertainty analysis in a multi-scale manner, it is a prerequisite to provide sufficient rock samples. Even though the advent of digital rock technology offers opportunities to reproduce rocks, it still cannot be utilized to provide massive samples due to its high cost, thus leading to the development of diversified mathematical methods. Among them, two-point statistics (TPS) and multi-point statistics (MPS) are commonly utilized, which feature incorporating low-order and high-order statistical information, respectively. Recently, generative adversarial networks (GANs) are becoming increasingly popular since they can reproduce training images with excellent visual and consequent geologic realism. However, standard GANs can only incorporate information from data, while leaving no interface for user-defined properties, and thus may limit the diversity of reconstructed samples. In this study, we propose conditional GANs for digital rock reconstruction, aiming to reproduce samples not only similar to the real training data, but also satisfying user-specified properties. In fact, the proposed framework can realize the targets of MPS and TPS simultaneously by incorporating high-order information directly from rock images with the GANs scheme, while preserving low-order counterparts through conditioning. We conduct three reconstruction experiments, and the results demonstrate that rock type, rock porosity, and correlation length can be successfully conditioned to affect the reconstructed rock images. Furthermore, in contrast to existing GANs, the proposed conditioning enables learning of multiple rock types simultaneously, and thus invisibly saves the computational cost.
摘要：由于地下岩石固有的非均质性和缺乏现场测量，不确定性在地下岩石中无处不在。为了以多尺度方式完成不确定性分析，必须提供足够的岩石样本。尽管数字岩石技术的出现提供了繁殖岩石的机会，但由于其成本高昂，仍不能用于提供大量样本，因此导致了多种数学方法的发展。其中，通常使用两点统计（TPS）和多点统计（MPS），其特征分别包含低阶和高阶统计信息。近年来，生成对抗网络（GAN）变得越来越流行，因为它们可以以出色的视觉效果和随之而来的地质真实感再现训练图像。但是，标准GAN只能合并来自数据的信息，而不能为用户定义的属性保留任何接口，因此可能会限制重构样本的多样性。在这项研究中，我们提出了用于数字岩石重建的条件GAN，旨在再现不仅类似于真实训练数据的样本，而且还满足用户指定的属性。实际上，所提出的框架可以通过直接将岩石图像中的高阶信息与GANs方案结合起来，同时保留条件下的低阶对应物，从而同时实现MPS和TPS的目标。我们进行了三个重建实验，结果表明，岩石类型，孔隙度和相关长度可以成功地调节，以影响重建的岩石图像。此外，与现有的GAN相比，所提出的条件能够同时学习多种岩石类型，从而无形地节省了计算成本。

5. Improving Panoptic Segmentation at All Scales [PDF] 返回目录
Lorenzo Porzi, Samuel Rota Bulò, Peter Kontschieder
Abstract: Crop-based training strategies decouple training resolution from GPU memory consumption, allowing the use of large-capacity panoptic segmentation networks on multi-megapixel images. Using crops, however, can introduce a bias towards truncating or missing large objects. To address this, we propose a novel crop-aware bounding box regression loss (CABB loss), which promotes predictions to be consistent with the visible parts of the cropped objects, while not over-penalizing them for extending outside of the crop. We further introduce a novel data sampling and augmentation strategy which improves generalization across scales by counteracting the imbalanced distribution of object sizes. Combining these two contributions with a carefully designed, top-down panoptic segmentation architecture, we obtain new state-of-the-art results on the challenging Mapillary Vistas (MVD), Indian Driving and Cityscapes datasets, surpassing the previously best approach on MVD by +4.5% PQ and +5.2% mAP.
摘要：基于作物的训练策略使训练分辨率与GPU内存消耗脱钩，从而可以在数百万像素的图像上使用大容量全景分割网络。但是，使用农作物可能会导致截断或丢失大物体的倾向。为了解决这个问题，我们提出了一种新颖的可感知作物的边界框回归损失（CABB损失），该损失促进了预测与已裁剪对象的可见部分一致，同时又不会过度惩罚以扩展到作物外部。我们进一步介绍了一种新颖的数据采样和扩充策略，通过抵消对象大小的不平衡分布来提高跨尺度的泛化能力。将这两方面的贡献与精心设计的自上而下的全景分割架构相结合，我们在具有挑战性的Mapillary Vistas（MVD），Indian Driving和Cityscapes数据集上获得了最新的最新结果， + 4.5％PQ和+ 5.2％mAP。

6. High-resolution global irrigation prediction with Sentinel-2 30m data [PDF] 返回目录
Weixin, Sonal Thakkar, Will Hawkins, Puya Vahabi, Alberto Todeschini
Abstract: An accurate and precise understanding of global irrigation usage is crucial for a variety of climate science efforts. Irrigation is highly energy-intensive, and as population growth continues at its current pace, increases in crop need and water usage will have an impact on climate change. Precise irrigation data can help with monitoring water usage and optimizing agricultural yield, particularly in developing countries. Irrigation data, in tandem with precipitation data, can be used to predict water budgets as well as climate and weather modeling. With our research, we produce an irrigation prediction model that combines unsupervised clustering of Normalized Difference Vegetation Index (NDVI) temporal signatures with a precipitation heuristic to label the months that irrigation peaks for each cropland cluster in a given year. We have developed a novel irrigation model and Python package ("Irrigation30") to generate 30m resolution irrigation predictions of cropland worldwide. With a small crowdsourced test set of cropland coordinates and irrigation labels, using a fraction of the resources used by the state-of-the-art NASA-funded GFSAD30 project with irrigation data limited to India and Australia, our model was able to achieve consistency scores in excess of 97\% and an accuracy of 92\% in a small geo-diverse randomly sampled test set.
摘要：对全球灌溉使用情况的准确和精确了解对于各种气候科学工作至关重要。灌溉是高耗能的，随着人口以目前的速度持续增长，农作物需求和用水的增加将对气候变化产生影响。精确的灌溉数据可以帮助监测用水量和优化农业产量，特别是在发展中国家。灌溉数据与降水数据可以用来预测水预算以及气候和天气模型。通过我们的研究，我们产生了一个灌溉预测模型，该模型结合了标准化植被指数（NDVI）时间特征的无监督聚类和降水启发法来标记给定年份每个农田集群的灌溉高峰月份。我们已经开发了一种新颖的灌溉模型和Python程序包（“ Irrigation30”），可以生成全球3,000万分辨率农田的灌溉预测。通过少量众包的农田坐标和灌溉标签测试集，使用了由美国国家航空航天局资助的最新GFSAD30项目使用的部分资源，灌溉数据仅限于印度和澳大利亚，我们的模型能够实现一致性在一个随机的小型地理测试集中，得分超过97％，准确度达到92％。

7. Lips Don't Lie: A Generalisable and Robust Approach to Face Forgery Detection [PDF] 返回目录
Alexandros Haliassos, Konstantinos Vougioukas, Stavros Petridis, Maja Pantic
Abstract: Although current deep learning-based face forgery detectors achieve impressive performance in constrained scenarios, they are vulnerable to samples created by unseen manipulation methods. Some recent works show improvements in generalisation but rely on cues that are easily corrupted by common post-processing operations such as compression. In this paper, we propose LipForensics, a detection approach capable of both generalising to novel manipulations and withstanding various distortions. LipForensics targets high-level semantic irregularities in mouth movements, which are common in many generated videos. It consists in first pretraining a spatio-temporal network to perform visual speech recognition (lipreading), thus learning rich internal representations related to natural mouth motion. A temporal network is subsequently finetuned on fixed mouth embeddings of real and forged data in order to detect fake videos based on mouth movements without overfitting to low-level, manipulation-specific artefacts. Extensive experiments show that this simple approach significantly surpasses the state-of-the-art in terms of generalisation to unseen manipulations and robustness to perturbations, as well as shed light on the factors responsible for its performance.
摘要：尽管当前基于深度学习的人脸伪造检测器在受限的情况下可实现令人印象深刻的性能，但它们仍容易受到看不见的操纵方法产生的样本的攻击。最近的一些工作显示出泛化方面的改进，但依赖于容易被常见的后处理操作（例如压缩）破坏的提示。在本文中，我们提出了LipForensics，这是一种既可以推广到新颖的操作方法又可以承受各种失真的检测方法。 LipForensics针对嘴部动作中的高级语义不规则性，这在许多生成的视频中很常见。它包括首先对时空网络进行预训练以执行视觉语音识别（唇读），从而学习与自然嘴巴运动有关的丰富内部表示。随后在真实和伪造数据的固定嘴部嵌入上微调时间网络，以便基于嘴部运动检测伪造视频，而不会过度适合于低级别的特定于操纵的伪像。大量的实验表明，这种简单的方法在不可见的操纵泛化和对扰动的鲁棒性方面的泛化能力大大超过了最新技术，并阐明了影响其性能的因素。

8. Deep Neural Networks for COVID-19 Detection and Diagnosis using Images and Acoustic-based Techniques: A Recent Review [PDF] 返回目录
Walid Hariri, Ali Narin
Abstract: The new coronavirus disease (COVID-19) has been declared a pandemic since March 2020 by the World Health Organization. It consists of an emerging viral infection with respiratory tropism that could develop atypical pneumonia. Experts emphasize the importance of early detection of those who have the COVID-19 virus. In this way, patients will be isolated from other people and the spread of the virus can be prevented. For this reason, it has become an area of interest to develop early diagnosis and detection methods to ensure a rapid treatment process and prevent the virus from spreading. Since the standard testing system is time-consuming and not available for everyone, alternative early-screening techniques have become an urgent need. In this study, the approaches used in the detection of COVID-19 based on deep learning (DL) algorithms, which have been popular in recent years, have been comprehensively discussed. The advantages and disadvantages of different approaches used in literature are examined in detail. The Computed Tomography of the chest and X-ray images give a rich representation of the patient's lung that is less time-consuming and allows an efficient viral pneumonia detection using the DL algorithms. The first step is the pre-processing of these images to remove noise. Next, deep features are extracted using multiple types of deep models (pre-trained models, generative models, generic neural networks, etc). Finally, the classification is performed using the obtained features to decide whether the patient is infected by coronavirus or it is another lung disease. In this study, we also give a brief review of the latest applications of cough analysis to early screen the COVID-19, and human mobility estimation to limit its spread.
摘要：自2020年3月以来，世界卫生组织宣布新的冠状病毒病（COVID-19）大流行。它由具有呼吸嗜性的新出现的病毒感染组成，可能发展为非典型肺炎。专家们强调了及早发现患有COVID-19病毒者的重要性。这样，可以将患者与其他人隔离，从而可以防止病毒传播。因此，开发早期诊断和检测方法以确保快速治疗过程并防止病毒传播已成为关注的领域。由于标准测试系统很耗时，而且并非所有人都可以使用，因此替代性的早期筛选技术已成为迫切需要。在这项研究中，已对近年来流行的基于深度学习（DL）算法的COVID-19检测方法进行了全面讨论。详细研究了文献中使用的不同方法的优缺点。胸部X射线计算机断层扫描和X射线计算机断层扫描可以对患者的肺部进行丰富的显示，从而减少了耗时，并且可以使用DL算法进行有效的病毒性肺炎检测。第一步是对这些图像进行预处理以去除噪声。接下来，使用多种类型的深度模型（预训练模型，生成模型，通用神经网络等）提取深度特征。最后，使用获得的特征进行分类，以确定患者是被冠状病毒感染还是另一种肺部疾病。在这项研究中，我们还简要回顾了咳嗽分析的最新应用，以早期筛查COVID-19，并进行了人类活动估计以限制其传播。

9. ProLab: perceptually uniform projective colour coordinate system [PDF] 返回目录
Ivan A. Konovalenko, Anna A. Smagina, Dmitry P. Nikolaev, Petr P. Nikolaev
Abstract: In this work, we propose proLab: a new colour coordinate system derived as a 3D projective transformation of CIE XYZ. We show that proLab is far ahead of the widely used CIELAB coordinate system (though inferior to the modern CAM16-UCS) according to perceptual uniformity evaluated by the STRESS metric in reference to the CIEDE2000 colour difference formula. At the same time, angular errors of chromaticity estimation that are standard for linear colour spaces can also be used in proLab since projective transformations preserve the linearity of manifolds. Unlike in linear spaces, angular errors for different hues are normalized according to human colour discrimination thresholds within proLab. We also demonstrate that shot noise in proLab is more homoscedastic than in CAM16-UCS or other standard colour spaces. This makes proLab a convenient coordinate system in which to perform linear colour analysis.
摘要：在这项工作中，我们提出了proLab：一种作为CIE XYZ的3D投影变换而派生的新颜色坐标系。我们显示，根据STRESS度量标准参照CIEDE2000色差公式评估的感知均匀性，proLab远远领先于广泛使用的CIELAB坐标系（尽管不如现代CAM16-UCS）。同时，由于射影变换保留了流形的线性，因此proLab中也可以使用线性色彩空间标准的色度估计角度误差。与线性空间不同，不同色相的角度误差根据proLab中的人为颜色区分阈值进行归一化。我们还证明，与CAM16-UCS或其他标准色彩空间相比，proLab中的散粒噪声更加均匀。这使proLab成为进行线性颜色分析的便捷坐标系。

10. Decoupled Self Attention for Accurate One Stage Object Detection [PDF] 返回目录
Kehe WUa, Zuge Chena, Qi MAb, Xiaoliang Zhanga, Wei Lia
Abstract: As the scale of object detection dataset is smaller than that of image recognition dataset ImageNet, transfer learning has become a basic training method for deep learning object detection models, which will pretrain the backbone network of object detection model on ImageNet dataset to extract features for classification and localization subtasks. However, the classification task focuses on the salient region features of object, while the location task focuses on the edge features of object, so there is certain deviation between the features extracted by pretrained backbone network and the features used for localization task. In order to solve this problem, a decoupled self attention(DSA) module is proposed for one stage object detection models in this paper. DSA includes two decoupled self-attention branches, so it can extract appropriate features for different tasks. It is located between FPN and head networks of subtasks, so it is used to extract global features based on FPN fused features for different tasks independently. Although the network of DSA module is simple, but it can effectively improve the performance of object detection, also it can be easily embedded in many detection models. Our experiments are based on the representative one-stage detection model RetinaNet. In COCO dataset, when ResNet50 and ResNet101 are used as backbone networks, the detection performances can be increased by 0.4% AP and 0.5% AP respectively. When DSA module and object confidence task are applied in RetinaNet together, the detection performances based on ResNet50 and ResNet101 can be increased by 1.0% AP and 1.4% AP respectively. The experiment results show the effectiveness of DSA module.
摘要：由于对象检测数据集的规模小于图像识别数据集ImageNet的规模，转移学习已成为深度学习对象检测模型的基本训练方法，它将对ImageNet数据集上的对象检测模型的骨干网络进行预训练以提取特征用于分类和本地化子任务。然而，分类任务的重点是对象的显着区域特征，而定位任务的重点是对象的边缘特征，因此预训练骨干网络提取的特征与用于定位任务的特征之间存在一定的偏差。为了解决这一问题，本文提出了一种用于一级目标检测模型的去耦自关注（DSA）模块。 DSA包括两个解耦的自我注意分支，因此它可以为不同任务提取适当的功能。它位于FPN与子任务的头部网络之间，因此用于基于FPN融合特征针对不同任务独立提取全局特征。 DSA模块的网络虽然简单，但是可以有效地提高物体检测的性能，也可以很容易地嵌入许多检测模型中。我们的实验基于代表性的一阶段检测模型RetinaNet。在COCO数据集中，将ResNet50和ResNet101用作骨干网络时，检测性能可以分别提高0.4％AP和0.5％AP。将DSA模块和对象置信度任务一起应用于RetinaNet时，基于ResNet50和ResNet101的检测性能可以分别提高1.0％AP和1.4％AP。实验结果证明了DSA模块的有效性。

11. Understanding Image Retrieval Re-Ranking: A Graph Neural Network Perspective [PDF] 返回目录
Xuanmeng Zhang, Minyue Jiang, Zhedong Zheng, Xiao Tan, Errui Ding, Yi Yang
Abstract: The re-ranking approach leverages high-confidence retrieved samples to refine retrieval results, which have been widely adopted as a post-processing tool for image retrieval tasks. However, we notice one main flaw of re-ranking, i.e., high computational complexity, which leads to an unaffordable time cost for real-world applications. In this paper, we revisit re-ranking and demonstrate that re-ranking can be reformulated as a high-parallelism Graph Neural Network (GNN) function. In particular, we divide the conventional re-ranking process into two phases, i.e., retrieving high-quality gallery samples and updating features. We argue that the first phase equals building the k-nearest neighbor graph, while the second phase can be viewed as spreading the message within the graph. In practice, GNN only needs to concern vertices with the connected edges. Since the graph is sparse, we can efficiently update the vertex features. On the Market-1501 dataset, we accelerate the re-ranking processing from 89.2s to 9.4ms with one K40m GPU, facilitating the real-time post-processing. Similarly, we observe that our method achieves comparable or even better retrieval results on the other four image retrieval benchmarks, i.e., VeRi-776, Oxford-5k, Paris-6k and University-1652, with limited time cost. Our code is publicly available.
摘要：重排序方法利用高可信度的检索样本来完善检索结果，该方法已被广泛用作图像检索任务的后处理工具。但是，我们注意到重新排序的一个主要缺陷，即较高的计算复杂度，这导致在现实世界中应用无法承受的时间成本。在本文中，我们重新审视了重新排名，并证明了重新排名可以重新构造为高度并行的图神经网络（GNN）函数。特别是，我们将常规的重新排名过程分为两个阶段，即检索高质量的画廊样本和更新功能。我们认为，第一阶段等于建立k最近邻居图，而第二阶段可以看作是在图中传播消息。在实践中，GNN只需要关注具有连接边的顶点。由于图形稀疏，因此我们可以有效地更新顶点特征。在Market-1501数据集上，我们使用一个K40m GPU将重新排名处理从89.2s加速到9.4ms，从而促进了实时后处理。同样，我们观察到，在有限的时间成本下，我们的方法在其他四个图像检索基准（即VeRi-776，Oxford-5k，Paris-6k和University-1652）上可获得可比甚至更好的检索结果。我们的代码是公开可用的。

12. WDNet: Watermark-Decomposition Network for Visible Watermark Removal [PDF] 返回目录
Yang Liu, Zhen Zhu, Xiang Bai
Abstract: Visible watermarks are widely-used in images to protect copyright ownership. Analyzing watermark removal helps to reinforce the anti-attack techniques in an adversarial way. Current removal methods normally leverage image-to-image translation techniques. Nevertheless, the uncertainty of the size, shape, color and transparency of the watermarks set a huge barrier for these methods. To combat this, we combine traditional watermarked image decomposition into a two-stage generator, called Watermark-Decomposition Network (WDNet), where the first stage predicts a rough decomposition from the whole watermarked image and the second stage specifically centers on the watermarked area to refine the removal results. The decomposition formulation enables WDNet to separate watermarks from the images rather than simply removing them. We further show that these separated watermarks can serve as extra nutrients for building a larger training dataset and further improving removal performance. Besides, we construct a large-scale dataset named CLWD, which mainly contains colored watermarks, to fill the vacuum of colored watermark removal dataset. Extensive experiments on the public gray-scale dataset LVW and CLWD consistently show that the proposed WDNet outperforms the state-of-the-art approaches both in accuracy and efficiency.
摘要：可见水印广泛用于图像中以保护版权。分析水印去除有助于以对抗性方式增强反攻击技术。当前的去除方法通常利用图像到图像的翻译技术。然而，水印的大小，形状，颜色和透明度的不确定性为这些方法设置了巨大的障碍。为了解决这个问题，我们将传统的水印图像分解合并为两阶段的生成器，称为水印分解网络（WDNet），其中第一阶段预测整个水印图像的粗略分解，第二阶段专门针对水印区域进行处理。优化移除结果。分解公式使WDNet能够从图像中分离水印，而不是简单地将其去除。我们进一步表明，这些分离的水印可以作为额外的营养素，用于构建更大的训练数据集并进一步提高去除效果。此外，我们构建了一个名为CLWD的大规模数据集，主要包含彩色水印，以填补彩色水印去除数据集的空白。在公共灰度数据集LVW和CLWD上进行的大量实验一致表明，所提出的WDNet在准确性和效率上均优于最新方法。

13. Agglomerative Clustering of Handwritten Numerals to Determine Similarity of Different Languages [PDF] 返回目录
Md. Rahat-uz-Zaman, Shadmaan Hye
Abstract: Handwritten numerals of different languages have various characteristics. Similarities and dissimilarities of the languages can be measured by analyzing the extracted features of the numerals. Handwritten numeral datasets are available and accessible for many renowned languages of different regions. In this paper, several handwritten numeral datasets of different languages are collected. Then they are used to find the similarity among those written languages through determining and comparing the similitude of each handwritten numerals. This will help to find which languages have the same or adjacent parent language. Firstly, a similarity measure of two numeral images is constructed with a Siamese network. Secondly, the similarity of the numeral datasets is determined with the help of the Siamese network and a new random sample with replacement similarity averaging technique. Finally, an agglomerative clustering is done based on the similarities of each dataset. This clustering technique shows some very interesting properties of the datasets. The property focused in this paper is the regional resemblance of the datasets. By analyzing the clusters, it becomes easy to identify which languages are originated from similar regions.
摘要：不同语言的手写数字具有不同的特征。可以通过分析数字的提取特征来衡量语言的相似性和相异性。手写数字数据集可用于不同地区的许多著名语言。本文收集了几种不同语言的手写数字数据集。然后通过确定和比较每个手写数字的相似性，将它们用于查找那些书面语言之间的相似性。这将有助于查找哪些语言具有相同或相邻的父语言。首先，用连体网络构造两个数字图像的相似性度量。其次，借助暹罗网络和具有替换相似度平均技术的新随机样本，确定数字数据集的相似度。最后，基于每个数据集的相似性进行聚集聚类。这种聚类技术显示了数据集的一些非常有趣的属性。本文关注的属性是数据集的区域相似性。通过分析聚类，可以轻松识别出哪些语言源自相似区域。

14. FlowMOT: 3D Multi-Object Tracking by Scene Flow Association [PDF] 返回目录
Guangyao Zhai, Xin Kong, Jinhao Cui1, Yong Liu1, Zhen Yang
Abstract: Most end-to-end Multi-Object Tracking (MOT) methods face the problems of low accuracy and poor generalization ability. Although traditional filter-based methods can achieve better results, they are difficult to be endowed with optimal hyperparameters and often fail in varying scenarios. To alleviate these drawbacks, we propose a LiDAR-based 3D MOT framework named FlowMOT, which integrates point-wise motion information into the traditional matching algorithm, enhancing the robustness of the data association. We firstly utilize a scene flow estimation network to obtain implicit motion information between two adjacent frames and calculate the predicted detection for each old tracklet in the previous frame. Then we use Hungarian algorithm to generate optimal matching relations with the ID propagation strategy to finish the tracking task. Experiments on KITTI MOT dataset show that our approach outperforms recent end-to-end methods and achieves competitive performance with the state-of-the-art filter-based method. In addition, ours can work steadily in the various-speed scenes where the filter-based methods may fail.
摘要：大多数端到端多对象跟踪（MOT）方法都面临着精度低和泛化能力差的问题。尽管传统的基于滤波器的方法可以实现更好的结果，但是很难赋予它们最佳的超参数，并且经常在各种情况下失败。为了缓解这些缺点，我们提出了一种基于LiDAR的3D MOT框架，称为FlowMOT，该框架将逐点运动信息集成到传统的匹配算法中，从而增强了数据关联的鲁棒性。我们首先利用场景流估计网络来获取两个相邻帧之间的隐式运动信息，并计算前一帧中每个旧小轨道的预测检测量。然后我们使用匈牙利算法通过ID传播策略生成最佳匹配关系以完成跟踪任务。在KITTI MOT数据集上进行的实验表明，我们的方法优于最新的端到端方法，并使用基于过滤器的最新方法实现了竞争性能。此外，我们的方法可以在基于滤波器的方法可能失败的各种速度场景中稳定运行。

15. Temporal Relational Modeling with Self-Supervision for Action Segmentation [PDF] 返回目录
Dong Wang, Di Hu, Xingjian Li, Dejing Dou
Abstract: Temporal relational modeling in video is essential for human action understanding, such as action recognition and action segmentation. Although Graph Convolution Networks (GCNs) have shown promising advantages in relation reasoning on many tasks, it is still a challenge to apply graph convolution networks on long video sequences effectively. The main reason is that large number of nodes (i.e., video frames) makes GCNs hard to capture and model temporal relations in videos. To tackle this problem, in this paper, we introduce an effective GCN module, Dilated Temporal Graph Reasoning Module (DTGRM), designed to model temporal relations and dependencies between video frames at various time spans. In particular, we capture and model temporal relations via constructing multi-level dilated temporal graphs where the nodes represent frames from different moments in video. Moreover, to enhance temporal reasoning ability of the proposed model, an auxiliary self-supervised task is proposed to encourage the dilated temporal graph reasoning module to find and correct wrong temporal relations in videos. Our DTGRM model outperforms state-of-the-art action segmentation models on three challenging datasets: 50Salads, Georgia Tech Egocentric Activities (GTEA), and the Breakfast dataset. The code is available at this https URL.
摘要：视频中的时间关系建模对于人类动作理解（例如动作识别和动作细分）至关重要。尽管图卷积网络（GCN）在许多任务的关系推理中显示出了令人鼓舞的优势，但在长视频序列上有效地应用图卷积网络仍然是一个挑战。主要原因是大量节点（即视频帧）使GCN难以捕获和建模视频中的时间关系。为了解决这个问题，在本文中，我们引入了一个有效的GCN模块，即膨胀时间图推理模块（DTGRM），该模块旨在对不同时间跨度的视频帧之间的时间关系和相关性进行建模。特别是，我们通过构造多级扩张的时间图来捕获和建模时间关系，其中节点表示视频中不同时刻的帧。此外，为了增强所提出模型的时间推理能力，提出了一种辅助的自我监督任务，以鼓励膨胀的时间图推理模块找到并纠正视频中错误的时间关系。在三个具有挑战性的数据集上，我们的DTGRM模型优于最新的行动细分模型：50Salads，乔治亚理工学院自我中心活动（GTEA）和Breakfast数据集。该代码可从以下https URL获得。

16. Improving Video Instance Segmentation by Light-weight Temporal Uncertainty Estimates [PDF] 返回目录
Kira Maag, Matthias Rottmann, Fabian Hüger, Peter Schlicht, Hanno Gottschalk
Abstract: Instance segmentation with neural networks is an essential task in environment perception. However, the networks can predict false positive instances with high confidence values and true positives with low ones. Hence, it is important to accurately model the uncertainties of neural networks to prevent safety issues and foster interpretability. In applications such as automated driving the detection of road users like vehicles and pedestrians is of highest interest. We present a temporal approach to detect false positives and investigate uncertainties of instance segmentation networks. Since image sequences are available for online applications, we track instances over multiple frames and create temporal instance-wise aggregated metrics of uncertainty. The prediction quality is estimated by predicting the intersection over union as performance measure. Furthermore, we show how to use uncertainty information to replace the traditional score value from object detection and improve the overall performance of instance segmentation networks.
摘要：利用神经网络进行实例分割是环境感知中必不可少的任务。然而，网络可以预测具有高置信度值的假阳性实例和具有低置信度值的真实肯定实例。因此，重要的是准确地对神经网络的不确定性进行建模，以防止出现安全问题并促进可解释性。在诸如自动驾驶之类的应用中，对诸如车辆和行人之类的道路使用者的检测最为关注。我们提出了一种暂时的方法来检测误报并调查实例细分网络的不确定性。由于图像序列可用于在线应用程序，因此我们可以在多个帧上跟踪实例，并创建不确定性的时间实例聚合指标。通过预测联合作为性能指标的交集来估计预测质量。此外，我们展示了如何使用不确定性信息来替换对象检测中的传统得分值，以及如何提高实例分割网络的整体性能。

17. Sign-Agnostic Implicit Learning of Surface Self-Similarities for Shape Modeling and Reconstruction from Raw Point Clouds [PDF] 返回目录
Wenbin Zhao, Jiabao Lei, Yuxin Wen, Jianguo Zhang, Kui Jia
Abstract: Shape modeling and reconstruction from raw point clouds of objects stand as a fundamental challenge in vision and graphics research. Classical methods consider analytic shape priors; however, their performance degraded when the scanned points deviate from the ideal conditions of cleanness and completeness. Important progress has been recently made by data-driven approaches, which learn global and/or local models of implicit surface representations from auxiliary sets of training shapes. Motivated from a universal phenomenon that self-similar shape patterns of local surface patches repeat across the entire surface of an object, we aim to push forward the data-driven strategies and propose to learn a local implicit surface network for a shared, adaptive modeling of the entire surface for a direct surface reconstruction from raw point cloud; we also enhance the leveraging of surface self-similarities by improving correlations among the optimized latent codes of individual surface patches. Given that orientations of raw points could be unavailable or noisy, we extend sign agnostic learning into our local implicit model, which enables our recovery of signed implicit fields of local surfaces from the unsigned inputs. We term our framework as Sign-Agnostic Implicit Learning of Surface Self-Similarities (SAIL-S3). With a global post-optimization of local sign flipping, SAIL-S3 is able to directly model raw, un-oriented point clouds and reconstruct high-quality object surfaces. Experiments show its superiority over existing methods.
摘要：从对象的原始点云进行形状建模和重构是视觉和图形研究的基本挑战。经典方法考虑分析形状先验。但是，当扫描点偏离清洁度和完整性的理想条件时，它们的性能会下降。最近，数据驱动方法取得了重要进展，该方法从辅助训练形状集中学习隐式表面表示的全局和/或局部模型。出于普遍现象，即局部表面斑块的自相似形状模式在对象的整个表面上重复出现，我们旨在推动数据驱动策略，并建议学习一种局部隐式表面网络，以进行共享的自适应建模从原始点云直接重建表面的整个表面；我们还通过改善各个表面斑块的优化潜在代码之间的相关性来增强表面自相似性的杠杆作用。鉴于原始点的方向可能不可用或嘈杂，我们将符号不可知学习扩展到我们的局部隐式模型中，这使我们能够从无符号输入中恢复局部曲面的有符号隐式字段。我们将我们的框架称为表面自相似性的符号不可知隐式学习（SAIL-S3）。通过局部标记翻转的全局后优化，SAIL-S3能够直接对原始的，未定向的点云建模并重建高质量的对象表面。实验表明，它优于现有方法。

18. Deep Learning for Material recognition: most recent advances and open challenges [PDF] 返回目录
Alain Tremeau, Sixiang Xu, Damien Muselet
Abstract: Recognizing material from color images is still a challenging problem today. While deep neural networks provide very good results on object recognition and has been the topic of a huge amount of papers in the last decade, their adaptation to material images still requires some works to reach equivalent accuracies. Nevertheless, recent studies achieve very good results in material recognition with deep learning and we propose, in this paper, to review most of them by focusing on three aspects: material image datasets, influence of the context and ad hoc descriptors for material appearance. Every aspect is introduced by a systematic manner and results from representative works are cited. We also present our own studies in this area and point out some open challenges for future works.
摘要：如今，从彩色图像中识别材料仍然是一个难题。尽管深度神经网络在对象识别方面提供了很好的结果，并且在过去十年中一直是大量论文的主题，但它们对材料图像的适应仍然需要一些工作才能达到同等的精度。尽管如此，最近的研究通过深度学习在材料识别方面取得了很好的结果，我们建议在本文中，重点关注三个方面：材料图像数据集，上下文的影响以及材料外观的临时描述词，以对其中的大部分进行回顾。系统地介绍了每个方面，并引用了代表性作品的结果。我们还介绍了自己在这一领域的研究，并指出了未来工作的一些开放挑战。

19. Scaling Semantic Segmentation Beyond 1K Classes on a Single GPU [PDF] 返回目录
Shipra Jain, Danda Paudel Pani, Martin Danelljan, Luc Van Gool
Abstract: The state-of-the-art object detection and image classification methods can perform impressively on more than 9k and 10k classes, respectively. In contrast, the number of classes in semantic segmentation datasets is relatively limited. This is not surprising when the restrictions caused by the lack of labeled data and high computation demand for segmentation are considered. In this paper, we propose a novel training methodology to train and scale the existing semantic segmentation models for a large number of semantic classes without increasing the memory overhead. In our embedding-based scalable segmentation approach, we reduce the space complexity of the segmentation model's output from O(C) to O(1), propose an approximation method for ground-truth class probability, and use it to compute cross-entropy loss. The proposed approach is general and can be adopted by any state-of-the-art segmentation model to gracefully scale it for any number of semantic classes with only one GPU. Our approach achieves similar, and in some cases, even better mIoU for Cityscapes, Pascal VOC, ADE20k, COCO-Stuff10k datasets when adopted to DeeplabV3+ model with different backbones. We demonstrate a clear benefit of our approach on a dataset with 1284 classes, bootstrapped from LVIS and COCO annotations, with three times better mIoU than the DeeplabV3+ model.
摘要：最先进的目标检测和图像分类方法在9k和10k以上的类别上都能表现出色。相反，语义分割数据集中的类数相对有限。当考虑到由于缺少标记数据和对分割的高计算需求而导致的限制时，这并不奇怪。在本文中，我们提出了一种新颖的训练方法，可以在不增加内存开销的情况下针对大量语义类训练和扩展现有的语义分割模型。在基于嵌入的可伸缩分割方法中，我们将分割模型的输出的空间复杂度从O（C）降低到O（1），提出了地面实物类概率的近似方法，并使用它来计算交叉熵损失。所提出的方法是通用的，并且可以由任何最新的分割模型采用，从而仅通过一个GPU就可以针对任意数量的语义类适当地对其进行缩放。当采用具有不同主干的DeeplabV3 +模型时，我们的方法可为Cityscapes，Pascal VOC，ADE20k，COCO-Stuff10k数据集实现相似的，甚至在某些情况下甚至更好的mIoU。我们从LVIS和COCO注释引导的具有1284个类的数据集上证明了我们的方法的明显好处，其mIoU比DeeplabV3 +模型高三倍。

20. Aggregative Self-Supervised Feature Learning [PDF] 返回目录
Jiuwen Zhu, Yuexiang Li, S. Kevin Zhou
Abstract: Self-supervised learning (SSL) is an efficient approach that addresses the issue of annotation shortage. The key part in SSL is its proxy task that defines the supervisory signals and drives the learning toward effective feature representations. However, most SSL approaches usually focus on a single proxy task, which greatly limits the expressive power of the learned features and therefore deteriorates the network generalization capacity. In this regard, we hereby propose three strategies of aggregation in terms of complementarity of various forms to boost the robustness of self-supervised learned features. In spatial context aggregative SSL, we contribute a heuristic SSL method that integrates two ad-hoc proxy tasks with spatial context complementarity, modeling global and local contextual features, respectively. We then propose a principled framework of multi-task aggregative self-supervised learning to form a unified representation, with an intent of exploiting feature complementarity among different tasks. Finally, in self-aggregative SSL, we propose to self-complement an existing proxy task with an auxiliary loss function based on a linear centered kernel alignment metric, which explicitly promotes the exploring of where are uncovered by the features learned from a proxy task at hand to further boost the modeling capability. Our extensive experiments on 2D natural image and 3D medical image classification tasks under limited annotation scenarios confirm that the proposed aggregation strategies successfully boost the classification accuracy.
摘要：自我监督学习（SSL）是解决注释不足问题的有效方法。 SSL的关键部分是它的代理任务，它定义监控信号并推动学习走向有效的特征表示。但是，大多数SSL方法通常只关注单个代理任务，这极大地限制了所学习功能的表达能力，从而降低了网络泛化能力。在这方面，我们据此提出了三种聚集形式的策略，以增强自我监督学习特征的鲁棒性。在空间上下文聚合SSL中，我们贡献了一种启发式SSL方法，该方法将两个临时代理任务与空间上下文互补性进行集成，分别对全局和局部上下文特征进行建模。然后，我们提出了一个多任务聚合式自我监督学习的原理框架，以形成一个统一的表示形式，旨在利用不同任务之间的特征互补性。最后，在自聚合SSL中，我们建议使用基于线性中心核对齐度量的辅助损失函数对现有代理任务进行自我补充，从而显着促进对从代理任务中学到的特征发现的探索。进一步增强建模能力。在有限注解场景下，我们对2D自然图像和3D医学图像分类任务进行了广泛的实验，证实了所提出的聚合策略成功提高了分类精度。

21. Learned Video Codec with Enriched Reconstruction for CLIC P-frame Coding [PDF] 返回目录
David Alexandre, Hsueh-Ming Hang
Abstract: This paper proposes a learning-based video codec, specifically used for Challenge on Learned Image Compression (CLIC, CVPRWorkshop) 2020 P-frame coding. More specifically, we designed a compressor network with Refine-Net for coding residual signals and motion vectors. Also, for motion estimation, we introduced a hierarchical, attention-based ME-Net. To verify our design, we conducted an extensive ablation study on our modules and different input formats. Our video codec demonstrates its performance by using the perfect reference frame at the decoder side specified by the CLIC P-frame Challenge. The experimental result shows that our proposed codec is very competitive with the Challenge top performers in terms of quality metrics.
摘要：本文提出了一种基于学习的视频编解码器，专门用于学习图像压缩挑战（CLIC，CVPRWorkshop）2020 P帧编码。更具体地说，我们设计了一个带有Refine-Net的压缩器网络，用于编码残差信号和运动矢量。此外，对于运动估计，我们引入了基于注意力的分层ME-Net。为了验证我们的设计，我们对模块和不同的输入格式进行了广泛的烧蚀研究。我们的视频编解码器通过在CLIC P帧质询指定的解码器端使用完美的参考帧来演示其性能。实验结果表明，在质量指标方面，我们提出的编解码器与Challenge的最佳执行者非常有竞争力。

22. Pyramid-Focus-Augmentation: Medical Image Segmentation with Step-Wise Focus [PDF] 返回目录
Vajira Thambawita, Steven Hicks, Pål Halvorsen, Michael A. Riegler
Abstract: Segmentation of findings in the gastrointestinal tract is a challenging but also an important task which is an important building stone for sufficient automatic decision support systems. In this work, we present our solution for the Medico 2020 task, which focused on the problem of colon polyp segmentation. We present our simple but efficient idea of using an augmentation method that uses grids in a pyramid-like manner (large to small) for segmentation. Our results show that the proposed methods work as indented and can also lead to comparable results when competing with other methods.
摘要：胃肠道检查结果的分割是一个挑战，但也是一项重要的任务，它是足够的自动决策支持系统的重要基础。在这项工作中，我们提出了针对Medico 2020任务的解决方案，该任务侧重于结肠息肉分割问题。我们提出了一种简单而有效的想法，即使用一种增强方法，该方法以类似金字塔的方式（从大到小）使用网格进行分割。我们的结果表明，与其他方法竞争时，所提出的方法可以以缩进方式工作，并且还可以产生可比的结果。

23. DSM Refinement with Deep Encoder-Decoder Networks [PDF] 返回目录
Nando Metzger
Abstract: 3D city models can be generated from aerial images. However, the calculated DSMs suffer from noise, artefacts, and data holes that have to be manually cleaned up in a time-consuming process. This work presents an approach that automatically refines such DSMs. The key idea is to teach a neural network the characteristics of urban area from reference data. In order to achieve this goal, a loss function consisting of an L1 norm and a feature loss is proposed. These features are constructed using a pre-trained image classification network. To learn to update the height maps, the network architecture is set up based on the concept of deep residual learning and an encoder-decoder structure. The results show that this combination is highly effective in preserving the relevant geometric structures while removing the undesired artefacts and noise.
摘要：可以从航拍图像中生成3D城市模型。但是，计算出的DSM存在噪声，伪影和数据漏洞，这些漏洞必须在耗时的过程中手动清理。这项工作提出了一种自动完善此类DSM的方法。关键思想是根据参考数据向神经网络传授市区特征。为了实现该目标，提出了由L1范数和特征损失组成的损失函数。这些功能是使用预先训练的图像分类网络构建的。为了学习更新高度图，基于深度残差学习和编码器-解码器结构的概念来建立网络架构。结果表明，这种组合在保留相关的几何结构的同时，消除了不希望的伪影和噪声，非常有效。

24. One-Shot Learning with Triplet Loss for Vegetation Classification Tasks [PDF] 返回目录
Alexander Uzhinskiy, Gennady Ososkov, Pavel Goncharov, Andrey Nechaevskiy, Artem Smetanin
Abstract: Triplet loss function is one of the options that can significantly improve the accuracy of the One-shot Learning tasks. Starting from 2015, many projects use Siamese networks and this kind of loss for face recognition and object classification. In our research, we focused on two tasks related to vegetation. The first one is plant disease detection on 25 classes of five crops (grape, cotton, wheat, cucumbers, and corn). This task is motivated because harvest losses due to diseases is a serious problem for both large farming structures and rural families. The second task is the identification of moss species (5 classes). Mosses are natural bioaccumulators of pollutants; therefore, they are used in environmental monitoring programs. The identification of moss species is an important step in the sample preprocessing. In both tasks, we used self-collected image databases. We tried several deep learning architectures and approaches. Our Siamese network architecture with a triplet loss function and MobileNetV2 as a base network showed the most impressive results in both above-mentioned tasks. The average accuracy for plant disease detection amounted to over 97.8% and 97.6% for moss species classification.
摘要：三重损失函数是可以显着提高单发学习任务准确性的选项之一。从2015年开始，许多项目都使用Siamese网络，这种人脸识别和对象分类的损失。在我们的研究中，我们专注于与植被有关的两个任务。第一个是在25种五种作物（葡萄，棉花，小麦，黄瓜和玉米）上检测植物病害。这项任务的动机在于，由于疾病造成的收成损失对于大型农业机构和农村家庭都是一个严重的问题。第二项任务是识别苔藓种类（5类）。苔藓是污染物的天然生物蓄积剂。因此，它们被用于环境监测程序中。鉴定苔藓种类是样品预处理中的重要步骤。在这两个任务中，我们都使用了自收集的图像数据库。我们尝试了几种深度学习架构和方法。具有三重态丢失功能和MobileNetV2作为基础网络的我们的暹罗网络体系结构在上述两项任务中均显示了最令人印象深刻的结果。苔藓物种分类的植物病害检测的平均准确度超过97.8％和97.6％。

25. HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation [PDF] 返回目录
Xiaoyang Lyu, Liang Liu, Mengmeng Wang, Xin Kong, Lina Liu, Yong Liu, Xinxin Chen, Yi Yuan
Abstract: Self-supervised learning shows great potential in monoculardepth estimation, using image sequences as the only source ofsupervision. Although people try to use the high-resolutionimage for depth estimation, the accuracy of prediction hasnot been significantly improved. In this work, we find thecore reason comes from the inaccurate depth estimation inlarge gradient regions, making the bilinear interpolation er-ror gradually disappear as the resolution increases. To obtainmore accurate depth estimation in large gradient regions, itis necessary to obtain high-resolution features with spatialand semantic information. Therefore, we present an improvedDepthNet, HR-Depth, with two effective strategies: (1) re-design the skip-connection in DepthNet to get better high-resolution features and (2) propose feature fusion Squeeze-and-Excitation(fSE) module to fuse feature more efficiently.Using Resnet-18 as the encoder, HR-Depth surpasses all pre-vious state-of-the-art(SoTA) methods with the least param-eters at both high and low resolution. Moreover, previousstate-of-the-art methods are based on fairly complex and deepnetworks with a mass of parameters which limits their realapplications. Thus we also construct a lightweight networkwhich uses MobileNetV3 as encoder. Experiments show thatthe lightweight network can perform on par with many largemodels like Monodepth2 at high-resolution with only20%parameters. All codes and models will be available at this https URL.
摘要：以图像序列作为唯一的监督来源，自我监督学习在单眼深度估计中显示出巨大的潜力。尽管人们尝试将高分辨率图像用于深度估计，但是预测的准确性并未得到明显提高。在这项工作中，我们发现核心原因来自大梯度区域中不正确的深度估计，从而使双线性插值误差随着分辨率的提高而逐渐消失。为了在大的梯度区域中获得更准确的深度估计，这是获得具有空间和语义信息的高分辨率特征所必需的。因此，我们提出了一种改进的DepthNet，即HR-Depth，它具有两种有效的策略：（1）重新设计DepthNet中的跳过连接以获得更好的高分辨率特征；（2）提出特征融合挤压和激发（fSE）通过使用Resnet-18作为编码器，HR-Depth超越了所有现有技术（SoTA），并且在高分辨率和低分辨率方面的参数均最少，因此可以更好地融合功能。此外，以前的最新方法是基于相当复杂的深度网络，其大量参数限制了它们的实际应用。因此，我们还构建了一个轻量级的网络，该网络使用MobileNetV3作为编码器。实验表明，轻量级网络可以以20％的高分辨率与许多大型模型（如Monodepth2）相媲美。所有代码和模型都可以通过此https URL获得。

26. The Open Brands Dataset: Unified brand detection and recognition at scale [PDF] 返回目录
Xuan Jin, Wei Su, Rong Zhang, Yuan He, Hui Xue
Abstract: Intellectual property protection(IPP) have received more and more attention recently due to the development of the global e-commerce platforms. brand recognition plays a significant role in IPP. Recent studies for brand recognition and detection are based on small-scale datasets that are not comprehensive enough when exploring emerging deep learning techniques. Moreover, it is challenging to evaluate the true performance of brand detection methods in realistic and open scenes. In order to tackle these problems, we first define the special issues of brand detection and recognition compared with generic object detection. Second, a novel brands benchmark called "Open Brands" is established. The dataset contains 1,437,812 images which have brands and 50,000 images without any brand. The part with brands in Open Brands contains 3,113,828 instances annotated in 3 dimensions: 4 types, 559 brands and 1216 logos. To the best of our knowledge, it is the largest dataset for brand detection and recognition with rich annotations. We provide in-depth comprehensive statistics about the dataset, validate the quality of the annotations and study how the performance of many modern models evolves with an increasing amount of training data. Third, we design a network called "Brand Net" to handle brand recognition. Brand Net gets state-of-art mAP on Open Brand compared with existing detection methods.
摘要：随着全球电子商务平台的发展，知识产权保护（IPP）近年来受到越来越多的关注。品牌识别在IPP中扮演着重要角色。关于品牌识别和检测的最新研究基于小型数据集，而这些数据集在探索新兴的深度学习技术时不够全面。此外，在现实和开放的场景中评估品牌检测方法的真实性能也是一项挑战。为了解决这些问题，我们首先定义品牌检测和识别与通用对象检测相比的特殊问题。其次，建立了一个称为“开放品牌”的新颖品牌基准。数据集包含1,437,812张具有品牌的图像和50,000张没有品牌的图像。 “开放品牌”中带有品牌的部分包含3,113,828个实例，这些实例在3个维度上进行了注释：4种类型，559个品牌和1216个徽标。据我们所知，它是具有丰富注释的最大品牌检测和识别数据集。我们提供有关数据集的深入综合统计数据，验证注释的质量，并研究随着训练数据量的增加，许多现代模型的性能如何演变。第三，我们设计了一个称为“品牌网”的网络来处理品牌识别。与现有的检测方法相比，Brand Net在Open Brand上获得了最新的mAP。

27. Articulated Shape Matching Using Laplacian Eigenfunctions and Unsupervised Point Registration [PDF] 返回目录
Diana Mateus, Radu Horaud, David Knossow, Fabio Cuzzolin, Edmond Boyer
Abstract: Matching articulated shapes represented by voxel-sets reduces to maximal sub-graph isomorphism when each set is described by a weighted graph. Spectral graph theory can be used to map these graphs onto lower dimensional spaces and match shapes by aligning their embeddings in virtue of their invariance to change of pose. Classical graph isomorphism schemes relying on the ordering of the eigenvalues to align the eigenspaces fail when handling large data-sets or noisy data. We derive a new formulation that finds the best alignment between two congruent $K$-dimensional sets of points by selecting the best subset of eigenfunctions of the Laplacian matrix. The selection is done by matching eigenfunction signatures built with histograms, and the retained set provides a smart initialization for the alignment problem with a considerable impact on the overall performance. Dense shape matching casted into graph matching reduces then, to point registration of embeddings under orthogonal transformations; the registration is solved using the framework of unsupervised clustering and the EM algorithm. Maximal subset matching of non identical shapes is handled by defining an appropriate outlier class. Experimental results on challenging examples show how the algorithm naturally treats changes of topology, shape variations and different sampling densities.
摘要：当每个集由加权图描述时，以体素集表示的匹配关节形状会减少到最大子图同构。频谱图理论可用于将这些图映射到低维空间，并根据其嵌入对姿势变化的不变性，通过对齐它们的嵌入来匹配形状。当处理大型数据集或嘈杂数据时，依赖于特征值排序来对齐特征空间的经典图同构方案会失败。通过选择拉普拉斯矩阵本征函数的最佳子集，我们得出了一种新的公式，该公式找到了两个全等的K维维点集之间的最佳对齐方式。通过匹配使用直方图构建的特征函数签名来完成选择，并且保留的集合为对齐问题提供了智能的初始化，对整体性能产生了重大影响。然后，将密集的形状匹配转换为图匹配，然后减少到正交变换下的嵌入点注册；使用无监督聚类框架和EM算法解决注册问题。通过定义适当的离群值类别，可以处理形状不相同的最大子集匹配。在具有挑战性的示例上的实验结果表明，该算法如何自然地处理拓扑变化，形状变化和不同的采样密度。

28. Intrinsic Image Captioning Evaluation [PDF] 返回目录
Chao Zeng, Sam Kwong
Abstract: The image captioning task is about to generate suitable descriptions from images. For this task there can be several challenges such as accuracy, fluency and diversity. However there are few metrics that can cover all these properties while evaluating results of captioning this http URL this paper we first conduct a comprehensive investigation on contemporary metrics. Motivated by the auto-encoder mechanism and the research advances of word embeddings we propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE). We select several state-of-the-art image captioning models and test their performances on MS COCO dataset with respects to both contemporary metrics and the proposed I2CE. Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics. On this concern the proposed metric could serve as a novel indicator on the intrinsic information between captions, which may be complementary to the existing ones.
摘要：图像字幕任务将根据图像生成合适的描述。对于此任务，可能会遇到一些挑战，例如准确性，流畅性和多样性。但是，在评估此http URL的字幕结果时，几乎没有能够涵盖所有这些属性的指标。本文首先我们对当代指标进行了全面的研究。在自动编码器机制和词嵌入技术研究的推动下，我们提出了一种基于学习的图像字幕度量标准，我们将其称为内在图像字幕评估（I2CE）。我们选择了几种最先进的图像字幕模型，并就现代指标和建议的I2CE而言，在MS COCO数据集上测试了它们的性能。实验结果表明，所提出的方法在语义相似的表达或较少对齐的语义上遇到问题时，可以保持鲁棒的性能，并为候选字幕提供更灵活的评分。在这个问题上，拟议的度量标准可以用作字幕之间固有信息的新颖指示器，它可以补充现有字幕。

29. Combining Similarity and Adversarial Learning to Generate Visual Explanation: Application to Medical Image Classification [PDF] 返回目录
Martin Charachon, Céline Hudelot, Paul-Henry Cournède, Camille Ruppli, Roberto Ardon
Abstract: Explaining decisions of black-box classifiers is paramount in sensitive domains such as medical imaging since clinicians confidence is necessary for adoption. Various explanation approaches have been proposed, among which perturbation based approaches are very promising. Within this class of methods, we leverage a learning framework to produce our visual explanations method. From a given classifier, we train two generators to produce from an input image the so called similar and adversarial images. The similar image shall be classified as the input image whereas the adversarial shall not. Visual explanation is built as the difference between these two generated images. Using metrics from the literature, our method outperforms state-of-the-art approaches. The proposed approach is model-agnostic and has a low computation burden at prediction time. Thus, it is adapted for real-time systems. Finally, we show that random geometric augmentations applied to the original image play a regularization role that improves several previously proposed explanation methods. We validate our approach on a large chest X-ray database.
摘要：在临床医学等敏感领域，解释黑匣子分类器的决策至关重要，因为采用这种方法需要临床医生的信心。已经提出了各种解释方法，其中基于扰动的方法是非常有前途的。在此类方法中，我们利用学习框架来生成视觉解释方法。从给定的分类器中，我们训练两个生成器从输入图像生成所谓的相似图像和对抗图像。类似图像应分类为输入图像，而对抗图像则不应。建立视觉解释是因为这两个生成的图像之间存在差异。使用文献中的指标，我们的方法优于最新方法。所提出的方法与模型无关，并且在预测时具有较低的计算负担。因此，它适用于实时系统。最后，我们证明了应用于原始图像的随机几何增强起着正则化作用，从而改善了先前提出的几种解释方法。我们在大型胸部X射线数据库上验证了我们的方法。

30. Morphology on categorical distributions [PDF] 返回目录
Silas Nyboe Ørting, Hans Jacob Teglbjærg Stephensen, Jon Sporring
Abstract: The categorical distribution is a natural representation of uncertainty in multi-class segmentations. In the two-class case the categorical distribution reduces to the Bernoulli distribution, for which grayscale morphology provides a range of useful operations. In the general case, applying morphological operations on uncertain multi-class segmentations is not straightforward as an image of categorical distributions is not a complete lattice. Although morphology on color images has received wide attention, this is not so for color-coded or categorical images and even less so for images of categorical distributions. In this work, we establish a set of requirements for morphology on categorical distributions by combining classic morphology with a probabilistic view. We then define operators respecting these requirements, introduce protected operations on categorical distributions and illustrate the utility of these operators on two example tasks: modeling annotator bias in brain tumor segmentations and segmenting vesicle instances from the predictions of a multi-class U-Net.
摘要：分类分布是多类细分中不确定性的自然表示。在两类情况下，分类分布简化为伯努利分布，为此，灰度形态可提供一系列有用的运算。在一般情况下，对不确定的多类分割应用形态学运算并不容易，因为分类分布的图像不是完整的晶格。尽管彩色图像的形态受到了广泛的关注，但对于彩色编码或分类图像却不是这样，对于分类分布图像则更是如此。在这项工作中，我们通过结合经典形态学和概率视图，为分类分布上的形态学提出了一组要求。然后，我们定义符合这些要求的运算符，在分类分布上引入受保护的运算，并说明这些运算符在两个示例任务中的效用：在脑肿瘤分割中建模注释子偏差，并根据多类U-Net的预测对囊泡实例进行分割。

31. Multi Modal Adaptive Normalization for Audio to Video Generation [PDF] 返回目录
Neeraj Kumar, Srishti Goel, Ankur Narang, Brejesh Lall
Abstract: Speech-driven facial video generation has been a complex problem due to its multi-modal aspects namely audio and video domain. The audio comprises lots of underlying features such as expression, pitch, loudness, prosody(speaking style) and facial video has lots of variability in terms of head movement, eye blinks, lip synchronization and movements of various facial action units along with temporal smoothness. Synthesizing highly expressive facial videos from the audio input and static image is still a challenging task for generative adversarial networks. In this paper, we propose a multi-modal adaptive normalization(MAN) based architecture to synthesize a talking person video of arbitrary length using as input: an audio signal and a single image of a person. The architecture uses the multi-modal adaptive normalization, keypoint heatmap predictor, optical flow predictor and class activation map[58] based layers to learn movements of expressive facial components and hence generates a highly expressive talking-head video of the given person. The multi-modal adaptive normalization uses the various features of audio and video such as Mel spectrogram, pitch, energy from audio signals and predicted keypoint heatmap/optical flow and a single image to learn the respective affine parameters to generate highly expressive video. Experimental evaluation demonstrates superior performance of the proposed method as compared to Realistic Speech-Driven Facial Animation with GANs(RSDGAN) [53], Speech2Vid [10], and other approaches, on multiple quantitative metrics including: SSIM (structural similarity index), PSNR (peak signal to noise ratio), CPBD (image sharpness), WER(word error rate), blinks/sec and LMD(landmark distance). Further, qualitative evaluation and Online Turing tests demonstrate the efficacy of our approach.
摘要：语音驱动的面部视频生成由于其多模式方面（即音频和视频领域）而成为一个复杂的问题。音频包括许多基本功能，例如表情，音调，响度，韵律（说话风格），面部视频在头部运动，眨眼，嘴唇同步以及各种面部动作单元的运动以及时间平滑度方面具有很多可变性。从音频输入和静态图像合成高表现力的面部视频对于生成对抗网络仍然是一项艰巨的任务。在本文中，我们提出了一种基于多模式自适应归一化（MAN）的体系结构，以使用音频信号和一个人的单个图像作为输入来合成任意长度的有声人的视频。该体系结构使用多模式自适应归一化，关键点热图预测器，光流预测器和基于类激活图的图层[58]来学习表情面部组件的运动，从而生成给定人物的高表情谈话头视频。多模式自适应归一化使用音频和视频的各种功能（例如梅尔频谱图，音高，音频信号的能量和预测的关键点热图/光学流和单个图像）来学习各自的仿射参数，以生成高表现力的视频。实验评估表明，与基于GANs的逼真的语音驱动人脸动画（RSDGAN）[53]，Speech2Vid [10]和其他方法相比，该方法在多个定量指标上具有更高的性能：SSIM（结构相似性指数），PSNR （峰值信噪比），CPBD（图像清晰度），WER（字错误率），闪烁/秒和LMD（地标距离）。此外，定性评估和在线图灵测试证明了我们方法的有效性。

32. Source Data-absent Unsupervised Domain Adaptation through Hypothesis Transfer and Labeling Transfer [PDF] 返回目录
Jian Liang, Dapeng Hu, Yunbo Wang, Ran He, Jiashi Feng
Abstract: Unsupervised domain adaptation (UDA) aims to transfer knowledge from a related but different well-labeled source domain to a new unlabeled target domain. Most existing UDA methods require access to the source data, and thus are not applicable when the data are confidential and not shareable due to privacy concerns. This paper aims to tackle a realistic setting with only a classification model available trained over, instead of accessing to, the source data. To effectively utilize the source model for adaptation, we propose a novel approach called Source HypOthesis Transfer (SHOT), which learns the feature extraction module for the target domain by fitting the target data features to the frozen source classification module (representing classification hypothesis). Specifically, SHOT exploits both information maximization and self-supervised learning for the feature extraction module learning to ensure the target features are implicitly aligned with the features of unseen source data via the same hypothesis. Furthermore, we propose a new labeling transfer strategy, which separates the target data into two splits based on the confidence of predictions (labeling information), and then employ semi-supervised learning to improve the accuracy of less-confident predictions in the target domain. We denote labeling transfer as SHOT++ if the predictions are obtained by SHOT. Extensive experiments on both digit classification and object recognition tasks show that SHOT and SHOT++ achieve results surpassing or comparable to the state-of-the-arts, demonstrating the effectiveness of our approaches for various visual domain adaptation problems.
摘要：无监督域自适应（UDA）旨在将知识从相关但标签清晰的源域转移到新的未标签目标域。大多数现有的UDA方法都需要访问源数据，因此，由于隐私问题，当数据为机密且不可共享时，这些方法不适用。本文旨在通过仅训练经过训练的分类模型来解决现实环境，而不是访问源数据。为了有效地利用源模型进行适应，我们提出了一种新的方法，称为源假设假设传输（SHOT），该方法通过将目标数据特征拟合到冻结的源分类模块（代表分类假设）来学习目标域的特征提取模块。具体来说，SHOT通过信息最大化和自我监督学习来进行特征提取模块学习，以确保目标特征通过相同的假设与看不见的源数据的特征隐式对齐。此外，我们提出了一种新的标记传输策略，该策略将基于预测的可信度（标记信息）将目标数据分为两部分，然后采用半监督学习来提高目标域中不太可信的预测的准确性。如果通过SHOT获得预测，则将标记传输标记为SHOT ++。在数字分类和对象识别任务方面的大量实验表明，SHOT和SHOT ++的结果优于或可与最新技术相媲美，证明了我们的方法在解决各种视觉域适应问题方面的有效性。

33. Learning Category-level Shape Saliency via Deep Implicit Surface Networks [PDF] 返回目录
Chaozheng Wu, Lin Sun, Xun Xu, Kui Jia
Abstract: This paper is motivated from a fundamental curiosity on what defines a category of object shapes. For example, we may have the common knowledge that a plane has wings, and a chair has legs. Given the large shape variations among different instances of a same category, we are formally interested in developing a quantity defined for individual points on a continuous object surface; the quantity specifies how individual surface points contribute to the formation of the shape as the category. We term such a quantity as category-level shape saliency or shape saliency for short. Technically, we propose to learn saliency maps for shape instances of a same category from a deep implicit surface network; sensible saliency scores for sampled points in the implicit surface field are predicted by constraining the capacity of input latent code. We also enhance the saliency prediction with an additional loss of contrastive training. We expect such learned surface maps of shape saliency to have the properties of smoothness, symmetry, and semantic representativeness. We verify these properties by comparing our method with alternative ways of saliency computation. Notably, we show that by leveraging the learned shape saliency, we are able to reconstruct either category-salient or instance-specific parts of object surfaces; semantic representativeness of the learned saliency is also reflected in its efficacy to guide the selection of surface points for better point cloud classification.
摘要：本文的基本动机是对定义对象形状类别的基本好奇心。例如，我们可能有一个共同的常识，那就是飞机有翼，椅子有腿。鉴于同一类别的不同实例之间的形状差异较大，我们正式有兴趣开发为连续物体表面上的各个点定义的数量；数量指定了各个表面点如何有助于形状的形成（作为类别）。我们将此类数量称为类别级形状显着性或形状显着性。从技术上讲，我们建议从深层隐式曲面网络中学习同一类别的形状实例的显着性图。通过限制输入潜在代码的容量，可以预测隐式表面场中采样点的显着显着性得分。我们还将通过增加对比训练来增强显着性预测。我们期望这样学习的形状显着性表面图具有平滑性，对称性和语义代表性的属性。我们通过将我们的方法与显着性计算的替代方法进行比较来验证这些属性。值得注意的是，我们表明，利用学习到的形状显着性，我们能够重建对象表面的类别显着或特定于实例的部分；学到的显着性的语义代表性也反映在其指导表面点选择以更好地进行点云分类的功效中。

34. Semantic Layout Manipulation with High-Resolution Sparse Attention [PDF] 返回目录
Haitian Zheng, Zhe Lin, Jingwan Lu, Scott Cohen, Jianming Zhang, Ning Xu, Jiebo Luo
Abstract: We tackle the problem of semantic image layout manipulation, which aims to manipulate an input image by editing its semantic label map. A core problem of this task is how to transfer visual details from the input images to the new semantic layout while making the resulting image visually realistic. Recent work on learning cross-domain correspondence has shown promising results for global layout transfer with dense attention-based warping. However, this method tends to lose texture details due to the lack of smoothness and resolution in the correspondence and warped images. To adapt this paradigm for the layout manipulation task, we propose a high-resolution sparse attention module that effectively transfers visual details to new layouts at a resolution up to 512x512. To further improve visual quality, we introduce a novel generator architecture consisting of a semantic encoder and a two-stage decoder for coarse-to-fine synthesis. Experiments on the ADE20k and Places365 datasets demonstrate that our proposed approach achieves substantial improvements over the existing inpainting and layout manipulation methods.
摘要：我们解决了语义图像布局操纵的问题，该目标旨在通过编辑输入图像的语义标签图来操纵输入图像。此任务的核心问题是如何在将生成的图像视觉上逼真的同时，将视觉细节从输入图像转移到新的语义布局。最近在学习跨域对应关系方面的工作已显示出基于密集的基于关注的翘曲的全球布局转移的有希望的结果。但是，由于对应和扭曲图像缺乏平滑度和分辨率，因此该方法往往会丢失纹理细节。为了使这种范例适应布局操作任务，我们提出了一个高分辨率的稀疏注意模块，该模块可以有效地将视觉细节转移到分辨率高达512x512的新布局。为了进一步提高视觉质量，我们引入了一种新颖的生成器体系结构，该体系结构由语义编码器和两级解码器组成，用于从粗到细的合成。在ADE20k和Places365数据集上进行的实验表明，我们提出的方法相对于现有的修复和布局操作方法有了实质性的改进。

35. Information-Theoretic Segmentation by Inpainting Error Maximization [PDF] 返回目录
Pedro Savarese, Sunnie S. Y. Kim, Michael Maire, Greg Shakhnarovich, David McAllester
Abstract: We study image segmentation from an information-theoretic perspective, proposing a novel adversarial method that performs unsupervised segmentation by partitioning images into maximally independent sets. More specifically, we group image pixels into foreground and background, with the goal of minimizing predictability of one set from the other. An easily computed loss drives a greedy search process to maximize inpainting error over these partitions. Our method does not involve training deep networks, is computationally cheap, class-agnostic, and even applicable in isolation to a single unlabeled image. Experiments demonstrate that it achieves a new state-of-the-art in unsupervised segmentation quality, while being substantially faster and more general than competing approaches.
摘要：我们从信息论的角度研究图像分割，提出了一种新颖的对抗方法，该方法通过将图像划分为最大独立的集合来执行无监督分割。更具体地说，我们将图像像素分为前景和背景，目的是最大程度地减少一组对另一组的可预测性。易于计算的损失会导致贪婪的搜索过程，从而使这些分区上的修复错误最大化。我们的方法不涉及训练深层网络，在计算上便宜，与类无关，甚至可以单独应用于单个未标记图像。实验表明，它在无监督分割质量上达到了最新水平，同时比竞争方法要快得多，更通用。

36. TDAF: Top-Down Attention Framework for Vision Tasks [PDF] 返回目录
Bo Pang, Yizhuo Li, Jiefeng Li, Muchen Li, Hanwen Cao, Cewu Lu
Abstract: Human attention mechanisms often work in a top-down manner, yet it is not well explored in vision research. Here, we propose the Top-Down Attention Framework (TDAF) to capture top-down attentions, which can be easily adopted in most existing models. The designed Recursive Dual-Directional Nested Structure in it forms two sets of orthogonal paths, recursive and structural ones, where bottom-up spatial features and top-down attention features are extracted respectively. Such spatial and attention features are nested deeply, therefore, the proposed framework works in a mixed top-down and bottom-up manner. Empirical evidence shows that our TDAF can capture effective stratified attention information and boost performance. ResNet with TDAF achieves 2.0% improvements on ImageNet. For object detection, the performance is improved by 2.7% AP over FCOS. For pose estimation, TDAF improves the baseline by 1.6%. And for action recognition, the 3D-ResNet adopting TDAF achieves improvements of 1.7% accuracy.
摘要：人们的注意力机制通常以自上而下的方式工作，但是在视觉研究中却没有得到很好的探索。在这里，我们提出自上而下的注意力框架（TDAF）来捕获自上而下的注意力，可以在大多数现有模型中轻松采用。设计的递归双向嵌套结构形成了两组正交路径，即递归路径和结构路径，分别提取了自下而上的空间特征和自上而下的注意力特征。这样的空间和注意力特征被深深地嵌套，因此，所提出的框架以自顶向下和自底向上的混合方式工作。经验证据表明，我们的TDAF可以捕获有效的分层注意力信息并提高绩效。带有TDAF的ResNet在ImageNet上实现了2.0％的改进。对于对象检测，性能比FCOS提高了2.7％。对于姿势估计，TDAF将基线提高了1.6％。对于动作识别，采用TDAF的3D-ResNet可以提高1.7％的准确性。

37. Deep Optimized Priors for 3D Shape Modeling and Reconstruction [PDF] 返回目录
Mingyue Yang, Yuxin Wen, Weikai Chen, Yongwei Chen, Kui Jia
Abstract: Many learning-based approaches have difficulty scaling to unseen data, as the generality of its learned prior is limited to the scale and variations of the training samples. This holds particularly true with 3D learning tasks, given the sparsity of 3D datasets available. We introduce a new learning framework for 3D modeling and reconstruction that greatly improves the generalization ability of a deep generator. Our approach strives to connect the good ends of both learning-based and optimization-based methods. In particular, unlike the common practice that fixes the pre-trained priors at test time, we propose to further optimize the learned prior and latent code according to the input physical measurements after the training. We show that the proposed strategy effectively breaks the barriers constrained by the pre-trained priors and could lead to high-quality adaptation to unseen data. We realize our framework using the implicit surface representation and validate the efficacy of our approach in a variety of challenging tasks that take highly sparse or collapsed observations as input. Experimental results show that our approach compares favorably with the state-of-the-art methods in terms of both generality and accuracy.
摘要：许多基于学习的方法难以扩展到看不见的数据，因为其先验知识的普遍性仅限于训练样本的规模和变化。鉴于可用的3D数据集稀疏，这在3D学习任务中尤其如此。我们引入了用于3D建模和重构的新学习框架，该框架大大提高了深度生成器的泛化能力。我们的方法致力于将基于学习的方法和基于优化的方法的优点联系起来。特别地，与在测试时固定预先训练的先验的惯常做法不同，我们建议根据训练后的输入物理测量值进一步优化学习的先验和潜在代码。我们表明，所提出的策略有效地打破了由预先训练的先验约束的障碍，并可能导致对看不见的数据进行高质量的适应。我们使用隐式表面表示来实现我们的框架，并在将高度稀疏或折叠的观察作为输入的各种挑战性任务中验证我们方法的有效性。实验结果表明，在通用性和准确性方面，我们的方法都可以与最新技术相媲美。

38. INSPIRE: Intensity and Spatial Information-Based Deformable Image Registration [PDF] 返回目录
Johan Öfverstedt, Joakim Lindblad, Nataša Sladoje
Abstract: We present INSPIRE, a top-performing general-purpose method for deformable image registration. INSPIRE extends our existing symmetric registration framework based on distances combining intensity and spatial information to an elastic B-splines based transformation model. We also present several theoretical and algorithmic improvements which provide high computational efficiency and thereby applicability of the framework in a wide range of real scenarios. We show that the proposed method delivers both highly accurate as well as stable and robust registration results. We evaluate the method on a synthetic dataset created from retinal images, consisting of thin networks of vessels, where INSPIRE exhibits excellent performance, substantially outperforming the reference methods. We also evaluate the method on four benchmark datasets of 3D images of brains, for a total of 2088 pairwise registrations; a comparison with 15 other state-of-the-art methods reveals that INSPIRE provides the best overall performance. Code is available at this http URL.
摘要：我们提出了INSPIRE，这是一种用于变形图像配准的性能最佳的通用方法。 INSPIRE将基于强度和空间信息的距离的现有对称配准框架扩展为基于弹性B样条的转换模型。我们还提出了一些理论和算法上的改进，这些改进提供了高计算效率，从而在广泛的实际场景中提供了该框架的适用性。我们表明，所提出的方法可提供高度准确以及稳定和强大的配准结果。我们在由视网膜图像创建的合成数据集上评估该方法，该图像集由薄薄的血管网络组成，其中INSPIRE表现出出色的性能，大大优于参考方法。我们还对大脑3D图像的四个基准数据集评估了该方法，总共进行了2088个成对注册。与其他15种最新技术的比较表明，INSPIRE提供了最佳的整体性能。此http URL上提供了代码。

39. Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding [PDF] 返回目录
Qingxing Cao, Bailin Li, Xiaodan Liang, Keze Wang, Liang Lin
Abstract: Though beneficial for encouraging the Visual Question Answering (VQA) models to discover the underlying knowledge by exploiting the input-output correlation beyond image and text contexts, the existing knowledge VQA datasets are mostly annotated in a crowdsource way, e.g., collecting questions and external reasons from different users via the internet. In addition to the challenge of knowledge reasoning, how to deal with the annotator bias also remains unsolved, which often leads to superficial over-fitted correlations between questions and answers. To address this issue, we propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation. Considering that a desirable VQA model should correctly perceive the image context, understand the question, and incorporate its learned knowledge, our proposed dataset aims to cutoff the shortcut learning exploited by the current deep embedding models and push the research boundary of the knowledge-based visual question reasoning. Specifically, we generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs to disentangle the knowledge from other biases. The programs can select one or two triplets from the scene graph or knowledge base to push multi-step reasoning, avoid answer ambiguity, and balanced the answer distribution. In contrast to the existing VQA datasets, we further imply the following two major constraints on the programs to incorporate knowledge reasoning: i) multiple knowledge triplets can be related to the question, but only one knowledge relates to the image object. This can enforce the VQA model to correctly perceive the image instead of guessing the knowledge based on the given question solely; ii) all questions are based on different knowledge, but the candidate answers are the same for both the training and test sets.
摘要：尽管通过利用超出图像和文本上下文的输入输出相关性来鼓励视觉问题回答（VQA）模型来发现基础知识是有益的，但是现有知识VQA数据集大多以众包方式进行注释，例如，收集问题和互联网上来自不同用户的外部原因。除了知识推理的挑战之外，如何处理注释器偏差也仍然未解决，这常常导致问题和答案之间的表面过度拟合。为了解决这个问题，我们提出了一个新颖的数据集，称为VQA模型评估的知识路由视觉问题推理。考虑到理想的VQA模型应该正确地感知图像上下文，理解问题并整合其学习的知识，我们提出的数据集旨在切断当前深度嵌入模型所利用的快捷学习，并推动基于知识的视觉研究的边界问题推理。具体来说，我们基于可视基因组场景图和带有受控程序的外部知识库生成问题-答案对，以将知识与其他偏见区分开。程序可以从场景图或知识库中选择一个或两个三元组，以推动多步推理，避免答案含糊不清，并平衡答案分布。与现有的VQA数据集相比，我们进一步暗示了以下程序中纳入知识推理的两个主要限制：i）多个知识三元组可以与问题相关，但是只有一个知识与图像对象有关。这可以强制VQA模型正确地感知图像，而不是仅仅根据给定的问题猜测知识。 ii）所有问题都基于不同的知识，但是对于训练集和测试集，候选答案都相同。

40. Meticulous Object Segmentation [PDF] 返回目录
Chenglin Yang, Yilin Wang, Jianming Zhang, He Zhang, Zhe Lin, Alan Yuille
Abstract: Compared with common image segmentation tasks targeted at low-resolution images, higher resolution detailed image segmentation receives much less attention. In this paper, we propose and study a task named Meticulous Object Segmentation (MOS), which is focused on segmenting well-defined foreground objects with elaborate shapes in high resolution images (e.g. 2k - 4k). To this end, we propose the MeticulousNet which leverages a dedicated decoder to capture the object boundary details. Specifically, we design a Hierarchical Point-wise Refining (HierPR) block to better delineate object boundaries, and reformulate the decoding process as a recursive coarse to fine refinement of the object mask. To evaluate segmentation quality near object boundaries, we propose the Meticulosity Quality (MQ) score considering both the mask coverage and boundary precision. In addition, we collect a MOS benchmark dataset including 600 high quality images with complex objects. We provide comprehensive empirical evidence showing that MeticulousNet can reveal pixel-accurate segmentation boundaries and is superior to state-of-the-art methods for high resolution object segmentation tasks.
摘要：与针对低分辨率图像的常见图像分割任务相比，高分辨率的详细图像分割受到的关注要少得多。在本文中，我们提出并研究了一个名为``细致对象分割''（MOS）的任务，该任务专注于在高分辨率图像（例如2k-4k）中分割形状精细的清晰定义的前景对象。为此，我们提出了MeticulousNet，它利用专用的解码器来捕获对象边界细节。具体而言，我们设计了分层逐点精炼（HierPR）块，以更好地描绘对象边界，并将解码过程重新构造为对对象蒙版进行递归粗化到细化。为了评估对象边界附近的分割质量，我们建议同时考虑蒙版覆盖率和边界精度的细化质量（MQ）得分。此外，我们收集了MOS基准数据集，其中包括600个带有复杂对象的高质量图像。我们提供全面的经验证据，表明MeticulousNet可以揭示像素准确的分割边界，并且优于用于高分辨率对象分割任务的最新方法。

41. Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation [PDF] 返回目录
Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-Yi Lin, Ekin D. Cubuk, Quoc V. Le, Barret Zoph
Abstract: Building instance segmentation models that are data-efficient and can handle rare object categories is an important challenge in computer vision. Leveraging data augmentations is a promising direction towards addressing this challenge. Here, we perform a systematic study of the Copy-Paste augmentation ([13, 12]) for instance segmentation where we randomly paste objects onto an image. Prior studies on Copy-Paste relied on modeling the surrounding visual context for pasting the objects. However, we find that the simple mechanism of pasting objects randomly is good enough and can provide solid gains on top of strong baselines. Furthermore, we show Copy-Paste is additive with semi-supervised methods that leverage extra data through pseudo labeling (e.g. self-training). On COCO instance segmentation, we achieve 49.1 mask AP and 57.3 box AP, an improvement of +0.6 mask AP and +1.5 box AP over the previous state-of-the-art. We further demonstrate that Copy-Paste can lead to significant improvements on the LVIS benchmark. Our baseline model outperforms the LVIS 2020 Challenge winning entry by +3.6 mask AP on rare categories.
摘要：建立具有数据效率并可以处理稀有对象类别的实例分割模型是计算机视觉中的一项重要挑战。利用数据增强是解决这一挑战的有希望的方向。在这里，我们对复制粘贴增强（[13，12]）进行了系统研究，例如将对象随机粘贴到图像上的分割。先前关于复制粘贴的研究依赖于对周围的视觉环境进行建模以粘贴对象。但是，我们发现随机粘贴对象的简单机制足够好，并且可以在强大的基线之上提供可靠的收益。此外，我们显示出Copy-Paste在半监督方法下是加法的，该方法通过伪标签（例如自我训练）利用额外数据。在COCO实例分割上，我们实现了49.1个mask AP和57.3个box AP，与之前的最新技术相比，分别提高了+0.6个mask AP和+1.5个box AP。我们进一步证明，复制粘贴可以显着改善LVIS基准。我们的基准模型在罕见类别上的表现优于LVIS 2020挑战赛的获胜者，获得+3.6 Mask AP。

42. MSAF: Multimodal Split Attention Fusion [PDF] 返回目录
Lang Su, Chuqing Hu, Guofa Li, Dongpu Cao
Abstract: Multimodal learning mimics the reasoning process of the human multi-sensory system, which is used to perceive the surrounding world. While making a prediction, the human brain tends to relate crucial cues from multiple sources of information. In this work, we propose a novel multimodal fusion module that learns to emphasize more contributive features across all modalities. Specifically, the proposed Multimodal Split Attention Fusion (MSAF) module splits each modality into channel-wise equal feature blocks and creates a joint representation that is used to generate soft attention for each channel across the feature blocks. Further, the MSAF module is designed to be compatible with features of various spatial dimensions and sequence lengths, suitable for both CNNs and RNNs. Thus, MSAF can be easily added to fuse features of any unimodal networks and utilize existing pretrained unimodal model weights. To demonstrate the effectiveness of our fusion module, we design three multimodal networks with MSAF for emotion recognition, sentiment analysis, and action recognition tasks. Our approach achieves competitive results in each task and outperforms other application-specific networks and multimodal fusion benchmarks.
摘要：多模式学习模仿了人类多感觉系统的推理过程，该过程用于感知周围世界。在做出预测时，人脑倾向于将来自多种信息来源的关键线索联系起来。在这项工作中，我们提出了一种新颖的多峰融合模块，该模块学习着重强调所有模态的更多贡献特征。具体而言，建议的多模式拆分注意力融合（MSAF）模块将每个模态拆分为各个通道相等的特征块，并创建一个联合表示形式，该联合表示用于为整个功能块上的每个通道生成软注意力。此外，MSAF模块设计为与各种空间尺寸和序列长度的特征兼容，适用于CNN和RNN。因此，可以轻松地将MSAF添加到任何单峰网络的融合特征中，并利用现有的预训练单峰模型权重。为了证明我们的融合模块的有效性，我们设计了三个带有MSAF的多峰网络，用于情感识别，情感分析和动作识别任务。我们的方法在每个任务中均取得了竞争性结果，并且胜过其他特定于应用程序的网络和多峰融合基准测试。

43. FSOCO: The Formula Student Objects in Context Dataset [PDF] 返回目录
David Dodel, Michael Schötz, Niclas Vödisch
Abstract: This paper presents the FSOCO dataset, a collaborative dataset for vision-based cone detection systems in Formula Student Driverless competitions. It contains human annotated ground truth labels for both bounding boxes and instance-wise segmentation masks. The data buy-in philosophy of FSOCO asks student teams to contribute to the database first before being granted access ensuring continuous growth. By providing clear labeling guidelines and tools for a sophisticated raw image selection, new annotations are guaranteed to meet the desired quality. The effectiveness of the approach is shown by comparing prediction results of a network trained on FSOCO and its unregulated predecessor. The FSOCO dataset can be found at this http URL.
摘要：本文介绍了FSOCO数据集，这是用于Formula Student Student无人驾驶比赛中基于视觉的视锥检测系统的协作数据集。它包含用于边界框和按实例分割的蒙版的人工注释的地面真相标签。 FSOCO的数据购买原则要求学生团队在被授予访问权限以确保持续增长之前首先为数据库做出贡献。通过提供清晰的标签准则和工具来进行复杂的原始图像选择，可以确保新的注释能够满足所需的质量。通过比较在FSOCO上训练的网络及其未受管制的前身的预测结果，表明了该方法的有效性。 FSOCO数据集可在此http URL中找到。

44. Location-aware Single Image Reflection Removal [PDF] 返回目录
Zheng Dong, Ke Xu, Yin Yang, Hujun Bao, Weiwei Xu, Rynson W.H. Lau
Abstract: This paper proposes a novel location-aware deep learning-based single image reflection removal method. Our network has a reflection detection module to regress a probabilistic reflection confidence map, taking multi-scale Laplacian features as inputs. This probabilistic map tells whether a region is reflection-dominated or transmission-dominated. The novelty is that we use the reflection confidence map as the cues for the network to learn how to encode the reflection information adaptively and control the feature flow when predicting reflection and transmission layers. The integration of location information into the network significantly improves the quality of reflection removal results. Besides, a set of learnable Laplacian kernel parameters is introduced to facilitate the extraction of discriminative Laplacian features for reflection detection. We design our network as a recurrent network to progressively refine each iteration's reflection removal results. Extensive experiments verify the superior performance of the proposed method over state-of-the-art approaches.
摘要：本文提出了一种新的基于位置感知的深度学习的单图像反射去除方法。我们的网络具有反射检测模块，可以将多尺度拉普拉斯特征作为输入来回归概率反射置信度图。该概率图表明一个区域是反射为主还是透射为主。新颖之处在于，我们将反射置信度图用作网络的线索，以学习如何在预测反射层和透射层时自适应地对反射信息进行编码并控制特征流。将位置信息集成到网络中可以显着提高反射消除结果的质量。此外，引入了一组可学习的拉普拉斯内核参数，以方便提取具有区别性的拉普拉斯特征进行反射检测。我们将网络设计为循环网络，以逐步优化每次迭代的反射去除结果。大量实验证明了所提出的方法优于最新方法的优越性能。

45. Iterative Knowledge Exchange Between Deep Learning and Space-Time Spectral Clustering for Unsupervised Segmentation in Videos [PDF] 返回目录
Emanuela Haller, Adina Magda Florea, Marius Leordeanu
Abstract: We propose a dual system for unsupervised object segmentation in video, which brings together two modules with complementary properties: a space-time graph that discovers objects in videos and a deep network that learns powerful object features. The system uses an iterative knowledge exchange policy. A novel spectral space-time clustering process on the graph produces unsupervised segmentation masks passed to the network as pseudo-labels. The net learns to segment in single frames what the graph discovers in video and passes back to the graph strong image-level features that improve its node-level features in the next iteration. Knowledge is exchanged for several cycles until convergence. The graph has one node per each video pixel, but the object discovery is fast. It uses a novel power iteration algorithm computing the main space-time cluster as the principal eigenvector of a special Feature-Motion matrix without actually computing the matrix. The thorough experimental analysis validates our theoretical claims and proves the effectiveness of the cyclical knowledge exchange. We also perform experiments on the supervised scenario, incorporating features pretrained with human supervision. We achieve state-of-the-art level on unsupervised and supervised scenarios on four challenging datasets: DAVIS, SegTrack, YouTube-Objects, and DAVSOD.
摘要：我们提出了一种用于视频中无监督对象分割的双重系统，该系统将具有互补特性的两个模块结合在一起：一个发现视频中对象的时空图和一个学习强大对象特征的深度网络。该系统使用迭代知识交换策略。图上的一种新颖的光谱时空聚类过程产生了无监督的分割掩码，这些掩码作为伪标签传递到网络。网络学习将图在视频中发现的内容分割为单个帧，然后传回图的强大图像级功能，从而在下一次迭代中改善其节点级功能。知识交换了几个周期，直到融合为止。该图的每个视频像素有一个节点，但是对象发现很快。它使用一种新颖的幂迭代算法，将主时空簇计算为特殊特征运动矩阵的主要特征向量，而无需实际计算该矩阵。详尽的实验分析验证了我们的理论主张，并证明了周期性知识交流的有效性。我们还将在有监督的情况下进行实验，并结合经过人工监督的预训练功能。我们在四个有挑战性的数据集：DAVIS，SegTrack，YouTube对象和DAVSOD上，在无监督和有监督的场景下达到了最新水平。

46. DFR: Deep Feature Reconstruction for Unsupervised Anomaly Segmentation [PDF] 返回目录
Jie Yang, Yong Shi, Zhiquan Qi
Abstract: Automatic detecting anomalous regions in images of objects or textures without priors of the anomalies is challenging, especially when the anomalies appear in very small areas of the images, making difficult-to-detect visual variations, such as defects on manufacturing products. This paper proposes an effective unsupervised anomaly segmentation approach that can detect and segment out the anomalies in small and confined regions of images. Concretely, we develop a multi-scale regional feature generator that can generate multiple spatial context-aware representations from pre-trained deep convolutional networks for every subregion of an image. The regional representations not only describe the local characteristics of corresponding regions but also encode their multiple spatial context information, making them discriminative and very beneficial for anomaly detection. Leveraging these descriptive regional features, we then design a deep yet efficient convolutional autoencoder and detect anomalous regions within images via fast feature reconstruction. Our method is simple yet effective and efficient. It advances the state-of-the-art performances on several benchmark datasets and shows great potential for real applications.
摘要：在没有异常先验的情况下自动检测对象或纹理图像中的异常区域具有挑战性，特别是当异常出现在图像的很小区域时，使得难以检测到视觉变化，例如制造产品上的缺陷。本文提出了一种有效的无监督异常分割方法，该方法可以检测并分割出图像的狭窄小区域中的异常。具体而言，我们开发了一种多尺度区域特征生成器，可以针对图像的每个子区域从预训练的深度卷积网络生成多个空间上下文感知表示。区域表示不仅描述了相应区域的局部特征，还对它们的多个空间上下文信息进行了编码，从而使它们具有区分性，对于异常检测非常有用。利用这些描述性的区域特征，我们然后设计了一个深而有效的卷积自动编码器，并通过快速特征重建来检测图像中的异常区域。我们的方法既简单又有效。它提高了几个基准数据集的最新性能，并显示了在实际应用中的巨大潜力。

47. Learning Heatmap-Style Jigsaw Puzzles Provides Good Pretraining for 2D Human Pose Estimation [PDF] 返回目录
Kun Zhang, Rui Wu, Ping Yao, Kai Deng, Ding Li, Renbiao Liu, Chuanguang Yang, Ge Chen, Min Du, Tianyao Zheng
Abstract: The target of 2D human pose estimation is to locate the keypoints of body parts from input 2D images. State-of-the-art methods for pose estimation usually construct pixel-wise heatmaps from keypoints as labels for learning convolution neural networks, which are usually initialized randomly or using classification models on ImageNet as their backbones. We note that 2D pose estimation task is highly dependent on the contextual relationship between image patches, thus we introduce a self-supervised method for pretraining 2D pose estimation networks. Specifically, we propose Heatmap-Style Jigsaw Puzzles (HSJP) problem as our pretext-task, whose target is to learn the location of each patch from an image composed of shuffled patches. During our pretraining process, we only use images of person instances in MS-COCO, rather than introducing extra and much larger ImageNet dataset. A heatmap-style label for patch location is designed and our learning process is in a non-contrastive way. The weights learned by HSJP pretext task are utilised as backbones of 2D human pose estimator, which are then finetuned on MS-COCO human keypoints dataset. With two popular and strong 2D human pose estimators, HRNet and SimpleBaseline, we evaluate mAP score on both MS-COCO validation and test-dev datasets. Our experiments show that downstream pose estimators with our self-supervised pretraining obtain much better performance than those trained from scratch, and are comparable to those using ImageNet classification models as their initial backbones.
摘要：二维人体姿势估计的目标是从输入的二维图像中定位身体部位的关键点。最新的姿态估计方法通常从关键点构建像素级热图，作为学习卷积神经网络的标签，通常是随机初始化或使用ImageNet上的分类模型作为其骨干。我们注意到2D姿势估计任务高度依赖于图像补丁之间的上下文关系，因此我们引入了一种自监督方法来对2D姿势估计网络进行预训练。具体来说，我们提出了热图样式拼图（HSJP）问题作为我们的任务，其目的是从混洗的补丁组成的图像中了解每个补丁的位置。在我们的预训练过程中，我们仅使用MS-COCO中的人物实例图像，而不引入额外的，更大的ImageNet数据集。设计了用于贴片位置的热图样式标签，并且我们的学习过程是非对比的。 HSJP借口任务学习的权重被用作2D人体姿势估计器的主干，然后在MS-COCO人体关键点数据集上进行微调。利用HRNet和SimpleBaseline这两种流行且功能强大的2D人体姿势估计器，我们可以在MS-COCO验证和测试开发数据集上评估mAP得分。我们的实验表明，采用我们自我监督的预训练的下游姿态估计器的性能要比从头开始训练的那些更好，并且可以与使用ImageNet分类模型作为其初始主干的方法进行比较。

48. MSVD-Turkish: A Comprehensive Multimodal Dataset for Integrated Vision and Language Research in Turkish [PDF] 返回目录
Begum Citamak, Ozan Caglayan, Menekse Kuyu, Erkut Erdem, Aykut Erdem, Pranava Madhyastha, Lucia Specia
Abstract: Automatic generation of video descriptions in natural language, also called video captioning, aims to understand the visual content of the video and produce a natural language sentence depicting the objects and actions in the scene. This challenging integrated vision and language problem, however, has been predominantly addressed for English. The lack of data and the linguistic properties of other languages limit the success of existing approaches for such languages. In this paper we target Turkish, a morphologically rich and agglutinative language that has very different properties compared to English. To do so, we create the first large scale video captioning dataset for this language by carefully translating the English descriptions of the videos in the MSVD (Microsoft Research Video Description Corpus) dataset into Turkish. In addition to enabling research in video captioning in Turkish, the parallel English-Turkish descriptions also enables the study of the role of video context in (multimodal) machine translation. In our experiments, we build models for both video captioning and multimodal machine translation and investigate the effect of different word segmentation approaches and different neural architectures to better address the properties of Turkish. We hope that the MSVD-Turkish dataset and the results reported in this work will lead to better video captioning and multimodal machine translation models for Turkish and other morphology rich and agglutinative languages.
摘要：以自然语言自动生成视频描述（也称为视频字幕）的目的是理解视频的视觉内容，并生成描述场景中对象和动作的自然语言句子。然而，这个具有挑战性的综合视觉和语言问题已主要针对英语。数据的缺乏和其他语言的语言特性限制了此类语言的现有方法的成功。在本文中，我们针对的是土耳其语，这是一种形态丰富且具有凝集性的语言，与英语相比具有非常不同的属性。为此，我们通过将MSVD（微软研究视频描述语料库）数据集中的视频英文描述仔细翻译成土耳其语，从而为此语言创建了第一个大规模视频字幕数据集。除了可以研究土耳其语的视频字幕外，平行的英语-土耳其语描述还可以研究视频上下文在（多模式）机器翻译中的作用。在我们的实验中，我们建立了视频字幕和多模式机器翻译的模型，并研究了不同的分词方法和不同的神经体系结构的效果，以更好地解决土耳其语的特性。我们希望MSVD-Turkish数据集和这项工作中报告的结果将为土耳其语和其他形态丰富和凝集性语言带来更好的视频字幕和多模式机器翻译模型。

49. EfficientPose: Efficient Human Pose Estimation with Neural Architecture Search [PDF] 返回目录
Wenqiang Zhang, Jiemin Fang, Xinggang Wang, Wenyu Liu
Abstract: Human pose estimation from image and video is a vital task in many multimedia applications. Previous methods achieve great performance but rarely take efficiency into consideration, which makes it difficult to implement the networks on resource-constrained devices. Nowadays real-time multimedia applications call for more efficient models for better interactions. Moreover, most deep neural networks for pose estimation directly reuse the networks designed for image classification as the backbone, which are not yet optimized for the pose estimation task. In this paper, we propose an efficient framework targeted at human pose estimation including two parts, the efficient backbone and the efficient head. By implementing the differentiable neural architecture search method, we customize the backbone network design for pose estimation and reduce the computation cost with negligible accuracy degradation. For the efficient head, we slim the transposed convolutions and propose a spatial information correction module to promote the performance of the final prediction. In experiments, we evaluate our networks on the MPII and COCO datasets. Our smallest model has only 0.65 GFLOPs with 88.1% PCKh@0.5 on MPII and our large model has only 2 GFLOPs while its accuracy is competitive with the state-of-the-art large model, i.e., HRNet with 9.5 GFLOPs.
摘要：在许多多媒体应用中，根据图像和视频进行人体姿势估计是一项至关重要的任务。先前的方法具有很高的性能，但很少考虑效率，这使得在资源受限的设备上实现网络变得困难。如今，实时多媒体应用程序需要更有效的模型以实现更好的交互。此外，大多数用于姿势估计的深度神经网络直接将为图像分类而设计的网络作为骨干直接使用，而这些网络尚未针对姿势估计任务进行优化。在本文中，我们提出了一种针对人体姿势估计的有效框架，该框架包括两个部分：有效主干和有效头部。通过实现可微的神经体系结构搜索方法，我们定制了用于姿势估计的骨干网设计，并以可忽略的精度降低降低了计算成本。对于有效率的头部，我们对转置的卷积进行了修整，并提出了一种空间信息校正模块来提高最终预测的性能。在实验中，我们根据MPII和COCO数据集评估我们的网络。我们最小的型号只有0.65 GFLOP，在MPII上PCKh@0.5为88.1％PCK，而我们的大型型号只有2 GFLOP，而其精度却与最新的大型型号HRNet 9.5 GFLOP竞争。

50. Robust Real-Time Pedestrian Detection on Embedded Devices [PDF] 返回目录
Mohamed Afifi, Yara Ali, Karim Amer, Mahmoud Shaker, Mohamed Elhelw
Abstract: Detection of pedestrians on embedded devices, such as those on-board of robots and drones, has many applications including road intersection monitoring, security, crowd monitoring and surveillance, to name a few. However, the problem can be challenging due to continuously-changing camera viewpoint and varying object appearances as well as the need for lightweight algorithms suitable for embedded systems. This paper proposes a robust framework for pedestrian detection in many footages. The framework performs fine and coarse detections on different image regions and exploits temporal and spatial characteristics to attain enhanced accuracy and real time performance on embedded boards. The framework uses the Yolo-v3 object detection [1] as its backbone detector and runs on the Nvidia Jetson TX2 embedded board, however other detectors and/or boards can be used as well. The performance of the framework is demonstrated on two established datasets and its achievement of the second place in CVPR 2019 Embedded Real-Time Inference (ERTI) Challenge.
摘要：在嵌入式设备（例如机器人和无人机上的行人）上检测行人具有许多应用，包括路口监控，安全性，人群监控和监视等。但是，由于摄像机视点的不断变化和物体外观的变化以及对适用于嵌入式系统的轻量级算法的需求，该问题可能具有挑战性。本文提出了一个健壮的框架，用于许多镜头中的行人检测。该框架在不同的图像区域上执行精细和粗略的检测，并利用时间和空间特征在嵌入式板上获得增强的准确性和实时性能。该框架使用Yolo-v3对象检测[1]作为其主干检测器，并在Nvidia Jetson TX2嵌入式板上运行，但是也可以使用其他检测器和/或板。在两个已建立的数据集上展示了该框架的性能，并在CVPR 2019嵌入式实时推理（ERTI）挑战赛中获得了第二名的成绩。

51. Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network [PDF] 返回目录
Jiayi Ji, Yunpeng Luo, Xiaoshuai Sun, Fuhai Chen, Gen Luo, Yongjian Wu, Yue Gao, Rongrong Ji
Abstract: Transformer-based architectures have shown great success in image captioning, where object regions are encoded and then attended into the vectorial representations to guide the caption decoding. However, such vectorial representations only contain region-level information without considering the global information reflecting the entire image, which fails to expand the capability of complex multi-modal reasoning in image captioning. In this paper, we introduce a Global Enhanced Transformer (termed GET) to enable the extraction of a more comprehensive global representation, and then adaptively guide the decoder to generate high-quality captions. In GET, a Global Enhanced Encoder is designed for the embedding of the global feature, and a Global Adaptive Decoder are designed for the guidance of the caption generation. The former models intra- and inter-layer global representation by taking advantage of the proposed Global Enhanced Attention and a layer-wise fusion module. The latter contains a Global Adaptive Controller that can adaptively fuse the global information into the decoder to guide the caption generation. Extensive experiments on MS COCO dataset demonstrate the superiority of our GET over many state-of-the-arts.
摘要：基于变压器的体系结构在图像字幕显示方面取得了巨大的成功，该过程中先对对象区域进行编码，然后将其放入矢量表示中，以指导字幕解码。但是，这样的矢量表示仅包含区域级别的信息，而没有考虑反映整个图像的全局信息，这无法扩展图像字幕中复杂的多模式推理的能力。在本文中，我们介绍了一种全局增强型变压器（称为GET），可以提取更全面的全局表示，然后自适应地指导解码器生成高质量的字幕。在GET中，为嵌入全局功能而设计了一个全球增强型编码器，为指导字幕生成而设计了一个全局自适应解码器。前者通过利用提议的全局增强注意和分层融合模块来建模层内和层间全局表示。后者包含一个全局自适应控制器，该控制器可以将全局信息自适应地融合到解码器中，以指导字幕的生成。在MS COCO数据集上进行的大量实验证明了GET相对于许多最新技术的优越性。

52. PoNA: Pose-guided Non-local Attention for Human Pose Transfer [PDF] 返回目录
Kun Li, Jinsong Zhang, Yebin Liu, Yu-Kun Lai, Qionghai Dai
Abstract: Human pose transfer, which aims at transferring the appearance of a given person to a target pose, is very challenging and important in many applications. Previous work ignores the guidance of pose features or only uses local attention mechanism, leading to implausible and blurry results. We propose a new human pose transfer method using a generative adversarial network (GAN) with simplified cascaded blocks. In each block, we propose a pose-guided non-local attention (PoNA) mechanism with a long-range dependency scheme to select more important regions of image features to transfer. We also design pre-posed image-guided pose feature update and post-posed pose-guided image feature update to better utilize the pose and image features. Our network is simple, stable, and easy to train. Quantitative and qualitative results on Market-1501 and DeepFashion datasets show the efficacy and efficiency of our model. Compared with state-of-the-art methods, our model generates sharper and more realistic images with rich details, while having fewer parameters and faster speed. Furthermore, our generated images can help to alleviate data insufficiency for person re-identification.
摘要：人体姿势转换旨在将给定人物的外观转换为目标姿势，这在许多应用中非常具有挑战性且非常重要。先前的工作忽略了姿势特征的指导或仅使用局部注意机制，从而导致难以置信和模糊的结果。我们提出了一种新的人体姿势转移方法，该方法使用具有简化级联块的生成对抗网络（GAN）。在每个块中，我们提出一种具有远程依赖方案的姿势引导非局部注意（PoNA）机制，以选择要传输的图像特征的更重要区域。我们还设计了前置姿势引导的姿势特征更新和后置姿势引导的图像特征更新，以更好地利用姿势和图像特征。我们的网络简单，稳定且易于培训。 Market-1501和DeepFashion数据集上的定量和定性结果显示了我们模型的有效性和效率。与最新技术相比，我们的模型可生成更清晰，更逼真的图像，并具有丰富的细节，同时具有更少的参数和更快的速度。此外，我们生成的图像可以帮助减轻数据不足以重新识别人的情况。

53. One-Shot Object Localization in Medical Images based on Relative Position Regression [PDF] 返回目录
Wenhui Lei, Wei Xu, Ran Gu, Hao Fu, Shaoting Zhang, Guotai Wang
Abstract: Deep learning networks have shown promising performance for accurate object localization in medial images, but require large amount of annotated data for supervised training, which is expensive and expertise burdensome. To address this problem, we present a one-shot framework for organ and landmark localization in volumetric medical images, which does not need any annotation during the training stage and could be employed to locate any landmarks or organs in test images given a support (reference) image during the inference stage. Our main idea comes from that tissues and organs from different human bodies have a similar relative position and context. Therefore, we could predict the relative positions of their non-local patches, thus locate the target organ. Our framework is composed of three parts: (1) A projection network trained to predict the 3D offset between any two patches from the same volume, where human annotations are not required. In the inference stage, it takes one given landmark in a reference image as a support patch and predicts the offset from a random patch to the corresponding landmark in the test (query) volume. (2) A coarse-to-fine framework contains two projection networks, providing more accurate localization of the target. (3) Based on the coarse-to-fine model, we transfer the organ boundingbox (B-box) detection to locating six extreme points along x, y and z directions in the query volume. Experiments on multi-organ localization from head-and-neck (HaN) CT volumes showed that our method acquired competitive performance in real time, which is more accurate and 10^5 times faster than template matching methods with the same setting. Code is available: this https URL.
摘要：深度学习网络已经显示出在中间图像中进行精确对象定位的有希望的性能，但需要大量带注释的数据来进行有监督的训练，这既昂贵又专业。为了解决这个问题，我们提出了一种用于体积医学图像中器官和界标定位的一键式框架，该框架在训练阶段不需要任何注释，并且可以用于在受支持的测试图像中定位任何界标或器官（参考）推断阶段的图片。我们的主要思想来自不同人体的组织和器官具有相似的相对位置和环境。因此，我们可以预测其非局部斑块的相对位置，从而定位目标器官。我们的框架由三部分组成：（1）投影网络经过训练，可以预测同一体积中任何两个面片之间的3D偏移，而无需人工注释。在推理阶段，它将参考图像中的一个给定界标作为支持补丁，并预测从随机补丁到测试（查询）卷中相应界标的偏移量。（2）粗到精框架包含两个投影网络，可以更精确地定位目标。（3）基于粗到精模型，我们将器官边界框（B框）检测转移到在查询卷中沿x，y和z方向定位六个极限点。从头颈（HaN）CT量进行多器官定位的实验表明，我们的方法实时获得竞争性能，它比相同设置的模板匹配方法更准确，并且快10 ^ 5倍。代码可用：此https URL。

54. Semi-supervised Segmentation via Uncertainty Rectified Pyramid Consistency and Its Application to Gross Target Volume of Nasopharyngeal Carcinoma [PDF] 返回目录
Xiangde Luo, Wenjun Liao, Jieneng Chen, Tao Song, Yinan Chen, Guotai Wang, Shaoting Zhang
Abstract: Gross Target Volume (GTV) segmentation plays an irreplaceable role in radiotherapy planning for Nasopharyngeal Carcinoma (NPC). Despite that convolutional neural networks (CNN) have achieved good performance for this task, they rely on a large set of labeled images for training, which is expensive and time-consuming to acquire. Recently, semi-supervised methods that learn from a small set of labeled images with a large set of unlabeled images have shown potential for dealing with this problem, but it is still challenging to train a high-performance model with the limited number of labeled data. In this paper, we propose a novel framework with Uncertainty Rectified Pyramid Consistency (URPC) regularization for semi-supervised NPC GTV segmentation. Concretely, we extend a backbone segmentation network to produce pyramid predictions at different scales, the pyramid predictions network (PPNet) was supervised by the ground truth of labeled images and a multi-scale consistency loss for unlabeled images, motivated by the fact that prediction at different scales for the same input should be similar and consistent. However, due to the different resolution of these predictions, encouraging them to be consistent at each pixel directly is not robust and may bring much noise and lead to a performance drop. To deal with this dilemma, we further design a novel uncertainty rectifying module to enable the framework to gradually learn from meaningful and reliable consensual regions at different scales. Extensive experiments on our collected NPC dataset with 258 volumes show that our method can largely improve performance by incorporating the unlabeled data, and this framework achieves a promising result compared with existing semi-supervised methods, which achieves 81.22% of mean DSC and 1.88 voxels of mean ASD on the test set, where the only 20% of the training set were annotated.
摘要：总目标体积（GTV）分割在鼻咽癌（NPC）的放射治疗计划中起着不可替代的作用。尽管卷积神经网络（CNN）在此任务上取得了良好的性能，但它们仍依赖大量标记图像进行训练，这是昂贵且费时的。最近，从少量标记图像和大量未标记图像中学习的半监督方法已显示出解决此问题的潜力，但是在有限数量的标记数据下训练高性能模型仍然具有挑战性。在本文中，我们提出了具有不确定性校正金字塔一致性（URPC）正则化的新型框架，用于半监督NPC GTV分割。具体来说，我们扩展了一个骨干分割网络，以产生不同尺度的金字塔预测，金字塔预测网络（PPNet）受标记图像的地面真实性和未标记图像的多尺度一致性损失的监督，这是由以下事实引起的：相同输入的不同比例应相似且一致。但是，由于这些预测的分辨率不同，因此直接在每个像素处鼓励它们保持一致并不可靠，并且可能会带来很多噪声并导致性能下降。为了解决这一难题，我们进一步设计了一种新颖的不确定性纠正模块，以使该框架能够逐步从不同规模的有意义和可靠的共识区域中学习。在我们收集的258个体积的NPC数据集上进行的大量实验表明，我们的方法可以通过合并未标记的数据来大大提高性能，并且与现有的半监督方法相比，该框架达到了可喜的结果，后者的平均DSC达到81.22％，像素的1.88体素测试集上的平均ASD，其中只有20％的训练集带有注释。

55. Uncertainty Estimation in Deep Neural Networks for Point Cloud Segmentation in Factory Planning [PDF] 返回目录
Christina Petschnigg, Juergen Pilz
Abstract: The digital factory provides undoubtedly a great potential for future production systems in terms of efficiency and effectivity. A key aspect on the way to realize the digital copy of a real factory is the understanding of complex indoor environments on the basis of 3D data. In order to generate an accurate factory model including the major components, i.e. building parts, product assets and process details, the 3D data collected during digitalization can be processed with advanced methods of deep learning. In this work, we propose a fully Bayesian and an approximate Bayesian neural network for point cloud segmentation. This allows us to analyze how different ways of estimating uncertainty in these networks improve segmentation results on raw 3D point clouds. We achieve superior model performance for both, the Bayesian and the approximate Bayesian model compared to the frequentist one. This performance difference becomes even more striking when incorporating the networks' uncertainty in their predictions. For evaluation we use the scientific data set S3DIS as well as a data set, which was collected by the authors at a German automotive production plant. The methods proposed in this work lead to more accurate segmentation results and the incorporation of uncertainty information makes this approach especially applicable to safety critical applications.
摘要：就效率和有效性而言，数字工厂无疑为未来的生产系统提供了巨大的潜力。实现真实工厂的数字副本的一个关键方面是基于3D数据了解复杂的室内环境。为了生成包括主要组件（即建筑部件，产品资产和过程详细信息）的准确工厂模型，可以使用高级深度学习方法来处理数字化期间收集的3D数据。在这项工作中，我们提出了一个完整的贝叶斯和近似贝叶斯神经网络进行点云分割。这使我们能够分析估计这些网络中不确定性的不同方法如何改善原始3D点云上的分割结果。与常客模型相比，我们对贝叶斯模型和近似贝叶斯模型均实现了出色的模型性能。当将网络的不确定性纳入其预测时，这种性能差异会变得更加惊人。为了进行评估，我们使用了科学数据集S3DIS以及作者在德国一家汽车生产厂收集的数据集。这项工作中提出的方法可导致更准确的分割结果，并且不确定性信息的合并使该方法特别适用于安全关键型应用。

56. Efficient Human Pose Estimation by Learning Deeply Aggregated Representations [PDF] 返回目录
Zhengxiong Luo, Zhicheng Wang, Yuanhao Cai, Guanan Wang, Yan Huang, Liang Wang, Erjin Zhou, Jian Sun
Abstract: In this paper, we propose an efficient human pose estimation network (DANet) by learning deeply aggregated representations. Most existing models explore multi-scale information mainly from features with different spatial sizes. Powerful multi-scale representations usually rely on the cascaded pyramid framework. This framework largely boosts the performance but in the meanwhile makes networks very deep and complex. Instead, we focus on exploiting multi-scale information from layers with different receptive-field sizes and then making full of use this information by improving the fusion method. Specifically, we propose an orthogonal attention block (OAB) and a second-order fusion unit (SFU). The OAB learns multi-scale information from different layers and enhances them by encouraging them to be diverse. The SFU adaptively selects and fuses diverse multi-scale information and suppress the redundant ones. This could maximize the effective information in final fused representations. With the help of OAB and SFU, our single pyramid network may be able to generate deeply aggregated representations that contain even richer multi-scale information and have a larger representing capacity than that of cascaded networks. Thus, our networks could achieve comparable or even better accuracy with much smaller model complexity. Specifically, our \mbox{DANet-72} achieves $70.5$ in AP score on COCO test-dev set with only $1.0G$ FLOPs. Its speed on a CPU platform achieves $58$ Persons-Per-Second~(PPS).
摘要：在本文中，我们通过学习深度聚合表示来提出一种有效的人体姿势估计网络（DANet）。大多数现有模型主要从具有不同空间大小的要素中探索多尺度信息。强大的多尺度表示法通常依赖于级联金字塔框架。该框架在很大程度上提高了性能，但与此同时却使网络变得非常复杂。相反，我们专注于从具有不同接收场大小的图层中利用多尺度信息，然后通过改进融合方法来充分利用这些信息。具体来说，我们提出了一个正交注意块（OAB）和一个二阶融合单元（SFU）。 OAB从不同的层次学习多尺度信息，并通过鼓励它们变得多样化来增强它们。 SFU自适应地选择和融合各种多尺度信息，并抑制冗余信息。这可以使最终融合表示中的有效信息最大化。在OAB和SFU的帮助下，我们的单金字塔网络可能能够生成深度聚合的表示形式，其中包含的信息甚至比级联网络更丰富，并且包含更多的多尺度信息，并且具有更大的表示能力。因此，我们的网络可以以较小的模型复杂度实现相当甚至更高的精度。具体来说，我们的\ mbox {DANet-72}在仅$ 1.0G $ FLOP的COCO测试开发集上获得了$ 70.5 $的AP分数。它在CPU平台上的速度达到了58美元/秒（PPS）。

57. Split then Refine: Stacked Attention-guided ResUNets for Blind Single Image Visible Watermark Removal [PDF] 返回目录
Xiaodong Cun, Chi-Man Pun
Abstract: Digital watermark is a commonly used technique to protect the copyright of medias. Simultaneously, to increase the robustness of watermark, attacking technique, such as watermark removal, also gets the attention from the community. Previous watermark removal methods require to gain the watermark location from users or train a multi-task network to recover the background indiscriminately. However, when jointly learning, the network performs better on watermark detection than recovering the texture. Inspired by this observation and to erase the visible watermarks blindly, we propose a novel two-stage framework with a stacked attention-guided ResUNets to simulate the process of detection, removal and refinement. In the first stage, we design a multi-task network called SplitNet. It learns the basis features for three sub-tasks altogether while the task-specific features separately use multiple channel attentions. Then, with the predicted mask and coarser restored image, we design RefineNet to smooth the watermarked region with a mask-guided spatial attention. Besides network structure, the proposed algorithm also combines multiple perceptual losses for better quality both visually and numerically. We extensively evaluate our algorithm over four different datasets under various settings and the experiments show that our approach outperforms other state-of-the-art methods by a large margin. The code is available at this http URL.
摘要：数字水印是保护媒体版权的常用技术。同时，为了提高水印的鲁棒性，诸如水印去除之类的攻击技术也引起了社区的关注。以前的水印去除方法需要从用户那里获取水印位置，或者训练多任务网络以无差别地恢复背景。但是，当共同学习时，网络在水印检测上的性能要好于恢复纹理。受此观察的启发，并盲目地消除了可见的水印，我们提出了一个新颖的两阶段框架，该框架具有堆叠的注意力导向的ResUNets，以模拟检测，去除和细化的过程。在第一阶段，我们设计了一个名为SplitNet的多任务网络。它总共学习了三个子任务的基本功能，而特定于任务的功能分别使用了多渠道关注。然后，使用预测的蒙版和较粗糙的还原图像，我们设计了RefineNet，以在蒙版引导的空间注意下平滑水印区域。除了网络结构外，该算法还结合了多种感知损失，从而在视觉和数字上都具有更好的质量。我们在不同的设置下对四个不同的数据集进行了广泛的算法评估，实验结果表明，我们的方法在很大程度上优于其他最新方法。该代码可从此http URL获得。

58. Effective multi-view registration of point sets based on student's t mixture model [PDF] 返回目录
Yanlin Ma, Jihua Zhu, Zhongyu Li, Zhiqiang Tian, Yaochen Li
Abstract: Recently, Expectation-maximization (EM) algorithm has been introduced as an effective means to solve multi-view registration problem. Most of the previous methods assume that each data point is drawn from the Gaussian Mixture Model (GMM), which is difficult to deal with the noise with heavy-tail or outliers. Accordingly, this paper proposed an effective registration method based on Student's t Mixture Model (StMM). More specially, we assume that each data point is drawn from one unique StMM, where its nearest neighbors (NNs) in other point sets are regarded as the t-distribution centroids with equal covariances, membership probabilities, and fixed degrees of freedom. Based on this assumption, the multi-view registration problem is formulated into the maximization of the likelihood function including all rigid transformations. Subsequently, the EM algorithm is utilized to optimize rigid transformations as well as the only t-distribution covariance for multi-view registration. Since only a few model parameters require to be optimized, the proposed method is more likely to obtain the desired registration results. Besides, all t-distribution centroids can be obtained by the NN search method, it is very efficient to achieve multi-view registration. What's more, the t-distribution takes the noise with heavy-tail into consideration, which makes the proposed method be inherently robust to noises and outliers. Experimental results tested on benchmark data sets illustrate its superior performance on robustness and accuracy over state-of-the-art methods.
摘要：最近，期望最大化算法（EM）被引入作为解决多视图配准问题的有效手段。先前的大多数方法都假定每个数据点均来自高斯混合模型（GMM），这很难用重尾或离群值处理噪声。因此，本文提出了一种基于学生t混合模型（StMM）的有效注册方法。更具体地说，我们假设每个数据点均来自一个唯一的StMM，其中其他点集中其最近邻（NN）被视为具有相等协方差，隶属概率和固定自由度的t分布质心。基于此假设，将多视图配准问题公式化为包括所有刚性变换的似然函数的最大化。随后，EM算法用于优化刚性变换以及用于多视图配准的唯一t分布协方差。由于仅需要优化几个模型参数，因此所提出的方法更有可能获得所需的配准结果。此外，所有的t分布质心都可以通过NN搜索方法获得，实现多视图配准非常有效。而且，t分布考虑了重尾噪声，这使得所提出的方法具有固有的鲁棒性，可以抵抗噪声和离群值。在基准数据集上测试的实验结果表明，其在鲁棒性和准确性方面的性能优于最新方法。

59. Bi-Classifier Determinacy Maximization for Unsupervised Domain Adaptation [PDF] 返回目录
Shuang Li, Fangrui Lv, Binhui Xie, Chi Harold Liu, Jian Liang, Chen Qin
Abstract: Unsupervised domain adaptation challenges the problem of transferring knowledge from a well-labelled source domain to an unlabelled target domain. Recently,adversarial learning with bi-classifier has been proven effective in pushing cross-domain distributions close. Prior approaches typically leverage the disagreement between bi-classifier to learn transferable representations, however, they often neglect the classifier determinacy in the target domain, which could result in a lack of feature discriminability. In this paper, we present a simple yet effective method, namely Bi-Classifier Determinacy Maximization(BCDM), to tackle this problem. Motivated by the observation that target samples cannot always be separated distinctly by the decision boundary, here in the proposed BCDM, we design a novel classifier determinacy disparity (CDD) metric, which formulates classifier discrepancy as the class relevance of distinct target predictions and implicitly introduces constraint on the target feature discriminability. To this end, the BCDM can generate discriminative representations by encouraging target predictive outputs to be consistent and determined, meanwhile, preserve the diversity of predictions in an adversarial manner. Furthermore, the properties of CDD as well as the theoretical guarantees of BCDM's generalization bound are both elaborated. Extensive experiments show that BCDM compares favorably against the existing state-of-the-art domain adaptation methods.
摘要：无监督域自适应挑战了将知识从标记明确的源域转移到未标记的目标域的问题。最近，已证明使用双分类器进行对抗学习可以有效地推动跨域分布的缩小。先前的方法通常利用双分类器之间的分歧来学习可转移表示，但是，它们通常忽略了目标域中分类器的确定性，这可能导致缺乏特征可识别性。在本文中，我们提出了一种简单而有效的方法，即双分类器确定性最大化（BCDM），来解决此问题。由于观察到目标样本不能总是被决策边界完全分开，因此在此建议的BCDM中，我们设计了一种新颖的分类器确定性差异（CDD）度量，该度量将分类器差异表述为不同目标预测的类别相关性，并隐式引入限制目标特征的可分辨性。为此，BCDM可以通过鼓励目标预测输出保持一致和确定来生成判别式表示，同时以对抗性方式保留预测的多样性。此外，还阐述了CDD的性质以及BCDM泛化范围的理论保证。大量的实验表明，BCDM与现有的最新领域自适应方法相比具有优势。

60. Contrastive Learning for Label-Efficient Semantic Segmentation [PDF] 返回目录
Xiangyun Zhao, Raviteja Vemulapalli, Philip Mansfield, Boqing Gong, Bradley Green, Lior Shapira, Ying Wu
Abstract: Collecting labeled data for the task of semantic segmentation is expensive and time-consuming, as it requires dense pixel-level annotations. While recent Convolutional Neural Network (CNN) based semantic segmentation approaches have achieved impressive results by using large amounts of labeled training data, their performance drops significantly as the amount of labeled data decreases. This happens because deep CNNs trained with the de facto cross-entropy loss can easily overfit to small amounts of labeled data. To address this issue, we propose a simple and effective contrastive learning-based training strategy in which we first pretrain the network using a pixel-wise class label-based contrastive loss, and then fine-tune it using the cross-entropy loss. This approach increases intra-class compactness and inter-class separability thereby resulting in a better pixel classifier. We demonstrate the effectiveness of the proposed training strategy in both fully-supervised and semi-supervised settings using the Cityscapes and PASCAL VOC 2012 segmentation datasets. Our results show that pretraining with label-based contrastive loss results in large performance gains (more than 20% absolute improvement in some settings) when the amount of labeled data is limited.
摘要：由于语义分割的任务需要密集的像素级注释，因此收集标记的数据既昂贵又费时。尽管最近基于卷积神经网络（CNN）的语义分割方法通过使用大量标记的训练数据已取得了令人印象深刻的结果，但随着标记数据量的减少，它们的性能将显着下降。发生这种情况是因为经过事实上的交叉熵损失训练的深层CNN可以很容易地过度拟合少量的标记数据。为了解决这个问题，我们提出了一种简单有效的基于对比学习的训练策略，在该策略中，我们首先使用基于像素类标记的对比损失对网络进行预训练，然后使用交叉熵损失对其进行微调。这种方法增加了类内的紧凑性和类间的可分离性，从而产生了更好的像素分类器。我们使用Cityscapes和PASCAL VOC 2012细分数据集在完全监督和半监督的环境中证明了所提出的训练策略的有效性。我们的结果表明，当标签数据量有限时，使用基于标签的对比损失进行预培训会带来较大的性能提升（在某些情况下绝对提高20％以上）。

61. GeoNet++: Iterative Geometric Neural Network with Edge-Aware Refinement for Joint Depth and Surface Normal Estimation [PDF] 返回目录
Xiaojuan Qi, Zhengzhe Liu, Renjie Liao, Philip H.S. Torr, Raquel Urtasun, Jiaya Jia
Abstract: In this paper, we propose a geometric neural network with edge-aware refinement (GeoNet++) to jointly predict both depth and surface normal maps from a single image. Building on top of two-stream CNNs, GeoNet++ captures the geometric relationships between depth and surface normals with the proposed depth-to-normal and normal-to-depth modules. In particular, the "depth-to-normal" module exploits the least square solution of estimating surface normals from depth to improve their quality, while the "normal-to-depth" module refines the depth map based on the constraints on surface normals through kernel regression. Boundary information is exploited via an edge-aware refinement module. GeoNet++ effectively predicts depth and surface normals with strong 3D consistency and sharp boundaries resulting in better reconstructed 3D scenes. Note that GeoNet++ is generic and can be used in other depth/normal prediction frameworks to improve the quality of 3D reconstruction and pixel-wise accuracy of depth and surface normals. Furthermore, we propose a new 3D geometric metric (3DGM) for evaluating depth prediction in 3D. In contrast to current metrics that focus on evaluating pixel-wise error/accuracy, 3DGM measures whether the predicted depth can reconstruct high-quality 3D surface normals. This is a more natural metric for many 3D application domains. Our experiments on NYUD-V2 and KITTI datasets verify that GeoNet++ produces fine boundary details, and the predicted depth can be used to reconstruct high-quality 3D surfaces. Code has been made publicly available.
摘要：在本文中，我们提出了一种具有边缘感知细化功能的几何神经网络（GeoNet ++），可以从单个图像联合预测深度和表面法线贴图。在两流CNN的基础上，GeoNet ++使用建议的“深度-法线”和“法线-深度”模块捕获深度和表面法线之间的几何关系。尤其是，“深度到法线”模块利用最小二乘解法从深度估计表面法线以提高其质量，而“法线到深度”模块则根据表面法线的约束条件来优化深度图。核回归。边界信息是通过边缘感知优化模块来开发的。 GeoNet ++具有强大的3D一致性和清晰的边界，可以有效地预测深度和表面法线，从而可以更好地重建3D场景。请注意，GeoNet ++是通用的，可以在其他深度/法线预测框架中使用，以提高3D重建的质量以及深度和表面法线的逐像素精度。此外，我们提出了一种新的3D几何度量（3DGM），用于评估3D中的深度预测。与目前专注于评估逐像素误差/准确性的度量标准相比，3DGM可以测量预测深度是否可以重建高质量3D表面法线。对于许多3D应用程序域来说，这是更自然的指标。我们在NYUD-V2和KITTI数据集上进行的实验证明，GeoNet ++可以生成精细的边界细节，并且预测的深度可用于重建高质量3D曲面。代码已公开提供。

62. MVFNet: Multi-View Fusion Network for Efficient Video Recognition [PDF] 返回目录
Wenhao Wu, Dongliang He, Tianwei Lin, Fu Li, Chuang Gan, Errui Ding
Abstract: Conventionally, spatiotemporal modeling network and its complexity are the two most concentrated research topics in video action recognition. Existing state-of-the-art methods have achieved excellent accuracy regardless of the complexity meanwhile efficient spatiotemporal modeling solutions are slightly inferior in performance. In this paper, we attempt to acquire both efficiency and effectiveness simultaneously. First of all, besides traditionally treating H x W x T video frames as space-time signal (viewing from the Height-Width spatial plane), we propose to also model video from the other two Height-Time and Width-Time planes, to capture the dynamics of video thoroughly. Secondly, our model is designed based on 2D CNN backbones and model complexity is well kept in mind by design. Specifically, we introduce a novel multi-view fusion (MVF) module to exploit video dynamics using separable convolution for efficiency. It is a plug-and-play module and can be inserted into off-the-shelf 2D CNNs to form a simple yet effective model called MVFNet. Moreover, MVFNet can be thought of as a generalized video modeling framework and it can specialize to be existing methods such as C2D, SlowOnly, and TSM under different settings. Extensive experiments are conducted on popular benchmarks (i.e., Something-Something V1 & V2, Kinetics, UCF-101, and HMDB-51) to show its superiority. The proposed MVFNet can achieve state-of-the-art performance with 2D CNN's complexity.
摘要：时空建模网络及其复杂性是视频动作识别中两个最集中的研究课题。不论复杂度如何，现有的最新方法都具有出色的精度，而有效的时空建模解决方案的性能则稍逊一筹。在本文中，我们试图同时获得效率和有效性。首先，除了传统上将H xW x T视频帧作为时空信号（从Height-Width空间平面看）外，我们还建议对其他两个Height-Time和Width-Time平面中的视频进行建模，以全面捕获视频动态。其次，我们的模型是基于2D CNN骨干网设计的，设计时充分考虑了模型的复杂性。具体来说，我们介绍了一种新颖的多视图融合（MVF）模块，该模块利用可分离卷积来提高视频动态效率。它是一个即插即用的模块，可以插入现成的2D CNN中以形成一个简单而有效的模型，称为MVFNet。此外，可以将MVFNet视为通用的视频建模框架，并且可以将其专门化为在不同设置下的现有方法，例如C2D，SlowOnly和TSM。为了证明其优越性，我们在流行的基准（即Something-Something V1和V2，Kinetics，UCF-101和HMDB-51）上进行了广泛的实验。拟议的MVFNet可以实现具有2D CNN复杂性的最新性能。

63. Spontaneous Emotion Recognition from Facial Thermal Images [PDF] 返回目录
Chirag Kyal
Abstract: One of the key research areas in computer vision addressed by a vast number of publications is the processing and understanding of images containing human faces. The most often addressed tasks include face detection, facial landmark localization, face recognition and facial expression analysis. Other, more specialized tasks such as affective computing, the extraction of vital signs from videos or analysis of social interaction usually require one or several of the aforementioned tasks that have to be performed. In our work, we analyze that a large number of tasks for facial image processing in thermal infrared images that are currently solved using specialized rule-based methods or not solved at all can be addressed with modern learning-based approaches. We have used USTC-NVIE database for training of a number of machine learning algorithms for facial landmark localization.
摘要：大量出版物涉及计算机视觉的关键研究领域之一是处理和理解包含人脸的图像。最常处理的任务包括面部检测，面部标志定位，面部识别和面部表情分析。其他更专门的任务，例如情感计算，从视频中提取生命体征或社会互动分析，通常需要执行上述一项或多项任务。在我们的工作中，我们分析说，可以通过基于现代学习的方法来解决当前使用专用的基于规则的方法已解决或根本无法解决的热红外图像中的大量面部图像处理任务。我们已经使用USTC-NVIE数据库来训练用于面部界标定位的多种机器学习算法。

64. Fully-Automated Liver Tumor Localization and Characterization from Multi-Phase MR Volumes Using Key-Slice ROI Parsing: A Physician-Inspired Approach [PDF] 返回目录
Bolin Lai, Xiaoyu Bai, Yuhsuan Wu, Xiao-Yun Zhou, Jinzheng Cai, Yuankai Huo, Lingyun Huang, Peng Wang, Yong Xia, Le Lu, Adam Harrison, Heping Hu, Jing Xiao
Abstract: Using radiological scans to identify liver tumors is crucial for proper patient treatment. This is highly challenging, as top radiologists only achieve F1 scores of roughly 80% (hepatocellular carcinoma (HCC) vs. others) with only moderate inter-rater agreement, even when using multi-phase magnetic resonance (MR) imagery. Thus, there is great impetus for computer-aided diagnosis (CAD) solutions. A critical challengeis to reliably parse a 3D MR volume to localize diagnosable regions of interest (ROI). In this paper, we break down this problem using a key-slice parser (KSP), which emulates physician workflows by first identifying key slices and then localize their corresponding key ROIs. Because performance demands are so extreme, (not to miss any key ROI),our KSP integrates complementary modules--top-down classification-plus-detection (CPD) and bottom-up localization-by-over-segmentation(LBOS). The CPD uses a curve-parsing and detection confidence to re-weight classifier confidences. The LBOS uses over-segmentation to flag CPD failure cases and provides its own ROIs. For scalability, LBOS is only weakly trained on pseudo-masks using a new distance-aware Tversky loss. We evaluate our approach on the largest multi-phase MR liver lesion test dataset to date (430 biopsy-confirmed patients). Experiments demonstrate that our KSP can localize diagnosable ROIs with high reliability (85% patients have an average overlap of >= 40% with the ground truth). Moreover, we achieve an HCC vs. others F1 score of 0.804, providing a fully-automated CAD solution comparable with top human physicians.
摘要：使用放射线扫描来识别肝肿瘤对于正确的患者治疗至关重要。这是极具挑战性的，因为即使使用多相磁共振（MR）图像，顶尖放射线医生仅在中等评定者之间达成一致，也只能获得大约80％的F1评分（肝癌（HCC）对其他）。因此，极大地推动了计算机辅助诊断（CAD）解决方案。关键挑战是可靠地解析3D MR体积以定位可诊断的感兴趣区域（ROI）。在本文中，我们使用密钥切片解析器（KSP）来解决此问题，该密钥解析器通过首先识别密钥切片然后定位其对应的密钥ROI来模拟医师的工作流程。由于对性能的要求如此之高（不要错过任何关键的投资回报率），我们的KSP集成了互补的模块-自上而下的分类加检测（CPD）和自下而上的分段本地化（LBOS）。 CPD使用曲线分析和检测置信度对权重分类器置信度进行加权。 LBOS使用过度细分来标记CPD失败案例，并提供自己的ROI。为了实现可伸缩性，仅使用新的可感知距离的Tversky损耗对LBOS进行伪掩码训练。我们评估了迄今为止最大的多阶段MR肝病变测试数据集（430例活检确诊的患者）中的方法。实验证明，我们的KSP可以高度可靠地定位可诊断的ROI（85％的患者与实际情况的平均重叠度> = 40％）。此外，我们的HCC与其他F1得分达到0.804，提供了与顶级人类医师相当的全自动CAD解决方案。

65. Using Computer Vision to Automate Hand Detection and Tracking of Surgeon Movements in Videos of Open Surgery [PDF] 返回目录
Michael Zhang, Xiaotian Cheng, Daniel Copeland, Arjun Desai, Melody Y. Guan, Gabriel A. Brat, Serena Yeung
Abstract: Open, or non-laparoscopic surgery, represents the vast majority of all operating room procedures, but few tools exist to objectively evaluate these techniques at scale. Current efforts involve human expert-based visual assessment. We leverage advances in computer vision to introduce an automated approach to video analysis of surgical execution. A state-of-the-art convolutional neural network architecture for object detection was used to detect operating hands in open surgery videos. Automated assessment was expanded by combining model predictions with a fast object tracker to enable surgeon-specific hand tracking. To train our model, we used publicly available videos of open surgery from YouTube and annotated these with spatial bounding boxes of operating hands. Our model's spatial detections of operating hands significantly outperforms the detections achieved using pre-existing hand-detection datasets, and allow for insights into intra-operative movement patterns and economy of motion.
摘要：开放式或非腹腔镜手术代表了所有手术室程序的绝大部分，但很少有工具能够客观地大规模评估这些技术。当前的努力包括基于人类专家的视觉评估。我们利用计算机视觉的先进性，为手术执行的视频分析引入自动化方法。用于对象检测的最先进的卷积神经网络体系结构用于检测开放式手术视频中的操作手。通过将模型预测与快速对象跟踪器相结合，扩展了自动评估功能，从而可以进行外科医生特定的手部跟踪。为了训练我们的模型，我们使用了YouTube上公开提供的开放手术视频，并用操作手的空间边界框对其进行了注释。我们模型对操作手的空间检测远胜于使用现有的手部检测数据集所进行的检测，并且可以洞悉术中运动模式和运动的经济性。

66. MiniVLM: A Smaller and Faster Vision-Language Model [PDF] 返回目录
Jianfeng Wang, Xiaowei Hu, Pengchuan Zhang, Xiujun Li, Lijuan Wang, Lei Zhang, Jianfeng Gao, Zicheng Liu
Abstract: Recent vision-language (VL) studies have shown remarkable progress by learning generic representations from massive image-text pairs with transformer models and then fine-tuning on downstream VL tasks. While existing research has been focused on achieving high accuracy with large pre-trained models, building a lightweight model is of great value in practice but is less explored. In this paper, we propose a smaller and faster VL model, MiniVLM, which can be finetuned with good performance on various downstream tasks like its larger counterpart. MiniVLM consists of two modules, a vision feature extractor and a transformer-based vision-language fusion module. We design a Two-stage Efficient feature Extractor (TEE), inspired by the one-stage EfficientDet network, to significantly reduce the time cost of visual feature extraction by $95\%$, compared to a baseline model. We adopt the MiniLM structure to reduce the computation cost of the transformer module after comparing different compact BERT models. In addition, we improve the MiniVLM pre-training by adding $7M$ Open Images data, which are pseudo-labeled by a state-of-the-art captioning model. We also pre-train with high-quality image tags obtained from a strong tagging model to enhance cross-modality alignment. The large models are used offline without adding any overhead in fine-tuning and inference. With the above design choices, our MiniVLM reduces the model size by $73\%$ and the inference time cost by $94\%$ while being able to retain $94-97\%$ of the accuracy on multiple VL tasks. We hope that MiniVLM helps ease the use of the state-of-the-art VL research for on-the-edge applications.
摘要：最近的视觉语言（VL）研究显示了惊人的进步，方法是通过使用转换器模型从大量的图文对中学习通用表示形式，然后对下游VL任务进行微调。尽管现有研究集中在使用大型预训练模型来实现高精度上，但是构建轻量级模型在实践中具有重要价值，但探索却很少。在本文中，我们提出了一个较小且速度更快的VL模型MiniVLM，可以在各种下游任务（如较大的任务）上以良好的性能进行微调。 MiniVLM由两个模块组成，一个视觉特征提取器和一个基于变压器的视觉语言融合模块。我们设计了一个两阶段高效特征提取器（TEE），该模型受一阶段EfficientDet网络的启发，与基线模型相比，可将视觉特征提取的时间成本显着降低$ 95 \％$。通过比较不同的紧凑型BERT模型，我们采用MiniLM结构来减少变压器模块的计算成本。此外，我们通过添加$ 7M $ Open Images数据来改进MiniVLM的预训练，这些数据由最新的字幕模型进行了伪标记。我们还预先训练了从强大的标记模型获得的高质量图像标记，以增强跨模态对齐。大型模型可以离线使用，而不会增加任何微调和推断的开销。通过上述设计选择，我们的MiniVLM可以将模型大小减少$ 73 \％$，并且将推理时间成本减少$ 94 \％$，同时能够在多个VL任务中保留$ 94-97 \％$的精度。我们希望MiniVLM有助于简化对边缘应用程序的最新VL研究的使用。

67. Human Pose Transfer by Adaptive Hierarchical Deformation [PDF] 返回目录
Jinsong Zhang, Xingzi Liu, Kun Li
Abstract: Human pose transfer, as a misaligned image generation task, is very challenging. Existing methods cannot effectively utilize the input information, which often fail to preserve the style and shape of hair and clothes. In this paper, we propose an adaptive human pose transfer network with two hierarchical deformation levels. The first level generates human semantic parsing aligned with the target pose, and the second level generates the final textured person image in the target pose with the semantic guidance. To avoid the drawback of vanilla convolution that treats all the pixels as valid information, we use gated convolution in both two levels to dynamically select the important features and adaptively deform the image layer by layer. Our model has very few parameters and is fast to converge. Experimental results demonstrate that our model achieves better performance with more consistent hair, face and clothes with fewer parameters than state-of-the-art methods. Furthermore, our method can be applied to clothing texture transfer.
摘要：作为一种未对准的图像生成任务，人体姿势转移非常具有挑战性。现有方法不能有效地利用输入信息，这常常不能保持头发和衣服的样式和形状。在本文中，我们提出了一种具有两个分层变形水平的自适应人体姿态转移网络。第一级别生成与目标姿势对齐的人类语义解析，第二级别生成具有语义指导的目标姿势中的最终纹理人图像。为了避免将所有像素都视为有效信息的香草卷积的缺点，我们在两个级别上都使用了门控卷积来动态选择重要特征并自适应地逐层变形图像。我们的模型具有很少的参数，并且可以快速收敛。实验结果表明，与最新方法相比，我们的模型在具有更少参数的情况下具有更一致的头发，面部和衣服效果更好。此外，我们的方法可以应用于服装质地的转移。

68. Assessing The Importance Of Colours For CNNs In Object Recognition [PDF] 返回目录
Aditya Singh, Alessandro Bay, Andrea Mirabile
Abstract: Humans rely heavily on shapes as a primary cue for object recognition. As secondary cues, colours and textures are also beneficial in this regard. Convolutional neural networks (CNNs), an imitation of biological neural networks, have been shown to exhibit conflicting properties. Some studies indicate that CNNs are biased towards textures whereas, another set of studies suggests shape bias for a classification task. However, they do not discuss the role of colours, implying its possible humble role in the task of object recognition. In this paper, we empirically investigate the importance of colours in object recognition for CNNs. We are able to demonstrate that CNNs often rely heavily on colour information while making a prediction. Our results show that the degree of dependency on colours tend to vary from one dataset to another. Moreover, networks tend to rely more on colours if trained from scratch. Pre-training can allow the model to be less colour dependent. To facilitate these findings, we follow the framework often deployed in understanding role of colours in object recognition for humans. We evaluate a model trained with congruent images (images in original colours eg. red strawberries) on congruent, greyscale, and incongruent images (images in unnatural colours eg. blue strawberries). We measure and analyse network's predictive performance (top-1 accuracy) under these different stylisations. We utilise standard datasets of supervised image classification and fine-grained image classification in our experiments.
摘要：人类严重依赖形状作为物体识别的主要线索。作为次要提示，颜色和纹理在这方面也很有益。卷积神经网络（CNN）是对生物神经网络的模仿，已显示出相互冲突的特性。一些研究表明，CNN偏向纹理，而另一组研究表明，分类任务的形状偏向。但是，他们没有讨论颜色的作用，暗示了颜色在对象识别任务中可能起的卑微作用。在本文中，我们通过经验研究了颜色在CNN对象识别中的重要性。我们能够证明CNN在做出预测时经常严重依赖颜色信息。我们的结果表明，对颜色的依赖程度在一个数据集之间会有所不同。此外，如果从头开始培训，网络往往会更多地依赖颜色。预训练可以使模型对颜色的依赖性降低。为了促进这些发现，我们遵循通常用于理解颜色在人类对象识别中的作用的框架。我们评估一个模型，该模型使用全等，灰度和不一致的图像（非自然颜色的图像，例如蓝色草莓）上的全等图像（原始颜色的图像，例如红色草莓）训练。我们在这些不同风格下测量和分析网络的预测性能（前1个准确性）。我们在实验中使用监督图像分类和细粒度图像分类的标准数据集。

69. PAIRS AutoGeo: an Automated Machine Learning Framework for Massive Geospatial Data [PDF] 返回目录
Wang Zhou, Levente J. Klein, Siyuan Lu
Abstract: An automated machine learning framework for geospatial data named PAIRS AutoGeo is introduced on IBM PAIRS Geoscope big data and analytics platform. The framework simplifies the development of industrial machine learning solutions leveraging geospatial data to the extent that the user inputs are minimized to merely a text file containing labeled GPS coordinates. PAIRS AutoGeo automatically gathers required data at the location coordinates, assembles the training data, performs quality check, and trains multiple machine learning models for subsequent deployment. The framework is validated using a realistic industrial use case of tree species classification. Open-source tree species data are used as the input to train a random forest classifier and a modified ResNet model for 10-way tree species classification based on aerial imagery, which leads to an accuracy of $59.8\%$ and $81.4\%$, respectively. This use case exemplifies how PAIRS AutoGeo enables users to leverage machine learning without extensive geospatial expertise.
摘要：在IBM PAIRS Geoscope大数据和分析平台上引入了一种名为PAIRS AutoGeo的地理空间数据自动机器学习框架。该框架简化了利用地理空间数据的工业机器学习解决方案的开发，其程度是将用户输入最小化为仅包含标签GPS坐标的文本文件。 PAIRS AutoGeo自动在位置坐标处收集所需的数据，组合训练数据，执行质量检查，并训练多个机器学习模型以用于后续部署。该框架使用树种分类的实际工业用例进行了验证。开源树种数据被用作训练随机森林分类器和基于航空影像的经过改进的ResNet模型以进行10种树种分类的输入，其准确性为$ 59.8 \％$和$ 81.4 \％$，分别。该用例说明了PAIRS AutoGeo如何使用户无需广泛的地理空间专业知识就能利用机器学习。

70. AMINN: Autoencoder-based Multiple Instance Neural Network for Outcome Prediction of Multifocal Liver Metastases [PDF] 返回目录
Jianan Chen, Helen M. C. Cheung, Laurent Milot, Anne L. Martel
Abstract: Colorectal cancer is one of the most common and lethal cancers and colorectal cancer liver metastases (CRLM) is the major cause of death in patients with colorectal cancer. Multifocality occurs frequently in CRLM, but is relatively unexplored in CRLM outcome prediction. Most existing clinical and imaging biomarkers do not take the imaging features of all multifocal lesions into account. In this paper, we present an end-to-end autoencoder-based multiple instance neural network (AMINN) for the prediction of survival outcomes in multifocal CRLM patients using radiomic features extracted from contrast-enhanced MRIs. Specifically, we jointly train an autoencoder to reconstruct input features and a multiple instance network to make predictions by aggregating information from all tumour lesions of a patient. In addition, we incorporate a two-step normalization technique to improve the training of deep neural networks, built on the observation that the distributions of radiomic features are almost always severely skewed. Experimental results empirically validated our hypothesis that incorporating imaging features of all lesions improves outcome prediction for multifocal cancer. The proposed ADMINN framework achieved an area under the ROC curve (AUC) of 0.70, which is 19.5% higher than baseline methods. We built a risk score based on the outputs of our network and compared it to other clinical and imaging biomarkers. Our risk score is the only one that achieved statistical significance in univariate and multivariate cox proportional hazard modeling in our cohort of multifocal CRLM patients. The effectiveness of incorporating all lesions and applying two-step normalization is demonstrated by a series of ablation studies. Our code will be released after the peer-review process.
摘要：大肠癌是最常见的致死性癌症之一，大肠癌肝转移（CRLM）是大肠癌患者死亡的主要原因。多焦点在CRLM中经常发生，但在CRLM结果预测中相对未开发。大多数现有的临床和影像学生物标志物并未考虑所有多灶性病变的影像学特征。在本文中，我们提出了一种基于端到端自动编码器的多实例神经网络（AMINN），用于使用从造影剂MRI提取的放射特征预测多灶性CRLM患者的生存结果。具体来说，我们联合训练一个自动编码器以重建输入特征，并训练一个多实例网络以通过汇总来自患者所有肿瘤病变的信息来进行预测。另外，我们结合了两步归一化技术来改善对深层神经网络的训练，这是基于观察到放射特征的分布几乎总是严重偏斜的观察结果。实验结果从经验上验证了我们的假说，即纳入所有病变的影像学特征可以改善多灶癌预后的预测。拟议的ADMINN框架在ROC曲线下的面积（AUC）为0.70，比基线方法高19.5％。我们根据网络的输出结果建立了风险评分，并将其与其他临床和影像生物标志物进行了比较。我们的风险评分是在我们的多灶CRLM患者队列中单变量和多变量Cox比例风险建模中达到统计意义的唯一得分。一系列消融研究证明了合并所有病变并应用两步归一化的有效性。我们的代码将在同行评审过程后发布。

71. Spectral Unmixing With Multinomial Mixture Kernel and Wasserstein Generative Adversarial Loss [PDF] 返回目录
Savas Ozkan, Gozde Bozdagi Akar
Abstract: This study proposes a novel framework for spectral unmixing by using 1D convolution kernels and spectral uncertainty. High-level representations are computed from data, and they are further modeled with the Multinomial Mixture Model to estimate fractions under severe spectral uncertainty. Furthermore, a new trainable uncertainty term based on a nonlinear neural network model is introduced in the reconstruction step. All uncertainty models are optimized by Wasserstein Generative Adversarial Network (WGAN) to improve stability and capture uncertainty. Experiments are performed on both real and synthetic datasets. The results validate that the proposed method obtains state-of-the-art performance, especially for the real datasets compared to the baselines. Project page at: this https URL.
摘要：本研究提出了一种使用一维卷积核和频谱不确定性进行频谱分解的新颖框架。从数据中计算出高级表示，然后使用多项式混合模型对其进行建模，以估计严重光谱不确定性下的馏分。此外，在重建步骤中引入了基于非线性神经网络模型的新的可训练不确定性项。 Wasserstein生成对抗网络（WGAN）对所有不确定性模型进行了优化，以提高稳定性并捕获不确定性。在真实和合成数据集上均进行了实验。结果验证了所提出的方法获得了最先进的性能，特别是对于与基线相比的真实数据集。项目页面位于：此https URL。

72. LiveChess2FEN: a Framework for Classifying Chess Pieces based on CNNs [PDF] 返回目录
David Mallasén Quintana, Alberto Antonio del Barrio García, Manuel Prieto Matías
Abstract: Automatic digitization of chess games using computer vision is a significant technological challenge. This problem is of much interest for tournament organizers and amateur or professional players to broadcast their over-the-board (OTB) games online or analyze them using chess engines. Previous work has shown promising results, but the recognition accuracy and the latency of state-of-the-art techniques still need further enhancements to allow their practical and affordable deployment. We have investigated how to implement them on an Nvidia Jetson Nano single-board computer effectively. Our first contribution has been accelerating the chessboard's detection algorithm. Subsequently, we have analyzed different Convolutional Neural Networks for chess piece classification and how to map them efficiently on our embedded platform. Notably, we have implemented a functional framework that automatically digitizes a chess position from an image in less than 1 second, with 92% accuracy when classifying the pieces and 95% when detecting the board.
摘要：利用计算机视觉对象棋游戏进行自动数字化是一项重大的技术挑战。对于锦标赛组织者和业余或专业玩家来说，在线广播他们的场外（OTB）游戏或使用国际象棋引擎进行分析是非常有趣的。先前的工作已经显示出令人鼓舞的结果，但是最新技术的识别准确性和等待时间仍然需要进一步增强，以使其能够以实用且可承受的方式进行部署。我们已经研究了如何在Nvidia Jetson Nano单板计算机上有效地实现它们。我们的第一项贡献就是加快了棋盘的检测算法。随后，我们分析了用于棋子分类的不同卷积神经网络，以及如何在嵌入式平台上有效地映射它们。值得注意的是，我们已经实现了一个功能框架，该功能可以在不到1秒的时间内自动将图像中的象棋位置数字化，对棋子进行分类时的准确度为92％，对棋盘进行检测时的准确度为95％。

73. Multi-Scale Cascading Network with Compact Feature Learning for RGB-Infrared Person Re-Identification [PDF] 返回目录
Can Zhang, Hong Liu, Wei Guo, Mang Ye
Abstract: RGB-Infrared person re-identification (RGB-IR Re-ID) aims to match persons from heterogeneous images captured by visible and thermal cameras, which is of great significance in the surveillance system under poor light conditions. Facing great challenges in complex variances including conventional single-modality and additional inter-modality discrepancies, most of the existing RGB-IR Re-ID methods propose to impose constraints in image level, feature level or a hybrid of both. Despite the better performance of hybrid constraints, they are usually implemented with heavy network architecture. As a matter of fact, previous efforts contribute more as pioneering works in new cross-modal Re-ID area while leaving large space for improvement. This can be mainly attributed to: (1) lack of abundant person image pairs from different modalities for training, and (2) scarcity of salient modality-invariant features especially on coarse representations for effective matching. To address these issues, a novel Multi-Scale Part-Aware Cascading framework (MSPAC) is formulated by aggregating multi-scale fine-grained features from part to global in a cascading manner, which results in a unified representation containing rich and enhanced semantic features. Furthermore, a marginal exponential centre (MeCen) loss is introduced to jointly eliminate mixed variances from intra- and inter-modal examples. Cross-modality correlations can thus be efficiently explored on salient features for distinctive modality-invariant feature learning. Extensive experiments are conducted to demonstrate that the proposed method outperforms all the state-of-the-art by a large margin.
摘要：RGB红外人物重新识别（RGB-IR Re-ID）旨在匹配可见光和热像仪捕获的异类图像中的人物，这在光线不足的监视系统中具有重要意义。面对包括常规单模态和其他模态间差异在内的复杂变化带来的巨大挑战，大多数现有的RGB-IR Re-ID方法建议在图像级别，特征级别或两者的混合中施加约束。尽管混合约束的性能更好，但是它们通常是在繁重的网络体系结构中实现的。实际上，在新的交叉模式Re-ID领域，作为开拓性工作，以前的努力做出了更多贡献，同时还留有很大的改进空间。这主要可归因于：（1）缺乏来自不同模式进行训练的丰富人像对，以及（2）显着模态不变特征的匮乏，尤其是在有效匹配的粗糙表示上。为了解决这些问题，通过以级联的方式聚合从零件到全局的多尺度细粒度特征，从而制定了新颖的多尺度零件感知级联框架（MSPAC），从而形成了包含丰富和增强的语义特征的统一表示形式。此外，引入了边际指数中心（MeCen）损失以共同消除模态内和模态间示例的混合方差。因此，可以在显着特征上有效地探索跨模态相关性，以实现独特的模态不变特征学习。进行了广泛的实验，以证明所提出的方法在很大程度上优于所有最新技术。

74. High Order Local Directional Pattern Based Pyramidal Multi-structure for Robust Face Recognition [PDF] 返回目录
Almabrok Essa, Vijayan Asari
Abstract: Derived from a general definition of texture in a local neighborhood, local directional pattern (LDP) encodes the directional information in the small local 3x3 neighborhood of a pixel, which may fail to extract detailed information especially during changes in the input image due to illumination variations. Therefore, in this paper we introduce a novel feature extraction technique that calculates the nth order direction variation patterns, named high order local directional pattern (HOLDP). The proposed HOLDP can capture more detailed discriminative information than the conventional LDP. Unlike the LDP operator, our proposed technique extracts nth order local information by encoding various distinctive spatial relationships from each neighborhood layer of a pixel in the pyramidal multi-structure way. Then we concatenate the feature vector of each neighborhood layer to form the final HOLDP feature vector. The performance evaluation of the proposed HOLDP algorithm is conducted on several publicly available face databases and observed the superiority of HOLDP under extreme illumination conditions.
摘要：源自局部邻域的纹理的一般定义，局部定向模式（LDP）在像素的较小局部3x3邻域中编码方向信息，这可能无法提取详细信息，尤其是在输入图像变化期间，由于照明变化。因此，在本文中，我们引入了一种新颖的特征提取技术，该技术可计算n阶方向变化模式，称为高阶局部方向性模式（HOLDP）。与传统的LDP相比，提出的HOLDP可以捕获更多详细的判别信息。与LDP运算符不同，我们提出的技术通过以金字塔多结构方式从像素的每个相邻层编码各种独特的空间关系来提取n阶局部信息。然后，我们将每个邻域层的特征向量连接起来，以形成最终的HOLDP特征向量。在多个公开的人脸数据库上对所提出的HOLDP算法进行了性能评估，并观察了HOLDP在极端光照条件下的优越性。

75. Alpha-Refine: Boosting Tracking Performance by Precise Bounding Box Estimation [PDF] 返回目录
Bin Yan, Xinyu Zhang, Dong Wang, Huchuan Lu, Xiaoyun Yang
Abstract: Visual object tracking aims to precisely estimate the bounding box for the given target, which is a challenging problem due to factors such as deformation and occlusion. Many recent trackers adopt the multiple-stage tracking strategy to improve the quality of bounding box estimation. These methods first coarsely locate the target and then refine the initial prediction in the following stages. However, existing approaches still suffer from limited precision, and the coupling of different stages severely restricts the method's transferability. This work proposes a novel, flexible, and accurate refinement module called Alpha-Refine, which can significantly improve the base trackers' prediction quality. By exploring a series of design options, we conclude that the key to successful refinement is extracting and maintaining detailed spatial information as much as possible. Following this principle, Alpha-Refine adopts a pixel-wise correlation, a corner prediction head, and an auxiliary mask head as the core components. We apply Alpha-Refine to six famous base trackers to verify our method's effectiveness: DiMPsuper, DiMP50, ATOM, SiamRPN++, RT-MDNet, and ECO. Comprehensive experiments on TrackingNet, LaSOT, GOT-10K, and VOT2020 benchmarks show that our approach significantly improves the base tracker's performance with little extra latency. Code and pretrained model is available at this https URL.
摘要：视觉对象跟踪旨在精确估计给定目标的边界框，由于变形和遮挡等因素，这是一个具有挑战性的问题。许多最新的跟踪器采用多阶段跟踪策略来提高边界框估计的质量。这些方法首先粗略地定位目标，然后在接下来的阶段中完善初始预测。但是，现有方法仍然存在精度受限的问题，并且不同阶段的耦合严重限制了该方法的可传递性。这项工作提出了一种新颖，灵活，准确的细化模块，称为Alpha-Refine，可以显着提高基本跟踪器的预测质量。通过探索一系列设计方案，我们得出结论，成功进行优化的关键是尽可能多地提取和维护详细的空间信息。遵循这一原理，Alpha-Refine采用像素相关，拐角预测头和辅助蒙版头作为核心组件。我们将Alpha-Refine应用于六个著名的基本跟踪器以验证我们方法的有效性：DiMPsuper，DiMP50，ATOM，SiamRPN ++，RT-MDNet和ECO。在TrackingNet，LaSOT，GOT-10K和VOT2020基准测试中进行的综合实验表明，我们的方法显着提高了基本跟踪器的性能，而几乎没有额外的延迟。代码和预训练模型可从此https URL获得。

76. Fine-grained Classification via Categorical Memory Networks [PDF] 返回目录
Weijian Deng, Joshua Marsh, Stephen Gould, Liang Zheng
Abstract: Motivated by the desire to exploit patterns shared across classes, we present a simple yet effective class-specific memory module for fine-grained feature learning. The memory module stores the prototypical feature representation for each category as a moving average. We hypothesize that the combination of similarities with respect to each category is itself a useful discriminative cue. To detect these similarities, we use attention as a querying mechanism. The attention scores with respect to each class prototype are used as weights to combine prototypes via weighted sum, producing a uniquely tailored response feature representation for a given input. The original and response features are combined to produce an augmented feature for classification. We integrate our class-specific memory module into a standard convolutional neural network, yielding a Categorical Memory Network. Our memory module significantly improves accuracy over baseline CNNs, achieving competitive accuracy with state-of-the-art methods on four benchmarks, including CUB-200-2011, Stanford Cars, FGVC Aircraft, and NABirds.
摘要：出于开发跨类共享模式的愿望，我们提出了一个简单而有效的特定于类的内存模块，用于细粒度的特征学习。存储器模块将每个类别的原型特征表示存储为移动平均值。我们假设关于每个类别的相似性组合本身就是有用的判别提示。为了检测这些相似之处，我们使用注意力作为查询机制。每个类原型的注意力得分用作权重，以通过加权总和组合原型，从而为给定的输入生成唯一量身定制的响应特征表示。原始特征和响应特征被组合以产生用于分类的增强特征。我们将特定于类的存储模块集成到标准的卷积神经网络中，从而产生一个分类存储网络。我们的内存模块大大提高了基准CNN的准确性，并通过四个基准（包括CUB-200-2011，Stanford Cars，FGVC Aircraft和NABirds）上的最新技术，达到了具有竞争力的准确性。

77. DETR for Pedestrian Detection [PDF] 返回目录
Matthieu Lin, Chuming Li, Xingyuan Bu, Ming Sun, Chen Lin, Junjie Yan, Wanli Ouyang, Zhidong Deng
Abstract: Pedestrian detection in crowd scenes poses a challenging problem due to the heuristic defined mapping from anchors to pedestrians and the conflict between NMS and highly overlapped pedestrians. The recently proposed end-to-end detectors(ED), DETR and deformable DETR, replace hand designed components such as NMS and anchors using the transformer architecture, which gets rid of duplicate predictions by computing all pairwise interactions between queries. Inspired by these works, we explore their performance on crowd pedestrian detection. Surprisingly, compared to Faster-RCNN with FPN, the results are opposite to those obtained on COCO. Furthermore, the bipartite match of ED harms the training efficiency due to the large ground truth number in crowd scenes. In this work, we identify the underlying motives driving ED's poor performance and propose a new decoder to address them. Moreover, we design a mechanism to leverage the less occluded visible parts of pedestrian specifically for ED, and achieve further improvements. A faster bipartite match algorithm is also introduced to make ED training on crowd dataset more practical. The proposed detector PED(Pedestrian End-to-end Detector) outperforms both previous EDs and the baseline Faster-RCNN on CityPersons and CrowdHuman. It also achieves comparable performance with state-of-the-art pedestrian detection methods. Code will be released soon.
摘要：由于启发式定义的从锚点到行人的映射以及NMS与高度重叠的行人之间的冲突，人群场景中的行人检测提出了一个具有挑战性的问题。最近提出的端到端检测器（ED），DETR和可变形DETR，使用变压器体系结构替代了手工设计的组件（如NMS和锚），该体系结构通过计算查询之间的所有成对交互关系而消除了重复的预测。受这些作品的启发，我们探索了它们在人群行人检测中的表现。令人惊讶的是，与带有FPN的Faster-RCNN相比，结果与在COCO上获得的结果相反。此外，由于人群场景中的地面真实数字大，ED的二分比赛会损害训练效率。在这项工作中，我们确定了导致ED性能不佳的根本原因，并提出了一种新的解码器来解决这些问题。此外，我们设计了一种机制来利用行人较少被遮挡的可见部分，专门用于ED，并实现进一步的改进。还引入了一种更快的二分匹配算法，以使针对人群数据集的ED训练更加实用。拟议的检测器PED（行人端到端检测器）在CityPersons和CrowdHuman上均优于以前的ED和基线Faster-RCNN。它也可以通过最先进的行人检测方法获得可比的性能。代码即将发布。

78. Uncalibrated Neural Inverse Rendering for Photometric Stereo of General Surfaces [PDF] 返回目录
Berk Kaya, Suryansh Kumar, Carlos Oliveira, Vittorio Ferrari, Luc Van Gool
Abstract: This paper presents an uncalibrated deep neural network framework for the photometric stereo problem. For training models to solve the problem, existing neural network-based methods either require exact light directions or ground-truth surface normals of the object or both. However, in practice, it is challenging to procure both of this information precisely, which restricts the broader adoption of photometric stereo algorithms for vision application. To bypass this difficulty, we propose an uncalibrated neural inverse rendering approach to this problem. Our method first estimates the light directions from the input images and then optimizes an image reconstruction loss to calculate the surface normals, bidirectional reflectance distribution function value, and depth. Additionally, our formulation explicitly models the concave and convex parts of a complex surface to consider the effects of interreflections in the image formation process. Extensive evaluation of the proposed method on the challenging subjects generally shows comparable or better results than the supervised and classical approaches.
摘要：本文提出了一种用于光度立体问题的未经校准的深度神经网络框架。对于训练模型来解决问题，现有的基于神经网络的方法或者需要精确的光方向或对象的地面真实法线，或者两者都需要。但是，在实践中，要精确地获取这两个信息非常困难，这限制了光度立体算法在视觉应用中的广泛采用。为了绕过这个困难，我们提出了一种未经校准的神经逆渲染方法来解决这个问题。我们的方法首先从输入图像估计光的方向，然后优化图像重建损失以计算表面法线，双向反射率分布函数值和深度。此外，我们的公式明确地对复杂表面的凹凸部分进行建模，以考虑图像形成过程中相互反射的影响。与有监督和经典方法相比，对具有挑战性的主题进行提议方法的广泛评估通常显示出可比或更好的结果。

79. An Overview of Depth Cameras and Range Scanners Based on Time-of-Flight Technologies [PDF] 返回目录
Radu Horaud, Miles Hansard, Georgios Evangelidis, Clement Menier
Abstract: Time-of-flight (TOF) cameras are sensors that can measure the depths of scene-points, by illuminating the scene with a controlled laser or LED source, and then analyzing the reflected light. In this paper, we will first describe the underlying measurement principles of time-of-flight cameras, including: (i) pulsed-light cameras, which measure directly the time taken for a light pulse to travel from the device to the object and back again, and (ii) continuous-wave modulated-light cameras, which measure the phase difference between the emitted and received signals, and hence obtain the travel time indirectly. We review the main existing designs, including prototypes as well as commercially available devices. We also review the relevant camera calibration principles, and how they are applied to TOF devices. Finally, we discuss the benefits and challenges of combined TOF and color camera systems.
摘要：飞行时间（TOF）相机是一种传感器，可以通过控制激光或LED光源照亮场景，然后分析反射光，从而测量场景点的深度。在本文中，我们将首先描述飞行时间相机的基本测量原理，包括：（i）脉冲光相机，它们直接测量光脉冲从设备传播到物体并返回物体所花费的时间。再次，和（ii）连续波调制光相机，它测量发射和接收信号之间的相位差，从而间接获得传播时间。我们回顾了现有的主要设计，包括原型以及商用设备。我们还将回顾相关的相机校准原理，以及如何将其应用于TOF设备。最后，我们讨论了组合TOF和彩色摄像头系统的好处和挑战。

80. Fusion of Range and Stereo Data for High-Resolution Scene-Modeling [PDF] 返回目录
Georgios D. Evangelidis, Miles Hansard, Radu Horaud
Abstract: This paper addresses the problem of range-stereo fusion, for the construction of high-resolution depth maps. In particular, we combine low-resolution depth data with high-resolution stereo data, in a maximum a posteriori (MAP) formulation. Unlike existing schemes that build on MRF optimizers, we infer the disparity map from a series of local energy minimization problems that are solved hierarchically, by growing sparse initial disparities obtained from the depth data. The accuracy of the method is not compromised, owing to three properties of the data-term in the energy function. Firstly, it incorporates a new correlation function that is capable of providing refined correlations and disparities, via subpixel correction. Secondly, the correlation scores rely on an adaptive cost aggregation step, based on the depth data. Thirdly, the stereo and depth likelihoods are adaptively fused, based on the scene texture and camera geometry. These properties lead to a more selective growing process which, unlike previous seed-growing methods, avoids the tendency to propagate incorrect disparities. The proposed method gives rise to an intrinsically efficient algorithm, which runs at 3FPS on 2.0MP images on a standard desktop computer. The strong performance of the new method is established both by quantitative comparisons with state-of-the-art methods, and by qualitative comparisons using real depth-stereo data-sets.
摘要：本文解决了距离立体融合的问题，用于构建高分辨率深度图。特别是，我们以最大的后验（MAP）公式结合了低分辨率深度数据和高分辨率立体声数据。与基于MRF优化器的现有方案不同，我们通过增加从深度数据获得的稀疏初始视差，从一系列局部解决的局部能量最小化问题推断出视差图。由于能量函数中数据项的三个属性，因此该方法的准确性不会受到影响。首先，它合并了一个新的相关函数，该函数可以通过子像素校正提供精细的相关和视差。其次，基于深度数据，相关分数依赖于自适应成本汇总步骤。第三，基于场景纹理和相机几何形状，将立体和深度似然性进行自适应融合。这些特性导致了更具选择性的生长过程，与以前的种子种植方法不同，该过程避免了传播不正确差异的趋势。所提出的方法产生了一种本质上高效的算法，该算法在标准台式计算机上以2.0MP图像上的3FPS运行。通过与最先进方法的定量比较，以及使用真实深度-立体声数据集的定性比较，可以建立新方法的强大性能。

81. Anomaly detection through latent space restoration using vector-quantized variational autoencoders [PDF] 返回目录
Sergio Naval Marimont, Giacomo Tarroni
Abstract: We propose an out-of-distribution detection method that combines density and restoration-based approaches using Vector-Quantized Variational Auto-Encoders (VQ-VAEs). The VQ-VAE model learns to encode images in a categorical latent space. The prior distribution of latent codes is then modelled using an Auto-Regressive (AR) model. We found that the prior probability estimated by the AR model can be useful for unsupervised anomaly detection and enables the estimation of both sample and pixel-wise anomaly scores. The sample-wise score is defined as the negative log-likelihood of the latent variables above a threshold selecting highly unlikely codes. Additionally, out-of-distribution images are restored into in-distribution images by replacing unlikely latent codes with samples from the prior model and decoding to pixel space. The average L1 distance between generated restorations and original image is used as pixel-wise anomaly score. We tested our approach on the MOOD challenge datasets, and report higher accuracies compared to a standard reconstruction-based approach with VAEs.
摘要：我们提出了一种矢量分布变异自动编码器（VQ-VAE），结合了基于密度和基于恢复的方法的分布外检测方法。 VQ-VAE模型学习在分类潜在空间中对图像进行编码。然后使用自动回归（AR）模型对潜在代码的先验分布进行建模。我们发现，由AR模型估算的先验概率可用于无监督的异常检测，并能够估算样本和逐像素的异常评分。逐样本得分被定义为潜在变量的负对数似然率，高于选择极不可能代码的阈值。此外，通过用来自先验模型的样本替换不太可能的潜在代码并将其解码到像素空间，可以将分发不足的图像恢复为分发中的图像。生成的修复体和原始图像之间的平均L1距离用作像素级异常评分。我们在MOOD挑战数据集上测试了我们的方法，并报告了与基于VAE的基于标准重建的方法相比更高的准确性。

82. Periocular in the Wild Embedding Learning with Cross-Modal Consistent Knowledge Distillation [PDF] 返回目录
Yoon Gyo Jung, Jaewoo Park, Cheng Yaw Low, Leslie Ching Ow Tiong, Andrew Beng Jin Teoh
Abstract: Periocular biometric, or peripheral area of ocular, is a collaborative alternative to face, especially if a face is occluded or masked. In practice, sole periocular biometric captures least salient facial features, thereby suffering from intra-class compactness and inter-class dispersion issues particularly in the wild environment. To address these problems, we transfer useful information from face to support periocular modality by means of knowledge distillation (KD) for embedding learning. However, applying typical KD techniques to heterogeneous modalities directly is suboptimal. We put forward in this paper a deep face-to-periocular distillation networks, coined as cross-modal consistent knowledge distillation (CM-CKD) henceforward. The three key ingredients of CM-CKD are (1) shared-weight networks, (2) consistent batch normalization, and (3) a bidirectional consistency distillation for face and periocular through an effectual CKD loss. To be more specific, we leverage face modality for periocular embedding learning, but only periocular images are targeted for identification or verification tasks. Extensive experiments on six constrained and unconstrained periocular datasets disclose that the CM-CKD-learned periocular embeddings extend identification and verification performance by 50% in terms of relative performance gain computed based upon face and periocular baselines. The experiments also reveal that the CM-CKD-learned periocular features enjoy better subject-wise cluster separation, thereby refining the overall accuracy performance.
摘要：眼周生物测定法或眼的外围区域是脸部的协作替代方法，尤其是当脸部被遮挡或遮盖时。在实践中，唯一的眼周生物特征捕获的面部特征最不明显，从而特别在野生环境中遭受类内紧实度和类间散布问题的困扰。为了解决这些问题，我们通过知识蒸馏（KD）的方式将有用的信息从面部转移到支持眼周模式，以嵌入学习。但是，直接将典型的KD技术应用于异构模式并不是最理想的。我们在本文中提出了一个深层的面对面的蒸馏网络，此后被称为跨模式一致知识蒸馏（CM-CKD）。 CM-CKD的三个关键成分是（1）权重网络，（2）一致的批次归一化和（3）通过有效CKD损失对面部和眼周进行双向一致性蒸馏。更具体地说，我们利用脸部形态进行眼周嵌入学习，但是只有眼周图像才用于识别或验证任务。在六个受约束和不受约束的眼周数据集上进行的广泛实验表明，根据基于面部和眼周基线计算的相对性能增益，通过CM-CKD学习的眼周嵌入将识别和验证性能提高了50％。实验还显示，通过CM-CKD学习的眼周特征在主体上具有更好的聚类分离，从而改善了整体准确性。

83. Computer Vision and Normalizing Flow Based Defect Detection [PDF] 返回目录
Zijian Kuang, Xinran Tie
Abstract: Surface defect detection is essential and necessary for controlling the qualities of the products during manufacturing. The challenges in this complex task include: 1) collecting defective samples and manually labeling for training is time-consuming; 2) the defects' characteristics are difficult to define as new types of defect can happen all the time; 3) and the real-world product images contain lots of background noise. In this paper, we present a two-stage defect detection network based on the object detection model YOLO, and the normalizing flow-based defect detection model DifferNet. Our model has high robustness and performance on defect detection using real-world video clips taken from a production line monitoring system. The normalizing flow-based anomaly detection model only requires a small number of good samples for training and then perform defect detection on the product images detected by YOLO. The model we invent employs two novel strategies: 1) a two-stage network using YOLO and a normalizing flow-based model to perform product defect detection, 2) multi-scale image transformations are implemented to solve the issue product image cropped by YOLO includes many background noise. Besides, extensive experiments are conducted on a new dataset collected from the real-world factory production line. We demonstrate that our proposed model can learn on a small number of defect-free samples of single or multiple product types. The dataset will also be made public to encourage further studies and research in surface defect detection.
摘要：表面缺陷检测对于在制造过程中控制产品质量至关重要。这项复杂任务所面临的挑战包括：1）收集有缺陷的样品并手动标记进行培训非常耗时； 2）缺陷的特征很难定义，因为新类型的缺陷会一直发生； 3）和现实世界中的产品图片包含很多背景噪音。在本文中，我们提出了一种基于对象检测模型YOLO和基于归一化流的缺陷检测模型DifferNet的两阶段缺陷检测网络。我们的模型使用从生产线监控系统获取的真实视频剪辑，在缺陷检测方面具有很高的鲁棒性和性能。基于归一化流的异常检测模型只需要训练少量的好样本，然后对YOLO检测到的产品图像进行缺陷检测。我们发明的模型采用了两种新颖的策略：1）使用YOLO的两阶段网络和基于归一化流程的模型来执行产品缺陷检测，2）进行多尺度图像转换以解决YOLO裁剪的产品图像问题，包括许多背景噪音。此外，还对从实际工厂生产线收集的新数据集进行了广泛的实验。我们证明了我们提出的模型可以学习少量的单个或多个产品类型的无缺陷样本。该数据集也将公开，以鼓励对表面缺陷检测进行进一步的研究。

84. Multimodal In-bed Pose and Shape Estimation under the Blankets [PDF] 返回目录
Yu Yin, Joseph P. Robinson, Yun Fu
Abstract: Humans spend vast hours in bed -- about one-third of the lifetime on average. Besides, a human at rest is vital in many healthcare applications. Typically, humans are covered by a blanket when resting, for which we propose a multimodal approach to uncover the subjects so their bodies at rest can be viewed without the occlusion of the blankets above. We propose a pyramid scheme to effectively fuse the different modalities in a way that best leverages the knowledge captured by the multimodal sensors. Specifically, the two most informative modalities (i.e., depth and infrared images) are first fused to generate good initial pose and shape estimation. Then pressure map and RGB images are further fused one by one to refine the result by providing occlusion-invariant information for the covered part, and accurate shape information for the uncovered part, respectively. However, even with multimodal data, the task of detecting human bodies at rest is still very challenging due to the extreme occlusion of bodies. To further reduce the negative effects of the occlusion from blankets, we employ an attention-based reconstruction module to generate uncovered modalities, which are further fused to update current estimation via a cyclic fashion. Extensive experiments validate the superiority of the proposed model over others.
摘要：人类在床上花费大量时间-平均约为一生的三分之一。此外，静止的人在许多医疗保健应用中至关重要。通常，人类在休息时会被毯子覆盖，为此，我们提出了一种多模态方法来发现对象，以便可以观察到他们的静止身体而不会阻塞上面的毯子。我们提出了一种金字塔方案，以最有效地利用多模式传感器捕获的知识的方式有效地融合不同的模式。具体而言，首先将两个最有用的模式（即深度和红外图像）融合在一起，以生成良好的初始姿势和形状估计。然后，通过分别提供覆盖部分的遮挡不变信息和未覆盖部分的准确形状信息，将压力图和RGB图像进一步逐一融合以改善结果。然而，即使具有多峰数据，由于人体的极端阻塞，检测静止人体的任务仍然非常具有挑战性。为了进一步减少毯子遮挡的负面影响，我们采用了基于注意力的重建模块来生成未发现的模态，将其进一步融合以通过循环方式更新电流估计。大量实验证明了该模型相对于其他模型的优越性。

85. PoP-Net: Pose over Parts Network for Multi-Person 3D Pose Estimation from a Depth Image [PDF] 返回目录
Yuliang Guo, Zhong Li, Zekun Li, Xiangyu Du, Shuxue Quan, Yi Xu
Abstract: In this paper, a real-time method called PoP-Net is proposed to predict multi-person 3D poses from a depth image. PoP-Net learns to predict bottom-up part detection maps and top-down global poses in a single-shot framework. A simple and effective fusion process is applied to fuse the global poses and part detection. Specifically, a new part-level representation, called Truncated Part Displacement Field (TPDF), is introduced. It drags low-precision global poses towards more accurate part locations while maintaining the advantage of global poses in handling severe occlusion and truncation cases. A mode selection scheme is developed to automatically resolve the conflict between global poses and local detection. Finally, due to the lack of high-quality depth datasets for developing and evaluating multi-person 3D pose estimation methods, a comprehensive depth dataset with 3D pose labels is released. The dataset is designed to enable effective multi-person and background data augmentation such that the developed models are more generalizable towards uncontrolled real-world multi-person scenarios. We show that PoP-Net has significant advantages in efficiency for multi-person processing and achieves the state-of-the-art results both on the released challenging dataset and on the widely used ITOP dataset.
摘要：本文提出了一种称为PoP-Net的实时方法，可以根据深度图像预测多人3D姿势。 PoP-Net学会了在单一镜头框架中预测自下而上的零件检测图和自上而下的全局姿势。应用简单有效的融合过程来融合全局姿势和零件检测。具体来说，引入了一种新的零件级表示形式，称为截断零件位移场（TPDF）。它将低精度的全局姿势拖到更精确的零件位置，同时保留了全局姿势在处理严重遮挡和截断情况下的优势。开发了一种模式选择方案以自动解决全局姿势与局部检测之间的冲突。最后，由于缺乏用于开发和评估多人3D姿态估计方法的高质量深度数据集，因此发布了具有3D姿态标签的综合深度数据集。该数据集旨在实现有效的多人和背景数据扩充，从而使开发的模型可以更广泛地推广到不受控制的现实多人场景。我们证明，PoP-Net在多人处理效率方面具有显着优势，并且在已发布的具有挑战性的数据集和广泛使用的ITOP数据集上均达到了最新的结果。

86. Mask Guided Matting via Progressive Refinement Network [PDF] 返回目录
Qihang Yu, Jianming Zhang, He Zhang, Yilin Wang, Zhe Lin, Ning Xu, Yutong Bai, Alan Yuille
Abstract: We propose Mask Guided (MG) Matting, a robust matting framework that takes a general coarse mask as guidance. MG Matting leverages a network (PRN) design which encourages the matting model to provide self-guidance to progressively refine the uncertain regions through the decoding process. A series of guidance mask perturbation operations are also introduced in the training to further enhance its robustness to external guidance. We show that PRN can generalize to unseen types of guidance masks such as trimap and low-quality alpha matte, making it suitable for various application pipelines. In addition, we revisit the foreground color prediction problem for matting and propose a surprisingly simple improvement to address the dataset issue. Evaluation on real and synthetic benchmarks shows that MG Matting achieves state-of-the-art performance using various types of guidance inputs. Code and models will be available at this https URL
摘要：我们提出了“蒙版引导（MG）遮罩”，这是一个鲁棒的遮罩框架，采用一般的粗糙蒙版作为指导。 MG Matting利用网络（PRN）设计，该设计鼓励消光模型提供自我指导，以通过解码过程逐步完善不确定区域。训练中还引入了一系列引导面罩摄动操作，以进一步增强其对外部引导的鲁棒性。我们证明PRN可以推广到看不见的引导遮罩类型，例如trimap和低质量的alpha遮罩，使其适用于各种应用程序管道。此外，我们将重新讨论用于抠图的前景色预测问题，并提出令人惊讶的简单改进来解决数据集问题。对真实和综合基准的评估表明，MG Matting使用各种类型的指导输入实现了最先进的性能。代码和模型将在此https URL提供

87. Teacher-Student Asynchronous Learning with Multi-Source Consistency for Facial Landmark Detection [PDF] 返回目录
Rongye Meng, Sanping Zhou, Xingyu Wan, Mengliu Li, Jinjun Wang
Abstract: Due to the high annotation cost of large-scale facial landmark detection tasks in videos, a semi-supervised paradigm that uses self-training for mining high-quality pseudo-labels to participate in training has been proposed by researchers. However, self-training based methods often train with a gradually increasing number of samples, whose performances vary a lot depending on the number of pseudo-labeled samples added. In this paper, we propose a teacher-student asynchronous learning~(TSAL) framework based on the multi-source supervision signal consistency criterion, which implicitly mines pseudo-labels through consistency constraints. Specifically, the TSAL framework contains two models with exactly the same structure. The radical student uses multi-source supervision signals from the same task to update parameters, while the calm teacher uses a single-source supervision signal to update parameters. In order to reasonably absorb student's suggestions, teacher's parameters are updated again through recursive average filtering. The experimental results prove that asynchronous-learning framework can effectively filter noise in multi-source supervision signals, thereby mining the pseudo-labels which are more significant for network parameter updating. And extensive experiments on 300W, AFLW, and 300VW benchmarks show that the TSAL framework achieves state-of-the-art performance.
摘要：由于视频中大规模面部地标检测任务的标注成本高昂，研究人员提出了一种使用自我训练来挖掘高质量伪标签来参与训练的半监督范例。但是，基于自训练的方法通常在逐渐增加数量的样本上进行训练，其性能取决于所添加的伪标记样本的数量而有很大不同。本文提出了一种基于多源监督信号一致性准则的师生异步学习框架，该框架通过一致性约束隐式地挖掘伪标签。具体地说，TSAL框架包含两个结构完全相同的模型。激进学生使用同一任务的多源监控信号来更新参数，而冷静的老师使用单源监控信号来更新参数。为了合理吸收学生的建议，通过递归平均滤波再次更新教师的参数。实验结果证明，异步学习框架可以有效地滤除多源监控信号中的噪声，从而挖掘出对网络参数更新更有意义的伪标记。在300W，AFLW和300VW基准上进行的大量实验表明，TSAL框架可实现最先进的性能。

88. D$^2$IM-Net: Learning Detail Disentangled Implicit Fields from Single Images [PDF] 返回目录
Manyi Li, Hao Zhang
Abstract: We present the first single-view 3D reconstruction network aimed at recovering geometric details from an input image which encompass both topological shape structures and surface features. Our key idea is to train the network to learn a detail disentangled reconstruction consisting of two functions, one implicit field representing the coarse 3D shape and the other capturing the details. Given an input image, our network, coined D$^2$IM-Net, encodes it into global and local features which are respectively fed into two decoders. The base decoder uses the global features to reconstruct a coarse implicit field, while the detail decoder reconstructs, from the local features, two displacement maps, defined over the front and back sides of the captured object. The final 3D reconstruction is a fusion between the base shape and the displacement maps, with three losses enforcing the recovery of coarse shape, overall structure, and surface details via a novel Laplacian term.
摘要：我们提出了第一个单视图3D重建网络，旨在从包含拓扑形状结构和表面特征的输入图像中恢复几何细节。我们的关键思想是训练网络以学习由两个函数组成的细节解纠缠重建，一个隐式字段表示粗糙的3D形状，另一个隐式字段表示细节。给定一个输入图像，我们的网络称为D $ ^ 2 $ IM-Net，将其编码为全局和局部特征，分别输入两个解码器。基本解码器使用全局特征来重建粗略的隐式字段，而细节解码器从局部特征中重建在捕获对象的正面和背面定义的两个位移图。最终的3D重建是基本形状和位移图之间的融合，其中三个损失通过一个新的Laplacian术语强制恢复了粗略形状，整体结构和表面细节。

89. Street-view Panoramic Video Synthesis from a Single Satellite Image [PDF] 返回目录
Zuoyue Li, Zhaopeng Cui, Martin R. Oswald
Abstract: We present a novel method for synthesizing both temporally and geometrically consistent street-view panoramic video from a given single satellite image and camera trajectory. Existing cross-view synthesis approaches focus more on images, while video synthesis in such a case has not yet received enough attention. Single image synthesis approaches are not well suited for video synthesis since they lack temporal consistency which is a crucial property of videos. To this end, our approach explicitly creates a 3D point cloud representation of the scene and maintains dense 3D-2D correspondences across frames that reflect the geometric scene configuration inferred from the satellite view. We implement a cascaded network architecture with two hourglass modules for successive coarse and fine generation for colorizing the point cloud from the semantics and per-class latent vectors. By leveraging computed correspondences, the produced street-view video frames adhere to the 3D geometric scene structure and maintain temporal consistency. Qualitative and quantitative experiments demonstrate superior results compared to other state-of-the-art cross-view synthesis approaches that either lack temporal or geometric consistency. To the best of our knowledge, our work is the first work to synthesize cross-view images to video.
摘要：我们提出了一种新颖的方法，可以从给定的单个卫星图像和摄像机轨迹合成时间和几何上一致的街景全景视频。现有的交叉视图合成方法更多地关注图像，而在这种情况下的视频合成尚未引起足够的重视。单一图像合成方法不适合视频合成，因为它们缺乏时间一致性，而时间一致性是视频的关键属性。为此，我们的方法显式创建了场景的3D点云表示，并在各个帧之间保持了密集的3D-2D对应关系，这些对应关系反映了从卫星视图推断出的几何场景配置。我们实现了具有两个沙漏模块的级联网络体系结构，以进行连续的粗略和精细生成，以便从语义和每类潜在向量着色点云。通过利用计算出的对应关系，生成的街景视频帧会遵循3D几何场景结构并保持时间一致性。与其他缺乏时间或几何一致性的最新横断面综合方法相比，定性和定量实验证明了优异的结果。据我们所知，我们的工作是将交叉视图图像合成为视频的第一项工作。

90. Vilio: State-of-the-art Visio-Linguistic Models applied to Hateful Memes [PDF] 返回目录
Niklas Muennighoff
Abstract: This work presents Vilio, an implementation of state-of-the-art visio-linguistic models and their application to the Hateful Memes Dataset. The implemented models have been fitted into a uniform code-base and altered to yield better performance. The goal of Vilio is to provide a user-friendly starting point for any visio-linguistic problem. An ensemble of 5 different V+L models implemented in Vilio achieves 2nd place in the Hateful Memes Challenge out of 3,300 participants. The code is available at this https URL.
摘要：这项工作介绍了Vilio，这是一种最新的视觉语言模型的实现及其在仇恨模因数据集中的应用。已实现的模型已安装到统一的代码库中，并进行了更改以产生更好的性能。 Vilio的目标是为任何视觉语言问题提供用户友好的起点。在Vilio中实现的5种不同的V + L模型集成在3300名参与者的“仇恨模因挑战赛”中排名第二。该代码可从以下https URL获得。

91. Improving Adversarial Robustness via Probabilistically Compact Loss with Logit Constraints [PDF] 返回目录
Xin Li, Xiangrui Li, Deng Pan, Dongxiao Zhu
Abstract: Convolutional neural networks (CNNs) have achieved state-of-the-art performance on various tasks in computer vision. However, recent studies demonstrate that these models are vulnerable to carefully crafted adversarial samples and suffer from a significant performance drop when predicting them. Many methods have been proposed to improve adversarial robustness (e.g., adversarial training and new loss functions to learn adversarially robust feature representations). Here we offer a unique insight into the predictive behavior of CNNs that they tend to misclassify adversarial samples into the most probable false classes. This inspires us to propose a new Probabilistically Compact (PC) loss with logit constraints which can be used as a drop-in replacement for cross-entropy (CE) loss to improve CNN's adversarial robustness. Specifically, PC loss enlarges the probability gaps between true class and false classes meanwhile the logit constraints prevent the gaps from being melted by a small perturbation. We extensively compare our method with the state-of-the-art using large scale datasets under both white-box and black-box attacks to demonstrate its effectiveness. The source codes are available from the following url: this https URL.
摘要：卷积神经网络（CNN）在计算机视觉的各种任务上都取得了最先进的性能。但是，最近的研究表明，这些模型容易受到精心制作的对抗性样本的攻击，并且在预测它们时会遭受明显的性能下降。已经提出了许多方法来提高对抗性鲁棒性（例如，对抗性训练和新的损失函数以学习对抗性鲁棒性特征表示）。在这里，我们对CNN的预测行为提供了独特的见解，因为它们倾向于将对抗性样本误分类为最可能的错误类别。这激发了我们提出一种新的具有logit约束的概率紧凑（PC）损失，可以用作交叉熵（CE）损失的替代品，以提高CNN的对抗性鲁棒性。具体而言，PC损失扩大了真实类别和错误类别之间的概率差距，同时对数约束限制了这些差距不会因较小的扰动而融化。我们在白盒和黑盒攻击下使用大规模数据集将我们的方法与最新技术进行了广泛比较，以证明其有效性。可从以下URL获得源代码：此https URL。

92. Biomechanical modelling of brain atrophy through deep learning [PDF] 返回目录
Mariana da Silva, Kara Garcia, Carole H. Sudre, Cher Bass, M. Jorge Cardoso, Emma Robinson
Abstract: We present a proof-of-concept, deep learning (DL) based, differentiable biomechanical model of realistic brain deformations. Using prescribed maps of local atrophy and growth as input, the network learns to deform images according to a Neo-Hookean model of tissue deformation. The tool is validated using longitudinal brain atrophy data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, and we demonstrate that the trained model is capable of rapidly simulating new brain deformations with minimal residuals. This method has the potential to be used in data augmentation or for the exploration of different causal hypotheses reflecting brain growth and atrophy.
摘要：我们提出了一种基于概念证明，深度学习（DL）的，可区分的逼真的大脑变形生物力学模型。使用规定的局部萎缩和生长图作为输入，该网络学会根据组织变形的新霍克模型对图像进行变形。该工具已使用来自阿尔茨海默氏病神经影像计划（ADNI）数据集的纵向脑萎缩数据进行了验证，并且我们证明了训练后的模型能够快速模拟新的大脑变形，且残差最小。这种方法有可能用于数据扩充或探索反映脑生长和萎缩的不同因果假设。

93. Movie Summarization via Sparse Graph Construction [PDF] 返回目录
Pinelopi Papalampidi, Frank Keller, Mirella Lapata
Abstract: We summarize full-length movies by creating shorter videos containing their most informative scenes. We explore the hypothesis that a summary can be created by assembling scenes which are turning points (TPs), i.e., key events in a movie that describe its storyline. We propose a model that identifies TP scenes by building a sparse movie graph that represents relations between scenes and is constructed using multimodal information. According to human judges, the summaries created by our approach are more informative and complete, and receive higher ratings, than the outputs of sequence-based models and general-purpose summarization algorithms. The induced graphs are interpretable, displaying different topology for different movie genres.
摘要：我们通过创建包含最丰富信息场景的较短视频来总结全长电影。我们探索这样一个假设，即可以通过组合场景来创建摘要，这些场景是转折点（TP），即电影中描述其故事情节的关键事件。我们提出了一个模型，该模型通过构建表示场景之间关系的稀疏电影图来识别TP场景，并使用多模式信息来构建。根据人类法官的判断，与基于序列的模型和通用摘要算法的输出相比，我们的方法创建的摘要更加翔实和完整，并且获得了更高的评分。诱导图是可解释的，针对不同的电影类型显示不同的拓扑。

94. Phase Retrieval with Holography and Untrained Priors: Tackling the Challenges of Low-Photon Nanoscale Imaging [PDF] 返回目录
Hannah Lawrence, David Bramherzig, Henry Li, Michael Eickenberg, Marylou Gabrié
Abstract: Phase retrieval is the inverse problem of recovering a signal from magnitude-only Fourier measurements, and underlies numerous imaging modalities, such as Coherent Diffraction Imaging (CDI). A variant of this setup, known as holography, includes a reference object that is placed adjacent to the specimen of interest before measurements are collected. The resulting inverse problem, known as holographic phase retrieval, is well-known to have improved problem conditioning relative to the original. This innovation, i.e. Holographic CDI, becomes crucial at the nanoscale, where imaging specimens such as viruses, proteins, and crystals require low-photon measurements. This data is highly corrupted by Poisson shot noise, and often lacks low-frequency content as well. In this work, we introduce a dataset-free deep learning framework for holographic phase retrieval adapted to these challenges. The key ingredients of our approach are the explicit and flexible incorporation of the physical forward model into an automatic differentiation procedure, the Poisson log-likelihood objective function, and an optional untrained deep image prior. We perform extensive evaluation under realistic conditions. Compared to competing classical methods, our method recovers signal from higher noise levels and is more resilient to suboptimal reference design, as well as to large missing regions of low frequencies in the observations. To the best of our knowledge, this is the first work to consider a dataset-free machine learning approach for holographic phase retrieval.
摘要：相位检索是从仅幅度傅立叶测量中恢复信号的反问题，它是众多成像方式的基础，例如相干衍射成像（CDI）。此设置的一种变体，称为全息术，包括一个参考对象，在收集测量值之前，该参考对象与感兴趣的样本相邻放置。众所周知，所产生的逆问题称为全息相位检索，相对于原始问题，其条件得到了改善。这项创新技术，即全息CDI，在纳米级变得至关重要，在纳米级，成像标本（例如病毒，蛋白质和晶体）需要进行低光子测量。泊松散粒噪声严重破坏了此数据，并且通常也缺少低频内容。在这项工作中，我们为全息相位检索引入了无数据集的深度学习框架，以适应这些挑战。我们方法的关键要素是将物理前向模型明确而灵活地并入自动微分程序，泊松对数似然目标函数以及可选的未经训练的深层图像。我们在现实条件下进行广泛的评估。与竞争性经典方法相比，我们的方法可以从较高的噪声水平恢复信号，并且对于次优参考设计以及观测数据中的低频缺失区域更具弹性。就我们所知，这是考虑全息相位检索的无数据集机器学习方法的第一项工作。

95. Learning Hybrid Representations for Automatic 3D Vessel Centerline Extraction [PDF] 返回目录
Jiafa He, Chengwei Pan, Can Yang, Ming Zhang, Yang Wang, Xiaowei Zhou, Yizhou Yu
Abstract: Automatic blood vessel extraction from 3D medical images is crucial for vascular disease diagnoses. Existing methods based on convolutional neural networks (CNNs) may suffer from discontinuities of extracted vessels when segmenting such thin tubular structures from 3D images. We argue that preserving the continuity of extracted vessels requires to take into account the global geometry. However, 3D convolutions are computationally inefficient, which prohibits the 3D CNNs from sufficiently large receptive fields to capture the global cues in the entire image. In this work, we propose a hybrid representation learning approach to address this challenge. The main idea is to use CNNs to learn local appearances of vessels in image crops while using another point-cloud network to learn the global geometry of vessels in the entire image. In inference, the proposed approach extracts local segments of vessels using CNNs, classifies each segment based on global geometry using the point-cloud network, and finally connects all the segments that belong to the same vessel using the shortest-path algorithm. This combination results in an efficient, fully-automatic and template-free approach to centerline extraction from 3D images. We validate the proposed approach on CTA datasets and demonstrate its superior performance compared to both traditional and CNN-based baselines.
摘要：从3D医学图像中自动提取血管对于诊断血管疾病至关重要。当从3D图像分割这种细管结构时，基于卷积神经网络（CNN）的现有方法可能会遇到提取血管不连续的问题。我们认为，保持提取血管的连续性需要考虑整体几何形状。但是，3D卷积在计算上效率低下，这阻止了3D CNN具有足够大的接收场来捕获整个图像中的全局提示。在这项工作中，我们提出了一种混合表示学习方法来应对这一挑战。主要思想是使用CNN来学习图像作物中血管的局部外观，同时使用另一个点云网络来学习整个图像中血管的整体几何形状。作为推论，所提出的方法使用CNN提取船只的局部航段，使用点云网络基于全局几何对每个航段进行分类，最后使用最短路径算法连接属于同一船只的所有航段。这种组合可提供一种高效，全自动且无模板的方法来从3D图像中提取中心线。我们在CTA数据集上验证了提出的方法，并证明了其与传统基准和基于CNN的基准相比均具有出色的性能。

96. IPN-V2 and OCTA-500: Methodology and Dataset for Retinal Image Segmentation [PDF] 返回目录
Mingchao Li, Yuhan Zhang, Zexuan Ji, Keren Xie, Songtao Yuan, Qinghuai Liu, Qiang Chen
Abstract: Optical coherence tomography angiography (OCTA) is a novel imaging modality that allows a micron-level resolution to present the three-dimensional structure of the retinal vascular. In our previous work, a 3D-to-2D image projection network (IPN) was proposed for retinal vessel (RV) and foveal avascular zone (FAZ) segmentations in OCTA images. One of its advantages is that the segmentation results are directly from the original volumes without using any projection images and retinal layer segmentation. In this work, we propose image projection network V2 (IPN-V2), extending IPN by adding a plane perceptron to enhance the perceptron ability in the horizontal direction. We also propose IPN-V2+, as a supplement of the IPN-V2, by introducing a global retraining process to overcome the "checkerboard effect". Besides, we propose a new multi-modality dataset, dubbed OCTA-500. It contains 500 subjects with two field of view (FOV) types, including OCT and OCTA volumes, six types of projections, four types of text labels and two types of pixel-level labels. The dataset contains more than 360K images with a size of about 80GB. To the best of our knowledge, it is currently the largest OCTA dataset with the abundant information. Finally, we perform a thorough evaluation of the performance of IPN-V2 on the OCTA-500 dataset. The experimental results demonstrate that our proposed IPN-V2 performs better than IPN and other deep learning methods in RV segmentation and FAZ segmentation.
摘要：光学相干断层扫描血管造影（OCTA）是一种新颖的成像方式，可实现微米级分辨率来呈现视网膜血管的三维结构。在我们以前的工作中，提出了一种3D到2D图像投影网络（IPN），用于OCTA图像中的视网膜血管（RV）和小凹性无血管区（FAZ）分割。其优点之一是分割结果直接来自原始体积，而无需使用任何投影图像和视网膜层分割。在这项工作中，我们提出了图像投影网络V2（IPN-V2），通过添加平面感知器来扩展IPN，以增强水平方向的感知器能力。我们还提出IPN-V2 +，作为IPN-V2的补充，通过引入全球再培训过程来克服“棋盘效应”。此外，我们提出了一个新的多模式数据集，称为OCTA-500。它包含500个主题，具有两种视场（FOV）类型，包括OCT和OCTA卷，六种类型的投影，四种类型的文本标签和两种类型的像素级标签。数据集包含超过360K图像，大小约为80GB。据我们所知，它是目前最大的OCTA数据集，具有丰富的信息。最后，我们对OCTA-500数据集上IPN-V2的性能进行了彻底的评估。实验结果表明，我们提出的IPN-V2在RV分割和FAZ分割方面比IPN和其他深度学习方法表现更好。

97. Accurate Cell Segmentation in Digital Pathology Images via Attention Enforced Networks [PDF] 返回目录
Muyi Sun, Zeyi Yao, Guanhong Zhang
Abstract: Automatic cell segmentation is an essential step in the pipeline of computer-aided diagnosis (CAD), such as the detection and grading of breast cancer. Accurate segmentation of cells can not only assist the pathologists to make a more precise diagnosis, but also save much time and labor. However, this task suffers from stain variation, cell inhomogeneous intensities, background clutters and cells from different tissues. To address these issues, we propose an Attention Enforced Network (AENet), which is built on spatial attention module and channel attention module, to integrate local features with global dependencies and weight effective channels adaptively. Besides, we introduce a feature fusion branch to bridge high-level and low-level features. Finally, the marker controlled watershed algorithm is applied to post-process the predicted segmentation maps for reducing the fragmented regions. In the test stage, we present an individual color normalization method to deal with the stain variation problem. We evaluate this model on the MoNuSeg dataset. The quantitative comparisons against several prior methods demonstrate the superiority of our approach.
摘要：细胞自动分割是计算机辅助诊断（CAD）流程中的重要步骤，例如乳腺癌的检测和分级。准确的细胞分割不仅可以帮助病理学家进行更精确的诊断，还可以节省大量时间和劳力。然而，该任务遭受着污渍变化，细胞不均匀强度，背景杂波和来自不同组织的细胞的困扰。为了解决这些问题，我们提出了一种基于网络注意力模块和通道注意力模块的注意力增强网络（AENet），以自适应地整合具有全局依赖性和权重有效通道的局部特征。此外，我们引入了一个功能融合分支，以桥接高级和低级功能。最后，将标记器控制的分水岭算法应用于后处理预测的分割图，以减少碎片区域。在测试阶段，我们提出了一种单独的颜色归一化方法来处理色斑变化问题。我们在MoNuSeg数据集上评估此模型。与几种先前方法的定量比较证明了我们方法的优越性。

98. Multi-Domain Multi-Task Rehearsal for Lifelong Learning [PDF] 返回目录
Fan Lyu, Shuai Wang, Wei Feng, Zihan Ye, Fuyuan Hu, Song Wang
Abstract: Rehearsal, seeking to remind the model by storing old knowledge in lifelong learning, is one of the most effective ways to mitigate catastrophic forgetting, i.e., biased forgetting of previous knowledge when moving to new tasks. However, the old tasks of the most previous rehearsal-based methods suffer from the unpredictable domain shift when training the new task. This is because these methods always ignore two significant factors. First, the Data Imbalance between the new task and old tasks that makes the domain of old tasks prone to shift. Second, the Task Isolation among all tasks will make the domain shift toward unpredictable directions; To address the unpredictable domain shift, in this paper, we propose Multi-Domain Multi-Task (MDMT) rehearsal to train the old tasks and new task parallelly and equally to break the isolation among tasks. Specifically, a two-level angular margin loss is proposed to encourage the intra-class/task compactness and inter-class/task discrepancy, which keeps the model from domain chaos. In addition, to further address domain shift of the old tasks, we propose an optional episodic distillation loss on the memory to anchor the knowledge for each old task. Experiments on benchmark datasets validate the proposed approach can effectively mitigate the unpredictable domain shift.
摘要：演练旨在通过在终身学习中存储旧知识来提醒模型，这是减轻灾难性遗忘（即在转移到新任务时对先前知识的偏见遗忘）的最有效方法之一。但是，最近的基于排练的方法中的旧任务在训练新任务时会出现不可预测的域偏移。这是因为这些方法始终忽略两个重要因素。首先，新任务和旧任务之间的数据不平衡使得旧任务的领域容易发生转移。其次，所有任务之间的任务隔离将使域向无法预测的方向转移；为了解决不可预测的域迁移，本文提出了多域多任务（MDMT）演练，以并行，平等地训练旧任务和新任务，以打破任务之间的隔离。具体而言，提出了两级角度余量损失，以鼓励类内/任务紧凑性和类间/任务间差异，这使模型避免了领域混乱。此外，为了进一步解决旧任务的域转移问题，我们在内存中提出了可选的情节提炼损失，以针对每个旧任务固定知识。在基准数据集上进行的实验验证了所提出的方法可以有效地缓解不可预测的域偏移。

99. D-LEMA: Deep Learning Ensembles from Multiple Annotations -- Application to Skin Lesion Segmentation [PDF] 返回目录
Zahra Mirikharaji, Kumar Abhishek, Saeed Izadi, Ghassan Hamarneh
Abstract: Medical image segmentation annotations suffer from inter/intra-observer variations even among experts due to intrinsic differences in human annotators and ambiguous boundaries. Leveraging a collection of annotators' opinions for an image is an interesting way of estimating a gold standard. Although training deep models in a supervised setting with a single annotation per image has been extensively studied, generalizing their training to work with data sets containing multiple annotations per image remains a fairly unexplored problem. In this paper, we propose an approach to handle annotators' disagreements when training a deep model. To this end, we propose an ensemble of Bayesian fully convolutional networks (FCNs) for the segmentation task by considering two major factors in the aggregation of multiple ground truth annotations: (1) handling contradictory annotations in the training data originating from inter-annotator disagreements and (2) improving confidence calibration through the fusion of base models predictions. We demonstrate the superior performance of our approach on the ISIC Archive and explore the generalization performance of our proposed method by cross-data set evaluation on the PH2 and DermoFit data sets.
摘要：由于人类注释者的内在差异和不明确的边界，医学图像分割注释甚至在专家之间也存在观察者之间/观察者之间的差异。利用注释者的意见收集图像是估计黄金标准的一种有趣方法。尽管已经广泛研究了在受监督的环境中对每个图像使用单个注释的深度模型的训练，但是将其训练推广到对每个图像包含多个注释的数据集进行工作仍然是一个尚未探索的问题。在本文中，我们提出了一种在训练深度模型时处理注释者分歧的方法。为此，我们通过考虑多个地面实况注释聚合中的两个主要因素，提出了一种用于分割任务的贝叶斯全卷积网络（FCN）集合：（1）处理来自注释者之间分歧的训练数据中的矛盾注释（2）通过融合基础模型预测来改善置信度校准。我们在ISIC存档上展示了我们方法的优越性能，并通过对PH2和DermoFit数据集进行跨数据集评估来探索所提出方法的泛化性能。

100. Pseudo Shots: Few-Shot Learning with Auxiliary Data [PDF] 返回目录
Reza Esfandiarpoor, Mohsen Hajabdollahi, Stephen H. Bach
Abstract: In many practical few-shot learning problems, even though labeled examples are scarce, there are abundant auxiliary data sets that potentially contain useful information. We propose a framework to address the challenges of efficiently selecting and effectively using auxiliary data in image classification. Given an auxiliary dataset and a notion of semantic similarity among classes, we automatically select pseudo shots, which are labeled examples from other classes related to the target task. We show that naively assuming that these additional examples come from the same distribution as the target task examples does not significantly improve accuracy. Instead, we propose a masking module that adjusts the features of auxiliary data to be more similar to those of the target classes. We show that this masking module can improve accuracy by up to 18 accuracy points, particularly when the auxiliary data is semantically distant from the target task. We also show that incorporating pseudo shots improves over the current state-of-the-art few-shot image classification scores by an average of 4.81 percentage points of accuracy on 1-shot tasks and an average of 0.31 percentage points on 5-shot tasks.
摘要：在许多实际的短时学习问题中，即使没有标注示例，但仍有大量辅助数据集，可能包含有用的信息。我们提出了一个框架来应对在图像分类中有效选择和有效使用辅助数据的挑战。给定一个辅助数据集和类之间的语义相似性概念，我们将自动选择伪镜头，这些伪镜头是来自与目标任务相关的其他类的标记示例。我们表明，天真地假设这些其他示例与目标任务示例来自相同的分布不会显着提高准确性。取而代之的是，我们提出了一个屏蔽模块，该模块可将辅助数据的功能调整为与目标类的功能更加相似。我们表明，该屏蔽模块最多可以提高18个精度点的精度，尤其是当辅助数据在语义上与目标任务相距较远时。我们还表明，结合伪镜头可以比当前最新的少数镜头图像分类分数提高1镜头任务的准确度平均为4.81个百分点，而5镜头任务的平均准确度为0.31个百分点。

101. Learning Contextual Causality from Time-consecutive Images [PDF] 返回目录
Hongming Zhang, Yintong Huo, Xinran Zhao, Yangqiu Song, Dan Roth
Abstract: Causality knowledge is crucial for many artificial intelligence systems. Conventional textual-based causality knowledge acquisition methods typically require laborious and expensive human annotations. As a result, their scale is often limited. Moreover, as no context is provided during the annotation, the resulting causality knowledge records (e.g., ConceptNet) typically do not take the context into consideration. To explore a more scalable way of acquiring causality knowledge, in this paper, we jump out of the textual domain and investigate the possibility of learning contextual causality from the visual signal. Compared with pure text-based approaches, learning causality from the visual signal has the following advantages: (1) Causality knowledge belongs to the commonsense knowledge, which is rarely expressed in the text but rich in videos; (2) Most events in the video are naturally time-ordered, which provides a rich resource for us to mine causality knowledge from; (3) All the objects in the video can be used as context to study the contextual property of causal relations. In detail, we first propose a high-quality dataset Vis-Causal and then conduct experiments to demonstrate that with good language and visual representation models as well as enough training signals, it is possible to automatically discover meaningful causal knowledge from the videos. Further analysis also shows that the contextual property of causal relations indeed exists, taking which into consideration might be crucial if we want to use the causality knowledge in real applications, and the visual signal could serve as a good resource for learning such contextual causality.
摘要：因果关系知识对于许多人工智能系统至关重要。传统的基于文本的因果关系知识获取方法通常需要费力且昂贵的人工注释。结果，它们的规模常常受到限制。而且，由于在注释期间没有提供上下文，因此所得到的因果关系知识记录（例如，ConceptNet）通常不考虑上下文。为了探索一种更可扩展的获取因果关系知识的方法，在本文中，我们跳出了文本领域，并研究了从视觉信号中学习上下文因果关系的可能性。与基于纯文本的方法相比，从视觉信号中学习因果关系具有以下优点：（1）因果关系知识属于常识知识，常识知识很少在文本中表达，而在视频中却很丰富；（2）视频中的大多数事件自然都是按时间顺序排列的，这为我们提供了丰富的资源来挖掘因果关系知识；（3）视频中的所有对象都可以用作上下文来研究因果关系的上下文属性。详细地，我们首先提出一个高质量的因果数据集，然后进行实验以证明，良好的语言和视觉表示模型以及足够的训练信号可以从视频中自动发现有意义的因果知识。进一步的分析还表明，因果关系的上下文属性确实存在，如果要在实际应用中使用因果关系知识，将其考虑可能至关重要，并且可视信号可以作为学习此类因果关系的良好资源。

102. Robust Segmentation of Optic Disc and Cup from Fundus Images Using Deep Neural Networks [PDF] 返回目录
Aniketh Manjunath, Subramanya Jois, Chandra Sekhar Seelamantula
Abstract: Optic disc (OD) and optic cup (OC) are regions of prominent clinical interest in a retinal fundus image. They are the primary indicators of a glaucomatous condition. With the advent and success of deep learning for healthcare research, several approaches have been proposed for the segmentation of important features in retinal fundus images. We propose a novel approach for the simultaneous segmentation of the OD and OC using a residual encoder-decoder network (REDNet) based regional convolutional neural network (RCNN). The RED-RCNN is motivated by the Mask RCNN (MRCNN). Performance comparisons with the state-of-the-art techniques and extensive validations on standard publicly available fundus image datasets show that RED-RCNN has superior performance compared with MRCNN. RED-RCNN results in Sensitivity, Specificity, Accuracy, Precision, Dice and Jaccard indices of 95.64%, 99.9%, 99.82%, 95.68%, 95.64%, 91.65%, respectively, for OD segmentation, and 91.44%, 99.87%, 99.83%, 85.67%, 87.48%, 78.09%, respectively, for OC segmentation. Further, we perform two-stage glaucoma severity grading using the cup-to-disc ratio (CDR) computed based on the obtained OD/OC segmentation. The superior segmentation performance of RED-RCNN over MRCNN translates to higher accuracy in glaucoma severity grading.
摘要：视盘（OD）和视杯（OC）是视网膜眼底图像中具有重要临床意义的区域。它们是青光眼病的主要指标。随着用于医疗保健研究的深度学习的出现和成功，已经提出了几种用于分割眼底图像中重要特征的方法。我们提出了一种新的方法，同时使用基于残差编码器-解码器网络（REDNet）的区域卷积神经网络（RCNN）来同时分割OD和OC。 RED-RCNN是由Mask RCNN（MRCNN）驱动的。与最新技术的性能比较以及对标准的公开眼底图像数据集的广泛验证表明，RED-RCNN的性能优于MRCNN。 RED-RCNN的OD细分结果分别为95.64％，99.9％，99.82％，95.68％，95.64％，91.65％，91.44％，99.87％，99.83、95.64％，99.9％，99.82％，95.68％，95.64％ OC细分分别为％，85.67％，87.48％和78.09％。此外，我们使用基于获得的OD / OC分割计算的杯碟比（CDR）进行两阶段性青光眼严重度分级。与MRCNN相比，RED-RCNN的出色分割性能转化为青光眼严重度分级的更高准确性。

103. Leaking Sensitive Financial Accounting Data in Plain Sight using Deep Autoencoder Neural Networks [PDF] 返回目录
Marco Schreyer, Chistian Schulze, Damian Borth
Abstract: Nowadays, organizations collect vast quantities of sensitive information in `Enterprise Resource Planning' (ERP) systems, such as accounting relevant transactions, customer master data, or strategic sales price information. The leakage of such information poses a severe threat for companies as the number of incidents and the reputational damage to those experiencing them continue to increase. At the same time, discoveries in deep learning research revealed that machine learning models could be maliciously misused to create new attack vectors. Understanding the nature of such attacks becomes increasingly important for the (internal) audit and fraud examination practice. The creation of such an awareness holds in particular for the fraudulent data leakage using deep learning-based steganographic techniques that might remain undetected by state-of-the-art `Computer Assisted Audit Techniques' (CAATs). In this work, we introduce a real-world `threat model' designed to leak sensitive accounting data. In addition, we show that a deep steganographic process, constituted by three neural networks, can be trained to hide such data in unobtrusive `day-to-day' images. Finally, we provide qualitative and quantitative evaluations on two publicly available real-world payment datasets.
摘要：如今，组织在“企业资源计划”（ERP）系统中收集了大量敏感信息，例如会计相关交易，客户主数据或战略销售价格信息。随着事件数量的增加以及对经历这些事件的人员的声誉损害，此类信息的泄漏对公司构成了严重威胁。同时，深度学习研究的发现表明，机器学习模型可能被恶意滥用以创建新的攻击媒介。对于（内部）审计和欺诈检查实践，了解此类攻击的性质变得越来越重要。这种意识的建立尤其适用于使用基于深度学习的隐写技术进行的欺诈性数据泄露，而这些技术可能仍未被最新的“计算机辅助审计技术”（CAAT）所检测。在这项工作中，我们介绍了一个旨在泄漏敏感会计数据的真实“威胁模型”。此外，我们表明，由三个神经网络构成的深层隐写过程可以被训练为将这些数据隐藏在不干扰的“日常”图像中。最后，我们对两个可公开获得的现实世界支付数据集进行定性和定量评估。

104. CHS-Net: A Deep learning approach for hierarchical segmentation of COVID-19 infected CT images [PDF] 返回目录
Narinder Singh Punn, Sonali Agarwal
Abstract: The pandemic of novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) also known as COVID-19 has been spreading worldwide, causing rampant loss of lives. Medical imaging such as computed tomography (CT), X-ray, etc., plays a significant role in diagnosing the patients by presenting the visual representation of the functioning of the organs. However, for any radiologist analyzing such scans is a tedious and time-consuming task. The emerging deep learning technologies have displayed its strength in analyzing such scans to aid in the faster diagnosis of the diseases and viruses such as COVID-19. In the present article, an automated deep learning based model, COVID-19 hierarchical segmentation network (CHS-Net) is proposed that functions as a semantic hierarchical segmenter to identify the COVID-19 infected regions from lungs contour via CT medical imaging. The CHS-Net is developed with the two cascaded residual attention inception U-Net (RAIU-Net) models where first generates lungs contour maps and second generates COVID-19 infected regions. RAIU-Net comprises of a residual inception U-Net model with spectral spatial and depth attention network (SSD), consisting of contraction and expansion phases of depthwise separable convolutions and hybrid pooling (max and spectral pooling) to efficiently encode and decode the semantic and varying resolution information. The CHS-Net is trained with the segmentation loss function that is the weighted average of binary cross entropy loss and dice loss to penalize false negative and false positive predictions. The approach is compared with the recently proposed research works on the basis of standard metrics, it is observed that the proposed approach outperformed the recently proposed approaches and effectively segments the COVID-19 infected regions in the lungs.
摘要：新型严重急性呼吸系统综合症冠状病毒2（SARS-CoV-2）也被称为COVID-19，目前已在全球蔓延，造成严重的生命损失。医学成像，例如计算机断层扫描（CT），X射线等，通过呈现器官功能的视觉表示，在诊断患者中起着重要作用。但是，对于任何放射线医师而言，分析此类扫描都是一项繁琐且耗时的任务。新兴的深度学习技术已显示出在分析此类扫描以帮助快速诊断疾病和病毒（例如COVID-19）方面的优势。在本文中，提出了一种基于自动深度学习的模型，即COVID-19分层分段网络（CHS-Net），该模型可作为语义分层分段器，以通过CT医学成像从肺部轮廓中识别出COVID-19感染区域。 CHS-Net是使用两个级联的剩余注意力起始U-Net（RAIU-Net）模型开发的，其中首先生成肺部轮廓图，其次生成COVID-19感染区域。 RAIU-Net包括具有频谱空间和深度注意网络（SSD）的残差初始U-Net模型，该模型由深度可分离卷积的收缩和扩展阶段以及混合池（最大和频谱池）组成，以有效地编码和解码语义和变化的分辨率信息。 CHS-Net用分段损失函数训练，该函数是二进制交叉熵损失和骰子损失的加权平均值，以惩罚假阴性和假阳性预测。在标准度量的基础上，将该方法与最近提出的研究工作进行了比较，观察到该提出的方法优于最近提出的方法，并有效地分割了肺中COVID-19感染的区域。

105. LEARN++: Recurrent Dual-Domain Reconstruction Network for Compressed Sensing CT [PDF] 返回目录
Yi Zhang, Hu Chen, Wenjun Xia, Yang Chen, Baodong Liu, Yan Liu, Huaiqiang Sun, Jiliu Zhou
Abstract: Compressed sensing (CS) computed tomography has been proven to be important for several clinical applications, such as sparse-view computed tomography (CT), digital tomosynthesis and interior tomography. Traditional compressed sensing focuses on the design of handcrafted prior regularizers, which are usually image-dependent and time-consuming. Inspired by recently proposed deep learning-based CT reconstruction models, we extend the state-of-the-art LEARN model to a dual-domain version, dubbed LEARN++. Different from existing iteration unrolling methods, which only involve projection data in the data consistency layer, the proposed LEARN++ model integrates two parallel and interactive subnetworks to perform image restoration and sinogram inpainting operations on both the image and projection domains simultaneously, which can fully explore the latent relations between projection data and reconstructed images. The experimental results demonstrate that the proposed LEARN++ model achieves competitive qualitative and quantitative results compared to several state-of-the-art methods in terms of both artifact reduction and detail preservation.
摘要：压缩感知（CS）计算机断层扫描已被证明对稀疏视图计算机断层扫描（CT），数字断层合成和内部断层扫描等多种临床应用非常重要。传统的压缩感测专注于手工制作的先验正则器的设计，这些通常依赖于图像且耗时。受最近提出的基于深度学习的CT重建模型的启发，我们将最新的LEARN模型扩展到称为LEARN ++的双域版本。与现有的迭代展开方法不同，现有的迭代展开方法仅在数据一致性层中包含投影数据，而提出的LEARN ++模型集成了两个并行且交互的子网，以同时在图像和投影域上执行图像恢复和正弦图修复操作，从而可以充分探索投影数据和重建图像之间的潜在关系。实验结果表明，与几种先进的方法相比，所提出的LEARN ++模型在减少伪像和保留细节方面均达到了竞争性的定性和定量结果。

106. Learn-Prune-Share for Lifelong Learning [PDF] 返回目录
Zifeng Wang, Tong Jian, Kaushik Chowdhury, Yanzhi Wang, Jennifer Dy, Stratis Ioannidis
Abstract: In lifelong learning, we wish to maintain and update a model (e.g., a neural network classifier) in the presence of new classification tasks that arrive sequentially. In this paper, we propose a learn-prune-share (LPS) algorithm which addresses the challenges of catastrophic forgetting, parsimony, and knowledge reuse simultaneously. LPS splits the network into task-specific partitions via an ADMM-based pruning strategy. This leads to no forgetting, while maintaining parsimony. Moreover, LPS integrates a novel selective knowledge sharing scheme into this ADMM optimization framework. This enables adaptive knowledge sharing in an end-to-end fashion. Comprehensive experimental results on two lifelong learning benchmark datasets and a challenging real-world radio frequency fingerprinting dataset are provided to demonstrate the effectiveness of our approach. Our experiments show that LPS consistently outperforms multiple state-of-the-art competitors.
摘要：在终身学习中，我们希望在存在顺序出现的新分类任务的情况下维护和更新模型（例如神经网络分类器）。在本文中，我们提出了一种学习修剪共享（LPS）算法，该算法同时解决了灾难性遗忘，简约和知识重用的挑战。 LPS通过基于ADMM的修剪策略将网络划分为特定于任务的分区。在保持简约的同时，这也不会忘记。此外，LPS将新颖的选择性知识共享方案集成到此ADMM优化框架中。这使得能够以端到端的方式共享自适应知识。提供了两个终身学习基准数据集和具有挑战性的现实世界射频指纹数据集的综合实验结果，以证明我们的方法的有效性。我们的实验表明，LPS始终优于多个最新竞争对手。

107. Attentional Biased Stochastic Gradient for Imbalanced Classification [PDF] 返回目录
Qi Qi, Yi Xu, Rong Jin, Wotao Yin, Tianbao Yang
Abstract: In this paper~\footnote{The original title is "Momentum SGD with Robust Weighting For Imbalanced Classification"}, we present a simple yet effective method (ABSGD) for addressing the data imbalance issue in deep learning. Our method is a simple modification to momentum SGD where we leverage an attentional mechanism to assign an individual importance weight to each gradient in the mini-batch. Unlike existing individual weighting methods that learn the individual weights by meta-learning on a separate balanced validation data, our weighting scheme is self-adaptive and is grounded in distributionally robust optimization. The weight of a sampled data is systematically proportional to exponential of a scaled loss value of the data, where the scaling factor is interpreted as the regularization parameter in the framework of information-regularized distributionally robust optimization. We employ a step damping strategy for the scaling factor to balance between the learning of feature extraction layers and the learning of the classifier layer. Compared with exiting meta-learning methods that require three backward propagations for computing mini-batch stochastic gradients at three different points at each iteration, our method is more efficient with only one backward propagation at each iteration as in standard deep learning methods. Compared with existing class-level weighting schemes, our method can be applied to online learning without any knowledge of class prior, while enjoying further performance boost in offline learning combined with existing class-level weighting schemes. Our empirical studies on several benchmark datasets also demonstrate the effectiveness of our proposed method
摘要：本文〜\脚注{原始标题为“具有针对不平衡分类的稳健加权的动量SGD”}，我们提出了一种解决深度学习中数据不平衡问题的简单有效的方法（ABSGD）。我们的方法是对动量SGD的简单修改，其中我们利用注意机制为微型批处理中的每个梯度分配了单独的重要性权重。与现有的个体加权方法不同，该方法通过在单独的平衡验证数据上进行元学习来学习个体权重，而我们的加权方案是自适应的，并基于分布稳健的优化。采样数据的权重与数据的比例损失值的指数成系统比例，其中比例因子在信息规则化分布鲁棒性优化框架中被解释为正则化参数。我们对比例因子采用了阶跃阻尼策略，以在特征提取层的学习与分类器层的学习之间取得平衡。与现有的元学习方法相比，该方法需要三个后向传播来计算每次迭代中三个不同点处的小批量随机梯度，与标准深度学习方法相比，我们的方法在每次迭代中仅一个后向传播的效率更高。与现有的班级加权方案相比，我们的方法可以在没有任何班级先验知识的情况下应用于在线学习，同时结合现有的班级加权方案可以进一步提高离线学习的性能。我们对几个基准数据集的实证研究也证明了我们提出的方法的有效性

108. The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models [PDF] 返回目录
Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang
Abstract: The computer vision world has been re-gaining enthusiasm in various pre-trained models, including both classical ImageNet supervised pre-training and recently emerged self-supervised pre-training such as simCLR and MoCo. Pre-trained weights often boost a wide range of downstream tasks including classification, detection, and segmentation. Latest studies suggest that the pre-training benefits from gigantic model capacity. We are hereby curious and ask: after pre-training, does a pre-trained model indeed have to stay large for its universal downstream transferability? In this paper, we examine the supervised and self-supervised pre-trained models through the lens of lottery ticket hypothesis (LTH). LTH identifies highly sparse matching subnetworks that can be trained in isolation from (nearly) scratch, to reach the full models' performance. We extend the scope of LTH to questioning whether matching subnetworks still exist in the pre-training models, that enjoy the same downstream transfer performance. Our extensive experiments convey an overall positive message: from all pre-trained weights obtained by ImageNet classification, simCLR and MoCo, we are consistently able to locate such matching subnetworks at 59.04% to 96.48% sparsity that transfer universally to multiple downstream tasks, whose performance see no degradation compared to using full pre-trained weights. Further analyses reveal that subnetworks found from different pre-training tend to yield diverse mask structures and perturbation sensitivities. We conclude that the core LTH observations remain generally relevant in the pre-training paradigm of computer vision, but more delicate discussions are needed in some cases. Codes and pre-trained models will be made available at: this https URL.
摘要：计算机视觉界已经在各种预训练模型中重新获得了热情，包括经典ImageNet监督的预训练和最近出现的自我监督的预训练，例如simCLR和MoCo。预先训练的权重通常会促进各种各样的下游任务，包括分类，检测和分段。最新研究表明，预训练受益于巨大的模型能力。我们特此感到好奇，并问：在进行预训练之后，预训练的模型是否确实必须保持其大型的通用下游可移植性？在本文中，我们通过彩票假说（LTH）的角度检查了监督和自我监督的预训练模型。 LTH可以识别高度稀疏的匹配子网，可以（几乎）从零开始对其进行培训，以达到完整模型的性能。我们将LTH的范围扩展到质疑在具有相同下游传输性能的预训练模型中是否仍然存在匹配的子网。我们广泛的实验传达了一个总体积极的信息：从ImageNet分类，simCLR和MoCo获得的所有预先训练的权重中，我们始终能够以59.04％到96.48％的稀疏度找到这些匹配的子网络，这些子网络普遍转移到多个下游任务，其性能与使用完整的预训练砝码相比，没有任何退化。进一步的分析表明，从不同的预训练中发现的子网往往会产生不同的蒙版结构和摄动灵敏度。我们得出结论，核心LTH观察在计算机视觉的预训练范式中仍然普遍相关，但是在某些情况下还需要进行更精细的讨论。代码和预先训练的模型将在以下网址提供：https https。

109. Interactive Radiotherapy Target Delineation with 3D-Fused Context Propagation [PDF] 返回目录
Chun-Hung Chao, Hsien-Tzu Cheng, Tsung-Ying Ho, Le Lu, Min Sun
Abstract: Gross tumor volume (GTV) delineation on tomography medical imaging is crucial for radiotherapy planning and cancer diagnosis. Convolutional neural networks (CNNs) has been predominated on automatic 3D medical segmentation tasks, including contouring the radiotherapy target given 3D CT volume. While CNNs may provide feasible outcome, in clinical scenario, double-check and prediction refinement by experts is still necessary because of CNNs' inconsistent performance on unexpected patient cases. To provide experts an efficient way to modify the CNN predictions without retrain the model, we propose 3D-fused context propagation, which propagates any edited slice to the whole 3D volume. By considering the high-level feature maps, the radiation oncologists would only required to edit few slices to guide the correction and refine the whole prediction volume. Specifically, we leverage the backpropagation for activation technique to convey the user editing information backwardly to the latent space and generate new prediction based on the updated and original feature. During the interaction, our proposed approach reuses the extant extracted features and does not alter the existing 3D CNN model architectures, avoiding the perturbation on other predictions. The proposed method is evaluated on two published radiotherapy target contouring datasets of nasopharyngeal and esophageal cancer. The experimental results demonstrate that our proposed method is able to further effectively improve the existing segmentation prediction from different model architectures given oncologists' interactive inputs.
摘要：断层扫描医学影像上的总肿瘤体积（GTV）轮廓对于放射治疗计划和癌症诊断至关重要。卷积神经网络（CNN）在3D医学自动分割任务中占主导地位，包括在给定3D CT体积的情况下确定放射治疗目标的轮廓。尽管CNN可以提供可行的结果，但在临床情况下，由于CNN在意外患者病例中的表现不一致，因此仍然需要专家进行仔细检查和预测。为了向专家提供一种有效的方法来修改CNN预测而无需重新训练模型，我们提出了3D融合上下文传播，它将任何编辑的片段传播到整个3D体积。通过考虑高级特征图，放射肿瘤学家仅需要编辑几个切片即可指导校正并完善整个预测量。具体来说，我们利用反向传播激活技术将用户编辑信息向后传递到潜在空间，并根据更新后的原始特征生成新的预测。在交互过程中，我们提出的方法将重用现有提取的特征，并且不会更改现有的3D CNN模型架构，从而避免了对其他预测的干扰。在两个已发表的鼻咽癌和食道癌放疗目标轮廓数据集上对提出的方法进行了评估。实验结果表明，在肿瘤科医生的交互式输入下，我们提出的方法能够进一步有效地改善来自不同模型架构的现有分割预测。

110. Delay Differential Neural Networks [PDF] 返回目录
Srinivas Anumasa, P.K. Srijith
Abstract: Neural ordinary differential equations (NODEs) treat computation of intermediate feature vectors as trajectories of ordinary differential equation parameterized by a neural network. In this paper, we propose a novel model, delay differential neural networks (DDNN), inspired by delay differential equations (DDEs). The proposed model considers the derivative of the hidden feature vector as a function of the current feature vector and past feature vectors (history). The function is modelled as a neural network and consequently, it leads to continuous depth alternatives to many recent ResNet variants. We propose two different DDNN architectures, depending on the way current and past feature vectors are considered. For training DDNNs, we provide a memory-efficient adjoint method for computing gradients and back-propagate through the network. DDNN improves the data efficiency of NODE by further reducing the number of parameters without affecting the generalization performance. Experiments conducted on synthetic and real-world image classification datasets such as Cifar10 and Cifar100 show the effectiveness of the proposed models.
摘要：神经常微分方程（NODE）将中间特征向量的计算视为由神经网络参数化的常微分方程的轨迹。在本文中，我们根据延迟微分方程（DDE）提出了一种新颖的模型，延迟微分神经网络（DDNN）。提出的模型将隐藏特征向量的导数视为当前特征向量和过去特征向量（历史）的函数。该函数被建模为神经网络，因此，它导致了许多最近的ResNet变体的连续深度替代。根据当前和过去特征向量的考虑方式，我们提出了两种不同的DDNN体系结构。为了训练DDNN，我们提供了一种内存有效的伴随方法，用于计算梯度并通过网络反向传播。 DDNN通过进一步减少参数数量而不影响泛化性能来提高NODE的数据效率。在合成图像和现实世界图像分类数据集（例如Cifar10和Cifar100）上进行的实验证明了所提出模型的有效性。

111. Knowledge Capture and Replay for Continual Learning [PDF] 返回目录
Saisubramaniam Gopalakrishnan, Pranshu Ranjan Singh, Haytham Fayek, Savitha Ramasamy, Arulmurugan Ambikapathi
Abstract: Deep neural networks have shown promise in several domains, and the learned task-specific information is implicitly stored in the network parameters. It will be vital to utilize representations from these networks for downstream tasks such as continual learning. In this paper, we introduce the notion of {\em flashcards} that are visual representations to {\em capture} the encoded knowledge of a network, as a function of random image patterns. We demonstrate the effectiveness of flashcards in capturing representations and show that they are efficient replay methods for general and task agnostic continual learning setting. Thus, while adapting to a new task, a limited number of constructed flashcards, help to prevent catastrophic forgetting of the previously learned tasks. Most interestingly, such flashcards neither require external memory storage nor need to be accumulated over multiple tasks and only need to be constructed just before learning the subsequent new task, irrespective of the number of tasks trained before and are hence task agnostic. We first demonstrate the efficacy of flashcards in capturing knowledge representation from a trained network, and empirically validate the efficacy of flashcards on a variety of continual learning tasks: continual unsupervised reconstruction, continual denoising, and new-instance learning classification, using a number of heterogeneous benchmark datasets. These studies also indicate that continual learning algorithms with flashcards as the replay strategy perform better than other state-of-the-art replay methods, and exhibits on par performance with the best possible baseline using coreset sampling, with the least additional computational complexity and storage.
摘要：深度神经网络已在多个领域显示出希望，并且所学习的特定于任务的信息隐式存储在网络参数中。利用这些网络的表示形式进行下游任务（例如持续学习）至关重要。在本文中，我们引入{\ em抽认卡}的概念，它们是视觉编码的{\ em捕获}网络编码知识的一种，是随机图像模式的函数。我们演示了抽认卡在捕获表示中的有效性，并表明它们是用于常规和与任务无关的持续学习设置的有效重播方法。因此，在适应新任务的同时，有限数量的构造的抽认卡有助于防止先前学习的任务的灾难性遗忘。最有趣的是，这样的闪存卡既不需要外部存储器存储，也不需要在多个任务上累积，并且只需要在学习后续新任务之前就构造即可，而与之前训练的任务数量无关，因此与任务无关。我们首先展示了抽认卡在从经过训练的网络中捕获知识表示的功效，并通过经验验证了抽认卡在多种持续学习任务上的功效：连续无监督的重建，连续降噪和使用多种异构学习的新实例学习分类基准数据集。这些研究还表明，以抽认卡作为重播策略的持续学习算法的性能优于其他最新的重播方法，并且使用核心集采样以最佳的基准表现出同等的性能，且计算复杂性和存储量最少。

112. Generative Adversarial Networks for Automatic Polyp Segmentation [PDF] 返回目录
Awadelrahman M. A. Ali Ahmed
Abstract: This paper aims to contribute in bench-marking the automatic polyp segmentation problem using generative adversarial networks framework. Perceiving the problem as an image-to-image translation task, conditional generative adversarial networks are utilized to generate masks conditioned by the images as inputs. Both generator and discriminator are convolution neural networks based. The model achieved 0.4382 on Jaccard index and 0.611 as F2 score.
摘要：本文旨在利用生成的对抗网络框架对自动息肉分割问题进行基准测试。将问题视为图像到图像的翻译任务，有条件的生成对抗网络用于生成以图像为条件的掩码作为输入。生成器和鉴别器都是基于卷积神经网络的。该模型的Jaccard指数达到0.4382，F2得分为0.611。

113. HI-Net: Hyperdense Inception 3D UNet for Brain Tumor Segmentation [PDF] 返回目录
Saqib Qamar, Parvez Ahmad, Linlin Shen
Abstract: The brain tumor segmentation task aims to classify tissue into the whole tumor (WT), tumor core (TC), and enhancing tumor (ET) classes using multimodel MRI images. Quantitative analysis of brain tumors is critical for clinical decision making. While manual segmentation is tedious, time-consuming, and subjective, this task is at the same time very challenging to automatic segmentation methods. Thanks to the powerful learning ability, convolutional neural networks (CNNs), mainly fully convolutional networks, have shown promising brain tumor segmentation. This paper further boosts the performance of brain tumor segmentation by proposing hyperdense inception 3D UNet (HI-Net), which captures multi-scale information by stacking factorization of 3D weighted convolutional layers in the residual inception block. We use hyper dense connections among factorized convolutional layers to extract more contexual information, with the help of features reusability. We use a dice loss function to cope with class imbalances. We validate the proposed architecture on the multi-modal brain tumor segmentation challenges (BRATS) 2020 testing dataset. Preliminary results on the BRATS 2020 testing set show that achieved by our proposed approach, the dice (DSC) scores of ET, WT, and TC are 0.79457, 0.87494, and 0.83712, respectively.
摘要：脑肿瘤分割任务旨在使用多模型MRI图像将组织分为整个肿瘤（WT），肿瘤核心（TC）和增强肿瘤（ET）类。脑肿瘤的定量分析对于临床决策至关重要。尽管手动分割是繁琐，耗时且主观的，但此任务同时对自动分割方法非常具有挑战性。多亏了强大的学习能力，卷积神经网络（CNN），主要是完全卷积网络，已经显示出有希望的脑肿瘤分割。本文提出了高密度起始3D UNet（HI-Net），进一步提高了脑肿瘤分割的性能，该技术通过在残留起始块中堆叠3D加权卷积层来分解多尺度信息。在特征可重用性的帮助下，我们使用分解卷积层之间的超密集连接来提取更多的信息。我们使用骰子损失函数来解决类不平衡问题。我们在多模式脑肿瘤分割挑战（BRATS）2020测试数据集上验证提出的体系结构。 BRATS 2020测试集的初步结果表明，通过我们提出的方法，ET，WT和TC的骰子（DSC）得分分别为0.79457、0.87494和0.83712。

114. Sampling Training Data for Continual Learning Between Robots and the Cloud [PDF] 返回目录
Sandeep Chinchali, Evgenya Pergament, Manabu Nakanoya, Eyal Cidon, Edward Zhang, Dinesh Bharadia, Marco Pavone, Sachin Katti
Abstract: Today's robotic fleets are increasingly measuring high-volume video and LIDAR sensory streams, which can be mined for valuable training data, such as rare scenes of road construction sites, to steadily improve robotic perception models. However, re-training perception models on growing volumes of rich sensory data in central compute servers (or the "cloud") places an enormous time and cost burden on network transfer, cloud storage, human annotation, and cloud computing resources. Hence, we introduce HarvestNet, an intelligent sampling algorithm that resides on-board a robot and reduces system bottlenecks by only storing rare, useful events to steadily improve perception models re-trained in the cloud. HarvestNet significantly improves the accuracy of machine-learning models on our novel dataset of road construction sites, field testing of self-driving cars, and streaming face recognition, while reducing cloud storage, dataset annotation time, and cloud compute time by between 65.7-81.3%. Further, it is between 1.05-2.58x more accurate than baseline algorithms and scalably runs on embedded deep learning hardware. We provide a suite of compute-efficient perception models for the Google Edge Tensor Processing Unit (TPU), an extended technical report, and a novel video dataset to the research community at this https URL.
摘要：当今的机器人车队正在越来越多地测量大容量视频和LIDAR传感流，可以从中获取有价值的训练数据（例如稀有的道路施工现场），以稳定地改善机器人的感知模型。但是，对中央计算服务器（或“云”）中不断增长的大量感官数据进行重新训练感知模型，会给网络传输，云存储，人工注释和云计算资源带来巨大的时间和成本负担。因此，我们引入了HarvestNet，这是一种智能采样算法，它驻留在机器人上，并且仅通过存储罕见的有用事件来稳定地改善在云中重新训练的感知模型，从而减少了系统瓶颈。 HarvestNet极大地提高了我们新型道路施工现场数据集，自动驾驶汽车的现场测试以及流式面部识别的机器学习模型的准确性，同时将云存储，数据集注释时间和云计算时间减少了65.7-81.3之间％。此外，它的精度比基准算法高1.05-2.58倍，并且可在嵌入式深度学习硬件上扩展运行。我们通过此https URL为研究边缘提供了一套计算有效的感知模型，用于扩展Google边缘张量处理单元（TPU），扩展的技术报告和新颖的视频数据集。

115. Learning Consistent Deep Generative Models from Sparse Data via Prediction Constraints [PDF] 返回目录
Gabriel Hope, Madina Abdrakhmanova, Xiaoyin Chen, Michael C. Hughes, Michael C. Hughes, Erik B. Sudderth
Abstract: We develop a new framework for learning variational autoencoders and other deep generative models that balances generative and discriminative goals. Our framework optimizes model parameters to maximize a variational lower bound on the likelihood of observed data, subject to a task-specific prediction constraint that prevents model misspecification from leading to inaccurate predictions. We further enforce a consistency constraint, derived naturally from the generative model, that requires predictions on reconstructed data to match those on the original data. We show that these two contributions - prediction constraints and consistency constraints -- lead to promising image classification performance, especially in the semi-supervised scenario where category labels are sparse but unlabeled data is plentiful. Our approach enables advances in generative modeling to directly boost semi-supervised classification performance, an ability we demonstrate by augmenting deep generative models with latent variables capturing spatial transformations.
摘要：我们开发了一个新的框架，用于学习变型自动编码器和其他能够平衡生成目标和判别目标的深度生成模型。我们的框架优化了模型参数，以最大程度地提高观测数据可能性的变化下限，但要遵守特定于任务的预测约束，以防止模型错误指定导致不准确的预测。我们进一步实施了一个由生成模型自然衍生的一致性约束，该约束要求对重构数据进行预测以使其与原始数据相匹配。我们显示出这两个贡献-预测约束和一致性约束-带来了有希望的图像分类性能，特别是在类别标签稀疏但未标记数据丰富的半监督情况下。我们的方法使生成建模技术的进步可以直接提高半监督分类性能，这是我们通过用捕获空间变换的潜在变量来扩展深层生成模型所展示的能力。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-12-15

目录

摘要