摘要

1. Portrait Neural Radiance Fields from a Single Image [PDF] 返回目录
Chen Gao, Yichang Shih, Wei-Sheng Lai, Chia-Kai Liang, Jia-Bin Huang
Abstract: We present a method for estimating Neural Radiance Fields (NeRF) from a single headshot portrait. While NeRF has demonstrated high-quality view synthesis, it requires multiple images of static scenes and thus impractical for casual captures and moving subjects. In this work, we propose to pretrain the weights of a multilayer perceptron (MLP), which implicitly models the volumetric density and colors, with a meta-learning framework using a light stage portrait dataset. To improve the generalization to unseen faces, we train the MLP in the canonical coordinate space approximated by 3D face morphable models. We quantitatively evaluate the method using controlled captures and demonstrate the generalization to real portrait images, showing favorable results against state-of-the-arts.
摘要：我们提出了一种从单个爆头画像中估计神经辐射场（NeRF）的方法。尽管NeRF已展示出高质量的视图合成，但它需要多张静态场景的图像，因此对于随意拍摄和移动物体来说是不切实际的。在这项工作中，我们建议通过使用光阶段人像数据集的元学习框架来预训练多层感知器（MLP）的权重，该感知器隐式地模拟体积密度和颜色。为了提高对看不见的面孔的泛化能力，我们在由3D面孔可变形模型近似的规范坐标空间中训练MLP。我们使用受控捕获对方法进行定量评估，并证明了对真实人像图像的一般化，显示出针对最新技术的良好结果。

2. Robust Consistent Video Depth Estimation [PDF] 返回目录
Johannes Kopf, Xuejian Rong, Jia-Bin Huang
Abstract: We present an algorithm for estimating consistent dense depth maps and camera poses from a monocular video. We integrate a learning-based depth prior, in the form of a convolutional neural network trained for single-image depth estimation, with geometric optimization, to estimate a smooth camera trajectory as well as detailed and stable depth reconstruction. Our algorithm combines two complementary techniques: (1) flexible deformation-splines for low-frequency large-scale alignment and (2) geometry-aware depth filtering for high-frequency alignment of fine depth details. In contrast to prior approaches, our method does not require camera poses as input and achieves robust reconstruction for challenging hand-held cell phone captures containing a significant amount of noise, shake, motion blur, and rolling shutter deformations. Our method quantitatively outperforms state-of-the-arts on the Sintel benchmark for both depth and pose estimations and attains favorable qualitative results across diverse wild datasets.
摘要：我们提出了一种用于从单眼视频中估计一致的密集深度图和相机姿态的算法。我们以卷积神经网络的形式集成了基于学习的深度先验知识，该算法经过训练可用于单图像深度估计，并通过几何优化来估计平滑的相机轨迹以及详细而稳定的深度重建。我们的算法结合了两种互补技术：（1）用于低频大规模对准的柔性变形样条线和（2）用于精细深度细节的高频对准的几何感知深度过滤。与现有方法相比，我们的方法不需要照相机姿势作为输入，并且可以实现强大的重建，以应对包含大量噪声，抖动，运动模糊和滚动快门变形的具有挑战性的手持式手机拍摄。在深度和姿态估计方面，我们的方法在数量上都超过了Sintel基准上的最新技术，并且在各种野生数据集上都获得了令人满意的定性结果。

3. Are Fewer Labels Possible for Few-shot Learning? [PDF] 返回目录
Suichan Li, Dongdong Chen, Yinpeng Chen, Lu Yuan, Lei Zhang, Qi Chu, Nenghai Yu
Abstract: Few-shot learning is challenging due to its very limited data and labels. Recent studies in big transfer (BiT) show that few-shot learning can greatly benefit from pretraining on large scale labeled dataset in a different domain. This paper asks a more challenging question: "can we use as few as possible labels for few-shot learning in both pretraining (with no labels) and fine-tuning (with fewer labels)?". Our key insight is that the clustering of target samples in the feature space is all we need for few-shot finetuning. It explains why the vanilla unsupervised pretraining (poor clustering) is worse than the supervised one. In this paper, we propose transductive unsupervised pretraining that achieves a better clustering by involving target data even though its amount is very limited. The improved clustering result is of great value for identifying the most representative samples ("eigen-samples") for users to label, and in return, continued finetuning with the labeled eigen-samples further improves the clustering. Thus, we propose eigen-finetuning to enable fewer shot learning by leveraging the co-evolution of clustering and eigen-samples in the finetuning. We conduct experiments on 10 different few-shot target datasets, and our average few-shot performance outperforms both vanilla inductive unsupervised transfer and supervised transfer by a large margin. For instance, when each target category only has 10 labeled samples, the mean accuracy gain over the above two baselines is 9.2% and 3.42 respectively.
摘要：由于数据和标签非常有限，很少有学习机会具有挑战性。大规模转移（BiT）方面的最新研究表明，少量学习可以从不同领域的大规模标签数据集的预训练中大大受益。本文提出了一个更具挑战性的问题：“我们可以在预训练（无标签）和微调（标签较少）中使用尽可能少的标签进行少拍学习吗？”。我们的主要见识在于，特征空间中目标样本的聚类是我们进行几次微调所需的全部。这解释了为什么香草无监督的预训练（聚类不良）比有监督的预训练差。在本文中，我们提出了转导式无监督预训练，即使其数量非常有限，它也可以通过包含目标数据来实现更好的聚类。改进的聚类结果对于识别供用户标记的最具代表性的样本（“特征样本”）具有重要价值，因此，继续对标记的特征样本进行微调进一步改善了聚类。因此，我们提出本征微调，以通过利用微调中聚类和本征样本的共同进化来实现较少的镜头学习。我们在10个不同的一次性拍摄目标数据集上进行了实验，我们的平均一次性拍摄性能远胜过香草感应式无监督转移和监督转移。例如，当每个目标类别仅包含10个带标签的样本时，在上述两个基准上的平均准确度增益分别为9.2％和3.42。

4. AutoSelect: Automatic and Dynamic Detection Selection for 3D Multi-Object Tracking [PDF] 返回目录
Xinshuo Weng, Kris Kitani
Abstract: 3D multi-object tracking is an important component in robotic perception systems such as self-driving vehicles. Recent work follows a tracking-by-detection pipeline, which aims to match past tracklets with detections in the current frame. To avoid matching with false positive detections, prior work filters out detections with low confidence scores via a threshold. However, finding a proper threshold is non-trivial, which requires extensive manual search via ablation study. Also, this threshold is sensitive to many factors such as target object category so we need to re-search the threshold if these factors change. To ease this process, we propose to automatically select high-quality detections and remove the efforts needed for manual threshold search. Also, prior work often uses a single threshold per data sequence, which is sub-optimal in particular frames or for certain objects. Instead, we dynamically search threshold per frame or per object to further boost performance. Through experiments on KITTI and nuScenes, our method can filter out $45.7\%$ false positives while maintaining the recall, achieving new S.O.T.A. performance and removing the need for manually threshold tuning.
摘要：3D多目标跟踪是自动驾驶汽车等机器人感知系统中的重要组成部分。最近的工作遵循逐条检测管道，该管道旨在将过去的小波与当前帧中的检测进行匹配。为了避免与假阳性检测结果相匹配，现有技术会通过阈值过滤掉具有低置信度得分的检测结果。但是，找到合适的阈值并非易事，这需要通过消融研究进行广泛的手动搜索。此外，此阈值对许多因素（例如目标对象类别）敏感，因此，如果这些因素发生变化，我们需要重新搜索阈值。为了简化此过程，我们建议自动选择高质量的检测并消除手动阈值搜索所需的工作。同样，现有技术通常对每个数据序列使用单个阈值，这在特定帧或某些对象中次优。相反，我们动态搜索每帧或每个对象的阈值以进一步提高性能。通过在KITTI和nuScenes上进行的实验，我们的方法可以滤除$ 45.7 \％$的误报，同时保持召回率，从而实现了新的S.O.T.A。性能，无需手动调整阈值。

5. iNeRF: Inverting Neural Radiance Fields for Pose Estimation [PDF] 返回目录
Lin Yen-Chen, Pete Florence, Jonathan T. Barron, Alberto Rodriguez, Phillip Isola, Tsung-Yi Lin
Abstract: We present iNeRF, a framework that performs pose estimation by "inverting" a trained Neural Radiance Field (NeRF). NeRFs have been shown to be remarkably effective for the task of view synthesis - synthesizing photorealistic novel views of real-world scenes or objects. In this work, we investigate whether we can apply analysis-by-synthesis with NeRF for 6DoF pose estimation - given an image, find the translation and rotation of a camera relative to a 3D model. Starting from an initial pose estimate, we use gradient descent to minimize the residual between pixels rendered from an already-trained NeRF and pixels in an observed image. In our experiments, we first study 1) how to sample rays during pose refinement for iNeRF to collect informative gradients and 2) how different batch sizes of rays affect iNeRF on a synthetic dataset. We then show that for complex real-world scenes from the LLFF dataset, iNeRF can improve NeRF by estimating the camera poses of novel images and using these images as additional training data for NeRF. Finally, we show iNeRF can be combined with feature-based pose initialization. The approach outperforms all other RGB-based methods relying on synthetic data on LineMOD.
摘要：我们提出了iNeRF，这是一种通过“反转”训练有素的神经辐射场（NeRF）来执行姿势估计的框架。事实证明，NeRF对视图合成非常有效-合成现实世界场景或对象的逼真新颖视图。在这项工作中，我们调查了是否可以使用NeRF综合分析进行6DoF姿态估计-给定图像，找到相机相对于3D模型的平移和旋转。从初始姿势估计开始，我们使用梯度下降来最小化从已经训练的NeRF渲染的像素与观察图像中的像素之间的残差。在我们的实验中，我们首先研究1）如何在iNeRF的姿态优化期间对射线进行采样以收集信息梯度，以及2）不同批次的射线如何影响合成数据集上的iNeRF。然后，我们表明对于LLFF数据集中的复杂现实世界场景，iNeRF可以通过估计新颖图像的相机姿态并将这些图像用作NeRF的其他训练数据来改善NeRF。最后，我们展示了iNeRF可以与基于特征的姿势初始化结合使用。该方法优于所有其他基于LineMOD上的合成数据的基于RGB的方法。

6. SPAA: Stealthy Projector-based Adversarial Attacks on Deep Image Classifiers [PDF] 返回目录
Bingyao Huang, Haibin Ling
Abstract: Light-based adversarial attacks aim to fool deep learning-based image classifiers by altering the physical light condition using a controllable light source, e.g., a projector. Compared with physical attacks that place carefully designed stickers or printed adversarial objects, projector-based ones obviate modifying the physical entities. Moreover, projector-based attacks can be performed transiently and dynamically by altering the projection pattern. However, existing approaches focus on projecting adversarial patterns that result in clearly perceptible camera-captured perturbations, while the more interesting yet challenging goal, stealthy projector-based attack, remains an open problem. In this paper, for the first time, we formulate this problem as an end-to-end differentiable process and propose Stealthy Projector-based Adversarial Attack (SPAA). In SPAA, we approximate the real project-and-capture operation using a deep neural network named PCNet, then we include PCNet in the optimization of projector-based attacks such that the generated adversarial projection is physically plausible. Finally, to generate robust and stealthy adversarial projections, we propose an optimization algorithm that uses minimum perturbation and adversarial confidence thresholds to alternate between the adversarial loss and stealthiness loss optimization. Our experimental evaluations show that the proposed SPAA clearly outperforms other methods by achieving higher attack success rates and meanwhile being stealthier.
摘要：基于光的对抗攻击旨在通过使用可控光源（例如投影仪）改变物理光照条件来欺骗基于深度学习的图像分类器。与放置精心设计的标贴或印刷的对抗对象的物理攻击相比，基于投影仪的攻击消除了对物理实体的修改。此外，可以通过更改投影图案来瞬时动态地进行基于投影仪的攻击。但是，现有的方法集中于投射对抗性模式，这些模式会导致明显捕捉到的相机捕捉到的干扰，而更有趣却又更具挑战性的目标，即基于投影机的隐身攻击，仍然是一个未解决的问题。在本文中，我们首次将此问题表述为端到端的可区分过程，并提出了基于隐身投影机的对抗攻击（SPAA）。在SPAA中，我们使用名为PCNet的深度神经网络来近似实际的项目捕获操作，然后将PCNet包含在基于投影机的攻击的优化中，以使所产生的对抗性投影在物理上是合理的。最后，为了生成鲁棒和隐身的对抗性预测，我们提出了一种优化算法，该算法使用最小摄动和对抗性置信度阈值在对抗性损失和隐身性损失优化之间交替。我们的实验评估表明，所提出的SPAA通过获得更高的攻击成功率并且更隐身，明显优于其他方法。

7. Full-Glow: Fully conditional Glow for more realistic image generation [PDF] 返回目录
Moein Sorkhei, Gustav Eje Henter, Hedvig Kjellström
Abstract: Autonomous agents, such as driverless cars, require large amounts of labeled visual data for their training. A viable approach for acquiring such data is training a generative model with collected real data, and then augmenting the collected real dataset with synthetic images from the model, generated with control of the scene layout and ground truth labeling. In this paper we propose Full-Glow, a fully conditional Glow-based architecture for generating plausible and realistic images of novel street scenes given a semantic segmentation map indicating the scene layout. Benchmark comparisons show our model to outperform recent works in terms of the semantic segmentation performance of a pretrained PSPNet. This indicates that images from our model are, to a higher degree than from other models, similar to real images of the same kinds of scenes and objects, making them suitable as training data for a visual semantic segmentation or object recognition system.
摘要：诸如无人驾驶汽车之类的自主代理人需要大量标记的视觉数据来进行培训。一种获取此类数据的可行方法是使用收集的真实数据训练生成模型，然后使用来自模型的合成图像扩充收集的真实数据集，该合成图像是在场景布局和地面真相标记的控制下生成的。在本文中，我们提出了Full-Glow，这是一种基于条件的Glow架构，用于生成新颖的街景场景的逼真的图像，并给出指示场景布局的语义分割图。基准比较显示，就预训练的PSPNet的语义分割性能而言，我们的模型优于最近的工作。这表明我们模型中的图像在某种程度上比其他模型中的图像更类似于相同场景和对象的真实图像，这使其适合作为视觉语义分割或对象识别系统的训练数据。

8. Sylvester Matrix Based Similarity Estimation Method for Automation of Defect Detection in Textile Fabrics [PDF] 返回目录
R.M.L.N. Kumari, G.A.C.T. Bandara, Maheshi B. Dissanayake
Abstract: Fabric defect detection is a crucial quality control step in the textile manufacturing industry. In this article, machine vision system based on the Sylvester Matrix Based Similarity Method (SMBSM) is proposed to automate the defect detection process. The algorithm involves six phases, namely resolution matching, image enhancement using Histogram Specification and Median-Mean Based Sub-Image-Clipped Histogram Equalization, image registration through alignment and hysteresis process, image subtraction, edge detection, and fault detection by means of the rank of the Sylvester matrix. The experimental results demonstrate that the proposed method is robust and yields an accuracy of 93.4%, precision of 95.8%, with 2275 ms computational speed.
摘要：织物缺陷检测是纺织制造业中至关重要的质量控制步骤。本文提出了一种基于基于Sylvester矩阵的相似度方法（SMBSM）的机器视觉系统，以实现缺陷检测过程的自动化。该算法涉及六个阶段，即分辨率匹配，使用直方图规范进行图像增强和基于中值的子图像固定直方图均衡，通过对齐和滞后过程进行图像配准，图像相减，边缘检测以及通过等级进行故障检测。西尔维斯特矩阵。实验结果表明，该方法具有较强的鲁棒性，计算精度为93.4％，精度为95.8％，计算速度为2275 ms。

9. Efficient Nonlinear RX Anomaly Detectors [PDF] 返回目录
José A. Padrón Hidalgo, Adrián Pérez-Suay, Fatih Nar, Gustau Camps-Valls
Abstract: Current anomaly detection algorithms are typically challenged by either accuracy or efficiency. More accurate nonlinear detectors are typically slow and not scalable. In this letter, we propose two families of techniques to improve the efficiency of the standard kernel Reed-Xiaoli (RX) method for anomaly detection by approximating the kernel function with either {\em data-independent} random Fourier features or {\em data-dependent} basis with the Nyström approach. We compare all methods for both real multi- and hyperspectral images. We show that the proposed efficient methods have a lower computational cost and they perform similar (or outperform) the standard kernel RX algorithm thanks to their implicit regularization effect. Last but not least, the Nyström approach has an improved power of detection.
摘要：当前的异常检测算法通常受到准确性或效率的挑战。精度更高的非线性检测器通常较慢且不可扩展。在这封信中，我们提出了两种技术来提高标准内核Reed-Xiaoli（RX）方法用于异常检测的效率，方法是使用{\ em数据独立的}随机傅立叶特征或{\ em数据 Nyström方法的基础。我们比较了真实多光谱和高光谱图像的所有方法。我们表明，所提出的有效方法具有较低的计算成本，并且由于它们的隐式正则化效果，它们执行的性能与标准内核RX算法相似（或优于）。最后但并非最不重要的一点是，Nyström方法具有更高的检测能力。

10. Demystifying Pseudo-LiDAR for Monocular 3D Object Detection [PDF] 返回目录
Andrea Simonelli, Samuel Rota Bulò, Lorenzo Porzi, Peter Kontschieder, Elisa Ricci
Abstract: Pseudo-LiDAR-based methods for monocular 3D object detection have generated large attention in the community due to performance gains showed on the KITTI3D benchmark dataset, in particular on the commonly reported validation split. This generated a distorted impression about the superiority of Pseudo-LiDAR approaches against methods working with RGB-images only. Our first contribution consists in rectifying this view by analysing and showing experimentally that the validation results published by Pseudo-LiDAR-based methods are substantially biased. The source of the bias resides in an overlap between the KITTI3D object detection validation set and the training/validation sets used to train depth predictors feeding Pseudo-LiDAR-based methods. Surprisingly, the bias remains also after geographically removing the overlap, revealing the presence of a more structured contamination. This leaves the test set as the only reliable mean of comparison, where published Pseudo-LiDAR-based methods do not excel. Our second contribution brings Pseudo-LiDAR-based methods back up in the ranking with the introduction of a 3D confidence prediction module. Thanks to the proposed architectural changes, our modified Pseudo-LiDAR-based methods exhibit extraordinary gains on the test scores (up to +8% 3D AP).
摘要：由于在KITTI3D基准数据集上，特别是在通常报告的验证拆分上显示的性能提升，用于单眼3D对象检测的基于伪LiDAR的方法引起了社会的广泛关注。相对于仅处理RGB图像的方法，这产生了关于伪LiDAR方法的优越性的扭曲印象。我们的第一个贡献在于通过分析和实验证明基于Pseudo-LiDAR的方法发布的验证结果存在明显偏差，从而纠正了这种观点。偏差的来源位于KITTI3D对象检测验证集与用于训练基于Pseudo-LiDAR方法的深度预测器的训练/验证集之间的重叠中。出人意料的是，在地理上消除了重叠之后，偏见仍然存在，这表明存在结构性污染。这使测试集成为唯一可靠的比较手段，而已发布的基于Pseudo-LiDAR的方法并不擅长。我们的第二个贡献是通过引入3D置信度预测模块，将基于伪LiDAR的方法重新纳入排名。由于建议的体系结构更改，我们基于Pseudo-LiDAR的改进方法在测试得分上获得了非凡的收益（高达+ 8％3D AP）。

11. Machine Learning Information Fusion in Earth Observation: A Comprehensive Review of Methods, Applications and Data Sources [PDF] 返回目录
S. Salcedo-Sanz, P. Ghamisi, M. Piles, M. Werner, L. Cuadra, A. Moreno-Martínez, E. Izquierdo-Verdiguier, J. Muñoz-Marí, Amirhosein Mosavi, G. Camps-Valls
Abstract: This paper reviews the most important information fusion data-driven algorithms based on Machine Learning (ML) techniques for problems in Earth observation. Nowadays we observe and model the Earth with a wealth of observations, from a plethora of different sensors, measuring states, fluxes, processes and variables, at unprecedented spatial and temporal resolutions. Earth observation is well equipped with remote sensing systems, mounted on satellites and airborne platforms, but it also involves in-situ observations, numerical models and social media data streams, among other data sources. Data-driven approaches, and ML techniques in particular, are the natural choice to extract significant information from this data deluge. This paper produces a thorough review of the latest work on information fusion for Earth observation, with a practical intention, not only focusing on describing the most relevant previous works in the field, but also the most important Earth observation applications where ML information fusion has obtained significant results. We also review some of the most currently used data sets, models and sources for Earth observation problems, describing their importance and how to obtain the data when needed. Finally, we illustrate the application of ML data fusion with a representative set of case studies, as well as we discuss and outlook the near future of the field.
摘要：本文回顾了基于机器学习（ML）技术的最重要的信息融合数据驱动算法，该算法用于解决地球观测问题。如今，我们以大量的观测数据对地球进行观测和建模，这些观测数据来自多个不同的传感器，以前所未有的时空分辨率测量状态，通量，过程和变量。对地观测设备很好地配备了安装在卫星和机载平台上的遥感系统，但它还涉及实地观测，数值模型和社交媒体数据流以及其他数据源。从这种数据泛滥中提取重要信息的自然选择是数据驱动的方法，尤其是ML技术。本文出于实际目的，对有关地球观测信息融合的最新工作进行了全面回顾，不仅着眼于描述该领域最相关的先前工作，而且还介绍了获得ML信息融合的最重要的地球观测应用明显的结果。我们还将回顾一些最常用的地球观测问题数据集，模型和来源，描述其重要性以及在需要时如何获取数据。最后，我们通过一组典型的案例研究来说明ML数据融合的应用，并讨论和展望该领域的不久的将来。

12. OneNet: Towards End-to-End One-Stage Object Detection [PDF] 返回目录
Peize Sun, Yi Jiang, Enze Xie, Zehuan Yuan, Changhu Wang, Ping Luo
Abstract: End-to-end one-stage object detection trailed thus far. This paper discovers that the lack of classification cost between sample and ground-truth in label assignment is the main obstacle for one-stage detectors to remove Non-maximum Suppression(NMS) and reach end-to-end. Existing one-stage object detectors assign labels by only location cost, e.g. box IoU or point distance. Without classification cost, sole location cost leads to redundant boxes of high confidence scores in inference, making NMS necessary post-processing. To design an end-to-end one-stage object detector, we propose Minimum Cost Assignment. The cost is the summation of classification cost and location cost between sample and ground-truth. For each object ground-truth, only one sample of minimum cost is assigned as the positive sample; others are all negative samples. To evaluate the effectiveness of our method, we design an extremely simple one-stage detector named OneNet. Our results show that when trained with Minimum Cost Assignment, OneNet avoids producing duplicated boxes and achieves to end-to-end detector. On COCO dataset, OneNet achieves 35.0 AP/80 FPS and 37.7 AP/50 FPS with image size of 512 pixels. We hope OneNet could serve as an effective baseline for end-to-end one-stage object detection. The code is available at: \url{this https URL}.
摘要：到目前为止，端到端的一阶段目标检测已经落后。本文发现，标签分配中样本与地面真相之间缺乏分类成本是一级检测器去除非最大抑制（NMS）并达到端到端的主要障碍。现有的一级物体检测器仅通过位置成本来分配标签，例如盒IoU或点距离。没有分类成本，唯一的位置成本就可以在推理中获得高置信度得分的冗余框，从而使NMS成为必要的后处理程序。为了设计端到端的一级目标检测器，我们提出了最小成本分配。成本是样本和地面真相之间的分类成本和位置成本的总和。对于每个实物，仅将一个最低成本的样本分配为正样本。其他都是阴性样本。为了评估我们方法的有效性，我们设计了一个非常简单的单级检测器OneNet。我们的结果表明，在接受“最小成本分配”训练后，OneNet避免了产生重复的盒子，并实现了端到端检测器。在COCO数据集上，OneNet达到35.0 AP / 80 FPS和37.7 AP / 50 FPS，图像尺寸为512像素。我们希望OneNet可以作为端到端一阶段对象检测的有效基准。该代码位于：\ url {此https URL}。

13. R-AGNO-RPN: A LIDAR-Camera Region Deep Network for Resolution-Agnostic Detection [PDF] 返回目录
Ruddy Théodose, Dieumet Denis, Thierry Chateau, Vincent Frémont, Paul Checchin
Abstract: Current neural networks-based object detection approaches processing LiDAR point clouds are generally trained from one kind of LiDAR sensors. However, their performances decrease when they are tested with data coming from a different LiDAR sensor than the one used for training, i.e., with a different point cloud resolution. In this paper, R-AGNO-RPN, a region proposal network built on fusion of 3D point clouds and RGB images is proposed for 3D object detection regardless of point cloud resolution. As our approach is designed to be also applied on low point cloud resolutions, the proposed method focuses on object localization instead of estimating refined boxes on reduced data. The resilience to low-resolution point cloud is obtained through image features accurately mapped to Bird's Eye View and a specific data augmentation procedure that improves the contribution of the RGB images. To show the proposed network's ability to deal with different point clouds resolutions, experiments are conducted on both data coming from the KITTI 3D Object Detection and the nuScenes datasets. In addition, to assess its performances, our method is compared to PointPillars, a well-known 3D detection network. Experimental results show that even on point cloud data reduced by $80\%$ of its original points, our method is still able to deliver relevant proposals localization.
摘要：目前，通常基于一种LiDAR传感器训练处理LiDAR点云的基于神经网络的目标检测方法。但是，当使用来自与用于训练的LiDAR传感器不同的数据（即具有不同点云分辨率）的数据进行测试时，它们的性能会下降。本文提出了一种R-AGNO-RPN，它是一种基于3D点云和RGB图像融合的区域建议网络，无论点云分辨率如何，都可以用于3D对象检测。由于我们的方法被设计为也可应用于低点云分辨率，因此所提出的方法侧重于对象定位，而不是估计精简后的数据框。通过精确映射到鸟瞰图的图像特征以及改进RGB图像贡献的特定数据增强程序，可以获得对低分辨率点云的适应性。为了显示提议的网络处理不同点云分辨率的能力，对来自KITTI 3D对象检测的数据和nuScenes数据集进行了实验。此外，为了评估其性能，我们的方法与著名的3D检测网络PointPillars进行了比较。实验结果表明，即使在点云数据减少了原始点80％的情况下，我们的方法仍然能够提供相关的建议书本地化。

14. HRCenterNet: An Anchorless Approach to Chinese Character Segmentation in Historical Documents [PDF] 返回目录
Chia-Wei Tang, Chao-Lin Liu, Po-Sen Chiu
Abstract: The information provided by historical documents has always been indispensable in the transmission of human civilization, but it has also made these books susceptible to damage due to various factors. Thanks to recent technology, the automatic digitization of these documents are one of the quickest and most effective means of preservation. The main steps of automatic text digitization can be divided into two stages, mainly: character segmentation and character recognition, where the recognition results depend largely on the accuracy of segmentation. Therefore, in this study, we will only focus on the character segmentation of historical Chinese documents. In this research, we propose a model named HRCenterNet, which is combined with an anchorless object detection method and parallelized architecture. The MTHv2 dataset consists of over 3000 Chinese historical document images and over 1 million individual Chinese characters; with these enormous data, the segmentation capability of our model achieves IoU 0.81 on average with the best speed-accuracy trade-off compared to the others. Our source code is available at this https URL.
摘要要】历史文献所提供的信息一直是人类文明传播中必不可少的，但也使这些书籍容易受到各种因素的破坏。借助最新技术，这些文档的自动数字化是最快，最有效的保存方法之一。自动文本数字化的主要步骤可分为两个阶段，主要是：字符分割和字符识别，其中识别结果在很大程度上取决于分割的准确性。因此，在这项研究中，我们将只着眼于中国历史文献的字符分割。在这项研究中，我们提出了一个名为HRCenterNet的模型，该模型与无锚对象检测方法和并行化体系结构相结合。 MTHv2数据集包含3000多个中国历史文献图像和超过100万个单独的汉字；借助这些巨大的数据，我们的模型的分割能力平均达到IoU 0.81，与其他模型相比，具有最佳的速度精度折衷。我们的源代码可从以下https URL获得。

15. Look Before you Speak: Visually Contextualized Utterances [PDF] 返回目录
Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid
Abstract: While most conversational AI systems focus on textual dialogue only, conditioning utterances on visual context (when it's available) can lead to more realistic conversations. Unfortunately, a major challenge for incorporating visual context into conversational dialogue is the lack of large-scale labeled datasets. We provide a solution in the form of a new visually conditioned Future Utterance Prediction task. Our task involves predicting the next utterance in a video, using both visual frames and transcribed speech as context. By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations. Leveraging recent advances in multimodal learning, our model consists of a novel co-attentional multimodal video transformer, and when trained on both textual and visual context, outperforms baselines that use textual inputs alone. Further, we demonstrate that our model trained for this task on unlabelled videos achieves state-of-the-art performance on a number of downstream VideoQA benchmarks such as MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA.
摘要：虽然大多数对话式AI系统仅专注于文本对话，但对视觉上下文（如果可用）的条件话语可以促成更现实的对话。不幸的是，将视觉环境整合到对话中的主要挑战是缺乏大规模的标记数据集。我们以新的视觉条件未来未来预测任务的形式提供了一种解决方案。我们的任务涉及使用视觉框架和转录语音作为上下文来预测视频中的下一个发音。通过在线开发大量教学视频，我们训练了一个模型来大规模解决此任务，而无需人工注释。利用多模态学习的最新进展，我们的模型由一种新颖的共同关注的多模态视频转换器组成，并且在文本和视觉环境下进行训练时，其性能优于仅使用文本输入的基线。此外，我们证明了针对未经标记的视频为此任务训练的模型在许多下游VideoQA基准（例如MSRVTT-QA，MSVD-QA，ActivityNet-QA和How2QA）上均达到了最新的性能。

16. TFPnP: Tuning-free Plug-and-Play Proximal Algorithm with Applications to Inverse Imaging Problems [PDF] 返回目录
Kaixuan Wei, Angelica Aviles-Rivero, Jingwei Liang, Ying Fu, Hua Huang, Carola-Bibiane Schönlieb
Abstract: Plug-and-Play (PnP) is a non-convex framework that combines proximal algorithms, for example alternating direction method of multipliers (ADMM), with advanced denoiser priors. Over the past few years, great empirical success has been obtained by PnP algorithms, especially for the ones integrated with deep learning-based denoisers. However, a crucial issue of PnP approaches is the need of manual parameter tweaking. As it is essential to obtain high-quality results across the high discrepancy in terms of imaging conditions and varying scene content. In this work, we present a tuning-free PnP proximal algorithm, which can automatically determine the internal parameters including the penalty parameter, the denoising strength and the termination time. A core part of our approach is to develop a policy network for automatic search of parameters, which can be effectively learned via mixed model-free and model-based deep reinforcement learning. We demonstrate, through a set of numerical and visual experiments, that the learned policy can customize different parameters for different states, and often more efficient and effective than existing handcrafted criteria. Moreover, we discuss the practical considerations of the plugged denoisers, which together with our learned policy yield to state-of-the-art results. This is prevalent on both linear and nonlinear exemplary inverse imaging problems, and in particular, we show promising results on compressed sensing MRI, sparse-view CT and phase retrieval.
摘要：即插即用（PnP）是一种非凸面框架，将近端算法（例如乘法器的交替方向方法（ADMM））与高级去噪器先验相结合。在过去的几年中，PnP算法取得了巨大的经验成功，特别是对于那些与基于深度学习的去噪器集成在一起的算法。但是，PnP方法的关键问题是需要手动调整参数。由于必须在成像条件和变化的场景内容方面跨越高度差异获得高质量的结果。在这项工作中，我们提出了一种免调谐的PnP近端算法，该算法可以自动确定内部参数，包括惩罚参数，降噪强度和终止时间。我们方法的核心部分是开发一个用于参数自动搜索的策略网络，可以通过无模型和基于模型的混合深度强化学习来有效地学习这些参数。通过一组数值和视觉实验，我们证明了所学习的策略可以针对不同的状态自定义不同的参数，并且通常比现有的手工准则更有效。此外，我们讨论了插入式去噪器的实际考虑因素，以及我们所学到的政策，可以得出最新的结果。这在线性和非线性示例性逆成像问题中都很普遍，尤其是，我们在压缩感测MRI，稀疏CT和相位检索方面显示出令人鼓舞的结果。

17. An Analysis of Deep Object Detectors For Diver Detection [PDF] 返回目录
Karin de Langis, Michael Fulton, Junaed Sattar
Abstract: With the end goal of selecting and using diver detection models to support human-robot collaboration capabilities such as diver following, we thoroughly analyze a large set of deep neural networks for diver detection. We begin by producing a dataset of approximately 105,000 annotated images of divers sourced from videos -- one of the largest and most varied diver detection datasets ever created. Using this dataset, we train a variety of state-of-the-art deep neural networks for object detection, including SSD with Mobilenet, Faster R-CNN, and YOLO. Along with these single-frame detectors, we also train networks designed for detection of objects in a video stream, using temporal information as well as single-frame image information. We evaluate these networks on typical accuracy and efficiency metrics, as well as on the temporal stability of their detections. Finally, we analyze the failures of these detectors, pointing out the most common scenarios of failure. Based on our results, we recommend SSDs or Tiny-YOLOv4 for real-time applications on robots and recommend further investigation of video object detection methods.
摘要：为了最终选择和使用潜水员检测模型来支持诸如潜水员追踪之类的人机协作功能，我们彻底分析了大量用于潜水员检测的深度神经网络。我们首先生成一个数据集，该数据集包含大约105,000个来自视频的潜水员带注释的图像-这是有史以来规模最大，变化最大的潜水员检测数据集之一。使用此数据集，我们训练了各种用于对象检测的最先进的深度神经网络，包括具有Mobilenet的SSD，Faster R-CNN和YOLO。与这些单帧检测器一起，我们还使用时间信息以及单帧图像信息来训练旨在检测视频流中对象的网络。我们根据典型的准确性和效率指标以及其检测的时间稳定性来评估这些网络。最后，我们分析了这些检测器的故障，指出了最常见的故障情况。根据我们的结果，我们建议将SSD或Tiny-YOLOv4用于机器人上的实时应用，并建议进一步研究视频对象检测方法。

18. Independent Sign Language Recognition with 3D Body, Hands, and Face Reconstruction [PDF] 返回目录
Agelos Kratimenos, Georgios Pavlakos, Petros Maragos
Abstract: Independent Sign Language Recognition is a complex visual recognition problem that combines several challenging tasks of Computer Vision due to the necessity to exploit and fuse information from hand gestures, body features and facial expressions. While many state-of-the-art works have managed to deeply elaborate on these features independently, to the best of our knowledge, no work has adequately combined all three information channels to efficiently recognize Sign Language. In this work, we employ SMPL-X, a contemporary parametric model that enables joint extraction of 3D body shape, face and hands information from a single image. We use this holistic 3D reconstruction for SLR, demonstrating that it leads to higher accuracy than recognition from raw RGB images and their optical flow fed into the state-of-the-art I3D-type network for 3D action recognition and from 2D Openpose skeletons fed into a Recurrent Neural Network. Finally, a set of experiments on the body, face and hand features showed that neglecting any of these, significantly reduces the classification accuracy, proving the importance of jointly modeling body shape, facial expression and hand pose for Sign Language Recognition.
摘要：独立手语识别是一个复杂的视觉识别问题，由于必须利用和融合手势，身体特征和面部表情中的信息，因此将计算机视觉的一些艰巨任务结合在一起。尽管据我们所知，许多最先进的作品已成功地对这些功能进行了详尽的阐述，但没有一件作品能充分结合所有三个信息渠道来有效地识别手语。在这项工作中，我们采用了SMPL-X，这是一种当代的参数化模型，可以从单个图像中联合提取3D身体形状，面部和手部信息。我们将这种整体3D重构用于SLR，这表明它比从原始RGB图像和将其光流馈入用于3D动作识别的最新I3D型网络以及从馈入的2D Openpose骨架中识别出的准确性更高进入递归神经网络。最后，一组关于身体，面部和手部特征的实验表明，忽略这些特征会大大降低分类准确性，从而证明了对人体形状，面部表情和手部姿势进行联合建模对于手语识别的重要性。

19. Increased performance in DDM analysis by calculating structure functions through Fourier transform in time [PDF] 返回目录
M. Norouzisadeh, G. Cerchiari, F. Croccolo
Abstract: Differential Dynamic Microscopy (DDM) is the combination of optical microscopy to statistical analysis to obtain information about the dynamical behaviour of a variety of samples spanning from soft matter physics to biology. In DDM, the dynamical evolution of the samples is investigated separately at different length scales and extracted from a set of images recorded at different times. A specific result of interest is the structure function that can be computed via spatial Fourier transforms and differences of signals. In this work, we present an algorithm to efficiently process a set of images according to the DDM analysis scheme. We bench-marked the new approach against the state-of-the-art algorithm reported in previous work. The new implementation computes the DDM analysis faster, thanks to an additional Fourier transform in time instead of performing differences of signals. This allows obtaining very fast analysis also in CPU based machine. In order to test the new code, we performed the DDM analysis over sets of more than 1000 images with and without the help of GPU hardware acceleration. As an example, for images of $512 \times 512$ pixels, the new algorithm is 10 times faster than the previous GPU code. Without GPU hardware acceleration and for the same set of images, we found that the new algorithm is 300 faster than the old one both running only on the CPU.
摘要：微分动态显微镜（DDM）是光学显微镜与统计分析的结合，以获取有关从软物质物理学到生物学的各种样品的动力学行为的信息。在DDM中，分别以不同的长度比例研究样本的动态演化，并从在不同时间记录的一组图像中提取样本。感兴趣的特定结果是可以通过空间傅立叶变换和信号差计算的结构函数。在这项工作中，我们提出了一种根据DDM分析方案有效处理一组图像的算法。我们将新方法标记为先前工作中报告的最新算法。由于采用了额外的及时傅里叶变换，而不是执行信号差，因此新的实现可以更快地计算DDM分析。这也允许在基于CPU的计算机中获得非常快速的分析。为了测试新代码，我们在有或没有GPU硬件加速的帮助下，对1000多个图像进行了DDM分析。例如，对于$ 512 x 512 $像素的图像，新算法比以前的GPU代码快10倍。如果没有GPU硬件加速并且对于相同的图像集，我们发现新算法比仅在CPU上运行的旧算法快300。

20. Lookahead optimizer improves the performance of Convolutional Autoencoders for reconstruction of natural images [PDF] 返回目录
Sayan Nag
Abstract: Autoencoders are a class of artificial neural networks which have gained a lot of attention in the recent past. Using the encoder block of an autoencoder the input image can be compressed into a meaningful representation. Then a decoder is employed to reconstruct the compressed representation back to a version which looks like the input image. It has plenty of applications in the field of data compression and denoising. Another version of Autoencoders (AE) exist, called Variational AE (VAE) which acts as a generative model like GAN. Recently, an optimizer was introduced which is known as lookahead optimizer which significantly enhances the performances of Adam as well as SGD. In this paper, we implement Convolutional Autoencoders (CAE) and Convolutional Variational Autoencoders (CVAE) with lookahead optimizer (with Adam) and compare them with the Adam (only) optimizer counterparts. For this purpose, we have used a movie dataset comprising of natural images for the former case and CIFAR100 for the latter case. We show that lookahead optimizer (with Adam) improves the performance of CAEs for reconstruction of natural images.
摘要：自动编码器是一类人工神经网络，近年来受到了广泛的关注。使用自动编码器的编码器块，可以将输入图像压缩为有意义的表示形式。然后，使用解码器将压缩的表示重新构造回看起来像输入图像的版本。它在数据压缩和降噪领域有大量应用。存在另一种形式的自动编码器（AE），称为变体AE（VAE），它像GAN一样用作生成模型。最近，引入了一种称为前瞻性优化器的优化器，该优化器显着提高了Adam和SGD的性能。在本文中，我们使用前瞻性优化器（使用Adam）实现了卷积自动编码器（CAE）和卷积变异自动编码器（CVAE），并将它们与Adam优化器（仅）进行了比较。为此，对于前一种情况，我们使用了包含自然图像的电影数据集；对于后一种情况，我们使用了CIFAR100。我们显示，超前优化器（与Adam一起使用）可提高CAE在重建自然图像方面的性能。

21. Interactive Fusion of Multi-level Features for Compositional Activity Recognition [PDF] 返回目录
Rui Yan, Lingxi Xie, Xiangbo Shu, Jinhui Tang
Abstract: To understand a complex action, multiple sources of information, including appearance, positional, and semantic features, need to be integrated. However, these features are difficult to be fused since they often differ significantly in modality and dimensionality. In this paper, we present a novel framework that accomplishes this goal by interactive fusion, namely, projecting features across different spaces and guiding it using an auxiliary prediction task. Specifically, we implement the framework in three steps, namely, positional-to-appearance feature extraction, semantic feature interaction, and semantic-to-positional prediction. We evaluate our approach on two action recognition datasets, Something-Something and Charades. Interactive fusion achieves consistent accuracy gain beyond off-the-shelf action recognition algorithms. In particular, on Something-Else, the compositional setting of Something-Something, interactive fusion reports a remarkable gain of 2.9% in terms of top-1 accuracy.
摘要：要理解复杂的动作，需要集成多种信息源，包括外观，位置和语义特征。但是，这些特征很难融合，因为它们在形式和尺寸上经常有很大差异。在本文中，我们提出了一个新颖的框架，该框架通过交互式融合来实现此目标，即跨不同空间投影特征并使用辅助预测任务对其进行指导。具体来说，我们通过三个步骤来实现该框架，即位置到外观特征提取，语义特征交互以及语义到位置预测。我们在两个动作识别数据集Something-Something和Charades上评估了我们的方法。交互式融合可实现超越现成动作识别算法的一致精度增益。特别是在Something-Else上，Something-Something的成分设置交互式融合报告说，就top-1准确性而言，可观的增益为2.9％。

22. Geometric Adversarial Attacks and Defenses on 3D Point Clouds [PDF] 返回目录
Itai Lang, Uriel Kotlicki, Shai Avidan
Abstract: Deep neural networks are prone to adversarial examples that maliciously alter the network's outcome. Due to the increasing popularity of 3D sensors in safety-critical systems and the vast deployment of deep learning models for 3D point sets, there is a growing interest in adversarial attacks and defenses for such models. So far, the research has focused on the semantic level, namely, deep point cloud classifiers. However, point clouds are also widely used in a geometric-related form that includes encoding and reconstructing the geometry. In this work, we explore adversarial examples at a geometric level. That is, a small change to a clean source point cloud leads, after passing through an autoencoder model, to a shape from a different target class. On the defense side, we show that remnants of the attack's target shape are still present at the reconstructed output after applying the defense to the adversarial input. Our code is publicly available at this https URL.
摘要：深度神经网络很容易出现对抗性示例，这些示例会恶意更改网络的结果。由于3D传感器在安全关键型系统中的日益普及以及针对3D点集的深度学习模型的广泛部署，人们越来越对这种模型的对抗性攻击和防御产生兴趣。到目前为止，研究集中在语义级别，即深点云分类器。但是，点云还以与几何相关的形式广泛使用，包括编码和重建几何。在这项工作中，我们从几何角度探索对抗性例子。也就是说，对纯净源点云的微小更改会在通过自动编码器模型后导致来自其他目标类的形状。在防御方面，我们表明，在将防御应用于对抗输入后，重构后的输出中仍会存在攻击目标形状的剩余部分。我们的代码可通过以下https URL公开获得。

23. Concept Generalization in Visual Representation Learning [PDF] 返回目录
Mert Bulent Sariyildiz, Yannis Kalantidis, Diane Larlus, Karteek Alahari
Abstract: Measuring concept generalization, i.e., the extent to which models trained on a set of (seen) visual concepts can be used to recognize a new set of (unseen) concepts, is a popular way of evaluating visual representations, especially when they are learned with self-supervised learning. Nonetheless, the choice of which unseen concepts to use is usually made arbitrarily, and independently from the seen concepts used to train representations, thus ignoring any semantic relationships between the two. In this paper, we argue that semantic relationships between seen and unseen concepts affect generalization performance and propose ImageNet-CoG, a novel benchmark on the ImageNet dataset that enables measuring concept generalization in a principled way. Our benchmark leverages expert knowledge that comes from WordNet in order to define a sequence of unseen ImageNet concept sets that are semantically more and more distant from the ImageNet-1K subset, a ubiquitous training set. This allows us to benchmark visual representations learned on ImageNet-1K out-of-the box: we analyse a number of such models from supervised, semi-supervised and self-supervised approaches under the prism of concept generalization, and show how our benchmark is able to uncover a number of interesting insights. We will provide resources for the benchmark at this https URL.
摘要：测量概念泛化，即在一组（可见）视觉概念上训练的模型可以用来识别一组新（看不见）概念的程度，是一种评估视觉表示的流行方法，尤其是当它们通过自我监督学习来学习。尽管如此，通常会任意选择使用哪些看不见的概念，并且与用于训练表示形式的已见概念无关，从而忽略了两者之间的任何语义关系。在本文中，我们认为可见和不可见概念之间的语义关系会影响泛化性能，并提出了ImageNet-CoG，这是ImageNet数据集上的一种新颖基准，它能够以原则性方式测量概念泛化。我们的基准测试利用了WordNet的专业知识来定义一系列看不见的ImageNet概念集，这些概念集在语义上与无处不在的培训集ImageNet-1K子集越来越远。这使我们可以对在ImageNet-1K上学到的视觉表示进行开箱即用的基准测试：我们在概念概括的角度下分析了有监督，半监督和自我监督方法中的许多此类模型，并展示了我们的基准是能够发现许多有趣的见解。我们将通过此https URL提供基准测试资源。

24. Can we detect harmony in artistic compositions? A machine learning approach [PDF] 返回目录
Adam Vandor, Marie van Vollenhoven, Gerhard Weiss, Gerasimos Spanakis
Abstract: Harmony in visual compositions is a concept that cannot be defined or easily expressed mathematically, even by humans. The goal of the research described in this paper was to find a numerical representation of artistic compositions with different levels of harmony. We ask humans to rate a collection of grayscale images based on the harmony they convey. To represent the images, a set of special features were designed and extracted. By doing so, it became possible to assign objective measures to subjectively judged compositions. Given the ratings and the extracted features, we utilized machine learning algorithms to evaluate the efficiency of such representations in a harmony classification problem. The best performing model (SVM) achieved 80% accuracy in distinguishing between harmonic and disharmonic images, which reinforces the assumption that concept of harmony can be expressed in a mathematical way that can be assessed by humans.
摘要：视觉构图中的和谐是一个甚至人类也无法定义或难以用数学方式表达的概念。本文所述研究的目的是找到具有不同和声水平的艺术作品的数值表示。我们要求人类根据他们传达的和谐度来评估一组灰度图像。为了表示图像，设计并提取了一组特殊功能。通过这样做，有可能将客观的度量分配给主观判断的构图。给定等级和提取的特征，我们利用机器学习算法来评估这种表示在和声分类问题中的效率。最佳表现模型（SVM）在区分谐波图像和不谐和图像时达到了80％的准确度，这进一步强化了这样的假设，即可以用人类可以评估的数学方式来表达和谐的概念。

25. Enhancing Human Pose Estimation in Ancient Vase Paintings via Perceptually-grounded Style Transfer Learning [PDF] 返回目录
Prathmesh Madhu, Angel Villar-Corrales, Ronak Kosti, Torsten Bendschus, Corinna Reinhardt, Peter Bell, Andreas Maier, Vincent Christlein
Abstract: Human pose estimation (HPE) is a central part of understanding the visual narration and body movements of characters depicted in artwork collections, such as Greek vase paintings. Unfortunately, existing HPE methods do not generalise well across domains resulting in poorly recognized poses. Therefore, we propose a two step approach: (1) adapting a dataset of natural images of known person and pose annotations to the style of Greek vase paintings by means of image style-transfer. We introduce a perceptually-grounded style transfer training to enforce perceptual consistency. Then, we fine-tune the base model with this newly created dataset. We show that using style-transfer learning significantly improves the SOTA performance on unlabelled data by more than 6% mean average precision (mAP) as well as mean average recall (mAR). (2) To improve the already strong results further, we created a small dataset (ClassArch) consisting of ancient Greek vase paintings from the 6-5th century BCE with person and pose annotations. We show that fine-tuning on this data with a style-transferred model improves the performance further. In a thorough ablation study, we give a targeted analysis of the influence of style intensities, revealing that the model learns generic domain styles. Additionally, we provide a pose-based image retrieval to demonstrate the effectiveness of our method.
摘要：人体姿势估计（HPE）是理解艺术品收藏（例如希腊花瓶画）中人物的视觉叙述和身体动作的重要组成部分。不幸的是，现有的HPE方法无法在各个域中很好地推广，从而导致姿态识别不佳。因此，我们提出了一种两步法：（1）通过图像样式转移将已知人物的自然图像数据集和姿势注解适应希腊花瓶绘画的样式。我们引入了基于感知的样式转换训练，以增强感知一致性。然后，我们使用这个新创建的数据集微调基础模型。我们表明，使用样式转移学习可以显着提高未标记数据的SOTA性能，平均平均准确度（mAP）和平均平均回想度（mAR）超过6％。（2）为了进一步改善已经很强的效果，我们创建了一个小型数据集（ClassArch），其中包含来自公元前6-5世纪的古希腊花瓶画，并带有人物和姿势注释。我们显示，使用样式转移模型对该数据进行微调可以进一步提高性能。在彻底的消融研究中，我们对样式强度的影响进行了有针对性的分析，揭示了该模型学习通用的领域样式。此外，我们提供了基于姿势的图像检索，以证明我们方法的有效性。

26. Retinex-inspired Unrolling with Cooperative Prior Architecture Search for Low-light Image Enhancement [PDF] 返回目录
Risheng Liu, Long Ma, Jiaao Zhang, Xin Fan, Zhongxuan Luo
Abstract: Low-light image enhancement plays very important roles in low-level vision field. Recent works have built a large variety of deep learning models to address this task. However, these approaches mostly rely on significant architecture engineering and suffer from high computational burden. In this paper, we propose a new method, named Retinex-inspired Unrolling with Architecture Search (RUAS), to construct lightweight yet effective enhancement network for low-light images in real-world scenario. Specifically, building upon Retinex rule, RUAS first establishes models to characterize the intrinsic underexposed structure of low-light images and unroll their optimization processes to construct our holistic propagation structure. Then by designing a cooperative reference-free learning strategy to discover low-light prior architectures from a compact search space, RUAS is able to obtain a top-performing image enhancement network, which is with fast speed and requires few computational resources. Extensive experiments verify the superiority of our RUAS framework against recently proposed state-of-the-art methods.
摘要：弱光图像增强在弱视领域中起着非常重要的作用。最近的工作已经建立了各种各样的深度学习模型来解决这一任务。但是，这些方法主要依赖于重要的体系结构工程，并且具有很高的计算负担。在本文中，我们提出了一种新的方法，该方法名为Retinex启发式的架构搜索展开（RUAS），可为现实世界中的弱光图像构建轻巧而有效的增强网络。具体而言，RUAS基于Retinex规则，首先建立模型以表征弱光图像的固有曝光不足结构，并展开其优化过程以构建整体传播结构。然后，通过设计一种无参考协作学习策略以从紧凑的搜索空间中发现低光先验体系结构，RUAS能够获得性能最高的图像增强网络，该网络速度快且需要很少的计算资源。大量的实验证明了我们的RUAS框架相对于最近提出的最新方法的优越性。

27. Exploiting Diverse Characteristics and Adversarial Ambivalence for Domain Adaptive Segmentation [PDF] 返回目录
Bowen Cai, Huan Fu, Rongfei Jia, Binqiang Zhao, Hua Li, Yinghui Xu
Abstract: Adapting semantic segmentation models to new domains is an important but challenging problem. Recently enlightening progress has been made, but the performance of existing methods are unsatisfactory on real datasets where the new target domain comprises of heterogeneous sub-domains (e.g., diverse weather characteristics). We point out that carefully reasoning about the multiple modalities in the target domain can improve the robustness of adaptation models. To this end, we propose a condition-guided adaptation framework that is empowered by a special attentive progressive adversarial training (APAT) mechanism and a novel self-training policy. The APAT strategy progressively performs condition-specific alignment and attentive global feature matching. The new self-training scheme exploits the adversarial ambivalences of easy and hard adaptation regions and the correlations among target sub-domains effectively. We evaluate our method (DCAA) on various adaptation scenarios where the target images vary in weather conditions. The comparisons against baselines and the state-of-the-art approaches demonstrate the superiority of DCAA over the competitors.
摘要：将语义分割模型应用于新领域是一个重要但具有挑战性的问题。最近已经取得了令人鼓舞的进展，但是现有方法的性能在真实数据集上并不令人满意，在真实数据集上，新的目标域包括异构子域（例如，不同的天气特征）。我们指出，对目标域中的多种模式进行仔细的推理可以提高适应模型的鲁棒性。为此，我们提出了一种条件指导的适应框架，该框架由特殊的专心渐进式对抗训练（APAT）机制和新颖的自我训练策略提供支持。 APAT策略逐步执行特定于条件的对齐和专心的全局特征匹配。新的自我训练方案有效地利用了容易适应和困难适应区域的对抗性矛盾以及目标子域之间的相关性。我们在目标图像在天气条件不同的各种适应情况下评估我们的方法（DCAA）。与基准的比较和最先进的方法证明了DCAA优于竞争对手。

28. Amodal Segmentation Based on Visible Region Segmentation and Shape Prior [PDF] 返回目录
Yuting Xiao, Yanyu Xu, Ziming Zhong, Weixin Luo, Jiawei Li, Shenghua Gao
Abstract: Almost all existing amodal segmentation methods make the inferences of occluded regions by using features corresponding to the whole image.
摘要：几乎所有现有的非模态分割方法都使用与整个图像相对应的特征来进行遮挡区域的推断。

29. An Asynchronous Kalman Filter for Hybrid Event Cameras [PDF] 返回目录
Ziwei Wang, Yonhon Ng, Cedric Scheerlinck, Robert Mahony
Abstract: We present an Asynchronous Kalman Filter (AKF) to reconstruct High Dynamic Range (HDR) videos by fusing low-dynamic range images with event data. Event cameras are ideally suited to capture HDR visual information without blur but perform poorly on static or slowly changing scenes. Conversely, conventional image sensors measure absolute intensity of slowly changing scenes effectively but do poorly on quickly changing scenes with high dynamic range. The proposed approach exploits advantages of hybrid sensors under a unifying uncertainty model for both conventional frames and events. We present a novel dataset targeting challenging HDR and fast motion scenes captured on two separate sensors: an RGB frame-based camera and an event camera. Our video reconstruction outperforms the state-of-the-art algorithms on existing datasets and our targeted HDR dataset.
摘要：我们提出了一种异步卡尔曼滤波器（AKF），通过将低动态范围图像与事件数据融合来重建高动态范围（HDR）视频。事件摄像机非常适合捕获HDR视觉信息而不会模糊，但在静态或缓慢变化的场景上表现不佳。相反，常规的图像传感器有效地测量缓慢变化的场景的绝对强度，但是在具有高动态范围的快速变化的场景上效果不佳。提出的方法在传统框架和事件的统一不确定性模型下利用了混合传感器的优势。我们提出了一个新颖的数据集，其目标是在两个单独的传感器（基于RGB帧的相机和事件相机）上捕获的具有挑战性的HDR和快速运动场景。我们的视频重建性能优于现有数据集和目标HDR数据集上的最新算法。

30. Full Matching on Low Resolution for Disparity Estimation [PDF] 返回目录
Hong Zhang, Shenglun Chen, Zhihui Wang, Haojie Li, Wanli Ouyang
Abstract: A Multistage Full Matching disparity estimation scheme (MFM) is proposed in this work. We demonstrate that decouple all similarity scores directly from the low-resolution 4D volume step by step instead of estimating low-resolution 3D cost volume through focusing on optimizing the low-resolution 4D volume iteratively leads to more accurate disparity. To this end, we first propose to decompose the full matching task into multiple stages of the cost aggregation module. Specifically, we decompose the high-resolution predicted results into multiple groups, and every stage of the newly designed cost aggregation module learns only to estimate the results for a group of points. This alleviates the problem of feature internal competitive when learning similarity scores of all candidates from one low-resolution 4D volume output from one stage. Then, we propose the strategy of \emph{Stages Mutual Aid}, which takes advantage of the relationship of multiple stages to boost similarity scores estimation of each stage, to solve the unbalanced prediction of multiple stages caused by serial multistage framework. Experiment results demonstrate that the proposed method achieves more accurate disparity estimation results and outperforms state-of-the-art methods on Scene Flow, KITTI 2012 and KITTI 2015 datasets.
摘要：提出了一种多阶段完全匹配视差估计方案（MFM）。我们证明，将所有相似性分数直接直接从低分辨率4D体积中解耦，而不是通过专注于迭代优化低分辨率4D体积来估计低分辨率3D成本量，从而导致更准确的差异。为此，我们首先建议将完全匹配任务分解为成本汇总模块的多个阶段。具体来说，我们将高分辨率的预测结果分解为多个组，新设计的成本汇总模块的每个阶段都仅学习估计一组点的结果。当从一个阶段输出的一个低分辨率4D体积中学习所有候选者的相似度得分时，这缓解了特征内部竞争的问题。然后，我们提出了\ emph {阶段互助}策略，该策略利用多阶段的关系来提高每个阶段的相似性得分估计，以解决由串行多阶段框架引起的多阶段不平衡预测。实验结果表明，该方法在场景流，KITTI 2012和KITTI 2015数据集上获得了更准确的视差估计结果，并且优于最新方法。

31. Image Matching with Scale Adjustment [PDF] 返回目录
Yves Dufournaud, Cordelia Schmid, Radu Horaud
Abstract: In this paper we address the problem of matching two images with two different resolutions: a high-resolution image and a low-resolution one. The difference in resolution between the two images is not known and without loss of generality one of the images is assumed to be the high-resolution one. On the premise that changes in resolution act as a smoothing equivalent to changes in scale, a scale-space representation of the high-resolution image is produced. Hence the one-to-one classical image matching paradigm becomes one-to-many because the low-resolution image is compared with all the scale-space representations of the high-resolution one. Key to the success of such a process is the proper representation of the features to be matched in scale-space. We show how to represent and extract interest points at variable scales and we devise a method allowing the comparison of two images at two different resolutions. The method comprises the use of photometric- and rotation-invariant descriptors, a geometric model mapping the high-resolution image onto a low-resolution image region, and an image matching strategy based on local constraints and on the robust estimation of this geometric model. Extensive experiments show that our matching method can be used for scale changes up to a factor of 6.
摘要：在本文中，我们解决了匹配两个具有两种不同分辨率的图像的问题：高分辨率图像和低分辨率图像。不知道两个图像之间的分辨率差异，并且在不失一般性的前提下，假定其中一个图像是高分辨率图像。在分辨率的变化等同于比例变化的平滑化的前提下，生成高分辨率图像的比例空间表示。因此，一对一的经典图像匹配范例变为一对多，因为将低分辨率图像与高分辨率图像的所有比例空间表示进行了比较。这种过程成功的关键是在比例空间中正确表示要匹配的特征。我们展示了如何以可变比例表示和提取兴趣点，并设计了一种方法，可以比较两种不同分辨率的图像。该方法包括使用光度和旋转不变描述符，将高分辨率图像映射到低分辨率图像区域的几何模型，以及基于局部约束和对该几何模型的鲁棒估计的图像匹配策略。大量实验表明，我们的匹配方法可用于比例尺变化高达6的情况。

32. Direct Depth Learning Network for Stereo Matching [PDF] 返回目录
Hong Zhang, Haojie Li, Shenglun Chen, Tiantian Yan, Zhihui Wang, Guo Lu, Wanli Ouyang
Abstract: Being a crucial task of autonomous driving, Stereo matching has made great progress in recent years. Existing stereo matching methods estimate disparity instead of depth. They treat the disparity errors as the evaluation metric of the depth estimation errors, since the depth can be calculated from the disparity according to the triangulation principle. However, we find that the error of the depth depends not only on the error of the disparity but also on the depth range of the points. Therefore, even if the disparity error is low, the depth error is still large, especially for the distant points. In this paper, a novel Direct Depth Learning Network (DDL-Net) is designed for stereo matching. DDL-Net consists of two stages: the Coarse Depth Estimation stage and the Adaptive-Grained Depth Refinement stage, which are all supervised by depth instead of disparity. Specifically, Coarse Depth Estimation stage uniformly samples the matching candidates according to depth range to construct cost volume and output coarse depth. Adaptive-Grained Depth Refinement stage performs further matching near the coarse depth to correct the imprecise matching and wrong matching. To make the Adaptive-Grained Depth Refinement stage robust to the coarse depth and adaptive to the depth range of the points, the Granularity Uncertainty is introduced to Adaptive-Grained Depth Refinement stage. Granularity Uncertainty adjusts the matching range and selects the candidates' features according to coarse prediction confidence and depth range. We verify the performance of DDL-Net on SceneFlow dataset and DrivingStereo dataset by different depth metrics. Results show that DDL-Net achieves an average improvement of 25% on the SceneFlow dataset and $12\%$ on the DrivingStereo dataset comparing the classical methods. More importantly, we achieve state-of-the-art accuracy at a large distance.
摘要：立体匹配是自动驾驶的一项重要任务，近年来取得了长足的进步。现有的立体匹配方法估计差异而不是深度。他们将视差误差视为深度估计误差的评估指标，因为可以根据三角测量原理从视差中计算深度。然而，我们发现深度的误差不仅取决于视差的误差，而且还取决于点的深度范围。因此，即使视差误差低，深度误差仍然很大，尤其是对于远点。在本文中，设计了一种新颖的直接深度学习网络（DDL-Net）用于立体声匹配。 DDL-Net包含两个阶段：粗略深度估计阶段和自适应粒度深度细化阶段，它们均由深度而不是视差监督。具体来说，粗略深度估计阶段根据深度范围对匹配的候选对象进行均匀采样，以构建成本量并输出粗略深度。自适应深度精化阶段在粗略深度附近执行进一步的匹配，以纠正不精确的匹配和错误的匹配。为了使自适应粒度深度细化阶段对粗略深度具有鲁棒性并适应点的深度范围，将粒度不确定性引入了自适应粒度深度细化阶段。粒度不确定性会根据粗略的预测置信度和深度范围来调整匹配范围并选择候选者的特征。我们通过不同的深度指标来验证DDL-Net在SceneFlow数据集和DrivingStereo数据集上的性能。结果表明，与传统方法相比，DDL-Net在SceneFlow数据集上平均提高了25％，在DrivingStereo数据集上提高了$ 12 \％$。更重要的是，我们可以在很长的距离内达到最先进的精度。

33. Debiased-CAM for bias-agnostic faithful visual explanations of deep convolutional networks [PDF] 返回目录
Wencan Zhang, Mariella Dimiccoli, Brian Y. Lim
Abstract: Class activation maps (CAMs) explain convolutional neural network predictions by identifying salient pixels, but they become misaligned and misleading when explaining predictions on images under bias, such as images blurred accidentally or deliberately for privacy protection, or images with improper white balance. Despite model fine-tuning to improve prediction performance on these biased images, we demonstrate that CAM explanations become more deviated and unfaithful with increased image bias. We present Debiased-CAM to recover explanation faithfulness across various bias types and levels by training a multi-input, multi-task model with auxiliary tasks for CAM and bias level predictions. With CAM as a prediction task, explanations are made tunable by retraining the main model layers and made faithful by self-supervised learning from CAMs of unbiased images. The model provides representative, bias-agnostic CAM explanations about the predictions on biased images as if generated from their unbiased form. In four simulation studies with different biases and prediction tasks, Debiased-CAM improved both CAM faithfulness and task performance. We further conducted two controlled user studies to validate its truthfulness and helpfulness, respectively. Quantitative and qualitative analyses of participant responses confirmed Debiased-CAM as more truthful and helpful. Debiased-CAM thus provides a basis to generate more faithful and relevant explanations for a wide range of real-world applications with various sources of bias.
摘要：类激活图（CAM）通过识别显着像素来解释卷积神经网络的预测，但是当解释偏见图像的预测时（例如出于隐私保护目的而意外或故意模糊的图像或白平衡不正确的图像），它们会变得未对准和误导。尽管模型进行了微调，以提高对这些有偏差的图像的预测性能，但我们证明了CAM的解释随着图像偏差的增加而变得更加偏离和不忠实。我们提出了Debiased-CAM，以通过训练带有辅助任务的CAM和偏差水平预测的多输入，多任务模型来恢复各种偏差类型和水平下的解释真实性。通过将CAM作为预测任务，可以通过重新训练主模型层来使说明变得可调整，并通过从无偏图像的CAM中进行自我监督学习来使说明变得忠实。该模型提供了关于偏见的预测的代表性，与偏见无关的CAM解释，就好像是从其无偏形式生成的一样。在具有不同偏差和预测任务的四项模拟研究中，Debiased-CAM改善了CAM的忠诚度和任务性能。我们进一步进行了两项受控用户研究，以分别验证其真实性和有用性。参与者反应的定量和定性分析证实了Debiased-CAM更真实和有用。因此，Debiased-CAM提供了一个基础，可以针对各种带有各种偏见来源的实际应用生成更真实和相关的解释。

34. DI-Fusion: Online Implicit 3D Reconstruction with Deep Priors [PDF] 返回目录
Jiahui Huang, Shi-Sheng Huang, Haoxuan Song, Shi-Min Hu
Abstract: Previous online 3D dense reconstruction methods often cost massive memory storage while achieving unsatisfactory surface quality mainly due to the usage of stagnant underlying geometry representation, such as TSDF (truncated signed distance functions) or surfels, without any knowledge of the scene priors. In this paper, we present DI-Fusion (Deep Implicit Fusion), based on a novel 3D representation, called Probabilistic Local Implicit Voxels (PLIVoxs), for online 3D reconstruction using a commodity RGB-D camera. Our PLIVox encodes scene priors considering both the local geometry and uncertainty parameterized by a deep neural network. With such deep priors, we demonstrate by extensive experiments that we are able to perform online implicit 3D reconstruction achieving state-of-the-art mapping quality and camera trajectory estimation accuracy, while taking much less storage compared with previous online 3D reconstruction approaches.
摘要：以前的在线3D密集重建方法通常要花费大量的内存存储，而表面质量却不能令人满意，这主要是由于使用了停滞的基础几何图形表示法，例如TSDF（截短的有符号距离函数）或冲浪，而没有对场景先验的任何了解。在本文中，我们基于一种称为概率局部隐式体素（PLIVoxs）的新型3D表示法，提出了使用传统RGB-D相机进行在线3D重建的DI融合（深度隐式融合）。我们的PLIVox对场景先验编码，同时考虑了局部几何结构和由深度神经网络参数化的不确定性。借助如此丰富的先验知识，我们通过广泛的实验证明，我们能够执行在线隐式3D重建，从而实现最新的映射质量和相机轨迹估计精度，同时与以前的在线3D重建方法相比，占用的存储空间要少得多。

35. Image Captioning with Context-Aware Auxiliary Guidance [PDF] 返回目录
Zeliang Song, Xiaofei Zhou, Zhendong Mao, Jianlong Tan
Abstract: Image captioning is a challenging computer vision task, which aims to generate a natural language description of an image. Most recent researches follow the encoder-decoder framework which depends heavily on the previous generated words for the current prediction. Such methods can not effectively take advantage of the future predicted information to learn complete semantics. In this paper, we propose Context-Aware Auxiliary Guidance (CAAG) mechanism that can guide the captioning model to perceive global contexts. Upon the captioning model, CAAG performs semantic attention that selectively concentrates on useful information of the global predictions to reproduce the current generation. To validate the adaptability of the method, we apply CAAG to three popular captioners and our proposal achieves competitive performance on the challenging Microsoft COCO image captioning benchmark, e.g. 132.2 CIDEr-D score on Karpathy split and 130.7 CIDEr-D (c40) score on official online evaluation server.
摘要：图像字幕是一项具有挑战性的计算机视觉任务，旨在生成图像的自然语言描述。最近的研究遵循编码器-解码器框架，该框架在很大程度上取决于先前为当前预测生成的字。这样的方法不能有效地利用将来的预测信息来学习完整的语义。在本文中，我们提出了上下文感知辅助指导（CAAG）机制，该机制可以指导字幕模型感知全局上下文。在字幕模型上，CAAG执行语义注意力，该语义注意力选择性地集中在全局预测的有用信息上以重现当前的一代。为了验证该方法的适应性，我们将CAAG应用于三个流行的字幕制作工具，并且我们的提案在具有挑战性的Microsoft COCO图像字幕制作基准（例如，喀尔巴阡山脉裂隙上的CIDEr-D得分为132.2，官方在线评估服务器上的CIDEr-D（c40）得分为130.7。

36. Topology-Adaptive Mesh Deformation for Surface Evolution, Morphing, and Multi-View Reconstruction [PDF] 返回目录
Andrei Zaharescu, Edmond Boyer, Radu Horaud
Abstract: Triangulated meshes have become ubiquitous discrete-surface representations. In this paper we address the problem of how to maintain the manifold properties of a surface while it undergoes strong deformations that may cause topological changes. We introduce a new self-intersection removal algorithm, TransforMesh, and we propose a mesh evolution framework based on this algorithm. Numerous shape modelling applications use surface evolution in order to improve shape properties, such as appearance or accuracy. Both explicit and implicit representations can be considered for that purpose. However, explicit mesh representations, while allowing for accurate surface modelling, suffer from the inherent difficulty of reliably dealing with self-intersections and topological changes such as merges and splits. As a consequence, a majority of methods rely on implicit representations of surfaces, e.g. level-sets, that naturally overcome these issues. Nevertheless, these methods are based on volumetric discretizations, which introduce an unwanted precision-complexity trade-off. The method that we propose handles topological changes in a robust manner and removes self intersections, thus overcoming the traditional limitations of mesh-based approaches. To illustrate the effectiveness of TransforMesh, we describe two challenging applications, namely surface morphing and 3-D reconstruction.
摘要：三角网格已成为无处不在的离散表面表示。在本文中，我们解决了如何在表面经受可能引起拓扑变化的强烈变形的同时保持其流形特性的问题。我们引入了一种新的自交集去除算法TransforMesh，并基于该算法提出了一种网格演化框架。许多形状建模应用程序使用表面演变来改善形状属性，例如外观或准确性。为此，可以考虑使用显式表示和隐式表示。但是，尽管可以进行精确的曲面建模，但显式的网格表示形式却存在固有的困难，即难以可靠地处理自相交和拓扑更改（例如合并和拆分）。结果，大多数方法依赖于表面的隐式表示，例如表面的隐含表示。水平集，自然可以克服这些问题。然而，这些方法是基于体积离散化的，这会带来不必要的精度-复杂度的折衷。我们提出的方法以健壮的方式处理拓扑变化并消除了自相交，从而克服了基于网格的方法的传统局限性。为了说明TransforMesh的有效性，我们描述了两个具有挑战性的应用程序，即表面变形和3-D重建。

37. SSD-GAN: Measuring the Realness in the Spatial and Spectral Domains [PDF] 返回目录
Yuanqi Chen, Ge Li, Cece Jin, Shan Liu, Thomas Li
Abstract: This paper observes that there is an issue of high frequencies missing in the discriminator of standard GAN, and we reveal it stems from downsampling layers employed in the network architecture. This issue makes the generator lack the incentive from the discriminator to learn high-frequency content of data, resulting in a significant spectrum discrepancy between generated images and real images. Since the Fourier transform is a bijective mapping, we argue that reducing this spectrum discrepancy would boost the performance of GANs. To this end, we introduce SSD-GAN, an enhancement of GANs to alleviate the spectral information loss in the discriminator. Specifically, we propose to embed a frequency-aware classifier into the discriminator to measure the realness of the input in both the spatial and spectral domains. With the enhanced discriminator, the generator of SSD-GAN is encouraged to learn high-frequency content of real data and generate exact details. The proposed method is general and can be easily integrated into most existing GANs framework without excessive cost. The effectiveness of SSD-GAN is validated on various network architectures, objective functions, and datasets. Code will be available at this https URL.
摘要：本文观察到标准GAN的鉴别器中存在一个高频问题，我们发现它源于网络体系结构中使用的下采样层。这个问题使生成器缺乏鉴别器的动力来学习数据的高频内容，从而导致生成的图像和真实图像之间存在明显的频谱差异。由于傅立叶变换是双射映射，因此我们认为减少这种频谱差异将提高GAN的性能。为此，我们引入了SSD-GAN，这是GAN的增强功能，可减轻鉴别器中的频谱信息丢失。具体而言，我们建议将频率感知的分类器嵌入到鉴别器中，以在空间和频谱域中测量输入的真实性。通过增强的鉴别器，鼓励SSD-GAN的生成器学习真实数据的高频内容并生成确切的细节。所提出的方法是通用的，并且可以容易地集成到大多数现有的GAN框架中，而无需花费过多成本。 SSD-GAN的有效性已在各种网络体系结构，目标函数和数据集上得到验证。代码将在此https URL上可用。

38. Synthesizing Long-Term 3D Human Motion and Interaction in 3D Scenes [PDF] 返回目录
Jiashun Wang, Huazhe Xu, Jingwei Xu, Sifei Liu, Xiaolong Wang
Abstract: Synthesizing 3D human motion plays an important role in many graphics applications as well as understanding human activity. While many efforts have been made on generating realistic and natural human motion, most approaches neglect the importance of modeling human-scene interactions and affordance. On the other hand, affordance reasoning (e.g., standing on the floor or sitting on the chair) has mainly been studied with static human pose and gestures, and it has rarely been addressed with human motion. In this paper, we propose to bridge human motion synthesis and scene affordance reasoning. We present a hierarchical generative framework to synthesize long-term 3D human motion conditioning on the 3D scene structure. Building on this framework, we further enforce multiple geometry constraints between the human mesh and scene point clouds via optimization to improve realistic synthesis. Our experiments show significant improvements over previous approaches on generating natural and physically plausible human motion in a scene.
摘要：合成3D人体运动在许多图形应用程序中以及理解人类活动方面都起着重要作用。尽管在生成逼真的自然人体运动方面已进行了许多努力，但大多数方法都忽略了对人与人之间的互动和承受能力进行建模的重要性。另一方面，负担能力推理（例如，站在地板上或坐在椅子上）主要是用静态的人类姿势和手势来研究的，而很少有人用动作来解决。在本文中，我们建议在人体运动合成和场景感知推理之间架起桥梁。我们提出了一个层次生成框架，以在3D场景结构上合成长期3D人类运动条件。在此框架的基础上，我们通过优化进一步增强了人类网格和场景点云之间的多个几何约束，以改善逼真的合成。我们的实验表明，与以前的方法相比，该方法在场景中产生自然和物理上合理的人类运动方面有显着改进。

39. Spatiotemporal Graph Neural Network based Mask Reconstruction for Video Object Segmentation [PDF] 返回目录
Daizong Liu, Shuangjie Xu, Xiao-Yang Liu, Zichuan Xu, Wei Wei, Pan Zhou
Abstract: This paper addresses the task of segmenting class-agnostic objects in semi-supervised setting. Although previous detection based methods achieve relatively good performance, these approaches extract the best proposal by a greedy strategy, which may lose the local patch details outside the chosen candidate. In this paper, we propose a novel spatiotemporal graph neural network (STG-Net) to reconstruct more accurate masks for video object segmentation, which captures the local contexts by utilizing all proposals. In the spatial graph, we treat object proposals of a frame as nodes and represent their correlations with an edge weight strategy for mask context aggregation. To capture temporal information from previous frames, we use a memory network to refine the mask of current frame by retrieving historic masks in a temporal graph. The joint use of both local patch details and temporal relationships allow us to better address the challenges such as object occlusion and missing. Without online learning and fine-tuning, our STG-Net achieves state-of-the-art performance on four large benchmarks (DAVIS, YouTube-VOS, SegTrack-v2, and YouTube-Objects), demonstrating the effectiveness of the proposed approach.
摘要：本文解决了在半监督环境中分割与类无关的对象的任务。尽管以前的基于检测的方法实现了相对较好的性能，但是这些方法通过贪婪策略提取了最佳建议，这可能会丢失所选候选对象之外的局部补丁详细信息。在本文中，我们提出了一种新颖的时空图神经网络（STG-Net），以重建用于视频对象分割的更准确的遮罩，该遮罩通过利用所有提议来捕获局部上下文。在空间图中，我们将一帧的对象提议作为节点对待，并使用用于遮罩上下文聚合的边缘权重策略表示它们的相关性。为了从先前的帧中捕获时间信息，我们使用存储网络通过检索时间图中的历史掩码来细化当前帧的掩码。局部补丁细节和时间关系的共同使用使我们能够更好地应对诸如物体遮挡和丢失之类的挑战。无需在线学习和微调，我们的STG-Net可以在四个大型基准（DAVIS，YouTube-VOS，SegTrack-v2和YouTube-Objects）上达到最先进的性能，证明了该方法的有效性。

40. Auto-MVCNN: Neural Architecture Search for Multi-view 3D Shape Recognition [PDF] 返回目录
Zhaoqun Li, Hongren Wang, Jinxing Li
Abstract: In 3D shape recognition, multi-view based methods leverage human's perspective to analyze 3D shapes and have achieved significant outcomes. Most existing research works in deep learning adopt handcrafted networks as backbones due to their high capacity of feature extraction, and also benefit from ImageNet pretraining. However, whether these network architectures are suitable for 3D analysis or not remains unclear. In this paper, we propose a neural architecture search method named Auto-MVCNN which is particularly designed for optimizing architecture in multi-view 3D shape recognition. Auto-MVCNN extends gradient-based frameworks to process multi-view images, by automatically searching the fusion cell to explore intrinsic correlation among view features. Moreover, we develop an end-to-end scheme to enhance retrieval performance through the trade-off parameter search. Extensive experimental results show that the searched architectures significantly outperform manually designed counterparts in various aspects, and our method achieves state-of-the-art performance at the same time.
摘要：在3D形状识别中，基于多视图的方法利用人类的视角来分析3D形状并取得了显著成果。深度学习中的大多数现有研究工作都将手工网络作为骨干网，因为它们具有很高的特征提取能力，并且还受益于ImageNet预训练。但是，这些网络体系结构是否适合3D分析仍不清楚。在本文中，我们提出了一种名为Auto-MVCNN的神经体系结构搜索方法，该方法专为优化多视图3D形状识别中的体系结构而设计。 Auto-MVCNN通过自动搜索融合单元以探索视图特征之间的内在关联性，扩展了基于梯度的框架以处理多视图图像。此外，我们开发了一种端到端方案，以通过权衡参数搜索来增强检索性能。大量的实验结果表明，所搜索的体系结构在各个方面均明显优于手动设计的对等体系，并且我们的方法同时达到了最先进的性能。

41. One for More: Selecting Generalizable Samples for Generalizable ReID Model [PDF] 返回目录
Enwei Zhang, Xinyang Jiang, Hao Cheng, Ancong Wu, Fufu Yu, Ke Li, Xiaowei Guo, Feng Zheng, Wei-Shi Zheng, Xing Sun
Abstract: Current training objectives of existing person Re-IDentification (ReID) models only ensure that the loss of the model decreases on selected training batch, with no regards to the performance on samples outside the batch. It will inevitably cause the model to over-fit the data in the dominant position (e.g., head data in imbalanced class, easy samples or noisy samples). %We call the sample that updates the model towards generalizing on more data a generalizable sample. The latest resampling methods address the issue by designing specific criterion to select specific samples that trains the model generalize more on certain type of data (e.g., hard samples, tail data), which is not adaptive to the inconsistent real world ReID data distributions. Therefore, instead of simply presuming on what samples are generalizable, this paper proposes a one-for-more training objective that directly takes the generalization ability of selected samples as a loss function and learn a sampler to automatically select generalizable samples. More importantly, our proposed one-for-more based sampler can be seamlessly integrated into the ReID training framework which is able to simultaneously train ReID models and the sampler in an end-to-end fashion. The experimental results show that our method can effectively improve the ReID model training and boost the performance of ReID models.
摘要：现有人员重新识别（ReID）模型的当前培训目标仅确保在所选培训批次上模型的损失减少，而与批次外部样本的性能无关。这将不可避免地导致模型在主导位置过度拟合数据（例如，不平衡类中的头部数据，简单样本或嘈杂样本）。我们将更新模型以对更多数据进行泛化的样本称为泛化样本。最新的重采样方法通过设计特定准则来选择特定样本来解决该问题，这些样本对模型进行训练，这些样本对某些类型的数据（例如硬样本，尾部数据）进行了更多的概括，这不适用于现实世界中不一致的ReID数据分布。因此，本文提出了一个单次训练目标，该方法直接将所选样本的泛化能力作为损失函数，并学习一个采样器来自动选择泛化样本，而不是简单地推测哪些样本可以泛化。更重要的是，我们提出的基于多个的采样器可以无缝集成到ReID训练框架中，该框架能够以端到端的方式同时训练ReID模型和采样器。实验结果表明，该方法可以有效地改善ReID模型的训练，提高ReID模型的性能。

42. Tensor Composition Net for Visual Relationship Prediction [PDF] 返回目录
Yuting Qiang, Yongxin Yang, Yanwen Guo, Timothy M. Hospedales
Abstract: We present a novel Tensor Composition Network (TCN) to predict visual relationships in images. Visual Relationships in subject-predicate-object form provide a more powerful query modality than simple image tags. However Visual Relationship Prediction (VRP) also provides a more challenging test of image understanding than conventional image tagging, and is difficult to learn due to a large label-space and incomplete annotation. The key idea of our TCN is to exploit the low rank property of the visual relationship tensor, so as to leverage correlations within and across objects and relationships, and make a structured prediction of all objects and their relations in an image. To show the effectiveness of our method, we first empirically compare our model with multi-label classification alternatives on VRP, and show that our model outperforms state-of-the-art MLIC methods. We then show that, thanks to our tensor (de)composition layer, our model can predict visual relationships which have not been seen in training dataset. We finally show our TCN's image-level visual relationship prediction provides a simple and efficient mechanism for relation-based image retrieval.
摘要：我们提出了一种新颖的张量合成网络（TCN），以预测图像中的视觉关系。与简单图像标签相比，主题-谓语-对象形式的视觉关系提供了更强大的查询方式。但是，视觉关系预测（VRP）还提供了比常规图像标记更具挑战性的图像理解测试，并且由于标签空间大和注释不完整而难以学习。我们的TCN的关键思想是利用视觉关系张量的低秩属性，从而利用对象和关系之间以及对象之间和对象之间的相关性，并对图像中的所有对象及其关系进行结构化的预测。为了证明我们方法的有效性，我们首先在经验上将我们的模型与VRP上的多标签分类替代方法进行比较，并证明我们的模型优于最新的MLIC方法。然后，我们证明，由于我们的张量（分解）合成层，我们的模型可以预测在训练数据集中尚未看到的视觉关系。最后，我们展示了TCN的图像级视觉关系预测为基于关系的图像检索提供了一种简单有效的机制。

43. Investigating Bias in Image Classification using Model Explanations [PDF] 返回目录
Schrasing Tong, Lalana Kagal
Abstract: We evaluated whether model explanations could efficiently detect bias in image classification by highlighting discriminating features, thereby removing the reliance on sensitive attributes for fairness calculations. To this end, we formulated important characteristics for bias detection and observed how explanations change as the degree of bias in models change. The paper identifies strengths and best practices for detecting bias using explanations, as well as three main weaknesses: explanations poorly estimate the degree of bias, could potentially introduce additional bias into the analysis, and are sometimes inefficient in terms of human effort involved.
摘要：我们评估了模型解释是否可以通过突出显示区分特征来有效地检测图像分类中的偏差，从而消除了对敏感度属性进行公平性计算的依赖。为此，我们制定了偏差检测的重要特征，并观察了解释如何随模型偏差程度的变化而变化。本文使用解释指出了检测偏差的优势和最佳实践，以及三个主要缺点：解释无法正确估计偏差的程度，可能会在分析中引入其他偏差，有时在涉及的人工方面效率低下。

44. Few-shot Medical Image Segmentation using a Global Correlation Network with Discriminative Embedding [PDF] 返回目录
Liyan Sun, Chenxin Li, Xinghao Ding, Yue Huang, Guisheng Wang, Yizhou Yu
Abstract: Despite deep convolutional neural networks achieved impressive progress in medical image computing and analysis, its paradigm of supervised learning demands a large number of annotations for training to avoid overfitting and achieving promising results. In clinical practices, massive semantic annotations are difficult to acquire in some conditions where specialized biomedical expert knowledge is required, and it is also a common condition where only few annotated classes are available. In this work, we proposed a novel method for few-shot medical image segmentation, which enables a segmentation model to fast generalize to an unseen class with few training images. We construct our few-shot image segmentor using a deep convolutional network trained episodically. Motivated by the spatial consistency and regularity in medical images, we developed an efficient global correlation module to capture the correlation between a support and query image and incorporate it into the deep network called global correlation network. Moreover, we enhance discriminability of deep embedding to encourage clustering of the feature domains of the same class while keep the feature domains of different organs far apart. Ablation Study proved the effectiveness of the proposed global correlation module and discriminative embedding loss. Extensive experiments on anatomical abdomen images on both CT and MRI modalities are performed to demonstrate the state-of-the-art performance of our proposed model.
摘要：尽管深层卷积神经网络在医学图像计算和分析方面取得了令人瞩目的进展，但其监督学习范式仍需要大量注释来进行训练，以避免过度拟合并获得可喜的结果。在临床实践中，在需要专门的生物医学专家知识的某些情况下，很难获得大量的语义注释，这也是一种常见的情况，即只有少数注释类可用。在这项工作中，我们提出了一种用于少拍医学图像分割的新方法，该方法使分割模型能够快速推广到训练图像少的看不见的类别。我们使用经过深度训练的深度卷积网络构造少量镜头图像分割器。受医学图像中空间一致性和规则性的影响，我们开发了一种有效的全局相关模块，以捕获支持图像和查询图像之间的相关性，并将其整合到称为全局相关性网络的深度网络中。此外，我们增强了深度嵌入的可分辨性，以鼓励对同一类别的特征域进行聚类，同时使不同器官的特征域保持距离。消融研究证明了所提出的全局相关模块和判别性嵌入损失的有效性。在CT和MRI模态上对腹部解剖图像进行了广泛的实验，以证明我们提出的模型的最新性能。

45. Developing Motion Code Embedding for Action Recognition in Videos [PDF] 返回目录
Maxat Alibayev, David Paulius, Yu Sun
Abstract: In this work, we propose a motion embedding strategy known as motion codes, which is a vectorized representation of motions based on a manipulation's salient mechanical attributes. These motion codes provide a robust motion representation, and they are obtained using a hierarchy of features called the motion taxonomy. We developed and trained a deep neural network model that combines visual and semantic features to identify the features found in our motion taxonomy to embed or annotate videos with motion codes. To demonstrate the potential of motion codes as features for machine learning tasks, we integrated the extracted features from the motion embedding model into the current state-of-the-art action recognition model. The obtained model achieved higher accuracy than the baseline model for the verb classification task on egocentric videos from the EPIC-KITCHENS dataset.
摘要：在这项工作中，我们提出了一种称为运动代码的运动嵌入策略，该策略是基于操纵的显着机械属性的运动的矢量化表示。这些运动代码提供了鲁棒的运动表示，它们是使用称为运动分类法的功能层次结构获得的。我们开发并训练了一个深度神经网络模型，该模型结合了视觉和语义特征，以识别在运动分类法中发现的特征，以使用运动代码嵌入或注释视频。为了展示运动代码作为机器学习任务特征的潜力，我们将从运动嵌入模型中提取的特征集成到当前最新的动作识别模型中。对于来自EPIC-KITCHENS数据集的以自我为中心的视频，动词分类任务获得的模型比基线模型获得的准确性更高。

46. Learning Optimization-inspired Image Propagation with Control Mechanisms and Architecture Augmentations for Low-level Vision [PDF] 返回目录
Risheng Liu, Zhu Liu, Pan Mu, Zhouchen Lin, Xin Fan, Zhongxuan Luo
Abstract: In recent years, building deep learning models from optimization perspectives has becoming a promising direction for solving low-level vision problems. The main idea of most existing approaches is to straightforwardly combine numerical iterations with manually designed network architectures to generate image propagations for specific kinds of optimization models. However, these heuristic learning models often lack mechanisms to control the propagation and rely on architecture engineering heavily. To mitigate the above issues, this paper proposes a unified optimization-inspired deep image propagation framework to aggregate Generative, Discriminative and Corrective (GDC for short) principles for a variety of low-level vision tasks. Specifically, we first formulate low-level vision tasks using a generic optimization objective and construct our fundamental propagative modules from three different viewpoints, i.e., the solution could be obtained/learned 1) in generative manner; 2) based on discriminative metric, and 3) with domain knowledge correction. By designing control mechanisms to guide image propagations, we then obtain convergence guarantees of GDC for both fully- and partially-defined optimization formulations. Furthermore, we introduce two architecture augmentation strategies (i.e., normalization and automatic search) to respectively enhance the propagation stability and task/data-adaption ability. Extensive experiments on different low-level vision applications demonstrate the effectiveness and flexibility of GDC.
摘要：近年来，从优化角度构建深度学习模型已成为解决低级视觉问题的有希望的方向。大多数现有方法的主要思想是将数值迭代与手动设计的网络体系结构直接结合起来，以生成针对特定种类的优化模型的图像传播。但是，这些启发式学习模型通常缺乏控制传播的机制，并且严重依赖体系结构工程。为了缓解上述问题，本文提出了一个统一的，以优化为灵感的深度图像传播框架，以聚合生成，区分和纠正（简称GDC）原则，用于各种低级视觉任务。具体来说，我们首先使用通用的优化目标来制定低级视觉任务，并从三个不同的角度构建基本的传播模块，即可以以生成的方式获得/学习解决方案。 2）基于判别指标，3）具有领域知识校正。通过设计控制机制来指导图像传播，我们然后获得针对全部和部分定义的优化公式的GDC的收敛性保证。此外，我们引入了两种架构扩充策略（即规范化和自动搜索）以分别增强传播稳定性和任务/数据自适应能力。在不同的低级视觉应用中进行的大量实验证明了GDC的有效性和灵活性。

47. A Free Lunch for Unsupervised Domain Adaptive Object Detection without Source Data [PDF] 返回目录
Xianfeng Li, Weijie Chen, Di Xie, Shicai Yang, Peng Yuan, Shiliang Pu, Yueting Zhuang
Abstract: Unsupervised domain adaptation (UDA) assumes that source and target domain data are freely available and usually trained together to reduce the domain gap. However, considering the data privacy and the inefficiency of data transmission, it is impractical in real scenarios. Hence, it draws our eyes to optimize the network in the target domain without accessing labeled source data. To explore this direction in object detection, for the first time, we propose a source data-free domain adaptive object detection (SFOD) framework via modeling it into a problem of learning with noisy labels. Generally, a straightforward method is to leverage the pre-trained network from the source domain to generate the pseudo labels for target domain optimization. However, it is difficult to evaluate the quality of pseudo labels since no labels are available in target domain. In this paper, self-entropy descent (SED) is a metric proposed to search an appropriate confidence threshold for reliable pseudo label generation without using any handcrafted labels. Nonetheless, completely clean labels are still unattainable. After a thorough experimental analysis, false negatives are found to dominate in the generated noisy labels. Undoubtedly, false negatives mining is helpful for performance improvement, and we ease it to false negatives simulation through data augmentation like Mosaic. Extensive experiments conducted in four representative adaptation tasks have demonstrated that the proposed framework can easily achieve state-of-the-art performance. From another view, it also reminds the UDA community that the labeled source data are not fully exploited in the existing methods.
摘要：无监督域自适应（UDA）假定源域和目标域数据是免费可用的，并且通常一起训练以减少域差距。但是，考虑到数据保密性和数据传输效率低下，在实际情况下这是不切实际的。因此，它吸引了我们的眼光，以在目标域中优化网络而无需访问标记的源数据。为了探索对象检测的这一方向，我们首次提出了一种无源数据域自适应对象检测（SFOD）框架，方法是将其建模为带有噪声标签的学习问题。通常，一种直接的方法是利用源域中的预训练网络来生成伪标签，以进行目标域优化。但是，由于目标域中没有可用的标签，因此很难评估伪标签的质量。在本文中，自熵下降（SED）是一种度量标准，旨在在不使用任何手工标签的情况下搜索适当的置信度阈值以可靠地生成伪标签。尽管如此，仍然无法获得完全清洁的标签。经过全面的实验分析，发现假阴性在所产生的噪音标签中占主导地位。毫无疑问，错误否定挖掘有助于提高性能，我们通过像Mosaic这样的数据增强将其简化为错误否定仿真。在四个有代表性的适应任务中进行的广泛实验表明，所提出的框架可以轻松实现最新性能。从另一个角度来看，它也提醒UDA社区，在现有方法中未充分利用标记的源数据。

48. MO-LTR: Multiple Object Localization, Tracking, and Reconstruction from Monocular RGB Videos [PDF] 返回目录
Kejie Li, Hamid Rezatofighi, Ian Reid
Abstract: Semantic aware reconstruction is more advantageous than geometric-only reconstruction for future robotic and AR/VR applications because it represents not only where things are, but also what things are. Object-centric mapping is a task to build an object-level reconstruction where objects are separate and meaningful entities that convey both geometry and semantic information. In this paper, we present MO-LTR, a solution to object-centric mapping using only monocular image sequences and camera poses. It is able to localize, track, and reconstruct multiple objects in an online fashion when an RGB camera captures a video of the surrounding. Given a new RGB frame, MO-LTR firstly applies a monocular 3D detector to localize objects of interest and extract their shape codes that represent the object shape in a learned embedding space. Detections are then merged to existing objects in the map after data association. Motion state (i.e. kinematics and the motion status) of each object is tracked by a multiple model Bayesian filter and object shape is progressively refined by fusing multiple shape code. We evaluate localization, tracking, and reconstruction on benchmarking datasets for indoor and outdoor scenes, and show superior performance over previous approaches.
摘要：对于未来的机器人和AR / VR应用程序而言，语义感知重构比仅几何重构更具优势，因为它不仅表示事物在哪里，而且还表示事物在什么地方。以对象为中心的映射是构建对象级重构的任务，其中对象是传达几何和语义信息的独立且有意义的实体。在本文中，我们提出了MO-LTR，这是一种仅使用单眼图像序列和相机姿态来进行以对象为中心的映射的解决方案。当RGB摄像机捕获周围的视频时，它可以在线方式定位，跟踪和重建多个对象。给定一个新的RGB帧，MO-LTR首先使用单眼3D检测器定位感兴趣的对象并提取它们的形状代码，这些形状代码表示在学习的嵌入空间中的对象形状。数据关联后，检测结果将合并到地图中的现有对象。通过多个模型贝叶斯滤波器跟踪每个对象的运动状态（即运动学和运动状态），并通过融合多个形状代码逐步完善对象的形状。我们评估用于室内和室外场景的基准数据集的本地化，跟踪和重建，并显示出优于先前方法的性能。

49. Automatic Diagnosis of Malaria from Thin Blood Smear Images using Deep Convolutional Neural Network with Multi-Resolution Feature Fusion [PDF] 返回目录
Tanvir Mahmud, Shaikh Anowarul Fattah
Abstract: Malaria, a life-threatening disease, infects millions of people every year throughout the world demanding faster diagnosis for proper treatment before any damages occur. In this paper, an end-to-end deep learning-based approach is proposed for faster diagnosis of malaria from thin blood smear images by making efficient optimizations of features extracted from diversified receptive fields. Firstly, an efficient, highly scalable deep neural network, named as DilationNet, is proposed that incorporates features from a large spectrum by varying dilation rates of convolutions to extract features from different receptive areas. Next, the raw images are resampled to various resolutions to introduce variations in the receptive fields that are used for independently optimizing different forms of DilationNet scaled for different resolutions of images. Afterward, a feature fusion scheme is introduced with the proposed DeepFusionNet architecture for jointly optimizing the feature space of these individually trained networks operating on different levels of observations. All the convolutional layers of various forms of DilationNets that are optimized to extract spatial features from different resolutions of images are directly transferred to provide a variegated feature space from any image. Later, joint optimization of these spatial features is carried out in the DeepFusionNet to extract the most relevant representation of the sample image. This scheme offers the opportunity to explore the feature space extensively by varying the observation level to accurately diagnose the abnormality. Intense experimentations on a publicly available dataset show outstanding performance with accuracy over 99.5% outperforming other state-of-the-art approaches.
摘要：疟疾是一种威胁生命的疾病，全世界每年都有数百万人感染这种疾病，因此要求在病害发生之前迅速进行诊断以进行适当治疗。在本文中，提出了一种基于端到端深度学习的方法，该方法通过有效地优化从各种感受野中提取的特征来从薄血涂片图像中更快地诊断出疟疾。首先，提出了一种有效的，高度可扩展的深度神经网络，称为DilationNet，该网络通过改变卷积的膨胀率来合并来自大频谱的特征，以从不同的接收区域提取特征。接下来，将原始图像重新采样为各种分辨率，以在接收场中引入变化，这些变化用于独立优化针对不同图像分辨率缩放的不同形式的DilationNet。然后，将特征融合方案与建议的DeepFusionNet体系结构一起引入，以共同优化在不同级别的观察值上运行的这些经过单独训练的网络的特征空间。经过优化可从不同分辨率的图像中提取空间特征的各种形式的DilationNets的所有卷积层都将直接转移，以从任何图像中提供多样化的特征空间。后来，在DeepFusionNet中对这些空间特征进行了联合优化，以提取样本图像中最相关的表示形式。通过改变观察级别以准确诊断异常，该方案提供了广泛探索特征空间的机会。在公开数据集上进行的大量实验显示出卓越的性能，其准确性超过99.5％，优于其他最新方法。

50. Vulnerability Analysis of Face Morphing Attacks from Landmarks and Generative Adversarial Networks [PDF] 返回目录
Eklavya Sarkar, Pavel Korshunov, Laurent Colbois, Sébastien Marcel
Abstract: Morphing attacks is a threat to biometric systems where the biometric reference in an identity document can be altered. This form of attack presents an important issue in applications relying on identity documents such as border security or access control. Research in face morphing attack detection is developing rapidly, however very few datasets with several forms of attacks are publicly available. This paper bridges this gap by providing a new dataset with four different types of morphing attacks, based on OpenCV, FaceMorpher, WebMorph and a generative adversarial network (StyleGAN), generated with original face images from three public face datasets. We also conduct extensive experiments to assess the vulnerability of the state-of-the-art face recognition systems, notably FaceNet, VGG-Face, and ArcFace. The experiments demonstrate that VGG-Face, while being less accurate face recognition system compared to FaceNet, is also less vulnerable to morphing attacks. Also, we observed that naïve morphs generated with a StyleGAN do not pose a significant threat.
摘要：变形攻击是对生物识别系统的威胁，在生物识别系统中，身份文件中的生物识别参考可以更改。这种攻击形式在依赖身份证明文件（例如边界安全或访问控制）的应用程序中提出了一个重要问题。关于面部变形攻击检测的研究正在迅速发展，但是很少公开有几种攻击形式的数据集。本文通过为新数据集提供基于OpenCV，FaceMorpher，WebMorph和生成对抗网络（StyleGAN）的四种不同类型的变形攻击，来弥补这一差距，该对抗网络使用来自三个公共面部数据集的原始面部图像生成。我们还进行了广泛的实验，以评估最新的人脸识别系统（尤其是FaceNet，VGG-Face和ArcFace）的脆弱性。实验证明，与FaceNet相比，VGG-Face的人脸识别系统较不准确，但它也较不易遭受变形攻击。此外，我们观察到，用StyleGAN生成的朴素变形不会构成重大威胁。

51. 3D attention mechanism for fine-grained classification of table tennis strokes using a Twin Spatio-Temporal Convolutional Neural Networks [PDF] 返回目录
Pierre-Etienne Martin, Jenny Benois-Pineau, Renaud Péteri, Julien Morlier
Abstract: The paper addresses the problem of recognition of actions in video with low inter-class variability such as Table Tennis strokes. Two stream, "twin" convolutional neural networks are used with 3D convolutions both on RGB data and optical flow. Actions are recognized by classification of temporal windows. We introduce 3D attention modules and examine their impact on classification efficiency. In the context of the study of sportsmen performances, a corpus of the particular actions of table tennis strokes is considered. The use of attention blocks in the network speeds up the training step and improves the classification scores up to 5% with our twin model. We visualize the impact on the obtained features and notice correlation between attention and player movements and position. Score comparison of state-of-the-art action classification method and proposed approach with attentional blocks is performed on the corpus. Proposed model with attention blocks outperforms previous model without them and our baseline.
摘要：本文解决了类别间变异性较低的视频中动作的识别问题，例如乒乓球的笔触。两流“双”卷积神经网络与RGB数据和光流上的3D卷积一起使用。通过时间窗口的分类来识别动作。我们介绍3D注意模块，并检查它们对分类效率的影响。在研究运动员的表演的背景下，考虑了乒乓球击打动作的主体。在我们的双胞胎模型中，在网络中使用注意块可以加快训练速度，并将分类得分提高多达5％。我们可视化对获得的功能的影响，并注意注意力与玩家动作和位置之间的相关性。对语料库进行了最新动作分类方法和带有注意块的建议方法的得分比较。建议的带有关注块的模型优于没有它们和我们的基线的先前模型。

52. GAN Steerability without optimization [PDF] 返回目录
Nurit Spingarn-Eliezer, Ron Banner, Tomer Michaeli
Abstract: Recent research has shown remarkable success in revealing "steering" directions in the latent spaces of pre-trained GANs. These directions correspond to semantically meaningful image transformations e.g., shift, zoom, color manipulations), and have similar interpretable effects across all categories that the GAN can generate. Some methods focus on user-specified transformations, while others discover transformations in an unsupervised manner. However, all existing techniques rely on an optimization procedure to expose those directions, and offer no control over the degree of allowed interaction between different transformations. In this paper, we show that "steering" trajectories can be computed in closed form directly from the generator's weights without any form of training or optimization. This applies to user-prescribed geometric transformations, as well as to unsupervised discovery of more complex effects. Our approach allows determining both linear and nonlinear trajectories, and has many advantages over previous methods. In particular, we can control whether one transformation is allowed to come on the expense of another (e.g. zoom-in with or without allowing translation to keep the object centered). Moreover, we can determine the natural end-point of the trajectory, which corresponds to the largest extent to which a transformation can be applied without incurring degradation. Finally, we show how transferring attributes between images can be achieved without optimization, even across different categories.
摘要：最近的研究表明，在预训练的GAN的潜在空间中揭示“转向”方向方面取得了显著成功。这些方向对应于语义上有意义的图像转换（例如，移位，缩放，颜色操作），并且在GAN可以生成的所有类别中具有相似的可解释效果。一些方法专注于用户指定的转换，而其他方法则以无监督的方式发现转换。但是，所有现有技术都依赖于优化过程来公开这些方向，并且无法控制不同转换之间允许的交互程度。在本文中，我们表明可以直接根据发生器的权重以闭合形式计算“转向”轨迹，而无需进行任何形式的训练或优化。这适用于用户指定的几何变换以及更复杂效果的无监督发现。我们的方法可以确定线性和非线性轨迹，并且比以前的方法有很多优点。特别是，我们可以控制是否允许一个转换以另一转换为代价（例如，放大或缩小或不允许翻译以使对象居中）。此外，我们可以确定轨迹的自然端点，该端点对应于可以在不发生退化的情况下应用变换的最大程度。最后，我们展示了如何无需优化就可以实现图像之间的属性转移，甚至跨不同类别也是如此。

53. Convolutional Neural Networks for Multispectral Image Cloud Masking [PDF] 返回目录
Gonzalo Mateo-García, Luis Gómez-Chova, Gustau Camps-Valls
Abstract: Convolutional neural networks (CNN) have proven to be state of the art methods for many image classification tasks and their use is rapidly increasing in remote sensing problems. One of their major strengths is that, when enough data is available, CNN perform an end-to-end learning without the need of custom feature extraction methods. In this work, we study the use of different CNN architectures for cloud masking of Proba-V multispectral images. We compare such methods with the more classical machine learning approach based on feature extraction plus supervised classification. Experimental results suggest that CNN are a promising alternative for solving cloud masking problems.
摘要：卷积神经网络（CNN）已被证明是用于许多图像分类任务的最先进方法，在遥感问题中，其使用正在迅速增加。它们的主要优势之一是，当有足够的数据可用时，CNN可以进行端到端学习，而无需使用自定义特征提取方法。在这项工作中，我们研究了使用不同的CNN架构对Proba-V多光谱图像进行云遮罩。我们将这些方法与基于特征提取和监督分类的更经典的机器学习方法进行比较。实验结果表明，CNN是解决云掩蔽问题的有前途的替代方法。

54. Multi-Model Learning for Real-Time Automotive Semantic Foggy Scene Understanding via Domain Adaptation [PDF] 返回目录
Naif Alshammari, Samet Akcay, Toby P. Breckon
Abstract: Robust semantic scene segmentation for automotive applications is a challenging problem in two key aspects: (1) labelling every individual scene pixel and (2) performing this task under unstable weather and illumination changes (e.g., foggy weather), which results in poor outdoor scene visibility. Such visibility limitations lead to non-optimal performance of generalised deep convolutional neural network-based semantic scene segmentation. In this paper, we propose an efficient end-to-end automotive semantic scene understanding approach that is robust to foggy weather conditions. As an end-to-end pipeline, our proposed approach provides: (1) the transformation of imagery from foggy to clear weather conditions using a domain transfer approach (correcting for poor visibility) and (2) semantically segmenting the scene using a competitive encoder-decoder architecture with low computational complexity (enabling real-time performance). Our approach incorporates RGB colour, depth and luminance images via distinct encoders with dense connectivity and features fusion to effectively exploit information from different inputs, which contributes to an optimal feature representation within the overall model. Using this architectural formulation with dense skip connections, our model achieves comparable performance to contemporary approaches at a fraction of the overall model complexity.
摘要：在汽车应用中，鲁棒的语义场景分割在两个关键方面是一个具有挑战性的问题：（1）标记每个单独的场景像素；（2）在不稳定的天气和光照变化（例如有雾的天气）下执行此任务户外场景能见度。这样的可见性限制导致基于深度深度卷积神经网络的语义场景分割的非最佳性能。在本文中，我们提出了一种有效的端到端汽车语义场景理解方法，该方法对大雾天气条件具有鲁棒性。作为端对端管道，我们提出的方法提供：（1）使用域转移方法（从较差的可见性进行校正）将图像从大雾转化为晴朗的天气，以及（2）使用竞争性编码器在语义上分割场景-解码器架构，具有较低的计算复杂度（可实现实时性能）。我们的方法通过具有密集连接性的独特编码器将RGB彩色，深度和亮度图像合并在一起，并融合了特征以有效利用来自不同输入的信息，从而有助于在整个模型中实现最佳的特征表示。使用这种具有密集跳过连接的体系结构公式，我们的模型可以以与模型整体复杂性相比很小的一部分达到与现代方法相当的性能。

55. Competitive Simplicity for Multi-Task Learning for Real-Time Foggy Scene Understanding via Domain Adaptation [PDF] 返回目录
Naif Alshammari, Samet Akcay, Toby P. Breckon
Abstract: Automotive scene understanding under adverse weather conditions raises a realistic and challenging problem attributable to poor outdoor scene visibility (e.g. foggy weather). However, because most contemporary scene understanding approaches are applied under ideal-weather conditions, such approaches may not provide genuinely optimal performance when compared to established a priori insights on extreme-weather understanding. In this paper, we propose a complex but competitive multi-task learning approach capable of performing in real-time semantic scene understanding and monocular depth estimation under foggy weather conditions by leveraging both recent advances in adversarial training and domain adaptation. As an end-to-end pipeline, our model provides a novel solution to surpass degraded visibility in foggy weather conditions by transferring scenes from foggy to normal using a GAN-based model. For optimal performance in semantic segmentation, our model generates depth to be used as complementary source information with RGB in the segmentation network. We provide a robust method for foggy scene understanding by training two models (normal and foggy) simultaneously with shared weights (each model is trained on each weather condition independently). Our model incorporates RGB colour, depth, and luminance images via distinct encoders with dense connectivity and features fusing, and leverages skip connections to produce consistent depth and segmentation predictions. Using this architectural formulation with light computational complexity at inference time, we are able to achieve comparable performance to contemporary approaches at a fraction of the overall model complexity.
摘要：在不利天气条件下对汽车场景的理解提出了一个现实而具有挑战性的问题，这归因于户外场景可见性不佳（例如大雾天气）。但是，由于大多数当代场景理解方法是在理想天气条件下应用的，因此与已建立的关于极端天气理解的先验见解相比，此类方法可能无法提供真正的最佳性能。在本文中，我们提出了一种复杂但具有竞争性的多任务学习方法，该方法能够利用对战训练和领域自适应的最新进展，在大雾天气条件下进行实时语义场景理解和单眼深度估计。作为端到端的管道，我们的模型提供了一种新颖的解决方案，可以通过使用基于GAN的模型将场景从有雾场景转换为正常场景，从而在有雾天气条件下克服可见度下降的问题。为了在语义分割中获得最佳性能，我们的模型会生成深度，以用作分割网络中RGB的补充源信息。我们通过共享权重同时训练两个模型（正常模型和有雾模型）（每个模型分别针对每种天气情况进行训练），为了解有雾场景提供了一种可靠的方法。我们的模型通过具有密集连接性和功能融合的独特编码器，结合了RGB彩色，深度和亮度图像，并利用跳过连接来产生一致的深度和分段预测。通过在推理时使用这种具有轻度计算复杂性的体系结构公式，我们可以在整体模型复杂性的一小部分上实现与现代方法相当的性能。

56. ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic Segmentation [PDF] 返回目录
Siyuan Qiao, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh Chen
Abstract: In this paper, we present ViP-DeepLab, a unified model attempting to tackle the long-standing and challenging inverse projection problem in vision, which we model as restoring the point clouds from perspective image sequences while providing each point with instance-level semantic interpretations. Solving this problem requires the vision models to predict the spatial location, semantic class, and temporally consistent instance label for each 3D point. ViP-DeepLab approaches it by jointly performing monocular depth estimation and video panoptic segmentation. We name this joint task as Depth-aware Video Panoptic Segmentation, and propose a new evaluation metric along with two derived datasets for it, which will be made available to the public. On the individual sub-tasks, ViP-DeepLab also achieves state-of-the-art results, outperforming previous methods by 5.1% VPQ on Cityscapes-VPS, ranking 1st on the KITTI monocular depth estimation benchmark, and 1st on KITTI MOTS pedestrian. The datasets and the evaluation codes are made publicly available.
摘要：在本文中，我们提出了ViP-DeepLab，这是一个统一的模型，旨在解决视觉中长期存在的挑战性逆投影问题，我们将其建模为从透视图像序列还原点云，同时为每个点提供实例级语义解释。解决此问题需要视觉模型预测每个3D点的空间位置，语义类别和时间上一致的实例标签。 ViP-DeepLab通过联合执行单眼深度估计和视频全景分割来实现这一目标。我们将此联合任务命名为“深度感知视频全景分割”，并提出了一个新的评估指标以及针对该指标的两个派生数据集，并将向公众公开。在单独的子任务上，ViP-DeepLab还获得了最先进的结果，在Cityscapes-VPS上比以前的方法高出5.1％VPQ，在KITTI单眼深度估计基准上排名第一，在KITTI MOTS行人上排名第一。数据集和评估代码是公开可用的。

57. Flexible Few-Shot Learning with Contextual Similarity [PDF] 返回目录
Mengye Ren, Eleni Triantafillou, Kuan-Chieh Wang, James Lucas, Jake Snell, Xaq Pitkow, Andreas S. Tolias, Richard Zemel
Abstract: Existing approaches to few-shot learning deal with tasks that have persistent, rigid notions of classes. Typically, the learner observes data only from a fixed number of classes at training time and is asked to generalize to a new set of classes at test time. Two examples from the same class would always be assigned the same labels in any episode. In this work, we consider a realistic setting where the similarities between examples can change from episode to episode depending on the task context, which is not given to the learner. We define new benchmark datasets for this flexible few-shot scenario, where the tasks are based on images of faces (Celeb-A), shoes (Zappos50K), and general objects (ImageNet-with-Attributes). While classification baselines and episodic approaches learn representations that work well for standard few-shot learning, they suffer in our flexible tasks as novel similarity definitions arise during testing. We propose to build upon recent contrastive unsupervised learning techniques and use a combination of instance and class invariance learning, aiming to obtain general and flexible features. We find that our approach performs strongly on our new flexible few-shot learning benchmarks, demonstrating that unsupervised learning obtains more generalizable representations.
摘要：现有的少拍学习方法处理的任务具有持久的，僵化的类概念。通常，学习者只能在训练时观察固定数量的课程中的数据，并在测试时被要求归纳为一组新的课程。在任何情节中，来自同一类别的两个示例将始终被分配相同的标签。在这项工作中，我们考虑了一个现实的环境，其中示例之间的相似性可能会根据任务上下文在情节之间发生变化，而这不会提供给学习者。我们为这种灵活的少量场景定义了新的基准数据集，其中的任务是基于面部（Celeb-A），鞋子（Zappos50K）和常规对象（ImageNet-with-Attributes）的图像。尽管分类基准和情节方法学习的表示法很适合标准的一次性学习，但由于在测试过程中出现了新的相似性定义，因此它们在我们的灵活任务中受到了影响。我们建议以最新的对比无监督学习技术为基础，并结合实例和类不变性学习，以期获得通用且灵活的功能。我们发现，我们的方法在新的灵活的几次尝试学习基准上表现出色，表明无监督学习获得了更通用的表示形式。

58. Learning Tubule-Sensitive CNNs for Pulmonary Airway and Artery-Vein Segmentation in CT [PDF] 返回目录
Yulei Qin, Hao Zheng, Yun Gu, Xiaolin Huang, Jie Yang, Lihui Wang, Feng Yao, Yue-Min Zhu, Guang-Zhong Yang
Abstract: Training convolutional neural networks (CNNs) for segmentation of pulmonary airway, artery, and vein is challenging due to sparse supervisory signals caused by the severe class imbalance between tubular targets and background. We present a CNNs-based method for accurate airway and artery-vein segmentation in non-contrast computed tomography. It enjoys superior sensitivity to tenuous peripheral bronchioles, arterioles, and venules. The method first uses a feature recalibration module to make the best use of features learned from the neural networks. Spatial information of features is properly integrated to retain relative priority of activated regions, which benefits the subsequent channel-wise recalibration. Then, attention distillation module is introduced to reinforce representation learning of tubular objects. Fine-grained details in high-resolution attention maps are passing down from one layer to its previous layer recursively to enrich context. Anatomy prior of lung context map and distance transform map is designed and incorporated for better artery-vein differentiation capacity. Extensive experiments demonstrated considerable performance gains brought by these components. Compared with state-of-the-art methods, our method extracted much more branches while maintaining competitive overall segmentation performance. Codes and models will be available later at this http URL.
摘要：由于管状目标和背景之间严重的类别不平衡，导致稀疏的监督信号，训练卷积神经网络（CNN）来分割肺气道，动脉和静脉非常具有挑战性。我们提出了一种基于CNNs的非对比计算机断层扫描中准确的气道和动脉静脉分割方法。它对周围的细支气管，小动脉和小静脉具有极高的敏感性。该方法首先使用特征重新校准模块，以充分利用从神经网络中学到的特征。要素的空间信息已正确集成，以保留激活区域的相对优先级，这有利于后续的通道方式重新校准。然后，引入注意蒸馏模块以加强对管状对象的表示学习。高分辨率注意力图中的细粒度细节递归地从一层传递到其上一层，以丰富上下文。肺上下文图和距离变换图的先验解剖图被设计并合并以具有更好的动脉-静脉分化能力。大量的实验表明，这些组件带来了可观的性能提升。与最先进的方法相比，我们的方法提取了更多的分支，同时保持了竞争优势的整体细分效果。代码和模型将稍后在此http URL上提供。

59. 3D Bounding Box Detection in Volumetric Medical Image Data: A Systematic Literature Review [PDF] 返回目录
Daria Kern, Andre Mastmeyer
Abstract: This paper discusses current methods and trends for 3D bounding box detection in volumetric medical image data. For this purpose, an overview of relevant papers from recent years is given. 2D and 3D implementations are discussed and compared. Multiple identified approaches for localizing anatomical structures are presented. The results show that most research recently focuses on Deep Learning methods, such as Convolutional Neural Networks vs. methods with manual feature engineering, e.g. Random-Regression-Forests. An overview of bounding box detection options is presented and helps researchers to select the most promising approach for their target objects.
摘要：本文讨论了体积医学图像数据中3D边界框检测的当前方法和趋势。为此，给出了近年来相关论文的概述。讨论并比较了2D和3D实现。介绍了多种用于定位解剖结构的方法。结果表明，大多数研究最近都集中在深度学习方法上，例如卷积神经网络与具有手动特征工程的方法，例如随机回归森林。概述了边界框检测选项，可帮助研究人员为目标对象选择最有前途的方法。

60. Effect of the regularization hyperparameter on deep learning-based segmentation in LGE-MRI [PDF] 返回目录
Olivier Rukundo
Abstract: In this preliminary evaluation, the author demonstrates the extent to which the arbitrary selection of the L2 regularization hyperparameter can affect the outcome of deep learning-based segmentation in LGE-MRI. Also, the author adopts the manual adjustment or tunning, of other deep learning hyperparameters, to be done only when 10% of all epochs are reached before attaining the 90% validation accuracy. With the arbitrary L2 regularization values, used in the experiments, the results showed that the smaller L2 regularization number can lead to better segmentation of the myocardium and/or higher accuracy.
摘要：在此初步评估中，作者证明了L2正则化超参数的任意选择可在多大程度上影响LGE-MRI中基于深度学习的分割的结果。此外，作者采用其他深度学习超参数的手动调整或调整，仅在达到90％的验证准确度之前，在达到所有纪元的10％时进行。在实验中使用任意L2正则化值时，结果表明，较小的L2正则化数可导致更好的心肌分割和/或更高的准确性。

61. Performance Comparison of Balanced and Unbalanced Cancer Datasets using Pre-Trained Convolutional Neural Network [PDF] 返回目录
Ali Narin
Abstract: Cancer disease is one of the leading causes of death all over the world. Breast cancer, which is a common cancer disease especially in women, is quite common. The most important tool used for early detection of this cancer type, which requires a long process to establish a definitive diagnosis, is histopathological images taken by biopsy. These obtained images are examined by pathologists and a definitive diagnosis is made. It is quite common to detect this process with the help of a computer. Detection of benign or malignant tumors, especially by using data with different magnification rates, takes place in the literature. In this study, two different balanced and unbalanced study groups have been formed by using the histopathological data in the BreakHis data set. We have examined how the performances of balanced and unbalanced data sets change in detecting tumor type. In conclusion, in the study performed using the InceptionV3 convolution neural network model, 93.55% accuracy, 99.19% recall and 87.10% specificity values have been obtained for balanced data, while 89.75% accuracy, 82.89% recall and 91.51% specificity values have been obtained for unbalanced data. According to the results obtained in two different studies, the balance of the data increases the overall performance as well as the detection performance of both benign and malignant tumors. It can be said that the model trained with the help of data sets created in a balanced way will give pathology specialists higher and accurate results.
摘要：癌症是全世界主要的死亡原因之一。乳腺癌是一种常见的癌症疾病，尤其是在女性中，这是很常见的。用于早期检测这种癌症类型的最重要工具是通过活组织检查拍摄的组织病理学图像，这需要很长的过程才能确定性诊断。由病理学家检查这些获得的图像，并做出明确的诊断。在计算机的帮助下检测此过程非常普遍。文献中特别是通过使用具有不同放大率的数据来检测良性或恶性肿瘤。在这项研究中，通过使用BreakHis数据集中的组织病理学数据形成了两个不同的平衡和不平衡研究组。我们已经检查了平衡和不平衡数据集的性能如何在检测肿瘤类型方面发生变化。总之，在使用InceptionV3卷积神经网络模型进行的研究中，平衡数据获得了93.55％的准确性，99.19％的召回率和87.10％的特异性值，而获得了89.75％的准确性，82.89％的召回率和91.51％的特异性值用于不平衡数据。根据在两项不同研究中获得的结果，数据的平衡提高了良性和恶性肿瘤的整体性能以及检测性能。可以说，在以平衡方式创建的数据集的帮助下训练的模型将为病理学专家提供更高，更准确的结果。

62. Large-Scale Generative Data-Free Distillation [PDF] 返回目录
Liangchen Luo, Mark Sandler, Zi Lin, Andrey Zhmoginov, Andrew Howard
Abstract: Knowledge distillation is one of the most popular and effective techniques for knowledge transfer, model compression and semi-supervised learning. Most existing distillation approaches require the access to original or augmented training samples. But this can be problematic in practice due to privacy, proprietary and availability concerns. Recent work has put forward some methods to tackle this problem, but they are either highly time-consuming or unable to scale to large datasets. To this end, we propose a new method to train a generative image model by leveraging the intrinsic normalization layers' statistics of the trained teacher network. This enables us to build an ensemble of generators without training data that can efficiently produce substitute inputs for subsequent distillation. The proposed method pushes forward the data-free distillation performance on CIFAR-10 and CIFAR-100 to 95.02% and 77.02% respectively. Furthermore, we are able to scale it to ImageNet dataset, which to the best of our knowledge, has never been done using generative models in a data-free setting.
摘要：知识蒸馏是用于知识转移，模型压缩和半监督学习的最流行和有效的技术之一。大多数现有的蒸馏方法都需要使用原始或增强的训练样本。但是，由于隐私，所有权和可用性方面的考虑，这在实践中可能会出现问题。最近的工作提出了一些解决此问题的方法，但是它们要么很耗时，要么无法扩展到大型数据集。为此，我们提出了一种利用受过训练的教师网络的固有归一化层统计信息来训练生成图像模型的新方法。这样一来，我们就无需训练数据即可构建一组发电机，而训练数据却可以有效地为后续蒸馏提供替代输入。提出的方法将CIFAR-10和CIFAR-100的无数据蒸馏性能分别提高到95.02％和77.02％。此外，我们能够将其缩放到ImageNet数据集，据我们所知，它从未在无数据的环境中使用生成模型来完成。

63. Effect of Different Batch Size Parameters on Predicting of COVID19 Cases [PDF] 返回目录
Ali Narin, Ziynet Pamuk
Abstract: The new coronavirus 2019, also known as COVID19, is a very serious epidemic that has killed thousands or even millions of people since December 2019. It was defined as a pandemic by the world health organization in March 2020. It is stated that this virus is usually transmitted by droplets caused by sneezing or coughing, or by touching infected surfaces. The presence of the virus is detected by real-time reverse transcriptase polymerase chain reaction (rRT-PCR) tests with the help of a swab taken from the nose or throat. In addition, X-ray and CT imaging methods are also used to support this method. Since it is known that the accuracy sensitivity in rRT-PCR test is low, auxiliary diagnostic methods have a very important place. Computer-aided diagnosis and detection systems are developed especially with the help of X-ray and CT images. Studies on the detection of COVID19 in the literature are increasing day by day. In this study, the effect of different batch size (BH=3, 10, 20, 30, 40, and 50) parameter values on their performance in detecting COVID19 and other classes was investigated using data belonging to 4 different (Viral Pneumonia, COVID19, Normal, Bacterial Pneumonia) classes. The study was carried out using a pre-trained ResNet50 convolutional neural network. According to the obtained results, they performed closely on the training and test data. However, it was observed that the steady state in the test data was delayed as the batch size value increased. The highest COVID19 detection was 95.17% for BH = 3, while the overall accuracy value was 97.97% with BH = 20. According to the findings, it can be said that the batch size value does not affect the overall performance significantly, but the increase in the batch size value delays obtaining stable results.
摘要：新的冠状病毒2019，也称为COVID19，是一种非常严重的流行病，自2019年12月以来已经杀死了成千上万的人。世界卫生组织在2020年3月将其定义为大流行病。病毒通常是通过打喷嚏，咳嗽或触摸受感染的表面而形成的飞沫传播的。通过实时逆转录聚合酶链反应（rRT-PCR）测试，借助从鼻子或喉咙中抽出的拭子，检测到病毒的存在。此外，X射线和CT成像方法也可用于支持该方法。由于已知rRT-PCR测试的准确性灵敏度较低，因此辅助诊断方法具有非常重要的地位。特别是借助X射线和CT图像，开发了计算机辅助的诊断和检测系统。文献中有关COVID19检测的研究日益增多。在这项研究中，使用属于4种不同（病毒性肺炎，COVID19）的数据，研究了不同批次大小（BH = 3、10、20、30、40和50）参数值对其检测COVID19和其他类别的性能的影响。，正常，细菌性肺炎）类别。该研究是使用预先训练的ResNet50卷积神经网络进行的。根据获得的结果，他们在训练和测试数据上进行了紧密合作。但是，观察到随着批大小值的增加，测试数据中的稳态被延迟。对于BH = 3，最高的COVID19检测率为95.17％，而对于BH = 20，总体准确度值为97.97％。根据研究结果，可以说批次大小值对整体性能没有明显影响，但是增加批次大小值的延迟会延迟获得稳定的结果。

64. Detection of Covid-19 Patients with Convolutional Neural Network Based Features on Multi-class X-ray Chest Images [PDF] 返回目录
Ali Narin
Abstract: Covid-19 is a very serious deadly disease that has been announced as a pandemic by the world health organization (WHO). The whole world is working with all its might to end Covid-19 pandemic, which puts countries in serious health and economic problems, as soon as possible. The most important of these is to correctly identify those who get the Covid-19. Methods and approaches to support the reverse transcription polymerase chain reaction (RT-PCR) test have begun to take place in the literature. In this study, chest X-ray images, which can be accessed easily and quickly, were used because the covid-19 attacked the respiratory systems. Classification performances with support vector machines have been obtained by using the features extracted with residual networks (ResNet-50), one of the convolutional neural network models, from these images. While Covid-19 detection is obtained with support vector machines (SVM)-quadratic with the highest sensitivity value of 96.35% with the 5-fold cross-validation method, the highest overall performance value has been detected with both SVM-quadratic and SVM-cubic above 99%. According to these high results, it is thought that this method, which has been studied, will help radiology specialists and reduce the rate of false detection.
摘要：Covid-19是一种非常严重的致命疾病，已被世界卫生组织（WHO）宣布为大流行病。全世界都在竭尽全力结束Covid-19大流行，这使各国尽快面临严重的健康和经济问题。其中最重要的是正确识别获得Covid-19的人。支持逆转录聚合酶链反应（RT-PCR）测试的方法和方法已在文献中开始发生。在这项研究中，由于covid-19攻击了呼吸系统，因此使用了可以轻松快捷地获取的胸部X射线图像。支持向量机的分类性能已经通过使用残差网络（ResNet-50）（这些是卷积神经网络模型之一）从这些图像中提取的特征来获得。虽然使用5倍交叉验证方法通过支持向量机（SVM）-二次获得Covid-19检测，最高灵敏度值为96.35％，但同时使用SVM-二次和SVM检测到最高整体性能值立方高于99％。根据这些较高的结果，可以认为已经研究过的这种方法将有助于放射学专家并减少错误检测的率。

65. COVID-MTL: Multitask Learning with Shift3D and Random-weighted Loss for Diagnosis and Severity Assessment of COVID-19 [PDF] 返回目录
Guoqing Bao, Xiuying Wang
Abstract: The outbreak of COVID-19 has resulted in over 67 million infections with over 1.5 million deaths worldwide so far. Both computer tomography (CT) diagnosis and nucleic acid test (NAT) have their pros and cons. Here we present a multitask-learning (MTL) framework, termed COVID-MTL, which is capable of simultaneously detecting COVID-19 against both radiology and NAT as well as assessing infection severity. We proposed an active-contour based method to refine lung segmentation results on COVID-19 CT scans and a Shift3D real-time 3D augmentation algorithm to improve the convergence and accuracy of state-of-the-art 3D CNNs. A random-weighted multitask loss function was then proposed, which made simultaneous learning of different COVID-19 tasks more stable and accurate. By only using CT data and extracting lung imaging features, COVID-MTL was trained on 930 CT scans and tested on another 399 cases, which yielded AUCs of 0.939 and 0.846, and accuracies of 90.23% and 79.20% for detection of COVID-19 against radiology and NAT, respectively, and outperformed state-of-the-art models. COVID-MTL yielded AUC of 0.800 $\pm$ 0.020 and 0.813 $\pm$ 0.021 (with transfer learning) for classifying control/suspected (AUC of 0.841), mild/regular (AUC of 0.808), and severe/critically-ill (AUC of 0.789) cases. Besides, we identified top imaging biomarkers that are significantly related (P < 0.001) to the positivity and severity of COVID-19.
摘要：到目前为止，COVID-19的爆发已导致超过6700万例感染，其中150万人死亡。计算机断层扫描（CT）诊断和核酸测试（NAT）都有其优缺点。在这里，我们提出了一个称为COVID-MTL的多任务学习（MTL）框架，该框架能够同时检测放射线和NAT的COVID-19，以及评估感染的严重程度。我们提出了一种基于活动轮廓的方法来优化COVID-19 CT扫描的肺部分割结果，并提出了Shift3D实时3D增强算法以提高最新3D CNN的收敛性和准确性。然后提出了一个随机加权的多任务丢失函数，该函数使同时学习不同的COVID-19任务更加稳定和准确。仅使用CT数据并提取肺部影像学特征，对930个CT扫描进行了COVID-MTL训练，并对另外399例进行了测试，得出的AUC分别为0.939和0.846，检测COVID-19的准确度为90.23％和79.20％放射学和NAT，分别优于最新模型。 COVID-MTL产生的AUC为0.800 $ \ pm $ 0.020和0.813 $ \ pm $ 0.021（带有转移学习），用于对控制/怀疑（AUC为0.841），轻度/常规（AUC为0.808）和重症/重症（AUC为0.789）。此外，我们确定了与COVID-19的阳性和严重程度显着相关（P <0.001）的顶级成像生物标志物。 < font>

66. Deep learning methods for SAR image despeckling: trends and perspectives [PDF] 返回目录
Giulia Fracastoro, Enrico Magli, Giovanni Poggi, Giuseppe Scarpa, Diego Valsesia, Luisa Verdoliva
Abstract: Synthetic aperture radar (SAR) images are affected by a spatially-correlated and signal-dependent noise called speckle, which is very severe and may hinder image exploitation. Despeckling is an important task that aims at removing such noise, so as to improve the accuracy of all downstream image processing tasks. The first despeckling methods date back to the 1970's, and several model-based algorithms have been developed in the subsequent years. The field has received growing attention, sparkled by the availability of powerful deep learning models that have yielded excellent performance for inverse problems in image processing. This paper surveys the literature on deep learning methods applied to SAR despeckling, covering both the supervised and the more recent self-supervised approaches. We provide a critical analysis of existing methods with the objective to recognize the most promising research lines, to identify the factors that have limited the success of deep models, and to propose ways forward in an attempt to fully exploit the potential of deep learning for SAR despeckling.
摘要：合成孔径雷达（SAR）图像受到称为斑点的空间相关且依赖信号的噪声的影响，这种噪声非常严重，可能会阻碍图像的开发。去斑点化是一项重要任务，旨在消除此类噪声，从而提高所有下游图像处理任务的准确性。最早的去斑点方法可追溯到1970年代，并且在随后的几年中开发了几种基于模型的算法。强大的深度学习模型的推出激发了该领域的关注，这些模型为图像处理中的逆问题产生了出色的性能。本文对适用于SAR去斑点的深度学习方法的文献进行了调查，涵盖了监督方法和最新的自我监督方法。我们对现有方法进行了批判性分析，目的是识别最有前途的研究路线，确定限制深度模型成功的因素，并提出前进的道路，以尝试充分利用深度学习在SAR中的潜力去除斑点。

67. Automatic Generation of Interpretable Lung Cancer Scoring Models from Chest X-Ray Images [PDF] 返回目录
Michael J. Horry, Subrata Chakraborty, Biswajeet Pradhan, Manoranjan Paul, Douglas P. S. Gomes, Anwaar Ul-Haq
Abstract: Lung cancer is the leading cause of cancer death and morbidity worldwide with early detection being the key to a positive patient prognosis. Although a multitude of studies have demonstrated that machine learning, and particularly deep learning, techniques are effective at automatically diagnosing lung cancer, these techniques have yet to be clinically approved and accepted/adopted by the medical community. Rather than attempting to provide an artificial 'second reading' we instead focus on the automatic creation of viable decision tree models from publicly available data using computer vision and machine learning techniques. For a small inferencing dataset, this method achieves a best accuracy over 84% with a positive predictive value of 83% for the malignant class. Furthermore, the decision trees created by this process may be considered as a starting point for refinement by medical experts into clinically usable multi-variate lung cancer scoring and diagnostic models.
摘要：肺癌是全球癌症死亡和发病的主要原因，早期发现是积极预后的关键。尽管大量研究表明，机器学习（尤其是深度学习）技术可以有效地自动诊断肺癌，但是这些技术尚未得到医学界的临床认可和接受/采用。与其尝试提供人为的“二读”，不如将注意力集中在使用计算机视觉和机器学习技术根据可公开获得的数据自动创建可行的决策树模型上。对于较小的推理数据集，此方法可达到84％以上的最佳准确性，对于恶性类的阳性预测值为83％。此外，由该过程创建的决策树可被视为医学专家完善为临床可用的多变量肺癌评分和诊断模型的起点。

68. Composite Adversarial Attacks [PDF] 返回目录
Xiaofeng Mao, Yuefeng Chen, Shuhui Wang, Hang Su, Yuan He, Hui Xue
Abstract: Adversarial attack is a technique for deceiving Machine Learning (ML) models, which provides a way to evaluate the adversarial robustness. In practice, attack algorithms are artificially selected and tuned by human experts to break a ML system. However, manual selection of attackers tends to be sub-optimal, leading to a mistakenly assessment of model security. In this paper, a new procedure called Composite Adversarial Attack (CAA) is proposed for automatically searching the best combination of attack algorithms and their hyper-parameters from a candidate pool of \textbf{32 base attackers}. We design a search space where attack policy is represented as an attacking sequence, i.e., the output of the previous attacker is used as the initialization input for successors. Multi-objective NSGA-II genetic algorithm is adopted for finding the strongest attack policy with minimum complexity. The experimental result shows CAA beats 10 top attackers on 11 diverse defenses with less elapsed time (\textbf{6 $\times$ faster than AutoAttack}), and achieves the new state-of-the-art on $l_{\infty}$, $l_{2}$ and unrestricted adversarial attacks.
摘要：对抗攻击是一种欺骗机器学习（ML）模型的技术，它提供了一种评估对抗鲁棒性的方法。在实践中，攻击算法是人为专家人工选择和调整的，以破坏ML系统。但是，手动选择攻击者往往不是最佳选择，从而导致对模型安全性的错误评估。本文提出了一种称为“复合对抗攻击”（CAA）的新过程，用于从\ textbf {32 base Attackers}候选池中自动搜索最佳的攻击算法组合和超参数。我们设计一个搜索空间，其中将攻击策略表示为攻击序列，即将前一个攻击者的输出用作后继者的初始化输入。采用多目标NSGA-II遗传算法寻找最强的攻击策略，最小的复杂度。实验结果表明，CAA在11种多样的防御系统中击败了10名顶级攻击者，耗时更少（\ textbf {6 $ \ times $比AutoAttack}更快），并在$ l _ {\ infty}上实现了最新技术$，$ l_ {2} $和不受限制的对抗攻击。

69. Facial expressions can detect Parkinson's disease: preliminary evidence from videos collected online [PDF] 返回目录
Mohammad Rafayet Ali, Taylor Myers, Ellen Wagner, Harshil Ratnu, E. Ray Dorsey, Ehsan Hoque
Abstract: One of the symptoms of Parkinson's disease (PD) is hypomimia or reduced facial expressions. In this paper, we present a digital biomarker for PD that utilizes the study of micro-expressions. We analyzed the facial action units (AU) from 1812 videos of 604 individuals (61 with PD and 543 without PD, mean age 63.9 yo, sd 7.8 ) collected online using a web-based tool (this http URL). In these videos, participants were asked to make three facial expressions (a smiling, disgusted, and surprised face) followed by a neutral face. Using techniques from computer vision and machine learning, we objectively measured the variance of the facial muscle movements and used it to distinguish between individuals with and without PD. The prediction accuracy using the facial micro-expressions was comparable to those methodologies that utilize motor symptoms. Logistic regression analysis revealed that participants with PD had less variance in AU6 (cheek raiser), AU12 (lip corner puller), and AU4 (brow lowerer) than non-PD individuals. An automated classifier using Support Vector Machine was trained on the variances and achieved 95.6% accuracy. Using facial expressions as a biomarker for PD could be potentially transformative for patients in need of physical separation (e.g., due to COVID) or are immobile.
摘要：帕金森氏病（PD）的症状之一是低视或面部表情减少。在本文中，我们介绍了利用微表达研究的PD数字生物标志物。我们分析了使用基于Web的工具（此http URL）在线收集的来自1812个视频的604个个体（61个有PD和543个无PD，平均年龄63.9岁，标准差7.8）的面部动作单位（AU）。在这些视频中，要求参与者做出三种面部表情（微笑，反感和惊讶的表情），然后是中性的表情。使用计算机视觉和机器学习的技术，我们客观地测量了面部肌肉运动的方差，并将其用于区分有无PD的个体。使用面部微表情的预测准确性与那些利用运动症状的方法相当。 Logistic回归分析显示，PD参与者与非PD个体相比，AU6（颊部抬高者），AU12（唇角拔除器）和AU4（低眉）的差异较小。使用支持向量机对自动分类器进行了方差训练，达到了95.6％的准确度。对于需要进行物理分离（例如由于COVID）或无法移动的患者而言，使用面部表情作为PD的生物标记可能具有潜在的变革性。

70. Neural Rate Control for Video Encoding using Imitation Learning [PDF] 返回目录
Hongzi Mao, Chenjie Gu, Miaosen Wang, Angie Chen, Nevena Lazic, Nir Levine, Derek Pang, Rene Claus, Marisabel Hechtman, Ching-Han Chiang, Cheng Chen, Jingning Han
Abstract: In modern video encoders, rate control is a critical component and has been heavily engineered. It decides how many bits to spend to encode each frame, in order to optimize the rate-distortion trade-off over all video frames. This is a challenging constrained planning problem because of the complex dependency among decisions for different video frames and the bitrate constraint defined at the end of the episode. We formulate the rate control problem as a Partially Observable Markov Decision Process (POMDP), and apply imitation learning to learn a neural rate control policy. We demonstrate that by learning from optimal video encoding trajectories obtained through evolution strategies, our learned policy achieves better encoding efficiency and has minimal constraint violation. In addition to imitating the optimal actions, we find that additional auxiliary losses, data augmentation/refinement and inference-time policy improvements are critical for learning a good rate control policy. We evaluate the learned policy against the rate control policy in libvpx, a widely adopted open source VP9 codec library, in the two-pass variable bitrate (VBR) mode. We show that over a diverse set of real-world videos, our learned policy achieves 8.5% median bitrate reduction without sacrificing video quality.
摘要：在现代视频编码器中，速率控制是至关重要的组件，并且经过了精心设计。它决定花费多少位来编码每个帧，以优化所有视频帧的速率失真权衡。这是一个具有挑战性的受约束的计划问题，因为针对不同视频帧的决策之间存在复杂的依赖性，并且在情节结束时定义了比特率约束。我们将速率控制问题公式化为部分可观察的马尔可夫决策过程（POMDP），并应用模仿学习来学习神经速率控制策略。我们证明，通过从通过进化策略获得的最佳视频编码轨迹中学习，我们的学习策略可以实现更好的编码效率，并且约束冲突最小。除了模仿最佳行动，我们发现额外的辅助损失，数据扩充/完善和推理时间策略的改进对于学习良好的速率控制策略至关重要。在两次通过可变比特率（VBR）模式下，我们针对libvpx（一种广泛采用的开源VP9编解码器库）中的速率控制策略，对学习的策略进行了评估。我们证明，在各种现实世界的视频中，我们的学习策略在不牺牲视频质量的情况下实现了中位数比特率降低8.5％。

71. Unsupervised Adversarial Domain Adaptation For Barrett's Segmentation [PDF] 返回目录
Numan Celik, Soumya Gupta, Sharib Ali, Jens Rittscher
Abstract: Barrett's oesophagus (BE) is one of the early indicators of esophageal cancer. Patients with BE are monitored and undergo ablation therapies to minimise the risk, thereby making it eminent to identify the BE area precisely. Automated segmentation can help clinical endoscopists to assess and treat BE area more accurately. Endoscopy imaging of BE can include multiple modalities in addition to the conventional white light (WL) modality. Supervised models require large amount of manual annotations incorporating all data variability in the training data. However, it becomes cumbersome, tedious and labour intensive work to generate manual annotations, and additionally modality specific expertise is required. In this work, we aim to alleviate this problem by applying an unsupervised domain adaptation technique (UDA). Here, UDA is trained on white light endoscopy images as source domain and are well-adapted to generalise to produce segmentation on different imaging modalities as target domain, namely narrow band imaging and post acetic-acid WL imaging. Our dataset consists of a total of 871 images consisting of both source and target domains. Our results show that the UDA-based approach outperforms traditional supervised U-Net segmentation by nearly 10% on both Dice similarity coefficient and intersection-over-union.
摘要：巴雷特食管（BE）是食道癌的早期指标之一。对BE患者进行监测并进行消融治疗，以最大程度地降低风险，因此非常重要的一点是，可以准确地确定BE区域。自动分割可以帮助临床内镜医师更准确地评估和治疗BE区域。 BE的内窥镜成像除了常规的白光（WL）模式外，还可以包括多种模式。监督模型需要大量的人工注释，并将所有数据可变性纳入训练数据中。但是，生成手动注释变得繁琐，繁琐且费力的工作，并且还需要特定于模态的专业知识。在这项工作中，我们旨在通过应用无监督域自适应技术（UDA）来缓解此问题。在这里，UDA在白光内窥镜图像上作为源域进行训练，非常适合于泛化以在不同的成像模态上作为目标域进行分割，即窄带成像和乙酸后WL成像。我们的数据集总共包含871个图像，其中包含源域和目标域。我们的结果表明，基于DDA的方法在Dice相似度系数和交集交集上都比传统的监督型U-Net分割好10％。

72. Topological Planning with Transformers for Vision-and-Language Navigation [PDF] 返回目录
Kevin Chen, Junshen K. Chen, Jo Chuang, Marynel Vázquez, Silvio Savarese
Abstract: Conventional approaches to vision-and-language navigation (VLN) are trained end-to-end but struggle to perform well in freely traversable environments. Inspired by the robotics community, we propose a modular approach to VLN using topological maps. Given a natural language instruction and topological map, our approach leverages attention mechanisms to predict a navigation plan in the map. The plan is then executed with low-level actions (e.g. forward, rotate) using a robust controller. Experiments show that our method outperforms previous end-to-end approaches, generates interpretable navigation plans, and exhibits intelligent behaviors such as backtracking.
摘要：视觉和语言导航（VLN）的常规方法经过端到端培训，但在自由穿越的环境中很难取得良好的性能。受机器人技术界的启发，我们提出了使用拓扑图的VLN模块化方法。给定自然语言指令和拓扑图，我们的方法利用注意力机制来预测地图中的导航计划。然后使用健壮的控制器以低级动作（例如，前进，旋转）执行该计划。实验表明，我们的方法优于以前的端到端方法，生成可解释的导航计划，并展现出智能行为，例如回溯。

73. MetaInfoNet: Learning Task-Guided Information for Sample Reweighting [PDF] 返回目录
Hongxin Wei, Lei Feng, Rundong Wang, Bo An
Abstract: Deep neural networks have been shown to easily overfit to biased training data with label noise or class imbalance. Meta-learning algorithms are commonly designed to alleviate this issue in the form of sample reweighting, by learning a meta weighting network that takes training losses as inputs to generate sample weights. In this paper, we advocate that choosing proper inputs for the meta weighting network is crucial for desired sample weights in a specific task, while training loss is not always the correct answer. In view of this, we propose a novel meta-learning algorithm, MetaInfoNet, which automatically learns effective representations as inputs for the meta weighting network by emphasizing task-related information with an information bottleneck strategy. Extensive experimental results on benchmark datasets with label noise or class imbalance validate that MetaInfoNet is superior to many state-of-the-art methods.
摘要：已经证明，深度神经网络很容易因标签噪声或班级不平衡而过度拟合有偏差的训练数据。元学习算法通常旨在通过学习元加权网络来减轻样本重加权形式的问题，该元加权网络将训练损失作为输入来生成样本权重。在本文中，我们主张为元加权网络选择适当的输入对于特定任务中所需的样本权重至关重要，而训练损失并不总是正确的答案。鉴于此，我们提出了一种新颖的元学习算法MetaInfoNet，该算法通过使用信息瓶颈策略强调与任务相关的信息，自动学习有效表示形式作为元加权网络的输入。在带有标签噪声或类别不平衡的基准数据集上的大量实验结果证明，MetaInfoNet优于许多最新方法。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-12-11

目录

摘要