摘要

1. Semantic Image Synthesis via Efficient Class-Adaptive Normalization [PDF] 返回目录
Zhentao Tan, Dongdong Chen, Qi Chu, Menglei Chai, Jing Liao, Mingming He, Lu Yuan, Gang Hua, Nenghai Yu
Abstract: Spatially-adaptive normalization (SPADE) is remarkably successful recently in conditional semantic image synthesis, which modulates the normalized activation with spatially-varying transformations learned from semantic layouts, to prevent the semantic information from being washed away. Despite its impressive performance, a more thorough understanding of the advantages inside the box is still highly demanded to help reduce the significant computation and parameter overhead introduced by this novel structure. In this paper, from a return-on-investment point of view, we conduct an in-depth analysis of the effectiveness of this spatially-adaptive normalization and observe that its modulation parameters benefit more from semantic-awareness rather than spatial-adaptiveness, especially for high-resolution input masks. Inspired by this observation, we propose class-adaptive normalization (CLADE), a lightweight but equally-effective variant that is only adaptive to semantic class. In order to further improve spatial-adaptiveness, we introduce intra-class positional map encoding calculated from semantic layouts to modulate the normalization parameters of CLADE and propose a truly spatially-adaptive variant of CLADE, namely CLADE-ICPE. %Benefiting from this design, CLADE greatly reduces the computation cost while being able to preserve the semantic information in the generation. Through extensive experiments on multiple challenging datasets, we demonstrate that the proposed CLADE can be generalized to different SPADE-based methods while achieving comparable generation quality compared to SPADE, but it is much more efficient with fewer extra parameters and lower computational cost. The code is available at this https URL
摘要：最近，空间自适应归一化（SPADE）在条件语义图像合成中取得了显著成功，该条件可通过从语义布局中学到的空间变化变换来调制归一化激活，以防止语义信息被冲走。尽管它具有令人印象深刻的性能，但仍需要对包装盒内的优点有更透彻的了解，以帮助减少这种新颖结构带来的大量计算量和参数开销。在本文中，从投资回报的角度出发，我们对这种空间自适应归一化的有效性进行了深入分析，并观察到其调制参数更多地受益于语义感知而不是空间适应性，特别是用于高分辨率输入蒙版。受此观察的启发，我们提出了类自适应归一化（CLADE），这是一种轻量但有效的变体，仅适用于语义类。为了进一步提高空间适应性，我们引入了根据语义布局计算出的类内位置图编码，以调制CLADE的归一化参数，并提出了一种真正的空间适应性CLADE变体，即CLADE-ICPE。得益于此设计，CLADE大大降低了计算成本，同时又能够保留生成的语义信息。通过在多个具有挑战性的数据集上进行的广泛实验，我们证明了所提出的CLADE可以推广到不同的基于SPADE的方法，同时与SPADE相比可以达到可比的生成质量，但是它效率更高，额外参数更少，计算成本更低。该代码位于此https URL

2. The Lottery Ticket Hypothesis for Object Recognition [PDF] 返回目录
Sharath Girish, Shishira R. Maiya, Kamal Gupta, Hao Chen, Larry Davis, Abhinav Shrivastava
Abstract: Recognition tasks, such as object recognition and keypoint estimation, have seen widespread adoption in recent years. Most state-of-the-art methods for these tasks use deep networks that are computationally expensive and have huge memory footprints. This makes it exceedingly difficult to deploy these systems on low power embedded devices. Hence, the importance of decreasing the storage requirements and the amount of computation in such models is paramount. The recently proposed Lottery Ticket Hypothesis (LTH) states that deep neural networks trained on large datasets contain smaller subnetworks that achieve on par performance as the dense networks. In this work, we perform the first empirical study investigating LTH for model pruning in the context of object detection, instance segmentation, and keypoint estimation. Our studies reveal that lottery tickets obtained from ImageNet pretraining do not transfer well to the downstream tasks. We provide guidance on how to find lottery tickets with up to 80% overall sparsity on different sub-tasks without incurring any drop in the performance. Finally, we analyse the behavior of trained tickets with respect to various task attributes such as object size, frequency, and difficulty of detection.
摘要：近年来，诸如对象识别和关键点估计之类的识别任务被广泛采用。这些任务的大多数最新方法都使用深度网络，这些网络在计算上很昂贵并且具有巨大的内存占用量。这使得将这些系统部署在低功耗嵌入式设备上极为困难。因此，降低此类模型中的存储需求和计算量的重要性至关重要。最近提出的彩票假说（LTH）指出，在大型数据集上训练的深度神经网络包含较小的子网络，这些子网络可与密集网络一样达到相同的性能。在这项工作中，我们进行了第一个实证研究，研究了在对象检测，实例分割和关键点估计的背景下对模型进行修剪的LTH。我们的研究表明，从ImageNet预培训获得的彩票不能很好地转移到下游任务。我们提供有关如何在不同子任务上找到总稀疏度高达80％的彩票的指南，而不会对性能造成任何影响。最后，我们针对各种任务属性（例如对象大小，频率和检测难度）分析经过训练的票证的行为。

3. Vid2CAD: CAD Model Alignment using Multi-View Constraints from Videos [PDF] 返回目录
Kevis-Kokitsi Maninis, Stefan Popov, Matthias Nießner, Vittorio Ferrari
Abstract: We address the task of aligning CAD models to a video sequence of a complex scene containing multiple objects. Our method is able to process arbitrary videos and fully automatically recover the 9 DoF pose for each object appearing in it, thus aligning them in a common 3D coordinate frame. The core idea of our method is to integrate neural network predictions from individual frames with a temporally global, multi-view constraint optimization formulation. This integration process resolves the scale and depth ambiguities in the per-frame predictions, and generally improves the estimate of all pose parameters. By leveraging multi-view constraints, our method also resolves occlusions and handles objects that are out of view in individual frames, thus reconstructing all objects into a single globally consistent CAD representation of the scene. In comparison to the state-of-the-art single-frame method Mask2CAD that we build on, we achieve substantial improvements on Scan2CAD (from 11.6% to 30.2% class average accuracy).
摘要：我们致力于将CAD模型与包含多个对象的复杂场景的视频序列对齐的任务。我们的方法能够处理任意视频，并为出现在其中的每个对象完全自动恢复9 DoF姿势，从而将它们对准公共3D坐标系。我们方法的核心思想是将单个框架的神经网络预测与时间全局，多视图约束优化公式集成在一起。该集成过程解决了每帧预测中的比例和深度模糊问题，并通常改善了所有姿势参数的估计。通过利用多视图约束，我们的方法还可以解决遮挡并处理单个帧中看不见的对象，从而将所有对象重构为场景的单个全局一致的CAD表示形式。与我们所建立的最新的单帧方法Mask2CAD相比，我们在Scan2CAD上实现了显着的改进（从11.6％到30.2％的班级平均准确度）。

4. TAP: Text-Aware Pre-training for Text-VQA and Text-Caption [PDF] 返回目录
Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, Jiebo Luo
Abstract: In this paper, we propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks. These two tasks aim at reading and understanding scene text in images for question answering and image caption generation, respectively. In contrast to the conventional vision-language pre-training that fails to capture scene text and its relationship with the visual and text modalities, TAP explicitly incorporates scene text (generated from OCR engines) in pre-training. With three pre-training tasks, including masked language modeling (MLM), image-text (contrastive) matching (ITM), and relative (spatial) position prediction (RPP), TAP effectively helps the model learn a better aligned representation among the three modalities: text word, visual object, and scene text. Due to this aligned representation learning, even pre-trained on the same downstream task dataset, TAP already boosts the absolute accuracy on the TextVQA dataset by +5.4%, compared with a non-TAP baseline. To further improve the performance, we build a large-scale dataset based on the Conceptual Caption dataset, named OCR-CC, which contains 1.4 million scene text-related image-text pairs. Pre-trained on this OCR-CC dataset, our approach outperforms the state of the art by large margins on multiple tasks, i.e., +8.3% accuracy on TextVQA, +8.6% accuracy on ST-VQA, and +10.2 CIDEr score on TextCaps.
摘要：在本文中，我们为Text-VQA和Text-Caption任务提出了文本感知预训练（TAP）。这两个任务旨在分别阅读和理解图像中的场景文本，以用于问答和图像标题生成。与传统的视觉语言预训练无法捕获场景文本及其与视觉和文本模态的关系相反，TAP在预训练中明确结合了场景文本（从OCR引擎生成）。 TAP通过掩盖语言建模（MLM），图像文本（对比）匹配（ITM）和相对（空间）位置预测（RPP）等三项预训练任务，有效地帮助模型学习了三者之间更好地对齐的表示形式形式：文字，视觉对象和场景文字。由于采用了这种对齐的表示学习方法，即使在相同的下游任务数据集上进行了预训练，与非TAP基准相比，TAP也已将TextVQA数据集的绝对准确性提高了+ 5.4％。为了进一步提高性能，我们基于Conceptual Caption数据集建立了一个名为OCR-CC的大规模数据集，其中包含140万个与场景文本相关的图像-文本对。在此OCR-CC数据集上进行了预训练，我们的方法在多项任务上的表现都远远超出了现有技术，即TextVQA的准确度为+ 8.3％，ST-VQA的准确度为+ 8.6％，TextCaps的准确度为+10.2 CIDEr 。

5. Accurate 3D Object Detection using Energy-Based Models [PDF] 返回目录
Fredrik K. Gustafsson, Martin Danelljan, Thomas B. Schön
Abstract: Accurate 3D object detection (3DOD) is crucial for safe navigation of complex environments by autonomous robots. Regressing accurate 3D bounding boxes in cluttered environments based on sparse LiDAR data is however a highly challenging problem. We address this task by exploring recent advances in conditional energy-based models (EBMs) for probabilistic regression. While methods employing EBMs for regression have demonstrated impressive performance on 2D object detection in images, these techniques are not directly applicable to 3D bounding boxes. In this work, we therefore design a differentiable pooling operator for 3D bounding boxes, serving as the core module of our EBM network. We further integrate this general approach into the state-of-the-art 3D object detector SA-SSD. On the KITTI dataset, our proposed approach consistently outperforms the SA-SSD baseline across all 3DOD metrics, demonstrating the potential of EBM-based regression for highly accurate 3DOD. Code is available at this https URL.
摘要：准确的3D对象检测（3DOD）对于自主机器人安全导航复杂环境至关重要。然而，基于稀疏的LiDAR数据在混乱的环境中回归准确的3D边界框是一个极具挑战性的问题。我们通过探索基于条件能量模型（EBM）的概率回归的最新进展来解决此任务。尽管使用EBM进行回归的方法在图像中的2D对象检测中已显示出令人印象深刻的性能，但这些技术并不直接适用于3D边界框。因此，在这项工作中，我们为3D边界框设计了一个可微分的池运算符，作为我们的EBM网络的核心模块。我们进一步将此通用方法集成到了最新的3D对象检测器SA-SSD中。在KITTI数据集上，我们提出的方法在所有3DOD指标上均始终优于SA-SSD基线，证明了基于EBM的回归对于高精度3DOD的潜力。在此https URL上提供了代码。

6. CASTing Your Model: Learning to Localize Improves Self-Supervised Representations [PDF] 返回目录
Ramprasaath R. Selvaraju, Karan Desai, Justin Johnson, Nikhil Naik
Abstract: Recent advances in self-supervised learning (SSL) have largely closed the gap with supervised ImageNet pretraining. Despite their success these methods have been primarily applied to unlabeled ImageNet images, and show marginal gains when trained on larger sets of uncurated images. We hypothesize that current SSL methods perform best on iconic images, and struggle on complex scene images with many objects. Analyzing contrastive SSL methods shows that they have poor visual grounding and receive poor supervisory signal when trained on scene images. We propose Contrastive Attention-Supervised Tuning(CAST) to overcome these limitations. CAST uses unsupervised saliency maps to intelligently sample crops, and to provide grounding supervision via a Grad-CAM attention loss. Experiments on COCO show that CAST significantly improves the features learned by SSL methods on scene images, and further experiments show that CAST-trained models are more robust to changes in backgrounds.
摘要：自我监督学习（SSL）的最新进展在很大程度上克服了监督式ImageNet预训练的不足。尽管取得了成功，但这些方法已主要应用于未标记的ImageNet图像，并且在对较大组的非策画图像进行训练时显示出少量收益。我们假设当前的SSL方法在标志性图像上表现最佳，而在具有许多对象的复杂场景图像上则表现不佳。分析对比SSL方法表明，当对场景图像进行训练时，它们的视觉基础较差，并且接收到的监管信号也较差。为了克服这些局限性，我们提出了对比注意力监督调整（CAST）。 CAST使用无监督的显着性图对作物进行智能采样，并通过Grad-CAM注意损失提供接地监督。在COCO上进行的实验表明，CAST大大改善了通过SSL方法在场景图像上学习的功能，并且进一步的实验表明，经过CAST训练的模型对背景变化的鲁棒性更高。

7. MERANet: Facial Micro-Expression Recognition using 3D Residual Attention Network [PDF] 返回目录
Viswanatha Reddy Gajjala, Sai Prasanna Teja Reddy, Snehasis Mukherjee, Shiv Ram Dubey
Abstract: We propose a facial micro-expression recognition model using 3D residual attention network called MERANet. The proposed model takes advantage of spatial-temporal attention and channel attention together, to learn deeper fine-grained subtle features for classification of emotions. The proposed model also encompasses both spatial and temporal information simultaneously using the 3D kernels and residual connections. Moreover, the channel features and spatio-temporal features are re-calibrated using the channel and spatio-temporal attentions, respectively in each residual module. The experiments are conducted on benchmark facial micro-expression datasets. A superior performance is observed as compared to the state-of-the-art for facial micro-expression recognition.
摘要：我们提出了一种使用称为MERANet的3D残留注意力网络的面部微表情识别模型。所提出的模型利用时空注意力和通道注意力在一起，以学习更深层次的细微细微特征来分类情感。所提出的模型还使用3D内核和残差连接同时包含空间和时间信息。此外，分别在每个残差模块中使用通道和时空注意重新校准通道特征和时空特征。实验是在基准面部微表情数据集上进行的。与最新的面部微表情识别技术相比，具有卓越的性能。

8. Bayesian Image Reconstruction using Deep Generative Models [PDF] 返回目录
Razvan V Marinescu, Daniel Moyer, Polina Golland
Abstract: Machine learning models are commonly trained end-to-end and in a supervised setting, using paired (input, output) data. Classical examples include recent super-resolution methods that train on pairs of (low-resolution, high-resolution) images. However, these end-to-end approaches require re-training every time there is a distribution shift in the inputs (e.g., night images vs daylight) or relevant latent variables (e.g., camera blur or hand motion). In this work, we leverage state-of-the-art (SOTA) generative models (here StyleGAN2) for building powerful image priors, which enable application of Bayes' theorem for many downstream reconstruction tasks. Our method, called Bayesian Reconstruction through Generative Models (BRGM), uses a single pre-trained generator model to solve different image restoration tasks, i.e., super-resolution and in-painting, by combining it with different forward corruption models. We demonstrate BRGM on three large, yet diverse, datasets that enable us to build powerful priors: (i) 60,000 images from the Flick Faces High Quality dataset \cite{karras2019style} (ii) 240,000 chest X-rays from MIMIC III and (iii) a combined collection of 5 brain MRI datasets with 7,329 scans. Across all three datasets and without any dataset-specific hyperparameter tuning, our approach yields state-of-the-art performance on super-resolution, particularly at low-resolution levels, as well as inpainting, compared to state-of-the-art methods that are specific to each reconstruction task. We will make our code and pre-trained models available online.
摘要：机器学习模型通常是使用成对的（输入，输出）数据在受控条件下进行的端到端训练。经典示例包括最近的超分辨率方法，它们在成对的（低分辨率，高分辨率）图像上进行训练。但是，这些端到端方法需要在每次输入出现分布偏移（例如，夜间图像与白天）或相关的潜在变量（例如，相机模糊或手部运动）时进行重新训练。在这项工作中，我们利用最先进的（SOTA）生成模型（此处为StyleGAN2）来构建功能强大的图像先验，从而使贝叶斯定理可用于许多下游重建任务。我们的方法称为通过生成模型进行贝叶斯重构（BRGM），它使用单个经过预训练的生成器模型，通过将其与不同的前向损坏模型相结合来解决不同的图像恢复任务，即超分辨率和绘画。我们在三个大型但多样化的数据集上演示了BRGM，这些数据集使我们能够建立强大的先验条件：（i）Flick Faces高质量数据集\ cite {karras2019style}中的60,000张图像（ii）MIMIC III的240,000箱X射线胸片和（iii））结合了5个大脑MRI数据集的集合，并进行了7329次扫描。在所有三个数据集上，并且没有任何特定于数据集的超参数调整，与最新技术相比，我们的方法在超分辨率（特别是在低分辨率级别）和修复方面可提供最新的性能每个重建任务特有的方法。我们将在线提供我们的代码和预先训练的模型。

9. CoShaRP: A Convex Program for Single-shot Tomographic Shape Sensing [PDF] 返回目录
Ajinkya Kadu, Tristan van Leeuwen, K. Joost Batenburg
Abstract: We introduce single-shot X-ray tomography that aims to estimate the target image from a single cone-beam projection measurement. This linear inverse problem is extremely under-determined since the measurements are far fewer than the number of unknowns. Moreover, it is more challenging than conventional tomography where a sufficiently large number of projection angles forms the measurements, allowing for a simple inversion process. However, single-shot tomography becomes less severe if the target image is only composed of known shapes. Hence, the shape prior transforms a linear ill-posed image estimation problem to a non-linear problem of estimating the roto-translations of the shapes. In this paper, we circumvent the non-linearity by using a dictionary of possible roto-translations of the shapes. We propose a convex program CoShaRP to recover the dictionary-coefficients successfully. CoShaRP relies on simplex-type constraint and can be solved quickly using a primal-dual algorithm. The numerical experiments show that CoShaRP recovers shapes stably from moderately noisy measurements.
摘要：我们介绍了单次X射线断层扫描，旨在通过单次锥形束投影测量来估计目标图像。由于测量结果远少于未知数，因此该线性逆问题的确定性极低。此外，它比传统的层析成像技术更具挑战性，在传统的层析成像技术中，足够多的投影角形成了测量值，从而可以进行简单的反演过程。但是，如果目标图像仅由已知形状组成，则单次断层扫描的严重性就会降低。因此，形状先验将线性不适定图像估计问题转换为估计形状的旋转平移的非线性问题。在本文中，我们通过使用形状的可能旋转平移的字典来规避非线性。我们提出了凸程序CoShaRP来成功恢复字典系数。 CoShaRP依赖于单纯形约束，可以使用原始对偶算法快速求解。数值实验表明，CoShaRP从中等噪声的测量中可以稳定地恢复形状。

10. Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting [PDF] 返回目录
Lingbo Liu, Jiaqi Chen, Hefeng Wu, Guanbin Li, Chenglong Li, Liang Lin
Abstract: Crowd counting is a fundamental yet challenging problem, which desires rich information to generate pixel-wise crowd density maps. However, most previous methods only utilized the limited information of RGB images and may fail to discover the potential pedestrians in unconstrained environments. In this work, we find that incorporating optical and thermal information can greatly help to recognize pedestrians. To promote future researches in this field, we introduce a large-scale RGBT Crowd Counting (RGBT-CC) benchmark, which contains 2,030 pairs of RGB-thermal images with 138,389 annotated people. Furthermore, to facilitate the multimodal crowd counting, we propose a cross-modal collaborative representation learning framework, which consists of multiple modality-specific branches, a modality-shared branch, and an Information Aggregation-Distribution Module (IADM) to fully capture the complementary information of different modalities. Specifically, our IADM incorporates two collaborative information transfer components to dynamically enhance the modality-shared and modality-specific representations with a dual information propagation mechanism. Extensive experiments conducted on the RGBT-CC benchmark demonstrate the effectiveness of our framework for RGBT crowd counting. Moreover, the proposed approach is universal for multimodal crowd counting and is also capable to achieve superior performance on the ShanghaiTechRGBD dataset.
摘要：人群计数是一个基本但具有挑战性的问题，它需要丰富的信息来生成像素级人群密度图。但是，大多数以前的方法仅利用RGB图像的有限信息，并且可能无法在不受约束的环境中发现潜在的行人。在这项工作中，我们发现合并光学和热学信息可以极大地帮助识别行人。为了促进该领域的未来研究，我们引入了大型RGBT人群计数（RGBT-CC）基准，其中包含2,030对RGB热图像以及138,389位带注释的人。此外，为了促进多模式人群计数，我们提出了一种跨模式的协作表示学习框架，该框架由多个特定于模式的分支，一个模式共享的分支以及一个信息聚集-分布模块（IADM）组成，以完全捕获互补的不同形式的信息。具体来说，我们的IADM包含两个协作信息传递组件，以双重信息传播机制动态增强模态共享和模态特定表示。在RGBT-CC基准上进行的大量实验证明了我们的RGBT人群计数框架的有效性。此外，所提出的方法对于多模式人群计数是通用的，并且还能够在ShanghaiTechRGBD数据集上实现出色的性能。

11. Digital Gimbal: End-to-end Deep Image Stabilization with Learnable Exposure Times [PDF] 返回目录
Omer Dahary, Matan Jacoby, Alex M. Bronstein
Abstract: Mechanical image stabilization using actuated gimbals enables capturing long-exposure shots without suffering from blur due to camera motion. These devices, however, are often physically cumbersome and expensive, limiting their widespread use. In this work, we propose to digitally emulate a mechanically stabilized system from the input of a fast unstabilized camera. To exploit the trade-off between motion blur at long exposures and low SNR at short exposures, we train a CNN that estimates a sharp high-SNR image by aggregating a burst of noisy short-exposure frames, related by unknown motion. We further suggest learning the burst's exposure times in an end-to-end manner, thus balancing the noise and blur across the frames. We demonstrate this method's advantage over the traditional approach of deblurring a single image or denoising a fixed-exposure burst.
摘要：使用驱动常平架的机械图像稳定功能可以捕获长时间曝光的照片，而不会因相机运动而造成模糊。然而，这些装置通常在物理上笨重且昂贵，从而限制了它们的广泛使用。在这项工作中，我们建议从快速不稳定的摄像机的输入中数字仿真机械稳定的系统。为了利用长时间曝光下的运动模糊与短时间曝光下的低SNR之间的权衡，我们训练了一个CNN，该CNN通过聚集与未知运动相关的一连串嘈杂的短曝光帧来估计出清晰的高SNR图像。我们进一步建议以端到端的方式了解突发的曝光时间，从而平衡噪声和整个帧的模糊。我们证明了该方法优于对单个图像进行模糊处理或对固定曝光突发进行降噪的传统方法的优势。

12. Human Motion Tracking by Registering an Articulated Surface to 3-D Points and Normals [PDF] 返回目录
Radu Horaud, Matti Niskanen, Guillaume Dewaele, Edmond Boyer
Abstract: We address the problem of human motion tracking by registering a surface to 3-D data. We propose a method that iteratively computes two things: Maximum likelihood estimates for both the kinematic and free-motion parameters of a kinematic human-body representation, as well as probabilities that the data are assigned either to a body part, or to an outlier cluster. We introduce a new metric between observed points and normals on one side, and a parameterized surface on the other side, the latter being defined as a blending over a set of ellipsoids. We claim that this metric is well suited when one deals with either visual-hull or visual-shape observations. We illustrate the method by tracking human motions using sparse visual-shape data (3-D surface points and normals) gathered from imperfect silhouettes.
摘要：我们通过将表面注册到3-D数据来解决人体运动跟踪的问题。我们提出了一种迭代计算两件事的方法：运动人体表示的运动学参数和自由运动参数的最大似然估计，以及将数据分配给身体部位或离群值的概率。我们在一侧的观测点和法线之间引入了新的度量，在另一侧引入了参数化的表面，后者被定义为在一组椭球上的混合。我们声称，当人们处理视觉船体或视觉形状观测值时，此度量标准非常适合。我们通过使用从不完美的轮廓中收集的稀疏视觉形状数据（3-D表面点和法线）跟踪人类运动来说明该方法。

13. SSCNav: Confidence-Aware Semantic Scene Completion for Visual Semantic Navigation [PDF] 返回目录
Yiqing Liang, Boyuan Chen, Shuran Song
Abstract: This paper focuses on visual semantic navigation, the task of producing actions for an active agent to navigate to a specified target object category in an unknown environment. To complete this task, the algorithm should simultaneously locate and navigate to an instance of the category. In comparison to the traditional point goal navigation, this task requires the agent to have a stronger contextual prior of indoor environments. We introduce SSCNav, an algorithm that explicitly models scene priors using a confidence-aware semantic scene completion module to complete the scene and guide the agent's navigation planning. Given a partial observation of the environment, SSCNav first infers a complete scene representation with semantic labels for the unobserved scene together with a confidence map associated with its own prediction. Then, a policy network infers the action from the scene completion result and confidence map. Our experiments demonstrate that the proposed scene completion module improves the efficiency of the downstream navigation policies. this https URL
摘要：本文着重于视觉语义导航，即为活动代理生成动作以在未知环境中导航到指定目标对象类别的任务。为了完成此任务，算法应同时定位并导航到类别的实例。与传统的点目标导航相比，此任务需要代理在室内环境中具有更强的上下文先验条件。我们引入了SSCNav，该算法使用置信度高的语义场景完成模块显式地建模场景先验，以完成场景并指导代理的导航计划。给定对环境的部分观察，SSCNav首先使用未观察到的场景的语义标签以及与其自己的预测相关联的置信度图来推断完整的场景表示。然后，策略网络从场景完成结果和置信度映射图推断出该动作。我们的实验表明，提出的场景完成模块提高了下游导航策略的效率。这个https URL

14. Rotation-Invariant Autoencoders for Signals on Spheres [PDF] 返回目录
Suhas Lohit, Shubhendu Trivedi
Abstract: Omnidirectional images and spherical representations of $3D$ shapes cannot be processed with conventional 2D convolutional neural networks (CNNs) as the unwrapping leads to large distortion. Using fast implementations of spherical and $SO(3)$ convolutions, researchers have recently developed deep learning methods better suited for classifying spherical images. These newly proposed convolutional layers naturally extend the notion of convolution to functions on the unit sphere $S^2$ and the group of rotations $SO(3)$ and these layers are equivariant to 3D rotations. In this paper, we consider the problem of unsupervised learning of rotation-invariant representations for spherical images. In particular, we carefully design an autoencoder architecture consisting of $S^2$ and $SO(3)$ convolutional layers. As 3D rotations are often a nuisance factor, the latent space is constrained to be exactly invariant to these input transformations. As the rotation information is discarded in the latent space, we craft a novel rotation-invariant loss function for training the network. Extensive experiments on multiple datasets demonstrate the usefulness of the learned representations on clustering, retrieval and classification applications.
摘要：$ 3D $形状的全向图像和球形表示无法使用常规的2D卷积神经网络（CNN）进行处理，因为展开会导致较大的变形。使用球形和$ SO（3）$卷积的快速实现，研究人员最近开发了更适合于对球形图像进行分类的深度学习方法。这些新提出的卷积层自然将卷积的概念扩展到单位球面$ S ^ 2 $和旋转组$ SO（3）$上的函数，并且这些层与3D旋转等价。在本文中，我们考虑了球面图像旋转不变表示的无监督学习问题。特别是，我们精心设计了由$ S ^ 2 $和$ SO（3）$卷积层组成的自动编码器体系结构。由于3D旋转通常是一个令人讨厌的因素，因此将潜在空间限制为对于这些输入转换完全不变。由于旋转信息在潜在空间中被丢弃，因此我们设计了一种新颖的旋转不变损失函数来训练网络。在多个数据集上进行的广泛实验证明了所学表示形式在聚类，检索和分类应用中的有用性。

15. Multi-Objective Interpolation Training for Robustness to Label Noise [PDF] 返回目录
Diego Ortego, Eric Arazo, Paul Albert, Noel E. O'Connor, Kevin McGuinness
Abstract: Deep neural networks trained with standard cross-entropy loss memorize noisy labels, which degrades their performance. Most research to mitigate this memorization proposes new robust classification loss functions. Conversely, we explore the behavior of supervised contrastive learning under label noise to understand how it can improve image classification in these scenarios. In particular, we propose a Multi-Objective Interpolation Training (MOIT) approach that jointly exploits contrastive learning and classification. We show that standard contrastive learning degrades in the presence of label noise and propose an interpolation training strategy to mitigate this behavior. We further propose a novel label noise detection method that exploits the robust feature representations learned via contrastive learning to estimate per-sample soft-labels whose disagreements with the original labels accurately identify noisy samples. This detection allows treating noisy samples as unlabeled and training a classifier in a semi-supervised manner. We further propose MOIT+, a refinement of MOIT by fine-tuning on detected clean samples. Hyperparameter and ablation studies verify the key components of our method. Experiments on synthetic and real-world noise benchmarks demonstrate that MOIT/MOIT+ achieves state-of-the-art results. Code is available at this https URL.
摘要：经过标准交叉熵损失训练的深度神经网络会记住嘈杂的标签，从而降低其性能。减轻这种记忆的大多数研究提出了新的鲁棒分类损失函数。相反，我们在标签噪声下探索监督式对比学习的行为，以了解在这些情况下如何改善图像分类。特别是，我们提出了一种多目标插值训练（MOIT）的方法，该方法共同利用对比学习和分类。我们表明，标准的对比学习在标签噪声的存在下会降低，并提出一种插值训练策略来减轻这种行为。我们进一步提出了一种新颖的标签噪声检测方法，该方法利用通过对比学习获得的鲁棒特征表示来估计每个样本的软标签，这些标签与原始标签的不同之处可以准确地识别出嘈杂的样本。这种检测可以将嘈杂的样本视为未标记的样本，并以半监督的方式训练分类器。我们进一步提出了MOIT +，即通过对检测到的干净样本进行微调来完善MOIT。超参数和消融研究证实了我们方法的关键组成部分。在合成噪声和现实噪声基准上进行的实验表明，MOIT / MOIT +获得了最先进的结果。在此https URL上提供了代码。

16. SPU-Net: Self-Supervised Point Cloud Upsampling by Coarse-to-Fine Reconstruction with Self-Projection Optimization [PDF] 返回目录
Xinhai Liu, Xinchen Liu, Zhizhong Han, Yu-Shen Liu
Abstract: The task of point cloud upsampling aims to acquire dense and uniform point sets from sparse and irregular point sets. Although significant progress has been made with deep learning models, they require ground-truth dense point sets as the supervision information, which can only trained on synthetic paired training data and are not suitable for training under real-scanned sparse data. However, it is expensive and tedious to obtain large scale paired sparse-dense point sets for training from real scanned sparse data. To address this problem, we propose a self-supervised point cloud upsampling network, named SPU-Net, to capture the inherent upsampling patterns of points lying on the underlying object surface. Specifically, we propose a coarse-to-fine reconstruction framework, which contains two main components: point feature extraction and point feature expansion, respectively. In the point feature extraction, we integrate self-attention module with graph convolution network (GCN) to simultaneously capture context information inside and among local regions. In the point feature expansion, we introduce a hierarchically learnable folding strategy to generate the upsampled point sets with learnable 2D grids. Moreover, to further optimize the noisy points in the generated point sets, we propose a novel self-projection optimization associated with uniform and reconstruction terms, as a joint loss, to facilitate the self-supervised point cloud upsampling. We conduct various experiments on both synthetic and real-scanned datasets, and the results demonstrate that we achieve comparable performance to the state-of-the-art supervised methods.
摘要：点云升采样的任务是从稀疏和不规则点集中获取密集和均匀的点集。尽管深度学习模型已取得重大进展，但它们需要地面实密度点集作为监督信息，该知识点只能在合成的配对训练数据上进行训练，而不适合在实际扫描的稀疏数据下进行训练。然而，从真实的扫描稀疏数据中获得用于训练的大规模成对的稀疏密集点集既昂贵又乏味。为了解决此问题，我们提出了一个名为SPU-Net的自监督点云上采样网络，以捕获位于底层对象表面上的点的固有上采样模式。具体来说，我们提出了一种从粗到细的重构框架，该框架包含两个主要组件：点特征提取和点特征扩展。在点特征提取中，我们将自注意力模块与图卷积网络（GCN）集成在一起，以同时捕获局部区域内部和局部区域之间的上下文信息。在点特征扩展中，我们引入了可分级学习的折叠策略，以生成具有可学习2D网格的上采样点集。此外，为了进一步优化生成的点集中的噪声点，我们提出了一种与均匀项和重构项相关联的新型自投影优化，以作为联合损失，以促进自监督点云的上采样。我们对合成和实际扫描的数据集进行了各种实验，结果表明我们可以达到与最新监督方法相当的性能。

17. Structure-Consistent Weakly Supervised Salient Object Detection with Local Saliency Coherence [PDF] 返回目录
Siyue Yu, Bingfeng Zhang, Jimin Xiao, Eng Gee Lim
Abstract: Sparse labels have been attracting much attention in recent years. However, the performance gap between weakly supervised and fully supervised salient object detection methods is huge, and most previous weakly supervised works adopt complex training methods with many bells and whistles. In this work, we propose a one-round end-to-end training approach for weakly supervised salient object detection via scribble annotations without pre/post-processing operations or extra supervision data. Since scribble labels fail to offer detailed salient regions, we propose a local coherence loss to propagate the labels to unlabeled regions based on image features and pixel distance, so as to predict integral salient regions with complete object structures. We design a saliency structure consistency loss as self-supervision to ensure consistent saliency maps are predicted with different scales of the same image as input, which could be viewed as a regularization technique to enhance the model generalization ability. Additionally, we design an aggregation module (AGGM) to better integrate high-level features, low-level features and global context information for the decoder to aggregate various information. Extensive experiments show that our method achieves a new state-of-the-art performance on six benchmarks (e.g. for the ECSSD dataset: F_\beta = 0.8995, E_\xi = 0.9079 and MAE = 0.0489$), with an average gain of 4.60\% for F-measure, 2.05\% for E-measure and 1.88\% for MAE over the previous best method on this task. Source code is available at this http URL.
摘要：近年来，稀疏标签引起了广泛的关注。但是，在弱监督和完全监督的显着物体检测方法之间的性能差距很大，并且大多数以前的弱监督作品都采用复杂的训练方法，包括许多钟声。在这项工作中，我们提出了一种单向的端到端训练方法，该方法用于通过杂文注释对弱监督的显着物体进行检测，而无需进行前/后处理操作或额外的监督数据。由于涂鸦标签无法提供详细的显着区域，因此我们提出了一种局部相干损失，以基于图像特征和像素距离将标签传播到未标记的区域，从而预测具有完整对象结构的整体显着区域。我们设计了一种显着性结构一致性损失作为自我监督机制，以确保在相同图像的不同比例下作为输入来预测一致的显着性图，可以将其视为增强模型泛化能力的正则化技术。此外，我们设计了一个聚合模块（AGGM），以更好地集成高级功能，低级功能和全局上下文信息，以使解码器可以聚合各种信息。大量的实验表明，我们的方法在六个基准（例如ECSSD数据集：F_ \ beta = 0.8995，E_ \ xi = 0.9079和MAE = 0.0489 $）上达到了最新的性能水平，平均收益为与之前在此任务上的最佳方法相比，F度量为4.60％，E度量为2.05％，MAE为1.88％。可从此http URL获得源代码。

18. Using Feature Alignment can Improve Clean Average Precision and Adversarial Robustness in Object Detection [PDF] 返回目录
Weipeng Xu, Hongcheng Huang
Abstract: The 2D object detection in clean images has been a well studied topic, but its vulnerability against adversarial attacks is still worrying. Existing work has improved the robustness of object detector by adversarial training, but at the same time, the average precision (AP) on clean images drops significantly. In this paper, we improve object detection algorithm by guiding the output of intermediate feature layer. On the basis of adversarial training, we propose two feature alignment methods, namely Knowledge-Distilled Feature Alignment (KDFA) and Self-Supervised Feature Alignment (SSFA). The detector's clean AP and robustness can be improved by aligning the features of the middle layer of the network. We conduct extensive experiments on PASCAL VOC and MS-COCO datasets to verify the effectiveness of our proposed approach. The code of our experiments is available at this https URL.
摘要：清洁图像中的2D对象检测一直是一个研究很好的话题，但其对抗对抗攻击的脆弱性仍令人担忧。现有的工作通过对抗训练提高了对象检测器的鲁棒性，但是与此同时，干净图像的平均精度（AP）大大降低了。在本文中，我们通过指导中间特征层的输出来改进对象检测算法。在对抗训练的基础上，我们提出了两种特征对齐方法，即知识提取特征对齐（KDFA）和自我监督特征对齐（SSFA）。通过调整网络中间层的功能，可以提高探测器的干净AP和鲁棒性。我们对PASCAL VOC和MS-COCO数据集进行了广泛的实验，以验证我们提出的方法的有效性。可通过以下https URL获得我们的实验代码。

19. 3DIoUMatch: Leveraging IoU Prediction for Semi-Supervised 3D Object Detection [PDF] 返回目录
He Wang, Yezhen Cong, Or Litany, Yue Gao, Leonidas J. Guibas
Abstract: 3D object detection is an important yet demanding task that heavily relies on difficult to obtain 3D annotations. To reduce the required amount of supervision, we propose 3DIoUMatch, a novel method for semi-supervised 3D object detection. We adopt VoteNet, a popular point cloud-based object detector, as our backbone and leverage a teacher-student mutual learning framework to propagate information from the labeled to the unlabeled train set in the form of pseudo-labels. However, due to the high task complexity, we observe that the pseudo-labels suffer from significant noise and are thus not directly usable. To that end, we introduce a confidence-based filtering mechanism. The key to our approach is a novel differentiable 3D IoU estimation module. This module is used for filtering poorly localized proposals as well as for IoU-guided bounding box deduplication. At inference time, this module is further utilized to improve localization through test-time optimization. Our method consistently improves state-of-the-art methods on both ScanNet and SUN-RGBD benchmarks by significant margins. For example, when training using only 10\% labeled data on ScanNet, 3DIoUMatch achieves 7.7 absolute improvement on mAP@0.25 and 8.5 absolute improvement on mAP@0.5 upon the prior art.
摘要：3D对象检测是一项重要而又艰巨的任务，在很大程度上依赖于难以获得3D注释。为了减少所需的监督量，我们提出了3DIoUMatch，这是一种用于半监督3D对象检测的新方法。我们采用流行的基于点云的对象检测器VoteNet作为我们的主干，并利用师生相互学习框架以伪标签的形式将信息从标记的传播到未标记的训练集。但是，由于任务复杂性高，我们注意到伪标签遭受了很大的噪音，因此不能直接使用。为此，我们引入了基于置信度的过滤机制。我们方法的关键是新颖的可微分3D IoU估算模块。该模块用于过滤本地化较差的提案以及IoU引导的边界框重复数据删除。在推断时，该模块可进一步用于通过测试时间优化来改善本地化。我们的方法在ScanNet和SUN-RGBD基准上始终不断地改进最新技术。例如，当在ScanNet上仅使用10％的标记数据进行训练时，与现有技术相比，3DIoUMatch在mAP@0.25上实现7.7的绝对改进，在mAP@0.5上实现8.5的绝对改进。

20. MANGO: A Mask Attention Guided One-Stage Scene Text Spotter [PDF] 返回目录
Liang Qiao, Ying Chen, Zhanzhan Cheng, Yunlu Xu, Yi Niu, Shiliang Pu, Fei Wu
Abstract: Recently end-to-end scene text spotting has become a popular research topic due to its advantages of global optimization and high maintainability in real applications. Most methods attempt to develop various region of interest (RoI) operations to concatenate the detection part and the sequence recognition part into a two-stage text spotting framework. However, in such framework, the recognition part is highly sensitive to the detected results (\emph{e.g.}, the compactness of text contours). To address this problem, in this paper, we propose a novel Mask AttentioN Guided One-stage text spotting framework named MANGO, in which character sequences can be directly recognized without RoI operation. Concretely, a position-aware mask attention module is developed to generate attention weights on each text instance and its characters. It allows different text instances in an image to be allocated on different feature map channels which are further grouped as a batch of instance features. Finally, a lightweight sequence decoder is applied to generate the character sequences. It is worth noting that MANGO inherently adapts to arbitrary-shaped text spotting and can be trained end-to-end with only coarse position information (\emph{e.g.}, rectangular bounding box) and text annotations. Experimental results show that the proposed method achieves competitive and even new state-of-the-art performance on both regular and irregular text spotting benchmarks, i.e., ICDAR 2013, ICDAR 2015, Total-Text, and SCUT-CTW1500.
摘要：最近，端到端场景文本点播由于其全局优化和在实际应用中的高度可维护性的优势而成为流行的研究主题。大多数方法试图开发各种感兴趣的区域（RoI）操作，以将检测部分和序列识别部分连接到两阶段的文本识别框架中。但是，在这样的框架中，识别部分对检测到的结果高度敏感（\ emph {例如}，文本轮廓的紧凑性）。为了解决这个问题，在本文中，我们提出了一种新颖的“面具诱人引导式”一阶段文本发现框架，称为MANGO，在该框架中无需RoI操作就可以直接识别字符序列。具体而言，开发了位置感知蒙版注意模块，以生成每个文本实例及其字符的注意权重。它允许将图像中的不同文本实例分配在不同的特征图通道上，这些通道进一步分组为一批实例特征。最后，使用轻量级序列解码器来生成字符序列。值得注意的是，MANGO固有地适应于任意形状的文本斑点，并且仅使用粗略的位置信息（\ emph {e.g。}，矩形边界框）和文本注释即可进行端到端训练。实验结果表明，该方法在常规和不常规文本点标基准（即ICDAR 2013，ICDAR 2015，Total-Text和SCUT-CTW1500）上均达到了竞争甚至全新的最新性能。

21. StacMR: Scene-Text Aware Cross-Modal Retrieval [PDF] 返回目录
Andrés Mafla, Rafael Sampaio de Rezende, Lluís Gómez, Diane Larlus, Dimosthenis Karatzas
Abstract: Recent models for cross-modal retrieval have benefited from an increasingly rich understanding of visual scenes, afforded by scene graphs and object interactions to mention a few. This has resulted in an improved matching between the visual representation of an image and the textual representation of its caption. Yet, current visual representations overlook a key aspect: the text appearing in images, which may contain crucial information for retrieval. In this paper, we first propose a new dataset that allows exploration of cross-modal retrieval where images contain scene-text instances. Then, armed with this dataset, we describe several approaches which leverage scene text, including a better scene-text aware cross-modal retrieval method which uses specialized representations for text from the captions and text from the visual scene, and reconcile them in a common embedding space. Extensive experiments confirm that cross-modal retrieval approaches benefit from scene text and highlight interesting research questions worth exploring further. Dataset and code are available at this http URL
摘要：跨模式检索的最新模型得益于对场景的日益丰富的理解，场景图和对象交互提供了一些功能。这导致图像的视觉表示与其标题的文本表示之间的匹配得到改善。但是，当前的视觉表示忽略了一个关键方面：图像中显示的文本可能包含用于检索的关键信息。在本文中，我们首先提出一个新的数据集，该数据集允许在图像包含场景文本实例的情况下探索交叉模式检索。然后，利用该数据集，我们描述了几种利用场景文本的方法，包括更好的场景文本感知跨模式检索方法，该方法使用来自字幕的文本和视觉场景的文本的专门表示，并将它们统一嵌入空间。大量实验证实，跨模式检索方法可从场景文本中受益，并突出了值得进一步探索的有趣研究问题。数据集和代码可从以下http URL获得

22. Continual Adaptation of Visual Representations via Domain Randomization and Meta-learning [PDF] 返回目录
Riccardo Volpi, Diane Larlus, Grégory Rogez
Abstract: Most standard learning approaches lead to fragile models which are prone to drift when sequentially trained on samples of a different nature - the well-known "catastrophic forgetting" issue. In particular, when a model consecutively learns from different visual domains, it tends to forget the past ones in favor of the most recent. In this context, we show that one way to learn models that are inherently more robust against forgetting is domain randomization - for vision tasks, randomizing the current domain's distribution with heavy image manipulations. Building on this result, we devise a meta-learning strategy where a regularizer explicitly penalizes any loss associated with transferring the model from the current domain to different "auxiliary" meta-domains, while also easing adaptation to them. Such meta-domains, are also generated through randomized image manipulations. We empirically demonstrate in a variety of experiments - spanning from classification to semantic segmentation - that our approach results in models that are less prone to catastrophic forgetting when transferred to new domains.
摘要：大多数标准的学习方法都会导致脆弱的模型，这些模型在对不同性质的样本进行顺序训练时容易漂移（众所周知的“灾难性遗忘”问题）。特别是，当一个模型连续从不同的视觉域中学习时，它倾向于忘记过去的模型，而倾向于最新的模型。在这种情况下，我们显示了一种学习模型的方法，该方法本质上具有更强的抗遗忘能力，那就是域随机化-对于视觉任务，通过大量的图像操作将当前域的分布随机化。在此结果的基础上，我们设计了一种元学习策略，其中调节器明确惩罚与将模型从当前域转移到不同的“辅助”元域相关的任何损失，同时还简化了对它们的适应。此类元域也通过随机图像处理生成。在从分类到语义细分的各种实验中，我们通过经验证明了我们的方法所产生的模型在转移到新域时不太容易发生灾难性的遗忘。

23. Perceptual Robust Hashing for Color Images with Canonical Correlation Analysis [PDF] 返回目录
Xinran Li, Chuan Qin, Zhenxing Qian, Heng Yao, Xinpeng Zhang
Abstract: In this paper, a novel perceptual image hashing scheme for color images is proposed based on ring-ribbon quadtree and color vector angle. First, original image is subjected to normalization and Gaussian low-pass filtering to produce a secondary image, which is divided into a series of ring-ribbons with different radii and the same number of pixels. Then, both textural and color features are extracted locally and globally. Quadtree decomposition (QD) is applied on luminance values of the ring-ribbons to extract local textural features, and the gray level co-occurrence matrix (GLCM) is used to extract global textural features. Local color features of significant corner points on outer boundaries of ring-ribbons are extracted through color vector angles (CVA), and color low-order moments (CLMs) is utilized to extract global color features. Finally, two types of feature vectors are fused via canonical correlation analysis (CCA) to prodcue the final hash after scrambling. Compared with direct concatenation, the CCA feature fusion method improves classification performance, which better reflects overall correlation between two sets of feature vectors. Receiver operating characteristic (ROC) curve shows that our scheme has satisfactory performances with respect to robustness, discrimination and security, which can be effectively used in copy detection and content authentication.
摘要：本文提出了一种基于环带四叉树和颜色矢量角度的彩色图像感知图像哈希算法。首先，对原始图像进行归一化和高斯低通滤波以生成辅助图像，该辅助图像被分为一系列具有不同半径和相同像素数的环形色带。然后，局部和全局提取纹理特征和颜色特征。将四叉树分解（QD）应用于环形色带的亮度值以提取局部纹理特征，并使用灰度共生矩阵（GLCM）提取全局纹理特征。通过颜色矢量角（CVA）提取环形色带外边界上重要角点的局部颜色特征，并利用颜色低阶矩（CLM）提取全局颜色特征。最后，通过规范相关分析（CCA）融合了两种类型的特征向量，以在加扰后生成最终的哈希。与直接串联相比，CCA特征融合方法提高了分类性能，更好地反映了两组特征向量之间的整体相关性。接收机工作特性曲线表明，该方案在鲁棒性，鉴别性和安全性方面具有令人满意的性能，可以有效地用于拷贝检测和内容认证。

24. Context-Aware Graph Convolution Network for Target Re-identification [PDF] 返回目录
Deyi Ji, Haoran Wang, Hanzhe Hu, Weihao Gan, Wei Wu, Junjie Yan
Abstract: Most existing re-identification methods focus on learning ro-bust and discriminative features with deep convolution net-works. However, many of them consider content similarityseparately and fail to utilize the context information of thequery and gallery sets, e.g. probe-gallery and gallery-galleryrelations, thus hard samples may not be well solved due tothe limited or even misleading information. In this paper,we present a novel Context-Aware Graph Convolution Net-work (CAGCN), where the probe-gallery relations are en-coded into the graph nodes and the graph edge connectionsare well controlled by the gallery-gallery relations. In thisway, hard samples can be addressed with the context infor-mation flows among other easy samples during the graph rea-soning. Specifically, we adopt an effective hard gallery sam-pler to obtain high recall for positive samples while keeping areasonable graph size, which can also weaken the imbalancedproblem in training process with low computation complex-ity. Experiments show that the proposed method achievesstate-of-the-art performance on both person and vehicle re-identification datasets in a plug and play fashion with limitedoverhead.
摘要：大多数现有的重新识别方法都专注于通过深度卷积网络来学习鲁棒性和判别特征。然而，它们中的许多人分开地考虑内容相似性并且不能利用查询和画廊集的上下文信息，例如。探针库和画廊库的关系，因此，由于信息有限甚至误导，硬样品可能无法得到很好的解决。在本文中，我们提出了一种新型的上下文感知图卷积网络（CAGCN），其中探针-画廊关系被编码到图节点中，并且图边缘连接受到画廊-画廊关系的良好控制。这样，在图形推理过程中，可以使用上下文信息流以及其他简单样本来处理硬样本。具体来说，我们采用有效的硬图库采样法来获得正样本的高召回率，同时保持合理的图形尺寸，这也可以以较低的计算复杂度来减弱训练过程中的不平衡问题。实验表明，所提出的方法以即插即用的方式在开销有限的情况下，在人员和车辆重新识别数据集上均达到了最新的性能。

25. Towards Uncovering the Intrinsic Data Structures for Unsupervised Domain Adaptation using Structurally Regularized Deep Clustering [PDF] 返回目录
Hui Tang, Xiatian Zhu, Ke Chen, Kui Jia, C. L. Philip Chen
Abstract: Unsupervised domain adaptation (UDA) is to learn classification models that make predictions for unlabeled data on a target domain, given labeled data on a source domain whose distribution diverges from the target one. Mainstream UDA methods strive to learn domain-aligned features such that classifiers trained on the source features can be readily applied to the target ones. Although impressive results have been achieved, these methods have a potential risk of damaging the intrinsic data structures of target discrimination, raising an issue of generalization particularly for UDA tasks in an inductive setting. To address this issue, we are motivated by a UDA assumption of structural similarity across domains, and propose to directly uncover the intrinsic target discrimination via constrained clustering, where we constrain the clustering solutions using structural source regularization that hinges on the very same assumption. Technically, we propose a hybrid model of Structurally Regularized Deep Clustering, which integrates the regularized discriminative clustering of target data with a generative one, and we thus term our method as SRDC++. Our hybrid model is based on a deep clustering framework that minimizes the Kullback-Leibler divergence between the distribution of network prediction and an auxiliary one, where we impose structural regularization by learning domain-shared classifier and cluster centroids. By enriching the structural similarity assumption, we are able to extend SRDC++ for a pixel-level UDA task of semantic segmentation. We conduct extensive experiments on seven UDA benchmarks of image classification and semantic segmentation. With no explicit feature alignment, our proposed SRDC++ outperforms all the existing methods under both the inductive and transductive settings. We make our implementation codes publicly available at this https URL.
摘要：无监督域自适应（UDA）是学习分类模型，该模型对目标域上未标记的数据进行预测，前提是源域上的标记数据的分布与目标域有所不同。主流的UDA方法努力学习与领域对齐的特征，以便可以将在源特征上训练的分类器轻松应用于目标特征。尽管已经取得了令人印象深刻的结果，但是这些方法具有破坏目标辨别力的固有数据结构的潜在风险，特别是对于归纳式UDA任务，存在泛化问题。为了解决这个问题，我们受到跨域结构相似性的UDA假设的启发，并建议通过约束聚类直接发现内在目标歧视，在这种情况下，我们使用结构源正则化来约束聚类解决方案，而正则化取决于非常相同的假设。从技术上讲，我们提出了一种结构化正则化深度聚类的混合模型，该模型将目标数据的正则化判别性聚类与生成性聚类集成在一起，因此我们将其称为SRDC ++。我们的混合模型基于深度聚类框架，该框架将网络预测的分布与辅助预测之间的Kullback-Leibler差异最小化，在该模型中，我们通过学习域共享分类器和聚类质心来进行结构正则化。通过丰富结构相似性假设，我们能够将SRDC ++扩展为语义分段的像素级UDA任务。我们对图像分类和语义分割的七个UDA基准进行了广泛的实验。在没有明确的特征对齐的情况下，我们提出的SRDC ++在感性和传导性设置下都优于所有现有方法。我们通过此https URL公开实现代码。

26. UnrealPerson: An Adaptive Pipeline towards Costless Person Re-identification [PDF] 返回目录
Tianyu Zhang, Lingxi Xie, Longhui Wei, Zijie Zhuang, Yongfei Zhang, Bo Li, Qi Tian
Abstract: The main difficulty of person re-identification (ReID) lies in collecting annotated data and transferring the model across different domains. This paper presents UnrealPerson, a novel pipeline that makes full use of unreal image data to decrease the costs in both the training and deployment stages. Its fundamental part is a system that can generate synthesized images of high-quality and from controllable distributions. Instance-level annotation goes with the synthesized data and is almost free. We point out some details in image synthesis that largely impact the data quality. With 3,000 IDs and 120,000 instances, our method achieves a 38.5% rank-1 accuracy when being directly transferred to MSMT17. It almost doubles the former record using synthesized data and even surpasses previous direct transfer records using real data. This offers a good basis for unsupervised domain adaption, where our pre-trained model is easily plugged into the state-of-the-art algorithms towards higher accuracy. In addition, the data distribution can be flexibly adjusted to fit some corner ReID scenarios, which widens the application of our pipeline. We will publish our data synthesis toolkit and synthesized data in this https URL.
摘要：人员重新识别（ReID）的主要困难在于收集带注释的数据并在不同领域之间传递模型。本文介绍了UnrealPerson，这是一种新颖的管道，可以充分利用虚幻的图像数据来减少培训和部署阶段的成本。它的基本部分是一个系统，可以从可控分布中生成高质量的合成图像。实例级注释与合成数据一起使用，几乎是免费的。我们指出了图像合成中的一些细节，这些细节在很大程度上影响了数据质量。通过3,000个ID和120,000个实例，我们的方法直接转移到MSMT17时可达到38.5％的1级精度。使用合成数据，它几乎使以前的记录翻了一番，甚至超过了使用真实数据的以前的直接传输记录。这为无监督域适应提供了良好的基础，在这种情况下，我们的预训练模型可以轻松地插入最新的算法中，以实现更高的准确性。此外，可以灵活地调整数据分布以适应某些角落的ReID场景，从而扩大了我们管道的应用范围。我们将在此https URL中发布数据综合工具包和综合数据。

27. Learning to Generate Content-Aware Dynamic Detectors [PDF] 返回目录
Junyi Feng, Jiashen Hua, Baisheng Lai, Jianqiang Huang, Xi Li, Xian-sheng Hua
Abstract: Model efficiency is crucial for object detection. Mostprevious works rely on either hand-crafted design or auto-search methods to obtain a static architecture, regardless ofthe difference of inputs. In this paper, we introduce a newperspective of designing efficient detectors, which is automatically generating sample-adaptive model architectureon the fly. The proposed method is named content-aware dynamic detectors (CADDet). It first applies a multi-scale densely connected network with dynamic routing as the supernet. Furthermore, we introduce a course-to-fine strat-egy tailored for object detection to guide the learning of dynamic routing, which contains two metrics: 1) dynamic global budget constraint assigns data-dependent expectedbudgets for individual samples; 2) local path similarity regularization aims to generate more diverse routing paths. With these, our method achieves higher computational efficiency while maintaining good performance. To the best of our knowledge, our CADDet is the first work to introduce dynamic routing mechanism in object detection. Experiments on MS-COCO dataset demonstrate that CADDet achieves 1.8 higher mAP with 10% fewer FLOPs compared with vanilla routing strategy. Compared with the models based upon similar building blocks, CADDet achieves a 42% FLOPs reduction with a competitive mAP.
摘要：模型效率对于对象检测至关重要。不论输入内容如何，大多数以前的作品都依靠手工设计或自动搜索方法来获得静态架构。在本文中，我们介绍了一种设计高效检测器的新观点，该检测器可以自动动态生成样本自适应模型架构。所提出的方法称为内容感知动态检测器（CADDet）。它首先应用以动态路由为超网的多尺度密集连接网络。此外，我们引入了一种针对对象检测而量身定制的精细策略，以指导动态路由的学习，该策略包含两个指标：1）动态全局预算约束为各个样本分配与数据相关的预期预算； 2）局部路径相似性正则化旨在生成更多不同的路由路径。有了这些，我们的方法就可以在保持良好性能的同时实现更高的计算效率。据我们所知，我们的CADDet是在对象检测中引入动态路由机制的第一项工作。在MS-COCO数据集上进行的实验表明，与常规路由策略相比，CADDet可实现1.8更高的mAP且FLOP减少10％。与基于类似构建基块的模型相比，CADDet具有竞争性的mAP，可将FLOPs减少42％。

28. Active Visual Localization in Partially Calibrated Environments [PDF] 返回目录
Yingda Yin, Qingnan Fan, Fei Xia, Qihang Fang, Siyan Dong, Leonidas Guibas, Baoquan Chen
Abstract: Humans can robustly localize themselves without a map after they get lost following prominent visual cues or landmarks. In this work, we aim at endowing autonomous agents the same ability. Such ability is important in robotics applications yet very challenging when an agent is exposed to partially calibrated environments, where camera images with accurate 6 Degree-of-Freedom pose labels only cover part of the scene. To address the above challenge, we explore using Reinforcement Learning to search for a policy to generate intelligent motions so as to actively localize the agent given visual information in partially calibrated environments. Our core contribution is to formulate the active visual localization problem as a Partially Observable Markov Decision Process and propose an algorithmic framework based on Deep Reinforcement Learning to solve it. We further propose an indoor scene dataset ACR-6, which consists of both synthetic and real data and simulates challenging scenarios for active visual localization. We benchmark our algorithm against handcrafted baselines for localization and demonstrate that our approach significantly outperforms them on localization success rate.
摘要：在因明显的视觉提示或地标而迷路之后，人们可以在没有地图的情况下稳固地定位自己。在这项工作中，我们旨在赋予自治代理相同的能力。这种功能在机器人应用程序中很重要，但当代理人暴露于部分校准的环境中时（具有精确的6自由度姿势标签的摄像机图像仅覆盖场景的一部分），就具有很大的挑战性。为了解决上述挑战，我们探索使用强化学习来搜索生成智能动作的策略，以便在局部校准的环境中根据给定的视觉信息主动定位代理。我们的核心贡献是将主动视觉定位问题表述为部分可观察的马尔可夫决策过程，并提出一种基于深度强化学习的算法框架来解决该问题。我们进一步提出了一个室内场景数据集ACR-6，该数据集由合成数据和真实数据组成，并模拟了具有挑战性的主动视觉定位方案。我们针对手工制作的基准对算法进行了基准定位，并证明我们的方法在定位成功率方面明显优于它们。

29. Overcomplete Representations Against Adversarial Videos [PDF] 返回目录
Shao-Yuan Lo, Jeya Maria Jose Valanarasu, Vishal M. Patel
Abstract: Adversarial robustness of deep neural networks is an extensively studied problem in the literature and various methods have been proposed to defend against adversarial images. However, only a handful of defense methods have been developed for defending against attacked videos. In this paper, we propose a novel Over-and-Under complete restoration network for Defending against adversarial videos (OUDefend). Most restoration networks adopt an encoder-decoder architecture that first shrinks spatial dimension then expands it back. This approach learns undercomplete representations, which have large receptive fields to collect global information but overlooks local details. On the other hand, overcomplete representations have opposite properties. Hence, OUDefend is designed to balance local and global features by learning those two representations. We attach OUDefend to target video recognition models as a feature restoration block and train the entire network end-to-end. Experimental results show that the defenses focusing on images may be ineffective to videos, while OUDefend enhances robustness against different types of adversarial videos, ranging from additive attacks, multiplicative attacks to physically realizable attacks.
摘要：深度神经网络的对抗鲁棒性是文献中广泛研究的问题，并且已经提出了多种防御对抗图像的方法。但是，仅开发了几种防御方法来防御被攻击的视频。在本文中，我们提出了一种新颖的全面防御网络，用于防御对抗性视频（OUDefend）。大多数恢复网络采用编解码器架构，该架构首先缩小空间维度，然后再将其扩展回去。这种方法学习的是不完整的表示形式，这些表示形式具有很大的接受范围，可以收集全局信息，但是却忽略了局部细节。另一方面，过完备的表示具有相反的属性。因此，OUDefend旨在通过学习这两种表示来平衡本地和全局特征。我们将OUDefend附加到目标视频识别模型中作为功能恢复模块，并端到端训练整个网络。实验结果表明，针对图像的防御可能对视频无效，而OUDefend增强了针对不同类型的对抗视频的鲁棒性，从加性攻击，乘性攻击到可物理实现的攻击。

30. Data Instance Prior for Transfer Learning in GANs [PDF] 返回目录
Puneet Mangla, Nupur Kumari, Mayank Singh, Vineeth N Balasubramanian, Balaji Krishnamurthy
Abstract: Recent advances in generative adversarial networks (GANs) have shown remarkable progress in generating high-quality images. However, this gain in performance depends on the availability of a large amount of training data. In limited data regimes, training typically diverges, and therefore the generated samples are of low quality and lack diversity. Previous works have addressed training in low data setting by leveraging transfer learning and data augmentation techniques. We propose a novel transfer learning method for GANs in the limited data domain by leveraging informative data prior derived from self-supervised/supervised pre-trained networks trained on a diverse source domain. We perform experiments on several standard vision datasets using various GAN architectures (BigGAN, SNGAN, StyleGAN2) to demonstrate that the proposed method effectively transfers knowledge to domains with few target images, outperforming existing state-of-the-art techniques in terms of image quality and diversity. We also show the utility of data instance prior in large-scale unconditional image generation and image editing tasks.
摘要：生成对抗网络（GAN）的最新进展已显示出在生成高质量图像方面的显着进步。但是，这种性能的提高取决于大量训练数据的可用性。在有限的数据方案中，训练通常会有所不同，因此生成的样本质量较低且缺乏多样性。先前的工作已经通过利用转移学习和数据增强技术来解决低数据设置方面的培训。我们提出了一种利用有限的数据域中GAN的新颖的转移学习方法，该方法利用了从在各种源域上训练的自我监督/监督的预训练网络中获得的先验信息数据。我们对使用各种GAN架构（BigGAN，SNGAN，StyleGAN2）的几个标准视觉数据集进行了实验，以证明该方法可有效地将知识转移到目标图像很少的领域，在图像质量方面优于现有的最新技术和多样性。我们还将展示数据实例优先级在大规模无条件图像生成和图像编辑任务中的实用性。

31. Variational Interaction Information Maximization for Cross-domain Disentanglement [PDF] 返回目录
HyeongJoo Hwang, Geon-Hyeong Kim, Seunghoon Hong, Kee-Eung Kim
Abstract: Cross-domain disentanglement is the problem of learning representations partitioned into domain-invariant and domain-specific representations, which is a key to successful domain transfer or measuring semantic distance between two domains. Grounded in information theory, we cast the simultaneous learning of domain-invariant and domain-specific representations as a joint objective of multiple information constraints, which does not require adversarial training or gradient reversal layers. We derive a tractable bound of the objective and propose a generative model named Interaction Information Auto-Encoder (IIAE). Our approach reveals insights on the desirable representation for cross-domain disentanglement and its connection to Variational Auto-Encoder (VAE). We demonstrate the validity of our model in the image-to-image translation and the cross-domain retrieval tasks. We further show that our model achieves the state-of-the-art performance in the zero-shot sketch based image retrieval task, even without external knowledge. Our implementation is publicly available at: this https URL
摘要：跨域解缠结是学习将表示分为领域不变表示和领域特定表示的问题，这是成功进行域转移或测量两个域之间语义距离的关键。立足于信息论，我们将领域不变和领域特定表示的同步学习作为多种信息约束的共同目标，不需要对抗训练或梯度逆转层。我们得出了目标的一个可理解的界限，并提出了一个称为交互信息自动编码器（IIAE）的生成模型。我们的方法揭示了跨域解缠结的理想表示形式及其与变分自动编码器（VAE）的联系。我们证明了我们的模型在图像到图像翻译和跨域检索任务中的有效性。我们进一步证明，即使在没有外部知识的情况下，我们的模型也可以在基于零镜头草图的图像检索任务中实现最先进的性能。我们的实施可在以下网址公开获得：https URL

32. Texture Transform Attention for Realistic Image Inpainting [PDF] 返回目录
Yejin Kim, Manri Cheon, Junwoo Lee
Abstract: Over the last few years, the performance of inpainting to fill missing regions has shown significant improvements by using deep neural networks. Most of inpainting work create a visually plausible structure and texture, however, due to them often generating a blurry result, final outcomes appear unrealistic and make feel heterogeneity. In order to solve this problem, the existing methods have used a patch based solution with deep neural network, however, these methods also cannot transfer the texture properly. Motivated by these observation, we propose a patch based method. Texture Transform Attention network(TTA-Net) that better produces the missing region inpainting with fine details. The task is a single refinement network and takes the form of U-Net architecture that transfers fine texture features of encoder to coarse semantic features of decoder through skip-connection. Texture Transform Attention is used to create a new reassembled texture map using fine textures and coarse semantics that can efficiently transfer texture information as a result. To stabilize training process, we use a VGG feature layer of ground truth and patch discriminator. We evaluate our model end-to-end with the publicly available datasets CelebA-HQ and Places2 and demonstrate that images of higher quality can be obtained to the existing state-of-the-art approaches.
摘要：在过去的几年中，通过使用深度神经网络，修补图像填充缺失区域的性能已显示出显着改善。大多数的修复工作会产生视觉上看似合理的结构和纹理，但是由于它们经常产生模糊的结果，因此最终结果显得不切实际，并让人感觉异质。为了解决这个问题，现有方法使用了具有深度神经网络的基于补丁的解决方案，但是，这些方法也不能正确地传递纹理。基于这些观察，我们提出了一种基于补丁的方法。纹理变换注意网络（TTA-Net）可以更好地产生具有精细细节的缺失区域修复。该任务是一个单一的优化网络，采用U-Net架构的形式，该架构通过跳过连接将编码器的精细纹理特征转换为解码器的粗略语义特征。 Texture Transform Attention用于使用精细纹理和粗糙语义创建新的重新组合的纹理贴图，从而可以有效地传递纹理信息。为了稳定训练过程，我们使用了地面实况和色标鉴别器的VGG特征层。我们使用可公开获取的数据集CelebA-HQ和Places2端到端评估我们的模型，并证明可以通过现有的最新方法获得更高质量的图像。

33. KNN-enhanced Deep Learning Against Noisy Labels [PDF] 返回目录
Shuyu Kong, You Li, Jia Wang, Amin Rezaei, Hai Zhou
Abstract: Supervised learning on Deep Neural Networks (DNNs) is data hungry. Optimizing performance of DNN in the presence of noisy labels has become of paramount importance since collecting a large dataset will usually bring in noisy labels. Inspired by the robustness of K-Nearest Neighbors (KNN) against data noise, in this work, we propose to apply deep KNN for label cleanup. Our approach leverages DNNs for feature extraction and KNN for ground-truth label inference. We iteratively train the neural network and update labels to simultaneously proceed towards higher label recovery rate and better classification performance. Experiment results show that under the same setting, our approach outperforms existing label correction methods and achieves better accuracy on multiple datasets, e.g.,76.78% on Clothing1M dataset.
摘要：深度神经网络（DNN）上的监督学习非常耗费数据。在存在噪声标签的情况下优化DNN的性能已变得至关重要，因为收集大型数据集通常会带来噪声标签。受K最近邻（KNN）抵御数据噪声的鲁棒性启发，在这项工作中，我们建议将深度KNN用于标签清除。我们的方法利用DNN进行特征提取，并利用KNN进行真实标签推理。我们迭代地训练神经网络并更新标签，以同时朝着更高的标签回收率和更好的分类性能前进。实验结果表明，在相同的设置下，我们的方法优于现有的标签校正方法，并且在多个数据集上均具有更好的准确性，例如，在Clothing1M数据集上达到了76.78％。

34. Scale Aware Adaptation for Land-Cover Classification in Remote Sensing Imagery [PDF] 返回目录
Xueqing Deng, Yi Zhu, Yuxin Tian, Shawn Newsam
Abstract: Land-cover classification using remote sensing imagery is an important Earth observation task. Recently, land cover classification has benefited from the development of fully connected neural networks for semantic segmentation. The benchmark datasets available for training deep segmentation models in remote sensing imagery tend to be small, however, often consisting of only a handful of images from a single location with a single scale. This limits the models' ability to generalize to other datasets. Domain adaptation has been proposed to improve the models' generalization but we find these approaches are not effective for dealing with the scale variation commonly found between remote sensing image collections. We therefore propose a scale aware adversarial learning framework to perform joint cross-location and cross-scale land-cover classification. The framework has a dual discriminator architecture with a standard feature discriminator as well as a novel scale discriminator. We also introduce a scale attention module which produces scale-enhanced features. Experimental results show that the proposed framework outperforms state-of-the-art domain adaptation methods by a large margin.
摘要：利用遥感影像进行土地覆盖分类是一项重要的地球观测任务。最近，土地覆盖分类受益于用于语义分割的全连接神经网络的发展。可用于在遥感影像中训练深度分割模型的基准数据集往往很小，但通常只包含来自单一位置，单一比例的少量图像。这限制了模型推广到其他数据集的能力。已经提出了领域自适应来改善模型的通用性，但是我们发现这些方法对于处理通常在遥感图像集合之间发现的尺度变化是无效的。因此，我们提出了一个具有规模意识的对抗性学习框架，以进行联合交叉定位和跨尺度土地覆盖分类。该框架具有双重鉴别器体系结构，其中包含标准特征鉴别器和新颖的比例鉴别器。我们还介绍了产生规模增强功能的规模注意模块。实验结果表明，所提出的框架在很大程度上优于最新的领域自适应方法。

35. Cost Sensitive Optimization of Deepfake Detector [PDF] 返回目录
Ivan Kukanov, Janne Karttunen, Hannu Sillanpää, Ville Hautamäki
Abstract: Since the invention of cinema, the manipulated videos have existed. But generating manipulated videos that can fool the viewer has been a time-consuming endeavor. With the dramatic improvements in the deep generative modeling, generating believable looking fake videos has become a reality. In the present work, we concentrate on the so-called deepfake videos, where the source face is swapped with the targets. We argue that deepfake detection task should be viewed as a screening task, where the user, such as the video streaming platform, will screen a large number of videos daily. It is clear then that only a small fraction of the uploaded videos are deepfakes, so the detection performance needs to be measured in a cost-sensitive way. Preferably, the model parameters also need to be estimated in the same way. This is precisely what we propose here.
摘要：自从电影院发明以来，操纵视频就已经存在。但是，制作可以欺骗观众的可操纵视频是一项耗时的工作。随着深度生成模型的巨大改进，生成逼真的假视频已成为现实。在当前的工作中，我们专注于所谓的Deepfake视频，其中源面部与目标互换。我们认为Deepfake检测任务应被视为一项筛选任务，在该任务中，用户（例如视频流平台）每天将筛选大量视频。显然，上传的视频中只有一小部分是伪造的，因此需要以对成本敏感的方式来衡量检测性能。优选地，还需要以相同的方式估计模型参数。这正是我们在这里提出的。

36. VAE-Info-cGAN: Generating Synthetic Images by Combining Pixel-level and Feature-level Geospatial Conditional Inputs [PDF] 返回目录
Xuerong Xiao, Swetava Ganguli, Vipul Pandey
Abstract: Training robust supervised deep learning models for many geospatial applications of computer vision is difficult due to dearth of class-balanced and diverse training data. Conversely, obtaining enough training data for many applications is financially prohibitive or may be infeasible, especially when the application involves modeling rare or extreme events. Synthetically generating data (and labels) using a generative model that can sample from a target distribution and exploit the multi-scale nature of images can be an inexpensive solution to address scarcity of labeled data. Towards this goal, we present a deep conditional generative model, called VAE-Info-cGAN, that combines a Variational Autoencoder (VAE) with a conditional Information Maximizing Generative Adversarial Network (InfoGAN), for synthesizing semantically rich images simultaneously conditioned on a pixel-level condition (PLC) and a macroscopic feature-level condition (FLC). Dimensionally, the PLC can only vary in the channel dimension from the synthesized image and is meant to be a task-specific input. The FLC is modeled as an attribute vector in the latent space of the generated image which controls the contributions of various characteristic attributes germane to the target distribution. An interpretation of the attribute vector to systematically generate synthetic images by varying a chosen binary macroscopic feature is explored. Experiments on a GPS trajectories dataset show that the proposed model can accurately generate various forms of spatio-temporal aggregates across different geographic locations while conditioned only on a raster representation of the road network. The primary intended application of the VAE-Info-cGAN is synthetic data (and label) generation for targeted data augmentation for computer vision-based modeling of problems relevant to geospatial analysis and remote sensing.
摘要：由于缺乏班级平衡和多样化的训练数据，很难为计算机视觉的许多地理空间应用训练鲁棒的监督式深度学习模型。相反，为许多应用程序获取足够的训练数据在经济上是禁止的，或者可能是不可行的，尤其是在应用程序涉及对罕见或极端事件进行建模时。使用可从目标分布中采样并利用图像的多尺度性质的生成模型来综合生成数据（和标签），对于解决标记数据的稀缺性而言，可能是一种廉价的解决方案。为了实现这一目标，我们提出了一种称为VAE-Info-cGAN的深层条件生成模型，该模型将变分自动编码器（VAE）与条件信息最大化生成对抗网络（InfoGAN）结合在一起，以合成同时位于像素像素上的语义丰富的图像级别条件（PLC）和宏观特征级别条件（FLC）。在尺寸上，PLC只能从合成图像改变通道尺寸，并且它是特定于任务的输入。 FLC被建模为生成图像的潜在空间中的属性矢量，该属性控制与目标分布密切相关的各种特征属性的贡献。探索了通过改变选择的二进制宏观特征来系统地生成合成图像的属性向量的解释。 GPS轨迹数据集上的实验表明，所提出的模型可以准确生成跨不同地理位置的各种形式的时空集合体，而仅以道路网络的栅格表示为条件。 VAE-Info-cGAN的主要预期应用是合成数据（和标签）的生成，用于针对数据的目标数据增强，用于基于计算机视觉的与地理空间分析和遥感有关的问题建模。

37. Multi-modal Visual Tracking: Review and Experimental Comparison [PDF] 返回目录
Pengyu Zhang, Dong Wang, Huchuan Lu
Abstract: Visual object tracking, as a fundamental task in computer vision, has drawn much attention in recent years. To extend trackers to a wider range of applications, researchers have introduced information from multiple modalities to handle specific scenes, which is a promising research prospect with emerging methods and benchmarks. To provide a thorough review of multi-modal track-ing, we summarize the multi-modal tracking algorithms, especially visible-depth (RGB-D) tracking and visible-thermal (RGB-T) tracking in a unified taxonomy from different aspects. Second, we provide a detailed description of the related benchmarks and challenges. Furthermore, we conduct extensive experiments to analyze the effectiveness of trackers on five datasets: PTB, VOT19-RGBD, GTOT, RGBT234, and VOT19-RGBT. Finally, we discuss various future directions from different perspectives, including model design and dataset construction for further research.
摘要：视觉对象跟踪作为计算机视觉中的一项基本任务，近年来引起了很多关注。为了将跟踪器扩展到更广泛的应用范围，研究人员引入了来自多种模式的信息来处理特定的场景，这是新兴方法和基准的有前途的研究前景。为了全面回顾多模式跟踪，我们从不同方面总结了多模式跟踪算法，特别是在统一分类法中的可见深度（RGB-D）跟踪和可见热（RGB-T）跟踪。其次，我们提供了有关基准和挑战的详细描述。此外，我们进行了广泛的实验，以分析跟踪器在五个数据集上的有效性：PTB，VOT19-RGBD，GTOT，RGBT234和VOT19-RGBT。最后，我们从不同的角度讨论了各种未来的方向，包括模型设计和数据集构建以供进一步研究。

38. Weakly-Supervised Cross-Domain Adaptation for Endoscopic Lesions Segmentation [PDF] 返回目录
Jiahua Dong, Yang Cong, Gan Sun, Yunsheng Yang, Xiaowei Xu, Zhengming Ding
Abstract: Weakly-supervised learning has attracted growing research attention on medical lesions segmentation due to significant saving in pixel-level annotation cost. However, 1) most existing methods require effective prior and constraints to explore the intrinsic lesions characterization, which only generates incorrect and rough prediction; 2) they neglect the underlying semantic dependencies among weakly-labeled target enteroscopy diseases and fully-annotated source gastroscope lesions, while forcefully utilizing untransferable dependencies leads to the negative performance. To tackle above issues, we propose a new weakly-supervised lesions transfer framework, which can not only explore transferable domain-invariant knowledge across different datasets, but also prevent the negative transfer of untransferable representations. Specifically, a Wasserstein quantified transferability framework is developed to highlight widerange transferable contextual dependencies, while neglecting the irrelevant semantic characterizations. Moreover, a novel selfsupervised pseudo label generator is designed to equally provide confident pseudo pixel labels for both hard-to-transfer and easyto-transfer target samples. It inhibits the enormous deviation of false pseudo pixel labels under the self-supervision manner. Afterwards, dynamically-searched feature centroids are aligned to narrow category-wise distribution shift. Comprehensive theoretical analysis and experiments show the superiority of our model on the endoscopic dataset and several public datasets.
摘要：由于大大节省了像素级注释成本，弱监督学习吸引了越来越多的研究关注医学病变分割。然而，1）大多数现有方法需要有效的先验条件和约束条件来探索内在病变的特征，这只会产生不正确和粗略的预测； 2）他们忽略了弱标签目标肠镜疾病和完整注释源胃镜病变之间潜在的语义依赖性，而强行利用不可转移的依赖性导致负面表现。为了解决上述问题，我们提出了一种新的弱监督病变转移框架，该框架不仅可以探索不同数据集之间可转移的领域不变知识，而且可以防止不可转移表示的负面转移。具体而言，开发了Wasserstein量化的可传递性框架，以突出显示广泛的可传递上下文相关性，同时忽略不相关的语义特征。此外，一种新颖的自我监督伪标签生成器被设计为为难以转移和易于转移的目标样本均等地提供可靠的伪像素标签。它以自监督的方式抑制了伪伪像素标签的巨大偏差。然后，将动态搜索的特征质心对齐以缩小类别范围的分布偏移。全面的理论分析和实验表明，我们的模型在内窥镜数据集和一些公共数据集上具有优越性。

39. Learning Independent Instance Maps for Crowd Localization [PDF] 返回目录
Junyu Gao, Tao Han, Yuan Yuan, Qi Wang
Abstract: Accurately locating each head's position in the crowd scenes is a crucial task in the field of crowd analysis. However, traditional density-based methods only predict coarse prediction, and segmentation/detection-based methods cannot handle extremely dense scenes and large-range scale-variations crowds. To this end, we propose an end-to-end and straightforward framework for crowd localization, named Independent Instance Map segmentation (IIM). Different from density maps and boxes regression, each instance in IIM is non-overlapped. By segmenting crowds into independent connected components, the positions and the crowd counts (the centers and the number of components, respectively) are obtained. Furthermore, to improve the segmentation quality for different density regions, we present a differentiable Binarization Module (BM) to output structured instance maps. BM brings two advantages into localization models: 1) adaptively learn a threshold map for different images to detect each instance more accurately; 2) directly train the model using loss on binary predictions and labels. Extensive experiments verify the proposed method is effective and outperforms the-state-of-the-art methods on the five popular crowd datasets. Significantly, IIM improves F1-measure by 10.4\% on the NWPU-Crowd Localization task. The source code and pre-trained models will be released at \url{this https URL}.
摘要：在人群场景中准确定位每个头的位置是人群分析领域的一项关键任务。但是，传统的基于密度的方法只能预测粗略的预测，而基于分段/检测的方法不能处理极其密集的场景和大范围的比例变化人群。为此，我们提出了一个用于人群定位的端到端，直接的框架，称为独立实例映射分段（IIM）。与密度图和盒子回归不同，IIM中的每个实例都是不重叠的。通过将人群分为独立的相连组件，可以获得位置和人群计数（分别为组件的中心和数量）。此外，为了提高不同密度区域的分割质量，我们提出了可微分的二值化模块（BM）以输出结构化实例图。 BM为定位模型带来了两个优点：1）自适应地学习不同图像的阈值图，以更准确地检测每个实例； 2）使用损失的二进制预测和标签直接训练模型。大量实验证明了该方法是有效的，并且在五个流行的人群数据集上均优于最新方法。值得注意的是，IIM在NWPU人群本地化任务上将F1度量提高了10.4％。源代码和经过预训练的模型将在\ url {此https URL}上发布。

40. Learning Portrait Style Representations [PDF] 返回目录
Sadat Shaik, Bernadette Bucher, Nephele Agrafiotis, Stephen Phillips, Kostas Daniilidis, William Schmenner
Abstract: Style analysis of artwork in computer vision predominantly focuses on achieving results in target image generation through optimizing understanding of low level style characteristics such as brush strokes. However, fundamentally different techniques are required to computationally understand and control qualities of art which incorporate higher level style characteristics. We study style representations learned by neural network architectures incorporating these higher level characteristics. We find variation in learned style features from incorporating triplets annotated by art historians as supervision for style similarity. Networks leveraging statistical priors or pretrained on photo collections such as ImageNet can also derive useful visual representations of artwork. We align the impact of these expert human knowledge, statistical, and photo realism priors on style representations with art historical research and use these representations to perform zero-shot classification of artists. To facilitate this work, we also present the first large-scale dataset of portraits prepared for computational analysis.
摘要：计算机视觉中艺术品的样式分析主要集中在通过优化对低级样式特征（例如笔触）的理解来实现目标图像生成结果。但是，从根本上来说，需要不同的技术来计算地理解和控制结合了较高风格特征的艺术品质。我们研究结合了这些更高层次特征的神经网络体系结构所学习的样式表示。通过合并由艺术史学家注释的三胞胎作为样式相似性的监督，我们发现学习的样式特征有所不同。利用统计先验或对照片集进行预训练的网络（例如ImageNet）也可以得出艺术品的有用视觉表示。我们将这些专家的人类知识，统计数据和照片写实先验对风格表征的影响与艺术历史研究相结合，并使用这些表征对艺术家进行零镜头分类。为了促进这项工作，我们还展示了为计算分析准备的第一个大规模肖像集。

41. Dynamic Anchor Learning for Arbitrary-Oriented Object Detection [PDF] 返回目录
Qi Ming, Zhiqiang Zhou, Lingjuan Miao, Hongwei Zhang, Linhao Li
Abstract: Arbitrary-oriented objects widely appear in natural scenes, aerial photographs, remote sensing images, etc., thus arbitrary-oriented object detection has received considerable attention. Many current rotation detectors use plenty of anchors with different orientations to achieve spatial alignment with ground truth boxes, then Intersection-over-Union (IoU) is applied to sample the positive and negative candidates for training. However, we observe that the selected positive anchors cannot always ensure accurate detections after regression, while some negative samples can achieve accurate localization. It indicates that the quality assessment of anchors through IoU is not appropriate, and this further lead to inconsistency between classification confidence and localization accuracy. In this paper, we propose a dynamic anchor learning (DAL) method, which utilizes the newly defined matching degree to comprehensively evaluate the localization potential of the anchors and carry out a more efficient label assignment process. In this way, the detector can dynamically select high-quality anchors to achieve accurate object detection, and the divergence between classification and regression will be alleviated. With the newly introduced DAL, we achieve superior detection performance for arbitrary-oriented objects with only a few horizontal preset anchors. Experimental results on three remote sensing datasets HRSC2016, DOTA, UCAS-AOD as well as a scene text dataset ICDAR 2015 show that our method achieves substantial improvement compared with the baseline model. Besides, our approach is also universal for object detection using horizontal bound box. The code and models are available at this https URL.
摘要：面向对象的物体广泛出现在自然场景，航拍照片，遥感图像等中，因此面向对象的检测受到了广泛的关注。许多当前的旋转检测器使用大量具有不同方向的锚来实现与地面真值框的空间对齐，然后应用“联合上方交叉点”（IoU）对正负候选对象进行采样以进行训练。但是，我们观察到，选定的正锚并不总是能够确保回归后的准确检测，而某些负样本可以实现准确的定位。这表明通过IoU对锚进行质量评估是不合适的，这进一步导致了分类置信度和定位精度之间的不一致。在本文中，我们提出了一种动态锚学习（DAL）方法，该方法利用新定义的匹配度来全面评估锚的定位潜力，并执行更有效的标签分配过程。这样，检测器可以动态选择高质量的锚点以实现准确的对象检测，并且将减轻分类和回归之间的差异。借助新推出的DAL，我们仅需几个水平预设锚点即可实现针对任意方向物体的出色检测性能。在三个遥感数据集HRSC2016，DOTA，UCAS-AOD以及场景文本数据集ICDAR 2015上的实验结果表明，与基准模型相比，我们的方法取得了实质性的改进。此外，我们的方法对于使用水平装订框的对象检测也是通用的。可以在此https URL上找到代码和模型。

42. Performance Analysis of Keypoint Detectors and Binary Descriptors Under Varying Degrees of Photometric and Geometric Transformations [PDF] 返回目录
Shuvo Kumar Paul, Pourya Hoseini, Mircea Nicolescu, Monica Nicolescu
Abstract: Detecting image correspondences by feature matching forms the basis of numerous computer vision applications. Several detectors and descriptors have been presented in the past, addressing the efficient generation of features from interest points (keypoints) in an image. In this paper, we investigate eight binary descriptors (AKAZE, BoostDesc, BRIEF, BRISK, FREAK, LATCH, LUCID, and ORB) and eight interest point detector (AGAST, AKAZE, BRISK, FAST, HarrisLapalce, KAZE, ORB, and StarDetector). We have decoupled the detection and description phase to analyze the interest point detectors and then evaluate the performance of the pairwise combination of different detectors and descriptors. We conducted experiments on a standard dataset and analyzed the comparative performance of each method under different image transformations. We observed that: (1) the FAST, AGAST, ORB detectors were faster and detected more keypoints, (2) the AKAZE and KAZE detectors performed better under photometric changes while ORB was more robust against geometric changes, (3) in general, descriptors performed better when paired with the KAZE and AKAZE detectors, (4) the BRIEF, LUCID, ORB descriptors were relatively faster, and (5) none of the descriptors did particularly well under geometric transformations, only BRISK, FREAK, and AKAZE showed reasonable resiliency.
摘要：通过特征匹配检测图像对应关系构成了许多计算机视觉应用程序的基础。过去已经提出了几种检测器和描述符，以解决从图像中的兴趣点（关键点）有效生成特征的问题。在本文中，我们研究了八个二进制描述符（AKAZE，BoostDesc，BRIEF，BRISK，FREAK，LATCH，LUCID和ORB）和八个兴趣点检测器（AGAST，AKAZE，BRISK，FAST，HarrisLapalce，KAZE，ORB和StarDetector）。我们已将检测和描述阶段解耦以分析兴趣点检测器，然后评估不同检测器和描述符的成对组合的性能。我们在标准数据集上进行了实验，并分析了每种方法在不同图像变换下的比较性能。我们观察到：（1）FAST，AGAST，ORB检测器速度更快并且检测到更多关键点；（2）AKAZE和KAZE检测器在光度变化下表现更好，而ORB对几何变化更鲁棒；（3）通常，描述子与KAZE和AKAZE检测器配对时，性能更好；（4）Brief，LUCID，ORB描述符相对较快；（5）在几何变换下，没有一个描述符表现特别好，只有BRISK，FREAK和AKAZE表现出合理的弹性。

43. Parameter Efficient Multimodal Transformers for Video Representation Learning [PDF] 返回目录
Sangho Lee, Youngjae Yu, Gunhee Kim, Thomas Breuel, Jan Kautz, Yale Song
Abstract: The recent success of Transformers in the language domain has motivated adapting it to a multimodal setting, where a new visual model is trained in tandem with an already pretrained language model. However, due to the excessive memory requirements from Transformers, existing work typically fixes the language model and train only the vision module, which limits its ability to learn cross-modal information in an end-to-end manner. In this work, we focus on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning. We alleviate the high memory requirement by sharing the weights of Transformers across layers and modalities; we decompose the Transformer into modality-specific and modality-shared parts so that the model learns the dynamics of each modality both individually and together, and propose a novel parameter sharing scheme based on low-rank approximation. We show that our approach reduces parameters up to 80$\%$, allowing us to train our model end-to-end from scratch. We also propose a negative sampling approach based on an instance similarity measured on the CNN embedding space that our model learns with the Transformers. To demonstrate our approach, we pretrain our model on 30-second clips from Kinetics-700 and transfer it to audio-visual classification tasks.
摘要：《变形金刚》在语言领域的最新成功促使其适应多模式设置，在这种模式下，已经预先训练好的语言模型与新的视觉模型一起被训练。但是，由于Transformers对内存的过多需求，现有工作通常会修复语言模型并仅训练视觉模块，这限制了它以端对端方式学习跨模式信息的能力。在这项工作中，我们着重于在视听视频表示学习的背景下减少多模式变压器的参数。我们通过跨层和模态共享Transformers的权重来减轻对高内存的需求；我们将Transformer分解为特定于模态和模态共享的部分，以便该模型可以单独或一起学习每种模态的动力学，并提出一种基于低秩逼近的新颖参数共享方案。我们证明了我们的方法最多可将参数减少80 $ \％$，从而使我们能够从头开始对模型进行端到端训练。我们还基于在模型从Transformers学习的CNN嵌入空间上测得的实例相似度，提出了一种否定采样方法。为了演示我们的方法，我们在Kinetics-700的30秒剪辑上对模型进行了预训练，并将其转移到视听分类任务中。

44. SuperFront: From Low-resolution to High-resolution Frontal Face Synthesis [PDF] 返回目录
Yu Yin, Joseph P. Robinson, Songyao Jiang, Yue Bai, Can Qin, Yun Fu
Abstract: Advances in face rotation, along with other face-based generative tasks, are more frequent as we advance further in topics of deep learning. Even as impressive milestones are achieved in synthesizing faces, the importance of preserving identity is needed in practice and should not be overlooked. Also, the difficulty should not be more for data with obscured faces, heavier poses, and lower quality. Existing methods tend to focus on samples with variation in pose, but with the assumption data is high in quality. We propose a generative adversarial network (GAN) -based model to generate high-quality, identity preserving frontal faces from one or multiple low-resolution (LR) faces with extreme poses. Specifically, we propose SuperFront-GAN (SF-GAN) to synthesize a high-resolution (HR), frontal face from one-to-many LR faces with various poses and with the identity-preserved. We integrate a super-resolution (SR) side-view module into SF-GAN to preserve identity information and fine details of the side-views in HR space, which helps model reconstruct high-frequency information of faces (i.e., periocular, nose, and mouth regions). Moreover, SF-GAN accepts multiple LR faces as input, and improves each added sample. We squeeze additional gain in performance with an orthogonal constraint in the generator to penalize redundant latent representations and, hence, diversify the learned features space. Quantitative and qualitative results demonstrate the superiority of SF-GAN over others.
摘要：随着我们在深度学习主题上的进一步发展，面部旋转的进展以及其他基于面部的生成任务变得更加频繁。即使在合成人脸方面达到了令人印象深刻的里程碑，在实践中保持身份的重要性仍然是必需的，不应被忽视。同样，对于脸部模糊，姿势较重且质量较低的数据，难度不应更大。现有方法倾向于集中于姿势变化的样本，但是假设数据质量很高。我们提出了一种基于生成对抗网络（GAN）的模型，可以从具有极端姿势的一个或多个低分辨率（LR）面孔生成高质量的，保留身份的正面面孔。具体来说，我们建议使用SuperFront-GAN（SF-GAN）从具有各种姿势和身份保留的一对多LR面孔合成高分辨率（HR）正面。我们将超分辨率（SR）侧视图模块集成到SF-GAN中，以保留HR空间中的身份信息和侧视图的精细细节，从而有助于对面部（即眼周，鼻子，和嘴巴区域）。此外，SF-GAN接受多个LR面作为输入，并改进每个添加的样本。我们通过在生成器中使用正交约束来挤压性能的额外增益，以惩罚冗余的潜在表示，从而使学习到的特征空间多样化。定量和定性结果证明了SF-GAN在其他方面的优越性。

45. Deformable Gabor Feature Networks for Biomedical Image Classification [PDF] 返回目录
Xuan Gong, Xin Xia, Wentao Zhu, Baochang Zhang, David Doermann, Lian Zhuo
Abstract: In recent years, deep learning has dominated progress in the field of medical image analysis. We find however, that the ability of current deep learning approaches to represent the complex geometric structures of many medical images is insufficient. One limitation is that deep learning models require a tremendous amount of data, and it is very difficult to obtain a sufficient amount with the necessary detail. A second limitation is that there are underlying features of these medical images that are well established, but the black-box nature of existing convolutional neural networks (CNNs) do not allow us to exploit them. In this paper, we revisit Gabor filters and introduce a deformable Gabor convolution (DGConv) to expand deep networks interpretability and enable complex spatial variations. The features are learned at deformable sampling locations with adaptive Gabor convolutions to improve representativeness and robustness to complex objects. The DGConv replaces standard convolutional layers and is easily trained end-to-end, resulting in deformable Gabor feature network (DGFN) with few additional parameters and minimal additional training cost. We introduce DGFN for addressing deep multi-instance multi-label classification on the INbreast dataset for mammograms and on the ChestX-ray14 dataset for pulmonary x-ray images.
摘要：近年来，深度学习在医学图像分析领域占据了主导地位。但是，我们发现，当前的深度学习方法代表许多医学图像的复杂几何结构的能力不足。局限性之一是深度学习模型需要大量数据，而且很难获得足够数量的必要细节。第二个局限性是这些医学图像的基本特征已经被很好地建立，但是现有卷积神经网络（CNN）的黑盒性质不允许我们利用它们。在本文中，我们将重新审视Gabor滤波器，并引入可变形Gabor卷积（DGConv）以扩展深层网络的可解释性并实现复杂的空间变化。通过使用自适应Gabor卷积在可变形的采样位置学习特征，以提高对复杂对象的代表性和鲁棒性。 DGConv取代了标准的卷积层，并且易于端到端训练，从而产生了变形的Gabor特征网络（DGFN），它具有很少的附加参数和最小的附加训练成本。我们在乳腺X线照片的INbreast数据集和肺部X射线图像的ChestX-ray14数据集上引入DGFN，以解决深层多实例多标签分类问题。

46. A New Window Loss Function for Bone Fracture Detection and Localization in X-ray Images with Point-based Annotation [PDF] 返回目录
Xinyu Zhang, Yirui Wang, Chi-Tung Cheng, Le Lu, Jing Xiao, Chien-Hung Liao, Shun Miao
Abstract: Object detection methods are widely adopted for computer-aided diagnosis using medical images. Anomalous findings are usually treated as objects that are described by bounding boxes. Yet, many pathological findings, e.g., bone fractures, cannot be clearly defined by bounding boxes, owing to considerable instance, shape and boundary ambiguities. This makes bounding box annotations, and their associated losses, highly ill-suited. In this work, we propose a new bone fracture detection method for X-ray images, based on a labor effective and flexible annotation scheme suitable for abnormal findings with no clear object-level spatial extents or boundaries. Our method employs a simple, intuitive, and informative point-based annotation protocol to mark localized pathology information. To address the uncertainty in the fracture scales annotated via point(s), we convert the annotations into pixel-wise supervision that uses lower and upper bounds with positive, negative, and uncertain regions. A novel Window Loss is subsequently proposed to only penalize the predictions outside of the uncertain regions. Our method has been extensively evaluated on 4410 pelvic X-ray images of unique patients. Experiments demonstrate that our method outperforms previous state-of-the-art image classification and object detection baselines by healthy margins, with an AUROC of 0.983 and FROC score of 89.6%.
摘要：目标检测方法被广泛用于医学图像的计算机辅助诊断。异常发现通常被视为由边界框描述的对象。然而，由于实例，形状和边界的含糊不清，许多病理学发现，例如骨折，不能通过边界框清楚地定义。这使得边界框注释及其相关损失非常不适合。在这项工作中，我们提出了一种适用于X射线图像的新型骨折检测方法，该方法基于劳动有效且灵活的注释方案，适用于没有明确的对象级空间范围或边界的异常发现。我们的方法采用了一种简单，直观且信息丰富的基于点的注释协议来标记局部病理信息。为了解决通过点注释的裂缝尺度的不确定性，我们将注释转换为像素级监督，该监督使用具有正，负和不确定区域的上下限。随后提出一种新颖的窗口损失，仅对不确定区域之外的预测进行惩罚。我们的方法已在4410例独特患者的骨盆X射线图像上得到了广泛评估。实验表明，我们的方法在健康边缘方面优于以前的最新图像分类和对象检测基准，AUROC为0.983，FROC得分为89.6％。

47. Semantic and Geometric Modeling with Neural Message Passing in 3D Scene Graphs for Hierarchical Mechanical Search [PDF] 返回目录
Andrey Kurenkov, Roberto Martín-Martín, Jeff Ichnowski, Ken Goldberg, Silvio Savarese
Abstract: Searching for objects in indoor organized environments such as homes or offices is part of our everyday activities. When looking for a target object, we jointly reason about the rooms and containers the object is likely to be in; the same type of container will have a different probability of having the target depending on the room it is in. We also combine geometric and semantic information to infer what container is best to search, or what other objects are best to move, if the target object is hidden from view. We propose to use a 3D scene graph representation to capture the hierarchical, semantic, and geometric aspects of this problem. To exploit this representation in a search process, we introduce Hierarchical Mechanical Search (HMS), a method that guides an agent's actions towards finding a target object specified with a natural language description. HMS is based on a novel neural network architecture that uses neural message passing of vectors with visual, geometric, and linguistic information to allow HMS to reason across layers of the graph while combining semantic and geometric cues. HMS is evaluated on a novel dataset of 500 3D scene graphs with dense placements of semantically related objects in storage locations, and is shown to be significantly better than several baselines at finding objects and close to the oracle policy in terms of the median number of actions required. Additional qualitative results can be found at this https URL.
摘要：在室内有组织的环境中（例如，家庭或办公室）搜索对象是我们日常活动的一部分。在寻找目标物体时，我们会共同推断该物体可能位于的房间和容器；相同类型的容器根据其所在的房间而具有目标的可能性不同。如果目标，我们还结合几何和语义信息来推断最适合搜索哪个容器，或最适合移动其他对象的容器对象从视图中隐藏。我们建议使用3D场景图表示法来捕获此问题的层次结构，语义和几何方面。为了在搜索过程中利用此表示形式，我们引入了层次机械搜索（HMS），该方法可指导代理的操作以查找使用自然语言描述指定的目标对象。 HMS基于一种新颖的神经网络体系结构，该体系结构使用带有视觉，几何和语言信息的向量的神经消息传递，以允许HMS在组合语义和几何提示的同时跨图的层进行推理。 HMS是在500个3D场景图的新颖数据集上进行评估的，该图具有语义相关对象在存储位置中的密集放置，并且在查找对象方面和与oracle策略接近方面，在行动中位数方面显着优于几个基线需要。在此https URL上可以找到其他定性结果。

48. Rotation-Invariant Point Convolution With Multiple Equivariant Alignments [PDF] 返回目录
Hugues Thomas
Abstract: Recent attempts at introducing rotation invariance or equivariance in 3D deep learning approaches have shown promising results, but these methods still struggle to reach the performances of standard 3D neural networks. In this work we study the relation between equivariance and invariance in 3D point convolutions. We show that using rotation-equivariant alignments, it is possible to make any convolutional layer rotation-invariant. Furthermore, we improve this simple alignment procedure by using the alignment themselves as features in the convolution, and by combining multiple alignments together. With this core layer, we design rotation-invariant architectures which improve state-of-the-art results in both object classification and semantic segmentation and reduces the gap between rotation-invariant and standard 3D deep learning approaches.
摘要：最近在3D深度学习方法中引入旋转不变性或等方差的尝试已显示出令人鼓舞的结果，但这些方法仍难以达到标准3D神经网络的性能。在这项工作中，我们研究了3D点卷积中的等方差与不变性之间的关系。我们表明，使用等速旋转比对，可以使任何卷积层旋转不变。此外，我们通过将比对本身用作卷积中的特征，并将多个比对组合在一起，从而改进了此简单的比对过程。借助这一核心层，我们设计了旋转不变的体系结构，该体系结构改进了对象分类和语义分割方面的最新技术成果，并缩小了旋转不变的方法与标准3D深度学习方法之间的差距。

49. Generating unseen complex scenes: are we there yet? [PDF] 返回目录
Arantxa Casanova, Michal Drozdzal, Adriana Romero-Soriano
Abstract: Although recent complex scene conditional generation models generate increasingly appealing scenes, it is very hard to assess which models perform better and why. This is often due to models being trained to fit different data splits, and defining their own experimental setups. In this paper, we propose a methodology to compare complex scene conditional generation models, and provide an in-depth analysis that assesses the ability of each model to (1) fit the training distribution and hence perform well on seen conditionings, (2) to generalize to unseen conditionings composed of seen object combinations, and (3) generalize to unseen conditionings composed of unseen object combinations. As a result, we observe that recent methods are able to generate recognizable scenes given seen conditionings, and exploit compositionality to generalize to unseen conditionings with seen object combinations. However, all methods suffer from noticeable image quality degradation when asked to generate images from conditionings composed of unseen object combinations. Moreover, through our analysis, we identify the advantages of different pipeline components, and find that (1) encouraging compositionality through instance-wise spatial conditioning normalizations increases robustness to both types of unseen conditionings, (2) using semantically aware losses such as the scene-graph perceptual similarity helps improve some dimensions of the generation process, and (3) enhancing the quality of generated masks and the quality of the individual objects are crucial steps to improve robustness to both types of unseen conditionings.
摘要：尽管最近的复杂场景条件生成模型生成了越来越有吸引力的场景，但是很难评估哪些模型表现更好以及为什么表现更好。这通常是由于对模型进行了训练以适应不同的数据拆分，并定义了自己的实验设置。在本文中，我们提出了一种方法来比较复杂的场景条件生成模型，并提供了深入的分析，以评估每种模型满足以下条件的能力：（1）拟合训练分布，从而在可见条件下表现良好；（2）归纳为由可见对象组合组成的看不见的条件，（3）归纳为由看不见对象组合组成的不可见条件。结果，我们观察到最近的方法能够在给定可见条件的情况下生成可识别的场景，并利用合成性将可见对象组合概括为看不见的条件。但是，当要求从由看不见的对象组合组成的条件中生成图像时，所有方法都会遭受明显的图像质量下降。此外，通过我们的分析，我们确定了不同管道组件的优势，并发现（1）通过实例化空间条件规范化来鼓励组合性，提高了两种看不见的条件的鲁棒性；（2）使用语义感知的损失（例如场景）图的感知相似性有助于改善生成过程的某些维度，并且（3）增强生成的蒙版的质量和单个对象的质量是提高对这两种看不见的条件的鲁棒性的关键步骤。

50. Learning an Animatable Detailed 3D Face Model from In-The-Wild Images [PDF] 返回目录
Yao Feng, Haiwen Feng, Michael J. Black, Timo Bolkart
Abstract: While current monocular 3D face reconstruction methods can recover fine geometric details, they suffer several limitations. Some methods produce faces that cannot be realistically animated because they do not model how wrinkles vary with expression. Other methods are trained on high-quality face scans and do not generalize well to in-the-wild images. We present the first approach to jointly learn a model with animatable detail and a detailed 3D face regressor from in-the-wild images that recovers shape details as well as their relationship to facial expressions. Our DECA (Detailed Expression Capture and Animation) model is trained to robustly produce a UV displacement map from a low-dimensional latent representation that consists of person-specific detail parameters and generic expression parameters, while a regressor is trained to predict detail, shape, albedo, expression, pose and illumination parameters from a single image. We introduce a novel detail-consistency loss to disentangle person-specific details and expression-dependent wrinkles. This disentanglement allows us to synthesize realistic person-specific wrinkles by controlling expression parameters while keeping person-specific details unchanged. DECA achieves state-of-the-art shape reconstruction accuracy on two benchmarks. Qualitative results on in-the-wild data demonstrate DECA's robustness and its ability to disentangle identity and expression dependent details enabling animation of reconstructed faces. The model and code are publicly available at this https URL.
摘要：虽然当前的单眼3D人脸重建方法可以恢复精细的几何细节，但它们仍受到一些限制。有些方法产生的脸部无法逼真地动画，因为它们无法模拟皱纹随表情变化的方式。其他方法都经过高质量面部扫描训练，无法很好地推广到野外图像。我们提出了第一种方法，可以从野生图像中共同学习具有可动画细节和详细3D人脸回归模型的模型，以恢复形状细节及其与面部表情的关系。我们的DECA（详细的表情捕获和动画）模型经过训练，可以从低维度的潜在表示中可靠地生成UV位移图，该低维的潜在表示包含特定于人的细节参数和通用表达参数，而回归器则经过训练，可以预测细节，形状，单幅图像的反照率，表情，姿势和照明参数。我们介绍了一种新颖的细节一致性损失，以解开特定于人的细节和表情相关的皱纹。这种解开关系使我们能够通过控制表情参数，同时保持特定于人的细节不变来合成逼真的特定于人的皱纹。 DECA在两个基准上均达到了最新的形状重建精度。关于野生数据的定性结果表明，DECA的鲁棒性及其分解身份和依赖表情的细节的能力使重建的面孔可以动画化。该模型和代码可从此https URL公开获得。

51. GenScan: A Generative Method for Populating Parametric 3D Scan Datasets [PDF] 返回目录
Mohammad Keshavarzi, Oladapo Afolabi, Luisa Caldas, Allen Y. Yang, Avideh Zakhor
Abstract: The availability of rich 3D datasets corresponding to the geometrical complexity of the built environments is considered an ongoing challenge for 3D deep learning methodologies. To address this challenge, we introduce GenScan, a generative system that populates synthetic 3D scan datasets in a parametric fashion. The system takes an existing captured 3D scan as an input and outputs alternative variations of the building layout including walls, doors, and furniture with corresponding textures. GenScan is a fully automated system that can also be manually controlled by a user through an assigned user interface. Our proposed system utilizes a combination of a hybrid deep neural network and a parametrizer module to extract and transform elements of a given 3D scan. GenScan takes advantage of style transfer techniques to generate new textures for the generated scenes. We believe our system would facilitate data augmentation to expand the currently limited 3D geometry datasets commonly used in 3D computer vision, generative design, and general 3D deep learning tasks.
摘要：与构建环境的几何复杂度相对应的丰富3D数据集的可用性被认为是3D深度学习方法的一项持续挑战。为了应对这一挑战，我们引入了GenScan，这是一种生成系统，以参数方式填充合成3D扫描数据集。该系统将现有的捕获3D扫描作为输入，并输出建筑物布局的替代变体，包括具有相应纹理的墙壁，门和家具。 GenScan是全自动系统，也可以由用户通过分配的用户界面手动控制。我们提出的系统利用混合深度神经网络和参数化器模块的组合来提取和转换给定3D扫描的元素。 GenScan利用样式转移技术为生成的场景生成新的纹理。我们相信我们的系统将有助于数据扩充，以扩展当前有限的3D几何数据集，这些数据集通常用于3D计算机视觉，生成设计和一般3D深度学习任务。

52. Shape From Tracing: Towards Reconstructing 3D Object Geometry and SVBRDF Material from Images via Differentiable Path Tracing [PDF] 返回目录
Purvi Goel, Loudon Cohen, James Guesman, Vikas Thamizharasan, James Tompkin, Daniel Ritchie
Abstract: Reconstructing object geometry and material from multiple views typically requires optimization. Differentiable path tracing is an appealing framework as it can reproduce complex appearance effects. However, it is difficult to use due to high computational cost. In this paper, we explore how to use differentiable ray tracing to refine an initial coarse mesh and per-mesh-facet material representation. In simulation, we find that it is possible to reconstruct fine geometric and material detail from low resolution input views, allowing high-quality reconstructions in a few hours despite the expense of path tracing. The reconstructions successfully disambiguate shading, shadow, and global illumination effects such as diffuse interreflection from material properties. We demonstrate the impact of different geometry initializations, including space carving, multi-view stereo, and 3D neural networks. Finally, with input captured using smartphone video and a consumer 360? camera for lighting estimation, we also show how to refine initial reconstructions of real-world objects in unconstrained environments.
摘要：从多个视图重建对象的几何形状和材料通常需要优化。可区分的路径跟踪是一个吸引人的框架，因为它可以重现复杂的外观效果。但是，由于计算成本高而难以使用。在本文中，我们探索如何使用可微分的射线追踪来细化初始的粗网格和每网格面材质表示。在仿真中，我们发现可以从低分辨率输入视图重建精细的几何和材质细节，尽管花费了路径跟踪的费用，但仍可以在几个小时内进行高质量的重建。重建成功地消除了阴影，阴影和全局照明效果（例如，材料属性中的漫反射）。我们演示了不同几何形状初始化的影响，包括空间雕刻，多视图立体和3D神经网络。最后，使用智能手机视频和消费者360捕获输入吗？用于照明估计的相机，我们还将展示如何在不受约束的环境中优化现实世界对象的初始重建。

53. Globetrotter: Unsupervised Multilingual Translation from Visual Alignment [PDF] 返回目录
Dídac Surís, Dave Epstein, Carl Vondrick
Abstract: Multi-language machine translation without parallel corpora is challenging because there is no explicit supervision between languages. Existing unsupervised methods typically rely on topological properties of the language representations. We introduce a framework that instead uses the visual modality to align multiple languages, using images as the bridge between them. We estimate the cross-modal alignment between language and images, and use this estimate to guide the learning of cross-lingual representations. Our language representations are trained jointly in one model with a single stage. Experiments with fifty-two languages show that our method outperforms baselines on unsupervised word-level and sentence-level translation using retrieval.
摘要：因为没有语言之间的明确监督，所以没有并行语料库的多语言机器翻译具有挑战性。现有的无监督方法通常依赖于语言表示的拓扑属性。我们引入了一个框架，该框架改为使用视觉模式来对齐多种语言，并使用图像作为它们之间的桥梁。我们估计语言和图像之间的跨模式对齐方式，并使用此估计值来指导跨语言表示的学习。我们的语言表示在一个模型中以单个阶段进行了联合培训。使用52种语言进行的实验表明，在使用检索的无监督词级和句子级翻译中，我们的方法优于基线。

54. Generalized iterated-sums signatures [PDF] 返回目录
Joscha Diehl, Kurusch Ebrahimi-Fard, Nikolas Tapia
Abstract: We explore the algebraic properties of a generalized version of the iterated-sums signature, inspired by previous work of F.~Király and H.~Oberhauser. In particular, we show how to recover the character property of the associated linear map over the tensor algebra by considering a deformed quasi-shuffle product of words on the latter. We introduce three non-linear transformations on iterated-sums signatures, close in spirit to Machine Learning applications, and show some of their properties.
摘要：我们从F.〜Király和H.〜Oberhauser的先前工作中探索了迭代和签名的广义版本的代数性质。特别是，我们展示了如何通过考虑张量代数上的变形的准混洗积来恢复张量代数上的线性映射的特征。我们在迭代求和签名上引入了三个非线性变换，它们在本质上与机器学习应用程序紧密相关，并显示了它们的某些属性。

55. Hierarchical Residual Attention Network for Single Image Super-Resolution [PDF] 返回目录
Parichehr Behjati, Pau Rodriguez, Armin Mehri, Isabelle Hupont, Carles Fernández Tena, Jordi Gonzalez
Abstract: Convolutional neural networks are the most successful models in single image super-resolution. Deeper networks, residual connections, and attention mechanisms have further improved their performance. However, these strategies often improve the reconstruction performance at the expense of considerably increasing the computational cost. This paper introduces a new lightweight super-resolution model based on an efficient method for residual feature and attention aggregation. In order to make an efficient use of the residual features, these are hierarchically aggregated into feature banks for posterior usage at the network output. In parallel, a lightweight hierarchical attention mechanism extracts the most relevant features from the network into attention banks for improving the final output and preventing the information loss through the successive operations inside the network. Therefore, the processing is split into two independent paths of computation that can be simultaneously carried out, resulting in a highly efficient and effective model for reconstructing fine details on high-resolution images from their low-resolution counterparts. Our proposed architecture surpasses state-of-the-art performance in several datasets, while maintaining relatively low computation and memory footprint.
摘要：卷积神经网络是单图像超分辨率中最成功的模型。更深层的网络，剩余连接和注意力机制进一步提高了它们的性能。但是，这些策略通常以显着增加计算成本为代价来改善重建性能。本文介绍了一种基于残差特征和注意力聚集的有效方法的新型轻量级超分辨率模型。为了有效利用残留特征，将这些残留特征按层次结构聚合到特征库中，以便在网络输出处后用。同时，轻量级的分层注意机制将网络中最相关的功能提取到注意库中，以提高最终输出并通过网络内部的后续操作防止信息丢失。因此，该处理被分成可以同时执行的两个独立的计算路径，从而形成了一种高效且有效的模型，用于从低分辨率的对应对象中重建高分辨率图像的精细细节。我们提出的架构在几个数据集中超越了最先进的性能，同时保持了相对较低的计算和内存占用。

56. GMM-Based Generative Adversarial Encoder Learning [PDF] 返回目录
Yuri Feigin, Hedva Spitzer, Raja Giryes
Abstract: While GAN is a powerful model for generating images, its inability to infer a latent space directly limits its use in applications requiring an encoder. Our paper presents a simple architectural setup that combines the generative capabilities of GAN with an encoder. We accomplish this by combining the encoder with the discriminator using shared weights, then training them simultaneously using a new loss term. We model the output of the encoder latent space via a GMM, which leads to both good clustering using this latent space and improved image generation by the GAN. Our framework is generic and can be easily plugged into any GAN strategy. In particular, we demonstrate it both with Vanilla GAN and Wasserstein GAN, where in both it leads to an improvement in the generated images in terms of both the IS and FID scores. Moreover, we show that our encoder learns a meaningful representation as its clustering results are competitive with the current GAN-based state-of-the-art in clustering.
摘要：尽管GAN是生成图像的强大模型，但其无法推断潜在空间直接限制了其在需要编码器的应用中的使用。我们的论文提出了一个简单的架构设置，将GAN的生成功能与编码器结合在一起。我们通过使用共享权重将编码器与鉴别器组合在一起，然后使用新的损耗项同时训练它们来实现此目的。我们通过GMM对编码器潜在空间的输出进行建模，这既可以使用该潜在空间实现良好的聚类，又可以通过GAN改善图像生成。我们的框架是通用的，可以轻松插入任何GAN策略。特别是，我们用Vanilla GAN和Wasserstein GAN演示了这两种方法，它们在IS和FID分数方面均可以改善生成的图像。此外，我们证明了我们的编码器可以学习有意义的表示形式，因为其聚类结果与当前基于GAN的聚类最新技术相比具有竞争力。

57. Multi-temporal and multi-source remote sensing image classification by nonlinear relative normalization [PDF] 返回目录
Devis Tuia, Diego Marcos, Gustau Camps-Valls
Abstract: Remote sensing image classification exploiting multiple sensors is a very challenging problem: data from different modalities are affected by spectral distortions and mis-alignments of all kinds, and this hampers re-using models built for one image to be used successfully in other scenes. In order to adapt and transfer models across image acquisitions, one must be able to cope with datasets that are not co-registered, acquired under different illumination and atmospheric conditions, by different sensors, and with scarce ground references. Traditionally, methods based on histogram matching have been used. However, they fail when densities have very different shapes or when there is no corresponding band to be matched between the images. An alternative builds upon \emph{manifold alignment}. Manifold alignment performs a multidimensional relative normalization of the data prior to product generation that can cope with data of different dimensionality (e.g. different number of bands) and possibly unpaired examples. Aligning data distributions is an appealing strategy, since it allows to provide data spaces that are more similar to each other, regardless of the subsequent use of the transformed data. In this paper, we study a methodology that aligns data from different domains in a nonlinear way through {\em kernelization}. We introduce the Kernel Manifold Alignment (KEMA) method, which provides a flexible and discriminative projection map, exploits only a few labeled samples (or semantic ties) in each domain, and reduces to solving a generalized eigenvalue problem. We successfully test KEMA in multi-temporal and multi-source very high resolution classification tasks, as well as on the task of making a model invariant to shadowing for hyperspectral imaging.
摘要：利用多个传感器进行遥感影像分类是一个非常具有挑战性的问题：来自不同模态的数据会受到光谱畸变和各种失准的影响，这会阻碍为一个影像建立的模型在其他场景中成功使用的重用。。为了在整个图像采集过程中适应和转移模型，人们必须能够处理在不同的光照和大气条件下，通过不同的传感器以及稀缺的地面参照物而未共同注册的数据集。传统上，已经使用了基于直方图匹配的方法。但是，当浓度具有非常不同的形状或图像之间没有要匹配的对应色带时，它们将失败。另一种方法是基于\ emph {歧管对齐}。歧管比对在生成产品之前对数据执行多维相对归一化，可以处理不同维数（例如，不同条带数）和可能未配对示例的数据。对齐数据分布是一种吸引人的策略，因为它允许提供彼此更为相似的数据空间，而不管随后使用转换后的数据如何。在本文中，我们研究了一种通过{\ em内核化}以非线性方式对齐来自不同域的数据的方法。我们引入了内核流形比对（KEMA）方法，该方法提供了灵活而有区别的投影图，在每个域中仅利用了几个标记的样本（或语义联系），并简化为解决广义特征值问题。我们成功地在多时间和多源超高分辨率分类任务中以及在使模型对于阴影不变的情况下进行高光谱成像测试，成功地测试了KEMA。

58. Active Learning Methods for Efficient Hybrid Biophysical Variable Retrieval [PDF] 返回目录
ochem Verrelst, Sara Dethier, Juan Pablo Rivera, Jordi Muñoz-Marí, Gustau Camps-Valls, José Moreno
Abstract: Kernel-based machine learning regression algorithms (MLRAs) are potentially powerful methods for being implemented into operational biophysical variable retrieval schemes. However, they face difficulties in coping with large training datasets. With the increasing amount of optical remote sensing data made available for analysis and the possibility of using a large amount of simulated data from radiative transfer models (RTMs) to train kernel MLRAs, efficient data reduction techniques will need to be implemented. Active learning (AL) methods enable to select the most informative samples in a dataset. This letter introduces six AL methods for achieving optimized biophysical variable estimation with a manageable training dataset, and their implementation into a Matlab-based MLRA toolbox for semi-automatic use. The AL methods were analyzed on their efficiency of improving the estimation accuracy of leaf area index and chlorophyll content based on PROSAIL simulations. Each of the implemented methods outperformed random sampling, improving retrieval accuracy with lower sampling rates. Practically, AL methods open opportunities to feed advanced MLRAs with RTM-generated training data for development of operational retrieval models.
摘要：基于内核的机器学习回归算法（MLRA）是潜在的功能强大的方法，可用于实施操作性生物物理变量检索方案。但是，他们在应对大型训练数据集时面临困难。随着越来越多的光学遥感数据可用于分析，并且有可能使用来自辐射传递模型（RTM）的大量模拟数据来训练内核MLRA，将需要实施有效的数据缩减技术。主动学习（AL）方法可以选择数据集中信息最多的样本。这封信介绍了六种用于通过可管理的训练数据集实现最佳生物物理变量估计的AL方法，并将其实现在基于Matlab的MLRA工具箱中以实现半自动使用。基于PROSAIL模拟，分析了AL方法提高叶面积指数和叶绿素含量估计准确性的效率。每种已实现的方法都优于随机采样，从而以较低的采样率提高了检索精度。实际上，AL方法为RML生成的训练数据提供高级MLRA提供了机会，以开发操作检索模型。

59. Planet cartography with neural learned regularization [PDF] 返回目录
A. Asensio Ramos, E. Pallé
Abstract: Finding potential life harboring exo-Earths is one of the aims of exoplanetary science. Detecting signatures of life in exoplanets will likely first be accomplished by determining the bulk composition of the planetary atmosphere via reflected/transmitted spectroscopy. However, a complete understanding of the habitability conditions will surely require mapping the presence of liquid water, continents and/or clouds. Spin-orbit tomography is a technique that allows us to obtain maps of the surface of exoplanets around other stars using the light scattered by the planetary surface. We leverage the potential of deep learning and propose a mapping technique for exo-Earths in which the regularization is learned from mock surfaces. The solution of the inverse mapping problem is posed as a deep neural network that can be trained end-to-end with suitable training data. We propose in this work to use methods based on the procedural generation of planets, inspired by what we found on Earth. We also consider mapping the recovery of surfaces and the presence of persistent cloud in cloudy planets. We show that the a reliable mapping can be carried out with our approach, producing very compact continents, even when using single passband observations. More importantly, if exoplanets are partially cloudy like the Earth is, we show that one can potentially map the distribution of persistent clouds that always occur on the same position on the surface (associated to orography and sea surface temperatures) together with non-persistent clouds that move across the surface. This will become the first test one can perform on an exoplanet for the detection of an active climate system. For small rocky planets in the habitable zone of their stars, this weather system will be driven by water, and the detection can be considered as a strong proxy for truly habitable conditions.
摘要：寻找隐伏着地球的潜在生命是系外科学的目标之一。检测系外行星生命特征的方法很可能首先是通过反射/透射光谱法确定行星大气的整体组成来完成的。但是，对可居住性条件的完全理解肯定会要求绘制液态水，大陆和/或云的存在的地图。自旋轨道层析成像技术是一种技术，它使我们能够利用行星表面散射的光来获取其他恒星周围系外行星表面的地图。我们利用深度学习的潜力，并为exo-Earths提出了一种映射技术，其中可以从模拟表面学习正则化。逆映射问题的解决方案是一个深度神经网络，该网络可以使用适当的训练数据进行端到端训练。在这项工作中，我们建议使用基于我们在地球上发现的东西启发的行星程序生成的方法。我们还考虑绘制地表恢复和多云星球中存在的持久云的映射。我们表明，即使使用单通带观测，也可以使用我们的方法进行可靠的映射，生成非常紧凑的大陆。更重要的是，如果系外行星像地球一样是部分多云的，那么我们表明，潜在地可以绘制出始终存在于地表同一位置（与地形和海面温度有关）的持久性云与非持久性云的分布图在整个表面上移动。这将成为第一个可以在系外行星上进行的，用于探测活跃气候系统的测试。对于处于其恒星宜居区域的小岩石行星而言，这种天气系统将由水驱动，而这种探测可被认为是真正宜居环境的有力替代。

60. Reinforcement Based Learning on Classification Task Could Yield Better Generalization and Adversarial Accuracy [PDF] 返回目录
Shashi Kant Gupta
Abstract: Deep Learning has become interestingly popular in computer vision, mostly attaining near or above human-level performance in various vision tasks. But recent work has also demonstrated that these deep neural networks are very vulnerable to adversarial examples (adversarial examples - inputs to a model which are naturally similar to original data but fools the model in classifying it into a wrong class). Humans are very robust against such perturbations; one possible reason could be that humans do not learn to classify based on an error between "target label" and "predicted label" but possibly due to reinforcements that they receive on their predictions. In this work, we proposed a novel method to train deep learning models on an image classification task. We used a reward-based optimization function, similar to the vanilla policy gradient method used in reinforcement learning, to train our model instead of conventional cross-entropy loss. An empirical evaluation on the cifar10 dataset showed that our method learns a more robust classifier than the same model architecture trained using cross-entropy loss function (on adversarial training). At the same time, our method shows a better generalization with the difference in test accuracy and train accuracy $< 2\%$ for most of the time compared to the cross-entropy one, whose difference most of the time remains $> 2\%$.
摘要：深度学习在计算机视觉中已变得有趣起来，主要是在各种视觉任务中达到或超过人类水平的表现。但是最近的工作也表明，这些深层神经网络非常容易受到对抗性示例的攻击（对抗性示例-模型的输入，这些输入自然类似于原始数据，但愚弄了将模型分类为错误类的模型）。人类对这种干扰非常有力。一个可能的原因可能是人类没有学会根据“目标标签”和“预测标签”之间的错误进行分类，而是可能由于他们在预测中得到的增强。在这项工作中，我们提出了一种在图像分类任务上训练深度学习模型的新方法。我们使用了基于奖励的优化功能，类似于强化学习中使用的香草策略梯度法，来训练我们的模型，而不是常规的交叉熵损失。对cifar10数据集的经验评估表明，与使用交叉熵损失函数（在对抗训练中）训练的相同模型体系结构相比，我们的方法学习的分类器更可靠。同时，我们的方法表现出更好的概括性，与交叉熵熵（大多数情况下保持$> 2 \）相比，大部分时间的测试精度和训练精度$ <2 \％$的差异％$。 < font>

61. Two-Phase Learning for Overcoming Noisy Labels [PDF] 返回目录
Hwanjun Song, Minseok Kim, Dongmin Park, Jae-Gil Lee
Abstract: To counter the challenge associated with noise labels, the learning strategy of deep neural networks must be differentiated over the learning period during the training process. Therefore, we propose a novel two-phase learning method, MORPH, which automatically transitions its learning phase at the point when the network begins to rapidly memorize false-labeled samples. In the first phase, MORPH starts to update the network for all the training samples before the transition point. Without any supervision, the learning phase is converted to the next phase on the basis of the estimated best transition point. Subsequently, MORPH resumes the training of the network only for a maximal safe set, which maintains the collection of almost certainly true-labeled samples at each epoch. Owing to its two-phase learning, MORPH realizes noise-free training for any type of label noise for practical use. Moreover, extensive experiments using six datasets verify that MORPH significantly outperforms five state-of-the art methods in terms of test error and training time.
摘要：为应对与噪声标签相关的挑战，在训练过程中，必须在学习期间区分深度神经网络的学习策略。因此，我们提出了一种新颖的两阶段学习方法MORPH，该方法会在网络开始快速记忆错误标记的样本时自动转换其学习阶段。在第一阶段，MORPH开始在过渡点之前为所有训练样本更新网络。在没有任何监督的情况下，学习阶段将根据估计的最佳过渡点转换为下一阶段。随后，MORPH仅在最大安全集的情况下恢复网络的训练，这将在每个时期维持几乎肯定是真实标记的样本的收集。由于具有两个阶段的学习，MORPH可以针对实际使用中的任何类型的标签噪声实现无噪声训练。此外，使用六个数据集进行的广泛实验证明，MORPH在测试误差和训练时间方面显着优于五种最新方法。

62. Interpretable deep learning regression for breast density estimation on MRI [PDF] 返回目录
Bas H.M. van der Velden, Max A.A. Ragusi, Markus H.A. Janse, Claudette E. Loo, Kenneth G.A. Gilhuijs
Abstract: Breast density, which is the ratio between fibroglandular tissue (FGT) and total breast volume, can be assessed qualitatively by radiologists and quantitatively by computer algorithms. These algorithms often rely on segmentation of breast and FGT volume. In this study, we propose a method to directly assess breast density on MRI, and provide interpretations of these assessments. We assessed breast density in 506 patients with breast cancer using a regression convolutional neural network (CNN). The input for the CNN were slices of breast MRI of 128 x 128 voxels, and the output was a continuous density value between 0 (fatty breast) and 1 (dense breast). We used 350 patients to train the CNN, 75 for validation, and 81 for independent testing. We investigated why the CNN came to its predicted density using Deep SHapley Additive exPlanations (SHAP). The density predicted by the CNN on the testing set was significantly correlated with the ground truth densities (N = 81 patients, Spearman's rho = 0.86, P < 0.001). When inspecting what the CNN based its predictions on, we found that voxels in FGT commonly had positive SHAP-values, voxels in fatty tissue commonly had negative SHAP-values, and voxels in non-breast tissue commonly had SHAP-values near zero. This means that the prediction of density is based on the structures we expect it to be based on, namely FGT and fatty tissue. To conclude, we presented an interpretable deep learning regression method for breast density estimation on MRI with promising results.
摘要：乳腺密度是指纤维腺腺组织（FGT）与乳腺总体积之间的比率，可以由放射科医生进行定性评估，并通过计算机算法进行定量评估。这些算法通常依赖于乳房和FGT体积的分割。在这项研究中，我们提出了一种直接在MRI上评估乳房密度的方法，并提供了这些评估的解释。我们使用回归卷积神经网络（CNN）评估了506名乳腺癌患者的乳房密度。 CNN的输入是128 x 128体素的乳房MRI切片，输出是介于0（肥大的乳房）和1（密集的乳房）之间的连续密度值。我们使用了350位患者来训练CNN，75位患者进行了验证以及81位患者进行了独立测试。我们调查了为什么使用Deep SHapley Additive exPlanations（SHAP）将CNN达到其预期密度。 CNN在测试集上预测的密度与地面真实密度显着相关（N = 81例患者，Spearman的rho = 0.86，P <0.001）。在检查cnn基于其预测的内容时，我们发现fgt中的体素通常具有正shap值，脂肪组织中的体素通常具有负shap值，非乳房组织中的体素通常具有shap值接近零。这意味着密度的预测是基于我们期望其基于的结构，即fgt和脂肪组织。总而言之，我们提出了一种可解释的深度学习回归方法，用于mri乳腺密度估计，结果令人满意。 < font>

63. CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions [PDF] 返回目录
Tayfun Ates, Muhammed Samil Atesoglu, Cagatay Yigit, Ilker Kesen, Mert Kobas, Erkut Erdem, Aykut Erdem, Tilbe Goksun, Deniz Yuret
Abstract: Recent advances in Artificial Intelligence and deep learning have revived the interest in studying the gap between the reasoning capabilities of humans and machines. In this ongoing work, we introduce CRAFT, a new visual question answering dataset that requires causal reasoning about physical forces and object interactions. It contains 38K video and question pairs that are generated from 3K videos from 10 different virtual environments, containing different number of objects in motion that interact with each other. Two question categories from CRAFT include previously studied descriptive and counterfactual questions. Besides, inspired by the theory of force dynamics from the field of human cognitive psychology, we introduce new question categories that involve understanding the intentions of objects through the notions of cause, enable, and prevent. Our preliminary results demonstrate that even though these tasks are very intuitive for humans, the implemented baselines could not cope with the underlying challenges.
摘要：人工智能和深度学习的最新进展激发了人们对研究人类和机器推理能力之间差距的兴趣。在这项正在进行的工作中，我们介绍了CRAFT，这是一个新的视觉问题回答数据集，需要有关物理力和物体相互作用的因果推理。它包含从10个不同虚拟环境中的3K视频生成的38K视频和问题对，其中包含运动中相互交互的不同数量的对象。 CRAFT的两个问题类别包括先前研究的描述性和反事实性问题。此外，受人类认知心理学领域的动力动力学理论启发，我们引入了新的问题类别，其中涉及通过原因，使能和预防的概念来理解对象的意图。我们的初步结果表明，即使这些任务对人类来说非常直观，但已实现的基准仍无法应对潜在的挑战。

64. Raw Image Deblurring [PDF] 返回目录
Chih-Hung Liang, Yu-An Chen, Yueh-Cheng Liu, Winston H. Hsu
Abstract: Deep learning-based blind image deblurring plays an essential role in solving image blur since all existing kernels are limited in modeling the real world blur. Thus far, researchers focus on powerful models to handle the deblurring problem and achieve decent results. For this work, in a new aspect, we discover the great opportunity for image enhancement (e.g., deblurring) directly from RAW images and investigate novel neural network structures benefiting RAW-based learning. However, to the best of our knowledge, there is no available RAW image deblurring dataset. Therefore, we built a new dataset containing both RAW images and processed sRGB images and design a new model to utilize the unique characteristics of RAW images. The proposed deblurring model, trained solely from RAW images, achieves the state-of-art performance and outweighs those trained on processed sRGB images. Furthermore, with fine-tuning, the proposed model, trained on our new dataset, can generalize to other sensors. Additionally, by a series of experiments, we demonstrate that existing deblurring models can also be improved by training on the RAW images in our new dataset. Ultimately, we show a new venue for further opportunities based on the devised novel raw-based deblurring method and the brand-new Deblur-RAW dataset.
摘要：基于深度学习的盲图像去模糊在解决图像模糊中起着至关重要的作用，因为所有现有内核在建模现实世界模糊时都受到限制。迄今为止，研究人员专注于强大的模型来处理去模糊问题并获得不错的结果。对于这项工作，我们在一个新的方面发现了直接从RAW图像进行图像增强（例如去模糊）的巨大机会，并研究了有益于基于RAW的学习的新型神经网络结构。但是，据我们所知，没有可用的RAW图像去模糊数据集。因此，我们建立了一个既包含RAW图像又包含处理后的sRGB图像的新数据集，并设计了一个新模型来利用RAW图像的独特特性。拟议的去模糊模型，仅从RAW图像进行训练，即可达到最先进的性能，并且胜过在经过处理的sRGB图像上训练的模型。此外，通过微调，在我们的新数据集上训练的建议模型可以推广到其他传感器。此外，通过一系列实验，我们证明了通过对新数据集中的RAW图像进行训练也可以改善现有的去模糊模型。最终，我们将基于已设计的新颖的基于原始的去模糊方法和全新的Deblur-RAW数据集，为进一步的机会展示一个新的场所。

65. A Unifying Framework for Formal Theories of Novelty:Framework, Examples and Discussion [PDF] 返回目录
T. E. Boult, P. A. Grabowicz, D. S. Prijatelj, R. Stern, L. Holder, J. Alspector, M. Jafarzadeh, T. Ahmad, A. R. Dhamija, C.Li, S. Cruz, A. Shrivastava, C. Vondrick, W. J. Scheirer
Abstract: Managing inputs that are novel, unknown, or out-of-distribution is critical as an agent moves from the lab to the open world. Novelty-related problems include being tolerant to novel perturbations of the normal input, detecting when the input includes novel items, and adapting to novel inputs. While significant research has been undertaken in these areas, a noticeable gap exists in the lack of a formalized definition of novelty that transcends problem domains. As a team of researchers spanning multiple research groups and different domains, we have seen, first hand, the difficulties that arise from ill-specified novelty problems, as well as inconsistent definitions and terminology. Therefore, we present the first unified framework for formal theories of novelty and use the framework to formally define a family of novelty types. Our framework can be applied across a wide range of domains, from symbolic AI to reinforcement learning, and beyond to open world image recognition. Thus, it can be used to help kick-start new research efforts and accelerate ongoing work on these important novelty-related problems. This extended version of our AAAI 2021 paper included more details and examples in multiple domains.
摘要：当代理人从实验室转移到开放世界时，管理新颖，未知或分发外的输入至关重要。与新颖性有关的问题包括：容忍正常输入的新颖扰动；检测输入何时包括新颖项；以及适应新颖输入。尽管在这些领域进行了大量研究，但是由于缺乏超越问题领域的新颖性的正式定义，仍然存在明显的差距。作为跨越多个研究组和不同领域的研究人员团队，我们已经亲眼目睹了由于未明确说明的新颖性问题以及定义和术语不一致而造成的困难。因此，我们提出了关于新颖性正式理论的第一个统一框架，并使用该框架正式定义了一系列新颖性类型。我们的框架可以应用于从符号AI到强化学习，再到开放世界图像识别的广泛领域。因此，它可用于帮助启动新的研究工作，并加快有关这些重要的新颖性相关问题的正在进行的工作。我们AAAI 2021论文的扩展版包含了多个领域的更多详细信息和示例。

66. Efficient Estimation of Influence of a Training Instance [PDF] 返回目录
Sosuke Kobayashi, Sho Yokoi, Jun Suzuki, Kentaro Inui
Abstract: Understanding the influence of a training instance on a neural network model leads to improving interpretability. However, it is difficult and inefficient to evaluate the influence, which shows how a model's prediction would be changed if a training instance were not used. In this paper, we propose an efficient method for estimating the influence. Our method is inspired by dropout, which zero-masks a sub-network and prevents the sub-network from learning each training instance. By switching between dropout masks, we can use sub-networks that learned or did not learn each training instance and estimate its influence. Through experiments with BERT and VGGNet on classification datasets, we demonstrate that the proposed method can capture training influences, enhance the interpretability of error predictions, and cleanse the training dataset for improving generalization.
摘要：了解训练实例对神经网络模型的影响可提高解释性。但是，评估影响是困难且效率低下的，这表明如果不使用训练实例，模型的预测将如何更改。在本文中，我们提出了一种有效的影响估计方法。我们的方法是受辍学启发的，该辍学对子网进行零掩码，并阻止子网学习每个训练实例。通过在辍学掩码之间切换，我们可以使用获知或未获悉每个训练实例的子网并估计其影响。通过在分类数据集上使用BERT和VGGNet进行实验，我们证明了该方法可以捕获训练影响，增强错误预测的可解释性，并清理训练数据集以提高泛化性。

67. A Number Sense as an Emergent Property of the Manipulating Brain [PDF] 返回目录
Neehar Kondapaneni, Pietro Perona
Abstract: The ability to understand and manipulate numbers and quantities emerges during childhood, but the mechanism through which this ability is developed is still poorly understood. In particular, it is not known whether acquiring such a {\em number sense} is possible without supervision from a teacher. To explore this question, we propose a model in which spontaneous and undirected manipulation of small objects trains perception to predict the resulting scene changes. We find that, from this task, an image representation emerges that exhibits regularities that foreshadow numbers and quantity. These include distinct categories for zero and the first few natural numbers, a notion of order, and a signal that correlates with numerical quantity. As a result, our model acquires the ability to estimate the number of objects in the scene, as well as {\em subitization}, i.e. the ability to recognize at a glance the exact number of objects in small scenes. We conclude that important aspects of a facility with numbers and quantities may be learned without explicit teacher supervision.
摘要：在儿童时期出现了理解和操纵数量和数量的能力，但是对这种能力发展的机制仍然知之甚少。尤其是，在没有老师的监督下，是否有可能获得这样的{\ em number sense}是未知的。为了探讨这个问题，我们提出了一个模型，在该模型中，对小物体的自发性和无方向性操纵会训练感知能力，以预测场景变化。我们发现，通过该任务，出现了图像表示，该图像表示出预示着数量和数量的规律性。这些包括零和前几个自然数的不同类别，阶数的概念以及与数值相关的信号。结果，我们的模型获得了估计场景中对象数量的能力，以及{\ em subitization}的能力，即一眼就能识别出小场景中确切对象数量的能力。我们得出的结论是，无需明确的老师监督，即可了解具有数量和数量的设施的重要方面。

68. CEL-Net: Continuous Exposure for Extreme Low-Light Imaging [PDF] 返回目录
Michael Klyuchka, Evgeny Hershkovitch Neiterman, Gil Ben-Artzi
Abstract: Deep learning methods for enhancing dark images learn a mapping from input images to output images with pre-determined discrete exposure levels. Often, at inference time the input and optimal output exposure levels of the given image are different from the seen ones during training. As a result the enhanced image might suffer from visual distortions, such as low contrast or dark areas. We address this issue by introducing a deep learning model that can continuously generalize at inference time to unseen exposure levels without the need to retrain the model. To this end, we introduce a dataset of 1500 raw images captured in both outdoor and indoor scenes, with five different exposure levels and various camera parameters. Using the dataset, we develop a model for extreme low-light imaging that can continuously tune the input or output exposure level of the image to an unseen one. We investigate the properties of our model and validate its performance, showing promising results.
摘要：用于增强深色图像的深度学习方法学习具有从输入图像到具有预定离散曝光水平的输出图像的映射。通常，在推理时，给定图像的输入和最佳输出曝光水平与训练期间所看到的图像不同。结果，增强的图像可能遭受视觉失真，例如低对比度或暗区。我们通过引入深度学习模型来解决此问题，该模型可以在推理时不断地将其推广到看不见的暴露水平，而无需重新训练模型。为此，我们引入了在室外和室内场景中捕获的1500张原始图像的数据集，其中包含五个不同的曝光级别和各种相机参数。使用数据集，我们开发了一种用于极端微光成像的模型，该模型可以将图像的输入或输出曝光水平连续调整为看不见的水平。我们调查了模型的属性并验证了其性能，并显示出令人鼓舞的结果。

69. Process of image super-resolution [PDF] 返回目录
Sebastien Lablanche, Gerard Lablanche
Abstract: In this paper we explain a process of super-resolution reconstruction allowing to increase the resolution of an image.The need for high-resolution digital images exists in diverse domains, for example the medical and spatial domains. The obtaining of high-resolution digital images can be made at the time of the shooting, but it is often synonymic of important costs because of the necessary material to avoid such costs, it is known how to use methods of super-resolution reconstruction, consisting from one or several low resolution images to obtain a high-resolution image. The american patent US 9208537 describes such an algorithm. A zone of one low-resolution image is isolated and categorized according to the information contained in pixels forming the borders of the zone. The category of it zone determines the type of interpolation used to add pixels in aforementioned zone, to increase the neatness of the images. It is also known how to reconstruct a low-resolution image there high-resolution image by using a model of super-resolution reconstruction whose learning is based on networks of neurons and on image or a picture library. The demand of chinese patent CN 107563965 and the scientist publication "Pixel Recursive Super Resolution", R. Dahl, M. Norouzi, J. Shlens propose such methods. The aim of this paper is to demonstrate that it is possible to reconstruct coherent human faces from very degraded pixelated images with a very fast algorithm, more faster than compressed sensing (CS), easier to compute and without deep learning, so without important technology resources, i.e. a large database of thousands training images (see arXiv:2003). This technological breakthrough has been patented in 2018 with the demand of French patent FR 1855485 (this https URL, see the HAL reference this https URL).
摘要：在本文中，我们解释了允许增加图像分辨率的超分辨率重建过程。高分辨率图像的需求存在于不同领域，例如医学和空间领域。可以在拍摄时获得高分辨率的数字图像，但是由于避免这种成本的必要材料，它通常是重要成本的代名词，已知如何使用超分辨率重建方法，包括从一个或几个低分辨率图像中获取高分辨率图像。美国专利US 9208537描述了这种算法。一个低分辨率图像的区域将根据构成区域边界的像素中包含的信息进行隔离和分类。它区域的类别决定了用于在上述区域中添加像素以增加图像整洁度的插值类型。还已知如何通过使用基于神经元网络和图像或图片库的学习的超分辨率重建模型在高分辨率图像上重建低分辨率图像。中国专利CN 107563965和科学家出版物“ Pixel Recursive Super Resolution”，R.Dahl，M.Norouzi，J.Shlens的需求提出了这样的方法。本文的目的是证明，可以使用非常快的算法，从非常退化的像素化图像中重建相干的人脸，该算法比压缩感知（CS）更快，更易于计算且无需深度学习，因此无需重要的技术资源，即包含数千个训练图像的大型数据库（请参见arXiv：2003）。这项技术突破已于2018年申请了法国专利FR 1855485的专利（此https URL，请参阅HAL引用此https URL）。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-12-09

目录

摘要