摘要

1. Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts [PDF] 返回目录
Ji Hou, Benjamin Graham, Matthias Nießner, Saining Xie
Abstract: The rapid progress in 3D scene understanding has come with growing demand for data; however, collecting and annotating 3D scenes (e.g. point clouds) are notoriously hard. For example, the number of scenes (e.g. indoor rooms) that can be accessed and scanned might be limited; even given sufficient data, acquiring 3D labels (e.g. instance masks) requires intensive human labor. In this paper, we explore data-efficient learning for 3D point cloud. As a first step towards this direction, we propose Contrastive Scene Contexts, a 3D pre-training method that makes use of both point-level correspondences and spatial contexts in a scene. Our method achieves state-of-the-art results on a suite of benchmarks where training data or labels are scarce. Our study reveals that exhaustive labelling of 3D point clouds might be unnecessary; and remarkably, on ScanNet, even using 0.1% of point labels, we still achieve 89% (instance segmentation) and 96% (semantic segmentation) of the baseline performance that uses full annotations.
摘要：3D场景理解的快速进步与对数据的需求不断增长有关。但是，众所周知，收集和注释3D场景（例如点云）非常困难。例如，可以访问和扫描的场景数量（例如室内房间）可能会受到限制；即使有足够的数据，获取3D标签（例如，实例蒙版）也需要大量的人工。在本文中，我们探索了3D点云的高效数据学习。作为朝这个方向迈出的第一步，我们提出了“对比场景上下文”，这是一种3D预训练方法，它同时利用了场景中的点级对应关系和空间上下文。我们的方法可以在缺乏训练数据或标签的一系列基准上获得最新的结果。我们的研究表明，可能没有必要对3D点云进行详尽的标注。值得注意的是，在ScanNet上，即使使用0.1％的点标签，使用完整注解的基准性能仍然达到89％（实例分割）和96％（语义分割）。

2. Point Transformer [PDF] 返回目录
Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip Torr, Vladlen Koltun
Abstract: Self-attention networks have revolutionized natural language processing and are making impressive strides in image analysis tasks such as image classification and object detection. Inspired by this success, we investigate the application of self-attention networks to 3D point cloud processing. We design self-attention layers for point clouds and use these to construct self-attention networks for tasks such as semantic scene segmentation, object part segmentation, and object classification. Our Point Transformer design improves upon prior work across domains and tasks. For example, on the challenging S3DIS dataset for large-scale semantic scene segmentation, the Point Transformer attains an mIoU of 70.4% on Area 5, outperforming the strongest prior model by 3.3 absolute percentage points and crossing the 70% mIoU threshold for the first time.
摘要：自注意力网络已经彻底改变了自然语言处理，并在图像分析任务（例如图像分类和目标检测）中取得了令人瞩目的进步。受此成功启发，我们研究了自我注意网络在3D点云处理中的应用。我们为点云设计了自我注意层，并使用它们来构造诸如语义场景分割，对象部分分割和对象分类等任务的自我注意网络。我们的Point Transformer设计改进了跨领域和任务的先前工作。例如，在用于大规模语义场景分割的具有挑战性的S3DIS数据集上，Point Transformer在Area 5上的mIoU达到70.4％，比最强的现有模型高3.3个绝对百分点，并首次超过70％mIoU阈值。

3. Learning Continuous Image Representation with Local Implicit Image Function [PDF] 返回目录
Yinbo Chen, Sifei Liu, Xiaolong Wang
Abstract: How to represent an image? While the visual world is presented in a continuous manner, machines store and see the images in a discrete way with 2D arrays of pixels. In this paper, we seek to learn a continuous representation for images. Inspired by the recent progress in 3D reconstruction with implicit function, we propose Local Implicit Image Function (LIIF), which takes an image coordinate and the 2D deep features around the coordinate as inputs, predicts the RGB value at a given coordinate as an output. Since the coordinates are continuous, LIIF can be presented in an arbitrary resolution. To generate the continuous representation for pixel-based images, we train an encoder and LIIF representation via a self-supervised task with super-resolution. The learned continuous representation can be presented in arbitrary resolution even extrapolate to $\times 30$ higher resolution, where the training tasks are not provided. We further show that LIIF representation builds a bridge between discrete and continuous representation in 2D, it naturally supports the learning tasks with size-varied image ground-truths and significantly outperforms the method with resizing the ground-truths. Our project page with code is at this https URL .
摘要：如何表现图像？虽然视觉世界是以连续方式呈现的，但是机器以2D像素阵列的离散方式存储和查看图像。在本文中，我们试图学习图像的连续表示。受具有隐式功能的3D重建最新进展的启发，我们提出了局部隐式图像功能（LIIF），该功能将图像坐标和坐标周围的2D深度特征作为输入，并预测给定坐标下的RGB值作为输出。由于坐标是连续的，因此可以以任意分辨率呈现LIIF。为了生成基于像素的图像的连续表示，我们通过具有超分辨率的自我监督任务训练编码器和LIIF表示。所学习的连续表示可以以任意分辨率呈现，甚至可以外推至30倍的更高分辨率，而无需提供训练任务。我们进一步证明，LIIF表示法在2D离散表示法和连续表示法之间架起了一座桥梁，它自然支持大小可变的图像地面实况的学习任务，并且在调整地面实境大小方面明显优于该方法。我们带有代码的项目页面位于此https URL上。

4. DECOR-GAN: 3D Shape Detailization by Conditional Refinement [PDF] 返回目录
Zhiqin Chen, Vladimir Kim, Matthew Fisher, Noam Aigerman, Hao Zhang, Siddhartha Chaudhuri
Abstract: We introduce a deep generative network for 3D shape detailization, akin to stylization with the style being geometric details. We address the challenge of creating large varieties of high-resolution and detailed 3D geometry from a small set of exemplars by treating the problem as that of geometric detail transfer. Given a low-resolution coarse voxel shape, our network refines it, via voxel upsampling, into a higher-resolution shape enriched with geometric details. The output shape preserves the overall structure (or content) of the input, while its detail generation is conditioned on an input "style code" corresponding to a detailed exemplar. Our 3D detailization via conditional refinement is realized by a generative adversarial network, coined DECOR-GAN. The network utilizes a 3D CNN generator for upsampling coarse voxels and a 3D PatchGAN discriminator to enforce local patches of the generated model to be similar to those in the training detailed shapes. During testing, a style code is fed into the generator to condition the refinement. We demonstrate that our method can refine a coarse shape into a variety of detailed shapes with different styles. The generated results are evaluated in terms of content preservation, plausibility, and diversity. Comprehensive ablation studies are conducted to validate our network designs.
摘要：我们引入了一个用于3D形状细节化的深层生成网络，类似于样式化，即几何细节。我们通过将问题视为几何细节传递问题来解决从一小套示例中创建各种高分辨率和详细3D几何图形的挑战。给定低分辨率的粗体素形状，我们的网络会通过体素上采样将其精炼为更高分辨率的形状，其中充斥着几何细节。输出形状保留输入的整体结构（或内容），而其详细信息生成则取决于与详细示例相对应的输入“样式代码”。我们通过有条件的精细化来进行3D细化，这是通过产生的对抗网络（称为DECOR-GAN）实现的。该网络利用3D CNN生成器对粗体素进行上采样，并利用3D PatchGAN鉴别器来强制生成模型的局部补丁，使其与训练详细形状中的相似。在测试过程中，将样式代码输入到生成器中，以进行优化。我们证明了我们的方法可以将粗略的形状细化为具有不同样式的各种详细形状。根据内容保存，合理性和多样性对生成的结果进行评估。进行了全面的消融研究，以验证我们的网络设计。

5. Joint Generative and Contrastive Learning for Unsupervised Person Re-identification [PDF] 返回目录
Hao Chen, Yaohui Wang, Benoit Lagadec, Antitza Dantcheva, Francois Bremond
Abstract: Annotating identity labels in large-scale datasets is a labour-intensive work, which strongly limits the scalability of person re-identification (ReID) in the real world. Unsupervised ReID addresses this issue by learning representations directly from unlabeled images. Recent self-supervised contrastive learning provides an effective approach for unsupervised representation learning. In this paper, we incorporate a Generative Adversarial Network (GAN) and contrastive learning into one joint training framework. While the GAN provides online data augmentation for contrastive learning, the contrastive module learns view-invariant features for generation. In this context, we propose a mesh-based novel view generator. Specifically, mesh projections serve as references towards generating novel views of a person. In addition, we propose a view-invariant loss to facilitate contrastive learning between original and generated views. Deviating from previous GAN-based unsupervised ReID methods involving domain adaptation, we do not rely on a labeled source dataset, which makes our method more flexible. Extensive experimental results show that our method significantly outperforms state-of-the-art methods under both, fully unsupervised and unsupervised domain adaptive settings on several large scale ReID datsets.
摘要：在大型数据集中标注身份标签是一项劳动密集型工作，这在很大程度上限制了人员重新识别（ReID）在现实世界中的可扩展性。无监督的ReID通过直接从未标记的图像中学习表示来解决此问题。最近的自我监督的对比学习为无监督的表征学习提供了一种有效的方法。在本文中，我们将生成对抗网络（GAN）和对比学习整合到一个联合培训框架中。 GAN提供在线数据增强以进行对比学习，而对比模块则学习视图不变特征以进行生成。在这种情况下，我们提出了一种基于网格的新颖视图生成器。具体而言，网格投影用作生成人的新颖视图的参考。另外，我们提出视图不变损失，以促进原始视图与生成的视图之间的对比学习。与以前涉及域自适应的基于GAN的无监督ReID方法不同，我们不依赖标记的源数据集，这使我们的方法更加灵活。大量的实验结果表明，在几种大型ReID数据集上，在完全不受监督和不受监督的域自适应设置下，我们的方法均明显优于最新方法。

6. Towards Recognizing New Semantic Concepts in New Visual Domains [PDF] 返回目录
Massimiliano Mancini
Abstract: Deep learning models heavily rely on large scale annotated datasets for training. Unfortunately, datasets cannot capture the infinite variability of the real world, thus neural networks are inherently limited by the restricted visual and semantic information contained in their training set. In this thesis, we argue that it is crucial to design deep architectures that can operate in previously unseen visual domains and recognize novel semantic concepts. In the first part of the thesis, we describe different solutions to enable deep models to generalize to new visual domains, by transferring knowledge from a labeled source domain(s) to a domain (target) where no labeled data are available. We will show how variants of batch-normalization (BN) can be applied to different scenarios, from domain adaptation when source and target are mixtures of multiple latent domains, to domain generalization, continuous domain adaptation, and predictive domain adaptation, where information about the target domain is available only in the form of metadata. In the second part of the thesis, we show how to extend the knowledge of a pretrained deep model to new semantic concepts, without access to the original training set. We address the scenarios of sequential multi-task learning, using transformed task-specific binary masks, open-world recognition, with end-to-end training and enforced clustering, and incremental class learning in semantic segmentation, where we highlight and address the problem of the semantic shift of the background class. In the final part, we tackle a more challenging problem: given images of multiple domains and semantic categories (with their attributes), how to build a model that recognizes images of unseen concepts in unseen domains? We also propose an approach based on domain and semantic mixing of inputs and features, which is a first, promising step towards solving this problem.
摘要：深度学习模型严重依赖大规模的带注释的数据集进行训练。不幸的是，数据集无法捕获现实世界的无限可变性，因此神经网络固有地受到其训练集中包含的受限视觉和语义信息的限制。在本文中，我们认为设计能够在以前看不见的视觉域中运行并识别新颖语义概念的深度体系结构至关重要。在论文的第一部分中，我们描述了不同的解决方案，通过将知识从标记的源域转移到没有标记数据可用的域（目标），使深度模型能够推广到新的可视域。我们将展示如何将批量归一化（BN）的变体应用于不同的场景，从当源和目标是多个潜在域的混合时的域自适应，到域概括，连续域自适应和预测域自适应，其中有关目标域仅以元数据形式可用。在论文的第二部分中，我们展示了如何在不访问原始训练集的情况下将预训练的深度模型的知识扩展到新的语义概念。我们通过使用特定于任务的转换二进制掩码，开放世界识别，端到端训练和强制聚类以及语义分段中的增量类学习来解决顺序多任务学习的场景，在其中重点介绍并解决该问题背景类的语义转移的概念。在最后一部分中，我们解决了一个更具挑战性的问题：给定多个域和语义类别的图像（及其属性），如何建立一个模型来识别看不见的领域中看不见的概念的图像？我们还提出了一种基于域和输入和特征的语义混合的方法，这是解决此问题的第一步，也是很有希望的一步。

7. Improved StyleGAN Embedding: Where are the Good Latents? [PDF] 返回目录
Peihao Zhu, Rameen Abdal, Yipeng Qin, Peter Wonka
Abstract: StyleGAN is able to produce photorealistic images almost indistinguishable from real ones. Embedding images into the StyleGAN latent space is not a trivial task due to the reconstruction quality and editing quality trade-off. In this paper, we first introduce a new normalized space to analyze the diversity and the quality of the reconstructed latent codes. This space can help answer the question of where good latent codes are located in latent space. Second, we propose a framework to analyze the quality of different embedding algorithms. Third, we propose an improved embedding algorithm based on our analysis. We compare our results with the current state-of-the-art methods and achieve a better trade-off between reconstruction quality and editing quality.
摘要：StyleGAN能够产生与真实图像几乎无法区分的真实感图像。由于重建质量和编辑质量的折衷，将图像嵌入StyleGAN潜在空间并不是一件容易的事。在本文中，我们首先引入一个新的归一化空间来分析重构后的潜在码的多样性和质量。这个空间可以帮助回答在潜在空间中好的潜在代码位于何处的问题。其次，我们提出了一个框架来分析不同嵌入算法的质量。第三，我们在分析的基础上提出了一种改进的嵌入算法。我们将结果与当前最先进的方法进行比较，并在重建质量和编辑质量之间取得更好的平衡。

8. CompositeTasking: Understanding Images by Spatial Composition of Tasks [PDF] 返回目录
Nikola Popovic, Danda Pani Paudel, Thomas Probst, Guolei Sun, Luc Van Gool
Abstract: We define the concept of CompositeTasking as the fusion of multiple, spatially distributed tasks, for various aspects of image understanding. Learning to perform spatially distributed tasks is motivated by the frequent availability of only sparse labels across tasks, and the desire for a compact multi-tasking network. To facilitate CompositeTasking, we introduce a novel task conditioning model -- a single encoder-decoder network that performs multiple, spatially varying tasks at once. The proposed network takes a pair of an image and a set of pixel-wise dense tasks as inputs, and makes the task related predictions for each pixel, which includes the decision of applying which task where. As to the latter, we learn the composition of tasks that needs to be performed according to some CompositeTasking rules. It not only offers us a compact network for multi-tasking, but also allows for task-editing. The strength of the proposed method is demonstrated by only having to supply sparse supervision per task. The obtained results are on par with our baselines that use dense supervision and a multi-headed multi-tasking design. The source code will be made publicly available at this http URL .
摘要：对于图像理解的各个方面，我们将CompositeTasking的概念定义为多个空间分布的任务的融合。学习执行空间分布的任务的动机是，跨任务仅经常使用稀疏标签，以及对紧凑型多任务网络的渴望。为了促进CompositeTasking，我们引入了一种新颖的任务调节模型-单个编码器/解码器网络，该网络可以一次执行多个空间变化的任务。所提出的网络将一对图像和一组像素密集任务作为输入，并对每个像素进行与任务相关的预测，其中包括在哪个位置应用哪个任务的决策。对于后者，我们学习了根据某些CompositeTasking规则需要执行的任务的组成。它不仅为我们提供了一个紧凑的网络来执行多任务，而且还允许进行任务编辑。只需为每个任务提供稀疏监督，就可以证明所提出方法的优势。获得的结果与我们使用密集监督和多头多任务设计的基准相当。源代码将通过此http URL公开提供。

9. AdjointBackMap: Reconstructing Effective Decision Hypersurfaces from CNN Layers Using Adjoint Operators [PDF] 返回目录
Qing Wan, Yoonsuck Choe
Abstract: There are several effective methods in explaining the inner workings of convolutional neural networks (CNNs). However, in general, finding the inverse of the function performed by CNNs as a whole is an ill-posed problem. In this paper, we propose a method based on adjoint operators to reconstruct, given an arbitrary unit in the CNN (except for the first convolutional layer), its effective hypersurface in the input space that replicates that unit's decision surface conditioned on a particular input image. Our results show that the hypersurface reconstructed this way, when multiplied by the original input image, would give nearly exact output value of that unit. We find that the CNN unit's decision surface is largely conditioned on the input, and this may explain why adversarial inputs can effectively deceive CNNs.
摘要：有几种有效的方法来解释卷积神经网络（CNN）的内部工作原理。但是，总的来说，找到CNN整体执行的函数的逆函数是一个不适定的问题。在本文中，我们提出了一种基于伴随算子的方法，以在CNN中给定任意单元（第一卷积层除外）的情况下重建其在输入空间中的有效超曲面，该曲面可以复制以特定输入图像为条件的该单元的决策曲面。。我们的结果表明，以这种方式重建的超曲面与原始输入图像相乘时，将给出该单元的几乎准确的输出值。我们发现CNN单位的决策面很大程度上取决于输入，这可以解释为什么对抗输入可以有效地欺骗CNN。

10. I3DOL: Incremental 3D Object Learning without Catastrophic Forgetting [PDF] 返回目录
Jiahua Dong, Yang Cong, Gan Sun, Bingtao Ma, Lichen Wang
Abstract: 3D object classification has attracted appealing attentions in academic researches and industrial applications. However, most existing methods need to access the training data of past 3D object classes when facing the common real-world scenario: new classes of 3D objects arrive in a sequence. Moreover, the performance of advanced approaches degrades dramatically for past learned classes (i.e., catastrophic forgetting), due to the irregular and redundant geometric structures of 3D point cloud data. To address these challenges, we propose a new Incremental 3D Object Learning (i.e., I3DOL) model, which is the first exploration to learn new classes of 3D object continually. Specifically, an adaptive-geometric centroid module is designed to construct discriminative local geometric structures, which can better characterize the irregular point cloud representation for 3D object. Afterwards, to prevent the catastrophic forgetting brought by redundant geometric information, a geometric-aware attention mechanism is developed to quantify the contributions of local geometric structures, and explore unique 3D geometric characteristics with high contributions for classes incremental learning. Meanwhile, a score fairness compensation strategy is proposed to further alleviate the catastrophic forgetting caused by unbalanced data between past and new classes of 3D object, by compensating biased prediction for new classes in the validation phase. Experiments on 3D representative datasets validate the superiority of our I3DOL framework.
摘要：3D对象分类在学术研究和工业应用中引起了人们的广泛关注。但是，当面对常见的现实世界场景时，大多数现有方法都需要访问过去3D对象类的训练数据：新的3D对象类按顺序到达。此外，由于3D点云数据的不规则和冗余几何结构，对于过去学习的课程（即灾难性遗忘），高级方法的性能会大大降低。为了解决这些挑战，我们提出了一种新的增量式3D对象学习（即I3DOL）模型，这是不断学习新类别的3D对象的首次探索。具体来说，自适应几何质心模块被设计为构造可区分的局部几何结构，从而可以更好地表征3D对象的不规则点云表示。然后，为防止冗余几何信息带来的灾难性遗忘，开发了一种几何感知注意机制来量化局部几何结构的贡献，并探索对班级增量学习具有高贡献的独特3D几何特征。同时，提出了一种分数公平性补偿策略，通过在验证阶段补偿新类的偏向预测，进一步减轻了过去和新类3D对象之间数据不平衡所导致的灾难性遗忘。在3D代表性数据集上进行的实验验证了I3DOL框架的优越性。

11. Sketch Generation with Drawing Process Guided by Vector Flow and Grayscale [PDF] 返回目录
Zhengyan Tong, Xuanhong Chen, Bingbing Ni, Xiaohang Wang
Abstract: We propose a novel image-to-pencil translation method that could not only generate high-quality pencil sketches but also offer the drawing process. Existing pencil sketch algorithms are based on texture rendering rather than the direct imitation of strokes, making them unable to show the drawing process but only a final result. To address this challenge, we first establish a pencil stroke imitation mechanism. Next, we develop a framework with three branches to guide stroke drawing: the first branch guides the direction of the strokes, the second branch determines the shade of the strokes, and the third branch enhances the details further. Under this framework's guidance, we can produce a pencil sketch by drawing one stroke every time. Our method is fully interpretable. Comparison with existing pencil drawing algorithms shows that our method is superior to others in terms of texture quality, style, and user evaluation.
摘要：我们提出了一种新颖的图像到铅笔翻译方法，该方法不仅可以生成高质量的铅笔素描，而且可以提供绘图过程。现有的铅笔素描算法基于纹理渲染，而不是直接模仿笔触，从而使其无法显示绘图过程，而只能显示最终结果。为了应对这一挑战，我们首先建立了铅笔笔触模仿机制。接下来，我们开发一个具有三个分支的框架以指导笔划绘制：第一个分支指导笔划的方向，第二个分支确定笔划的阴影，第三个分支进一步增强细节。在此框架的指导下，我们可以通过每次绘制一个笔划来生成铅笔素描。我们的方法是完全可以解释的。与现有铅笔绘图算法的比较表明，我们的方法在纹理质量，样式和用户评估方面均优于其他方法。

12. C2F-FWN: Coarse-to-Fine Flow Warping Network for Spatial-Temporal Consistent Motion Transfer [PDF] 返回目录
Dongxu Wei, Xiaowei Xu, Haibin Shen, Kejie Huang
Abstract: Human video motion transfer (HVMT) aims to synthesize videos that one person imitates other persons' actions. Although existing GAN-based HVMT methods have achieved great success, they either fail to preserve appearance details due to the loss of spatial consistency between synthesized and exemplary images, or generate incoherent video results due to the lack of temporal consistency among video frames. In this paper, we propose Coarse-to-Fine Flow Warping Network (C2F-FWN) for spatial-temporal consistent HVMT. Particularly, C2F-FWN utilizes coarse-to-fine flow warping and Layout-Constrained Deformable Convolution (LC-DConv) to improve spatial consistency, and employs Flow Temporal Consistency (FTC) Loss to enhance temporal consistency. In addition, provided with multi-source appearance inputs, C2F-FWN can support appearance attribute editing with great flexibility and efficiency. Besides public datasets, we also collected a large-scale HVMT dataset named SoloDance for evaluation. Extensive experiments conducted on our SoloDance dataset and the iPER dataset show that our approach outperforms state-of-art HVMT methods in terms of both spatial and temporal consistency. Source code and the SoloDance dataset are available at this https URL.
摘要：人类视频运动转移（HVMT）旨在合成一个人模仿他人行为的视频。尽管现有的基于GAN的HVMT方法已经取得了巨大的成功，但是它们要么由于合成图像和示例性图像之间空间一致性的损失而无法保留外观细节，要么由于视频帧之间缺乏时间一致性而产生了不一致的视频结果。在本文中，我们提出了用于时空一致的HVMT的粗细流变形网络（C2F-FWN）。特别地，C2F-FWN利用从粗到精细的流整形和布局约束的可变形卷积（LC-DConv）来提高空间一致性，并采用流时间一致性（FTC）损失来增强时间一致性。此外，通过提供多源外观输入，C2F-FWN可以以极大的灵活性和效率来支持外观属性编辑。除了公共数据集，我们还收集了名为SoloDance的大型HVMT数据集进行评估。在我们的SoloDance数据集和iPER数据集上进行的大量实验表明，就空间和时间一致性而言，我们的方法优于最新的HVMT方法。源代码和SoloDance数据集可从此https URL获得。

13. SimuGAN: Unsupervised forward modeling and optimal design of a LIDAR Camera [PDF] 返回目录
Nir Diamant, Tal Mund, Ohad Menashe, Aviad Zabatani, Alex M. Bronstein
Abstract: Energy-saving LIDAR camera for short distances estimates an object's distance using temporally intensity-coded laser light pulses and calculates the maximum correlation with the back-scattered pulse. Though on low power, the backs-scattered pulse is noisy and unstable, which leads to inaccurate and unreliable depth estimation. To address this problem, we use GANs (Generative Adversarial Networks), which are two neural networks that can learn complicated class distributions through an adversarial process. We learn the LIDAR camera's hidden properties and behavior, creating a novel, fully unsupervised forward model that simulates the camera. Then, we use the model's differentiability to explore the camera parameter space and optimize those parameters in terms of depth, accuracy, and stability. To achieve this goal, we also propose a new custom loss function designated to the back-scattered code distribution's weaknesses and its circular behavior. The results are demonstrated on both synthetic and real data.
摘要：节能型LIDAR短距离摄像机使用时间强度编码的激光脉冲估算物体的距离，并计算与反向散射脉冲的最大相关性。尽管在低功率下，背向散射脉冲噪声大且不稳定，这导致深度估计不准确且不可靠。为了解决这个问题，我们使用GAN（生成对抗网络），这是两个神经网络，可以通过对抗过程学习复杂的类分布。我们学习了LIDAR相机的隐藏属性和行为，创建了一个新颖的，完全不受监督的正向模型来模拟相机。然后，我们使用模型的可微性来探索相机参数空间，并在深度，准确性和稳定性方面优化这些参数。为了实现此目标，我们还提出了一个新的自定义损失函数，用于指定反向散射代码分布的弱点及其循环行为。结果在综合数据和真实数据中均得到证明。

14. Deep Reinforcement Learning of Graph Matching [PDF] 返回目录
Chang Liu, Runzhong Wang, Zetian Jiang, Junchi Yan
Abstract: Graph matching under node and pairwise constraints has been a building block in areas from combinatorial optimization, machine learning to computer vision, for effective structural representation and association. We present a reinforcement learning solver that seeks the node correspondence between two graphs, whereby the node embedding model on the association graph is learned to sequentially find the node-to-node matching. Our method differs from the previous deep graph matching model in the sense that they are focused on the front-end feature and affinity function learning while our method aims to learn the backend decision making given the affinity objective function whatever obtained by learning or not. Such an objective function maximization setting naturally fits with the reinforcement learning mechanism, of which the learning procedure is label-free. Besides, the model is not restricted to a fixed number of nodes for matching. These features make it more suitable for practical usage. Extensive experimental results on both synthetic datasets, natural images, and QAPLIB showcase the superior performance regarding both matching accuracy and efficiency. To our best knowledge, this is the first deep reinforcement learning solver for graph matching.
摘要：在节点和成对约束下的图匹配已成为从组合优化，机器学习到计算机视觉的各个领域的构建块，以实现有效的结构表示和关联。我们提出了一种强化学习求解器，该算法寻求两个图之间的节点对应关系，从而学习关联图上的节点嵌入模型，以顺序查找节点到节点的匹配。我们的方法与以前的深度图匹配模型不同，因为它们专注于前端特征和亲和力函数学习，而我们的方法旨在通过给定亲和力目标函数来学习后端决策，无论学习与否。这种目标函数最大化设置自然适合增强学习机制，该学习过程无需标签。此外，该模型不限于用于匹配的固定数量的节点。这些功能使其更适合实际使用。在合成数据集，自然图像和QAPLIB上的大量实验结果均显示了在匹配精度和效率方面的卓越性能。据我们所知，这是第一个用于图匹配的深度强化学习求解器。

15. SAfE: Self-Attention Based Unsupervised Road Safety Classification in Hazardous Environments [PDF] 返回目录
Divya Kothandaraman, Rohan Chandra, Dinesh Manocha
Abstract: We present a novel approach SAfE that can identify parts of an outdoor scene that are safe for driving, based on attention models. Our formulation is designed for hazardous weather conditions that can impair the visibility of human drivers as well as autonomous vehicles, increasing the risk of accidents. Our approach is unsupervised and uses domain adaptation, with entropy minimization and attention transfer discriminators, to leverage the large amounts of labeled data corresponding to clear weather conditions. Our attention transfer discriminator uses attention maps from the clear weather image to help the network learn relevant regions to attend to, on the images from the hazardous weather dataset. We conduct experiments on CityScapes simulated datasets depicting various weather conditions such as rain, fog and snow under different intensities, and additionally on Berkeley Deep Drive. Our result show that using attention models improves the standard unsupervised domain adaptation performance by 29.29%. Furthermore, we also compare with unsupervised domain adaptation methods and show an improvement of at least 12.02% (mIoU) over the state-of-the-art.
摘要：我们提出了一种新颖的方法SAfE，它可以基于注意力模型来识别室外场景中可以安全驾驶的部分。我们的配方专为危险天气条件而设计，该条件可能会损害驾驶员和自动驾驶车辆的能见度，从而增加发生事故的风险。我们的方法是不受监督的，并使用域自适应，熵最小化和注意力转移识别符来利用对应于晴朗天气的大量标记数据。我们的注意力转移判别器使用晴朗天气图像中的注意力图，以帮助网络根据危险天气数据集上的图像来学习相关区域。我们在CityScapes模拟数据集上进行了实验，这些数据集描述了不同强度下的各种天气状况（如雨，雾和雪），此外还进行了Berkeley Deep Drive的实验。我们的结果表明，使用注意力模型可以将标准无监督域自适应性能提高29.29％。此外，我们还与无监督域自适应方法进行了比较，并显示出与最新技术相比至少提高了12.02％（mIoU）。

16. Copyspace: Where to Write on Images? [PDF] 返回目录
Jessica M. Lundin, Michael Sollami, Brian Lonsdorf, Alan Ross, Owen Schoppe, David Woodward, Sönke Rohde
Abstract: The placement of text over an image is an important part of producing high-quality visual designs. Automating this work by determining appropriate position, orientation, and style for textual elements requires understanding the contents of the background image. We refer to the search for aesthetic parameters of text rendered over images as "copyspace detection", noting that this task is distinct from foreground-background separation. We have developed solutions using one and two stage object detection methodologies trained on an expertly labeled data. This workshop will examine such algorithms for copyspace detection and demonstrate their application in generative design models and pipelines such as Einstein Designer.
摘要：在图像上放置文本是产生高质量视觉设计的重要部分。通过确定文本元素的适当位置，方向和样式来使这项工作自动化，需要了解背景图像的内容。我们将对在图像上呈现的文本的美学参数的搜索称为“ copyspace检测”，并指出此任务与前景与背景分离是不同的。我们已经开发了使用经过专业标记的数据训练的一阶段和两阶段目标检测方法的解决方案。该研讨会将研究用于copyspace检测的此类算法，并展示其在生成设计模型和管道（例如Einstein Designer）中的应用。

17. FuseVis: Interpreting neural networks for image fusion using per-pixel saliency visualization [PDF] 返回目录
Nishant Kumar, Stefan Gumhold
Abstract: Image fusion helps in merging two or more images to construct a more informative single fused image. Recently, unsupervised learning based convolutional neural networks (CNN) have been utilized for different types of image fusion tasks such as medical image fusion, infrared-visible image fusion for autonomous driving as well as multi-focus and multi-exposure image fusion for satellite imagery. However, it is challenging to analyze the reliability of these CNNs for the image fusion tasks since no groundtruth is available. This led to the use of a wide variety of model architectures and optimization functions yielding quite different fusion results. Additionally, due to the highly opaque nature of such neural networks, it is difficult to explain the internal mechanics behind its fusion results. To overcome these challenges, we present a novel real-time visualization tool, named FuseVis, with which the end-user can compute per-pixel saliency maps that examine the influence of the input image pixels on each pixel of the fused image. We trained several image fusion based CNNs on medical image pairs and then using our FuseVis tool, we performed case studies on a specific clinical application by interpreting the saliency maps from each of the fusion methods. We specifically visualized the relative influence of each input image on the predictions of the fused image and showed that some of the evaluated image fusion methods are better suited for the specific clinical application. To the best of our knowledge, currently, there is no approach for visual analysis of neural networks for image fusion. Therefore, this work opens up a new research direction to improve the interpretability of deep fusion networks. The FuseVis tool can also be adapted in other deep neural network based image processing applications to make them interpretable.
摘要：图像融合有助于合并两个或更多图像，以构建信息量更大的单个融合图像。最近，基于无监督学习的卷积神经网络（CNN）已用于不同类型的图像融合任务，例如医学图像融合，用于自动驾驶的红外可见图像融合以及用于卫星图像的多焦点和多次曝光图像融合。但是，由于没有地面信息可用，因此分析这些CNN进行图像融合任务的可靠性具有挑战性。这导致使用了各种各样的模型架构和优化功能，产生了截然不同的融合结果。另外，由于这种神经网络的高度不透明性，很难解释其融合结果背后的内部机制。为了克服这些挑战，我们提出了一种新颖的实时可视化工具FuseVis，最终用户可以使用该工具来计算每个像素的显着性图，以检查输入图像像素对融合图像的每个像素的影响。我们在医学图像对上训练了几个基于图像融合的CNN，然后使用FuseVis工具，通过解释每种融合方法的显着性图，对特定临床应用进行了案例研究。我们专门可视化了每个输入图像对融合图像预测的相对影响，并表明某些评估的图像融合方法更适合特定的临床应用。就我们所知，目前还没有用于可视化分析神经网络进行图像融合的方法。因此，这项工作为提高深度融合网络的可解释性开辟了新的研究方向。 FuseVis工具还可以在其他基于深度神经网络的图像处理应用程序中进行调整，以使其可解释。

18. Unsupervised Image Segmentation using Mutual Mean-Teaching [PDF] 返回目录
Zhichao Wu, Lei Guo, Hao Zhang, Dan Xu
Abstract: Unsupervised image segmentation aims at assigning the pixels with similar feature into a same cluster without annotation, which is an important task in computer vision. Due to lack of prior knowledge, most of existing model usually need to be trained several times to obtain suitable results. To address this problem, we propose an unsupervised image segmentation model based on the Mutual Mean-Teaching (MMT) framework to produce more stable results. In addition, since the labels of pixels from two model are not matched, a label alignment algorithm based on the Hungarian algorithm is proposed to match the cluster labels. Experimental results demonstrate that the proposed model is able to segment various types of images and achieves better performance than the existing methods.
摘要：无监督图像分割的目的是将具有相似特征的像素分配到没有注释的同一群集中，这是计算机视觉中的一项重要任务。由于缺乏先验知识，大多数现有模型通常需要进行几次训练才能获得合适的结果。为了解决这个问题，我们提出了一种基于互均教学（MMT）框架的无监督图像分割模型，以产生更稳定的结果。另外，由于两个模型的像素标签不匹配，提出了一种基于匈牙利算法的标签对齐算法来匹配聚类标签。实验结果表明，所提出的模型能够分割各种类型的图像，并且比现有方法具有更好的性能。

19. Self-Supervised Person Detection in 2D Range Data using a Calibrated Camera [PDF] 返回目录
Dan Jia, Mats Steinweg, Alexander Hermans, Bastian Leibe
Abstract: Deep learning is the essential building block of state-of-the-art person detectors in 2D range data. However, only a few annotated datasets are available for training and testing these deep networks, potentially limiting their performance when deployed in new environments or with different LiDAR models. We propose a method, which uses bounding boxes from an image-based detector (e.g. Faster R-CNN) on a calibrated camera to automatically generate training labels (called pseudo-labels) for 2D LiDAR-based person detectors. Through experiments on the JackRabbot dataset with two detector models, DROW3 and DR-SPAAM, we show that self-supervised detectors, trained or fine-tuned with pseudo-labels, outperform detectors trained using manual annotations from a different dataset. Combined with robust training techniques, the self-supervised detectors reach a performance close to the ones trained using manual annotations. Our method is an effective way to improve person detectors during deployment without any additional labeling effort, and we release our source code to support relevant robotic applications.
摘要：深度学习是2D范围数据中最新型人检测器的基本构建块。但是，只有少数带注释的数据集可用于训练和测试这些深度网络，从而在新环境或不同LiDAR模型中部署时可能会限制其性能。我们提出了一种方法，该方法使用经过校准的相机上基于图像的检测器（例如Faster R-CNN）的边界框来自动生成基于2D LiDAR的人员检测器的训练标签（称为伪标签）。通过在具有两种检测器模型DROW3和DR-SPAAM的JackRabbot数据集上进行的实验，我们表明，使用伪标签训练或微调的自监督检测器性能优于使用来自不同数据集的手动注释训练的检测器。结合强大的训练技术，自我监督探测器的性能接近使用手动注释训练的探测器。我们的方法是在部署过程中改进人员检测器的有效方法，而无需任何额外的标记工作，并且我们发布了源代码以支持相关的机器人应用程序。

20. Temporal Graph Modeling for Skeleton-based Action Recognition [PDF] 返回目录
Jianan Li, Xuemei Xie, Zhifu Zhao, Yuhan Cao, Qingzhe Pan, Guangming Shi
Abstract: Graph Convolutional Networks (GCNs), which model skeleton data as graphs, have obtained remarkable performance for skeleton-based action recognition. Particularly, the temporal dynamic of skeleton sequence conveys significant information in the recognition task. For temporal dynamic modeling, GCN-based methods only stack multi-layer 1D local convolutions to extract temporal relations between adjacent time steps. With the repeat of a lot of local convolutions, the key temporal information with non-adjacent temporal distance may be ignored due to the information dilution. Therefore, these methods still remain unclear how to fully explore temporal dynamic of skeleton sequence. In this paper, we propose a Temporal Enhanced Graph Convolutional Network (TE-GCN) to tackle this limitation. The proposed TE-GCN constructs temporal relation graph to capture complex temporal dynamic. Specifically, the constructed temporal relation graph explicitly builds connections between semantically related temporal features to model temporal relations between both adjacent and non-adjacent time steps. Meanwhile, to further explore the sufficient temporal dynamic, multi-head mechanism is designed to investigate multi-kinds of temporal relations. Extensive experiments are performed on two widely used large-scale datasets, NTU-60 RGB+D and NTU-120 RGB+D. And experimental results show that the proposed model achieves the state-of-the-art performance by making contribution to temporal modeling for action recognition.
摘要：图卷积网络（GCN）将骨架数据建模为图形，在基于骨架的动作识别中获得了卓越的性能。特别是，骨架序列的时间动态在识别任务中传达了重要信息。对于时间动态建模，基于GCN的方法仅堆叠多层一维局部卷积以提取相邻时间步之间的时间关系。随着大量局部卷积的重复，由于信息稀释，具有不相邻时间距离的关键时间信息可能会被忽略。因此，这些方法仍不清楚如何充分探索骨骼序列的时间动态。在本文中，我们提出了一种时间增强图卷积网络（TE-GCN）来解决此限制。提出的TE-GCN构造时间关系图以捕获复杂的时间动态。具体而言，所构造的时间关系图显式地建立语义相关的时间特征之间的连接，以对相邻时间步长和非相邻时间步长之间的时间关系建模。同时，为了进一步探索足够的时间动态性，设计了多头机制来研究多种时间关系。在两个广泛使用的大型数据集NTU-60 RGB + D和NTU-120 RGB + D上进行了广泛的实验。实验结果表明，该模型通过为动作识别的时间建模做出了贡献，从而达到了最新的性能。

21. Latent Space Conditioning on Generative Adversarial Networks [PDF] 返回目录
Ricard Durall, Kalun Ho, Franz-Josef Pfreundt, Janis Keuper
Abstract: Generative adversarial networks are the state of the art approach towards learned synthetic image generation. Although early successes were mostly unsupervised, bit by bit, this trend has been superseded by approaches based on labelled data. These supervised methods allow a much finer-grained control of the output image, offering more flexibility and stability. Nevertheless, the main drawback of such models is the necessity of annotated data. In this work, we introduce an novel framework that benefits from two popular learning techniques, adversarial training and representation learning, and takes a step towards unsupervised conditional GANs. In particular, our approach exploits the structure of a latent space (learned by the representation learning) and employs it to condition the generative model. In this way, we break the traditional dependency between condition and label, substituting the latter by unsupervised features coming from the latent space. Finally, we show that this new technique is able to produce samples on demand keeping the quality of its supervised counterpart.
摘要：生成对抗网络是学习合成图像的最先进方法。尽管早期的成功几乎没有一点点地受到监督，但这种趋势已被基于标记数据的方法所取代。这些受监督的方法可以对输出图像进行更细粒度的控制，从而提供更大的灵活性和稳定性。但是，此类模型的主要缺点是必须带有注释的数据。在这项工作中，我们介绍了一个新颖的框架，该框架受益于两种流行的学习技术（对抗训练和表示学习），并朝着无监督的条件GAN迈了一步。尤其是，我们的方法利用了潜在空间的结构（通过表示学习来学习），并将其用于条件生成模型。通过这种方式，我们打破了条件和标签之间的传统依赖性，用来自潜在空间的无监督特征代替了后者。最后，我们证明了这项新技术能够按需生产样品，并保持其受监督同类产品的质量。

22. Analysing the Direction of Emotional Influence in Nonverbal Dyadic Communication: A Facial-Expression Study [PDF] 返回目录
Maha Shadaydeh, Lea Mueller, Dana Schneider, Martin Thuemmel, Thomas Kessler, Joachim Denzler
Abstract: Identifying the direction of emotional influence in a dyadic dialogue is of increasing interest in the psychological sciences with applications in psychotherapy, analysis of political interactions, or interpersonal conflict behavior. Facial expressions are widely described as being automatic and thus hard to overtly influence. As such, they are a perfect measure for a better understanding of unintentional behavior cues about social-emotional cognitive processes. With this view, this study is concerned with the analysis of the direction of emotional influence in dyadic dialogue based on facial expressions only. We exploit computer vision capabilities along with causal inference theory for quantitative verification of hypotheses on the direction of emotional influence, i.e., causal effect relationships, in dyadic dialogues. We address two main issues. First, in a dyadic dialogue, emotional influence occurs over transient time intervals and with intensity and direction that are variant over time. To this end, we propose a relevant interval selection approach that we use prior to causal inference to identify those transient intervals where causal inference should be applied. Second, we propose to use fine-grained facial expressions that are present when strong distinct facial emotions are not visible. To specify the direction of influence, we apply the concept of Granger causality to the time series of facial expressions over selected relevant intervals. We tested our approach on newly, experimentally obtained data. Based on the quantitative verification of hypotheses on the direction of emotional influence, we were able to show that the proposed approach is most promising to reveal the causal effect pattern in various instructed interaction conditions.
摘要：在二元对话中确定情感影响的方向越来越受到心理学界的关注，这些心理学应用于心理治疗，政治互动分析或人际冲突行为中。面部表情被广泛描述为自动的，因此很难公开影响。因此，它们是更好地了解关于社会情感认知过程的无意识行为线索的完美措施。有鉴于此，本研究仅针对基于面部表情的二元对话中情感影响的方向分析。我们利用计算机视觉功能以及因果推理理论对二元对话中情感影响的方向（即因果关系）的假设进行定量验证。我们解决两个主要问题。首先，在二元对话中，情感影响发生在瞬态时间间隔内，强度和方向随时间变化。为此，我们提出了一种相关的区间选择方法，在因果推理之前使用它来识别应应用因果推理的那些瞬时间隔。其次，我们建议使用当看不到强烈的明显面部表情时出现的细粒度面部表情。为了指定影响的方向，我们将格兰杰因果关系的概念应用于在选定的相关间隔内的面部表情的时间序列。我们在实验获得的新数据上测试了我们的方法。基于对情绪影响方向的假设的定量验证，我们能够证明所提出的方法最有可能揭示各种指示的交互条件下的因果关系模式。

23. Revisiting 3D Context Modeling with Supervised Pre-training for Universal Lesion Detection in CT Slices [PDF] 返回目录
Shu Zhang, Jincheng Xu, Yu-Chun Chen, Jiechao Ma, Zihao Li, Yizhou Wang, Yizhou Yu
Abstract: Universal lesion detection from computed tomography (CT) slices is important for comprehensive disease screening. Since each lesion can locate in multiple adjacent slices, 3D context modeling is of great significance for developing automated lesion detection algorithms. In this work, we propose a Modified Pseudo-3D Feature Pyramid Network (MP3D FPN) that leverages depthwise separable convolutional filters and a group transform module (GTM) to efficiently extract 3D context enhanced 2D features for universal lesion detection in CT slices. To facilitate faster convergence, a novel 3D network pre-training method is derived using solely large-scale 2D object detection dataset in the natural image domain. We demonstrate that with the novel pre-training method, the proposed MP3D FPN achieves state-of-the-art detection performance on the DeepLesion dataset (3.48% absolute improvement in the sensitivity of FPs@0.5), significantly surpassing the baseline method by up to 6.06% (in MAP@0.5) which adopts 2D convolution for 3D context modeling. Moreover, the proposed 3D pre-trained weights can potentially be used to boost the performance of other 3D medical image analysis tasks.
摘要：从计算机断层扫描（CT）切片中检测出的普遍病变对于全面的疾病筛查至关重要。由于每个病变都可以位于多个相邻切片中，因此3D上下文建模对于开发自动化病变检测算法非常重要。在这项工作中，我们提出了一种改进的伪3D特征金字塔网络（MP3D FPN），该网络利用深度可分离卷积滤波器和组变换模块（GTM）来有效提取3D上下文增强的2D特征，以用于CT切片中的通用病变检测。为了促进更快的收敛，仅在自然图像域中使用大规模2D对象检测数据集就得出了一种新颖的3D网络预训练方法。我们证明，通过新颖的预训练方法，拟议的MP3D FPN在DeepLesion数据集上实现了最新的检测性能（FPs@0.5的灵敏度绝对提高了3.48％），远远超过了基线方法。达到6.06％（在MAP@0.5中），采用2D卷积进行3D上下文建模。此外，建议的3D预训练权重可以潜在地用于提高其他3D医学图像分析任务的性能。

24. Difficulty in estimating visual information from randomly sampled images [PDF] 返回目录
Masaki Kitayama, Hitoshi Kiya
Abstract: In this paper, we evaluate dimensionality reduction methods in terms of difficulty in estimating visual information on original images from dimensionally reduced ones. Recently, dimensionality reduction has been receiving attention as the process of not only reducing the number of random variables, but also protecting visual information for privacy-preserving machine learning. For such a reason, difficulty in estimating visual information is discussed. In particular, the random sampling method that was proposed for privacy-preserving machine learning, is compared with typical dimensionality reduction methods. In an image classification experiment, the random sampling method is demonstrated not only to have high difficulty, but also to be comparable to other dimensionality reduction methods, while maintaining the property of spatial information invariant.
摘要：在本文中，我们从难以估计尺寸缩小的原始图像视觉信息的角度，评估了尺寸缩小的方法。近来，降维已经受到关注，这不仅是减少随机变量数量的过程，而且是为了保护隐私机器学习而保护视觉信息的过程。因此，讨论了估计视觉信息的困难。尤其是，针对隐私保护机器学习提出的随机抽样方法与典型的降维方法进行了比较。在图像分类实验中，证明了随机采样方法不仅难度高，而且在保持空间信息不变性的同时，还可以与其他降维方法相提并论。

25. Exploiting Sample Uncertainty for Domain Adaptive Person Re-Identification [PDF] 返回目录
Kecheng Zheng, Cuiling Lan, Wenjun Zeng, Zhizheng Zhan, Zheng-Jun Zha
Abstract: Many unsupervised domain adaptive (UDA) person re-identification (ReID) approaches combine clustering-based pseudo-label prediction with feature fine-tuning. However, because of domain gap, the pseudo-labels are not always reliable and there are noisy/incorrect labels. This would mislead the feature representation learning and deteriorate the performance. In this paper, we propose to estimate and exploit the credibility of the assigned pseudo-label of each sample to alleviate the influence of noisy labels, by suppressing the contribution of noisy samples. We build our baseline framework using the mean teacher method together with an additional contrastive loss. We have observed that a sample with a wrong pseudo-label through clustering in general has a weaker consistency between the output of the mean teacher model and the student model. Based on this finding, we propose to exploit the uncertainty (measured by consistency levels) to evaluate the reliability of the pseudo-label of a sample and incorporate the uncertainty to re-weight its contribution within various ReID losses, including the identity (ID) classification loss per sample, the triplet loss, and the contrastive loss. Our uncertainty-guided optimization brings significant improvement and achieves the state-of-the-art performance on benchmark datasets.
摘要：许多无监督域自适应（UDA）人员重新识别（ReID）方法将基于聚类的伪标签预测与特征微调相结合。但是，由于存在域间隙，伪标签并不总是可靠的，并且存在嘈杂/错误的标签。这会误导特征表示学习并降低性能。在本文中，我们建议通过抑制噪声样本的贡献来估计和利用分配给每个样本的伪标签的可信度，以减轻噪声标签的影响。我们使用均值教师方法以及其他对比损失来构建基准框架。我们已经观察到，通常通过聚类而具有错误伪标签的样本在均值教师模型和学生模型之间的一致性较弱。基于此发现，我们建议利用不确定性（通过一致性水平衡量）来评估样本伪标签的可靠性，并结合不确定性以在各种ReID损失（包括身份（ID））中重新加权其贡献。每个样品的分类损失，三重态损失和对比损失。我们的不确定性指导优化带来了显着改善，并在基准数据集上实现了最新的性能。

26. Event-based Motion Segmentation with Spatio-Temporal Graph Cuts [PDF] 返回目录
Yi Zhou, Guillermo Gallego, Xiuyuan Lu, Siqi Liu, Shaojie Shen
Abstract: Identifying independently moving objects is an essential task for dynamic scene understanding. However, traditional cameras used in dynamic scenes may suffer from motion blur or exposure artifacts due to their sampling principle. By contrast, event-based cameras are novel bio-inspired sensors that offer advantages to overcome such limitations. They report pixel-wise intensity changes asynchronously, which enables them to acquire visual information at exactly the same rate as the scene dynamics. We have developed a method to identify independently moving objects acquired with an event-based camera, i.e., to solve the event-based motion segmentation problem. This paper describes how to formulate the problem as a weakly-constrained multi-model fitting one via energy minimization, and how to jointly solve its two subproblems -- event-cluster assignment (labeling) and motion model fitting - in an iterative manner, by exploiting the spatio-temporal structure of input events in the form of a space-time graph. Experiments on available datasets demonstrate the versatility of the method in scenes with different motion patterns and number of moving objects. The evaluation shows that the method performs on par or better than the state of the art without having to predetermine the number of expected moving objects.
摘要：识别独立移动的对象是动态场景理解的一项基本任务。然而，由于其采样原理，用于动态场景的传统相机可能会遭受运动模糊或曝光伪影的困扰。相比之下，基于事件的相机是新颖的生物灵感传感器，具有克服此类限制的优势。他们异步报告像素方向的强度变化，这使他们能够以与场景动态完全相同的速率获取视觉信息。我们已经开发出一种方法来识别使用基于事件的摄像机获取的独立运动对象，即解决基于事件的运动分割问题。本文描述了如何通过能量最小化将问题表达为弱约束的多模型拟合问题，以及如何以迭代方式通过以下方式共同解决其两个子问题-事件群集分配（标记）和运动模型拟合-以时空图的形式利用输入事件的时空结构。在可用数据集上的实验证明了该方法在具有不同运动模式和运动对象数量的场景中的多功能性。评估表明，该方法的性能与现有技术相当或优于现有技术，而无需预先确定预期的运动对象的数量。

27. Deep Learning to Segment Pelvic Bones: Large-scale CT Datasets and Baseline Models [PDF] 返回目录
Pengbo Liu, Hu Han, Yuanqi Du, Heqin Zhu, Yinhao Li, Feng Gu, Honghu Xiao, Jun Li, Chunpeng Zhao, Li Xiao, Xinbao Wu, S.Kevin Zhou
Abstract: Purpose: Pelvic bone segmentation in CT has always been an essential step in clinical diagnosis and surgery planning of pelvic bone diseases. Existing methods for pelvic bone segmentation are either hand-crafted or semi-automatic and achieve limited accuracy when dealing with image appearance variations due to the multi-site domain shift, the presence of contrasted vessels, coprolith and chyme, bone fractures, low dose, metal artifacts, etc. Due to the lack of a large-scale pelvic CT dataset with annotations, deep learning methods are not fully explored. Methods: In this paper, we aim to bridge the data gap by curating a large pelvic CT dataset pooled from multiple sources and different manufacturers, including 1, 184 CT volumes and over 320, 000 slices with different resolutions and a variety of the above-mentioned appearance variations. Then we propose for the first time, to the best of our knowledge, to learn a deep multi-class network for segmenting lumbar spine, sacrum, left hip, and right hip, from multiple-domain images simultaneously to obtain more effective and robust feature representations. Finally, we introduce a post-processing tool based on the signed distance function (SDF) to eliminate false predictions while retaining correctly predicted bone fragments. Results: Extensive experiments on our dataset demonstrate the effectiveness of our automatic method, achieving an average Dice of 0.987 for a metal-free volume. SDF post-processor yields a decrease of 10.5% in hausdorff distance by maintaining important bone fragments in post-processing phase. Conclusion: We believe this large-scale dataset will promote the development of the whole community and plan to open source the images, annotations, codes, and trained baseline models at this URL1.
摘要：目的：CT骨盆骨分割术一直是骨盆骨疾病临床诊断和手术计划中必不可少的步骤。现有的骨盆骨分割方法是手工制作的或半自动的，由于多位域移位，造影剂的存在，coprolith和食糜的存在，骨折，低剂量，金属制品等。由于缺少带有注释的大型骨盆CT数据集，因此尚未充分探索深度学习方法。方法：在本文中，我们旨在通过整理从多个来源和不同制造商那里收集的大型骨盆CT数据集来弥合数据鸿沟，包括1,184个CT体积和320,000多个具有不同分辨率的切片以及多种以上的-提到的外观变化。然后，根据我们的知识，我们首次建议学习一种深度多类网络，用于同时从多域图像中分割腰椎，s骨，左髋和右髋，以获得更有效和更强大的功能表示形式。最后，我们介绍一种基于符号距离函数（SDF）的后处理工具，以消除错误的预测，同时保留正确预测的骨骼碎片。结果：在我们的数据集上进行的大量实验证明了我们的自动方法的有效性，无金属体积的平均Dice达到0.987。通过在后期处理阶段保持重要的骨骼碎片，SDF后处理器可使hausdorff距离减少10.5％。结论：我们相信，该大规模数据集将促进整个社区的发展，并计划在该URL1上开源图像，注释，代码和经过训练的基线模型。

28. SID-NISM: A Self-supervised Low-light Image Enhancement Framework [PDF] 返回目录
Lijun Zhang, Xiao Liu, Erik Learned-Miller, Hui Guan
Abstract: When capturing images in low-light conditions, the images often suffer from low visibility, which not only degrades the visual aesthetics of images, but also significantly degenerates the performance of many computer vision algorithms. In this paper, we propose a self-supervised low-light image enhancement framework (SID-NISM), which consists of two components, a Self-supervised Image Decomposition Network (SID-Net) and a Nonlinear Illumination Saturation Mapping function (NISM). As a self-supervised network, SID-Net could decompose the given low-light image into its reflectance, illumination and noise directly without any prior training or reference image, which distinguishes it from existing supervised-learning methods greatly. Then, the decomposed illumination map will be enhanced by NISM. Having the restored illumination map, the enhancement can be achieved accordingly. Experiments on several public challenging low-light image datasets reveal that the images enhanced by SID-NISM are more natural and have less unexpected artifacts.
摘要：在弱光条件下捕获图像时，图像通常会遇到可见度低的问题，这不仅会降低图像的视觉美感，而且还会大大降低许多计算机视觉算法的性能。在本文中，我们提出了一种自我监督的弱光图像增强框架（SID-NISM），该框架由两个组件组成，一个自我监督的图像分解网络（SID-Net）和一个非线性照明饱和度映射函数（NISM）。作为一个自我监督的网络，SID-Net可以将给定的低光照图像直接分解为反射率，照度和噪声，而无需任何事先的训练或参考图像，这使其与现有的监督学习方法有很大的区别。然后，将通过NISM增强分解后的照明图。具有恢复的照明图，可以相应地实现增强。对几个公开的具有挑战性的低光照图像数据集进行的实验表明，通过SID-NISM增强的图像更加自然，并且具有较少的意外伪影。

29. Two-Stage Copy-Move Forgery Detection with Self Deep Matching and Proposal SuperGlue [PDF] 返回目录
Yaqi Liu, Chao Xia, Xiaobin Zhu, Shengwei Xu
Abstract: Copy-move forgery detection identifies a tampered image by detecting pasted and source regions in the same image. In this paper, we propose a novel two-stage framework specially for copy-move forgery detection. The first stage is a backbone self deep matching network, and the second stage is named as Proposal SuperGlue. In the first stage, atrous convolution and skip matching are incorporated to enrich spatial information and leverage hierarchical features. Spatial attention is built on self-correlation to reinforce the ability to find appearance similar regions. In the second stage, Proposal SuperGlue is proposed to remove false-alarmed regions and remedy incomplete regions. Specifically, a proposal selection strategy is designed to enclose highly suspected regions based on proposal generation and backbone score maps. Then, pairwise matching is conducted among candidate proposals by deep learning based keypoint extraction and matching, i.e., SuperPoint and SuperGlue. Integrated score map generation and refinement methods are designed to integrate results of both stages and obtain optimized results. Our two-stage framework unifies end-to-end deep matching and keypoint matching by obtaining highly suspected proposals, and opens a new gate for deep learning research in copy-move forgery detection. Experiments on publicly available datasets demonstrate the effectiveness of our two-stage framework.
摘要：复制移动伪造检测通过检测同一图像中的粘贴区域和源区域来识别被篡改的图像。在本文中，我们提出了一个新颖的两阶段框架，专门用于复制移动伪造检测。第一个阶段是骨干网自我深度匹配网络，第二个阶段被称为提案SuperGlue。在第一阶段，将无规则卷积和跳过匹配结合在一起，以丰富空间信息并利用分层功能。空间注意力建立在自相关基础上，以增强发现相似区域外观的能力。在第二阶段，建议使用SuperGlue提案，以删除错误警报的区域并补救不完整的区域。具体来说，提案选择策略旨在根据提案生成和骨干评分图将高度可疑的区域包围起来。然后，通过基于深度学习的关键点提取和匹配，即SuperPoint和SuperGlue，在候选提案之间进行成对匹配。集成的分数图生成和优化方法旨在整合两个阶段的结果并获得优化的结果。我们的两阶段框架通过获取高度可疑的提案来统一端到端深度匹配和关键点匹配，并为复制移动伪造检测的深度学习研究打开了新的大门。在公开数据集上进行的实验证明了我们的两阶段框架的有效性。

30. Domain Adaptive Object Detection via Feature Separation and Alignment [PDF] 返回目录
Chengyang Liang, Zixiang Zhao, Junmin Liu, Jiangshe Zhang
Abstract: Recently, adversarial-based domain adaptive object detection (DAOD) methods have been developed rapidly. However, there are two issues that need to be resolved urgently. Firstly, numerous methods reduce the distributional shifts only by aligning all the feature between the source and target domain, while ignoring the private information of each domain. Secondly, DAOD should consider the feature alignment on object existing regions in images. But redundancy of the region proposals and background noise could reduce the domain transferability. Therefore, we establish a Feature Separation and Alignment Network (FSANet) which consists of a gray-scale feature separation (GSFS) module, a local-global feature alignment (LGFA) module and a region-instance-level alignment (RILA) module. The GSFS module decomposes the distractive/shared information which is useless/useful for detection by a dual-stream framework, to focus on intrinsic feature of objects and resolve the first issue. Then, LGFA and RILA modules reduce the distributional shifts of the multi-level features. Notably, scale-space filtering is exploited to implement adaptive searching for regions to be aligned, and instance-level features in each region are refined to reduce redundancy and noise mentioned in the second issue. Various experiments on multiple benchmark datasets prove that our FSANet achieves better performance on the target domain detection and surpasses the state-of-the-art methods.
摘要：近年来，基于对抗的领域自适应对象检测（DAOD）方法得到了快速发展。但是，有两个问题需要紧急解决。首先，许多方法只能通过在源域和目标域之间对齐所有特征来减少分布偏移，而忽略每个域的私有信息。其次，DAOD应考虑图像中对象现有区域的特征对齐。但是区域提议的冗余和背景噪声会降低域的可转移性。因此，我们建立了一个特征分离与对齐网络（FSANet），该网络由一个灰度特征分离（GSFS）模块，一个局部全局特征对齐（LGFA）模块和一个区域实例级水平对齐（RILA）模块组成。 GSFS模块分解了分散/共享的信息，这些信息对于双流框架的检测是无用/有用的，以关注对象的固有特征并解决第一个问题。然后，LGFA和RILA模块减少了多层功能的分布偏移。值得注意的是，利用比例空间过滤对要对齐的区域实施自适应搜索，并对每个区域中的实例级特征进行了改进以减少第二期中提到的冗余和噪声。在多个基准数据集上进行的各种实验证明，我们的FSANet在目标域检测上取得了更好的性能，并超过了最新的方法。

31. StrokeGAN: Reducing Mode Collapse in Chinese Font Generation via Stroke Encoding [PDF] 返回目录
Jinshan Zeng, Qi Chen, Yunxin Liu, Mingwen Wang, Yuan Yao
Abstract: The generation of stylish Chinese fonts is an important problem involved in many applications. Most of existing generation methods are based on the deep generative models, particularly, the generative adversarial networks (GAN) based models. However, these deep generative models may suffer from the mode collapse issue, which significantly degrades the diversity and quality of generated results. In this paper, we introduce a one-bit stroke encoding to capture the key mode information of Chinese characters and then incorporate it into CycleGAN, a popular deep generative model for Chinese font generation. As a result we propose an efficient method called StrokeGAN, mainly motivated by the observation that the stroke encoding contains amount of mode information of Chinese characters. In order to reconstruct the one-bit stroke encoding of the associated generated characters, we introduce a stroke-encoding reconstruction loss imposed on the discriminator. Equipped with such one-bit stroke encoding and stroke-encoding reconstruction loss, the mode collapse issue of CycleGAN can be significantly alleviated, with an improved preservation of strokes and diversity of generated characters. The effectiveness of StrokeGAN is demonstrated by a series of generation tasks over nine datasets with different fonts. The numerical results demonstrate that StrokeGAN generally outperforms the state-of-the-art methods in terms of content and recognition accuracies, as well as certain stroke error, and also generates more realistic characters.
摘要：时尚的中文字体的生成是许多应用中涉及的重要问题。现有的大多数生成方法都基于深度生成模型，尤其是基于生成对抗网络（GAN）的模型。但是，这些深度生成模型可能会遭受模式崩溃问题的困扰，这会严重降低所生成结果的多样性和质量。在本文中，我们介绍了一种位笔画编码，以捕获汉字的键模式信息，然后将其合并到CycleGAN中，这是一种流行的用于中文字体生成的深度生成模型。因此，我们提出了一种称为StrokeGAN的有效方法，其主要目的是观察笔划编码包含大量汉字模式信息。为了重建相关联的生成字符的一位笔画编码，我们引入了施加在鉴别器上的笔画编码重建损失。配备了这样的一位笔划编码和笔划编码重建损失，可以显着缓解CycleGAN的模式崩溃问题，并改善笔划的保留性和生成字符的多样性。通过在9种具有不同字体的数据集上进行的一系列生成任务，证明了StrokeGAN的有效性。数值结果表明，在内容和识别准确性以及某些笔划错误方面，StrokeGAN通常优于最新方法。

32. Training an Emotion Detection Classifier using Frames from a Mobile Therapeutic Game for Children with Developmental Disorders [PDF] 返回目录
Peter Washington, Haik Kalantarian, Jack Kent, Arman Husic, Aaron Kline, Emilie Leblanc, Cathy Hou, Cezmi Mutlu, Kaitlyn Dunlap, Yordan Penev, Maya Varma, Nate Stockham, Brianna Chrisman, Kelley Paskov, Min Woo Sun, Jae-Yoon Jung, Catalin Voss, Nick Haber, Dennis P. Wall
Abstract: Automated emotion classification could aid those who struggle to recognize emotion, including children with developmental behavioral conditions such as autism. However, most computer vision emotion models are trained on adult affect and therefore underperform on child faces. In this study, we designed a strategy to gamify the collection and the labeling of child affect data in an effort to boost the performance of automatic child emotion detection to a level closer to what will be needed for translational digital healthcare. We leveraged our therapeutic smartphone game, GuessWhat, which was designed in large part for children with developmental and behavioral conditions, to gamify the secure collection of video data of children expressing a variety of emotions prompted by the game. Through a secure web interface gamifying the human labeling effort, we gathered and labeled 2,155 videos, 39,968 emotion frames, and 106,001 labels on all images. With this drastically expanded pediatric emotion centric database (>30x larger than existing public pediatric affect datasets), we trained a pediatric emotion classification convolutional neural network (CNN) classifier of happy, sad, surprised, fearful, angry, disgust, and neutral expressions in children. The classifier achieved 66.9% balanced accuracy and 67.4% F1-score on the entirety of CAFE as well as 79.1% balanced accuracy and 78.0% F1-score on CAFE Subset A, a subset containing at least 60% human agreement on emotions labels. This performance is at least 10% higher than all previously published classifiers, the best of which reached 56.% balanced accuracy even when combining "anger" and "disgust" into a single class. This work validates that mobile games designed for pediatric therapies can generate high volumes of domain-relevant datasets to train state of the art classifiers to perform tasks highly relevant to precision health efforts.
摘要：自动的情绪分类可以帮助那些难以识别情绪的人，包括患有自闭症等发育行为状况的儿童。但是，大多数计算机视觉情感模型都是针对成人情感进行训练的，因此在儿童脸上的表现不佳。在这项研究中，我们设计了一种策略来对儿童情感数据的收集和标记进行游戏化，以努力将儿童情感自动检测的性能提高到更接近转化数字医疗所需的水平。我们利用了我们的智能治疗手机游戏GuessWhat，该游戏在很大程度上是为有发育和行为状况的儿童设计的，它可以安全地收集表示游戏引起的各种情绪的儿童视频数据。通过一个安全的网络界面，将人类的标签工作进行了博弈，我们在所有图像上收集并标记了2155个视频，39968个情感帧和106001个标签。借助这个以儿童情感为中心的数据库（比现有的公共儿童情感数据集大30倍以上），我们训练了儿童情感分类卷积神经网络（CNN）分类器，该分类器以以下形式表达快乐，悲伤，惊讶，恐惧，愤怒，厌恶和中立的表情孩子们。该分类器在整个CAFE上达到了66.9％的平衡准确度和67.4％的F1得分，以及在CAFE子集A上获得了79.1％的平衡准确度和78.0％的F1-得分。该性能比所有以前发布的分类器至少高出10％，即使将“愤怒”和“厌恶”组合到一个类别中，其最好的平衡精度也达到了56.％。这项工作验证了专为儿科治疗设计的手机游戏可以生成大量与领域相关的数据集，以训练最新的分类器来执行与精准健康工作高度相关的任务。

33. A Closer Look at the Robustness of Vision-and-Language Pre-trained Models [PDF] 返回目录
Linjie Li, Zhe Gan, Jingjing Liu
Abstract: Large-scale pre-trained multimodal transformers, such as ViLBERT and UNITER, have propelled the state of the art in vision-and-language (V+L) research to a new level. Although achieving impressive performance on standard tasks, to date, it still remains unclear how robust these pre-trained models are. To investigate, we conduct a host of thorough evaluations on existing pre-trained models over 4 different types of V+L specific model robustness: (i) Linguistic Variation; (ii) Logical Reasoning; (iii) Visual Content Manipulation; and (iv) Answer Distribution Shift. Interestingly, by standard model finetuning, pre-trained V+L models already exhibit better robustness than many task-specific state-of-the-art methods. To further enhance model robustness, we propose Mango, a generic and efficient approach that learns a Multimodal Adversarial Noise GeneratOr in the embedding space to fool pre-trained V+L models. Differing from previous studies focused on one specific type of robustness, Mango is task-agnostic, and enables universal performance lift for pre-trained models over diverse tasks designed to evaluate broad aspects of robustness. Comprehensive experiments demonstrate that Mango achieves new state of the art on 7 out of 9 robustness benchmarks, surpassing existing methods by a significant margin. As the first comprehensive study on V+L robustness, this work puts robustness of pre-trained models into sharper focus, pointing new directions for future study.
摘要：ViLBERT和UNITER等大型预训练多模态变压器已将视觉和语言（V + L）研究的最新水平推向了一个新水平。尽管迄今为止在标准任务上实现了令人印象深刻的性能，但到目前为止，仍不清楚这些预训练模型的鲁棒性。为了进行调查，我们针对现有的预训练模型对4种不同类型的V + L特定模型的鲁棒性进行了全面的评估：（i）语言变异；（ii）逻辑推理；（iii）视觉内容操纵；（iv）答案分布转移。有趣的是，通过标准模型微调，预训练的V + L模型已经比许多特定于任务的最新方法表现出更好的鲁棒性。为了进一步增强模型的鲁棒性，我们提出了Mango，这是一种通用且有效的方法，它可以在嵌入空间中学习多模式对抗性噪声产生器，以欺骗预训练的V + L模型。与以往针对一种特定类型的鲁棒性的研究不同，Mango与任务无关，可以为各种任务（旨在评估鲁棒性的广泛方面）的预训练模型提供通用的性能提升。全面的实验表明，Mango在9个鲁棒性基准中有7个达到了最新水平，大大超过了现有方法。作为对V + L鲁棒性的第一个综合研究，这项工作将预先训练的模型的鲁棒性放在了更加明确的焦点上，为今后的研究指明了新的方向。

34. Automated system to measure Tandem Gait to assess executive functions in children [PDF] 返回目录
Mohammad Zaki Zadeh, Ashwin Ramesh Babu, Ashish Jaiswal, Maria Kyrarini, Fillia Makedon
Abstract: As mobile technologies have become ubiquitous in recent years, computer-based cognitive tests have become more popular and efficient. In this work, we focus on assessing motor function in children by analyzing their gait movements. Although there has been a lot of research on designing automated assessment systems for gait analysis, most of these efforts use obtrusive wearable sensors for measuring body movements. We have devised a computer vision-based assessment system that only requires a camera which makes it easier to employ in school or home environments. A dataset has been created with 27 children performing the test. Furthermore in order to improve the accuracy of the system, a deep learning based model was pre-trained on NTU-RGB+D 120 dataset and then it was fine-tuned on our gait dataset. The results highlight the efficacy of proposed work for automating the assessment of children's performances by achieving 76.61% classification accuracy.
摘要：随着近年来移动技术的普及，基于计算机的认知测试变得越来越流行和高效。在这项工作中，我们专注于通过分析他们的步态运动来评估儿童的运动功能。尽管在设计用于步态分析的自动评估系统方面已经进行了大量研究，但是这些努力大多数都使用可穿戴式传感器来测量人体运动。我们设计了一种基于计算机视觉的评估系统，该系统仅需要一个摄像头即可轻松在学校或家庭环境中使用。已创建一个数据集，其中包含27个执行测试的孩子。此外，为了提高系统的准确性，在NTU-RGB + D 120数据集上预先训练了基于深度学习的模型，然后在我们的步态数据集上对其进行了微调。结果通过达到76.61％的分类准确度，突出了拟议工作对儿童表现的自动化评估的功效。

35. Enabling Collaborative Video Sensing at the Edge through Convolutional Sharing [PDF] 返回目录
Kasthuri Jayarajah, Dhanuja Wanniarachchige, Archan Misra
Abstract: While Deep Neural Network (DNN) models have provided remarkable advances in machine vision capabilities, their high computational complexity and model sizes present a formidable roadblock to deployment in AIoT-based sensing applications. In this paper, we propose a novel paradigm by which peer nodes in a network can collaborate to improve their accuracy on person detection, an exemplar machine vision task. The proposed methodology requires no re-training of the DNNs and incurs minimal processing latency as it extracts scene summaries from the collaborators and injects back into DNNs of the reference cameras, on-the-fly. Early results show promise with improvements in recall as high as 10% with a single collaborator, on benchmark datasets.
摘要：虽然深度神经网络（DNN）模型在机器视觉功能方面取得了显着进步，但其高计算复杂度和模型大小为在基于AIoT的传感应用中部署提供了巨大的障碍。在本文中，我们提出了一种新颖的范例，网络中的对等节点可以通过该范例进行协作，以提高其在人检测（一种典型的机器视觉任务）上的准确性。所提出的方法不需要重新训练DNN，并且由于从协作者中提取场景摘要并即时将其注入参考摄像机的DNN中而无需花费最少的处理延迟。早期结果表明，在基准数据集上，单个协作者将召回率提高了10％，因此很有希望。

36. Does the dataset meet your expectations? Explaining sample representation in image data [PDF] 返回目录
Dhasarathy Parthasarathy, Anton Johansson
Abstract: Since the behavior of a neural network model is adversely affected by a lack of diversity in training data, we present a method that identifies and explains such deficiencies. When a dataset is labeled, we note that annotations alone are capable of providing a human interpretable summary of sample diversity. This allows explaining any lack of diversity as the mismatch found when comparing the \textit{actual} distribution of annotations in the dataset with an \textit{expected} distribution of annotations, specified manually to capture essential label diversity. While, in many practical cases, labeling (samples $\rightarrow$ annotations) is expensive, its inverse, simulation (annotations $\rightarrow$ samples) can be cheaper. By mapping the expected distribution of annotations into test samples using parametric simulation, we present a method that explains sample representation using the mismatch in diversity between simulated and collected data. We then apply the method to examine a dataset of geometric shapes to qualitatively and quantitatively explain sample representation in terms of comprehensible aspects such as size, position, and pixel brightness.
摘要：由于神经网络模型的行为受到训练数据缺乏多样性的不利影响，因此，我们提出了一种识别和解释此类缺陷的方法。在标记数据集时，我们注意到，仅注释便能够提供人类可以理解的样本多样性摘要。这可以解释任何多样性不足的问题，因为将数据集中注解的\ textit {actual}分布与注解的\ textit {expected}分布进行比较时发现的不匹配，手动指定这些注解以捕获基本标签的多样性。尽管在许多实际情况下，标记（标本$ \ rightarrow $标注）很昂贵，但其逆模拟（标本$ \ rightarrow $标注）可能会更便宜。通过使用参数模拟将注解的预期分布映射到测试样本中，我们提出了一种使用模拟数据和收集数据之间的多样性不匹配来解释样本表示的方法。然后，我们将该方法应用于检查几何形状的数据集，以从可理解的方面（例如大小，位置和像素亮度）定性和定量地解释样本表示。

37. A grid-point detection method based on U-net for a structured light system [PDF] 返回目录
Dieuthuy Pham, Minhtuan Ha, Changyan Xiao
Abstract: Accurate detection of the feature points of the projected pattern plays an extremely important role in one-shot 3D reconstruction systems, especially for the ones using a grid pattern. To solve this problem, this paper proposes a grid-point detection method based on U-net. A specific dataset is designed that includes the images captured with the two-shot imaging method and the ones acquired with the one-shot imaging method. Among them, the images in the first group after labeled as the ground truth images and the images captured at the same pose with the one-shot method are cut into small patches with the size of 64x64 pixels then feed to the training set. The remaining of the images in the second group is the test set. The experimental results show that our method can achieve a better detecting performance with higher accuracy in comparison with the previous methods.
摘要：准确检测投影图案的特征点在单次3D重建系统中起着极其重要的作用，特别是对于使用网格图案的系统。为了解决这个问题，本文提出了一种基于U-net的网格点检测方法。设计了一个特定的数据集，其中包括用两次成像方法捕获的图像和用一次成像方法捕获的图像。其中，将第一组中的图像标记为地面真相图像和使用一次拍摄方法在同一姿势下捕获的图像切成大小为64x64像素的小块，然后馈入训练集。第二组中的其余图像是测试集。实验结果表明，与以前的方法相比，本方法可以实现更好的检测性能和更高的准确度。

38. Spectral band selection for vegetation properties retrieval using Gaussian processes regression [PDF] 返回目录
Jochem Verrelst, Juan Pablo Rivera, Anatoly Gitelson, Jesus Delegido, José Moreno, Gustau Camps-Valls
Abstract: With current and upcoming imaging spectrometers, automated band analysis techniques are needed to enable efficient identification of most informative bands to facilitate optimized processing of spectral data into estimates of biophysical variables. This paper introduces an automated spectral band analysis tool (BAT) based on Gaussian processes regression (GPR) for the spectral analysis of vegetation properties. The GPR-BAT procedure sequentially backwards removes the least contributing band in the regression model for a given variable until only one band is kept. GPR-BAT is implemented within the framework of the free ARTMO's MLRA (machine learning regression algorithms) toolbox, which is dedicated to the transforming of optical remote sensing images into biophysical products. GPR-BAT allows (1) to identify the most informative bands in relating spectral data to a biophysical variable, and (2) to find the least number of bands that preserve optimized accurate predictions. This study concludes that a wise band selection of hyperspectral data is strictly required for optimal vegetation properties mapping.
摘要：对于当前和即将到来的成像光谱仪，需要使用自动谱带分析技术来有效识别大多数信息谱带，以促进将光谱数据优化处理为生物物理变量的估计。本文介绍了一种基于高斯过程回归（GPR）的自动光谱带分析工具（BAT），用于植被特性的光谱分析。对于给定的变量，GPR-BAT过程顺序地向后移去，以消除回归模型中贡献最小的带，直到仅保留一个带。 GPR-BAT是在免费ARTMO的MLRA（机器学习回归算法）工具箱的框架内实现的，该工具箱致力于将光学遥感图像转换为生物物理产品。 GPR-BAT允许（1）在将光谱数据与生物物理变量相关的过程中识别出最多的信息带，以及（2）找到保留优化的准确预测的最少数量的带。这项研究得出的结论是，严格地选择高光谱数据的波段是最佳植被特性映射所必需的。

39. Sparsity-driven Digital Terrain Model Extraction [PDF] 返回目录
Fatih Nar, Erdal Yilmaz, Gustau Camps-Valls
Abstract: We here introduce an automatic Digital Terrain Model (DTM) extraction method. The proposed sparsity-driven DTM extractor (SD-DTM) takes a high-resolution Digital Surface Model (DSM) as an input and constructs a high-resolution DTM using the variational framework. To obtain an accurate DTM, an iterative approach is proposed for the minimization of the target variational cost function. Accuracy of the SD-DTM is shown in a real-world DSM data set. We show the efficiency and effectiveness of the approach both visually and quantitatively via residual plots in illustrative terrain types.
摘要：我们在这里介绍一种自动数字地形模型（DTM）提取方法。提出的稀疏驱动DTM提取器（SD-DTM）以高分辨率数字表面模型（DSM）作为输入，并使用变分框架构造高分辨率DTM。为了获得准确的DTM，提出了一种迭代方法来最小化目标变动成本函数。在实际的DSM数据集中显示了SD-DTM的准确性。我们通过说明性地形类型中的残差图在视觉上和定量上显示了该方法的效率和有效性。

40. Post-Hurricane Damage Assessment Using Satellite Imagery and Geolocation Features [PDF] 返回目录
Quoc Dung Cao, Youngjun Choe
Abstract: Gaining timely and reliable situation awareness after hazard events such as a hurricane is crucial to emergency managers and first responders. One effective way to achieve that goal is through damage assessment. Recently, disaster researchers have been utilizing imagery captured through satellites or drones to quantify the number of flooded/damaged buildings. In this paper, we propose a mixed data approach, which leverages publicly available satellite imagery and geolocation features of the affected area to identify damaged buildings after a hurricane. The method demonstrated significant improvement from performing a similar task using only imagery features, based on a case study of Hurricane Harvey affecting Greater Houston area in 2017. This result opens door to a wide range of possibilities to unify the advancement in computer vision algorithms such as convolutional neural networks and traditional methods in damage assessment, for example, using flood depth or bare-earth topology. In this work, a creative choice of the geolocation features was made to provide extra information to the imagery features, but it is up to the users to decide which other features can be included to model the physical behavior of the events, depending on their domain knowledge and the type of disaster. The dataset curated in this work is made openly available (DOI: 10.17603/ds2-3cca-f398).
摘要：在飓风等灾害事件发生后获得及时可靠的态势感知，对于紧急事件管理人员和急救人员至关重要。实现该目标的一种有效方法是通过损害评估。最近，灾难研究人员一直在利用通过卫星或无人机捕获的图像来量化淹没/损坏的建筑物的数量。在本文中，我们提出了一种混合数据方法，该方法利用公众可获得的卫星图像和受影响区域的地理位置特征来识别飓风后受损的建筑物。该方法基于2017年飓风哈维（Hurricane Harvey）影响大休斯顿地区的案例研究，证明仅使用图像特征即可完成类似任务，从而显着改善了该结果。这一结果为统一诸如计算机视觉算法等先进技术打开了广阔的大门卷积神经网络和传统方法进行损伤评估，例如，使用洪水深度或地球拓扑。在这项工作中，对地理位置要素进行了创造性的选择，以便为影像要素提供额外的信息，但是取决于用户的域，由用户决定可以包含哪些其他要素来模拟事件的物理行为知识和灾难类型。公开提供了这项工作中策划的数据集（DOI：10.17603 / ds2-3cca-f398）。

41. Pose Error Reduction for Focus Enhancement in Thermal Synthetic Aperture Visualization [PDF] 返回目录
Indrajit Kurmi, David C. Schedl, Oliver Bimber
Abstract: Airborne optical sectioning, an effective aerial synthetic aperture imaging technique for revealing artifacts occluded by forests, requires precise measurements of drone poses. In this article we present a new approach for reducing pose estimation errors beyond the possibilities of conventional Perspective-n-Point solutions by considering the underlying optimization as a focusing problem. We present an efficient image integration technique, which also reduces the parameter search space to achieve realistic processing times, and improves the quality of resulting synthetic integral images.
摘要：机载光学切片是一种有效的空中合成孔径成像技术，用于显示森林所遮挡的伪影，需要对无人机的姿势进行精确测量。在本文中，我们提出了一种新的方法，通过将基础优化视为焦点问题，从而减少了姿态估计误差，这超出了传统的“透视n点”解决方案的可能性。我们提出了一种有效的图像集成技术，该技术还减少了参数搜索空间，以实现逼真的处理时间，并提高了生成的合成积分图像的质量。

42. FoggySight: A Scheme for Facial Lookup Privacy [PDF] 返回目录
Ivan Evtimov, Pascal Sturmfels, Tadayoshi Kohno
Abstract: Advances in deep learning algorithms have enabled better-than-human performance on face recognition tasks. In parallel, private companies have been scraping social media and other public websites that tie photos to identities and have built up large databases of labeled face images. Searches in these databases are now being offered as a service to law enforcement and others and carry a multitude of privacy risks for social media users. In this work, we tackle the problem of providing privacy from such face recognition systems. We propose and evaluate FoggySight, a solution that applies lessons learned from the adversarial examples literature to modify facial photos in a privacy-preserving manner before they are uploaded to social media. FoggySight's core feature is a community protection strategy where users acting as protectors of privacy for others upload decoy photos generated by adversarial machine learning algorithms. We explore different settings for this scheme and find that it does enable protection of facial privacy -- including against a facial recognition service with unknown internals.
摘要：深度学习算法的进步使人脸识别任务的性能优于人类。同时，私人公司一直在刮擦社交媒体和其他将照片与身份相关联的公共网站，并建立了带有标签的人脸图像的大型数据库。这些数据库中的搜索现在作为对执法人员和其他人员的服务提供，并为社交媒体用户带来了许多隐私风险。在这项工作中，我们解决了从此类面部识别系统提供隐私的问题。我们提出并评估了FoggySight，该解决方案可利用从对抗性示例文学中学到的经验教训，在将脸部照片上传到社交媒体之前以隐私保护的方式修改这些照片。 FoggySight的核心功能是社区保护策略，在该策略中，充当他人隐私保护者的用户上传由对抗性机器学习算法生成的诱饵照片。我们探索了该方案的不同设置，发现它确实可以保护面部隐私-包括针对内部结构未知的面部识别服务。

43. Attentional Local Contrast Networks for Infrared Small Target Detection [PDF] 返回目录
Yimian Dai, Yiquan Wu, Fei Zhou, Kobus Barnard
Abstract: To mitigate the issue of minimal intrinsic features for pure data-driven methods, in this paper, we propose a novel model-driven deep network for infrared small target detection, which combines discriminative networks and conventional model-driven methods to make use of both labeled data and the domain knowledge. By designing a feature map cyclic shift scheme, we modularize a conventional local contrast measure method as a depth-wise parameterless nonlinear feature refinement layer in an end-to-end network, which encodes relatively long-range contextual interactions with clear physical interpretability. To highlight and preserve the small target features, we also exploit a bottom-up attentional modulation integrating the smaller scale subtle details of low-level features into high-level features of deeper layers. We conduct detailed ablation studies with varying network depths to empirically verify the effectiveness and efficiency of the design of each component in our network architecture. We also compare the performance of our network against other model-driven methods and deep networks on the open SIRST dataset as well. The results suggest that our network yields a performance boost over its competitors. Our code, trained models, and results are available online.
摘要：为了缓解纯数据驱动方法的最小固有特征问题，本文提出了一种新颖的模型驱动深度网络用于红外小目标检测，该方法将判别网络和常规模型驱动方法结合起来使用标记数据和领域知识。通过设计特征图循环移位方案，我们将传统的局部对比度测量方法模块化为端到端网络中的深度无参数非线性特征细化层，该层对相对长距离的上下文交互进行了编码，并具有清晰的物理可解释性。为了突出显示并保留较小的目标特征，我们还利用了自下而上的注意力调制，将低级特征的较小比例的细微细节集成到了较深层的高级特征中。我们对不同的网络深度进行了详细的消融研究，以凭经验验证网络体系结构中每个组件设计的有效性和效率。我们还将开放式SIRST数据集上的网络与其他模型驱动方法和深度网络的性能进行比较。结果表明，我们的网络比竞争对手具有更高的性能。我们的代码，训练有素的模型和结果可在线获得。

44. Equalization Loss v2: A New Gradient Balance Approach for Long-tailed Object Detection [PDF] 返回目录
Jingru Tan, Xin Lu, Gang Zhang, Changqing Yin, Quanquan Li
Abstract: Recently proposed decoupled training methods emerge as a dominant paradigm for long-tailed object detection. But they require an extra fine-tuning stage, and the disjointed optimization of representation and classifier might lead to suboptimal results. However, end-to-end training methods, like equalization loss (EQL), still perform worse than decoupled training methods. In this paper, we reveal the main issue in long-tailed object detection is the imbalanced gradients between positives and negatives, and find that EQL does not solve it well. To address the problem of imbalanced gradients, we introduce a new version of equalization loss, called equalization loss v2 (EQL v2), a novel gradient guided reweighing mechanism that re-balances the training process for each category independently and equally. Extensive experiments are performed on the challenging LVIS benchmark. EQL v2 outperforms origin EQL by about 4 points overall AP with 14-18 points improvements on the rare categories. More importantly, It also surpasses decoupled training methods. Without further tuning for the Open Images dataset, EQL v2 improves EQL by 6.3 points AP, showing strong generalization ability. Codes will be released at this https URL
摘要：最近提出的解耦训练方法成为长尾目标检测的主要范例。但是它们需要一个额外的微调阶段，并且表示和分类器的脱节优化可能导致次优结果。但是，端到端的训练方法，例如均衡损失（EQL），仍然比解耦的训练方法更差。在本文中，我们揭示了长尾目标检测中的主要问题是正负之间的梯度不平衡，并且发现EQL不能很好地解决它。为了解决梯度不平衡的问题，我们引入了一种新版本的均衡损失，称为均衡损失v2（EQL v2），这是一种新颖的梯度引导重称机制，可以独立且均等地重新平衡每个类别的训练过程。在具有挑战性的LVIS基准上进行了广泛的实验。 EQL v2在整体AP方面比原始EQL高出约4点，在稀有类别上提高了14-18点。更重要的是，它还超越了分离训练方法。 EQL v2无需进一步调整Open Images数据集，便将EQL提高了6.3点AP，显示了强大的泛化能力。代码将在此https URL上发布

45. End-to-end Generative Floor-plan and Layout with Attributes and Relation Graph [PDF] 返回目录
Xinhan Di, Pengqian Yu, Danfeng Yang, Hong Zhu, Changyu Sun, YinDong Liu
Abstract: In this paper, we propose an end-end model for producing furniture layout for interior scene synthesis from the random vector. This proposed model is aimed to support professional interior designers to produce the interior decoration solutions more quickly. The proposed model combines a conditional floor-plan module of the room, a conditional graphical floor-plan module of the room and a conditional layout module. As compared with the prior work on scene synthesis, our proposed three modules enhance the ability of auto-layout generation given the dimensional category of the room. We conduct our experiments on the proposed real-world interior layout dataset that contains $191208$ designs from the professional designers. Our numerical results demonstrate that the proposed model yields higher-quality layouts in comparison with the state-of-the-art model. The dataset and code are released \href{this https URL}{Dataset,Code}
摘要：在本文中，我们提出了一种端到端模型，该模型用于从随机向量产生用于室内场景合成的家具布局。此提议的模型旨在支持专业的室内设计师更快地生产室内装饰解决方案。所提出的模型结合了房间的条件平面图模块，房间的条件图形平面图模块和条件布局模块。与先前的场景合成工作相比，我们提出的三个模块在给定房间尺寸类别的情况下增强了自动布局生成的能力。我们对提议的真实室内布局数据集进行了实验，该数据集包含来自专业设计师的191208美元的设计。我们的数值结果表明，与最新模型相比，该模型可产生更高质量的布局。数据集和代码已发布\ href {此https URL} {Dataset，Code}

46. Exploration of Whether Skylight Polarization Patterns Contain Three-dimensional Attitude Information [PDF] 返回目录
Huaju Liang, Hongyang Bai, Tong Zhou
Abstract: Our previous work has demonstrated that Rayleigh model, which is widely used in polarized skylight navigation to describe skylight polarization patterns, does not contain three-dimensional (3D) attitude information [1]. However, it is still necessary to further explore whether the skylight polarization patterns contain 3D attitude information. So, in this paper, a social spider optimization (SSO) method is proposed to estimate three Euler angles, which considers the difference of each pixel among polarization images based on template matching (TM) to make full use of the captured polarization information. In addition, to explore this problem, we not only use angle of polarization (AOP) and degree of polarization (DOP) information, but also the light intensity (LI) information. So, a sky model is established, which combines Berry model and Hosek model to fully describe AOP, DOP, and LI information in the sky, and considers the influence of four neutral points, ground albedo, atmospheric turbidity, and wavelength. The results of simulation show that the SSO algorithm can estimate 3D attitude and the established sky model contains 3D attitude information. However, when there are measurement noise or model error, the accuracy of 3D attitude estimation drops significantly. Especially in field experiment, it is very difficult to estimate 3D attitude. Finally, the results are discussed in detail.
摘要：我们以前的工作表明，在极化天窗导航中广泛使用的描述天窗极化模式的瑞利模型不包含三维（3D）姿态信息[1]。然而，仍然有必要进一步探索天窗偏振图案是否包含3D姿态信息。因此，本文提出一种社会蜘蛛优化（SSO）方法来估计三个欧拉角，该角度基于模板匹配（TM）考虑偏振图像中每个像素的差异，以充分利用捕获的偏振信息。另外，为了探讨这个问题，我们不仅使用偏振角（AOP）和偏振度（DOP）信息，而且使用光强度（LI）信息。因此，建立了一个天空模型，该模型结合了Berry模型和Hosek模型，以充分描述天空中的AOP，DOP和LI信息，并考虑了四个中性点，地面反照率，大气浊度和波长的影响。仿真结果表明，SSO算法可以估计3D姿态，所建立的天空模型包含3D姿态信息。但是，当存在测量噪声或模型误差时，3D姿态估计的准确性会大大下降。尤其是在野外实验中，很难估计3D姿态。最后，详细讨论了结果。

47. Personal Mental Health Navigator: Harnessing the Power of Data, Personal Models, and Health Cybernetics to Promote Psychological Well-being [PDF] 返回目录
Amir M. Rahmani, Jocelyn Lai, Salar Jafarlou, Asal Yunusova, Alex. P. Rivera, Sina Labbaf, Sirui Hu, Arman Anzanpour, Nikil Dutt, Ramesh Jain, Jessica L. Borelli
Abstract: Traditionally, the regime of mental healthcare has followed an episodic psychotherapy model wherein patients seek care from a provider through a prescribed treatment plan developed over multiple provider visits. Recent advances in wearable and mobile technology have generated increased interest in digital mental healthcare that enables individuals to address episodic mental health symptoms. However, these efforts are typically reactive and symptom-focused and do not provide comprehensive, wrap-around, customized treatments that capture an individual's holistic mental health model as it unfolds over time. Recognizing that each individual is unique, we present the notion of Personalized Mental Health Navigation (MHN): a therapist-in-the-loop, cybernetic goal-based system that deploys a continuous cyclic loop of measurement, estimation, guidance, to steer the individual's mental health state towards a healthy zone. We outline the major components of MHN that is premised on the development of an individual's personal mental health state, holistically represented by a high-dimensional cover of multiple knowledge layers such as emotion, biological patterns, sociology, behavior, and cognition. We demonstrate the feasibility of the personalized MHN approach via a 12-month pilot case study for holistic stress management in college students and highlight an instance of a therapist-in-the-loop intervention using MHN for monitoring, estimating, and proactively addressing moderately severe depression over a sustained period of time. We believe MHN paves the way to transform mental healthcare from the current passive, episodic, reactive process (where individuals seek help to address symptoms that have already manifested) to a continuous and navigational paradigm that leverages a personalized model of the individual, promising to deliver timely interventions to individuals in a holistic manner.
摘要：传统上，精神卫生保健制度遵循一种间歇性心理治疗模型，其中患者通过多次就医而制定的处方治疗计划向医务人员寻求护理。可穿戴和移动技术的最新进展已引起人们对数字心理医疗保健的关注，数字心理医疗保健使个人能够解决发作性的心理健康症状。但是，这些努力通常是反应性的和以症状为重点的，不能提供随着时间的推移而发展起来的，全面的，可定制的治疗方法，无法捕获个人的整体心理健康模型。认识到每个人都是独一无二的，我们提出了个性化心理健康导航（MHN）的概念：环论治疗师，基于控制论的目标系统，该系统部署了连续的测量，估计，指导循环循环，以指导个人的心理健康状况朝着健康区域发展。我们概述了以个人的个人心理健康状况的发展为前提的MHN的主要组成部分，从整体上以情感，生物模式，社会学，行为和认知等多个知识层的高层次覆盖为代表。我们通过一个为期12个月的大学生整体压力管理试点案例研究，证明了个性化MHN方法的可行性，并重点介绍了使用MHN进行监测，评估和积极应对中度重症的治疗师在环干预持续一段时间的抑郁症。我们相信MHN为将精神保健从目前的被动，偶发，反应性过程（个人寻求帮助以解决已经出现的症状）转变为利用个人个性化模型的连续导航范式铺平了道路，有望实现及时以整体方式对个人进行干预。

48. TEMImageNet and AtomSegNet Deep Learning Training Library and Models for High-Precision Atom Segmentation, Localization, Denoising, and Super-resolution Processing of Atom-Resolution Scanning TEM Images [PDF] 返回目录
Ruoqian Lin, Rui Zhang, Chunyang Wang, Xiao-Qing Yang, Huolin L. Xin
Abstract: Atom segmentation and localization, noise reduction and super-resolution processing of atomic-resolution scanning transmission electron microscopy (STEM) images with high precision and robustness is a challenging task. Although several conventional algorithms, such has thresholding, edge detection and clustering, can achieve reasonable performance in some predefined sceneries, they tend to fail when interferences from the background are strong and unpredictable. Particularly, for atomic-resolution STEM images, so far there is no well-established algorithm that is robust enough to segment or detect all atomic columns when there is large thickness variation in a recorded image. Herein, we report the development of a training library and a deep learning method that can perform robust and precise atom segmentation, localization, denoising, and super-resolution processing of experimental images. Despite using simulated images as training datasets, the deep-learning model can self-adapt to experimental STEM images and shows outstanding performance in atom detection and localization in challenging contrast conditions and the precision is consistently better than the state-of-the-art two-dimensional Gaussian fit method. Taking a step further, we have deployed our deep-learning models to a desktop app with a graphical user interface and the app is free and open-source. We have also built a TEM ImageNet project website for easy browsing and downloading of the training data.
摘要：原子分辨率扫描透射电子显微镜（STEM）图像的高精度，鲁棒性，原子分割和定位，降噪以及超分辨率处理是一项艰巨的任务。尽管具有阈值化，边缘检测和聚类的几种常规算法可以在某些预定义的场景中实现合理的性能，但是当来自背景的干扰强烈且不可预测时，它们往往会失败。特别是，对于原子分辨率的STEM图像，到目前为止，还没有一种完善的算法，当记录的图像中存在较大的厚度变化时，该算法的鲁棒性足以分割或检测所有原子列。在这里，我们报告了可以进行健壮且精确的原子分割，定位，去噪和实验图像超分辨率处理的训练库和深度学习方法的开发。尽管使用模拟图像作为训练数据集，深度学习模型仍可自适应实验性STEM图像，并在极具挑战性的对比条件下显示出出色的原子检测和定位性能，并且精度始终优于最新技术2维高斯拟合方法。更进一步，我们已经将深度学习模型部署到具有图形用户界面的桌面应用程序中，并且该应用程序是免费和开源的。我们还建立了一个TEM ImageNet项目网站，以方便浏览和下载培训数据。

49. Evaluation of deep learning-based myocardial infarction quantification using Segment CMR software [PDF] 返回目录
Olivier Rukundo
Abstract: In this paper, the author evaluates the preliminary work related to automating the quantification of the size of the myocardial infarction (MI) using deep learning in Segment cardiovascular magnetic resonance (CMR) software. Here, deep learning is used to automate the segmentation of myocardial boundaries before triggering the automatic quantification of the size of the MI using the expectation-maximization, weighted intensity, a priori information (EWA) algorithm incorporated in the Segment CMR software. Experimental evaluation of the size of the MI shows that more than 50 % (average infarct scar volume), 75% (average infarct scar percentage), and 65 % (average microvascular obstruction percentage) of the network-based results are approximately very close to the expert delineation-based results. Also, in an experiment involving the visualization of myocardial and infarct contours, in all images of the selected stack, the network and expert-based results tie in terms of the number of infarcted and contoured images.
摘要：在本文中，作者评估了使用分段心血管磁共振（CMR）软件中的深度学习自动量化心肌梗塞（MI）大小的相关初步工作。在这里，深度学习用于在使用分割CMR软件中包含的期望最大化，加权强度，先验信息（EWA）算法触发MI大小的自动量化之前，自动分割心肌边界。对MI大小的实验评估表明，基于网络的结果中，超过50％（平均梗塞疤痕体积），75％（平均梗塞疤痕百分比）和65％（平均微血管阻塞百分比）非常接近基于专家描述的结果。同样，在涉及可视化心肌和梗塞轮廓的实验中，在所选堆栈的所有图像中，网络和基于专家的结果都与梗塞和轮廓图像的数量有关。

50. PGMAN: An Unsupervised Generative Multi-adversarial Network for Pan-sharpening [PDF] 返回目录
Huanyu Zhou, Qingjie Liu, Yunhong Wang
Abstract: Pan-sharpening aims at fusing a low-resolution (LR) multi-spectral (MS) image and a high-resolution (HR) panchromatic (PAN) image acquired by a satellite to generate an HR MS image. Many deep learning based methods have been developed in the past few years. However, since there are no intended HR MS images as references for learning, almost all of the existing methods down-sample the MS and PAN images and regard the original MS images as targets to form a supervised setting for training. These methods may perform well on the down-scaled images, however, they generalize poorly to the full-resolution images. To conquer this problem, we design an unsupervised framework that is able to learn directly from the full-resolution images without any preprocessing. The model is built based on a novel generative multi-adversarial network. We use a two-stream generator to extract the modality-specific features from the PAN and MS images, respectively, and develop a dual-discriminator to preserve the spectral and spatial information of the inputs when performing fusion. Furthermore, a novel loss function is introduced to facilitate training under the unsupervised setting. Experiments and comparisons with other state-of-the-art methods on GaoFen-2 and QuickBird images demonstrate that the proposed method can obtain much better fusion results on the full-resolution images.
摘要：全景锐化旨在融合由卫星获取的低分辨率（LR）多光谱（MS）图像和高分辨率（HR）全色（PAN）图像，以生成HR MS图像。在过去的几年中，已经开发了许多基于深度学习的方法。但是，由于没有预期的HR MS图像作为学习参考，因此几乎所有现有方法都对MS和PAN图像进行降采样，并将原始MS图像视为目标，以形成用于训练的监督设置。这些方法在按比例缩小的图像上可能效果很好，但是，它们对一般分辨率的图像的推广效果很差。为了解决这个问题，我们设计了一个无监督的框架，该框架无需任何预处理即可直接从全分辨率图像中学习。该模型基于新颖的生成式多对抗网络构建。我们使用两流发生器分别从PAN和MS图像中提取特定于模态的特征，并开发出双重鉴别器以在执行融合时保留输入的光谱和空间信息。此外，引入了新的损失函数以促进在无人监督的情况下进行训练。实验和与其他最新技术对GaoFen-2和QuickBird图像的比较表明，该方法可以在全分辨率图像上获得更好的融合效果。

51. A Differential Model of the Complex Cell [PDF] 返回目录
Miles Hansard, Radu Horaud
Abstract: The receptive fields of simple cells in the visual cortex can be understood as linear filters. These filters can be modelled by Gabor functions, or by Gaussian derivatives. Gabor functions can also be combined in an `energy model' of the complex cell response. This paper proposes an alternative model of the complex cell, based on Gaussian derivatives. It is most important to account for the insensitivity of the complex response to small shifts of the image. The new model uses a linear combination of the first few derivative filters, at a single position, to approximate the first derivative filter, at a series of adjacent positions. The maximum response, over all positions, gives a signal that is insensitive to small shifts of the image. This model, unlike previous approaches, is based on the scale space theory of visual processing. In particular, the complex cell is built from filters that respond to the \twod\ differential structure of the image. The computational aspects of the new model are studied in one and two dimensions, using the steerability of the Gaussian derivatives. The response of the model to basic images, such as edges and gratings, is derived formally. The response to natural images is also evaluated, using statistical measures of shift insensitivity. The relevance of the new model to the cortical image representation is discussed.
摘要：视觉皮层中简单细胞的感受野可以理解为线性过滤器。这些滤波器可以用Gabor函数或高斯导数建模。 Gabor函数也可以结合到复杂细胞反应的“能量模型”中。本文提出了一种基于高斯导数的复杂单元的替代模型。最重要的是要考虑复杂响应对图像微小变化的不敏感性。新模型在单个位置使用前几个导数滤波器的线性组合，以在一系列相邻位置上近似一阶导数滤波器。在所有位置上的最大响应给出的信号对图像的微小偏移不敏感。与以前的方法不同，此模型基于视觉处理的比例空间理论。特别是，复杂单元是由对图像的\ twod \差分结构作出响应的滤波器构成的。利用高斯导数的可导性，从一维和二维研究了新模型的计算方面。正式得出模型对基本图像（如边缘和光栅）的响应。还使用移位不敏感度的统计度量来评估对自然图像的响应。讨论了新模型与皮质图像表示的相关性。

52. Learning-Based Algorithms for Vessel Tracking: A Review [PDF] 返回目录
Dengqiang Jia, Xiahai Zhuang
Abstract: Developing efficient vessel-tracking algorithms is crucial for imaging-based diagnosis and treatment of vascular diseases. Vessel tracking aims to solve recognition problems such as key (seed) point detection, centerline extraction, and vascular segmentation. Extensive image-processing techniques have been developed to overcome the problems of vessel tracking that are mainly attributed to the complex morphologies of vessels and image characteristics of angiography. This paper presents a literature review on vessel-tracking methods, focusing on machine-learning-based methods. First, the conventional machine-learning-based algorithms are reviewed, and then, a general survey of deep-learning-based frameworks is provided. On the basis of the reviewed methods, the evaluation issues are introduced. The paper is concluded with discussions about the remaining exigencies and future research.
摘要：开发有效的血管跟踪算法对于基于影像的血管疾病诊断和治疗至关重要。血管跟踪旨在解决识别问题，例如关键（种子）点检测，中心线提取和血管分割。已经开发了广泛的图像处理技术来克服血管追踪的问题，血管追踪的问题主要归因于血管的复杂形态和血管造影的图像特征。本文介绍了有关船舶跟踪方法的文献综述，重点是基于机器学习的方法。首先，回顾了传统的基于机器学习的算法，然后，对基于深度学习的框架进行了概述。在审查方法的基础上，介绍了评估问题。本文最后讨论了剩余的紧急情况和未来的研究。

53. Secret Key Agreement with Physical Unclonable Functions: An Optimality Summary [PDF] 返回目录
Onur Günlü, Rafael F. Schaefer
Abstract: We address security and privacy problems for digital devices and biometrics from an information-theoretic optimality perspective, where a secret key is generated for authentication, identification, message encryption/decryption, or secure computations. A physical unclonable function (PUF) is a promising solution for local security in digital devices and this review gives the most relevant summary for information theorists, coding theorists, and signal processing community members who are interested in optimal PUF constructions. Low-complexity signal processing methods such as transform coding that are developed to make the information-theoretic analysis tractable are discussed. The optimal trade-offs between the secret-key, privacy-leakage, and storage rates for multiple PUF measurements are given. Proposed optimal code constructions that jointly design the vector quantizer and error-correction code parameters are listed. These constructions include modern and algebraic codes such as polar codes and convolutional codes, both of which can achieve small block-error probabilities at short block lengths, corresponding to a small number of PUF circuits. Open problems in the PUF literature from a signal processing, information theory, coding theory, and hardware complexity perspectives and their combinations are listed to stimulate further advancements in the research on local privacy and security.
摘要：我们从信息理论的最佳角度解决了数字设备和生物识别技术的安全和隐私问题，其中生成了用于身份验证，标识，消息加密/解密或安全计算的密钥。物理不可克隆功能（PUF）是用于数字设备本地安全性的一种有前途的解决方案，该综述为对最佳PUF构造感兴趣的信息理论家，编码理论家和信号处理社区成员提供了最相关的摘要。讨论了开发低复杂度的信号处理方法（例如变换编码）以使信息理论分析变得容易处理的问题。给出了多个PUF测量的秘密密钥，隐私泄漏和存储速率之间的最佳折衷。列出了共同设计矢量量化器和纠错码参数的建议最佳编码结构。这些结构包括现代和代数代码，例如极地代码和卷积代码，这两种代码都可以在较短的块长度上实现较小的块错误概率，这对应于少量的PUF电路。从信号处理，信息理论，编码理论和硬件复杂性的角度，列出了PUF文献中的开放性问题，并列出了它们的组合，以刺激本地隐私和安全性研究的进一步发展。

54. Distilling Optimal Neural Networks: Rapid Search in Diverse Spaces [PDF] 返回目录
Bert Moons, Parham Noorzad, Andrii Skliar, Giovanni Mariani, Dushyant Mehta, Chris Lott, Tijmen Blankevoort
Abstract: This work presents DONNA (Distilling Optimal Neural Network Architectures), a novel pipeline for rapid neural architecture search and search space exploration, targeting multiple different hardware platforms and user scenarios. In DONNA, a search consists of three phases. First, an accuracy predictor is built for a diverse search space using blockwise knowledge distillation. This predictor enables searching across diverse macro-architectural network parameters such as layer types, attention mechanisms, and channel widths, as well as across micro-architectural parameters such as block repeats, kernel sizes, and expansion rates. Second, a rapid evolutionary search phase finds a Pareto-optimal set of architectures in terms of accuracy and latency for any scenario using the predictor and on-device measurements. Third, Pareto-optimal models can be quickly finetuned to full accuracy. With this approach, DONNA finds architectures that outperform the state of the art. In ImageNet classification, architectures found by DONNA are 20% faster than EfficientNet-B0 and MobileNetV2 on a Nvidia V100 GPU at similar accuracy and 10% faster with 0.5% higher accuracy than MobileNetV2-1.4x on a Samsung S20 smartphone. In addition to neural architecture search, DONNA is used for search-space exploration and hardware-aware model compression.
摘要：这项工作提出了DONNA（蒸馏最优神经网络体系结构），这是一种用于快速神经体系结构搜索和搜索空间探索的新颖管道，针对多种不同的硬件平台和用户场景。在DONNA中，搜索包括三个阶段。首先，使用块状知识提炼为不同的搜索空间构建准确性预测器。该预测器可以搜索各种宏结构网络参数，例如层类型，注意机制和通道宽度，以及微结构参数，例如块重复，内核大小和扩展速率。其次，快速进化搜索阶段会使用预测器和设备上的测量结果在任何情况下的准确性和等待时间方面找到帕累托最优的一组架构。第三，帕累托最优模型可以快速调整到完全准确。通过这种方法，DONNA发现了性能超过现有技术的架构。在ImageNet分类中，DONNA发现的架构在Nvidia V100 GPU上比EfficientNet-B0和MobileNetV2快20％，准确度相近，而在Samsung S20智能手机上，则比MobileNetV2-1.4x快10％，准确度高0.5％。除神经体系结构搜索外，DONNA还用于搜索空间探索和硬件感知模型压缩。

55. Learning to Run with Potential-Based Reward Shaping and Demonstrations from Video Data [PDF] 返回目录
Aleksandra Malysheva, Daniel Kudenko, Aleksei Shpilman
Abstract: Learning to produce efficient movement behaviour for humanoid robots from scratch is a hard problem, as has been illustrated by the "Learning to run" competition at NIPS 2017. The goal of this competition was to train a two-legged model of a humanoid body to run in a simulated race course with maximum speed. All submissions took a tabula rasa approach to reinforcement learning (RL) and were able to produce relatively fast, but not optimal running behaviour. In this paper, we demonstrate how data from videos of human running (e.g. taken from YouTube) can be used to shape the reward of the humanoid learning agent to speed up the learning and produce a better result. Specifically, we are using the positions of key body parts at regular time intervals to define a potential function for potential-based reward shaping (PBRS). Since PBRS does not change the optimal policy, this approach allows the RL agent to overcome sub-optimalities in the human movements that are shown in the videos. We present experiments in which we combine selected techniques from the top ten approaches from the NIPS competition with further optimizations to create an high-performing agent as a baseline. We then demonstrate how video-based reward shaping improves the performance further, resulting in an RL agent that runs twice as fast as the baseline in 12 hours of training. We furthermore show that our approach can overcome sub-optimal running behaviour in videos, with the learned policy significantly outperforming that of the running agent from the video.
摘要：学会从头开始为类人机器人创造有效的运动行为是一个难题，正如在NIPS 2017上的“学跑”竞赛所说明的那样。该竞赛的目的是训练两足动物模型可以在最大速度的模拟赛道上跑步。所有提交者都采用表格形式的强化学习（RL）方法，并且能够产生相对较快但并非最佳的跑步行为。在本文中，我们演示了如何将来自人类跑步视频（例如，从YouTube拍摄的视频）中的数据用于塑造类人学习代理的奖励，从而加快学习速度并产生更好的结果。具体来说，我们以固定的时间间隔使用关键部位的位置来定义基于势能的奖励塑形（PBRS）的势能函数。由于PBRS不会更改最佳策略，因此该方法允许RL代理克服视频中显示的人类运动中的次优问题。我们介绍了一些实验，在这些实验中，我们将NIPS竞争的前十种方法中的精选技术与进一步的优化相结合，以创建高性能试剂作为基准。然后，我们演示了基于视频的奖励塑形如何进一步提高性能，从而使RL代理在12小时的训练中运行速度是基线的两倍。此外，我们还表明，我们的方法可以克服视频中的次优运行行为，其学习到的策略明显优于视频中正在运行的代理。

56. Cross-Cohort Generalizability of Deep and Conventional Machine Learning for MRI-based Diagnosis and Prediction of Alzheimer's Disease [PDF] 返回目录
Esther E. Bron, Stefan Klein, Janne M. Papma, Lize C. Jiskoot, Vikram Venkatraghavan, Jara Linders, Pauline Aalten, Peter Paul De Deyn, Geert Jan Biessels, Jurgen A.H.R. Claassen, Huub A.M. Middelkoop, Marion Smits, Wiro J. Niessen, John C. van Swieten, Wiesje M. van der Flier, Inez H.G.B. Ramakers, Aad van der Lugt
Abstract: This work validates the generalizability of MRI-based classification of Alzheimer's disease (AD) patients and controls (CN) to an external data set and to the task of prediction of conversion to AD in individuals with mild cognitive impairment (MCI). We used a conventional support vector machine (SVM) and a deep convolutional neural network (CNN) approach based on structural MRI scans that underwent either minimal pre-processing or more extensive pre-processing into modulated gray matter (GM) maps. Classifiers were optimized and evaluated using cross-validation in the ADNI (334 AD, 520 CN). Trained classifiers were subsequently applied to predict conversion to AD in ADNI MCI patients (231 converters, 628 non-converters) and in the independent Health-RI Parelsnoer data set. From this multi-center study representing a tertiary memory clinic population, we included 199 AD patients, 139 participants with subjective cognitive decline, 48 MCI patients converting to dementia, and 91 MCI patients who did not convert to dementia. AD-CN classification based on modulated GM maps resulted in a similar AUC for SVM (0.940) and CNN (0.933). Application to conversion prediction in MCI yielded significantly higher performance for SVM (0.756) than for CNN (0.742). In external validation, performance was slightly decreased. For AD-CN, it again gave similar AUCs for SVM (0.896) and CNN (0.876). For prediction in MCI, performances decreased for both SVM (0.665) and CNN (0.702). Both with SVM and CNN, classification based on modulated GM maps significantly outperformed classification based on minimally processed images. Deep and conventional classifiers performed equally well for AD classification and their performance decreased only slightly when applied to the external cohort. We expect that this work on external validation contributes towards translation of machine learning to clinical practice.
摘要：这项工作验证了基于MRI的阿尔茨海默氏病（AD）患者和对照（CN）分类的可推广性到外部数据集，以及对患有轻度认知障碍（MCI）的人转换为AD的预测任务。我们使用了传统的支持向量机（SVM）和基于结构MRI扫描的深度卷积神经网络（CNN）方法，对扫描进行最小限度的预处理或对调制的灰质（GM）图进行了更广泛的预处理。使用ADNI（334 AD，520 CN）中的交叉验证对分类器进行优化和评估。随后，使用经过训练的分类器来预测ADNI MCI患者（231位转化者，628位非转化者）和独立的Health-RI Parelsnoer数据集中的AD转化。通过这项代表三级记忆诊所人群的多中心研究，我们纳入了199名AD患者，139名主观认知能力下降的参与者，48名MCI患者转化为痴呆症和91名未转化为痴呆症的MCI患者。基于调制的GM映射的AD-CN分类为SVM（0.940）和CNN（0.933）产生了相似的AUC。应用到MCI中的转换预测中，SVM的性能（0.756）比CNN（0.742）高得多。在外部验证中，性能略有下降。对于AD-CN，它再次为SVM（0.896）和CNN（0.876）提供了类似的AUC。对于MCI的预测，SVM（0.665）和CNN（0.702）的性能均下降。无论是使用SVM还是CNN，基于调制GM映射的分类都明显优于基于最少处理图像的分类。深度分类器和常规分类器在AD分类中的效果相当好，当应用于外部同类群组时，它们的性能仅略有下降。我们希望这项关于外部验证的工作有助于将机器学习转化为临床实践。

57. Learning-Based Quality Assessment for Image Super-Resolution [PDF] 返回目录
Tiesong Zhao, Yuting Lin, Yiwen Xu, Weiling Chen, Zhou Wang
Abstract: Image Super-Resolution (SR) techniques improve visual quality by enhancing the spatial resolution of images. Quality evaluation metrics play a critical role in comparing and optimizing SR algorithms, but current metrics achieve only limited success, largely due to the lack of large-scale quality databases, which are essential for learning accurate and robust SR quality metrics. In this work, we first build a large-scale SR image database using a novel semi-automatic labeling approach, which allows us to label a large number of images with manageable human workload. The resulting SR Image quality database with Semi-Automatic Ratings (SISAR), so far the largest of SR-IQA database, contains 8,400 images of 100 natural scenes. We train an end-to-end Deep Image SR Quality (DISQ) model by employing two-stream Deep Neural Networks (DNNs) for feature extraction, followed by a feature fusion network for quality prediction. Experimental results demonstrate that the proposed method outperforms state-of-the-art metrics and achieves promising generalization performance in cross-database tests. The SISAR database and DISQ model will be made publicly available to facilitate reproducible research.
摘要：图像超分辨率（SR）技术通过增强图像的空间分辨率来提高视觉质量。质量评估指标在比较和优化SR算法中起着关键作用，但是当前的指标仅获得有限的成功，这在很大程度上是由于缺乏大规模的质量数据库，这对于学习准确而可靠的SR质量指标至关重要。在这项工作中，我们首先使用新颖的半自动标记方法构建了一个大型SR图像数据库，这使我们能够以可管理的人员工作量标记大量图像。生成的具有半自动评级（SISAR）的SR图像质量数据库（迄今为止最大的SR-IQA数据库）包含100个自然场景的8400张图像。我们通过使用两流深度神经网络（DNN）进行特征提取，再使用特征融合网络进行质量预测，来训练端到端的深度图像SR质量（DISQ）模型。实验结果表明，所提出的方法优于最新指标，并在跨数据库测试中实现了有希望的泛化性能。 SISAR数据库和DISQ模型将公开提供，以促进可重复的研究。

58. Responsible Disclosure of Generative Models Using Scalable Fingerprinting [PDF] 返回目录
Ning Yu, Vladislav Skripniuk, Dingfan Chen, Larry Davis, Mario Fritz
Abstract: Over the past five years, deep generative models have achieved a qualitative new level of performance. Generated data has become difficult, if not impossible, to be distinguished from real data. While there are plenty of use cases that benefit from this technology, there are also strong concerns on how this new technology can be misused to spoof sensors, generate deep fakes, and enable misinformation at scale. Unfortunately, current deep fake detection methods are not sustainable, as the gap between real and fake continues to close. In contrast, our work enables a responsible disclosure of such state-of-the-art generative models, that allows researchers and companies to fingerprint their models, so that the generated samples containing a fingerprint can be accurately detected and attributed to a source. Our technique achieves this by an efficient and scalable ad-hoc generation of a large population of models with distinct fingerprints. Our recommended operation point uses a 128-bit fingerprint which in principle results in more than $10^{36}$ identifiable models. Experimental results show that our method fulfills key properties of a fingerprinting mechanism and achieves effectiveness in deep fake detection and attribution.
摘要：在过去的五年中，深度生成模型已将性能定性提高到新的水平。生成的数据很难（即使不是不可能）与实际数据区分开。尽管有很多用例可从该技术中受益，但人们也强烈关注如何滥用这项新技术来欺骗传感器，产生深层的伪造品以及大规模地散布错误信息。不幸的是，当前的深度伪造检测方法是不可持续的，因为真实与伪造之间的差距仍在缩小。相反，我们的工作可以负责任地披露这种最新的生成模型，这使研究人员和公司可以对他们的模型进行指纹识别，从而可以准确地检测出生成的包含指纹的样本并将其归因于来源。我们的技术通过高效且可扩展的即席生成大量具有不同指纹的模型来实现这一目标。我们推荐的操作点使用128位指纹，原则上可得出超过$ 10 ^ {36} $可识别的模型。实验结果表明，该方法满足了指纹识别机制的关键特性，在深度伪造检测和归因中取得了效果。

59. Wasserstein Contrastive Representation Distillation [PDF] 返回目录
Liqun Chen, Zhe Gan, Dong Wang, Jingjing Liu, Ricardo Henao, Lawrence Carin
Abstract: The primary goal of knowledge distillation (KD) is to encapsulate the information of a model learned from a teacher network into a student network, with the latter being more compact than the former. Existing work, e.g., using Kullback-Leibler divergence for distillation, may fail to capture important structural knowledge in the teacher network and often lacks the ability for feature generalization, particularly in situations when teacher and student are built to address different classification tasks. We propose Wasserstein Contrastive Representation Distillation (WCoRD), which leverages both primal and dual forms of Wasserstein distance for KD. The dual form is used for global knowledge transfer, yielding a contrastive learning objective that maximizes the lower bound of mutual information between the teacher and the student networks. The primal form is used for local contrastive knowledge transfer within a mini-batch, effectively matching the distributions of features between the teacher and the student networks. Experiments demonstrate that the proposed WCoRD method outperforms state-of-the-art approaches on privileged information distillation, model compression and cross-modal transfer.
摘要：知识提炼（KD）的主要目标是将从教师网络中学到的模型信息封装到学生网络中，后者比前者更为紧凑。现有的工作，例如使用Kullback-Leibler散度进行蒸馏，可能无法捕获教师网络中的重要结构知识，并且通常缺乏特征概括的能力，尤其是在为师生构建不同的分类任务的情况下。我们提出了Wasserstein对比表示蒸馏法（WCoRD），该方法利用了Wasserstein距离的原始形式和对偶形式来表示KD。双重形式用于全球知识转移，产生了一种对比性的学习目标，该目标使教师与学生网络之间的相互信息的下限最大化。原始形式用于小批量生产中的局部对比知识传递，有效地匹配了教师和学生网络之间的特征分布。实验表明，提出的WCoRD方法在特权信息蒸馏，模型压缩和交叉模式转移方面优于最新方法。

60. Mitigating bias in calibration error estimation [PDF] 返回目录
Rebecca Roelofs, Nicholas Cain, Jonathon Shlens, Michael C. Mozer
Abstract: Building reliable machine learning systems requires that we correctly understand their level of confidence. Calibration focuses on measuring the degree of accuracy in a model's confidence and most research in calibration focuses on techniques to improve an empirical estimate of calibration error, ECE_bin. Using simulation, we show that ECE_bin can systematically underestimate or overestimate the true calibration error depending on the nature of model miscalibration, the size of the evaluation data set, and the number of bins. Critically, ECE_bin is more strongly biased for perfectly calibrated models. We propose a simple alternative calibration error metric, ECE_sweep, in which the number of bins is chosen to be as large as possible while preserving monotonicity in the calibration function. Evaluating our measure on distributions fit to neural network confidence scores on CIFAR-10, CIFAR-100, and ImageNet, we show that ECE_sweep produces a less biased estimator of calibration error and therefore should be used by any researcher wishing to evaluate the calibration of models trained on similar datasets.
摘要：构建可靠的机器学习系统要求我们正确地理解它们的置信度。校准着重于测量模型置信度中的准确度，并且大多数校准研究着重于改善对校准误差的经验估计ECE_bin的技术。通过仿真，我们表明ECE_bin可以系统地低估或高估真实的校准误差，具体取决于模型未校准的性质，评估数据集的大小以及容器的数量。至关重要的是，对于完美校准的模型，ECE_bin的偏向更大。我们提出了一种简单的替代性校准误差度量ECE_sweep，其中将仓数选择为尽可能大，同时在校准函数中保持单调性。评估我们对分布的度量以适应CIFAR-10，CIFAR-100和ImageNet上的神经网络置信度得分，我们表明ECE_sweep产生的校准误差估计量较小，因此，任何希望评估模型校准的研究人员都应使用ECE_sweep在类似的数据集上训练。

61. CUDA-Optimized real-time rendering of a Foveated Visual System [PDF] 返回目录
Elian Malkin, Arturo Deza, Tomaso Poggio
Abstract: The spatially-varying field of the human visual system has recently received a resurgence of interest with the development of virtual reality (VR) and neural networks. The computational demands of high resolution rendering desired for VR can be offset by savings in the periphery, while neural networks trained with foveated input have shown perceptual gains in i.i.d and o.o.d generalization. In this paper, we present a technique that exploits the CUDA GPU architecture to efficiently generate Gaussian-based foveated images at high definition (1920x1080 px) in real-time (165 Hz), with a larger number of pooling regions than previous Gaussian-based foveation algorithms by several orders of magnitude, producing a smoothly foveated image that requires no further blending or stitching, and that can be well fit for any contrast sensitivity function. The approach described can be adapted from Gaussian blurring to any eccentricity-dependent image processing and our algorithm can meet demand for experimentation to evaluate the role of spatially-varying processing across biological and artificial agents, so that foveation can be added easily on top of existing systems rather than forcing their redesign (emulated foveated renderer). Altogether, this paper demonstrates how a GPU, with a CUDA block-wise architecture, can be employed for radially-variant rendering, with opportunities for more complex post-processing to ensure a metameric foveation scheme. Code is provided.
摘要：随着虚拟现实（VR）和神经网络的发展，人类视觉系统的空间变化领域最近引起了人们的兴趣。 VR所需要的高分辨率渲染的计算需求可以通过节省外围设备来抵消，而使用偏心输入训练的神经网络在i.d.和o.d.泛化中已显示出感知增益。在本文中，我们提出一种技术，该技术利用CUDA GPU架构实时（165 Hz）高效地生成高清（1920x1080 px）的基于高斯的偏心图像，其合并区域的数量比以前的基于高斯的偏心算法的数量级提高了几个，从而生成了一个平滑的偏心图像，不需要进一步的融合或缝合，并且可以很好地适合任何对比度敏感功能。所描述的方法可以从高斯模糊适应于任何与偏心率相关的图像处理，并且我们的算法可以满足实验需求，以评估跨生物和人工因素的空间变化处理的作用，从而可以轻松地在现有基础上添加凹痕系统，而不是强制其重新设计（模拟中心渲染器）。总之，本文演示了如何将具有CUDA块式体系结构的GPU用于径向变化渲染，并提供进行更复杂的后处理的机会，以确保同分异构中心化方案。提供了代码。

62. An anatomically-informed 3D CNN for brain aneurysm classification with weak labels [PDF] 返回目录
Tommaso Di Noto, Guillaume Marie, Sébastien Tourbier, Yasser Alemán-Gómez, Guillaume Saliou, Meritxell Bach Cuadra, Patric Hagmann, Jonas Richiardi
Abstract: A commonly adopted approach to carry out detection tasks in medical imaging is to rely on an initial segmentation. However, this approach strongly depends on voxel-wise annotations which are repetitive and time-consuming to draw for medical experts. An interesting alternative to voxel-wise masks are so-called "weak" labels: these can either be coarse or oversized annotations that are less precise, but noticeably faster to create. In this work, we address the task of brain aneurysm detection as a patch-wise binary classification with weak labels, in contrast to related studies that rather use supervised segmentation methods and voxel-wise delineations. Our approach comes with the non-trivial challenge of the data set creation: as for most focal diseases, anomalous patches (with aneurysm) are outnumbered by those showing no anomaly, and the two classes usually have different spatial distributions. To tackle this frequent scenario of inherently imbalanced, spatially skewed data sets, we propose a novel, anatomically-driven approach by using a multi-scale and multi-input 3D Convolutional Neural Network (CNN). We apply our model to 214 subjects (83 patients, 131 controls) who underwent Time-Of-Flight Magnetic Resonance Angiography (TOF-MRA) and presented a total of 111 unruptured cerebral aneurysms. We compare two strategies for negative patch sampling that have an increasing level of difficulty for the network and we show how this choice can strongly affect the results. To assess whether the added spatial information helps improving performances, we compare our anatomically-informed CNN with a baseline, spatially-agnostic CNN. When considering the more realistic and challenging scenario including vessel-like negative patches, the former model attains the highest classification results (accuracy$\simeq$95\%, AUROC$\simeq$0.95, AUPR$\simeq$0.71), thus outperforming the baseline.
摘要：在医学成像中执行检测任务的常用方法是依靠初始分割。但是，此方法强烈依赖于体素级注释，这些注释对于医学专家而言是重复且耗时的。体素方向蒙版的一种有趣替代方法是所谓的“弱”标签：这些标签可以是较粗略或超大注释，它们的精度较低，但创建速度明显更快。在这项工作中，我们将脑动脉瘤检测的任务作为带有弱标记的逐片式二进制分类来解决，这与相关研究相反，后者使用监督性分割方法和按体素进行描述。我们的方法伴随着数据集创建的一个非同寻常的挑战：对于大多数局灶性疾病，异常斑块（有动脉瘤）的数量要多于那些没有异常现象的斑块，并且这两个类别通常具有不同的空间分布。为了解决这种固有的不平衡，空间偏斜的数据集的常见情况，我们提出了一种新颖的，解剖学驱动的方法，即使用多尺度和多输入3D卷积神经网络（CNN）。我们将我们的模型应用于经历飞行时间磁共振血管造影（TOF-MRA）的214名受试者（83例患者，131例对照），共呈现111例未破裂的脑动脉瘤。我们比较了网络上难度越来越高的两种负面补丁采样策略，并且我们展示了这种选择如何严重影响结果。为了评估添加的空间信息是否有助于改善性能，我们将解剖学知悉的CNN与基线，与空间无关的CNN进行了比较。在考虑更现实和更具挑战性的场景（包括类似血管的负片）时，前一种模型获得了最高的分类结果（准确性$ \ simeq $ 95 \％，AUROC $ \ simeq $ 0.95，AUPR $ \ simeq $ 0.71），因此优于基准。

63. Jet tagging in the Lund plane with graph networks [PDF] 返回目录
Frédéric A. Dreyer, Huilin Qu
Abstract: The identification of boosted heavy particles such as top quarks or vector bosons is one of the key problems arising in experimental studies at the Large Hadron Collider. In this article, we introduce LundNet, a novel jet tagging method which relies on graph neural networks and an efficient description of the radiation patterns within a jet to optimally disentangle signatures of boosted objects from background events. We apply this framework to a number of different benchmarks, showing significantly improved performance for top tagging compared to existing state-of-the-art algorithms. We study the robustness of the LundNet taggers to non-perturbative and detector effects, and show how kinematic cuts in the Lund plane can mitigate overfitting of the neural network to model-dependent contributions. Finally, we consider the computational complexity of this method and its scaling as a function of kinematic Lund plane cuts, showing an order of magnitude improvement in speed over previous graph-based taggers.
摘要要】大型夸克对撞机实验研究中出现的关键问题之一就是对增强的重粒子如顶夸克或矢量玻色子的识别。在本文中，我们介绍LundNet，这是一种新颖的喷射标记方法，它依赖于图神经网络和对喷射中辐射模式的有效描述，以最佳地使增强对象的特征与背景事件分离。我们将此框架应用于许多不同的基准，与现有的最新算法相比，显示了顶部标记的性能大大提高。我们研究了LundNet标记器对非扰动和检测器影响的鲁棒性，并展示了Lund平面中的运动学切口如何减轻神经网络对模型依赖贡献的过度拟合。最后，我们将这种方法的计算复杂性及其缩放比例作为运动Lund平面切角的函数进行考虑，显示出速度比以前的基于图的标记器提高了一个数量级。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-12-17

目录

摘要