摘要

1. Entropy Maximization and Meta Classification for Out-Of-Distribution Detection in Semantic Segmentation [PDF] 返回目录
Robin Chan, Matthias Rottmann, Hanno Gottschalk
Abstract: Deep neural networks (DNNs) for the semantic segmentation of images are usually trained to operate on a predefined closed set of object classes. This is in contrast to the "open world" setting where DNNs are envisioned to be deployed to. From a functional safety point of view, the ability to detect so-called "out-of-distribution" (OoD) samples, i.e., objects outside of a DNN's semantic space, is crucial for many applications such as automated driving. A natural baseline approach to OoD detection is to threshold on the pixel-wise softmax entropy. We present a two-step procedure that significantly improves that approach. Firstly, we utilize samples from the COCO dataset as OoD proxy and introduce a second training objective to maximize the softmax entropy on these samples. Starting from pretrained semantic segmentation networks we re-train a number of DNNs on different in-distribution datasets and consistently observe improved OoD detection performance when evaluating on completely disjoint OoD datasets. Secondly, we perform a transparent post-processing step to discard false positive OoD samples by so-called "meta classification". To this end, we apply linear models to a set of hand-crafted metrics derived from the DNN's softmax probabilities. In our experiments we consistently observe a clear additional gain in OoD detection performance, cutting down the number of detection errors by up to 52% when comparing the best baseline with our results. We achieve this improvement sacrificing only marginally in original segmentation performance. Therefore, our method contributes to safer DNNs with more reliable overall system performance.
摘要：通常会训练用于图像语义分割的深度神经网络（DNN），以对预定义的封闭对象类集进行操作。这与设想将DNN部署到的“开放世界”设置相反。从功能安全的角度来看，检测所谓的“分布外”（OoD）样本（即DNN语义空间之外的对象）的能力对于许多应用（例如自动驾驶）至关重要。 OoD检测的自然基准方法是对像素级softmax熵设定阈值。我们提出了一个两步过程，可以大大改进该方法。首先，我们利用来自COCO数据集的样本作为OoD代理，并引入了第二个训练目标，以最大化这些样本的softmax熵。从预训练的语义分割网络开始，我们在不同的分布数据集上重新训练许多DNN，并在对完全不相交的OoD数据集进行评估时始终观察到改进的OoD检测性能。其次，我们执行透明的后处理步骤，以通过所谓的“元分类”丢弃假阳性的OoD样本。为此，我们将线性模型应用于从DNN的softmax概率得出的一组手工度量中。在我们的实验中，我们始终观察到OoD检测性能的明显提高，当将最佳基准与我们的结果进行比较时，检测错误的数量最多减少了52％。我们实现了这一改进，只牺牲了少量的原始细分效果。因此，我们的方法有助于更安全的DNN，并具有更可靠的整体系统性能。

2. A Comprehensive Study of Deep Video Action Recognition [PDF] 返回目录
Yi Zhu, Xinyu Li, Chunhui Liu, Mohammadreza Zolfaghari, Yuanjun Xiong, Chongruo Wu, Zhi Zhang, Joseph Tighe, R. Manmatha, Mu Li
Abstract: Video action recognition is one of the representative tasks for video understanding. Over the last decade, we have witnessed great advancements in video action recognition thanks to the emergence of deep learning. But we also encountered new challenges, including modeling long-range temporal information in videos, high computation costs, and incomparable results due to datasets and evaluation protocol variances. In this paper, we provide a comprehensive survey of over 200 existing papers on deep learning for video action recognition. We first introduce the 17 video action recognition datasets that influenced the design of models. Then we present video action recognition models in chronological order: starting with early attempts at adapting deep learning, then to the two-stream networks, followed by the adoption of 3D convolutional kernels, and finally to the recent compute-efficient models. In addition, we benchmark popular methods on several representative datasets and release code for reproducibility. In the end, we discuss open problems and shed light on opportunities for video action recognition to facilitate new research ideas.
摘要：视频动作识别是视频理解的代表任务之一。在过去的十年中，由于深度学习的出现，我们见证了视频动作识别的巨大进步。但是我们也遇到了新的挑战，包括对视频中的远程时间信息进行建模，高昂的计算成本以及由于数据集和评估协议差异而产生的无与伦比的结果。在本文中，我们对200多篇有关深度学习的视频动作识别现有论文进行了全面调查。我们首先介绍影响模型设计的17个视频动作识别数据集。然后，我们按时间顺序介绍了视频动作识别模型：从适应深度学习的早期尝试开始，然后到两流网络，接着是3D卷积内核的采用，最后是最近的计算效率高的模型。此外，我们在几种代表性数据集上对流行方法进行了基准测试，并发布了可重复性的代码。最后，我们讨论了未解决的问题，并阐明了视频动作识别的机会，以促进新的研究思路。

3. Exploring Facial Expressions and Affective Domains for Parkinson Detection [PDF] 返回目录
Luis Felipe Gomez-Gomez, Aythami Morales, Julian Fierrez, Juan Rafael Orozco-Arroyave
Abstract: Parkinson's Disease (PD) is a neurological disorder that affects facial movements and non-verbal communication. Patients with PD present a reduction in facial movements called hypomimia which is evaluated in item 3.2 of the MDS-UPDRS-III scale. In this work, we propose to use facial expression analysis from face images based on affective domains to improve PD detection. We propose different domain adaptation techniques to exploit the latest advances in face recognition and Face Action Unit (FAU) detection. The principal contributions of this work are: (1) a novel framework to exploit deep face architectures to model hypomimia in PD patients; (2) we experimentally compare PD detection based on single images vs. image sequences while the patients are evoked various face expressions; (3) we explore different domain adaptation techniques to exploit existing models initially trained either for Face Recognition or to detect FAUs for the automatic discrimination between PD patients and healthy subjects; and (4) a new approach to use triplet-loss learning to improve hypomimia modeling and PD detection. The results on real face images from PD patients show that we are able to properly model evoked emotions using image sequences (neutral, onset-transition, apex, offset-transition, and neutral) with accuracy improvements up to 5.5% (from 72.9% to 78.4%) with respect to single-image PD detection. We also show that our proposed affective-domain adaptation provides improvements in PD detection up to 8.9% (from 78.4% to 87.3% detection accuracy).
摘要：帕金森氏病（PD）是一种影响面部运动和非语言交流的神经系统疾病。 PD患者的面部运动减少，称为低氧血症，这在MDS-UPDRS-III量表的3.2项中进行了评估。在这项工作中，我们建议基于情感域使用来自面部图像的面部表情分析来改善PD检测。我们提出了不同的领域适应技术，以利用面部识别和面部动作单元（FAU）检测方面的最新进展。这项工作的主要贡献是：（1）利用深层脸部结构为PD患者的低血容量建模的新颖框架；（2）我们在实验中比较了基于单个图像和图像序列的PD检测，而患者被唤起了各种面部表情；（3）我们探索了不同的领域适应技术，以利用最初经过培训的现有模型来进行人脸识别或检测FAU，以自动区分PD患者和健康受试者；（4）一种使用三重损失学习来改善低氧血症建模和PD检测的新方法。来自PD患者的真实面部图像的结果表明，我们能够使用图像序列（中性，发作转变，顶点，胶印转变和中性）正确建模诱发的情绪，其准确性提高了5.5％（从72.9％提高到单张图像PD检测的占78.4％）。我们还表明，我们提出的情感域自适应技术可将PD检测提高到8.9％（检测准确度从78.4％提高到87.3％）。

4. LayoutGMN: Neural Graph Matching for Structural Layout Similarity [PDF] 返回目录
Akshay Gadi Patil, Manyi Li, Matthew Fisher, Manolis Savva, Hao Zhang
Abstract: We present a deep neural network to predict structural similarity between 2D layouts by leveraging Graph Matching Networks (GMN). Our network, coined LayoutGMN, learns the layout metric via neural graph matching, using an attention-based GMN designed under a triplet network setting. To train our network, we utilize weak labels obtained by pixel-wise Intersection-over-Union (IoUs) to define the triplet loss. Importantly, LayoutGMN is built with a structural bias which can effectively compensate for the lack of structure awareness in IoUs. We demonstrate this on two prominent forms of layouts, viz., floorplans and UI designs, via retrieval experiments on large-scale datasets. In particular, retrieval results by our network better match human judgement of structural layout similarity compared to both IoUs and other baselines including a state-of-the-art method based on graph neural networks and image convolution. In addition, LayoutGMN is the first deep model to offer both metric learning of structural layout similarity and structural matching between layout elements.
摘要：我们提出了一个深度神经网络，以利用图匹配网络（GMN）来预测2D布局之间的结构相似性。我们的网络（称为LayoutGMN）使用在三元组网络设置下设计的基于注意力的GMN，通过神经图匹配来学习布局指标。为了训练我们的网络，我们利用通过像素单位联合交叉（IoU）获得的弱标签来定义三元组损失。重要的是，LayoutGMN具有结构偏差，可以有效地弥补IoU中缺乏结构意识的情况。我们通过对大型数据集进行检索实验，以两种突出的布局形式（即布局图和UI设计）对此进行了演示。尤其是，与IoU和其他基准相比，我们的网络通过网络检索的结果与人类对结构布局相似性的判断更好地匹配，包括基于图神经网络和图像卷积的最新方法。此外，LayoutGMN是第一个同时提供结构布局相似性度量和布局元素之间结构匹配的深度模型。

5. Dependency Decomposition and a Reject Option for Explainable Models [PDF] 返回目录
Jan Kronenberger, Anselm Haselhoff
Abstract: Deploying machine learning models in safety-related do-mains (e.g. autonomous driving, medical diagnosis) demands for approaches that are explainable, robust against adversarial attacks and aware of the model uncertainty. Recent deep learning models perform extremely well in various inference tasks, but the black-box nature of these approaches leads to a weakness regarding the three requirements mentioned above. Recent advances offer methods to visualize features, describe attribution of the input (e.g.heatmaps), provide textual explanations or reduce dimensionality. However,are explanations for classification tasks dependent or are they independent of each other? For in-stance, is the shape of an object dependent on the color? What is the effect of using the predicted class for generating explanations and vice versa? In the context of explainable deep learning models, we present the first analysis of dependencies regarding the probability distribution over the desired image classification outputs and the explaining variables (e.g. attributes, texts, heatmaps). Therefore, we perform an Explanation Dependency Decomposition (EDD). We analyze the implications of the different dependencies and propose two ways of generating the explanation. Finally, we use the explanation to verify (accept or reject) the prediction
摘要：在与安全相关的领域（例如自动驾驶，医疗诊断）中部署机器学习模型的需求在于，这种方法应具有可解释性，对付对抗攻击的能力并知道模型的不确定性。最近的深度学习模型在各种推理任务中表现非常出色，但是这些方法的黑盒性质导致了上述三个要求的弱点。最新的进展提供了可视化特征，描述输入属性（例如热图），提供文字说明或降低尺寸的方法。但是，分类任务的说明是相互依赖的还是彼此独立的？例如，物体的形状是否取决于颜色？使用预测类生成解释的效果如何，反之亦然？在可解释的深度学习模型的背景下，我们将对所需图像分类输出上的概率分布以及解释变量（例如属性，文本，热图）的依赖关系进行首次分析。因此，我们执行解释依赖分解（EDD）。我们分析了不同依赖关系的含义，并提出了两种产生解释的方法。最后，我们使用解释来验证（接受或拒绝）预测

6. Detection of Binary Square Fiducial Markers Using an Event Camera [PDF] 返回目录
Hamid Sarmadi, Rafael Muñoz-Salinas, Miguel A. Olivares-Mendez, Rafael Medina-Carnicer
Abstract: Event cameras are a new type of image sensors that output changes in light intensity (events) instead of absolute intensity values. They have a very high temporal resolution and a high dynamic range. In this paper, we propose a method to detect and decode binary square markers using an event camera. We detect the edges of the markers by detecting line segments in an image created from events in the current packet. The line segments are combined to form marker candidates. The bit value of marker cells is decoded using the events on their borders. To the best of our knowledge, no other approach exists for detecting square binary markers directly from an event camera. Experimental results show that the performance of our proposal is much superior to the one from the RGB ArUco marker detector. Additionally, the proposed method can run on a single CPU core in real-time.
摘要：事件相机是一种新型的图像传感器，可输出光强度（事件）的变化而不是绝对强度值。它们具有很高的时间分辨率和很高的动态范围。在本文中，我们提出了一种使用事件摄像机检测和解码二进制正方形标记的方法。我们通过检测从当前数据包中的事件创建的图像中的线段来检测标记的边缘。线段被组合以形成候选标记。标记单元的位值使用其边界上的事件进行解码。据我们所知，没有其他方法可以直接从事件摄像机检测方形二进制标记。实验结果表明，我们的建议的性能远远优于RGB ArUco标记检测器的建议。此外，所提出的方法可以实时在单个CPU内核上运行。

7. Objectness-Guided Open Set Visual Search and Closed Set Detection [PDF] 返回目录
Nathan Drenkow, Philippe Burlina, Neil Fendley, Kachi Odoemene, Jared Markowitz
Abstract: Searching for small objects in large images is currently challenging for deep learning systems, but is a task with numerous applications including remote sensing and medical imaging. Thorough scanning of very large images is computationally expensive, particularly at resolutions sufficient to capture small objects. The smaller an object of interest, the more likely it is to be obscured by clutter or otherwise deemed insignificant. We examine these issues in the context of two complementary problems: closed-set object detection and open-set target search. First, we present a method for predicting pixel-level objectness from a low resolution gist image, which we then use to select regions for subsequent evaluation at high resolution. This approach has the benefit of not being fixed to a predetermined grid, allowing fewer costly high-resolution glimpses than existing methods. Second, we propose a novel strategy for open-set visual search that seeks to find all objects in an image of the same class as a given target reference image. We interpret both detection problems through a probabilistic, Bayesian lens, whereby the objectness maps produced by our method serve as priors in a maximum-a-posteriori approach to the detection step. We evaluate the end-to-end performance of both the combination of our patch selection strategy with this target search approach and the combination of our patch selection strategy with standard object detection methods. Both our patch selection and target search approaches are seen to significantly outperform baseline strategies.
摘要：目前，对于深度学习系统而言，在大图像中搜索小物体是一项挑战，但这是一项涉及众多应用的任务，其中包括遥感和医学成像。彻底扫描非常大的图像在计算上非常昂贵，特别是在足以捕获小物体的分辨率下。感兴趣的物体越小，就越有可能被混乱或其他无关紧要的物体遮盖。我们在两个互补问题的背景下研究这些问题：封闭集对象检测和开放集目标搜索。首先，我们提出了一种从低分辨率主图像中预测像素级客观性的方法，然后将其用于选择区域，以便在高分辨率下进行后续评估。该方法的优点是不固定在预定的网格上，与现有方法相比，可以减少昂贵的高分辨率瞥视。其次，我们提出了一种开放式视觉搜索的新颖策略，该策略试图在与给定目标参考图像相同类别的图像中找到所有对象。我们通过概率贝叶斯透镜解释这两个检测问题，从而通过我们的方法产生的客观图在检测步骤的最大后验方法中作为先验。我们评估了我们的补丁选择策略与这种目标搜索方法的结合以及我们的补丁选择策略与标准对象检测方法的结合的端到端性能。我们的补丁选择和目标搜索方法均明显优于基准策略。

8. Confidence Estimation via Auxiliary Models [PDF] 返回目录
Charles Corbière, Nicolas Thome, Antoine Saporta, Tuan-Hung Vu, Matthieu Cord, Patrick Pérez
Abstract: Reliably quantifying the confidence of deep neural classifiers is a challenging yet fundamental requirement for deploying such models in safety-critical applications. In this paper, we introduce a novel target criterion for model confidence, namely the true class probability (TCP). We show that TCP offers better properties for confidence estimation than standard maximum class probability (MCP). Since the true class is by essence unknown at test time, we propose to learn TCP criterion from data with an auxiliary model, introducing a specific learning scheme adapted to this context. We evaluate our approach on the task of failure prediction and of self-training with pseudo-labels for domain adaptation, which both necessitate effective confidence estimates. Extensive experiments are conducted for validating the relevance of the proposed approach in each task. We study various network architectures and experiment with small and large datasets for image classification and semantic segmentation. In every tested benchmark, our approach outperforms strong baselines.
摘要：可靠地量化深度神经分类器的置信度是在安全关键型应用程序中部署此类模型的一项具有挑战性但基本的要求。在本文中，我们介绍了模型置信度的新目标标准，即真实分类概率（TCP）。我们证明，与标准最大类别概率（MCP）相比，TCP为置信度估计提供了更好的属性。由于真正的类在测试时实质上是未知的，因此我们建议使用辅助模型从数据中学习TCP准则，并引入一种适合此上下文的特定学习方案。我们评估在故障预测和使用伪标签进行域自适应的自我训练任务上的方法，这两者都需要有效的置信度估计。为了验证所提出的方法在每个任务中的相关性，进行了广泛的实验。我们研究了各种网络体系结构，并针对图像分类和语义分割对大型和小型数据集进行了实验。在每个经过测试的基准中，我们的方法都优于强大的基准。

9. DeepObjStyle: Deep Object-based Photo Style Transfer [PDF] 返回目录
Indra Deep Mastan, Shanmuganathan Raman
Abstract: One of the major challenges of style transfer is the appropriate image features supervision between the output image and the input (style and content) images. An efficient strategy would be to define an object map between the objects of the style and the content images. However, such a mapping is not well established when there are semantic objects of different types and numbers in the style and the content images. It also leads to content mismatch in the style transfer output, which could reduce the visual quality of the results. We propose an object-based style transfer approach, called DeepObjStyle, for the style supervision in the training data-independent framework. DeepObjStyle preserves the semantics of the objects and achieves better style transfer in the challenging scenario when the style and the content images have a mismatch of image features. We also perform style transfer of images containing a word cloud to demonstrate that DeepObjStyle enables an appropriate image features supervision. We validate the results using quantitative comparisons and user studies.
摘要：样式转换的主要挑战之一是在输出图像和输入（样式和内容）图像之间进行适当的图像特征监督。一种有效的策略是在样式的对象和内容图像之间定义对象图。但是，当样式和内容图像中存在不同类型和数字的语义对象时，这种映射无法很好地建立。它还会导致样式转换输出中的内容不匹配，这可能会降低结果的视觉质量。我们提出了一种基于对象的样式传递方法，称为DeepObjStyle，用于与训练数据无关的框架中的样式监视。当样式和内容图像与图像特征不匹配时，DeepObjStyle保留对象的语义，并在具有挑战性的场景中实现更好的样式转换。我们还对包含词云的图像进行样式转换，以证明DeepObjStyle支持适当的图像特征监督。我们使用定量比较和用户研究来验证结果。

10. Unsupervised deep learning for individualized brain functional network identification [PDF] 返回目录
Hongming Li, Yong Fan
Abstract: A novel unsupervised deep learning method is developed to identify individual-specific large scale brain functional networks (FNs) from resting-state fMRI (rsfMRI) in an end-to-end learning fashion. Our method leverages deep Encoder-Decoder networks and conventional brain decomposition models to identify individual-specific FNs in an unsupervised learning framework and facilitate fast inference for new individuals with one forward pass of the deep network. Particularly, convolutional neural networks (CNNs) with an Encoder-Decoder architecture are adopted to identify individual-specific FNs from rsfMRI data by optimizing their data fitting and sparsity regularization terms that are commonly used in brain decomposition models. Moreover, a time-invariant representation learning module is designed to learn features invariant to temporal orders of time points of rsfMRI data. The proposed method has been validated based on a large rsfMRI dataset and experimental results have demonstrated that our method could obtain individual-specific FNs which are consistent with well-established FNs and are informative for predicting brain age, indicating that the individual-specific FNs identified truly captured the underlying variability of individualized functional neuroanatomy.
摘要：开发了一种新颖的无监督深度学习方法，以端到端学习方式从静止状态功能磁共振成像（rsfMRI）识别个体特定的大规模脑功能网络（FNs）。我们的方法利用深层的Encoder-Decoder网络和常规的大脑分解模型在无监督的学习框架中识别特定于个体的FN，并通过深层网络的前向传递促进新个体的快速推断。特别是，采用具有编码器-解码器体系结构的卷积神经网络（CNN）通过优化rsfMRI数据的数据拟合和稀疏性正则化项来识别rsfMRI数据中的个体特定FN，这是大脑分解模型中常用的。此外，设计了时不变表示学习模块，以学习与rsfMRI数据的时间点的时间顺序不变的特征。基于大型rsfMRI数据集对提出的方法进行了验证，实验结果表明，我们的方法可以获得与特定FN一致的个体特异性FN，并且对于预测大脑年龄具有参考价值，表明已识别出个体特异性FN真正捕捉到了个性化功能神经解剖学的潜在变异性。

11. EventHands: Real-Time Neural 3D Hand Reconstruction from an Event Stream [PDF] 返回目录
Viktor Rudnev, Vladislav Golyanik, Jiayi Wang, Hans-Peter Seidel, Franziska Mueller, Mohamed Elgharib, Christian Theobalt
Abstract: 3D hand pose estimation from monocular videos is a long-standing and challenging problem, which is now seeing a strong upturn. In this work, we address it for the first time using a single event camera, i.e., an asynchronous vision sensor reacting on brightness changes. Our EventHands approach has characteristics previously not demonstrated with a single RGB or depth camera such as high temporal resolution at low data throughputs and real-time performance at 1000 Hz. Due to the different data modality of event cameras compared to classical cameras, existing methods cannot be directly applied to and re-trained for event streams. We thus design a new neural approach which accepts a new event stream representation suitable for learning, which is trained on newly-generated synthetic event streams and can generalise to real data. Experiments show that EventHands outperforms recent monocular methods using a colour (or depth) camera in terms of accuracy and its ability to capture hand motions of unprecedented speed. Our method, the event stream simulator and the dataset will be made publicly available.
摘要：从单眼视频中估算3D手势姿势是一个长期存在的挑战性问题，目前正呈强劲增长态势。在这项工作中，我们首次使用单事件摄影机来解决它，即异步视觉传感器对亮度变化做出反应。我们的EventHands方法具有以前没有使用单个RGB或深度相机展示的特性，例如低数据吞吐量下的高时间分辨率和1000 Hz下的实时性能。由于事件摄像机的数据模态与经典摄像机相比不同，因此现有方法无法直接应用于事件流并对其进行重新训练。因此，我们设计了一种新的神经方法，该方法接受适合于学习的新事件流表示形式，该方法在新生成的合成事件流上进行训练，并可以推广到真实数据。实验表明，EventHands在使用彩色（或深度）相机时，其准确性和捕获空前速度的手部动作的能力要优于最近的单眼方法。我们的方法，事件流模拟器和数据集将公开提供。

12. DILIE: Deep Internal Learning for Image Enhancement [PDF] 返回目录
Indra Deep Mastan, Shanmuganathan Raman
Abstract: We consider the generic deep image enhancement problem where an input image is transformed into a perceptually better-looking image. Recent methods for image enhancement consider the problem by performing style transfer and image restoration. The methods mostly fall into two categories: training data-based and training data-independent (deep internal learning methods). We perform image enhancement in the deep internal learning framework. Our Deep Internal Learning for Image Enhancement framework enhances content features and style features and uses contextual content loss for preserving image context in the enhanced image. We show results on both hazy and noisy image enhancement. To validate the results, we use structure similarity and perceptual error, which is efficient in measuring the unrealistic deformation present in the images. We show that the proposed framework outperforms the relevant state-of-the-art works for image enhancement.
摘要：我们考虑了一般的深度图像增强问题，其中输入图像被转换为感觉上更好看的图像。用于图像增强的最新方法通过执行样式转换和图像恢复来考虑该问题。这些方法主要分为两类：基于训练数据的训练和与训练数据无关的训练（深度内部学习方法）。我们在深度内部学习框架中执行图像增强。我们的用于图像增强的深度内部学习框架增强了内容功能和样式功能，并使用上下文内容丢失来在增强图像中保留图像上下文。我们显示了朦胧和嘈杂的图像增强效果。为了验证结果，我们使用结构相似性和感知误差，可以有效地测量图像中存在的不真实变形。我们表明，所提出的框架优于相关的最新图像增强工作。

13. Cyclic orthogonal convolutions for long-range integration of features [PDF] 返回目录
Federica Freddi, Jezabel R Garcia, Michael Bromberg, Sepehr Jalali, Da-Shan Shiu, Alvin Chua, Alberto Bernacchia
Abstract: In Convolutional Neural Networks (CNNs) information flows across a small neighbourhood of each pixel of an image, preventing long-range integration of features before reaching deep layers in the network. We propose a novel architecture that allows flexible information flow between features $z$ and locations $(x,y)$ across the entire image with a small number of layers. This architecture uses a cycle of three orthogonal convolutions, not only in $(x,y)$ coordinates, but also in $(x,z)$ and $(y,z)$ coordinates. We stack a sequence of such cycles to obtain our deep network, named CycleNet. As this only requires a permutation of the axes of a standard convolution, its performance can be directly compared to a CNN. Our model obtains competitive results at image classification on CIFAR-10 and ImageNet datasets, when compared to CNNs of similar size. We hypothesise that long-range integration favours recognition of objects by shape rather than texture, and we show that CycleNet transfers better than CNNs to stylised images. On the Pathfinder challenge, where integration of distant features is crucial, CycleNet outperforms CNNs by a large margin. We also show that even when employing a small convolutional kernel, the size of receptive fields of CycleNet reaches its maximum after one cycle, while conventional CNNs require a large number of layers.
摘要：在卷积神经网络（CNN）中，信息流过图像的每个像素的小邻域，从而在到达网络的深层之前阻止了特征的长期集成。我们提出了一种新颖的架构，该架构允许在具有少量层的整个图像中，在特征$ z $和位置$（x，y）$之间进行灵活的信息流。该体系结构使用三个正交卷积的循环，不仅在$（x，y）$坐标中，而且还在$（x，z）$和$（y，z）$坐标中。我们堆叠一系列这样的循环，以获得名为CycleNet的深度网络。由于这仅需要对标准卷积的轴进行排列，因此可以将其性能直接与CNN进行比较。与相似大小的CNN相比，我们的模型在CIFAR-10和ImageNet数据集的图像分类中获得了竞争性结果。我们假设远距离集成有利于通过形状而不是纹理来识别对象，并且我们证明CycleNet比CNN更好地传输到风格化图像。在“探路者”挑战中，远处功能的集成至关重要，“ CycleNet”在很大程度上要优于CNN。我们还显示，即使使用小的卷积核，CycleNet的接受域的大小在一个周期后也会达到最大值，而常规的CNN则需要大量的层。

14. Relighting Images in the Wild with a Self-Supervised Siamese Auto-Encoder [PDF] 返回目录
Yang Liu, Alexandros Neophytou, Sunando Sengupta, Eric Sommerlade
Abstract: We propose a self-supervised method for image relighting of single view images in the wild. The method is based on an auto-encoder which deconstructs an image into two separate encodings, relating to the scene illumination and content, respectively. In order to disentangle this embedding information without supervision, we exploit the assumption that some augmentation operations do not affect the image content and only affect the direction of the light. A novel loss function, called spherical harmonic loss, is introduced that forces the illumination embedding to convert to a spherical harmonic vector. We train our model on large-scale datasets such as Youtube 8M and CelebA. Our experiments show that our method can correctly estimate scene illumination and realistically re-light input images, without any supervision or a prior shape model. Compared to supervised methods, our approach has similar performance and avoids common lighting artifacts.
摘要：我们提出了一种自我监督的方法，用于野外单视图图像的图像重新照明。该方法基于自动编码器，该自动编码器将图像解构为分别与场景照明和内容有关的两个单独的编码。为了在无需监督的情况下解开该嵌入信息，我们采用了以下假设：某些增强操作不会影响图像内容，而只会影响光的方向。引入了一种新的损耗函数，称为球形谐波损耗，该函数强制照明嵌入转换为球形谐波矢量。我们在大型数据集（例如Youtube 8M和CelebA）上训练模型。我们的实验表明，我们的方法可以正确估计场景照明并真实地重新照明输入图像，而无需任何监督或预先的形状模型。与监督方法相比，我们的方法具有类似的性能，并且避免了常见的照明伪影。

15. D2-Net: Weakly-Supervised Action Localization via Discriminative Embeddings and Denoised Activations [PDF] 返回目录
Sanath Narayan, Hisham Cholakkal, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, Ling Shao
Abstract: This work proposes a weakly-supervised temporal action localization framework, called D2-Net, which strives to temporally localize actions using video-level supervision. Our main contribution is the introduction of a novel loss formulation, which jointly enhances the discriminability of latent embeddings and robustness of the output temporal class activations with respect to foreground-background noise caused by weak supervision. The proposed formulation comprises a discriminative and a denoising loss term for enhancing temporal action localization. The discriminative term incorporates a classification loss and utilizes a top-down attention mechanism to enhance the separability of latent foreground-background embeddings. The denoising loss term explicitly addresses the foreground-background noise in class activations by simultaneously maximizing intra-video and inter-video mutual information using a bottom-up attention mechanism. As a result, activations in the foreground regions are emphasized whereas those in the background regions are suppressed, thereby leading to more robust predictions. Comprehensive experiments are performed on two benchmarks: THUMOS14 and ActivityNet1.2. Our D2-Net performs favorably in comparison to the existing methods on both datasets, achieving gains as high as 3.6% in terms of mean average precision on THUMOS14.
摘要：这项工作提出了一个称为D2-Net的弱监督时间动作本地化框架，该框架致力于使用视频级监督来对时间动作进行本地化。我们的主要贡献是引入了一种新颖的损耗公式，该公式共同增强了潜在嵌入的可分辨性，以及针对由弱监管引起的前景背景噪声，输出时空类激活的鲁棒性。所提出的公式包括用于增强时间动作定位的判别和去噪损失项。区分性术语包含分类损失，并利用自上而下的注意力机制来增强潜在前景-背景嵌入的可分离性。去噪损失项通过使用自下而上的关注机制同时最大化视频内和视频间的互信息，显式解决了类别激活中的前景背景噪声。结果，强调了前景区域中的激活，而抑制了背景区域中的激活，从而导致更可靠的预测。在两个基准上进行了全面的实验：THUMOS14和ActivityNet1.2。与两个数据集上的现有方法相比，我们的D2-Net均具有出色的性能，在THUMOS14上的平均平均精度方面可实现高达3.6％的增益。

16. Iso-Points: Optimizing Neural Implicit Surfaces with Hybrid Representations [PDF] 返回目录
Wang Yifan, Shihao Wu, Cengiz Oztireli, Olga Sorkine-Hornung
Abstract: Neural implicit functions have emerged as a powerful representation for surfaces in 3D. Such a function can encode a high quality surface with intricate details into the parameters of a deep neural network. However, optimizing for the parameters for accurate and robust reconstructions remains a challenge, especially when the input data is noisy or incomplete. In this work, we develop a hybrid neural surface representation that allows us to impose geometry-aware sampling and regularization, which significantly improves the fidelity of reconstructions. We propose to use \emph{iso-points} as an explicit representation for a neural implicit function. These points are computed and updated on-the-fly during training to capture important geometric features and impose geometric constraints on the optimization. We demonstrate that our method can be adopted to improve state-of-the-art techniques for reconstructing neural implicit surfaces from multi-view images or point clouds. Quantitative and qualitative evaluations show that, compared with existing sampling and optimization methods, our approach allows faster convergence, better generalization, and accurate recovery of details and topology.
摘要：神经隐式函数已经成为3D曲面的强大表示形式。这样的功能可以将具有复杂细节的高质量表面编码到深度神经网络的参数中。但是，优化参数以进行准确而可靠的重建仍然是一个挑战，尤其是在输入数据有噪声或不完整的情况下。在这项工作中，我们开发了一种混合的神经表面表示法，使我们可以进行几何感知的采样和正则化，从而显着提高了重建的保真度。我们建议使用\ emph {iso-points}作为神经隐函数的显式表示。这些点是在训练过程中即时计算和更新的，以捕获重要的几何特征并将几何约束强加于优化中。我们证明了我们的方法可以用来改进从多视图图像或点云中重建神经隐式表面的最新技术。定量和定性评估表明，与现有的采样和优化方法相比，我们的方法可以更快地收敛，更好地泛化并准确地恢复细节和拓扑。

17. Imitation-Based Active Camera Control with Deep Convolutional Neural Network [PDF] 返回目录
Christos Kyrkou
Abstract: The increasing need for automated visual monitoring and control for applications such as smart camera surveillance, traffic monitoring, and intelligent environments, necessitates the improvement of methods for visual active monitoring. Traditionally, the active monitoring task has been handled through a pipeline of modules such as detection, filtering, and control. In this paper we frame active visual monitoring as an imitation learning problem to be solved in a supervised manner using deep learning, to go directly from visual information to camera movement in order to provide a satisfactory solution by combining computer vision and control. A deep convolutional neural network is trained end-to-end as the camera controller that learns the entire processing pipeline needed to control a camera to follow multiple targets and also estimate their density from a single image. Experimental results indicate that the proposed solution is robust to varying conditions and is able to achieve better monitoring performance both in terms of number of targets monitored as well as in monitoring time than traditional approaches, while reaching up to 25 FPS. Thus making it a practical and affordable solution for multi-target active monitoring in surveillance and smart-environment applications.
摘要：对于诸如智能相机监控，交通监控和智能环境之类的应用，对自动视觉监控的日益增长的需求，需要改进视觉主动监控的方法。传统上，主动监视任务是通过诸如检测，过滤和控制之类的模块管道来处理的。在本文中，我们将主动视觉监控作为模仿学习问题进行框架化，使用深度学习以监督的方式解决该问题，将其直接从视觉信息转移到摄像机运动，从而通过结合计算机视觉和控制来提供令人满意的解决方案。深度卷积神经网络作为摄像机控制器的端到端训练，可以学习控制摄像机跟随多个目标并从单个图像估计其密度所需的整个处理流水线。实验结果表明，与传统方法相比，所提出的解决方案在各种条件下均具有较强的鲁棒性，并且在监视目标数量和监视时间方面均能实现更好的监视性能，同时达到25 FPS。因此，它成为监视和智能环境应用中多目标主动监视的实用且价格合理的解决方案。

18. A Multi-task Joint Framework for Real-time Person Search [PDF] 返回目录
Ye Li, Kangning Yin, Jie Liang, Chunyu Wang, Guangqiang Yin
Abstract: Person search generally involves three important parts: person detection, feature extraction and identity comparison. However, person search integrating detection, extraction and comparison has the following drawbacks. Firstly, the accuracy of detection will affect the accuracy of comparison. Secondly, it is difficult to achieve real-time in real-world applications. To solve these problems, we propose a Multi-task Joint Framework for real-time person search (MJF), which optimizes the person detection, feature extraction and identity comparison respectively. For the person detection module, we proposed the YOLOv5-GS model, which is trained with person dataset. It combines the advantages of the Ghostnet and the Squeeze-and-Excitation (SE) block, and improves the speed and accuracy. For the feature extraction module, we design the Model Adaptation Architecture (MAA), which could select different network according to the number of people. It could balance the relationship between accuracy and speed. For identity comparison, we propose a Three Dimension (3D) Pooled Table and a matching strategy to improve identification accuracy. On the condition of 1920*1080 resolution video and 500 IDs table, the identification rate (IR) and frames per second (FPS) achieved by our method could reach 93.6% and 25.7,
摘要：人物搜索通常包括三个重要部分：人物检测，特征提取和身份比较。然而，结合了检测，提取和比较的人员搜索具有以下缺点。首先，检测的准确性会影响比较的准确性。其次，在实际应用中很难实现实时性。为了解决这些问题，我们提出了一种用于实时人员搜索（MJF）的多任务联合框架，该框架分别优化了人员检测，特征提取和身份比较。对于人员检测模块，我们提出了用人员数据集训练的YOLOv5-GS模型。它结合了Ghostnet和Squeeze-and-Excitation（SE）块的优点，并提高了速度和准确性。对于特征提取模块，我们设计了模型适应体系结构（MAA），可以根据人数选择不同的网络。它可以平衡精度和速度之间的关系。为了进行身份比较，我们提出了三维（3D）汇总表和匹配策略以提高识别准确性。在1920 * 1080分辨率视频和500 ID表的条件下，我们的方法实现的识别率（IR）和每秒帧数（FPS）可以分别达到93.6％和25.7，

19. A new automatic approach to seed image analysis: From acquisition to segmentation [PDF] 返回目录
A.M.P.G. Vale, M. Ucchesu, C. Di Ruberto, A. Loddo, J.M. Soares, G.Bacchetta
Abstract: Image Analysis offers a new tool for classifying vascular plant species based on the morphological and colorimetric features of the seeds, and has made significant contributions in systematic studies. However, in order to extract the morphological and colorimetric features, it is necessary to segment the image containing the samples to be analysed. This stage represents one of the most challenging steps in image processing, as it is difficult to separate uniform and homogeneous objects from the background. In this paper, we present a new, open source plugin for the automatic segmentation of an image of a seed sample. This plugin was written in Java to allow it to work with ImageJ open source software. The new plugin was tested on a total of 3,386 seed samples from 120 species belonging to the Fabaceae family. Digital images were acquired using a flatbed scanner. In order to test the efficacy of this approach in terms of identifying the edges of objects and separating them from the background, each sample was scanned using four different hues of blue for the background, and a total of 480 digital images were elaborated. The performance of the new plugin was compared with a method based on double image acquisition (with a black and white background) using the same seed samples, in which images were manually segmented using the Core ImageJ plugin. The results showed that the new plugin was able to segment all of the digital images without generating any object detection errors. In addition, the new plugin was able to segment images within an average of 0.02 s, while the average time for execution with the manual method was 63 s. This new open source plugin is proven to be able to work on a single image, and to be highly efficient in terms of time and segmentation when working with large numbers of images and a wide diversity of shapes.
摘要：图像分析为基于种子的形态和比色特征的维管植物物种分类提供了一种新工具，为系统研究做出了重要贡献。但是，为了提取形态和比色特征，有必要对包含要分析样品的图像进行分割。该阶段代表了图像处理中最具挑战性的步骤之一，因为很难从背景中分离出均匀和均匀的对象。在本文中，我们提出了一个新的开源插件，用于自动分割种子样本的图像。该插件是用Java编写的，以使其能够与ImageJ开源软件一起使用。这个新插件在来自Fabaceae家族的120个物种的3386个种子样本中进行了测试。使用平板扫描仪获取数字图像。为了测试该方法在识别对象边缘并将其与背景分离方面的功效，使用四种不同的蓝色作为背景对每个样本进行了扫描，总共制作了480张数字图像。将新插件的性能与使用相同种子样本的基于双图像获取（带有黑色和白色背景）的方法进行了比较，该方法使用Core ImageJ插件手动分割图像。结果表明，新插件能够分割所有数字图像，而不会产生任何对象检测错误。此外，新插件能够在平均0.02 s的时间内对图像进行分段，而使用手动方法执行的平均时间为63 s。事实证明，这个新的开源插件可以处理单个图像，并且在处理大量图像和各种形状时在时间和分割方面都非常高效。

20. Random Projections for Adversarial Attack Detection [PDF] 返回目录
Nathan Drenkow, Neil Fendley, Philippe Burlina
Abstract: Whilst adversarial attack detection has received considerable attention, it remains a fundamentally challenging problem from two perspectives. First, while threat models can be well-defined, attacker strategies may still vary widely within those constraints. Therefore, detection should be considered as an open-set problem, standing in contrast to most current detection strategies. These methods take a closed-set view and train binary detectors, thus biasing detection toward attacks seen during detector training. Second, information is limited at test time and confounded by nuisance factors including the label and underlying content of the image. Many of the current high-performing techniques use training sets for dealing with some of these issues, but can be limited by the overall size and diversity of those sets during the detection step. We address these challenges via a novel strategy based on random subspace analysis. We present a technique that makes use of special properties of random projections, whereby we can characterize the behavior of clean and adversarial examples across a diverse set of subspaces. We then leverage the self-consistency (or inconsistency) of model activations to discern clean from adversarial examples. Performance evaluation demonstrates that our technique outperforms ($>0.92$ AUC) competing state of the art (SOTA) attack strategies, while remaining truly agnostic to the attack method itself. It also requires significantly less training data, composed only of clean examples, when compared to competing SOTA methods, which achieve only chance performance, when evaluated in a more rigorous testing scenario.
摘要：尽管对抗攻击的检测已经引起了广泛的关注，但从两个角度来看，它仍然是一个具有挑战性的根本问题。首先，虽然可以很好地定义威胁模型，但攻击者的策略在这些约束条件下仍可能相差很大。因此，与大多数当前的检测策略相比，检测应被视为开放问题。这些方法采用封闭视图并训练二进制检测器，从而使检测偏向于在检测器训练期间看到的攻击。其次，信息在测试时受到限制，并受到包括图像标签和底层内容在内的令人讨厌的因素的困扰。许多当前的高性能技术都使用训练集来处理其中的一些问题，但是在检测步骤中可能会受到这些集的总体大小和多样性的限制。我们通过基于随机子空间分析的新颖策略来应对这些挑战。我们提出了一种利用随机投影的特殊属性的技术，通过该技术，我们可以表征各种子空间集合中干净和对抗性示例的行为。然后，我们利用模型激活的自洽性（或自相矛盾性）来从对抗性示例中辨别干净。性能评估表明，我们的技术优于（$> 0.92 $ AUC）竞争的最新技术（SOTA）攻击策略，同时仍然对攻击方法本身完全不可知。与竞争性SOTA方法相比，它需要的培训数据也要少得多，仅由干净的示例组成，而竞争性SOTA方法在更严格的测试场景中进行评估时，只能获得偶然的性能。

21. Spatial Temporal Transformer Network for Skeleton-based Action Recognition [PDF] 返回目录
Chiara Plizzari, Marco Cannici, Matteo Matteucci
Abstract: Skeleton-based human action recognition has achieved a great interest in recent years, as skeleton data has been demonstrated to be robust to illumination changes, body scales, dynamic camera views, and complex background. Nevertheless, an effective encoding of the latent information underlying the 3D skeleton is still an open problem. In this work, we propose a novel Spatial-Temporal Transformer network (ST-TR) which models dependencies between joints using the Transformer self-attention operator. In our ST-TR model, a Spatial Self-Attention module (SSA) is used to understand intra-frame interactions between different body parts, and a Temporal Self-Attention module (TSA) to model inter-frame correlations. The two are combined in a two-stream network which outperforms state-of-the-art models using the same input data on both NTU-RGB+D 60 and NTU-RGB+D 120.
摘要：近年来，基于骨骼的人体动作识别引起了人们的极大兴趣，因为骨骼数据已被证明对光照变化，人体比例，动态摄像机视角和复杂背景具有鲁棒性。然而，有效编码3D骨架下面的潜在信息仍然是一个未解决的问题。在这项工作中，我们提出了一种新颖的时空变压器网络（ST-TR），该网络使用变压器自注意运算符对关节之间的依赖关系进行建模。在我们的ST-TR模型中，空间自我关注模块（SSA）用于了解不同身体部位之间的帧内交互，而时间自我关注模块（TSA）用于对帧间相关性进行建模。两者结合在一个两流网络中，该网络在NTU-RGB + D 60和NTU-RGB + D 120上使用相同的输入数据，胜过最新模型。

22. Cyclopean Geometry of Binocular Vision [PDF] 返回目录
Miles Hansard, Radu Horaud
Abstract: The geometry of binocular projection is analyzed, with reference to the primate visual system. In particular, the effects of coordinated eye movements on the retinal images are investigated. An appropriate oculomotor parameterization is defined, and is shown to complement the classical version and vergence angles. The midline horopter is identified, and subsequently used to construct the epipolar geometry of the system. It is shown that the Essential matrix can be obtained by combining the epipoles with the projection of the midline horopter. A local model of the scene is adopted, in which depth is measured relative to a plane containing the fixation point. The binocular disparity field is given a symmetric parameterization, in which the unknown scene-depths determine the location of corresponding image-features. The resulting Cyclopean depth-map can be combined with the estimated oculomotor parameters, to produce a local representation of the scene. The recovery of visual direction and depth from retinal images is discussed, with reference to the relevant psychophysical and neurophysiological literature.
摘要：参照灵长目动物视觉系统分析了双目投影的几何形状。特别是，研究了眼睛协调运动对视网膜图像的影响。定义了合适的动眼动参数化，并显示为对经典形式和发散角的补充。确定中线直升机，然后将其用于构建系统的对极几何。结果表明，可以通过将子极与中线直角形直升机的投影相结合来获得基本矩阵。采用场景的局部模型，其中相对于包含注视点的平面测量深度。为双目视差字段提供对称参数化，其中未知的景深确定相应图像特征的位置。可以将生成的Cyclopean深度图与估计的动眼神经参数结合起来，以生成场景的局部表示。参照相关的心理生理和神经生理学文献，讨论了从视网膜图像恢复视觉方向和深度的方法。

23. Self-Growing Spatial Graph Network for Context-Aware Pedestrian Trajectory Prediction [PDF] 返回目录
Sirin Haddad, Siew-Kei Lam
Abstract: Pedestrian trajectory prediction is an active research area with recent works undertaken to embed accurate models of pedestrians social interactions and their contextual compliance into dynamic spatial graphs. However, existing works rely on spatial assumptions about the scene and dynamics, which entails a significant challenge to adapt the graph structure in unknown environments for an online system. %Additionally, tackling the same problem for streamed data entails the inherent challenge of adapting the graph structure to represent pedestrians interactions without reliance on spatial assumptions. In addition, there is a lack of assessment approach for the relational modeling impact on prediction performance. To fill this gap, we propose Social Trajectory Recommender-Gated Graph Recurrent Neighborhood Network, (STR-GGRNN), which uses data-driven adaptive online neighborhood recommendation based on the contextual scene features and pedestrian visual cues. The neighborhood recommendation is achieved by online Nonnegative Matrix Factorization (NMF) to construct the graph adjacency matrices for predicting the pedestrians' trajectories. %and evaluates the adjacency matrix against prediction errors. s Experiments based on widely-used datasets show that our method outperforms the state-of-the-art. Our best performing model achieves 12 cm ADE and $\sim$15 cm FDE on ETH-UCY dataset. The proposed method takes only 0.49 seconds when sampling a total of 20K future trajectories per frame.
摘要：行人轨迹预测是一个活跃的研究领域，其最新工作是将行人社交互动的精确模型及其上下文依从性嵌入动态空间图。但是，现有的作品依赖于有关场景和动态的空间假设，这给在在线系统的未知环境中适应图形结构带来了巨大的挑战。此外，要解决流数据的相同问题，还存在着固有的挑战，即在不依赖于空间假设的情况下，调整图结构来表示行人交互。此外，对于关系模型对预测性能的影响，缺乏评估方法。为了填补这一空白，我们提出了社交轨迹推荐门控图递归邻居网络（STR-GGRNN），该网络使用基于上下文场景特征和行人视觉提示的数据驱动的自适应在线邻居推荐。通过在线非负矩阵分解（NMF）来构建邻域推荐，以构造图邻接矩阵来预测行人的轨迹。并针对预测误差评估邻接矩阵。基于广泛使用的数据集的实验表明，我们的方法优于最新技术。我们性能最好的模型在ETH-UCY数据集上实现了12 cm的ADE和$ \ sim $ 15 cm的FDE。当对每帧总共20K的未来轨迹进行采样时，提出的方法仅花费0.49秒。

24. Video Camera Identification from Sensor Pattern Noise with a Constrained ConvNet [PDF] 返回目录
Derrick Timmerman, Swaroop Bennabhaktula, Enrique Alegre, George Azzopardi
Abstract: The identification of source cameras from videos, though it is a highly relevant forensic analysis topic, has been studied much less than its counterpart that uses images. In this work we propose a method to identify the source camera of a video based on camera specific noise patterns that we extract from video frames. For the extraction of noise pattern features, we propose an extended version of a constrained convolutional layer capable of processing color inputs. Our system is designed to classify individual video frames which are in turn combined by a majority vote to identify the source camera. We evaluated this approach on the benchmark VISION data set consisting of 1539 videos from 28 different cameras. To the best of our knowledge, this is the first work that addresses the challenge of video camera identification on a device level. The experiments show that our approach is very promising, achieving up to 93.1% accuracy while being robust to the WhatsApp and YouTube compression techniques. This work is part of the EU-funded project 4NSEEK focused on forensics against child sexual abuse.
摘要：尽管从视频中识别源摄像机是一个非常重要的法医学分析主题，但与使用图像的摄像机相比，对它的研究要少得多。在这项工作中，我们提出了一种根据从视频帧中提取的特定于摄像机的噪声模式来识别视频源摄像机的方法。对于噪声图案特征的提取，我们提出了能够处理颜色输入的约束卷积层的扩展版本。我们的系统旨在对各个视频帧进行分类，然后通过多数表决将其合并以标识源摄像机。我们对基准VISION数据集（包括来自28个不同摄像机的1539个视频）进行了评估。据我们所知，这是第一个在设备级别解决摄像机识别挑战的工作。实验表明，我们的方法非常有前途，可以达到93.1％的准确度，同时对WhatsApp和YouTube压缩技术也很可靠。这项工作是欧盟资助的4NSEEK项目的一部分，该项目侧重于对儿童性虐待的取证。

25. One Point is All You Need: Directional Attention Point for Feature Learning [PDF] 返回目录
Liqiang Lin, Pengdi Huang, Chi-Wing Fu, Kai Xu, Hao Zhang, Hui Huang
Abstract: We present a novel attention-based mechanism for learning enhanced point features for tasks such as point cloud classification and segmentation. Our key message is that if the right attention point is selected, then "one point is all you need" -- not a sequence as in a recurrent model and not a pre-selected set as in all prior works. Also, where the attention point is should be learned, from data and specific to the task at hand. Our mechanism is characterized by a new and simple convolution, which combines the feature at an input point with the feature at its associated attention point. We call such a point adirectional attention point (DAP), since it is found by adding to the original point an offset vector that is learned by maximizing the task performance in training. We show that our attention mechanism can be easily incorporated into state-of-the-art point cloud classification and segmentation networks. Extensive experiments on common benchmarks such as ModelNet40, ShapeNetPart, and S3DIS demonstrate that our DAP-enabled networks consistently outperform the respective original networks, as well as all other competitive alternatives, including those employing pre-selected sets of attention points.
摘要：我们提出了一种新颖的基于注意力的机制，用于学习增强的点特征以完成诸如点云分类和分割之类的任务。我们的主要信息是，如果选择了正确的关注点，那么“一个点就是您所需要的”-不是像在循环模型中那样的序列，也不是在所有先前的工作中那样的预先选择的集合。同样，应该从数据中获取关注点，并针对手头任务进行学习。我们的机制以新的简单卷积为特征，该卷积将输入点的特征与其关联的关注点的特征相结合。我们称这种点为定向注意点（DAP），因为它是通过向原始点添加偏移向量而找到的，该偏移向量是通过最大化训练中的任务性能来学习的。我们证明了我们的注意力机制可以轻松地集成到最新的点云分类和细分网络中。在通用基准（例如ModelNet40，ShapeNetPart和S3DIS）上进行的广泛实验表明，启用DAP的网络始终优于各自的原始网络以及所有其他竞争性替代产品，包括采用预先选择的关注点的替代产品。

26. Garment Recommendation with Memory Augmented Neural Networks [PDF] 返回目录
Lavinia De Divitiis, Federico Becattini, Claudio Baecchi, Alberto Del Bimbo
Abstract: Fashion plays a pivotal role in society. Combining garments appropriately is essential for people to communicate their personality and style. Also different events require outfits to be thoroughly chosen to comply with underlying social clothing rules. Therefore, combining garments appropriately might not be trivial. The fashion industry has turned this into a massive source of income, relying on complex recommendation systems to retrieve and suggest appropriate clothing items for customers. To perform better recommendations, personalized suggestions can be performed, taking into account user preferences or purchase histories. In this paper, we propose a garment recommendation system to pair different clothing items, namely tops and bottoms, exploiting a Memory Augmented Neural Network (MANN). By training a memory writing controller, we are able to store a non-redundant subset of samples, which is then used to retrieve a ranked list of suitable bottoms to complement a given top. In particular, we aim at retrieving a variety of modalities in which a certain garment can be combined. To refine our recommendations, we then include user preferences via Matrix Factorization. We experiment on IQON3000, a dataset collected from an online fashion community, reporting state of the art results.
摘要：时尚在社会中起着举足轻重的作用。适当地组合服装对于人们交流自己的个性和风格至关重要。同样，不同的事件要求彻底选择服装以符合基本的社交服装规则。因此，适当地组合服装可能并不容易。时装业已将其转变为庞大的收入来源，依靠复杂的推荐系统为客户检索并建议合适的服装。为了执行更好的建议，可以在考虑用户偏好或购买历史的情况下执行个性化建议。在本文中，我们提出了一种服装推荐系统，以利用记忆增强神经网络（MANN）来配对不同的服装，即顶部和底部。通过训练内存写入控制器，我们能够存储样本的非冗余子集，然后将其用于检索合适底部的排序列表以补充给定顶部。特别是，我们的目标是检索可以组合某种服装的多种方式。为了完善我们的建议，我们然后通过矩阵分解来包括用户偏好。我们对IQON3000（从在线时尚社区收集的数据集）进行实验，报告了最新的结果。

27. Learning Edge-Preserved Image Stitching from Large-Baseline Deep Homography [PDF] 返回目录
Lang Nie, Chunyu Lin, Kang Liao, Yao Zhao
Abstract: Image stitching is a classical and crucial technique in computer vision, which aims to generate the image with a wide field of view. The traditional methods heavily depend on the feature detection and require that scene features be dense and evenly distributed in the image, leading to varying ghosting effects and poor robustness. Learning methods usually suffer from fixed view and input size limitations, showing a lack of generalization ability on other real datasets. In this paper, we propose an image stitching learning framework, which consists of a large-baseline deep homography module and an edge-preserved deformation module. First, we propose a large-baseline deep homography module to estimate the accurate projective transformation between the reference image and the target image in different scales of features. After that, an edge-preserved deformation module is designed to learn the deformation rules of image stitching from edge to content, eliminating the ghosting effects as much as possible. In particular, the proposed learning framework can stitch images of arbitrary views and input sizes, thus contribute to a supervised deep image stitching method with excellent generalization capability in other real images. Experimental results demonstrate that our homography module significantly outperforms the existing deep homography methods in the large baseline scenes. In image stitching, our method is superior to the existing learning method and shows competitive performance with state-of-the-art traditional methods.
摘要：图像拼接是计算机视觉中的一种经典且至关重要的技术，其目的是生成具有广阔视野的图像。传统方法在很大程度上取决于特征检测，并且要求场景特征要密集且均匀地分布在图像中，从而导致各种重影效果和较差的鲁棒性。学习方法通常受固定视图和输入大小限制的困扰，这表明在其他实际数据集上缺乏归纳能力。在本文中，我们提出了一种图像拼接学习框架，该框架由一个大基线深层单应性模块和一个保留边缘的变形模块组成。首先，我们提出一个大基线深层单应性模块，以估计不同特征尺度下参考图像和目标图像之间的精确投影变换。此后，设计了一个边缘保留变形模块，以学习图像拼接从边缘到内容的变形规则，从而尽可能地消除重影效应。特别地，所提出的学习框架可以缝合任意视图和输入大小的图像，从而有助于在其他真实图像中具有出色泛化能力的监督式深度图像缝合方法。实验结果表明，在大型基线场景中，我们的单应性模块明显优于现有的深度单应性方法。在图像拼接中，我们的方法优于现有的学习方法，并且与最先进的传统方法相比具有竞争优势。

28. Writer Identification and Writer Retrieval Based on NetVLAD with Re-ranking [PDF] 返回目录
Shervin Rasoulzadeh, Bagher Babaali
Abstract: This paper addresses writer identification and retrieval which is a challenging problem in the document analysis field. In this work, a novel pipeline is proposed for the problem by employing a unified neural network architecture consisting of the ResNet-20 as a feature extractor and an integrated NetVLAD layer, inspired by the vectors of locally aggregated descriptors (VLAD), in the head of the latter part. Having defined this architecture, triplet semi-hard loss function is used to directly learn an embedding for individual input image patches. Generalised max-pooling is used for the aggregation of embedded descriptors of each handwritten image. In the evaluation part, for identification and retrieval, re-ranking has been done based on query expansion and $k$-reciprocal nearest neighbours, and it is shown that the pipeline can benefit tremendously from this step. Experimental evaluation shows that our writer identification and writer retrieval pipeline is superior compared to the state-of-the-art pipelines, as our results on the publicly available ICDAR13 and CVL datasets set new standards by achieving 96.5\% and 98.4\% mAP, respectively.
摘要：本文致力于作者的识别和检索，这是文档分析领域中一个具有挑战性的问题。在这项工作中，通过采用一个由ResNet-20作为特征提取器和一个集成NetVLAD层组成的统一神经网络体系结构（该结构受头部聚集的局部矢量（VLAD）的启发），针对该问题提出了一条新的管道。后一部分。定义了此架构后，可以使用三元组半硬损失函数直接学习单个输入图像块的嵌入。广义最大池用于汇总每个手写图像的嵌入式描述符。在评估部分，为了进行标识和检索，已基于查询扩展和$ k $-倒数最近邻进行了重新排序，结果表明，管道可以从此步骤中受益匪浅。实验评估表明，我们的作家识别和作家检索管道优于最新的管道，因为我们在可公开获得的ICDAR13和CVL数据集上的结果通过达到96.5 \％和98.4 \％的mAP设定了新标准，分别。

29. Detailed 3D Human Body Reconstruction from Multi-view Images Combining Voxel Super-Resolution and Learned Implicit Representation [PDF] 返回目录
Zhongguo Li, Magnus Oskarsson, Anders Heyden
Abstract: The task of reconstructing detailed 3D human body models from images is interesting but challenging in computer vision due to the high freedom of human bodies. In order to tackle the problem, we propose a coarse-to-fine method to reconstruct a detailed 3D human body from multi-view images combining voxel super-resolution based on learning the implicit representation. Firstly, the coarse 3D models are estimated by learning an implicit representation based on multi-scale features which are extracted by multi-stage hourglass networks from the multi-view images. Then, taking the low resolution voxel grids which are generated by the coarse 3D models as input, the voxel super-resolution based on an implicit representation is learned through a multi-stage 3D convolutional neural network. Finally, the refined detailed 3D human body models can be produced by the voxel super-resolution which can preserve the details and reduce the false reconstruction of the coarse 3D models. Benefiting from the implicit representation, the training process in our method is memory efficient and the detailed 3D human body produced by our method from multi-view images is the continuous decision boundary with high-resolution geometry. In addition, the coarse-to-fine method based on voxel super-resolution can remove false reconstructions and preserve the appearance details in the final reconstruction, simultaneously. In the experiments, our method quantitatively and qualitatively achieves the competitive 3D human body reconstructions from images with various poses and shapes on both the real and synthetic datasets.
摘要：从图像重建详细的3D人体模型的任务很有趣，但是由于人体的高度自由度，在计算机视觉中具有挑战性。为了解决该问题，我们提出了一种从粗到细的方法，在学习隐式表示的基础上，从结合体素超分辨率的多视图图像中重建详细的3D人体。首先，通过学习基于多尺度特征的隐式表示来估计粗略3D模型，该多尺度特征是通过多级沙漏网络从多视图图像中提取的。然后，以由粗略3D模型生成的低分辨率体素网格作为输入，通过多级3D卷积神经网络学习基于隐式表示的体素超分辨率。最后，可以通过体素超分辨率生成精炼的详细3D人体模型，该模型可以保留细节并减少粗略3D模型的错误重建。受益于隐式表示，我们方法的训练过程具有存储效率，并且我们的方法从多视图图像生成的详细3D人体是具有高分辨率几何形状的连续决策边界。此外，基于体素超分辨率的粗到细方法可以消除错误的重建，并同时保留最终重建中的外观细节。在实验中，我们的方法从真实和合成数据集上具有各种姿势和形状的图像中定量和定性地获得了具有竞争力的3D人体重建。

30. AViNet: Diving Deep into Audio-Visual Saliency Prediction [PDF] 返回目录
Samyak Jain, Pradeep Yarlagadda, Ramanathan Subramanian, Vineet Gandhi
Abstract: We propose the \textbf{AViNet} architecture for audiovisual saliency prediction. AViNet is a fully convolutional encoder-decoder architecture. The encoder combines visual features learned for action recognition, with audio embeddings learned via an aural network designed to classify objects and scenes. The decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining hierarchical features. The overall architecture is conceptually simple, causal, and runs in real-time (60 fps). AViNet outperforms the state-of-the-art on ten (seven audiovisual and three visual-only) datasets while surpassing human performance on the CC, SIM, and AUC metrics for the AVE dataset. Visual features maximally account for saliency on existing datasets with audio-only contributing to minor gains, except in specific contexts like social events. Our work, therefore, motivates the need to curate saliency datasets reflective of real-life, where both the visual and aural modalities complimentarily drive saliency. Our code and pre-trained models are available at this https URL
摘要：我们提出了\ textbf {AViNet}架构来进行视听显着性预测。 AViNet是一种完全卷积的编码器-解码器体系结构。编码器将为动作识别而学习的视觉特征与通过用于对物体和场景进行分类的听觉网络而学习的音频嵌入相结合。解码器通过三线性插值和3D卷积结合层次特征来推断显着图。总体架构从概念上讲是简单，因果的，并且实时运行（60 fps）。 AViNet在十个（七个视听和三个仅视觉）数据集上的表现都达到了最先进的水平，同时在AVE数据集的CC，SIM和AUC指标上也超过了人类的表现。视觉功能最大程度地说明了现有数据集的显着性，仅在音频中贡献很小的收益，除非在特定情况下（如社交事件）。因此，我们的工作激发了建立反映真实生活的显着性数据集的需求，在视觉和听觉形态上，显着性数据集都是显着推动显着性的。我们的代码和预先训练的模型可从以下https URL获得

31. Few-Shot Segmentation Without Meta-Learning: A Good Transductive Inference Is All You Need? [PDF] 返回目录
Malik Boudiaf, Hoel Kervadec, Ziko Imtiaz Masud, Pablo Piantanida, Ismail Ben Ayed, Jose Dolz
Abstract: Few-shot segmentation has recently attracted substantial interest, with the popular meta-learning paradigm widely dominating the literature. We show that the way inference is performed for a given few-shot segmentation task has a substantial effect on performances, an aspect that has been overlooked in the literature. We introduce a transductive inference, which leverages the statistics of the unlabeled pixels of a task by optimizing a new loss containing three complementary terms: (i) a standard cross-entropy on the labeled pixels; (ii) the entropy of posteriors on the unlabeled query pixels; and (iii) a global KL-divergence regularizer based on the proportion of the predicted foreground region. Our inference uses a simple linear classifier of the extracted features, has a computational load comparable to inductive inference and can be used on top of any base training. Using standard cross-entropy training on the base classes, our inference yields highly competitive performances on well-known few-shot segmentation benchmarks. On PASCAL-5i, it brings about 5% improvement over the best performing state-of-the-art method in the 5-shot scenario, while being on par in the 1-shot setting. Even more surprisingly, this gap widens as the number of support samples increases, reaching up to 6% in the 10-shot scenario. Furthermore, we introduce a more realistic setting with domain shift, where the base and novel classes are drawn from different datasets. In this setting, we found that our method achieves the best performances.
摘要要】很少有的分割方法引起了人们的极大兴趣，流行的元学习范式在文献中占据了主导地位。我们表明，对给定的几次快照分割任务执行推理的方式会对性能产生重大影响，这在文献中已被忽略。我们介绍了一种转导推断，它通过优化包含三个互补项的新损失来利用任务中未标记像素的统计信息：（i）标记像素的标准交叉熵；（ii）在未标记的查询像素上后验的熵；（iii）基于预测前景区域的比例的全局KL散度正则化器。我们的推论使用提取特征的简单线性分类器，具有与归纳推论相当的计算量，并且可以在任何基础训练的基础上使用。在基类上使用标准的交叉熵训练，我们的推论在著名的几次分割基准上产生了极具竞争力的表现。在PASCAL-5i上，它与5镜头场景中性能最佳的最新方法相比，可带来5％的改善，而在1镜头设置中则可与之媲美。更令人惊讶的是，这种差距随着支持样本数量的增加而扩大，在10次注入情况下达到6％。此外，我们引入了一个更现实的域偏移设置，其中基类和新颖类是从不同的数据集中绘制的。在这种情况下，我们发现我们的方法可以达到最佳性能。

32. Superpixel Segmentation Based on Spatially Constrained Subspace Clustering [PDF] 返回目录
Hua Li, Yuheng Jia, Runmin Cong, Wenhui Wu, Sam Kwong, Chuanbo Chen
Abstract: Superpixel segmentation aims at dividing the input image into some representative regions containing pixels with similar and consistent intrinsic properties, without any prior knowledge about the shape and size of each superpixel. In this paper, to alleviate the limitation of superpixel segmentation applied in practical industrial tasks that detailed boundaries are difficult to be kept, we regard each representative region with independent semantic information as a subspace, and correspondingly formulate superpixel segmentation as a subspace clustering problem to preserve more detailed content boundaries. We show that a simple integration of superpixel segmentation with the conventional subspace clustering does not effectively work due to the spatial correlation of the pixels within a superpixel, which may lead to boundary confusion and segmentation error when the correlation is ignored. Consequently, we devise a spatial regularization and propose a novel convex locality-constrained subspace clustering model that is able to constrain the spatial adjacent pixels with similar attributes to be clustered into a superpixel and generate the content-aware superpixels with more detailed boundaries. Finally, the proposed model is solved by an efficient alternating direction method of multipliers (ADMM) solver. Experiments on different standard datasets demonstrate that the proposed method achieves superior performance both quantitatively and qualitatively compared with some state-of-the-art methods.
摘要：超像素分割旨在将输入图像划分为一些具有代表性的区域，这些区域包含具有相似且一致的固有属性的像素，而无需事先了解每个超像素的形状和大小。在本文中，为减轻超像素分割在实际工业任务中难以保持详细边界的局限性，我们将具有独立语义信息的每个代表区域视为一个子空间，并相应地将超像素分割表示为一个子空间聚类问题，以保持更详细的内容边界。我们显示，由于超像素内像素的空间相关性，超像素分割与常规子空间聚类的简单集成无法有效地工作，当忽略相关性时，这可能导致边界混淆和分割错误。因此，我们设计了空间正则化方法，并提出了一种新颖的凸局部约束子空间聚类模型，该模型能够将具有相似属性的空间相邻像素约束为一个超像素，并生成具有更详细边界的内容感知超像素。最后，通过有效的乘数交替方向法（ADMM）求解器对所提出的模型进行求解。在不同标准数据集上进行的实验表明，与某些最新方法相比，该方法在定量和定性方面均具有优异的性能。

33. Classifying Breast Histopathology Images with a Ductal Instance-Oriented Pipeline [PDF] 返回目录
Beibin Li, Ezgi Mercan, Sachin Mehta, Stevan Knezevich, Corey W. Arnold, Donald L. Weaver, Joann G. Elmore, Linda G. Shapiro
Abstract: In this study, we propose the Ductal Instance-Oriented Pipeline (DIOP) that contains a duct-level instance segmentation model, a tissue-level semantic segmentation model, and three-levels of features for diagnostic classification. Based on recent advancements in instance segmentation and the Mask R-CNN model, our duct-level segmenter tries to identify each ductal individual inside a microscopic image; then, it extracts tissue-level information from the identified ductal instances. Leveraging three levels of information obtained from these ductal instances and also the histopathology image, the proposed DIOP outperforms previous approaches (both feature-based and CNN-based) in all diagnostic tasks; for the four-way classification task, the DIOP achieves comparable performance to general pathologists in this unique dataset. The proposed DIOP only takes a few seconds to run in the inference time, which could be used interactively on most modern computers. More clinical explorations are needed to study the robustness and generalizability of this system in the future.
摘要：在这项研究中，我们提出了以导管实例为导向的管道（DIOP），该导管包含导管级实例分割模型，组织级语义分割模型以及用于诊断分类的三级特征。基于实例分割和Mask R-CNN模型的最新进展，我们的导管级分割器试图识别显微图像中的每个导管个体。然后，它从识别出的导管实例中提取组织水平信息。利用从这些导管实例获得的三个级别的信息以及组织病理学图像，所提出的DIOP在所有诊断任务中均优于先前的方法（基于特征和基于CNN）。对于四向分类任务，DIOP在此独特数据集中的性能可与普通病理学家媲美。建议的DIOP只需几秒钟即可运行，可以在大多数现代计算机上交互使用。未来需要更多的临床研究来研究该系统的鲁棒性和通用性。

34. Intrinsic Temporal Regularization for High-resolution Human Video Synthesis [PDF] 返回目录
Lingbo Yang, Zhanning Gao, Peiran Ren, Siwei Ma, Wen Gao
Abstract: Temporal consistency is crucial for extending image processing pipelines to the video domain, which is often enforced with flow-based warping error over adjacent frames. Yet for human video synthesis, such scheme is less reliable due to the misalignment between source and target video as well as the difficulty in accurate flow estimation. In this paper, we propose an effective intrinsic temporal regularization scheme to mitigate these issues, where an intrinsic confidence map is estimated via the frame generator to regulate motion estimation via temporal loss modulation. This creates a shortcut for back-propagating temporal loss gradients directly to the front-end motion estimator, thus improving training stability and temporal coherence in output videos. We apply our intrinsic temporal regulation to single-image generator, leading to a powerful "INTERnet" capable of generating $512\times512$ resolution human action videos with temporal-coherent, realistic visual details. Extensive experiments demonstrate the superiority of proposed INTERnet over several competitive baselines.
摘要：时间一致性对于将图像处理管道扩展到视频域至关重要，而视频域通常会在相邻帧上使用基于流的扭曲错误来实施。然而对于人类视频合成，由于源视频和目标视频之间的未对准以及精确的流量估计的困难，这种方案不太可靠。在本文中，我们提出了一种有效的固有时间正则化方案来缓解这些问题，其中通过帧发生器估计固有置信度图，以通过时间损耗调制来调节运动估计。这为直接向前端运动估计器向后传播时间损失梯度创建了快捷方式，从而提高了输出视频中的训练稳定性和时间一致性。我们将固有的时间调节应用于单幅图像生成器，从而产生一个强大的“互联网”，能够生成具有时域连贯，逼真的视觉细节的512×512分辨率人类动作视频。大量的实验证明了所提出的INTERnet优于几个竞争基准。

35. Color-related Local Binary Pattern: A Learned Local Descriptor for Color Image Recognition [PDF] 返回目录
Bin Xiao, Tao Geng, Xiuli Bi, Weisheng Li
Abstract: Local binary pattern (LBP) as a kind of local feature has shown its simplicity, easy implementation and strong discriminating power in image recognition. Although some LBP variants are specifically investigated for color image recognition, the color information of images is not adequately considered and the curse of dimensionality in classification is easily caused in these methods. In this paper, a color-related local binary pattern (cLBP) which learns the dominant patterns from the decoded LBP is proposed for color images recognition. This paper first proposes a relative similarity space (RSS) that represents the color similarity between image channels for describing a color image. Then, the decoded LBP which can mine the correlation information between the LBP feature maps correspond to each color channel of RSS traditional RGB spaces, is employed for feature extraction. Finally, a feature learning strategy is employed to learn the dominant color-related patterns for reducing the dimension of feature vector and further improving the discriminatively of features. The theoretic analysis show that the proposed RSS can provide more discriminative information, and has higher noise robustness as well as higher illumination variation robustness than traditional RGB space. Experimental results on four groups, totally twelve public color image datasets show that the proposed method outperforms most of the LBP variants for color image recognition in terms of dimension of features, recognition accuracy under noise-free, noisy and illumination variation conditions.
摘要：局部二进制模式（LBP）作为一种局部特征已显示出其简单性，易于实现性和强大的图像识别能力。尽管专门研究了一些LBP变体以进行彩色图像识别，但是在这些方法中并未充分考虑图像的彩色信息，并且容易引起分类维数的诅咒。在本文中，提出了一种与颜色有关的局部二进制模式（cLBP），该模式从解码的LBP中学习主要模式，用于彩色图像识别。本文首先提出了一个相对相似度空间（RSS），它代表了描述彩色图像的图像通道之间的颜色相似度。然后，将能够挖掘与RSS传统RGB空间的每个颜色通道相对应的LBP特征图之间的相关性信息的解码LBP用于特征提取。最后，采用特征学习策略来学习与颜色相关的主要图案，以减少特征向量的维数并进一步提高特征的判别力。理论分析表明，与传统的RGB空间相比，所提出的RSS可以提供更多的判别信息，并且具有更高的噪声鲁棒性和更高的照明变化鲁棒性。在四组共十二个公共彩色图像数据集上的实验结果表明，该方法在特征尺寸，无噪声，噪声和光照变化条件下的识别精度方面优于大多数LBP彩色图像识别方法。

36. Learning Omni-frequency Region-adaptive Representations for Real Image Super-Resolution [PDF] 返回目录
Xin Li, Xin Jin, Tao Yu, Yingxue Pang, Simeng Sun, Zhizheng Zhang, Zhibo Chen
Abstract: Traditional single image super-resolution (SISR) methods that focus on solving single and uniform degradation (i.e., bicubic down-sampling), typically suffer from poor performance when applied into real-world low-resolution (LR) images due to the complicated realistic degradations. The key to solving this more challenging real image super-resolution (RealSR) problem lies in learning feature representations that are both informative and content-aware. In this paper, we propose an Omni-frequency Region-adaptive Network (ORNet) to address both challenges, here we call features of all low, middle and high frequencies omni-frequency features. Specifically, we start from the frequency perspective and design a Frequency Decomposition (FD) module to separate different frequency components to comprehensively compensate the information lost for real LR image. Then, considering the different regions of real LR image have different frequency information lost, we further design a Region-adaptive Frequency Aggregation (RFA) module by leveraging dynamic convolution and spatial attention to adaptively restore frequency components for different regions. The extensive experiments endorse the effective, and scenario-agnostic nature of our OR-Net for RealSR.
摘要：传统的单图像超分辨率（SISR）方法专注于解决单一且均匀的降级问题（即双三次降采样），由于存在以下缺陷，因此在应用于现实世界中的低分辨率（LR）图像时通常会遇到性能不佳的问题。复杂的现实退化。解决此更具挑战性的真实图像超分辨率（RealSR）问题的关键在于学习功能丰富且内容丰富的特征表示。在本文中，我们提出了一个全域区域自适应网络（ORNet）来应对这两个挑战，在这里我们将所有低，中和高频特征称为全频特征。具体来说，我们从频率角度出发，设计了一个频率分解（FD）模块，以分离不同的频率分量，以全面补偿实际LR图像丢失的信息。然后，考虑到真实LR图像的不同区域丢失了不同的频率信息，我们进一步利用动态卷积和空间注意力来自适应地恢复不同区域的频率分量，从而设计了区域自适应频率聚合（RFA）模块。广泛的实验证明了针对RealSR的OR-Net的有效且与场景无关的性质。

37. A Dark Flash Normal Camera [PDF] 返回目录
Zhihao Xia, Jason Lawrence, Supreeth Achar
Abstract: Casual photography is often performed in uncontrolled lighting that can result in low quality images and degrade the performance of downstream processing. We consider the problem of estimating surface normal and reflectance maps of scenes depicting people despite these conditions by supplementing the available visible illumination with a single near infrared (NIR) light source and camera, a so-called "dark flash image". Our method takes as input a single color image captured under arbitrary visible lighting and a single dark flash image captured under controlled front-lit NIR lighting at the same viewpoint, and computes a normal map, a diffuse albedo map, and a specular intensity map of the scene. Since ground truth normal and reflectance maps of faces are difficult to capture, we propose a novel training technique that combines information from two readily available and complementary sources: a stereo depth signal and photometric shading cues. We evaluate our method over a range of subjects and lighting conditions and describe two applications: optimizing stereo geometry and filling the shadows in an image.
摘要：休闲摄影通常是在不受控制的光线下进行的，这会导致图像质量下降并降低下游处理的性能。我们考虑通过用单个近红外（NIR）光源和相机（即所谓的“暗闪光图像”）补充可用的可见照明来估计描绘这些人的场景的表面法线和反射率图的问题。我们的方法将在任意可见光下捕获的单色图像和在相同视点下在受控前照灯NIR照明下捕获的单个暗闪光图像作为输入，并计算法线图，漫反射率图和镜面强度图。现场。由于难以捕获地面的地面真实法线和反射图，因此我们提出了一种新颖的训练技术，该技术将来自两个容易获得且互补的来源的信息进行组合：立体深度信号和光度学阴影提示。我们在各种主题和光照条件下评估我们的方法，并描述了两种应用：优化立体几何形状和填充图像中的阴影。

38. A Log-likelihood Regularized KL Divergence for Video Prediction with A 3D Convolutional Variational Recurrent Network [PDF] 返回目录
Haziq Razali, Basura Fernando
Abstract: The use of latent variable models has shown to be a powerful tool for modeling probability distributions over sequences. In this paper, we introduce a new variational model that extends the recurrent network in two ways for the task of video frame prediction. First, we introduce 3D convolutions inside all modules including the recurrent model for future frame prediction, inputting and outputting a sequence of video frames at each timestep. This enables us to better exploit spatiotemporal information inside the variational recurrent model, allowing us to generate high-quality predictions. Second, we enhance the latent loss of the variational model by introducing a maximum likelihood estimate in addition to the KL divergence that is commonly used in variational models. This simple extension acts as a stronger regularizer in the variational autoencoder loss function and lets us obtain better results and generalizability. Experiments show that our model outperforms existing video prediction methods on several benchmarks while requiring fewer parameters.
摘要：潜在变量模型的使用已证明是对序列上概率分布进行建模的强大工具。在本文中，我们介绍了一种新的变分模型，该模型以两种方式扩展了递归网络，以完成视频帧预测任务。首先，我们在所有模块中引入3D卷积，包括用于将来帧预测的循环模型，在每个时间步输入和输出视频帧序列。这使我们能够更好地利用变分递归模型中的时空信息，从而使我们能够生成高质量的预测。其次，除了引入在变异模型中常用的KL散度以外，我们还通过引入最大似然估计来增强变异模型的潜在损失。这个简单的扩展在变分自动编码器损耗函数中充当更强大的正则化器，并让我们获得更好的结果和通用性。实验表明，我们的模型在几个基准上优于现有的视频预测方法，同时所需的参数更少。

39. How to Train PointGoal Navigation Agents on a (Sample and Compute) Budget [PDF] 返回目录
Erik Wijmans, Irfan Essa, Dhruv Batra
Abstract: PointGoal navigation has seen significant recent interest and progress, spurred on by the Habitat platform and associated challenge. In this paper, we study PointGoal navigation under both a sample budget (75 million frames) and a compute budget (1 GPU for 1 day). We conduct an extensive set of experiments, cumulatively totaling over 50,000 GPU-hours, that let us identify and discuss a number of ostensibly minor but significant design choices -- the advantage estimation procedure (a key component in training), visual encoder architecture, and a seemingly minor hyper-parameter change. Overall, these design choices to lead considerable and consistent improvements over the baselines present in Savva et al. Under a sample budget, performance for RGB-D agents improves 8 SPL on Gibson (14% relative improvement) and 20 SPL on Matterport3D (38% relative improvement). Under a compute budget, performance for RGB-D agents improves by 19 SPL on Gibson (32% relative improvement) and 35 SPL on Matterport3D (220% relative improvement). We hope our findings and recommendations will make serve to make the community's experiments more efficient.
摘要：在人居平台和相关挑战的刺激下，PointGoal导航最近获得了重大兴趣和进展。在本文中，我们将在示例预算（7500万帧）和计算预算（1天1 GPU）下研究PointGoal导航。我们进行了一系列广泛的实验，累计总计超过50,000 GPU小时，这使我们可以识别和讨论表面上看似微不足道的重要设计选择-优势估算程序（培训中的关键组成部分），视觉编码器架构以及似乎很小的超参数更改。总的来说，这些设计选择可以对Savva等人的基准进行重大而一致的改进。在一个示例预算下，RGB-D代理的性能在Gibson上提高了8 SPL（相对提高了14％），在Matterport3D上提高了20 SPL（相对提高了38％）。在计算预算下，RGB-D代理的性能在Gibson上提高了19 SPL（相对提高了32％），在Matterport3D上提高了35 SPL（相对提高了220％）。我们希望我们的发现和建议将有助于提高社区的实验效率。

40. A novel joint points and silhouette-based method to estimate 3D human pose and shape [PDF] 返回目录
Zhongguo Li, Anders Heyden, Magnus Oskarsson
Abstract: This paper presents a novel method for 3D human pose and shape estimation from images with sparse views, using joint points and silhouettes, based on a parametric model. Firstly, the parametric model is fitted to the joint points estimated by deep learning-based human pose estimation. Then, we extract the correspondence between the parametric model of pose fitting and silhouettes on 2D and 3D space. A novel energy function based on the correspondence is built and minimized to fit parametric model to the silhouettes. Our approach uses sufficient shape information because the energy function of silhouettes is built from both 2D and 3D space. This also means that our method only needs images from sparse views, which balances data used and the required prior information. Results on synthetic data and real data demonstrate the competitive performance of our approach on pose and shape estimation of the human body.
摘要：本文提出了一种基于参数模型的，基于关节点和轮廓的稀疏视图图像3D人体姿势和形状估计的新方法。首先，将参数模型拟合到基于深度学习的人体姿势估计所估计的关节点。然后，我们提取姿势拟合的参数模型与2D和3D空间上的轮廓之间的对应关系。建立一种基于对应关系的新颖能量函数，并将其最小化，以使参数模型适合轮廓。我们的方法使用了足够的形状信息，因为轮廓的能量函数是从2D和3D空间构建的。这也意味着我们的方法仅需要来自稀疏视图的图像，可以平衡使用的数据和所需的先验信息。综合数据和真实数据的结果证明了我们的方法在人体姿势和形状估计方面的竞争性能。

41. Monocular Real-time Full Body Capture with Inter-part Correlations [PDF] 返回目录
Yuxiao Zhou, Marc Habermann, Ikhsanul Habibie, Ayush Tewari, Christian Theobalt, Feng Xu
Abstract: We present the first method for real-time full body capture that estimates shape and motion of body and hands together with a dynamic 3D face model from a single color image. Our approach uses a new neural network architecture that exploits correlations between body and hands at high computational efficiency. Unlike previous works, our approach is jointly trained on multiple datasets focusing on hand, body or face separately, without requiring data where all the parts are annotated at the same time, which is much more difficult to create at sufficient variety. The possibility of such multi-dataset training enables superior generalization ability. In contrast to earlier monocular full body methods, our approach captures more expressive 3D face geometry and color by estimating the shape, expression, albedo and illumination parameters of a statistical face model. Our method achieves competitive accuracy on public benchmarks, while being significantly faster and providing more complete face reconstructions.
摘要：我们提出了一种用于实时全身捕获的第一种方法，该方法可以从单色图像估计身体和手的形状和运动以及动态3D面部模型。我们的方法使用了一种新的神经网络体系结构，该体系结构以高计算效率利用了身体与手部之间的相关性。与以前的作品不同，我们的方法是针对分别集中于手，身体或面部的多个数据集进行联合训练的，而无需同时标注所有部分的数据，而要创建足够多的多样性则更加困难。这种多数据集训练的可能性实现了卓越的泛化能力。与早期的单眼全身方法相比，我们的方法通过估计统计面部模型的形状，表情，反照率和照明参数来捕获更具表现力的3D面部几何形状和颜色。我们的方法可达到公开基准的竞争准确性，同时显着加快速度并提供更完整的面部重建。

42. Spatio-attentive Graphs for Human-Object Interaction Detection [PDF] 返回目录
Frederic Z. Zhang, Dylan Campbell, Stephen Gould
Abstract: We address the problem of detecting human--object interactions in images using graphical neural networks. Our network constructs a bipartite graph of nodes representing detected humans and objects, wherein messages passed between the nodes encode relative spatial and appearance information. Unlike existing approaches that separate appearance and spatial features, our method fuses these two cues within a single graphical model allowing information conditioned on both modalities to influence the prediction of interactions with neighboring nodes. Through extensive experimentation we demonstrate the advantages of fusing relative spatial information with appearance features in the computation of adjacency structure, message passing and the ultimate refined graph features. On the popular HICO-DET benchmark dataset, our model outperforms state-of-the-art with an mAP of 27.18, a 10% relative improvement.
摘要：我们解决了使用图形神经网络检测图像中人与物体之间相互作用的问题。我们的网络构造了代表检测到的人类和物体的节点的二部图，其中在节点之间传递的消息对相对的空间和外观信息进行编码。与将外观和空间特征分开的现有方法不同，我们的方法在单个图形模型中融合了这两个线索，从而允许以两种方式为条件的信息影响与相邻节点交互的预测。通过广泛的实验，我们证明了将相对空间信息与外观特征融合在计算邻接结构，消息传递和最终精炼图形特征方面的优势。在流行的HICO-DET基准数据集上，我们的模型以27.18的mAP优于最新技术，相对改进了10％。

43. Mesoscopic photogrammetry with an unstabilized phone camera [PDF] 返回目录
Kevin C. Zhou, Colin Cooke, Jaehee Park, Ruobing Qian, Roarke Horstmeyer, Joseph A. Izatt, Sina Farsiu
Abstract: We present a feature-free photogrammetric technique that enables quantitative 3D mesoscopic (mm-scale height variation) imaging with tens-of-micron accuracy from sequences of images acquired by a smartphone at close range (several cm) under freehand motion without additional hardware. Our end-to-end, pixel-intensity-based approach jointly registers and stitches all the images by estimating a coaligned height map, which acts as a pixel-wise radial deformation field that orthorectifies each camera image to allow homographic registration. The height maps themselves are reparameterized as the output of an untrained encoder-decoder convolutional neural network (CNN) with the raw camera images as the input, which effectively removes many reconstruction artifacts. Our method also jointly estimates both the camera's dynamic 6D pose and its distortion using a nonparametric model, the latter of which is especially important in mesoscopic applications when using cameras not designed for imaging at short working distances, such as smartphone cameras. We also propose strategies for reducing computation time and memory, applicable to other multi-frame registration problems. Finally, we demonstrate our method using sequences of multi-megapixel images captured by an unstabilized smartphone on a variety of samples (e.g., painting brushstrokes, circuit board, seeds).
摘要：我们提出了一种无特征的摄影测量技术，该技术可通过智能手机在徒手运动下近距离（几厘米）获取的图像序列中，以数十微米的精度进行定量的3D介观（毫米级高度变化）成像硬件。我们的端到端基于像素强度的方法通过估计共对齐的高度图共同注册和拼接所有图像，该高度图充当像素方向的径向变形场，该场对每个摄像机图像进行正射校正以允许单应性配准。高度图本身被重新参数化为未经训练的编码器-解码器卷积神经网络（CNN）的输出，原始摄像机图像作为输入，有效地消除了许多重建伪像。我们的方法还使用非参数模型共同估算了相机的动态6D姿势及其变形，当使用非设计用于短工作距离成像的相机（例如智能手机相机）时，后者在介观应用中尤其重要。我们还提出了减少计算时间和内存的策略，适用于其他多帧配准问题。最后，我们用不稳定的智能手机在各种样本（例如，绘画笔触，电路板，种子）上捕获的数百万像素图像序列演示了我们的方法。

44. Uncertainty-Aware Deep Calibrated Salient Object Detection [PDF] 返回目录
Jing Zhang, Yuchao Dai, Xin Yu, Mehrtash Harandi, Nick Barnes, Richard Hartley
Abstract: Existing deep neural network based salient object detection (SOD) methods mainly focus on pursuing high network accuracy. However, those methods overlook the gap between network accuracy and prediction confidence, known as the confidence uncalibration problem. Thus, state-of-the-art SOD networks are prone to be overconfident. In other words, the predicted confidence of the networks does not reflect the real probability of correctness of salient object detection, which significantly hinder their real-world applicability. In this paper, we introduce an uncertaintyaware deep SOD network, and propose two strategies from different perspectives to prevent deep SOD networks from being overconfident. The first strategy, namely Boundary Distribution Smoothing (BDS), generates continuous labels by smoothing the original binary ground-truth with respect to pixel-wise uncertainty. The second strategy, namely Uncertainty-Aware Temperature Scaling (UATS), exploits a relaxed Sigmoid function during both training and testing with spatially-variant temperature scaling to produce softened output. Both strategies can be incorporated into existing deep SOD networks with minimal efforts. Moreover, we propose a new saliency evaluation metric, namely dense calibration measure C, to measure how the model is calibrated on a given dataset. Extensive experimental results on seven benchmark datasets demonstrate that our solutions can not only better calibrate SOD models, but also improve the network accuracy.
摘要：现有的基于深度神经网络的显着目标检测（SOD）方法主要致力于追求较高的网络精度。但是，这些方法忽略了网络准确性和预测置信度之间的差距，称为置信度未校准问题。因此，最新的SOD网络容易过于自信。换句话说，网络的预测置信度不能反映出显着物体检测正确性的真实概率，这严重阻碍了它们在现实世界中的适用性。在本文中，我们介绍了一种具有不确定性的深度SOD网络，并从不同角度提出了两种防止深度SOD网络过于自信的策略。第一种策略，即边界分布平滑（BDS），是通过针对像素不确定性对原始二进制地面真实性进行平滑来生成连续标签。第二种策略，即不确定性感知温度缩放（UATS），在训练和测试期间利用空间变化的温度缩放来释放Sigmoid函数，以产生柔和的输出。这两种策略都可以轻松地整合到现有的深层SOD网络中。此外，我们提出了一种新的显着性评估度量，即密集校准度量C，以度量如何在给定的数据集上校准模型。在七个基准数据集上的大量实验结果表明，我们的解决方案不仅可以更好地校准SOD模型，而且可以提高网络精度。

45. A MAC-less Neural Inference Processor Supporting Compressed, Variable Precision Weights [PDF] 返回目录
Vincenzo Liguori
Abstract: This paper introduces two architectures for the inference of convolutional neural networks (CNNs). Both architectures exploit weight sparsity and compression to reduce computational complexity and bandwidth. The first architecture uses multiply-accumulators (MACs) but avoids unnecessary multiplications by skipping zero weights. The second architecture exploits weight sparsity at the level of their bit representation by substituting resource-intensive MACs with much smaller Bit Layer Multiply Accumulators (BLMACs). The use of BLMACs also allows variable precision weights as variable size integers and even floating points. Some details of an implementation of the second architecture are given. Weight compression with arithmetic coding is also discussed as well as bandwidth implications. Finally, some implementation results for a pathfinder design and various technologies are presented.
摘要：本文介绍了两种用于卷积神经网络（CNN）推理的体系结构。两种架构都利用权重稀疏性和压缩来降低计算复杂性和带宽。第一种架构使用乘法累加器（MAC），但通过跳过零权重避免了不必要的乘法。第二种体系结构通过用少得多的位层乘法累加器（BLMAC）替代资源密集型MAC，在其位表示级别利用权重稀疏性。使用BLMAC还允许将可变精度的权重用作可变大小的整数，甚至是浮点数。给出了第二种架构的一些实现细节。还讨论了使用算术编码的权重压缩以及带宽含义。最后，给出了一些探路者设计和各种技术的实现结果。

46. Vision-based Price Suggestion for Online Second-hand Items [PDF] 返回目录
Liang Han, Zhaozheng Yin, Zhurong Xia, Li Guo, Mingqian Tang, Rong Jin
Abstract: Different from shopping in physical stores, where people have the opportunity to closely check a product (e.g., touching the surface of a T-shirt or smelling the scent of perfume) before making a purchase decision, online shoppers rely greatly on the uploaded product images to make any purchase decision. The decision-making is challenging when selling or purchasing second-hand items online since estimating the items' prices is not trivial. In this work, we present a vision-based price suggestion system for the online second-hand item shopping platform. The goal of vision-based price suggestion is to help sellers set effective prices for their second-hand listings with the images uploaded to the online platforms. First, we propose to better extract representative visual features from the images with the aid of some other image-based item information (e.g., category, brand). Then, we design a vision-based price suggestion module which takes the extracted visual features along with some statistical item features from the shopping platform as the inputs to determine whether an uploaded item image is qualified for price suggestion by a binary classification model, and provide price suggestions for items with qualified images by a regression model. According to two demands from the platform, two different objective functions are proposed to jointly optimize the classification model and the regression model. For better model training, we also propose a warm-up training strategy for the joint optimization. Extensive experiments on a large real-world dataset demonstrate the effectiveness of our vision-based price prediction system.
摘要：与实体商店购物不同，在实体商店购物之前，人们有机会在做出购买决定之前仔细检查产品（例如，触摸T恤的表面或闻到香水的气味），在线购物者在很大程度上依赖上传的商品产品图片以做出购买决定。在网上出售或购买二手商品时，决策过程具有挑战性，因为估算二手商品的价格并非易事。在这项工作中，我们为在线二手商品购物平台提供了一个基于视觉的价格建议系统。基于视觉的价格建议的目的是通过将图片上传到在线平台，帮助卖家为二手商品设置有效的价格。首先，我们建议借助其他一些基于图像的商品信息（例如类别，品牌）更好地从图像中提取代表性的视觉特征。然后，我们设计了一个基于视觉的价格建议模块，该模块将从购物平台中提取的视觉特征以及一些统计项目特征作为输入，以通过二进制分类模型确定上传的商品图像是否符合价格建议，并提供通过回归模型为具有合格图片的商品提供价格建议。根据平台的两个需求，提出了两个不同的目标函数来共同优化分类模型和回归模型。为了更好地进行模型训练，我们还提出了联合优化的热身训练策略。在大型现实数据集上进行的大量实验证明了我们基于视觉的价格预测系统的有效性。

47. A Generative Approach for Detection-driven Underwater Image Enhancement [PDF] 返回目录
Chelsey Edge, Md Jahidul Islam, Christopher Morse, Junaed Sattar
Abstract: In this paper, we introduce a generative model for image enhancement specifically for improving diver detection in the underwater domain. In particular, we present a model that integrates generative adversarial network (GAN)-based image enhancement with the diver detection task. Our proposed approach restructures the GAN objective function to include information from a pre-trained diver detector with the goal to generate images which would enhance the accuracy of the detector in adverse visual conditions. By incorporating the detector output into both the generator and discriminator networks, our model is able to focus on enhancing images beyond aesthetic qualities and specifically to improve robotic detection of scuba divers. We train our network on a large dataset of scuba divers, using a state-of-the-art diver detector, and demonstrate its utility on images collected from oceanic explorations of human-robot teams. Experimental evaluations demonstrate that our approach significantly improves diver detection performance over raw, unenhanced images, and even outperforms detection performance on the output of state-of-the-art underwater image enhancement algorithms. Finally, we demonstrate the inference performance of our network on embedded devices to highlight the feasibility of operating on board mobile robotic platforms.
摘要：在本文中，我们介绍了一种用于图像增强的生成模型，专门用于改进水下域中的潜水员检测。特别是，我们提出了一个模型，该模型将基于生成对抗网络（GAN）的图像增强与潜水员检测任务集成在一起。我们提出的方法将GAN目标函数重构为包括来自预训练潜水员检测器的信息，目的是生成可增强检测器在不利视觉条件下的准确性的图像。通过将检测器输出整合到发生器网络和鉴别器网络中，我们的模型能够专注于增强超出美学质量的图像，特别是改善对潜水员的机器人检测。我们使用最先进的潜水员检测器在大型潜水员数据集上训练我们的网络，并在从人类机器人团队的海洋探索中收集的图像上证明其实用性。实验评估表明，我们的方法大大改善了原始，未增强图像的潜水员检测性能，甚至优于最新的水下图像增强算法输出的检测性能。最后，我们演示了我们的网络在嵌入式设备上的推理性能，以突出说明在移动机器人平台上进行操作的可行性。

48. Image-Graph-Image Translation via Auto-Encoding [PDF] 返回目录
Chenyang Lu, Gijs Dubbelman
Abstract: This work presents the first convolutional neural network that learns an image-to-graph translation task without needing external supervision. Obtaining graph representations of image content, where objects are represented as nodes and their relationships as edges, is an important task in scene understanding. Current approaches follow a fully-supervised approach thereby requiring meticulous annotations. To overcome this, we are the first to present a self-supervised approach based on a fully-differentiable auto-encoder in which the bottleneck encodes the graph's nodes and edges. This self-supervised approach can currently encode simple line drawings into graphs and obtains comparable results to a fully-supervised baseline in terms of F1 score on triplet matching. Besides these promising results, we provide several directions for future research on how our approach can be extended to cover more complex imagery.
摘要：这项工作提出了第一个卷积神经网络，无需外部监督即可学习图像到图形的翻译任务。获得图像内容的图形表示（其中对象表示为节点，其关系表示为边缘）是场景理解中的重要任务。当前的方法遵循完全监督的方法，因此需要细致的注释。为了克服这个问题，我们率先提出了一种基于完全可区分的自动编码器的自我监督方法，其中瓶颈对图形的节点和边进行编码。这种自我监督的方法目前可以将简单的线条图编码为图形，并且在三重态匹配方面的F1分数方面可以获得与完全监督的基线相当的结果。除了这些令人鼓舞的结果，我们还提供了一些未来的研究方向，以指导我们如何扩展我们的方法以涵盖更复杂的图像。

49. Super-resolution Guided Pore Detection for Fingerprint Recognition [PDF] 返回目录
Syeda Nyma Ferdous, Ali Dabouei, Jeremy Dawson, Nasser M Nasrabadi
Abstract: Performance of fingerprint recognition algorithms substantially rely on fine features extracted from fingerprints. Apart from minutiae and ridge patterns, pore features have proven to be usable for fingerprint recognition. Although features from minutiae and ridge patterns are quite attainable from low-resolution images, using pore features is practical only if the fingerprint image is of high resolution which necessitates a model that enhances the image quality of the conventional 500 ppi legacy fingerprints preserving the fine details. To find a solution for recovering pore information from low-resolution fingerprints, we adopt a joint learning-based approach that combines both super-resolution and pore detection networks. Our modified single image Super-Resolution Generative Adversarial Network (SRGAN) framework helps to reliably reconstruct high-resolution fingerprint samples from low-resolution ones assisting the pore detection network to identify pores with a high accuracy. The network jointly learns a distinctive feature representation from a real low-resolution fingerprint sample and successfully synthesizes a high-resolution sample from it. To add discriminative information and uniqueness for all the subjects, we have integrated features extracted from a deep fingerprint verifier with the SRGAN quality discriminator. We also add ridge reconstruction loss, utilizing ridge patterns to make the best use of extracted features. Our proposed method solves the recognition problem by improving the quality of fingerprint images. High recognition accuracy of the synthesized samples that is close to the accuracy achieved using the original high-resolution images validate the effectiveness of our proposed model.
摘要：指纹识别算法的性能主要取决于从指纹中提取的优良特征。除了细节和山脊图案外，毛孔特征已被证明可用于指纹识别。尽管可以从低分辨率图像中获得细节和隆脊图案的特征，但仅当指纹图像具有高分辨率时才使用孔特征才是可行的，这需要模型来增强传统500 ppi传统指纹的图像质量，并保留精细细节。为了找到从低分辨率指纹中恢复毛孔信息的解决方案，我们采用了基于联合学习的方法，该方法将超分辨率和毛孔检测网络结合在一起。我们改进的单图像超高分辨率生成对抗网络（SRGAN）框架有助于从低分辨率样本可靠地重建高分辨率指纹样本，从而帮助毛孔检测网络以高精度识别毛孔。该网络从真实的低分辨率指纹样本中共同学习独特的特征表示，并从中成功合成出高分辨率样本。为了为所有主题添加区分性信息和唯一性，我们将深指纹验证器中提取的功能与SRGAN质量区分器集成在一起。我们还利用山脊模式来充分利用提取的特征，从而增加山脊重建损失。我们提出的方法通过提高指纹图像的质量解决了识别问题。合成样品的高识别精度接近使用原始高分辨率图像所达到的精度，验证了我们提出的模型的有效性。

50. Risk & returns around FOMC press conferences: a novel perspective from computer vision [PDF] 返回目录
Alexis Marchal
Abstract: I propose a new tool to characterize the resolution of uncertainty around FOMC press conferences. It relies on the construction of a measure capturing the level of discussion complexity between the Fed Chair and reporters during the Q&A sessions. I show that complex discussions are associated with higher equity returns and a drop in realized volatility. The method creates an attention score by quantifying how much the Chair needs to rely on reading internal documents to be able to answer a question. This is accomplished by building a novel dataset of video images of the press conferences and leveraging recent deep learning algorithms from computer vision. This alternative data provides new information on nonverbal communication that cannot be extracted from the widely analyzed FOMC transcripts. This paper can be seen as a proof of concept that certain videos contain valuable information for the study of financial markets.
摘要：我提出了一个新工具来描述FOMC新闻发布会上不确定性的解决方法。它依赖于一种措施的构建，该措施反映了问答环节中美联储主席和记者之间讨论的复杂程度。我表明，复杂的讨论与更高的股票收益率和实际波动率下降相关。该方法通过量化主席需要多少阅读内部文件才能回答问题来创建注意力得分。这是通过构建新闻发布会视频图像的新颖数据集并利用计算机视觉中最新的深度学习算法来实现的。这种替代数据提供了无法从广泛分析的FOMC成绩单中提取的有关非语言交流的新信息。本文可以看作是某些视频包含用于研究金融市场的有价值信息的概念证明。

51. Analyzing and Improving Generative Adversarial Training for Generative Modeling and Out-of-Distribution Detection [PDF] 返回目录
Xuwang Yin, Shiying Li, Gustavo K. Rohde
Abstract: Generative adversarial training (GAT) is a recently introduced adversarial defense method. Previous works have focused on empirical evaluations of its application to training robust predictive models. In this paper we focus on theoretical understanding of the GAT method and extending its application to generative modeling and out-of-distribution detection. We analyze the optimal solutions of the maximin formulation employed by the GAT objective, and make a comparative analysis of the minimax formulation employed by GANs. We use theoretical analysis and 2D simulations to understand the convergence property of the training algorithm. Based on these results, we develop an incremental generative training algorithm, and conduct comprehensive evaluations of the algorithm's application to image generation and adversarial out-of-distribution detection. Our results suggest that generative adversarial training is a promising new direction for the above applications.
摘要：对抗式生成训练（GAT）是最近引入的对抗式防御方法。先前的工作集中于对它在训练稳健的预测模型中的应用的实证评估。在本文中，我们将重点放在对GAT方法的理论理解上，并将其应用扩展到生成模型和分布外检测。我们分析了GAT目标采用的maximin配方的最佳解决方案，并对GANs采用的minimax配方进行了比较分析。我们使用理论分析和2D仿真来了解训练算法的收敛性。基于这些结果，我们开发了一种增量生成训练算法，并对算法在图像生成和对抗性分布外检测中的应用进行了综合评估。我们的结果表明，生成对抗训练是上述应用的有希望的新方向。

52. AIforCOVID: predicting the clinical outcomes in patients with COVID-19 applying AI to chest-X-rays. An Italian multicentre study [PDF] 返回目录
Paolo Soda, Natascha Claudia D'Amico, Jacopo Tessadori, Giovanni Valbusa, Valerio Guarrasi, Chandra Bortolotto, Muhammad Usman Akbar, Rosa Sicilia, Ermanno Cordelli, Deborah Fazzini, Michaela Cellina, Giancarlo Oliva, Giovanni Callea, Silvia Panella, Maurizio Cariati, Diletta Cozzi, Vittorio Miele, Elvira Stellato, Gian Paolo Carrafiello, Giulia Castorani, Annalisa Simeone, Lorenzo Preda, Giulio Iannello, Alessio Del Bue, Fabio Tedoldi, Marco Alì, Diego Sona, Sergio Papa
Abstract: Recent epidemiological data report that worldwide more than 53 million people have been infected by SARS-CoV-2, resulting in 1.3 million deaths. The disease has been spreading very rapidly and few months after the identification of the first infected, shortage of hospital resources quickly became a problem. In this work we investigate whether chest X-ray (CXR) can be used as a possible tool for the early identification of patients at risk of severe outcome, like intensive care or death. CXR is a radiological technique that compared to computed tomography (CT) it is simpler, faster, more widespread and it induces lower radiation dose. We present a dataset including data collected from 820 patients by six Italian hospitals in spring 2020 during the first COVID-19 emergency. The dataset includes CXR images, several clinical attributes and clinical outcomes. We investigate the potential of artificial intelligence to predict the prognosis of such patients, distinguishing between severe and mild cases, thus offering a baseline reference for other researchers and practitioners. To this goal, we present three approaches that use features extracted from CXR images, either handcrafted or automatically by convolutional neuronal networks, which are then integrated with the clinical data. Exhaustive evaluation shows promising performance both in 10-fold and leave-one-centre-out cross-validation, implying that clinical data and images have the potential to provide useful information for the management of patients and hospital resources.
摘要：最新的流行病学数据报告指出，全球有超过5300万人感染了SARS-CoV-2，导致130万人死亡。该疾病的传播非常迅速，在确定第一批感染者后的几个月，医院资源的短缺迅速成为一个问题。在这项工作中，我们调查了是否可以将胸部X光（CXR）用作早期识别有严重后果风险（如重症监护或死亡）的患者的可能工具。 CXR是一种放射学技术，与计算机断层扫描（CT）相比，它更简单，更快，更广泛并且可产生更低的辐射剂量。我们提供了一个数据集，其中包括2020年春季在第一次COVID-19紧急情况下从六家意大利医院的820名患者中收集的数据。数据集包括CXR图像，一些临床属性和临床结果。我们研究了人工智能预测此类患者预后的潜力，区分重症和轻度病例，从而为其他研究人员和从业人员提供了基线参考。为此，我们提出了三种方法，这些方法使用从CXR图像中提取的特征，无论是手工制作的还是由卷积神经元网络自动生成的特征，然后与临床数据集成在一起。详尽的评估显示，在10倍交叉验证和遗留一分之一交叉验证中均具有良好的性能，这意味着临床数据和图像具有为患者和医院资源管理提供有用信息的潜力。

53. Automatic Test Suite Generation for Key-points Detection DNNs Using Many-Objective Search [PDF] 返回目录
Fitash Ul Haq, Donghwan Shin, Lionel C. Briand, Thomas Stifter, Jun Wang
Abstract: Automatically detecting the positions of key-points (e.g., facial key-points or finger key-points) in an image is an essential problem in many applications, such as driver's gaze detection and drowsiness detection in automated driving systems. With the recent advances of Deep Neural Networks (DNNs), Key-Points detection DNNs (KP-DNNs) have been increasingly employed for that purpose. Nevertheless, KP-DNN testing and validation have remained a challenging problem because KP-DNNs predict many independent key-points at the same time -- where each individual key-point may be critical in the targeted application -- and images can vary a great deal according to many factors. In this paper, we present an approach to automatically generate test data for KP-DNNs using many-objective search. In our experiments, focused on facial key-points detection DNNs developed for an industrial automotive application, we show that our approach can generate test suites to severely mispredict, on average, more than 93% of all key-points. In comparison, random search-based test data generation can only severely mispredict 41% of them. Many of these mispredictions, however, are not avoidable and should not therefore be considered failures. We also empirically compare state-of-the-art, many-objective search algorithms and their variants, tailored for test suite generation. Furthermore, we investigate and demonstrate how to learn specific conditions, based on image characteristics (e.g., head posture and skin color), that lead to severe mispredictions. Such conditions serve as a basis for risk analysis or DNN retraining.
摘要：在许多应用中，自动检测图像中关键点（例如面部关键点或手指关键点）的位置是一个必不可少的问题，例如自动驾驶系统中的驾驶员视线检测和嗜睡检测。随着深度神经网络（DNN）的最新发展，为此目的越来越多地采用了关键点检测DNN（KP-DNN）。尽管如此，KP-DNN的测试和验证仍然是一个具有挑战性的问题，因为KP-DNN可以同时预测许多独立的关键点-每个关键点在目标应用程序中可能至关重要-并且图像变化很大根据许多因素进行交易。在本文中，我们提出了一种使用多目标搜索为KP-DNN自动生成测试数据的方法。在我们的实验中，针对针对工业汽车应用开发的面部关键点检测DNN，我们证明了我们的方法可以生成测试套件，以严重错误地平均预测所有关键点的93％以上。相比之下，基于随机搜索的测试数据生成只能严重错误地预测其中的41％。但是，许多错误预测是无法避免的，因此不应被视为失败。我们还根据经验比较了针对测试套件生成量身定制的最新，多目标搜索算法及其变体。此外，我们调查并演示了如何根据图像特征（例如头部姿势和肤色）学习特定条件，这些条件会导致严重的错误预测。这些条件可作为风险分析或DNN再培训的基础。

54. Context Matters: Graph-based Self-supervised Representation Learning for Medical Images [PDF] 返回目录
Li Sun, Ke Yu, Kayhan Batmanghelich
Abstract: Supervised learning method requires a large volume of annotated datasets. Collecting such datasets is time-consuming and expensive. Until now, very few annotated COVID-19 imaging datasets are available. Although self-supervised learning enables us to bootstrap the training by exploiting unlabeled data, the generic self-supervised methods for natural images do not sufficiently incorporate the context. For medical images, a desirable method should be sensitive enough to detect deviation from normal-appearing tissue of each anatomical region; here, anatomy is the context. We introduce a novel approach with two levels of self-supervised representation learning objectives: one on the regional anatomical level and another on the patient-level. We use graph neural networks to incorporate the relationship between different anatomical regions. The structure of the graph is informed by anatomical correspondences between each patient and an anatomical atlas. In addition, the graph representation has the advantage of handling any arbitrarily sized image in full resolution. Experiments on large-scale Computer Tomography (CT) datasets of lung images show that our approach compares favorably to baseline methods that do not account for the context. We use the learnt embedding to quantify the clinical progression of COVID-19 and show that our method generalizes well to COVID-19 patients from different hospitals. Qualitative results suggest that our model can identify clinically relevant regions in the images.
摘要：监督学习方法需要大量带注释的数据集。收集此类数据集既耗时又昂贵。到目前为止，很少有带注释的COVID-19成像数据集。尽管自我监督学习使我们能够通过利用未标记的数据来引导训练，但是用于自然图像的通用自我监督方法并未充分融合上下文。对于医学图像，一种理想的方法应该足够灵敏，以检测与每个解剖区域正常出现的组织的偏离；在这里，解剖是背景。我们介绍了一种具有两个级别的自我监督的表示学习目标的新颖方法：一个在区域解剖级别，另一个在患者级别。我们使用图神经网络来合并不同解剖区域之间的关系。该图的结构由每个患者与解剖图谱之间的解剖对应关系告知。另外，图形表示具有以全分辨率处理任何大小的图像的优点。在大型肺部计算机断层扫描（CT）数据集上进行的实验表明，我们的方法与不考虑上下文的基线方法相比具有优势。我们使用学习的嵌入来量化COVID-19的临床进展，并表明我们的方法可以很好地推广到来自不同医院的COVID-19患者。定性结果表明我们的模型可以识别图像中的临床相关区域。

55. Uncertainty-driven refinement of tumor-core segmentation using 3D-to-2D networks with label uncertainty [PDF] 返回目录
Richard McKinley, Micheal Rebsamen, Katrin Daetwyler, Raphael Meier, Piotr Radojewski, Roland Wiest
Abstract: The BraTS dataset contains a mixture of high-grade and low-grade gliomas, which have a rather different appearance: previous studies have shown that performance can be improved by separated training on low-grade gliomas (LGGs) and high-grade gliomas (HGGs), but in practice this information is not available at test time to decide which model to use. By contrast with HGGs, LGGs often present no sharp boundary between the tumor core and the surrounding edema, but rather a gradual reduction of tumor-cell density. Utilizing our 3D-to-2D fully convolutional architecture, DeepSCAN, which ranked highly in the 2019 BraTS challenge and was trained using an uncertainty-aware loss, we separate cases into those with a confidently segmented core, and those with a vaguely segmented or missing core. Since by assumption every tumor has a core, we reduce the threshold for classification of core tissue in those cases where the core, as segmented by the classifier, is vaguely defined or missing. We then predict survival of high-grade glioma patients using a fusion of linear regression and random forest classification, based on age, number of distinct tumor components, and number of distinct tumor cores. We present results on the validation dataset of the Multimodal Brain Tumor Segmentation Challenge 2020 (segmentation and uncertainty challenge), and on the testing set, where the method achieved 4th place in Segmentation, 1st place in uncertainty estimation, and 1st place in Survival prediction.
摘要：BraTS数据集包含高级和低级神经胶质瘤的混合物，它们的外观相当不同：以前的研究表明，通过分别对低级神经胶质瘤（LGG）和高级神经胶质瘤进行训练，可以提高性能（HGG），但实际上在测试时无法确定使用哪种模型的信息。与HGGs相比，LGGs在肿瘤核心和周围水肿之间通常没有明显的边界，而是逐渐降低了肿瘤细胞的密度。利用我们的3D到2D全卷积架构DeepSCAN，它在2019年BraTS挑战赛中名列前茅，并使用不确定性感知损失进行训练，我们将案例分为核心部分可靠划分的部分，分类部分模糊或缺失的部分核心。由于假设每个肿瘤都有一个核心，因此在分类器细分的核心定义模糊或缺失的情况下，我们降低了核心组织分类的阈值。然后，我们根据年龄，不同肿瘤成分的数量和不同肿瘤核心的数量，使用线性回归和随机森林分类的融合预测高级别神经胶质瘤患者的存活率。我们在2020多模式脑肿瘤分割挑战赛（细分和不确定性挑战）的验证数据集上以及测试集上展示了结果，该方法在分割中获得了第四名，在不确定性估计中排名第一，在生存预测中排名第一。

56. RENATA: REpreseNtation And Training Alteration for Bias Mitigation [PDF] 返回目录
William Paul, Armin Hadzic, Neil Joshi, Phil Burlina
Abstract: We propose a novel method for enforcing AI fairness with respect to protected or sensitive factors. This method uses a dual strategy performing Training And Representation Alteration (RENATA) for mitigation of two of the most prominent causes of AI bias, including: a) the use of representation learning alteration via adversarial independence, to suppress the bias-inducing dependence of the data representation from protected factors; and b) training set alteration via intelligent augmentation, to address bias-causing data imbalance, by using generative models that allow fine control of sensitive factors related to underrepresented populations. When testing our methods on image analytics, experiments demonstrate that RENATA significantly or fully debiases baseline models while outperforming competing debiasing methods, e.g., with (% overall accuracy, % accuracy gap) of (78.75, 0.5) vs. baseline method's (71.75, 10.5) for EyePACS, and (73.71, 11.82) vs. the (69.08, 21.65) baseline for CelebA. As an additional contribution, recognizing certain limitations in current metrics used for assessing debiasing performance, this study proposes novel conjunctive debiasing metrics. Our experiments also demonstrate the ability of these novel metrics in assessing the Pareto efficiency of the proposed methods.
摘要：我们提出了一种针对受保护或敏感因素强制实施AI公平性的新方法。此方法使用双重策略执行训练和表征变更（RENATA）来缓解AI偏见的两个最主要的原因，包括：a）通过对抗独立性使用表征学习变更，以抑制人工智能产生偏见的依赖性受保护因素的数据表示； b）通过使用能够精确控制与代表性不足人群有关的敏感因素的生成模型，通过智能增强训练集变更，以解决导致偏差的数据不平衡问题。当测试我们的图像分析方法时，实验表明RENATA可以显着或完全消除基线模型的偏斜，同时胜过其他竞争性去偏斜方法，例如，（基线的总准确度，准确度差距百分比）为（78.75，0.5），而基线方法为（71.75，10.5 ）（EyePACS）和（73.71，11.82）vs CelebA（69.08，21.65）基准。作为一项额外的贡献，该研究认识到当前用于评估去偏置性能的指标中的某些局限性，因此提出了新颖的联合去偏置指标。我们的实验还证明了这些新颖指标在评估所提出方法的帕累托效率方面的能力。

57. Parallelized Rate-Distortion Optimized Quantization Using Deep Learning [PDF] 返回目录
Dana Kianfar, Auke Wiggers, Amir Said, Reza Pourreza, Taco Cohen
Abstract: Rate-Distortion Optimized Quantization (RDOQ) has played an important role in the coding performance of recent video compression standards such as H.264/AVC, H.265/HEVC, VP9 and AV1. This scheme yields significant reductions in bit-rate at the expense of relatively small increases in distortion. Typically, RDOQ algorithms are prohibitively expensive to implement on real-time hardware encoders due to their sequential nature and their need to frequently obtain entropy coding costs. This work addresses this limitation using a neural network-based approach, which learns to trade-off rate and distortion during offline supervised training. As these networks are based solely on standard arithmetic operations that can be executed on existing neural network hardware, no additional area-on-chip needs to be reserved for dedicated RDOQ circuitry. We train two classes of neural networks, a fully-convolutional network and an auto-regressive network, and evaluate each as a post-quantization step designed to refine cheap quantization schemes such as scalar quantization (SQ). Both network architectures are designed to have a low computational overhead. After training they are integrated into the HM 16.20 implementation of HEVC, and their video coding performance is evaluated on a subset of the H.266/VVC SDR common test sequences. Comparisons are made to RDOQ and SQ implementations in HM 16.20. Our method achieves 1.64% BD-rate savings on luminosity compared to the HM SQ anchor, and on average reaches 45% of the performance of the iterative HM RDOQ algorithm.
摘要：速率失真优化量化（RDOQ）在诸如H.264 / AVC，H.265 / HEVC，VP9和AV1等最新视频压缩标准的编码性能中发挥了重要作用。该方案以相对较小的失真增加为代价，显着降低了比特率。通常，由于RDOQ算法的顺序性质以及经常需要获取熵编码成本的需要，因此在实时硬件编码器上实施RDOQ算法的成本过高。这项工作使用基于神经网络的方法解决了这一局限性，该方法在离线监督训练中学会权衡速率和失真。由于这些网络仅基于可在现有神经网络硬件上执行的标准算术运算，因此无需为专用RDOQ电路保留额外的片上区域。我们训练了两类神经网络，一个全卷积网络和一个自回归网络，并将它们作为量化量化后方案的后量化步骤，以优化廉价的量化方案，例如标量量化（SQ）。两种网络体系结构都设计为具有较低的计算开销。训练后，它们被集成到HEVC的HM 16.20实现中，并且在H.266 / VVC SDR通用测试序列的子集上评估了它们的视频编码性能。对HM 16.20中的RDOQ和SQ实现进行了比较。与HM SQ锚相比，我们的方法在亮度上实现了BD率节省1.64％，并且平均达到迭代HM RDOQ算法性能的45％。

58. Feature Selection Based on Sparse Neural Network Layer with Normalizing Constraints [PDF] 返回目录
Peter Bugata, Peter Drotar
Abstract: Feature selection is important step in machine learning since it has shown to improve prediction accuracy while depressing the curse of dimensionality of high dimensional data. The neural networks have experienced tremendous success in solving many nonlinear learning problems. Here, we propose new neural-network based feature selection approach that introduces two constrains, the satisfying of which leads to sparse FS layer. We have performed extensive experiments on synthetic and real world data to evaluate performance of the proposed FS. In experiments we focus on the high dimension, low sample size data since those represent the main challenge for feature selection. The results confirm that proposed Feature Selection Based on Sparse Neural Network Layer with Normalizing Constraints (SNEL-FS) is able to select the important features and yields superior performance compared to other conventional FS methods.
摘要：特征选择是机器学习中的重要步骤，因为它已显示出在提高预测精度的同时抑制了高维数据的维数诅咒。神经网络在解决许多非线性学习问题方面取得了巨大的成功。在这里，我们提出了一种新的基于神经网络的特征选择方法，该方法引入了两个约束，满足这些约束会导致FS层稀疏。我们已经对合成和现实世界的数据进行了广泛的实验，以评估建议的FS的性能。在实验中，我们专注于高维，低样本量的数据，因为这些数据代表了特征选择的主要挑战。结果证实，与其他传统的FS方法相比，基于具有归一化约束的稀疏神经网络层的建议特征选择（SNEL-FS）能够选择重要特征，并产生更高的性能。

59. Privacy-preserving medical image analysis [PDF] 返回目录
Alexander Ziller, Jonathan Passerat-Palmbach, Théo Ryffel, Dmitrii Usynin, Andrew Trask, Ionésio Da Lima Costa Junior, Jason Mancuso, Marcus Makowski, Daniel Rueckert, Rickmer Braren, Georgios Kaissis
Abstract: The utilisation of artificial intelligence in medicine and healthcare has led to successful clinical applications in several domains. The conflict between data usage and privacy protection requirements in such systems must be resolved for optimal results as well as ethical and legal compliance. This calls for innovative solutions such as privacy-preserving machine learning (PPML). We present PriMIA (Privacy-preserving Medical Image Analysis), a software framework designed for PPML in medical imaging. In a real-life case study we demonstrate significantly better classification performance of a securely aggregated federated learning model compared to human experts on unseen datasets. Furthermore, we show an inference-as-a-service scenario for end-to-end encrypted diagnosis, where neither the data nor the model are revealed. Lastly, we empirically evaluate the framework's security against a gradient-based model inversion attack and demonstrate that no usable information can be recovered from the model.
摘要：人工智能在医学和医疗保健中的应用已导致在多个领域的成功临床应用。必须解决此类系统中的数据使用与隐私保护要求之间的冲突，以获得最佳结果以及符合道德和法律要求。这就需要创新的解决方案，例如保护隐私的机器学习（PPML）。我们提出了PriMIA（隐私保护医学图像分析），这是一种为医学图像中的PPML设计的软件框架。在真实案例研究中，我们证明了与未知数据集上的人类专家相比，安全聚合的联合学习模型的分类性能明显更好。此外，我们展示了一种端到端加密诊断的即服务即服务场景，其中既没有数据也没有模型。最后，我们通过经验评估框架针对基于梯度的模型反转攻击的安全性，并证明无法从模型中恢复可用信息。

60. State-of-the-art Machine Learning MRI Reconstruction in 2020: Results of the Second fastMRI Challenge [PDF] 返回目录
Matthew J. Muckley, Bruno Riemenschneider, Alireza Radmanesh, Sunwoo Kim, Geunu Jeong, Jingyu Ko, Yohan Jun, Hyungseob Shin, Dosik Hwang, Mahmoud Mostapha, Simon Arberet, Dominik Nickel, Zaccharie Ramzi, Philippe Ciuciu, Jean-Luc Starck, Jonas Teuwen, Dimitrios Karkalousos, Chaoping Zhang, Anuroop Sriram, Zhengnan Huang, Nafissa Yakubova, Yvonne Lui, Florian Knoll
Abstract: Accelerating MRI scans is one of the principal outstanding problems in the MRI research community. Towards this goal, we hosted the second fastMRI competition targeted towards reconstructing MR images with subsampled k-space data. We provided participants with data from 7,299 clinical brain scans (de-identified via a HIPAA-compliant procedure by NYU Langone Health), holding back the fully-sampled data from 894 of these scans for challenge evaluation purposes. In contrast to the 2019 challenge, we focused our radiologist evaluations on pathological assessment in brain images. We also debuted a new Transfer track that required participants to submit models evaluated on MRI scanners from outside the training set. We received 19 submissions from eight different groups. Results showed one team scoring best in both SSIM scores and qualitative radiologist evaluations. We also performed analysis on alternative metrics to mitigate the effects of background noise and collected feedback from the participants to inform future challenges. Lastly, we identify common failure modes across the submissions, highlighting areas of need for future research in the MRI reconstruction community.
摘要：加速MRI扫描是MRI研究界的主要突出问题之一。为了实现这一目标，我们举办了第二届fastMRI竞赛，旨在利用欠采样的k空间数据重建MR图像。我们向参与者提供了7,299项临床脑部扫描的数据（通过NYU Langone Health通过HIPAA兼容程序进行了身份识别），同时保留了894项这些扫描的完全采样数据，以进行挑战评估。与2019年的挑战相反，我们将放射科医生的评估重点放在脑图像的病理评估中。我们还发布了一条新的“转移”曲目，该曲目要求参与者从训练集之外提交在MRI扫描仪上评估过的模型。我们收到了来自八个不同小组的19份意见书。结果显示，一个团队在SSIM评分和放射放射定性评估方面均得分最高。我们还对替代指标进行了分析，以减轻背景噪声的影响，并从参与者那里收集了反馈，以告知未来的挑战。最后，我们确定了提交文件中的常见故障模式，突出了MRI重建社区中需要进一步研究的领域。

61. Artificial Intelligence for COVID-19 Detection -- A state-of-the-art review [PDF] 返回目录
Parsa Sarosh, Shabir A. Parah, Romany F Mansur, G. M. Bhat
Abstract: The emergence of COVID-19 has necessitated many efforts by the scientific community for its proper management. An urgent clinical reaction is required in the face of the unending devastation being caused by the pandemic. These efforts include technological innovations for improvement in screening, treatment, vaccine development, contact tracing and, survival prediction. The use of Deep Learning (DL) and Artificial Intelligence (AI) can be sought in all of the above-mentioned spheres. This paper aims to review the role of Deep Learning and Artificial intelligence in various aspects of the overall COVID-19 management and particularly for COVID-19 detection and classification. The DL models are developed to analyze clinical modalities like CT scans and X-Ray images of patients and predict their pathological condition. A DL model aims to detect the COVID-19 pneumonia, classify and distinguish between COVID-19, Community-Acquired Pneumonia (CAP), Viral and Bacterial pneumonia, and normal conditions. Furthermore, sophisticated models can be built to segment the affected area in the lungs and quantify the infection volume for a better understanding of the extent of damage. Many models have been developed either independently or with the help of pre-trained models like VGG19, ResNet50, and AlexNet leveraging the concept of transfer learning. Apart from model development, data preprocessing and augmentation are also performed to cope with the challenge of insufficient data samples often encountered in medical applications. It can be evaluated that DL and AI can be effectively implemented to withstand the challenges posed by the global emergency
摘要：COVID-19的出现需要科学界对其适当的管理做出许多努力。面对由大流行引起的持续破坏，需要紧急临床反应。这些努力包括改进筛选，治疗，疫苗开发，接触者追踪和生存预测的技术创新。可以在上述所有领域中寻求深度学习（DL）和人工智能（AI）的使用。本文旨在回顾深度学习和人工智能在整个COVID-19管理中各个方面的作用，尤其是在COVID-19检测和分类中的作用。 DL模型的开发旨在分析患者的CT扫描和X射线图像等临床模式，并预测其病理状况。 DL模型旨在检测COVID-19肺炎，对COVID-19，社区获得性肺炎（CAP），病毒性和细菌性肺炎以及正常状况进行分类和区分。此外，可以建立复杂的模型来分割肺部患处并量化感染量，以更好地了解损伤程度。许多模型是独立开发的，或者是在预先学习的模型（例如VGG19，ResNet50和AlexNet）的帮助下开发的，利用了转移学习的概念。除了模型开发之外，还执行数据预处理和扩充以应对医疗应用中经常遇到的数据样本不足的挑战。可以评估的是，DL和AI可以有效实施以应对全球紧急情况带来的挑战

62. Learning Order Parameters from Videos of Dynamical Phases for Skyrmions with Neural Networks [PDF] 返回目录
Weidi Wang, Zeyuan Wang, Yinghui Zhang, Bo Sun, Ke Xia
Abstract: The ability to recognize dynamical phenomena (e.g., dynamical phases) and dynamical processes in physical events from videos, then to abstract physical concepts and reveal physical laws, lies at the core of human intelligence. The main purposes of this paper are to use neural networks for classifying the dynamical phases of some videos and to demonstrate that neural networks can learn physical concepts from them. To this end, we employ multiple neural networks to recognize the static phases (image format) and dynamical phases (video format) of a particle-based skyrmion model. Our results show that neural networks, without any prior knowledge, can not only correctly classify these phases, but also predict the phase boundaries which agree with those obtained by simulation. We further propose a parameter visualization scheme to interpret what neural networks have learned. We show that neural networks can learn two order parameters from videos of dynamical phases and predict the critical values of two order parameters. Finally, we demonstrate that only two order parameters are needed to identify videos of skyrmion dynamical phases. It shows that this parameter visualization scheme can be used to determine how many order parameters are needed to fully recognize the input phases. Our work sheds light on the future use of neural networks in discovering new physical concepts and revealing unknown yet physical laws from videos.
摘要：从视频中识别物理事件中的动力学现象（例如，动力学阶段）和动力学过程，然后抽象出物理概念和揭示物理规律的能力是人类智能的核心。本文的主要目的是使用神经网络对某些视频的动态阶段进行分类，并证明神经网络可以从中学习物理概念。为此，我们采用了多个神经网络来识别基于粒子的skyrmion模型的静态阶段（图像格式）和动态阶段（视频格式）。我们的结果表明，在没有任何先验知识的情况下，神经网络不仅可以正确地对这些相位进行分类，而且可以预测与通过仿真获得的相位边界一致的相位边界。我们进一步提出了一种参数可视化方案来解释神经网络学到了什么。我们表明神经网络可以从动态相位的视频中学习两个阶参数，并预测两个阶参数的临界值。最后，我们证明仅需要两个阶数参数即可识别skyrmion动力学相位的视频。它表明，该参数可视化方案可用于确定完全识别输入相位所需的阶数参数。我们的工作揭示了神经网络在发现新的物理概念以及从视频中揭示未知但尚不存在的物理定律的未来用途。

63. Deep-Learning-Based Kinematic Reconstruction for DUNE [PDF] 返回目录
Junze Liu, Jordan Ott, Julian Collado, Benjamin Jargowsky, Wenjie Wu, Jianming Bian, Pierre Baldi
Abstract: In the framework of three-active-neutrino mixing, the charge parity phase, the neutrino mass ordering, and the octant of $\theta_{23}$ remain unknown. The Deep Underground Neutrino Experiment (DUNE) is a next-generation long-baseline neutrino oscillation experiment, which aims to address these questions by measuring the oscillation patterns of $\nu_\mu/\nu_e$ and $\bar\nu_\mu/\bar\nu_e$ over a range of energies spanning the first and second oscillation maxima. DUNE far detector modules are based on liquid argon TPC (LArTPC) technology. A LArTPC offers excellent spatial resolution, high neutrino detection efficiency, and superb background rejection, while reconstruction in LArTPC is challenging. Deep learning methods, in particular, Convolutional Neural Networks (CNNs), have demonstrated success in classification problems such as particle identification in DUNE and other neutrino experiments. However, reconstruction of neutrino energy and final state particle momenta with deep learning methods is yet to be developed for a full AI-based reconstruction chain. To precisely reconstruct these kinematic characteristics of detected interactions at DUNE, we have developed and will present two CNN-based methods, 2-D and 3-D, for the reconstruction of final state particle direction and energy, as well as neutrino energy. Combining particle masses with the kinetic energy and the direction reconstructed by our work, the four-momentum of final state particles can be obtained. Our models show considerable improvements compared to the traditional methods for both scenarios.
摘要：在三主动中微子混合的框架中，电荷奇偶性阶段，中微子质量有序化和$ \ theta_ {23} $的八分圆仍然未知。深地下中微子实验（DUNE）是下一代长基线中微子振荡实验，旨在通过测量$ \ nu_ \ mu / \ nu_e $和$ \ bar \ nu_ \ mu /的振荡模式来解决这些问题。 \ bar \ nu_e $在跨越第一和第二振荡最大值的一系列能量范围内。 DUNE远距离探测器模块基于液氩TPC（LArTPC）技术。 LArTPC具有出色的空间分辨率，高中微子检测效率和出色的背景抑制能力，而在LArTPC中进行重建具有挑战性。深度学习方法，特别是卷积神经网络（CNN），已证明在分类问题（例如DUNE中的粒子识别和其他中微子实验）中取得了成功。然而，对于完整的基于AI的重建链，尚未开发使用深度学习方法重建中微子能量和最终状态粒子动量的方法。为了精确地重构在DUNE处检测到的相互作用的运动学特征，我们已经开发并提出了两种基于CNN的方法，即2-D和3-D，用于重构最终状态粒子的方向和能量以及中微子能量。将粒子质量与我们的工作重构的动能和方向相结合，可以获得最终态粒子的四动量。与这两种情况下的传统方法相比，我们的模型都显示出了很大的改进。

64. DSRNA: Differentiable Search of Robust Neural Architectures [PDF] 返回目录
Ramtin Hosseini, Xingyi Yang, Pengtao Xie
Abstract: In deep learning applications, the architectures of deep neural networks are crucial in achieving high accuracy. Many methods have been proposed to search for high-performance neural architectures automatically. However, these searched architectures are prone to adversarial attacks. A small perturbation of the input data can render the architecture to change prediction outcomes significantly. To address this problem, we propose methods to perform differentiable search of robust neural architectures. In our methods, two differentiable metrics are defined to measure architectures' robustness, based on certified lower bound and Jacobian norm bound. Then we search for robust architectures by maximizing the robustness metrics. Different from previous approaches which aim to improve architectures' robustness in an implicit way: performing adversarial training and injecting random noise, our methods explicitly and directly maximize robustness metrics to harvest robust architectures. On CIFAR-10, ImageNet, and MNIST, we perform game-based evaluation and verification-based evaluation on the robustness of our methods. The experimental results show that our methods 1) are more robust to various norm-bound attacks than several robust NAS baselines; 2) are more accurate than baselines when there are no attacks; 3) have significantly higher certified lower bounds than baselines.
摘要：在深度学习应用中，深度神经网络的体系结构对于实现高精度至关重要。已经提出了许多方法来自动搜索高性能神经体系结构。但是，这些搜索的体系结构容易受到对抗性攻击。输入数据的微小扰动可使体系结构显着改变预测结果。为了解决这个问题，我们提出了执行鲁棒神经体系结构的差异搜索的方法。在我们的方法中，基于认证的下界和雅可比范数界，定义了两个可区分的度量标准来度量体系结构的健壮性。然后，我们通过最大化鲁棒性指标来搜索鲁棒性架构。与以前的旨在以隐式方式提高体系结构鲁棒性的方法不同：执行对抗训练和注入随机噪声，我们的方法明确而直接地最大化了鲁棒性指标以收获鲁棒的体系结构。在CIFAR-10，ImageNet和MNIST上，我们对方法的鲁棒性进行了基于游戏的评估和基于验证的评估。实验结果表明，我们的方法（1）对各种受规范约束的攻击比几种可靠的NAS基准更可靠； 2）没有攻击时比基线更准确； 3）具有比基线明显更高的认证下界。

65. Provable Defense against Privacy Leakage in Federated Learning from Representation Perspective [PDF] 返回目录
Jingwei Sun, Ang Li, Binghui Wang, Huanrui Yang, Hai Li, Yiran Chen
Abstract: Federated learning (FL) is a popular distributed learning framework that can reduce privacy risks by not explicitly sharing private data. However, recent works demonstrated that sharing model updates makes FL vulnerable to inference attacks. In this work, we show our key observation that the data representation leakage from gradients is the essential cause of privacy leakage in FL. We also provide an analysis of this observation to explain how the data presentation is leaked. Based on this observation, we propose a defense against model inversion attack in FL. The key idea of our defense is learning to perturb data representation such that the quality of the reconstructed data is severely degraded, while FL performance is maintained. In addition, we derive certified robustness guarantee to FL and convergence guarantee to FedAvg, after applying our defense. To evaluate our defense, we conduct experiments on MNIST and CIFAR10 for defending against the DLG attack and GS attack. Without sacrificing accuracy, the results demonstrate that our proposed defense can increase the mean squared error between the reconstructed data and the raw data by as much as more than 160X for both DLG attack and GS attack, compared with baseline defense methods. The privacy of the FL system is significantly improved.
摘要：联合学习（FL）是一种流行的分布式学习框架，可以通过不显式共享私有数据来降低隐私风险。但是，最近的工作表明，共享模型更新使FL容易受到推理攻击。在这项工作中，我们显示出关键的观察结果，即梯度数据的泄漏是FL中隐私泄漏的根本原因。我们还提供了对此观察结果的分析，以解释如何泄漏数据表示。基于此观察，我们提出了针对FL中的模型反转攻击的防御措施。我们防御的关键思想是学习扰动数据表示，以便在保持FL性能的同时严重降低重建数据的质量。此外，在应用防御后，我们得出了FL的经过认证的鲁棒性保证，以及FedAvg的了收敛性保证。为了评估我们的防御能力，我们在MNIST和CIFAR10上进行了实验，以防御DLG和GS攻击。在不牺牲准确性的前提下，结果表明，与基线防御方法相比，我们提出的防御方法可以将DLG攻击和GS攻击的重建数据和原始数据之间的均方误差提高160倍以上。 FL系统的隐私性得到了显着改善。

66. 3D Scattering Tomography by Deep Learning with Architecture Tailored to Cloud Fields [PDF] 返回目录
Yael Sde-Chen, Yoav Y. Schechner, Vadim Holodovsky, Eshkol Eytan
Abstract: We present 3DeepCT, a deep neural network for computed tomography, which performs 3D reconstruction of scattering volumes from multi-view images. Our architecture is dictated by the stationary nature of atmospheric cloud fields. The task of volumetric scattering tomography aims at recovering a volume from its 2D projections. This problem has been studied extensively, leading, to diverse inverse methods based on signal processing and physics models. However, such techniques are typically iterative, exhibiting high computational load and long convergence time. We show that 3DeepCT outperforms physics-based inverse scattering methods in term of accuracy as well as offering a significant orders of magnitude improvement in computational time. To further improve the recovery accuracy, we introduce a hybrid model that combines 3DeepCT and physics-based method. The resultant hybrid technique enjoys fast inference time and improved recovery performance.
摘要：我们提出了3DeepCT，这是一种用于计算机断层摄影的深度神经网络，它可以对多视图图像的散射体积进行3D重建。我们的架构是由大气云场的固定性质决定的。体积散射层析成像的任务是从2D投影中恢复体积。这个问题已经被广泛研究，导致基于信号处理和物理模型的多种逆方法。但是，此类技术通常是迭代的，显示出高计算量和长收敛时间。我们显示3DeepCT在精确度方面优于基于物理的逆散射方法，并且在计算时间方面提供了显着数量级的改进。为了进一步提高恢复精度，我们引入了一种结合了3DeepCT和基于物理方法的混合模型。最终的混合技术享有快速的推理时间并提高了恢复性能。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-12-14

目录

摘要