摘要

1. Deep Neural Models for color discrimination and color constancy [PDF] 返回目录
Alban Flachot, Arash Akbarinia, Heiko H. Schütt, Roland W. Fleming, Felix A. Wichmann, Karl R. Gegenfurtner
Abstract: Color constancy is our ability to perceive constant colors across varying illuminations. Here, we trained deep neural networks to be color constant and evaluated their performance with varying cues. Inputs to the networks consisted of the cone excitations in 3D-rendered images of 2115 different 3D-shapes, with spectral reflectances of 1600 different Munsell chips, illuminated under 278 different natural illuminations. The models were trained to classify the reflectance of the objects. One network, Deep65, was trained under a fixed daylight D65 illumination, while DeepCC was trained under varying illuminations. Testing was done with 4 new illuminations with equally spaced CIEL*a*b* chromaticities, 2 along the daylight locus and 2 orthogonal to it. We found a high degree of color constancy for DeepCC, and constancy was higher along the daylight locus. When gradually removing cues from the scene, constancy decreased. High levels of color constancy were achieved with different DNN architectures. Both ResNets and classical ConvNets of varying degrees of complexity performed well. However, DeepCC, a convolutional network, represented colors along the 3 color dimensions of human color vision, while ResNets showed a more complex representation.
摘要：色彩恒定是我们在各种照明条件下感知恒定颜色的能力。在这里，我们训练了深度神经网络使其颜色保持不变，并根据不同的提示评估了它们的性能。网络的输入包括2115种不同3D形状的3D渲染图像中的视锥激发，以及1600种不同的Munsell芯片的光谱反射率，在278种不同的自然光照下进行照明。对模型进行训练以对物体的反射率进行分类。在固定的日光D65照明下训练了一个网络Deep65，而在变化的照明下训练了DeepCC。测试使用4种新的照明进行，这些照明具有相等间隔的CIEL * a * b *色度，沿日光轨迹的2个和与之正交的2个。我们发现DeepCC具有高度的颜色恒定性，并且沿白天位置的恒定性更高。当逐渐从场景中删除提示时，稳定性会降低。使用不同的DNN架构可实现高水平的色彩稳定性。 ResNet和复杂程度不同的经典ConvNet都表现良好。但是，DeepCC是一个卷积网络，代表人类彩色视觉的3个颜色维度上的颜色，而ResNets则显示了更复杂的表示。

2. Enhanced Regularizers for Attributional Robustness [PDF] 返回目录
Anindya Sarkar, Anirban Sarkar, Vineeth N Balasubramanian
Abstract: Deep neural networks are the default choice of learning models for computer vision tasks. Extensive work has been carried out in recent years on explaining deep models for vision tasks such as classification. However, recent work has shown that it is possible for these models to produce substantially different attribution maps even when two very similar images are given to the network, raising serious questions about trustworthiness. To address this issue, we propose a robust attribution training strategy to improve attributional robustness of deep neural networks. Our method carefully analyzes the requirements for attributional robustness and introduces two new regularizers that preserve a model's attribution map during attacks. Our method surpasses state-of-the-art attributional robustness methods by a margin of approximately 3% to 9% in terms of attribution robustness measures on several datasets including MNIST, FMNIST, Flower and GTSRB.
摘要：深度神经网络是计算机视觉任务学习模型的默认选择。近年来，在解释诸如分类之类的视觉任务的深层模型方面已经进行了广泛的工作。但是，最近的工作表明，即使将两个非常相似的图像提供给网络，这些模型也可能会产生实质上不同的归因图，从而引发了有关可信度的严重问题。为了解决此问题，我们提出了一种鲁棒的归因训练策略，以提高深度神经网络的归因鲁棒性。我们的方法仔细分析了归因鲁棒性的要求，并引入了两个新的正则化器来在攻击过程中保留模型的归因图。在包括MNIST，FMNIST，Flower和GTSRB在内的多个数据集的归因稳健性度量方面，我们的方法比最先进的归因稳健性方法高出约3％至9％。

3. Tensor Representations for Action Recognition [PDF] 返回目录
Piotr Koniusz, Lei Wang, Anoop Cherian
Abstract: Human actions in video sequences are characterized by the complex interplay between spatial features and their temporal dynamics. In this paper, we propose novel tensor representations for compactly capturing such higher-order relationships between visual features for the task of action recognition. We propose two tensor-based feature representations, viz. (i) sequence compatibility kernel (SCK) and (ii) dynamics compatibility kernels (DCK); the former capitalizing on the spatio-temporal correlations between features, while the latter explicitly modeling the action dynamics of a sequence. We also explore generalization of SCK, coined SCK+, that operates on subsequences to capture the local-global interplay of correlations, as well as can incorporate multi-modal inputs e.g., skeleton 3D body-joints and per-frame classifier scores obtained from deep learning models trained on videos. We introduce linearization of these kernels that lead to compact and fast descriptors. We provide experiments on (i) 3D skeleton action sequences, (ii) fine-grained video sequences, and (iii) standard non-fine-grained videos. As our final representations are tensors that capture higher-order relationships of features, they relate to co-occurrences for robust fine-grained recognition. We use higher-order tensors and so-called Eigenvalue Power Normalization (EPN) which have been long speculated to perform spectral detection of higher-order occurrences; thus detecting fine-grained relationships of features rather than merely count features in scenes. We prove that a tensor of order r, built from Z* dim. features, coupled with EPN indeed detects if at least one higher-order occurrence is `projected' into one of its binom(Z*,r) subspaces of dim. r represented by the tensor; thus forming a Tensor Power Normalization metric endowed with binom(Z*,r) such `detectors'.
摘要：视频序列中的人类动作的特征是空间特征与其时间动态之间复杂的相互作用。在本文中，我们提出了新颖的张量表示形式，用于紧凑地捕获动作特征的视觉特征之间的这种高阶关系。我们提出了两个基于张量的特征表示。（i）序列兼容性内核（SCK）和（ii）动态兼容性内核（DCK）；前者利用特征之间的时空相关性，而后者则显式地对序列的动作动力学建模。我们还探索了SCK的广义化，即造币SCK +，它在子序列上进行操作以捕获相关的局部-全局相互作用，并且可以合并多模式输入，例如骨骼3D人体关节和从深度学习中获得的每帧分类器得分通过视频训练的模型。我们介绍了这些内核的线性化，这些线性化导致了紧凑而快速的描述符。我们提供了（i）3D骨架动作序列，（ii）细粒度视频序列和（iii）标准非细粒度视频的实验。由于我们的最终表示形式是捕获特征的高阶关系的张量，因此它们与共现相关，以实现可靠的细粒度识别。我们一直使用高阶张量和所谓的特征值功率归一化（EPN）来执行高阶事件的频谱检测。因此，可以检测特征的细粒度关系，而不仅仅是对场景中的特征进行计数。我们证明了从Z * dim建立的r阶张量。与EPN结合使用的特征确实可以检测是否至少有一个高阶事件被“投影”到其dim的binom（Z *，r）子空间之一中。 r由张量表示；这样就形成了一个带有binom（Z *，r）这样的“探测器”的张量功率归一化度量。

4. Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention [PDF] 返回目录
Hang Chen, Jun Du, Yu Hu, Li-Rong Dai, Chin-Hui Lee, Bao-Cai Yin
Abstract: In this paper, we propose a novel deep learning architecture to improving word-level lip-reading. On the one hand, we first introduce the multi-scale processing into the spatial feature extraction for lip-reading. Specially, we proposed hierarchical pyramidal convolution (HPConv) to replace the standard convolution in original module, leading to improvements over the model's ability to discover fine-grained lip movements. On the other hand, we merge information in all time steps of the sequence by utilizing self-attention, to make the model pay more attention to the relevant frames. These two advantages are combined together to further enhance the model's classification power. Experiments on the Lip Reading in the Wild (LRW) dataset show that our proposed model has achieved 86.83% accuracy, yielding 1.53% absolute improvement over the current state-of-the-art. We also conducted extensive experiments to better understand the behavior of the proposed model.
摘要：在本文中，我们提出了一种新颖的深度学习体系结构，以改善单词级口头阅读。一方面，我们首先将多尺度处理引入用于唇读的空间特征提取中。特别是，我们提出了层次金字塔卷积（HPConv）来代替原始模块中的标准卷积，从而改善了模型发现细粒度嘴唇运动的能力。另一方面，我们利用自我关注来合并序列中所有时间步骤的信息，使模型更加关注相关框架。这两个优点结合在一起，可以进一步提高模型的分类能力。在野外唇读（LRW）数据集上进行的实验表明，我们提出的模型已达到86.83％的准确性，与当前的最新技术相比，绝对改善了1.53％。我们还进行了广泛的实验，以更好地了解所提出模型的行为。

5. Data-efficient Weakly-supervised Learning for On-line Object Detection under Domain Shift in Robotics [PDF] 返回目录
Elisa Maiettini, Raffaello Camoriano, Giulia Pasquale, Vadim Tikhanoff, Lorenzo Rosasco, Lorenzo Natale
Abstract: Several object detection methods have recently been proposed in the literature, the vast majority based on Deep Convolutional Neural Networks (DCNNs). Such architectures have been shown to achieve remarkable performance, at the cost of computationally expensive batch training and extensive labeling. These methods have important limitations for robotics: Learning solely on off-line data may introduce biases (the so-called domain shift), and prevents adaptation to novel tasks. In this work, we investigate how weakly-supervised learning can cope with these problems. We compare several techniques for weakly-supervised learning in detection pipelines to reduce model (re)training costs without compromising accuracy. In particular, we show that diversity sampling for constructing active learning queries and strong positives selection for self-supervised learning enable significant annotation savings and improve domain shift adaptation. By integrating our strategies into a hybrid DCNN/FALKON on-line detection pipeline [1], our method is able to be trained and updated efficiently with few labels, overcoming limitations of previous work. We experimentally validate and benchmark our method on challenging robotic object detection tasks under domain shift.
摘要：最近在文献中提出了几种目标检测方法，其中绝大多数基于深度卷积神经网络（DCNN）。事实证明，这样的体系结构具有非凡的性能，但以计算量大的批量训练和大量标记为代价。这些方法对机器人技术有重要的局限性：仅对离线数据进行学习可能会引入偏差（所谓的域偏移），并阻止其适应新任务。在这项工作中，我们研究了弱监督学习如何解决这些问题。我们比较了检测管道中弱监督学习的几种技术，以减少模型（重新）训练成本，而不会影响准确性。尤其是，我们表明，用于构建主动学习查询的多样性采样和用于自我监督学习的强大肯定选择可以显着节省注释并改善域移位适应性。通过将我们的策略集成到DCNN / FALKON混合在线检测管道中[1]，我们的方法能够以很少的标签进行有效的训练和更新，克服了先前工作的局限性。我们通过实验验证了我们的方法在领域转移下具有挑战性的机器人目标检测任务中的性能并进行了基准测试。

6. GAKP: GRU Association and Kalman Prediction for Multiple Object Tracking [PDF] 返回目录
Zhen Li, Sunzeng Cai, Xiaoyi Wang, Zhe Liu, Nian Xue
Abstract: Multiple Object Tracking (MOT) has been a useful yet challenging task in many real-world applications such as video surveillance, intelligent retail, and smart city. The challenge is how to model long-term temporal dependencies in an efficient manner. Some recent works employ Recurrent Neural Networks (RNN) to obtain good performance, which, however, requires a large amount of training data. In this paper, we proposed a novel tracking method that integrates the auto-tuning Kalman method for prediction and the Gated Recurrent Unit (GRU) and achieves a near-optimum with a small amount of training data. Experimental results show that our new algorithm can achieve competitive performance on the challenging MOT benchmark, and faster and more robust than the state-of-the-art RNN-based online MOT algorithms.
摘要：在许多实际应用中，例如视频监视，智能零售和智能城市，多对象跟踪（MOT）一直是一项有用而又具有挑战性的任务。面临的挑战是如何以有效的方式对长期时间依赖性进行建模。最近的一些工作使用递归神经网络（RNN）获得良好的性能，但是，这需要大量的训练数据。在本文中，我们提出了一种新颖的跟踪方法，该方法将预测的自动调谐卡尔曼方法与门控循环单位（GRU）集成在一起，并在少量训练数据的情况下实现了接近最佳的跟踪。实验结果表明，与基于RNN的最新在线MOT算法相比，我们的新算法可以在具有挑战性的MOT基准上实现竞争性能，并且更快，更可靠。

7. Adaptive Threshold for Better Performance of the Recognition and Re-identification Models [PDF] 返回目录
Bharat Bohara
Abstract: Choosing a decision threshold is one of the challenging job in any classification tasks. How much the model is accurate, if the deciding boundary is not picked up carefully, its entire performance would go in vain. On the other hand, for imbalance classification where one of the classes is dominant over another, relying on the conventional method of choosing threshold would result in poor performance. Even if the threshold or decision boundary is properly chosen based on machine learning strategies like SVM and decision tree, it will fail at some point for dynamically varying databases and in case of identity-features that are more or less similar, like in face recognition and person re-identification models. Hence, with the need for adaptability of the decision threshold selection for imbalanced classification and incremental database size, an online optimization-based statistical feature learning adaptive technique is developed and tested on the LFW datasets and self-prepared athletes datasets. This method of adopting adaptive threshold resulted in 12-45% improvement in the model accuracy compared to the fixed threshold {0.3,0.5,0.7} that are usually taken via the hit-and-trial method in any classification and identification tasks. Source code for the complete algorithm is available at: this https URL
摘要：在所有分类任务中，选择决策阈值都是一项艰巨的任务。该模型的精确度是多少，如果不仔细确定决策边界，则其整个性能将是徒劳的。另一方面，对于其中一个类别优先于另一个类别的不平衡分类，依靠选择阈值的常规方法将导致较差的性能。即使基于SVM和决策树之类的机器学习策略正确选择了阈值或决策边界，对于动态变化的数据库以及身份特征或多或少相似的情况（例如人脸识别和人员重新识别模型。因此，由于需要针对不平衡分类和增量数据库大小的决策阈值选择的适应性，因此开发了一种基于在线优化的统计特征学习自适应技术，并在LFW数据集和自行准备的运动员数据集上进行了测试。与固定阈值{0.3,0.5,0.7}相比，这种采用自适应阈值的方法可使模型准确性提高12-45％，而固定阈值{0.3,0.5,0.7}通常在任何分类和识别任务中都是通过按击试验法获得的。完整算法的源代码可在以下网址获得：https https

8. Context-Aware Personality Inference in Dyadic Scenarios: Introducing the UDIVA Dataset [PDF] 返回目录
Cristina Palmero, Javier Selva, Sorina Smeureanu, Julio C. S. Jacques Junior, Albert Clapés, Alexa Moseguí, Zejian Zhang, David Gallardo, Georgina Guilera, David Leiva, Sergio Escalera
Abstract: This paper introduces UDIVA, a new non-acted dataset of face-to-face dyadic interactions, where interlocutors perform competitive and collaborative tasks with different behavior elicitation and cognitive workload. The dataset consists of 90.5 hours of dyadic interactions among 147 participants distributed in 188 sessions, recorded using multiple audiovisual and physiological sensors. Currently, it includes sociodemographic, self- and peer-reported personality, internal state, and relationship profiling from participants. As an initial analysis on UDIVA, we propose a transformer-based method for self-reported personality inference in dyadic scenarios, which uses audiovisual data and different sources of context from both interlocutors to regress a target person's personality traits. Preliminary results from an incremental study show consistent improvements when using all available context information.
摘要：本文介绍了UDIVA，这是一个新的非有效面对面二元互动数据集，对话者在其中执行竞争和协作任务，具有不同的行为诱发和认知工作量。该数据集由188个会话中分布的147个参与者之间的90.5小时二元互动组成，并使用多个视听和生理传感器记录下来。当前，它包括社会人口统计学，自我和同伴报告的人格，内部状态以及与参与者的关系概况。作为对UDIVA的初步分析，我们提出了一种基于转换器的二元情境中自我报告性格推断方法，该方法使用视听数据和两个对话者的不同上下文来源来回归目标人的性格特征。增量研究的初步结果显示，在使用所有可用的上下文信息时，都将获得一致的改进。

9. Compositional Prototype Network with Multi-view Comparision for Few-Shot Point Cloud Semantic Segmentation [PDF] 返回目录
Xiaoyu Chen, Chi Zhang, Guosheng Lin, Jing Han
Abstract: Point cloud segmentation is a fundamental visual understanding task in 3D vision. A fully supervised point cloud segmentation network often requires a large amount of data with point-wise annotations, which is expensive to obtain. In this work, we present the Compositional Prototype Network that can undertake point cloud segmentation with only a few labeled training data. Inspired by the few-shot learning literature in images, our network directly transfers label information from the limited training data to unlabeled test data for prediction. The network decomposes the representations of complex point cloud data into a set of local regional representations and utilizes them to calculate the compositional prototypes of a visual concept. Our network includes a key Multi-View Comparison Component that exploits the redundant views of the support set. To evaluate the proposed method, we create a new segmentation benchmark dataset, ScanNet-$6^i$, which is built upon ScanNet dataset. Extensive experiments show that our method outperforms baselines with a significant advantage. Moreover, when we use our network to handle the long-tail problem in a fully supervised point cloud segmentation dataset, it can also effectively boost the performance of the few-shot classes.
摘要：点云分割是3D视觉中的基本视觉理解任务。受到完全监督的点云分段网络通常需要使用逐点注释的大量数据，而获取这些数据非常昂贵。在这项工作中，我们提出了成分原型网络，该网络可以仅用少量标记的训练数据进行点云分割。受图像中少量学习文献的启发，我们的网络将标签信息从有限的训练数据直接传输到未标签的测试数据中进行预测。网络将复杂点云数据的表示分解为一组局部区域表示，并利用它们来计算视觉概念的组成原型。我们的网络包括一个关键的多视图比较组件，该组件可利用支持集的冗余视图。为了评估所提出的方法，我们创建了一个新的细分基准数据集ScanNet- $ 6 ^ i $，该数据集建立在ScanNet数据集的基础上。大量的实验表明，我们的方法具有优于基线的显着优势。此外，当我们使用网络在完全监督的点云分割数据集中处理长尾问题时，它还可以有效地提高少数镜头类的性能。

10. Instance Segmentation of Industrial Point Cloud Data [PDF] 返回目录
Eva Agapaki, Ioannis Brilakis
Abstract: The challenge that this paper addresses is how to efficiently minimize the cost and manual labour for automatically generating object oriented geometric Digital Twins (gDTs) of industrial facilities, so that the benefits provide even more value compared to the initial investment to generate these models. Our previous work achieved the current state-of-the-art class segmentation performance (75% average accuracy per point and average AUC 90% in the CLOI dataset classes) as presented in (Agapaki and Brilakis 2020) and directly produces labelled point clusters of the most important to model objects (CLOI classes) from laser scanned industrial data. CLOI stands for C-shapes, L-shapes, O-shapes, I-shapes and their combinations. However, the problem of automated segmentation of individual instances that can then be used to fit geometric shapes remains unsolved. We argue that the use of instance segmentation algorithms has the theoretical potential to provide the output needed for the generation of gDTs. We solve instance segmentation in this paper through (a) using a CLOI-Instance graph connectivity algorithm that segments the point clusters of an object class into instances and (b) boundary segmentation of points that improves step (a). Our method was tested on the CLOI benchmark dataset (Agapaki et al. 2019) and segmented instances with 76.25% average precision and 70% average recall per point among all classes. This proved that it is the first to automatically segment industrial point cloud shapes with no prior knowledge other than the class point label and is the bedrock for efficient gDT generation in cluttered industrial point clouds.
摘要：本文要解决的挑战是如何有效地降低成本和人工成本，以自动生成工业设施的面向对象的几何数字双胞胎（gDT），从而与生成这些模型的初始投资相比，其收益提供更多的价值。。我们的先前工作达到了（Agapaki和Brilakis 2020）中提出的当前最先进的类别分割性能（CLOI数据集类别中的每点平均准确度为75％，AUC平均值为90％），并直接产生标记的点簇最重要的是根据激光扫描的工业数据为对象（CLOI类）建模。 CLOI代表C形，L形，O形，I形及其组合。但是，自动实例化各个实例然后可以用来拟合几何形状的问题仍然没有解决。我们认为实例分割算法的使用具有提供gDT生成所需输出的理论潜力。我们在本文中通过以下方法解决实例分割：（a）使用CLOI-实例图连接算法将对象类的点簇分割为实例，以及（b）改进步骤（a）的点的边界分割。我们的方法在CLOI基准数据集（Agapaki et al.2019）上进行了测试，并在所有类别中以76.25％的平均精度和70％的平均召回率对实例进行了细分。这证明了它是第一个在没有分类知识的情况下自动分割工业点云形状的方法，并且是在混乱的工业点云中高效生成gDT的基础。

11. DeepSurfels: Learning Online Appearance Fusion [PDF] 返回目录
Marko Mihajlovic, Silvan Weder, Marc Pollefeys, Martin R. Oswald
Abstract: We present DeepSurfels, a novel hybrid scene representation for geometry and appearance information. DeepSurfels combines explicit and neural building blocks to jointly encode geometry and appearance information. In contrast to established representations, DeepSurfels better represents high-frequency textures, is well-suited for online updates of appearance information, and can be easily combined with machine learning methods. We further present an end-to-end trainable online appearance fusion pipeline that fuses information provided by RGB images into the proposed scene representation and is trained using self-supervision imposed by the reprojection error with respect to the input images. Our method compares favorably to classical texture mapping approaches as well as recently proposed learning-based techniques. Moreover, we demonstrate lower runtime, improved generalization capabilities, and better scalability to larger scenes compared to existing methods.
摘要：我们介绍了DeepSurfels，这是一种用于几何和外观信息的新颖混合场景表示。 DeepSurfels结合了显式和神经构造块，共同编码几何形状和外观信息。与既定的表示形式相比，DeepSurfels更好地表示了高频纹理，非常适合在线更新外观信息，并且可以轻松地与机器学习方法结合使用。我们进一步提出了一种端到端可训练的在线外观融合管道，该管道将RGB图像提供的信息融合到建议的场景表示中，并使用针对输入图像的重投影误差施加的自我监督进行训练。我们的方法优于经典的纹理映射方法以及最近提出的基于学习的技术。此外，与现有方法相比，我们展示了较低的运行时间，改进的泛化功能以及对较大场景的更好可伸缩性。

12. Longitudinal diffusion MRI analysis using Segis-Net: a single-step deep-learning framework for simultaneous segmentation and registration [PDF] 返回目录
Bo Li, Wiro J. Niessen, Stefan Klein, Marius de Groot, M. Arfan Ikram, Meike W. Vernooij, Esther E. Bron
Abstract: This work presents a single-step deep-learning framework for longitudinal image analysis, coined Segis-Net. To optimally exploit information available in longitudinal data, this method concurrently learns a multi-class segmentation and nonlinear registration. Segmentation and registration are modeled using a convolutional neural network and optimized simultaneously for their mutual benefit. An objective function that optimizes spatial correspondence for the segmented structures across time-points is proposed. We applied Segis-Net to the analysis of white matter tracts from N=8045 longitudinal brain MRI datasets of 3249 elderly individuals. Segis-Net approach showed a significant increase in registration accuracy, spatio-temporal segmentation consistency, and reproducibility comparing with two multistage pipelines. This also led to a significant reduction in the sample-size that would be required to achieve the same statistical power in analyzing tract-specific measures. Thus, we expect that Segis-Net can serve as a new reliable tool to support longitudinal imaging studies to investigate macro- and microstructural brain changes over time.
摘要：这项工作提出了一个用于纵向图像分析的单步深度学习框架，即Segis-Net。为了最佳地利用纵向数据中可用的信息，该方法同时学习了多类分割和非线性配准。使用卷积神经网络对分割和配准进行建模，并为它们的互利而同时进行优化。提出了一个目标函数，该函数优化了跨时间点的分段结构的空间对应性。我们将Segis-Net应用于来自3249位老年人的N = 8045纵向脑MRI数据集的白质束分析。与两条多级流水线相比，Segis-Net方法显示出配准精度，时空分割一致性和可重复性显着提高。这也导致样本数量的显着减少，而样本数量的减少对于实现针对特定区域的度量的分析具有相同的统计功效将是必需的。因此，我们希望Segis-Net可以作为一种新的可靠工具来支持纵向成像研究，以研究随时间变化的宏观和微观结构的大脑变化。

13. TransPose: Towards Explainable Human Pose Estimation by Transformer [PDF] 返回目录
Sen Yang, Zhibin Quan, Mu Nie, Wankou Yang
Abstract: Deep Convolutional Neural Networks (CNNs) have made remarkable progress on human pose estimation task. However, there is no explicit understanding of how the locations of body keypoints are predicted by CNN, and it is also unknown what spatial dependency relationships between structural variables are learned in the model. To explore these questions, we construct an explainable model named TransPose based on Transformer architecture and low-level convolutional blocks. Given an image, the attention layers built in Transformer can capture long-range spatial relationships between keypoints and explain what dependencies the predicted keypoints locations highly rely on. We analyze the rationality of using attention as the explanation to reveal the spatial dependencies in this task. The revealed dependencies are image-specific and variable across different keypoint types, layer depths, or trained models. The experiments show that TransPose can accurately predict the positions of keypoints. It achieves state-of-the-art performance on COCO dataset, while being more interpretable, lightweight, and efficient than mainstream fully convolutional architectures.
摘要：深度卷积神经网络（CNN）在人体姿态估计任务上取得了显着进展。但是，对于CNN如何预测身体关键点的位置尚无明确的了解，并且还未知在模型中了解到哪些结构变量之间的空间依赖性关系。为了探讨这些问题，我们基于Transformer体系结构和低级卷积块构造了一个可解释的模型TransPose。给定一张图像，Transformer中内置的注意层可以捕获关键点之间的远程空间关系，并解释预测的关键点位置高度依赖于哪些相关性。我们分析了使用注意力作为解释以揭示此任务中空间依赖性的合理性。揭示的依存关系是特定于图像的，并且在不同的关键点类型，图层深度或训练的模型之间是可变的。实验表明，TransPose可以准确预测关键点的位置。它在COCO数据集上实现了最先进的性能，同时比主流的全卷积体系结构更具解释性，轻量级和高效。

14. Action Recognition with Kernel-based Graph Convolutional Networks [PDF] 返回目录
Hichem Sahbi
Abstract: Learning graph convolutional networks (GCNs) is an emerging field which aims at generalizing deep learning to arbitrary non-regular domains. Most of the existing GCNs follow a neighborhood aggregation scheme, where the representation of a node is recursively obtained by aggregating its neighboring node representations using averaging or sorting operations. However, these operations are either ill-posed or weak to be discriminant or increase the number of training parameters and thereby the computational complexity and the risk of overfitting. In this paper, we introduce a novel GCN framework that achieves spatial graph convolution in a reproducing kernel Hilbert space (RKHS). The latter makes it possible to design, via implicit kernel representations, convolutional graph filters in a high dimensional and more discriminating space without increasing the number of training parameters. The particularity of our GCN model also resides in its ability to achieve convolutions without explicitly realigning nodes in the receptive fields of the learned graph filters with those of the input graphs, thereby making convolutions permutation agnostic and well defined. Experiments conducted on the challenging task of skeleton-based action recognition show the superiority of the proposed method against different baselines as well as the related work.
摘要：学习图卷积网络（GCN）是一个新兴领域，旨在将深度学习推广到任意非常规领域。现有的大多数GCN都遵循邻域聚合方案，在该方案中，节点的表示是通过使用平均或排序操作汇总其相邻节点表示来递归获得的。然而，这些操作要么不适当地，要么不容易辨别，或者增加了训练参数的数量，从而增加了计算复杂性和过拟合的风险。在本文中，我们介绍了一种新颖的GCN框架，该框架可在再生内核希尔伯特空间（RKHS）中实现空间图卷积。后者使得可以通过隐式内核表示来设计高维且更具区分性的空间中的卷积图滤波器，而无需增加训练参数的数量。我们的GCN模型的特殊性还在于其实现卷积的能力，而无需明确地将学习到的图过滤器的接受域中的节点与输入图的那些重新对齐，从而使卷积置换不可知且定义明确。在基于骨骼的动作识别这一具有挑战性的任务上进行的实验表明，该方法相对于不同的基线以及相关工作具有优越性。

15. A cortical-inspired sub-Riemannian model for Poggendorff-type visual illusions [PDF] 返回目录
Emre Baspinar, Luca Calatroni, Valentina Franceschi, Dario Prandi
Abstract: We consider Wilson-Cowan-type models for the mathematical orientation-dependent description of Poggendorff-like illusions. Our modelling improves the cortical-inspired approaches used in [1,2] by encoding within the neuronal interaction term the sub-Riemannian heat kernel, in agreement with the intrinsically anisotropic functional architecture of V1 based both on local and lateral connections. For the numerical realisation of both models, we consider standard gradient descent algorithms combined with Fourier-based approaches for the efficient computation of the sub-Laplacian evolution. Our numerical results show that the use of the sub-Riemannian kernel allows to reproduce numerically visual misperceptions and inpainting-type biases that standard approaches were not able to replicate [3].
摘要：我们考虑使用威尔逊-考恩（Wilson-Cowan）型模型来描述Poggendorff式幻觉的数学取向相关描述。我们的模型通过在神经元相互作用项内编码亚里曼热核，并与基于局部和侧向连接的V1的固有各向异性功能架构相一致，从而改进了[1,2]中所用的皮质启发方法。对于这两个模型的数值实现，我们考虑将标准梯度下降算法与基于傅立叶的方法相结合，以有效地计算次拉普拉斯演化。我们的数值结果表明，使用次黎曼核可以重现标准方法无法复制的数字视觉误解和修补类型的偏差[3]。

16. Deep Visual Domain Adaptation [PDF] 返回目录
Gabriela Csurka
Abstract: Domain adaptation (DA) aims at improving the performance of a model on target domains by transferring the knowledge contained in different but related source domains. With recent advances in deep learning models which are extremely data hungry, the interest for visual DA has significantly increased in the last decade and the number of related work in the field exploded. The aim of this paper, therefore, is to give a comprehensive overview of deep domain adaptation methods for computer vision applications. First, we detail and compared different possible ways of exploiting deep architectures for domain adaptation. Then, we propose an overview of recent trends in deep visual DA. Finally, we mention a few improvement strategies, orthogonal to these methods, that can be applied to these models. While we mainly focus on image classification, we give pointers to papers that extend these ideas for other applications such as semantic segmentation, object detection, person re-identifications, and others.
摘要：领域适应（DA）旨在通过传递包含在不同但相关的源域中的知识来提高模型在目标域上的性能。随着深度学习模型的最新发展（这些研究非常需要数据），在过去的十年中，对视觉DA的兴趣大大增加，并且该领域的相关工作激增。因此，本文的目的是对计算机视觉应用的深域自适应方法进行全面概述。首先，我们详细介绍并比较了利用深层体系结构进行域自适应的各种可能方式。然后，我们概述了深度视觉DA的最新趋势。最后，我们提到了一些与这些方法正交的改进策略，这些策略可以应用于这些模型。尽管我们主要关注图像分类，但我们提供了指向将这些思想扩展到其他应用程序（例如语义分割，对象检测，人的重新识别等）的论文的指针。

17. Playing to distraction: towards a robust training of CNN classifiers through visual explanation techniques [PDF] 返回目录
David Morales, Estefania Talavera, Beatriz Remeseiro
Abstract: The field of deep learning is evolving in different directions, with still the need for more efficient training strategies. In this work, we present a novel and robust training scheme that integrates visual explanation techniques in the learning process. Unlike the attention mechanisms that focus on the relevant parts of images, we aim to improve the robustness of the model by making it pay attention to other regions as well. Broadly speaking, the idea is to distract the classifier in the learning process to force it to focus not only on relevant regions but also on those that, a priori, are not so informative for the discrimination of the class. We tested the proposed approach by embedding it into the learning process of a convolutional neural network for the analysis and classification of two well-known datasets, namely Stanford cars and FGVC-Aircraft. Furthermore, we evaluated our model on a real-case scenario for the classification of egocentric images, allowing us to obtain relevant information about peoples' lifestyles. In particular, we work on the challenging EgoFoodPlaces dataset, achieving state-of-the-art results with a lower level of complexity. The obtained results indicate the suitability of our proposed training scheme for image classification, improving the robustness of the final model.
摘要：深度学习领域正在朝着不同的方向发展，仍然需要更有效的培训策略。在这项工作中，我们提出了一种新颖而强大的培训计划，该计划将视觉解释技术整合到了学习过程中。与关注图像相关部分的注意力机制不同，我们旨在通过使模型也关注其他区域来提高模型的鲁棒性。广义上讲，其思想是在学习过程中分散分类器，迫使其不仅关注相关区域，而且还应关注那些先验对分类没有足够信息的区域。我们通过将其嵌入卷积神经网络的学习过程中来测试该方法，以对两个著名的数据集（斯坦福汽车和FGVC-飞机）进行分析和分类。此外，我们在实际案例中评估了模型，以对以自我为中心的图像进行分类，从而使我们可以获得有关人们生活方式的相关信息。尤其是，我们致力于具有挑战性的EgoFoodPlaces数据集，以较低的复杂度实现了最新的结果。获得的结果表明我们提出的训练方案对图像分类的适用性，提高了最终模型的鲁棒性。

18. Multiple Document Datasets Pre-training Improves Text Line Detection With Deep Neural Networks [PDF] 返回目录
Mélodie Boillet, Christopher Kermorvant, Thierry Paquet
Abstract: In this paper, we introduce a fully convolutional network for the document layout analysis task. While state-of-the-art methods are using models pre-trained on natural scene images, our method Doc-UFCN relies on a U-shaped model trained from scratch for detecting objects from historical documents. We consider the line segmentation task and more generally the layout analysis problem as a pixel-wise classification task then our model outputs a pixel-labeling of the input images. We show that Doc-UFCN outperforms state-of-the-art methods on various datasets and also demonstrate that the pre-trained parts on natural scene images are not required to reach good results. In addition, we show that pre-training on multiple document datasets can improve the performances. We evaluate the models using various metrics to have a fair and complete comparison between the methods.
摘要：在本文中，我们介绍了用于文档布局分析任务的完全卷积网络。当最先进的方法使用在自然场景图像上预先训练的模型时，我们的Doc-UFCN方法依靠从头开始训练的U形模型来检测历史文档中的对象。我们将线分割任务和更普遍的布局分析问题视为按像素分类任务，然后我们的模型输出输入图像的像素标签。我们证明了Doc-UFCN在各种数据集上的表现均优于最新方法，并且还证明了自然场景图像上的预训练部分不需要达到良好的效果。此外，我们表明对多个文档数据集进行预训练可以提高性能。我们使用各种指标来评估模型，以在方法之间进行公正而完整的比较。

19. Deep Graph Normalizer: A Geometric Deep Learning Approach for Estimating Connectional Brain Templates [PDF] 返回目录
Mustafa Burak Gurbuz, Islem Rekik
Abstract: A connectional brain template (CBT) is a normalized graph-based representation of a population of brain networks also regarded as an average connectome. CBTs are powerful tools for creating representative maps of brain connectivity in typical and atypical populations. Particularly, estimating a well-centered and representative CBT for populations of multi-view brain networks (MVBN) is more challenging since these networks sit on complex manifolds and there is no easy way to fuse different heterogeneous network views. This problem remains unexplored with the exception of a few recent works rooted in the assumption that the relationship between connectomes are mostly linear. However, such an assumption fails to capture complex patterns and non-linear variation across individuals. Besides, existing methods are simply composed of sequential MVBN processing blocks without any feedback mechanism, leading to error accumulation. To address these issues, we propose Deep Graph Normalizer (DGN), the first geometric deep learning (GDL) architecture for normalizing a population of MVBNs by integrating them into a single connectional brain template. Our end-to-end DGN learns how to fuse multi-view brain networks while capturing non-linear patterns across subjects and preserving brain graph topological properties by capitalizing on graph convolutional neural networks. We also introduce a randomized weighted loss function which also acts as a regularizer to minimize the distance between the population of MVBNs and the estimated CBT, thereby enforcing its centeredness. We demonstrate that DGN significantly outperforms existing state-of-the-art methods on estimating CBTs on both small-scale and large-scale connectomic datasets in terms of both representativeness and discriminability (i.e., identifying distinctive connectivities fingerprinting each brain network population).
摘要：连接性脑模板（CBT）是基于归一化图的大脑网络总体表示，也被视为平均连接体。 CBT是创建典型和非典型人群的大脑连通性代表性地图的有力工具。特别是，为多视图脑网络（MVBN）的人群估计一个居中且有代表性的CBT更具挑战性，因为这些网络位于复杂的流形上，并且没有简便的方法来融合不同的异构网络视图。除了一些最近的工作以外，这个问题仍然没有得到解决，这些工作植根于假设连接体之间的关系大多是线性的假设。但是，这样的假设无法捕捉到个体之间的复杂模式和非线性变化。此外，现有方法仅由顺序的MVBN处理块组成，而没有任何反馈机制，从而导致错误累积。为了解决这些问题，我们提出了深度图归一化器（DGN），这是第一个几何深度学习（GDL）架构，用于通过将MVBN集成到单个连接的大脑模板中来对其进行归一化。我们的端到端DGN学习如何融合多视图大脑网络，同时利用图卷积神经网络捕获跨对象的非线性模式并保留大脑图拓扑特性。我们还引入了随机加权损失函数，该函数还可以充当正则化函数，以最大程度地减少MVBN人口与估计的CBT之间的距离，从而加强其居中性。我们证明DGN在代表性和可辨别性方面（即识别每个大脑网络人群的独特连通性）在评估小规模和大规模连接组数据集上的CBT方面明显优于现有的最新方法。

20. Joint Intensity-Gradient Guided Generative Modeling for Colorization [PDF] 返回目录
Kai Hong, Jin Li, Wanyun Li, Cailian Yang, Minghui Zhang, Yuhao Wang, Qiegen Liu
Abstract: This paper proposes an iterative generative model for solving the automatic colorization problem. Although previous researches have shown the capability to generate plausible color, the edge color overflow and the requirement of the reference images still exist. The starting point of the unsupervised learning in this study is the ob?servation that the gradient map possesses latent infor?mation of the image. Therefore, the inference process of the generative modeling is conducted in joint intensity-gradient domain. Specifically, a set of intensity-gradient formed high-dimensional tensors, as the network input, are used to train a powerful noise conditional score network at the training phase. Furthermore, the joint intensity-gradient constraint in data-fidelity term is proposed to limit the degree of freedom within generative model at the iterative colorization stage, and it is conducive to edge-preserving. Extensive experiments demonstrated that the system out?performed state-of-the-art methods whether in quantitative comparisons or user study.
摘要：本文提出了一种迭代生成模型，用于解决自动着色问题。尽管先前的研究已经显示出产生合理颜色的能力，但是边缘颜色溢出和参考图像的要求仍然存在。在这项研究中，无监督学习的起点是观察到梯度图具有图像的潜在信息。因此，生成建模的推断过程是在联合强度梯度域中进行的。具体来说，一组强度梯度形成的高维张量作为网络输入，用于在训练阶段训练强大的噪声条件得分网络。此外，提出了数据保真度条件下的联合强度梯度约束，以限制迭代着色阶段生成模型内部的自由度，有利于边缘保持。大量的实验表明，无论是在定量比较还是用户研究中，该系统均优于最新技术。

21. Spectral Analysis for Semantic Segmentation with Applications on Feature Truncation and Weak Annotation [PDF] 返回目录
Li-Wei Chen, Wei-Chen Chiu, Chin-Tien Wu
Abstract: The current neural networks for semantic segmentation usually predict the pixel-wise semantics on the down-sampled grid of images to alleviate the computational cost for dense maps. However, the accuracy of resultant segmentation maps may also be down graded particularly in the regions near object boundaries. In this paper, we advance to have a deeper investigation on the sampling efficiency of the down-sampled grid. By applying the spectral analysis that analyze on the network back propagation process in frequency domain, we discover that cross-entropy is mainly contributed by the low-frequency components of segmentation maps, as well as that of the feature in CNNs. The network performance maintains as long as the resolution of the down sampled grid meets the cut-off frequency. Such finding leads us to propose a simple yet effective feature truncation method that limits the feature size in CNNs and removes the associated high-frequency components. This method can not only reduce the computational cost but also maintain the performance of semantic segmentation networks. Moreover, one can seamlessly integrate this method with the typical network pruning approaches for further model reduction. On the other hand, we propose to employee a block-wise weak annotation for semantic segmentation that captures the low-frequency information of the segmentation map and is easy to collect. Using the proposed analysis scheme, one can easily estimate the efficacy of the block-wise annotation and the feature truncation method.
摘要：目前用于语义分割的神经网络通常在图像的下采样网格上预测像素级语义，以减轻密集地图的计算成本。但是，所得分割图的准确性也可能会下降，尤其是在对象边界附近的区域。在本文中，我们将对下采样网格的采样效率进行更深入的研究。通过应用频谱分析对频域中的网络反向传播过程进行分析，我们发现交叉熵主要由分段图的低频分量以及CNN中的特征分量引起。只要向下采样的网格的分辨率满足截止频率，网络性能就可以保持。这样的发现使我们提出了一种简单而有效的特征截断方法，该方法可以限制CNN中的特征尺寸并去除相关的高频分量。该方法不仅可以降低计算量，而且可以保持语义分割网络的性能。此外，可以将这种方法与典型的网络修剪方法无缝集成，以进一步简化模型。另一方面，我们建议员工进行语义分割的逐块弱注释，该注释捕获分割图的低频信息并且易于收集。使用所提出的分析方案，可以轻松地估计逐块注释和特征截断方法的功效。

22. Towards A Category-extended Object Detector without Relabeling or Conflicts [PDF] 返回目录
Bowen Zhao, Chen Chen, Wanpeng Xiao, Xi Xiao, Qi Ju, Shutao Xia
Abstract: Object detectors are typically learned based on fully-annotated training data with fixed pre-defined categories. However, not all possible categories of interest can be known beforehand, as classes are often required to be increased progressively in many realistic applications. In such scenario, only the original training set annotated with the old classes and some new training data labeled with the new classes are available. In this paper, we aim at leaning a strong unified detector that can handle all categories based on the limited datasets without extra manual labor. Vanilla joint training without considering label ambiguity leads to heavy biases and poor performance due to the incomplete annotations. To avoid such situation, we propose a practical framework which focuses on three aspects: better base model, better unlabeled ground-truth mining strategy and better retraining method with pseudo annotations. First, a conflict-free loss is proposed to obtain a usable base detector. Second, we employ Monte Carlo Dropout to calculate the localization confidence, combined with the classification confidence, to mine more accurate bounding boxes. Third, we explore several strategies for making better use of pseudo annotations during retraining to achieve more powerful detectors. Extensive experiments conducted on multiple datasets demonstrate the effectiveness of our framework for category-extended object detectors.
摘要：通常基于具有固定预定义类别的全注释训练数据来学习对象检测器。但是，并不是所有可能的感兴趣类别都预先已知，因为在许多实际应用中经常需要逐步增加类别。在这种情况下，只有带有旧类注释的原始训练集和带有新类标记的一些新训练数据才可用。在本文中，我们旨在依靠一个强大的统一检测器，该检测器可以基于有限的数据集处理所有类别，而无需额外的人工。由于注释不完整，未考虑标签歧义性的香草关节训练会导致严重的偏见和较差的表现。为了避免这种情况，我们提出了一个实用的框架，该框架着重于三个方面：更好的基础模型，更好的未标记地面真相挖掘策略以及更好的带有伪注释的再训练方法。首先，提出了无冲突损耗以获得可用的基础检测器。其次，我们使用Monte Carlo Dropout来计算定位置信度，并结合分类置信度，以挖掘更准确的边界框。第三，我们探索了几种在重新训练过程中更好地利用伪注释的策略，以实现更强大的检测器。在多个数据集上进行的广泛实验证明了我们的类别扩展对象检测器框架的有效性。

23. Human Expression Recognition using Facial Shape Based Fourier Descriptors Fusion [PDF] 返回目录
Ali Raza Shahid, Sheheryar Khan, Hong Yan
Abstract: Dynamic facial expression recognition has many useful applications in social networks, multimedia content analysis, security systems and others. This challenging process must be done under recurrent problems of image illumination and low resolution which changes at partial occlusions. This paper aims to produce a new facial expression recognition method based on the changes in the facial muscles. The geometric features are used to specify the facial regions i.e., mouth, eyes, and nose. The generic Fourier shape descriptor in conjunction with elliptic Fourier shape descriptor is used as an attribute to represent different emotions under frequency spectrum features. Afterwards a multi-class support vector machine is applied for classification of seven human expression. The statistical analysis showed our approach obtained overall competent recognition using 5-fold cross validation with high accuracy on well-known facial expression dataset.
摘要：动态面部表情识别在社交网络，多媒体内容分析，安全系统等方面具有许多有用的应用程序。这个挑战性的过程必须在图像照明和低分辨率（在部分遮挡下发生变化）的反复出现的问题下进行。本文旨在基于面部肌肉的变化产生一种新的面部表情识别方法。几何特征用于指定面部区域，即嘴，眼睛和鼻子。通用傅立叶形状描述符与椭圆形傅立叶形状描述符一起用作表示频谱特征下不同情绪的属性。之后，将多类支持向量机应用于七个人类表达的分类。统计分析表明，我们的方法在著名的面部表情数据集上使用5倍交叉验证以高精度获得了全面的能力认可。

24. From Point to Space: 3D Moving Human Pose Estimation Using Commodity WiFi [PDF] 返回目录
Yiming Wang, Lingchao Guo, Zhaoming Lu, Xiangming Wen, Shuang Zhou, Wanyu Meng
Abstract: In this paper, we present Wi-Mose, the first 3D moving human pose estimation system using commodity WiFi. Previous WiFi-based works have achieved 2D and 3D pose estimation. These solutions either capture poses from one perspective or construct poses of people who are at a fixed point, preventing their wide adoption in daily scenarios. To reconstruct 3D poses of people who move throughout the space rather than a fixed point, we fuse the amplitude and phase into Channel State Information (CSI) images which can provide both pose and position information. Besides, we design a neural network to extract features that are only associated with poses from CSI images and then convert the features into key-point coordinates. Experimental results show that Wi-Mose can localize key-point with 29.7mm and 37.8mm Procrustes analysis Mean Per Joint Position Error (P-MPJPE) in the Line of Sight (LoS) and Non-Line of Sight (NLoS) scenarios, respectively, achieving higher performance than the state-of-the-art method. The results indicate that Wi-Mose can capture high-precision 3D human poses throughout the space.
摘要：在本文中，我们介绍了Wi-Mose，这是第一个使用商品WiFi的3D移动人体姿态估计系统。以前基于WiFi的作品已实现2D和3D姿态估计。这些解决方案要么从一个角度捕获姿势，要么在固定点构建人的姿势，从而阻止了他们在日常场景中的广泛采用。为了重构在整个空间而不是固定点移动的人的3D姿势，我们将幅度和相位融合到通道状态信息（CSI）图像中，该图像可以同时提供姿势和位置信息。此外，我们设计了一个神经网络以从CSI图像中提取仅与姿势相关的特征，然后将其转换为关键点坐标。实验结果表明，Wi-Mose可以在视线（LoS）和非视线（NLoS）场景中分别以29.7mm和37.8mm的前推力分析关键点进行定位，平均每关节位置误差（P-MPJPE）与最先进的方法相比，可实现更高的性能。结果表明，Wi-Mose可以捕获整个空间中的高精度3D人体姿势。

25. Adversarial Multi-scale Feature Learning for Person Re-identification [PDF] 返回目录
Xinglu Wang
Abstract: Person Re-identification (Person ReID) is an important topic in intelligent surveillance and computer vision. It aims to accurately measure visual similarities between person images for determining whether two images correspond to the same person. The key to accurately measure visual similarities is learning discriminative features, which not only captures clues from different spatial scales, but also jointly inferences on multiple scales, with the ability to determine reliability and ID-relativity of each clue. To achieve these goals, we propose to improve Person ReID system performance from two perspective: \textbf{1).} Multi-scale feature learning (MSFL), which consists of Cross-scale information propagation (CSIP) and Multi-scale feature fusion (MSFF), to dynamically fuse features cross different scales.\textbf{2).} Multi-scale gradient regularizor (MSGR), to emphasize ID-related factors and ignore irrelevant factors in an adversarial manner. Combining MSFL and MSGR, our method achieves the state-of-the-art performance on four commonly used person-ReID datasets with neglectable test-time computation overhead.
摘要：人员重新识别（Person ReID）是智能监视和计算机视觉中的重要主题。其目的在于准确地测量人物图像之间的视觉相似度，以确定两个图像是否对应于同一人物。准确衡量视觉相似度的关键是学习区分特征，该特征不仅可以捕获来自不同空间尺度的线索，而且可以在多个尺度上共同推断，并具有确定每个线索的可靠性和ID相关性的能力。为了实现这些目标，我们建议从两个角度改善Person ReID系统的性能：\ textbf {1）。}多尺度特征学习（MSFL），由跨尺度信息传播（CSIP）和多尺度特征融合组成（MSFF），以动态融合不同比例的要素。\ textbf {2）。}多比例梯度正则化器（MSGR），以对抗性方式强调与ID相关的因素并忽略不相关的因素。将MSFL和MSGR相结合，我们的方法在四个常用的Person-ReID数据集上实现了最先进的性能，而测试时间的计算开销却可以忽略不计。

26. Person Re-identification with Adversarial Triplet Embedding [PDF] 返回目录
Xinglu Wang
Abstract: Person re-identification is an important task and has widespread applications in video surveillance for public security. In the past few years, deep learning network with triplet loss has become popular for this problem. However, the triplet loss usually suffers from poor local optimal and relies heavily on the strategy of hard example mining. In this paper, we propose to address this problem with a new deep metric learning method called Adversarial Triplet Embedding (ATE), in which we simultaneously generate adversarial triplets and discriminative feature embedding in an unified framework. In particular, adversarial triplets are generated by introducing adversarial perturbations into the training process. This adversarial game is converted into a minimax problem so as to have an optimal solution from the theoretical view. Extensive experiments on several benchmark datasets demonstrate the effectiveness of the approach against the state-of-the-art literature.
摘要：人员重新识别是一项重要任务，在公共安全视频监控中具有广泛的应用。在过去的几年中，具有三重态损失的深度学习网络已变得很受欢迎。但是，三重态损失通常会遭受较差的局部最优值，并且严重依赖于难榜样的开采策略。在本文中，我们建议使用一种称为对抗性三重嵌入（ATE）的新深度度量学习方法来解决此问题，在该方法中，我们可以在统一框架中同时生成对抗性三元组和判别式特征嵌入。特别地，通过将对抗性扰动引入训练过程中来产生对抗性三胞胎。这种对抗性博弈被转化为极小极大问题，以便从理论上获得最优解。在几个基准数据集上进行的大量实验证明了该方法针对最新文献的有效性。

27. Aerial Imagery Pile burn detection using Deep Learning: the FLAME dataset [PDF] 返回目录
Alireza Shamsoshoara, Fatemeh Afghah, Abolfazl Razi, Liming Zheng, Peter Z Fulé, Erik Blasch
Abstract: Wildfires are one of the costliest and deadliest natural disasters in the US, causing damage to millions of hectares of forest resources and threatening the lives of people and animals. Of particular importance are risks to firefighters and operational forces, which highlights the need for leveraging technology to minimize danger to people and property. FLAME (Fire Luminosity Airborne-based Machine learning Evaluation) offers a dataset of aerial images of fires along with methods for fire detection and segmentation which can help firefighters and researchers to develop optimal fire management strategies. This paper provides a fire image dataset collected by drones during a prescribed burning piled detritus in an Arizona pine forest. The dataset includes video recordings and thermal heatmaps captured by infrared cameras. The captured videos and images are annotated and labeled frame-wise to help researchers easily apply their fire detection and modeling algorithms. The paper also highlights solutions to two machine learning problems: (1) Binary classification of video frames based on the presence [and absence] of fire flames. An Artificial Neural Network (ANN) method is developed that achieved a 76% classification accuracy. (2) Fire detection using segmentation methods to precisely determine fire borders. A deep learning method is designed based on the U-Net up-sampling and down-sampling approach to extract a fire mask from the video frames. Our FLAME method approached a precision of 92% and a recall of 84%. Future research will expand the technique for free burning broadcast fire using thermal images.
摘要：野火是美国最昂贵，最致命的自然灾害之一，对数百万公顷的森林资源造成破坏，并威胁着人类和动物的生命。消防员和作战部队面临的风险尤为重要，这凸显了需要利用技术来最大程度地减少对人员和财产的危害。 FLAME（基于火光的机载机器学习评估）提供了火灾航空图像的数据集以及火灾检测和分割的方法，可以帮助消防员和研究人员制定最佳的火管理策略。本文提供了在亚利桑那州的松树林中，无人机在规定的燃烧堆积碎屑过程中收集的火灾图像数据集。数据集包括红外摄像机捕获的视频记录和热图。对捕获的视频和图像进行注释和逐帧标记，以帮助研究人员轻松地应用其火灾探测和建模算法。本文还重点介绍了解决两个机器学习问题的解决方案：（1）根据火焰的存在（或不存在）对视频帧进行二进制分类。开发了一种人工神经网络（ANN）方法，该方法实现了76％的分类精度。（2）使用分段方法进行火灾探测，以精确确定火灾边界。基于U-Net上采样和下采样方法设计了一种深度学习方法，用于从视频帧中提取防火面具。我们的FLAME方法的精确度达到92％，召回率达到84％。未来的研究将扩展使用热图像自由燃烧广播火的技术。

28. Power Normalizations in Fine-grained Image, Few-shot Image and Graph Classification [PDF] 返回目录
Piotr Koniusz, Hongguang Zhang
Abstract: Power Normalizations (PN) are useful non-linear operators which tackle feature imbalances in classification problems. We study PNs in the deep learning setup via a novel PN layer pooling feature maps. Our layer combines the feature vectors and their respective spatial locations in the feature maps produced by the last convolutional layer of CNN into a positive definite matrix with second-order statistics to which PN operators are applied, forming so-called Second-order Pooling (SOP). As the main goal of this paper is to study Power Normalizations, we investigate the role and meaning of MaxExp and Gamma, two popular PN functions. To this end, we provide probabilistic interpretations of such element-wise operators and discover surrogates with well-behaved derivatives for end-to-end training. Furthermore, we look at the spectral applicability of MaxExp and Gamma by studying Spectral Power Normalizations (SPN). We show that SPN on the autocorrelation/covariance matrix and the Heat Diffusion Process (HDP) on a graph Laplacian matrix are closely related, thus sharing their properties. Such a finding leads us to the culmination of our work, a fast spectral MaxExp which is a variant of HDP for covariances/autocorrelation matrices. We evaluate our ideas on fine-grained recognition, scene recognition, and material classification, as well as in few-shot learning and graph classification.
摘要：功率归一化（PN）是有用的非线性算子，用于解决分类问题中的特征不平衡问题。我们通过新颖的PN层池特征图在深度学习设置中研究PN。我们的层将特征向量及其在CNN的最后一个卷积层产生的特征图中的各自空间位置组合到一个正定矩阵中，该矩阵具有应用了PN运算符的二阶统计量，形成了所谓的二阶池化（SOP ）。由于本文的主要目标是研究幂归一化，因此我们研究了两个常用的PN函数MaxExp和Gamma的作用和含义。为此，我们提供了这种元素智能算子的概率解释，并发现了具有良好行为导数的代理，以进行端到端训练。此外，我们通过研究频谱功率归一化（SPN）来研究MaxExp和Gamma的频谱适用性。我们显示，自相关/协方差矩阵上的SPN和图拉普拉斯矩阵上的热扩散过程（HDP）密切相关，从而共享它们的属性。这样的发现将我们的工作推向了高潮，即快速频谱MaxExp，它是用于协方差/自相关矩阵的HDP的变体。我们评估关于细粒度识别，场景识别和材质分类以及少拍学习和图形分类的想法。

29. Exploring Emotion Features and Fusion Strategies for Audio-Video Emotion Recognition [PDF] 返回目录
Hengshun Zhou, Debin Meng, Yuanyuan Zhang, Xiaojiang Peng, Jun Du, Kai Wang, Yu Qiao
Abstract: The audio-video based emotion recognition aims to classify a given video into basic emotions. In this paper, we describe our approaches in EmotiW 2019, which mainly explores emotion features and feature fusion strategies for audio and visual modality. For emotion features, we explore audio feature with both speech-spectrogram and Log Mel-spectrogram and evaluate several facial features with different CNN models and different emotion pretrained strategies. For fusion strategies, we explore intra-modal and cross-modal fusion methods, such as designing attention mechanisms to highlights important emotion feature, exploring feature concatenation and factorized bilinear pooling (FBP) for cross-modal feature fusion. With careful evaluation, we obtain 65.5% on the AFEW validation set and 62.48% on the test set and rank third in the challenge.
摘要：基于音频视频的情感识别旨在将给定视频分类为基本情感。在本文中，我们在EmotiW 2019中描述了我们的方法，该方法主要探讨了情绪特征和音频和视觉模式的特征融合策略。对于情绪特征，我们使用语音频谱图和Log Mel频谱图探索音频特征，并使用不同的CNN模型和不同的情绪预训练策略评估几种面部特征。对于融合策略，我们探索模态内和跨模态融合方法，例如设计注意力机制以突出重要的情感特征，探索特征串联和因式分解双线性池（FBP）进行跨模态特征融合。经过仔细评估，我们在AFEW验证集上获得了65.5％，在测试集上获得了62.48％，在挑战中排名第三。

30. Lidar and Camera Self-Calibration using CostVolume Network [PDF] 返回目录
Xudong Lv, Boya Wang, Dong Ye, Shuo Wang
Abstract: In this paper, we propose a novel online self-calibration approach for Light Detection and Ranging (LiDAR) and camera sensors. Compared to the previous CNN-based methods that concatenate the feature maps of the RGB image and decalibrated depth image, we exploit the cost volume inspired by the PWC-Net for feature matching. Besides the smooth L1-Loss of the predicted extrinsic calibration parameters, an additional point cloud loss is applied. Instead of regress the extrinsic parameters between LiDAR and camera directly, we predict the decalibrated deviation from initial calibration to the ground truth. During inference, the calibration error decreases further with the usage of iterative refinement and the temporal filtering approach. The evaluation results on the KITTI dataset illustrate that our approach outperforms CNN-based state-of-the-art methods in terms of a mean absolute calibration error of 0.297cm in translation and 0.017° in rotation with miscalibration magnitudes of up to 1.5m and 20°.
摘要：在本文中，我们提出了一种用于光检测与测距（LiDAR）和相机传感器的新型在线自校准方法。与以前的基于CNN的方法将RGB图像和未校准深度图像的特征图连接起来相比，我们利用PWC-Net启发的成本进行特征匹配。除了预测的外部校准参数的平滑L1-损失外，还会应用其他点云损耗。我们可以预测从初始校准到地面真实情况的去校准偏差，而不是直接回归LiDAR和相机之间的外部参数。在推断过程中，校准误差会随着迭代细化和时间滤波方法的使用而进一步降低。在KITTI数据集上的评估结果表明，我们的方法在平移和旋转时的平均绝对校准误差为0.297cm，旋转时的平均绝对校准误差为1.5m，最大误差为1.5m，优于基于CNN的最新方法。 20°。

31. ANL: Anti-Noise Learning for Cross-Domain Person Re-Identification [PDF] 返回目录
Hongliang Zhang, Shoudong Han, Xiaofeng Pan, Jun Zhao
Abstract: Due to the lack of labels and the domain diversities, it is a challenge to study person re-identification in the cross-domain setting. An admirable method is to optimize the target model by assigning pseudo-labels for unlabeled samples through clustering. Usually, attributed to the domain gaps, the pre-trained source domain model cannot extract appropriate target domain features, which will dramatically affect the clustering performance and the accuracy of pseudo-labels. Extensive label noise will lead to sub-optimal solutions doubtlessly. To solve these problems, we propose an Anti-Noise Learning (ANL) approach, which contains two modules. The Feature Distribution Alignment (FDA) module is designed to gather the id-related samples and disperse id-unrelated samples, through the camera-wise contrastive learning and adversarial adaptation. Creating a friendly cross-feature foundation for clustering that is to reduce clustering noise. Besides, the Reliable Sample Selection (RSS) module utilizes an Auxiliary Model to correct noisy labels and select reliable samples for the Main Model. In order to effectively utilize the outlier information generated by the clustering algorithm and RSS module, we train these samples at the instance-level. The experiments demonstrate that our proposed ANL framework can effectively reduce the domain conflicts and alleviate the influence of noisy samples, as well as superior performance compared with the state-of-the-art methods.
摘要：由于缺乏标签和领域的多样性，在跨域环境下研究人员的重新识别是一个挑战。一种令人钦佩的方法是通过聚类为未标记的样本分配伪标记来优化目标模型。通常，由于域空白，预训练的源域模型无法提取适当的目标域特征，这将极大地影响聚类性能和伪标签的准确性。广泛的标签噪声无疑会导致次优解决方案。为了解决这些问题，我们提出了一种抗噪声学习（ANL）方法，其中包含两个模块。功能分配对齐（FDA）模块旨在通过基于照相机的对比学习和对抗适应来收集与id相关的样本并分散与id不相关的样本。为集群创建友好的跨功能基础，以减少集群噪声。此外，可靠样本选择（RSS）模块利用辅助模型来纠正噪声标签并为主要模型选择可靠的样本。为了有效利用聚类算法和RSS模块生成的异常信息，我们在实例级别训练这些样本。实验表明，与最新方法相比，我们提出的ANL框架可以有效地减少域冲突并减轻噪声样本的影响，并具有出色的性能。

32. SparsePipe: Parallel Deep Learning for 3D Point Clouds [PDF] 返回目录
Keke Zhai, Pan He, Tania Banerjee, Anand Rangarajan, Sanjay Ranka
Abstract: We propose SparsePipe, an efficient and asynchronous parallelism approach for handling 3D point clouds with multi-GPU training. SparsePipe is built to support 3D sparse data such as point clouds. It achieves this by adopting generalized convolutions with sparse tensor representation to build expressive high-dimensional convolutional neural networks. Compared to dense solutions, the new models can efficiently process irregular point clouds without densely sliding over the entire space, significantly reducing the memory requirements and allowing higher resolutions of the underlying 3D volumes for better performance. SparsePipe exploits intra-batch parallelism that partitions input data into multiple processors and further improves the training throughput with inter-batch pipelining to overlap communication and computing. Besides, it suitably partitions the model when the GPUs are heterogeneous such that the computing is load-balanced with reduced communication overhead. Using experimental results on an eight-GPU platform, we show that SparsePipe can parallelize effectively and obtain better performance on current point cloud benchmarks for both training and inference, compared to its dense solutions.
摘要：我们提出了SparsePipe，这是一种高效且异步的并行方法，可通过多GPU训练来处理3D点云。 SparsePipe构建为支持点云等3D稀疏数据。它通过采用具有稀疏张量表示的广义卷积来构建表达性高维卷积神经网络来实现此目的。与密集解决方案相比，新模型可以有效地处理不规则点云，而不会在整个空间中密集滑动，从而显着降低了内存需求，并允许更高分辨率的基础3D卷以实现更好的性能。 SparsePipe利用批内并行性将输入数据划分为多个处理器，并通过批间流水线以重叠通信和计算来进一步提高训练吞吐量。此外，当GPU是异构的时，其适当地划分模型，使得计算是负载平衡的，从而减少了通信开销。使用在八GPU平台上的实验结果，我们证明SparsePipe与密集型解决方案相比，可以有效地并行化并在当前点云基准上获得更好的性能，可用于训练和推理。

33. Spatial Contrastive Learning for Few-Shot Classification [PDF] 返回目录
Yassine Ouali, Céline Hudelot, Myriam Tami
Abstract: Existing few-shot classification methods rely to some degree on the cross-entropy (CE) loss to learn transferable representations that facilitate the test time adaptation to unseen classes with limited data. However, the CE loss has several shortcomings, e.g., inducing representations with excessive discrimination towards seen classes, which reduces their transferability to unseen classes and results in sub-optimal generalization. In this work, we explore contrastive learning as an additional auxiliary training objective, acting as a data-dependent regularizer to promote more general and transferable features. Instead of using the standard contrastive objective, which suppresses local discriminative features, we propose a novel attention-based spatial contrastive objective to learn locally discriminative and class-agnostic features. With extensive experiments, we show that the proposed method outperforms state-of-the-art approaches, confirming the importance of learning good and transferable embeddings for few-shot learning.
摘要：现有的少拍分类方法在某种程度上依靠交叉熵（CE）损失来学习可转移的表示形式，这有助于测试时间适应数据量有限的未知类。然而，CE损失具有若干缺点，例如，导致对所看到的类别过度歧视的表示，这降低了它们向看不见的类别的可转移性，并导致次优的泛化。在这项工作中，我们将对比学习作为附加的辅助训练目标，作为依赖于数据的正则化工具来推广更一般和可转移的功能。代替使用抑制局部判别特征的标准对比物镜，我们提出了一种基于注意力的新型空间对比物镜，以学习局部判别和类无关的特征。通过广泛的实验，我们证明了所提出的方法优于最新的方法，证实了学习良好和可移植的嵌入对于一次性学习的重要性。

34. Skeleton-DML: Deep Metric Learning for Skeleton-Based One-Shot Action Recognition [PDF] 返回目录
Raphael Memmesheimer, Simon Häring, Nick Theisen, Dietrich Paulus
Abstract: One-shot action recognition allows the recognition of human-performed actions with only a single training example. This can influence human-robot-interaction positively by enabling the robot to react to previously unseen behaviour. We formulate the one-shot action recognition problem as a deep metric learning problem and propose a novel image-based skeleton representation that performs well in a metric learning setting. Therefore, we train a model that projects the image representations into an embedding space. In embedding space the similar actions have a low euclidean distance while dissimilar actions have a higher distance. The one-shot action recognition problem becomes a nearest-neighbor search in a set of activity reference samples. We evaluate the performance of our proposed representation against a variety of other skeleton-based image representations. In addition, we present an ablation study that shows the influence of different embedding vector sizes, losses and augmentation. Our approach lifts the state-of-the-art by 3.3% for the one-shot action recognition protocol on the NTU RGB+D 120 dataset under a comparable training setup. With additional augmentation our result improved over 7.7%.
摘要：一键式动作识别只需要一个训练示例就可以识别人为操作。通过使机器人能够对以前看不见的行为做出反应，这可以对人机交互产生积极影响。我们将一次动作识别问题公式化为深度度量学习问题，并提出了一种新颖的基于图像的骨架表示方法，该度量在度量学习设置中表现良好。因此，我们训练了一个将图像表示投影到嵌入空间中的模型。在嵌入空间中，相似的动作具有较低的欧氏距离，而不同的动作具有较高的距离。一次动作识别问题成为一组活动参考样本中的最近邻搜索。我们针对各种其他基于骨骼的图像表示评估了我们提出的表示的性能。此外，我们提出了一项消融研究，该研究显示了不同嵌入矢量大小，损失和增大的影响。在可比较的训练设置下，我们的方法将NTU RGB + D 120数据集上的单发动作识别协议的最新技术提升了3.3％。通过额外的扩充，我们的结果提高了7.7％以上。

35. Achieving Real-Time LiDAR 3D Object Detection on a Mobile Device [PDF] 返回目录
Pu Zhao, Wei Niu, Geng Yuan, Yuxuan Cai, Hsin-Hsuan Sung, Wujie Wen, Sijia Liu, Xipeng Shen, Bin Ren, Yanzhi Wang, Xue Lin
Abstract: 3D object detection is an important task, especially in the autonomous driving application domain. However, it is challenging to support the real-time performance with the limited computation and memory resources on edge-computing devices in self-driving cars. To achieve this, we propose a compiler-aware unified framework incorporating network enhancement and pruning search with the reinforcement learning techniques, to enable real-time inference of 3D object detection on the resource-limited edge-computing devices. Specifically, a generator Recurrent Neural Network (RNN) is employed to provide the unified scheme for both network enhancement and pruning search automatically, without human expertise and assistance. And the evaluated performance of the unified schemes can be fed back to train the generator RNN. The experimental results demonstrate that the proposed framework firstly achieves real-time 3D object detection on mobile devices (Samsung Galaxy S20 phone) with competitive detection performance.
摘要：3D对象检测是一项重要任务，尤其是在自动驾驶应用领域。然而，在自动驾驶汽车的边缘计算设备上以有限的计算和内存资源来支持实时性能具有挑战性。为了实现这一目标，我们提出了一个具有编译器感知能力的统一框架，该框架将网络增强和修剪搜索与增强学习技术结合在一起，以便能够在资源受限的边缘计算设备上实时推断3D对象检测。具体而言，使用生成器递归神经网络（RNN）来提供统一的方案，以自动进行网络增强和修剪搜索，而无需人工和专业知识。统一方案的评估性能可以反馈给训练发电机RNN。实验结果表明，该框架首先在具有竞争优势的移动设备（三星Galaxy S20手机）上实现了实时3D对象检测。

36. An Affine moment invariant for multi-component shapes [PDF] 返回目录
Jovisa Zunic, Milos Stojmenovic
Abstract: We introduce an image based algorithmic tool for analyzing multi-component shapes here. Due to the generic concept of multi-component shapes, our method can be applied to the analysis of a wide spectrum of applications where real objects are analyzed based on their shapes - i.e. on their corresponded black and white images. The method allocates a number to a shape, herein called a multi-component shapes measure. This number/measure is invariant with respect to affine transformations and is established based on the theoretical frame developed in this paper. In addition, the method is easy to implement and is robust (e.g. with respect to noise). We provide two small but illustrative examples related to aerial image analysis and galaxy image analysis. Also, we provide some synthetic examples for a better understanding of the measure behavior.
摘要：我们在此介绍一种用于分析多部件形状的基于图像的算法工具。由于多成分形状的通用概念，我们的方法可用于分析各种应用，其中基于真实物体的形状-即根据它们对应的黑白图像来分析真实物体。该方法将数字分配给形状，在此称为多分量形状度量。此数字/度量相对于仿射变换是不变的，并且是基于本文开发的理论框架建立的。另外，该方法易于实现并且是健壮的（例如，关于噪声）。我们提供了两个与航空影像分析和星系影像分析相关的小而说明性的示例。此外，我们提供了一些综合示例，以更好地了解测量行为。

37. Balance-Oriented Focal Loss with Linear Scheduling for Anchor Free Object Detection [PDF] 返回目录
Hopyong Gil, Sangwoo Park, Yusang Park, Wongoo Han, Juyean Hong, Juneyoung Jung
Abstract: Most existing object detectors suffer from class imbalance problems that hinder balanced performance. In particular, anchor free object detectors have to solve the background imbalance problem due to detection in a per-pixel prediction fashion as well as foreground imbalance problem simultaneously. In this work, we propose Balance-oriented focal loss that can induce balanced learning by considering both background and foreground balance comprehensively. This work aims to address imbalance problem in the situation of using a general unbalanced data of non-extreme distribution not including few shot and the focal loss for anchor free object detector. We use a batch-wise alpha-balanced variant of the focal loss to deal with this imbalance problem elaborately. It is a simple and practical solution using only re-weighting for general unbalanced data. It does require neither additional learning cost nor structural change during inference and grouping classes is also unnecessary. Through extensive experiments, we show the performance improvement for each component and analyze the effect of linear scheduling when using re-weighting for the loss. By improving the focal loss in terms of balancing foreground classes, our method achieves AP gains of +1.2 in MS-COCO for the anchor free real-time detector.
摘要：大多数现有的对象检测器都存在类不平衡的问题，这些问题阻碍了性能的平衡。特别地，无锚物体检测器必须解决由于以每像素预测方式进行检测而引起的背景不平衡问题以及前景不平衡问题。在这项工作中，我们提出了面向平衡的焦点损失，它可以通过综合考虑背景和前景平衡来诱导平衡学习。这项工作旨在解决在使用非极端分布的一般不平衡数据（不包括很少的镜头和焦点损失）用于无锚物体检测器的情况下的不平衡问题。我们使用焦距损失的逐批alpha平衡变体来精心处理此不平衡问题。这是一种简单实用的解决方案，仅对一般的不平衡数据使用重新加权。它确实不需要额外的学习成本，也不需要在推理过程中进行结构更改，也无需分组课程。通过广泛的实验，我们显示了每个组件的性能提高，并分析了对损失进行重新加权时线性调度的效果。通过在平衡前景类方面改善焦点损失，我们的方法为无锚实时检测器在MS-COCO中获得了+1.2的AP增益。

38. Direct Quantization for Training Highly Accurate Low Bit-width Deep Neural Networks [PDF] 返回目录
Tuan Hoang, Thanh-Toan Do, Tam V. Nguyen, Ngai-Man Cheung
Abstract: This paper proposes two novel techniques to train deep convolutional neural networks with low bit-width weights and activations. First, to obtain low bit-width weights, most existing methods obtain the quantized weights by performing quantization on the full-precision network weights. However, this approach would result in some mismatch: the gradient descent updates full-precision weights, but it does not update the quantized weights. To address this issue, we propose a novel method that enables {direct} updating of quantized weights {with learnable quantization levels} to minimize the cost function using gradient descent. Second, to obtain low bit-width activations, existing works consider all channels equally. However, the activation quantizers could be biased toward a few channels with high-variance. To address this issue, we propose a method to take into account the quantization errors of individual channels. With this approach, we can learn activation quantizers that minimize the quantization errors in the majority of channels. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on the image classification task, using AlexNet, ResNet and MobileNetV2 architectures on CIFAR-100 and ImageNet datasets.
摘要：本文提出了两种新颖的技术来训练具有低位宽权重和激活的深度卷积神经网络。首先，为了获得低的位宽权重，大多数现有方法是通过对全精度网络权重执行量化来获得量化权重的。但是，这种方法会导致某些不匹配：梯度下降会更新全权重，但不会更新量化权重。为了解决这个问题，我们提出了一种新颖的方法，该方法能够{直接}更新{具有可学习的量化水平的}量化权重，以使用梯度下降来最小化成本函数。其次，为了获得低位宽度的激活，现有作品将所有通道均等地考虑。但是，激活量化器可能会偏向一些具有高变化性的通道。为了解决这个问题，我们提出一种考虑单个通道的量化误差的方法。通过这种方法，我们可以学习激活量化器，以最小化大多数通道中的量化误差。实验结果表明，我们提出的方法在CIFAR-100和ImageNet数据集上使用AlexNet，ResNet和MobileNetV2体系结构，可以在图像分类任务上实现最新的性能。

39. Probabilistic 3D Multi-Modal, Multi-Object Tracking for Autonomous Driving [PDF] 返回目录
Hsu-kuang Chiu, Jie Li, Rares Ambrus, Jeannette Bohg
Abstract: Multi-object tracking is an important ability for an autonomous vehicle to safely navigate a traffic scene. Current state-of-the-art follows the tracking-by-detection paradigm where existing tracks are associated with detected objects through some distance metric. The key challenges to increase tracking accuracy lie in data association and track life cycle management. We propose a probabilistic, multi-modal, multi-object tracking system consisting of different trainable modules to provide robust and data-driven tracking results. First, we learn how to fuse features from 2D images and 3D LiDAR point clouds to capture the appearance and geometric information of an object. Second, we propose to learn a metric that combines the Mahalanobis and feature distances when comparing a track and a new detection in data association. And third, we propose to learn when to initialize a track from an unmatched object detection. Through extensive quantitative and qualitative results, we show that our method outperforms current state-of-the-art on the NuScenes Tracking dataset.
摘要：多目标跟踪是自动驾驶汽车安全导航交通场景的一项重要功能。当前的最新技术遵循“按检测跟踪”范例，其中现有轨道通过某种距离度量与检测到的对象相关联。提高跟踪准确性的关键挑战在于数据关联和跟踪生命周期管理。我们提出了一种概率，多模式，多对象的跟踪系统，该系统由不同的可训练模块组成，以提供可靠且由数据驱动的跟踪结果。首先，我们学习如何融合2D图像和3D LiDAR点云中的特征以捕获对象的外观和几何信息。其次，我们建议在比较轨迹和数据关联中的新检测时，学习一种结合了马氏距离和特征距离的度量。第三，我们建议从不匹配的对象检测中学习何时初始化轨道。通过广泛的定量和定性结果，我们证明了我们的方法优于NuScenes跟踪数据集上的最新技术。

40. Few Shot Learning With No Labels [PDF] 返回目录
Aditya Bharti, N.B. Vineeth, C.V. Jawahar
Abstract: Few-shot learners aim to recognize new categories given only a small number of training samples. The core challenge is to avoid overfitting to the limited data while ensuring good generalization to novel classes. Existing literature makes use of vast amounts of annotated data by simply shifting the label requirement from novel classes to base classes. Since data annotation is time-consuming and costly, reducing the label requirement even further is an important goal. To that end, our paper presents a more challenging few-shot setting where no label access is allowed during training or testing. By leveraging self-supervision for learning image representations and image similarity for classification at test time, we achieve competitive baselines while using \textbf{zero} labels, which is at least fewer labels than state-of-the-art. We hope that this work is a step towards developing few-shot learning methods which do not depend on annotated data at all. Our code will be publicly released.
摘要：只有少数培训样本，很少有学习者打算识别新类别。核心挑战是在确保对新颖类进行良好概括的同时，避免过度拟合有限的数据。现有文献通过简单地将标签要求从新颖类转移到基本类来利用大量带注释的数据。由于数据注释既费时又昂贵，因此进一步降低标签要求是一个重要的目标。为此，我们的论文提出了一种更具挑战性的单打设置，在训练或测试期间不允许访问标签。通过利用自我监督在测试时学习图像表示和图像相似度进行分类，我们在使用\ textbf {zero}标签的同时获得了竞争基准，该标签至少比最新的标签少。我们希望这项工作是朝着发展完全不依赖注释数据的一次性学习方法迈出的一步。我们的代码将公开发布。

41. Image Synthesis with Adversarial Networks: a Comprehensive Survey and Case Studies [PDF] 返回目录
Pourya Shamsolmoali, Masoumeh Zareapoor, Eric Granger, Huiyu Zhou, Ruili Wang, M. Emre Celebi, Jie Yang
Abstract: Generative Adversarial Networks (GANs) have been extremely successful in various application domains such as computer vision, medicine, and natural language processing. Moreover, transforming an object or person to a desired shape become a well-studied research in the GANs. GANs are powerful models for learning complex distributions to synthesize semantically meaningful samples. However, there is a lack of comprehensive review in this field, especially lack of a collection of GANs loss-variant, evaluation metrics, remedies for diverse image generation, and stable training. Given the current fast GANs development, in this survey, we provide a comprehensive review of adversarial models for image synthesis. We summarize the synthetic image generation methods, and discuss the categories including image-to-image translation, fusion image generation, label-to-image mapping, and text-to-image translation. We organize the literature based on their base models, developed ideas related to architectures, constraints, loss functions, evaluation metrics, and training datasets. We present milestones of adversarial models, review an extensive selection of previous works in various categories, and present insights on the development route from the model-based to data-driven methods. Further, we highlight a range of potential future research directions. One of the unique features of this review is that all software implementations of these GAN methods and datasets have been collected and made available in one place at this https URL.
摘要：生成对抗网络（GANs）在计算机视觉，医学和自然语言处理等各种应用领域中都非常成功。此外，将物体或人转变为所需形状已成为GAN中经过充分研究的研究。 GAN是用于学习复杂分布以合成语义上有意义的样本的强大模型。但是，该领域缺乏全面的审查，尤其是缺少GAN损耗变量，评估指标，用于生成多种图像的补救措施以及稳定的培训的集合。鉴于当前GAN的快速发展，在本次调查中，我们对图像合成的对抗模型进行了全面回顾。我们总结了合成图像生成方法，并讨论了包括图像到图像翻译，融合图像生成，标签到图像映射以及文本到图像翻译的类别。我们根据其基本模型，与体系结构，约束，损失函数，评估指标和训练数据集有关的发展思路组织文献。我们介绍了对抗性模型的里程碑，回顾了各种类别的以前的著作，并提出了从基于模型的方法到数据驱动方法的发展路线的见解。此外，我们重点介绍了一系列潜在的未来研究方向。该评论的独特功能之一是，这些GAN方法和数据集的所有软件实现都已被收集并在此https URL的一个位置提供。

42. Faster and Accurate Compressed Video Action Recognition Straight from the Frequency Domain [PDF] 返回目录
Samuel Felipe dos Santos, Jurandy Almeida
Abstract: Human action recognition has become one of the most active field of research in computer vision due to its wide range of applications, like surveillance, medical, industrial environments, smart homes, among others. Recently, deep learning has been successfully used to learn powerful and interpretable features for recognizing human actions in videos. Most of the existing deep learning approaches have been designed for processing video information as RGB image sequences. For this reason, a preliminary decoding process is required, since video data are often stored in a compressed format. However, a high computational load and memory usage is demanded for decoding a video. To overcome this problem, we propose a deep neural network capable of learning straight from compressed video. Our approach was evaluated on two public benchmarks, the UCF-101 and HMDB-51 datasets, demonstrating comparable recognition performance to the state-of-the-art methods, with the advantage of running up to 2 times faster in terms of inference speed.
摘要：由于人类动作识别的广泛应用，例如监视，医疗，工业环境，智能家居等，已成为计算机视觉研究中最活跃的领域之一。最近，深度学习已成功用于学习强大和可解释的功能，以识别视频中的人类动作。大多数现有的深度学习方法都已设计为将视频信息处理为RGB图像序列。因此，由于视频数据通常以压缩格式存储，因此需要初步解码过程。然而，需要高计算量和存储器使用来解码视频。为了克服这个问题，我们提出了一种能够直接从压缩视频中学习的深度神经网络。我们的方法在两个公开基准（UCF-101和HMDB-51数据集）上进行了评估，展示了与最新方法可比的识别性能，其推理速度提高了2倍。

43. Assigning Apples to Individual Trees in Dense Orchards using 3D Color Point Clouds [PDF] 返回目录
Mouad Zine-El-Abidine, Helin Dutagaci, Gilles Galopin, David Rousseau
Abstract: We propose a 3D color point cloud processing pipeline to count apples on individual apple trees in trellis structured orchards. Fruit counting at the tree level requires separating trees, which is challenging in dense orchards. We employ point clouds acquired from the leaf-off orchard in winter period, where the branch structure is visible, to delineate tree crowns. We localize apples in point clouds acquired in harvest period. Alignment of the two point clouds enables mapping apple locations to the delineated winter cloud and assigning each apple to its bearing tree. Our apple assignment method achieves an accuracy rate higher than 95%. In addition to presenting a first proof of feasibility, we also provide suggestions for further improvement on our apple assignment pipeline.
摘要：我们提出了一种3D色点云处理管道，用于对格子结构果园中单个苹果树上的苹果进行计数。在树木级别上数水果需要分开树木，这在密集的果园中具有挑战性。我们使用从分支果园在冬季可见的分支云点获取的点云来描绘树冠。我们将苹果定位在收获期获得的点云中。通过将两个点云对齐，可以将苹果位置映射到所描绘的冬季云，并将每个苹果分配给其方位树。我们的苹果分配方法可达到95％以上的准确率。除了提供第一个可行性证明之外，我们还提供了进一步改进苹果分配管道的建议。

44. Learning Inter- and Intra-frame Representations for Non-Lambertian Photometric Stereo [PDF] 返回目录
Yanlong Cao, Binjie Ding, Zewei He, Jiangxin Yang, Jingxi Chen, Yanpeng Cao, Xin Li
Abstract: In this paper, we build a two-stage Convolutional Neural Network (CNN) architecture to construct inter- and intra-frame representations based on an arbitrary number of images captured under different light directions, performing accurate normal estimation of non-Lambertian objects. We experimentally investigate numerous network design alternatives for identifying the optimal scheme to deploy inter-frame and intra-frame feature extraction modules for the photometric stereo problem. Moreover, we propose to utilize the easily obtained object mask for eliminating adverse interference from invalid background regions in intra-frame spatial convolutions, thus effectively improve the accuracy of normal estimation for surfaces made of dark materials or with cast shadows. Experimental results demonstrate that proposed masked two-stage photometric stereo CNN model (MT-PS-CNN) performs favorably against state-of-the-art photometric stereo techniques in terms of both accuracy and efficiency. In addition, the proposed method is capable of predicting accurate and rich surface normal details for non-Lambertian objects of complex geometry and performs stably given inputs captured in both sparse and dense lighting distributions.
摘要：在本文中，我们建立了一个两阶段卷积神经网络（CNN）架构，以基于在不同光方向下捕获的任意数量的图像构造帧间和帧内表示，对非朗伯对象进行准确的法线估计。我们通过实验研究了许多网络设计替代方案，以确定用于为光度立体问题部署帧间和帧内特征提取模块的最佳方案。此外，我们提出利用容易获得的目标蒙版消除帧内空间卷积中无效背景区域的不利干扰，从而有效地提高由深色材料或带有阴影的表面进行法线估计的准确性。实验结果表明，在准确性和效率方面，所提出的掩蔽式两级光度立体CNN模型（MT-PS-CNN）优于最新的光度立体技术。另外，所提出的方法能够预测复杂几何形状的非朗伯对象的准确且丰富的表面法线细节，并稳定地执行在稀疏和密集照明分布中捕获的给定输入。

45. Hybrid and Non-Uniform quantization methods using retro synthesis data for efficient inference [PDF] 返回目录
Tej pratap GVSL, Raja Kumar
Abstract: Existing quantization aware training methods attempt to compensate for the quantization loss by leveraging on training data, like most of the post-training quantization methods, and are also time consuming. Both these methods are not effective for privacy constraint applications as they are tightly coupled with training data. In contrast, this paper proposes a data-independent post-training quantization scheme that eliminates the need for training data. This is achieved by generating a faux dataset, hereafter referred to as Retro-Synthesis Data, from the FP32 model layer statistics and further using it for quantization. This approach outperformed state-of-the-art methods including, but not limited to, ZeroQ and DFQ on models with and without Batch-Normalization layers for 8, 6, and 4 bit precisions on ImageNet and CIFAR-10 datasets. We also introduced two futuristic variants of post-training quantization methods namely Hybrid Quantization and Non-Uniform Quantization
摘要：现有的量化意识训练方法试图像大多数训练后量化方法一样，通过利用训练数据来补偿量化损失，而且也很耗时。这两种方法对于隐私约束应用程序均无效，因为它们与培训数据紧密结合。相比之下，本文提出了一种与数据无关的训练后量化方案，该方案无需训练数据。这是通过从FP32模型层统计信息生成伪数据集（以下称为逆合成数据）并将其进一步用于量化来实现的。在ImageNet和CIFAR-10数据集上，此方法在具有和不具有批量归一化层的模型上的性能优于最新方法，包括但不限于ZeroQ和DFQ。我们还介绍了训练后量化方法的两种未来派变体，即混合量化和非均匀量化

46. 2-D Respiration Navigation Framework for 3-D Continuous Cardiac Magnetic Resonance Imaging [PDF] 返回目录
Elisabeth Hoppe, Jens Wetzl, Philipp Roser, Lina Felsner, Alexander Preuhs, Andreas Maier
Abstract: Continuous protocols for cardiac magnetic resonance imaging enable sampling of the cardiac anatomy simultaneously resolved into cardiac phases. To avoid respiration artifacts, associated motion during the scan has to be compensated for during reconstruction. In this paper, we propose a sampling adaption to acquire 2-D respiration information during a continuous scan. Further, we develop a pipeline to extract the different respiration states from the acquired signals, which are used to reconstruct data from one respiration phase. Our results show the benefit of the proposed workflow on the image quality compared to no respiration compensation, as well as a previous 1-D respiration navigation approach.
摘要：连续的心脏磁共振成像协议可以对同时分解为心脏相位的心脏解剖结构进行采样。为了避免呼吸伪像，必须在重建期间补偿扫描期间的相关运动。在本文中，我们提出了在连续扫描过程中获取二维呼吸信息的采样适应方法。此外，我们开发了一种管道，用于从采集的信号中提取不同的呼吸状态，这些信号用于从一个呼吸阶段重建数据。我们的结果表明，与无呼吸补偿相比，拟议的工作流程对图像质量的好处，以及以前的一维呼吸导航方法。

47. TSGCNet: Discriminative Geometric Feature Learning with Two-Stream GraphConvolutional Network for 3D Dental Model Segmentation [PDF] 返回目录
Lingming Zhang, Yue Zhao, Deyu Meng, Zhiming Cui, Chenqiang Gao, Xinbo Gao, Chunfeng Lian, Dinggang Shen
Abstract: The ability to segment teeth precisely from digitized 3D dental models is an essential task in computer-aided orthodontic surgical planning. To date, deep learning based methods have been popularly used to handle this task. State-of-the-art methods directly concatenate the raw attributes of 3D inputs, namely coordinates and normal vectors of mesh cells, to train a single-stream network for fully-automated tooth segmentation. This, however, has the drawback of ignoring the different geometric meanings provided by those raw attributes. This issue might possibly confuse the network in learning discriminative geometric features and result in many isolated false predictions on the dental model. Against this issue, we propose a two-stream graph convolutional network (TSGCNet) to learn multi-view geometric information from different geometric attributes. Our TSGCNet adopts two graph-learning streams, designed in an input-aware fashion, to extract more discriminative high-level geometric representations from coordinates and normal vectors, respectively. These feature representations learned from the designed two different streams are further fused to integrate the multi-view complementary information for the cell-wise dense prediction task. We evaluate our proposed TSGCNet on a real-patient dataset of dental models acquired by 3D intraoral scanners, and experimental results demonstrate that our method significantly outperforms state-of-the-art methods for 3D shape segmentation.
摘要：从数字化3D牙齿模型精确分割牙齿的能力是计算机辅助正畸外科手术计划中的一项基本任务。迄今为止，基于深度学习的方法已广泛用于处理此任务。最先进的方法直接将3D输入的原始属性（即网格单元的坐标和法线向量）串联起来，以训练用于全自动牙齿分割的单流网络。但是，这具有忽略那些原始属性提供的不同几何含义的缺点。此问题可能会使网络在学习区分几何特征时感到困惑，并导致对牙齿模型进行许多孤立的错误预测。针对此问题，我们提出了一种两流图卷积网络（TSGCNet），以从不同的几何属性中学习多视图几何信息。我们的TSGCNet采用了两种以输入感知方式设计的图形学习流，分别从坐标和法向矢量中提取出更具区分性的高级几何表示。从设计的两个不同流中学习到的这些特征表示将进一步融合，以集成用于单元级密集预测任务的多视图互补信息。我们在由3D口腔扫描仪获得的牙科模型的真实患者数据集上评估了我们提出的TSGCNet，实验结果表明，我们的方法明显优于3D形状分割的最新方法。

48. Sparse Adversarial Attack to Object Detection [PDF] 返回目录
Jiayu Bao
Abstract: Adversarial examples have gained tons of attention in recent years. Many adversarial attacks have been proposed to attack image classifiers, but few work shift attention to object detectors. In this paper, we propose Sparse Adversarial Attack (SAA) which enables adversaries to perform effective evasion attack on detectors with bounded \emph{l$_{0}$} norm perturbation. We select the fragile position of the image and designed evasion loss function for the task. Experiment results on YOLOv4 and FasterRCNN reveal the effectiveness of our method. In addition, our SAA shows great transferability across different detectors in the black-box attack setting. Codes are available at \emph{this https URL}.
摘要：近年来，对抗性例子引起了人们的广泛关注。已经提出了许多对抗攻击来攻击图像分类器，但是很少有工作将注意力转移到物体检测器上。在本文中，我们提出了稀疏对抗攻击（SAA），它使攻击者能够对有界\ emph {l $ _ {0} $}规范扰动的检测器执行有效的躲避攻击。我们选择图像的脆弱位置，并为任务设计了规避损失的功能。 YOLOv4和FasterRCNN的实验结果证明了我们方法的有效性。此外，我们的SAA在黑匣子攻击环境中显示了跨不同检测器的出色传输能力。可以在\ emph {this https URL}中找到代码。

49. One-Shot Object Localization Using Learnt Visual Cues via Siamese Networks [PDF] 返回目录
Sagar Gubbi Venkatesh, Bharadwaj Amrutur
Abstract: A robot that can operate in novel and unstructured environments must be capable of recognizing new, previously unseen, objects. In this work, a visual cue is used to specify a novel object of interest which must be localized in new environments. An end-to-end neural network equipped with a Siamese network is used to learn the cue, infer the object of interest, and then to localize it in new environments. We show that a simulated robot can pick-and-place novel objects pointed to by a laser pointer. We also evaluate the performance of the proposed approach on a dataset derived from the Omniglot handwritten character dataset and on a small dataset of toys.
摘要：可以在新颖和非结构化环境中运行的机器人必须能够识别以前看不见的新物体。在这项工作中，使用视觉提示来指定必须在新环境中定位的新颖感兴趣的对象。配备了暹罗网络的端到端神经网络用于学习提示，推断感兴趣的对象，然后将其定位在新的环境中。我们证明了模拟机器人可以拾取和放置激光指示器指向的新颖物体。我们还在从Omniglot手写字符数据集和小型玩具数据集获得的数据集上评估了该方法的性能。

50. Dual-Refinement: Joint Label and Feature Refinement for Unsupervised Domain Adaptive Person Re-Identification [PDF] 返回目录
Yongxing Dai, Jun Liu, Yan Bai, Zekun Tong, Ling-Yu Duan
Abstract: Unsupervised domain adaptive (UDA) person re-identification (re-ID) is a challenging task due to the missing of labels for the target domain data. To handle this problem, some recent works adopt clustering algorithms to off-line generate pseudo labels, which can then be used as the supervision signal for on-line feature learning in the target domain. However, the off-line generated labels often contain lots of noise that significantly hinders the discriminability of the on-line learned features, and thus limits the final UDA re-ID performance. To this end, we propose a novel approach, called Dual-Refinement, that jointly refines pseudo labels at the off-line clustering phase and features at the on-line training phase, to alternatively boost the label purity and feature discriminability in the target domain for more reliable re-ID. Specifically, at the off-line phase, a new hierarchical clustering scheme is proposed, which selects representative prototypes for every coarse cluster. Thus, labels can be effectively refined by using the inherent hierarchical information of person images. Besides, at the on-line phase, we propose an instant memory spread-out (IM-spread-out) regularization, that takes advantage of the proposed instant memory bank to store sample features of the entire dataset and enable spread-out feature learning over the entire training data instantly. Our Dual-Refinement method reduces the influence of noisy labels and refines the learned features within the alternative training process. Experiments demonstrate that our method outperforms the state-of-the-art methods by a large margin.
摘要：由于缺少目标域数据的标签，因此无监督域自适应（UDA）人员重新识别（re-ID）是一项具有挑战性的任务。为了解决这个问题，最近的一些工作采用聚类算法离线生成伪标签，然后将其用作目标域中在线特征学习的监督信号。但是，脱机生成的标签通常包含大量噪声，这些噪声显着阻碍了在线学习功能的可分辨性，从而限制了最终的UDA re-ID性能。为此，我们提出了一种称为双重精化的新方法，该方法可以在离线聚类阶段和在线训练阶段共同完善伪标签，以替代地提高目标域中的标签纯度和特征可分辨性以获得更可靠的re-ID。具体来说，在离线阶段，提出了一种新的层次聚类方案，该方案为每个粗聚类选择代表性的原型。因此，可以通过使用人图像的固有分层信息来有效地完善标签。此外，在在线阶段，我们提出了即时存储扩展（IM-spread-out）正则化方法，该规范化利用建议的即时存储库来存储整个数据集的样本特征并实现扩展特征学习即时了解整个训练数据。我们的双重精炼方法减少了噪音标签的影响，并在替代训练过程中完善了学习到的功能。实验表明，我们的方法在很大程度上优于最新方法。

51. PaXNet: Dental Caries Detection in Panoramic X-ray using Ensemble Transfer Learning and Capsule Classifier [PDF] 返回目录
Arman Haghanifar, Mahdiyar Molahasani Majdabadi, Seok-Bum Ko
Abstract: Dental caries is one of the most chronic diseases involving the majority of the population during their lifetime. Caries lesions are typically diagnosed by radiologists relying only on their visual inspection to detect via dental x-rays. In many cases, dental caries is hard to identify using x-rays and can be misinterpreted as shadows due to different reasons such as low image quality. Hence, developing a decision support system for caries detection has been a topic of interest in recent years. Here, we propose an automatic diagnosis system to detect dental caries in Panoramic images for the first time, to the best of authors' knowledge. The proposed model benefits from various pretrained deep learning models through transfer learning to extract relevant features from x-rays and uses a capsule network to draw prediction results. On a dataset of 470 Panoramic images used for features extraction, including 240 labeled images for classification, our model achieved an accuracy score of 86.05\% on the test set. The obtained score demonstrates acceptable detection performance and an increase in caries detection speed, as long as the challenges of using Panoramic x-rays of real patients are taken into account. Among images with caries lesions in the test set, our model acquired recall scores of 69.44\% and 90.52\% for mild and severe ones, confirming the fact that severe caries spots are more straightforward to detect and efficient mild caries detection needs a more robust and larger dataset. Considering the novelty of current research study as using Panoramic images, this work is a step towards developing a fully automated efficient decision support system to assist domain experts.
摘要：龋齿是一生中涉及大多数人口的最慢性疾病之一。龋损通常由放射科医生诊断，仅依靠其肉眼检查以通过牙科X射线进行检测。在许多情况下，龋齿很难使用X射线识别，并且由于诸如图像质量低等不同原因而可能被误认为是阴影。因此，近年来，开发用于龋齿检测的决策支持系统成为人们关注的话题。在这里，我们提出了一种自动诊断系统，以作者的最大见识，这是首次检测全景图像中的龋齿。所提出的模型通过转移学习从x射线中提取相关特征，并使用胶囊网络绘制预测结果，从而受益于各种预先训练的深度学习模型。在用于特征提取的470张全景图像的数据集（包括用于分类的240张标记图像）上，我们的模型在测试集上的准确度得分为86.05 \％。只要考虑到使用真实患者的全景X射线所带来的挑战，所获得的分数就可以证明可接受的检测性能和龋齿检测速度的提高。在测试集中具有龋齿病变的图像中，我们的模型对轻度和重度龋齿的回忆得分分别为69.44 \％和90.52 \％，证实了以下事实：重度龋斑更容易检测，而有效的轻度龋齿检测需要更强大的功能和更大的数据集。考虑到当前使用全景影像进行研究的新颖性，这项工作是朝着开发一种能够帮助领域专家的全自动高效决策支持系统迈出的一步。

52. Coarse to Fine: Multi-label Image Classification with Global/Local Attention [PDF] 返回目录
Fan Lyu, Fuyuan Hu, Victor S. Sheng, Zhengtian Wu, Qiming Fu, Baochuan Fu
Abstract: In our daily life, the scenes around us are always with multiple labels especially in a smart city, i.e., recognizing the information of city operation to response and control. Great efforts have been made by using Deep Neural Networks to recognize multi-label images. Since multi-label image classification is very complicated, people seek to use the attention mechanism to guide the classification process. However, conventional attention-based methods always analyzed images directly and aggressively. It is difficult for them to well understand complicated scenes. In this paper, we propose a global/local attention method that can recognize an image from coarse to fine by mimicking how human-beings observe images. Specifically, our global/local attention method first concentrates on the whole image, and then focuses on local specific objects in the image. We also propose a joint max-margin objective function, which enforces that the minimum score of positive labels should be larger than the maximum score of negative labels horizontally and vertically. This function can further improve our multi-label image classification method. We evaluate the effectiveness of our method on two popular multi-label image datasets (i.e., Pascal VOC and MS-COCO). Our experimental results show that our method outperforms state-of-the-art methods.
摘要：在我们的日常生活中，我们周围的场景总是带有多个标签，尤其是在智能城市中，即识别城市运营对响应和控制的信息。通过使用深度神经网络来识别多标签图像已经做出了巨大的努力。由于多标签图像的分类非常复杂，因此人们寻求使用注意力机制来指导分类过程。然而，常规的基于注意力的方法总是直接且积极地分析图像。他们很难很好地理解复杂的场景。在本文中，我们提出了一种全局/局部注意方法，该方法可以通过模仿人类观察图像的方式来识别从粗糙到精细的图像。具体来说，我们的全局/局部注意力方法首先集中于整个图像，然后集中于图像中的局部特定对象。我们还提出了一个联合最大边距目标函数，该函数要求在水平和垂直方向上，正标签的最小分数应大于负标签的最大分数。此功能可以进一步改善我们的多标签图像分类方法。我们在两个流行的多标签图像数据集（即Pascal VOC和MS-COCO）上评估了我们方法的有效性。我们的实验结果表明，我们的方法优于最新方法。

53. Detecting Road Obstacles by Erasing Them [PDF] 返回目录
Krzysztof Lis, Sina Honari, Pascal Fua, Mathieu Salzmann
Abstract: Vehicles can encounter a myriad of obstacles on the road, and it is not feasible to record them all beforehand to train a detector. Our method selects image patches and inpaints them with the surrounding road texture, which tends to remove obstacles from those patches. It them uses a network trained to recognize discrepancies between the original patch and the inpainted one, which signals an erased obstacle. We also contribute a new dataset for monocular road obstacle detection, and show that our approach outperforms the state-of-the-art methods on both our new dataset and the standard Fishyscapes Lost & Found benchmark.
摘要：车辆在道路上会遇到无数的障碍，因此事先记录所有障碍物以训练探测器是不可行的。我们的方法选择图像斑块，并用周围的道路纹理对其进行修补，这往往会消除这些斑块中的障碍。他们使用受过训练的网络来识别原始补丁与已修复补丁之间的差异，这表明已清除障碍物。我们还贡献了用于单眼道路障碍物检测的新数据集，并表明我们的方法在我们的新数据集和标准的《 Fishyscapes失物招领处》基准上均优于最新方法。

54. A Simple Fine-tuning Is All You Need: Towards Robust Deep Learning Via Adversarial Fine-tuning [PDF] 返回目录
Ahmadreza Jeddi, Mohammad Javad Shafiee, Alexander Wong
Abstract: Adversarial Training (AT) with Projected Gradient Descent (PGD) is an effective approach for improving the robustness of the deep neural networks. However, PGD AT has been shown to suffer from two main limitations: i) high computational cost, and ii) extreme overfitting during training that leads to reduction in model generalization. While the effect of factors such as model capacity and scale of training data on adversarial robustness have been extensively studied, little attention has been paid to the effect of a very important parameter in every network optimization on adversarial robustness: the learning rate. In particular, we hypothesize that effective learning rate scheduling during adversarial training can significantly reduce the overfitting issue, to a degree where one does not even need to adversarially train a model from scratch but can instead simply adversarially fine-tune a pre-trained model. Motivated by this hypothesis, we propose a simple yet very effective adversarial fine-tuning approach based on a $\textit{slow start, fast decay}$ learning rate scheduling strategy which not only significantly decreases computational cost required, but also greatly improves the accuracy and robustness of a deep neural network. Experimental results show that the proposed adversarial fine-tuning approach outperforms the state-of-the-art methods on CIFAR-10, CIFAR-100 and ImageNet datasets in both test accuracy and the robustness, while reducing the computational cost by 8-10$\times$. Furthermore, a very important benefit of the proposed adversarial fine-tuning approach is that it enables the ability to improve the robustness of any pre-trained deep neural network without needing to train the model from scratch, which to the best of the authors' knowledge has not been previously demonstrated in research literature.
摘要：带有预测梯度下降（PGD）的对抗训练（AT）是提高深度神经网络鲁棒性的有效方法。但是，已显示PGD AT有两个主要局限性：i）高计算成本，以及ii）训练期间的极端过拟合导致模型泛化的减少。尽管已经广泛研究了模型容量和训练数据规模等因素对对抗性鲁棒性的影响，但很少关注每个网络优化中非常重要的参数对对抗性鲁棒性的影响：学习率。尤其是，我们假设在对抗训练中进行有效的学习速率安排可以显着减少过度拟合的问题，从而达到甚至无需从头进行对抗训练的程度，而可以简单地通过对抗方式微调预训练模型的程度。基于这种假设，我们提出了一种基于$ \ text {慢启动，快速衰减} $学习率调度策略的简单但非常有效的对抗性微调方法，该方法不仅显着降低了所需的计算成本，而且还大大提高了准确性和深度神经网络的鲁棒性。实验结果表明，所提出的对抗性微调方法在测试准确性和鲁棒性方面均优于CIFAR-10，CIFAR-100和ImageNet数据集的最新方法，同时将计算成本降低了8-10美元\ times $。此外，所提出的对抗性微调方法的一个非常重要的好处是，它能够提高任何预先训练的深度神经网络的鲁棒性，而无需从头开始训练模型，这是作者所知的。以前在研究文献中尚未得到证明。

55. Inception Convolution with Efficient Dilation Search [PDF] 返回目录
Jie Liu, Chuming Li, Feng Liang, Chen Lin, Ming Sun, Junjie Yan, Wanli Ouyang, Dong Xu
Abstract: Dilation convolution is a critical mutant of standard convolution neural network to control effective receptive fields and handle large scale variance of objects without introducing additional computation. However, fitting the effective reception field to data with dilated convolution is less discussed in the literature. To fully explore its potentials, we proposed a new mutant of dilated convolution, namely inception (dilated) convolution where the convolutions have independent dilation among different axes, channels and layers. To explore a practical method for fitting the complex inception convolution to the data, a simple while effective dilation search algorithm(EDO) based on statistical optimization is developed. The search method operates in a zero-cost manner which is extremely fast to apply on large scale datasets. Empirical results reveal that our method obtains consistent performance gains in an extensive range of benchmarks. For instance, by simply replace the 3 x 3 standard convolutions in ResNet-50 backbone with inception convolution, we improve the mAP of Faster-RCNN on MS-COCO from 36.4% to 39.2%. Furthermore, using the same replacement in ResNet-101 backbone, we achieve a huge improvement over AP score from 60.2% to 68.5% on COCO val2017 for the bottom up human pose estimation.
摘要：扩张卷积是标准卷积神经网络的关键突变体，它可以控制有效的接收场并处理物体的大规模方差，而无需引入额外的计算。但是，在文献中很少讨论将有效接收场适合于具有卷积的数据。为了充分挖掘其潜力，我们提出了一种新的扩张卷积突变体，即初始（扩张）卷积，其中卷积在不同轴，通道和层之间具有独立的扩张。为了探索一种将复杂的初始卷积拟合到数据的实用方法，开发了一种基于统计优化的简单而有效的膨胀搜索算法（EDO）。该搜索方法以零成本方式运行，该方法极其快速地应用于大规模数据集。实证结果表明，我们的方法在广泛的基准测试中获得了一致的性能提升。例如，通过简单地将ResNet-50主干中的3 x 3标准卷积替换为初始卷积，我们将Faster-RCNN在MS-COCO上的mAP从36.4％提高到39.2％。此外，在ResNet-101骨干网中使用相同的替代物，我们在自下而上的人体姿势估计上将AP得分从COCO val2017的AP得分从60.2％大幅提高到68.5％。

56. Camouflaged Object Detection and Tracking: A Survey [PDF] 返回目录
Ajoy Mondal
Abstract: Moving object detection and tracking have various applications, including surveillance, anomaly detection, vehicle navigation, etc. The literature on object detection and tracking is rich enough, and several essential survey papers exist. However, the research on camouflage object detection and tracking limited due to the complexity of the problem. Existing work on this problem has been done based on either biological characteristics of the camouflaged objects or computer vision techniques. In this article, we review the existing camouflaged object detection and tracking techniques using computer vision algorithms from the theoretical point of view. This article also addresses several issues of interest as well as future research direction on this area. We hope this review will help the reader to learn the recent advances in camouflaged object detection and tracking.
摘要：运动目标的检测与跟踪具有监视，异常检测，车辆导航等多种应用。关于目标的检测与跟踪的文献足够丰富，存在一些基本的调查论文。然而，由于问题的复杂性，对伪装目标检测和跟踪的研究受到了限制。基于被伪装物体的生物学特性或计算机视觉技术，已经完成了有关此问题的现有工作。在本文中，我们将从理论角度回顾使用计算机视觉算法的现有伪装目标检测和跟踪技术。本文还讨论了一些有趣的问题以及该领域的未来研究方向。我们希望本文能帮助读者了解伪装物体检测和跟踪的最新进展。

57. Revisiting Edge Detection in Convolutional Neural Networks [PDF] 返回目录
Minh Le, Subhradeep Kayal
Abstract: The ability to detect edges is a fundamental attribute necessary to truly capture visual concepts. In this paper, we prove that edges cannot be represented properly in the first convolutional layer of a neural network, and further show that they are poorly captured in popular neural network architectures such as VGG-16 and ResNet. The neural networks are found to rely on color information, which might vary in unexpected ways outside of the datasets used for their evaluation. To improve their robustness, we propose edge-detection units and show that they reduce performance loss and generate qualitatively different representations. By comparing various models, we show that the robustness of edge detection is an important factor contributing to the robustness of models against color noise.
摘要：检测边缘的能力是真正捕捉视觉概念所必需的基本属性。在本文中，我们证明了边缘无法在神经网络的第一卷积层中正确表示，并且进一步表明它们在流行的神经网络体系结构（如VGG-16和ResNet）中无法很好地捕获。发现神经网络依赖于颜色信息，颜色信息可能会在用于评估的数据集之外以意想不到的方式变化。为了提高它们的鲁棒性，我们提出了边缘检测单元，并表明它们减少了性能损失并生成了质量上不同的表示。通过比较各种模型，我们表明边缘检测的鲁棒性是有助于模型抵御色彩噪声的鲁棒性的重要因素。

58. Implicit Feature Pyramid Network for Object Detection [PDF] 返回目录
Tiancai Wang, Xiangyu Zhang, Jian Sun
Abstract: In this paper, we present an implicit feature pyramid network (i-FPN) for object detection. Existing FPNs stack several cross-scale blocks to obtain large receptive field. We propose to use an implicit function, recently introduced in deep equilibrium model (DEQ), to model the transformation of FPN. We develop a residual-like iteration to updates the hidden states efficiently. Experimental results on MS COCO dataset show that i-FPN can significantly boost detection performance compared to baseline detectors with ResNet-50-FPN: +3.4, +3.2, +3.5, +4.2, +3.2 mAP on RetinaNet, Faster-RCNN, FCOS, ATSS and AutoAssign, respectively.
摘要：在本文中，我们提出了一种用于目标检测的隐式特征金字塔网络（i-FPN）。现有的FPN堆叠了多个跨比例块，以获得较大的接收场。我们建议使用最近在深度均衡模型（DEQ）中引入的隐式函数对FPN的转换进行建模。我们开发了类似残差的迭代来有效地更新隐藏状态。 MS COCO数据集上的实验结果表明，与使用ResNet-50-FPN的基线检测器相比，i-FPN可以显着提高检测性能：RetinaNet，Faster-RCNN，FCOS上的+ 3.4，+ 3.2，+ 3.5，+ 4.2，+ 3.2 mAP ，ATSS和AutoAssign。

59. Generative VoxelNet: Learning Energy-Based Models for 3D Shape Synthesis and Analysis [PDF] 返回目录
Jianwen Xie, Zilong Zheng, Ruiqi Gao, Wenguan Wang, Song-Chun Zhu, Ying Nian Wu
Abstract: 3D data that contains rich geometry information of objects and scenes is valuable for understanding 3D physical world. With the recent emergence of large-scale 3D datasets, it becomes increasingly crucial to have a powerful 3D generative model for 3D shape synthesis and analysis. This paper proposes a deep 3D energy-based model to represent volumetric shapes. The maximum likelihood training of the model follows an "analysis by synthesis" scheme. The benefits of the proposed model are six-fold: first, unlike GANs and VAEs, the model training does not rely on any auxiliary models; second, the model can synthesize realistic 3D shapes by Markov chain Monte Carlo (MCMC); third, the conditional model can be applied to 3D object recovery and super resolution; fourth, the model can serve as a building block in a multi-grid modeling and sampling framework for high resolution 3D shape synthesis; fifth, the model can be used to train a 3D generator via MCMC teaching; sixth, the unsupervisedly trained model provides a powerful feature extractor for 3D data, which is useful for 3D object classification. Experiments demonstrate that the proposed model can generate high-quality 3D shape patterns and can be useful for a wide variety of 3D shape analysis.
摘要：包含丰富的对象和场景几何信息的3D数据对于理解3D物理世界非常有价值。随着大规模3D数据集的出现，拥有强大的3D生成模型进行3D形状合成和分析变得越来越重要。本文提出了一种基于3D能量的深层模型来表示体积形状。模型的最大似然训练遵循“综合分析”方案。提出的模型的好处有六点：首先，与GAN和VAE不同，模型训练不依赖任何辅助模型；其次，该模型可以通过马尔可夫链蒙特卡洛（MCMC）合成逼真的3D形状。第三，条件模型可以应用于3D对象恢复和超分辨率；第四，该模型可以用作高分辨率3D形状合成的多网格建模和采样框架的基础。第五，该模型可用于通过MCMC教学来训练3D生成器。第六，无监督训练的模型为3D数据提供了功能强大的特征提取器，这对于3D对象分类很有用。实验表明，提出的模型可以生成高质量的3D形状图案，并且可以用于各种3D形状分析。

60. A Cascaded Residual UNET for Fully Automated Segmentation of Prostate and Peripheral Zone in T2-weighted 3D Fast Spin Echo Images [PDF] 返回目录
Lavanya Umapathy, Wyatt Unger, Faryal Shareef, Hina Arif, Diego Martin, Maria Altbach, Ali Bilgin
Abstract: Multi-parametric MR images have been shown to be effective in the non-invasive diagnosis of prostate cancer. Automated segmentation of the prostate eliminates the need for manual annotation by a radiologist which is time consuming. This improves efficiency in the extraction of imaging features for the characterization of prostate tissues. In this work, we propose a fully automated cascaded deep learning architecture with residual blocks, Cascaded MRes-UNET, for segmentation of the prostate gland and the peripheral zone in one pass through the network. The network yields high Dice scores ($0.91\pm.02$), precision ($0.91\pm.04$), and recall scores ($0.92\pm.03$) in prostate segmentation compared to manual annotations by an experienced radiologist. The average difference in total prostate volume estimation is less than 5%.
摘要：多参数MR图像已显示对前列腺癌的非侵入性诊断有效。前列腺的自动分割消除了放射医生对人工注释的需求，这是很费时间的。这提高了用于表征前列腺组织的成像特征的提取效率。在这项工作中，我们提出了一种带有级联残差的全自动级联深度学习架构，即级联MRes-UNET，用于一次通过网络分割前列腺和周围区域。与经验丰富的放射科医生的手动注释相比，该网络在前列腺分割中产生较高的Dice评分（$ 0.91 \ pm.02 $），精度（$ 0.91 \ pm.04 $）和召回评分（$ 0.92 \ pm.03 $）。总前列腺体积估计的平均差异小于5％。

61. 1st Place Solution to VisDA-2020: Bias Elimination for Domain Adaptive Pedestrian Re-identification [PDF] 返回目录
Jianyang Gu, Hao Luo, Weihua Chen, Yiqi Jiang, Yuqi Zhang, Shuting He, Fan Wang, Hao Li, Wei Jiang
Abstract: This paper presents our proposed methods for domain adaptive pedestrian re-identification (Re-ID) task in Visual Domain Adaptation Challenge (VisDA-2020). Considering the large gap between the source domain and target domain, we focused on solving two biases that influenced the performance on domain adaptive pedestrian Re-ID and proposed a two-stage training procedure. At the first stage, a baseline model is trained with images transferred from source domain to target domain and from single camera to multiple camera styles. Then we introduced a domain adaptation framework to train the model on source data and target data simultaneously. Different pseudo label generation strategies are adopted to continuously improve the discriminative ability of the model. Finally, with multiple models ensembled and additional post processing approaches adopted, our methods achieve 76.56% mAP and 84.25% rank-1 on the test set. Codes are available at this https URL
摘要：本文介绍了我们在视觉域自适应挑战赛（VisDA-2020）中提出的用于域自适应行人重新识别（Re-ID）任务的方法。考虑到源域和目标域之间的巨大差距，我们集中于解决两个影响域自适应行人Re-ID性能的偏差，并提出了一个两阶段的训练程序。在第一阶段，使用从源域到目标域以及从单个摄像机到多个摄像机样式的图像训练基线模型。然后，我们引入了域适应框架，以同时在源数据和目标数据上训练模型。采用了不同的伪标签生成策略来不断提高模型的判别能力。最后，在集成了多个模型并采用其他后处理方法的情况下，我们的方法在测试集上达到了76.56％的mAP和84.25％的1级。可以从此https URL获得代码

62. Self-supervised Pre-training with Hard Examples Improves Visual Representations [PDF] 返回目录
Chunyuan Li, Xiujun Li, Lei Zhang, Baolin Peng, Mingyuan Zhou, Jianfeng Gao
Abstract: Self-supervised pre-training (SSP) employs random image transformations to generate training data for visual representation learning. In this paper, we first present a modeling framework that unifies existing SSP methods as learning to predict pseudo-labels. Then, we propose new data augmentation methods of generating training examples whose pseudo-labels are harder to predict than those generated via random image transformations. Specifically, we use adversarial training and CutMix to create hard examples (HEXA) to be used as augmented views for MoCo-v2 and DeepCluster-v2, leading to two variants HEXA_{MoCo} and HEXA_{DCluster}, respectively. In our experiments, we pre-train models on ImageNet and evaluate them on multiple public benchmarks. Our evaluation shows that the two new algorithm variants outperform their original counterparts, and achieve new state-of-the-art on a wide range of tasks where limited task supervision is available for fine-tuning. These results verify that hard examples are instrumental in improving the generalization of the pre-trained models.
摘要：自我监督式预训练（SSP）使用随机图像变换来生成用于视觉表示学习的训练数据。在本文中，我们首先提出一个建模框架，该框架将现有的SSP方法统一为学习预测伪标签的方法。然后，我们提出了生成训练样本的新数据增强方法，这些训练样本的伪标签比通过随机图像变换生成的伪标签更难以预测。具体来说，我们使用对抗训练和CutMix来创建硬示例（HEXA），以用作MoCo-v2和DeepCluster-v2的增强视图，分别导致两个变体HEXA_ {MoCo}和HEXA_ {DCluster}。在我们的实验中，我们在ImageNet上对模型进行了预训练，并在多个公共基准上对其进行了评估。我们的评估表明，这两个新算法变体的性能优于其原始变体，并且在有限的任务监督可用于微调的广泛任务上实现了最新的技术水平。这些结果证明，困难的例子有助于改善预训练模型的通用性。

63. GraNet: Global Relation-aware Attentional Network for ALS Point Cloud Classification [PDF] 返回目录
Rong Huang, Yusheng Xu, Uwe Stilla
Abstract: In this work, we propose a novel neural network focusing on semantic labeling of ALS point clouds, which investigates the importance of long-range spatial and channel-wise relations and is termed as global relation-aware attentional network (GraNet). GraNet first learns local geometric description and local dependencies using a local spatial discrepancy attention convolution module (LoSDA). In LoSDA, the orientation information, spatial distribution, and elevation differences are fully considered by stacking several local spatial geometric learning modules and the local dependencies are embedded by using an attention pooling module. Then, a global relation-aware attention module (GRA), consisting of a spatial relation-aware attention module (SRA) and a channel relation aware attention module (CRA), are investigated to further learn the global spatial and channel-wise relationship between any spatial positions and feature vectors. The aforementioned two important modules are embedded in the multi-scale network architecture to further consider scale changes in large urban areas. We conducted comprehensive experiments on two ALS point cloud datasets to evaluate the performance of our proposed framework. The results show that our method can achieve higher classification accuracy compared with other commonly used advanced classification methods. The overall accuracy (OA) of our method on the ISPRS benchmark dataset can be improved to 84.5% to classify nine semantic classes, with an average F1 measure (AvgF1) of 73.5%. In detail, we have following F1 values for each object class: powerlines: 66.3%, low vegetation: 82.8%, impervious surface: 91.8%, car: 80.7%, fence: 51.2%, roof: 94.6%, facades: 62.1%, shrub: 49.9%, trees: 82.1%. Besides, experiments were conducted using a new ALS point cloud dataset covering highly dense urban areas.
摘要：在这项工作中，我们提出了一个专注于ALS点云语义标记的新型神经网络，它研究了远程空间关系和通道关系的重要性，被称为全局关系感知注意网络（GraNet）。 GraNet首先使用局部空间差异注意力卷积模块（LoSDA）学习局部几何描述和局部依存关系。在LoSDA中，通过堆叠几个局部空间几何学习模块来充分考虑方向信息，空间分布和高程差异，并使用注意力集中模块嵌入局部依存关系。然后，研究由空间关系感知注意力模块（SRA）和通道关系感知注意力模块（CRA）组成的全局关系感知注意力模块（GRA），以进一步了解两者之间的全局空间和通道方向关系。任何空间位置和特征向量。上述两个重要模块被嵌入到多尺度网络体系结构中，以进一步考虑大城市区域的尺度变化。我们对两个ALS点云数据集进行了综合实验，以评估我们提出的框架的性能。结果表明，与其他常用的高级分类方法相比，我们的方法可以实现更高的分类精度。我们的方法在ISPRS基准数据集上的整体准确性（OA）可以提高到84.5％，可以对9个语义类别进行分类，平均F1量度（AvgF1）为73.5％。详细地，我们为每个对象类别提供以下F1值：电力线：66.3％，低植被：82.8％，防渗表面：91.8％，汽车：80.7％，围栏：51.2％，屋顶：94.6％，外墙：62.1％，灌木：49.9％，树木：82.1％。此外，使用覆盖高度密集市区的新ALS点云数据集进行了实验。

64. Real-Time Facial Expression Emoji Masking with Convolutional Neural Networks and Homography [PDF] 返回目录
Qinchen Wang, Sixuan Wu, Tingfeng Xia
Abstract: Neural network based algorithms has shown success in many applications. In image processing, Convolutional Neural Networks (CNN) can be trained to categorize facial expressions of images of human faces. In this work, we create a system that masks a student's face with a emoji of the respective emotion. Our system consists of three building blocks: face detection using Histogram of Gradients (HoG) and Support Vector Machine (SVM), facial expression categorization using CNN trained on FER2013 dataset, and finally masking the respective emoji back onto the student's face via homography estimation. (Demo: this https URL) Our results show that this pipeline is deploy-able in real-time, and is usable in educational settings.
摘要：基于神经网络的算法已在许多应用中取得了成功。在图像处理中，可以训练卷积神经网络（CNN）对人脸图像的面部表情进行分类。在这项工作中，我们创建了一个系统，该系统用相应情感的表情掩盖学生的脸。我们的系统包括三个组成部分：使用梯度直方图（HoG）和支持向量机（SVM）的面部检测，使用在FER2013数据集上训练的CNN进行面部表情分类，最后通过单应性估计将相应的表情符号掩盖回学生的面部。（演示：此https URL）我们的结果表明该管道可以实时部署，并且可以在教育环境中使用。

65. Commonsense Visual Sensemaking for Autonomous Driving: On Generalised Neurosymbolic Online Abduction Integrating Vision and Semantics [PDF] 返回目录
Jakob Suchan, Mehul Bhatt, Srikrishna Varadarajan
Abstract: We demonstrate the need and potential of systematically integrated vision and semantics solutions for visual sensemaking in the backdrop of autonomous driving. A general neurosymbolic method for online visual sensemaking using answer set programming (ASP) is systematically formalised and fully implemented. The method integrates state of the art in visual computing, and is developed as a modular framework that is generally usable within hybrid architectures for realtime perception and control. We evaluate and demonstrate with community established benchmarks KITTIMOD, MOT-2017, and MOT-2020. As use-case, we focus on the significance of human-centred visual sensemaking - e.g., involving semantic representation and explainability, question-answering, commonsense interpolation -- in safety-critical autonomous driving situations. The developed neurosymbolic framework is domain-independent, with the case of autonomous driving designed to serve as an exemplar for online visual sensemaking in diverse cognitive interaction settings in the backdrop of select human-centred AI technology design considerations. Keywords: Cognitive Vision, Deep Semantics, Declarative Spatial Reasoning, Knowledge Representation and Reasoning, Commonsense Reasoning, Visual Abduction, Answer Set Programming, Autonomous Driving, Human-Centred Computing and Design, Standardisation in Driving Technology, Spatial Cognition and AI.
摘要：我们展示了在自动驾驶背景下，系统集成的视觉和语义解决方案对视觉感知的需求和潜力。使用答案集编程（ASP）进行在线视觉感知的一般神经符号方法已系统化并得到了全面实施。该方法在视觉计算中集成了最新技术，并被开发为模块化框架，通常可在混合体系结构中用于实时感知和控制。我们评估并证明了社区建立的基准KITTIMOD，MOT-2017和MOT-2020。作为用例，我们重点关注以人为本的视觉感知的重要性-例如，在安全关键的自动驾驶情况下，涉及语义表示和可解释性，问题解答，常识插值。发达的神经符号框架是独立于领域的，在选择以人为中心的AI技术设计考虑因素的背景下，自动驾驶被设计为在各种认知互动环境中在线视觉感知的典范。关键字：认知视觉，深度语义，声明性空间推理，知识表示和推理，常识推理，视觉诱拐，答案集编程，自动驾驶，以人为本的计算和设计，驾驶技术标准化，空间认知和AI。

66. TSEQPREDICTOR: Spatiotemporal Extreme Earthquakes Forecasting for Southern California [PDF] 返回目录
Bo Feng, Geoffrey C. Fox
Abstract: Seismology from the past few decades has utilized the most advanced technologies and equipment to monitor seismic events globally. However, forecasting disasters like earthquakes is still an underdeveloped topic from the history. Recent researches in spatiotemporal forecasting have revealed some possibilities of successful predictions, which becomes an important topic in many scientific research fields. Most studies of them have many successful applications of using deep neural networks. In the geoscience study, earthquake prediction is one of the world's most challenging problems, about which cutting edge deep learning technologies may help to discover some useful patterns. In this project, we propose a joint deep learning modeling method for earthquake forecasting, namely TSEQPREDICTOR. In TSEQPREDICTOR, we use comprehensive deep learning technologies with domain knowledge in seismology and exploit the prediction problem using encoder-decoder and temporal convolutional neural networks. Comparing to some state-of-art recurrent neural networks, our experiments show our method is promising in terms of predicting major shocks for earthquakes in Southern California.
摘要：过去几十年来，地震学已经利用最先进的技术和设备来监控全球地震事件。但是，从历史上看，预测地震等灾难仍然是一个欠发达的话题。时空预测的最新研究揭示了成功进行预测的可能性，这成为许多科学研究领域的重要课题。他们的大多数研究都有使用深度神经网络的许多成功应用。在地球科学研究中，地震预测是世界上最具挑战性的问题之一，关于哪些前沿深度学习技术可能有助于发现一些有用的模式。在本项目中，我们提出了一种用于地震预报的联合深度学习建模方法，即TSEQPREDICTOR。在TSEQPREDICTOR中，我们将综合的深度学习技术与地震学领域的知识结合使用，并使用编解码器和时间卷积神经网络开发预测问题。与一些最新的递归神经网络相比，我们的实验表明，在预测南加州地震的大震度方面，我们的方法很有希望。

67. A method to integrate and classify normals [PDF] 返回目录
Abhranil Das, Wilson S Geisler
Abstract: Univariate and multivariate normal probability distributions are widely used when modeling decisions under uncertainty. Computing the performance of such models requires integrating these distributions over specific domains, which can vary widely across models. Besides some special cases where these integrals are easy to calculate, there exists no general analytical expression, standard numerical method or software tool for these integrals. Here we present mathematical methods and software that provide (i) the probability in any domain of a normal in any dimensions with any parameters, (ii) the probability density, distribution, and percentage points of any scalar or vector function of a normal vector, (iii) quantities, such as the error matrix and discriminability, which summarize classification performance amongst any number of normal distributions, (iv) dimension reduction and visualizations for all such problems, and (v) tests for how reliably these methods can be used on given data. We illustrate these tools with models for detecting occluding targets in natural scenes and for detecting camouflage.
摘要：在不确定性条件下对决策进行建模时，单变量和多元正态概率分布被广泛使用。计算此类模型的性能需要在特定域上整合这些分布，这些分布在各个模型之间可能存在很大差异。除了一些易于计算这些积分的特殊情况外，没有用于这些积分的一般分析表达式，标准数值方法或软件工具。在这里，我们介绍了数学方法和软件，这些数学方法和软件可提供（i）具有任何参数的任何维度的法线在任何域中的概率，（ii）法线矢量的任何标量或矢量函数的概率密度，分布和百分比，（iii）数量，例如误差矩阵和可辨别性，这些数量总结了任意数量的正态分布中的分类性能；（iv）针对所有此类问题的降维和可视化；以及（v）测试这些方法在以下方面的可靠性如何：给定数据。我们用模型说明这些工具，以检测自然场景中的遮挡目标并检测伪装。

68. Online Photometric Calibration of Automatic Gain Thermal Infrared Cameras [PDF] 返回目录
Manash Pratim Das, Larry Matthies, Shreyansh Daftry
Abstract: Thermal infrared cameras are increasingly being used in various applications such as robot vision, industrial inspection and medical imaging, thanks to their improved resolution and portability. However, the performance of traditional computer vision techniques developed for electro-optical imagery does not directly translate to the thermal domain due to two major reasons: these algorithms require photometric assumptions to hold, and methods for photometric calibration of RGB cameras cannot be applied to thermal-infrared cameras due to difference in data acquisition and sensor phenomenology. In this paper, we take a step in this direction, and introduce a novel algorithm for online photometric calibration of thermal-infrared cameras. Our proposed method does not require any specific driver/hardware support and hence can be applied to any commercial off-the-shelf thermal IR camera. We present this in the context of visual odometry and SLAM algorithms, and demonstrate the efficacy of our proposed system through extensive experiments for both standard benchmark datasets, and real-world field tests with a thermal-infrared camera in natural outdoor environments.
摘要：由于红外热像仪具有更高的分辨率和便携性，因此越来越多地用于机器人视觉，工业检查和医学成像等各种应用中。但是，由于两个主要原因，为电光学图像开发的传统计算机视觉技术的性能不能直接转换为热域：这些算法需要光度假设才能成立，并且RGB相机的光度校准方法无法应用于热场。红外相机，这是由于数据采集和传感器现象学的差异。在本文中，我们朝着这个方向迈出了一步，并介绍了一种用于热红外摄像机在线光度校准的新颖算法。我们提出的方法不需要任何特定的驱动程序/硬件支持，因此可以应用于任何现成的商用红外热像仪。我们在视觉里程表和SLAM算法的背景下展示了这一点，并通过针对标准基准数据集的大量实验以及在自然室外环境中使用热红外相机进行的真实世界现场测试，证明了我们提出的系统的功效。

69. Learning by Ignoring [PDF] 返回目录
Xingchen Zhao, Pengtao Xie
Abstract: Learning by ignoring, which identifies less important things and excludes them from the learning process, is an effective learning technique in human learning. There has been psychological studies showing that learning to ignore certain things is a powerful tool for helping people focus. We are interested in investigating whether this powerful learning technique can be borrowed from humans to improve the learning abilities of machines. We propose a novel learning approach called learning by ignoring (LBI). Our approach automatically identifies pretraining data examples that have large domain shift from the target distribution by learning an ignoring variable for each example and excludes them from the pretraining process. We propose a three-level optimization framework to formulate LBI which involves three stages of learning: pretraining by minimizing the losses weighed by ignoring variables; finetuning; updating the ignoring variables by minimizing the validation loss. We develop an efficient algorithm to solve the LBI problem. Experiments on various datasets demonstrate the effectiveness of our method.
摘要：忽略学习可以识别不重要的事物并将其排除在学习过程之外，是一种有效的人类学习技术。心理学研究表明，学习忽略某些事物是帮助人们集中注意力的有力工具。我们有兴趣研究这种强大的学习技术是否可以从人身上借鉴来提高机器的学习能力。我们提出了一种新颖的学习方法，称为忽略学习（LBI）。我们的方法通过学习每个示例的忽略变量，自动识别与目标分布有较大域偏移的预训练数据示例，并将其从预训练过程中排除。我们提出了一个三级优化框架来制定LBI，它涉及学习的三个阶段：通过最小化因忽略变量而权衡的损失进行预训练；微调;通过最小化验证损失来更新忽略变量。我们开发了一种有效的算法来解决LBI问题。在各种数据集上进行的实验证明了我们方法的有效性。

70. Latent Compass: Creation by Navigation [PDF] 返回目录
Sarah Schwettmann, Hendrik Strobelt, Mauro Martino
Abstract: In Marius von Senden's Space and Sight, a newly sighted blind patient describes the experience of a corner as lemon-like, because corners "prick" sight like lemons prick the tongue. Prickliness, here, is a dimension in the feature space of sensory experience, an effect of the perceived on the perceiver that arises where the two interact. In the account of the newly sighted, an effect familiar from one interaction translates to a novel context. Perception serves as the vehicle for generalization, in that an effect shared across different experiences produces a concrete abstraction grounded in those experiences. Cezanne and the post-impressionists, fluent in the language of experience translation, realized that the way to paint a concrete form that best reflected reality was to paint not what they saw, but what it was like to see. We envision a future of creation using AI where what it is like to see is replicable, transferrable, manipulable - part of the artist's palette that is both grounded in a particular context, and generalizable beyond it. An active line of research maps human-interpretable features onto directions in GAN latent space. Supervised and self-supervised approaches that search for anticipated directions or use off-the-shelf classifiers to drive image manipulation in embedding space are limited in the variety of features they can uncover. Unsupervised approaches that discover useful new directions show that the space of perceptually meaningful directions is nowhere close to being fully mapped. As this space is broad and full of creative potential, we want tools for direction discovery that capture the richness and generalizability of human perception. Our approach puts creators in the discovery loop during real-time tool use, in order to identify directions that are perceptually meaningful to them, and generate interpretable image translations along those directions.
摘要：在马里乌斯·冯·森登（Marius von Senden）的《太空与视线》中，一名新近视盲的患者将一个角落的体验描述为柠檬状，因为角落“刺入”视线就像柠檬一样刺破了舌头。在这里，多刺是感官体验的特征空间中的一个维度，这是感知力对感知者的影响，这是两者相互作用的地方。对于新近见到的人，一种互动所熟悉的效果转化为一种新颖的情境。感知是泛化的工具，因为跨不同经验共享的效果会产生基于这些经验的具体抽象。塞尚和后印象派人士精通经验翻译的语言，他们意识到，描绘最能反映现实的具体形式的方式不是描绘他们所看到的，而是描绘所看到的。我们设想使用AI进行创作的未来，在这种情况下，所看到的东西是可复制的，可转移的，可操作的-是艺术家调色板的一部分，既基于特定的上下文，又可以泛化。活跃的研究领域将人类可解释的特征映射到GAN潜在空间中的方向。在嵌入式空间中搜索预期方向或使用现成的分类器来驱动图像操纵的有监督和自我监督的方法在它们可以发现的各种功能方面受到限制。发现有用的新方向的无监督方法表明，感知上有意义的方向的空间远未完全映射。由于这个空间广阔且充满创造力，因此我们需要用于发现方向的工具，以捕捉人类感知的丰富性和普遍性。我们的方法在实时工具使用过程中将创建者置于发现循环中，以识别对他们有意义的方向，并沿这些方向生成可解释的图像翻译。

71. Lesion Net -- Skin Lesion Segmentation Using Coordinate Convolution and Deep Residual Units [PDF] 返回目录
Sabari Nathan, Priya Kansal
Abstract: Skin lesions segmentation is an important step in the process of automated diagnosis of the skin melanoma. However, the accuracy of segmenting melanomas skin lesions is quite a challenging task due to less data for training, irregular shapes, unclear boundaries, and different skin colors. Our proposed approach helps in improving the accuracy of skin lesion segmentation. Firstly, we have introduced the coordinate convolutional layer before passing the input image into the encoder. This layer helps the network to decide on the features related to translation invariance which further improves the generalization capacity of the model. Secondly, we have leveraged the properties of deep residual units along with the convolutional layers. At last, instead of using only cross-entropy or Dice-loss, we have combined the two-loss functions to optimize the training metrics which helps in converging the loss more quickly and smoothly. After training and validating the proposed model on ISIC 2018 (60% as train set + 20% as validation set), we tested the robustness of our trained model on various other datasets like ISIC 2018 (20% as test-set) ISIC 2017, 2016 and PH2 dataset. The results show that the proposed model either outperform or at par with the existing skin lesion segmentation methods.
摘要：皮肤病变分割是皮肤黑色素瘤自动诊断过程中的重要步骤。然而，由于用于训练的数据较少，形状不规则，边界不清晰以及皮肤颜色不同，因此分割黑色素瘤皮肤病变的准确性是一项艰巨的任务。我们提出的方法有助于提高皮肤病变分割的准确性。首先，我们在将输入图像传递到编码器之前引入了坐标卷积层。该层有助于网络确定与翻译不变性相关的特征，从而进一步提高模型的泛化能力。其次，我们利用了深度残差单元以及卷积层的特性。最后，我们不仅使用交叉熵或骰子损失，还结合了两种损失函数来优化训练指标，这有助于更快，更平稳地收敛损失。在对ISIC 2018的建议模型进行训练和验证后（火车集为60％+验证集为20％），我们在ISIC 2018（其他测试集为20％）ISIC 2017等各种其他数据集上测试了训练后模型的稳健性， 2016和PH2数据集。结果表明，所提出的模型优于或与现有的皮肤病变分割方法相当。

72. Combining CNN and Hybrid Active Contours for Head and Neck Tumor Segmentation in CT and PET images [PDF] 返回目录
Jun Ma, Xiaoping Yang
Abstract: Automatic segmentation of head and neck tumors plays an important role in radiomics analysis. In this short paper, we propose an automatic segmentation method for head and neck tumors from PET and CT images based on the combination of convolutional neural networks (CNNs) and hybrid active contours. Specifically, we first introduce a multi-channel 3D U-Net to segment the tumor with the concatenated PET and CT images. Then, we estimate the segmentation uncertainty by model ensembles and define a segmentation quality score to select the cases with high uncertainties. Finally, we develop a hybrid active contour model to refine the high uncertainty cases. Our method ranked second place in the MICCAI 2020 HECKTOR challenge with average Dice Similarity Coefficient, precision, and recall of 0.752, 0.838, and 0.717, respectively.
摘要：头颈部肿瘤的自动分割在放射组学分析中起着重要的作用。在这篇简短的论文中，我们提出了一种基于卷积神经网络（CNN）和混合活动轮廓的组合的PET和CT图像中的头颈部肿瘤自动分割方法。具体来说，我们首先引入多通道3D U-Net，以通过串联的PET和CT图像分割肿瘤。然后，我们通过模型集成估计分割不确定性，并定义分割质量得分以选择不确定性较高的案例。最后，我们开发了一种混合主动轮廓模型来完善高不确定性情况。我们的方法在“ MICCAI 2020 HECKTOR”挑战赛中排名第二，平均骰子相似系数，精确度和召回率分别为0.752、0.838和0.717。

73. Screening COVID-19 Based on CT/CXR Images & Building a Publicly Available CT-scan Dataset of COVID-19 [PDF] 返回目录
Maryam Dialameh, Ali Hamzeh, Hossein Rahmani, Amir Reza Radmard, Safoura Dialameh
Abstract: The rapid outbreak of COVID-19 threatens humans life all around the world. Due to insufficient diagnostic infrastructures, developing an accurate, efficient, inexpensive, and quick diagnostic tool is of great importance. As chest radiography, such as chest X-ray (CXR) and CT computed tomography (CT), is a possible way for screening COVID-19, developing an automatic image classification tool is immensely helpful for detecting the patients with COVID-19. To date, researchers have proposed several different screening methods; however, none of them could achieve a reliable and highly sensitive performance yet. The main drawbacks of current methods are the lack of having enough training data, low generalization performance, and a high rate of false-positive detection. To tackle such limitations, this study firstly builds a large-size publicly available CT-scan dataset, consisting of more than 13k CT-images of more than 1000 individuals, in which 8k images are taken from 500 patients infected with COVID-19. Secondly, we propose a deep learning model for screening COVID-19 using our proposed CT dataset and report the baseline results. Finally, we extend the proposed CT model for screening COVID-19 from CXR images using a transfer learning approach. The experimental results show that the proposed CT and CXR methods achieve the AUC scores of 0.886 and 0.984 respectively.
摘要：COVID-19的迅速爆发威胁着世界各地的人类生命。由于诊断基础设施不足，因此开发准确，高效，廉价和快速的诊断工具至关重要。由于胸部X线摄影（CXR）和CT计算机断层扫描（CT）等胸部放射线照相是筛查COVID-19的一种可能方法，因此开发自动图像分类工具对检测COVID-19的患者非常有帮助。迄今为止，研究人员已经提出了几种不同的筛选方法。但是，它们都无法获得可靠且高度敏感的性能。当前方法的主要缺点是缺乏足够的训练数据，低泛化性能以及高假阳性检测率。为了解决这些局限性，本研究首先建立了一个大型的公共CT扫描数据集，该数据集由1000多个个体的13k多个CT图像组成，其中8k图像来自500例感染了COVID-19的患者。其次，我们提出了使用我们提出的CT数据集筛选COVID-19的深度学习模型，并报告了基线结果。最后，我们扩展了提出的CT模型，使用转移学习方法从CXR图像中筛选出COVID-19。实验结果表明，所提出的CT和CXR方法的AUC得分分别为0.886和0.984。

74. Data augmentation and image understanding [PDF] 返回目录
Alex Hernandez-Garcia
Abstract: Interdisciplinary research is often at the core of scientific progress. This dissertation explores some advantageous synergies between machine learning, cognitive science and neuroscience. In particular, this thesis focuses on vision and images. The human visual system has been widely studied from both behavioural and neuroscientific points of view, as vision is the dominant sense of most people. In turn, machine vision has also been an active area of research, currently dominated by the use of artificial neural networks. This work focuses on learning representations that are more aligned with visual perception and the biological vision. For that purpose, I have studied tools and aspects from cognitive science and computational neuroscience, and attempted to incorporate them into machine learning models of vision. A central subject of this dissertation is data augmentation, a commonly used technique for training artificial neural networks to augment the size of data sets through transformations of the images. Although often overlooked, data augmentation implements transformations that are perceptually plausible, since they correspond to the transformations we see in our visual world -- changes in viewpoint or illumination, for instance. Furthermore, neuroscientists have found that the brain invariantly represents objects under these transformations. Throughout this dissertation, I use these insights to analyse data augmentation as a particularly useful inductive bias, a more effective regularisation method for artificial neural networks, and as the framework to analyse and improve the invariance of vision models to perceptually plausible transformations. Overall, this work aims to shed more light on the properties of data augmentation and demonstrate the potential of interdisciplinary research.
摘要：跨学科研究通常是科学进步的核心。本文探讨了机器学习，认知科学和神经科学之间的一些有益的协同作用。本论文特别关注视觉和图像。人们已经从行为和神经科学的角度对人的视觉系统进行了广泛的研究，因为视觉是大多数人的主要感觉。反过来，机器视觉也一直是研究的一个活跃领域，目前主要是使用人工神经网络。这项工作着重于学习与视觉感知和生物视觉更加契合的表征。为此，我研究了认知科学和计算神经科学中的工具和方面，并尝试将其纳入视觉的机器学习模型中。本论文的主要主题是数据增强，它是一种常用的技术，用于训练人工神经网络以通过图像变换来增加数据集的大小。尽管经常被忽视，但是数据增强实现的转换在感觉上似乎是合理的，因为它们与我们在视觉世界中看到的转换相对应-例如，视点或照明的变化。此外，神经科学家发现，在这些变换下，大脑始终代表物体。在整个论文中，我使用这些见解将数据增强分析作为一种特别有用的归纳偏差，一种更为有效的人工神经网络正则化方法，以及作为分析和改善视觉模型对可感知转换的不变性的框架。总的来说，这项工作旨在更加了解数据增强的特性，并展示跨学科研究的潜力。

75. A Google Earth Engine-enabled Python approach to improve identification of anthropogenic palaeo-landscape features [PDF] 返回目录
Filippo Brandolini, Guillem Domingo Ribas, Andrea Zerboni, Sam Turner
Abstract: The necessity of sustainable development for landscapes has emerged as an important theme in recent decades. Current methods take a holistic approach to landscape heritage and promote an interdisciplinary dialogue to facilitate complementary landscape management strategies. With the socio-economic values of the natural and cultural landscape heritage increasingly recognised worldwide, remote sensing tools are being used more and more to facilitate the recording and management of landscape heritage. Satellite remote sensing technologies have enabled significant improvements in landscape research. The advent of the cloud-based platform of Google Earth Engine has allowed the rapid exploration and processing of satellite imagery such as the Landsat and Copernicus Sentinel datasets. In this paper, the use of Sentinel-2 satellite data in the identification of palaeo-riverscape features has been assessed in the Po Plain, selected because it is characterized by human exploitation since the Mid-Holocene. A multi-temporal approach has been adopted to investigate the potential of satellite imagery to detect buried hydrological and anthropogenic features along with Spectral Index and Spectral Decomposition analysis. This research represents one of the first applications of the GEE Python API in landscape studies. The complete FOSS-cloud protocol proposed here consists of a Python code script developed in Google Colab which could be simply adapted and replicated in different areas of the world
摘要：近几十年来，景观可持续发展的必要性已成为一个重要主题。当前的方法对景观遗产采取整体方法，并促进跨学科对话，以促进互补的景观管理策略。随着自然和文化景观遗产的社会经济价值在世界范围内得到越来越多的认可，越来越多地使用遥感工具来促进景观遗产的记录和管理。卫星遥感技术极大地改善了景观研究。 Google Earth Engine基于云的平台的出现允许对Landsat和Copernicus Sentinel数据集等卫星图像进行快速探索和处理。在本文中，已对Po Plain评估了使用Sentinel-2卫星数据识别古河景观的特征，这是因为自中全新世以来就具有人类开发的特征。已经采用了一种多时相方法来研究卫星图像检测潜藏水文和人为特征的潜力，以及光谱指数和光谱分解分析。这项研究代表了GEE Python API在景观研究中的最早应用之一。这里提出的完整的FOSS-cloud协议包含在Google Colab中开发的Python代码脚本，可以轻松地对其进行修改并在世界不同地区复制

76. Perception Consistency Ultrasound Image Super-resolution via Self-supervised CycleGAN [PDF] 返回目录
Heng Liu, Jianyong Liu, Tao Tao, Shudong Hou, Jungong Han
Abstract: Due to the limitations of sensors, the transmission medium and the intrinsic properties of ultrasound, the quality of ultrasound imaging is always not ideal, especially its low spatial resolution. To remedy this situation, deep learning networks have been recently developed for ultrasound image super-resolution (SR) because of the powerful approximation capability. However, most current supervised SR methods are not suitable for ultrasound medical images because the medical image samples are always rare, and usually, there are no low-resolution (LR) and high-resolution (HR) training pairs in reality. In this work, based on self-supervision and cycle generative adversarial network (CycleGAN), we propose a new perception consistency ultrasound image super-resolution (SR) method, which only requires the LR ultrasound data and can ensure the re-degenerated image of the generated SR one to be consistent with the original LR image, and vice versa. We first generate the HR fathers and the LR sons of the test ultrasound LR image through image enhancement, and then make full use of the cycle loss of LR-SR-LR and HR-LR-SR and the adversarial characteristics of the discriminator to promote the generator to produce better perceptually consistent SR results. The evaluation of PSNR/IFC/SSIM, inference efficiency and visual effects under the benchmark CCA-US and CCA-US datasets illustrate our proposed approach is effective and superior to other state-of-the-art methods.
摘要：由于传感器的局限性，超声的传输介质和固有特性，超声成像的质量始终不理想，尤其是空间分辨率低。为了解决这种情况，由于强大的逼近能力，近来已经开发了用于超声图像超分辨率（SR）的深度学习网络。但是，当前大多数受监督的SR方法都不适合用于超声医学图像，因为医学图像样本始终很少，并且通常，实际上不存在低分辨率（LR）和高分辨率（HR）训练对。在这项工作中，基于自我监督和循环生成对抗网络（CycleGAN），我们提出了一种新的感知一致性超声图像超分辨率（SR）方法，该方法仅需要LR超声数据即可确保图像的退化图像。生成的SR与原始LR图像一致，反之亦然。我们首先通过图像增强生成测试超声LR图像的HR父亲和LR儿子，然后充分利用LR-SR-LR和HR-LR-SR的循环损失以及鉴别器的对抗特性来促进生成器产生更好的感知一致的SR结果。在基准CCA-US和CCA-US数据集下对PSNR / IFC / SSIM，推理效率和视觉效果的评估表明，我们提出的方法是有效的，并且优于其他最新方法。

77. Analysis of Macula on Color Fundus Images Using Heightmap Reconstruction Through Deep Learning [PDF] 返回目录
Peyman Tahghighi, Reza A.Zoroofi, Sare Safi, Alireza Ramezani
Abstract: For medical diagnosis based on retinal images, a clear understanding of 3D structure is often required but due to the 2D nature of images captured, we cannot infer that information. However, by utilizing 3D reconstruction methods, we can recover the height information of the macula area on a fundus image which can be helpful for diagnosis and screening of macular disorders. Recent approaches have used shading information for heightmap prediction but their output was not accurate since they ignored the dependency between nearby pixels and only utilized shading information. Additionally, other methods were dependent on the availability of more than one image of the retina which is not available in practice. In this paper, motivated by the success of Conditional Generative Adversarial Networks(cGANs) and deeply supervised networks, we propose a novel architecture for the generator which enhances the details and the quality of output by progressive refinement and the use of deep supervision to reconstruct the height information of macula on a color fundus image. Comparisons on our own dataset illustrate that the proposed method outperforms all of the state-of-the-art methods in image translation and medical image translation on this particular task. Additionally, perceptual studies also indicate that the proposed method can provide additional information for ophthalmologists for diagnosis.
摘要：对于基于视网膜图像的医学诊断，通常需要对3D结构有清晰的了解，但是由于所捕获图像的2D性质，我们无法推断出这些信息。但是，通过使用3D重建方法，我们可以在眼底图像上恢复黄斑区域的高度信息，这有助于诊断和筛查黄斑疾病。最近的方法已经将阴影信息用于高度图预测，但是它们的输出是不准确的，因为它们忽略了附近像素之间的依赖性并且仅利用了阴影信息。另外，其他方法取决于一个以上的视网膜图像的可用性，而这在实践中是不可用的。在本文中，基于条件生成对抗网络（cGAN）和深度监督网络的成功，我们提出了一种新颖的生成器体系结构，该体系结构通过逐步完善和使用深度监督来重构输出来提高输出的细节和输出质量。黄斑高度信息在彩色眼底图像上。在我们自己的数据集上进行的比较表明，在此特定任务上，所提出的方法在图像翻译和医学图像翻译方面优于所有最新方法。此外，知觉研究还表明，所提出的方法可以为眼科医生进行诊断提供更多信息。

78. Cascaded Convolutional Neural Network for Automatic Myocardial Infarction Segmentation from Delayed-Enhancement Cardiac MRI [PDF] 返回目录
Yichi Zhang
Abstract: Automatic segmentation of myocardial contours and relevant areas like infraction and no-reflow is an important step for the quantitative evaluation of myocardial infarction. In this work, we propose a cascaded convolutional neural network for automatic myocardial infarction segmentation from delayed-enhancement cardiac MRI. We first use a 2D U-Net to focus on the intra-slice information to perform a preliminary segmentation. After that, we use a 3D U-Net to utilize the volumetric spatial information for a subtle segmentation. Our method is evaluated on the MICCAI 2020 EMIDEC challenge dataset and achieves average Dice score of 0.8786, 0.7124 and 0.7851 for myocardium, infarction and no-reflow respectively, outperforms all the other teams of the segmentation contest.
摘要：心肌轮廓和相关区域的自动分割（如梗死和不复流）是定量评估心肌梗死的重要步骤。在这项工作中，我们提出了一个级联的卷积神经网络，用于从延迟增强心脏MRI进行自动心肌梗塞分割。我们首先使用2D U-Net专注于切片内信息以执行初步分割。之后，我们使用3D U-Net来利用体积空间信息进行精细分割。我们的方法在MICCAI 2020 EMIDEC挑战数据集上进行了评估，其心肌，梗塞和无复流的平均Dice得分分别为0.8786、0.7124和0.7851，胜过了其他所有细分市场竞争团队。

79. 3D Axial-Attention for Lung Nodule Classification [PDF] 返回目录
Mundher Al-Shabi, Kelvin Shak, Maxine Tan
Abstract: Purpose: In recent years, Non-Local based methods have been successfully applied to lung nodule classification. However, these methods offer 2D attention or a limited 3D attention to low-resolution feature maps. Moreover, they still depend on a convenient local filter such as convolution as full 3D attention is expensive to compute and requires a big dataset, which might not be available. Methods: We propose to use 3D Axial-Attention, which requires a fraction of the computing power of a regular Non-Local network. Additionally, we solve the position invariant problem of the Non-Local network by proposing adding 3D positional encoding to shared embeddings. Results: We validated the proposed method on the LIDC-IDRI dataset by following a rigorous experimental setup using only nodules annotated by at least three radiologists. Our results show that the 3D Axial-Attention model achieves state-of-the-art performance on all evaluation metrics including AUC and Accuracy. Conclusions: The proposed model provides full 3D attention effectively, which can be used in all layers without the need for local filters. The experimental results show the importance of full 3D attention for classifying lung nodules.
摘要：目的：近年来，基于非局部的方法已成功地应用于肺结节的分类。但是，这些方法对低分辨率特征图提供2D注意或有限的3D注意。而且，它们仍然依赖于方便的局部滤波器，例如卷积，因为要完全关注3D，计算起来很昂贵，并且需要大数据集，而这可能无法使用。方法：我们建议使用3D轴向注意，这需要常规非本地网络的一部分计算能力。此外，我们通过建议在共享嵌入中添加3D位置编码来解决非本地网络的位置不变性问题。结果：我们遵循严格的实验设置，仅使用至少由三名放射科医生标注的结节，从而在LIDC-IDRI数据集上验证了所提出的方法。我们的结果表明，3D轴向注意力模型在包括AUC和准确性在内的所有评估指标上均达到了最先进的性能。结论：所提出的模型有效地提供了完整的3D注意，可以在所有层中使用，而无需局部滤镜。实验结果表明，充分注意3D对肺结节分类的重要性。

80. Diagnosis/Prognosis of COVID-19 Images: Challenges, Opportunities, and Applications [PDF] 返回目录
Arash Mohammadi, Yingxu Wang, Nastaran Enshaei, Parnian Afshar, Farnoosh Naderkhani, Anastasia Oikonomou, Moezedin Javad Rafiee, Helder C. R. Oliveira, Svetlana Yanushkevich, Konstantinos N. Plataniotis
Abstract: The novel Coronavirus disease, COVID-19, has rapidly and abruptly changed the world as we knew in 2020. It becomes the most unprecedent challenge to analytic epidemiology in general and signal processing theories in specific. Given its high contingency nature and adverse effects across the world, it is important to develop efficient processing/learning models to overcome this pandemic and be prepared for potential future ones. In this regard, medical imaging plays an important role for the management of COVID-19. Human-centered interpretation of medical images is, however, tedious and can be subjective. This has resulted in a surge of interest to develop Radiomics models for analysis and interpretation of medical images. Signal Processing (SP) and Deep Learning (DL) models can assist in development of robust Radiomics solutions for diagnosis/prognosis, severity assessment, treatment response, and monitoring of COVID-19 patients. In this article, we aim to present an overview of the current state, challenges, and opportunities of developing SP/DL-empowered models for diagnosis (screening/monitoring) and prognosis (outcome prediction and severity assessment) of COVID-19 infection. More specifically, the article starts by elaborating the latest development on the theoretical framework of analytic epidemiology and hypersignal processing for COVID-19. Afterwards, imaging modalities and Radiological characteristics of COVID-19 are discussed. SL/DL-based Radiomic models specific to the analysis of COVID-19 infection are then described covering the following four domains: Segmentation of COVID-19 lesions; Predictive models for outcome prediction; Severity assessment, and; Diagnosis/classification models. Finally, open problems and opportunities are presented in detail.
摘要：就像我们在2020年所知道的那样，新型冠状病毒病COVID-19迅速迅速地改变了世界。它成为一般分析流行病学和具体而言信号处理理论面临的前所未有的挑战。鉴于其突发事件的高发性和在全球范围内的不利影响，开发有效的处理/学习模型以克服这种大流行并为未来的潜在大流行做好准备非常重要。在这方面，医学成像对COVID-19的管理起着重要作用。但是，以人为中心的医学图像解释是乏味的并且可能是主观的。这引起了开发Radiomics模型以分析和解释医学图像的兴趣。信号处理（SP）和深度学习（DL）模型可以协助开发强大的Radiomics解决方案，以用于诊断/预后，严重性评估，治疗反应以及监测COVID-19患者。在本文中，我们旨在概述开发基于SP / DL的模型来诊断COVID-19感染的诊断（筛查/监测）和预后（结果预测和严重性评估）的现状，挑战和机遇。更具体地说，本文首先阐述了COVID-19的分析流行病学和高信号处理的理论框架的最新进展。之后，讨论了COVID-19的成像方式和放射学特征。然后描述了基于SL / DL的特定于COVID-19感染分析的Radiomic模型，该模型涵盖以下四个领域：COVID-19病变的分割；用于结果预测的预测模型；严重性评估；以及诊断/分类模型。最后，详细介绍了开放的问题和机遇。

81. Model Optimization for Deep Space Exploration via Simulators and Deep Learning [PDF] 返回目录
James Bird, Kellan Colburn, Linda Petzold, Philip Lubin
Abstract: Machine learning, and eventually true artificial intelligence techniques, are extremely important advancements in astrophysics and astronomy. We explore the application of deep learning using neural networks in order to automate the detection of astronomical bodies for future exploration missions, such as missions to search for signatures or suitability of life. The ability to acquire images, analyze them, and send back those that are important, as determined by the deep learning algorithm, is critical in bandwidth-limited applications. Our previous foundational work solidified the concept of using simulator images and deep learning in order to detect planets. Optimization of this process is of vital importance, as even a small loss in accuracy might be the difference between capturing and completely missing a possibly-habitable nearby planet. Through computer vision, deep learning, and simulators, we introduce methods that optimize the detection of exoplanets. We show that maximum achieved accuracy can hit above 98% for multiple model architectures, even with a relatively small training set.
摘要：机器学习以及最终真正的人工智能技术，是天体物理学和天文学中极为重要的进步。我们探索使用神经网络进行深度学习的应用，以自动检测天文物体，以进行未来的探索任务，例如寻找特征或生命适合性的任务。深度学习算法确定的获取图像，分析图像并发送重要图像的能力在带宽受限的应用中至关重要。我们之前的基础工作巩固了使用模拟器图像和深度学习来检测行星的概念。优化此过程至关重要，因为即使精确度损失很小，也可能是捕获和完全丢失可能居住的附近星球之间的区别。通过计算机视觉，深度学习和模拟器，我们介绍了优化对系外行星探测的方法。我们证明，即使使用相对较小的训练集，对于多个模型体系结构，最大实现的精度也可以达到98％以上。

82. Generative Partial Visual-Tactile Fused Object Clustering [PDF] 返回目录
Tao Zhang, Yang Cong, Gan Sun, Jiahua Dong, Yuyang Liu, Zhengming Ding
Abstract: Visual-tactile fused sensing for object clustering has achieved significant progresses recently, since the involvement of tactile modality can effectively improve clustering performance. However, the missing data (i.e., partial data) issues always happen due to occlusion and noises during the data collecting process. This issue is not well solved by most existing partial multi-view clustering methods for the heterogeneous modality challenge. Naively employing these methods would inevitably induce a negative effect and further hurt the performance. To solve the mentioned challenges, we propose a Generative Partial Visual-Tactile Fused (i.e., GPVTF) framework for object clustering. More specifically, we first do partial visual and tactile features extraction from the partial visual and tactile data, respectively, and encode the extracted features in modality-specific feature subspaces. A conditional cross-modal clustering generative adversarial network is then developed to synthesize one modality conditioning on the other modality, which can compensate missing samples and align the visual and tactile modalities naturally by adversarial learning. To the end, two pseudo-label based KL-divergence losses are employed to update the corresponding modality-specific encoders. Extensive comparative experiments on three public visual-tactile datasets prove the effectiveness of our method.
摘要：由于触觉模态的参与可以有效地改善聚类性能，因此视觉-触觉融合感知技术在目标聚类方面取得了重大进展。但是，由于数据收集过程中的遮挡和噪声，总是会发生丢失的数据（即部分数据）问题。对于异构模式挑战，大多数现有的部分多视图聚类方法无法很好地解决此问题。幼稚地采用这些方法将不可避免地引起负面影响并进一步损害性能。为了解决上述挑战，我们提出了一种用于对象聚类的生成式局部视觉触觉融合（即GPVTF）框架。更具体地说，我们首先分别从部分视觉和触觉数据中提取部分视觉和触觉特征，并将提取的特征编码在特定于模态的特征子空间中。然后开发条件交叉模态聚类生成对抗网络，以在另一模态上合成一个模态条件，该条件模态可以补偿缺失的样本并通过对抗学习自然地对齐视觉和触觉模态。最后，使用两个基于伪标签的KL散度损失来更新相应的模态专用编码器。在三个公共视觉触觉数据集上的大量比较实验证明了我们方法的有效性。

83. Domain Generalisation with Domain Augmented Supervised Contrastive Learning (Student Abstract) [PDF] 返回目录
Hoang Son Le, Rini Akmeliawati, Gustavo Carneiro
Abstract: Domain generalisation (DG) methods address the problem of domain shift, when there is a mismatch between the distributions of training and target domains. Data augmentation approaches have emerged as a promising alternative for DG. However, data augmentation alone is not sufficient to achieve lower generalisation errors. This project proposes a new method that combines data augmentation and domain distance minimisation to address the problems associated with data augmentation and provide a guarantee on the learning performance, under an existing framework. Empirically, our method outperforms baseline results on DG benchmarks.
摘要：当训练域和目标域的分布不匹配时，域一般化（DG）方法解决了域移位的问题。数据增强方法已经成为DG的有希望的替代方法。但是，仅数据增强不足以实现较低的泛化误差。该项目提出了一种新方法，该方法在现有框架下将数据增强和域距离最小化相结合，以解决与数据增强相关的问题，并为学习性能提供保证。从经验上讲，我们的方法优于DG基准的基线结果。

84. Generalized Categorisation of Digital Pathology Whole Image Slides using Unsupervised Learning [PDF] 返回目录
Mostafa Ibrahim, Kevin Bryson
Abstract: This project aims to break down large pathology images into small tiles and then cluster those tiles into distinct groups without the knowledge of true labels, our analysis shows how difficult certain aspects of clustering tumorous and non-tumorous cells can be and also shows that comparing the results of different unsupervised approaches is not a trivial task. The project also provides a software package to be used by the digital pathology community, that uses some of the approaches developed to perform unsupervised unsupervised tile classification, which could then be easily manually labelled. The project uses a mixture of techniques ranging from classical clustering algorithms such as K-Means and Gaussian Mixture Models to more complicated feature extraction techniques such as deep Autoencoders and Multi-loss learning. Throughout the project, we attempt to set a benchmark for evaluation using a few measures such as completeness scores and cluster plots. Throughout our results we show that Convolutional Autoencoders manages to slightly outperform the rest of the approaches due to its powerful internal representation learning abilities. Moreover, we show that Gaussian Mixture models produce better results than K-Means on average due to its flexibility in capturing different clusters. We also show the huge difference in the difficulties of classifying different types of pathology textures.
摘要：该项目旨在将大型病理图像分解成小块，然后在不知道真实标签的情况下将这些块分成不同的组，我们的分析显示了聚类肿瘤和非肿瘤细胞的某些方面可能有多困难，并且还表明比较不同无监督方法的结果并不是一件容易的事。该项目还提供了一个可供数字病理学界使用的软件包，该软件包使用一些开发的方法来执行无监督，无监督的图块分类，然后可以轻松地对其进行手动标记。该项目使用多种技术混合，从经典的聚类算法（例如K-Means和高斯混合模型）到更复杂的特征提取技术（例如深度自动编码器和多损失学习）。在整个项目中，我们尝试使用一些指标（例如完整性得分和聚类图）来设置评估基准。在我们的整个结果中，我们表明卷积自动编码器由于其强大的内部表示学习能力而设法比其他方法略胜一筹。此外，由于高斯混合模型在捕获不同聚类方面的灵活性，因此平均而言，它比K-Means产生更好的结果。我们还显示出在分类不同类型的病理纹理的难度方面的巨大差异。

85. Learning Generalized Spatial-Temporal Deep Feature Representation for No-Reference Video Quality Assessment [PDF] 返回目录
Baoliang Chen, Lingyu Zhu, Guo Li, Hongfei Fan, Shiqi Wang
Abstract: In this work, we propose a no-reference video quality assessment method, aiming to achieve high-generalization capability in cross-content, -resolution and -frame rate quality prediction. In particular, we evaluate the quality of a video by learning effective feature representations in spatial-temporal domain. In the spatial domain, to tackle the resolution and content variations, we impose the Gaussian distribution constraints on the quality features. The unified distribution can significantly reduce the domain gap between different video samples, resulting in a more generalized quality feature representation. Along the temporal dimension, inspired by the mechanism of visual perception, we propose a pyramid temporal aggregation module by involving the short-term and long-term memory to aggregate the frame-level quality. Experiments show that our method outperforms the state-of-the-art methods on cross-dataset settings, and achieves comparable performance on intra-dataset configurations, demonstrating the high-generalization capability of the proposed method.
摘要：在这项工作中，我们提出了一种无参考视频质量评估方法，旨在在交叉内容，分辨率和帧速率质量预测方面实现高泛化能力。特别是，我们通过学习时空域中的有效特征表示来评估视频的质量。在空间领域中，为了解决分辨率和内容变化，我们对质量特征施加了高斯分布约束。统一的分布可以显着减小不同视频样本之间的域间隙，从而获得更通用的质量特征表示。沿着时间维度，在视觉感知机制的启发下，我们通过涉及短期和长期记忆来聚合帧级质量，提出了金字塔时间聚合模块。实验表明，我们的方法在跨数据集设置方面优于最新方法，并且在内部数据集配置上具有可比的性能，证明了该方法的高泛化能力。

86. WHU-Hi: UAV-borne hyperspectral with high spatial resolution (H2) benchmark datasets for hyperspectral image classification [PDF] 返回目录
Xin Hu, Yanfei Zhong, Chang Luo, Xinyu Wang
Abstract: Classification is an important aspect of hyperspectral images processing and application. At present, the researchers mostly use the classic airborne hyperspectral imagery as the benchmark dataset. However, existing datasets suffer from three bottlenecks: (1) low spatial resolution; (2) low labeled pixels proportion; (3) low degree of subclasses distinction. In this paper, a new benchmark dataset named the Wuhan UAV-borne hyperspectral image (WHU-Hi) dataset was built for hyperspectral image classification. The WHU-Hi dataset with a high spectral resolution (nm level) and a very high spatial resolution (cm level), which we refer to here as H2 imager. Besides, the WHU-Hi dataset has a higher pixel labeling ratio and finer subclasses. Some start-of-art hyperspectral image classification methods benchmarked the WHU-Hi dataset, and the experimental results show that WHU-Hi is a challenging dataset. We hope WHU-Hi dataset can become a strong benchmark to accelerate future research.
摘要：分类是高光谱图像处理和应用的重要方面。目前，研究人员大多将经典的机载高光谱图像用作基准数据集。但是，现有数据集存在三个瓶颈：（1）空间分辨率低；（2）标记像素比例低；（3）子类区分度低。本文建立了一个新的基准数据集，称为武汉无人机传播的高光谱图像（WHU-Hi）数据集，用于高光谱图像分类。 WHU-Hi数据集具有较高的光谱分辨率（纳米级）和非常高的空间分辨率（厘米级），在此称为H2成像仪。此外，WHU-Hi数据集具有更高的像素标记率和更细的子类。一些先进的高光谱图像分类方法对WHU-Hi数据集进行了基准测试，实验结果表明WHU-Hi是一个具有挑战性的数据集。我们希望WHU-Hi数据集可以成为加速未来研究的强大基准。

87. Structure-Aware Layer Decomposition Learning Based on Gaussian Convolution Model for Inverse Halftoning [PDF] 返回目录
Chang-Hwan Son
Abstract: Layer decomposition to separate an input image into base and detail layers has been steadily used for image restoration. Existing residual networks based on an additive model require residual layers with a small output range for fast convergence and visual quality improvement. However, in inverse halftoning, homogenous dot patterns hinder a small output range from the residual layers. Therefore, a new layer decomposition network based on the Gaussian convolution model (GCM) and structure-aware deblurring strategy is presented to achieve residual learning for both the base and detail layers. For the base layer, a new GCM-based residual subnetwork is presented. The GCM utilizes a statistical distribution, in which the image difference between a blurred continuous-tone image and a blurred halftoned image with a Gaussian filter can result in a narrow output range. Subsequently, the GCM-based residual subnetwork uses a Gaussian-filtered halftoned image as input and outputs the image difference as residual, thereby generating the base layer, i.e., the Gaussian-blurred continuous-tone image. For the detail layer, a new structure-aware residual deblurring subnetwork (SARDS) is presented. To remove the Gaussian blurring of the base layer, the SARDS uses the predicted base layer as input and outputs the deblurred version. To more effectively restore image structures such as lines and texts, a new image structure map predictor is incorporated into the deblurring network to induce structure-adaptive learning. This paper provides a method to realize the residual learning of both the base and detail layers based on the GCM and SARDS. In addition, it is verified that the proposed method surpasses state-of-the-art methods based on U-Net, direct deblurring networks, and progressively residual networks.
摘要：分层分解将输入图像分为基础层和细节层已被稳定地用于图像恢复。现有基于加性模型的残差网络要求残差层具有较小的输出范围，以实现快速收敛和视觉质量的提高。然而，在反半色调中，均匀的点图案阻碍了残留层的较小输出范围。因此，提出了一种基于高斯卷积模型（GCM）和结构感知去模糊策略的新层分解网络，以实现基础层和细节层的残差学习。对于基础层，提出了一个新的基于GCM的残留子网。 GCM利用统计分布，其中使用高斯滤波器的模糊连续色调图像和模糊半色调图像之间的图像差异会导致输出范围狭窄。随后，基于GCM的残差子网络使用高斯滤波的半色调图像作为输入，并输出图像差作为残差，从而生成基础层，即高斯模糊的连续色调图像。对于细节层，提出了一种新的结构感知残差去模糊子网（SARDS）。为了消除基本层的高斯模糊，SARDS使用预测的基本层作为输入并输出去模糊的版本。为了更有效地还原线条和文本等图像结构，将新的图像结构图预测器合并到去模糊网络中，以诱导结构自适应学习。本文提供了一种基于GCM和SARDS实现基础层和细节层的残差学习的方法。此外，已验证所提出的方法超越了基于U-Net，直接去模糊网络和渐进残差网络的最新方法。

88. Histogram Matching Augmentation for Domain Adaptation with Application to Multi-Centre, Multi-Vendor and Multi-Disease Cardiac Image Segmentation [PDF] 返回目录
Jun Ma
Abstract: Convolutional Neural Networks (CNNs) have achieved high accuracy for cardiac structure segmentation if training cases and testing cases are from the same distribution. However, the performance would be degraded if the testing cases are from a distinct domain (e.g., new MRI scanners, clinical centers). In this paper, we propose a histogram matching (HM) data augmentation method to eliminate the domain gap. Specifically, our method generates new training cases by using HM to transfer the intensity distribution of testing cases to existing training cases. The proposed method is quite simple and can be used in a plug-and-play way in many segmentation tasks. The method is evaluated on MICCAI 2020 M\&Ms challenge, and achieves average Dice scores of 0.9051, 0.8405, and 0.8749, and Hausdorff Distances of 9.996, 12.49, and 12.68 for the left ventricular, myocardium, and right ventricular, respectively. Our results rank the third place in MICCAI 2020 M\&Ms challenge. The code and trained models are publicly available at \url{this https URL}.
摘要：如果训练用例和测试用例来自同一分布，则卷积神经网络（CNN）可以实现高精度的心脏结构分割。但是，如果测试案例来自不同的领域（例如，新的MRI扫描仪，临床中心），则性能会下降。在本文中，我们提出了一种直方图匹配（HM）数据增强方法来消除域间隙。具体而言，我们的方法通过使用HM将测试用例的强度分布转移到现有训练用例中来生成新的训练用例。所提出的方法非常简单，可以在许多分割任务中以即插即用的方式使用。该方法在MICCAI 2020 M \＆Ms挑战中进行了评估，左心室，心肌和右心室的平均Dice得分分别为0.9051、0.8405和0.8749，Hausdorff距离分别为9.996、12.49和12.68。我们的结果在MICCAI 2020 M && Ms挑战中排名第三。这些代码和经过训练的模型可在\ url {此https URL}上公开获得。

89. Improving the Generalization of End-to-End Driving through Procedural Generation [PDF] 返回目录
Quanyi Li, Zhenghao Peng, Qihang Zhang, Cong Qiu, Chunxiao Liu, Bolei Zhou
Abstract: Recently there is a growing interest in the end-to-end training of autonomous driving where the entire driving pipeline from perception to control is modeled as a neural network and jointly optimized. The end-to-end driving is usually first developed and validated in simulators. However, most of the existing driving simulators only contain a fixed set of maps and a limited number of configurations. As a result the deep models are prone to overfitting training scenarios. Furthermore it is difficult to assess how well the trained models generalize to unseen scenarios. To better evaluate and improve the generalization of end-to-end driving, we introduce an open-ended and highly configurable driving simulator called PGDrive. PGDrive first defines multiple basic road blocks such as ramp, fork, and roundabout with configurable settings. Then a range of diverse maps can be assembled from those blocks with procedural generation, which are further turned into interactive environments. The experiments show that the driving agent trained by reinforcement learning on a small fixed set of maps generalizes poorly to unseen maps. We further validate that training with the increasing number of procedurally generated maps significantly improves the generalization of the agent across scenarios of different traffic densities and map structures. Code is available at: this https URL
摘要：近来，人们对自动驾驶的端到端培训越来越感兴趣，该培训将从感知到控制的整个驾驶流程建模为神经网络并共同优化。通常首先在模拟器中开发并验证端到端驱动。但是，大多数现有的驾驶模拟器仅包含一组固定的地图和数量有限的配置。结果，深层模型倾向于过度拟合训练场景。此外，很难评估训练后的模型对看不见的场景的概括程度。为了更好地评估和改善端到端驾驶的通用性，我们引入了一种开放式，高度可配置的驾驶模拟器PGDrive。 PGDrive首先使用可配置的设置定义多个基本路障，例如坡道，前叉和环形交叉路口。然后，可以从这些带有程序生成的块中组装出一系列多样的地图，并将其进一步转变为交互式环境。实验表明，通过在一组固定的小地图上进行强化学习而训练的驾驶人，很难将其推广到看不见的地图上。我们进一步证实，随着过程生成的地图数量的增加，训练会大大提高代理在不同流量密度和地图结构情况下的泛化能力。可以在以下网址获得代码：https URL

90. Multidimensional Uncertainty-Aware Evidential Neural Networks [PDF] 返回目录
Yibo Hu, Yuzhe Ou, Xujiang Zhao, Jin-Hee Cho, Feng Chen
Abstract: Traditional deep neural networks (NNs) have significantly contributed to the state-of-the-art performance in the task of classification under various application domains. However, NNs have not considered inherent uncertainty in data associated with the class probabilities where misclassification under uncertainty may easily introduce high risk in decision making in real-world contexts (e.g., misclassification of objects in roads leads to serious accidents). Unlike Bayesian NN that indirectly infer uncertainty through weight uncertainties, evidential NNs (ENNs) have been recently proposed to explicitly model the uncertainty of class probabilities and use them for classification tasks. An ENN offers the formulation of the predictions of NNs as subjective opinions and learns the function by collecting an amount of evidence that can form the subjective opinions by a deterministic NN from data. However, the ENN is trained as a black box without explicitly considering inherent uncertainty in data with their different root causes, such as vacuity (i.e., uncertainty due to a lack of evidence) or dissonance (i.e., uncertainty due to conflicting evidence). By considering the multidimensional uncertainty, we proposed a novel uncertainty-aware evidential NN called WGAN-ENN (WENN) for solving an out-of-distribution (OOD) detection problem. We took a hybrid approach that combines Wasserstein Generative Adversarial Network (WGAN) with ENNs to jointly train a model with prior knowledge of a certain class, which has high vacuity for OOD samples. Via extensive empirical experiments based on both synthetic and real-world datasets, we demonstrated that the estimation of uncertainty by WENN can significantly help distinguish OOD samples from boundary samples. WENN outperformed in OOD detection when compared with other competitive counterparts.
摘要：传统的深度神经网络（NNs）在各种应用领域的分类任务中为最新技术性能做出了重要贡献。但是，NN并未考虑与类别概率相关的数据固有的不确定性，其中不确定性下的错误分类很容易在现实环境中给决策带来高风险（例如，道路物体的错误分类会导致严重事故）。与贝叶斯神经网络通过权重不确定性间接推断不确定性不同，最近提出了证据神经网络（ENN）来明确建模类概率的不确定性并将其用于分类任务。 ENN将神经网络的预测表述为主观意见，并通过收集可以由确定性NN从数据中形成主观意见的大量证据来学习该功能。但是，ENN被训练为黑匣子，没有明确考虑具有不同根本原因的数据固有的不确定性，例如空缺（即由于缺乏证据而导致的不确定性）或不和谐（即因证据冲突而导致的不确定性）。通过考虑多维不确定性，我们提出了一种新颖的可感知不确定性的证据神经网络，称为WGAN-ENN（WENN），用于解决分布失调（OOD）检测问题。我们采用了一种混合方法，将Wasserstein生成对抗网络（WGAN）与ENN结合起来，共同训练具有一定类别先验知识的模型，该模型对于OOD样本具有很高的空缺率。通过基于合成数据集和实际数据集的大量经验实验，我们证明了WENN的不确定性估计可以显着帮助区分OOD样本和边界样本。与其他竞争产品相比，WENN在OOD检测方面的表现要好。

91. Deep Learning Framework Applied for Predicting Anomaly of Respiratory Sounds [PDF] 返回目录
Dat Ngo, Lam Pham, Anh Nguyen, Ben Phan, Khoa Tran, Truong Nguyen
Abstract: This paper proposes a robust deep learning framework used for classifying anomaly of respiratory cycles. Initially, our framework starts with front-end feature extraction step. This step aims to transform the respiratory input sound into a two-dimensional spectrogram where both spectral and temporal features are well presented. Next, an ensemble of C- DNN and Autoencoder networks is then applied to classify into four categories of respiratory anomaly cycles. In this work, we conducted experiments over 2017 Internal Conference on Biomedical Health Informatics (ICBHI) benchmark dataset. As a result, we achieve competitive performances with ICBHI average score of 0.49, ICBHI harmonic score of 0.42.
摘要：本文提出了一个强大的深度学习框架，用于对呼吸循环异常进行分类。最初，我们的框架从前端特征提取步骤开始。此步骤旨在将呼吸输入声音转换为二维频谱图，在频谱图中可以很好地呈现频谱特征和时间特征。接下来，然后将C-DNN和自动编码器网络集成在一起，以将其分为呼吸异常周期的四类。在这项工作中，我们在2017年生物医学健康信息学内部会议（ICBHI）基准数据集上进行了实验。结果，我们获得了具有竞争力的性能，ICBHI平均得分为0.49，ICBHI谐波得分为0.42。

92. Taxonomy of multimodal self-supervised representation learning [PDF] 返回目录
Alex Fedorov, Tristan Sylvain, Margaux Luck, Lei Wu, Thomas P. DeRamus, Alex Kirilin, Dmitry Bleklov, Sergey M. Plis, Vince D. Calhoun
Abstract: Sensory input from multiple sources is crucial for robust and coherent human perception. Different sources contribute complementary explanatory factors and get combined based on factors they share. This system motivated the design of powerful unsupervised representation-learning algorithms. In this paper, we unify recent work on multimodal self-supervised learning under a single framework. Observing that most self-supervised methods optimize similarity metrics between a set of model components, we propose a taxonomy of all reasonable ways to organize this process. We empirically show on two versions of multimodal MNIST and a multimodal brain imaging dataset that (1) multimodal contrastive learning has significant benefits over its unimodal counterpart, (2) the specific composition of multiple contrastive objectives is critical to performance on a downstream task, (3) maximization of the similarity between representations has a regularizing effect on a neural network, which sometimes can lead to reduced downstream performance but still can reveal multimodal relations. Consequently, we outperform previous unsupervised encoder-decoder methods based on CCA or variational mixtures MMVAE on various datasets on linear evaluation protocol.
摘要：来自多个来源的感官输入对于人类健壮和连贯的感知至关重要。不同的来源贡献了互补的解释性因素，并根据它们共有的因素加以组合。该系统激励了功能强大的无监督表示学习算法的设计。在本文中，我们统一了在单一框架下进行多模式自我监督学习的最新工作。观察到大多数自我监督方法可以优化一组模型组件之间的相似性指标，因此我们提出了一种用于组织此过程的所有合理方法的分类法。我们在两个版本的多模态MNIST和一个多模态脑成像数据集上进行实证研究，结果表明：（1）多模态对比学习相对于其单模态学习具有显着优势;（2）多个对比目标的特定组成对于下游任务的执行至关重要，（ 3）表示之间的相似度最大化对神经网络有规则化作用，有时可能导致下游性能降低，但仍可以显示多峰关系。因此，我们在线性评估协议的各种数据集上均优于基于CCA或变混合MMVAE的先前无监督编码器-解码器方法。

93. Teaching Robots Novel Objects by Pointing at Them [PDF] 返回目录
Sagar Gubbi Venkatesh, Raviteja Upadrashta, Shishir Kolathaya, Bharadwaj Amrutur
Abstract: Robots that must operate in novel environments and collaborate with humans must be capable of acquiring new knowledge from human experts during operation. We propose teaching a robot novel objects it has not encountered before by pointing a hand at the new object of interest. An end-to-end neural network is used to attend to the novel object of interest indicated by the pointing hand and then to localize the object in new scenes. In order to attend to the novel object indicated by the pointing hand, we propose a spatial attention modulation mechanism that learns to focus on the highlighted object while ignoring the other objects in the scene. We show that a robot arm can manipulate novel objects that are highlighted by pointing a hand at them. We also evaluate the performance of the proposed architecture on a synthetic dataset constructed using emojis and on a real-world dataset of common objects.
摘要：必须在新颖环境中操作并与人类协作的机器人必须能够在操作过程中从人类专家那里获得新知识。我们建议通过将手指向感兴趣的新对象来教导机器人以前从未遇到过的新颖对象。端到端神经网络用于关注由指向手指示的新颖感兴趣的对象，然后将该对象定位在新场景中。为了关注由指向手指示的新颖对象，我们提出了一种空间注意力调制机制，该机制学会在不关注场景中的其他对象的同时专注于突出显示的对象。我们展示了机械臂可以操纵新颖的对象，只需将手指向它们即可。我们还将在使用表情符号构建的综合数据集和常见对象的真实数据集上评估所提出体系结构的性能。

94. COVIDX: Computer-aided diagnosis of Covid-19 and its severity prediction with raw digital chest X-ray images [PDF] 返回目录
Wajid Arshad Abbasi, Syed Ali Abbas, Saiqa Andleeb
Abstract: Coronavirus disease (COVID-19) is a contagious infection caused by severe acute respiratory syndrome coronavirus-2 (SARS-COV-2) and it has infected and killed millions of people across the globe. In the absence of specific drugs or vaccines for the treatment of COVID-19 and the limitation of prevailing diagnostic techniques, there is a requirement for some alternate automatic screening systems that can be used by the physicians to quickly identify and isolate the infected patients. A chest X-ray (CXR) image can be used as an alternative modality to detect and diagnose the COVID-19. In this study, we present an automatic COVID-19 diagnostic and severity prediction (COVIDX) system that uses deep feature maps from CXR images to diagnose COVID-19 and its severity prediction. The proposed system uses a three-phase classification approach (healthy vs unhealthy, COVID-19 vs Pneumonia, and COVID-19 severity) using different shallow supervised classification algorithms. We evaluated COVIDX not only through 10-fold cross2 validation and by using an external validation dataset but also in real settings by involving an experienced radiologist. In all the evaluation settings, COVIDX outperforms all the existing stateof-the-art methods designed for this purpose. We made COVIDX easily accessible through a cloud-based webserver and python code available at this https URL and this https URL, respectively.
摘要：冠状病毒病（COVID-19）是由严重急性呼吸系统综合症冠状病毒2（SARS-COV-2）引起的传染性感染，已感染并杀死了全球数百万人。在缺乏用于治疗COVID-19的特定药物或疫苗的情况下，以及现行诊断技术的局限性，需要一些可供医生使用的替代自动筛选系统，以快速识别和隔离受感染的患者。胸部X射线（CXR）图像可用作检测和诊断COVID-19的替代方法。在这项研究中，我们提出了一个自动的COVID-19诊断和严重程度预测（COVIDX）系统，该系统使用来自CXR图像的深层特征图来诊断COVID-19及其严重程度的预测。拟议的系统使用了三个阶段的分类方法（健康与不健康，COVID-19与肺炎以及严重程度为COVID-19），并采用了不同的浅层监督分类算法。我们不仅通过10倍的cross2验证并使用外部验证数据集对COVIDX进行了评估，还通过经验丰富的放射科医生在实际环境中对COVIDX进行了评估。在所有评估设置中，COVIDX均优于为此目的设计的所有现有最新方法。通过基于云的Web服务器和python代码（分别位于此https URL和https https处），可以轻松访问COVIDX。

95. Deep Learning Methods for Screening Pulmonary Tuberculosis Using Chest X-rays [PDF] 返回目录
Chirath Dasanayakaa, Maheshi Buddhinee Dissanayake
Abstract: Tuberculosis (TB) is a contagious bacterial airborne disease, and is one of the top 10 causes of death worldwide. According to the World Health Organization (WHO), around 1.8 billion people are infected with TB and 1.6 million deaths were reported in 2018. More importantly,95% of cases and deaths were from developing countries. Yet, TB is a completely curable disease through early diagnosis. To achieve this goal one of the key requirements is efficient utilization of existing diagnostic technologies, among which chest X-ray is the first line of diagnostic tool used for screening for active TB. The presented deep learning pipeline consists of three different state of the art deep learning architectures, to generate, segment and classify lung X-rays. Apart from this image preprocessing, image augmentation, genetic algorithm based hyper parameter tuning and model ensembling were used to to improve the diagnostic process. We were able to achieve classification accuracy of 97.1% (Youden's index-0.941,sensitivity of 97.9% and specificity of 96.2%) which is a considerable improvement compared to the existing work in the literature. In our work, we present an highly accurate, automated TB screening system using chest X-rays, which would be helpful especially for low income countries with low access to qualified medical professionals.
摘要：结核是一种传染性细菌性空气传播疾病，是全球十大死亡原因之一。根据世界卫生组织（WHO）的数据，2018年约有18亿人感染了结核病，并报告了160万例死亡。更重要的是，95％的病例和死亡来自发展中国家。然而，通过早期诊断，结核病是完全可以治愈的疾病。为了实现这一目标，关键要求之一就是有效利用现有的诊断技术，其中，胸部X射线是用于筛查活动性结核病的第一诊断工具。提出的深度学习管道包括三种不同的最新深度学习体系结构，用于生成，分割和分类肺部X射线。除了这种图像预处理之外，还使用图像增强，基于遗传算法的超参数调整和模型集成来改善诊断过程。我们能够实现97.1％的分类准确率（Youden指数为0.941，敏感性为97.9％，特异性为96.2％），与文献中的现有工作相比，有很大的提高。在我们的工作中，我们提供了一种使用胸部X光检查的高精度，自动化的TB筛查系统，该系统特别适用于无法获得合格医疗专业人员的低收入国家。

96. Three-dimensional Simultaneous Shape and Pose Estimation for Extended Objects Using Spherical Harmonics [PDF] 返回目录
Gerhard Kurz, Florian Faion, Florian Pfaff, Antonio Zea, Uwe D. Hanebeck
Abstract: We propose a new recursive method for simultaneous estimation of both the pose and the shape of a three-dimensional extended object. The key idea of the presented method is to represent the shape of the object using spherical harmonics, similar to the way Fourier series can be used in the two-dimensional case. This allows us to derive a measurement equation that can be used within the framework of nonlinear filters such as the UKF. We provide both simulative and experimental evaluations of the novel techniques.
摘要：我们提出了一种新的递归方法，用于同时估计三维扩展对象的姿势和形状。提出的方法的关键思想是使用球谐函数表示物体的形状，类似于在二维情况下可以使用傅立叶级数的方式。这使我们能够推导出可在非线性滤波器（例如UKF）框架内使用的测量方程。我们提供了对新技术的仿真和实验评估。

97. Comprehensive Graph-conditional Similarity Preserving Network for Unsupervised Cross-modal Hashing [PDF] 返回目录
Jun Yu, Hao Zhou, Yibing Zhan, Dacheng Tao
Abstract: Unsupervised cross-modal hashing (UCMH) has become a hot topic recently. Current UCMH focuses on exploring data similarities. However, current UCMH methods calculate the similarity between two data, mainly relying on the two data's cross-modal features. These methods suffer from inaccurate similarity problems that result in a suboptimal retrieval Hamming space, because the cross-modal features between the data are not sufficient to describe the complex data relationships, such as situations where two data have different feature representations but share the inherent concepts. In this paper, we devise a deep graph-neighbor coherence preserving network (DGCPN). Specifically, DGCPN stems from graph models and explores graph-neighbor coherence by consolidating the information between data and their neighbors. DGCPN regulates comprehensive similarity preserving losses by exploiting three types of data similarities (i.e., the graph-neighbor coherence, the coexistent similarity, and the intra- and inter-modality consistency) and designs a half-real and half-binary optimization strategy to reduce the quantization errors during hashing. Essentially, DGCPN addresses the inaccurate similarity problem by exploring and exploiting the data's intrinsic relationships in a graph. We conduct extensive experiments on three public UCMH datasets. The experimental results demonstrate the superiority of DGCPN, e.g., by improving the mean average precision from 0.722 to 0.751 on MIRFlickr-25K using 64-bit hashing codes to retrieve texts from images. We will release the source code package and the trained model on this https URL.
摘要：无监督跨模态哈希（UCMH）已成为近期的热门话题。当前的UCMH致力于探索数据的相似性。但是，当前的UCMH方法主要依靠两个数据的交叉模式特征来计算两个数据之间的相似度。这些方法的相似性问题不准确，导致检索汉明空间不够理想，因为数据之间的交叉模式特征不足以描述复杂的数据关系，例如两个数据具有不同特征表示但共享固有概念的情况。在本文中，我们设计了一个深度图邻相干保持网络（DGCPN）。具体而言，DGCPN源自图模型，并通过合并数据及其邻居之间的信息来探索图邻居一致性。 DGCPN通过利用三种类型的数据相似性（即图邻一致性，共存相似性以及模内和模间一致性）来调节保留相似性的综合损失，并设计了一种半实半二进制的优化策略来减少散列期间的量化误差。本质上，DGCPN通过探索和利用图中的数据内在关系来解决不精确的相似性问题。我们对三个公开的UCMH数据集进行了广泛的实验。实验结果证明了DGCPN的优越性，例如通过使用64位哈希码从图像中检索文本将MIRFlickr-25K的平均平均精度从0.722提高到0.751。我们将在此https URL上发布源代码包和经过训练的模型。

98. Prediction by Anticipation: An Action-Conditional Prediction Method based on Interaction Learning [PDF] 返回目录
Ershad Banijamali, Mohsen Rohani, Elmira Amirloo, Jun Luo, Pascal Poupart
Abstract: In autonomous driving (AD), accurately predicting changes in the environment can effectively improve safety and comfort. Due to complex interactions among traffic participants, however, it is very hard to achieve accurate prediction for a long horizon. To address this challenge, we propose prediction by anticipation, which views interaction in terms of a latent probabilistic generative process wherein some vehicles move partly in response to the anticipated motion of other vehicles. Under this view, consecutive data frames can be factorized into sequential samples from an action-conditional distribution that effectively generalizes to a wider range of actions and driving situations. Our proposed prediction model, variational Bayesian in nature, is trained to maximize the evidence lower bound (ELBO) of the log-likelihood of this conditional distribution. Evaluations of our approach with prominent AD datasets NGSIM I-80 and Argoverse show significant improvement over current state-of-the-art in both accuracy and generalization.
摘要：在自动驾驶（AD）中，准确预测环境变化可以有效提高安全性和舒适性。但是，由于交通参与者之间复杂的交互作用，很难长期准确地进行预测。为了应对这一挑战，我们提出了通过预测进行预测的方法，该方法根据潜在的概率生成过程来观察交互作用，其中某些车辆响应于其他车辆的预期运动而部分移动。在这种观点下，连续数据帧可以从行动条件分布中分解为顺序样本，从而有效地推广到更广泛的行动和驾驶情况。我们提出的预测模型本质上是变分贝叶斯（Bayesian），经过训练可以使这种条件分布的对数似然性的证据下界（ELBO）最大化。对我们的方法与著名的AD数据集NGSIM I-80和Argoverse的评估显示，在准确性和泛化性方面均比当前的最新技术有了显着改进。

99. Mixed-Privacy Forgetting in Deep Networks [PDF] 返回目录
Aditya Golatkar, Alessandro Achille, Avinash Ravichandran, Marzia Polito, Stefano Soatto
Abstract: We show that the influence of a subset of the training samples can be removed -- or "forgotten" -- from the weights of a network trained on large-scale image classification tasks, and we provide strong computable bounds on the amount of remaining information after forgetting. Inspired by real-world applications of forgetting techniques, we introduce a novel notion of forgetting in mixed-privacy setting, where we know that a "core" subset of the training samples does not need to be forgotten. While this variation of the problem is conceptually simple, we show that working in this setting significantly improves the accuracy and guarantees of forgetting methods applied to vision classification tasks. Moreover, our method allows efficient removal of all information contained in non-core data by simply setting to zero a subset of the weights with minimal loss in performance. We achieve these results by replacing a standard deep network with a suitable linear approximation. With opportune changes to the network architecture and training procedure, we show that such linear approximation achieves comparable performance to the original network and that the forgetting problem becomes quadratic and can be solved efficiently even for large models. Unlike previous forgetting methods on deep networks, ours can achieve close to the state-of-the-art accuracy on large scale vision tasks. In particular, we show that our method allows forgetting without having to trade off the model accuracy.
摘要：我们表明，训练样本子集的影响可以从在大规模图像分类任务中训练过的网络的权重中消除（或“遗忘”），并且我们为训练样本的数量提供了强大的可计算边界遗忘后剩下的信息。受遗忘技术在现实世界中应用的启发，我们引入了混合隐私设置中的遗忘新概念，即我们知道不需要忘记训练样本的“核心”子集。尽管问题的这种变化从概念上讲很简单，但我们证明了在这种情况下工作可以大大提高准确性，并确保将遗忘方法应用于视觉分类任务。此外，我们的方法允许将非核心数据中包含的所有信息简单地设置为零，从而在性能损失最小的情况下有效去除所有非核心数据。我们通过用合适的线性近似替换标准的深层网络来获得这些结果。通过对网络体系结构和训练过程进行适当的更改，我们证明了这种线性逼近可以达到与原始网络相当的性能，并且遗忘问题变为二次项，即使对于大型模型也可以有效解决。与以前的深度网络遗忘方法不同，我们的方法可以在大规模视觉任务中达到接近最新的准确性。尤其是，我们证明了我们的方法无需考虑模型的准确性就可以进行遗忘。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-12-29

目录

摘要