摘要

1. Identity-Driven DeepFake Detection [PDF] 返回目录
Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Dong Chen, Fang Wen, Baining Guo
Abstract: DeepFake detection has so far been dominated by ``artifact-driven'' methods and the detection performance significantly degrades when either the type of image artifacts is unknown or the artifacts are simply too hard to find. In this work, we present an alternative approach: Identity-Driven DeepFake Detection. Our approach takes as input the suspect image/video as well as the target identity information (a reference image or video). We output a decision on whether the identity in the suspect image/video is the same as the target identity. Our motivation is to prevent the most common and harmful DeepFakes that spread false information of a targeted person. The identity-based approach is fundamentally different in that it does not attempt to detect image artifacts. Instead, it focuses on whether the identity in the suspect image/video is true. To facilitate research on identity-based detection, we present a new large scale dataset ``Vox-DeepFake", in which each suspect content is associated with multiple reference images collected from videos of a target identity. We also present a simple identity-based detection algorithm called the OuterFace, which may serve as a baseline for further research. Even trained without fake videos, the OuterFace algorithm achieves superior detection accuracy and generalizes well to different DeepFake methods, and is robust with respect to video degradation techniques -- a performance not achievable with existing detection algorithms.
摘要：到目前为止，DeepFake检测一直受``工件驱动''方法支配，并且当图像伪影的类型未知或根本很难找到伪影时，检测性能会大大降低。在这项工作中，我们提出了另一种方法：身份驱动的DeepFake检测。我们的方法将可疑图像/视频以及目标身份信息（参考图像或视频）作为输入。我们输出有关可疑图像/视频中的身份是否与目标身份相同的决定。我们的动机是防止传播目标人群的虚假信息的最常见和最有害的DeepFake。基于身份的方法根本不同之处在于它不尝试检测图像伪像。相反，它着眼于可疑图像/视频中的身份是否真实。为了促进基于身份的检测的研究，我们提出了一个新的大规模数据集``Vox-DeepFake''，其中每个可疑内容与从目标身份的视频中收集的多个参考图像相关联。检测算法称为OuterFace，可以作为进一步研究的基准，即使没有经过伪造的视频训练，OuterFace算法也可以实现出色的检测精度，并且可以很好地推广到各种DeepFake方法中，并且在视频降级技术方面具有鲁棒性-性能现有的检测算法无法实现。

2. NeRV: Neural Reflectance and Visibility Fields for Relighting and View Synthesis [PDF] 返回目录
Pratul P. Srinivasan, Boyang Deng, Xiuming Zhang, Matthew Tancik, Ben Mildenhall, Jonathan T. Barron
Abstract: We present a method that takes as input a set of images of a scene illuminated by unconstrained known lighting, and produces as output a 3D representation that can be rendered from novel viewpoints under arbitrary lighting conditions. Our method represents the scene as a continuous volumetric function parameterized as MLPs whose inputs are a 3D location and whose outputs are the following scene properties at that input location: volume density, surface normal, material parameters, distance to the first surface intersection in any direction, and visibility of the external environment in any direction. Together, these allow us to render novel views of the object under arbitrary lighting, including indirect illumination effects. The predicted visibility and surface intersection fields are critical to our model's ability to simulate direct and indirect illumination during training, because the brute-force techniques used by prior work are intractable for lighting conditions outside of controlled setups with a single light. Our method outperforms alternative approaches for recovering relightable 3D scene representations, and performs well in complex lighting settings that have posed a significant challenge to prior work.
摘要：我们提出了一种方法，该方法将不受约束的已知照明照亮的场景的图像集作为输入，并生成可以在任意照明条件下从新颖的视点呈现的3D表示作为输出。我们的方法将场景表示为参数化为MLP的连续体积函数，其输入为3D位置，并且输出为该输入位置处的以下场景属性：体积密度，表面法线，材质参数，到任何方向的第一个表面相交的距离，以及任何方向的外部环境可见性。总之，这些使我们能够在任意照明（包括间接照明效果）下渲染对象的新颖视图。预测的可见性和表面相交场对于我们的模型在训练过程中模拟直接和间接照明的能力至关重要，因为以前的工作使用的蛮力技术对于使用单个光源的受控设置之外的照明条件是难以解决的。我们的方法胜过可恢复3D场景表示的替代方法，并且在复杂的照明设置中表现良好，这对先前的工作提出了重大挑战。

3. NeRD: Neural Reflectance Decomposition from Image Collections [PDF] 返回目录
Mark Boss, Raphael Braun, Varun Jampani, Jonathan T. Barron, Ce Liu, Hendrik P.A. Lensch
Abstract: Decomposing a scene into its shape, reflectance, and illumination is a challenging but essential problem in computer vision and graphics. This problem is inherently more challenging when the illumination is not a single light source under laboratory conditions but is instead an unconstrained environmental illumination. Though recent work has shown that implicit representations can be used to model the radiance field of an object, these techniques only enable view synthesis and not relighting. Additionally, evaluating these radiance fields is resource and time-intensive. By decomposing a scene into explicit representations, any rendering framework can be leveraged to generate novel views under any illumination in real-time. NeRD is a method that achieves this decomposition by introducing physically-based rendering to neural radiance fields. Even challenging non-Lambertian reflectances, complex geometry, and unknown illumination can be decomposed to high-quality models. The datasets and code is available at the project page: this https URL
摘要：将场景分解为形状，反射率和照明是计算机视觉和图形学中一个具有挑战性但必不可少的问题。当照明不是实验室条件下的单个光源而是无约束的环境照明时，此问题本质上更具挑战性。尽管最近的工作表明可以使用隐式表示来对对象的辐射场进行建模，但是这些技术只能实现视图合成而不能重新照明。此外，评估这些辐射场是耗费资源和时间的。通过将场景分解为明确的表示形式，可以利用任何渲染框架在任何光照下实时生成新颖的视图。 NeRD是一种通过将基于物理的渲染引入神经辐射场来实现此分解的方法。甚至具有挑战性的非朗伯反射率，复杂的几何形状和未知的照明也可以分解为高质量的模型。数据集和代码可在项目页面上找到：https URL

4. MultiON: Benchmarking Semantic Map Memory using Multi-Object Navigation [PDF] 返回目录
Saim Wani, Shivansh Patel, Unnat Jain, Angel X. Chang, Manolis Savva
Abstract: Navigation tasks in photorealistic 3D environments are challenging because they require perception and effective planning under partial observability. Recent work shows that map-like memory is useful for long-horizon navigation tasks. However, a focused investigation of the impact of maps on navigation tasks of varying complexity has not yet been performed. We propose the multiON task, which requires navigation to an episode-specific sequence of objects in a realistic environment. MultiON generalizes the ObjectGoal navigation task and explicitly tests the ability of navigation agents to locate previously observed goal objects. We perform a set of multiON experiments to examine how a variety of agent models perform across a spectrum of navigation task complexities. Our experiments show that: i) navigation performance degrades dramatically with escalating task complexity; ii) a simple semantic map agent performs surprisingly well relative to more complex neural image feature map agents; and iii) even oracle map agents achieve relatively low performance, indicating the potential for future work in training embodied navigation agents using maps. Video summary: this https URL
摘要：在逼真的3D环境中的导航任务具有挑战性，因为它们需要在部分可观察性下进行感知和有效规划。最近的工作表明，类似地图的内存对于长时间的导航任务很有用。但是，尚未对地图对各种复杂性的导航任务的影响进行集中研究。我们提出了multiON任务，该任务需要在现实环境中导航到特定于情节的对象序列。 MultiON概括了ObjectGoal导航任务，并显式测试导航代理定位先前观察到的目标对象的能力。我们执行了一组multiON实验，以检查各种代理模型如何在一系列导航任务复杂性上执行。我们的实验表明：i）导航性能随着任务复杂性的提高而大大降低； ii）简单的语义图代理相对于更复杂的神经图像特征图代理表现出奇的出色； iii）甚至甲骨文地图代理也取得了相对较低的性能，这表明了将来在使用地图培训具体化的导航代理方面的潜力。视频摘要：此https URL

5. Learning Video Instance Segmentation with Recurrent Graph Neural Networks [PDF] 返回目录
Joakim Johnander, Emil Brissman, Martin Danelljan, Michael Felsberg
Abstract: Most existing approaches to video instance segmentation comprise multiple modules that are heuristically combined to produce the final output. Formulating a purely learning-based method instead, which models both the temporal aspect as well as a generic track management required to solve the video instance segmentation task, is a highly challenging problem. In this work, we propose a novel learning formulation, where the entire video instance segmentation problem is modelled jointly. We fit a flexible model to our formulation that, with the help of a graph neural network, processes all available new information in each frame. Past information is considered and processed via a recurrent connection. We demonstrate the effectiveness of the proposed approach in comprehensive experiments. Our approach, operating at over 25 FPS, outperforms previous video real-time methods. We further conduct detailed ablative experiments that validate the different aspects of our approach.
摘要：大多数现有的视频实例分割方法都包含多个模块，这些模块通过启发式组合以产生最终输出。替代地，提出一种纯粹的基于学习的方法，该方法对解决视频实例分割任务所需的时间方面以及通用轨道管理都进行了建模，这是一个极富挑战性的问题。在这项工作中，我们提出了一种新颖的学习方式，其中对整个视频实例分割问题进行了联合建模。我们将一个灵活的模型拟合到我们的公式中，该模型借助图神经网络来处理每一帧中所有可用的新信息。通过循环连接来考虑和处理过去的信息。我们在综合实验中证明了该方法的有效性。我们的方法以超过25 FPS的速度运行，优于以前的视频实时方法。我们将进一步进行详细的烧蚀实验，以验证我们方法的不同方面。

6. Model Compression Using Optimal Transport [PDF] 返回目录
Suhas Lohit, Michael Jones
Abstract: Model compression methods are important to allow for easier deployment of deep learning models in compute, memory and energy-constrained environments such as mobile phones. Knowledge distillation is a class of model compression algorithm where knowledge from a large teacher network is transferred to a smaller student network thereby improving the student's performance. In this paper, we show how optimal transport-based loss functions can be used for training a student network which encourages learning student network parameters that help bring the distribution of student features closer to that of the teacher features. We present image classification results on CIFAR-100, SVHN and ImageNet and show that the proposed optimal transport loss functions perform comparably to or better than other loss functions.
摘要：模型压缩方法对于在计算，内存和能源受限的环境（例如移动电话）中轻松部署深度学习模型非常重要。知识提炼是一类模型压缩算法，其中将来自大型教师网络的知识转移到较小的学生网络，从而提高学生的表现。在本文中，我们展示了如何使用基于最佳传输的损失函数来训练学生网络，从而鼓励学习学生网络参数，从而使学生特征的分布更接近于教师特征的分布。我们在CIFAR-100，SVHN和ImageNet上显示图像分类结果，并表明所提出的最佳运输损失函数的性能与其他损失函数相当或更好。

7. IHashNet: Iris Hashing Network based on efficient multi-index hashing [PDF] 返回目录
Avantika Singh, Chirag Vashist, Pratyush Gaurav, Aditya Nigam, Rameshwar Pratap
Abstract: Massive biometric deployments are pervasive in today's world. But despite the high accuracy of biometric systems, their computational efficiency degrades drastically with an increase in the database size. Thus, it is essential to index them. An ideal indexing scheme needs to generate codes that preserve the intra-subject similarity as well as inter-subject dissimilarity. Here, in this paper, we propose an iris indexing scheme using real-valued deep iris features binarized to iris bar codes (IBC) compatible with the indexing structure. Firstly, for extracting robust iris features, we have designed a network utilizing the domain knowledge of ordinal filtering and learning their nonlinear combinations. Later these real-valued features are binarized. Finally, for indexing the iris dataset, we have proposed a loss that can transform the binary feature into an improved feature compatible with the Multi-Index Hashing scheme. This loss function ensures the hamming distance equally distributed among all the contiguous disjoint sub-strings. To the best of our knowledge, this is the first work in the iris indexing domain that presents an end-to-end iris indexing structure. Experimental results on four datasets are presented to depict the efficacy of the proposed approach.
摘要：大规模的生物识别部署在当今世界无处不在。但是，尽管生物识别系统具有很高的准确性，但随着数据库规模的增加，其计算效率会急剧下降。因此，索引它们是必不可少的。理想的索引方案需要生成保留受试者内部相似度和受试者间相似度的代码。在本文中，我们提出了一种使用实值的深虹膜特征进行虹膜索引的方案，该特征是将二值化为与索引结构兼容的虹膜条形码（IBC）。首先，为了提取鲁棒的虹膜特征，我们设计了一个网络，该网络利用序数滤波的领域知识并学习其非线性组合。后来，这些实值特征被二值化了。最后，为索引虹膜数据集，我们提出了一种损失，可以将二进制特征转换为与多索引哈希方案兼容的改进特征。此损失函数可确保将汉明距离平均分布在所有连续的不相交子字符串之间。据我们所知，这是虹膜索引领域中的第一项工作，提出了端到端虹膜索引结构。提出了四个数据集上的实验结果，以描述该方法的有效性。

8. Traffic flow prediction using Deep Sedenion Networks [PDF] 返回目录
Alabi Bojesomo, Panos Liatsis, Hasan Al Marzouqi
Abstract: Traffic4cast2020 is the second year of NeurIPS competition where participants are to predict traffic parameters (speed and volume) for 3 different cities. The information provided includes multi-channeled multiple time steps inputs and predicted output. Multiple output time steps were viewed as a multi-task segmentation in this work, forming the basis for applying hypercomplex number based UNet structure. The presence of 12 spatio-temporal multi-channel dynamic inputs and single static input, while sedenion numbers are 16-dimensional (16-ions) forms the basis of using sedenion hypercomplex number. We use 12 of the 15 sedenion imaginary parts for the dynamic inputs and the real part is used for the static input. The sedenion network enables the solution of this challenge by using multi-task learning, a situation of the traffic prediction with different time steps to compute.
摘要：Traffic4cast2020是NeurIPS竞赛的第二年，参与者将预测3个不同城市的交通参数（速度和流量）。提供的信息包括多通道的多个时间步长输入和预测的输出。在这项工作中，多个输出时间步被视为多任务分割，从而为应用基于超复杂数的UNet结构奠定了基础。 12个时空多通道动态输入和单个静态输入的存在，而凝结数是16维（16个离子），构成了使用凝结超复杂数的基础。我们将15个虚拟假想部分中的12个用于动态输入，将实数部分用于静态输入。隔离网络通过使用多任务学习来解决此挑战，这是一种需要计算不同时间步长的交通预测情况。

9. End-to-end Handwritten Paragraph Text Recognition Using a Vertical Attention Network [PDF] 返回目录
Denis Coquenet, Clément Chatelain, Thierry Paquet
Abstract: Unconstrained handwritten text recognition remains challenging for computer vision systems. Paragraph text recognition is traditionally achieved by two models: the first one for line segmentation and the second one for text line recognition. We propose a unified end-to-end model using hybrid attention to tackle this task. We achieve state-of-the-art character error rate at line and paragraph levels on three popular datasets: 1.90% for RIMES, 4.32% for IAM and 3.63% for READ 2016. The proposed model can be trained from scratch, without using any segmentation label contrary to the standard approach. Our code and trained model weights are available at this https URL.
摘要：对于计算机视觉系统而言，无限制的手写文本识别仍然具有挑战性。传统上，段落文本识别是通过两个模型实现的：第一个模型用于行分割，第二个模型用于文本行识别。我们提出了一个使用混合注意力的统一端到端模型来解决此任务。我们在三个流行的数据集上达到了行和段落级别的最新字符错误率：RIMES为1.90％，IAM为4.32％，READ 2016为3.63％。建议的模型可以从头开始进行训练，而无需使用任何方法细分标签与标准方法相反。我们的代码和训练有素的模型权重可从此https URL获得。

10. Correct block-design experiments mitigate temporal correlation bias in EEG classification [PDF] 返回目录
Simone Palazzo, Concetto Spampinato, Joseph Schmidt, Isaak Kavasidis, Daniela Giordano, Mubarak Shah
Abstract: It is argued in [1] that [2] was able to classify EEG responses to visual stimuli solely because of the temporal correlation that exists in all EEG data and the use of a block design. We here show that the main claim in [1] is drastically overstated and their other analyses are seriously flawed by wrong methodological choices. To validate our counter-claims, we evaluate the performance of state-of-the-art methods on the dataset in [2] reaching about 50% classification accuracy over 40 classes, lower than in [2], but still significant. We then investigate the influence of EEG temporal correlation on classification accuracy by testing the same models in two additional experimental settings: one that replicates [1]'s rapid-design experiment, and another one that examines the data between blocks while subjects are shown a blank screen. In both cases, classification accuracy is at or near chance, in contrast to what [1] reports, indicating a negligible contribution of temporal correlation to classification accuracy. We, instead, are able to replicate the results in [1] only when intentionally contaminating our data by inducing a temporal correlation. This suggests that what Li et al. [1] demonstrate is that their data are strongly contaminated by temporal correlation and low signal-to-noise ratio. We argue that the reason why Li et al. [1] observe such high correlation in EEG data is their unconventional experimental design and settings that violate the basic cognitive neuroscience design recommendations, first and foremost the one of limiting the experiments' duration, as instead done in [2]. Our analyses in this paper refute the claims of the "perils and pitfalls of block-design" in [1]. Finally, we conclude the paper by examining a number of other oversimplistic statements, inconsistencies, misinterpretation of machine learning concepts, speculations and misleading claims in [1].
摘要：[1]认为[2]之所以能够将EEG对视觉刺激的反应分类，完全是因为所有EEG数据中都存在时间相关性，并且使用了区块设计。我们在这里表明，[1]中的主要主张被严重高估了，而他们的其他分析由于错误的方法选择而严重地存在缺陷。为了验证我们的反诉，我们在[2]中评估了数据集上最先进方法的性能，在40个分类上达到了约50％的分类准确率，低于[2]，但仍然很有意义。然后，我们通过在两个其他实验设置中测试相同模型来研究EEG时间相关性对分类准确性的影响：一个模型可以复制[1]的快速设计实验，另一个模型可以在显示被摄对象时检查块之间的数据黑屏。在两种情况下，与[1]所报告的相比，分类准确率接近或接近，表明时间相关性对分类准确度的贡献可忽略不计。相反，只有当通过诱导时间相关性有意污染我们的数据时，我们才能够复制[1]中的结果。这表明李等人。 [1]证明了它们的数据受到时间相关性和低信噪比的严重污染。我们认为，李等人之所以如此。 [1]观察到，EEG数据中如此高的相关性是其非常规的实验设计和设置，违反了基本的认知神经科学设计建议，首先也是限制实验持续时间的一项，相反，在[2]中进行。我们在本文中的分析驳斥了[1]中“块设计的风险和陷阱”的主张。最后，我们通过研究[1]中的许多其他过分简单的陈述，不一致，对机器学习概念的误解，推测和误导性主张来结束本文。

11. Sparse Fooling Images: Fooling Machine Perception through Unrecognizable Images [PDF] 返回目录
Soichiro Kumano, Hiroshi Kera, Toshihiko Yamasaki
Abstract: In recent years, deep neural networks (DNNs) have achieved equivalent or even higher accuracy in various recognition tasks than humans. However, some images exist that lead DNNs to a completely wrong decision, whereas humans never fail with these images. Among others, fooling images are those that are not recognizable as natural objects such as dogs and cats, but DNNs classify these images into classes with high confidence scores. In this paper, we propose a new class of fooling images, sparse fooling images (SFIs), which are single color images with a small number of altered pixels. Unlike existing fooling images, which retain some characteristic features of natural objects, SFIs do not have any local or global features that can be recognizable to humans; however, in machine perception (i.e., by DNN classifiers), SFIs are recognizable as natural objects and classified to certain classes with high confidence scores. We propose two methods to generate SFIs for different settings~(semiblack-box and white-box). We also experimentally demonstrate the vulnerability of DNNs through out-of-distribution detection and compare three architectures in terms of the robustness against SFIs. This study gives rise to questions on the structure and robustness of CNNs and discusses the differences between human and machine perception.
摘要：近年来，深层神经网络（DNN）在各种识别任务中已达到与人类同等甚至更高的准确性。但是，存在一些导致DNN做出完全错误决定的图像，而人类永远不会因这些图像而失败。其中，欺骗图像是指那些无法识别为自然物体（如狗和猫）的图像，但DNN将这些图像分类为具有高置信度得分的类别。在本文中，我们提出了一种新型的欺骗图像，即稀疏欺骗图像（SFI），它们是具有少量更改像素的单色图像。与保留自然对象某些特征的现有愚弄图像不同，SFI不具有人类可以识别的任何局部或全局特征。但是，在机器感知（即DNN分类器）中，SFI可识别为自然物体，并以高置信度得分分类为某些类别。我们提出了两种针对不同设置生成SFI的方法（半黑盒和白盒）。我们还通过分布外检测实验性地证明了DNN的脆弱性，并针对SFI的鲁棒性比较了三种架构。这项研究提出了有关CNN的结构和鲁棒性的问题，并讨论了人与机器感知之间的差异。

12. CycleQSM: Unsupervised QSM Deep Learning using Physics-Informed CycleGAN [PDF] 返回目录
Gyutaek Oh, Hyokyoung Bae, Hyun-Seo Ahn, Sung-Hong Park, Jong Chul Ye
Abstract: Quantitative susceptibility mapping (QSM) is a useful magnetic resonance imaging (MRI) technique which provides spatial distribution of magnetic susceptibility values of tissues. QSMs can be obtained by deconvolving the dipole kernel from phase images, but the spectral nulls in the dipole kernel make the inversion ill-posed. In recent times, deep learning approaches have shown a comparable QSM reconstruction performance as the classic approaches, despite the fast reconstruction time. Most of the existing deep learning methods are, however, based on supervised learning, so matched pairs of input phase images and the ground-truth maps are needed. Moreover, it was reported that the supervised learning often leads to underestimated QSM values. To address this, here we propose a novel unsupervised QSM deep learning method using physics-informed cycleGAN, which is derived from optimal transport perspective. In contrast to the conventional cycleGAN, our novel cycleGAN has only one generator and one discriminator thanks to the known dipole kernel. Experimental results confirm that the proposed method provides more accurate QSM maps compared to the existing deep learning approaches, and provide competitive performance to the best classical approaches despite the ultra-fast reconstruction.
摘要：定量磁化率映射（QSM）是一种有用的磁共振成像（MRI）技术，可提供组织的磁化率值的空间分布。可以通过将偶极子核从相位图像中解卷积来获得QSM，但是偶极子核中的频谱零点会使反演处于不适定状态。近年来，尽管重建时间很快，但深度学习方法已显示出与传统方法相当的QSM重建性能。但是，大多数现有的深度学习方法都是基于监督学习的，因此需要匹配的输入相位图像对和真实地图。此外，据报道，监督学习经常导致低估了QSM值。为了解决这个问题，我们在这里提出了一种新的无监督QSM深度学习方法，该方法使用了物理信息化的cycleGAN，它是从最佳运输角度出发得出的。与传统的cycleGAN相比，我们的新型cycleGAN由于已知偶极子核而仅具有一个生成器和一个鉴别器。实验结果证实，与现有的深度学习方法相比，该方法可提供更准确的QSM映射，并且尽管具有超快的重建速度，但仍可提供与最佳经典方法相比的竞争性能。

13. Self-supervised asymmetric deep hashing with margin-scalable constraint for image retrieval [PDF] 返回目录
Zhengyang Yu, Zhihao Dou, Song Wu
Abstract: Due to its validity and rapidity, image retrieval based on deep hashing approaches is widely concerned especially in large-scale visual search. However, many existing deep hashing methods inadequately utilize label information as guidance of feature learning network without more advanced exploration in semantic space, besides the similarity correlations in hamming space are not fully discovered and embedded into hash codes, by which the retrieval quality is diminished with inefficient preservation of pairwise correlations and multi-label semantics. To cope with these problems, we propose a novel self-supervised asymmetric deep hashing with margin-scalable constraint(SADH) approach for image retrieval. SADH implements a self-supervised network to preserve supreme semantic information in a semantic feature map and a semantic code map for each semantics of the given dataset, which efficiently-and-precisely guides a feature learning network to preserve multi-label semantic information with asymmetric learning strategy. Moreover, for the feature learning part, by further exploiting semantic maps, a new margin-scalable constraint is employed for both highly-accurate construction of pairwise correlation in the hamming space and more discriminative hash code representation. Extensive empirical research on three benchmark datasets validate that the proposed method outperforms several state-of-the-art approaches.
摘要：基于深度哈希方法的图像检索由于其有效性和快速性而受到广泛关注，特别是在大规模视觉搜索中。然而，许多现有的深层哈希算法没有充分利用标签信息作为特征学习网络的指导，而没有在语义空间上进行更深入的探索，此外，汉明空间中的相似性相关性还没有得到充分发现并嵌入到哈希码中，从而降低了检索质量。成对相关性和多标签语义的保存效率低下。为了解决这些问题，我们提出了一种新的自我监督的不对称深度哈希与边缘可缩放约束（SADH）方法来进行图像检索。 SADH实现了一个自我监督网络，以针对给定数据集的每个语义在语义特征图中和语义代码图中保留最高语义信息，从而有效，精确地引导特征学习网络来保留不对称的多标签语义信息学习策略。此外，对于特征学习部分，通过进一步利用语义图，为海明空间中的成对相关的高精度构建和更具判别性的哈希码表示，采用了新的余量可缩放约束。对三个基准数据集的大量实证研究证明，该方法优于几种最先进的方法。

14. Pose-Guided Human Animation from a Single Image in the Wild [PDF] 返回目录
Jae Shin Yoon, Lingjie Liu, Vladislav Golyanik, Kripasindhu Sarkar, Hyun Soo Park, Christian Theobalt
Abstract: We present a new pose transfer method for synthesizing a human animation from a single image of a person controlled by a sequence of body poses. Existing pose transfer methods exhibit significant visual artifacts when applying to a novel scene, resulting in temporal inconsistency and failures in preserving the identity and textures of the person. To address these limitations, we design a compositional neural network that predicts the silhouette, garment labels, and textures. Each modular network is explicitly dedicated to a subtask that can be learned from the synthetic data. At the inference time, we utilize the trained network to produce a unified representation of appearance and its labels in UV coordinates, which remains constant across poses. The unified representation provides an incomplete yet strong guidance to generating the appearance in response to the pose change. We use the trained network to complete the appearance and render it with the background. With these strategies, we are able to synthesize human animations that can preserve the identity and appearance of the person in a temporally coherent way without any fine-tuning of the network on the testing scene. Experiments show that our method outperforms the state-of-the-arts in terms of synthesis quality, temporal coherence, and generalization ability.
摘要：我们提出了一种新的姿势转移方法，该方法用于从受一系列身体姿势控制的人的单个图像中合成人类动画。现有的姿势转移方法在应用于新颖的场景时会显示出明显的视觉伪像，从而导致时间上的不一致以及在保存人的身份和纹理时失败。为了解决这些局限性，我们设计了一个成分神经网络来预测轮廓，服装标签和纹理。每个模块化网络明确地专用于可以从合成数据中学习的子任务。在推断时，我们利用训练有素的网络在UV坐标中生成外观及其标签的统一表示，该姿势在各个姿势中保持不变。统一表示为响应姿势变化生成外观提供了不完整但有力的指导。我们使用训练有素的网络来完成外观并使用背景进行渲染。通过这些策略，我们能够合成人类动画，从而可以在时间上连贯的方式保持人的身份和外观，而无需在测试现场对网络进行任何微调。实验表明，我们的方法在综合质量，时间连贯性和泛化能力方面均优于最新技术。

15. Matching Distributions via Optimal Transport for Semi-Supervised Learning [PDF] 返回目录
Fariborz Taherkhani, Hadi Kazemi, Ali Dabouei, Jeremy Dawson, Nasser M. Nasrabadi
Abstract: Semi-Supervised Learning (SSL) approaches have been an influential framework for the usage of unlabeled data when there is not a sufficient amount of labeled data available over the course of training. SSL methods based on Convolutional Neural Networks (CNNs) have recently provided successful results on standard benchmark tasks such as image classification. In this work, we consider the general setting of SSL problem where the labeled and unlabeled data come from the same underlying probability distribution. We propose a new approach that adopts an Optimal Transport (OT) technique serving as a metric of similarity between discrete empirical probability measures to provide pseudo-labels for the unlabeled data, which can then be used in conjunction with the initial labeled data to train the CNN model in an SSL manner. We have evaluated and compared our proposed method with state-of-the-art SSL algorithms on standard datasets to demonstrate the superiority and effectiveness of our SSL algorithm.
摘要：当在培训过程中没有足够的标记数据量时，半监督学习（SSL）方法已成为使用未标记数据的有影响力的框架。基于卷积神经网络（CNN）的SSL方法最近在标准基准任务（例如图像分类）上提供了成功的结果。在这项工作中，我们考虑了SSL问题的一般设置，其中标记和未标记的数据来自相同的基础概率分布。我们提出了一种新方法，该方法采用最佳传输（OT）技术作为离散经验概率度量之间的相似性度量，以为未标记数据提供伪标签，然后可以将其与初始标记数据结合使用来训练CNN模型以SSL方式。我们已经对标准数据集上的最新SSL算法与我们提出的方法进行了评估和比较，以证明我们SSL算法的优越性和有效性。

16. Sparse Single Sweep LiDAR Point Cloud Segmentation via Learning Contextual Shape Priors from Scene Completion [PDF] 返回目录
Xu Yan, Jiantao Gao, Jie Li, Ruimao Zhang, Zhen Li, Rui Huang, Shuguang Cui
Abstract: LiDAR point cloud analysis is a core task for 3D computer vision, especially for autonomous driving. However, due to the severe sparsity and noise interference in the single sweep LiDAR point cloud, the accurate semantic segmentation is non-trivial to achieve. In this paper, we propose a novel sparse LiDAR point cloud semantic segmentation framework assisted by learned contextual shape priors. In practice, an initial semantic segmentation (SS) of a single sweep point cloud can be achieved by any appealing network and then flows into the semantic scene completion (SSC) module as the input. By merging multiple frames in the LiDAR sequence as supervision, the optimized SSC module has learned the contextual shape priors from sequential LiDAR data, completing the sparse single sweep point cloud to the dense one. Thus, it inherently improves SS optimization through fully end-to-end training. Besides, a Point-Voxel Interaction (PVI) module is proposed to further enhance the knowledge fusion between SS and SSC tasks, i.e., promoting the interaction of incomplete local geometry of point cloud and complete voxel-wise global structure. Furthermore, the auxiliary SSC and PVI modules can be discarded during inference without extra burden for SS. Extensive experiments confirm that our JS3C-Net achieves superior performance on both SemanticKITTI and SemanticPOSS benchmarks, i.e., 4% and 3% improvement correspondingly.
摘要：LiDAR点云分析是3D计算机视觉（尤其是自动驾驶）的一项核心任务。但是，由于单扫描LiDAR点云中存在严重的稀疏性和噪声干扰，因此实现精确的语义分割并非易事。在本文中，我们提出了一种新的稀疏LiDAR点云语义分割框架，该框架借助学习的上下文形状先验来辅助。实际上，可以通过任何吸引人的网络来实现单个扫描点云的初始语义分割（SS），然后将其作为输入流到语义场景完成（SSC）模块中。通过在LiDAR序列中合并多个帧作为监督，优化的SSC模块从连续的LiDAR数据中了解了上下文形状先验，从而完成了稀疏的单个扫描点云到密集的云。因此，它通过完全的端到端培训从本质上提高了SS优化。此外，提出了点-体素交互（PVI）模块，以进一步增强SS和SSC任务之间的知识融合，即促进点云的不完整局部几何形状和完整的体素智能全局结构之间的相互作用。此外，在推理过程中可以丢弃辅助SSC和PVI模块，而不会给SS带来额外负担。大量的实验证实，我们的JS3C-Net在SemanticKITTI和SemanticPOSS基准上均具有出色的性能，即分别提高了4％和3％。

17. Unsupervised Pre-training for Person Re-identification [PDF] 返回目录
Dengpan Fu, Dongdong Chen, Jianmin Bao, Hao Yang, Lu Yuan, Lei Zhang, Houqiang Li, Dong Chen
Abstract: In this paper, we present a large scale unlabeled person re-identification (Re-ID) dataset "LUPerson" and make the first attempt of performing unsupervised pre-training for improving the generalization ability of the learned person Re-ID feature representation. This is to address the problem that all existing person Re-ID datasets are all of limited scale due to the costly effort required for data annotation. Previous research tries to leverage models pre-trained on ImageNet to mitigate the shortage of person Re-ID data but suffers from the large domain gap between ImageNet and person Re-ID data. LUPerson is an unlabeled dataset of 4M images of over 200K identities, which is 30X larger than the largest existing Re-ID dataset. It also covers a much diverse range of capturing environments (eg, camera settings, scenes, etc.). Based on this dataset, we systematically study the key factors for learning Re-ID features from two perspectives: data augmentation and contrastive loss. Unsupervised pre-training performed on this large-scale dataset effectively leads to a generic Re-ID feature that can benefit all existing person Re-ID methods. Using our pre-trained model in some basic frameworks, our methods achieve state-of-the-art results without bells and whistles on four widely used Re-ID datasets: CUHK03, Market1501, DukeMTMC, and MSMT17. Our results also show that the performance improvement is more significant on small-scale target datasets or under few-shot setting.
摘要：在本文中，我们提出了一个大规模的无标签人重新识别（Re-ID）数据集“ LUPerson”，并首次尝试进行无监督的预训练以提高学习者Re-ID特征表示的泛化能力。这是为了解决所有现有人员Re-ID数据集由于数据标注所需的昂贵工作而规模有限的问题。先前的研究试图利用在ImageNet上进行预训练的模型来缓解人员Re-ID数据的短缺，但是却遭受ImageNet与人员Re-ID数据之间的巨大领域鸿沟。 LUPerson是一个未标记的4M图像数据集，具有20万个以上的身份，比最大的现有Re-ID数据集大30倍。它还涵盖了多种捕获环境（例如，摄像机设置，场景等）。基于此数据集，我们从两个方面系统地研究了学习Re-ID功能的关键因素：数据增强和对比丢失。在此大规模数据集上进行的无监督预训练有效地导致了通用的Re-ID功能，该功能可以使所有现有的人员Re-ID方法受益。在某些基本框架中使用经过预训练的模型，我们的方法可在四个广泛使用的Re-ID数据集：CUHK03，Market1501，DukeMTMC和MSMT17上获得最新的结果，而不会一帆风顺。我们的结果还表明，在小规模目标数据集或少镜头设置下，性能提高更为显着。

18. Combining Deep Transfer Learning with Signal-image Encoding for Multi-Modal Mental Wellbeing Classification [PDF] 返回目录
Kieran Woodward, Eiman Kanjo, Athanasios Tsanas
Abstract: The quantification of emotional states is an important step to understanding wellbeing. Time series data from multiple modalities such as physiological and motion sensor data have proven to be integral for measuring and quantifying emotions. Monitoring emotional trajectories over long periods of time inherits some critical limitations in relation to the size of the training data. This shortcoming may hinder the development of reliable and accurate machine learning models. To address this problem, this paper proposes a framework to tackle the limitation in performing emotional state recognition on multiple multimodal datasets: 1) encoding multivariate time series data into coloured images; 2) leveraging pre-trained object recognition models to apply a Transfer Learning (TL) approach using the images from step 1; 3) utilising a 1D Convolutional Neural Network (CNN) to perform emotion classification from physiological data; 4) concatenating the pre-trained TL model with the 1D CNN. Furthermore, the possibility of performing TL to infer stress from physiological data is explored by initially training a 1D CNN using a large physical activity dataset and then applying the learned knowledge to the target dataset. We demonstrate that model performance when inferring real-world wellbeing rated on a 5-point Likert scale can be enhanced using our framework, resulting in up to 98.5% accuracy, outperforming a conventional CNN by 4.5%. Subject-independent models using the same approach resulted in an average of 72.3% accuracy (SD 0.038). The proposed CNN-TL-based methodology may overcome problems with small training datasets, thus improving on the performance of conventional deep learning methods.
摘要：情绪状态的量化是了解幸福感的重要步骤。来自多种模态的时间序列数据（例如生理和运动传感器数据）已被证明是测量和量化情绪的不可或缺的部分。长时间监控情绪轨迹继承了与训练数据大小有关的一些关键限制。该缺点可能会阻碍可靠和准确的机器学习模型的开发。为了解决这个问题，本文提出了一个框架来解决在多个多模态数据集上进行情感状态识别的局限性：1）将多元时间序列数据编码为彩色图像； 2）利用预训练的对象识别模型，使用步骤1中的图像应用转移学习（TL）方法； 3）利用一维卷积神经网络（CNN）从生理数据进行情感分类； 4）将预训练的TL模型与1D CNN串联起来。此外，通过首先使用大型体育活动数据集训练一维CNN，然后将学习到的知识应用于目标数据集，探索执行TL从生理数据推断压力的可能性。我们证明，使用我们的框架可以提高模型的性能，以5点Likert量表来评估现实生活中的幸福感，从而可以达到98.5％的准确性，比传统的CNN优越4.5％。使用相同方法的与受试者无关的模型的平均准确度为72.3％（SD 0.038）。所提出的基于CNN-TL的方法可以克服训练数据集较小的问题，从而提高传统深度学习方法的性能。

19. An Enriched Automated PV Registry: Combining Image Recognition and 3D Building Data [PDF] 返回目录
Benjamin Rausch, Kevin Mayer, Marie-Louise Arlt, Gunther Gust, Philipp Staudt, Christof Weinhardt, Dirk Neumann, Ram Rajagopal
Abstract: While photovoltaic (PV) systems are installed at an unprecedented rate, reliable information on an installation level remains scarce. As a result, automatically created PV registries are a timely contribution to optimize grid planning and operations. This paper demonstrates how aerial imagery and three-dimensional building data can be combined to create an address-level PV registry, specifying area, tilt, and orientation angles. We demonstrate the benefits of this approach for PV capacity estimation. In addition, this work presents, for the first time, a comparison between automated and officially-created PV registries. Our results indicate that our enriched automated registry proves to be useful to validate, update, and complement official registries.
摘要：尽管光伏（PV）系统以空前的速度安装，但是在安装级别上仍然缺乏可靠的信息。因此，自动创建的PV注册表对于优化电网规划和运营是及时的贡献。本文演示了如何将航空影像和三维建筑数据结合起来以创建地址级PV注册表，指定面积，倾斜度和方向角。我们展示了这种方法对光伏容量估算的好处。此外，这项工作首次展示了自动和正式创建的PV注册表之间的比较。我们的结果表明，我们丰富的自动注册管理机构对验证，更新和补充官方注册管理机构非常有用。

20. A New Framework for Registration of Semantic Point Clouds from Stereo and RGB-D Cameras [PDF] 返回目录
Ray Zhang, Tzu-Yuan Lin, Chien Erh Lin, Steven A. Parkison, William Clark, Jessy W. Grizzle, Ryan M. Eustice, Maani Ghaffari
Abstract: This paper reports on a novel nonparametric rigid point cloud registration framework that jointly integrates geometric and semantic measurements such as color or semantic labels into the alignment process and does not require explicit data association. The point clouds are represented as nonparametric functions in a reproducible kernel Hilbert space. The alignment problem is formulated as maximizing the inner product between two functions, essentially a sum of weighted kernels, each of which exploits the local geometric and semantic features. As a result of the continuous models, analytical gradients can be computed, and a local solution can be obtained by optimization over the rigid body transformation group. Besides, we present a new point cloud alignment metric that is intrinsic to the proposed framework and takes into account geometric and semantic information. The evaluations using publicly available stereo and RGB-D datasets show that the proposed method outperforms state-of-the-art outdoor and indoor frame-to-frame registration methods. An open-source GPU implementation is also provided.
摘要：本文报道了一种新颖的非参数刚性点云注册框架，该框架将几何和语义度量（例如颜色或语义标签）共同集成到对齐过程中，并且不需要显式数据关联。在可重现的内核希尔伯特空间中，点云表示为非参数函数。对齐问题被表述为最大化两个函数之间的内积，这两个函数本质上是加权内核的总和，每个内核都利用了局部几何和语义特征。作为连续模型的结果，可以计算出解析梯度，并可以通过对刚体转换组进行优化来获得局部解。此外，我们提出了一种新的点云对齐指标，该指标是建议框架所固有的，并考虑了几何和语义信息。使用公开的立体声和RGB-D数据集进行的评估表明，该方法优于最新的室外和室内逐帧配准方法。还提供了一个开源GPU实现。

21. Generic Semi-Supervised Adversarial Subject Translation for Sensor-Based Human Activity Recognition [PDF] 返回目录
Elnaz Soleimani, Ghazaleh Khodabandelou, Abdelghani Chibani, Yacine Amirat
Abstract: The performance of Human Activity Recognition (HAR) models, particularly deep neural networks, is highly contingent upon the availability of the massive amount of annotated training data which should be sufficiently labeled. Though, data acquisition and manual annotation in the HAR domain are prohibitively expensive due to skilled human resource requirements in both steps. Hence, domain adaptation techniques have been proposed to adapt the knowledge from the existing source of data. More recently, adversarial transfer learning methods have shown very promising results in image classification, yet limited for sensor-based HAR problems, which are still prone to the unfavorable effects of the imbalanced distribution of samples. This paper presents a novel generic and robust approach for semi-supervised domain adaptation in HAR, which capitalizes on the advantages of the adversarial framework to tackle the shortcomings, by leveraging knowledge from annotated samples exclusively from the source subject and unlabeled ones of the target subject. Extensive subject translation experiments are conducted on three large, middle, and small-size datasets with different levels of imbalance to assess the robustness and effectiveness of the proposed model to the scale as well as imbalance in the data. The results demonstrate the effectiveness of our proposed algorithms over state-of-the-art methods, which led in up to 13%, 4%, and 13% improvement of our high-level activities recognition metrics for Opportunity, LISSI, and PAMAP2 datasets, respectively. The LISSI dataset is the most challenging one owing to its less populated and imbalanced distribution. Compared to the SA-GAN adversarial domain adaptation method, the proposed approach enhances the final classification performance with an average of 7.5% for the three datasets, which emphasizes the effectiveness of micro-mini-batch training.
摘要：人类活动识别（HAR）模型（尤其是深度神经网络）的性能在很大程度上取决于是否有大量带注释的训练数据，这些数据应进行充分标记。但是，由于两个步骤中熟练的人力资源要求，HAR域中的数据获取和手动注释的费用过高。因此，已经提出了领域适应技术来适应来自现有数据源的知识。最近，对抗转移学习方法在图像分类中显示了非常有希望的结果，但仅限于基于传感器的HAR问题，这些问题仍然容易受到样本分布不均的不利影响。本文提出了一种新颖的，通用的，健壮的，用于HAR中半监督域自适应的方法，该方法利用了对抗性框架的优势来解决缺点，通过利用来自源主题和未标记主题的注释样本中的知识。在具有不同失衡水平的三个大型，中型和小型数据集上进行了广泛的主题翻译实验，以评估所提出的模型在规模和数据失衡方面的鲁棒性和有效性。结果表明，我们提出的算法相对于最新方法的有效性，从而使机会，LISSI和PAMAP2数据集的高级活动识别指标分别提高了13％，4％和13％，分别。 LISSI数据集由于人口较少且分布不均而成为最具挑战性的数据集。与SA-GAN对抗域自适应方法相比，该方法提高了三个数据集的最终分类性能，平均为7.5％，这强调了微小批量训练的有效性。

22. Roof fall hazard detection with convolutional neural networks using transfer learning [PDF] 返回目录
Ergin Isleyen, Sebnem Duzgun, McKell R. Carter
Abstract: Roof falls due to geological conditions are major safety hazards in mining and tunneling industries, causing lost work times, injuries, and fatalities. Several large-opening limestone mines in the Eastern and Midwestern United States have roof fall problems caused by high horizontal stresses. The typical hazard management approach for this type of roof fall hazard relies heavily on visual inspections and expert knowledge. In this study, we propose an artificial intelligence (AI) based system for the detection roof fall hazards caused by high horizontal stresses. We use images depicting hazardous and non-hazardous roof conditions to develop a convolutional neural network for autonomous detection of hazardous roof conditions. To compensate for limited input data, we utilize a transfer learning approach. In transfer learning, an already-trained network is used as a starting point for classification in a similar domain. Results confirm that this approach works well for classifying roof conditions as hazardous or safe, achieving a statistical accuracy of 86%. However, accuracy alone is not enough to ensure a reliable hazard management system. System constraints and reliability are improved when the features being used by the network are understood. Therefore, we used a deep learning interpretation technique called integrated gradients to identify the important geologic features in each image for prediction. The analysis of integrated gradients shows that the system mimics expert judgment on roof fall hazard detection. The system developed in this paper demonstrates the potential of deep learning in geological hazard management to complement human experts, and likely to become an essential part of autonomous tunneling operations in those cases where hazard identification heavily depends on expert knowledge.
摘要：地质条件导致的顶板坠落是采矿和隧道业的主要安全隐患，造成工作时间损失，人员伤亡和死亡。美国东部和中西部的几座大型开放式石灰岩矿山都有因高水平应力而导致的顶板掉落问题。这种类型的屋顶坠落危险的典型危险管理方法在很大程度上取决于外观检查和专家知识。在这项研究中，我们提出了一种基于人工智能（AI）的系统，用于检测由高水平应力引起的屋顶跌落危害。我们使用描述危险和非危险屋顶状况的图像来开发卷积神经网络，以自动检测危险屋顶状况。为了补偿有限的输入数据，我们采用了转移学习方法。在转移学习中，已经训练的网络被用作相似域中分类的起点。结果证实，该方法可以很好地将屋顶状况分类为危险或安全，达到86％的统计准确性。然而，仅准确性是不足以确保可靠的危害管理系统的。了解网络使用的功能后，系统约束和可靠性将得到改善。因此，我们使用了一种称为集成梯度的深度学习解释技术来识别每幅图像中的重要地质特征以进行预测。积分梯度的分析表明，该系统模仿了专家对屋顶坠落危险检测的判断。本文开发的系统证明了在地质灾害管理中进行深度学习以补充人类专家的潜力，并且在灾害识别严重依赖专家知识的情况下，它有可能成为自主隧道作业的重要组成部分。

23. UNOC: Understanding Occlusion for Embodied Presence in Virtual Reality [PDF] 返回目录
Mathias Parger, Chengcheng Tang, Yuanlu Xu, Christopher Twigg, Lingling Tao, Yijing Li, Robert Wang, Markus Steinberger
Abstract: Tracking body and hand motions in the 3D space is essential for social and self-presence in augmented and virtual environments. Unlike the popular 3D pose estimation setting, the problem is often formulated as inside-out tracking based on embodied perception (e.g., egocentric cameras, handheld sensors). In this paper, we propose a new data-driven framework for inside-out body tracking, targeting challenges of omnipresent occlusions in optimization-based methods (e.g., inverse kinematics solvers). We first collect a large-scale motion capture dataset with both body and finger motions using optical markers and inertial sensors. This dataset focuses on social scenarios and captures ground truth poses under self-occlusions and body-hand interactions. We then simulate the occlusion patterns in head-mounted camera views on the captured ground truth using a ray casting algorithm and learn a deep neural network to infer the occluded body parts. In the experiments, we show that our method is able to generate high-fidelity embodied poses by applying the proposed method on the task of real-time inside-out body tracking, finger motion synthesis, and 3-point inverse kinematics.
摘要：在增强和虚拟环境中，追踪3D空间中的身体和手部运动对于社交和自我存在至关重要。与流行的3D姿势估计设置不同，该问题通常被公式化为基于具体感知的内外跟踪（例如，以自我为中心的相机，手持式传感器）。在本文中，我们提出了一种由内而外的身体跟踪的新的数据驱动框架，以基于优化的方法（例如，逆运动学求解器）中的无处不在的遮挡挑战为目标。我们首先使用光学标记和惯性传感器收集大规模的运动捕获数据集，包括身体和手指的运动。该数据集专注于社交场景，并捕获了自我遮挡和身体-手部交互作用下的地面真相姿势。然后，我们使用射线投射算法在捕获的地面真相上模拟头戴式摄像机视图中的遮挡模式，并学习一个深度神经网络来推断被遮挡的身体部位。在实验中，我们表明，通过将提出的方法应用于实时的由内而外的身体跟踪，手指运动合成和三点逆运动学的任务，我们的方法能够生成高逼真的体现姿势。

24. Generating Natural Questions from Images for Multimodal Assistants [PDF] 返回目录
Alkesh Patel, Akanksha Bindal, Hadas Kotek, Christopher Klein, Jason Williams
Abstract: Generating natural, diverse, and meaningful questions from images is an essential task for multimodal assistants as it confirms whether they have understood the object and scene in the images properly. The research in visual question answering (VQA) and visual question generation (VQG) is a great step. However, this research does not capture questions that a visually-abled person would ask multimodal assistants. Recently published datasets such as KB-VQA, FVQA, and OK-VQA try to collect questions that look for external knowledge which makes them appropriate for multimodal assistants. However, they still contain many obvious and common-sense questions that humans would not usually ask a digital assistant. In this paper, we provide a new benchmark dataset that contains questions generated by human annotators keeping in mind what they would ask multimodal digital assistants. Large scale annotations for several hundred thousand images are expensive and time-consuming, so we also present an effective way of automatically generating questions from unseen images. In this paper, we present an approach for generating diverse and meaningful questions that consider image content and metadata of image (e.g., location, associated keyword). We evaluate our approach using standard evaluation metrics such as BLEU, METEOR, ROUGE, and CIDEr to show the relevance of generated questions with human-provided questions. We also measure the diversity of generated questions using generative strength and inventiveness metrics. We report new state-of-the-art results on the public and our datasets.
摘要：从图像中生成自然，多样且有意义的问题是多模式助手的一项基本任务，因为它可以确认他们是否正确理解了图像中的对象和场景。视觉问题解答（VQA）和视觉问题生成（VQG）的研究是迈出的重要一步。但是，这项研究并未捕捉到有视觉能力的人会问多模式助手的问题。最近发布的数据集（例如KB-VQA，FVQA和OK-VQA）尝试收集寻找外部知识的问题，这使其适合于多模式助手。但是，它们仍然包含许多显而易见的常识性问题，人类通常不会问数字助理。在本文中，我们提供了一个新的基准数据集，其中包含人类注释者所产生的问题，同时牢记他们会问多模式数字助理。数十万张图像的大规模注释既昂贵又费时，因此，我们还提出了一种有效的方法，可以从看不见的图像中自动生成问题。在本文中，我们提出了一种生成各种有意义的问题的方法，这些问题考虑了图像的内容和图像的元数据（例如位置，关联的关键字）。我们使用BLEU，METEOR，ROUGE和CIDEr等标准评估指标评估我们的方法，以显示生成的问题与人工提供的问题的相关性。我们还使用生成力和创造力指标来衡量所产生问题的多样性。我们在公众和我们的数据集上报告最新的最新结果。

25. G-RCN: Optimizing the Gap between Classification and Localization Tasks for Object Detection [PDF] 返回目录
Yufan Luo, Li Xiao
Abstract: Multi-task learning is widely used in computer vision. Currently, object detection models utilize shared feature map to complete classification and localization tasks simultaneously. By comparing the performance between the original Faster R-CNN and that with partially separated feature maps, we show that: (1) Sharing high-level features for the classification and localization tasks is sub-optimal; (2) Large stride is beneficial for classification but harmful for localization; (3) Global context information could improve the performance of classification. Based on these findings, we proposed a paradigm called Gap-optimized region based convolutional network (G-RCN), which aims to separating these two tasks and optimizing the gap between them. The paradigm was firstly applied to correct the current ResNet protocol by simply reducing the stride and moving the Conv5 block from the head to the feature extraction network, which brings 3.6 improvement of AP70 on the PASCAL VOC dataset and 1.5 improvement of AP on the COCO dataset for ResNet50. Next, the new method is applied on the Faster R-CNN with backbone of VGG16,ResNet50 and ResNet101, which brings above 2.0 improvement of AP70 on the PASCAL VOC dataset and above 1.9 improvement of AP on the COCO dataset. Noticeably, the implementation of G-RCN only involves a few structural modifications, with no extra module added.
摘要：多任务学习在计算机视觉中得到了广泛的应用。当前，对象检测模型利用共享特征图来同时完成分类和定位任务。通过比较原始Faster R-CNN与部分分离的特征图的性能，我们发现：（1）共享用于分类和定位任务的高级特征不是最优的；（2）大步幅对分类有利，但对定位不利；（3）全局上下文信息可以提高分类的性能。基于这些发现，我们提出了一种名为基于间隙优化的基于区域的卷积网络（G-RCN）的范例，该范例旨在分离这两个任务并优化它们之间的差距。该范式首先通过简单地减小步幅并将Conv5块从头部移到特征提取网络来纠正当前的ResNet协议，这带来了PASCAL VOC数据集上AP70的3.6改进和COCO数据集上AP的1.5改进。用于ResNet50。接下来，将新方法应用于具有VGG16，ResNet50和ResNet101主干的Faster R-CNN，这将使PASCAL VOC数据集的AP70改善2.0以上，而COCO数据集的AP改善1.9以上。值得注意的是，G-RCN的实现仅涉及一些结构上的修改，而没有添加额外的模块。

26. w-Net: Dual Supervised Medical Image Segmentation Model with Multi-Dimensional Attention and Cascade Multi-Scale Convolution [PDF] 返回目录
Bo Wang, Lei Wang, Junyang Chen, Zhenghua Xu, Thomas Lukasiewicz, Zhigang Fu
Abstract: Deep learning-based medical image segmentation technology aims at automatic recognizing and annotating objects on the medical image. Non-local attention and feature learning by multi-scale methods are widely used to model network, which drives progress in medical image segmentation. However, those attention mechanism methods have weakly non-local receptive fields' strengthened connection for small objects in medical images. Then, the features of important small objects in abstract or coarse feature maps may be deserted, which leads to unsatisfactory performance. Moreover, the existing multi-scale methods only simply focus on different sizes of view, whose sparse multi-scale features collected are not abundant enough for small objects segmentation. In this work, a multi-dimensional attention segmentation model with cascade multi-scale convolution is proposed to predict accurate segmentation for small objects in medical images. As the weight function, multi-dimensional attention modules provide coefficient modification for significant/informative small objects features. Furthermore, The cascade multi-scale convolution modules in each skip-connection path are exploited to capture multi-scale features in different semantic depth. The proposed method is evaluated on three datasets: KiTS19, Pancreas CT of Decathlon-10, and MICCAI 2018 LiTS Challenge, demonstrating better segmentation performances than the state-of-the-art baselines.
摘要：基于深度学习的医学图像分割技术旨在自动识别和注释医学图像上的对象。通过多尺度方法进行的非局部注意和特征学习已广泛用于对网络进行建模，从而推动了医学图像分割的发展。然而，这些注意机制方法对于医学图像中的小物体具有较弱的非局部感受野增强的联系。然后，抽象或粗略特征图中的重要小物体的特征可能会被遗弃，从而导致性能不令人满意。而且，现有的多尺度方法仅简单地关注于不同大小的视图，其收集的稀疏多尺度特征不足以用于小对象分割。在这项工作中，提出了具有级联多尺度卷积的多维注意力分割模型，以预测医学图像中小物体的精确分割。作为权重函数，多维注意模块为重要/信息性小物体特征提供系数修改。此外，利用每个跳过连接路径中的级联多尺度卷积模块来捕获不同语义深度的多尺度特征。该方法在三个数据集上进行了评估：KiTS19，十项全能10的胰腺CT和MICCAI 2018 LiTS挑战赛，显示出比最新基准更好的分割性能。

27. Confidence-aware Non-repetitive Multimodal Transformers for TextCaps [PDF] 返回目录
Zhaokai Wang, Renda Bao, Qi Wu, Si Liu
Abstract: When describing an image, reading text in the visual scene is crucial to understand the key information. Recent work explores the TextCaps task, i.e. image captioning with reading Optical Character Recognition (OCR) tokens, which requires models to read text and cover them in generated captions. Existing approaches fail to generate accurate descriptions because of their (1) poor reading ability; (2) inability to choose the crucial words among all extracted OCR tokens; (3) repetition of words in predicted captions. To this end, we propose a Confidence-aware Non-repetitive Multimodal Transformers (CNMT) to tackle the above challenges. Our CNMT consists of a reading, a reasoning and a generation modules, in which Reading Module employs better OCR systems to enhance text reading ability and a confidence embedding to select the most noteworthy tokens. To address the issue of word redundancy in captions, our Generation Module includes a repetition mask to avoid predicting repeated word in captions. Our model outperforms state-of-the-art models on TextCaps dataset, improving from 81.0 to 93.0 in CIDEr. Our source code is publicly available.
摘要：描述图像时，在视觉场景中阅读文本对于理解关键信息至关重要。最近的工作探索了TextCaps任务，即通过读取光学字符识别（OCR）令牌的图像字幕，该模型要求模型读取文本并将其覆盖在生成的字幕中。现有方法由于其（1）阅读能力差而无法生成准确的描述。（2）无法在所有提取的OCR令牌中选择关键单词；（3）预测字幕中的单词重复。为此，我们提出了一种具有信心的非重复多模态变压器（CNMT），以应对上述挑战。我们的CNMT包含阅读，推理和生成模块，其中阅读模块采用更好的OCR系统来增强文本阅读能力，并采用置信度来选择最值得关注的令牌。为了解决字幕中单词冗余的问题，我们的生成模块包含一个重复掩码，以避免预测字幕中的重复单词。我们的模型优于TextCaps数据集上的最新模型，在CIDEr中从81.0改进到93.0。我们的源代码是公开可用的。

28. Ada-Segment: Automated Multi-loss Adaptation for Panoptic Segmentation [PDF] 返回目录
Gengwei Zhang, Yiming Gao, Hang Xu, Hao Zhang, Zhenguo Li, Xiaodan Liang
Abstract: Panoptic segmentation that unifies instance segmentation and semantic segmentation has recently attracted increasing attention. While most existing methods focus on designing novel architectures, we steer toward a different perspective: performing automated multi-loss adaptation (named Ada-Segment) on the fly to flexibly adjust multiple training losses over the course of training using a controller trained to capture the learning dynamics. This offers a few advantages: it bypasses manual tuning of the sensitive loss combination, a decisive factor for panoptic segmentation; it allows to explicitly model the learning dynamics, and reconcile the learning of multiple objectives (up to ten in our experiments); with an end-to-end architecture, it generalizes to different datasets without the need of re-tuning hyperparameters or re-adjusting the training process laboriously. Our Ada-Segment brings 2.7% panoptic quality (PQ) improvement on COCO val split from the vanilla baseline, achieving the state-of-the-art 48.5% PQ on COCO test-dev split and 32.9% PQ on ADE20K dataset. The extensive ablation studies reveal the ever-changing dynamics throughout the training process, necessitating the incorporation of an automated and adaptive learning strategy as presented in this paper.
摘要：全景分割将实例分割和语义分割统一起来，最近引起了越来越多的关注。虽然大多数现有方法都专注于设计新颖的体系结构，但我们却朝着不同的方向发展：在训练过程中使用训练有素的控制器实时地执行自动多损失自适应（称为Ada-Segment）以灵活地调整多个训练损失。学习动力。这提供了一些优势：它绕过了敏感损失组合的手动调整，而敏感损失组合是全景分割的决定性因素。它允许显式地模拟学习动态，并协调多个目标的学习（在我们的实验中最多为十个）；通过端到端的体系结构，它可以推广到不同的数据集，而无需重新调整超参数或费力地调整训练过程。我们的Ada细分将香草基准上的COCO val分割提高了2.7％的全景质量（PQ），在COCO测试-dev分割上实现了最先进的48.5％PQ，在ADE20K数据集上实现了32.9％的PQ。广泛的消融研究揭示了整个培训过程中不断变化的动力，因此有必要采用本文所述的自动适应性学习策略。

29. PSCNet: Pyramidal Scale and Global Context Guided Network for Crowd Counting [PDF] 返回目录
Guangshuai Gao, Qingjie Liu, Qi Wen, Yunhong Wang
Abstract: Crowd counting, which towards to accurately count the number of the objects in images, has been attracted more and more attention by researchers recently. However, challenges from severely occlusion, large scale variation, complex background interference and non-uniform density distribution, limit the crowd number estimation accuracy. To mitigate above issues, this paper proposes a novel crowd counting approach based on pyramidal scale module (PSM) and global context module (GCM), dubbed PSCNet. Moreover, a reliable supervision manner combined Bayesian and counting loss (BCL) is utilized to learn the density probability and then computes the count exception at each annotation point. Specifically, PSM is used to adaptively capture multi-scale information, which can identify a fine boundary of crowds with different image scales. GCM is devised with low-complexity and lightweight manner, to make the interactive information across the channels of the feature maps more efficient, meanwhile guide the model to select more suitable scales generated from PSM. Furthermore, BL is leveraged to construct a credible density contribution probability supervision manner, which relieves non-uniform density distribution in crowds to a certain extent. Extensive experiments on four crowd counting datasets show the effectiveness and superiority of the proposed model. Additionally, some experiments extended on a remote sensing object counting (RSOC) dataset further validate the generalization ability of the model. Our resource code will be released upon the acceptance of this work.
摘要要】人群计数技术是一种用于精确计数图像中物体数量的方法，近年来受到越来越多研究者的关注。然而，来自严重遮挡，大规模变化，复杂的背景干扰和不均匀的密度分布的挑战限制了人群数量估计的准确性。为了缓解上述问题，本文提出了一种新颖的基于金字塔尺度模块（PSM）和全局上下文模块（GCM）的人群计数方法，称为PSCNet。此外，结合贝叶斯和计数损失（BCL）的可靠监督方式来学习密度概率，然后计算每个注释点的计数异常。具体而言，PSM用于自适应捕获多尺度信息，该信息可以标识具有不同图像尺度的人群的精细边界。 GCM以低复杂度和轻量级方式设计，以使跨特征图通道的交互信息更有效，同时指导模型选择更合适的从PSM生成的比例尺。此外，通过利用BL来构建可信的密度贡献概率监督方式，在一定程度上缓解了人群中密度分布的不均匀性。在四个人群计数数据集上的大量实验表明了该模型的有效性和优越性。此外，在遥感物体计数（RSOC）数据集上扩展的一些实验进一步验证了模型的泛化能力。我们的资源代码将在接受这项工作后发布。

30. Learned Block Iterative Shrinkage Thresholding Algorithm for Photothermal Super Resolution Imaging [PDF] 返回目录
Samim Ahmadi, Jan Christian Hauffen, Linh Kästner, Peter Jung, Giuseppe Caire, Mathias Ziegler
Abstract: Block-sparse regularization is already well-known in active thermal imaging and is used for multiple measurement based inverse problems. The main bottleneck of this method is the choice of regularization parameters which differs for each experiment. To avoid time-consuming manually selected regularization parameter, we propose a learned block-sparse optimization approach using an iterative algorithm unfolded into a deep neural network. More precisely, we show the benefits of using a learned block iterative shrinkage thresholding algorithm that is able to learn the choice of regularization parameters. In addition, this algorithm enables the determination of a suitable weight matrix to solve the underlying inverse problem. Therefore, in this paper we present the algorithm and compare it with state of the art block iterative shrinkage thresholding using synthetically generated test data and experimental test data from active thermography for defect reconstruction. Our results show that the use of the learned block-sparse optimization approach provides smaller normalized mean square errors for a small fixed number of iterations than without learning. Thus, this new approach allows to improve the convergence speed and only needs a few iterations to generate accurate defect reconstruction in photothermal super resolution imaging.
摘要：块稀疏正则化在主动热成像中已经众所周知，并用于基于多次测量的逆问题。该方法的主要瓶颈是对每个实验而言不同的正则化参数的选择。为了避免费时的手动选择正则化参数，我们提出了一种学习算法，该算法使用展开到深度神经网络中的迭代算法进行块稀疏优化。更确切地说，我们展示了使用学习的块迭代收缩阈值算法的好处，该算法能够学习正则化参数的选择。另外，该算法能够确定合适的权重矩阵来解决潜在的逆问题。因此，在本文中，我们介绍了该算法，并使用合成生成的测试数据和来自有源热成像的实验测试数据将其与现有技术的块迭代收缩阈值进行了比较，以进行缺陷重建。我们的结果表明，与没有学习的情况相比，使用学习的块稀疏优化方法可在较小的固定迭代次数下提供较小的归一化均方误差。因此，这种新方法可以提高收敛速度，并且只需几次迭代即可在光热超分辨率成像中生成准确的缺陷重建。

31. End-to-End Object Detection with Fully Convolutional Network [PDF] 返回目录
Jianfeng Wang, Lin Song, Zeming Li, Hongbin Sun, Jian Sun, Nanning Zheng
Abstract: Mainstream object detectors based on the fully convolutional network has achieved impressive performance. While most of them still need a hand-designed non-maximum suppression (NMS) post-processing, which impedes fully end-to-end training. In this paper, we give the analysis of discarding NMS, where the results reveal that a proper label assignment plays a crucial role. To this end, for fully convolutional detectors, we introduce a Prediction-aware One-To-One (POTO) label assignment for classification to enable end-to-end detection, which obtains comparable performance with NMS. Besides, a simple 3D Max Filtering (3DMF) is proposed to utilize the multi-scale features and improve the discriminability of convolutions in the local region. With these techniques, our end-to-end framework achieves competitive performance against many state-of-the-art detectors with NMS on COCO and CrowdHuman datasets. The code is available at this https URL .
摘要：基于全卷积网络的主流目标检测器取得了令人印象深刻的性能。尽管它们中的大多数仍需要手动设计的非最大抑制（NMS）后处理，这阻碍了全面的端到端培训。在本文中，我们对丢弃NMS进行了分析，结果表明适当的标签分配起着至关重要的作用。为此，对于全卷积检测器，我们引入了一种预测感知的一对一（POTO）标签分配进行分类，以实现端到端检测，并获得了与NMS相当的性能。此外，提出了一种简单的3D最大滤波（3DMF），以利用多尺度特征并提高局部区域卷积的可分辨性。借助这些技术，我们的端到端框架与NCO（基于COCO和CrowdHuman数据集）上的许多最新检测器相比，具有竞争优势。该代码可从以下https URL获得。

32. Fine-Grained Dynamic Head for Object Detection [PDF] 返回目录
Lin Song, Yanwei Li, Zhengkai Jiang, Zeming Li, Hongbin Sun, Jian Sun, Nanning Zheng
Abstract: The Feature Pyramid Network (FPN) presents a remarkable approach to alleviate the scale variance in object representation by performing instance-level assignments. Nevertheless, this strategy ignores the distinct characteristics of different sub-regions in an instance. To this end, we propose a fine-grained dynamic head to conditionally select a pixel-level combination of FPN features from different scales for each instance, which further releases the ability of multi-scale feature representation. Moreover, we design a spatial gate with the new activation function to reduce computational complexity dramatically through spatially sparse convolutions. Extensive experiments demonstrate the effectiveness and efficiency of the proposed method on several state-of-the-art detection benchmarks. Code is available at this https URL.
摘要：特征金字塔网络（FPN）提供了一种出色的方法，可以通过执行实例级别的分配来减轻对象表示中的比例差异。然而，这种策略忽略了实例中不同子区域的独特特征。为此，我们提出了一种细粒度的动态头，可以针对每种情况从不同的比例有条件地选择FPN特征的像素级组合，从而进一步释放了多比例特征表示的能力。此外，我们设计了具有新激活函数的空间门，以通过空间稀疏卷积显着降低计算复杂性。大量实验证明了该方法在几种最新检测基准上的有效性和效率。在此https URL上提供了代码。

33. A Singular Value Perspective on Model Robustness [PDF] 返回目录
Malhar Jere, Maghav Kumar, Farinaz Koushanfar
Abstract: Convolutional Neural Networks (CNNs) have made significant progress on several computer vision benchmarks, but are fraught with numerous non-human biases such as vulnerability to adversarial samples. Their lack of explainability makes identification and rectification of these biases difficult, and understanding their generalization behavior remains an open problem. In this work we explore the relationship between the generalization behavior of CNNs and the Singular Value Decomposition (SVD) of images. We show that naturally trained and adversarially robust CNNs exploit highly different features for the same dataset. We demonstrate that these features can be disentangled by SVD for ImageNet and CIFAR-10 trained networks. Finally, we propose Rank Integrated Gradients (RIG), the first rank-based feature attribution method to understand the dependence of CNNs on image rank.
摘要：卷积神经网络（CNN）在一些计算机视觉基准上已取得了重大进展，但充满了许多非人类的偏见，例如对抗性样本的脆弱性。他们缺乏可解释性，使得难以识别和纠正这些偏见，而了解其普遍性仍然是一个悬而未决的问题。在这项工作中，我们探索了CNN的泛化行为与图像的奇异值分解（SVD）之间的关系。我们证明了自然训练和对抗性强的CNN对同一数据集利用了截然不同的特征。我们证明，对于ImageNet和CIFAR-10培训的网络，SVD可以消除这些功能。最后，我们提出了等级综合梯度（RIG），这是第一种基于等级的特征归因方法，用于了解CNN对图像等级的依赖性。

34. Fine-grained Angular Contrastive Learning with Coarse Labels [PDF] 返回目录
Guy Bukchin, Eli Schwartz, Kate Saenko, Ori Shahar, Rogerio Feris, Raja Giryes, Leonid Karlinsky
Abstract: Few-shot learning methods offer pre-training techniques optimized for easier later adaptation of the model to new classes (unseen during training) using one or a few examples. This adaptivity to unseen classes is especially important for many practical applications where the pre-trained label space cannot remain fixed for effective use and the model needs to be "specialized" to support new categories on the fly. One particularly interesting scenario, essentially overlooked by the few-shot literature, is Coarse-to-Fine Few-Shot (C2FS), where the training classes (e.g. animals) are of much `coarser granularity' than the target (test) classes (e.g. breeds). A very practical example of C2FS is when the target classes are sub-classes of the training classes. Intuitively, it is especially challenging as (both regular and few-shot) supervised pre-training tends to learn to ignore intra-class variability which is essential for separating sub-classes. In this paper, we introduce a novel 'Angular normalization' module that allows to effectively combine supervised and self-supervised contrastive pre-training to approach the proposed C2FS task, demonstrating significant gains in a broad study over multiple baselines and datasets. We hope that this work will help to pave the way for future research on this new, challenging, and very practical topic of C2FS classification.
摘要：很少有学习方法提供经过优化的预训练技术，以便使用一个或几个示例对模型进行更轻松的后续适应以适应新课程（在训练过程中未见）。这种对于看不见的类别的适应性对于许多实际应用尤其重要，在这些实际应用中，预先训练的标签空间无法保持固定以有效使用，并且模型需要“专门化”以支持动态的新类别。稀疏到极少的射击（C2FS）是一种特别有趣的场景，基本上被鲜为人知的文献所忽略，其中训练班（例如，动物）的粒度比目标（测试）类的“粒度更粗”（例如品种）。 C2FS的一个非常实际的示例是目标类是培训类的子类。直观地讲，这是特别具有挑战性的，因为（定期和少拍）有监督的预训练往往会学会忽略类内变异性，这对于分离子类至关重要。在本文中，我们介绍了一个新颖的“角度归一化”模块，该模块可以有效地结合监督和自我监督的对比预训练来解决建议的C2FS任务，从而在广泛的研究中展示了在多个基准和数据集上的显着成果。我们希望这项工作将有助于为有关C2FS分类这一新的，具有挑战性且非常实用的主题的未来研究铺平道路。

35. Rethinking Learnable Tree Filter for Generic Feature Transform [PDF] 返回目录
Lin Song, Yanwei Li, Zhengkai Jiang, Zeming Li, Xiangyu Zhang, Hongbin Sun, Jian Sun, Nanning Zheng
Abstract: The Learnable Tree Filter presents a remarkable approach to model structure-preserving relations for semantic segmentation. Nevertheless, the intrinsic geometric constraint forces it to focus on the regions with close spatial distance, hindering the effective long-range interactions. To relax the geometric constraint, we give the analysis by reformulating it as a Markov Random Field and introduce a learnable unary term. Besides, we propose a learnable spanning tree algorithm to replace the original non-differentiable one, which further improves the flexibility and robustness. With the above improvements, our method can better capture long-range dependencies and preserve structural details with linear complexity, which is extended to several vision tasks for more generic feature transform. Extensive experiments on object detection/instance segmentation demonstrate the consistent improvements over the original version. For semantic segmentation, we achieve leading performance (82.1% mIoU) on the Cityscapes benchmark without bells-and-whistles. Code is available at this https URL.
摘要：可学习树过滤器提出了一种用于语义保留的模型保留结构关系模型。然而，固有的几何约束迫使其将注意力集中在具有近空间距离的区域上，从而阻碍了有效的远程交互作用。为了放松几何约束，我们通过将其重新构造为马氏随机场来进行分析，并引入可学习的一元项。此外，我们提出了一种可学习的生成树算法来代替原来的不可微算法，从而进一步提高了灵活性和鲁棒性。通过上述改进，我们的方法可以更好地捕获远程依存关系，并以线性复杂度保留结构细节，将其扩展到多个视觉任务，以实现更通用的特征转换。有关对象检测/实例分割的大量实验证明了与原始版本相比的一致改进。在语义分割方面，我们在Cityscapes基准测试中取得了领先的性能（82.1％mIoU），而没有喧嚣。在此https URL上提供了代码。

36. Meta Ordinal Regression Forest For Learning with Unsure Lung Nodules [PDF] 返回目录
Yiming Lei, Haiping Zhu, Junping Zhang, Hongming Shan
Abstract: Deep learning-based methods have achieved promising performance in early detection and classification of lung nodules, most of which discard unsure nodules and simply deal with a binary classification -- malignant vs benign. Recently, an unsure data model (UDM) was proposed to incorporate those unsure nodules by formulating this problem as an ordinal regression, showing better performance over traditional binary classification. To further explore the ordinal relationship for lung nodule classification, this paper proposes a meta ordinal regression forest (MORF), which improves upon the state-of-the-art ordinal regression method, deep ordinal regression forest (DORF), in three major ways. First, MORF can alleviate the biases of the predictions by making full use of deep features while DORF needs to fix the composition of decision trees before training. Second, MORF has a novel grouped feature selection (GFS) module to re-sample the split nodes of decision trees. Last, combined with GFS, MORF is equipped with a meta learning-based weighting scheme to map the features selected by GFS to tree-wise weights while DORF assigns equal weights for all trees. Experimental results on the LIDC-IDRI dataset demonstrate superior performance over existing methods, including the state-of-the-art DORF.
摘要：基于深度学习的方法在肺结节的早期检测和分类中取得了令人鼓舞的性能，其中大多数方法丢弃不确定的结节并仅处理二分类-恶性还是良性。最近，提出了不确定数据模型（UDM），通过将该问题表述为序数回归来合并那些不确定结节，从而显示出优于传统二进制分类的性能。为了进一步探索肺结节分类的序数关系，本文提出了一种元序数回归林（MORF），该方法以三种主要方式对最先进的序数回归方法深层序数回归林（DORF）进行了改进。首先，MORF可以通过充分利用深度特征来减轻预测的偏差，而DORF需要在训练之前确定决策树的组成。其次，MORF具有新颖的分组特征选择（GFS）模块，用于对决策树的拆分节点进行重新采样。最后，MORF与GFS相结合，配备了基于元学习的加权方案，可以将GFS选择的特征映射到树的权重，而DORF为所有树分配相等的权重。 LIDC-IDRI数据集上的实验结果证明，其性能优于现有方法，包括最先进的DORF。

37. Attention-based Saliency Hashing for Ophthalmic Image Retrieval [PDF] 返回目录
Jiansheng Fang, Yanwu Xu, Xiaoqing Zhang, Yan Hu, Jiang Liu
Abstract: Deep hashing methods have been proved to be effective for the large-scale medical image search assisting reference-based diagnosis for clinicians. However, when the salient region plays a maximal discriminative role in ophthalmic image, existing deep hashing methods do not fully exploit the learning ability of the deep network to capture the features of salient regions pointedly. The different grades or classes of ophthalmic images may be share similar overall performance but have subtle differences that can be differentiated by mining salient regions. To address this issue, we propose a novel end-to-end network, named Attention-based Saliency Hashing (ASH), for learning compact hash-code to represent ophthalmic images. ASH embeds a spatial-attention module to focus more on the representation of salient regions and highlights their essential role in differentiating ophthalmic images. Benefiting from the spatial-attention module, the information of salient regions can be mapped into the hash-code for similarity calculation. In the training stage, we input the image pairs to share the weights of the network, and a pairwise loss is designed to maximize the discriminability of the hash-code. In the retrieval stage, ASH obtains the hash-code by inputting an image with an end-to-end manner, then the hash-code is used to similarity calculation to return the most similar images. Extensive experiments on two different modalities of ophthalmic image datasets demonstrate that the proposed ASH can further improve the retrieval performance compared to the state-of-the-art deep hashing methods due to the huge contributions of the spatial-attention module.
摘要：事实证明，深度散列方法对于基于临床医生的基于参考的大型医学图像搜索辅助诊断是有效的。然而，当显着区域在眼科图像中发挥最大的判别作用时，现有的深度哈希方法无法充分利用深度网络的学习能力来明确捕捉显着区域的特征。眼科图像的不同等级或类别可能具有相似的总体性能，但具有细微的差异，可以通过挖掘显着区域来区分它们。为了解决这个问题，我们提出了一种新颖的端到端网络，称为基于注意力的显着性哈希（ASH），用于学习表示哈希码的紧凑哈希码。 ASH嵌入了一个空间注意模块，以将更多注意力集中在突出区域的表示上，并突出显示它们在区分眼科图像方面的重要作用。得益于空间注意模块，可以将显着区域的信息映射到哈希码中以进行相似度计算。在训练阶段，我们输入图像对以共享网络的权重，并且设计成对丢失以最大程度地提高哈希码的可分辨性。在检索阶段，ASH通过以端到端的方式输入图像来获取哈希码，然后将该哈希码用于相似度计算以返回最相似的图像。在两种不同形式的眼科图像数据集上进行的广泛实验表明，与现有的深层哈希算法相比，由于空间注意模块的巨大贡献，所提出的ASH可以进一步提高检索性能。

38. PFA-GAN: Progressive Face Aging with Generative Adversarial Network [PDF] 返回目录
Zhizhong Huang, Shouzhen Chen, Junping Zhang, Hongming Shan
Abstract: Face aging is to render a given face to predict its future appearance, which plays an important role in the information forensics and security field as the appearance of the face typically varies with age. Although impressive results have been achieved with conditional generative adversarial networks (cGANs), the existing cGANs-based methods typically use a single network to learn various aging effects between any two different age groups. However, they cannot simultaneously meet three essential requirements of face aging - including image quality, aging accuracy, and identity preservation -- and usually generate aged faces with strong ghost artifacts when the age gap becomes large. Inspired by the fact that faces gradually age over time, this paper proposes a novel progressive face aging framework based on generative adversarial network (PFA-GAN) to mitigate these issues. Unlike the existing cGANs-based methods, the proposed framework contains several sub-networks to mimic the face aging process from young to old, each of which only learns some specific aging effects between two adjacent age groups. The proposed framework can be trained in an end-to-end manner to eliminate accumulative artifacts and blurriness. Moreover, this paper introduces an age estimation loss to take into account the age distribution for an improved aging accuracy, and proposes to use the Pearson correlation coefficient as an evaluation metric measuring the aging smoothness for face aging methods. Extensively experimental results demonstrate superior performance over existing (c)GANs-based methods, including the state-of-the-art one, on two benchmarked datasets. The source code is available at~\url{this https URL}.
摘要：人脸衰老是为了渲染给定的人脸以预测其未来外观，这在信息取证和安全领域起着重要作用，因为人脸的外观通常会随着年龄的变化而变化。尽管使用条件生成对抗网络（cGAN）取得了令人印象深刻的结果，但是现有的基于cGAN的方法通常使用单个网络来了解任意两个不同年龄组之间的各种衰老影响。但是，它们不能同时满足面部老化的三个基本要求-包括图像质量，老化精度和身份保存-并且通常会在年龄差距变大时生成带有强鬼像的老化面孔。受到随着时间的推移逐渐变老的事实的启发，本文提出了一种基于生成对抗网络（PFA-GAN）的新颖的渐进式变老框架，以缓解这些问题。与现有的基于cGANs的方法不同，该提议的框架包含多个子网来模拟从年轻到老年人的面部衰老过程，每个子网络仅了解两个相邻年龄组之间的特定衰老效果。可以以端到端的方式训练提出的框架，以消除累积的伪像和模糊性。此外，本文介绍了一种年龄估计损失，以考虑到年龄分布，从而提高了老化精度，并建议使用Pearson相关系数作为评估面部老化方法的老化平滑度的评估指标。大量的实验结果表明，在两个基准数据集上，性能优于现有的基于（c）GAN的方法，包括最先进的方法。源代码位于〜\ url {此https URL}。

39. VideoMix: Rethinking Data Augmentation for Video Classification [PDF] 返回目录
Sangdoo Yun, Seong Joon Oh, Byeongho Heo, Dongyoon Han, Jinhyung Kim
Abstract: State-of-the-art video action classifiers often suffer from overfitting. They tend to be biased towards specific objects and scene cues, rather than the foreground action content, leading to sub-optimal generalization performances. Recent data augmentation strategies have been reported to address the overfitting problems in static image classifiers. Despite the effectiveness on the static image classifiers, data augmentation has rarely been studied for videos. For the first time in the field, we systematically analyze the efficacy of various data augmentation strategies on the video classification task. We then propose a powerful augmentation strategy VideoMix. VideoMix creates a new training video by inserting a video cuboid into another video. The ground truth labels are mixed proportionally to the number of voxels from each video. We show that VideoMix lets a model learn beyond the object and scene biases and extract more robust cues for action recognition. VideoMix consistently outperforms other augmentation baselines on Kinetics and the challenging Something-Something-V2 benchmarks. It also improves the weakly-supervised action localization performance on THUMOS'14. VideoMix pretrained models exhibit improved accuracies on the video detection task (AVA).
摘要：最先进的视频动作分类器经常遭受过度拟合的困扰。它们倾向于偏向于特定的对象和场景线索，而不是前台的动作内容，从而导致次优的泛化性能。据报道，最近的数据增强策略解决了静态图像分类器中的过拟合问题。尽管在静态图像分类器上有效，但很少研究视频的数据增强。在该领域，我们第一次系统地分析了各种数据增强策略对视频分类任务的功效。然后，我们提出了一个强大的增强策略VideoMix。 VideoMix通过将视频长方体插入另一个视频中来创建新的培训视频。地面真相标签与每个视频的体素数量成比例混合。我们证明了VideoMix可以让模型学习超出对象和场景偏差的范围，并提取出更可靠的线索来进行动作识别。 VideoMix在Kinetics和具有挑战性的Something-Something-V2基准上始终优于其他增强基准。它还提高了THUMOS'14上的弱监督动作定位性能。 VideoMix预训练模型在视频检测任务（AVA）上显示出更高的准确性。

40. Hyperspectral Classification Based on Lightweight 3-D-CNN With Transfer Learning [PDF] 返回目录
Haokui Zhang, Ying Li, Yenan Jiang, Peng Wang, Qiang Shen, Chunhua Shen
Abstract: Recently, hyperspectral image (HSI) classification approaches based on deep learning (DL) models have been proposed and shown promising performance. However, because of very limited available training samples and massive model parameters, DL methods may suffer from overfitting. In this paper, we propose an end-to-end 3-D lightweight convolutional neural network (CNN) (abbreviated as 3-D-LWNet) for limited samples-based HSI classification. Compared with conventional 3-D-CNN models, the proposed 3-D-LWNet has a deeper network structure, less parameters, and lower computation cost, resulting in better classification performance. To further alleviate the small sample problem, we also propose two transfer learning strategies: 1) cross-sensor strategy, in which we pretrain a 3-D model in the source HSI data sets containing a greater number of labeled samples and then transfer it to the target HSI data sets and 2) cross-modal strategy, in which we pretrain a 3-D model in the 2-D RGB image data sets containing a large number of samples and then transfer it to the target HSI data sets. In contrast to previous approaches, we do not impose restrictions over the source data sets, in which they do not have to be collected by the same sensors as the target data sets. Experiments on three public HSI data sets captured by different sensors demonstrate that our model achieves competitive performance for HSI classification compared to several state-of-the-art methods
摘要：最近，已经提出了基于深度学习（DL）模型的高光谱图像（HSI）分类方法，并显示出有希望的性能。但是，由于可用的训练样本和模型参数非常有限，因此DL方法可能会过度拟合。在本文中，我们提出了一种基于端到端的3D轻量级卷积神经网络（CNN）（缩写为3-D-LWNet），用于基于样本的有限HSI分类。与传统的3-D-CNN模型相比，所提出的3-D-LWNet具有更深的网络结构，更少的参数和更低的计算成本，从而具有更好的分类性能。为了进一步缓解小样本问题，我们还提出了两种转移学习策略：1）跨传感器策略，其中我们在源HSI数据集中预训练包含大量标记样本的3-D模型，然后将其转移到目标HSI数据集和2）交叉模式策略，其中我们在包含大量样本的2-D RGB图像数据集中预训练了3-D模型，然后将其转移到目标HSI数据集。与以前的方法相比，我们不对源数据集施加限制，在这些限制中，不必由与目标数据集相同的传感器来收集它们。对由不同传感器捕获的三个公共HSI数据集进行的实验表明，与几种最新方法相比，我们的模型在HSI分类方面具有竞争优势

41. Selective Pseudo-Labeling with Reinforcement Learning for Semi-Supervised Domain Adaptation [PDF] 返回目录
Bingyu Liu, Yuhong Guo, Jieping Ye, Weihong Deng
Abstract: Recent domain adaptation methods have demonstrated impressive improvement on unsupervised domain adaptation problems. However, in the semi-supervised domain adaptation (SSDA) setting where the target domain has a few labeled instances available, these methods can fail to improve performance. Inspired by the effectiveness of pseudo-labels in domain adaptation, we propose a reinforcement learning based selective pseudo-labeling method for semi-supervised domain adaptation. It is difficult for conventional pseudo-labeling methods to balance the correctness and representativeness of pseudo-labeled data. To address this limitation, we develop a deep Q-learning model to select both accurate and representative pseudo-labeled instances. Moreover, motivated by large margin loss's capacity on learning discriminative features with little data, we further propose a novel target margin loss for our base model training to improve its discriminability. Our proposed method is evaluated on several benchmark datasets for SSDA, and demonstrates superior performance to all the comparison methods.
摘要：最近的领域自适应方法已经证明了对无监督领域自适应问题的显着改进。但是，在目标域具有一些可用的标记实例的半监督域自适应（SSDA）设置中，这些方法可能无法提高性能。受伪标记在域自适应中的有效性的启发，我们提出了一种基于增强学习的半监督域自适应选择性伪标记方法。传统的伪标记方法很难平衡伪标记数据的正确性和代表性。为了解决这个限制，我们开发了一个深度Q学习模型来选择准确且具有代表性的伪标记实例。此外，受较大的边际损失损失能力学习较少数据的判别特征的启发，我们进一步为基础模型训练提出了一种新颖的目标边际损失损失，以提高其可辨别性。我们提出的方法在SSDA的多个基准数据集上进行了评估，并证明了优于所有比较方法的性能。

42. Interpreting Deep Neural Networks with Relative Sectional Propagation by Analyzing Comparative Gradients and Hostile Activations [PDF] 返回目录
Woo-Jeoung Nam, jaesik choi, Seong-Whan Lee
Abstract: The clear transparency of Deep Neural Networks (DNNs) is hampered by complex internal structures and nonlinear transformations along deep hierarchies. In this paper, we propose a new attribution method, Relative Sectional Propagation (RSP), for fully decomposing the output predictions with the characteristics of class-discriminative attributions and clear objectness. We carefully revisit some shortcomings of backpropagation-based attribution methods, which are trade-off relations in decomposing DNNs. We define hostile factor as an element that interferes with finding the attributions of the target and propagate it in a distinguishable way to overcome the non-suppressed nature of activated neurons. As a result, it is possible to assign the bi-polar relevance scores of the target (positive) and hostile (negative) attributions while maintaining each attribution aligned with the importance. We also present the purging techniques to prevent the decrement of the gap between the relevance scores of the target and hostile attributions during backward propagation by eliminating the conflicting units to channel attribution map. Therefore, our method makes it possible to decompose the predictions of DNNs with clearer class-discriminativeness and detailed elucidations of activation neurons compared to the conventional attribution methods. In a verified experimental environment, we report the results of the assessments: (i) Pointing Game, (ii) mIoU, and (iii) Model Sensitivity with PASCAL VOC 2007, MS COCO 2014, and ImageNet datasets. The results demonstrate that our method outperforms existing backward decomposition methods, including distinctive and intuitive visualizations.
摘要：复杂的内部结构和沿深层次的非线性转换阻碍了深度神经网络（DNN）的清晰透明性。在本文中，我们提出了一种新的归因方法，即相对分段传播（RSP），以完全分解具有类判别归因和清晰客观性特征的输出预测。我们仔细地回顾了基于反向传播的归因方法的一些缺点，它们是分解DNN时的权衡关系。我们将敌对因素定义为一种干扰因素，该因素会干扰寻找目标的属性并以可区分的方式传播该目标，从而克服激活神经元的非抑制性质。结果，可以分配目标（阳性）和敌对（阴性）归因的双极性相关性得分，同时保持每个归因与重要性保持一致。我们还提出了清除技术，通过消除向频道归因图冲突的单元，来防止向后传播期间目标与关联性得分之间的差距减小。因此，与传统的归因方法相比，我们的方法可以分解DNN的预测，并具有更清晰的类别区分性和激活神经元的详细说明。在经过验证的实验环境中，我们报告评估的结果：（i）指向游戏，（ii）mIoU，和（iii）使用PASCAL VOC 2007，MS COCO 2014和ImageNet数据集的模型敏感性。结果表明，我们的方法优于现有的向后分解方法，包括独特且直观的可视化效果。

43. Boosting Image Super-Resolution Via Fusion of Complementary Information Captured by Multi-Modal Sensors [PDF] 返回目录
Fan Wang, Jiangxin Yang, Yanlong Cao, Yanpeng Cao, Michael Ying Yang
Abstract: Image Super-Resolution (SR) provides a promising technique to enhance the image quality of low-resolution optical sensors, facilitating better-performing target detection and autonomous navigation in a wide range of robotics applications. It is noted that the state-of-the-art SR methods are typically trained and tested using single-channel inputs, neglecting the fact that the cost of capturing high-resolution images in different spectral domains varies significantly. In this paper, we attempt to leverage complementary information from a low-cost channel (visible/depth) to boost image quality of an expensive channel (thermal) using fewer parameters. To this end, we first present an effective method to virtually generate pixel-wise aligned visible and thermal images based on real-time 3D reconstruction of multi-modal data captured at various viewpoints. Then, we design a feature-level multispectral fusion residual network model to perform high-accuracy SR of thermal images by adaptively integrating co-occurrence features presented in multispectral images. Experimental results demonstrate that this new approach can effectively alleviate the ill-posed inverse problem of image SR by taking into account complementary information from an additional low-cost channel, significantly outperforming state-of-the-art SR approaches in terms of both accuracy and efficiency.
摘要：图像超分辨率（SR）提供了一种有前途的技术，可提高低分辨率光学传感器的图像质量，从而在各种机器人应用中促进性能更好的目标检测和自主导航。注意，通常使用单通道输入来训练和测试最新的SR方法，而忽略了在不同光谱域中捕获高分辨率图像的成本有很大不同的事实。在本文中，我们尝试利用较少的参数来利用低成本渠道（可见/深度）的补充信息来提高昂贵渠道（热）的图像质量。为此，我们首先提出一种有效的方法，基于在各种视点捕获的多模式数据的实时3D重建，虚拟生成按像素排列的可见光图像和热图像。然后，我们设计了一个功能级别的多光谱融合残差网络模型，通过自适应地集成多光谱图像中呈现的共现特征来执行热图像的高精度SR。实验结果表明，这种新方法可以通过考虑来自其他低成本渠道的补充信息来有效缓解图像SR不适定的逆问题，在准确性和准确性方面均明显优于最新的SR方法。效率。

44. PMP-Net: Point Cloud Completion by Learning Multi-step Point Moving Paths [PDF] 返回目录
Xin Wen, Peng Xiang, Zhizhong Han, Yan-Pei Cao, Pengfei Wan, Wen Zheng, Yu-Shen Liu
Abstract: The task of point cloud completion aims to predict the missing part for an incomplete 3D shape. A widely used strategy is to generate a complete point cloud from the incomplete one. However, the unordered nature of point clouds will degrade the generation of high-quality 3D shapes, as the detailed topology and structure of discrete points are hard to be captured by the generative process only using a latent code. In this paper, we address the above problem by reconsidering the completion task from a new perspective, where we formulate the prediction as a point cloud deformation process. Specifically, we design a novel neural network, named PMP-Net, to mimic the behavior of an earth mover. It moves move each point of the incomplete input to complete the point cloud, where the total distance of point moving paths (PMP) should be shortest. Therefore, PMP-Net predicts a unique point moving path for each point according to the constraint of total point moving distances. As a result, the network learns a strict and unique correspondence on point-level, which can capture the detailed topology and structure relationships between the incomplete shape and the complete target, and thus improves the quality of the predicted complete shape. We conduct comprehensive experiments on Completion3D and PCN datasets, which demonstrate our advantages over the state-of-the-art point cloud completion methods.
摘要：点云完成的任务旨在预测3D形状不完整的缺失部分。一种广泛使用的策略是从不完整的点云生成完整的点云。但是，点云的无序性质将降低高质量3D形状的生成，因为仅使用潜在代码，生成过程很难捕获离散点的详细拓扑和结构。在本文中，我们通过从新的角度重新考虑完工任务来解决上述问题，在此我们将预测公式化为点云变形过程。具体来说，我们设计了一种新型的神经网络，称为PMP-Net，以模仿推土机的行为。它移动不完整输入的每个点以完成点云，其中点移动路径（PMP）的总距离应最短。因此，PMP-Net根据总点移动距离的约束为每个点预测唯一的点移动路径。结果，网络在点级上学习到严格而唯一的对应关系，可以捕获不完整形状和完整目标之间的详细拓扑和结构关系，从而提高了预测完整形状的质量。我们对Completion3D和PCN数据集进行了全面的实验，证明了我们相对于最新的点云完成方法的优势。

45. CompFeat: Comprehensive Feature Aggregation for Video Instance Segmentation [PDF] 返回目录
Yang Fu, Linjie Yang, Ding Liu, Thomas S. Huang, Humphrey Shi
Abstract: Video instance segmentation is a complex task in which we need to detect, segment, and track each object for any given video. Previous approaches only utilize single-frame features for the detection, segmentation, and tracking of objects and they suffer in the video scenario due to several distinct challenges such as motion blur and drastic appearance change. To eliminate ambiguities introduced by only using single-frame features, we propose a novel comprehensive feature aggregation approach (CompFeat) to refine features at both frame-level and object-level with temporal and spatial context information. The aggregation process is carefully designed with a new attention mechanism which significantly increases the discriminative power of the learned features. We further improve the tracking capability of our model through a siamese design by incorporating both feature similarities and spatial similarities. Experiments conducted on the YouTube-VIS dataset validate the effectiveness of proposed CompFeat. Our code will be available at this https URL.
摘要：视频实例分割是一项复杂的任务，我们需要为任何给定视频检测，分割和跟踪每个对象。先前的方法仅将单帧特征用于对象的检测，分割和跟踪，并且由于诸如运动模糊和剧烈的外观变化之类的若干独特挑战，它们在视频场景中会遭受损失。为了消除仅使用单帧特征引入的歧义，我们提出了一种新颖的综合特征聚合方法（CompFeat），以利用时空上下文信息在帧级别和对象级别上细化特征。使用新的注意力机制精心设计了聚合过程，该机制会显着提高学习功能的判别力。通过合并特征相似性和空间相似性，我们通过暹罗设计进一步提高了模型的跟踪能力。在YouTube-VIS数据集上进行的实验验证了拟议的CompFeat的有效性。我们的代码将在此https URL上提供。

46. Art Style Classification with Self-Trained Ensemble of AutoEncoding Transformations [PDF] 返回目录
Akshay Joshi, Ankit Agrawal, Sushmita Nair
Abstract: The artistic style of a painting is a rich descriptor that reveals both visual and deep intrinsic knowledge about how an artist uniquely portrays and expresses their creative vision. Accurate categorization of paintings across different artistic movements and styles is critical for large-scale indexing of art databases. However, the automatic extraction and recognition of these highly dense artistic features has received little to no attention in the field of computer vision research. In this paper, we investigate the use of deep self-supervised learning methods to solve the problem of recognizing complex artistic styles with high intra-class and low inter-class variation. Further, we outperform existing approaches by almost 20% on a highly class imbalanced WikiArt dataset with 27 art categories. To achieve this, we train the EnAET semi-supervised learning model (Wang et al., 2019) with limited annotated data samples and supplement it with self-supervised representations learned from an ensemble of spatial and non-spatial transformations.
摘要：绘画的艺术风格是丰富的描写，可以揭示视觉和深刻的内在知识，这些知识涉及艺术家如何独特地描绘和表达他们的创造力。跨艺术风格和风格的绘画的正确分类对于大规模索引艺术数据库至关重要。但是，这些高度密集的艺术特征的自动提取和识别在计算机视觉研究领域几乎没有受到关注。在本文中，我们研究了使用深度自我监督学习方法来解决识别类别间差异较大和类别间差异较小的复杂艺术风格的问题。此外，在具有27个艺术类别的高度不平衡的WikiArt数据集上，我们的表现比现有方法高出近20％。为了实现这一目标，我们使用有限的带注释数据样本训练了EnAET半监督学习模型（Wang等，2019），并通过从空间和非空间变换集成中学习到的自我监督表示法对其进行了补充。

47. Proactive Pseudo-Intervention: Causally Informed Contrastive Learning For Interpretable Vision Models [PDF] 返回目录
Dong Wang, Yuewei Yang, Chenyang Tao, Fanjie Kong, Ricardo Henao, Lawrence Carin
Abstract: Deep neural networks have shown significant promise in comprehending complex visual signals, delivering performance on par or even superior to that of human experts. However, these models often lack a mechanism for interpreting their predictions, and in some cases, particularly when the sample size is small, existing deep learning solutions tend to capture spurious correlations that compromise model generalizability on unseen inputs. In this work, we propose a contrastive causal representation learning strategy that leverages proactive interventions to identify causally-relevant image features, called Proactive Pseudo-Intervention (PPI). This approach is complemented with a causal salience map visualization module, i.e., Weight Back Propagation (WBP), that identifies important pixels in the raw input image, which greatly facilitates the interpretability of predictions. To validate its utility, our model is benchmarked extensively on both standard natural images and challenging medical image datasets. We show this new contrastive causal representation learning model consistently improves model performance relative to competing solutions, particularly for out-of-domain predictions or when dealing with data integration from heterogeneous sources. Further, our causal saliency maps are more succinct and meaningful relative to their non-causal counterparts.
摘要：深度神经网络在理解复杂的视觉信号方面显示出巨大的希望，可提供与人类专家同等甚至更高的性能。但是，这些模型通常缺乏解释其预测的机制，在某些情况下，尤其是在样本量较小的情况下，现有的深度学习解决方案往往会捕获虚假的相关性，从而损害了看不见的输入的模型通用性。在这项工作中，我们提出了一种对比性因果表征学习策略，该策略利用主动干预来识别因果相关的图像特征，称为主动伪干预（PPI）。此方法辅以因果显着图可视化模块（即权重反向传播（WBP）），该模块可识别原始输入图像中的重要像素，从而大大促进了预测的可解释性。为了验证其实用性，我们的模型在标准自然图像和具有挑战性的医学图像数据集上进行了广泛的基准测试。我们展示了这种新的因果关系表示学习模型，相对于竞争解决方案，其性能不断提高，尤其是对于域外预测或处理来自异构源的数据集成时。此外，相对于非因果对应图，我们的因果显着图更为简洁和有意义。

48. Visual Aware Hierarchy Based Food Recognition [PDF] 返回目录
Runyu Mao, Jiangpeng He, Zeman Shao, Sri Kalyan Yarlagadda, Fengqing Zhu
Abstract: Food recognition is one of the most important components in image-based dietary assessment. However, due to the different complexity level of food images and inter-class similarity of food categories, it is challenging for an image-based food recognition system to achieve high accuracy for a variety of publicly available datasets. In this work, we propose a new two-step food recognition system that includes food localization and hierarchical food classification using Convolutional Neural Networks (CNNs) as the backbone architecture. The food localization step is based on an implementation of the Faster R-CNN method to identify food regions. In the food classification step, visually similar food categories can be clustered together automatically to generate a hierarchical structure that represents the semantic visual relations among food categories, then a multi-task CNN model is proposed to perform the classification task based on the visual aware hierarchical structure. Since the size and quality of dataset is a key component of data driven methods, we introduce a new food image dataset, VIPER-FoodNet (VFN) dataset, consists of 82 food categories with 15k images based on the most commonly consumed foods in the United States. A semi-automatic crowdsourcing tool is used to provide the ground-truth information for this dataset including food object bounding boxes and food object labels. Experimental results demonstrate that our system can significantly improve both classification and recognition performance on 4 publicly available datasets and the new VFN dataset.
摘要：食物识别是基于图像的饮食评估中最重要的组成部分之一。但是，由于食物图像的复杂性级别不同以及食物类别的类间相似性不同，因此基于图像的食物识别系统对于各种公开可用的数据集要实现高精度具有挑战性。在这项工作中，我们提出了一个新的两步式食物识别系统，该系统包括使用卷积神经网络（CNN）作为主干架构的食物定位和分层食物分类。食品本地化步骤基于Faster R-CNN方法的实现来识别食品区域。在食品分类步骤中，可以将视觉相似的食品类别自动聚类在一起，以生成代表食品类别之间语义视觉关系的层次结构，然后提出一种多任务CNN模型，以基于视觉感知的层次执行分类任务结构体。由于数据集的大小和质量是数据驱动方法的关键组成部分，因此我们引入了一个新的食品图像数据集VIPER-FoodNet（VFN）数据集，该数据集由82种食品类别组成，并基于美国最常用的食品提供了15,000张图像状态。使用半自动众包工具为该数据集提供实物信息，包括食物对象边界框和食物对象标签。实验结果表明，我们的系统可以显着提高4个公开数据集和新的VFN数据集的分类和识别性能。

49. Self-Training for Class-Incremental Semantic Segmentation [PDF] 返回目录
Lu Yu, Xialei Liu, Joost van de Weijer
Abstract: We study incremental learning for semantic segmentation where when learning new classes we have no access to the labeled data of previous tasks. When incrementally learning new classes, deep neural networks suffer from catastrophic forgetting of previous learned knowledge. To address this problem, we propose to apply a self-training approach that leverages unlabeled data, which is used for rehearsal of previous knowledge. Additionally, conflict reduction is proposed to resolve the conflicts of pseudo labels generated from both the old and new models. We show that maximizing self-entropy can further improve results by smoothing the overconfident predictions. The experiments demonstrate state-of-the-art results: obtaining a relative gain of up to 114% on Pascal-VOC 2012 and 8.5% on the more challenging ADE20K compared to previous state-of-the-art methods.
摘要：我们研究用于语义分割的增量学习，其中在学习新课程时，我们无法访问先前任务的标记数据。当逐步学习新课程时，深度神经网络会遭受灾难性的遗忘，使他们忘记先前所学的知识。为了解决此问题，我们建议采用一种自我训练方法，该方法利用未标记的数据，用于排练先前的知识。另外，提出了减少冲突以解决从旧模型和新模型两者生成的伪标签的冲突。我们表明，最大化自我熵可以通过平滑过度自信的预测来进一步改善结果。实验证明了最先进的结果：与以前最先进的方法相比，Pascal-VOC 2012的相对增益高达114％，而更具挑战性的ADE20K的相对增益高达8.5％。

50. Select, Label, and Mix: Learning Discriminative Invariant Feature Representations for Partial Domain Adaptation [PDF] 返回目录
Aadarsh Sahoo, Rameswar Panda, Rogerio Feris, Kate Saenko, Abir Das
Abstract: Partial domain adaptation which assumes that the unknown target label space is a subset of the source label space has attracted much attention in computer vision. Despite recent progress, existing methods often suffer from three key problems: negative transfer, lack of discriminability and domain invariance in the latent space. To alleviate the above issues, we develop a novel 'Select, Label, and Mix' (SLM) framework that aims to learn discriminative invariant feature representations for partial domain adaptation. First, we present a simple yet efficient "select" module that automatically filters out the outlier source samples to avoid negative transfer while aligning distributions across both domains. Second, the "label" module iteratively trains the classifier using both the labeled source domain data and the generated pseudo-labels for the target domain to enhance the discriminability of the latent space. Finally, the "mix" module utilizes domain mixup regularization jointly with the other two modules to explore more intrinsic structures across domains leading to a domain-invariant latent space for partial domain adaptation. Extensive experiments on several benchmark datasets demonstrate the superiority of our proposed framework over state-of-the-art methods.
摘要：假设未知目标标签空间是源标签空间的子集的部分域自适应已引起计算机视觉的广泛关注。尽管取得了新的进展，但是现有方法通常会遇到三个关键问题：负迁移，缺乏可分辨性和潜在空间中的域不变性。为了缓解上述问题，我们开发了一种新颖的“选择，标记和混合”（SLM）框架，旨在学习用于部分域自适应的区分不变特征表示。首先，我们提供一个简单而有效的“选择”模块，该模块会自动滤除异常源样本，从而避免在调整两个域的分布时产生负向转移。第二，“标签”模块使用已标记的源域数据和针对目标域生成的伪标签来迭代地训练分类器，以增强潜在空间的可分辨性。最后，“混合”模块将域混合正则化与其他两个模块一起使用，以探索跨域的更多固有结构，从而为部分域自适应提供域不变的潜在空间。在几个基准数据集上的大量实验证明了我们提出的框架优于最新方法的优越性。

51. Rethinking FUN: Frequency-Domain Utilization Networks [PDF] 返回目录
Kfir Goldberg, Stav Shapiro, Elad Richardson, Shai Avidan
Abstract: The search for efficient neural network architectures has gained much focus in recent years, where modern architectures focus not only on accuracy but also on inference time and model size. Here, we present FUN, a family of novel Frequency-domain Utilization Networks. These networks utilize the inherent efficiency of the frequency-domain by working directly in that domain, represented with the Discrete Cosine Transform. Using modern techniques and building blocks such as compound-scaling and inverted-residual layers we generate a set of such networks allowing one to balance between size, latency and accuracy while outperforming competing RGB-based models. Extensive evaluations verifies that our networks present strong alternatives to previous approaches. Moreover, we show that working in frequency domain allows for dynamic compression of the input at inference time without any explicit change to the architecture.
摘要：近年来，寻找有效的神经网络体系结构已成为人们关注的焦点，现代体系结构不仅关注准确性，而且关注推理时间和模型大小。在这里，我们介绍FUN，这是一种新颖的频域利用网络。这些网络通过直接在离散余弦变换表示的频域中工作，从而利用了频域的固有效率。使用现代技术和复合缩放和反向残差层等构建模块，我们生成了一组这样的网络，可以使它们在大小，延迟和准确性之间取得平衡，同时胜过基于RGB的竞争模型。广泛的评估证明，我们的网络可以替代以前的方法。而且，我们表明在频域中工作可以在推理时动态压缩输入，而无需对体系结构进行任何显式更改。

52. Improving Auto-Encoders' self-supervised image classification using pseudo-labelling via data augmentation and the perceptual loss [PDF] 返回目录
Aymene Mohammed Bouayed, Karim Atif, Rachid Deriche, Abdelhakim Saim
Abstract: In this paper, we introduce a novel method to pseudo-label unlabelled images and train an Auto-Encoder to classify them in a self-supervised manner that allows for a high accuracy and consistency across several datasets. The proposed method consists of first applying a randomly sampled set of data augmentation transformations to each training image. As a result, each initial image can be considered as a pseudo-label to its corresponding augmented ones. Then, an Auto-Encoder is used to learn the mapping between each set of the augmented images and its corresponding pseudo-label. Furthermore, the perceptual loss is employed to take into consideration the existing dependencies between the pixels in the same neighbourhood of an image. This combination encourages the encoder to output richer encodings that are highly informative of the input's class. Consequently, the Auto-Encoder's performance on unsupervised image classification is improved both in termes of stability and accuracy becoming more uniform and more consistent across all tested datasets. Previous state-of-the-art accuracy on the MNIST, CIFAR-10 and SVHN datasets is improved by 0.3\%, 3.11\% and 9.21\% respectively.
摘要：在本文中，我们介绍了一种对未标记图像进行伪标记的新方法，并训练自动编码器以自我监督的方式对它们进行分类，从而在多个数据集之间实现高精度和一致性。所提出的方法包括首先将随机采样的一组数据增强变换应用于每个训练图像。结果，每个初始图像可以被认为是其对应的增强图像的伪标签。然后，使用自动编码器来学习每组扩增图像与其对应的伪标签之间的映射。此外，采用感知损失来考虑图像的相同邻域中的像素之间的现有依赖性。这种组合鼓励编码器输出更丰富的编码，这些编码对于输入的类别具有很高的信息价值。因此，在所有测试数据集的稳定性和准确性方面，自动编码器在无监督图像分类方面的性能得到了改善，变得更加统一和一致。 MNIST，CIFAR-10和SVHN数据集的先前最先进的准确性分别提高了0.3 \％，3.11 \％和9.21 \％。

53. Efficient Human Pose Estimation with Depthwise Separable Convolution and Person Centroid Guided Joint Grouping [PDF] 返回目录
Jie Ou, Hong Wu
Abstract: In this paper, we propose efficient and effective methods for 2D human pose estimation. A new ResBlock is proposed based on depthwise separable convolution and is utilized instead of the original one in Hourglass network. It can be further enhanced by replacing the vanilla depthwise convolution with a mixed depthwise convolution. Based on it, we propose a bottom-up multi-person pose estimation method. A rooted tree is used to represent human pose by introducing person centroid as the root which connects to all body joints directly or hierarchically. Two branches of sub-networks are used to predict the centroids, body joints and their offsets to their parent nodes. Joints are grouped by tracing along their offsets to the closest centroids. Experimental results on the MPII human dataset and the LSP dataset show that both our single-person and multi-person pose estimation methods can achieve competitive accuracies with low computational costs.
摘要：在本文中，我们提出了一种有效的二维人体姿势估计方法。提出了一种新的基于深度可分离卷积的ResBlock，并代替了Hourglass网络中的原始卷积。通过用混合的深度卷积代替香草的深度卷积可以进一步增强它。基于此，我们提出了一种自下而上的多人姿势估计方法。通过引入人质心作为直接或分层连接到所有身体关节的根，可以使用一棵有根的树来表示人的姿势。子网的两个分支用于预测形心，身体关节及其到其父节点的偏移。关节通过沿其偏移量追踪到最接近的质心进行分组。在MPII人类数据集和LSP数据集上的实验结果表明，我们的单人和多人姿势估计方法都可以以较低的计算成本实现竞争精度。

54. TediGAN: Text-Guided Diverse Image Generation and Manipulation [PDF] 返回目录
Weihao Xia, Yujiu Yang, Jing-Hao Xue, Baoyuan Wu
Abstract: In this work, we propose TediGAN, a novel framework for multi-modal image generation and manipulation with textual descriptions. The proposed method consists of three components: StyleGAN inversion module, visual-linguistic similarity learning, and instance-level optimization. The inversion module is to train an image encoder to map real images to the latent space of a well-trained StyleGAN. The visual-linguistic similarity is to learn the text-image matching by mapping the image and text into a common embedding space. The instance-level optimization is for identity preservation in manipulation. Our model can provide the lowest effect guarantee, and produce diverse and high-quality images with an unprecedented resolution at 1024. Using a control mechanism based on style-mixing, our TediGAN inherently supports image synthesis with multi-modal inputs, such as sketches or semantic labels with or without instance (text or real image) guidance. To facilitate text-guided multi-modal synthesis, we propose the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real face images and corresponding semantic segmentation map, sketch, and textual descriptions. Extensive experiments on the introduced dataset demonstrate the superior performance of our proposed method. Code and data are available at this https URL.
摘要：在这项工作中，我们提出了TediGAN，这是一种用于多模式图像生成和带有文字描述的新颖框架。该方法由三部分组成：StyleGAN倒置模块，视觉语言相似性学习和实例级优化。反转模块用于训练图像编码器，以将真实图像映射到训练有素的StyleGAN的潜在空间。视觉语言的相似性是通过将图像和文本映射到公共嵌入空间中来学习文本-图像匹配。实例级优化用于在操作中保留身份。我们的模型可以提供最低的效果保证，并可以在1024分辨率下以前所未有的分辨率生成各种高质量的图像。使用基于样式混合的控制机制，我们的TediGAN固有地支持具有多模式输入的图像合成，例如草图或带有或不带有实例（文本或真实图像）指导的语义标签。为了促进文本指导的多模式合成，我们提出了多模式CelebA-HQ，这是一个由真实面孔图像和相应的语义分割图，草图以及文本描述组成的大规模数据集。对引入的数据集进行的大量实验证明了我们提出的方法的优越性能。代码和数据可从此https URL获得。

55. Pedestrian Behavior Prediction via Multitask Learning and Categorical Interaction Modeling [PDF] 返回目录
Amir Rasouli, Mohsen Rohani, Jun Luo
Abstract: Pedestrian behavior prediction is one of the major challenges for intelligent driving systems. Pedestrians often exhibit complex behaviors influenced by various contextual elements. To address this problem, we propose a multitask learning framework that simultaneously predicts trajectories and actions of pedestrians by relying on multimodal data. Our method benefits from 1) a hybrid mechanism to encode different input modalities independently allowing them to develop their own representations, and jointly to produce a representation for all modalities using shared parameters; 2) a novel interaction modeling technique that relies on categorical semantic parsing of the scenes to capture interactions between target pedestrians and their surroundings; and 3) a dual prediction mechanism that uses both independent and shared decoding of multimodal representations. Using public pedestrian behavior benchmark datasets for driving, PIE and JAAD, we highlight the benefits of multitask learning for behavior prediction and show that our model achieves state-of-the-art performance and improves trajectory and action prediction by up to 22% and 6% respectively. We further investigate the contributions of the proposed processing and interaction modeling techniques via extensive ablation studies.
摘要：行人行为预测是智能驾驶系统的主要挑战之一。行人经常表现出受各种环境因素影响的复杂行为。为了解决这个问题，我们提出了一个多任务学习框架，该框架通过依赖多模态数据同时预测行人的轨迹和动作。我们的方法得益于：1）一种混合机制，可以独立地编码不同的输入模态，使它们能够开发自己的表示形式，并使用共享参数共同为所有模态生成表示形式； 2）一种新颖的交互建模技术，该技术依赖于场景的分类语义解析来捕获目标行人及其周围环境之间的交互； 3）一种双重预测机制，该机制使用多模态表示的独立解码和共享解码。使用公共行人行为基准测试数据集进行驾驶，PIE和JAAD的测试，我们强调了多任务学习在行为预测中的优势，并表明我们的模型实现了最新的性能，并且将轨迹和动作预测分别提高了22％和6 ％分别。我们将通过广泛的消融研究进一步调查所提出的处理和交互建模技术的贡献。

56. Towards Better Object Detection in Scale Variation with Adaptive Feature Selection [PDF] 返回目录
Zehui Gong, Dong Li
Abstract: It is a common practice to exploit pyramidal feature representation to tackle the problem of scale variation in object instances. However, most of them still predict the objects in a certain range of scales based solely or mainly on a single-level representation, yielding inferior detection performance. To this end, we propose a novel adaptive feature selection module (AFSM), to automatically learn the way to fuse multi-level representations in the channel dimension, in a data-driven manner. It significantly improves the performance of the detectors that have a feature pyramid structure, while introducing nearly free inference overhead. Moreover, we propose a class-aware sampling mechanism to tackle the class imbalance problem, which automatically assigns the sampling weight to each of the images during training, according to the number of objects in each class. Experimental results demonstrate the effectiveness of the proposed method, with 83.04\% mAP at 15.96 FPS on the VOC dataset, and 39.48\% AP on the VisDrone-DET validation subset, respectively, outperforming other state-of-the-art detectors considerably. The code will be available soon at this https URL.
摘要：利用金字塔特征表示来解决对象实例中尺度变化的问题是一种常见的做法。然而，它们中的大多数仍然仅基于或仅基于单级表示来在一定范围的比例范围内预测对象，从而产生较差的检测性能。为此，我们提出了一种新颖的自适应特征选择模块（AFSM），以数据驱动的方式自动学习在通道维度上融合多级表示的方法。它显着提高了具有特征金字塔结构的检测器的性能，同时引入了几乎免费的推理开销。此外，我们提出了一种基于类的采样机制来解决类不平衡问题，该机制会根据每个类中的对象数量在训练过程中自动为每个图像分配采样权重。实验结果证明了该方法的有效性，VOC数据集上15.96 FPS时的mAP为83.04 \％，VisDrone-DET验证子集上的AP为39.48 \％，大大优于其他最新的检测器。该代码即将在此https URL上可用。

57. Cross-Layer Distillation with Semantic Calibration [PDF] 返回目录
Defang Chen, Jian-Ping Mei, Yuan Zhang, Can Wang, Zhe Wang, Yan Feng, Chun Chen
Abstract: Recently proposed knowledge distillation approaches based on feature-map transfer validate that intermediate layers of a teacher model can serve as effective targets for training a student model to obtain better generalization ability. Existing studies mainly focus on particular representation forms for knowledge transfer between manually specified pairs of teacher-student intermediate layers. However, semantics of intermediate layers may vary in different networks and manual association of layers might lead to negative regularization caused by semantic mismatch between certain teacher-student layer pairs. To address this problem, we propose Semantic Calibration for Cross-layer Knowledge Distillation (SemCKD), which automatically assigns proper target layers of the teacher model for each student layer with an attention mechanism. With a learned attention distribution, each student layer distills knowledge contained in multiple layers rather than a single fixed intermediate layer from the teacher model for appropriate cross-layer supervision in training. Consistent improvements over state-of-the-art approaches are observed in extensive experiments with various network architectures for teacher and student models, demonstrating the effectiveness and flexibility of the proposed attention based soft layer association mechanism for cross-layer distillation.
摘要：最近提出的基于特征图转移的知识提炼方法验证了教师模型的中间层可以作为培训学生模型以获得更好的泛化能力的有效目标。现有研究主要集中于在手动指定的一对师生中间层之间进行知识转移的特定表示形式。但是，中间层的语义在不同的网络中可能会有所不同，并且各层的手动关联可能会由于某些师生层对之间的语义不匹配而导致负面的正则化。为了解决此问题，我们提出了跨层知识蒸馏的语义校准（SemCKD），它可以通过关注机制自动为每个学生层分配教师模型的适当目标层。通过学习注意力的分布，每个学生层都从教师模型中提取多层而不是单个固定中间层中包含的知识，以在培训中进行适当的跨层监督。在针对教师和学生模型的各种网络体系结构的广泛实验中，观察到了对最新技术的一致改进，证明了提出的基于注意的跨层蒸馏软层关联机制的有效性和灵活性。

58. Spatiotemporal tomography based on scattered multiangular signals and its application for resolving evolving clouds using moving platforms [PDF] 返回目录
Roi Ronen, Yoav Y. Schechner, Eshkol Eytan
Abstract: We derive computed tomography (CT) of a time-varying volumetric translucent object, using a small number of moving cameras. We particularly focus on passive scattering tomography, which is a non-linear problem. We demonstrate the approach on dynamic clouds, as clouds have a major effect on Earth's climate. State of the art scattering CT assumes a static object. Existing 4D CT methods rely on a linear image formation model and often on significant priors. In this paper, the angular and temporal sampling rates needed for a proper recovery are discussed. If these rates are used, the paper leads to a representation of the time-varying object, which simplifies 4D CT tomography. The task is achieved using gradient-based optimization. We demonstrate this in physics-based simulations and in an experiment that had yielded real-world data.
摘要：我们使用少量移动相机来获取时变体积半透明物体的计算机断层扫描（CT）。我们特别关注被动散射层析成像，这是一个非线性问题。我们演示了动态云的方法，因为云会对地球的气候产生重大影响。现有技术的散射CT假定为静态物体。现有的4D CT方法依赖于线性图像形成模型，并且通常依赖大量先验条件。在本文中，讨论了适当恢复所需的角度和时间采样率。如果使用这些比率，则论文将代表时变对象，从而简化了4D CT断层扫描。该任务是使用基于梯度的优化来完成的。我们在基于物理学的模拟和产生真实数据的实验中证明了这一点。

59. Skeleon-Based Typing Style Learning For Person Identification [PDF] 返回目录
Lior Gelberg, David Mendlovic, Dan Raviv
Abstract: We present a novel architecture for person identification based on typing-style, constructed of adaptive non-local spatio-temporal graph convolutional network. Since type style dynamics convey meaningful information that can be useful for person identification, we extract the joints positions and then learn their movements' dynamics. Our non-local approach increases our model's robustness to noisy input data while analyzing joints locations instead of RGB data provides remarkable robustness to alternating environmental conditions, e.g., lighting, noise, etc. We further present two new datasets for typing style based person identification task and extensive evaluation that displays our model's superior discriminative and generalization abilities, when compared with state-of-the-art skeleton-based models.
摘要：我们提出了一种基于打字类型的新型人识别体系结构，它是由自适应非局部时空图卷积网络构成的。由于字体风格动态传达了有意义的信息，可用于识别人，因此我们提取关节位置，然后学习其运动动态。我们的非局部方法提高了模型对嘈杂的输入数据的鲁棒性，同时分析了关节位置而不是RGB数据，为交替的环境条件（例如光照，噪声等）提供了显着的鲁棒性。我们进一步提出了两个新的数据集，用于基于类型的人员识别任务的打字与最先进的基于骨架的模型相比，该模型具有广泛的评估能力，显示出我们模型的卓越判别和泛化能力。

60. MVHM: A Large-Scale Multi-View Hand Mesh Benchmark for Accurate 3D Hand Pose Estimation [PDF] 返回目录
Liangjian Chen, Shih-Yao Lin, Yusheng Xie, Yen-Yu Lin, Xiaohui Xie
Abstract: Estimating 3D hand poses from a single RGB image is challenging because depth ambiguity leads the problem ill-posed. Training hand pose estimators with 3D hand mesh annotations and multi-view images often results in significant performance gains. However, existing multi-view datasets are relatively small with hand joints annotated by off-the-shelf trackers or automated through model predictions, both of which may be inaccurate and can introduce biases. Collecting a large-scale multi-view 3D hand pose images with accurate mesh and joint annotations is valuable but strenuous. In this paper, we design a spin match algorithm that enables a rigid mesh model matching with any target mesh ground truth. Based on the match algorithm, we propose an efficient pipeline to generate a large-scale multi-view hand mesh (MVHM) dataset with accurate 3D hand mesh and joint labels. We further present a multi-view hand pose estimation approach to verify that training a hand pose estimator with our generated dataset greatly enhances the performance. Experimental results show that our approach achieves the performance of 0.990 in $\text{AUC}_{\text{20-50}}$ on the MHP dataset compared to the previous state-of-the-art of 0.939 on this dataset. Our datasset is public available. \footnote{\url{this https URL}} Our datasset is available at~\href{this https URL}{\color{blue}{this https URL}}.
摘要：由于深度模糊会导致问题不正确，因此从单个RGB图像估计3D手部姿势是一项挑战。使用3D手部网格注释和多视图图像训练手部姿势估计器通常可以显着提高性能。但是，现有的多视图数据集相对较小，手关节由现成的跟踪器注释，或者通过模型预测自动进行，这两种方法可能都不准确，并且可能会引入偏差。收集具有准确网格和关节注释的大规模多视图3D手势图像是有价值的，但很费力。在本文中，我们设计了一种旋转匹配算法，该算法可以使刚性网格模型与任何目标网格地面真相匹配。基于匹配算法，我们提出了一种有效的管道来生成具有精确3D手网格和关节标签的大规模多视图手网格（MVHM）数据集。我们进一步提出了一种多视图手势估计方法，以验证使用我们生成的数据集训练手势估计器可以大大提高性能。实验结果表明，与先前在该数据集上的最新水平0.939相比，我们的方法在MHP数据集上的$ \ text {AUC} _ {\ text {20-50}} $中的性能达到了0.990。我们的数据集是公开可用的。 \ footnote {\ url {this https URL}}我们的数据集位于〜\ href {this https URL} {\ color {blue} {this https URL}}。

61. Temporal-Aware Self-Supervised Learning for 3D Hand Pose and Mesh Estimation in Videos [PDF] 返回目录
Liangjian Chen, Shih-Yao Lin, Yusheng Xie, Yen-Yu Lin, Xiaohui Xie
Abstract: Estimating 3D hand pose directly from RGB imagesis challenging but has gained steady progress recently bytraining deep models with annotated 3D poses. Howeverannotating 3D poses is difficult and as such only a few 3Dhand pose datasets are available, all with limited samplesizes. In this study, we propose a new framework of training3D pose estimation models from RGB images without usingexplicit 3D annotations, i.e., trained with only 2D informa-tion. Our framework is motivated by two observations: 1)Videos provide richer information for estimating 3D posesas opposed to static images; 2) Estimated 3D poses oughtto be consistent whether the videos are viewed in the for-ward order or reverse order. We leverage these two obser-vations to develop a self-supervised learning model calledtemporal-aware self-supervised network (TASSN). By en-forcing temporal consistency constraints, TASSN learns 3Dhand poses and meshes from videos with only 2D keypointposition annotations. Experiments show that our modelachieves surprisingly good results, with 3D estimation ac-curacy on par with the state-of-the-art models trained with3D annotations, highlighting the benefit of the temporalconsistency in constraining 3D prediction models.
摘要：直接从RGB图像估计3D手势是一项艰巨的任务，但最近通过使用带注释的3D姿势训练深度模型获得了稳步的进展。但是，注释3D姿势很困难，因此只有少数3Dhand姿势数据集可用，并且所有样本大小都有限。在这项研究中，我们提出了一种从RGB图像训练3D姿势估计模型的新框架，而无需使用明确的3D注释，即仅使用2D信息进行训练。我们的框架有两个观察结果：1）视频提供了丰富的信息来估计3D姿势，而不是静态图像； 2）无论是按向前还是向后观看视频，估计的3D姿势都应保持一致。我们利用这两个观察结果来开发一种称为时间感知的自我监督网络（TASSN）的自我监督学习模型。通过强制时间一致性约束，TASSN可从仅具有2D关键点注释的视频中学习3D手势和网格。实验表明，我们的模型取得了令人惊讶的良好结果，其3D估计准确性与经过3D注释训练的最新模型相当，突显了时间一致性在约束3D预测模型方面的优势。

62. DGGAN: Depth-image Guided Generative Adversarial Networks for Disentangling RGB and Depth Images in 3D Hand Pose Estimation [PDF] 返回目录
Liangjian Chen, Shih-Yao Lin, Yusheng Xie, Yen-Yu Lin, Wei Fan, Xiaohui Xie
Abstract: Estimating3D hand poses from RGB images is essentialto a wide range of potential applications, but is challengingowing to substantial ambiguity in the inference of depth in-formation from RGB images. State-of-the-art estimators ad-dress this problem by regularizing3D hand pose estimationmodels during training to enforce the consistency betweenthe predicted3D poses and the ground-truth depth maps.However, these estimators rely on both RGB images and thepaired depth maps during training. In this study, we proposea conditional generative adversarial network (GAN) model,called Depth-image Guided GAN (DGGAN), to generate re-alistic depth maps conditioned on the input RGB image, anduse the synthesized depth maps to regularize the3D handpose estimation model, therefore eliminating the need forground-truth depth maps. Experimental results on multiplebenchmark datasets show that the synthesized depth mapsproduced by DGGAN are quite effective in regularizing thepose estimation model, yielding new state-of-the-art resultsin estimation accuracy, notably reducing the mean3D end-point errors (EPE) by4.7%,16.5%, and6.8%on the RHD,STB and MHP datasets, respectively.
摘要：从RGB图像估计3D手势对于广泛的潜在应用是必不可少的，但是由于从RGB图像推断深度信息方面存在很大的歧义，因此具有挑战性。最先进的估计器通过在训练过程中对3D手部姿势估计模型进行正则化来增强预测的3D姿势与地面真实深度图之间的一致性，从而解决了这个问题，但是这些估计器在训练过程中同时依赖RGB图像和配对的深度图。在这项研究中，我们提出了一个条件生成对抗网络（GAN）模型，称为深度图像制导GAN（DGGAN），以生成以输入RGB图像为条件的re-alistic深度图，并使用合成的深度图对3D姿势估计模型进行正则化，因此无需使用真实的深度图。在多个基准数据集上的实验结果表明，由DGGAN生成的合成深度图在正则化姿势估计模型方面非常有效，在估计精度上产生了最新的结果，尤其是将平均3D端点误差（EPE）降低了4.7％在RHD，STB和MHP数据集上分别为分别的16.5％和6.8％。

63. Online Adaptation for Consistent Mesh Reconstruction in the Wild [PDF] 返回目录
Xueting Li, Sifei Liu, Shalini De Mello, Kihwan Kim, Xiaolong Wang, Ming-Hsuan Yang, Jan Kautz
Abstract: This paper presents an algorithm to reconstruct temporally consistent 3D meshes of deformable object instances from videos in the wild. Without requiring annotations of 3D mesh, 2D keypoints, or camera pose for each video frame, we pose video-based reconstruction as a self-supervised online adaptation problem applied to any incoming test video. We first learn a category-specific 3D reconstruction model from a collection of single-view images of the same category that jointly predicts the shape, texture, and camera pose of an image. Then, at inference time, we adapt the model to a test video over time using self-supervised regularization terms that exploit temporal consistency of an object instance to enforce that all reconstructed meshes share a common texture map, a base shape, as well as parts. We demonstrate that our algorithm recovers temporally consistent and reliable 3D structures from videos of non-rigid objects including those of animals captured in the wild -- an extremely challenging task rarely addressed before.
摘要：本文提出了一种从野外视频中重建可变形对象实例的时间一致3D网格的算法。无需为每个视频帧添加3D网格，2D关键点或相机姿势的注释，我们将基于视频的重构作为一种自我监督的在线适应问题，应用于任何传入的测试视频。我们首先从同一类别的单视图图像集合中学习特定于类别的3D重建模型，该模型共同预测图像的形状，纹理和相机姿势。然后，在推论时，我们使用自我监督的正则化术语将模型适应一段时间的测试视频，这些术语利用对象实例的时间一致性来强制所有重建的网格共享相同的纹理贴图，基本形状以及部分。我们证明了我们的算法从非刚性物体的视频（包括在野外捕获的动物的视频）中恢复了时间上一致且可靠的3D结构-这是一项极富挑战性的任务，以前从未解决过。

64. Depth Completion using Piecewise Planar Model [PDF] 返回目录
Yiran Zhong, Yuchao Dai, Hongdong Li
Abstract: A depth map can be represented by a set of learned bases and can be efficiently solved in a closed form solution. However, one issue with this method is that it may create artifacts when colour boundaries are inconsistent with depth boundaries. In fact, this is very common in a natural image. To address this issue, we enforce a more strict model in depth recovery: a piece-wise planar model. More specifically, we represent the desired depth map as a collection of 3D planar and the reconstruction problem is formulated as the optimization of planar parameters. Such a problem can be formulated as a continuous CRF optimization problem and can be solved through particle based method (MP-PBP) \cite{Yamaguchi14}. Extensive experimental evaluations on the KITTI visual odometry dataset show that our proposed methods own high resistance to false object boundaries and can generate useful and visually pleasant 3D point clouds.
摘要：深度图可以由一组学习的基础来表示，并且可以在封闭形式的解决方案中有效地求解。但是，这种方法的一个问题是，当颜色边界与深度边界不一致时，它可能会产生伪像。实际上，这在自然图像中非常普遍。为了解决此问题，我们在深度恢复中实施了更严格的模型：分段平面模型。更具体地说，我们将所需的深度图表示为3D平面的集合，并将重构问题表述为平面参数的优化。这样的问题可以表述为连续的CRF优化问题，可以通过基于粒子的方法（MP-PBP）\ cite {Yamaguchi14}解决。在KITTI视觉里程表数据集上进行的广泛实验评估表明，我们提出的方法对虚假物体边界具有很高的抵抗力，并且可以生成有用且视觉上令人愉悦的3D点云。

65. Computer Stereo Vision for Autonomous Driving [PDF] 返回目录
Rui Fan, Li Wang, Mohammud Junaid Bocus, Ioannis Pitas
Abstract: For a long time, autonomous cars were found only in science fiction movies and series but now they are becoming a reality. The opportunity to pick such a vehicle at your garage forecourt is closer than you think. As an important component of autonomous systems, autonomous car perception has had a big leap with recent advances in parallel computing architectures, such as OpenMP for multi-threading CPUs and OpenCL for GPUs. With the use of tiny but full-feature embedded supercomputers, computer stereo vision has been prevalently applied in autonomous cars for depth perception. The two key aspects of computer stereo vision are speed and accuracy. They are two desirable but conflicting properties -- the algorithms with better disparity accuracy usually have higher computational complexity. Therefore, the main aim of developing a computer stereo vision algorithm for resource-limited hardware is to improve the trade-off between speed and accuracy. In this chapter, we first introduce the autonomous car system, from the hardware aspect to the software aspect. We then discuss four autonomous car perception functionalities, including: 1) visual feature detection, description and matching, 2) 3D information acquisition, 3) object detection/recognition and 4) semantic image segmentation. Finally, we introduce the principles of computer stereo vision and parallel computing.
摘要：长期以来，自动驾驶汽车仅在科幻电影和系列电影中被发现，但如今已成为现实。在您的车库前院挑选这种车辆的机会比您想象的要近。作为自主系统的重要组成部分，随着并行计算架构的最新发展，例如用于多线程CPU的OpenMP和用于GPU的OpenCL，自动驾驶汽车的感知有了巨大的飞跃。通过使用微型但功能齐全的嵌入式超级计算机，计算机立体视觉已广泛应用于自动驾驶汽车以进行深度感知。计算机立体视觉的两个关键方面是速度和准确性。它们是两个理想的但相互矛盾的属性-视差精度更高的算法通常具有更高的计算复杂度。因此，开发用于资源受限的硬件的计算机立体视觉算法的主要目的是提高速度和准确性之间的权衡。在本章中，我们首先从硬件方面到软件方面介绍自动驾驶汽车系统。然后，我们讨论了四个自动驾驶汽车感知功能，包括：1）视觉特征检测，描述和匹配，2）3D信息获取，3）对象检测/识别和4）语义图像分割。最后，我们介绍了计算机立体视觉和并行计算的原理。

66. Maximum Entropy Subspace Clustering Network [PDF] 返回目录
Zhihao Peng, Yuheng Jia, Hui Liu, Junhui Hou, Qingfu Zhang
Abstract: Deep subspace clustering network (DSC-Net) and its numerous variants have achieved impressive performance for subspace clustering, in which an auto-encoder non-linearly maps input data into a latent space, and a fully connected layer named self-expressiveness module is introduced between the encoder and the decoder to learn an affinity matrix. However, the adopted regularization on the affinity matrix (e.g., sparse, Tikhonov, or low-rank) is still insufficient to drive the learning of an ideal affinity matrix, thus limiting their performance. In addition, in DSC-Net, the self-expressiveness module and the auto-encoder module are tightly coupled, making the training of the DSC-Net non-trivial. To this end, in this paper, we propose a novel deep learning-based clustering method named Maximum Entropy Subspace Clustering Network (MESC-Net). Specifically, MESC-Net maximizes the learned affinity matrix's entropy to encourage it to exhibit an ideal affinity matrix structure. We theoretically prove that the affinity matrix driven by MESC-Net obeys the block-diagonal property, and experimentally show that its elements corresponding to the same subspace are uniformly and densely distributed, which gives better clustering performance. Moreover, we explicitly decouple the auto-encoder module and the self-expressiveness module. Extensive quantitative and qualitative results on commonly used benchmark datasets validate MESC-Net significantly outperforms state-of-the-art methods.
摘要：深度子空间聚类网络（DSC-Net）及其众多变体为子空间聚类提供了令人印象深刻的性能，其中自动编码器将输入数据非线性地映射到潜在空间中，并且一个完全连接的层称为自表达模块在编码器和解码器之间引入来学习亲和矩阵。然而，对亲和度矩阵（例如，稀疏，Tikhonov或低等级）采用的正则化仍然不足以驱动理想亲和度矩阵的学习，从而限制了它们的性能。另外，在DSC-Net中，自表达模块和自动编码器模块紧密耦合，这使得对DSC-Net的训练非常简单。为此，在本文中，我们提出了一种新的基于深度学习的聚类方法，称为最大熵子空间聚类网络（MESC-Net）。具体地说，MESC-Net可以使学习的亲和矩阵的熵最大化，以鼓励其表现出理想的亲和矩阵结构。我们从理论上证明了由MESC-Net驱动的亲和矩阵服从块对角性质，并通过实验证明其对应于相同子空间的元素均匀且密集地分布，从而提供了更好的聚类性能。此外，我们明确地将自动编码器模块和自表达模块分离。常用基准数据集上的大量定量和定性结果证明，MESC-Net明显优于最新方法。

67. Food Classification with Convolutional Neural Networks and Multi-Class Linear Discernment Analysis [PDF] 返回目录
Joshua Ball
Abstract: Convolutional neural networks (CNNs) have been successful in representing the fully-connected inferencing ability perceived to be seen in the human brain: they take full advantage of the hierarchy-style patterns commonly seen in complex data and develop more patterns using simple features. Countless implementations of CNNs have shown how strong their ability is to learn these complex patterns, particularly in the realm of image classification. However, the cost of getting a high performance CNN to a so-called "state of the art" level is computationally costly. Even when using transfer learning, which utilize the very deep layers from models such as MobileNetV2, CNNs still take a great amount of time and resources. Linear discriminant analysis (LDA), a generalization of Fisher's linear discriminant, can be implemented in a multi-class classification method to increase separability of class features while not needing a high performance system to do so for image classification. Similarly, we also believe LDA has great promise in performing well. In this paper, we discuss our process of developing a robust CNN for food classification as well as our effective implementation of multi-class LDA and prove that (1) CNN is superior to LDA for image classification and (2) why LDA should not be left out of the races for image classification, particularly for binary cases.
摘要：卷积神经网络（CNN）已成功地代表了人脑中感知到的全连接推理能力：它们充分利用了复杂数据中常见的层次样式模式，并使用简单功能开发了更多模式。 CNN的无数实现已显示出它们学习这些复杂模式的能力有多么强大，尤其是在图像分类领域。但是，将高性能CNN提升到所谓的“最新技术”水平的成本在计算上非常昂贵。即使使用迁移学习（利用了MobileNetV2等模型的非常深层次的知识），CNN仍会花费大量时间和资源。线性判别分析（LDA）是Fisher线性判别的一种概括，可以在多类分类方法中实施，以增加类特征的可分离性，而无需高性能系统来进行图像分类。同样，我们也认为LDA的出色表现很有前途。在本文中，我们讨论了开发用于食品分类的强大CNN的过程以及有效实施多类LDA的方法，并证明（1）CNN在图像分类方面优于LDA，以及（2）为什么不应该使用LDA排除了图像分类的竞争，特别是对于二进制情况。

68. Any-Width Networks [PDF] 返回目录
Thanh Vu, Marc Eder, True Price, Jan-Michael Frahm
Abstract: Despite remarkable improvements in speed and accuracy, convolutional neural networks (CNNs) still typically operate as monolithic entities at inference time. This poses a challenge for resource-constrained practical applications, where both computational budgets and performance needs can vary with the situation. To address these constraints, we propose the Any-Width Network (AWN), an adjustable-width CNN architecture and associated training routine that allow for fine-grained control over speed and accuracy during inference. Our key innovation is the use of lower-triangular weight matrices which explicitly address width-varying batch statistics while being naturally suited for multi-width operations. We also show that this design facilitates an efficient training routine based on random width sampling. We empirically demonstrate that our proposed AWNs compare favorably to existing methods while providing maximally granular control during inference.
摘要：尽管在速度和准确性上有了显着的提高，但是卷积神经网络（CNN）通常在推理时仍作为整体实体运行。这对资源受限的实际应用提出了挑战，因为计算预算和性能需求都可能随情况而变化。为了解决这些限制，我们提出了Any-Width Network（AWN），可调整宽度的CNN体系结构和相关的训练例程，该例程允许在推理过程中对速度和准确性进行细粒度的控制。我们的关键创新是使用低三角形权重矩阵，该矩阵显式解决了宽度可变的批处理统计信息，同时自然适用于多宽度运算。我们还表明，该设计有助于基于随机宽度采样的有效训练例程。我们凭经验证明，我们提出的AWN与现有方法相比具有优势，同时在推理过程中可提供最大的粒度控制。

69. Automatic sampling and training method for wood-leaf classification based on tree terrestrial point cloud [PDF] 返回目录
Zichu Liu, Qing Zhang, Pei Wang, Yaxin Li, Jingqian Sun
Abstract: Terrestrial laser scanning technology provides an efficient and accuracy solution for acquiring three-dimensional information of plants. The leaf-wood classification of plant point cloud data is a fundamental step for some forestry and biological research. An automatic sampling and training method for classification was proposed based on tree point cloud data. The plane fitting method was used for selecting leaf sample points and wood sample points automatically, then two local features were calculated for training and classification by using support vector machine (SVM) algorithm. The point cloud data of ten trees were tested by using the proposed method and a manual selection method. The average correct classification rate and kappa coefficient are 0.9305 and 0.7904, respectively. The results show that the proposed method had better efficiency and accuracy comparing to the manual selection method.
摘要：地面激光扫描技术为获取植物三维信息提供了一种高效，准确的解决方案。植物点云数据的叶木分类是一些林业和生物学研究的基本步骤。提出了一种基于树点云数据的分类自动采样与训练方法。采用平面拟合法自动选择叶片和木材的采样点，然后利用支持向量机（SVM）算法对两个局部特征进行训练和分类。利用提出的方法和人工选择方法对十棵树的点云数据进行了测试。平均正确分类率和卡伯系数分别为0.9305和0.7904。结果表明，与人工选择方法相比，该方法具有更高的效率和准确性。

70. YieldNet: A Convolutional Neural Network for Simultaneous Corn and Soybean Yield Prediction Based on Remote Sensing Data [PDF] 返回目录
Saeed Khaki, Hieu Pham, Lizhi Wang
Abstract: Large scale crop yield estimation is, in part, made possible due to the availability of remote sensing data allowing for the continuous monitoring of crops throughout its growth state. Having this information allows stakeholders the ability to make real-time decisions to maximize yield potential. Although various models exist that predict yield from remote sensing data, there currently does not exist an approach that can estimate yield for multiple crops simultaneously, and thus leads to more accurate predictions. A model that predicts yield of multiple crops and concurrently considers the interaction between multiple crop's yield. We propose a new model called YieldNet which utilizes a novel deep learning framework that uses transfer learning between corn and soybean yield predictions by sharing the weights of the backbone feature extractor. Additionally, to consider the multi-target response variable, we propose a new loss function. Numerical results demonstrate that our proposed method accurately predicts yield from one to four months before the harvest, and is competitive to other state-of-the-art approaches.
摘要：大规模的农作物产量估算在某种程度上是由于遥感数据的可用性，从而可以在整个生长状态下连续监测农作物。有了这些信息，利益相关者就可以做出实时决策，从而最大程度地提高产量潜力。尽管存在各种模型可以根据遥感数据预测产量，但目前尚不存在可以同时估算多种作物产量的方法，因此可以得出更准确的预测结果。一个可以预测多种作物产量并同时考虑多种作物产量之间相互作用的模型。我们提出了一个名为YieldNet的新模型，该模型利用了新颖的深度学习框架，该框架通过共享主干特征提取器的权重，在玉米和大豆产量预测之间使用转移学习。此外，为了考虑多目标响应变量，我们提出了一个新的损失函数。数值结果表明，我们提出的方法可以准确预测收割前一到四个月的产量，并且与其他最新方法相比具有竞争力。

71. It's All Around You: Range-Guided Cylindrical Network for 3D Object Detection [PDF] 返回目录
Meytal Rapoport-Lavie, Dan Raviv
Abstract: Modern perception systems in the field of autonomous driving rely on 3D data analysis. LiDAR sensors are frequently used to acquire such data due to their increased resilience to different lighting conditions. Although rotating LiDAR scanners produce ring-shaped patterns in space, most networks analyze their data using an orthogonal voxel sampling strategy. This work presents a novel approach for analyzing 3D data produced by 360-degree depth scanners, utilizing a more suitable coordinate system, which is aligned with the scanning pattern. Furthermore, we introduce a novel notion of range-guided convolutions, adapting the receptive field by distance from the ego vehicle and the object's scale. Our network demonstrates powerful results on the nuScenes challenge, comparable to current state-of-the-art architectures. The backbone architecture introduced in this work can be easily integrated onto other pipelines as well.
摘要：自动驾驶领域的现代感知系统依赖于3D数据分析。由于LiDAR传感器对不同光照条件的适应能力增强，因此常用于获取此类数据。尽管旋转的LiDAR扫描仪在空间中会产生环形图案，但大多数网络使用正交体素采样策略来分析其数据。这项工作提出了一种新颖的方法，该方法利用与扫描图案对齐的更合适的坐标系来分析360度深度扫描仪产生的3D数据。此外，我们引入了一种新颖的范围引导卷积概念，根据距自我车辆的距离和物体的比例来适应接收场。我们的网络在nuScenes挑战中展示了强大的成果，堪与当前的最新架构相媲美。这项工作中引入的骨干架构也可以轻松集成到其他管道中。

72. LandCoverNet: A global benchmark land cover classification training dataset [PDF] 返回目录
Hamed Alemohammad, Kevin Booth
Abstract: Regularly updated and accurate land cover maps are essential for monitoring 14 of the 17 Sustainable Development Goals. Multispectral satellite imagery provide high-quality and valuable information at global scale that can be used to develop land cover classification models. However, such a global application requires a geographically diverse training dataset. Here, we present LandCoverNet, a global training dataset for land cover classification based on Sentinel-2 observations at 10m spatial resolution. Land cover class labels are defined based on annual time-series of Sentinel-2, and verified by consensus among three human annotators.
摘要：定期更新和准确的土地覆盖图对于监测17个可持续发展目标中的14个至关重要。多光谱卫星影像可在全球范围内提供高质量且有价值的信息，可用于开发土地覆被分类模型。然而，这样的全球应用需要地理上不同的训练数据集。在这里，我们介绍LandCoverNet，这是一个基于Sentinel-2观测值在10m空间分辨率下进行土地覆盖分类的全球培训数据集。土地覆盖类别标签是根据Sentinel-2的年度时间序列定义的，并通过三个人类注释者的共识进行了验证。

73. Spectral Distribution aware Image Generation [PDF] 返回目录
Steffen Jung, Margret Keuper
Abstract: Recent advances in deep generative models for photo-realistic images have led to high quality visual results. Such models learn to generate data from a given training distribution such that generated images can not be easily distinguished from real images by the human eye. Yet, recent work on the detection of such fake images pointed out that they are actually easily distinguishable by artifacts in their frequency spectra. In this paper, we propose to generate images according to the frequency distribution of the real data by employing a spectral discriminator. The proposed discriminator is lightweight, modular and works stably with different commonly used GAN losses. We show that the resulting models can better generate images with realistic frequency spectra, which are thus harder to detect by this cue.
摘要：用于照片级逼真的图像的深度生成模型的最新进展导致了高质量的视觉效果。这样的模型学习从给定的训练分布中生成数据，从而使得人眼不容易将生成的图像与真实图像区分开。然而，最近在检测此类伪图像上的工作指出，实际上它们很容易被频谱中的伪影区分开。在本文中，我们建议通过使用频谱鉴别器根据实际数据的频率分布生成图像。拟议的鉴别器重量轻，模块化并且在不同的常用GAN损耗下均能稳定工作。我们表明，所得模型可以更好地生成具有逼真的频谱的图像，因此很难通过该提示进行检测。

74. Generating Synthetic Multispectral Satellite Imagery from Sentinel-2 [PDF] 返回目录
Tharun Mohandoss, Aditya Kulkarni, Daniel Northrup, Ernest Mwebaze, Hamed Alemohammad
Abstract: Multi-spectral satellite imagery provides valuable data at global scale for many environmental and socio-economic applications. Building supervised machine learning models based on these imagery, however, may require ground reference labels which are not available at global scale. Here, we propose a generative model to produce multi-resolution multi-spectral imagery based on Sentinel-2 data. The resulting synthetic images are indistinguishable from real ones by humans. This technique paves the road for future work to generate labeled synthetic imagery that can be used for data augmentation in data scarce regions and applications.
摘要：多光谱卫星影像可为全球许多环境和社会经济应用提供有价值的数据。但是，基于这些图像构建受监督的机器学习模型可能需要地面参考标签，这些参考标签在全球范围内不可用。在这里，我们提出了一个生成模型，用于基于Sentinel-2数据生成多分辨率多光谱图像。由此产生的合成图像与人的真实图像是无法区分的。该技术为将来的工作铺平了道路，以生成可用于数据稀缺区域和应用程序中的数据增强的标记合成图像。

75. Semantic Segmentation of Medium-Resolution Satellite Imagery using Conditional Generative Adversarial Networks [PDF] 返回目录
Aditya Kulkarni, Tharun Mohandoss, Daniel Northrup, Ernest Mwebaze, Hamed Alemohammad
Abstract: Semantic segmentation of satellite imagery is a common approach to identify patterns and detect changes around the planet. Most of the state-of-the-art semantic segmentation models are trained in a fully supervised way using Convolutional Neural Network (CNN). The generalization property of CNN is poor for satellite imagery because the data can be very diverse in terms of landscape types, image resolutions, and scarcity of labels for different geographies and seasons. Hence, the performance of CNN doesn't translate well to images from unseen regions or seasons. Inspired by Conditional Generative Adversarial Networks (CGAN) based approach of image-to-image translation for high-resolution satellite imagery, we propose a CGAN framework for land cover classification using medium-resolution Sentinel-2 imagery. We find that the CGAN model outperforms the CNN model of similar complexity by a significant margin on an unseen imbalanced test dataset.
摘要：卫星图像的语义分割是一种识别模式并检测行星周围变化的常用方法。使用卷积神经网络（CNN）以完全监督的方式训练了大多数最新的语义分割模型。由于卫星的景观类型，图像分辨率以及不同地理位置和季节的标签稀缺性方面的数据差异很大，因此CNN对于卫星图像的通用性很差。因此，CNN的性能不能很好地转化为来自看不见的地区或季节的图像。受到基于条件生成对抗网络（CGAN）的高分辨率卫星图像的图像到图像转换方法的启发，我们提出了使用中分辨率Sentinel-2图像进行土地覆盖分类的CGAN框架。我们发现，在看不见的不平衡测试数据集上，CGAN模型优于具有相似复杂性的CNN模型。

76. MyFood: A Food Segmentation and Classification System to Aid Nutritional Monitoring [PDF] 返回目录
Charles N. C. Freitas, Filipe R. Cordeiro, Valmir Macario
Abstract: The absence of food monitoring has contributed significantly to the increase in the population's weight. Due to the lack of time and busy routines, most people do not control and record what is consumed in their diet. Some solutions have been proposed in computer vision to recognize food images, but few are specialized in nutritional monitoring. This work presents the development of an intelligent system that classifies and segments food presented in images to help the automatic monitoring of user diet and nutritional intake. This work shows a comparative study of state-of-the-art methods for image classification and segmentation, applied to food recognition. In our methodology, we compare the FCN, ENet, SegNet, DeepLabV3+, and Mask RCNN algorithms. We build a dataset composed of the most consumed Brazilian food types, containing nine classes and a total of 1250 images. The models were evaluated using the following metrics: Intersection over Union, Sensitivity, Specificity, Balanced Precision, and Positive Predefined Value. We also propose an system integrated into a mobile application that automatically recognizes and estimates the nutrients in a meal, assisting people with better nutritional monitoring. The proposed solution showed better results than the existing ones in the market. The dataset is publicly available at the following link this http URL
摘要：缺乏食品监测已大大增加了人口的体重。由于时间紧缺和例行工作繁忙，大多数人无法控制和记录饮食中所消耗的食物。在计算机视觉中已经提出了一些解决方案来识别食物图像，但是很少有专门用于营养监测的解决方案。这项工作提出了智能系统的开发，该系统可以对图像中显示的食物进行分类和细分，以帮助自动监控用户的饮食和营养摄入量。这项工作显示了对用于食品识别的图像分类和分割的最新方法的比较研究。在我们的方法中，我们比较了FCN，ENet，SegNet，DeepLabV3 +和Mask RCNN算法。我们建立了一个由巴西消费量最大的食物组成的数据集，包含9个类别，共1250张图像。使用以下度量对模型进行评估：交集，敏感性，特异性，平衡精度和正预定义值的交集。我们还提出了一种集成到移动应用程序中的系统，该系统可自动识别和估算膳食中的营养成分，从而帮助人们更好地进行营养监测。所提出的解决方案显示出比市场上现有解决方案更好的结果。该数据集可通过以下链接公开获得：http URL

77. Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction [PDF] 返回目录
Guy Gafni, Justus Thies, Michael Zollhöfer, Matthias Nießner
Abstract: We present dynamic neural radiance fields for modeling the appearance and dynamics of a human face. Digitally modeling and reconstructing a talking human is a key building-block for a variety of applications. Especially, for telepresence applications in AR or VR, a faithful reproduction of the appearance including novel viewpoints or head-poses is required. In contrast to state-of-the-art approaches that model the geometry and material properties explicitly, or are purely image-based, we introduce an implicit representation of the head based on scene representation networks. To handle the dynamics of the face, we combine our scene representation network with a low-dimensional morphable model which provides explicit control over pose and expressions. We use volumetric rendering to generate images from this hybrid representation and demonstrate that such a dynamic neural scene representation can be learned from monocular input data only, without the need of a specialized capture setup. In our experiments, we show that this learned volumetric representation allows for photo-realistic image generation that surpasses the quality of state-of-the-art video-based reenactment methods.
摘要：我们提出了动态神经辐射场，用于模拟人脸的外观和动力学。对会说话的人进行数字建模和重建是各种应用程序的关键构建块。特别地，对于AR或VR中的网真应用，需要忠实地再现包括新颖的观点或头姿势的外观。与显式地或完全基于图像的几何和材料属性建模的最新方法相反，我们基于场景表示网络引入了头部的隐式表示。为了处理面部动态，我们将场景表示网络与低维可变形模型相结合，该模型可对姿势和表情进行显式控制。我们使用体积渲染从此混合表示生成图像，并证明可以仅从单眼输入数据中学习这种动态神经场景表示，而无需专门的捕获设置。在我们的实验中，我们表明，这种学到的体积表示法可以实现逼真的图像生成，其质量超过了基于视频的最新重现方法的质量。

78. Self-Supervised Visual Representation Learning from Hierarchical Grouping [PDF] 返回目录
Xiao Zhang, Michael Maire
Abstract: We create a framework for bootstrapping visual representation learning from a primitive visual grouping capability. We operationalize grouping via a contour detector that partitions an image into regions, followed by merging of those regions into a tree hierarchy. A small supervised dataset suffices for training this grouping primitive. Across a large unlabeled dataset, we apply this learned primitive to automatically predict hierarchical region structure. These predictions serve as guidance for self-supervised contrastive feature learning: we task a deep network with producing per-pixel embeddings whose pairwise distances respect the region hierarchy. Experiments demonstrate that our approach can serve as state-of-the-art generic pre-training, benefiting downstream tasks. We additionally explore applications to semantic region search and video-based object instance tracking.
摘要：我们创建了一个框架，用于从原始视觉分组功能引导视觉表示学习。我们通过轮廓检测器对分组进行操作，该轮廓检测器将图像划分为多个区域，然后将这些区域合并为树形层次结构。一个小的监督数据集就足以训练此分组基元。在一个大型的未标记数据集上，我们将此学习的原语应用于自动预测分层区域结构。这些预测可为自我监督的对比特征学习提供指导：我们需要一个深度网络，该深度网络需要生成每像素嵌入的成对嵌入距离区域层次结构的嵌入。实验表明，我们的方法可以作为最先进的通用预培训，使下游任务受益。我们还探索了语义区域搜索和基于视频的对象实例跟踪的应用程序。

79. Understanding Bird's-Eye View Semantic HD-Maps Using an Onboard Monocular Camera [PDF] 返回目录
Yigit Baran Can, Alexander Liniger, Ozan Unal, Danda Paudel, Luc Van Gool
Abstract: Autonomous navigation requires scene understanding of the action-space to move or anticipate events. For planner agents moving on the ground plane, such as autonomous vehicles, this translates to scene understanding in the bird's-eye view. However, the onboard cameras of autonomous cars are customarily mounted horizontally for a better view of the surrounding. In this work, we study scene understanding in the form of online estimation of semantic bird's-eye-view HD-maps using the video input from a single onboard camera. We study three key aspects of this task, image-level understanding, BEV level understanding, and the aggregation of temporal information. Based on these three pillars we propose a novel architecture that combines these three aspects. In our extensive experiments, we demonstrate that the considered aspects are complementary to each other for HD-map understanding. Furthermore, the proposed architecture significantly surpasses the current state-of-the-art.
摘要：自主导航需要了解动作空间的场景以移动或预测事件。对于在地平面上移动的计划人员，例如自动驾驶汽车，这可以鸟瞰鸟瞰图。但是，自动驾驶汽车的车载摄像头通常水平安装，以更好地欣赏周围环境。在这项工作中，我们使用从单个车载摄像头输入的视频，以在线估计语义鸟瞰高清地图的形式研究场景理解。我们研究了此任务的三个关键方面，即图像级别的理解，BEV级别的理解以及时间信息的聚合。基于这三个支柱，我们提出了一种结合了这三个方面的新颖架构。在广泛的实验中，我们证明了所考虑的方面是彼此互补的，有助于理解HD-map。此外，所提出的架构大大超越了当前的最新技术水平。

80. ParaNet: Deep Regular Representation for 3D Point Clouds [PDF] 返回目录
Qijian Zhang, Junhui Hou, Yue Qian, Juyong Zhang, Ying He
Abstract: Although convolutional neural networks have achieved remarkable success in analyzing 2D images/videos, it is still non-trivial to apply the well-developed 2D techniques in regular domains to the irregular 3D point cloud data. To bridge this gap, we propose ParaNet, a novel end-to-end deep learning framework, for representing 3D point clouds in a completely regular and nearly lossless manner. To be specific, ParaNet converts an irregular 3D point cloud into a regular 2D color image, named point geometry image (PGI), where each pixel encodes the spatial coordinates of a point. In contrast to conventional regular representation modalities based on multi-view projection and voxelization, the proposed representation is differentiable and reversible. Technically, ParaNet is composed of a surface embedding module, which parameterizes 3D surface points onto a unit square, and a grid resampling module, which resamples the embedded 2D manifold over regular dense grids. Note that ParaNet is unsupervised, i.e., the training simply relies on reference-free geometry constraints. The PGIs can be seamlessly coupled with a task network established upon standard and mature techniques for 2D images/videos to realize a specific task for 3D point clouds. We evaluate ParaNet over shape classification and point cloud upsampling, in which our solutions perform favorably against the existing state-of-the-art methods. We believe such a paradigm will open up many possibilities to advance the progress of deep learning-based point cloud processing and understanding.
摘要：尽管卷积神经网络在分析2D图像/视频方面取得了显著成功，但将规则域中发达的2D技术应用于不规则3D点云数据仍然是不平凡的。为了弥合这一差距，我们提出了ParaNet，这是一种新颖的端到端深度学习框架，用于以完全规则且几乎无损的方式表示3D点云。具体来说，ParaNet将不规则的3D点云转换为规则的2D彩色图像，称为点几何图像（PGI），其中每个像素编码点的空间坐标。与基于多视图投影和体素化的常规常规表示形式相比，所提出的表示是可微的和可逆的。从技术上讲，ParaNet由一个表面嵌入模块（一个将3D表面点参数化到一个单位正方形上）和一个网格重采样模块组成，该模块在规则的密集网格上对嵌入式2D流形进行重采样。请注意，ParaNet是不受监督的，即训练仅依赖于无参考几何约束。 PGI可以与基于2D图像/视频的标准和成熟技术建立的任务网络无缝耦合，以实现3D点云的特定任务。我们通过形状分类和点云上采样对ParaNet进行评估，其中我们的解决方案相对于现有的最新方法表现良好。我们相信，这种范例将为推进基于深度学习的点云处理和理解的进步打开许多可能性。

81. Depth estimation from 4D light field videos [PDF] 返回目录
Takahiro Kinoshita, Satoshi Ono
Abstract: Depth (disparity) estimation from 4D Light Field (LF) images has been a research topic for the last couple of years. Most studies have focused on depth estimation from static 4D LF images while not considering temporal information, i.e., LF videos. This paper proposes an end-to-end neural network architecture for depth estimation from 4D LF videos. This study also constructs a medium-scale synthetic 4D LF video dataset that can be used for training deep learning-based methods. Experimental results using synthetic and real-world 4D LF videos show that temporal information contributes to the improvement of depth estimation accuracy in noisy regions. Dataset and code is available at: this https URL
摘要：近几年来，根据4D光场（LF）图像进行深度（视差）估计一直是研究的主题。大多数研究集中于从静态4D LF图像进行深度估计，而没有考虑时间信息（即LF视频）。本文提出了一种从4D LF视频进行深度估计的端到端神经网络体系结构。这项研究还构建了可用于训练基于深度学习的方法的中等规模的合成4D LF视频数据集。使用合成的和真实的4D LF视频进行的实验结果表明，时间信息有助于提高嘈杂区域中的深度估计精度。数据集和代码可在以下网址获得：https URL

82. CIA-SSD: Confident IoU-Aware Single-Stage Object Detector From Point Cloud [PDF] 返回目录
Wu Zheng, Weiliang Tang, Sijin Chen, Li Jiang, Chi-Wing Fu
Abstract: Existing single-stage detectors for locating objects in point clouds often treat object localization and category classification as separate tasks, so the localization accuracy and classification confidence may not well align. To address this issue, we present a new single-stage detector named the Confident IoU-Aware Single-Stage object Detector (CIA-SSD). First, we design the lightweight Spatial-Semantic Feature Aggregation module to adaptively fuse high-level abstract semantic features and low-level spatial features for accurate predictions of bounding boxes and classification confidence. Also, the predicted confidence is further rectified with our designed IoU-aware confidence rectification module to make the confidence more consistent with the localization accuracy. Based on the rectified confidence, we further formulate the Distance-variant IoU-weighted NMS to obtain smoother regressions and avoid redundant predictions. We experiment CIA-SSD on 3D car detection in the KITTI test set and show that it attains top performance in terms of the official ranking metric (moderate AP 80.28%) and above 32 FPS inference speed, outperforming all prior single-stage detectors. The code is available at this https URL.
摘要：现有的用于定位点云中对象的单级检测器通常将对象定位和类别分类视为单独的任务，因此定位精度和分类置信度可能无法很好地对齐。为了解决此问题，我们提出了一种新的单级检测器，称为信心IoU感知单级对象检测器（CIA-SSD）。首先，我们设计轻量级的空间语义特征聚合模块，以自适应地融合高级抽象语义特征和低级空间特征，以准确预测边界框和分类置信度。同样，使用我们设计的支持IoU的置信度校正模块可以进一步校正预测的置信度，以使置信度与定位精度更加一致。基于校正后的置信度，我们进一步制定了距离变量IoU加权NMS，以获得更平滑的回归并避免了多余的预测。我们在KITTI测试集中对3D汽车检测进行了CIA-SSD实验，结果表明，它在官方排名指标（中等AP 80.28％）和32 FPS推理速度之上均达到了最高性能，优于所有以前的单级检测器。该代码可从以下https URL获得。

83. ProMask: Probability Mask for Skeleton Detection [PDF] 返回目录
Xiuxiu Bai, Lele Ye, Zhe Liu
Abstract: Detecting object skeletons in natural images presents challenging, due to varied object scales, the complexity of backgrounds and various noises. The skeleton is a highly compressing shape representation, which can bring some essential advantages but cause the difficulties of detection. This skeleton line occupies a rare proportion of an image and is overly sensitive to spatial position. Inspired by these issues, we propose the ProMask, which is a novel skeleton detection model. The ProMask includes the probability mask and vector router. The skeleton probability mask representation explicitly encodes skeletons with segmentation signals, which can provide more supervised information to learn and pay more attention to ground-truth skeleton pixels. Moreover, the vector router module possesses two sets of orthogonal basis vectors in a two-dimensional space, which can dynamically adjust the predicted skeleton position. We evaluate our method on the well-known skeleton datasets, realizing the better performance than state-of-the-art approaches. Especially, ProMask significantly outperforms the competitive DeepFlux by 6.2% on the challenging SYM-PASCAL dataset. We consider that our proposed skeleton probability mask could serve as a solid baseline for future skeleton detection, since it is very effective and it requires about 10 lines of code.
摘要：由于物体尺度的变化，背景的复杂性和各种噪声的存在，在自然图像中检测物体的骨架提出了挑战。骨架是高度压缩的形状表示，可以带来一些本质的优点，但会导致检测困难。此骨架线仅占图像的一小部分，并且对空间位置过于敏感。受这些问题的启发，我们提出了ProMask，这是一种新颖的骨架检测模型。 ProMask包括概率掩码和矢量路由器。骨架概率掩码表示使用分段信号显式编码骨架，可以提供更多监督信息来学习并更加关注地面真实的骨架像素。此外，矢量路由器模块在二维空间中拥有两组正交基矢量，可以动态调整预测的骨架位置。我们在著名的骨架数据集上评估了我们的方法，从而实现了比最新方法更好的性能。尤其是，在富有挑战性的SYM-PASCAL数据集上，ProMask的性能明显优于竞争性DeepFlux 6.2％。我们认为，我们提出的骨架概率掩码可以作为将来的骨架检测的坚实基准，因为它非常有效并且需要大约10行代码。

84. Attention-Driven Dynamic Graph Convolutional Network for Multi-Label Image Recognition [PDF] 返回目录
Jin Ye, Junjun He, Xiaojiang Peng, Wenhao Wu, Yu Qiao
Abstract: Recent studies often exploit Graph Convolutional Network (GCN) to model label dependencies to improve recognition accuracy for multi-label image recognition. However, constructing a graph by counting the label co-occurrence possibilities of the training data may degrade model generalizability, especially when there exist occasional co-occurrence objects in test images. Our goal is to eliminate such bias and enhance the robustness of the learnt features. To this end, we propose an Attention-Driven Dynamic Graph Convolutional Network (ADD-GCN) to dynamically generate a specific graph for each image. ADD-GCN adopts a Dynamic Graph Convolutional Network (D-GCN) to model the relation of content-aware category representations that are generated by a Semantic Attention Module (SAM). Extensive experiments on public multi-label benchmarks demonstrate the effectiveness of our method, which achieves mAPs of 85.2%, 96.0%, and 95.5% on MS-COCO, VOC2007, and VOC2012, respectively, and outperforms current state-of-the-art methods with a clear margin. All codes can be found at this https URL.
摘要：最近的研究经常利用图卷积网络（GCN）对标签依赖关系进行建模，以提高多标签图像识别的识别精度。但是，通过对训练数据的标签共现可能性进行计数来构建图形可能会降低模型的通用性，尤其是在测试图像中偶尔存在共现对象的情况下。我们的目标是消除这种偏差并增强学习功能的鲁棒性。为此，我们提出了一种注意力驱动的动态图卷积网络（ADD-GCN），可以为每个图像动态生成特定的图。 ADD-GCN采用动态图卷积网络（D-GCN）来建模由语义注意模块（SAM）生成的内容感知类别表示的关系。在公共多标签基准上进行的大量实验证明了我们方法的有效性，该方法在MS-COCO，VOC2007和VOC2012上分别达到了85.2％，96.0％和95.5％的mAP，并且优于目前的最新水平清晰的方法。所有代码均可在此https URL中找到。

85. Spatially-Adaptive Pixelwise Networks for Fast Image Translation [PDF] 返回目录
Tamar Rott Shaham, Michael Gharbi, Richard Zhang, Eli Shechtman, Tomer Michaeli
Abstract: We introduce a new generator architecture, aimed at fast and efficient high-resolution image-to-image translation. We design the generator to be an extremely lightweight function of the full-resolution image. In fact, we use pixel-wise networks; that is, each pixel is processed independently of others, through a composition of simple affine transformations and nonlinearities. We take three important steps to equip such a seemingly simple function with adequate expressivity. First, the parameters of the pixel-wise networks are spatially varying so they can represent a broader function class than simple 1x1 convolutions. Second, these parameters are predicted by a fast convolutional network that processes an aggressively low-resolution representation of the input; Third, we augment the input image with a sinusoidal encoding of spatial coordinates, which provides an effective inductive bias for generating realistic novel high-frequency image content. As a result, our model is up to 18x faster than state-of-the-art baselines. We achieve this speedup while generating comparable visual quality across different image resolutions and translation domains.
摘要：我们介绍了一种新的生成器体系结构，旨在实现快速，高效的高分辨率图像到图像转换。我们将生成器设计为全分辨率图像的极轻量级功能。实际上，我们使用逐像素网络。也就是说，通过简单的仿射变换和非线性的组合，每个像素都独立于其他像素进行处理。我们采取了三个重要步骤，以使这种看似简单的功能具有足够的表现力。首先，逐像素网络的参数在空间上变化，因此与简单的1x1卷积相比，它们可以表示更广泛的函数类。其次，这些参数是由快速卷积网络预测的，该网络处理输入的低分辨率表示形式。第三，我们使用空间坐标的正弦编码来增强输入图像，这为生成逼真的新颖的高频图像内容提供了有效的感应偏差。因此，我们的模型比最新的基准速度快18倍。我们实现了这种加速，同时在不同的图像分辨率和翻译域中产生了可比的视觉质量。

86. Multi Scale Temporal Graph Networks For Skeleton-based Action Recognition [PDF] 返回目录
Tingwei Li, Ruiwen Zhang, Qing Li
Abstract: Graph convolutional networks (GCNs) can effectively capture the features of related nodes and improve the performance of the model. More attention is paid to employing GCN in Skeleton-Based action recognition. But existing methods based on GCNs have two problems. First, the consistency of temporal and spatial features is ignored for extracting features node by node and frame by frame. To obtain spatiotemporal features simultaneously, we design a generic representation of skeleton sequences for action recognition and propose a novel model called Temporal Graph Networks (TGN). Secondly, the adjacency matrix of the graph describing the relation of joints is mostly dependent on the physical connection between joints. To appropriately describe the relations between joints in the skeleton graph, we propose a multi-scale graph strategy, adopting a full-scale graph, part-scale graph, and core-scale graph to capture the local features of each joint and the contour features of important joints. Experiments were carried out on two large datasets and results show that TGN with our graph strategy outperforms state-of-the-art methods.
摘要：图卷积网络（GCN）可以有效地捕获相关节点的特征并提高模型的性能。在基于骨骼的动作识别中采用GCN的注意更多。但是现有的基于GCN的方法存在两个问题。首先，时间和空间特征的一致性被忽略，以逐节点和逐帧提取特征。为了同时获得时空特征，我们设计了用于动作识别的骨架序列的一般表示，并提出了一种称为时态图网络（TGN）的新颖模型。其次，描述关节关系的图的邻接矩阵主要取决于关节之间的物理连接。为了适当地描述骨架图中关节之间的关系，我们提出了一种多尺度图策略，采用全尺度图，局部尺度图和核心尺度图来捕获每个关节的局部特征和轮廓特征重要关节。在两个大型数据集上进行了实验，结果表明，采用我们的图策略的TGN优于最新方法。

87. FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding [PDF] 返回目录
Maryam Rahnemoonfar, Tashnim Chowdhury, Argho Sarkar, Debvrat Varshney, Masoud Yari, Robin Murphy
Abstract: Visual scene understanding is the core task in making any crucial decision in any computer vision system. Although popular computer vision datasets like Cityscapes, MS-COCO, PASCAL provide good benchmarks for several tasks (e.g. image classification, segmentation, object detection), these datasets are hardly suitable for post disaster damage assessments. On the other hand, existing natural disaster datasets include mainly satellite imagery which have low spatial resolution and a high revisit period. Therefore, they do not have a scope to provide quick and efficient damage assessment tasks. Unmanned Aerial Vehicle(UAV) can effortlessly access difficult places during any disaster and collect high resolution imagery that is required for aforementioned tasks of computer vision. To address these issues we present a high resolution UAV imagery, FloodNet, captured after the hurricane Harvey. This dataset demonstrates the post flooded damages of the affected areas. The images are labeled pixel-wise for semantic segmentation task and questions are produced for the task of visual question answering. FloodNet poses several challenges including detection of flooded roads and buildings and distinguishing between natural water and flooded water. With the advancement of deep learning algorithms, we can analyze the impact of any disaster which can make a precise understanding of the affected areas. In this paper, we compare and contrast the performances of baseline methods for image classification, semantic segmentation, and visual question answering on our dataset.
摘要：视觉场景理解是在任何计算机视觉系统中做出任何关键决定的核心任务。尽管流行的计算机视觉数据集（例如Cityscapes，MS-COCO，PASCAL）为多项任务（例如图像分类，分割，物体检测）提供了良好的基准，但这些数据集几乎不适合灾后评估。另一方面，现有的自然灾害数据集主要包括空间分辨率低且重访周期长的卫星图像。因此，他们没有提供快速，有效的损害评估任务的范围。无人驾驶飞机（UAV）可以在任何灾难中毫不费力地进入困难的地方，并可以收集上述计算机视觉任务所需的高分辨率图像。为了解决这些问题，我们展示了高分辨率的无人机图像FloodNet，是在飓风“哈维”（Harvey）之后拍摄的。该数据集显示了灾区的洪灾后破坏。这些图像在语义分割任务中被逐像素标记，并且为视觉问题回答任务产生问题。 FloodNet带来了一些挑战，包括检测被淹的道路和建筑物以及区分自然水和淹水。随着深度学习算法的进步，我们可以分析任何灾难的影响，从而可以准确了解受影响的区域。在本文中，我们比较和对比了用于数据分类的基线方法在图像分类，语义分割和视觉问题解答方面的性能。

88. Cirrus: A Long-range Bi-pattern LiDAR Dataset [PDF] 返回目录
Ze Wang, Sihao Ding, Ying Li, Jonas Fenn, Sohini Roychowdhury, Andreas Wallin, Lane Martin, Scott Ryvola, Guillermo Sapiro, Qiang Qiu
Abstract: In this paper, we introduce Cirrus, a new long-range bi-pattern LiDAR public dataset for autonomous driving tasks such as 3D object detection, critical to highway driving and timely decision making. Our platform is equipped with a high-resolution video camera and a pair of LiDAR sensors with a 250-meter effective range, which is significantly longer than existing public datasets. We record paired point clouds simultaneously using both Gaussian and uniform scanning patterns. Point density varies significantly across such a long range, and different scanning patterns further diversify object representation in LiDAR. In Cirrus, eight categories of objects are exhaustively annotated in the LiDAR point clouds for the entire effective range. To illustrate the kind of studies supported by this new dataset, we introduce LiDAR model adaptation across different ranges, scanning patterns, and sensor devices. Promising results show the great potential of this new dataset to the robotics and computer vision communities.
摘要：在本文中，我们介绍了Cirrus，这是一种新的远程双模式LiDAR公共数据集，用于自动驾驶任务，例如3D对象检测，对高速公路驾驶和及时决策至关重要。我们的平台配备了高分辨率摄像机和一对有效距离为250米的LiDAR传感器，比现有的公共数据集长得多。我们使用高斯和均匀扫描模式同时记录配对的点云。点密度在如此长的范围内变化很大，并且不同的扫描模式进一步使LiDAR中的对象表示多样化。在Cirrus中，在整个有效范围内，在LiDAR点云中详尽注释了八类对象。为了说明此新数据集支持的研究类型，我们介绍了跨不同范围，扫描模式和传感器设备的LiDAR模型自适应。有希望的结果表明，该新数据集对机器人技术和计算机视觉社区具有巨大的潜力。

89. Multi-head Knowledge Distillation for Model Compression [PDF] 返回目录
Huan Wang, Suhas Lohit, Michael Jones, Yun Fu
Abstract: Several methods of knowledge distillation have been developed for neural network compression. While they all use the KL divergence loss to align the soft outputs of the student model more closely with that of the teacher, the various methods differ in how the intermediate features of the student are encouraged to match those of the teacher. In this paper, we propose a simple-to-implement method using auxiliary classifiers at intermediate layers for matching features, which we refer to as multi-head knowledge distillation (MHKD). We add loss terms for training the student that measure the dissimilarity between student and teacher outputs of the auxiliary classifiers. At the same time, the proposed method also provides a natural way to measure differences at the intermediate layers even though the dimensions of the internal teacher and student features may be different. Through several experiments in image classification on multiple datasets we show that the proposed method outperforms prior relevant approaches presented in the literature.
摘要：已经开发了几种用于神经网络压缩的知识蒸馏方法。尽管他们都使用KL散度损失使学生模型的软输出与教师的软输出更紧密地对齐，但在鼓励学生的中间特征与教师的中间特征匹配方面，各种方法有所不同。在本文中，我们提出了一种在中间层使用辅助分类器来实现特征匹配的简单方法，我们将其称为多头知识蒸馏（MHKD）。我们添加了损失术语来训练学生，以测量辅助分类器的学生和教师输出之间的差异。同时，即使内部教师和学生特征的尺寸可能不同，所提出的方法也提供了一种自然的方法来测量中间层的差异。通过对多个数据集进行图像分类的几次实验，我们证明了该方法优于文献中提出的相关方法。

90. Cosine-Pruned Medial Axis: A new method for isometric equivariant and noise-free medial axis extraction [PDF] 返回目录
Diego Patiño, John Branch
Abstract: We present the CPMA, a new method for medial axis pruning with noise robustness and equivariance to isometric transformations. Our method leverages the discrete cosine transform to create smooth versions of a shape $\Omega$. We use the smooth shapes to compute a score function $\scorefunction$ that filters out spurious branches from the medial axis. We extensively compare the CPMA with state-of-the-art pruning methods and highlight our method's noise robustness and isometric equivariance. We found that our pruning approach achieves competitive results and yields stable medial axes even in scenarios with significant contour perturbations.
摘要：我们提出了CPMA，这是一种用于对中轴进行修剪的新方法，该方法具有噪声鲁棒性和等距变换的等方差。我们的方法利用离散余弦变换来创建形状$ \ Omega $的平滑版本。我们使用平滑形状来计算得分函数$ \ scorefunction $，该得分函数从中间轴过滤掉虚假分支。我们将CPMA与最新的修剪方法进行了广泛的比较，并强调了该方法的噪声鲁棒性和等距等方差。我们发现，即使在轮廓扰动明显的情况下，我们的修剪方法也能获得有竞争力的结果并产生稳定的中间轴。

91. Knowledge Distillation Thrives on Data Augmentation [PDF] 返回目录
Huan Wang, Suhas Lohit, Michael Jones, Yun Fu
Abstract: Knowledge distillation (KD) is a general deep neural network training framework that uses a teacher model to guide a student model. Many works have explored the rationale for its success, however, its interplay with data augmentation (DA) has not been well recognized so far. In this paper, we are motivated by an interesting observation in classification: KD loss can benefit from extended training iterations while the cross-entropy loss does not. We show this disparity arises because of data augmentation: KD loss can tap into the extra information from different input views brought by DA. By this explanation, we propose to enhance KD via a stronger data augmentation scheme (e.g., mixup, CutMix). Furthermore, an even stronger new DA approach is developed specifically for KD based on the idea of active learning. The findings and merits of the proposed method are validated by extensive experiments on CIFAR-100, Tiny ImageNet, and ImageNet datasets. We can achieve improved performance simply by using the original KD loss combined with stronger augmentation schemes, compared to existing state-of-the-art methods, which employ more advanced distillation losses. In addition, when our approaches are combined with more advanced distillation losses, we can advance the state-of-the-art performance even more. On top of the encouraging performance, this paper also sheds some light on explaining the success of knowledge distillation. The discovered interplay between KD and DA may inspire more advanced KD algorithms.
摘要：知识蒸馏（KD）是一种通用的深度神经网络训练框架，该框架使用教师模型指导学生模型。许多作品都探讨了其成功的原理，但是，到目前为止，它与数据增强（DA）的相互作用还没有得到很好的认识。在本文中，我们受到分类中一个有趣观察的启发：KD损失可以受益于扩展的训练迭代，而交叉熵损失却不能。我们显示出这种差异是由于数据扩充而引起的：KD丢失可以利用DA带来的来自不同输入视图的额外信息。通过这种解释，我们建议通过更强大的数据增强方案（例如，mixup，CutMix）来增强KD。此外，基于主动学习的思想，专门针对KD开发了一种更强大的新DA方法。通过在CIFAR-100，Tiny ImageNet和ImageNet数据集上进行的大量实验，验证了该方法的发现和优点。与现有的采用更先进的蒸馏损失的最新方法相比，我们可以简单地通过使用原始的KD损失与更强大的扩增方案相结合来提高性能。此外，将我们的方法与更先进的蒸馏损失结合使用时，我们甚至可以进一步提高最先进的性能。除了令人鼓舞的表现，本文还为解释知识提升的成功提供了一些启示。发现的KD和DA之间的相互作用可能会激发出更高级的KD算法。

92. Driver Glance Classification In-the-wild: Towards Generalization Across Domains and Subjects [PDF] 返回目录
Sandipan Banerjee, Ajjen Joshi, Jay Turcot, Bryan Reimer, Taniya Mishra
Abstract: Distracted drivers are dangerous drivers. Equipping advanced driver assistance systems (ADAS) with the ability to detect driver distraction can help prevent accidents and improve driver safety. In order to detect driver distraction, an ADAS must be able to monitor their visual attention. We propose a model that takes as input a patch of the driver's face along with a crop of the eye-region and classifies their glance into 6 coarse regions-of-interest (ROIs) in the vehicle. We demonstrate that an hourglass network, trained with an additional reconstruction loss, allows the model to learn stronger contextual feature representations than a traditional encoder-only classification module. To make the system robust to subject-specific variations in appearance and behavior, we design a personalized hourglass model tuned with an auxiliary input representing the driver's baseline glance behavior. Finally, we present a weakly supervised multi-domain training regimen that enables the hourglass to jointly learn representations from different domains (varying in camera type, angle), utilizing unlabeled samples and thereby reducing annotation cost.
摘要：分心的驾驶员是危险的驾驶员。为高级驾驶员辅助系统（ADAS）配备检测驾驶员注意力分散的功能可以帮助预防事故并提高驾驶员安全性。为了检测驾驶员的注意力，ADAS必须能够监视其视觉注意力。我们提出了一个模型，该模型将驾驶员面部的一小块以及眼睛区域的裁剪作为输入，并将驾驶员的视线分类为车辆中的6个粗略感兴趣区域（ROI）。我们证明，沙漏网络经过额外的重建损失训练后，与传统的仅编码器分类模块相比，该模型可以学习更强大的上下文特征表示。为了使系统对外观和行为方面特定于主题的变化具有鲁棒性，我们设计了个性化的沙漏模型，并通过辅助输入对驾驶员的基线扫视行为进行了调整。最后，我们提出了一种弱监督的多域训练方案，该方案使沙漏可以利用未标记的样本共同学习来自不同域的表示（摄像机类型，角度各不相同），从而减少标注成本。

93. Automated Calibration of Mobile Cameras for 3D Reconstruction of Mechanical Pipes [PDF] 返回目录
Reza Maalek, Derek Lichti
Abstract: This manuscript provides a new framework for calibration of optical instruments, in particular mobile cameras, using large-scale circular black and white target fields. New methods were introduced for (i) matching targets between images; (ii) adjusting the systematic eccentricity error of target centers; and (iii) iteratively improving the calibration solution through a free-network self-calibrating bundle adjustment. It was observed that the proposed target matching effectively matched circular targets in 270 mobile phone images from a complete calibration laboratory with robustness to Type II errors. The proposed eccentricity adjustment, which requires only camera projective matrices from two views, behaved synonymous to available closed-form solutions, which require several additional object space target information a priori. Finally, specifically for the case of the mobile devices, the calibration parameters obtained using our framework was found superior compared to in-situ calibration for estimating the 3D reconstructed radius of a mechanical pipe (approximately 45% improvement).
摘要：该手稿为使用大型圆形黑白目标场的光学仪器，尤其是移动相机的校准提供了新的框架。引入了新的方法（i）匹配图像之间的目标；（ii）调整目标中心的系统偏心误差；（iii）通过自由网络自校准束调整来迭代地改进校准解决方案。据观察，提出的目标匹配有效地匹配了来自完整校准实验室的270部手机图像中的圆形目标，并且具有对II型错误的鲁棒性。所提出的偏心率调整仅需要来自两个视图的相机投影矩阵，其表现与可用的封闭形式解决方案同义，后者需要先验几个额外的对象空间目标信息。最后，特别是对于移动设备，使用我们的框架获得的校准参数被发现比用于估计机械管道的3D重构半径的现场校准要好（提高了约45％）。

94. Discovering Underground Maps from Fashion [PDF] 返回目录
Utkarsh Mall, Kavita Bala, Tamara Berg, Kristen Grauman
Abstract: The fashion sense -- meaning the clothing styles people wear -- in a geographical region can reveal information about that region. For example, it can reflect the kind of activities people do there, or the type of crowds that frequently visit the region (e.g., tourist hot spot, student neighborhood, business center). We propose a method to automatically create underground neighborhood maps of cities by analyzing how people dress. Using publicly available images from across a city, our method finds neighborhoods with a similar fashion sense and segments the map without supervision. For 37 cities worldwide, we show promising results in creating good underground maps, as evaluated using experiments with human judges and underground map benchmarks derived from non-image data. Our approach further allows detecting distinct neighborhoods (what is the most unique region of LA?) and answering analogy questions between cities (what is the "Downtown LA" of Bogota?).
摘要：某个地理区域的时尚感（即人们穿着的服装样式）可以显示该区域的信息。例如，它可以反映人们在那里进行的活动的类型，或经常拜访该地区的人群的类型（例如，旅游热点，学生社区，商务中心）。我们提出了一种通过分析人们的着装方式自动创建城市地下社区地图的方法。通过使用整个城市的公开图像，我们的方法可以找到具有类似时尚感的街区，并在没有监督的情况下对地图进行分割。对于全球37个城市，通过使用人类法官进行的实验以及从非图像数据得出的地下地图基准，我们在创建良好的地下地图方面显示出令人鼓舞的结果。我们的方法还允许检测不同的街区（什么是洛杉矶最独特的地区？）并回答城市之间的类比问题（什么是波哥大的“洛杉矶市中心”？）。

95. MPG: A Multi-ingredient Pizza Image Generator with Conditional StyleGANs [PDF] 返回目录
Fangda Han, Guoyao Hao, Ricardo Guerrero, Vladimir Pavlovic
Abstract: Multilabel conditional image generation is a challenging problem in computer vision. In this work we propose Multi-ingredient Pizza Generator (MPG), a conditional Generative Neural Network (GAN) framework for synthesizing multilabel images. We design MPG based on a state-of-the-art GAN structure called StyleGAN2, in which we develop a new conditioning technique by enforcing intermediate feature maps to learn scalewise label information. Because of the complex nature of the multilabel image generation problem, we also regularize synthetic image by predicting the corresponding ingredients as well as encourage the discriminator to distinguish between matched image and mismatched image. To verify the efficacy of MPG, we test it on Pizza10, which is a carefully annotated multi-ingredient pizza image dataset. MPG can successfully generate photo-realist pizza images with desired ingredients. The framework can be easily extend to other multilabel image generation scenarios.
摘要：多标签条件图像生成是计算机视觉中的一个难题。在这项工作中，我们提出了多成分比萨饼生成器（MPG），这是一种用于合成多标签图像的条件生成神经网络（GAN）框架。我们基于称为GAN2的最新GAN结构设计MPG，其中我们通过强制执行中间特征图来学习按比例标注的信息，从而开发了一种新的调节技术。由于多标签图像生成问题的复杂性，我们还通过预测相应的成分来规范化合成图像，并鼓励区分器区分匹配的图像和不匹配的图像。为了验证MPG的功效，我们在Pizza10（经过仔细注释的多成分的批萨图像数据集）上对其进行了测试。 MPG可以成功生成具有所需成分的逼真比萨饼图像。该框架可以轻松扩展到其他多标签图像生成方案。

96. Encoding the latent posterior of Bayesian Neural Networks for uncertainty quantification [PDF] 返回目录
Gianni Franchi, Andrei Bursuc, Emanuel Aldea, Severine Dubuisson, Isabelle Bloch
Abstract: Bayesian neural networks (BNNs) have been long considered an ideal, yet unscalable solution for improving the robustness and the predictive uncertainty of deep neural networks. While they could capture more accurately the posterior distribution of the network parameters, most BNN approaches are either limited to small networks or rely on constraining assumptions such as parameter independence. These drawbacks have enabled prominence of simple, but computationally heavy approaches such as Deep Ensembles, whose training and testing costs increase linearly with the number of networks. In this work we aim for efficient deep BNNs amenable to complex computer vision architectures, e.g. ResNet50 DeepLabV3+, and tasks, e.g. semantic segmentation, with fewer assumptions on the parameters. We achieve this by leveraging variational autoencoders (VAEs) to learn the interaction and the latent distribution of the parameters at each network layer. Our approach, Latent-Posterior BNN (LP-BNN), is compatible with the recent BatchEnsemble method, leading to highly efficient ({in terms of computation and} memory during both training and testing) ensembles. LP-BNN s attain competitive results across multiple metrics in several challenging benchmarks for image classification, semantic segmentation and out-of-distribution detection.
摘要：贝叶斯神经网络（BNN）长期以来一直被认为是提高深度神经网络的鲁棒性和预测不确定性的理想但不可扩展的解决方案。尽管它们可以更准确地捕获网络参数的后验分布，但大多数BNN方法要么局限于小型网络，要么依赖于诸如参数独立性等约束性假设。这些缺点使简单但计算量大的方法如Deep Ensembles成为可能，这种方法的训练和测试成本随着网络数量的增加而线性增加。在这项工作中，我们旨在开发适合复杂计算机视觉架构的高效深层BNN，例如ResNet50 DeepLabV3 +和任务，例如语义分割，对参数的假设较少。我们通过利用变分自动编码器（VAE）来了解每个网络层的交互作用和参数的潜在分布来实现这一目标。我们的方法-潜在后脑BNN（LP-BNN）与最新的BatchEnsemble方法兼容，从而导致高效（在训练和测试中均在计算和存储方面）。 LP-BNN在针对图像分类，语义分割和分布外检测的几个具有挑战性的基准中，在多个指标上均获得了竞争性结果。

97. The Role of Regularization in Shaping Weight and Node Pruning Dependency and Dynamics [PDF] 返回目录
Yael Ben-Guigui, Jacob Goldberger, Tammy Riklin-Raviv
Abstract: The pressing need to reduce the capacity of deep neural networks has stimulated the development of network dilution methods and their analysis. While the ability of $L_1$ and $L_0$ regularization to encourage sparsity is often mentioned, $L_2$ regularization is seldom discussed in this context. We present a novel framework for weight pruning by sampling from a probability function that favors the zeroing of smaller weights. In addition, we examine the contribution of $L_1$ and $L_2$ regularization to the dynamics of node pruning while optimizing for weight pruning. We then demonstrate the effectiveness of the proposed stochastic framework when used together with a weight decay regularizer on popular classification models in removing 50% of the nodes in an MLP for MNIST classification, 60% of the filters in VGG-16 for CIFAR10 classification, and on medical image models in removing 60% of the channels in a U-Net for instance segmentation and 50% of the channels in CNN model for COVID-19 detection. For these node-pruned networks, we also present competitive weight pruning results that are only slightly less accurate than the original, dense networks.
摘要：减少深度神经网络容量的迫切需求刺激了网络稀释方法及其分析方法的发展。虽然经常提到$ L_1 $和$ L_0 $正则化鼓励稀疏性的能力，但在这种情况下很少讨论$ L_2 $正则化。我们通过从倾向于较小权重归零的概率函数中采样来提供一种用于权重修剪的新颖框架。此外，我们研究了$ L_1 $和$ L_2 $正则化对节点修剪动力学的贡献，同时优化了权重修剪。然后，我们证明了在随机分类模型上与权重衰减正则化算法一起使用时，所提出的随机框架在去除MNIST分类的MLP中50％的节点，去除CIFAR10分类的VGG-16中60％的过滤器中的有效性，以及在医学图像模型中，可以去除U-Net中60％的通道（例如，分割）和CNN模型中的50％的通道，以进行COVID-19检测。对于这些节点修剪的网络，我们还提出了具有竞争力的权重修剪结果，其准确性仅比原始的密集网络稍差。

98. Perspectives on Sim2Real Transfer for Robotics: A Summary of the R:SS 2020 Workshop [PDF] 返回目录
Sebastian Höfer, Kostas Bekris, Ankur Handa, Juan Camilo Gamboa, Florian Golemo, Melissa Mozifian, Chris Atkeson, Dieter Fox, Ken Goldberg, John Leonard, C. Karen Liu, Jan Peters, Shuran Song, Peter Welinder, Martha White
Abstract: This report presents the debates, posters, and discussions of the Sim2Real workshop held in conjunction with the 2020 edition of the "Robotics: Science and System" conference. Twelve leaders of the field took competing debate positions on the definition, viability, and importance of transferring skills from simulation to the real world in the context of robotics problems. The debaters also joined a large panel discussion, answering audience questions and outlining the future of Sim2Real in robotics. Furthermore, we invited extended abstracts to this workshop which are summarized in this report. Based on the workshop, this report concludes with directions for practitioners exploiting this technology and for researchers further exploring open problems in this area.
摘要：本报告介绍了与2020年“机器人：科学与系统”会议一起举行的Sim2Real研讨会的辩论，海报和讨论。在机器人问题的背景下，该领域的十二位领导人就定义，可行性和将技能从仿真转移到现实世界的重要性，竞争性和重要性展开了激烈的辩论。辩论者还参加了一个大型小组讨论，回答了观众的问题，并概述了Sim2Real在机器人技术中的未来。此外，我们邀请了本研讨会的扩展摘要，本报告对此进行了总结。在研讨会的基础上，本报告为从业人员开发该技术以及研究人员进一步探索该领域的未解决问题提供了指导。

99. Triplet Entropy Loss: Improving The Generalisation of Short Speech Language Identification Systems [PDF] 返回目录
Ruan van der Merwe
Abstract: We present several methods to improve the generalisation of language identification (LID) systems to new speakers and to new domains. These methods involve Spectral augmentation, where spectrograms are masked in the frequency or time bands during training and CNN architectures that are pre-trained on the Imagenet dataset. The paper also introduces the novel Triplet Entropy Loss training method, which involves training a network simultaneously using Cross Entropy and Triplet loss. It was found that all three methods improved the generalisation of the models, though not significantly. Even though the models trained using Triplet Entropy Loss showed a better understanding of the languages and higher accuracies, it appears as though the models still memorise word patterns present in the spectrograms rather than learning the finer nuances of a language. The research shows that Triplet Entropy Loss has great potential and should be investigated further, not only in language identification tasks but any classification task.
摘要：我们提出了几种方法来提高语言识别（LID）系统对新说话者和新领域的通用性。这些方法涉及频谱增强，其中频谱图在训练和Imagenet数据集上预先训练的CNN架构期间的频率或时间段中被屏蔽。本文还介绍了新颖的三重熵损失训练方法，该方法包括使用交叉熵和三重损失同时训练网络。发现这三种方法都改进了模型的泛化，尽管效果不显着。即使使用三重熵损失训练的模型显示出对语言的更好理解和更高的准确性，似乎这些模型仍然记住了频谱图中存在的单词模式，而不是学习语言的细微差别。研究表明，三重态熵损失具有巨大的潜力，不仅在语言识别任务中，而且在任何分类任务中，都应进一步研究。

100. Overcoming Barriers to Data Sharing with Medical Image Generation: A Comprehensive Evaluation [PDF] 返回目录
August DuMont Schütte, Jürgen Hetzel, Sergios Gatidis, Tobias Hepp, Benedikt Dietz, Stefan Bauer, Patrick Schwab
Abstract: Privacy concerns around sharing personally identifiable information are a major practical barrier to data sharing in medical research. However, in many cases, researchers have no interest in a particular individual's information but rather aim to derive insights at the level of cohorts. Here, we utilize Generative Adversarial Networks (GANs) to create derived medical imaging datasets consisting entirely of synthetic patient data. The synthetic images ideally have, in aggregate, similar statistical properties to those of a source dataset but do not contain sensitive personal information. We assess the quality of synthetic data generated by two GAN models for chest radiographs with 14 different radiology findings and brain computed tomography (CT) scans with six types of intracranial hemorrhages. We measure the synthetic image quality by the performance difference of predictive models trained on either the synthetic or the real dataset. We find that synthetic data performance disproportionately benefits from a reduced number of unique label combinations and determine at what number of samples per class overfitting effects start to dominate GAN training. Our open-source benchmark findings also indicate that synthetic data generation can benefit from higher levels of spatial resolution. We additionally conducted a reader study in which trained radiologists do not perform better than random on discriminating between synthetic and real medical images for both data modalities to a statistically significant extent. Our study offers valuable guidelines and outlines practical conditions under which insights derived from synthetic medical images are similar to those that would have been derived from real imaging data. Our results indicate that synthetic data sharing may be an attractive and privacy-preserving alternative to sharing real patient-level data in the right settings.
摘要：有关共享个人身份信息的隐私问题是医学研究中数据共享的主要实际障碍。但是，在许多情况下，研究人员对特定个体的信息没有兴趣，而是旨在从同类群体中获得见解。在这里，我们利用生殖对抗网络（GAN）来创建派生的医学成像数据集，该数据集完全由合成患者数据组成。理想情况下，合成图像总体上具有与源数据集相似的统计属性，但不包含敏感的个人信息。我们评估了两种GAN模型所产生的合成数据的质量，这些模型具有14种不同的放射学发现以及两种类型的颅内出血的脑部X线断层扫描（CT）扫描，用于胸部X光片。我们通过在合成或真实数据集上训练的预测模型的性能差异来测量合成图像质量。我们发现合成数据的性能从减少的独特标签组合数量中获得了不成比例的收益，并确定了每类中多少个样本的过度拟合效应开始主导GAN训练。我们的开放源基准测试结果还表明，合成数据生成可以从更高级别的空间分辨率中受益。我们还进行了一项读者研究，其中受过训练的放射科医生在统计学上显着地区分两种数据形式的合成医学图像和真实医学图像时，其表现都不会比随机的好。我们的研究提供了有价值的指导方针，并概述了从合成医学图像获得的见解与从真实成像数据获得的见解相似的实际条件。我们的结果表明，在正确的设置中，合成数据共享可能是共享实际患者级别数据的一种有吸引力且具有隐私保护的替代方法。

101. Learning Tactile Models for Factor Graph-based State Estimation [PDF] 返回目录
Paloma Sodhi, Michael Kaess, Mustafa Mukadam, Stuart Anderson
Abstract: We address the problem of estimating object pose from touch during manipulation under occlusion. Vision-based tactile sensors provide rich, local measurements at the point of contact. A single such measurement, however, contains limited information and multiple measurements are needed to infer latent object state. We solve this inference problem using a factor graph. In order to incorporate tactile measurements in the graph, we need local observation models that can map high-dimensional tactile images onto a low-dimensional state space. Prior work has used low-dimensional force measurements or hand-designed functions to interpret tactile measurements. These methods, however, can be brittle and difficult to scale across objects and sensors. Our key insight is to directly learn tactile observation models that predict the relative pose of the sensor given a pair of tactile images. These relative poses can then be incorporated as factors within a factor graph. We propose a two-stage approach: first we learn local tactile observation models supervised with ground truth data, and then integrate these models along with physics and geometric factors within a factor graph optimizer. We demonstrate reliable object tracking using only tactile feedback for over 150 real-world planar pushing sequences with varying trajectories across three object shapes. Supplementary video: this https URL
摘要：我们解决了在遮挡下操作时通过触摸估计物体姿势的问题。基于视觉的触觉传感器可在接触点提供丰富的本地测量。然而，单个这样的测量包含有限的信息，并且需要多个测量来推断潜在物体状态。我们使用因子图解决了这一推理问题。为了将触觉测量值包含在图中，我们需要可以将高维触觉图像映射到低维状态空间的局部观察模型。先前的工作使用低维力测量或手动设计的功能来解释触觉测量。但是，这些方法可能很脆弱，并且很难在对象和传感器上缩放。我们的主要见解是直接学习触觉观察模型，这些模型在给定一对触觉图像的情况下预测传感器的相对姿势。然后，可以将这些相对姿势作为因子合并到因子图中。我们提出了一种分两个阶段的方法：首先，我们学习在地面实况数据的监督下的本地触觉观察模型，然后将这些模型与物理和几何因素集成到因素图优化器中。我们展示了仅使用触觉反馈就超过150种现实世界中的平面推动序列（在三个物体形状上具有不同轨迹）的可靠物体跟踪。补充视频：此https URL

102. Multi-Decoder Networks with Multi-Denoising Inputs for Tumor Segmentation [PDF] 返回目录
Minh H. Vu, Tufve Nyholm, Tommy Löfstedt
Abstract: Automatic segmentation of brain glioma from multimodal MRI scans plays a key role in clinical trials and practice. Unfortunately, manual segmentation is very challenging, time-consuming, costly, and often inaccurate despite human expertise due to the high variance and high uncertainty in the human annotations. In the present work, we develop an end-to-end deep-learning-based segmentation method using a multi-decoder architecture by jointly learning three separate sub-problems using a partly shared encoder. We also propose to apply smoothing methods to the input images to generate denoised versions as additional inputs to the network. The validation performance indicate an improvement when using the proposed method. The proposed method was ranked 2nd in the task of Quantification of Uncertainty in Segmentation in the Brain Tumors in Multimodal Magnetic Resonance Imaging Challenge 2020.
摘要：从多模式MRI扫描中自动分割脑胶质瘤在临床试验和实践中起着关键作用。不幸的是，由于人工注释的高度变异性和高度不确定性，尽管具有人工专业知识，但是手动分割非常具有挑战性，耗时，成本高且常常不准确。在当前的工作中，我们通过使用部分共享的编码器共同学习三个单独的子问题，开发了一种使用多解码器体系结构的端到端基于深度学习的分割方法。我们还建议对输入图像应用平滑方法，以生成降噪后的版本，作为网络的其他输入。使用所提出的方法时，验证性能表明有所改进。在2020多模态磁共振成像挑战赛中，该方法在量化脑肿瘤分割不确定性的任务中排名第二。

103. Learning normal appearance for fetal anomaly screening: Application to the unsupervised detection of Hypoplastic Left Heart Syndrome [PDF] 返回目录
Elisa Chotzoglou, Thomas Day, Jeremy Tan, Jacqueline Matthew, David Lloyd, Reza Razavi, John Simpson, Bernhard Kainz
Abstract: Congenital heart disease is considered as one the most common groups of congenital malformations which affects $6-11$ per $1000$ newborns. In this work, an automated framework for detection of cardiac anomalies during ultrasound screening is proposed and evaluated on the example of Hypoplastic Left Heart Syndrome (HLHS), a sub-category of congenital heart disease. We propose an unsupervised approach that learns healthy anatomy exclusively from clinically confirmed normal control patients. We evaluate a number of known anomaly detection frameworks together with a new model architecture based on the $\alpha$-GAN network and find evidence that the proposed model performs significantly better than the state-of-the-art in image-based anomaly detection, yielding average $0.81$ AUC \emph{and} a better robustness towards initialisation compared to previous works.
摘要要】先天性心脏病被认为是最常见的先天性畸形之一，每1000美元新生儿中有6-11美元受到影响。在这项工作中，提出了一种自动框架，用于在超声筛查过程中检测心脏异常，并以先天性心脏病的子类别低塑性左心综合征（HLHS）为例进行了评估。我们提出了一种无监督的方法，专门从临床确认的正常对照患者那里学习健康的解剖结构。我们评估了许多已知的异常检测框架，以及基于$ \ alpha $ -GAN网络的新模型架构，并发现了所提出的模型在基于图像的异常检测中的性能明显优于最新技术，平均产生$ 0.81 $ AUC \ emph {，并且与以前的作品相比，在初始化方面具有更好的鲁棒性。

104. Binary Segmentation of Seismic Facies Using Encoder-Decoder Neural Networks [PDF] 返回目录
Gefersom Lima, Gabriel Ramos, Sandro Rigo, Felipe Zeiser, Ariane da Silveira
Abstract: The interpretation of seismic data is vital for characterizing sediments' shape in areas of geological study. In seismic interpretation, deep learning becomes useful for reducing the dependence on handcrafted facies segmentation geometry and the time required to study geological areas. This work presents a Deep Neural Network for Facies Segmentation (DNFS) to obtain state-of-the-art results for seismic facies segmentation. DNFS is trained using a combination of cross-entropy and Jaccard loss functions. Our results show that DNFS obtains highly detailed predictions for seismic facies segmentation using fewer parameters than StNet and U-Net.
摘要：地震数据的解释对于表征地质研究领域的沉积物形状至关重要。在地震解释中，深度学习对于减少对手工相分割几何形状的依赖性以及研究地质区域所需的时间很有用。这项工作提出了一种用于相切的深层神经网络（DNFS），以获取有关地震相切分的最新结果。 DNFS是使用交叉熵和Jaccard损失函数的组合进行训练的。我们的结果表明，DNFS使用比StNet和U-Net更少的参数获得了地震相分割的高度详细的预测。

105. Efficient Medical Image Segmentation with Intermediate Supervision Mechanism [PDF] 返回目录
Di Yuan, Junyang Chen, Zhenghua Xu, Thomas Lukasiewicz, Zhigang Fu, Guizhi Xu
Abstract: Because the expansion path of U-Net may ignore the characteristics of small targets, intermediate supervision mechanism is proposed. The original mask is also entered into the network as a label for intermediate output. However, U-Net is mainly engaged in segmentation, and the extracted features are also targeted at segmentation location information, and the input and output are different. The label we need is that the input and output are both original masks, which is more similar to the refactoring process, so we propose another intermediate supervision mechanism. However, the features extracted by the contraction path of this intermediate monitoring mechanism are not necessarily consistent. For example, U-Net's contraction path extracts transverse features, while auto-encoder extracts longitudinal features, which may cause the output of the expansion path to be inconsistent with the label. Therefore, we put forward the intermediate supervision mechanism of shared-weight decoder module. Although the intermediate supervision mechanism improves the segmentation accuracy, the training time is too long due to the extra input and multiple loss functions. For one of these problems, we have introduced tied-weight decoder. To reduce the redundancy of the model, we combine shared-weight decoder module with tied-weight decoder module.
摘要：由于U-Net的扩展路径可能会忽略小目标的特征，因此提出了中间监督机制。原始掩码也作为中间输出的标签输入网络。但是，U-Net主要从事分割，提取的特征也针对分割位置信息，并且输入和输出是不同的。我们需要的标签是输入和输出都是原始掩码，这与重构过程更相似，因此我们提出了另一种中间监督机制。但是，此中间监视机制的收缩路径提取的特征不一定一致。例如，U-Net的收缩路径提取横向特征，而自动编码器提取纵向特征，这可能导致扩展路径的输出与标签不一致。因此，我们提出了共享权解码器模块的中间监督机制。尽管中间监督机制提高了分割精度，但是由于额外的输入和多重损失功能，训练时间过长。对于这些问题之一，我们引入了束缚加权解码器。为了减少模型的冗余度，我们将共享权重解码器模块与捆绑权重解码器模块结合在一起。

106. Impact of Power Supply Noise on Image Sensor Performance in Automotive Applications [PDF] 返回目录
Shane Gilroy
Abstract: Vision Systems are quickly becoming a large component of Active Automotive Safety Systems. In order to be effective in critical safety applications these systems must produce high quality images in both daytime and night-time scenarios in order to provide the large informational content required for software analysis in applications such as lane departure, pedestrian detection and collision detection. The challenge in capturing high quality images in low light scenarios is that the signal to noise ratio is greatly reduced, which can result in noise becoming the dominant factor in a captured image, thereby making these safety systems less effective at night. Research has been undertaken to develop a systematic method of characterising image sensor performance in response to electrical noise in order to improve the design and performance of automotive cameras in low light scenarios. The root cause of image row noise has been established and a mathematical algorithm for determining the magnitude of row noise in an image has been devised. An automated characterisation method has been developed to allow performance characterisation in response to a large frequency spectrum of electrical noise on the image sensor power supply. Various strategies of improving image sensor performance for low light applications have also been proposed from the research outcomes.
摘要：视觉系统正迅速成为主动汽车安全系统的重要组成部分。为了在关键安全应用中有效，这些系统必须在白天和夜间情况下都产生高质量的图像，以便提供车道偏离，行人检测和碰撞检测等应用中软件分析所需的大量信息内容。在弱光情况下捕获高质量图像的挑战在于信噪比大大降低，这可能导致噪声成为捕获图像中的主要因素，从而使这些安全系统在夜间效率较低。已经进行了研究以开发一种系统的方法来表征响应电噪声的图像传感器的性能，以改善低照度场景下汽车相机的设计和性能。已经建立了图像行噪声的根本原因，并且已经设计了用于确定图像中行噪声的大小的数学算法。已经开发出一种自动表征方法，以允许响应于图像传感器电源上的大电噪声频谱进行性能表征。根据研究成果，还提出了各种针对弱光应用提高图像传感器性能的策略。

107. Deep Metric Learning-based Image Retrieval System for Chest Radiograph and its Clinical Applications in COVID-19 [PDF] 返回目录
Aoxiao Zhong, Xiang Li, Dufan Wu, Hui Ren, Kyungsang Kim, Younggon Kim, Varun Buch, Nir Neumark, Bernardo Bizzo, Won Young Tak, Soo Young Park, Yu Rim Lee, Min Kyu Kang, Jung Gil Park, Byung Seok Kim, Woo Jin Chung, Ning Guo, Ittai Dayan, Mannudeep K. Kalra, Quanzheng Li
Abstract: In recent years, deep learning-based image analysis methods have been widely applied in computer-aided detection, diagnosis and prognosis, and has shown its value during the public health crisis of the novel coronavirus disease 2019 (COVID-19) pandemic. Chest radiograph (CXR) has been playing a crucial role in COVID-19 patient triaging, diagnosing and monitoring, particularly in the United States. Considering the mixed and unspecific signals in CXR, an image retrieval model of CXR that provides both similar images and associated clinical information can be more clinically meaningful than a direct image diagnostic model. In this work we develop a novel CXR image retrieval model based on deep metric learning. Unlike traditional diagnostic models which aims at learning the direct mapping from images to labels, the proposed model aims at learning the optimized embedding space of images, where images with the same labels and similar contents are pulled together. It utilizes multi-similarity loss with hard-mining sampling strategy and attention mechanism to learn the optimized embedding space, and provides similar images to the query image. The model is trained and validated on an international multi-site COVID-19 dataset collected from 3 different sources. Experimental results of COVID-19 image retrieval and diagnosis tasks show that the proposed model can serve as a robust solution for CXR analysis and patient management for COVID-19. The model is also tested on its transferability on a different clinical decision support task, where the pre-trained model is applied to extract image features from a new dataset without any further training. These results demonstrate our deep metric learning based image retrieval model is highly efficient in the CXR retrieval, diagnosis and prognosis, and thus has great clinical value for the treatment and management of COVID-19 patients.
摘要：近年来，基于深度学习的图像分析方法已广泛应用于计算机辅助检测，诊断和预后，并已证明其在新型冠状病毒病2019（COVID-19）大流行的公共卫生危机中具有价值。胸部X光片（CXR）在COVID-19患者的分类，诊断和监测中一直发挥着至关重要的作用，尤其是在美国。考虑到CXR中的混合信号和非特异性信号，与直接图像诊断模型相比，提供相似图像和相关临床信息的CXR图像检索模型在临床上更有意义。在这项工作中，我们开发了一种基于深度度量学习的新颖CXR图像检索模型。与旨在学习从图像到标签的直接映射的传统诊断模型不同，所提出的模型旨在学习图像的优化嵌入空间，其中具有相同标签和相似内容的图像被拉在一起。它利用多相似度损失，难以挖掘的采样策略和注意力机制来学习优化的嵌入空间，并为查询图像提供相似的图像。该模型在从3个不同来源收集的国际多站点COVID-19数据集上进行了训练和验证。 COVID-19图像检索和诊断任务的实验结果表明，该模型可以为COVID-19的CXR分析和患者管理提供强大的解决方案。还对该模型在不同的临床决策支持任务上的可移植性进行了测试，其中将预先训练的模型应用于无需任何进一步训练即可从新数据集中提取图像特征的情况。这些结果表明我们基于深度度量学习的图像检索模型在CXR检索，诊断和预后方面非常有效，因此对于COVID-19患者的治疗和管理具有重要的临床价值。

108. Noise2Kernel: Adaptive Self-Supervised Blind Denoising using a Dilated Convolutional Kernel Architecture [PDF] 返回目录
Kanggeun Lee, Won-Ki Jeong
Abstract: With the advent of recent advances in unsupervised learning, efficient training of a deep network for image denoising without pairs of noisy and clean images has become feasible. However, most current unsupervised denoising methods are built on the assumption of zero-mean noise under the signal-independent condition. This assumption causes blind denoising techniques to suffer brightness shifting problems on images that are greatly corrupted by extreme noise such as salt-and-pepper noise. Moreover, most blind denoising methods require a random masking scheme for training to ensure the invariance of the denoising process. In this paper, we propose a dilated convolutional network that satisfies an invariant property, allowing efficient kernel-based training without random masking. We also propose an adaptive self-supervision loss to circumvent the requirement of zero-mean constraint, which is specifically effective in removing salt-and-pepper or hybrid noise where a prior knowledge of noise statistics is not readily available. We demonstrate the efficacy of the proposed method by comparing it with state-of-the-art denoising methods using various examples.
摘要：随着无监督学习的最新进展的出现，一种有效的训练深度网络的图像去噪而没有成对的噪声和干净图像的方法已经变得可行。然而，大多数当前的无监督降噪方法都是建立在独立于信号的情况下零均值噪声的假设之下的。这种假设导致盲降噪技术在图像上出现亮度偏移问题，而这些图像由于诸如盐和胡椒噪声之类的极端噪声而大大受损。而且，大多数盲式去噪方法都需要用于训练的随机掩蔽方案，以确保去噪过程的不变性。在本文中，我们提出了一个满足不变性质的膨胀卷积网络，可以在不进行随机掩码的情况下进行有效的基于核的训练。我们还提出了一种自适应的自我监督损失来规避零均值约束的要求，这种方法在消除不易获得噪声统计数据先验知识的椒盐或混合噪声时特别有效。我们通过使用各种示例将其与最新的去噪方法进行比较，证明了该方法的有效性。

109. Efficient Kernel based Matched Filter Approach for Segmentation of Retinal Blood Vessels [PDF] 返回目录
Sushil Kumar Saroj, Vikas Ratna, Rakesh Kumar, Nagendra Pratap Singh
Abstract: Retinal blood vessels structure contains information about diseases like obesity, diabetes, hypertension and glaucoma. This information is very useful in identification and treatment of these fatal diseases. To obtain this information, there is need to segment these retinal vessels. Many kernel based methods have been given for segmentation of retinal vessels but their kernels are not appropriate to vessel profile cause poor performance. To overcome this, a new and efficient kernel based matched filter approach has been proposed. The new matched filter is used to generate the matched filter response (MFR) image. We have applied Otsu thresholding method on obtained MFR image to extract the vessels. We have conducted extensive experiments to choose best value of parameters for the proposed matched filter kernel. The proposed approach has examined and validated on two online available DRIVE and STARE datasets. The proposed approach has specificity 98.50%, 98.23% and accuracy 95.77 %, 95.13% for DRIVE and STARE dataset respectively. Obtained results confirm that the proposed method has better performance than others. The reason behind increased performance is due to appropriate proposed kernel which matches retinal blood vessel profile more accurately.
摘要：视网膜血管结构包含有关肥胖，糖尿病，高血压和青光眼等疾病的信息。该信息对于识别和治疗这些致命疾病非常有用。为了获得该信息，需要分割这些视网膜血管。已经给出了许多基于核的方法来分割视网膜血管，但是它们的核不适用于血管轮廓而导致较差的性能。为了克服这个问题，已经提出了一种新的，有效的基于核的匹配滤波器方法。新的匹配滤波器用于生成匹配滤波器响应（MFR）图像。我们对获得的MFR图像应用了Otsu阈值化方法来提取血管。我们进行了广泛的实验，为拟议的匹配滤波器内核选择最佳参数值。所提出的方法已经在两个在线可用的DRIVE和STARE数据集上进行了检查和验证。该方法对DRIVE和STARE数据集的特异性分别为98.50％，98.23％和准确度95.77％，95.13％。所得结果证实了该方法具有比其他方法更好的性能。性能提高背后的原因是由于提出了合适的内核，可以更准确地匹配视网膜血管轮廓。

110. Self-Supervision Closes the Gap Between Weak and Strong Supervision in Histology [PDF] 返回目录
Olivier Dehaene, Axel Camara, Olivier Moindrot, Axel de Lavergne, Pierre Courtiol
Abstract: One of the biggest challenges for applying machine learning to histopathology is weak supervision: whole-slide images have billions of pixels yet often only one global label. The state of the art therefore relies on strongly-supervised model training using additional local annotations from domain experts. However, in the absence of detailed annotations, most weakly-supervised approaches depend on a frozen feature extractor pre-trained on ImageNet. We identify this as a key weakness and propose to train an in-domain feature extractor on histology images using MoCo v2, a recent self-supervised learning algorithm. Experimental results on Camelyon16 and TCGA show that the proposed extractor greatly outperforms its ImageNet counterpart. In particular, our results improve the weakly-supervised state of the art on Camelyon16 from 91.4% to 98.7% AUC, thereby closing the gap with strongly-supervised models that reach 99.3% AUC. Through these experiments, we demonstrate that feature extractors trained via self-supervised learning can act as drop-in replacements to significantly improve existing machine learning techniques in histology. Lastly, we show that the learned embedding space exhibits biologically meaningful separation of tissue structures.
摘要：将机器学习应用于组织病理学的最大挑战之一是监管薄弱：整张幻灯片的图像有数十亿像素，但通常只有一个全局标签。因此，现有技术依赖于使用领域专家提供的其他本地注释的强监督模型训练。但是，在没有详细注释的情况下，大多数弱监督方法都依赖于在ImageNet上预先训练的冻结特征提取器。我们将其确定为主要弱点，并建议使用MoCo v2（一种最新的自我监督学习算法）在组织学图像上训练域内特征提取器。在Camelyon16和TCGA上的实验结果表明，所提出的提取器大大优于ImageNet的提取器。特别是，我们的结果将Camelyon16上的弱监督技术水平从91.4％提高到了98.7％AUC，从而缩小了达到99.3％AUC的强监督模型的差距。通过这些实验，我们证明了通过自我监督学习训练的特征提取器可以作为直接替代，以显着改善组织学中现有的机器学习技术。最后，我们表明，学习到的嵌入空间表现出生物学上有意义的组织结构分离。

111. DIPPAS: A Deep Image Prior PRNU Anonymization Scheme [PDF] 返回目录
Francesco Picetti, Sara Mandelli, Paolo Bestagini, Vincenzo Lipari, Stefano Tubaro
Abstract: Source device identification is an important topic in image forensics since it allows to trace back the origin of an image. Its forensics counter-part is source device anonymization, that is, to mask any trace on the image that can be useful for identifying the source device. A typical trace exploited for source device identification is the Photo Response Non-Uniformity (PRNU), a noise pattern left by the device on the acquired images. In this paper, we devise a methodology for suppressing such a trace from natural images without significant impact on image quality. Specifically, we turn PRNU anonymization into an optimization problem in a Deep Image Prior (DIP) framework. In a nutshell, a Convolutional Neural Network (CNN) acts as generator and returns an image that is anonymized with respect to the source PRNU, still maintaining high visual quality. With respect to widely-adopted deep learning paradigms, our proposed CNN is not trained on a set of input-target pairs of images. Instead, it is optimized to reconstruct the PRNU-free image from the original image under analysis itself. This makes the approach particularly suitable in scenarios where large heterogeneous databases are analyzed and prevents any problem due to lack of generalization. Through numerical examples on publicly available datasets, we prove our methodology to be effective compared to state-of-the-art techniques.
摘要：源设备识别是图像取证中的重要主题，因为它可以追溯图像的来源。它的取证对手是源设备匿名化，也就是说，掩盖图像上对识别源设备有用的任何迹线。用于源设备识别的典型迹线是光响应非均匀性（PRNU），即设备在所采集图像上留下的噪声图案。在本文中，我们设计了一种方法来抑制自然图像中的这种痕迹，而对图像质量没有重大影响。具体来说，我们将PRNU匿名化转变为Deep Image Prior（DIP）框架中的优化问题。简而言之，卷积神经网络（CNN）充当生成器并返回相对于源PRNU匿名化的图像，仍然保持较高的视觉质量。关于广泛采用的深度学习范例，我们提出的CNN并未在一组输入目标图像对上进行训练。取而代之的是，对它进行了优化以从分析后的原始图像中重建无PRNU图像。这使得该方法特别适用于分析大型异构数据库的场景，并可以防止由于缺乏通用性而引起的任何问题。通过公开数据集上的数值示例，我们证明了我们的方法论与最新技术相比是有效的。

112. Robustness Investigation on Deep Learning CT Reconstruction for Real-Time Dose Optimization [PDF] 返回目录
Chang Liu, Yixing Huang, Joscha Maier, Laura Klein, Marc Kachelrieß, Andreas Maier
Abstract: In computed tomography (CT), automatic exposure control (AEC) is frequently used to reduce radiation dose exposure to patients. For organ-specific AEC, a preliminary CT reconstruction is necessary to estimate organ shapes for dose optimization, where only a few projections are allowed for real-time reconstruction. In this work, we investigate the performance of automated transform by manifold approximation (AUTOMAP) in such applications. For proof of concept, we investigate its performance on the MNIST dataset first, where the dataset containing all the 10 digits are randomly split into a training set and a test set. We train the AUTOMAP model for image reconstruction from 2 projections or 4 projections directly. The test results demonstrate that AUTOMAP is able to reconstruct most digits well with a false rate of 1.6% and 6.8% respectively. In our subsequent experiment, the MNIST dataset is split in a way that the training set contains 9 digits only while the test set contains the excluded digit only, for instance "2". In the test results, the digit "2"s are falsely predicted as "3" or "5" when using 2 projections for reconstruction, reaching a false rate of 94.4%. For the application in medical images, AUTOMAP is also trained on patients' CT images. The test images reach an average root-mean-square error of 290 HU. Although the coarse body outlines are well reconstructed, some organs are misshaped.
摘要：在计算机断层扫描（CT）中，自动曝光控制（AEC）经常用于减少对患者的辐射剂量暴露。对于特定于器官的AEC，必须进行初步的CT重建以估计器官形状以进行剂量优化，在这种情况下，仅允许进行一些投影以进行实时重建。在这项工作中，我们研究了在此类应用中通过流形近似（AUTOMAP）进行自动变换的性能。为了进行概念验证，我们首先研究其在MNIST数据集上的性能，其中将包含所有10位数字的数据集随机分为训练集和测试集。我们训练AUTOMAP模型直接从2个投影或4个投影重建图像。测试结果表明，AUTOMAP能够很好地重构大多数数字，错误率分别为1.6％和6.8％。在我们的后续实验中，MNIST数据集的拆分方式是，训练集仅包含9位数字，而测试集仅包含排除位，例如“ 2”。在测试结果中，当使用2个投影进行重建时，数字“ 2”被错误地预测为“ 3”或“ 5”，达到94.4％的错误率。为了在医学图像中应用，AUTOMAP还接受了患者CT图像的培训。测试图像的平均均方根误差为290 HU。尽管粗略的身体轮廓得到了很好的重建，但某些器官的形状仍然不正确。

113. Backpropagating Linearly Improves Transferability of Adversarial Examples [PDF] 返回目录
Yiwen Guo, Qizhang Li, Hao Chen
Abstract: The vulnerability of deep neural networks (DNNs) to adversarial examples has drawn great attention from the community. In this paper, we study the transferability of such examples, which lays the foundation of many black-box attacks on DNNs. We revisit a not so new but definitely noteworthy hypothesis of Goodfellow et al.'s and disclose that the transferability can be enhanced by improving the linearity of DNNs in an appropriate manner. We introduce linear backpropagation (LinBP), a method that performs backpropagation in a more linear fashion using off-the-shelf attacks that exploit gradients. More specifically, it calculates forward as normal but backpropagates loss as if some nonlinear activations are not encountered in the forward pass. Experimental results demonstrate that this simple yet effective method obviously outperforms current state-of-the-arts in crafting transferable adversarial examples on CIFAR-10 and ImageNet, leading to more effective attacks on a variety of DNNs.
摘要：深度神经网络（DNN）容易受到对抗性示例的攻击已引起社区的极大关注。在本文中，我们研究了此类示例的可移植性，从而为DNN上的许多黑盒攻击奠定了基础。我们回顾了Goodfellow等人的一个不太新但绝对值得关注的假设，并公开了通过以适当的方式改善DNN的线性可以增强可传递性。我们介绍了线性反向传播（LinBP），这是一种利用利用梯度的现成攻击以更线性的方式执行反向传播的方法。更具体地说，它正常计算正向，但反向传播损耗，就好像在正向传递中未遇到某些非线性激活一样。实验结果表明，在CIFAR-10和ImageNet上制作可转移对抗性示例时，这种简单而有效的方法显然优于当前的最新技术，从而可以对各种DNN进行更有效的攻击。

114. An Approach to Intelligent Pneumonia Detection and Integration [PDF] 返回目录
Bonaventure F. P. Dossou, Alena Iureva, Sayali R. Rajhans, Vamsi S. Pidikiti
Abstract: Each year, over 2.5 million people, most of them in developed countries, die from pneumonia [1]. Since many studies have proved pneumonia is successfully treatable when timely and correctly diagnosed, many of diagnosis aids have been developed, with AI-based methods achieving high accuracies [2]. However, currently, the usage of AI in pneumonia detection is limited, in particular, due to challenges in generalizing a locally achieved result. In this report, we propose a roadmap for creating and integrating a system that attempts to solve this challenge. We also address various technical, legal, ethical, and logistical issues, with a blueprint of possible solutions.
摘要：每年有超过250万人死于肺炎[1]，其中大多数在发达国家。由于许多研究证明，及时，正确地诊断出肺炎是可以成功治疗的，因此已经开发出了许多诊断辅助手段，其中基于AI的方法具有很高的准确性[2]。然而，当前，由于在概括局部获得的结果方面的挑战，AI在肺炎检测中的使用受到限制。在本报告中，我们提出了一个路线图，用于创建和集成尝试解决此挑战的系统。我们还将解决各种技术，法律，道德和后勤问题，并提供可能的解决方案蓝图。

115. Multi-Instrumentalist Net: Unsupervised Generation of Music from Body Movements [PDF] 返回目录
Kun Su, Xiulong Liu, Eli Shlizerman
Abstract: We propose a novel system that takes as an input body movements of a musician playing a musical instrument and generates music in an unsupervised setting. Learning to generate multi-instrumental music from videos without labeling the instruments is a challenging problem. To achieve the transformation, we built a pipeline named 'Multi-instrumentalistNet' (MI Net). At its base, the pipeline learns a discrete latent representation of various instruments music from log-spectrogram using a Vector Quantized Variational Autoencoder (VQ-VAE) with multi-band residual blocks. The pipeline is then trained along with an autoregressive prior conditioned on the musician's body keypoints movements encoded by a recurrent neural network. Joint training of the prior with the body movements encoder succeeds in the disentanglement of the music into latent features indicating the musical components and the instrumental features. The latent space results in distributions that are clustered into distinct instruments from which new music can be generated. Furthermore, the VQ-VAE architecture supports detailed music generation with additional conditioning. We show that a Midi can further condition the latent space such that the pipeline will generate the exact content of the music being played by the instrument in the video. We evaluate MI Net on two datasets containing videos of 13 instruments and obtain generated music of reasonable audio quality, easily associated with the corresponding instrument, and consistent with the music audio content.
摘要：我们提出了一种新颖的系统，该系统将演奏乐器的音乐家的身体动作作为输入，并在无人监督的情况下产生音乐。学会在不标记乐器的情况下从视频生成多乐器音乐是一个具有挑战性的问题。为了实现转换，我们建立了一个名为“ Multi-instrumentalistNet”（MI Net）的管道。在其基础上，管道使用带有多频带残差块的矢量量化变分自动编码器（VQ-VAE）从对数声谱图中学习了各种乐器音乐的离散潜在表示。然后训练管道，并根据以递归神经网络编码的音乐家的身体关键点运动为条件的自回归先验。先验与人体运动编码器的联合训练成功地将音乐分解为潜在特征，指示音乐成分和乐器特征。潜在的空间会导致将分布分布到不同的乐器中，从而可以生成新音乐。此外，VQ-VAE架构通过附加的条件支持详细的音乐生成。我们证明了Midi可以进一步调节潜在空间，以便管道可以生成视频中乐器正在播放的音乐的确切内容。我们在包含13种乐器视频的两个数据集上评估MI Net，并获得生成的音频质量合理，易于与相应乐器关联并且与音乐音频内容一致的音乐。

116. Multivariate Density Estimation with Deep Neural Mixture Models [PDF] 返回目录
Edmondo Trentin
Abstract: Albeit worryingly underrated in the recent literature on machine learning in general (and, on deep learning in particular), multivariate density estimation is a fundamental task in many applications, at least implicitly, and still an open issue. With a few exceptions, deep neural networks (DNNs) have seldom been applied to density estimation, mostly due to the unsupervised nature of the estimation task, and (especially) due to the need for constrained training algorithms that ended up realizing proper probabilistic models that satisfy Kolmogorov's axioms. Moreover, in spite of the well-known improvement in terms of modeling capabilities yielded by mixture models over plain single-density statistical estimators, no proper mixtures of multivariate DNN-based component densities have been investigated so far. The paper fills this gap by extending our previous work on Neural Mixture Densities (NMMs) to multivariate DNN mixtures. A maximum-likelihood (ML) algorithm for estimating Deep NMMs (DNMMs) is handed out, which satisfies numerically a combination of hard and soft constraints aimed at ensuring satisfaction of Kolmogorov's axioms. The class of probability density functions that can be modeled to any degree of precision via DNMMs is formally defined. A procedure for the automatic selection of the DNMM architecture, as well as of the hyperparameters for its ML training algorithm, is presented (exploiting the probabilistic nature of the DNMM). Experimental results on univariate and multivariate data are reported on, corroborating the effectiveness of the approach and its superiority to the most popular statistical estimation techniques.
摘要：尽管最近关于一般机器学习（尤其是深度学习）的文献令人担忧地低估了它，但多元密度估计是许多应用程序中的一项基本任务，至少是隐式的，并且仍然是一个未解决的问题。除少数例外，深度神经网络（DNN）很少用于密度估计，这主要是由于估计任务的无监督性质，并且（尤其是）由于需要受限的训练算法而最终实现了适当的概率模型，满足柯尔莫哥洛夫的公理。此外，尽管混合模型相对于普通单密度统计估计器在建模能力方面的众所周知的改进，到目前为止，尚未研究基于多元DNN的组件密度的适当混合物。本文通过将我们先前在神经混合物密度（NMM）上的工作扩展到多元DNN混合物来填补这一空白。提出了用于估计深度NMM（DNMM）的最大似然（ML）算法，该算法在数值上满足硬约束和软约束的组合，旨在确保满足Kolmogorov公理的要求。可以正式定义可以通过DNMM建模到任何精确度的概率密度函数的类别。提出了自动选择DNMM体系结构及其ML训练算法的超参数的过程（利用了DNMM的概率性质）。报告了单变量和多变量数据的实验结果，证实了该方法的有效性及其相对于最流行的统计估计技术的优越性。

117. Spatio-Temporal Graph Scattering Transform [PDF] 返回目录
Chao Pan, Siheng Chen, Antonio Ortega
Abstract: Although spatio-temporal graph neural networks have achieved great empirical success in handling multiple correlated time series, they may be impractical in some real-world scenarios due to a lack of sufficient high-quality training data. Furthermore, spatio-temporal graph neural networks lack theoretical interpretation. To address these issues, we put forth a novel mathematically designed framework to analyze spatio-temporal data. Our proposed spatio-temporal graph scattering transform (ST-GST) extends traditional scattering transforms to the spatio-temporal domain. It performs iterative applications of spatio-temporal graph wavelets and nonlinear activation functions, which can be viewed as a forward pass of spatio-temporal graph convolutional networks without training. Since all the filter coefficients in ST-GST are mathematically designed, it is promising for the real-world scenarios with limited training data, and also allows for a theoretical analysis, which shows that the proposed ST-GST is stable to small perturbations of input signals and structures. Finally, our experiments show that i) ST-GST outperforms spatio-temporal graph convolutional networks by an increase of 35% in accuracy for MSR Action3D dataset; ii) it is better and computationally more efficient to design the transform based on separable spatio-temporal graphs than the joint ones; and iii) the nonlinearity in ST-GST is critical to empirical performance.
摘要：尽管时空图神经网络在处理多个相关时间序列方面取得了巨大的经验成功，但是由于缺乏足够的高质量训练数据，它们在某些实际场景中可能不切实际。此外，时空图神经网络缺乏理论解释。为了解决这些问题，我们提出了一种新颖的数学设计框架来分析时空数据。我们提出的时空图散射变换（ST-GST）将传统的散射变换扩展到时空域。它执行时空图小波和非线性激活函数的迭代应用，可以将其视为时空图卷积网络的前向传递而无需训练。由于ST-GST中的所有滤波器系数都是经过数学设计的，因此对于训练数据有限的现实情况很有希望，并且还可以进行理论分析，这表明所提出的ST-GST对输入的微小扰动是稳定的信号和结构。最后，我们的实验表明：i）ST-GST的时空图卷积网络的性能比MSR Action3D数据集的精度提高了35％； ii）基于可分离的时空图比联合图更好，更有效地设计变换； iii）ST-GST中的非线性对于经验性能至关重要。

118. An Uncertainty-Driven GCN Refinement Strategy for Organ Segmentation [PDF] 返回目录
Roger D. Soberanis-Mukul, Nassir Navab, Shadi Albarqouni
Abstract: Organ segmentation in CT volumes is an important pre-processing step in many computer assisted intervention and diagnosis methods. In recent years, convolutional neural networks have dominated the state of the art in this task. However, since this problem presents a challenging environment due to high variability in the organ's shape and similarity between tissues, the generation of false negative and false positive regions in the output segmentation is a common issue. Recent works have shown that the uncertainty analysis of the model can provide us with useful information about potential errors in the segmentation. In this context, we proposed a segmentation refinement method based on uncertainty analysis and graph convolutional networks. We employ the uncertainty levels of the convolutional network in a particular input volume to formulate a semi-supervised graph learning problem that is solved by training a graph convolutional network. To test our method we refine the initial output of a 2D U-Net. We validate our framework with the NIH pancreas dataset and the spleen dataset of the medical segmentation decathlon. We show that our method outperforms the state-of-the-art CRF refinement method by improving the dice score by 1% for the pancreas and 2% for spleen, with respect to the original U-Net's prediction. Finally, we perform a sensitivity analysis on the parameters of our proposal and discuss the applicability to other CNN architectures, the results, and current limitations of the model for future work in this research direction. For reproducibility purposes, we make our code publicly available at this https URL.
摘要：CT量的器官分割是许多计算机辅助干预和诊断方法中重要的预处理步骤。近年来，卷积神经网络在该任务中占据了最先进的地位。然而，由于该问题由于器官形状的高度可变性和组织之间的相似性而呈现出具有挑战性的环境，因此在输出分割中产生假阴性和假阳性区域是一个普遍的问题。最近的工作表明，模型的不确定性分析可以为我们提供有关细分中潜在错误的有用信息。在此背景下，我们提出了一种基于不确定性分析和图卷积网络的分割细化方法。我们利用特定输入量中卷积网络的不确定性水平来制定半监督图学习问题，该问题可以通过训练图卷积网络来解决。为了测试我们的方法，我们优化了2D U-Net的初始输出。我们使用NIH胰腺数据集和医学分段十项全能的脾脏数据集验证我们的框架。我们证明，相对于最初的U-Net预测，我们的方法通过将胰脏的骰子得分提高1％，脾脏的骰子得分提高2％，而优于最新的CRF细化方法。最后，我们对提案的参数进行敏感性分析，并讨论该模型对其他CNN架构的适用性，结果以及该模型当前的局限性，以便在该研究方向上继续开展工作。为了重现性目的，我们在此https URL上公开提供了我们的代码。

119. Global Unifying Intrinsic Calibration for Spinning and Solid-State LiDARs [PDF] 返回目录
Jiunn-Kai Huang, Chenxi Feng, Madhav Achar, Maani Ghaffari, Jessy W. Grizzle
Abstract: Sensor calibration, which can be intrinsic or extrinsic, is an essential step to achieve the measurement accuracy required for modern perception and navigation systems deployed on autonomous robots. To date, intrinsic calibration models for spinning LiDARs have been based on hypothesized based on their physical mechanisms, resulting in anywhere from three to ten parameters to be estimated from data, while no phenomenological models have yet been proposed for solid-state LiDARs. Instead of going down that road, we propose to abstract away from the physics of a LiDAR type (spinning vs solid-state, for example), and focus on the spatial geometry of the point cloud generated by the sensor. By modeling the calibration parameters as an element of a special matrix Lie Group, we achieve a unifying view of calibration for different types of LiDARs. We further prove mathematically that the proposed model is well-constrained (has a unique answer) given four appropriately orientated targets. The proof provides a guideline for target positioning in the form of a tetrahedron. Moreover, an existing Semidefinite programming global solver for SE(3) can be modified to compute efficiently the optimal calibration parameters. For solid state LiDARs, we illustrate how the method works in simulation. For spinning LiDARs, we show with experimental data that the proposed matrix Lie Group model performs equally well as physics-based models in terms of reducing the P2P distance, while being more robust to noise.
摘要：传感器校准（可以是固有的或外部的）是实现部署在自动机器人上的现代感知和导航系统所需的测量精度的重要步骤。迄今为止，用于旋转的LiDAR的固有校准模型已基于其物理机制进行了假设，从而可从数据中估计出三到十个参数，而尚未提出用于固态LiDAR的现象学模型。我们建议不要走这条路，而要脱离LiDAR类型的物理原理（例如旋转与固态），而专注于传感器生成的点云的空间几何形状。通过将校准参数建模为特殊矩阵Lie组的元素，我们获得了针对不同类型LiDAR的校准的统一视图。我们进一步从数学上证明，给定四个适当定向的目标，所提出的模型具有良好的约束（具有唯一的答案）。证明为四面体形式的目标定位提供了指导。此外，可以修改现有的SE（3）的Semidefinite编程全局求解器，以有效地计算最佳校准参数。对于固态LiDAR，我们说明了该方法如何在仿真中工作。对于旋转式LiDAR，我们通过实验数据表明，所提出的矩阵李群模型在减少P2P距离的同时，对噪声更鲁棒，在性能上与基于物理的模型相当。

120. CoEdge: Cooperative DNN Inference with Adaptive Workload Partitioning over Heterogeneous Edge Devices [PDF] 返回目录
Liekang Zeng, Xu Chen, Zhi Zhou, Lei Yang, Junshan Zhang
Abstract: Recent advances in artificial intelligence have driven increasing intelligent applications at the network edge, such as smart home, smart factory, and smart city. To deploy computationally intensive Deep Neural Networks (DNNs) on resource-constrained edge devices, traditional approaches have relied on either offloading workload to the remote cloud or optimizing computation at the end device locally. However, the cloud-assisted approaches suffer from the unreliable and delay-significant wide-area network, and the local computing approaches are limited by the constrained computing capability. Towards high-performance edge intelligence, the cooperative execution mechanism offers a new paradigm, which has attracted growing research interest recently. In this paper, we propose CoEdge, a distributed DNN computing system that orchestrates cooperative DNN inference over heterogeneous edge devices. CoEdge utilizes available computation and communication resources at the edge and dynamically partitions the DNN inference workload adaptive to devices' computing capabilities and network conditions. Experimental evaluations based on a realistic prototype show that CoEdge outperforms status-quo approaches in saving energy with close inference latency, achieving up to 25.5%~66.9% energy reduction for four widely-adopted CNN models.
摘要：人工智能的最新进展推动了网络边缘智能应用的增长，例如智能家居，智能工厂和智能城市。为了在资源受限的边缘设备上部署计算密集型深度神经网络（DNN），传统方法依赖于将工作负载卸载到远程云或在本地终端设备上优化计算。但是，云辅助方法受制于不可靠且延迟显着的广域网，并且本地计算方法受到受限的计算能力的限制。对于高性能边缘智能，协作执行机制提供了一种新的范例，最近引起了越来越多的研究兴趣。在本文中，我们提出了CoEdge，这是一个分布式DNN计算系统，可在异构边缘设备上协调DNN推理。 CoEdge利用边缘的可用计算和通信资源，并根据设备的计算能力和网络条件动态地划分DNN推理工作负载。根据逼真的原型进行的实验评估表明，CoEdge在节能方面具有优于现有方法的优势，并具有紧密的推理延迟，对于四个广泛采用的CNN模型，其节能量最多可降低25.5％〜66.9％。

121. Learning to Reduce Defocus Blur by Realistically Modeling Dual-Pixel Data [PDF] 返回目录
Abdullah Abuolaim, Mauricio Delbracio, Damien Kelly, Michael S. Brown, Peyman Milanfar
Abstract: Recent work has shown impressive results on data-driven defocus deblurring using the two-image views available on modern dual-pixel (DP) sensors. One significant challenge in this line of research is access to DP data. Despite many cameras having DP sensors, only a limited number provide access to the low-level DP sensor images. In addition, capturing training data for defocus deblurring involves a time-consuming and tedious setup requiring the camera's aperture to be adjusted. Some cameras with DP sensors (e.g., smartphones) do not have adjustable apertures, further limiting the ability to produce the necessary training data. We address the data capture bottleneck by proposing a procedure to generate realistic DP data synthetically. Our synthesis approach mimics the optical image formation found on DP sensors and can be applied to virtual scenes rendered with standard computer software. Leveraging these realistic synthetic DP images, we introduce a new recurrent convolutional network (RCN) architecture that can improve deblurring results and is suitable for use with single-frame and multi-frame data captured by DP sensors. Finally, we show that our synthetic DP data is useful for training DNN models targeting video deblurring applications where access to DP data remains challenging.
摘要：最近的工作已经显示出令人印象深刻的结果，该数据使用现代双像素（DP）传感器上的两个图像视图进行了数据驱动的散焦去模糊。这一研究领域的一项重大挑战是访问DP数据。尽管有许多具有DP传感器的相机，但只有有限的数量可以访问低级DP传感器图像。此外，捕获用于散焦去模糊的训练数据涉及耗时且繁琐的设置，需要调整相机的光圈。某些带有DP传感器的相机（例如智能手机）没有可调节的光圈，进一步限制了生成必要训练数据的能力。我们提出了一种程序来综合生成实际的DP数据，从而解决了数据捕获瓶颈。我们的综合方法模仿了DP传感器上的光学图像形成，并可应用于使用标准计算机软件渲染的虚拟场景。利用这些逼真的合成DP图像，我们引入了一种新的循环卷积网络（RCN）架构，该架构可以改善去模糊效果，并适合与DP传感器捕获的单帧和多帧数据一起使用。最后，我们证明了我们的合成DP数据对于训练针对视频去模糊应用的DNN模型很有用，而在这些应用中，访问DP数据仍然充满挑战。

122. Esophageal Tumor Segmentation in CT Images using a 3D Convolutional Neural Network [PDF] 返回目录
Sahar Yousefi, Hessam Sokooti, Mohamed S. Elmahdy, Irene M. Lips, Mohammad T. Manzuri Shalmani, Roel T. Zinkstok, Frank J.W.M. Dankers, Marius Staring
Abstract: Manual or automatic delineation of the esophageal tumor in CT images is known to be very challenging. This is due to the low contrast between the tumor and adjacent tissues, the anatomical variation of the esophagus, as well as the occasional presence of foreign bodies (e.g. feeding tubes). Physicians therefore usually exploit additional knowledge such as endoscopic findings, clinical history, additional imaging modalities like PET scans. Achieving his additional information is time-consuming, while the results are error-prone and might lead to non-deterministic results. In this paper we aim to investigate if and to what extent a simplified clinical workflow based on CT alone, allows one to automatically segment the esophageal tumor with sufficient quality. For this purpose, we present a fully automatic end-to-end esophageal tumor segmentation method based on convolutional neural networks (CNNs). The proposed network, called Dilated Dense Attention Unet (DDAUnet), leverages spatial and channel attention gates in each dense block to selectively concentrate on determinant feature maps and regions. Dilated convolutional layers are used to manage GPU memory and increase the network receptive field. We collected a dataset of 792 scans from 288 distinct patients including varying anatomies with \mbox{air pockets}, feeding tubes and proximal tumors. Repeatability and reproducibility studies were conducted for three distinct splits of training and validation sets. The proposed network achieved a $\mathrm{DSC}$ value of $0.79 \pm 0.20$, a mean surface distance of $5.4 \pm 20.2mm$ and $95\%$ Hausdorff distance of $14.7 \pm 25.0mm$ for 287 test scans, demonstrating promising results with a simplified clinical workflow based on CT alone. Our code is publicly available via \url{this https URL}.
摘要：众所周知，在CT图像中手动或自动描绘食道肿瘤非常具有挑战性。这是由于肿瘤与相邻组织之间的对比度低，食道的解剖学变化以及偶尔存在异物（例如饲管）引起的。因此，医师通常会利用其他知识，例如内窥镜检查结果，临床病史，其他成像方式（如PET扫描）。获得他的附加信息非常耗时，而结果容易出错，并且可能导致不确定的结果。在本文中，我们旨在研究仅基于CT的简化临床工作流程是否以及在何种程度上能够以足够的质量自动分割食道肿瘤。为此，我们提出了一种基于卷积神经网络（CNN）的全自动端到端食道肿瘤分割方法。拟议的网络称为密集密集注意力Unet（DDAUnet），它利用每个密集块中的空间和通道注意力门来选择性地集中于行列式特征图和区域。扩展的卷积层用于管理GPU内存并增加网络接收范围。我们收集了来自288位不同患者的792次扫描数据集，包括具有\ mbox {air pockets}，饲管和近端肿瘤的不同解剖结构。对训练和验证集的三个不同部分进行了重复性和可重复性研究。拟议的网络的$ \ mathrm {DSC} $值为$ 0.79 \ pm 0.20 $，平均表面距离为$ 5.4 \ pm 20.2mm $和$ 95 \％$ Hausdorff距离为$ 14.7 \ pm 25.0mm $，进行287次测试扫描，仅基于CT的简化临床工作流程就证明了令人鼓舞的结果。我们的代码可通过\ url {此https URL}公开获得。

123. Robust Deep AUC Maximization: A New Surrogate Loss and Empirical Studies on Medical Image Classification [PDF] 返回目录
Zhuoning Yuan, Yan Yan, Milan Sonka, Tianbao Yang
Abstract: Deep AUC Maximization (DAM) is a paradigm for learning a deep neural network by maximizing the AUC score of the model on a dataset. Most previous works of AUC maximization focus on the perspective of optimization by designing efficient stochastic algorithms, and studies on generalization performance of DAM on difficult tasks are missing. In this work, we aim to make DAM more practical for interesting real-world applications (e.g., medical image classification). First, we propose a new margin-based surrogate loss function for the AUC score (named as the AUC margin loss). It is more robust than the commonly used AUC square loss, while enjoying the same advantage in terms of large-scale stochastic optimization. Second, we conduct empirical studies of our DAM method on difficult medical image classification tasks, namely classification of chest x-ray images for identifying many threatening diseases and classification of images of skin lesions for identifying melanoma. Our DAM method has achieved great success on these difficult tasks, i.e., the 1st place on Stanford CheXpert competition (by the paper submission date) and Top 1% rank (rank 33 out of 3314 teams) on Kaggle 2020 Melanoma classification competition. We also conduct extensive ablation studies to demonstrate the advantages of the new AUC margin loss over the AUC square loss on benchmark datasets. To the best of our knowledge, this is the first work that makes DAM succeed on large-scale medical image datasets.
摘要：深度AUC最大化（DAM）是通过最大化数据集中模型的AUC分数来学习深度神经网络的范例。 AUC最大化的大多数以前的工作都集中在通过设计有效的随机算法进行优化的角度，并且缺少有关DAM在困难任务上的泛化性能的研究。在这项工作中，我们旨在使DAM在有趣的现实世界应用（例如医学图像分类）中更加实用。首先，我们为AUC分数提出了一个新的基于边际的替代损失函数（称为AUC边际损失）。它比常用的AUC平方损失更健壮，同时在大规模随机优化方面享有相同的优势。其次，我们对困难的医学图像分类任务进行了DAM方法的实证研究，即对胸部X射线图像进行分类以识别多种威胁性疾病，对皮肤病变的图像进行分类以识别黑色素瘤。我们的DAM方法在这些艰巨的任务上取得了巨大的成功，即在斯坦福大学CheXpert比赛中排名第一（截至论文提交日期），在Kaggle 2020黑色素瘤分类比赛中排名前1％（3314个团队中排名第33）。我们还进行了广泛的消融研究，以证明基准数据集上新的AUC保证金损失优于AUC平方损失。据我们所知，这是使DAM在大规模医学图像数据集上取得成功的第一项工作。

124. Selective Eye-gaze Augmentation To Enhance Imitation Learning In Atari Games [PDF] 返回目录
Chaitanya Thammineni, Hemanth Manjunatha, Ehsan T. Esfahani
Abstract: This paper presents the selective use of eye-gaze information in learning human actions in Atari games. Vast evidence suggests that our eye movement convey a wealth of information about the direction of our attention and mental states and encode the information necessary to complete a task. Based on this evidence, we hypothesize that selective use of eye-gaze, as a clue for attention direction, will enhance the learning from demonstration. For this purpose, we propose a selective eye-gaze augmentation (SEA) network that learns when to use the eye-gaze information. The proposed network architecture consists of three sub-networks: gaze prediction, gating, and action prediction network. Using the prior 4 game frames, a gaze map is predicted by the gaze prediction network which is used for augmenting the input frame. The gating network will determine whether the predicted gaze map should be used in learning and is fed to the final network to predict the action at the current frame. To validate this approach, we use publicly available Atari Human Eye-Tracking And Demonstration (Atari-HEAD) dataset consists of 20 Atari games with 28 million human demonstrations and 328 million eye-gazes (over game frames) collected from four subjects. We demonstrate the efficacy of selective eye-gaze augmentation in comparison with state of the art Attention Guided Imitation Learning (AGIL), Behavior Cloning (BC). The results indicate that the selective augmentation approach (the SEA network) performs significantly better than the AGIL and BC. Moreover, to demonstrate the significance of selective use of gaze through the gating network, we compare our approach with the random selection of the gaze. Even in this case, the SEA network performs significantly better validating the advantage of selectively using the gaze in demonstration learning.
摘要：本文介绍了在Atari游戏中选择性使用眼神信息来学习人类行为的方法。大量证据表明，我们的眼球运动传达了有关注意力和精神状态方向的大量信息，并对完成任务所需的信息进行了编码。基于此证据，我们假设选择性地使用视线作为注意方向的线索将增强对示范的学习。为此，我们提出了一种选择性的注视增强（SEA）网络，用于学习何时使用注视信息。所提出的网络体系结构由三个子网组成：凝视预测，门控和动作预测网络。使用先前的4个游戏帧，通过注视预测网络预测注视图，该注视图用于增加输入帧。门控网络将确定预测的凝视图是否应在学习中使用，并馈送到最终网络以预测当前帧的动作。为了验证这种方法，我们使用了公开可用的Atari人眼跟踪和演示（Atari-HEAD）数据集，该数据集包含20种Atari游戏，其中包含2千8百万例人类演示和从4个主题收集的3.28亿眼注视（超过游戏画面）。我们证明了选择性注视增强的效果与最新水平的注意力指导模仿学习（AGIL），行为克隆（BC）相比。结果表明，选择性增强方法（SEA网络）的性能明显优于AGIL和BC。而且，为了证明通过门控网络选择性使用凝视的重要性，我们将我们的方法与凝视的随机选择进行了比较。即使在这种情况下，SEA网络也可以更好地验证示范学习中选择性使用凝视的优势。

125. Development and Characterization of a Chest CT Atlas [PDF] 返回目录
Kaiwen Xu, Riqiang Gao, Mirza S. Khan, Shunxing Bao, Yucheng Tang, Steve A. Deppen, Yuankai Huo, Kim L. Sandler, Pierre P. Massion, Mattias P. Heinrich, Bennett A. Landman
Abstract: A major goal of lung cancer screening is to identify individuals with particular phenotypes that are associated with high risk of cancer. Identifying relevant phenotypes is complicated by the variation in body position and body composition. In the brain, standardized coordinate systems (e.g., atlases) have enabled separate consideration of local features from gross/global structure. To date, no analogous standard atlas has been presented to enable spatial mapping and harmonization in chest computational tomography (CT). In this paper, we propose a thoracic atlas built upon a large low dose CT (LDCT) database of lung cancer screening program. The study cohort includes 466 male and 387 female subjects with no screening detected malignancy (age 46-79 years, mean 64.9 years). To provide spatial mapping, we optimize a multi-stage inter-subject non-rigid registration pipeline for the entire thoracic space. We evaluate the optimized pipeline relative to two baselines with alternative non-rigid registration module: the same software with default parameters and an alternative software. We achieve a significant improvement in terms of registration success rate based on manual QA. For the entire study cohort, the optimized pipeline achieves a registration success rate of 91.7%. The application validity of the developed atlas is evaluated in terms of discriminative capability for different anatomic phenotypes, including body mass index (BMI), chronic obstructive pulmonary disease (COPD), and coronary artery calcification (CAC).
摘要：肺癌筛查的主要目标是鉴定具有与高癌症风险相关的特定表型的个体。身体位置和身体组成的变化使相关表型的鉴定变得复杂。在大脑中，标准化的坐标系（例如地图集）已实现了将局部特征与总体/全局结构分开考虑。迄今为止，还没有提出类似的标准图集来实现胸部计算机断层扫描（CT）中的空间映射和协调。在本文中，我们提出了基于大型低剂量CT（LDCT）肺癌筛查程序数据库的胸部图谱。该研究队列包括466位男性和387位女性受试者，未筛查恶性肿瘤（年龄46-79岁，平均64.9岁）。为了提供空间映射，我们针对整个胸部空间优化了多阶段的受试者间非刚性配准管线。我们使用替代的非刚性注册模块评估相对于两个基准的优化管道：具有默认参数的相同软件和替代软件。基于手动质量检查，我们在注册成功率方面取得了显着改善。在整个研究队列中，优化的流程实现了91.7％的注册成功率。根据对不同解剖表型的辨别能力，包括体重指数（BMI），慢性阻塞性肺疾病（COPD）和冠状动脉钙化（CAC），对已开发图集的应用有效性进行了评估。

126. When Do Curricula Work? [PDF] 返回目录
Xiaoxia Wu, Ethan Dyer, Behnam Neyshabur
Abstract: Inspired by human learning, researchers have proposed ordering examples during training based on their difficulty. Both curriculum learning, exposing a network to easier examples early in training, and anti-curriculum learning, showing the most difficult examples first, have been suggested as improvements to the standard i.i.d. training. In this work, we set out to investigate the relative benefits of ordered learning. We first investigate the \emph{implicit curricula} resulting from architectural and optimization bias and find that samples are learned in a highly consistent order. Next, to quantify the benefit of \emph{explicit curricula}, we conduct extensive experiments over thousands of orderings spanning three kinds of learning: curriculum, anti-curriculum, and random-curriculum -- in which the size of the training dataset is dynamically increased over time, but the examples are randomly ordered. We find that for standard benchmark datasets, curricula have only marginal benefits, and that randomly ordered samples perform as well or better than curricula and anti-curricula, suggesting that any benefit is entirely due to the dynamic training set size. Inspired by common use cases of curriculum learning in practice, we investigate the role of limited training time budget and noisy data in the success of curriculum learning. Our experiments demonstrate that curriculum, but not anti-curriculum can indeed improve the performance either with limited training time budget or in existence of noisy data.
摘要：受人类学习启发，研究人员根据他们的难度提出了训练过程中的订购示例。有人建议将课程学习（在早期培训中将网络暴露给更容易的示例）和反课程学习（首先显示最困难的示例）作为对标准i.d.的改进。训练。在这项工作中，我们着手研究有序学习的相对好处。我们首先调查由于架构和优化偏差而产生的\ emph {隐式课程}，并发现以高度一致的顺序学习样本。接下来，为了量化\ emph {explicit curricula}的收益，我们针对数千种排序进行了广泛的实验，涵盖了三种学习方式：课程，反课程和随机课程-其中训练数据集的大小是动态的随时间增加，但示例是随机排列的。我们发现，对于标准基准数据集，课程仅具有边际收益，而随机排序的样本的表现优于或优于课程和反课程，这表明任何收益都完全取决于动态训练集的大小。受课程学习中实际使用案例的启发，我们研究了有限的培训时间预算和嘈杂的数据在课程学习成功中的作用。我们的实验表明，在有限的培训时间预算或存在嘈杂数据的情况下，课程表（而不是课程表）确实可以提高性能。

127. A Survey on Deep Learning with Noisy Labels: How to train your model when you cannot trust on the annotations? [PDF] 返回目录
Filipe R. Cordeiro, Gustavo Carneiro
Abstract: Noisy Labels are commonly present in data sets automatically collected from the internet, mislabeled by non-specialist annotators, or even specialists in a challenging task, such as in the medical field. Although deep learning models have shown significant improvements in different domains, an open issue is their ability to memorize noisy labels during training, reducing their generalization potential. As deep learning models depend on correctly labeled data sets and label correctness is difficult to guarantee, it is crucial to consider the presence of noisy labels for deep learning training. Several approaches have been proposed in the literature to improve the training of deep learning models in the presence of noisy labels. This paper presents a survey on the main techniques in literature, in which we classify the algorithm in the following groups: robust losses, sample weighting, sample selection, meta-learning, and combined approaches. We also present the commonly used experimental setup, data sets, and results of the state-of-the-art models.
摘要：嘈杂的标签通常出现在从互联网自动收集的数据集中，被非专业的注释者甚至在医疗领域等具有挑战性的任务的专家中误贴了标签。尽管深度学习模型在不同领域已显示出显着改进，但开放的问题是它们在训练过程中能够记住嘈杂标签的能力，从而降低了其泛化潜力。由于深度学习模型依赖于正确标记的数据集，并且标签的正确性难以保证，因此对于深度学习训练而言，考虑噪声标签的存在至关重要。文献中已经提出了几种方法来改善在存在噪声标签的情况下深度学习模型的训练。本文对文献中的主要技术进行了概述，其中将算法分为以下几类：鲁棒损失，样本加权，样本选择，元学习和组合方法。我们还介绍了常用的实验设置，数据集和最新模型的结果。

128. Automatic Segmentation and Location Learning of Neonatal Cerebral Ventricles in 3D Ultrasound Data Combining CNN and CPPN [PDF] 返回目录
Matthieu Martin, Bruno Sciolla, Michael Sdika, Philippe Quetin, Philippe Delachartre
Abstract: Preterm neonates are highly likely to suffer from ventriculomegaly, a dilation of the Cerebral Ventricular System (CVS). This condition can develop into life-threatening hydrocephalus and is correlated with future neuro-developmental impairments. Consequently, it must be detected and monitored by physicians. In clinical routing, manual 2D measurements are performed on 2D ultrasound (US) images to estimate the CVS volume but this practice is imprecise due to the unavailability of 3D information. A way to tackle this problem would be to develop automatic CVS segmentation algorithms for 3D US data. In this paper, we investigate the potential of 2D and 3D Convolutional Neural Networks (CNN) to solve this complex task and propose to use Compositional Pattern Producing Network (CPPN) to enable the CNNs to learn CVS location. Our database was composed of 25 3D US volumes collected on 21 preterm nenonates at the age of $35.8 \pm 1.6$ gestational weeks. We found that the CPPN enables to encode CVS location, which increases the accuracy of the CNNs when they have few layers. Accuracy of the 2D and 3D CNNs reached intraobserver variability (IOV) in the case of dilated ventricles with Dice of $0.893 \pm 0.008$ and $0.886 \pm 0.004$ respectively (IOV = $0.898 \pm 0.008$) and with volume errors of $0.45 \pm 0.42$ cm$^3$ and $0.36 \pm 0.24$ cm$^3$ respectively (IOV = $0.41 \pm 0.05$ cm$^3$). 3D CNNs were more accurate than 2D CNNs in the case of normal ventricles with Dice of $0.797 \pm 0.041$ against $0.776 \pm 0.038$ (IOV = $0.816 \pm 0.009$) and volume errors of $0.35 \pm 0.29$ cm$^3$ against $0.35 \pm 0.24$ cm$^3$ (IOV = $0.2 \pm 0.11$ cm$^3$). The best segmentation time of volumes of size $320 \times 320 \times 320$ was obtained by a 2D CNN in $3.5 \pm 0.2$ s.
摘要：早产新生儿极有可能患有脑室扩大症，即脑室系统（CVS）的扩张。这种情况会发展成威胁生命的脑积水，并与未来的神经发育障碍相关。因此，必须由医生对其进行检测和监控。在临床路由中，对2D超声（US）图像执行手动2D测量以估计CVS量，但是由于3D信息不可用，这种做法不精确。解决此问题的一种方法是为3D US数据开发自动CVS分割算法。在本文中，我们研究了2D和3D卷积神经网络（CNN）解决此复杂任务的潜力，并建议使用合成模式生成网络（CPPN）使CNN能够学习CVS位置。我们的数据库由25个3D美国体积组成，这些体积是在21个早产儿的35.8美元至1.6美元孕周时收集的。我们发现CPPN能够对CVS位置进行编码，从而在CNN层数较少时提高了CNN的准确性。在Dice分别为$ 0.893 \ pm 0.008 $和$ 0.886 \ pm 0.004 $（IOV = $ 0.898 \ pm 0.008 $）且体积误差为$ 0.45 \的扩张型心室的情况下，2D和3D CNN的精度达到了观察者内变异（IOV）。 pm 0.42 $ cm $ ^ 3 $和0.36 \ pm 0.24 $ cm $ ^ 3 $（IOV = $ 0.41 \ pm 0.05 $ cm $ ^ 3 $）。对于正常心室，3D CNN比2D CNN更准确，其中Dice为$ 0.797 \ pm 0.041 $对$ 0.776 \ pm 0.038 $（IOV = $ 0.816 \ pm 0.009 $），体积误差为$ 0.35 \ pm 0.29 $ cm $ ^ 3 $相对于$ 0.35 \ pm 0.24 $ cm $ ^ 3 $（IOV = $ 0.2 \ pm 0.11 $ cm $ ^ 3 $）。大小为$ 320 \ times 320 \ times 320 $的卷的最佳分割时间是由2D CNN在$ 3.5 \ pm 0.2 $ s中获得的。

129. SpeakingFaces: A Large-Scale Multimodal Dataset of Voice Commands with Visual and Thermal Video Streams [PDF] 返回目录
Madina Abdrakhmanova, Askat Kuzdeuov, Sheikh Jarju, Yerbolat Khassanov, Michael Lewis, Huseyin Atakan Varol
Abstract: We present SpeakingFaces as a publicly-available large-scale dataset developed to support multimodal machine learning research in contexts that utilize a combination of thermal, visual, and audio data streams; examples include human-computer interaction (HCI), biometric authentication, recognition systems, domain transfer, and speech recognition. SpeakingFaces is comprised of well-aligned high-resolution thermal and visual spectra image streams of fully-framed faces synchronized with audio recordings of each subject speaking approximately 100 imperative phrases. Data was collected from 142 subjects, yielding over 13,000 instances of synchronized data (~3.8 TB). For technical validation, we demonstrate two baseline examples. The first baseline shows classification by gender, utilizing different combinations of the three data streams in both clean and noisy environments. The second example consists of thermal-to-visual facial image translation, as an instance of domain transfer.
摘要：我们将SpeakingFaces展示为可公开使用的大规模数据集，旨在在结合了热，视觉和音频数据流的环境中支持多模式机器学习研究。示例包括人机交互（HCI），生物特征认证，识别系统，域传输和语音识别。 SpeakingFaces由完全对齐的面部的对齐良好的高分辨率热和视觉光谱图像流组成，这些图像流与每个受试者说大约100个命令短语的音频记录同步。从142位受试者中收集了数据，产生了13,000多个同步数据实例（约3.8 TB）。为了进行技术验证，我们演示了两个基线示例。第一个基准线按性别分类，在干净和嘈杂的环境中利用三个数据流的不同组合。第二个示例包括从热到视觉的面部图像转换，作为域转移的一个实例。

130. iGibson, a Simulation Environment for Interactive Tasks in Large Realistic Scenes [PDF] 返回目录
Bokui Shen, Fei Xia, Chengshu Li, Roberto Martín-Martín, Linxi Fan, Guanzhi Wang, Shyamal Buch, Claudia D'Arpino, Sanjana Srivastava, Lyne P. Tchapmi, Micael E. Tchapmi, Kent Vainio, Li Fei-Fei, Silvio Savarese
Abstract: We present iGibson, a novel simulation environment to develop robotic solutions for interactive tasks in large-scale realistic scenes. Our environment contains fifteen fully interactive home-sized scenes populated with rigid and articulated objects. The scenes are replicas of 3D scanned real-world homes, aligning the distribution of objects and layout to that of the real world. iGibson integrates several key features to facilitate the study of interactive tasks: i) generation of high-quality visual virtual sensor signals (RGB, depth, segmentation, LiDAR, flow, among others), ii) domain randomization to change the materials of the objects (both visual texture and dynamics) and/or their shapes, iii) integrated sampling-based motion planners to generate collision-free trajectories for robot bases and arms, and iv) intuitive human-iGibson interface that enables efficient collection of human demonstrations. Through experiments, we show that the full interactivity of the scenes enables agents to learn useful visual representations that accelerate the training of downstream manipulation tasks. We also show that iGibson features enable the generalization of navigation agents, and that the human-iGibson interface and integrated motion planners facilitate efficient imitation learning of simple human demonstrated behaviors. iGibson is open-sourced with comprehensive examples and documentation. For more information, visit our project website: this http URL
摘要：我们提出了一种新型的模拟环境iGibson，该环境可以为大型现实场景中的交互式任务开发机器人解决方案。我们的环境包含十五个完全互动的家庭大小场景，其中充满了刚性和铰接的物体。场景是3D扫描的现实房屋的复制品，使对象的分布和布局与现实世界的布局保持一致。 iGibson集成了几个关键功能，以促进对交互式任务的研究：i）生成高质量的视觉虚拟传感器信号（RGB，深度，分段，LiDAR，流量等），ii）域随机化以更改对象的材质（视觉纹理和动力学）和/或它们的形状，iii）基于采样的运动计划器，以生成用于机器人基座和手臂的无碰撞轨迹，以及iv）直观的人机iGibson界面，可有效收集人类演示。通过实验，我们证明了场景的完全交互性使代理能够学习有用的视觉表示，从而加速下游操纵任务的训练。我们还显示，iGibson功能可实现导航代理的通用化，并且人iGibson界面和集成的运动计划器可促进对简单人类演示行为的有效模仿学习。 iGibson是开放源码的，其中包含完整的示例和文档。有关更多信息，请访问我们的项目网站：此http URL

131. Cross-Modal Generalization: Learning in Low Resource Modalities via Meta-Alignment [PDF] 返回目录
Paul Pu Liang, Peter Wu, Liu Ziyin, Louis-Philippe Morency, Ruslan Salakhutdinov
Abstract: The natural world is abundant with concepts expressed via visual, acoustic, tactile, and linguistic modalities. Much of the existing progress in multimodal learning, however, focuses primarily on problems where the same set of modalities are present at train and test time, which makes learning in low-resource modalities particularly difficult. In this work, we propose algorithms for cross-modal generalization: a learning paradigm to train a model that can (1) quickly perform new tasks in a target modality (i.e. meta-learning) and (2) doing so while being trained on a different source modality. We study a key research question: how can we ensure generalization across modalities despite using separate encoders for different source and target modalities? Our solution is based on meta-alignment, a novel method to align representation spaces using strongly and weakly paired cross-modal data while ensuring quick generalization to new tasks across different modalities. We study this problem on 3 classification tasks: text to image, image to audio, and text to speech. Our results demonstrate strong performance even when the new target modality has only a few (1-10) labeled samples and in the presence of noisy labels, a scenario particularly prevalent in low-resource modalities.
摘要：自然世界充满了通过视觉，听觉，触觉和语言形式表达的概念。但是，多模式学习中的许多现有进展主要集中在培训和测试时出现相同模式的问题，这使得在低资源模式下的学习特别困难。在这项工作中，我们提出了跨模式泛化的算法：一种训练模型的学习范例，该模型可以（1）在目标模态下快速执行新任务（即元学习），以及（2）在模型上进行训练时这样做不同的来源方式。我们研究了一个关键的研究问题：尽管针对不同的源模态和目标模态使用了单独的编码器，我们如何才能确保跨模态的泛化？我们的解决方案基于元对齐，这是一种使用强和弱配对的跨模态数据对齐表示空间的新方法，同时可以确保快速概括到不同模态中的新任务。我们在3个分类任务上研究了此问题：文本到图像，图像到音频和文本到语音。即使在新的目标模式只有少量（1-10）标记的样本并且存在嘈杂的标记的情况下，我们的结果也显示出了出色的性能，这种情况在低资源模式中特别普遍。

132. Visually Grounded Compound PCFGs [PDF] 返回目录
Yanpeng Zhao, Ivan Titov
Abstract: Exploiting visual groundings for language understanding has recently been drawing much attention. In this work, we study visually grounded grammar induction and learn a constituency parser from both unlabeled text and its visual groundings. Existing work on this task (Shi et al., 2019) optimizes a parser via Reinforce and derives the learning signal only from the alignment of images and sentences. While their model is relatively accurate overall, its error distribution is very uneven, with low performance on certain constituents types (e.g., 26.2% recall on verb phrases, VPs) and high on others (e.g., 79.6% recall on noun phrases, NPs). This is not surprising as the learning signal is likely insufficient for deriving all aspects of phrase-structure syntax and gradient estimates are noisy. We show that using an extension of probabilistic context-free grammar model we can do fully-differentiable end-to-end visually grounded learning. Additionally, this enables us to complement the image-text alignment loss with a language modeling objective. On the MSCOCO test captions, our model establishes a new state of the art, outperforming its non-grounded version and, thus, confirming the effectiveness of visual groundings in constituency grammar induction. It also substantially outperforms the previous grounded model, with largest improvements on more `abstract' categories (e.g., +55.1% recall on VPs).
摘要：利用可视化基础进行语言理解最近引起了很多关注。在这项工作中，我们研究基于视觉的语法归纳，并从未标记的文本及其视觉基础中学习选区分析器。有关此任务的现有工作（Shi等人，2019）通过Reinforce优化了解析器，仅从图像和句子的对齐中得出学习信号。尽管他们的模型总体上相对准确，但其误差分布非常不均匀，某些成分类型的性能较低（例如，动词短语，VP的回想率为26.2％），而其他成分类型的表现则较高（例如，名词短语，NP的回想率为79.6％）。这不足为奇，因为学习信号可能不足以推导短语结构语法的所有方面，并且梯度估计很嘈杂。我们证明，使用概率上下文无关文法模型的扩展，我们可以进行完全可区分的端到端的基于视觉的学习。此外，这使我们能够通过语言建模目标来弥补图像-文本对齐损失。在MSCOCO测试字幕上，我们的模型建立了一个新的技术水平，优于其非扎实的版本，因此证实了视觉扎实在选区语法归纳中的有效性。它还大大优于以前的扎根模型，对更多“抽象”类别进行了最大改进（例如，副总裁的召回率提高了55.1％）。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-12-08

目录

摘要