摘要

1. What makes for good views for contrastive learning [PDF] 返回目录
Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, Phillip Isola
Abstract: Contrastive learning between multiple views of the data has recently achieved state of the art performance in the field of self-supervised representation learning. Despite its success, the influence of different view choices has been less studied. In this paper, we use empirical analysis to better understand the importance of view selection, and argue that we should reduce the mutual information (MI) between views while keeping task-relevant information intact. To verify this hypothesis, we devise unsupervised and semi-supervised frameworks that learn effective views by aiming to reduce their MI. We also consider data augmentation as a way to reduce MI, and show that increasing data augmentation indeed leads to decreasing MI and improves downstream classification accuracy. As a by-product, we also achieve a new state-of-the-art accuracy on unsupervised pre-training for ImageNet classification ($73\%$ top-1 linear readoff with a ResNet-50). In addition, transferring our models to PASCAL VOC object detection and COCO instance segmentation consistently outperforms supervised pre-training. Code:this http URL
摘要：数据的多个视图之间的对比学习中自我监督表示学习领域取得最近取得的先进的性能。尽管它的成功，不同的看法的选择的影响已经研究较少。在本文中，我们使用了实证分析，以更好地了解视图选择的重要性，并认为我们应该视图之间减少相互信息（MI），同时保持与任务相关的信息完整。为了验证这一假设，我们设计出的，旨在减少其MI学习有效监督的意见和半监督框架。我们还考虑数据扩张，以此来减少心肌梗死，并表明，增加数据增强确实导致减少心肌梗死和改善下游分类精度。作为副产物，我们还实现对ImageNet无监督分类前培训一个新的国家的最先进的精度（$ 73 \％$顶部-1线性readoff与RESNET-50）。此外，我们的模型转移到PASCAL VOC物体检测和COCO例如分割的性能一直优于监督前培训。代码：此http网址

2. Intra- and Inter-Action Understanding via Temporal Action Parsing [PDF] 返回目录
Dian Shao, Yue Zhao, Bo Dai, Dahua Lin
Abstract: Current methods for action recognition primarily rely on deep convolutional networks to derive feature embeddings of visual and motion features. While these methods have demonstrated remarkable performance on standard benchmarks, we are still in need of a better understanding as to how the videos, in particular their internal structures, relate to high-level semantics, which may lead to benefits in multiple aspects, e.g. interpretable predictions and even new methods that can take the recognition performances to a next level. Towards this goal, we construct TAPOS, a new dataset developed on sport videos with manual annotations of sub-actions, and conduct a study on temporal action parsing on top. Our study shows that a sport activity usually consists of multiple sub-actions and that the awareness of such temporal structures is beneficial to action recognition. We also investigate a number of temporal parsing methods, and thereon devise an improved method that is capable of mining sub-actions from training data without knowing the labels of them. On the constructed TAPOS, the proposed method is shown to reveal intra-action information, i.e. how action instances are made of sub-actions, and inter-action information, i.e. one specific sub-action may commonly appear in various actions.
摘要：行为识别当前的方法主要依赖于深卷积网络的视觉和运动特征的派生功能的嵌入。虽然这些方法已经证明在标准的基准测试表现可圈可点，我们仍然需要更好地了解至于怎样的视频，特别是它们的内部结构，涉及到高层次的语义，这可能会导致利益在多个方面，例如可解释的预测甚至认为可以采取的识别性能到一个新的水平的新方法。为了实现这一目标，我们构建TAPOS，与子操作手册注释体育视频开发新的数据集，并进行对顶时间动作分析研究。我们的研究表明，体育活动通常由多个子行动，这样的时间结构的意识是动作识别有利。我们还调查了一些时间分析方法，并就此设计出能够挖掘从训练数据子的行动不知道他们的标签的改进方法。所构造的TAPOS中，示出所提出的方法以显示帧内动作信息，即动作情况下是如何制造的子动作，和动作补间信息，即一个特定的子动作可以共同出现在各种动作。

3. Compute-Bound and Low-Bandwidth Distributed 3D Graph-SLAM [PDF] 返回目录
Jincheng Zhang, Andrew R. Willis, Jamie Godwin
Abstract: This article describes a new approach for distributed 3D SLAM map building. The key contribution of this article is the creation of a distributed graph-SLAM map-building architecture responsive to bandwidth and computational needs of the robotic platform. Responsiveness is afforded by the integration of a 3D point cloud to plane cloud compression algorithm that approximates dense 3D point cloud using local planar patches. Compute bound platforms may restrict the computational duration of the compression algorithm and low-bandwidth platforms can restrict the size of the compression result. The backbone of the approach is an ultra-fast adaptive 3D compression algorithm that transforms swaths of 3D planar surface data into planar patches attributed with image textures. Our approach uses DVO SLAM, a leading algorithm for 3D mapping, and extends it by computationally isolating map integration tasks from local Guidance, Navigation, and Control tasks and includes an addition of a network protocol to share the compressed plane clouds. The joint effect of these contributions allows agents with 3D sensing capabilities to calculate and communicate compressed map information commensurate with their onboard computational resources and communication channel capacities. This opens SLAM mapping to new categories of robotic platforms that may have computational and memory limits that prohibit other SLAM solutions.
摘要：本文介绍了分布式3D SLAM地图建立一种新的方法。本文的主要贡献是一个分布式图形-SLAM地图建筑结构响应带宽和机器人平台的计算需求的创造。响应由近似于致密三维点云使用局部平面的贴片的3D点云以平面云压缩算法的积分得到。计算绑定平台可能限制的压缩算法和低带宽的平台的计算的持续时间可以限制压缩结果的大小。该方法的主链是一种超快速自适应3D压缩算法，3D平面的表面数据的变换成幅平面的贴片归因与图像的纹理。我们利用DVO SLAM，用于3D地图的领先算法，并通过计算从当地制导，导航和控制任务隔离地图集成任务扩展它，包括添加网络协议的共享压缩的平面云。这些捐款的联合作用使药物与3D感应功能，计算和通信的压缩地图信息相称的板载计算资源和通信信道容量。这将打开SLAM映射到新的类别中，可能有禁止其他SLAM解决方案的计算和内存限制机器人平台。

4. Reducing Overlearning through Disentangled Representations by Suppressing Unknown Tasks [PDF] 返回目录
Naveen Panwar, Tarun Tater, Anush Sankaran, Senthil Mani
Abstract: Existing deep learning approaches for learning visual features tend to overlearn and extract more information than what is required for the task at hand. From a privacy preservation perspective, the input visual information is not protected from the model; enabling the model to become more intelligent than it is trained to be. Current approaches for suppressing additional task learning assume the presence of ground truth labels for the tasks to be suppressed during training time. In this research, we propose a three-fold novel contribution: (i) a model-agnostic solution for reducing model overlearning by suppressing all the unknown tasks, (ii) a novel metric to measure the trust score of a trained deep learning model, and (iii) a simulated benchmark dataset, PreserveTask, having five different fundamental image classification tasks to study the generalization nature of models. In the first set of experiments, we learn disentangled representations and suppress overlearning of five popular deep learning models: VGG16, VGG19, Inception-v1, MobileNet, and DenseNet on PreserverTask dataset. Additionally, we show results of our framework on color-MNIST dataset and practical applications of face attribute preservation in Diversity in Faces (DiF) and IMDB-Wiki dataset.
摘要：现有的深度学习学习视觉特征往往overlearn并提取比什么是需要为手头的任务的详细信息的方法。从隐私保护的角度来看，输入的视觉信息不受保护从模型;使模型变得更聪明比它训练成。抑制额外任务学习当前的方法假设地面实况标签存在的训练时间期间被抑制的任务。在这项研究中，我们提出了三倍小说的贡献：（i）对于减少模型通过抑制所有未知的任务，过度学习的模型无关的解决方案，（二）一种新的指标来衡量一个训练有素的深度学习模型的信任分数，及（iii）一个模拟的基准数据集，PreserveTask，有五种不同的基础图像分类任务，研究模型的推广性质。在PreserverTask数据集VGG16，VGG19，启-V1，MobileNet和DenseNet：在第一组实验中，我们学习迎刃而解表示，抑制过度学习的五个款热门深度学习模型。此外，我们显示在彩色MNIST数据集，并面临着多样性的脸属性保存的实际应用（DIF）和IMDB维基数据集我们的框架的结果。

5. Discriminative Dictionary Design for Action Classification in Still Images and Videos [PDF] 返回目录
Abhinaba Roy, Biplab Banerjee
Abstract: In this paper, we address the problem of action recognition from still images and videos. Traditional local features such as SIFT, STIP etc. invariably pose two potential problems: 1) they are not evenly distributed in different entities of a given category and 2) many of such features are not exclusive of the visual concept the entities represent. In order to generate a dictionary taking the aforementioned issues into account, we propose a novel discriminative method for identifying robust and category specific local features which maximize the class separability to a greater extent. Specifically, we pose the selection of potent local descriptors as filtering based feature selection problem which ranks the local features per category based on a novel measure of distinctiveness. The underlying visual entities are subsequently represented based on the learned dictionary and this stage is followed by action classification using the random forest model followed by label propagation refinement. The framework is validated on the action recognition datasets based on still images (Stanford-40) as well as videos (UCF-50) and exhibits superior performances than the representative methods from the literature.
摘要：在本文中，我们从静止图像和视频处理动作识别的问题。传统的地方特色，如SIFT，科技政策等不约而同地提出了两个潜在的问题：1）他们不是均匀地分布在与特定类别的不同实体2）许多的这些特征不是唯一的视觉概念的实体代表。为了生成一个字典鉴于上述问题考虑，我们提出了用于识别鲁棒和类别特定局部特征，其中类可分最大化在更大程度上的新颖方法判别。具体地，我们提出有效的局部描述符的选择作为过滤位居每基于独特的新颖的措施类别局部特征基于特征选择问题。底层视觉实体基于所学习字典随后表示，并且该阶段使用随机森林模型随后标签传播细化随后动作分类。该框架是基于静止图像（斯坦福-40）以及视频（UCF-50）和表现出优良的性能比从文献中的代表性的方法上动作识别数据集验证。

6. Classification of Industrial Control Systems screenshots using Transfer Learning [PDF] 返回目录
Pablo Blanco Medina, Eduardo Fidalgo Fernandez, Enrique Alegre, Francisco Jáñez Martino, Roberto A. Vasco-Carofilis, Víctor Fidalgo Villar
Abstract: Industrial Control Systems depend heavily on security and monitoring protocols. Several tools are available for this purpose, which scout vulnerabilities and take screenshots from various control panels for later analysis. However, they do not adequately classify images into specific control groups, which can difficult operations performed by manual operators. In order to solve this problem, we use transfer learning with five CNN architectures, pre-trained on Imagenet, to determine which one best classifies screenshots obtained from Industrial Controls Systems. Using 337 manually labeled images, we train these architectures and study their performance both in accuracy and CPU and GPU time. We find out that MobilenetV1 is the best architecture based on its 97,95% of F1-Score, and its speed on CPU with 0.47 seconds per image. In systems where time is critical and GPU is available, VGG16 is preferable because it takes 0.04 seconds to process images, but dropping performance to 87,67%.
摘要：工业控制系统在很大程度上依赖于安全和监控协议。有几个工具可以用于此目的，其侦察漏洞，并采取截图，以供以后分析各种控制面板。然而，它们不能充分地图像分类成特定的对照组，其可以困难的操作由操作员手动进行。为了解决这个问题，我们用五个CNN架构，在Imagenet预先训练迁移学习，以确定从工业获得一个最好的截图进行分类控制系统。使用337张手动标记的图片，我们培养这些体系结构并研究其无论在精度和CPU和GPU实时性能。我们发现，MobilenetV1是基于其F1-得分的97,95％的最好的架构，并与每幅图像0.47秒其CPU速度。在系统中，时间是关键的，并且GPU可用，VGG16优选，因为它需要0.04秒到处理的图像，但滴性能87,67％。

7. Label Efficient Visual Abstractions for Autonomous Driving [PDF] 返回目录
Aseem Behl, Kashyap Chitta, Aditya Prakash, Eshed Ohn-Bar, Andreas Geiger
Abstract: It is well known that semantic segmentation can be used as an effective intermediate representation for learning driving policies. However, the task of street scene semantic segmentation requires expensive annotations. Furthermore, segmentation algorithms are often trained irrespective of the actual driving task, using auxiliary image-space loss functions which are not guaranteed to maximize driving metrics such as safety or distance traveled per intervention. In this work, we seek to quantify the impact of reducing segmentation annotation costs on learned behavior cloning agents. We analyze several segmentation-based intermediate representations. We use these visual abstractions to systematically study the trade-off between annotation efficiency and driving performance, i.e., the types of classes labeled, the number of image samples used to learn the visual abstraction model, and their granularity (e.g., object masks vs. 2D bounding boxes). Our analysis uncovers several practical insights into how segmentation-based visual abstractions can be exploited in a more label efficient manner. Surprisingly, we find that state-of-the-art driving performance can be achieved with orders of magnitude reduction in annotation cost. Beyond label efficiency, we find several additional training benefits when leveraging visual abstractions, such as a significant reduction in the variance of the learned policy when compared to state-of-the-art end-to-end driving models.
摘要：众所周知，语义分割可以用作用于学习驱动的政策的有效中间表示。然而，街景语义分割的任务，需要昂贵的注解。此外，分割算法常常训练有素，而不管实际的驾驶任务的，使用不保证最大限度地驱动指标，如安全或每干预行驶距离辅助图像空间损耗函数。在这项工作中，我们力求量化减少对学习行为的克隆代理分割注释成本的影响。我们分析几种基于分割的中间表示。我们使用这些视觉的抽象系统学习注解效率和驾驶性能，即之间的权衡，类标记的类型，图像样本的数量用于学习视觉抽象模型，以及它们的粒度（例如，对象口罩与2D边界框）。我们的分析揭示了一些实际的认识到如何基于分割的视觉抽象能够以更高效的标签的方式来利用。出人意料的是，我们发现可以在注释成本降低幅度的命令来实现国家的最先进的是驾驶性能。除了标签的效率，我们发现一些额外的训练效果相比，国家的最先进的终端到终端的行驶模型利用视觉的抽象，比如在学习政策的方差显著减少时。

8. Perceptual Hashing applied to Tor domains recognition [PDF] 返回目录
Rubel Biswas, Roberto A. Vasco-Carofilis, Eduardo Fidalgo Fernandez, Francisco Jáñez Martino, Pablo Blanco Medina
Abstract: The Tor darknet hosts different types of illegal content, which are monitored by cybersecurity agencies. However, manually classifying Tor content can be slow and error-prone. To support this task, we introduce Frequency-Dominant Neighborhood Structure (F-DNS), a new perceptual hashing method for automatically classifying domains by their screenshots. First, we evaluated F-DNS using images subject to various content preserving operations. We compared them with their original images, achieving better correlation coefficients than other state-of-the-art methods, especially in the case of rotation. Then, we applied F-DNS to categorize Tor domains using the Darknet Usage Service Images-2K (DUSI-2K), a dataset with screenshots of active Tor service domains. Finally, we measured the performance of F-DNS against an image classification approach and a state-of-the-art hashing method. Our proposal obtained 98.75% accuracy in Tor images, surpassing all other methods compared.
摘要：Tor的暗网主机不同类型的非法内容，这是由网络安全机构监测。然而，手工分类Tor的内容可能会很慢且容易出错。为了支持这项工作，我们引入频率主导邻近结构（F-DNS），通过他们的截图自动分类域新感知散列方法。首先，我们评估F-DNS使用图像受到各种内容保存操作。我们与他们的原始图像进行了比较，取得了比其他国家的最先进的方法，较好的相关性系数，特别是在旋转的情况。然后，我们应用F-DNS进行分类使用暗网使用服务图像-2K（肚丝-2K），具有活性的Tor服务域的屏幕截图的数据集的Tor域。最后，我们测量F-DNS的性能对图像分类的方法和一个国家的最先进的散列方法。我们建议在Tor的图像获得98.75％的准确率，超越比较所有其它的方法。

9. Classifying Suspicious Content in Tor Darknet [PDF] 返回目录
Eduardo Fidalgo Fernandez, Roberto Andrés Vasco Carofilis, Francisco Jáñez Martino, Pablo Blanco Medina
Abstract: One of the tasks of law enforcement agencies is to find evidence of criminal activity in the Darknet. However, visiting thousands of domains to locate visual information containing illegal acts manually requires a considerable amount of time and resources. Furthermore, the background of the images can pose a challenge when performing classification. To solve this problem, in this paper, we explore the automatic classification Tor Darknet images using Semantic Attention Keypoint Filtering, a strategy that filters non-significant features at a pixel level that do not belong to the object of interest, by combining saliency maps with Bag of Visual Words (BoVW). We evaluated SAKF on a custom Tor image dataset against CNN features: MobileNet v1 and Resnet50, and BoVW using dense SIFT descriptors, achieving a result of 87.98% accuracy and outperforming all other approaches.
摘要：一个执法机构的任务是找到在暗网犯罪活动的证据。然而，参观域十万找到包含违法违规行为的视觉信息，需要手工大量的时间和资源。此外，图像的背景可以进行分类时提出了挑战。为了解决这个问题，在本文中，我们使用语义注意关键点过滤，能过滤在像素级不属于感兴趣的对象非显著特征的战略Tor的暗网的图像，通过显着性地图相结合探索自动分类视觉词袋（BoVW）。我们评估SAKF上对CNN定制Tor的图像数据集的特点：MobileNet v1和Resnet50，并使用BoVW密集SIFT描述，达到87.98％的准确率的结果，表现优于所有其他方法。

10. Map Generation from Large Scale Incomplete and Inaccurate Data Labels [PDF] 返回目录
Rui Zhang, Conrad Albrecht, Wei Zhang, Xiaodong Cui, Ulrich Finkler, David Kung, Siyuan Lu
Abstract: Accurately and globally mapping human infrastructure is an important and challenging task with applications in routing, regulation compliance monitoring, and natural disaster response management etc.. In this paper we present progress in developing an algorithmic pipeline and distributed compute system that automates the process of map creation using high resolution aerial images. Unlike previous studies, most of which use datasets that are available only in a few cities across the world, we utilizes publicly available imagery and map data, both of which cover the contiguous United States (CONUS). We approach the technical challenge of inaccurate and incomplete training data adopting state-of-the-art convolutional neural network architectures such as the U-Net and the CycleGAN to incrementally generate maps with increasingly more accurate and more complete labels of man-made infrastructure such as roads and houses. Since scaling the mapping task to CONUS calls for parallelization, we then adopted an asynchronous distributed stochastic parallel gradient descent training scheme to distribute the computational workload onto a cluster of GPUs with nearly linear speed-up.
摘要：准确和全球映射人力基础设施是在路由应用，监管合规性监控，以及自然灾害应急管理等。在本文中一项重要而艰巨的任务，我们在开发的算法管道目前的进展和分布计算系统，自动化进程使用高分辨率航空影像图制作的。不同于以往的研究，其中大部分的使用只在世界各地的几个城市提供的数据集，我们利用可公开获得的图像和地图数据，这两者覆盖美国本土（CONUS）。我们的做法不准确，不完整的训练数据采用国家的最先进的卷积神经网络架构如U-NET和CycleGAN逐步产生与人为等基础设施的日益更准确，更完整的标签映射的技术挑战如道路和房屋。由于缩放映射任务CONUS呼叫进行并行化，我们随后通过一个异步分布式随机并行梯度下降训练方案来计算工作量与接近线性加速分配到GPU的一个集群。

11. Deep learning with 4D spatio-temporal data representations for OCT-based force estimation [PDF] 返回目录
Nils Gessert, Marcel Bengs, Matthias Schlüter, Alexander Schlaefer
Abstract: Estimating the forces acting between instruments and tissue is a challenging problem for robot-assisted minimally-invasive surgery. Recently, numerous vision-based methods have been proposed to replace electro-mechanical approaches. Moreover, optical coherence tomography (OCT) and deep learning have been used for estimating forces based on deformation observed in volumetric image data. The method demonstrated the advantage of deep learning with 3D volumetric data over 2D depth images for force estimation. In this work, we extend the problem of deep learning-based force estimation to 4D spatio-temporal data with streams of 3D OCT volumes. For this purpose, we design and evaluate several methods extending spatio-temporal deep learning to 4D which is largely unexplored so far. Furthermore, we provide an in-depth analysis of multi-dimensional image data representations for force estimation, comparing our 4D approach to previous, lower-dimensional methods. Also, we analyze the effect of temporal information and we study the prediction of short-term future force values, which could facilitate safety features. For our 4D force estimation architectures, we find that efficient decoupling of spatial and temporal processing is advantageous. We show that using 4D spatio-temporal data outperforms all previously used data representations with a mean absolute error of 10.7mN. We find that temporal information is valuable for force estimation and we demonstrate the feasibility of force prediction.
摘要：估算仪器和组织之间的作用力是机器人辅助微创手术具有挑战性的问题。近日，众多基于视觉的方法被提出来取代机电式的方法。此外，光学相干断层扫描（OCT）和深学习已被用于基于在体积图像数据中观察到变形估计的力。该方法表现出深学习的优势超过2D深度图像用于力估计3D体积数据。在这项工作中，我们扩展了基于深学习力估计与3D OCT体积流四维时空数据的问题。为此，我们设计和评估扩展时空深度学习到4D这是很大的未开发至今的几种方法。此外，我们提供多维图像数据表示为力估计，比较我们的4D办法以前，低维的方法进行了深入的分析。此外，我们分析的时间信息的影响，我们研究短期未来的力值，这可能有助于安全特性的预测。对于我们的4D力估计架构，我们发现空间和时间处理的有效率的去耦是有利的。我们表明，采用四维时空数据优于所有先前使用的数据表示与10.7mN的平均绝对误差。我们发现，时间信息是力估计有价值的，我们证明力预测的可行性。

12. Dynamic Refinement Network for Oriented and Densely Packed Object Detection [PDF] 返回目录
Xingjia Pan, Yuqiang Ren, Kekai Sheng, Weiming Dong, Haolei Yuan, Xiaowei Guo, Chongyang Ma, Changsheng Xu
Abstract: Object detection has achieved remarkable progress in the past decade. However, the detection of oriented and densely packed objects remains challenging because of following inherent reasons: (1) receptive fields of neurons are all axis-aligned and of the same shape, whereas objects are usually of diverse shapes and align along various directions; (2) detection models are typically trained with generic knowledge and may not generalize well to handle specific objects at test time; (3) the limited dataset hinders the development on this task. To resolve the first two issues, we present a dynamic refinement network that consists of two novel components, i.e., a feature selection module (FSM) and a dynamic refinement head (DRH). Our FSM enables neurons to adjust receptive fields in accordance with the shapes and orientations of target objects, whereas the DRH empowers our model to refine the prediction dynamically in an object-aware manner. To address the limited availability of related benchmarks, we collect an extensive and fully annotated dataset, namely, SKU110K-R, which is relabeled with oriented bounding boxes based on SKU110K. We perform quantitative evaluations on several publicly available benchmarks including DOTA, HRSC2016, SKU110K, and our own SKU110K-R dataset. Experimental results show that our method achieves consistent and substantial gains compared with baseline approaches. The code and dataset are available at this https URL.
摘要：目的检测已经在过去十年中取得了显着的进展。然而，取向和密集的对象的检测仍是因为以下的原因固有的挑战：（1）神经元的感受域都是轴对齐并具有相同的形状，而对象通常是不同的形状，并沿不同的方向对准的; （2）检测模型通常训练了与通用的知识，并且可以不能一概而论以及在测试时手柄的特定对象; （3）有限的数据集阻碍了这项任务的发展。要解决前两个问题，我们提出了一种动态网络细化由两个新颖的组件，即，特征选择模块（FSM）和动态细化头（DRH）。我们的FSM使神经元来调整根据形状和目标对象的方位感受域，而DRH让我们的模型来动态地细化预测中的对象知晓的方式。为了解决相关的基准有限，我们收集了广泛而充分注解数据集，即SKU110K-R，这是重新标记基于SKU110K定向包围盒。我们在几个公开可用的基准包括DOTA，HRSC2016，SKU110K，和我们自己的SKU110K-R的数据集进行定量评估。实验结果表明，该方法实现了与基线方法相比，一致和可观的收益。代码和数据集可在此HTTPS URL。

13. Range Conditioned Dilated Convolutions for Scale Invariant 3D Object Detection [PDF] 返回目录
Alex Bewley, Pei Sun, Thomas Mensink, Dragomir Anguelov, Cristian Sminchisescu
Abstract: This paper presents a novel 3D object detection framework that processes LiDAR data directly on a representation of the sensor's native range images. When operating in the range image view, one faces learning challenges, including occlusion and considerable scale variation, limiting the obtainable accuracy. To address these challenges, a range-conditioned dilated block (RCD) is proposed to dynamically adjust a continuous dilation rate as a function of the measured range, achieving scale invariance. Furthermore, soft range gating helps mitigate the effect of occlusion. An end-to-end trained box-refinement network brings additional performance improvements in occluded areas, and produces more accurate bounding box predictions. On the challenging Waymo Open Dataset, our improved range-based detector outperforms state of the art at long range detection. Our framework is superior to prior multiview, voxel-based methods over all ranges, setting a new baseline for range-based 3D detection on this large scale public dataset.
摘要：本文提出了一种三维物体检测框架，该框架直接处理在传感器上的本地范围图像的表示LiDAR数据。当在范围内的图像视图操作，一个面学习的挑战，包括闭塞和相当的规模变化，限制了可获得的精确度。为了应对这些挑战，范围空调扩张块（RCD），提出了以动态调整的连续扩张率作为测量范围的函数，实现尺度不变性。此外，软距离选通有助于减轻阻塞的效果。终端到终端的培训箱细化网络带来闭塞地区额外的性能提升，并产生更准确的边界框的预测。在挑战Waymo打开的数据集，我们的改进的基于范围检测在长范围检测优于现有技术的状态。我们的框架是优于现有多视角，基于体素的方法在所有范围，在这个大型公共数据集设置基于范围的3D检测一个新的基准。

14. Rethinking Performance Estimation in Neural Architecture Search [PDF] 返回目录
Xiawu Zheng, Rongrong Ji, Qiang Wang, Qixiang Ye, Zhenguo Li, Yonghong Tian, Qi Tian
Abstract: Neural architecture search (NAS) remains a challenging problem, which is attributed to the indispensable and time-consuming component of performance estimation (PE). In this paper, we provide a novel yet systematic rethinking of PE in a resource constrained regime, termed budgeted PE (BPE), which precisely and effectively estimates the performance of an architecture sampled from an architecture space. Since searching an optimal BPE is extremely time-consuming as it requires to train a large number of networks for evaluation, we propose a Minimum Importance Pruning (MIP) approach. Given a dataset and a BPE search space, MIP estimates the importance of hyper-parameters using random forest and subsequently prunes the minimum one from the next iteration. In this way, MIP effectively prunes less important hyper-parameters to allocate more computational resource on more important ones, thus achieving an effective exploration. By combining BPE with various search algorithms including reinforcement learning, evolution algorithm, random search, and differentiable architecture search, we achieve 1, 000x of NAS speed up with a negligible performance drop comparing to the SOTA
摘要：神经架构搜索（NAS）仍然是一个挑战性的问题，这归因于性能估计（PE）的不可缺少的且耗时的组件。在本文中，我们提供了在资源受限的政权，称为预算PE（BPE），其精确有效地估计从建筑空间采样的架构性能的新颖但PE的系统性反思。由于寻找最佳的BPE是非常耗时的，因为它需要培养大量的网络进行评估的，我们提出了一个最小的重要性修剪（MIP）的方法。给定数据集和BPE的搜索空间，MIP估计，使用随机森林超参数的重要性，并随后从修剪下一次迭代中最小的一个。通过这种方式，MIP有效地修剪不太重要的超参数分配放在更重要的人更多的计算资源，从而实现了有效的探索。通过BPE与各种搜索算法，包括强化学习，进化算法，随机搜索和微架构的搜索结合起来，我们实现1，000X NAS加快的一个微不足道的性能下降比较到SOTA

15. Deep Learning for LiDAR Point Clouds in Autonomous Driving: A Review [PDF] 返回目录
Ying Li, Lingfei Ma, Zilong Zhong, Fei Liu, Dongpu Cao, Jonathan Li, Michael A. Chapman
Abstract: Recently, the advancement of deep learning in discriminative feature learning from 3D LiDAR data has led to rapid development in the field of autonomous driving. However, automated processing uneven, unstructured, noisy, and massive 3D point clouds is a challenging and tedious task. In this paper, we provide a systematic review of existing compelling deep learning architectures applied in LiDAR point clouds, detailing for specific tasks in autonomous driving such as segmentation, detection, and classification. Although several published research papers focus on specific topics in computer vision for autonomous vehicles, to date, no general survey on deep learning applied in LiDAR point clouds for autonomous vehicles exists. Thus, the goal of this paper is to narrow the gap in this topic. More than 140 key contributions in the recent five years are summarized in this survey, including the milestone 3D deep architectures, the remarkable deep learning applications in 3D semantic segmentation, object detection, and classification; specific datasets, evaluation metrics, and the state of the art performance. Finally, we conclude the remaining challenges and future researches.
摘要：近日，深度学习的辨别特征的学习从三维LiDAR数据的进步已经导致自动驾驶领域的快速发展。然而，自动化的处理不均匀，非结构化的，嘈杂，和块状3D点云是一个具有挑战性和乏味的任务。在本文中，我们提供激光雷达点云应用现有的令人信服的深度学习的架构进行了系统回顾，在自动驾驶，如分割，检测和分类，详细说明具体的任务。尽管一些发表研究论文侧重于具体的主题为自主汽车计算机视觉，迄今为止，深学习没有普查的激光雷达点云适用于自主车辆的存在。因此，本文的目标是要缩小这个主题中的差距。在近五年的140个多名重要贡献总结在本次调查中，包括里程碑3D深架构，在3D语义分割，目标检测和分类卓越的深度学习的应用;特定的数据集，评价指标，并在本领域性能的状态。最后，我们的结论仍然面临的挑战和未来的研究。

16. Relevant Region Prediction for Crowd Counting [PDF] 返回目录
Xinya Chen, Yanrui Bin, Changxin Gao, Nong Sang, Hao Tang
Abstract: Crowd counting is a concerned and challenging task in computer vision. Existing density map based methods excessively focus on the individuals' localization which harms the crowd counting performance in highly congested scenes. In addition, the dependency between the regions of different density is also ignored. In this paper, we propose Relevant Region Prediction (RRP) for crowd counting, which consists of the Count Map and the Region Relation-Aware Module (RRAM). Each pixel in the count map represents the number of heads falling into the corresponding local area in the input image, which discards the detailed spatial information and forces the network pay more attention to counting rather than localizing individuals. Based on the Graph Convolutional Network (GCN), Region Relation-Aware Module is proposed to capture and exploit the important region dependency. The module builds a fully connected directed graph between the regions of different density where each node (region) is represented by weighted global pooled feature, and GCN is learned to map this region graph to a set of relation-aware regions representations. Experimental results on three datasets show that our method obviously outperforms other existing state-of-the-art methods.
摘要：人群计数是计算机视觉关注和具有挑战性的任务。现有的基于密度图方法过分集中于个人的定位，其危害在高度拥挤的场面人群计数性能。另外，不同密度的区域之间的相关性也将被忽略。在本文中，我们提出了人群计数，它由计数图和区域关系感知模块（RRAM）的相关区域预测（RRP）。在计数图的每个像素代表磁头落入相应的局部区域的图像中，其中丢弃详细的空间信息和力量的网络更注重计算，而不是本地化的个体数量。基于图形卷积网络（GCN）上，地区关系感知模块提出了捕获和利用的重要地区的依赖。该模块建立，其中每个节点（区域）由加权全局池特征表示的不同密度的区域之间的完全连接有向图，和GCN了解该区域图形映射到一组关系感知区域表示。三个数据集的实验结果表明，该方法明显优于现有的其他国家的最先进的方法。

17. Active Speakers in Context [PDF] 返回目录
Juan Leon Alcazar, Fabian Caba Heilbron, Long Mai, Federico Perazzi, Joon-Young Lee, Pablo Arbelaez, Bernard Ghanem
Abstract: Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker. Although this strategy can be enough for addressing single-speaker scenarios, it prevents accurate detection when the task is to identify who of many candidate speakers are talking. This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons. Our Active Speaker Context is designed to learn pairwise and temporal relations from an structured ensemble of audio-visual observations. Our experiments show that a structured feature ensemble already benefits the active speaker detection performance. Moreover, we find that the proposed Active Speaker Context improves the state-of-the-art on the AVA-ActiveSpeaker dataset achieving a mAP of 87.1%. We present ablation studies that verify that this result is a direct consequence of our long-term multi-speaker analysis.
摘要：目前用于主动说话呃检测重点从单一的扬声器造型短期视听信息。虽然这种策略可够解决单扬声器的情况，它可以防止准确的检测时，任务是确定候选人的许多发言者谁在聊天。本文介绍了有源音箱背景下，一个新的表示是在很长的时间跨度多个扬声器之间的关系模型。我们的有源音箱上下文设计学成对和时间的关系，从视听观察的结构化集合。我们的实验表明，结构化特征全集已经受益有源音箱的检测性能。此外，我们发现，所提出的有源音箱上下文提高了对AVA-ActiveSpeaker数据集实现的87.1％的MAP的国家的最先进的。我们目前的消融研究，证实这个结果是我们长期的多喇叭分析的直接后果。

18. On Evaluating Weakly Supervised Action Segmentation Methods [PDF] 返回目录
Yaser Souri, Alexander Richard, Luca Minciullo, Juergen Gall
Abstract: Action segmentation is the task of temporally segmenting every frame of an untrimmed video. Weakly supervised approaches to action segmentation, especially from transcripts have been of considerable interest to the computer vision community. In this work, we focus on two aspects of the use and evaluation of weakly supervised action segmentation approaches that are often overlooked: the performance variance over multiple training runs and the impact of selecting feature extractors for this this http URL tackle the first problem, we train each method on the Breakfast dataset 5 times and provide average and standard deviation of the results. Our experiments show that the standard deviation over these repetitions is between 1 and 2.5% and significantly affects the comparison between different approaches. Furthermore, our investigation on feature extraction shows that, for the studied weakly-supervised action segmentation methods, higher-level I3D features perform worse than classical IDT features.
摘要：动作分割是时间分割未修剪视频的每一帧的任务。弱监督的方法来操作的分割，尤其是从成绩单已经相当大的兴趣在计算机视觉领域的。在这项工作中，我们注重的弱监督作用分割方法使用和评价中经常被忽视的两个方面：在多个训练运行性能的变化和选择特征提取的用于这方面的影响这个HTTP URL解决的第一个问题，我们培养在早餐数据集中的每个方法5次并提供结果的平均值和标准偏差。我们的实验显示，在这些重复的标准偏差为1％〜2.5，显著影响不同方法之间的比较。此外，我们的特征提取表明，对于所研究的弱监督作用分割方法，更高级别的I3D功能比传统的IDT功能表现较差调查。

19. Ventral-Dorsal Neural Networks: Object Detection via Selective Attention [PDF] 返回目录
Mohammad K. Ebrahimpour, Jiayun Li, Yen-Yun Yu, Jackson L. Reese, Azadeh Moghtaderi, Ming-Hsuan Yang, David C. Noelle
Abstract: Deep Convolutional Neural Networks (CNNs) have been repeatedly proven to perform well on image classification tasks. Object detection methods, however, are still in need of significant improvements. In this paper, we propose a new framework called Ventral-Dorsal Networks (VDNets) which is inspired by the structure of the human visual system. Roughly, the visual input signal is analyzed along two separate neural streams, one in the temporal lobe and the other in the parietal lobe. The coarse functional distinction between these streams is between object recognition -- the "what" of the signal -- and extracting location related information -- the "where" of the signal. The ventral pathway from primary visual cortex, entering the temporal lobe, is dominated by "what" information, while the dorsal pathway, into the parietal lobe, is dominated by "where" information. Inspired by this structure, we propose the integration of a "Ventral Network" and a "Dorsal Network", which are complementary. Information about object identity can guide localization, and location information can guide attention to relevant image regions, improving object recognition. This new dual network framework sharpens the focus of object detection. Our experimental results reveal that the proposed method outperforms state-of-the-art object detection approaches on PASCAL VOC 2007 by 8% (mAP) and PASCAL VOC 2012 by 3% (mAP). Moreover, a comparison of techniques on Yearbook images displays substantial qualitative and quantitative benefits of VDNet.
摘要：深卷积神经网络（细胞神经网络）已经被反复证明对图像分类任务表现良好。物体的检测方法，但是，仍然需要改进显著。在本文中，我们提出了所谓的腹侧背网络（VDNets）的新框架，它是由人类视觉系统的结构的启发。粗略地说，视觉输入信号沿着两个单独的神经流，一个在颞叶和另一个在顶叶进行分析。这些流之间的粗功能区分物体识别之间 - 和提取位置相关的信息 - - “在哪里”的信号“是什么”的信号。从初级视觉皮层的腹侧通路，进入颞叶，是“什么”的信息为主，而背侧通路，进入顶叶，由“何处”的信息为主。通过这种结构的启发，我们提出了“腹网络”和“背网”的整合，这是相辅相成的。有关对象的身份信息可以引导定位和位置信息可以指导注意有关的图像区域，提高目标识别。这种新的双网架构锐化对象检测的重点。我们的实验结果表明，国家的最先进的方法优于物体检测上PASCAL VOC 2007由8％（MAP）和PASCAL VOC 2012由3％（MAP）接近。此外，技术对年鉴图像的显示器的显着的定性和VDNet的定量益处的比较。

20. Contextual Residual Aggregation for Ultra High-Resolution Image Inpainting [PDF] 返回目录
Zili Yi, Qiang Tang, Shekoofeh Azizi, Daesik Jang, Zhan Xu
Abstract: Recently data-driven image inpainting methods have made inspiring progress, impacting fundamental image editing tasks such as object removal and damaged image repairing. These methods are more effective than classic approaches, however, due to memory limitations they can only handle low-resolution inputs, typically smaller than 1K. Meanwhile, the resolution of photos captured with mobile devices increases up to 8K. Naive up-sampling of the low-resolution inpainted result can merely yield a large yet blurry result. Whereas, adding a high-frequency residual image onto the large blurry image can generate a sharp result, rich in details and textures. Motivated by this, we propose a Contextual Residual Aggregation (CRA) mechanism that can produce high-frequency residuals for missing contents by weighted aggregating residuals from contextual patches, thus only requiring a low-resolution prediction from the network. Since convolutional layers of the neural network only need to operate on low-resolution inputs and outputs, the cost of memory and computing power is thus well suppressed. Moreover, the need for high-resolution training datasets is alleviated. In our experiments, we train the proposed model on small images with resolutions 512x512 and perform inference on high-resolution images, achieving compelling inpainting quality. Our model can inpaint images as large as 8K with considerable hole sizes, which is intractable with previous learning-based approaches. We further elaborate on the light-weight design of the network architecture, achieving real-time performance on 2K images on a GTX 1080 Ti GPU. Codes are available at: Atlas200dk/sample-imageinpainting-HiFill.
摘要：最近数据驱动的图像修复方法，取得了鼓舞人心的进展，影响基本的图像编辑任务，例如对象删除和损坏的图像修复。这些方法比传统方法更有效，但是，由于内存限制，他们只能处理低分辨率的投入，通常比1K小。同时，照片与移动设备拍摄的分辨率增加至8K。天真的上采样低分辨率的补绘结果可能仅仅产生大量尚未模糊的结果。而，加入高频残差图像到大模糊图像可以产生锋利的结果，富含细节和纹理。通过此激励，我们建议，可以产生高频残差，用于通过从上下文补丁加权聚集残差缺少内容，因此，仅需要来自网络的低分辨率预测的语境残余聚合（CRA）机制。由于神经网络卷积层只需要在低分辨率输入和输出，存储器的成本和计算能力操作因此被很好地抑制。此外，需要高分辨率的训练数据得到缓解。在我们的实验中，我们培养的小图像，分辨率为512x512提出的模型和高分辨率的图像进行推理，实现引人注目的修补质量。我们的模型可以补绘的图像一样大8K具有相当大的孔尺寸，这是顽固性与以前基于学习的方法。我们进一步阐述了轻量设计的网络架构，在1080 GTX GPU的Ti 2K上实现图像的实时性能。代码可在：Atlas200dk /采样imageinpainting-HiFill。

21. Representation Learning with Fine-grained Patterns [PDF] 返回目录
Yuanhong Xu, Qi Qian, Hao Li, Rong Jin, Juhua Hu
Abstract: With the development of computational power and techniques for data collection, deep learning demonstrates a superior performance over many existing algorithms on benchmark data sets. Many efforts have been devoted to studying the mechanism of deep learning. One of important observations is that deep learning can learn the discriminative patterns from raw materials directly in a task-dependent manner. It makes the patterns obtained by deep learning outperform hand-crafted features significantly. However, those patterns can be misled by the training task when the target task is different. In this work, we investigate a prevalent problem in real-world applications, where the training set only accesses to the supervised information from superclasses but the target task is defined on fine-grained classes. Each superclass can contain multiple fine-grained classes. In this scenario, fine-grained patterns are essential to classify examples from fine-grained classes while they can be neglected when training only with labels from superclasses. To mitigate the challenge, we propose the algorithm to explore the fine-grained patterns sufficiently without additional supervised information. Besides, our analysis indicates that the performance of learned patterns on the fine-grained classes can be theoretically guaranteed. Finally, an efficient algorithm is developed to reduce the cost of optimization. The experiments on real-world data sets verify that the propose algorithm can significantly improve the performance on the fine-grained classes with information from superclasses only.
摘要：随着计算能力和数据采集技术的发展，深度学习演示了在基准数据集许多现有的算法性能优越。许多努力，一直致力于研究深度学习的机制。其中一个重要的观察是，深度学习可以了解从原材料的判别模式直接在取决于任务的方式。这使得通过显著深度学习强于大盘的手工制作的功能获得的图案。然而，这些模式可以通过训练任务被误导当目标任务是不同的。在这项工作中，我们研究了在现实世界的应用中普遍的问题，在训练集只访问到监测信息，从父，但目标任务的细粒度类中定义。每个超可以包含多个细粒度类。在这种情况下，细粒度的模式是必不可少的例子从细粒度类进行分类，而他们只能从超标签训练的时候被忽略。为了缓解这一挑战，我们提出的算法充分发掘细粒度模式无需额外的监督信息。此外，我们的分析表明，在细粒度班学习的模式的性能可以保证在理论上。最后，一个高效的算法开发，以减少优化的成本。在真实世界的数据集上的实验验证提出的算法可以显著改善仅从超信息的细粒度类的性能。

22. Tensor completion via nonconvex tensor ring rank minimization with guaranteed convergence [PDF] 返回目录
Meng Ding, Ting-Zhu Huang, Xi-Le Zhao, Tian-Hui Ma
Abstract: In recent studies, the tensor ring (TR) rank has shown high effectiveness in tensor completion due to its ability of capturing the intrinsic structure within high-order tensors. A recently proposed TR rank minimization method is based on the convex relaxation by penalizing the weighted sum of nuclear norm of TR unfolding matrices. However, this method treats each singular value equally and neglects their physical meanings, which usually leads to suboptimal solutions in practice. In this paper, we propose to use the logdet-based function as a nonconvex smooth relaxation of the TR rank for tensor completion, which can more accurately approximate the TR rank and better promote the low-rankness of the solution. To solve the proposed nonconvex model efficiently, we develop an alternating direction method of multipliers algorithm and theoretically prove that, under some mild assumptions, our algorithm converges to a stationary point. Extensive experiments on color images, multispectral images, and color videos demonstrate that the proposed method outperforms several state-of-the-art competitors in both visual and quantitative comparison. Key words: nonconvex optimization, tensor ring rank, logdet function, tensor completion, alternating direction method of multipliers.
摘要：在最近的研究中，张量环（TR）秩已经显示张量完成高效力，由于其高次张量内捕获的内在结构的能力。最近提出的TR秩最小化的方法是基于通过惩罚TR展开矩阵的范数核的加权和的凸松弛。但是，这种方法将每一个奇异值相等，而忽略在实践中它们的物理意义，这通常会导致次优解。在本文中，我们建议使用基于logdet功能作为一个非凸平滑TR秩张量完成，它可以更准确地接近TR等级和更好地促进解决方案的低rankness的放松。为了有效地解决所提出的非凸模型，我们开发乘数算法的交替方向方法和理论证明，在一些轻微的假设下，我们的算法收敛到一个固定的点。上的彩色图像，多光谱图像，和彩色视频广泛的实验表明，该方法优于在几个视觉和定量比较状态的最先进的竞争对手。关键词：非凸优化，张量环秩，logdet功能，张量完成，交替乘法器方向方法。

23. An Innovative Approach to Determine Rebar Depth and Size by Comparing GPR Data with a Theoretical Database [PDF] 返回目录
Zhongming Xiang, Ge Ou, Abbas Rashidi
Abstract: Ground penetrating radar (GPR) is an efficient technique used for rapidly recognizing embedded rebar in concrete structures. However, due to the difficulty in extracting signals from GPR data and the intrinsic coupling between the rebar depth and size showing in the data, simultaneously determining rebar depth and size is challenging. This paper proposes an innovative algorithm to address this issue. First, the hyperbola signal from the GPR data is identified by direct wave removal, signal reconstruction and separation. Subsequently, a database is developed from a series of theoretical hyperbolas and then compared with the extracted hyperbola outlines. Finally, the rebar depth and size are determined by searching for the closest counterpart in the database. The obtained results are very promising and indicate that: (1) implementing the method presented in this paper can completely remove the direct wave noise from the GPR data, and can successfully extract the outlines from the interlaced hyperbolas; and (2) the proposed method can simultaneously determine the rebar depth and size with the accuracy of 100% and 95.11%, respectively.
摘要：探地雷达（GPR）是用于在混凝土结构中快速地识别嵌入钢筋的有效技术。然而，由于在提取GPR数据和钢筋深度和大小之间的内在耦合示出数据，同时确定钢筋深度和大小是具有挑战性的信号的困难。本文提出了一种创新的算法来解决这个问题。首先，从GPR数据双曲线信号由直接波去除，信号重建和分离鉴定。随后，数据库是从一系列的理论双曲线的显影，然后与所提取的轮廓双曲线相比。最后，钢筋深度和大小通过搜索数据库中最接近的对手确定。所得到的结果非常有希望的，并且表明：（1）实现在本文所提出的方法可以完全从GPR数据中去除直接波噪声，并且可以成功地提取从隔行双曲线的轮廓;和（2）所提出的方法可以同时确定与的100％和95.11％的准确度钢筋深度和大小，分别。

24. InterFaceGAN: Interpreting the Disentangled Face Representation Learned by GANs [PDF] 返回目录
Yujun Shen, Ceyuan Yang, Xiaoou Tang, Bolei Zhou
Abstract: Although Generative Adversarial Networks (GANs) have made significant progress in face synthesis, there lacks enough understanding of what GANs have learned in the latent representation to map a randomly sampled code to a photo-realistic face image. In this work, we propose a framework, called InterFaceGAN, to interpret the disentangled face representation learned by the state-of-the-art GAN models and thoroughly analyze the properties of the facial semantics in the latent space. We first find that GANs actually learn various semantics in some linear subspaces of the latent space when being trained to synthesize high-quality faces. After identifying the subspaces of the corresponding latent semantics, we are able to realistically manipulate the facial attributes occurring in the synthesized images without retraining the model. We then conduct a detailed study on the correlation between different semantics and manage to better disentangle them via subspace projection, resulting in more precise control of the attribute manipulation. Besides manipulating gender, age, expression, and the presence of eyeglasses, we can even alter the face pose as well as fix the artifacts accidentally generated by GANs. Furthermore, we perform in-depth face identity analysis and layer-wise analysis to quantitatively evaluate the editing results. Finally, we apply our approach to real face editing by involving GAN inversion approaches as well as explicitly training additional feed-forward models based on the synthetic data established by InterFaceGAN. Extensive experimental results suggest that learning to synthesize faces spontaneously brings a disentangled and controllable face representation.
摘要：虽然创成对抗性网络（甘斯）已在脸合成显著的进步，但缺乏的是什么甘斯潜表示已经学会地图随机采样代码到照片般逼真的人脸图像不够了解。在这项工作中，我们提出了一个框架，叫做InterFaceGAN，解释解缠结的脸部表征通过国家的最先进的GAN模式的经验和深入分析潜在空间面部语义的属性。首先，我们发现，其实甘斯各学习语法于潜在空间的一些线性子空间时，被训练来合成高品质的面孔。识别对应的潜语义的子空间之后，我们能够逼真地操纵在合成图像中出现的面部属性，而不重新训练模型。然后我们进行了不同的语义之间的关系进行详细的研究，并设法通过子空间投影更好的理清它们，导致属性操作更精确的控制。此外操纵性别，年龄，表达和眼镜的存在，我们甚至可以改变脸姿势以及固定不慎被甘斯产生的伪影。此外，我们进行深入的面部身份的分析和逐层分析，以定量地评价编辑结果。最后，我们涉及到甘反转应用于我们真正面对编辑途径方法基于由InterFaceGAN建立的综合数据以及明确培训更多的前馈机型。大量的实验结果表明，学习合成面临自发带来了解缠可控的脸表示。

25. Automated Copper Alloy Grain Size Evaluation Using a Deep-learning CNN [PDF] 返回目录
George S. Baggs, Paul Guerrier, Andrew Loeb, Jason C. Jones
Abstract: Moog Inc. has automated the evaluation of copper (Cu) alloy grain size using a deep-learning convolutional neural network (CNN). The proof-of-concept automated image acquisition and batch-wise image processing offers the potential for significantly reduced labor, improved accuracy of grain evaluation, and decreased overall turnaround times for approving Cu alloy bar stock for use in flight critical aircraft hardware. A classification accuracy of 91.1% on individual sub-images of the Cu alloy coupons was achieved. Process development included minimizing the variation in acquired image color, brightness, and resolution to create a dataset with 12300 sub-images, and then optimizing the CNN hyperparameters on this dataset using statistical design of experiments (DoE). Over the development of the automated Cu alloy grain size evaluation, a degree of "explainability" in the artificial intelligence (XAI) output was realized, based on the decomposition of the large raw images into many smaller dataset sub-images, through the ability to explain the CNN ensemble image output via inspection of the classification results from the individual smaller sub-images.
摘要：穆格公司拥有自动化的铜（Cu），使用深学习卷积神经网络（CNN）合金晶粒尺寸的评价。验证的概念自动图像采集和分批图像处理提供了显著降低了劳动力的潜力，提高粮食评价的精度，降低整体周转时间为批准Cu合金棒料在飞行中飞机关键硬件使用。达到91.1％的Cu合金试样的各个子图像A分类精度。工艺开发包括最小化获得的图像的颜色，亮度和分辨率的变化，以创建数据集与12300的子图像，然后优化使用的试验设计（DOE）统计设计此数据集的CNN超参数。在自动化的Cu合金晶粒尺寸的评价，一定程度的在人工智能“explainability”的发展（XAI）输出实现时，根据大的原始图像转换成许多更小的数据集的子图像的分解，通过能力通过从各个更小的子图像的分类结果的检查解释CNN合奏的图像输出。

26. Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere [PDF] 返回目录
Tongzhou Wang, Phillip Isola
Abstract: Contrastive representation learning has been outstandingly successful in practice. In this work, we identify two key properties related to the contrastive loss: (1) alignment (closeness) of features from positive pairs, and (2) uniformity of the induced distribution of the (normalized) features on the hypersphere. We prove that, asymptotically, the contrastive loss optimizes these properties, and analyze their positive effects on downstream tasks. Empirically, we introduce an optimizable metric to quantify each property. Extensive experiments on standard vision and language datasets confirm the strong agreement between both metrics and downstream task performance. Remarkably, directly optimizing for these two metrics leads to representations with comparable or better performance at downstream tasks than contrastive learning. Project Page: this https URL Code: this https URL
摘要：对比表示学习一直在实践中出色的成功。在这项工作中，我们识别与对比损耗两个关键性质：（1）从正对对准特征（紧密度），和的（归一化）功能的超球面的感应分布的（2）的均匀性。我们证明，渐进的对比损失优化这些特性，并分析对下游任务的积极作用。根据经验，我们引入一个优化的指标来量化每个属性。在标准视力和语言的数据集大量实验证实这两个指标和下游任务绩效之间的强有力的协议。值得注意的是，直接优化这两个指标导致具有相当或更好的性能表示在比对比学习下游任务。项目页面：这个HTTPS URL代码：此HTTPS URL

27. Lung Segmentation from Chest X-rays using Variational Data Imputation [PDF] 返回目录
Raghavendra Selvan, Erik B. Dam, Sofus Rischel, Kaining Sheng, Mads Nielsen, Akshay Pai
Abstract: Pulmonary opacification is the inflammation in the lungs caused by many respiratory ailments, including the novel corona virus disease 2019 (COVID-19). Chest X-rays (CXRs) with such opacifications render regions of lungs imperceptible, making it difficult to perform automated image analysis on them. In this work, we focus on segmenting lungs from such abnormal CXRs as part of a pipeline aimed at automated risk scoring of COVID-19 from CXRs. We treat the high opacity regions as missing data and present a modified CNN-based image segmentation network that utilizes a deep generative model for data imputation. We train this model on normal CXRs with extensive data augmentation and demonstrate the usefulness of this model to extend to cases with extreme abnormalities.
摘要：肺浑浊是造成许多呼吸系统疾病，其中包括新的冠状病毒病2019（COVID-19）的肺部炎症。胸部X射线（CXRS）与这种opacifications渲染肺的区域难以察觉，使得难以对它们执行自动图像分析。在这项工作中，我们专注于细分此类异常CXRS肺作为管道旨在从CXRS COVID-19的自动风险评分的一部分。我们对待高不透明度区域为丢失数据和现在的改性CNN的图像分割网络利用数据归集了深刻的生成模型。我们培养正常CXRS这种模式具有广泛的数据增强并展示了该模型的有效性延伸到案件极端异常。

28. Data Consistent CT Reconstruction from Insufficient Data with Learned Prior Images [PDF] 返回目录
Yixing Huang, Alexander Preuhs, Michael Manhart, Guenter Lauritsch, Andreas Maier
Abstract: Image reconstruction from insufficient data is common in computed tomography (CT), e.g., image reconstruction from truncated data, limited-angle data and sparse-view data. Deep learning has achieved impressive results in this field. However, the robustness of deep learning methods is still a concern for clinical applications due to the following two challenges: a) With limited access to sufficient training data, a learned deep learning model may not generalize well to unseen data; b) Deep learning models are sensitive to noise. Therefore, the quality of images processed by neural networks only may be inadequate. In this work, we investigate the robustness of deep learning in CT image reconstruction by showing false negative and false positive lesion cases. Since learning-based images with incorrect structures are likely not consistent with measured projection data, we propose a data consistent reconstruction (DCR) method to improve their image quality, which combines the advantages of compressed sensing and deep learning: First, a prior image is generated by deep learning. Afterwards, unmeasured projection data are inpainted by forward projection of the prior image. Finally, iterative reconstruction with reweighted total variation regularization is applied, integrating data consistency for measured data and learned prior information for missing data. The efficacy of the proposed method is demonstrated in cone-beam CT with truncated data, limited-angle data and sparse-view data, respectively. For example, for truncated data, DCR achieves a mean root-mean-square error of 24 HU and a mean structure similarity index of 0.999 inside the field-of-view for different patients in the noisy case, while the state-of-the-art U-Net method achieves 55 HU and 0.995 respectively for these two metrics.
摘要：从数据不足的图像重建是在计算机断层摄影（CT），例如，从截断数据，有限角度数据和稀疏视数据的图像重建常见。深度学习已经在这一领域取得了骄人的成绩。然而，深学习方法的稳健性仍是临床应用由于以下两个方面的挑战的担忧：1）只能有限地使用足够的训练数据，一个有学问的深度学习模式可能不能很好地推广到看不见的数据; B）深学习模型对噪声敏感。因此，通过神经网络处理的图像的质量只可能是不适当的。在这项工作中，我们探讨在CT图像重建深度学习的通过显示假阴性和假阳性病灶的情况下的鲁棒性。由于使用不正确的结构基于学习的图像很可能不与测量的投影数据一致，我们提出了一种数据一致重建（DCR）的方法来提高其图像质量，它结合了压缩传感和深学习的优点：首先，现有的图像是通过深度学习产生。此后，未测量的投影数据是由在先图像的前向投影补绘。最后，用重新加权全变差正迭代重建被施加，集成数据一致性用于将测量数据和缺失数据学习的先验信息。所提出的方法的效力证明了锥束CT与截断数据，有限角度数据和稀疏视数据，分别。例如，对于截断数据，DCR达到24 HU的平均根均方误差和场的视图用于不同患者在嘈杂的壳体内的0.999的平均结构相似性索引，而状态的最-art U形网方法实现55 HU和0.995分别为这两个指标。

29. A Modified Fourier-Mellin Approach for Source Device Identification on Stabilized Videos [PDF] 返回目录
Sara Mandelli, Fabrizio Argenti, Paolo Bestagini, Massimo Iuliani, Alessandro Piva, Stefano Tubaro
Abstract: To decide whether a digital video has been captured by a given device, multimedia forensic tools usually exploit characteristic noise traces left by the camera sensor on the acquired frames. This analysis requires that the noise pattern characterizing the camera and the noise pattern extracted from video frames under analysis are geometrically aligned. However, in many practical scenarios this does not occur, thus a re-alignment or synchronization has to be performed. Current solutions often require time consuming search of the realignment transformation parameters. In this paper, we propose to overcome this limitation by searching scaling and rotation parameters in the frequency domain. The proposed algorithm tested on real videos from a well-known state-of-the-art dataset shows promising results.
摘要：决定是否一个数字视频已经由一个给定的装置捕获，多媒体法医工具通常利用由所获取的帧中的摄像机传感器左特征噪声痕迹。这种分析需要噪声图案表征相机和从下分析视频帧中提取的噪声图案几何排列。然而，在许多实际情况下，这不会发生，从而重新对准或同步必须执行。目前的解决方案通常需要耗时搜索的调整转换参数。在本文中，我们建议在频域搜索缩放和旋转参数来克服这个限制。该算法从一个众所周知的国家的最先进的数据集显示了可喜的成果真实的视频测试。

30. AutoML Segmentation for 3D Medical Image Data: Contribution to the MSD Challenge 2018 [PDF] 返回目录
Oliver Rippel, Leon Weninger, Dorit Merhof
Abstract: Fueled by recent advances in machine learning, there has been tremendous progress in the field of semantic segmentation for the medical image computing community. However, developed algorithms are often optimized and validated by hand based on one task only. In combination with small datasets, interpreting the generalizability of the results is often difficult. The Medical Segmentation Decathlon challenge addresses this problem, and aims to facilitate development of generalizable 3D semantic segmentation algorithms that require no manual parametrization. Such an algorithm was developed and is presented in this paper. It consists of a 3D convolutional neural network with encoder-decoder architecture employing residual-connections, skip-connections and multi-level generation of predictions. It works on anisotropic voxel-geometries and has anisotropic depth, i.e., the number of downsampling steps is a task-specific parameter. These depths are automatically inferred for each task prior to training. By combining this flexible architecture with on-the-fly data augmentation and little-to-no pre-- or postprocessing, promising results could be achieved. The code developed for this challenge will be available online after the final deadline at: this https URL
摘要：在机器学习最新进展的推动下，出现了医用图像计算社区语义分割领域的巨大进步。然而，开发的算法往往是优化和基于任务只有一个手工验证。在小数据集相结合，解释结果的普遍性是非常困难的。医疗分割十项全能挑战解决了这个问题，并旨在促使那些不需要手动参数化普及3D语义分割算法的发展。这种算法的开发，并在本文中给出。它由带编码器的解码器的体系结构使用剩余的连接，则跳过的连接和多级生成的预测的3D卷积神经网络。它适用于各向异性体素的几何形状和具有各向异性的深度，即：下采样步数是任务的具体参数。这些深度自动推断之前训练每一个任务。通过这种灵活的架构与即时数据增长以及小到无的pre--或后处理，可喜的成果可以实现合并。此HTTPS URL：在最后期限之后迎接这一挑战开发的代码将在网上提供

31. Iterative Network for Image Super-Resolution [PDF] 返回目录
Yuqing Liu, Shiqi Wang, Jian Zhang, Shanshe Wang, Siwei Ma, Wen Gao
Abstract: Single image super-resolution (SISR), as a traditional ill-conditioned inverse problem, has been greatly revitalized by the recent development of convolutional neural networks (CNN). These CNN-based methods generally map a low-resolution image to its corresponding high-resolution version with sophisticated network structures and loss functions, showing impressive performances. This paper proposes a substantially different approach relying on the iterative optimization on HR space with an iterative super-resolution network (ISRN). We first analyze the observation model of image SR problem, inspiring a feasible solution by mimicking and fusing each iteration in a more general and efficient manner. Considering the drawbacks of batch normalization, we propose a feature normalization (FNorm) method to regulate the features in network. Furthermore, a novel block with F-Norm is developed to improve the network representation, termed as FNB. Residual-in-residual structure is proposed to form a very deep network, which groups FNBs with a long skip connection for better information delivery and stabling the training phase. Extensive experimental results on testing benchmarks with bicubic (BI) degradation show our ISRN can not only recover more structural information, but also achieve competitive or better PSNR/SSIM results with much fewer parameters compared to other works. Besides BI, we simulate the real-world degradation with blur-downscale (BD) and downscalenoise (DN). ISRN and its extension ISRN+ both achieve better performance than others with BD and DN degradation models.
摘要：单张超分辨率（SISR），作为一个传统的病态逆问题，已大大受近期卷积神经网络（CNN）的发展活力。这些基于CNN-方法通常低分辨率图像映射到其相应的高分辨率版本与复杂的网络结构和损耗函数，示出了令人印象深刻的性能。本文提出了一种基本上不同的方法依赖于对HR空间迭代优化利用迭代超分辨率网络（ISRN）。我们首先分析图像SR问题的观察模型，激励通过模拟和在更一般的和有效的方式熔合每个迭代一个可行的解决方案。考虑批标准化的缺点，我们提出了一个特征规格化（FNORM）方法来调节网络的功能。此外，F-范数的新的块被开发，以提高网络表示，称为FNB。残留在残余结构，提出了以形成非常深的网络，这些基团FNBs具有长跳过连接为更好的信息传递和舍饲训练阶段。有关测试与双三次基准广泛的实验结果（BI）的降解，表明我们的ISRN不仅可以恢复更多的结构信息，而且还取得竞争或更好的PSNR / SSIM结果相比于其他作品少得多的参数。除了BI，模拟出模糊低档次（BD）和downscalenoise（DN）的真实世界的降解。 ISRN及其延伸ISRN +都实现了比别人用BD和DN退化模型更好的性能。

32. Interactive exploration of population scale pharmacoepidemiology datasets [PDF] 返回目录
Tengel Ekrem Skar, Einar Holsbø, Kristian Svendsen, Lars Ailo Bongo
Abstract: Population-scale drug prescription data linked with adverse drug reaction (ADR) data supports the fitting of models large enough to detect drug use and ADR patterns that are not detectable using traditional methods on smaller datasets. However, detecting ADR patterns in large datasets requires tools for scalable data processing, machine learning for data analysis, and interactive visualization. To our knowledge no existing pharmacoepidemiology tool supports all three requirements. We have therefore created a tool for interactive exploration of patterns in prescription datasets with millions of samples. We use Spark to preprocess the data for machine learning and for analyses using SQL queries. We have implemented models in Keras and the scikit-learn framework. The model results are visualized and interpreted using live Python coding in Jupyter. We apply our tool to explore a 384 million prescription data set from the Norwegian Prescription Database combined with a 62 million prescriptions for elders that were hospitalized. We preprocess the data in two minutes, train models in seconds, and plot the results in milliseconds. Our results show the power of combining computational power, short computation times, and ease of use for analysis of population scale pharmacoepidemiology datasets. The code is open source and available at: this https URL
摘要：人口级药物的处方数据与药物不良反应（ADR）的数据载体连接的模型足够大以检测药物使用和ADR图案上使用较小的数据集的传统方法是不可检测的嵌合。然而，在大的数据集检测ADR图案需要用于可扩展的数据处理，机器学习对数据进行分析和交互式可视化工具。据我们所知，没有现有的药物流行病学工具，支持所有三个要求。因此，我们已经创造了在处方集数以百万计的样本模式的互动探索的工具。我们用星火对数据进行预处理机器学习和使用SQL查询分析。我们在Keras实现模型和scikit学习框架。模型结果可视化，并利用活的Python在Jupyter编码解释。我们运用我们的工具，探索出一条3.84亿处方数据来自挪威处方数据库有6200万个处方为老人住院治疗，联合设置。我们在进行预处理两分钟，火车模型的数据在几秒钟内，并绘制以毫秒为单位的结果。我们的研究结果表明相结合的计算能力，短计算时间的能力和易于使用的人口规模药物流行病学数据集的分析。该代码是开源的，网址为：此HTTPS URL

33. Attention-based network for low-light image enhancement [PDF] 返回目录
Cheng Zhang, Qingsen Yan, Yu zhu, Jinqiu Sun, Yanning Zhang
Abstract: The captured images under low light conditions often suffer insufficient brightness and notorious noise. Hence, low-light image enhancement is a key challenging task in computer vision. A variety of methods have been proposed for this task, but these methods often failed in an extreme low-light environment and amplified the underlying noise in the input image. To address such a difficult problem, this paper presents a novel attention-based neural network to generate high-quality enhanced low-light images from the raw sensor data. Specifically, we first employ attention strategy (i.e. channel attention and spatial attention modules) to suppress undesired chromatic aberration and noise. The channel attention module guides the network to refine redundant colour features. The spatial attention module focuses on denoising by taking advantage of the non-local correlation in the image. Furthermore, we propose a new pooling layer, called inverted shuffle layer, which adaptively selects useful information from previous features. Extensive experiments demonstrate the superiority of the proposed network in terms of suppressing the chromatic aberration and noise artifacts in enhancement, especially when the low-light image has severe noise.
摘要：低光条件下所拍摄的图像经常遭受亮度不足和臭名昭著的噪音。因此，低光图像增强是一个关键的挑战在计算机视觉任务。各种各样的方法已经被提出了这个任务，但这些方法往往不能在极端的低光环境和放大输入图像中的潜在噪声。为了解决这样的难题，本文提出了一种新颖的基于注意机制的神经网络，以产生从所述原始传感器数据的高质量增强的低光图像。具体地讲，我们首先注意雇用策略（即信道的关注和空间注意模块）抑制不期望的色像差和噪声。信道注意模块引导所述网络来优化冗余颜色特征。空间注意模块侧重于通过利用图像中的非本地相关的优势去噪。此外，我们提出了一个新的汇集层，所谓反向洗牌层，其中自适应地选择从一特征的有用信息。大量的实验证明在抑制增强色差和噪音现象方面所提出的网络的优势，特别是在低光图像具有严重的噪声。

34. FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval [PDF] 返回目录
Dehong Gao, Linbo Jin, Ben Chen, Minghui Qiu, Yi Wei, Yi Hu, Hao Wang
Abstract: In this paper, we address the text and image matching in cross-modal retrieval of the fashion industry. Different from the matching in the general domain, the fashion matching is required to pay much more attention to the fine-grained information in the fashion images and texts. Pioneer approaches detect the region of interests (i.e., RoIs) from images and use the RoI embeddings as image representations. In general, RoIs tend to represent the "object-level" information in the fashion images, while fashion texts are prone to describe more detailed information, e.g. styles, attributes. RoIs are thus not fine-grained enough for fashion text and image matching. To this end, we propose FashionBERT, which leverages patches as image features. With the pre-trained BERT model as the backbone network, FashionBERT learns high level representations of texts and images. Meanwhile, we propose an adaptive loss to trade off multitask learning in the FashionBERT modeling. Two tasks (i.e., text and image matching and cross-modal retrieval) are incorporated to evaluate FashionBERT. On the public dataset, experiments demonstrate FashionBERT achieves significant improvements in performances than the baseline and state-of-the-art approaches. In practice, FashionBERT is applied in a concrete cross-modal retrieval application. We provide the detailed matching performance and inference efficiency analysis.
摘要：在本文中，我们要解决在时尚界的跨模态获取的文本和图像匹配。从一般的域名匹配不同的是，时尚的匹配是需要付出更多的关注在时尚图像和文字细粒度信息。先驱接近从图像检测的利益的区域（即，感兴趣区），并使用投资回报的嵌入作为图像表示。通常，感兴趣区趋向于表示在时尚图像中的“对象级”的信息，而时尚文本易于描述更详细的信息，例如样式属性。投资回报率，因此不细粒度够时尚的文字和图像匹配。为此，我们提出FashionBERT，它利用补丁图像特征。随着预训练BERT模型作为骨干网，FashionBERT学习文字和图像的高层表示。同时，我们提出了一种自适应的损失在FashionBERT造型权衡多任务学习。两个任务（即，文本和图像匹配和交叉模态获取）被结合，以评估FashionBERT。在公共数据集，实验证明FashionBERT实现了性能比基线接近先进国家的和显著的改善。在实践中，FashionBERT在一个具体的跨通道检索应用程序施加。我们提供详细的匹配性能和推理效率分析。

35. Self-supervised Dynamic CT Perfusion Image Denoising with Deep Neural Networks [PDF] 返回目录
Dufan Wu, Hui Ren, Quanzheng Li
Abstract: Dynamic computed tomography perfusion (CTP) imaging is a promising approach for acute ischemic stroke diagnosis and evaluation. Hemodynamic parametric maps of cerebral parenchyma are calculated from repeated CT scans of the first pass of iodinated contrast through the brain. It is necessary to reduce the dose of CTP for routine applications due to the high radiation exposure from the repeated scans, where image denoising is necessary to achieve a reliable diagnosis. In this paper, we proposed a self-supervised deep learning method for CTP denoising, which did not require any high-dose reference images for training. The network was trained by mapping each frame of CTP to an estimation from its adjacent frames. Because the noise in the source and target was independent, this approach could effectively remove the noise. Being free from high-dose training images granted the proposed method easier adaptation to different scanning protocols. The method was validated on both simulation and a public real dataset. The proposed method achieved improved image quality compared to conventional denoising methods. On the real data, the proposed method also had improved spatial resolution and contrast-to-noise ratio compared to supervised learning which was trained on the simulation data
摘要：动态计算机断层摄影灌注（CTP）成像是用于急性缺血性中风的诊断和评价有前途的方法。脑实质的血液动力学参数图是从通过脑碘化造影第一遍的重复CT扫描来计算。有必要减少CTP的用于常规应用的剂量由于来自重复扫描，其中图像去噪是必须实现可靠的诊断的高辐射曝光。在本文中，我们提出了一个自我监督的深度学习的CTP去噪，这不需要任何高剂量的参考图像的训练方法。网络是由CTP的每个帧映射到从它的相邻的帧的估计训练。因为在源和目标噪声是独立的，这种方法可以有效地去除噪声。作为一个来自授予该方法更容易适应不同的扫描方案大剂量训练图像免费。该方法进行了验证的仿真和公众真正的数据集。相比于常规的去噪方法所提出的方法来实现改善的图像质量。对实际数据相比，这是在模拟数据训练监督学习所提出的方法也得到了改善的空间分辨率和对比度噪声比

36. Inverse problems with second-order Total Generalized Variation constraints [PDF] 返回目录
Kristian Bredies, Tuomo Valkonen
Abstract: Total Generalized Variation (TGV) has recently been introduced as penalty functional for modelling images with edges as well as smooth variations. It can be interpreted as a "sparse" penalization of optimal balancing from the first up to the $k$-th distributional derivative and leads to desirable results when applied to image denoising, i.e., $L^2$-fitting with TGV penalty. The present paper studies TGV of second order in the context of solving ill-posed linear inverse problems. Existence and stability for solutions of Tikhonov-functional minimization with respect to the data is shown and applied to the problem of recovering an image from blurred and noisy data.
摘要：总广义变分（TGV）最近被引入作为惩罚功能与边缘以及流畅的造型变化的图像。它可以在施加到图像去噪，即$ L ^ 2 $与TGV罚水用快速被解释为最优的“疏”惩罚从第一上行平衡到$ $ķ第分布式衍生物和导致期望的结果。本文件的研究解决不适定线性逆问题的上下文中二阶TGV。用于相对于数据正则化官能最小化的解决方案，存在性和稳定性被示出并施加到回收来自模糊和噪声数据的图像的问题。

37. Webpage Segmentation for Extracting Images and Their Surrounding Contextual Information [PDF] 返回目录
F. Fauzi, H. J. Long, M. Belkhatir
Abstract: Web images come in hand with valuable contextual information. Although this information has long been mined for various uses such as image annotation, clustering of images, inference of image semantic content, etc., insufficient attention has been given to address issues in mining this contextual information. In this paper, we propose a webpage segmentation algorithm targeting the extraction of web images and their contextual information based on their characteristics as they appear on webpages. We conducted a user study to obtain a human-labeled dataset to validate the effectiveness of our method and experiments demonstrated that our method can achieve better results compared to an existing segmentation algorithm.
摘要：Web图像来与有价值的上下文信息的手。虽然这个信息早已开采各种用途，如图像标注，图像的聚类，图像语义内容的推理等，重视不够给的地址问题在挖掘这个上下文信息。在本文中，我们提出了一个网页分割算法，针对Web图像的根据自身的特点，因为他们出现在网页上提取和他们的上下文信息。我们进行了用户调查，以获得人标记的数据集来验证我们的方法和实验表明，我们的方法可以比现有的分割算法取得更好的结果的有效性。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-05-21

目录

摘要