摘要

1. The Hessian Penalty: A Weak Prior for Unsupervised Disentanglement [PDF] 返回目录
William Peebles, John Peebles, Jun-Yan Zhu, Alexei Efros, Antonio Torralba
Abstract: Existing disentanglement methods for deep generative models rely on hand-picked priors and complex encoder-based architectures. In this paper, we propose the Hessian Penalty, a simple regularization term that encourages the Hessian of a generative model with respect to its input to be diagonal. We introduce a model-agnostic, unbiased stochastic approximation of this term based on Hutchinson's estimator to compute it efficiently during training. Our method can be applied to a wide range of deep generators with just a few lines of code. We show that training with the Hessian Penalty often causes axis-aligned disentanglement to emerge in latent space when applied to ProGAN on several datasets. Additionally, we use our regularization term to identify interpretable directions in BigGAN's latent space in an unsupervised fashion. Finally, we provide empirical evidence that the Hessian Penalty encourages substantial shrinkage when applied to over-parameterized latent spaces.
摘要：深生成模型现有的解开方法依赖于钦点先验和复杂的基于编码器的架构。在本文中，我们提出了黑森州处罚，鼓励生成模型黑森州相对于它的输入是对角线简单的调整项。我们推出基于哈钦森的估计这个词的模型无关，公正的随机逼近训练过程中有效地计算它。我们的方法可以适用于范围广泛，只需几行代码深发电机。我们发现，培养具有黑森州惩罚往往会导致当几个数据集应用到ProGAN轴线对齐的解开到潜在空间出现。此外，我们使用正则项，以确定在无人监督的时尚BigGAN的潜在空间可解释的方向。最后，我们提供了经验证据表明，当应用于超参数潜伏空间黑森州处罚鼓励大幅萎缩。

2. Semantic View Synthesis [PDF] 返回目录
Hsin-Ping Huang, Hung-Yu Tseng, Hsin-Ying Lee, Jia-Bin Huang
Abstract: We tackle a new problem of semantic view synthesis -- generating free-viewpoint rendering of a synthesized scene using a semantic label map as input. We build upon recent advances in semantic image synthesis and view synthesis for handling photographic image content generation and view extrapolation. Direct application of existing image/view synthesis methods, however, results in severe ghosting/blurry artifacts. To address the drawbacks, we propose a two-step approach. First, we focus on synthesizing the color and depth of the visible surface of the 3D scene. We then use the synthesized color and depth to impose explicit constraints on the multiple-plane image (MPI) representation prediction process. Our method produces sharp contents at the original view and geometrically consistent renderings across novel viewpoints. The experiments on numerous indoor and outdoor images show favorable results against several strong baselines and validate the effectiveness of our approach.
摘要：我们解决语义视图合成一个新的问题 - 产生使用语义标签映射作为输入的合成场景自由视点呈现。我们建立在语义图像合成和查看照片处理图像内容的生成和视图合成外推的最新进展。的现有的图像/视图合成方法直接应用，但是，结果在严重的重影/模糊伪影。为了解决这个缺点，我们提出了一个两步走的方法。首先，我们专注于合成3D场景的可视面的颜色和深度。然后，我们使用所合成的颜色和深度强加于多平面图像（MPI）表示预测处理明确的限制。我们的方法产生在原始视图尖锐内容和整个新颖观点几何一致渲染。在众多的室内和室外的图像实验表明对几种强大的基线有利的结果，并验证了我们方法的有效性。

3. 3D for Free: Crossmodal Transfer Learning using HD Maps [PDF] 返回目录
Benjamin Wilson, Zsolt Kira, James Hays
Abstract: 3D object detection is a core perceptual challenge for robotics and autonomous driving. However, the class-taxonomies in modern autonomous driving datasets are significantly smaller than many influential 2D detection datasets. In this work, we address the long-tail problem by leveraging both the large class-taxonomies of modern 2D datasets and the robustness of state-of-the-art 2D detection methods. We proceed to mine a large, unlabeled dataset of images and LiDAR, and estimate 3D object bounding cuboids, seeded from an off-the-shelf 2D instance segmentation model. Critically, we constrain this ill-posed 2D-to-3D mapping by using high-definition maps and object size priors. The result of the mining process is 3D cuboids with varying confidence. This mining process is itself a 3D object detector, although not especially accurate when evaluated as such. However, we then train a 3D object detection model on these cuboids, consistent with other recent observations in the deep learning literature, we find that the resulting model is fairly robust to the noisy supervision that our mining process provides. We mine a collection of 1151 unlabeled, multimodal driving logs from an autonomous vehicle and use the discovered objects to train a LiDAR-based object detector. We show that detector performance increases as we mine more unlabeled data. With our full, unlabeled dataset, our method performs competitively with fully supervised methods, even exceeding the performance for certain object categories, without any human 3D annotations.
摘要：3D物体检测是机器人和自动驾驶核心感性的挑战。然而，在现代自动驾驶数据集类分类比许多有影响力的二维检测数据集显著小。在这项工作中，我们通过利用现代二维数据集的两个大类，分类和国家的最先进的二维检测方法的稳健性解决长尾问题。我们进行到矿井图像和激光雷达的一个大的，未标记的数据集，并估计三维物体包围立方体，从关闭的，现成的二维实例分割模型接种。重要的是，我们限制这病态的2D到3D采用高清晰度的地图和对象大小先验映射。在挖掘过程的结果是3D立方体具有不同的信心。此挖掘过程本身是一个3D对象检测器，虽然没有特别准确当这样进行评价。然而，我们再培养这些立方体3D物体检测模型，在深层学习文学其它最近的观察相一致，我们发现，得到的模型是相当强劲的嘈杂监督我们的挖掘过程中提供。我们矿1151未标记，多式联运的驾驶记录的集合，从自主车辆，并使用所发现的物体来训练基于激光雷达的目标物检测。我们表明，探测器的性能提升，因为我们更雷标签数据。我们的全部，未标记的数据集，我们的方法与竞争充分执行监督方法，甚至超过某些对象类别的表现，没有任何人的3D注解。

4. What makes fake images detectable? Understanding properties that generalize [PDF] 返回目录
Lucy Chai, David Bau, Ser-Nam Lim, Phillip Isola
Abstract: The quality of image generation and manipulation is reaching impressive levels, making it increasingly difficult for a human to distinguish between what is real and what is fake. However, deep networks can still pick up on the subtle artifacts in these doctored images. We seek to understand what properties of fake images make them detectable and identify what generalizes across different model architectures, datasets, and variations in training. We use a patch-based classifier with limited receptive fields to visualize which regions of fake images are more easily detectable. We further show a technique to exaggerate these detectable properties and demonstrate that, even when the image generator is adversarially finetuned against a fake image classifier, it is still imperfect and leaves detectable artifacts in certain image patches. Code is available at this https URL.
摘要：图像生成和处理的质量达到了令人印象深刻的水平，使其越来越困难的人之间有什么是真实的，什么是假的区分。然而，深网络仍然可以拿起这些篡改过的影像微妙的假象。我们设法了解什么假图像的特性使它们检测和识别在不同的模型架构，数据集和变化的训练有什么推广。我们以有限的感受野基于块拼贴的分类想象这假图像的区域更容易检测。进一步的研究表明一种技术，以夸大这些检测的性质和证明，即使在图像发生器adversarially微调，对假的图像分类，它仍然是不完美的，叶检测文物在某些图像块。代码可在此HTTPS URL。

5. Classification of Noncoding RNA Elements Using Deep Convolutional Neural Networks [PDF] 返回目录
Brian McClannahan, Krushi Patel, Usman Sajid, Cuncong Zhong, Guanghui Wang
Abstract: The paper proposes to employ deep convolutional neural networks (CNNs) to classify noncoding RNA (ncRNA) sequences. To this end, we first propose an efficient approach to convert the RNA sequences into images characterizing their base-pairing probability. As a result, classifying RNA sequences is converted to an image classification problem that can be efficiently solved by available CNN-based classification models. The paper also considers the folding potential of the ncRNAs in addition to their primary sequence. Based on the proposed approach, a benchmark image classification dataset is generated from the RFAM database of ncRNA sequences. In addition, three classical CNN models have been implemented and compared to demonstrate the superior performance and efficiency of the proposed approach. Extensive experimental results show the great potential of using deep learning approaches for RNA classification.
摘要：提出采用深卷积神经网络（细胞神经网络）到非编码RNA（ncRNA的）序列进行分类。为此，我们首先提出了一种高效的方法来RNA序列转化为表征的碱基配对概率图像。其结果，分类的RNA序列被转换成可通过可用的基于CNN-分类模型能够有效地解决了图像分类的问题。本文还考虑了除其一级序列的非编码RNA的折叠潜力。根据所提出的方法中，从非编码RNA序列的RFAM数据库生成的基准图像分类数据集。此外，三大经典车型CNN已经实施，并与表明，该方法的优越的性能和效率。大量的实验结果表明，采用深度学习的巨大潜力，RNA分类方法。

6. LMSCNet: Lightweight Multiscale 3D Semantic Completion [PDF] 返回目录
Luis Roldão, Raoul de Charette, Anne Verroust-Blondet
Abstract: We introduce a new approach for multiscale 3D semantic scene completion from sparse 3D occupancy grid like voxelized LiDAR scans. As opposed to the literature, we use a 2D UNet backbone with comprehensive multiscale skip connections to enhance feature flow, along with 3D segmentation heads. On the SemanticKITTI benchmark, our method performs on par on semantic completion and better on completion than all other published methods - while being significantly lighter and faster. As such it provides a great performance/speed trade-off for mobile-robotics applications. The ablation studies demonstrate our method is robust to lower density inputs, and that it enables very high speed semantic completion at the coarsest level. Qualitative results of our approach are provided at this http URL.
摘要：介绍了从稀疏的3D网格占用像素化的激光雷达扫描的多尺度三维语义场景完成了新的途径。相对于文献中，我们使用2D UNET骨干提供全面的多尺度跳跃连接来增强的功能流，用3D分割头一起。在SemanticKITTI标杆，我们的方法看齐语义完成，比其他所有公开的方法更好地在完成执行 - 而被显著更轻，更快。因此，它提供了极大的性能/速度的权衡对移动机器人应用。烧蚀研究证明我们的方法是健壮的降低密度的投入，而且它能够在最粗的级别非常高的速度完成语义。我们的做法的定性结果在这个HTTP URL提供。

7. Certainty Pooling for Multiple Instance Learning [PDF] 返回目录
Jacob Gildenblat, Ido Ben-Shaul, Zvi Lapp, Eldad Klaiman
Abstract: Multiple Instance Learning is a form of weakly supervised learning in which the data is arranged in sets of instances called bags with one label assigned per bag. The bag level class prediction is derived from the multiple instances through application of a permutation invariant pooling operator on instance predictions or embeddings. We present a novel pooling operator called \textbf{Certainty Pooling} which incorporates the model certainty into bag predictions resulting in a more robust and explainable model. We compare our proposed method with other pooling operators in controlled experiments with low evidence ratio bags based on MNIST, as well as on a real life histopathology dataset - Camelyon16. Our method outperforms other methods in both bag level and instance level prediction, especially when only small training sets are available. We discuss the rationale behind our approach and the reasons for its superiority for these types of datasets.
摘要：多示例学习是弱监督学习中的数据被安排在台称为袋每袋分配一个标签实例的形式。袋水平类预测从多个实例通过对实例的预测或嵌入物的置换不变池操作的应用的。我们提出称为新颖池操作者\ textbf {确定性池}结合有模型确定性成导致更健壮和可解释模型袋预测。我们比较了我们提出的方法与基于MNIST低的证据比袋受控实验等汇集运营商，以及对现实生活中的病理数据集 - Camelyon16。我们的方法优于两个包级别和实例级别预测等方法，尤其是在只有少量训练集可用。我们讨论我们的做法背后的理由和原因，其优势为这些类型的数据集。

8. Products-10K: A Large-scale Product Recognition Dataset [PDF] 返回目录
Yalong Bai, Yuxiang Chen, Wei Yu, Linfang Wang, Wei Zhang
Abstract: With the rapid development of electronic commerce, the way of shopping has experienced a revolutionary evolution. To fully meet customers' massive and diverse online shopping needs with quick response, the retailing AI system needs to automatically recognize products from images and videos at the stock-keeping unit (SKU) level with high accuracy. However, product recognition is still a challenging task, since many of SKU-level products are fine-grained and visually similar by a rough glimpse. Although there are already some products benchmarks available, these datasets are either too small (limited number of products) or noisy-labeled (lack of human labeling). In this paper, we construct a human-labeled product image dataset named "Products-10K", which contains 10,000 fine-grained SKU-level products frequently bought by online customers in this http URL. Based on our new database, we also introduced several useful tips and tricks for fine-grained product recognition. The products-10K dataset is available via this https URL.
摘要：随着电子商务的快速发展，购物方式已经经历了革命性的变化。为了充分满足客户的庞大和多样化的在线购物需求，响应迅速，零售AI系统需要自动识别从图像和视频产品的库存单位高精度（SKU）的水平。然而，产品的认可仍然是一个艰巨的任务，因为许多的SKU级别的产品是细颗粒和粗窥视觉上相似。虽然已经有一些产品提供基准，这些数据集都太小（限产品数量）或嘈杂的标记（缺乏人力标签）。在本文中，我们构建了一个名为“产品-10K”人类标记的产品图像数据集，其中包含经常在线的客户在这HTTP URL买了罚款10,000细粒度SKU级产品。基于我们新的数据库，我们也推出了一些有用的提示和技巧细粒产品的认可。该产品-10K的数据集通过此HTTPS URL可用。

9. TORNADO-Net: mulTiview tOtal vaRiatioN semAntic segmentation with Diamond inceptiOn module [PDF] 返回目录
Martin Gerdzhev, Ryan Razani, Ehsan Taghavi, Bingbing Liu
Abstract: Semantic segmentation of point clouds is a key component of scene understanding for robotics and autonomous driving. In this paper, we introduce TORNADO-Net - a neural network for 3D LiDAR point cloud semantic segmentation. We incorporate a multi-view (bird-eye and range) projection feature extraction with an encoder-decoder ResNet architecture with a novel diamond context block. Current projection-based methods do not take into account that neighboring points usually belong to the same class. To better utilize this local neighbourhood information and reduce noisy predictions, we introduce a combination of Total Variation, Lovasz-Softmax, and Weighted Cross-Entropy losses. We also take advantage of the fact that the LiDAR data encompasses 360 degrees field of view and uses circular padding. We demonstrate state-of-the-art results on the SemanticKITTI dataset and also provide thorough quantitative evaluations and ablation results.
摘要：点云的语义分割是为机器人和自动驾驶场景理解一个关键组成部分。在本文中，我们介绍龙卷风网 - 三维激光雷达点云语义分割神经网络。我们结合了多视图（鸟眼和范围）的投影特征提取与具有新型菱形上下文块的编码器 - 解码器RESNET架构。目前基于投影的方法没有考虑到邻近的点通常属于同一类。为了更好地利用本地区邻近地区的信息，减少噪声的预测，我们引入全变差，Lovasz-使用SoftMax和加权互熵损失的组合。我们还利用该LiDAR数据包括360度的视场，并使用圆形填充的事实。我们证明在SemanticKITTI数据集的国家的最先进的成果，还提供全面的定量评估和消融的结果。

10. Decision Support for Video-based Detection of Flu Symptoms [PDF] 返回目录
Kenneth Lai, Svetlana N. Yanushkevich
Abstract: The development of decision support systems is a growing domain that can be applied in the area of disease control and diagnostics. Using video-based surveillance data, skeleton features are extracted to perform action recognition, specifically the detection and recognition of coughing and sneezing motions. Providing evidence of flu-like symptoms, a decision support system based on causal networks is capable of providing the operator with vital information for decision-making. A modified residual temporal convolutional network is proposed for action recognition using skeleton features. This paper addresses the capability of using results from a machine-learning model as evidence for a cognitive decision support system. We propose risk and trust measures as a metric to bridge between machine-learning and machine-reasoning. We provide experiments on evaluating the performance of the proposed network and how these performance measures can be combined with risk to generate trust.
摘要：决策支持系统的开发是可以在疾病控制和诊断方面的应用不断增长的领域。使用基于视频的监视数据，骨架特征被提取为执行动作识别，特别是检测和识别咳嗽和打喷嚏运动。提供证据的流感样症状，基于因果网络的决策支持系统能够为操作员提供决策的重要信息。一种改进的残留时间卷积网络提出了一种用于使用骨架特征动作识别。本文讨论了利用机器学习模型作为一种认知决策支持系统的证据结果的能力。我们建议风险与信任措施作为度量机器学习和机器推理之间的桥梁。我们提供评估建议的网络的性能，以及如何将这些性能指标可与风险相结合，以产生信赖感的实验。

11. EfficientFCN: Holistically-guided Decoding for Semantic Segmentation [PDF] 返回目录
Jianbo Liu, Junjun He, Jiawei Zhang, Jimmy S. Ren, Hongsheng Li
Abstract: Both performance and efficiency are important to semantic segmentation. State-of-the-art semantic segmentation algorithms are mostly based on dilated Fully Convolutional Networks (dilatedFCN), which adopt dilated convolutions in the backbone networks to extract high-resolution feature maps for achieving high-performance segmentation performance. However, due to many convolution operations are conducted on the high-resolution feature maps, such dilatedFCN-based methods result in large computational complexity and memory consumption. To balance the performance and efficiency, there also exist encoder-decoder structures that gradually recover the spatial information by combining multi-level feature maps from the encoder. However, the performances of existing encoder-decoder methods are far from comparable with the dilatedFCN-based methods. In this paper, we propose the EfficientFCN, whose backbone is a common ImageNet pre-trained network without any dilated convolution. A holistically-guided decoder is introduced to obtain the high-resolution semantic-rich feature maps via the multi-scale features from the encoder. The decoding task is converted to novel codebook generation and codeword assembly task, which takes advantages of the high-level and low-level features from the encoder. Such a framework achieves comparable or even better performance than state-of-the-art methods with only 1/3 of the computational cost. Extensive experiments on PASCAL Context, PASCAL VOC, ADE20K validate the effectiveness of the proposed EfficientFCN.
摘要：无论是性能和效率是语义分割重要。国家的最先进的语义分割算法大多基于扩张完全卷积网络（dilatedFCN），其均采用扩张盘旋在骨干网中提取高分辨率功能映射实现高性能的分割性能。然而，由于许多卷积操作是在高分辨率的特征图进行，例如基于dilatedFCN的方法导致大的计算复杂性和内存消耗。为了平衡的性能和效率，还存在编码器 - 解码器结构，其逐渐恢复由多级特征映射从编码器相结合的空间信息。然而，现有的编码器，解码器方法的性能还远远没有与基于dilatedFCN的方法相媲美。在本文中，我们提出了EfficientFCN，其骨架是没有任何扩张卷积共同ImageNet预先训练网络。甲整体上引导的解码器被引入到得到高分辨率富语义特征图通过多尺度从编码器功能。解码任务被转换为新的码本生成和码字组装作业，这需要高级别和低级别的优点从编码器功能。这种框架实现了比状态的最先进的方法仅计算成本的1/3相当或甚至更好的性能。帕斯卡语境，PASCAL VOC广泛的实验，ADE20K验证了该EfficientFCN的有效性。

12. Lossy Image Compression with Normalizing Flows [PDF] 返回目录
Leonhard Helminger, Abdelaziz Djelouah, Markus Gross, Christopher Schroers
Abstract: Deep learning based image compression has recently witnessed exciting progress and in some cases even managed to surpass transform coding based approaches that have been established and refined over many decades. However, state-of-the-art solutions for deep image compression typically employ autoencoders which map the input to a lower dimensional latent space and thus irreversibly discard information already before quantization. Due to that, they inherently limit the range of quality levels that can be covered. In contrast, traditional approaches in image compression allow for a larger range of quality levels. Interestingly, they employ an invertible transformation before performing the quantization step which explicitly discards information. Inspired by this, we propose a deep image compression method that is able to go from low bit-rates to near lossless quality by leveraging normalizing flows to learn a bijective mapping from the image space to a latent representation. In addition to this, we demonstrate further advantages unique to our solution, such as the ability to maintain constant quality results through re-encoding, even when performed multiple times. To the best of our knowledge, this is the first work to explore the opportunities for leveraging normalizing flows for lossy image compression.
摘要：深学习基于图像压缩最近目睹了令人振奋的进展，并在甚至设法超越变换编码已经建立和完善了几十年为基础的办法某些情况下。然而，对于深图像压缩状态的最先进的解决方案通常采用其量化之前已经映射输入到较低维潜在空间，从而不可逆地丢弃信息自动编码。原因在于，它们本质上限制了可覆盖质量水平的范围内。相反，在图像压缩传统方法允许更大范围的质量水平。有趣的是，他们执行的量化步长，其中明确丢弃信息之前采用的可逆转变。受此启发，我们建议能够通过利用正火从低比特率去近无损质量流向学习从图像空间双射映射至潜表示了深刻的图像压缩方法。除了这个，我们表现出独特的我们的解决方案更多的优点，如通过重新编码，以保持恒定的质量结果进行多次，即使能力。据我们所知，这是探索利用有损图像压缩正火流动机会的第一项工作。

13. 3rd Place Solution to "Google Landmark Retrieval 2020" [PDF] 返回目录
Ke Mei, Lei li, Jinchang Xu, Yanhua Cheng, Yugeng Lin
Abstract: Image retrieval is a fundamental problem in computer vision. This paper presents our 3rd place detailed solution to the Google Landmark Retrieval 2020 challenge. We focus on the exploration of data cleaning and models with metric learning. We use a data cleaning strategy based on embedding clustering. Besides, we employ a data augmentation method called Corner-Cutmix, which improves the model's ability to recognize multi-scale and occluded landmark images. We show in detail the ablation experiments and results of our method.
摘要：图像检索是计算机视觉中的一个基本问题。本文介绍了我们第三名详细地解决了谷歌地标检索2020的挑战。我们专注于数据清洗的探索和模型与度量学习。我们使用基于聚类的嵌入数据清洗策略。此外，我们采用一种称为角，Cutmix数据隆胸方法，提高了模型的识别多尺度和闭塞的地标图像的能力。我们详细显示了消融实验和我们的方法的结果。

14. CSCL: Critical Semantic-Consistent Learning for Unsupervised Domain Adaptation [PDF] 返回目录
Jiahua Dong, Yang Cong, Gan Sun, Yuyang Liu, Xiaowei Xu
Abstract: Unsupervised domain adaptation without consuming annotation process for unlabeled target data attracts appealing interests in semantic segmentation. However, 1) existing methods neglect that not all semantic representations across domains are transferable, which cripples domain-wise transfer with untransferable knowledge; 2) they fail to narrow category-wise distribution shift due to category-agnostic feature alignment. To address above challenges, we develop a new Critical Semantic-Consistent Learning (CSCL) model, which mitigates the discrepancy of both domain-wise and category-wise distributions. Specifically, a critical transfer based adversarial framework is designed to highlight transferable domain-wise knowledge while neglecting untransferable knowledge. Transferability-critic guides transferability-quantizer to maximize positive transfer gain under reinforcement learning manner, although negative transfer of untransferable knowledge occurs. Meanwhile, with the help of confidence-guided pseudo labels generator of target samples, a symmetric soft divergence loss is presented to explore inter-class relationships and facilitate category-wise distribution alignment. Experiments on several datasets demonstrate the superiority of our model.
摘要：无需消耗注释过程未标记的目标数据的无监督领域适应性吸引了语义分割吸引人的利益。然而，1）现有的方法忽视了不能跨越域的所有语义表示可以转让，这削弱与不可转让的知识领域，明智的转移; 2）它们不能缩小类别逐分布移由于类别无关的特征对准。为了上述挑战的地址，我们开发了一个新的临界语义一致学习（CSCL）模式，既域的智慧和类别明智分布的这减轻的差异。具体而言，一个关键的传输基于对抗框架，旨在突出转让域的智慧的知识，而忽略不可转让的知识。可转让性，评论家指南转移性量化最大化下的强化学习方式正迁移增益，但不能转让知识的负迁移发生。同时，随着信心引导伪标签的帮助发电机目标样本，提出了一种对称软发散损失探索类之间的关系，并促进类明智的分布排列。几个数据集实验结果表明我们的模型的优越性。

15. FOCAL: A Forgery Localization Framework based on Video Coding Self-Consistency [PDF] 返回目录
Sebastiano Verde, Paolo Bestagini, Simone Milani, Giancarlo Calvagno, Stefano Tubaro
Abstract: Forgery operations on video contents are nowadays within the reach of anyone, thanks to the availability of powerful and user-friendly editing software. Integrity verification and authentication of videos represent a major interest in both journalism (e.g., fake news debunking) and legal environments dealing with digital evidence (e.g., a court of law). While several strategies and different forensics traces have been proposed in recent years, latest solutions aim at increasing the accuracy by combining multiple detectors and features. This paper presents a video forgery localization framework that verifies the self-consistency of coding traces between and within video frames, by fusing the information derived from a set of independent feature descriptors. The feature extraction step is carried out by means of an explainable convolutional neural network architecture, specifically designed to look for and classify coding artifacts. The overall framework was validated in two typical forgery scenarios: temporal and spatial splicing. Experimental results show an improvement to the state-of-the-art on temporal splicing localization and also promising performance in the newly tackled case of spatial splicing, on both synthetic and real-world videos.
摘要：对视频内容的伪造操作是时下人的范围内，这要归功于强大的和用户友好的编辑软件的可用性。完整性验证和视频认证代表了两种新闻的主要兴趣（例如，假新闻揭穿）和处理数字证据（如法庭）的法律环境。虽然一些策略和不同的取证痕迹在最近几年被提出，最新的解决方案旨在通过合并多个检测器和功能提高了准确性。本文提出了一种视频伪造定位框架，验证编码之间和之内的视频帧的痕迹，通过熔合从一组独立的特征描述符导出的信息的自洽。特征提取步骤是通过解释的卷积神经网络体系结构的手段进行，专门设计的寻和分类编码伪像。时空拼接：整体框架中的两个典型伪造的情况进行了验证。实验结果表明，在空间拼接的新解决情况的改善对时间拼接本地化的国家的最先进的和也有为性能，在人工和真实世界的视频。

16. Transferring Inter-Class Correlation [PDF] 返回目录
Hui Wen, Yue Wu, Chenming Yang, Jingjing Li, Yue Zhu, Xu Jiang, Hancong Duan
Abstract: The Teacher-Student (T-S) framework is widely utilized in the classification tasks, through which the performance of one neural network (the student) can be improved by transferring knowledge from another trained neural network (the teacher). Since the transferring knowledge is related to the network capacities and structures between the teacher and the student, how to define efficient knowledge remains an open question. To address this issue, we design a novel transferring knowledge, the Self-Attention based Inter-Class Correlation (ICC) map in the output layer, and propose our T-S framework, Inter-Class Correlation Transfer (ICCT).
摘要：师生（T-S）框架广泛用于分类任务，通过一个神经网络（学生）的性能可以通过从其他训练的神经网络（老师）传授知识得到改善。由于知识转移是关系到教师和学生之间的网络容量和结构，如何定义有效的知识仍然是一个悬而未决的问题。为了解决这个问题，我们设计了一个新颖的传递知识，基于类间相关性（ICC）自我注意图中的输出层，并提出我们的T-S架构，跨类别相关性转移（ICCT）。

17. Cross-Modality 3D Object Detection [PDF] 返回目录
Ming Zhu, Chao Ma, Pan Ji, Xiaokang Yang
Abstract: In this paper, we focus on exploring the fusion of images and point clouds for 3D object detection in view of the complementary nature of the two modalities, i.e., images possess more semantic information while point clouds specialize in distance sensing. To this end, we present a novel two-stage multi-modal fusion network for 3D object detection, taking both binocular images and raw point clouds as input. The whole architecture facilitates two-stage fusion. The first stage aims at producing 3D proposals through sparse point-wise feature fusion. Within the first stage, we further exploit a joint anchor mechanism that enables the network to utilize 2D-3D classification and regression simultaneously for better proposal generation. The second stage works on the 2D and 3D proposal regions and fuses their dense features. In addition, we propose to use pseudo LiDAR points from stereo matching as a data augmentation method to densify the LiDAR points, as we observe that objects missed by the detection network mostly have too few points especially for far-away objects. Our experiments on the KITTI dataset show that the proposed multi-stage fusion helps the network to learn better representations.
摘要：在本文中，我们侧重于探索鉴于两个模态的互补性的图像和点云为立体物检测的融合，即，图像具有更多语义信息而点云专门距离感测。为此，提出了一种新颖的两级多模态融合网络为立体物检测，同时服用双目图像和原始点云作为输入。整个架构促进两级融合。第一阶段旨在通过稀疏的逐点特征融合产生3D的建议。在第一阶段中，我们进一步利用联合锚机制，使网络能够同时使用2D-3D分类和回归更好的提案产生。第二阶段的工作在2D和3D的提案地区和融合他们的密集特征。此外，我们建议使用伪激光雷达点从立体匹配的数据增强方法中以致密的激光雷达点，因为我们观察到通过检测网络无缘的对象主要是具有特别是对遥远的对象点太少。我们对KITTI数据集上的实验，所提出的多级融合有助于网络更好地学习表示。

18. Global-local Enhancement Network for NMFs-aware Sign Language Recognition [PDF] 返回目录
Hezhen Hu, Wengang Zhou, Junfu Pu, Houqiang Li
Abstract: Sign language recognition (SLR) is a challenging problem, involving complex manual features, i.e., hand gestures, and fine-grained non-manual features (NMFs), i.e., facial expression, mouth shapes, etc. Although manual features are dominant, non-manual features also play an important role in the expression of a sign word. Specifically, many sign words convey different meanings due to non-manual features, even though they share the same hand gestures. This ambiguity introduces great challenges in the recognition of sign words. To tackle the above issue, we propose a simple yet effective architecture called Global-local Enhancement Network (GLE-Net), including two mutually promoted streams towards different crucial aspects of SLR. Of the two streams, one captures the global contextual relationship, while the other models the discriminative fine-grained cues. Moreover, due to the lack of datasets explicitly focusing on this kind of features, we introduce the first non-manual features-aware isolated Chinese sign language dataset (NMFs-CSL) with a total vocabulary size of 1,067 sign words in daily life. Extensive experiments on NMFs-CSL and SLR500 datasets demonstrate the effectiveness of our method.
摘要：手语识别（SLR）是一个具有挑战性的问题，涉及到复杂的手动功能，即，手势，和细粒度的非手动功能（NMFS），即面部表情，口型等。虽然手动功能占主导地位，非手动功能也是一个标志词的表达中发挥重要作用。具体而言，许多标志词语传达由于非手动功能不同的含义，即使它们共享相同的手势。这种不确定性介绍了识别的标志的话巨大的挑战。为了解决上述问题，我们提出了一个简单而有效的架构，称为全球局部增强网络（GLE-网），包括对单反的不同重要方面两个相互促进流。两个流，一个捕捉全球语境的关系，而其他车型的区别细粒度的线索。此外，由于缺乏明确的数据集专注于这种特点，我们引进了第一个非手动功能感知隔离中国手语的数据集（NMFS-CSL）与日常生活中的1067个字符号总词汇量。上NMFS-CSL和SLR500数据集大量的实验证明我们的方法的有效性。

19. INSIDE: Steering Spatial Attention with Non-Imaging Information in CNNs [PDF] 返回目录
Grzegorz Jacenków, Alison Q. O'Neil, Brian Mohr, Sotirios A. Tsaftaris
Abstract: We consider the problem of integrating non-imaging information into segmentation networks to improve performance. Conditioning layers such as FiLM provide the means to selectively amplify or suppress the contribution of different feature maps in a linear fashion. However, spatial dependency is difficult to learn within a convolutional paradigm. In this paper, we propose a mechanism to allow for spatial localisation conditioned on non-imaging information, using a feature-wise attention mechanism comprising a differentiable parametrised function (e.g. Gaussian), prior to applying the feature-wise modulation. We name our method INstance modulation with SpatIal DEpendency (INSIDE). The conditioning information might comprise any factors that relate to spatial or spatio-temporal information such as lesion location, size, and cardiac cycle phase. Our method can be trained end-to-end and does not require additional supervision. We evaluate the method on two datasets: a new CLEVR-Seg dataset where we segment objects based on location, and the ACDC dataset conditioned on cardiac phase and slice location within the volume. Code and the CLEVR-Seg dataset are available at this https URL.
摘要：我们认为非成像信息整合到分割网络，以提高性能的问题。调节层，例如膜提供的装置以选择性地放大或抑制不同的特征的贡献以线性方式映射。然而，空间的依赖很难卷积范式学习。在本文中，我们提出了一种机构，将所述特征明智调制之前以允许调节上非成像信息空间定位，使用特征明智注意机制包括可微分函数parametrised（例如高斯分布）。我们命名我们的空间依赖性（INSIDE）方法实例调制。该调节信息可以包括，涉及到的空间或时空信息诸如病变的位置，大小，和心动周期相位的任何因素。我们的方法可以训练结束到终端的，并且不需要额外的监管。我们评估两个数据集的方法：一个新的CLEVR波段数据集，我们的分段对象基于位置和ACDC数据集中空调的体积内的心脏相位和切片的位置。代码和CLEVR波段数据集可在此HTTPS URL。

20. An Ensemble of Simple Convolutional Neural Network Models for MNIST Digit Recognition [PDF] 返回目录
Sanghyeon An, Minjun Lee, Sanglee Park, Heerin Yang, Jungmin So
Abstract: We report that a very high accuracy on the MNIST test set can be achieved by using simple convolutional neural network (CNN) models. We use three different models with 3x3, 5x5, and 7x7 kernel size in the convolution layers. Each model consists of a set of convolution layers followed by a single fully connected layer. Every convolution layer uses batch normalization and ReLU activation, and pooling is not used. Rotation and translation is used to augment training data, which is frequently used in most image classification tasks. A majority voting using the three models independently trained on the training data set can achieve up to 99.87% accuracy on the test set, which is one of the state-of-the-art results. A two-layer ensemble, a heterogeneous ensemble of three homogeneous ensemble networks, can achieve up to 99.91% test accuracy. The results can be reproduced by using the code at: this https URL
摘要：我们报告说，在MNIST测试树立了很高的精度可以通过使用简单的卷积神经网络（CNN）模型来实现。我们使用三种不同型号的3x3，5x5的，并在卷积层7x7的内核大小。每个模型由一组卷积层，随后通过单个完全连接层构成。每个卷积层使用批标准化和RELU激活，并且不使用池。旋转和平移用来扩充训练数据，这是经常在大多数图像分类任务中。使用三个独立的模型上训练所述训练数据集A多数表决可以实现高达在测试集，这是状态的最先进的结果之一99.87％的准确度。的双层合奏，三对均匀合奏网络的异构合奏，可以实现高达99.91％的测试精度。此HTTPS URL：结果可以通过使用代码在转载

21. Visual Attack and Defense on Text [PDF] 返回目录
Shengjun Liu, Ningkang Jiang, Yuanbin Wu
Abstract: Modifying characters of a piece of text to their visual similar ones often ap-pear in spam in order to fool inspection systems and other conditions, which we regard as a kind of adversarial attack to neural models. We pro-pose a way of generating such visual text attack and show that the attacked text are readable by humans but mislead a neural classifier greatly. We ap-ply a vision-based model and adversarial training to defense the attack without losing the ability to understand normal text. Our results also show that visual attack is extremely sophisticated and diverse, more work needs to be done to solve this.
摘要：修改一段文字的字符他们的视觉相似的那些垃圾邮件经常AP-梨为了糊弄检查系统和其他条件，这是我们作为一种对抗攻击神经模型认为。我们PRO-造成产生这种视觉文本攻击的方式显示，被攻击的文本是可读的人类，但误导神经分类很大。我们AP股基于视觉模型和对抗性训练，国防不失了解普通文本的能力攻击。我们的研究结果还表明，视觉攻击极为复杂和多样化，需要做更多的工作需要解决这个问题。

22. Model Generalization in Deep Learning Applications for Land Cover Mapping [PDF] 返回目录
Lucas Hu, Caleb Robinson, Bistra Dilkina
Abstract: Recent work has shown that deep learning models can be used to classify land-use data from geospatial satellite imagery. We show that when these deep learning models are trained on data from specific continents/seasons, there is a high degree of variability in model performance on out-of-sample continents/seasons. This suggests that just because a model accurately predicts land-use classes in one continent or season does not mean that the model will accurately predict land-use classes in a different continent or season. We then use clustering techniques on satellite imagery from different continents to visualize the differences in landscapes that make geospatial generalization particularly difficult, and summarize our takeaways for future satellite imagery-related applications.
摘要：最近的研究表明，深度学习模型可以用于从地理空间卫星图像分类土地利用数据。我们发现，当这些深层次的学习模型是从特定的大陆/赛季数据训练，还有在模型性能的高度变化对出的样本洲/季节。这表明，仅仅因为一个模型准确地预测一个大陆或季节土地利用类别并不意味着该模型将准确地预测土地利用类别在不同的大陆或季节。然后，我们使用来自不同大洲的卫星图像聚类技术，以可视化的景观，使地理空间泛化特别困难的差异，并总结我们的外卖店为未来卫星图像相关的应用程序。

23. Multi-scale Deep Feature Representation Based on SR-GAN for Low Resolution Person Re-identfication [PDF] 返回目录
Guoqing Zhang, zhenxing Wang, yuhui Zheng, Jianwei Zhang
Abstract: As the key technology of automatic video surveillance, person re-identfication (Re-ID) has attracted widespread interest and has been applied in many applications. Most of existing Re-ID methods usually assume that pedestrian images taken from different cameras have uniform resolutions. However, in many practical scenes, due to variations of distances between person-cameras as well as the deployment setting of cameras, the resolution of the pedestrian images is usually different. Such situation will be resulting in resolution mismatch problem. If we directly match pedestrian image with different resolutions, the performance of Re-ID will be affected adversely because of the discrepancy of information amount. To tackle this issue, one potential solution is to combine the super-resolution (SR) technology with Re-ID method. In this paper, we propose a super-resolution GAN (SR-GAN) based Multi-scale Deep Feature Representation (MFR-GAN) framework for Low Resolution person re-identification, which aims to optimize the super-resolution of image and pedestrian matching. We first design three cascaded SR-GANs to increase the resolution of person images with different upscaling factor and then introduce a re-identification network after each SR-GAN to strengthen the representation capability of image features. Experiments on two synthetic datasets and a common Re-ID dataset confirm that our method (MFR-GAN) can solve the resolution mismatch problem effectively.
摘要：自动视频监控的关键技术，人再identfication（再ID）已引起广泛关注，并在许多应用中得到了应用。大多数的现有重新ID方法通常假设来自不同照相机拍摄的行人的图像具有均匀的分辨率。然而，在许多实际场景中，由于人的数码相机以及摄像机的部署环境之间的距离变化，行人图像的分辨率通常是不同的。这种情况将导致分辨率不匹配的问题。如果我们直接匹配不同分辨率的图像行人，再ID的业绩将受到不利，因为信息量的差异的影响。为了解决这个问题，一个可能的解决方案是超分辨率（SR）技术，重新编号方法结合起来。在本文中，我们提出了一个超分辨率甘（SR-GAN）基于多尺度深特征表示（MFR-GAN）的低分辨率人重新鉴定，框架，旨在优化超分辨率图像和行人匹配。我们首先设计三个级联SR-甘斯，以增加人的图像不同放大因子的分辨率，然后将每个SR-GaN之后推出重新鉴定网络，以加强图像特征的表现能力。在两个数据集合成和公共重新ID数据集确认实验，我们的方法（MFR-GAN）可有效解决分辨率不匹配的问题。

24. LCA-Net: Light Convolutional Autoencoder for Image Dehazing [PDF] 返回目录
Pavan A, Adithya Bennur, Mohit Gaggar, Shylaja S S
Abstract: Image dehazing is a crucial image pre-processing task aimed at removing the incoherent noise generated by haze to improve the visual appeal of the image. The existing models use sophisticated networks and custom loss functions which are computationally inefficient and requires heavy hardware to run. Time is of the essence in image pre-processing since real time outputs can be obtained instantly. To overcome these problems, our proposed generic model uses a very light convolutional encoder-decoder network which does not depend on any atmospheric models. The network complexity-image quality trade off is handled well in this neural network and the performance of this network is not limited by low-spec systems. This network achieves optimum dehazing performance at a much faster rate, on several standard datasets, comparable to the state-of-the-art methods in terms of image quality.
摘要：图像除雾是一个关键的处理前图像任务旨在消除由雾度生成，以改善图像的视觉吸引力的非相干噪声。现有车型采用先进的网络和定制损失功能，这是计算效率低，需要大量的硬件来运行。时间是在图像预处理本质因为实时输出可以立即获得。为了克服这些问题，我们提出的通用模型采用的是很轻的卷积编码器，解码器网络，它不依赖于任何大气模型。网络的复杂性，图像质量的权衡是在这个神经网络处理好了，这网络的性能不受低规格系统的限制。该网络实现最佳以更快的速率去混浊性能，在几个标准数据集，媲美国家的最先进的方法在图像质量方面。

25. A Single Frame and Multi-Frame Joint Network for 360-degree Panorama Video Super-Resolution [PDF] 返回目录
Hongying Liu, Zhubo Ruan, Chaowei Fang, Peng Zhao, Fanhua Shang, Yuanyuan Liu, Lijun Wang
Abstract: Spherical videos, also known as \ang{360} (panorama) videos, can be viewed with various virtual reality devices such as computers and head-mounted displays. They attract large amount of interest since awesome immersion can be experienced when watching spherical videos. However, capturing, storing and transmitting high-resolution spherical videos are extremely expensive. In this paper, we propose a novel single frame and multi-frame joint network (SMFN) for recovering high-resolution spherical videos from low-resolution inputs. To take advantage of pixel-level inter-frame consistency, deformable convolutions are used to eliminate the motion difference between feature maps of the target frame and its neighboring frames. A mixed attention mechanism is devised to enhance the feature representation capability. The dual learning strategy is exerted to constrain the space of solution so that a better solution can be found. A novel loss function based on the weighted mean square error is proposed to emphasize on the super-resolution of the equatorial regions. This is the first attempt to settle the super-resolution of spherical videos, and we collect a novel dataset from the Internet, MiG Panorama Video, which includes 204 videos. Experimental results on 4 representative video clips demonstrate the efficacy of the proposed method. The dataset and code are available at this https URL.
摘要：球形视频，也称为\ ANG {360}（全景）的视频，可以与多种虚拟现实设备，如计算机和观察的头戴式显示器。他们吸引了大量的自真棒浸泡饶有兴趣地看着球视频时可以体验。然而，捕获，存储和传输高清晰度视频球是非常昂贵的。在本文中，我们提出一种用于从低分辨率输入恢复高分辨率球形视频一种新颖的单个帧和多帧联合网络（SMFN）。要利用的像素级帧间一致性，可变形的卷积用于消除运动差之间的目标帧和其相邻的帧中的特征映射。混合注意机制被设计为增强功能，表现能力。双学习策略施加约束的解决方案的空间，从而更好的办法可以解决。基于加权均方误差的新颖的损失函数提出来强调关于超分辨率赤道区域。这是解决超分辨率的球面全景影片的第一次尝试，我们从互联网上，米格全景视频，其中包括204部影片收集新的数据集。 4个代表的视频剪辑的实验结果证明了该方法的功效。该数据集和代码可在此HTTPS URL。

26. Improved Mutual Mean-Teaching for Unsupervised Domain Adaptive Re-ID [PDF] 返回目录
Yixiao Ge, Shijie Yu, Dapeng Chen
Abstract: In this technical report, we present our submission to the VisDA Challenge in ECCV 2020 and we achieved one of the top-performing results on the leaderboard. Our solution is based on Structured Domain Adaptation (SDA) and Mutual Mean-Teaching (MMT) frameworks. SDA, a domain-translation-based framework, focuses on carefully translating the source-domain images to the target domain. MMT, a pseudo-label-based framework, focuses on conducting pseudo label refinery with robust soft labels. Specifically, there are three main steps in our training pipeline. (i) We adopt SDA to generate source-to-target translated images, and (ii) such images serve as informative training samples to pre-train the network. (iii) The pre-trained network is further fine-tuned by MMT on the target domain. Note that we design an improved MMT (dubbed MMT+) to further mitigate the label noise by modeling inter-sample relations across two domains and maintaining the instance discrimination. Our proposed method achieved 74.78% accuracies in terms of mAP, ranked the 2nd place out of 153 teams.
摘要：在这个技术报告中，我们提出我们提交VisDA挑战ECCV 2020年我们实现了在排行榜上表现的结果之一。我们的解决方案是基于结构化领域适应性（SDA）和相互均值教学（MMT）的框架。 SDA，基于域的转化框架，侧重于精心转换源域图像到目标域。 MMT，基于伪标签框架，重点放在具有强大的软标签进行伪标签炼油厂。具体来说，有我们的培训流水线三个主要步骤。（ⅰ）我们采用SDA，以产生源到目标翻译的图像，以及（ii）这样的图像作为信息训练样本预先训练网络。（ⅲ）所述的预训练的网络是进一步微调由MMT在目标域。需要注意的是，我们通过模拟跨两个域之间的样本关系和维护实例歧视设计改进MMT（称为MMT +），以进一步减轻标签的噪音。我们提出的方法在地图方面取得了74.78％的准确度，位列第二名出153队。

27. Self-Supervised Learning for Large-Scale Unsupervised Image Clustering [PDF] 返回目录
Evgenii Zheltonozhskii, Chaim Baskin, Alex M. Bronstein, Avi Mendelson
Abstract: Unsupervised learning has always been appealing to machine learning researchers and practitioners, allowing them to avoid an expensive and complicated process of labeling the data. However, unsupervised learning of complex data is challenging, and even the best approaches show much weaker performance than their supervised counterparts. Self-supervised deep learning has become a strong instrument for representation learning in computer vision. However, those methods have not been evaluated in a fully unsupervised setting. In this paper, we propose a simple scheme for unsupervised classification based on self-supervised representations. We evaluate the proposed approach with several recent self-supervised methods showing that it achieves competitive results for ImageNet classification (39% accuracy on ImageNet with 1000 clusters and 46% with overclustering). We suggest adding the unsupervised evaluation to a set of standard benchmarks for self-supervised learning. The code is available at this https URL
摘要：无监督学习一直呼吁机器学习研究人员和从业人员，使他们避免标签数据的昂贵和复杂的过程。然而，复杂的数据的无监督学习是具有挑战性的，甚至是最好的方法显示出比他们的同行监督弱得多的性能。自我监督的深度学习已经成为一种在计算机视觉表示浓厚的学习工具。然而，这些方法还没有在完全无人监管的环境评估。在本文中，我们提出了基于自我监督交涉无监督分类的简单方案。我们评估与最近几次自我监督的方法表明它实现了ImageNet分类（上ImageNet 39％的准确率1000个集群和以overclustering 46％）的竞争结果所提出的方法。我们建议增加无监督评价自我监督学习一套标准的基准。该代码可在此HTTPS URL

28. LC-NAS: Latency Constrained Neural Architecture Search for Point Cloud Networks [PDF] 返回目录
Guohao Li, Mengmeng Xu, Silvio Giancola, Ali Thabet, Bernard Ghanem
Abstract: Point cloud architecture design has become a crucial problem for 3D deep learning. Several efforts exist to manually design architectures with high accuracy in point cloud tasks such as classification, segmentation, and detection. Recent progress in automatic Neural Architecture Search (NAS) minimizes the human effort in network design and optimizes high performing architectures. However, these efforts fail to consider important factors such as latency during inference. Latency is of high importance in time critical applications like self-driving cars, robot navigation, and mobile applications, that are generally bound by the available hardware. In this paper, we introduce a new NAS framework, dubbed LC-NAS, where we search for point cloud architectures that are constrained to a target latency. We implement a novel latency constraint formulation to trade-off between accuracy and latency in our architecture search. Contrary to previous works, our latency loss guarantees that the final network achieves latency under a specified target value. This is crucial when the end task is to be deployed in a limited hardware setting. Extensive experiments show that LC-NAS is able to find state-of-the-art architectures for point cloud classification in ModelNet40 with minimal computational cost. We also show how our searched architectures achieve any desired latency with a reasonably low drop in accuracy. Finally, we show how our searched architectures easily transfer to a different task, part segmentation on PartNet, where we achieve state-of-the-art results while lowering latency by a factor of 10.
摘要：点云架构的设计已经成为3D深度学习的一个关键问题。多次努力存在手动的设计架构与点云的任务，如分类，细分和检测精度高。在自动神经结构搜索（NAS）的最新进展最大限度地减少网络设计的人的努力和优化高性能的架构。然而，这些努力失败要考虑的重要因素，如推理过程中的等待时间。延迟是非常重要的像自动驾驶汽车，机器人导航和移动应用中，通常由可用的硬件约束时间关键型应用程序。在本文中，我们引入一个新的NAS架构，被称为LC-NAS，我们寻找的是被限制的目标延迟点云架构。我们实施了一种新的延迟约束配方权衡在我们的架构搜索精度和延迟之间。相反，以前的作品，我们的延迟损失保证了最终的网络达到一个规定的目标值之下延迟。这是至关重要的，当最终任务是在有限的硬件设置进行部署。大量的实验表明，LC-NAS是目前能找到的国家的最先进的架构点云分类的ModelNet40以最小的计算成本。我们还表明我们的搜索架构如何实现任何所需的等待时间可以精确到合理的低压降。最后，我们将展示如何我们搜索架构很容易转移到不同的任务，部分分割上PartNet，我们实现国家的先进成果，同时通过10倍降低延迟。

29. CA-GAN: Weakly Supervised Color Aware GAN for Controllable Makeup Transfer [PDF] 返回目录
Robin Kips, Pietro Gori, Matthieu Perrot, Isabelle Bloch
Abstract: While existing makeup style transfer models perform an image synthesis whose results cannot be explicitly controlled, the ability to modify makeup color continuously is a desirable property for virtual try-on applications. We propose a new formulation for the makeup style transfer task, with the objective to learn a color controllable makeup style synthesis. We introduce CA-GAN, a generative model that learns to modify the color of specific objects (e.g. lips or eyes) in the image to an arbitrary target color while preserving background. Since color labels are rare and costly to acquire, our method leverages weakly supervised learning for conditional GANs. This enables to learn a controllable synthesis of complex objects, and only requires a weak proxy of the image attribute that we desire to modify. Finally, we present for the first time a quantitative analysis of makeup style transfer and color control performance.
摘要：尽管现有的化妆造型传递模型进行图像合成，其结果不能明确地控制，不断修改的妆容颜色的能力是虚拟试穿应用所需的性质。我们提出的化妆风格转职任务新配方，用客观的学习颜色可控的化妆造型的合成。我们引入CA-GaN，生成模型是学会修改图像的特定对象（如嘴唇或眼睛）的颜色任意目标颜色，同时保持背景。由于彩色标签是罕见和昂贵的收购，我们的方法的杠杆作用弱监督学习有条件甘斯。这使学习复杂对象的可控合成，只需要我们的愿望来修改图像属性的弱代理。最后，我们提出的第一次化妆风格转移和色彩控制性能的定量分析。

30. Automated Search for Resource-Efficient Branched Multi-Task Networks [PDF] 返回目录
David Bruggemann, Menelaos Kanakis, Stamatios Georgoulis, Luc Van Gool
Abstract: The multi-modal nature of many vision problems calls for neural network architectures that can perform multiple tasks concurrently. Typically, such architectures have been handcrafted in the literature. However, given the size and complexity of the problem, this manual architecture exploration likely exceeds human design abilities. In this paper, we propose a principled approach, rooted in differentiable neural architecture search, to automatically define branching (tree-like) structures in the encoding stage of a multi-task neural network. To allow flexibility within resource-constrained environments, we introduce a proxyless, resource-aware loss that dynamically controls the model size. Evaluations across a variety of dense prediction tasks show that our approach consistently finds high-performing branching structures within limited resource budgets.
摘要：许多视力问题的多模态性要求的神经网络架构，可以同时执行多个任务。典型地，这样的体系结构已被手工在文献中。然而，鉴于问题的规模和复杂性，本手册架构探索可能超出人性化的设计能力。在本文中，我们提出了一个原则性的方针，扎根于微神经结构搜索，自动定义分支（树型）结构在多任务的神经网络的编码阶段。为了让资源有限的环境中的灵活性，我们引入动态控制模型大小proxyless，资源意识丧失。跨越各种密集的预测任务的评估表明，我们的做法一致认定高性能资源有限的预算范围内分支结构。

31. Semantic Labeling of Large-Area Geographic Regions Using Multi-View and Multi-Date Satellite Images, and Noisy OSM Training Labels [PDF] 返回目录
Bharath Comandur, Avinash C. Kak
Abstract: We present a novel multi-view training framework and CNN architecture for combining information from multiple overlapping satellite images and noisy training labels derived from OpenStreetMap (OSM) for semantic labeling of buildings and roads across large geographic regions (100 km$^2$). Our approach to multi-view semantic segmentation yields a 4-7% improvement in the per-class IoU scores compared to the traditional approaches that use the views independently of one another. A unique (and, perhaps, surprising) property of our system is that modifications that are added to the tail-end of the CNN for learning from the multi-view data can be discarded at the time of inference with a relatively small penalty in the overall performance. This implies that the benefits of training using multiple views are absorbed by all the layers of the network. Additionally, our approach only adds a small overhead in terms of the GPU-memory consumption even when training with as many as 32 views per scene. The system we present is end-to-end automated, which facilitates comparing the classifiers trained directly on true orthophotos vis-a-vis first training them on the off-nadir images and subsequently translating the predicted labels to geographical coordinates. With no human supervision, our IoU scores for the buildings and roads classes are 0.8 and 0.64 respectively which are better than state-of-the-art approaches that use OSM labels and that are not completely automated.
摘要：我们提出了从多个重叠的卫星图像和嘈杂训练标签从OpenStreetMap的（OSM），用于在大的地理区域建筑物和道路的语义标注（100公里$ ^ 2 $衍生合成信息的新的多视图训练框架和CNN架构）。我们对多视点语义分割方法产生的每一个类IOU分数的4-7％的改善相比，使用彼此独立的意见的传统方法。独特的（也许，奇）我们的系统的特性是，其被添加到CNN的尾端用于从多视图数据进行学习的修饰可在推断的时间与在一个相对较小的罚分被丢弃整体表现。这意味着，使用多个视图锻炼益处是通过网络的所有层吸收。此外，我们的方法只在每个场景多达32个看法训练甚至当GPU，内存消耗方面增加了一个小的开销。该系统我们本是端至端的自动化，这有利于它们在最低偏离图像进行比较直接在真正射可见光，以对比第一训练训练的分类器，并随后翻译所预测的标签，以地理坐标。由于没有人监督，我们的欠条比分为建筑物和道路等级分别为0.8和0.64这是优于国家的最先进的方法是使用OSM标签和未完全自动化。

32. Explainable Disease Classification via weakly-supervised segmentation [PDF] 返回目录
Aniket Joshi, Gaurav Mishra, Jayanthi Sivaswamy
Abstract: Deep learning based approaches to Computer Aided Diagnosis (CAD) typically pose the problem as an image classification (Normal or Abnormal) problem. These systems achieve high to very high accuracy in specific disease detection for which they are trained but lack in terms of an explanation for the provided decision/classification result. The activation maps which correspond to decisions do not correlate well with regions of interest for specific diseases. This paper examines this problem and proposes an approach which mimics the clinical practice of looking for an evidence prior to diagnosis. A CAD model is learnt using a mixed set of information: class labels for the entire training set of images plus a rough localisation of suspect regions as an extra input for a smaller subset of training images for guiding the learning. The proposed approach is illustrated with detection of diabetic macular edema (DME) from OCT slices. Results of testing on on a large public dataset show that with just a third of images with roughly segmented fluid filled regions, the classification accuracy is on par with state of the art methods while providing a good explanation in the form of anatomically accurate heatmap /region of interest. The proposed solution is then adapted to Breast Cancer detection from mammographic images. Good evaluation results on public datasets underscores the generalisability of the proposed solution.
摘要：深学习基础的方法，以计算机辅助诊断（CAD）通常带来的问题，图像分类（正常或异常）的问题。这些系统实现高在为他们进行培训，但缺乏对所提供的决策/分类结果的解释方面的具体疾病的检测精度非常高。对应于决策激活地图不很好的相关性与特定疾病的感兴趣区域。本文探讨了这一问题，并提出了一种方法，其模仿的寻找诊断前的证据的临床实践。类标签整个训练集图像的加可疑区域的粗略定位为训练图像指导学习一个较小的子额外的输入：一个CAD模型是使用混合组信息学。所提出的方法被示出具有检测糖尿病性黄斑水肿（DME）从OCT切片。在一个大的公共数据集显示在测试与只是一个大致分段的流体填充区域的图像的第三，分类精度是看齐的现有技术方法而在解剖学上正确的热图/区域的形式提供一个很好的解释的结果出于兴趣。然后，提出的解决方案适合于从乳腺摄影图像乳腺癌检测。公共数据集良好的评价结果强调了建议的解决方案的推广性。

33. A Dataset for Evaluating Blood Detection in Hyperspectral Images [PDF] 返回目录
Michał Romaszewski, Przemysław Głomb, Arkadiusz Sochan, Michał Cholewa
Abstract: The sensitivity of imaging spectroscopy to hemoglobin derivatives makes it a promising tool for detecting blood. However, due to complexity and high dimensionality of hyperspectral images, the development of hyperspectral blood detection algorithms is challenging. To facilitate their development, we present a new hyperspectral blood detection dataset. This dataset, published in accordance to open access mandate, consist of multiple detection scenarios with varying levels of complexity. It allows to test the performance of Machine Learning methods in relation to different acquisition environments, types of background, age of blood and presence of other blood-like substances. We explored the dataset with blood detection experiments. We used hyperspectral target detection algorithm based on the well-known Matched Filter detector. Our results and their discussion highlight the challenges of blood detection in hyperspectral data and form a reference for further works
摘要：成像光谱对血红蛋白衍生物的敏感性使得它用于检测血液中的有前途的工具。然而，由于复杂性和高光谱图像的高维，高光谱血液检测算法的开发是一个挑战。为了促进自己的发展，我们提出了一个新的高光谱血液检测数据集。此数据集，公布按照以开放获取任务，由多个检测场景与复杂的变化水平。它可以让测试的机器学习方法的性能相对于不同的采集环境，类型的背景，血年龄和其他血液类物质的存在。我们探索出血液检测实验数据集。我们采用基于众所周知的匹配滤波器检测器上的高光谱目标检测算法。我们的研究结果和他们的讨论凸显血液检测的高光谱数据的挑战，并为进一步的工作参考

34. Monocular Reconstruction of Neural Face Reflectance Fields [PDF] 返回目录
Mallikarjun B R., Ayush Tewari, Tae-Hyun Oh, Tim Weyrich, Bernd Bickel, Hans-Peter Seidel, Hanspeter Pfister, Wojciech Matusik, Mohamed Elgharib, Christian Theobalt
Abstract: The reflectance field of a face describes the reflectance properties responsible for complex lighting effects including diffuse, specular, inter-reflection and self shadowing. Most existing methods for estimating the face reflectance from a monocular image assume faces to be diffuse with very few approaches adding a specular component. This still leaves out important perceptual aspects of reflectance as higher-order global illumination effects and self-shadowing are not modeled. We present a new neural representation for face reflectance where we can estimate all components of the reflectance responsible for the final appearance from a single monocular image. Instead of modeling each component of the reflectance separately using parametric models, our neural representation allows us to generate a basis set of faces in a geometric deformation-invariant space, parameterized by the input light direction, viewpoint and face geometry. We learn to reconstruct this reflectance field of a face just from a monocular image, which can be used to render the face from any viewpoint in any light condition. Our method is trained on a light-stage training dataset, which captures 300 people illuminated with 150 light conditions from 8 viewpoints. We show that our method outperforms existing monocular reflectance reconstruction methods, in terms of photorealism due to better capturing of physical premitives, such as sub-surface scattering, specularities, self-shadows and other higher-order effects.
摘要：一个面的反射率字段描述负责复杂的照明效果，包括漫反射，镜面反射，除反射和自阴影反射特性。用于从单眼图像来估计脸部反射大多数现有方法假设面是漫与添加一个镜面反射分量非常少的方法。这仍然留下了反射率高阶全局照明效果和自身阴影的重要感知方面没有建模。我们提出了面的反射率，我们可以估算负责从一个单一的单眼图像的最终外观的反射率的所有组件的新的神经表示。代替分别使用的参数模型的建模反射率的各成分的，我们的神经表示允许我们能够产生在几何变形不变空间的基础组面，由输入光的方向，视点和脸的几何形状参数化。我们学会了刚刚从单眼图像，它可以用来渲染从任何角度来看，面对在任何光线条件下重建面对这种反射领域。我们的方法是在光阶段训练数据集，其捕获人300与150个从8个观点出发光条件照射训练。我们证明了我们的方法优于因身体premitives，如子表面散射，镜面反射，自阴影等高阶效应的更好地捕捉现有单眼反射率重构方法，在写实方面。

35. Dogs as Model for Human Breast Cancer: A Completely Annotated Whole Slide Image Dataset [PDF] 返回目录
Marc Aubreville, Christof A. Bertram, Taryn A. Donovan, Christian Marzahl, Andreas Maier, Robert Klopfleisch
Abstract: Canine mammary carcinoma (CMC) has been used as a model to investigate the pathogenesis of human breast cancer and the same grading scheme is commonly used to assess tumor malignancy in both. One key component of this grading scheme is the density of mitotic figures (MF). Current publicly available datasets on human breast cancer only provide annotations for small subsets of whole slide images (WSIs). We present a novel dataset of 21 WSIs of CMC completely annotated for MF. For this, a pathologist screened all WSIs for potential MF and structures with a similar appearance. A second expert blindly assigned labels, and for non-matching labels, a third expert assigned the final labels. Additionally, we used machine learning to identify previously undetected MF. Finally, we performed representation learning and two-dimensional projection to further increase the consistency of the annotations. Our dataset consists of 13,907 MF and 36,379 hard negatives. We achieved a mean F1-score of 0.791 on the test set and of up to 0.696 on a human breast cancer dataset.
摘要：犬乳腺癌（CMC）已被用作一个模型来研究人乳腺癌的发病机制和相同的分级方案常用于评估在肿瘤的恶性肿瘤。这个分级方案的一个关键组成部分是核分裂（MF）的密度。对人乳腺癌目前可公开获得的数据集只对整个幻灯片图像（WSIS）的小亚群提供注解。我们目前CMC的信息社会世界峰会21完全注释为MF的新数据集。对于这一点，病理学家筛选信息社会世界峰会的所有潜在的MF和结构具有相似的外观。第二专家盲目分配的标签，而对于非匹配标签，分配的最终标签的第三专家。此外，我们使用机器学习识别以前未MF。最后，我们进行表示学习和二维投影到进一步增加注释的一致性。我们的数据集由13907 MF和36379张硬底片。我们的测试集和高达0.696的人类乳腺癌的数据集来实现的0.791的平均F1-得分。

36. VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval [PDF] 返回目录
Minuk Ma, Sunjae Yoon, Junyeong Kim, Youngjoon Lee, Sunghun Kang, Chang D. Yoo
Abstract: Video Moment Retrieval (VMR) is a task to localize the temporal moment in untrimmed video specified by natural language query. For VMR, several methods that require full supervision for training have been proposed. Unfortunately, acquiring a large number of training videos with labeled temporal boundaries for each query is a labor-intensive process. This paper explores methods for performing VMR in a weakly-supervised manner (wVMR): training is performed without temporal moment labels but only with the text query that describes a segment of the video. Existing methods on wVMR generate multi-scale proposals and apply query-guided attention mechanisms to highlight the most relevant proposal. To leverage the weak supervision, contrastive learning is used which predicts higher scores for the correct video-query pairs than for the incorrect pairs. It has been observed that a large number of candidate proposals, coarse query representation, and one-way attention mechanism lead to blurry attention maps which limit the localization performance. To handle this issue, Video-Language Alignment Network (VLANet) is proposed that learns sharper attention by pruning out spurious candidate proposals and applying a multi-directional attention mechanism with fine-grained query representation. The Surrogate Proposal Selection module selects a proposal based on the proximity to the query in the joint embedding space, and thus substantially reduces candidate proposals which leads to lower computation load and sharper attention. Next, the Cascaded Cross-modal Attention module considers dense feature interactions and multi-directional attention flow to learn the multi-modal alignment. VLANet is trained end-to-end using contrastive loss which enforces semantically similar videos and queries to gather. The experiments show that the method achieves state-of-the-art performance on Charades-STA and DiDeMo datasets.
摘要：视频检索瞬间（VMR）是本地化的时间时刻由自然语言查询指定的修剪视频的任务。对于VMR，需要培训全程监督的几种方法已经被提出。不幸的是，获取与每个查询标记时间的界限了大量的培训视频是一个劳动力密集的过程。本文探讨了在弱监督的方式执行VMR方法（wVMR）：培训不具有时间时刻标签，但只与描述视频的一个片段的文本查询执行。在wVMR现有的方法产生的多尺度提案和申请查询的注意力引导机制，突出最相关的建议。要利用监管不力，对比学习使用哪种预测更高的分数比为不正确对正确的视频查询对。据观察，大量的候选提案，粗查询表示，和单向注意机制导致模糊不清注意地图这限制了定位性能。为了解决这个问题，视频语言对齐网（VLANet）被修剪出虚假候选人的提案和应用细粒度查询表示一个多方位的注意机制提出了更清晰获悉注意。替代建议选择模块选择基于接近到在关节嵌入空间查询的建议，并且因此显着地降低候选的提案，其导致较低计算负荷和更清晰的关注。接下来，级联跨模态注意模块认为密集特征相互作用和多方位的关注流向学习多模式定位。 VLANet使用对比损失，强制执行语义上类似的视频和查询来收集受训结束到终端。该实验表明，该方法实现了对字谜-STA和DiDeMo数据集状态的最先进的性能。

37. Strawberry Detection using Mixed Training on Simulated and Real Data [PDF] 返回目录
Sunny Goondram, Akansel Cosgun, Dana Kulic
Abstract: This paper demonstrates how simulated images can be useful for object detection tasks in the agricultural sector, where labeled data can be scarce and costly to collect. We consider training on mixed datasets with real and simulated data for strawberry detection in real images. Our results show that using the real dataset augmented by the simulated dataset resulted in slightly higher accuracy.
摘要：本文阐述了模拟图像如何能够在农业部门，标签数据可稀缺和昂贵的收集对象检测任务非常有用。我们认为，与在真实图像草莓检测真实和模拟数据混合的数据集训练。我们的研究结果表明，采用由模拟数据集增强真实数据集导致稍微更高的精度。

38. Affinity-aware Compression and Expansion Network for Human Parsing [PDF] 返回目录
Xinyan Zhang, Yunfeng Wang, Pengfei Xiong
Abstract: As a fine-grained segmentation task, human parsing is still faced with two challenges: inter-part indistinction and intra-part inconsistency, due to the ambiguous definitions and confusing relationships between similar human parts. To tackle these two problems, this paper proposes a novel \textit{Affinity-aware Compression and Expansion} Network (ACENet), which mainly consists of two modules: Local Compression Module (LCM) and Global Expansion Module (GEM). Specifically, LCM compresses parts-correlation information through structural skeleton points, obtained from an extra skeleton branch. It can decrease the inter-part interference, and strengthen structural relationships between ambiguous parts. Furthermore, GEM expands semantic information of each part into a complete piece by incorporating the spatial affinity with boundary guidance, which can effectively enhance the semantic consistency of intra-part as well. ACENet achieves new state-of-the-art performance on the challenging LIP and Pascal-Person-Part datasets. In particular, 58.1% mean IoU is achieved on the LIP benchmark.
摘要：作为细粒度分割任务，解析人类仍面临着两大挑战：零件间模糊性和内部部分不一致，由于不明确的定义和混乱的类似人类的部分之间的关系。为了解决这两个问题，本文提出了一种新颖\ textit {亲和感知压缩和膨胀}网络（ACENET），它主要包括两个模块：本地压缩模块（LCM）和全球扩展模块（GEM）。具体而言，LCM压缩通过结构骨架点份相关的信息，从一个额外的骨架分支获得。它可以减少零件间的干扰，并加强暧昧的部分之间的结构关系。此外，GEM通过将与边界指导的空间的亲和力，可有效增强帧内部分的语义的一致性以及扩大各部分的语义信息成一个完整的片。 ACENET实现对挑战LIP和Pascal-人，部分数据集的新的国家的最先进的性能。具体而言，58.1％的意思是借条上的LIP基准来实现的。

39. Fast Bi-layer Neural Synthesis of One-Shot Realistic Head Avatars [PDF] 返回目录
Egor Zakharov, Aleksei Ivakhnenko, Aliaksandra Shysheya, Victor Lempitsky
Abstract: We propose a neural rendering-based system that creates head avatars from a single photograph. Our approach models a person's appearance by decomposing it into two layers. The first layer is a pose-dependent coarse image that is synthesized by a small neural network. The second layer is defined by a pose-independent texture image that contains high-frequency details. The texture image is generated offline, warped and added to the coarse image to ensure a high effective resolution of synthesized head views. We compare our system to analogous state-of-the-art systems in terms of visual quality and speed. The experiments show significant inference speedup over previous neural head avatar models for a given visual quality. We also report on a real-time smartphone-based implementation of our system.
摘要：我们建议从单一的照片创建头化身神经渲染为基础的系统。我们的方法模型一个人的外表分解成两层。所述第一层是由小神经网络合成的姿态相关的粗糙图像。所述第二层由包含高频细节的姿态无关的纹理图像定义。纹理图像离线生成，翘曲，并加入到粗略图像，以确保合成头景色的高有效分辨率。我们比较我们的系统状态的最先进的类似系统在视觉质量和速度方面。实验表明，比前神经头形象模型对于给定的视觉质量显著推断加速。我们也对我们系统的实时基于智能手机的执行情况。

40. Learning Kernel for Conditional Moment-Matching Discrepancy-based Image Classification [PDF] 返回目录
Chuan-Xian Ren, Pengfei Ge, Dao-Qing Dai, Hong Yan
Abstract: Conditional Maximum Mean Discrepancy (CMMD) can capture the discrepancy between conditional distributions by drawing support from nonlinear kernel functions, thus it has been successfully used for pattern classification. However, CMMD does not work well on complex distributions, especially when the kernel function fails to correctly characterize the difference between intra-class similarity and inter-class similarity. In this paper, a new kernel learning method is proposed to improve the discrimination performance of CMMD. It can be operated with deep network features iteratively and thus denoted as KLN for abbreviation. The CMMD loss and an auto-encoder (AE) are used to learn an injective function. By considering the compound kernel, i.e., the injective function with a characteristic kernel, the effectiveness of CMMD for data category description is enhanced. KLN can simultaneously learn a more expressive kernel and label prediction distribution, thus, it can be used to improve the classification performance in both supervised and semi-supervised learning scenarios. In particular, the kernel-based similarities are iteratively learned on the deep network features, and the algorithm can be implemented in an end-to-end manner. Extensive experiments are conducted on four benchmark datasets, including MNIST, SVHN, CIFAR-10 and CIFAR-100. The results indicate that KLN achieves state-of-the-art classification performance.
摘要：条件最大平均差异（CMMD）可以通过绘制从非线性的核函数的支持捕捉条件分布之间的差异，因此，已成功地用于图案分类。然而，CMMD并不好复杂分布工作，特别是在核函数未能正确表征类内的相似性和类间的相似性之间的差异。在本文中，一个新的内核学习方法，提出了改进CMMD的鉴别性能。它可以与迭代并且因此表示为KLN缩写为深网络功能来操作。该CMMD损失和自动编码器（AE）是用来学习的射功能。通过考虑化合物的内核，即，具有特征内核射函数，CMMD的数据类别描述的有效性得到增强。九龙区可以同时学习更具表现力的内核和标签预测分布，因此，它可以被用来提高分类性能在这两个监督和半监督学习的情景。特别地，所述基于内核的相似性被迭代学习在深网络功能，并且该算法可以在端至端的方式来实现。广泛实验在四个基准数据集，包括MNIST，SVHN，CIFAR-10和CIFAR-100进行的。结果表明，九龙达到国家的最先进的分类性能。

41. Hierarchical Style-based Networks for Motion Synthesis [PDF] 返回目录
Jingwei Xu, Huazhe Xu, Bingbing Ni, Xiaokang Yang, Xiaolong Wang, Trevor Darrell
Abstract: Generating diverse and natural human motion is one of the long-standing goals for creating intelligent characters in the animated world. In this paper, we propose a self-supervised method for generating long-range, diverse and plausible behaviors to achieve a specific goal location. Our proposed method learns to model the motion of human by decomposing a long-range generation task in a hierarchical manner. Given the starting and ending states, a memory bank is used to retrieve motion references as source material for short-range clip generation. We first propose to explicitly disentangle the provided motion material into style and content counterparts via bi-linear transformation modelling, where diverse synthesis is achieved by free-form combination of these two components. The short-range clips are then connected to form a long-range motion sequence. Without ground truth annotation, we propose a parameterized bi-directional interpolation scheme to guarantee the physical validity and visual naturalness of generated results. On large-scale skeleton dataset, we show that the proposed method is able to synthesise long-range, diverse and plausible motion, which is also generalizable to unseen motion data during testing. Moreover, we demonstrate the generated sequences are useful as subgoals for actual physical execution in the animated world.
摘要：生成多样，自然人体运动是在动画世界创造智能字符的长期目标之一。在本文中，我们提出了产生远射，多样化和合理的行为，以实现特定的目标位置的自我监督方法。我们提出的方法学会通过以分级方式分解远距离生成任务人类的运动模型。鉴于开始和结束状态，一个存储体被用于检索运动引用作为短程剪辑生成的源材料。我们首先提出通过双线性变换建模，其中不同的合成是通过这两种组分的自由形式的组合来实现的显式地解开提供运动材料进入风格和内容对应。短程剪辑然后被连接，以形成一个长程运动序列。如果没有基本事实注解，我们提出了一个参数化双向插值方案，以保证身体的有效性和产生的结果的视觉自然。在大规模数据集的骨架，我们表明，该方法能够合成远射，多样化和合理运动，这也是推广到看不见的运动数据的测试过程中。此外，我们展示了产生的序列是作为子目标在动画世界实际物理执行有用的。

42. m2caiSeg: Semantic Segmentation of Laparoscopic Images using Convolutional Neural Networks [PDF] 返回目录
Salman Maqbool, Aqsa Riaz, Hasan Sajid, Osman Hasan
Abstract: Autonomous surgical procedures, in particular minimal invasive surgeries, are the next frontier for Artificial Intelligence research. However, the existing challenges include precise identification of the human anatomy and the surgical settings, and modeling the environment for training of an autonomous agent. To address the identification of human anatomy and the surgical settings, we propose a deep learning based semantic segmentation algorithm to identify and label the tissues and organs in the endoscopic video feed of the human torso region. We present an annotated dataset, m2caiSeg, created from endoscopic video feeds of real-world surgical procedures. Overall, the data consists of 307 images, each of which is annotated for the organs and different surgical instruments present in the scene. We propose and train a deep convolutional neural network for the semantic segmentation task. To cater for the low quantity of annotated data, we use unsupervised pre-training and data augmentation. The trained model is evaluated on an independent test set of the proposed dataset. We obtained a F1 score of 0.33 while using all the labeled categories for the semantic segmentation task. Secondly, we labeled all instruments into an 'Instruments' superclass to evaluate the model's performance on discerning the various organs and obtained a F1 score of 0.57. We propose a new dataset and a deep learning method for pixel level identification of various organs and instruments in a endoscopic surgical scene. Surgical scene understanding is one of the first steps towards automating surgical procedures.
摘要：自主外科手术，尤其是微创手术，是人工智能研究的一个前沿领域。然而，现有的挑战包括人体解剖学的精确识别和手术的设置，并为自主代理的模拟训练环境。为解决人体解剖与手术设置的标识，我们提出了一个深刻的学习基于语义分割算法人体躯干区域的内窥镜检查视频饲料中识别和标记的组织和器官。我们提出了一个注解集，m2caiSeg，从现实世界的外科手术内窥镜检查视频供稿制作。总的来说，数据由307倍的图像，其中的每一个被注释为器官和不同的外科器械出现在场景中的。我们提出并培养了深厚的卷积神经网络的语义分割任务。为迎合注释数据的低量，我们使用无监督前培训和数据增强。训练的模型是在一个独立的测试组所提出的数据集的评估。我们在使用的所有标示类别的语义分割任务获得的0.33的F1值。其次，我们标记所有仪器为“仪器”超对挑剔的各个器官评估模型的性能和获得的0.57 F1得分。我们提出了一个新的数据集，并在内窥镜手术现场为各种器官和仪器的像素级鉴定了深刻的学习方法。手术场景的理解是朝着自动化手术过程的第一步。

43. Good Graph to Optimize: Cost-Effective, Budget-Aware Bundle Adjustment in Visual SLAM [PDF] 返回目录
Yipu Zhao, Justin S. Smith, Patricio A. Vela
Abstract: The cost-efficiency of visual(-inertial) SLAM (VSLAM) is a critical characteristic of resource-limited applications. While hardware and algorithm advances have been significantly improved the cost-efficiency of VSLAM front-ends, the cost-efficiency of VSLAM back-ends remains a bottleneck. This paper describes a novel, rigorous method to improve the cost-efficiency of local BA in a BA-based VSLAM back-end. An efficient algorithm, called Good Graph, is developed to select size-reduced graphs optimized in local BA with condition preservation. To better suit BA-based VSLAM back-ends, the Good Graph predicts future estimation needs, dynamically assigns an appropriate size budget, and selects a condition-maximized subgraph for BA estimation. Evaluations are conducted on two scenarios: 1) VSLAM as standalone process, and 2) VSLAM as part of closed-loop navigation system. Results from the first scenario show Good Graph improves accuracy and robustness of VSLAM estimation, when computational limits exist. Results from the second scenario, indicate that Good Graph benefits the trajectory tracking performance of VSLAM-based closed-loop navigation systems, which is a primary application of VSLAM.
摘要：视觉（-inertial）SLAM（VSLAM）是一个关键的资源受限的应用特性的成本效益。而硬件和算法的进步已经显著改善VSLAM前端的成本效率，VSLAM后端的成本效率保持的瓶颈。本文介绍了一种新颖的，以提高在基于BA-VSLAM后端本地BA的成本效率严格的方法。一个高效的算法，所谓的好图，发展到在当地BA与保存条件优化选择尺寸减小的图。为了更好地满足基于BA-VSLAM后端的好图预测未来估计的需求，动态地分配适当规模的预算，并选择了BA估计的条件最大化子。 1）VSLAM作为独立的过程，和2）作为VSLAM闭环导航系统的一部分：评价在两种情况下进行。从第一个场景结果显示良好的图形提高了精度和VSLAM估计的稳健性，当计算范围存在。从第二个方案的结果，表明良好的图形有利于基于VSLAM的闭环导航系统的轨迹跟踪性能，这是VSLAM的主要应用。

44. Robust Vision Challenge 2020 -- 1st Place Report for Panoptic Segmentation [PDF] 返回目录
Rohit Mohan, Abhinav Valada
Abstract: In this technical report, we present key details of our winning panoptic segmentation architecture EffPS_b1bs4_RVC. Our network is a lightweight version of our state-of-the-art EfficientPS architecture that consists of our proposed shared backbone with a modified EfficientNet-B5 model as the encoder, followed by the 2-way FPN to learn semantically rich multi-scale features. It consists of two task-specific heads, a modified Mask R-CNN instance head and our novel semantic segmentation head that processes features of different scales with specialized modules for coherent feature refinement. Finally, our proposed panoptic fusion module adaptively fuses logits from each of the heads to yield the panoptic segmentation output. The Robust Vision Challenge 2020 benchmarking results show that our model is ranked #1 on Microsoft COCO, VIPER and WildDash, and is ranked #2 on Cityscapes and Mapillary Vistas, thereby achieving the overall rank #1 for the panoptic segmentation task.
摘要：在这个技术报告中，我们提出我们的获奖全景分割架构EffPS_b1bs4_RVC的关键细节。我们的网络是我们国家的最先进的EfficientPS架构的轻量级版本，由我们与修改EfficientNet-B5模型作为编码器提出的共享骨干，其次是FPN 2路的学习语义丰富的多尺度特征。它由两个特定任务的头，改进的面膜R-CNN例如，头和我们的新的语义分割头工艺不同规模的拥有能提供一致的功能细化的专业模块。最后，我们提出的全景融合模块自适应融合logits从各个头以产生全景分割输出。乐百氏视觉挑战2020基准测试结果表明，我们的模型是排名第一的微软COCO，VIPER和WildDash，而排名第二的城市景观和Mapillary极目远眺，从而实现整体排名＃1的全景分割任务。

45. Developing and Defeating Adversarial Examples [PDF] 返回目录
Ian McDiarmid-Sterling, Allan Moser
Abstract: Breakthroughs in machine learning have resulted in state-of-the-art deep neural networks (DNNs) performing classification tasks in safety-critical applications. Recent research has demonstrated that DNNs can be attacked through adversarial examples, which are small perturbations to input data that cause the DNN to misclassify objects. The proliferation of DNNs raises important safety concerns about designing systems that are robust to adversarial examples. In this work we develop adversarial examples to attack the Yolo V3 object detector [1] and then study strategies to detect and neutralize these examples. Python code for this project is available at this https URL
摘要：突破在机器学习导致国家的最先进的深层神经网络（DNNs）在安全关键应用进行分类任务。最近的研究已经证明，DNNs可以通过对抗实例中，这是小扰动致使所述DNN到错误分类的对象的输入数据被攻击。 DNNs的激增引发了设计是稳健对抗实例系统重要的安全问题。在这项工作中，我们开发对抗的例子来攻击YOLO V3对象检测器[1]，然后研究策略，以检测和消除这些例子。这个项目的Python代码可在此HTTPS URL

46. Let me join you! Real-time F-formation recognition by a socially aware robot [PDF] 返回目录
Hrishav Bakul Barua, Pradip Pramanick, Chayan Sarkar, Theint Haythi Mg
Abstract: This paper presents a novel architecture to detect social groups in real-time from a continuous image stream of an ego-vision camera. F-formation defines social orientations in space where two or more person tends to communicate in a social place. Thus, essentially, we detect F-formations in social gatherings such as meetings, discussions, etc. and predict the robot's approach angle if it wants to join the social group. Additionally, we also detect outliers, i.e., the persons who are not part of the group under consideration. Our proposed pipeline consists of -- a) a skeletal key points estimator (a total of 17) for the detected human in the scene, b) a learning model (using a feature vector based on the skeletal points) using CRF to detect groups of people and outlier person in a scene, and c) a separate learning model using a multi-class Support Vector Machine (SVM) to predict the exact F-formation of the group of people in the current scene and the angle of approach for the viewing robot. The system is evaluated using two data-sets. The results show that the group and outlier detection in a scene using our method establishes an accuracy of 91%. We have made rigorous comparisons of our systems with a state-of-the-art F-formation detection system and found that it outperforms the state-of-the-art by 29% for formation detection and 55% for combined detection of the formation and approach angle.
摘要：本文提出了一种新的架构，从一个自我视觉相机的连续图像流检测实时的社会群体。 F-形成定义的社会取向在空间，两个或两个以上的人倾向于在社交场所进行通信。因此，从本质上讲，我们检测到社交聚会F-编队如会议，讨论等，如果要加入社会团体预测机器人的接近角。此外，我们还发现异常值，即，谁不考虑组的一部分的人。我们提出的管道组成 - 一）骨骼关键点估计（共17个），在现场检测到的人，b）根据骨骼点学习模型（使用特征向量），使用CRF检测组人们和离群者在一个场景中，以及c）使用多级支持向量机（SVM）以预测组的人的当前场景的确切F-形成和方法用于观看角度的单独学习模型机器人。该系统使用两个数据集进行评估。结果表明，在使用我们的方法的场景组和异常检测规定的91％的准确度。我们已经取得了我们的系统的严格的比较与一个国家的最先进的F-形成检测系统，发现它由29％形成检测和55％，优于国家的最先进的形成联合检测和接近角。

47. Multi-Person Full Body Pose Estimation [PDF] 返回目录
Haoyi Zhu, Cheng Jie, Shaofei Jiang
Abstract: Multi-person pose estimation plays an important role in many fields. Although previous works have researched a lot on different parts of human pose estimation, full body pose estimation for multi-person still needs further research. Our work has developed an integrated model through knowledge distillation which can estimate full body poses. Trained based on the AlphaPose system and MSCOCO2017 dataset, our model achieves 51.5 mAP on the validation dataset annotated manually by ourselves. Related resources are available at this https URL.
摘要：多的人姿势估计扮演在许多领域具有重要作用。尽管以前的作品都研究了很多人的姿势估计的不同部位，多者全身姿态估计还需要进一步的研究。我们的工作已经开发出通过知识蒸馏可以估算全身姿势的整合模式。基于AlphaPose系统和MSCOCO2017数据集上的训练，我们的模型实现了对自己手动注释验证数据集51.5地图。相关资源，且可在此HTTPS URL。

48. Holistic Multi-View Building Analysis in the Wild with Projection Pooling [PDF] 返回目录
Zbigniew Wojna, Krzysztof Maziarz, Łukasz Jocz, Robert Pałuba, Robert Kozikowski, Iasonas Kokkinos
Abstract: We address six different classification tasks related to fine-grained building attributes: construction type, number of floors, pitch, and geometry of the roof, facade material, and occupancy class. Tackling such a problem of remote building analysis became possible only recently due to growing large-scale datasets of urban scenes. To this end, we introduce a new benchmarking dataset, consisting of 49426 top-view and street-view images of 9674 buildings. These photos are further assembled, together with the geometric metadata. The dataset showcases a variety of real-world challenges, such as occlusions, blur, partially visible objects, and a broad spectrum of buildings. We propose a new projection pooling layer, creating a unified, top-view representation of the top-view and the side views in a high-dimensional space. It allows us to utilize the building and imagery metadata seamlessly. Introducing this layer improves classification accuracy -- compared to highly tuned baseline models -- indicating its suitability for building analysis.
摘要：解决与细粒度构建属性六种不同的分类任务：建筑类型，楼层，沥青的数量和屋顶的形状，立面材料和入住类。远程建筑分析解决这样的问题，由于城市场景的不断增长的大型数据集成为可能最近才。为此，我们引入了一个新的标杆数据集，包括49426顶视图和街景9674个建筑物的图像。这些照片进一步组装，与几何元数据一起。该数据集展示各种现实世界的挑战，如闭塞，模糊，部分可见的物体和建筑物的广谱。我们提出了一个新的投影池层，创建顶视图和在高维空间中的侧视图的一个统一的，顶视图表示。它使我们能够无缝地利用建筑和图像的元数据。在介绍本层提高分级精度 - 相比高度调整基准模型 - 指示其用于构建分析的适宜性。

49. Dual Adversarial Auto-Encoders for Clustering [PDF] 返回目录
Pengfei Ge, Chuan-Xian Ren, Jiashi Feng, Shuicheng Yan
Abstract: As a powerful approach for exploratory data analysis, unsupervised clustering is a fundamental task in computer vision and pattern recognition. Many clustering algorithms have been developed, but most of them perform unsatisfactorily on the data with complex structures. Recently, Adversarial Auto-Encoder (AAE) shows effectiveness on tackling such data by combining Auto-Encoder (AE) and adversarial training, but it cannot effectively extract classification information from the unlabeled data. In this work, we propose Dual Adversarial Auto-encoder (Dual-AAE) which simultaneously maximizes the likelihood function and mutual information between observed examples and a subset of latent variables. By performing variational inference on the objective function of Dual-AAE, we derive a new reconstruction loss which can be optimized by training a pair of Auto-encoders. Moreover, to avoid mode collapse, we introduce the clustering regularization term for the category variable. Experiments on four benchmarks show that Dual-AAE achieves superior performance over state-of-the-art clustering methods. Besides, by adding a reject option, the clustering accuracy of Dual-AAE can reach that of supervised CNN algorithms. Dual-AAE can also be used for disentangling style and content of images without using supervised information.
摘要：对于探索性数据分析的有力方法，无监督聚类是计算机视觉和模式识别的根本任务。许多聚类算法已被开发，但大多执行对结构复杂的数据不能令人满意。近日，对抗性自动编码器来通过组合自动编码器（AE）和对抗性训练应对这样的数据（AAE）展示效果，但它不能有效地提取从标签数据分类信息。在这项工作中，我们提出了双对抗性自动编码器（双AAE）的同时最大化似然函数和观察到的例子和潜在变量的一个子集之间的相互信息。通过对双AAE的目标函数进行变推理，我们得出这可以通过训练对自动编码器进行优化，新的重建损失。另外，为避免模式崩溃，我们引入了类变量聚类调整项。四个基准测试实验表明，双AAE达到了国家的最先进的聚类方法优越的性能。此外，通过添加拒绝选项，双AAE的聚类准确率可以达到的监督CNN算法。双AAE也可用于解开风格和图像的内容，而无需使用监督信息。

50. Seesaw Loss for Long-Tailed Instance Segmentation [PDF] 返回目录
Jiaqi Wang, Wenwei Zhang, Yuhang Zang, Yuhang Cao, Jiangmiao Pang, Tao Gong, Kai Chen, Ziwei Liu, Chen Change Loy, Dahua Lin
Abstract: This report presents the approach used in the submission of the LVIS Challenge 2020 of team MMDet. In the submission, we propose Seesaw Loss that dynamically rebalances the penalty to each category according to a relative ratio of cumulative training instances between different categories. Furthermore, we propose HTC-Lite, a light-weight version of Hybrid Task Cascade (HTC) which replaces the semantic segmentation branch by a global context encoder. Seesaw Loss improves the strong baseline by 6.9% AP on LVIS v1 val split. With a single model, and without using external data and annotations except for standard ImageNet-1k classification dataset for backbone pre-training, our submission achieves 38.92% AP on the test-dev split of the LVIS v1 benchmark.
摘要：本报告在提交团队MMDet的LVIS挑战2020所使用的方法。在提交，我们提出跷跷板损失的是，根据不同类别之间累积训练实例的相对比动态重新平衡罚到各类别。此外，建议其通过全球范围内的编码器取代了语义分割分支HTC-精简版，重量轻混合动力版级联任务（HTC）的。跷跷板损失提高了LVIS V1 VAL分裂的强烈基线6.9％AP。使用单一模式，而无需使用外部数据和注释除了标准ImageNet-1K分类数据集骨干训练前，我们提交实现对LVIS V1基准测试-dev的分裂38.92％AP。

51. A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild [PDF] 返回目录
K R Prajwal, Rudrabha Mukhopadhyay, Vinay Namboodiri, C V Jawahar
Abstract: In this work, we investigate the problem of lip-syncing a talking face video of an arbitrary identity to match a target speech segment. Current works excel at producing accurate lip movements on a static image or videos of specific people seen during the training phase. However, they fail to accurately morph the lip movements of arbitrary identities in dynamic, unconstrained talking face videos, resulting in significant parts of the video being out-of-sync with the new audio. We identify key reasons pertaining to this and hence resolve them by learning from a powerful lip-sync discriminator. Next, we propose new, rigorous evaluation benchmarks and metrics to accurately measure lip synchronization in unconstrained videos. Extensive quantitative evaluations on our challenging benchmarks show that the lip-sync accuracy of the videos generated by our Wav2Lip model is almost as good as real synced videos. We provide a demo video clearly showing the substantial impact of our Wav2Lip model and evaluation benchmarks on our website: \url{this http URL}. The code and models are released at this GitHub repository: \url{this http URL}. You can also try out the interactive demo at this link: \url{this http URL}.
摘要：在这项工作中，我们调查的唇同步任意身份的说话人脸视频相匹配的目标声音片段的问题。现在的作品擅长制作静态图像准确的嘴唇动作或在训练阶段看到具体的人的视频。然而，他们无法准确地变身在动态随机身份，不受约束的谈话脸视频的嘴唇动作，导致视频的显著部分与新的音频被淘汰的不同步。我们确定有关这一重要原因，因此由一个强大的唇形同步鉴别学习解决这些问题。接下来，我们提出了新的，严格的评估标准和指标来准确测量不受约束的视频唇音同步。我们的挑战基准大量的定量评估表明，我们Wav2Lip模型生成的视频唇同步精度几乎一样好真正的同步视频。我们提供了一个演示视频清楚地显示出我们的Wav2Lip模型和评价基准对我们网站上的实质性影响：\ {URL这个HTTP URL}。代码和模型被释放在这个GitHub的库：\ {URL这个HTTP URL}。你也可以尝试一下互动演示在此链接：\ {URL这个HTTP URL}。

52. Discriminative Residual Analysis for Image Set Classification with Posture and Age Variations [PDF] 返回目录
Chuan-Xian Ren, You-Wei Luo, Xiao-Lin Xu, Dao-Qing Dai, Hong Yan
Abstract: Image set recognition has been widely applied in many practical problems like real-time video retrieval and image caption tasks. Due to its superior performance, it has grown into a significant topic in recent years. However, images with complicated variations, e.g., postures and human ages, are difficult to address, as these variations are continuous and gradual with respect to image appearance. Consequently, the crucial point of image set recognition is to mine the intrinsic connection or structural information from the image batches with variations. In this work, a Discriminant Residual Analysis (DRA) method is proposed to improve the classification performance by discovering discriminant features in related and unrelated groups. Specifically, DRA attempts to obtain a powerful projection which casts the residual representations into a discriminant subspace. Such a projection subspace is expected to magnify the useful information of the input space as much as possible, then the relation between the training set and the test set described by the given metric or distance will be more precise in the discriminant subspace. We also propose a nonfeasance strategy by defining another approach to construct the unrelated groups, which help to reduce furthermore the cost of sampling errors. Two regularization approaches are used to deal with the probable small sample size problem. Extensive experiments are conducted on benchmark databases, and the results show superiority and efficiency of the new methods.
摘要：图像设置的识别已经广泛应用于如实时视频检索和图片标题任务的许多实际问题。由于其优越的性能，它已经发展成为一个显著的话题在最近几年。然而，利用复杂的变化，例如，姿势和人的年龄的图像，难以地址，因为这些变化是连续的和渐进的相对于图像的外观。因此，图像设定识别的关键的一点是挖掘从图像批次与变化的内在连接或结构信息。在这项工作中，判别残留分析（DRA）方法提出了提高通过发现在相关和不相关群体判别特征的分类性能。具体而言，DRA尝试获得一个强大的突起，该突起连铸残余表示为判别子空间。这样的投影子空间预期放大输入空间的有用的信息尽可能多地，那么训练集和由给定度量或距离描述的测试组之间的关系将在判别子空间更精确。我们还通过定义另一种方法来构造无关组，以减少这有助于进一步抽样误差的成本提出了不作为的策略。两个正规化的办法来对付可能的小样本问题。大量的实验是在基准数据库进行的，结果显示的新方法的优势和效率。

53. Visible Feature Guidance for Crowd Pedestrian Detection [PDF] 返回目录
Zhida Huang, Kaiyu Yue, Jiangfan Deng, Feng Zhou
Abstract: Heavy occlusion and dense gathering in crowd scene make pedestrian detection become a challenging problem, because it's difficult to guess a precise full bounding box according to the invisible human part. To crack this nut, we propose a mechanism called Visible Feature Guidance (VFG) for both training and inference. During training, we adopt visible feature to regress the simultaneous outputs of visible bounding box and full bounding box. Then we perform NMS only on visible bounding boxes to achieve the best fitting full box in inference. This manner can alleviate the incapable influence brought by NMS in crowd scene and make full bounding box more precisely. Furthermore, in order to ease feature association in the post application process, such as pedestrian tracking, we apply Hungarian algorithm to associate parts for a human instance. Our proposed method can stably bring about 2~3% improvements in mAP and AP50 for both two-stage and one-stage detector. It's also more effective for MR-2 especially with the stricter IoU. Experiments on Crowdhuman, Cityperson, Caltech and KITTI datasets show that visible feature guidance can help detector achieve promisingly better performances. Moreover, parts association produces a strong benchmark on Crowdhuman for the vision community.
摘要：重闭塞和密集聚集在现场围观的行人化妆检测成为一个具有挑战性的问题，因为它是很难根据看不见的部分人猜测准确完整的边框。为了破解这个螺母，我们提出了训练和推论被称为可视功能的指南（VFG）的机制。在培训过程中，我们采用了可视功能退步明显的边框和全边框的同时输出。然后，我们只对可见的边框进行NMS实现推理最适合的满箱。这种方式可以减轻NMS在现场人群带来的影响不能和充分边框更精确。此外，为了缓解在后申请过程特征关联，如行人跟踪，我们应用匈牙利算法关联零部件人的实例。我们提出的方法能够稳定地带来地图和AP50约2〜3％的改进两个双级和单级检测器。这也为MR-2特别是与严格的欠条更有效。在Crowdhuman，Cityperson，加州理工学院和KITTI数据集实验表明，可见功能的指南可以帮助探测器很有希望实现更好的性能。此外，零件关联产生了视觉上的社区一Crowdhuman强烈基准。

54. Neighbourhood-Insensitive Point Cloud Normal Estimation Network [PDF] 返回目录
Zirui Wang, Victor Adrian Prisacariu
Abstract: We introduce a novel self-attention-based normal estimation network that is able to focus softly on relevant points and adjust the softness by learning a temperature parameter, making it able to work naturally and effectively within a large neighbourhood range. As a result, our model outperforms all existing normal estimation algorithms by a large margin, achieving 94.1% accuracy in comparison with the previous state of the art of 91.2%, with a 25x smaller model and 12x faster inference time. We also use point-to-plane Iterative Closest Point (ICP) as an application case to show that our normal estimations lead to faster convergence than normal estimations from other methods, without manually fine-tuning neighbourhood range parameters. Code available at https://code.active.vision.
摘要：介绍了一种新的基于自我关注正常估计网络能够通过学习温度参数，使之能在大邻域范围内，自然和有效的工作轻声集中在相关点和调整柔软性。其结果是，我们的模型大幅度优于所有现有的正常估计算法，实现了与艺术的91.2％的先前状态相比，94.1％的准确率，具有25X小模型和12倍快推理时间。我们还使用点到面迭代最近点（ICP）作为应用情况表明，我们的正常估计导致比其他方法正常估计收敛速度快，无需手动微调附近范围的参数。代码可在https://code.active.vision。

55. Matching Guided Distillation [PDF] 返回目录
Kaiyu Yue, Jiangfan Deng, Feng Zhou
Abstract: Feature distillation is an effective way to improve the performance for a smaller student model, which has fewer parameters and lower computation cost compared to the larger teacher model. Unfortunately, there is a common obstacle - the gap in semantic feature structure between the intermediate features of teacher and student. The classic scheme prefers to transform intermediate features by adding the adaptation module, such as naive convolutional, attention-based or more complicated one. However, this introduces two problems: a) The adaptation module brings more parameters into training. b) The adaptation module with random initialization or special transformation isn't friendly for distilling a pre-trained student. In this paper, we present Matching Guided Distillation (MGD) as an efficient and parameter-free manner to solve these problems. The key idea of MGD is to pose matching the teacher channels with students' as an assignment problem. We compare three solutions of the assignment problem to reduce channels from teacher features with partial distillation loss. The overall training takes a coordinate-descent approach between two optimization objects - assignments update and parameters update. Since MGD only contains normalization or pooling operations with negligible computation cost, it is flexible to plug into network with other distillation methods.
摘要：特点蒸馏是改善较小的学生模型，相比有较大老师模型，该模型具有较少的参数和较低成本的计算性能的有效途径。不幸的是，有一个共同的障碍 - 教师和学生中间特征之间的语义特征结构的差距。经典的方案更喜欢通过将自适应模块，如幼稚卷积，基于注意机制的或更复杂的一个变换中间特性。然而，这引起两个问题：1）适配模块带来了更多的参数纳入培训。 b）用随机初始化或特殊改造的适配模块不是蒸馏预先训练的学生友好。在本文中，我们本匹配引导蒸馏（MGD），为有效和无参数的方式来解决这些问题。 MGD的核心思想是提出匹配与学生的教师频道作为分配问题。我们比较分配问题的三种解决方案，以减少与部分蒸馏损失老师特性渠道。总的培训有两个优化对象之间的坐标下降法 - 分配更新和参数更新。由于MGD只包含可忽略计算成本正常化或池操作，它是灵活插到网络与其它蒸馏法。

56. Few-Shot Image Classification via Contrastive Self-Supervised Learning [PDF] 返回目录
Jianyi Li, Guizhong Liu
Abstract: Most previous few-shot learning algorithms are based on meta-training with fake few-shot tasks as training samples, where large labeled base classes are required. The trained model is also limited by the type of tasks. In this paper we propose a new paradigm of unsupervised few-shot learning to repair the deficiencies. We solve the few-shot tasks in two phases: meta-training a transferable feature extractor via contrastive self-supervised learning and training a classifier using graph aggregation, self-distillation and manifold augmentation. Once meta-trained, the model can be used in any type of tasks with a task-dependent classifier training. Our method achieves state of-the-art performance in a variety of established few-shot tasks on the standard few-shot visual classification datasets, with an 8- 28% increase compared to the available unsupervised few-shot learning methods.
摘要：大多数前几拍的学习算法是基于元培训假冒几拍任务作为训练样本，其中需要大量标记基类。训练的模型也由任务类型的限制。在本文中，我们提出了无监督为数不多的射门学习的新模式，以修复缺陷。我们解决了几拍任务分为两个阶段：通过对比自我监督学习荟萃训练转移特征提取和使用图形聚集，自蒸馏和多方面增强训练分类。一旦元的训练，该模型与任务相关的分类培训的任何类型的任务可以使用。我们的方法实现状态的最先进的性能上的各种标准的为数不多的镜头视觉分类数据集建立的为数不多的射门任务，与相比增加可用的无监督几拍的学习方法8- 28％。

57. Quantitative Survey of the State of the Art in Sign Language Recognition [PDF] 返回目录
Oscar Koller
Abstract: This work presents a meta study covering around 300 published sign language recognition papers with over 400 experimental results. It includes most papers between the start of the field in 1983 and 2020. Additionally, it covers a fine-grained analysis on over 25 studies that have compared their recognition approaches on \phoenix 2014, the standard benchmark task of the field. Research in the domain of sign language recognition has progressed significantly in the last decade, reaching a point where the task attracts much more attention than ever before. This study compiles the state of the art in a concise way to help advance the field and reveal open questions. Moreover, all of this meta study's source data is made public, easing future work with it and further expansion. The analyzed papers have been manually labeled with a set of categories. The data reveals many insights, such as, among others, shifts in the field from intrusive to non-intrusive capturing, from local to global features and the lack of non-manual parameters included in medium and larger vocabulary recognition systems. Surprisingly, RWTH-PHOENIX-Weather with a vocabulary of 1080 signs represents the only resource for large vocabulary continuous sign language recognition benchmarking world wide.
摘要：这项工作提出的元研究覆盖400多个实验结果约300公布手语识别的论文。它包括在1983年和2020年该领域的开始之间的大多数论文此外，它涵盖了细粒度分析了25项研究已经比较他们的认可对\凤凰2014年，该领域的标准测试方法任务。研究手语识别的域名已经在过去十年显著的进展，在某个时刻的任务，吸引了更多的关注比以往任何时候。本研究编制的技术简洁的方式的状态，以帮助推动该领域，并揭示开放性问题。此外，所有这些研究荟萃的源数据是公开的，缓解未来，并进一步扩展工作。该分析论文都与一组类别的手动标记。的数据揭示了许多见解，如，除其他外，在该领域的变化从侵入到非侵入式捕获，从局部到全局特征和缺乏包括在介质和较大词汇识别系统的非手动参数。出人意料的是，RWTH-PHOENIX-天气与1080个标志词汇代表了大词汇量连续手语识别宽标杆世界上唯一的资源。

58. Few-Shot Learning with Intra-Class Knowledge Transfer [PDF] 返回目录
Vivek Roy, Yan Xu, Yu-Xiong Wang, Kris Kitani, Ruslan Salakhutdinov, Martial Hebert
Abstract: We consider the few-shot classification task with an unbalanced dataset, in which some classes have sufficient training samples while other classes only have limited training samples. Recent works have proposed to solve this task by augmenting the training data of the few-shot classes using generative models with the few-shot training samples as the seeds. However, due to the limited number of the few-shot seeds, the generated samples usually have small diversity, making it difficult to train a discriminative classifier for the few-shot classes. To enrich the diversity of the generated samples, we propose to leverage the intra-class knowledge from the neighbor many-shot classes with the intuition that neighbor classes share similar statistical information. Such intra-class information is obtained with a two-step mechanism. First, a regressor trained only on the many-shot classes is used to evaluate the few-shot class means from only a few samples. Second, superclasses are clustered, and the statistical mean and feature variance of each superclass are used as transferable knowledge inherited by the children few-shot classes. Such knowledge is then used by a generator to augment the sparse training data to help the downstream classification tasks. Extensive experiments show that our method achieves state-of-the-art across different datasets and $n$-shot settings.
摘要：我们考虑与不平衡的数据集，其中一些类有足够的训练样本，而其他类只具有有限的训练样本很少，镜头分类任务。最近的工作提出了通过扩大使用生成模型与几炮训练样本作为种子数，出手类的训练数据来解决这个任务。然而，由于几拍种子数量有限，产生的样本通常有小的差异，因此很难培养辨别分类为几拍类。以丰富的生成样本的多样性，我们建议利用从邻居很多次类与直觉的内部类知识邻居类共享类似的统计信息。与两步机制获得的这种类内的信息。首先，回归只能在许多次类培训的用于评估从只有少数样本数拍类的手段。其次，超聚集，每个超类的统计均值和方差的特征作为深受孩子们很少拍类继承转让的知识。那么这种知识是通过使用一台发电机，以加强稀疏训练数据，以帮助下游分类任务。大量的实验表明，该方法实现国家的最先进的跨不同的数据集，而且$ n $ -shot设置。

59. Online Visual Tracking with One-Shot Context-Aware Domain Adaptation [PDF] 返回目录
Hossein Kashiani, Amir Abbas Hamidi Imani, Shahriar Baradaran Shokouhi, Ahmad Ayatollahi
Abstract: Online learning policy makes visual trackers more robust against different distortions through learning domain-specific cues. However, the trackers adopting this policy fail to fully leverage the discriminative context of the background areas. Moreover, owing to the lack of sufficient data at each time step, the online learning approach can also make the trackers prone to over-fitting to the background regions. In this paper, we propose a domain adaptation approach to strengthen the contributions of the semantic background context. The domain adaptation approach is backboned with only an off-the-shelf deep model. The strength of the proposed approach comes from its discriminative ability to handle severe occlusion and background clutter challenges. We further introduce a cost-sensitive loss alleviating the dominance of non-semantic background candidates over the semantic candidates, thereby dealing with the data imbalance issue. Experimental results demonstrate that our tracker achieves competitive results at real-time speed compared to the state-of-the-art trackers.
摘要：在线学习政策使得视觉跟踪器对通过学习特定领域的线索不同的失真更稳健。然而，采用这种策略的跟踪未能充分利用背景区域的歧视性内容。此外，由于在每个时间步缺乏足够的数据，在线学习方法也使跟踪容易出现过度拟合的背景区域。在本文中，我们提出了一个领域适应性的办法来加强语义背景方面的贡献。域名适应方法是主链的只有一个现成的，现成的深层模型。该方法的力量来自它的判别处理严重闭塞和背景杂波挑战的能力。我们进一步引入成本敏感的损失减轻非语义背景的候选人在语义候选人的主导地位，从而处理数据的不平衡问题。实验结果表明，我们的跟踪器实现在实时速度竞争的结果相比，国家的最先进的跟踪器。

60. Supervision Levels Scale (SLS) [PDF] 返回目录
Dima Damen, Michael Wray
Abstract: We propose a three-dimensional discrete and incremental scale to encode a method's level of supervision - i.e. the data and labels used when training a model to achieve a given performance. We capture three aspects of supervision, that are known to give methods an advantage while requiring additional costs: pre-training, training labels and training data. The proposed three-dimensional scale can be included in result tables or leaderboards to handily compare methods not only by their performance, but also by the level of data supervision utilised by each method. The Supervision Levels Scale (SLS) is first presented generally fo any task/dataset/challenge. It is then applied to the EPIC-KITCHENS-100 dataset, to be used for the various leaderboards and challenges associated with this dataset.
摘要：我们提出了一个三维离散和增量规模进行编码监管方法的水平 - 即训练模式来实现给定的性能时使用的数据和标签。我们捕捉监督三个方面，已知给方法，而无需额外的成本优势：岗前培训，培训标签和训练数据。所提出的三维尺度可以被包括在结果表或排行榜的比较得心应手方法不仅可以通过性能，而且还通过每个方法使用的数据监管水平。监督等级量表（SLS）首先提出了一般FO的任何任务/集/挑战。然后，它被施加到EPIC的厨房-100数据集，将用于与该数据集相关联的各种排行榜和挑战。

61. Joint Modeling of Chest Radiographs and Radiology Reports for Pulmonary Edema Assessment [PDF] 返回目录
Geeticka Chauhan, Ruizhi Liao, William Wells, Jacob Andreas, Xin Wang, Seth Berkowitz, Steven Horng, Peter Szolovits, Polina Golland
Abstract: We propose and demonstrate a novel machine learning algorithm that assesses pulmonary edema severity from chest radiographs. While large publicly available datasets of chest radiographs and free-text radiology reports exist, only limited numerical edema severity labels can be extracted from radiology reports. This is a significant challenge in learning such models for image classification. To take advantage of the rich information present in the radiology reports, we develop a neural network model that is trained on both images and free-text to assess pulmonary edema severity from chest radiographs at inference time. Our experimental results suggest that the joint image-text representation learning improves the performance of pulmonary edema assessment compared to a supervised model trained on images only. We also show the use of the text for explaining the image classification by the joint model. To the best of our knowledge, our approach is the first to leverage free-text radiology reports for improving the image model performance in this application. Our code is available at this https URL.
摘要：我们提出并论证了一种新的机器学习算法，从胸片评估肺水肿严重程度。虽然有胸片和自由文本放射科报告大公开可用的数据集，只有有限的数值水肿严重程度标签可以从放射学报告中提取。这是在学习这种模式的图像分类的显著挑战。要利用目前在放射学报告的丰富信息，我们开发了针对两个图像和自由文本训练来推断时间来评估从胸片肺水肿严重程度的神经网络模型。我们的实验结果表明，联合图像 - 文本表示学习提高肺水肿的评估相比，受过训练只图像的监督模型的性能。我们还表明通过联合模型解释图像分类使用的文字。据我们所知，我们的做法是首先利用自由文本放射科报告，为改善此应用程序的图像模型的性能。我们的代码可在此HTTPS URL。

62. Unsupervised Deep Metric Learning via Orthogonality based Probabilistic Loss [PDF] 返回目录
Ujjal Kr Dutta, Mehrtash Harandi, Chellu Chandra Sekhar
Abstract: Metric learning is an important problem in machine learning. It aims to group similar examples together. Existing state-of-the-art metric learning approaches require class labels to learn a metric. As obtaining class labels in all applications is not feasible, we propose an unsupervised approach that learns a metric without making use of class labels. The lack of class labels is compensated by obtaining pseudo-labels of data using a graph-based clustering approach. The pseudo-labels are used to form triplets of examples, which guide the metric learning. We propose a probabilistic loss that minimizes the chances of each triplet violating an angular constraint. A weight function, and an orthogonality constraint in the objective speeds up the convergence and avoids a model collapse. We also provide a stochastic formulation of our method to scale up to large-scale datasets. Our studies demonstrate the competitiveness of our approach against state-of-the-art methods. We also thoroughly study the effect of the different components of our method.
摘要：度量学习是机器学习中的一个重要问题。它的目的是一群类似的例子在一起。国家的最先进的现有度量学习方法需要类标签学习的度量。正如在所有的应用程序获取类标签是不可行的，我们建议学习而不使用类标签的度量无监督的办法。缺乏类标签的通过获得使用基于图的聚类方法数据的伪标签补偿。伪标记被用于形成的例子三胞胎，其引导所述度量学习。我们建议每个三联违反角度约束的可能性最小的概率损失。权重函数，并在目标速度的收敛，避免了模型崩溃的正交性约束。我们还提供了方法的随机配方扩展到大型数据集。我们的研究证明了我们对国家的最先进的方法，方法的竞争力。我们还深入研究了该方法的不同组成部分的作用。

63. Symbolic Semantic Segmentation and Interpretation of COVID-19 Lung Infections in Chest CT volumes based on Emergent Languages [PDF] 返回目录
Aritra Chowdhury, Alberto Santamaria-Pang, James R. Kubricht, Jianwei Qiu, Peter Tu
Abstract: The coronavirus disease (COVID-19) has resulted in a pandemic crippling the a breadth of services critical to daily life. Segmentation of lung infections in computerized tomography (CT) slices could be be used to improve diagnosis and understanding of COVID-19 in patients. Deep learning systems lack interpretability because of their black box nature. Inspired by human communication of complex ideas through language, we propose a symbolic framework based on emergent languages for the segmentation of COVID-19 infections in CT scans of lungs. We model the cooperation between two artificial agents - a Sender and a Receiver. These agents synergistically cooperate using emergent symbolic language to solve the task of semantic segmentation. Our game theoretic approach is to model the cooperation between agents unlike Generative Adversarial Networks (GANs). The Sender retrieves information from one of the higher layers of the deep network and generates a symbolic sentence sampled from a categorical distribution of vocabularies. The Receiver ingests the stream of symbols and cogenerates the segmentation mask. A private emergent language is developed that forms the communication channel used to describe the task of segmentation of COVID infections. We augment existing state of the art semantic segmentation architectures with our symbolic generator to form symbolic segmentation models. Our symbolic segmentation framework achieves state of the art performance for segmentation of lung infections caused by COVID-19. Our results show direct interpretation of symbolic sentences to discriminate between normal and infected regions, infection morphology and image characteristics. We show state of the art results for segmentation of COVID-19 lung infections in CT.
摘要：冠状病毒病（COVID-19）已经导致大流行沉重的日常生活至关重要的服务的广度。在计算机断层扫描肺部感染的分割（CT）的切片可以被用于改善的诊断和患者COVID-19的理解。深度学习系统缺乏，因为他们的黑匣子性质的可解释性。由通过语言复杂的思想的人际交往的启发，我们提出了肺的基础上出现的语言的COVID-19感染的CT分割一个象征性的框架扫描。我们模型中的两个人工坐席之间的合作 - 一个发送者和接收者。这些药物协同作用的紧急用象征性的语言来解决语义分割的任务。我们的游戏理论方法是模拟不像剖成对抗性网络（甘斯）代理商之间的合作。发件人从深网络的较高层中的一个中检索信息，并生成从词汇表的分类分布抽样象征句子。接收器摄取符号流和cogenerates分割掩码。私人新兴语言开发的形式通信信道用来描述COVID感染分割的任务。我们加强我们的象征发电机的艺术语义分割结构的现有状态，形成象征性的细分车型。我们象征性分割框架实现了对造成COVID-19肺部感染的分割先进的性能。我们的研究结果表明象征性的句子直接解释正常和被感染地区，感染形态和图像特征之间进行区分。我们展示的艺术效果在CT COVID-19肺部感染的分割状态。

64. Emergent symbolic language based deep medical image classification [PDF] 返回目录
Aritra Chowdhury, Alberto Santamaria-Pang, James R. Kubricht, Peter Tu
Abstract: Modern deep learning systems for medical image classification have demonstrated exceptional capabilities for distinguishing between image based medical categories. However, they are severely hindered by their ina-bility to explain the reasoning behind their decision making. This is partly due to the uninterpretable continuous latent representations of neural net-works. Emergent languages (EL) have recently been shown to enhance the capabilities of neural networks by equipping them with symbolic represen-tations in the framework of referential games. Symbolic representations are one of the cornerstones of highly explainable good old fashioned AI (GOFAI) systems. In this work, we demonstrate for the first time, the emer-gence of deep symbolic representations of emergent language in the frame-work of image classification. We show that EL based classification models can perform as well as, if not better than state of the art deep learning mod-els. In addition, they provide a symbolic representation that opens up an entire field of possibilities of interpretable GOFAI methods involving symbol manipulation. We demonstrate the EL classification framework on immune cell marker based cell classification and chest X-ray classification using the CheXpert dataset. Code is available online at this https URL.
摘要：医学图像分类现代深度学习系统已经证明具有超凡的能力基于图像的医疗类别区分。然而，他们严重的INA-吴春明阻碍解释他们的决策背后的原因。这部分是由于神经网络，作品的不可解释的连续潜伏表示。急诊语言（EL）最近被证明在的参考游戏框架象征represen-tations装备他们，以提高神经网络的功能。符号表示是高度可解释良好的老式AI（GOFAI）系统的基石之一。在这项工作中，我们表现出的第一次，在图像分类中的帧的作品出现的语言，深刻的符号表示的EMER-让斯。我们表明，基于EL分类模型可如果不优于艺术深刻学习模型的状态表现不如。此外，他们还提供了开辟了涉及符号操纵解释GOFAI方法准备一整场的符号表示。我们证明对免疫细胞基于标记的细胞分类，并使用CheXpert数据集胸片分类EL分类框架。代码可在网上该HTTPS URL。

65. Data augmentation techniques for the Video Question Answering task [PDF] 返回目录
Alex Falcon, Oswald Lanz, Giuseppe Serra
Abstract: Video Question Answering (VideoQA) is a task that requires a model to analyze and understand both the visual content given by the input video and the textual part given by the question, and the interaction between them in order to produce a meaningful answer. In our work we focus on the Egocentric VideoQA task, which exploits first-person videos, because of the importance of such task which can have impact on many different fields, such as those pertaining the social assistance and the industrial training. Recently, an Egocentric VideoQA dataset, called EgoVQA, has been released. Given its small size, models tend to overfit quickly. To alleviate this problem, we propose several augmentation techniques which give us a +5.5% improvement on the final accuracy over the considered baseline.
摘要：视频答疑（VideoQA）是需要一个模型来分析和理解，以产生一个有意义的答复的输入视频和对这个问题给出的文字部分，以及它们之间的相互作用给出两种视觉内容的任务。在我们的工作中，我们专注于自我中心VideoQA的任务，这是因为利用这样的任务，可以有许多不同的领域，如有关社会救助和产业培训的影响的重要性的第一人称视频。最近，自我中心VideoQA数据集，称为EgoVQA，已被释放。由于其体积小，模型往往很快过度拟合。为了缓解这一问题，我们提出了几种增强技术，这给我们在考虑基线最终精度+ 5.5％的改善。

66. Revisiting Anchor Mechanisms for Temporal Action Localization [PDF] 返回目录
Le Yang, Houwen Peng, Dingwen Zhang, Jianlong Fu, Junwei Han
Abstract: Most of the current action localization methods follow an anchor-based pipeline: depicting action instances by pre-defined anchors, learning to select the anchors closest to the ground truth, and predicting the confidence of anchors with refinements. Pre-defined anchors set prior about the location and duration for action instances, which facilitates the localization for common action instances but limits the flexibility for tackling action instances with drastic varieties, especially for extremely short or extremely long ones. To address this problem, this paper proposes a novel anchor-free action localization module that assists action localization by temporal points. Specifically, this module represents an action instance as a point with its distances to the starting boundary and ending boundary, alleviating the pre-defined anchor restrictions in terms of action localization and duration. The proposed anchor-free module is capable of predicting the action instances whose duration is either extremely short or extremely long. By combining the proposed anchor-free module with a conventional anchor-based module, we propose a novel action localization framework, called A2Net. The cooperation between anchor-free and anchor-based modules achieves superior performance to the state-of-the-art on THUMOS14 (45.5% vs. 42.8%). Furthermore, comprehensive experiments demonstrate the complementarity between the anchor-free and the anchor-based module, making A2Net simple but effective.
摘要：针对目前大多行动定位方法遵循基于锚的管道：通过预先定义的锚描绘动作的情况下，学习选择最靠近地面真理的锚，锚预测与改进的信心。预先定义的锚之前关于位置和持续时间的行动的情况下，这有利于共同行动实例的定位，但限制了灵活性为处理动作实例用急剧品种尤其适用于极短或极长来设定的。为了解决这个问题，本文提出了通过时间点协助行动本地化一种新型的无锚的行动定位模块。具体地，该模块代表一个动作实例作为与它的距离的起始边界和终止边界的点，在动作的定位和持续时间方面减轻预先定义的锚定的限制。所提出的无锚模块能够预测的动作实例，其持续时间是任一极短或极长的。所提出的无锚模块与传统的基于锚的模块相结合，我们提出了一种新的行动本地化框架，叫做A2Net。锚自由和基于锚的模块之间的合作实现了优异的性能的状态下的最先进的上THUMOS14（45.5％对42.8％）。此外，综合性实验证明了无锚和基于锚的模块之间的互补性，使A2Net简单而有效的。

67. RANSIP : From noisy point clouds to complete ear models, unsupervised [PDF] 返回目录
Filipa Valdeira, Ricardo Ferreira, Alessandra Micheletti, Cláudia Soares
Abstract: Ears are a particularly difficult region of the human face to model, not only due to the non-rigid deformations existing between shapes but also to the challenges in processing the retrieved data. The first step towards obtaining a good model is to have complete scans in correspondence, but these usually present a higher amount of occlusions, noise and outliers when compared to most face regions, thus requiring a specific procedure. Therefore, we propose a complete pipeline taking as input unordered 3D point clouds with the aforementioned problems, and producing as output a dataset in correspondence, with completion of the missing data. We provide a comparison of several state-of-the-art registration methods and propose a new approach for one of the steps of the pipeline, with better performance for our data.
摘要：耳朵人脸模式，不仅由于形状之间也能在处理检索到的数据的挑战现有的非刚性变形的一个特别困难的区域。向获得良好的模型中的第一步骤是将具有完整的扫描对应，但相对于大多数的面部区域的情况下，因此需要特定的程序，这些通常存在较高量的闭塞，噪声和异常值。因此，我们提出了一个完整的管道作为输入无序与上述问题三维点云，并产生作为输出相对应的数据集，与丢失数据的完成。我们提供的先进设备，最先进的几种注册方法的比较，提出了一种新的方法用于管道的步骤之一，为我们的数据更好的性能。

68. Memory-based Jitter: Improving Visual Recognition on Long-tailed Data with Diversity In Memory [PDF] 返回目录
Jialun Liu, Jingwei Zhang, Wenhui Li, Chi Zhang, Yifan Sun
Abstract: This paper considers deep visual recognition on long-tailed data, with the majority categories only occupying relatively few samples. The tail categories are prone to lack of within-class diversity, which compromises the representative ability of the learned visual concepts. A radical solution is to augment the tail categories with higher diversity. To this end, we introduce a simple and reliable method named Memory-based Jitter (MBJ) to gain extra diversity for the tail data. We observe that the deep model keeps on jittering from one historical edition to another, even when it already approaches convergence. The ``jitter'' means the small variations between historical models. We argue that such jitter largely originates from the within-class diversity of the overall data and thus encodes the within-class distribution pattern. To utilize such jitter for tail data augmentation, we store the jitter among historical models into a memory bank and get the so-called Memory-based Jitter. With slight modifications, MBJ is applicable for two fundamental visual recognition tasks, \emph{i.e.}, image classification and deep metric learning (on long-tailed data). On image classification, MBJ collects the historical embeddings to learn an accurate classifier. In contrast, on deep metric learning, it collects the historical prototypes of each class to learn a robust deep embedding. Under both scenarios, MBJ enforces higher concentration on tail classes, so as to compensate for their lack of diversity. Extensive experiments on three long-tailed classification benchmarks and two deep metric learning benchmarks (person re-identification, in particular) demonstrate the significant improvement. Moreover, the achieved performance are on par with the state-of-the-art on both tasks.
摘要：本文认为，对长尾数据深视觉识别，大多数类别的仅占相对较少的样本。尾部类别易于缺乏类内的多样性，这损害的所学习的视觉概念代表能力的。自由基溶液是具有较高的分集，以增加尾类别。为此，我们介绍一个简单可靠的方法命名为基于内存的抖动（MBJ），以获得额外的多样性尾数据。我们观察到深层模型不断抖动，从一个历史版本到另一个，甚至当它已经接近收敛。在``抖动“”是指历史模型之间的小的变化。我们认为，这样的抖动很大程度上源自从整体数据的类内分集并且因此编码类内分布图案。要针对尾数据增强这种抖动，我们的历史模型之间的抖动存储到存储库并获得所谓的基于内存的抖动。稍作修改，MBJ适用于两个基本的视觉识别任务，\ {EMPH即}，图像分类和深度量学习（上长尾的数据）。在图像分类，收集MBJ历史的嵌入学习准确的分类。相比之下，深度量学习，它收集每个类的历史原型来学习强大的深嵌入。下两种情况下，MBJ强制对尾类浓度较高，以便补偿它们缺乏多样性。三长尾分类基准和两个深度量学习基准大量的实验（人重新鉴定，尤其是）展示显著改善。此外，已达到的性能是堪与两个任务的国家的最先进的。

69. Identity-Aware Multi-Sentence Video Description [PDF] 返回目录
Jae Sung Park, Trevor Darrell, Anna Rohrbach
Abstract: Standard video and movie description tasks abstract away from person identities, thus failing to link identities across sentences. We propose a multi-sentence Identity-Aware Video Description task, which overcomes this limitation and requires to re-identify persons locally within a set of consecutive clips. We introduce an auxiliary task of Fill-in the Identity, that aims to predict persons' IDs consistently within a set of clips, when the video descriptions are given. Our proposed approach to this task leverages a Transformer architecture allowing for coherent joint prediction of multiple IDs. One of the key components is a gender-aware textual representation as well an additional gender prediction objective in the main model. This auxiliary task allows us to propose a two-stage approach to Identity-Aware Video Description. We first generate multi-sentence video descriptions, and then apply our Fill-in the Identity model to establish links between the predicted person entities. To be able to tackle both tasks, we augment the Large Scale Movie Description Challenge (LSMDC) benchmark with new annotations suited for our problem statement. Experiments show that our proposed Fill-in the Identity model is superior to several baselines and recent works, and allows us to generate descriptions with locally re-identified people.
摘要：标准的视频和电影描述的任务从一个人的身份抽象掉，从而无法链接身份跨越句子。我们提出了一个多句身份感知视频描述的任务，克服了这个限制，并需要一组连续的片段内局部重新鉴定的人。我们引进填写相应身份的辅助任务，其目的是一致的中的一组片段，预测者的ID被赋予视频说明时。我们提出的方法把这个任务利用了变压器的架构允许多个ID的相关联合预测。一个关键组成部分是一个性别意识文本表示，以及在主模型的附加性别预测的目标。该辅助任务，使我们能够提出一个两阶段的方法来识别身份视频说明。我们首先生成多文动画说明，然后运用我们填写相应的身份模式建立预测人的实体之间的联系。为了能够解决这两个任务，我们增加了大型电影描述的挑战（LSMDC）以适合我们的问题陈述新的注解标杆。实验表明，我们提出的填充，在身份模型优于几个基线和最近的作品，使我们能够生成带有介绍本地重新认定的人。

70. Traffic-Aware Multi-Camera Tracking of Vehicles Based on ReID and Camera Link Model [PDF] 返回目录
Hung-Min Hsu, Yizhou Wang, Jenq-Neng Hwang
Abstract: Multi-target multi-camera tracking (MTMCT), i.e., tracking multiple targets across multiple cameras, is a crucial technique for smart city applications. In this paper, we propose an effective and reliable MTMCT framework for vehicles, which consists of a traffic-aware single camera tracking (TSCT) algorithm, a trajectory-based camera link model (CLM) for vehicle re-identification (ReID), and a hierarchical clustering algorithm to obtain the cross camera vehicle trajectories. First, the TSCT, which jointly considers vehicle appearance, geometric features, and some common traffic scenarios, is proposed to track the vehicles in each camera separately. Second, the trajectory-based CLM is adopted to facilitate the relationship between each pair of adjacently connected cameras and add spatio-temporal constraints for the subsequent vehicle ReID with temporal attention. Third, the hierarchical clustering algorithm is used to merge the vehicle trajectories among all the cameras to obtain the final MTMCT results. Our proposed MTMCT is evaluated on the CityFlow dataset and achieves a new state-of-the-art performance with IDF1 of 74.93%.
摘要：多目标多摄像机跟踪（MTMCT），即，在多个摄像机跟踪多个目标，是智慧城市应用的关键技术。在本文中，我们提出了车辆的有效和可靠的MTMCT框架，其中包括交通意识的单摄像头跟踪（TSCT）算法，基于轨迹的相机链路模型（CLM）为车辆再次鉴定（雷德），以及分级聚类算法，以获得交相机车辆轨迹。首先，TSCT，其共同考虑车辆的外观，几何特征，以及一些公共业务场景中，提出了单独跟踪车辆在每个摄像机。其次，基于轨迹的CLM采用以促进各对相邻连接摄像机之间的关系，并添加时空约束颞注意后续车辆REID。三，层次聚类算法用于合并所有相机中的车辆轨迹，以获得最终MTMCT结果。我们提出的MTMCT是在CityFlow数据集评估，并实现了新的国家的最先进的性能的74.93％IDF1。

71. Chest Area Segmentation in Depth Images of Sleeping Patients [PDF] 返回目录
Yoav Goldstein, Martin Schätz, Mireille Avigal
Abstract: Although the field of sleep study has greatly developed over the recent years, the most common and efficient way to detect sleep issues remains a sleep examination performed in a sleep laboratory, in a procedure called Polysomnography (PSG). This examination measures several vital signals during a full night's sleep using multiple sensors connected to the patient's body. Yet, despite being the golden standard, the connection of the sensors and the unfamiliar environment inevitably impact the quality of the patient's sleep and the examination itself. Therefore, with the novel development of more accurate and affordable 3D sensing devices, new approaches for non-contact sleep study emerged. These methods utilize different techniques with the purpose to extract the same sleep parameters, but remotely, eliminating the need of any physical connections to the patient's body. However, in order to enable reliable remote extraction, these methods require accurate identification of the basic Region of Interest (ROI) i.e. the chest area of the patient, a task that is currently holding back the development process, as it is performed manually for each patient. In this study, we propose an automatic chest area segmentation algorithm, that given an input set of 3D frames of a sleeping patient, outputs a segmentation image with the pixels that correspond to the chest area, and can then be used as an input to subsequent sleep analysis algorithms. Except for significantly speeding up the development process of the non-contact methods, accurate automatic segmentation can also enable a more precise feature extraction and it is shown it is already improving sensitivity of prior solutions on average 46.9% better compared to manual ROI selection. All mentioned will place the extraction algorithms of the non-contact methods as a leading candidate to replace the existing traditional methods used today.
摘要：虽然睡眠研究领域在最近几年有了很大的发展，最常见和检测睡眠问题的有效方式仍然是在睡眠实验室进行的睡眠检查，在一个叫多导睡眠图（PSG）的过程。这种检查使用连接到患者身体多个传感器一整夜的睡眠过程中测量的几个重要信号。然而，尽管是金标准，传感器的连接和陌生的环境中不可避免地影响患者的睡眠质量，考试本身。因此，更准确，更实惠的3D传感装置的新型发展，非接触式睡眠研究的新方法应运而生。这些方法利用不同的技术，目的是提取相同的睡眠参数，但远程，消除到患者的身体中的任何物理连接的需要。然而，为了实现可靠的远程提取，因为它是手动地执行对每个这些方法需要关注区域（ROI）的基本区域的准确的识别，即在患者的胸部区域，当前阻碍发展过程中的任务时，患者。在这项研究中，我们提出了一种自动胸部区域分割算法，即给定的输入集合睡眠患者的三维帧，输出与对应于胸部区域，并且然后，可以使用作为输入到在后续像素的分割图像睡眠分析算法。除了显著加快的非接触方法的发展过程中，准确的自动分割也可以使更精确的特征提取和它被示出它已经提高平均46.9％现有解决方案的灵敏度更好比手动ROI选择。所有提及的将放置的非接触式方法提取算法作为一个领先的候选人，以取代目前使用的现有的传统方法。

72. A Benchmark for Studying Diabetic Retinopathy: Segmentation, Grading, and Transferability [PDF] 返回目录
Yi Zhou, Boyang Wang, Shanshan Cui, Ling Shao
Abstract: People with diabetes are at risk of developing an eye disease called diabetic retinopathy (DR). This disease occurs when high blood glucose levels cause damage to blood vessels in the retina. Computer-aided DR diagnosis is a promising tool for early detection of DR and severity grading, due to the great success of deep learning. However, most current DR diagnosis systems do not achieve satisfactory performance or interpretability for ophthalmologists, due to the lack of training data with consistent and fine-grained annotations. To address this problem, we construct a large-scale fine-grained annotated DR dataset containing 2,842 images (FGADR). This dataset has 1,842 images with pixel-level DR-related lesion annotations, and 1,000 images with image-level labels graded by six board-certified ophthalmologists with intra-rater consistency. The proposed dataset will enable extensive studies on DR diagnosis. We set up three benchmark tasks for evaluation: 1. DR lesion segmentation; 2. DR grading by joint classification and segmentation; 3. Transfer learning for ocular multi-disease identification. Moreover, a novel inductive transfer learning method is introduced for the third task. Extensive experiments using different state-of-the-art methods are conducted on our FGADR dataset, which can serve as baselines for future research.
摘要：糖尿病人在发展中的眼部疾病称为糖尿病性视网膜病变（DR）的风险。当高血糖水平导致血管中的视网膜损伤发生这种疾病。计算机辅助诊断DR是早期检测DR和严重程度分级，由于深学习的巨大成功一个行之有效的手段。然而，目前大多数DR诊断系统没有实现眼科医生满意的性能和可解释性，由于缺乏一致和细粒度的注释训练数据。为了解决这个问题，我们构建包含2842幅图像（FGADR）大规模细粒度注释DR数据集。此数据集具有1842个图像用像素级DR相关的病变的注解，和1000个图像与由六个委员会认证的眼科医生与帧内评价者一致性渐变图像级标签。所提出的数据集将启用DR诊断广泛的研究。我们建立了评价三项地基准任务：1. DR病变划分; 2. DR通过关节分类和分级分割; 3.转移学习眼部多病鉴别。此外，一个新的感应传输学习方法引入第三任务。使用国家最先进的方法不同了广泛的实验是在我们的FGADR数据集，它可以作为基线为今后的研究进行。

73. Unsupervised Hyperspectral Mixed Noise Removal Via Spatial-Spectral Constrained Deep Image Prior [PDF] 返回目录
Yi-Si Luo, Xi-Le Zhao, Tai-Xiang Jiang, Yu-Bang Zheng, Yi Chang
Abstract: Hyperspectral images (HSIs) are unavoidably corrupted by mixed noise which hinders the subsequent applications. Traditional methods exploit the structure of the HSI via optimization-based models for denoising, while their capacity is inferior to the convolutional neural network (CNN)-based methods, which supervisedly learn the noisy-to-denoised mapping from a large amount of data. However, as the clean-noisy pairs of hyperspectral data are always unavailable in many applications, it is eager to build an unsupervised HSI denoising method with high model capability. To remove the mixed noise in HSIs, we suggest the spatial-spectral constrained deep image prior (S2DIP), which simultaneously capitalize the high model representation ability brought by the CNN in an unsupervised manner and does not need any extra training data. Specifically, we employ the separable 3D convolution blocks to faithfully encode the HSI in the framework of DIP, and a spatial-spectral total variation (SSTV) term is tailored to explore the spatial-spectral smoothness of HSIs. Moreover, our method favorably addresses the semi-convergence behavior of prevailing unsupervised methods, e.g., DIP 2D, and DIP 3D. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art optimization-based HSI denoising methods in terms of effectiveness and robustness.
摘要：高光谱图像（HSIS）不可避免地由这阻碍了后续应用混合噪声损坏。传统的方法利用经由去噪基于优化的模型中的HSI的结构，而其容量劣于卷积神经网络（CNN）系的方法，其supervisedly从大量数据的学习嘈杂到去噪映射。然而，随着清洁嘈杂的对高光谱数据总是在许多应用无法使用，这是渴望建立具有高模型能力的无监督恒指去噪方法。以去除在HSIS混合噪声，我们建议的空间光谱约束深图像之前（S2DIP），其同时利用由CNN以无监督的方式所带来的高模型表示能力，并且不需要任何额外的训练数据。具体而言，我们采用了可分离三维卷积块忠实地编码HSI在DIP的框架内，并且空间光谱总变化（SSTV）术语被定制为探索HSIS的空间频谱的平滑度。而且，我们的方法有利地解决了当时无监督方法，例如，DIP 2D和3D DIP半收敛行为。大量实验证明，该方法优于基于优化的国家的最先进的降噪恒指方法的有效性和稳健性方面。

74. Multidomain Multimodal Fusion For Human Action Recognition Using Inertial Sensors [PDF] 返回目录
Zeeshan Ahmad, Naimul Khan
Abstract: One of the major reasons for misclassification of multiplex actions during action recognition is the unavailability of complementary features that provide the semantic information about the actions. In different domains these features are present with different scales and intensities. In existing literature, features are extracted independently in different domains, but the benefits from fusing these multidomain features are not realized. To address this challenge and to extract complete set of complementary information, in this paper, we propose a novel multidomain multimodal fusion framework that extracts complementary and distinct features from different domains of the input modality. We transform input inertial data into signal images, and then make the input modality multidomain and multimodal by transforming spatial domain information into frequency and time-spectrum domain using Discrete Fourier Transform (DFT) and Gabor wavelet transform (GWT) respectively. Features in different domains are extracted by Convolutional Neural networks (CNNs) and then fused by Canonical Correlation based Fusion (CCF) for improving the accuracy of human action recognition. Experimental results on three inertial datasets show the superiority of the proposed method when compared to the state-of-the-art.
摘要：一对动作识别过程中多重行动误判的主要原因是提供有关行动的语义信息的互补功能不可用。在不同的域，这些特征存在不同尺度和强度。在现有文献中，特征被提取独立地在不同的域中，但是从这些融合多畴特性的益处没有实现。为了应对这一挑战，并提取完整的补充信息，在本文中，我们提出了一种新颖的多畴多模态融合框架，提取物互补的且不同的特征从输入模态的不同结构域。我们变换输入的惯性数据到信号的图像，然后通过使用离散傅立叶分别变换（DFT）和Gabor小波变换（GWT）转化空间域信息转换成频域和时域的频谱使输入模态多畴和多峰的。在不同的域中的特征由卷积神经网络（细胞神经网络）萃取，然后通过典型相关基于融合（CCF），用于改善人类动作识别的准确性稠合。相比于国家的最先进的当在三个数据集惯性实验结果表明所提出的方法的优越性。

75. Towards Improved Human Action Recognition Using Convolutional Neural Networks and Multimodal Fusion of Depth and Inertial Sensor Data [PDF] 返回目录
Zeeshan Ahmad, Naimul Khan
Abstract: This paper attempts at improving the accuracy of Human Action Recognition (HAR) by fusion of depth and inertial sensor data. Firstly, we transform the depth data into Sequential Front view Images(SFI) and fine-tune the pre-trained AlexNet on these images. Then, inertial data is converted into Signal Images (SI) and another convolutional neural network (CNN) is trained on these images. Finally, learned features are extracted from both CNN, fused together to make a shared feature layer, and these features are fed to the classifier. We experiment with two classifiers, namely Support Vector Machines (SVM) and softmax classifier and compare their performances. The recognition accuracies of each modality, depth data alone and sensor data alone are also calculated and compared with fusion based accuracies to highlight the fact that fusion of modalities yields better results than individual modalities. Experimental results on UTD-MHAD and Kinect 2D datasets show that proposed method achieves state of the art results when compared to other recently proposed visual-inertial action recognition methods.
摘要：在通过深度和惯性传感器数据的融合提高人类行为识别（HAR）的准确性纸尝试。首先，我们改造的深度数据到顺序前视图图像（SFI）和微调预训练AlexNet上这些图像。然后，惯性数据被转换成图像信号（SI）和其他的卷积神经网络（CNN）在这些图像上被训练。最后，学习特征从两个CNN，稠合在一起，使一个共享功能层萃取，这些特征被馈送到分类器。我们有两个分类，即支持向量机（SVM）和SOFTMAX分类实验，并比较他们的表演。每个模态的识别精度，单独深度数据和传感器数据单独进行了计算，并与基于融合精度以突出的事实，不是个别方式融合方式产率更好的结果。相对于其他最近提出视觉惯性动作识别方法，当上UTD-MHAD和Kinect 2D数据集实验结果表明本领域的结果所提出的方法实现的状态。

76. A Efficient Multimodal Framework for Large Scale Emotion Recognition by Fusing Music and Electrodermal Activity Signals [PDF] 返回目录
Guanghao Yin, Shouqian Sun, Dian Yu, Kejun Zhang
Abstract: Considerable attention has been paid for physiological signal-based emotion recognition in field of affective computing. For the reliability and user friendly acquisition, Electrodermal Activity (EDA) has great advantage in practical applications. However, the EDA-based emotion recognition with hundreds of subjects still lacks effective solution. In this paper, our work makes an attempt to fuse the subject individual EDA features and the external evoked music features. And we propose an end-to-end multimodal framework, the 1-dimensional residual temporal and channel attention network (RTCAN-1D). For EDA features, the novel convex optimization-based EDA (CvxEDA) method is applied to decompose EDA signals into pahsic and tonic signals for mining the dynamic and steady features. The channel-temporal attention mechanism for EDA-based emotion recognition is firstly involved to improve the temporal- and channel-wise representation. For music features, we process the music signal with the open source toolkit openSMILE to obtain external feature vectors. The individual emotion features from EDA signals and external emotion benchmarks from music are fused in the classifing layers. We have conducted systematic comparisons on three multimodal datasets (PMEmo, DEAP, AMIGOS) for 2-classes valance/arousal emotion recognition. Our proposed RTCAN-1D outperforms the existing state-of-the-art models, which also validate that our work provides an reliable and efficient solution for large scale emotion recognition. Our code has been released at this https URL.
摘要：相当多的注意力都在情感计算领域已经支付了生理信号为主的情感认同。对于可靠性和用户友好的收购，电活动（EDA）在实际应用中很大的优势。但是，基于EDA的感情识别数百个科目仍缺乏有效的解决方案。在本文中，我们的工作做出尝试融合受检者个体特征EDA和外部诱发的音乐功能。和我们提出的端至端多峰框架，所述一维的残留时间和信道的关注网络（RTCAN-1D）。对于EDA特征，应用了新的基于优化凸EDA（CvxEDA）方法来分解EDA信号转换成pahsic和补药信号挖掘的动态和稳态功能。基于EDA的感情识别的信道时间注意机制首先涉及到提高temporal-和渠道明智表示。对于音乐的特点，我们处理与开源工具包openSMILE音乐信号以获得外部特征向量。从EDA信号和从外部音乐情感的基准的各个情感特征融合在classifing层。我们已经为2类帷幔/唤醒情感识别三个多数据集（PMEmo，DEAP，AMIGOS）进行了系统的比较。我们提出的RTCAN-1D优于现有的国家的最先进的车型，这也验证了我们的工作提供了大规模的情感认同的可靠和有效的解决方案。我们的代码已经在这个HTTPS URL被释放。

77. PNEN: Pyramid Non-Local Enhanced Networks [PDF] 返回目录
Feida Zhu, Chaowei Fang, Kai-Kuang Ma
Abstract: Existing neural networks proposed for low-level image processing tasks are usually implemented by stacking convolution layers with limited kernel size. Every convolution layer merely involves in context information from a small local neighborhood. More contextual features can be explored as more convolution layers are adopted. However it is difficult and costly to take full advantage of long-range dependencies. We propose a novel non-local module, Pyramid Non-local Block, to build up connection between every pixel and all remain pixels. The proposed module is capable of efficiently exploiting pairwise dependencies between different scales of low-level structures. The target is fulfilled through first learning a query feature map with full resolution and a pyramid of reference feature maps with downscaled resolutions. Then correlations with multi-scale reference features are exploited for enhancing pixel-level feature representation. The calculation procedure is economical considering memory consumption and computational cost. Based on the proposed module, we devise a Pyramid Non-local Enhanced Networks for edge-preserving image smoothing which achieves state-of-the-art performance in imitating three classical image smoothing algorithms. Additionally, the pyramid non-local block can be directly incorporated into convolution neural networks for other image restoration tasks. We integrate it into two existing methods for image denoising and single image super-resolution, achieving consistently improved performance.
摘要：目前提议的低层次的图像处理任务的神经网络通常通过堆叠卷积层有限内核的大小来实现。每个卷积层仅仅涉及从一个小的地方附近的上下文信息。随着更多的卷积层采用多个上下文特征可以被探索。然而，它是困难和昂贵采取远距离依赖的充分利用。我们提出了一个新的非本地模块，金字塔非本地座，建立每个像素之间的连接，并全部保留像素。所提出的模块能够有效地利用低级别结构的不同尺度之间的成对的依赖关系。目标是通过第一次学习与全分辨率查询特征图和参考特征与小尺寸解决方案映射金字塔满足。然后用多尺度的参考特征相关被利用用于增强像素级特征表示。计算过程经济考虑内存消耗和计算成本。根据所提出的模块上，我们设计出一个金字塔的非本地增强网络的这实现了模仿三大经典图像平滑算法国家的最先进的性能边缘保持图像平滑。此外，金字塔非本地块可以直接结合到用于其它图像恢复任务卷积神经网络。我们把它融入到对图像进行去噪和单幅图像超分辨率现有的两个方法，实现持续改进的性能。

78. ScribbleBox: Interactive Annotation Framework for Video Object Segmentation [PDF] 返回目录
Bowen Chen, Huan Ling, Xiaohui Zeng, Gao Jun, Ziyue Xu, Sanja Fidler
Abstract: Manually labeling video datasets for segmentation tasks is extremely time consuming. In this paper, we introduce ScribbleBox, a novel interactive framework for annotating object instances with masks in videos. In particular, we split annotation into two steps: annotating objects with tracked boxes, and labeling masks inside these tracks. We introduce automation and interaction in both steps. Box tracks are annotated efficiently by approximating the trajectory using a parametric curve with a small number of control points which the annotator can interactively correct. Our approach tolerates a modest amount of noise in the box placements, thus typically only a few clicks are needed to annotate tracked boxes to a sufficient accuracy. Segmentation masks are corrected via scribbles which are efficiently propagated through time. We show significant performance gains in annotation efficiency over past work. We show that our ScribbleBox approach reaches 88.92% J&F on DAVIS2017 with 9.14 clicks per box track, and 4 frames of scribble annotation.
摘要：分割任务手动贴标视频数据集是非常耗时。在本文中，我们介绍ScribbleBox，用于标注对象实例，在视频面具一种新颖的交互式框架。特别是，我们分裂标记信息两个步骤：注释对象与跟踪盒和标签掩盖这些轨道内。我们引进自动化，互动化两个步骤中。盒轨道，通过使用具有小数目的控制点的参数曲线近似的轨迹有效地注释该注释可以交互正确。我们的做法容忍在框中的展示位置噪声适量的，因此通常只有少数的点击需要注释跟踪框足够的精度。分割掩码经由其通过时间有效地传播涂鸦校正。我们显示注解效率在过去的工作显著的性能提升。我们证明了我们的ScribbleBox方法达到88.92％J＆F于DAVIS2017以每盒轨道9.14点击和涂鸦注解的4帧。

79. Detection perceptual underwater image enhancement with deep learning and physical priors [PDF] 返回目录
Long Chen, Zheheng Jiang, Lei Tong, Zhihua Liu, Aite Zhao, Qianni Zhang, Junyu Dong, Huiyu Zhou
Abstract: Underwater image enhancement, as a pre-processing step to improve the accuracy of the following object detection task, has drawn considerable attention in the field of underwater navigation and ocean exploration. However, most of the existing underwater image enhancement strategies tend to consider enhancement and detection as two independent modules with no interaction, and the practice of separate optimization does not always help the underwater object detection task. In this paper, we propose two perceptual enhancement models, each of which uses a deep enhancement model with a detection perceptor. The detection perceptor provides coherent information in the form of gradients to the enhancement model, guiding the enhancement model to generate patch level visually pleasing images or detection favourable images. In addition, due to the lack of training data, a hybrid underwater image synthesis model, which fuses physical priors and data-driven cues, is proposed to synthesize training data and generalise our enhancement model for real-world underwater images. Experimental results show the superiority of our proposed method over several state-of-the-art methods on both real-world and synthetic underwater datasets.
摘要：水下图像增强，作为前处理步骤，以提高下列对象检测任务的准确性，已经引起相当大的关注在水下航行和海洋勘探领域。然而，大多数现有的水下图像增强策略往往会考虑增强和检测为没有互动两个独立的模块，与单独的最优化的做法并不总是利于水下目标探测任务。在本文中，我们提出了两种感知增强模式，每个使用深增强模型的检测感知器。检测感知器提供在梯度以增强模型的形式相干信息，引导所述增强模型以生成补丁级别视觉上令人愉悦的图像或检测良好的图像。此外，由于缺乏训练数据，混合型水下图像合成模型，它融合了物理先验和数据驱动的线索，提出了综合训练数据和推广了增强模型的真实世界的水下图像。实验结果表明，我们提出的方法过的国家的最先进的几种两个现实世界和合成水下数据集方法的优越性。

80. Many-shot from Low-shot: Learning to Annotate using Mixed Supervision for Object Detection [PDF] 返回目录
Carlo Biffi, Steven McDonagh, Philip Torr, Ales Leonardis, Sarah Parisot
Abstract: Annotating such datasets is highly time consuming and expensive, which motivates the development of weakly supervised and few-shot object detection methods. However, these methods largely underperform with respect to their strongly supervised counterpart, as weak training signals \emph{often} result in partial or oversized detections. Towards solving this problem we introduce, for the first time, an online annotation module (OAM) that learns to generate a many-shot set of \emph{reliable} annotations from a larger volume of weakly labelled images. Our OAM can be jointly trained with any fully supervised two-stage object detection method, providing additional training annotations on the fly. This results in a fully end-to-end strategy that only requires a low-shot set of fully annotated images. The integration of the OAM with Fast(er) R-CNN improves their performance by $17\%$ mAP, $9\%$ AP50 on PASCAL VOC 2007 and MS-COCO benchmarks, and significantly outperforms competing methods using mixed supervision.
摘要：这样的注解集是非常费时且昂贵，这促使弱监督和几拍物体检测方法的发展。然而，这些方法在很大程度上表现不佳相对于它们的强烈监督对应，作为弱训练信号\ EMPH {经常}结果在部分或超大检测。对于解决这个问题，我们介绍，首次，在线注释模块（OAM），该学会产生许多次组从弱标记图像体积较大\ {EMPH可靠}注解。我们OAM可以与任何充分的监督两级物体检测方法进行联合训练的，在飞行中提供额外的培训注解。这导致了一个完全终端到终端的策略，只需要一个低射击集完全注释的图像。在OAM与集成快速（ER）R-CNN由$提高17点他们的表现\％$图，$ 9 \％的PASCAL VOC 2007和MS-COCO基准$ AP50，以及使用混业监管显著优于竞争的方法。

81. Toward Quantifying Ambiguities in Artistic Images [PDF] 返回目录
Xi Wang, Zoya Bylinskii, Aaron Hertzmann, Robert Pepperell
Abstract: It has long been hypothesized that perceptual ambiguities play an important role in aesthetic experience: a work with some ambiguity engages a viewer more than one that does not. However, current frameworks for testing this theory are limited by the availability of stimuli and data collection methods. This paper presents an approach to measuring the perceptual ambiguity of a collection of images. Crowdworkers are asked to describe image content, after different viewing durations. Experiments are performed using images created with Generative Adversarial Networks, using the Artbreeder website. We show that text processing of viewer responses can provide a fine-grained way to measure and describe image ambiguities.
摘要：长期以来，人们一直推测，感性的含糊之处发挥的审美体验的重要作用：有一些歧义工作从事一个观众更重要的是没有。然而，为了测试这个理论现有的框架被刺激和数据收集方法的可用性的限制。本文介绍一种方法来测量图像的集合的感知不确定性。 Crowdworkers被要求描述图像的内容，不同的观看时长之后。实验使用的是带有对抗性剖成网络创建的图像进行，使用Artbreeder网站。我们展示的观众反应是文本处理能够提供细粒度的方式来衡量和描述的图像模糊。

82. Towards Autonomous Driving: a Multi-Modal 360$^{\circ}$ Perception Proposal [PDF] 返回目录
Jorge Beltrán, Carlos Guindel, Irene Cortés, Alejandro Barrera, Armando Astudillo, Jesús Urdiales, Mario Álvarez, Farid Bekka, Vicente Milanés, Fernando García
Abstract: In this paper, a multi-modal 360$^{\circ}$ framework for 3D object detection and tracking for autonomous vehicles is presented. The process is divided into four main stages. First, images are fed into a CNN network to obtain instance segmentation of the surrounding road participants. Second, LiDAR-to-image association is performed for the estimated mask proposals. Then, the isolated points of every object are processed by a PointNet ensemble to compute their corresponding 3D bounding boxes and poses. Lastly, a tracking stage based on Unscented Kalman Filter is used to track the agents along time. The solution, based on a novel sensor fusion configuration, provides accurate and reliable road environment detection. A wide variety of tests of the system, deployed in an autonomous vehicle, have successfully assessed the suitability of the proposed perception stack in a real autonomous driving application.
摘要：本文介绍了一种多模态360 $ ^ {\保监会}为立体物检测$框架和自主车跟踪呈现。该过程分为四个主要阶段。首先，图像被馈送到CNN网络获得周围道路参与者的实例分割。第二，激光雷达到图像关联于所估计的掩模建议进行。然后，每一个对象的分离点由PointNet合奏处理，以计算它们的对应的三维边界框和姿势。最后，基于无迹卡尔曼滤波跟踪阶段用来跟踪代理商一起的时间。该解决方案的基础上，新的传感器融合的配置，提供了精确和可靠的道路环境的检测。各种各样的系统，部署在自主汽车的试验，已经成功地评估了一个真正的自动驾驶应用所提出的看法堆栈的适用性。

83. Training Sparse Neural Networks using Compressed Sensing [PDF] 返回目录
Jonathan W. Siegel, Jianhong Chen, Jinchao Xu
Abstract: Pruning the weights of neural networks is an effective and widely-used technique for reducing model size and inference complexity. We develop and test a novel method based on compressed sensing which combines the pruning and training into a single step. Specifically, we utilize an adaptively weighted $\ell^1$ penalty on the weights during training, which we combine with a generalization of the regularized dual averaging (RDA) algorithm in order to train sparse neural networks. The adaptive weighting we introduce corresponds to a novel regularizer based on the logarithm of the absolute value of the weights. Numerical experiments on the CIFAR-10 and CIFAR-100 datasets demonstrate that our method 1) trains sparser, more accurate networks than existing state-of-the-art methods; 2) can also be used effectively to obtain structured sparsity; 3) can be used to train sparse networks from scratch, i.e. from a random initialization, as opposed to initializing with a well-trained base model; 4) acts as an effective regularizer, improving generalization accuracy.
摘要：修剪神经网络的权重是用于减少模型的大小和复杂性推理的有效和广泛使用的技术。我们开发并测试其结合修剪和训练成一个单一的步骤中，基于压缩感知的新方法。具体而言，我们利用训练，这是我们为了培养稀疏神经网络与正规化双平均（RDA）算法的一般化结合期间，在权重自适应加权$ \ ELL ^ 1 $罚款。自适应加权我们引入对应于基于所述权重的绝对值的对数的新的正则化。在CIFAR-10和CIFAR-100的数据集的数值的实验表明我们的方法1）列车稀疏，更准确的网络比现有的国家的最先进的方法; 2）也可以有效地用于获得结构稀疏性; 3）可以用来从随机初始化训练从头，即稀疏的网络，而不是与一个训练有素的基本模型初始化; 4）作为一个有效的正则化，改善泛化精度。

84. DeepLandscape: Adversarial Modeling of Landscape Video [PDF] 返回目录
Elizaveta Logacheva, Roman Suvorov, Oleg Khomenko, Anton Mashikhin, Victor Lempitsky
Abstract: We build a new model of landscape videos that can be trained on a mixture of static landscape images as well as landscape animations. Our architecture extends StyleGAN model by augmenting it with parts that allow to model dynamic changes in a scene. Once trained, our model can be used to generate realistic time-lapse landscape videos with moving objects and time-of-the-day changes. Furthermore, by fitting the learned models to a static landscape image, the latter can be reenacted in a realistic way. We propose simple but necessary modifications to StyleGAN inversion procedure, which lead to in-domain latent codes and allow to manipulate real images. Quantitative comparisons and user studies suggest that our model produces more compelling animations of given photographs than previously proposed methods. The results of our approach including comparisons with prior art can be seen in supplementary materials and on the project page this https URL
摘要：我们建立的，可以在静态图像景观的混合物进行培训，以及山水景观动画影片的新模式。我们的架构与零部件，允许在一个场景动态变动表型号增强它延伸StyleGAN模型。经过培训后，我们的模型可以用来生成移动物体和时间的最日变化的现实时间推移景观的视频。此外，通过拟合学习模型静态风景图像，后者可以以现实的方式重演。我们提出了简单而必要的修改StyleGAN反演过程，这导致域潜在代码和可以操作的真实图像。定量比较和用户的研究表明，我们的模型产生的比以前提出的方法给出的照片更引人注目的动画。我们的方法包括与现有技术比较的结果可以在补充材料和项目页面上此HTTPS URL中可以看出

85. Semantic Segmentation and Data Fusion of Microsoft Bing 3D Cities and Small UAV-based Photogrammetric Data [PDF] 返回目录
Meida Chen, Andrew Feng, Kyle McCullough, Pratusha Bhuvana Prasad, Ryan McAlinden, Lucio Soibelman
Abstract: With state-of-the-art sensing and photogrammetric techniques, Microsoft Bing Maps team has created over 125 highly detailed 3D cities from 11 different countries that cover hundreds of thousands of square kilometer areas. The 3D city models were created using the photogrammetric technique with high-resolution images that were captured from aircraft-mounted cameras. Such a large 3D city database has caught the attention of the US Army for creating virtual simulation environments to support military operations. However, the 3D city models do not have semantic information such as buildings, vegetation, and ground and cannot allow sophisticated user-level and system-level interaction. At I/ITSEC 2019, the authors presented a fully automated data segmentation and object information extraction framework for creating simulation terrain using UAV-based photogrammetric data. This paper discusses the next steps in extending our designed data segmentation framework for segmenting 3D city data. In this study, the authors first investigated the strengths and limitations of the existing framework when applied to the Bing data. The main differences between UAV-based and aircraft-based photogrammetric data are highlighted. The data quality issues in the aircraft-based photogrammetric data, which can negatively affect the segmentation performance, are identified. Based on the findings, a workflow was designed specifically for segmenting Bing data while considering its characteristics. In addition, since the ultimate goal is to combine the use of both small unmanned aerial vehicle (UAV) collected data and the Bing data in a virtual simulation environment, data from these two sources needed to be aligned and registered together. To this end, the authors also proposed a data registration workflow that utilized the traditional iterative closest point (ICP) with the extracted semantic information.
摘要：随着国家的最先进的传感和摄影技术，微软Bing地图团队已经创造了超过125高度详细的3D城市来自11个不同国家的覆盖数十万平方公里的地区。该三维城市模型采用与从拍摄高分辨率图像的摄影技术创造了机载相机。这么大的城市三维数据库已经引起了美国军方的注意，用于创建虚拟仿真环境来支持军事行动。然而，三维城市模型没有语义信息，如建筑物，植被，地面，不能让复杂的用户级和系统级的交互。在I / ITSEC 2019中，作者提出了用于创建使用基于UAV-摄影数据模拟地形一个完全自动化的数据分割和对象信息提取框架。本文讨论了扩大我们设计的数据分割框架分割三维城市数据的下一步骤。在这项研究中，当应用到必应的数据，作者首先研究了现有框架的优势和局限性。基于飞机无人机为基础的摄影数据之间的主要区别是突出。在基于飞机的摄影数据，可以分割性能产生负面影响的数据质量问题，被识别。基于这些发现，一个工作流被用于分割秉数据，同时考虑其特性而设计的。另外，由于最终目标是同时使用小无人驾驶飞行器（UAV）收集的数据和在虚拟仿真环境的秉数据的组合，从需要这两个源的数据被对准并且一起登记。为此，作者还建议的工作流程，与所提取的语义信息利用传统的迭代最近点（ICP）一个数据登记。

86. Generating synthetic photogrammetric data for training deep learning based 3D point cloud segmentation models [PDF] 返回目录
Meida Chen, Andrew Feng, Kyle McCullough, Pratusha Bhuvana Prasad, Ryan McAlinden, Lucio Soibelman
Abstract: At I/ITSEC 2019, the authors presented a fully-automated workflow to segment 3D photogrammetric point-clouds/meshes and extract object information, including individual tree locations and ground materials (Chen et al., 2019). The ultimate goal is to create realistic virtual environments and provide the necessary information for simulation. We tested the generalizability of the previously proposed framework using a database created under the U.S. Army's One World Terrain (OWT) project with a variety of landscapes (i.e., various buildings styles, types of vegetation, and urban density) and different data qualities (i.e., flight altitudes and overlap between images). Although the database is considerably larger than existing databases, it remains unknown whether deep-learning algorithms have truly achieved their full potential in terms of accuracy, as sizable data sets for training and validation are currently lacking. Obtaining large annotated 3D point-cloud databases is time-consuming and labor-intensive, not only from a data annotation perspective in which the data must be manually labeled by well-trained personnel, but also from a raw data collection and processing perspective. Furthermore, it is generally difficult for segmentation models to differentiate objects, such as buildings and tree masses, and these types of scenarios do not always exist in the collected data set. Thus, the objective of this study is to investigate using synthetic photogrammetric data to substitute real-world data in training deep-learning algorithms. We have investigated methods for generating synthetic UAV-based photogrammetric data to provide a sufficiently sized database for training a deep-learning algorithm with the ability to enlarge the data size for scenarios in which deep-learning models have difficulties.
摘要：（Chen等，2019）。在I / ITSEC 2019中，作者提出了一个完全自动化的工作流，以段3D摄影点云/网格和提取对象的信息，包括个人树中的位置和地面材料。最终的目标是创作出逼真的虚拟环境，并提供仿真的必要信息。我们测试使用美国军队的一个世界地形（OWT）项目下创建了各种景观（即不同的建筑风格，植被类型和城市密度）的数据库和不同的数据质量先前提出的框架的普遍性（即，飞行高度和图像之间的重叠）。虽然数据库比现有的数据库大得多，它仍然是未知的深学习算法是否真正实现其全部潜力的准确度方面，由于庞大的数据集进行训练和验证，目前缺乏的。获得大型注释的三维点云数据库是耗时耗力，不仅从数据注解的角度，其中的数据必须由训练有素的人员进行手工标注，但也从一个原始数据采集和处理的角度来看。此外，对于分割模式来区分对象，如建筑物和树木群众，这些类型的场景并不总是在收集的数据集通常存在困难。因此，本研究的目的是使用合成摄影数据在训练深学习算法来代替真实世界的数据进行调查。我们已经研究了生成合成基于无人机摄影测量数据来训练了深刻的学习算法放大数据的大小，在深学习模型有困难情景的能力提供一个足够大的数据库的方法。

87. Blending of Learning-based Tracking and Object Detection for Monocular Camera-based Target Following [PDF] 返回目录
Pranoy Panda, Martin Barczyk
Abstract: Deep learning has recently started being applied to visual tracking of generic objects in video streams. For the purposes of robotics applications, it is very important for a target tracker to recover its track if it is lost due to heavy or prolonged occlusions or motion blur of the target. We present a real-time approach which fuses a generic target tracker and object detection module with a target re-identification module. Our work focuses on improving the performance of Convolutional Recurrent Neural Network-based object trackers in cases where the object of interest belongs to the category of \emph{familiar} objects. Our proposed approach is sufficiently lightweight to track objects at 85-90 FPS while attaining competitive results on challenging benchmarks.
摘要：深学习最近已开始被应用到视频流中的通用对象的视觉跟踪。对于机器人应用的目的，一个目标跟踪恢复其跟踪如果丢失是由于大雨或持续闭塞或目标的运动模糊是非常重要的。我们提出，融合与目标重新识别模块的通用目标跟踪和对象检测模块实时方法。我们的工作重点是改善基于网络的卷积递归对象跟踪的情况下的表现，其中感兴趣的对象属于\ {EMPH熟悉}对象的类别。我们提出的方法是足够轻在85-90 FPS跟踪对象，而具有挑战性的基准测试获得有竞争力的结果。

88. Orderly Disorder in Point Cloud Domain [PDF] 返回目录
Morteza Ghahremani, Bernard Tiddeman, Yonghuai Liu, Ardhendu Behera
Abstract: In the real world, out-of-distribution samples, noise and distortions exist in test data. Existing deep networks developed for point cloud data analysis are prone to overfitting and a partial change in test data leads to unpredictable behaviour of the networks. In this paper, we propose a smart yet simple deep network for analysis of 3D models using `orderly disorder' theory. Orderly disorder is a way of describing the complex structure of disorders within complex systems. Our method extracts the deep patterns inside a 3D object via creating a dynamic link to seek the most stable patterns and at once, throws away the unstable ones. Patterns are more robust to changes in data distribution, especially those that appear in the top layers. Features are extracted via an innovative cloning decomposition technique and then linked to each other to form stable complex patterns. Our model alleviates the vanishing-gradient problem, strengthens dynamic link propagation and substantially reduces the number of parameters. Extensive experiments on challenging benchmark datasets verify the superiority of our light network on the segmentation and classification tasks, especially in the presence of noise wherein our network's performance drops less than 10% while the state-of-the-art networks fail to work.
摘要：在现实世界中，外的分布样本，噪声和失真的测试数据存在。点云数据分析开发的现有深网络很容易发生过度拟合和测试数据将导致网络的不可预知的行为进行部分修改。在本文中，我们提出了三维模型的分析智能而简单的深网络使用`秩序混乱”的理论。有序病症是描述复杂系统内障碍的复杂结构的一种方式。我们的方法通过创建一个动态链接寻求最稳定的模式，并在一次，扔掉不稳定者提取3D物体内部的深模式。图案更稳健的数据分布的变化，特别是那些出现在顶部层。的功能可以通过一个创新的克隆分解技术提取，然后彼此连接，以形成稳定的复合物的图案。我们的模型减轻了消失梯度问题，增强的动态链接传播和基本上减少参数的数量。在富有挑战性的基准数据集验证我们的分割和分类的任务光网络的优势，尤其是在噪声的情况下，其中我们的网络性能下降小于10％，而国家的最先进的网络无法正常工作了广泛的实验。

89. Audio-Visual Waypoints for Navigation [PDF] 返回目录
Changan Chen, Sagnik Majumder, Ziad Al-Halah, Ruohan Gao, Santhosh Kumar Ramakrishnan, Kristen Grauman
Abstract: In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source (e.g., a phone ringing in another room). Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations. We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements 1) audio-visual waypoints that are dynamically set and learned end-to-end within the navigation policy, and 2) an acoustic memory that provides a structured, spatially grounded record of what the agent has heard as it moves. Both new ideas capitalize on the synergy of audio and visual data for revealing the geometry of an unmapped space. We demonstrate our approach on the challenging Replica environments of real-world 3D scenes. Our model improves the state of the art by a substantial margin, and our experiments reveal that learning the links between sights, sounds, and space is essential for audio-visual navigation.
摘要：视听导航，代理智能地通过同时使用景象和声音找到声源的复杂的，未映射的3D环境行进（例如，电话在另一个房间振铃）。现有的模型学习在代理运动的固定粒度采取行动，依靠音频观察的简单反复聚合。我们介绍一种强化学习方法来视听导航具有两个关键新颖元件1）被动态地设置，得知端至端的导航政策内视听航点，和2），提供了一个结构化的声学存储器，在空间上什么样的代理已经听到它的动作接地纪录。这两款新的思路利用音频和视频数据的泄露未映射空间的几何形状的协同作用。我们证明了我们对真实世界的3D场景的挑战副本环境的做法。我们的模型通过了足够的余量改进了该技术的国家，而我们的实验表明，学习景象，声音和空间之间的联系是影音导航是必不可少的。

90. Accurate Alignment Inspection System for Low-resolution Automotive and Mobility LiDAR [PDF] 返回目录
Seontake Oh, Ji-Hwan You, Azim Eskandarian, Young-Keun Kim
Abstract: A misalignment of LiDAR as low as a few degrees could cause a significant error in obstacle detection and mapping that could cause safety and quality issues. In this paper, an accurate inspection system is proposed for estimating a LiDAR alignment error after sensor attachment on a mobility system such as a vehicle or robot. The proposed method uses only a single target board at the fixed position to estimate the three orientations (roll, tilt, and yaw) and the horizontal position of the LiDAR attachment with sub-degree and millimeter level accuracy. After the proposed preprocessing steps, the feature beam points that are the closest to each target corner are extracted and used to calculate the sensor attachment pose with respect to the target board frame using a nonlinear optimization method and with a low computational cost. The performance of the proposed method is evaluated using a test bench that can control the reference yaw and horizontal translation of LiDAR within ranges of 3 degrees and 30 millimeters, respectively. The experimental results for a low-resolution 16 channel LiDAR (Velodyne VLP-16) confirmed that misalignment could be estimated with accuracy within 0.2 degrees and 4 mm. The high accuracy and simplicity of the proposed system make it practical for large-scale industrial applications such as automobile or robot manufacturing process that inspects the sensor attachment for the safety quality control.
摘要：低激光雷达的偏差为几度可能导致障碍物检测和映射显著误差，可能导致安全和质量问题。在本文中，一个准确的检查系统，提出了一种用于估计移动性系统上传感器安装之后的激光雷达的对准误差，例如车辆或机器人。所提出的方法仅使用单个目标板在固定位置来估计三个方向（滚动，俯仰和偏航）和激光雷达附件与分度和毫米级精度的水平位置。所提出的预处理步骤之后，即最接近于每个目标角的特征光束点被提取，并用于相对于使用非线性优化方法目标板框架和具有低的计算成本来计算传感器安装姿势。所提出的方法的性能，使用一个测试台，可以分别3度和30毫米，的范围内控制激光雷达的基准横摆和水平平移评价。对于低分辨率16通道的实验结果的LiDAR（Velodyne VLP-16）证实，未对准可与内0.2度和4mm的精度来估计。所提出的系统的高精确度和简单性使其成为实际的大规模工业应用，如汽车或机器人的制造过程，检查对于安全性的质量控制传感器安装。

91. Automatic LiDAR Extrinsic Calibration System using Photodetector and Planar Board for Large-scale Applications [PDF] 返回目录
Ji-Hwan You, Seon Taek Oh, Jae-Eun Park, Azim Eskandarian, Young-Keun Kim
Abstract: This paper presents a novel automatic calibration system to estimate the extrinsic parameters of LiDAR mounted on a mobile platform for sensor misalignment inspection in the large-scale production of highly automated vehicles. To obtain subdegree and subcentimeter accuracy levels of extrinsic calibration, this study proposed a new concept of a target board with embedded photodetector arrays, named the PD-target system, to find the precise position of the correspondence laser beams on the target surface. Furthermore, the proposed system requires only the simple design of the target board at the fixed pose in a close range to be readily applicable in the automobile manufacturing environment. The experimental evaluation of the proposed system on low-resolution LiDAR showed that the LiDAR offset pose can be estimated within 0.1 degree and 3 mm levels of precision. The high accuracy and simplicity of the proposed calibration system make it practical for large-scale applications for the reliability and safety of autonomous systems.
摘要：本文提出了一种新颖的自动校准系统以估计激光雷达的外部参数安装在用于在大规模生产高度自动化的车辆传感器未对准检查移动平台。为了获得外部校准的subdegree和亚厘米级精度的水平，本研究提出的目标板的具有嵌入的光电检测器阵列，命名为PD-目标系统的新概念，以找到对应激光的精确位置束在目标表面上。此外，所提出的系统需要在在近距离固定姿势目标板的仅简单的设计来容易地适用于汽车制造环境。上低分辨率激光雷达所提出的系统的实验评价结果表明，激光雷达偏移姿态可以内的精度0.1度和3mm的水平进行估计。所提出的校准系统的高精确度和简单使其实用的自治系统的可靠性和安全性的大规模应用。

92. Generate High Resolution Images With Generative Variational Autoencoder [PDF] 返回目录
Abhinav Sagar
Abstract: In this work, we present a novel neural network to generate high resolution images. We replace the decoder of VAE with a discriminator while using the encoder as it is. The encoder uses data from a normal distribution while the generator from a gaussian distribution. The combination from both is given to a discriminator which tells whether the generated images are correct or not. We evaluate our network on 3 different datasets: MNIST, LSUN and CelebA-HQ dataset. Our network beats the previous state of the art using MMD, SSIM, log likelihood, reconstruction error, ELBO and KL divergence as the evaluation metrics while generating much sharper images. This work is potentially very exciting as we are able to combine the advantages of generative models and inference models in a principled bayesian manner.
摘要：在这项工作中，我们提出了一种新的神经网络来生成高分辨率图像。我们在使用编码器，因为它是鉴别替换VAE的解码器。编码器从一个正态分布，而发电机由一个高斯分布中的数据。从两个组合被提供给一个鉴别它告诉生成的图像是否正确与否。我们评估我们在3点不同的数据集网：MNIST，LSUN和CelebA-HQ数据集。我们的网络使用跳动MMD，SSIM，对数似然，重建误差，ELBO和KL散度作为评价指标，同时产生更清晰的图像的技术的先前状态。这项工作可能非常令人兴奋的，因为我们能够生成模型和推理模型的优点结合起来的原则的贝叶斯方法。

93. Fidelity-Controllable Extreme Image Compression with Generative Adversarial Networks [PDF] 返回目录
Shoma Iwai, Tomo Miyazaki, Yoshihiro Sugaya, Shinichiro Omachi
Abstract: We propose a GAN-based image compression method working at extremely low bitrates below 0.1bpp. Most existing learned image compression methods suffer from blur at extremely low bitrates. Although GAN can help to reconstruct sharp images, there are two drawbacks. First, GAN makes training unstable. Second, the reconstructions often contain unpleasing noise or artifacts. To address both of the drawbacks, our method adopts two-stage training and network interpolation. The two-stage training is effective to stabilize the training. Moreover, the network interpolation utilizes the models in both stages and reduces undesirable noise and artifacts, while maintaining important edges. Hence, we can control the trade-off between perceptual quality and fidelity without re-training models. The experimental results show that our model can reconstruct high quality images. Furthermore, our user study confirms that our reconstructions are preferable to state-of-the-art GAN-based image compression model. The code will be available.
摘要：本文提出的基于GaN的图像压缩方法以低于0.1bpp非常低比特率工作。大多数现有了解到图像压缩方法从模糊遭受以极低的比特率。虽然GAN可以帮助重建清晰的图像，有两个缺点。首先，GAN使得训练不稳定。二，重建往往包含令人不快的噪音或假象。为了解决这两个弊端，我们的方法是采用两级培训和网络插值。这两个阶段的训练是有效地稳定了培训。此外，网络内插利用了两个阶段的模型，并降低不希望的噪声和伪像，同时保持重要的边缘。因此，我们可以控制的感知质量和保真度之间的权衡无需重新培训模式。实验结果表明，我们的模型可以重建高品质的图像。此外，我们的用户研究证实，我们的重建是优选状态的最先进的基于GaN的图像压缩模式。该代码将可用。

94. Bosch Deep Learning Hardware Benchmark [PDF] 返回目录
Armin Runge, Thomas Wenzel, Dimitrios Bariamis, Benedikt Sebastian Staffler, Lucas Rego Drumond, Michael Pfeiffer
Abstract: The widespread use of Deep Learning (DL) applications in science and industry has created a large demand for efficient inference systems. This has resulted in a rapid increase of available Hardware Accelerators (HWAs) making comparison challenging and laborious. To address this, several DL hardware benchmarks have been proposed aiming at a comprehensive comparison for many models, tasks, and hardware platforms. Here, we present our DL hardware benchmark which has been specifically developed for inference on embedded HWAs and tasks required for autonomous driving. In addition to previous benchmarks, we propose a new granularity level to evaluate common submodules of DL models, a twofold benchmark procedure that accounts for hardware and model optimizations done by HWA manufacturers, and an extended set of performance indicators that can help to identify a mismatch between a HWA and the DL models used in our benchmark.
摘要：广泛使用深层学习（DL）在科学和工业应用中创造了高效的推理系统的大量需求。这导致可用的硬件加速器的快速增长（华氏）决策比较具有挑战性又费力。为了解决这个问题，几个DL硬件基准测试已经提出了旨在为众多机型，任务和硬件平台的综合比较。这里，我们目前已为嵌入式HWAS和自动驾驶所需任务的推理是专门开发了DL硬件基准。除了以前的基准，我们提出了一个新的粒度级别评估DL模式，双重标杆的过程，占由HWA生产商做硬件和模型优化，以及一组扩展的性能指标，可以帮助识别不匹配的通用子模块一个HWA在我们的基准测试中使用的DL模型之间。

95. Multi-view Graph Learning by Joint Modeling of Consistency and Inconsistency [PDF] 返回目录
Youwei Liang, Dong Huang, Chang-Dong Wang, Philip S. Yu
Abstract: Graph learning has emerged as a promising technique for multi-view clustering with its ability to learn a unified and robust graph from multiple views. However, existing graph learning methods mostly focus on the multi-view consistency issue, yet often neglect the inconsistency across multiple views, which makes them vulnerable to possibly low-quality or noisy datasets. To overcome this limitation, we propose a new multi-view graph learning framework, which for the first time simultaneously and explicitly models multi-view consistency and multi-view inconsistency in a unified objective function, through which the consistent and inconsistent parts of each single-view graph as well as the unified graph that fuses the consistent parts can be iteratively learned. Though optimizing the objective function is NP-hard, we design a highly efficient optimization algorithm which is able to obtain an approximate solution with linear time complexity in the number of edges in the unified graph. Furthermore, our multi-view graph learning approach can be applied to both similarity graphs and dissimilarity graphs, which lead to two graph fusion-based variants in our framework. Experiments on twelve multi-view datasets have demonstrated the robustness and efficiency of the proposed approach.
摘要：图形学习已经成为了其从多个视角了解统一和强大的图形能力，多视点集群有前途的技术。然而，现有的图学习方法主要集中在多视图一致性的问题，但往往忽视了跨多个意见不一致，这使得他们很容易受到可能是劣质或嘈杂的数据集。为了克服这种局限性，我们提出了一种新的多视图图形学习框架，这是第一次同时并明确模型多视图一致性和多视图不一致统一的目标函数，通过每个单一的一致和不一致的部分 - 视图图形以及统一的图形是融合了一致的部分可以被反复教训。虽然优化所述目标函数是NP难题，我们设计一种高效的优化算法，其能够获得在统一图形边缘的数量具有线性的时间复杂度的近似解。此外，我们的多视图图形学习方法可被应用到相似的曲线图和图相异，从而导致在我们的框架2个图表基于融合的变体。上12多视图数据集实验已经证明了该方法的稳健性和效率。

96. Hierarchical Adaptive Lasso: Learning Sparse Neural Networks with Shrinkage via Single Stage Training [PDF] 返回目录
Skyler Seto, Martin T. Wells, Wenyu Zhang
Abstract: Deep neural networks achieve state-of-the-art performance in a variety of tasks, however this performance is closely tied to model size. Sparsity is one approach to limiting model size. Modern techniques for inducing sparsity in neural networks are (1) network pruning, a procedure involving iteratively training a model initialized with a previous run's weights and hard thresholding, (2) training in one-stage with a sparsity inducing penalty (usually based on the Lasso), and (3) training a binary mask jointly with the weights of the network. In this work, we study different sparsity inducing penalties from the perspective of Bayesian hierarchical models with the goal of designing penalties which perform well without retraining subnetworks in isolation. With this motivation, we present a novel penalty called Hierarchical Adaptive Lasso (HALO) which learns to adaptively sparsify weights of a given network via trainable parameters without learning a mask. When used to train over-parametrized networks, our penalty yields small subnetworks with high accuracy (winning tickets) even when the subnetworks are not trained in isolation. Empirically, on the CIFAR-100 dataset, we find that HALO is able to learn highly sparse network (only $5\%$ of the parameters) with approximately a $2\%$ and $4\%$ gain in performance over state-of-the-art magnitude pruning methods at the same level of sparsity.
摘要：深层神经网络实现多种任务的国家的最先进的性能，但是这样的表现是紧密联系在一起的模型的大小。稀疏性是一个方法来限制模型的大小。诱导神经网络稀疏现代技术（1）网络修剪，过程涉及迭代训练与之前运行的权重和硬阈值初始化的模型，（2）在一个阶段的训练有稀疏诱导罚款（通常根据套索），和（3）训练的二元掩模共同与网络的权重。在这项工作中，我们研究了从贝叶斯分层模型与设计一个没有隔离再培训子网表现良好处罚的目标角度不同稀疏诱导处罚。与此动机，我们提出了一个新颖的惩罚称为分级自适应套索（HALO），其学会经由可训练参数给定网络的自适应权重sparsify无需学习掩码。当用于训练过度参数化网络，我们的处罚产生小的子网具有高精度（中奖彩票），即使子网不是孤立训练。根据经验，在CIFAR-100数据集，我们发现光晕能够学到非常稀疏的网络（只有5 $ \％的参数$），大约$ 2 \％$和超过了最先进的性能$ 4 \％$增益最先进的大小修剪在稀疏的相同的水平的方法。

97. Unsupervised Domain Adaptation via Discriminative Manifold Propagation [PDF] 返回目录
You-Wei Luo, Chuan-Xian Ren, Dao-Qing Dai, Hong Yan
Abstract: Unsupervised domain adaptation is effective in leveraging rich information from a labeled source domain to an unlabeled target domain. Though deep learning and adversarial strategy made a significant breakthrough in the adaptability of features, there are two issues to be further studied. First, hard-assigned pseudo labels on the target domain are arbitrary and error-prone, and direct application of them may destroy the intrinsic data structure. Second, batch-wise training of deep learning limits the characterization of the global structure. In this paper, a Riemannian manifold learning framework is proposed to achieve transferability and discriminability simultaneously. For the first issue, this framework establishes a probabilistic discriminant criterion on the target domain via soft labels. Based on pre-built prototypes, this criterion is extended to a global approximation scheme for the second issue. Manifold metric alignment is adopted to be compatible with the embedding space. The theoretical error bounds of different alignment metrics are derived for constructive guidance. The proposed method can be used to tackle a series of variants of domain adaptation problems, including both vanilla and partial settings. Extensive experiments have been conducted to investigate the method and a comparative study shows the superiority of the discriminative manifold learning framework.
摘要：无监督域的适应是有效的利用来自标记的源域信息丰富到未标记的目标域。虽然深学习和对抗性的策略取得了功能适应性一个显著的突破，有两个问题有待进一步研究。首先，在目标域硬分配伪标签是他们的任意且容易出错，并且直接应用可能会破坏内部数据结构。其次，深度学习的间歇训练限制了全局结构的表征。在本文中，黎曼流形学习框架，提出了同时实现可转让性和识别性。对于第一个问题，这个框架建立在通过软标签的目标域概率判别标准。基于预先建立的原型，这个标准扩展到了第二个问题，一个全球性的近似方案。歧管指标比对采用与嵌入空间兼容。不同对准指标的理论误差界限得出建设性的指导意见。所提出的方法可以用于解决一系列域的适应的问题，包括香草和部分设置的变体。大量的实验已经进行，调查方法和比较研究显示了辨别流形学习框架的优越性。

98. One Weight Bitwidth to Rule Them All [PDF] 返回目录
Ting-Wu Chin, Pierce I-Jen Chuang, Vikas Chandra, Diana Marculescu
Abstract: Weight quantization for deep ConvNets has shown promising results for applications such as image classification and semantic segmentation and is especially important for applications where memory storage is limited. However, when aiming for quantization without accuracy degradation, different tasks may end up with different bitwidths. This creates complexity for software and hardware support and the complexity accumulates when one considers mixed-precision quantization, in which case each layer's weights use a different bitwidth. Our key insight is that optimizing for the least bitwidth subject to no accuracy degradation is not necessarily an optimal strategy. This is because one cannot decide optimality between two bitwidths if one has a smaller model size while the other has better accuracy. In this work, we take the first step to understand if some weight bitwidth is better than others by aligning all to the same model size using a width-multiplier. Under this setting, somewhat surprisingly, we show that using a single bitwidth for the whole network can achieve better accuracy compared to mixed-precision quantization targeting zero accuracy degradation when both have the same model size. In particular, our results suggest that when the number of channels becomes a target hyperparameter, a single weight bitwidth throughout the network shows superior results for model compression.
摘要：深ConvNets重量量化已经显示出有希望的应用，如图像分类和语义分割结果和是为应用程序，其中存储器存储被限制尤其重要。然而，如果没有准确恶化瞄准量化时，不同的任务可以用不同的bitwidths结束。这对于软件和硬件支持创建复杂性和当考虑混合精度量化，在这种情况下，每个层的权重使用不同位宽的复杂性蓄积。我们的主要观点是，优化了至少位宽不受任何精度的下降不一定是最优策略。这是因为一个不能在两个bitwidths之间做出选择最优如果一个人有一个小的模型的大小，而其他有更好的精度。在这项工作中，我们采取的第一个步骤，以了解是否有些重位宽是使用宽度倍增器对准所有同型号大小比别人做得更好。在此设置下，多少有些出人意料的是，我们证明了使用单一位宽为整个网络可以比较混合精度量化目标零精度的下降，当两者具有相同的型号尺寸实现更高的精度。特别地，我们的结果表明，当信道的数目成为目标超参数，在整个网络中示出了用于模型压缩优异的结果的单个重量位宽。

99. LT4REC:A Lottery Ticket Hypothesis Based Multi-task Practice for Video Recommendation System [PDF] 返回目录
Xuanji Xiao, Huabin Chen, Yuzhen Liu, Xing Yao, Pei Liu, Chaosheng Fan, Nian Ji, Xirong Jiang
Abstract: Click-through rate prediction (CTR) and post-click conversion rate prediction (CVR) play key roles across all industrial ranking systems, such as recommendation systems, online advertising, and search engines. Different from the extensive research on CTR, there is much less research on CVR estimation, whose main challenge is extreme data sparsity with one or two orders of magnitude reduction in the number of samples than CTR. People try to solve this problem with the paradigm of multi-task learning with the sufficient samples of CTR, but the typical hard sharing method can't effectively solve this problem, because it is difficult to analyze which parts of network components can be shared and which parts are in conflict, i.e., there is a large inaccuracy with artificially designed neurons sharing. In this paper, we model CVR in a brand-new method by adopting the lottery-ticket-hypothesis-based sparse sharing multi-task learning, which can automatically and flexibly learn which neuron weights to be shared without artificial experience. Experiments on the dataset gathered from traffic logs of Tencent video's recommendation system demonstrate that sparse sharing in the CVR model significantly outperforms competitive methods. Due to the nature of weight sparsity in sparse sharing, it can also significantly reduce computational complexity and memory usage which are very important in the industrial recommendation system.
摘要：点击率预测（CTR）和点击后转化率预测（CVR）发挥在所有产业的排名系统的关键角色，如推荐系统，网络广告和搜索引擎。从CTR广泛的研究不同，对CVR估计，其主要的挑战是，在样本比CTR的数量级减少的一个或两个数量级极高的数据稀疏的研究要少得多。人们试图解决这个问题，多任务学习与CTR的足够样本的范例，但典型的硬共享方法不能有效地解决这个问题，因为它是很难分析其网络组件的部分可以共享，这部分是冲突的，即存在较大的误差与人为设计的神经元分享。在本文中，我们以一种全新的方法模型CVR通过采用基于彩票票，假设稀疏共享多任务学习，能够自动灵活地了解哪些神经元的权重将无需人工经验共享。从腾讯视频的推荐系统的流量日志收集到的数据集的实验结果表明，稀疏共享的CVR模型显著优于竞争力的方法。由于稀疏的共享权数稀疏的性质，它也能显著降低计算复杂度和内存使用它们在工业推荐系统非常重要。

100. Self-Competitive Neural Networks [PDF] 返回目录
Iman Saberi, Fathiyeh Faghih
Abstract: Deep Neural Networks (DNNs) have improved the accuracy of classification problems in lots of applications. One of the challenges in training a DNN is its need to be fed by an enriched dataset to increase its accuracy and avoid it suffering from overfitting. One way to improve the generalization of DNNs is to augment the training data with new synthesized adversarial samples. Recently, researchers have worked extensively to propose methods for data augmentation. In this paper, we generate adversarial samples to refine the Domains of Attraction (DoAs) of each class. In this approach, at each stage, we use the model learned by the primary and generated adversarial data (up to that stage) to manipulate the primary data in a way that look complicated to the DNN. The DNN is then retrained using the augmented data and then it again generates adversarial data that are hard to predict for itself. As the DNN tries to improve its accuracy by competing with itself (generating hard samples and then learning them), the technique is called Self-Competitive Neural Network (SCNN). To generate such samples, we pose the problem as an optimization task, where the network weights are fixed and use a gradient descent based method to synthesize adversarial samples that are on the boundary of their true labels and the nearest wrong labels. Our experimental results show that data augmentation using SCNNs can significantly increase the accuracy of the original network. As an example, we can mention improving the accuracy of a CNN trained with 1000 limited training data of MNIST dataset from 94.26% to 98.25%.
摘要：深层神经网络（DNNs）已经提高了在许多应用程序分类的精度问题。一个在训练DNN的挑战是它需要通过一个丰富的数据集被输送到提高其准确性和避免过度拟合的痛苦。改善DNNs的推广方法之一是增加新的合成对抗样本训练数据。最近，研究人员广泛合作，提出了数据增强方法。在本文中，我们产生对抗样本细化吸引力每个类的域（DOAS）。在这种方法中，在每一个阶段，我们使用主要和产生对抗数据（到那个阶段）学习的模型来进行处理的方式，看起来很复杂的DNN的主要数据。然后，DNN使用增强的数据重新训练，然后重新生成是很难预测本身对抗性的数据。由于DNN试图通过自身的竞争（产生硬的样本，然后学习它们），以提高其精度，该技术被称为自我竞争神经网络（SCNN）。为了产生这样的样本，我们提出这个问题作为优化任务，其中网络的权重是固定的，使用基于梯度下降法合成敌对的样品是他们真正的标签和最近的错误标签的边界上。我们的实验结果表明，使用SCNNs可以显著提高原有网络的精确度数据增强。作为一个例子，我们可以提到从94.26％提高CNN与MNIST数据集的1000个有限的训练数据训练的准确性98.25％。

101. Comparative performance analysis of the ResNet backbones of Mask RCNN to segment the signs of COVID-19 in chest CT scans [PDF] 返回目录
Muhammad Aleem, Rahul Raj, Arshad Khan
Abstract: COVID-19 has been detrimental in terms of the number of fatalities and rising number of critical patients across the world. According to the UNDP (United National Development Programme) Socio-Economic programme, aimed at the COVID-19 crisis, the pandemic is far more than a health crisis: it is affecting societies and economies at their core. There has been greater developments recently in the chest X-ray-based imaging technique as part of the COVID-19 diagnosis especially using Convolution Neural Networks (CNN) for recognising and classifying images. However, given the limitation of supervised labelled imaging data, the classification and predictive risk modelling of medical diagnosis tend to compromise. This paper aims to identify and monitor the effects of COVID-19 on the human lungs by employing Deep Neural Networks on axial CT (Chest Computed Tomography) scan of lungs. We have adopted Mask RCNN, with ResNet50 and ResNet101 as its backbone, to segment the regions, affected by COVID-19 coronavirus. Using the regions of human lungs, where symptoms have manifested, the model classifies condition of the patient as either "Mild" or "Alarming". Moreover, the model is deployed on the Google Cloud Platform (GCP) to simulate the online usage of the model for performance evaluation and accuracy improvement. The ResNet101 backbone model produces an F1 score of 0.85 and faster prediction scores with an average time of 9.04 seconds per inference.
摘要：COVID-19在死亡人数方面和全世界多个危重病人的上涨一直是有害的。根据联合国开发计划署（联合国发展计划）社会经济的节目，瞄准COVID-19危机，流行病远远超过健康危机：它影响到他们的核心社会和经济。有在胸部X射线基于成像技术已经发展更大最近作为特别是在使用卷积神经网络（CNN），用于识别和分类的图像的COVID-19诊断的一部分。然而，由于监督标记成像数据的限制，分类和医疗诊断的预测风险模型倾向于妥协。本文旨在识别并通过使用上轴向CT（胸部CT）扫描肺深部神经网络监视COVID-19上的人体肺部的影响。我们采用面膜RCNN，与ResNet50和ResNet101为骨干，以分割区域，受COVID-19冠状病毒。利用人体肺部，其中症状表现的区域，以病人为不是“温和”或“报警”的模式进行分类的条件。此外，该模型被部署在谷歌云平台（GCP）的模拟模型的绩效评价和提高精度的在线使用。该ResNet101骨干模型产生的0.85和更快的预测分数的F1得分以每推理9.04秒的平均时间。

102. A Unified Taylor Framework for Revisiting Attribution Methods [PDF] 返回目录
Huiqi Deng, Na Zou, Mengnan Du, Weifu Chen, Guocan Feng, Xia Hu
Abstract: Attribution methods have been developed to understand the decision making process of machine learning models, especially deep neural networks, by assigning importance scores to individual features. Existing attribution methods often built upon empirical intuitions and heuristics. There still lacks a unified framework that can provide deeper understandings of their rationales, theoretical fidelity, and limitations. To bridge the gap, we present a Taylor attribution framework to theoretically characterize the fidelity of explanations. The key idea is to decompose model behaviors into first-order, high-order independent, and high-order interactive terms, which makes clearer attribution of high-order effects and complex feature interactions. Three desired properties are proposed for Taylor attributions, i.e., low model approximation error, accurate assignment of independent and interactive effects. Moreover, several popular attribution methods are mathematically reformulated under the unified Taylor attribution framework. Our theoretical investigations indicate that these attribution methods implicitly reflect high-order terms involving complex feature interdependencies. Among these methods, Integrated Gradient is the only one satisfying the proposed three desired properties. New attribution methods are proposed based on Integrated Gradient by utilizing the Taylor framework. Experimental results show that the proposed method outperforms the existing ones in model interpretations.
摘要：归属方法已经发展到了解机器学习模型的决策过程中，尤其是深层神经网络，通过分配的重要性分数单个特征。现有的归属方法经常在经验直觉和启发式建成。目前仍然缺乏，可以提供他们的理由更深入的理解，理论保真度，并限制一个统一的框架。为了缩小差距，我们提出了一个泰勒归属框架，从理论上解释特点的保真度。其核心思想是分解模型的行为为一阶，高阶独立，和高阶互动方面，这使得高阶效应和复杂的特征交互更清晰的归属。三个期望的特性提出了泰勒归因，即低模型逼近误差，独立和相互影响准确分配。此外，一些流行的归属方法是数学的统一泰勒归属框架下重新编写。我们的理论研究表明，这些归属方法隐含反映涉及复杂功能的相互依存关系高阶项。在这些方法中，综合梯度是唯一一个满足提出了三个期望的性质。基于集成的梯度利用泰勒框架，提出了新的归因方法。实验结果表明，该方法优于模型解释了现有的。

103. HRVGAN: High Resolution Video Generation using Spatio-Temporal GAN [PDF] 返回目录
Abhinav Sagar
Abstract: In this paper, we present a novel network for high resolution video generation. Our network uses ideas from Wasserstein GANs by enforcing k-Lipschitz constraint on the loss term and Conditional GANs using class labels for training and testing. We present Generator and Discriminator network layerwise details along with the combined network architecture, optimization details and algorithm used in this work. Our network uses a combination of two loss terms: mean square pixel loss and an adversarial loss. The datasets used for training and testing our network are UCF101, Golf and Aeroplane Datasets. Using Inception Score and Fréchet Inception Distance as the evaluation metrics, our network outperforms previous state of the art networks on unsupervised video generation.
摘要：在本文中，我们提出了高分辨率视频生成一个新的网络。我们的网络通过对损失项执行K-李氏约束和使用类标签的训练和测试条件甘斯使用从沃瑟斯坦甘斯的想法。我们提出了发电机和鉴别网络分层的细节与联合网络架构，优化的细节和算法用于这项工作一起。我们的网络使用两种损失方面的结合：平均平方像素损失和对抗性的损失。用于训练和测试我们的网络的数据集是UCF101，高尔夫球和飞机的数据集。使用盗梦空间得分和Fréchet可盗梦空间距离作为评价指标，我们的网络优于无监督视频生成的艺术网络的以前的状态。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-08-25

目录

摘要