摘要

1. Learning to Detect 3D Reflection Symmetry for Single-View Reconstruction [PDF] 返回目录
Yichao Zhou, Shichen Liu, Yi Ma
Abstract: 3D reconstruction from a single RGB image is a challenging problem in computer vision. Previous methods are usually solely data-driven, which lead to inaccurate 3D shape recovery and limited generalization capability. In this work, we focus on object-level 3D reconstruction and present a geometry-based end-to-end deep learning framework that first detects the mirror plane of reflection symmetry that commonly exists in man-made objects and then predicts depth maps by finding the intra-image pixel-wise correspondence of the symmetry. Our method fully utilizes the geometric cues from symmetry during the test time by building plane-sweep cost volumes, a powerful tool that has been used in multi-view stereopsis. To our knowledge, this is the first work that uses the concept of cost volumes in the setting of single-image 3D reconstruction. We conduct extensive experiments on the ShapeNet dataset and find that our reconstruction method significantly outperforms the previous state-of-the-art single-view 3D reconstruction networks in term of the accuracy of camera poses and depth maps, without requiring objects being completely symmetric. Code is available at this https URL.
摘要：3D从单个RGB图像重建是计算机视觉一个具有挑战性的问题。以前的方法通常是单独数据驱动的，这导致不准确的三维形状恢复和有限的泛化能力。在这项工作中，我们侧重于对象级3D重建和呈现的端至端深学习基于几何形状的框架，它首先检测反射对称性常用于人造物体存在，然后的镜平面预测通过发现深度图对称性的所述帧内图像逐像素的对应关系。我们的方法充分利用从对称的几何线索过程中通过建立平面扫描成本卷，已在多视角立体视觉中使用的强大工具的测试时间。据我们所知，这是使用成本卷的概念在单图像三维重建的设定的第一部作品。我们进行的ShapeNet数据集大量的实验和发现，我们的重建方法显著优于以前的状态的最先进的单一视图三维重建网络摄像头的姿势和深度图的精度内，无需对象是完全对称的。代码可在此HTTPS URL。

2. LSD-C: Linearly Separable Deep Clusters [PDF] 返回目录
Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, Kai Han, Andrea Vedaldi, Andrew Zisserman
Abstract: We present LSD-C, a novel method to identify clusters in an unlabeled dataset. Our algorithm first establishes pairwise connections in the feature space between the samples of the minibatch based on a similarity metric. Then it regroups in clusters the connected samples and enforces a linear separation between clusters. This is achieved by using the pairwise connections as targets together with a binary cross-entropy loss on the predictions that the associated pairs of samples belong to the same cluster. This way, the feature representation of the network will evolve such that similar samples in this feature space will belong to the same linearly separated cluster. Our method draws inspiration from recent semi-supervised learning practice and proposes to combine our clustering algorithm with self-supervised pretraining and strong data augmentation. We show that our approach significantly outperforms competitors on popular public image benchmarks including CIFAR 10/100, STL 10 and MNIST, as well as the document classification dataset Reuters 10K.
摘要：本LSD-C，一种新颖的方法来识别的未标记数据集的簇。我们的算法首先建立成对地在minibatch的基于相似性度量样本之间的特征空间的连接。然后，它重新组合以簇所连接的样品和强制簇之间的线性间隔。这是通过使用与在预测的是，相关联的双采样的属于同一群集的二进制交叉熵损失一起成对连接作为靶来实现。以这种方式，网络的特征表示将演变，使得在该特征空间中类似的样品将属于相同的线性分离簇。我们的方法从最近的半监督学习实践中汲取灵感，并建议我们的聚类算法与自我监督的训练前和强大的数据增强相结合。我们证明了我们的方法显著优于流行的公众形象基准包括CIFAR 10/100，STL 10和MNIST，以及文件分类数据集路透社10K的竞争对手。

3. Semantic Visual Navigation by Watching YouTube Videos [PDF] 返回目录
Matthew Chang, Arjun Gupta, Saurabh Gupta
Abstract: Semantic cues and statistical regularities in real-world environment layouts can improve efficiency for navigation in novel environments. This paper learns and leverages such semantic cues for navigating to objects of interest in novel environments, by simply watching YouTube videos. This is challenging because YouTube videos don't come with labels for actions or goals, and may not even showcase optimal behavior. Our proposed method tackles these challenges through the use of Q-learning on pseudo-labeled transition quadruples (image, action, next image, reward). We show that such off-policy Q-learning from passive data is able to learn meaningful semantic cues for navigation. These cues, when used in a hierarchical navigation policy, lead to improved efficiency at the ObjectGoal task in visually realistic simulations. We improve upon end-to-end RL methods by 66%, while using 250x fewer interactions. Code, data, and models will be made available.
摘要：语义线索和统计规律在现实环境中的布局可以提高效率，在新的环境中导航。本文学习和利用这种语义线索导航到的小说环境感兴趣的对象，通过简单地观看YouTube视频。这是有挑战性的，因为YouTube视频不来与标签的行为或目标，甚至有可能不会展示最佳的行为。我们提出的方法通过铲球伪标记过渡四倍（图像，动作，下一个图像，奖励）使用Q学习的这些挑战。我们表明，这种偏离政策Q学习由被动数据能够学到有意义的语义线索进行导航。这些线索，以分层导航策略，使用时，会导致更高的效率在视觉上逼真的模拟了ObjectGoal任务。我们提高在66％的终端到终端的RL方法，同时使用更少的250间的相互作用。代码，数据和模型将提供。

4. Fast Object Classification and Meaningful Data Representation of Segmented Lidar Instances [PDF] 返回目录
Lukas Hahn, Frederik Hasecke, Anton Kummert
Abstract: Object detection algorithms for Lidar data have seen numerous publications in recent years, reporting good results on dataset benchmarks oriented towards automotive requirements. Nevertheless, many of these are not deployable to embedded vehicle systems, as they require immense computational power to be executed close to real time. In this work, we propose a way to facilitate real-time Lidar object classification on CPU. We show how our approach uses segmented object instances to extract important features, enabling a computationally efficient batch-wise classification. For this, we introduce a data representation which translates three-dimensional information into small image patches, using decomposed normal vector images. We couple this with dedicated object statistics to handle edge cases. We apply our method on the tasks of object detection and semantic segmentation, as well as the relatively new challenge of panoptic segmentation. Through evaluation, we show, that our algorithm is capable of producing good results on public data, while running in real time on CPU without using specific optimisation.
摘要：激光雷达数据对象检测算法在最近几年看到的许多出版物，报告对汽车的需求为导向的数据集的基准了良好的效果。然而，许多的这些都不是部署到嵌入式车载系统，因为它们需要巨大的计算能力将接近实时执行。在这项工作中，我们提出了一种方法，以便于CPU的实时雷达对象分类。我们证明我们的方法如何使用分割的对象实例提取重要的特征，使得计算效率分批分类。对于这一点，我们将介绍其转换的三维信息为小图像块，使用分解的法线矢量图像的数据表示。我们夫妇此专用对象统计处理边缘情形。我们应用我们的目标检测和语义分割的任务的方法，以及全景分割的相对较新的挑战。通过评估，我们表明，我们的算法能够产生对公共数据的好成绩，而无需使用特定的优化CPU的实时运行时。

5. Deeply Learned Spectral Total Variation Decomposition [PDF] 返回目录
Tamara G. Grossmann, Yury Korolev, Guy Gilboa, Carola-Bibiane Schönlieb
Abstract: Non-linear spectral decompositions of images based on one-homogeneous functionals such as total variation have gained considerable attention in the last few years. Due to their ability to extract spectral components corresponding to objects of different size and contrast, such decompositions enable filtering, feature transfer, image fusion and other applications. However, obtaining this decomposition involves solving multiple non-smooth optimisation problems and is therefore computationally highly intensive. In this paper, we present a neural network approximation of a non-linear spectral decomposition. We report up to four orders of magnitude ($\times 10,000$) speedup in processing of mega-pixel size images, compared to classical GPU implementations. Our proposed network, TVSpecNET, is able to implicitly learn the underlying PDE and, despite being entirely data driven, inherits invariances of the model based transform. To the best of our knowledge, this is the first approach towards learning a non-linear spectral decomposition of images. Not only do we gain a staggering computational advantage, but this approach can also be seen as a step towards studying neural networks that can decompose an image into spectral components defined by a user rather than a handcrafted functional.
摘要：基于一个均质函如总变异图像非线性光谱分解获得了相当大的关注，在过去的几年里。由于它们以提取对应于不同大小和对比度的对象的频谱分量的能力，这样的分解启用过滤，特征转印，图像融合和其他应用。然而，获得这种分解解决涉及多个非光滑优化问题，因此计算高度密集。在本文中，我们提出了一种非线性频谱分解的神经网络逼近。我们报告多达幅度（$ \次$ 10,000）的百万像素大小的图像处理速度提升四个订单，相比于传统的GPU实现。我们提出的网络，TVSpecNET，是能够隐含了解底层的PDE，尽管是完全数据驱动的，基于变换模式的继承不变性。据我们所知，这是对学习图像的非线性光谱分解的第一种方法。我们不仅获得了惊人的计算优势，但这种做法也可以看作是对学习神经网络一步，可以将图像分解成由用户定义的，而不是手工制作的功能频谱分量。

6. Noise or Signal: The Role of Image Backgrounds in Object Recognition [PDF] 返回目录
Kai Xiao, Logan Engstrom, Andrew Ilyas, Aleksander Madry
Abstract: We assess the tendency of state-of-the-art object recognition models to depend on signals from image backgrounds. We create a toolkit for disentangling foreground and background signal on ImageNet images, and find that (a) models can achieve non-trivial accuracy by relying on the background alone, (b) models often misclassify images even in the presence of correctly classified foregrounds--up to 87.5% of the time with adversarially chosen backgrounds, and (c) more accurate models tend to depend on backgrounds less. Our analysis of backgrounds brings us closer to understanding which correlations machine learning models use, and how they determine models' out of distribution performance.
摘要：我们评估的国家的最先进的物体识别模型的趋势依赖于从图像的背景信号。我们甚至在正确的分类foregrounds-存在创建一个工具包上ImageNet图像解开前景和背景信号，并发现（一）模型可以通过单独依靠后台实现不平凡的准确性，（二）款往往错误分类图片 - 最多与adversarially选择背景的时间87.5％，和（c）更精确的模型倾向于依赖于背景以下。我们的背景的分析，使我们更接近了解哪些相关机器学习模型使用，以及它们是如何确定模型出来的分布性能。

7. WhoAmI: An Automatic Tool for Visual Recognition of Tiger and Leopard Individuals in the Wild [PDF] 返回目录
Rita Pucci, Jitendra Shankaraiah, Devcharan Jathanna, Ullas Karanth, Kartic Subr
Abstract: Photographs of wild animals in their natural habitats can be recorded unobtrusively via cameras that are triggered by motion nearby. The installation of such camera traps is becoming increasingly common across the world. Although this is a convenient source of invaluable data for biologists, ecologists and conservationists, the arduous task of poring through potentially millions of pictures each season introduces prohibitive costs and frustrating delays. We develop automatic algorithms that are able to detect animals, identify the species of animals and to recognize individual animals for two species. we propose the first fully-automatic tool that can recognize specific individuals of leopard and tiger due to their characteristic body markings. We adopt a class of supervised learning approach of machine learning where a Deep Convolutional Neural Network (DCNN) is trained using several instances of manually-labelled images for each of the three classification tasks. We demonstrate the effectiveness of our approach on a data set of camera-trap images recorded in the jungles of Southern India.
摘要：在其自然栖息地的野生动物的照片，可以通过不引人注意是由运动引发附近的摄像头记录下来。这样的相机陷阱的安装正在成为世界各地越来越普遍。虽然这是非常宝贵的数据为生物学家，生态学家和环保主义者，可能数以百万计的照片每一季推出高昂的成本和令人沮丧的延迟通过研读的艰巨任务，一个方便的来源。我们开发的自动算法，能够检测动物，动物识别的种类和识别单个动物的两个物种。我们提出的第一个完全自动化的工具，能够识别豹和虎的具体个人，由于其特有的身体标记。我们采用的一类机器学习的监督学习方法在使用深卷积神经网络（DCNN）使用的每三个分类任务手动标记的图像的多个实例培训。我们证明了我们对记录在印度南部的丛林中的相机陷阱图像的数据集方法的有效性。

8. When We First Met: Visual-Inertial Person Localization for Co-Robot Rendezvous [PDF] 返回目录
Xi Sun, Xinshuo Weng, Kris Kitani
Abstract: We aim to enable robots to visually localize a target person through the aid of an additional sensing modality -- the target person's 3D inertial measurements. The need for such technology may arise when a robot is to meet person in a crowd for the first time or when an autonomous vehicle must rendezvous with a rider amongst a crowd without knowing the appearance of the person in advance. A person's inertial information can be measured with a wearable device such as a smart-phone and can be shared selectively with an autonomous system during the rendezvous. We propose a method to learn a visual-inertial feature space in which the motion of a person in video can be easily matched to the motion measured by a wearable inertial measurement unit (IMU). The transformation of the two modalities into the joint feature space is learned through the use of a contrastive loss which forces inertial motion features and video motion features generated by the same person to lie close in the joint feature space. To validate our approach, we compose a dataset of over 60,000 video segments of moving people along with wearable IMU data. Our experiments show that our proposed method is able to accurately localize a target person with 80.7% accuracy using only 5 seconds of IMU data and video.
摘要：我们的目标是让机器人通过一个额外的感测方式的援助，用于视觉定位目标人物 - 目标人物的3D惯性测量。当机器人是满足人在人群中的第一次，或当一个自主车辆必须以之中人群骑乘者会合不知道的人的外观预先可能出现需要这样的技术。一个人的惯性信息可以与可佩戴装置来测量，如智能电话，并且可以与集合中的自治系统选择性地共享。我们提出了一个方法来学习，其中一个人在视频的运动可以容易地匹配于由可佩戴的惯性测量单元（IMU）测得的运动的视觉惯性特征空间。这两个模式的进入关节功能空间的转换是通过使用由同一人产生哪些力量惯性运动功能和视频运动特征在关节功能空间的谎言接近一对比损失的教训。为了验证我们的方法，我们撰写动之以穿戴式IMU数据沿60000视频片段的数据集。我们的实验证明，该方法能够精确定位使用只要五秒钟IMU数据和视频的目标人以80.7％的准确率。

9. Contrastive Learning for Weakly Supervised Phrase Grounding [PDF] 返回目录
Tanmay Gupta, Arash Vahdat, Gal Chechik, Xiaodong Yang, Jan Kautz, Derek Hoiem
Abstract: Phrase grounding, the problem of associating image regions to caption words, is a crucial component of vision-language tasks. We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on mutual information between images and caption words. Given pairs of images and captions, we maximize compatibility of the attention-weighted regions and the words in the corresponding caption, compared to non-corresponding pairs of images and captions. A key idea is to construct effective negative captions for learning through language model guided word substitutions. Training with our negatives yields a $\sim10\%$ absolute gain in accuracy over randomly-sampled negatives from the training data. Our weakly supervised phrase grounding model trained on COCO-Captions shows a healthy gain of $5.7\%$ to achieve $76.7\%$ accuracy on Flickr30K Entities benchmark.
摘要：短语接地，图像区域以标题词相关联的问题，是视觉语言任务的重要组成部分。我们表明，短语接地可以通过优化字区域的关注，最大限度上的图像和字幕的话之间的相互信息的下界来学习。由于对图像和字幕，我们最大限度的注意权重的区域，并在相应的标题字样的兼容性，相比于非对应的成对的图像和字幕。一个关键的概念是构建有效的负字幕通过语言模型引导词替换学习。我们的底片训练产生了$ \ sim10 \％$绝对精度增益过随机采样底片从训练数据。培训了COCO-字幕显示了我们的弱监督短语基础模型$ 5.7 \％$的健康增益，以实现对Flickr30K实体基准$ 76.7 \％$准确性。

10. FISHING Net: Future Inference of Semantic Heatmaps In Grids [PDF] 返回目录
Noureldin Hendy, Cooper Sloan, Feng Tian, Pengfei Duan, Nick Charchut, Yuesong Xie, Chuang Wang, James Philbin
Abstract: For autonomous robots to navigate a complex environment, it is crucial to understand the surrounding scene both geometrically and semantically. Modern autonomous robots employ multiple sets of sensors, including lidars, radars, and cameras. Managing the different reference frames and characteristics of the sensors, and merging their observations into a single representation complicates perception. Choosing a single unified representation for all sensors simplifies the task of perception and fusion. In this work, we present an end-to-end pipeline that performs semantic segmentation and short term prediction using a top-down representation. Our approach consists of an ensemble of neural networks which take in sensor data from different sensor modalities and transform them into a single common top-down semantic grid representation. We find this representation favorable as it is agnostic to sensor-specific reference frames and captures both the semantic and geometric information for the surrounding scene. Because the modalities share a single output representation, they can be easily aggregated to produce a fused output. In this work we predict short-term semantic grids but the framework can be extended to other tasks. This approach offers a simple, extensible, end-to-end approach for multi-modal perception and prediction.
摘要：对于自主机器人导航复杂环境，关键是要既几何和语义理解周围的景象。现代自主机器人采用多组传感器，激光雷达，包括，雷达和摄像头。管理不同的参考帧和传感器的特性，和合并他们的观察到一个单一的复杂化表示感知。选择所有传感器的单个统一表示，简化感知和融合的任务。在这项工作中，我们提出，其执行语义分割和短期预测使用自顶向下的表现的端至端的管道。我们的方法包括其中在采取从不同的传感器模态的传感器数据，并将它们转换成一个单一的共同顶向下语义网格表示神经网络的集合的。我们认为，这表示有利的，因为它是不可知的传感器固有的参考帧，同时捕捉周围场景中的语义和几何信息。由于模态共享单个输出表示，它们可以很容易地聚合以产生融合的输出。在这项工作中，我们预计短期语义网格，但框架可以扩展到其他任务。这种方法提供了一个简单，可扩展，最终到终端的多模态感知和预测方法。

11. Vision-Aided Dynamic Blockage Prediction for 6G Wireless Communication Networks [PDF] 返回目录
Gouranga Charan, Muhammad Alrabeiah, Ahmed Alkhateeb
Abstract: Unlocking the full potential of millimeter-wave and sub-terahertz wireless communication networks hinges on realizing unprecedented low-latency and high-reliability requirements. The challenge in meeting those requirements lies partly in the sensitivity of signals in the millimeter-wave and sub-terahertz frequency ranges to blockages. One promising way to tackle that challenge is to help a wireless network develop a sense of its surrounding using machine learning. This paper attempts to do that by utilizing deep learning and computer vision. It proposes a novel solution that proactively predicts \textit{dynamic} link blockages. More specifically, it develops a deep neural network architecture that learns from observed sequences of RGB images and beamforming vectors how to predict possible future link blockages. The proposed architecture is evaluated on a publicly available dataset that represents a synthetic dynamic communication scenario with multiple moving users and blockages. It scores a link-blockage prediction accuracy in the neighborhood of 86\%, a performance that is unlikely to be matched without utilizing visual data.
摘要：解锁毫米波和亚太赫兹无线通信网络的铰链的全部潜力上实现了前所未有的低延时和高可靠性的要求。在部分地在毫米波和亚太赫兹频率范围到阻塞信号的灵敏度满足这些要求谎言的挑战。一个有希望的方式来应对这一挑战是帮助无线网络发展其使用机器学习周围的感觉。本文试图做到这一点，利用深度学习和计算机视觉。它提出了积极地预测\ textit {动态}链路阻塞的新颖的解决方案。更具体地，开发了一个深的神经网络结构，从RGB图像和波束成形向量的观察到的序列学会如何预测未来可能的链路阻塞。所提出的架构是在表示与多个移动用户和堵塞的合成动态通信场景可公开获得的数据集进行评估。它分数的86 \％附近的链路阻塞的预测精度，这样的表现是不太可能不利用可视数据来进行匹配。

12. Let's face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic settings [PDF] 返回目录
Patrik Jonell, Taras Kucherenko, Gustav Eje Henter, Jonas Beskow
Abstract: To enable more natural face-to-face interactions, conversational agents need to adapt their behavior to their interlocutors. One key aspect of this is generation of appropriate non-verbal behavior for the agent, for example facial gestures, here defined as facial expressions and head movements. Most existing gesture-generating systems do not utilize multi-modal cues from the interlocutor when synthesizing non-verbal behavior. Those that do, typically use deterministic methods that risk producing repetitive and non-vivid motions. In this paper, we introduce a probabilistic method to synthesize interlocutor-aware facial gestures - represented by highly expressive FLAME parameters - in dyadic conversations. Our contributions are: a) a method for feature extraction from multi-party video and speech recordings, resulting in a representation that allows for independent control and manipulation of expression and speech articulation in a 3D avatar; b) an extension to MoGlow, a recent motion-synthesis method based on normalizing flows, to also take multi-modal signals from the interlocutor as input and subsequently output interlocutor-aware facial gestures; and c) subjective and objective experiments assessing the use and relative importance of the different modalities in the synthesized output. The results show that the model successfully leverages the input from the interlocutor to generate more appropriate behavior.
摘要：为了让更多的自然脸对脸的互动，对话代理商需要他们的行为适应他们的对话者。这方面的一个重要方面是产生适当的非言语行为的代理，例如面部表情，在这里定义为面部表情和头部运动。合成的非语言行为，当大多数现有的手势产生系统不从对话者利用多模态提示。那些做，通常使用确定性方法是风险产生的重复和不生动的动作。在本文中，我们介绍一种概率的方法来合成对话者感知面部表情 - 通过极具表现力的火焰参数来表示 - 在二元对话。我们的贡献是：a）用于从多方视频和语音录音特征提取的方法，产生一个表示其允许独立控制和在3D化身表达和语言清晰度的操纵; b）一种扩展MoGlow，最近运动的合成方法基于正火流动，也采取从对话者作为输入，并随后输出对话者感知脸部表情多模态信号;和c）的主观和客观试验评估在合成输出的使用和不同模态的相对重要性。结果表明，该模型成功地利用从对话者的输入，以产生更多适当的行为。

13. Quantification of groundnut leaf defects using image processing algorithms [PDF] 返回目录
Asharf, Balasubramanian E, Sankarasrinivasan S
Abstract: Identification, classification, and quantification of crop defects are of paramount of interest to the farmers for preventive measures and decrease the yield loss through necessary remedial actions. Due to the vast agricultural field, manual inspection of crops is tedious and time-consuming. UAV based data collection, observation, identification, and quantification of defected leaves area are considered to be an effective solution. The present work attempts to estimate the percentage of affected groundnut leaves area across four regions of Andharapradesh using image processing techniques. The proposed method involves colour space transformation combined with thresholding technique to perform the segmentation. The calibration measures are performed during acquisition with respect to UAV capturing distance, angle and other relevant camera parameters. Finally, our method can estimate the consolidated leaves and defected area. The image analysis results across these four regions reveal that around 14 - 28% of leaves area is affected across the groundnut field and thereby yield will be diminished correspondingly. Hence, it is recommended to spray the pesticides on the affected regions alone across the field to improve the plant growth and thereby yield will be increased.
摘要：识别，分类和作物缺陷的定量是极其感兴趣的农民采取预防措施，减少通过必要的补救行动的产量损失。由于巨大的农业领域中，作物的人工检查是繁琐和耗时的。无人机基于数据收集，观察，识别，并投奔叶面积的定量被认为是一种有效的解决方案。本工作尝试使用图像处理技术来估计整个Andharapradesh四个地区受影响花生叶面积的百分比。所提出的方法涉及到色彩空间转换与阈值技术来执行分割结合。相对于UAV摄影距离，角度和其它相关的相机参数采集期间执行校准的措施。最后，我们的方法可以估算合并叶子和叛逃区。在这些四个区域的图像分析结果显示，大约14 - 叶面积的28％，是整个花生场的影响，因此产量会相应减少。因此，建议喷在受影响地区的农药单独穿过田野，以提高植物生长和产量，从而将增加。

14. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments [PDF] 返回目录
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, Armand Joulin
Abstract: Unsupervised image representations have significantly reduced the gap with supervised pretraining, notably with the recent achievements of contrastive learning methods. These contrastive methods typically work online and rely on a large number of explicit pairwise feature comparisons, which is computationally challenging. In this paper, we propose an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons. Specifically, our method simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or views) of the same image, instead of comparing features directly as in contrastive learning. Simply put, we use a swapped prediction mechanism where we predict the code of a view from the representation of another view. Our method can be trained with large and small batches and can scale to unlimited amounts of data. Compared to previous contrastive methods, our method is more memory efficient since it does not require a large memory bank or a special momentum network. In addition, we also propose a new data augmentation strategy, multi-crop, that uses a mix of views with different resolutions in place of two full-resolution views, without increasing the memory or compute requirements much. We validate our findings by achieving $75.3\%$ top-1 accuracy on ImageNet with ResNet-50, as well as surpassing supervised pretraining on all the considered transfer tasks.
摘要：无监督图像表示已显著降低与监督训练前的差距，特别是与最近的对比学习方法成就。这些对比方法通常联机工作，并依靠大量显性配对功能比较的，这是计算挑战。在本文中，我们提出了一个在线算法，SwAV，它利用了对比方法，而不需要对计算成对比较。具体地，同时实施的，而不是比较特征直接用作对比学习对于不同的扩增系统（或视图）相同的图像的产生簇分配之间的一致性，我们的方法同时簇的数据。简单地说，我们用我们预测从另一种观点的代表性视图的代码交换预测机制。我们的方法可以有大，小批量的培训，并可以扩展到无限量的数据。相较于以前的对比方法，因为它不需要大量存储银行或专门的势头网我们的方法是更多的内存效率。此外，我们还提出了一个新的数据增强策略，多作物，使用与到位的两个全分辨率视图分辨率不同意见的搭配，在不增加内存或计算需求很大。我们通过实现对ImageNet $ 75.3 \％$顶1精度RESNET-50，以及对所有考虑的传输任务超越监督训练前验证我们的研究结果。

15. Using Wavelets and Spectral Methods to Study Patterns in Image-Classification Datasets [PDF] 返回目录
Roozbeh Yousefzadeh, Furong Huang
Abstract: Deep learning models extract, before a final classification layer, features or patterns which are key for their unprecedented advantageous performance. However, the process of complex nonlinear feature extraction is not well understood, a major reason why interpretation, adversarial robustness, and generalization of deep neural nets are all open research problems. In this paper, we use wavelet transformation and spectral methods to analyze the contents of image classification datasets, extract specific patterns from the datasets and find the associations between patterns and classes. We show that each image can be written as the summation of a finite number of rank-1 patterns in the wavelet space, providing a low rank approximation that captures the structures and patterns essential for learning. Regarding the studies on memorization vs learning, our results clearly reveal disassociation of patterns from classes, when images are randomly labeled. Our method can be used as a pattern recognition approach to understand and interpret learnability of these datasets. It may also be used for gaining insights about the features and patterns that deep classifiers learn from the datasets.
摘要：深学习模型提取，最终分类层之前，功能或这对他们有利的空前性能的关键模式。然而，复杂的非线性特征提取的过程还不是很清楚，一个主要的原因解释，对抗性的鲁棒性和深神经网络的推广都是开放性的研究问题。在本文中，我们利用小波变换和光谱方法来分析图像分类数据集的内容，从中提取特定的数据集模式和寻找模式和类之间的关联。我们表明，每个图像可被写为有限数量的在小波空间秩1型态的总和，提供低秩近似，其捕获结构和图形的学习是必不可少的。关于对记忆VS学习研究，我们的研究结果清楚地表明，从阶级，当图像随机标记模式解离。我们的方法可以作为模式识别方法来理解和解释这些数据集的学习能力。它也可以用于获取有关的功能和模式，深分类从数据集中学习的见解。

16. Self-Supervised Joint Learning Framework of Depth Estimation via Implicit Cues [PDF] 返回目录
Jianrong Wang, Ge Zhang, Zhenyu Wu, XueWei Li, Li Liu
Abstract: In self-supervised monocular depth estimation, the depth discontinuity and motion objects' artifacts are still challenging problems. Existing self-supervised methods usually utilize a single view to train the depth estimation network. Compared with static views, abundant dynamic properties between video frames are beneficial to refined depth estimation, especially for dynamic objects. In this work, we propose a novel self-supervised joint learning framework for depth estimation using consecutive frames from monocular and stereo videos. The main idea is using an implicit depth cue extractor which leverages dynamic and static cues to generate useful depth proposals. These cues can predict distinguishable motion contours and geometric scene structures. Furthermore, a new high-dimensional attention module is introduced to extract clear global transformation, which effectively suppresses uncertainty of local descriptors in high-dimensional space, resulting in a more reliable optimization in learning framework. Experiments demonstrate that the proposed framework outperforms the state-of-the-art(SOTA) on KITTI and Make3D datasets.
摘要：自监督单眼深度估计，深度不连续性和运动对象的伪影仍然具有挑战性的问题。现有的自监督方法通常利用单一视图来训练深度估计网络。用静态的观点相比，视频帧之间丰富的动态属性是精制深度估计是有益的，尤其是对于动态物体。在这项工作中，我们使用来自单眼和立体声视频连续帧提出了深度估计一个新的自我监督共同学习的框架。主要思想是使用隐式深度线索提取器，其利用动态和静态线索来产生有用的深度的建议。这些线索可以预测区分运动的轮廓和几何现场搭建。此外，一个新的高维注意模块被引入到提取清晰的全球性的转变，从而有效地抑制了高维空间的局部描述符的不确定性，导致学习框架，更可靠的优化。实验表明，该框架优于状态的最先进的上KITTI和Make3D数据集（SOTA）。

17. Shallow Feature Based Dense Attention Network for Crowd Counting [PDF] 返回目录
Yunqi Miao, Zijia Lin, Guiguang Ding, Jungong Han
Abstract: While the performance of crowd counting via deep learning has been improved dramatically in the recent years, it remains an ingrained problem due to cluttered backgrounds and varying scales of people within an image. In this paper, we propose a Shallow feature based Dense Attention Network (SDANet) for crowd counting from still images, which diminishes the impact of backgrounds via involving a shallow feature based attention model, and meanwhile, captures multi-scale information via densely connecting hierarchical image features. Specifically, inspired by the observation that backgrounds and human crowds generally have noticeably different responses in shallow features, we decide to build our attention model upon shallow-feature maps, which results in accurate background-pixel detection. Moreover, considering that the most representative features of people across different scales can appear in different layers of a feature extraction network, to better keep them all, we propose to densely connect hierarchical image features of different layers and subsequently encode them for estimating crowd density. Experimental results on three benchmark datasets clearly demonstrate the superiority of SDANet when dealing with different scenarios. Particularly, on the challenging UCF CC 50 dataset, our method outperforms other existing methods by a large margin, as is evident from a remarkable 11.9% Mean Absolute Error (MAE) drop of our SDANet.
摘要：虽然通过深学群计数的性能已经在近几年显着改善，它仍然是一个根深蒂固的问题，是由于杂乱的背景和图像内的人不同规模。在本文中，我们提出了从静止图像，它通过涉及浅基于特征的注意力模型减少背景的影响，同时，捕获的多尺度信息的人群计数浅埋基于特征的密集关注网络（SDANet）通过密集的连接层次图像特征。具体而言，可观察的启发是背景和人的人群通常有浅的特点明显不同的反应，我们决定建立我们在浅特征映射注意模型，其结果准确的背景像素的检测。此外，考虑到跨越不同尺度的人最有代表性的特征可以出现在特征提取网络的不同层次，更好地让他们的所有，我们建议密集的不同层的连接分层图像特征，然后对其编码，以便估计人群密度。对三个标准数据集的实验结果与不同的场景打交道时，清楚地表明SDANet的优越性。特别是，在具有挑战性的UCF CC 50的数据集，我们的方法优于大幅度其他现有的方法，如从我们的SDANet的一个显着的11.9％平均绝对误差（MAE）下降明显。

18. Burst Photography for Learning to Enhance Extremely Dark Images [PDF] 返回目录
Ahmet Serdar Karadeniz, Erkut Erdem, Aykut Erdem
Abstract: Capturing images under extremely low-light conditions poses significant challenges for the standard camera pipeline. Images become too dark and too noisy, which makes traditional enhancement techniques almost impossible to apply. Recently, learning-based approaches have shown very promising results for this task since they have substantially more expressive capabilities to allow for improved quality. Motivated by these studies, in this paper, we aim to leverage burst photography to boost the performance and obtain much sharper and more accurate RGB images from extremely dark raw images. The backbone of our proposed framework is a novel coarse-to-fine network architecture that generates high-quality outputs progressively. The coarse network predicts a low-resolution, denoised raw image, which is then fed to the fine network to recover fine-scale details and realistic textures. To further reduce the noise level and improve the color accuracy, we extend this network to a permutation invariant structure so that it takes a burst of low-light images as input and merges information from multiple images at the feature-level. Our experiments demonstrate that our approach leads to perceptually more pleasing results than the state-of-the-art methods by producing more detailed and considerably higher quality images.
摘要：下的标准摄像头管道非常低光照条件下的姿势显著挑战捕捉图像。图片变得太黑暗，太嘈杂，这使得传统的增强技术几乎是不可能的申请。最近，基于学习的方法已经显示出非常乐观这一任务的结果，因为他们有更大幅的表达能力，允许提高质量。通过这些研究的启发，在本文中，我们的目标是利用突发摄影，以提高性能和极暗的原始图像获得更清晰，更准确的RGB图像。我们提出的框架的主链是一种新颖的粗到细的网络架构，逐渐产生高品质的输出。粗网络预测低分辨率，去噪原始图像，然后将其馈送到细网以恢复精细尺度细节和逼真纹理。为了进一步降低噪声水平和改善的色彩精确度，我们发现此网络延伸到置换不变结构，使得它需要低光图像作为从在功能级的多个图像输入和合并信息的突发。我们的实验表明，我们的做法导致比国家的最先进的方法感知更满意的结果通过生产更详细和相当高品质的图像。

19. Sketch-Guided Scenery Image Outpainting [PDF] 返回目录
Yaxiong Wang, Yunchao Wei, Xueming Qian, Li Zhu, Yi Yang
Abstract: The outpainting results produced by existing approaches are often too random to meet users' requirement. In this work, we take the image outpainting one step forward by allowing users to harvest personal custom outpainting results using sketches as the guidance. To this end, we propose an encoder-decoder based network to conduct sketch-guided outpainting, where two alignment modules are adopted to impose the generated content to be realistic and consistent with the provided sketches. First, we apply a holistic alignment module to make the synthesized part be similar to the real one from the global view. Second, we reversely produce the sketches from the synthesized part and encourage them be consistent with the ground-truth ones using a sketch alignment module. In this way, the learned generator will be imposed to pay more attention to fine details and be sensitive to the guiding sketches. To our knowledge, this work is the first attempt to explore the challenging yet meaningful conditional scenery image outpainting. We conduct extensive experiments on two collected benchmarks to qualitatively and quantitatively validate the effectiveness of our approach compared with the other state-of-the-art generative models.
摘要：通过现有的方法产生的outpainting结果往往过于随意，以满足用户的需求。在这项工作中，我们采取的图像outpainting向前迈出的一步，允许用户窃取个人定制outpainting使用草图为导向的结果。为此，提出了一种编码器 - 解码器基于网络行为草图引导outpainting，其中两个对准模块被采用施加所生成的内容是现实的，并与所提供的草图一致。首先，我们采用综合定位模块，使合成部分类似于从全球视野的真正之一。其次，我们相反地从合成部分产生的草图，并鼓励他们与使用草图对齐模块的地面实况吻合较好。通过这种方式，学习发电机将被征收人更加注重细节和是引导草图敏感。据我们所知，这项工作是探索挑战却有意义的条件风景图像outpainting的第一次尝试。我们在两个收集基准进行广泛的实验，以定性和定量验证与国家的最先进的另一生成模型相比，我们的方法的有效性。

20. Self-supervised Knowledge Distillation for Few-shot Learning [PDF] 返回目录
Jathushan Rajasegaran, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Mubarak Shah
Abstract: Real-world contains an overwhelmingly large number of object classes, learning all of which at once is infeasible. Few shot learning is a promising learning paradigm due to its ability to learn out of order distributions quickly with only a few samples. Recent works [7, 41] show that simply learning a good feature embedding can outperform more sophisticated meta-learning and metric learning algorithms for few-shot learning. In this paper, we propose a simple approach to improve the representation capacity of deep neural networks for few-shot learning tasks. We follow a two-stage learning process: First, we train a neural network to maximize the entropy of the feature embedding, thus creating an optimal output manifold using a self-supervised auxiliary loss. In the second stage, we minimize the entropy on feature embedding by bringing self-supervised twins together, while constraining the manifold with student-teacher distillation. Our experiments show that, even in the first stage, self-supervision can outperform current state-of-the-art methods, with further gains achieved by our second stage distillation process. Our codes are available at: this https URL.
摘要：现实世界中含有的压倒性大量的对象类，学习所有这些在一次是不可行的。很少有出手的学习是一种很有前途的学习模式由于其只有少数样本快速学习顺序分布的出能力。最近的作品[7，41]表明，仅仅学习一个很好的功能嵌入能够跑赢了几拍的学习更先进的元学习和度量学习算法。在本文中，我们提出了一个简单的方法来改善深层神经网络的几拍的学习任务表示能力。我们遵循两个阶段的学习过程：首先，我们训练神经网络，以最大限度的功能嵌入的熵，从而产生用最佳的输出歧管自我监督的辅助损失。在第二阶段，我们尽量减少对功能的熵通过将自我监督的双胞胎一起嵌入，同时限制与师生蒸馏歧管。我们的实验显示，即使在第一阶段，自检可以超越国家的最先进的电流的方法，有进一步上涨我们第二阶段的蒸馏过程来实现的。我们的代码，请访问：此HTTPS URL。

21. Mitosis Detection Under Limited Annotation: A Joint Learning Approach [PDF] 返回目录
Pushpak Pati, Antonio Foncubierta-Rodriguez, Orcun Goksel, Maria Gabrani
Abstract: Mitotic counting is a vital prognostic marker of tumor proliferation in breast cancer. Deep learning-based mitotic detection is on par with pathologists, but it requires large labeled data for training. We propose a deep classification framework for enhancing mitosis detection by leveraging class label information, via softmax loss, and spatial distribution information among samples, via distance metric learning. We also investigate strategies towards steadily providing informative samples to boost the learning. The efficacy of the proposed framework is established through evaluation on ICPR 2012 and AMIDA 2013 mitotic data. Our framework significantly improves the detection with small training data and achieves on par or superior performance compared to state-of-the-art methods for using the entire training data.
摘要：有丝分裂计数是在乳腺癌肿瘤增殖的重要的预后标志物。深基础的学习有丝分裂检测看齐病理学家，但它需要培训大量的标签数据。我们提出了提高通过利用类标签信息有丝分裂检测通过SOFTMAX损失，样品中的空间分布信息的深分类框架，通过距离度量学习。我们还调查对战略稳定提供信息的样本，以提高学习。所提出的框架的有效性是通过评估对ICPR 2012年和2013 AMIDA有丝分裂的数据建立的。我们的框架显著提高了与相比状态的最先进的方法使用整个训练数据小的训练数据和实现上面值或优异的性能进行检测。

22. Maximum Roaming Multi-Task Learning [PDF] 返回目录
Lucas Pascal, Pietro Michiardi, Xavier Bost, Benoit Huet, Maria A. Zuluaga
Abstract: Multi-task learning has gained popularity due to the advantages it provides with respect to resource usage and performance. Nonetheless, the joint optimization of parameters with respect to multiple tasks remains an active research topic. Sub-partitioning the parameters between different tasks has proven to be an efficient way to relax the optimization constraints over the shared weights, may the partitions be disjoint or overlapping. However, one drawback of this approach is that it can weaken the inductive bias generally set up by the joint task optimization. In this work, we present a novel way to partition the parameter space without weakening the inductive bias. Specifically, we propose Maximum Roaming, a method inspired by dropout that randomly varies the parameter partitioning, while forcing them to visit as many tasks as possible at a regulated frequency, so that the network fully adapts to each update. We study the properties of our method through experiments on a variety of visual multi-task data sets. Experimental results suggest that the regularization brought by roaming has more impact on performance than usual partitioning optimization strategies. The overall method is flexible, easily applicable, provides superior regularization and consistently achieves improved performances compared to recent multi-task learning formulations.
摘要：多任务学习已得到普及，由于它相对于资源的使用和性能提供了优势。尽管如此，参数相对于多任务的联合优化仍然是一个活跃的研究课题。子划分不同任务之间的参数已被证明是放松通过共享权重优化约束的有效方式，也可以在分区是不相交的或重叠的。但是，这种方法的一个缺点是，它可以削弱归纳偏置一般是由联合工作的优化设置。在这项工作中，我们目前不削弱归纳偏置分区参数空间的新方法。具体来说，我们建议最大漫游，通过该降随机变化的参数划分灵感的方法，同时迫使它们在调节频率访问尽可能多的任务成为可能，从而使网络充分适应各种更新。我们通过实验对各种视觉多任务数据集研究了该方法的性能。实验结果表明，通过漫游带来的正规化对性能比平时分区优化策略的影响更大。总的方法是灵活的，易于应用，提供了卓越的正规化，并与最近的多任务学习配方一贯地实现改良的性能。

23. Evaluation of 3D CNN Semantic Mapping for Rover Navigation [PDF] 返回目录
Sebastiano Chiodini, Luca Torresin, Marco Pertile, Stefano Debei
Abstract: Terrain assessment is a key aspect for autonomous exploration rovers, surrounding environment recognition is required for multiple purposes, such as optimal trajectory planning and autonomous target identification. In this work we present a technique to generate accurate three-dimensional semantic maps for Martian environment. The algorithm uses as input a stereo image acquired by a camera mounted on a rover. Firstly, images are labeled with DeepLabv3+, which is an encoder-decoder Convolutional Neural Networl (CNN). Then, the labels obtained by the semantic segmentation are combined to stereo depth-maps in a Voxel representation. We evaluate our approach on the ESA Katwijk Beach Planetary Rover Dataset.
摘要：地形评估是自主探测漫游者一个重要方面，需要用于多种用途，如最优轨迹规划和自主目标识别周围环境的认可。在这项工作中，我们提出了一种技术产生的火星环境准确的三维地图的语义。该算法使用作为输入安装在流动站由相机获取的立体图像。首先，图像被标记有DeepLabv3 +，这是一个编码器 - 解码器卷积神经Networl（CNN）。然后，由语义分割所获得的标签进行组合以在一个体素表示立体声深度地图。我们评估我们对ESA莱茵海滩星球探测机器人数据集的方式。

24. Probabilistic orientation estimation with matrix Fisher distributions [PDF] 返回目录
D. Mohlin, G. Bianchi, J. Sullivan
Abstract: This paper focuses on estimating probability distributions over the set of 3D rotations ($SO(3)$) using deep neural networks. Learning to regress models to the set of rotations is inherently difficult due to differences in topology between $\mathbb{R}^N$ and $SO(3)$. We overcome this issue by using a neural network to output the parameters for a matrix Fisher distribution since these parameters are homeomorphic to $\mathbb{R}^9$. By using a negative log likelihood loss for this distribution we get a loss which is convex with respect to the network outputs. By optimizing this loss we improve state-of-the-art on several challenging applicable datasets, namely Pascal3D+, ModelNet10-$SO(3)$ and UPNA head pose.
摘要：本文着重于通过所述一组3D旋转估计的概率分布（$ SO（3）$）使用深神经网络。学习回归模型来设定旋转是由于$ \ mathbb {R} ^ N $ $和SO（3）$之间的拓扑差异固有困难。我们用神经网络进行输出矩阵费舍尔分布的参数，因为这些参数是同胚于$ \ mathbb {R} ^ $ 9克服这个问题。通过使用负对数似然损失这一分布，我们得到一个损失，这是凸面相对于网络输出。通过优化这方面的损失，我们提高对一些具有挑战性的应用数据集，即Pascal3D +，ModelNet10- $ SO（3）$和UPNA头部姿势国家的最先进的。

25. LRPD: Long Range 3D Pedestrian Detection Leveraging Specific Strengths of LiDAR and RGB [PDF] 返回目录
Michael Fürst, Oliver Wasenmüller, Didier Stricker
Abstract: While short range 3D pedestrian detection is sufficient for emergency breaking, long range detections are required for smooth breaking and gaining trust in autonomous vehicles. The current state-of-the-art on the KITTI benchmark performs suboptimal in detecting the position of pedestrians at long range. Thus, we propose an approach specifically targeting long range 3D pedestrian detection (LRPD), leveraging the density of RGB and the precision of LiDAR. Therefore, for proposals, RGB instance segmentation and LiDAR point based proposal generation are combined, followed by a second stage using both sensor modalities symmetrically. This leads to a significant improvement in mAP on long range compared to the current state-of-the art. The evaluation of our LRPD approach was done on the pedestrians from the KITTI benchmark.
摘要：虽然近距离的3D行人检测是足够的紧急分断，需要在自主车平稳开断和获得信任的远程检测。状态的最先进的上检测在远距离的行人的位置的基准KITTI执行次优的电流。因此，我们提出了一种方法特异性地靶向长范围内的3D步行者检测（LRPD），利用RGB的密度和激光雷达的精度。因此，对于建议，例如RGB分割和激光雷达点基于方案生成合并，随后使用两个传感器方式对称的第二级。这导致在地图上长范围内的显著的改善相比，目前国家的艺术。我们LRPD做法的评价从KITTI基准行人完成。

26. Adversarial Defense by Latent Style Transformations [PDF] 返回目录
Shuo Wang, Surya Nepal, Marthie Grobler, Carsten Rudolph, Tianle Chen, Shangyu Chen
Abstract: Machine learning models have demonstrated vulnerability to adversarial attacks, more specifically misclassification of adversarial examples. In this paper, we investigate an attack-agnostic defense against adversarial attacks on high-resolution images by detecting suspicious inputs. The intuition behind our approach is that the essential characteristics of a normal image are generally consistent with non-essential style transformations, e.g., slightly changing the facial expression of human portraits. In contrast, adversarial examples are generally sensitive to such transformations. In our approach to detect adversarial instances, we propose an in\underline{V}ertible \underline{A}utoencoder based on the \underline{S}tyleGAN2 generator via \underline{A}dversarial training (VASA) to inverse images to disentangled latent codes that reveal hierarchical styles. We then build a set of edited copies with non-essential style transformations by performing latent shifting and reconstruction, based on the correspondences between latent codes and style transformations. The classification-based consistency of these edited copies is used to distinguish adversarial instances.
摘要：机器学习模型已经证明漏洞进行攻击的对抗性，更具体的对抗性例子误判。在本文中，我们调查对通过检测可疑的输入高分辨率图像对抗攻击的攻击无关的防御。我们的做法背后的直觉是正常图像的基本特性是一般与非必需风格转化相一致，例如，稍微改变人的肖像的面部表情。与此相反，对抗性的例子是这样的变换通常是敏感的。在我们的方法来检测对抗性的情况下，我们提出了一个在\下划线{V} ertible \下划线{A} utoencoder基础上，\下划线{S} tyleGAN2发生器通过\下划线{A} dversarial培训（VASA）逆图像解开潜在的代码揭示分层样式。然后，我们通过执行潜移位和重建建立一组与非必需风格转换编辑份，基于潜代码和风格转换之间的对应关系。这些已修改的副本的基于分类的一致性是用来区别对抗性实例。

27. 3D Shape Reconstruction from Free-Hand Sketches [PDF] 返回目录
Jiayun Wang, Jierui Lin, Qian Yu, Runtao Liu, Yubei Chen, Stella X. Yu
Abstract: Sketches are the most abstract 2D representations of real-world objects. Although a sketch usually has geometrical distortion and lacks visual cues, humans can effortlessly envision a 3D object from it. This indicates that sketches encode the appropriate information to recover 3D shapes. Although great progress has been achieved in 3D reconstruction from distortion-free line drawings, such as CAD and edge maps, little effort has been made to reconstruct 3D shapes from free-hand sketches. We pioneer to study this task and aim to enhance the power of sketches in 3D-related applications such as interactive design and VR/AR games. Further, we propose an end-to-end sketch-based 3D reconstruction framework. Instead of well-used edge maps, synthesized sketches are adopted as training data. Additionally, we propose a sketch standardization module to handle different sketch styles and distortions. With extensive experiments, we demonstrate the effectiveness of our model and its strong generalizability to various free-hand sketches.
摘要：草图是真实世界的物体的最抽象的2D表示。虽然草图通常有几何失真，缺乏视觉线索，人们可以毫不费力地设想从它的3D对象。这表明草图编码适当的信息，以恢复的3D形状。尽管很大的进步已经在3D重建从无失真的线图，如CAD和边缘映射来实现，小已作出努力来重建从徒手3D草图形状。我们开拓研究这个任务，旨在加强草图在3D相关的应用程序，如交互式设计和VR / AR游戏的力量。此外，我们提出了一个终端到终端的基于草图的三维重建框架。代替的良好使用的边缘地图，合成草图采用作为训练数据。此外，我们提出了一个草图标准化模块来处理不同的素描风格和扭曲。随着大量的实验，我们证明我们的模型和其强大的推广到各种徒手草图的有效性。

28. A Real-time Action Representation with Temporal Encoding and Deep Compression [PDF] 返回目录
Kun Liu, Wu Liu, Huadong Ma, Mingkui Tan, Chuang Gan
Abstract: Deep neural networks have achieved remarkable success for video-based action recognition. However, most of existing approaches cannot be deployed in practice due to the high computational cost. To address this challenge, we propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation. T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed. Specifically, we propose a residual 3D Convolutional Neural Network (CNN) to capture complementary information on the appearance of a single frame and the motion between consecutive frames. Based on this CNN, we develop a new temporal encoding method to explore the temporal dynamics of the whole video. Furthermore, we integrate deep compression techniques with T-C3D to further accelerate the deployment of models via reducing the size of the model. By these means, heavy calculations can be avoided when doing the inference, which enables the method to deal with videos beyond real-time speed while keeping promising performance. Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model. We validate our approach by studying its action representation performance on four different benchmarks over three different tasks. Extensive experiments demonstrate comparable recognition performance to the state-of-the-art methods. The source code and the pre-trained models are publicly available at this https URL.
摘要：深层神经网络已经实现了基于视频的动作识别显着成效。然而，大多数的现有方法无法在实践中部署由于高计算成本。为了应对这一挑战，我们提出了一个新的实时卷积架构，称为颞卷积3D网络（T-C3D）的行为表示。 T-C3D获悉在分层多粒度方式视频动作表示的同时，获得高处理速度。具体地，我们提出了一种残余三维卷积神经网络（CNN），以捕获在单个帧的外观和连续帧之间的运动补偿信息。在此基础上CNN，我们开发了一个新的临时编码方法来探索整个视频的时间动态。此外，我们还集成了T-C3D深度压缩技术，通过减少模型的大小，以进一步加快模型的部署。通过这些手段，可以做的推论，这使得该方法来处理超过实时速度的视频，同时保持性能的承诺时，应避免重计算。我们的方法实现了关于对国家的最先进的实时方法UCF101动作识别基准在精度方面和2倍的速度在推理速度方面明显改善了5.4％与小于5MB存储模型。我们验证通过研究四种不同的基准，在三个不同的任务，其行为表示的性能我们的方法。广泛实验表明可比的识别性能，以国家的最先进的方法。源代码和预先训练模式是公开的，在此HTTPS URL。

29. Revealing the Invisible with Model and Data Shrinking for Composite-database Micro-expression Recognition [PDF] 返回目录
Zhaoqiang Xia, Wei Peng, Huai-Qian Khor, Xiaoyi Feng, Guoying Zhao
Abstract: Composite-database micro-expression recognition is attracting increasing attention as it is more practical to real-world applications. Though the composite database provides more sample diversity for learning good representation models, the important subtle dynamics are prone to disappearing in the domain shift such that the models greatly degrade their performance, especially for deep models. In this paper, we analyze the influence of learning complexity, including the input complexity and model complexity, and discover that the lower-resolution input data and shallower-architecture model are helpful to ease the degradation of deep models in composite-database task. Based on this, we propose a recurrent convolutional network (RCN) to explore the shallower-architecture and lower-resolution input data, shrinking model and input complexities simultaneously. Furthermore, we develop three parameter-free modules (i.e., wide expansion, shortcut connection and attention unit) to integrate with RCN without increasing any learnable parameters. These three modules can enhance the representation ability in various perspectives while preserving not-very-deep architecture for lower-resolution data. Besides, three modules can further be combined by an automatic strategy (a neural architecture search strategy) and the searched architecture becomes more robust. Extensive experiments on MEGC2019 dataset (composited of existing SMIC, CASME II and SAMM datasets) have verified the influence of learning complexity and shown that RCNs with three modules and the searched combination outperform the state-of-the-art approaches.
摘要：综合数据库的微表情识别吸引了越来越多的关注，因为它是更实际的现实世界的应用。虽然复合材料数据库，用于学习的好表示模型提供了更多样的多样性，重要的细微动态容易发生域中转变消失，使得模型大大降低其性能，特别是对于深模型。在本文中，我们分析了学习的复杂性，包括输入的复杂性和模型的复杂性的影响，并发现了较低分辨率的输入数据和较浅的架构模型是有助于缓解复合数据库任务深模型的退化。在此基础上，我们提出了一个经常性的卷积网络（RCN）探索浅架构，降低分辨率的输入数据，同时缩小模型和输入的复杂性。此外，我们开发三个自由参数模块（即，宽扩展，快捷方式连接和关注单元）与RCN整合，而不增加任何可学习参数。同时保留不是非常深的架构，分辨率较低的数据这三个模块可以增强不同的角度表现能力。另外，三个模块可以进一步通过自动策略（神经架构搜索策略）合并，并搜索体系结构变得更坚固。上MEGC2019数据集广泛的实验（合成现有中芯，CASME II和SAMM数据集的）已经验证学习复杂性的影响，并示出与三个模块和搜索到的组合RCNs优于状态的最先进的方法。

30. MetaSDF: Meta-learning Signed Distance Functions [PDF] 返回目录
Vincent Sitzmann, Eric R. Chan, Richard Tucker, Noah Snavely, Gordon Wetzstein
Abstract: Neural implicit shape representations are an emerging paradigm that offers many potential benefits over conventional discrete representations, including memory efficiency at a high spatial resolution. Generalizing across shapes with such neural implicit representations amounts to learning priors over the respective function space and enables geometry reconstruction from partial or noisy observations. Existing generalization methods rely on conditioning a neural network on a low-dimensional latent code that is either regressed by an encoder or jointly optimized in the auto-decoder framework. Here, we formalize learning of a shape space as a meta-learning problem and leverage gradient-based meta-learning algorithms to solve this task. We demonstrate that this approach performs on par with auto-decoder based approaches while being an order of magnitude faster at test-time inference. We further demonstrate that the proposed gradient-based method outperforms encoder-decoder based methods that leverage pooling-based set encoders.
摘要：神经隐含形状表示是一个新兴的模式，它比传统的离散表示，包括高空间分辨率内存效率提供了许多潜在的好处。跨形状归纳与这种神经隐式表示相当于学习在相应的函数空间先验并且使得几何重建从部分或嘈杂观测。概括现有方法依赖于空调上或者由编码器或倒退共同在自动译码器框架优化的低维潜码的神经网络。在这里，我们正式确定的形状空间作为元学习问题，并利用基于梯度的元学习算法来解决这个任务的学习。我们证明了，同时一个数量级的测试时间推断基于快看齐，与汽车解码器这种方式执行方法。我们进一步表明，基于所提出的基于梯度的方法优于编码器 - 解码器的方法，借助杠杆作用池组编码器。

31. Implicit Neural Representations with Periodic Activation Functions [PDF] 返回目录
Vincent Sitzmann, Julien N. P. Martel, Alexander W. Bergman, David B. Lindell, Gordon Wetzstein
Abstract: Implicitly defined, continuous, differentiable signal representations parameterized by neural networks have emerged as a powerful paradigm, offering many possible benefits over conventional representations. However, current network architectures for such implicit neural representations are incapable of modeling signals with fine detail, and fail to represent a signal's spatial and temporal derivatives, despite the fact that these are essential to many physical signals defined implicitly as the solution to partial differential equations. We propose to leverage periodic activation functions for implicit neural representations and demonstrate that these networks, dubbed sinusoidal representation networks or Sirens, are ideally suited for representing complex natural signals and their derivatives. We analyze Siren activation statistics to propose a principled initialization scheme and demonstrate the representation of images, wavefields, video, sound, and their derivatives. Further, we show how Sirens can be leveraged to solve challenging boundary value problems, such as particular Eikonal equations (yielding signed distance functions), the Poisson equation, and the Helmholtz and wave equations. Lastly, we combine Sirens with hypernetworks to learn priors over the space of Siren functions.
摘要：隐式定义，通过神经网络参数连续，可微信号表示已经成为一个强有力的范例，提供比传统陈述许多可能的好处。然而，对于这样的隐式神经表征当前的网络体系结构是不能模拟与细节信号，并且，不能代表一个信号的空间和时间的衍生物，尽管事实上，这些都是必要的许多物理信号隐含地定义作为解决偏微分方程。我们建议定期杠杆激活功能隐神经表征，并证明这些网络，被称为正弦表示网络或警报器，非常适合较复杂的自然信号和它们的衍生物。我们分析警报器激活的统计数据提出一个原则性的初始化方案和展示图像，波场，视频，声音，和它们的衍生物的代表性。此外，我们显示如何警报器可以被利用来解决挑战边界值问题，如特定程函方程（产生符号距离函数），泊松方程，和亥姆霍兹和波方程。最后，我们用超级网络相结合警报器学习过的报警器功能的空间先验。

32. Self-Supervised Representation Learning for Visual Anomaly Detection [PDF] 返回目录
Rabia Ali, Muhammad Umar Karim Khan, Chong Min Kyung
Abstract: Self-supervised learning allows for better utilization of unlabelled data. The feature representation obtained by self-supervision can be used in downstream tasks such as classification, object detection, segmentation, and anomaly detection. While classification, object detection, and segmentation have been investigated with self-supervised learning, anomaly detection needs more attention. We consider the problem of anomaly detection in images and videos, and present a new visual anomaly detection technique for videos. Numerous seminal and state-of-the-art self-supervised methods are evaluated for anomaly detection on a variety of image datasets. The best performing image-based self-supervised representation learning method is then used for video anomaly detection to see the importance of spatial features in visual anomaly detection in videos. We also propose a simple self-supervision approach for learning temporal coherence across video frames without the use of any optical flow information. At its core, our method identifies the frame indices of a jumbled video sequence allowing it to learn the spatiotemporal features of the video. This intuitive approach shows superior performance of visual anomaly detection compared to numerous methods for images and videos on UCF101 and ILSVRC2015 video datasets.
摘要：自监督学习可以更好地利用未标记的数据。通过自检获得的特征表示可以在下游的任务，如分类，对象检测，分割和异常检测中使用。虽然分类，目标检测和分割进行了研究与自我监督学习，异常检测需要更多的关注。我们认为，在图像和视频异常检测的问题，并提出了一种新的视觉异常检测技术的视频。众多精液状态的最先进的用于对各种图像数据组的异常检测自监督方法进行评估和。然后最好执行基于图像的自监督表示学习方法用于视频异常检测看到的空间特征在视觉异常检测在视频的重要性。我们还提出了学习跨视频帧时间相干性，而无需使用任何的光流信息的简单自检方法。在其核心，我们的方法确定允许它来学习视频的时空特征的杂乱的视频序列的帧索引。这种直观的方式显示视觉异常检测的卓越性能相比，许多方法用于图像和视频UCF101和ILSVRC2015视频数据集。

33. Learning Visual Commonsense for Robust Scene Graph Generation [PDF] 返回目录
Alireza Zareian, Haoxuan You, Zhecan Wang, Shih-Fu Chang
Abstract: Scene graph generation models understand the scene through object and predicate recognition, but are prone to mistakes due to the challenges of perception in the wild. Perception errors often lead to nonsensical compositions in the output scene graph, which do not follow real-world rules and patterns, and can be corrected using commonsense knowledge. We propose the first method to acquire visual commonsense such as affordance and intuitive physics automatically from data, and use that to enhance scene graph generation. To this end, we extend transformers to incorporate the structure of scene graphs, and train our Global-Local Attention Transformer on a scene graph corpus. Once trained, our commonsense model can be applied on any perception model and correct its obvious mistakes, resulting in a more commonsensical scene graph. We show the proposed model learns commonsense better than any alternative, and improves the accuracy of any scene graph generation model. Nevertheless, strong disproportions in real-world datasets could bias commonsense to miscorrect already confident perceptions. We address this problem by devising a fusion module that compares predictions made by the perception and commonsense models, and the confidence of each, to make a hybrid decision. Our full model learns commonsense and knows when to use it, which is shown effective through experiments, resulting in a new state of the art.
摘要：场景图代车型明白通过对象和谓语识别场景，但容易出错，由于感知在野外的挑战。知觉错误往往会导致输出场景图无厘头的组合物，其不遵循现实世界的规则和模式，并可以使用常识知识库进行修正。我们建议第一种方法来获得视觉常识，例如自动数据可供性和直观的物理，并用它来增强场景图的产生。为此，我们扩展变压器纳入场景图的结构，培养我们的全球 - 地方注意变压器上的场景图的语料库。经过培训后，我们的常识模型可以在任何感知模型可以应用于并纠正其明显的失误，造成了更多的常识场景图。我们展示了模型获悉常识性优于任何选择，并提高任何场景图生成模型的准确性。然而，在现实世界中的数据集强失调现象可能偏向常识来miscorrect已经有信心的看法。我们通过设计，将通过感知和常识模型，以及各自的信心做出预测的融合模块，使混合动力决定解决这个问题。我们完整的模型学习常识，知道什么时候使用它，这显示通过实验有效，导致了新的艺术状态。

34. Multi-Subspace Neural Network for Image Recognition [PDF] 返回目录
Chieh-Ning Fang, Chin-Teng Lin
Abstract: In image classification task, feature extraction is always a big issue. Intra-class variability increases the difficulty in designing the extractors. Furthermore, hand-crafted feature extractor cannot simply adapt new situation. Recently, deep learning has drawn lots of attention on automatically learning features from data. In this study, we proposed multi-subspace neural network (MSNN) which integrates key components of the convolutional neural network (CNN), receptive field, with subspace concept. Associating subspace with the deep network is a novel designing, providing various viewpoints of data. Basis vectors, trained by adaptive subspace self-organization map (ASSOM) span the subspace, serve as a transfer function to access axial components and define the receptive field to extract basic patterns of data without distorting the topology in the visual task. Moreover, the multiple-subspace strategy is implemented as parallel blocks to adapt real-world data and contribute various interpretations of data hoping to be more robust dealing with intra-class variability issues. To this end, handwritten digit and object image datasets (i.e., MNIST and COIL-20) for classification are employed to validate the proposed MSNN architecture. Experimental results show MSNN is competitive to other state-of-the-art approaches.
摘要：在图像分类任务，特征提取始终是一个大问题。类内变化增加了设计提取的难度。此外，手工制作的特征提取不能简单地适应新的形势。近日，深学习借鉴了从数据自动学习功能，大量的关注。在这项研究中，我们提出了集成了卷积神经网络的关键组件（CNN），感受野，与子空间概念的多子空间神经网络（MSNN）。相关联的子空间与深网络是一种新颖的设计，提供数据的各种观点。基本向量，由自适应子空间自组织映射（ASSOM）训练跨越子空间，作为传递函数以访问轴向分量和限定感受域来提取数据的基本模式，而不在视觉任务扭曲的拓扑。此外，多子空间战略实现为并行块，以适应现实世界的数据并提供数据希望能更强大的交易与类内变化问题的各种不同的解释。为此，手写体数字和对象图像数据组（即，MNIST和COIL-20）进行分类被用来验证所提出MSNN架构。实验结果表明MSNN是国家的最先进的其他方法具有竞争力。

35. Learning Sparse Masks for Efficient Image Super-Resolution [PDF] 返回目录
Longguang Wang, Xiaoyu Dong, Yingqian Wang, Xinyi Ying, Zaiping Lin, Wei An, Yulan Guo
Abstract: Current CNN-based super-resolution (SR) methods process all locations equally with computational resources being uniformly assigned in space. However, since highfrequency details mainly lie around edges and textures, less computational resources are required for those flat regions. Therefore, existing CNN-based methods involve much redundant computation in flat regions, which increases their computational cost and limits the applications on mobile devices. To address this limitation, we develop an SR network (SMSR) to learn sparse masks to prune redundant computation conditioned on the input image. Within our SMSR, spatial masks learn to identify "important" locations while channel masks learn to mark redundant channels in those "unimportant" regions. Consequently, redundant computation can be accurately located and skipped while maintaining comparable performance. It is demonstrated that our SMSR achieves state-of-the-art performance with 41%/33%/27% FLOPs being reduced for x2/3/4 SR.
摘要：基于CNN电流超分辨率（SR）方法处理所有位置同样与计算资源在空间上均匀地分配。然而，由于高频细节主要在于周围边缘和纹理，都需要那些平坦地区少的计算资源。因此，现有的基于CNN-方法涉及在平坦的区域多的冗余计算，这增加了他们的计算成本和在移动设备上限制了应用。为了解决这个限制，我们开发的SR网络（SMSR）学习稀疏口罩，修剪多余的计算条件的输入图像。在我们的SMSR，空间口罩边学边通道口罩学习，以纪念那些“不重要”的区域冗余通道，以确定“重要”的位置。因此，冗余计算可精确地定位，并同时保持相当的性能跳过。已经证明，我们的SMSR实现状态的最先进的性能与41％/ 33％/ 27％的触发器被还原为X2 / 3/4 SR。

36. Cross-Correlated Attention Networks for Person Re-Identification [PDF] 返回目录
Jieming Zhou, Soumava Kumar Roy, Pengfei Fang, Mehrtash Harandi, Lars Petersson
Abstract: Deep neural networks need to make robust inference in the presence of occlusion, background clutter, pose and viewpoint variations -- to name a few -- when the task of person re-identification is considered. Attention mechanisms have recently proven to be successful in handling the aforementioned challenges to some degree. However previous designs fail to capture inherent inter-dependencies between the attended features; leading to restricted interactions between the attention blocks. In this paper, we propose a new attention module called Cross-Correlated Attention (CCA); which aims to overcome such limitations by maximizing the information gain between different attended regions. Moreover, we also propose a novel deep network that makes use of different attention mechanisms to learn robust and discriminative representations of person images. The resulting model is called the Cross-Correlated Attention Network (CCAN). Extensive experiments demonstrate that the CCAN comfortably outperforms current state-of-the-art algorithms by a tangible margin.
摘要：深层神经网络需要做强大的推理在闭塞，背景杂波，姿态和观点变化的存在 - 仅举几例 - 当人重新鉴定的任务被认为是。注意机制最近被证明是成功的处理在一定程度上上述挑战。但是以前的设计未能捕捉出席特征之间的内在相互依存关系;导致注意力块之间的相互作用的限制。在本文中，我们提出了所谓的互相关注意事项（CCA）的新关注模块;其目的是最大限度地提高不同地区出席之间的信息增益，以克服这些限制。此外，我们还提出了一种深深的网络，它使用不同的机制，重视学习的人图像的稳健和歧视性的表示。由此产生的模型被称为交叉关联关注网络（CCAN）。广泛的实验表明，该CCAN舒适地在有形余量优于国家的最先进的当前算法。

37. Mining Label Distribution Drift in Unsupervised Domain Adaptation [PDF] 返回目录
Peizhao Li, Zhengming Ding, Hongfu Liu
Abstract: Unsupervised domain adaptation targets to transfer task knowledge from labeled source domain to related yet unlabeled target domain, and is catching extensive interests from academic and industrial areas. Although tremendous efforts along this direction have been made to minimize the domain divergence, unfortunately, most of existing methods only manage part of the picture by aligning feature representations from different domains. Beyond the discrepancy in feature space, the gap between known source label and unknown target label distribution, recognized as label distribution drift, is another crucial factor raising domain divergence, and has not been paid enough attention and well explored. From this point, in this paper, we first experimentally reveal how label distribution drift brings negative effects on current domain adaptation methods. Next, we propose Label distribution Matching Domain Adversarial Network (LMDAN) to handle data distribution shift and label distribution drift jointly. In LMDAN, label distribution drift problem is addressed by the proposed source samples weighting strategy, which select samples to contribute to positive adaptation and avoid negative effects brought by the mismatched in label distribution. Finally, different from general domain adaptation experiments, we modify domain adaptation datasets to create the considerable label distribution drift between source and target domain. Numerical results and empirical model analysis show that LMDAN delivers superior performance compared to other state-of-the-art domain adaptation methods under such scenarios.
摘要：无监督领域适应性的目标，以从标源域任务的知识转移给相关又未标记的目标域，以及正赶上从学术和工业领域的广泛兴趣。虽然沿着这个方向的巨大努力已作出减少域发散，不幸的是，大多数的现有方法只有通过从不同的域校准功能，表示管理图片的一部分。除了在功能空间的差异，已知的源标签和未知目标的标签分配之间的差距，确认为标签分配漂移，是另一个关键因素饲养领域的分歧，并没有得到足够的重视和很好的探讨。从这点来说，在本文中，我们首先用实验揭示标签分发漂移如何使当前的领域适应性方法的负面影响。接下来，我们提出标签分发匹配域的对抗性网络（LMDAN）来处理数据分发移和标签分发漂移联合。在LMDAN，标签分发漂移问题是由所提出的源样本加权策略，选择样品有助于通过在标签分配的不匹配带来的正适应，并避免负面影响寻址。最后，从一般领域适应性实验不同，我们修改域适应数据集创建源和目标域之间的相当大的标签分发漂移。数值结果和经验模型分析相比显示出这样的情形下的其他状态的最先进的域适应方法即LMDAN提供优越的性能。

38. Explanation-based Weakly-supervised Learning of Visual Relations with Graph Networks [PDF] 返回目录
Federico Baldassarre, Kevin Smith, Josephine Sullivan, Hossein Azizpour
Abstract: Visual relationship detection is fundamental for holistic image understanding. However, localizing and classifying (subject, predicate, object) triplets constitutes a hard learning objective due to the combinatorial explosion of possible relationships, their long-tail distribution in natural images, and an expensive annotation process. This paper introduces a novel weakly-supervised method for visual relationship detection that relies only on image-level predicate annotations. A graph neural network is trained to classify the predicates in an image from the graph representation of all objects, implicitly encoding an inductive bias for pairwise relationships. We then frame relationship detection as the explanation of such a predicate classifier, i.e. we reconstruct a complete relationship by recovering the subject and the object of a predicted predicate. Using this novel technique and minimal labels, we present comparable results to recent fully-supervised and weakly-supervised methods on three diverse and challenging datasets: HICO-DET for human-object interaction, Visual Relationship Detection for generic object-to-object relationships, and UnRel for unusual relationships.
摘要：视觉关系的检测是全面的图像理解的基础。然而，定位和分类（主语，谓语，宾语）三元组构成硬学习目标由于可能的关系的组合爆炸，在自然图像中其长尾分布，以及昂贵的注释过程。本文介绍了视觉关系检测其仅依赖于图像级谓词注释的新颖弱监督方法。的曲线图的神经网络被训练以在图像中的谓词从所有对象的图形表示进行分类，隐式编码的成对关系的感应偏压。然后，我们帧关系检测作为这样的谓词分类我们通过回收主语和谓语预测的对象重建一个完整的关系的解释，即。使用这种新技术和最小的标签，我们提出了比较的结果，最近全面监督和弱监督方法在三个不同的和具有挑战性的数据集：HICO-DET以人为对象的互动，视觉关系检测的通用对象到对象的关系，和UnRel不寻常的关系。

39. Visual Chirality [PDF] 返回目录
Zhiqiu Lin, Jin Sun, Abe Davis, Noah Snavely
Abstract: How can we tell whether an image has been mirrored? While we understand the geometry of mirror reflections very well, less has been said about how it affects distributions of imagery at scale, despite widespread use for data augmentation in computer vision. In this paper, we investigate how the statistics of visual data are changed by reflection. We refer to these changes as "visual chirality", after the concept of geometric chirality - the notion of objects that are distinct from their mirror image. Our analysis of visual chirality reveals surprising results, including low-level chiral signals pervading imagery stemming from image processing in cameras, to the ability to discover visual chirality in images of people and faces. Our work has implications for data augmentation, self-supervised learning, and image forensics.
摘要：我们怎样才能辨别图像是否已经反映？虽然我们理解镜面反射的几何形状非常好，少已经谈如何在规模，影响图像的分布，尽管在计算机视觉中广泛使用的数据增强。在本文中，我们研究如何视觉数据的统计是由反射改变。我们称这些变化“视觉手性”，几何手性的概念之后 - 这是从他们的镜像不同对象的概念。我们的视觉手性分析显示令人惊讶的结果，包括弥漫在图像从相机图像处理而产生，以发现在人脸图像的视觉手性的能力低层次的手性信号。我们的工作有数据增强，自我监督学习和图像取证的影响。

40. A generalizable saliency map-based interpretation of model outcome [PDF] 返回目录
Shailja Thakur, Sebastian Fischmeister
Abstract: One of the significant challenges of deep neural networks is that the complex nature of the network prevents human comprehension of the outcome of the network. Consequently, the applicability of complex machine learning models is limited in the safety-critical domains, which incurs risk to life and property. To fully exploit the capabilities of complex neural networks, we propose a non-intrusive interpretability technique that uses the input and output of the model to generate a saliency map. The method works by empirically optimizing a randomly initialized input mask by localizing and weighing individual pixels according to their sensitivity towards the target class. Our experiments show that the proposed model interpretability approach performs better than the existing saliency map-based approaches methods at localizing the relevant input pixels. Furthermore, to obtain a global perspective on the target-specific explanation, we propose a saliency map reconstruction approach to generate acceptable variations of the salient inputs from the space of input data distribution for which the model outcome remains unaltered. Experiments show that our interpretability method can reconstruct the salient part of the input with a classification accuracy of 89%.
摘要：一个深层神经网络的显著挑战是网络的结果的网络防止人类理解的复杂性。因此，复杂的机器学习模型的适用性在安全关键领域，这会带来生命危险和财产受到限制。为了充分利用复杂的神经网络的功能，我们建议使用模型的输入和输出产生显着映像非侵入解释性技术。该方法的工作原理是通过定位和称重根据它们朝向目标类灵敏度各个像素凭经验优化随机初始化的输入掩码。我们的实验表明，该模型可解释性方法的性能比基于地图的存在显着好于本地化相关的输入像素接近的方法。此外，为了获得对目标特异性说明全球的角度来看，我们提出了一个显着图的重建方法来产生可接受的从该模型的结果保持不变输入数据分配的空间的显着输入具有变化。实验表明，我们的解释性方法可以用89％的分类准确重建输入的突出部分。

41. On the Inference of Soft Biometrics from Typing Patterns Collected in a Multi-device Environment [PDF] 返回目录
Vishaal Udandarao, Mohit Agrawal, Rajesh Kumar, Rajiv Ratn Shah
Abstract: In this paper, we study the inference of gender, major/minor (computer science, non-computer science), typing style, age, and height from the typing patterns collected from 117 individuals in a multi-device environment. The inference of the first three identifiers was considered as classification tasks, while the rest as regression tasks. For classification tasks, we benchmark the performance of six classical machine learning (ML) and four deep learning (DL) classifiers. On the other hand, for regression tasks, we evaluated three ML and four DL-based regressors. The overall experiment consisted of two text-entry (free and fixed) and four device (Desktop, Tablet, Phone, and Combined) configurations. The best arrangements achieved accuracies of 96.15%, 93.02%, and 87.80% for typing style, gender, and major/minor, respectively, and mean absolute errors of 1.77 years and 2.65 inches for age and height, respectively. The results are promising considering the variety of application scenarios that we have listed in this work.
摘要：在本文中，我们从在多设备环境从117个个人收集的打字方式研究性别，主/次（计算机科学，非计算机专业），打字风格，年龄和身高的推断。前三标识符的推论被认为是分类任务，而其余的回归任务。对于分类的任务，我们的基准六首古典的机器学习（ML）和四个深学习（DL）分类器的性能。在另一方面，回归的任务，我们评估了3毫升和四个基于DL-回归量。总的实验包括两个文本输入（自由和固定）和四个设备（台式机，平板电脑，手机，结合）的配置。最好的安排，分别达到96.15％，93.02％，和打字风格，性别和主/次，87.80％的精度，和1.77岁2.65英寸分别为年龄和身高，平均绝对误差。结果是有前途的考虑各种各样的应用场景，我们在这项工作中所列举。

42. Intriguing generalization and simplicity of adversarially trained neural networks [PDF] 返回目录
Chirag Agarwal, Peijie Chen, Anh Nguyen
Abstract: Adversarial training has been the topic of dozens of studies and a leading method for defending against adversarial attacks. Yet, it remains unknown (a) how adversarially-trained classifiers (a.k.a "robust" classifiers) generalize to new types of out-of-distribution examples; and (b) what hidden representations were learned by robust networks. In this paper, we perform a thorough, systematic study to answer these two questions on AlexNet, GoogLeNet, and ResNet-50 trained on ImageNet. While robust models often perform on-par or worse than standard models on unseen distorted, texture-preserving images (e.g. blurred), they are consistently more accurate on texture-less images (i.e. silhouettes and stylized). That is, robust models rely heavily on shapes, in stark contrast to the strong texture bias in standard ImageNet classifiers (Geirhos et al. 2018). Remarkably, adversarial training causes three significant shifts in the functions of hidden neurons. That is, each convolutional neuron often changes to (1) detect pixel-wise smoother patterns; (2) detect more lower-level features i.e. textures and colors (instead of objects); and (3) be simpler in terms of complexity i.e. detecting more limited sets of concepts.
摘要：对抗性训练一直几十研究的课题和防御敌对攻击的主要方法。然而，它仍然是未知的（一）如何adversarially被训练的分类（a.k.a“稳健”分类）推广到新的类型外的分布例;和（b）隐藏什么申述强大的网络被教训。在本文中，我们进行全面的，系统的研究来回答AlexNet，GoogLeNet这两个问题，并RESNET-50培训了ImageNet。而可靠的模型经常执行上面值或大于上看不见扭曲，纹理保留图像标准模型（例如模糊）更糟的是，它们是一致的无纹理图像（即轮廓和程式化）更加准确。也就是说，稳健模式在很大程度上依赖于形状，形成鲜明对比的标准ImageNet分类的质感强烈偏见（Geirhos等人。2018）。值得注意的是，对抗性训练会导致隐藏神经元的三种功能显著变化。即，每个神经元卷积经常改变到（1）检测逐像素平滑图案; （2）检测多个较低级特征，即纹理和颜色（而不是对象）;在的复杂性，即检测集的更多限制的概念的术语和（3）是更简单的。

43. Big Self-Supervised Models are Strong Semi-Supervised Learners [PDF] 返回目录
Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, Geoffrey Hinton
Abstract: One paradigm for learning from few labeled examples while making best use of a large amount of unlabeled data is unsupervised pretraining followed by supervised fine-tuning. Although this paradigm uses unlabeled data in a task-agnostic way, in contrast to most previous approaches to semi-supervised learning for computer vision, we show that it is surprisingly effective for semi-supervised learning on ImageNet. A key ingredient of our approach is the use of a big (deep and wide) network during pretraining and fine-tuning. We find that, the fewer the labels, the more this approach (task-agnostic use of unlabeled data) benefits from a bigger network. After fine-tuning, the big network can be further improved and distilled into a much smaller one with little loss in classification accuracy by using the unlabeled examples for a second time, but in a task-specific way. The proposed semi-supervised learning algorithm can be summarized in three steps: unsupervised pretraining of a big ResNet model using SimCLRv2 (a modification of SimCLR), supervised fine-tuning on a few labeled examples, and distillation with unlabeled examples for refining and transferring the task-specific knowledge. This procedure achieves 73.9\% ImageNet top-1 accuracy with just 1\% of the labels ($\le$13 labeled images per class) using ResNet-50, a $10\times$ improvement in label efficiency over the previous state-of-the-art. With 10\% of labels, ResNet-50 trained with our method achieves 77.5\% top-1 accuracy, outperforming standard supervised training with all of the labels.
摘要：从几个标识样本的学习，同时使大量无标签数据进行最佳利用的一个范例是无人监管的训练前，然后监督微调。虽然这种模式在任务无关的方式是使用无标签数据，与大多数以前的办法，以半监督学习计算机视觉，我们表明，这是一项就ImageNet半监督学习惊人地有效。我们的方法的一个关键因素是训练前和微调在使用大（深和宽）网络。我们发现，标签上的越少，这种做法（任务无关的使用未标记的数据）从一个更大的网络效益。微调后，大网络可以进一步提高，通过使用未标记的例子第二次蒸馏成更小的一个在分类准确率的小损失，但在任务的具体方式。所提出的半监督学习算法，可以归纳为三个步骤：使用SimCLRv2（SimCLR的修改），对几个标记的例子监督的微调，和蒸馏与提炼和传输未标记的例子大RESNET模型的无监督训练前特定任务的知识。该过程实现了73.9 \％ImageNet顶部-1精度仅有1 \％标签的（$ \文件每类$ 13个标记的图像）使用RESNET-50，$ 10 \倍标签效率$改善较了最先进艺术。随着标签的10 \％，RESNET-50与我们的方法训练达到77.5 \％，最高1精度，超越了所有标签的标准监督训练。

44. Universal Lower-Bounds on Classification Error under Adversarial Attacks and Random Corruption [PDF] 返回目录
Elvis Dohmatob
Abstract: We theoretically analyse the limits of robustness to test-time adversarial and noisy examples in classification. Our work focuses on deriving bounds which uniformly apply to all classifiers (i.e all measurable functions from features to labels) for a given problem. Our contributions are three-fold. (1) In the classical framework of adversarial attacks, we use optimal transport theory to derive variational formulae for the Bayes-optimal error a classifier can make on a given classification problem, subject to adversarial attacks. We also propose a simple algorithm for computing the corresponding universal (i.e classifier-independent) adversarial attacks, based on maximal matching on bipartite graphs. (2) We derive explicit lower-bounds on the Bayes-optimal error in the case of the popular distance-based attacks. These bounds are universal in the sense that they depend on the geometry of the class-conditional distributions of the data, but not on a particular classifier. This is in contrast with the existing literature, wherein adversarial vulnerability of classifiers is derived as a consequence of nonzero ordinary test error. (3) For our third contribution, we study robustness to random noise corruption, wherein the attacker (or nature) is allowed to inject random noise into examples at test time. We establish nonlinear data-processing inequalities induced by such corruptions, and use them to obtain lower-bounds on the Bayes-optimal error.
摘要：从理论上分析鲁棒性的测试时间对抗性和分类嘈杂的例子的限制。我们的工作重点是要得到一种统一适用于所有分类（即所有可测量的功能，从功能到标签），对于给定的问题范围。我们的贡献有三方面。（1）在对抗攻击的传统框架，我们使用最佳的运输理论推导变公式，贝叶斯最优错误分类可以对一个给定的分类问题，受到敌对攻击。我们还提出了计算相应的通用（即分级无关）对抗攻击的基础上，对二部图最大匹配一个简单的算法。（2）我们推导出在贝叶斯最优误差显低界中流行的基于距离的攻击的情况。这些界限是普遍的意义上，它们依赖于数据的分类条件分布的几何形状，而不是在一个特定的分类。这是在与现有的文献中，其特征在于，分类器的对抗脆弱性被推导为非零普通试验误差的结果的对比。（3）对于我们的第三个贡献，我们研究的鲁棒性的随机噪声腐败，其中攻击者（或性质），允许在测试时注入随机噪声为例子。我们建立由这样的损坏引起的非线性数据处理的不平等，并用它们来获得关于贝叶斯最优误差较低界限。

45. Multilevel Image Thresholding Using a Fully Informed Cuckoo Search Algorithm [PDF] 返回目录
Xiaotao Huang, Liang Shen, Chongyi Fan, Jiahua zhu, Sixian Chen
Abstract: Though effective in the segmentation, conventional multilevel thresholding methods are computationally expensive as exhaustive search are used for optimal thresholds to optimize the objective functions. To overcome this problem, population-based metaheuristic algorithms are widely used to improve the searching capacity. In this paper, we improve a popular metaheuristic called cuckoo search using a ring topology based fully informed strategy. In this strategy, each individual in the population learns from its neighborhoods to improve the cooperation of the population and the learning efficiency. Best solution or best fitness value can be obtained from the initial random threshold values, whose quality is evaluated by the correlation function. Experimental results have been examined on various numbers of thresholds. The results demonstrate that the proposed algorithm is more accurate and efficient than other four popular methods.
摘要：虽然有效地分割，常规多级阈值的方法是作为穷举搜索用于最佳阈值来优化目标函数计算上昂贵的。为了克服这个问题，人口为基础的启发式算法被广泛用于提高搜索能力。在本文中，我们提高使用基于充分知情的战略环型拓扑称为杜鹃搜索流行的启发式。在这一战略中，从街区的人口获悉每一个人提高人口的合作和学习效率。最佳解决方案或最佳适合度值可以从最初的随机阈值，其质量由相关函数评估来获得。实验结果已经检查了阈值的不同数量。结果表明，该算法比其他四种流行的方法更准确，更高效。

46. High-Fidelity Generative Image Compression [PDF] 返回目录
Fabian Mentzer, George Toderici, Michael Tschannen, Eirikur Agustsson
Abstract: We extensively study how to combine Generative Adversarial Networks and learned compression to obtain a state-of-the-art generative lossy compression system. In particular, we investigate normalization layers, generator and discriminator architectures, training strategies, as well as perceptual losses. In contrast to previous work, i) we obtain visually pleasing reconstructions that are perceptually similar to the input, ii) we operate in a broad range of bitrates, and iii) our approach can be applied to high-resolution images. We bridge the gap between rate-distortion-perception theory and practice by evaluating our approach both quantitatively with various perceptual metrics and a user study. The study shows that our method is preferred to previous approaches even if they use more than 2x the bitrate.
摘要：我们广泛地研究如何向生成对抗性网络，学会压缩相结合，获得一个国家的最先进的生成有损压缩系统。特别是，我们调查正规化层，发电机和鉴别架构，培训策略，以及感性的损失。相较于以前的工作中，ⅰ）我们得到视觉上令人愉悦的重建是在感知上类似于输入，ⅱ）我们在宽范围内的比特率的操作，以及iii）我们的方法可以应用到高分辨率的图像。我们通过评估我们在数量上与不同的感知度量和用户研究引桥率失真知觉理论与实践之间的差距。研究表明，我们的方法是首选以前的方法，即使他们使用的2倍以上的比特率。

47. Universally Quantized Neural Compression [PDF] 返回目录
Eirikur Agustsson, Lucas Theis
Abstract: A popular approach to learning encoders for lossy compression is to use additive uniform noise during training as a differentiable approximation to test-time quantization. We demonstrate that a uniform noise channel can also be implemented at test time using universal quantization (Ziv, 1985). This allows us to eliminate the mismatch between training and test phases while maintaining a completely differentiable loss function. Implementing the uniform noise channel is a special case of a more general problem to communicate a sample, which we prove is computationally hard if we do not make assumptions about its distribution. However, the uniform special case is efficient as well as easy to implement and thus of great interest from a practical point of view. Finally, we show that quantization can be obtained as a limiting case of a soft quantizer applied to the uniform noise channel, bridging compression with and without quantization.
摘要：一个流行的方法来学习编码器的有损压缩是培训作为微近似测试时间量化过程中使用添加剂均匀噪声。我们表明，均匀的噪声信道，也可以在测试时使用通用量化（谢夫，1985）来实现。这使我们能够消除的培训和测试阶段之间的不匹配，同时保持完全失去微分功能。实施统一的噪声信道是一个更一般的问题的一个特例通信的样本，这证明是难以计算，如果我们不做出关于其分布的假设。然而，统一的特殊情况是从实用的角度极大的兴趣，高效以及易于实现，因此。最后，我们表明，量化可以作为施加到均匀噪声信道的软量化器的限制的情况中获得，使用和不使用量化桥接压缩。

48. CoSE: Compositional Stroke Embeddings [PDF] 返回目录
Emre Aksan, Thomas Deselaers, Andrea Tagliasacchi, Otmar Hilliges
Abstract: We present a generative model for stroke-based drawing tasks which is able to model complex free-form structures. While previous approaches rely on sequence-based models for drawings of basic objects or handwritten text, we propose a model that treats drawings as a collection of strokes that can be composed into complex structures such as diagrams (e.g., flow-charts). At the core of the approach lies a novel auto-encoder that projects variable-length strokes into a latent space of fixed dimension. This representation space allows a relational model, operating in latent space, to better capture the relationship between strokes and to predict subsequent strokes. We demonstrate qualitatively and quantitatively that our proposed approach is able to model the appearance of individual strokes, as well as the compositional structure of larger diagram drawings. Our approach is suitable for interactive use cases such as auto-completing diagrams.
摘要：我们提出了基于笔画的绘制任务的生成模型，它能够复杂的自由形式的结构模型。虽然以前的方法依赖于基本对象或手写文本的绘图基于序列模型，我们提出了一个模型，对待图纸，可以组合成复杂的结构，笔画的集合，例如图（如流动图表）。在该方法的核心在于一种新颖的自动编码器，项目可变长度笔画成固定尺寸的潜在空间。这表示空间允许一个关系模型中，潜在空间操作，以更好地反映中风之间的关系，并预测后续笔画。我们定性和定量地证明了我们提出的方法能够个别笔划的外观，以及较大的图图的组成结构建模。我们的做法是适用于交互式的使用情况，如自动完成图。

49. Learning Colour Representations of Search Queries [PDF] 返回目录
Paridhi Maheshwari, Manoj Ghuhan, Vishwa Vinay
Abstract: Image search engines rely on appropriately designed ranking features that capture various aspects of the content semantics as well as the historic popularity. In this work, we consider the role of colour in this relevance matching process. Our work is motivated by the observation that a significant fraction of user queries have an inherent colour associated with them. While some queries contain explicit colour mentions (such as 'black car' and 'yellow daisies'), other queries have implicit notions of colour (such as 'sky' and 'grass'). Furthermore, grounding queries in colour is not a mapping to a single colour, but a distribution in colour space. For instance, a search for 'trees' tends to have a bimodal distribution around the colours green and brown. We leverage historical clickthrough data to produce a colour representation for search queries and propose a recurrent neural network architecture to encode unseen queries into colour space. We also show how this embedding can be learnt alongside a cross-modal relevance ranker from impression logs where a subset of the result images were clicked. We demonstrate that the use of a query-image colour distance feature leads to an improvement in the ranker performance as measured by users' preferences of clicked versus skipped images.
摘要：图像搜索引擎依赖于设计适当分级特征，捕捉的内容语义，以及历史普及的各个方面。在这项工作中，我们考虑的颜色在这个相关性匹配过程中的作用。我们的工作是由用户查询的显著部分具有与之相关的固有颜色观察动机。虽然一些查询包含明确提到的颜色（如“黑车”和“黄色的雏菊”），其他查询有颜色的隐含概念（如“天空”和“草”）。此外，在彩色查询的接地不是映射单一的颜色，但在色彩空间中的分布。例如，对于“树”的搜索往往有周围的绿色和棕色的双峰分布。我们利用历史数据的点击产生的搜索查询颜色表示，并提出一个经常性的神经网络结构，看不见的查询编码到色彩空间。我们还表明这是如何嵌入可以与来自展示日志跨模态相关性排序器，其中被点击的结果图像的一个子集来学习。我们表明，在排序器性能的改进使用查询图像颜色距离特征线索，通过用户的偏好测量与点击跳过图像。

50. Intelligent Protection & Classification of Transients in Two-Core Symmetric Phase Angle Regulating Transformers [PDF] 返回目录
Pallav Kumar Bera, Can Isik
Abstract: This paper investigates the applicability of time and time-frequency features based classifiers to distinguish internal faults and other transients magnetizing inrush, sympathetic inrush, external faults with current transformer saturation, and overexcitation - for Indirect Symmetrical Phase Angle Regulating Transformers (ISPAR). Then the faulty transformer unit (series/exciting) of the ISPAR is located, or else the transient disturbance is identified. An event detector detects variation in differential currents and registers one-cycle of 3-phase post transient samples which are used to extract the time and time-frequency features for training seven classifiers. Three different sets of features - wavelet coefficients, time-domain features, and combination of time and wavelet energy - obtained from exhaustive search using Decision Tree, random forest feature selection, and maximum Relevance Minimum Redundancy are used. The internal fault is detected with a balanced accuracy of 99.9%, the faulty unit is localized with balanced accuracy of 98.7% and the no-fault transients are classified with balanced accuracy of 99.5%. The results show potential for accurate internal fault detection and localization, and transient identification. The proposed scheme can supervise the operation of existing microprocessor-based differential relays resulting in higher stability and dependability. The ISPAR is modeled and the transients are simulated in PSCAD/EMTDC by varying several parameters.
摘要：本文研究的时间和时间 - 频率的适用性具有基于分类来区分内部故障和其它瞬态励磁涌流，涌流，用电流互感器饱和，和过励磁外部故障 - 用于间接对称相角调压变压器（ISPAR）。然后ISPAR的故障变压器单元（系列/激动人心）的位置，否则瞬变干扰被识别。事件检测器检测到的差动电流和寄存器，其被用于提取的时间和时间 - 频率特性用于训练7个分类器3相交的瞬变采样一个周期变化。从穷举搜索使用决策树，随机森林特征选择，和最大相关性最小冗余获得的，使用 - 三组不同的特征 - 小波系数，时域特征，以及时间和小波能量的组合。内部故障与99.9％的平衡精度检测到时，故障单元是局部用的98.7％的平衡精度和无故障瞬变分类为99.5％的平衡精度。结果显示准确的内部故障检测和定位，以及瞬时识别潜力。所提出的方案可以监督导致更高的稳定性和可靠性的现有的基于微处理器的差分继电器操作。所述ISPAR被建模和瞬变通过改变若干参数模拟在PSCAD / EMTDC。

51. Optimizing Grouped Convolutions on Edge Devices [PDF] 返回目录
Perry Gibson, José Cano, Jack Turner, Elliot J. Crowley, Michael O'Boyle, Amos Storkey
Abstract: When deploying a deep neural network on constrained hardware, it is possible to replace the network's standard convolutions with grouped convolutions. This allows for substantial memory savings with minimal loss of accuracy. However, current implementations of grouped convolutions in modern deep learning frameworks are far from performing optimally in terms of speed. In this paper we propose Grouped Spatial Pack Convolutions (GSPC), a new implementation of grouped convolutions that outperforms existing solutions. We implement GSPC in TVM, which provides state-of-the-art performance on edge devices. We analyze a set of networks utilizing different types of grouped convolutions and evaluate their performance in terms of inference time on several edge devices. We observe that our new implementation scales well with the number of groups and provides the best inference times in all settings, improving the existing implementations of grouped convolutions in TVM, PyTorch and TensorFlow Lite by 3.4x, 8x and 4x on average respectively. Code is available at this https URL
摘要：在部署上限制硬件深层神经网络，可以使用分组卷积取代网络的标准卷积。这样就可以准确的损失最小节省大量的内存。然而，在现代深度学习框架分组卷积当前实现还远远没有在速度方面的最佳性能。在本文中，我们提出了分组空间包卷积（GSPC），一个新的实现分组卷积是性能优于现有的解决方案。我们在TVM，其提供的边缘设备的国家的最先进的性能实施GSPC。我们分析了一组利用不同类型的分组卷积的网络和在几个边缘设备评估其在推理时间方面的性能。我们注意到，我们的新的实现规模以及与组的数量，并提供最佳的推论次的所有设置，分别提高3.4倍通过，8倍和4倍，平均在TVM，PyTorch和TensorFlow精简版分组回旋的现有实现。代码可在此HTTPS URL

52. Neural Anisotropy Directions [PDF] 返回目录
Guillermo Ortiz-Jimenez, Apostolos Modas, Seyed-Mohsen Moosavi-Dezfooli, Pascal Frossard
Abstract: In this work, we analyze the role of the network architecture in shaping the inductive bias of deep classifiers. To that end, we start by focusing on a very simple problem, i.e., classifying a class of linearly separable distributions, and show that, depending on the direction of the discriminative feature of the distribution, many state-of-the-art deep convolutional neural networks (CNNs) have a surprisingly hard time solving this simple task. We then define as neural anisotropy directions (NADs) the vectors that encapsulate the directional inductive bias of an architecture. These vectors, which are specific for each architecture and hence act as a signature, encode the preference of a network to separate the input data based on some particular features. We provide an efficient method to identify NADs for several CNN architectures and thus reveal their directional inductive biases. Furthermore, we show that, for the CIFAR-10 dataset, NADs characterize features used by CNNs to discriminate between different classes.
摘要：在这项工作中，我们分析了塑造深分类的归纳偏置网络架构的作用。为此，我们通过专注于一个很简单的问题，即一类线性可分分布的分类开始，并表明，根据分配的辨别特征的方向，许多国家的最先进的深卷积神经网络（细胞神经网络）有着惊人很难解决这个简单的任务。然后，我们定义为神经各向异性方向（NADS）封装架构的定向归纳偏置向量。这些载体，其是特定于每个体系结构，并因此充当签名，编码网络的偏好基于某些特定的特征，以将输入数据中分离出来。我们提供了一个有效的方法来确定几个CNN的架构NADS，从而揭示其定向感应偏见。此外，我们证明了，对于CIFAR-10数据集，NADS表征使用细胞神经网络不同的类之间进行区分的特征。

53. StatAssist & GradBoost: A Study on Optimal INT8 Quantization-aware Training from Scratch [PDF] 返回目录
Taehoon Kim, Youngjoon Yoo, Jihoon Yang
Abstract: This paper studies the scratch training of quantization-aware training (QAT), which has been applied to the lossless conversion of lower-bit, especially for INT8 quantization. Due to its training instability, QAT have required a full-precision (FP) pre-trained weight for fine-tuning and the performance is bound to the original FP model with floating-point computations. Here, we propose critical but straightforward optimization methods which enable the scratch training: floating-point statistic assisting (StatAssist) and stochastic-gradient boosting (GradBoost). We discovered that, first, the scratch QAT get comparable and often surpasses the performance of the floating-point counterpart without any help of the pre-trained model, especially when the model becomes complicated.We also show that our method can even train the minimax generation loss, which is very unstable and hence difficult to apply QAT fine-tuning. From extent experiments, we show that our method successfully enables QAT to train various deep models from scratch: classification, object detection, semantic segmentation, and style transfer, with comparable or often better performance than their FP baselines.
摘要：本文研究的量化意识培训（QAT），已被应用到较低位的无损转换，尤其是对INT8量化从头训练。由于其训练的不稳定性，QAT都需要全精度（FP）预先训练重量微调和性能必然会与浮点计算的原始模型FP。在这里，我们提出重要的，但简单的优化方法，使划痕培训：浮点统计协助（StatAssist）和随机梯度推进（GradBoost）。我们发现，第一，从无到有QAT得到可比，往往超过了浮点对方的性能，而无需预先训练模型的任何帮助，特别是当模型变得complicated.We也表明，我们的方法甚至可以训练极小产生的损失，这是非常不稳定的，因此很难适用QAT微调。从某种程度上实验，我们证明了我们的方法成功地使QAT从头训练各种深型号：分类，目标检测，语义分割和风格转移，与比自己FP基线相当或经常更好的性能。

54. Visor: Privacy-Preserving Video Analytics as a Cloud Service [PDF] 返回目录
Rishabh Poddar, Ganesh Ananthanarayanan, Srinath Setty, Stavros Volos, Raluca Ada Popa
Abstract: Video-analytics-as-a-service is becoming an important offering for cloud providers. A key concern in such services is the privacy of the videos being analyzed. While trusted execution environments (TEEs) are promising options for preventing the direct leakage of private video content, they remain vulnerable to side-channel attacks. We present Visor, a system that provides confidentiality for the user's video stream as well as the ML models in the presence of a compromised cloud platform and untrusted co-tenants. Visor executes video pipelines in a hybrid TEE that spans both the CPU and GPU enclaves. It protects against any side-channel attack induced by data-dependent access patterns of video modules, and also protects the CPU-GPU communication channel. Visor is up to $1000\times$ faster than naïve oblivious solutions, and its overheads relative to a non-oblivious baseline are limited to $2\times$--$6\times$.
摘要：视频分析作为一种服务正在成为云提供商的一款重要产品。在这样的服务的一个关键问题是要分析的影片的隐私权。虽然可信执行环境（核糖核酸TEE）是有希望的预防的民营视频内容直接泄漏的选择，他们仍然容易受到侧信道攻击。我们提出遮阳，也为用户的视频流，以及在被攻破的云平台和不受信任的合作商户存在的ML车型提供保密性的系统。遮阳板在混合TEE跨越CPU和GPU两者的飞地执行视频流水线。它能防止由视频模块的数据依赖性访问模式引起的任何侧信道攻击，并且还保护了CPU-GPU通信信道。遮阳高达$ 1000 \ $倍比幼稚遗忘解决方案快，并且相对于非忘却基线其开销限制到$ 2 \ $次 - $ 6 \ $倍。

55. On sparse connectivity, adversarial robustness, and a novel model of the artificial neuron [PDF] 返回目录
Sergey Bochkanov
Abstract: Deep neural networks have achieved human-level accuracy on almost all perceptual benchmarks. It is interesting that these advances were made using two ideas that are decades old: (a) an artificial neuron based on a linear summator and (b) SGD training. However, there are important metrics beyond accuracy: computational efficiency and stability against adversarial perturbations. In this paper, we propose two closely connected methods to improve these metrics on contour recognition tasks: (a) a novel model of an artificial neuron, a "strong neuron," with low hardware requirements and inherent robustness against adversarial perturbations and (b) a novel constructive training algorithm that generates sparse networks with $O(1)$ connections per neuron. We demonstrate the feasibility of our approach through experiments on SVHN and GTSRB benchmarks. We achieved an impressive 10x-100x reduction in operations count (10x when compared with other sparsification approaches, 100x when compared with dense networks) and a substantial reduction in hardware requirements (8-bit fixed-point math was used) with no reduction in model accuracy. Superior stability against adversarial perturbations (exceeding that of adversarial training) was achieved without any counteradversarial measures, relying on the robustness of strong neurons alone. We also proved that constituent blocks of our strong neuron are the only activation functions with perfect stability against adversarial attacks.
摘要：深层神经网络已经在几乎所有的感知基准达到人类水平的精确度。有趣的是，这些进步是使用两个想法，是几十年的老制作：（1）基于线性加法和（b）SGD训练的人工神经。不过，也有超越的准确性重要的指标：计算效率和稳定性，反对对抗的扰动。在本文中，我们提出了两种紧密连接的方法来提高对轮廓识别任务这些指标：（a）一种人工神经，一个的一种新颖的模型“强神经元，”具有低的硬件要求和对对抗性扰动和固有的鲁棒性（b）中生成与$ O（1）神经元的每$连接稀疏的网络的新的建设性的训练算法。我们通过对SVHN和GTSRB基准的实验证明了该方法的可行性。我们取得了令人印象深刻的10倍至100倍减少操作数（10倍相比时，当密集网络相比其他稀疏方法，100X），并大幅降低硬件要求（使用8位定点运算），在模型中没有降低准确性。没有任何counteradversarial措施，达到防止敌对扰动超强的稳定性（超过了对抗性训练），单独依靠强大的神经元的鲁棒性。我们也证明了我们的强大的神经元组成的块与反对敌对攻击完美稳定的唯一激活功能。

56. Interpretable multimodal fusion networks reveal mechanisms of brain cognition [PDF] 返回目录
Wenxing Hu, Xianghe Meng, Yuntong Bai, Aiying Zhang, Biao Cai, Gemeng Zhang, Tony W. Wilson, Julia M. Stephen, Vince D. Calhoun, Yu-Ping Wang
Abstract: Multimodal fusion benefits disease diagnosis by providing a more comprehensive perspective. Developing algorithms is challenging due to data heterogeneity and the complex within- and between-modality associations. Deep-network-based data-fusion models have been developed to capture the complex associations and the performance in diagnosis has been improved accordingly. Moving beyond diagnosis prediction, evaluation of disease mechanisms is critically important for biomedical research. Deep-network-based data-fusion models, however, are difficult to interpret, bringing about difficulties for studying biological mechanisms. In this work, we develop an interpretable multimodal fusion model, namely gCAM-CCL, which can perform automated diagnosis and result interpretation simultaneously. The gCAM-CCL model can generate interpretable activation maps, which quantify pixel-level contributions of the input features. This is achieved by combining intermediate feature maps using gradient-based weights. Moreover, the estimated activation maps are class-specific, and the captured cross-data associations are interest/label related, which further facilitates class-specific analysis and biological mechanism analysis. We validate the gCAM-CCL model on a brain imaging-genetic study, and show gCAM-CCL's performed well for both classification and mechanism analysis. Mechanism analysis suggests that during task-fMRI scans, several object recognition related regions of interests (ROIs) are first activated and then several downstream encoding ROIs get involved. Results also suggest that the higher cognition performing group may have stronger neurotransmission signaling while the lower cognition performing group may have problem in brain/neuron development, resulting from genetic variations.
摘要：多模态融合，提供更全面的角度来看有利于疾病的诊断。开发算法是具有挑战性由于数据的异质性和复杂的内和模态间的关联。基于深网数据融合模型已经发展到捕捉复杂的关联和诊断性能得到了相应的改善。超越诊断预测，疾病机制评价是生物医学研究是至关重要的。基于深网数据融合模型，但是，很难解释，把有关学习的生物学机制的困难。在这项工作中，我们开发可解释的多模态融合模式，即GCAM-CCL，可同时进行自动诊断和结果解释。所述GCAM-CCL模型可以生成可解释的激活图，其中输入要素进行量化的像素级的贡献。这是通过结合使用基于梯度的权重中间特征图来实现的。此外，所估计的激活图是类特异性的，并且将所捕获的交叉数据协会是兴趣/标签相关的，这类特定进一步便于分析和生物学机制分析。我们验证在脑成像遗传研究GCAM-CCL模式，并显示GCAM-CCL对分类和机理分析表现良好。机理分析表明，在任务的功能磁共振成像扫描，几个物体识别有关的利益（投资回报）的区域被首次激活，然后几个下游编码的ROI涉足。结果还表明，较高的认知执行基可以具有更强的神经传递信号，而下认知执行组可以具有在脑/神经元发育问题，从遗传变异产生的。

57. Noise2Inpaint: Learning Referenceless Denoising by Inpainting Unrolling [PDF] 返回目录
Burhaneddin Yaman, Seyed Amir Hossein Hosseini, Mehmet Akçakaya
Abstract: Deep learning based image denoising methods have been recently popular due to their improved performance. Traditionally, these methods are trained in a supervised manner, requiring a set of noisy input and clean target image pairs. More recently, self-supervised approaches have been proposed to learn denoising from noisy images only, without requiring clean ground truth during training. Succinctly, these methods assume that an image pixel is correlated with its neighboring pixels, while the noise is independent. In this work, building on these approaches and recent methods from image reconstruction, we introduce Noise2Inpaint (N2I), a training approach that recasts the denoising problem into a regularized image inpainting framework. This allows us to use an objective function, which can incorporate different statistical properties of the noise as needed. We use algorithm unrolling to unroll an iterative optimization for solving this objective function and train the unrolled network end-to-end. The training is self-supervised without requiring clean target images, where pixels in the noisy image are split into two disjoint sets. One of these is used to impose data fidelity in the unrolled network, while the other one defines the loss. We demonstrate that N2I performs successful denoising on real-world datasets, while preserving better details compared to its self-supervised counterpart Noise2Void.
摘要：深学习基于图像去噪方法最近已流行，由于其更高的性能。传统上，这些方法在监督方式训练，需要一组噪声的输入和清洁目标图像对。最近，自我监督的方法被提出来学习仅从噪声图像去噪，无需培训期间需要干净的地面实况。简洁地说，这些方法假设一个图像像素与它的相邻像素相关的，而噪声是独立的。在这项工作中，建立在这些方法和最近从图像重建方法，我们介绍Noise2Inpaint（N2I），即用高超的去噪问题转化为正则化图像修复框架的培训方法。这使我们能够使用的目标函数，根据需要可以纳入不同的噪声统计特性。我们使用的算法展开解开迭代优化解决这个目标函数和训练展开的网络终端到终端。训练是自监督而不需要清洁目标图像，其中，所述噪声的图像中的像素被分成两个不相交的集合。其中之一是用于施加数据保真度展开的网络中，而另一个限定的损失。我们证明N2I执行对现实世界的数据集成功的去噪，同时相对于自身监督对口Noise2Void保持更好的细节。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-06-18

目录

摘要