摘要

1. Large-scale multilingual audio visual dubbing [PDF] 返回目录
Yi Yang, Brendan Shillingford, Yannis Assael, Miaosen Wang, Wendi Liu, Yutian Chen, Yu Zhang, Eren Sezener, Luis C. Cobo, Misha Denil, Yusuf Aytar, Nando de Freitas
Abstract: We describe a system for large-scale audiovisual translation and dubbing, which translates videos from one language to another. The source language's speech content is transcribed to text, translated, and automatically synthesized into target language speech using the original speaker's voice. The visual content is translated by synthesizing lip movements for the speaker to match the translated audio, creating a seamless audiovisual experience in the target language. The audio and visual translation subsystems each contain a large-scale generic synthesis model trained on thousands of hours of data in the corresponding domain. These generic models are fine-tuned to a specific speaker before translation, either using an auxiliary corpus of data from the target speaker, or using the video to be translated itself as the input to the fine-tuning process. This report gives an architectural overview of the full system, as well as an in-depth discussion of the video dubbing component. The role of the audio and text components in relation to the full system is outlined, but their design is not discussed in detail. Translated and dubbed demo videos generated using our system can be viewed at this https URL
摘要：我们描述了大型影视翻译和配音，翻译视频从一种语言到另一种体系。源语言的语音内容转录成文本，翻译，并使用原说话者的语音自动合成为目标语言讲话。可视内容是通过合成唇动的扬声器相匹配的已翻译的音频，从而在目标语言的无缝视听体验翻译。音频和可视平移子系统均包含在相应的域上训练数千小时的数据大规模通用合成模型。这些通用模型微调到特定扬声器翻译之前，无论是使用从目标讲话者的数据的辅助语料库，或使用视频被本身翻译为输入到微调处理。这份报告给出了完整的系统体系结构的概述，以及视频配音组件的深入讨论。相对于整个系统的音频和文本组件的作用概述，但他们的设计中没有详细讨论。使用我们的系统产生的翻译和配音演示视频可以在此HTTPS URL查看

2. Deep Cross-modal Proxy Hashing [PDF] 返回目录
Rong-Cheng Tu, Xian-Ling Mao, Rongxin Tu, Wei Wei, Heyan Huang
Abstract: Due to their high retrieval efficiency and low storage cost for cross-modal search task, cross-modal hashing methods have attracted considerable attention. For supervised cross-modal hashing methods, how to make the learned hash codes preserve semantic structure information sufficiently is a key point to further enhance the retrieval performance. As far as we know, almost all supervised cross-modal hashing methods preserve semantic structure information depending on at-least-one similarity definition fully or partly, i.e., it defines two datapoints as similar ones if they share at least one common category otherwise they are dissimilar. Obviously, the at-least-one similarity misses abundant semantic structure information. To tackle this problem, in this paper, we propose a novel Deep Cross-modal Proxy Hashing, called DCPH. Specifically, DCPH first learns a proxy hashing network to generate a discriminative proxy hash code for each category. Then, by utilizing the learned proxy hash code as supervised information, a novel $Margin$-$SoftMax$-$like\ loss$ is proposed without defining the at-least-one similarity between datapoints. By minimizing the novel $Margin$-$SoftMax$-$like\ loss$, the learned hash codes will simultaneously preserve the cross-modal similarity and abundant semantic structure information well. Extensive experiments on two benchmark datasets show that the proposed method outperforms the state-of-the-art baselines in cross-modal retrieval task.
摘要：由于其较高的检索效率和低存储成本的跨模态搜索任务，跨模态散列方法已经吸引了相当多的关注。对于监督跨模态散列方法，如何使学到的散列码保存语义结构信息充分是一个关键点，进一步提高检索的性能。据我们所知，几乎所有的监督跨模态散列方法依赖于AT-至少一相似的定义完全或部分，即保持语义结构信息，两份数据点的定义是类似的，如果他们共享至少一个共同的类别，否则他们是不一样的。显然，在-至少一相似性丰富射门语义结构的信息。为了解决这个问题，在本文中，我们提出了一个新颖的深跨模态代理散列，叫DCPH。具体而言，第一DCPH学习代理散列网络以产生用于每个类别判别代理哈希码。然后，通过利用学习代理哈希码作为监督信息，一个新的$ $保证金 - $ $使用SoftMax - $如\损失$提议没有定义的数据点之间的所述-至少一相似性。通过最小化小说$保证金$ - $使用SoftMax $ - $如\ $亏损，学习的散列码将同时保持跨模式的相似性和丰富的语义结构的信息很好。两个基准数据集大量实验表明，该方法优于国家的最先进的基线跨模态获取任务。

3. Illumination Normalization by Partially Impossible Encoder-Decoder Cost Function [PDF] 返回目录
Steve Dias Da Cruz, Bertram Taetz, Thomas Stifter, Didier Stricker
Abstract: Images recorded during the lifetime of computer vision based systems undergo a wide range of illumination and environmental conditions affecting the reliability of previously trained machine learning models. Image normalization is hence a valuable preprocessing component to enhance the models' robustness. To this end, we introduce a new strategy for the cost function formulation of encoder-decoder networks to average out all the unimportant information in the input images (e.g. environmental features and illumination changes) to focus on the reconstruction of the salient features (e.g. class instances). Our method exploits the availability of identical sceneries under different illumination and environmental conditions for which we formulate a partially impossible reconstruction target: the input image will not convey enough information to reconstruct the target in its entirety. Its applicability is assessed on three publicly available datasets. We combine the triplet loss as a regularizer in the latent space representation and a nearest neighbour search to improve the generalization to unseen illuminations and class instances. The importance of the aforementioned post-processing is highlighted on an automotive application. To this end, we release a synthetic dataset of sceneries from three different passenger compartments where each scenery is rendered under ten different illumination and environmental conditions: see this https URL
摘要：基于计算机视觉系统的生命周期内记录的图像进行了广泛的影响之前训练的机器学习模型的可靠性照明和环境条件。图像归一化是因此具有宝贵的预处理组件，以提高模型的鲁棒性。为此，我们引入编码器 - 解码器网络的成本函数制剂的新策略，以平均掉所有在输入图像（例如环境特征和光照变化）的不重要的信息集中的显着特征的重建（例如类实例）。我们的方法利用不同的照明条件和环境条件下相同的风景的可用性一个我们制定的部分不可能重建目标：输入图像不会传达足够的信息来重建所述目标将其全部。它的适用性进行评估三个公开可用的数据集。我们结合三重损失于潜在空间表示一个正则和最近邻搜索来提高泛化看不见的照明和类实例。上述后处理的重要性突出了汽车应用。为此，我们从每个景色下十个不同的光照和环境条件下呈现三种不同的客舱释放风景的合成数据集：看到这个HTTPS URL

4. Disentangling 3D Prototypical Networks For Few-Shot Concept Learning [PDF] 返回目录
Mihir Prabhudesai, Shamit Lal, Darshan Patil, Hsiao-Yu Tung, Adam W Harley, Katerina Fragkiadaki
Abstract: We present neural architectures that disentangle RGB-D images into objects' shapes and styles and a map of the background scene, and explore their applications for few-shot 3D object detection and few-shot concept classification. Our networks incorporate architectural biases that reflect the image formation process, 3D geometry of the world scene, and shape-style interplay. They are trained end-to-end self-supervised by predicting views in static scenes, alongside a small number of 3D object boxes. Objects and scenes are represented in terms of 3D feature grids in the bottleneck of the network. We show that the proposed 3D neural representations are compositional: they can generate novel 3D scene feature maps by mixing object shapes and styles, resizing and adding the resulting object 3D feature maps over background scene feature maps. We show that classifiers for object categories, color, materials, and spatial relationships trained over the disentangled 3D feature sub-spaces generalize better with dramatically fewer examples than the current state-of-the-art, and enable a visual question answering system that uses them as its modules to generalize one-shot to novel objects in the scene.
摘要：我们提出的神经架构，理清RGB-d图像转换为对象的形状和样式和地图的背景场景，并探讨其对几拍3D物体检测和几拍概念分类应用。我们的网络结合，反映了图像的形成过程，在世界舞台的3D几何形状和风格相互建筑偏见。他们被训练结束到终端的自我监督，通过预测在静态场景意见，旁边一个小数目的3D对象框。对象和场景都在网络的瓶颈，3D功能网格来表示。我们表明，该3D神经表征的成分：它们可以产生新的3D场景功能通过混合物体的形状和样式，调整和增加所产生的对象3D功能映射在背景场景特征地图中的地图。我们展示的对象类别，颜色，材料，并培训了解开3D功能的子空间比当前国家的最先进的显着更少的例子概括了更好的空间关系，即分类，并启用视觉问答系统，采用他们作为它的模块来概括一次性场景中的新物体。

5. Domain Adaptive Person Re-Identification via Coupling Optimization [PDF] 返回目录
Xiaobin Liu, Shiliang Zhang
Abstract: Domain adaptive person Re-Identification (ReID) is challenging owing to the domain gap and shortage of annotations on target scenarios. To handle those two challenges, this paper proposes a coupling optimization method including the Domain-Invariant Mapping (DIM) method and the Global-Local distance Optimization (GLO), respectively. Different from previous methods that transfer knowledge in two stages, the DIM achieves a more efficient one-stage knowledge transfer by mapping images in labeled and unlabeled datasets to a shared feature space. GLO is designed to train the ReID model with unsupervised setting on the target domain. Instead of relying on existing optimization strategies designed for supervised training, GLO involves more images in distance optimization, and achieves better robustness to noisy label prediction. GLO also integrates distance optimizations in both the global dataset and local training batch, thus exhibits better training efficiency. Extensive experiments on three large-scale datasets, i.e., Market-1501, DukeMTMC-reID, and MSMT17, show that our coupling optimization outperforms state-of-the-art methods by a large margin. Our method also works well in unsupervised training, and even outperforms several recent domain adaptive methods.
摘要：域自适应人重新鉴定（雷德）是由于注释的目标场景的域间隙和短缺挑战。为了处理这两个挑战，提出一种耦合优化方法，包括：分别域不变映射（DIM）方法和全局 - 局部距离优化（GLO）。从以前的方法是，在两个阶段中转印知识不同，DIM实现通过映射图像中标记和未标记数据集的更有效的单级知识转移到共享特征空间。 GLO旨在培养与目标域无监督设置雷德模型。相反，依靠现有的设计指导训练优化策略，GLO涉及距离更优化的图像，并实现更好的稳健性嘈杂的标签预测。 GLO也集成在全球数据集和当地培训一批既距离的优化，从而表现出更好的训练效率。三个大型数据集，即以市场为1501，DukeMTMC - 里德和MSMT17，显示出广泛的实验，我们的耦合优化性能优于国家的最先进的方法，通过大幅度提高。我们的方法也是行之有效的无监督训练，甚至优于最近的几个域自适应方法。

6. Self Supervised Learning for Object Localisation in 3D Tomographic Images [PDF] 返回目录
Yaroslav Zharov, Alexey Ershov, Tilo Baumbach
Abstract: While a lot of work is dedicated to self-supervised learning, most of it is dealing with 2D images of natural scenes and objects. In this paper, we focus on \textit{volumetric} images obtained by means of the X-Ray Computed Tomography (CT). We describe two pretext training tasks which are designed taking into account the specific properties of volumetric data. We propose two ways to transfer a trained network to the downstream task of object localization with a zero amount of manual markup. Despite its simplicity, the proposed method shows its applicability to practical tasks of object localization and data reduction.
摘要：虽然大量的工作是致力于自我监督学习，其中大部分是处理自然场景和物体的2D图像。在本文中，我们集中由X射线计算机断层摄影（CT）的装置获得关于\ textit {体积}图像。我们描述其目的是考虑到容积数据的特定特性的两个借口的培训任务。我们提出两种方法可以训练的网络转移到目标定位的下游任务与零的手工标记。尽管它的简单性，该方法示出了其适用于对象定位和数据减少的实际任务。

7. Towards Efficient Scene Understanding via Squeeze Reasoning [PDF] 返回目录
Xiangtai Li, Xia Li, Ansheng You, Li Zhang, Guangliang Cheng, Kuiyuan Yang, Yunhai Tong, Zhouchen Lin
Abstract: Graph-based convolutional model such as non-local block has shown to be effective for strengthening the context modeling ability in convolutional neural networks (CNNs). However, its pixel-wise computational overhead is prohibitive which renders it unsuitable for high resolution imagery. In this paper, we explore the efficiency of context graph reasoning and propose a novel framework called Squeeze Reasoning. Instead of propagating information on the spatial map, we first learn to squeeze the input feature into a channel-wise global vector and perform reasoning within the single vector where the computation cost can be significantly reduced. Specifically, we build the node graph in the vector where each node represents an abstract semantic concept. The refined feature within the same semantic category results to be consistent, which is thus beneficial for downstream tasks. We show that our approach can be modularized as an end-to-end trained block and can be easily plugged into existing networks. Despite its simplicity and being lightweight, our strategy allows us to establish a new state-of-the-art on semantic segmentation and show significant improvements with respect to strong, state-of-the-art baselines on various other scene understanding tasks including object detection, instance segmentation and panoptic segmentation. Code will be made available to foster any further research
摘要：基于图的卷积模型诸如非局部块已被证明是有效的加强卷积神经网络（细胞神经网络）的上下文建模能力。然而，它的逐像素计算开销是望而却步这使得它不适合高清晰度图像。在本文中，我们将探讨背景图形推理的效率，并提出了一个名为挤压推理小说的框架。代替传播空间地图上的信息，我们先学会挤压输入特征为channel明智全局矢量和单个向量，其中的计算成本可以被降低显著内执行推理。具体来说，我们建立在其中每个节点代表一个抽象语义概念向量的节点图。同一语义范畴结果中的细化功能是一致的，因而是下游的任务是有益的。我们表明，我们的方法可以模块化为一个终端到终端的培训模块，可以很容易地插入到现有的网络。尽管它的简单性和轻量，我们的策略使我们能够建立一个新的语义分割国家的最先进的，并显示在其他各种场景理解任务，包括物体相对于显著提高到强，国家的最先进的基线检测，例如分割和全景分割。代码将提供促进进一步的研究

8. Learning to Orient Surfaces by Self-supervised Spherical CNNs [PDF] 返回目录
Riccardo Spezialetti, Federico Stella, Marlon Marcon, Luciano Silva, Samuele Salti, Luigi Di Stefano
Abstract: Defining and reliably finding a canonical orientation for 3D surfaces is key to many Computer Vision and Robotics applications. This task is commonly addressed by handcrafted algorithms exploiting geometric cues deemed as distinctive and robust by the designer. Yet, one might conjecture that humans learn the notion of the inherent orientation of 3D objects from experience and that machines may do so alike. In this work, we show the feasibility of learning a robust canonical orientation for surfaces represented as point clouds. Based on the observation that the quintessential property of a canonical orientation is equivariance to 3D rotations, we propose to employ Spherical CNNs, a recently introduced machinery that can learn equivariant representations defined on the Special Orthogonal group SO(3). Specifically, spherical correlations compute feature maps whose elements define 3D rotations. Our method learns such feature maps from raw data by a self-supervised training procedure and robustly selects a rotation to transform the input point cloud into a learned canonical orientation. Thereby, we realize the first end-to-end learning approach to define and extract the canonical orientation of 3D shapes, which we aptly dub Compass. Experiments on several public datasets prove its effectiveness at orienting local surface patches as well as whole objects.
摘要：定义和可靠地找到一个规范方向为3D表面的关键是许多计算机视觉和机器人技术的应用。这项任务通常是由手工制作的算法开发视为独特和强大的由设计师几何线索处理。然而，人们可能会猜测，人类学习3D的内在取向的概念从经验和机器可以做的如此相似的对象。在这项工作中，我们将展示学习表示为点云表面的稳健规范取向的可行性。基于这样的观察，一个规范方向的典型属性是同变性为3D旋转，我们建议采用球形细胞神经网络，最近推出的机器，可以学习在特殊正交群SO（3）定义等变表示。具体地，球形的相关性计算的功能映射，其元素定义3D旋转。我们的方法可以学习这样的功能可以从原始数据通过自我监督的训练过程映射和稳健选择旋转输入点云转化为一个有学问的规范方向。从而，我们认识到该第一端至端的学习方法来定义和提取的3D形状的规范方向，这是我们恰当地配音罗盘。在几个公开的数据集实验证明其在定向局部表面补丁以及整个对象的有效性。

9. Event-VPR: End-to-End Weakly Supervised Network Architecture for Event-based Visual Place Recognition [PDF] 返回目录
Delei Kong, Zheng Fang, Haojia Li, Kuanxu Hou, Sonya Coleman, Dermot Kerr
Abstract: Traditional visual place recognition (VPR) methods generally use frame-based cameras, which is easy to fail due to dramatic illumination changes or fast motions. In this paper, we propose an end-to-end visual place recognition network for event cameras, which can achieve good place recognition performance in challenging environments. The key idea of the proposed algorithm is firstly to characterize the event streams with the EST voxel grid, then extract features using a convolution network, and finally aggregate features using an improved VLAD network to realize end-to-end visual place recognition using event streams. To verify the effectiveness of the proposed algorithm, we compare the proposed method with classical VPR methods on the event-based driving datasets (MVSEC, DDD17) and the synthetic datasets (Oxford RobotCar). Experimental results show that the proposed method can achieve much better performance in challenging scenarios. To our knowledge, this is the first end-to-end event-based VPR method. The accompanying source code is available at this https URL.
摘要：传统视觉地方识别（VPR）方法通常使用基于帧的相机，这是很容易失败，由于剧烈的光照变化或快速运动。在本文中，我们提出了事件的摄像头，它可以在恶劣环境中实现好去处识别性能的终端到终端的视觉地方识别网络。该算法的关键思想是首先与EST体素网格来表征事件流，然后提取特征使用卷积网络，并使用改进的VLAD网络终于骨料特征使用事件流以实现端 - 端视觉地方识别。为了验证算法的有效性，我们比较所提出的方法与基于事件驱动的数据集（MVSEC，DDD17）和合成数据集（牛津RobotCar）经典的VPR方法。实验结果表明，该方法能够在具有挑战性的情况下实现更高的性能。据我们所知，这是第一端至端基于事件的VPR方法。附带的源代码可在此HTTPS URL。

10. "What's This?" -- Learning to Segment Unknown Objects from Manipulation Sequences [PDF] 返回目录
Wout Boerdijk, Martin Sundermeyer, Maximilian Durner, Rudolph Triebel
Abstract: We present a novel framework for self-supervised grasped object segmentation with a robotic manipulator. Our method successively learns an agnostic foreground segmentation followed by a distinction between manipulator and object solely by observing the motion between consecutive RGB frames. In contrast to previous approaches, we propose a single, end-to-end trainable architecture which jointly incorporates motion cues and semantic knowledge. Furthermore, while the motion of the manipulator and the object are substantial cues for our algorithm, we present means to robustly deal with distraction objects moving in the background, as well as with completely static scenes. Our method neither depends on any visual registration of a kinematic robot or 3D object models, nor on precise hand-eye calibration or any additional sensor data. By extensive experimental evaluation we demonstrate the superiority of our framework and provide detailed insights on its capability of dealing with the aforementioned extreme cases of motion. We also show that training a semantic segmentation network with the automatically labeled data achieves results on par with manually annotated training data. Code and pretrained models will be made publicly available.
摘要：我们提出一个新的框架，自我监督抓住物体分割与机器人操纵。我们的方法依次仅通过观察连续RGB帧之间的运动学习一个不可知前景分割，随后机械手和对象之间的区别。相较于以前的方法，我们提出了一个单一的，端到端的端到端训练的架构，共同整合运动提示信息和语义知识。此外，尽管机械手的运动和对象是我们的算法实质性线索，我们目前的手段与强劲分心对象背景的移动，以及具有完全静态场景处理。我们的方法既不依赖于运动机械手或三维对象模型的任何视觉登记，也没有对精确的手眼校准或任何额外的传感器数据。通过大量的实验评估中，我们证明了我们架构的优势，并提供有关其处理运动的上述极端情况下的功能的详细见解。我们还表明，培养了语义分割网络与自动标记数据达到看齐结果与手动注释的训练数据。代码和预训练的车型都将公之于众。

11. Hi-UCD: A Large-scale Dataset for Urban Semantic Change Detection in Remote Sensing Imagery [PDF] 返回目录
Shiqi Tian, Yanfei Zhong, Ailong Ma, Zhuo Zheng
Abstract: With the acceleration of the urban expansion, urban change detection (UCD), as a significant and effective approach, can provide the change information with respect to geospatial objects for dynamical urban analysis. However, existing datasets suffer from three bottlenecks: (1) lack of high spatial resolution images; (2) lack of semantic annotation; (3) lack of long-range multi-temporal images. In this paper, we propose a large scale benchmark dataset, termed Hi-UCD. This dataset uses aerial images with a spatial resolution of 0.1 m provided by the Estonia Land Board, including three-time phases, and semantically annotated with nine classes of land cover to obtain the direction of ground objects change. It can be used for detecting and analyzing refined urban changes. We benchmark our dataset using some classic methods in binary and multi-class change detection. Experimental results show that Hi-UCD is challenging yet useful. We hope the Hi-UCD can become a strong benchmark accelerating future research.
摘要：随着城市扩张的加速度，城市变化检测（UCD），作为显著和有效的方法，能够提供相对于用于动态城市地理空间分析的对象的变化信息。然而，现有的数据集从三个瓶颈遭受：（1）缺乏高空间分辨率图像; （2）缺乏语义标注的; （3）缺少长程多时图像。在本文中，我们提出了一个大规模的基准数据集，被称为高新UCD。此数据集使用具有0.1微米的空间分辨率由爱沙尼亚国土局提供航空图像，其中包括三个时间阶段，和具有九个类土地覆盖的语义注解的地物变化来获得的方向。它可用于检测和分析精制城市变化。我们的基准使用二进制和多级变化检测一些经典的方法，我们的数据。实验结果表明，高新UCD是挑战又是有用的。我们希望的Hi-UCD可以成为一个强大的标杆加速未来的研究。

12. Channel Pruning via Multi-Criteria based on Weight Dependency [PDF] 返回目录
Yangchun Yan, Chao Li, Rongzuo Guo, Kang Yang, Yongjun Xu
Abstract: Channel pruning has demonstrated its effectiveness in compressing ConvNets. In many prior arts, the importance of an output feature map is only determined by its associated filter. However, these methods ignore a small part of weights in the next layer which disappear as the feature map is removed. They ignore the dependency of the weights, so that, a part of weights are pruned without being evaluated. In addition, many pruning methods use only one criterion for evaluation, and find a sweet-spot of pruning structure and accuracy in a trial-and-error fashion, which can be time-consuming. To address the above issues, we proposed a channel pruning algorithm via multi-criteria based on weight dependency, CPMC, which can compress a variety of models efficiently. We design the importance of the feature map in three aspects, including its associated weight value, computational cost and parameter quantity. Use the phenomenon of weight dependency, We get the importance by assessing its associated filter and the corresponding partial weights of the next layer. Then we use global normalization to achieve cross-layer comparison. Our method can compress various CNN models, including VGGNet, ResNet and DenseNet, on various image classification datasets. Extensive experiments have shown CPMC outperforms the others significantly.
摘要：通道修剪已经证明在压缩ConvNets其有效性。在许多现有技术中，一个输出特征图的重要性仅通过其相关联的过滤器来确定。然而，这些方法忽略下一个层权，其消失为特征地图被移除的一小部分。他们忽视了权重的依赖，使得权重的一部分，则判断修剪。此外，许多修剪方法使用只有一个评价标准，并找到一个试错的方式修剪的结构和精度的最佳点，这可能是费时。为了解决上述问题，我们提出了一个信道通过基于重量的依赖性，CPMC，其可以有效地压缩多种型号的多标准修剪算法。我们设计三个方面的特征图的重要性，包括其相关的权重值，计算成本和参数的数量。使用重依赖的现象，我们通过评估其相关的过滤器和相应的下一层的部分权重得到的重要性。然后我们使用全球标准化，实现跨层比较。我们的方法可以压缩各种CNN模型，包括VGGNet，RESNET和DenseNet，各种图像分类数据集。大量的实验表明，中粮包装优于其他显著。

13. Efficient image retrieval using multi neural hash codes and bloom filters [PDF] 返回目录
Sourin Chakrabarti
Abstract: This paper aims to deliver an efficient and modified approach for image retrieval using multiple neural hash codes and limiting the number of queries using bloom filters by identifying false positives beforehand. Traditional approaches involving neural networks for image retrieval tasks tend to use higher layers for feature extraction. But it has been seen that the activations of lower layers have proven to be more effective in a number of scenarios. In our approach, we have leveraged the use of local deep convolutional neural networks which combines the powers of both the features of lower and higher layers for creating feature maps which are then compressed using PCA and fed to a bloom filter after binary sequencing using a modified multi k-means approach. The feature maps obtained are further used in the image retrieval process in a hierarchical coarse-to-fine manner by first comparing the images in the higher layers for semantically similar images and then gradually moving towards the lower layers searching for structural similarities. While searching, the neural hashes for the query image are again calculated and queried in the bloom filter which tells us whether the query image is absent in the set or maybe present. If the bloom filter doesn't necessarily rule out the query, then it goes into the image retrieval process. This approach can be particularly helpful in cases where the image store is distributed since the approach supports parallel querying.
摘要：本文目的使用多个神经散列码和限制使用布隆过滤器通过预先识别误报的查询的数量，提供用于图像检索的有效和修改的方法。涉及的图像检索任务的神经网络的传统方法倾向于使用更高层的特征提取。但它已经看到，较低层的激活已被证明是在许多情况下更有效。在我们的方法，我们已经利用了利用当地深卷积神经网络相结合的较低和较高层的两个特征的权力用于创建特征图采用PCA使用改进型二测序之后，然后压缩并送入布隆过滤器多k-均值接近。映射由第一比较在更高的层的图像进行语义上相似的图像，然后逐渐移向较低层搜索的结构相似性获得的特征在图像检索处理还用来以分层粗到细的方式。虽然搜索，对于查询图像的神经哈希值重新计算，并在布隆过滤器，它告诉我们查询图像是否在集中缺少或可能存在质疑。如果布隆过滤器不一定排除查询，然后进入图像检索过程。这种方法可以是在图像存储区，因为方法支持查询并行分布情况特别有用。

14. Learning a Geometric Representation for Data-Efficient Depth Estimation via Gradient Field and Contrastive Loss [PDF] 返回目录
Dongseok Shim, H. Jin Kim
Abstract: Estimating a depth map from a single RGB image has been investigated widely for localization, mapping, and 3-dimensional object detection. Recent studies on a single-view depth estimation are mostly based on deep Convolutional neural Networks (ConvNets) which require a large amount of training data paired with densely annotated labels. Depth annotation tasks are both expensive and inefficient, so it is inevitable to leverage RGB images which can be collected very easily to boost the performance of ConvNets without depth labels. However, most self-supervised learning algorithms are focused on capturing the semantic information of images to improve the performance in classification or object detection, not in depth estimation. In this paper, we show that existing self-supervised methods do not perform well on depth estimation and propose a gradient-based self-supervised learning algorithm with momentum contrastive loss to help ConvNets extract the geometric information with unlabeled images. As a result, the network can estimate the depth map accurately with a relatively small amount of annotated data. To show that our method is independent of the model structure, we evaluate our method with two different monocular depth estimation algorithms. Our method outperforms the previous state-of-the-art self-supervised learning algorithms and shows the efficiency of labeled data in triple compared to random initialization on the NYU Depth v2 dataset.
摘要：估算来自单个RGB图像的深度图被用于定位，映射和3维物体检测广泛研究。在单一视图深度估计最近的研究大多基于这需要大量的培训与密集的注释标签配对数据深卷积神经网络（ConvNets）。深度注释任务是既昂贵和低效的，所以它不可避免地可以被收集很容易提高ConvNets的性能没有深度标签杠杆RGB图像。然而，大多数的自我监督学习算法专注于捕捉图像的语义信息，以提高分类或物体的检测性能，而不是在深度估计。在本文中，我们表明，现有的自我监督方法不上深度估计表现良好，并提出与气势对比损失的帮助ConvNets提取与未标记的图像的几何信息基于梯度的自我监督学习算法。其结果是，该网络可以与注释的数据的相对少量的准确估计深度图。为了证明我们的方法是独立于模型结构，我们评估我们有两个不同的单眼深度估计算法方法。我们的方法优于先前的国家的最先进的自我监督学习算法和节目标记数据的三倍的效率相比，在NYU深度V2数据集随机初始化。

15. ULSD: Unified Line Segment Detection across Pinhole, Fisheye, and Spherical Cameras [PDF] 返回目录
Hao Li, Huai Yu, Wen Yang, Lei Yu, Sebastian Scherer
Abstract: Line segment detection is essential for high-level tasks in computer vision and robotics. Currently, most stateof-the-art (SOTA) methods are dedicated to detecting straight line segments in undistorted pinhole images, thus distortions on fisheye or spherical images may largely degenerate their performance. Targeting at the unified line segment detection (ULSD) for both distorted and undistorted images, we propose to represent line segments with the Bezier curve model. Then the line segment detection is tackled by the Bezier curve regression with an end-to-end network, which is model-free and without any undistortion preprocessing. Experimental results on the pinhole, fisheye, and spherical image datasets validate the superiority of the proposed ULSD to the SOTA methods both in accuracy and efficiency (40.6fps for pinhole images). The source code is available at this https URL.
摘要：线段检测是一种在计算机视觉和机器人的高级任务至关重要。目前，大多数stateof先进（SOTA）方法致力于在无失真的针孔检测图像中的直线段，从而对鱼眼或球形图像失真可能很大程度上变质它们的性能。在统一线段检测（ULSD）两者失真和非失真图像定位，我们建议表示与Bezier曲线模型线段。然后，线段检测是通过贝塞尔曲线回归与端至端网络，这是模型的自由和无任何undistortion预处理解决。上针孔，鱼眼的实验结果，和球形图像数据组验证了该超低硫柴油的优越性既在精度和效率（对于针孔图像40.6fps）的SOTA方法。源代码可在此HTTPS URL。

16. GHFP: Gradually Hard Filter Pruning [PDF] 返回目录
Linhang Cai, Zhulin An, Yongjun Xu
Abstract: Filter pruning is widely used to reduce the computation of deep learning, enabling the deployment of Deep Neural Networks (DNNs) in resource-limited devices. Conventional Hard Filter Pruning (HFP) method zeroizes pruned filters and stops updating them, thus reducing the search space of the model. On the contrary, Soft Filter Pruning (SFP) simply zeroizes pruned filters, keeping updating them in the following training epochs, thus maintaining the capacity of the network. However, SFP, together with its variants, converges much slower than HFP due to its larger search space. Our question is whether SFP-based methods and HFP can be combined to achieve better performance and speed up convergence. Firstly, we generalize SFP-based methods and HFP to analyze their characteristics. Then we propose a Gradually Hard Filter Pruning (GHFP) method to smoothly switch from SFP-based methods to HFP during training and pruning, thus maintaining a large search space at first, gradually reducing the capacity of the model to ensure a moderate convergence speed. Experimental results on CIFAR-10/100 show that our method achieves the state-of-the-art performance.
摘要：过滤器修剪被广泛用于降低深度学习的计算，使深层神经网络（DNNs）在资源有限的设备的部署。常规的硬修剪器（HFP）方法zeroizes修剪过滤器和停止更新他们，从而减少了模型的搜索空间。相反，软修剪器（SFP）简单地修剪zeroizes过滤器，保持在下面的训练时期更新它们，从而保持了网络的容量。然而，SFP，与它的变体一起，比HFP慢得多收敛由于其较大的搜索空间。我们的问题是，是否基于SFP的方法和HFP可以组合以达到更好的性能和加速收敛。首先，我们推广基于SFP的方法和HFP来分析他们的特点。然后，我们提出了一个渐渐硬筛选修剪（GHFP）方法来训练和修剪过程平稳地从基于SFP的方法来HFP切换，从而在第一保持较大的空间搜索，逐步减少模型的能力，以确保适度的收敛速度。上CIFAR-10/100显示，我们的方法实现了国家的最先进的性能的实验结果。

17. Confusable Learning for Large-class Few-Shot Classification [PDF] 返回目录
Bingcong Li, Bo Han, Zhuowei Wang, Jing Jiang, Guodong Long
Abstract: Few-shot image classification is challenging due to the lack of ample samples in each class. Such a challenge becomes even tougher when the number of classes is very large, i.e., the large-class few-shot scenario. In this novel scenario, existing approaches do not perform well because they ignore confusable classes, namely similar classes that are difficult to distinguish from each other. These classes carry more information. In this paper, we propose a biased learning paradigm called Confusable Learning, which focuses more on confusable classes. Our method can be applied to mainstream meta-learning algorithms. Specifically, our method maintains a dynamically updating confusion matrix, which analyzes confusable classes in the dataset. Such a confusion matrix helps meta learners to emphasize on confusable classes. Comprehensive experiments on Omniglot, Fungi, and ImageNet demonstrate the efficacy of our method over state-of-the-art baselines.
摘要：很少有镜头图像分类是具有挑战性由于每一类缺乏充足的样本。当类的数量是非常大的，即大班几拍的场景这样的挑战变得更加严峻。在这部小说中的场景，因为他们忽略了混淆类，即相似的类难以相互区分现有方法不执行好。这些类携带更多的信息。在本文中，我们提出了一个偏颇的学习范例，称之为易混学习，这更侧重于易混淆类。我们的方法可以适用于主流元学习算法。具体地，我们的方法维持动态更新混淆矩阵，其分析在数据集中混淆类。这种混淆矩阵有助于学习者元对易混淆类强调。在Omniglot，真菌和ImageNet综合实验证明我们的方法在国家的最先进的基线功效。

18. Affinity LCFCN: Learning to Segment Fish with Weak Supervision [PDF] 返回目录
Issam Laradji, Alzayat Saleh, Pau Rodriguez, Derek Nowrouzezahrai, Mostafa Rahimi Azghadi, David Vazquez
Abstract: Aquaculture industries rely on the availability of accurate fish body measurements, e.g., length, width and mass. Manual methods that rely on physical tools like rulers are time and labour intensive. Leading automatic approaches rely on fully-supervised segmentation models to acquire these measurements but these require collecting per-pixel labels -- also time consuming and laborious: i.e., it can take up to two minutes per fish to generate accurate segmentation labels, almost always requiring at least some manual intervention. We propose an automatic segmentation model efficiently trained on images labeled with only point-level supervision, where each fish is annotated with a single click. This labeling process requires significantly less manual intervention, averaging roughly one second per fish. Our approach uses a fully convolutional neural network with one branch that outputs per-pixel scores and another that outputs an affinity matrix. We aggregate these two outputs using a random walk to obtain the final, refined per-pixel segmentation output. We train the entire model end-to-end with an LCFCN loss, resulting in our A-LCFCN method. We validate our model on the DeepFish dataset, which contains many fish habitats from the north-eastern Australian region. Our experimental results confirm that A-LCFCN outperforms a fully-supervised segmentation model at fixed annotation budget. Moreover, we show that A-LCFCN achieves better segmentation results than LCFCN and a standard baseline. We have released the code at \url{this https URL}.
摘要：水产养殖工业依靠准确鱼体测量，例如，长度，宽度和质量的可用性。依靠像统治者物理工具手动的方法是费时费力的。领先的全自动方法依赖于完全监督分割模型来获取这些测量但这些都需要收集每个像素的标签 - 也费时费力：即它可以高达每鱼两分钟，以产生精确分割标签，几乎总是需要至少一些人工干预。我们提出了一种自动分割模型有效的培训标有只点一级的监督，其中每条鱼都标注有一个单一的点击图片。这种标记方法需要较少显著人工干预，平均每鱼大致一秒。我们的方法使用全卷积神经网络与一个分支输出每个像素的分数，而另一个输出亲和基质。我们汇总使用随机游走获得最终，精每个像素分割输出这两个输出。我们培养整个模型的端至端与LCFCN损失，导致我们的A-LCFCN方法。我们验证我们对DeepFish数据集，其中包含从东北澳大利亚地区许多鱼类的栖息地模型。我们的实验结果证实了A-LCFCN优于在固定的注释预算的完全监督分割模型。此外，我们表明，A-LCFCN取得了较好的分割效果比LCFCN和标准基线。我们已经发布了代号为\ {URL这HTTPS URL}。

19. Ellipse Loss for Scene-Compliant Motion Prediction [PDF] 返回目录
Henggang Cui, Hoda Shajari, Sai Yalamanchi, Nemanja Djuric
Abstract: Motion prediction is a critical part of self-driving technology, responsible for inferring future behavior of traffic actors in autonomous vehicle's surroundings. In order to ensure safe and efficient operations, prediction models need to output accurate trajectories that obey the map constraints. In this paper, we address this task and propose a novel ellipse loss that allows the models to better reason about scene compliance and predict more realistic trajectories. Ellipse loss penalizes off-road predictions directly in a supervised manner, by projecting the output trajectories into the top-down map frame using a differentiable trajectory rasterizer module. Moreover, it takes into account the actor dimension and orientation, providing more direct training signals to the model. We applied ellipse loss to a recently proposed state-of-the-art joint detection-prediction model to showcase its benefits. Evaluation results on a large-scale autonomous driving data set strongly indicate that our method allows for more accurate and more realistic trajectory predictions.
摘要：运动预测是自驾车技术的重要组成部分，负责自主车辆周围环境推断交通参与者的未来行为。为了确保安全，高效运营，预测模型必须服从的映射约束输出精确的轨迹。在本文中，我们解决这个任务，并提出了一种新的椭圆损失使得模型有关场景遵守更好的理由，并预测更现实的轨迹。椭圆损耗惩罚越野直接预测在监督方式，通过使用可微分轨迹光栅化模块伸出的输出轨迹进入顶向下地图帧。此外，考虑到演员的尺寸和方向，该模型提供了更直接的训练信号。我们采用椭圆输给最近提出的国家的最先进的联合检测，预测模型，以展示它的好处。在一个大型的自动驾驶数据有力地设置评价结果表明，我们的方法允许更准确，更真实的运动轨迹的预测。

20. Uncertainty-Aware Vehicle Orientation Estimation for Joint Detection-Prediction Models [PDF] 返回目录
Henggang Cui, Fang-Chieh Chou, Jake Charland, Carlos Vallespi-Gonzalez, Nemanja Djuric
Abstract: Object detection is a critical component of a self-driving system, tasked with inferring the current states of the surrounding traffic actors. While there exist a number of studies on the problem of inferring the position and shape of vehicle actors, understanding actors' orientation remains a challenge for existing state-of-the-art detectors. Orientation is an important property for downstream modules of an autonomous system, particularly relevant for motion prediction of stationary or reversing actors where current approaches struggle. We focus on this task and present a method that extends the existing models that perform joint object detection and motion prediction, allowing us to more accurately infer vehicle orientations. In addition, the approach is able to quantify prediction uncertainty, outputting the probability that the inferred orientation is flipped, which allows for improved motion prediction and safer autonomous operations. Empirical results show the benefits of the approach, obtaining state-of-the-art performance on the open-sourced nuScenes data set.
摘要：目标检测是一种自驱动系统，以推断周围交通参与者的当前状态的任务的重要组成部分。虽然存在许多关于推断的位置和车辆行为者形状的问题的研究，理解演员的方位保持现有状态的最先进的检测器的一个挑战。取向是一个自治系统的下游模块，用于静止或逆转行动者的运动预测，其中电流接近斗争特别相关的重要性质。我们专注于这一任务，并呈现延伸执行关节物体检测和运动预测现有的模型，使我们能够更准确地推断车辆的方位的方法。此外，该方法是能够量化预测的不确定性，输出该推断方向被翻转的可能性，这允许改进的运动预测和更安全的自主操作。实证结果表明，该方法的好处，对开源nuScenes数据集获得国家的最先进的性能。

21. Learning Rolling Shutter Correction from Real Data without Camera Motion Assumption [PDF] 返回目录
Jiawei Mo, Md Jahidul Islam, Junaed Sattar
Abstract: The rolling shutter mechanism in modern cameras generates distortions as the images are formed on the sensor through a row-by-row readout process; this is highly undesirable for photography and vision-based algorithms (e.g., structure-from-motion and visual SLAM). In this paper, we propose a deep neural network to predict depth and camera poses for single-frame rolling shutter correction. Compared to the state-of-the-art, the proposed method has no assumptions on camera motion. It is enabled by training on real images captured by rolling shutter cameras instead of synthetic ones generated with certain motion assumption. Consequently, the proposed method performs better for real rolling shutter images. This makes it possible for numerous vision-based algorithms to use imagery captured using rolling shutter cameras and produce highly accurate results. Our evaluations on the TUM rolling shutter dataset using DSO and COLMAP validate the accuracy and robustness of the proposed method.
摘要：作为被通过一个一行接一行读出过程中的传感器形成的图像在现代摄像机的卷帘式快门机构产生的失真;这是用于基于视觉的摄影和算法非常不希望的（例如，结构，从运动和视觉SLAM）。在本文中，我们提出了一个深层神经网络来预测单框卷帘校正深度和摄影机姿态。的状态相比的最先进的，所提出的方法具有对摄像机运动没有假设。它是通过在通过滚动快门照相机，而不是与特定运动的假设生成的合成的人拍摄真实图像训练启用。因此，所提出的方法进行了真正的滚动快门图像更好。这使得众多基于视觉的算法，利用图像使用滚动快门相机和产生高度精确的结果抓获。我们使用DSO和COLMAP了TUM卷帘数据集的评估验证了该方法的准确性和鲁棒性。

22. Can Human Sex Be Learned Using Only 2D Body Keypoint Estimations? [PDF] 返回目录
Kristijan Bartol, Tomislav Pribanic, David Bojanic, Tomislav Petkovic
Abstract: In this paper, we analyze human male and female sex recognition problem and present a fully automated classification system using only 2D keypoints. The keypoints represent human joints. A keypoint set consists of 15 joints and the keypoint estimations are obtained using an OpenPose 2D keypoint detector. We learn a deep learning model to distinguish males and females using the keypoints as input and binary labels as output. We use two public datasets in the experimental section - 3DPeople and PETA. On PETA dataset, we report a 77% accuracy. We provide model performance details on both PETA and 3DPeople. To measure the effect of noisy 2D keypoint detections on the performance, we run separate experiments on 3DPeople ground truth and noisy keypoint data. Finally, we extract a set of factors that affect the classification accuracy and propose future work. The advantage of the approach is that the input is small and the architecture is simple, which enables us to run many experiments and keep the real-time performance in inference. The source code, with the experiments and data preparation scripts, are available on GitHub (this https URL).
摘要：在本文中，我们分析了仅使用2D关键点的人类男性和女性的性别认同问题和现在完全自动化分类系统。关键点是人类的关节。关键点集由15个关节，并使用OpenPose 2D关键点检测器获得了关键点的估计。我们学习了深刻的学习模式来区分使用的关键点作为输入和二进制标签作为输出的男性和女性。我们用在实验部分两个公共数据集 - 3DPeople和PETA。在PETA的数据集，我们提出一个77％的准确率。我们提供两种PETA和3DPeople模型性能的详细信息。为了测量性能嘈杂的2D关键点检测的效果，我们运行3DPeople地面实况和嘈杂的关键点数据独立的实验。最后，我们提取了一套影响分类精度，并提出今后工作的因素。该方法的优点是输入小，结构简单，这使我们能够运行许多实验，并保持实时性能的推断。源代码，与实验和数据准备脚本，都可以在GitHub（此HTTPS URL）。

23. Smart Time-Multiplexing of Quads Solves the Multicamera Interference Problem [PDF] 返回目录
Tomislav Pribanic, Tomislav Petkovic, David Bojanic, Kristijan Bartol
Abstract: Time-of-flight (ToF) cameras are becoming increasingly popular for 3D imaging. Their optimal usage has been studied from the several aspects. One of the open research problems is the possibility of a multicamera interference problem when two or more ToF cameras are operating simultaneously. In this work we present an efficient method to synchronize multiple operating ToF cameras. Our method is based on the time-division multiplexing, but unlike traditional time multiplexing, it does not decrease the effective camera frame rate. Additionally, for unsynchronized cameras, we provide a robust method to extract from their corresponding video streams, frames which are not subject to multicamera interference problem. We demonstrate our approach through a series of experiments and with a different level of support available for triggering, ranging from a hardware triggering to purely random software triggering.
摘要：时间飞行（TOF）相机正在成为3D成像越来越受欢迎。他们的最佳使用已经从几个方面分析。其中一个开放的研究问题是当两个或多个飞行时间摄像机同时工作的多机干扰问题的可能性。在这项工作中，我们提出了一种有效的方法，以飞行时间相机的多个操作同步。我们的方法是基于时分复用，但不同于传统的时分复用，它不会降低有效相机的帧速率。此外，对于不同步的摄像机，我们提供了一个可靠的方法从它们相应的视频流，这是不符合多机干扰问题帧提取。我们证明我们的方法通过一系列的实验，并用不同级别的可触发，从硬件触发纯粹的随机软件触发支持。

24. End-to-end Deep Learning Methods for Automated Damage Detection in Extreme Events at Various Scales [PDF] 返回目录
Yongsheng Bai, Halil Sezen, Alper Yilmaz
Abstract: Robust Mask R-CNN (Mask Regional Convolu-tional Neural Network) methods are proposed and tested for automatic detection of cracks on structures or their components that may be damaged during extreme events, such as earth-quakes. We curated a new dataset with 2,021 labeled images for training and validation and aimed to find end-to-end deep neural networks for crack detection in the field. With data augmentation and parameters fine-tuning, Path Aggregation Network (PANet) with spatial attention mechanisms and High-resolution Network (HRNet) are introduced into Mask R-CNNs. The tests on three public datasets with low or high-resolution images demonstrate that the proposed methods can achieve a big improvement over alternative networks, so the proposed method may be sufficient for crack detection for a variety of scales in real applications.
摘要：鲁棒面膜R-CNN（掩膜区域Convolu-周志武神经网络）的方法，提出并用于在结构或它们的部件的裂缝可能在极端事件，例如地球的地震损坏的自动检测测试。我们策划与2021幅标记图像的新的数据集进行训练和验证，旨在找到结束到终端的深层神经网络在现场探伤。随着数据增加和参数的微调，路径集合网络（PANet）具有空间注意力机制和高分辨率网络（HRNet）引入到面膜R-细胞神经网络。上具有低或高分辨率图像3个公共数据集的测试表明，该方法可以达到一个很大的改进过的替代网络，所以所提出的方法可以是足够的裂纹检测，适用于各种实际应用中尺度。

25. Towards Keypoint Guided Self-Supervised Depth Estimation [PDF] 返回目录
Kristijan Bartol, David Bojanic, Tomislav Petkovic, Tomislav Pribanic, Yago Diez Donoso
Abstract: This paper proposes to use keypoints as a self-supervision clue for learning depth map estimation from a collection of input images. As ground truth depth from real images is difficult to obtain, there are many unsupervised and self-supervised approaches to depth estimation that have been proposed. Most of these unsupervised approaches use depth map and ego-motion estimations to reproject the pixels from the current image into the adjacent image from the image collection. Depth and ego-motion estimations are evaluated based on pixel intensity differences between the correspondent original and reprojected pixels. Instead of reprojecting the individual pixels, we propose to first select image keypoints in both images and then reproject and compare the correspondent keypoints of the two images. The keypoints should describe the distinctive image features well. By learning a deep model with and without the keypoint extraction technique, we show that using the keypoints improve the depth estimation learning. We also propose some future directions for keypoint-guided learning of structure-from-motion problems.
摘要：本文提出了使用关键点作为自我监督线索从输入图像集合学习深度图估计。从真实图像地面实况深度是很难获得的，也有已经提出了许多无监督和自我监督的方法来深度估计。大多数的这些无监督方法使用深度图和自运动估算从当前图像的像素重新投影到从图像集合中的邻近图像。深度和自运动估算基于对应的原始和重新投影像素之间的像素的强度差进行评估。相反，重新投影单个像素的，我们建议在两个图像第一选择图像的关键点，然后重新投影和比较两个图像的关键点通讯员。关键点应该描述鲜明的形象特征很好。通过学习深刻的模型，并没有关键点提取技术，我们表明，使用关键点提高深度估计学习。我们还提出的结构由运动问题的关键点，引导学习一些未来的发展方向。

26. A Tunable Robust Pruning Framework Through Dynamic Network Rewiring of DNNs [PDF] 返回目录
Souvik Kundu, Mahdi Nazemi, Peter A. Beerel, Massoud Pedram
Abstract: This paper presents a dynamic network rewiring (DNR) method to generate pruned deep neural network (DNN) models that are robust against adversarial attacks yet maintain high accuracy on clean images. In particular, the disclosed DNR method is based on a unified constrained optimization formulation using a hybrid loss function that merges ultra-high model compression with robust adversarial training. This training strategy dynamically adjusts inter-layer connectivity based on per-layer normalized momentum computed from the hybrid loss function. In contrast to existing robust pruning frameworks that require multiple training iterations, the proposed learning strategy achieves an overall target pruning ratio with only a single training iteration and can be tuned to support both irregular and structured channel pruning. To evaluate the merits of DNR, experiments were performed with two widely accepted models, namely VGG16 and ResNet-18, on CIFAR-10, CIFAR-100 as well as with VGG16 on Tiny-ImageNet. Compared to the baseline uncompressed models, DNR provides over20x compression on all the datasets with no significant drop in either clean or adversarial classification accuracy. Moreover, our experiments show that DNR consistently finds compressed models with better clean and adversarial image classification performance than what is achievable through state-of-the-art alternatives.
摘要：本文介绍了一种动态的网络重新布线（DNR）方法来生成修剪是强大的抗攻击敌对但仍保持在干净的影像高精度深层神经网络（DNN）模型。特别地，所公开的DNR的方法是基于使用混合损失函数与鲁棒对抗性训练合并超高模型压缩统一约束优化的制剂。这个训练策略基于来自混合动力损失函数计算每层标准化动量动态调整层间连接性。在与需要多个训练迭代现有健壮修剪框架，所提出的学习策略实现只有一个单一的训练迭代的总目标修剪比和可以被调谐以同时支持不规则和结构化信道修剪。为了评估DNR的优劣，实验用两种广泛接受的模型，即VGG16和RESNET-18，在CIFAR-10微型-ImageNet进行，CIFAR-100以及与VGG16。比较基准未压缩的车型，DNR提供所有与清除或对抗性的分类准确度没有显著下降的数据集over20x压缩。此外，我们的实验表明，DNR一致认为压缩机型比是通过国家的最先进的替代方案可以实现更好的清洁和对抗性的图像分类性能。

27. A Tree-structure Convolutional Neural Network for Temporal Features Exaction on Sensor-based Multi-resident Activity Recognition [PDF] 返回目录
Jingjing Cao, Fukang Guo, Xin Lai, Qiang Zhou, Jinshan Dai
Abstract: With the propagation of sensor devices applied in smart home, activity recognition has ignited huge interest and most existing works assume that there is only one habitant. While in reality, there are generally multiple residents at home, which brings greater challenge to recognize activities. In addition, many conventional approaches rely on manual time series data segmentation ignoring the inherent characteristics of events and their heuristic hand-crafted feature generation algorithms are difficult to exploit distinctive features to accurately classify different activities. To address these issues, we propose an end-to-end Tree-Structure Convolutional neural network based framework for Multi-Resident Activity Recognition (TSC-MRAR). First, we treat each sample as an event and obtain the current event embedding through the previous sensor readings in the sliding window without splitting the time series data. Then, in order to automatically generate the temporal features, a tree-structure network is designed to derive the temporal dependence of nearby readings. The extracted features are fed into the fully connected layer, which can jointly learn the residents labels and the activity labels simultaneously. Finally, experiments on CASAS datasets demonstrate the high performance in multi-resident activity recognition of our model compared to state-of-the-art techniques.
摘要：随着智能家居中应用的传感器装置的传播，行为识别已经点燃巨大的利益和大多数现有的作品假定只有一个居住者。而在现实中，一般都在家，这给认识的活动更大的挑战多居民。此外，许多传统的方法依赖于手工的时间序列数据分割忽略事件的固有特性和它们的启发式手工制作的特征生成算法难以利用独特的功能，以准确地分类不同的活动。为了解决这些问题，我们提出了多居民行为识别（TSC-MRAR）的端至端树结构卷积基于神经网络的框架。首先，我们将每个样品作为一个事件，将获得当前事件通过之前的传感器读数中的滑动窗口嵌入而不破的时间序列数据。然后，为了自动生成时间特征，树形结构网络的设计来导出附近的读数的时间依赖性。所提取的特征被馈送到完全连接层，其可以同时共同学习居民标签和活动的标签。最后，在CASAS数据集的实验证明多居民行为识别我们的模型相比，国家的最先进技术的高性能。

28. A Comprehensive Comparison of Multi-Dimensional Image Denoising Methods [PDF] 返回目录
Zhaoming Kong, Xiaowei Yang, Lifang He
Abstract: Filtering multi-dimensional images such as color images, color videos, multispectral images and magnetic resonance images is challenging in terms of both effectiveness and efficiency. Leveraging the nonlocal self-similarity (NLSS) characteristic of images and sparse representation in the transform domain, the block-matching and 3D filtering (BM3D) based methods show powerful denoising performance. Recently, numerous new approaches with different regularization terms, transforms and advanced deep neural network (DNN) architectures are proposed to improve denoising quality. In this paper, we extensively compare over 60 methods on both synthetic and real-world datasets. We also introduce a new color image and video dataset for benchmarking, and our evaluations are performed from four different perspectives including quantitative metrics, visual effects, human ratings and computational cost. Comprehensive experiments demonstrate: (i) the effectiveness and efficiency of the BM3D family for various denoising tasks, (ii) a simple matrix-based algorithm could produce similar results compared with its tensor counterparts, and (iii) several DNN models trained with synthetic Gaussian noise show state-of-the-art performance on real-world color image and video datasets. Despite the progress in recent years, we discuss shortcomings and possible extensions of existing techniques. Datasets and codes for evaluation are made publicly available at this https URL.
摘要：过滤多维图像，诸如彩色图象，彩色视频，多光谱图像和磁共振图像中都有效性和效率方面具有挑战性。利用图像和稀疏表示的非局部自相似性（NLSS）特性在变换域中，基于块匹配和三维滤波（BM3D）方法显示强大去噪性能。近年来，随着不同的正则项，变换和先进的深层神经网络（DNN）架构许多新的方法提出了改进降噪质量。在本文中，我们比较广泛的合成和真实世界的数据集60方法。我们还引进了新的彩色图像和视频数据集标杆，我们的评价是从四个不同的角度，包括定量指标，视觉效果，人的收视率和计算成本来执行。综合实验证明：（i）本BM3D家庭各种降噪工作的有效性和效率，（ii）与其张同行相比简单的基于矩阵的算法能产生类似的结果，以及（iii）训练了与合成高斯几个DNN模型噪音表现现实世界的彩色图像和视频数据集的国家的最先进的性能。尽管近年来的进展，我们讨论的缺点和现有技术可能的扩展。数据集和代码进行评估在此HTTPS URL公之于众。

29. Noise2Sim -- Similarity-based Self-Learning for Image Denoising [PDF] 返回目录
Chuang Niu, Ge Wang
Abstract: The key idea behind denoising methods is to perform a mean/averaging operation, either locally or non-locally. An observation on classic denoising methods is that non-local mean (NLM) outcomes are typically superior to locally denoised results. Despite achieving the best performance in image denoising, the supervised deep denoising methods require paired noise-clean data which are often unavailable. To address this challenge, Noise2Noise methods are based on the fact that paired noise-clean images can be replaced by paired noise-noise images which are easier to collect. However, in many scenarios the collection of paired noise-noise images are still impractical. To bypass labeled images, Noise2Void methods predict masked pixels from their surroundings in a single noisy image only. It is pitiful that neither Noise2Noise nor Noise2Void methods utilize self-similarities in an image as NLM methods do, while self-similarities/symmetries play a critical role in modern sciences. Here we propose Noise2Sim, an NLM-inspired self-learning method for image denoising. Specifically, Noise2Sim leverages self-similarities of image patches and learns to map between the center pixels of similar patches for self-consistent image denoising. Our statistical analysis shows that Noise2Sim tends to be equivalent to Noise2Noise under mild conditions. To accelerate the process of finding similar image patches, we design an efficient two-step procedure to provide data for Noise2Sim training, which can be iteratively conducted if needed. Extensive experiments demonstrate the superiority of Noise2Sim over Noise2Noise and Noise2Void on common benchmark datasets.
摘要：背后的去噪方法的核心思想是执行平均/平均操作，在本地或非本地。经典去噪方法的一个观察结果是，非本地平均（NLM）的结果通常优于局部去噪结果。尽管实现图像去噪的最佳性能，监督深去噪方法需要配对的噪声数据清理这往往是不可用的。为了应对这一挑战，Noise2Noise方法是基于一个事实，即配对噪声干净的影像可以由更容易收集配对噪音噪声的图像所取代。然而，在许多情况下配对的噪音噪声的图像的收集仍然是不切实际的。到旁路标记的图像，Noise2Void方法地预测掩蔽与其周围像素中仅单个噪声图像。可怜的是，无论是Noise2Noise也不Noise2Void方法利用自相似的图像作为NLM方法做，而自相似性/对称性在现代科学中发挥关键作用。在这里我们建议Noise2Sim，对图像进行去噪的NLM风格的自学习方法。具体地，图像块和获悉的Noise2Sim杠杆自相似性类似补丁的自我一致的图像去噪的中心的像素之间进行映射。我们的统计分析显示，Noise2Sim趋于温和的条件下相当于Noise2Noise。为了加速寻找相似图像块的过程中，我们设计一个高效的两个步骤，以提供Noise2Sim培训，这可以根据需要被迭代地进行数据。大量的实验证明Noise2Sim超过Noise2Noise和Noise2Void共同的基准数据集的优越性。

30. Intra-Domain Task-Adaptive Transfer Learning to Determine Acute Ischemic Stroke Onset Time [PDF] 返回目录
Haoyue Zhang, Jennifer S Polson, Kambiz Nael, Noriko Salamon, Bryan Yoo, Suzie El-Saden, Fabien Scalzo, William Speier, Corey W Arnold
Abstract: Treatment of acute ischemic strokes (AIS) is largely contingent upon the time since stroke onset (TSS). However, TSS may not be readily available in up to 25% of patients with unwitnessed AIS. Current clinical guidelines for patients with unknown TSS recommend the use of MRI to determine eligibility for thrombolysis, but radiology assessments have high inter-reader variability. In this work, we present deep learning models that leverage MRI diffusion series to classify TSS based on clinically validated thresholds. We propose an intra-domain task-adaptive transfer learning method, which involves training a model on an easier clinical task (stroke detection) and then refining the model with different binary thresholds of TSS. We apply this approach to both 2D and 3D CNN architectures with our top model achieving an ROC-AUC value of 0.74, with a sensitivity of 0.70 and a specificity of 0.81 for classifying TSS < 4.5 hours. Our pretrained models achieve better classification metrics than the models trained from scratch, and these metrics exceed those of previously published models applied to our dataset. Furthermore, our pipeline accommodates a more inclusive patient cohort than previous work, as we did not exclude imaging studies based on clinical, demographic, or image processing criteria. When applied to this broad spectrum of patients, our deep learning model achieves an overall accuracy of 75.78% when classifying TSS < 4.5 hours, carrying potential therapeutic implications for patients with unknown TSS.
摘要：治疗急性缺血性中风（AIS）是因为在中风发作（TSS）的时间大大队伍。然而，TSS可能无法在高达25％的患者有unwitnessed AIS一应俱全。患者不明TSS目前的临床指南推荐使用磁共振成像来确定溶栓资格，但放射学评估具有较高的阅读器间的变异。在这项工作中，我们目前深度学习模型，利用MRI扩散系列基于临床验证阈值进行分类TSS。我们建议域内任务自适应传输学习方法，它涉及到对一个更简单的临床任务训练模型（行程检测），然后以提炼TSS的不同的二进制阈值模型。我们将这种方法用于2D和3D CNN架构与我们的顶级车型达到0.74的ROC-AUC值，与0.70的灵敏度和0.81分类TSS <4.5小时特异性。我们预训练模式取得较好的分类指标比从头训练模式，而这些指标超出那些以前出版的模式应用到我们的数据集。此外，我们的管道容纳比以前工作更具有包容性的患者人群，因为我们没有排除根据临床，人口，或图像处理的标准影像学检查。当应用到患者的这种广谱，我们的深度学习模型时，分类tss <4.5小时，携带潜在的治疗影响患者的未知tss达到75.78％的整体精度。< font>

31. Learning to Respond with Your Favorite Stickers: A Framework of Unifying Multi-Modality and User Preference in Multi-Turn Dialog [PDF] 返回目录
Shen Gao, Xiuying Chen, Li Liu, Dongyan Zhao, Rui Yan
Abstract: Stickers with vivid and engaging expressions are becoming increasingly popular in online messaging apps, and some works are dedicated to automatically select sticker response by matching the stickers image with previous utterances. However, existing methods usually focus on measuring the matching degree between the dialog context and sticker image, which ignores the user preference of using stickers. Hence, in this paper, we propose to recommend an appropriate sticker to user based on multi-turn dialog context and sticker using history of user. Two main challenges are confronted in this task. One is to model the sticker preference of user based on the previous sticker selection history. Another challenge is to jointly fuse the user preference and the matching between dialog context and candidate sticker into final prediction making. To tackle these challenges, we propose a \emph{Preference Enhanced Sticker Response Selector} (PESRS) model. Specifically, PESRS first employs a convolutional based sticker image encoder and a self-attention based multi-turn dialog encoder to obtain the representation of stickers and utterances. Next, deep interaction network is proposed to conduct deep matching between the sticker and each utterance. Then, we model the user preference by using the recently selected stickers as input, and use a key-value memory network to store the preference representation. PESRS then learns the short-term and long-term dependency between all interaction results by a fusion network, and dynamically fuse the user preference representation into the final sticker selection prediction. Extensive experiments conducted on a large-scale real-world dialog dataset show that our model achieves the state-of-the-art performance for all commonly-used metrics. Experiments also verify the effectiveness of each component of PESRS.
摘要：以生动，引人入胜的表情贴纸正在成为在线消息应用越来越普及，有的作品致力于自动选择与以前的话语贴图像匹配贴纸响应。但是，现有的方法通常集中于测量对话框上下文和不干胶贴纸图像，而忽略使用贴纸的用户偏好之间的匹配程度。因此，在本文中，我们提出建议适当贴纸基于多轮对话背景和贴纸使用的用户历史的用户。两个主要挑战所面临的这一任务。一种是基于以前的贴纸选择历史的用户的标签偏好模型。另一个挑战是共同的保险丝用户偏好和对话情境和候选标签之间的匹配到最终预测决策。为了应对这些挑战，我们提出了一个\ {EMPH偏好增强贴纸响应选择}（PESRS）模型。具体而言，第一PESRS采用卷积基于不干胶贴纸图像编码器和一个自关注基于多匝对话编码器以获得贴和话语的表示。接下来，深相互作用网络，建议将贴纸和各发声之间进行深匹配。然后，我们建模通过使用最近选择贴纸作为输入用户偏好，并使用键 - 值存储器网络来存储偏好表示。然后PESRS通过熔融网络学习的所有交互结果之间的短期和长期的依赖性，并动态保险丝用户偏好表示成最终的贴纸选择预测。在大规模真实世界进行了大量的实验数据集对话框表明我们的模型实现了国家的最先进的性能为所有常用的指标。实验还证实PESRS的各成分的有效性。

32. Deep coastal sea elements forecasting using U-Net based models [PDF] 返回目录
Jesús García Fernández, Ismail Alaoui Abdellaoui, Siamak Mehrkanoon
Abstract: Due to the development of deep learning techniques applied to satellite imagery, weather forecasting that uses remote sensing data has also been the subject of major progress. The present paper investigates multiple steps ahead frame prediction for coastal sea elements in the Netherlands using U-Net based architectures. Hourly data from the Copernicus observation programme spanned over a period of 2 years has been used to train the models and make the forecasting, including seasonal predictions. We propose a variation of the U-Net architecture and also extend this novel model using residual connections, parallel convolutions and asymmetric convolutions in order to propose three additional architectures. In particular, we show that the architecture equipped with parallel and asymmetric convolutions as well as skip connections is particularly suited for this task, outperforming the other three discussed models.
摘要：由于深学习技术的发展应用卫星图像，气象预报是利用遥感数据也已经取得重大进展的主题。本论文研究多个步骤提前帧预测用于使用基于U型网络架构在荷兰近海元件。从跨越历时2年哥白尼观测计划每小时数据已经用于训练模型并进行预测，包括季节性的预测。我们建议在U-Net的体系结构的变体，并且还使用，以便提出三个附加架构残余连接，并行卷积和非对称卷积延伸这种新颖的模型。特别是，我们表明，配备有平行和非对称卷积以及跳过的连接结构特别适合此任务，表现优于其他三个讨论的模型。

33. Modular Primitives for High-Performance Differentiable Rendering [PDF] 返回目录
Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, Timo Aila
Abstract: We present a modular differentiable renderer design that yields performance superior to previous methods by leveraging existing, highly optimized hardware graphics pipelines. Our design supports all crucial operations in a modern graphics pipeline: rasterizing large numbers of triangles, attribute interpolation, filtered texture lookups, as well as user-programmable shading and geometry processing, all in high resolutions. Our modular primitives allow custom, high-performance graphics pipelines to be built directly within automatic differentiation frameworks such as PyTorch or TensorFlow. As a motivating application, we formulate facial performance capture as an inverse rendering problem and show that it can be solved efficiently using our tools. Our results indicate that this simple and straightforward approach achieves excellent geometric correspondence between rendered results and reference imagery.
摘要：本文提出了一种模块化的微渲染器的设计，收益率通过利用现有的，高度优化的硬件图形管线的性能优于以前的方法。我们的设计支持现代图形流水线的所有关键操作：光栅化大量的三角形，属性插值，过滤纹理查找，以及用户可编程的着色和几何处理，都在高分辨率。我们的模块化原语允许定制的，高性能的图形管线直接自动分化框架，如PyTorch或TensorFlow内建成。作为一种激励的应用，我们制定的面部表演捕捉作为反向渲染问题表明，它可以有效地利用我们的工具来解决。我们的结果表明，该简单而直接的方法实现呈现的结果和参考图像之间的优良的几何对应。

34. Task-relevant Representation Learning for Networked Robotic Perception [PDF] 返回目录
Manabu Nakanoya, Sandeep Chinchali, Alexandros Anemogiannis, Akul Datta, Sachin Katti, Marco Pavone
Abstract: Today, even the most compute-and-power constrained robots can measure complex, high data-rate video and LIDAR sensory streams. Often, such robots, ranging from low-power drones to space and subterranean rovers, need to transmit high-bitrate sensory data to a remote compute server if they are uncertain or cannot scalably run complex perception or mapping tasks locally. However, today's representations for sensory data are mostly designed for human, not robotic, perception and thus often waste precious compute or wireless network resources to transmit unimportant parts of a scene that are unnecessary for a high-level robotic task. This paper presents an algorithm to learn task-relevant representations of sensory data that are co-designed with a pre-trained robotic perception model's ultimate objective. Our algorithm aggressively compresses robotic sensory data by up to 11x more than competing methods. Further, it achieves high accuracy and robust generalization on diverse tasks including Mars terrain classification with low-power deep learning accelerators, neural motion planning, and environmental timeseries classification.
摘要：今天，即使是最计算和功率受限的机器人可以测量复杂的，高数据速率的视频和激光雷达传感流。通常，这样的机器人，从低功率无人驾驶飞机空间和地下流动站，需要传送高比特率感知数据到远程计算服务器如果它们是不确定的或不能可伸缩本地运行复杂知觉或映射的任务。然而，今天的感官数据表示大多是专为人类，而不是机器人，感知，因而经常浪费宝贵的计算或无线网络资源来传输的场景是不需要一个高层次的机器人任务的不重要的部分。本文提出一种算法来学习是共同设计了一个预先训练的机器人感知模型的最终目标的感知数据的任务相关表示。我们的算法压缩积极高达机器人传感数据到11倍以上的竞争方法。此外，它实现了对多种任务，包括低功耗的深度学习加速器，神经运动规划和环境时间序列分类火星地形分类精度高和强大的推广。

35. Automatic Head and Neck Tumor Segmentation in PET/CT with Scale Attention Network [PDF] 返回目录
Yading Yuan
Abstract: Automatic segmentation is an essential but challenging step for extracting quantitative imaging bio-markers for characterizing head and neck tumor in tumor detection, diagnosis, prognosis, treatment planning and assessment. The HEad and neCK TumOR Segmentation Challenge 2020 (HECKTOR 2020) provides a common platform for comparing different automatic algorithms for segmentation the primary gross target volume (GTV) in the oropharynx region on FDG-PET and CT images. We participated in the image segmentation challenge by developing a fully automatic segmentation network based on encoder-decoder architecture. In order to better integrate information across different scales, we proposed a dynamic scale attention mechanism that incorporates low-level details with high-level semantics from feature maps at different scales. Our framework was trained using the 201 challenge training cases provided by HECKTOR 2020, and achieved an average Dice Similarity Coefficient (DSC) of 0:7505 with cross validation. By testing on the 53 testing cases, our model achieved an average DSC, precision and recall of 0:7318, 0:7851, and 0:7319 respectively, which ranked our method in the fourth place in the challenge (id: deepX)
摘要：自动分割是在肿瘤检测，诊断，预后，治疗计划和评估表征头颈部肿瘤中提取定量成像生物标志物的必要，但具有挑战性的一步。头颈部肿瘤分割挑战2020（HECKTOR 2020）提供了一种用于分割比较不同的自动算法在FDG-PET和CT图像的口咽区中的主总目标体积（GTV）的通用平台。我们参加了由开发基于编码器，解码器架构的全自动分割网络的图像分割挑战。为了更好地整合在不同尺度的信息，我们提出了整合与功能映射在不同尺度的高层次语义的低层细节动态规模注意机制。我们的框架使用由HECKTOR 2020提供的201挑战训练受训的情况下，取得的0的平均骰子相似系数（DSC）：7505用交叉验证。通过在53测试用例测试，我们的模型实现了0的平均DSC，精确度和召回：7318，0：7851，和0：7319分别名列其中我们的方法中第四位的挑战（ID：deepX）

36. MorphEyes: Variable Baseline Stereo For Quadrotor Navigation [PDF] 返回目录
Nitin J. Sanket, Chahat Deep Singh, Varun Asthana, Cornelia Fermüller, Yiannis Aloimonos
Abstract: Morphable design and depth-based visual control are two upcoming trends leading to advancements in the field of quadrotor autonomy. Stereo-cameras have struck the perfect balance of weight and accuracy of depth estimation but suffer from the problem of depth range being limited and dictated by the baseline chosen at design time. In this paper, we present a framework for quadrotor navigation based on a stereo camera system whose baseline can be adapted on-the-fly. We present a method to calibrate the system at a small number of discrete baselines and interpolate the parameters for the entire baseline range. We present an extensive theoretical analysis of calibration and synchronization errors. We showcase three different applications of such a system for quadrotor navigation: (a) flying through a forest, (b) flying through an unknown shaped/location static/dynamic gap, and (c) accurate 3D pose detection of an independently moving object. We show that our variable baseline system is more accurate and robust in all three scenarios. To our knowledge, this is the first work that applies the concept of morphable design to achieve a variable baseline stereo vision system on a quadrotor.
摘要：形变设计和基于深度的可视化控制是两个导致进步的四旋翼自治领域即将到来的趋势。立体相机已经达成的重量和深度估计的准确度的完美平衡，但是从深度范围的问题遭受的限制，并在设计时选择的基线决定。在本文中，我们提出了基于立体摄像系统，其基线可以适应于即时对四旋翼飞行器导航的框架。我们呈现给系统以小数量的离散基线校准的方法和内插用于整个基线范围的参数。我们目前的校准和同步错误的广泛理论分析。我们展示了这种系统的用于四旋翼飞行器导航三个不同的应用：通过一个未知的形状/位置静态/动态间隙，和（c）精确的三维姿态检测一个独立地移动的物体的飞行（a）至森林飞行，（b）中。我们证明了我们的变量基线系统在所有三种情况更准确，更稳健。据我们所知，这是适用于可变形的设计理念，实现对四旋翼可变基线立体视觉系统的第一项工作。

37. Identifying and interpreting tuning dimensions in deep networks [PDF] 返回目录
Nolan S. Dey, J. Eric Taylor, Bryan P. Tripp, Alexander Wong, Graham W. Taylor
Abstract: In neuroscience, a tuning dimension is a stimulus attribute that accounts for much of the activation variance of a group of neurons. These are commonly used to decipher the responses of such groups. While researchers have attempted to manually identify an analogue to these tuning dimensions in deep neural networks, we are unaware of an automatic way to discover them. This work contributes an unsupervised framework for identifying and interpreting "tuning dimensions" in deep networks. Our method correctly identifies the tuning dimensions of a synthetic Gabor filter bank and tuning dimensions of the first two layers of InceptionV1 trained on ImageNet.
摘要：神经科学，调谐维度是占多一组神经元的激活方差的刺激属性。这些通常用来破译这些团体的响应。虽然研究人员试图手动识别模拟在深层神经网络的这些调整尺寸，我们不知道的自动方式来发现它们。这个工作有助于识别和在深网络解释“调谐尺寸”无监督框架。我们的方法正确地识别的合成的Gabor滤波器组和上训练ImageNet InceptionV1的前两层的调谐尺寸的调谐尺寸。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-11-09

目录

摘要