摘要

1. Transmission Map and Atmospheric Light Guided Iterative Updater Network for Single Image Dehazing [PDF] 返回目录
Aupendu Kar, Sobhan Kanti Dhara, Debashis Sen, Prabir Kumar Biswas
Abstract: Hazy images obscure content visibility and hinder several subsequent computer vision tasks. For dehazing in a wide variety of hazy conditions, an end-to-end deep network jointly estimating the dehazed image along with suitable transmission map and atmospheric light for guidance could prove effective. To this end, we propose an Iterative Prior Updated Dehazing Network (IPUDN) based on a novel iterative update framework. We present a novel convolutional architecture to estimate channel-wise atmospheric light, which along with an estimated transmission map are used as priors for the dehazing network. Use of channel-wise atmospheric light allows our network to handle color casts in hazy images. In our IPUDN, the transmission map and atmospheric light estimates are updated iteratively using corresponding novel updater networks. The iterative mechanism is leveraged to gradually modify the estimates toward those appropriately representing the hazy condition. These updates occur jointly with the iterative estimation of the dehazed image using a convolutional neural network with LSTM driven recurrence, which introduces inter-iteration dependencies. Our approach is qualitatively and quantitatively found effective for synthetic and real-world hazy images depicting varied hazy conditions, and it outperforms the state-of-the-art. Thorough analyses of IPUDN through additional experiments and detailed ablation studies are also presented.
摘要：朦胧模糊的图像内容的知名度和阻碍一些后续的计算机视觉任务。对于各种各样的朦胧条件去混浊，端至端深网络与合适的传输地图和大气光指导沿着共同估计dehazed图像可以证明是有效的。为此，我们提出了一种基于一种新的迭代更新框架的迭代更新之前网络除雾（IPUDN）。我们提出了一个新颖的卷积架构来估计信道逐大气光，其与估计的传输地图一起被用作先验用于除雾网络。通道明智的大气光的使用使得我们的网络在朦胧的图像处理色偏。在我们的IPUDN，传输地图和大气光估计迭代地使用对应的新更新的网络更新。迭代机构被利用来逐渐修改对那些适当地表示的朦胧状态的估计。使用与LSTM从动复发，它引入了迭代间的依赖关系的卷积神经网络的dehazed图像的迭代估计共同发生这些更新。我们的方法是定性和定量地发现有效的合成的和真实世界的朦胧图像描绘多样朦胧条件，以及它优于所述状态的最先进的。通过额外的实验和详细的消融研究IPUDN的深入分析，还提出。

2. MOR-UAV: A Benchmark Dataset and Baselines for Moving Object Recognition in UAV Videos [PDF] 返回目录
Murari Mandal, Lav Kush Kumar, Santosh Kumar Vipparthi
Abstract: Visual data collected from Unmanned Aerial Vehicles (UAVs) has opened a new frontier of computer vision that requires automated analysis of aerial images/videos. However, the existing UAV datasets primarily focus on object detection. An object detector does not differentiate between the moving and non-moving objects. Given a real-time UAV video stream, how can we both localize and classify the moving objects, i.e. perform moving object recognition (MOR)? The MOR is one of the essential tasks to support various UAV vision-based applications including aerial surveillance, search and rescue, event recognition, urban and rural scene this http URL the best of our knowledge, no labeled dataset is available for MOR evaluation in UAV videos. Therefore, in this paper, we introduce MOR-UAV, a large-scale video dataset for MOR in aerial videos. We achieve this by labeling axis-aligned bounding boxes for moving objects which requires less computational resources than producing pixel-level estimates. We annotate 89,783 moving object instances collected from 30 UAV videos, consisting of 10,948 frames in various scenarios such as weather conditions, occlusion, changing flying altitude and multiple camera views. We assigned the labels for two categories of vehicles (car and heavy vehicle). Furthermore, we propose a deep unified framework MOR-UAVNet for MOR in UAV videos. Since, this is a first attempt for MOR in UAV videos, we present 16 baseline results based on the proposed framework over the MOR-UAV dataset through quantitative and qualitative experiments. We also analyze the motion-salient regions in the network through multiple layer visualizations. The MOR-UAVNet works online at inference as it requires only few past frames. Moreover, it doesn't require predefined target initialization from user. Experiments also demonstrate that the MOR-UAV dataset is quite challenging.
摘要：从无人飞行器（UAV）收集的数据可视化已开通，需要的航拍图像/视频自动分析计算机视觉的一个新领域。然而，现有的数据集UAV主要侧重于物体检测。对象检测器不移动和非移动的物体之间进行区分。鉴于实时无人机视频流，怎能既定位和分类的运动物体，即执行移动物体识别（MOR）？铁道部是基本任务之一，以支持各种无人机基于视觉的应用，包括空中监视，搜索和救援，事件识别，城市和农村的场景此HTTP URL中的我们所知，没有任何标记的数据集可用于无人机MOR评价视频。因此，在本文中，我们介绍了MOR-UAV，在空中视频大规模视频数据集MOR。我们通过标记为移动要求不是生产像素级估计更少的计算资源的对象轴对齐包围盒实现这一目标。我们从注释30个UAV视频收集89783分移动物体的情况下，由在各种情况下10948帧，诸如天气条件下，闭塞的，改变飞行高度和多个摄像机视图。我们分配标签两大类车辆（轿车和重型车辆）。此外，建议在无人机视频深深的统一框架MOR-UAVNet铁道部。因为，这是在无人机视频铁道部第一次尝试，我们提出基础上，通过定量和定性实验，拟议的框架在MOR-UAV数据集16个基准结果。我们还通过多层可视化分析网络在运动显着的区域。该MOR-UAVNet作品在网上的推理，因为它需要只有少数过去的帧。此外，它不需要从用户预定义的目标初始化。实验还表明，MOR-UAV数据集是相当具有挑战性的。

3. Multimodal Image-to-Image Translation via a Single Generative Adversarial Network [PDF] 返回目录
Shihua Huang, Cheng He, Ran Cheng
Abstract: Despite significant advances in image-to-image (I2I) translation with Generative Adversarial Networks (GANs) have been made, it remains challenging to effectively translate an image to a set of diverse images in multiple target domains using a pair of generator and discriminator. Existing multimodal I2I translation methods adopt multiple domain-specific content encoders for different domains, where each domain-specific content encoder is trained with images from the same domain only. Nevertheless, we argue that the content (domain-invariant) features should be learned from images among all the domains. Consequently, each domain-specific content encoder of existing schemes fails to extract the domain-invariant features efficiently. To address this issue, we present a flexible and general SoloGAN model for efficient multimodal I2I translation among multiple domains with unpaired data. In contrast to existing methods, the SoloGAN algorithm uses a single projection discriminator with an additional auxiliary classifier, and shares the encoder and generator for all domains. As such, the SoloGAN model can be trained effectively with images from all domains such that the domain-invariant content representation can be efficiently extracted. Qualitative and quantitative results over a wide range of datasets against several counterparts and variants of the SoloGAN model demonstrate the merits of the method, especially for the challenging I2I translation tasks, i.e., tasks that involve extreme shape variations or need to keep the complex backgrounds unchanged after translations. Furthermore, we demonstrate the contribution of each component using ablation studies.
摘要：尽管在图像 - 图像（121）平移剖成对抗性网络（甘斯）已经进行了显著的进步，但它仍然具有挑战性的有效使用一对发电机的图像转换成一组中的多个目标域不同图像和鉴别。现有多I2I翻译方法采取不同的域，每个域的具体内容编码器与仅从同一个域的图像训练的多个特定领域的内容编码器。尽管如此，我们认为，内容（域不变）功能应该从图像中所有的域间进行教训。因此，现有方案中的每一个域专用内容编码器不能有效地提取域不变特征。为了解决这个问题，我们提出了一个未成对数据的多个域之间进行有效多式联运I2I翻译灵活和通用SoloGAN模型。相比于现有的方法中，所述SoloGAN算法使用一个单一的投影鉴别与附加的辅助分类器，并共享为所有域编码器和发电机。这样，SoloGAN模型可有效地与从使得域不变内容表示可以被有效地提取所有域的图像训练。定性和在大范围内对几个同行数据集的定量结果和SoloGAN模型的变种证明了该方法的优点，特别适用于具有挑战性的I2I翻译任务，也就是说，涉及到极端的形状变化或必要的任务，以保持复杂的背景不变经过翻译。此外，我们证明了使用消融研究每个组件的贡献。

4. Applying Incremental Deep Neural Networks-based Posture Recognition Model for Injury Risk Assessment in Construction [PDF] 返回目录
Junqi Zhao, Esther Obonyo
Abstract: Monitoring awkward postures is a proactive prevention for Musculoskeletal Disorders (MSDs)in construction. Machine Learning (ML) models have shown promising results for posture recognition from Wearable Sensors. However, further investigations are needed concerning: i) Incremental Learning (IL), where trained models adapt to learn new postures and control the forgetting of learned postures; ii) MSDs assessment with recognized postures. This study proposed an incremental Convolutional Long Short-Term Memory (CLN) model, investigated effective IL strategies, and evaluated MSDs assessment using recognized postures. Tests with nine workers showed the CLN model with shallow convolutional layers achieved high recognition performance (F1 Score) under personalized (0.87) and generalized (0.84) modeling. Generalized shallow CLN model under Many-to-One IL scheme can balance the adaptation (0.73) and forgetting of learnt subjects (0.74). MSDs assessment using postures recognized from incremental CLN model had minor difference with ground-truth, which demonstrates the high potential for automated MSDs monitoring in construction.
摘要：监测不良姿势是积极的预防建筑筋肌劳损（MSDS）。机器学习（ML）模型显示有前途的穿戴式传感器姿势识别结果。然而，需要进一步调查有关：1）增量学习（IL），在那里训练的模型适应新的学习姿势和控制了解到姿势的遗忘; II）MSDS评估与认可的姿势。这项研究提出了一种增量卷积长短期记忆（CLN）模型，研究有效的IL策略，并采用公认的姿势评估MSD的评估。有9名工人试验表明浅卷积层CLN模式下实现高识别性能（F1分数）个性化（0.87）和通用（0.84）模型。在许多对1 IL方案广义浅CLN模型可以平衡调整（0.73），学习科目（0.74）的遗忘。从增量CLN模型公认的MSD使用评估的姿势曾与地面实况，这表明自动化MSD的建设监测高电位微小的差别。

5. Simultaneous Semantic Alignment Network for Heterogeneous Domain Adaptation [PDF] 返回目录
Shuang Li, Binhui Xie, Jiashu Wu, Ying Zhao, Chi Harold Liu, Zhengming Ding
Abstract: Heterogeneous domain adaptation (HDA) transfers knowledge across source and target domains that present heterogeneities e.g., distinct domain distributions and difference in feature type or dimension. Most previous HDA methods tackle this problem through learning a domain-invariant feature subspace to reduce the discrepancy between domains. However, the intrinsic semantic properties contained in data are under-explored in such alignment strategy, which is also indispensable to achieve promising adaptability. In this paper, we propose a Simultaneous Semantic Alignment Network (SSAN) to simultaneously exploit correlations among categories and align the centroids for each category across domains. In particular, we propose an implicit semantic correlation loss to transfer the correlation knowledge of source categorical prediction distributions to target domain. Meanwhile, by leveraging target pseudo-labels, a robust triplet-centroid alignment mechanism is explicitly applied to align feature representations for each category. Notably, a pseudo-label refinement procedure with geometric similarity involved is introduced to enhance the target pseudo-label assignment accuracy. Comprehensive experiments on various HDA tasks across text-to-image, image-to-image and text-to-text successfully validate the superiority of our SSAN against state-of-the-art HDA methods. The code is publicly available at this https URL.
摘要：异构域适应跨越源和目标域（HDA）转让知识本不均匀性例如，不同结构域的分布和在特征类型或尺寸的差异。以往大多数HDA方法通过学习域不变特征空间，以减少区域之间的差异解决这个问题。然而，本征语义特性包含在数据是充分开发的在这样的对准策略，这也是不可缺少的，以实现希望的适应性。在本文中，我们提出了一种同时语义对准网络（SSAN）同时利用类别之间的相关性并对齐形心为跨域每个类别。特别是，我们提出了一个隐式的语义相关损失源分类预测分布的相关性的知识转移给目标域。同时，通过利用目标伪标签，一个强大的三重重心对准机制明确适用于每个类别的对齐功能表示。值得注意的是，与几何相似的伪标签精化过程涉及引入增强靶伪标签分配的准确性。针对在不同HDA任务综合实验文本 - 图像，图像 - 图像和文本到文本的成功验证我们的SSAN针对国家的最先进的HDA方法的优越性。该代码是公开的，在此HTTPS URL。

6. Deep Multi-modality Soft-decoding of Very Low Bit-rate Face Videos [PDF] 返回目录
Yanhui Guo, Xi Zhang, Xiaolin Wu
Abstract: We propose a novel deep multi-modality neural network for restoring very low bit rate videos of talking heads. Such video contents are very common in social media, teleconferencing, distance education, tele-medicine, etc., and often need to be transmitted with limited bandwidth. The proposed CNN method exploits the correlations among three modalities, video, audio and emotion state of the speaker, to remove the video compression artifacts caused by spatial down sampling and quantization. The deep learning approach turns out to be ideally suited for the video restoration task, as the complex non-linear cross-modality correlations are very difficult to model analytically and explicitly. The new method is a video post processor that can significantly boost the perceptual quality of aggressively compressed talking head videos, while being fully compatible with all existing video compression standards.
摘要：我们提出了恢复名嘴非常低比特率视频的新颖深刻的多模态神经网络。这样的视频内容在社交媒体，远程会议，远程教育，远程医疗等很常见，而且经常需要在有限的带宽传输。所提出的CNN方法利用了相关性3种方式，视频，扬声器的音频和情绪状态之间，以移除由空间向下取样，并量化的视频压缩伪像。深学习方法可谓是非常适用于影像恢复任务，为复杂的非线性跨模态的相关性是非常困难的分析和明确建模。这种新方法是视频后处理器可以显著提升的积极压缩说话的头视频的感知质量，同时与所有现有的视频压缩标准完全兼容。

7. PatchNets: Patch-Based Generalizable Deep Implicit 3D Shape Representations [PDF] 返回目录
Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael Zollhöfer, Carsten Stoll, Christian Theobalt
Abstract: Implicit surface representations, such as signed-distance functions, combined with deep learning have led to impressive models which can represent detailed shapes of objects with arbitrary topology. Since a continuous function is learned, the reconstructions can also be extracted at any arbitrary resolution. However, large datasets such as ShapeNet are required to train such models. In this paper, we present a new mid-level patch-based surface representation. At the level of patches, objects across different categories share similarities, which leads to more generalizable models. We then introduce a novel method to learn this patch-based representation in a canonical space, such that it is as object-agnostic as possible. We show that our representation trained on one category of objects from ShapeNet can also well represent detailed shapes from any other category. In addition, it can be trained using much fewer shapes, compared to existing approaches. We show several applications of our new representation, including shape interpolation and partial point cloud completion. Due to explicit control over positions, orientations and scales of patches, our representation is also more controllable compared to object-level representations, which enables us to deform encoded shapes non-rigidly.
摘要：隐式表面表示，如签订距离的功能，具有深厚的学习相结合导致了令人印象深刻的机型，可以代表任意拓扑对象的详细形状。由于连续函数了解到，重建也可以在任意分辨率提取。然而，需要大型数据集，如ShapeNet培养这样的模型。在本文中，我们提出了一种新的基于补丁中级表面表示。在补丁程序的级别，不同类别对象共享相似之处，这将导致更多的普及机型。然后，我们介绍一种新颖的方法来学习该基于块拼贴的表示在典范空间，使得其作为对象无关越好。我们证明了我们的表现上训练对象的一类从ShapeNet也很好，从任何其他类别的代表详细形状。此外，也可以用少得多的形状的训练，相比于现有的方法。我们展示我们的新代表的多种应用，包括形状插值和局部点云完成。由于对位置，方向和补丁的尺度明确的控制，我们的表现也更加可控相比，对象级的表示，这使我们能够变形编码的形状不严格。

8. Shape Consistent 2D Keypoint Estimation under Domain Shift [PDF] 返回目录
Levi O. Vasconcelos, Massimiliano Mancini, Davide Boscaini, Samuel Rota Bulo, Barbara Caputo, Elisa Ricci
Abstract: Recent unsupervised domain adaptation methods based on deep architectures have shown remarkable performance not only in traditional classification tasks but also in more complex problems involving structured predictions (e.g. semantic segmentation, depth estimation). Following this trend, in this paper we present a novel deep adaptation framework for estimating keypoints under domain shift}, i.e. when the training (source) and the test (target) images significantly differ in terms of visual appearance. Our method seamlessly combines three different components: feature alignment, adversarial training and self-supervision. Specifically, our deep architecture leverages from domain-specific distribution alignment layers to perform target adaptation at the feature level. Furthermore, a novel loss is proposed which combines an adversarial term for ensuring aligned predictions in the output space and a geometric consistency term which guarantees coherent predictions between a target sample and its perturbed version. Our extensive experimental evaluation conducted on three publicly available benchmarks shows that our approach outperforms state-of-the-art domain adaptation methods in the 2D keypoint prediction task.
摘要：基于深厚的架构近期无监督的领域适应性方法已经显示出不仅在传统的分类任务，而且在涉及结构预测（例如语义分割，深度估计）更复杂的问题骄人的业绩。以下这种趋势，在本文中，我们提出一个新的深适应框架用于估计下域移位关键点}，即当训练（源）和测试（目标）的图像在视觉外观方面显著不同。我们的方法无缝地结合三个不同的部分组成：功能定位，对抗性训练和自我监督。具体而言，我们从域特异性分布取向层深架构利用于在特征级别执行目标自适应。此外，一个新的损耗，提出了一种结合了用于确保在输出空间和这保证了目标样本和它的扰动版本之间相干预测的几何一致性术语对准预测对抗性术语。我们广泛的实验评价三个公开可用的基准进行显示，在2D关键点预测任务我们的方法比国家的最先进的领域适应性方法。

9. Evaluating the performance of the LIME and Grad-CAM explanation methods on a LEGO multi-label image classification task [PDF] 返回目录
David Cian, Jan van Gemert, Attila Lengyel
Abstract: In this paper, we run two methods of explanation, namely LIME and Grad-CAM, on a convolutional neural network trained to label images with the LEGO bricks that are visible in them. We evaluate them on two criteria, the improvement of the network's core performance and the trust they are able to generate for users of the system. We find that in general, Grad-CAM seems to outperform LIME on this specific task: it yields more detailed insight from the point of view of core performance and 80\% of respondents asked to choose between them when it comes to the trust they inspire in the model choose Grad-CAM. However, we also posit that it is more useful to employ these two methods together, as the insights they yield are complementary.
摘要：在本文中，我们运行的解释，即石灰和奇格勒-CAM两种方法，训练与乐高积木是在他们看到标签图像卷积神经网络。我们评估他们的两个标准，网络的核心性能的提高，他们能够生成系统的用户的信赖。我们发现，在一般情况下，梯度-CAM似乎强于大盘LIME在这个特定的任务：它产生从视图的核心性能和受访者80 \％的点更详细的洞察力让他们之间进行选择，当谈到他们激发信任在模型中选择梯度-CAM。然而，我们也断定，这是这两种方法使用起来更加有用，因为它们产生的见解是互补的。

10. Open-Edit: Open-Domain Image Manipulation with Open-Vocabulary Instructions [PDF] 返回目录
Xihui Liu, Zhe Lin, Jianming Zhang, Handong Zhao, Quan Tran, Xiaogang Wang, Hongsheng Li
Abstract: We propose a novel algorithm, named Open-Edit, which is the first attempt on open-domain image manipulation with open-vocabulary instructions. It is a challenging task considering the large variation of image domains and the lack of training supervision. Our approach takes advantage of the unified visual-semantic embedding space pretrained on a general image-caption dataset, and manipulates the embedded visual features by applying text-guided vector arithmetic on the image feature maps. A structure-preserving image decoder then generates the manipulated images from the manipulated feature maps. We further propose an on-the-fly sample-specific optimization approach with cycle-consistency constraints to regularize the manipulated images and force them to preserve details of the source images. Our approach shows promising results in manipulating open-vocabulary color, texture, and high-level attributes for various scenarios of open-domain images.
摘要：本文提出了一种新的算法，命名为开放式编辑，这是开放域图像处理的第一次尝试与开放式词汇的说明。这是考虑到图像域的大变化和缺乏培训监督的具有挑战性的任务。我们的方法利用预先训练上一般的图像字幕数据集中的统一的视觉语义嵌入空间，并通过对图像特征的地图应用文本引导向量运算操纵嵌入式视觉特征。甲结构保留图像解码器然后产生从操纵特征地图的操作的图像。我们进一步建议与周期一致性约束来规范操纵图像，并迫使他们保留源图像细节的上即时采样特定优化方法。我们的做法显示了有前途的操纵开放式词汇的颜色，质地，以及开放域图像的各种场景高级属性结果。

11. Simultaneous Consensus Maximization and Model Fitting [PDF] 返回目录
Fei Wen, Hewen Wei, Yipeng Liu, Peilin Liu
Abstract: Maximum consensus (MC) robust fitting is a fundamental problem in low-level vision to process raw-data. Typically, it firstly finds a consensus set of inliers and then fits a model on the consensus set. This work proposes a new formulation to achieve simultaneous maximum consensus and model estimation (MCME), which has two significant features compared with traditional MC robust fitting. First, it takes fitting residual into account in finding inliers, hence its lowest achievable residual in model fitting is lower than that of MC robust fitting. Second, it has an unconstrained formulation involving binary variables, which facilitates the use of the effective semidefinite relaxation (SDR) method to handle the underlying challenging combinatorial optimization problem. Though still nonconvex after SDR, it becomes biconvex in some applications, for which we use an alternating minimization algorithm to solve. Further, the sparsity of the problem is exploited in combination with low-rank factorization to develop an efficient algorithm. Experiments show that MCME significantly outperforms RANSAC and deterministic approximate MC methods at high outlier ratios. Besides, in rotation and Euclidean registration, it also compares favorably with state-of-the-art registration methods, especially in high noise and outliers. Code is available at \textit{this https URL}.
摘要：最大的共识（MC）强大的配件是在低级别的视觉的一个基本问题来处理原始数据。通常情况下，它首先找到了一套普遍的内围，然后配合上一套普遍的模式。这项工作提出了一种新的提法，以实现同步最大限度的共识和模型估计（MCME），与传统的MC强大的配件相比，它有两个显著的特点。第一，它需要拟合残余考虑寻找内点，在模型拟合因此其最低可达剩余比MC稳健拟合低。第二，它具有一个无约束制剂涉及二进制变量，这有利于使用的有效半定松弛（SDR）方法来处理底层挑战组合优化问题。虽然仍非凸SDR后，它在一些应用中，我们使用的交替最小化算法来解决成为双凸。此外，该问题的稀疏性组合被利用与低秩分解开发一种有效的算法。实验表明，MCME显著优于RANSAC和确定性近似MC方法中的高离群值比率。此外，在旋转和欧几里德登记，它还有利地与国家的最先进的配准方法进行比较，特别是在高噪声和异常值。代码可以在\ {textit这HTTPS URL}。

12. Cylinder3D: An Effective 3D Framework for Driving-scene LiDAR Semantic Segmentation [PDF] 返回目录
Hui Zhou, Xinge Zhu, Xiao Song, Yuexin Ma, Zhe Wang, Hongsheng Li, Dahua Lin
Abstract: State-of-the-art methods for large-scale driving-scene LiDAR semantic segmentation often project and process the point clouds in the 2D space. The projection methods includes spherical projection, bird-eye view projection, etc. Although this process makes the point cloud suitable for the 2D CNN-based networks, it inevitably alters and abandons the 3D topology and geometric relations. A straightforward solution to tackle the issue of 3D-to-2D projection is to keep the 3D representation and process the points in the 3D space. In this work, we first perform an in-depth analysis for different representations and backbones in 2D and 3D spaces, and reveal the effectiveness of 3D representations and networks on LiDAR segmentation. Then, we develop a 3D cylinder partition and a 3D cylinder convolution based framework, termed as Cylinder3D, which exploits the 3D topology relations and structures of driving-scene point clouds. Moreover, a dimension-decomposition based context modeling module is introduced to explore the high-rank context information in point clouds in a progressive manner. We evaluate the proposed model on a large-scale driving-scene dataset, i.e. SematicKITTI. Our method achieves state-of-the-art performance and outperforms existing methods by 6% in terms of mIoU.
摘要：国家的最先进的方法进行大规模的驾驶场景的LiDAR语义分割经常项目和流程在二维空间中的点云。投影方法包括球面投影，鸟瞰视图投影等。尽管这个过程使适于基于CNN-2D网络，它不可避免地会改变和放弃3D拓扑和几何关系中的点云。一个简单的解决方案，以解决3D到2D投影的问题是保持3D表示和处理点在3D空间中。在这项工作中，我们首先进行了深入的分析，在二维和三维空间中的不同表述和骨干，并揭示的LiDAR分割3D表示和网络的有效性。然后，我们开发了一个3D柱体分区和3D柱体卷积为基础的框架，称为Cylinder3D，它利用驾驶场景的点云的三维拓扑关系和结构。此外，尺寸分解基于上下文建模模块被引入到探索以渐进的方式在点云高秩的上下文信息。我们评估在一个大型的驾驶场景的数据集，即SematicKITTI提出的模型。我们的方法实现状态的最先进的性能和现有的6％方法米欧方面性能优于。

13. Online Continual Learning under Extreme Memory Constraints [PDF] 返回目录
Enrico Fini, Stéphane Lathuilière, Enver Sangineto, Moin Nabi, Elisa Ricci
Abstract: Continual Learning (CL) aims to develop agents emulating the human ability to sequentially learn new tasks while being able to retain knowledge obtained from past experiences. In this paper, we introduce the novel problem of Memory-Constrained Online Continual Learning (MC-OCL) which imposes strict constraints on the memory overhead that a possible algorithm can use to avoid catastrophic forgetting. As most, if not all, previous CL methods violate these constraints, we propose an algorithmic solution to MC-OCL: Batch-level Distillation (BLD), a regularization-based CL approach, which effectively balances stability and plasticity in order to learn from data streams, while preserving the ability to solve old tasks through distillation. Our extensive experimental evaluation, conducted on three publicly available benchmarks, empirically demonstrates that our approach successfully addresses the MC-OCL problem and achieves comparable accuracy to prior distillation methods requiring higher memory overhead.
摘要：不断地学习（CL）的目的是发展代理商模仿人类的能力依次学习新的任务，同时能够保留从过去的经验中获得知识。在本文中，我们介绍存储受限在线持续学习（MC-OCL）强行对内存开销，一个可能的算法可以使用，以避免灾难性的遗忘严格的约束的新问题。由于大部分，如果不是全部，一个CL方法违反了这些限制，我们提出了一个算法解决MC-OCL：批量级蒸馏（BLD），基于正则-CL方法，这有效地平衡，以便从学习的稳定性和可塑性数据流，同时保留通过蒸馏来解决旧任务的能力。我们广泛的实验评价，在三个公开可用的基准进行，经验表明，我们的方法成功地解决了MC-OCL的问题，实现了媲美精度要求较高的内存开销之前蒸馏法。

14. Learning Stereo from Single Images [PDF] 返回目录
Jamie Watson, Oisin Mac Aodha, Daniyar Turmukhambetov, Gabriel J. Brostow, Michael Firman
Abstract: Supervised deep networks are among the best methods for finding correspondences in stereo image pairs. Like all supervised approaches, these networks require ground truth data during training. However, collecting large quantities of accurate dense correspondence data is very challenging. We propose that it is unnecessary to have such a high reliance on ground truth depths or even corresponding stereo pairs. Inspired by recent progress in monocular depth estimation, we generate plausible disparity maps from single images. In turn, we use those flawed disparity maps in a carefully designed pipeline to generate stereo training pairs. Training in this manner makes it possible to convert any collection of single RGB images into stereo training data. This results in a significant reduction in human effort, with no need to collect real depths or to hand-design synthetic data. We can consequently train a stereo matching network from scratch on datasets like COCO, which were previously hard to exploit for stereo. Through extensive experiments we show that our approach outperforms stereo networks trained with standard synthetic datasets, when evaluated on KITTI, ETH3D, and Middlebury.
摘要：深监督网络是寻找在立体图像对对应的最佳方法之一。像所有监督的方法，这些网络需要在训练时的实测数据。然而，收集了大量高精度的密集对应的数据是非常具有挑战性的。我们建议，没有必要对地面实况深处这样的高度依赖，甚至相应的对立体声。在单眼深度估计的最新进展鼓舞，我们生成单个图像合理的视差图。反过来，我们使用那些有缺陷的视差图在一个精心设计的管道生成立体训练对。以这种方式培养能够单独RGB图像的任何集合转换成立体声训练数据。这导致在人类努力显著减少，而无需实时采集深度或手设计合成数据。我们因此可以培养一个从无到有的立体匹配网络上像COCO的数据集，这在以前是很难利用立体声。通过大量的实验表明，我们的方法比用标准的合成数据集，在KITTI，ETH3D和明德评价时，训练有素的立体网络。

15. Tracking Skin Colour and Wrinkle Changes During Cosmetic Product Trials Using Smartphone Images [PDF] 返回目录
Alan F. Smeaton, Swathikiran Srungavarapu, Cyril Messaraa, Claire Tansey
Abstract: Background: To explore how the efficacy of product trials for skin cosmetics can be improved through the use of consumer-level images taken by volunteers using a conventional smartphone. Materials and Methods: 12 women aged 30 to 60 years participated in a product trial and had close-up images of the cheek and temple regions of their faces taken with a high-resolution Antera 3D CS camera at the start and end of a 4-week period. Additionally, they each had ``selfies'' of the same regions of their faces taken regularly throughout the trial period. Automatic image analysis to identify changes in skin colour used three kinds of colour normalisation and analysis for wrinkle composition identified edges and calculated their magnitude. Results: Images taken at the start and end of the trial acted as baseline ground truth for normalisation of smartphone images and showed large changes in both colour and wrinkle magnitude during the trial for many volunteers. Conclusions: Results demonstrate that regular use of selfie smartphone images within trial periods can add value to interpretation of the efficacy of the trial.
摘要：背景：为了探讨如何产品试用的皮肤化妆品的功效可以通过使用通过使用传统的智能手机志愿者的消费级的图像得到改善。材料和方法：12名女性30至60岁参加了产品试用，不得不在4-开始和结束与高分辨率Antera 3D CS相机拍摄他们的脸脸颊和太阳穴的区域的特写图像周的时间。此外，他们每人都必须``自拍“”在整个试验期间采取定期脸上的相同区域的。自动图像分析识别的3种颜色的正常化和分析的皱纹成分识别边缘和计算其规模在皮肤颜色的变化。在担任基线地面实况智能手机的图像正常化审判的开始和结束拍摄和审判许多志愿者在显示彩色和皱幅度变化较大图片：结果。结论：结果证明试用期之内，经常使用智能手机自拍照的图像可以增加价值的审判效果的解释。

16. Guiding CNNs towards Relevant Concepts by Multi-task and Adversarial Learning [PDF] 返回目录
Mara Graziani, Sebastian Otalora, Henning Muller, Vincent Andrearczyk
Abstract: The opaqueness of deep learning limits its deployment in critical application scenarios such as cancer grading in medical images. In this paper, a framework for guiding CNN training is built on top of successful existing techniques of hard parameter sharing, with the main goal of explicitly introducing expert knowledge in the training objectives. The learning process is guided by identifying concepts that are relevant or misleading for the task. Relevant concepts are encouraged to appear in the representation through multi-task learning. Undesired and misleading concepts are discouraged by a gradient reversal operation. In this way, a shift in the deep representations can be corrected to match the clinicians' assumptions. The application on breast lymph nodes histopathology data from the Camelyon challenge shows a significant increase in the generalization performance on unseen patients (from 0.839 to 0.864 average AUC, $\text{p-value} = 0,0002$) when the internal representations are controlled by prior knowledge regarding the acquisition center and visual features of the tissue. The code will be shared for reproducibility on our GitHub repository.
摘要：深学习的不透明性限制了在关键应用场景，如在医学图像癌分级的部署。在本文中，指导CNN培训的框架是建立在硬参数共享成功的现有技术顶部，在培养目标明确地引入专业知识的主要目标。学习的过程是通过识别相关的或误导性的任务概念引导。鼓励相关的概念出现在通过多任务学习的表示。不希望的和误导概念由梯度反转动作气馁。通过这种方式，在深表示的转变可以被纠正，以配合临床医生的假设。对乳腺癌淋巴结中的应用从Camelyon挑战节点示出了组织病理学数据在上看不见病人的泛化性能（从0.839至0.864的平均AUC，$ \ {文本p值} = 0.0002 $）当内部表示是一个显著增加由关于获取中心和组织的视觉特征的先验知识的控制。该代码将用于在我们的GitHub库重复性共享。

17. Weight-Sharing Neural Architecture Search:\\A Battle to Shrink the Optimization Gap [PDF] 返回目录
Lingxi Xie, Xin Chen, Kaifeng Bi, Longhui Wei, Yuhui Xu, Zhengsu Chen, Lanfei Wang, An Xiao, Jianlong Chang, Xiaopeng Zhang, Qi Tian
Abstract: Neural architecture search (NAS) has attracted increasing attentions in both academia and industry. In the early age, researchers mostly applied individual search methods which sample and evaluate the candidate architectures separately and thus incur heavy computational overheads. To alleviate the burden, weight-sharing methods were proposed in which exponentially many architectures share weights in the same super-network, and the costly training procedure is performed only once. These methods, though being much faster, often suffer the issue of instability. This paper provides a literature review on NAS, in particular the weight-sharing methods, and points out that the major challenge comes from the optimization gap between the super-network and the sub-architectures. From this perspective, we summarize existing approaches into several categories according to their efforts in bridging the gap, and analyze both advantages and disadvantages of these methodologies. Finally, we share our opinions on the future directions of NAS and AutoML. Due to the expertise of the authors, this paper mainly focuses on the application of NAS to computer vision problems and may bias towards the work in our group.
摘要：神经结构搜索（NAS）已经吸引了越来越多的学术界和工业界的关注。在早期，研究者主要应用于个人搜索方法，其样品，并分别，因此评估候选架构招致大量的计算开销。为了减轻负担，提出了重共享方法，这极大地许多架构共享相同的超级网络中的权重，并且只执行一次昂贵的培训过程。这些方法，虽然是快很多，经常遭受不稳定的问题。本文提供了NAS文献回顾，尤其是重量分摊方法，并指出，主要的挑战来自于超级网络和子体系结构之间的优化差距。从这个角度来看，我们根据自己的缩小差距的努力总结现有的方法分为几类，并分析两者的优点和这些方法的缺点。最后，我们分享NAS和AutoML的未来发展方向我们的意见。由于作者的专业知识，本文主要集中在NAS的应用，计算机视觉的问题，对我们组的工作可能偏差。

18. Spherical Feature Transform for Deep Metric Learning [PDF] 返回目录
Yuke Zhu, Yan Bai, Yichen Wei
Abstract: Data augmentation in feature space is effective to increase data diversity. Previous methods assume that different classes have the same covariance in their feature distributions. Thus, feature transform between different classes is performed via translation. However, this approach is no longer valid for recent deep metric learning scenarios, where feature normalization is widely adopted and all features lie on a hypersphere. This work proposes a novel spherical feature transform approach. It relaxes the assumption of identical covariance between classes to an assumption of similar covariances of different classes on a hypersphere. Consequently, the feature transform is performed by a rotation that respects the spherical data distributions. We provide a simple and effective training method, and in depth analysis on the relation between the two different transforms. Comprehensive experiments on various deep metric learning benchmarks and different baselines verify that our method achieves consistent performance improvement and state-of-the-art results.
摘要：在特征空间数据增强有效提高数据的多样性。上一页方法假定不同类别在其功能分布相同的协方差。因此，特征不同类之间的变换是通过翻译进行。然而，这种方法不再有效，近期深度量学习的场景，其中特征正规化被广泛采用，所有功能躺在一张超球。这项工作提出了一种新颖的球形特征变换方法。它放松类之间相同协方差的对一个超球面不同类的类似协方差的假设的假设。因此，特征变换由旋转该方面的球面数据分布进行。我们提供了两种不同的转换间的关系的简单而有效的训练方法，并在深入分析。各种深度量学习标准和不同的基线综合实验验证，我们的方法实现一致的性能改善和国家的最先进的成果。

19. Prime-Aware Adaptive Distillation [PDF] 返回目录
Youcai Zhang, Zhonghao Lan, Yuchen Dai, Fangao Zeng, Yan Bai, Jie Chang, Yichen Wei
Abstract: Knowledge distillation(KD) aims to improve the performance of a student network by mimicing the knowledge from a powerful teacher network. Existing methods focus on studying what knowledge should be transferred and treat all samples equally during training. This paper introduces the adaptive sample weighting to KD. We discover that previous effective hard mining methods are not appropriate for distillation. Furthermore, we propose Prime-Aware Adaptive Distillation (PAD) by the incorporation of uncertainty learning. PAD perceives the prime samples in distillation and then emphasizes their effect adaptively. PAD is fundamentally different from and would refine existing methods with the innovative view of unequal training. For this reason, PAD is versatile and has been applied in various tasks including classification, metric learning, and object detection. With ten teacher-student combinations on six datasets, PAD promotes the performance of existing distillation methods and outperforms recent state-of-the-art methods.
摘要：知识蒸馏（KD）目标是，通过强大的名师网mimicing的知识，提高学生的网络的性能。现有的方法集中在学习哪些知识应该转移和培训过程中平等对待所有样本。本文介绍了自适应样本加权KD。我们发现，先前的有效硬采矿方法不适用于蒸馏。此外，我们的不确定性学习的结合提出了首相，感知自适应蒸馏（PAD）。 PAD感知蒸馏的主要样本，然后强调他们的作用适应性。 PAD是根本不同的，并会完善不等培训的创新观点现有的方法。出于这个原因，PAD是通用的，在各种任务，包括分类，度量学习，以及物体检测得到了应用。在六大数据集10师生组合，PAD促进现有的蒸馏方法的性能和优于国家的最先进的最近的方法。

20. Prior Guided Feature Enrichment Network for Few-Shot Segmentation [PDF] 返回目录
Zhuotao Tian, Hengshuang Zhao, Michelle Shu, Zhicheng Yang, Ruiyu Li, Jiaya Jia
Abstract: State-of-the-art semantic segmentation methods require sufficient labeled data to achieve good results and hardly work on unseen classes without fine-tuning. Few-shot segmentation is thus proposed to tackle this problem by learning a model that quickly adapts to new classes with a few labeled support samples. Theses frameworks still face the challenge of generalization ability reduction on unseen classes due to inappropriate use of high-level semantic information of training classes and spatial inconsistency between query and support targets. To alleviate these issues, we propose the Prior Guided Feature Enrichment Network (PFENet). It consists of novel designs of (1) a training-free prior mask generation method that not only retains generalization power but also improves model performance and (2) Feature Enrichment Module (FEM) that overcomes spatial inconsistency by adaptively enriching query features with support features and prior masks. Extensive experiments on PASCAL-5$^i$ and COCO prove that the proposed prior generation method and FEM both improve the baseline method significantly. Our PFENet also outperforms state-of-the-art methods by a large margin without efficiency loss. It is surprising that our model even generalizes to cases without labeled support samples. Our code is available at this https URL.
摘要：国家的最先进的语义分割方法需要足够的标记数据，以达到良好的效果，并没有微调上看不见班几乎工作。因此，为数不多的镜头分割，提出了通过学习模式，迅速适应新的班级，几个标记载体样品来解决这个问题。论文的框架仍然面临看不见类泛化能力下降的挑战，由于使用不当的培训课程以及查询和支持的目标之间的空间不一致的高层语义信息。为了缓解这些问题，我们提出了指导之前富集特征网（PFENet）。它包括（1）一个自由训练现有掩模产生方法，该方法不仅保留泛化功率的新颖的设计，而且也提高了模型的性能和（2）功能富集模块（FEM），通过自适应地富集查询克服空间不一致性与支撑特征设有和之前的面罩。在PASCAL-5广泛的实验$ ^ I $和COCO证明提出之前生成方法和FEM都显著提高基线法。我们还PFENet优于状态的最先进的方法大幅度无效率的损失。令人惊讶的是，我们的模型甚至可以推广到例无标记载体样品。我们的代码可在此HTTPS URL。

21. Controlling Information Capacity of Binary Neural Network [PDF] 返回目录
Dmitry Ignatov, Andrey Ignatov
Abstract: Despite the growing popularity of deep learning technologies, high memory requirements and power consumption are essentially limiting their application in mobile and IoT areas. While binary convolutional networks can alleviate these problems, the limited bitwidth of weights is often leading to significant degradation of prediction accuracy. In this paper, we present a method for training binary networks that maintains a stable predefined level of their information capacity throughout the training process by applying Shannon entropy based penalty to convolutional filters. The results of experiments conducted on SVHN, CIFAR and ImageNet datasets demonstrate that the proposed approach can statistically significantly improve the accuracy of binary networks.
摘要：尽管深学习技术，对内存要求较高和功耗的日益普及，基本上限制了其在移动和物联网领域的应用。虽然二进制卷积网络可以缓解这些问题，权重的有限位宽常常导致预测精度的显著下降。在本文中，我们提出了一个方法训练，通过应用香农维持在整个训练过程中他们的信息能力的稳定的预定级别的二进制网络基于熵的处罚卷积过滤器。对SVHN，CIFAR和ImageNet数据集进行的实验结果表明，该方法可以统计显著提高二进制网络的精度。

22. Addressing the Cold-Start Problem in Outfit Recommendation Using Visual Preference Modelling [PDF] 返回目录
Dhruv Verma, Kshitij Gulati, Rajiv Ratn Shah
Abstract: With the global transformation of the fashion industry and a rise in the demand for fashion items worldwide, the need for an effectual fashion recommendation has never been more. Despite various cutting-edge solutions proposed in the past for personalising fashion recommendation, the technology is still limited by its poor performance on new entities, i.e. the cold-start problem. In this paper, we attempt to address the cold-start problem for new users, by leveraging a novel visual preference modelling approach on a small set of input images. We demonstrate the use of our approach with feature-weighted clustering to personalise occasion-oriented outfit recommendation. Quantitatively, our results show that the proposed visual preference modelling approach outperforms state of the art in terms of clothing attribute prediction. Qualitatively, through a pilot study, we demonstrate the efficacy of our system to provide diverse and personalised recommendations in cold-start scenarios.
摘要：随着时尚业的全球转型为时尚单品全球需求的增加，需要一个有效的方式推荐从未像现在这样。尽管在过去的个性化时尚建议提出了各种尖端解决方案，该技术仍受到其对新的实体表现不佳，即冷启动问题的限制。在本文中，我们试图解决冷启动问题的新用户，通过利用一个小组输入图像的新型视觉偏好建模方法。我们展示了使用我们的方法与功能的加权分簇个性化导向的场合，装备的建议。从数量上看，我们的结果表明，该技术的提出视觉偏好建模方法比国家服装属性的预测方面。定性，通过试点研究，我们证明了我们系统的效能在冷启动情况下提供多样化和个性化的建议。

23. Boundary Content Graph Neural Network for Temporal Action Proposal Generation [PDF] 返回目录
Yueran Bai, Yingying Wang, Yunhai Tong, Yang Yang, Qiyue Liu, Junhui Liu
Abstract: Temporal action proposal generation plays an important role in video action understanding, which requires localizing high-quality action content precisely. However, generating temporal proposals with both precise boundaries and high-quality action content is extremely challenging. To address this issue, we propose a novel Boundary Content Graph Neural Network (BC-GNN) to model the insightful relations between the boundary and action content of temporal proposals by the graph neural networks. In BC-GNN, the boundaries and content of temporal proposals are taken as the nodes and edges of the graph neural network, respectively, where they are spontaneously linked. Then a novel graph computation operation is proposed to update features of edges and nodes. After that, one updated edge and two nodes it connects are used to predict boundary probabilities and content confidence score, which will be combined to generate a final high-quality proposal. Experiments are conducted on two mainstream datasets: ActivityNet-1.3 and THUMOS14. Without the bells and whistles, BC-GNN outperforms previous state-of-the-art methods in both temporal action proposal and temporal action detection tasks.
摘要：临时行动建议的生成起着视频动作的理解，这需要精确定位高品质的行动内容的重要作用。然而，产生与这两个精确边界和高品质的内容的动作极其挑战时间的提案。为了解决这个问题，我们提出了一个新的边界内容图表神经网络（BC-GNN）由图神经网络的时间建议的边界和行动内容之间的关系，有见地的模型。在BC-GNN，界限和时空提案内容分别取为图中的神经网络，在那里它们被自发联的节点和边。然后一个新的图形计算操作，提出了边和节点的更新功能。在此之后，一个更新的边缘和两个节点它连接被用来预测边界概率和内容的信心分数，将被组合，以产生最终的高质量提案。 ActivityNet-1.3和THUMOS14：实验是在两个主流的数据集进行。没有花俏，BC-GNN优于国家的最先进的以前在这两个时间行动方案和时间动作检测任务的方法。

24. A non-discriminatory approach to ethical deep learning [PDF] 返回目录
Enzo Tartaglione, Marco Grangetto
Abstract: Artificial neural networks perform state-of-the-art in an ever-growing number of tasks, nowadays they are used to solve an incredibly large variety of tasks. However, typical training strategies do not take into account lawful, ethical and discriminatory potential issues the trained ANN models could incur in. In this work we propose NDR, a non-discriminatory regularization strategy to prevent the ANN model to solve the target task using some discriminatory features like, for example, the ethnicity in an image classification task for human faces. In particular, a part of the ANN model is trained to hide the discriminatory information such that the rest of the network focuses in learning the given learning task. Our experiments show that NDR can be exploited to achieve non-discriminatory models with both minimal computational overhead and performance loss.
摘要：人工神经网络在数量不断增长的任务，执行国家的最先进的，如今它们被用来解决一个非常大的各种任务。然而，典型的培训策略没有考虑到合法的，受过训练的人工神经网络模型可能招致道德和歧视性的潜在问题。在这项工作中，我们提出了NDR，非歧视的规则化战略，以防止人工神经网络模型，使用一些解决目标任务歧视性的功能，如，例如，在用于人脸图像分类任务的种族。尤其是，人工神经网络模型的一部分训练隐藏歧视性信息，使得网络的其余部分将集中在学习给定的学习任务。我们的实验表明，NDR可以被利用来实现无歧视的车型既最小的计算开销和性能损失。

25. Predicting the Blur Visual Discomfort for Natural Scenes by the Loss of Positional Information [PDF] 返回目录
Elio D. Di Claudio, Paolo Giannitrapani, Giovanni Jacovitti
Abstract: The perception of the blur due to accommodation failures, insufficient optical correction or imperfect image reproduction is a common source of visual discomfort, usually attributed to an anomalous and annoying distribution of the image spectrum in the spatial frequency domain. In the present paper, this discomfort is attributed to a loss of the localization accuracy of the observed patterns. It is assumed, as a starting perceptual principle, that the visual system is optimally adapted to pattern localization in a natural environment. Thus, since the best possible accuracy of the image patterns localization is indicated by the positional Fisher information, it is argued that the blur discomfort is highly correlated with a loss of this information. Following this concept, a receptive field functional model, tuned to common and stable features of natural scenes, is adopted to predict the visual discomfort. It is of a complex-valued operator, orientation-selective both in the space domain and in the spatial frequency domain. Starting from the case of Gaussian blur, the analysis is extended to a generic blur type by applying a positional Fisher information equivalence criterion. Out-of-focus blur and astigmatic blur are presented as significant examples. The validity of the proposed model is verified by comparing its predictions with subjective ratings of the quality loss of blurred natural images. The model fits linearly with the experiments reported in independent databases, based on different protocols and settings.
摘要：模糊的由于住宿故障感知，光学校正不足或不完全的图象再现是视觉不适的公共源极，通常归因于异常和在空间频率域中的图像频谱的恼人的分布。在本文件中，这种不适是由于观察到的图案的定位精度的损失。据推测，作为起始感知原则，即视觉系统最佳地适合于图案定位在自然环境中。因此，由于图像图案定位的可能的最佳精确度由位置Fisher信息指示的，它被认为，模糊的不适是高度有关此信息的损失相关。在此之后的概念，一个感受野功能模型，调整到自然场景的常见和稳定的特点，采用预测的视觉不适。这是一个复值算子，取向选择性既在空间域和在空间频率域中的。从高斯模糊的情况下开始，分析通过施加位置Fisher信息等价标准扩展到一个通用的模糊类型。失焦模糊和散光模糊表示为显著的例子。该模型的有效性是由它的预测与模糊的自然图像的质量损失的主观评定比较，验证。线性与实验模型拟合报道，独立的数据库，根据不同的协议和设置。

26. Hyperspectral Image Classification with Spatial Consistence Using Fully Convolutional Spatial Propagation Network [PDF] 返回目录
Yenan Jiang, Ying Li, Shanrong Zou, Haokui Zhang, Yunpeng Bai
Abstract: In recent years, deep convolutional neural networks (CNNs) have shown impressive ability to represent hyperspectral images (HSIs) and achieved encouraging results in HSI classification. However, the existing CNN-based models operate at the patch-level, in which pixel is separately classified into classes using a patch of images around it. This patch-level classification will lead to a large number of repeated calculations, and it is difficult to determine the appropriate patch size that is beneficial to classification accuracy. In addition, the conventional CNN models operate convolutions with local receptive fields, which cause failures in modeling contextual spatial information. To overcome the aforementioned limitations, we propose a novel end-to-end, pixels-to-pixels fully convolutional spatial propagation network (FCSPN) for HSI classification. Our FCSPN consists of a 3D fully convolution network (3D-FCN) and a convolutional spatial propagation network (CSPN). Specifically, the 3D-FCN is firstly introduced for reliable preliminary classification, in which a novel dual separable residual (DSR) unit is proposed to effectively capture spectral and spatial information simultaneously with fewer parameters. Moreover, the channel-wise attention mechanism is adapted in the 3D-FCN to grasp the most informative channels from redundant channel information. Finally, the CSPN is introduced to capture the spatial correlations of HSI via learning a local linear spatial propagation, which allows maintaining the HSI spatial consistency and further refining the classification results. Experimental results on three HSI benchmark datasets demonstrate that the proposed FCSPN achieves state-of-the-art performance on HSI classification.
摘要：近年来，深卷积神经网络（细胞神经网络）已经显示出代表高光谱图像（HSIS）和HSI分类取得了令人鼓舞的成绩令人印象深刻的能力。然而，现有的基于CNN的模型在补丁级别，其中像素被分别分为使用其周围图像的补丁类操作。这个补丁级别的分类将导致大量重复计算的，这是很难确定相应的补丁大小是分类的准确性有益。此外，传统的CNN模型操作与当地的感受野，这导致在模拟情境空间信息失灵的卷积。为了克服上述局限性，我们提出了一种新颖的端至端，用于HSI分类像素到像素充分卷积空间传播网络（FCSPN）。我们的FCSPN由3D完全卷积网络（3D-FCN）和卷积空间传播网络（CSPN）的。具体而言，3D-FCN首先介绍用于可靠初步分类，其中，新型的双可分离残差（DSR）单元，提出了同时有效地捕捉光谱和空间信息以更少的参数。此外，信道逐关注机构适于在3D-FCN把握从冗余信道信息的信息量最大的信道。最后，CSPN引入经由学习局部线性空间传播，这允许维持HSI空间一致性和进一步细化分类结果来捕获HSI的空间相关性。三个HSI标准数据集实验结果表明，该FCSPN实现恒指分类国家的最先进的性能。

27. MSDPN: Monocular Depth Prediction with Partial Laser Observation using Multi-stage Neural Networks [PDF] 返回目录
Hyungtae Lim, Hyeonjae Gil, Hyun Myung
Abstract: In this study, a deep-learning-based multi-stage network architecture called Multi-Stage Depth Prediction Network (MSDPN) is proposed to predict a dense depth map using a 2D LiDAR and a monocular camera. Our proposed network consists of a multi-stage encoder-decoder architecture and Cross Stage Feature Aggregation (CSFA). The proposed multi-stage encoder-decoder architecture alleviates the partial observation problem caused by the characteristics of a 2D LiDAR, and CSFA prevents the multi-stage network from diluting the features and allows the network to learn the inter-spatial relationship between features better. Previous works use sub-sampled data from the ground truth as an input rather than actual 2D LiDAR data. In contrast, our approach trains the model and conducts experiments with a physically-collected 2D LiDAR dataset. To this end, we acquired our own dataset called KAIST RGBD-scan dataset and validated the effectiveness and the robustness of MSDPN under realistic conditions. As verified experimentally, our network yields promising performance against state-of-the-art methods. Additionally, we analyzed the performance of different input methods and confirmed that the reference depth map is robust in untrained scenarios.
摘要：在这项研究中，基于深学习，多级网络架构称为多级深度预测网（MSDPN）提出了预测使用二维激光雷达和单目摄像机稠密深度图。我们提出的网络由一个多级编码器 - 解码器架构和交叉阶段特征聚合（CSFA）的。所提出的多级编码器 - 解码器架构缓解所引起的2D激光雷达的特性的部分观测问题，CSFA防止多级网络从稀释的功能和允许网络更好地学习特征之间的相互空间关系。以前的作品使用子采样数据从地面真相的输入，而不是实际的2D LiDAR数据。相比之下，我们的方法与训练物理收集的2D激光雷达数据集模型，并进行实验。为此，我们收购了我们自己的数据集称为KAIST RGBD扫描数据集和验证的有效性和MSDPN的现实条件下的鲁棒性。作为实验验证，我们的网络收益率承诺对国家的最先进的方法的性能。另外，我们分析了不同的输入方法的性能并证实所述基准深度图是在未受过训练的情况下的鲁棒性。

28. Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization [PDF] 返回目录
Daizong Liu, Xiaoye Qu, Xiaoyang Liu, Jianfeng Dong, Pan Zhou, Zichuan Xu
Abstract: Query-based moment localization is a new task that localizes the best matched segment in an untrimmed video according to a given sentence query. In this localization task, one should pay more attention to thoroughly mine visual and linguistic information. To this end, we propose a novel Cross- and Self-Modal Graph Attention Network (CSMGAN) that recasts this task as a process of iterative messages passing over a joint graph. Specifically, the joint graph consists of Cross-Modal interaction Graph (CMG) and Self-Modal relation Graph (SMG), where frames and words are represented as nodes, and the relations between cross- and self-modal node pairs are described by an attention mechanism. Through parametric message passing, CMG highlights relevant instances across video and sentence, and then SMG models the pairwise relation inside each modality for frame (word) correlating. With multiple layers of such a joint graph, our CSMGAN is able to effectively capture high-order interactions between two modalities, thus enabling a further precise localization. Besides, to better comprehend the contextual details in the query, we develop a hierarchical sentence encoder to enhance the query understanding. Extensive experiments on four public datasets demonstrate the effectiveness of our proposed model, and GCSMAN significantly outperforms the state-of-the-arts.
摘要：基于查询的时刻本地化是一个新的任务是根据给定的句子查询本地化的最佳匹配段的修剪视频。在这种本地化的任务，一要更加注重深入矿井视觉和语言的信息。为此，提出了一种新颖的交叉和自模态格拉夫注意网络（CSMGAN），该重铸此任务越过关节图表迭代消息的处理。具体而言，联合图由跨模态的相互作用图（CMG）和自模态关系图（SMG），其中帧和字被表示为节点，和交叉和自模态的节点对之间的关系由一个描述注意机制。通过参数的消息传递，CMG突出跨越的视频和句子相关的实例，然后SMG模型各形态内成对关系的框架（字）相关联。通过这样的联合图形的多个层，我们CSMGAN能够有效地捕获两个模态之间的高次的相互作用，从而能够进一步精确定位。此外，为了更好地理解查询上下文的详细信息，我们开发了一个层次一句编码器，以提高查询理解。四个公共数据集大量的实验证明我们提出的模型的有效性，并GCSMAN显著优于国家的最艺术。

29. Learning Visual Representations with Caption Annotations [PDF] 返回目录
Mert Bulent Sariyildiz, Julien Perez, Diane Larlus
Abstract: Pretraining general-purpose visual features has become a crucial part of tackling many computer vision tasks. While one can learn such features on the extensively-annotated ImageNet dataset, recent approaches have looked at ways to allow for noisy, fewer, or even no annotations to perform such pretraining. Starting from the observation that captioned images are easily crawlable, we argue that this overlooked source of information can be exploited to supervise the training of visual representations. To do so, motivated by the recent progresses in language models, we introduce {\em image-conditioned masked language modeling} (ICMLM) -- a proxy task to learn visual representations over image-caption pairs. ICMLM consists in predicting masked words in captions by relying on visual cues. To tackle this task, we propose hybrid models, with dedicated visual and textual encoders, and we show that the visual representations learned as a by-product of solving this task transfer well to a variety of target tasks. Our experiments confirm that image captions can be leveraged to inject global and localized semantic information into visual representations. Project website: this https URL.
摘要：训练前通用的视觉特征，已成为解决许多计算机视觉任务的重要组成部分。尽管人们可以学会在广泛的标注ImageNet数据集，最近的方法已经论述了如何此类功能允许嘈杂的，更少，甚至没有标注执行这样的训练前。从观察到的图像标题均易于抓取开始，我们认为这种忽视的信息源可以被利用来监督视觉表现的训练。要做到这一点，通过语言模型的最新进展激励，我们介绍{\ EM图像空调掩盖语言建模}（ICMLM） - 一个代理任务来学习过的图像字幕对可视化表示。 ICMLM在于依靠视觉线索预测字幕屏蔽字。为了解决这个任务，我们提出了混合动力车型，专用视觉和文本编码器，我们证明可视表示学到的解决这个任务转移阱的各种目标任务的一个副产品。我们的实验证实，图片说明可以利用注入全局和局部的语义信息为视觉表示。项目网站：此HTTPS URL。

30. Unsupervised Cross-Modal Alignment for Multi-Person 3D Pose Estimation [PDF] 返回目录
Jogendra Nath Kundu, Ambareesh Revanur, Govind Vitthal Waghmare, Rahul Mysore Venkatesh, R. Venkatesh Babu
Abstract: We present a deployment friendly, fast bottom-up framework for multi-person 3D human pose estimation. We adopt a novel neural representation of multi-person 3D pose which unifies the position of person instances with their corresponding 3D pose representation. This is realized by learning a generative pose embedding which not only ensures plausible 3D pose predictions, but also eliminates the usual keypoint grouping operation as employed in prior bottom-up approaches. Further, we propose a practical deployment paradigm where paired 2D or 3D pose annotations are unavailable. In the absence of any paired supervision, we leverage a frozen network, as a teacher model, which is trained on an auxiliary task of multi-person 2D pose estimation. We cast the learning as a cross-modal alignment problem and propose training objectives to realize a shared latent space between two diverse modalities. We aim to enhance the model's ability to perform beyond the limiting teacher network by enriching the latent-to-3D pose mapping using artificially synthesized multi-person 3D scene samples. Our approach not only generalizes to in-the-wild images, but also yields a superior trade-off between speed and performance, compared to prior top-down approaches. Our approach also yields state-of-the-art multi-person 3D pose estimation performance among the bottom-up approaches under consistent supervision levels.
摘要：本文提出了一种友好地部署，快多人3D人体姿势估计自下而上的框架。我们采用多人3D姿态的新的神经表示其结合人的实例及其相应的3D姿态表示位置。这是通过学习生成姿势嵌入既保证合理三维姿态的预测来实现，而且还消除了通常的关键点如在现有自下而上的方法采用分组操作。此外，我们提出了一个实际的部署模式，其中配对的二维或三维姿态注解是不可用的。在没有任何配对的监督，我们利用冷冻网络，作为一个教师模型，这是在多方人士的2D姿态估计的辅助任务训练。我们投的学习作为跨模态定位问题，并提出培养目标，以实现两个不同的模式之间共享的潜在空间。我们的目标是提高模型的通过丰富使用人工合成的多人3D场景样品潜到3D姿势映射超出限制教师网络执行能力。我们的方法不仅推广到在最狂野的图像，但也产生了优越的权衡速度和性能之间，相比于现有自上而下的方法。我们的做法也得到中下一致的监管水平的自下而上的方法的国家的最先进的多人3D姿态估计性能。

31. ExchNet: A Unified Hashing Network for Large-Scale Fine-Grained Image Retrieval [PDF] 返回目录
Quan Cui, Qing-Yuan Jiang, Xiu-Shen Wei, Wu-Jun Li, Osamu Yoshie
Abstract: Retrieving content relevant images from a large-scale fine-grained dataset could suffer from intolerably slow query speed and highly redundant storage cost, due to high-dimensional real-valued embeddings which aim to distinguish subtle visual differences of fine-grained objects. In this paper, we study the novel fine-grained hashing topic to generate compact binary codes for fine-grained images, leveraging the search and storage efficiency of hash learning to alleviate the aforementioned problems. Specifically, we propose a unified end-to-end trainable network, termed as ExchNet. Based on attention mechanisms and proposed attention constraints, it can firstly obtain both local and global features to represent object parts and whole fine-grained objects, respectively. Furthermore, to ensure the discriminative ability and semantic meaning's consistency of these part-level features across images, we design a local feature alignment approach by performing a feature exchanging operation. Later, an alternative learning algorithm is employed to optimize the whole ExchNet and then generate the final binary hash codes. Validated by extensive experiments, our proposal consistently outperforms state-of-the-art generic hashing methods on five fine-grained datasets, which shows our effectiveness. Moreover, compared with other approximate nearest neighbor methods, ExchNet achieves the best speed-up and storage reduction, revealing its efficiency and practicality.
摘要：从大规模细粒度数据集检索内容相关的影像可能难以忍受的慢查询速度和高度冗余的存储成本受到影响，因为其目的是区分细粒度对象的细微的视觉差异高维实值的嵌入。在本文中，我们研究了新的细粒度散列话题产生细粒度的图像紧凑的二进制代码，利用哈希的搜索和存储效率的学习，以缓解上述问题。具体来说，我们建议一个统一的终端到终端的可训练网络，称为ExchNet。基于注意机制和建议关注的限制，可以先获得本地和全局特征来表示对象部分和整体的细粒度的对象，分别。此外，为了确保辨别能力和跨图像这些部件级功能语义的一致性，我们通过执行功能交换操作设计的局部特征比对的方法。后来，替代学习算法来优化整个ExchNet然后产生最终二进制散列码。通过大量的实验验证，我们建议的性能一直优于五细粒度数据集的国家的最先进的通用哈希方法，这说明我们的效率。此外，与其他近似最近邻方法相比，ExchNet达到最佳的加速和存储还原，揭示其效率和实用性。

32. 1st Place Solutions of Waymo Open Dataset Challenge 2020 -- 2D Object Detection Track [PDF] 返回目录
Zehao Huang, Zehui Chen, Qiaofei Li, Hongkai Zhang, Naiyan Wang
Abstract: In this technical report, we present our solutions of Waymo Open Dataset (WOD) Challenge 2020 - 2D Object Track. We adopt FPN as our basic framework. Cascade RCNN, stacked PAFPN Neck and Double-Head are used for performance improvements. In order to handle the small object detection problem in WOD, we use very large image scales for both training and testing. Using our methods, our team RW-TSDet achieved the 1st place in the 2D Object Detection Track.
摘要：在这个技术报告中，我们提出我们的Waymo打开的数据集（WOD）挑战2020的解决方案 - 2D目标跟踪。我们采用FPN为我们的基本框架。级联RCNN，堆叠PAFPN颈部和双头用于性能改进。为了处理在WOD小物件检测问题，我们用于训练和测试非常大的图像尺度。使用我们的方法，我们的团队RW-TSDet在2D物体检测轨道取得第一名。

33. A Novel Indoor Positioning System for unprepared firefighting scenarios [PDF] 返回目录
Vamsi Karthik Vadlamani, Manish Bhattarai, Meenu Ajith, Manel Martınez-Ramon
Abstract: Situational awareness and Indoor location tracking for firefighters is one of the tasks with paramount importance in search and rescue operations. For Indoor Positioning systems (IPS), GPS is not the best possible solution. There are few other techniques like dead reckoning, Wifi and bluetooth based triangulation, Structure from Motion (SFM) based scene reconstruction for Indoor positioning system. However due to high temperatures, the rapidly changing environment of fires, and low parallax in the thermal images, these techniques are not suitable for relaying the necessary information in a fire fighting environment needed to increase situational awareness in real time. In fire fighting environments, thermal imaging cameras are used due to smoke and low visibility hence obtaining relative orientation from the vanishing point estimation is very difficult. The following technique that is the content of this research implements a novel optical flow based video compass for orientation estimation and fused IMU data based activity recognition for IPS. This technique helps first responders to go into unprepared, unknown environments and still maintain situational awareness like the orientation and, position of the victim fire fighters.
摘要：态势感知和室内位置跟踪消防队员的是在搜索和救援行动最重要的任务之一。室内定位系统（IPS），GPS是不是最佳的解决方案。有喜欢的航位推算，WiFi和蓝牙根据三角测量，结构从运动（SFM）基于场景的重建室内定位系统的其它几项技术。然而，由于高温，火灾的快速变化的环境，并在热图像低视差，这些技术不适合在中继需要增加实时态势感知消防环境的必要信息。在消防环境中，热成像摄像机用于由于烟和低能见度因此从消失点估计获得相对定向是非常困难的。以下技术，该技术是本研究工具的新型光流基于视频取向估计基于活动识别为IPS罗盘和稠合的IMU数据的内容。这项技术有助于急救人员进入措手不及，未知的环境中，仍然保持态势感知喜欢的方向和，受害人消防战士的位置。

34. Appearance Consensus Driven Self-Supervised Human Mesh Recovery [PDF] 返回目录
Jogendra Nath Kundu, Mugalodi Rakesh, Varun Jampani, Rahul Mysore Venkatesh, R. Venkatesh Babu
Abstract: We present a self-supervised human mesh recovery framework to infer human pose and shape from monocular images in the absence of any paired supervision. Recent advances have shifted the interest towards directly regressing parameters of a parametric human model by supervising them on large-scale datasets with 2D landmark annotations. This limits the generalizability of such approaches to operate on images from unlabeled wild environments. Acknowledging this we propose a novel appearance consensus driven self-supervised objective. To effectively disentangle the foreground (FG) human we rely on image pairs depicting the same person (consistent FG) in varied pose and background (BG) which are obtained from unlabeled wild videos. The proposed FG appearance consistency objective makes use of a novel, differentiable Color-recovery module to obtain vertex colors without the need for any appearance network; via efficient realization of color-picking and reflectional symmetry. We achieve state-of-the-art results on the standard model-based 3D pose estimation benchmarks at comparable supervision levels. Furthermore, the resulting colored mesh prediction opens up the usage of our framework for a variety of appearance-related tasks beyond the pose and shape estimation, thus establishing our superior generalizability.
摘要：本文提出了一种自我监督人网恢复框架在没有任何配对的监督来推断人的姿势和体形的单眼图像。最新进展已经对通过监督他们与2D标志标注的大型数据集直接回归的参数人体模型的参数转移的兴趣。这限制了这种方法的推广到来自未标记的野生环境中图像进行操作。承认这一点，我们提出了一个新颖的外观共识推动自我监督的目标。为了有效地解开我们依靠描绘其无标签野生影片获得了不同的姿势和背景（BG）同一个人（一致FG）图像对前景（FG）人。所提出的FG外观一致性目的利用一种新型的，可微的颜色恢复模块以获得顶点颜色而不需要任何外观网络;通过高效的实现颜色采摘和反射对称的。我们对实现在可比监督水平标准的基于模型的三维姿态估计的基准国家的先进成果。此外，对于得到的着色网预测开辟了我们的框架，适用于各种超越姿势和体形估计与外观相关任务的使用情况，从而确立了我们卓越的普遍性。

35. Hierarchical Context Embedding for Region-based Object Detection [PDF] 返回目录
Zhao-Min Chen, Xin Jin, Borui Zhao, Xiu-Shen Wei, Yanwen Guo
Abstract: State-of-the-art two-stage object detectors apply a classifier to a sparse set of object proposals, relying on region-wise features extracted by RoIPool or RoIAlign as inputs. The region-wise features, in spite of aligning well with the proposal locations, may still lack the crucial context information which is necessary for filtering out noisy background detections, as well as recognizing objects possessing no distinctive appearances. To address this issue, we present a simple but effective Hierarchical Context Embedding (HCE) framework, which can be applied as a plug-and-play component, to facilitate the classification ability of a series of region-based detectors by mining contextual cues. Specifically, to advance the recognition of context-dependent object categories, we propose an image-level categorical embedding module which leverages the holistic image-level context to learn object-level concepts. Then, novel RoI features are generated by exploiting hierarchically embedded context information beneath both whole images and interested regions, which are also complementary to conventional RoI features. Moreover, to make full use of our hierarchical contextual RoI features, we propose the early-and-late fusion strategies (i.e., feature fusion and confidence fusion), which can be combined to boost the classification accuracy of region-based detectors. Comprehensive experiments demonstrate that our HCE framework is flexible and generalizable, leading to significant and consistent improvements upon various region-based detectors, including FPN, Cascade R-CNN and Mask R-CNN.
摘要：国家的最先进的双级对象检测器应用分类器来的稀疏集合的对象的建议，依靠由RoIPool或RoIAlign提取作为输入区域方式的特征。区域明智特征，尽管与提案位置对准孔的，可能仍然缺少这是必要的滤除噪声背景的检测，以及具有识别没有明显出现对象的关键上下文信息。为了解决这个问题，我们提出了一个简单而有效的分层上下文嵌入（HCE）框架，它可以作为一个插件和播放组件应用，以促进矿业上下文线索一系列基于区域的检测器的分类能力。具体地，为了推进依赖于上下文的对象类别的识别，提出了一种图像级分类嵌入模块利用的是整体图像级别上下文来学习对象级概念。然后，通过利用两个整体图像和感兴趣的区域，这也是常规的ROI特征互补的下方嵌入分层的上下文信息生成新的ROI特征。此外，为了充分利用我们的分层上下文ROI特征，我们提出了早期和后期融合策略（即特征融合和信心融合），它可以合并，以提高基于区域检测器的分类精度。综合实验表明，我们的HCE框架是灵活和普及，导致对各种基于区域的探测器，包括FPN，级联R-CNN和面膜R-CNN显著和持续改善。

36. Context Encoding for Video Retrieval with Contrastive Learning [PDF] 返回目录
Jie Shao, Xin Wen, Bingchen Zhao, Changhu Wang, Xiangyang Xue
Abstract: Content-based video retrieval plays an important role in areas such as video recommendation, copyright protection, etc. Existing video retrieval methods mainly extract frame-level features independently, therefore lack of efficient aggregation of features between frames, and it is difficult to effectively deal with poor quality frames, such as frames with motion blur, out of focus, etc. In this paper, we propose CECL (Context Encoding for video retrieval with Contrastive Learning), a video representation learning framework that aggregates the context information of frame-level descriptors, and a supervised contrastive learning method that performs automatic hard negative mining, and utilizes the memory bank mechanism to increase the capacity of negative samples. Extensive experiments are conducted on multi video retrieval tasks, such as FIVR, CC_WEB_VIDEO and EVVE. The proposed method shows a significant performance advantage (~17% mAP on FIVR-200K) over state-of-the-art methods with video-level features, and deliver competitive results with much lower computational cost when compared with frame-level features.
摘要：基于内容的视频检索中扮演着诸如视频推荐，版权保护等现有的视频检索的方法主要是提取帧级独立功能，因此缺乏的帧之间的功能有效的聚集，这是难以领域发挥重要作用有效地处理质量差的帧，诸如与运动模糊的帧，失焦等在本文中，我们提议CECL（上下文编码为与对比学习视频检索），一个视频表示学习的框架，框架的聚集体的上下文信息 - 电平的描述符，并有监督对比学习方法，其执行自动硬负挖掘，并利用该存储器组机制来提高阴性样品的容量。大量的实验是在多视频检索任务，如FIVR，CC_WEB_VIDEO和EVVE进行。所提出的方法示出了当与帧级特征相比显著性能优势（〜上FIVR-200K 17％MAP）在国家的最先进的方法与视频级功能，并提供有竞争力的结果与低得多的计算成本。

37. Structural Plan of Indoor Scenes with Personalized Preferences [PDF] 返回目录
Xinhan Di, Pengqian Yu, Hong Zhu, Lei Cai, Qiuyan Sheng, Changyu Sun
Abstract: In this paper, we propose an assistive model that supports professional interior designers to produce industrial interior decoration solutions and to meet the personalized preferences of the property owners. The proposed model is able to automatically produce the layout of objects of a particular indoor scene according to property owners' preferences. In particular, the model consists of the extraction of abstract graph, conditional graph generation, and conditional scene instantiation. We provide an interior layout dataset that contains real-world 11000 designs from professional designers. Our numerical results on the dataset demonstrate the effectiveness of the proposed model compared with the state-of-art methods.
摘要：在本文中，我们提出了一个辅助模型，支持专业的室内设计师，以生产工业室内装饰解决方案，并能满足业主的个性化偏好。该模型能够根据业主的喜好自动生成特定的室内场景中的对象的布局。特别地，所述模型由抽象图形，条件图表生成和条件场景实例的提取的。我们提供包含真实世界的专业设计师设计11000的内饰布局数据集。我们对数据集的数值结果表明与国家的现有技术的方法相比，所提出的模型的有效性。

38. Tracking Emerges by Looking Around Static Scenes, with Neural 3D Mapping [PDF] 返回目录
Adam W. Harley, Shrinidhi K. Lakshmikanth, Paul Schydlo, Katerina Fragkiadaki
Abstract: We hypothesize that an agent that can look around in static scenes can learn rich visual representations applicable to 3D object tracking in complex dynamic scenes. We are motivated in this pursuit by the fact that the physical world itself is mostly static, and multiview correspondence labels are relatively cheap to collect in static scenes, e.g., by triangulation. We propose to leverage multiview data of \textit{static points} in arbitrary scenes (static or dynamic), to learn a neural 3D mapping module which produces features that are correspondable across time. The neural 3D mapper consumes RGB-D data as input, and produces a 3D voxel grid of deep features as output. We train the voxel features to be correspondable across viewpoints, using a contrastive loss, and correspondability across time emerges automatically. At test time, given an RGB-D video with approximate camera poses, and given the 3D box of an object to track, we track the target object by generating a map of each timestep and locating the object's features within each map. In contrast to models that represent video streams in 2D or 2.5D, our model's 3D scene representation is disentangled from projection artifacts, is stable under camera motion, and is robust to partial occlusions. We test the proposed architectures in challenging simulated and real data, and show that our unsupervised 3D object trackers outperform prior unsupervised 2D and 2.5D trackers, and approach the accuracy of supervised trackers. This work demonstrates that 3D object trackers can emerge without tracking labels, through multiview self-supervision on static data.
摘要：我们假设可以看在静态场景周围的代理可以学习适用于复杂的动态场景的3D物体跟踪丰富的视觉表示。我们对事实的物理世界本身大多是静态的，多视点对应的标签相对便宜的静态场景，例如收集，通过三角测量这种追求上进。我们建议在任意场景（静态或动态）\ {textit静点}杠杆多视点数据，以了解其产生是跨越时间correspondable功能神经3D映射模块。神经3D映射器消耗RGB-d的数据作为输入，并且产生的深特征作为输出一个的3D体素网格。我们培养体素的功能是跨观点correspondable，用对比的损失，以及跨越时间对应性自动出现。在测试时，给定的与近似照相机姿态的RGB-d视频，和给出的物体的3D箱进行跟踪，我们通过生成地图上的每个时步的并且每个图内定位所述对象的特征跟踪目标对象。相反，它们表示2D或2.5D视频流的模型，我们的模型的3D场景表示是从投影神器解开，是摄像机运动下是稳定的，并具有较强的抗部分遮挡。我们在挑战仿真和实际数据测试提出的架构，并表明我们的监督的3D物体跟踪器优于前无监督2D和2.5D跟踪，并接近监管跟踪的精度。这项工作表明，3D对象跟踪器可以出现没有跟踪标签，通过对静态数据，多视点的自我监督。

39. Learning Discriminative Feature with CRF for Unsupervised Video Object Segmentation [PDF] 返回目录
Mingmin Zhen, Shiwei Li, Lei Zhou, Jiaxiang Shang, Haoan Feng, Tian Fang, Long Quan
Abstract: In this paper, we introduce a novel network, called discriminative feature network (DFNet), to address the unsupervised video object segmentation task. To capture the inherent correlation among video frames, we learn discriminative features (D-features) from the input images that reveal feature distribution from a global perspective. The D-features are then used to establish correspondence with all features of test image under conditional random field (CRF) formulation, which is leveraged to enforce consistency between pixels. The experiments verify that DFNet outperforms state-of-the-art methods by a large margin with a mean IoU score of 83.4% and ranks first on the DAVIS-2016 leaderboard while using much fewer parameters and achieving much more efficient performance in the inference phase. We further evaluate DFNet on the FBMS dataset and the video saliency dataset ViSal, reaching a new state-of-the-art. To further demonstrate the generalizability of our framework, DFNet is also applied to the image object co-segmentation task. We perform experiments on a challenging dataset PASCAL-VOC and observe the superiority of DFNet. The thorough experiments verify that DFNet is able to capture and mine the underlying relations of images and discover the common foreground objects.
摘要：在本文中，我们介绍了一种新的网络，所谓的辨别特征的网络（DFNet），以解决无监督的视频对象分割的任务。要捕获视频帧之间的内在关系，我们借鉴，揭示从全球范围来看特征分布输入图像判别特征（d-功能）。所述d-特征然后用于建立对应于下条件随机场（CRF）的制剂，其被利用来执行的像素之间的一致性的测试图像中的所有特征。实验验证DFNet性能优于国家的最先进的方法，通过用平均欠条大幅度得分的83.4％和排名第一的DAVIS-2016领先，同时使用少得多的参数，实现在推断阶段更高效的性能。我们进一步评估DFNet在FBMS数据集和视频显着的数据集ViSal，达到一个新的国家的最先进的。为了进一步证明我们的框架的普遍性，DFNet也适用于图像对象共同分割的任务。我们在一个具有挑战性的数据集PASCAL VOC含量进行实验，观察DFNet的优越性。彻底的实验验证DFNet能够捕获和矿山图像的基本关系，探索共同的前景物体。

40. Robust Uncertainty-Aware Multiview Triangulation [PDF] 返回目录
Seong Hun Lee, Javier Civera
Abstract: We propose a robust and efficient method for multiview triangulation and uncertainty estimation. Our contribution is threefold: First, we propose an outlier rejection scheme using two-view RANSAC with the midpoint method. By prescreening the two-view samples prior to triangulation, we achieve the state-of-the-art efficiency. Second, we compare different local optimization methods for refining the initial solution and the inlier set. With an iterative update of the inlier set, we show that the optimization provides significant improvement in accuracy and robustness. Third, we model the uncertainty of a triangulated point as a function of three factors: the number of cameras, the mean reprojection error and the maximum parallax angle. Learning this model allows us to quickly interpolate the uncertainty at test time. We validate our method through an extensive evaluation.
摘要：我们提出了多视角三角测量和不确定性估计的稳健和有效的方法。我们的贡献有三个方面：首先，我们建议使用两个视图RANSAC与中点法的异常值拒绝方案。由前三角测量预筛选的双视图样品，我们实现国家的最先进的效率。其次，我们比较了改进初始解和内围组不同的局部优化的方法。随着内点集的迭代更新，我们表明，优化提供了精确度和耐用性显著改善。第三，我们三角点的不确定性建模为三个因素的功能：摄像机的数量，平均投影误差和最大视差角。学习这种模式使我们能够迅速插在测试时间的不确定性。我们通过广泛的评估验证我们的方法。

41. Geometric Interpretations of the Normalized Epipolar Error [PDF] 返回目录
Seong Hun Lee, Javier Civera
Abstract: In this work, we provide geometric interpretations of the normalized epipolar error. Most notably, we show that it is directly related to the following quantities: (1) the shortest distance between the two backprojected rays, (2) the dihedral angle between the two bounding epipolar planes, and (3) the $L_1$-optimal angular reprojection error.
摘要：在这项工作中，我们提供的标准化极错误的几何解释。最值得注意的是，我们表明，它直接关系到以下参数：（1）两个背投影射线之间的最短距离，（2）在两个边界核面之间的二面角，和（3）$ L_1 $ - 最优角投影误差。

42. Central object segmentation by deep learning for fruits and other roundish objects [PDF] 返回目录
Motohisa Fukuda, Takashi Okuno, Shinya Yuki
Abstract: We present CROP (Central Roundish Object Painter), which identifies and paints the object at the center of an RGB image. Primarily CROP works for roundish fruits in various illumination conditions, but surprisingly, it could also deal with images of other organic or inorganic materials, or ones by optical and electron microscopes, although CROP was trained solely by 172 images of fruits. The method involves image segmentation by deep learning, and the architecture of the neural network is a deeper version of the original U-Net. This technique could provide us with a means of automatically collecting statistical data of fruit growth in farms. Our trained neural network CROP is available on GitHub, with a user-friendly interface program.
摘要：本CROP（中央圆形对象画家），其识别和油漆在RGB图像的中心处的物体。主要作物在各种照明条件下圆形水果的作品，但是出人意料的是，它也可以处理的其他有机或无机材料，或者通过光学显微镜和电子显微镜的图像的人，虽然CROP由水果的图像172仅训练。该方法包括图像分割深学习，神经网络的架构是原掌中有更深的版本。这种技术可以为我们提供的农场自动收集果实生长的统计数据的方法。我们训练的神经网络CROP可以在GitHub上，具有友好的用户界面程序。

43. Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition [PDF] 返回目录
M. Esat Kalfaoglu, Sinan Kalkan, A. Aydin Alatan
Abstract: In this work, we combine 3D convolution with late temporal modeling for action recognition. For this aim, we replace the conventional Temporal Global Average Pooling (TGAP) layer at the end of 3D convolutional architecture with the Bidirectional Encoder Representations from Transformers (BERT) layer in order to better utilize the temporal information with BERT's attention mechanism. We show that this replacement improves the performances of many popular 3D convolution architectures for action recognition, including ResNeXt, I3D, SlowFast and R(2+1)D. Moreover, we provide the-state-of-the-art results on both HMDB51 and UCF101 datasets with 83.99% and 98.65% top-1 accuracy, respectively. The code is publicly available.
摘要：在这项工作中，我们结合3D卷积与动作识别迟到时间建模。为了这个目标，我们才能在与双向编码器交涉3D卷积架构从变压器（BERT）层最终取代常规的时间总体平均池（TGAP）层，以更好地利用与BERT的注意机制的时间信息。我们表明，这种替代品可提高许多流行的3D卷积架构进行动作识别，包括ResNeXt，I3D，SlowFast和R（2 + 1）d的表演。此外，我们提供在两个HMDB51的状态的最先进的结果和数据集UCF101分别83.99％和98.65％顶1的准确性，。该代码是公开的。

44. Multi-Class 3D Object Detection Within Volumetric 3D Computed Tomography Baggage Security Screening Imagery [PDF] 返回目录
Qian Wang, Neelanjan Bhowmik, Toby P. Breckon
Abstract: Automatic detection of prohibited objects within passenger baggage is important for aviation security. X-ray Computed Tomography (CT) based 3D imaging is widely used in airports for aviation security screening whilst prior work on automatic prohibited item detection focus primarily on 2D X-ray imagery. These works have proven the possibility of extending deep convolutional neural networks (CNN) based automatic prohibited item detection from 2D X-ray imagery to volumetric 3D CT baggage security screening imagery. However, previous work on 3D object detection in baggage security screening imagery focused on the detection of one specific type of objects (e.g., either {\it bottles} or {\it handguns}). As a result, multiple models are needed if more than one type of prohibited item is required to be detected in practice. In this paper, we consider the detection of multiple object categories of interest using one unified framework. To this end, we formulate a more challenging multi-class 3D object detection problem within 3D CT imagery and propose a viable solution (3D RetinaNet) to tackle this problem. To enhance the performance of detection we investigate a variety of strategies including data augmentation and varying backbone networks. Experimentation carried out to provide both quantitative and qualitative evaluations of the proposed approach to multi-class 3D object detection within 3D CT baggage security screening imagery. Experimental results demonstrate the combination of the 3D RetinaNet and a series of favorable strategies can achieve a mean Average Precision (mAP) of 65.3\% over five object classes (i.e. {\it bottles, handguns, binoculars, glock frames, iPods}). The overall performance is affected by the poor performance on {\it glock frames} and {\it iPods} due to the lack of data and their resemblance with the baggage clutter.
摘要：自动检测旅客行李内禁止的对象是航空安全的重要。基于X射线计算机断层扫描（CT）的3D成像被广泛地应用在机场对航空安全筛选，而主要在二维X射线图像上自动违禁物品检测焦点之前的工作。这些作品已被证明延长深卷积神经网络（CNN）的自动违禁物品检测从二维X射线图像，以立体3D CT行李安检图像的可能性。然而，在行李安全筛选影像上立体物检测先前的工作集中在检测对象中的一个特定类型的（例如，或者{\它瓶}或{\它手枪}）。其结果是，如果需要不止一种类型的违禁物品的实践中被检测到，需要多个模型。在本文中，我们考虑多个关注对象类别使用一个统一的框架中进行检测。为此，我们制定3D影像CT内更具挑战性的多级立体物检测问题，并提出可行的解决方案（3D RetinaNet）来解决这个问题。为了提高检测的性能，我们调查了多种策略，包括增强数据和不同的骨干网络。实验的开展提供三维CT行李安检图像中所提出的方法，以多级立体物检测的定量和定性评估。实验结果表明，3D RetinaNet的组合和一系列有利的策略可以在五年对象类实现的65.3 \％的中值平均精密（MAP）（即{\它瓶，手枪，望远镜，格洛克帧的iPod}）。整体性能是通过对表现不佳的影响{\它格洛克帧}和{\ iPod的它}由于缺乏数据，以及他们与行李杂乱相似。

45. Generalized Zero-Shot Domain Adaptation via Coupled Conditional Variational Autoencoders [PDF] 返回目录
Qian Wang, Toby P. Breckon
Abstract: Domain adaptation approaches aim to exploit useful information from the source domain where supervised learning examples are easier to obtain to address a learning problem in the target domain where there is no or limited availability of such examples. In classification problems, domain adaptation has been studied under varying supervised, unsupervised and semi-supervised conditions. However, a common situation when the labelled samples are available for a subset of target domain classes has been overlooked. In this paper, we formulate this particular domain adaptation problem within a generalized zero-shot learning framework by treating the labelled source domain samples as semantic representations for zero-shot learning. For this particular problem, neither conventional domain adaptation approaches nor zero-shot learning algorithms directly apply. To address this generalized zero-shot domain adaptation problem, we present a novel Coupled Conditional Variational Autoencoder (CCVAE) which can generate synthetic target domain features for unseen classes from their source domain counterparts. Extensive experiments have been conducted on three domain adaptation datasets including a bespoke X-ray security checkpoint dataset to simulate a real-world application in aviation security. The results demonstrate the effectiveness of our proposed approach both against established benchmarks and in terms of real-world applicability.
摘要：域适应接近目标是利用从那里监督学习的例子是更容易获得在目标域中，以解决学习问题不存在或这些实施例的限制的可用性的源域的有用信息。在分类问题，域适应已下不同的监督，无监督和半监督条件的研究。然而，当标记的样品可用于目标域类的子集一个常见的情况一直被忽视。在本文中，我们通过处理标记源域样本作为零次学习的语义表示制定一个广义零射门的学习框架内，这个特殊的领域适应性问题。对于这个特殊的问题，无论是传统的领域适应性方法也不零射门学习算法直接申请。为了解决这个广义零炮域的适应问题，我们提出了耦合条件变自动编码器（CCVAE）一种新颖的能够产生合成目标域特征用于从它们的源域的对应看不见类。大量的实验已经在三个领域适应性数据集，包括一个定制的X射线安全检查站的数据集，以模拟航空安全构成现实世界的应用程序进行。结果证明我们提出的办法都对既定的基准和真实世界的适用性方面的有效性。

46. Mixup-CAM: Weakly-supervised Semantic Segmentation via Uncertainty Regularization [PDF] 返回目录
Yu-Ting Chang, Qiaosong Wang, Wei-Chih Hung, Robinson Piramuthu, Yi-Hsuan Tsai, Ming-Hsuan Yang
Abstract: Obtaining object response maps is one important step to achieve weakly-supervised semantic segmentation using image-level labels. However, existing methods rely on the classification task, which could result in a response map only attending on discriminative object regions as the network does not need to see the entire object for optimizing the classification loss. To tackle this issue, we propose a principled and end-to-end train-able framework to allow the network to pay attention to other parts of the object, while producing a more complete and uniform response map. Specifically, we introduce the mixup data augmentation scheme into the classification network and design two uncertainty regularization terms to better interact with the mixup strategy. In experiments, we conduct extensive analysis to demonstrate the proposed method and show favorable performance against state-of-the-art approaches.
摘要：获取对象响应映射是实现弱监督使用图像级标签语义分割一个重要步骤。但是，现有的方法依赖于分类任务，这可能会导致作为网络并不需要看到整个对象优化分类损失的响应地图只在辨别物体的区域参加。为了解决这个问题，我们提出了一个原则性和终端到终端的列车能够框架，允许网络关注对象的其他部分，同时产生一个更完整和一致的响应地图。具体来说，我们介绍的mixup数据扩张计划进入分级网络，并设计了两个不确定性正则项，以更好地与查询股价战略互动。在实验中，我们进行了广泛的分析，证明了该方法，并显示出对国家的最先进的方法良好的性能。

47. PhraseCut: Language-based Image Segmentation in the Wild [PDF] 返回目录
Chenyun Wu, Zhe Lin, Scott Cohen, Trung Bui, Subhransu Maji
Abstract: We consider the problem of segmenting image regions given a natural language phrase, and study it on a novel dataset of 77,262 images and 345,486 phrase-region pairs. Our dataset is collected on top of the Visual Genome dataset and uses the existing annotations to generate a challenging set of referring phrases for which the corresponding regions are manually annotated. Phrases in our dataset correspond to multiple regions and describe a large number of object and stuff categories as well as their attributes such as color, shape, parts, and relationships with other entities in the image. Our experiments show that the scale and diversity of concepts in our dataset poses significant challenges to the existing state-of-the-art. We systematically handle the long-tail nature of these concepts and present a modular approach to combine category, attribute, and relationship cues that outperforms existing approaches.
摘要：我们认为给予分段自然语言短语的图像区域的问题，并研究其对77262个图像和345486短语地区对一种新型的数据集。我们的数据收集在视觉基因组数据集的顶部，并使用现有的注释来产生一个具有挑战性的组的量，相应的区被手动注释参照短语。在我们的数据集中对应于多个区域的短语和描述了大量的对象和内容的类别以及它们的属性，例如颜色，形状，部件和与所述图像中的其他实体的关系。我们的实验表明，在我们的数据概念的规模和多样性构成了对现有的国家的最先进的显著挑战。我们系统地处理这些概念的长尾性质，提出了一种模块化的方法来分类，属性和关系线索性能优于现有的方法结合起来。

48. Weakly-Supervised Semantic Segmentation via Sub-category Exploration [PDF] 返回目录
Yu-Ting Chang, Qiaosong Wang, Wei-Chih Hung, Robinson Piramuthu, Yi-Hsuan Tsai, Ming-Hsuan Yang
Abstract: Existing weakly-supervised semantic segmentation methods using image-level annotations typically rely on initial responses to locate object regions. However, such response maps generated by the classification network usually focus on discriminative object parts, due to the fact that the network does not need the entire object for optimizing the objective function. To enforce the network to pay attention to other parts of an object, we propose a simple yet effective approach that introduces a self-supervised task by exploiting the sub-category information. Specifically, we perform clustering on image features to generate pseudo sub-categories labels within each annotated parent class, and construct a sub-category objective to assign the network to a more challenging task. By iteratively clustering image features, the training process does not limit itself to the most discriminative object parts, hence improving the quality of the response maps. We conduct extensive analysis to validate the proposed method and show that our approach performs favorably against the state-of-the-art approaches.
摘要：现有弱监督使用图像级别注释通常依赖于最初的反应来定位物体区域语义分割方法。然而，地图由分类网络产生这样的响应通常集中在判别对象部分，由于这样的事实，该网络不需要整个对象用于优化目标函数。要加强网络关注的对象的其他部分，我们提出了一个简单而有效的方法，将通过利用该分类信息的自我监督任务。具体来说，我们执行使图像的聚类功能，以生成每个带注释的父类中的伪子类别标签，并构造一个子类目标给网络分配给更有挑战性的任务。通过迭代聚类的图像特征，训练过程本身并不限定于最判别对象部分，因此提高了响应的地图的质量。我们进行了广泛的分析，验证了该方法，并表明我们的方法进行从优反对国家的最先进的方法。

49. Describing Textures using Natural Language [PDF] 返回目录
Chenyun Wu, Mikayla Timm, Subhransu Maji
Abstract: Textures in natural images can be characterized by color, shape, periodicity of elements within them, and other attributes that can be described using natural language. In this paper, we study the problem of describing visual attributes of texture on a novel dataset containing rich descriptions of textures, and conduct a systematic study of current generative and discriminative models for grounding language to images on this dataset. We find that while these models capture some properties of texture, they fail to capture several compositional properties, such as the colors of dots. We provide critical analysis of existing models by generating synthetic but realistic textures with different descriptions. Our dataset also allows us to train interpretable models and generate language-based explanations of what discriminative features are learned by deep networks for fine-grained categorization where texture plays a key role. We present visualizations of several fine-grained domains and show that texture attributes learned on our dataset offer improvements over expert-designed attributes on the Caltech-UCSD Birds dataset.
摘要：纹理在自然图像可以通过颜色，形状，在其中元件的周期性，并且可以使用自然语言来描述的其它属性来表征。在本文中，我们研究说明含纹理丰富的描述了一种新的数据集质感的视觉属性的问题，并传导电流生成和判别模型的系统研究对这个数据集接地语言的图像。我们发现，尽管这些模型捕捉质感的一些性质，他们并没有捕捉到几个组成性质，如圆点的颜色。我们通过生成不同的描述合成，但逼真的纹理提供现有车型的批判性分析。我们的数据也让我们培养解释模型，生成什么区别特征由深网络进行细粒度的分类，其中质地起着关键作用了解到基于语言的解释。我们几个细粒度域的存在可视化和显示纹理属性对我们的数据提供了改进，在加州理工学院，加州大学圣地亚哥分校鸟集专家设计的属性教训。

50. End-to-end Birds-eye-view Flow Estimation for Autonomous Driving [PDF] 返回目录
Kuan-Hui Lee, Matthew Kliemann, Adrien Gaidon, Jie Li, Chao Fang, Sudeep Pillai, Wolfram Burgard
Abstract: In autonomous driving, accurately estimating the state of surrounding obstacles is critical for safe and robust path planning. However, this perception task is difficult, particularly for generic obstacles/objects, due to appearance and occlusion changes. To tackle this problem, we propose an end-to-end deep learning framework for LIDAR-based flow estimation in bird's eye view (BeV). Our method takes consecutive point cloud pairs as input and produces a 2-D BeV flow grid describing the dynamic state of each cell. The experimental results show that the proposed method not only estimates 2-D BeV flow accurately but also improves tracking performance of both dynamic and static objects.
摘要：在自主驾驶，准确地估计周围障碍物的状态是安全可靠的路径规划是至关重要的。然而，这种看法的任务是困难的，特别是对于普通的障碍物/物体，由于外观和阻塞的变化。为了解决这个问题，我们提出了基于激光雷达流估计结束到终端的深度学习框架鸟瞰（BEV）。我们的方法利用连续点云对作为输入并且产生描述每个单元的动态状态的2-d流BEV格。实验结果表明，所提出的方法不仅估计2- d BEV流动准确又提高动态和静态对象的跟踪性能。

51. Multiple instance learning on deep features for weakly supervised object detection with extreme domain shifts [PDF] 返回目录
Nicolas Gonthier, Saïd Ladjal, Yann Gousseau
Abstract: Weakly supervised object detection (WSOD) using only image-level annotations has attracted a growing attention over the past few years. Whereas such task is typically addressed with a domain-specific solution focused on natural images, we show that a simple multiple instance approach applied on pre-trained deep features yields excellent performances on non-photographic datasets, possibly including new classes. The approach does not include any fine-tuning or cross-domain learning and is therefore efficient and possibly applicable to arbitrary datasets and classes. We investigate several flavors of the proposed approach, some including multi-layers perceptron and polyhedral classifiers. Despite its simplicity, our method shows competitive results on a range of publicly available datasets, including paintings (People-Art, IconArt), watercolors, cliparts and comics and allows to quickly learn unseen visual categories.
摘要：弱监督只用图像级别注释已经吸引了越来越多的关注，在过去几年中的物体检测（WSOD）。而这样的任务通常与特定领域的解决方案解决专注于自然的图像，我们表明，应用于预训练深层特征收益率非摄影集优异的性能，可能包括新的类简单的多实例的方法。该方法不包括任何微调或跨域学习，因此有效的并且可能适用于任意的数据集和类。我们调查了该方法的几种口味，包括一些多层感知和多面体分类。尽管它的简单，我们的方法显示在一系列公开可用的数据集的有竞争力的结果，其中包括绘画（人民，艺术，IconArt），水彩，剪贴画和漫画，并允许快速学习看不见的视觉类别。

52. Reducing Label Noise in Anchor-Free Object Detection [PDF] 返回目录
Nermin Samet, Samet Hicsonmez, Emre Akbas
Abstract: Current anchor-free object detectors label all the features that spatially fall inside a predefined central region of a ground-truth box as positive. This approach causes label noise during training, since some of these positively labeled features may be on the background or an occluder object, or they are simply not discriminative features. In this paper, we propose a new labeling strategy aimed to reduce the label noise in anchor-free detectors. We sum-pool predictions stemming from individual features into a single prediction. This allows the model to reduce the contributions of non-discriminatory features during training. We develop a new one-stage, anchor-free object detector, PPDet, to employ this labeling strategy during training and a similar prediction pooling method during inference. On the COCO dataset, PPDet achieves the best performance among anchor-free top-down detectors and performs on-par with the other state-of-the-art methods. It also outperforms all state-of-the-art methods in the detection of small objects ($AP_S$ 31.4). Code is available at this https URL
摘要：当前无锚对象检测器标签的所有空间落在地面实况箱为阳性的预定中心区域内的功能。这种方法训练过程中导致标签噪声，因为其中的一些阳性标记的特征可以是对背景或封堵器对象，或者它们根本不是判别特征。在本文中，我们提出了一个新的标签战略，旨在减少无锚探测器标签噪音。我们总结池预测，从个人特征而产生成一个单一的预测。这使得该模型以减少训练中的非歧视性特点的贡献。我们开发一个新的阶段，无锚目标物检测，PPDet，培训和推理过程中类似的预测轮询方法时采用这种标记策略。在COCO数据集，PPDet实现上面值无锚自上而下检测器和执行与其它国家的最先进的方法中的最佳性能。它也优于国家的最先进的所有的小物件检测（$ AP_S $ 31.4）的方法。代码可在此HTTPS URL

53. Recognition and 3D Localization of Pedestrian Actions from Monocular Video [PDF] 返回目录
Jun Hayakawa, Behzad Dariush
Abstract: Understanding and predicting pedestrian behavior is an important and challenging area of research for realizing safe and effective navigation strategies in automated and advanced driver assistance technologies in urban scenes. This paper focuses on monocular pedestrian action recognition and 3D localization from an egocentric view for the purpose of predicting intention and forecasting future trajectory. A challenge in addressing this problem in urban traffic scenes is attributed to the unpredictable behavior of pedestrians, whereby actions and intentions are constantly in flux and depend on the pedestrians pose, their 3D spatial relations, and their interaction with other agents as well as with the environment. To partially address these challenges, we consider the importance of pose toward recognition and 3D localization of pedestrian actions. In particular, we propose an action recognition framework using a two-stream temporal relation network with inputs corresponding to the raw RGB image sequence of the tracked pedestrian as well as the pedestrian pose. The proposed method outperforms methods using a single-stream temporal relation network based on evaluations using the JAAD public dataset. The estimated pose and associated body key-points are also used as input to a network that estimates the 3D location of the pedestrian using a unique loss function. The evaluation of our 3D localization method on the KITTI dataset indicates the improvement of the average localization error as compared to existing state-of-the-art methods. Finally, we conduct qualitative tests of action recognition and 3D localization on HRI's H3D driving dataset.
摘要：了解和预测行人行为的研究实现在城市场景的自动和先进辅助驾驶系统技术安全，有效的导航战略的重点和难点地区。本文侧重于单眼行人动作识别和三维定位，从预测的意图和预测未来走势的目的自我中心的观点。在解决城市交通场景这个问题的一个挑战是由于行人，借此行动和意图是不断的变化，并取决于行人的姿势，他们的3D空间关系的不可预知的行为，以及它们与其他药物以及与互动环境。为了部分解决这些挑战，我们认为对识别和行人行为的三维定位姿势的重要性。特别是，我们使用两流在时间上相关的网络与对应于跟踪的行人的原始RGB图像序列以及行人姿态输入提出动作识别框架。使用基于使用JAAD公共数据集的评估单流在时间上相关的网络，该方法优于方法。估计姿态和相关车身关键点也可用作输入到估计采用了独特的损失功能的行人的3D位置的网络。对数据集KITTI我们的3D定位方法的评估指示平均定位误差的改善相对于现有状态的最先进的方法如。最后，我们进行行车数据集动作识别和三维定位于HRI的H3D的定性试验。

54. Weakly Supervised Multi-Organ Multi-Disease Classification of Body CT Scans [PDF] 返回目录
Fakrul Islam Tushar, Vincent M. D'Anniballe, Rui Hou, Maciej A. Mazurowski, Wanyi Fu, Ehsan Samei, Geoffrey D. Rubin, Joseph Y. Lo
Abstract: We designed a multi-organ, multi-label disease classification algorithm for computed tomography (CT) scans using case-level labels from radiology text reports. A rule-based algorithm extracted 19,255 disease labels from reports of 13,667 body CT scans from 12,092 subjects. A 3D DenseVNet was trained to segment 3 organ systems: lungs/pleura, liver/gallbladder, and kidneys. From patches guided by segmentations, a 3D convolutional neural network provided multi-label disease classification for normality versus four common diseases per organ. The process was tested on 2,158 CT volumes with 2,875 manually obtained labels. Manual validation of the rulebased labels confirmed 91 to 99% accuracy. Results were characterized using the receiver operating characteristic area under the curve (AUC). Classification AUCs for lungs/pleura labels were as follows: atelectasis 0.77 (95% confidence intervals 0.74 to 0.81), nodule 0.65 (0.61 to 0.69), emphysema 0.89 (0.86 to 0.92), effusion 0.97 (0.96 to 0.98), and normal 0.89 (0.87 to 0.91). For liver/gallbladder, AUCs were: stone 0.62 (0.56 to 0.67), lesion 0.73 (0.69 to 0.77), dilation 0.87 (0.84 to 0.90), fatty 0.89 (0.86 to 0.92), and normal 0.82 (0.78 to 0.85). For kidneys, AUCs were: stone 0.83 (0.79 to 0.87), atrophy 0.92 (0.89 to 0.94), lesion 0.68 (0.64 to 0.72), cyst 0.70 (0.66 to 0.73), and normal 0.79 (0.75 to 0.83). In conclusion, by using automated extraction of disease labels from radiology reports, we created a weakly supervised, multi-organ, multi-disease classifier that can be easily adapted to efficiently leverage massive amounts of unannotated data associated with medical images.
摘要：我们设计的计算机断层扫描（CT），多器官，多标签疾病分类算法使用区分级别的标签，从放射学文本报告扫描。 A-基于规则的算法从12,092科目13667身体CT扫描报告中摘录19255名疾病的标签。三维DenseVNet被培养成3段器官系统：肺/胸膜，肝脏/胆囊和肾脏。从通过分割引导补丁，一个三维卷积神经网络的正态对每个器官四种常见疾病提供多标记疾病分类。该过程在2158个CT体积测试了2875个手动获得标签。在基于规则标注的人工验证确认91〜99％的准确率。使用接收者的曲线（AUC）下操作的特征区域的结果进行了表征。分类的AUC为肺/胸膜标签如下：肺不张0.77（95％置信区间0.74至0.81），结节0.65（0.61至0.69），肺气肿0.89（0.86〜0.92），积液0.97（0.96〜0.98），和正常0.89 （0.87〜0.91）。对于肝/胆，的AUC分别为：石0.62（0.56〜0.67），病变0.73（0.69〜0.77），扩张0.87（0.84〜0.90），脂肪0.89（0.86〜0.92），和正常0.82（0.78〜0.85）。对于肾脏，的AUC分别为：石0.83（0.79〜0.87），萎缩0.92（0.89〜0.94），病变0.68（0.64〜0.72），囊肿0.70（0.66〜0.73），和正常0.79（0.75〜0.83）。总之，通过使用疾病的标签的自动提取从放射学报告，我们创建了一个弱监督，多器官，多疾病分类器，可以很容易地适应有效地利用与医疗图像相关联的未注释的大量数据。

55. AiRound and CV-BrCT: Novel Multi-View Datasets for Scene Classification [PDF] 返回目录
Gabriel Machado, Edemir Ferreira, Keiller Nogueira, Hugo Oliveira, Pedro Gama, Jefersson A. dos Santos
Abstract: It is undeniable that aerial/satellite images can provide useful information for a large variety of tasks. But, since these images are always looking from above, some applications can benefit from complementary information provided by other perspective views of the scene, such as ground-level images. Despite a large number of public repositories for both georeferenced photographs and aerial images, there is a lack of benchmark datasets that allow the development of approaches that exploit the benefits and complementarity of aerial/ground imagery. In this paper, we present two new publicly available datasets named \thedataset~and CV-BrCT. The first one contains triplets of images from the same geographic coordinate with different perspectives of view extracted from various places around the world. Each triplet is composed of an aerial RGB image, a ground-level perspective image, and a Sentinel-2 sample. The second dataset contains pairs of aerial and street-level images extracted from southeast Brazil. We design an extensive set of experiments concerning multi-view scene classification, using early and late fusion. Such experiments were conducted to show that image classification can be enhanced using multi-view data.
摘要：不可否认的是，航空/卫星图像可以提供大量的各种任务的有用信息。但是，由于这些图像总是从上方看，一些应用可以受益于通过场景的其它的立体图，如地面拍摄的图像提供的补充信息。尽管有大量的公共库两种地理参照图片和航拍图片，还缺乏基准数据集，让那利用的好处和空中/地面图像的互补方法的发展的。在本文中，我们提出了一个名为\ thedataset〜和CV-BRCT两个新的可公开获得的数据集。第一个包含图像的三胞胎来自同一地理与观点来自世界各地不同地方提取不同角度的坐标。每个三元组是由一个天线RGB图像，地面水平透视图像和哨兵-2的样品。所述第二数据集包含对从巴西东南部提取空中和街道级图像。我们设计了一整套关于多视图场景分类，利用早，晚融合实验。进行了这样的实验表明，图像分类可使用多视图数据来增强。

56. A Visual Analytics Framework for Reviewing Multivariate Time-Series Data with Dimensionality Reduction [PDF] 返回目录
Takanori Fujiwara, Shilpika, Naohisa Sakamoto, Jorji Nonaka, Keiji Yamamoto, Kwan-Liu Ma
Abstract: Data-driven problem solving in many real-world applications involves analysis of time-dependent multivariate data, for which dimensionality reduction (DR) methods are often used to uncover the intrinsic structure and features of the data. However, DR is usually applied to a subset of data that is either single-time-point multivariate or univariate time-series, resulting in the need to manually examine and correlate the DR results out of different data subsets. When the number of dimensions is large either in terms of the number of time points or attributes, this manual task becomes too tedious and infeasible. In this paper, we present MulTiDR, a new DR framework that enables processing of time-dependent multivariate data as a whole to provide a comprehensive overview of the data. With the framework, we employ DR in two steps. When treating the instances, time points, and attributes of the data as a 3D array, the first DR step reduces the three axes of the array to two, and the second DR step visualizes the data in a lower-dimensional space. In addition, by coupling with a contrastive learning method and interactive visualizations, our framework enhances analysts' ability to interpret DR results. We demonstrate the effectiveness of our framework with four case studies using real-world datasets.
摘要：数据驱动的问题解决在许多实际应用中涉及与时间相关的多变量数据，该降维（DR）方法通常用于揭示的内在结构和数据的特征分析。然而，DR通常适用于数据的一个子集即可以是单时间点或多元单变量的时间序列，从而导致需要手动检查和DR结果出不同数据子集相关联。当维数变大或者在时间点或属性的数量方面，本手册任务变得过于繁琐和不可行的。在本文中，我们目前MulTiDR，新的DR框架，使处理时间相关的多变量数据作为一个整体来提供数据的全面概述。与框架，我们采用DR在两个步骤。当处理的情况下，时间点和数据的属性为3D阵列，所述第一DR步骤减小了阵列的三个轴为两个，并且所述第二步骤DR可视化在较低维空间中的数据。此外，通过与对比学习方法和交互式可视化连接，我们的架构增强了分析师的解释DR结果的能力。我们证明我们的框架中使用现实世界的数据集的四个案例研究的有效性。

57. Land Use and Land Cover Classification using a Human Group based Particle Swarm Optimization Algorithm with a LSTM classifier on hybrid-pre-processing Remote Sensing Images [PDF] 返回目录
T. Kowsalya, S.L. Ullo, C. Zarro, K. L. Hemalatha, B. D. Parameshachari
Abstract: Land use and land cover (LULC) classification using remote sensing imagery plays a vital role in many environment modeling and land use inventories. In this study, a hybrid feature optimization algorithm along with a deep learning classifier is proposed to improve performance of LULC classification, helping to predict wildlife habitat, deteriorating environmental quality, haphazard, etc. LULC classification is assessed using Sat 4, Sat 6 and Eurosat datasets. After the selection of remote sensing images, normalization and histogram equalization methods are used to improve the quality of the images. Then, a hybrid optimization is accomplished by using the Local Gabor Binary Pattern Histogram Sequence (LGBPHS), the Histogram of Oriented Gradient (HOG) and Haralick texture features, for the feature extraction from the selected images. The benefits of this hybrid optimization are a high discriminative power and invariance to color and grayscale images. Next, a Human Group based Particle Swarm Optimization (PSO) algorithm is applied to select the optimal features, whose benefits are fast convergence rate and easy to implement. After selecting the optimal feature values, a Long Short Term Memory (LSTM) network is utilized to classify the LULC classes. Experimental results showed that the Human Group based PSO algorithm with a LSTM classifier effectively well differentiates the land use and land cover classes in terms of classification accuracy, recall and precision. A minimum of 0.01% and a maximum of 2.56% improvement in accuracy is achieved compared to the existing models GoogleNet, VGG, AlexNet, ConvNet, when the proposed method is applied.
摘要：利用遥感影像土地利用和土地覆盖（LULC）的分类中扮演着许多环境建模和土地使用库存至关重要的作用。在这项研究中，有一个深刻的学习分类沿着混合特征的优化算法，以改善LULC分类的性能，帮助预测野生动物栖息地，日益恶化的环境质量，杂乱无章等LULC分类利用周六4，周六6 Eurosat评估数据集。遥感图像的选择之后，归一化和直方图均衡化方法被用于改善图像的质量。然后，混合优化是通过使用局部Gabor二值模式柱状图序列（LGBPHS），定向梯度（HOG）和Haralick纹理特征的直方图，用于从所选图像中的特征提取来实现。这种混合优化的好处是很高的辨别力和不变性彩色和灰度图像。接下来，基于人类群体粒子群优化（PSO）算法应用于选择最佳的功能，它的好处是收敛速度快，易于实现。选择最佳的特征值之后，一个长短期记忆（LSTM）网络被使用将LULC类分类。实验结果表明，人类群体基于PSO算法与LSTM分类有效区分以及在分类准确率，召回率和准确方面的土地利用和土地覆盖类。的0.01％的最小和精度最大为2.56％的改善，实现比现有机型GoogleNet，VGG，AlexNet，ConvNet，当施加所提出的方法。

58. Multi-Slice Fusion for Sparse-View and Limited-Angle 4D CT Reconstruction [PDF] 返回目录
Soumendu Majee, Thilo Balke, Craig A.J. Kemp, Gregery T. Buzzard, Charles A. Bouman
Abstract: Inverse problems spanning four or more dimensions such as space, time and other independent parameters have become increasingly important. State-of-the-art 4D reconstruction methods use model based iterative reconstruction (MBIR), but depend critically on the quality of the prior modeling. Recently, plug-and-play (PnP) methods have been shown to be an effective way to incorporate advanced prior models using state-of-the-art denoising algorithms. However, state-of-the-art denoisers such as BM4D and deep convolutional neural networks (CNNs) are primarily available for 2D or 3D images and extending them to higher dimensions is difficult due to algorithmic complexity and the increased difficulty of effective training. In this paper, we present multi-slice fusion, a novel algorithm for 4D reconstruction, based on the fusion of multiple low-dimensional denoisers. Our approach uses multi-agent consensus equilibrium (MACE), an extension of plug-and-play, as a framework for integrating the multiple lower-dimensional models. We apply our method to 4D cone-beam X-ray CT reconstruction for non destructive evaluation (NDE) of samples that are dynamically moving during acquisition. We implement multi-slice fusion on distributed, heterogeneous clusters in order to reconstruct large 4D volumes in reasonable time and demonstrate the inherent parallelizable nature of the algorithm. We present simulated and real experimental results on sparse-view and limited-angle CT data to demonstrate that multi-slice fusion can substantially improve the quality of reconstructions relative to traditional methods, while also being practical to implement and train.
摘要：跨越四个或更多方面，如空间，时间和其他独立参数反演问题已经变得越来越重要。状态的最先进的4D重建方法使用基于迭代重建（MBIR）模型，但在现有的建模的质量主要取决于。最近，插头和播放（PNP）的方法已被证明是使用状态的最先进的算法去噪掺入之前先进的模型的有效途径。然而，诸如BM4D和深卷积神经网络（细胞神经网络）的状态的最先进的denoisers主要可用于二维或三维图像，并将它们扩展到更高的维度是算法的复杂性和有效的训练难度增加困难所致。在本文中，我们提出了多切片融合，一种新颖的算法4D重建，基于多个低维denoisers的融合。我们利用多代理的共识平衡（MACE），插件和播放的延伸，作为整合多个低维模型的框架。我们应用我们的方法的样品被采集期间动态移动的4D锥束X射线CT重建的非破坏性评估（NDE）。我们为了重建大型4D卷合理的时间内实现分布式，异构集群多层融合，展示了算法的并行的内在本质。我们提出模拟和稀疏视和有限角度CT数据真实的实验结果证明，多切片融合可以显着提高相对于传统的方法重建的质量，同时还实际实现和火车。

59. Multiple Code Hashing for Efficient Image Retrieval [PDF] 返回目录
Ming-Wei Li, Qing-Yuan Jiang, Wu-Jun Li
Abstract: Due to its low storage cost and fast query speed, hashing has been widely used in large-scale image retrieval tasks. Hash bucket search returns data points within a given Hamming radius to each query, which can enable search at a constant or sub-linear time cost. However, existing hashing methods cannot achieve satisfactory retrieval performance for hash bucket search in complex scenarios, since they learn only one hash code for each image. More specifically, by using one hash code to represent one image, existing methods might fail to put similar image pairs to the buckets with a small Hamming distance to the query when the semantic information of images is complex. As a result, a large number of hash buckets need to be visited for retrieving similar images, based on the learned codes. This will deteriorate the efficiency of hash bucket search. In this paper, we propose a novel hashing framework, called multiple code hashing (MCH), to improve the performance of hash bucket search. The main idea of MCH is to learn multiple hash codes for each image, with each code representing a different region of the image. Furthermore, we propose a deep reinforcement learning algorithm to learn the parameters in MCH. To the best of our knowledge, this is the first work that proposes to learn multiple hash codes for each image in image retrieval. Experiments demonstrate that MCH can achieve a significant improvement in hash bucket search, compared with existing methods that learn only one hash code for each image.
摘要：由于其较低的存储成本和快速的查询速度，哈希已广泛应用于大型图像检索任务。给定汉明半径与每个查询，其可以使在恒定的或子线性时间成本搜索内哈希桶搜索返回的数据点。但是，现有的哈希方法无法实现对复杂场景散列桶搜索满意的检索性能，因为它们只学一个为每个图像的哈希码。更具体地，通过使用一个散列码来表示一个图像，现有的方法可能无法把类似图像对所述桶具有小汉明距离到查询时图像的语义信息是复杂的。其结果是，需要访问检索相似图像的基础上，学习的代码了大量的散列桶。这将恶化哈希桶搜索的效率。在本文中，我们提出了一种新的散列框架，称为多码散列（MCH），以提高哈希桶搜索性能。 MCH的主要思想是学习多个散列码对于每个图像，与表示所述图像的不同区域中的每个代码。此外，我们提出了一个深刻的强化学习算法来学习MCH的参数。据我们所知，这是提出要学习多个散列码在图像检索每个图像的第一项工作。实验表明，MCH可以实现散列桶搜索一个显著的改善，与学习为每个图像只有一个哈希码现行方法相比。

60. Memory Efficient Class-Incremental Learning for Image Classification [PDF] 返回目录
Hanbin Zhao, Hui Wang, Yongjian Fu, Fei Wu, Xi Li
Abstract: With the memory-resource-limited constraints, class-incremental learning (CIL) usually suffers from the "catastrophic forgetting" problem when updating the joint classification model on the arrival of newly added classes. To cope with the forgetting problem, many CIL methods transfer the knowledge of old classes by preserving some exemplar samples into the size-constrained memory buffer. To utilize the memory buffer more efficiently, we propose to keep more auxiliary low-fidelity exemplar samples rather than the original real high-fidelity exemplar samples. Such memory-efficient exemplar preserving scheme make the old-class knowledge transfer more effective. However, the low-fidelity exemplar samples are often distributed in a different domain away from that of the original exemplar samples, that is, a domain shift. To alleviate this problem, we propose a duplet learning scheme that seeks to construct domain-compatible feature extractors and classifiers, which greatly narrows down the above domain gap. As a result, these low-fidelity auxiliary exemplar samples have the ability to moderately replace the original exemplar samples with a lower memory cost. In addition, we present a robust classifier adaptation scheme, which further refines the biased classifier (learned with the samples containing distillation label knowledge about old classes) with the help of the samples of pure true class labels. Experimental results demonstrate the effectiveness of this work against the state-of-the-art approaches. We will release the code, baselines, and training statistics for all models to facilitate future research.
摘要：随着内存资源有限的约束，类增量学习（CIL），通常在更新新增加类的到来联合分类模型时，从“灾难性遗忘”问题的困扰。为了配合遗忘的问题，许多CIL方法，通过保留一些典范样本到大小受限的内存缓冲区转移旧类的知识。为了利用内存更有效地缓冲，我们建议保持多个辅助低精度典范样本，而不是原来真正的高保真典范样本。这种存储效率的典范保方案使旧一流的知识传递更有效。然而，低保真度示例性样品往往分布在不同的域中从原始示例性样品的远，即，域移位。为了缓解这一问题，我们提出了旨在构建兼容域的特征提取和分类，这大大缩小了上述领域差距小芯片的学习方案。其结果是，这些低的保真度辅助范例样品具有适度具有较低成本的存储器代替原来的示例性样品的能力。此外，我们提出了一个强大的分类器调节方案，进一步细化分类偏向纯真实类标签的样本的帮助（约含老班蒸馏标签知识样本学习）。实验结果表明，对国家的最先进的方法这项工作的有效性。我们将发布的代码，基线和培训统计所有车型，方便今后的研究。

61. Deep Parallel MRI Reconstruction Network Without Coil Sensitivities [PDF] 返回目录
Wanyu Bian, Yunmei Chen, Xiaojing Ye
Abstract: We propose a novel deep neural network architecture by mapping the robust proximal gradient scheme for fast image reconstruction in parallel MRI (pMRI) with regularization function trained from data. The proposed network learns to adaptively combine the multi-coil images from incomplete pMRI data into a single image with uniform contrast, which is then passed to a nonlinear encoder to efficiently extract sparse features of the image. Unlike most of existing deep image reconstruction networks, our network does not require knowledge of sensitivity maps, which are notoriously difficult to estimate and have been a major bottleneck of image reconstruction in real-world pMRI applications. The experimental results demonstrate the promising performance of our method on a variety of pMRI imaging data sets.
摘要：通过映射用于快速图像重建平行MRI（PMRI）与来自数据训练正则化函数的健壮近端梯度方案提出了一种深神经网络结构。所提出的网络获知自适应地从残缺PMRI数据多线圈图像组合成具有均匀的对比度，然后将其传递到非线性编码器的图像有效地提取稀疏特征的单个图像。与大多数现有的深图像重建网络，我们的网络不需要灵敏度地图，这是估计非常困难，并在现实世界PMRI应用已经图像重建的一大瓶颈的知识。实验结果表明，我们的方法对各种PMRI成像数据集的前途表现。

62. Class-Incremental Domain Adaptation [PDF] 返回目录
Jogendra Nath Kundu, Rahul Mysore Venkatesh, Naveen Venkat, Ambareesh Revanur, R. Venkatesh Babu
Abstract: We introduce a practical Domain Adaptation (DA) paradigm called Class-Incremental Domain Adaptation (CIDA). Existing DA methods tackle domain-shift but are unsuitable for learning novel target-domain classes. Meanwhile, class-incremental (CI) methods enable learning of new classes in absence of source training data but fail under a domain-shift without labeled supervision. In this work, we effectively identify the limitations of these approaches in the CIDA paradigm. Motivated by theoretical and empirical observations, we propose an effective method, inspired by prototypical networks, that enables classification of target samples into both shared and novel (one-shot) target classes, even under a domain-shift. Our approach yields superior performance as compared to both DA and CI methods in the CIDA paradigm.
摘要：介绍所谓的类增量领域适应性（CIDA）实用的领域适应性（DA）范例。现有的DA方法解决域名移，但不适合学习的新靶标域类。同时，一流的增量（CI）方法能够在没有源训练数据的新班学习，但没有标注监督域移下失效。在这项工作中，我们有效地识别在CIDA范式这些方法的局限性。通过理论和经验观察的启发，我们提出了一种有效的方法，通过原型网络的启发，能够使目标样本到两个共享和新颖（单次）目标类的分类，即使是在一个域的偏移。我们的方法产生优异的性能相比，在CIDA范式既DA和CI方法的。

63. Neuromorphic Computing for Content-based Image Retrieval [PDF] 返回目录
Te-Yuan Liu, Ata Mahjoubfar, Daniel Prusinski, Luis Stevens
Abstract: Neuromorphic computing mimics the neural activity of the brain through emulating spiking neural networks. In numerous machine learning tasks, neuromorphic chips are expected to provide superior solutions in terms of cost and power efficiency. Here, we explore the application of Loihi, a neuromorphic computing chip developed by Intel, for the computer vision task of image retrieval. We evaluated the functionalities and the performance metrics that are critical in context-based visual search and recommender systems using deep-learning embeddings. Our results show that the neuromorphic solution is about 3.2 times more energy-efficient compared with an Intel Core i7 CPU and 12.5 times more energy-efficient compared with Nvidia T4 GPU for inference by a lightweight convolutional neural network without batching, while maintaining the same level of matching accuracy. The study validates the longterm potential of neuromorphic computing in machine learning, as a complementary paradigm to the existing Von Neumann architectures.
摘要：神经形态计算模拟仿真，通过扣球神经网络的大脑的神经活动。在众多的机器学习任务，神经形态芯片有望在成本和功率效率方面提供了卓越的解决方案。在这里，我们探讨Loihi，由Intel开发，用于图像检索的计算机视觉任务神经形态计算芯片的应用。我们评估的功能，并且是基于上下文的视觉搜索和使用深学习的嵌入推荐系统的关键性能指标。我们的研究结果表明，神经形态溶液为约3.2倍以上的能源效率与英特尔的Core i7处理器相比，和12.5倍以上的能源效率与Nvidia的T4 GPU相比，通过无配料一个轻量级的卷积神经网络推理，同时保持相同的水平的匹配精度。该研究证实了在机器学习神经形态计算的长期潜力，因为互补范例现有冯·诺依曼体系结构。

64. Two-Stage Deep Learning for Accelerated 3D Time-of-Flight MRA without Matched Training Data [PDF] 返回目录
Hyungjin Chung, Eunju Cha, Leonard Sunwoo, Jong Chul Ye
Abstract: Time-of-flight magnetic resonance angiography (TOF-MRA) is one of the most widely used non-contrast MR imaging methods to visualize blood vessels, but due to the 3-D volume acquisition highly accelerated acquisition is necessary. Accordingly, high quality reconstruction from undersampled TOF-MRA is an important research topic for deep learning. However, most existing deep learning works require matched reference data for supervised training, which are often difficult to obtain. By extending the recent theoretical understanding of cycleGAN from the optimal transport theory, here we propose a novel two-stage unsupervised deep learning approach, which is composed of the multi-coil reconstruction network along the coronal plane followed by a multi-planar refinement network along the axial plane. Specifically, the first network is trained in the square-root of sum of squares (SSoS) domain to achieve high quality parallel image reconstruction, whereas the second refinement network is designed to efficiently learn the characteristics of highly-activated blood flow using double-headed max-pool discriminator. Extensive experiments demonstrate that the proposed learning process without matched reference exceeds performance of state-of-the-art compressed sensing (CS)-based method and provides comparable or even better results than supervised learning approaches.
摘要：时间飞行磁共振血管造影（TOF-MRA）是最广泛使用的非造影MR成像方法来可视化血管之一，但由于3-d体积采集高度加速的采集是必要的。因此，从欠TOF-MRA高质量的重建是深度学习的重要研究课题。然而，大多数现有的深度学习模式需要相匹配的指导训练的参考数据，这往往很难得到。通过从最佳输运理论扩展最近cycleGAN的理论认识，在这里，我们提出了一个新颖的2级无监督的深度学习的方式，沿着冠状面，然后沿着多平面细化网络由多线圈重建网络轴面。具体地，第一网络是正方形（SSOS）结构域的总和的平方根训练，以实现高品质的并行图像重建，而第二细化网络被设计为有效地高度活化的血流量的特性使用学习双头最大池鉴别。大量的实验证明，如果没有匹配参考所提出的学习过程中超过国家的最先进的压缩感测（CS）的性能为基础的方法，并提供相当或甚至更好的结果比监督学习方法。

65. LoCo: Local Contrastive Representation Learning [PDF] 返回目录
Yuwen Xiong, Mengye Ren, Raquel Urtasun
Abstract: Deep neural nets typically perform end-to-end backpropagation to learn the weights, a procedure that creates synchronization constraints in the weight update step across layers and is not biologically plausible. Recent advances in unsupervised contrastive representation learning point to the question of whether a learning algorithm can also be made local, that is, the updates of lower layers do not directly depend on the computation of upper layers. While Greedy InfoMax separately learns each block with a local objective, we found that it consistently hurts readout accuracy in state-of-the-art unsupervised contrastive learning algorithms, possibly due to the greedy objective as well as gradient isolation. In this work, we discover that by overlapping local blocks stacking on top of each other, we effectively increase the decoder depth and allow upper blocks to implicitly send feedbacks to lower blocks. This simple design closes the performance gap between local learning and end-to-end contrastive learning algorithms for the first time. Aside from standard ImageNet experiments, we also show results on complex downstream tasks such as object detection and instance segmentation directly using readout features.
摘要：深神经网络通常执行端至端反向传播学习的权重，即在跨层权重更新步骤创建同步约束并且不是生物学上可行的过程。在无人监督的对比表示学习点是否学习算法也可以进行局部的问题，那就是最新进展，较低层的更新不直接依赖于上层的计算。而贪婪INFOMAX分别获知每个块与本地目标，我们发现它一致地伤害在国家的最先进的无监督对比学习算法的读出精度，可能对贪婪客观因以及梯度隔离。在这项工作中，我们发现，通过重叠局部块堆积在彼此的顶部，我们有效地提高了解码器的深度，并允许上块隐式地发送反馈到下块。这个简单的设计将关闭首次当地学习和终端到终端的对比学习算法之间的性能差距。除了标准ImageNet实验，我们还显示在复杂的下游任务，如对象检测和实例分割直接使用读出功能的结果。

66. Faster Stochastic Alternating Direction Method of Multipliers for Nonconvex Optimization [PDF] 返回目录
Feihu Huang, Songcan Chen, Heng Huang
Abstract: In this paper, we propose a faster stochastic alternating direction method of multipliers (ADMM) for nonconvex optimization by using a new stochastic path-integrated differential estimator (SPIDER), called as SPIDER-ADMM. Moreover, we prove that the SPIDER-ADMM achieves a record-breaking incremental first-order oracle (IFO) complexity of $\mathcal{O}(n+n^{1/2}\epsilon^{-1})$ for finding an $\epsilon$-approximate stationary point, which improves the deterministic ADMM by a factor $\mathcal{O}(n^{1/2})$, where $n$ denotes the sample size. As one of major contribution of this paper, we provide a new theoretical analysis framework for nonconvex stochastic ADMM methods with providing the optimal IFO complexity. Based on this new analysis framework, we study the unsolved optimal IFO complexity of the existing non-convex SVRG-ADMM and SAGA-ADMM methods, and prove they have the optimal IFO complexity of $\mathcal{O}(n+n^{2/3}\epsilon^{-1})$. Thus, the SPIDER-ADMM improves the existing stochastic ADMM methods by a factor of $\mathcal{O}(n^{1/6})$. Moreover, we extend SPIDER-ADMM to the online setting, and propose a faster online SPIDER-ADMM. Our theoretical analysis shows that the online SPIDER-ADMM has the IFO complexity of $\mathcal{O}(\epsilon^{-\frac{3}{2}})$, which improves the existing best results by a factor of $\mathcal{O}(\epsilon^{-\frac{1}{2}})$. Finally, the experimental results on benchmark datasets validate that the proposed algorithms have faster convergence rate than the existing ADMM algorithms for nonconvex optimization.
摘要：在本文中，我们通过使用新的随机路径积分差估计（SPIDER），称为SPIDER-ADMM提出乘数（ADMM）的非凸优化更快的随机交替方向法。此外，我们证明了SPIDER-ADMM实现了$ A增量阶甲骨文破纪录（IFO）的复杂性\ mathcal {}ø（N + N ^ {1/2} \小量^ { - 1}）$的找到一个$ \小量$ -approximate静止点，其通过因子$提高了确定性ADMM \ mathcal {Ó}（N ^ {1/2}）$，其中$ N $表示样本大小。作为本文的主要贡献之一，我们提供关于非线性随机ADMM方法与提供最佳的IFO复杂一个新的理论分析框架。基于这种新的分析框架中，我们研究了解决的最佳IFO现有非凸SVRG-ADMM和SAGA-ADMM方法的复杂性，并证明他们有最佳IFO的$ \ mathcal {Ø}（N + N ^ {复杂2/3} \小量^ { - 1}）$。因此，蜘蛛ADMM提高了的$ \ mathcal {Ó}（N ^ {1/6}）$一个因子现有随机ADMM方法。此外，我们SPIDER-ADMM延伸到网上设置，并提出更快的在线SPIDER-ADMM。我们的理论分析表明，在线SPIDER-ADMM有$ \ mathcal {Ø}的IFO复杂（\小量^ { - \压裂{3} {2}}）$，其中由$的因素提高了现有最好的结果\ mathcal {Ó}（\小量^ { - \压裂{1} {2}}）$。最后，在标准数据集的实验结果验证了该算法具有更快的收敛速度比现有的ADMM算法非凸优化。

67. Design and Deployment of Photo2Building: A Cloud-based Procedural Modeling Tool as a Service [PDF] 返回目录
Manush Bhatt, Rajesh Kalyanam, Gen Nishida, Liu He, Christopher May, Dev Niyogi, Daniel Aliaga
Abstract: We present a Photo2Building tool to create a plausible 3D model of a building from only a single photograph. Our tool is based on a prior desktop version which, as described in this paper, is converted into a client-server model, with job queuing, web-page support, and support of concurrent usage. The reported cloud-based web-accessible tool can reconstruct a building in 40 seconds on average and costing only 0.60 USD with current pricing. This provides for an extremely scalable and possibly widespread tool for creating building models for use in urban design and planning applications. With the growing impact of rapid urbanization on weather and climate and resource availability, access to such a service is expected to help a wide variety of users such as city planners, urban meteorologists worldwide in the quest to improved prediction of urban weather and designing climate-resilient cities of the future.
摘要：我们提出一个Photo2Building工具从只有单一的照片创建一个构建的一个似是而非的3D模型。我们的工具是基于先前桌面版本，正如本文所述，转换成客户端 - 服务器模型，具有工作排队，网页的支持，并支持并发使用的。报告的基于云的Web访问的工具可以在平均成本只有0.60与当前定价美元重建在40秒内的建筑物。这提供了用于在城市设计和规划应用创建的建筑模型的高度可扩展的和可能的广泛的工具。随着城市化快速发展对天气和气候和资源可用性的影响越来越大，获得这种服务有望帮助各种用户，如城市规划师，城市气象学家全球在寻求改善城市的天气和气候的设计预测未来的弹性城市。

68. Generalisable Cardiac Structure Segmentation via Attentional and Stacked Image Adaptation [PDF] 返回目录
Hongwei Li, Jianguo Zhang, Bjoern Menze
Abstract: Tackling domain shifts in multi-centre and multi-vendor data sets remains challenging for cardiac image segmentation. In this paper, we propose a generalisable segmentation framework for cardiac image segmentation in which multi-centre, multi-vendor, multi-disease datasets are involved. A generative adversarial networks with an attention loss was proposed to translate the images from existing source domains to a target domain, thus to generate good-quality synthetic cardiac structure and enlarge the training set. A stack of data augmentation techniques was further used to simulate real-world transformation to boost the segmentation performance for unseen domains.We achieved an average Dice score of 90.3% for the left ventricle, 85.9% for the myocardium, and 86.5% for the right ventricle on the hidden validation set across four vendors. We show that the domain shifts in heterogeneous cardiac imaging datasets can be drastically reduced by two aspects: 1) good-quality synthetic data by learning the underlying target domain distribution, and 2) stacked classical image processing techniques for data augmentation.
摘要：应对在多中心和多厂商数据集遗体用于心脏图像分割挑战域移位。在本文中，我们提出了心脏图像分割中，多中心，多厂商，多疾病数据集涉及到普遍意义的分割框架。与关注损失甲生成对抗性网络提出了从现有的源域的图像转换为一个目标域，从而产生高质量的合成心脏结构和放大训练集。数据增强技术堆栈进一步用来模拟真实世界的改造，以提高分割性能中的潜在domains.We实现的平均骰子得分的90.3％，为左心室，85.9％的心肌，而86.5％的权心室横跨四家厂商隐藏的验证集。我们表明，在非均相心脏成像数据集的域移位可以通过两个方面可显着降低：1）良好品质的合成数据通过学习下面的目标域的分布，和2）用于堆放数据扩张经典的图像处理技术。

69. Including Images into Message Veracity Assessment in Social Media [PDF] 返回目录
Abderrazek Azri, Cécile Favre, Nouria Harbi, Jérôme Darmont
Abstract: The extensive use of social media in the diffusion of information has also laid a fertile ground for the spread of rumors, which could significantly affect the credibility of social media. An ever-increasing number of users post news including, in addition to text, multimedia data such as images and videos. Yet, such multimedia content is easily editable due to the broad availability of simple and effective image and video processing tools. The problem of assessing the veracity of social network posts has attracted a lot of attention from researchers in recent years. However, almost all previous works have focused on analyzing textual contents to determine veracity, while visual contents, and more particularly images, remains ignored or little exploited in the literature. In this position paper, we propose a framework that explores two novel ways to assess the veracity of messages published on social networks by analyzing the credibility of both their textual and visual contents.
摘要：在信息传播中广泛使用社交媒体也为谣言的传播，这可能显著影响社会媒体的公信力奠定了肥沃的土壤。不断增加的用户发布的消息包括，除了文本，多媒体数据，如图片和视频。然而，这样的多媒体内容是简单而有效的图像和视频处理工具的广泛可用性容易编辑所致。评估社交网络帖的真实性问题已经吸引了很多研究人员的重视，近年来。然而，几乎所有的以前的作品都集中在分析文本内容，以确定真实性，而视觉内容，特别是图像，遗体被忽略或者在文献中很少利用。在这个位置上，本文提出了一个框架，探讨了两种新的方法来评估通过分析他们的文字和视觉内容的可信度社交网络发布消息的真实性。

70. Deep Learning Techniques for Future Intelligent Cross-Media Retrieval [PDF] 返回目录
Sadaqat ur Rehman, Muhammad Waqas, Shanshan Tu, Anis Koubaa, Obaid ur Rehman, Jawad Ahmad, Muhammad Hanif, Zhu Han
Abstract: With the advancement in technology and the expansion of broadcasting, cross-media retrieval has gained much attention. It plays a significant role in big data applications and consists in searching and finding data from different types of media. In this paper, we provide a novel taxonomy according to the challenges faced by multi-modal deep learning approaches in solving cross-media retrieval, namely: representation, alignment, and translation. These challenges are evaluated on deep learning (DL) based methods, which are categorized into four main groups: 1) unsupervised methods, 2) supervised methods, 3) pairwise based methods, and 4) rank based methods. Then, we present some well-known cross-media datasets used for retrieval, considering the importance of these datasets in the context in of deep learning based cross-media retrieval approaches. Moreover, we also present an extensive review of the state-of-the-art problems and its corresponding solutions for encouraging deep learning in cross-media retrieval. The fundamental objective of this work is to exploit Deep Neural Networks (DNNs) for bridging the "media gap", and provide researchers and developers with a better understanding of the underlying problems and the potential solutions of deep learning assisted cross-media retrieval. To the best of our knowledge, this is the first comprehensive survey to address cross-media retrieval under deep learning methods.
摘要：随着科技的进步和广播的扩张，跨媒体检索已经获得了广泛关注。它在大数据应用的显著的作用，包括在搜索和寻找不同类型的媒体数据。表示，调整和翻译：在本文中，我们根据所面临的多模态深学习方法在解决跨媒体检索，即挑战提供一种新的分类。这些挑战在深度学习（DL）为基础的方法，其被分类为四个主要的组进行评估：1）无监督方法，2）监督的方法，3）基于成对方法，以及4）基于秩的方法。然后，我们目前用于检索的一些知名跨媒体数据集，在考虑范畴内，这些数据集在基于深度学习跨媒体检索方法的重要性。此外，我们也出席了国家的最先进的问题和跨媒体检索鼓励深度学习其相应的解决方案进行全面的审查。其根本目的这项工作是利用深层神经网络（DNNs）为弥合“差距媒体”，并提供研究人员和开发人员更好地了解潜在的问题和深学习辅助跨媒体检索的潜在解决方案。据我们所知，这是在深学习方法的第一个全面的调查，地址跨媒体检索。

71. Action sequencing using visual permutations [PDF] 返回目录
Michael Burke, Kartic Subr, Subramanian Ramamoorthy
Abstract: Humans can easily reason about the sequence of high level actions needed to complete tasks, but it is particularly difficult to instil this ability in robots trained from relatively few examples. This work considers the task of neural action sequencing conditioned on a single reference visual state. This task is extremely challenging as it is not only subject to the significant combinatorial complexity that arises from large action sets, but also requires a model that can perform some form of symbol grounding, mapping high dimensional input data to actions, while reasoning about action relationships. Drawing on human cognitive abilities to rearrange objects in scenes to create new configurations, we take a permutation perspective and argue that action sequencing benefits from the ability to reason about both permutations and ordering concepts. Empirical analysis shows that neural models trained with latent permutations outperform standard neural architectures in constrained action sequencing tasks. Results also show that action sequencing using visual permutations is an effective mechanism to initialise and speed up traditional planning techniques and successfully scales to far greater action set sizes than models considered previously.
摘要：人类可以很容易推理的需要完成任务高水平的操作序列，但它是特别困难的灌输这种能力从例子相对较少训练的机器人。这项工作考虑神经动作测序空调在单一参考视觉状态的任务。这个任务是非常具有挑战性的，因为它不仅受到来自大动作集出现的显著组合复杂，而且还需要能够进行某种形式的符号接地，测绘高维输入数据到动作模式，而行动推理关系。借鉴人类认知能力来重新排列场景对象来创建新的配置，我们采取置换观点，并认为行动测序受益于能力的原因既左右排列顺序的概念。实证分析表明，与潜在的置换训练神经模型优于在约束作用的测序任务的标准神经结构。结果还显示，使用可视排列是一种有效的机制来初始化和加快传统规划技术，并成功地扩展到更大的动作集大小比以前认为的模型，动作顺序。

72. HAMLET: A Hierarchical Multimodal Attention-based Human Activity Recognition Algorithm [PDF] 返回目录
Md Mofijul Islam, Tariq Iqbal
Abstract: To fluently collaborate with people, robots need the ability to recognize human activities accurately. Although modern robots are equipped with various sensors, robust human activity recognition (HAR) still remains a challenging task for robots due to difficulties related to multimodal data fusion. To address these challenges, in this work, we introduce a deep neural network-based multimodal HAR algorithm, HAMLET. HAMLET incorporates a hierarchical architecture, where the lower layer encodes spatio-temporal features from unimodal data by adopting a multi-head self-attention mechanism. We develop a novel multimodal attention mechanism for disentangling and fusing the salient unimodal features to compute the multimodal features in the upper layer. Finally, multimodal features are used in a fully connect neural-network to recognize human activities. We evaluated our algorithm by comparing its performance to several state-of-the-art activity recognition algorithms on three human activity datasets. The results suggest that HAMLET outperformed all other evaluated baselines across all datasets and metrics tested, with the highest top-1 accuracy of 95.12% and 97.45% on the UTD-MHAD [1] and the UT-Kinect [2] datasets respectively, and F1-score of 81.52% on the UCSD-MIT [3] dataset. We further visualize the unimodal and multimodal attention maps, which provide us with a tool to interpret the impact of attention mechanisms concerning HAR.
摘要：为了流利地与人合作，机器人需要准确识别人类活动的能力。虽然现代的机器人配备有各种传感器，强大的人类行为识别（HAR）仍然机器人一个具有挑战性的任务，因为关系到多模态数据融合的困难。为了应对这些挑战，在这项工作中，我们介绍了深刻的基于神经网络的多模态HAR算法，HAMLET。 HAMLET并入分层架构，其中从单峰数据下层编码时空特征通过采用多头自注意机制。我们开发解开和融合的显着特征的单峰来计算上层多式联运功能新颖的多注意机制。最后，多式联运功能都在一个完全连接的神经网络来识别人的活动。我们通过三个人类活动数据集的性能比较先进设备，最先进的几种行为识别算法评估我们的算法。结果表明，HAMLET分别[2]的数据集优于在所测试的所有数据集和指标的所有其他评价基准，具有95.12％的最高顶-1准确度和97.45％的UTD-MHAD [1]和UT-Kinect的，和的81.52％的UCSD-MIT [3]数据集F1-得分。我们进一步形象化的单峰和多注意地图，这为我们提供了解读的关于HAR注意机制的影响的工具。

73. 3D B-mode ultrasound speckle reduction using deep learning for 3D registration applications [PDF] 返回目录
Hongliang Li, Tal Mezheritsky, Liset Vazquez Romaguera, Samuel Kadoury
Abstract: Ultrasound (US) speckles are granular patterns which can impede image post-processing tasks, such as image segmentation and registration. Conventional filtering approaches are commonly used to remove US speckles, while their main drawback is long run-time in a 3D scenario. Although a few studies were conducted to remove 2D US speckles using deep learning, to our knowledge, there is no study to perform speckle reduction of 3D B-mode US using deep learning. In this study, we propose a 3D dense U-Net model to process 3D US B-mode data from a clinical US system. The model's results were applied to 3D registration. We show that our deep learning framework can obtain similar suppression and mean preservation index (1.066) on speckle reduction when compared to conventional filtering approaches (0.978), while reducing the runtime by two orders of magnitude. Moreover, it is found that the speckle reduction using our deep learning model contributes to improving the 3D registration performance. The mean square error of 3D registration on 3D data using 3D U-Net speckle reduction is reduced by half compared to that with speckles.
摘要：超声（US）斑点是粒状图案可妨碍处理后图像的任务，如图像分割和配准。传统的过滤方法通常用来去除斑点美国，而他们的主要缺点是长期运行时的3D场景。虽然一些研究，以删除使用深度学习二维超声斑点，据我们所知，没有研究使用深度学习了3D的B模式美国的斑点除去。在这项研究中，我们提出了一个三维密集的U型网模型从临床美国的系统处理3D美国的B模式数据。该模型的结果应用到三维注册。我们发现，相比于传统的过滤方法（0.978），我们深切学习框架，可以减少散斑获得类似的抑制和保存平均指数（1.066），而由两个数量级减少运行时间。此外，发现用我们的深度学习模式有利于提高3D性能注册光斑减少。使用3D U形网散斑减少3D数据3D配准的均方误差减小一半相比，与斑点。

74. Sub-Pixel Back-Projection Network For Lightweight Single Image Super-Resolution [PDF] 返回目录
Supratik Banerjee, Cagri Ozcinar, Aakanksha Rana, Aljosa Smolic, Michael Manzke
Abstract: Convolutional neural network (CNN)-based methods have achieved great success for single-image superresolution (SISR). However, most models attempt to improve reconstruction accuracy while increasing the requirement of number of model parameters. To tackle this problem, in this paper, we study reducing the number of parameters and computational cost of CNN-based SISR methods while maintaining the accuracy of super-resolution reconstruction performance. To this end, we introduce a novel network architecture for SISR, which strikes a good trade-off between reconstruction quality and low computational complexity. Specifically, we propose an iterative back-projection architecture using sub-pixel convolution instead of deconvolution layers. We evaluate the performance of computational and reconstruction accuracy for our proposed model with extensive quantitative and qualitative evaluations. Experimental results reveal that our proposed method uses fewer parameters and reduces the computational cost while maintaining reconstruction accuracy against state-of-the-art SISR methods over well-known four SR benchmark datasets. Code is available at "this https URL".
摘要：卷积神经网络（CNN）为基础的方法已为单幅影像超分辨率（SISR）取得了巨大成功。然而，大多数模型试图以提高重建精度，同时提高模型参数数量的要求。为了解决这个问题，在本文中，我们研究了减少的基于CNN-SISR方法参数的数量和计算成本，同时保持的超分辨率重建性能的准确性。为此，我们介绍SISR一个新的网络架构，它撞击很好的权衡重建的质量和较低的计算复杂度之间。具体来说，我们使用子像素卷积代替去卷积层的提出的迭代反投影架构。我们评估计算和重建精度，为我们提出的模型具有广泛的定量和定性评估的性能。实验结果表明，我们提出的方法使用更少的参数，降低了计算成本，同时保持对过著名的4个SR标准数据集的国家的最先进的方法SISR重建精度。代码可在“此HTTPS URL”。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-08-05

目录

摘要