摘要

1. Deep Learning-Based Human Pose Estimation: A Survey [PDF] 返回目录
Ce Zheng, Wenhan Wu, Taojiannan Yang, Sijie Zhu, Chen Chen, Ruixu Liu, Ju Shen, Nasser Kehtarnavaz, Mubarak Shah
Abstract: Human pose estimation aims to locate the human body parts and build human body representation (e.g., body skeleton) from input data such as images and videos. It has drawn increasing attention during the past decade and has been utilized in a wide range of applications including human-computer interaction, motion analysis, augmented reality, and virtual reality. Although the recently developed deep learning-based solutions have achieved high performance in human pose estimation, there still remain challenges due to insufficient training data, depth ambiguities, and occlusions. The goal of this survey paper is to provide a comprehensive review of recent deep learning-based solutions for both 2D and 3D pose estimation via a systematic analysis and comparison of these solutions based on their input data and inference procedures. More than 240 research papers since 2014 are covered in this survey. Furthermore, 2D and 3D human pose estimation datasets and evaluation metrics are included. Quantitative performance comparisons of the reviewed methods on popular datasets are summarized and discussed. Finally, the challenges involved, applications, and future research directions are concluded. We also provide a regularly updated project page on: \url{this https URL}
摘要：人体姿势估计旨在根据图像和视频等输入数据来定位人体部位并构建人体表示形式（例如人体骨骼）。在过去的十年中，它引起了越来越多的关注，并已被广泛用于包括人机交互，运动分析，增强现实和虚拟现实的应用中。尽管最近开发的基于深度学习的解决方案在人体姿势估计方面取得了高性能，但是由于训练数据不足，深度模糊和遮挡，仍然存在挑战。本调查论文的目的是通过基于输入和推理程序的系统分析和比较，对基于深度学习的2D和3D姿态估计解决方案进行全面回顾。自2014年以来，该调查涵盖了240余篇研究论文。此外，还包括2D和3D人体姿势估计数据集和评估指标。总结并讨论了所审查方法在流行数据集上的定量性能比较。最后，总结了所涉及的挑战，应用和未来的研究方向。我们还会在以下位置提供定期更新的项目页面：\ url {此https URL}

2. Global Context Networks [PDF] 返回目录
Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, Han Hu
Abstract: The Non-Local Network (NLNet) presents a pioneering approach for capturing long-range dependencies within an image, via aggregating query-specific global context to each query position. However, through a rigorous empirical analysis, we have found that the global contexts modeled by the non-local network are almost the same for different query positions. In this paper, we take advantage of this finding to create a simplified network based on a query-independent formulation, which maintains the accuracy of NLNet but with significantly less computation. We further replace the one-layer transformation function of the non-local block by a two-layer bottleneck, which further reduces the parameter number considerably. The resulting network element, called the global context (GC) block, effectively models global context in a lightweight manner, allowing it to be applied at multiple layers of a backbone network to form a global context network (GCNet). Experiments show that GCNet generally outperforms NLNet on major benchmarks for various recognition tasks. The code and network configurations are available at this https URL.
摘要：非本地网络（NLNet）通过将特定于查询的全局上下文聚合到每个查询位置，提供了一种捕获图像中远程依赖项的开创性方法。但是，通过严格的经验分析，我们发现对于不同的查询位置，由非本地网络建模的全局上下文几乎相同。在本文中，我们利用这一发现来创建基于查询无关公式的简化网络，该网络可以保持NLNet的准确性，但计算量却大大减少。我们进一步用两层瓶颈代替了非局部块的一层变换功能，这进一步减少了参数数量。生成的网络元素称为全局上下文（GC）块，以一种轻量级的方式有效地对全局上下文进行了建模，从而允许将其应用于骨干网的多个层以形成全局上下文网络（GCNet）。实验表明，在各种识别任务的主要基准上，GCNet的性能通常都优于NLNet。代码和网络配置可从该https URL获得。

3. Unsupervised deep clustering and reinforcement learning can accurately segment MRI brain tumors with very small training sets [PDF] 返回目录
Joseph Stember, Hrithwik Shalu
Abstract: Purpose: Lesion segmentation in medical imaging is key to evaluating treatment response. We have recently shown that reinforcement learning can be applied to radiological images for lesion localization. Furthermore, we demonstrated that reinforcement learning addresses important limitations of supervised deep learning; namely, it can eliminate the requirement for large amounts of annotated training data and can provide valuable intuition lacking in supervised approaches. However, we did not address the fundamental task of lesion/structure-of-interest segmentation. Here we introduce a method combining unsupervised deep learning clustering with reinforcement learning to segment brain lesions on MRI. Materials and Methods: We initially clustered images using unsupervised deep learning clustering to generate candidate lesion masks for each MRI image. The user then selected the best mask for each of 10 training images. We then trained a reinforcement learning algorithm to select the masks. We tested the corresponding trained deep Q network on a separate testing set of 10 images. For comparison, we also trained and tested a U-net supervised deep learning network on the same set of training/testing images. Results: Whereas the supervised approach quickly overfit the training data and predictably performed poorly on the testing set (16% average Dice score), the unsupervised deep clustering and reinforcement learning achieved an average Dice score of 83%. Conclusion: We have demonstrated a proof-of-principle application of unsupervised deep clustering and reinforcement learning to segment brain tumors. The approach represents human-allied AI that requires minimal input from the radiologist without the need for hand-traced annotation.
摘要：目的：医学影像中的病变分割是评估治疗反应的关键。我们最近显示，强化学习可以应用于放射线图像以进行病变定位。此外，我们证明了强化学习解决了监督深度学习的重要局限；也就是说，它可以消除对大量带注释的训练数据的需求，并且可以提供在监督方法中缺乏的有价值的直觉。但是，我们没有解决病变/感兴趣结构分割的基本任务。在这里，我们介绍了一种结合无监督深度学习聚类和强化学习对MRI上的脑部病变进行分割的方法。材料和方法：我们最初使用无监督的深度学习聚类对图像进行聚类，以为每个MRI图像生成候选病变蒙版。然后，用户为10个训练图像中的每个图像选择最佳蒙版。然后，我们训练了强化学习算法以选择蒙版。我们在10张图片的单独测试集中测试了相应的经过训练的深度Q网络。为了进行比较，我们还在同一组训练/测试图像上训练和测试了受U-net监督的深度学习网络。结果：尽管有监督的方法很快适合训练数据，并且在测试集上表现不佳（平均Dice得分为16％），但无监督的深度聚类和强化学习却达到了83％的Dice平均得分。结论：我们已经证明了无监督的深度聚类和强化学习对脑肿瘤进行分割的原理应用证明。该方法代表了一种人工联合AI，它需要放射科医生的最少输入，而无需手工跟踪注释。

4. Person Re-Identification using Deep Learning Networks: A Systematic Review [PDF] 返回目录
Ankit Yadav, Dinesh Kumar Vishwakarma
Abstract: Person re-identification has received a lot of attention from the research community in recent times. Due to its vital role in security based applications, person re-identification lies at the heart of research relevant to tracking robberies, preventing terrorist attacks and other security critical events. While the last decade has seen tremendous growth in re-id approaches, very little review literature exists to comprehend and summarize this progress. This review deals with the latest state-of-the-art deep learning based approaches for person re-identification. While the few existing re-id review works have analysed re-id techniques from a singular aspect, this review evaluates numerous re-id techniques from multiple deep learning aspects such as deep architecture types, common Re-Id challenges (variation in pose, lightning, view, scale, partial or complete occlusion, background clutter), multi-modal Re-Id, cross-domain Re-Id challenges, metric learning approaches and video Re-Id contributions. This review also includes several re-id benchmarks collected over the years, describing their characteristics, specifications and top re-id results obtained on them. The inclusion of the latest deep re-id works makes this a significant contribution to the re-id literature. Lastly, the conclusion and future directions are included.
摘要：人的重新识别近年来引起了研究界的广泛关注。由于其在基于安全的应用程序中的重要作用，人员重新识别是与跟踪抢劫，防止恐怖袭击和其他安全关键事件相关的研究的核心。尽管在过去十年中re-id方法取得了巨大的发展，但很少有评论文献来理解和总结这一进展。这篇评论探讨了基于最新的深度学习技术的人员重新识别方法。虽然少数现有的re-id审查工作从单个方面分析了re-id技术，但本次审查从多个深度学习方面评估了许多re-id技术，例如深度架构类型，常见的Re-Id挑战（姿势变化，闪电），查看，缩放，部分或完全遮挡，背景混乱），多模式Re-Id，跨域Re-Id挑战，度量学习方法和视频Re-Id贡献。这次审查还包括多年来收集的几个re-id基准，描述了它们的特性，规格和从中获得的顶级re-id结果。最新的深层re-id著作的加入，为re-id文献做出了重大贡献。最后，包括结论和未来方向。

5. Seed Phenotyping on Neural Networks using Domain Randomization and Transfer Learning [PDF] 返回目录
Venkat Margapuri, Mitchell Neilsen
Abstract: Seed phenotyping is the idea of analyzing the morphometric characteristics of a seed to predict the behavior of the seed in terms of development, tolerance and yield in various environmental conditions. The focus of the work is the application and feasibility analysis of the state-of-the-art object detection and localization neural networks, Mask R-CNN and YOLO (You Only Look Once), for seed phenotyping using Tensorflow. One of the major bottlenecks of such an endeavor is the need for large amounts of training data. While the capture of a multitude of seed images is taunting, the images are also required to be annotated to indicate the boundaries of the seeds on the image and converted to data formats that the neural networks are able to consume. Although tools to manually perform the task of annotation are available for free, the amount of time required is enormous. In order to tackle such a scenario, the idea of domain randomization i.e. the technique of applying models trained on images containing simulated objects to real-world objects, is considered. In addition, transfer learning i.e. the idea of applying the knowledge obtained while solving a problem to a different problem, is used. The networks are trained on pre-trained weights from the popular ImageNet and COCO data sets. As part of the work, experiments with different parameters are conducted on five different seed types namely, canola, rough rice, sorghum, soy, and wheat.
摘要：种子表型分析是分析种子的形态特征以预测种子在不同环境条件下的发育，耐受性和产量的行为的想法。工作重点是使用Tensorflow进行最先进的目标检测和定位神经网络Mask R-CNN和YOLO（仅查看一次）的应用和可行性分析。这种努力的主要瓶颈之一是需要大量的训练数据。在嘲笑捕获大量种子图像的同时，还需要对图像进行注释，以指示图像上种子的边界，并转换为神经网络能够使用的数据格式。尽管可以免费使用手动执行批注任务的工具，但是所需的时间却是巨大的。为了解决这种情况，考虑了域随机化的思想，即将在包含模拟对象的图像上训练的模型应用于实际对象的技术。另外，使用转移学习，即将在解决问题时获得的知识应用于不同问题的想法。使用来自流行的ImageNet和COCO数据集的预训练权重对网络进行训练。作为工作的一部分，对五种不同的种子类型（低芥酸菜籽，糙米，高粱，大豆和小麦）进行了不同参数的实验。

6. Interpolating Points on a Non-Uniform Grid using a Mixture of Gaussians [PDF] 返回目录
Ivan Skorokhodov
Abstract: In this work, we propose an approach to perform non-uniform image interpolation based on a Gaussian Mixture Model. Traditional image interpolation methods, like nearest neighbor, bilinear, Hamming, Lanczos, etc. assume that the coordinates you want to interpolate from, are positioned on a uniform grid. However, it is not always the case in practice and we develop an interpolation method that is able to generate an image from arbitrarily positioned pixel values. We do this by representing each known pixel as a 2D normal distribution and considering each output image pixel as a sample from the mixture of all the known ones. Apart from the ability to reconstruct an image from arbitrarily positioned set of pixels, this also allows us to differentiate through the interpolation procedure, which might be helpful for downstream applications. Our optimized CUDA kernel and the source code to reproduce the benchmarks is located at this https URL.
摘要：在这项工作中，我们提出了一种基于高斯混合模型执行非均匀图像插值的方法。传统的图像插值方法（例如最近邻，双线性，汉明，Lanczos等）假定您要插值的坐标位于统一的网格上。但是，实际情况并非总是如此，我们开发了一种插值方法，该方法可以从任意定位的像素值生成图像。为此，我们将每个已知像素表示为2D正态分布，并将每个输出图像像素视为所有已知像素的混合样本。除了能够从任意定位的像素集中重建图像之外，这还使我们能够通过内插过程进行区分，这可能对下游应用有所帮助。我们的优化CUDA内核和用于重现基准的源代码位于此https URL。

7. Dynamic Facial Expression Recognition under Partial Occlusion with Optical Flow Reconstruction [PDF] 返回目录
Delphine Poux, Benjamin Allaert, Nacim Ihaddadene, Ioan Marius Bilasco, Chaabane Djeraba, Mohammed Bennamoun
Abstract: Video facial expression recognition is useful for many applications and received much interest lately. Although some solutions give really good results in a controlled environment (no occlusion), recognition in the presence of partial facial occlusion remains a challenging task. To handle occlusions, solutions based on the reconstruction of the occluded part of the face have been proposed. These solutions are mainly based on the texture or the geometry of the face. However, the similarity of the face movement between different persons doing the same expression seems to be a real asset for the reconstruction. In this paper we exploit this asset and propose a new solution based on an auto-encoder with skip connections to reconstruct the occluded part of the face in the optical flow domain. To the best of our knowledge, this is the first proposition to directly reconstruct the movement for facial expression recognition. We validated our approach in the controlled dataset CK+ on which different occlusions were generated. Our experiments show that the proposed method reduce significantly the gap, in terms of recognition accuracy, between occluded and non-occluded situations. We also compare our approach with existing state-of-the-art solutions. In order to lay the basis of a reproducible and fair comparison in the future, we also propose a new experimental protocol that includes occlusion generation and reconstruction evaluation.
摘要：视频面部表情识别在许多应用中都很有用，并且最近引起了人们的兴趣。尽管某些解决方案在受控环境（无遮挡）下能提供非常好的结果，但是在存在部分面部遮挡的情况下进行识别仍然是一项艰巨的任务。为了处理遮挡，已经提出了基于面部的遮挡部分的重建的解决方案。这些解决方案主要基于面部的纹理或几何形状。但是，执行相同表情的不同人脸部运动的相似性似乎是重建的真正财富。在本文中，我们将利用这一资产，并提出一种基于具有跳跃连接的自动编码器的新解决方案，以在光流域中重建人脸的被遮挡部分。据我们所知，这是第一个直接重建面部表情识别运动的命题。我们在受控数据集CK +上验证了我们的方法，在该数据集上生成了不同的遮挡。我们的实验表明，该方法在遮挡和非遮挡情况之间的识别准确度方面显着减小。我们还将我们的方法与现有的最新解决方案进行了比较。为了奠定将来可重复和公平比较的基础，我们还提出了一种新的实验方案，其中包括遮挡的产生和重建评估。

8. Memory-Efficient Hierarchical Neural Architecture Search for Image Restoration [PDF] 返回目录
Haokui Zhang, Ying Li, Chengrong Gong, Hao Chen, Zongwen Bai, Chunhua Shen
Abstract: Recently, much attention has been spent on neural architecture search (NAS) approaches, which often outperform manually designed architectures on highlevel vision tasks. Inspired by this, we attempt to leverage NAS technique to automatically design efficient network architectures for low-level image restoration tasks. In this paper, we propose a memory-efficient hierarchical NAS HiNAS (HiNAS) and apply to two such tasks: image denoising and image super-resolution. HiNAS adopts gradient based search strategies and builds an flexible hierarchical search space, including inner search space and outer search space, which in charge of designing cell architectures and deciding cell widths, respectively. For inner search space, we propose layerwise architecture sharing strategy (LWAS), resulting in more flexible architectures and better performance. For outer search space, we propose cell sharing strategy to save memory, and considerably accelerate the search speed. The proposed HiNAS is both memory and computation efficient. With a single GTX1080Ti GPU, it takes only about 1 hour for searching for denoising network on BSD 500 and 3.5 hours for searching for the super-resolution structure on DIV2K. Experimental results show that the architectures found by HiNAS have fewer parameters and enjoy a faster inference speed, while achieving highly competitive performance compared with state-of-the-art methods.
摘要：最近，人们对神经体系结构搜索（NAS）的方法投入了很多注意力，它们在高级视觉任务上的性能通常优于手动设计的体系结构。受此启发，我们尝试利用NAS技术为低级图像恢复任务自动设计高效的网络体系结构。在本文中，我们提出了一种内存高效的分层NAS HiNAS（HiNAS），并将其应用于以下两个任务：图像去噪和图像超分辨率。 HiNAS采用基于梯度的搜索策略，并构建了一个灵活的分层搜索空间，包括内部搜索空间和外部搜索空间，分别负责设计单元架构和确定单元宽度。对于内部搜索空间，我们提出了分层体系结构共享策略（LWAS），以实现更灵活的体系结构和更好的性能。对于外部搜索空间，我们提出了单元共享策略以节省内存，并显着提高搜索速度。所提出的HiNAS既具有存储效率又具有计算效率。使用单个GTX1080Ti GPU，在BSD 500上搜索降噪网络只需要大约1个小时，而在DIV2K上搜索超分辨率结构只需要3.5个小时。实验结果表明，与最先进的方法相比，HiNAS发现的体系结构具有更少的参数并享有更快的推理速度，同时还具有极高的竞争力。

9. Appearance-Invariant 6-DoF Visual Localization using Generative Adversarial Networks [PDF] 返回目录
Yimin Lin, Jianfeng Huang, Shiguo Lian
Abstract: We propose a novel visual localization network when outside environment has changed such as different illumination, weather and season. The visual localization network is composed of a feature extraction network and pose regression network. The feature extraction network is made up of an encoder network based on the Generative Adversarial Network CycleGAN, which can capture intrinsic appearance-invariant feature maps from unpaired samples of different weathers and seasons. With such an invariant feature, we use a 6-DoF pose regression network to tackle long-term visual localization in the presence of outdoor illumination, weather and season changes. A variety of challenging datasets for place recognition and localization are used to prove our visual localization network, and the results show that our method outperforms state-of-the-art methods in the scenarios with various environment changes.
摘要：我们提出了一个新颖的视觉本地化网络，以应对外部环境如光照，天气和季节等变化的情况。视觉定位网络由特征提取网络和姿态回归网络组成。特征提取网络由基于生成对抗网络CycleGAN的编码器网络组成，该网络可以从不同天气和季节的不成对样本中捕获固有的外观不变特征图。具有这种不变的特征，我们使用6自由度姿势回归网络来解决在室外照明，天气和季节变化的情况下的长期视觉定位。用于位置识别和定位的各种具有挑战性的数据集被用于证明我们的视觉定位网络，结果表明，在各种环境变化的情况下，我们的方法优于最新方法。

10. Control of computer pointer using hand gesture recognition in motion pictures [PDF] 返回目录
Yalda Foroutan, Ahmad Kalhor, Saeid Mohammadi Nejati, Samad Sheikhaei
Abstract: A user interface is designed to control the computer cursor by hand detection and classification of its gesture. A hand dataset with 6720 image samples is collected, including four classes: fist, palm, pointing to the left, and pointing to the right. The images are captured from 15 persons in simple backgrounds and different perspectives and light conditions. A CNN network is trained on this dataset to predict a label for each captured image and measure the similarity of them. Finally, commands are defined to click, right-click and move the cursor. The algorithm has 91.88% accuracy and can be used in different backgrounds.
摘要：设计了一种用户界面，用于通过手检测和手势分类来控制计算机光标。收集具有6720个图像样本的手部数据集，包括四个类别：拳头，手掌，指向左侧和指向右侧。这些图像是在简单的背景，不同的视角和光线条件下从15个人捕获的。在该数据集上训练CNN网络，以预测每个捕获图像的标签并测量它们的相似性。最后，定义了命令以单击，右键单击并移动光标。该算法的准确度为91.88％，可在不同背景下使用。

11. Unveiling Real-Life Effects of Online Photo Sharing [PDF] 返回目录
Van-Khoa Nguyen, Adrian Popescu, Jerome Deshayes-Chossart
Abstract: Social networks give free access to their services in exchange for the right to exploit their users' data. Data sharing is done in an initial context which is chosen by the users. However, data are used by social networks and third parties in different contexts which are often not transparent. We propose a new approach which unveils potential effects of data sharing in impactful real-life situations. Focus is put on visual content because of its strong influence in shaping online user profiles. The approach relies on three components: (1) a set of concepts with associated situation impact ratings obtained by crowdsourcing, (2) a corresponding set of object detectors used to analyze users' photos and (3) a ground truth dataset made of 500 visual user profiles which are manually rated for each situation. These components are combined in LERVUP, a method which learns to rate visual user profiles in each situation. LERVUP exploits a new image descriptor which aggregates concept ratings and object detections at user level. It also uses an attention mechanism to boost the detections of highly-rated concepts to prevent them from being overwhelmed by low-rated ones. Performance is evaluated per situation by measuring the correlation between the automatic ranking of profile ratings and a manual ground truth. Results indicate that LERVUP is effective since a strong correlation of the two rankings is obtained. This finding indicates that providing meaningful automatic situation-related feedback about the effects of data sharing is feasible.
摘要：社交网络免费提供其服务，以换取利用其用户数据的权利。数据共享是在用户选择的初始上下文中完成的。但是，社交网络和第三方在不同的上下文中使用数据，这通常是不透明的。我们提出了一种新方法，该方法揭示了在有影响的现实情况下数据共享的潜在影响。重点放在视觉内容上，因为它对塑造在线用户个人资料有很大的影响。该方法依赖于三个组成部分：（1）通过众包获得的具有相关情景影响等级的一组概念；（2）用于分析用户照片的相应对象检测器集合；（3）由500个视觉对象组成的地面真相数据集用户配置文件，可针对每种情况手动进行评级。这些组件组合在LERVUP中，该方法可学习为每种情况下的视觉用户配置文件评分。 LERVUP利用了一个新的图像描述符，该图像描述符在用户级别汇总了概念等级和对象检测。它还使用一种注意机制来增强对高评级概念的检测，以防止它们被低评级概念淹没。通过测量配置文件等级的自动排名与手动基本情况之间的相关性来评估性能。结果表明，LERVUP是有效的，因为获得了两个排名的强相关性。该发现表明，就数据共享的影响提供有意义的自动情境相关反馈是可行的。

12. Adversarial Momentum-Contrastive Pre-Training [PDF] 返回目录
Cong Xu, Min Yang
Abstract: Deep neural networks are vulnerable to semantic invariant corruptions and imperceptible artificial perturbations. Although data augmentation can improve the robustness against the former, it offers no guarantees against the latter. Adversarial training, on the other hand, is quite the opposite. Recent studies have shown that adversarial self-supervised pre-training is helpful to extract the invariant representations under both data augmentations and adversarial perturbations. Based on the MoCo's idea, this paper proposes a novel adversarial momentum-contrastive (AMOC) pre-training approach, which designs two dynamic memory banks to maintain the historical clean and adversarial representations respectively, so as to exploit the discriminative representations that are consistent in a long period. Compared with the existing self-supervised pre-training approaches, AMOC can use a smaller batch size and fewer training epochs but learn more robust features. Empirical results show that the developed approach further improves the current state-of-the-art adversarial robustness.
摘要：深度神经网络易受语义不变性破坏和不可察觉的人工干扰的影响。尽管数据增强可以提高针对前者的鲁棒性，但无法提供针对后者的保证。另一方面，对抗训练则相反。最近的研究表明，对抗性自我监督的预训练有助于在数据扩充和对抗性扰动下提取不变表示。基于MoCo的思想，本文提出了一种新颖的对抗动量对比（AMOC）预训练方法，该方法设计了两个动态记忆库来分别保持历史的干净和对抗性表示，以利用在很长一段时间。与现有的自我监督式预训练方法相比，AMOC可以使用较小的批次大小和较少的训练时间，但可以学习更多可靠的功能。实证结果表明，所开发的方法进一步提高了当前最先进的对抗性鲁棒性。

13. Objective Class-based Micro-Expression Recognition through Simultaneous Action Unit Detection and Feature Aggregation [PDF] 返回目录
Ling Zhou, Qirong Mao, Ming Dong
Abstract: Micro-expression recognition (MER) has attracted lots of researchers' attention due to its potential value in many practical applications. In this paper, we investigate Micro-Expression Recognition (MER) is a challenging task as the subtle changes occur over different action regions of a face. Changes in facial action regions are formed as Action Units (AUs), and AUs in micro-expressions can be seen as the actors in cooperative group activities. In this paper, we propose a novel deep neural network model for objective class-based MER, which simultaneously detects AUs and aggregates AU-level features into micro-expression-level representation through Graph Convolutional Networks (GCN). Specifically, we propose two new strategies in our AU detection module for more effective AU feature learning: the attention mechanism and the balanced detection loss function. With those two strategies, features are learned for all the AUs in a unified model, eliminating the error-prune landmark detection process and tedious separate training for each AU. Moreover, our model incorporates a tailored objective class-based AU knowledge-graph, which facilitates the GCN to aggregate the AU-level features into a micro-expression-level feature representation. Extensive experiments on two tasks in MEGC 2018 show that our approach significantly outperforms the current state-of-the-arts in MER. Additionally, we also report our single model-based micro-expression AU detection results.
摘要：微表达识别（MER）由于其在许多实际应用中的潜在价值而吸引了许多研究人员的注意。在本文中，我们研究微表情识别（MER）是一项具有挑战性的任务，因为在面部的不同动作区域会发生微妙的变化。面部动作区域的变化形成为动作单位（AU），微表情的AU可以看作是合作团体活动的参与者。在本文中，我们为基于目标类的MER提出了一种新颖的深度神经网络模型，该模型可同时检测AU，并通过图卷积网络（GCN）将AU级特征聚合为微表达级表示形式。具体来说，我们在AU检测模块中提出了两种新策略，以更有效地进行AU特征学习：注意力机制和平衡的检测损失函数。通过这两种策略，可以在统一模型中为所有AU学习功能，从而消除了错误剪裁界标检测过程，并且为每个AU进行了繁琐的单独训练。此外，我们的模型结合了基于客观类的量身定制的AU知识图，这有助于GCN将AU级特征聚合为微表达式级特征表示。在MEGC 2018中对两项任务进行的广泛实验表明，我们的方法大大优于MER中当前的最新技术。此外，我们还报告了基于单个模型的微表达AU检测结果。

14. A non-alternating graph hashing algorithm for large scale image search [PDF] 返回目录
Sobhan Hemati, Mohammad Hadi Mehdizavareh, Shojaeddin Chenouri, Hamid R Tizhoosh
Abstract: In the era of big data, methods for improving memory and computational efficiency have become crucial for successful deployment of technologies. Hashing is one of the most effective approaches to deal with computational limitations that come with big data. One natural way for formulating this problem is spectral hashing that directly incorporates affinity to learn binary codes. However, due to binary constraints, the optimization becomes intractable. To mitigate this challenge, different relaxation approaches have been proposed to reduce the computational load of obtaining binary codes and still attain a good solution. The problem with all existing relaxation methods is resorting to one or more additional auxiliary variables to attain high quality binary codes while relaxing the problem. The existence of auxiliary variables leads to coordinate descent approach which increases the computational complexity. We argue that introducing these variables is unnecessary. To this end, we propose a novel relaxed formulation for spectral hashing that adds no additional variables to the problem. Furthermore, instead of solving the problem in original space where number of variables is equal to the data points, we solve the problem in a much smaller space and retrieve the binary codes from this solution. This trick reduces both the memory and computational complexity at the same time. We apply two optimization techniques, namely projected gradient and optimization on manifold, to obtain the solution. Using comprehensive experiments on four public datasets, we show that the proposed efficient spectral hashing (ESH) algorithm achieves highly competitive retrieval performance compared with state of the art at low complexity.
摘要：在大数据时代，提高内存和计算效率的方法对于成功部署技术至关重要。散列是应对大数据带来的计算限制的最有效方法之一。解决这个问题的一种自然方法是频谱哈希，它直接结合了学习二进制代码的亲和力。但是，由于二进制限制，优化变得棘手。为了减轻这一挑战，已经提出了不同的松弛方法以减少获得二进制代码的计算量并且仍然获得良好的解决方案。所有现有松弛方法的问题是在松弛问题的同时求助于一个或多个附加辅助变量以获得高质量的二进制代码。辅助变量的存在导致协调下降法，这增加了计算复杂度。我们认为引入这些变量是不必要的。为此，我们为频谱哈希提出了一种新颖的宽松公式，该公式没有为问题增加任何变量。此外，我们不是在变量数量等于数据点的原始空间中解决问题，而是在更小的空间中解决问题，并从该解决方案中检索二进制代码。该技巧可同时降低内存和计算复杂度。我们应用两种优化技术，即投影梯度和流形上的优化，来获得解。通过在四个公共数据集上进行的综合实验，我们证明了与低复杂度的最新技术相比，所提出的高效频谱哈希（ESH）算法在竞争性方面具有很高的竞争力。

15. WEmbSim: A Simple yet Effective Metric for Image Captioning [PDF] 返回目录
Naeha Sharif, Lyndon White, Mohammed Bennamoun, Wei Liu, Syed Afaq Ali Shah
Abstract: The area of automatic image caption evaluation is still undergoing intensive research to address the needs of generating captions which can meet adequacy and fluency requirements. Based on our past attempts at developing highly sophisticated learning-based metrics, we have discovered that a simple cosine similarity measure using the Mean of Word Embeddings(MOWE) of captions can actually achieve a surprisingly high performance on unsupervised caption evaluation. This inspires our proposed work on an effective metric WEmbSim, which beats complex measures such as SPICE, CIDEr and WMD at system-level correlation with human judgments. Moreover, it also achieves the best accuracy at matching human consensus scores for caption pairs, against commonly used unsupervised methods. Therefore, we believe that WEmbSim sets a new baseline for any complex metric to be justified.
摘要：自动图像标题评估领域仍在深入研究中，以解决生成满足适当性和流利性要求的字幕的需求。基于我们过去开发高度复杂的基于学习的度量标准的尝试，我们发现使用字幕的单词嵌入均值（MOWE）进行的简单余弦相似性度量实际上可以在未经监督的字幕评估中实现令人惊讶的高性能。这激发了我们提议的关于有效度量WEmbSim的建议，该度量在与人类判断相关的系统级关联方面击败了SPICE，CIDEr和WMD等复杂度量。此外，与常用的无监督方法相比，它在匹配字幕对的人类共识分数时也达到了最佳准确性。因此，我们认为WEmbSim为要证明其合理性的任何复杂指标设置了新的基准。

16. MRDet: A Multi-Head Network for Accurate Oriented Object Detection in Aerial Images [PDF] 返回目录
Ran Qin, Qingjie Liu, Guangshuai Gao, Di Huang, Yunhong Wang
Abstract: Objects in aerial images usually have arbitrary orientations and are densely located over the ground, making them extremely challenge to be detected. Many recently developed methods attempt to solve these issues by estimating an extra orientation parameter and placing dense anchors, which will result in high model complexity and computational costs. In this paper, we propose an arbitrary-oriented region proposal network (AO-RPN) to generate oriented proposals transformed from horizontal anchors. The AO-RPN is very efficient with only a few amounts of parameters increase than the original RPN. Furthermore, to obtain accurate bounding boxes, we decouple the detection task into multiple subtasks and propose a multi-head network to accomplish them. Each head is specially designed to learn the features optimal for the corresponding task, which allows our network to detect objects accurately. We name it MRDet short for Multi-head Rotated object Detector for convenience. We test the proposed MRDet on two challenging benchmarks, i.e., DOTA and HRSC2016, and compare it with several state-of-the-art methods. Our method achieves very promising results which clearly demonstrate its effectiveness.
摘要：航空影像中的物体通常具有任意方向，并且密集地放置在地面上，这使其极难被发现。许多最近开发的方法试图通过估计额外的方向参数并放置密集的锚点来解决这些问题，这将导致较高的模型复杂度和计算成本。在本文中，我们提出了一个面向任意区域的提案网络（AO-RPN），以生成由水平锚转换而成的针对性提案。 AO-RPN非常有效，仅比原始RPN增加了少量参数。此外，为了获得准确的边界框，我们将检测任务分解为多个子任务，并提出了一个多头网络来完成这些任务。每个头部都经过专门设计，以学习针对相应任务的最佳功能，从而使我们的网络能够准确地检测物体。为了方便起见，我们将其命名为MRDet，是多头旋转物体检测器的缩写。我们在两个具有挑战性的基准（即DOTA和HRSC2016）上测试了拟议的MRDet，并将其与几种最新方法进行了比较。我们的方法取得了非常有希望的结果，清楚地证明了其有效性。

17. Hausdorff Point Convolution with Geometric Priors [PDF] 返回目录
Pengdi Huang, Liqiang Lin, Fuyou Xue, Kai Xu, Danny Cohen-Or, Hui Huang
Abstract: Without a shape-aware response, it is hard to characterize the 3D geometry of a point cloud efficiently with a compact set of kernels. In this paper, we advocate the use of Hausdorff distance as a shape-aware distance measure for calculating point convolutional responses. The technique we present, coined Hausdorff Point Convolution (HPC), is shape-aware. We show that HPC constitutes a powerful point feature learning with a rather compact set of only four types of geometric priors as kernels. We further develop a HPC-based deep neural network (HPC-DNN). Task-specific learning can be achieved by tuning the network weights for combining the shortest distances between input and kernel point sets. We also realize hierarchical feature learning by designing a multi-kernel HPC for multi-scale feature encoding. Extensive experiments demonstrate that HPC-DNN outperforms strong point convolution baselines (e.g., KPConv), achieving 2.8% mIoU performance boost on S3DIS and 1.5% on SemanticKITTI for semantic segmentation task.
摘要：如果没有形状感知的响应，则很难使用一组紧凑的内核高效地表征点云的3D几何形状。在本文中，我们提倡使用Hausdorff距离作为形状感知距离度量来计算点卷积响应。我们提出的技术，称为Hausdorff点卷积（HPC），是形状感知的。我们证明，HPC构成了一个强大的点特征学习，仅包含四种类型的几何先验作为内核，而这套结构相当紧凑。我们进一步开发了基于HPC的深度神经网络（HPC-DNN）。通过调整网络权重以结合输入和内核点集之间的最短距离，可以实现特定于任务的学习。我们还通过设计用于多尺度特征编码的多内核HPC来实现分层特征学习。大量实验表明，HPC-DNN优于强点卷积基线（例如KPConv），在S3DIS上实现了2.8％的mIoU性能提升，在SemanticKITTI上实现了1.5％的语义分段任务。

18. FracTrain: Fractionally Squeezing Bit Savings Both Temporally and Spatially for Efficient DNN Training [PDF] 返回目录
Yonggan Fu, Haoran You, Yang Zhao, Yue Wang, Chaojian Li, Kailash Gopalakrishnan, Zhangyang Wang, Yingyan Lin
Abstract: Recent breakthroughs in deep neural networks (DNNs) have fueled a tremendous demand for intelligent edge devices featuring on-site learning, while the practical realization of such systems remains a challenge due to the limited resources available at the edge and the required massive training costs for state-of-the-art (SOTA) DNNs. As reducing precision is one of the most effective knobs for boosting training time/energy efficiency, there has been a growing interest in low-precision DNN training. In this paper, we explore from an orthogonal direction: how to fractionally squeeze out more training cost savings from the most redundant bit level, progressively along the training trajectory and dynamically per input. Specifically, we propose FracTrain that integrates (i) progressive fractional quantization which gradually increases the precision of activations, weights, and gradients that will not reach the precision of SOTA static quantized DNN training until the final training stage, and (ii) dynamic fractional quantization which assigns precisions to both the activations and gradients of each layer in an input-adaptive manner, for only "fractionally" updating layer parameters. Extensive simulations and ablation studies (six models, four datasets, and three training settings including standard, adaptation, and fine-tuning) validate the effectiveness of FracTrain in reducing computational cost and hardware-quantified energy/latency of DNN training while achieving a comparable or better (-0.12%~+1.87%) accuracy. For example, when training ResNet-74 on CIFAR-10, FracTrain achieves 77.6% and 53.5% computational cost and training latency savings, respectively, compared with the best SOTA baseline, while achieving a comparable (-0.07%) accuracy. Our codes are available at: this https URL.
摘要：深度神经网络（DNN）的最新突破激发了对具有现场学习功能的智能边缘设备的巨大需求，而由于边缘资源有限和需要大量培训，此类系统的实际实现仍然是一个挑战最新（SOTA）DNN的成本。由于降低精度是提高训练时间/能源效率的最有效方法之一，因此人们对低精度DNN训练的兴趣日益浓厚。在本文中，我们从一个正交的方向进行探索：如何从最冗余的位级别中分步地挤出更多的培训成本节省，并沿着训练轨迹逐步地，动态地按输入进行压缩。具体来说，我们建议FracTrain集成（i）逐步分数量化，从而逐步提高激活，权重和渐变的精度，直到最终训练阶段才能达到SOTA静态量化DNN训练的精度，以及（ii）动态分数量化它仅以“部分”方式更新图层参数，以输入自适应的方式将精度分配给每个图层的激活和渐变。广泛的仿真和消融研究（六个模型，四个数据集和三个训练设置，包括标准，自适应和微调）验证了FracTrain在降低DNN训练的计算成本和硬件量化的能量/等待时间的有效性，同时达到了可比或精度更高（-0.12％〜+ 1.87％）例如，当在CIFAR-10上训练ResNet-74时，与最佳的SOTA基准相比，FracTrain分别实现了77.6％和53.5％的计算成本和培训等待时间节省，同时达到了可比的（-0.07％）精度。我们的代码可在以下网址获得：https URL。

19. MobileSal: Extremely Efficient RGB-D Salient Object Detection [PDF] 返回目录
Yu-Huan Wu, Yun Liu, Jun Xu, Jia-Wang Bian, Yuchao Gu, Ming-Ming Cheng
Abstract: The high computational cost of neural networks has prevented recent successes in RGB-D salient object detection (SOD) from benefiting real-world applications. Hence, this paper introduces a novel network, \methodname, which focuses on efficient RGB-D SOD by using mobile networks for deep feature extraction. The problem is that mobile networks are less powerful in feature representation than cumbersome networks. To this end, we observe that the depth information of color images can strengthen the feature representation related to SOD if leveraged properly. Therefore, we propose an implicit depth restoration (IDR) technique to strengthen the feature representation capability of mobile networks for RGB-D SOD. IDR is only adopted in the training phase and is omitted during testing, so it is computationally free. Besides, we propose compact pyramid refinement (CPR) for efficient multi-level feature aggregation so that we can derive salient objects with clear boundaries. With IDR and CPR incorporated, \methodname~performs favorably against \sArt methods on seven challenging RGB-D SOD datasets with much faster speed (450fps) and fewer parameters (6.5M). The code will be released.
摘要：神经网络的高计算成本阻碍了RGB-D显着目标检测（SOD）的最新成功使现实应用受益。因此，本文介绍了一种新颖的网络\ methodname，它通过使用移动网络进行深度特征提取来关注有效的RGB-D SOD。问题在于，移动网络的功能表示不如笨重的网络强大。为此，我们观察到，如果正确利用彩色图像的深度信息，可以增强与SOD相关的特征表示。因此，我们提出了一种隐式深度恢复（IDR）技术，以增强移动网络对RGB-D SOD的特征表示能力。 IDR仅在训练阶段被采用，而在测试过程中被省略，因此它在计算上是免费的。此外，我们提出了紧凑的金字塔细化（CPR），以进行有效的多级特征聚合，从而可以导出具有清晰边界的显着对象。结合IDR和CPR，在更快的速度（450fps）和更少的参数（6.5M）的七个具有挑战性的RGB-D SOD数据集上，\ methodname〜的表现优于\ sArt方法。该代码将被释放。

20. EDN: Salient Object Detection via Extremely-Downsampled Network [PDF] 返回目录
Yu-Huan Wu, Yun Liu, Le Zhang, Ming-Ming Cheng
Abstract: Recent progress on salient object detection (SOD) mainly benefits from multi-scale learning, where the high-level and low-level features work collaboratively in locating salient objects and discovering fine details, respectively. However, most efforts are devoted to low-level feature learning by fusing multi-scale features or enhancing boundary representations. In this paper, we show another direction that improving high-level feature learning is essential for SOD as well. To verify this, we introduce an Extremely-Downsampled Network (EDN), which employs an extreme downsampling technique to effectively learn a global view of the whole image, leading to accurate salient object localization. A novel Scale-Correlated Pyramid Convolution (SCPC) is also designed to build an elegant decoder for recovering object details from the above extreme downsampling. Extensive experiments demonstrate that EDN achieves \sArt performance with real-time speed. Hence, this work is expected to spark some new thinking in SOD. The code will be released.
摘要：显着对象检测（SOD）的最新进展主要得益于多尺度学习，在这种学习中，高阶和低阶特征可以协同工作来分别定位显着物体和发现精细的细节。但是，大多数工作都致力于通过融合多尺度特征或增强边界表示来进行低级特征学习。在本文中，我们显示了另一个方向，即改进高级特征学习对于SOD也是必不可少的。为了验证这一点，我们引入了极端下采样网络（EDN），该网络采用了极端下采样技术来有效地学习整个图像的全局视图，从而实现精确的显着对象定位。还设计了新颖的比例相关金字塔卷积（SCPC），以构建优雅的解码器，以从上述极端下采样中恢复对象细节。大量实验表明，EDN以实时速度达到了\ SArt性能。因此，这项工作有望在SOD中引发一些新的想法。该代码将被释放。

21. P4Contrast: Contrastive Learning with Pairs of Point-Pixel Pairs for RGB-D Scene Understanding [PDF] 返回目录
Yunze Liu, Li Yi, Shanghang Zhang, Qingnan Fan, Thomas Funkhouser, Hao Dong
Abstract: Self-supervised representation learning is a critical problem in computer vision, as it provides a way to pretrain feature extractors on large unlabeled datasets that can be used as an initialization for more efficient and effective training on downstream tasks. A promising approach is to use contrastive learning to learn a latent space where features are close for similar data samples and far apart for dissimilar ones. This approach has demonstrated tremendous success for pretraining both image and point cloud feature extractors, but it has been barely investigated for multi-modal RGB-D scans, especially with the goal of facilitating high-level scene understanding. To solve this problem, we propose contrasting "pairs of point-pixel pairs", where positives include pairs of RGB-D points in correspondence, and negatives include pairs where one of the two modalities has been disturbed and/or the two RGB-D points are not in correspondence. This provides extra flexibility in making hard negatives and helps networks to learn features from both modalities, not just the more discriminating one of the two. Experiments show that this proposed approach yields better performance on three large-scale RGB-D scene understanding benchmarks (ScanNet, SUN RGB-D, and 3RScan) than previous pretraining approaches.
摘要：自我监督的表示学习是计算机视觉中的一个关键问题，因为它提供了一种在未标记的大型数据集上预训练特征提取器的方法，该方法可用作初始化以对下游任务进行更有效的训练。一种有前途的方法是使用对比学习来学习一个潜在空间，在该潜在空间中，相似数据样本的特征较近，而相异数据样本的特征相距较远。这种方法在预训练图像和点云特征提取器方面已显示出巨大的成功，但对于多模式RGB-D扫描却很少进行研究，特别是为了促进高级场景理解。为了解决此问题，我们提出了对比“点对像素对”，其中正值包括对应的RGB-D点对，负值包括其中的两种模态之一和/或两个RGB-D被干扰的对。点不对应。这在制作硬底片时提供了额外的灵活性，并帮助网络从两种方式中学习功能，而不仅仅是两种方式中更具区别性的一种。实验表明，与先前的预训练方法相比，该方法在三个大型RGB-D场景理解基准（ScanNet，SUN RGB-D和3RScan）上产生了更好的性能。

22. Rotation Equivariant Siamese Networks for Tracking [PDF] 返回目录
Deepak K. Gupta, Devanshu Arya, Efstratios Gavves
Abstract: Rotation is among the long prevailing, yet still unresolved, hard challenges encountered in visual object tracking. The existing deep learning-based tracking algorithms use regular CNNs that are inherently translation equivariant, but not designed to tackle rotations. In this paper, we first demonstrate that in the presence of rotation instances in videos, the performance of existing trackers is severely affected. To circumvent the adverse effect of rotations, we present rotation-equivariant Siamese networks (RE-SiamNets), built through the use of group-equivariant convolutional layers comprising steerable filters. SiamNets allow estimating the change in orientation of the object in an unsupervised manner, thereby facilitating its use in relative 2D pose estimation as well. We further show that this change in orientation can be used to impose an additional motion constraint in Siamese tracking through imposing restriction on the change in orientation between two consecutive frames. For benchmarking, we present Rotation Tracking Benchmark (RTB), a dataset comprising a set of videos with rotation instances. Through experiments on two popular Siamese architectures, we show that RE-SiamNets handle the problem of rotation very well and out-perform their regular counterparts. Further, RE-SiamNets can accurately estimate the relative change in pose of the target in an unsupervised fashion, namely the in-plane rotation the target has sustained with respect to the reference frame.
摘要：旋转是视觉对象跟踪中长期存在但仍未解决的艰巨挑战之一。现有的基于深度学习的跟踪算法使用常规的CNN，这些CNN本质上是平移等变的，但并非旨在解决旋转问题。在本文中，我们首先证明视频中存在旋转实例时，现有跟踪器的性能会受到严重影响。为了避免旋转的不利影响，我们介绍了等速旋转连体网络（RE-SiamNets），该网络是通过使用包含可控滤镜的等分组卷积层构建的。 SiamNets允许以无人监督的方式估计对象方向的变化，从而也便于在相对2D姿势估计中使用它。我们进一步表明，通过对两个连续帧之间的方向变化施加限制，这种方向变化可以用于在暹罗跟踪中施加附加的运动约束。为了进行基准测试，我们展示了旋转跟踪基准（RTB），该数据集包括一组带有旋转实例的视频。通过在两种流行的暹罗体系结构上进行的实验，我们证明RE-SiamNets很好地解决了旋转问题，并且性能优于常规同类产品。此外，RE-SiamNets可以以无人监督的方式准确估算目标姿势的相对变化，即目标相对于参考框架的平面内旋转。

23. Task-Adaptive Negative Class Envision for Few-Shot Open-Set Recognition [PDF] 返回目录
Shiyuan Huang, Jiawei Ma, Guangxing Han, Shih-Fu Chang
Abstract: Recent works seek to endow recognition systems with the ability to handle the open world. Few shot learning aims for fast learning of new classes from limited examples, while open-set recognition considers unknown negative class from the open world. In this paper, we study the problem of few-shot open-set recognition (FSOR), which learns a recognition system robust to queries from new sources with few examples and from unknown open sources. To achieve that, we mimic human capability of envisioning new concepts from prior knowledge, and propose a novel task-adaptive negative class envision method (TANE) to model the open world. Essentially we use an external memory to estimate a negative class representation. Moreover, we introduce a novel conjugate episode training strategy that strengthens the learning process. Extensive experiments on four public benchmarks show that our approach significantly improves the state-of-the-art performance on few-shot open-set recognition. Besides, we extend our method to generalized few-shot open-set recognition (GFSOR), where we also achieve performance gains on MiniImageNet.
摘要：最近的工作试图赋予识别系统处理开放世界的能力。很少有镜头学习旨在从有限的示例中快速学习新课程，而开放式识别则考虑来自开放世界的未知否定课程。在本文中，我们研究了少拍开放集识别（FSOR）的问题，该问题学习了一种对来自新来源的示例很少且来自未知开源的查询具有鲁棒性的识别系统。为此，我们模仿了人类从先验知识中构想新概念的能力，并提出了一种新颖的，适应任务的否定类构想方法（TANE）来对开放世界进行建模。本质上，我们使用外部存储器来估计否定的类表示。此外，我们介绍了一种新颖的共轭情节训练策略，可增强学习过程。在四个公开基准上进行的大量实验表明，我们的方法可以显着提高几下开放式识别的最新性能。此外，我们将方法扩展到广义的少拍开放集识别（GFSOR），从而在MiniImageNet上获得性能提升。

24. Union-net: A deep neural network model adapted to small data sets [PDF] 返回目录
Qingfang He, Guang Cheng, Zhiying Lin
Abstract: In real applications, generally small data sets can be obtained. At present, most of the practical applications of machine learning use classic models based on big data to solve the problem of small data sets. However, the deep neural network model has complex structure, huge model parameters, and training requires more advanced equipment, which brings certain difficulties to the application. Therefore, this paper proposes the concept of union convolution, designing a light deep network model union-net with a shallow network structure and adapting to small data sets. This model combines convolutional network units with different combinations of the same input to form a union module. Each union module is equivalent to a convolutional layer. The serial input and output between the 3 modules constitute a "3-layer" neural network. The output of each union module is fused and added as the input of the last convolutional layer to form a complex network with a 4-layer network structure. It solves the problem that the deep network model network is too deep and the transmission path is too long, which causes the loss of the underlying information transmission. Because the model has fewer model parameters and fewer channels, it can better adapt to small data sets. It solves the problem that the deep network model is prone to overfitting in training small data sets. Use the public data sets cifar10 and 17flowers to conduct multi-classification experiments. Experiments show that the Union-net model can perform well in classification of large data sets and small data sets. It has high practical value in daily application scenarios. The model code is published at this https URL
摘要：在实际应用中，通常可以获得较小的数据集。当前，机器学习的大多数实际应用都使用基于大数据的经典模型来解决小数据集的问题。但是，深度神经网络模型结构复杂，模型参数庞大，训练需要更先进的设备，给应用带来一定困难。因此，本文提出了联合卷积的概念，设计了一种浅层网络结构的浅层深层网络模型，并适用于小数据集。该模型将卷积网络单元与相同输入的不同组合结合在一起，以形成联合模块。每个联合模块等效于一个卷积层。 3个模块之间的串行输入和输出构成一个“ 3层”神经网络。将每个联合模块的输出融合并添加为最后一个卷积层的输入，以形成具有4层网络结构的复杂网络。解决了深层网络模型网络太深，传输路径太长的问题，从而导致底层信息传输的损失。由于模型具有较少的模型参数和通道，因此可以更好地适应小型数据集。解决了深度网络模型在训练小数据集时容易过度拟合的问题。使用公共数据集cifar10和17flowers进行多分类实验。实验表明，联合网络模型在大数据集和小数据集的分类中表现良好。在日常应用场景中具有很高的实用价值。模型代码在此https URL上发布

25. Private-Shared Disentangled Multimodal VAE for Learning of Hybrid Latent Representations [PDF] 返回目录
Mihee Lee, Vladimir Pavlovic
Abstract: Multi-modal generative models represent an important family of deep models, whose goal is to facilitate representation learning on data with multiple views or modalities. However, current deep multi-modal models focus on the inference of shared representations, while neglecting the important private aspects of data within individual modalities. In this paper, we introduce a disentangled multi-modal variational autoencoder (DMVAE) that utilizes disentangled VAE strategy to separate the private and shared latent spaces of multiple modalities. We specifically consider the instance where the latent factor may be of both continuous and discrete nature, leading to the family of general hybrid DMVAE models. We demonstrate the utility of DMVAE on a semi-supervised learning task, where one of the modalities contains partial data labels, both relevant and irrelevant to the other modality. Our experiments on several benchmarks indicate the importance of the private-shared disentanglement as well as the hybrid latent representation.
摘要：多模式生成模型代表了一个重要的深度模型家族，其目的是促进对具有多种视图或模式的数据的表示学习。但是，当前的深层多模式模型侧重于共享表示的推断，而忽略了各个模式中数据的重要私有方面。在本文中，我们介绍了一个解缠结的多模态变分自编码器（DMVAE），它利用解缠结的VAE策略来分离多个模态的私有和共享潜在空间。我们专门考虑了潜在因素可能具有连续性和离散性的情况，从而产生了通用混合DMVAE模型家族。我们演示了DMVAE在半监督学习任务上的效用，其中一种模式包含与另一种模式相关和不相关的部分数据标签。我们在多个基准上进行的实验表明了私有共享解纠缠以及混合潜在表示的重要性。

26. Physics-based Shadow Image Decomposition for Shadow Removal [PDF] 返回目录
Hieu Le, Dimitris Samaras
Abstract: We propose a novel deep learning method for shadow removal. Inspired by physical models of shadow formation, we use a linear illumination transformation to model the shadow effects in the image that allows the shadow image to be expressed as a combination of the shadow-free image, the shadow parameters, and a matte layer. We use two deep networks, namely SP-Net and M-Net, to predict the shadow parameters and the shadow matte respectively. This system allows us to remove the shadow effects from images. We then employ an inpainting network, I-Net, to further refine the results. We train and test our framework on the most challenging shadow removal dataset (ISTD). Our method improves the state-of-the-art in terms of root mean square error (RMSE) for the shadow area by 20\%. Furthermore, this decomposition allows us to formulate a patch-based weakly-supervised shadow removal method. This model can be trained without any shadow-free images (that are cumbersome to acquire) and achieves competitive shadow removal results compared to state-of-the-art methods that are trained with fully paired shadow and shadow-free images. Last, we introduce SBU-Timelapse, a video shadow removal dataset for evaluating shadow removal methods.
摘要：我们提出了一种新颖的深度学习去除阴影的方法。受阴影形成物理模型的启发，我们使用线性照明变换对图像中的阴影效果进行建模，使阴影图像可以表示为无阴影图像，阴影参数和遮罩层的组合。我们使用两个深层网络，即SP-Net和M-Net，分别预测阴影参数和阴影遮罩。该系统使我们可以消除图像的阴影效果。然后，我们使用一个修复网络I-Net进一步完善结果。我们在最具挑战性的阴影去除数据集（ISTD）上训练和测试我们的框架。我们的方法将阴影区域的均方根误差（RMSE）提高了20％。此外，这种分解使我们能够制定基于补丁的弱监督阴影去除方法。与没有完整阴影和无阴影图像训练的最新方法相比，该模型可以在没有任何无阴影图像（难以获取的图像）的情况下进行训练，并且可以获得有竞争力的阴影去除结果。最后，我们介绍SBU-Timelapse，这是一个用于评估阴影去除方法的视频阴影去除数据集。

27. Low-latency Perception in Off-Road Dynamical Low Visibility Environments [PDF] 返回目录
Nelson Alves, Marco Ruiz, Marco Reis, Tiago Cajahyba, Davi Oliveira, Ana Barreto, Eduardo F. Simas Filho, Wagner L. A. de Oliveira, Leizer Schnitman, Roberto L. S. Monteiro
Abstract: This work proposes a perception system for autonomous vehicles and advanced driver assistance specialized on unpaved roads and off-road environments. In this research, the authors have investigated the behavior of Deep Learning algorithms applied to semantic segmentation of off-road environments and unpaved roads under differents adverse conditions of visibility. Almost 12,000 images of different unpaved and off-road environments were collected and labeled. It was assembled an off-road proving ground exclusively for its development. The proposed dataset also contains many adverse situations such as rain, dust, and low light. To develop the system, we have used convolutional neural networks trained to segment obstacles and areas where the car can pass through. We developed a Configurable Modular Segmentation Network (CMSNet) framework to help create different architectures arrangements and test them on the proposed dataset. Besides, we also have ported some CMSNet configurations by removing and fusing many layers using TensorRT, C++, and CUDA to achieve embedded real-time inference and allow field tests. The main contributions of this work are: a new dataset for unpaved roads and off-roads environments containing many adverse conditions such as night, rain, and dust; a CMSNet framework; an investigation regarding the feasibility of applying deep learning to detect region where the vehicle can pass through when there is no clear boundary of the track; a study of how our proposed segmentation algorithms behave in different severity levels of visibility impairment; and an evaluation of field tests carried out with semantic segmentation architectures ported for real-time inference.
摘要：这项工作提出了一种针对自动驾驶汽车和高级驾驶员辅助的感知系统，专门用于未铺砌的道路和越野环境。在这项研究中，作者研究了深度学习算法在可见性的不同不利条件下应用于越野环境和未铺设道路的语义分割的行为。收集并标记了近12,000张不同未铺设和越野环境的图像。它被组装成专门用于其开发的越野试验场。提议的数据集还包含许多不利情况，例如雨，尘埃和光线不足。为了开发该系统，我们使用了经过训练的卷积神经网络来分割障碍物和汽车可以通过的区域。我们开发了可配置的模块化分段网络（CMSNet）框架，以帮助创建不同的体系结构安排并在建议的数据集上对其进行测试。此外，我们还通过使用TensorRT，C ++和CUDA移除并融合了许多层来移植了一些CMSNet配置，以实现嵌入式实时推理并允许进行现场测试。这项工作的主要贡献是：一个新的数据集，用于未铺砌的道路和越野环境，其中包含许多不利条件，例如夜，雨和灰尘； CMSNet框架；关于在没有清晰边界的情况下应用深度学习来检测车辆可以通过的区域的可行性的调查；研究我们提出的分割算法在可见性损害的不同严重程度下的行为；以及通过移植用于实时推理的语义分段架构进行的现场测试评估。

28. Semantic Segmentation on Swiss3DCities: A Benchmark Study on Aerial Photogrammetric 3D Pointcloud Dataset [PDF] 返回目录
Gülcan Can, Dario Mantegazza, Gabriele Abbate, Sébastien Chappuis, Alessandro Giusti
Abstract: We introduce a new outdoor urban 3D pointcloud dataset, covering a total area of 2.7 $km^2$, sampled from three Swiss cities with different characteristics. The dataset is manually annotated for semantic segmentation with per-point labels, and is built using photogrammetry from images acquired by multirotors equipped with high-resolution cameras. In contrast to datasets acquired with ground LiDAR sensors, the resulting point clouds are uniformly dense and complete, and are useful to disparate applications, including autonomous driving, gaming and smart city planning. As a benchmark, we report quantitative results of PointNet++, an established point-based deep 3D semantic segmentation model; on this model, we additionally study the impact of using different cities for model generalization.
摘要：我们引入了一个新的室外城市3D点云数据集，其总面积为2.7 $ km ^ 2 $，该数据来自三个具有不同特征的瑞士城市。该数据集通过每点标签手动注释以进行语义分割，并使用摄影测量法从由配备高分辨率相机的多旋翼飞机获取的图像中构建而成。与使用地面LiDAR传感器采集的数据集相比，所得的点云均匀密集且完整，并且可用于各种应用程序，包括自动驾驶，游戏和智慧城市规划。作为基准，我们报告了PointNet ++的定量结果，PointNet ++是已建立的基于点的深度3D语义分割模型；在此模型上，我们还研究了使用不同城市进行模型概括的影响。

29. SyNet: An Ensemble Network for Object Detection in UAV Images [PDF] 返回目录
Berat Mert Albaba, Sedat Ozer
Abstract: Recent advances in camera equipped drone applications and their widespread use increased the demand on vision based object detection algorithms for aerial images. Object detection process is inherently a challenging task as a generic computer vision problem, however, since the use of object detection algorithms on UAVs (or on drones) is relatively a new area, it remains as a more challenging problem to detect objects in aerial images. There are several reasons for that including: (i) the lack of large drone datasets including large object variance, (ii) the large orientation and scale variance in drone images when compared to the ground images, and (iii) the difference in texture and shape features between the ground and the aerial images. Deep learning based object detection algorithms can be classified under two main categories: (a) single-stage detectors and (b) multi-stage detectors. Both single-stage and multi-stage solutions have their advantages and disadvantages over each other. However, a technique to combine the good sides of each of those solutions could yield even a stronger solution than each of those solutions individually. In this paper, we propose an ensemble network, SyNet, that combines a multi-stage method with a single-stage one with the motivation of decreasing the high false negative rate of multi-stage detectors and increasing the quality of the single-stage detector proposals. As building blocks, CenterNet and Cascade R-CNN with pretrained feature extractors are utilized along with an ensembling strategy. We report the state of the art results obtained by our proposed solution on two different datasets: namely MS-COCO and visDrone with \%52.1 $mAP_{IoU = 0.75}$ is obtained on MS-COCO $val2017$ dataset and \%26.2 $mAP_{IoU = 0.75}$ is obtained on VisDrone $test-set$.
摘要：配备摄像头的无人机应用的最新进展及其广泛应用增加了对基于视觉的航空图像目标检测算法的需求。作为通用计算机视觉问题，对象检测过程本质上是一项具有挑战性的任务，但是，由于在无人机（或无人机）上使用对象检测算法是一个相对较新的领域，因此在航空影像中检测对象仍然是一个更具挑战性的问题。造成这种情况的原因有很多：（i）缺少大型无人机数据集，其中包括较大的对象差异；（ii）与地面图像相比，无人机图像的方向和比例差异较大；（iii）纹理和地面和航拍图像之间的形状特征。基于深度学习的对象检测算法可以分为两大类：（a）单级检测器和（b）多级检测器。单阶段解决方案和多阶段解决方案两者都有各自的优缺点。但是，将每种解决方案的优点结合在一起的技术可能会产生比每种解决方案单独的解决方案更强大的解决方案。在本文中，我们提出了一个集成网络SyNet，该网络将多阶段方法与单阶段方法相结合，以降低多阶段检测器的高假阴性率并提高单阶段检测器的质量。提案。作为构建基块，将CenterNet和带有预训练特征提取器的Cascade R-CNN与集成策略一起使用。我们报告了我们的拟议解决方案在两个不同的数据集上获得的最新结果：MS-COCO和visDrone，其\％52.1 $ mAP_ {IoU = 0.75} $是在MS-COCO $ val2017 $数据集和\％26.2 $ mAP_ {IoU = 0.75} $是从VisDrone $ test-set $获得的。

30. Convolutional Neural Network for Elderly Wandering Prediction in Indoor Scenarios [PDF] 返回目录
Rafael F. C. Oliveira, Fabio Barreto, Raphael Abreu
Abstract: This work proposes a way to detect the wandering activity of Alzheimer's patients from path data collected from non-intrusive indoor sensors around the house. Due to the lack of adequate data, we've manually generated a dataset of 220 paths using our own developed application. Wandering patterns in the literature are normally identified by visual features (such as loops or random movement), thus our dataset was transformed into images and augmented. Convolutional layers were used on the neural network model since they tend to have good results finding patterns, especially on images. The Convolutional Neural Network model was trained with the generated data and achieved an f1 score (relation between precision and recall) of 75%, recall of 60%, and precision of 100% on our 10 sample validation slice
摘要：这项工作提出了一种从屋内非侵入式室内传感器收集的路径数据中检测阿尔茨海默氏症患者游荡活动的方法。由于缺乏足够的数据，我们使用自己开发的应用程序手动生成了220条路径的数据集。文献中的游荡模式通常通过视觉特征（例如循环或随机运动）来识别，因此我们的数据集被转换为图像并得到了增强。在神经网络模型上使用了卷积层，因为它们倾向于在图案上找到良好的结果，尤其是在图像上。使用所生成的数据对卷积神经网络模型进行了训练，在我们的10个样本验证切片上，f1得分（准确率与召回率之间的关系）达到75％，召回率达到60％和准确率达到100％

31. Spatio-temporal Multi-task Learning for Cardiac MRI Left Ventricle Quantification [PDF] 返回目录
Sulaiman Vesal, Mingxuan Gu, Andreas Maier, Nishant Ravikumar
Abstract: Quantitative assessment of cardiac left ventricle (LV) morphology is essential to assess cardiac function and improve the diagnosis of different cardiovascular diseases. In current clinical practice, LV quantification depends on the measurement of myocardial shape indices, which is usually achieved by manual contouring of the endo- and epicardial. However, this process subjected to inter and intra-observer variability, and it is a time-consuming and tedious task. In this paper, we propose a spatio-temporal multi-task learning approach to obtain a complete set of measurements quantifying cardiac LV morphology, regional-wall thickness (RWT), and additionally detecting the cardiac phase cycle (systole and diastole) for a given 3D Cine-magnetic resonance (MR) image sequence. We first segment cardiac LVs using an encoder-decoder network and then introduce a multitask framework to regress 11 LV indices and classify the cardiac phase, as parallel tasks during model optimization. The proposed deep learning model is based on the 3D spatio-temporal convolutions, which extract spatial and temporal features from MR images. We demonstrate the efficacy of the proposed method using cine-MR sequences of 145 subjects and comparing the performance with other state-of-the-art quantification methods. The proposed method obtained high prediction accuracy, with an average mean absolute error (MAE) of 129 $mm^2$, 1.23 $mm$, 1.76 $mm$, Pearson correlation coefficient (PCC) of 96.4%, 87.2%, and 97.5% for LV and myocardium (Myo) cavity regions, 6 RWTs, 3 LV dimensions, and an error rate of 9.0\% for phase classification. The experimental results highlight the robustness of the proposed method, despite varying degrees of cardiac morphology, image appearance, and low contrast in the cardiac MR sequences.
摘要：心脏左心室（LV）形态的定量评估对于评估心功能和改善不同心血管疾病的诊断至关重要。在当前的临床实践中，LV量化取决于心肌形状指数的测量，通常通过手动绘制心内膜和心外膜轮廓来实现。然而，该过程经受观察者之间和观察者内部的可变性，并且是耗时且繁琐的任务。在本文中，我们提出了一种时空多任务学习方法，以获取一套完整的测量值，以量化心脏左室形态，局部壁厚（RWT），并另外检测给定心脏相位的周期（收缩期和舒张期） 3D电影-磁共振（MR）图像序列。我们首先使用编码器-解码器网络对心脏左心室进行细分，然后引入多任务框架以对11个左心室指数进行回归，并对心脏相位进行分类，作为模型优化过程中的并行任务。所提出的深度学习模型基于3D时空卷积，可从MR图像中提取空间和时间特征。我们展示了使用145位受试者的电影MR序列并与其他最新定量方法进行比较的方法的有效性。所提出的方法具有较高的预测精度，平均平均绝对误差（MAE）为129 $ mm ^ 2 $，1.23 $ mm $，1.76 $ mm $，Pearson相关系数（PCC）为96.4％，87.2％和97.5 LV和心肌（Myo）腔区域，6个RWT，3个LV尺寸的百分比为5％，相位分类的错误率为9.0％。实验结果突显了所提出方法的鲁棒性，尽管心脏形态，心脏MR序列的心脏形态不同，图像外观和对比度低。

32. Parallel-beam X-ray CT datasets of apples with internal defects and label balancing for machine learning [PDF] 返回目录
Sophia Bethany Coban, Vladyslav Andriiashen, Poulami Somanya Ganguly, Maureen van Eijnatten, Kees Joost Batenburg
Abstract: We present three parallel-beam tomographic datasets of 94 apples with internal defects along with defect label files. The datasets are prepared for development and testing of data-driven, learning-based image reconstruction, segmentation and post-processing methods. The three versions are a noiseless simulation; simulation with added Gaussian noise, and with scattering noise. The datasets are based on real 3D X-ray CT data and their subsequent volume reconstructions. The ground truth images, based on the volume reconstructions, are also available through this project. Apples contain various defects, which naturally introduce a label bias. We tackle this by formulating the bias as an optimization problem. In addition, we demonstrate solving this problem with two methods: a simple heuristic algorithm and through mixed integer quadratic programming. This ensures the datasets can be split into test, training or validation subsets with the label bias eliminated. Therefore the datasets can be used for image reconstruction, segmentation, automatic defect detection, and testing the effects of (as well as applying new methodologies for removing) label bias in machine learning.
摘要：我们提出了94个具有内部缺陷的苹果的三个平行束层析数据集以及缺陷标签文件。这些数据集是为开发和测试数据驱动的，基于学习的图像重建，分割和后处理方法而准备的。这三个版本是无噪音的模拟。高斯噪声和散射噪声的模拟。数据集基于真实的3D X射线CT数据及其后续的体积重建。基于体积重建的地面真相图像也可以通过该项目获得。苹果包含各种缺陷，自然会引起标签偏差。我们通过将偏差表示为优化问题来解决此问题。此外，我们展示了通过两种方法解决此问题：一种简单的启发式算法以及通过混合整数二次编程。这样可以确保在消除标签偏差的情况下将数据集分为测试，训练或验证子集。因此，数据集可用于图像重建，分割，自动缺陷检测以及测试机器学习中标签偏差的效果（以及应用新的方法来消除）。

33. AudioViewer: Learning to Visualize Sound [PDF] 返回目录
Yuchi Zhang, Willis Peng, Bastian Wandt, Helge Rhodin
Abstract: Sensory substitution can help persons with perceptual deficits. In this work, we attempt to visualize audio with video. Our long-term goal is to create sound perception for hearing impaired people, for instance, to facilitate feedback for training deaf speech. Different from existing models that translate between speech and text or text and images, we target an immediate and low-level translation that applies to generic environment sounds and human speech without delay. No canonical mapping is known for this artificial translation task. Our design is to translate from audio to video by compressing both into a common latent space with shared structure. Our core contribution is the development and evaluation of learned mappings that respect human perception limits and maximize user comfort by enforcing priors and combining strategies from unpaired image translation and disentanglement. We demonstrate qualitatively and quantitatively that our AudioViewer model maintains important audio features in the generated video and that generated videos of faces and numbers are well suited for visualizing high-dimensional audio features since they can easily be parsed by humans to match and distinguish between sounds, words, and speakers.
摘要：感觉替代可以帮助有知觉缺陷的人。在这项工作中，我们尝试将音频和视频可视化。我们的长期目标是为听障人士创造听觉感受，例如，促进对聋人语音训练的反馈。与现有的在语音和文本或文本和图像之间进行翻译的模型不同，我们的目标是立即进行低级翻译，该翻译适用于一般环境声音和人类语音，而不会产生延迟。对于这种人工翻译任务，尚无规范的映射。我们的设计是通过将音频和视频压缩到具有共享结构的公共潜伏空间中，从而将其从音频转换为视频。我们的核心贡献是通过尊重先验并结合不成对的图像平移和解缠策略，对尊重人类感知极限并最大化用户舒适度的学习映射进行开发和评估。我们定性和定量地证明，我们的AudioViewer模型在生成的视频中保持重要的音频特征，并且生成的面部和数字视频非常适合可视化高维音频特征，因为人类可以轻松解析它们以匹配和区分声音，文字和说话者。

34. Joint super-resolution and synthesis of 1 mm isotropic MP-RAGE volumes from clinical MRI exams with scans of different orientation, resolution and contrast [PDF] 返回目录
Juan Eugenio Iglesias, Benjamin Billot, Yael Balbastre, Azadeh Tabari, John Conklin, Daniel C. Alexander, Polina Golland, Brian L. Edlow, Bruce Fischl
Abstract: Most existing algorithms for automatic 3D morphometry of human brain MRI scans are designed for data with near-isotropic voxels at approximately 1 mm resolution, and frequently have contrast constraints as well - typically requiring T1 scans (e.g., MP-RAGE). This limitation prevents the analysis of millions of MRI scans acquired with large inter-slice spacing ("thick slice") in clinical settings every year. The inability to quantitatively analyze these scans hinders the adoption of quantitative neuroimaging in healthcare, and precludes research studies that could attain huge sample sizes and hence greatly improve our understanding of the human brain. Recent advances in CNNs are producing outstanding results in super-resolution and contrast synthesis of MRI. However, these approaches are very sensitive to the contrast, resolution and orientation of the input images, and thus do not generalize to diverse clinical acquisition protocols - even within sites. Here we present SynthSR, a method to train a CNN that receives one or more thick-slice scans with different contrast, resolution and orientation, and produces an isotropic scan of canonical contrast (typically a 1 mm MP-RAGE). The presented method does not require any preprocessing, e.g., skull stripping or bias field correction. Crucially, SynthSR trains on synthetic input images generated from 3D segmentations, and can thus be used to train CNNs for any combination of contrasts, resolutions and orientations without high-resolution training data. We test the images generated with SynthSR in an array of common downstream analyses, and show that they can be reliably used for subcortical segmentation and volumetry, image registration (e.g., for tensor-based morphometry), and, if some image quality requirements are met, even cortical thickness morphometry. The source code is publicly available at this http URL.
摘要：大多数现有的用于人脑MRI扫描自动3D形态学的算法都是针对具有约1 mm分辨率的近等方体素的数据而设计的，并且经常具有对比度约束-通常需要T1扫描（例如MP-RAGE）。此限制阻止了每年在临床环境中以较大的切片间间距（“厚切片”）进行的数百万MRI扫描的分析。无法对这些扫描进行定量分析阻碍了定量神经影像学在医疗保健领域的应用，并排除了可能获得大量样本并因此大大改善我们对人脑理解的研究。 CNN的最新进展在MRI的超分辨率和对比度合成中产生了出色的结果。但是，这些方法对输入图像的对比度，分辨率和方向非常敏感，因此，即使在站点内，也无法推广到各种临床采集协议。在这里，我们介绍SynthSR，一种训练CNN的方法，该CNN接收具有不同对比度，分辨率和方向的一个或多个厚切片扫描，并产生标准对比度的各向同性扫描（通常为1 mm MP-RAGE）。提出的方法不需要任何预处理，例如，颅骨剥离或偏置场校正。至关重要的是，SynthSR在从3D分割生成的合成输入图像上进行训练，因此可以在没有高分辨率训练数据的情况下针对对比度，分辨率和方向的任意组合来训练CNN。我们在一系列常见的下游分析中测试了用SynthSR生成的图像，并表明它们可以可靠地用于皮层下分割和体积测定，图像配准（例如，用于基于张量的形态测量），以及是否满足某些图像质量要求，甚至皮质厚度形态。源代码可从此http URL公开获得。

35. LEUGAN:Low-Light Image Enhancement by Unsupervised Generative Attentional Networks [PDF] 返回目录
Yangyang Qu, Chao liu, Yongsheng Ou
Abstract: Restoring images from low-light data is a challenging problem. Most existing deep-network based algorithms are designed to be trained with pairwise images. Due to the lack of real-world datasets, they usually perform poorly when generalized in practice in terms of loss of image edge and color information. In this paper, we propose an unsupervised generation network with attention-guidance to handle the low-light image enhancement task. Specifically, our network contains two parts: an edge auxiliary module that restores sharper edges and an attention guidance module that recovers more realistic colors. Moreover, we propose a novel loss function to make the edges of the generated images more visible. Experiments validate that our proposed algorithm performs favorably against state-of-the-art methods, especially for real-world images in terms of image clarity and noise control.
摘要：从弱光数据中还原图像是一个具有挑战性的问题。大多数现有的基于深度网络的算法都设计为可以使用成对图像进行训练。由于缺乏真实世界的数据集，因此在实践中将其普遍化时，它们在图像边缘和颜色信息的损失方面通常表现不佳。在本文中，我们提出了一种具有注意力指导的无监督发电网络来处理微光图像增强任务。具体来说，我们的网络包含两部分：边缘辅助模块，用于恢复更清晰的边缘；注意力提示模块，用于恢复更逼真的颜色。此外，我们提出了一种新颖的损失函数，以使生成的图像的边缘更加可见。实验证明，我们提出的算法相对于最新方法具有良好的性能，特别是对于真实世界的图像，在图像清晰度和噪声控制方面。

36. Soft-IntroVAE: Analyzing and Improving the Introspective Variational Autoencoder [PDF] 返回目录
Tal Daniel, Aviv Tamar
Abstract: The recently introduced introspective variational autoencoder (IntroVAE) exhibits outstanding image generations, and allows for amortized inference using an image encoder. The main idea in IntroVAE is to train a VAE adversarially, using the VAE encoder to discriminate between generated and real data samples. However, the original IntroVAE loss function relied on a particular hinge-loss formulation that is very hard to stabilize in practice, and its theoretical convergence analysis ignored important terms in the loss. In this work, we take a step towards better understanding of the IntroVAE model, its practical implementation, and its applications. We propose the Soft-IntroVAE, a modified IntroVAE that replaces the hinge-loss terms with a smooth exponential loss on generated samples. This change significantly improves training stability, and also enables theoretical analysis of the complete algorithm. Interestingly, we show that the IntroVAE converges to a distribution that minimizes a sum of KL distance from the data distribution and an entropy term. We discuss the implications of this result, and demonstrate that it induces competitive image generation and reconstruction. Finally, we describe two applications of Soft-IntroVAE to unsupervised image translation and out-of-distribution detection, and demonstrate compelling results. Code and additional information is available on the project website - this https URL
摘要：最近推出的内省变分自动编码器（IntroVAE）具有出色的图像生成能力，并允许使用图像编码器进行摊销推断。 IntroVAE的主要思想是使用VAE编码器来区分生成的数据样本和实际的数据样本，从而对抗性地训练VAE。但是，最初的IntroVAE损失函数依赖于在实践中很难稳定的特定铰链损耗公式，其理论收敛分析忽略了损失中的重要术语。在这项工作中，我们朝着更好地了解IntroVAE模型，其实际实现及其应用迈出了一步。我们提出了Soft-IntroVAE，这是一种改进的IntroVAE，它以生成的样本上的平滑指数损失替换了铰链损耗项。此更改显着提高了训练稳定性，并且还可以对整个算法进行理论分析。有趣的是，我们表明IntroVAE收敛到一个分布，该分布使与数据分布和熵项的KL距离之和最小。我们讨论了这个结果的含义，并证明了它可以诱导竞争性图像的产生和重构。最后，我们描述了Soft-IntroVAE在无监督图像转换和分布外检测中的两种应用，并展示了令人信服的结果。代码和其他信息可在项目网站上找到-此https URL

37. Detecting Hateful Memes Using a Multimodal Deep Ensemble [PDF] 返回目录
Vlad Sandulescu
Abstract: While significant progress has been made using machine learning algorithms to detect hate speech, important technical challenges still remain to be solved in order to bring their performance closer to human accuracy. We investigate several of the most recent visual-linguistic Transformer architectures and propose improvements to increase their performance for this task. The proposed model outperforms the baselines by a large margin and ranks 5$^{th}$ on the leaderboard out of 3,100+ participants.
摘要：尽管使用机器学习算法检测仇恨语音已经取得了重大进展，但仍需要解决重要的技术挑战，以使其性能更接近于人类准确性。我们研究了几种最新的视觉语言Transformer体系结构，并提出了改进措施以提高其执行此任务的性能。拟议的模型大大超过了基线，并且在3,100多名参与者中，在排行榜上排名5 $ ^ {th} $。

38. Effective Deployment of CNNs for 3DoF Pose Estimation and Grasping in Industrial Settings [PDF] 返回目录
Daniele De Gregorio, Riccardo Zanella, Gianluca Palli, Luigi Di Stefano
Abstract: In this paper we investigate how to effectively deploy deep learning in practical industrial settings, such as robotic grasping applications. When a deep-learning based solution is proposed, usually lacks of any simple method to generate the training data. In the industrial field, where automation is the main goal, not bridging this gap is one of the main reasons why deep learning is not as widespread as it is in the academic world. For this reason, in this work we developed a system composed by a 3-DoF Pose Estimator based on Convolutional Neural Networks (CNNs) and an effective procedure to gather massive amounts of training images in the field with minimal human intervention. By automating the labeling stage, we also obtain very robust systems suitable for production-level usage. An open source implementation of our solution is provided, alongside with the dataset used for the experimental evaluation.
摘要：本文研究了如何在实际的工业环境中有效地部署深度学习，例如机器人抓取应用程序。当提出基于深度学习的解决方案时，通常缺乏任何简单的方法来生成训练数据。在以自动化为主要目标的工业领域，不弥合这一差距是深度学习没有像学术界那样广泛普及的主要原因之一。因此，在这项工作中，我们开发了一个基于卷积神经网络（CNN）的三自由度姿势估计器组成的系统，并提出了一种有效的程序来以最少的人工干预收集现场的大量训练图像。通过使贴标阶段自动化，我们还获得了适用于生产级用途的非常强大的系统。提供了我们解决方案的开源实现，以及用于实验评估的数据集。

39. UMLE: Unsupervised Multi-discriminator Network for Low Light Enhancement [PDF] 返回目录
Yangyang Qu, Kai Chen, Yongsheng Ou
Abstract: Low-light image enhancement, such as recovering color and texture details from low-light images, is a complex and vital task. For automated driving, low-light scenarios will have serious implications for vision-based applications. To address this problem, we propose a real-time unsupervised generative adversarial network (GAN) containing multiple discriminators, i.e. a multi-scale discriminator, a texture discriminator, and a color discriminator. These distinct discriminators allow the evaluation of images from different perspectives. Further, considering that different channel features contain different information and the illumination is uneven in the image, we propose a feature fusion attention module. This module combines channel attention with pixel attention mechanisms to extract image features. Additionally, to reduce training time, we adopt a shared encoder for the generator and the discriminator. This makes the structure of the model more compact and the training more stable. Experiments indicate that our method is superior to the state-of-the-art methods in qualitative and quantitative evaluations, and significant improvements are achieved for both autopilot positioning and detection results.
摘要：低光图像增强（例如从低光图像中恢复颜色和纹理细节）是一项复杂而至关重要的任务。对于自动驾驶，弱光场景将对基于视觉的应用产生严重影响。为了解决这个问题，我们提出了一种实时的无监督的生成对抗网络（GAN），其包含多个鉴别器，即多尺度鉴别器，纹理鉴别器和颜色鉴别器。这些不同的鉴别器允许从不同的角度评估图像。此外，考虑到不同的通道特征包含不同的信息并且图像中的光照不均匀，我们提出了一种特征融合注意模块。该模块将通道注意力与像素注意力机制相结合，以提取图像特征。另外，为了减少训练时间，我们为生成器和鉴别器采用共享编码器。这使模型的结构更紧凑，训练更稳定。实验表明，我们的方法在定性和定量评估方面均优于最新方法，并且自动驾驶仪定位和检测结果均得到了显着改善。

40. Global Convergence of Model Function Based Bregman Proximal Minimization Algorithms [PDF] 返回目录
Mahesh Chandra Mukkamala, Jalal Fadili, Peter Ochs
Abstract: Lipschitz continuity of the gradient mapping of a continuously differentiable function plays a crucial role in designing various optimization algorithms. However, many functions arising in practical applications such as low rank matrix factorization or deep neural network problems do not have a Lipschitz continuous gradient. This led to the development of a generalized notion known as the $L$-smad property, which is based on generalized proximity measures called Bregman distances. However, the $L$-smad property cannot handle nonsmooth functions, for example, simple nonsmooth functions like $\abs{x^4-1}$ and also many practical composite problems are out of scope. We fix this issue by proposing the MAP property, which generalizes the $L$-smad property and is also valid for a large class of nonconvex nonsmooth composite problems. Based on the proposed MAP property, we propose a globally convergent algorithm called Model BPG, that unifies several existing algorithms. The convergence analysis is based on a new Lyapunov function. We also numerically illustrate the superior performance of Model BPG on standard phase retrieval problems, robust phase retrieval problems, and Poisson linear inverse problems, when compared to a state of the art optimization method that is valid for generic nonconvex nonsmooth optimization problems.
摘要：连续可微函数梯度映射的Lipschitz连续性在设计各种优化算法中起着至关重要的作用。但是，实际应用中出现的许多功能（例如低秩矩阵分解或深度神经网络问题）都没有Lipschitz连续梯度。这导致了被称为$ L $ -smad属性的广义概念的发展，该概念基于称为Bregman距离的广义邻近度量。但是，$ L $ -smad属性无法处理非平滑函数，例如，简单的非平滑函数，例如$ \ abs {x ^ 4-1} $，而且许多实际的复合问题也超出了范围。我们通过提出MAP属性来解决此问题，该属性可以推广$ L $ -smad属性，并且也适用于一大类非凸非光滑复合问题。基于提出的MAP属性，我们提出了一种称为模型BPG的全局收敛算法，该算法将几种现有算法统一在一起。收敛性分析基于新的Lyapunov函数。与适用于一般非凸非光滑优化问题的现有最优化方法相比，我们还用数值方法说明了BPG模型在标准相位检索问题，鲁棒相位检索问题和泊松线性逆问题上的优越性能。

41. SubICap: Towards Subword-informed Image Captioning [PDF] 返回目录
Naeha Sharif, Mohammed Bennamoun, Wei Liu, Syed Afaq Ali Shah
Abstract: Existing Image Captioning (IC) systems model words as atomic units in captions and are unable to exploit the structural information in the words. This makes representation of rare words very difficult and out-of-vocabulary words impossible. Moreover, to avoid computational complexity, existing IC models operate over a modest sized vocabulary of frequent words, such that the identity of rare words is lost. In this work we address this common limitation of IC systems in dealing with rare words in the corpora. We decompose words into smaller constituent units 'subwords' and represent captions as a sequence of subwords instead of words. This helps represent all words in the corpora using a significantly lower subword vocabulary, leading to better parameter learning. Using subword language modeling, our captioning system improves various metric scores, with a training vocabulary size approximately 90% less than the baseline and various state-of-the-art word-level models. Our quantitative and qualitative results and analysis signify the efficacy of our proposed approach.
摘要：现有的图像字幕（IC）系统将单词建模为字幕中的原子单元，并且无法利用单词中的结构信息。这使得稀有单词的表示非常困难，而词汇外的单词则不可能。此外，为了避免计算复杂性，现有的IC模型在常用单词的适度大小的词汇上进行操作，从而丢失了稀有单词的标识。在这项工作中，我们解决了IC系统在处理语料库中的稀有词时遇到的常见限制。我们将单词分解为较小的组成单元“子单词”，并将标题表示为一系列子单词而不是单词。这有助于使用低得多的子词词汇来表示语料库中的所有词，从而更好地学习参数。通过使用子词语言建模，我们的字幕系统提高了各种度量标准分数，其训练词汇量比基线少了约90％，并且使用了各种最新的词级模型。我们的定量和定性结果与分析表明我们提出的方法的有效性。

42. Improving the Certified Robustness of Neural Networks via Consistency Regularization [PDF] 返回目录
Mengting Xu, Tao Zhang, Zhongnian Li, Wei Shao, Daoqiang Zhang
Abstract: A range of defense methods have been proposed to improve the robustness of neural networks on adversarial examples, among which provable defense methods have been demonstrated to be effective to train neural networks that are certifiably robust to the attacker. However, most of these provable defense methods treat all examples equally during training process, which ignore the inconsistent constraint of certified robustness between correctly classified (natural) and misclassified examples. In this paper, we explore this inconsistency caused by misclassified examples and add a novel consistency regularization term to make better use of the misclassified examples. Specifically, we identified that the certified robustness of network can be significantly improved if the constraint of certified robustness on misclassified examples and correctly classified examples is consistent. Motivated by this discovery, we design a new defense regularization term called Misclassification Aware Adversarial Regularization (MAAR), which constrains the output probability distributions of all examples in the certified region of the misclassified example. Experimental results show that our proposed MAAR achieves the best certified robustness and comparable accuracy on CIFAR-10 and MNIST datasets in comparison with several state-of-the-art methods.
摘要：已提出了一系列防御方法，以提高对抗性示例中神经网络的鲁棒性，其中已证明可证明的防御方法对于训练对攻击者具有可靠鲁棒性的神经网络有效。但是，这些可证明的防御方法中的大多数在训练过程中都会平等对待所有示例，而忽略了正确分类（自然）和错误分类的示例之间认证的鲁棒性的不一致约束。在本文中，我们探讨了由错误分类的示例引起的这种不一致，并添加了新的一致性正则化术语以更好地利用错误分类的示例。具体而言，我们确定，如果对错误分类的示例和正确分类的示例的认证鲁棒性约束是一致的，则可以大大提高网络的认证鲁棒性。受此发现的启发，我们设计了一个新的防御正则化术语，称为误分类感知对抗正则化（MAAR），该术语约束了错误分类示例的认证区域中所有示例的输出概率分布。实验结果表明，与几种最新方法相比，我们提出的MAAR在CIFAR-10和MNIST数据集上具有最佳的认证鲁棒性和相当的准确性。

43. White matter hyperintensities volume and cognition: Assessment of a deep learning based lesion detection and quantification algorithm on the Alzheimers Disease Neuroimaging Initiative [PDF] 返回目录
Lavanya Umapathy, Gloria Guzman Perez-Carillo, Blair Winegar, Srinivasan Vedantham, Maria Altbach, Ali Bilgin
Abstract: The relationship between cognition and white matter hyperintensities (WMH) volumes often depends on the accuracy of the lesion segmentation algorithm used. As such, accurate detection and quantification of WMH is of great interest. Here, we use a deep learning-based WMH segmentation algorithm, StackGen-Net, to detect and quantify WMH on 3D FLAIR volumes from ADNI. We used a subset of subjects (n=20) and obtained manual WMH segmentations by an experienced neuro-radiologist to demonstrate the accuracy of our algorithm. On a larger cohort of subjects (n=290), we observed that larger WMH volumes correlated with worse performance on executive function (P=.004), memory (P=.01), and language (P=.005).
摘要：认知与白质高强度（WMH）量之间的关系通常取决于所使用的病变分割算法的准确性。因此，对WMH的准确检测和定量非常重要。在这里，我们使用基于深度学习的WMH分割算法StackGen-Net，以检测和量化来自ADNI的3D FLAIR卷上的WMH。我们使用了一组受试者（n = 20），并由经验丰富的神经放射科医生获得了手动WMH分割，以证明我们算法的准确性。在一组较大的受试者（n = 290）中，我们观察到较大的WMH量与执行功能（P = .004），记忆力（P = .01）和语言（P = .005）的较差表现相关。

44. Learning from Crowds by Modeling Common Confusions [PDF] 返回目录
Zhendong Chu, Jing Ma, Hongning Wang
Abstract: Crowdsourcing provides a practical way to obtain large amounts of labeled data at a low cost. However, the annotation quality of annotators varies considerably, which imposes new challenges in learning a high-quality model from the crowdsourced annotations. In this work, we provide a new perspective to decompose annotation noise into common noise and individual noise and differentiate the source of confusion based on instance difficulty and annotator expertise on a per-instance-annotator basis. We realize this new crowdsourcing model by an end-to-end learning solution with two types of noise adaptation layers: one is shared across annotators to capture their commonly shared confusions, and the other one is pertaining to each annotator to realize individual confusion. To recognize the source of noise in each annotation, we use an auxiliary network to choose the two noise adaptation layers with respect to both instances and annotators. Extensive experiments on both synthesized and real-world benchmarks demonstrate the effectiveness of our proposed common noise adaptation solution.
摘要：众包提供了一种实用的方法来以低成本获取大量带标签的数据。但是，注释器的注释质量差异很大，这在从众包注释中学习高质量模型方面提出了新的挑战。在这项工作中，我们提供了一个新的视角，可以将注释噪声分解为普通噪声和个体噪声，并基于实例难度和基于每个实例注释者的注释者专业知识来区分混淆源。我们通过具有两种类型的噪声适应层的端到端学习解决方案实现了这种新的众包模型：一种在注释者之间共享以捕获他们共同的困惑，另一种与每个注释者相关以实现个体的困惑。为了识别每个注释中的噪声源，我们使用辅助网络针对实例和注释者选择两个噪声适应层。在综合基准和实际基准上进行的大量实验证明了我们提出的通用噪声适应解决方案的有效性。

45. An Efficient Recurrent Adversarial Framework for Unsupervised Real-Time Video Enhancement [PDF] 返回目录
Dario Fuoli, Zhiwu Huang, Danda Pani Paudel, Luc Van Gool, Radu Timofte
Abstract: Video enhancement is a challenging problem, more than that of stills, mainly due to high computational cost, larger data volumes and the difficulty of achieving consistency in the spatio-temporal domain. In practice, these challenges are often coupled with the lack of example pairs, which inhibits the application of supervised learning strategies. To address these challenges, we propose an efficient adversarial video enhancement framework that learns directly from unpaired video examples. In particular, our framework introduces new recurrent cells that consist of interleaved local and global modules for implicit integration of spatial and temporal information. The proposed design allows our recurrent cells to efficiently propagate spatio-temporal information across frames and reduces the need for high complexity networks. Our setting enables learning from unpaired videos in a cyclic adversarial manner, where the proposed recurrent units are employed in all architectures. Efficient training is accomplished by introducing one single discriminator that learns the joint distribution of source and target domain simultaneously. The enhancement results demonstrate clear superiority of the proposed video enhancer over the state-of-the-art methods, in all terms of visual quality, quantitative metrics, and inference speed. Notably, our video enhancer is capable of enhancing over 35 frames per second of FullHD video (1080x1920).
摘要：视频增强比静止图像更是一个具有挑战性的问题，这主要是由于计算成本高，数据量大以及难以在时空域中实现一致性。在实践中，这些挑战通常与示例对的缺乏相结合，这阻碍了监督学习策略的应用。为了应对这些挑战，我们提出了一种有效的对抗视频增强框架，该框架直接从未配对的视频示例中学习。特别是，我们的框架引入了新的循环单元，该单元由交错的局部和全局模块组成，用于隐式集成时空信息。提出的设计使我们的循环单元能够跨帧有效地传播时空信息，并减少了对高复杂度网络的需求。我们的设置允许以循环对抗的方式从未配对的视频中学习，其中建议的循环单元在所有体系结构中都采用。通过引入一个可同时学习源域和目标域的联合分布的单个鉴别器来完成有效的培训。增强结果在视觉质量，定量指标和推理速度的所有方面均证明了所提出的视频增强器明显优于最新方法。值得注意的是，我们的视频增强器能够每秒增强35帧以上的FullHD视频（1080x1920）。

46. General Domain Adaptation Through Proportional Progressive Pseudo Labeling [PDF] 返回目录
Mohammad J. Hashemi, Eric Keller
Abstract: Domain adaptation helps transfer the knowledge gained from a labeled source domain to an unlabeled target domain. During the past few years, different domain adaptation techniques have been published. One common flaw of these approaches is that while they might work well on one input type, such as images, their performance drops when applied to others, such as text or time-series. In this paper, we introduce Proportional Progressive Pseudo Labeling (PPPL), a simple, yet effective technique that can be implemented in a few lines of code to build a more general domain adaptation technique that can be applied on several different input types. At the beginning of the training phase, PPPL progressively reduces target domain classification error, by training the model directly with pseudo-labeled target domain samples, while excluding samples with more likely wrong pseudo-labels from the training set and also postponing training on such samples. Experiments on 6 different datasets that include tasks such as anomaly detection, text sentiment analysis and image classification demonstrate that PPPL can beat other baselines and generalize better.
摘要：领域自适应有助于将获得的知识从标记的源域转移到未标记的目标域。在过去的几年中，已经发布了不同的领域适应技术。这些方法的一个常见缺陷是，尽管它们可能在一种输入类型（例如图像）上很好地工作，但是当将它们应用于其他输入类型（例如文本或时间序列）时，其性能会下降。在本文中，我们介绍了比例渐进伪标记（PPPL），这是一种简单而有效的技术，可以在几行代码中实现，以构建可以应用于几种不同输入类型的更通用的域自适应技术。在训练阶段开始时，PPPL通过直接用伪标记的目标域样本训练模型，同时从训练集中排除具有更可能错误的伪标记的样本，并推迟对此类样本的训练，逐步减少目标域分类错误。对6个不同数据集的实验包括异常检测，文本情感分析和图像分类等任务表明，PPPL可以超越其他基准并能更好地推广。

47. Detecting Hate Speech in Memes Using Multimodal Deep Learning Approaches: Prize-winning solution to Hateful Memes Challenge [PDF] 返回目录
Riza Velioglu, Jewgeni Rose
Abstract: Memes on the Internet are often harmless and sometimes amusing. However, by using certain types of images, text, or combinations of both, the seemingly harmless meme becomes a multimodal type of hate speech -- a hateful meme. The Hateful Memes Challenge is a first-of-its-kind competition which focuses on detecting hate speech in multimodal memes and it proposes a new data set containing 10,000+ new examples of multimodal content. We utilize VisualBERT - which meant to be the BERT of vision and language -- that was trained multimodally on images and captions and apply Ensemble Learning. Our approach achieves 0.811 AUROC with an accuracy of 0.765 on the challenge test set and placed third out of 3,173 participants in the Hateful Memes Challenge.
摘要：互联网上的模因通常是无害的，有时很有趣。但是，通过使用某些类型的图像，文本或两者的组合，看似无害的模因变成了仇恨言语的多模式类型-仇恨模因。仇恨模因挑战赛是首例竞赛，重点是检测多模式模因中的仇恨言论，并提出了一个新数据集，其中包含10,000多个新的多模式内容示例。我们利用VisualBERT（这意味着视觉和语言的BERT）进行了多模式的图像和字幕培训，并应用了集成学习。我们的方法在挑战测试集上获得0.811的AUROC，准确度为0.765，在仇恨模因挑战赛的3,173名参与者中排名第三。

48. Learning by Self-Explanation, with Application to Neural Architecture Search [PDF] 返回目录
Ramtin Hosseini, Pengtao Xie
Abstract: Learning by self-explanation, where students explain a learned topic to themselves for deepening their understanding of this topic, is a broadly used methodology in human learning and shows great effectiveness in improving learning outcome. We are interested in investigating whether this powerful learning technique can be borrowed from humans to improve the learning abilities of machines. We propose a novel learning approach called learning by self-explanation (LeaSE). In our approach, an explainer model improves its learning ability by trying to clearly explain to an audience model regarding how a prediction outcome is made. We propose a multi-level optimization framework to formulate LeaSE which involves four stages of learning: explainer learns; explainer explains; audience learns; explainer and audience validate themselves. We develop an efficient algorithm to solve the LeaSE problem. We apply our approach to neural architecture search on CIFAR-100, CIFAR-10, and ImageNet. The results demonstrate the effectiveness of our method.
摘要：通过自我解释学习，学生可以向自己解释一个已学习的主题以加深对这个主题的理解，这是人类学习中广泛使用的方法，在提高学习成果方面显示出极大的有效性。我们有兴趣研究这种强大的学习技术是否可以从人身上借鉴来提高机器的学习能力。我们提出了一种新颖的学习方法，称为通过自我解释学习（LeaSE）。在我们的方法中，解释器模型通过尝试向听众模型清楚地说明有关预测结果的产生方式来提高其学习能力。我们提出了一个多层次的优化框架来制定LeaSE，涉及四个学习阶段：解释者学习；解释者解释；听众学习；解释者和听众进行自我验证。我们开发了一种有效的算法来解决LeaSE问题。我们将我们的方法应用于CIFAR-100，CIFAR-10和ImageNet上的神经体系结构搜索。结果证明了我们方法的有效性。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-12-25

目录

摘要