摘要

1. Find it if You Can: End-to-End Adversarial Erasing for Weakly-Supervised Semantic Segmentation [PDF] 返回目录
Erik Stammes, Tom F.H. Runia, Michael Hofmann, Mohsen Ghafoorian
Abstract: Semantic segmentation is a task that traditionally requires a large dataset of pixel-level ground truth labels, which is time-consuming and expensive to obtain. Recent advancements in the weakly-supervised setting show that reasonable performance can be obtained by using only image-level labels. Classification is often used as a proxy task to train a deep neural network from which attention maps are extracted. However, the classification task needs only the minimum evidence to make predictions, hence it focuses on the most discriminative object regions. To overcome this problem, we propose a novel formulation of adversarial erasing of the attention maps. In contrast to previous adversarial erasing methods, we optimize two networks with opposing loss functions, which eliminates the requirement of certain suboptimal strategies; for instance, having multiple training steps that complicate the training process or a weight sharing policy between networks operating on different distributions that might be suboptimal for performance. The proposed solution does not require saliency masks, instead it uses a regularization loss to prevent the attention maps from spreading to less discriminative object regions. Our experiments on the Pascal VOC dataset demonstrate that our adversarial approach increases segmentation performance by 2.1 mIoU compared to our baseline and by 1.0 mIoU compared to previous adversarial erasing approaches.
摘要：语义分割是传统上需要大量的数据集像素级地面实况标签，这是费时和昂贵的获得任务。最新进展在弱监督的设置表明：合理的性能可以通过只使用映像级标签来获得。分类经常被用来作为一个代理任务训练从关注映射提取的深层神经网络。然而，分类任务只需要最小的证据作出预测，因此它侧重于最辨别物体的区域。为了克服这个问题，我们建议关注地图对抗擦除的新制剂。相较于以前的对抗性擦除方法，我们优化两个网络具有相对的损失函数，从而消除了某些次优策略的要求;例如，具有在训练过程中或在不同的分布可能是次优的性能运行的网络之间的权重共享政策的难度多个训练步骤。所提出的解决方案不需要显着口罩，而是它使用正则损失，防止注意力蔓延到更少的歧视对象的区域映射。我们对帕斯卡尔VOC实验数据集表明，我们的对抗方式增加分割2.1米欧相比，我们的基线和1.0性能米欧相比之前的对抗擦除方法。

2. Fast Fourier Intrinsic Network [PDF] 返回目录
Yanlin Qian, Miaojing Shi, Joni-Kristian Kämäräinen, Jiri Matas
Abstract: We address the problem of decomposing an image into albedo and shading. We propose the Fast Fourier Intrinsic Network, FFI-Net in short, that operates in the spectral domain, splitting the input into several spectral bands. Weights in FFI-Net are optimized in the spectral domain, allowing faster convergence to a lower error. FFI-Net is lightweight and does not need auxiliary networks for training. The network is trained end-to-end with a novel spectral loss which measures the global distance between the network prediction and corresponding ground truth. FFI-Net achieves state-of-the-art performance on MPI-Sintel, MIT Intrinsic, and IIW datasets.
摘要：我们应对图像分解成反照率和阴影的问题。我们提出了快速傅立叶内在网络，FFI-净总之，工作频率在频域中，分裂投入的光谱带。在FFI-Net的权重被在频域中进行了优化，从而允许更快的收敛到一个较低的错误。 FFI-Net是重量轻，不需要训练辅助网络。的网络进行训练，端至端与测量网络预测和相应的基础事实之间的全局距离的新的光谱损耗。 FFI-Net的实现上MPI-辛特尔，MIT内在和IIW数据集的国家的最先进的性能。

3. Masked Face Image Classification with Sparse Representation based on Majority Voting Mechanism [PDF] 返回目录
Han Wang
Abstract: Sparse approximation is the problem to find the sparsest linear combination for a signal from a redundant dictionary, which is widely applied in signal processing and compressed sensing. In this project, I manage to implement the Orthogonal Matching Pursuit (OMP) algorithm and Sparse Representation-based Classification (SRC) algorithm, then use them to finish the task of masked image classification with majority voting. Here the experiment was token on the AR data-set, and the result shows the superiority of OMP algorithm combined with SRC algorithm over masked face image classification with an accuracy of 98.4%.
摘要：稀疏近似是找到用于从冗余字典，其在信号处理和压缩感知广泛应用的信号最稀疏线性组合的问题。在这个项目中，我设法实现正交匹配追踪（OMP）算法和基于表示稀疏分类（SRC）算法，然后用它们来完成模板图像分类的工作与多数表决。在这里，实验令牌在AR数据集，并且将结果显示OMP算法与SRC算法超过掩蔽面部图像分类为98.4％的准确度相结合的优越性。

4. MinkLoc3D: Point Cloud Based Large-Scale Place Recognition [PDF] 返回目录
Jacek Komorowski
Abstract: The paper presents a learning-based method for computing a discriminative 3D point cloud descriptor for place recognition purposes. Existing methods, such as PointNetVLAD, are based on unordered point cloud representation. They use PointNet as the first processing step to extract local features, which are later aggregated into a global descriptor. The PointNet architecture is not well suited to capture local geometric structures. Thus, state-of-the-art methods enhance vanilla PointNet architecture by adding different mechanism to capture local contextual information, such as graph convolutional networks or using hand-crafted features. We present an alternative approach, dubbed MinkLoc3D, to compute a discriminative 3D point cloud descriptor, based on a sparse voxelized point cloud representation and sparse 3D convolutions. The proposed method has a simple and efficient architecture. Evaluation on standard benchmarks proves that MinkLoc3D outperforms current state-of-the-art. Our code is publicly available on the project website: this https URL
摘要：本文介绍了用于计算位置识别目的的判别三维点云描述一个学习法。现有的方法，如PointNetVLAD，基于无序点云表示。他们用PointNet作为第一道工序提取局部特征，将在稍后聚合成一个全球性的描述符。该PointNet架构并不非常适合捕捉当地的几何结构。因此，国家的最先进的方法，通过将不同的机制来捕获本地上下文信息，如图形卷积网络或使用手工特征增强香草PointNet架构。我们提出一个替代方法，被称为MinkLoc3D，以计算判别三维点云描述的基础上，稀疏的体素化点云的代表性和稀疏的3D卷积。所提出的方法有一个简单而有效的结构。上标准基准评价证明MinkLoc3D优于当前状态的最先进的。我们的代码是公布项目网站上：此HTTPS URL

5. Fast Hybrid Cascade for Voxel-based 3D Object Classification [PDF] 返回目录
Hui Cao, Jie Wang, Yuqi Liu, Siyu Zhang, Shen Cai
Abstract: Voxel-based 3D object classification has been frequently studied in recent years. The previous methods often directly convert the classic 2D convolution into a 3D form applied to an object with binary voxel representation. In this paper, we investigate the reason why binary voxel representation is not very suitable for 3D convolution and how to simultaneously improve the performance both in accuracy and speed. We show that by giving each voxel a signed distance value, the accuracy will gain about 30% promotion compared with binary voxel representation using a two-layer fully connected network. We then propose a fast fully connected and convolution hybrid cascade network for voxel-based 3D object classification. This threestage cascade network can divide 3D models into three categories: easy, moderate and hard. Consequently, the mean inference time (0.3ms) can speedup about 5x and 2x compared with the state-of-the-art point cloud and voxel based methods respectively, while achieving the highest accuracy in the latter category of methods (92%). Experiments with ModelNet andMNIST verify the performance of the proposed hybrid cascade network.
摘要：基于体素的3D对象分类已经经常研究在最近几年。成施加到与二进制体单元表示法的对象的三维形状的以前的方法通常直接转换经典的2D卷积。在本文中，我们调查为什么二进制体素图像不是很适合3D卷积以及如何同时提高精度和速度性能的原因。我们发现，通过给每个像素符号距离值，精度将获得约30％的推广使用双层全连接网络的二进制体素图像进行比较。然后，我们提出了一个快速的完全连接和卷积混合级联网络，基于体素的3D对象的分类。这threestage级联网络可以将3D模型分为三类：易，中，难。因此，平均推理时间（0.3ms的）可以加速比大约5倍和2倍，分别的状态的最先进的点云和基于体素的方法相比，而在方法的后一类（92％）达到最高的精度。与ModelNet andMNIST实验验证了该混合级联网络的性能。

6. DynaVSR: Dynamic Adaptive Blind Video Super-Resolution [PDF] 返回目录
Suyoung Lee, Myungsub Choi, Kyoung Mu Lee
Abstract: Most conventional supervised super-resolution (SR) algorithms assume that low-resolution (LR) data is obtained by downscaling high-resolution (HR) data with a fixed known kernel, but such an assumption often does not hold in real scenarios. Some recent blind SR algorithms have been proposed to estimate different downscaling kernels for each input LR image. However, they suffer from heavy computational overhead, making them infeasible for direct application to videos. In this work, we present DynaVSR, a novel meta-learning-based framework for real-world video SR that enables efficient downscaling model estimation and adaptation to the current input. Specifically, we train a multi-frame downscaling module with various types of synthetic blur kernels, which is seamlessly combined with a video SR network for input-aware adaptation. Experimental results show that DynaVSR consistently improves the performance of the state-of-the-art video SR models by a large margin, with an order of magnitude faster inference time compared to the existing blind SR approaches.
摘要：大多数传统的监督超分辨率（SR）算法假定低分辨率（LR）数据被缩减高分辨率（HR）具有固定的已知内核数据获得的，但这样的假设往往并不现实场景中举行。最近的一些盲目的SR算法已经被提出来估算不同缩小内核为每个输入LR图像。然而，他们从繁重的计算开销受到影响，使他们不可能直接应用于视频。在这项工作中，我们目前DynaVSR，对真实世界视频SR一种新型的基于元学习框架，使有效的缩小模型估计和适应当前的输入。具体来说，我们培养具有各类合成模糊内核，其无缝地与视频SR网络输入感知适配组合的多帧缩减模块。实验结果表明，DynaVSR大幅度提高了一贯的国家的最先进的视频SR车型的性能，用一个数量级相比，现有的无SR方法快推理时间。

7. Deep Transfer Learning for Automated Diagnosis of Skin Lesions from Photographs [PDF] 返回目录
Emma Rocheteau, Doyoon Kim
Abstract: Melanoma is the most common form of skin cancer worldwide. Currently, the disease is diagnosed by expert dermatologists, which is costly and requires timely access to medical treatment. Recent advances in deep learning have the potential to improve diagnostic performance, expedite urgent referrals and reduce burden on clinicians. Through smart phones, the technology could reach people who would not normally have access to such healthcare services, e.g. in remote parts of the world, due to financial constraints or in 2020, COVID-19 cancellations. To this end, we have investigated various transfer learning approaches by leveraging model parameters pre-trained on ImageNet with finetuning on melanoma detection. We compare EfficientNet, MnasNet, MobileNet, DenseNet, SqueezeNet, ShuffleNet, GoogleNet, ResNet, ResNeXt, VGG and a simple CNN with and without transfer learning. We find the mobile network, EfficientNet (with transfer learning) achieves the best mean performance with an area under the receiver operating characteristic curve (AUROC) of 0.931$\pm$0.005 and an area under the precision recall curve (AUPRC) of 0.840$\pm$0.010. This is significantly better than general practitioners (0.83$\pm$0.03 AUROC) and dermatologists (0.91$\pm$0.02 AUROC).
摘要：黑色素瘤是皮肤癌全世界的最常见形式。目前，该病是由专家皮肤科医生，这是昂贵的，需要就医及时获得诊断。在深度学习的最新进展有潜力，以提高诊断能力，加快紧急转诊，减少对临床医生的负担。通过智能手机，该技术可达到人谁也不会通常能够获得这样的医疗服务，如在世界的偏远地区，由于财政限制或在2020年，COVID-19取消。为此，我们已经研究了各种迁移学习通过利用模型参数预先训练上ImageNet与微调对黑色素瘤的检测方法。我们比较EfficientNet，MnasNet，MobileNet，DenseNet，SqueezeNet，的ShuffleNet，GoogleNet，RESNET，ResNeXt，VGG和简单CNN与不转的学习。我们发现移动网络，EfficientNet（与迁移学习）接收器下有一个面积达到最佳的平均性能运行0.931 $ \时许$ 0.005特性曲线（AUROC）和精确召回曲线0.840 $（AUPRC）下面积\下午$ 0.010英寸这是显著优于全科医生（0.83 $ \ $下午0.03 AUROC）和皮肤科医生（0.91 $ \ $下午0.02 AUROC）。

8. Neural Architecture Search with an Efficient Multiobjective Evolutionary Framework [PDF] 返回目录
Maria Baldeon Calisto, Susana Lai-Yuen
Abstract: Deep learning methods have become very successful at solving many complex tasks such as image classification and segmentation, speech recognition and machine translation. Nevertheless, manually designing a neural network for a specific problem is very difficult and time-consuming due to the massive hyperparameter search space, long training times, and lack of technical guidelines for the hyperparameter selection. Moreover, most networks are highly complex, task specific and over-parametrized. Recently, multiobjective neural architecture search (NAS) methods have been proposed to automate the design of accurate and efficient architectures. However, they only optimize either the macro- or micro-structure of the architecture requiring the unset hyperparameters to be manually defined, and do not use the information produced during the optimization process to increase the efficiency of the search. In this work, we propose EMONAS, an Efficient MultiObjective Neural Architecture Search framework for the automatic design of neural architectures while optimizing the network's accuracy and size. EMONAS is composed of a search space that considers both the macro- and micro-structure of the architecture, and a surrogate-assisted multiobjective evolutionary based algorithm that efficiently searches for the best hyperparameters using a Random Forest surrogate and guiding selection probabilities. EMONAS is evaluated on the task of 3D cardiac segmentation from the MICCAI ACDC challenge, which is crucial for disease diagnosis, risk evaluation, and therapy decision. The architecture found with EMONAS is ranked within the top 10 submissions of the challenge in all evaluation metrics, performing better or comparable to other approaches while reducing the search time by more than 50% and having considerably fewer number of parameters.
摘要：深学习方法已经成为解决许多复杂的任务，比如图像分类和分割，语音识别，机器翻译等非常成功的。然而，手动设计一个神经网络特定的问题是非常困难和耗费时间的，由于大量的超参数搜索空间，长的训练时间，并且缺乏对超参数选择的技术准则。此外，大多数网络是非常复杂的，具体的任务和过度参数化。近日，多目标神经结构搜索（NAS）方法被提出来自动准确，高效的架构设计。然而，它们仅优化任一需要未固化的超参数的体系结构的宏观或微观结构来手动定义，并且不使用在优化过程中所产生的信息来提高搜索的效率。在这项工作中，我们提出EMONAS，一个高效的多目标神经结构搜索的框架，同时优化网络的准确度和大小的神经结构的自动设计。 EMONAS由该同时考虑宏观和体系结构的微结构，和替代辅助多目标进化算法基于搜索空间的，对于使用随机森林替代和引导选择概率最好超参数有效地搜索。 EMONAS是在3D心脏分割从MICCAI ACDC的挑战，这对疾病的诊断，风险评估和治疗决策的关键任务计算。与EMONAS发现该体系结构的顶部内排名的挑战10提交在所有的评价指标，执行更好或可比其它方法而超过50％减少搜索时间和具有参数相当少的次数。

9. TTVOS: Lightweight Video Object Segmentation with Adaptive Template Attention Module and Temporal Consistency Loss [PDF] 返回目录
Hyojin Park, Ganesh Venkatesh, Nojun Kwak
Abstract: Semi-supervised video object segmentation (semi-VOS) is widely used in many applications. This task is tracking class-agnostic objects by a given segmentation mask. For doing this, various approaches have been developed based on optical flow, online-learning, and memory networks. These methods show high accuracy but are hard to be utilized in real-world applications due to slow inference time and tremendous complexity. To resolve this problem, template matching methods are devised for fast processing speed, sacrificing lots of performance. We introduce a novel semi-VOS model based on a temple matching method and a novel temporal consistency loss to reduce the performance gap from heavy models while expediting inference time a lot. Our temple matching method consists of short-term and long-term matching. The short-term matching enhances target object localization, while long-term matching improves fine details and handles object shape-changing through the newly proposed adaptive template attention module. However, the long-term matching causes error-propagation due to the inflow of the past estimated results when updating the template. To mitigate this problem, we also propose a temporal consistency loss for better temporal coherence between neighboring frames by adopting the concept of a transition matrix. Our model obtains 79.5% J&F score at the speed of 73.8 FPS on the DAVIS16 benchmark.
摘要：半监督视频对象分割（半VOS）被广泛用于许多应用中。这个任务是由给定的分割遮罩跟踪类无关的对象。这样做，各种方法都基于光流，在线学习和记忆网络被开发出来。这些方法显示精度高，但很难在现实世界的应用程序中使用，由于缓慢的推理时间和巨大的复杂性。要解决此问题，模板匹配方法被设计用来处理速度快，牺牲大量的性能。我们推出基于寺庙匹配方法和新的时间一致性的损失从繁重的车型降低了性能上的差距，同时加快推理时间了很多新颖的半VOS模式。我们的寺庙匹配方法包括短期和长期匹配。短期匹配增强了目标对象定位，同时长期匹配改善了细节和把手物体形状变化通过新提出的自适应模板注意模块。然而，长期的匹配会导致错误传播由于过去估计结果的流入更新所述模板时。为了缓解这个问题，我们也通过采用转移矩阵的概念，提出了更好的时间相干性，相邻帧之间的时间一致性的损失。我们的模型获得79.5％J＆F值在73.8 FPS对DAVIS16基准的速度。

10. FACEGAN: Facial Attribute Controllable rEenactment GAN [PDF] 返回目录
Soumya Tripathy, Juho Kannala, Esa Rahtu
Abstract: The face reenactment is a popular facial animation method where the person's identity is taken from the source image and the facial motion from the driving image. Recent works have demonstrated high quality results by combining the facial landmark based motion representations with the generative adversarial networks. These models perform best if the source and driving images depict the same person or if the facial structures are otherwise very similar. However, if the identity differs, the driving facial structures leak to the output distorting the reenactment result. We propose a novel Facial Attribute Controllable rEenactment GAN (FACEGAN), which transfers the facial motion from the driving face via the Action Unit (AU) representation. Unlike facial landmarks, the AUs are independent of the facial structure preventing the identity leak. Moreover, AUs provide a human interpretable way to control the reenactment. FACEGAN processes background and face regions separately for optimized output quality. The extensive quantitative and qualitative comparisons show a clear improvement over the state-of-the-art in a single source reenactment task. The results are best illustrated in the reenactment video provided in the supplementary material. The source code will be made available upon publication of the paper.
摘要：面对重演是该人的身份从源图像，并从驱动图像面部动作采取了流行的脸部动画方法。最近的作品由面部基于地标的运动表示与生成对抗性网络相结合表现出高品质的成果。这些模型进行最好的，如果源和驱动图像描绘同一人，或者如果面部结构在其他方面非常相似。然而，如果身份不同，驱动面部结构泄漏到输出扭曲重演结果。我们提出了一个新颖的面部属性可控重演GAN（FACEGAN），其从经由所述操作单元（AU）表示的驱动面面部运动传递。不同于面部界标，的AU是独立于面部结构防止身份泄漏。此外，AUS提供控制重演人类可解释的方式。 FACEGAN用于优化的输出质量分别处理背景和面部区域。广泛的定量和定性的比较表明在一个源重演任务在国家的最先进的明显改善。结果列于在补充材料提供的重演视频最佳地示出。源代码将在该报的出版提供。

11. SeasonDepth: Cross-Season Monocular Depth Prediction Dataset and Benchmark under Multiple Environments [PDF] 返回目录
Hanjiang Hu, Baoquan Yang, Weiang Shi, Zhijian Qiao, Hesheng Wang
Abstract: Monocular depth prediction has been well studied recently, while there are few works focused on the depth prediction across multiple environments, e.g. changing illumination and seasons, owing to the lack of such real-world dataset and benchmark. In this work, we derive a new cross-season scaleless monocular depth prediction dataset SeasonDepth from CMU Visual Localization dataset through structure from motion. And then we formulate several metrics to benchmark the performance under different environments using recent stateof-the-art open-source depth prediction pretrained models from KITTI benchmark. Through extensive zero-shot experimental evaluation on the proposed dataset, we show that the long-term monocular depth prediction is far from solved and provide promising solutions in the future work, including geometricbased or scale-invariant training. Moreover, multi-environment synthetic dataset and cross-dataset validataion are beneficial to the robustness to real-world environmental variance.
摘要：单眼深度预测得到了很好的最近研究，而很少有作品集中在跨多个环境，例如深度预测改变照明和季节，由于缺乏这样的现实世界的数据集和标杆。在这项工作中，我们从运动中获得来自CMU可视本地化数据集的一种新的跨赛季的无鳞单眼深度预测数据集SeasonDepth通过结构。然后我们制定多种指标来基准下，使用最近stateof最先进的开源深度预测模型预先训练从KITTI基准不同环境下的性能。通过对所提出的数据集广泛零射门实验评估，我们表明，长期单眼深度预测远没有解决，在今后的工作提供了有前途的解决方案，包括geometricbased或尺度不变培训。此外，多环境综合数据集和跨数据集validataion是稳健性，以真实世界的环境变化是有益的。

12. EDEN: Multimodal Synthetic Dataset of Enclosed GarDEN Scenes [PDF] 返回目录
Hoang-An Le, Thomas Mensink, Partha Das, Sezer Karaoglu, Theo Gevers
Abstract: Multimodal large-scale datasets for outdoor scenes are mostly designed for urban driving problems. The scenes are highly structured and semantically different from scenarios seen in nature-centered scenes such as gardens or parks. To promote machine learning methods for nature-oriented applications, such as agriculture and gardening, we propose the multimodal synthetic dataset for Enclosed garDEN scenes (EDEN). The dataset features more than 300K images captured from more than 100 garden models. Each image is annotated with various low/high-level vision modalities, including semantic segmentation, depth, surface normals, intrinsic colors, and optical flow. Experimental results on the state-of-the-art methods for semantic segmentation and monocular depth prediction, two important tasks in computer vision, show positive impact of pre-training deep networks on our dataset for unstructured natural scenes. The dataset and related materials will be available at this https URL.
摘要：多式联运室外场景的大型数据集大多是专为城市驾驶的问题。场景是高度结构化，并从本质上为中心的场景如花园或公园看到的场景不同语义。为了促进面向性质的应用程序，如农业和园艺机器学习方法，我们提出了封闭的花园场景多式联运综合数据集（EDEN）。该数据集功能来自100多个车型园捕获超过300K的图片。每个图像都被注解各种低/高级别视觉方式，包括语义分割，深度，表面法线，固有的色彩，和光流。在国家的最先进的方法语义分割和单眼深度预测实验结果，在计算机视觉中两个重要的任务，显示在我们的数据为非结构化的自然场景前培训深层网络的积极影响。该数据集及相关资料可在此HTTPS URL。

13. MAGNeto: An Efficient Deep Learning Method for the Extractive Tags Summarization Problem [PDF] 返回目录
Hieu Trong Phung, Anh Tuan Vu, Tung Dinh Nguyen, Lam Thanh Do, Giang Nam Ngo, Trung Thanh Tran, Ngoc C. Lê
Abstract: In this work, we study a new image annotation task named Extractive Tags Summarization (ETS). The goal is to extract important tags from the context lying in an image and its corresponding tags. We adjust some state-of-the-art deep learning models to utilize both visual and textual information. Our proposed solution consists of different widely used blocks like convolutional and self-attention layers, together with a novel idea of combining auxiliary loss functions and the gating mechanism to glue and elevate these fundamental components and form a unified architecture. Besides, we introduce a loss function that aims to reduce the imbalance of the training data and a simple but effective data augmentation technique dedicated to alleviates the effect of outliers on the final results. Last but not least, we explore an unsupervised pre-training strategy to further boost the performance of the model by making use of the abundant amount of available unlabeled data. Our model shows the good results as 90% $F_\text{1}$ score on the public NUS-WIDE benchmark, and 50% $F_\text{1}$ score on a noisy large-scale real-world private dataset. Source code for reproducing the experiments is publicly available at: this https URL
摘要：在这项工作中，我们研究了一个名为采掘标签概述（ETS）的新图像注释任务。我们的目标是提取从上下文躺在图像中的重要标签和其对应的标签。我们调整了一些国家的最先进的深度学习模式，能够同时利用视觉和文本信息。我们提出的解决方案包括像卷积和自关注层的不同广泛使用的块的，与结合辅助损失函数的一个新的想法和门控机构，以胶一起和提升这些基本组成部分，并形成一个统一的结构。此外，我们引入了损失函数，旨在减少训练数据的专用于的缓解异常值的效果对最终结果的不平衡，一个简单而有效的数据增强技术。最后但并非最不重要的，我们通过利用丰富的可用数量未标记数据的探索无监督前培训战略，进一步提升模型的性能。我们的模型显示了良好的效果为90％$ F_ \文本{1}在公共NUS-WIDE基准$得分和50％$ F_ \文本{1}在嘈杂的大型现实世界的私人数据集$得分。再现实验的源代码是公开的：这HTTPS URL

14. EfficientPose -- An efficient, accurate and scalable end-to-end 6D multi object pose estimation approach [PDF] 返回目录
Yannick Bukschat, Marcus Vetter
Abstract: In this paper we introduce EfficientPose, a new approach for 6D object pose estimation. Our method is highly accurate, efficient and scalable over a wide range of computational resources. Moreover, it can detect the 2D bounding box of multiple objects and instances as well as estimate their full 6D poses in a single shot. This eliminates the significant increase in runtime when dealing with multiple objects other approaches suffer from. These approaches aim to first detect 2D targets, e.g. keypoints, and solve a Perspective-n-Point problem for their 6D pose for each object afterwards. We also propose a novel augmentation method for direct 6D pose estimation approaches to improve performance and generalization, called 6D augmentation. Our approach achieves a new state-of-theart accuracy of 97.35% in terms of the ADD(-S) metric on the widely-used 6D pose estimation benchmark dataset Linemod using RGB input, while still running end-to-end at over 27 FPS. Through the inherent handling of multiple objects and instances and the fused single shot 2D object detection as well as 6D pose estimation, our approach runs even with multiple objects (eight) end-to-end at over 26 FPS, making it highly attractive to many real world scenarios. Code will be made publicly available at this https URL.
摘要：本文介绍EfficientPose，为6D对象姿态估计的新方法。我们的方法是在大范围的计算资源的高精确度，高效率和可扩展性。此外，它可以检测多个对象和实例的2D边框以及估计单杆充分6D姿势。与多个对象的其他方法遭受打交道时，这消除了在运行时显著上升。这些方法的目标是首先检测2D目标，例如关键点，并解决了视角的N点问题，为他们的6D姿态为每个对象之后。我们还提出了直接6D姿态估计的新方法隆胸办法，提高性能和概括，叫做6D增强。我们的方法达到97.35的％的新的国家的theart精度在ADD的条款（-S）指标使用RGB输入，同时还以超过27运行终端到终端的广泛使用6D姿态估计的基准数据集Linemod FPS。通过多个对象和实例和融合的单镜头2D物体检测的固有处理以及6D姿态估计，我们的方法即使有多个对象（为八个）终端到终端超过26 FPS运行，使其成为许多极具吸引力真实世界的场景。代码将在这个HTTPS URL公之于众。

15. Learning the Best Pooling Strategy for Visual Semantic Embedding [PDF] 返回目录
Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, Changhu Wang
Abstract: Visual Semantic Embedding (VSE) is a dominant approach for vision-language retrieval, which aims at learning a deep embedding space such that visual data are embedded close to their semantic text labels or descriptions. Recent VSE models use complex methods to better contextualize and aggregate multi-modal features into holistic embeddings. However, we discover that surprisingly simple (but carefully selected) global pooling functions (e.g., max pooling) outperform those complex models, across different feature extractors. Despite its simplicity and effectiveness, seeking the best pooling function for different data modality and feature extractor is costly and tedious, especially when the size of features varies (e.g., text, video). Therefore, we propose a Generalized Pooling Operator (GPO), which learns to automatically adapt itself to the best pooling strategy for different features, requiring no manual tuning while staying effective and efficient. We extend the VSE model using this proposed GPO and denote it as VSE$\infty$. Without bells and whistles, VSE$\infty$ outperforms previous VSE methods significantly on image-text retrieval benchmarks across popular feature extractors. With a simple adaptation, variants of VSE$\infty$ further demonstrate its strength by achieving the new state of the art on two video-text retrieval datasets. Comprehensive experiments and visualizations confirm that GPO always discovers the best pooling strategy and can be a plug-and-play feature aggregation module for standard VSE models.
摘要：视觉语义嵌入（VSE）是视觉语言检索的主导方针，其目的是在学习了深刻的嵌入空间，使得视觉数据接近嵌入到它们的语义文本标签或说明。最近VSE模型使用复杂的方法，以更好地考虑并加以汇总的多模态功能集成到整体的嵌入。然而，我们发现，令人惊讶的简单（但精心挑选的）全球汇集功能（例如，最大池）胜过那些复杂的模型，在不同的特征提取。尽管它的简单性和功效，寻求不同的数据模态和特征提取器最好池功能是昂贵的和繁琐的，特别是当特征尺寸而变化（例如，文本，视频）。因此，我们提出一个广义池运营商（GPO），其学习自动将自己适应于不同特性的最佳合并策略，无需手动调节，同时保持有效和高效。我们使用这个提议GPO延长VSE模型，并表示它作为VSE $ \ infty $。没有花俏，VSE $ \ infty $，远远超过前VSE方法显著横跨流行特征提取图像，文本检索基准。 VSE $ \ infty $的用一个简单的调整，变体通过在两个视频，文本检索数据集实现新的艺术状态，进一步证明其实力。综合性实验和可视化确认GPO总是发现最好的合并策略，可以作为标准的VSE型号插件和播放功能，聚集模块。

16. Sketch-Inspector: a Deep Mixture Model for High-Quality Sketch Generation of Cats [PDF] 返回目录
Yunkui Pang, Zhiqing Pan, Ruiyang Sun, Shuchong Wang
Abstract: With the involvement of artificial intelligence (AI), sketches can be automatically generated under certain topics. Even though breakthroughs have been made in previous studies in this area, a relatively high proportion of the generated figures are too abstract to recognize, which illustrates that AIs fail to learn the general pattern of the target object when drawing. This paper posits that supervising the process of stroke generation can lead to a more accurate sketch interpretation. Based on that, a sketch generating system with an assistant convolutional neural network (CNN) predictor to suggest the shape of the next stroke is presented in this paper. In addition, a CNN-based discriminator is introduced to judge the recognizability of the end product. Since the base-line model is ineffective at generating multi-class sketches, we restrict the model to produce one category. Because the image of a cat is easy to identify, we consider cat sketches selected from the QuickDraw data set. This paper compares the proposed model with the original Sketch-RNN on 75K human-drawn cat sketches. The result indicates that our model produces sketches with higher quality than human's sketches.
摘要：随着人工智能（AI）的参与，草图可以自动在特定主题产生。即使突破在这方面以前的研究已经取得，所产生的数字比较高的比例是过于抽象认识到，这说明认可机构无法绘制时，了解目标对象的一般模式。本文假定，监管行程生成的过程中可能会导致更准确的草图解释。基于这一点，与助手卷积神经网络（CNN）预测器的草图生成系统建议下一笔划的形状呈现于本文。此外，基于CNN-鉴别引入判断最终产品的识别性。由于基线模型是在生成多类草图无效的，我们限制该模型，以产生一个类别。由于猫的形象很容易识别，我们认为从QuickDraw的数据集选择猫草图。本文对75K人类绘制的草图猫与原始草图RNN提出的模型进行比较。结果表明，我们的模型产生的草图与人类相比的草图质量更高。

17. Closing the Generalization Gap in One-Shot Object Detection [PDF] 返回目录
Claudio Michaelis, Matthias Bethge, Alexander S. Ecker
Abstract: Despite substantial progress in object detection and few-shot learning, detecting objects based on a single example - one-shot object detection remains a challenge: trained models exhibit a substantial generalization gap, where object categories used during training are detected much more reliably than novel ones. Here we show that this generalization gap can be nearly closed by increasing the number of object categories used during training. Our results show that the models switch from memorizing individual categories to learning object similarity over the category distribution, enabling strong generalization at test time. Importantly, in this regime standard methods to improve object detection models like stronger backbones or longer training schedules also benefit novel categories, which was not the case for smaller datasets like COCO. Our results suggest that the key to strong few-shot detection models may not lie in sophisticated metric learning approaches, but instead in scaling the number of categories. Future data annotation efforts should therefore focus on wider datasets and annotate a larger number of categories rather than gathering more images or instances per category.
摘要：尽管在目标检测和几个次学习实质性进展，检测基于单一实例对象 - 一次性对象检测仍然是一个挑战：训练的模型表现出相当的泛化差距，其中检测训练时使用的对象类别更加可靠比小说的。在这里，我们表明，这种泛化缺口可以通过增加训练过程中使用的对象类别的数量接近关闭。我们的研究结果表明，该模型从记忆各个类别学习目标相似在类别分布，使强泛化在测试时进行切换。重要的是，这一制度的标准方法，以提高目标检测模型，如较强的骨干或更长的训练计划也有利于新的类别，这是不是为较小的数据集COCO一样的情况。我们的研究结果表明，关键强少一次检测模型可能不在于复杂的度量学习方法，而是在扩大类别的数量。未来数据注解的努力因此应集中于更广泛的数据集和注释类别更大数目的，而不是每个类别收集多个图像或实例。

18. Unified Quality Assessment of In-the-Wild Videos with Mixed Datasets Training [PDF] 返回目录
Dingquan Li, Tingting Jiang, Ming Jiang
Abstract: Video quality assessment (VQA) is an important problem in computer vision. The videos in computer vision applications are usually captured in the wild. We focus on automatically assessing the quality of in-the-wild videos, which is a challenging problem due to the absence of reference videos, the complexity of distortions, and the diversity of video contents. Moreover, the video contents and distortions among existing datasets are quite different, which leads to poor performance of data-driven methods in the cross-dataset evaluation setting. To improve the performance of quality assessment models, we borrow intuitions from human perception, specifically, content dependency and temporal-memory effects of human visual system. To face the cross-dataset evaluation challenge, we explore a mixed datasets training strategy for training a single VQA model with multiple datasets. The proposed unified framework explicitly includes three stages: relative quality assessor, nonlinear mapping, and dataset-specific perceptual scale alignment, to jointly predict relative quality, perceptual quality, and subjective quality. Experiments are conducted on four publicly available datasets for VQA in the wild, i.e., LIVE-VQC, LIVE-Qualcomm, KoNViD-1k, and CVD2014. The experimental results verify the effectiveness of the mixed datasets training strategy and prove the superior performance of the unified model in comparison with the state-of-the-art models. For reproducible research, we make the PyTorch implementation of our method available at this https URL.
摘要：视频质量评估（VQA）是计算机视觉中的一个重要问题。在计算机视觉应用的视频在野外通常被抓获。我们专注于自动评估在最狂野视频的质量，这是一个具有挑战性的问题，由于没有提及视频时，扭曲的复杂性，以及视频内容的多样性。此外，现有的数据集之间的视频内容和失真有很大的不同，这导致在交叉数据集的评价设置数据驱动方法表现不佳。为了提高质量评估模型的性能，我们借用直觉从人类的感知，具体而言，人类视觉系统的内容依赖和时间，记忆效应。面对跨数据集的评估挑战，我们探索的混合数据集培训战略训练一个VQA模型多个数据集。所提出的统一框架明确地包括三个阶段：相对质量评估器，非线性映射，和特定的数据集，感性标对准，共同预测的相对质量，感知质量和主观质量。实验是在野外上VQA 4点公立可用的数据集进行，即LIVE-VQC，LIVE，高通，KoNViD-1K，和CVD2014。实验结果验证了混合数据集培训策略的有效性，并证明在国家的最先进的车型比较统一模型的卓越性能。对于重复性的研究，我们提出PyTorch实现我们的方法可在此HTTPS URL。

19. Robust Visual Tracking via Statistical Positive Sample Generation and Gradient Aware Learning [PDF] 返回目录
Lijian Lin, Haosheng Chen, Yanjie Liang, Yan Yan, Hanzi Wang
Abstract: In recent years, Convolutional Neural Network (CNN) based trackers have achieved state-of-the-art performance on multiple benchmark datasets. Most of these trackers train a binary classifier to distinguish the target from its background. However, they suffer from two limitations. Firstly, these trackers cannot effectively handle significant appearance variations due to the limited number of positive samples. Secondly, there exists a significant imbalance of gradient contributions between easy and hard samples, where the easy samples usually dominate the computation of gradient. In this paper, we propose a robust tracking method via Statistical Positive sample generation and Gradient Aware learning (SPGA) to address the above two limitations. To enrich the diversity of positive samples, we present an effective and efficient statistical positive sample generation algorithm to generate positive samples in the feature space. Furthermore, to handle the issue of imbalance between easy and hard samples, we propose a gradient sensitive loss to harmonize the gradient contributions between easy and hard samples. Extensive experiments on three challenging benchmark datasets including OTB50, OTB100 and VOT2016 demonstrate that the proposed SPGA performs favorably against several state-of-the-art trackers.
摘要：近年来，卷积神经网络（CNN）的追踪器已经实现在多个基准数据集的国家的最先进的性能。大多数这些纤夫的训练二元分类为目标，从它的背景区分开来。然而，他们从两个限制的影响。首先，这些跟踪器不能有效地处理显著外观变化由于阳性样品的数量有限。其次，存在的简单的和困难的样本，其中样本轻松称霸通常梯度的计算之间的梯度贡献显著失衡。在本文中，我们提出通过统计阳性样品产生和梯度感知学习（SPGA），以解决上述两个限制一个强大的跟踪方法。为了丰富阳性样本的多样性，我们提出一个有效和高效的统计阳性样品生成算法来生成特征空间阳性样品。此外，处理简单的和困难的样本之间不平衡的问题，我们提出了一个梯度敏感损失协调简单的和困难的样本之间的梯度贡献。在三个挑战基准数据集包括OTB50，OTB100和VOT2016广泛的实验证明针对国家的最先进的几个跟踪器所提出的SPGA进行有利。

20. Improved Soccer Action Spotting using both Audio and Video Streams [PDF] 返回目录
Bastien Vanderplaetse, Stéphane Dupont
Abstract: In this paper, we propose a study on multi-modal (audio and video) action spotting and classification in soccer videos. Action spotting and classification are the tasks that consist in finding the temporal anchors of events in a video and determine which event they are. This is an important application of general activity understanding. Here, we propose an experimental study on combining audio and video information at different stages of deep neural network architectures. We used the SoccerNet benchmark dataset, which contains annotated events for 500 soccer game videos from the Big Five European leagues. Through this work, we evaluated several ways to integrate audio stream into video-only-based architectures. We observed an average absolute improvement of the mean Average Precision (mAP) metric of $7.43\%$ for the action classification task and of $4.19\%$ for the action spotting task.
摘要：在本文中，我们提出了基于多模态（音频和视频）的行动发现和分类足球视频进行了研究。行动斑点和分类是由在寻找事件的时间锚的视频，并确定它们是事件的任务。这是一般活动的认识具有重要的应用。在这里，我们建议对在深层神经网络结构的不同阶段，结合音频和视频信息的实验研究。我们使用SoccerNet基准数据集，其中包含注释来自五大欧洲联赛500级足球比赛的视频事件。通过这项工作，我们评估了几种方式的音频流集成到视频仅基于架构。我们观察到的均值平均精确度（MAP）指标$ 7.43 \％$的动作分类任务和$ 4.19 \％$的点滴行动任务的平均绝对改善。

21. Real-time object detection method based on improved YOLOv4-tiny [PDF] 返回目录
Zicong Jiang, Liquan Zhao, Shuaiyang Li, Yanfei Jia
Abstract: The "You only look once v4"(YOLOv4) is one type of object detection methods in deep learning. YOLOv4-tiny is proposed based on YOLOv4 to simple the network structure and reduce parameters, which makes it be suitable for developing on the mobile and embedded devices. To improve the real-time of object detection, a fast object detection method is proposed based on YOLOv4-tiny. It firstly uses two ResBlock-D modules in ResNet-D network instead of two CSPBlock modules in Yolov4-tiny, which reduces the computation complexity. Secondly, it designs an auxiliary residual network block to extract more feature information of object to reduce detection error. In the design of auxiliary network, two consecutive 3x3 convolutions are used to obtain 5x5 receptive fields to extract global features, and channel attention and spatial attention are also used to extract more effective information. In the end, it merges the auxiliary network and backbone network to construct the whole network structure of improved YOLOv4-tiny. Simulation results show that the proposed method has faster object detection than YOLOv4-tiny and YOLOv3-tiny, and almost the same mean value of average precision as the YOLOv4-tiny. It is more suitable for real-time object detection.
摘要：“你只能看一次V4”（YOLOv4）是一种类型的深度学习对象的检测方法。 YOLOv4-微小基于YOLOv4到简单网络结构提出并降低参数，这使得它适合于在移动设备和嵌入式设备发展。为了提高对象检测的实时性，一个快速移动物体检测方法是基于YOLOv4-小小的建议。它首先使用在RESNET-d网络中的两个ResBlock-d模块而不是两个CSPBlock模块Yolov4-小，从而降低了计算复杂度。其次，设计一个辅助残余网络块以提取对象的多个特征信息，以减少检测误差。在辅助网络的设计中，两个连续的3×3的卷积来获得5×5感受域来提取全局特征，和信道的关注和空间注意也用于提取更有效的信息。最后，它融合了辅助网络和骨干网络来构造改善YOLOv4-微小的整个网络结构。仿真结果表明，所提出的方法具有比平均精度为YOLOv4纤巧YOLOv4纤巧和YOLOv3-微小的，和几乎相同的平均值更快物体检测。它更适合于实时对象检测。

22. Dual ResGCN for Balanced Scene GraphGeneration [PDF] 返回目录
Jingyi Zhang, Yong Zhang, Baoyuan Wu, Yanbo Fan, Fumin Shen, Heng Tao Shen
Abstract: Visual scene graph generation is a challenging task. Previous works have achieved great progress, but most of them do not explicitly consider the class imbalance issue in scene graph generation. Models learned without considering the class imbalance tend to predict the majority classes, which leads to a good performance on trivial frequent predicates, but poor performance on informative infrequent predicates. However, predicates of minority classes often carry more semantic and precise information~(\textit{e.g.}, \emph{`on'} v.s \emph{`parked on'}). % which leads to a good score of recall, but a poor score of mean recall. To alleviate the influence of the class imbalance, we propose a novel model, dubbed \textit{dual ResGCN}, which consists of an object residual graph convolutional network and a relation residual graph convolutional network. The two networks are complementary to each other. The former captures object-level context information, \textit{i.e.,} the connections among objects. We propose a novel ResGCN that enhances object features in a cross attention manner. Besides, we stack multiple contextual coefficients to alleviate the imbalance issue and enrich the prediction diversity. The latter is carefully designed to explicitly capture relation-level context information \textit{i.e.,} the connections among relations. We propose to incorporate the prior about the co-occurrence of relation pairs into the graph to further help alleviate the class imbalance issue. Extensive evaluations of three tasks are performed on the large-scale database VG to demonstrate the superiority of the proposed method.
摘要：视觉场景图生成是一个具有挑战性的任务。以前的作品都取得了很大的进步，但大多没有明确考虑场景图生成类不平衡问题。不考虑类不平衡了解到模型倾向于多数类，从而导致预测到琐碎的频繁谓词有不错的表现，但在信息不频繁谓词表现不佳。然而，少数类的谓词经常携带更多的语义和精确的信息〜（\ textit {e.g。}，\ {EMPH`上 '} v.s \ EMPH {`停在'}）。％，这导致好成绩召回，但悬殊的比分平均召回。为了减轻类失衡的影响，我们提出了一种新的模式，被称为\ textit {双ResGCN}，它由一个对象剩余图卷积网络和关系剩余图卷积网络。两个网络彼此互补。前者捕获对象级的上下文信息，\ textit {即，}对象之间的连接。我们提出了一个新颖的ResGCN增强在交叉注意方式对象特征。此外，我们堆叠多个情境系数，以缓解失衡问题，并丰富了预测多样性。后者经过精心设计，明确地捕捉关系级别的上下文信息\ textit {即，}关系之间的联系。我们建议纳入现有的关系型对的共生到图形，以进一步帮助缓解类不平衡问题。三项任务广泛的评估是在大型数据库VG进行证明了该方法的优越性。

23. End-to-end Lane Shape Prediction with Transformers [PDF] 返回目录
Ruijin Liu, Zejian Yuan, Tie Liu, Zhiliang Xiong
Abstract: Lane detection, the process of identifying lane markings as approximated curves, is widely used for lane departure warning and adaptive cruise control in autonomous vehicles. The popular pipeline that solves it in two steps---feature extraction plus post-processing, while useful, is too inefficient and flawed in learning the global context and lanes' long and thin structures. To tackle these issues, we propose an end-to-end method that directly outputs parameters of a lane shape model, using a network built with a transformer to learn richer structures and context. The lane shape model is formulated based on road structures and camera pose, providing physical interpretation for parameters of network output. The transformer models non-local interactions with a self-attention mechanism to capture slender structures and global context. The proposed method is validated on the TuSimple benchmark and shows state-of-the-art accuracy with the most lightweight model size and fastest speed. Additionally, our method shows excellent adaptability to a challenging self-collected lane detection dataset, showing its powerful deployment potential in real applications. Codes are available at this https URL.
摘要：泳道检测，识别车道标记为近似曲线的过程中，被广泛用于车道偏离警告，并在自主车辆的自适应巡航控制。流行的管道，解决了这两个步骤---特征提取以及后期处理，虽然有用，但效率太低和有缺陷的学习全球背景和车道长而薄的结构。为了解决这些问题，我们提出了一个终端到终端的方法车道形状模型的直接输出参数，使用带有内置变压器的网络学习更丰富的结构和上下文。车道形状模型是基于道路结构和相机姿态制定，为网络输出参数的物理解释。变压器模型的非本地使用自注意机制来捕获细长的结构和全球范围内的相互作用。所提出的方法进行了验证上TuSimple基准和节目状态的最先进的准确性与最轻质模型大小和最快的速度。此外，我们的方法显示出良好的适应性，以一个具有挑战性的自采车道检测数据集，呈现在实际应用中其强大的部署潜力。代码可在此HTTPS URL。

24. An improved helmet detection method for YOLOv3 on an unbalanced dataset [PDF] 返回目录
Rui Geng, Yixuan Ma, Wanhong Huang
Abstract: The YOLOv3 target detection algorithm is widely used in industry due to its high speed and high accuracy, but it has some limitations, such as the accuracy degradation of unbalanced datasets. The YOLOv3 target detection algorithm is based on a Gaussian fuzzy data augmentation approach to pre-process the data set and improve the YOLOv3 target detection algorithm. Through the efficient pre-processing, the confidence level of YOLOv3 is generally improved by 0.01-0.02 without changing the recognition speed of YOLOv3, and the processed images also perform better in image localization due to effective feature fusion, which is more in line with the requirement of recognition speed and accuracy in production.
摘要：YOLOv3目标检测算法被广泛应用于工业，由于其高速，高精确度，但它有一定的局限性，如不平衡数据集的精度降低。所述YOLOv3目标检测算法基于高斯模糊数据增强方法来预先处理该数据集，提高了YOLOv3目标检测算法。通过高效预处理，YOLOv3的置信水平通常通过0.01-0.02而不改变YOLOv3的识别速度提高，并且所处理的图像也象定位由于有效特征融合，这是更符合线更好的表现识别速度和生产精度的要求。

25. Detecting Outliers with Foreign Patch Interpolation [PDF] 返回目录
Jeremy Tan, Benjamin Hou, James Batten, Huaqi Qiu, Bernhard Kainz
Abstract: In medical imaging, outliers can contain hypo/hyper-intensities, minor deformations, or completely altered anatomy. To detect these irregularities it is helpful to learn the features present in both normal and abnormal images. However this is difficult because of the wide range of possible abnormalities and also the number of ways that normal anatomy can vary naturally. As such, we leverage the natural variations in normal anatomy to create a range of synthetic abnormalities. Specifically, the same patch region is extracted from two independent samples and replaced with an interpolation between both patches. The interpolation factor, patch size, and patch location are randomly sampled from uniform distributions. A wide residual encoder decoder is trained to give a pixel-wise prediction of the patch and its interpolation factor. This encourages the network to learn what features to expect normally and to identify where foreign patterns have been introduced. The estimate of the interpolation factor lends itself nicely to the derivation of an outlier score. Meanwhile the pixel-wise output allows for pixel- and subject- level predictions using the same model.
摘要：在医学成像中，离群值可以包含低/超强度，微小变形，或完全改变的解剖结构。为了检测这些违规行为是有帮助的学习存在于正常和非正常图像的功能。然而，这是因为多种可能的异常，并且还的方式，正常的解剖结构自然可以改变一些困难。因此，我们利用在正常解剖的自然变化，以创建一个合成的范围内的异常。具体而言，相同的补丁区域是由两个独立的样品中提取的，并与这两个片之间的内插代替。内插因子，贴片尺寸和补丁位置被随机地从均匀分布采样。宽残差编码解码器被训练，得到贴剂的逐像素预测和它的内插因子。这鼓励了网络学习什么功能正常预期，并确定哪些外国的模式相继出台。插值系数的估计使其本身很好，以一个局外人的分数的推导。同时，逐像素输出允许使用相同型号的像素和主语水平的预测。

26. Two-Stream Appearance Transfer Network for Person Image Generation [PDF] 返回目录
Chengkang Shen, Peiyan Wang, Wei Tang
Abstract: Pose guided person image generation means to generate a photo-realistic person image conditioned on an input person image and a desired pose. This task requires spatial manipulation of the source image according to the target pose. However, the generative adversarial networks (GANs) widely used for image generation and translation rely on spatially local and translation equivariant operators, i.e., convolution, pooling and unpooling, which cannot handle large image deformation. This paper introduces a novel two-stream appearance transfer network (2s-ATN) to address this challenge. It is a multi-stage architecture consisting of a source stream and a target stream. Each stage features an appearance transfer module and several two-stream feature fusion modules. The former finds the dense correspondence between the two-stream feature maps and then transfers the appearance information from the source stream to the target stream. The latter exchange local information between the two streams and supplement the non-local appearance transfer. Both quantitative and qualitative results indicate the proposed 2s-ATN can effectively handle large spatial deformation and occlusion while retaining the appearance details. It outperforms prior states of the art on two widely used benchmarks.
摘要：姿态引导人图像生成部件以生成逼真的人的图像调节输入人物图像和期望的姿态上。此任务需要根据目标姿态源图像的空间操作。然而，生成对抗网络（甘斯）广泛地用于图像生成和翻译依赖于在空间上的本地和翻译等变运营商，即，卷积，汇集和unpooling，它不能处理大的图像的变形。本文介绍了一种新型的双流外观传送网络（2S-ATN）来应对这一挑战。这是一个多级结构由源数据流和一个目标流的。每一级提供一个外观传送模块和若干两流特征融合模块。前者的发现两流特征之间的密集对应映射，然后从源流到目标流的外观信息传送。两个流和补充的非本地外观转移之间的交换后者的本地信息。定量和定性结果表明，所提出的2S-ATN可以有效处理大空间变形和闭塞，同时保留了外观细节。它在两个广泛使用的基准优于现有技术的在先状态。

27. Deep Learning based Monocular Depth Prediction: Datasets, Methods and Applications [PDF] 返回目录
Qing Li, Jiasong Zhu, Jun Liu, Rui Cao, Qingquan Li, Sen Jia, Guoping Qiu
Abstract: Estimating depth from RGB images can facilitate many computer vision tasks, such as indoor localization, height estimation, and simultaneous localization and mapping (SLAM). Recently, monocular depth estimation has obtained great progress owing to the rapid development of deep learning techniques. They surpass traditional machine learning-based methods by a large margin in terms of accuracy and speed. Despite the rapid progress in this topic, there are lacking of a comprehensive review, which is needed to summarize the current progress and provide the future directions. In this survey, we first introduce the datasets for depth estimation, and then give a comprehensive introduction of the methods from three perspectives: supervised learning-based methods, unsupervised learning-based methods, and sparse samples guidance-based methods. In addition, downstream applications that benefit from the progress have also been illustrated. Finally, we point out the future directions and conclude the paper.
摘要：从RGB图像估计深度可以方便许多计算机视觉任务，如室内定位，高度估计，并同时定位和映射（SLAM）。近日，单眼深度估计已获得由于深学习技术的快速发展很大的进步。他们超越在精度和速度方面大幅度传统的基于机器学习的方法。尽管本主题中的快速发展，也有缺乏一个全面的审查，这是需要总结目前的进展，并提供了未来发展方向。在本次调查中，我们首先介绍了深度估计的数据集，然后从三个方面给予了全面的介绍的方法：基于监督学习的方法，基于无监督学习方法，以及稀疏样本基于指导的方法。此外，下游应用从进步中受益也已经说明。最后，我们指出了未来发展方向，全文结束。

28. Localising In Complex Scenes Using Balanced Adversarial Adaptation [PDF] 返回目录
Gil Avraham, Yan Zuo, Tom Drummond
Abstract: Domain adaptation and generative modelling have collectively mitigated the expensive nature of data collection and labelling by leveraging the rich abundance of accurate, labelled data in simulation environments. In this work, we study the performance gap that exists between representations optimised for localisation on simulation environments and the application of such representations in a real-world setting. Our method exploits the shared geometric similarities between simulation and real-world environments whilst maintaining invariance towards visual discrepancies. This is achieved by optimising a representation extractor to project both simulated and real representations into a shared representation space. Our method uses a symmetrical adversarial approach which encourages the representation extractor to conceal the domain that features are extracted from and simultaneously preserves robust attributes between source and target domains that are beneficial for localisation. We evaluate our method by adapting representations optimised for indoor Habitat simulated environments (Matterport3D and Replica) to a real-world indoor environment (Active Vision Dataset), showing that it compares favourably against fully-supervised approaches.
摘要：域适应和生成建模通过利用准确，标签数据富丰在模拟环境已经集体缓解数据收集和标签的昂贵的性质。在这项工作中，我们研究的是本地化的模拟环境，并表示这样的在真实世界场景的应用优化的表示之间存在的性能差距。我们的方法利用模拟和真实世界的环境之间的共享几何相似性，同时保持不变对视觉差异。这是通过优化的表示提取到两个模拟和真实表示投射到一个共享的表示空间来实现的。我们的方法使用一个对称的对抗方法，其鼓励表示提取隐瞒特征从提取的域，并同时保留其是用于本地化有益源和目标域之间鲁棒的属性。我们评估通过调整表示了室内人居环境模拟（Matterport3D和副本）到真实世界的室内环境（主动视觉数据集）进行了优化，表明它相媲美针对全监督的方法我们的方法。

29. Distance-Based Anomaly Detection for Industrial Surfaces Using Triplet Networks [PDF] 返回目录
Tareq Tayeh, Sulaiman Aburakhia, Ryan Myers, Abdallah Shami
Abstract: Surface anomaly detection plays an important quality control role in many manufacturing industries to reduce scrap production. Machine-based visual inspections have been utilized in recent years to conduct this task instead of human experts. In particular, deep learning Convolutional Neural Networks (CNNs) have been at the forefront of these image processing-based solutions due to their predictive accuracy and efficiency. Training a CNN on a classification objective requires a sufficiently large amount of defective data, which is often not available. In this paper, we address that challenge by training the CNN on surface texture patches with a distance-based anomaly detection objective instead. A deep residual-based triplet network model is utilized, and defective training samples are synthesized exclusively from non-defective samples via random erasing techniques to directly learn a similarity metric between the same-class samples and out-of-class samples. Evaluation results demonstrate the approach's strength in detecting different types of anomalies, such as bent, broken, or cracked surfaces, for known surfaces that are part of the training data and unseen novel surfaces.
摘要：表面异常检测在许多制造行业中起着重要的质量控制作用，减少废品的产生。基于机器视觉检查已经利用最近几年，而不是进行人类专家的这项任务。特别是，深度学习卷积神经网络（细胞神经网络）已经在这些图像基于处理解决方案的前沿，由于其预测的准确性和效率。在分类目标训练CNN需要一个足够大的量缺陷的数据，这通常是不适用的。在本文中，我们用基于距离的异常检测的目标，而不是训练CNN上表面纹理补丁应对这一挑战。深基于残余 - 三重态网络模型被利用，和有缺陷的训练样本被专门从无缺陷样品经由随机擦除技术合成直接学习同一类别样本和出的类别样本之间的相似性度量。评价结果表明了该方法的强度在检测不同类型的异常，如弯曲，破裂，或破裂的表面，已知的表面是训练数据和看不见新颖表面的一部分。

30. Image Clustering using an Augmented Generative Adversarial Network and Information Maximization [PDF] 返回目录
Foivos Ntelemis, Yaochu Jin, Spencer A. Thomas
Abstract: Image clustering has recently attracted significant attention due to the increased availability of unlabelled datasets. The efficiency of traditional clustering algorithms heavily depends on the distance functions used and the dimensionality of the features. Therefore, performance degradation is often observed when tackling either unprocessed images or high-dimensional features extracted from processed images. To deal with these challenges, we propose a deep clustering framework consisting of a modified generative adversarial network (GAN) and an auxiliary classifier. The modification employs Sobel operations prior to the discriminator of the GAN to enhance the separability of the learned features. The discriminator is then leveraged to generate representations as the input to an auxiliary classifier. An adaptive objective function is utilised to train the auxiliary classifier for clustering the representations, aiming to increase the robustness by minimizing the divergence of multiple representations generated by the discriminator. The auxiliary classifier is implemented with a group of multiple cluster-heads, where a tolerance hyper-parameter is used to tackle imbalanced data. Our results indicate that the proposed method significantly outperforms state-of-the-art clustering methods on CIFAR-10 and CIFAR-100, and is competitive on the STL10 and MNIST datasets.
摘要：图片集群近来也备受关注显著由于未标记的数据集的更高的可用性。的传统聚类算法的效率在很大程度上取决于所使用的距离函数和所述特征的维数。因此，解决或者未处理图像或从处理的图像中提取的高维特征时性能下降，经常观察到。为了应对这些挑战，我们提出由改性生成对抗网络（GAN）和辅助分类的深集群框架。该修改方案采用之前GAN中鉴别增强学习特征的分离索贝尔操作。然后将鉴别器利用，以生成表示为输入到辅助分类器。自适应目标函数被用来训练分类器辅助对聚类的表示，旨在通过最小化由鉴别产生多个表示的发散，以增加坚固性。辅助分类器与一组的多个簇头，其中容差超参数用于解决不平衡数据实现。我们的结果表明，所提出的方法显著优于上CIFAR-10和CIFAR-100状态的最先进的聚类方法，并且是在STL10和MNIST数据集竞争力。

31. An HVS-Oriented Saliency Map Prediction Modeling [PDF] 返回目录
Qiang Li
Abstract: Visual attention is one of the most significant characteristics for selecting and understanding the outside world. The nature complex scenes, including larger redundancy and human vision, can't be processing all information simultaneously because of the information bottleneck. The visual system mainly focuses on dominant parts of the scenes to reduce the input visual redundancy information. It's commonly known as visual attention prediction or visual saliency map. This paper proposes a new saliency prediction architecture inspired by human low-level visual cortex function. The model considered the opponent color channel, wavelet energy map, and contrast sensitivity function for extract image features and maximum approach to real visual neural network function in the brain. The proposed model is evaluated several datasets, including MIT1003, MIT300, TORONTO, and SID4VAM to explain its efficiency. The proposed model results are quantitatively and qualitatively compared to other state-of-the-art salience prediction models and their achieved out-performing of visual saliency prediction.
摘要：视觉注意力的选择和了解外面的世界上最显著特点之一。性质复杂的场景，包括更大的冗余性和人的视觉，不能处理同时由于信息瓶颈的所有信息。视觉系统主要集中在场面占优的部分，以减少输入视觉冗余信息。它通常被称为视觉注意的预测或视觉显着性地图。本文提出了由人类低级别视觉皮层功能的启发新的显着性预测结构。该模型考虑了对手的颜色通道，小波能量地图，和对比敏感度函数为提取图像特征和最大的方法来实视觉神经网络功能在大脑中。该模型评估的几个数据集，包括MIT1003，MIT300，多伦多和SID4VAM解释其效率。该模型的结果是定性和定量相对于其他国家的最先进的显着性预测模型和视觉显着性预测的其实现了成效。

32. AI on the Bog: Monitoring and Evaluating Cranberry Crop Risk [PDF] 返回目录
Peri Akiva, Benjamin Planche, Aditi Roy, Kristin Dana, Peter Oudemans, Michael Mars
Abstract: Machine vision for precision agriculture has attracted considerable research interest in recent years. The goal of this paper is to develop an end-to-end cranberry health monitoring system to enable and support real time cranberry over-heating assessment to facilitate informed decisions that may sustain the economic viability of the farm. Toward this goal, we propose two main deep learning-based modules for: 1) cranberry fruit segmentation to delineate the exact fruit regions in the cranberry field image that are exposed to sun, 2) prediction of cloud coverage conditions and sun irradiance to estimate the inner temperature of exposed cranberries. We develop drone-based field data and ground-based sky data collection systems to collect video imagery at multiple time points for use in crop health analysis. Extensive evaluation on the data set shows that it is possible to predict exposed fruit's inner temperature with high accuracy (0.02% MAPE). The sun irradiance prediction error was found to be 8.41-20.36% MAPE in the 5-20 minutes time horizon. With 62.54% mIoU for segmentation and 13.46 MAE for counting accuracies in exposed fruit identification, this system is capable of giving informed feedback to growers to take precautionary action (e.g. irrigation) in identified crop field regions with higher risk of sunburn in the near future. Though this novel system is applied for cranberry health monitoring, it represents a pioneering step forward for efficient farming and is useful in precision agriculture beyond the problem of cranberry overheating.
摘要：精准农业机器视觉吸引，近年来大量的研究兴趣。本文的目标是开发一个终端到终端的蔓越莓健康监测系统启用和支持实时蔓越莓过热的评估，以促进可维持农场的经济可行性的决策。为了实现这一目标，我们提出了两个主要的深基础的学习模块：1）蔓越莓果分段划定确切的水果区在暴露于阳光下蔓越莓场图像，2）的云覆盖的条件和太阳辐照度预测估算暴露越橘的内部温度。我们在发展中作物的健康分析中使用多个时间点基于无人机场数据和陆基空中数据收集系统，收集的视频图像。上的数据集显示广泛的评估，这是可能预测暴露果的精度高（0.02％MAPE）内的温度。发现太阳辐照度的预测误差是在5-20分钟的时间范围8.41-20.36％MAPE。随着62.54％米欧进行分割和13.46 MAE在暴露的水果识别计数的精确度，该系统能够在不久的将来，晒伤的风险更高给予知情反馈给种植户采取预防措施（如灌溉）中确定农田区。虽然这种新颖的系统被应用于蔓越莓健康监测，它代表了有效的耕作向前开拓步骤，并且在精密农业有用超出蔓越莓过热的问题。

33. Analysis of Dimensional Influence of Convolutional Neural Networks for Histopathological Cancer Classification [PDF] 返回目录
Shreyas Rajesh Labhsetwar, Alistair Michael Baretto, Raj Sunil Salvi, Piyush Arvind Kolte, Veerasai Subramaniam Venkatesh
Abstract: Convolutional Neural Networks can be designed with different levels of complexity depending upon the task at hand. This paper analyzes the effect of dimensional changes to the CNN architecture on its performance on the task of Histopathological Cancer Classification. The research starts with a baseline 10-layer CNN model with (3 X 3) convolution filters. Thereafter, the baseline architecture is scaled in multiple dimensions including width, depth, resolution and a combination of all of these. Width scaling involves inculcating greater number of neurons per CNN layer, whereas depth scaling involves deepening the hierarchical layered structure. Resolution scaling is performed by increasing the dimensions of the input image, and compound scaling involves a hybrid combination of width, depth and resolution scaling. The results indicate that histopathological cancer scans are very complex in nature and hence require high resolution images fed to a large hierarchy of Convolution, MaxPooling, Dropout and Batch Normalization layers to extract all the intricacies and perform perfect classification. Since compound scaling the baseline model ensures that all the three dimensions: width, depth and resolution are scaled, the best performance is obtained with compound scaling. This research shows that better performance of CNN models is achieved by compound scaling of the baseline model for the task of Histopathological Cancer Classification.
摘要：卷积神经网络可以被设计成根据手头的任务不同层次的复杂性。本文分析的尺寸变化在其组织病理学癌症分类的任务性能CNN结构的影响。研究开始与（3×3）的卷积滤波器的基线10层CNN模型。此后，将基线架构在多个维度，包括宽度，深度，分辨率和所有这些的组合缩放。宽度缩放涉及灌输每CNN层的神经元的更大数目，而深度比例涉及加深分层的层状结构。分辨率缩放通过增加输入图像的尺寸进行，并且化合物缩放涉及宽度，深度和分辨率缩放的混合组合。结果表明，组织病理学癌症扫描是在性质上非常复杂，因此需要供给到一大的层级卷积，MaxPooling，差和批标准化层以提取所有的复杂性并执行完美分类高分辨率图像。由于化合物缩放基准模型可以确保所有的三个维度：宽度，深度和分辨率缩放，与化合物缩放获得最佳的性能。这项研究表明，CNN机型更好的表现被用于组织病理学癌症分类的任务基线模型化合物的比例来实现的。

34. Performance Analysis of Optimizers for Plant Disease Classification with Convolutional Neural Networks [PDF] 返回目录
Shreyas Rajesh Labhsetwar, Soumya Haridas, Riyali Panmand, Rutuja Deshpande, Piyush Arvind Kolte, Sandhya Pati
Abstract: Crop failure owing to pests & diseases are inherent within Indian agriculture, leading to annual losses of 15 to 25% of productivity, resulting in a huge economic loss. This research analyzes the performance of various optimizers for predictive analysis of plant diseases with deep learning approach. The research uses Convolutional Neural Networks for classification of farm or plant leaf samples of 3 crops into 15 classes. The various optimizers used in this research include RMSprop, Adam and AMSgrad. Optimizers Performance is visualised by plotting the Training and Validation Accuracy and Loss curves, ROC curves and Confusion Matrix. The best performance is achieved using Adam optimizer, with the maximum validation accuracy being 98%. This paper focuses on the research analysis proving that plant diseases can be predicted and pre-empted using deep learning methodology with the help of satellite, drone based or mobile based images that result in reducing crop failure and agricultural losses.
摘要：由于害虫和疾病的作物失是印度农业中固有的，导致生产效率的15〜25％的年亏损，造成了巨大的经济损失。这项研究分析了各种优化的植物病害深学习方法的预测性分析的性能。该研究采用卷积神经网络的3种农作物的农场或植物叶片样品的分类成15个班。在这项研究中所使用的各种优化包括RMSprop，亚当和AMSgrad。优化性能是通过绘制训练和验证准确性和损耗曲线，ROC曲线和混淆矩阵可视化。最佳性能是使用亚当优化来实现，以最大准确度验证为98％。本文着重研究分析证明，植物病害可使用深层的学习方法与卫星的帮助下进行预测，并捷足先登，基于无人机或基于移动的图像，结果在减少作物歉收和农业损失。

35. Predictive Analysis of Diabetic Retinopathy with Transfer Learning [PDF] 返回目录
Shreyas Rajesh Labhsetwar, Raj Sunil Salvi, Piyush Arvind Kolte, Veerasai Subramaniam venkatesh, Alistair Michael Baretto
Abstract: With the prevalence of Diabetes, the Diabetes Mellitus Retinopathy (DR) is becoming a major health problem across the world. The long-term medical complications arising due to DR have a significant impact on the patient as well as the society, as the disease mostly affects individuals in their most productive years. Early detection and treatment can help reduce the extent of damage to the patients. The rise of Convolutional Neural Networks for predictive analysis in the medical field paves the way for a robust solution to DR detection. This paper studies the performance of several highly efficient and scalable CNN architectures for Diabetic Retinopathy Classification with the help of Transfer Learning. The research focuses on VGG16, Resnet50 V2 and EfficientNet B0 models. The classification performance is analyzed using several performance metrics including True Positive Rate, False Positive Rate, Accuracy, etc. Also, several performance graphs are plotted for visualizing the architecture performance including Confusion Matrix, ROC Curve, etc. The results indicate that Transfer Learning with ImageNet weights using VGG 16 model demonstrates the best classification performance with the best Accuracy of 95%. It is closely followed by ResNet50 V2 architecture with the best Accuracy of 93%. This paper shows that predictive analysis of DR from retinal images is achieved with Transfer Learning on Convolutional Neural Networks.
摘要：随着糖尿病患病率的糖尿病视网膜病变（DR）已成为世界各地的主要健康问题。出现由于DR长期并发症对患者和社会的一个显著影响，因为本病多影响他们最有生产力的岁月个人。早期发现和治疗可以帮助减少对患者的损害程度。卷积神经网络进行预测分析在医疗领域的崛起铺平了强大的解决方案，以DR检测的方式。本文研究了几种高效，可扩展的架构CNN对糖尿病视网膜病变分类的性能，迁移学习的帮助。该研究主要集中在VGG16，Resnet50 V2和EfficientNet B0模型。分类性能，并使用几种性能指标包括真阳性率，假阳性率，精度等进行分析此外，一些性能图表绘制可视化架构的性能，包括混淆矩阵，ROC曲线等的研究结果表明，转移与学习使用VGG 16模型ImageNet权证明95％的最佳精度最好的分类性能。这是紧随其后的是ResNet50 V2架构，具有93％的最佳精度。本文表明，从视网膜图像DR的预测分析与卷积神经网络的迁移学习来实现的。

36. Adaptive Linear Span Network for Object Skeleton Detection [PDF] 返回目录
Chang Liu, Yunjie Tian, Jianbin Jiao, Qixiang Ye
Abstract: Conventional networks for object skeleton detection are usually hand-crafted. Although effective, they require intensive priori knowledge to configure representative features for objects in different scale this http URL this paper, we propose adaptive linear span network (AdaLSN), driven by neural architecture search (NAS), to automatically configure and integrate scale-aware features for object skeleton detection. AdaLSN is formulated with the theory of linear span, which provides one of the earliest explanations for multi-scale deep feature fusion. AdaLSN is materialized by defining a mixed unit-pyramid search space, which goes beyond many existing search spaces using unit-level or pyramid-level features.Within the mixed space, we apply genetic architecture search to jointly optimize unit-level operations and pyramid-level connections for adaptive feature space expansion. AdaLSN substantiates its versatility by achieving significantly higher accuracy and latency trade-off compared with state-of-the-arts. It also demonstrates general applicability to image-to-mask tasks such as edge detection and road extraction. Code is available at \href{this https URL}{\color{magenta}this http URL}.
摘要：对象骨架检测的常规网络通常是手工制作的。虽然有效，但它们需要大量的先验知识来配置代表的功能在不同规模的对象这个HTTP URL本文中，我们提出了自适应线性跨网络（AdaLSN），由神经结构搜索（NAS）驱动，能够自动配置和集成规模感知的特征在于用于对象骨架检测。 AdaLSN配制与线性跨度的理论，这提供了多尺度深特征融合的最早的一个解释。 AdaLSN通过定义混合单元的金字塔搜索空间，从而超越使用单元级或金字塔级别features.Within混合空间许多现有的搜索空间化，我们的遗传结构搜索应用，共同优化单元级操作和棱锥用于自适应特征空间扩张水平的连接。 AdaLSN通过实现显著更高的精度和延迟权衡与国家的最技术相比，证实了它的多功能性。它还表明，以图像对掩蔽任务，如边缘检测和提取道路一般适用性。代码可以在\ {HREF这HTTPS URL} {\ {色品红}这个HTTP URL}。

37. The quantization error in a Self-Organizing Map as a contrast and colour specific indicator of single-pixel change in large random patterns [PDF] 返回目录
John M Wandeto, Birgitta Dresp-Langley
Abstract: The quantization error in a fixed-size Self-Organizing Map (SOM) with unsupervised winner-take-all learning has previously been used successfully to detect, in minimal computation time, highly meaningful changes across images in medical time series and in time series of satellite images. Here, the functional properties of the quantization error in SOM are explored further to show that the metric is capable of reliably discriminating between the finest differences in local contrast intensities and contrast signs. While this capability of the QE is akin to functional characteristics of a specific class of retinal ganglion cells (the so-called Y-cells) in the visual systems of the primate and the cat, the sensitivity of the QE surpasses the capacity limits of human visual detection. Here, the quantization error in the SOM is found to reliably signal changes in contrast or colour when contrast information is removed from or added to the image, but not when the amount and relative weight of contrast information is constant and only the local spatial position of contrast elements in the pattern changes. While the RGB Mean reflects coarser changes in colour or contrast well enough, the SOM-QE is shown to outperform the RGB Mean in the detection of single-pixel changes in images with up to five million pixels. This could have important implications in the context of unsupervised image learning and computational building block approaches to large sets of image data (big data), including deep learning blocks, and automatic detection of contrast change at the nanoscale in Transmission or Scanning Electron Micrographs (TEM, SEM), or at the subpixel level in multispectral and hyper-spectral imaging data.
摘要：在一个固定大小的量化误差自组织映射（SOM）与无监督赢家通吃学习之前已经成功地用于检测，在最短的计算时间，在医疗时间序列在时间上跨越图像非常有意义的变化系列卫星图像。这里，量化误差的在SOM的功能特性被进一步探讨以表明度量是能够在局部对比度强度和对比度标志最好差之间可靠地判别。虽然QE的这种能力是类似于一个特定类的灵长类动物和猫的视觉系统的视网膜神经节细胞（所谓的Y型细胞）的功能特性，该QE的灵敏度超过人类的容量限制视觉检测。在此，在SOM的量化误差被发现时的对比度信息从移除或添加到所述图像中的对比度或颜色可靠信号的变化，而不是当该量和对比度信息相对重量是恒定的并且仅仅的局部空间位置在图案的变化对比元素。虽然RGB平均数反映颜色或对比度的改变较粗不够好时，SOM-QE被示出为优于RGB平均数在检测的图像中具有高达500万个像素的单像素的变化。这可能有重要影响的无监督图像学习和计算积木的情况下在传输或扫描电子显微镜（TEM在纳米尺度接近大集图像数据（大数据），包括深学习块和对比度的改变自动检测，SEM），或在多光谱和高光谱成像数据中的子像素的水平。

38. FlowCaps: Optical Flow Estimation with Capsule Networks For Action Recognition [PDF] 返回目录
Vinoj Jayasundara, Debaditya Roy, Basura Fernando
Abstract: Capsule networks (CapsNets) have recently shown promise to excel in most computer vision tasks, especially pertaining to scene understanding. In this paper, we explore CapsNet's capabilities in optical flow estimation, a task at which convolutional neural networks (CNNs) have already outperformed other approaches. We propose a CapsNet-based architecture, termed FlowCaps, which attempts to a) achieve better correspondence matching via finer-grained, motion-specific, and more-interpretable encoding crucial for optical flow estimation, b) perform better-generalizable optical flow estimation, c) utilize lesser ground truth data, and d) significantly reduce the computational complexity in achieving good performance, in comparison to its CNN-counterparts.
摘要：胶囊网（CapsNets）最近表现出的承诺到Excel中大多数计算机视觉任务，特别是有关现场了解。在本文中，我们将探讨CapsNet在光流估计的能力，在这卷积神经网络（细胞神经网络）已经优于其他方法的任务。我们提出了一个基于CapsNet的架构，称为FlowCaps，其尝试一个）实现经由细粒度，运动特异性更好的对应的匹配，并且更可解释编码光流估计是至关重要的，b）中表现得更好，一般化光流估计， c）利用较小的地面实测数据，以及d）显著降低计算复杂度实现良好的性能，相较于它的CNN-同行。

39. Right on Time: Multi-Temporal Convolutions for Human Action Recognition in Videos [PDF] 返回目录
Alexandros Stergiou, Ronald Poppe
Abstract: The variations in the temporal performance of human actions observed in videos present challenges for their extraction using fixed-sized convolution kernels in CNNs. We present an approach that is more flexible in terms of processing the input at multiple timescales. We introduce Multi-Temporal networks that model spatio-temporal patterns of different temporal durations at each layer. To this end, they employ novel 3D convolution (MTConv) blocks that consist of a short stream for local space-time features and a long stream for features spanning across longer times. By aligning features of each stream with respect to the global motion patterns using recurrent cells, we can discover temporally coherent spatio-temporal features with varying durations. We further introduce sub-streams within each of the block pathways to reduce the computation requirements. The proposed MTNet architectures outperform state-of-the-art 3D-CNNs on five action recognition benchmark datasets. Notably, we achieve at 87.22% top-1 accuracy on HACS, and 58.39% top-1 at Kinectics-700. We further demonstrate the favorable computational requirements. Using sub-streams, we can further achieve a drastic reduction in parameters (~60%) and GLOPs (~74%). Experiments using transfer learning finally verify the generalization capabilities of the multi-temporal features
摘要：在人类活动的时间性能变化的视频观察到了使用固定大小的卷积核在细胞神经网络其萃取提出挑战。我们提出了一种方法，是在以多个时间尺度处理输入的方面更灵活。我们引入多时网络，每一层不同的时间持续时间的模型空间 - 时间模式。为此目的，它们采用的是由局部空间 - 时间特征的短流和长流用于在更长的时间跨度特征的新颖三维卷积（MTConv）块。通过比相对于使用复发细胞全局运动模式的各个流的特征，我们可以发现不同的有效期时间相干时空特征。我们进一步介绍在每个块途径的子流，以减少计算需求。所提出的MTNet架构优于五个动作识别标准数据集的国家的最先进的3D-细胞神经网络。值得注意的是，我们实现了在上HACS 87.22％，最高1准确性和58.39％，最高-1 Kinectics-700。我们进一步展示了良好的计算需求。使用子流，可以进一步实现参数的显着降低（〜60％）和GLOPs（〜74％）。使用转移实验学习终于验证的多的时间特征的泛化能力

40. PointTransformer for Shape Classification and Retrieval of 3D and ALS Roof PointClouds [PDF] 返回目录
Dimple A Shajahan, Mukund Varma T, Ramanathan Muthuganapathy
Abstract: Effective feature representation from Airborne Laser Scanning (ALS) point clouds used for urban modeling was challenging until the advent of deep learning and improved ALS techniques. Most deep learning techniques for 3-D point clouds utilize convolutions that assume a uniform input distribution and cannot learn long-range dependencies, leading to some limitations. Recent works have already shown that adding attention on top of these methods improves performance. This raises a question: can attention layers completely replace convolutions? We propose a fully attentional model-PointTransformer for deriving a rich point cloud representation. The model's shape classification and retrieval performance are evaluated on a large-scale urban dataset RoofN3D and a standard benchmark dataset ModelNet40. Also, the model is tested on various simulated point corruptions to analyze its effectiveness on real datasets. The proposed method outperforms other state-of-the-art models in the RoofN3D dataset, gives competitive results in the ModelNet40 benchmark, and showcases high robustness to multiple point corruptions. Furthermore, the model is both memory and space-efficient without compromising on performance.
摘要：用于城市模型从空中有效特征表示激光扫描（ALS）的点云是具有挑战性的，直到深度学习和改进的ALS技术的问世。 3-d点云最深处学习技术利用的是承担一个统一的输入分布，不能学习远程依赖关系，导致一些限制卷积。最近的作品已经表明，在这些方法顶部添加关注提高性能。这就提出了一个问题：能关注层完全取代回旋？我们提出了一个完全的注意模型PointTransformer推导丰富的点云表示。该模型的形状分类和检索性能是在一个大型城市的数据集RoofN3D和一个标准的基准数据集ModelNet40评估。此外，该模型在各种模拟点损坏测试来分析真实数据集的有效性。该方法优于国家的最先进的其他车型在RoofN3D数据集，给出了ModelNet40基准竞争力的结果，和陈列柜高鲁棒性多点损坏。此外，该模型是内存和空间效率不影响性能。

41. Integrating Human Gaze into Attention for Egocentric Activity Recognition [PDF] 返回目录
Kyle Min, Jason J. Corso
Abstract: It is well known that human gaze carries significant information about visual attention. However, there are three main difficulties in incorporating the gaze data in an attention mechanism of deep neural networks: 1) the gaze fixation points are likely to have measurement errors due to blinking and rapid eye movements; 2) it is unclear when and how much the gaze data is correlated with visual attention; and 3) gaze data is not always available in many real-world situations. In this work, we introduce an effective probabilistic approach to integrate human gaze into spatiotemporal attention for egocentric activity recognition. Specifically, we represent the locations of gaze fixation points as structured discrete latent variables to model their uncertainties. In addition, we model the distribution of gaze fixations using a variational method. The gaze distribution is learned during the training process so that the ground-truth annotations of gaze locations are no longer needed in testing situations since they are predicted from the learned gaze distribution. The predicted gaze locations are used to provide informative attentional cues to improve the recognition performance. Our method outperforms all the previous state-of-the-art approaches on EGTEA, which is a large-scale dataset for egocentric activity recognition provided with gaze measurements. We also perform an ablation study and qualitative analysis to demonstrate that our attention mechanism is effective.
摘要：众所周知，人类的目光携带视觉注意力显著的信息。但是，也有在深层神经网络的注意机制并入所述视线数据的三个主要的困难：1）注视固定点都可能有由于眨眼和快速眼球运动的测量误差; 2）它是不明何时以及有多少注视数据与视觉注意相关; 3）凝视数据是不是在许多现实世界的情况始终可用。在这项工作中，我们介绍一种有效的概率方法对人的目光融入了以自我为中心的活动识别时空的关注。具体来说，我们代表的凝视点的位置为结构化离散潜在变量，以他们的不确定性模型。此外，我们建立一个使用变分法的目光注视的分布。凝视分布在训练过程中，这样的注视位置的地面真相批注不再因为它们从了解到注视分布预测测试情况需要教训。预测注视的位置来提供信息的注意力线索，以提高识别性能。我们的方法优于以前的所有国家的最先进的上EGTEA，其是用于设置有视线测量自我中心活动识别大规模数据集的方法。我们还进行消融研究和定性分析，以证明我们的注意机制是有效的。

42. Faster object tracking pipeline for real time tracking [PDF] 返回目录
Parthesh Soni, Falak Shah, Nisarg Vyas
Abstract: Multi-object tracking (MOT) is a challenging practical problem for vision based applications. Most recent approaches for MOT use precomputed detections from models such as Faster RCNN, performing fine-tuning of bounding boxes and association in subsequent phases. However, this is not suitable for actual industrial applications due to unavailability of detections upfront. In their recent work, Wang et al. proposed a tracking pipeline that uses a Joint detection and embedding model and performs target localization and association in realtime. Upon investigating the tracking by detection paradigm, we find that the tracking pipeline can be made faster by performing localization and association tasks parallely with model prediction. This, and other computational optimizations such as using mixed precision model and performing batchwise detection result in a speed-up of the tracking pipeline by 57.8\% (19 FPS to 30 FPS) on FullHD resolution. Moreover, the speed is independent of the object density in image sequence. The main contribution of this paper is showcasing a generic pipeline which can be used to speed up detection based object tracking methods. We also reviewed different batch sizes for optimal performance, taking into consideration GPU memory usage and speed.
摘要：多目标追踪（MOT）是基于视觉应用一个具有挑战性的实际问题。最近期的MOT使用预先计算方法从检测模式，如更快的RCNN，执行在后续阶段边界框和公司章程的微调。然而，这不适合于实际工业应用由于检测前期的不可用性。在他们最近的工作，王等人。提出了使用一个联合检测和嵌入模型和进行目标定位和关联实时跟踪管道。在查办检测范式的跟踪，我们发现，跟踪流水线可以使通过模型预测平行进行定位和关联任务完成得更快。此，和其他计算的优化，如上的FullHD分辨率使用混合精度模型和由57.8 \％（19 FPS至30 FPS）在加速跟踪的管道执行间歇检测结果。此外，该速度独立于图像序列中的对象密度的。本文的主要贡献是展示一个通用的管道，其可用于加快基于检测对象跟踪方法。我们还审查了最佳性能的不同批量大小，考虑到GPU内存使用量和速度。

43. Channel Pruning Guided by Spatial and Channel Attention for DNNs in Intelligent Edge Computing [PDF] 返回目录
Mengran Liu, Weiwei Fang, Xiaodong Ma, Wenyuan Xu, Naixue Xiong, Yi Ding
Abstract: Deep Neural Networks (DNNs) have achieved remarkable success in many computer vision tasks recently, but the huge number of parameters and the high computation overhead hinder their deployments on resource-constrained edge devices. It is worth noting that channel pruning is an effective approach for compressing DNN models. A critical challenge is to determine which channels are to be removed, so that the model accuracy will not be negatively affected. In this paper, we first propose Spatial and Channel Attention (SCA), a new attention module combining both spatial and channel attention that respectively focuses on "where" and "what" are the most informative parts. Guided by the scale values generated by SCA for measuring channel importance, we further propose a new channel pruning approach called Channel Pruning guided by Spatial and Channel Attention (CPSCA). Experimental results indicate that SCA achieves the best inference accuracy, while incurring negligibly extra resource consumption, compared to other state-of-the-art attention modules. Our evaluation on two benchmark datasets shows that, with the guidance of SCA, our CPSCA approach achieves higher inference accuracy than other state-of-the-art pruning methods under the same pruning ratios.
摘要：深层神经网络（DNNs）已经在许多计算机视觉任务，取得了显着成效最近，但参数的数量庞大和高计算开销妨碍他们在资源受限的边缘设备部署。值得注意的是，该通道修剪为压缩在DNN模型的有效途径。一个关键的挑战是确定哪些渠道将被去除，使模型的精度不会受到负面影响。在本文中，我们首先提出了空间和通道注意（SCA），一个新的关注模块相结合的空间和通道的注意，分别侧重于“在哪里”和“什么”是最翔实的部分。通过SCA用于测量信道的重要性产生的刻度值的指导下，我们进一步提出称为信道修剪由空间和通道注意（CPSCA）引导的新的信道修剪方法。实验结果表明，SCA获得了最佳推理精度，同时承担额外的可以忽略不计资源消耗，相对于国家的最先进的其他注意模块。我们的两个标准数据集显示，与SCA的指导下，我们的CPSCA方法实现更高的精度推断比在同一修剪比其他国家的最先进的修剪方法评价。

44. Latent Neural Differential Equations for Video Generation [PDF] 返回目录
Cade Gordon, Natalie Parde
Abstract: Generative Adversarial Networks have recently shown promise for video generation, building off of the success of image generation while also addressing a new challenge: time. Although time was analyzed in some early work, the literature has not adequately grown with temporal modeling developments. We propose studying the effects of Neural Differential Equations to model the temporal dynamics of video generation. The paradigm of Neural Differential Equations presents many theoretical strengths including the first continuous representation of time within video generation. In order to address the effects of Neural Differential Equations, we will investigate how changes in temporal models affect generated video quality.
摘要：创成对抗性的网络最近表现出的视频生成的诺言，图像生成的成功建设关，同时也解决了新的挑战：时间。虽然时间是在一些早期的工作分析，文献没有充分利用时间建模的发展壮大。我们建议学习神经微分方程的效果，视频生成的时间动力学模型。神经微分方程的范式提出了许多理论优势，包括视频生成中的第一时间连续表示。为了解决神经微分方程的影响，我们将研究在时间模型的变化如何影响产生的视频质量。

45. Deep traffic light detection by overlaying synthetic context on arbitrary natural images [PDF] 返回目录
Jean Pablo Vieira de Mello, Lucas Tabelini, Rodrigo F. Berriel, Thiago M. Paixão, Alberto F. de Souza, Claudine Badue, Nicu Sebe, Thiago Oliveira-Santos
Abstract: Deep neural networks come as an effective solution to many problems associated with autonomous driving. By providing real image samples with traffic context to the network, the model learns to detect and classify elements of interest, such as pedestrians, traffic signs, and traffic lights. However, acquiring and annotating real data can be extremely costly in terms of time and effort. In this context, we propose a method to generate artificial traffic-related training data for deep traffic light detectors. This data is generated using basic non-realistic computer graphics to blend fake traffic scenes on top of arbitrary image backgrounds that are not related to the traffic domain. Thus, a large amount of training data can be generated without annotation efforts. Furthermore, it also tackles the intrinsic data imbalance problem in traffic light datasets, caused mainly by the low amount of samples of the yellow state. Experiments show that it is possible to achieve results comparable to those obtained with real training data from the problem domain, yielding an average mAP and an average F1-score which are each nearly 4 p.p. higher than the respective metrics obtained with a real-world reference model.
摘要：深层神经网络来作为有效解决与自动驾驶相关的许多问题。通过为网络提供实时图像样本交通方面，该机型学会检测和兴趣分类的元素，如行人，交通标志，交通信号灯。然而，获取和标注真实的数据可以在时间和精力方面非常昂贵。在此背景下，我们提出了一个方法来生成深红绿灯探测器人工交通相关的训练数据。该数据是使用基本的非真实感图形学上不相关的业务领域任意图像背景的顶部混合假交通场景生成。因此，可以在不批注的努力来产生大量的训练数据。此外，它也解决在交通灯的数据集的固有数据的不平衡问题，引起主要由黄色状态的样品的低量。实验结果表明，有可能达到与那些从问题域与真实训练数据获得，得到的平均MAP和平均F1分数，它们分别近4 P.P.比具有真实世界的参考模型得到的各指标高。

46. On the spatial attention in Spatio-Temporal Graph Convolutional Networks for skeleton-based human action recognition [PDF] 返回目录
Negar Heidari, Alexandros Iosifidis
Abstract: Graph convolutional networks (GCNs) achieved promising performance in skeleton-based human action recognition by modeling a sequence of skeletons as a spatio-temporal graph. Most of the recently proposed GCN-based methods improve the performance by learning the graph structure at each layer of the network using a spatial attention applied on a predefined graph Adjacency matrix that is optimized jointly with model's parameters in an end-to-end manner. In this paper, we analyze the spatial attention used in spatio-temporal GCN layers and propose a symmetric spatial attention for better reflecting the symmetric property of the relative positions of the human body joints when executing actions. We also highlight the connection of spatio-temporal GCN layers employing additive spatial attention to bilinear layers, and we propose the spatio-temporal bilinear network (ST-BLN) which does not require the use of predefined Adjacency matrices and allows for more flexible design of the model. Experimental results show that the three models lead to effectively the same performance. Moreover, by exploiting the flexibility provided by the proposed ST-BLN, one can increase the efficiency of the model.
摘要：图形卷积网络（GCNs）基于骨架人类动作识别通过建模骨架的序列，作为时空图实现承诺的表现。大部分的最近提出了基于GCN的方法，通过使用施加在空间注意学习图形结构在网络的每一层提高性能的预定图形邻接矩阵，其与在端至端的方式模型的参数联合优化。在本文中，我们分析了时空GCN层使用空间的关注和提出了一个对称的空间注意力更好地反映在执行行动时，人体关节的相对位置的对称性。我们还强调使用添加剂空间注意双线性层时空GCN层的连接，我们提出了时空双线性网络（ST-BLN），它不需要使用预定义的邻接矩阵，并允许更灵活的设计该模型。实验结果表明，三种模式导致有效相同的性能。此外，通过利用由所提出的ST-BLN提供的灵活性，可以增加模型的效率。

47. Towards Resolving the Challenge of Long-tail Distribution in UAV Images for Object Detection [PDF] 返回目录
Weiping Yu, Taojiannan Yang, Chen Chen
Abstract: Existing methods for object detection in UAV images ignored an important challenge - imbalanced class distribution in UAV images - which leads to poor performance on tail classes. We systematically investigate existing solutions to long-tail problems and unveil that re-balancing methods that are effective on natural image datasets cannot be trivially applied to UAV datasets. To this end, we rethink long-tailed object detection in UAV images and propose the Dual Sampler and Head detection Network (DSHNet), which is the first work that aims to resolve long-tail distribution in UAV images. The key components in DSHNet include Class-Biased Samplers (CBS) and Bilateral Box Heads (BBH), which are developed to cope with tail classes and head classes in a dual-path manner. Without bells and whistles, DSHNet significantly boosts the performance of tail classes on different detection frameworks. Moreover, DSHNet significantly outperforms base detectors and generic approaches for long-tail problems on VisDrone and UAVDT datasets. It achieves new state-of-the-art performance when combining with image cropping methods. Code is available at this https URL
摘要： - 无人机图像不平衡类分布 - 这导致对尾类表现不佳的无人机图像目标检测现有的方法忽视的一个重要挑战。我们系统地研究，以长尾存在的问题解决方案，并推出该再平衡方法，这些方法对自然图像数据集有效，不能平凡应用到无人机数据集。为此，我们重新考虑无人机的图像长尾目标检测，并提出双采样和头部探测网络（DSHNet），这是一个旨在解决无人机图像长尾分布的第一部作品。在DSHNet关键组件包括类偏压式取样器（CBS）和双侧箱头（BBH），这是开发双路径的方式与尾类和头类应付。而不花俏，对不同的检测框架尾类DSHNet显著提升性能。此外，DSHNet显著优于基础检测器和通用方法对VisDrone和UAVDT数据集长尾问题。当图像裁剪相结合的方法它实现了国家的最先进的新性能。代码可在此HTTPS URL

48. Sim-to-Real Transfer for Vision-and-Language Navigation [PDF] 返回目录
Peter Anderson, Ayush Shrivastava, Joanne Truong, Arjun Majumdar, Devi Parikh, Dhruv Batra, Stefan Lee
Abstract: We study the challenging problem of releasing a robot in a previously unseen environment, and having it follow unconstrained natural language navigation instructions. Recent work on the task of Vision-and-Language Navigation (VLN) has achieved significant progress in simulation. To assess the implications of this work for robotics, we transfer a VLN agent trained in simulation to a physical robot. To bridge the gap between the high-level discrete action space learned by the VLN agent, and the robot's low-level continuous action space, we propose a subgoal model to identify nearby waypoints, and use domain randomization to mitigate visual domain differences. For accurate sim and real comparisons in parallel environments, we annotate a 325m2 office space with 1.3km of navigation instructions, and create a digitized replica in simulation. We find that sim-to-real transfer to an environment not seen in training is successful if an occupancy map and navigation graph can be collected and annotated in advance (success rate of 46.8% vs. 55.9% in sim), but much more challenging in the hardest setting with no prior mapping at all (success rate of 22.5%).
摘要：我们研究一个前所未见的环境释放机器人，并让它跟随不受约束的自然语言导航指令的具有挑战性的问题。在视觉和语言导航（VLN）的任务，最近的工作已经取得了在模拟显著的进展。为了评估这项工作的机器人的影响，我们在模拟训练了VLN剂传送到物理机器人。为了弥补由VLN剂学会高级独立行动的空间，以及机器人的低级别的连续动作空间之间的差距，我们提出了一个子目标模型来识别附近的航点，并使用域随机减轻视觉域差异。为了获得准确的SIM卡，并在并行环境现实的比较中，我们用注释的导航指令1.3公里一个325平方米办公空间，营造仿真的数字化复制品。我们发现，SIM到真正转移到一个环境，训练没有看到是成功的，如果入住的地图和导航图可以收集并提前注解（46.8％的成功率与SIM卡中的55.9％），但更大的挑战与在所有之前没有映射（22.5％的成功率）最难设置。

49. Symmetric Parallax Attention for Stereo Image Super-Resolution [PDF] 返回目录
Yingqian Wang, Xinyi Ying, Longguang Wang, Jungang Yang, Wei An, Yulan Guo
Abstract: Although recent years have witnessed the great advances in stereo image super-resolution (SR), the beneficial information provided by binocular systems has not been fully used. Since stereo images are highly symmetric under epipolar constraint, in this paper, we improve the performance of stereo image SR by exploiting symmetry cues in stereo image pairs. Specifically, we propose a symmetric bi-directional parallax attention module (biPAM) and an inline occlusion handling scheme to effectively interact cross-view information. Then, we design a Siamese network equipped with a biPAM to super-resolve both sides of views in a highly symmetric manner. Finally, we design several illuminance-robust bilateral losses to enforce stereo consistency. Experiments on four public datasets have demonstrated the superiority of our method. As compared to PASSRnet, our method achieves notable performance improvements with a comparable model size. Source codes are available at this https URL.
摘要：虽然近年来已经见证了立体影像超分辨率（SR）的巨大进步，通过双筒望远镜系统提供了有益的信息还没有被充分利用。由于立体图像是在极约束高度对称的，在本文中，我们通过立体图像对利用对称性线索提高立体图像SR的性能。具体地，我们提出了一个对称的双向视差注意模块（biPAM）和一个内联闭塞处理方案以有效地交互交叉视角信息。在一个高度对称的方式。然后，我们设计配备了biPAM到连体网络超级解决的意见两侧。最后，我们设计了几种照度稳健的双边亏损强制立体声一致性。四个公共数据集的实验已经证明了我们方法的优越性。相较于PASSRnet，我们的方法实现了显着的性能改进与可比模型大小。源代码可在此HTTPS URL。

50. Deep Learning Analysis and Age Prediction from Shoeprints [PDF] 返回目录
Muhammad Hassan, Yan Wang, Di Wang, Daixi Li, Yanchun Liang, You Zhou, Dong Xu
Abstract: Human walking and gaits involve several complex body parts and are influenced by personality, mood, social and cultural traits, and aging. These factors are reflected in shoeprints, which in turn can be used to predict age, a problem not systematically addressed using any computational approach. We collected 100,000 shoeprints of subjects ranging from 7 to 80 years old and used the data to develop a deep learning end-to-end model ShoeNet to analyze age-related patterns and predict age. The model integrates various convolutional neural network models together using a skip mechanism to extract age-related features, especially in pressure and abrasion regions from pair-wise shoeprints. The results show that 40.23% of the subjects had prediction errors within 5-years of age and the prediction accuracy for gender classification reached 86.07%. Interestingly, the age-related features mostly reside in the asymmetric differences between left and right shoeprints. The analysis also reveals interesting age-related and gender-related patterns in the pressure distributions on shoeprints; in particular, the pressure forces spread from the middle of the toe toward outside regions over age with gender-specific variations on heel regions. Such statistics provide insight into new methods for forensic investigations, medical studies of gait-pattern disorders, biometrics, and sport studies.
摘要：人类行走步态，并涉及几个复杂的身体部位，并通过个性，情绪，社会和文化特质，以及老龄化的影响。这些因素反映在shoeprints，这反过来又可以用来预测年龄，出了问题没有系统地使用任何计算方法解决。我们收集的对象，从7岁到80岁10万个shoeprints和所使用的数据来制定一个深度学习终端到高端型号ShoeNet分析与年龄相关的模式和预测的年龄。该模型集成在一起使用跳过机制，以提取年龄相关的功能，尤其是在成对shoeprints压力和磨损区域的各种卷积神经网络模型。结果表明，受试者的40.23％有5岁以内的预测误差和性别分类的预测准确度达到了86.07％。有趣的是，与年龄相关的功能主要是居住在左，右shoeprints之间的不对称差异。分析还揭示了在shoeprints压力分布有趣年龄相关和与性别有关的图案;特别是，压力军队从朝超龄与踵部性别特异性变化外区的脚趾中间蔓延。这样的统计数据提供了洞察法庭调查，步态模式障碍，生物识别和运动研究的医学研究新方法。

51. Rapid Pose Label Generation through Sparse Representation of Unknown Objects [PDF] 返回目录
Rohan Pratap Singh, Mehdi Benallegue, Yusuke Yoshiyasu, Fumio Kanehiro
Abstract: Deep Convolutional Neural Networks (CNNs) have been successfully deployed on robots for 6-DoF object pose estimation through visual perception. However, obtaining labeled data on a scale required for the supervised training of CNNs is a difficult task - exacerbated if the object is novel and a 3D model is unavailable. To this end, this work presents an approach for rapidly generating real-world, pose-annotated RGB-D data for unknown objects. Our method not only circumvents the need for a prior 3D object model (textured or otherwise) but also bypasses complicated setups of fiducial markers, turntables, and sensors. With the help of a human user, we first source minimalistic labelings of an ordered set of arbitrarily chosen keypoints over a set of RGB-D videos. Then, by solving an optimization problem, we combine these labels under a world frame to recover a sparse, keypoint-based representation of the object. The sparse representation leads to the development of a dense model and the pose labels for each image frame in the set of scenes. We show that the sparse model can also be efficiently used for scaling to a large number of new scenes. We demonstrate the practicality of the generated labeled dataset by training a pipeline for 6-DoF object pose estimation and a pixel-wise segmentation network.
摘要：深卷积神经网络（细胞神经网络）已成功部署在机器人的六自由度通过视觉感知物体姿态估计。然而，获得对细胞神经网络的指导训练所需的规模，标记数据是一项艰巨的任务 - 加剧，如果对象是小说和3D模型是不可用的。为此，这项工作礼物快速生成真实世界，姿势注解RGB-d数据未知物体的方法。我们的方法不仅规避了现有3D对象模型（纹理或以其他方式）的需要，但也绕过基准标记，转盘，和传感器的复杂的设定。与人类用户的帮助下，我们首先源的简约标号L下令对一组RGB-d视频设置任意选择的关键点。然后，通过解决优化问题，我们结合世界框架下，这些标签来恢复对象的稀疏，基于关键点的表示。稀疏表示导致密集的发展模式和姿势标签集合中的场景的每一帧图像。我们表明，稀疏模型也可以有效地用于扩展到了大量的新场景。我们通过培训的6自由度物体姿态估计和逐像素分割网络管道展示生成标记数据集的实用性。

52. Text-to-Image Generation Grounded by Fine-Grained User Attention [PDF] 返回目录
Jing Yu Koh, Jason Baldridge, Honglak Lee, Yinfei Yang
Abstract: Localized Narratives is a dataset with detailed natural language descriptions of images paired with mouse traces that provide a sparse, fine-grained visual grounding for phrases. We propose TReCS, a sequential model that exploits this grounding to generate images. TReCS uses descriptions to retrieve segmentation masks and predict object labels aligned with mouse traces. These alignments are used to select and position masks to generate a fully covered segmentation canvas; the final image is produced by a segmentation-to-image generator using this canvas. This multi-step, retrieval-based approach outperforms existing direct text-to-image generation models on both automatic metrics and human evaluations: overall, its generated images are more photo-realistic and better match descriptions.
摘要：本地化的叙事是与鼠标轨迹提供了短语稀疏，细粒度的视觉接地成对的图像的详细自然语言描述的数据集。我们建议的TREC，来利用这个接地，以生成图像的顺序模型。的TREC使用说明，以获取分割掩码和预测与鼠标的痕迹对准对象的标签。这些比对用于选择和位置的掩模，以产生完全覆盖分割帆布;最终图像是通过使用该帆布分割到图像发生器产生。这种多步骤的，基于内容的检索，方法比现有的两个自动度量和评价人的直接文字到图像生成模式：总体来说，其产生的图像更逼真的和更好的匹配描述。

53. A Multi-stream Convolutional Neural Network for Micro-expression Recognition Using Optical Flow and EVM [PDF] 返回目录
Jinming Liu, Ke Li, Baolin Song, Li Zhao
Abstract: Micro-expression (ME) recognition plays a crucial role in a wide range of applications, particularly in public security and psychotherapy. Recently, traditional methods rely excessively on machine learning design and the recognition rate is not high enough for its practical application because of its short duration and low intensity. On the other hand, some methods based on deep learning also cannot get high accuracy due to problems such as the imbalance of databases. To address these problems, we design a multi-stream convolutional neural network (MSCNN) for ME recognition in this paper. Specifically, we employ EVM and optical flow to magnify and visualize subtle movement changes in MEs and extract the masks from the optical flow images. And then, we add the masks, optical flow images, and grayscale images into the MSCNN. After that, in order to overcome the imbalance of databases, we added a random over-sampler after the Dense Layer of the neural network. Finally, extensive experiments are conducted on two public ME databases: CASME II and SAMM. Compared with many recent state-of-the-art approaches, our method achieves more promising recognition results.
摘要：微表情（ME）的识别起着广泛的应用至关重要的作用，尤其是在公共安全和心理治疗。近来，传统的方法过分依赖于机器学习设计和识别率不是其实际应用足够高的，因为它的持续时间短，强度低。在另一方面，基于深度学习也有一些方法不能由于诸如数据库的不平衡问题得到很高的精度。为了解决这些问题，我们设计了ME识别多流卷积神经网络（MSCNN）在本文中。具体而言，我们采用EVM和光流放大并可视化的ME细微运动的变化和从光流的图像中提取的掩模。然后，我们添加了口罩，光流图像，灰度图像到MSCNN。在此之后，为了克服数据库的失衡，我们的神经网络的致密层后添加一个随机的过采样。 CASME II和SAMM：最后，大量的实验在两个公共ME数据库进行。最近发生很多国家的最先进的方法相比，我们的方法实现了更有前途的识别结果。

54. A Strong Baseline for Crowd Counting and Unsupervised People Localization [PDF] 返回目录
Liangzi Rong, Chunping Li
Abstract: In this paper, we explore a strong baseline for crowd counting and an unsupervised people localization algorithm based on estimated density maps. Firstly, existing methods achieve state-of-the-art performance based on different backbones and kinds of training tricks. We collect different backbones and training tricks and evaluate the impact of changing them and develop an efficient pipeline for crowd counting, which decreases MAE and RMSE significantly on multiple datasets. We also propose a clustering algorithm named isolated KMeans to locate the heads in density maps. This method can divide the density maps into subregions and find the centers under local count constraints without training any parameter and can be integrated with existing methods easily.
摘要：在本文中，我们探讨的人群计数和基于估计密度图无监督人定位算法强大的基线。首先，现有的方法实现了基于不同骨架和各种培训技巧的国家的最先进的性能。我们收集不同的骨干和培训技巧和评估改变它们的影响，并制定人群计数，从而减少MAE和RMSE显著的多个数据集的有效管道。我们还提出一个名为隔离KMEANS定位在密度图头的聚类算法。这种方法可以把密度图分割成子区域，找到下局部计约束中心没有任何训练参数，并可以与现有的方法很容易地集成。

55. Coarse- and Fine-grained Attention Network with Background-aware Loss for Crowd Density Map Estimation [PDF] 返回目录
Liangzi Rong, Chunping Li
Abstract: In this paper, we present a novel method Coarse- and Fine-grained Attention Network (CFANet) for generating high-quality crowd density maps and people count estimation by incorporating attention maps to better focus on the crowd area. We devise a from-coarse-to-fine progressive attention mechanism by integrating Crowd Region Recognizer (CRR) and Density Level Estimator (DLE) branch, which can suppress the influence of irrelevant background and assign attention weights according to the crowd density levels, because generating accurate fine-grained attention maps directly is normally difficult. We also employ a multi-level supervision mechanism to assist the backpropagation of gradient and reduce overfitting. Besides, we propose a Background-aware Structural Loss (BSL) to reduce the false recognition ratio while improving the structural similarity to groundtruth. Extensive experiments on commonly used datasets show that our method can not only outperform previous state-of-the-art methods in terms of count accuracy but also improve the image quality of density maps as well as reduce the false recognition ratio.
摘要：在本文中，我们提出了产生高品质的人群密度图的新方法粗粒和细粒关注网络（CFANet）和人数量估计通过将注意力映射，以更好地专注于人群区域。我们通过整合围观地区识别器（CRR）和密度水平估计（DLE）分支，它可以抑制不相关的背景和根据人群密度级别上指定的关注权重的影响，制定一个从-粗到细渐进注意机制，因为生成准确的细粒度关注直接映射通常是困难的。我们还采用了多层次的监督机制，以协助梯度的反向传播，减少过度拟合。此外，我们提出了一个背景感知的结构损失（BSL），以降低误识别率，同时提高了结构相似真实状况。在常用的数据集大量的实验表明，该方法不仅可以超越国家的最先进的前面的计数精度方面的方法，但也提高了密度图的图像质量以及降低误识别率。

56. DeepCFL: Deep Contextual Features Learning from a Single Image [PDF] 返回目录
Indra Deep Mastan, Shanmuganathan Raman
Abstract: Recently, there is a vast interest in developing image feature learning methods that are independent of the training data, such as deep image prior, InGAN, SinGAN, and DCIL. These methods are unsupervised and are used to perform low-level vision tasks such as image restoration, image editing, and image synthesis. In this work, we proposed a new training data-independent framework, called Deep Contextual Features Learning (DeepCFL), to perform image synthesis and image restoration based on the semantics of the input image. The contextual features are simply the high dimensional vectors representing the semantics of the given image. DeepCFL is a single image GAN framework that learns the distribution of the context vectors from the input image. We show the performance of contextual learning in various challenging scenarios: outpainting, inpainting, and restoration of randomly removed pixels. DeepCFL is applicable when the input source image and the generated target image are not aligned. We illustrate image synthesis using DeepCFL for the task of image resizing.
摘要：近日，有开发图像特征学习独立于训练数据的方法，如深图像之前，氮化铟镓，SinGAN和DCIL一个巨大的利益。这些方法是无监督和用于执行低级别的视觉任务，如图像恢复，图像编辑，和图像合成。在这项工作中，我们提出了一个新的训练数据独立的框架，称为深上下文特征学习（DeepCFL），进行图像合成，并根据输入图像的语义图像恢复。的上下文特征是简单地表示给定图像的语义的高维向量。 DeepCFL是单个图像GAN框架，学习上下文向量从输入图像的分布。我们展示情境学习的各种挑战的方案的性能：outpainting，补绘，并随机删除像素的恢复。当输入源图像和所产生的目标图像没有对准DeepCFL是适用的。我们说明了使用DeepCFL的图像大小调整的任务图像合成。

57. Blind Motion Deblurring through SinGAN Architecture [PDF] 返回目录
Harshil Jain, Rohit Patil, Indra Deep Mastan, Shanmuganathan Raman
Abstract: Blind motion deblurring involves reconstructing a sharp image from an observation that is blurry. It is a problem that is ill-posed and lies in the categories of image restoration problems. The training data-based methods for image deblurring mostly involve training models that take a lot of time. These models are data-hungry i.e., they require a lot of training data to generate satisfactory results. Recently, there are various image feature learning methods developed which relieve us of the need for training data and perform image restoration and image synthesis, e.g., DIP, InGAN, and SinGAN. SinGAN is a generative model that is unconditional and could be learned from a single natural image. This model primarily captures the internal distribution of the patches which are present in the image and is capable of generating samples of varied diversity while preserving the visual content of the image. Images generated from the model are very much like real natural images. In this paper, we focus on blind motion deblurring through SinGAN architecture.
摘要：盲运动去模糊涉及从一个观察结果是模糊重建清晰图像。它是在图像复原问题的类别是病态和谎言的一个问题。图像的基于数据的训练方法去模糊大多是涉及到需要大量的时间培训模式。这些模型数据，饿了即，它们需要大量的训练数据产生令人满意的结果。近日，有各种图像特征的学习方法开发出缓解的需要进行培训数据我们并执行图像恢复和图像合成，例如，DIP，InGaN和SinGAN。 SinGAN是生成模型是无条件的，可以从一个单一的自然图像的教训。该模型主要捕获其是存在于图像中的贴剂的内部分配并能够产生变化的多样性的样品，同时保留所述图像的视觉内容的。从模型生成的图像是非常像真实自然的图像。在本文中，我们专注于盲人运动通过SinGAN架构去模糊。

58. TB-Net: A Three-Stream Boundary-Aware Network for Fine-Grained Pavement Disease Segmentation [PDF] 返回目录
Yujia Zhang, Qianzhong Li, Xiaoguang Zhao, Min Tan
Abstract: Regular pavement inspection plays a significant role in road maintenance for safety assurance. Existing methods mainly address the tasks of crack detection and segmentation that are only tailored for long-thin crack disease. However, there are many other types of diseases with a wider variety of sizes and patterns that are also essential to segment in practice, bringing more challenges towards fine-grained pavement inspection. In this paper, our goal is not only to automatically segment cracks, but also to segment other complex pavement diseases as well as typical landmarks (markings, runway lights, etc.) and commonly seen water/oil stains in a single model. To this end, we propose a three-stream boundary-aware network (TB-Net). It consists of three streams fusing the low-level spatial and the high-level contextual representations as well as the detailed boundary information. Specifically, the spatial stream captures rich spatial features. The context stream, where an attention mechanism is utilized, models the contextual relationships over local features. The boundary stream learns detailed boundaries using a global-gated convolution to further refine the segmentation outputs. The network is trained using a dual-task loss in an end-to-end manner, and experiments on a newly collected fine-grained pavement disease dataset show the effectiveness of our TB-Net.
摘要：常规路面检查起着公路养护安全保证一个显著的作用。现有的方法主要涉及裂缝检测和分割是仅对于长薄裂纹疾病定制的任务。不过，也有许多其他类型的与更广泛的尺寸，并且也是必不可少的段实践模式的疾病，带来对细粒度的路面检查更多的挑战。在本文中，我们的目标是不仅自动分割破解，也给段其他复杂的路面疾病以及典型地标（标志，跑道灯等）和常见的水/油污渍在一个单一的模式。为此，我们提出了一个三流的边界感知网络（TB-网）。它由三流融合在低级别的空间和高层次情境的表示以及详细的边界信息。具体而言，空间流捕获丰富的空间特征。上下文流，其中使用的注意机制，模型在局部特征的上下文关系。边界流获悉使用全局门控卷积，以进一步细化分割输出详述边界。该网络使用的终端到终端的方式双任务亏损的培训，并在新收集的细粒度路面疾病的实验数据集上我们的TB-网络的有效性。

59. Depthwise Multiception Convolution for Reducing Network Parameters without Sacrificing Accuracy [PDF] 返回目录
Guoqing Bao, Manuel B. Graeber, Xiuying Wang
Abstract: Deep convolutional neural networks have been proven successful in multiple benchmark challenges in recent years. However, the performance improvements are heavily reliant on increasingly complex network architecture and a high number of parameters, which require ever increasing amounts of storage and memory capacity. Depthwise separable convolution (DSConv) can effectively reduce the number of required parameters through decoupling standard convolution into spatial and cross-channel convolution steps. However, the method causes a degradation of accuracy. To address this problem, we present depthwise multiception convolution, termed Multiception, which introduces layer-wise multiscale kernels to learn multiscale representations of all individual input channels simultaneously. We have carried out the experiment on four benchmark datasets, i.e. Cifar-10, Cifar-100, STL-10 and ImageNet32x32, using five popular CNN models, Multiception achieved accuracy promotion in all models and demonstrated higher accuracy performance compared to related works. Meanwhile, Multiception significantly reduces the number of parameters of standard convolution-based models by 32.48% on average while still preserving accuracy.
摘要：深卷积神经网络已经被证明是成功的在多个基准的挑战在最近几年。但是，性能改进是在日益复杂的网络架构和大量的参数，这就需要不断增加的存储和内存容量的数量严重依赖。深度方向可分离卷积（DSConv）能够有效地通过去耦标准卷积成空间和跨通道卷积步骤减少所需的参数的数量。然而，该方法会导致精度的劣化。为了解决这个问题，我们目前在深度上multiception卷积，称为Multiception，它引入了逐层多尺度内核同时学习所有各输入信道的多尺度表示。我们已经在四个标准数据集进行实验，即CIFAR-10，CIFAR-100，STL-10和ImageNet32x32，使用五种款热门机型CNN，Multiception相比，相关工作取得了所有型号的精度和推广表现出较高的精度性能。同时，Multiception显著由32.48％降低了标准的基于卷积模型参数的数量平均同时仍保持精度。

60. Domain-Aware Unsupervised Hyperspectral Reconstruction for Aerial Image Dehazing [PDF] 返回目录
Aditya Mehta, Harsh Sinha, Murari Mandal, Pratik Narang
Abstract: Haze removal in aerial images is a challenging problem due to considerable variation in spatial details and varying contrast. Changes in particulate matter density often lead to degradation in visibility. Therefore, several approaches utilize multi-spectral data as auxiliary information for haze removal. In this paper, we propose SkyGAN for haze removal in aerial images. SkyGAN consists of 1) a domain-aware hazy-to-hyperspectral (H2H) module, and 2) a conditional GAN (cGAN) based multi-cue image-to-image translation module (I2I) for dehazing. The proposed H2H module reconstructs several visual bands from RGB images in an unsupervised manner, which overcomes the lack of hazy hyperspectral aerial image datasets. The module utilizes task supervision and domain adaptation in order to create a "hyperspectral catalyst" for image dehazing. The I2I module uses the hyperspectral catalyst along with a 12-channel multi-cue input and performs effective image dehazing by utilizing the entire visual spectrum. In addition, this work introduces a new dataset, called Hazy Aerial-Image (HAI) dataset, that contains more than 65,000 pairs of hazy and ground truth aerial images with realistic, non-homogeneous haze of varying density. The performance of SkyGAN is evaluated on the recent SateHaze1k dataset as well as the HAI dataset. We also present a comprehensive evaluation of HAI dataset with a representative set of state-of-the-art techniques in terms of PSNR and SSIM.
摘要：雾度在去除航拍照片是一个具有挑战性的问题，由于在空间细节相当大的变化和变化的对比度。在颗粒物浓度的变化往往会导致退化的知名度。因此，有几种方法利用多光谱数据，作为雾度去除辅助信息。在本文中，我们提出SkyGAN浊度去除航拍图像。 SkyGAN包含：1）一个域感知朦胧到高光谱（H2H）模块，和2）一个有条件GAN（cGAN）基于多线索图像到图像的转换模块（121）用于除雾。所提出的H2H模块重建来自RGB图像数视觉频带以无监督的方式，其克服了缺乏朦胧高光谱空间图像数据集。该模块利用任务监督领域适应性，以创建图像除雾一个“高光谱催化剂”。所述I2I模块使用具有12通道多线索输入和执行有效图像通过利用整个可见光谱去混浊沿着高光谱催化剂。此外，该工作引入了一个新的数据集，被称为空中朦胧 - 图像（HAI）的数据集，包含超过65000双朦胧和地面实况航空图像的具有不同密度的现实的，非均匀的雾度。 SkyGAN的性能在近期SateHaze1k数据集以及该数据集HAI评估。我们还提出HAI数据集的综合评价与PSNR和SSIM方面具有代表性的一套先进设备，最先进的技术。

61. Identifying Mislabeled Images in Supervised Learning Utilizing Autoencoder [PDF] 返回目录
Yunhao Yang
Abstract: Supervised learning is based on the assumption that the ground truth in the training data is accurate. However, this may not be guaranteed in real-world settings. Inaccurate training data will result in some unexpected predictions. In image classification, incorrect labels may cause the classification model to be inaccurate as well. In this paper, I am going to apply unsupervised techniques on the training data before training the classification network. A convolutional autoencoder is applied to encode and reconstruct images. The encoder will project the image data on to latent space. In the latent space, image features are preserved in a lower dimension. Samples with similar features are likely to have the same label. Noised samples can be classified in the latent space. Image projections on the feature space are compared with its neighbors. The images with different labels with their neighbors are considered as incorrectly labeled data. These incorrectly labeled data are visualized as outliers in the latent space. After incorrect labels are detected, I will remove the samples which consist of incorrect labels. Thus the training data can be directly used in training the supervised learning network.
摘要：监督学习是基于这样的假设在训练数据的基础事实准确。然而，这可能不会在现实世界中的设置保证。不准确的训练数据，将导致一些意想不到的预测。在图像分类，标签不正确可能会导致分类模型是不准确的为好。在本文中，我将训练的分类网络之前应用在训练数据无监督的技术。卷积自动编码器被应用到编码和重构图像。编码器将投射到潜在空间中的图像数据。在潜空间，图像特征就保存在一个较低的层面。具有相似特征的样品都可能有相同的标签。降噪样品可于潜在空间进行分类。该特征空间上的图像投影与邻国相比。用不同的标签与他们的邻居的图像被认为是不正确的标签数据。这些错误地标记数据被可视化为在潜空间离群值。检测到不正确的标签之后，我将删除其中包含不正确标签的样本。因此，该训练数据可在训练监督学习网络可直接使用。

62. ROBIN: a Graph-Theoretic Approach to Reject Outliers in Robust Estimation using Invariants [PDF] 返回目录
Jingnan Shi, Heng Yang, Luca Carlone
Abstract: Many estimation problems in robotics, computer vision, and learning require estimating unknown quantities in the face of outliers. Outliers are typically the result of incorrect data association or feature matching, and it is common to have problems where more than 90% of the measurements used for estimation are outliers. While current approaches for robust estimation are able to deal with moderate amounts of outliers, they fail to produce accurate estimates in the presence of many outliers. This paper develops an approach to prune outliers. First, we develop a theory of invariance that allows us to quickly check if a subset of measurements are mutually compatible without explicitly solving the estimation problem. Second, we develop a graph-theoretic framework, where measurements are modeled as vertices and mutual compatibility is captured by edges. We generalize existing results showing that the inliers form a clique in this graph and typically belong to the maximum clique. We also show that in practice the maximum k-core of the compatibility graph provides an approximation of the maximum clique, while being faster to compute in large problems. These two contributions leads to ROBIN, our approach to Reject Outliers Based on INvariants, which allows us to quickly prune outliers in generic estimation problems. We demonstrate ROBIN in four geometric perception problems and show it boosts robustness of existing solvers while running in milliseconds in large problems.
摘要：在机器人的许多估计问题，计算机视觉和学习需要在异常的脸估计未知数。离群值是通常不正确的数据的关联或特征匹配的结果，并且是很常见的，其中有90％以上的用于估计的测量值的异常值是问题。虽然稳健估计目前的方法是能够处理适量的异常值，它们不能产生许多异常的情况下准确的估计。本文开发了一种方法来修剪异常。首先，我们开发不变性的理论，使我们能够快速地检查测量的子集是没有明确的解决估计问题相互兼容。其次，我们开发了一个图的理论框架，其中的测量数据建模为顶点，相互兼容是由被捕获。我们归纳出了正常值形成本图中的集团和通常属于最大团现有的成果。我们还表明，在实践中相容图的最大K-核心提供最大团的近似值，而被更快地在大型计算问题。这两个贡献导致罗宾，我们的方法基于异常值拒绝上不变，这使我们能够在一般的估计问题很快修剪异常。我们证明罗宾在四种几何感知的问题，并显示现有的解算器在大问题毫秒运行时的提升鲁棒性。

63. Augmented Equivariant Attention Networks for Electron Microscopy Image Super-Resolution [PDF] 返回目录
Yaochen Xie, Yu Ding, Shuiwang Ji
Abstract: Taking electron microscopy (EM) images in high-resolution is time-consuming and expensive and could be detrimental to the integrity of the samples under observation. Advances in deep learning enable us to perform super-resolution computationally, so as to obtain high-resolution images from low-resolution ones. When training super-resolution models on pairs of experimentally acquired EM images, prior models suffer from performance loss while using the pooled-training strategy due to their inability to capture inter-image dependencies and common features shared among images. Although there exist methods that take advantage of shared features among input instances in image classification tasks, they in the current form cannot be applied to super-resolution tasks because they fail to preserve an essential property in image-to-image transformation problems, which is the equivariance property to spatial permutations. To address these limitations, we propose the augmented equivariant attention networks (AEANets) with better capability to capture inter-image dependencies and shared features, while preserving the equivariance to spatial permutations. The proposed AEANets captures inter-image dependencies and common features shared among images via two augmentations on the attention mechanism; namely, the shared references and the batch-aware attention during training. We theoretically show the equivariance property of the proposed augmented attention model and experimentally show that AEANets consistently outperforms the baselines in both quantitative and visual results.
摘要：在高分辨率拍摄电子显微镜（EM）的图像是耗时且昂贵的，并且可能不利于样品的观察下的完整性。在深度学习的进步使我们能够计算执行超分辨率，从而从低分辨率的人获得高清晰度的图像。当训练超分辨率的机型上对实验获得的EM图像，以前的型号从性能遭受损失，同时使用存积培训战略，因为它们无法捕捉到的图像间的依赖关系和图像共享共同的特点。虽然存在，在图像分类任务采取的共同特点优势输入实例间的方法，他们在目前的形式不能使用，因为它们无法在图像到影像转化存在的问题，以保持一个基本属性，它是适用于超分辨率任务在同变性属性的空间排列。为了解决这些限制，我们提出了增强等变化关注网络（AEANets）具有更好的能力来捕捉图像间的依赖关系和共享功能，同时保留同变性与空间排列。所提出的AEANets捕获图像间的依赖关系，并通过对关注机构2个扩充图像之间共享共同特征;即，共享的引用和训练期间批量感知的关注。从理论上证明了该增强注意力模型的同变性财产和实验表明，AEANets始终优于在数量和视觉效果的基线。

64. Efficient Robust Watermarking Based on Quaternion Singular Value Decomposition and Coefficient Pair Selection [PDF] 返回目录
Yong Chen, Zhi-Gang Jia, Ya-Xin Peng, Yan Peng
Abstract: Quaternion singular value decomposition (QSVD) is a robust technique of digital watermarking which can extract high quality watermarks from watermarked images with low distortion. In this paper, QSVD technique is further investigated and an efficient robust watermarking scheme is proposed. The improved algebraic structure-preserving method is proposed to handle the problem of "explosion of complexity" occurred in the conventional QSVD design. Secret information is transmitted blindly by incorporating in QSVD two new strategies, namely, coefficient pair selection and adaptive embedding. Unlike conventional QSVD which embeds watermarks in a single imaginary unit, we propose to adaptively embed the watermark into the optimal hiding position using the Normalized Cross-Correlation (NC) method. This avoids the selection of coefficient pair with less correlation, and thus, it reduces embedding impact by decreasing the maximum modification of coefficient values. In this way, compared with conventional QSVD, the proposed watermarking strategy avoids more modifications to a single color image layer and a better visual quality of the watermarked image is observed. Meanwhile, adaptive QSVD resists some common geometric attacks, and it improves the robustness of conventional QSVD. With these improvements, our method outperforms conventional QSVD. Its superiority over other state-of-the-art methods is also demonstrated experimentally.
摘要：四元数的奇异值分解（QSVD）是数字水印可以从具有低失真水印图像中提取高品质水印的鲁棒性的技术。在本文中，QSVD技术进一步研究，并提出了一种高效鲁棒水印方案。提出了改进的代数结构保留的方法来处理的“复杂性爆炸”问题发生在常规QSVD设计。秘密信息由QSVD并入了两个新的策略，即，系数对选择和自适应嵌入盲目地发送。不同于水印嵌入在单个虚数单位常规QSVD，我们提出以自适应地嵌入水印成使用归一化互相关（NC）方法的最佳隐藏位置。这避免了系数对选择具有较少的相关性，并且因此，它减少了通过降低系数值的最大变形嵌入影响。以这种方式，与传统的相比QSVD，所提出的加水印策略避免了多个修饰的单个彩色图像层，并且观察到水印图像的更好的视觉质量。同时，自适应QSVD抵抗一些常见的几何攻击，并且它提高了常规QSVD的鲁棒性。通过这些改进，我们的方法优于传统QSVD。其优于其它国家的最先进的方法也被实验证明。

65. A Weakly Supervised Convolutional Network for Change Segmentation and Classification [PDF] 返回目录
Philipp Andermatt, Radu Timofte
Abstract: Fully supervised change detection methods require difficult to procure pixel-level labels, while weakly supervised approaches can be trained with image-level labels. However, most of these approaches require a combination of changed and unchanged image pairs for training. Thus, these methods can not directly be used for datasets where only changed image pairs are available. We present W-CDNet, a novel weakly supervised change detection network that can be trained with image-level semantic labels. Additionally, W-CDNet can be trained with two different types of datasets, either containing changed image pairs only or a mixture of changed and unchanged image pairs. Since we use image-level semantic labels for training, we simultaneously create a change mask and label the changed object for single-label images. W-CDNet employs a W-shaped siamese U-net to extract feature maps from an image pair which then get compared in order to create a raw change mask. The core part of our model, the Change Segmentation and Classification (CSC) module, learns an accurate change mask at a hidden layer by using a custom Remapping Block and then segmenting the current input image with the change mask. The segmented image is used to predict the image-level semantic label. The correct label can only be predicted if the change mask actually marks relevant change. This forces the model to learn an accurate change mask. We demonstrate the segmentation and classification performance of our approach and achieve top results on AICD and HRSCD, two public aerial imaging change detection datasets as well as on a Food Waste change detection dataset. Our code is available at this https URL .
摘要：全监督变化检测方法要求很难觅到像素级别的标签，而弱监督的方法可以在图像级标签进行培训。然而，这些方法都需要进行培训的变与不变图像对的组合。因此，这些方法不能直接用于在仅改变的图像对可用数据集。我们本发明的W-CDNet，一种新型的弱监督变化检测网络可与图像级语义标签来训练。另外，W-CDNet可与两种不同类型的数据集的训练，或者只含改变的图像对或改变和未改变的图像对的混合物。由于我们使用的图像层次语义标签的训练，我们同时创建一个改变掩码标注更改的对象为单标签图像。 W-CDNet采用W形连体U形网，以提取特征从一个图像对，然后以创建一个原始改变掩模相比得到映射。我们的模型的核心部分，在更改分割和分类（CSC）模块，通过使用自定义重映块，然后分割与变更掩模当前输入图像获知在隐藏层的准确变化掩模。分割的图像被用于预测图像级语义标签。如果改变实际光罩标记相关的变化正确的标签只能进行预测。这迫使模型学习准确的变化面具。我们证明我们的方法的分割和分类性能，实现对AICD和HRSCD，两个公共航空影像变化检测的数据集，以及对餐厨垃圾变化检测数据集顶部的结果。我们的代码可在此HTTPS URL。

66. Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze [PDF] 返回目录
Ece Takmaz, Sandro Pezzelle, Lisa Beinborn, Raquel Fernández
Abstract: When speakers describe an image, they tend to look at objects before mentioning them. In this paper, we investigate such sequential cross-modal alignment by modelling the image description generation process computationally. We take as our starting point a state-of-the-art image captioning system and develop several model variants that exploit information from human gaze patterns recorded during language production. In particular, we propose the first approach to image description generation where visual processing is modelled $\textit{sequentially}$. Our experiments and analyses confirm that better descriptions can be obtained by exploiting gaze-driven attention and shed light on human cognitive processes by comparing different ways of aligning the gaze modality with language production. We find that processing gaze data sequentially leads to descriptions that are better aligned to those produced by speakers, more diverse, and more natural${-}$particularly when gaze is encoded with a dedicated recurrent component.
摘要：当扬声器描述的形象，他们往往看对象提他们。在本文中，我们通过计算模拟图象描述生成处理调查这种顺序跨通道对准。我们需要为出发点的国家的最先进的图像字幕系统，并开发出利用从语言的产生过程中记录的人的目光模式的信息模型的几个变种。特别是，我们提出了第一种方法，以图像描述文件生成其中可视处理被建模$ \ textit {依次} $。我们的实验和分析证实，更好的描述可以通过利用凝视驱动的关注和比较对准语言生产凝视方式不同的方式对人的认知过程揭示获得。我们发现，处理数据的目光依次导致了更好地对准那些扬声器，更多样化的生产描述，更自然的$ { - } $尤其是当注视编码有专门的经常性组件。

67. MPRNet: Multi-Path Residual Network for Lightweight Image Super Resolution [PDF] 返回目录
Armin Mehri, Parichehr B.Ardakani, Angel D.Sappa
Abstract: Lightweight super resolution networks have extremely importance for real-world applications. In recent years several SR deep learning approaches with outstanding achievement have been introduced by sacrificing memory and computational cost. To overcome this problem, a novel lightweight super resolution network is proposed, which improves the SOTA performance in lightweight SR and performs roughly similar to computationally expensive networks. Multi-Path Residual Network designs with a set of Residual concatenation Blocks stacked with Adaptive Residual Blocks: ($i$) to adaptively extract informative features and learn more expressive spatial context information; ($ii$) to better leverage multi-level representations before up-sampling stage; and ($iii$) to allow an efficient information and gradient flow within the network. The proposed architecture also contains a new attention mechanism, Two-Fold Attention Module, to maximize the representation ability of the model. Extensive experiments show the superiority of our model against other SOTA SR approaches.
摘要：轻量级超解像网络对现实世界的应用非常重要。近年来，已有一些SR深度学习以优异成绩被牺牲内存和计算成本方法介绍。为了克服这个问题，一种新的轻量级超分辨率网建议，提高轻质SR和执行大致相同计算昂贵的网络SOTA性能。多路径残余网络设计与堆放着自适应残差块一组残余级联模块组成：（$ I $）自适应地提取信息量大的特点，并了解更多的表现空间上下文信息; （$ II $）之前上采样阶段更好地利用多层次交涉;和（$ $ III），以允许在网络内的有效的信息和梯度流。所提出的架构还包含一个新的关注机制，两方面的注意模块，最大限度地提高了模型的表现能力。大量的实验表明我们对其它SOTA SR方法模式的优越性。

68. Refer, Reuse, Reduce: Generating Subsequent References in Visual and Conversational Contexts [PDF] 返回目录
Ece Takmaz, Mario Giulianelli, Sandro Pezzelle, Arabella Sinclair, Raquel Fernández
Abstract: Dialogue participants often refer to entities or situations repeatedly within a conversation, which contributes to its cohesiveness. Subsequent references exploit the common ground accumulated by the interlocutors and hence have several interesting properties, namely, they tend to be shorter and reuse expressions that were effective in previous mentions. In this paper, we tackle the generation of first and subsequent references in visually grounded dialogue. We propose a generation model that produces referring utterances grounded in both the visual and the conversational context. To assess the referring effectiveness of its output, we also implement a reference resolution system. Our experiments and analyses show that the model produces better, more effective referring utterances than a model not grounded in the dialogue context, and generates subsequent references that exhibit linguistic patterns akin to humans.
摘要：对话参与者往往是指实体或情况反复交谈，这有助于它的凝聚力内。后续引用利用由对话者积累的共同点，因此有几个有趣的特性，即，他们往往是是有效的先前提到短和重用表达式。在本文中，我们将处理在视觉上接地对话第一和后续引用的产生。我们建议产生指同时在视觉和会话语境接地话语的生成模型。为了评估其输出的参考有效性，我们还实现了一个参考解析系统。我们的实验和分析表明，该模型产生更好的，更有效的参照话语比对话的上下文不接地的模型，并生成表现出语言模式类似于人类的后续引用。

69. Learning to Localize in New Environments from Synthetic Training Data [PDF] 返回目录
Dominik Winkelbauer, Maximilian Denninger, Rudolph Triebel
Abstract: Most existing approaches for visual localization either need a detailed 3D model of the environment or, in the case of learning-based methods, must be retrained for each new scene. This can either be very expensive or simply impossible for large, unknown environments, for example in search-and-rescue scenarios. Although there are learning-based approaches that operate scene-agnostically, the generalization capability of these methods is still outperformed by classical approaches. In this paper, we present an approach that can generalize to new scenes by applying specific changes to the model architecture, including an extended regression part, the use of hierarchical correlation layers, and the exploitation of scale and uncertainty information. Our approach outperforms the 5-point algorithm using SIFT features on equally big images and additionally surpasses all previous learning-based approaches that were trained on different data. It is also superior to most of the approaches that were specifically trained on the respective scenes. We also evaluate our approach in a scenario where only very few reference images are available, showing that under such more realistic conditions our learning-based approach considerably exceeds both existing learning-based and classical methods.
摘要：视觉定位大多数现有的方法要么需要环境的详细3D模型，或者在学习基础的方法的情况下，必须重新训练对于每个新场景。这可以是大的，未知的环境中非常昂贵或根本不可能的，例如在搜索和救援方案。虽然有工作场景不可知学习型的方法，这些方法的推广能力仍然由经典方法跑赢。在本文中，我们提出可以通过施加特定的改变模型体系结构，包括扩展部分回归，使用分层相关层，并且规模和不确定性的信息的利用推广到新的场景的方法。我们的方法优于使用SIFT 5点算法一样大的图像功能，另外超过接受了有关不同的数据，所有以前的基于学习的方法。这也是优于大多数是专门在各个场景训练方法的。我们也评估我们在这样一个场景，只有极少数的参考图像是可用的，显示出这样的更现实的条件下，我们基于学习的方法显着超过现有两种学习法和经典方法的方法。

70. Towards a quantitative assessment of neurodegeneration in Alzheimer's disease [PDF] 返回目录
Oleg Michailovich, Rinat Mukhometzianov
Abstract: Alzheimer's disease (AD) is an irreversible neurodegenerative disorder that progressively destroys memory and other cognitive domains of the brain. While effective therapeutic management of AD is still in development, it seems reasonable to expect their prospective outcomes to depend on the severity of baseline pathology. For this reason, substantial research efforts have been invested in the development of effective means of non-invasive diagnosis of AD at its earliest possible stages. In pursuit of the same objective, the present paper addresses the problem of the quantitative diagnosis of AD by means of Diffusion Magnetic Resonance Imaging (dMRI). In particular, the paper introduces the notion of a pathology specific imaging contrast (PSIC), which, in addition to supplying a valuable diagnostic score, can serve as a means of visual representation of the spatial extent of neurodegeneration. The values of PSIC are computed by a dedicated deep neural network (DNN), which has been specially adapted to the processing of dMRI signals. Once available, such values can be used for several important purposes, including stratification of study subjects. In particular, experiments confirm the DNN-based classification can outperform a wide range of alternative approaches in application to the basic problem of stratification of cognitively normal (CN) and AD subjects. Notwithstanding its preliminary nature, this result suggests a strong rationale for further extension and improvement of the explorative methodology described in this paper.
摘要：阿尔茨海默氏病（AD）是一种不可逆的神经变性疾病，逐步破坏存储器和大脑的其他认知领域。虽然AD的有效治疗管理仍处于发展，似乎有理由期待他们未来的结果取决于基线病变的严重程度。为此，大量的研究工作已经在它的最早阶段投资的AD的非侵入性诊断的有效手段的发展。在追求目标相同的，本文件地址用扩散磁共振成像（DMRI）的装置AD的定量诊断的问题。特别地，本文介绍的病理特定成像造影（PSIC），其中，除了供给有价值的诊断评分，可以用作神经变性的空间范围的可视化表示的装置的概念。 PSIC的值由专用深神经网络（DNN），其已被特别适于DMRI信号的处理来计算。一旦获得，这样的值可用于几个重要的目的，包括研究对象的分层。特别地，实验证实基于DNN分类可以超越一个宽范围的替代的应用中接近认知正常（CN）的分层和AD对象的基本问题。尽管它的初步性质，这一结果提出了进一步扩展，在本文中所描述的方法，探索完善一个强有力的理由。

71. A Poisson multi-Bernoulli mixture filter for coexisting point and extended targets [PDF] 返回目录
Ángel F. García-Fernández, Jason L. Williams, Lennart Svensson, Yuxuan Xia
Abstract: This paper proposes a Poisson multi-Bernoulli mixture (PMBM) filter for coexisting point and extended targets. The PMBM filter provides a recursion to compute the filtering posterior based on single-target predictions and updates. In this paper, we first derive the PMBM filter update for a generalised measurement model, which can include measurements originated from point and extended targets. Second, we propose a single-target space that accommodates both point and extended targets and derive the filtering recursion that propagates Gaussian densities for single targets and gamma Gaussian inverse Wishart densities for extended targets. As a computationally efficient approximation of the PMBM filter, we also develop a Poisson multi-Bernoulli (PMB) filter for coexisting point and extended targets. The resulting filters are analysed via numerical simulations.
摘要：本文提出了一种用于共存的点和扩展目标泊松多伯努利混合物（PMBM）滤波器。所述PMBM滤波器提供递归来计算基于单目标预测和更新过滤后。在本文中，我们首先推导出广义度量模型，可以包括测量源自点和扩展目标PMBM过滤器更新。第二，我们建议，可同时点和扩展目标和导出滤波递归单目标空间传播的单靶和γ的高斯逆威沙特密度为扩展目标高斯密度。由于PMBM过滤器的计算效率的逼近，我们也制定共存点和扩展目标泊松多伯努利（PMB）滤波器。将所得的过滤器通过数值模拟进行分析。

72. An Empirical Study of Visual Features for DNN based Audio-Visual Speech Enhancement in Multi-talker Environments [PDF] 返回目录
Shrishti Saha Shetu, Soumitro Chakrabarty, Emanuël A. P. Habets
Abstract: Audio-visual speech enhancement (AVSE) methods use both audio and visual features for the task of speech enhancement and the use of visual features has been shown to be particularly effective in multi-speaker scenarios. In the majority of deep neural network (DNN) based AVSE methods, the audio and visual data are first processed separately using different sub-networks, and then the learned features are fused to utilize the information from both modalities. There have been various studies on suitable audio input features and network architectures, however, to the best of our knowledge, there is no published study that has investigated which visual features are best suited for this specific task. In this work, we perform an empirical study of the most commonly used visual features for DNN based AVSE, the pre-processing requirements for each of these features, and investigate their influence on the performance. Our study shows that despite the overall better performance of embedding-based features, their computationally intensive pre-processing make their use difficult in low resource systems. For such systems, optical flow or raw pixels-based features might be better suited.
摘要：视听语音增强（AVSE）方法同时使用音频和视频功能的语音增强的任务和使用的视觉特征已经被证明是多扬声器的情况特别有效。在大多数的深层神经网络（DNN）基于AVSE方法中，音频和视频数据首先分别处理使用不同的子网络，然后学习功能被融合到利用来自两种模式的信息。已经有上合适的音频输入功能和网络架构的各种研究，但是，据我们所知，没有发表的研究已经调查了视觉特征最适合这个特定的任务。在这项工作中，我们执行的最常用的视觉特征基于DNN AVSE，对于每一个这些功能的前处理要求进行了实证研究，并研究它们对性能的影响。我们的研究显示，尽管基于嵌入的功能整体性能更好，他们的计算密集的前处理让它们在低资源系统困难。对于此类系统，光流或基于原始像素的特性可能更适合。

73. CapWAP: Captioning with a Purpose [PDF] 返回目录
Adam Fisch, Kenton Lee, Ming-Wei Chang, Jonathan H. Clark, Regina Barzilay
Abstract: The traditional image captioning task uses generic reference captions to provide textual information about images. Different user populations, however, will care about different visual aspects of images. In this paper, we propose a new task, Captioning with a Purpose (CapWAP). Our goal is to develop systems that can be tailored to be useful for the information needs of an intended population, rather than merely provide generic information about an image. In this task, we use question-answer (QA) pairs---a natural expression of information need---from users, instead of reference captions, for both training and post-inference evaluation. We show that it is possible to use reinforcement learning to directly optimize for the intended information need, by rewarding outputs that allow a question answering model to provide correct answers to sampled user questions. We convert several visual question answering datasets into CapWAP datasets, and demonstrate that under a variety of scenarios our purposeful captioning system learns to anticipate and fulfill specific information needs better than its generic counterparts, as measured by QA performance on user questions from unseen images, when using the caption alone as context.
摘要：传统的图像字幕任务使用的通用参考字幕提供有关图片的文本信息。不同的用户群，但是，会在乎图像的不同视觉方面。在本文中，我们提出了一个新的任务，有目的（CAPWAP）字幕。我们的目标是开发出可以根据对预期的人群的信息需求有用的，而不是仅仅提供有关图像的通用信息系统。在此任务中，我们使用问答（QA）---对信息的需要自然的表情---来自用户的，而不是参考标题，对训练和后推断评价。我们表明，它可以使用强化学习直接优化为目的的信息需求，通过奖励输出，可答疑系统的模型提供正确的答案，以抽样用户的问题。我们转换了多种视觉问答集成CAPWAP数据集，并证明在各种情况下我们有目的的字幕系统学习预见并满足特定信息的需求比其一般的同行，由QA表现从看不见的图像，当用户提问测单独使用标题作为背景。

74. PAMS: Quantized Super-Resolution via Parameterized Max Scale [PDF] 返回目录
Huixia Li, Chenqian Yan, Shaohui Lin, Xiawu Zheng, Yuchao Li, Baochang Zhang, Fan Yang, Rongrong Ji
Abstract: Deep convolutional neural networks (DCNNs) have shown dominant performance in the task of super-resolution (SR). However, their heavy memory cost and computation overhead significantly restrict their practical deployments on resource-limited devices, which mainly arise from the floating-point storage and operations between weights and activations. Although previous endeavors mainly resort to fixed-point operations, quantizing both weights and activations with fixed coding lengths may cause significant performance drop, especially on low bits. Specifically, most state-of-the-art SR models without batch normalization have a large dynamic quantization range, which also serves as another cause of performance drop. To address these two issues, we propose a new quantization scheme termed PArameterized Max Scale (PAMS), which applies the trainable truncated parameter to explore the upper bound of the quantization range adaptively. Finally, a structured knowledge transfer (SKT) loss is introduced to fine-tune the quantized network. Extensive experiments demonstrate that the proposed PAMS scheme can well compress and accelerate the existing SR models such as EDSR and RDN. Notably, 8-bit PAMS-EDSR improves PSNR on Set5 benchmark from 32.095dB to 32.124dB with 2.42$\times$ compression ratio, which achieves a new state-of-the-art.
摘要：深卷积神经网络（DCNNs）表明，在超分辨率（SR）的任务统治力的表现。然而，他们的高内存成本和计算开销显著限制对资源有限的设备，它主要从浮点存储和运营权重和激励之间出现自己的实际部署。尽管以前的努力主要是采取定点运算，具有固定编码长度的量化两者权重和激活可能会导致显著性能下降，尤其是在低位。具体而言，没有批标准化大多数国家的最先进的SR模型具有大的动态量化范围，其也用作性能下降的另一个原因。为了解决这两个问题，我们提出了一个新的量化方案称为参数最大刻度（PAMS），它适用于可训练被截断的参数，探索上限量化范围的适应性。最后，一个结构化的知识转移（SKT）损失引入到微调量化网络。大量的实验表明，该方案PAMS能很好地压缩和加速现有SR车型如EDSR和RDN。值得注意的是，8位的PAM-EDSR提高PSNR从32.095分贝SET5基准来32.124分贝与2.42 $ \倍$压缩比，其实现了新的国家的最先进的。

75. Geometric Structure Aided Visual Inertial Localization [PDF] 返回目录
Huaiyang Huang, Haoyang Ye, Jianhao Jiao, Yuxiang Sun, Ming Liu
Abstract: Visual Localization is an essential component in autonomous navigation. Existing approaches are either based on the visual structure from SLAM/SfM or the geometric structure from dense mapping. To take the advantages of both, in this work, we present a complete visual inertial localization system based on a hybrid map representation to reduce the computational cost and increase the positioning accuracy. Specially, we propose two modules for data association and batch optimization, respectively. To this end, we develop an efficient data association module to associate map components with local features, which takes only $2$ms to generate temporal landmarks. For batch optimization, instead of using visual factors, we develop a module to estimate a pose prior from the instant localization results to constrain poses. The experimental results on the EuRoC MAV dataset demonstrate a competitive performance compared to the state of the arts. Specially, our system achieves an average position error in 1.7 cm with 100% recall. The timings show that the proposed modules reduce the computational cost by 20-30%. We will make our implementation open source at this http URL.
摘要：可视本地化是自主导航的重要组成部分。现有的方法要么基于从SLAM / SFM视觉结构或从密映射的几何结构。要利用双方的优势，在这项工作中，我们提出了一种基于混合地图表示一个完整的视觉惯性定位系统，以减少计算成本，提高定位精度。特别地，我们分别提出了数据关联和批量优化两个模块。为此，我们开发了一个高效的数据关联模块，具有地方特色，这只需$ 2 $毫秒生成时间标记副地图的组件。对于批量的优化，而不是使用视觉因素，我们开发了一个模块，估计从即时定位结果来限制姿势之前的姿势。相比于艺术的状态在EuRoC MAV实验结果数据集表现出有竞争力的表现。特别地，我们的系统实现了1.7厘米，100％的召回的平均位置误差。时序表明，该模块由20-30％降低计算成本。我们将尽我们的这个HTTP URL实现开源。

76. Fine Perceptive GANs for Brain MR Image Super-Resolution in Wavelet Domain [PDF] 返回目录
Senrong You, Yong Liu, Baiying Lei, Shuqiang Wang
Abstract: Magnetic resonance imaging plays an important role in computer-aided diagnosis and brain exploration. However, limited by hardware, scanning time and cost, it's challenging to acquire high-resolution (HR) magnetic resonance (MR) image clinically. In this paper, fine perceptive generative adversarial networks (FP-GANs) is proposed to produce HR MR images from low-resolution counterparts. It can cope with the detail insensitive problem of the existing super-resolution model in a divide-and-conquer manner. Specifically, FP-GANs firstly divides an MR image into low-frequency global approximation and high-frequency anatomical texture in wavelet domain. Then each sub-band generative adversarial network (sub-band GAN) conquers the super-resolution procedure of each single sub-band image. Meanwhile, sub-band attention is deployed to tune focus between global and texture information. It can focus on sub-band images instead of feature maps to further enhance the anatomical reconstruction ability of FP-GANs. In addition, inverse discrete wavelet transformation (IDWT) is integrated into model for taking the reconstruction of whole image into account. Experiments on MultiRes_7T dataset demonstrate that FP-GANs outperforms the competing methods quantitatively and qualitatively.
摘要：磁共振成像起着计算机辅助诊断和大脑的探索具有重要作用。然而，通过硬件，扫描时间和成本的限制，它的临床挑战性，以获取高分辨率（HR）的磁共振（MR）图像。在本文中，细感知生成对抗网络（FP-甘斯）建议生产HR MR图像从低分辨率的对应。它可以在一分而治之的方式对现有的超分辨率模型的细节不敏感的问题应付。具体地，FP-甘斯首先划分的MR图像到小波域低频全球近似和高频解剖结构。然后，每个子频带生成对抗网络（子带GAN）克服各单个子波段图像的超分辨率过程。同时，子带注意力被部署到全球和纹理信息之间调焦点。它可以专注于子带图像，而不是功能映射到进一步提高FP-甘斯的解剖重建能力。此外，逆离散小波变换（IDWT）集成到模型用于拍摄整体图像的重建考虑。在MultiRes_7T实验数据集显示，FP-甘斯定量和定性优于竞争的方法。

77. Kimera-Multi: a System for Distributed Multi-Robot Metric-Semantic Simultaneous Localization and Mapping [PDF] 返回目录
Yun Chang, Yulun Tian, Jonathan P. How, Luca Carlone
Abstract: We present the first fully distributed multi-robot system for dense metric-semantic Simultaneous Localization and Mapping (SLAM). Our system, dubbed Kimera-Multi, is implemented by a team of robots equipped with visual-inertial sensors, and builds a 3D mesh model of the environment in real-time, where each face of the mesh is annotated with a semantic label (e.g., building, road, objects). In Kimera-Multi, each robot builds a local trajectory estimate and a local mesh using Kimera. Then, when two robots are within communication range, they initiate a distributed place recognition and robust pose graph optimization protocol with a novel incremental maximum clique outlier rejection; the protocol allows the robots to improve their local trajectory estimates by leveraging inter-robot loop closures. Finally, each robot uses its improved trajectory estimate to correct the local mesh using mesh deformation techniques. We demonstrate Kimera-Multi in photo-realistic simulations and real data. Kimera-Multi (i) is able to build accurate 3D metric-semantic meshes, (ii) is robust to incorrect loop closures while requiring less computation than state-of-the-art distributed SLAM back-ends, and (iii) is efficient, both in terms of computation at each robot as well as communication bandwidth.
摘要：呈现所述第一完全分布式多机器人系统用于密集度量语义同时定位和地图创建（SLAM）。我们的系统，被称为Kimera-多，是由一个团队配备了视觉惯性传感器的机器人的实施，并建立一个三维网格实时的环境中，网格的每一个面标注有语义标签的模型（例如，建筑，道路，对象）。在Kimera-多，每个机器人建立一个本地轨迹估计和使用Kimera本地网格。然后，当两个机器人通信范围内，他们发起的分布式地点识别和健壮姿态图形优化协议具有新颖增量最大团异常值拒绝;该协议允许机器人通过利用机器人之间环闭合件，以改善他们的本地轨迹估计。最后，每个机器人使用其改进的轨迹估计来校正使用网格变形技术本地网格。我们证明Kimera，多在照片般逼真的模拟和真实的数据。 Kimera-MULTI（i）是能够建立精确的3D度量语义网格，（ⅱ）是稳健的不正确的循环闭合，同时需要比状态的最先进的分布式SLAM后端较少的计算，以及（iii）是有效，无论在每个机器人计算方面以及通信带宽。

78. Long Range Arena: A Benchmark for Efficient Transformers [PDF] 返回目录
Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, Donald Metzler
Abstract: Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem, more often than not claiming superior or comparable model quality to vanilla Transformer models. To this date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsistent benchmarking on a wide spectrum of tasks and datasets makes it difficult to assess relative model quality amongst many models. This paper proposes a systematic and unified benchmark, LRA, specifically focused on evaluating model quality under long-context scenarios. Our benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens, encompassing a wide range of data types and modalities such as text, natural, synthetic images, and mathematical expressions requiring similarity, structural, and visual-spatial reasoning. We systematically evaluate ten well-established long-range Transformer models (Reformers, Linformers, Linear Transformers, Sinkhorn Transformers, Performers, Synthesizers, Sparse Transformers, and Longformers) on our newly proposed benchmark suite. LRA paves the way towards better understanding this class of efficient Transformer models, facilitates more research in this direction, and presents new challenging tasks to tackle. Our benchmark code will be released at this https URL.
摘要：变压器不规模非常好长序列的长度，主要是因为二次自我关注的复杂性。在最近几个月，高效，快捷的变形金刚的广泛被提出来解决这个问题，往往不是自称上级或相近的模型质量香草Transformer模型。为了这一天，对如何评价这个级别的车型没有行之有效的共识。此外，在任务和数据集的广泛不一致的标杆使得很难评估相对模型质量以及很多车型。本文提出了一种系统的，统一的标杆，LRA，特别专注于长下上下文情境评估模型的品质。我们的基准是由序列范围从$ 1K $到$ $ 16K令牌的，包含大范围的数据类型和诸如文本，天然的，合成的图像模态的任务的一个套件，数学表达式需要相似性，结构和visual-空间推理。我们系统地评价我们的新提议的基准套件10行之有效的远射Transformer模型（改革者，Linformers，线性变压器，变压器Sinkhorn，表演者，合成器，稀疏变压器和Longformers）。 LRA对铺平了更好地了解这类高效变压器模型的方式，有利于在这个方向上进行更多的研究，并提出了新的挑战任务来解决。我们的基准测试的代码将在这个HTTPS URL被释放。

79. Real-time Surgical Environment Enhancement for Robot-Assisted Minimally Invasive Surgery Based on Super-Resolution [PDF] 返回目录
Ruoxi Wang, Dandan Zhang, Qingbiao Li, Xiao-Yun Zhou, Benny Lo
Abstract: In Robot-Assisted Minimally Invasive Surgery (RAMIS), a camera assistant is normally required to control the position and zooming ratio of the laparoscope, following the surgeon's instructions. However, moving the laparoscope frequently may lead to unstable and suboptimal views, while the adjustment of zooming ratio may interrupt the workflow of the surgical operation. To this end, we propose a multi-scale Generative Adversarial Network (GAN)-based video super-resolution method to construct a framework for automatic zooming ratio adjustment. It can provide automatic real-time zooming for high-quality visualization of the Region Of Interest (ROI) during the surgical operation. In the pipeline of the framework, the Kernel Correlation Filter (KCF) tracker is used for tracking the tips of the surgical tools, while the Semi-Global Block Matching (SGBM) based depth estimation and Recurrent Neural Network (RNN)-based context-awareness are developed to determine the upscaling ratio for zooming. The framework is validated with the JIGSAW dataset and Hamlyn Centre Laparoscopic/Endoscopic Video Datasets, with results demonstrating its practicability.
摘要：机器人辅助微创外科（RAMIS），照相机助理通常需要控制腹腔镜的位置和变焦比，以下外科医生的指示。然而，频繁地移动所述腹腔镜可能导致不稳定的和次优的观点，而缩放比例的调整可以中断手术操作的工作流程。为此，提出了一种多尺度剖成对抗性网络（GAN）系视频超分辨率方法构造为自动变焦比调整的框架。它可以提供自动实时的手术过程中的变焦（ROI）区域中的感兴趣的高品质的可视化。在该框架中的管道，内核相关滤波（KCF）跟踪器，用于跟踪的手术工具的提示，而基于半球状块匹配（SGBM）深估计递归神经网络（RNN）基于上下文意识的开发，以确定缩放倍增率。该框架是验证与JIGSAW数据集和哈姆林中心腹腔镜/内窥镜视频数据集，其结果证明了其实用性。

80. Learning-based 3D Occupancy Prediction for Autonomous Navigation in Occluded Environments [PDF] 返回目录
Lizi Wang, Hongkai Ye, Qianhao Wang, Yuman Gao, Chao Xu, Fei Gao
Abstract: In autonomous navigation of mobile robots, sensors suffer from massive occlusion in cluttered environments, leaving significant amount of space unknown during planning. In practice, treating the unknown space in optimistic or pessimistic ways both set limitations on planning performance, thus aggressiveness and safety cannot be satisfied at the same time. However, humans can infer the exact shape of the obstacles from only partial observation and generate non-conservative trajectories that avoid possible collisions in occluded space. Mimicking human behavior, in this paper, we propose a method based on deep neural network to predict occupancy distribution of unknown space reliably. Specifically, the proposed method utilizes contextual information of environments and learns from prior knowledge to predict obstacle distributions in occluded space. We use unlabeled and no-ground-truth data to train our network and successfully apply it to real-time navigation in unseen environments without any refinement. Results show that our method leverages the performance of a kinodynamic planner by improving security with no reduction of speed in clustered environments.
摘要：移动机器人的自主导航，传感器从混乱的环境中大规模闭塞受苦，规划中留出空间未知的显著量。在实践中，乐观或悲观的方式规划性能处理未知的空间都设置限制，从而攻击性和安全性不能同时满足。然而，人类只能从局部观察推断出障碍物的准确形状并生成避免闭塞的空间可能碰撞非保守轨迹。模仿人类的行为，在本文中，我们提出了基于深层神经网络的方法来预测的未知空间占用分配可靠。具体地，所提出的方法使用了来自先验知识环境和获悉的上下文信息来预测障碍物分布在闭塞空间。我们使用未标记和无地面实况数据来训练我们的网络，并成功地将其应用到实时导航看不见的环境中，没有任何细化。结果表明，我们的方法，通过提高安全性，在集群环境中不降低速度的利用了kinodynamic规划师的表现。

81. Cross-Modal Self-Attention Distillation for Prostate Cancer Segmentation [PDF] 返回目录
Guokai Zhang, Xiaoang Shen, Ye Luo, Jihao Luo, Zeju Wang, Weigang Wang, Binghui Zhao, Jianwei Lu
Abstract: Automatic segmentation of the prostate cancer from the multi-modal magnetic resonance images is of critical importance for the initial staging and prognosis of patients. However, how to use the multi-modal image features more efficiently is still a challenging problem in the field of medical image segmentation. In this paper, we develop a cross-modal self-attention distillation network by fully exploiting the encoded information of the intermediate layers from different modalities, and the extracted attention maps of different modalities enable the model to transfer the significant spatial information with more details. Moreover, a novel spatial correlated feature fusion module is further employed for learning more complementary correlation and non-linear information of different modality images. We evaluate our model in five-fold cross-validation on 358 MRI with biopsy confirmed. Extensive experiment results demonstrate that our proposed network achieves state-of-the-art performance.
摘要：从多模态磁共振图像的前列腺癌的自动分割是对患者的初始分期及预后至关重要。但是，如何使用多模态图像功能更有效地仍然是在医学图像分割领域的一个具有挑战性的问题。在本文中，我们通过充分利用来自不同模态的中间层的编码信息制定一个跨通道自关注蒸馏网络，和不同模态的提取关注地图使模型以更多的细节传送显著空间信息。此外，一个新的空间相关特征融合模块还用于学习多个互补的相关性和不同模态图像的非线性的信息。我们评估与活检证实的358 MRI我们在模型五倍交叉验证。大量的实验结果表明，我们提出的网络实现了国家的最先进的性能。

82. Learning to Model and Ignore Dataset Bias with Mixed Capacity Ensembles [PDF] 返回目录
Christopher Clark, Mark Yatskar, Luke Zettlemoyer
Abstract: Many datasets have been shown to contain incidental correlations created by idiosyncrasies in the data collection process. For example, sentence entailment datasets can have spurious word-class correlations if nearly all contradiction sentences contain the word "not", and image recognition datasets can have tell-tale object-background correlations if dogs are always indoors. In this paper, we propose a method that can automatically detect and ignore these kinds of dataset-specific patterns, which we call dataset biases. Our method trains a lower capacity model in an ensemble with a higher capacity model. During training, the lower capacity model learns to capture relatively shallow correlations, which we hypothesize are likely to reflect dataset bias. This frees the higher capacity model to focus on patterns that should generalize better. We ensure the models learn non-overlapping approaches by introducing a novel method to make them conditionally independent. Importantly, our approach does not require the bias to be known in advance. We evaluate performance on synthetic datasets, and four datasets built to penalize models that exploit known biases on textual entailment, visual question answering, and image recognition tasks. We show improvement in all settings, including a 10 point gain on the visual question answering dataset.
摘要：许多数据集已被证明包含在数据收集过程特质创建附带的相关性。例如，句子蕴涵数据集可以有假字级的相关性，如果几乎所有的矛盾句子包含单词“不”，和图像识别数据集可以有搬弄是非的对象背景的相关性，如果狗永远是在室内。在本文中，我们提出，能自动检测并忽略这些类型的特定数据集，模式，我们称之为数据集偏见的方法。我们的方法中的列车的合奏较低容量模型具有较高容量模型。在培训过程中，低容量型号学会捕捉相对较浅的相关性，其中我们推测很可能反映数据集的偏见。这将释放更高容量的型号，以专注于模式，应更好地一概而论。我们保证模型学习不重叠的方式通过引入一种新的方法，使他们有条件独立。重要的是，我们的方法不要求偏置提前知道。我们在综合数据集进行性能评估，以及四个数据集构建以惩罚那些利用已知的文字蕴涵，视觉问答以及图像识别任务的偏见模型。我们发现在所有设置，包括在视觉答疑数据集有10点的增益改善。

83. Autonomous Intruder Detection Using a ROS-Based Multi-Robot System Equipped with 2D-LiDAR Sensors [PDF] 返回目录
Mashnoon Islam, Touhid Ahmed, Abu Tammam Bin Nuruddin, Mashuda Islam, Shahnewaz Siddique
Abstract: The application of autonomous mobile robots in robotic security platforms is becoming a promising field of innovation due to their adaptive capability of responding to potential disturbances perceived through a wide range of sensors. Researchers have proposed systems that either focus on utilizing a single mobile robot or a system of cooperative multiple robots. However, very few of the proposed works, particularly in the field of multi-robot systems, are completely dependent on LiDAR sensors for achieving various tasks. This is essential when other sensors on a robot fail to provide peak performance in particular conditions, such as a camera operating in the absence of light. This paper proposes a multi-robot system that is developed using ROS (Robot Operating System) for intruder detection in a single-range-sensor-per-robot scenario with centralized processing of detections from all robots by our central bot MIDNet (Multiple Intruder Detection Network). This work is aimed at providing an autonomous multi-robot security solution for a warehouse in the absence of human personnel.
摘要：在机器人安全平台，自主移动机器人的应用日益创新的前途的领域，由于响应，通过各种传感器感知的潜在干扰他们的适应能力。研究人员已经提出的系统，要么专注于利用单个移动机器人或合作多个机器人的系统。然而，很少提出的作品，尤其是在多机器人系统领域的，完全依赖于LiDAR传感器实现各种任务。当在机器人上的其它传感器不能提供特别的条件，如在不存在光的照相机的操作的峰值性能，这是必要的。本文提出了使用ROS（机器人操作系统）为入侵者检测由我们的中心机器人MIDNet（多入侵者检测开发在单范围传感器的每个机器人情形与来自所有机器人检测的集中式处理一个多机器人系统网络）。这项工作的目的是在没有人类的人员提供了一个仓库的自主多机器人的安全解决方案。

84. Multiscale Point Cloud Geometry Compression [PDF] 返回目录
Jianqiang Wang, Dandan Ding, Zhu Li, Zhan Ma
Abstract: Recent years have witnessed the growth of point cloud based applications because of its realistic and fine-grained representation of 3D objects and scenes. However, it is a challenging problem to compress sparse, unstructured, and high-precision 3D points for efficient communication. In this paper, leveraging the sparsity nature of point cloud, we propose a multiscale end-to-end learning framework which hierarchically reconstructs the 3D Point Cloud Geometry (PCG) via progressive re-sampling. The framework is developed on top of a sparse convolution based autoencoder for point cloud compression and reconstruction. For the input PCG which has only the binary occupancy attribute, our framework translates it to a downscaled point cloud at the bottleneck layer which possesses both geometry and associated feature attributes. Then, the geometric occupancy is losslessly compressed using an octree codec and the feature attributes are lossy compressed using a learned probabilistic context model.Compared to state-of-the-art Video-based Point Cloud Compression (V-PCC) and Geometry-based PCC (G-PCC) schemes standardized by the Moving Picture Experts Group (MPEG), our method achieves more than 40% and 70% BD-Rate (Bjontegaard Delta Rate) reduction, respectively. Its encoding runtime is comparable to that of G-PCC, which is only 1.5% of V-PCC.
摘要：近年来，两国因为3D对象和场景的其现实和细粒度表示基于点云应用的增长。然而，这是一个挑战性的问题，以压缩稀疏，非结构化和高精度的3D点高效的通信。在本文中，利用点云的稀疏性，我们提出了一个多尺度终端到终端的学习框架，通过渐进式重新取样的分层重构三维点云几何（PCG）。该框架上的稀疏卷积基于自动编码点云压缩和重建的顶部显影。对于仅具有二进制占用属性输入PCG，我们的框架将其转换为在其中具有几何和相关要素属性的瓶颈层缩小的点云。然后，几何占用使用无损压缩八叉树编解码器和到基于视频的状态的最先进的点云压缩（V-PCC）和几何基于使用学习概率上下文model.Compared特征属性有损压缩PCC（G-PCC）由运动图像专家组（MPEG）标准化的方案，我们的方法分别达到40％以上和70％BD-率（Bjontegaard三角洲率）降低。它的编码运行时比得上G-PCC，这是V-PCC的仅1.5％的。

85. Grading the Severity of Arteriolosclerosis from Retinal Arterio-venous Crossing Patterns [PDF] 返回目录
Liangzhi Li, Manisha Verma, Bowen Wang, Yuta Nakashima, Ryo Kawasaki, Hajime Nagahara
Abstract: The status of retinal arteriovenous crossing is of great significance for clinical evaluation of arteriolosclerosis and systemic hypertension. As an ophthalmology diagnostic criteria, Scheie's classification has been used to grade the severity of arteriolosclerosis. In this paper, we propose a deep learning approach to support the diagnosis process, which, to the best of our knowledge, is one of the earliest attempts in medical imaging. The proposed pipeline is three-fold. First, we adopt segmentation and classification models to automatically obtain vessels in a retinal image with the corresponding artery/vein labels and find candidate arteriovenous crossing points. Second, we use a classification model to validate the true crossing point. At last, the grade of severity for the vessel crossings is classified. To better address the problem of label ambiguity and imbalanced label distribution, we propose a new model, named multi-diagnosis team network (MDTNet), in which the sub-models with different structures or different loss functions provide different decisions. MDTNet unifies these diverse theories to give the final decision with high accuracy. Our severity grading method was able to validate crossing points with precision and recall of 96.3% and 96.3%, respectively. Among correctly detected crossing points, the kappa value for the agreement between the grading by a retina specialist and the estimated score was 0.85, with an accuracy of 0.92. The numerical results demonstrate that our method can achieve a good performance in both arteriovenous crossing validation and severity grading tasks. By the proposed models, we could build a pipeline reproducing retina specialist's subjective grading without feature extractions. The code is available for reproducibility.
摘要：视网膜动静脉交叉的状态是对动脉硬化和全身性高血压的临床评估具有重要意义。作为一个眼科诊断标准，沙伊的分类已被用于级小动脉硬化的严重程度。在本文中，我们提出了一个深刻的学习方法来支持诊断过程，其中，在我们所知的，是在医疗成像的最早尝试之一。建议的管道是三倍。首先，我们采用分割和分类模型自动获取血管视网膜图像与相应的动脉/静脉标签，找到候选动静脉交叉点。其次，我们使用分类模型来验证真正的交叉点。最后，严重的血管交叉的等级分类。为了更好的地址标签的模糊性和不均衡的标签分配的问题，我们提出了一个新的模式，称为多诊断团队网络（MDTNet），其中子模型以不同的结构或不同的损失函数提供了不同的决定。 MDTNet统一这些不同的理论，得到具有高精确度的最终决定。我们的严重程度分级方法能够精确和的分别为96.3％和96.3％，召回验证穿越点。间正确地检测交叉点，用于通过视网膜专家和所估计的分数分级之间的协议的Kappa值为0.85，具有0.92的精度。计算结果表明，我们的方法可以实现在两个静脉交叉验证和严重程度分级任务的良好性能。通过提出的模型，我们可以建立一个管道再现视网膜专家的主观评分无特征提取。该代码可用于重现。

86. Robustness and Diversity Seeking Data-Free Knowledge Distillation [PDF] 返回目录
Pengchao Han, Jihong Park, Shiqiang Wang, Yejun Liu
Abstract: Knowledge distillation (KD) has enabled remarkable progress in model compression and knowledge transfer. However, KD requires a large volume of original data or their representation statistics that are not usually available in practice. Data-free KD has recently been proposed to resolve this problem, wherein teacher and student models are fed by a synthetic sample generator trained from the teacher. Nonetheless, existing data-free KD methods rely on fine-tuning of weights to balance multiple losses, and ignore the diversity of generated samples, resulting in limited accuracy and robustness. To overcome this challenge, we propose robustness and diversity seeking data-free KD (RDSKD) in this paper. The generator loss function is crafted to produce samples with high authenticity, class diversity, and inter-sample diversity. Without real data, the objectives of seeking high sample authenticity and class diversity often conflict with each other, causing frequent loss fluctuations. We mitigate this by exponentially penalizing loss increments. With MNIST, CIFAR-10, and SVHN datasets, our experiments show that RDSKD achieves higher accuracy with more robustness over different hyperparameter settings, compared to other data-free KD methods such as DAFL, MSKD, ZSKD, and DeepInversion.
摘要：知识蒸馏（KD），使在压缩模式和知识转移显着进展。然而，KD需要大量的原始数据或他们的代表统计数据中通常不提供实践的。无数据KD最近已经提出来解决这一问题，其中教师和学生模型是由老师培养了合成样品发电机供电。尽管如此，现有的无数据KD方法依赖于权重的微调，以平衡多重损失，并忽略所产生的样本的多样性，导致有限的精度和鲁棒性。为了克服这一挑战，我们建议稳健性和多样性寻求在本文中无数据KD（RDSKD）。发电机损失函数制作，以产生具有高真实性，类多样性，以及样本间多样性的样品。如果没有真实的数据，寻求高样品的真实性和一流的多样性往往互相冲突，导致频繁的损失波动的目标。我们以成倍惩罚损失增量缓解此。随着MNIST，CIFAR-10和SVHN数据集，我们的实验表明，RDSKD实现更高的精确度通过不同的超参数设置更稳健，相对于其他无数据KD方法，如DAFL，MSKD，ZSKD和DeepInversion。

87. Interventional Domain Adaptation [PDF] 返回目录
Jun Wen, Changjian Shui, Kun Kuang, Junsong Yuan, Zenan Huang, Zhefeng Gong, Nenggan Zheng
Abstract: Domain adaptation (DA) aims to transfer discriminative features learned from source domain to target domain. Most of DA methods focus on enhancing feature transferability through domain-invariance learning. However, source-learned discriminability itself might be tailored to be biased and unsafely transferable by spurious correlations, \emph{i.e.}, part of source-specific features are correlated with category labels. We find that standard domain-invariance learning suffers from such correlations and incorrectly transfers the source-specifics. To address this issue, we intervene in the learning of feature discriminability using unlabeled target data to guide it to get rid of the domain-specific part and be safely transferable. Concretely, we generate counterfactual features that distinguish the domain-specifics from domain-sharable part through a novel feature intervention strategy. To prevent the residence of domain-specifics, the feature discriminability is trained to be invariant to the mutations in the domain-specifics of counterfactual features. Experimenting on typical \emph{one-to-one} unsupervised domain adaptation and challenging domain-agnostic adaptation tasks, the consistent performance improvements of our method over state-of-the-art approaches validate that the learned discriminative features are more safely transferable and generalize well to novel domains.
摘要：域名适应（DA）的目标是把从源域学会了目标域判别特征。大多数的DA方法，着眼于通过域名不变性学习增强功能可转让。然而，源学习的可辨性本身可能被调整以通过伪相关被偏置和不安全转让，\ {EMPH即}，的特定源的特征部分与类别标签相关联。我们找到这样的相关性是标准的域不变性患有学习，并错误地传递源细节。为了解决这个问题，我们有鉴别力的学习使用未标记的目标数据来引导它摆脱域的特定部分，并安全地转移干预。具体而言，我们生成通过新颖的特征干预策略区分域可共享部分的结构域的具体反特性。为了防止域细节的住所，该功能可辨性的训练是不变的在反功能域的具体突变。实验典型\ EMPH {一到一个}无监督的领域适应性和具有挑战性的领域无关的改编任务，我们的方法在国家的最先进的一致性能改进方法验证该学会判别特征是更安全地转让，概括以及新颖的结构域。

88. Data--driven Image Restoration with Option--driven Learning for Big and Small Astronomical Image Datasets [PDF] 返回目录
Peng Jia, Ruiyu Ning, Ruiqi Sun, Xiaoshan Yang, Dongmei Cai
Abstract: Image restoration methods are commonly used to improve the quality of astronomical images. In recent years, developments of deep neural networks and increments of the number of astronomical images have evoked a lot of data--driven image restoration methods. However, most of these methods belong to supervised learning algorithms, which require paired images either from real observations or simulated data as training set. For some applications, it is hard to get enough paired images from real observations and simulated images are quite different from real observed ones. In this paper, we propose a new data--driven image restoration method based on generative adversarial networks with option--driven learning. Our method uses several high resolution images as references and applies different learning strategies when the number of reference images is different. For sky surveys with variable observation conditions, our method can obtain very stable image restoration results, regardless of the number of reference images.
摘要：图像复原方法通常用来改善天文图像的质量。近年来，深层神经网络和天文图像数量的增量的发展已经引起了大量的数据 - 驱动的图像复原方法。然而，这些方法大多属于监督学习算法，这就需要无论是从实际观测或模拟数据作为训练集配对的图像。对于一些应用，这是很难得到实际观测和模拟图像不够成对图片均来自实际观察到的那些非常不同。在本文中，我们提出了一个新的数据 - 驱动学习 - 基于与选项生成对抗性的网络驱动图像复原方法。我们的方法是使用一些高清晰度的图像作为参考，并适用不同的学习策略时的参考图像的数量是不同的。对于具有可变观察条件天空调查，我们的方法可以得到非常稳定的图像恢复的结果，而不管参考图像的数量的。

89. Deeply-Supervised Density Regression for Automatic Cell Counting in Microscopy Images [PDF] 返回目录
Shenghua He, Kyaw Thu Minn, Lilianna Solnica-Krezel, Mark A. Anastasio, Hua Li
Abstract: Accurately counting the number of cells in microscopy images is required in many medical diagnosis and biological studies. This task is tedious, time-consuming, and prone to subjective errors. However, designing automatic counting methods remains challenging due to low image contrast, complex background, large variance in cell shapes and counts, and significant cell occlusions in two-dimensional microscopy images. In this study, we proposed a new density regression-based method for automatically counting cells in microscopy images. The proposed method processes two innovations compared to other state-of-the-art density regression-based methods. First, the density regression model (DRM) is designed as a concatenated fully convolutional regression network (C-FCRN) to employ multi-scale image features for the estimation of cell density maps from given images. Second, auxiliary convolutional neural networks (AuxCNNs) are employed to assist in the training of intermediate layers of the designed C-FCRN to improve the DRM performance on unseen datasets. Experimental studies evaluated on four datasets demonstrate the superior performance of the proposed method.
摘要：准确计数的显微图像细胞的数量，需要在许多医疗诊断和生物学研究。这个任务是繁琐，费时，而且容易出错主观。然而，设计自动计数的方法依然由于在二维显微图像低图像对比度，复杂的背景下，大的方差在细胞的形状和数量，以及显著细胞闭塞挑战。在这项研究中，我们提出了以显微图像自动细胞计数新的基于回归密度法。相比状态的最先进的其他基于回归密度方法所提出的方法处理两个创新。首先，密度回归模型（DRM）被设计成一个级联卷积完全回归网络（C-FcRn）的采用对细胞密度的估计从给定的图像映射的多尺度图像特征。第二，辅助卷积神经网络（AuxCNNs）被用于协助设计C-FcRn与中间层的训练，以提高上看不见数据集DRM性能。在四个数据集评价的实验研究证明了该方法的优越性能。

90. Strawberry Detection Using a Heterogeneous Multi-Processor Platform [PDF] 返回目录
Samuel Brandenburg, Pedro Machado, Nikesh Lama, T.M. McGinnity
Abstract: Over the last few years, the number of precision farming projects has increased specifically in harvesting robots and many of which have made continued progress from identifying crops to grasping the desired fruit or vegetable. One of the most common issues found in precision farming projects is that successful application is heavily dependent not just on identifying the fruit but also on ensuring that localisation allows for accurate navigation. These issues become significant factors when the robot is not operating in a prearranged environment, or when vegetation becomes too thick, thus covering crop. Moreover, running a state-of-the-art deep learning algorithm on an embedded platform is also very challenging, resulting most of the times in low frame rates. This paper proposes using the You Only Look Once version 3 (YOLOv3) Convolutional Neural Network (CNN) in combination with utilising image processing techniques for the application of precision farming robots targeting strawberry detection, accelerated on a heterogeneous multiprocessor platform. The results show a performance acceleration by five times when implemented on a Field-Programmable Gate Array (FPGA) when compared with the same algorithm running on the processor side with an accuracy of 78.3\% over the test set comprised of 146 images.
摘要：在过去的几年中，精耕细作项目的数量特别是在收获机器人增加，其中许多已经从识别作物抓所需的水果或蔬菜工作取得新进展。一个最常见的问题在精密铸造养殖项目是成功的应用在很大程度上不仅取决于识别成果，但也确保本地化允许精确导航。这些问题成为显著因素当机器人没有预先安排的环境中工作，或者当植被变得太厚，从而覆盖作物。此外，嵌入式平台上运行一个国家的最先进的深学习算法也非常具有挑战性，导致大部分的时间在低帧速率。本文提出使用你只看一旦与利用图像处理技术应用于精准农业机器人瞄准草莓检测，加快了不同的多处理器平台的应用程序组合3版（YOLOv3）卷积神经网络（CNN）。结果表明通过五次性能加速上时与在处理器侧运行具有78.3 \％在试验组包括146幅的图像的精确度相同的算法相比，现场可编程门阵列（FPGA）来实现时。

91. Motion Prediction on Self-driving Cars: A Review [PDF] 返回目录
Shahrokh Paravarzar, Belqes Mohammad
Abstract: The autonomous vehicle motion prediction literature is reviewed. Motion prediction is the most challenging task in autonomous vehicles and self-drive cars. These challenges have been discussed. Later on, the state-of-theart has reviewed based on the most recent literature and the current challenges are discussed. The state-of-the-art consists of classical and physical methods, deep learning networks, and reinforcement learning. prons and cons of the methods and gap of the research presented in this review. Finally, the literature surrounding object tracking and motion will be presented. As a result, deep reinforcement learning is the best candidate to tackle self-driving cars.
摘要：自主车辆运动预测文献回顾。运动预测是自主车和自驾车车中最具挑战性的任务。这些挑战进行了讨论。后来，国家的theart审查基础上，最近的文献和当前面临的挑战进行了讨论。在国家的最先进的包括古典和物理方法，深度学习网络和强化学习。的方法和研究的差距prons和缺点在本次审查提出的。最后，围绕对象跟踪和运动文献将被呈现。其结果是，深强化学习是解决自动驾驶汽车的最佳人选。

92. Unmasking Communication Partners: A Low-Cost AI Solution for Digitally Removing Head-Mounted Displays in VR-Based Telepresence [PDF] 返回目录
Philipp Ladwig, Alexander Pech, Ralf Dörner, Christian Geiger
Abstract: Face-to-face conversation in Virtual Reality (VR) is a challenge when participants wear head-mounted displays (HMD). A significant portion of a participant's face is hidden and facial expressions are difficult to perceive. Past research has shown that high-fidelity face reconstruction with personal avatars in VR is possible under laboratory conditions with high-cost hardware. In this paper, we propose one of the first low-cost systems for this task which uses only open source, free software and affordable hardware. Our approach is to track the user's face underneath the HMD utilizing a Convolutional Neural Network (CNN) and generate corresponding expressions with Generative Adversarial Networks (GAN) for producing RGBD images of the person's face. We use commodity hardware with low-cost extensions such as 3D-printed mounts and miniature cameras. Our approach learns end-to-end without manual intervention, runs in real time, and can be trained and executed on an ordinary gaming computer. We report evaluation results showing that our low-cost system does not achieve the same fidelity of research prototypes using high-end hardware and closed source software, but it is capable of creating individual facial avatars with person-specific characteristics in movements and expressions.
摘要：人脸对脸的虚拟现实（VR）的对话是一个挑战，当参与者佩戴头戴式显示器（HMD）。参与者的面部的显著部分被隐藏和面部表情都难以察觉。过去的研究已经表明，高保真面部重建个人头像在VR高成本硬件实验室条件下是可能的。在本文中，我们提出了这个任务只使用开源，免费的软件和硬件的实惠第一的低成本系统之一。我们的做法是跟踪用户的脸部使用卷积神经网络（CNN）的HMD的下方，并为生产人脸的图像RGBD与相应对抗性剖成网络（GAN）的表达。我们利用低成本的扩展，如3D印刷的坐骑和微型摄像头的商品硬件。我们的方法可以学习结束到终端，无需人工干预，实时运行，并且可以训练和普通游戏电脑上执行。我们报告评估结果显示，我们的低成本系统没有实现使用高端硬件和闭源软件的研究原型相同的保真度，但它能够创建与动作和表情的人特有的特性的各个脸部的化身。

93. HDR Imaging with Quanta Image Sensors: Theoretical Limits and Optimal Reconstruction [PDF] 返回目录
Abhiram Gnanasambandam, Stanley H. Chan
Abstract: High dynamic range (HDR) imaging is one of the biggest achievements in modern photography. Traditional solutions to HDR imaging are designed for and applied to CMOS image sensors (CIS). However, the mainstream one-micron CIS cameras today generally have a high read noise and low frame-rate. These, in turn, limit the acquisition speed and quality, making the cameras slow in the HDR mode. In this paper, we propose a new computational photography technique for HDR imaging. Recognizing the limitations of CIS, we use the Quanta Image Sensor (QIS) to trade the spatial-temporal resolution with bit-depth. QIS is a single-photon image sensor that has comparable pixel pitch to CIS but substantially lower dark current and read noise. We provide a complete theoretical characterization of the sensor in the context of HDR imaging, by proving the fundamental limits in the dynamic range that QIS can offer and the trade-offs with noise and speed. In addition, we derive an optimal reconstruction algorithm for single-bit and multi-bit QIS. Our algorithm is theoretically optimal for \emph{all} linear reconstruction schemes based on exposure bracketing. Experimental results confirm the validity of the theory and algorithm, based on synthetic and real QIS data.
摘要：高动态范围（HDR）成像是现代摄影的最大成就之一。到HDR成像传统的解决方案的设计和应用到CMOS图像传感器（CIS）。然而，当今主流的1微米CIS相机一般具有较高的阅读噪声和低帧速率。这些，反过来，限制收购的速度和质量，使相机在缓慢的HDR模式。在本文中，我们提出了HDR成像的新的计算摄影技术。认识到独联体的局限性，我们使用了广达图像传感器（QIS）贸易与位深度的时空分辨率。 QIS是具有相当的像素间距，以CIS但基本上较低的暗电流和噪声读取单个光子图像传感器。我们提供的HDR成像的背景下，传感器的完整的理论特征，通过证明在动态范围的基本限制是QIS能提供和权衡噪音和速度。此外，我们推导出单比特和多比特QIS最佳重建算法。我们的算法在理论上是最优的\ EMPH基于包围曝光{所有}线性重建方案。实验结果证实了理论和算法的有效性，基于合成和实际数据QIS。

94. Chest X-ray Image Phase Features for Improved Diagnosis of COVID-19 Using Convolutional Neural Network [PDF] 返回目录
Xiao Qi, Lloyd Brown, David J. Foran, Ilker Hacihaliloglu
Abstract: Recently, the outbreak of the novel Coronavirus disease 2019 (COVID-19) pandemic has seriously endangered human health and life. Due to limited availability of test kits, the need for auxiliary diagnostic approach has increased. Recent research has shown radiography of COVID-19 patient, such as CT and X-ray, contains salient information about the COVID-19 virus and could be used as an alternative diagnosis method. Chest X-ray (CXR) due to its faster imaging time, wide availability, low cost and portability gains much attention and becomes very promising. Computational methods with high accuracy and robustness are required for rapid triaging of patients and aiding radiologist in the interpretation of the collected data. In this study, we design a novel multi-feature convolutional neural network (CNN) architecture for multi-class improved classification of COVID-19 from CXR images. CXR images are enhanced using a local phase-based image enhancement method. The enhanced images, together with the original CXR data, are used as an input to our proposed CNN architecture. Using ablation studies, we show the effectiveness of the enhanced images in improving the diagnostic accuracy. We provide quantitative evaluation on two datasets and qualitative results for visual inspection. Quantitative evaluation is performed on data consisting of 8,851 normal (healthy), 6,045 pneumonia, and 3,323 Covid-19 CXR scans. In Dataset-1, our model achieves 95.57\% average accuracy for a three classes classification, 99\% precision, recall, and F1-scores for COVID-19 cases. For Dataset-2, we have obtained 94.44\% average accuracy, and 95\% precision, recall, and F1-scores for detection of COVID-19. Conclusions: Our proposed multi-feature guided CNN achieves improved results compared to single-feature CNN proving the importance of the local phase-based CXR image enhancement.
摘要：近日，新型冠状病毒病2019（COVID-19）流行病的爆发，严重危害人类健康和生命。由于检测试剂的供应有限，需要进行辅助诊断方法有所增加。最近的研究表明COVID-19的患者，例如CT和X射线的射线照相术，包含有关COVID-19病毒凸信息，并且可以用作替代诊断方法。胸部X线（CXR），由于其更快的成像时间，实用性广，成本低，便携性收益的关注，并成为前景十分看好。具有高精确度和耐用性的计算方法所需要的患者病情迅速检伤分类和帮助放射科医生在采集数据的解释。在这项研究中，我们设计用于从图像CXR COVID-19的多类分类改进的新型多特征卷积神经网络（CNN）架构。 CXR图像是使用本地基于相位的图像增强方法增强。增强的图像，与原始CXR数据一起被用作输入到我们提出的CNN架构。使用消融研究，我们展示了增强的图像在提高诊断的准确性有效性。我们提供两个数据集和目测定性结果定量评价。定量评价是在由8851正常（健康），6045肺炎，和3323 Covid-19 CXR扫描的数据执行的。在数据集-1，我们的模型实现了三类分类95.57 \％的平均准确度，99 \％的精度，召回，和F1-分数COVID-19的情况。对于数据集-2，我们已获得94.44 \％的平均精确度，和95 \％精度，调用和F1-分数检测COVID-19。结论：我们所提出的引导CNN获得改善的结果相比单功能CNN证明本地基于相位CXR图像增强的重要性，多特征。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-11-10

目录

摘要